SlideShare una empresa de Scribd logo
1 de 48
Descargar para leer sin conexión
A proteomics data “gold mine” at your
disposal: Now that the data is there, what
can we do with it?
Dr. Juan Antonio Vizcaíno
Proteomics Team Leader
EMBL-European Bioinformatics Institute
Hinxton, Cambridge, UK
E-mail: juan@ebi.ac.uk
Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
Overview
• Short introduction to proteomics and PRIDE
• Reuse of public proteomics data
• Open analysis pipelines for proteomics data
• Challenges for public proteomics data reuse
Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
New ELIXIR Proteomics “Community”
Vizcaíno et al., F1000Research, 2017
Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
• PRIDE stores mass spectrometry (MS)-based
proteomics data:
• Peptide and protein expression data
(identification and quantification)
• Post-translational modifications
• Mass spectra (raw data and peak lists)
• Technical and biological metadata
• Any other related information
• Full support for tandem MS approaches
• Any type of data can be stored
• Leading ProteomeXchange.
• From July 2017, an ELIXIR core resource
PRIDE (PRoteomics IDEntifications) Archive
http://www.ebi.ac.uk/pride/archive
Martens et al., Proteomics, 2005
Vizcaíno et al., NAR, 2016
Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
Stats (1): Data submissions to PRIDE Archive continue to increase
Top months: 224 and 234 datasets submitted on July & August,
respectively (~2,000 in 2016).
> 400 TBs of data
Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
ProteomeXchange: A Global, distributed proteomics database
PASSEL
(SRM data)
PRIDE
(MS/MS data)
MassIVE
(MS/MS data)
Raw
ID/Q
Meta
jPOST
(MS/MS data)
Mandatory raw data deposition
since July 2015
• Goal: Development of a framework to allow standard data submission and
dissemination pipelines between the main existing proteomics repositories.
http://www.proteomexchange.org
Vizcaíno et al., Nat Biotechnol, 2014
Deutsch et al., NAR, 2017
Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
Origin:
1229 USA
902 Germany
618 China
583 United Kingdom
319 France
250 Netherlands
213 Canada
208 Switzerland
200 Australia
179 Spain
172 Austria
168 Denmark
138 Sweden
133 India
115 Japan
115 Belgium
98 Norway
75 Italy
69 Taiwan
57 Brazil
51 Israel
51 Singapore
43 Finland
44 Ireland…
ProteomeXchange: 7,475 datasets up until September 1st 2017
Type:
4805 PRIDE partial
1552 PRIDE complete
649 MassIVE
117 PeptideAtlas/PASSEL
complete
109 jPOST
243 reprocessed datasets
Publicly Accessible:
4051 datasets, 54% of all
89% PRIDE
6% MassIVE
3% PASSEL
2% jPOST
Top Species studied by at least
50 datasets:
2,787 Homo sapiens
958 Mus musculus
236 Saccharomyces cerevisiae
229 Arabidopsis thaliana
190 Rattus norvegicus
157 Escherichia coli
68 Bos taurus
62 Drosophila melanogaster
~ 1,100 species in total
Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
Stats (2): Data growth in EMBL-EBI resources
Genomics
Transcriptomics
Metabolomics
Proteomics
Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
Overview
• Short introduction to proteomics and PRIDE
• Reuse of public proteomics data
• Open analysis pipelines for proteomics data
• Challenges for public proteomics data reuse
Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
Data downloads are increasing
Data download volume for
PRIDE in 2016: 241 TB
0
50
100
150
200
250
300
2013 2014 2015 2016
Downloads in TBs
Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
Public data re-analysis -> Data repurposing
• Individual authors can re-analyze MS proteomics
raw data with new hypotheses in mind (not taken
into account by the original authors).
• Proteogenomics studies.
• Discovery of new PTMs/ variants.
• Meta-analysis studies.
Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
Public data re-analysis -> Data repurposing
• Individual authors can re-analyze MS proteomics
raw data with new hypotheses in mind (not taken
into account by the original authors).
• Proteogenomics studies.
• Discovery of new PTMs/ variants.
• Meta-analysis studies.
Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
Across-omics -> Proteogenomics approaches
• Proteomics data is combined with genomics and/or transcriptomics
information, typically by using sequence databases generated from DNA
sequencing efforts, RNA-Seq experiments, Ribo-Seq approaches, and long-
non-coding RNAs.
Nesvizhskii, Nat Methods, 2014
Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
MS proteomics: Proteogenomics
in vivo in silico
DNA, RNASeq,
RiboSeq
Proteogenomics
Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
Examples of repurposing datasets: proteogenomics
Data in public resources can be used for genome annotation purposes ->
Discovery of short ORFs, translated lncRNAs, etc
Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
Examples of repurposing datasets: proteogenomics
Also some studies have been performed in model organisms: mouse, rat,
Drosophila, and other microorganisms (Mycobacterium tuberculosis,
Helicobacter pylori)
Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
MS proteomics: ProteoGenomics
in vivo in silico
Personal genomes
Personal
proteomes
Personalised medicine
Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
Public datasets from different omics: OmicsDI
http://www.omicsdi.org/
• Aims to integrate of ‘omics’ datasets (proteomics,
transcriptomics, metabolomics and genomics at present).
PRIDE
MassIVE
jPOST
PASSEL
GPMDB
ArrayExpress
Expression Atlas
MetaboLights
Metabolomics Workbench
GNPS
EGA
Perez-Riverol et al., Nat Biotechnol, 2017
Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
OmicsDI: Portal for omics datasets
Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
Public data re-analysis -> Data repurposing
• Individual authors can re-analyze MS proteomics
raw data with new hypotheses in mind (not taken
into account by the original authors).
• Proteogenomics studies.
• Discovery of new PTMs/ variants.
• Meta-analysis studies.
Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
The “dark” proteome
Sequence-based
search engines
Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
The “dark” proteome
• Only ~25-30% of spectra in a typical proteomics experiments
are identified.
• What does that fraction of unidentified spectra correspond to?
• For sure, there will be artefacts (e.g. chimeric spectra).
• Undetected protein variants:
• What it is not included in the searched database cannot be found.
• Peptide containing unexpected Post-Translational Modifications
(PTMs).
• Big potential to find novel biological relevant “proteoforms”.
Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
Concept of “proteoform”
Could any of these “undetected” proteoforms have an important biological
function?
Smith et al., Nat Methods, 2013
Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
Introduction to Spectrum Clustering
spectra-cluster algorithm
Unidentified
spectrum
Spectrum identified
as peptide A
Spectrum identified
as peptide B
Consensus
spectra
(= data reduction)
Input Mass
Spectra
Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
PRIDE Cluster: Second Implementation
• Clustered all public, identified
spectra in PRIDE
• EBI compute farm, LSF
• 20.7 M identified spectra
• 610 CPU days, two
calendar weeks
• Validation, calibration
• Feedback into PRIDE datasets
• EBI farm, LSF
• Griss et al., Nat. Methods, 2013
• Clustered all public spectra in
PRIDE by summer 2015.
• Apache Hadoop.
• Starting with 256 M spectra.
• 190 M unidentified spectra (they
were filtered to 111 M for spectra
that are likely to represent a
peptide).
• 66 M identified spectra
• Result: 28 M clusters
• 5 calendar days on 30 node
Hadoop cluster, 340 CPU cores
• Griss et al., Nat. Methods, 2016
Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
Consistently unidentified clusters
Not identified
Not identified
Not identified
Not identified
Consensus spectrum
Not identified
Not identified
Originally submitted spectra
Spectrum
clustering
Method to target
recurrent unidentified
spectra
??
Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
Consistently unidentified clusters (Recurring Unidentified Spectra)
• 19 M clusters contain only unidentified spectra.
• Most of them are likely to be derived from peptides.
• They could correspond to PTMs or variant peptides -> Potentially relevant
for biology.
• With various methods, we found likely identifications for about 20%.
• Vast amount of data mining remains to be done.
Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
Consistently unidentified clusters-> Open modification searches
Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
Public data re-analysis -> Data repurposing
• Individual authors can re-analyze MS proteomics
raw data with new hypotheses in mind (not taken
into account by the original authors).
• Proteogenomics studies.
• Discovery of new PTMs/ variants.
• Meta-analysis studies.
Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
Meta-analysis: Protein-protein interactions/associations
Drew et al., Mol Systems Biol, 2017 Gupta et al., NAR, 2017, in press
Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
Recent examples of meta-analysis studies (2)
Lund-Johanssen et al., Nat Methods, 2016
Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
Further reading…
Martens & Vizcaíno, Trends Bioch Sci, 2017 Vaudel et al., Proteomics, 2016
Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
Overview
• Short introduction to proteomics and PRIDE
• Reuse of public proteomics data
• Open analysis pipelines for proteomics data
• Challenges for public proteomics data reuse
Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
Infrastructure as a Service:
• Compute power not necessarily local but remote
• Still from compute centres, but on a larger scale
The ‘service’ is:
• Customer gets infrastructure, but it’s virtualized
• This Abstraction yields better
utilisation and scalability (but...)
• Developer/Customer has to interface with these abstraction layers
What’s a cloud environment?
Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
Purpose:
• Develop robust, fully reproducible and automated proteomics
analysis pipelines (DDA ‘shot-gun’ proteomics workflows).
• Deployed in a cloud environment and reusable by the scientific
community.
Started in February (1 year), ELIXIR Implementation study:
• EMBL-EBI (Proteomics & Cloud infrastructure Teams)
• ELIXIR-DE (Kohlbacher EKUT , Eisenacher RUB)
ELIXIR Implementation Study
Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
Develop exemplary proteomics data analysis workflows and deploy
them in the EMBL-EBI "Embassy Cloud”:
(1) Standard identification workflow
(2) Identification workflow for PTMs
(3) Quantification (label-free/label-based approaches)
(4) Quality Control (to aid data set interpretation/reanalysis evaluation)
(5) Versions of quantification approaches (including PTMs)
è Connected to public proteomics data from
ELIXIR Implementation Study
Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
We are using the framework
Features:
• Tool modularisation
• Solutions for data handover between tools with standardised
(PSI) formats
• Adapters for integrating third-party software (search engines,
LuciPHOr, FIDO, percolator, etc.)
• Integration into various workflow systems as a basis
Analysis software used
Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
Cloud deployment developments at EBI
Cloud workflow in metabolomics:
Enabling flexible workflow implementation
è Kubernetes Galaxy runner development
Kubernetes Framework (k8s):
• Controls the virtual cloud infrastructure
• Manages request and provision of resources (VMs)
Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
Summary figure
Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
Open analysis pipelines -> In the near future…
• Recent 3-year BBSRC grant awarded to do the same for DIA
approaches (to start on December 2017).
• In collaboration with the Stoller Center (Manchester) (co-PIs Graham,
Hubbard & Townsend)
• Recent 4-year Wellcome Trust grant awarded to do (among other
things) pipelines for proteogenomics approaches (to start mid 2018).
• In collaboration with J. Choudhary (Institute of Cancer Research, London)
Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
Purpose
Data dissemination intro popular
bioinformatics resources
Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
Overview
• Short introduction to proteomics and PRIDE
• Reuse of public proteomics data
• Open analysis pipelines for proteomics data
• Challenges for public proteomics data reuse
Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
Main challenges (1)
• Lack of information/annotation about the experimental
design :
• Need for manual curation unavoidable at present
• But wouldn’t it be needed anyway?
• Crowd-sourcing infrastructure?
• Computational infrastructure challenges:
• Datasets get larger and larger…
• Analysing many datasets together is challenging (in
terms of the IT infrastructure)
• Windows OS dependency (things are improving in this
context though)
Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
Main challenges (2)
• Computational analysis challenges:
• Need to improve current computational analysis
methods for big datasets (e.g. FDR calculation)
• Size of DBs is very large (e.g. in proteogenomics)
• Open modification searches
• “Controlled access” for clinical proteomics data:
• Not enforced at present, but it is common practise not
to share these datasets just in case…
Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
Main challenges (3): Across-omics analysis
• It can be very challenging to find “matching” datasets (e.g.
genomics, proteomics, metabolomics).
• Better connection between resources from different omics
approaches is needed.
Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
Summary
• Public proteomics datasets are on the rise! (~250 new
datasets/month in PRIDE).
• A lot of possibilities open for reuse of this data:
• New purposes: proteogenomics approaches.
• Discovery of new proteoforms (e.g. new PTMs, new variants).
• Starting to work in open and reproducible analysis pipelines.
• ELIXIR Implementation Study (DDA data, collaboration EMBL-EBI &
ELIXIR-DE)
• In the near future: DIA and proteogenomics approaches.
Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
Aknowledgements: People
Attila Csordas
Tobias Ternent
Mathias Walzer
Gerhard Mayer (de.NBI)
Johannes Griss
Yasset Perez-Riverol
Manuel Bernal-Llinares
Andrew Jarnuczak
Former team members, especially
Rui Wang, Florian Reisinger, Noemi
del Toro, Jose A. Dianes & Henning
Hermjakob
Acknowledgements: The PRIDE Team
All data submitters !!!@pride_ebi
@proteomexchange
Pablo Moreno
EBI Cloud infrastructure team
Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Mining the hidden proteome using hundreds of public proteomics datasets
Mining the hidden proteome using hundreds of public proteomics datasetsMining the hidden proteome using hundreds of public proteomics datasets
Mining the hidden proteome using hundreds of public proteomics datasets
 
Museum collections as research data - October 2019
Museum collections as research data - October 2019Museum collections as research data - October 2019
Museum collections as research data - October 2019
 
GBIF towards 2030 (November 2018)
GBIF towards 2030 (November 2018)GBIF towards 2030 (November 2018)
GBIF towards 2030 (November 2018)
 
FAIR and open biodiversity collection data management
FAIR and open biodiversity collection data managementFAIR and open biodiversity collection data management
FAIR and open biodiversity collection data management
 
GBIF and Biodiversity informatics for museums, 15 March 2021
GBIF and Biodiversity informatics for museums, 15 March 2021GBIF and Biodiversity informatics for museums, 15 March 2021
GBIF and Biodiversity informatics for museums, 15 March 2021
 
2021-01-27--biodiversity-informatics-gbif-(52slides)
2021-01-27--biodiversity-informatics-gbif-(52slides)2021-01-27--biodiversity-informatics-gbif-(52slides)
2021-01-27--biodiversity-informatics-gbif-(52slides)
 
GBIF & GRScicoll, Høstseminar Norges museumsforbunds Seksjon for natur, 2021-...
GBIF & GRScicoll, Høstseminar Norges museumsforbunds Seksjon for natur, 2021-...GBIF & GRScicoll, Høstseminar Norges museumsforbunds Seksjon for natur, 2021-...
GBIF & GRScicoll, Høstseminar Norges museumsforbunds Seksjon for natur, 2021-...
 
GBIF and Open Science
GBIF and Open ScienceGBIF and Open Science
GBIF and Open Science
 
Storage and Analysis of Sensitive Large-Scale Biomedical Data in Sweden
Storage and Analysis of Sensitive Large-Scale Biomedical Data in SwedenStorage and Analysis of Sensitive Large-Scale Biomedical Data in Sweden
Storage and Analysis of Sensitive Large-Scale Biomedical Data in Sweden
 
Developing open data analysis pipelines in the cloud: Enabling the ‘big data’...
Developing open data analysis pipelines in the cloud: Enabling the ‘big data’...Developing open data analysis pipelines in the cloud: Enabling the ‘big data’...
Developing open data analysis pipelines in the cloud: Enabling the ‘big data’...
 
EURISCO and GBIF IPT, at the Vavilov Institute in St Petersburg (27 April 2010)
EURISCO and GBIF IPT, at the Vavilov Institute in St Petersburg (27 April 2010)EURISCO and GBIF IPT, at the Vavilov Institute in St Petersburg (27 April 2010)
EURISCO and GBIF IPT, at the Vavilov Institute in St Petersburg (27 April 2010)
 
Introduction to GBIF. GBIF seminar in Bergen. 2016-12-14
Introduction to GBIF. GBIF seminar in Bergen. 2016-12-14Introduction to GBIF. GBIF seminar in Bergen. 2016-12-14
Introduction to GBIF. GBIF seminar in Bergen. 2016-12-14
 
Germplasm data exchange, CGIAR SINGER (2009)
Germplasm data exchange, CGIAR SINGER (2009)Germplasm data exchange, CGIAR SINGER (2009)
Germplasm data exchange, CGIAR SINGER (2009)
 
GBIF and reuse of research data, Bergen (2016-12-14)
GBIF and reuse of research data, Bergen (2016-12-14)GBIF and reuse of research data, Bergen (2016-12-14)
GBIF and reuse of research data, Bergen (2016-12-14)
 
Electron Microscopy Between OPIC, Oxford and eBIC
Electron Microscopy Between OPIC, Oxford and eBICElectron Microscopy Between OPIC, Oxford and eBIC
Electron Microscopy Between OPIC, Oxford and eBIC
 
TDWG VoMaG Vocabulary management workflow, 2013-10-31
TDWG VoMaG Vocabulary management workflow, 2013-10-31TDWG VoMaG Vocabulary management workflow, 2013-10-31
TDWG VoMaG Vocabulary management workflow, 2013-10-31
 
Data exchange alternatives, GIGA TAG (2009)
Data exchange alternatives, GIGA TAG (2009)Data exchange alternatives, GIGA TAG (2009)
Data exchange alternatives, GIGA TAG (2009)
 
Darwin Core extension for germplasm (11th December 2013)
Darwin Core extension for germplasm (11th December 2013)Darwin Core extension for germplasm (11th December 2013)
Darwin Core extension for germplasm (11th December 2013)
 
Mass spectrometry resources at the EBI
Mass spectrometry resources at the EBIMass spectrometry resources at the EBI
Mass spectrometry resources at the EBI
 
PRIDE-ProteomeXchange
PRIDE-ProteomeXchangePRIDE-ProteomeXchange
PRIDE-ProteomeXchange
 

Similar a A proteomics data “gold mine” at your disposal: Now that the data is there, what can we do with it?

Similar a A proteomics data “gold mine” at your disposal: Now that the data is there, what can we do with it? (20)

Is it feasible to identify novel biomarkers by mining public proteomics data?
Is it feasible to identify novel biomarkers by mining public proteomics data?Is it feasible to identify novel biomarkers by mining public proteomics data?
Is it feasible to identify novel biomarkers by mining public proteomics data?
 
The ELIXIR Proteomics Community
The ELIXIR Proteomics CommunityThe ELIXIR Proteomics Community
The ELIXIR Proteomics Community
 
Pride cluster presentation
Pride cluster presentation Pride cluster presentation
Pride cluster presentation
 
Reuse of public proteomics data
Reuse of public proteomics dataReuse of public proteomics data
Reuse of public proteomics data
 
Proteomics repositories
Proteomics repositoriesProteomics repositories
Proteomics repositories
 
Enabling automated processing and analysis of large-scale proteomics data
Enabling automated processing and analysis of large-scale proteomics dataEnabling automated processing and analysis of large-scale proteomics data
Enabling automated processing and analysis of large-scale proteomics data
 
ProteomeXchange update HUPO 2016
ProteomeXchange update HUPO 2016ProteomeXchange update HUPO 2016
ProteomeXchange update HUPO 2016
 
The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...
The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...
The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...
 
Reuse of public data in proteomics
Reuse of public data in proteomicsReuse of public data in proteomics
Reuse of public data in proteomics
 
PRIDE and ProteomeXchange: supporting the cultural change in proteomics publi...
PRIDE and ProteomeXchange: supporting the cultural change in proteomics publi...PRIDE and ProteomeXchange: supporting the cultural change in proteomics publi...
PRIDE and ProteomeXchange: supporting the cultural change in proteomics publi...
 
Introduction to EBI for Proteomics in ELIXIR
Introduction to EBI for Proteomics in ELIXIRIntroduction to EBI for Proteomics in ELIXIR
Introduction to EBI for Proteomics in ELIXIR
 
Reuse of public proteomics data
Reuse of public proteomics dataReuse of public proteomics data
Reuse of public proteomics data
 
An overview of the PRIDE ecosystem of resources and computational tools for m...
An overview of the PRIDE ecosystem of resources and computational tools for m...An overview of the PRIDE ecosystem of resources and computational tools for m...
An overview of the PRIDE ecosystem of resources and computational tools for m...
 
Open data genomics_palermo_2017_ver03
Open data genomics_palermo_2017_ver03Open data genomics_palermo_2017_ver03
Open data genomics_palermo_2017_ver03
 
Reuse of public proteomics data
Reuse of public proteomics dataReuse of public proteomics data
Reuse of public proteomics data
 
The Proteomics Standards Initiative (PSI)
The Proteomics Standards Initiative (PSI)The Proteomics Standards Initiative (PSI)
The Proteomics Standards Initiative (PSI)
 
Proteomexchange
ProteomexchangeProteomexchange
Proteomexchange
 
The ELIXIR Proteomics community
The ELIXIR Proteomics community The ELIXIR Proteomics community
The ELIXIR Proteomics community
 
2017 02-22 Oxford Global Biomarker Congress, Manchester, Alain van Gool
2017 02-22 Oxford Global Biomarker Congress, Manchester, Alain van Gool2017 02-22 Oxford Global Biomarker Congress, Manchester, Alain van Gool
2017 02-22 Oxford Global Biomarker Congress, Manchester, Alain van Gool
 
Proteomics repositories
Proteomics repositoriesProteomics repositories
Proteomics repositories
 

Más de Juan Antonio Vizcaino

Más de Juan Antonio Vizcaino (12)

Reusing and integrating public proteomics data to improve our knowledge of th...
Reusing and integrating public proteomics data to improve our knowledge of th...Reusing and integrating public proteomics data to improve our knowledge of th...
Reusing and integrating public proteomics data to improve our knowledge of th...
 
Introduction to the PSI standard data formats
Introduction to the PSI standard data formatsIntroduction to the PSI standard data formats
Introduction to the PSI standard data formats
 
PRIDE resources and ProteomeXchange
PRIDE resources and ProteomeXchangePRIDE resources and ProteomeXchange
PRIDE resources and ProteomeXchange
 
Introduction to the Proteomics Bioinformatics Course 2018
Introduction to the Proteomics Bioinformatics Course 2018Introduction to the Proteomics Bioinformatics Course 2018
Introduction to the Proteomics Bioinformatics Course 2018
 
ELIXIR Implementation Study: “Mining the Proteome: Enabling Automated Process...
ELIXIR Implementation Study: “Mining the Proteome: Enabling Automated Process...ELIXIR Implementation Study: “Mining the Proteome: Enabling Automated Process...
ELIXIR Implementation Study: “Mining the Proteome: Enabling Automated Process...
 
PSI-Proteome Informatics update
PSI-Proteome Informatics updatePSI-Proteome Informatics update
PSI-Proteome Informatics update
 
ProteomeXchange update
ProteomeXchange updateProteomeXchange update
ProteomeXchange update
 
Proteomics data standards
Proteomics data standardsProteomics data standards
Proteomics data standards
 
Introduction to the Proteomics Bioinformatics Course 2017
Introduction to the Proteomics Bioinformatics Course 2017Introduction to the Proteomics Bioinformatics Course 2017
Introduction to the Proteomics Bioinformatics Course 2017
 
ProteomeXchange update 2017
ProteomeXchange update 2017ProteomeXchange update 2017
ProteomeXchange update 2017
 
Introduction to the Proteomics Bioinformatics Course 2016
Introduction to the Proteomics Bioinformatics Course 2016Introduction to the Proteomics Bioinformatics Course 2016
Introduction to the Proteomics Bioinformatics Course 2016
 
Pride and ProteomeXchange
Pride and ProteomeXchangePride and ProteomeXchange
Pride and ProteomeXchange
 

Último

(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
Scintica Instrumentation
 
Human genetics..........................pptx
Human genetics..........................pptxHuman genetics..........................pptx
Human genetics..........................pptx
Silpa
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Sérgio Sacani
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
NazaninKarimi6
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
1301aanya
 
POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.
Silpa
 

Último (20)

Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsGrade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its Functions
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
 
FAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceFAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical Science
 
Human genetics..........................pptx
Human genetics..........................pptxHuman genetics..........................pptx
Human genetics..........................pptx
 
Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIACURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
Introduction of DNA analysis in Forensic's .pptx
Introduction of DNA analysis in Forensic's .pptxIntroduction of DNA analysis in Forensic's .pptx
Introduction of DNA analysis in Forensic's .pptx
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
 
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
 
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
 
Dr. E. Muralinath_ Blood indices_clinical aspects
Dr. E. Muralinath_ Blood indices_clinical  aspectsDr. E. Muralinath_ Blood indices_clinical  aspects
Dr. E. Muralinath_ Blood indices_clinical aspects
 
300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Exploring Criminology and Criminal Behaviour.pdf
Exploring Criminology and Criminal Behaviour.pdfExploring Criminology and Criminal Behaviour.pdf
Exploring Criminology and Criminal Behaviour.pdf
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.
 

A proteomics data “gold mine” at your disposal: Now that the data is there, what can we do with it?

  • 1. A proteomics data “gold mine” at your disposal: Now that the data is there, what can we do with it? Dr. Juan Antonio Vizcaíno Proteomics Team Leader EMBL-European Bioinformatics Institute Hinxton, Cambridge, UK E-mail: juan@ebi.ac.uk
  • 2. Juan A. Vizcaíno juan@ebi.ac.uk 2nd International de.NBI Symposium Bielefeld, 23 October 2017 Overview • Short introduction to proteomics and PRIDE • Reuse of public proteomics data • Open analysis pipelines for proteomics data • Challenges for public proteomics data reuse
  • 3. Juan A. Vizcaíno juan@ebi.ac.uk 2nd International de.NBI Symposium Bielefeld, 23 October 2017 New ELIXIR Proteomics “Community” Vizcaíno et al., F1000Research, 2017
  • 4. Juan A. Vizcaíno juan@ebi.ac.uk 2nd International de.NBI Symposium Bielefeld, 23 October 2017 • PRIDE stores mass spectrometry (MS)-based proteomics data: • Peptide and protein expression data (identification and quantification) • Post-translational modifications • Mass spectra (raw data and peak lists) • Technical and biological metadata • Any other related information • Full support for tandem MS approaches • Any type of data can be stored • Leading ProteomeXchange. • From July 2017, an ELIXIR core resource PRIDE (PRoteomics IDEntifications) Archive http://www.ebi.ac.uk/pride/archive Martens et al., Proteomics, 2005 Vizcaíno et al., NAR, 2016
  • 5. Juan A. Vizcaíno juan@ebi.ac.uk 2nd International de.NBI Symposium Bielefeld, 23 October 2017 Stats (1): Data submissions to PRIDE Archive continue to increase Top months: 224 and 234 datasets submitted on July & August, respectively (~2,000 in 2016). > 400 TBs of data
  • 6. Juan A. Vizcaíno juan@ebi.ac.uk 2nd International de.NBI Symposium Bielefeld, 23 October 2017 ProteomeXchange: A Global, distributed proteomics database PASSEL (SRM data) PRIDE (MS/MS data) MassIVE (MS/MS data) Raw ID/Q Meta jPOST (MS/MS data) Mandatory raw data deposition since July 2015 • Goal: Development of a framework to allow standard data submission and dissemination pipelines between the main existing proteomics repositories. http://www.proteomexchange.org Vizcaíno et al., Nat Biotechnol, 2014 Deutsch et al., NAR, 2017
  • 7. Juan A. Vizcaíno juan@ebi.ac.uk 2nd International de.NBI Symposium Bielefeld, 23 October 2017 Origin: 1229 USA 902 Germany 618 China 583 United Kingdom 319 France 250 Netherlands 213 Canada 208 Switzerland 200 Australia 179 Spain 172 Austria 168 Denmark 138 Sweden 133 India 115 Japan 115 Belgium 98 Norway 75 Italy 69 Taiwan 57 Brazil 51 Israel 51 Singapore 43 Finland 44 Ireland… ProteomeXchange: 7,475 datasets up until September 1st 2017 Type: 4805 PRIDE partial 1552 PRIDE complete 649 MassIVE 117 PeptideAtlas/PASSEL complete 109 jPOST 243 reprocessed datasets Publicly Accessible: 4051 datasets, 54% of all 89% PRIDE 6% MassIVE 3% PASSEL 2% jPOST Top Species studied by at least 50 datasets: 2,787 Homo sapiens 958 Mus musculus 236 Saccharomyces cerevisiae 229 Arabidopsis thaliana 190 Rattus norvegicus 157 Escherichia coli 68 Bos taurus 62 Drosophila melanogaster ~ 1,100 species in total
  • 8. Juan A. Vizcaíno juan@ebi.ac.uk 2nd International de.NBI Symposium Bielefeld, 23 October 2017 Stats (2): Data growth in EMBL-EBI resources Genomics Transcriptomics Metabolomics Proteomics
  • 9. Juan A. Vizcaíno juan@ebi.ac.uk 2nd International de.NBI Symposium Bielefeld, 23 October 2017 Overview • Short introduction to proteomics and PRIDE • Reuse of public proteomics data • Open analysis pipelines for proteomics data • Challenges for public proteomics data reuse
  • 10. Juan A. Vizcaíno juan@ebi.ac.uk 2nd International de.NBI Symposium Bielefeld, 23 October 2017 Data downloads are increasing Data download volume for PRIDE in 2016: 241 TB 0 50 100 150 200 250 300 2013 2014 2015 2016 Downloads in TBs
  • 11. Juan A. Vizcaíno juan@ebi.ac.uk 2nd International de.NBI Symposium Bielefeld, 23 October 2017 Public data re-analysis -> Data repurposing • Individual authors can re-analyze MS proteomics raw data with new hypotheses in mind (not taken into account by the original authors). • Proteogenomics studies. • Discovery of new PTMs/ variants. • Meta-analysis studies.
  • 12. Juan A. Vizcaíno juan@ebi.ac.uk 2nd International de.NBI Symposium Bielefeld, 23 October 2017 Public data re-analysis -> Data repurposing • Individual authors can re-analyze MS proteomics raw data with new hypotheses in mind (not taken into account by the original authors). • Proteogenomics studies. • Discovery of new PTMs/ variants. • Meta-analysis studies.
  • 13. Juan A. Vizcaíno juan@ebi.ac.uk 2nd International de.NBI Symposium Bielefeld, 23 October 2017 Across-omics -> Proteogenomics approaches • Proteomics data is combined with genomics and/or transcriptomics information, typically by using sequence databases generated from DNA sequencing efforts, RNA-Seq experiments, Ribo-Seq approaches, and long- non-coding RNAs. Nesvizhskii, Nat Methods, 2014
  • 14. Juan A. Vizcaíno juan@ebi.ac.uk 2nd International de.NBI Symposium Bielefeld, 23 October 2017 MS proteomics: Proteogenomics in vivo in silico DNA, RNASeq, RiboSeq Proteogenomics
  • 15. Juan A. Vizcaíno juan@ebi.ac.uk 2nd International de.NBI Symposium Bielefeld, 23 October 2017 Examples of repurposing datasets: proteogenomics Data in public resources can be used for genome annotation purposes -> Discovery of short ORFs, translated lncRNAs, etc
  • 16. Juan A. Vizcaíno juan@ebi.ac.uk 2nd International de.NBI Symposium Bielefeld, 23 October 2017 Examples of repurposing datasets: proteogenomics Also some studies have been performed in model organisms: mouse, rat, Drosophila, and other microorganisms (Mycobacterium tuberculosis, Helicobacter pylori)
  • 17. Juan A. Vizcaíno juan@ebi.ac.uk 2nd International de.NBI Symposium Bielefeld, 23 October 2017 MS proteomics: ProteoGenomics in vivo in silico Personal genomes Personal proteomes Personalised medicine
  • 18. Juan A. Vizcaíno juan@ebi.ac.uk 2nd International de.NBI Symposium Bielefeld, 23 October 2017 Public datasets from different omics: OmicsDI http://www.omicsdi.org/ • Aims to integrate of ‘omics’ datasets (proteomics, transcriptomics, metabolomics and genomics at present). PRIDE MassIVE jPOST PASSEL GPMDB ArrayExpress Expression Atlas MetaboLights Metabolomics Workbench GNPS EGA Perez-Riverol et al., Nat Biotechnol, 2017
  • 19. Juan A. Vizcaíno juan@ebi.ac.uk 2nd International de.NBI Symposium Bielefeld, 23 October 2017 OmicsDI: Portal for omics datasets
  • 20. Juan A. Vizcaíno juan@ebi.ac.uk 2nd International de.NBI Symposium Bielefeld, 23 October 2017 Public data re-analysis -> Data repurposing • Individual authors can re-analyze MS proteomics raw data with new hypotheses in mind (not taken into account by the original authors). • Proteogenomics studies. • Discovery of new PTMs/ variants. • Meta-analysis studies.
  • 21. Juan A. Vizcaíno juan@ebi.ac.uk 2nd International de.NBI Symposium Bielefeld, 23 October 2017 The “dark” proteome Sequence-based search engines
  • 22. Juan A. Vizcaíno juan@ebi.ac.uk 2nd International de.NBI Symposium Bielefeld, 23 October 2017 The “dark” proteome • Only ~25-30% of spectra in a typical proteomics experiments are identified. • What does that fraction of unidentified spectra correspond to? • For sure, there will be artefacts (e.g. chimeric spectra). • Undetected protein variants: • What it is not included in the searched database cannot be found. • Peptide containing unexpected Post-Translational Modifications (PTMs). • Big potential to find novel biological relevant “proteoforms”.
  • 23. Juan A. Vizcaíno juan@ebi.ac.uk 2nd International de.NBI Symposium Bielefeld, 23 October 2017 Concept of “proteoform” Could any of these “undetected” proteoforms have an important biological function? Smith et al., Nat Methods, 2013
  • 24. Juan A. Vizcaíno juan@ebi.ac.uk 2nd International de.NBI Symposium Bielefeld, 23 October 2017 Introduction to Spectrum Clustering spectra-cluster algorithm Unidentified spectrum Spectrum identified as peptide A Spectrum identified as peptide B Consensus spectra (= data reduction) Input Mass Spectra
  • 25. Juan A. Vizcaíno juan@ebi.ac.uk 2nd International de.NBI Symposium Bielefeld, 23 October 2017 PRIDE Cluster: Second Implementation • Clustered all public, identified spectra in PRIDE • EBI compute farm, LSF • 20.7 M identified spectra • 610 CPU days, two calendar weeks • Validation, calibration • Feedback into PRIDE datasets • EBI farm, LSF • Griss et al., Nat. Methods, 2013 • Clustered all public spectra in PRIDE by summer 2015. • Apache Hadoop. • Starting with 256 M spectra. • 190 M unidentified spectra (they were filtered to 111 M for spectra that are likely to represent a peptide). • 66 M identified spectra • Result: 28 M clusters • 5 calendar days on 30 node Hadoop cluster, 340 CPU cores • Griss et al., Nat. Methods, 2016
  • 26. Juan A. Vizcaíno juan@ebi.ac.uk 2nd International de.NBI Symposium Bielefeld, 23 October 2017 Consistently unidentified clusters Not identified Not identified Not identified Not identified Consensus spectrum Not identified Not identified Originally submitted spectra Spectrum clustering Method to target recurrent unidentified spectra ??
  • 27. Juan A. Vizcaíno juan@ebi.ac.uk 2nd International de.NBI Symposium Bielefeld, 23 October 2017 Consistently unidentified clusters (Recurring Unidentified Spectra) • 19 M clusters contain only unidentified spectra. • Most of them are likely to be derived from peptides. • They could correspond to PTMs or variant peptides -> Potentially relevant for biology. • With various methods, we found likely identifications for about 20%. • Vast amount of data mining remains to be done.
  • 28. Juan A. Vizcaíno juan@ebi.ac.uk 2nd International de.NBI Symposium Bielefeld, 23 October 2017 Consistently unidentified clusters-> Open modification searches
  • 29. Juan A. Vizcaíno juan@ebi.ac.uk 2nd International de.NBI Symposium Bielefeld, 23 October 2017 Public data re-analysis -> Data repurposing • Individual authors can re-analyze MS proteomics raw data with new hypotheses in mind (not taken into account by the original authors). • Proteogenomics studies. • Discovery of new PTMs/ variants. • Meta-analysis studies.
  • 30. Juan A. Vizcaíno juan@ebi.ac.uk 2nd International de.NBI Symposium Bielefeld, 23 October 2017 Meta-analysis: Protein-protein interactions/associations Drew et al., Mol Systems Biol, 2017 Gupta et al., NAR, 2017, in press
  • 31. Juan A. Vizcaíno juan@ebi.ac.uk 2nd International de.NBI Symposium Bielefeld, 23 October 2017 Recent examples of meta-analysis studies (2) Lund-Johanssen et al., Nat Methods, 2016
  • 32. Juan A. Vizcaíno juan@ebi.ac.uk 2nd International de.NBI Symposium Bielefeld, 23 October 2017 Further reading… Martens & Vizcaíno, Trends Bioch Sci, 2017 Vaudel et al., Proteomics, 2016
  • 33. Juan A. Vizcaíno juan@ebi.ac.uk 2nd International de.NBI Symposium Bielefeld, 23 October 2017 Overview • Short introduction to proteomics and PRIDE • Reuse of public proteomics data • Open analysis pipelines for proteomics data • Challenges for public proteomics data reuse
  • 34. Juan A. Vizcaíno juan@ebi.ac.uk 2nd International de.NBI Symposium Bielefeld, 23 October 2017 Infrastructure as a Service: • Compute power not necessarily local but remote • Still from compute centres, but on a larger scale The ‘service’ is: • Customer gets infrastructure, but it’s virtualized • This Abstraction yields better utilisation and scalability (but...) • Developer/Customer has to interface with these abstraction layers What’s a cloud environment?
  • 35. Juan A. Vizcaíno juan@ebi.ac.uk 2nd International de.NBI Symposium Bielefeld, 23 October 2017 Purpose: • Develop robust, fully reproducible and automated proteomics analysis pipelines (DDA ‘shot-gun’ proteomics workflows). • Deployed in a cloud environment and reusable by the scientific community. Started in February (1 year), ELIXIR Implementation study: • EMBL-EBI (Proteomics & Cloud infrastructure Teams) • ELIXIR-DE (Kohlbacher EKUT , Eisenacher RUB) ELIXIR Implementation Study
  • 36. Juan A. Vizcaíno juan@ebi.ac.uk 2nd International de.NBI Symposium Bielefeld, 23 October 2017 Develop exemplary proteomics data analysis workflows and deploy them in the EMBL-EBI "Embassy Cloud”: (1) Standard identification workflow (2) Identification workflow for PTMs (3) Quantification (label-free/label-based approaches) (4) Quality Control (to aid data set interpretation/reanalysis evaluation) (5) Versions of quantification approaches (including PTMs) è Connected to public proteomics data from ELIXIR Implementation Study
  • 37. Juan A. Vizcaíno juan@ebi.ac.uk 2nd International de.NBI Symposium Bielefeld, 23 October 2017 We are using the framework Features: • Tool modularisation • Solutions for data handover between tools with standardised (PSI) formats • Adapters for integrating third-party software (search engines, LuciPHOr, FIDO, percolator, etc.) • Integration into various workflow systems as a basis Analysis software used
  • 38. Juan A. Vizcaíno juan@ebi.ac.uk 2nd International de.NBI Symposium Bielefeld, 23 October 2017 Cloud deployment developments at EBI Cloud workflow in metabolomics: Enabling flexible workflow implementation è Kubernetes Galaxy runner development Kubernetes Framework (k8s): • Controls the virtual cloud infrastructure • Manages request and provision of resources (VMs)
  • 39. Juan A. Vizcaíno juan@ebi.ac.uk 2nd International de.NBI Symposium Bielefeld, 23 October 2017 Summary figure
  • 40. Juan A. Vizcaíno juan@ebi.ac.uk 2nd International de.NBI Symposium Bielefeld, 23 October 2017 Open analysis pipelines -> In the near future… • Recent 3-year BBSRC grant awarded to do the same for DIA approaches (to start on December 2017). • In collaboration with the Stoller Center (Manchester) (co-PIs Graham, Hubbard & Townsend) • Recent 4-year Wellcome Trust grant awarded to do (among other things) pipelines for proteogenomics approaches (to start mid 2018). • In collaboration with J. Choudhary (Institute of Cancer Research, London)
  • 41. Juan A. Vizcaíno juan@ebi.ac.uk 2nd International de.NBI Symposium Bielefeld, 23 October 2017 Purpose Data dissemination intro popular bioinformatics resources
  • 42. Juan A. Vizcaíno juan@ebi.ac.uk 2nd International de.NBI Symposium Bielefeld, 23 October 2017 Overview • Short introduction to proteomics and PRIDE • Reuse of public proteomics data • Open analysis pipelines for proteomics data • Challenges for public proteomics data reuse
  • 43. Juan A. Vizcaíno juan@ebi.ac.uk 2nd International de.NBI Symposium Bielefeld, 23 October 2017 Main challenges (1) • Lack of information/annotation about the experimental design : • Need for manual curation unavoidable at present • But wouldn’t it be needed anyway? • Crowd-sourcing infrastructure? • Computational infrastructure challenges: • Datasets get larger and larger… • Analysing many datasets together is challenging (in terms of the IT infrastructure) • Windows OS dependency (things are improving in this context though)
  • 44. Juan A. Vizcaíno juan@ebi.ac.uk 2nd International de.NBI Symposium Bielefeld, 23 October 2017 Main challenges (2) • Computational analysis challenges: • Need to improve current computational analysis methods for big datasets (e.g. FDR calculation) • Size of DBs is very large (e.g. in proteogenomics) • Open modification searches • “Controlled access” for clinical proteomics data: • Not enforced at present, but it is common practise not to share these datasets just in case…
  • 45. Juan A. Vizcaíno juan@ebi.ac.uk 2nd International de.NBI Symposium Bielefeld, 23 October 2017 Main challenges (3): Across-omics analysis • It can be very challenging to find “matching” datasets (e.g. genomics, proteomics, metabolomics). • Better connection between resources from different omics approaches is needed.
  • 46. Juan A. Vizcaíno juan@ebi.ac.uk 2nd International de.NBI Symposium Bielefeld, 23 October 2017 Summary • Public proteomics datasets are on the rise! (~250 new datasets/month in PRIDE). • A lot of possibilities open for reuse of this data: • New purposes: proteogenomics approaches. • Discovery of new proteoforms (e.g. new PTMs, new variants). • Starting to work in open and reproducible analysis pipelines. • ELIXIR Implementation Study (DDA data, collaboration EMBL-EBI & ELIXIR-DE) • In the near future: DIA and proteogenomics approaches.
  • 47. Juan A. Vizcaíno juan@ebi.ac.uk 2nd International de.NBI Symposium Bielefeld, 23 October 2017 Aknowledgements: People Attila Csordas Tobias Ternent Mathias Walzer Gerhard Mayer (de.NBI) Johannes Griss Yasset Perez-Riverol Manuel Bernal-Llinares Andrew Jarnuczak Former team members, especially Rui Wang, Florian Reisinger, Noemi del Toro, Jose A. Dianes & Henning Hermjakob Acknowledgements: The PRIDE Team All data submitters !!!@pride_ebi @proteomexchange Pablo Moreno EBI Cloud infrastructure team
  • 48. Juan A. Vizcaíno juan@ebi.ac.uk 2nd International de.NBI Symposium Bielefeld, 23 October 2017