The document discusses the reuse of public proteomics data. It describes how data from the PRoteomics IDEntifications (PRIDE) Archive can be reanalyzed to conduct proteogenomics studies, discover new post-translational modifications and variants, and enable meta-analysis studies of protein-protein interactions and associations. It also examines challenges around analyzing the "dark proteome" of consistently unidentified spectra in public datasets and developing open analysis pipelines for proteomics data in cloud environments.
POGONATUM : morphology, anatomy, reproduction etc.
A proteomics data “gold mine” at your disposal: Now that the data is there, what can we do with it?
1. A proteomics data “gold mine” at your
disposal: Now that the data is there, what
can we do with it?
Dr. Juan Antonio Vizcaíno
Proteomics Team Leader
EMBL-European Bioinformatics Institute
Hinxton, Cambridge, UK
E-mail: juan@ebi.ac.uk
2. Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
Overview
• Short introduction to proteomics and PRIDE
• Reuse of public proteomics data
• Open analysis pipelines for proteomics data
• Challenges for public proteomics data reuse
3. Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
New ELIXIR Proteomics “Community”
Vizcaíno et al., F1000Research, 2017
4. Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
• PRIDE stores mass spectrometry (MS)-based
proteomics data:
• Peptide and protein expression data
(identification and quantification)
• Post-translational modifications
• Mass spectra (raw data and peak lists)
• Technical and biological metadata
• Any other related information
• Full support for tandem MS approaches
• Any type of data can be stored
• Leading ProteomeXchange.
• From July 2017, an ELIXIR core resource
PRIDE (PRoteomics IDEntifications) Archive
http://www.ebi.ac.uk/pride/archive
Martens et al., Proteomics, 2005
Vizcaíno et al., NAR, 2016
5. Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
Stats (1): Data submissions to PRIDE Archive continue to increase
Top months: 224 and 234 datasets submitted on July & August,
respectively (~2,000 in 2016).
> 400 TBs of data
6. Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
ProteomeXchange: A Global, distributed proteomics database
PASSEL
(SRM data)
PRIDE
(MS/MS data)
MassIVE
(MS/MS data)
Raw
ID/Q
Meta
jPOST
(MS/MS data)
Mandatory raw data deposition
since July 2015
• Goal: Development of a framework to allow standard data submission and
dissemination pipelines between the main existing proteomics repositories.
http://www.proteomexchange.org
Vizcaíno et al., Nat Biotechnol, 2014
Deutsch et al., NAR, 2017
7. Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
Origin:
1229 USA
902 Germany
618 China
583 United Kingdom
319 France
250 Netherlands
213 Canada
208 Switzerland
200 Australia
179 Spain
172 Austria
168 Denmark
138 Sweden
133 India
115 Japan
115 Belgium
98 Norway
75 Italy
69 Taiwan
57 Brazil
51 Israel
51 Singapore
43 Finland
44 Ireland…
ProteomeXchange: 7,475 datasets up until September 1st 2017
Type:
4805 PRIDE partial
1552 PRIDE complete
649 MassIVE
117 PeptideAtlas/PASSEL
complete
109 jPOST
243 reprocessed datasets
Publicly Accessible:
4051 datasets, 54% of all
89% PRIDE
6% MassIVE
3% PASSEL
2% jPOST
Top Species studied by at least
50 datasets:
2,787 Homo sapiens
958 Mus musculus
236 Saccharomyces cerevisiae
229 Arabidopsis thaliana
190 Rattus norvegicus
157 Escherichia coli
68 Bos taurus
62 Drosophila melanogaster
~ 1,100 species in total
8. Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
Stats (2): Data growth in EMBL-EBI resources
Genomics
Transcriptomics
Metabolomics
Proteomics
9. Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
Overview
• Short introduction to proteomics and PRIDE
• Reuse of public proteomics data
• Open analysis pipelines for proteomics data
• Challenges for public proteomics data reuse
10. Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
Data downloads are increasing
Data download volume for
PRIDE in 2016: 241 TB
0
50
100
150
200
250
300
2013 2014 2015 2016
Downloads in TBs
11. Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
Public data re-analysis -> Data repurposing
• Individual authors can re-analyze MS proteomics
raw data with new hypotheses in mind (not taken
into account by the original authors).
• Proteogenomics studies.
• Discovery of new PTMs/ variants.
• Meta-analysis studies.
12. Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
Public data re-analysis -> Data repurposing
• Individual authors can re-analyze MS proteomics
raw data with new hypotheses in mind (not taken
into account by the original authors).
• Proteogenomics studies.
• Discovery of new PTMs/ variants.
• Meta-analysis studies.
13. Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
Across-omics -> Proteogenomics approaches
• Proteomics data is combined with genomics and/or transcriptomics
information, typically by using sequence databases generated from DNA
sequencing efforts, RNA-Seq experiments, Ribo-Seq approaches, and long-
non-coding RNAs.
Nesvizhskii, Nat Methods, 2014
14. Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
MS proteomics: Proteogenomics
in vivo in silico
DNA, RNASeq,
RiboSeq
Proteogenomics
15. Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
Examples of repurposing datasets: proteogenomics
Data in public resources can be used for genome annotation purposes ->
Discovery of short ORFs, translated lncRNAs, etc
16. Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
Examples of repurposing datasets: proteogenomics
Also some studies have been performed in model organisms: mouse, rat,
Drosophila, and other microorganisms (Mycobacterium tuberculosis,
Helicobacter pylori)
17. Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
MS proteomics: ProteoGenomics
in vivo in silico
Personal genomes
Personal
proteomes
Personalised medicine
18. Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
Public datasets from different omics: OmicsDI
http://www.omicsdi.org/
• Aims to integrate of ‘omics’ datasets (proteomics,
transcriptomics, metabolomics and genomics at present).
PRIDE
MassIVE
jPOST
PASSEL
GPMDB
ArrayExpress
Expression Atlas
MetaboLights
Metabolomics Workbench
GNPS
EGA
Perez-Riverol et al., Nat Biotechnol, 2017
20. Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
Public data re-analysis -> Data repurposing
• Individual authors can re-analyze MS proteomics
raw data with new hypotheses in mind (not taken
into account by the original authors).
• Proteogenomics studies.
• Discovery of new PTMs/ variants.
• Meta-analysis studies.
21. Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
The “dark” proteome
Sequence-based
search engines
22. Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
The “dark” proteome
• Only ~25-30% of spectra in a typical proteomics experiments
are identified.
• What does that fraction of unidentified spectra correspond to?
• For sure, there will be artefacts (e.g. chimeric spectra).
• Undetected protein variants:
• What it is not included in the searched database cannot be found.
• Peptide containing unexpected Post-Translational Modifications
(PTMs).
• Big potential to find novel biological relevant “proteoforms”.
23. Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
Concept of “proteoform”
Could any of these “undetected” proteoforms have an important biological
function?
Smith et al., Nat Methods, 2013
24. Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
Introduction to Spectrum Clustering
spectra-cluster algorithm
Unidentified
spectrum
Spectrum identified
as peptide A
Spectrum identified
as peptide B
Consensus
spectra
(= data reduction)
Input Mass
Spectra
25. Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
PRIDE Cluster: Second Implementation
• Clustered all public, identified
spectra in PRIDE
• EBI compute farm, LSF
• 20.7 M identified spectra
• 610 CPU days, two
calendar weeks
• Validation, calibration
• Feedback into PRIDE datasets
• EBI farm, LSF
• Griss et al., Nat. Methods, 2013
• Clustered all public spectra in
PRIDE by summer 2015.
• Apache Hadoop.
• Starting with 256 M spectra.
• 190 M unidentified spectra (they
were filtered to 111 M for spectra
that are likely to represent a
peptide).
• 66 M identified spectra
• Result: 28 M clusters
• 5 calendar days on 30 node
Hadoop cluster, 340 CPU cores
• Griss et al., Nat. Methods, 2016
26. Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
Consistently unidentified clusters
Not identified
Not identified
Not identified
Not identified
Consensus spectrum
Not identified
Not identified
Originally submitted spectra
Spectrum
clustering
Method to target
recurrent unidentified
spectra
??
27. Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
Consistently unidentified clusters (Recurring Unidentified Spectra)
• 19 M clusters contain only unidentified spectra.
• Most of them are likely to be derived from peptides.
• They could correspond to PTMs or variant peptides -> Potentially relevant
for biology.
• With various methods, we found likely identifications for about 20%.
• Vast amount of data mining remains to be done.
28. Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
Consistently unidentified clusters-> Open modification searches
29. Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
Public data re-analysis -> Data repurposing
• Individual authors can re-analyze MS proteomics
raw data with new hypotheses in mind (not taken
into account by the original authors).
• Proteogenomics studies.
• Discovery of new PTMs/ variants.
• Meta-analysis studies.
30. Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
Meta-analysis: Protein-protein interactions/associations
Drew et al., Mol Systems Biol, 2017 Gupta et al., NAR, 2017, in press
31. Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
Recent examples of meta-analysis studies (2)
Lund-Johanssen et al., Nat Methods, 2016
32. Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
Further reading…
Martens & Vizcaíno, Trends Bioch Sci, 2017 Vaudel et al., Proteomics, 2016
33. Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
Overview
• Short introduction to proteomics and PRIDE
• Reuse of public proteomics data
• Open analysis pipelines for proteomics data
• Challenges for public proteomics data reuse
34. Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
Infrastructure as a Service:
• Compute power not necessarily local but remote
• Still from compute centres, but on a larger scale
The ‘service’ is:
• Customer gets infrastructure, but it’s virtualized
• This Abstraction yields better
utilisation and scalability (but...)
• Developer/Customer has to interface with these abstraction layers
What’s a cloud environment?
35. Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
Purpose:
• Develop robust, fully reproducible and automated proteomics
analysis pipelines (DDA ‘shot-gun’ proteomics workflows).
• Deployed in a cloud environment and reusable by the scientific
community.
Started in February (1 year), ELIXIR Implementation study:
• EMBL-EBI (Proteomics & Cloud infrastructure Teams)
• ELIXIR-DE (Kohlbacher EKUT , Eisenacher RUB)
ELIXIR Implementation Study
36. Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
Develop exemplary proteomics data analysis workflows and deploy
them in the EMBL-EBI "Embassy Cloud”:
(1) Standard identification workflow
(2) Identification workflow for PTMs
(3) Quantification (label-free/label-based approaches)
(4) Quality Control (to aid data set interpretation/reanalysis evaluation)
(5) Versions of quantification approaches (including PTMs)
è Connected to public proteomics data from
ELIXIR Implementation Study
37. Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
We are using the framework
Features:
• Tool modularisation
• Solutions for data handover between tools with standardised
(PSI) formats
• Adapters for integrating third-party software (search engines,
LuciPHOr, FIDO, percolator, etc.)
• Integration into various workflow systems as a basis
Analysis software used
38. Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
Cloud deployment developments at EBI
Cloud workflow in metabolomics:
Enabling flexible workflow implementation
è Kubernetes Galaxy runner development
Kubernetes Framework (k8s):
• Controls the virtual cloud infrastructure
• Manages request and provision of resources (VMs)
40. Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
Open analysis pipelines -> In the near future…
• Recent 3-year BBSRC grant awarded to do the same for DIA
approaches (to start on December 2017).
• In collaboration with the Stoller Center (Manchester) (co-PIs Graham,
Hubbard & Townsend)
• Recent 4-year Wellcome Trust grant awarded to do (among other
things) pipelines for proteogenomics approaches (to start mid 2018).
• In collaboration with J. Choudhary (Institute of Cancer Research, London)
41. Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
Purpose
Data dissemination intro popular
bioinformatics resources
42. Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
Overview
• Short introduction to proteomics and PRIDE
• Reuse of public proteomics data
• Open analysis pipelines for proteomics data
• Challenges for public proteomics data reuse
43. Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
Main challenges (1)
• Lack of information/annotation about the experimental
design :
• Need for manual curation unavoidable at present
• But wouldn’t it be needed anyway?
• Crowd-sourcing infrastructure?
• Computational infrastructure challenges:
• Datasets get larger and larger…
• Analysing many datasets together is challenging (in
terms of the IT infrastructure)
• Windows OS dependency (things are improving in this
context though)
44. Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
Main challenges (2)
• Computational analysis challenges:
• Need to improve current computational analysis
methods for big datasets (e.g. FDR calculation)
• Size of DBs is very large (e.g. in proteogenomics)
• Open modification searches
• “Controlled access” for clinical proteomics data:
• Not enforced at present, but it is common practise not
to share these datasets just in case…
45. Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
Main challenges (3): Across-omics analysis
• It can be very challenging to find “matching” datasets (e.g.
genomics, proteomics, metabolomics).
• Better connection between resources from different omics
approaches is needed.
46. Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
Summary
• Public proteomics datasets are on the rise! (~250 new
datasets/month in PRIDE).
• A lot of possibilities open for reuse of this data:
• New purposes: proteogenomics approaches.
• Discovery of new proteoforms (e.g. new PTMs, new variants).
• Starting to work in open and reproducible analysis pipelines.
• ELIXIR Implementation Study (DDA data, collaboration EMBL-EBI &
ELIXIR-DE)
• In the near future: DIA and proteogenomics approaches.
47. Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
Aknowledgements: People
Attila Csordas
Tobias Ternent
Mathias Walzer
Gerhard Mayer (de.NBI)
Johannes Griss
Yasset Perez-Riverol
Manuel Bernal-Llinares
Andrew Jarnuczak
Former team members, especially
Rui Wang, Florian Reisinger, Noemi
del Toro, Jose A. Dianes & Henning
Hermjakob
Acknowledgements: The PRIDE Team
All data submitters !!!@pride_ebi
@proteomexchange
Pablo Moreno
EBI Cloud infrastructure team