ELIXIR Proteomics Community Drives Standards and Tools
1. European Life Sciences Infrastructure for Biological Information
www.elixir-europe.org
The ELIXIR Proteomics Community
Dr. Juan AntonioVizcaíno
European Bioinformatics Institute(EMBL-EBI
juan@ebi.ac.uk
2. Juan A. Vizcaíno
juan@ebi.ac.uk
12th CeBiTec Symposium: “Big data in Medicine & Biotechnology”
Bielefeld, 20 March 2018
• One slide intro to proteomics
• The ELIXIR Proteomics Community
• Plans for the near future
Outline
3. Juan A. Vizcaíno
juan@ebi.ac.uk
12th CeBiTec Symposium: “Big data in Medicine & Biotechnology”
Bielefeld, 20 March 2018
One slide intro to Mass Spectrometry proteomics
Hein et al., Handbook of Systems Biology, 2012
Proteins -> most drug targets
4. Juan A. Vizcaíno
juan@ebi.ac.uk
12th CeBiTec Symposium: “Big data in Medicine & Biotechnology”
Bielefeld, 20 March 2018
• One slide intro to proteomics
• The ELIXIR Proteomics Community
• Plans for the near future
Outline
5. Juan A. Vizcaíno
juan@ebi.ac.uk
12th CeBiTec Symposium: “Big data in Medicine & Biotechnology”
Bielefeld, 20 March 2018
• 11 ELIXIR nodes supported the application:
• Germany (co-lead) (O. Kohlbacher)
• Belgium (co-lead) (L. Martens)
• Czech Republic
• Denmark
• Ireland
• France
• Netherlands
• Spain
• Sweden
• United Kingdom
• EMBL-EBI (co-lead) (Juan A. Vizcaíno)
ELIXIR nodes supporting the new Community
6. Juan A. Vizcaíno
juan@ebi.ac.uk
12th CeBiTec Symposium: “Big data in Medicine & Biotechnology”
Bielefeld, 20 March 2018
• The goal of the ELIXIR proteomics community is to
develop and maintain sustainable proteomics
tools and data resources
• An essential part of the development will also be the
‘FAIRification’ of the resources (i.e. making the
resources FAIR)
• Integrate proteomics bioinformatics activities in
ELIXIR
Overall objectives
7. Juan A. Vizcaíno
juan@ebi.ac.uk
12th CeBiTec Symposium: “Big data in Medicine & Biotechnology”
Bielefeld, 20 March 2018
White paper as the basis for this Community
Vizcaíno et al., F1000Research, 2017
8. Juan A. Vizcaíno
juan@ebi.ac.uk
12th CeBiTec Symposium: “Big data in Medicine & Biotechnology”
Bielefeld, 20 March 2018
Highlighting already existing resources and initiatives
Tools: Services and connectors to drive access and exploitation
Data: Sustaining Europe’s life science data infrastructure
Interoperability: Integration of data and services
Compute: Access, exchange and storage
Training: Professional skills for managing and exploiting data
9. Juan A. Vizcaíno
juan@ebi.ac.uk
12th CeBiTec Symposium: “Big data in Medicine & Biotechnology”
Bielefeld, 20 March 2018
Tools: Services and connectors to drive access and exploitation
Data: Sustaining Europe’s life science data infrastructure
Interoperability: Integration of data and services
Compute: Access, exchange and storage
Training: Professional skills for managing and exploiting data
Highlighting already existing resources and initiatives
10. Juan A. Vizcaíno
juan@ebi.ac.uk
12th CeBiTec Symposium: “Big data in Medicine & Biotechnology”
Bielefeld, 20 March 2018
• PRIDE stores mass spectrometry (MS)-based
proteomics data:
• Peptide and protein expression data
(identification and quantification)
• Post-translational modifications
• Mass spectra (raw data and peak lists)
• Technical and biological metadata
• Any other related information
• Full support for tandem MS approaches
• Any type of data can be stored
• Leading ProteomeXchange
• From July 2017, an ELIXIR core resource
European leadership: the world-leading PRIDE database
http://www.ebi.ac.uk/pride/archive Martens et al., Proteomics, 2005
Vizcaíno et al., NAR, 2016
11. Juan A. Vizcaíno
juan@ebi.ac.uk
12th CeBiTec Symposium: “Big data in Medicine & Biotechnology”
Bielefeld, 20 March 2018
ProteomeXchange: A Global, distributed proteomics database
PASSEL
(SRM data)
PRIDE
(MS/MS data)
MassIVE
(MS/MS data)
Raw
ID/Q
Meta
jPOST
(MS/MS data)
Mandatory data deposition
http://www.proteomexchange.org
Vizcaíno et al., Nat Biotechnol, 2014
Deutsch et al., NAR, 2017
iProX
(MS/MS data)
• Framework to allow standard data submission and dissemination
pipelines between the main existing proteomics repositories.
12. Juan A. Vizcaíno
juan@ebi.ac.uk
12th CeBiTec Symposium: “Big data in Medicine & Biotechnology”
Bielefeld, 20 March 2018
PRIDE data submissions and data growth
> 2,400 datasets submitted in 2017
September, November and December
2017 were the record months in terms
of submitted datasets
Datasets submitted per
month
Datasets submitted
per year
13. Juan A. Vizcaíno
juan@ebi.ac.uk
12th CeBiTec Symposium: “Big data in Medicine & Biotechnology”
Bielefeld, 20 March 2018
Stats: Data growth in EMBL-EBI resources
Sequence data
Micro-array
Metabolomics
Proteomics
14. Juan A. Vizcaíno
juan@ebi.ac.uk
12th CeBiTec Symposium: “Big data in Medicine & Biotechnology”
Bielefeld, 20 March 2018
Data re-use in proteomics is increasing
Data download volume for PRIDE
Archive in 2017: 295 TB
0
50
100
150
200
250
300
350
2013 2014 2015 2016 2017
Downloads in TBs
15. Juan A. Vizcaíno
juan@ebi.ac.uk
12th CeBiTec Symposium: “Big data in Medicine & Biotechnology”
Bielefeld, 20 March 2018
Do you want to learn more?
Martens & Vizcaíno, Trends Bioch Sci, 2017 Vaudel et al., Proteomics, 2016
16. Juan A. Vizcaíno
juan@ebi.ac.uk
12th CeBiTec Symposium: “Big data in Medicine & Biotechnology”
Bielefeld, 20 March 2018
Tools: Services and connectors to drive access and exploitation
Data: Sustaining Europe’s life science data infrastructure
Interoperability: Integration of data and services
Compute: Access, exchange and storage
Training: Professional skills for managing and exploiting data
Highlighting already existing resources and initiatives
17. Juan A. Vizcaíno
juan@ebi.ac.uk
12th CeBiTec Symposium: “Big data in Medicine & Biotechnology”
Bielefeld, 20 March 2018
•Develops open data standards for proteomics.
•Both data representation and annotation standards.
•Involves data producers, database providers, software producers,
publishers, everyone who wants to be involved…
•Active Workgroups: MI, MS, PI, Mod and the new QC.
•Inter-group activities: MIAPE and Controlled Vocabularies.
•Started in 2002, so some experience already…
•One annual meeting in March-April, regular phone calls.
•Closer interaction with the metabolomics community (MSI).
http://www.psidev.info
European leadership: HUPO Proteomics
Standards Initiative
18. Juan A. Vizcaíno
juan@ebi.ac.uk
12th CeBiTec Symposium: “Big data in Medicine & Biotechnology”
Bielefeld, 20 March 2018
Activities of the Proteomics Standards Initiative
Deutsch et al., J Proteome Res, 2017
19. Juan A. Vizcaíno
juan@ebi.ac.uk
12th CeBiTec Symposium: “Big data in Medicine & Biotechnology”
Bielefeld, 20 March 2018
• One slide intro to proteomics
• The ELIXIR Proteomics Community
• Plans for the near future
Outline
21. Juan A. Vizcaíno
juan@ebi.ac.uk
12th CeBiTec Symposium: “Big data in Medicine & Biotechnology”
Bielefeld, 20 March 2018
Moving to “Proteoform” centric approaches
Smith et al., Nat Methods, 2013
23. Juan A. Vizcaíno
juan@ebi.ac.uk
12th CeBiTec Symposium: “Big data in Medicine & Biotechnology”
Bielefeld, 20 March 2018
Across-omics -> Proteogenomics approaches
• Proteomics data is combined with genomics and/or transcriptomics
information, typically by using sequence databases generated from
DNA sequencing efforts, RNA-Seq experiments, Ribo-Seq
approaches, and long-non-coding RNAs.
• Increasingly important in personalised medicine studies.
Nesvizhskii, Nat Methods, 2014
24. Juan A. Vizcaíno
juan@ebi.ac.uk
12th CeBiTec Symposium: “Big data in Medicine & Biotechnology”
Bielefeld, 20 March 2018
Data standards for proteogenomics: proBed and proBAM
• Same overall objective: to map identified peptides to genome
coordinates. Different level of detail:
• proBed is tab-delimited and simpler, based on the original BED format. Less level of
detail.
• proBAM is based in the original SAM/BAM formats, widely used in genomics. Much
higher level of detail.
• They can be used as “Track Hubs” (e.g. integration with genome browsers)
25. Juan A. Vizcaíno
juan@ebi.ac.uk
12th CeBiTec Symposium: “Big data in Medicine & Biotechnology”
Bielefeld, 20 March 2018
Data
Tools
Compute
Interoperability
Training
27. Juan A. Vizcaíno
juan@ebi.ac.uk
12th CeBiTec Symposium: “Big data in Medicine & Biotechnology”
Bielefeld, 20 March 2018
• Title: ‘’Mining the proteome: Enabling automated processing and analysis
of large-scale proteomics data”.
• Goal: Development of open, reproducible and modular pipelines based
on Open MS for DDA (Data Dependent Acquisition) approaches.
• Deployment in the EMBL-”Embassy Cloud”, with the goal that in the future,
they can be deployed in other cloud infrastructures, and be reused by
anyone in the community (e.g. hospitals).
• Connected to PRIDE, bringing the tools closer to the data.
• Who is involved?
• EMBL-EBI (Vizcaíno & Newhouse).
• ELIXIR-DE (Kohlbacher, EKUT , Eisenacher, RUB)
ELIXIR Implementation Study (Feb 2017-June 2018)
28. Juan A. Vizcaíno
juan@ebi.ac.uk
12th CeBiTec Symposium: “Big data in Medicine & Biotechnology”
Bielefeld, 20 March 2018
We opted for the framework
Features:
• Tool modularisation
• Solutions for data handover between tools with standardised
(PSI) formats
• Adapters for integrating third-party software (Search Engines,
LuciPHOr, FIDO, percolator, etc.)
• Integration into various workflow systems as a basis
Software used
29. Juan A. Vizcaíno
juan@ebi.ac.uk
12th CeBiTec Symposium: “Big data in Medicine & Biotechnology”
Bielefeld, 20 March 2018
Summary figure of the developed infrastructure
Thanks to
30. Juan A. Vizcaíno
juan@ebi.ac.uk
12th CeBiTec Symposium: “Big data in Medicine & Biotechnology”
Bielefeld, 20 March 2018
• Follow-up of the implementation study just mentioned.
• Title: "Extending open proteomics data analysis pipelines in the
cloud: Additional tools and focus on scalability, supporting the
dramatic growth of public proteomics data"
• It will start on August 2018 (1 year):
• Led by ELIXIR-Belgium (Martens).
• Participation of EMBL-EBI (Vizcaíno, Newhouse), ELIXIR-
Germany (Kohlbacher), ELIXIR-France (Bouyssie), ELIXIR-
Spain (Sabidó)
• It will include other tools and additional pipelines (Compomics tools,
QCloud, PROFI tools, etc).
Just approved Implementation Study (2018-2019)
31. Juan A. Vizcaíno
juan@ebi.ac.uk
12th CeBiTec Symposium: “Big data in Medicine & Biotechnology”
Bielefeld, 20 March 2018
Data
Tools
Compute
Interoperability
Training
33. Juan A. Vizcaíno
juan@ebi.ac.uk
12th CeBiTec Symposium: “Big data in Medicine & Biotechnology”
Bielefeld, 20 March 2018
• Assigned to the Community (10 ELIXIR nodes involved). It will start
on June 2018 (1 year).
• Title: ”Crowd-sourcing the annotation of public proteomics
datasets to improve data reusability”.
• Apply software developed in the different nodes to improve
automatic annotation pipelines linked to PRIDE (and QC
assessment).
• Improve re-usability of public data.
Just approved Implementation Study (2018-2019)
34. Juan A. Vizcaíno
juan@ebi.ac.uk
12th CeBiTec Symposium: “Big data in Medicine & Biotechnology”
Bielefeld, 20 March 2018
Management of clinical proteomics data
Is proteomics data
patient identifiable?
A couple of papers on
this topic in 2016
Clear guidelines, policy
and resources
need to be developed
35. Juan A. Vizcaíno
juan@ebi.ac.uk
12th CeBiTec Symposium: “Big data in Medicine & Biotechnology”
Bielefeld, 20 March 2018
Data
Tools
Compute
Interoperability
Training
37. Juan A. Vizcaíno
juan@ebi.ac.uk
12th CeBiTec Symposium: “Big data in Medicine & Biotechnology”
Bielefeld, 20 March 2018
• Proteomics bioinformatics activities in Europe are
very prominent world-wide
• Plans for the future:
• Focus in ‘proteoforms’ centric approaches.
• Data integration approaches with other ‘omics’
technologies (e.g. genomics, metabolomics, etc).
• Development of open, reproducible and scalable analysis
(and QC) data analysis workflows
• Improve data management practises (metadata
annotation, management of clinical data, …)
Summary
38. Juan A. Vizcaíno
juan@ebi.ac.uk
12th CeBiTec Symposium: “Big data in Medicine & Biotechnology”
Bielefeld, 20 March 2018
Acknowledgements
https://www.elixir-europe.org/communities/proteomics