This document summarizes a presentation about the ProteomeXchange (PX) consortium, which provides a framework for standard data submission and dissemination between major proteomics repositories, including PRIDE, PeptideAtlas, and MassIVE. It describes how researchers can submit complete or partial datasets to PX via PRIDE using the PX submission tool. Complete submissions use mzIdentML for processed results, while partial submissions store search engine output files. Over 1,300 datasets have been submitted to PX from researchers worldwide.
ProteomeXchange Experience: PXD Identifiers and Release of Data on Acceptance, Uploading Large Data Sets
1. ProteomeXchange Experience: PXD
Identifiers and Release of Data on
Acceptance, Uploading Large Data Sets
Dr. Juan Antonio Vizcaíno
PRIDE Group Coordinator
Proteomics Services Team
EMBL-EBI
Hinxton, Cambridge, UK
2. Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 5 October 2014
Overview
• The ProteomeXchange (PX) consortium
• How to submit and access data in PX via PRIDE
• How to access PX data
• Some HPP related things
3. ProteomeXchange Consortium
• Goal: Development of a framework to allow
standard data submission and dissemination
pipelines between the main existing proteomics
repositories.
• Includes PeptideAtlas (ISB, Seattle), PRIDE
(Cambridge, UK) and (very recently) MassIVE
(UCSD, San Diego).
• Common identifier space (PXD identifiers)
• Two supported data workflows: MS/MS and SRM.
• Main objective: Make life easier for researchers
http://www.proteomexchange.org
Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 5 October 2014
4. ProteomeXchange data workflow
Results
Raw Data*
Juan A. Vizcaíno
juan@ebi.ac.uk
ProteomeCentral
PRIDE
(MS/MS data)
13th HUPO World Congress
Madrid, 5 October 2014
Metadata /
Manuscript
Journals
UniProt/
neXtProt
Peptide Atlas
Other DBs
Receiving repositories
PASSEL
(SRM data)
Other DBs
GPMDB
Researcher’s results
Reprocessed results
Raw data*
Metadata
MassIVE
(MS/MS data)
Vizcaíno et al., Nat Biotechnol, 2014
5. MassIVE (UCSD)
• Just joined ProteomeXchange on June 2014
• Only partial submissions. A few datasets so far.
Juan A. Vizcaíno
juan@ebi.ac.uk
http://proteomics.ucsd.edu/service/massive/
13th HUPO World Congress
Madrid, 5 October 2014
6. Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 5 October 2014
• Suitable for SRM assays
• Part of PeptideAtlas set of
resources.
http://www.peptideatlas.org/passel/
Farrah et al., Proteomics, 2012
PASSEL: repository for SRM data
7. ProteomeXchange data workflow
Results
Raw Data*
Juan A. Vizcaíno
juan@ebi.ac.uk
ProteomeCentral
PRIDE
(MS/MS data)
13th HUPO World Congress
Madrid, 5 October 2014
Metadata /
Manuscript
Journals
UniProt/
neXtProt
Peptide Atlas
Other DBs
Receiving repositories
PASSEL
(SRM data)
Other DBs
GPMDB
Researcher’s results
Reprocessed results
Raw data*
Metadata
MassIVE
(MS/MS data)
Vizcaíno et al., Nat Biotechnol, 2014
8. Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 5 October 2014
Overview
• The ProteomeXchange (PX) consortium
• How to submit and access data in PX via PRIDE
• How to access PX data
• Some HPP related things
9. Manuscript just out detailing the process
http://www.proteomexchange.org/submission Ternent et al., Proteomics, 2014
Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 5 October 2014
Example dataset:
PXD000764
- Title: “Discovery of new CSF biomarkers for meningitis in children”
- 12 runs: 4 controls and 8 infected samples
- Identification and quantification data
10. PX Data workflow for MS/MS data
Juan A. Vizcaíno
juan@ebi.ac.uk
1. Mass spectrometer output files: raw data (binary files) or
peak list spectra in a standardized format (mzML, mzXML).
2. Result files:
a. Complete submissions: Result files can be converted to
PRIDE XML or the mzIdentML data standard.
b. Partial submissions: For workflows not yet supported by
PRIDE, search engine output files will be stored and
provided in their original form.
3. Metadata: Sufficiently detailed description of sample origin,
workflow, instrumentation, submitter.
4. Other files: Optional files:
a. QUANT: Quantification related results e. FASTA
b. PEAK: Peak list files f. SP_LIBRARY
c. GEL: Gel images
d. OTHER: Any other file type
13th HUPO World Congress
Madrid, 5 October 2014
Published
Raw
Files
Other
files
11. Complete vs Partial submissions: experimental metadata
Complete Partial
General experimental metadata about the projects is similar.
However, at the assay level information in partial submissions is not so detailed
Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 5 October 2014
12. Complete vs Partial submissions: processed results
For complete submissions, it is possible to connect the spectra with the identification
processed results and they can be visualized.
Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 5 October 2014
Complete
Partial
13. Complete submissions using mzIdentML
Juan A. Vizcaíno
juan@ebi.ac.uk
An increasing number of tools support export to mzIdentML
1.1
13th HUPO World Congress
Madrid, 5 October 2014
Search
Engine
Results +
MS files
Search
engines
mzIdentML
- Mascot
- MSGF+
- Myrimatch and related tools from D. Tabb’s lab
- OpenMS
- PEAKS
- ProCon (ProteomeDiscoverer, Sequest)
- Scaffold
- TPP via the idConvert tool (ProteoWizard)
- ProteinPilot (planned by the end of 2014)
- Others: library for X!Tandem conversion, lab
internal pipelines, …
- Referenced spectral files need to be submitted as well
(all open formats are supported).
Updated list: http://www.psidev.info/tools-implementing-mzIdentML#.
14. Tools ‘RESULT’ file generation Final ‘RESULT’ file
Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 5 October 2014
mzIdentML
‘RESULT’
Now: native file export
Spectra
files
Mascot
ProteinPilo
t
Scaffold
PEAKS
MSGF+
Others
Native File export
15. Available for complete submissions
Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 5 October 2014
Wang et al., Nat. Biotechnology, 2012
PRIDE Inspector 2.0
PRIDE Inspector 2.0 supports:
- PRIDE XML
- mzIdentML + all types of spectra files
- mzML
- mzTab Ident (work in progress)
http://code.google.com/p/pride-toolsuite/
wiki/PRIDEInspector
16. PX submission tool: data submission
• Capture the mappings between the different types of files.
• Add the mandatory metadata annotation.
• Make the file upload process straightforward to the submitter (It transfers all the
files using Aspera or FTP).
• Command line alternative: some scripting is needed.
Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 5 October 2014
Published
Raw
Other
files
http://www.proteomexchange.org/submission
PX
submission
tool
17. Uploading large datasets: Aspera
- Aspera is the default file transfer protocol to PRIDE:
- PX Submission tool
- Command line
- Up to 50X faster than FTP
Juan A. Vizcaíno
juan@ebi.ac.uk
File transfer speed should
not be a problem!!
13th HUPO World Congress
Madrid, 5 October 2014
18. ProteomeXchange: 1329 datasets up until October 2014
Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 5 October 2014
Origin:
271 USA
166 Germany
115 United Kingdom
73 Switzerland
70 China
68 Netherlands
67 France
55 Canada
44 Spain
42 Belgium
33 Sweden
31 Australia
31 Denmark
31 Japan
20 India
20 Norway
19 Taiwan
17 Ireland
16 Austria
14 Finland
14 Italy
12 Republic of Korea
11 Brazil
9 Russia
8 Israel
7 Singapore …
Type:
437 PRIDE complete
792 PRIDE partial
63 PeptideAtlas/PASSEL complete
14 MassIVE
23 reprocessed
Publicly Accessible:
691 datasets, 52% of all
86% PRIDE
12% PASSEL
2% MassIVE
Top Species studied by at least 10
datasets:
577 Homo sapiens
165 Mus musculus
56 Saccharomyces cerevisiae
53 Arabidopsis thaliana
29 Rattus norvegicus
22 Escherichia coli
17 Bos taurus
16 Mycobacterium tuberculosis
13 Oryza sativa
13 Drosophila melanogaster
13 Glycine max
~ 290 species in total
Data volume:
Total: ~55 TB
Number of all files: ~131,000
PXD000320-324: ~ 5 TB
PXD000065: ~ 1.4TB
Datasets/year:
2012: 102
2013: 527
2014: 700
19. Public data release: when does it happen?
• When the author tells us to do it (the authors can do it by
Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 5 October 2014
themselves)
• When we find out that a dataset has been published
• We look for PXD identifiers in PubMed abstracts.
• If your PXD identifier is not in the abstract, a paper may have
been published and the data is still private. Let us know!
• New web form in the PRIDE web to facilitate the process
20. ProteomeXchange: 1329 datasets up until October 2014
Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 5 October 2014
Origin:
271 USA
166 Germany
115 United Kingdom
73 Switzerland
70 China
68 Netherlands
67 France
55 Canada
44 Spain
42 Belgium
33 Sweden
31 Australia
31 Denmark
31 Japan
20 India
20 Norway
19 Taiwan
17 Ireland
16 Austria
14 Finland
14 Italy
12 Republic of Korea
11 Brazil
9 Russia
8 Israel
7 Singapore …
Type:
437 PRIDE complete
792 PRIDE partial
63 PeptideAtlas/PASSEL complete
14 MassIVE
23 reprocessed
Publicly Accessible:
691 datasets, 52% of all
86% PRIDE
12% PASSEL
2% MassIVE
Top Species studied by at least 10
datasets:
577 Homo sapiens
165 Mus musculus
56 Saccharomyces cerevisiae
53 Arabidopsis thaliana
29 Rattus norvegicus
22 Escherichia coli
17 Bos taurus
16 Mycobacterium tuberculosis
13 Oryza sativa
13 Drosophila melanogaster
13 Glycine max
~ 290 species in total
Data volume:
Total: ~55 TB
Number of all files: ~131,000
PXD000320-324: ~ 5 TB
PXD000065: ~ 1.4TB
Datasets/year:
2012: 102
2013: 527
2014: 700
21. Partial submissions can be used to store
other data types
• Everything can be stored, not only MS/MS data: very flexible
mechanism to be able to capture all types of datasets
• PRIDE does not store SRM data (it goes to PASSEL)
• Top down proteomics datasets.
• Mass Spectrometry Imaging datasets.
• Data independent acquisition techniques: e.g. SWATH-MS datasets.
Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 5 October 2014
22. Imaging MS datasets: partial submissions
3. Upload
4. Download
From original publication [13] Reconstructed ProteomeXchange data
Juan A. Vizcaíno
juan@ebi.ac.uk
C
D
13th HUPO World Congress
Madrid, 5 October 2014
1. Thermo RAW data / UDP
2. Mirion Software (JLU)
1. Thermo RAW data / UDP
2. Convert to imzML
3. Upload to PRIDE repository
(EBI, Cambridge, UK)
4. Download from PRIDE
5. Display in MSiReader
PRIDE
Database
European
Bioinformatics
Institute,
Cambridge, UK
- Vendor-independent data format
- Freely available software (open source)
- ‚open data‘ – free to reuse
- Anybody can do this!
23. Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 5 October 2014
Overview
• The ProteomeXchange (PX) consortium
• How to submit and access data in PX via PRIDE
• How to access PX data
• Some HPP related things
24. ProteomeXchange data workflow
Results
Raw Data*
Juan A. Vizcaíno
juan@ebi.ac.uk
ProteomeCentral
PRIDE
(MS/MS data)
13th HUPO World Congress
Madrid, 5 October 2014
Metadata /
Manuscript
Journals
UniProt/
neXtProt
Peptide Atlas
Other DBs
Receiving repositories
PASSEL
(SRM data)
Other DBs
GPMDB
Researcher’s results
Reprocessed results
Raw data*
Metadata
MassIVE
(MS/MS data)
Vizcaíno et al., Nat Biotechnol, 2014
25. ProteomeCentral: Portal for all PX datasets
http://proteomecentral.proteomexchange.org/cgi/GetDataset
Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 5 October 2014
26. Get notified about new PX datasets
- Subscribe to the RSS Feed to receive information about
the new datasets:
http://groups.google.com/group/proteomexchange/feed/r
ss_v2_0_msgs.xml
Juan A. Vizcaíno
juan@ebi.ac.uk
Proteome Central Researchers
13th HUPO World Congress
Madrid, 5 October 2014
27. Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 5 October 2014
Overview
• The ProteomeXchange (PX) consortium
• How to submit and access data in PX via PRIDE
• How to access PX data
• Some HPP related things
28. PX submission tool: HPP tags
Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 5 October 2014
29. HPP datasets are now tagged
The Projects are now tagged and can be browsed as a group of data sets.
Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 5 October 2014
Tags for: HPP, C-HPP and
B/D-HPP
30. HPP PX datasets: some numbers
Since January 2014, we started capturing the PI information
- 25 HPP datasets: 22 C-HPP and 3 B/D-HPP
- Countries represented in C-HPP:
Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 5 October 2014
- 5 Spain
- 4 South Korea
- 3 Brazil, China
Only a small proportion of the datasets have been made
publicly available, at least through ProteomeXchange
31. Which are the most accessed datasets?
PXD Identifier Hits Dataset title Publication
PXD000561 153512 A draft map of the human proteome
Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 5 October 2014
Kim et al.,
Nature,2014.
PMID: 24870542
PXD000851 111587
Membrane proteomic analysis of
colorectal cancer tissue
Kume et al., MCP,
2014.
PMID:24687888
PXD000865 51639
Mass spectrometry based draft of
the human proteome
Wilhelm et al., 2014,
Nature,
PMID:24870543
32. Which are the most accessed datasets?
Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 5 October 2014
Total Numbers
33. Conclusions
• ProteomeXchange is widely used.
• PRIDE contains most of the MS/MS datasets.
• It has now a new consortium member: MassIVE (UCSD).
• Around half of the datasets are already public.
• Different open source tools available to facilitate the process:
• File transfer speed should not be a problem (Aspera support)
• Data depostion enables and promotes data reuse.
• ProteomeXchange is open to new members.
Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 5 October 2014
34. Acknowledgements
Juan A. Vizcaíno
juan@ebi.ac.uk
PeptideAtlas Team (ISB, Seattle)
Eric Deutsch
Terry Farrah
Zhi Sun
Andrew R. Jones
Lennart Martens
Juan Pablo Albar
Martin Eisenacher
Gil Omenn
Nuno Bandeira
And many other PX partners and
stakeholders
13th HUPO World Congress
Madrid, 5 October 2014
PRIDE Team
Attila Csordas
Rui Wang
Florian Reisinger
Jose A. Dianes
Tobias Ternent
Yasset Perez-Riverol
Noemi del Toro
Henning Hermjakob
EU FP7 grant number 260558
35. Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 5 October 2014
Questions?
36. Connecting different data types
How to connect different data types (genomics, metabolomics, etc)?
Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 5 October 2014
It can be used for:
- ArrayExpress/ GEO
Identifiers
- MetaboLights identifiers
- etc, etc
37. Pilot project started in the context of ELIXIR
Juan A. Vizcaíno
juan@ebi.ac.uk
B2SAFE
B2SAFE
13th HUPO World Congress
Madrid, 5 October 2014
4
3
CSC
BILS
Site B
Site C
ELIXIR EUDAT CDI
B2SAFE
B2SAFE
PRIDE
EMBL-EBI