Se ha denunciado esta presentación.
Se está descargando tu SlideShare. ×

ProteomeXchange Experience: PXD Identifiers and Release of Data on Acceptance, Uploading Large Data Sets

Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio

Eche un vistazo a continuación

1 de 37 Anuncio

ProteomeXchange Experience: PXD Identifiers and Release of Data on Acceptance, Uploading Large Data Sets

Descargar para leer sin conexión

Talk I gave in the Human Proteome Project session during HUPO 2014, devoted to Proteomexchange. I summarized the updated in the last year.

Talk I gave in the Human Proteome Project session during HUPO 2014, devoted to Proteomexchange. I summarized the updated in the last year.

Anuncio
Anuncio

Más Contenido Relacionado

Presentaciones para usted (18)

A los espectadores también les gustó (20)

Anuncio

Similares a ProteomeXchange Experience: PXD Identifiers and Release of Data on Acceptance, Uploading Large Data Sets (20)

Más de Juan Antonio Vizcaino (20)

Anuncio

Más reciente (20)

ProteomeXchange Experience: PXD Identifiers and Release of Data on Acceptance, Uploading Large Data Sets

  1. 1. ProteomeXchange Experience: PXD Identifiers and Release of Data on Acceptance, Uploading Large Data Sets Dr. Juan Antonio Vizcaíno PRIDE Group Coordinator Proteomics Services Team EMBL-EBI Hinxton, Cambridge, UK
  2. 2. Juan A. Vizcaíno juan@ebi.ac.uk 13th HUPO World Congress Madrid, 5 October 2014 Overview • The ProteomeXchange (PX) consortium • How to submit and access data in PX via PRIDE • How to access PX data • Some HPP related things
  3. 3. ProteomeXchange Consortium • Goal: Development of a framework to allow standard data submission and dissemination pipelines between the main existing proteomics repositories. • Includes PeptideAtlas (ISB, Seattle), PRIDE (Cambridge, UK) and (very recently) MassIVE (UCSD, San Diego). • Common identifier space (PXD identifiers) • Two supported data workflows: MS/MS and SRM. • Main objective: Make life easier for researchers http://www.proteomexchange.org Juan A. Vizcaíno juan@ebi.ac.uk 13th HUPO World Congress Madrid, 5 October 2014
  4. 4. ProteomeXchange data workflow Results Raw Data* Juan A. Vizcaíno juan@ebi.ac.uk ProteomeCentral PRIDE (MS/MS data) 13th HUPO World Congress Madrid, 5 October 2014 Metadata / Manuscript Journals UniProt/ neXtProt Peptide Atlas Other DBs Receiving repositories PASSEL (SRM data) Other DBs GPMDB Researcher’s results Reprocessed results Raw data* Metadata MassIVE (MS/MS data) Vizcaíno et al., Nat Biotechnol, 2014
  5. 5. MassIVE (UCSD) • Just joined ProteomeXchange on June 2014 • Only partial submissions. A few datasets so far. Juan A. Vizcaíno juan@ebi.ac.uk http://proteomics.ucsd.edu/service/massive/ 13th HUPO World Congress Madrid, 5 October 2014
  6. 6. Juan A. Vizcaíno juan@ebi.ac.uk 13th HUPO World Congress Madrid, 5 October 2014 • Suitable for SRM assays • Part of PeptideAtlas set of resources. http://www.peptideatlas.org/passel/ Farrah et al., Proteomics, 2012 PASSEL: repository for SRM data
  7. 7. ProteomeXchange data workflow Results Raw Data* Juan A. Vizcaíno juan@ebi.ac.uk ProteomeCentral PRIDE (MS/MS data) 13th HUPO World Congress Madrid, 5 October 2014 Metadata / Manuscript Journals UniProt/ neXtProt Peptide Atlas Other DBs Receiving repositories PASSEL (SRM data) Other DBs GPMDB Researcher’s results Reprocessed results Raw data* Metadata MassIVE (MS/MS data) Vizcaíno et al., Nat Biotechnol, 2014
  8. 8. Juan A. Vizcaíno juan@ebi.ac.uk 13th HUPO World Congress Madrid, 5 October 2014 Overview • The ProteomeXchange (PX) consortium • How to submit and access data in PX via PRIDE • How to access PX data • Some HPP related things
  9. 9. Manuscript just out detailing the process http://www.proteomexchange.org/submission Ternent et al., Proteomics, 2014 Juan A. Vizcaíno juan@ebi.ac.uk 13th HUPO World Congress Madrid, 5 October 2014 Example dataset: PXD000764 - Title: “Discovery of new CSF biomarkers for meningitis in children” - 12 runs: 4 controls and 8 infected samples - Identification and quantification data
  10. 10. PX Data workflow for MS/MS data Juan A. Vizcaíno juan@ebi.ac.uk 1. Mass spectrometer output files: raw data (binary files) or peak list spectra in a standardized format (mzML, mzXML). 2. Result files: a. Complete submissions: Result files can be converted to PRIDE XML or the mzIdentML data standard. b. Partial submissions: For workflows not yet supported by PRIDE, search engine output files will be stored and provided in their original form. 3. Metadata: Sufficiently detailed description of sample origin, workflow, instrumentation, submitter. 4. Other files: Optional files: a. QUANT: Quantification related results e. FASTA b. PEAK: Peak list files f. SP_LIBRARY c. GEL: Gel images d. OTHER: Any other file type 13th HUPO World Congress Madrid, 5 October 2014 Published Raw Files Other files
  11. 11. Complete vs Partial submissions: experimental metadata Complete Partial General experimental metadata about the projects is similar. However, at the assay level information in partial submissions is not so detailed Juan A. Vizcaíno juan@ebi.ac.uk 13th HUPO World Congress Madrid, 5 October 2014
  12. 12. Complete vs Partial submissions: processed results For complete submissions, it is possible to connect the spectra with the identification processed results and they can be visualized. Juan A. Vizcaíno juan@ebi.ac.uk 13th HUPO World Congress Madrid, 5 October 2014 Complete Partial
  13. 13. Complete submissions using mzIdentML Juan A. Vizcaíno juan@ebi.ac.uk An increasing number of tools support export to mzIdentML 1.1 13th HUPO World Congress Madrid, 5 October 2014 Search Engine Results + MS files Search engines mzIdentML - Mascot - MSGF+ - Myrimatch and related tools from D. Tabb’s lab - OpenMS - PEAKS - ProCon (ProteomeDiscoverer, Sequest) - Scaffold - TPP via the idConvert tool (ProteoWizard) - ProteinPilot (planned by the end of 2014) - Others: library for X!Tandem conversion, lab internal pipelines, … - Referenced spectral files need to be submitted as well (all open formats are supported). Updated list: http://www.psidev.info/tools-implementing-mzIdentML#.
  14. 14. Tools ‘RESULT’ file generation Final ‘RESULT’ file Juan A. Vizcaíno juan@ebi.ac.uk 13th HUPO World Congress Madrid, 5 October 2014 mzIdentML ‘RESULT’ Now: native file export Spectra files Mascot ProteinPilo t Scaffold PEAKS MSGF+ Others Native File export
  15. 15. Available for complete submissions Juan A. Vizcaíno juan@ebi.ac.uk 13th HUPO World Congress Madrid, 5 October 2014 Wang et al., Nat. Biotechnology, 2012 PRIDE Inspector 2.0 PRIDE Inspector 2.0 supports: - PRIDE XML - mzIdentML + all types of spectra files - mzML - mzTab Ident (work in progress) http://code.google.com/p/pride-toolsuite/ wiki/PRIDEInspector
  16. 16. PX submission tool: data submission • Capture the mappings between the different types of files. • Add the mandatory metadata annotation. • Make the file upload process straightforward to the submitter (It transfers all the files using Aspera or FTP). • Command line alternative: some scripting is needed. Juan A. Vizcaíno juan@ebi.ac.uk 13th HUPO World Congress Madrid, 5 October 2014 Published Raw Other files http://www.proteomexchange.org/submission PX submission tool
  17. 17. Uploading large datasets: Aspera - Aspera is the default file transfer protocol to PRIDE: - PX Submission tool - Command line - Up to 50X faster than FTP Juan A. Vizcaíno juan@ebi.ac.uk File transfer speed should not be a problem!! 13th HUPO World Congress Madrid, 5 October 2014
  18. 18. ProteomeXchange: 1329 datasets up until October 2014 Juan A. Vizcaíno juan@ebi.ac.uk 13th HUPO World Congress Madrid, 5 October 2014 Origin: 271 USA 166 Germany 115 United Kingdom 73 Switzerland 70 China 68 Netherlands 67 France 55 Canada 44 Spain 42 Belgium 33 Sweden 31 Australia 31 Denmark 31 Japan 20 India 20 Norway 19 Taiwan 17 Ireland 16 Austria 14 Finland 14 Italy 12 Republic of Korea 11 Brazil 9 Russia 8 Israel 7 Singapore … Type: 437 PRIDE complete 792 PRIDE partial 63 PeptideAtlas/PASSEL complete 14 MassIVE 23 reprocessed Publicly Accessible: 691 datasets, 52% of all 86% PRIDE 12% PASSEL 2% MassIVE Top Species studied by at least 10 datasets: 577 Homo sapiens 165 Mus musculus 56 Saccharomyces cerevisiae 53 Arabidopsis thaliana 29 Rattus norvegicus 22 Escherichia coli 17 Bos taurus 16 Mycobacterium tuberculosis 13 Oryza sativa 13 Drosophila melanogaster 13 Glycine max ~ 290 species in total Data volume: Total: ~55 TB Number of all files: ~131,000 PXD000320-324: ~ 5 TB PXD000065: ~ 1.4TB Datasets/year: 2012: 102 2013: 527 2014: 700
  19. 19. Public data release: when does it happen? • When the author tells us to do it (the authors can do it by Juan A. Vizcaíno juan@ebi.ac.uk 13th HUPO World Congress Madrid, 5 October 2014 themselves) • When we find out that a dataset has been published • We look for PXD identifiers in PubMed abstracts. • If your PXD identifier is not in the abstract, a paper may have been published and the data is still private. Let us know! • New web form in the PRIDE web to facilitate the process
  20. 20. ProteomeXchange: 1329 datasets up until October 2014 Juan A. Vizcaíno juan@ebi.ac.uk 13th HUPO World Congress Madrid, 5 October 2014 Origin: 271 USA 166 Germany 115 United Kingdom 73 Switzerland 70 China 68 Netherlands 67 France 55 Canada 44 Spain 42 Belgium 33 Sweden 31 Australia 31 Denmark 31 Japan 20 India 20 Norway 19 Taiwan 17 Ireland 16 Austria 14 Finland 14 Italy 12 Republic of Korea 11 Brazil 9 Russia 8 Israel 7 Singapore … Type: 437 PRIDE complete 792 PRIDE partial 63 PeptideAtlas/PASSEL complete 14 MassIVE 23 reprocessed Publicly Accessible: 691 datasets, 52% of all 86% PRIDE 12% PASSEL 2% MassIVE Top Species studied by at least 10 datasets: 577 Homo sapiens 165 Mus musculus 56 Saccharomyces cerevisiae 53 Arabidopsis thaliana 29 Rattus norvegicus 22 Escherichia coli 17 Bos taurus 16 Mycobacterium tuberculosis 13 Oryza sativa 13 Drosophila melanogaster 13 Glycine max ~ 290 species in total Data volume: Total: ~55 TB Number of all files: ~131,000 PXD000320-324: ~ 5 TB PXD000065: ~ 1.4TB Datasets/year: 2012: 102 2013: 527 2014: 700
  21. 21. Partial submissions can be used to store other data types • Everything can be stored, not only MS/MS data: very flexible mechanism to be able to capture all types of datasets • PRIDE does not store SRM data (it goes to PASSEL) • Top down proteomics datasets. • Mass Spectrometry Imaging datasets. • Data independent acquisition techniques: e.g. SWATH-MS datasets. Juan A. Vizcaíno juan@ebi.ac.uk 13th HUPO World Congress Madrid, 5 October 2014
  22. 22. Imaging MS datasets: partial submissions 3. Upload 4. Download From original publication [13] Reconstructed ProteomeXchange data Juan A. Vizcaíno juan@ebi.ac.uk C D 13th HUPO World Congress Madrid, 5 October 2014 1. Thermo RAW data / UDP 2. Mirion Software (JLU) 1. Thermo RAW data / UDP 2. Convert to imzML 3. Upload to PRIDE repository (EBI, Cambridge, UK) 4. Download from PRIDE 5. Display in MSiReader PRIDE Database European Bioinformatics Institute, Cambridge, UK - Vendor-independent data format - Freely available software (open source) - ‚open data‘ – free to reuse - Anybody can do this!
  23. 23. Juan A. Vizcaíno juan@ebi.ac.uk 13th HUPO World Congress Madrid, 5 October 2014 Overview • The ProteomeXchange (PX) consortium • How to submit and access data in PX via PRIDE • How to access PX data • Some HPP related things
  24. 24. ProteomeXchange data workflow Results Raw Data* Juan A. Vizcaíno juan@ebi.ac.uk ProteomeCentral PRIDE (MS/MS data) 13th HUPO World Congress Madrid, 5 October 2014 Metadata / Manuscript Journals UniProt/ neXtProt Peptide Atlas Other DBs Receiving repositories PASSEL (SRM data) Other DBs GPMDB Researcher’s results Reprocessed results Raw data* Metadata MassIVE (MS/MS data) Vizcaíno et al., Nat Biotechnol, 2014
  25. 25. ProteomeCentral: Portal for all PX datasets http://proteomecentral.proteomexchange.org/cgi/GetDataset Juan A. Vizcaíno juan@ebi.ac.uk 13th HUPO World Congress Madrid, 5 October 2014
  26. 26. Get notified about new PX datasets - Subscribe to the RSS Feed to receive information about the new datasets: http://groups.google.com/group/proteomexchange/feed/r ss_v2_0_msgs.xml Juan A. Vizcaíno juan@ebi.ac.uk Proteome Central Researchers 13th HUPO World Congress Madrid, 5 October 2014
  27. 27. Juan A. Vizcaíno juan@ebi.ac.uk 13th HUPO World Congress Madrid, 5 October 2014 Overview • The ProteomeXchange (PX) consortium • How to submit and access data in PX via PRIDE • How to access PX data • Some HPP related things
  28. 28. PX submission tool: HPP tags Juan A. Vizcaíno juan@ebi.ac.uk 13th HUPO World Congress Madrid, 5 October 2014
  29. 29. HPP datasets are now tagged The Projects are now tagged and can be browsed as a group of data sets. Juan A. Vizcaíno juan@ebi.ac.uk 13th HUPO World Congress Madrid, 5 October 2014 Tags for: HPP, C-HPP and B/D-HPP
  30. 30. HPP PX datasets: some numbers Since January 2014, we started capturing the PI information - 25 HPP datasets: 22 C-HPP and 3 B/D-HPP - Countries represented in C-HPP: Juan A. Vizcaíno juan@ebi.ac.uk 13th HUPO World Congress Madrid, 5 October 2014 - 5 Spain - 4 South Korea - 3 Brazil, China Only a small proportion of the datasets have been made publicly available, at least through ProteomeXchange
  31. 31. Which are the most accessed datasets? PXD Identifier Hits Dataset title Publication PXD000561 153512 A draft map of the human proteome Juan A. Vizcaíno juan@ebi.ac.uk 13th HUPO World Congress Madrid, 5 October 2014 Kim et al., Nature,2014. PMID: 24870542 PXD000851 111587 Membrane proteomic analysis of colorectal cancer tissue Kume et al., MCP, 2014. PMID:24687888 PXD000865 51639 Mass spectrometry based draft of the human proteome Wilhelm et al., 2014, Nature, PMID:24870543
  32. 32. Which are the most accessed datasets? Juan A. Vizcaíno juan@ebi.ac.uk 13th HUPO World Congress Madrid, 5 October 2014 Total Numbers
  33. 33. Conclusions • ProteomeXchange is widely used. • PRIDE contains most of the MS/MS datasets. • It has now a new consortium member: MassIVE (UCSD). • Around half of the datasets are already public. • Different open source tools available to facilitate the process: • File transfer speed should not be a problem (Aspera support) • Data depostion enables and promotes data reuse. • ProteomeXchange is open to new members. Juan A. Vizcaíno juan@ebi.ac.uk 13th HUPO World Congress Madrid, 5 October 2014
  34. 34. Acknowledgements Juan A. Vizcaíno juan@ebi.ac.uk PeptideAtlas Team (ISB, Seattle) Eric Deutsch Terry Farrah Zhi Sun Andrew R. Jones Lennart Martens Juan Pablo Albar Martin Eisenacher Gil Omenn Nuno Bandeira And many other PX partners and stakeholders 13th HUPO World Congress Madrid, 5 October 2014 PRIDE Team Attila Csordas Rui Wang Florian Reisinger Jose A. Dianes Tobias Ternent Yasset Perez-Riverol Noemi del Toro Henning Hermjakob EU FP7 grant number 260558
  35. 35. Juan A. Vizcaíno juan@ebi.ac.uk 13th HUPO World Congress Madrid, 5 October 2014 Questions?
  36. 36. Connecting different data types How to connect different data types (genomics, metabolomics, etc)? Juan A. Vizcaíno juan@ebi.ac.uk 13th HUPO World Congress Madrid, 5 October 2014 It can be used for: - ArrayExpress/ GEO Identifiers - MetaboLights identifiers - etc, etc
  37. 37. Pilot project started in the context of ELIXIR Juan A. Vizcaíno juan@ebi.ac.uk B2SAFE B2SAFE 13th HUPO World Congress Madrid, 5 October 2014 4 3 CSC BILS Site B Site C ELIXIR EUDAT CDI B2SAFE B2SAFE PRIDE EMBL-EBI

×