SlideShare una empresa de Scribd logo
1 de 23
PRIDE: Quality control in a proteomics
data repository
Attila Csordas
Proteomics Services Team
Biocuration Conference
April 2nd, 2012



1/23
Overview

              who are we?

             what are we dealing with?

              manual curation and submission

              quick detour: ProteomeXchange

              automated curation & submission pipeline

              conclusion


       April 2, 2012
2/23
PRIDE: http://www.ebi.ac.uk/pride
       The PRoteomics IDEntifications database is
       a centralised, primary, archival, public data
          repository for MS/MS proteomics data
        containing peptide ids, protein ids, mass
            spectra, protein expression values,
                         metadata.




3/23
        April 2, 2012
Acknowledgements
                 colleagues at the PRIDE team




                             @pride_ebi

                         pride-ebi@ebi.ac.uk
                         pride-support@ebi.ac.uk


       http://code.google.com/p/pride-toolsuite/
       http://code.google.com/p/pride-converter-2/


4/23
        April 2, 2012
Mass spectrometry
analytical technique measuring the mass-to-charge (m/z) ratio of charged
        particles to determine masses of particles, composition of
        samples/molecules and chemical structures of molecules




             April 2, 2012
5/23
Shotgun/bottom-up proteomics

                                                      P
peptides                             MS/MS analysis
                                                      R
                                                      O
           sequence
           database                                   T
proteins                                              O
                              fragmentation
                                                      C
      MS analysis                                     O
                                                      L



              April 2, 2012
 6/23
What is a PRIDE submission?




7/23
        April 2, 2012
growth of
core data types                   130 million




                                   23 million
                                   4.6 million




  8/23
                  April 2, 2012
Manual curation and submission process
       Search
   Engine + spectra

                                   PRIDE
                                  Converter


                                  pride xml

Mascot (.dat),
X!Tandem (.xml) + mgf




9/23
                  April 2, 2012
PRIDE Inspector

initial assessment
on data quality

visualise/check data

summary charts

support for submitters &
reviewers/editors

more flexible than web
interface




  10/23
                 April 2, 2012
Frequent Data Quality Issues

                           <SearchEngine>PeptideShaker</SearchEngine>
  1. syntactic problems    <PeptideItem>



   2a. core data missing                no protein/peptide identifications




   2b. or metadata missing              no species




   3.inconsistent/incorrect data        protein modifications




11/23
           April 2, 2012
Delta m/z of detected peptide precursors


experimental precursor ion m/z - theoretical precursor ion m/z




   source of delta m/z outliers: incorrect or missing protein
   modifications and charge state misassignments




 12/23
             April 2, 2012
Fixing modifications based on delta m/z outliers




13/23
            April 2, 2012
Fixing modifications based on delta m/z outliers




14/23
            April 2, 2012
but the manual approach does not scale!




15/23
         April 2, 2012
10 times as many & big submissions/ day?




16/23
        April 2, 2012
single point of submission of data to the main repositories to encourage data exchange

                          Published        Raw       Reprocessed


 Individual
submissions
                                                       PeptideAtlas
                                 EBI
                                PRIDE   Raw files                                 Users
                                         archive
Large-scale
submissions

                            UniProt
                                               Other DBs
                                              (GPMDB, …)



17/23
                April 2, 2012
PX submission pipeline




                                                                    Proteome
PX Tool                     Validation   Submission   Publication
                                                                     Central




            Files

    Raw             PRIDE
    Files            XML

        Summary




18/23
                       April 2, 2012
Automated regular submission pipeline
         curation-submission time is ~1/6th of manual time

                            actionable curation summary

  number of files: 3
  Project: Combined personal saliva proteome and microbioproteome
  XML generator software         PRIDE Converter Toolsuite 2.0-
  SNAPSHOT
Filename size         Species      #Proteins   #Peptides #Spectra   #Unid-d   PTMs   % delta
                                                                    spectra          m/z
                                                                                     outlier

22143.    3.3 GB      Homo         4128        60544    184209      123665    3      0.0
xml                   sapiens                           spectra     spectra




 19/23
                   April 2, 2012
Conclusion

                growing amount of data


                growingly complex data


                scalability issues


              overcoming them by automation
              and new, smarter curation strategies




20/23
        April 2, 2012
21/23
        April 2, 2012
Thanks for the attention!




22/23
        April 2, 2012
acsordas@ebi.ac.uk
        Q&A                 @attilacsordas

23/23
        April 2, 2012

Más contenido relacionado

Similar a Pride quality controlattilacsordasbiocuration2012

Proteomics & Metabolomics
Proteomics & MetabolomicsProteomics & Metabolomics
Proteomics & Metabolomicsgumccomm
 
Tim Malthus_Towards standards for the exchange of field spectral datasets
Tim Malthus_Towards standards for the exchange of field spectral datasetsTim Malthus_Towards standards for the exchange of field spectral datasets
Tim Malthus_Towards standards for the exchange of field spectral datasetsTERN Australia
 
RDAP13 Jian Qin: Functional and Architectural Requirements for Metadata
RDAP13 Jian Qin: Functional and Architectural Requirements for MetadataRDAP13 Jian Qin: Functional and Architectural Requirements for Metadata
RDAP13 Jian Qin: Functional and Architectural Requirements for MetadataASIS&T
 
PATHS first paths prototype
PATHS first paths prototypePATHS first paths prototype
PATHS first paths prototypepathsproject
 
ProteomeXchange: data deposition and data retrieval made easy
ProteomeXchange: data deposition and data retrieval made easyProteomeXchange: data deposition and data retrieval made easy
ProteomeXchange: data deposition and data retrieval made easyJuan Antonio Vizcaino
 
Revolutionizing Laboratory Instrument Data for the Pharmaceutical Industry:...
Revolutionizing Laboratory  Instrument Data for the  Pharmaceutical Industry:...Revolutionizing Laboratory  Instrument Data for the  Pharmaceutical Industry:...
Revolutionizing Laboratory Instrument Data for the Pharmaceutical Industry:...OSTHUS
 
OpenMS: Quantitative proteomics at large scale
OpenMS: Quantitative proteomics at large scaleOpenMS: Quantitative proteomics at large scale
OpenMS: Quantitative proteomics at large scaleYasset Perez-Riverol
 
Nanoinformatics 2010 SMIRP-ONS Talk
Nanoinformatics 2010 SMIRP-ONS TalkNanoinformatics 2010 SMIRP-ONS Talk
Nanoinformatics 2010 SMIRP-ONS TalkJean-Claude Bradley
 
Automated data pipelines at the rat genome database
Automated data pipelines at the rat genome databaseAutomated data pipelines at the rat genome database
Automated data pipelines at the rat genome databaseJennifer Smith
 

Similar a Pride quality controlattilacsordasbiocuration2012 (13)

C044041723
C044041723C044041723
C044041723
 
Proteomics & Metabolomics
Proteomics & MetabolomicsProteomics & Metabolomics
Proteomics & Metabolomics
 
Tim Malthus_Towards standards for the exchange of field spectral datasets
Tim Malthus_Towards standards for the exchange of field spectral datasetsTim Malthus_Towards standards for the exchange of field spectral datasets
Tim Malthus_Towards standards for the exchange of field spectral datasets
 
RDAP13 Jian Qin: Functional and Architectural Requirements for Metadata
RDAP13 Jian Qin: Functional and Architectural Requirements for MetadataRDAP13 Jian Qin: Functional and Architectural Requirements for Metadata
RDAP13 Jian Qin: Functional and Architectural Requirements for Metadata
 
PATHS first paths prototype
PATHS first paths prototypePATHS first paths prototype
PATHS first paths prototype
 
ProteomeXchange: data deposition and data retrieval made easy
ProteomeXchange: data deposition and data retrieval made easyProteomeXchange: data deposition and data retrieval made easy
ProteomeXchange: data deposition and data retrieval made easy
 
Revolutionizing Laboratory Instrument Data for the Pharmaceutical Industry:...
Revolutionizing Laboratory  Instrument Data for the  Pharmaceutical Industry:...Revolutionizing Laboratory  Instrument Data for the  Pharmaceutical Industry:...
Revolutionizing Laboratory Instrument Data for the Pharmaceutical Industry:...
 
OpenMS: Quantitative proteomics at large scale
OpenMS: Quantitative proteomics at large scaleOpenMS: Quantitative proteomics at large scale
OpenMS: Quantitative proteomics at large scale
 
Presentation of agriopenlink @ EFITA (main program)
Presentation of agriopenlink @ EFITA (main program)Presentation of agriopenlink @ EFITA (main program)
Presentation of agriopenlink @ EFITA (main program)
 
Information systems a revision
Information systems  a revisionInformation systems  a revision
Information systems a revision
 
IntelliGO semantic similarity measure for Gene Ontology annotations
IntelliGO semantic similarity measure for Gene Ontology annotationsIntelliGO semantic similarity measure for Gene Ontology annotations
IntelliGO semantic similarity measure for Gene Ontology annotations
 
Nanoinformatics 2010 SMIRP-ONS Talk
Nanoinformatics 2010 SMIRP-ONS TalkNanoinformatics 2010 SMIRP-ONS Talk
Nanoinformatics 2010 SMIRP-ONS Talk
 
Automated data pipelines at the rat genome database
Automated data pipelines at the rat genome databaseAutomated data pipelines at the rat genome database
Automated data pipelines at the rat genome database
 

Más de attilacsordas

Aging is agings: a recursive definition of biological aging
Aging is agings: a recursive definition of biological agingAging is agings: a recursive definition of biological aging
Aging is agings: a recursive definition of biological agingattilacsordas
 
Towards a consensus definition of biological aging
Towards a consensus definition of biological agingTowards a consensus definition of biological aging
Towards a consensus definition of biological agingattilacsordas
 
Aging vs agings: limits and consequences of biomedical definitions
Aging vs agings: limits and consequences of biomedical definitionsAging vs agings: limits and consequences of biomedical definitions
Aging vs agings: limits and consequences of biomedical definitionsattilacsordas
 
What is it like to be 572 year old?
What is it like to be 572 year old?What is it like to be 572 year old?
What is it like to be 572 year old?attilacsordas
 
Cell lineage trees and the limiting problem of comprehensive rejuvenation
Cell lineage trees and the limiting problem of comprehensive rejuvenationCell lineage trees and the limiting problem of comprehensive rejuvenation
Cell lineage trees and the limiting problem of comprehensive rejuvenationattilacsordas
 
The problematic openness behind the first capability concerning the end of a ...
The problematic openness behind the first capability concerning the end of a ...The problematic openness behind the first capability concerning the end of a ...
The problematic openness behind the first capability concerning the end of a ...attilacsordas
 
Open Lifespan and (not) knowing our age in Rawls’ Original Position
Open Lifespan and (not) knowing our age in Rawls’ Original PositionOpen Lifespan and (not) knowing our age in Rawls’ Original Position
Open Lifespan and (not) knowing our age in Rawls’ Original Positionattilacsordas
 
Hadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansHadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansattilacsordas
 
Ultrcentifugation: Basic Training
Ultrcentifugation: Basic TrainingUltrcentifugation: Basic Training
Ultrcentifugation: Basic Trainingattilacsordas
 
Google's Palimpsest Project
Google's Palimpsest ProjectGoogle's Palimpsest Project
Google's Palimpsest Projectattilacsordas
 
SENS3: Stephen Coles on the Secrets of the Oldest Old
SENS3: Stephen Coles on the Secrets of the Oldest OldSENS3: Stephen Coles on the Secrets of the Oldest Old
SENS3: Stephen Coles on the Secrets of the Oldest Oldattilacsordas
 

Más de attilacsordas (15)

Aging is agings: a recursive definition of biological aging
Aging is agings: a recursive definition of biological agingAging is agings: a recursive definition of biological aging
Aging is agings: a recursive definition of biological aging
 
Towards a consensus definition of biological aging
Towards a consensus definition of biological agingTowards a consensus definition of biological aging
Towards a consensus definition of biological aging
 
Aging vs agings: limits and consequences of biomedical definitions
Aging vs agings: limits and consequences of biomedical definitionsAging vs agings: limits and consequences of biomedical definitions
Aging vs agings: limits and consequences of biomedical definitions
 
What is it like to be 572 year old?
What is it like to be 572 year old?What is it like to be 572 year old?
What is it like to be 572 year old?
 
Cell lineage trees and the limiting problem of comprehensive rejuvenation
Cell lineage trees and the limiting problem of comprehensive rejuvenationCell lineage trees and the limiting problem of comprehensive rejuvenation
Cell lineage trees and the limiting problem of comprehensive rejuvenation
 
The problematic openness behind the first capability concerning the end of a ...
The problematic openness behind the first capability concerning the end of a ...The problematic openness behind the first capability concerning the end of a ...
The problematic openness behind the first capability concerning the end of a ...
 
Open Lifespan and (not) knowing our age in Rawls’ Original Position
Open Lifespan and (not) knowing our age in Rawls’ Original PositionOpen Lifespan and (not) knowing our age in Rawls’ Original Position
Open Lifespan and (not) knowing our age in Rawls’ Original Position
 
Hadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansHadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticians
 
Ultrcentifugation: Basic Training
Ultrcentifugation: Basic TrainingUltrcentifugation: Basic Training
Ultrcentifugation: Basic Training
 
Merry XOmas
Merry XOmasMerry XOmas
Merry XOmas
 
Google's Palimpsest Project
Google's Palimpsest ProjectGoogle's Palimpsest Project
Google's Palimpsest Project
 
LindaPowers onSENS3
LindaPowers onSENS3LindaPowers onSENS3
LindaPowers onSENS3
 
SENS3: Stephen Coles on the Secrets of the Oldest Old
SENS3: Stephen Coles on the Secrets of the Oldest OldSENS3: Stephen Coles on the Secrets of the Oldest Old
SENS3: Stephen Coles on the Secrets of the Oldest Old
 
SENS3: Michael Rose
SENS3: Michael RoseSENS3: Michael Rose
SENS3: Michael Rose
 
Microvesiclesslide
MicrovesiclesslideMicrovesiclesslide
Microvesiclesslide
 

Último

"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 

Último (20)

"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 

Pride quality controlattilacsordasbiocuration2012

  • 1. PRIDE: Quality control in a proteomics data repository Attila Csordas Proteomics Services Team Biocuration Conference April 2nd, 2012 1/23
  • 2. Overview who are we? what are we dealing with? manual curation and submission quick detour: ProteomeXchange automated curation & submission pipeline conclusion April 2, 2012 2/23
  • 3. PRIDE: http://www.ebi.ac.uk/pride The PRoteomics IDEntifications database is a centralised, primary, archival, public data repository for MS/MS proteomics data containing peptide ids, protein ids, mass spectra, protein expression values, metadata. 3/23 April 2, 2012
  • 4. Acknowledgements colleagues at the PRIDE team @pride_ebi pride-ebi@ebi.ac.uk pride-support@ebi.ac.uk http://code.google.com/p/pride-toolsuite/ http://code.google.com/p/pride-converter-2/ 4/23 April 2, 2012
  • 5. Mass spectrometry analytical technique measuring the mass-to-charge (m/z) ratio of charged particles to determine masses of particles, composition of samples/molecules and chemical structures of molecules April 2, 2012 5/23
  • 6. Shotgun/bottom-up proteomics P peptides MS/MS analysis R O sequence database T proteins O fragmentation C MS analysis O L April 2, 2012 6/23
  • 7. What is a PRIDE submission? 7/23 April 2, 2012
  • 8. growth of core data types 130 million 23 million 4.6 million 8/23 April 2, 2012
  • 9. Manual curation and submission process Search Engine + spectra PRIDE Converter pride xml Mascot (.dat), X!Tandem (.xml) + mgf 9/23 April 2, 2012
  • 10. PRIDE Inspector initial assessment on data quality visualise/check data summary charts support for submitters & reviewers/editors more flexible than web interface 10/23 April 2, 2012
  • 11. Frequent Data Quality Issues <SearchEngine>PeptideShaker</SearchEngine> 1. syntactic problems <PeptideItem> 2a. core data missing no protein/peptide identifications 2b. or metadata missing no species 3.inconsistent/incorrect data protein modifications 11/23 April 2, 2012
  • 12. Delta m/z of detected peptide precursors experimental precursor ion m/z - theoretical precursor ion m/z source of delta m/z outliers: incorrect or missing protein modifications and charge state misassignments 12/23 April 2, 2012
  • 13. Fixing modifications based on delta m/z outliers 13/23 April 2, 2012
  • 14. Fixing modifications based on delta m/z outliers 14/23 April 2, 2012
  • 15. but the manual approach does not scale! 15/23 April 2, 2012
  • 16. 10 times as many & big submissions/ day? 16/23 April 2, 2012
  • 17. single point of submission of data to the main repositories to encourage data exchange Published Raw Reprocessed Individual submissions PeptideAtlas EBI PRIDE Raw files Users archive Large-scale submissions UniProt Other DBs (GPMDB, …) 17/23 April 2, 2012
  • 18. PX submission pipeline Proteome PX Tool Validation Submission Publication Central Files Raw PRIDE Files XML Summary 18/23 April 2, 2012
  • 19. Automated regular submission pipeline curation-submission time is ~1/6th of manual time actionable curation summary number of files: 3 Project: Combined personal saliva proteome and microbioproteome XML generator software PRIDE Converter Toolsuite 2.0- SNAPSHOT Filename size Species #Proteins #Peptides #Spectra #Unid-d PTMs % delta spectra m/z outlier 22143. 3.3 GB Homo 4128 60544 184209 123665 3 0.0 xml sapiens spectra spectra 19/23 April 2, 2012
  • 20. Conclusion growing amount of data growingly complex data scalability issues overcoming them by automation and new, smarter curation strategies 20/23 April 2, 2012
  • 21. 21/23 April 2, 2012
  • 22. Thanks for the attention! 22/23 April 2, 2012
  • 23. acsordas@ebi.ac.uk Q&A @attilacsordas 23/23 April 2, 2012