Human Factors of XR: Using Human Factors to Design XR Systems
Pride quality controlattilacsordasbiocuration2012
1. PRIDE: Quality control in a proteomics
data repository
Attila Csordas
Proteomics Services Team
Biocuration Conference
April 2nd, 2012
1/23
2. Overview
who are we?
what are we dealing with?
manual curation and submission
quick detour: ProteomeXchange
automated curation & submission pipeline
conclusion
April 2, 2012
2/23
3. PRIDE: http://www.ebi.ac.uk/pride
The PRoteomics IDEntifications database is
a centralised, primary, archival, public data
repository for MS/MS proteomics data
containing peptide ids, protein ids, mass
spectra, protein expression values,
metadata.
3/23
April 2, 2012
4. Acknowledgements
colleagues at the PRIDE team
@pride_ebi
pride-ebi@ebi.ac.uk
pride-support@ebi.ac.uk
http://code.google.com/p/pride-toolsuite/
http://code.google.com/p/pride-converter-2/
4/23
April 2, 2012
5. Mass spectrometry
analytical technique measuring the mass-to-charge (m/z) ratio of charged
particles to determine masses of particles, composition of
samples/molecules and chemical structures of molecules
April 2, 2012
5/23
6. Shotgun/bottom-up proteomics
P
peptides MS/MS analysis
R
O
sequence
database T
proteins O
fragmentation
C
MS analysis O
L
April 2, 2012
6/23
8. growth of
core data types 130 million
23 million
4.6 million
8/23
April 2, 2012
9. Manual curation and submission process
Search
Engine + spectra
PRIDE
Converter
pride xml
Mascot (.dat),
X!Tandem (.xml) + mgf
9/23
April 2, 2012
10. PRIDE Inspector
initial assessment
on data quality
visualise/check data
summary charts
support for submitters &
reviewers/editors
more flexible than web
interface
10/23
April 2, 2012
11. Frequent Data Quality Issues
<SearchEngine>PeptideShaker</SearchEngine>
1. syntactic problems <PeptideItem>
2a. core data missing no protein/peptide identifications
2b. or metadata missing no species
3.inconsistent/incorrect data protein modifications
11/23
April 2, 2012
12. Delta m/z of detected peptide precursors
experimental precursor ion m/z - theoretical precursor ion m/z
source of delta m/z outliers: incorrect or missing protein
modifications and charge state misassignments
12/23
April 2, 2012
15. but the manual approach does not scale!
15/23
April 2, 2012
16. 10 times as many & big submissions/ day?
16/23
April 2, 2012
17. single point of submission of data to the main repositories to encourage data exchange
Published Raw Reprocessed
Individual
submissions
PeptideAtlas
EBI
PRIDE Raw files Users
archive
Large-scale
submissions
UniProt
Other DBs
(GPMDB, …)
17/23
April 2, 2012
18. PX submission pipeline
Proteome
PX Tool Validation Submission Publication
Central
Files
Raw PRIDE
Files XML
Summary
18/23
April 2, 2012
19. Automated regular submission pipeline
curation-submission time is ~1/6th of manual time
actionable curation summary
number of files: 3
Project: Combined personal saliva proteome and microbioproteome
XML generator software PRIDE Converter Toolsuite 2.0-
SNAPSHOT
Filename size Species #Proteins #Peptides #Spectra #Unid-d PTMs % delta
spectra m/z
outlier
22143. 3.3 GB Homo 4128 60544 184209 123665 3 0.0
xml sapiens spectra spectra
19/23
April 2, 2012
20. Conclusion
growing amount of data
growingly complex data
scalability issues
overcoming them by automation
and new, smarter curation strategies
20/23
April 2, 2012