Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Pride cluster presentation
1. Update to the PRIDE Cluster project
Dr. Juan Antonio Vizcaíno
Proteomics Team Leader
EMBL-European Bioinformatics Institute
Hinxton, Cambridge, UK
2. Juan A. Vizcaíno
juan@ebi.ac.uk
Bioinformatics Hub HUPO 2016
Taipei, September 2016
•PRIDE stores mass spectrometry (MS)-
based proteomics data:
•Peptide and protein expression data
(identification and quantification)
•Post-translational modifications
•Mass spectra (raw data and peak lists)
•Technical and biological metadata
•Any other related information
•Full support for tandem MS approaches
PRIDE (PRoteomics IDEntifications) database
http://www.ebi.ac.uk/pride/archive
Martens et al., Proteomics, 2005
Vizcaíno et al., NAR, 2016
3. Juan A. Vizcaíno
juan@ebi.ac.uk
Bioinformatics Hub HUPO 2016
Taipei, September 2016
PRIDE Cluster: Initial Motivation
• Provide a QC-filtered peptide-centric view of PRIDE.
• Data is stored in PRIDE Archive as originally analysed by the
submitters (no data reprocessing is done).
• Heterogeneous quality, difficult to make the data comparable.
• Enable assessment of (published) proteomics data. Pre-
requisite for data reuse (e.g. in UniProt).
4. Juan A. Vizcaíno
juan@ebi.ac.uk
Bioinformatics Hub HUPO 2016
Taipei, September 2016
PRIDE Cluster - Concept
Griss et al., Nat Methods, 2016
NMMAACDPR
NMMAACDPR
PPECPDFDPPR
NMMAACDPR
Consensus spectrum
PPECPDFDPPR
NMMAACDPR
NMMAACDPR
Threshold: At least 3 spectra in a
cluster and ratio >70%.
Originally submitted identified spectra
Spectrum
clustering
5. Juan A. Vizcaíno
juan@ebi.ac.uk
Bioinformatics Hub HUPO 2016
Taipei, September 2016
PRIDE Cluster: Second Implementation
• Griss et al., Nat. Methods, 2013
• Clustered all public, identified
spectra in PRIDE
• EBI compute farm, LSF
• 20.7 M identified spectra
• 610 CPU days, two
calendar weeks
• Validation, calibration
• Feedback into PRIDE datasets
• EBI farm, LSF
• Griss et al., Nat. Methods, 2016
• Clustered all public spectra in
PRIDE by April 2015
• Apache Hadoop.
• Starting with 256 M spectra.
• 190 M unidentified spectra (they
were filtered to 111 M for spectra
that are likely to represent a
peptide).
• 66 M identified spectra
• Result: 28 M clusters
• 5 calendar days on 30 node
Hadoop cluster, 340 CPU cores
6. Juan A. Vizcaíno
juan@ebi.ac.uk
Bioinformatics Hub HUPO 2016
Taipei, September 2016
Parallelizing Spectrum Clustering: Hadoop
• Optimizes work distribution among machines.
• Hadoop is a (open source) Framework for parallelism using
the Map-Reduce algorithm by Google.
• Solves many general issues of large parallel jobs:
• Scheduling
• inter-job communication
• failure
https://hadoop.apache.org/
9. Juan A. Vizcaíno
juan@ebi.ac.uk
Bioinformatics Hub HUPO 2016
Taipei, September 2016
Examples: one perfect cluster
- 880 PSMs give the same peptide ID
- 4 species
- 28 datasets
- Same instruments
11. Juan A. Vizcaíno
juan@ebi.ac.uk
Bioinformatics Hub HUPO 2016
Taipei, September 2016
Output of the analysis
• 1. Inconsistent spectrum clusters
• 2. Clusters including identified and unidentified spectra.
• 3. Clusters just containing unidentified spectra.
12. Juan A. Vizcaíno
juan@ebi.ac.uk
Bioinformatics Hub HUPO 2016
Taipei, September 2016
Output of the analysis
• 1. Inconsistent spectrum clusters
• 2. Clusters including identified and unidentified spectra.
• 3. Clusters just containing unidentified spectra.
13. Juan A. Vizcaíno
juan@ebi.ac.uk
Bioinformatics Hub HUPO 2016
Taipei, September 2016
2. Inferring identifications for originally unidentified spectra
13
• 9.1 M unidentified spectra were contained in clusters with a reliable
identification.
• These are candidate new identifications (that need to be confirmed),
often missed due to search engine settings
• Example: 49,263 reliable clusters (containing 560,000 identified and
130,000 unidentified spectra) contained phosphorylated peptides,
many of them from non-enriched studies.
14. Juan A. Vizcaíno
juan@ebi.ac.uk
Bioinformatics Hub HUPO 2016
Taipei, September 2016
Output of the analysis
• 1. Inconsistent spectrum clusters
• 2. Clusters including identified and unidentified spectra.
• 3. Clusters just containing unidentified spectra.
15. Juan A. Vizcaíno
juan@ebi.ac.uk
Bioinformatics Hub HUPO 2016
Taipei, September 2016
3. Consistently unidentified clusters
• 19 M clusters contain only unidentified spectra.
• 41,155 of these spectra have more than 100 spectra (= 12 M spectra).
• Most of them are likely to be derived from peptides.
• They could correspond to PTMs or variant peptides.
• With various methods, we found likely identifications for about 20%.
• Vast amount of data mining remains to be done.
18. Juan A. Vizcaíno
juan@ebi.ac.uk
Bioinformatics Hub HUPO 2016
Taipei, September 2016
PRIDE Cluster as a Public Data Mining Resource
18
• http://www.ebi.ac.uk/pride/cluster
• Spectral libraries for 16 species.
• All clustering results, as well as specific subsets of interest available.
• Source code (open source) and Java API
19. Juan A. Vizcaíno
juan@ebi.ac.uk
Bioinformatics Hub HUPO 2016
Taipei, September 2016
Consistently unidentified clusters
• We provide the results split per species in MGF and mzML format.
• Very interested in getting people trying to work in those.
• Available for several species (Largest clusters at present).
21. Juan A. Vizcaíno
juan@ebi.ac.uk
Bioinformatics Hub HUPO 2016
Taipei, September 2016
Aknowledgements: People
Attila Csordas
Tobias Ternent
Gerhard Mayer (de.NBI)
Johannes Griss
Yasset Perez-Riverol
Manuel Bernal-Llinares
Andrew Jarnuczak
Former team members,
especially Rui Wang, Florian
Reisinger, Noemi del Toro, Jose
A. Dianes & Henning Hermjakob
Acknowledgements: The PRIDE Team
All data submitters !!!