Pride cluster presentation

Update to the PRIDE Cluster project
Dr. Juan Antonio Vizcaíno
Proteomics Team Leader
EMBL-European Bioinformatics Institute
Hinxton, Cambridge, UK

Juan A. Vizcaíno
juan@ebi.ac.uk
Bioinformatics Hub HUPO 2016
Taipei, September 2016
•PRIDE stores mass spectrometry (MS)-
based proteomics data:
•Peptide and protein expression data
(identification and quantification)
•Post-translational modifications
•Mass spectra (raw data and peak lists)
•Technical and biological metadata
•Any other related information
•Full support for tandem MS approaches
PRIDE (PRoteomics IDEntifications) database
http://www.ebi.ac.uk/pride/archive
Martens et al., Proteomics, 2005
Vizcaíno et al., NAR, 2016

Juan A. Vizcaíno
juan@ebi.ac.uk
PRIDE Cluster: Initial Motivation
• Provide a QC-filtered peptide-centric view of PRIDE.
• Data is stored in PRIDE Archive as originally analysed by the
submitters (no data reprocessing is done).
• Heterogeneous quality, difficult to make the data comparable.
• Enable assessment of (published) proteomics data. Pre-
requisite for data reuse (e.g. in UniProt).

Juan A. Vizcaíno
juan@ebi.ac.uk
PRIDE Cluster - Concept
Griss et al., Nat Methods, 2016
NMMAACDPR
NMMAACDPR
PPECPDFDPPR
NMMAACDPR
Consensus spectrum
PPECPDFDPPR
NMMAACDPR
NMMAACDPR
Threshold: At least 3 spectra in a
cluster and ratio >70%.
Originally submitted identified spectra
Spectrum
clustering

Juan A. Vizcaíno
juan@ebi.ac.uk
PRIDE Cluster: Second Implementation
• Griss et al., Nat. Methods, 2013
• Clustered all public, identified
spectra in PRIDE
• EBI compute farm, LSF
• 20.7 M identified spectra
• 610 CPU days, two
calendar weeks
• Validation, calibration
• Feedback into PRIDE datasets
• EBI farm, LSF
• Griss et al., Nat. Methods, 2016
• Clustered all public spectra in
PRIDE by April 2015
• Apache Hadoop.
• Starting with 256 M spectra.
• 190 M unidentified spectra (they
were filtered to 111 M for spectra
that are likely to represent a
peptide).
• 66 M identified spectra
• Result: 28 M clusters
• 5 calendar days on 30 node
Hadoop cluster, 340 CPU cores

Juan A. Vizcaíno
juan@ebi.ac.uk
Parallelizing Spectrum Clustering: Hadoop
• Optimizes work distribution among machines.
• Hadoop is a (open source) Framework for parallelism using
the Map-Reduce algorithm by Google.
• Solves many general issues of large parallel jobs:
• Scheduling
• inter-job communication
• failure
https://hadoop.apache.org/

Juan A. Vizcaíno
juan@ebi.ac.uk
PRIDE Cluster Home page
http://www.ebi.ac.uk/pride/cluster/#/

Juan A. Vizcaíno
juan@ebi.ac.uk
PRIDE Cluster: result of searches
http://www.ebi.ac.uk/pride/cluster/#/
A couple of examples …

Juan A. Vizcaíno
juan@ebi.ac.uk
Examples: one perfect cluster
- 880 PSMs give the same peptide ID
- 4 species
- 28 datasets
- Same instruments

Juan A. Vizcaíno
juan@ebi.ac.uk
Examples: one perfect cluster (2)

Juan A. Vizcaíno
juan@ebi.ac.uk
Output of the analysis
• 1. Inconsistent spectrum clusters
• 2. Clusters including identified and unidentified spectra.
• 3. Clusters just containing unidentified spectra.

Juan A. Vizcaíno
juan@ebi.ac.uk
2. Inferring identifications for originally unidentified spectra
13
• 9.1 M unidentified spectra were contained in clusters with a reliable
identification.
• These are candidate new identifications (that need to be confirmed),
often missed due to search engine settings
• Example: 49,263 reliable clusters (containing 560,000 identified and
130,000 unidentified spectra) contained phosphorylated peptides,
many of them from non-enriched studies.

Juan A. Vizcaíno
juan@ebi.ac.uk
3. Consistently unidentified clusters
• 19 M clusters contain only unidentified spectra.
• 41,155 of these spectra have more than 100 spectra (= 12 M spectra).
• Most of them are likely to be derived from peptides.
• They could correspond to PTMs or variant peptides.
• With various methods, we found likely identifications for about 20%.
• Vast amount of data mining remains to be done.

Juan A. Vizcaíno
juan@ebi.ac.uk
3. Consistently unidentified clusters

Juan A. Vizcaíno
juan@ebi.ac.uk
PRIDE Cluster as a Public Data Mining Resource
18
• http://www.ebi.ac.uk/pride/cluster
• Spectral libraries for 16 species.
• All clustering results, as well as specific subsets of interest available.
• Source code (open source) and Java API

Juan A. Vizcaíno
juan@ebi.ac.uk
Consistently unidentified clusters
• We provide the results split per species in MGF and mzML format.
• Very interested in getting people trying to work in those.
• Available for several species (Largest clusters at present).

Juan A. Vizcaíno
juan@ebi.ac.uk

Juan A. Vizcaíno
juan@ebi.ac.uk
Aknowledgements: People
Attila Csordas
Tobias Ternent
Gerhard Mayer (de.NBI)
Johannes Griss
Yasset Perez-Riverol
Manuel Bernal-Llinares
Andrew Jarnuczak
Former team members,
especially Rui Wang, Florian
Reisinger, Noemi del Toro, Jose
A. Dianes & Henning Hermjakob
Acknowledgements: The PRIDE Team
All data submitters !!!

Juan A. Vizcaíno
juan@ebi.ac.uk
Questions?

Pride cluster presentation

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Pride cluster presentation

Similar a Pride cluster presentation (20)

Más de Juan Antonio Vizcaino

Más de Juan Antonio Vizcaino (17)

Último

Último (20)

Pride cluster presentation