The document discusses various search engines and algorithms used for peptide identification from mass spectrometry data. It describes common pre-processing steps like noise thresholding, charge deconvolution, and centroiding. Popular search engines like SEQUEST, Mascot, X!Tandem, and OMSSA are explained. They use different scoring systems like cross-correlation, hyperscore, and E-value to match experimental spectra to theoretical spectra from a database. Comparative studies show these search engines can identify different but also overlapping sets of peptides. Combining results from multiple engines increases identification rates.
2. search engines
lennart martens
lennart.martens@ugent.be
Lennart MARTENS
lennart.martens@ebi.ac.uk
Computational Omics and Systems Biology Group
Proteomics Services Group
European Bioinformatics Institute
Department of Medical Protein Research, VIB
Hinxton, Cambridge
United Kingdom
Department of Biochemistry, Ghent University
www.ebi.ac.uk
Lennart Martens Ghent, Belgium
BITS MS Data Processing – Search Engines
lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
3. THREE TYPICAL PRE-PROCESSING STEPS
Lennart Martens BITS MS Data Processing – Search Engines
lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
4. Noise thresholding
precursor
Global thresholding
precursor
Local thresholding
Lennart Martens BITS MS Data Processing – Search Engines
lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
5. Charge deconvolution (peptides)
From: http://www.purdue.edu/dp/bioscience/images/spectrum.jpg
Lennart Martens BITS MS Data Processing – Search Engines
lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
6. Charge deconvolution (proteins)
From: Gill et al, EMBO Journal, 2000
Lennart Martens BITS MS Data Processing – Search Engines
lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
7. Centroiding (peak picking)
Monoisotopic mass Average mass
x x
Lennart Martens BITS MS Data Processing – Search Engines
lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
8. Combined results
A total ion current chromatogram, corrected by
typical pre-processing steps.
From: Last et al, Nature Rev. Mol. Cell Bio., 2007
Lennart Martens BITS MS Data Processing – Search Engines
lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
9. Data size reduction
60
Q-TOF II
Q-TOF Esquire HCT
Esquire HCT
50
40
File size
File size (MB)
(MB)
30
51.4
20
24.5 25.8
23.7
10
0.7 0.2 0.3 0.1
0
RAW RAW GZIPped Peak lists Peak lists GZIPped
Data type
Data type
See: Martens et al., Proteomics, 2005
Lennart Martens BITS MS Data Processing – Search Engines
lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
10. MS/MS IDENTIFICATION
PEPTIDE FRAGMENTATION FINGERPRINTING
Lennart Martens BITS MS Data Processing – Search Engines
lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
11. Peptide sequences and MS/MS spectra
LENNART
intensity
LENNAR
RT
NNART
NART
LEN LENNART
LENNA LENNART
ART ENNART
T LENN
L
LE
L E N N A R T
m/z
Lennart Martens BITS MS Data Processing – Search Engines
lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
12. Peptide fragment fingerprinting (PFF)
Int
YSFVATAER
m/z
Int
HETSINGK
in silico in silico Int
m/z
MILQEESTVYYR
digest MS/MS
m/z
Int
SEFASTPINK
… m/z
protein sequence database peptide sequences theoretical MS/MS
spectra
1) YSFVATAER 34
in silico
2) YSFVSAIR 12
3) FFLIGGGGK 12 matching
peptide scores
experimental MS/MS spectrum
Lennart Martens BITS MS Data Processing – Search Engines
lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
13. Three types of PFF identification
Spectral comparison
theoretical compare experimental
database sequence
spectrum spectrum
Sequencial comparison
compare de novo experimental
database sequence
sequence spectrum
Threading comparison
thread experimental
database sequence
spectrum
From: Eidhammer, Flikka, Martens, Mikalsen – Wiley 2007
Lennart Martens BITS MS Data Processing – Search Engines
lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
14. The most popular algorithms
• MASCOT (Matrix Science)
http://www.matrixscience.com
• SEQUEST (Scripps, Thermo Fisher Scientific)
http://fields.scripps.edu/sequest
• X!Tandem (The Global Proteome Machine Organization)
http://www.thegpm.org/TANDEM
• OMSSA (NCBI)
http://pubchem.ncbi.nlm.nih.gov/omssa/
Lennart Martens BITS MS Data Processing – Search Engines
lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
15. Overall concept of scores and cut-offs
Incorrect identifications Threshold score
Correct
identifications
False negatives False positives
Adapted from: www.proteomesoftware.com – Wiki pages
Lennart Martens BITS MS Data Processing – Search Engines
lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
16. Playing with probabilistic cut-off scores
higher stringency
6% 100%
90%
5%
80%
4%
identifications 70%
60%
3% 50%
false positives 40%
2%
30%
20%
1%
10%
0% 0%
p=0.05 p=0.01 p=0.005 p=0.0005
Lennart Martens BITS MS Data Processing – Search Engines
lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
17. SEQUEST
• Very well established search engine
• Can be used for MS/MS (PFF) identifications
• Based on a cross-correlation score (includes peak height)
• Published core algorithm (patented, licensed to Thermo), Eng, JASMS 1994
• Provides preliminary (Sp) score, rank, cross-correlation score (XCorr),
and score difference between the top tow ranks (deltaCn, ∆Cn)
• Thresholding is up to the user, and is commonly done per charge state
������
• Many extensions exist to perform a more automatic validation of results
������������ = � ������������ ∙ ������(������+������)
������=1
1
+75
XCorr = ������0 − 151 � ������������
XCorr 1 − XCorr 2
������=−75
deltaCn=
XCorr 1
Lennart Martens BITS MS Data Processing – Search Engines
lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
18. SEQUEST: some additional pictures
From: MacCoss et al., Anal. Chem. 2002
From: Peng et al., J. Prot. Res.. 2002
Lennart Martens BITS MS Data Processing – Search Engines
lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
19. Mascot
• Very well established search engine, Perkins, Electrophoresis 1999
• Can do MS (PMF) and MS/MS (PFF) identifications
• Based on the MOWSE score,
• Unpublished core algorithm (trade secret)
• Predicts an a priori threshold score that identifications need to pass
• From version 2.2, Mascot allows integrated decoy searches
• Provides rank, score, threshold and expectation value per identification
• Customizable confidence level for the threshold score
Lennart Martens BITS MS Data Processing – Search Engines
lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
20. Mascot: some additional pictures
40
Average identity threshold
35 y = 8.3761x - 34.089
2
6%R = 0.9985 100%
Average identitythreshold
30
25 90%
5%
20 80%
15 70%
4%
10
identifications 60%
5
3% 50%
0
6.50 7.00 7.50 8.00 8.50 40%
2% log10(number of AA)
30%
false positives
20%
1%
10%
0% 0%
p=0.05 p=0.01 p=0.005 p=0.0005
Lennart Martens BITS MS Data Processing – Search Engines
lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
21. X!Tandem
• A successful open source search engine, Craig and Beavis, RCMS 2003
• Can be used for MS/MS (PFF) identifications
n
• Based on a hyperscore (Pi is either 0 or 1): HyperScore = ∑ Ii * Pi * Nb !* Ny !
i =0
• Relies on a hypergeometric distribution (hence hyperscore)
• Published core algorithm, and is freely available
• Provides hyperscore and expectancy score (the discriminating one)
• X!Tandem is fast and can handle modifications in an iterative fashion
• Has rapidly gained popularity as (auxiliary) search engine
Lennart Martens BITS MS Data Processing – Search Engines
lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
23. A note on how the scores differ
SEQUEST Accuracy Score Relative Score
XCorr DeltaCn
X! Tandem
HyperScore E-Value
Adapted from: Brian Searle, ProteomeSoftware
Lennart Martens BITS MS Data Processing – Search Engines
lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
24. OMSSA
• A successful open source search engine, Geer, JPR 2004
• Can be used for MS/MS (PFF) identifications
• Relies on a Poisson distribution
• Published core algorithm, and is freely available
• Provides an expectancy score, similar to the BLAST E-value
• OMSSA was recently upgraded to take peak intensity into account
• Good really good marks in a recently published comparative study
Lennart Martens BITS MS Data Processing – Search Engines
lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
25. OMSSA: some additional pictures
Yeast lysate spectrum, m/z matches of Validation of the Poisson distribution model:
fragment peak matches versus all NCBI nr mean number of modelled and measured
sequence library. Poisson distribution fitted. matching peaks (against the NCBI nr
database) for two mass tolerances.
Adapted from: Geer et al., J. Prot. Res., 2004
Lennart Martens BITS MS Data Processing – Search Engines
lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
26. COMPARATIVE STUDIES
Lennart Martens BITS MS Data Processing – Search Engines
lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
27. Kapp et al., Proteomics, 2005
Lennart Martens BITS MS Data Processing – Search Engines
lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
28. Balgley et al., Mol. Cell. Proteomics, 2007
1.6x more?!
Lennart Martens BITS MS Data Processing – Search Engines
lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
29. Combining the output of search algorithms
Mascot SEQUEST
3229 3792
212 486
(+4,2%) (+9,6%)
ProteinSolver
3203
179 168 Phenyx
40
3186
329 380
(+6,5%) 501 348 (+7,5%)
1776
139 96
195 77
146
Figure courtesy of Dr. Christian Stephan, Medizinisches Proteom-Center,
Ruhr-Universität Bochum; Human Brain Proteome Project
Lennart Martens BITS MS Data Processing – Search Engines
lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
30. SEQUENCIAL COMPARISON
ALGORITHMS
Lennart Martens BITS MS Data Processing – Search Engines
lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
31. Sequence tags
sequence tag
The concept of sequence tags was introduced by Mann and Wilm
(Mann,and Wilm, Anal. Chem. 1994, 66: 4390-4399).
Image from: Matthias Wilm, EMBL Heidelberg, Germany
http://www.narrador.embl-heidelberg.de/GroupPages/PageLink/activities/SeqTag.html
Lennart Martens BITS MS Data Processing – Search Engines
lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
32. GutenTag, DirecTag, TagRecon
• Tabb, Anal. Chem. 2003, Tabb, JPR 2008, Dasari, JPR 2010
• Recent implementations of the sequence tag approach
• Refine hits by peak mapping in a second stage to resolve ambiguities
• Rely on a empirical fragmentation model
• Published core algorithms, DirecTag and TagRecon freely available
• Most useful to retrieve unexpected peptides (modifications, variations)
• Entire workflows exist (e.g., combination with IDPicker)
Lennart Martens BITS MS Data Processing – Search Engines
lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
33. GutenTag: some additional pictures
From: Tabb et al., Anal. Chem., 2003
Lennart Martens BITS MS Data Processing – Search Engines
lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
34. De novo compared to sequence tags
Example of a manual de novo of an MS/MS spectrum
No more database necessary to extract a sequence!
Algorithms References
Lutefisk Dancik 1999, Taylor 2000
Sherenga Fernandez-de-Cossio 2000
PEAKS Ma 2003, Zhang 2004
PepNovo Frank 2005, Grossmann 2005
… …
Lennart Martens BITS MS Data Processing – Search Engines
lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
35. Thank you!
Questions?
Lennart Martens BITS MS Data Processing – Search Engines
lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011