Proteomics is the large-scale study of proteins. It has become an important field due to developments in mass spectrometry and genomics. However, proteomics generates large amounts of complex data that requires bioinformatics analysis. The history of proteomics includes early pioneers in protein sequencing and mass spectrometry techniques. Current areas of focus include biomarker discovery, structural biology, and integrating proteomics with other omics data through systems biology approaches.
Introduction to the Proteomics Bioinformatics Course 2017
1. Proteomics: History and introduction to
the course
Dr. Juan Antonio Vizcaíno
EMBL-EBI
Hinxton, Cambridge, UK
2. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2017
Hinxton, 16 July 2017
Data resources at EMBL-EBI
Genes, genomes & variation
ArrayExpress
Expression Atlas PRIDE
InterPro Pfam UniProt
ChEMBL ChEBI
Molecular structures
Protein Data Bank in Europe
Electron Microscopy Data Bank
European Nucleotide Archive
European Variation Archive
European Genome-phenome Archive
Gene & protein expression
Protein sequences, families & motifs
Chemical biology
Reactions, interactions &
pathways
IntAct Reactome MetaboLights
Systems
BioModels Enzyme Portal BioSamples
Ensembl
Ensembl Genomes
GWAS Catalog
Metagenomics portal
Europe PubMed Central
Gene Ontology
Experimental Factor
Ontology
Literature & ontologies
3. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2017
Hinxton, 16 July 2017
• Useful definitions and concepts to start
• A little bit of history… and curiosities
• Importance of bioinformatics
Overview
4. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2017
Hinxton, 16 July 2017
Proteomics is the large-scale study of proteins, particularly
their structures and functions
The proteome is the entire complement of proteins
including the modifications made to a particular set of
proteins, produced by an organism or system. This will vary
with time and distinct requirements, or stresses, that a cell
or organism undergoes
proteome = ‘protein’ + ‘genome’ (M. Wilkins, 1994)
Definitions
5. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2017
Hinxton, 16 July 2017
Genomics
Transcriptomics
Proteomics
From the genome to the proteome
6. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2017
Hinxton, 16 July 2017
Genome vs. proteome
•Genome
• Essentially static over time
• Non location specific
• Human genome mapped
(initially on 2000)
• ~20,000 genes
• PCR is available to amplify
DNA
•Proteome
• Dynamic over time
• Location specific
• Human proteome non-
mapped:
• How many???
• No equivalent of PCR for
proteins
7. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2017
Hinxton, 16 July 2017
• Large increase in protein diversity due to:
• Alternative splicing of pre-mRNA (introns and exons)
• Post-translational modifications of proteins
• Cell age and health/disease state
Genome -> Proteome
8. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2017
Hinxton, 16 July 2017
20 naturally occurring
amino acids
Chirality
L-aa
Amino acids
9. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2017
Hinxton, 16 July 2017
From: Molecular Biology of the Cell (4th Ed)
http://www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=mboc4&part=A388&rendertype=figure&id=A3
91
Individual amino acids
polypeptide
Peptide bond
Protein backbone
10. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2017
Hinxton, 16 July 2017
• Useful definitions and concepts to start
• A little bit of history… and curiosities
• Importance of bioinformatics
Overview
11. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2017
Hinxton, 16 July 2017
Sanger's principal conclusion was that the two polypeptide chains of the
protein insulin had precise amino acid sequences and, by extension, that
every protein had a unique sequence.
Nobel Prize in Chemistry in 1958
F. Sanger
Protein sequencing: the pioneers
12. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2017
Hinxton, 16 July 2017
F. Sanger
By 1975, he had developed the “dideoxy”
method for sequencing DNA molecules,
also known as the Sanger method. He
sequenced the first organism: Phague F-
x-174
Nobel Prize in Chemistry in 1980
Not only protein sequencing…
13. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2017
Hinxton, 16 July 2017
MS is an analytical technique that measures the mass-to-charge (m/z)
ratio of charged particles. It is used for determining masses of particles,
for the determination of the elemental composition of a sample or
molecule, and for elucidating the chemical structures of molecules, such as
peptides and other chemical compounds.
Many applications…
one of them is proteomics
Mass spectrometry (MS)
14. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2017
Hinxton, 16 July 2017
P. V. Edman
By 1950, he first developed the Edman degradation
method.
A major drawback of this technique is that the peptides
being sequenced cannot be longer than around 30
residues
Protein sequencing: the pioneers
15. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2017
Hinxton, 16 July 2017
Wolfgang Paul / Hans G. Dehmelt developed the ion trap
technique (1950s and 1960s).
Nobel Prize in Physics (1989)
A commercial quadrupole ion trap
(Finnigan MAT) was introduced in 1983.
The ion trap quickly became the primary
instrument for conducting proteomics
because of its ability to conduct tandem
MS (MS/ MS) analysis of complex mixtures
of peptides, generated by enzymatic
digestion of proteome samples such as cell
lysates.
History of Mass spectrometry
16. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2017
Hinxton, 16 July 2017
John B. Fenn (Yale University) and co-workers use
electrospray (ESI) to ionize biomolecules (high-
molecular weight proteins).
Koichi Tanaka (Shimadzu Corp) used the “ultra fine metal
plus liquid matrix method” to ionize intact proteins (Soft
Laser Desorption): “With the proper combination of laser
wavelength and matrix, a protein can be ionized”.
Fenn and Tanaka: Nobel Prize in Chemistry (2002)
Ionization methods were too energetic to be used with biological molecules
F. Hillenkamp & M. Karas developed the MALDI technique:
use of organic matrices to obtain MS of large proteins
Mass spectrometry: Soft ionization methods
17. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2017
Hinxton, 16 July 2017
Patrick H. O’Farrell
J. Klose
1D SDS gel
MW
MW
pI
2D SDS gel
2D gel image from: http://www.fixingproteomics.org/
Gel electrophoresis
18. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2017
Hinxton, 16 July 2017
The rapid
development of
genomics allowed the
development of
proteomics
Shot-gun
proteomics:
Method of
identifying proteins
in complex mixture
HPLC
MS
100 300 500 700 900 1100 1300 1500 1700 1900 2100
m/z0
100
%
100 300 500 700 900 1100 1300 1500 1700 1900 2100
m/z0
100
%
There are only 20 aminoacids.
The physico-chemical
properties of the peptides are
more homogeneous and
‘manageable’ than the ones
from the proteins
From protein centric to peptide centric
19. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2017
Hinxton, 16 July 2017
Mass Spectrometry (MS)-based proteomics
• Many different workflows.
• Discovery mode:
• Bottom-up proteomics
• Data dependent acquisition (DDA)
• Data independent acquisition (DIA)
• Top down proteomics (intact proteins)
• Targeted mode:
• SRM/ MRM (Selected Reaction
Monitoring/ Multiple Reaction Monitoring)
20. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2017
Hinxton, 16 July 2017
Not only identify, but also quantify the
amount of each protein in the sample
The current methods rely mainly on MS:
Vaudel et al., Proteomics 2010 Feb;10(4):650-670
Proteomics becomes quantitative
21. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2017
Hinxton, 16 July 2017
The Yeast-two-hybrid method was developed by S. Fields in 1989.
Many more methods developed since then:
- Affinity electrophoresis
- Co-inmunoprecipitation
-Tandem affinity purification (TAP)
Protein-protein interactions: yeast-two hybrid
22. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2017
Hinxton, 16 July 2017
Proteomics in a clinical environment
• Biomarker discovery is a very active field of research.
• MS technology is slowly incorporating into the clinical world.
• Used to identify microorganisms by
MALDI MS profiling.
• Approved in Europe. On August
2013 it become the first MS
diagnostic tool approved in the US.
J Rohn (2013) Nat Biotechnol, 31, 862
23. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2017
Hinxton, 16 July 2017
http://thehpp.org/
The Human Proteome Project (HPP)
24. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2017
Hinxton, 16 July 2017
Proteomics for structural biology
• Increased focus
in recent years (a
lot more to
come).
• MS/MS cross-
linking
approaches
• HD-exchange
mass
spectrometry
Lössl et al., EMBO J, 2016
25. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2017
Hinxton, 16 July 2017
• Useful definitions and concepts to start
• A little bit of history… and curiosities
• Importance of bioinformatics
Overview
26. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2017
Hinxton, 16 July 2017
Atlas
what happens
where
Need for bioinformatics
Biology is changing:
• High-throughput
• More data produced
• New types of data
• Emphasis on systems biology
Bioinformatics enables new
applications:
• molecular medicine
• agriculture
• food
• environmental sciences
27. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2017
Hinxton, 16 July 2017
On 21 July 1986, SWISS-PROT was created by
A. Bairoch (it contained around 3,900 protein
sequences)
In 1979, the first software was developed for 2DE image analysis (ELSIE)
Bioinformatics is very much needed in proteomics
28. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2017
Hinxton, 16 July 2017
On 21 July 1986, SWISS-PROT was created by
A. Bairoch (it contained around 3,900 protein
sequences)
In 1979, the first software was developed for 2DE image analysis (ELSIE)
Bioinformatics is very much needed in proteomics
29. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2017
Hinxton, 16 July 2017
Mallick & Kuster, Nat. Biotechnol. 2010 Jul;28(7):695-709
Proteomics is a complex discipline
30. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2017
Hinxton, 16 July 2017
MS based proteomics
Hein et al., Handbook of Systems Biology, 2012
31. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2017
Hinxton, 16 July 2017
Genomics
Transcriptomics
Proteomics
More multi-omics studies…
Metabolomics
The slide shows the core resources at the EBI to show the range of data you can access through the EBI.
This is important for the 3D view of the proteins
Sanger's first triumph was to determine the complete amino acid sequence of the two polypeptide chains of bovine insulin in 1951.[5][6] Prior to this it was widely assumed that proteins were somewhat amorphous. In determining these sequences, Sanger proved that proteins have a defined chemical composition. For this purpose he used the "Sanger Reagent", fluorodinitrobenzene (FDNB), to react with the exposed amino groups in the protein and in particular with the N-terminal amino group at one end of the polypeptide chain. He then partially hydrolysed the insulin into short peptides (either with hydrochloric acid or using an enzyme such as trypsin). The mixture of peptides was fractionated in two dimensions on a sheet of filter paper: first by electrophoresis in one dimension and then, perpendicular to that, by chromatography in the other. The different peptide fragments of insulin, detected with ninhydrin, moved to different positions on the paper, creating a distinct pattern which Sanger called "fingerprints". The peptide from the N-terminus could be recognised by the yellow colour imparted by the FDNB label and the identity of the labelled amino acid at the end of the peptide determined by complete acid hydrolysis and discovering which dinitrophenyl-amino acid was there. By repeating this type of procedure Sanger was able to determine the sequences of the many peptides generated using different methods for the initial partial hydrolysis. These could then be assembled into the longer sequences to deduce the complete structure of insulin. Sanger's principal conclusion was that the two polypeptide chains of the protein insulin had precise amino acid sequences and, by extension, that every protein had a unique sequence.
In 1958 he was awarded a Nobel prize in chemistry "for his work on the structure of proteins, especially that of insulin".
In 1980, Walter Gilbert and Sanger shared half of the chemistry prize "for their contributions concerning the determination of base sequences in nucleic acids”.
Multiple Nobel Awardees:
Four people have received two Nobel Prizes. Maria Skłodowska-Curie received the Physics Prize in 1903 for the discovery of radioactivity and the Chemistry Prize in 1911 for the isolation of pure radium.[164] Linus Pauling won the 1954 Chemistry Prize for his research into the chemical bond and its application to the structure of complex substances. Pauling also won the Peace Prize in 1962 for his anti-nuclear activism, making him the only winner of two unshared prizes. John Bardeen received the Physics Prize twice: in 1956 for the invention of the transistor and in 1972 for the theory of superconductivity.[165] Frederick Sanger received the prize twice in Chemistry: in 1958 for determining the structure of the insulin molecule and in 1980 for inventing a method of determining base sequences in DNA.
Phenylisothiocyanate is reacted with an uncharged terminal amino group, under mildly alkaline conditions, to form a cyclical phenylthiocarbamoyl derivative. Then, under acidic conditions, this derivative of the terminal amino acid is cleaved as a thiazolinone derivative. The thiazolinone amino acid is then selectively extracted into an organic solvent and treated with acid to form the more stable phenylthiohydantoin (PTH)- amino acid derivative that can be identified by using chromatography or electrophoresis. This procedure can then be repeated again to identify the next amino acid. A major drawback to this technique is that the peptides being sequenced in this manner cannot have more than 50 to 60 residues (and in practice, under 30). The peptide length is limited due to the cyclical derivitization not always going to completion. The derivitization problem can be resolved by cleaving large peptides into smaller peptides before proceeding with the reaction. It is able to accurately sequence up to 30 amino acids with modern machines capable of over 99% efficiency per amino acid. An advantage of the Edman degradation is that it only uses 10 - 100 picomoles of peptide for the sequencing process. Edman degradation reaction is automated to speed up the process
Ion traps are almost ubiquitous in analytical laboratories worldwide and serve as both GC and LC downstream MS detectors.
Fenn had a big fight with Yale University because he did not want to retire. In fact he started the studies that led to the Nobel Prize when he was 70.
He joined the Yale University faculty in 1962. In 1987, he reached the mandatory retirement age. Fighting age discrimination and a University-mandated move to smaller laboratory space, Fenn remained at Yale and was 70 years old before he began work on what would in time become his Nobel Prize-winning discovery.
K. Tanaka is so far the only person with non a post-graduate to win a Nobel Prize in a scientific discipline.
However, there was some criticism about his winning the prize, saying that contribution by two German scientists, Franz Hillenkamp and Michael Karas was also big enough not to be dismissed, and therefore they should also be included as prize winners
The premise behind the test is the activation of downstream reporter gene(s) by the binding of a transcription factor onto an upstream activating sequence (UAS). For two-hybrid screening, the transcription factor is split into two separate fragments, called the binding domain (BD) and activating domain (AD). The BD is the domain responsible for binding to the UAS and the AD is the domain responsible for the activation of transcription.
Overview of two-hybrid assay, checking for interactions between two proteins, called here Bait and Prey.
A. Gal4 transcription factor gene produces two domain protein (BD and AD), which is essential for transcription of the reporter gene (LacZ).
B,C. Two fusion proteins are prepared: Gal4BD+Bait and Gal4AD+Prey. None of them is usually sufficient to initiate the transcription (of the reporter gene) alone.
D. When both fusion proteins are produced and Bait part of the first interact with Prey part of the second, transcription of the reporter gene occurs.