1. www.
.uni-rostock.de
Bioinformatics
Introduction to genomics and proteomics I
Ulf Schmitz
ulf.schmitz@informatik.uni-rostock.de
Bioinformatics and Systems Biology Group
www.sbi.informatik.uni-rostock.de
Ulf Schmitz, Introduction to genomics and proteomics I
1
3. Genomics - Definitions
www.
.uni-rostock.de
Genetics:
is the science of genes, heredity, and the variation of organisms.
Humans began applying knowledge of genetics in prehistory with
the domestication and breeding of plants and animals.
In modern research, genetics provides tools in the investigation
of the function of a particular gene, e.g. analysis of genetic
interactions.
Genomics:
attempts the study of large-scale genetic patterns across the
genome for a given species. It deals with the systematic use of
genome information to provide answers in biology, medicine, and
industry.
Genomics has the potential of offering new therapeutic methods
for the treatment of some diseases, as well as new diagnostic
methods.
Major tools and methods related to genomics are bioinformatics,
genetic analysis, measurement of gene expression, and
determination of gene function.
Ulf Schmitz, Introduction to genomics and proteomics I
3
4. Genes
•
•
•
www.
.uni-rostock.de
a gene coding for a protein corresponds to a sequence of
nucleotides along one or more regions of a molecule of DNA
in species with double stranded DNA (dsDNA), genes may appear
on either strand
bacterial genes are continuous regions of DNA
bacterium:
• a string of 3N nucleotides encodes a string of N amino acids
• or a string of N nucleotides encodes a structural RNA molecule of N
residues
eukaryote:
• a gene may appear split into separated segments in the DNA
• an exon is a stretch of DNA retained in mRNA that the ribosomes translate
into protein
Ulf Schmitz, Introduction to genomics and proteomics I
4
5. www.
Genomics
.uni-rostock.de
Genome size comparison
Species
Human
(Homo sapiens)
Mouse
(Mus musculus)
Puffer fish
(Fugu rubripes)
Malaria mosquito
(Anopheles gambiae)
Fruit Fly
(Drosophila melanogaster)
Roundworm
(C. elegans)
Bacterium
(E. coli)
Chrom.
Genes
Base pairs
46
28-35,000
3.1 billion
40
22.5-30,000
2.7 billion
44
31,000
365 million
6
14,000
289 million
8
14,000
137 million
12
19,000
97 million
1
5,000
4.1 million
(23 pairs)
Ulf Schmitz, Introduction to genomics and proteomics I
5
6. www.
Genes
.uni-rostock.de
exon:
A section of DNA which carries the coding
A section of DNA which carries the coding
sequence for a protein or part of it. Exons
sequence for a protein or part of it. Exons
are separated by intervening, non-coding
are separated by intervening, non-coding
sequences (called introns). In eukaryotes
sequences (called introns). In eukaryotes
most genes consist of a number of exons.
most genes consist of a number of exons.
intron:
An intervening section of DNA which occurs
An intervening section of DNA which occurs
almost exclusively within a eukaryotic gene, but
almost exclusively within a eukaryotic gene, but
which is not translated to amino-acid sequences in
which is not translated to amino-acid sequences in
the gene product.
the gene product.
The introns are removed from the pre-mature
The introns are removed from the pre-mature
mRNA through a process called splicing, which
mRNA through a process called splicing, which
leaves the exons untouched, to form an active
leaves the exons untouched, to form an active
mRNA.
mRNA.
Ulf Schmitz, Introduction to genomics and proteomics I
6
7. www.
Genes
.uni-rostock.de
Examples of the exon:intron mosaic of genes
exon
intron
Globin gene – 1525 bp: 622 in exons, 893 in introns
Ovalbumin gene - ~ 7500 bp: 8 short exons comprising 1859 bp
Conalbumin gene - ~ 10,000 bp: 17 short exons comprising ~ 2,200 bp
Ulf Schmitz, Introduction to genomics and proteomics I
7
8. Picking out genes in genomes
www.
.uni-rostock.de
• Computer programs for genome analysis identify ORFs
(open reading frames)
• An ORF begins with an initiation codon ATG (AUG)
• An ORF is a potential protein-coding region
• There are two approaches to identify protein coding
regions…
Ulf Schmitz, Introduction to genomics and proteomics I
8
9. Picking out genes in genomes
1.
•
•
•
2.
•
•
•
www.
.uni-rostock.de
Detection of regions similar to known coding regions from other organisms
Regions may encode amino acid sequences similar to known proteins
Or may be similar to ESTs (correspond to genes known to be
expressed)
Few hundred initial bases of cDNA are sequenced to identify a gene
Ab initio methods, seek to identify genes from the properties of the
DNA sequence itself
Bacterial genes are easy to identify, because they are contiguous
They have no introns and the space between genes is small
Identification of exons in higher organisms is a problem, assembling
them another…
Ulf Schmitz, Introduction to genomics and proteomics I
9
10. Picking out genes in genomes
www.
.uni-rostock.de
Ab initio gene identification in eukaryotic genomes
• The initial (5´) exon starts with a transcription start
point, preceded by a core promoter site such as the
TATA box (~30bp upstream)
– Free of stop codons
– End immediately before a GT splice-signal
binds and directs RNA polymerase
to the correct transcriptional start site
Ulf Schmitz, Introduction to genomics and proteomics I
10
11. Picking out genes in genomes
www.
.uni-rostock.de
5' splice signal
3' splice signal
Ulf Schmitz, Introduction to genomics and proteomics I
11
12. Picking out genes in genomes
www.
.uni-rostock.de
Ab initio gene identification in eukaryotic genomes
• Internal exons are free of stop codons too
– Begin after an AG splice signal
– End before a GT splice signal
Ulf Schmitz, Introduction to genomics and proteomics I
12
13. Picking out genes in genomes
www.
.uni-rostock.de
Ab initio gene identification in eukaryotic genomes
• The final (3´) exon starts after a an AG splice signal
– Ends with a stop codon (TAA,TAG,TGA)
– Followed by a polyadenylation signal sequence
Ulf Schmitz, Introduction to genomics and proteomics I
13
17. Genomics – Prokaryotes
•
–
•
•
.uni-rostock.de
the genome of a prokaryote comes
as a single double-stranded DNA
molecule in ring-form
–
–
•
www.
in average 2mm long
whereas the cells diameter is only
0.001mm
< 5 Mb
prokaryotic cells can have plasmids
as well (see next slide)
protein coding regions have no
introns
little non-coding DNA compared to
eukaryotes
–
in E.coli only 11%
Ulf Schmitz, Introduction to genomics and proteomics I
17
18. Genomics - Plasmids
www.
.uni-rostock.de
• Plasmids are circular double stranded DNA molecules that are separate
from the chromosomal DNA.
• They usually occur in bacteria, sometimes in eukaryotic organisms
• Their size varies from 1 to 250 kilo base pairs (kbp). There are from one
copy, for large plasmids, to hundreds of copies of the same plasmid
present in a single cell.
Ulf Schmitz, Introduction to genomics and proteomics I
18
20. www.
Genomics
.uni-rostock.de
• DNA of higher organisms is organized into chromosomes
(human – 23 chromosome pairs)
• not all DNA codes for proteins
• on the other hand some genes exist in multiple copies
• that’s why from the genome size you can’t easily estimate
the amount of protein sequence information
Ulf Schmitz, Introduction to genomics and proteomics I
20
21. www.
Genomes of eukaryotes
.uni-rostock.de
• majority of the DNA is in the nucleus, separated into
bundles (chromosomes)
– small amounts of DNA appear in organelles (mitochondria and
chloroplasts)
• within single chromosomes gene families are common
– some family members are paralogues (related)
• they have duplicated within the same genome
• often diverged to provide separate functions in descendants
(Nachkommen)
• e.g. human α and β globin
– orthologues genes
• are homologues in different species
• often perform the same function
• e.g. human and horse myoglobin
– pseudogenes
• lost their function
• e.g. human globin gene cluster
pseudogene
Ulf Schmitz, Introduction to genomics and proteomics I
21
22. Eukaryotic model organisms
www.
.uni-rostock.de
• Saccharomyces cerevisiae (baker’s yeast)
• Caenorhabditis elegans (C.elegans)
• Drosophila melanogaster (fruit fly)
• Arabidopsis thaliana (flower)
• Homo sapiens (human)
Ulf Schmitz, Introduction to genomics and proteomics I
22
23. The human genome
•
•
•
•
•
•
•
•
•
•
•
•
•
•
www.
.uni-rostock.de
~3.2 x 109 bp (thirty time larger than C.elegans or D.melongaster)
coding sequences form only 5% of the human genome
Repeat sequences over 50%
Only ~32.000 genes
Human genome is distributed over 22 chromosome pairs plus X and
Y chromosomes
Exons of protein-coding genes are relatively small compared to
other known eukaryotic genomes
Introns are relatively long
Protein-coding genes span long stretches of DNA (dystrophin,
coding a 3.685 amino acid protein, is >2.4Mbp long)
Average gene length: ~ 8,000 bp
Average of 5-6 exons/gene
Average exon length: ~200 bp
Average intron length: ~2,000 bp
~8% genes have a single exon
Some exons can be as small as 1 or 3 bp.
Ulf Schmitz, Introduction to genomics and proteomics I
23
24. www.
The human genome
.uni-rostock.de
Top categories in a function classification:
Function
Number
Nucleic acid binding
DNA binding
DNA repair protein
DNA replication factor
Transcription factor
RNA binding
Structural protein of ribosome
Translation factor
%
2207
1656
45
7
986
380
137
44
14.0
10.5
0.2
0.0
6.2
2.4
0.8
0.2
6
0.0
75
0.4
154
0.9
85
0.5
Actin binding
129
0.8
Defense/immunity protein
603
3.8
3242
457
403
839
295
20.6
2.9
2.5
5.3
1.8
3
0.0
Transcription factor binding
Cell Cycle regulator
Chaperone
Motor
Enzyme
Peptidase
Endopeptidase
Protein kinase
Protein phosphatase
Enzyme activator
Function
Apoptosis inhibitor
Number
%
132
0.8
1790
1318
1202
489
71
11.4
8.4
7.6
3.1
0.0
7
0.0
Cell adhesion
189
1.2
Structural protein
Cytoskeletal structural protein
714
145
4.5
0.9
Transporter
Ion channel
Neurotransmitter transporter
682
269
19
4.3
1.7
0.1
1536
33
50
9.7
0.2
0.3
5
0.0
4813
30.6
15683
100.0
Signal transduction
Receptor
Transmembrane receptor
G-protein link receptor
Olfactory receptor
Storage protein
Ligand binding or carrier
Electron transfer
Cytochrome P450
Tumor suppressor
Unclassified
Total
Ulf Schmitz, Introduction to genomics and proteomics I
24
25. www.
The human genome
•
.uni-rostock.de
Repeated sequences comprise over 50% of the genome:
– Transposable elements, or interspersed repeats include LINEs and
SINEs (almost 50%)
– Retroposed pseudogenes
– Simple ‘stutters’ - repeats of short oligomers (minisatellites and
microsatellites)
– Segment duplication, of blocks of ~10 - 300kb
– Blocks of tandem repeats, including gene families
Element
Size (bp)
Short Interspersed Nuclear
Elements (SINEs)
100-300
Long Interspersed Nuclear
Elements (LINEs)
Copy
number
Fraction of
genome %
1.500.000
13
6000-8000
850.000
21
Long Terminal Repeats
15.000 -110.000
450.000
8
DNA Transposon fossils
80-3000
300.000
3
Ulf Schmitz, Introduction to genomics and proteomics I
25
26. The human genome
www.
.uni-rostock.de
• All people are different, but the DNA of different
people only varies for 0.2% or less.
• So, only up to 2 letters in 1000 are expected to be
different.
• Evidence in current genomics studies (Single
Nucleotide Polymorphisms or SNPs) imply that on
average only 1 letter out of 1400 is different
between individuals.
• means that 2 to 3 million letters would differ
between individuals.
Ulf Schmitz, Introduction to genomics and proteomics I
26
27. www.
Functional Genomics
.uni-rostock.de
From gene to function
Genome
Expressome
Proteome
TERTIARY STRUCTURE (fold)
TERTIARY STRUCTURE (fold)
Metabolome
Ulf Schmitz, Introduction to genomics and proteomics I
27
28. DNA makes RNA makes Protein:
www.
.uni-rostock.de
Expression data
• More copies of mRNA for a gene leads to more
protein
• mRNA can now be measured for all the genes in a
cell at ones through microarray technology
• Can have 60,000 spots (genes) on a single gene
chip
• Color change gives intensity of gene expression
(over- or under-expression)
Ulf Schmitz, Introduction to genomics and proteomics I
28
30. Genes and regulatory regions
www.
.uni-rostock.de
regulatory mechanisms organize the
expression of genes
– genes may be turned on or off in response to
concentrations of nutrients or to stress
– control regions often lie near the segments
coding for proteins
– they can serve as binding sites for molecules
that transcribe the DNA
– or they bind regulatory molecules that can
block transcription
Ulf Schmitz, Introduction to genomics and proteomics I
30
34. www.
.uni-rostock.de
Bioinformatics
Introduction to genomics and proteomics II
ulf.schmitz@informatik.uni-rostock.de
Bioinformatics and Systems Biology Group
www.sbi.informatik.uni-rostock.de
Ulf Schmitz, Introduction to genomics and proteomics II
1
36. www.
Protomics
.uni-rostock.de
Proteomics:
• is the large-scale study of proteins, particularly their structures
and functions
• This term was coined to make an analogy with genomics, and
is often viewed as the "next step",
• but proteomics is much more complicated than genomics.
• Most importantly, while the genome is a rather constant entity,
the proteome is constantly changing through its biochemical
interactions with the genome.
• One organism will have radically different protein expression in
different parts of its body and in different stages of its life cycle.
Proteome:
The entirety of proteins in existence in an organism are
referred to as the proteome.
Ulf Schmitz, Introduction to genomics and proteomics II
3
37. www.
Proteomics
.uni-rostock.de
If the genome is a list of the instruments in an orchestra, the
proteome is the orchestra playing a symphony.
R.Simpson
Ulf Schmitz, Introduction to genomics and proteomics II
4
38. www.
Proteomics
•
•
.uni-rostock.de
Describing all 3D structures of proteins in the cell is called Structural
Genomics
Finding out what these proteins do is called Functional Genomics
DNA Microarray
GENOME
Genetic Screens
PROTEOME
Protein – Protein
Interactions
Protein – Ligand
Interactions
Structure
Ulf Schmitz, Introduction to genomics and proteomics II
5
39. www.
Proteomics
.uni-rostock.de
Motivation:
• What kind of data would we like to measure?
• What mature experimental techniques exist to
determine them?
• The basic goal is a spatio-temporal description of
the deployment of proteins in the organism.
Ulf Schmitz, Introduction to genomics and proteomics II
6
40. www.
Proteomics
.uni-rostock.de
Things to consider:
• the rates of synthesis of different proteins vary among
different tissues and different cell types and states of activity
• methods are available for efficient analysis of transcription
patterns of multiple genes
• because proteins ‘turn over’ at different rates, it is also
necessary to measure proteins directly
• the distribution of expressed protein levels is a kinetic
balance between rates of protein synthesis and degradation
Ulf Schmitz, Introduction to genomics and proteomics II
7
42. Why do Proteomics?
•
www.
.uni-rostock.de
are there differences between amino acid sequences determined
directly from proteins and those determined by translation from
DNA?
– pattern recognition programs addressing this questions have following
errors:
•
•
•
•
•
a genuine protein sequence may be missed entirely
an incomplete protein may be reported
a gene may be incorrectly spliced
genes for different proteins may overlap
genes may be assembled from exons in different ways in different tissues
– often, molecules must be modified to make a mature protein that differs
significantly from the one suggested by translation
• in many cases the missing post-translational- modifications are quite
important and have functional significance
• post-transitional modifications include addition of ligands, glycosylation,
methylation, excision of peptides, etc.
– in some cases mRNA is edited before translation, creating changes in
the amino acid sequence that are not inferrable from the genes
•
a protein inferred from a genome sequence is a hypothetical object
until an experiment verifies its existence
Ulf Schmitz, Introduction to genomics and proteomics II
9
43. Post-translational modification
www.
.uni-rostock.de
•
a protein is a polypeptide chain composed of 20 possible amino acids
•
there are far fewer genes that code for proteins in the human genome than there
are proteins in the human proteome (~33,000 genes vs ~200,000 proteins).
•
each gene encodes as many as six to eight different proteins
– due to post-translational modifications such as phosphorylation, glycosylation or cleavage
(Spaltung)
•
posttranslational modification extends the range of possible functions a protein can
have
– changes may alter the hydrophobicity of a protein and thus determine if the modified
protein is cytosolic or membrane-bound
– modifications like phosphorylation are part of common mechanisms for controlling the
behavior of a protein, for instance, activating or inactivating an enzyme.
Ulf Schmitz, Introduction to genomics and proteomics II
10
44. Post-translational modification
www.
.uni-rostock.de
Phosphorylation
•
•
•
•
phosphorylation is the addition of a phosphate (PO4) group to a protein
or a small molecule (usual to serine, tyrosine, threonine or histidine)
In eukaryotes, protein phosphorylation is probably the most important
regulatory event
Many enzymes and receptors are switched "on" or "off" by
phosphorylation and dephosphorylation
Phosphorylation is catalyzed by various specific protein kinases,
whereas phosphatases dephosphorylate.
Acetylation
•
Is the addition of an acetyl group, usually at the N-terminus of the protein
Farnesylation
•
farnesylation, the addition of a farnesyl group
Glycosylation
•
the addition of a glycosyl group to either asparagine, hydroxylysine,
serine, or threonine, resulting in a glycoprotein
Ulf Schmitz, Introduction to genomics and proteomics II
11
46. Key technologies for proteomics
www.
.uni-rostock.de
1. 1-D electrophoresis and 2-D electrophoresis
•
are for the separation and visualization of proteins.
2. mass spectrometry, x-ray crystallography, and NMR
(Nuclear magnetic resonance )
•
are used to identify and characterize proteins
3. chromatography techniques especially affinity
chromatography
•
are used to characterize protein-protein interactions.
4. Protein expression systems like the yeast twohybrid and FRET (fluorescence resonance energy
transfer)
•
can also be used to characterize protein-protein interactions.
Ulf Schmitz, Introduction to genomics and proteomics II
13
47. Key technologies for proteomics
www.
.uni-rostock.de
High-resolution two-dimensional polyacrylamide gel
electrophoresis (2D PAGE) shows the pattern of
protein content in a sample.
Reference map of lympphoblastoid
cell linePRI, soluble proteins.
• 110 µg of proteins loaded
• Strip 17cm pH gradient 4-7, SDS
PAGE gels 20 x 25 cm, 8-18.5% T.
• Staining by silver nitrate method
(Rabilloud et al.,)
• Identification by mass spectrometry.
The pinks labels on the spots indicate
the ID in Swiss-prot database
browse the SWISS-2DPAGE database for more 2d PAGE images
Ulf Schmitz, Introduction to genomics and proteomics II
14
48. www.
Proteomics
.uni-rostock.de
X-ray crystallography is a means to
determine the detailed molecular
structure of a protein, nucleic acid or
small molecule.
With a crystal structure we can explain the
mechanism of an enzyme, the binding of an
inhibitor, the packing of protein domains, the
tertiary structure of a nucleic acid molecule
etc..
Typically, a sample is purified to
homogeneity, crystallized, subjected to an Xray beam and diffraction data are collected.
Ulf Schmitz, Introduction to genomics and proteomics II
15
49. High-throughput Biological Data
www.
.uni-rostock.de
• Enormous amounts of biological data are being
generated by high-throughput capabilities; even
more are coming
–
–
–
–
–
–
genomic sequences
gene expression data (microarrays)
mass spec. data
protein-protein interaction (chromatography)
protein structures (x-ray christallography)
......
Ulf Schmitz, Introduction to genomics and proteomics II
16
50. Protein structural data explosion
www.
.uni-rostock.de
Protein Data Bank (PDB): 33.367 Structures (1 November 2005)
28.522 x-ray crystallography, 4.845 NMR
Ulf Schmitz, Introduction to genomics and proteomics II
17
51. Maps of hereditary information
www.
.uni-rostock.de
Following maps are used to find out how hereditary information is
stored, passed on, and implemented.
1.
Linkage maps of
genes
mini- / microsatellites
2.
Banding patterns of chromosomes
physical objects with visible landmarks called banding patterns
3.
DNA sequences
Contig maps (contigous clone maps)
Sequence tagged site (STS)
SNPs (Single nucloetide polymorphisms)
Ulf Schmitz, Introduction to genomics and proteomics II
18
53. Maps of hereditary information
www.
.uni-rostock.de
Variable number tandem repeats (VNTRs, also minisatellites)
• regions, 8-80bp long, repeated a variable number of times
• the distribution and the size of repeats is the marker
• inheritance of VNTRs can be followed in a family and
mapped to a pathological phenotype
• first genetic data used for personal identification
– Genetic fingerprints; in paternity and in criminal cases
Short tandem repeat polymorphism (STRPs, also microsatellites)
• Regions of 2-7bp, repeated many times
– Usually 10-30 consecutive copies
Ulf Schmitz, Introduction to genomics and proteomics II
20
55. Maps of hereditary information
www.
.uni-rostock.de
Banding patterns of
chromosomes
Ulf Schmitz, Introduction to genomics and proteomics II
22
56. Maps of hereditary information
www.
.uni-rostock.de
Banding patterns of chromosomes
petite – arm
centromere
queue - arm
Ulf Schmitz, Introduction to genomics and proteomics II
23
57. Maps of hereditary information
www.
.uni-rostock.de
Contig map (also contiguous clone map)
•
•
•
Series of overlapping DNA clones of known
order along a chromosome from an organism
of interest, stored in yeast or bacterial cells as
YACs (Yeast Artificial Chromosomes) or
BACs (Bacterial Artificial Chromosomes)
A contig map produces a fine mapping (high
resolution) of a genome
YAC can contain up to 106bp, a BAC about
250.000bp
Sequence tagged site (STS)
•
•
Short, sequenced region of DNA, 200-600bp
long, that appears in a unique location in the
genome
One type arises from an EST (expressed
sequence tag), a piece of cDNA
Ulf Schmitz, Introduction to genomics and proteomics II
24
58. Maps of hereditary information
www.
.uni-rostock.de
Imagine we know that a disease results from a specific
defective protein:
1. if we know the protein involved, we can pursue
rational approaches to therapy
2. if we know the gene involved, we can devise
tests to identify sufferers or carriers
3. wereas the knowledge of the chromosomal
location of the gene is unnecessary in many
cases for either therapy or detection;
• it is required only for identifying the gene, providing a
bridge between the patterns of inheritance and the
DNA sequence
Ulf Schmitz, Introduction to genomics and proteomics II
25
59. Single nucleotide polymorphisms (SNPs)
•
•
•
www.
.uni-rostock.de
SNP (pronounced ‘snip’) is a genetic
variation between individuals
single base pairs that can be substituted,
deleted or inserted
SNPs are distributed throughout the
genome
– average every 2000bp
•
•
provide markers for mapping genes
not all SNPs are linked to diseases
Ulf Schmitz, Introduction to genomics and proteomics II
26
60. Single nucleotide polymorphisms (SNPs)
www.
.uni-rostock.de
• nonsense mutations:
– codes for a stop, which can truncate the
protein
• missense mutations:
– codes for a different amino acid
• silent mutations:
– codes for the same amino acid, so has no
effect
Ulf Schmitz, Introduction to genomics and proteomics II
27
61. Outlook – coming lecture
www.
.uni-rostock.de
• Bioinformatics Information Resources And Networks
– EMBnet – European Molecular Biology Network
• DBs and Tools
– NCBI – National Center For Biotechnology Information
• DBs and Tools
–
–
–
–
–
–
Nucleic Acid Sequence Databases
Protein Information Resources
Metabolic Databases
Mapping Databases
Databases concerning Mutations
Literature Databases
Ulf Schmitz, Introduction to genomics and proteomics II
28