4. Human Genome Project decided to use smaller
genomes as warm-up for human genome.
Resulted in sequencing:
Many bacteria
Model organism genomes
Yeast, C. elegans, Arabidopsis, Drosophila
Crop plants – Rice
Comparison of these genome sequences
provided basis for field of “Comparative
Genomics” S S Jena
5. What is comparative genomics?
Analyzing & comparing
genetic material from
different species to study
evolution, gene function,
and inherited disease.
To understand the
uniqueness between
different species.
S S Jena
6. Comparative genomics prior to obtaining full
genome sequence
Genome size
•Compared DNA content among species.
(DAPI staining- FACS)
Single copy and repetitive DNA
•Used hybridization kinetics. (Cot curve)
•Found amount of repetitive DNA differed
greatly among species.
S S Jena
7. What is compared?
Gene location
Gene structure
Exon number
Exon lengths
Intron lengths
Sequence similarity
Gene characteristics
Splice sites
Codon usage
Conserved synteny
S S Jena
8. What can homology tell us?
1. Identity of genes:
- The identity of the gene in another organism
- The identity of nearby genes
- The function of the gene (if annotated)
2. Suggestions of how the gene might be
causing disease
3. Infer ancestral relationships
4. Discover principles of evolution
S S Jena
9. Negative selection: The removal of
deleterious mutations from a population; also
referred to as purifying selection.
Positive selection: The retention of
mutations that benefit an organism; also
referred to as Darwinian selection.
Homologs: Features (including DNA and
protein sequences) in species being
compared that are similar because they are
ancestrally related. S S Jena
10. Orthologs and Paralogs
• While comparing sequence from different
genomes, we must distinguish between two
types of closely related sequences:
Orthologs are genes found in two species
that had a common ancestor.
Paralogs are genes found in the same
species that were created through gene
duplication events.
S S Jena
11. A
A’
A’’
B’
B
B”
A & B Orthologs
A’ & A’’
B’ & B’’
Paralogs
Orthologs and Paralogs
S S Jena
13. Synteny
Regions of two genomes that show
considerable similarity in terms of sequence
and conservation of the order of genes.
Genes that are in the same relative position
on two different chromosomes.
Closely related species generally have
similar order of genes on chromosomes.
Synteny can be used to identify genes in one
species based on map-position in another.
S S Jena
14. Synteny among crop genomes:
rice, maize and wheat
Maize: evolved from tetraploid ancestor
15. THE ORIGIN AND EVOLUTION OF MODEL ORGANISMS
Hedges SB (2002) Nature Reviews Genetics 3: 838 -849.
The colored blocks
represent
synteny blocks
S S Jena
16. Synteny of Mouse and Human genome
When sequence from mouse and human
genomes compared, regions of remarkable
synteny were found.
Genes are in almost identical order for long
stretches along the chromosome.
Human
Chr 14
Mouse
Chr 14
S S Jena
17. Synteny of Mouse & Human genome
Almost entirely
syntenic
S S Jena
18. The one to one linear correspondence
between the order of codons in a coding
sequence and the order of amino acids in the
protein encoded.
A linear map of mutation sites within a gene
corresponds to the linear location of amino
acid substitutions within the polypeptide
encoded by that gene.
Comparative genome analyses
demonstrated that gene orders among related
plant species remained largely conserved over
millions of years of evolution.
Colinearity
S S Jena
19. Comparative genomics exploits both similarities
and differences in the proteins, RNA and
regulatory regions of different organisms to infer
how selection has acted upon these elements.
Those elements that are responsible for
similarities between different species should be
conserved through time (stable selection), while
those elements responsible for differences
among species should be divergent (positive
selection). S S JenaS S Jena
20. The DNA sequences encoding the proteins and
RNAs responsible for functions that were
conserved from the last common ancestor
should be preserved in contemporary genome
sequences.
Likewise, the DNA sequences controlling the
expression of genes that are regulated similarly
in two related species should also be conserved.
Conversely, sequences that encode (or control
the expression of) proteins and RNAs
responsible for differences between species will
themselves be divergent.
Principle cont.
S S JenaS S Jena
21. Evolution and sequence conservation
If no constraints on DNA sequence, Random
mutations will occur.
Over tens of millions of years these random
mutations will make two related sequences
Different.
e.g. Non-coding DNA that does not have a
regulatory function tends to diverge much more
rapidly than protein coding DNA.
S S JenaS S Jena
22. Function and sequence conservation
However: if there are constraints, e.g.
oDNA codes for protein
oor transcription factor binds DNA
oReplication origin
Then there will be sequence similarity when
related sequences compared
Basic rule when comparing two related
sequences:
Sequence conservation = functional importance
23. Why do we annotate genomes?
• If we find the gene in a model organism (like
the rat), then we need to know what the
homolog is in humans.
• If we find the gene in a model organism, we
need to know if it’s doing the same thing in
humans.
• If we DON’T know what gene is implicated in
a disease, we can annotate ALL the genes in
a region and find candidates for further study
S S JenaS S Jena
24. Comparison of genomic sequences from different
species can help to identify:
oGene structure
oGene function
oRegulatory sequences
oInteractions between gene products
S S JenaS S Jena
25. Comparative Genomics Tools
• Similarity search programs
– BLAST2 (Basic Local Alignment Search Tool)
– FASTA
– MUMmer (Maximal Unique Match)
(Comparisons and analyses at both Nucleic acid and
protein level)
• Other alignment programs
– DBA [DNA Block Aligner] (Jareborg et al)
– blastz (Schwartz et al.)
– BLAT/AVID,
– WABA [Wobble Aware Bulk Aligner]
– DIALIGN [Diagonal ALIGNment] (Morgenstern et al.)
– SSAHA [Sequence Search and Alignment by
Hashing Algorithm] S S Jena
27. General Databases Useful for Comparative Genomics
• Locus Link/RefSeq:
http://www.ncbi.nih.gov/LocusLink/
• PEDANT-Protein Extraction Description ANalysis Tool
http://pedant.gsf.de
• COGs - Cluster of Orthologous Groups (of proteins)
http://www.ncbi.nih.gov/COG/
• KEGG- Kyoto Encyclopedia of Genes and Genomes
http://www.genome.ad.jp/kegg/
• MBGD - Microbial Genome Database
http://mbgd.genome.ad.jp/
• GOLD - Genome OnLine Database
http://wit.integratedgenomics.com/GOLD/
• TIGR – The Institute of Genome Research
Comparative genomics of Parasites
S S Jena
28. Alignment of DNA sequences is the core
process in comparative genomics.
An alignment is a mapping of the nucleotides
in one sequence onto the nucleotides in the
other sequence, with gaps introduced into one
or the other sequence to increase the number
of positions with matching nucleotides.
Several powerful alignment algorithms have
been developed to align two or more
sequences.
COMPARATIVE GENOMICS -COMPARATIVE GENOMICS -
PROCESSPROCESS
S S Jena
29. • The most frequently performed type of sequence
comparison is the sequence similarity search.
• Sequence comparisons that implicate function
are widely used:
– To determine if newly sequenced cDNA or
genomic region encodes gene of known
function.
– Search for similar sequence in other species
(or in same species)
Sequence similarity searchSequence similarity search
S S Jena
30. • Search databases of DNA sequences
• Use computer algorithms to align sequences
– Don’t require perfect matches between
sequences
– Allow for insertions, deletions and base
changes
• Most commonly used algorithms:
– BLAST
– FAST-A
Homology searches
S S Jena
37. Gramene (http://www.gramene.org) is a comparative
genome mapping database for grasses and a
community resource for rice (Oryza sativa).
It combines a semi-automatically generated
database of cereal genomic and expressed sequence
tag sequences, genetic maps, map relations, and
publications, with a curated database of rice mutants
(genes and alleles), molecular markers, and proteins.
Gramene curators read and extract detailed
information from published sources, summarize that
information in a structured format, and establish links
to related objects both inside and outside the
database.
S S Jena
41. Genome analysesGenome analyses
• Variation in
– Genome size
– GC content
– Codon usage
– Amino acid composition
– Genome organization
• Single circular chromosomes
• Linear chromosome + extra chromosomal elements
E. coli: 4.6Mbp
M. pneumoniae: 0.81Mbp
B. subtilis: 4.20Mbp
B. burgdorferi: 29%
M. tuberculosis: 68%
G, A, P, R: GC rich
I, F, Y, M, D: AT rich
S S Jena
42. CG: Comparisons between genomes
• The stains of the same species
• The closely related species
• The distantly related species
– List of Orthologs
– Evolution of individual genes
– Evolution of organisms
S S Jena
43. Comparison of the coding regions
• Begins with the
gene identification
algorithm:
Infer what portions
of the genomic
sequence actively
code for genes.
• There are four
basic approaches.
4 basic categories of gene
identification programs
Category Algorithm
1. Based on direct
evidence of transcription
EST_GENOME
sim4
2. Based on homology
with known genes
PROCRUSTES
3. Statistical or ab-initio
approaches
Genscan
FGENES
GeneMark
Glimmer
4. Using genome
comparison
TwinScan
Rosetta
S S Jena
44. ‘Trait-to-gene’ approach
A bioinformatics approach to identify genes
involved in adaptive traits is called “Trait-to-gene”.
Assumption: New genes will be created to
perform tasks required for an adaptive response.
Underlying reasoning: organisms that share a
particular trait will share related genes.
Also used to identify two different genes that
serve the same function in different organisms.
S S Jena
45. Relating traits to genes
To compare genes among species the Trait-to-
gene approach uses “COG database”. (Eugene
Koonin)
It is a method for identifying likely orthologues
when making whole genome comparisons among
multiple species.
S S Jena
46. Important observations with regard to
Gene Order
• Order is highly conserved in closely related
species but gets changed by rearrangements.
• With more evolutionary distance, no
correspondence between the gene order of
orthologous genes.
• Group of genes having similar biochemical
function tend to remain localized.
S S Jena
47. Finding regulatory regionsFinding regulatory regions
Called phylogentic footprinting (analogous with
DNAase footprinting)
Functionally important regions are mutated less.
These cis-regulatory motifs can be determined
by:
– Finding common motifs in orthologous sequences
– Aligning orthologous sequences first, then
indentifying common regions
Previously known motifs might help
S S Jena
49. Visualization
Dot plot:
A graphical dot plot program for detailed comparison of
two sequences
Soft wares
for dot-plot:
DNA strider
Dotlet
Dotter
Dottup
S S JenaS S JenaS S Jena
50. Dot plot
– The X axis represents the
first sequence (PHO5),
– The Y axis represents the
second sequence (PHO3)
– A dot is plotted for each
match between two
residues of the sequences.
– Diagonal lines reveal
regions of identity between
the two sequences.
A dot plot is a simple graphical representation of identical
residues between two sequences.
S S JenaS S JenaS S Jena
52. Whole genome map dot plots of
E. coli vs B. subtilis and E. coli vs S. typhimurium
Random distribution Diagonal patternS S JenaS S Jena
53. Sequence logo
A very useful representation of the conservation
patterns is the so-called sequence logo.
This shows the conserved residues as larger
characters, where the total height of a column is
proportional to how conserved that position is.
S S JenaS S Jena
55. When comparing genomes of different species
– Genes normally have same exon/intron structure
(Neutral theory of evolution)
Look for ORFs that are conserved in both
genomes
Frequently permits accurate identification of genes
– Fugu/human comparison: found >1000 genes that
had been missed by annotation
– Mouse/human comparison indicates only 30,000
genes in genome
How genome comparisons help?
S S Jena
56. The comparison of fruit fly genome with the human
genome discovered that about 60 percent of genes
are conserved between fly and human.
Virtually all (99%) of the protein-coding genes in
humans align with homologs in mouse, and over
80% are clear 1:1 orthologs. In most cases, the
intron-exon structures are highly conserved.
The finding that the three wheat genomes have a
highly similar gene content and order was the first
demonstration of colinearity in the grasses and a
pivotal finding in the development of comparative
genomics.
How genome comparisons help? Cont.
S S Jena
57. Comparison of the human and mouse spermidine
synthase genes revealed an additional intron in
the human gene that is not found in the mouse
homologue.
Sequence comparison example
Human
Mouse
5,500 bp
S S Jena
59. CG helps genome annotationsCG helps genome annotations
In prokaryotes, finding genes is relatively
easy based on open reading frames (ORFs)
In eukaryotes, we have to look for ORFs,
exons, introns, splice sites, polyA sites
Difficulties:
• Predicted exons sometimes do not exist
• Pseudogenes
• Alternative splicing
Merit: In different species, the genes normally
have similar exon-intron structure
S S Jena
60. Finding regulatory sequencesFinding regulatory sequencesFinding regulatory sequencesFinding regulatory sequences
Regulatory sequences are difficult to identify
using computer programs.
Problems are:
Most enhancer sequences have yet to be
identified
They are usually short: 6-10 basepairs
Those that are known are usually degenerate
• They can differ in one or more basepairs
• Still bind the cognate transcription factor
S S Jena
61. Comparisons to identify regulatory elements
Comparisons of genomes of different species
can identify cis-regulatory elements. (Neutral
theory of evolution)
Change in intergenic regions and introns are
usually more rapid than in coding regions
Nevertheless, regulatory elements tend to be
conserved (because these seq. bind TFs)
Conserved intergenic sequences identified by
aligning genomic regions of orthologs are
called “phylogenetic footprint.” (analogous with
DNAase footprinting).
62. Interaction mapping
A remarkable use of comparative genomics is
to identify interacting proteins.
Protein-protein interactions are critical for
cellular functions like
Transfer of information in a genetic pathway
Scaffolding to tether other proteins
Enzymatic reactions (multi-subunit enzymes)
Large molecular machines such as motors
S S Jena
63. Rosetta Stone
Interaction proteins are encoded by single gene
in some species, whereas in other species
same proteins are encoded in two genes.
Systematic search through sequenced genomes
for these relationships should identify proteins
that interact.
This method is called “Rosetta Stone” approach
S S Jena
64. Rosetta Stone example
Equivalent of yeast protein
topoisomerase II, in E. coli is
encoded by two genes:
gyrase A and gyrase B.
Suggests that gyrase B and
gyrase A interact in E. coli
S S Jena
65. 1. Identification and mapping of
Leucine-rich repeat resistance gene analogs
in Bermuda grass
31 Bermuda grass (Cynodon spp.) disease
resistance gene analogs (BRGA) were
cloned and sequenced from diploid, triploid,
tetraploid and hexaploid bermuda grass using
degenerate primers to target Nucleotide
Binding site (NBS) of the NBS-Leucine Rich
Repeat (LRR) resistance family.
(Harris et al., (2010) J.Amer.Soc.135:74-82)
S S Jena
66. 2. Synteny between the centromeric regions
of wheat and rice
Recently discovered that rice centromeres contain
genes. This helps in studying centromere homologies
between wheat and rice chromosomes by mapping
rice centromeric regions on to wheat aneuploid stocks
Genome wide comparison of wheat ESTs that were
mapped to centromeric regions against rice genome
sequences revealed high conservation and one to one
correspondence of centromeric regions between
wheat and rice chromosome pairs W1-R5, W2-R7,
W3-R1, W5-R12, W6-R2 and W7-R8
(Qi et al., (2009) Genetics 183:1235-1247)
S S Jena
67. 3. Sequencing and comparative analysis of
conserved syntenic segment (CSS)
in the solanaceae
Wang et.al. reported generation and analysis of
sequences for unduplicated conserved syntenic
segment (CSS) in genomes of five members of
solanaceae.
This analysis indicates 30 million years of plant
evolution in absence of polyploidization.
The sequenced segments of the potato, tomato,
pepper, eggplant, and petunia genomes are
shown alongside corresponding regions of the
Arabidopsis (At) genome.
(Wang et al., (2008) Genetics 180:391-408)
S S Jena
69. 4. Comparative physical mapping of Rice
Comparative physical mapping between Oryza sativa
(AA Genome type) and Oryza punctata (BB Genome
type) was constructed by aligning physical map of O.
punctata on to O. punctata genome sequence.
The level of conservation of each genome between two
species was determined.
The alignment suggests more divergence of intergenic
and repeat regions in comparison to gene rich regions.
Genome of O. punctata was 8% larger than O. sativa
with individual chr. differences of 1.5 to 16.5%.
(Kim et al., (2007) Genetics:379-390)
S S Jena
70. Alignment view of the comparative physical map of O. punctata
(BB genome type) and O. sativa (AA genome type) using SyMap
71. What is difference between man and ape?
Man and chimpanzee have a
genome wide similarity of greater
than 95%.
What accounts for differences in
species?.
Recent study suggests that it is due
to specific gene expression
differences.
– Striking differences found only in
brain
S S Jena
74. Objective of the researchObjective of the research
Classification and phylogenic analysis of
phytohormone related genes, from
metabolism enzymes to receptors and
signaling components, in different species.
S S Jena
75. Abstract of the workAbstract of the work
Genetic and molecular studies in the model organism
Arabidopsis thaliana have revealed the individual
pathways of various plant hormone responses.
Selected 479 genes that were convincingly associated
with various hormone actions.
By using these 479 genes as queries, a genome-wide
search for their orthologues in several species
(microorganisms, plants and animals) was performed.
Meanwhile, a comparative analysis was conducted to
evaluate their evolutionary relationship.
S S Jena
76. Result and discussionResult and discussion
Phylogenetic tree generated by orthologue
genes, using orthologue gene similarity as
compared to A. thaliana hormone related genes
Protein sequence phylogenetic
tree
S S Jena
77. Distribution of orthologues in function category
of hormone related genes in different species
Distribution of orthologues in function category
of hormone related genes in different species
The height of each bar showing in different color represents the
percentage of orthologue genes in AHRG of selected plants as
compared to that of A. thaliana.
Blue - orthologues belonging to hormone metabolism related genes;
Purple - orthologues belonging to hormone transport genes;
light yellow – orthologues belonging to genes related to signal
transduction.
S S Jena
78. Comparison of the copy numbers of AHRG orthologues in cereals –
Rice, S. bicolor, P. trichocarpa, and A. thaliana.
Different colors represent ratios that are calculated by the number of
orthologue genes in selected species versus the number of AHRG in
A. thaliana.
S S Jena
79. ConclusionConclusion
The metabolisms and functions of plant hormones are
generally more sophisticated and diversified in higher
plant species.
In particular, several phytohormone receptors and key
signaling components were not present in lower plants
or animals.
Meanwhile, as the genome complexity increases, the
orthologue genes tend to have more copies and
probably gain more diverse functions.
S S Jena
81. Plant disease resistance (R) loci frequently lack
synteny between related species of cereals and
crucifers but appear to be positionally well conserved
in the Solanaceae.
In this report, a local RGA approach is adopted using
genomic information from the model Solanaceous
plant tomato to isolate R3a, a potato gene that confers
race-specific resistance to the late blight pathogen
Phytophthora infestans.
Abstract of the researchAbstract of the research
S S Jena
82. The genomic regions harboring the R3 late
blight resistance locus in potato and the I2
Fusarium wilt resistance locus in tomato are
colinear (Huang et al., 2004).
Identified a cluster of I2 gene analogues
(I2GAs) in potato.
This potato I2GA cluster positionally
corresponds to the SL8D cluster of the I2
complex locus in tomato and was therefore
named the St-I2 cluster.
Results and discussionResults and discussion
S S Jena
83. R3a candidates were
identified using a local
resistance gene
analogue (RGA) approach
The syntenic relationships
of R gene clusters are
highlighted using
gray rectangles.
To identify I2GAs
physically close to R3a, an
association analysis on
bacterial artificial
chromosome (BAC) pools
was conducted.
ResultsResults cont.cont.
Comparative genetic maps of the
I2 complex locus in tomato and
the R3 complex locus in potato.
84. ResultsResults cont.cont.
In vitro inoculation of the primary transformants of R3a. Massive
sporulation (S) and localized hypersensitive reactions (HR) are
observed on compatible and incompatible interactions, respectively.
S S Jena
85. In this study, genomic information from tomato was
used to isolate the potato late blight resistance gene
R3a from an ancient locus involved in plant innate
immunity in the Solanaceae.
Comparative analyses of the R3 complex locus with
the corresponding I2 complex locus in tomato suggest
that this is an ancient locus involved in plant innate
immunity against oomycetes and fungal pathogens.
However, the R3 complex locus has evolved after
divergence from tomato and the locus has experienced
a significant expansion in potato without disruption of
the flanking colinearity.
ConclusionConclusion
S S Jena
86. LIMITATIONS OF CGLIMITATIONS OF CGLIMITATIONS OF CGLIMITATIONS OF CG
Homologous genes are relatively well preserved
while noncoding regions tend to show varying
degrees of conservation.
Cross species comparative genomics is influenced
by the evolutionary distance of the compared
species.
Genetic drift- how can we tell what differences are
really selection and important to organism function
and not a result of genetic drift.
Computationally intensive- large amount of data that
are being compared, still coming up with the tools to
process and compare genomes.
In order for the comparisons to statistically relevant
many more genomes will need to be sequenced.S S Jena
88. The goal of comparative genomics
Due to our current ability
to annotate genomes we
can precisely place a list
of genes on the
chromosome, resulting
in something similar to a
set of lights of uniform
color and intensity.
We will be able to tell
what each gene does
and when and where it is
expressed.
S S Jena
89. Issues for the future
• Faster/better algorithms for aligning vertebrate
genomes
• Multiple alignments
– Comparing several species can give clues to which
regulatory sequences are of a basic nature, and
which are lineage specific
• Cataloguing of comparative data
• Better visualisation
– Whole syntenic region <> nucleotide level
– Multiple genome sequences
S S Jena
The origins of the field of Comparative Genomics can be found in the strategy used by the Human Genome Project to start sequencing smaller genomes to gain experience prior to tackling the large and complex human genome. This resulted in the sequencing of several bacterial genomes as well as the genomes of some of the most widely studied model organisms including (in chronological order of sequence completion): brewers yeast, S. cerevisiae; the nematode, C. elegans; the model plant, Arabidopsis thaliana; and the fruitfly, Drosophila melanogaster. As each genome sequence was completed, it was compared to those already in the databases and the study of the comparison of genomes came to be known as “Comparative Genomics”.
Prior to the availability of fully-sequenced genomes and the acquisition of the name Comparative Genomics, genome-wide comparisons had been performed. These included determining relative genome size by comparing the DNA content of different species. One way of determining DNA content was to incubate cells with a DNA-binding dye such as DAPI and then analyze the fluorescence found in individual nuclei using a fluorescence activated cell sorter (FACS). Another parameter that could be roughly determined prior to sequencing was the relative abundance of single copy versus repetitive DNA in the genome. Genomic DNA was randomly sheared and denatured. It was then allowed to reanneal and the amount of double-stranded DNA was determined at successive time intervals. Because repetitive DNA finds a complementary strand more rapidly than single-copy DNA, the proportion of double-stranded DNA detected at early times was considered to be roughly equal to the amount of repetitive DNA in the genome. Graphs of the proportion of double stranded DNA versus time were called Cot curves. This use of hybridization kinetics was refined to identify highly repetitive DNA (the DNA that most rapidly reannealed), middle repetitive DNA (more slowly reannealing DNA) and single copy DNA (the slowest to reanneal).
There is a trap that is easy to fall into when comparing two genomic sequences. For a comparison of two genes to be meaningful one must know the evolutionary relationship between them. The genes can be either orthologs or paralogs. Othologs are genes found in two species that had a common ancestor prior to the divergence of the two species. Paralogs are genes found in the same species that were created through gene duplication events. The problem arises when one compares genes from two species that are similar in sequence but are not orthologs. These genes may have different structure, function and expression patterns in the two species.
This slides depicts the relationship between orthologs and paralogs. A species indicated by the green line evolves from a species represented by the red line. In the green species, gene A evolves from gene B in the red species. After the two species have diverged, gene duplication events take place in both species. In the green species these events produce A’ and A’’, while in the red species they produce B’ and B’’. A and B are the most closely related to the ancestral gene while A’, A’’, B’ and B’’ are more distantly related and have taken on new functions. Thus it is reasonable to compare A and B but not A’ with B’ or with B’’ nor should one compare B with A’ or A’’. The important question to ask is are two genes related by a speciation event as is the case for orthologs, or by a gene duplication event, as is the case for paralogs.
Prior to the availability of sequenced genomes, genetic and physical maps of different species were compared. When a similar order of genes was found on the chromosomes of two different species it was called “synteny.” Sometimes, duplications of chromosomal segments leads to syntenic regions within the genome of the same species. In general, it has been found that closely related species have fairly similar gene order on their chromosomes, leading to large syntenic regions. An important practical application of synteny is to identify a gene that has been genetically mapped in a species that has a very large genome. If the gene is mapped between two known genes then synteny would suggest that the gene of interest is likely to lie between the same two genes in a species with a smaller genome.
One of the most dramatic examples of synteny between species whose genomes vary greatly in size is found in the crop plants, rice, maize (corn) and wheat. In this figure the chromosomes of each species are arranged in a circle. Rice has the smallest genome and is in the middle in red. Maize has a duplicated genome which is indicated by the two sets of yellow chromosomes. Wheat has a considerably larger genome than maize and is on the outside. The location of genes that control traits such as height or seed shattering are indicated on the chromosomes of the different species. Now that the rice genome is fully sequenced, wheat and maize breeders can use synteny to identify genes involved in these and other traits for which they have a genetic map position.
Once genomes of related species are fully sequenced, then the precise order of genes can be compared. For Mouse and Human there are stretches of remarkable synteny like the one shown on this slide between a portion of Human chromosome 14 and mouse chromosome 12. The blue lines linking the two chromosomal regions indicate genes that are found on both chromosomes. Note that the size of the intergenic regions varies considerably, but the order of genes is nearly identical.
Comparison of Mouse and Human sequences across the whole genome revealed a hodge-podge of chromosomal translocations as shown on this slide. At the bottom of the image are the 23 human chromosomes each with a different color code. Above them are the 20 mouse chromosomes showing the regions that are syntenic to a particular human chromosome through use of the human chromosome’s color. Some mouse chromosomes like the X chromosome are almost entirely syntenic to their human counterpart. Others have syntenic regions from many human chromosomes, like mouse chromosome 2 which has syntenic regions from at least seven human chromosomes.
Most uses of DNA sequence comparison are based on the observation that conservation of DNA sequence is usually for a reason. The idea is that if there are no constraints on the DNA sequence then mutations will occur randomly. The accumulation of these random mutations means that over tens of millions of years of evolutionary time, two sequences that started out the same will become so different as to be unrecognizable. An example is non-coding DNA that does not have a regulatory function tends to diverge much more rapidly than protein coding DNA.
However, if there are constraints on the DNA such as coding for a protein or binding a transcription factor, then that portion of the DNA sequence will not randomly drift, but will remain relatively similar in sequence over the millions of years that two species evolve independently. Thus, the basic rule when comparing two related sequences is that sequence conservation indicates that there is functional importance. We will discuss how this rule has been used to identify various features of DNA sequence. It is important to point out that sequence conservation will also exist when the two related genomes are so closely related that insufficient time has passed to allow for sequence divergence.
Once DNA sequence is known for two genomes, comparisons between them can be used for many purposes including determining gene structure (e.g. exon/intron boundaries), specifying gene function (by similarity to a gene encoding a protein of known function) and identifying regulatory sequences (such as promoters and enhancers). Comparisons of DNA sequences have even been used to determine which gene products are likely to interact with each other.
By far the most frequently performed type of sequence comparison is the sequence similarity search. This is frequently called a homology search, although strictly speaking, homology refers to similarity due to a common origin and sequence similarity can also arise through convergent evolution. Whenever DNA is freshly sequenced the question naturally arises, “Does it encode something that is similar to a protein (or RNA) of known function?” To answer this question, researchers search for similar sequences in other species, or in the same species.
To find out if there are any sequences that are similar to a newly acquired sequence requires searching the databases of DNA and protein sequences. This process is described in detail in the Bioinformatics chapter. Briefly, there are computer algorithms that attempt to find the best alignment between the search sequence and all the sequences in the databases. Because the sequences in other species are likely to have undergone some changes the computer programs don’t require perfect matches between the sequences. They allow for some base or amino acid changes as well as small insertions and deletions. The most commonly used program is BLAST because it performs a thorough search in a relatively small amount of time. Another program known as FAST-A is able to detect matching segments over longer stretches of sequence but takes much longer to perform a thorough search of the databases.
A bioinformatics approach to identify genes involved in adaptive traits is called “Trait-to-gene”. This method is based on the assumption that in some cases at least, new genes will be created to perform tasks required for an adaptive response. Its underlying reasoning is that organisms that share a particular trait will share related genes. This method can also be used to identify two different genes that serve the same function in different organisms.
To compare genes among species the Trait-to-gene approach uses a database developed by Eugene Koonin’s group of orthologues known as the “COG database”. COG stands for “Cluster of Orthologous Group” and it is a method for identifying likely orthologues when making whole genome comparisons among multiple species. When all species share a COG then there is little that can be learned except that it is likely to play an important role in a basic housekeeping function. However, as is illustrated in this slide, if all the species that share a particular trait have a gene that is a member of a particular COG, and species that don’t share the trait also don’t have the COG, then this is evidence that the gene may play an important role in the particular adaptive trait.
Perfectly congruent maps produce a single diagonal set of dots. Maps that differ by simple chromosomal rearrangements show recognizable patterns corresponding to those rearrangements.
Whole genome map dot plots show essentially random distributions of homologous genes when the maps of E. coli and B. subtilis are compared. However, a similar plot of E. coli and S. typhimurium shows an obvious diagonal and gives some evidence of the well known inversion in the 30-40 minute region (Sanderson and Hall, 1970; Riley and Krawiec, 1987).
When genomes of more than one species are compared, due to the neutral theory of evolution coding regions tend to be conserved as well as exon/intron structures. Thus, it becomes fairly straightforward to look for ORFs that are conserved in both genomes and found in similar positions relative to surrounding ORFs. Cross-genome comparisons have been shown to be an effective means of accurately annotating genes as well as identifying new genes. For example, in the comparison of the Fugu genome with the human genome, over 1000 putative genes were discovered that had been missed by the annotation programs. Comparison of the mouse genome with the human genome indicates that there are only about 30,000 genes in both genomes. This is consistent with the predictions made from the draft human sequence but runs contrary to earlier pre-genome predictions that set the number of genes as closer to 100,000.
When the human spermidine synthase gene involved in the synthesis of polyamines was compared with the homologous gene in the mouse genome it was found that there is an additional intron in the human gene which interrupts the fifth exon in the mouse gene. This is an example of a comparison that highlights how gene structure can change as species evolve.
Identifying regulatory sequences in fully sequenced genomes is very challenging for computer programs. There are several issues that compound the problem. The first is that relatively few transcription factor binding sites have been characterized by biochemical or genetic experiments. Thus, many have yet to be identified. Second, most enhancer and repressor cis-elements are short, between 6 and 10 basepairs. Third, and perhaps most problematic, is the fact that most cis-elements are degenerate, meaning that two sites can differ in one or more basepairs but still bind the same transcription factor. The combination of these characteristics means that it is very difficult to write a computer program that can reliably identify cis-elements in eukaryotic genomes.
The neutral theory of evolution can greatly aid in the search for cis-regulatory elements. Because these sequences bind transcription factors they are constrained from random mutations. Thus sequence conservation in intergenic regions and introns can be used to help identify cis-regulatory elements. To detect conserved intergenic sequences usually involves aligning the genomic regions of orthologues from two or more species. The conserved sequences identified by this type of alignment are called “phylogenetic footprints.”
A remarkable use of comparative genomics has been to identify proteins that are likely to interact with each other. Protein-protein interactions are critical for various types of cellular functions including: the transfer of information in a genetic pathway, providing a scaffold to tether other proteins, multi-subunit enzymes and large molecular machines such as dynein which works as a molecular motor.
Normally, one would think that it would be necessary to work with proteins to be able to detect interacting partners. However, with remarkable insight, the group of Christos Ouzounis realized that proteins that interact in one species can be encoded by a single gene in another species. They used these criteria to systematically search through the sequenced genomes to identify genes encoding proteins that are likely to interact with each other. They called their method the “Rosetta Stone” approach.
An example of the Rosetta Stone approach is the finding that when the genome of the yeast S. cereviseae was compared with the bacterium E. coli it was noted that the equivalent of the yeast protein, topoisomerase II is encoded in two genes in E. coli: gyrase A and gyrase B. This is taken as evidence that gyrase A and gyrase B probably interact in E. coli.
Of particular concern to some people is the very small difference at the genome level between humans and apes. Based on hybridization between the two genomes there appeared to be a 98.4% similarity, but more recent analysis of sequence comparisons suggests that it is closer to 95%, which is still highly similar. This raises the question,”If the genomes are so similar what accounts for the important differences between the species such as language, the ability to walk upright, etc.?” Pioneering work by Svante Pääbo’s group has begun to address this question. The initial data suggest that there are striking differences in gene expression patterns in the brain, but not in other organs.
In their 2002 study, Pääbo’s group used microarrays of 12,000 human genes to analyze expression patterns in rhesus monkeys, chimpanzees, and humans. The results showed little difference in the results from liver or blood among the three primates. However, when tissue from the brain was used, it showed a 5.5 fold difference between humans and chimpanzees. The difference was calculated by using the absolute change in gene expression in all 12,000 genes and then doing the interspecies comparison. This study suggests that the most dramatic differences between humans and chimpanzees, namely the disparity in cognitive abilities, might be explained by a difference in gene expression in the brain.
Our current ability to annotate genomes results in a list of genes for which we still know very little about either their function or regulation. We can precisely place them on the chromosome resulting in something similar to a set of lights of uniform color and intensity. With the approaches of comparative genomics it is hoped that we will be able to soon gain enough knowledge so that we will be able to tell what each gene does and when and where it is expressed. The genome will then resemble a multi-colored set of lights for which we know both their purpose and exactly how they are controlled.