Comparative genomics @ sid 2003 format

Siddhartha Swarup
Jena
RAD/10-30

gttaaaattcagcaggcagaatgaaaataaatgtcaataattttttatt
t
taaaatattcatgttttactattttgatataatttttaaagaaaaaggc
a
gaaaccactgcttattagaaggcagattttattgattttatacccctag
a
cttgttgcatatcaaacctatgtaaaaacatctataaatcaaatcatta
a
ttgcacctagtataataattctatatatggaggtaatgtttgattcttc
a
ggagctttaataacttgaagcccgtttgattgctttaaaatgatttctc
a
ttgtatttgtttatattgtatcattaagcaaaagtacagagtaagcaat
t
agtgtgattaattcctcttccataatacagtaaagcactgcctccatag
a
ccaattctctgggatccctggaaaacatctggcatccagcaagtcttga
c
ContentsContents
S S Jena

Human Genome Project decided to use smaller
genomes as warm-up for human genome.
Resulted in sequencing:
Many bacteria
Model organism genomes
Yeast, C. elegans, Arabidopsis, Drosophila
Crop plants – Rice
Comparison of these genome sequences
provided basis for field of “Comparative
Genomics” S S Jena

What is comparative genomics?
Analyzing & comparing
genetic material from
different species to study
evolution, gene function,
and inherited disease.
To understand the
uniqueness between
different species.
S S Jena

Comparative genomics prior to obtaining full
genome sequence
Genome size
•Compared DNA content among species.
(DAPI staining- FACS)
Single copy and repetitive DNA
•Used hybridization kinetics. (Cot curve)
•Found amount of repetitive DNA differed
greatly among species.
S S Jena

What is compared?
Gene location
Gene structure
Exon number
Exon lengths
Intron lengths
Sequence similarity
Gene characteristics
Splice sites
Codon usage
Conserved synteny
S S Jena

What can homology tell us?
1. Identity of genes:
- The identity of the gene in another organism
- The identity of nearby genes
- The function of the gene (if annotated)
2. Suggestions of how the gene might be
causing disease
3. Infer ancestral relationships
4. Discover principles of evolution
S S Jena

Negative selection: The removal of
deleterious mutations from a population; also
referred to as purifying selection.
Positive selection: The retention of
mutations that benefit an organism; also
referred to as Darwinian selection.
Homologs: Features (including DNA and
protein sequences) in species being
compared that are similar because they are
ancestrally related. S S Jena

Orthologs and Paralogs
• While comparing sequence from different
genomes, we must distinguish between two
types of closely related sequences:
 Orthologs are genes found in two species
that had a common ancestor.
 Paralogs are genes found in the same
species that were created through gene
duplication events.
S S Jena

A
A’
A’’
B’
B
B”
A & B Orthologs
A’ & A’’
B’ & B’’
Paralogs
Orthologs and Paralogs
S S Jena

Example
Early globin
gene
Alpha
chain
Beta
chain
Frog
alpha
Human
alpha
Human
Beta
Frog
beta
First
duplication
event
Second
duplication
event
0 1 2 3
t
Orthologs
Paralogs
Orthologs
H
o
m
o
l
o
g
s
S S Jena

Synteny
Regions of two genomes that show
considerable similarity in terms of sequence
and conservation of the order of genes.
Genes that are in the same relative position
on two different chromosomes.
Closely related species generally have
similar order of genes on chromosomes.
Synteny can be used to identify genes in one
species based on map-position in another.
S S Jena

Synteny among crop genomes:
rice, maize and wheat
Maize: evolved from tetraploid ancestor

THE ORIGIN AND EVOLUTION OF MODEL ORGANISMS
Hedges SB (2002) Nature Reviews Genetics 3: 838 -849.
The colored blocks
represent
synteny blocks
S S Jena

Synteny of Mouse and Human genome
When sequence from mouse and human
genomes compared, regions of remarkable
synteny were found.
Genes are in almost identical order for long
stretches along the chromosome.
Human
Chr 14
Mouse
Chr 14
S S Jena

Synteny of Mouse & Human genome
Almost entirely
syntenic
S S Jena

The one to one linear correspondence
between the order of codons in a coding
sequence and the order of amino acids in the
protein encoded.
A linear map of mutation sites within a gene
corresponds to the linear location of amino
acid substitutions within the polypeptide
encoded by that gene.
Comparative genome analyses
demonstrated that gene orders among related
plant species remained largely conserved over
millions of years of evolution.
Colinearity
S S Jena

Comparative genomics exploits both similarities
and differences in the proteins, RNA and
regulatory regions of different organisms to infer
how selection has acted upon these elements.
Those elements that are responsible for
similarities between different species should be
conserved through time (stable selection), while
those elements responsible for differences
among species should be divergent (positive
selection). S S JenaS S Jena

The DNA sequences encoding the proteins and
RNAs responsible for functions that were
conserved from the last common ancestor
should be preserved in contemporary genome
sequences.
Likewise, the DNA sequences controlling the
expression of genes that are regulated similarly
in two related species should also be conserved.
Conversely, sequences that encode (or control
the expression of) proteins and RNAs
responsible for differences between species will
themselves be divergent.
Principle cont.
S S JenaS S Jena

Evolution and sequence conservation
If no constraints on DNA sequence, Random
mutations will occur.
Over tens of millions of years these random
mutations will make two related sequences
Different.
e.g. Non-coding DNA that does not have a
regulatory function tends to diverge much more
rapidly than protein coding DNA.
S S JenaS S Jena

Function and sequence conservation
However: if there are constraints, e.g.
oDNA codes for protein
oor transcription factor binds DNA
oReplication origin
Then there will be sequence similarity when
related sequences compared
Basic rule when comparing two related
sequences:
Sequence conservation = functional importance

Why do we annotate genomes?
• If we find the gene in a model organism (like
the rat), then we need to know what the
homolog is in humans.
• If we find the gene in a model organism, we
need to know if it’s doing the same thing in
humans.
• If we DON’T know what gene is implicated in
a disease, we can annotate ALL the genes in
a region and find candidates for further study
S S JenaS S Jena

Comparison of genomic sequences from different
species can help to identify:
oGene structure
oGene function
oRegulatory sequences
oInteractions between gene products
S S JenaS S Jena

Comparative Genomics Tools
• Similarity search programs
– BLAST2 (Basic Local Alignment Search Tool)
– FASTA
– MUMmer (Maximal Unique Match)
(Comparisons and analyses at both Nucleic acid and
protein level)
• Other alignment programs
– DBA [DNA Block Aligner] (Jareborg et al)
– blastz (Schwartz et al.)
– BLAT/AVID,
– WABA [Wobble Aware Bulk Aligner]
– DIALIGN [Diagonal ALIGNment] (Morgenstern et al.)
– SSAHA [Sequence Search and Alignment by
Hashing Algorithm] S S Jena

Comparative Genomics Tools
• Comparative gene prediction programs
– Twinscan
– Doublescan
– SGP-1
• Regulatory region prediction
– Consite
• Visualization/ Sequence analysis programs
– Dot plot (e.g. Dotter)
– PIP maker (Percent Identity Plot)
– Alfresco
– VISTA (VISualization Tools for Alignments)
– ACT (Artemis comparison tool)
S S Jena

General Databases Useful for Comparative Genomics
• Locus Link/RefSeq:
http://www.ncbi.nih.gov/LocusLink/
• PEDANT-Protein Extraction Description ANalysis Tool
http://pedant.gsf.de
• COGs - Cluster of Orthologous Groups (of proteins)
http://www.ncbi.nih.gov/COG/
• KEGG- Kyoto Encyclopedia of Genes and Genomes
http://www.genome.ad.jp/kegg/
• MBGD - Microbial Genome Database
http://mbgd.genome.ad.jp/
• GOLD - Genome OnLine Database
http://wit.integratedgenomics.com/GOLD/
• TIGR – The Institute of Genome Research
Comparative genomics of Parasites
S S Jena

 Alignment of DNA sequences is the core
process in comparative genomics.
 An alignment is a mapping of the nucleotides
in one sequence onto the nucleotides in the
other sequence, with gaps introduced into one
or the other sequence to increase the number
of positions with matching nucleotides.
 Several powerful alignment algorithms have
been developed to align two or more
sequences.
COMPARATIVE GENOMICS -COMPARATIVE GENOMICS -
PROCESSPROCESS
S S Jena

• The most frequently performed type of sequence
comparison is the sequence similarity search.
• Sequence comparisons that implicate function
are widely used:
– To determine if newly sequenced cDNA or
genomic region encodes gene of known
function.
– Search for similar sequence in other species
(or in same species)
Sequence similarity searchSequence similarity search
S S Jena

• Search databases of DNA sequences
• Use computer algorithms to align sequences
– Don’t require perfect matches between
sequences
– Allow for insertions, deletions and base
changes
• Most commonly used algorithms:
– BLAST
– FAST-A
Homology searches
S S Jena

Pairwise genome comparison of protein
homologs (symmetrical best hits)
http://www.ncbi.nlm.nih.gov/sutils/geneplot.cgi S S Jena

Genome databases
• Genomes at NCBI, EBI, TIGR

NCBI comparative maps
S S Jena

Ensembl
Genome databases
S S Jena

Ensembl
synteny
views
S S Jena

 Gramene (http://www.gramene.org) is a comparative
genome mapping database for grasses and a
community resource for rice (Oryza sativa).
 It combines a semi-automatically generated
database of cereal genomic and expressed sequence
tag sequences, genetic maps, map relations, and
publications, with a curated database of rice mutants
(genes and alleles), molecular markers, and proteins.
 Gramene curators read and extract detailed
information from published sources, summarize that
information in a structured format, and establish links
to related objects both inside and outside the
database.
S S Jena

Map Search:
Comparative
Maps
S S Jena

• Rice
– whole genome sequence
• Other crop grasses
– Maize
– Sorghum
– Millet
– Sugarcane
– Wheat
– Oats
– Barley
Gramene scope
Synteny in Gramene

Genome analysesGenome analyses
• Variation in
– Genome size
– GC content
– Codon usage
– Amino acid composition
– Genome organization
• Single circular chromosomes
• Linear chromosome + extra chromosomal elements
E. coli: 4.6Mbp
M. pneumoniae: 0.81Mbp
B. subtilis: 4.20Mbp
B. burgdorferi: 29%
M. tuberculosis: 68%
G, A, P, R: GC rich
I, F, Y, M, D: AT rich
S S Jena

CG: Comparisons between genomes
• The stains of the same species
• The closely related species
• The distantly related species
– List of Orthologs
– Evolution of individual genes
– Evolution of organisms
S S Jena

Comparison of the coding regions
• Begins with the
gene identification
algorithm:
Infer what portions
of the genomic
sequence actively
code for genes.
• There are four
basic approaches.
4 basic categories of gene
identification programs
Category Algorithm
1. Based on direct
evidence of transcription
EST_GENOME
sim4
2. Based on homology
with known genes
PROCRUSTES
3. Statistical or ab-initio
approaches
Genscan
FGENES
GeneMark
Glimmer
4. Using genome
comparison
TwinScan
Rosetta
S S Jena

‘Trait-to-gene’ approach
A bioinformatics approach to identify genes
involved in adaptive traits is called “Trait-to-gene”.
Assumption: New genes will be created to
perform tasks required for an adaptive response.
Underlying reasoning: organisms that share a
particular trait will share related genes.
Also used to identify two different genes that
serve the same function in different organisms.
S S Jena

Relating traits to genes
To compare genes among species the Trait-to-
gene approach uses “COG database”. (Eugene
Koonin)
It is a method for identifying likely orthologues
when making whole genome comparisons among
multiple species.
S S Jena

Important observations with regard to
Gene Order
• Order is highly conserved in closely related
species but gets changed by rearrangements.
• With more evolutionary distance, no
correspondence between the gene order of
orthologous genes.
• Group of genes having similar biochemical
function tend to remain localized.
S S Jena

Finding regulatory regionsFinding regulatory regions
 Called phylogentic footprinting (analogous with
DNAase footprinting)
 Functionally important regions are mutated less.
 These cis-regulatory motifs can be determined
by:
– Finding common motifs in orthologous sequences
– Aligning orthologous sequences first, then
indentifying common regions
 Previously known motifs might help
S S Jena

Regulatory region prediction
Consite:
– Detection of
TFBS
conserved in
corresponding
genomic
sequences
from different
species
S S Jena

Visualization
Dot plot:
A graphical dot plot program for detailed comparison of
two sequences
Soft wares
for dot-plot:
DNA strider
Dotlet
Dotter
Dottup
S S JenaS S JenaS S Jena

Dot plot
– The X axis represents the
first sequence (PHO5),
– The Y axis represents the
second sequence (PHO3)
– A dot is plotted for each
match between two
residues of the sequences.
– Diagonal lines reveal
regions of identity between
the two sequences.
A dot plot is a simple graphical representation of identical
residues between two sequences.
S S JenaS S JenaS S Jena

Hypothetical whole genome map dot plots
S S JenaS S Jena

Whole genome map dot plots of
E. coli vs B. subtilis and E. coli vs S. typhimurium
Random distribution Diagonal patternS S JenaS S Jena

Sequence logo
A very useful representation of the conservation
patterns is the so-called sequence logo.
This shows the conserved residues as larger
characters, where the total height of a column is
proportional to how conserved that position is.
S S JenaS S Jena

Applications ofApplications of
comparative genomicscomparative genomics
1. Gene prediction
2. Regulatory region prediction
3. Interaction mapping
S S Jena

 When comparing genomes of different species
– Genes normally have same exon/intron structure
(Neutral theory of evolution)
 Look for ORFs that are conserved in both
genomes
 Frequently permits accurate identification of genes
– Fugu/human comparison: found >1000 genes that
had been missed by annotation
– Mouse/human comparison indicates only 30,000
genes in genome
How genome comparisons help?
S S Jena

 The comparison of fruit fly genome with the human
genome discovered that about 60 percent of genes
are conserved between fly and human.
 Virtually all (99%) of the protein-coding genes in
humans align with homologs in mouse, and over
80% are clear 1:1 orthologs. In most cases, the
intron-exon structures are highly conserved.
 The finding that the three wheat genomes have a
highly similar gene content and order was the first
demonstration of colinearity in the grasses and a
pivotal finding in the development of comparative
genomics.
How genome comparisons help? Cont.
S S Jena

Comparison of the human and mouse spermidine
synthase genes revealed an additional intron in
the human gene that is not found in the mouse
homologue.
Sequence comparison example
Human
Mouse
5,500 bp
S S Jena

Relationships among the Genomes of
Rice, Foxtail Millet, and Pearl Millet

CG helps genome annotationsCG helps genome annotations
In prokaryotes, finding genes is relatively
easy based on open reading frames (ORFs)
In eukaryotes, we have to look for ORFs,
exons, introns, splice sites, polyA sites
Difficulties:
• Predicted exons sometimes do not exist
• Pseudogenes
• Alternative splicing
Merit: In different species, the genes normally
have similar exon-intron structure
S S Jena

Finding regulatory sequencesFinding regulatory sequencesFinding regulatory sequencesFinding regulatory sequences
Regulatory sequences are difficult to identify
using computer programs.
Problems are:
 Most enhancer sequences have yet to be
identified
 They are usually short: 6-10 basepairs
 Those that are known are usually degenerate
• They can differ in one or more basepairs
• Still bind the cognate transcription factor
S S Jena

Comparisons to identify regulatory elements
Comparisons of genomes of different species
can identify cis-regulatory elements. (Neutral
theory of evolution)
Change in intergenic regions and introns are
usually more rapid than in coding regions
Nevertheless, regulatory elements tend to be
conserved (because these seq. bind TFs)
Conserved intergenic sequences identified by
aligning genomic regions of orthologs are
called “phylogenetic footprint.” (analogous with
DNAase footprinting).

Interaction mapping
 A remarkable use of comparative genomics is
to identify interacting proteins.
 Protein-protein interactions are critical for
cellular functions like
 Transfer of information in a genetic pathway
 Scaffolding to tether other proteins
 Enzymatic reactions (multi-subunit enzymes)
 Large molecular machines such as motors
S S Jena

Rosetta Stone
 Interaction proteins are encoded by single gene
in some species, whereas in other species
same proteins are encoded in two genes.
 Systematic search through sequenced genomes
for these relationships should identify proteins
that interact.
 This method is called “Rosetta Stone” approach
S S Jena

Rosetta Stone example
 Equivalent of yeast protein
topoisomerase II, in E. coli is
encoded by two genes:
gyrase A and gyrase B.
 Suggests that gyrase B and
gyrase A interact in E. coli
S S Jena

1. Identification and mapping of
Leucine-rich repeat resistance gene analogs
in Bermuda grass
 31 Bermuda grass (Cynodon spp.) disease
resistance gene analogs (BRGA) were
cloned and sequenced from diploid, triploid,
tetraploid and hexaploid bermuda grass using
degenerate primers to target Nucleotide
Binding site (NBS) of the NBS-Leucine Rich
Repeat (LRR) resistance family.
(Harris et al., (2010) J.Amer.Soc.135:74-82)
S S Jena

2. Synteny between the centromeric regions
of wheat and rice
 Recently discovered that rice centromeres contain
genes. This helps in studying centromere homologies
between wheat and rice chromosomes by mapping
rice centromeric regions on to wheat aneuploid stocks
 Genome wide comparison of wheat ESTs that were
mapped to centromeric regions against rice genome
sequences revealed high conservation and one to one
correspondence of centromeric regions between
wheat and rice chromosome pairs W1-R5, W2-R7,
W3-R1, W5-R12, W6-R2 and W7-R8
(Qi et al., (2009) Genetics 183:1235-1247)
S S Jena

3. Sequencing and comparative analysis of
conserved syntenic segment (CSS)
in the solanaceae
 Wang et.al. reported generation and analysis of
sequences for unduplicated conserved syntenic
segment (CSS) in genomes of five members of
solanaceae.
 This analysis indicates 30 million years of plant
evolution in absence of polyploidization.
 The sequenced segments of the potato, tomato,
pepper, eggplant, and petunia genomes are
shown alongside corresponding regions of the
Arabidopsis (At) genome.
(Wang et al., (2008) Genetics 180:391-408)
S S Jena

Conserved syntenic segment (CSS) in five species of Solanaceae

4. Comparative physical mapping of Rice
Comparative physical mapping between Oryza sativa
(AA Genome type) and Oryza punctata (BB Genome
type) was constructed by aligning physical map of O.
punctata on to O. punctata genome sequence.
The level of conservation of each genome between two
species was determined.
The alignment suggests more divergence of intergenic
and repeat regions in comparison to gene rich regions.
Genome of O. punctata was 8% larger than O. sativa
with individual chr. differences of 1.5 to 16.5%.
(Kim et al., (2007) Genetics:379-390)
S S Jena

Alignment view of the comparative physical map of O. punctata
(BB genome type) and O. sativa (AA genome type) using SyMap

What is difference between man and ape?
Man and chimpanzee have a
genome wide similarity of greater
than 95%.
What accounts for differences in
species?.
Recent study suggests that it is due
to specific gene expression
differences.
– Striking differences found only in
brain
S S Jena

Human/ape gene expression comparisons
Blood tissue Brain tissueLiver tissue
S S Jena

CASE STUDYCASE STUDYCASE STUDYCASE STUDY
S S Jena

Objective of the researchObjective of the research
Classification and phylogenic analysis of
phytohormone related genes, from
metabolism enzymes to receptors and
signaling components, in different species.
S S Jena

Abstract of the workAbstract of the work
Genetic and molecular studies in the model organism
Arabidopsis thaliana have revealed the individual
pathways of various plant hormone responses.
Selected 479 genes that were convincingly associated
with various hormone actions.
By using these 479 genes as queries, a genome-wide
search for their orthologues in several species
(microorganisms, plants and animals) was performed.
Meanwhile, a comparative analysis was conducted to
evaluate their evolutionary relationship.
S S Jena

Result and discussionResult and discussion
Phylogenetic tree generated by orthologue
genes, using orthologue gene similarity as
compared to A. thaliana hormone related genes
Protein sequence phylogenetic
tree
S S Jena

Distribution of orthologues in function category
of hormone related genes in different species
Distribution of orthologues in function category
of hormone related genes in different species
The height of each bar showing in different color represents the
percentage of orthologue genes in AHRG of selected plants as
compared to that of A. thaliana.
Blue - orthologues belonging to hormone metabolism related genes;
Purple - orthologues belonging to hormone transport genes;
light yellow – orthologues belonging to genes related to signal
transduction.
S S Jena

Comparison of the copy numbers of AHRG orthologues in cereals –
Rice, S. bicolor, P. trichocarpa, and A. thaliana.
Different colors represent ratios that are calculated by the number of
orthologue genes in selected species versus the number of AHRG in
A. thaliana.
S S Jena

ConclusionConclusion
The metabolisms and functions of plant hormones are
generally more sophisticated and diversified in higher
plant species.
In particular, several phytohormone receptors and key
signaling components were not present in lower plants
or animals.
Meanwhile, as the genome complexity increases, the
orthologue genes tend to have more copies and
probably gain more diverse functions.
S S Jena

CASE STUDY- 2CASE STUDY- 2
S S Jena

Plant disease resistance (R) loci frequently lack
synteny between related species of cereals and
crucifers but appear to be positionally well conserved
in the Solanaceae.
In this report, a local RGA approach is adopted using
genomic information from the model Solanaceous
plant tomato to isolate R3a, a potato gene that confers
race-specific resistance to the late blight pathogen
Phytophthora infestans.
Abstract of the researchAbstract of the research
S S Jena

The genomic regions harboring the R3 late
blight resistance locus in potato and the I2
Fusarium wilt resistance locus in tomato are
colinear (Huang et al., 2004).
Identified a cluster of I2 gene analogues
(I2GAs) in potato.
This potato I2GA cluster positionally
corresponds to the SL8D cluster of the I2
complex locus in tomato and was therefore
named the St-I2 cluster.
Results and discussionResults and discussion
S S Jena

R3a candidates were
identified using a local
resistance gene
analogue (RGA) approach
The syntenic relationships
of R gene clusters are
highlighted using
gray rectangles.
To identify I2GAs
physically close to R3a, an
association analysis on
bacterial artificial
chromosome (BAC) pools
was conducted.
ResultsResults cont.cont.
Comparative genetic maps of the
I2 complex locus in tomato and
the R3 complex locus in potato.

ResultsResults cont.cont.
In vitro inoculation of the primary transformants of R3a. Massive
sporulation (S) and localized hypersensitive reactions (HR) are
observed on compatible and incompatible interactions, respectively.
S S Jena

In this study, genomic information from tomato was
used to isolate the potato late blight resistance gene
R3a from an ancient locus involved in plant innate
immunity in the Solanaceae.
Comparative analyses of the R3 complex locus with
the corresponding I2 complex locus in tomato suggest
that this is an ancient locus involved in plant innate
immunity against oomycetes and fungal pathogens.
However, the R3 complex locus has evolved after
divergence from tomato and the locus has experienced
a significant expansion in potato without disruption of
the flanking colinearity.
ConclusionConclusion
S S Jena

LIMITATIONS OF CGLIMITATIONS OF CGLIMITATIONS OF CGLIMITATIONS OF CG
Homologous genes are relatively well preserved
while noncoding regions tend to show varying
degrees of conservation.
Cross species comparative genomics is influenced
by the evolutionary distance of the compared
species.
Genetic drift- how can we tell what differences are
really selection and important to organism function
and not a result of genetic drift.
Computationally intensive- large amount of data that
are being compared, still coming up with the tools to
process and compare genomes.
In order for the comparisons to statistically relevant
many more genomes will need to be sequenced.S S Jena

The goal of comparative genomics
Due to our current ability
to annotate genomes we
can precisely place a list
of genes on the
chromosome, resulting
in something similar to a
set of lights of uniform
color and intensity.
We will be able to tell
what each gene does
and when and where it is
expressed.
S S Jena

Issues for the future
• Faster/better algorithms for aligning vertebrate
genomes
• Multiple alignments
– Comparing several species can give clues to which
regulatory sequences are of a basic nature, and
which are lineage specific
• Cataloguing of comparative data
• Better visualisation
– Whole syntenic region <> nucleotide level
– Multiple genome sequences
S S Jena

Comparative genomics @ sid 2003 format

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a Comparative genomics @ sid 2003 format

Similar a Comparative genomics @ sid 2003 format (20)

Más de sidjena70

Más de sidjena70 (15)

Último

Último (20)

Comparative genomics @ sid 2003 format

Notas del editor