This document provides an overview of comparative genomics. It defines comparative genomics as combining genomic data and evolutionary biology to study genome structure, evolution and function. It discusses three levels of genome comparison: bulk properties like chromosome size and number, whole genome sequence similarity and organization, and functional genome features. The history of experimental comparative genomics is reviewed, noting that practical comparisons predated widespread genome sequencing.
2. Part
1
l What
is
compara've
genomics?
l Levels
of
genome
comparison
l bulk,
whole
sequence,
features
l A
Brief
History
of
Compara've
Genomics
l experimental
compara;ve
genomics
l Computa'onal
Compara've
Genomics
l Bulk
proper;es
l Whole
genome
comparisons
l Part
2
l Genome
feature
comparisons
3. What
is
Compara've
Genomics?
The
combina'on
of
genomic
data
and
compara've
and
evolu'onary
biology
to
address
ques'ons
of
genome
structure,
evolu'on
and
func'on.
4. What
is
Compara've
Genomics?
“Nothing
in
biology
makes
sense,
except
in
the
light
of
evolu9on”
Theodosius
Dobzhansky
5. Why
Compara've
Genomics?
l Genomes
describe
heritable
characteris;cs
l Related
organisms
share
ancestral
genomes
l Func;onal
elements
encoded
in
genomes
are
common
to
related
organisms
l Func;onal
understanding
of
model
systems
(E.
coli,
A.
thaliana,
D.
melanogaster)
can
be
transferred
to
non-‐model
systems
on
the
basis
of
genome
comparisons
l Genome
comparisons
can
be
informa;ve,
even
for
distantly-‐related
organisms
6. Why
Compara've
Genomics?
l BUT:
l Context:
epigene;cs,
;ssue
differen;a;on,
mesoscale
systems,
etc.
l Phenotypic
plas'city:
responses
to
temperature,
stress,
environment,
etc.
7. Why
Compara've
Genomics?
l Genomic
differences
can
underpin
phenotypic
(morphological
or
physiological)
differences.
l Where
phenotypes
or
other
organism-‐level
proper;es
are
known,
comparison
of
genomes
may
give
mechanis;c
or
func;onal
insight
into
differences
(e.g.
GWAS).
l Genome
comparisons
aid
iden;fica;on
of
func;onal
elements
on
the
genome.
l Studying
genomic
changes
reveals
evolu;onary
processes
and
constraints.
8. Why
Compara've
Genomics?
Adapted
from
Hardison
(2003)
PLoS
Biol.
doi:10.1371/journal.pbio.0000058
species
'me
contemporary
organisms
l Comparison
within
species
(e.g.
isolate-‐level
–
or
even
within
individuals):
which
genome
features
may
account
for
unique
characteris;cs
of
organisms/
tumours?
Epigene;cs
in
an
individual.
9. Why
Compara've
Genomics?
genus
'me
contemporary
organisms
l Comparison
within
genus
(e.g.
species-‐level):
what
genome
features
show
evidence
of
selec;ve
pressure,
and
in
which
species?
10. Why
Compara've
Genomics?
subgroup
'me
contemporary
organisms
l Comparison
within
subgroup
(e.g.
genus-‐level):
what
are
the
core
set
of
genome
features
that
define
a
subgroup
or
genus?
11. The
E.coli
long-‐term
evolu'on
experiment
l Run
by
the
Lenski
lab,
Michigan
State
University
since
1988
l hVp://myxo.css.msu.edu/ecoli/
l 12
flasks,
citrate
usage
selec;on
l 50,000
genera;ons
of
Escherichia
coli!
l Cultures
propagated
every
day
l Every
500
genera;ons
(75
days),
mixed-‐popula;on
samples
stored
l Mean
fitness
es;mated
at
500
genera;on
intervals
Jeong
et
al.
(2009)
J.
Mol.
Biol.
doi:10.1016/j.jmb.2009.09.052
Barrick
et
al.
(2009)
Nature
doi:10.1038/nature08480
Wiser
et
al.
(2013)
Science.
doi:10.1126/science.1243357
12. Compara've
Genomics
in
the
News
Sankaraman
et
al.
(2014)
Nature.
doi:10.1038/nature12961
l Neanderthal
alleles:
l Aid
adapta;on
outwith
Africa
l Associated
with
disease
risk
l Reduce
male
fer;lity
13. Levels
of
Genome
Comparison
Genomes
are
complex,
and
can
be
compared
on
a
range
of
conceptual
levels
-‐
both
prac'cally
and
in
silico.
14. Three
broad
levels
of
comparison
l Bulk
Proper;es
l chromosome/plasmid
counts
and
sizes,
l nucleo;de
content,
etc.
l Whole
Genome
Sequence
l sequence
similarity
l organisa;on
of
genomic
regions
(synteny),
etc.
l Genome
Features/Func;onal
Components
l numbers
and
types
of
features
(genes,
ncRNA,
regulatory
elements,
etc.)
l organisa;on
of
features
(synteny,
operons,
regulons,
etc.)
l complements
of
features
l selec;on
pressure,
etc.
15. A
Brief
History
of
Experimental
Compara've
Genomics
You
don’t
have
to
sequence
genomes
to
compare
them
(but
it
helps).
16. Genome
Comparisons
Predate
NGS
l Sequence
data
was
not
always
cheap
and
abundant
l Prac;cal,
experimental
genome
comparisons
were
needed
17. Bulk
Genome
Property
Comparisons
Values
calculated
for
individual
genomes,
and
subsequently
compared.
19. Chromosome
Counts/Size
l The
chromosome
counts/ploidy
of
organisms
can
vary
widely
l Escherichia
coli:
1
(but
plasmids…)
l Rice
(Oryza
sa6va):
24
(but
mitochondria,
plas;ds
etc…)
l Human
(Homo
sapiens):
46,
diploid
l Adders-‐tongue
(Ophioglossum
re6culatum):
up
to
1260
l Domes;c
(but
not
wild)
wheat
soma;c
cells
hexaploid,
gametes
haploid
l Physical
genome
size
(related
to
sequence
length)
can
also
vary
greatly
l Genome
size
and
chromosome
count
do
not
indicate
organism
‘complexity’
l S;ll
surprises
to
be
found
in
physical
study
of
chromosomes!
(e.g.
Hi-‐C)
Kamisugi
et
al.
(1993)
Chromosome
Res.
1(3):
189-‐96
Wang
et
al.
(2013)
Nature
Rev
Genet.
doi:10.1038/nrg3375
20. Nucleo'de
Content
l Experimental
approaches
for
accurate
measurement
l e.g.
use
radiolabelled
monophosphates,
calculate
propor;ons
using
chromatography
Karl
(1980)
Microbiol.
Rev.
44(4)
739-‐796
Krane
et
al.
(1991)
Nucl.
Acids
Res.
doi:10.1093/nar/19.19.5181
22. Whole
Genome
Comparisons
l Requires
two
genomes:
“reference”
and
“comparator”
l Experiment
produces
a
compara;ve
result,
dependent
on
the
choice
of
genomes
l Methods
mostly
based
around
direct
or
indirect
DNA
hybridisa;on
l DNA-‐DNA
hybridisa;on
l Compara;ve
Genomic
Hybridisa;on
(CGH)
l Array
Compara;ve
Genomic
Hybridisa;on
(aCGH)
23. DNA-‐DNA
Hybridisa'on
(DDH)
l Several
methods
based
around
the
same
principle
1. Denature
organism
A,
B
genomic
DNA
mixture
2. Allow
to
anneal
–
hybrids
result
(reassocia;on
≈
similarity)
Morelló-‐Mora
&
Amann
(2001)
FEMS
Microbiol.
Rev.
doi:10.1016/S0168-‐6445(00)00040-‐1
25. DNA-‐DNA
Hybridisa'on
(DDH)
l Used
for
taxonomic
classifica;on
in
prokaryotes
from
1960s
l Sibley
&
Ahlquist
redefined
bird
and
primate
phylogeny
with
DDH
in
1980s:
Homo
shares
more
recent
common
ancestor
with
Pan
than
with
Gorilla
(this
was
previously
in
dispute)
Sibley
&
Ahlquist
(1984)
J.
Mol.
Evol.
doi:10.1007/BF02101980
26. Compara've
Genomic
Hybridisa'on
l Two
genomes:
“reference”
and
“test”
are
labelled
(red
and
green
–
a
bad
conven6on
to
choose,
for
visualisa6on),
then
hybridised
against
a
third
“normal”
genome
l Differences
in
red/green
intensity
mapped
by
microscopy
correspond
to
rela;ve
rela;onship
of
reference
and
test
to
“normal”
genome
l Comparisons
within
species
(or
individual,
for
tumours);
copy
number
varia'ons
(CNV)
l Labour-‐intensive,
low-‐resolu;on
27. Compara've
Genomic
Hybridisa'on
l Image
analysis
required
–
intensity
along
medial
axis.
Kallioniemi
et
al.
(1992)
Science
doi:10.1126/science.1359641
Fraga
et
al.
(2005)
Proc.
Natl.
Acad.
Sci.
USA
doi:10.1073/pnas.0500398102
Epigene'cs:
hybridising
methylated
DNA
28. Array
Compara've
Genomic
Hybridisa'on
l Uses
DNA
microarrays:
thousands
of
short
DNA
probes
(genome
fragments)
immobilised
on
a
surface
l gDNA,
cDNA,
etc.
fluorescently-‐labelled
and
hybridised
to
the
array
l Smaller
sample
sizes
cf.
CGH,
automatable,
high-‐throughput,
high-‐res
l Iden'fies
copy
number
varia'on
(CNV)
and
segmental
duplica'on
Pollack
et
al.
(1999)
Nat.
Genet.
doi:10.1038/12640
30. Chromosomal
Rearrangements
l Genomes
are
dynamic,
and
undergo
large-‐scale
changes
l Hybridisa;on
used
to
map
genome
rearrangement/duplica;on
l Separate
chromosomes
electrophore;cally
l Apply
single
gene
hybridising
probes
l Reciprocal
hybridisa;ons
indicate
transloca;ons
Fischer
et
al.
(2000)
Nature.
doi:10.1038/35013058
31. Diagnos'c
PCR/MLST
l Define
a
set
of
regions
(usually
genes):
l conserved
enough
that
PCR
primers
can
be
designed
to
amplify
the
same
region
in
mul;ple
organisms
l and:
l divergent
enough
that
hybridising
probes
can
dis;nguish
between
groups
l or:
l sequence
the
amplifica;on
products
l Sequence
variants
given
numbers
l Number
profiles
define
groups
l Track
evolu;on
by
minimum
spanning
trees
(MST)
l hVp://pubmlst.org/
Maiden
et
al.
(2006)
Ann.
Rev.
Microbiol.
doi:10.1146/annurev.micro.59.030804.121325
32. l aCGH
can
also
be
applied
across
species
for
classifica'on/diagnos'cs:
l Microarray
probes
represent
genes
from
one
or
more
organisms
l “Off-‐species”
gDNA
fragmented,
labelled,
and
hybridised
l Hybridisa;on
≈
sequence
similarity
≈
gene
presence
l Heatmap
of
217
Staphylococcus
aureus
isolates
on
7-‐strain
array.
l columns=isolates
l yellow/red=gene
present
l blue/white/grey=gene
absent
l Lower
bars
coloured
by
lineage
and
host
(green=caVle,
blue=horse,
purple=human)
Array
Compara've
Genomic
Hybridisa'on
Sung
et
al.
(2008)
Microbiol.
doi:10.1099/mic.0.2007/015289-‐0
34. …And
Then
It
Rained
Sequence
Data
l Modern
high-‐throughput
sequencing
(454,
Illumina)
completely
changed
the
landscape.
l Complete,
(mainly)
accurate
sequence
data
much
cheaper,
enabling:
l more
precise
sequence
comparison
l novel
analyses,
insights
and
visualisa;ons
l Genomic
&
exomic
comparisons
l 19/2/2014
at
GOLD:
l 3,011
“finished”
genomes
l 9,891
“permanent
drar”
genomes
l 19/2/2014
at
NCBI
WGS:
l 17,023
whole
genome
projects
35. …And
Then
It
Rained
Sequence
Data
l In
2012,
GOLD
added
3736
genomes,
NCBI
added
4585
l Mostly
prokaryotes
(archaea
and
bacteria)
l We’re
a
liVle
ahead
of
Su’s
(Scripps,
La
Jolla)
projec;ons
Figures
and
code
from:
hlp://sulab.org/2013/06/sequenced-‐genomes-‐per-‐year/
37. Three
broad
levels
of
comparison
l Bulk
Proper;es
l chromosome/plasmid
counts
and
sizes,
l nucleo;de
content,
etc.
l Whole
Genome
Sequence
l sequence
similarity
l organisa;on
of
genomic
regions
(rearrangements),
etc.
l Genome
Features/Func;onal
Components
l numbers
and
types
of
features
(genes,
ncRNA,
regulatory
elements,
etc.)
l organisa;on
of
features
(synteny,
operons,
regulons,
etc.)
l complements
of
features
l selec;on
pressure,
etc.
38. Bulk
Genome
Property
Comparisons
Values
calculated
for
individual
genomes,
and
subsequently
compared.
39. Nucleo'de
Frequencies/Genome
Size
l Very
easy
to
calculate
from
complete
or
drar
genome
sequence
l (or
in
a
region
of
genome
sequence)
l GC
content/chromosome
size
can
be
characteris;c
of
an
organism
l [ACTIVITY]
l bacteria_size_gc
iPython
notebook
l ipython notebook –-pylab inline
in
bacteria_size
directory
40. Blobology
l Metazoan
sequence
data
can
be
contaminated
by
microbial
symbionts.
l Host
and
symbiont
DNA
have
different
%GC
(and
are
present
in
different
amounts/coverage)
l Preliminary
genome
assembly,
followed
by
read
mapping
l Plot
con;g
coverage
against
%GC
=
Blobology
l hVp://nematodes.org/bioinforma;cs/blobology/
Kumar
&
Blaxter
(2011)
Symbiosis
doi:10.1007/s13199-‐012-‐0154-‐6
41. Nucleo'de
k-‐mers
l Sequence
data
is
required
to
determine
k-‐mers
l Nucleo;de
frequencies:
l A,
C,
G,
T
l Dinucleo;de
frequencies:
l AA,
AC,
AG,
AT,
CA,
CC,
CG,
CT,
GA,
GC,
GG,
GT,
TA,
TC,
TG,
TT
l Trinucleo;de
frequencies:
l 64
trinucleo;des
l k-‐nucleo;de
frequencies:
l 4k
k-‐mers
l [ACTIVITY]
l runApp(“shiny/nucleotide_frequencies”)in
RStudio
42. k-‐mer
Spectra
l k-‐mer
spectrum:
l Frequency
distribu;on
of
observed
k-‐mer
counts
l Most
species
have
a
unimodal
k-‐mer
spectrum
Chor
et
al.
(2009)
Genome
Biol.
doi:10.1186/gb-‐2009-‐10-‐10-‐r108
43. k-‐mer
Spectra
l k-‐mer
spectrum:
l All
mammals
tested
(and
some
other)
species
have
a
mul;modal
k-‐mer
spectrum
l Genomic
regions
differ
in
this
property
Chor
et
al.
(2009)
Genome
Biol.
doi:10.1186/gb-‐2009-‐10-‐10-‐r108
44. Average
Nucleo'de
Iden'ty
(ANI)
l ANI
introduced
as
a
subs;tute
for
DDH
in
2007:
l 70%
iden;ty
(DDH)
=
“gold
standard”
prokaryo;c
species
boundary
l 70%
iden;ty
(DDH)
≈
95%
iden;ty
(ANI)
Goris
et
al.
(2007)
Int.
J.
System.
Evol.
Biol.
doi:10.1099/ijs.0.64483-‐0
45. Average
Nucleo'de
Iden'ty
(ANI)
l ANI
introduced
as
a
subs;tute
for
DDH
in
2007:
l 70%
iden;ty
(DDH)
=
“gold
standard”
prokaryo;c
species
boundary
l 70%
iden;ty
(DDH)
≈
95%
iden;ty
(ANI)
l Original
method
emulates
physical
experiment:
1. break
genome
into
1020nt
fragments
2. align
fragments
using
BLASTN
3. ANI
=
mean
iden;ty
of
all
BLASTN
matches
with
>30%
iden;ty
over
70%
alignable
length
Goris
et
al.
(2007)
Int.
J.
System.
Evol.
Biol.
doi:10.1099/ijs.0.64483-‐0
46. Average
Nucleo'de
Iden'ty
(ANI)
l ANI
introduced
as
a
subs;tute
for
DDH
in
2007:
l 70%
iden;ty
(DDH)
=
“gold
standard”
prokaryo;c
species
boundary
l 70%
iden;ty
(DDH)
≈
95%
iden;ty
(ANI)
l ANIm
and
TETRA
introduced
(2009)
1. Align
sequences
using
NUCmer
2. ANI
=
mean
%iden;ty
of
matches
l TETRA:
1. Calculate
tetranucleo;de
frequencies
2. Determine
each
tetramer
devia;on
from
expecta;on
(Z-‐score)
3. TETRA
=
Pearson
correla;on
coefficient
of
tetramer
Z-‐scores
Richter
&
Rosselló-‐Móra
(2009)
Proc.
Natl.
Acad.
Sci.
USA
doi:10.1073/pnas.0906412106
47. Average
Nucleo'de
Iden'ty
(ANI)
l ANIb
discards
useful
informa;on
that
ANIm
retains
l TETRA
reflects
bulk
genome
proper;es
rather
than
selec;on
on
sequence
l Data
for
Anaplasma
marginale
(3),
A.phagocytophilum
(4),
A.centrale
(1)
l TETRA
scores
are
prone
to
false
posi;ves;
ANIb
scores
are
prone
to
false
nega;ves
49. Diagnos'c
PCR/MLST
l PCR/MLST
s;ll
cheap
l (but
for
how
much
longer?)
l Use
whole
genomes
to
iden;fy
unique/
diagnos;c
regions
for
PCR/MLST
Slezak
et
al.
(2003)
Brief.
Bioinf.
doi:10.1093/bib/4.2.133
Pritchard
et
al.
(2012)
PLoS
One
doi:10.1371/journal.pone.0034498
50. Whole
Genome
Sequence
Comparisons
Comparisons
of
one
whole
or
drac
genome
sequence
with
another
(or
many
others)
52. Whole
Genome
Alignment
l Which
genomes
should
you
align?
(or
not
bother
aligning)
l For
reasonable
analysis,
genomes
should:
l derive
from
a
sufficiently
recent
common
ancestor:
so
that
homologous
regions
can
be
iden;fied.
l derive
from
a
sufficiently
distant
common
ancestor:
so
that
sufficiently
“interes;ng”
changes
are
likely
to
have
occurred
l help
answer
your
biological
ques;on:
„ is
your
ques;on
organism
or
phenotype
specific?
„ are
you
inves;ga;ng
a
process?
l This
may
be
more
involved
for
metazoans
(vertebrates,
arthropods,
nematodes,
etc.)
than
prokaryotes…
53. Whole
Genome
Alignment
l Naïve
alignment
algorithms
(e.g.
Needleman-‐Wunsch/Smith-‐
Waterman)
are
not
appropriate:
l Do
not
handle
rearrangements
l Computa;onally
expensive
on
large
sequences
l Many
whole-‐genome
alignment
algorithms
proposed,
including:
l LASTZ
(hVp://www.bx.psu.edu/~rsharris/lastz/)
l BLAT
(hVp://genome.ucsc.edu/goldenPath/help/blatSpec.html)
l Mugsy
(hVp://mugsy.sourceforge.net/)
l megaBLAST
(hVp://www.ncbi.nlm.nih.gov/blast/html/megablast.html)
l MUMmer
(hVp://mummer.sourceforge.net/)
l LAGAN
(hVp://lagan.stanford.edu/lagan_web/index.shtml)
l WABA,
etc…
54. Whole
Genome
Alignment
l BLAT
l BLAT
is
broadly
similar
to
BLAST
l Main
differences:
„ op;mised
to
find
only
exact
or
near-‐exact
matches,
for
speed
„ indexes
the
subject
genome,
retains
the
index
and
scans
the
query
„ connects
homologous
match
regions
into
a
single
alignment
(BLAST
reports
them
separately)
„ reports
mRNA
match
intron-‐exon
boundaries
exactly
(BLAST
tends
to
extend)
l Advantages:
fast;
exact
exon
boundaries;
UCSC
integra;on
l Disadvantages:
does
not
find
more
remote/very
divergent
matches
Kent
(2002)
Genome
Res.
doi:10.1101/gr.229202
55. Whole
Genome
Alignment
l megaBLAST
l Op;mised
for
speed
over
BLASTN
(see
hVp://www.ncbi.nlm.nih.gov/blast/Why.shtml):
„ genome-‐level
searches
„ queries
on
large
sequence
sets
„ long
alignments
of
very
similar
sequence
(sequencing
errors/SNPs)
l Uses
Zhang
et
al.
(2000)
greedy
algorithm
l Concatenates
queries
to
improve
performance
(“query
packing”)
„ NOTE:
this
is
good
prac'ce
for
large
query
sets!
l Two
modes:
megaBLAST,
and
discon;nuous
megaBLAST
(dc-‐megablast)
„ dc-‐megablast
intended
for
more
divergent
sequences
Zhang
et
al.
(2000)
J.
Comp.
Biol.
7(1-‐2)
203-‐14
Korf
et
al.
(2003)
“BLAST”,
O’Reilly
&
Associates,
Sebastopol,
CA
56. Whole
Genome
Alignment
l MUMmer
l Uses
suffix
trees
for
paVern
matching:
very
fast
even
for
large
sequences
„ Finds
maximal
exact
matches
„ Memory
use
depends
only
on
reference
sequence
size
Kurtz
et
al.
(2004)
Genome
Biol.
doi:10.1186/gb-‐2004-‐5-‐2-‐r12
57. Whole
Genome
Alignment
l MUMmer
l Uses
suffix
trees
for
paVern
matching:
very
fast
even
for
large
sequences
„ Finds
maximal
exact
matches
„ Memory
use
depends
only
on
reference
sequence
size
l Suffix
Tree:
l Can
be
constructed
and
searched
in
O(n)
;me
l Useful
algorithms
are
nontrivial
l BANANA$
„ B
followed
by
ANANA$
only
„ A
followed
by
$,
NA$,
NANA$
„ N
followed
by
A$,
ANA$
Kurtz
et
al.
(2004)
Genome
Biol.
doi:10.1186/gb-‐2004-‐5-‐2-‐r12
58. Whole
Genome
Alignment
l MUMmer
l Process:
„ 1)
Iden;fy
a
non-‐overlapping
subset
of
maximal
exact
matches:
oren
Maximum
Unique
Matches
(MUMs
-‐
though
not
always
unique)
„ 2)
Cluster
into
alignment
anchors
„ 3)
Extend
between
anchors
to
produce
a
final
gapped
alignment
l Very
flexible
approach:
a
suite
of
programs
(mummer, nucmer,
promer,
…)
„ nucleo;de
and
“conceptual
protein”
(more
sensi;ve)
alignments
„ used
for
genome
comparisons,
assembly
scaffolding,
repeat
detec;on,
etc.
„ forms
the
basis
for
other
aligners/assemblers,
e.g.
Mugsy,
AMOS
Kurtz
et
al.
(2004)
Genome
Biol.
doi:10.1186/gb-‐2004-‐5-‐2-‐r12
61. Mul'ple
Genome
Alignment
l LAGAN:
rapid
alignment
of
two
homologous
genome
sequences
l Generate
local
alignments
(anchors,
B)
l Construct
rough
global
map
(maximal-‐scoring
ordered
subset,
C)
„ Join
anchors
that
lie
within
a
threshold
distance,
the
same
way
l Compute
global
alignment
by
dynamic
programming
(D)
Brudno
et
al.
(2003)
Genome
Res.
doi:10.1101/gr.926603
62. Mul'ple
Genome
Alignment
l MLAGAN:
mul;ple
genome
alignment
of
k
genomes
in
k-‐1
alignment
steps,
using
a
phylogene;c
tree
(CLUSTAL-‐like):
l Make
rough
global
maps
between
each
pair
of
sequences
(step
C
in
LAGAN)
l Progressive
mul;ple
alignment
with
anchors
(iterated)
1. Perform
global
alignment
between
closest
pair
of
sequences
with
LAGAN:
alignments
are
“mul6-‐sequences”
2. Find
rough
global
maps
of
this
mul6-‐
sequence
to
all
other
mul6-‐sequences.
Brudno
et
al.
(2003)
Genome
Res.
doi:10.1101/gr.926603
63. Human-‐Mouse-‐Rat
Alignment
l Three-‐way
progressive
alignment,
iden;fying:
l Homologous
(H/M/R),
rodent-‐only
(M/R)
and
human-‐
mouse
or
human-‐rat
(H/M,
H/R)
homologous
regions
l Three-‐way
synteny
synteny
mapped
to
rat
genome
Brudno
et
al.
(2004)
Genome
Res.
doi:10.1101/gr.2067704
Ini'al
alignments
by
BLAT
Syntenous
regions
aligned
with
LAGAN
65. Drac
Genome
Alignment
l Whole
genome
alignments
useful
for
scaffolding
assemblies
l High-‐throughput
sequence
assemblies
come
in
fragments
(con;gs)
l Con;gs
can
some;mes
be
ordered
if
paired
reads
or
long
read
technologies
are
used
l Can
also
align
to
a
known
reference
genome
l MUMmer
l Can
use
NUCmer
or,
for
more
distant
rela;ons,
PROmer
l Mauve/Progressive
Mauve
l hVp://gel.ahabs.wisc.edu/mauve/
Darling
et
al.
(2003)
Genome
Res.
doi:10.1101/gr.2289704
66. Mauve
l Mauve’s
alignment
algorithm
1. Find
local
alignments
(mul;-‐MUMs
–
seed
&
extend)
2. Construct
phylogene;c
guide
tree
from
mul;-‐
MUMs
3. Select
subset
of
mul;-‐MUMs
as
anchors.
„ Par;;on
anchors
into
Local
Collinear
Blocks
(LCBs)
–
consistently-‐ordered
subsets
4. Perform
recursive
anchoring
to
iden;fy
further
anchors
5. Perform
progressive
alignment
(similar
to
CLUSTAL),
against
guide
tree
l Mauve
Con;g
Mover
(MCM)
for
ordering
con;gs
Darling
et
al.
(2003)
Genome
Res.
doi:10.1101/gr.2289704
67. Mauve
l Mauve
alignment
of
LCBs
in
nine
enterobacterial
genomes
l Rearrangement
of
homologous
backbone
sequence
Darling
et
al.
(2003)
Genome
Res.
doi:10.1101/gr.2289704
68. Drac
Genome
Alignment
l [OPTIONAL
ACTIVITY]
(useful
for
exercise)
l Alignment
and
reordering
of
drar
genome
con;gs
l whole_genome_alignments_B.md
Markdown
l hVps://github.com/widdowquinn/Teaching/blob/master/
Compara;ve_Genomics_and_Visualisa;on/Part_1/
whole_genome_alignment/
whole_genome_alignments_B.md
l [ACTIVITY]
l Visualisa;on
of
whole
genome
alignment
with
Biopython
l biopython_visualisation
iPython
notebook
69. Collinearity
and
Synteny
l Rearrangements
may
occur
post-‐specia;on
l Different
species
s;ll
exhibit
conserva;on
of
sequence
similarity
and
order
l Two
elements
are
collinear
if
they
lie
in
the
same
linear
sequence
l Two
elements
are
syntenous
(syntenic)
if:
„ (orig.)
they
lie
on
the
same
chromosome
„ (mod.)
conserva;on
of
blocks
of
order
within
the
same
chromosome
l Signs
of
evolu;onary
constraints,
including
synteny,
may
indicate
func;onal
genome
regions
l More
about
this
in
Part
2,
related
to
genome
features
75. Conclusion
l Physical
and
computa;onal
genome
comparisons:
l Similar
biological
ques;ons
-‐>
similar
concepts
l Lots
of
sequence
data
in
modern
biology
l Conserva;on
≈
evolu;onary
constraint
l Many
choices
of
algorithms/analysis
sorware
l Many
choices
of
visualisa;on
sorware/tools
l Coming
in
Part
2:
genomic
func;onal
elements
76. Credits
l This
slideshow
is
shared
under
a
Crea;ve
Commons
AVribu;on
4.0
License
hVp://crea;vecommons.org/licenses/by/4.0/)
l Copyright
is
held
by
The
James
HuVon
Ins;tute
hVp://www.huVon.ac.uk
l You
may
freely
use
this
material
in
research,
papers,
and
talks
so
long
as
acknowledgement
is
made.
77. Nucleo'de
Content
l A,
C,
G,
T
composi;on
l Varies
between,
and
within
genomes
l staining
varies
across
genomes,
due
to
varia;on
in
GC
content
l “isochores”:
regions
with
liVle
internal
GC
varia;on
(homogeneous)
„
long
a
point
of
discussion
–
difficult
to
define
l In
humans:
l L1,
L2
isochores:
low
GC
(≲41%)
l H1,
H2,
H3
isochores:
high
GC
(≳41%)
l Imprecise
bulk
measurement
Sadoni
et
al.
(1999)
J.
Cell
Biol.
doi:10.1083/jcb.146.6.1211
hybridisa;on
of
H3
isochore
to
human
genome
78. DNA-‐DNA
Hybridisa'on
(DDH)
l Used
for
taxonomic
classifica;on
in
prokaryotes
from
1960s
l Sibley
&
Ahlquist
redefined
bird
and
primate
phylogeny
with
DDH
in
1980s:
l Not
without
controversy:
„ Sugges;ons
of
data
manipula;on
(see
here)
„ Close
evolu;onary
rela;onships
difficult
to
resolve
due
to
paralogy
(more
on
paralogy
later…)
l S;ll
hanging
on
as
a
de
facto
“gold
standard”
in
microbiological
taxonomic
classifica;on.
Sibley
&
Ahlquist
(1987)
J.
Mol.
Evol.
doi:10.1007/BF02111285
79. Finding
isochores
l Isochores:
homogeneous
regions
of
%GC
content
l Easy
to
find
with
windowed
(100kbp)
%GC
calcula;on,
from
sequenced
genomes.
l 3200
isochores
characterised
in
the
human
genome,
consistent
with
5
levels
(L1,
L2,
H1,
H2,
H3)
found
by
staining/hybridisa;on.
Costan'ni
et
al.
(2006)
Genome
Res.
doi:10.1101/gr.4910606
80. Compara've
Genomic
Hybridisa'on
l Two
genomes:
“reference”
and
“test”
labelled
(red
and
green),
then
hybridised
against
a
“normal”
genome
l semiquan'ta've:
l Red:
loss
(<2
copies)
in
tumour
l Green:
gain
(3-‐4
copies)
in
tumour
l Amplifica;ons
(>4
copies)
in
BOLD
l Cases
with
the
same
Copy
Number
Aberra;on
(CNA)
are
numbered
De
Bortoli
et
al.
(2006)
BMC
Cancer
doi:10.1186/1471-‐2407-‐6-‐223
81. l Early
approaches
took
a
threshold
score
(present/absent)
l Later
approaches
used
known
reference
genome
sequence
context
(HMMs,
synteny)
to
improve
presence/absence
calls
l No
hybridisa;on
=
“absent”
or“divergent”?
l Not
nearly
as
good
as
sequencing
directly!
Array
Compara've
Genomic
Hybridisa'on
Pritchard
et
al.
(2009)
PLoS
Comp.
Biol.
doi:10.1371/journal.pcbi.1000473
82. k-‐mer
Spectra
l k-‐mer
spectrum:
l CpG
suppression
(CGs
are
uncommon
in
vertebrate
genomes),
but
(by
simula;on)
only
when
in
combina;on
with
a
par;cular
%GC,
explains
mul;modality
Chor
et
al.
(2009)
Genome
Biol.
doi:10.1186/gb-‐2009-‐10-‐10-‐r108