SlideShare una empresa de Scribd logo
1 de 82
Descargar para leer sin conexión
Compara've	
  Genomics	
  and	
  
Visualisa'on	
  –	
  Part	
  1	
  
Leighton	
  Pritchard	
  
Part	
  1	
  
l What	
  is	
  compara've	
  genomics?	
  
l Levels	
  of	
  genome	
  comparison	
  
l  bulk,	
  whole	
  sequence,	
  features	
  
l A	
  Brief	
  History	
  of	
  Compara've	
  Genomics	
  
l  experimental	
  compara;ve	
  genomics	
  
l Computa'onal	
  Compara've	
  Genomics	
  
l  Bulk	
  proper;es	
  
l  Whole	
  genome	
  comparisons	
  
l Part	
  2	
  
l  Genome	
  feature	
  comparisons	
  
What	
  is	
  Compara've	
  Genomics?	
  
The	
  combina'on	
  of	
  genomic	
  data	
  and	
  
compara've	
  and	
  evolu'onary	
  biology	
  to	
  
address	
  ques'ons	
  of	
  genome	
  structure,	
  
evolu'on	
  and	
  func'on.	
  
What	
  is	
  Compara've	
  Genomics?	
  	
  
“Nothing	
  in	
  biology	
  makes	
  sense,	
  except	
  
in	
  the	
  light	
  of	
  evolu9on”	
  
Theodosius	
  Dobzhansky	
  
Why	
  Compara've	
  Genomics?	
  
l Genomes	
  describe	
  heritable	
  characteris;cs	
  
l Related	
  organisms	
  share	
  ancestral	
  genomes	
  
l Func;onal	
  elements	
  encoded	
  in	
  genomes	
  
are	
  common	
  to	
  related	
  organisms	
  
l Func;onal	
  understanding	
  of	
  model	
  systems	
  
(E.	
  coli,	
  A.	
  thaliana,	
  D.	
  melanogaster)	
  can	
  be	
  
transferred	
  to	
  non-­‐model	
  systems	
  on	
  the	
  
basis	
  of	
  genome	
  comparisons	
  
l Genome	
  comparisons	
  can	
  be	
  informa;ve,	
  
even	
  for	
  distantly-­‐related	
  organisms	
  
Why	
  Compara've	
  Genomics?	
  
l BUT:	
  
l  Context:	
  epigene;cs,	
  ;ssue	
  
differen;a;on,	
  mesoscale	
  systems,	
  etc.	
  
l  Phenotypic	
  plas'city:	
  responses	
  to	
  
temperature,	
  stress,	
  environment,	
  etc.	
  
Why	
  Compara've	
  Genomics?	
  
l Genomic	
  differences	
  can	
  underpin	
  phenotypic	
  
(morphological	
  or	
  physiological)	
  differences.	
  
l Where	
  phenotypes	
  or	
  other	
  organism-­‐level	
  
proper;es	
  are	
  known,	
  comparison	
  of	
  
genomes	
  may	
  give	
  mechanis;c	
  or	
  func;onal	
  
insight	
  into	
  differences	
  (e.g.	
  GWAS).	
  
l Genome	
  comparisons	
  aid	
  iden;fica;on	
  of	
  
func;onal	
  elements	
  on	
  the	
  genome.	
  
l Studying	
  genomic	
  changes	
  reveals	
  
evolu;onary	
  processes	
  and	
  constraints.	
  
	
  
Why	
  Compara've	
  Genomics?	
  
Adapted	
  from	
  Hardison	
  (2003)	
  PLoS	
  Biol.	
  doi:10.1371/journal.pbio.0000058	
  
species	
  
'me	
  
contemporary	
  
organisms	
  
l  Comparison	
  within	
  species	
  (e.g.	
  isolate-­‐level	
  –	
  or	
  even	
  within	
  individuals):	
  
which	
  genome	
  features	
  may	
  account	
  for	
  unique	
  characteris;cs	
  of	
  organisms/
tumours?	
  Epigene;cs	
  in	
  an	
  individual.	
  
Why	
  Compara've	
  Genomics?	
  
genus	
  
'me	
  
contemporary	
  
organisms	
  
l Comparison	
  within	
  genus	
  (e.g.	
  species-­‐level):	
  what	
  genome	
  features	
  
show	
  evidence	
  of	
  selec;ve	
  pressure,	
  and	
  in	
  which	
  species?	
  
Why	
  Compara've	
  Genomics?	
  
subgroup	
  
'me	
  
contemporary	
  
organisms	
  
l Comparison	
  within	
  subgroup	
  (e.g.	
  genus-­‐level):	
  what	
  are	
  the	
  core	
  set	
  
of	
  genome	
  features	
  that	
  define	
  a	
  subgroup	
  or	
  genus?	
  
The	
  E.coli	
  long-­‐term	
  evolu'on	
  experiment	
  
l Run	
  by	
  the	
  Lenski	
  lab,	
  Michigan	
  State	
  University	
  since	
  1988	
  
l  hVp://myxo.css.msu.edu/ecoli/	
  
l 12	
  flasks,	
  citrate	
  usage	
  selec;on	
  
l 50,000	
  genera;ons	
  of	
  Escherichia	
  coli!	
  
l  Cultures	
  propagated	
  every	
  day	
  
l  Every	
  500	
  genera;ons	
  (75	
  days),	
  	
  
mixed-­‐popula;on	
  samples	
  stored	
  
l  Mean	
  fitness	
  es;mated	
  at	
  500	
  	
  
genera;on	
  intervals	
  
Jeong	
  et	
  al.	
  (2009)	
  J.	
  Mol.	
  Biol.	
  doi:10.1016/j.jmb.2009.09.052	
  
Barrick	
  et	
  al.	
  (2009)	
  Nature	
  doi:10.1038/nature08480	
  
Wiser	
  et	
  al.	
  (2013)	
  Science.	
  doi:10.1126/science.1243357	
  
Compara've	
  Genomics	
  in	
  the	
  News	
  
Sankaraman	
  et	
  al.	
  (2014)	
  Nature.	
  doi:10.1038/nature12961	
  
l Neanderthal	
  alleles:	
  
l  Aid	
  adapta;on	
  outwith	
  Africa	
  
l  Associated	
  with	
  disease	
  risk	
  
l  Reduce	
  male	
  fer;lity	
  
Levels	
  of	
  Genome	
  Comparison	
  
Genomes	
  are	
  complex,	
  and	
  can	
  be	
  
compared	
  on	
  a	
  range	
  of	
  conceptual	
  levels	
  
-­‐	
  both	
  prac'cally	
  and	
  in	
  silico.	
  
Three	
  broad	
  levels	
  of	
  comparison	
  
l Bulk	
  Proper;es	
  
l  chromosome/plasmid	
  counts	
  and	
  sizes,	
  	
  
l  nucleo;de	
  content,	
  etc.	
  
l Whole	
  Genome	
  Sequence	
  
l  sequence	
  similarity	
  
l  organisa;on	
  of	
  genomic	
  regions	
  (synteny),	
  etc.	
  
l Genome	
  Features/Func;onal	
  Components	
  
l  numbers	
  and	
  types	
  of	
  features	
  (genes,	
  ncRNA,	
  regulatory	
  
elements,	
  etc.)	
  
l  organisa;on	
  of	
  features	
  (synteny,	
  operons,	
  regulons,	
  etc.)	
  
l  complements	
  of	
  features	
  
l  selec;on	
  pressure,	
  etc.	
  
A	
  Brief	
  History	
  of	
  Experimental	
  
Compara've	
  Genomics	
  
You	
  don’t	
  have	
  to	
  sequence	
  genomes	
  to	
  
compare	
  them	
  (but	
  it	
  helps).	
  
Genome	
  Comparisons	
  Predate	
  NGS	
  
l Sequence	
  data	
  was	
  not	
  always	
  cheap	
  and	
  abundant	
  
l Prac;cal,	
  experimental	
  genome	
  comparisons	
  were	
  needed	
  
Bulk	
  Genome	
  Property	
  Comparisons	
  
Values	
  calculated	
  for	
  individual	
  genomes,	
  
and	
  subsequently	
  compared.	
  
Bulk	
  Genome	
  Proper'es	
  
l  Large-­‐scale	
  summary	
  measurements	
  
l  Measure	
  genomes	
  independently	
  –	
  compare	
  values	
  later	
  
l  Number	
  of	
  chromosomes	
  
l  Ploidy	
  
l  Chromosome	
  size	
  
l  Nucleo;de	
  (A,	
  C,	
  G,	
  T)	
  frequency/percentage	
  
Chromosome	
  Counts/Size	
  
l  The	
  chromosome	
  counts/ploidy	
  of	
  organisms	
  can	
  vary	
  widely	
  
l  Escherichia	
  coli:	
  1	
  (but	
  plasmids…)	
  
l  Rice	
  (Oryza	
  sa6va):	
  24	
  (but	
  mitochondria,	
  plas;ds	
  etc…)	
  
l  Human	
  (Homo	
  sapiens):	
  46,	
  diploid	
  
l  Adders-­‐tongue	
  (Ophioglossum	
  re6culatum):	
  up	
  to	
  1260	
  
l  Domes;c	
  (but	
  not	
  wild)	
  wheat	
  soma;c	
  cells	
  hexaploid,	
  gametes	
  haploid	
  
l  Physical	
  genome	
  size	
  (related	
  to	
  sequence	
  length)	
  	
  
can	
  also	
  vary	
  greatly	
  
l  Genome	
  size	
  and	
  chromosome	
  count	
  	
  
do	
  not	
  indicate	
  organism	
  ‘complexity’	
  
l  S;ll	
  surprises	
  to	
  be	
  found	
  in	
  physical	
  
study	
  of	
  chromosomes!	
  (e.g.	
  Hi-­‐C)	
  
Kamisugi	
  et	
  al.	
  (1993)	
  Chromosome	
  Res.	
  1(3):	
  189-­‐96	
  
Wang	
  et	
  al.	
  (2013)	
  Nature	
  Rev	
  Genet.	
  doi:10.1038/nrg3375	
  
Nucleo'de	
  Content	
  
l Experimental	
  approaches	
  for	
  accurate	
  measurement	
  
l  e.g.	
  use	
  radiolabelled	
  monophosphates,	
  calculate	
  propor;ons	
  using	
  
chromatography	
  
Karl	
  (1980)	
  Microbiol.	
  Rev.	
  44(4)	
  739-­‐796	
  
Krane	
  et	
  al.	
  (1991)	
  Nucl.	
  Acids	
  Res.	
  doi:10.1093/nar/19.19.5181	
  
Whole	
  Genome	
  Comparisons	
  
Comparisons	
  of	
  one	
  whole	
  or	
  drac	
  
genome	
  with	
  another	
  (or	
  many	
  others)	
  
Whole	
  Genome	
  Comparisons	
  
l  Requires	
  two	
  genomes:	
  “reference”	
  and	
  “comparator”	
  
l  Experiment	
  produces	
  a	
  compara;ve	
  result,	
  dependent	
  on	
  the	
  
choice	
  of	
  genomes	
  
l  Methods	
  mostly	
  based	
  around	
  direct	
  or	
  indirect	
  DNA	
  
hybridisa;on	
  
l  DNA-­‐DNA	
  hybridisa;on	
  
l  Compara;ve	
  Genomic	
  Hybridisa;on	
  (CGH)	
  
l  Array	
  Compara;ve	
  Genomic	
  Hybridisa;on	
  (aCGH)	
  
DNA-­‐DNA	
  Hybridisa'on	
  (DDH)	
  
l Several	
  methods	
  based	
  around	
  the	
  same	
  principle	
  
1.  Denature	
  organism	
  A,	
  B	
  	
  
genomic	
  DNA	
  mixture	
  
2.  Allow	
  to	
  anneal	
  –	
  hybrids	
  result	
  	
  
(reassocia;on	
  ≈	
  similarity)	
  
Morelló-­‐Mora	
  &	
  Amann	
  (2001)	
  FEMS	
  Microbiol.	
  Rev.	
  doi:10.1016/S0168-­‐6445(00)00040-­‐1	
  
DNA-­‐DNA	
  Hybridisa'on	
  (DDH)	
  
l  Several	
  methods	
  -­‐	
  same	
  principle	
  
1.  Find	
  homoduplex	
  Tm1	
  
2.  Denature	
  reference,	
  comparator	
  
gDNA	
  +	
  mix	
  
3.  Allow	
  to	
  anneal	
  –	
  hybrids	
  result	
  	
  
(reassocia;on	
  ≈	
  similarity),	
  find	
  	
  
heteroduplex	
  Tm2	
  
4.  ∆Tm	
  =	
  Tm1	
  –	
  Tm2	
  
5.  High	
  ∆T	
  implies	
  greater	
  genomic	
  
difference	
  (fewer	
  H-­‐bonds)	
  
l  Proxy	
  for	
  sequence	
  similarity	
  
Morelló-­‐Mora	
  &	
  Amann	
  (2001)	
  FEMS	
  Microbiol.	
  Rev.	
  doi:10.1016/S0168-­‐6445(00)00040-­‐1	
  
DNA-­‐DNA	
  Hybridisa'on	
  (DDH)	
  
l Used	
  for	
  taxonomic	
  classifica;on	
  in	
  prokaryotes	
  from	
  1960s	
  
l Sibley	
  &	
  Ahlquist	
  redefined	
  bird	
  and	
  primate	
  phylogeny	
  with	
  	
  
DDH	
  in	
  1980s:	
  Homo	
  shares	
  more	
  recent	
  common	
  ancestor	
  with	
  Pan	
  
than	
  with	
  Gorilla	
  (this	
  was	
  previously	
  in	
  dispute)	
  
Sibley	
  &	
  Ahlquist	
  (1984)	
  J.	
  Mol.	
  Evol.	
  doi:10.1007/BF02101980	
  
Compara've	
  Genomic	
  Hybridisa'on	
  
l  Two	
  genomes:	
  “reference”	
  and	
  “test”	
  are	
  labelled	
  (red	
  and	
  green	
  –	
  	
  
a	
  bad	
  conven6on	
  to	
  choose,	
  for	
  visualisa6on),	
  then	
  hybridised	
  against	
  a	
  
third	
  “normal”	
  genome	
  
l  Differences	
  in	
  red/green	
  intensity	
  mapped	
  by	
  microscopy	
  correspond	
  to	
  
rela;ve	
  rela;onship	
  of	
  reference	
  and	
  test	
  to	
  “normal”	
  genome	
  
l  Comparisons	
  within	
  species	
  (or	
  individual,	
  for	
  tumours);	
  copy	
  number	
  
varia'ons	
  (CNV)	
  
l  Labour-­‐intensive,	
  low-­‐resolu;on	
  
Compara've	
  Genomic	
  Hybridisa'on	
  
l Image	
  analysis	
  required	
  –	
  intensity	
  along	
  medial	
  axis.	
  
Kallioniemi	
  et	
  al.	
  (1992)	
  Science	
  doi:10.1126/science.1359641	
  
Fraga	
  et	
  al.	
  (2005)	
  Proc.	
  Natl.	
  Acad.	
  Sci.	
  USA	
  doi:10.1073/pnas.0500398102	
  
Epigene'cs:	
  hybridising	
  	
  
methylated	
  DNA	
  
Array	
  Compara've	
  Genomic	
  Hybridisa'on	
  
l  Uses	
  DNA	
  microarrays:	
  thousands	
  of	
  short	
  DNA	
  probes	
  (genome	
  	
  
fragments)	
  immobilised	
  on	
  a	
  surface	
  
l  gDNA,	
  cDNA,	
  etc.	
  fluorescently-­‐labelled	
  and	
  hybridised	
  to	
  the	
  array	
  
l  Smaller	
  sample	
  sizes	
  cf.	
  CGH,	
  	
  
automatable,	
  high-­‐throughput,	
  high-­‐res	
  
l  Iden'fies	
  copy	
  number	
  varia'on	
  (CNV)	
  
and	
  segmental	
  duplica'on	
  
Pollack	
  et	
  al.	
  (1999)	
  Nat.	
  Genet.	
  doi:10.1038/12640	
  
Genome	
  Feature	
  Comparisons	
  
Comparisons	
  on	
  the	
  basis	
  of	
  a	
  restricted	
  
set	
  of	
  genome	
  features	
  
Chromosomal	
  Rearrangements	
  
l  Genomes	
  are	
  dynamic,	
  and	
  undergo	
  large-­‐scale	
  changes	
  
l  Hybridisa;on	
  used	
  to	
  map	
  genome	
  rearrangement/duplica;on	
  
l  Separate	
  chromosomes	
  electrophore;cally	
  
l  Apply	
  single	
  gene	
  hybridising	
  probes	
  
l  Reciprocal	
  hybridisa;ons	
  indicate	
  transloca;ons	
  
Fischer	
  et	
  al.	
  (2000)	
  Nature.	
  doi:10.1038/35013058	
  
Diagnos'c	
  PCR/MLST	
  
l  Define	
  a	
  set	
  of	
  regions	
  (usually	
  genes):	
  
l  conserved	
  enough	
  that	
  PCR	
  primers	
  can	
  
be	
  designed	
  to	
  amplify	
  the	
  same	
  region	
  
in	
  mul;ple	
  organisms	
  
l  and:	
  
l  divergent	
  enough	
  that	
  hybridising	
  
probes	
  can	
  dis;nguish	
  between	
  groups	
  
l  or:	
  
l  sequence	
  the	
  amplifica;on	
  products	
  
l  Sequence	
  variants	
  given	
  numbers	
  
l  Number	
  profiles	
  define	
  groups	
  
l  Track	
  evolu;on	
  by	
  minimum	
  spanning	
  
trees	
  (MST)	
  
l  hVp://pubmlst.org/	
  
Maiden	
  et	
  al.	
  (2006)	
  Ann.	
  Rev.	
  Microbiol.	
  doi:10.1146/annurev.micro.59.030804.121325	
  
l  aCGH	
  can	
  also	
  be	
  applied	
  across	
  species	
  for	
  classifica'on/diagnos'cs:	
  
l  Microarray	
  probes	
  represent	
  genes	
  	
  
from	
  one	
  or	
  more	
  organisms	
  
l  “Off-­‐species”	
  gDNA	
  fragmented,	
  	
  
labelled,	
  and	
  hybridised	
  
l  Hybridisa;on	
  ≈	
  sequence	
  	
  
similarity	
  ≈	
  gene	
  presence	
  
l  Heatmap	
  of	
  217	
  Staphylococcus	
  
aureus	
  isolates	
  on	
  7-­‐strain	
  array.	
  
l  columns=isolates	
  
l  yellow/red=gene	
  present	
  
l  blue/white/grey=gene	
  absent	
  
l  Lower	
  bars	
  coloured	
  by	
  lineage	
  and	
  host	
  	
  
(green=caVle,	
  blue=horse,	
  purple=human)	
  
Array	
  Compara've	
  Genomic	
  Hybridisa'on	
  
Sung	
  et	
  al.	
  (2008)	
  Microbiol.	
  doi:10.1099/mic.0.2007/015289-­‐0	
  
But	
  This	
  Happened…	
  
l High-­‐throughput	
  sequencing	
  
…And	
  Then	
  It	
  Rained	
  Sequence	
  Data	
  
l  Modern	
  high-­‐throughput	
  sequencing	
  (454,	
  Illumina)	
  completely	
  
changed	
  the	
  landscape.	
  
l  Complete,	
  (mainly)	
  accurate	
  sequence	
  	
  
data	
  much	
  cheaper,	
  enabling:	
  
l  more	
  precise	
  sequence	
  comparison	
  
l  novel	
  analyses,	
  insights	
  and	
  	
  
visualisa;ons	
  
l  Genomic	
  &	
  exomic	
  comparisons	
  
l  19/2/2014	
  at	
  GOLD:	
  
l  3,011	
  “finished”	
  genomes	
  
l  9,891	
  “permanent	
  drar”	
  genomes	
  
l  19/2/2014	
  at	
  NCBI	
  WGS:	
  
l  17,023	
  whole	
  genome	
  projects	
  
…And	
  Then	
  It	
  Rained	
  Sequence	
  Data	
  
l In	
  2012,	
  GOLD	
  added	
  3736	
  genomes,	
  NCBI	
  added	
  4585	
  
l Mostly	
  prokaryotes	
  (archaea	
  and	
  bacteria)	
  
l We’re	
  a	
  liVle	
  ahead	
  of	
  Su’s	
  (Scripps,	
  La	
  Jolla)	
  projec;ons	
  
Figures	
  and	
  code	
  from:	
  hlp://sulab.org/2013/06/sequenced-­‐genomes-­‐per-­‐year/	
  	
  
Computa'onal	
  Compara've	
  Genomics	
  
Massively	
  enabled	
  by	
  high-­‐throughput	
  
sequencing,	
  much	
  more	
  powerful	
  and	
  
precise.	
  
Three	
  broad	
  levels	
  of	
  comparison	
  
l Bulk	
  Proper;es	
  
l  chromosome/plasmid	
  counts	
  and	
  sizes,	
  	
  
l  nucleo;de	
  content,	
  etc.	
  
l Whole	
  Genome	
  Sequence	
  
l  sequence	
  similarity	
  
l  organisa;on	
  of	
  genomic	
  regions	
  (rearrangements),	
  etc.	
  
l Genome	
  Features/Func;onal	
  Components	
  
l  numbers	
  and	
  types	
  of	
  features	
  (genes,	
  ncRNA,	
  regulatory	
  
elements,	
  etc.)	
  
l  organisa;on	
  of	
  features	
  (synteny,	
  operons,	
  regulons,	
  etc.)	
  
l  complements	
  of	
  features	
  
l  selec;on	
  pressure,	
  etc.	
  
Bulk	
  Genome	
  Property	
  Comparisons	
  
Values	
  calculated	
  for	
  individual	
  genomes,	
  
and	
  subsequently	
  compared.	
  
Nucleo'de	
  Frequencies/Genome	
  Size	
  
l Very	
  easy	
  to	
  calculate	
  from	
  complete	
  or	
  drar	
  genome	
  sequence	
  
l  (or	
  in	
  a	
  region	
  of	
  genome	
  sequence)	
  
l GC	
  content/chromosome	
  size	
  can	
  be	
  characteris;c	
  of	
  an	
  
organism	
  
l [ACTIVITY]	
  
l  bacteria_size_gc	
  iPython	
  notebook	
  
l  ipython notebook –-pylab inline	
  in	
  
bacteria_size	
  directory	
  
Blobology	
  
l Metazoan	
  sequence	
  data	
  can	
  be	
  contaminated	
  by	
  microbial	
  
symbionts.	
  
l  Host	
  and	
  symbiont	
  DNA	
  have	
  different	
  %GC	
  (and	
  are	
  present	
  in	
  
different	
  amounts/coverage)	
  
l  Preliminary	
  genome	
  assembly,	
  followed	
  
by	
  read	
  mapping	
  
l  Plot	
  con;g	
  coverage	
  against	
  	
  
%GC	
  =	
  Blobology	
  
l  hVp://nematodes.org/bioinforma;cs/blobology/	
  
Kumar	
  &	
  Blaxter	
  (2011)	
  Symbiosis	
  doi:10.1007/s13199-­‐012-­‐0154-­‐6	
  
Nucleo'de	
  k-­‐mers	
  
l  Sequence	
  data	
  is	
  required	
  to	
  determine	
  k-­‐mers	
  
l  Nucleo;de	
  frequencies:	
  	
  
l  A,	
  C,	
  G,	
  T	
  
l  Dinucleo;de	
  frequencies:	
  	
  
l  AA,	
  AC,	
  AG,	
  AT,	
  CA,	
  CC,	
  CG,	
  CT,	
  GA,	
  GC,	
  GG,	
  GT,	
  TA,	
  TC,	
  TG,	
  TT	
  
l  Trinucleo;de	
  frequencies:	
  
l  64	
  trinucleo;des	
  
l  k-­‐nucleo;de	
  frequencies:	
  
l  4k	
  k-­‐mers	
  
l  [ACTIVITY]	
  
l  runApp(“shiny/nucleotide_frequencies”)in	
  RStudio	
  
k-­‐mer	
  Spectra	
  
l k-­‐mer	
  spectrum:	
  
l  Frequency	
  distribu;on	
  of	
  observed	
  k-­‐mer	
  counts	
  
l  Most	
  species	
  have	
  a	
  unimodal	
  k-­‐mer	
  spectrum	
  
Chor	
  et	
  al.	
  (2009)	
  Genome	
  Biol.	
  doi:10.1186/gb-­‐2009-­‐10-­‐10-­‐r108	
  
k-­‐mer	
  Spectra	
  
l  k-­‐mer	
  spectrum:	
  
l  All	
  mammals	
  tested	
  (and	
  some	
  other)	
  species	
  have	
  a	
  mul;modal	
  k-­‐mer	
  
spectrum	
  
l  Genomic	
  regions	
  differ	
  in	
  this	
  property	
  
Chor	
  et	
  al.	
  (2009)	
  Genome	
  Biol.	
  doi:10.1186/gb-­‐2009-­‐10-­‐10-­‐r108	
  
Average	
  Nucleo'de	
  Iden'ty	
  (ANI)	
  
l ANI	
  introduced	
  as	
  a	
  subs;tute	
  for	
  DDH	
  in	
  2007:	
  
l  70%	
  iden;ty	
  (DDH)	
  =	
  “gold	
  standard”	
  	
  
prokaryo;c	
  species	
  boundary	
  
l  70%	
  iden;ty	
  (DDH)	
  ≈	
  95%	
  iden;ty	
  (ANI)	
  
Goris	
  et	
  al.	
  (2007)	
  Int.	
  J.	
  System.	
  Evol.	
  Biol.	
  doi:10.1099/ijs.0.64483-­‐0	
  
Average	
  Nucleo'de	
  Iden'ty	
  (ANI)	
  
l ANI	
  introduced	
  as	
  a	
  subs;tute	
  for	
  DDH	
  in	
  2007:	
  
l  70%	
  iden;ty	
  (DDH)	
  =	
  “gold	
  standard”	
  	
  
prokaryo;c	
  species	
  boundary	
  
l  70%	
  iden;ty	
  (DDH)	
  ≈	
  95%	
  iden;ty	
  (ANI)	
  
l Original	
  method	
  emulates	
  physical	
  
experiment:	
  
1.  break	
  genome	
  into	
  1020nt	
  fragments	
  
2.  align	
  fragments	
  using	
  BLASTN	
  
3.  ANI	
  =	
  mean	
  iden;ty	
  of	
  all	
  BLASTN	
  	
  
matches	
  with	
  >30%	
  iden;ty	
  over	
  70%	
  
alignable	
  length	
  
Goris	
  et	
  al.	
  (2007)	
  Int.	
  J.	
  System.	
  Evol.	
  Biol.	
  doi:10.1099/ijs.0.64483-­‐0	
  
Average	
  Nucleo'de	
  Iden'ty	
  (ANI)	
  
l ANI	
  introduced	
  as	
  a	
  subs;tute	
  for	
  DDH	
  in	
  2007:	
  
l  70%	
  iden;ty	
  (DDH)	
  =	
  “gold	
  standard”	
  prokaryo;c	
  species	
  boundary	
  
l  70%	
  iden;ty	
  (DDH)	
  ≈	
  95%	
  iden;ty	
  (ANI)	
  
l ANIm	
  and	
  TETRA	
  introduced	
  (2009)	
  
1.  Align	
  sequences	
  using	
  NUCmer	
  
2.  ANI	
  =	
  mean	
  %iden;ty	
  of	
  matches	
  
l TETRA:	
  
1.  Calculate	
  tetranucleo;de	
  frequencies	
  
2.  Determine	
  each	
  tetramer	
  devia;on	
  from	
  expecta;on	
  (Z-­‐score)	
  
3.  TETRA	
  =	
  Pearson	
  correla;on	
  coefficient	
  of	
  tetramer	
  Z-­‐scores	
  
Richter	
  &	
  Rosselló-­‐Móra	
  (2009)	
  Proc.	
  Natl.	
  Acad.	
  Sci.	
  USA	
  doi:10.1073/pnas.0906412106	
  
Average	
  Nucleo'de	
  Iden'ty	
  (ANI)	
  
l ANIb	
  discards	
  useful	
  informa;on	
  that	
  ANIm	
  retains	
  
l TETRA	
  reflects	
  bulk	
  genome	
  proper;es	
  rather	
  than	
  selec;on	
  on	
  
sequence	
  
l  Data	
  for	
  Anaplasma	
  marginale	
  (3),	
  A.phagocytophilum	
  (4),	
  A.centrale	
  (1)	
  
l TETRA	
  scores	
  are	
  prone	
  to	
  false	
  posi;ves;	
  ANIb	
  scores	
  are	
  prone	
  to	
  
false	
  nega;ves	
  
Average	
  Nucleo'de	
  Iden'ty	
  (ANI)	
  
l Jspecies	
  (hVp://www.imedea.uib.es/jspecies/)	
  	
  
l  WebStart	
  
l  java -jar -Xms1024m -Xmx1024m jspecies1.2.1.jar
l Python	
  script	
  
l  scripts/calculate_ani.py
l [ACTIVITY]	
  
l  average_nucleotide_identity/README.md	
  Markdown	
  
Richter	
  &	
  Rosselló-­‐Móra	
  (2009)	
  Proc.	
  Natl.	
  Acad.	
  Sci.	
  USA	
  doi:10.1073/pnas.0906412106	
  
Diagnos'c	
  PCR/MLST	
  
l PCR/MLST	
  s;ll	
  cheap	
  
l  (but	
  for	
  how	
  much	
  longer?)	
  
l Use	
  whole	
  genomes	
  to	
  iden;fy	
  unique/
diagnos;c	
  regions	
  for	
  PCR/MLST	
  
Slezak	
  et	
  al.	
  (2003)	
  Brief.	
  Bioinf.	
  doi:10.1093/bib/4.2.133	
  
Pritchard	
  et	
  al.	
  (2012)	
  PLoS	
  One	
  doi:10.1371/journal.pone.0034498	
  
Whole	
  Genome	
  Sequence	
  Comparisons	
  
Comparisons	
  of	
  one	
  whole	
  or	
  drac	
  
genome	
  sequence	
  with	
  another	
  (or	
  many	
  
others)	
  
Whole	
  Genome	
  Alignment	
  
Whole	
  Genome	
  Alignment	
  
l Which	
  genomes	
  should	
  you	
  align?	
  (or	
  not	
  bother	
  aligning)	
  
l For	
  reasonable	
  analysis,	
  genomes	
  should:	
  
l  derive	
  from	
  a	
  sufficiently	
  recent	
  common	
  ancestor:	
  so	
  that	
  
homologous	
  regions	
  can	
  be	
  iden;fied.	
  
l  derive	
  from	
  a	
  sufficiently	
  distant	
  common	
  ancestor:	
  so	
  that	
  
sufficiently	
  “interes;ng”	
  changes	
  are	
  likely	
  to	
  have	
  occurred	
  
l  help	
  answer	
  your	
  biological	
  ques;on:	
  
„ is	
  your	
  ques;on	
  organism	
  or	
  phenotype	
  specific?	
  
„ are	
  you	
  inves;ga;ng	
  a	
  process?	
  
l This	
  may	
  be	
  more	
  involved	
  for	
  metazoans	
  (vertebrates,	
  
arthropods,	
  nematodes,	
  etc.)	
  than	
  prokaryotes…	
  
Whole	
  Genome	
  Alignment	
  
l Naïve	
  alignment	
  algorithms	
  (e.g.	
  Needleman-­‐Wunsch/Smith-­‐
Waterman)	
  are	
  not	
  appropriate:	
  
l  Do	
  not	
  handle	
  rearrangements	
  
l  Computa;onally	
  expensive	
  on	
  large	
  sequences	
  
l Many	
  whole-­‐genome	
  alignment	
  algorithms	
  proposed,	
  including:	
  
l  LASTZ	
  (hVp://www.bx.psu.edu/~rsharris/lastz/)	
  
l  BLAT	
  (hVp://genome.ucsc.edu/goldenPath/help/blatSpec.html)	
  
l  Mugsy	
  (hVp://mugsy.sourceforge.net/)	
  
l  megaBLAST	
  (hVp://www.ncbi.nlm.nih.gov/blast/html/megablast.html)	
  
l  MUMmer	
  (hVp://mummer.sourceforge.net/)	
  
l  LAGAN	
  (hVp://lagan.stanford.edu/lagan_web/index.shtml)	
  
l  WABA,	
  etc…	
  
Whole	
  Genome	
  Alignment	
  
l BLAT	
  
l  BLAT	
  is	
  broadly	
  similar	
  to	
  BLAST	
  
l  Main	
  differences:	
  
„ op;mised	
  to	
  find	
  only	
  exact	
  or	
  near-­‐exact	
  matches,	
  for	
  
speed	
  
„ indexes	
  the	
  subject	
  genome,	
  retains	
  the	
  index	
  and	
  scans	
  
the	
  query	
  	
  
„ connects	
  homologous	
  match	
  regions	
  into	
  a	
  single	
  alignment	
  
(BLAST	
  reports	
  them	
  separately)	
  
„ reports	
  mRNA	
  match	
  intron-­‐exon	
  boundaries	
  exactly	
  
(BLAST	
  tends	
  to	
  extend)	
  
l  Advantages:	
  fast;	
  exact	
  exon	
  boundaries;	
  UCSC	
  integra;on	
  
l  Disadvantages:	
  does	
  not	
  find	
  more	
  remote/very	
  divergent	
  
matches	
  
Kent	
  (2002)	
  Genome	
  Res.	
  doi:10.1101/gr.229202	
  
Whole	
  Genome	
  Alignment	
  
l megaBLAST	
  
l  Op;mised	
  for	
  speed	
  over	
  BLASTN	
  	
  
(see	
  hVp://www.ncbi.nlm.nih.gov/blast/Why.shtml):	
  
„ genome-­‐level	
  searches	
  	
  
„ queries	
  on	
  large	
  sequence	
  sets	
  
„ long	
  alignments	
  of	
  very	
  similar	
  sequence	
  (sequencing	
  errors/SNPs)	
  
l  Uses	
  Zhang	
  et	
  al.	
  (2000)	
  greedy	
  algorithm	
  
l  Concatenates	
  queries	
  to	
  improve	
  performance	
  (“query	
  packing”)	
  
„ NOTE:	
  this	
  is	
  good	
  prac'ce	
  for	
  large	
  query	
  sets!	
  
l  Two	
  modes:	
  megaBLAST,	
  and	
  discon;nuous	
  megaBLAST	
  (dc-­‐megablast)	
  
„ dc-­‐megablast	
  intended	
  for	
  more	
  divergent	
  sequences	
  
Zhang	
  et	
  al.	
  (2000)	
  J.	
  Comp.	
  Biol.	
  7(1-­‐2)	
  203-­‐14	
  
Korf	
  et	
  al.	
  (2003)	
  “BLAST”,	
  O’Reilly	
  &	
  Associates,	
  Sebastopol,	
  CA	
  
Whole	
  Genome	
  Alignment	
  
l MUMmer	
  
l  Uses	
  suffix	
  trees	
  for	
  paVern	
  matching:	
  very	
  fast	
  even	
  for	
  large	
  sequences	
  
„ Finds	
  maximal	
  exact	
  matches	
  
„ Memory	
  use	
  depends	
  only	
  on	
  reference	
  sequence	
  size	
  
Kurtz	
  et	
  al.	
  (2004)	
  Genome	
  Biol.	
  doi:10.1186/gb-­‐2004-­‐5-­‐2-­‐r12	
  
Whole	
  Genome	
  Alignment	
  
l MUMmer	
  
l  Uses	
  suffix	
  trees	
  for	
  paVern	
  matching:	
  very	
  fast	
  even	
  for	
  large	
  sequences	
  
„ Finds	
  maximal	
  exact	
  matches	
  
„ Memory	
  use	
  depends	
  only	
  on	
  reference	
  sequence	
  size	
  
l  Suffix	
  Tree:	
  
l  Can	
  be	
  constructed	
  and	
  searched	
  in	
  O(n)	
  ;me	
  
l  Useful	
  algorithms	
  are	
  nontrivial	
  
l  BANANA$	
  
„  B	
  followed	
  by	
  ANANA$	
  only	
  
„  A	
  followed	
  by	
  $,	
  NA$,	
  NANA$	
  
„  N	
  followed	
  by	
  A$,	
  ANA$	
  
Kurtz	
  et	
  al.	
  (2004)	
  Genome	
  Biol.	
  doi:10.1186/gb-­‐2004-­‐5-­‐2-­‐r12	
  
Whole	
  Genome	
  Alignment	
  
l MUMmer	
  
l  Process:	
  
„ 1)	
  Iden;fy	
  a	
  non-­‐overlapping	
  subset	
  of	
  maximal	
  exact	
  matches:	
  
oren	
  Maximum	
  Unique	
  Matches	
  (MUMs	
  -­‐	
  though	
  not	
  always	
  
unique)	
  
„ 2)	
  Cluster	
  into	
  alignment	
  anchors	
  
„ 3)	
  Extend	
  between	
  anchors	
  to	
  produce	
  a	
  final	
  gapped	
  alignment	
  
l  Very	
  flexible	
  approach:	
  a	
  suite	
  of	
  programs	
  (mummer, nucmer,
promer,	
  …)	
  
„  nucleo;de	
  and	
  “conceptual	
  protein”	
  (more	
  sensi;ve)	
  alignments	
  
„  used	
  for	
  genome	
  comparisons,	
  assembly	
  scaffolding,	
  repeat	
  
detec;on,	
  etc.	
  
„  forms	
  the	
  basis	
  for	
  other	
  aligners/assemblers,	
  e.g.	
  Mugsy,	
  AMOS	
  
Kurtz	
  et	
  al.	
  (2004)	
  Genome	
  Biol.	
  doi:10.1186/gb-­‐2004-­‐5-­‐2-­‐r12	
  
Whole	
  Genome	
  Alignment	
  
l [ACTIVITY]	
  
l  whole_genome_alignments_A.md Markdown	
  
l  hVps://github.com/widdowquinn/Teaching/blob/master/
Compara;ve_Genomics_and_Visualisa;on/Part_1/
whole_genome_alignment/
whole_genome_alignments_A.md	
  
Mul'ple	
  Genome	
  Alignment	
  
l Several	
  tools:	
  
l  Mugsy	
  (hVp://mugsy.sourceforge.net/)	
  
l  MLAGAN	
  (hVp://lagan.stanford.edu/lagan_web/index.shtml)	
  
l  TBA/Mul'Z	
  (hVp://www.bx.psu.edu/miller_lab/)	
  
l  Mauve	
  (hVp://gel.ahabs.wisc.edu/mauve/)	
  
l Posi;onal	
  homology	
  vs.	
  glocal	
  
Mul'ple	
  Genome	
  Alignment	
  
l LAGAN:	
  rapid	
  alignment	
  of	
  two	
  homologous	
  
genome	
  sequences	
  
l  Generate	
  local	
  alignments	
  (anchors,	
  B)	
  
l  Construct	
  rough	
  global	
  map	
  	
  
(maximal-­‐scoring	
  ordered	
  subset,	
  C)	
  
„ Join	
  anchors	
  that	
  lie	
  within	
  a	
  	
  
threshold	
  distance,	
  the	
  same	
  way	
  
l  Compute	
  global	
  alignment	
  by	
  	
  
dynamic	
  programming	
  (D)	
  
Brudno	
  et	
  al.	
  (2003)	
  Genome	
  Res.	
  doi:10.1101/gr.926603	
  
Mul'ple	
  Genome	
  Alignment	
  
l MLAGAN:	
  mul;ple	
  genome	
  alignment	
  of	
  k	
  
genomes	
  in	
  k-­‐1	
  alignment	
  steps,	
  using	
  a	
  
phylogene;c	
  tree	
  (CLUSTAL-­‐like):	
  
l  Make	
  rough	
  global	
  maps	
  between	
  each	
  	
  
pair	
  of	
  sequences	
  (step	
  C	
  in	
  LAGAN)	
  
l  Progressive	
  mul;ple	
  alignment	
  with	
  	
  
anchors	
  (iterated)	
  
1.  Perform	
  global	
  alignment	
  between	
  	
  
closest	
  pair	
  of	
  sequences	
  with	
  	
  
LAGAN:	
  alignments	
  are	
  	
  
“mul6-­‐sequences”	
  
2.  Find	
  rough	
  global	
  maps	
  of	
  this	
  mul6-­‐
sequence	
  to	
  all	
  other	
  mul6-­‐sequences.	
  
Brudno	
  et	
  al.	
  (2003)	
  Genome	
  Res.	
  doi:10.1101/gr.926603	
  
Human-­‐Mouse-­‐Rat	
  Alignment	
  
l Three-­‐way	
  progressive	
  alignment,	
  iden;fying:	
  
l  Homologous	
  (H/M/R),	
  rodent-­‐only	
  (M/R)	
  and	
  human-­‐
mouse	
  or	
  human-­‐rat	
  (H/M,	
  H/R)	
  homologous	
  regions	
  
l Three-­‐way	
  synteny	
  
synteny	
  mapped	
  to	
  rat	
  genome	
  
Brudno	
  et	
  al.	
  (2004)	
  Genome	
  Res.	
  doi:10.1101/gr.2067704	
  
Ini'al	
  alignments	
  by	
  BLAT	
  
Syntenous	
  regions	
  aligned	
  with	
  LAGAN	
  
Drac	
  Genome	
  Alignment	
  
Drac	
  Genome	
  Alignment	
  
l Whole	
  genome	
  alignments	
  useful	
  for	
  scaffolding	
  assemblies	
  
l  High-­‐throughput	
  sequence	
  assemblies	
  come	
  in	
  fragments	
  (con;gs)	
  
l  Con;gs	
  can	
  some;mes	
  be	
  ordered	
  if	
  paired	
  reads	
  or	
  long	
  read	
  
technologies	
  are	
  used	
  
l  Can	
  also	
  align	
  to	
  a	
  known	
  reference	
  genome	
  
l MUMmer	
  
l  Can	
  use	
  NUCmer	
  or,	
  for	
  more	
  distant	
  rela;ons,	
  PROmer	
  
l Mauve/Progressive	
  Mauve	
  
l  hVp://gel.ahabs.wisc.edu/mauve/	
  
Darling	
  et	
  al.	
  (2003)	
  Genome	
  Res.	
  doi:10.1101/gr.2289704	
  
Mauve	
  
l  Mauve’s	
  alignment	
  algorithm	
  
1.  Find	
  local	
  alignments	
  (mul;-­‐MUMs	
  –	
  seed	
  &	
  
extend)	
  
2.  Construct	
  phylogene;c	
  guide	
  tree	
  from	
  mul;-­‐
MUMs	
  
3.  Select	
  subset	
  of	
  mul;-­‐MUMs	
  as	
  anchors.	
  
„  Par;;on	
  anchors	
  into	
  Local	
  Collinear	
  
Blocks	
  (LCBs)	
  –	
  consistently-­‐ordered	
  
subsets	
  
4.  Perform	
  recursive	
  anchoring	
  to	
  iden;fy	
  
further	
  anchors	
  
5.  Perform	
  progressive	
  alignment	
  (similar	
  to	
  
CLUSTAL),	
  against	
  guide	
  tree	
  
l  Mauve	
  Con;g	
  Mover	
  (MCM)	
  for	
  ordering	
  con;gs	
  
Darling	
  et	
  al.	
  (2003)	
  Genome	
  Res.	
  doi:10.1101/gr.2289704	
  
Mauve	
  
l  Mauve	
  alignment	
  of	
  LCBs	
  in	
  nine	
  enterobacterial	
  genomes	
  
l  Rearrangement	
  of	
  homologous	
  backbone	
  sequence	
  
Darling	
  et	
  al.	
  (2003)	
  Genome	
  Res.	
  doi:10.1101/gr.2289704	
  
Drac	
  Genome	
  Alignment	
  
l [OPTIONAL	
  ACTIVITY]	
  (useful	
  for	
  exercise)	
  
l  Alignment	
  and	
  reordering	
  of	
  drar	
  genome	
  con;gs	
  
l  whole_genome_alignments_B.md	
  Markdown	
  
l  hVps://github.com/widdowquinn/Teaching/blob/master/
Compara;ve_Genomics_and_Visualisa;on/Part_1/
whole_genome_alignment/
whole_genome_alignments_B.md	
  
l [ACTIVITY]	
  
l  Visualisa;on	
  of	
  whole	
  genome	
  alignment	
  with	
  Biopython	
  
l  biopython_visualisation	
  iPython	
  notebook	
  
Collinearity	
  and	
  Synteny	
  
l Rearrangements	
  may	
  occur	
  post-­‐specia;on	
  
l Different	
  species	
  s;ll	
  exhibit	
  conserva;on	
  of	
  sequence	
  
similarity	
  and	
  order	
  
l  Two	
  elements	
  are	
  collinear	
  if	
  they	
  lie	
  in	
  the	
  same	
  linear	
  
sequence	
  
l  Two	
  elements	
  are	
  syntenous	
  (syntenic)	
  if:	
  
„ (orig.)	
  they	
  lie	
  on	
  the	
  same	
  chromosome	
  
„ (mod.)	
  conserva;on	
  of	
  blocks	
  of	
  order	
  within	
  the	
  same	
  
chromosome	
  
l Signs	
  of	
  evolu;onary	
  constraints,	
  including	
  synteny,	
  may	
  
indicate	
  func;onal	
  genome	
  regions	
  
l More	
  about	
  this	
  in	
  Part	
  2,	
  related	
  to	
  genome	
  features	
  
Syntenous	
  
l example1.png	
  from	
  biopython_visualisation	
  
ac;vity	
  
Nonsyntenous	
  
l example2.png	
  from	
  biopython_visualisation	
  
ac;vity	
  
Whole	
  Genome	
  Duplica'on	
  
l Puffer	
  fish	
  Tetraodon	
  nigroviridis	
  (smallest	
  known	
  vertebrate	
  genome)	
  
l  Whole-­‐genome	
  duplica;on,	
  subsequent	
  to	
  divergence	
  from	
  mammals.	
  
l  Ancestral	
  vertebrate	
  genome	
  inferred	
  to	
  have	
  12	
  chromosomes.	
  
Duplicated	
  genes	
  (ExoFish)	
  on	
  21	
  chromosomes	
  
Jaillon	
  et	
  al.	
  (2004)	
  Nature	
  doi:10.1038/nature03025	
  
VISTA,	
  mVISTA,	
  VISTA-­‐Point	
  
l Alignment/visualisa;on	
  tools:	
  	
  
l  hVp://genome.lbl.gov/vista/index.shtml	
  
l mVISTA:	
  align	
  and	
  compare	
  submiVed	
  sequences	
  (up	
  to	
  2Mbp)	
  
l VISTA-­‐Point:	
  visualise	
  precomputed	
  alignments	
  
Frazer	
  et	
  al.	
  (2004)	
  Nucl.	
  Acids	
  Res.	
  doi:10.1093/nar/gkh458	
  
UCSC	
  
l hVp://genome.ucsc.edu/	
  
l Many	
  vertebrate/invertebrate	
  model	
  genomes	
  
Kent	
  et	
  al.	
  (2002)	
  Genome	
  Res.	
  doi:10.1101/gr.229102	
  
Conclusion	
  
l Physical	
  and	
  computa;onal	
  genome	
  comparisons:	
  
l  Similar	
  biological	
  ques;ons	
  -­‐>	
  similar	
  concepts	
  
l Lots	
  of	
  sequence	
  data	
  in	
  modern	
  biology	
  
l Conserva;on	
  ≈	
  evolu;onary	
  constraint	
  
l Many	
  choices	
  of	
  algorithms/analysis	
  sorware	
  
l Many	
  choices	
  of	
  visualisa;on	
  sorware/tools	
  
l Coming	
  in	
  Part	
  2:	
  genomic	
  func;onal	
  elements	
  
Credits	
  
l This	
  slideshow	
  is	
  shared	
  under	
  a	
  Crea;ve	
  Commons	
  
AVribu;on	
  4.0	
  License	
  
hVp://crea;vecommons.org/licenses/by/4.0/)	
  
l Copyright	
  is	
  held	
  by	
  The	
  James	
  HuVon	
  Ins;tute	
  
hVp://www.huVon.ac.uk	
  
l You	
  may	
  freely	
  use	
  this	
  material	
  in	
  research,	
  papers,	
  and	
  
talks	
  so	
  long	
  as	
  acknowledgement	
  is	
  made.	
  	
  
Nucleo'de	
  Content	
  
l A,	
  C,	
  G,	
  T	
  composi;on	
  
l  Varies	
  between,	
  and	
  within	
  genomes	
  
l  staining	
  varies	
  across	
  genomes,	
  due	
  to	
  	
  
varia;on	
  in	
  GC	
  content	
  
l “isochores”:	
  regions	
  with	
  liVle	
  
internal	
  GC	
  varia;on	
  (homogeneous)	
  
„  	
  long	
  a	
  point	
  of	
  discussion	
  	
  
–	
  difficult	
  to	
  define	
  
l In	
  humans:	
  
l  L1,	
  L2	
  isochores:	
  low	
  GC	
  (≲41%)	
  
l  H1,	
  H2,	
  H3	
  isochores:	
  high	
  GC	
  (≳41%)	
  
l  Imprecise	
  bulk	
  measurement	
  
Sadoni	
  et	
  al.	
  (1999)	
  J.	
  Cell	
  Biol.	
  doi:10.1083/jcb.146.6.1211	
  
hybridisa;on	
  of	
  H3	
  isochore	
  to	
  human	
  genome	
  
DNA-­‐DNA	
  Hybridisa'on	
  (DDH)	
  
l Used	
  for	
  taxonomic	
  classifica;on	
  in	
  prokaryotes	
  from	
  1960s	
  
l Sibley	
  &	
  Ahlquist	
  redefined	
  bird	
  and	
  primate	
  phylogeny	
  with	
  	
  
DDH	
  in	
  1980s:	
  	
  
l Not	
  without	
  controversy:	
  
„ Sugges;ons	
  of	
  data	
  manipula;on	
  	
  
(see	
  here)	
  
„ Close	
  evolu;onary	
  rela;onships	
  	
  
difficult	
  to	
  resolve	
  due	
  to	
  paralogy	
  	
  
(more	
  on	
  paralogy	
  later…)	
  
l S;ll	
  hanging	
  on	
  as	
  a	
  de	
  facto	
  “gold	
  	
  
standard”	
  in	
  microbiological	
  taxonomic	
  	
  
classifica;on.	
  
Sibley	
  &	
  Ahlquist	
  (1987)	
  J.	
  Mol.	
  Evol.	
  doi:10.1007/BF02111285	
  
Finding	
  isochores	
  
l Isochores:	
  homogeneous	
  regions	
  of	
  %GC	
  content	
  
l  Easy	
  to	
  find	
  with	
  windowed	
  (100kbp)	
  
%GC	
  calcula;on,	
  from	
  sequenced	
  	
  
genomes.	
  
l  3200	
  isochores	
  characterised	
  in	
  the	
  	
  
human	
  genome,	
  consistent	
  with	
  5	
  	
  
levels	
  (L1,	
  L2,	
  H1,	
  H2,	
  H3)	
  found	
  	
  
by	
  staining/hybridisa;on.	
  
	
  
Costan'ni	
  et	
  al.	
  (2006)	
  Genome	
  Res.	
  doi:10.1101/gr.4910606	
  
Compara've	
  Genomic	
  Hybridisa'on	
  
l  Two	
  genomes:	
  “reference”	
  and	
  “test”	
  labelled	
  (red	
  and	
  green),	
  
then	
  hybridised	
  against	
  a	
  “normal”	
  genome	
  
l  semiquan'ta've:	
  
l  Red:	
  loss	
  (<2	
  copies)	
  in	
  tumour	
  
l  Green:	
  gain	
  (3-­‐4	
  copies)	
  in	
  tumour	
  
l  Amplifica;ons	
  (>4	
  copies)	
  in	
  BOLD	
  
l  Cases	
  with	
  the	
  same	
  Copy	
  Number	
  	
  
Aberra;on	
  (CNA)	
  are	
  numbered	
  
De	
  Bortoli	
  et	
  al.	
  (2006)	
  BMC	
  Cancer	
  doi:10.1186/1471-­‐2407-­‐6-­‐223	
  
l Early	
  approaches	
  took	
  a	
  threshold	
  score	
  (present/absent)	
  
l Later	
  approaches	
  used	
  known	
  reference	
  genome	
  sequence	
  	
  
context	
  (HMMs,	
  synteny)	
  to	
  improve	
  presence/absence	
  calls	
  
l  No	
  hybridisa;on	
  =	
  “absent”	
  or“divergent”?	
  
l  Not	
  nearly	
  as	
  good	
  as	
  sequencing	
  directly!	
  
Array	
  Compara've	
  Genomic	
  Hybridisa'on	
  
Pritchard	
  et	
  al.	
  (2009)	
  PLoS	
  Comp.	
  Biol.	
  doi:10.1371/journal.pcbi.1000473	
  
k-­‐mer	
  Spectra	
  
l k-­‐mer	
  spectrum:	
  
l  CpG	
  suppression	
  (CGs	
  are	
  uncommon	
  in	
  vertebrate	
  genomes),	
  
but	
  (by	
  simula;on)	
  only	
  when	
  in	
  combina;on	
  with	
  a	
  par;cular	
  
%GC,	
  explains	
  mul;modality	
  
Chor	
  et	
  al.	
  (2009)	
  Genome	
  Biol.	
  doi:10.1186/gb-­‐2009-­‐10-­‐10-­‐r108	
  

Más contenido relacionado

La actualidad más candente (20)

Comparative genomics
Comparative genomicsComparative genomics
Comparative genomics
 
Comparative genomics presentation
Comparative genomics presentationComparative genomics presentation
Comparative genomics presentation
 
Gene order
Gene orderGene order
Gene order
 
NGS - QC & Dataformat
NGS - QC & Dataformat NGS - QC & Dataformat
NGS - QC & Dataformat
 
What is comparative genomics
What is comparative genomicsWhat is comparative genomics
What is comparative genomics
 
Comparative genomics
Comparative genomicsComparative genomics
Comparative genomics
 
RNA-Seq
RNA-SeqRNA-Seq
RNA-Seq
 
Comparative genomics
Comparative genomicsComparative genomics
Comparative genomics
 
BITS - Comparative genomics on the genome level
BITS - Comparative genomics on the genome levelBITS - Comparative genomics on the genome level
BITS - Comparative genomics on the genome level
 
Phylogenetics
PhylogeneticsPhylogenetics
Phylogenetics
 
Genome annotation 2013
Genome annotation 2013Genome annotation 2013
Genome annotation 2013
 
Basics of Genome Assembly
Basics of Genome Assembly Basics of Genome Assembly
Basics of Genome Assembly
 
Orthologs,Paralogs & Xenologs
 Orthologs,Paralogs & Xenologs  Orthologs,Paralogs & Xenologs
Orthologs,Paralogs & Xenologs
 
NGS: Mapping and de novo assembly
NGS: Mapping and de novo assemblyNGS: Mapping and de novo assembly
NGS: Mapping and de novo assembly
 
Introduction to sequence alignment
Introduction to sequence alignmentIntroduction to sequence alignment
Introduction to sequence alignment
 
Genotyping by Sequencing
Genotyping by SequencingGenotyping by Sequencing
Genotyping by Sequencing
 
Phylogenetic data analysis
Phylogenetic data analysisPhylogenetic data analysis
Phylogenetic data analysis
 
GWAS
GWASGWAS
GWAS
 
Functional genomics
Functional genomicsFunctional genomics
Functional genomics
 
genomic comparison
genomic comparison genomic comparison
genomic comparison
 

Destacado

Genome assembly: then and now — v1.2
Genome assembly: then and now — v1.2Genome assembly: then and now — v1.2
Genome assembly: then and now — v1.2Keith Bradnam
 
SyMAP Master's Thesis Presentation
SyMAP Master's Thesis PresentationSyMAP Master's Thesis Presentation
SyMAP Master's Thesis Presentationaustinps
 
Marker devt. workshop 27022012
Marker devt. workshop 27022012Marker devt. workshop 27022012
Marker devt. workshop 27022012Koppolu Ravi
 
Introduction to Bioinformatics
Introduction to BioinformaticsIntroduction to Bioinformatics
Introduction to BioinformaticsLeighton Pritchard
 
All kmers are not created equal: recognizing the signal from the noise in lar...
All kmers are not created equal: recognizing the signal from the noise in lar...All kmers are not created equal: recognizing the signal from the noise in lar...
All kmers are not created equal: recognizing the signal from the noise in lar...wltrimbl
 

Destacado (6)

Genome assembly: then and now — v1.2
Genome assembly: then and now — v1.2Genome assembly: then and now — v1.2
Genome assembly: then and now — v1.2
 
In a Different Class?
In a Different Class?In a Different Class?
In a Different Class?
 
SyMAP Master's Thesis Presentation
SyMAP Master's Thesis PresentationSyMAP Master's Thesis Presentation
SyMAP Master's Thesis Presentation
 
Marker devt. workshop 27022012
Marker devt. workshop 27022012Marker devt. workshop 27022012
Marker devt. workshop 27022012
 
Introduction to Bioinformatics
Introduction to BioinformaticsIntroduction to Bioinformatics
Introduction to Bioinformatics
 
All kmers are not created equal: recognizing the signal from the noise in lar...
All kmers are not created equal: recognizing the signal from the noise in lar...All kmers are not created equal: recognizing the signal from the noise in lar...
All kmers are not created equal: recognizing the signal from the noise in lar...
 

Similar a Compare Genomes and Visualize Evolutionary Relationships

Comparative genomics.pdf
Comparative genomics.pdfComparative genomics.pdf
Comparative genomics.pdfshinycthomas
 
DNA Sequencing in Phylogeny
DNA Sequencing in PhylogenyDNA Sequencing in Phylogeny
DNA Sequencing in PhylogenyBikash1489
 
Learning Objectives Define inheritance Define allele and trait Describe.docx
Learning Objectives Define inheritance Define allele and trait Describe.docxLearning Objectives Define inheritance Define allele and trait Describe.docx
Learning Objectives Define inheritance Define allele and trait Describe.docxwrite4
 
Learning Objectives Define inheritance Define allele and trait Describe.docx
Learning Objectives Define inheritance Define allele and trait Describe.docxLearning Objectives Define inheritance Define allele and trait Describe.docx
Learning Objectives Define inheritance Define allele and trait Describe.docxwrite5
 
Bls 303 l1.phylogenetics
Bls 303 l1.phylogeneticsBls 303 l1.phylogenetics
Bls 303 l1.phylogeneticsBruno Mmassy
 
Gene,chromosomes, ppt.pptx
Gene,chromosomes, ppt.pptxGene,chromosomes, ppt.pptx
Gene,chromosomes, ppt.pptxRITIKARana18
 
Hertweck AB3ACBS presentation
Hertweck AB3ACBS presentationHertweck AB3ACBS presentation
Hertweck AB3ACBS presentationKate Hertweck
 
Genomics Technologies
Genomics TechnologiesGenomics Technologies
Genomics TechnologiesSean Davis
 
Genetic Variation and Change.pptx
Genetic Variation and Change.pptxGenetic Variation and Change.pptx
Genetic Variation and Change.pptxssuser1107f4
 
Abraham B. Korol, Lecture presentation
Abraham B. Korol, Lecture presentationAbraham B. Korol, Lecture presentation
Abraham B. Korol, Lecture presentationMoshe Kenigshtein
 
Macromolecule evolution
Macromolecule  evolutionMacromolecule  evolution
Macromolecule evolutionPaula Mills
 
Chapter 20 ppt
Chapter 20 pptChapter 20 ppt
Chapter 20 pptrehman2009
 
Genetic fine str. analysis &amp; complementation
Genetic fine str. analysis &amp; complementationGenetic fine str. analysis &amp; complementation
Genetic fine str. analysis &amp; complementationAjay Kumar Chandra
 
How Can Ngs Forward Research Essay
How Can Ngs Forward Research EssayHow Can Ngs Forward Research Essay
How Can Ngs Forward Research EssayStefanie Yang
 
NAISTビッグデータシンポジウム - バイオ久保先生
NAISTビッグデータシンポジウム - バイオ久保先生NAISTビッグデータシンポジウム - バイオ久保先生
NAISTビッグデータシンポジウム - バイオ久保先生ysuzuki-naist
 
Molecular basis of evolution and softwares used in phylogenetic tree contruction
Molecular basis of evolution and softwares used in phylogenetic tree contructionMolecular basis of evolution and softwares used in phylogenetic tree contruction
Molecular basis of evolution and softwares used in phylogenetic tree contructionUdayBhanushali111
 

Similar a Compare Genomes and Visualize Evolutionary Relationships (20)

Comparitive genomics
Comparitive genomicsComparitive genomics
Comparitive genomics
 
U1 and U2 Exam Review from 28May
U1 and U2 Exam Review from 28MayU1 and U2 Exam Review from 28May
U1 and U2 Exam Review from 28May
 
Comparative genomics.pdf
Comparative genomics.pdfComparative genomics.pdf
Comparative genomics.pdf
 
DNA Sequencing in Phylogeny
DNA Sequencing in PhylogenyDNA Sequencing in Phylogeny
DNA Sequencing in Phylogeny
 
Gene structure and expression
Gene structure and expressionGene structure and expression
Gene structure and expression
 
Learning Objectives Define inheritance Define allele and trait Describe.docx
Learning Objectives Define inheritance Define allele and trait Describe.docxLearning Objectives Define inheritance Define allele and trait Describe.docx
Learning Objectives Define inheritance Define allele and trait Describe.docx
 
Learning Objectives Define inheritance Define allele and trait Describe.docx
Learning Objectives Define inheritance Define allele and trait Describe.docxLearning Objectives Define inheritance Define allele and trait Describe.docx
Learning Objectives Define inheritance Define allele and trait Describe.docx
 
Bls 303 l1.phylogenetics
Bls 303 l1.phylogeneticsBls 303 l1.phylogenetics
Bls 303 l1.phylogenetics
 
Gene,chromosomes, ppt.pptx
Gene,chromosomes, ppt.pptxGene,chromosomes, ppt.pptx
Gene,chromosomes, ppt.pptx
 
Hertweck AB3ACBS presentation
Hertweck AB3ACBS presentationHertweck AB3ACBS presentation
Hertweck AB3ACBS presentation
 
Genomics Technologies
Genomics TechnologiesGenomics Technologies
Genomics Technologies
 
Hertweck uva2012
Hertweck uva2012Hertweck uva2012
Hertweck uva2012
 
Genetic Variation and Change.pptx
Genetic Variation and Change.pptxGenetic Variation and Change.pptx
Genetic Variation and Change.pptx
 
Abraham B. Korol, Lecture presentation
Abraham B. Korol, Lecture presentationAbraham B. Korol, Lecture presentation
Abraham B. Korol, Lecture presentation
 
Macromolecule evolution
Macromolecule  evolutionMacromolecule  evolution
Macromolecule evolution
 
Chapter 20 ppt
Chapter 20 pptChapter 20 ppt
Chapter 20 ppt
 
Genetic fine str. analysis &amp; complementation
Genetic fine str. analysis &amp; complementationGenetic fine str. analysis &amp; complementation
Genetic fine str. analysis &amp; complementation
 
How Can Ngs Forward Research Essay
How Can Ngs Forward Research EssayHow Can Ngs Forward Research Essay
How Can Ngs Forward Research Essay
 
NAISTビッグデータシンポジウム - バイオ久保先生
NAISTビッグデータシンポジウム - バイオ久保先生NAISTビッグデータシンポジウム - バイオ久保先生
NAISTビッグデータシンポジウム - バイオ久保先生
 
Molecular basis of evolution and softwares used in phylogenetic tree contruction
Molecular basis of evolution and softwares used in phylogenetic tree contructionMolecular basis of evolution and softwares used in phylogenetic tree contruction
Molecular basis of evolution and softwares used in phylogenetic tree contruction
 

Más de Leighton Pritchard

Little Rotters: Adventures With Plant-Pathogenic Bacteria
Little Rotters: Adventures With Plant-Pathogenic BacteriaLittle Rotters: Adventures With Plant-Pathogenic Bacteria
Little Rotters: Adventures With Plant-Pathogenic BacteriaLeighton Pritchard
 
Reverse-and forward-engineering specificity of carbohydrate-processing enzymes
Reverse-and forward-engineering specificity of carbohydrate-processing enzymesReverse-and forward-engineering specificity of carbohydrate-processing enzymes
Reverse-and forward-engineering specificity of carbohydrate-processing enzymesLeighton Pritchard
 
Whole genome taxonomic classi cation for prokaryotic plant pathogens
Whole genome taxonomic classication for prokaryotic plant pathogensWhole genome taxonomic classication for prokaryotic plant pathogens
Whole genome taxonomic classi cation for prokaryotic plant pathogensLeighton Pritchard
 
Microbial Genomics and Bioinformatics: BM405 (2015)
Microbial Genomics and Bioinformatics: BM405 (2015)Microbial Genomics and Bioinformatics: BM405 (2015)
Microbial Genomics and Bioinformatics: BM405 (2015)Leighton Pritchard
 
Microbial Agrogenomics 4/2/2015, UK-MX Workshop
Microbial Agrogenomics 4/2/2015, UK-MX WorkshopMicrobial Agrogenomics 4/2/2015, UK-MX Workshop
Microbial Agrogenomics 4/2/2015, UK-MX WorkshopLeighton Pritchard
 
BM405 Lecture Slides 21/11/2014 University of Strathclyde
BM405 Lecture Slides 21/11/2014 University of StrathclydeBM405 Lecture Slides 21/11/2014 University of Strathclyde
BM405 Lecture Slides 21/11/2014 University of StrathclydeLeighton Pritchard
 
Highly Discriminatory Diagnostic Primer Design From Whole Genome Data
Highly Discriminatory Diagnostic Primer Design From Whole Genome DataHighly Discriminatory Diagnostic Primer Design From Whole Genome Data
Highly Discriminatory Diagnostic Primer Design From Whole Genome DataLeighton Pritchard
 
ICSB 2013 - Visits Abroad Report
ICSB 2013 - Visits Abroad ReportICSB 2013 - Visits Abroad Report
ICSB 2013 - Visits Abroad ReportLeighton Pritchard
 
Adventures in Bioinformatics (2012)
Adventures in Bioinformatics (2012)Adventures in Bioinformatics (2012)
Adventures in Bioinformatics (2012)Leighton Pritchard
 
Golden Rules of Bioinformatics
Golden Rules of BioinformaticsGolden Rules of Bioinformatics
Golden Rules of BioinformaticsLeighton Pritchard
 
Plant Pathogen Genome Data: My Life In Sequences
Plant Pathogen Genome Data: My Life In SequencesPlant Pathogen Genome Data: My Life In Sequences
Plant Pathogen Genome Data: My Life In SequencesLeighton Pritchard
 
Repeatable plant pathology bioinformatic analysis: Not everything is NGS data
Repeatable plant pathology bioinformatic analysis: Not everything is NGS dataRepeatable plant pathology bioinformatic analysis: Not everything is NGS data
Repeatable plant pathology bioinformatic analysis: Not everything is NGS dataLeighton Pritchard
 
What makes the enterobacterial plant pathogen Pectobacterium atrosepticum dif...
What makes the enterobacterial plant pathogen Pectobacterium atrosepticum dif...What makes the enterobacterial plant pathogen Pectobacterium atrosepticum dif...
What makes the enterobacterial plant pathogen Pectobacterium atrosepticum dif...Leighton Pritchard
 
Rapid generation of E.coli O104:H4 PCR diagnostics
Rapid generation of E.coli O104:H4 PCR diagnosticsRapid generation of E.coli O104:H4 PCR diagnostics
Rapid generation of E.coli O104:H4 PCR diagnosticsLeighton Pritchard
 
Mining Plant Pathogen Genomes for Effectors
Mining Plant Pathogen Genomes for EffectorsMining Plant Pathogen Genomes for Effectors
Mining Plant Pathogen Genomes for EffectorsLeighton Pritchard
 
Comparative Genomics and Visualisation - Part 2
Comparative Genomics and Visualisation - Part 2Comparative Genomics and Visualisation - Part 2
Comparative Genomics and Visualisation - Part 2Leighton Pritchard
 
A Systems Biology Perspective on Plant-Pathogen Interactions 2012-05-08, Turin
A Systems Biology Perspective on Plant-Pathogen Interactions 2012-05-08, TurinA Systems Biology Perspective on Plant-Pathogen Interactions 2012-05-08, Turin
A Systems Biology Perspective on Plant-Pathogen Interactions 2012-05-08, TurinLeighton Pritchard
 

Más de Leighton Pritchard (20)

RDVW Hands-on session: Python
RDVW Hands-on session: PythonRDVW Hands-on session: Python
RDVW Hands-on session: Python
 
Little Rotters: Adventures With Plant-Pathogenic Bacteria
Little Rotters: Adventures With Plant-Pathogenic BacteriaLittle Rotters: Adventures With Plant-Pathogenic Bacteria
Little Rotters: Adventures With Plant-Pathogenic Bacteria
 
Pathogen Genome Data
Pathogen Genome DataPathogen Genome Data
Pathogen Genome Data
 
Reverse-and forward-engineering specificity of carbohydrate-processing enzymes
Reverse-and forward-engineering specificity of carbohydrate-processing enzymesReverse-and forward-engineering specificity of carbohydrate-processing enzymes
Reverse-and forward-engineering specificity of carbohydrate-processing enzymes
 
Whole genome taxonomic classi cation for prokaryotic plant pathogens
Whole genome taxonomic classication for prokaryotic plant pathogensWhole genome taxonomic classication for prokaryotic plant pathogens
Whole genome taxonomic classi cation for prokaryotic plant pathogens
 
Microbial Genomics and Bioinformatics: BM405 (2015)
Microbial Genomics and Bioinformatics: BM405 (2015)Microbial Genomics and Bioinformatics: BM405 (2015)
Microbial Genomics and Bioinformatics: BM405 (2015)
 
Microbial Agrogenomics 4/2/2015, UK-MX Workshop
Microbial Agrogenomics 4/2/2015, UK-MX WorkshopMicrobial Agrogenomics 4/2/2015, UK-MX Workshop
Microbial Agrogenomics 4/2/2015, UK-MX Workshop
 
BM405 Lecture Slides 21/11/2014 University of Strathclyde
BM405 Lecture Slides 21/11/2014 University of StrathclydeBM405 Lecture Slides 21/11/2014 University of Strathclyde
BM405 Lecture Slides 21/11/2014 University of Strathclyde
 
Sequencing and Beyond?
Sequencing and Beyond?Sequencing and Beyond?
Sequencing and Beyond?
 
Highly Discriminatory Diagnostic Primer Design From Whole Genome Data
Highly Discriminatory Diagnostic Primer Design From Whole Genome DataHighly Discriminatory Diagnostic Primer Design From Whole Genome Data
Highly Discriminatory Diagnostic Primer Design From Whole Genome Data
 
ICSB 2013 - Visits Abroad Report
ICSB 2013 - Visits Abroad ReportICSB 2013 - Visits Abroad Report
ICSB 2013 - Visits Abroad Report
 
Adventures in Bioinformatics (2012)
Adventures in Bioinformatics (2012)Adventures in Bioinformatics (2012)
Adventures in Bioinformatics (2012)
 
Golden Rules of Bioinformatics
Golden Rules of BioinformaticsGolden Rules of Bioinformatics
Golden Rules of Bioinformatics
 
Plant Pathogen Genome Data: My Life In Sequences
Plant Pathogen Genome Data: My Life In SequencesPlant Pathogen Genome Data: My Life In Sequences
Plant Pathogen Genome Data: My Life In Sequences
 
Repeatable plant pathology bioinformatic analysis: Not everything is NGS data
Repeatable plant pathology bioinformatic analysis: Not everything is NGS dataRepeatable plant pathology bioinformatic analysis: Not everything is NGS data
Repeatable plant pathology bioinformatic analysis: Not everything is NGS data
 
What makes the enterobacterial plant pathogen Pectobacterium atrosepticum dif...
What makes the enterobacterial plant pathogen Pectobacterium atrosepticum dif...What makes the enterobacterial plant pathogen Pectobacterium atrosepticum dif...
What makes the enterobacterial plant pathogen Pectobacterium atrosepticum dif...
 
Rapid generation of E.coli O104:H4 PCR diagnostics
Rapid generation of E.coli O104:H4 PCR diagnosticsRapid generation of E.coli O104:H4 PCR diagnostics
Rapid generation of E.coli O104:H4 PCR diagnostics
 
Mining Plant Pathogen Genomes for Effectors
Mining Plant Pathogen Genomes for EffectorsMining Plant Pathogen Genomes for Effectors
Mining Plant Pathogen Genomes for Effectors
 
Comparative Genomics and Visualisation - Part 2
Comparative Genomics and Visualisation - Part 2Comparative Genomics and Visualisation - Part 2
Comparative Genomics and Visualisation - Part 2
 
A Systems Biology Perspective on Plant-Pathogen Interactions 2012-05-08, Turin
A Systems Biology Perspective on Plant-Pathogen Interactions 2012-05-08, TurinA Systems Biology Perspective on Plant-Pathogen Interactions 2012-05-08, Turin
A Systems Biology Perspective on Plant-Pathogen Interactions 2012-05-08, Turin
 

Último

Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfSumit Kumar yadav
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡anilsa9823
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticssakshisoni2385
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfSumit Kumar yadav
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfrohankumarsinghrore1
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfSumit Kumar yadav
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptxRajatChauhan518211
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINsankalpkumarsahoo174
 

Último (20)

Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
 

Compare Genomes and Visualize Evolutionary Relationships

  • 1. Compara've  Genomics  and   Visualisa'on  –  Part  1   Leighton  Pritchard  
  • 2. Part  1   l What  is  compara've  genomics?   l Levels  of  genome  comparison   l  bulk,  whole  sequence,  features   l A  Brief  History  of  Compara've  Genomics   l  experimental  compara;ve  genomics   l Computa'onal  Compara've  Genomics   l  Bulk  proper;es   l  Whole  genome  comparisons   l Part  2   l  Genome  feature  comparisons  
  • 3. What  is  Compara've  Genomics?   The  combina'on  of  genomic  data  and   compara've  and  evolu'onary  biology  to   address  ques'ons  of  genome  structure,   evolu'on  and  func'on.  
  • 4. What  is  Compara've  Genomics?     “Nothing  in  biology  makes  sense,  except   in  the  light  of  evolu9on”   Theodosius  Dobzhansky  
  • 5. Why  Compara've  Genomics?   l Genomes  describe  heritable  characteris;cs   l Related  organisms  share  ancestral  genomes   l Func;onal  elements  encoded  in  genomes   are  common  to  related  organisms   l Func;onal  understanding  of  model  systems   (E.  coli,  A.  thaliana,  D.  melanogaster)  can  be   transferred  to  non-­‐model  systems  on  the   basis  of  genome  comparisons   l Genome  comparisons  can  be  informa;ve,   even  for  distantly-­‐related  organisms  
  • 6. Why  Compara've  Genomics?   l BUT:   l  Context:  epigene;cs,  ;ssue   differen;a;on,  mesoscale  systems,  etc.   l  Phenotypic  plas'city:  responses  to   temperature,  stress,  environment,  etc.  
  • 7. Why  Compara've  Genomics?   l Genomic  differences  can  underpin  phenotypic   (morphological  or  physiological)  differences.   l Where  phenotypes  or  other  organism-­‐level   proper;es  are  known,  comparison  of   genomes  may  give  mechanis;c  or  func;onal   insight  into  differences  (e.g.  GWAS).   l Genome  comparisons  aid  iden;fica;on  of   func;onal  elements  on  the  genome.   l Studying  genomic  changes  reveals   evolu;onary  processes  and  constraints.    
  • 8. Why  Compara've  Genomics?   Adapted  from  Hardison  (2003)  PLoS  Biol.  doi:10.1371/journal.pbio.0000058   species   'me   contemporary   organisms   l  Comparison  within  species  (e.g.  isolate-­‐level  –  or  even  within  individuals):   which  genome  features  may  account  for  unique  characteris;cs  of  organisms/ tumours?  Epigene;cs  in  an  individual.  
  • 9. Why  Compara've  Genomics?   genus   'me   contemporary   organisms   l Comparison  within  genus  (e.g.  species-­‐level):  what  genome  features   show  evidence  of  selec;ve  pressure,  and  in  which  species?  
  • 10. Why  Compara've  Genomics?   subgroup   'me   contemporary   organisms   l Comparison  within  subgroup  (e.g.  genus-­‐level):  what  are  the  core  set   of  genome  features  that  define  a  subgroup  or  genus?  
  • 11. The  E.coli  long-­‐term  evolu'on  experiment   l Run  by  the  Lenski  lab,  Michigan  State  University  since  1988   l  hVp://myxo.css.msu.edu/ecoli/   l 12  flasks,  citrate  usage  selec;on   l 50,000  genera;ons  of  Escherichia  coli!   l  Cultures  propagated  every  day   l  Every  500  genera;ons  (75  days),     mixed-­‐popula;on  samples  stored   l  Mean  fitness  es;mated  at  500     genera;on  intervals   Jeong  et  al.  (2009)  J.  Mol.  Biol.  doi:10.1016/j.jmb.2009.09.052   Barrick  et  al.  (2009)  Nature  doi:10.1038/nature08480   Wiser  et  al.  (2013)  Science.  doi:10.1126/science.1243357  
  • 12. Compara've  Genomics  in  the  News   Sankaraman  et  al.  (2014)  Nature.  doi:10.1038/nature12961   l Neanderthal  alleles:   l  Aid  adapta;on  outwith  Africa   l  Associated  with  disease  risk   l  Reduce  male  fer;lity  
  • 13. Levels  of  Genome  Comparison   Genomes  are  complex,  and  can  be   compared  on  a  range  of  conceptual  levels   -­‐  both  prac'cally  and  in  silico.  
  • 14. Three  broad  levels  of  comparison   l Bulk  Proper;es   l  chromosome/plasmid  counts  and  sizes,     l  nucleo;de  content,  etc.   l Whole  Genome  Sequence   l  sequence  similarity   l  organisa;on  of  genomic  regions  (synteny),  etc.   l Genome  Features/Func;onal  Components   l  numbers  and  types  of  features  (genes,  ncRNA,  regulatory   elements,  etc.)   l  organisa;on  of  features  (synteny,  operons,  regulons,  etc.)   l  complements  of  features   l  selec;on  pressure,  etc.  
  • 15. A  Brief  History  of  Experimental   Compara've  Genomics   You  don’t  have  to  sequence  genomes  to   compare  them  (but  it  helps).  
  • 16. Genome  Comparisons  Predate  NGS   l Sequence  data  was  not  always  cheap  and  abundant   l Prac;cal,  experimental  genome  comparisons  were  needed  
  • 17. Bulk  Genome  Property  Comparisons   Values  calculated  for  individual  genomes,   and  subsequently  compared.  
  • 18. Bulk  Genome  Proper'es   l  Large-­‐scale  summary  measurements   l  Measure  genomes  independently  –  compare  values  later   l  Number  of  chromosomes   l  Ploidy   l  Chromosome  size   l  Nucleo;de  (A,  C,  G,  T)  frequency/percentage  
  • 19. Chromosome  Counts/Size   l  The  chromosome  counts/ploidy  of  organisms  can  vary  widely   l  Escherichia  coli:  1  (but  plasmids…)   l  Rice  (Oryza  sa6va):  24  (but  mitochondria,  plas;ds  etc…)   l  Human  (Homo  sapiens):  46,  diploid   l  Adders-­‐tongue  (Ophioglossum  re6culatum):  up  to  1260   l  Domes;c  (but  not  wild)  wheat  soma;c  cells  hexaploid,  gametes  haploid   l  Physical  genome  size  (related  to  sequence  length)     can  also  vary  greatly   l  Genome  size  and  chromosome  count     do  not  indicate  organism  ‘complexity’   l  S;ll  surprises  to  be  found  in  physical   study  of  chromosomes!  (e.g.  Hi-­‐C)   Kamisugi  et  al.  (1993)  Chromosome  Res.  1(3):  189-­‐96   Wang  et  al.  (2013)  Nature  Rev  Genet.  doi:10.1038/nrg3375  
  • 20. Nucleo'de  Content   l Experimental  approaches  for  accurate  measurement   l  e.g.  use  radiolabelled  monophosphates,  calculate  propor;ons  using   chromatography   Karl  (1980)  Microbiol.  Rev.  44(4)  739-­‐796   Krane  et  al.  (1991)  Nucl.  Acids  Res.  doi:10.1093/nar/19.19.5181  
  • 21. Whole  Genome  Comparisons   Comparisons  of  one  whole  or  drac   genome  with  another  (or  many  others)  
  • 22. Whole  Genome  Comparisons   l  Requires  two  genomes:  “reference”  and  “comparator”   l  Experiment  produces  a  compara;ve  result,  dependent  on  the   choice  of  genomes   l  Methods  mostly  based  around  direct  or  indirect  DNA   hybridisa;on   l  DNA-­‐DNA  hybridisa;on   l  Compara;ve  Genomic  Hybridisa;on  (CGH)   l  Array  Compara;ve  Genomic  Hybridisa;on  (aCGH)  
  • 23. DNA-­‐DNA  Hybridisa'on  (DDH)   l Several  methods  based  around  the  same  principle   1.  Denature  organism  A,  B     genomic  DNA  mixture   2.  Allow  to  anneal  –  hybrids  result     (reassocia;on  ≈  similarity)   Morelló-­‐Mora  &  Amann  (2001)  FEMS  Microbiol.  Rev.  doi:10.1016/S0168-­‐6445(00)00040-­‐1  
  • 24. DNA-­‐DNA  Hybridisa'on  (DDH)   l  Several  methods  -­‐  same  principle   1.  Find  homoduplex  Tm1   2.  Denature  reference,  comparator   gDNA  +  mix   3.  Allow  to  anneal  –  hybrids  result     (reassocia;on  ≈  similarity),  find     heteroduplex  Tm2   4.  ∆Tm  =  Tm1  –  Tm2   5.  High  ∆T  implies  greater  genomic   difference  (fewer  H-­‐bonds)   l  Proxy  for  sequence  similarity   Morelló-­‐Mora  &  Amann  (2001)  FEMS  Microbiol.  Rev.  doi:10.1016/S0168-­‐6445(00)00040-­‐1  
  • 25. DNA-­‐DNA  Hybridisa'on  (DDH)   l Used  for  taxonomic  classifica;on  in  prokaryotes  from  1960s   l Sibley  &  Ahlquist  redefined  bird  and  primate  phylogeny  with     DDH  in  1980s:  Homo  shares  more  recent  common  ancestor  with  Pan   than  with  Gorilla  (this  was  previously  in  dispute)   Sibley  &  Ahlquist  (1984)  J.  Mol.  Evol.  doi:10.1007/BF02101980  
  • 26. Compara've  Genomic  Hybridisa'on   l  Two  genomes:  “reference”  and  “test”  are  labelled  (red  and  green  –     a  bad  conven6on  to  choose,  for  visualisa6on),  then  hybridised  against  a   third  “normal”  genome   l  Differences  in  red/green  intensity  mapped  by  microscopy  correspond  to   rela;ve  rela;onship  of  reference  and  test  to  “normal”  genome   l  Comparisons  within  species  (or  individual,  for  tumours);  copy  number   varia'ons  (CNV)   l  Labour-­‐intensive,  low-­‐resolu;on  
  • 27. Compara've  Genomic  Hybridisa'on   l Image  analysis  required  –  intensity  along  medial  axis.   Kallioniemi  et  al.  (1992)  Science  doi:10.1126/science.1359641   Fraga  et  al.  (2005)  Proc.  Natl.  Acad.  Sci.  USA  doi:10.1073/pnas.0500398102   Epigene'cs:  hybridising     methylated  DNA  
  • 28. Array  Compara've  Genomic  Hybridisa'on   l  Uses  DNA  microarrays:  thousands  of  short  DNA  probes  (genome     fragments)  immobilised  on  a  surface   l  gDNA,  cDNA,  etc.  fluorescently-­‐labelled  and  hybridised  to  the  array   l  Smaller  sample  sizes  cf.  CGH,     automatable,  high-­‐throughput,  high-­‐res   l  Iden'fies  copy  number  varia'on  (CNV)   and  segmental  duplica'on   Pollack  et  al.  (1999)  Nat.  Genet.  doi:10.1038/12640  
  • 29. Genome  Feature  Comparisons   Comparisons  on  the  basis  of  a  restricted   set  of  genome  features  
  • 30. Chromosomal  Rearrangements   l  Genomes  are  dynamic,  and  undergo  large-­‐scale  changes   l  Hybridisa;on  used  to  map  genome  rearrangement/duplica;on   l  Separate  chromosomes  electrophore;cally   l  Apply  single  gene  hybridising  probes   l  Reciprocal  hybridisa;ons  indicate  transloca;ons   Fischer  et  al.  (2000)  Nature.  doi:10.1038/35013058  
  • 31. Diagnos'c  PCR/MLST   l  Define  a  set  of  regions  (usually  genes):   l  conserved  enough  that  PCR  primers  can   be  designed  to  amplify  the  same  region   in  mul;ple  organisms   l  and:   l  divergent  enough  that  hybridising   probes  can  dis;nguish  between  groups   l  or:   l  sequence  the  amplifica;on  products   l  Sequence  variants  given  numbers   l  Number  profiles  define  groups   l  Track  evolu;on  by  minimum  spanning   trees  (MST)   l  hVp://pubmlst.org/   Maiden  et  al.  (2006)  Ann.  Rev.  Microbiol.  doi:10.1146/annurev.micro.59.030804.121325  
  • 32. l  aCGH  can  also  be  applied  across  species  for  classifica'on/diagnos'cs:   l  Microarray  probes  represent  genes     from  one  or  more  organisms   l  “Off-­‐species”  gDNA  fragmented,     labelled,  and  hybridised   l  Hybridisa;on  ≈  sequence     similarity  ≈  gene  presence   l  Heatmap  of  217  Staphylococcus   aureus  isolates  on  7-­‐strain  array.   l  columns=isolates   l  yellow/red=gene  present   l  blue/white/grey=gene  absent   l  Lower  bars  coloured  by  lineage  and  host     (green=caVle,  blue=horse,  purple=human)   Array  Compara've  Genomic  Hybridisa'on   Sung  et  al.  (2008)  Microbiol.  doi:10.1099/mic.0.2007/015289-­‐0  
  • 33. But  This  Happened…   l High-­‐throughput  sequencing  
  • 34. …And  Then  It  Rained  Sequence  Data   l  Modern  high-­‐throughput  sequencing  (454,  Illumina)  completely   changed  the  landscape.   l  Complete,  (mainly)  accurate  sequence     data  much  cheaper,  enabling:   l  more  precise  sequence  comparison   l  novel  analyses,  insights  and     visualisa;ons   l  Genomic  &  exomic  comparisons   l  19/2/2014  at  GOLD:   l  3,011  “finished”  genomes   l  9,891  “permanent  drar”  genomes   l  19/2/2014  at  NCBI  WGS:   l  17,023  whole  genome  projects  
  • 35. …And  Then  It  Rained  Sequence  Data   l In  2012,  GOLD  added  3736  genomes,  NCBI  added  4585   l Mostly  prokaryotes  (archaea  and  bacteria)   l We’re  a  liVle  ahead  of  Su’s  (Scripps,  La  Jolla)  projec;ons   Figures  and  code  from:  hlp://sulab.org/2013/06/sequenced-­‐genomes-­‐per-­‐year/    
  • 36. Computa'onal  Compara've  Genomics   Massively  enabled  by  high-­‐throughput   sequencing,  much  more  powerful  and   precise.  
  • 37. Three  broad  levels  of  comparison   l Bulk  Proper;es   l  chromosome/plasmid  counts  and  sizes,     l  nucleo;de  content,  etc.   l Whole  Genome  Sequence   l  sequence  similarity   l  organisa;on  of  genomic  regions  (rearrangements),  etc.   l Genome  Features/Func;onal  Components   l  numbers  and  types  of  features  (genes,  ncRNA,  regulatory   elements,  etc.)   l  organisa;on  of  features  (synteny,  operons,  regulons,  etc.)   l  complements  of  features   l  selec;on  pressure,  etc.  
  • 38. Bulk  Genome  Property  Comparisons   Values  calculated  for  individual  genomes,   and  subsequently  compared.  
  • 39. Nucleo'de  Frequencies/Genome  Size   l Very  easy  to  calculate  from  complete  or  drar  genome  sequence   l  (or  in  a  region  of  genome  sequence)   l GC  content/chromosome  size  can  be  characteris;c  of  an   organism   l [ACTIVITY]   l  bacteria_size_gc  iPython  notebook   l  ipython notebook –-pylab inline  in   bacteria_size  directory  
  • 40. Blobology   l Metazoan  sequence  data  can  be  contaminated  by  microbial   symbionts.   l  Host  and  symbiont  DNA  have  different  %GC  (and  are  present  in   different  amounts/coverage)   l  Preliminary  genome  assembly,  followed   by  read  mapping   l  Plot  con;g  coverage  against     %GC  =  Blobology   l  hVp://nematodes.org/bioinforma;cs/blobology/   Kumar  &  Blaxter  (2011)  Symbiosis  doi:10.1007/s13199-­‐012-­‐0154-­‐6  
  • 41. Nucleo'de  k-­‐mers   l  Sequence  data  is  required  to  determine  k-­‐mers   l  Nucleo;de  frequencies:     l  A,  C,  G,  T   l  Dinucleo;de  frequencies:     l  AA,  AC,  AG,  AT,  CA,  CC,  CG,  CT,  GA,  GC,  GG,  GT,  TA,  TC,  TG,  TT   l  Trinucleo;de  frequencies:   l  64  trinucleo;des   l  k-­‐nucleo;de  frequencies:   l  4k  k-­‐mers   l  [ACTIVITY]   l  runApp(“shiny/nucleotide_frequencies”)in  RStudio  
  • 42. k-­‐mer  Spectra   l k-­‐mer  spectrum:   l  Frequency  distribu;on  of  observed  k-­‐mer  counts   l  Most  species  have  a  unimodal  k-­‐mer  spectrum   Chor  et  al.  (2009)  Genome  Biol.  doi:10.1186/gb-­‐2009-­‐10-­‐10-­‐r108  
  • 43. k-­‐mer  Spectra   l  k-­‐mer  spectrum:   l  All  mammals  tested  (and  some  other)  species  have  a  mul;modal  k-­‐mer   spectrum   l  Genomic  regions  differ  in  this  property   Chor  et  al.  (2009)  Genome  Biol.  doi:10.1186/gb-­‐2009-­‐10-­‐10-­‐r108  
  • 44. Average  Nucleo'de  Iden'ty  (ANI)   l ANI  introduced  as  a  subs;tute  for  DDH  in  2007:   l  70%  iden;ty  (DDH)  =  “gold  standard”     prokaryo;c  species  boundary   l  70%  iden;ty  (DDH)  ≈  95%  iden;ty  (ANI)   Goris  et  al.  (2007)  Int.  J.  System.  Evol.  Biol.  doi:10.1099/ijs.0.64483-­‐0  
  • 45. Average  Nucleo'de  Iden'ty  (ANI)   l ANI  introduced  as  a  subs;tute  for  DDH  in  2007:   l  70%  iden;ty  (DDH)  =  “gold  standard”     prokaryo;c  species  boundary   l  70%  iden;ty  (DDH)  ≈  95%  iden;ty  (ANI)   l Original  method  emulates  physical   experiment:   1.  break  genome  into  1020nt  fragments   2.  align  fragments  using  BLASTN   3.  ANI  =  mean  iden;ty  of  all  BLASTN     matches  with  >30%  iden;ty  over  70%   alignable  length   Goris  et  al.  (2007)  Int.  J.  System.  Evol.  Biol.  doi:10.1099/ijs.0.64483-­‐0  
  • 46. Average  Nucleo'de  Iden'ty  (ANI)   l ANI  introduced  as  a  subs;tute  for  DDH  in  2007:   l  70%  iden;ty  (DDH)  =  “gold  standard”  prokaryo;c  species  boundary   l  70%  iden;ty  (DDH)  ≈  95%  iden;ty  (ANI)   l ANIm  and  TETRA  introduced  (2009)   1.  Align  sequences  using  NUCmer   2.  ANI  =  mean  %iden;ty  of  matches   l TETRA:   1.  Calculate  tetranucleo;de  frequencies   2.  Determine  each  tetramer  devia;on  from  expecta;on  (Z-­‐score)   3.  TETRA  =  Pearson  correla;on  coefficient  of  tetramer  Z-­‐scores   Richter  &  Rosselló-­‐Móra  (2009)  Proc.  Natl.  Acad.  Sci.  USA  doi:10.1073/pnas.0906412106  
  • 47. Average  Nucleo'de  Iden'ty  (ANI)   l ANIb  discards  useful  informa;on  that  ANIm  retains   l TETRA  reflects  bulk  genome  proper;es  rather  than  selec;on  on   sequence   l  Data  for  Anaplasma  marginale  (3),  A.phagocytophilum  (4),  A.centrale  (1)   l TETRA  scores  are  prone  to  false  posi;ves;  ANIb  scores  are  prone  to   false  nega;ves  
  • 48. Average  Nucleo'de  Iden'ty  (ANI)   l Jspecies  (hVp://www.imedea.uib.es/jspecies/)     l  WebStart   l  java -jar -Xms1024m -Xmx1024m jspecies1.2.1.jar l Python  script   l  scripts/calculate_ani.py l [ACTIVITY]   l  average_nucleotide_identity/README.md  Markdown   Richter  &  Rosselló-­‐Móra  (2009)  Proc.  Natl.  Acad.  Sci.  USA  doi:10.1073/pnas.0906412106  
  • 49. Diagnos'c  PCR/MLST   l PCR/MLST  s;ll  cheap   l  (but  for  how  much  longer?)   l Use  whole  genomes  to  iden;fy  unique/ diagnos;c  regions  for  PCR/MLST   Slezak  et  al.  (2003)  Brief.  Bioinf.  doi:10.1093/bib/4.2.133   Pritchard  et  al.  (2012)  PLoS  One  doi:10.1371/journal.pone.0034498  
  • 50. Whole  Genome  Sequence  Comparisons   Comparisons  of  one  whole  or  drac   genome  sequence  with  another  (or  many   others)  
  • 52. Whole  Genome  Alignment   l Which  genomes  should  you  align?  (or  not  bother  aligning)   l For  reasonable  analysis,  genomes  should:   l  derive  from  a  sufficiently  recent  common  ancestor:  so  that   homologous  regions  can  be  iden;fied.   l  derive  from  a  sufficiently  distant  common  ancestor:  so  that   sufficiently  “interes;ng”  changes  are  likely  to  have  occurred   l  help  answer  your  biological  ques;on:   „ is  your  ques;on  organism  or  phenotype  specific?   „ are  you  inves;ga;ng  a  process?   l This  may  be  more  involved  for  metazoans  (vertebrates,   arthropods,  nematodes,  etc.)  than  prokaryotes…  
  • 53. Whole  Genome  Alignment   l Naïve  alignment  algorithms  (e.g.  Needleman-­‐Wunsch/Smith-­‐ Waterman)  are  not  appropriate:   l  Do  not  handle  rearrangements   l  Computa;onally  expensive  on  large  sequences   l Many  whole-­‐genome  alignment  algorithms  proposed,  including:   l  LASTZ  (hVp://www.bx.psu.edu/~rsharris/lastz/)   l  BLAT  (hVp://genome.ucsc.edu/goldenPath/help/blatSpec.html)   l  Mugsy  (hVp://mugsy.sourceforge.net/)   l  megaBLAST  (hVp://www.ncbi.nlm.nih.gov/blast/html/megablast.html)   l  MUMmer  (hVp://mummer.sourceforge.net/)   l  LAGAN  (hVp://lagan.stanford.edu/lagan_web/index.shtml)   l  WABA,  etc…  
  • 54. Whole  Genome  Alignment   l BLAT   l  BLAT  is  broadly  similar  to  BLAST   l  Main  differences:   „ op;mised  to  find  only  exact  or  near-­‐exact  matches,  for   speed   „ indexes  the  subject  genome,  retains  the  index  and  scans   the  query     „ connects  homologous  match  regions  into  a  single  alignment   (BLAST  reports  them  separately)   „ reports  mRNA  match  intron-­‐exon  boundaries  exactly   (BLAST  tends  to  extend)   l  Advantages:  fast;  exact  exon  boundaries;  UCSC  integra;on   l  Disadvantages:  does  not  find  more  remote/very  divergent   matches   Kent  (2002)  Genome  Res.  doi:10.1101/gr.229202  
  • 55. Whole  Genome  Alignment   l megaBLAST   l  Op;mised  for  speed  over  BLASTN     (see  hVp://www.ncbi.nlm.nih.gov/blast/Why.shtml):   „ genome-­‐level  searches     „ queries  on  large  sequence  sets   „ long  alignments  of  very  similar  sequence  (sequencing  errors/SNPs)   l  Uses  Zhang  et  al.  (2000)  greedy  algorithm   l  Concatenates  queries  to  improve  performance  (“query  packing”)   „ NOTE:  this  is  good  prac'ce  for  large  query  sets!   l  Two  modes:  megaBLAST,  and  discon;nuous  megaBLAST  (dc-­‐megablast)   „ dc-­‐megablast  intended  for  more  divergent  sequences   Zhang  et  al.  (2000)  J.  Comp.  Biol.  7(1-­‐2)  203-­‐14   Korf  et  al.  (2003)  “BLAST”,  O’Reilly  &  Associates,  Sebastopol,  CA  
  • 56. Whole  Genome  Alignment   l MUMmer   l  Uses  suffix  trees  for  paVern  matching:  very  fast  even  for  large  sequences   „ Finds  maximal  exact  matches   „ Memory  use  depends  only  on  reference  sequence  size   Kurtz  et  al.  (2004)  Genome  Biol.  doi:10.1186/gb-­‐2004-­‐5-­‐2-­‐r12  
  • 57. Whole  Genome  Alignment   l MUMmer   l  Uses  suffix  trees  for  paVern  matching:  very  fast  even  for  large  sequences   „ Finds  maximal  exact  matches   „ Memory  use  depends  only  on  reference  sequence  size   l  Suffix  Tree:   l  Can  be  constructed  and  searched  in  O(n)  ;me   l  Useful  algorithms  are  nontrivial   l  BANANA$   „  B  followed  by  ANANA$  only   „  A  followed  by  $,  NA$,  NANA$   „  N  followed  by  A$,  ANA$   Kurtz  et  al.  (2004)  Genome  Biol.  doi:10.1186/gb-­‐2004-­‐5-­‐2-­‐r12  
  • 58. Whole  Genome  Alignment   l MUMmer   l  Process:   „ 1)  Iden;fy  a  non-­‐overlapping  subset  of  maximal  exact  matches:   oren  Maximum  Unique  Matches  (MUMs  -­‐  though  not  always   unique)   „ 2)  Cluster  into  alignment  anchors   „ 3)  Extend  between  anchors  to  produce  a  final  gapped  alignment   l  Very  flexible  approach:  a  suite  of  programs  (mummer, nucmer, promer,  …)   „  nucleo;de  and  “conceptual  protein”  (more  sensi;ve)  alignments   „  used  for  genome  comparisons,  assembly  scaffolding,  repeat   detec;on,  etc.   „  forms  the  basis  for  other  aligners/assemblers,  e.g.  Mugsy,  AMOS   Kurtz  et  al.  (2004)  Genome  Biol.  doi:10.1186/gb-­‐2004-­‐5-­‐2-­‐r12  
  • 59. Whole  Genome  Alignment   l [ACTIVITY]   l  whole_genome_alignments_A.md Markdown   l  hVps://github.com/widdowquinn/Teaching/blob/master/ Compara;ve_Genomics_and_Visualisa;on/Part_1/ whole_genome_alignment/ whole_genome_alignments_A.md  
  • 60. Mul'ple  Genome  Alignment   l Several  tools:   l  Mugsy  (hVp://mugsy.sourceforge.net/)   l  MLAGAN  (hVp://lagan.stanford.edu/lagan_web/index.shtml)   l  TBA/Mul'Z  (hVp://www.bx.psu.edu/miller_lab/)   l  Mauve  (hVp://gel.ahabs.wisc.edu/mauve/)   l Posi;onal  homology  vs.  glocal  
  • 61. Mul'ple  Genome  Alignment   l LAGAN:  rapid  alignment  of  two  homologous   genome  sequences   l  Generate  local  alignments  (anchors,  B)   l  Construct  rough  global  map     (maximal-­‐scoring  ordered  subset,  C)   „ Join  anchors  that  lie  within  a     threshold  distance,  the  same  way   l  Compute  global  alignment  by     dynamic  programming  (D)   Brudno  et  al.  (2003)  Genome  Res.  doi:10.1101/gr.926603  
  • 62. Mul'ple  Genome  Alignment   l MLAGAN:  mul;ple  genome  alignment  of  k   genomes  in  k-­‐1  alignment  steps,  using  a   phylogene;c  tree  (CLUSTAL-­‐like):   l  Make  rough  global  maps  between  each     pair  of  sequences  (step  C  in  LAGAN)   l  Progressive  mul;ple  alignment  with     anchors  (iterated)   1.  Perform  global  alignment  between     closest  pair  of  sequences  with     LAGAN:  alignments  are     “mul6-­‐sequences”   2.  Find  rough  global  maps  of  this  mul6-­‐ sequence  to  all  other  mul6-­‐sequences.   Brudno  et  al.  (2003)  Genome  Res.  doi:10.1101/gr.926603  
  • 63. Human-­‐Mouse-­‐Rat  Alignment   l Three-­‐way  progressive  alignment,  iden;fying:   l  Homologous  (H/M/R),  rodent-­‐only  (M/R)  and  human-­‐ mouse  or  human-­‐rat  (H/M,  H/R)  homologous  regions   l Three-­‐way  synteny   synteny  mapped  to  rat  genome   Brudno  et  al.  (2004)  Genome  Res.  doi:10.1101/gr.2067704   Ini'al  alignments  by  BLAT   Syntenous  regions  aligned  with  LAGAN  
  • 65. Drac  Genome  Alignment   l Whole  genome  alignments  useful  for  scaffolding  assemblies   l  High-­‐throughput  sequence  assemblies  come  in  fragments  (con;gs)   l  Con;gs  can  some;mes  be  ordered  if  paired  reads  or  long  read   technologies  are  used   l  Can  also  align  to  a  known  reference  genome   l MUMmer   l  Can  use  NUCmer  or,  for  more  distant  rela;ons,  PROmer   l Mauve/Progressive  Mauve   l  hVp://gel.ahabs.wisc.edu/mauve/   Darling  et  al.  (2003)  Genome  Res.  doi:10.1101/gr.2289704  
  • 66. Mauve   l  Mauve’s  alignment  algorithm   1.  Find  local  alignments  (mul;-­‐MUMs  –  seed  &   extend)   2.  Construct  phylogene;c  guide  tree  from  mul;-­‐ MUMs   3.  Select  subset  of  mul;-­‐MUMs  as  anchors.   „  Par;;on  anchors  into  Local  Collinear   Blocks  (LCBs)  –  consistently-­‐ordered   subsets   4.  Perform  recursive  anchoring  to  iden;fy   further  anchors   5.  Perform  progressive  alignment  (similar  to   CLUSTAL),  against  guide  tree   l  Mauve  Con;g  Mover  (MCM)  for  ordering  con;gs   Darling  et  al.  (2003)  Genome  Res.  doi:10.1101/gr.2289704  
  • 67. Mauve   l  Mauve  alignment  of  LCBs  in  nine  enterobacterial  genomes   l  Rearrangement  of  homologous  backbone  sequence   Darling  et  al.  (2003)  Genome  Res.  doi:10.1101/gr.2289704  
  • 68. Drac  Genome  Alignment   l [OPTIONAL  ACTIVITY]  (useful  for  exercise)   l  Alignment  and  reordering  of  drar  genome  con;gs   l  whole_genome_alignments_B.md  Markdown   l  hVps://github.com/widdowquinn/Teaching/blob/master/ Compara;ve_Genomics_and_Visualisa;on/Part_1/ whole_genome_alignment/ whole_genome_alignments_B.md   l [ACTIVITY]   l  Visualisa;on  of  whole  genome  alignment  with  Biopython   l  biopython_visualisation  iPython  notebook  
  • 69. Collinearity  and  Synteny   l Rearrangements  may  occur  post-­‐specia;on   l Different  species  s;ll  exhibit  conserva;on  of  sequence   similarity  and  order   l  Two  elements  are  collinear  if  they  lie  in  the  same  linear   sequence   l  Two  elements  are  syntenous  (syntenic)  if:   „ (orig.)  they  lie  on  the  same  chromosome   „ (mod.)  conserva;on  of  blocks  of  order  within  the  same   chromosome   l Signs  of  evolu;onary  constraints,  including  synteny,  may   indicate  func;onal  genome  regions   l More  about  this  in  Part  2,  related  to  genome  features  
  • 70. Syntenous   l example1.png  from  biopython_visualisation   ac;vity  
  • 71. Nonsyntenous   l example2.png  from  biopython_visualisation   ac;vity  
  • 72. Whole  Genome  Duplica'on   l Puffer  fish  Tetraodon  nigroviridis  (smallest  known  vertebrate  genome)   l  Whole-­‐genome  duplica;on,  subsequent  to  divergence  from  mammals.   l  Ancestral  vertebrate  genome  inferred  to  have  12  chromosomes.   Duplicated  genes  (ExoFish)  on  21  chromosomes   Jaillon  et  al.  (2004)  Nature  doi:10.1038/nature03025  
  • 73. VISTA,  mVISTA,  VISTA-­‐Point   l Alignment/visualisa;on  tools:     l  hVp://genome.lbl.gov/vista/index.shtml   l mVISTA:  align  and  compare  submiVed  sequences  (up  to  2Mbp)   l VISTA-­‐Point:  visualise  precomputed  alignments   Frazer  et  al.  (2004)  Nucl.  Acids  Res.  doi:10.1093/nar/gkh458  
  • 74. UCSC   l hVp://genome.ucsc.edu/   l Many  vertebrate/invertebrate  model  genomes   Kent  et  al.  (2002)  Genome  Res.  doi:10.1101/gr.229102  
  • 75. Conclusion   l Physical  and  computa;onal  genome  comparisons:   l  Similar  biological  ques;ons  -­‐>  similar  concepts   l Lots  of  sequence  data  in  modern  biology   l Conserva;on  ≈  evolu;onary  constraint   l Many  choices  of  algorithms/analysis  sorware   l Many  choices  of  visualisa;on  sorware/tools   l Coming  in  Part  2:  genomic  func;onal  elements  
  • 76. Credits   l This  slideshow  is  shared  under  a  Crea;ve  Commons   AVribu;on  4.0  License   hVp://crea;vecommons.org/licenses/by/4.0/)   l Copyright  is  held  by  The  James  HuVon  Ins;tute   hVp://www.huVon.ac.uk   l You  may  freely  use  this  material  in  research,  papers,  and   talks  so  long  as  acknowledgement  is  made.    
  • 77. Nucleo'de  Content   l A,  C,  G,  T  composi;on   l  Varies  between,  and  within  genomes   l  staining  varies  across  genomes,  due  to     varia;on  in  GC  content   l “isochores”:  regions  with  liVle   internal  GC  varia;on  (homogeneous)   „   long  a  point  of  discussion     –  difficult  to  define   l In  humans:   l  L1,  L2  isochores:  low  GC  (≲41%)   l  H1,  H2,  H3  isochores:  high  GC  (≳41%)   l  Imprecise  bulk  measurement   Sadoni  et  al.  (1999)  J.  Cell  Biol.  doi:10.1083/jcb.146.6.1211   hybridisa;on  of  H3  isochore  to  human  genome  
  • 78. DNA-­‐DNA  Hybridisa'on  (DDH)   l Used  for  taxonomic  classifica;on  in  prokaryotes  from  1960s   l Sibley  &  Ahlquist  redefined  bird  and  primate  phylogeny  with     DDH  in  1980s:     l Not  without  controversy:   „ Sugges;ons  of  data  manipula;on     (see  here)   „ Close  evolu;onary  rela;onships     difficult  to  resolve  due  to  paralogy     (more  on  paralogy  later…)   l S;ll  hanging  on  as  a  de  facto  “gold     standard”  in  microbiological  taxonomic     classifica;on.   Sibley  &  Ahlquist  (1987)  J.  Mol.  Evol.  doi:10.1007/BF02111285  
  • 79. Finding  isochores   l Isochores:  homogeneous  regions  of  %GC  content   l  Easy  to  find  with  windowed  (100kbp)   %GC  calcula;on,  from  sequenced     genomes.   l  3200  isochores  characterised  in  the     human  genome,  consistent  with  5     levels  (L1,  L2,  H1,  H2,  H3)  found     by  staining/hybridisa;on.     Costan'ni  et  al.  (2006)  Genome  Res.  doi:10.1101/gr.4910606  
  • 80. Compara've  Genomic  Hybridisa'on   l  Two  genomes:  “reference”  and  “test”  labelled  (red  and  green),   then  hybridised  against  a  “normal”  genome   l  semiquan'ta've:   l  Red:  loss  (<2  copies)  in  tumour   l  Green:  gain  (3-­‐4  copies)  in  tumour   l  Amplifica;ons  (>4  copies)  in  BOLD   l  Cases  with  the  same  Copy  Number     Aberra;on  (CNA)  are  numbered   De  Bortoli  et  al.  (2006)  BMC  Cancer  doi:10.1186/1471-­‐2407-­‐6-­‐223  
  • 81. l Early  approaches  took  a  threshold  score  (present/absent)   l Later  approaches  used  known  reference  genome  sequence     context  (HMMs,  synteny)  to  improve  presence/absence  calls   l  No  hybridisa;on  =  “absent”  or“divergent”?   l  Not  nearly  as  good  as  sequencing  directly!   Array  Compara've  Genomic  Hybridisa'on   Pritchard  et  al.  (2009)  PLoS  Comp.  Biol.  doi:10.1371/journal.pcbi.1000473  
  • 82. k-­‐mer  Spectra   l k-­‐mer  spectrum:   l  CpG  suppression  (CGs  are  uncommon  in  vertebrate  genomes),   but  (by  simula;on)  only  when  in  combina;on  with  a  par;cular   %GC,  explains  mul;modality   Chor  et  al.  (2009)  Genome  Biol.  doi:10.1186/gb-­‐2009-­‐10-­‐10-­‐r108