3. Edit Distance Model (1)
Weighted sum of insertions, deletions &
mutations to transform one string into
another
AGGCACA--CA AGGCACACA
| |||| || or | || ||
A--CACATTCA ACACATTCA
4. Edit Distance Model (2)
Given: x, y
Define: F(i,j) = Score of best alignment of
x1…xi to y1…yj
Recurrence: F(i,j) = max (F(i-1,j) – GAP_PENALTY,
F(i,j-1) – GAP_PENALTY,
F(i-1,j-1) + SCORE(xi, yj))
5. Edit Distance Model (3)
F(i,j) = Score of best
alignment ending at i,j
Time O( n2 ) for two seqs,
O( nk ) for k seqs
F(i,j)
F(i,j-1)
F(i-1,j-1)
F(i-1,j)
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
6. Overview
• Local Alignment (CHAOS)
• Multiple Global Alignment (LAGAN)
- Whole Genome Alignment
• Glocal Alignment (Shuffle-LAGAN)
• Biological Story
9. CHAOS: CHAins Of Seeds
1. Find short matching words
(seeds)
2. Chain them
3. Rescore chain
10. CHAOS: Chaining the Seeds
• Find seeds at current location
in seq1
location
in seq1
seed
seq1
seq2
11. CHAOS: Chaining the Seeds
location
in seq1
distance
cutoff seed
seq1
seq2
• Find seeds at current location
in seq1
12. CHAOS: Chaining the Seeds
location
in seq1
distance
cutoff
gap
cutoff
seed
seq1
seq2
• Find seeds at current location
in seq1
13. CHAOS: Chaining the Seeds
• Find seeds at current location
in seq1
• Find the previous seeds that
fall into the search box
location
in seq1
distance
cutoff
gap
cutoff
seed
Search
box
seq1
seq2
14. CHAOS: Chaining the Seeds
• Find seeds at current location
in seq1
• Find the previous seeds that
fall into the search box
• Do a range query: seeds are
indexed by their diagonal
location
in seq1
distance
cutoff
gap
cutoff
seed
Search
box
seq1
seq2
Range of
search
15. CHAOS: Chaining the Seeds
• Find seeds at current location
in seq1
• Find the previous seeds that
fall into the search box
• Do a range query: seeds are
indexed by their diagonal.
• Pick a previous seed that
maximizes the score of chain location
in seq1
distance
cutoff
gap
cutoff
seed
Search
box
seq1
seq2
Range of
search
16. CHAOS: Chaining the Seeds
• Find seeds at current location
in seq1
• Find the previous seeds that
fall into the search box
• Do a range query: seeds are
indexed by their diagonal.
• Pick a previous seed that
maximizes the score of chain location
in seq1
distance
cutoff
gap
cutoff
seed
Search
box
seq1
seq2
Range of
search
Time O(n log n), where n is number of seeds.
17. CHAOS Scoring
• Initial score = # matching bp - gaps
• Rapid rescoring: extend all seeds to find
optimal location for gaps
18. Overview
• Local Alignment (CHAOS)
• Multiple Global Alignment (LAGAN)
- Whole Genome Alignment
• Glocal Alignment (Shuffle-LAGAN)
• Biological Story
25. Cystic Fibrosis (CFTR), 12 species
• Human sequence length: 1.8 Mb
• Total genomic sequence: 13 Mb
Human
Baboon Cat Dog
Cow Pig
Mouse
Rat
Chimp
Chicken
Fugufish
Zebrafish
27. Automatic computational system for
comparative analysis of pairs of genomes
http://pipeline.lbl.gov
Alignments (all pair combinations):
Human Genome (Golden Path Assembly)
Mouse assemblies: Arachne, Phusion (2001) MGSC v3 (2002)
Rat assemblies: January 2003, February 2003
----------------------------------------------------------
D. Melanogaster vs D. Pseudoobscura February 2003
29. Progressive Alignment Scheme
yes no yes no
Human, Mouse and
Rat genomes
Pairwise M/R
mapping
Aligned M&R fragments Unaligned M&R sequences
Map to Human Genome
Mapping aligned fragments by
union of M&R local BLAT hits
on the human genome
H/M/R MLAGAN
alignment
M/R pairwise
alignment
M/H and R/H
pairwise alignment
Unassigned M&R
DNA fragments
yes no
30. Computational Time
23 dual 2.2GHz Intel Xeon node PC cluster.
Pair-wise rat/mouse – 4 hours
Pair-wise rat/human and mouse/human – 2 hours
Multiple human/mouse/rat – 9 hours
Total wall time: ~ 15 hours
31. Distribution of Large Indels
0
20
40
60
80
100
120
140
160
180
200
100 150 200 250 300 350 400 450 500 550
Indel length
Count
33. Overview
• Local Alignment (CHAOS)
• Multiple Global Alignment (LAGAN)
- Whole Genome Alignment
• Glocal Alignment (Shuffle-LAGAN)
• Biological Story
34. Evolution at the DNA level
…ACGGTGCAGTTACCA…
…AC----CAGTCCACCA…
Mutation
SEQUENCE EDITS
REARRANGEMENTS
Deletion
Inversion
Translocation
Duplication
35. Local & Global Alignment
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC
Local Global
36. Glocal Alignment Problem
Find least cost transformation of one
sequence into another using new operations
•Sequence edits
•Inversions
•Translocations
•Duplications
•Combinations
of above
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC
38. S-LAGAN: Find Local Alignments
1. Find Local
Alignments
2. Build Rough
Homology Map
3. Globally Align
Consistent Parts
39. S-LAGAN: Build Homology Map
1. Find Local
Alignments
2. Build Rough
Homology Map
3. Globally Align
Consistent Parts
40. Building the Homology Map
d
a b
c
Chain (using Eppstein
Galil); each alignment
gets a score which is
MAX over 4 possible
chains.
Penalties are affine
(event and distance
components)
Penalties:
a) regular
b) translocation
c) inversion
d) inverted
translocation
41. S-LAGAN: Build Homology Map
1. Find Local
Alignments
2. Build Rough
Homology Map
3. Globally Align
Consistent Parts
42. S-LAGAN: Global Alignment
1. Find Local
Alignments
2. Build Rough
Homology Map
3. Globally Align
Consistent Parts
48. S-LAGAN Results (Chr 20)
• Human Chr 20 v. homologous Mouse Chr 2.
• 270 Segments of conserved synteny
• 70 Inversions
49. S-LAGAN Results (Whole Genome)
LAGAN S-LAGAN
Total 37% 38%
Exon 93% 96%
Ups200 78% 81%
CPU Time 350 Hrs 450 Hrs
• Used Berkeley Genome Pipeline
• % Human genome aligned with mouse sequence
• Evaluation criteria from Waterston, et al
(Nature 2002)
50. Rearrangements in Human v. Mouse
Preliminary conclusions:
• Rearrangements come in all sizes
• Duplications worse conserved than other
rearranged regions
• Simple inversions tend to be most
common and most conserved
51. What is next? (Shuffle)
• Better algorithm and scoring
• Whole genome synteny mapping
• Multiple Glocal Alignment(!?)
52. Overview
• Local Alignment (CHAOS)
• Multiple Global Alignment (LAGAN)
- Whole Genome Alignment
• Glocal Alignment (Shuffle-LAGAN)
• Biological Story
53. Biological Story
• Math1 (Mouse Atonal Homologue 1,
also ATOH) is a gene that is
responsible for nervous system
development
56. Can we align human & fly???
CGCGGTGC-GGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGCACTCTCCTTTCAGGCAGCTCCCCGGGGAG
CCCGGTGC-GGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGCACTCG-CTTTCAGGCAGCTCCCCGGGGAG
CCCGGTGC-GGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGCACTCG-CTTTCAGGCAGCTCCCCGGGGAG
GAGGTGTTGGATGGCCTGAGTGA-AGCACGCGCTGTCAGCTGGCGAGCGCTCGCG-AGTCCCTGCCGTGTCCCCG
Melan GCTACTCCAGCT-ACCACCTGCATGCAGCTGCACAGC
Pseudo GCCACTGAGACT-GCCACCTGCATGCAGCTGCACAGA
57. Putting it all together
CGCGGTGC-GGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGCACTCTCCTTTCAGGCAGCTCCCCGGGGAG
CCCGGTGC-GGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGCACTCG-CTTTCAGGCAGCTCCCCGGGGAG
CCCGGTGC-GGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGCACTCG-CTTTCAGGCAGCTCCCCGGGGAG
GAGGTGTTGGATGGCCTGAGTGA-AGCACGCGCTGTCAGCTGGCGAGCGCTCGCG-AGTCCCTGCCGTGTCCCCG
Melan GCTACTCCAGCT-ACCACCTGCATGCAGCTGCACAGC
Pseudo GCCACTGAGACT-GCCACCTGCATGCAGCTGCACAGA
58. Overview
• Local Alignment (CHAOS)
• Multiple Global Alignment (LAGAN)
- Whole Genome Alignment
• Glocal Alignment (Shuffle-LAGAN)
• Biological Story
59. Acknowledgments
Stanford:
Serafim Batzoglou
Arend Sidow
Matt Scott
Gregory Cooper
Chuong (Tom) Do
Sanket Malde
Kerrin Small
Mukund Sundararajan
Berkeley:
Inna Dubchak
Alexander Poliakov
Göttingen:
Burkhard Morgenstern
Rat Genome Sequencing
Consortium
http://lagan.stanford.edu/