SlideShare una empresa de Scribd logo
1 de 69
Descargar para leer sin conexión
Multiple Alignment
Outline
1. Introduction to Multiple Alignment
2. Progressive Alignment
3. Scoring Multiple Alignments
4. Partial Order Alignment
5. A-Bruijn Approach to Multiple Alignment
Section 1:
Introduction to Multiple
Alignment
Multiple Sequence Alignment (MSA)
• Up until now we have only
tried to align two sequences.
Multiple Sequence Alignment (MSA)
• Up until now we have only
tried to align two sequences.
• What about aligning more
than two sequences?
Multiple Sequence Alignment (MSA)
• Up until now we have only
tried to align two sequences.
• What about aligning more
than two sequences?
• A faint similarity between
two sequences becomes
significant if it is present in
many other sequences.
Multiple Sequence Alignment (MSA)
• Up until now we have only
tried to align two sequences.
• What about aligning more
than two sequences?
• A faint similarity between
two sequences becomes
significant if it is present in
many other sequences.
• Therefore multiple
alignments can reveal subtle
similarities that pairwise
alignments do not reveal.
Generalizing Pairwise to Multiple Alignment
• Alignment of 2 sequences is represented as a 2-row matrix.
• In a similar way, we represent alignment of 3 sequences as a
3-row matrix
• Example:
A T - G C G -
A - C G T - A
A T C A C - A
• Our scoring function should score alignments with conserved
columns higher.
A A T -- C
A -- T G C
-- A T G C
Alignments = Paths in 3-Space
• Say we have 3 sequences to align: ATGC, AATC, ATGC
0 1 1 2 3 4
A A T -- C
A -- T G C
-- A T G C
x coordinate
Alignments = Paths in 3-Space
• Say we have 3 sequences to align: ATGC, AATC, ATGC
0 1 1 2 3 4
0 1 2 3 3 4
A A T -- C
A -- T G C
-- A T G C
y coordinate
Alignments = Paths in 3-Space
• Say we have 3 sequences to align: ATGC, AATC, ATGC
x coordinate
0 1 1 2 3 4
0 1 2 3 3 4
A A T -- C
A -- T G C
0 0 1 2 3 4
-- A T G C
x coordinate
y coordinate
z coordinate
Alignments = Paths in 3-Space
• Say we have 3 sequences to align: ATGC, AATC, ATGC
• Plotting the coordinates gives a path in 3-space:
0 1 1 2 3 4
0 1 2 3 3 4
A A T -- C
A -- T G C
0 0 1 2 3 4
-- A T G C
x coordinate
y coordinate
z coordinate
Alignments = Paths in 3-Space
• Say we have 3 sequences to align: ATGC, AATC, ATGC
• Plotting the coordinates gives a path in 3-space:
• (0,0,0)(1,1,0)(1,2,1) (2,3,2) (3,3,3) (4,4,4)
Source
Sink
Alignments = Paths in 3-Space
• Same strategy as aligning two
sequences.
• Use a 3-D “Manhattan Cube”,
with each axis representing a
sequence to align.
• For global alignments, go
from source to sink.
2-D Alignment Cell versus 3-D Alignment Cell
• In 2-D, 3 edges in
each unit square
• In 3-D, 7 edges in
each unit cube
3-D Unit Cube2-D Unit Square
(i-1,j-1,k-1)
(i,j-1,k-1)
(i,j-1,k)
(i-1,j-1,k) (i-1,j,k)
(i,j,k)
(i-1,j,k-1)
(i,j,k-1)
Architecture of 3-D Alignment Cell
• si,j,k = max
• (x, y, z) is an entry in the 3-D scoring matrix.
Cube diagonal: no indels
Face diagonal: one indel
Edge diagonal: two indels
Multiple Alignment: Dynamic Programming
si-1,j-1,k-1 + (vi, wj, uk)
si-1,j-1,k +  (vi, wj, _ )
si-1,j,k-1 +  (vi, _, uk)
si,j-1,k-1 +  (_, wj, uk)
si-1,j,k +  (vi, _ , _)
si,j-1,k +  (_, wj, _)
si,j,k-1 +  (_, _, uk)
Multiple Alignment: Running Time
• For 3 sequences of length n, the run time is 7n3 = O(n3)
• For generalization to k sequences, build a k-dimensional
Manhattan graph:
• There are nk vertices in this graph.
• Each vertex has 2k – 1 edges coming into it.
• Therefore, run time = (2k – 1)(nk) = O(2knk)
• Conclusion: The dynamic programming approach for
alignment between two sequences is easily extended to k
sequences but it is impractical due to a run time that is
exponential in the number of sequences.
Multiple Alignment Induces Pairwise
Alignments
• Every multiple alignment induces pairwise alignments
• Example: The alignment
x: A C - G C G G - C
y: A C - G C - G A G
z: G C C G C - G A G
induces the following three pairwise alignments:
x: ACGCGG-C x: AC-GCGG-C y: AC-GCGAG
y: ACGC-GAC z: GCCGC-GAG z: GCCGCGAG
Idea: Construct Multiple from Pairwise Alignments
• Given k arbitrary pairwise alignments, can we construct a
multiple alignment that induces them?
• Example: 3 sequence alignment
• x = ACGCTGGC, y = ACGCGAC, z = GCCGCAGAG
• Say we have optimal pairwise alignments as follows:
• Can we construct a multiple alignment that induces them?
x: ACGCTGG-C x: AC-GCTGG-C y: AC-GC-GAG
y: ACGC--GAC z: GCCGCA-GAG z: GCCGCAGAG
Idea: Construct Multiple from Pairwise Alignments
• Given k arbitrary pairwise alignments, can we construct a
multiple alignment that induces them?
• Example: 3 sequence alignment
• x = ACGCTGGC, y = ACGCGAC, z = GCCGCAGAG
• Say we have optimal pairwise alignments as follows:
• Can we construct a multiple alignment that induces them?
• Answer: Not always!
x: ACGCTGG-C x: AC-GCTGG-C y: AC-GC-GAG
y: ACGC--GAC z: GCCGCA-GAG z: GCCGCAGAG
Idea: Construct Multiple from Pairwise Alignments
• From an optimal multiple alignment, we can infer pairwise
alignments between all pairs of sequences, but they are not
necessarily optimal.
• Likewise, it is difficult to infer a “good” multiple alignment
from optimal pairwise alignments between all sequences.
Idea: Construct Multiple from Pairwise Alignments
• Example 1: Can combine
pairwise alignments into
optimal multiple alignment.
• Example 2: Can not combine
pairwise alignments into
optimal multiple alignment.
- A G G C T A T C A C C T G
T A G – C T A C C A - - - G
C A G – C T A C C A - - - G
C A G – C T A T C A C – G G
C A G – C T A T C G C – G G
A 1 1 .8
C .6 1 .4 1 .6 .2
G 1 .2 .2 .4 1
T .2 1 .6 .2
- .2 .8 .4 .8 .4
Profile Representation of Multiple Alignment
Profile Representation of Multiple Alignment
• In the past we were aligning a sequence against a sequence.
- A G G C T A T C A C C T G
T A G – C T A C C A - - - G
C A G – C T A C C A - - - G
C A G – C T A T C A C – G G
C A G – C T A T C G C – G G
A 1 1 .8
C .6 1 .4 1 .6 .2
G 1 .2 .2 .4 1
T .2 1 .6 .2
- .2 .8 .4 .8 .4
Profile Representation of Multiple Alignment
• In the past we were aligning a sequence against a sequence.
• Can we align a sequence against a profile?
- A G G C T A T C A C C T G
T A G – C T A C C A - - - G
C A G – C T A C C A - - - G
C A G – C T A T C A C – G G
C A G – C T A T C G C – G G
A 1 1 .8
C .6 1 .4 1 .6 .2
G 1 .2 .2 .4 1
T .2 1 .6 .2
- .2 .8 .4 .8 .4
Profile Representation of Multiple Alignment
• In the past we were aligning a sequence against a sequence.
• Can we align a sequence against a profile?
• Can we align a profile against a profile?
- A G G C T A T C A C C T G
T A G – C T A C C A - - - G
C A G – C T A C C A - - - G
C A G – C T A T C A C – G G
C A G – C T A T C G C – G G
A 1 1 .8
C .6 1 .4 1 .6 .2
G 1 .2 .2 .4 1
T .2 1 .6 .2
- .2 .8 .4 .8 .4
• Choose the most similar pair of strings and combine them into
a profile, thereby reducing alignment of k sequences to an
alignment of of k – 1 sequences/profiles.
• Then repeat.
• This is a heuristic (greedy) method.
u1= ACGTACGTACGT…
u2 = TTAATTAATTAA…
u3 = ACTACTACTACT…
…
uk = CCGGCCGGCCGG
u1= ACg/tTACg/tTACg/cT…
u2 = TTAATTAATTAA…
…
uk = CCGGCCGGCCGG…
k
k – 1
Multiple Alignment: Greedy Approach
Greedy Approach: Example
• Consider the 4 sequences: GATTCA, GTCTGA, GATATT,
GTCAGC
• Consider the 4 sequences: GATTCA, GTCTGA, GATATT,
GTCAGC.
• There are = 6 possible pairwise alignments:





2
4
s2 GTCTGA
s4 GTCAGC (score = 2)
s1 GAT-TCA
s2 G-TCTGA (score = 1)
s1 GAT-TCA
s3 GATAT-T (score = 1)
s1 GATTCA--
s4 G—T-CAGC (score = 0)
s2 G-TCTGA
s3 GATAT-T (score = -1)
s3 GAT-ATT
s4 G-TCAGC (score = -1)
Greedy Approach: Example
• s2 and s4 are closest, so we consolidate these sequences into
one by using the profile matrix:
• New set of 3 sequences to align:
• We can choose either of the nucleotides in question for s2, 4.
s2 GTCTGA
s4 GTCAGC
s2,4 = GTCt/aGa/cA
s1 GATTCA
s3 GATATT
s2,4 GTCt/aGa/c
Greedy Approach: Example
Section 2:
Progressive Alignment
Progressive Alignment
• Progressive alignment: A variation of the greedy algorithm
for multiple alignment with a somewhat more intelligent
strategy for choosing the order of alignments.
• Progressive alignment works well for close sequences, but
deteriorates for distant sequences.
• Gaps in consensus string are permanent.
• Use profiles to compare sequences.
ClustalW
• Popular multiple alignment tool today.
• „W‟ stands for „weighted‟ (different parts of alignment are
weighted differently).
• Three-step process:
1. Construct pairwise alignments.
2. Build guide tree.
3. Progressive alignment guided by the tree.
v1 v2 v3 v4
v1 -
v2 .17 -
v3 .87 .28 -
v4 .59 .33 .62 - (.17 means 17 % identical)
Step 1: Pairwise Alignment
• Aligns each sequence against each other, giving a similarity
matrix.
• Similarity = exact matches / sequence length (percent identity).
Step 2: Guide Tree
• Create guide tree using the similarity matrix.
• ClustalW uses the neighbor-joining method,
• Guide tree roughly reflects evolutionary relations.
v1 v2 v3 v4
v1 -
v2 .17 -
v3 .87 .28 -
v4 .59 .33 .62 -
v1
v3
v4
v2
FOS_RAT PEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNISNMELKAEPFD
FOS_MOUSE PEEMSVAS-LDLTGGLPEASTPESEEAFTLPLLNDPEPK-PSLEPVKSISNVELKAEPFD
FOS_CHICK SEELAAATALDLG----APSPAAAEEAFALPLMTEAPPAVPPKEPSG--SGLELKAEPFD
FOSB_MOUSE PGPGPLAEVRDLPG-----STSAKEDGFGWLLPPPPPPP-----------------LPFQ
FOSB_HUMAN PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP-----------------LPFQ
. . : ** . :.. *:.* * . * **:
Dots and stars show how well-conserved a column is
Step 3: Progressive Alignment
• Start by aligning the two most similar sequences.
• Following the guide tree, add in the next sequences, aligning
to the existing alignment.
• Insert gaps as necessary.
Section 3:
Scoring Multiple
Alignments
Multiple Alignments: Scoring
• We will discuss three possible scoring systems:
1. Number of matches (multiple longest common
subsequence score)
2. Entropy score
3. Sum of pairs (SP-Score)
• A column is a “match” if all the letters in the column are the
same.
• Example: Only the first column in the following matching
represents a “match:”
• The Multiple LCS score is the total number of matches.
• This score is good for very similar sequences.
A A A
A A A
A A T
A T C
Score # 1: Multiple LCS Score
Score # 2: Entropy Score
• Define frequencies px for the occurrence of each letter x in each
column of the multiple alignment.
• Then, compute “entropy” of each column.
• The entropy score is then given by the sum of the entropies of all
the columns.
Entropy of Column = 


CGTAX
XX
pp
,,,
log
Entropy: Example
• For our sequences {AAA, AAA, AAT, ATC}:
Entropy: Example
• For our sequences {AAA, AAA, AAT, ATC}:
• 1st Column: pA = 1, pT= pG= pC= 0
Entropy = 0
Entropy: Example
• For our sequences {AAA, AAA, AAT, ATC}:
• 1st Column: pA = 1, pT= pG= pC= 0
• 2nd Column: pA = 0.75, pT = 0.25, pG= pC= 0
Entropy = 0
Entropy = 0.56
Entropy: Example
• For our sequences {AAA, AAA, AAT, ATC}:
• 1st Column: pA = 1, pT= pG= pC= 0
• 2nd Column: pA = 0.75, pT = 0.25, pG= pC= 0
• 3rd Column: pA = 0.50, pT = 0.25, pC= 0.25, pG= 0
Entropy = 0
Entropy = 0.56
Entropy = 1.04
Entropy: Example
• For our sequences {AAA, AAA, AAT, ATC}:
• 1st Column: pA = 1, pT= pG= pC= 0
• 2nd Column: pA = 0.75, pT = 0.25, pG= pC= 0
• 3rd Column: pA = 0.50, pT = 0.25, pC= 0.25, pG= 0
• Entropy Score = 0 + 0.56 + 1.04 = 1.60
Entropy = 0
Entropy = 0.56
Entropy = 1.04
0














A
A
A
A
entropy
Entropy: Interpretation
• The more similar the members of a column, the lower the
entropy score.
• Example: Best and worst cases:
• Therefore, if we are searching for the best multiple alignment,
we will want to minimize the entropy score.
Inferring Pairwise from Multiple Alignments
• Recall: Every multiple alignment induces pairwise alignments.
• From a multiple alignment, we can infer pairwise alignments
between all sequences, but they are not necessarily optimal.
• We can view reducing multiple alignments to pairwise
alignments as projecting a 3-D multiple alignment path onto a
2-D face of the cube
• Our third scoring function for MSA will be based off the
projections.
Multiple Alignment Projections: Illustration
• A 3-D alignment can be
projected onto the 2-D plane
to represent an alignment
between a pair of sequences.
• Example: Figure at right.
• Solid line: represents a 3-D
alignment path.
• Dashed lines: represent the
three induced pairwise
alignments that are projected
onto the cube‟s faces.
• Consider the pairwise alignment of sequences ai and aj
implied from a multiple alignment of k sequences.
• Denote the score of this (not necessarily optimal) pairwise
alignment as s*(ai, aj).
• Sum of Pairs (SP) Score: Obtained by summing the pairwise
scores:
s(a1,…,ak) = Σi,j s*(ai, aj)
Score 3: Sum of Pairs Score (SP-Score)
Multiple Alignment: History
• 1975: Sankoff formulates multiple
alignment problem and gives the
dynamic programming solution.
• 1988: Carrillo and Lipman provide
branch and bound approach for Multiple
Alignment.
David Sankoff
David Lipman
Multiple Alignment: History
• 1990: Feng and Doolittle develop
progressive alignment.
• 1994: Thompson,
Higgins, and Gibson
create ClustalW, which
is the most popular
multiple alignment
program in the world. Des HigginsJulie Thompson Toby Gibson
Russell Doolittle
Multiple Alignment: History
• 1998: Morgenstern et al. create
DIALIGN, an algorithm for
segment-based multiple
alignment.
• 2000: Notredame, Des Higgins,
and Heringa develop T-Coffee,
which aligns multiple sequences
based off a library of pairwise
alignments.
Burkhard Morgenstern
Cedric Notredame JaapHeringa
Multiple Alignment: History
• 2004: Robert Edgar formulates
MUSCLE, a faster and more efficient
algorithm than ClustalW.
• 201X: What is next?
Problems with Multiple Alignment
• Multidomain proteins evolve not only through point mutations
but also through domain duplications and domain
recombinations.
• Although Multiple Alignment is a 30 year old problem, there
were no approaches for aligning rearranged sequences (i.e.,
multi-domain proteins with shuffled domains) prior to 2002.
• It is often impossible to align all protein sequences throughout
their entire length.
Section 4:
Partial Order Alignment
Alignment as a Graph
• Conventional Alignment
• Protein sequence as a path
• Two protein sequence paths
• Combination of two protein
graphs into one graph
Representing Sequences as Paths in a Graph
• Each protein sequence is
represented by a path.
• Dashed edges connect
“equivalent” positions.
• Vertices with identical
labels are fused.
• Two objectives:
1. Find a graph that represents domain structure
2. Find mapping of each sequence to this graph
• Partial Order Alignment (POA): A graph such that every
sequence in the given set is a path in G.
Partial Order Multiple Alignment
• Aligns sequences onto a directed acyclic graph (DAG)
• Outline:
1. Guide Tree Construction
2. Progressive Alignment Following Guide Tree
3. Dynamic Programming Algorithm to align two POAs
(POA-POA Alignment).
• We learned how to align one sequence (path) against
another sequence (path).
• We need to develop an algorithm for aligning a directed
graph against a directed graph.
POA Algorithm
Dynamic Programming for Aligning Two
Graphs
• S(n, o) = optimal score for n = node in G, o = node in G‟
• Match/mismatch: Aligning two nodes with score s(n,o)
• Deletion/insertion:
• Omitting node n from the alignment with score ∆(n)
• Omitting node o from the alignment with score ∆(o)
• Dynamic formula for S(n, o):
Row-Column AlignmentInput Sequences
Row-Column Alignment
• POA is more flexible: standard methods force sequences to
align linearly.
• POA representation handles gaps more naturally and retains
(and uses) all information in the MSA (unlike linear profiles).
POA Advantages
Section 5:
A-Bruijn Approach to
Multiple Alignment
A-Bruijn Alignment
• A-Bruijn Alignment (ABA): Represents alignment as
directed graph that may contains cycles.
• This is in contrast to POA, which represents alignment as an
acyclic directed graph.
ABA: How Is the Graph Created?
MSA vs. POA vs. ABA
MSAOriginal Sequences
ABAPOA
Advantages of ABA
1. More flexible than POA: allows larger class of evolutionary
relationships between aligned sequences
2. Can align proteins with shuffled and/or repeated domain
structure
3. Can align proteins with domains present in multiple copies in
some proteins
4. Handles:
• Domains not present in all proteins.
• Domains present in different orders in different proteins.
• Chris Lee, POA, UCLA
http://www.bioinformatics.ucla.edu/poa/Poa_Tutorial.html
Credits

Más contenido relacionado

La actualidad más candente

PROPERTIES OF LAPLACE TRANSFORM part 2
PROPERTIES OF LAPLACE TRANSFORM part 2PROPERTIES OF LAPLACE TRANSFORM part 2
PROPERTIES OF LAPLACE TRANSFORM part 2Ranit Sarkar
 
Partial fraction decomposition for inverse laplace transform
Partial fraction decomposition for inverse laplace transformPartial fraction decomposition for inverse laplace transform
Partial fraction decomposition for inverse laplace transformVishalsagar657
 
Inverse Laplace Transform
Inverse Laplace TransformInverse Laplace Transform
Inverse Laplace TransformVishnu V
 

La actualidad más candente (6)

Inverse laplace
Inverse laplaceInverse laplace
Inverse laplace
 
Merge sort algorithm
Merge sort algorithmMerge sort algorithm
Merge sort algorithm
 
PROPERTIES OF LAPLACE TRANSFORM part 2
PROPERTIES OF LAPLACE TRANSFORM part 2PROPERTIES OF LAPLACE TRANSFORM part 2
PROPERTIES OF LAPLACE TRANSFORM part 2
 
Partial fraction decomposition for inverse laplace transform
Partial fraction decomposition for inverse laplace transformPartial fraction decomposition for inverse laplace transform
Partial fraction decomposition for inverse laplace transform
 
Merge sort
Merge sortMerge sort
Merge sort
 
Inverse Laplace Transform
Inverse Laplace TransformInverse Laplace Transform
Inverse Laplace Transform
 

Destacado (8)

Bioinfo intro
Bioinfo introBioinfo intro
Bioinfo intro
 
090622_blast-clustalw
090622_blast-clustalw090622_blast-clustalw
090622_blast-clustalw
 
1 amino acids
1 amino acids1 amino acids
1 amino acids
 
Clustal X
Clustal XClustal X
Clustal X
 
Msa
MsaMsa
Msa
 
Alignments
AlignmentsAlignments
Alignments
 
Multiple alignment
Multiple alignmentMultiple alignment
Multiple alignment
 
Multiple sequence alignment
Multiple sequence alignmentMultiple sequence alignment
Multiple sequence alignment
 

Similar a Ch06 multalign

Similar a Ch06 multalign (20)

Bioinformatics lesson
Bioinformatics lessonBioinformatics lesson
Bioinformatics lesson
 
Bioinformatics lesson
Bioinformatics lessonBioinformatics lesson
Bioinformatics lesson
 
02-alignment.pdf
02-alignment.pdf02-alignment.pdf
02-alignment.pdf
 
Ch06 alignment
Ch06 alignmentCh06 alignment
Ch06 alignment
 
Multiple sequence alignment
Multiple sequence alignmentMultiple sequence alignment
Multiple sequence alignment
 
The Needleman-Wunsch Algorithm for Sequence Alignment
The Needleman-Wunsch Algorithm for Sequence Alignment The Needleman-Wunsch Algorithm for Sequence Alignment
The Needleman-Wunsch Algorithm for Sequence Alignment
 
Dot matrix seminar
Dot matrix seminarDot matrix seminar
Dot matrix seminar
 
Lecture 4
Lecture 4Lecture 4
Lecture 4
 
Msa & rooted/unrooted tree
Msa & rooted/unrooted treeMsa & rooted/unrooted tree
Msa & rooted/unrooted tree
 
Sequence alignment global vs. local
Sequence alignment  global vs. localSequence alignment  global vs. local
Sequence alignment global vs. local
 
20100515 bioinformatics kapushesky_lecture07
20100515 bioinformatics kapushesky_lecture0720100515 bioinformatics kapushesky_lecture07
20100515 bioinformatics kapushesky_lecture07
 
lecture4.ppt Sequence Alignmentaldf sdfsadf
lecture4.ppt Sequence Alignmentaldf sdfsadflecture4.ppt Sequence Alignmentaldf sdfsadf
lecture4.ppt Sequence Alignmentaldf sdfsadf
 
sequence alignment
sequence alignmentsequence alignment
sequence alignment
 
Pairwise sequence alignment
Pairwise sequence alignmentPairwise sequence alignment
Pairwise sequence alignment
 
Aligning seqeunces with W-curve and SQL.
Aligning seqeunces with W-curve and SQL.Aligning seqeunces with W-curve and SQL.
Aligning seqeunces with W-curve and SQL.
 
Ivd soda-2019
Ivd soda-2019Ivd soda-2019
Ivd soda-2019
 
seq alignment.ppt
seq alignment.pptseq alignment.ppt
seq alignment.ppt
 
Clustering.pptx
Clustering.pptxClustering.pptx
Clustering.pptx
 
CS176: Genome Assembly
CS176: Genome AssemblyCS176: Genome Assembly
CS176: Genome Assembly
 
Sequence-analysis-pairwise-alignment.pdf
Sequence-analysis-pairwise-alignment.pdfSequence-analysis-pairwise-alignment.pdf
Sequence-analysis-pairwise-alignment.pdf
 

Más de BioinformaticsInstitute

Comparative Genomics and de Bruijn graphs
Comparative Genomics and de Bruijn graphsComparative Genomics and de Bruijn graphs
Comparative Genomics and de Bruijn graphsBioinformaticsInstitute
 
Биоинформатический анализ данных полноэкзомного секвенирования: анализ качес...
 Биоинформатический анализ данных полноэкзомного секвенирования: анализ качес... Биоинформатический анализ данных полноэкзомного секвенирования: анализ качес...
Биоинформатический анализ данных полноэкзомного секвенирования: анализ качес...BioinformaticsInstitute
 
Вперед в прошлое. Методы генетической диагностики древней днк
Вперед в прошлое. Методы генетической диагностики древней днкВперед в прошлое. Методы генетической диагностики древней днк
Вперед в прошлое. Методы генетической диагностики древней днкBioinformaticsInstitute
 
"Зачем биологам суперкомпьютеры", Александр Предеус
"Зачем биологам суперкомпьютеры", Александр Предеус"Зачем биологам суперкомпьютеры", Александр Предеус
"Зачем биологам суперкомпьютеры", Александр ПредеусBioinformaticsInstitute
 
Иммунотерапия раковых опухолей: взгляд со стороны системной биологии. Максим ...
Иммунотерапия раковых опухолей: взгляд со стороны системной биологии. Максим ...Иммунотерапия раковых опухолей: взгляд со стороны системной биологии. Максим ...
Иммунотерапия раковых опухолей: взгляд со стороны системной биологии. Максим ...BioinformaticsInstitute
 
Рак 101 (Мария Шутова, ИоГЕН РАН)
Рак 101 (Мария Шутова, ИоГЕН РАН)Рак 101 (Мария Шутова, ИоГЕН РАН)
Рак 101 (Мария Шутова, ИоГЕН РАН)BioinformaticsInstitute
 
Секвенирование как инструмент исследования сложных фенотипов человека: от ген...
Секвенирование как инструмент исследования сложных фенотипов человека: от ген...Секвенирование как инструмент исследования сложных фенотипов человека: от ген...
Секвенирование как инструмент исследования сложных фенотипов человека: от ген...BioinformaticsInstitute
 
Инвестиции в биоинформатику и биотех (Андрей Афанасьев)
Инвестиции в биоинформатику и биотех (Андрей Афанасьев)Инвестиции в биоинформатику и биотех (Андрей Афанасьев)
Инвестиции в биоинформатику и биотех (Андрей Афанасьев)BioinformaticsInstitute
 

Más de BioinformaticsInstitute (20)

Graph genome
Graph genome Graph genome
Graph genome
 
Nanopores sequencing
Nanopores sequencingNanopores sequencing
Nanopores sequencing
 
A superglue for string comparison
A superglue for string comparisonA superglue for string comparison
A superglue for string comparison
 
Comparative Genomics and de Bruijn graphs
Comparative Genomics and de Bruijn graphsComparative Genomics and de Bruijn graphs
Comparative Genomics and de Bruijn graphs
 
Биоинформатический анализ данных полноэкзомного секвенирования: анализ качес...
 Биоинформатический анализ данных полноэкзомного секвенирования: анализ качес... Биоинформатический анализ данных полноэкзомного секвенирования: анализ качес...
Биоинформатический анализ данных полноэкзомного секвенирования: анализ качес...
 
Вперед в прошлое. Методы генетической диагностики древней днк
Вперед в прошлое. Методы генетической диагностики древней днкВперед в прошлое. Методы генетической диагностики древней днк
Вперед в прошлое. Методы генетической диагностики древней днк
 
Knime & bioinformatics
Knime & bioinformaticsKnime & bioinformatics
Knime & bioinformatics
 
"Зачем биологам суперкомпьютеры", Александр Предеус
"Зачем биологам суперкомпьютеры", Александр Предеус"Зачем биологам суперкомпьютеры", Александр Предеус
"Зачем биологам суперкомпьютеры", Александр Предеус
 
Иммунотерапия раковых опухолей: взгляд со стороны системной биологии. Максим ...
Иммунотерапия раковых опухолей: взгляд со стороны системной биологии. Максим ...Иммунотерапия раковых опухолей: взгляд со стороны системной биологии. Максим ...
Иммунотерапия раковых опухолей: взгляд со стороны системной биологии. Максим ...
 
Рак 101 (Мария Шутова, ИоГЕН РАН)
Рак 101 (Мария Шутова, ИоГЕН РАН)Рак 101 (Мария Шутова, ИоГЕН РАН)
Рак 101 (Мария Шутова, ИоГЕН РАН)
 
Плюрипотентность 101
Плюрипотентность 101Плюрипотентность 101
Плюрипотентность 101
 
Секвенирование как инструмент исследования сложных фенотипов человека: от ген...
Секвенирование как инструмент исследования сложных фенотипов человека: от ген...Секвенирование как инструмент исследования сложных фенотипов человека: от ген...
Секвенирование как инструмент исследования сложных фенотипов человека: от ген...
 
Инвестиции в биоинформатику и биотех (Андрей Афанасьев)
Инвестиции в биоинформатику и биотех (Андрей Афанасьев)Инвестиции в биоинформатику и биотех (Андрей Афанасьев)
Инвестиции в биоинформатику и биотех (Андрей Афанасьев)
 
Biodb 2011-everything
Biodb 2011-everythingBiodb 2011-everything
Biodb 2011-everything
 
Biodb 2011-05
Biodb 2011-05Biodb 2011-05
Biodb 2011-05
 
Biodb 2011-04
Biodb 2011-04Biodb 2011-04
Biodb 2011-04
 
Biodb 2011-03
Biodb 2011-03Biodb 2011-03
Biodb 2011-03
 
Biodb 2011-01
Biodb 2011-01Biodb 2011-01
Biodb 2011-01
 
Biodb 2011-02
Biodb 2011-02Biodb 2011-02
Biodb 2011-02
 
Ngs 3 1
Ngs 3 1Ngs 3 1
Ngs 3 1
 

Último

Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 

Último (20)

Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 

Ch06 multalign

  • 2. Outline 1. Introduction to Multiple Alignment 2. Progressive Alignment 3. Scoring Multiple Alignments 4. Partial Order Alignment 5. A-Bruijn Approach to Multiple Alignment
  • 3. Section 1: Introduction to Multiple Alignment
  • 4. Multiple Sequence Alignment (MSA) • Up until now we have only tried to align two sequences.
  • 5. Multiple Sequence Alignment (MSA) • Up until now we have only tried to align two sequences. • What about aligning more than two sequences?
  • 6. Multiple Sequence Alignment (MSA) • Up until now we have only tried to align two sequences. • What about aligning more than two sequences? • A faint similarity between two sequences becomes significant if it is present in many other sequences.
  • 7. Multiple Sequence Alignment (MSA) • Up until now we have only tried to align two sequences. • What about aligning more than two sequences? • A faint similarity between two sequences becomes significant if it is present in many other sequences. • Therefore multiple alignments can reveal subtle similarities that pairwise alignments do not reveal.
  • 8. Generalizing Pairwise to Multiple Alignment • Alignment of 2 sequences is represented as a 2-row matrix. • In a similar way, we represent alignment of 3 sequences as a 3-row matrix • Example: A T - G C G - A - C G T - A A T C A C - A • Our scoring function should score alignments with conserved columns higher.
  • 9. A A T -- C A -- T G C -- A T G C Alignments = Paths in 3-Space • Say we have 3 sequences to align: ATGC, AATC, ATGC
  • 10. 0 1 1 2 3 4 A A T -- C A -- T G C -- A T G C x coordinate Alignments = Paths in 3-Space • Say we have 3 sequences to align: ATGC, AATC, ATGC
  • 11. 0 1 1 2 3 4 0 1 2 3 3 4 A A T -- C A -- T G C -- A T G C y coordinate Alignments = Paths in 3-Space • Say we have 3 sequences to align: ATGC, AATC, ATGC x coordinate
  • 12. 0 1 1 2 3 4 0 1 2 3 3 4 A A T -- C A -- T G C 0 0 1 2 3 4 -- A T G C x coordinate y coordinate z coordinate Alignments = Paths in 3-Space • Say we have 3 sequences to align: ATGC, AATC, ATGC • Plotting the coordinates gives a path in 3-space:
  • 13. 0 1 1 2 3 4 0 1 2 3 3 4 A A T -- C A -- T G C 0 0 1 2 3 4 -- A T G C x coordinate y coordinate z coordinate Alignments = Paths in 3-Space • Say we have 3 sequences to align: ATGC, AATC, ATGC • Plotting the coordinates gives a path in 3-space: • (0,0,0)(1,1,0)(1,2,1) (2,3,2) (3,3,3) (4,4,4)
  • 14. Source Sink Alignments = Paths in 3-Space • Same strategy as aligning two sequences. • Use a 3-D “Manhattan Cube”, with each axis representing a sequence to align. • For global alignments, go from source to sink.
  • 15. 2-D Alignment Cell versus 3-D Alignment Cell • In 2-D, 3 edges in each unit square • In 3-D, 7 edges in each unit cube 3-D Unit Cube2-D Unit Square
  • 17. • si,j,k = max • (x, y, z) is an entry in the 3-D scoring matrix. Cube diagonal: no indels Face diagonal: one indel Edge diagonal: two indels Multiple Alignment: Dynamic Programming si-1,j-1,k-1 + (vi, wj, uk) si-1,j-1,k +  (vi, wj, _ ) si-1,j,k-1 +  (vi, _, uk) si,j-1,k-1 +  (_, wj, uk) si-1,j,k +  (vi, _ , _) si,j-1,k +  (_, wj, _) si,j,k-1 +  (_, _, uk)
  • 18. Multiple Alignment: Running Time • For 3 sequences of length n, the run time is 7n3 = O(n3) • For generalization to k sequences, build a k-dimensional Manhattan graph: • There are nk vertices in this graph. • Each vertex has 2k – 1 edges coming into it. • Therefore, run time = (2k – 1)(nk) = O(2knk) • Conclusion: The dynamic programming approach for alignment between two sequences is easily extended to k sequences but it is impractical due to a run time that is exponential in the number of sequences.
  • 19. Multiple Alignment Induces Pairwise Alignments • Every multiple alignment induces pairwise alignments • Example: The alignment x: A C - G C G G - C y: A C - G C - G A G z: G C C G C - G A G induces the following three pairwise alignments: x: ACGCGG-C x: AC-GCGG-C y: AC-GCGAG y: ACGC-GAC z: GCCGC-GAG z: GCCGCGAG
  • 20. Idea: Construct Multiple from Pairwise Alignments • Given k arbitrary pairwise alignments, can we construct a multiple alignment that induces them? • Example: 3 sequence alignment • x = ACGCTGGC, y = ACGCGAC, z = GCCGCAGAG • Say we have optimal pairwise alignments as follows: • Can we construct a multiple alignment that induces them? x: ACGCTGG-C x: AC-GCTGG-C y: AC-GC-GAG y: ACGC--GAC z: GCCGCA-GAG z: GCCGCAGAG
  • 21. Idea: Construct Multiple from Pairwise Alignments • Given k arbitrary pairwise alignments, can we construct a multiple alignment that induces them? • Example: 3 sequence alignment • x = ACGCTGGC, y = ACGCGAC, z = GCCGCAGAG • Say we have optimal pairwise alignments as follows: • Can we construct a multiple alignment that induces them? • Answer: Not always! x: ACGCTGG-C x: AC-GCTGG-C y: AC-GC-GAG y: ACGC--GAC z: GCCGCA-GAG z: GCCGCAGAG
  • 22. Idea: Construct Multiple from Pairwise Alignments • From an optimal multiple alignment, we can infer pairwise alignments between all pairs of sequences, but they are not necessarily optimal. • Likewise, it is difficult to infer a “good” multiple alignment from optimal pairwise alignments between all sequences.
  • 23. Idea: Construct Multiple from Pairwise Alignments • Example 1: Can combine pairwise alignments into optimal multiple alignment. • Example 2: Can not combine pairwise alignments into optimal multiple alignment.
  • 24. - A G G C T A T C A C C T G T A G – C T A C C A - - - G C A G – C T A C C A - - - G C A G – C T A T C A C – G G C A G – C T A T C G C – G G A 1 1 .8 C .6 1 .4 1 .6 .2 G 1 .2 .2 .4 1 T .2 1 .6 .2 - .2 .8 .4 .8 .4 Profile Representation of Multiple Alignment
  • 25. Profile Representation of Multiple Alignment • In the past we were aligning a sequence against a sequence. - A G G C T A T C A C C T G T A G – C T A C C A - - - G C A G – C T A C C A - - - G C A G – C T A T C A C – G G C A G – C T A T C G C – G G A 1 1 .8 C .6 1 .4 1 .6 .2 G 1 .2 .2 .4 1 T .2 1 .6 .2 - .2 .8 .4 .8 .4
  • 26. Profile Representation of Multiple Alignment • In the past we were aligning a sequence against a sequence. • Can we align a sequence against a profile? - A G G C T A T C A C C T G T A G – C T A C C A - - - G C A G – C T A C C A - - - G C A G – C T A T C A C – G G C A G – C T A T C G C – G G A 1 1 .8 C .6 1 .4 1 .6 .2 G 1 .2 .2 .4 1 T .2 1 .6 .2 - .2 .8 .4 .8 .4
  • 27. Profile Representation of Multiple Alignment • In the past we were aligning a sequence against a sequence. • Can we align a sequence against a profile? • Can we align a profile against a profile? - A G G C T A T C A C C T G T A G – C T A C C A - - - G C A G – C T A C C A - - - G C A G – C T A T C A C – G G C A G – C T A T C G C – G G A 1 1 .8 C .6 1 .4 1 .6 .2 G 1 .2 .2 .4 1 T .2 1 .6 .2 - .2 .8 .4 .8 .4
  • 28. • Choose the most similar pair of strings and combine them into a profile, thereby reducing alignment of k sequences to an alignment of of k – 1 sequences/profiles. • Then repeat. • This is a heuristic (greedy) method. u1= ACGTACGTACGT… u2 = TTAATTAATTAA… u3 = ACTACTACTACT… … uk = CCGGCCGGCCGG u1= ACg/tTACg/tTACg/cT… u2 = TTAATTAATTAA… … uk = CCGGCCGGCCGG… k k – 1 Multiple Alignment: Greedy Approach
  • 29. Greedy Approach: Example • Consider the 4 sequences: GATTCA, GTCTGA, GATATT, GTCAGC
  • 30. • Consider the 4 sequences: GATTCA, GTCTGA, GATATT, GTCAGC. • There are = 6 possible pairwise alignments:      2 4 s2 GTCTGA s4 GTCAGC (score = 2) s1 GAT-TCA s2 G-TCTGA (score = 1) s1 GAT-TCA s3 GATAT-T (score = 1) s1 GATTCA-- s4 G—T-CAGC (score = 0) s2 G-TCTGA s3 GATAT-T (score = -1) s3 GAT-ATT s4 G-TCAGC (score = -1) Greedy Approach: Example
  • 31. • s2 and s4 are closest, so we consolidate these sequences into one by using the profile matrix: • New set of 3 sequences to align: • We can choose either of the nucleotides in question for s2, 4. s2 GTCTGA s4 GTCAGC s2,4 = GTCt/aGa/cA s1 GATTCA s3 GATATT s2,4 GTCt/aGa/c Greedy Approach: Example
  • 33. Progressive Alignment • Progressive alignment: A variation of the greedy algorithm for multiple alignment with a somewhat more intelligent strategy for choosing the order of alignments. • Progressive alignment works well for close sequences, but deteriorates for distant sequences. • Gaps in consensus string are permanent. • Use profiles to compare sequences.
  • 34. ClustalW • Popular multiple alignment tool today. • „W‟ stands for „weighted‟ (different parts of alignment are weighted differently). • Three-step process: 1. Construct pairwise alignments. 2. Build guide tree. 3. Progressive alignment guided by the tree.
  • 35. v1 v2 v3 v4 v1 - v2 .17 - v3 .87 .28 - v4 .59 .33 .62 - (.17 means 17 % identical) Step 1: Pairwise Alignment • Aligns each sequence against each other, giving a similarity matrix. • Similarity = exact matches / sequence length (percent identity).
  • 36. Step 2: Guide Tree • Create guide tree using the similarity matrix. • ClustalW uses the neighbor-joining method, • Guide tree roughly reflects evolutionary relations. v1 v2 v3 v4 v1 - v2 .17 - v3 .87 .28 - v4 .59 .33 .62 - v1 v3 v4 v2
  • 37. FOS_RAT PEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNISNMELKAEPFD FOS_MOUSE PEEMSVAS-LDLTGGLPEASTPESEEAFTLPLLNDPEPK-PSLEPVKSISNVELKAEPFD FOS_CHICK SEELAAATALDLG----APSPAAAEEAFALPLMTEAPPAVPPKEPSG--SGLELKAEPFD FOSB_MOUSE PGPGPLAEVRDLPG-----STSAKEDGFGWLLPPPPPPP-----------------LPFQ FOSB_HUMAN PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP-----------------LPFQ . . : ** . :.. *:.* * . * **: Dots and stars show how well-conserved a column is Step 3: Progressive Alignment • Start by aligning the two most similar sequences. • Following the guide tree, add in the next sequences, aligning to the existing alignment. • Insert gaps as necessary.
  • 39. Multiple Alignments: Scoring • We will discuss three possible scoring systems: 1. Number of matches (multiple longest common subsequence score) 2. Entropy score 3. Sum of pairs (SP-Score)
  • 40. • A column is a “match” if all the letters in the column are the same. • Example: Only the first column in the following matching represents a “match:” • The Multiple LCS score is the total number of matches. • This score is good for very similar sequences. A A A A A A A A T A T C Score # 1: Multiple LCS Score
  • 41. Score # 2: Entropy Score • Define frequencies px for the occurrence of each letter x in each column of the multiple alignment. • Then, compute “entropy” of each column. • The entropy score is then given by the sum of the entropies of all the columns. Entropy of Column =    CGTAX XX pp ,,, log
  • 42. Entropy: Example • For our sequences {AAA, AAA, AAT, ATC}:
  • 43. Entropy: Example • For our sequences {AAA, AAA, AAT, ATC}: • 1st Column: pA = 1, pT= pG= pC= 0 Entropy = 0
  • 44. Entropy: Example • For our sequences {AAA, AAA, AAT, ATC}: • 1st Column: pA = 1, pT= pG= pC= 0 • 2nd Column: pA = 0.75, pT = 0.25, pG= pC= 0 Entropy = 0 Entropy = 0.56
  • 45. Entropy: Example • For our sequences {AAA, AAA, AAT, ATC}: • 1st Column: pA = 1, pT= pG= pC= 0 • 2nd Column: pA = 0.75, pT = 0.25, pG= pC= 0 • 3rd Column: pA = 0.50, pT = 0.25, pC= 0.25, pG= 0 Entropy = 0 Entropy = 0.56 Entropy = 1.04
  • 46. Entropy: Example • For our sequences {AAA, AAA, AAT, ATC}: • 1st Column: pA = 1, pT= pG= pC= 0 • 2nd Column: pA = 0.75, pT = 0.25, pG= pC= 0 • 3rd Column: pA = 0.50, pT = 0.25, pC= 0.25, pG= 0 • Entropy Score = 0 + 0.56 + 1.04 = 1.60 Entropy = 0 Entropy = 0.56 Entropy = 1.04
  • 47. 0               A A A A entropy Entropy: Interpretation • The more similar the members of a column, the lower the entropy score. • Example: Best and worst cases: • Therefore, if we are searching for the best multiple alignment, we will want to minimize the entropy score.
  • 48. Inferring Pairwise from Multiple Alignments • Recall: Every multiple alignment induces pairwise alignments. • From a multiple alignment, we can infer pairwise alignments between all sequences, but they are not necessarily optimal. • We can view reducing multiple alignments to pairwise alignments as projecting a 3-D multiple alignment path onto a 2-D face of the cube • Our third scoring function for MSA will be based off the projections.
  • 49. Multiple Alignment Projections: Illustration • A 3-D alignment can be projected onto the 2-D plane to represent an alignment between a pair of sequences. • Example: Figure at right. • Solid line: represents a 3-D alignment path. • Dashed lines: represent the three induced pairwise alignments that are projected onto the cube‟s faces.
  • 50. • Consider the pairwise alignment of sequences ai and aj implied from a multiple alignment of k sequences. • Denote the score of this (not necessarily optimal) pairwise alignment as s*(ai, aj). • Sum of Pairs (SP) Score: Obtained by summing the pairwise scores: s(a1,…,ak) = Σi,j s*(ai, aj) Score 3: Sum of Pairs Score (SP-Score)
  • 51. Multiple Alignment: History • 1975: Sankoff formulates multiple alignment problem and gives the dynamic programming solution. • 1988: Carrillo and Lipman provide branch and bound approach for Multiple Alignment. David Sankoff David Lipman
  • 52. Multiple Alignment: History • 1990: Feng and Doolittle develop progressive alignment. • 1994: Thompson, Higgins, and Gibson create ClustalW, which is the most popular multiple alignment program in the world. Des HigginsJulie Thompson Toby Gibson Russell Doolittle
  • 53. Multiple Alignment: History • 1998: Morgenstern et al. create DIALIGN, an algorithm for segment-based multiple alignment. • 2000: Notredame, Des Higgins, and Heringa develop T-Coffee, which aligns multiple sequences based off a library of pairwise alignments. Burkhard Morgenstern Cedric Notredame JaapHeringa
  • 54. Multiple Alignment: History • 2004: Robert Edgar formulates MUSCLE, a faster and more efficient algorithm than ClustalW. • 201X: What is next?
  • 55. Problems with Multiple Alignment • Multidomain proteins evolve not only through point mutations but also through domain duplications and domain recombinations. • Although Multiple Alignment is a 30 year old problem, there were no approaches for aligning rearranged sequences (i.e., multi-domain proteins with shuffled domains) prior to 2002. • It is often impossible to align all protein sequences throughout their entire length.
  • 57. Alignment as a Graph • Conventional Alignment • Protein sequence as a path • Two protein sequence paths • Combination of two protein graphs into one graph
  • 58. Representing Sequences as Paths in a Graph • Each protein sequence is represented by a path. • Dashed edges connect “equivalent” positions. • Vertices with identical labels are fused.
  • 59. • Two objectives: 1. Find a graph that represents domain structure 2. Find mapping of each sequence to this graph • Partial Order Alignment (POA): A graph such that every sequence in the given set is a path in G. Partial Order Multiple Alignment
  • 60. • Aligns sequences onto a directed acyclic graph (DAG) • Outline: 1. Guide Tree Construction 2. Progressive Alignment Following Guide Tree 3. Dynamic Programming Algorithm to align two POAs (POA-POA Alignment). • We learned how to align one sequence (path) against another sequence (path). • We need to develop an algorithm for aligning a directed graph against a directed graph. POA Algorithm
  • 61. Dynamic Programming for Aligning Two Graphs • S(n, o) = optimal score for n = node in G, o = node in G‟ • Match/mismatch: Aligning two nodes with score s(n,o) • Deletion/insertion: • Omitting node n from the alignment with score ∆(n) • Omitting node o from the alignment with score ∆(o) • Dynamic formula for S(n, o):
  • 63. • POA is more flexible: standard methods force sequences to align linearly. • POA representation handles gaps more naturally and retains (and uses) all information in the MSA (unlike linear profiles). POA Advantages
  • 64. Section 5: A-Bruijn Approach to Multiple Alignment
  • 65. A-Bruijn Alignment • A-Bruijn Alignment (ABA): Represents alignment as directed graph that may contains cycles. • This is in contrast to POA, which represents alignment as an acyclic directed graph.
  • 66. ABA: How Is the Graph Created?
  • 67. MSA vs. POA vs. ABA MSAOriginal Sequences ABAPOA
  • 68. Advantages of ABA 1. More flexible than POA: allows larger class of evolutionary relationships between aligned sequences 2. Can align proteins with shuffled and/or repeated domain structure 3. Can align proteins with domains present in multiple copies in some proteins 4. Handles: • Domains not present in all proteins. • Domains present in different orders in different proteins.
  • 69. • Chris Lee, POA, UCLA http://www.bioinformatics.ucla.edu/poa/Poa_Tutorial.html Credits