Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
A bit parallel, general integer-scoring
1. GARY BENSON, YOZEN HERNANDEZ, &
JOSHUA LOVING
BIOINFORMATICS PROGRAM
B O S T O N U N I V E R S I T Y
J L O V I N G @ B U . E D U
A Bit-Parallel,
General Integer-Scoring
Sequence Alignment Algorithm
2. Introduction: Problem Description
Input:
• Sequences X and Y
• Integer weights M; I; G
match; mismatch; indel or gap that define a
similarity or distance scoring function S
Output:
• Calculate the global alignment score for X and Y
3. Introduction
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18 -23 -28
G -10 -3 -1 -6 -6 -11 -16 -21
T -15 -8 -6 1 -4 -9 -14 -19
C -20 -13 -6 -4 -2 -2 -7 -12
A -25 -18 -11 -9 -7 -5 0 -5
A -30 -23 -16 -14 -12 -10 -3 2
Match = 2, Mismatch = -3, Indel = -5
Global Alignment – Needleman-Wunsch
Alignment Scoring Matrix
Sequence X
SequenceY
4. Introduction
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18 -23 -28
G -10 -3 -1 -6 -6 -11 -16 -21
T -15 -8 -6 1 -4 -9 -14 -19
C -20 -13 -6 -4 -2 -2 -7 -12
A -25 -18 -11 -9 -7 -5 0 -5
A -30 -23 -16 -14 -12 -10 -3 2
Match = 2, Mismatch = -3, Indel = -5
Integer Scores
5. Introduction
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18 -23 -28
G -10 -3 -1 -6 -6 -11 -16 -21
T -15 -8 -6 1 -4 -9 -14 -19
C -20 -13 -6 -4 -2 -2 -7 -12
A -25 -18 -11 -9 -7 -5 0 -5
A -30 -23 -16 -14 -12 -10 -3 2
Match = 2, Mismatch = -3, Indel = -5
Penalty from beginning
6. Introduction
- A C T G C A A
- 0 0 0 0 0 0 0 0
A 0 2 -3 -3 -3 -3 2 2
G 0 -3 -1 -6 -1 -6 -3 -1
T 0 -3 -6 1 -4 -4 -8 -6
C 0 -3 -1 -4 -2 -2 -7 -11
A 0 2 -3 -9 -7 -5 0 -5
A 0 2 -1 -6 -7 -10 -3 2
Match = 2, Mismatch = -3, Indel = -5
No initial Penalty
7. Needleman-Wunsch Alignment
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5
G -10
T -15
C -20
A -25
A -30
Match = 2, Mismatch = -3, Indel = -5
8. Needleman-Wunsch Alignment
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2
G -10
T -15
C -20
A -25
A -30
Match = 2, Mismatch = -3, Indel = -5
9. Needleman-Wunsch Alignment
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3
G -10
T -15
C -20
A -25
A -30
Match = 2, Mismatch = -3, Indel = -5
10. Needleman-Wunsch Alignment
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8
G -10
T -15
C -20
A -25
A -30
Match = 2, Mismatch = -3, Indel = -5
11. Needleman-Wunsch Alignment
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8
G -10
T -15
C -20
A -25
A -30
Match = 2, Mismatch = -3, Indel = -5
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13
G -10
T -15
C -20
A -25
A -30
12. Needleman-Wunsch Alignment
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8
G -10
T -15
C -20
A -25
A -30
Match = 2, Mismatch = -3, Indel = -5
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13
G -10
T -15
C -20
A -25
A -30
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18
G -10
T -15
C -20
A -25
A -30
13. Needleman-Wunsch Alignment
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8
G -10
T -15
C -20
A -25
A -30
Match = 2, Mismatch = -3, Indel = -5
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13
G -10
T -15
C -20
A -25
A -30
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18
G -10
T -15
C -20
A -25
A -30
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18 -23
G -10
T -15
C -20
A -25
A -30
14. Needleman-Wunsch Alignment
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8
G -10
T -15
C -20
A -25
A -30
Match = 2, Mismatch = -3, Indel = -5
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13
G -10
T -15
C -20
A -25
A -30
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18
G -10
T -15
C -20
A -25
A -30
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18 -23
G -10
T -15
C -20
A -25
A -30
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18 -23 -28
G -10
T -15
C -20
A -25
A -30
15. Needleman-Wunsch Alignment
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8
G -10
T -15
C -20
A -25
A -30
Match = 2, Mismatch = -3, Indel = -5
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13
G -10
T -15
C -20
A -25
A -30
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18
G -10
T -15
C -20
A -25
A -30
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18 -23
G -10
T -15
C -20
A -25
A -30
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18 -23 -28
G -10
T -15
C -20
A -25
A -30
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18 -23 -28
G -10 -3
T -15
C -20
A -25
A -30
16. Needleman-Wunsch Alignment
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8
G -10
T -15
C -20
A -25
A -30
Match = 2, Mismatch = -3, Indel = -5
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13
G -10
T -15
C -20
A -25
A -30
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18
G -10
T -15
C -20
A -25
A -30
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18 -23
G -10
T -15
C -20
A -25
A -30
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18 -23 -28
G -10
T -15
C -20
A -25
A -30
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18 -23 -28
G -10 -3
T -15
C -20
A -25
A -30
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18 -23 -28
G -10 -3 -1
T -15
C -20
A -25
A -30
17. Needleman-Wunsch Alignment
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8
G -10
T -15
C -20
A -25
A -30
Match = 2, Mismatch = -3, Indel = -5
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13
G -10
T -15
C -20
A -25
A -30
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18
G -10
T -15
C -20
A -25
A -30
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18 -23
G -10
T -15
C -20
A -25
A -30
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18 -23 -28
G -10
T -15
C -20
A -25
A -30
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18 -23 -28
G -10 -3
T -15
C -20
A -25
A -30
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18 -23 -28
G -10 -3 -1
T -15
C -20
A -25
A -30
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18 -23 -28
G -10 -3 -1 -6
T -15
C -20
A -25
A -30
18. Needleman-Wunsch Alignment
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8
G -10
T -15
C -20
A -25
A -30
Match = 2, Mismatch = -3, Indel = -5
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13
G -10
T -15
C -20
A -25
A -30
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18
G -10
T -15
C -20
A -25
A -30
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18 -23
G -10
T -15
C -20
A -25
A -30
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18 -23 -28
G -10
T -15
C -20
A -25
A -30
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18 -23 -28
G -10 -3
T -15
C -20
A -25
A -30
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18 -23 -28
G -10 -3 -1
T -15
C -20
A -25
A -30
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18 -23 -28
G -10 -3 -1 -6
T -15
C -20
A -25
A -30
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18 -23 -28
G -10 -3 -1 -6 -6
T -15
C -20
A -25
A -30
19. Needleman-Wunsch Alignment
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8
G -10
T -15
C -20
A -25
A -30
Match = 2, Mismatch = -3, Indel = -5
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13
G -10
T -15
C -20
A -25
A -30
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18
G -10
T -15
C -20
A -25
A -30
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18 -23
G -10
T -15
C -20
A -25
A -30
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18 -23 -28
G -10
T -15
C -20
A -25
A -30
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18 -23 -28
G -10 -3
T -15
C -20
A -25
A -30
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18 -23 -28
G -10 -3 -1
T -15
C -20
A -25
A -30
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18 -23 -28
G -10 -3 -1 -6
T -15
C -20
A -25
A -30
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18 -23 -28
G -10 -3 -1 -6 -6
T -15
C -20
A -25
A -30
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18 -23 -28
G -10 -3 -1 -6 -6 -11
T -15
C -20
A -25
A -30
20. Needleman-Wunsch Alignment
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8
G -10
T -15
C -20
A -25
A -30
Match = 2, Mismatch = -3, Indel = -5
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13
G -10
T -15
C -20
A -25
A -30
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18
G -10
T -15
C -20
A -25
A -30
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18 -23
G -10
T -15
C -20
A -25
A -30
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18 -23 -28
G -10
T -15
C -20
A -25
A -30
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18 -23 -28
G -10 -3
T -15
C -20
A -25
A -30
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18 -23 -28
G -10 -3 -1
T -15
C -20
A -25
A -30
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18 -23 -28
G -10 -3 -1 -6
T -15
C -20
A -25
A -30
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18 -23 -28
G -10 -3 -1 -6 -6
T -15
C -20
A -25
A -30
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18 -23 -28
G -10 -3 -1 -6 -6 -11
T -15
C -20
A -25
A -30
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18 -23 -28
G -10 -3 -1 -6 -6 -11 -16
T -15
C -20
A -25
A -30
21. Needleman-Wunsch Alignment
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8
G -10
T -15
C -20
A -25
A -30
Match = 2, Mismatch = -3, Indel = -5
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13
G -10
T -15
C -20
A -25
A -30
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18
G -10
T -15
C -20
A -25
A -30
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18 -23
G -10
T -15
C -20
A -25
A -30
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18 -23 -28
G -10
T -15
C -20
A -25
A -30
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18 -23 -28
G -10 -3
T -15
C -20
A -25
A -30
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18 -23 -28
G -10 -3 -1
T -15
C -20
A -25
A -30
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18 -23 -28
G -10 -3 -1 -6
T -15
C -20
A -25
A -30
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18 -23 -28
G -10 -3 -1 -6 -6
T -15
C -20
A -25
A -30
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18 -23 -28
G -10 -3 -1 -6 -6 -11
T -15
C -20
A -25
A -30
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18 -23 -28
G -10 -3 -1 -6 -6 -11 -16
T -15
C -20
A -25
A -30
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18 -23 -28
G -10 -3 -1 -6 -6 -11 -16 -21
T -15
C -20
A -25
A -30
22. Bit-parallel alignment
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5
G -10
T -15
C -20
A -25
A -30
Match = 2, Mismatch = -3, Indel = -5
Integer Scores
23. Bit-parallel alignment
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5
G -10
T -15
C -20
A -25
A -30
Match = 2, Mismatch = -3, Indel = -5
Integer Scores
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18 -23 -28
G -10
T -15
C -20
A -25
A -30
24. Bit-parallel alignment
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5
G -10
T -15
C -20
A -25
A -30
Match = 2, Mismatch = -3, Indel = -5
Integer Scores
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18 -23 -28
G -10
T -15
C -20
A -25
A -30
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18 -23 -28
G -10 -3 -1 -6 -6 -11 -16 -21
T -15
C -20
A -25
A -30
25. Bit-parallel alignment
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5
G -10
T -15
C -20
A -25
A -30
Match = 2, Mismatch = -3, Indel = -5
Integer Scores
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18 -23 -28
G -10
T -15
C -20
A -25
A -30
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18 -23 -28
G -10 -3 -1 -6 -6 -11 -16 -21
T -15
C -20
A -25
A -30
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18 -23 -28
G -10 -3 -1 -6 -6 -11 -16 -21
T -15 -8 -6 1 -4 -9 -14 -19
C -20
A -25
A -30
26. Bit-parallel alignment
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5
G -10
T -15
C -20
A -25
A -30
Match = 2, Mismatch = -3, Indel = -5
Integer Scores
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18 -23 -28
G -10
T -15
C -20
A -25
A -30
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18 -23 -28
G -10 -3 -1 -6 -6 -11 -16 -21
T -15
C -20
A -25
A -30
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18 -23 -28
G -10 -3 -1 -6 -6 -11 -16 -21
T -15 -8 -6 1 -4 -9 -14 -19
C -20
A -25
A -30
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18 -23 -28
G -10 -3 -1 -6 -6 -11 -16 -21
T -15 -8 -6 1 -4 -9 -14 -19
C -20 -13 -6 -4 -2 -2 -7 -12
A -25
A -30
27. Bit-parallel alignment
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5
G -10
T -15
C -20
A -25
A -30
Match = 2, Mismatch = -3, Indel = -5
Integer Scores
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18 -23 -28
G -10
T -15
C -20
A -25
A -30
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18 -23 -28
G -10 -3 -1 -6 -6 -11 -16 -21
T -15
C -20
A -25
A -30
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18 -23 -28
G -10 -3 -1 -6 -6 -11 -16 -21
T -15 -8 -6 1 -4 -9 -14 -19
C -20
A -25
A -30
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18 -23 -28
G -10 -3 -1 -6 -6 -11 -16 -21
T -15 -8 -6 1 -4 -9 -14 -19
C -20 -13 -6 -4 -2 -2 -7 -12
A -25
A -30
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18 -23 -28
G -10 -3 -1 -6 -6 -11 -16 -21
T -15 -8 -6 1 -4 -9 -14 -19
C -20 -13 -6 -4 -2 -2 -7 -12
A -25 -18 -11 -9 -7 -5 0 -5
A -30
28. Bit-parallel alignment
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5
G -10
T -15
C -20
A -25
A -30
Match = 2, Mismatch = -3, Indel = -5
Integer Scores
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18 -23 -28
G -10
T -15
C -20
A -25
A -30
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18 -23 -28
G -10 -3 -1 -6 -6 -11 -16 -21
T -15
C -20
A -25
A -30
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18 -23 -28
G -10 -3 -1 -6 -6 -11 -16 -21
T -15 -8 -6 1 -4 -9 -14 -19
C -20
A -25
A -30
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18 -23 -28
G -10 -3 -1 -6 -6 -11 -16 -21
T -15 -8 -6 1 -4 -9 -14 -19
C -20 -13 -6 -4 -2 -2 -7 -12
A -25
A -30
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18 -23 -28
G -10 -3 -1 -6 -6 -11 -16 -21
T -15 -8 -6 1 -4 -9 -14 -19
C -20 -13 -6 -4 -2 -2 -7 -12
A -25 -18 -11 -9 -7 -5 0 -5
A -30
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18 -23 -28
G -10 -3 -1 -6 -6 -11 -16 -21
T -15 -8 -6 1 -4 -9 -14 -19
C -20 -13 -6 -4 -2 -2 -7 -12
A -25 -18 -11 -9 -7 -5 0 -5
A -30 -23 -16 -14 -12 -10 -3 2
29. Motivation
Cheaper
sequencing of
DNA means that
larger datasets
are being
generated
Sequence
analysis of such
large datasets
can be
accelerated by
faster alignment
algorithms
Wetterstrand KA. DNA Sequencing Costs: Data from the NHGRI Genome Sequencing Program
(GSP) Available at: www.genome.gov/sequencingcosts. Accessed June 10, 2013.
42. Bit-parallel Representation
Bit-vectors: computer words 64 bits long
A bit-vector for each possible difference, both horizontally
and vertically (∆V and ∆H)
A set of Match vectors (MatchA, MatchC, MatchG, MatchT
in the DNA case)
We keep track of match positions because they are a special
case in the function table.
46. Algorithm: Example
- A C T G C A A
C -20 -13 -6 -4 -2 -2 -7 -12
A -25 -18 -11 -9 -7 -5 0 -5
∆H values
7 7 2 2 0 -5 -5
-5 -5 -5 -5 -3 7 7
∆V values-5
48. Implementation
Python script that generates C code based on input
parameters (M; I; G)
Will eventually have web page for download of code
49. Experimental Analysis
Compared BHL with
Wu-Manber K-differences algorithm
Unit cost edit distance bit-parallel algorithm
Longest Common Subsequence bit-parallel algorithm
Needleman-Wunsch dynamic programming algorithm
Computed 25 million alignments with each program
Each DNA sequence was 63 bases long
All programs compiled using GCC, optimization level
O3
Computation done on a typical desktop computer
52. Current and Future Work
Implementation for sequences longer than one word
Single Instruction Multiple Data (SIMD) implementation
BLOSUM and PAM type substitution matrix support
General Purpose Graphics Processing Unit (GPGPU)
implementation
New bit-parallel representations for greater speed and
compactness of data
53. Acknowledgements
My advisor, Dr. Gary Benson
Lab members
Yevgeniy Gelfand Yozen Hernandez
Funded by the National Science Foundation (NSF)
In computational biology, sequence alignment forms an important part of biological sequence analysis.To formally state the problem we are trying to solve, given two input sequences we want to find their global alignment score, using integer parameters to weight the alignment. We aren’t computing the alignment, which takes extra steps, we are solely interested in computing the alignment score.
The Needleman Wunsch algorithm is the standard method of global alignment.Here is the alignment scoring matrix for an example alignment of two sequences with a particular set of alignment scoring weights.
We restrict the scores that we are using to be only integer valued.
In this alignment example, we want to align the strings from their beginnings, so we impose a gap penalty for starting from a place other than the beginning in either string.
However, if this constraint is not present, we can allow alignment to start at any point by initializing the first row and column to be all zero.Our method allows either case, but we will focus on the case using an initial penalty.
As a brief overview, the NW algorithm proceeds by initializing the scoring matrix’s first row and column.
To fill in the scoring matrix, we iterate over it, with each cell’s score being determined by the three cells to the left, diagonally, and above.
To fill in the entire scoring matrix, each cell must be visited once.
Because cells depend on the cells that preceded them, they must be done in order.
Bit parallel alignment allows us to represent multiple cells by bits in a computer word.Rather than computing each cell individually, we compute values for entire rows at once.The benefit to bit-parallel algorithms lie in their speed and efficiency.
In this figure, we have cost, in dollars, shown logarithmically on the y-axis against time on the x-axis. The white line represents how Moore’s Law shrinks the cost of computation over time. In green, we have the actual cost of sequencing a human genome, so it is clear that the cost of sequencing is dropping much faster than the cost of computing.This makes implementing faster sequence alignment algorithms important, which is what we are doing.
Say that Hyyroacheieved 4 bit operations per word for LCS, and 15 bit operations per word for Unit-cost edit distance.Arbitrary weights edit-distance – does it solve the same problem? They gave no implementation and were only able to guess at the time complexity. We have an implementation that uses a different methodology.
As we recalled, each cell in the NW alignment matrix is derived from the 3 adjacent cells to its left and above.
As an example, we will consider a small block of the scoring matrix.
We know that the scores in the scoring matrix have unbounded values.However, the differences between adjacent cells are bounded. In our bit-parallel algorithm and others these differences are used as the problem representation. This is similar to how the 4 russians’ technique uses differences between cells. (possibly refer to 4 russian’s technique if questions are asked)
We will call these differences Delta H and Delta V.
Of course, it is obvious that DeltaVnext and DeltaHnext can be derived from Delta H and Delta V, but what was not obvious was the regular pattern that emerges if you look at all possible inputs and outputs.
In fact, the output values produced by input values of Delta H and Delta V follow a very regular pattern.This is shown here in this function table for Delta Vnext output values.
As an example, we can look at the function table shown before. The input Delta V and Delta H values range from -5 to positive 7.
These values derive from the input parameters of 2, -3, and -5
The minimum value is equal to the indel penalty, and the maximum value is equal to the match score minus the indel penalty. 2 minus (negative 5)
Similar relationships determine important boundaries between these zones in this function table.The algorithm needs to compute the values in these zones separately Thus, using a set of scoring parameters we can generate the corresponding function table
Here are the same zones shown in the example function table we were looking at.
What are bit-vectors? They are computer words, in current machines, generally 64 bits long. We use a bit-vector to represent each possible difference horizontally and vertically. We end up with two sets, one that represents Delta H, and another set representing the same integer values of difference for Delta V.We also need a set of Match vectors. For each character in Sequence Y, we wish to know whether there are matches in sequence X. In the case of DNA, this means that we must store 4 vectors for matches with A, C, G or T.These match vectors are necessary because matches represent a special case in the function table.
Here we have a single row in the scoring matrix. The bit-vectors that represent the horizontal changes are shown directly below the values they represent.
Similarly, this is an example set of Match vectors used to store the matches corresponding to each base in Sequence X
Point out the squared values, note that these tend to be small, and mention that we will be showing an actual runtime demonstration of how the parameters affect efficiency.
As the scoring parameters change, the code changes as well. Since it would be very difficult to write a program for each possible input parameter set, I wrote a script that generates the C code for each scoresWe will host it on a website.
For all experiments, we used human DNA and did a total of 25 million alignments. All sequences were 63 bases long, to allow the bit-parallel representations to fit in single computer words. (if they ask: We are working on the multiple word implementation, but were unable to finish in time for this paper.)Single core of 3 GHz computer (intel core 2 duo).
We implemented our algorithm and compared it to several other pattern matching algorithms.This figure compares our method with Needleman Wunsch. Say what the Y – axis is: minutes to compute 25 million alignmentsSay what the X-axis is: number of bit operations that each of our versions of the algorithm uses.Say what the labels on the BHL algorithms mean.Note that as the parameters increase, the time increases.Note that NW takes over 7 minutes, our 2-3-5 version takes under 2
Here, we compare the unit-cost edit distance bit-parallel algorithm, the LCS bit-parallel algorithm, Wu-Manbers K-difference algorithm, and the version of our algorithm equivalent to unit-cost edit-distance.While our algorithm has slightly more operations, 23, than the unit-cost edit distance algorithm with 15, our algorithm is not optimized for any particular parameter set. Even though our algorithm is general purpose, we come quite close to the best known solution.
I’d like to thank my advisor, Dr. Gary Benson, and the other members of my lab, YevgeniyGelfand and Yozen Hernandez, as well as the NSF for funding this work.