3. Suppose you have acquired a DNA/Protein sequence derived
from a sample of some environments such as lake, pond or plant.
Introduction
KLMNTRARLIVHISG
LTRK…………………………
……………………
Img Src: http://www.austincc.edu
Sequencing process
Cell Samples
Your sequence
4. Introduction
• Or you might get a DNA/Protein sequence from a database such as
NCBI/EMBL/Swiss-Prot. You might also find an interesting
gene/sequence from a journal.
KLMNTRARLIVHISG
LTRK…………………………
……………………
Your sequence
5. • In that case, you might want to know if the sequence that you have,
already exists or is similar to some sequences in a database, may be
down to a particular organism database.
• Why do you want to know that?
• Because you can infer structural, functional and evolutionary
relationship to your query sequence.
Introduction
Already in
here?
Similar?
Your sequence
7. Introducing BLAST (Basic Local
Alignment Search Tool)
BLAST tool is used to compare a query sequence with a
library or database of sequences.
It uses a heuristic search algorithm based on statistical
methods. The algorithm was invented by Stephen
Altschul and his co-workers in 1990.
BLAST programs were designed for fast database
searching.
11. BLAST Algorithm (Protein)
L E H K M G S
Query Sequence
Length 11
L E H
E H K
H K M
This generates 11 – 3 + 1 = 9 words
H K M
H K M
Y A N C
Y A N
W = 3
12. BLAST Algorithm Example
L E H
For each word from a window = 3, generate neighborhood words using
BLOSUM62 matrix with score threshold = 11
L M H
D E H
L E H
C E H
L K H
Q E H
L F H
L E R
. . .
All aligned with
LEH using
BLOSUM62
(then sorted by
scores)
17
13
12
10
9
11
9
9
Score
threshold
(cut off here)
20320 x 20 x 20 alignments
Sorted by scores
3 Amino Acids
13. BLAST Algorithm Example
L E H
C E H
L K H
Q E H
Word List
DAPCQEHKRGWPNDC
L E H Database sequences
L E H
L E H
L E H
L E H
L K H
L K H
C E H
C E H
QEH
Exact matches of words from the word list to the database sequences
14. Q E H
D A P C Q E H K R G W P N D C
For each exact word match, alignment is extended in both directions
to find high score segments.
Extended in the right direction Max drop off score X= 2
0
5
10
15
20
25
30
Q-Q E-E H-H K-K M-R G-G S-W
AccumulatedScore
5 5 8
Score drop = 3 > X
Score drop = 1 <= X
Trim to max
Query = Y A N C L E H K M G S
K
5
235 10 18
M
-1
22
G
6
28
S
-3
25
15. Q E H
D A P C Q E H K R G W P N D C
For each exact word match, alignment is extended in both directions
to find high score segments.
Extended in the left direction
K M G
Max drop off score X= 2
0
5
10
15
20
25
30
35
H-H E-E Q-Q C-C N-P A-A Y-D
AccumulatedScore
5 5 8
Score drop = 3 > X
Score drop = 2 <= X
Query = Y A N C L E H K M G S
18 13 8
C
9
27
N
-2
25
A
4
29
Y
-3
26
16. BLAST Algorithm Example
A P C Q E H K R G
5 -1 65 5 894 -2
Maximal Segment Pair (MSP)
Pair Score = 4-2+9+5+5+8+5-1+6 = 39
A N C Q E H K M G
BLOSUM62
Scoring Matrix
17. A P C Q E H K R G
A N C Q E H K M G 39
Maximal Segment Pairs
(MSPs) from other
seeds
Sorted by alignment
scores
42
45
35
37
51
55
33
BLAST Algorithm Example
Each match has its own E-Value
18. E-Value: The number of MSPs with similar score or
higher that one can EXPECT to see by chance alone
when searching a database of a particular size.
BLAST Algorithm
Expect Value (E-Value)
19. For example: if the E-Value is equal to 10 for a
particular MSP with score S, one can say that
actually…about 10 MSPs with score >= S can just
happen by chance alone (for any query sequence).
So most likely that our MSP is not a significant match
at all.
BLAST Algorithm
Expect Value (E-Value)
20. If E-Value if very small e.x. 10-4 (very high score S), one
can say that it is almost impossible that there would be
any MSP with score >= S.
Thus, our MSP is a pretty significant match
(homologous).
BLAST Algorithm
Expect Value (E-Value)
21. First: Calculate bit score
S = Score of the alignment (Raw Score)
, values depend on the scoring scheme and
sequence composition of a database.
[log value is natural logarithm (log base e)]
BLAST Algorithm
E-Value Calculation
22. The lower the E-Value, the better.
E-Value can be used to limit the number of hits in the
result page.
BLAST Algorithm
Expect Value (E-Value)
23. Second: Calculate E-Value
= Bit Score
m = query length
n = length of database
BLAST Algorithm
E-Value Calculation
24. • E-values of 10-4 and lower indicate a significant
homology.
• E-values between 10-4 and 10-2 should be checked
(similar domains, maybe non-homologous).
• E-values between 10-2 and 1 do not indicate a good
homology
BLAST Algorithm
E-Value Interpretation
25. Gapped BLAST
The Gapped BLAST algorithm allows gaps to be
introduced into the alignments. That means similar
regions are not broken into several segments.
This method reflects biological relationships much
better.
This results in different parameter values when
calculating E-Value ( , ).
26. BLAST programs
Name Description
Blastp Amino acid query sequence against a protein database
Blastn Nucleotide query sequence against a nucleotide sequence database
Blastx Nucleotide query sequence translated in all reading frames against a
protein database
Tblastn Protein query sequence against a nucleotide sequence database
dynamically translated in all reading frames
Tblastx Six frame translations of a nucleotide query sequence against the
six-frame translations of a nucleotide sequence database.
28. BLAST Suggestion
Where possible use translated sequence (Protein).
Split large query sequence (if > 1000 for DNA, >200 for
protein) into small ones.
If the query has low complexity regions or repeated
segments, remove them and repeat the search.
IVLKVALRPVLRPVLRPVWQARNGS
Repeated segments might confuse the program to find
the ‘real’ significant matches in a database.
29. Running BLAST
Find appropriate BLAST program
Enter query sequence
Select database to search
Run BLAST search
Analyze output
Interpret E-values
30. Documenting BLAST
Program (Blastp, Blastn,..)
Name of database
Word size
E-Value threshold
Substitution matrix
Gap penalty
BLAST results: Sequence Name, Bit Score, Raw
Score, E-Value, Identities, Positives, Gaps
32. Homework 4A
Determine the common proteins in Domestic Cat
[Felis catus], Tiger [Panthera tigris] and Snow Leopard
[Uncia uncia] using this initiating sequence
>gi|145558804
MSMVYINMFLAFIMSLMGLLMYRSHLMSSLLCLEGMMLSLFIMMTVAILNNHFTLASMTPII
LLVFAACEAALGLSLLVMVSNTYGTDYVQNLNLLQC
Report for each protein match: Protein
name, accession number, bit score, raw score, E-
Value, Identities, Positives and Gaps.
33. Homework 4B
H5N1 is the subtype of the Influenza A Virus which is a
bird-adapted strain. This subtype can cause “avian
influenza” or “bird flu” which is fatal to human.
Use DNA sequence with GenBank Accession number
JX120150.1 as a seed sequence to search for other TWO
matching sequences, each belonging to a different
Influenza A virus subtypes (HXNX). [Use Blastn]
Report for each subtype match: Subtype
name, Organism origin, Sequence name, accession
number, bit score, raw score, E-
Value, Identities, Positives and Gaps
34. Homework 4C
Suppose you have acquired an unknown protein
sequence
FLWLWPYLSYIEAVPIRKVQDDTKTLIKTIVTRINDISHTQAVSSKQRVAGLDFIP
GLHPVLSLSRMDQTLAIYQQILTSLHSRNVVQISNDLENLRDLLHLLASSKS
(1) Use BLAST program to find out which species this sequence most likely
belongs to.
(2) Report both scientific and common name for the species.
(3) This sequence matches to a certain protein of that species, Report E-Value,
protein accession number [GenBank], Protein name, Length, Full sequence
and Function.
35. Homework 4D
Calculate E-Value for an MSP with
Raw Score : 83
Query Length : 103
Length of database : 48,109,873
: 0.316
: 0.135
36. Send me email with subject
“HW3_BINF_lastname_id” by 28 June before 5pm.
Late submission will NOT be accepted.