6. Needleman-Wunsch-edu.pl
The Score Matrix
---------------Seq1(j)1
2
3
4 5
6
7
8
9
Seq2
*
C
K
H
V
F
C
R
(i) *
0
-1
-2
-3
-4
-5
-6
-7
1
C
-1
1 a 0
-1
-2
-3
-4
-5
2
K
-2
0c
2b
1
0
-1
-2
-3
3
K
-3
-1
1
1
0
-1
-2
-3
A:
4
C
-4
-2 matrix(i,j) = matrix(i-1,j-1) + (MIS)MATCH
0
0
0
-1
0
-1
if
5
F
-5
-3
-1(substr(seq1,j-1,1) eq substr(seq2,i-1,1)
-1
-1
1
0
-1
6
C
-6
-4 up_score = matrix(i-1,j) + GAP 2
-2
-2
-2
0
1
B:
7
K
-7
-5
-3
-3
-3
-1
1
1
8
C
-8
-6 left_score =-4
-4
-4
0
C:
matrix(i,j-1) +-2
GAP 0
9
V
-9
-7
-5
-5
-3
-3
-1
-1
7. Multiple Alignment Method
• The most practical and widely used
method in multiple sequence alignment
is the hierarchical extensions of
pairwise alignment methods.
• The principal is that multiple alignments
is achieved by successive application
of pairwise methods.
– First do all pairwise alignments (not just one
sequence with all others)
– Then combine pairwise alignments to generate
overall alignment
8. Database Searching
• Consider the task of searching
SWISS-PROT against a query
sequence:
– say our query sequence is 362
amino- acids long
– SWISS-PROT release 38
contains 29,085,265 amino acids
– finding local alignments via
dynamic programming would
entail O(1010)matrix operations
• Given size of databases, more
efficient methods needed
9. Heuristic approaches to DP for database searching
FASTA (Pearson 1995)
BLAST (Altschul 1990, 1997)
Uses heuristics to avoid
calculating the full dynamic
programming matrix
Uses rapid word lookup
methods to completely skip
most of the database
entries
Speed up searches by an
order of magnitude
compared to full SmithWaterman
The statistical side of FASTA is
still stronger than BLAST
Extremely fast
One order of magnitude
faster than FASTA
Two orders of magnitude
faster than SmithWaterman
Almost as sensitive as FASTA
10. FASTA
« Hit and extend heuristic»
• Problem: Too many calculations
“wasted” by comparing regions
that have nothing in common
• Initial insight: Regions that are
similar between two sequences
are likely to share short
stretches that are identical
• Basic method: Look for similar
regions only near short
stretches that match exactly
11. FASTA-Stages
1.
2.
3.
4.
5.
Find k-tups in the two sequences (k=1,2 for
proteins, 4-6 for DNA sequences)
Score and select top 10 scoring “local diagonals”
Rescan top 10 regions, score with PAM250
(proteins) or DNA scoring matrix. Trim off the
ends of the regions to achieve highest scores.
Try to join regions with gapped alignments. Join
if similarity score is one standard deviation above
average expected score
After finding the best initial region, FASTA
performs a global alignment of a 32 residue wide
region centered on the best initial region, and
uses the score as the optimized score.
12.
13.
14. FastA
• Sensitivity: the ability of a
program to identify weak but
biologically significant sequence
similarity.
• Selectivity: the ability of a
program to discriminate between
true matches and matches
occurring by chance alone.
– A decrease in selectivity results in
more false positives being reported.
15. FastA (http://www.ebi.ac.uk/fasta33/)
Gap opening penalty
-12, -16 by default
for fasta with
proteins and DNA,
respectively
Gap extension
penalty -2, -4 by
default for fasta
with proteins and
DNA, respectively
Max number of
scores and
alignments is 100
Blosum50
default.
Lower PAM
higher blosum
to detect close
sequences
Higher PAM and
lower blosum
to detect distant
sequences
The larger the
word-length the
less sensitive, but
faster the search
will be
16. FastA Output
Initn, init1, opt, zscore calculated
during run
E score expectation
value, how
many hits are
expected to be
found by
chance with
such a score
while
comparing
this query to
this database.
Database
code
hyperlinked
to the SRS
database at
EBI
Accession
number
Description
Length
E() does not
represent the
% similarity
17. FastA is a family of programs
FastA, TFastA, FastX, FastY
Query:
DNAProtein
Database:DNA
Protein
18. FASTA problems
FASTA can miss significant similarity
since
– For proteins, similar sequences do
not have to share identical residues
• Asp-Lys-Val is quite similar to
• Glu-Arg-Ile yet it is missed even with
ktuple size of 1 since no amino acid
matches
• Gly-Asp-Gly-Lys-Gly is quite similar
to Gly-Glu-Gly-Arg-Gly but there is
no match with ktuple size of 2
19. FASTA problems
FASTA can miss significant
similarity since
– For nucleic acids, due to codon
“wobble”, DNA sequences may
look like XXyXXyXXy where X’s
are conserved and y’s are not
• GGuUCuACgAAg and
GGcUCcACaAAA both code for
the same peptide sequence (Gly-SerThr-Lys) but they don’t match with
ktuple size of 3 or higher
22. What does BLAST do?
• Search a large target set of sequences...
• …for hits to a query sequence...
• …and return the alignments and scores from those
hits...
• Do it fast.
Show me those sequences that deserve a second look.
Blast programs were designed for fast database
searching, with minimal sacrifice of sensitivity to
distant related sequences.
23. The big red button
Do My Job
It is dangerous to hide too much of the
underlying complexity from the scientists.
24. Overview
• Approach: find segment pairs
by first finding word pairs that
score above a threshold, i.e.,
find word pairs of fixed length
wwith a score of at least T
• Key concept “Neigborhood”:
Seems similar to FASTA, but
we are searching for words
which score above T rather than
that match exactly
• Calculate neigborhood (T) for
substrings of query (size W)
25. Overview
Compile a list of words which give a score
above T when paired with the query sequence.
– Example using PAM-120 for query sequence ACDE
(w=4, T=17):
A
C
D
E
A C
D
E = +3 +9 +5 +5 = 22
• try all possibilities:
A
A
A
A
A
A
A = +3 -3
C = +3 -3
• ...too slow, try directed change
0 0 = 0
0 -7 = -7
no good
no good
26. Overview
A
A
g
n
I
k
C D E
C D E = +3 +9 +5 +5 = 22
• change 1st pos. to all acceptable substitutions
C D E = +1 +9 +5 +5 = 20ok
C D E = +0 +9 +5 +5 = 19 ok
C D E = -1 +9 +5 +5 = 18 ok
C D E = -2 +9 +5 +5 = 17 ok
• change 2nd pos.: can't - all alternatives negative
and the other three positions only add up to 13
• change 3rd pos. in combination with first position
gCnE = 1 9 2 5 = 17 ok
• continue - use recursion
• For "best" values of w and T there are typically
about 50 words in the list for every residue in the
query sequence
27. Neighborhood.pl
# Calculate neighborhood
my %NH;
for (my $i = 0; $i < @A; $i++) {
my $s1 = $S{$W[0]}{$A[$i]};
for (my $j = 0; $j < @A; $j++) {
my $s2 = $S{$W[1]}{$A[$j]};
for (my $k = 0; $k < @A; $k++) {
my $s3 = $S{$W[2]}{$A[$k]};
my $score = $s1 + $s2 + $s3;
my $word = "$A[$i]$A[$j]$A[$k]";
next if $word =~ /[BZX*]/;
$NH{$word} = $score if $score >= $T;
}
}
}
# Output neighborhood
foreach my $word (sort {$NH{$b} <=> $NH{$a} or $a cmp $b} keys %NH) {
print "$word $NH{$word}n";
}
32. The BLAST algorithm
• Break the search sequence into words
– W = 3 for proteins, W = 12 for DNA
MCGPFILGTYC
CGP
MCG, CGP, GPF, PFI, FIL,
ILG, LGT, GTY, TYC
MCG
• Include in the search all words that score
above a certain value (T) for any search word
MCGCGP
MCT
MGP
MCN
CTP
…
…
…
This list can be
computed in linear
time
33. The Blast Algorithm (2)
• Search for the words in the database
– Word locations can be precomputed and indexed
– Searching for a short string in a long string
• HSP (High Scoring Pair) = A match between
a query word and the database
• Find a “hit”: Two non-overlapping HSP’s on a
diagonal within distance A
• Extend the hit until the score falls below a
threshold value, S
34.
35. BLAST parameters
• Lowering the neighborhood word threshold (T)
allows more distantly related sequences to be found,
at the expense of increased noise in the results set.
• Choosing a value for w
– small w: many matches to expand
– big w: many words to be generated
– w=4 is a good compromise
• Lowering the segment extension cutoff (S) returns
longer extensions for each hit.
• Changing the minimum E-value changes the
threshold for reporting a hit.
36. Critical parameters: T,W and scoring matrix
• The proper value of T depends ons both the
values in the scoring matrix and balance
between speed and sensitivity
• Higher values of T progressively remove
more word hits and reduce the search space.
• Word size (W) of 1 will produce more hits
than a word size of 10. In general, if T is
scaled uniformly with W, smaller word
sizes incraese sensitivity and decrease
speed.
• The interplay between W,T and the scoring
matrix is criticial and choosing them wisely
is the most effective way of controlling the
speed and sensiviy of blast
38. Database Searching
• How can we find a particular short sequence
in a database of sequences (or one HUGE
sequence)?
• Problem is identical to local sequence
alignment, but on a much larger scale.
• We must also have some idea of the
significance of a database hit.
– Databases always return some kind of hit, how
much attention should be paid to the result?
• How can we determine how “unusual” a
particular alignment score is?
39. Significance
Sentence 1:
“These algorithms are trying to find the best way to match up
two sequences”
Sentence 2:
“This does not mean that they will find anything profound”
ALIGNMENT:
THESEALGRITHMARETR--YINGTFINDTHEBESTWAYTMATCHPTWSEQENCES
:: :.. . .. ...:
:
::::..
:: . : ...
THISDESNTMEANTHATTHEYWILLFINDAN-------YTHIN-GPRFND-----12 exact matches
14 conservative substitutions
Is this a good alignment?
40. Overview
• A key to the utility of BLAST is
the ability to calculate expected
probabilities of occurrence of
Maximum Segment Pairs
(MSPs) given w and T
• This allows BLAST to rank
matching sequences in order of
“significance” and to cut off
listings at a user-specified
probability
41. Mathematical Basis of BLAST
• Model matches as a sequence of coin tosses
• Let p be the probability of a “head”
– For a “fair” coin, p = 0.5
• (Erdös-Rényi) If there are n throws, then the
expected length R of the longest run of heads is
R = log1/p (n).
• Example: Suppose n = 20 for a “fair” coin
R=log2(20)=4.32
• Trick is how to model DNA (or amino acid)
sequence alignments as coin tosses.
42. Mathematical Basis of BLAST
• To model random sequence alignments, replace a
match with a “head” and mismatch with a “tail”.
AATCAT
HTHHHT
ATTCAG
• For DNA, the probability of a “head” is 1/4
– What is it for amino acid sequences?
43. Mathematical Basis of BLAST
• So, for one particular alignment, the Erdös-Rényi
property can be applied
• What about for all possible alignments?
– Consider that sequences are being shifted back and
forth, dot matrix plot
• The expected length of the longest match is
R=log1/p(mn)
where m and n are the lengths of the two sequences.
45. Karlin-Alschul Statistics
E=kmn-λS
This equation states that the number of alignments
expected by chance (E) during the sequence
database search is a function of the size of the
search space (m*n), the normalized score (λS)
and a minor constant (k mostly 0.1)
E-Value grows linearly with the product of target and
query sizes. Doubling target set size and doubling
query length have the same effect on e-value
47. Scoring alignments
• Score: S (~R)
– S= M(qi,ti) - gaps
• Any alignment has a score
• Any two sequences have a(t least one)
optimal alignment
48. • For a particular scoring matrix and its
associated gap initiation and extention costs
one must calculate λ and k
• Unfortunately (for gapped alignments), you
can’t do this analytically and the values must
be estimated empirically
– The procedure involves aligning random
sequences (Monte Carlo approach) with a specific
scoring scheme and observing the alignment
properties (scores, target frequencies and
lengths)
49. Significance
“Monte Carlo” Approach:
• Compares result to randomized
result, similarly to results generated by a
roulette wheel at Monte Carlo
• Typical procedure for alignments
– Randomize sequence A
– Align to sequence B
– Repeat many times (hundreds)
– Keep track op optimal score
• Histogram of scores …
53. Significance
Normal Distribution does NOT Fit Alignment Scores !!
• In seeking optimal Alignments between two
sequences, one desires those that have the highest
score - i.e. one is seeking a distribution of maxima
• In seeking optimal Matches between an Input
Sequence and Sequence Entries in a Database, one
again desires the matches that have the highest
score, and these are obtained via examination of the
distribution of such scores for the entries in the
database - this is again a distribution of maxima.
“A Normal Distribution is a distribution of Sums of
independent variables rather than a sum of their
Maxima.“
55. Alignment scores follow extreme value distributions
Alignment of unrelated/random sequences result in scores
following an extreme value distribution
x
P = 1 –e-E
E
P(x S) = 1-exp(-k m n e- S)
m, n: sequence lengths.
k,
free parameters.
E=-ln(1-P)
This can be shown analytically for ungapped alignments and has
been found empirically to also hold for gapped alignments under
commonly used conditions.
56. Alignment scores follow extreme value distributions
Alignment algorithms will always produce
alignments, regardless of whether it is meaningful or not
=> important to have way of selecting significant alignments
from large set of database hits.
Solution: fit distribution of scores from database search to
extreme value distribution; determine p-value of hit from this
fitted distribution.
Example: scores fitted to
extreme value distribution.
99.9% of this distribution is
located below score=112
=> hit with score = 112 has a
p-value of 0.1%
57. Significance
BLAST uses precomputed extreme
value distributions to calculate Evalues from alignment scores
For this reason BLAST only allows
certain combinations of substitution
matrices and gap penalties
This also means that the fit is based on
a different data set than the one you
are working on
A word of caution: BLAST tends to overestimate the significance of its
matches
E-values from BLAST are fine for identifying sure hits
One should be careful using BLAST’s E-values to judge if a marginal hit
can be trusted (e.g., you may want to use E-values of 10-4 to 10-5).
58. Determining P-values
• If we can estimate and , then we can
determine, for a given match score x, the
probability that a random match with score x
or greater would have occurred in the
database.
• For sequence matches, a scoring system and
database can be parameterized by two
parameters, k and , related to and .
– It would be nice if we could compare hit
significance without regard to the scoring system
used!
59. Bit Scores
• The expected number of hits with score
is:
E = Kmne s
S
– Where m and n are the sequence lengths
• Normalize the raw score using:
S
S
ln K
ln 2
• Obtains a “bit score” S’, with a standard set of
units.
S
• The new E-value is: E mn 2
61. FastA Output
• The distribution of scores graph of
frequency of observed scores
• expected curve (asterisks) according
to the extreme value distribution
–the theoretic curve should be
similar to the observed results
• deviations indicate that the fitting
parameters are wrong
–too weak gap penalties
–compositional biases
64. FastA Output
• A summary of the statistics and of the
program parameters follows the histogram.
– An important number in this summary is the
Kolmogorov-Smirnov statistic, which indicates
how well the actual data fit the theoretical
statistical distribution. The lower this value, the
better the fit, and the more reliable the statistical
estimates.
– In general, a Kolmogorov-Smirnov statistic under
0.1 indicates a good fit with the theoretical model.
If the statistic is higher than 0.2, the statistics may
not be valid, and it is recommended to repeat the
search, using more stringent (more negative)
values for the gap penalty parameters.
65. Statistics summary
• Optimal local alignment scores for pairs of random
amino acid sequences of the same length follow and
extreme-value distribution. For any score S, the
probability of observing a score >= S is given by the
Karlin-Altschul statistic (P(score>=S)=1-exp(-kmne(lambda.S))
• k en Lambda are parameters related to the position
of the maximum and the with of the distribution,
• Note the long tail at the right. This means that a
score serveral standard deviations above the mean
has higher probability of arising by chance (that is, it
is less significant) than if the scores followed a
normal distribution.
66. P-values
• Many programs report P = the probability that the
alignment is no better than random. The relationship
between Z and P depends on the distribution of the
scores from the control population, which do NOT
follow the normal distributions
– P<=10E-100 (exact match)
– P in range 10E-100 10E-50 (sequences nearly identical eg.
Alleles or SNPs
– P in range 10E-50 10E-10 (closely related
sequenes, homology certain)
– P in range 10-5 10E-1 (usually distant relatives)
– P > 10-1 (match probably insignificant)
67. E
• For database searches, most programs report E-values. The
E-value of an alignemt is the expected number of sequences
that give the same Z-score or better if the database is probed
with a random sequence. E is found by multiplying the value
of P by the size of the database probed. Note that E but not P
depends on the size of the database. Values of P are
between 0 and 1. Values of E are between 0 and the number
of sequences in the database searched:
– E<=0.02
sequences probably homologous
– E between 0.02 and 1
homology cannot be ruled out
– E>1
you would have to expect this good a match by just chance
85. Tips
• Be aware of what options you
have selected when using
BLAST, or FASTA
implementations.
• Treat BLAST searches as
scientific experiments
• So you should try your searches
with the filters on and off to see
whether it makes any difference
to the output
86. Tips: Low-complexity and Gapped Blast Algorithm
• The common, Web-based ones often have
default settings that will affect the outcome
of your searches. By default all NCBI BLAST
implementations filter out biased sequence
composition from your query sequence (e.g.
signal peptide and transmembrane
sequences - beware!).
• The SEG program has been implemented
as part of the blast routine in order to mask
low-complexity regions
• Low-complexity regions are denoted by
strings of Xs in the query sequence
87. Tips
• The sequence databases contain a
wealth of information. They also
contain a lot of errors. Contaminants
…
• Annotation errors, frameshifts that
may result in erroneous conceptual
translations.
• Hypothetical proteins ?
• In the words of Fox Mulder, "Trust
no one."
88. Tips
• Once you get a match to things
in the databases, check whether
the match is to the entire
protein, or to a domain. Don't
immediately assume that a
match means that your protein
carries out the same function
(see above). Compare your
protein and the match protein(s)
along their entire lengths before
making this assumption.
89. Tips
• Domain matches can also cause problems
by hiding other informative matches. For
instance if your protein contains a common
domain you'll get significant matches to
every homologous sequence in the
database. BLAST only reports back a
limited number of matches, ordered by P
value.
• If this list consists only of matches to the
same domain, cut this bit out of your query
sequence and do the BLAST search again
with the edited sequence (e.g. NHR).
90. Tips
• Do controls wherever possible. In
particular when you use a particular
search software for the first time.
• Suitable positive controls would be protein
sequences known to have distant
homologues in the databases to check
how good the software is at detecting such
matches.
• Negative controls can be employed to
make sure the compositional bias of the
sequence isn't giving you false positives.
Shuffle your query sequence and see what
difference this makes to the matches that
are returned. A real match should be lost
upon shuffling of your sequence.
91. Tips
• Perform Controls
#!/usr/bin/perl -w
use strict;
my ($def, @seq) = <>;
print $def;
chomp @seq;
@seq = split(//, join("", @seq));
my $count = 0;
while (@seq) {
my $index = rand(@seq);
my $base = splice(@seq, $index, 1);
print $base;
print "n" if ++$count % 60 == 0;
}
print "n" unless $count %60 == 0;
92. Tips
• Read the footer first
• View results graphically
• Parse Blasts with Bioperl
93. FastA vs. Blast
• BLAST's major advantage is its speed.
– 2-3 minutes for BLAST versus several hours
for a sensitive FastA search of the whole of
GenBank.
• When both programs use their default
setting, BLAST is usually more sensitive
than FastA for detecting protein sequence
similarity.
– Since it doesn't require a perfect sequence
match in the first stage of the search.
94. FastA vs. Blast
Weakness of BLAST:
– The long word size it uses in the initial stage of DNA
sequence similarity searches was chosen for speed, and not
sensitivity.
– For a thorough DNA similarity search, FastA is the
program of choice, especially when run with a lowered
KTup value.
– FastA is also better suited to the specialised task of
detecting genomic DNA regions using a cDNA query
sequence, because it allows the use of a gap extension
penalty of 0. BLAST, which only creates ungapped
alignments, will usually detect only the longest exon, or fail
altogether.
• In general, a BLAST search using the default
parameters should be the first step in a database
similarity search strategy. In many cases, this is all
that may be required to yield all the information
needed, in a very short time.
96. PSI-Blast
1. Old (ungapped) BLAST
2. New BLAST (allows gaps)
3. Profile -> PSI Blast - Position Specific
Iterated
Strategy:Multiple alignment of the hits
Calculates a position-specific score matrix
Searches with this matrix
In many cases is much more sensitive to weak but
biologically relevant sequence similarities
PSSM !!!
97. PSI-Blast
• Patterns of conservation from the alignment of
related sequences can aid the recognition of
distant similarities.
– These patterns have been variously called motifs,
profiles, position-specific score matrices, and
Hidden Markov Models.
For each position in the derived pattern, every
amino acid is assigned a score.
(1) Highly conserved residue at a position: that
residue is assigned a high positive score, and
others are assigned high negative scores.
(2) Weakly conserved positions: all residues receive
scores near zero.
(3) Position-specific scores can also be assigned to
potential insertions and deletions.
98. Pattern
• a set of alternative
sequences, using
“regular expressions”
• Prosite
(http://www.expasy.org/
prosite/)
102. PSI-Blast
• The power of profile methods can be
further enhanced through iteration of
the search procedure.
– After a profile is run against a database,
new similar sequences can be detected. A
new multiple alignment, which includes
these sequences, can be constructed, a
new profile abstracted, and a new
database search performed.
– The procedure can be iterated as often as
desired or until convergence, when no new
statistically significant sequences are
detected.
103. PSI-Blast
(1) PSI-BLAST takes as an input a single protein sequence
and compares it to a protein database, using the gapped
BLAST program.
(2) The program constructs a multiple alignment, and then a
profile, from any significant local alignments found.
The original query sequence serves as a template for the multiple
alignment andprofile, whose lengths are identical to that of the
query. Different numbers of sequences can be aligned in different
template positions.
(3) The profile is compared to the protein database, again
seeking local alignments using the BLAST algorithm.
(4) PSI-BLAST estimates the statistical significance of the local
alignments found.
Because profile substitution scores are constructed to a fixed
scale, and gap scores remain independent of position, the
statistical theory andparameters for gapped BLAST alignments
remain applicable to profile alignments.
(5) Finally, PSI-BLAST iterates, by returning to step (2), a
specified number of times or until convergence.
109. PSI-BLAST pitfalls
• Avoid too close sequences: overfit!
• Can include false homologous! Therefore check
the matches carefully: include or exclude
sequences based on biological knowledge.
• The E-value reflects the significance of the
match to the previous training set not to the
original sequence!
• Choose carefully your query sequence.
• Try reverse experiment to certify.
110. Reduce overfitting risk by Cobbler
• A single sequence is selected
from a set of blocks and enriched
by replacing the conserved
regions delineated by the blocks
by consensus residues derived
from the blocks.
• Embedding consensus residues
improves performance
• S. Henikoff and J.G. Henikoff;
Protein Science (1997) 6:698705.
122. BLAT method
• Align sequence with BLAT, get alignment
info
• Per BLAT hit, pick up additional info from
connected databases:
–
–
–
–
–
mRNAs
ESTs
RepeatMasker
CpG Islands
RefSeq Genes
123.
124. Weblems
W5.1: Submit the amino acid sequence of papaya
papein to a BLAST (gapped and ungapped) and to a
PSI-BLAST search. What are the main difference in
results?
W5.2: Is there a relationship between Klebsiella
aerogenes urease, Pseudomonas diminuta
phosphotriesterase and mouse adenosine deaminase
? Also use DALI, ClustalW and T-coffee.
W5.3: Yeast two-hybrid typically yields DNA
sequences. How would you find the corresponding
protein ?
W5.4: When and why would you use tblastn ?
W5.5: How would you search a database if you want to
restrict the search space to those entries having a
secretion signal consisting of 4 consecutive (Nterminal) basic residues ?