Seq db searching

Computational Biology, Part 6
Sequence Database Searching

PUSHPENDRA TRIPATHI

Sequence Analysis Tasks

⇒ Given a query sequence, search for similar
sequences in a database

Global or Local?

Both local and global alignment methods may be
applied to database scanning, but local alignment
methods are more useful since they do not make
the assumption that the query protein and database
sequence are of similar length.

Efficient database searching
methods
Dynamic programming requires order N2L
computations (where N is size of the query
sequence and L is the size of the database)
Given size of databases, more efficient
methods needed

“Hit and extend” sequence
searching
Problem: Too many calculations “wasted”
by comparing regions that have nothing in
common
Initial insight: Regions that are similar
between two sequences are likely to share
short stretches that are identical
Basic method: Look for similar regions only
near short stretches that match exactly

“Hit and extend” sequence
searching
We define a word (or k-tuple) size that is
the minimum number of exact “letter”
matches that must occur before we do any
further comparison or alignment
How do we find all of the occurences of
matching words between a sequence and a
database?
Could scan sequence a word at a time, but this
is order L (size of database)

Word searching - hashing
Solution: Use a precomputed table that lists
where in the database each possible word
occurs
Generation of the table is of order L (size of
database) but use of the table is of order N (size
of query sequence)
The computer science term for this
approach is hashing

Hashing
Hashing
Hashing Table of size 10
Hashing function H(x) = x mod 10
Applet:
http://www.engin.umd.umich.edu/CIS/course.des/cis
Insertion & Search

Demonstration: Hashing algorithm for sequence searching
Author: R.F. Murphy, Feb. 6, 1995 (revised Feb. 15, 1996)
This demonstration takes a piece of database sequence, calculates hash values for each
ktuple, builds a hash table (listing the positions in the database of the occurence of each
hash value), and uses a simplified version of the hash table to find the positions in the
database sequence of the first occurence of each ktuple in a query sequence.
database sequence

Hashing i
1
seq(i) seq(i)
as char as int hash value
a 0 6
2 c 1 27 This section converts each base to a number
3 g 2 47 from 0 to 3 and combines those numbers three
4 t 3 63 at a time to form an integer from 0 to 63 that
(Demonstration A10)
5
6
t
t
3
3
63
60
is unique for each three base sequence.
Each three base sequence is called a "ktuple."
7 t 3 48
8 a 0 0
9 a 0 0
10 a 0 1
11 a 0 6
12 c 1 24
13 g 2 33
14 a 0 4
15 c 1 17
16 a 0 5
17 c 1
18 c 1
hash first hit
value pos1 pos2 pos3 hash table for the database sequence hash table
0 a a a 8 9 8
1 a a c 10 10
2 a a g not found
3 a a t not found
4 a c a 14 14
5 a c c 16 16
6 a c g 1 11 1
7 a c t not found
8 a g a not found

FASTA
Heavily used for searching databases until
advent of BLAST (see below)
Inputs
k (word or k-tuple) size
similarity matrix
Compares query sequence pairwise with
each sequence in the database

FASTA method
The initial step in the algorithm is to
identify all exact matches of length k (k–
tuples) or greater between the two
sequences.

FASTA method
1. Find diagonals (paired pieces from each
sequence without gaps) that have the
highest density of common words
2. Rescore these using a scoring (similarity)
matrix and trim ends that do not contribute
to the highest score
Result: partial alignments without gaps
Reported as the “init1” score

FASTA method
3. Join regions together, including penalties
for gaps
Result: unoptimized alignment with gaps
Reported as the “initn” score
4. Use dynamic programming in a band 32
residues wide around the best “initn” score
Result: optimized alignment with gaps
Reported as the “opt” score

Comments on FASTA
Larger k-tuple increases speed since fewer
“hits” are found but it also decreases
sensitivity for finding similar but not
identical sequences since exact matches of
this length are required

Limitations of FASTA
FASTA can miss significant similarity since
For proteins, similar sequences do not have to
share identical residues
Asp-Lys-Val is quite similar to Glu-Arg-Ile yet
it is missed even with k-tuple size of 1 since no
amino acid matches
Gly-Asp-Gly-Lys-Gly is quite similar to
Gly-Glu-Gly-Arg-Gly but there is only match
with k-tuple size of 1

Limitations of FASTA
FASTA can miss significant similarity since
For nucleic acids, due to codon “wobble”, DNA
sequences may look like XXyXXyXXy where
X’s are conserved and y’s are not
GGuUCuACgAAg and GGcUCcACaAAA
both code for the same peptide sequence (Gly-Ser-
Thr-Lys) but they don’t match with k-tuple size of 3
or higher

BLAST (Basic Local Alignment
Search Tool)
Goal: find sequences from database similar
to query sequence
Previous tools use either
direct, theoretically sound but computationally
slow approach to examine all possible
alignments of query with database (dynamic
programming)
indirect, heuristic but computationally fast
approach to find similar sequences by first
finding identical stretches (FASTP, FASTA)

BLAST (Basic Local Alignment
Search Tool)
BLAST combines best of both by using
theoretically sound method which searches
for similar sequences directly but
computationally fast
Reference
S. F. Altschul, W. Gish, W. Miller, E. W.
Myers and D. J. Lipman. Basic Local
Alignment Search Tool. J. Mol. Biol. 215:403-
410 (1990)

BLAST basics
Need similarity measure, as in dynamic
programming - use PAM-120 for proteins
Define maximal segment pair (MSP) to be
the highest scoring pair of identical length
segments chosen from 2 sequences (in
FASTA terms, highest init1 diagonal)

BLAST basics
Define a segment pair to be locally maximal
if its score cannot be improved either by
extending or by shortening both segments

BLAST basics
Approach: find segment pairs by first
finding word pairs that score above a
threshold, i.e., find word pairs of fixed
length w with a score of at least T
Key concept: Seems similar to FASTA, but
we are searching for words which score
above T rather than that match exactly

BLAST method for proteins
1. Compile a list of words which give a score
above T when paired with the query
sequence.
Example using PAM-120 for query sequence
ACDE (w=4, T=17):
A C D E
ACDE = +3 +9 +5 +5 = 22
try all possibilities:
AAAA = +3 -3 0 0 = 0 no good
AAAC = +3 -3 0 -7 = -7 no good
...too slow, try directed change

Generating word list
A C D E
ACDE = +3 +9 +5 +5 = 22
change 1st pos. to all acceptable substitutions
gCDE = 1 9 5 5 = 20 ok (=pCDE,sCDE,
tCDE)
nCDE = 0 9 5 5 = 19 ok (=dCDE,eCDE,
nCDE,vCDE)
iCDE = -1 9 5 5 = 18 ok (=qCDE)
kCDE = -2 9 5 5 = 17 ok (=mCDE)
change 2nd pos.: can't - all alternatives negative and
the other three positions only add up to 13
change 3rd pos. in combination with first position
gCnE = 1 9 2 5 = 17 ok
continue - use recursion

Generating word list
For "best" values of w and T there are
typically about 50 words in the list for every
residue in the query sequence

2. Scan the database for hits with the
compiled list of words. Two approaches:
Use index of all possible words (for w=4, need
array of size 204=160,000. Can compress this
index using pointers to save space.
Use finite state machine (actually used)
Calculate a state transition table that tells what state
to go to based on the next character in the sequence
3a. Extend hits to form HSPs (high-scoring
segment pairs)

3b. BLAST2 or gapped BLAST uses an
approach similar to FASTA to combine hits
before trying to extend them as in 3a.
4. Compare the score for each HSP to a
threshold S to decide whether to keep it
5. Proceed to estimating statistical
significance (see below)

BLAST Method for DNA
1. Make list of all contiguous w-mers in the
query sequence (often w=12)
2. Compress database by packing 4
nucleotides into a single byte (use auxiliary
table to tell you where sequences start and
stop within the compressed database) --
doesn't allow for unspecified bases
(wildcards)

3. Compress the w-mers from the query sequence
the same way.
4. Search the compressed database for matches
with the compressed w-mers
Since all frames of the query sequence are considered
separately, any match of length w>=11 must contain a
match of length 8 that lies on a byte boundary of one of
the w-mers from the query sequence. Thus can scan a
(packed) byte at a time, improving speed 4-fold over
comparing one nucleotide at a time.

Problem: if query sequence has a stretch of
unusual base composition (e.g., A-T rich)
or a repeated sequence element (e.g., Alu
sequence) there will be many hits with
"uninteresting" regions.

Solution:
During compression of the database, tabulate
frequencies of all 8-tuples.
Make a list of those occurring very frequently (much
more frequently than expected by chance).
Remove these words from the query list of w-mers
before searching database.
Remove words matching a sublibrary of repeated
sequences (but report the matches to that sublibrary
when done).

BLAST Statistical significance
A key to the utility of BLAST is the ability
to calculate expected probabilities of
occurrence of Maximum Segment Pairs
(MSPs) given w and T
This allows BLAST to rank matching
sequences in order of “significance” and to
cut off listings at a user-specified
probability

From Karlin-Altschul formulation, the
expected value (mean) of the HSPs between
a query and a set of random sequences is
u≅ [ e (Kmn)]/λ
log
or
u≅ [ln(Kmn)]/λ

BLAST uses a correction to this formulation
that takes into account the effective
sequence lengths of the query and the
database sequences
l Kn λ
u [( m)/
=n ′ ′]

The corrected lengths are given by
m′ = m−(lnKmn)/H
n′ = n −(lnKmn)/H
with
H = (lnKmn)/l
where l is the average length of the alignment that
can be achieved between random sequences of
length m and n

Given u, we can calculate the probability p of
observing a score S between a query sequence and
a given database sequence that is equal to or
greater than x
λ−
−xu
p ≥ = e (e
( x 1 x−
S ) −p ( )
)

Lastly, we have to consider that we are searching
many database sequences and can expect even a
relatively rare score to occur with high chance
given enough comparisons
For a database of D sequences, this is

− sx
p≥D
≈−
E1e ( )

Summary of Database Search
Methods
Authors (Program) Description
Needleman & Wunsch full alignment
Wilbur & Lipman match k-tuple - form
diag - NW
Lipman & Pearson k-tuple - diag - rescore
(FASTP)
Pearson & Lipman FASTP - join diags-
(FASTA) NW
Altschul et al (BLAST) word match list -
statistics

Reading for next class
Paper by Grundy and Bailey

Seq db searching

Recomendados

Recomendados

Más contenido relacionado

Similar a Seq db searching

Similar a Seq db searching (20)

Último

Último (20)

Seq db searching