Detection of genetic motifs

Detection of Genetic
Motif.

Promoters – Biology
Information theory
Random Projections
Composed motif detection

DNA sequence

gene
junk DNA

gene
UTR-5' UTR-3'
e1 e2 e3 e4 e5

exon
intron
INR = Initiator Region
DSE = DownStream region
TSS = Transcription
Start Site
Promoter module Promoter module
TSS
TATA box
INR
INR DSE
DSE

TFBS TFBS
Distal Promoter Proximal promoter Core promoter

TFBS-Transcription Factor
Binding Site
 Short strings (12 to 20
nucleotides long) protein that
 spreaded over up to 5kb is going to bind

before TSS
 The string structure select
the protein that will bind on
the basis of Van der Waals
interactions ACCGATTATCA

 Van der Waals interactions
example of a Transcription Factor
- Binding Sites TFBS

Assembly of the promoter protein
complex of transcription
1st stage
Transcription factors TF

TFIID

TBP

TSS

TATA box

DNA

Transcription factor Binding Sites TFBS INR

2nd stage

TFIID
TBP

TATA

TSS

INR
core
promoter
DNA


DNA looping

/Distal promoter
enhancer

TF4
TF3

TF5

Proximal TF2
promoter TBP
TATA TFIIE
TFIIA TFIIH
TFIIB TFIID
TFIIB
TSS

TF1 INR RNA Poly II
core
promoter
DNA

Information based motif
detection

The set of all TFBS (for a certain
class of genes, organism or other)
Unknown Known

known
unknown

TFBS's with the same colour are correlated

Example

Protein of the
Promoter complex
Protein of the
Promoter complex

A T G C T C A T C C T G

Entropy
 Given a probability distribution, we want a function
representing the quantity of information stored in the
distribution.
 We define the entropy (H) as:

H = −∑ p (i ) log( p (i ))
i

or
H = − ∫ p ( x) log( p ( x))dx

 For the sake of simplicity, we will use from now on
the discrete definition.

Observed entropy
 The real distribution is usually unknown, but
we can replace it by the observed distribution
f(x). The resulting entropy is:

H ( x) = −∑ f ( x) log( f ( x))
x
 For a multi dimensional probability distribution
it is:
H ( x, y ) = −∑ f ( x, y ) log( f ( x, y ))
x, y

= ∑ f ( x)∑ f ( y | x) log( f ( x, y ))
x, y y

Mutual Information

f ( x, y )
I ( x. y ) = −∑ f ( x, y ) log( )
x, y f ( x) f ( y )
= −∑ f ( x, y )[log( f ( x, y )) − log( f ( x)) − log( f ( y ))]
x, y

= H ( x, y ) − H ( x ) − H ( y )
 X and Y are strings of equal length, S={A, C, G, T}, x
and y belong to S
 f(x,y) is the relative joint frequency of x,y in X and Y
 f(x) is the relative frequency of x in X
 f(y) is the relative frequency of y in Y

Information divergence
 Given two distributions
P and Q

p( x)
D( P, Q) = ∑ p ( x) log(
Not for exam
x q( x)
)

= ∑ p ( x) log( p ( x)) − ∑ p ( x) log(q ( x))
x x

Example of calculation
X A C A T T T A CC A T A G A C A A C T A
Y A C T T T T A CG A T G G A A A C C T G

f(x,y)
6 4 4 6
f(x,y) A C G T
9 A 5 1 2 1
f(x) 5 C 1 3 1 0
1 G 0 0 1 0
5 T 0 0 0 5

Divide by 20 to obtain relative frequencies

Algorithm for finding new
TFBS
1) select a true TFBS (for example ACATTTACCATAGACAACT)
(from a data bank as IUPAC or TRANSFAC) as a probe;

2) shift the probe over a non-coding zone;

3) evaluate step-by-step mutual information I(P,S), where P is
the probe and S is the current adjacent string on the sequence;

4) select the positions (and the corresponding adjacent strings) for
which
I(P,S)> threshold

5) the strings starting from these positions are candidate
TFBS,which need to be validated in vitro.

Example
the same string CACTGTGCGACTGTCATTCATCATCCACCGTTGTTAGCACAGGGGTCGAT
TTCGGAACCGGCCTTAAGACGGTGAAGGCGCTACTCATTTAATTGTGTTC
1 error CACTGTGCGTCTGTCATTCATCATCCACCGTTGTTAGCACAGGGGTCGAT
TACTATATAATTGATCGTGTTTTGGCCCGCTACTCATGAAGAGCCGTTCG
2 errors CACTGTGCGTCTGTCATTCGTCATCCACCGTTGTTAGCACAGGGGTCGAT
TAAGGGTATCCAAGTCTGAATACCCCCTGTATTACACTCTCGCTGTCAGT
5 errors CACTGTGCGTCTGTCATTCGTCATCCACCATTGTTAGCATAGGGGTCGAC
CATTATCGAGGACAGTGATTTGTGGAATGCTTGGCCTTAATACGTCTCTA
C<--> G GAGTCTCGCAGTCTGATTGATGATGGAGGCTTCTTACGAGACCCCTGCAT
TCAAAGTCAATTTACAGATTGGCGCCTCATGTAATAACGTTGGCATACTA
C <-- G GAGTGTGGGAGTGTGATTGATGATGGAGGGTTGTTAGGAGAGGGGTGGAT
CTTAAGATAACGGACACTTGATTGAGATACGCTCGACGCTATGTCCGGCT
some C<-> G CAGTGTCCGACTGTCATTGATCATCCACGCTTGTTACCAGACGCGTCGAT
ACTCGACATAAGGTTACAGCATGTGGAGTAATGCGGTCGCTAACTACGGG
complementarGTGACACGCTGACAGTAAGTAGTAGGTGGCAACAATCGTGTCCCCAGCTA
GCGTGGCGAGCTTAATCCCTGCTGCTCTGAGCAAGGAGGGCGTGTAGAAA
compl+1errorGTGACACGCGGACAGTAAGTAGTAGGTGGCAACAATCGTGTCCCCAGCTA
CAAGGTGACAGAGTATTGAGTGAATCTACAATGTTCGCAGTGCTTTGTCG
compl+2errorsGTGACACGCTGACAGTAAGAAGTAGGTGGCAACAATCGTGTCCCCAGCTA
GCGGTCGCCAATCGTCAAGGAAATGATAGGTCTGATTGGCGTGGCTTAAG
compl+5errorsGTGACACGCTGACAGTAAGAAGTAGGTGGAAACAATCGTCTCCCCAGCTG
GGCGCTAACGAATACTTCAAGGCCCGAAGGATTGGTGTTGATACTAGCCG
1 letter moreCACTGTGCGACTGTCATTCATCATCACACCGTTGTTAGCACAGGGGTCGAT
CGTGACCAGATGTCCTTACTCTGAATGTTATGGTATTAAGTGAGGTAGTG
2 letters moreCACTGTGCGACTGTCATTCATCATCCACACCGTTGTTAGCACAGGGGTCGAT
GCCCATGAACATACATTCATGACTGTTCAAGCGCACTGGACCACTCGTTC
3 letters moreCACTGTGCGACTGTCATTCATCATCCATCACCGTTGTTAGCACAGGGGTCGAT
probe CACTGTGCGACTGTCATTCATCATCCACCGTTGTTAGCACAGGGGTCGAT

Detected values for I(P,S)
4 C become G and 5 G become C
the same string

C and G exchanged complementary
1 1 error
complementary+1error
2 errors
complementary+2errors

C becomes G complementary+5errors
08
. 5 errors

1 letter more

06
. 2 letters more
3 letters
more
04
.

02
.

Conclusions:
 Use Mutual information as a tool to capture strings
that are correlated to a true TFBS used as a probe.
 validate in vitro the candidates so obtained
 This is more flexible than the use of Hamming or
Levenshtein distance, since correlated strings
could be very distant one another
Drawbacks:
1. the method need a precise calibration of the
threshold
2. Does not include gaps

Random Projection Approach to
Motif Finding

daf-19 Binding Sites in C. elegans
GTTGTCATGGTGAC
GTTTCCATGGAAAC
GCTACCATGGCAAC
GTTACCATAGTAAC
GTTTCCATGGTAAC
che-2
daf-19
osm-1
osm-6
F02D8.3

-150 -1

The (l,d) Planted Motif Problem
 Generate a random length l consensus
sequence C.
 Generate 20 instances, each differing from C
by d random mutations.
 Plant one at a random position in each of
N=20 random sequences of length n=600.
 Can you find the planted instances?

Planted Motifs
AGTTATCGCGGCACAGGCTCCTTCTTTATAGCC
ATGATAGCATCAACCTAACCCTAGATATGGGAT
TTTTGGGATATATCGCCCCTACACTGGATGACT
GGATATACATGAACACGGTGGGAAAACCCTGAC
 Each instance differs from ACAGGATCA by 2
mutations
 Remaining sequence random

Random Projection Algorithm
 Buhler and Tompa (2001)
 Guiding principle: Some instances of a motif
agree on a subset of positions.
 Use information from multiple motif instances
to construct model.

x(1) ...ccATCCGACca...
x(2) ...ttATGAGGCtc...
ATGCGTC =M
x(5) ...ctATAAGTCgc...
x(8) ...tcATGTGACac... (7,2) motif

k-Projections
 Choose k positions in string of length l.
 Concatenate nucleotides at chosen k
positions to form k-tuple.
 In l-dimensional Hamming space, projection
onto k dimensional subspace.

l = 15 k=7
P
ATGGCATTCAGATTC TGCTGAT

P = (2, 4, 5, 7, 11, 12, 13)

Random Projection Algorithm
 Choose a projection by
Input sequence x(i):
selecting k positions …TCAATGCACCTAT...
uniformly at random.
 For each l-tuple in input
sequences, hash into
bucket based on letters
at k selected positions.
 Recover motif from
bucket containing TGCACCT
multiple l-tuples.

Bucket TGCT

Example
 l = 7 (motif size) , k = 4 (projection size)
 Choose projection (1,2,5,7)

Input Sequence
...TAGACATCCGACTTGCCTTACTAC...

Buckets ATCCGAC

GCCTTAC

ATGC GCTC

Hashing and Buckets
 Hash function h(x) obtained from k positions
of projection.
 Buckets are labeled by values of h(x).
 Enriched buckets: contain at least s l-tuples,
for some parameter s.

ATGC GCTC CATC ATTC

Frequency Matrix Model From
Bucket
A1 0 .25 .5 0 .5 0
ATCCGAC
 
C 0 0 .25 .25 0 0 1
G 0 0
ATGAGGC
ATAAGTC 0 .5 0 1 .25
 
ATGTGAC T 0
 1 0 .25 0 .25 0

ATGC Frequency matrix W

EM algorithm

Refined matrix W*

Motif Refinement

 How do we recover the motif from the
sequences in the enriched buckets?
 k nucleotides are known from hash value of
bucket.
 Use information in other l-k positions as
starting point for local refinement scheme,
e.g. EM or Gibbs sampler
ATCCGAC
Local refinement algorithm
ATGAGGC ATGCGTC
ATAAGTC
Candidate motif
ATGTGAC

ATGC

Expectation Maximization (EM)
 S = { x(1), …, x(N)} : set of input sequences
 Given:
 W = An initial probabilistic motif model
 P0 = background probability distribution.
 Find value Wmax that maximizes likelihood ratio:

Pr( S | Wmax )
Pr( S | P0 )
 EM is local optimization scheme. Requires
starting value W

EM Motif Refinement

 For each bucket h containing more than s
sequences, form weight matrix Wh
 Use EM algorithm with starting point Wh to obtain
refined weight matrix model Wh*
 For each input sequence x(i), return l tuple y(i) which
maximizes likelihood ratio:
Pr(y(i) | Wh* )/ Pr(y(i) | P0).
 T = {y(1), y(2), …, y(N)}
 C(T ) = consensus string

What Is the Best Motif?
 Compute score S for each motif:
 Generate W, an initial PSSM from the returned
l-mers {y(1), y(2), …, y(N)}

P( y (i ) | W )
Score = ∑ log
i P( y (i ) | P0 )
 Return motif with maximal score

Iterations
 Single iteration.
 Choose a random k-projection.
 Hash each l-mer x in input sequence into bucket
labelled by h(x).
 From each bucket B with at least s sequences, form
weight matrix model, and perform EM/Gibbs sampler
refinement.
 Candidate motif is the best one found from refinement
of all enriched buckets.
 Multiple iterations.
 Repeat process for multiple projections.

Parameter Selection
 Projection size k
 Choose k small so several motif instances
hash to same bucket. (k < l - d)
 Choose k large to avoid contamination by
spurious l-mers. E > (N (n - l + 1))/ 4k
 Bucket threshold s: (s = 3, s = 4)

How Many Iterations?
 Planted bucket : bucket with hash value h(M),
where M is motif.
 Choose m = number of iterations, such that
Pr(planted bucket contains ≥ s sequences in
at least one of m iterations) ≥ 0.95.
 Probability is readily computable since
iterations form a sequence of independent
trials.

Composite motifs
detection

Question
Monad detection
Mitra

monad patterns
 Short contiguous strings
 Appear surprisingly many times( in a statistically significant
way)
 S=
AGTCTTGCTAGTCCGTAATATCCGGATAGAATAATGATC
AGTC AGTC
GTAGCATCGTACGTAGCTATCGATCTGAAGCTAGCAGC
AAGATGTACTAGAGTCACGTAGCTAGTCATCTATACGAG
AGTC AGTC
TCGATGTAGTAGCTATCGATCGTAGCTAGAGTCCGTAGC
TC AGTC
AGCTAGTATCGTAGTGAGCAACATGAGTCCAGTGCATA
AGTC
GTCAGCTCATGAGTCGCATAGTC
GTC AGTC
P = AGTC

Introduction
 However, many of the actual regulatory
signals are composite patterns.
 Groups of monad patterns
 Occur relatively near each other

 An example of a composite pattern is a dyad
signal.

Composite Pattern

 S=ACGTAAATCACGTTGACTAGCTAGCACGAG
CTAGCATAATCACACTTTGACGAGTCGACTGC
ATGCATTGACGCAGTGCATTGCTAGCATGGG
TAATCAAACGTTGGCTAGCTAGCATGCATCTG
AGCATGCTAGCTACGTACTAGCGCGATAGTC
TACTACAAATCACCCATTGCGAGCTACGTAG
CTAGCTAGCTAGCTAGCTAGTGATGCATGCTA
GAATCCGATCTTGCGATCGAT

CP = AATCxxxxTTG

Introduction
 A possible approach is to find each part of the
pattern separately and reconstruct the
composite pattern.
 However, they often fail to output composite
regulatory patterns consisting of weak monad
parts.

Introduction
 A better approach would be to detect both parts of
a composite pattern at the same time.
 Two steps in the proposed algorithm:
 Preprocessing the sample creates a set of ‘virtual’
monads.
 Apply an exhaustive monad discovery algorithm to
the ’virtual’ monad problem.
 By preprocessing, original problem can be
transformed into a larger monad discovery problem.

Monad Pattern Discovery

 Canonical pattern lmer 3mer: A C A
 A continuous string of length l
 (l, d)-neighbourhood of an lmer P
 all possible lmers with up to d mismatches as compared to P
 The number of such lmers is :
d
l  i
∑  i 3
 
i =0  
 (l,d)-k patterns
 Given a sequence S, find all lmers that occur with up to d
mismatches at least k times in the sample
 A variant : the sample is split into several sequence, to find all
lmers, d mismatches, in at least k sequences

Pattern Driven Approach(PDA)

 (Prvzner, 2000)
 Examine all 4 l patterns of fixed length l in lexical order,
compares each pattern to every lmer in the sample, and
return all (l, d)-k pattern
 (Waterman et al., 1984 and Galas et al.,1985)
 Bypass excessive time requirement
 Most of all 4 l examines not worth since neither these
patterns nor their neighbours appear in the sample
 SDA was therefore designed only explores the lmer
appearing in the sample and their neighbours.

Sample Driven Approach(SDA)
 First initializes a table of size 4l
 Each table entry corresponds to a pattern SDA
generate the (l, d)-neighbourhood of lmer
 Incremented by a certain amount
 After all lmers processed, SDA return all pattern
whose table entries have scores exceed the
threshold
AAAAA 3

4l AAAAC 1
AAACC 2
… ..

Sample Driven Approach(SDA)

 Faster but requires a large 4l table still
 not practical for long pattern in mid 1980
 Not mainstream and no tool
 (Today gigabytes of RAM memory available thus l
increased without a memory-efficient algorithm)

SDA Iterations
 First, explore all neighbour of the first lmer from the
sample.
 Second, explore all neighbour of the second lmer
 If an lmer P belongs to the neighbour of the lmers
appearing at positions i1 ,…ik in the sample  info about P
collected at iteration i1 ,…ik .
 So the Waterman approach update info about P k times 
memory slot for P is occupied during the course time even
if P is not “interesting” lmer
 Most of lmers explored are not interesting—waste memory
slot

To improve SDA
 Better solution:
 Collect info about all P at the same time
 to remove the need to keep the info in memory
 but require a new approach to navigate the space of
all lmers
 MITRA runs faster than PDA and SDA, and uses
only a fraction of the memory of the SDA

Pattern-finding vs. profile-based
 Profile-based is more biologically relevant for
finding motifs in biological samples?
 Probably the reason Waterman algorithm not
popular in the last decade
 Sagot and colleagues were the first to rebut this
opinion
 Develop an efficient version of Waterman’s

Pattern-based vs. profile-based

 Similarities
 Pattern-based generate the profile
 Every profile of length l corresponds to a pattern of length
l formed by the most frequent nucleotides in every
position.
 Pattern-driven at least as good as profile-based
 Even better on simulated samples with implanted patterns
 Though profile-implantation model is somehow limited
 Today little evidence profile-based perform any better on
either biological or simulated samples

Mitra

Mismatch Tree Algorithm

Mismatch Tree Algorithm
(MITRA)
 MITRA uses a mismatch tree data structure to
split the space of all possible patterns into disjoint
subspaces that start with a given prefix.
 For reducing the pattern discovery into smaller
sub-problems.
 MITRA also takes advantage of pair-wise
similarity between instances.

Splitting Pattern Space
 A pattern is called weak if it has less than k
( l ,d )-neighbours in the sample.

 A subspace is called weak if all patterns in this
subspace are weak.

Splitting pattern space

Sequence = AGTATCAGTT
P= GTC
Not weak

l = 3 …. sample = { AGT, GTA, TAT, ATC, TCA, CAG, AGT, GTT }
d =1 ; k =2
( l ,d )-neighbours in the sample = { GTA, ATC, GTT }


P= CAG
weak

l = 3 …. sample = { AGT, GTA, TAT, ATC, TCA, CAG, AGT, GTT }
d =1 ; k =2
( l ,d )-neighbours in the sample = { CAG }

 A subspace is called weak if all patterns in this
subspace are weak.


•subspaceA = { AAA, AAT, AAC, AAG ………..AGG }
•subspaceT = { TAA, TAT, TAC, TAG ………..TGG }
•subspaceC = { CAA, CAT, CAC, CAG ………..CGG }
•subspaceGG = { GGA, GGT, GGC, GGG}

Question
 Input:
 S, l, d, k
 Output:
 All l mers that occur with up to d mismatches
at least k times in the sample.

Solution
 Naïve :
 Test all l mer in the space
 If occur with up to d mismatches at least k times
in the sample than output this l mer.

space = { AAA, AAT, AAC, AAG ………..AGG
TAA, TAT, TAC, TAG ………..TGG
CAA, CAT, CAC, CAG ………..CGG
GAA, GAT, GAC, GAG ………..GGG }

sample = { AGT, GTA, TAT, ATC, TCA, CAG, AGT, GTT }

 if we are looking for patterns of length l we would first
split the space of all l mers into 4 disjoint subspaces.

 Subspace of all l mers starting with A,
 Subspace of all l mers starting with T,
 Subspace of all l mers starting with C,
 Subspace of all l mers starting with G,

 if we are looking for patterns of length l we would first
split the space of all l mers into 4 disjoint subspaces.

Space:

A* T* C* G*

SubspaceA

 we further determine whether the subspace
contains a ( l ,d )-k pattern.

Space:
A*

Can’t rule out


Space:
AA* AC*
AT* AG*

Can rule out


Space:

Can’t rule out

 If we can rule out this subspace contains such
a pattern
 we stop searching in this subspace;

 release the memory slot;

 If we can’t rule out this subspace contains
such a pattern
 we split this subspace again on the next

symbol;
 and repeat;

Mismatch tree data structure
 A mismatch tree is a rooted tree where each internal
node has 4 branches labeled with a symbol in
{A,C,T,G}
 The maximum depth of the tree is l.
 Each node in the mismatch tree corresponds to the
subspace of patterns P with a fixed prefix.
 Each node contains pointers to all l mers instances
from the sample that are within d mismatches from a
pattern p.

 MITRA start with examining the root node of the
mismatch tree that corresponds to the space of all
patterns.
 When examining a node, MITRA tries to prove that it
corresponds to a weak subspace.
 If (we can’t prove it)
 we expand the node’s children and examine each of

them.
 Whenever we reach a node corresponding to a weak
subspace, we backtrack.
 The intuition is that many of the nodes correspond
to weak subspaces and can be rule out.
 This allows us to avoid searching much of the
pattern space.

 If we reach depth l and the number of instances is
not less than k.
 the l mer corresponding to the path from the
root to the leaf .
 the pointers from this node correspond to the
instances of this pattern.

Example

 Consider a very simple example of finding the
pattern of length 4 with up to 1 mismatch and
at least 2 times in the sample S =
Not for exam
“AGTATCAGTT”.

 The substrings (4mers) in S are
{ AGTA, GTAT, TATC, ATCA, TCAG,
CAGT, AGTT }

0 1 1 0 1 1 0
A G T A T C A 0 0 0 0 0 0 0
G T A T C A G A G T A T C A
T A T C A G T A G T A T C A G
A T C A G T T T A T C A G T
k =7 A T C A G T T

1 2 1 1 2 1 1 A
A G T A T C A k=5 1 1 2 2 1

G T A T C A G A T A C A

T A T C A G T A T G A T A G
2 2 2 2 2
A T C A G T T CA T G C 2 2 T 2 T 1 C2 G T
k A
k=1=3 2 2
A
1 A 2 T AA CC AA T T
2
G A T A G
A T AG CA AT A G
T T C G T
k=0 G A T TT T GC
A
k = 1C
A A A T T
G T
T 1 T2
C2 GC T A
G
CA T A T T
2 2 22 2 1
A C AA T2 T1 2A T A
G A G A T A
A T AG A G
T Tk = 0
T k =A1 G A G
k=1 k A 1C T
= G GT T T
T T T
T T TA C T
A C T
A C T

0 1 1 0 1 1 0
A G T A T C A 0 0 0 0 0 0 0
G T A T C A G A G T A T C A
T A T C A G T A G T A T C A G
A T C A G T T T A T C A G T
k =7 A T C A G T T

T
0 2 2 1 2 2 0 G
k=3
A G T A T C A
0 2 0
G T A T C A G
A A A
T A T C A G T
T G T G
A T C A G T T T C T
k=2
A A T

A T
0 1
Output: AGTA CA G 1 1
A 1 0
1 1 A A
AGTC G G
A A
A A
G G
k = 2k = 2 T
T T G G
AGTG k=2 G G
k = 2T
A T
T T
AGTT T T A T
A T
A T

Overall complexity
0 0 0 0 0 0 0
 Space = A G T A T C A

 Time = O(l2 × |S|) A
G T A T C A G l
T A T C A G T

O(4l × |S|) A T C A G T T

G O(|S|)
O(|S|)
T O(l)
l 0 1 1 0 1 1 0

. . . A G T A T C A

. . . G T A T C A G

. . . T A T C A G T
T A T C A G T T
 Number of nodes = O(4l)
– Number of comparisons in each node = O(|S|)

Take a Closer Look
 In mismatch tree algorithm, we can not start
ruling out a node until traverse to depth .

d +1
0 1 1 0 1 1 0
0 0 0 0 0 0 0
A G T A T C A
A G T A T C A
G T A T C A G
T A T C A G T
A G T A T C A G
T A T C A G T
A T C A G T T
k =7 A T C A G T T
1 2 1 1 2 1 1
A A G T A T C A
G T A T C A G
k=5
T A T C A G T
A T C A G T T

MITRA Graph
 Information about pairwise similarities between
instances of the pattern can significantly
the sample-driven approach.
speed up
 The graph that is constructed to model this
pairwise similarity is called MITRA-Graph

MITRA Graph

 Given a pattern P and sample S we can construct
a graph G(P, S) where each vertex is an lmer in
the sample and there is an edge connecting two
lmers if P is within d mismatches from both
lmers.

S = TAACA
P = TAC
AAC (d=1)

TAA ACA
(d=1) (d=3)

MITRA Graph
 For an (l,d) – k pattern P the corresponding graph
contains a clique of size k.

S = TAACA
P = AAA
AAC (d=1)

TAA ACA
(d=1) (d=3)

MITRA Graph

 Given a set of patterns P and a sample S, define a
graph G(P , S) whose edge set is a union of edge
sets of graphs G(P, S) for P∈P .
 Each vertex of G(P , S) is an lmer in the sample
and there is an edge connecting two lmers if there
is a pattern P∈P that is within d mismatches
from both lmers.
 If for a subspace of patterns we can rule out an
existence of a clique of size k, then the subspace
has no (l,d)-k

The WINNOWER Algorithm

 The WINNOWER algorithm by Pevzner and
Sze (2000) constructs the following graph:
Each lmer in the sample is a vertex, and an edge
connects two vertices if the corresponding lmers
have less than d mismatches.
 Instances of a (l,d)-k pattern form a clique of
size k in this graph.

The WINNOWER Algorithm
(con’t)
 Since clique are difficult to find, WINNOWER
takes the approach of trying to remove edges that
do not corresponding to a clique.

k=4

Improvements by MITRA-Graph

1. Construct a graph at each node in the mismatch
tree.

0 1 1 0 1 1 0
A G T A T C A A
G T A T C A G
T A T C A G T
A T C A G T T


2. Remove edges which are not part of a clique.

A


3. If no potential clique remains, rule out the
subspace corresponding to the node and
backtrack.

A

A


4. If we cannot rule out a clique, split the subspace
of patterns and examine the child nodes

A

MISMATCH TREE ALGORITHM —
Improvements over WINNOWER

 At each node of the tree, we remove edges
by computing the degree of each vertex.
 If the degree of the vertex is less than k-1,
we can remove all edges incident to it since
we know it is not part of a clique.
 We repeat this procedure until we cannot
remove any more edges.
 If the number of edges remaining is less than
the minimum number of edges in a clique,
we can rule out the existence of a clique and
backtrack.

 The problem with this approach is how to
efficiently construct the graph at each node in
the mismatch tree.
 Instead of constructing the graph from scratch,
we construct it based on the graph at the
parent node
 an edge connecting two l mers
 the first l mer matches the prefix of the pattern
subspace with d1 mismatches
 the second l mer matches with d2 mismatches


 the number of mismatches between the tail of
the first and the second l mers as m.
 The edge between these two l mers exists in
the pattern subspace if and only if d1 <= d,
d2 <= d and d1+d2+m <= 2d.

The prefix of the
pattern subspace

the first lmer
the second lmer

Improvements over WINNOWER (cont’d)

 In the root node since d1 = d2 = 0, an edge exists only if
m <= 2d which is the equivalent graph to
WINNOWER.
 With moving down the tree, the condition becomes
much stronger than the WINNOWER.
 We can compute the edges of a node based on the
edges of the node’s parents by keeping track of the
quantities d1, d2, and m for each edge.

 To summarize, the MITRA-Graph algorithm works as
follows
 We first compute the set of edges at the root node by
performing pairwise comparisons between all l mers
due to d1 = d2 = 0.
 We traverse the tree in a depth first order, passing on
the valid edges and keeping track of the quantities d1,
d2, and m for each of them.
 At each node, we prune the graph by eliminating any
edges incident to vertices that have degrees of less
than k-1.
 If there are less than the minimum number of edges for
a clique, we backtrack.
 If we reach a leaf of the tree (depth l), then we output
the corresponding pattern.

DISCOVERING DYNAD SIGNALS
 For dyad signals, we are interested in
discovering two monads that occur a certain
length apart
 We use the notation (l1-(s1,s2)-l2,d)-k pattern to
denote a dyad signal

l1 s l2 l1 s l2 l1 s l2


 The MITRA-Dyad algorithm casts the dyad
discovery problem into a monad discovery
problem by preprocessing the input and
creating a “virtual” sample to solve the
(l1+l2,d)-k monad pattern discovery problem in
this sample
 For each l1mer in the sample and for each s in
[s1,s2], we create an l1+l2 mer which is the l1mer
concatenated with the l2 mer upstream s
nucleotides of the l1mer.

 The number of elements in the “virtual” sample
will be approximately (s1-s2+1) times larger.
 An (l1+l2,d)-k pattern in the “virtual” sample will
correspond to a (l1-(s1,s2)-l2,d)-k pattern in the
original sample, and we can easily map the
solution from the monad problem to the dyad
one.
 An important feature of MITRA-Dyad is an
ability to search for long patterns.


 If the range s1-s2+1 of acceptable distances
between monad parts in a composite pattern
is large, the MITRA-Dyad algorithm becomes
inefficient
 A simple approach to detect these patterns is
to generate a long ranked list of candidate
monad patterns using MITRA.
 Then check each occurrence of each pair from
the list to see if they occur within the
acceptable distance.

Detection of genetic motifs

Recomendados

Recomendados

Más contenido relacionado

Más de Juan Carlos Munévar

Más de Juan Carlos Munévar (20)

Último

Último (20)

Detection of genetic motifs