SlideShare una empresa de Scribd logo
1 de 92
Detection of Genetic
       Motif.

    Promoters – Biology
     Information theory
    Random Projections
  Composed motif detection
Motifs and promoters
DNA sequence


                                                   gene
                                                   junk DNA




                            gene
  UTR-5'                                               UTR-3'
           e1     e2            e3        e4      e5

                                                       exon
                                                       intron
                                                                  INR = Initiator Region
                                                                  DSE = DownStream region
                                                                  TSS = Transcription
                                                                      Start Site
Promoter module             Promoter module
                                                  TSS
                                       TATA box
                                                INR
                                                 INR            DSE
                                                                DSE

    TFBS                     TFBS
Distal Promoter        Proximal promoter Core promoter
TFBS-Transcription Factor
           Binding Site
 Short strings (12 to 20
  nucleotides long)                        protein that
 spreaded over up to 5kb                is going to bind

  before TSS
 The string structure select
  the protein that will bind on
  the basis of Van der Waals
  interactions                          ACCGATTATCA

 Van der Waals interactions
                                  example of a Transcription Factor
  -                                   Binding Sites TFBS
Assembly of the promoter protein
    complex of transcription
                                      1st stage
          Transcription factors TF


                                                  TFIID

                                            TBP



                                                                TSS

                                           TATA box

                                                                      DNA



 Transcription factor Binding Sites TFBS                  INR
Assembly of the promoter protein
    complex of transcription
            2nd stage


                               TFIID
                    TBP

            TATA



                                       TSS


                               INR
                   core
                    promoter
                                             DNA
Assembly of the promoter protein
    complex of transcription

                  DNA looping




                                                                       /Distal promoter
                                                                      enhancer


                                                      TF4
                        TF3

                                                                        TF5




 Proximal   TF2
 promoter                             TBP
                                            TATA          TFIIE
                              TFIIA                                    TFIIH
                                      TFIIB TFIID
                                      TFIIB
                                                    TSS


                  TF1                         INR                 RNA Poly II
                                 core
                                 promoter
                                                                                DNA
Information based motif
       detection
The set of all TFBS (for a certain
     class of genes, organism or other)
          Unknown                Known




                       known
                                    unknown




TFBS's with the same colour are correlated
Example



   Protein of the
   Promoter complex
                                Protein of the
                                Promoter complex




A T G C T C                A T C C T G
Entropy
 Given a probability distribution, we want a function
  representing the quantity of information stored in the
  distribution.
 We define the entropy (H) as:

        H = −∑ p (i ) log( p (i ))
                 i

        or
        H = − ∫ p ( x) log( p ( x))dx

 For the sake of simplicity, we will use from now on
  the discrete definition.
Observed entropy
 The real distribution is usually unknown, but
  we can replace it by the observed distribution
  f(x). The resulting entropy is:

           H ( x) = −∑ f ( x) log( f ( x))
                      x
 For a multi dimensional probability distribution
  it is:
           H ( x, y ) = −∑ f ( x, y ) log( f ( x, y ))
                              x, y

           = ∑ f ( x)∑ f ( y | x) log( f ( x, y ))
              x, y        y
Mutual Information

                                      f ( x, y )
I ( x. y ) = −∑     f ( x, y ) log(                )
             x, y                   f ( x) f ( y )
= −∑ f ( x, y )[log( f ( x, y )) − log( f ( x)) − log( f ( y ))]
     x, y

= H ( x, y ) − H ( x ) − H ( y )
 X and Y are strings of equal length, S={A, C, G, T}, x
  and y belong to S
 f(x,y) is the relative joint frequency of x,y in X and Y
 f(x) is the relative frequency of x in X
 f(y) is the relative frequency of y in Y
Information divergence
 Given two distributions
  P and Q

                           p( x)
  D( P, Q) = ∑ p ( x) log(
         Not for exam
             x             q( x)
                                 )

  = ∑ p ( x) log( p ( x)) − ∑ p ( x) log(q ( x))
     x                      x
Example of calculation
X      A C A T T T A CC A T A G A C A A C T A
Y      A C T T T T A CG A T G G A A A C C T G

                           f(x,y)
                            6       4   4   6
                  f(x,y)    A       C   G   T
             9      A       5       1   2   1
         f(x) 5     C       1       3   1   0
             1      G       0       0   1   0
             5      T       0       0   0   5



    Divide by 20 to obtain relative frequencies
Algorithm for finding new
               TFBS
1) select a true TFBS (for example ACATTTACCATAGACAACT)
   (from a data bank as IUPAC or TRANSFAC) as a probe;

2) shift the probe over a non-coding zone;

3) evaluate step-by-step mutual information I(P,S), where P is
   the probe and S is the current adjacent string on the sequence;

4) select the positions (and the corresponding adjacent strings) for
   which
                I(P,S)> threshold

5) the strings starting from these positions are candidate
   TFBS,which need to be validated in vitro.
Example
the same string     CACTGTGCGACTGTCATTCATCATCCACCGTTGTTAGCACAGGGGTCGAT
                    TTCGGAACCGGCCTTAAGACGGTGAAGGCGCTACTCATTTAATTGTGTTC
   1 error          CACTGTGCGTCTGTCATTCATCATCCACCGTTGTTAGCACAGGGGTCGAT
                     TACTATATAATTGATCGTGTTTTGGCCCGCTACTCATGAAGAGCCGTTCG
   2 errors        CACTGTGCGTCTGTCATTCGTCATCCACCGTTGTTAGCACAGGGGTCGAT
                     TAAGGGTATCCAAGTCTGAATACCCCCTGTATTACACTCTCGCTGTCAGT
   5 errors         CACTGTGCGTCTGTCATTCGTCATCCACCATTGTTAGCATAGGGGTCGAC
                     CATTATCGAGGACAGTGATTTGTGGAATGCTTGGCCTTAATACGTCTCTA
   C<--> G         GAGTCTCGCAGTCTGATTGATGATGGAGGCTTCTTACGAGACCCCTGCAT
                     TCAAAGTCAATTTACAGATTGGCGCCTCATGTAATAACGTTGGCATACTA
   C <-- G        GAGTGTGGGAGTGTGATTGATGATGGAGGGTTGTTAGGAGAGGGGTGGAT
                    CTTAAGATAACGGACACTTGATTGAGATACGCTCGACGCTATGTCCGGCT
   some C<-> G CAGTGTCCGACTGTCATTGATCATCCACGCTTGTTACCAGACGCGTCGAT
                   ACTCGACATAAGGTTACAGCATGTGGAGTAATGCGGTCGCTAACTACGGG
    complementarGTGACACGCTGACAGTAAGTAGTAGGTGGCAACAATCGTGTCCCCAGCTA
                  GCGTGGCGAGCTTAATCCCTGCTGCTCTGAGCAAGGAGGGCGTGTAGAAA
     compl+1errorGTGACACGCGGACAGTAAGTAGTAGGTGGCAACAATCGTGTCCCCAGCTA
                    CAAGGTGACAGAGTATTGAGTGAATCTACAATGTTCGCAGTGCTTTGTCG
    compl+2errorsGTGACACGCTGACAGTAAGAAGTAGGTGGCAACAATCGTGTCCCCAGCTA
                   GCGGTCGCCAATCGTCAAGGAAATGATAGGTCTGATTGGCGTGGCTTAAG
    compl+5errorsGTGACACGCTGACAGTAAGAAGTAGGTGGAAACAATCGTCTCCCCAGCTG
                   GGCGCTAACGAATACTTCAAGGCCCGAAGGATTGGTGTTGATACTAGCCG
     1 letter moreCACTGTGCGACTGTCATTCATCATCACACCGTTGTTAGCACAGGGGTCGAT
                    CGTGACCAGATGTCCTTACTCTGAATGTTATGGTATTAAGTGAGGTAGTG
 2 letters moreCACTGTGCGACTGTCATTCATCATCCACACCGTTGTTAGCACAGGGGTCGAT
                    GCCCATGAACATACATTCATGACTGTTCAAGCGCACTGGACCACTCGTTC
3 letters moreCACTGTGCGACTGTCATTCATCATCCATCACCGTTGTTAGCACAGGGGTCGAT
   probe            CACTGTGCGACTGTCATTCATCATCCACCGTTGTTAGCACAGGGGTCGAT
Detected values for I(P,S)
                               4 C become G and 5 G become C
the same string

                         C and G exchanged      complementary
   1    1 error
                                                  complementary+1error
              2 errors
                                                            complementary+2errors

                                 C becomes G                   complementary+5errors
 08
 .                  5 errors

                                                                   1 letter more

 06
 .                                                                    2 letters more
                                                                               3 letters
                                                                               more
 04
 .




 02
 .
Conclusions:
 Use Mutual information as a tool to capture strings
   that are correlated to a true TFBS used as a probe.
 validate in vitro the candidates so obtained
 This is more flexible than the use of Hamming or
   Levenshtein distance, since correlated strings
   could be very distant one another
Drawbacks:
1. the method need a precise calibration of the
   threshold
2. Does not include gaps
Random Projection Approach to
       Motif Finding
daf-19 Binding Sites in C. elegans
                     GTTGTCATGGTGAC
                     GTTTCCATGGAAAC
                     GCTACCATGGCAAC
                     GTTACCATAGTAAC
                     GTTTCCATGGTAAC
                                      che-2
                                      daf-19
                                      osm-1
                                      osm-6
                                      F02D8.3

-150                          -1
The (l,d) Planted Motif Problem
 Generate a random length l consensus
  sequence C.
 Generate 20 instances, each differing from C
  by d random mutations.
 Plant one at a random position in each of
  N=20 random sequences of length n=600.
 Can you find the planted instances?
Planted Motifs
AGTTATCGCGGCACAGGCTCCTTCTTTATAGCC
ATGATAGCATCAACCTAACCCTAGATATGGGAT
TTTTGGGATATATCGCCCCTACACTGGATGACT
GGATATACATGAACACGGTGGGAAAACCCTGAC
 Each instance differs from ACAGGATCA by 2
  mutations
 Remaining sequence random
Random Projection Algorithm
 Buhler and Tompa (2001)
 Guiding principle: Some instances of a motif
  agree on a subset of positions.
 Use information from multiple motif instances
  to construct model.



x(1)   ...ccATCCGACca...
x(2)   ...ttATGAGGCtc...
                                 ATGCGTC        =M
x(5)   ...ctATAAGTCgc...
x(8)   ...tcATGTGACac...          (7,2) motif
k-Projections
 Choose k positions in string of length l.
 Concatenate nucleotides at chosen k
  positions to form k-tuple.
 In l-dimensional Hamming space, projection
  onto k dimensional subspace.

  l = 15                               k=7
                                 P
   ATGGCATTCAGATTC                     TGCTGAT

            P = (2, 4, 5, 7, 11, 12, 13)
Random Projection Algorithm
 Choose a projection by
                              Input sequence x(i):
  selecting k positions       …TCAATGCACCTAT...
  uniformly at random.
 For each l-tuple in input
  sequences, hash into
  bucket based on letters
  at k selected positions.
 Recover motif from
  bucket containing                 TGCACCT
  multiple l-tuples.


                                    Bucket TGCT
Example
  l = 7 (motif size) , k = 4 (projection size)
  Choose projection (1,2,5,7)


Input Sequence
       ...TAGACATCCGACTTGCCTTACTAC...




  Buckets     ATCCGAC

                                   GCCTTAC

              ATGC                 GCTC
Hashing and Buckets
 Hash function h(x) obtained from k positions
  of projection.
 Buckets are labeled by values of h(x).
 Enriched buckets: contain at least s l-tuples,
  for some parameter s.




  ATGC         GCTC         CATC         ATTC
Frequency Matrix Model From
          Bucket
          A1     0 .25 .5     0 .5    0
ATCCGAC
                                       
          C 0    0 .25 .25    0 0     1
          G 0                         0
ATGAGGC
ATAAGTC           0 .5 0       1 .25
                                       
ATGTGAC   T 0
                 1 0 .25      0 .25   0
                                        
 ATGC            Frequency matrix W


                              EM algorithm


                 Refined matrix W*
Motif Refinement

 How do we recover the motif from the
  sequences in the enriched buckets?
 k nucleotides are known from hash value of
  bucket.
 Use information in other l-k positions as
  starting point for local refinement scheme,
  e.g. EM or Gibbs sampler
    ATCCGAC
                Local refinement algorithm
    ATGAGGC                                  ATGCGTC
    ATAAGTC
                                             Candidate motif
    ATGTGAC

    ATGC
Expectation Maximization (EM)
 S = { x(1), …, x(N)} : set of input sequences
 Given:
         W = An initial probabilistic motif model
         P0 = background probability distribution.
 Find value Wmax that maximizes likelihood ratio:

                   Pr( S | Wmax )
                    Pr( S | P0 )
 EM is local optimization scheme. Requires
  starting value W
EM Motif Refinement

 For each bucket h containing more than s
  sequences, form weight matrix Wh
 Use EM algorithm with starting point Wh to obtain
  refined weight matrix model Wh*
 For each input sequence x(i), return l tuple y(i) which
  maximizes likelihood ratio:
    Pr(y(i) | Wh* )/ Pr(y(i) | P0).
 T = {y(1), y(2), …, y(N)}
 C(T ) = consensus string
What Is the Best Motif?
 Compute score S for each motif:
     Generate W, an initial PSSM from the returned
      l-mers {y(1), y(2), …, y(N)}


                P( y (i ) | W )
  Score = ∑ log
          i     P( y (i ) | P0 )
 Return motif with maximal score
Iterations
 Single iteration.
    Choose a random k-projection.
    Hash each l-mer x in input sequence into bucket
     labelled by h(x).
    From each bucket B with at least s sequences, form
     weight matrix model, and perform EM/Gibbs sampler
     refinement.
    Candidate motif is the best one found from refinement
     of all enriched buckets.
 Multiple iterations.
      Repeat process for multiple projections.
Parameter Selection
 Projection size k
 Choose k small so several motif instances
  hash to same bucket. (k < l - d)
 Choose k large to avoid contamination by
  spurious l-mers. E > (N (n - l + 1))/ 4k
 Bucket threshold s: (s = 3, s = 4)
How Many Iterations?
 Planted bucket : bucket with hash value h(M),
  where M is motif.
 Choose m = number of iterations, such that
  Pr(planted bucket contains ≥ s sequences in
  at least one of m iterations) ≥ 0.95.
 Probability is readily computable since
  iterations form a sequence of independent
  trials.
Composite motifs
   detection

    Question
 Monad detection
     Mitra
monad patterns
 Short contiguous strings
 Appear surprisingly many times( in a statistically significant
  way)
 S=
  AGTCTTGCTAGTCCGTAATATCCGGATAGAATAATGATC
  AGTC     AGTC
  GTAGCATCGTACGTAGCTATCGATCTGAAGCTAGCAGC
  AAGATGTACTAGAGTCACGTAGCTAGTCATCTATACGAG
               AGTC        AGTC
  TCGATGTAGTAGCTATCGATCGTAGCTAGAGTCCGTAGC
  TC                            AGTC
  AGCTAGTATCGTAGTGAGCAACATGAGTCCAGTGCATA
                            AGTC
  GTCAGCTCATGAGTCGCATAGTC
  GTC        AGTC
                           P = AGTC
Introduction
 However, many of the actual regulatory
  signals are composite patterns.
   Groups of monad patterns
   Occur relatively near each other

 An example of a composite pattern is a dyad
  signal.
Composite Pattern

 S=ACGTAAATCACGTTGACTAGCTAGCACGAG
 CTAGCATAATCACACTTTGACGAGTCGACTGC
 ATGCATTGACGCAGTGCATTGCTAGCATGGG
 TAATCAAACGTTGGCTAGCTAGCATGCATCTG
 AGCATGCTAGCTACGTACTAGCGCGATAGTC
 TACTACAAATCACCCATTGCGAGCTACGTAG
 CTAGCTAGCTAGCTAGCTAGTGATGCATGCTA
 GAATCCGATCTTGCGATCGAT


        CP = AATCxxxxTTG
Introduction
 A possible approach is to find each part of the
  pattern separately and reconstruct the
  composite pattern.
 However, they often fail to output composite
  regulatory patterns consisting of weak monad
  parts.
Introduction
 A better approach would be to detect both parts of
  a composite pattern at the same time.
 Two steps in the proposed algorithm:
      Preprocessing the sample creates a set of ‘virtual’
       monads.
      Apply an exhaustive monad discovery algorithm to
       the ’virtual’ monad problem.
 By preprocessing, original problem can be
  transformed into a larger monad discovery problem.
Monad Pattern Discovery

 Canonical pattern lmer             3mer:              A   C   A
    A continuous string of length l
 (l, d)-neighbourhood of an lmer P
       all possible lmers with up to d mismatches as compared to P
       The number of such lmers is :
                                           d
                                               l  i
                                          ∑  i 3
                                                
                                          i =0  
 (l,d)-k patterns
    Given a sequence S, find all lmers that occur with up to d
      mismatches at least k times in the sample
    A variant : the sample is split into several sequence, to find all
      lmers, d mismatches, in at least k sequences
Pattern Driven Approach(PDA)

 (Prvzner, 2000)
    Examine all 4 l patterns of fixed length l in lexical order,
     compares each pattern to every lmer in the sample, and
     return all (l, d)-k pattern
 (Waterman et al., 1984 and Galas et al.,1985)
       Bypass excessive time requirement
       Most of all 4 l examines not worth since neither these
        patterns nor their neighbours appear in the sample
       SDA was therefore designed only explores the lmer
        appearing in the sample and their neighbours.
Sample Driven Approach(SDA)
 First initializes a table of size 4l
    Each table entry corresponds to a pattern SDA
     generate the (l, d)-neighbourhood of lmer
    Incremented by a certain amount
    After all lmers processed, SDA return all pattern
     whose table entries have scores exceed the
     threshold
                                          AAAAA   3

                                     4l   AAAAC   1
                                          AAACC   2
                                          …       ..
Sample Driven Approach(SDA)

 Faster but requires a large 4l table still
    not practical for long pattern in mid 1980
    Not mainstream and no tool
    (Today gigabytes of RAM memory available thus l
     increased without a memory-efficient algorithm)
SDA Iterations
 First, explore all neighbour of the first lmer from the
    sample.
   Second, explore all neighbour of the second lmer
   If an lmer P belongs to the neighbour of the lmers
    appearing at positions i1 ,…ik in the sample  info about P
    collected at iteration i1 ,…ik .
   So the Waterman approach update info about P k times 
    memory slot for P is occupied during the course time even
    if P is not “interesting” lmer
   Most of lmers explored are not interesting—waste memory
    slot
To improve SDA
 Better solution:
    Collect info about all P at the same time
    to remove the need to keep the info in memory
    but require a new approach to navigate the space of
     all lmers
 MITRA runs faster than PDA and SDA, and uses
  only a fraction of the memory of the SDA
Pattern-finding vs. profile-based
 Profile-based is more biologically relevant for
  finding motifs in biological samples?
      Probably the reason Waterman algorithm not
       popular in the last decade
 Sagot and colleagues were the first to rebut this
  opinion
      Develop an efficient version of Waterman’s
Pattern-based vs. profile-based

 Similarities
    Pattern-based generate the profile
    Every profile of length l corresponds to a pattern of length
     l formed by the most frequent nucleotides in every
     position.
    Pattern-driven at least as good as profile-based
 Even better on simulated samples with implanted patterns
    Though profile-implantation model is somehow limited
    Today little evidence profile-based perform any better on
     either biological or simulated samples
Mitra


Mismatch Tree Algorithm
Mismatch Tree Algorithm
              (MITRA)
 MITRA uses a mismatch tree data structure to
  split the space of all possible patterns into disjoint
  subspaces that start with a given prefix.
    For reducing the pattern discovery into smaller
     sub-problems.
    MITRA also takes advantage of pair-wise
     similarity between instances.
Splitting Pattern Space
   A pattern is called weak if it has less than k
    ( l ,d )-neighbours in the sample.

   A subspace is called weak if all patterns in this
    subspace are weak.
Splitting pattern space
        A pattern is called weak if it has less than k
         ( l ,d )-neighbours in the sample.


Sequence = AGTATCAGTT
P= GTC
               Not weak


l = 3 …. sample = { AGT, GTA, TAT, ATC, TCA, CAG, AGT, GTT }
d =1 ; k =2
( l ,d )-neighbours in the sample = { GTA, ATC, GTT }
Splitting pattern space
        A pattern is called weak if it has less than k
         ( l ,d )-neighbours in the sample.


Sequence = AGTATCAGTT
P= CAG
                  weak


l = 3 …. sample = { AGT, GTA, TAT, ATC, TCA, CAG, AGT, GTT }
d =1 ; k =2
( l ,d )-neighbours in the sample = { CAG }
Splitting pattern space
       A subspace is called weak if all patterns in this
        subspace are weak.

Sequence = AGTATCAGTT




•subspaceA = { AAA, AAT, AAC, AAG ………..AGG }
•subspaceT = { TAA, TAT, TAC, TAG ………..TGG }
•subspaceC = { CAA, CAT, CAC, CAG ………..CGG }
•subspaceGG = { GGA, GGT, GGC, GGG}
Question
 Input:
     S, l, d, k
 Output:
     All l mers that occur with up to d mismatches
      at least k times in the sample.
Solution
 Naïve :
       Test all l mer in the space
            If occur with up to d mismatches at least k times
             in the sample than output this l mer.


space = { AAA, AAT, AAC, AAG ………..AGG
         TAA, TAT, TAC, TAG ………..TGG
         CAA, CAT, CAC, CAG ………..CGG
         GAA, GAT, GAC, GAG ………..GGG }

sample = { AGT, GTA, TAT, ATC, TCA, CAG, AGT, GTT }
Splitting pattern space
 if we are looking for patterns of length l we would first
  split the space of all l mers into 4 disjoint subspaces.

          Subspace of all l mers starting with A,
          Subspace of all l mers starting with T,
          Subspace of all l mers starting with C,
          Subspace of all l mers starting with G,
Splitting pattern space
 if we are looking for patterns of length l we would first
  split the space of all l mers into 4 disjoint subspaces.

Space:

               A*          T*           C*          G*




           SubspaceA
Splitting pattern space
 we further determine whether the subspace
  contains a ( l ,d )-k pattern.

Space:
                 A*



         Can’t rule out
Splitting pattern space
 we further determine whether the subspace
  contains a ( l ,d )-k pattern.

Space:
          AA*   AC*
          AT*   AG*



         Can rule out
Splitting pattern space
 we further determine whether the subspace
  contains a ( l ,d )-k pattern.

Space:




             Can’t rule out
Splitting pattern space
 we further determine whether the subspace
  contains a ( l ,d )-k pattern.
   If we can rule out this subspace contains such
    a pattern
       we stop searching in this subspace;

       release the memory slot;

   If we can’t rule out this subspace contains
    such a pattern
       we split this subspace again on the next

        symbol;
       and repeat;
Mismatch tree data structure
 A mismatch tree is a rooted tree where each internal
  node has 4 branches labeled with a symbol in
  {A,C,T,G}
 The maximum depth of the tree is l.
 Each node in the mismatch tree corresponds to the
  subspace of patterns P with a fixed prefix.
 Each node contains pointers to all l mers instances
  from the sample that are within d mismatches from a
  pattern p.
Mismatch tree data structure
 MITRA start with examining the root node of the
  mismatch tree that corresponds to the space of all
  patterns.
      When examining a node, MITRA tries to prove that it
       corresponds to a weak subspace.
      If (we can’t prove it)
          we expand the node’s children and examine each of

           them.
      Whenever we reach a node corresponding to a weak
       subspace, we backtrack.
 The intuition is that many of the nodes correspond
  to weak subspaces and can be rule out.
 This allows us to avoid searching much of the
  pattern space.
Mismatch tree data structure
 If we reach depth l and the number of instances is
  not less than k.
    the l mer corresponding to the path from the
     root to the leaf .
    the pointers from this node correspond to the
     instances of this pattern.
Example

 Consider a very simple example of finding the
  pattern of length 4 with up to 1 mismatch and
  at least 2 times in the sample S =
    Not for exam
  “AGTATCAGTT”.

 The substrings (4mers) in S are
  { AGTA, GTAT, TATC, ATCA, TCAG,
   CAGT, AGTT }
0   1   1   0   1       1       0
                    A   G T     A   T       C       A                                         0   0   0   0   0   0   0
                    G T     A   T   C       A       G                                         A   G T     A   T   C   A
                    T   A   T   C   A       G T                     A                         G T     A   T   C   A   G
                    A   T   C   A   G T             T                                         T   A   T   C   A   G T
                                                                   k =7                       A   T   C   A   G T     T

1   2   1   1   2   1   1               A
A   G T     A   T   C   A                       k=5                         1   1    2    2   1

G   T   A   T   C   A   G                                                   A   T    A    C   A

T   A   T   C   A   G T         A         T                                 G   A    T    A   G
                                     2 2 2 2                  2
A   T   C   A   G T     T           CA T G C                        2    2 T 2 T 1 C2 G T
                                      k A
                                    k=1=3 2                   2
                                                               A
                                                                   1 A 2 T AA CC AA T T
                                                                           2
                                        G A             T A G
                                                          A T      AG CA AT A G
                                        T       T       C G T
                        k=0                               G A  T TT T GC
                                                                   A
                                    k = 1C
                                      A A               A T T
                                                                                    G T
                                                          T 1 T2
                                                               C2 GC T A
                                                                G
                                                         CA T    A              T T
                                                                                  2 2 22 2        1
                                                          A C  AA T2 T1         2A T A
                                                           G A G                      A T         A
                                                                    A T         AG A G
                                                           T Tk = 0
                                                                T  k =A1              G A         G
                                    k=1                  k A 1C T
                                                           =        G           GT T T
                                                                                      T T         T
                                                                    T T         TA C T
                                                                                      A C         T
                                                                        A   C   T
0   1   1   0   1   1   0
              A   G T     A   T   C   A                                  0   0       0       0       0       0   0
              G T     A   T   C   A   G                                  A   G T             A       T       C   A
              T   A   T   C   A   G T         A                          G T         A       T       C       A   G
              A   T   C   A   G T     T                                  T   A       T       C       A       G T
                                              k =7                       A   T       C       A       G T         T

                                               T
          0   2   2   1   2   2   0       G
                                              k=3
          A   G T     A   T   C   A
                                                                         0   2       0
          G T     A   T   C   A   G
                                                                         A   A       A
          T   A   T   C   A   G T
                                                     T                   G   T       G
          A   T   C   A   G T     T                                      T   C       T
                                                             k=2
                                                                         A   A       T



                                                A                  T
                                                         0   1
Output:   AGTA                                       CA           G              1       1
                                                             A                                   1       0
                                                                     1   1       A       A
          AGTC                                           G   G
                                                                     A   A
                                                                                                 A       A
                                                                                 G       G
                                                                 k = 2k = 2 T
                                                         T   T                                   G G
          AGTG                            k=2                        G G
                                                     k = 2T
                                                       A                                 T
                                                                                                 T       T
          AGTT                                                       T   T       A       T
                                                                                                 A       T
                                                                     A   T
Overall complexity
                                                         0       0       0       0       0       0       0
         Space =                                        A       G T             A       T       C       A

         Time =       O(l2 × |S|)    A
                                                         G       T       A       T       C       A       G   l
                                                         T       A       T       C       A       G T

                       O(4l × |S|)                       A       T       C       A       G T             T


                                  G                                O(|S|)
                                                                 O(|S|)
                                      T       O(l)
l                                                    0       1       1       0       1       1       0

                   .    .     .                      A       G T             A       T       C       A

                   .    .     .                      G       T       A       T       C       A       G

                   .    .     .                      T       A       T       C       A       G T
                                          T          A       T       C       A       G T             T
       Number of nodes = O(4l)
    – Number of comparisons in each node = O(|S|)
Take a Closer Look
 In mismatch tree algorithm, we can not start
  ruling out a node until traverse to depth                                  .

          d +1
             0   1   1   0   1   1   0
                                                                     0   0   0   0   0   0   0
             A   G T     A   T   C   A
                                                                     A   G T     A   T   C   A
             G   T   A   T   C   A   G
             T   A   T   C   A   G T
                                             A                       G   T   A   T   C   A   G
                                                                     T   A   T   C   A   G T
             A   T   C   A   G T     T
                                             k =7                    A   T   C   A   G T     T
                                         1   2   1   1   2   1   1
                                 A       A   G T     A   T   C   A
                                         G   T   A   T   C   A   G
                     k=5
                                         T   A   T   C   A   G T
                                         A   T   C   A   G T     T
MITRA Graph
 Information about pairwise similarities between
  instances of the pattern can significantly
  the sample-driven approach.
                   speed up
 The graph that is constructed to model this
  pairwise similarity is called MITRA-Graph
MITRA Graph

 Given a pattern P and sample S we can construct
  a graph G(P, S) where each vertex is an lmer in
  the sample and there is an edge connecting two
  lmers if P is within d mismatches from both
  lmers.

             S = TAACA
             P = TAC
                         AAC (d=1)


            TAA                ACA
            (d=1)              (d=3)
MITRA Graph
 For an (l,d) – k pattern P the corresponding graph
  contains a clique of size k.




                S = TAACA
                P = AAA
                            AAC (d=1)


               TAA                ACA
               (d=1)              (d=3)
MITRA Graph

 Given a set of patterns P and a sample S, define a
  graph G(P , S) whose edge set is a union of edge
  sets of graphs G(P, S) for P∈P .
 Each vertex of G(P , S) is an lmer in the sample
  and there is an edge connecting two lmers if there
  is a pattern P∈P that is within d mismatches
  from both lmers.
 If for a subspace of patterns we can rule out an
  existence of a clique of size k, then the subspace
  has no (l,d)-k
The WINNOWER Algorithm

 The WINNOWER algorithm by Pevzner and
  Sze (2000) constructs the following graph:
  Each lmer in the sample is a vertex, and an edge
  connects two vertices if the corresponding lmers
  have less than d mismatches.
 Instances of a (l,d)-k pattern form a clique of
  size k in this graph.
The WINNOWER Algorithm
           (con’t)
 Since clique are difficult to find, WINNOWER
  takes the approach of trying to remove edges that
  do not corresponding to a clique.



         k=4
Improvements by MITRA-Graph

1. Construct a graph at each node in the mismatch
   tree.


          0   1   1   0   1   1   0
          A   G T     A   T   C   A   A
          G   T   A   T   C   A   G
          T   A   T   C   A   G T
          A   T   C   A   G T     T
Improvements by MITRA-Graph

2. Remove edges which are not part of a clique.




                            A
Improvements by MITRA-Graph

3. If no potential clique remains, rule out the
   subspace corresponding to the node and
   backtrack.

                              A

                      A
Improvements by MITRA-Graph

4. If we cannot rule out a clique, split the subspace
   of patterns and examine the child nodes


                             A
MISMATCH TREE ALGORITHM —
   Improvements over WINNOWER

 At each node of the tree, we remove edges
  by computing the degree of each vertex.
 If the degree of the vertex is less than k-1,
  we can remove all edges incident to it since
  we know it is not part of a clique.
 We repeat this procedure until we cannot
  remove any more edges.
 If the number of edges remaining is less than
  the minimum number of edges in a clique,
  we can rule out the existence of a clique and
  backtrack.
MISMATCH TREE ALGORITHM —
   Improvements over WINNOWER
 The problem with this approach is how to
  efficiently construct the graph at each node in
  the mismatch tree.
 Instead of constructing the graph from scratch,
  we construct it based on the graph at the
  parent node
   an edge connecting two l mers
   the first l mer matches the prefix of the pattern
    subspace with d1 mismatches
     the second l mer matches with d2 mismatches
MISMATCH TREE ALGORITHM —
     Improvements over WINNOWER

     the number of mismatches between the tail of
      the first and the second l mers as m.
     The edge between these two l mers exists in
      the pattern subspace if and only if   d1 <= d,
      d2 <= d and d1+d2+m <= 2d.

The prefix of the
pattern subspace

the first lmer
the second lmer
MISMATCH TREE ALGORITHM —
Improvements over WINNOWER (cont’d)

 In the root node since d1 = d2 = 0, an edge exists only if
  m <= 2d which is the equivalent graph to
  WINNOWER.
 With moving down the tree, the condition becomes
  much stronger than the WINNOWER.
 We can compute the edges of a node based on the
  edges of the node’s parents by keeping track of the
  quantities d1, d2, and m for each edge.
MISMATCH TREE ALGORITHM —
   Improvements over WINNOWER
 To summarize, the MITRA-Graph algorithm works as
  follows
      We first compute the set of edges at the root node by
       performing pairwise comparisons between all l mers
       due to d1 = d2 = 0.
      We traverse the tree in a depth first order, passing on
       the valid edges and keeping track of the quantities d1,
       d2, and m for each of them.
      At each node, we prune the graph by eliminating any
       edges incident to vertices that have degrees of less
       than k-1.
      If there are less than the minimum number of edges for
       a clique, we backtrack.
      If we reach a leaf of the tree (depth l), then we output
       the corresponding pattern.
Discovering dyad signals
DISCOVERING DYNAD SIGNALS
 For dyad signals, we are interested in
  discovering two monads that occur a certain
  length apart
     We use the notation (l1-(s1,s2)-l2,d)-k pattern to
      denote a dyad signal



       l1   s l2        l1   s l2     l1   s   l2
DISCOVERING DYNAD SIGNALS

 The MITRA-Dyad algorithm casts the dyad
 discovery problem into a monad discovery
 problem by preprocessing the input and
 creating a “virtual” sample to solve the
 (l1+l2,d)-k monad pattern discovery problem in
 this sample
     For each l1mer in the sample and for each s in
      [s1,s2], we create an l1+l2 mer which is the l1mer
      concatenated with the l2 mer upstream s
      nucleotides of the l1mer.
DISCOVERING DYNAD SIGNALS
     The number of elements in the “virtual” sample
      will be approximately (s1-s2+1) times larger.
     An (l1+l2,d)-k pattern in the “virtual” sample will
      correspond to a (l1-(s1,s2)-l2,d)-k pattern in the
      original sample, and we can easily map the
      solution from the monad problem to the dyad
      one.
 An important feature of MITRA-Dyad is an
  ability to search for long patterns.
DISCOVERING DYNAD SIGNALS

 If the range s1-s2+1 of acceptable distances
  between monad parts in a composite pattern
  is large, the MITRA-Dyad algorithm becomes
  inefficient
   A simple approach to detect these patterns is
    to generate a long ranked list of candidate
    monad patterns using MITRA.
   Then check each occurrence of each pair from
    the list to see if they occur within the
    acceptable distance.

Más contenido relacionado

Más de Juan Carlos Munévar

Biología de los Tejidos de la cavidad oral, cabeza y cuello
Biología de los Tejidos de la cavidad oral, cabeza y cuelloBiología de los Tejidos de la cavidad oral, cabeza y cuello
Biología de los Tejidos de la cavidad oral, cabeza y cuelloJuan Carlos Munévar
 
Secretoma congreso institucional 2017
Secretoma congreso institucional 2017Secretoma congreso institucional 2017
Secretoma congreso institucional 2017Juan Carlos Munévar
 
Células Madre “Bombo Publicitario o Esperanza Médica”
Células Madre “Bombo Publicitario o Esperanza Médica”Células Madre “Bombo Publicitario o Esperanza Médica”
Células Madre “Bombo Publicitario o Esperanza Médica”Juan Carlos Munévar
 
Stem Cell clinical grade Biology for human therapies
Stem Cell clinical grade Biology for human therapiesStem Cell clinical grade Biology for human therapies
Stem Cell clinical grade Biology for human therapiesJuan Carlos Munévar
 
Regeneracion y reparacion periodontal
Regeneracion y reparacion periodontalRegeneracion y reparacion periodontal
Regeneracion y reparacion periodontalJuan Carlos Munévar
 
¿Cómo publicar en revistas académicas indexadas peer review?
¿Cómo publicar en revistas académicas  indexadas peer review?¿Cómo publicar en revistas académicas  indexadas peer review?
¿Cómo publicar en revistas académicas indexadas peer review?Juan Carlos Munévar
 
Fisiopatologia y Biologia de la inflamación
Fisiopatologia y Biologia de la inflamaciónFisiopatologia y Biologia de la inflamación
Fisiopatologia y Biologia de la inflamaciónJuan Carlos Munévar
 
OSTEOINMUNOLOGÍA: Biología de osteoclasto
OSTEOINMUNOLOGÍA: Biología de osteoclasto OSTEOINMUNOLOGÍA: Biología de osteoclasto
OSTEOINMUNOLOGÍA: Biología de osteoclasto Juan Carlos Munévar
 
Big data o datos masivos en investigación en odontología
Big data o datos masivos en investigación en odontologíaBig data o datos masivos en investigación en odontología
Big data o datos masivos en investigación en odontologíaJuan Carlos Munévar
 
Lectura crítica de la literatura biomédica
Lectura crítica de la literatura biomédicaLectura crítica de la literatura biomédica
Lectura crítica de la literatura biomédicaJuan Carlos Munévar
 
Indicadores produccióncientífica
Indicadores produccióncientíficaIndicadores produccióncientífica
Indicadores produccióncientíficaJuan Carlos Munévar
 
Mecanismos de señalización en osteoclastogenesis y enfermedad òsea
Mecanismos de señalización en osteoclastogenesis y enfermedad òseaMecanismos de señalización en osteoclastogenesis y enfermedad òsea
Mecanismos de señalización en osteoclastogenesis y enfermedad òseaJuan Carlos Munévar
 
Profundización en Biologia Osea para postgrados en el área de la salud
Profundización en Biologia Osea para postgrados en el área de la saludProfundización en Biologia Osea para postgrados en el área de la salud
Profundización en Biologia Osea para postgrados en el área de la saludJuan Carlos Munévar
 
INDICADORES DE PRODUCCION CIENTIFICA
INDICADORES DE  PRODUCCION CIENTIFICAINDICADORES DE  PRODUCCION CIENTIFICA
INDICADORES DE PRODUCCION CIENTIFICAJuan Carlos Munévar
 
INTERACCIONES MOLECULARES Y ENLACES ATÓMICOS
INTERACCIONES MOLECULARES Y ENLACES ATÓMICOSINTERACCIONES MOLECULARES Y ENLACES ATÓMICOS
INTERACCIONES MOLECULARES Y ENLACES ATÓMICOSJuan Carlos Munévar
 
¿Escribir artículo de revisión?
¿Escribir artículo de revisión?¿Escribir artículo de revisión?
¿Escribir artículo de revisión?Juan Carlos Munévar
 
Lectura critica de la literatura biomédica
Lectura critica de la literatura biomédicaLectura critica de la literatura biomédica
Lectura critica de la literatura biomédicaJuan Carlos Munévar
 

Más de Juan Carlos Munévar (20)

Biología de los Tejidos de la cavidad oral, cabeza y cuello
Biología de los Tejidos de la cavidad oral, cabeza y cuelloBiología de los Tejidos de la cavidad oral, cabeza y cuello
Biología de los Tejidos de la cavidad oral, cabeza y cuello
 
Proyecto Decreto Minsalud 2021
Proyecto Decreto Minsalud 2021Proyecto Decreto Minsalud 2021
Proyecto Decreto Minsalud 2021
 
Tablero demo postgrados
Tablero demo postgradosTablero demo postgrados
Tablero demo postgrados
 
Secretoma congreso institucional 2017
Secretoma congreso institucional 2017Secretoma congreso institucional 2017
Secretoma congreso institucional 2017
 
Células Madre “Bombo Publicitario o Esperanza Médica”
Células Madre “Bombo Publicitario o Esperanza Médica”Células Madre “Bombo Publicitario o Esperanza Médica”
Células Madre “Bombo Publicitario o Esperanza Médica”
 
Stem Cell clinical grade Biology for human therapies
Stem Cell clinical grade Biology for human therapiesStem Cell clinical grade Biology for human therapies
Stem Cell clinical grade Biology for human therapies
 
Regeneracion y reparacion periodontal
Regeneracion y reparacion periodontalRegeneracion y reparacion periodontal
Regeneracion y reparacion periodontal
 
¿Cómo publicar en revistas académicas indexadas peer review?
¿Cómo publicar en revistas académicas  indexadas peer review?¿Cómo publicar en revistas académicas  indexadas peer review?
¿Cómo publicar en revistas académicas indexadas peer review?
 
Fisiopatologia y Biologia de la inflamación
Fisiopatologia y Biologia de la inflamaciónFisiopatologia y Biologia de la inflamación
Fisiopatologia y Biologia de la inflamación
 
OSTEOINMUNOLOGÍA: Biología de osteoclasto
OSTEOINMUNOLOGÍA: Biología de osteoclasto OSTEOINMUNOLOGÍA: Biología de osteoclasto
OSTEOINMUNOLOGÍA: Biología de osteoclasto
 
Big data o datos masivos en investigación en odontología
Big data o datos masivos en investigación en odontologíaBig data o datos masivos en investigación en odontología
Big data o datos masivos en investigación en odontología
 
Lectura crítica de la literatura biomédica
Lectura crítica de la literatura biomédicaLectura crítica de la literatura biomédica
Lectura crítica de la literatura biomédica
 
Indicadores produccióncientífica
Indicadores produccióncientíficaIndicadores produccióncientífica
Indicadores produccióncientífica
 
Mecanismos de señalización en osteoclastogenesis y enfermedad òsea
Mecanismos de señalización en osteoclastogenesis y enfermedad òseaMecanismos de señalización en osteoclastogenesis y enfermedad òsea
Mecanismos de señalización en osteoclastogenesis y enfermedad òsea
 
Profundización en Biologia Osea para postgrados en el área de la salud
Profundización en Biologia Osea para postgrados en el área de la saludProfundización en Biologia Osea para postgrados en el área de la salud
Profundización en Biologia Osea para postgrados en el área de la salud
 
INDICADORES DE PRODUCCION CIENTIFICA
INDICADORES DE  PRODUCCION CIENTIFICAINDICADORES DE  PRODUCCION CIENTIFICA
INDICADORES DE PRODUCCION CIENTIFICA
 
INTERACCIONES MOLECULARES Y ENLACES ATÓMICOS
INTERACCIONES MOLECULARES Y ENLACES ATÓMICOSINTERACCIONES MOLECULARES Y ENLACES ATÓMICOS
INTERACCIONES MOLECULARES Y ENLACES ATÓMICOS
 
¿Escribir artículo de revisión?
¿Escribir artículo de revisión?¿Escribir artículo de revisión?
¿Escribir artículo de revisión?
 
Lectura critica de la literatura biomédica
Lectura critica de la literatura biomédicaLectura critica de la literatura biomédica
Lectura critica de la literatura biomédica
 
Seminario Manejo de diabetes
Seminario Manejo de diabetesSeminario Manejo de diabetes
Seminario Manejo de diabetes
 

Último

Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfJemuel Francisco
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beña
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Jisc
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptxmary850239
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management systemChristalin Nelson
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfErwinPantujan2
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Seán Kennedy
 
Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)cama23
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...Postal Advocate Inc.
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfSpandanaRallapalli
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Celine George
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYKayeClaireEstoconing
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPCeline George
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptxmary850239
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 

Último (20)

Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...
 
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management system
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...
 
Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdf
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 

Detection of genetic motifs

  • 1. Detection of Genetic Motif. Promoters – Biology Information theory Random Projections Composed motif detection
  • 3. DNA sequence gene junk DNA gene UTR-5' UTR-3' e1 e2 e3 e4 e5 exon intron INR = Initiator Region DSE = DownStream region TSS = Transcription Start Site Promoter module Promoter module TSS TATA box INR INR DSE DSE TFBS TFBS Distal Promoter Proximal promoter Core promoter
  • 4. TFBS-Transcription Factor Binding Site  Short strings (12 to 20 nucleotides long) protein that  spreaded over up to 5kb is going to bind before TSS  The string structure select the protein that will bind on the basis of Van der Waals interactions ACCGATTATCA  Van der Waals interactions example of a Transcription Factor - Binding Sites TFBS
  • 5. Assembly of the promoter protein complex of transcription 1st stage Transcription factors TF TFIID TBP TSS TATA box DNA Transcription factor Binding Sites TFBS INR
  • 6. Assembly of the promoter protein complex of transcription 2nd stage TFIID TBP TATA TSS INR core promoter DNA
  • 7. Assembly of the promoter protein complex of transcription DNA looping /Distal promoter enhancer TF4 TF3 TF5 Proximal TF2 promoter TBP TATA TFIIE TFIIA TFIIH TFIIB TFIID TFIIB TSS TF1 INR RNA Poly II core promoter DNA
  • 9. The set of all TFBS (for a certain class of genes, organism or other) Unknown Known known unknown TFBS's with the same colour are correlated
  • 10. Example Protein of the Promoter complex Protein of the Promoter complex A T G C T C A T C C T G
  • 11. Entropy  Given a probability distribution, we want a function representing the quantity of information stored in the distribution.  We define the entropy (H) as: H = −∑ p (i ) log( p (i )) i or H = − ∫ p ( x) log( p ( x))dx  For the sake of simplicity, we will use from now on the discrete definition.
  • 12. Observed entropy  The real distribution is usually unknown, but we can replace it by the observed distribution f(x). The resulting entropy is: H ( x) = −∑ f ( x) log( f ( x)) x  For a multi dimensional probability distribution it is: H ( x, y ) = −∑ f ( x, y ) log( f ( x, y )) x, y = ∑ f ( x)∑ f ( y | x) log( f ( x, y )) x, y y
  • 13. Mutual Information f ( x, y ) I ( x. y ) = −∑ f ( x, y ) log( ) x, y f ( x) f ( y ) = −∑ f ( x, y )[log( f ( x, y )) − log( f ( x)) − log( f ( y ))] x, y = H ( x, y ) − H ( x ) − H ( y )  X and Y are strings of equal length, S={A, C, G, T}, x and y belong to S  f(x,y) is the relative joint frequency of x,y in X and Y  f(x) is the relative frequency of x in X  f(y) is the relative frequency of y in Y
  • 14. Information divergence  Given two distributions P and Q p( x) D( P, Q) = ∑ p ( x) log( Not for exam x q( x) ) = ∑ p ( x) log( p ( x)) − ∑ p ( x) log(q ( x)) x x
  • 15. Example of calculation X A C A T T T A CC A T A G A C A A C T A Y A C T T T T A CG A T G G A A A C C T G f(x,y) 6 4 4 6 f(x,y) A C G T 9 A 5 1 2 1 f(x) 5 C 1 3 1 0 1 G 0 0 1 0 5 T 0 0 0 5 Divide by 20 to obtain relative frequencies
  • 16. Algorithm for finding new TFBS 1) select a true TFBS (for example ACATTTACCATAGACAACT) (from a data bank as IUPAC or TRANSFAC) as a probe; 2) shift the probe over a non-coding zone; 3) evaluate step-by-step mutual information I(P,S), where P is the probe and S is the current adjacent string on the sequence; 4) select the positions (and the corresponding adjacent strings) for which I(P,S)> threshold 5) the strings starting from these positions are candidate TFBS,which need to be validated in vitro.
  • 17. Example the same string CACTGTGCGACTGTCATTCATCATCCACCGTTGTTAGCACAGGGGTCGAT TTCGGAACCGGCCTTAAGACGGTGAAGGCGCTACTCATTTAATTGTGTTC 1 error CACTGTGCGTCTGTCATTCATCATCCACCGTTGTTAGCACAGGGGTCGAT TACTATATAATTGATCGTGTTTTGGCCCGCTACTCATGAAGAGCCGTTCG 2 errors CACTGTGCGTCTGTCATTCGTCATCCACCGTTGTTAGCACAGGGGTCGAT TAAGGGTATCCAAGTCTGAATACCCCCTGTATTACACTCTCGCTGTCAGT 5 errors CACTGTGCGTCTGTCATTCGTCATCCACCATTGTTAGCATAGGGGTCGAC CATTATCGAGGACAGTGATTTGTGGAATGCTTGGCCTTAATACGTCTCTA C<--> G GAGTCTCGCAGTCTGATTGATGATGGAGGCTTCTTACGAGACCCCTGCAT TCAAAGTCAATTTACAGATTGGCGCCTCATGTAATAACGTTGGCATACTA C <-- G GAGTGTGGGAGTGTGATTGATGATGGAGGGTTGTTAGGAGAGGGGTGGAT CTTAAGATAACGGACACTTGATTGAGATACGCTCGACGCTATGTCCGGCT some C<-> G CAGTGTCCGACTGTCATTGATCATCCACGCTTGTTACCAGACGCGTCGAT ACTCGACATAAGGTTACAGCATGTGGAGTAATGCGGTCGCTAACTACGGG complementarGTGACACGCTGACAGTAAGTAGTAGGTGGCAACAATCGTGTCCCCAGCTA GCGTGGCGAGCTTAATCCCTGCTGCTCTGAGCAAGGAGGGCGTGTAGAAA compl+1errorGTGACACGCGGACAGTAAGTAGTAGGTGGCAACAATCGTGTCCCCAGCTA CAAGGTGACAGAGTATTGAGTGAATCTACAATGTTCGCAGTGCTTTGTCG compl+2errorsGTGACACGCTGACAGTAAGAAGTAGGTGGCAACAATCGTGTCCCCAGCTA GCGGTCGCCAATCGTCAAGGAAATGATAGGTCTGATTGGCGTGGCTTAAG compl+5errorsGTGACACGCTGACAGTAAGAAGTAGGTGGAAACAATCGTCTCCCCAGCTG GGCGCTAACGAATACTTCAAGGCCCGAAGGATTGGTGTTGATACTAGCCG 1 letter moreCACTGTGCGACTGTCATTCATCATCACACCGTTGTTAGCACAGGGGTCGAT CGTGACCAGATGTCCTTACTCTGAATGTTATGGTATTAAGTGAGGTAGTG 2 letters moreCACTGTGCGACTGTCATTCATCATCCACACCGTTGTTAGCACAGGGGTCGAT GCCCATGAACATACATTCATGACTGTTCAAGCGCACTGGACCACTCGTTC 3 letters moreCACTGTGCGACTGTCATTCATCATCCATCACCGTTGTTAGCACAGGGGTCGAT probe CACTGTGCGACTGTCATTCATCATCCACCGTTGTTAGCACAGGGGTCGAT
  • 18. Detected values for I(P,S) 4 C become G and 5 G become C the same string C and G exchanged complementary 1 1 error complementary+1error 2 errors complementary+2errors C becomes G complementary+5errors 08 . 5 errors 1 letter more 06 . 2 letters more 3 letters more 04 . 02 .
  • 19. Conclusions:  Use Mutual information as a tool to capture strings that are correlated to a true TFBS used as a probe.  validate in vitro the candidates so obtained  This is more flexible than the use of Hamming or Levenshtein distance, since correlated strings could be very distant one another Drawbacks: 1. the method need a precise calibration of the threshold 2. Does not include gaps
  • 20. Random Projection Approach to Motif Finding
  • 21. daf-19 Binding Sites in C. elegans GTTGTCATGGTGAC GTTTCCATGGAAAC GCTACCATGGCAAC GTTACCATAGTAAC GTTTCCATGGTAAC che-2 daf-19 osm-1 osm-6 F02D8.3 -150 -1
  • 22. The (l,d) Planted Motif Problem  Generate a random length l consensus sequence C.  Generate 20 instances, each differing from C by d random mutations.  Plant one at a random position in each of N=20 random sequences of length n=600.  Can you find the planted instances?
  • 24. Random Projection Algorithm  Buhler and Tompa (2001)  Guiding principle: Some instances of a motif agree on a subset of positions.  Use information from multiple motif instances to construct model. x(1) ...ccATCCGACca... x(2) ...ttATGAGGCtc... ATGCGTC =M x(5) ...ctATAAGTCgc... x(8) ...tcATGTGACac... (7,2) motif
  • 25. k-Projections  Choose k positions in string of length l.  Concatenate nucleotides at chosen k positions to form k-tuple.  In l-dimensional Hamming space, projection onto k dimensional subspace. l = 15 k=7 P ATGGCATTCAGATTC TGCTGAT P = (2, 4, 5, 7, 11, 12, 13)
  • 26. Random Projection Algorithm  Choose a projection by Input sequence x(i): selecting k positions …TCAATGCACCTAT... uniformly at random.  For each l-tuple in input sequences, hash into bucket based on letters at k selected positions.  Recover motif from bucket containing TGCACCT multiple l-tuples. Bucket TGCT
  • 27. Example  l = 7 (motif size) , k = 4 (projection size)  Choose projection (1,2,5,7) Input Sequence ...TAGACATCCGACTTGCCTTACTAC... Buckets ATCCGAC GCCTTAC ATGC GCTC
  • 28. Hashing and Buckets  Hash function h(x) obtained from k positions of projection.  Buckets are labeled by values of h(x).  Enriched buckets: contain at least s l-tuples, for some parameter s. ATGC GCTC CATC ATTC
  • 29. Frequency Matrix Model From Bucket A1 0 .25 .5 0 .5 0 ATCCGAC   C 0 0 .25 .25 0 0 1 G 0 0 ATGAGGC ATAAGTC 0 .5 0 1 .25   ATGTGAC T 0  1 0 .25 0 .25 0  ATGC Frequency matrix W EM algorithm Refined matrix W*
  • 30. Motif Refinement  How do we recover the motif from the sequences in the enriched buckets?  k nucleotides are known from hash value of bucket.  Use information in other l-k positions as starting point for local refinement scheme, e.g. EM or Gibbs sampler ATCCGAC Local refinement algorithm ATGAGGC ATGCGTC ATAAGTC Candidate motif ATGTGAC ATGC
  • 31. Expectation Maximization (EM)  S = { x(1), …, x(N)} : set of input sequences  Given:  W = An initial probabilistic motif model  P0 = background probability distribution.  Find value Wmax that maximizes likelihood ratio: Pr( S | Wmax ) Pr( S | P0 )  EM is local optimization scheme. Requires starting value W
  • 32. EM Motif Refinement  For each bucket h containing more than s sequences, form weight matrix Wh  Use EM algorithm with starting point Wh to obtain refined weight matrix model Wh*  For each input sequence x(i), return l tuple y(i) which maximizes likelihood ratio: Pr(y(i) | Wh* )/ Pr(y(i) | P0).  T = {y(1), y(2), …, y(N)}  C(T ) = consensus string
  • 33. What Is the Best Motif?  Compute score S for each motif:  Generate W, an initial PSSM from the returned l-mers {y(1), y(2), …, y(N)} P( y (i ) | W ) Score = ∑ log i P( y (i ) | P0 )  Return motif with maximal score
  • 34. Iterations  Single iteration.  Choose a random k-projection.  Hash each l-mer x in input sequence into bucket labelled by h(x).  From each bucket B with at least s sequences, form weight matrix model, and perform EM/Gibbs sampler refinement.  Candidate motif is the best one found from refinement of all enriched buckets.  Multiple iterations.  Repeat process for multiple projections.
  • 35. Parameter Selection  Projection size k  Choose k small so several motif instances hash to same bucket. (k < l - d)  Choose k large to avoid contamination by spurious l-mers. E > (N (n - l + 1))/ 4k  Bucket threshold s: (s = 3, s = 4)
  • 36. How Many Iterations?  Planted bucket : bucket with hash value h(M), where M is motif.  Choose m = number of iterations, such that Pr(planted bucket contains ≥ s sequences in at least one of m iterations) ≥ 0.95.  Probability is readily computable since iterations form a sequence of independent trials.
  • 37. Composite motifs detection Question Monad detection Mitra
  • 38. monad patterns  Short contiguous strings  Appear surprisingly many times( in a statistically significant way)  S= AGTCTTGCTAGTCCGTAATATCCGGATAGAATAATGATC AGTC AGTC GTAGCATCGTACGTAGCTATCGATCTGAAGCTAGCAGC AAGATGTACTAGAGTCACGTAGCTAGTCATCTATACGAG AGTC AGTC TCGATGTAGTAGCTATCGATCGTAGCTAGAGTCCGTAGC TC AGTC AGCTAGTATCGTAGTGAGCAACATGAGTCCAGTGCATA AGTC GTCAGCTCATGAGTCGCATAGTC GTC AGTC P = AGTC
  • 39. Introduction  However, many of the actual regulatory signals are composite patterns.  Groups of monad patterns  Occur relatively near each other  An example of a composite pattern is a dyad signal.
  • 40. Composite Pattern  S=ACGTAAATCACGTTGACTAGCTAGCACGAG CTAGCATAATCACACTTTGACGAGTCGACTGC ATGCATTGACGCAGTGCATTGCTAGCATGGG TAATCAAACGTTGGCTAGCTAGCATGCATCTG AGCATGCTAGCTACGTACTAGCGCGATAGTC TACTACAAATCACCCATTGCGAGCTACGTAG CTAGCTAGCTAGCTAGCTAGTGATGCATGCTA GAATCCGATCTTGCGATCGAT CP = AATCxxxxTTG
  • 41. Introduction  A possible approach is to find each part of the pattern separately and reconstruct the composite pattern.  However, they often fail to output composite regulatory patterns consisting of weak monad parts.
  • 42. Introduction  A better approach would be to detect both parts of a composite pattern at the same time.  Two steps in the proposed algorithm:  Preprocessing the sample creates a set of ‘virtual’ monads.  Apply an exhaustive monad discovery algorithm to the ’virtual’ monad problem.  By preprocessing, original problem can be transformed into a larger monad discovery problem.
  • 43. Monad Pattern Discovery  Canonical pattern lmer 3mer: A C A  A continuous string of length l  (l, d)-neighbourhood of an lmer P  all possible lmers with up to d mismatches as compared to P  The number of such lmers is : d l  i ∑  i 3   i =0    (l,d)-k patterns  Given a sequence S, find all lmers that occur with up to d mismatches at least k times in the sample  A variant : the sample is split into several sequence, to find all lmers, d mismatches, in at least k sequences
  • 44. Pattern Driven Approach(PDA)  (Prvzner, 2000)  Examine all 4 l patterns of fixed length l in lexical order, compares each pattern to every lmer in the sample, and return all (l, d)-k pattern  (Waterman et al., 1984 and Galas et al.,1985)  Bypass excessive time requirement  Most of all 4 l examines not worth since neither these patterns nor their neighbours appear in the sample  SDA was therefore designed only explores the lmer appearing in the sample and their neighbours.
  • 45. Sample Driven Approach(SDA)  First initializes a table of size 4l  Each table entry corresponds to a pattern SDA generate the (l, d)-neighbourhood of lmer  Incremented by a certain amount  After all lmers processed, SDA return all pattern whose table entries have scores exceed the threshold AAAAA 3 4l AAAAC 1 AAACC 2 … ..
  • 46. Sample Driven Approach(SDA)  Faster but requires a large 4l table still  not practical for long pattern in mid 1980  Not mainstream and no tool  (Today gigabytes of RAM memory available thus l increased without a memory-efficient algorithm)
  • 47. SDA Iterations  First, explore all neighbour of the first lmer from the sample.  Second, explore all neighbour of the second lmer  If an lmer P belongs to the neighbour of the lmers appearing at positions i1 ,…ik in the sample  info about P collected at iteration i1 ,…ik .  So the Waterman approach update info about P k times  memory slot for P is occupied during the course time even if P is not “interesting” lmer  Most of lmers explored are not interesting—waste memory slot
  • 48. To improve SDA  Better solution:  Collect info about all P at the same time  to remove the need to keep the info in memory  but require a new approach to navigate the space of all lmers  MITRA runs faster than PDA and SDA, and uses only a fraction of the memory of the SDA
  • 49. Pattern-finding vs. profile-based  Profile-based is more biologically relevant for finding motifs in biological samples?  Probably the reason Waterman algorithm not popular in the last decade  Sagot and colleagues were the first to rebut this opinion  Develop an efficient version of Waterman’s
  • 50. Pattern-based vs. profile-based  Similarities  Pattern-based generate the profile  Every profile of length l corresponds to a pattern of length l formed by the most frequent nucleotides in every position.  Pattern-driven at least as good as profile-based  Even better on simulated samples with implanted patterns  Though profile-implantation model is somehow limited  Today little evidence profile-based perform any better on either biological or simulated samples
  • 52. Mismatch Tree Algorithm (MITRA)  MITRA uses a mismatch tree data structure to split the space of all possible patterns into disjoint subspaces that start with a given prefix.  For reducing the pattern discovery into smaller sub-problems.  MITRA also takes advantage of pair-wise similarity between instances.
  • 53. Splitting Pattern Space  A pattern is called weak if it has less than k ( l ,d )-neighbours in the sample.  A subspace is called weak if all patterns in this subspace are weak.
  • 54. Splitting pattern space  A pattern is called weak if it has less than k ( l ,d )-neighbours in the sample. Sequence = AGTATCAGTT P= GTC Not weak l = 3 …. sample = { AGT, GTA, TAT, ATC, TCA, CAG, AGT, GTT } d =1 ; k =2 ( l ,d )-neighbours in the sample = { GTA, ATC, GTT }
  • 55. Splitting pattern space  A pattern is called weak if it has less than k ( l ,d )-neighbours in the sample. Sequence = AGTATCAGTT P= CAG weak l = 3 …. sample = { AGT, GTA, TAT, ATC, TCA, CAG, AGT, GTT } d =1 ; k =2 ( l ,d )-neighbours in the sample = { CAG }
  • 56. Splitting pattern space  A subspace is called weak if all patterns in this subspace are weak. Sequence = AGTATCAGTT •subspaceA = { AAA, AAT, AAC, AAG ………..AGG } •subspaceT = { TAA, TAT, TAC, TAG ………..TGG } •subspaceC = { CAA, CAT, CAC, CAG ………..CGG } •subspaceGG = { GGA, GGT, GGC, GGG}
  • 57. Question  Input:  S, l, d, k  Output:  All l mers that occur with up to d mismatches at least k times in the sample.
  • 58. Solution  Naïve :  Test all l mer in the space  If occur with up to d mismatches at least k times in the sample than output this l mer. space = { AAA, AAT, AAC, AAG ………..AGG TAA, TAT, TAC, TAG ………..TGG CAA, CAT, CAC, CAG ………..CGG GAA, GAT, GAC, GAG ………..GGG } sample = { AGT, GTA, TAT, ATC, TCA, CAG, AGT, GTT }
  • 59. Splitting pattern space  if we are looking for patterns of length l we would first split the space of all l mers into 4 disjoint subspaces.  Subspace of all l mers starting with A,  Subspace of all l mers starting with T,  Subspace of all l mers starting with C,  Subspace of all l mers starting with G,
  • 60. Splitting pattern space  if we are looking for patterns of length l we would first split the space of all l mers into 4 disjoint subspaces. Space: A* T* C* G* SubspaceA
  • 61. Splitting pattern space  we further determine whether the subspace contains a ( l ,d )-k pattern. Space: A* Can’t rule out
  • 62. Splitting pattern space  we further determine whether the subspace contains a ( l ,d )-k pattern. Space: AA* AC* AT* AG* Can rule out
  • 63. Splitting pattern space  we further determine whether the subspace contains a ( l ,d )-k pattern. Space: Can’t rule out
  • 64. Splitting pattern space  we further determine whether the subspace contains a ( l ,d )-k pattern.  If we can rule out this subspace contains such a pattern  we stop searching in this subspace;  release the memory slot;  If we can’t rule out this subspace contains such a pattern  we split this subspace again on the next symbol;  and repeat;
  • 65. Mismatch tree data structure  A mismatch tree is a rooted tree where each internal node has 4 branches labeled with a symbol in {A,C,T,G}  The maximum depth of the tree is l.  Each node in the mismatch tree corresponds to the subspace of patterns P with a fixed prefix.  Each node contains pointers to all l mers instances from the sample that are within d mismatches from a pattern p.
  • 66. Mismatch tree data structure  MITRA start with examining the root node of the mismatch tree that corresponds to the space of all patterns.  When examining a node, MITRA tries to prove that it corresponds to a weak subspace.  If (we can’t prove it)  we expand the node’s children and examine each of them.  Whenever we reach a node corresponding to a weak subspace, we backtrack.  The intuition is that many of the nodes correspond to weak subspaces and can be rule out.  This allows us to avoid searching much of the pattern space.
  • 67. Mismatch tree data structure  If we reach depth l and the number of instances is not less than k.  the l mer corresponding to the path from the root to the leaf .  the pointers from this node correspond to the instances of this pattern.
  • 68. Example  Consider a very simple example of finding the pattern of length 4 with up to 1 mismatch and at least 2 times in the sample S = Not for exam “AGTATCAGTT”.  The substrings (4mers) in S are { AGTA, GTAT, TATC, ATCA, TCAG, CAGT, AGTT }
  • 69. 0 1 1 0 1 1 0 A G T A T C A 0 0 0 0 0 0 0 G T A T C A G A G T A T C A T A T C A G T A G T A T C A G A T C A G T T T A T C A G T k =7 A T C A G T T 1 2 1 1 2 1 1 A A G T A T C A k=5 1 1 2 2 1 G T A T C A G A T A C A T A T C A G T A T G A T A G 2 2 2 2 2 A T C A G T T CA T G C 2 2 T 2 T 1 C2 G T k A k=1=3 2 2 A 1 A 2 T AA CC AA T T 2 G A T A G A T AG CA AT A G T T C G T k=0 G A T TT T GC A k = 1C A A A T T G T T 1 T2 C2 GC T A G CA T A T T 2 2 22 2 1 A C AA T2 T1 2A T A G A G A T A A T AG A G T Tk = 0 T k =A1 G A G k=1 k A 1C T = G GT T T T T T T T TA C T A C T A C T
  • 70. 0 1 1 0 1 1 0 A G T A T C A 0 0 0 0 0 0 0 G T A T C A G A G T A T C A T A T C A G T A G T A T C A G A T C A G T T T A T C A G T k =7 A T C A G T T T 0 2 2 1 2 2 0 G k=3 A G T A T C A 0 2 0 G T A T C A G A A A T A T C A G T T G T G A T C A G T T T C T k=2 A A T A T 0 1 Output: AGTA CA G 1 1 A 1 0 1 1 A A AGTC G G A A A A G G k = 2k = 2 T T T G G AGTG k=2 G G k = 2T A T T T AGTT T T A T A T A T
  • 71. Overall complexity 0 0 0 0 0 0 0  Space = A G T A T C A  Time = O(l2 × |S|) A G T A T C A G l T A T C A G T O(4l × |S|) A T C A G T T G O(|S|) O(|S|) T O(l) l 0 1 1 0 1 1 0 . . . A G T A T C A . . . G T A T C A G . . . T A T C A G T T A T C A G T T  Number of nodes = O(4l) – Number of comparisons in each node = O(|S|)
  • 72. Take a Closer Look  In mismatch tree algorithm, we can not start ruling out a node until traverse to depth . d +1 0 1 1 0 1 1 0 0 0 0 0 0 0 0 A G T A T C A A G T A T C A G T A T C A G T A T C A G T A G T A T C A G T A T C A G T A T C A G T T k =7 A T C A G T T 1 2 1 1 2 1 1 A A G T A T C A G T A T C A G k=5 T A T C A G T A T C A G T T
  • 73. MITRA Graph  Information about pairwise similarities between instances of the pattern can significantly the sample-driven approach. speed up  The graph that is constructed to model this pairwise similarity is called MITRA-Graph
  • 74. MITRA Graph  Given a pattern P and sample S we can construct a graph G(P, S) where each vertex is an lmer in the sample and there is an edge connecting two lmers if P is within d mismatches from both lmers. S = TAACA P = TAC AAC (d=1) TAA ACA (d=1) (d=3)
  • 75. MITRA Graph  For an (l,d) – k pattern P the corresponding graph contains a clique of size k. S = TAACA P = AAA AAC (d=1) TAA ACA (d=1) (d=3)
  • 76. MITRA Graph  Given a set of patterns P and a sample S, define a graph G(P , S) whose edge set is a union of edge sets of graphs G(P, S) for P∈P .  Each vertex of G(P , S) is an lmer in the sample and there is an edge connecting two lmers if there is a pattern P∈P that is within d mismatches from both lmers.  If for a subspace of patterns we can rule out an existence of a clique of size k, then the subspace has no (l,d)-k
  • 77. The WINNOWER Algorithm  The WINNOWER algorithm by Pevzner and Sze (2000) constructs the following graph: Each lmer in the sample is a vertex, and an edge connects two vertices if the corresponding lmers have less than d mismatches.  Instances of a (l,d)-k pattern form a clique of size k in this graph.
  • 78. The WINNOWER Algorithm (con’t)  Since clique are difficult to find, WINNOWER takes the approach of trying to remove edges that do not corresponding to a clique. k=4
  • 79. Improvements by MITRA-Graph 1. Construct a graph at each node in the mismatch tree. 0 1 1 0 1 1 0 A G T A T C A A G T A T C A G T A T C A G T A T C A G T T
  • 80. Improvements by MITRA-Graph 2. Remove edges which are not part of a clique. A
  • 81. Improvements by MITRA-Graph 3. If no potential clique remains, rule out the subspace corresponding to the node and backtrack. A A
  • 82. Improvements by MITRA-Graph 4. If we cannot rule out a clique, split the subspace of patterns and examine the child nodes A
  • 83. MISMATCH TREE ALGORITHM — Improvements over WINNOWER  At each node of the tree, we remove edges by computing the degree of each vertex.  If the degree of the vertex is less than k-1, we can remove all edges incident to it since we know it is not part of a clique.  We repeat this procedure until we cannot remove any more edges.  If the number of edges remaining is less than the minimum number of edges in a clique, we can rule out the existence of a clique and backtrack.
  • 84. MISMATCH TREE ALGORITHM — Improvements over WINNOWER  The problem with this approach is how to efficiently construct the graph at each node in the mismatch tree.  Instead of constructing the graph from scratch, we construct it based on the graph at the parent node  an edge connecting two l mers  the first l mer matches the prefix of the pattern subspace with d1 mismatches  the second l mer matches with d2 mismatches
  • 85. MISMATCH TREE ALGORITHM — Improvements over WINNOWER  the number of mismatches between the tail of the first and the second l mers as m.  The edge between these two l mers exists in the pattern subspace if and only if d1 <= d, d2 <= d and d1+d2+m <= 2d. The prefix of the pattern subspace the first lmer the second lmer
  • 86. MISMATCH TREE ALGORITHM — Improvements over WINNOWER (cont’d)  In the root node since d1 = d2 = 0, an edge exists only if m <= 2d which is the equivalent graph to WINNOWER.  With moving down the tree, the condition becomes much stronger than the WINNOWER.  We can compute the edges of a node based on the edges of the node’s parents by keeping track of the quantities d1, d2, and m for each edge.
  • 87. MISMATCH TREE ALGORITHM — Improvements over WINNOWER  To summarize, the MITRA-Graph algorithm works as follows  We first compute the set of edges at the root node by performing pairwise comparisons between all l mers due to d1 = d2 = 0.  We traverse the tree in a depth first order, passing on the valid edges and keeping track of the quantities d1, d2, and m for each of them.  At each node, we prune the graph by eliminating any edges incident to vertices that have degrees of less than k-1.  If there are less than the minimum number of edges for a clique, we backtrack.  If we reach a leaf of the tree (depth l), then we output the corresponding pattern.
  • 89. DISCOVERING DYNAD SIGNALS  For dyad signals, we are interested in discovering two monads that occur a certain length apart  We use the notation (l1-(s1,s2)-l2,d)-k pattern to denote a dyad signal l1 s l2 l1 s l2 l1 s l2
  • 90. DISCOVERING DYNAD SIGNALS  The MITRA-Dyad algorithm casts the dyad discovery problem into a monad discovery problem by preprocessing the input and creating a “virtual” sample to solve the (l1+l2,d)-k monad pattern discovery problem in this sample  For each l1mer in the sample and for each s in [s1,s2], we create an l1+l2 mer which is the l1mer concatenated with the l2 mer upstream s nucleotides of the l1mer.
  • 91. DISCOVERING DYNAD SIGNALS  The number of elements in the “virtual” sample will be approximately (s1-s2+1) times larger.  An (l1+l2,d)-k pattern in the “virtual” sample will correspond to a (l1-(s1,s2)-l2,d)-k pattern in the original sample, and we can easily map the solution from the monad problem to the dyad one.  An important feature of MITRA-Dyad is an ability to search for long patterns.
  • 92. DISCOVERING DYNAD SIGNALS  If the range s1-s2+1 of acceptable distances between monad parts in a composite pattern is large, the MITRA-Dyad algorithm becomes inefficient  A simple approach to detect these patterns is to generate a long ranked list of candidate monad patterns using MITRA.  Then check each occurrence of each pair from the list to see if they occur within the acceptable distance.