SlideShare una empresa de Scribd logo
1 de 37
Computational Biology, Part 6
Sequence Database Searching

    PUSHPENDRA TRIPATHI
Sequence Analysis Tasks

⇒ Given a query sequence, search for similar
 sequences in a database

 Global or Local?

 Both local and global alignment methods may be
 applied to database scanning, but local alignment
 methods are more useful since they do not make
 the assumption that the query protein and database
 sequence are of similar length.
Efficient database searching
methods
 Dynamic programming requires order N2L
 computations (where N is size of the query
 sequence and L is the size of the database)
 Given size of databases, more efficient
 methods needed
“Hit and extend” sequence
searching
 Problem: Too many calculations “wasted”
 by comparing regions that have nothing in
 common
 Initial insight: Regions that are similar
 between two sequences are likely to share
 short stretches that are identical
 Basic method: Look for similar regions only
 near short stretches that match exactly
“Hit and extend” sequence
searching
 We define a word (or k-tuple) size that is
 the minimum number of exact “letter”
 matches that must occur before we do any
 further comparison or alignment
 How do we find all of the occurences of
 matching words between a sequence and a
 database?
   Could scan sequence a word at a time, but this
   is order L (size of database)
Word searching - hashing
 Solution: Use a precomputed table that lists
 where in the database each possible word
 occurs
   Generation of the table is of order L (size of
   database) but use of the table is of order N (size
   of query sequence)
 The computer science term for this
 approach is hashing
Hashing
 Hashing
   Hashing Table of size 10
   Hashing function H(x) = x mod 10
   Applet:
 http://www.engin.umd.umich.edu/CIS/course.des/cis
   Insertion & Search
Demonstration: Hashing algorithm for sequence searching
      Author: R.F. Murphy, Feb. 6, 1995 (revised Feb. 15, 1996)
                This demonstration takes a piece of database sequence, calculates hash values for each
                ktuple, builds a hash table (listing the positions in the database of the occurence of each
                hash value), and uses a simplified version of the hash table to find the positions in the
                 database sequence of the first occurence of each ktuple in a query sequence.
                   database sequence

Hashing i
        1
                seq(i)   seq(i)
                as char as int hash value
                    a       0    6
        2           c       1   27                       This section converts each base to a number
        3           g       2   47                       from 0 to 3 and combines those numbers three
        4           t       3   63                       at a time to form an integer from 0 to 63 that
 (Demonstration A10)
        5
        6
                    t
                    t
                            3
                            3
                                63
                                60
                                                         is unique for each three base sequence.
                                                         Each three base sequence is called a "ktuple."
        7           t       3   48
        8           a       0    0
        9           a       0    0
       10           a       0    1
       11           a       0    6
       12           c       1   24
       13           g       2   33
       14           a       0    4
       15           c       1   17
       16           a       0    5
       17           c       1
       18           c       1
      hash                                                                                                     first hit
      value pos1           pos2      pos3                  hash table for the   database sequence             hash table
            0   a          a         a                                          8   9                             8
            1   a          a         c                                                  10                        10
            2   a          a         g                                                                        not found
            3   a          a         t                                                                        not found
            4   a          c         a                                                              14            14
            5   a          c         c                                                                   16       16
            6   a          c         g       1                                               11                   1
            7   a          c         t                                                                        not found
            8   a          g         a                                                                        not found
FASTA
Heavily used for searching databases until
advent of BLAST (see below)
Inputs
  k (word or k-tuple) size
  similarity matrix
Compares query sequence pairwise with
each sequence in the database
FASTA method
 The initial step in the algorithm is to
 identify all exact matches of length k (k–
 tuples) or greater between the two
 sequences.
FASTA method
1. Find diagonals (paired pieces from each
  sequence without gaps) that have the
  highest density of common words
2. Rescore these using a scoring (similarity)
  matrix and trim ends that do not contribute
  to the highest score
    Result: partial alignments without gaps
    Reported as the “init1” score
FASTA method
3. Join regions together, including penalties
  for gaps
    Result: unoptimized alignment with gaps
    Reported as the “initn” score
4. Use dynamic programming in a band 32
  residues wide around the best “initn” score
    Result: optimized alignment with gaps
    Reported as the “opt” score
Comments on FASTA
 Larger k-tuple increases speed since fewer
 “hits” are found but it also decreases
 sensitivity for finding similar but not
 identical sequences since exact matches of
 this length are required
Limitations of FASTA
 FASTA can miss significant similarity since
   For proteins, similar sequences do not have to
   share identical residues
      Asp-Lys-Val is quite similar to Glu-Arg-Ile yet
     it is missed even with k-tuple size of 1 since no
     amino acid matches
     Gly-Asp-Gly-Lys-Gly is quite similar to
     Gly-Glu-Gly-Arg-Gly but there is only match
     with k-tuple size of 1
Limitations of FASTA
 FASTA can miss significant similarity since
   For nucleic acids, due to codon “wobble”, DNA
   sequences may look like XXyXXyXXy where
   X’s are conserved and y’s are not
      GGuUCuACgAAg and GGcUCcACaAAA
     both code for the same peptide sequence (Gly-Ser-
     Thr-Lys) but they don’t match with k-tuple size of 3
     or higher
BLAST (Basic Local Alignment
Search Tool)
 Goal: find sequences from database similar
 to query sequence
 Previous tools use either
   direct, theoretically sound but computationally
   slow approach to examine all possible
   alignments of query with database (dynamic
   programming)
   indirect, heuristic but computationally fast
   approach to find similar sequences by first
   finding identical stretches (FASTP, FASTA)
BLAST (Basic Local Alignment
Search Tool)
 BLAST combines best of both by using
 theoretically sound method which searches
 for similar sequences directly but
 computationally fast
 Reference
   S. F. Altschul, W. Gish, W. Miller, E. W.
   Myers and D. J. Lipman. Basic Local
   Alignment Search Tool. J. Mol. Biol. 215:403-
   410 (1990)
BLAST basics
 Need similarity measure, as in dynamic
 programming - use PAM-120 for proteins
 Define maximal segment pair (MSP) to be
 the highest scoring pair of identical length
 segments chosen from 2 sequences (in
 FASTA terms, highest init1 diagonal)
BLAST basics
 Define a segment pair to be locally maximal
 if its score cannot be improved either by
 extending or by shortening both segments
BLAST basics
 Approach: find segment pairs by first
 finding word pairs that score above a
 threshold, i.e., find word pairs of fixed
 length w with a score of at least T
 Key concept: Seems similar to FASTA, but
 we are searching for words which score
 above T rather than that match exactly
BLAST method for proteins
1. Compile a list of words which give a score
  above T when paired with the query
  sequence.
    Example using PAM-120 for query sequence
    ACDE (w=4, T=17):
              A C D E
  ACDE = +3 +9 +5 +5 = 22
      try all possibilities:
  AAAA = +3 -3 0 0 = 0 no good
  AAAC = +3 -3 0 -7 = -7 no good
      ...too slow, try directed change
Generating word list
            A C D E
 ACDE = +3 +9 +5 +5 = 22
     change 1st pos. to all acceptable substitutions
 gCDE = 1 9 5 5 = 20 ok (=pCDE,sCDE,
                                             tCDE)
 nCDE = 0 9 5 5 = 19 ok (=dCDE,eCDE,
                                             nCDE,vCDE)
 iCDE = -1 9 5 5 = 18 ok (=qCDE)
 kCDE = -2 9 5 5 = 17 ok (=mCDE)
     change 2nd pos.: can't - all alternatives negative and
     the other three positions only add up to 13
     change 3rd pos. in combination with first position
 gCnE = 1 9 2 5 = 17 ok
     continue - use recursion
Generating word list
 For "best" values of w and T there are
 typically about 50 words in the list for every
 residue in the query sequence
BLAST method for proteins
2. Scan the database for hits with the
  compiled list of words. Two approaches:
    Use index of all possible words (for w=4, need
    array of size 204=160,000. Can compress this
    index using pointers to save space.
    Use finite state machine (actually used)
      Calculate a state transition table that tells what state
      to go to based on the next character in the sequence
3a. Extend hits to form HSPs (high-scoring
  segment pairs)
BLAST method for proteins
3b. BLAST2 or gapped BLAST uses an
  approach similar to FASTA to combine hits
  before trying to extend them as in 3a.
4. Compare the score for each HSP to a
  threshold S to decide whether to keep it
5. Proceed to estimating statistical
  significance (see below)
BLAST Method for DNA
 1. Make list of all contiguous w-mers in the
 query sequence (often w=12)
 2. Compress database by packing 4
 nucleotides into a single byte (use auxiliary
 table to tell you where sequences start and
 stop within the compressed database) --
 doesn't allow for unspecified bases
 (wildcards)
BLAST Method for DNA
 3. Compress the w-mers from the query sequence
 the same way.
 4. Search the compressed database for matches
 with the compressed w-mers
   Since all frames of the query sequence are considered
   separately, any match of length w>=11 must contain a
   match of length 8 that lies on a byte boundary of one of
   the w-mers from the query sequence. Thus can scan a
   (packed) byte at a time, improving speed 4-fold over
   comparing one nucleotide at a time.
BLAST Method for DNA
 Problem: if query sequence has a stretch of
 unusual base composition (e.g., A-T rich)
 or a repeated sequence element (e.g., Alu
 sequence) there will be many hits with
 "uninteresting" regions.
BLAST Method for DNA
 Solution:
   During compression of the database, tabulate
   frequencies of all 8-tuples.
   Make a list of those occurring very frequently (much
   more frequently than expected by chance).
   Remove these words from the query list of w-mers
   before searching database.
   Remove words matching a sublibrary of repeated
   sequences (but report the matches to that sublibrary
   when done).
BLAST Statistical significance
 A key to the utility of BLAST is the ability
 to calculate expected probabilities of
 occurrence of Maximum Segment Pairs
 (MSPs) given w and T
 This allows BLAST to rank matching
 sequences in order of “significance” and to
 cut off listings at a user-specified
 probability
BLAST Statistical significance
 From Karlin-Altschul formulation, the
 expected value (mean) of the HSPs between
 a query and a set of random sequences is
       u≅ [ e (Kmn)]/λ
           log
       or
       u≅ [ln(Kmn)]/λ
BLAST Statistical significance
 BLAST uses a correction to this formulation
 that takes into account the effective
 sequence lengths of the query and the
 database sequences
          l Kn λ
        u [( m)/
         =n ′ ′]
BLAST Statistical significance
 The corrected lengths are given by
       m′ = m−(lnKmn)/H
       n′ = n −(lnKmn)/H
       with
       H = (lnKmn)/l
 where l is the average length of the alignment that
 can be achieved between random sequences of
 length m and n
BLAST Statistical significance
 Given u, we can calculate the probability p of
 observing a score S between a query sequence and
 a given database sequence that is equal to or
 greater than x
                                    λ−
                                    −xu
p ≥ = e (e
( x 1 x−
 S ) −p                              ( )
                                               )
BLAST Statistical significance
 Lastly, we have to consider that we are searching
 many database sequences and can expect even a
 relatively rare score to occur with high chance
 given enough comparisons
 For a database of D sequences, this is


                            − sx
                            p≥D
  ≈−
 E1e                         ( )
Summary of Database Search
Methods
   Authors (Program)      Description
   Needleman & Wunsch full alignment
   Wilbur & Lipman        match k-tuple - form
                          diag - NW
   Lipman & Pearson       k-tuple - diag - rescore
   (FASTP)
   Pearson & Lipman       FASTP - join diags-
   (FASTA)                NW
   Altschul et al (BLAST) word match list -
                          statistics
Reading for next class
 Paper by Grundy and Bailey

Más contenido relacionado

Similar a Seq db searching

Bioinformatica 10-11-2011-t5-database searching
Bioinformatica 10-11-2011-t5-database searchingBioinformatica 10-11-2011-t5-database searching
Bioinformatica 10-11-2011-t5-database searchingProf. Wim Van Criekinge
 
BLAST AND FASTA.pptx
BLAST AND FASTA.pptxBLAST AND FASTA.pptx
BLAST AND FASTA.pptxPiyushBehgal1
 
2016 bioinformatics i_database_searching_wimvancriekinge
2016 bioinformatics i_database_searching_wimvancriekinge2016 bioinformatics i_database_searching_wimvancriekinge
2016 bioinformatics i_database_searching_wimvancriekingeProf. Wim Van Criekinge
 
Sequence Alignment - Data Bioinformatics Introduction
Sequence Alignment - Data Bioinformatics IntroductionSequence Alignment - Data Bioinformatics Introduction
Sequence Alignment - Data Bioinformatics IntroductionTenaAvdic
 
BLAST_CSS2.ppt
BLAST_CSS2.pptBLAST_CSS2.ppt
BLAST_CSS2.pptSilpa87
 
20100515 bioinformatics kapushesky_lecture07
20100515 bioinformatics kapushesky_lecture0720100515 bioinformatics kapushesky_lecture07
20100515 bioinformatics kapushesky_lecture07Computer Science Club
 
Periodic pattern mining
Periodic pattern miningPeriodic pattern mining
Periodic pattern miningAshis Chanda
 
Ee693 sept2014quiz1
Ee693 sept2014quiz1Ee693 sept2014quiz1
Ee693 sept2014quiz1Gopi Saiteja
 
Presentation for blast algorithm bio-informatice
Presentation for blast algorithm bio-informaticePresentation for blast algorithm bio-informatice
Presentation for blast algorithm bio-informaticezahid6
 
2015 bioinformatics database_searching_wimvancriekinge
2015 bioinformatics database_searching_wimvancriekinge2015 bioinformatics database_searching_wimvancriekinge
2015 bioinformatics database_searching_wimvancriekingeProf. Wim Van Criekinge
 
Msa & rooted/unrooted tree
Msa & rooted/unrooted treeMsa & rooted/unrooted tree
Msa & rooted/unrooted treeSamiul Ehsan
 
lecture4.ppt Sequence Alignmentaldf sdfsadf
lecture4.ppt Sequence Alignmentaldf sdfsadflecture4.ppt Sequence Alignmentaldf sdfsadf
lecture4.ppt Sequence Alignmentaldf sdfsadfalizain9604
 
Bioinformatics t5-database searching-v2013_wim_vancriekinge
Bioinformatics t5-database searching-v2013_wim_vancriekingeBioinformatics t5-database searching-v2013_wim_vancriekinge
Bioinformatics t5-database searching-v2013_wim_vancriekingeProf. Wim Van Criekinge
 
Multiple sequence alignment
Multiple sequence alignmentMultiple sequence alignment
Multiple sequence alignmentSanaym
 

Similar a Seq db searching (20)

Bioinformatica 10-11-2011-t5-database searching
Bioinformatica 10-11-2011-t5-database searchingBioinformatica 10-11-2011-t5-database searching
Bioinformatica 10-11-2011-t5-database searching
 
BLAST AND FASTA.pptx
BLAST AND FASTA.pptxBLAST AND FASTA.pptx
BLAST AND FASTA.pptx
 
Sequence alignment
Sequence alignmentSequence alignment
Sequence alignment
 
blast and fasta
 blast and fasta blast and fasta
blast and fasta
 
2016 bioinformatics i_database_searching_wimvancriekinge
2016 bioinformatics i_database_searching_wimvancriekinge2016 bioinformatics i_database_searching_wimvancriekinge
2016 bioinformatics i_database_searching_wimvancriekinge
 
Seq alignment
Seq alignment Seq alignment
Seq alignment
 
Sequence Alignment - Data Bioinformatics Introduction
Sequence Alignment - Data Bioinformatics IntroductionSequence Alignment - Data Bioinformatics Introduction
Sequence Alignment - Data Bioinformatics Introduction
 
BLAST_CSS2.ppt
BLAST_CSS2.pptBLAST_CSS2.ppt
BLAST_CSS2.ppt
 
20100515 bioinformatics kapushesky_lecture07
20100515 bioinformatics kapushesky_lecture0720100515 bioinformatics kapushesky_lecture07
20100515 bioinformatics kapushesky_lecture07
 
Periodic pattern mining
Periodic pattern miningPeriodic pattern mining
Periodic pattern mining
 
Periodic pattern mining
Periodic pattern miningPeriodic pattern mining
Periodic pattern mining
 
Ee693 sept2014quiz1
Ee693 sept2014quiz1Ee693 sept2014quiz1
Ee693 sept2014quiz1
 
Mayank
MayankMayank
Mayank
 
Presentation for blast algorithm bio-informatice
Presentation for blast algorithm bio-informaticePresentation for blast algorithm bio-informatice
Presentation for blast algorithm bio-informatice
 
Blast Algorithm
Blast AlgorithmBlast Algorithm
Blast Algorithm
 
2015 bioinformatics database_searching_wimvancriekinge
2015 bioinformatics database_searching_wimvancriekinge2015 bioinformatics database_searching_wimvancriekinge
2015 bioinformatics database_searching_wimvancriekinge
 
Msa & rooted/unrooted tree
Msa & rooted/unrooted treeMsa & rooted/unrooted tree
Msa & rooted/unrooted tree
 
lecture4.ppt Sequence Alignmentaldf sdfsadf
lecture4.ppt Sequence Alignmentaldf sdfsadflecture4.ppt Sequence Alignmentaldf sdfsadf
lecture4.ppt Sequence Alignmentaldf sdfsadf
 
Bioinformatics t5-database searching-v2013_wim_vancriekinge
Bioinformatics t5-database searching-v2013_wim_vancriekingeBioinformatics t5-database searching-v2013_wim_vancriekinge
Bioinformatics t5-database searching-v2013_wim_vancriekinge
 
Multiple sequence alignment
Multiple sequence alignmentMultiple sequence alignment
Multiple sequence alignment
 

Último

Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxVishalSingh1417
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxnegromaestrong
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docxPoojaSen20
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfAyushMahapatra5
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin ClassesCeline George
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxDenish Jangid
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
Gardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterGardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterMateoGardella
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 

Último (20)

Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Gardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterGardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch Letter
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 

Seq db searching

  • 1. Computational Biology, Part 6 Sequence Database Searching PUSHPENDRA TRIPATHI
  • 2. Sequence Analysis Tasks ⇒ Given a query sequence, search for similar sequences in a database Global or Local? Both local and global alignment methods may be applied to database scanning, but local alignment methods are more useful since they do not make the assumption that the query protein and database sequence are of similar length.
  • 3. Efficient database searching methods Dynamic programming requires order N2L computations (where N is size of the query sequence and L is the size of the database) Given size of databases, more efficient methods needed
  • 4. “Hit and extend” sequence searching Problem: Too many calculations “wasted” by comparing regions that have nothing in common Initial insight: Regions that are similar between two sequences are likely to share short stretches that are identical Basic method: Look for similar regions only near short stretches that match exactly
  • 5. “Hit and extend” sequence searching We define a word (or k-tuple) size that is the minimum number of exact “letter” matches that must occur before we do any further comparison or alignment How do we find all of the occurences of matching words between a sequence and a database? Could scan sequence a word at a time, but this is order L (size of database)
  • 6. Word searching - hashing Solution: Use a precomputed table that lists where in the database each possible word occurs Generation of the table is of order L (size of database) but use of the table is of order N (size of query sequence) The computer science term for this approach is hashing
  • 7. Hashing Hashing Hashing Table of size 10 Hashing function H(x) = x mod 10 Applet: http://www.engin.umd.umich.edu/CIS/course.des/cis Insertion & Search
  • 8. Demonstration: Hashing algorithm for sequence searching Author: R.F. Murphy, Feb. 6, 1995 (revised Feb. 15, 1996) This demonstration takes a piece of database sequence, calculates hash values for each ktuple, builds a hash table (listing the positions in the database of the occurence of each hash value), and uses a simplified version of the hash table to find the positions in the database sequence of the first occurence of each ktuple in a query sequence. database sequence Hashing i 1 seq(i) seq(i) as char as int hash value a 0 6 2 c 1 27 This section converts each base to a number 3 g 2 47 from 0 to 3 and combines those numbers three 4 t 3 63 at a time to form an integer from 0 to 63 that (Demonstration A10) 5 6 t t 3 3 63 60 is unique for each three base sequence. Each three base sequence is called a "ktuple." 7 t 3 48 8 a 0 0 9 a 0 0 10 a 0 1 11 a 0 6 12 c 1 24 13 g 2 33 14 a 0 4 15 c 1 17 16 a 0 5 17 c 1 18 c 1 hash first hit value pos1 pos2 pos3 hash table for the database sequence hash table 0 a a a 8 9 8 1 a a c 10 10 2 a a g not found 3 a a t not found 4 a c a 14 14 5 a c c 16 16 6 a c g 1 11 1 7 a c t not found 8 a g a not found
  • 9. FASTA Heavily used for searching databases until advent of BLAST (see below) Inputs k (word or k-tuple) size similarity matrix Compares query sequence pairwise with each sequence in the database
  • 10. FASTA method The initial step in the algorithm is to identify all exact matches of length k (k– tuples) or greater between the two sequences.
  • 11. FASTA method 1. Find diagonals (paired pieces from each sequence without gaps) that have the highest density of common words 2. Rescore these using a scoring (similarity) matrix and trim ends that do not contribute to the highest score Result: partial alignments without gaps Reported as the “init1” score
  • 12. FASTA method 3. Join regions together, including penalties for gaps Result: unoptimized alignment with gaps Reported as the “initn” score 4. Use dynamic programming in a band 32 residues wide around the best “initn” score Result: optimized alignment with gaps Reported as the “opt” score
  • 13. Comments on FASTA Larger k-tuple increases speed since fewer “hits” are found but it also decreases sensitivity for finding similar but not identical sequences since exact matches of this length are required
  • 14. Limitations of FASTA FASTA can miss significant similarity since For proteins, similar sequences do not have to share identical residues Asp-Lys-Val is quite similar to Glu-Arg-Ile yet it is missed even with k-tuple size of 1 since no amino acid matches Gly-Asp-Gly-Lys-Gly is quite similar to Gly-Glu-Gly-Arg-Gly but there is only match with k-tuple size of 1
  • 15. Limitations of FASTA FASTA can miss significant similarity since For nucleic acids, due to codon “wobble”, DNA sequences may look like XXyXXyXXy where X’s are conserved and y’s are not GGuUCuACgAAg and GGcUCcACaAAA both code for the same peptide sequence (Gly-Ser- Thr-Lys) but they don’t match with k-tuple size of 3 or higher
  • 16. BLAST (Basic Local Alignment Search Tool) Goal: find sequences from database similar to query sequence Previous tools use either direct, theoretically sound but computationally slow approach to examine all possible alignments of query with database (dynamic programming) indirect, heuristic but computationally fast approach to find similar sequences by first finding identical stretches (FASTP, FASTA)
  • 17. BLAST (Basic Local Alignment Search Tool) BLAST combines best of both by using theoretically sound method which searches for similar sequences directly but computationally fast Reference S. F. Altschul, W. Gish, W. Miller, E. W. Myers and D. J. Lipman. Basic Local Alignment Search Tool. J. Mol. Biol. 215:403- 410 (1990)
  • 18. BLAST basics Need similarity measure, as in dynamic programming - use PAM-120 for proteins Define maximal segment pair (MSP) to be the highest scoring pair of identical length segments chosen from 2 sequences (in FASTA terms, highest init1 diagonal)
  • 19. BLAST basics Define a segment pair to be locally maximal if its score cannot be improved either by extending or by shortening both segments
  • 20. BLAST basics Approach: find segment pairs by first finding word pairs that score above a threshold, i.e., find word pairs of fixed length w with a score of at least T Key concept: Seems similar to FASTA, but we are searching for words which score above T rather than that match exactly
  • 21. BLAST method for proteins 1. Compile a list of words which give a score above T when paired with the query sequence. Example using PAM-120 for query sequence ACDE (w=4, T=17): A C D E ACDE = +3 +9 +5 +5 = 22 try all possibilities: AAAA = +3 -3 0 0 = 0 no good AAAC = +3 -3 0 -7 = -7 no good ...too slow, try directed change
  • 22. Generating word list A C D E ACDE = +3 +9 +5 +5 = 22 change 1st pos. to all acceptable substitutions gCDE = 1 9 5 5 = 20 ok (=pCDE,sCDE, tCDE) nCDE = 0 9 5 5 = 19 ok (=dCDE,eCDE, nCDE,vCDE) iCDE = -1 9 5 5 = 18 ok (=qCDE) kCDE = -2 9 5 5 = 17 ok (=mCDE) change 2nd pos.: can't - all alternatives negative and the other three positions only add up to 13 change 3rd pos. in combination with first position gCnE = 1 9 2 5 = 17 ok continue - use recursion
  • 23. Generating word list For "best" values of w and T there are typically about 50 words in the list for every residue in the query sequence
  • 24. BLAST method for proteins 2. Scan the database for hits with the compiled list of words. Two approaches: Use index of all possible words (for w=4, need array of size 204=160,000. Can compress this index using pointers to save space. Use finite state machine (actually used) Calculate a state transition table that tells what state to go to based on the next character in the sequence 3a. Extend hits to form HSPs (high-scoring segment pairs)
  • 25. BLAST method for proteins 3b. BLAST2 or gapped BLAST uses an approach similar to FASTA to combine hits before trying to extend them as in 3a. 4. Compare the score for each HSP to a threshold S to decide whether to keep it 5. Proceed to estimating statistical significance (see below)
  • 26. BLAST Method for DNA 1. Make list of all contiguous w-mers in the query sequence (often w=12) 2. Compress database by packing 4 nucleotides into a single byte (use auxiliary table to tell you where sequences start and stop within the compressed database) -- doesn't allow for unspecified bases (wildcards)
  • 27. BLAST Method for DNA 3. Compress the w-mers from the query sequence the same way. 4. Search the compressed database for matches with the compressed w-mers Since all frames of the query sequence are considered separately, any match of length w>=11 must contain a match of length 8 that lies on a byte boundary of one of the w-mers from the query sequence. Thus can scan a (packed) byte at a time, improving speed 4-fold over comparing one nucleotide at a time.
  • 28. BLAST Method for DNA Problem: if query sequence has a stretch of unusual base composition (e.g., A-T rich) or a repeated sequence element (e.g., Alu sequence) there will be many hits with "uninteresting" regions.
  • 29. BLAST Method for DNA Solution: During compression of the database, tabulate frequencies of all 8-tuples. Make a list of those occurring very frequently (much more frequently than expected by chance). Remove these words from the query list of w-mers before searching database. Remove words matching a sublibrary of repeated sequences (but report the matches to that sublibrary when done).
  • 30. BLAST Statistical significance A key to the utility of BLAST is the ability to calculate expected probabilities of occurrence of Maximum Segment Pairs (MSPs) given w and T This allows BLAST to rank matching sequences in order of “significance” and to cut off listings at a user-specified probability
  • 31. BLAST Statistical significance From Karlin-Altschul formulation, the expected value (mean) of the HSPs between a query and a set of random sequences is u≅ [ e (Kmn)]/λ log or u≅ [ln(Kmn)]/λ
  • 32. BLAST Statistical significance BLAST uses a correction to this formulation that takes into account the effective sequence lengths of the query and the database sequences l Kn λ u [( m)/ =n ′ ′]
  • 33. BLAST Statistical significance The corrected lengths are given by m′ = m−(lnKmn)/H n′ = n −(lnKmn)/H with H = (lnKmn)/l where l is the average length of the alignment that can be achieved between random sequences of length m and n
  • 34. BLAST Statistical significance Given u, we can calculate the probability p of observing a score S between a query sequence and a given database sequence that is equal to or greater than x λ− −xu p ≥ = e (e ( x 1 x− S ) −p ( ) )
  • 35. BLAST Statistical significance Lastly, we have to consider that we are searching many database sequences and can expect even a relatively rare score to occur with high chance given enough comparisons For a database of D sequences, this is − sx p≥D ≈− E1e ( )
  • 36. Summary of Database Search Methods Authors (Program) Description Needleman & Wunsch full alignment Wilbur & Lipman match k-tuple - form diag - NW Lipman & Pearson k-tuple - diag - rescore (FASTP) Pearson & Lipman FASTP - join diags- (FASTA) NW Altschul et al (BLAST) word match list - statistics
  • 37. Reading for next class Paper by Grundy and Bailey