SlideShare una empresa de Scribd logo
1 de 55
Descargar para leer sin conexión
Gene Prediction:
Similarity-Based
Approaches
Outline
1. Introduction
2. Exon Chaining Problem
3. Spliced Alignment
4. Gene Prediction Tools
Section 1:
Introduction
Similarity-Based Approach to Gene Prediction
• Some genomes may be well-studied, with many genes having
been experimentally verified.
• Closely-related organisms may have similar genes.
• In order to determine the functions of unknown genes,
researchers may compare them to known genes in a closely-
related species.
• This is the idea behind the similarity-based approach to gene
prediction.
Similarity-Based Approach to Gene Prediction
• There  is  one  issue  with  comparison:  genes  are  often  “split.”
• The exons, or coding sections of a gene, are separated from
introns, or the non-coding sections.
• Example:
• Our Problem: Given a known gene and an unannotated
genome sequence, find a set of substrings in the genomic
sequence whose concatenation best matches the known gene.
• This concatenation will be our candidate gene.
…AGGGTCTCATTGTAGACAGTGGTACTGATCAACGCAGGACTT…
Coding Non-coding Coding Non-coding
• Small islands of similarity corresponding to similarities between
exons
Comparing Genes in Two Genomes
FrogGene(known)
Human Genome
Using Similarities to Find Exon Structure
• A (known) frog gene is aligned to different locations in the
human genome.
• Find  the  “best”  path  to  reveal  the  exon  structure  of  the  
corresponding human gene.
Finding Local Alignments
• A (known) frog gene is aligned to different locations in the
human genome.
• Find  the  “best”  path  to  reveal  the  exon  structure  of  the  
corresponding human gene.
• Use local
alignments to
find all islands
of similarity.
FrogGene(known)
Human Genome
genome
mRNA
Exon 3Exon 1 Exon 2
{
{
{
Intron 1 Intron 2
{
{
Reverse Translation
• Reverse Translation Problem: Given a known protein, find a
gene which codes for it.
• Inexact: amino acids map to > 1 codon
• This problem is essentially reduced to an alignment
problem.
• Example: Comparing Genomic DNA Against mRNA.
Portion of genome
mRNA
(codonsequence)
Exon 3Exon 1 Exon 2
{
{
{
Intron 1 Intron 2
{
{
Comparing Genomic DNA Against mRNA
Reverse Translation
• The reverse translation problem can be modeled as traveling in
Manhattan  grid  with  “free”  horizontal  jumps.
• Each horizontal jump models insertion of an intron.
• Complexity of Manhattan grid is O(n3).
• Issue: Would match nucleotides pointwise and use
horizontal jumps at every opportunity.
Section 2:
Exon Chaining Problem
Chaining Local Alignments
• Aim: Find candidate exons, or substrings that match a given
gene sequence.
• Define a candidate exon as (l, r, w):
• l = starting position of exon
• r = ending position of exon
• w = weight of exon, defined as score of local alignment or
some other weighting score
• Idea: Look for a chain of substrings with maximal score.
• Chain: a set of non-overlapping nonadjacent intervals.
• Locate the beginning and end of each interval (2n points).
• Find  the  “best”  concatenation  of  some  of  the  intervals.
3
4
11
9
15
5
5
0 2 3 5 6 11 13 16 20 25 27 28 30 32
Exon Chaining Problem: Illustration
• Locate the beginning and end of each interval (2n points).
• Find  the  “best”  concatenation  of  some  of  the  intervals.
3
4
11
9
15
5
5
0 2 3 5 6 11 13 16 20 25 27 28 30 32
Exon Chaining Problem: Illustration
• Locate the beginning and end of each interval (2n points).
• Find  the  “best”  concatenation  of  some  of  the  intervals.
3
4
11
9
15
5
5
0 2 3 5 6 11 13 16 20 25 27 28 30 32
Exon Chaining Problem: Illustration
Exon Chaining Problem: Formulation
• Goal: Given a set of putative exons, find a maximum set of
non-overlapping putative exons (chain).
• Input: A set of weighted intervals (putative exons).
• Output: A maximum chain of intervals from this set.
Exon Chaining Problem: Formulation
• Goal: Given a set of putative exons, find a maximum set of
non-overlapping putative exons (chain).
• Input: A set of weighted intervals (putative exons).
• Output: A maximum chain of intervals from this set.
• Question: Would a greedy algorithm solve this problem?
• This problem can be solved with dynamic programming in
O(n) time.
• Idea: Connect adjacent endpoints with weight zero edges,
connect ends of weight w exon with weight w edge.
Exon Chaining Problem: Graph Representation
ExonChaining (G, n) //Graph, number of intervals
1. for i ← 1 to 2n
2. si ← 0
3. for i ← 1 to 2n
4. if vertex vi in G corresponds to right end of the interval I
5. j ← index of vertex for left end of the interval I
6. w ← weight of the interval I
7. sj ← max {sj + w, si-1}
8. else
9. si ← si-1
10.return s2n
Exon Chaining Problem: Pseudocode
Exon Chaining: Deficiencies
1. Poor definition of the putative exon endpoints.
2. Optimal chain of intervals may not correspond to a valid
alignment.
• Example: First interval may correspond to a suffix, whereas
second interval may correspond to a prefix.
Human Genome
FrogGenes(known)
Infeasible Chains: Illustration
• Red local similarities form two non-overlapping intervals but
do not form a valid global alignment.
• The cell carries DNA as a blueprint for producing proteins,
like a manufacturer carries a blueprint for producing a car.
Gene Prediction Analogy
Gene Prediction Analogy: Using Blueprint
• Each protein has its own distinct blueprint for construction.
Gene Prediction Analogy: Assembling Exons
Gene  Prediction  Analogy:  Still  Assembling…
Section 3:
Spliced Alignment
Spliced Alignment
• Mikhail Gelfand and colleagues proposed a spliced alignment
approach of using a protein from one genome to reconstruct
the exon-intron structure of a (related) gene in another
genome.
• Spliced alignment begins by selecting either all putative exons
between potential acceptor and donor sites or by finding all
substrings similar to the target protein (as in the Exon
Chaining Problem).
• This set is further filtered in a such a way that attempts to
retain all true exons, with some false ones.
Spliced Alignment Problem: Formulation
• Goal: Find a chain of blocks in a genomic sequence that best
fits a target sequence.
• Input: Genomic sequences G, target sequence T, and a set of
candidate exons B.
• Output: A chain of exons C such that the global alignment
score between C* and T is maximum among all chains of
blocks from B.
• Here C* is the concatenation of all exons from chain C.
Example:  Lewis  Carroll’s  “Jabberwocky”
• Genomic  Sequence:  “It  was  a  brilliant  thrilling  morning  and  
the slimy, hellish, lithe doves gyrated and gamboled nimbly in
the  waves.”
• Target  Sequence:  “Twas  brillig,  and  the  slithy  toves  did  gyre  
and  gimble  in  the  wabe.”
• Alignment:  Next  Slide…    
Example:  Lewis  Carroll’s  “Jabberwocky”
Example:  Lewis  Carroll’s  “Jabberwocky”
Example:  Lewis  Carroll’s  “Jabberwocky”
Example:  Lewis  Carroll’s  “Jabberwocky”
Example:  Lewis  Carroll’s  “Jabberwocky”
Spliced Alignment: Idea
• Compute the best alignment between i-prefix of genomic
sequence G and j-prefix of target T: S(i,j)
• But what is “i-prefix”  of G?
• There may be a few i-prefixes of G, depending on which block
B we are in.
Spliced Alignment: Idea
• Compute the best alignment between i-prefix of genomic
sequence G and j-prefix of target T: S(i,j)
• But what is “i-prefix”  of G?
• There may be a few i-prefixes of G, depending on which block
B we are in.
• Compute the best alignment between i-prefix of genomic
sequence G and j-prefix of target T under the assumption that
the alignment uses the block B at position i: S(i, j, B).
Spliced Alignment Recurrence
• If i is not the starting vertex of block B:
• If i is the starting vertex of block B:
where p is  some  indel  penalty  and  δ  is  a  scoring  matrix.

S(i, j,B)  max
S(i  1, j,B)  p
S(i, j  1,B)  p
S(i  1, j  1,B)   (gi,t j )





S(i, j,B)  max
B ' preceding B
S(i, j  1,B)  p
S(end (B'), j,B')  p
S(end (B'), j  1,B')   (gi,t j )





Spliced Alignment Recurrence
• After computing the three-dimensional table S(i, j, B), the
score of the optimal spliced alignment is:

max
B
S end (B), length (T), B 
Spliced Alignment: Complications
• Considering multiple i-prefixes leads to slow down.
• Running time:
where m is the target length, n is the genomic sequence
length and |B| is the number of blocks.
• A mosaic effect: short exons are easily combined to fit any
target protein.

O mn
2
B 
Spliced Alignment: Speedup
Spliced Alignment: Speedup
Spliced Alignment: Speedup

P i, j   max
B preceding i
S end B , j, B 
Exon Chaining vs Spliced Alignment
• In Spliced Alignment, every path spells out the string obtained
by concatenation of labels of its edges.
• The weight of the path is defined as optimal alignment score
between concatenated labels (blocks) and target sequence.
• Defines weight of entire path in graph, but not the weights
for individual edges.
• Exon Chaining assumes the positions and weights of exons are
pre-defined.
Section 4:
Gene Prediction Tools
Gene Prediction: Aligning Genome vs. Genome
• Goal: Align entire human and mouse genomes.
• Predict genes in both sequences simultaneously as chains of
aligned blocks (exons).
• This approach does not assume any annotation of either human
or mouse genes.
Gene Prediction Tools
1. GENSCAN/Genome Scan
2. TwinScan
3. Glimmer
4. GenMark
The GENSCAN Algorithm
• Algorithm is based on probabilistic model of gene structure.
• GENSCAN  uses  a  “training  set,”  then  the  algorithm  returns  
the exon structure using maximum likelihood approach
standard to many HMM algorithms (Viterbi algorithm).
• Biological input: Codon bias in coding regions, gene structure
(start and stop codons, typical exon and intron length, presence
of promoters, presence of genes on both strands, etc).
• Benefit: Covers cases where input sequence contains no gene,
partial gene, complete gene, or multiple genes.
GENSCAN Limitations
• Does not use similarity search to predict genes.
• Does not address alternative splicing.
• Could combine two exons from consecutive genes together.
GenomeScan
• Incorporates similarity information into GENSCAN: predicts
gene structure which corresponds to maximum probability
conditional on similarity information.
• Algorithm is a combination of two sources of information:
1. Probabilistic models of exons-introns.
2. Sequence similarity information.
http://www.stanford.edu/class/cs262/Spring2003/Notes/ln10.pdf
TwinScan
• Aligns two sequences and marks each base as gap ( - ),
mismatch (:), or match (|).
• Results in a new alphabet of 12 letters
Σ = {A-, A:, A |, C-, C:, C |, G-, G:, G |, T-, T:, T|}.
• Then run Viterbi algorithm using emissions ek(b) where
b ∊ {A-,  A:,  A|,  …,  T|}.
TwinScan
• The emission probabilities are estimated from human/mouse
gene pairs.
• Example: eI(x|) < eE(x|) since matches are favored in exons,
and eI(x-) > eE(x-) since gaps (as well as mismatches) are
favored in introns.
• Benefit: Compensates for dominant occurrence of poly-A
region in introns.
Glimmer
• Glimmer: Stands for
Gene Locator and Interpolated Markov ModelER
• Finds genes in bacterial DNA
• Uses interpolated Markov Models
Glimmer
• Made of 2 programs:
1. BuildIMM:
• Takes sequences as input and outputs the Interpolated
Markov Models (IMMs).
2. Glimmer:
• Takes IMMs and outputs all candidate genes.
• Automatically resolves overlapping genes by choosing
one, hence limited.
• Marks  “suspected  to  truly  overlap”  genes  for  closer  
inspection by user.
GenMark
• Based on non-stationary Markov chain models.
• Results displayed graphically with coding vs. noncoding
probability dependent on position in nucleotide sequence.

Más contenido relacionado

La actualidad más candente

Apollo Introduction for i5K Groups 2015-10-07
Apollo Introduction for i5K Groups 2015-10-07Apollo Introduction for i5K Groups 2015-10-07
Apollo Introduction for i5K Groups 2015-10-07Monica Munoz-Torres
 
Apollo Introduction for the Chestnut Research Community
Apollo Introduction for the Chestnut Research CommunityApollo Introduction for the Chestnut Research Community
Apollo Introduction for the Chestnut Research CommunityMonica Munoz-Torres
 
Gene prediction and expression
Gene prediction and expressionGene prediction and expression
Gene prediction and expressionishi tandon
 
Functional annotation
Functional annotationFunctional annotation
Functional annotationRavi Gandham
 
Introduction to Apollo: A webinar for the i5K Research Community
Introduction to Apollo: A webinar for the i5K Research CommunityIntroduction to Apollo: A webinar for the i5K Research Community
Introduction to Apollo: A webinar for the i5K Research CommunityMonica Munoz-Torres
 
2 md2016 annotation
2 md2016 annotation2 md2016 annotation
2 md2016 annotationScott Dawson
 
Introduction to Apollo: i5K E affinis
Introduction to Apollo: i5K E affinisIntroduction to Apollo: i5K E affinis
Introduction to Apollo: i5K E affinisMonica Munoz-Torres
 
Genome Curation using Apollo - Workshop at UTK
Genome Curation using Apollo - Workshop at UTKGenome Curation using Apollo - Workshop at UTK
Genome Curation using Apollo - Workshop at UTKMonica Munoz-Torres
 
Open Reading Frames
Open Reading FramesOpen Reading Frames
Open Reading FramesOsama Zahid
 
gene prediction programs
gene prediction programsgene prediction programs
gene prediction programsMugdhaSharma11
 
Apollo - A webinar for the Phascolarctos cinereus research community
Apollo - A webinar for the Phascolarctos cinereus research communityApollo - A webinar for the Phascolarctos cinereus research community
Apollo - A webinar for the Phascolarctos cinereus research communityMonica Munoz-Torres
 
Apollo : A workshop for the Manakin Research Coordination Network
Apollo: A workshop for the Manakin Research Coordination NetworkApollo: A workshop for the Manakin Research Coordination Network
Apollo : A workshop for the Manakin Research Coordination NetworkMonica Munoz-Torres
 

La actualidad más candente (20)

Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Apollo Introduction for i5K Groups 2015-10-07
Apollo Introduction for i5K Groups 2015-10-07Apollo Introduction for i5K Groups 2015-10-07
Apollo Introduction for i5K Groups 2015-10-07
 
Genome analysis2
Genome analysis2Genome analysis2
Genome analysis2
 
Gene prediction strategies
Gene prediction strategies Gene prediction strategies
Gene prediction strategies
 
prediction methods for ORF
prediction methods for ORFprediction methods for ORF
prediction methods for ORF
 
Apollo Introduction for the Chestnut Research Community
Apollo Introduction for the Chestnut Research CommunityApollo Introduction for the Chestnut Research Community
Apollo Introduction for the Chestnut Research Community
 
Gene prediction and expression
Gene prediction and expressionGene prediction and expression
Gene prediction and expression
 
Functional annotation
Functional annotationFunctional annotation
Functional annotation
 
Introduction to Apollo: A webinar for the i5K Research Community
Introduction to Apollo: A webinar for the i5K Research CommunityIntroduction to Apollo: A webinar for the i5K Research Community
Introduction to Apollo: A webinar for the i5K Research Community
 
2 md2016 annotation
2 md2016 annotation2 md2016 annotation
2 md2016 annotation
 
Introduction to Apollo: i5K E affinis
Introduction to Apollo: i5K E affinisIntroduction to Apollo: i5K E affinis
Introduction to Apollo: i5K E affinis
 
Genome annotation
Genome annotationGenome annotation
Genome annotation
 
Introduction to Apollo for i5k
Introduction to Apollo for i5kIntroduction to Apollo for i5k
Introduction to Apollo for i5k
 
Genome Curation using Apollo
Genome Curation using ApolloGenome Curation using Apollo
Genome Curation using Apollo
 
Genome assembly
Genome assemblyGenome assembly
Genome assembly
 
Genome Curation using Apollo - Workshop at UTK
Genome Curation using Apollo - Workshop at UTKGenome Curation using Apollo - Workshop at UTK
Genome Curation using Apollo - Workshop at UTK
 
Open Reading Frames
Open Reading FramesOpen Reading Frames
Open Reading Frames
 
gene prediction programs
gene prediction programsgene prediction programs
gene prediction programs
 
Apollo - A webinar for the Phascolarctos cinereus research community
Apollo - A webinar for the Phascolarctos cinereus research communityApollo - A webinar for the Phascolarctos cinereus research community
Apollo - A webinar for the Phascolarctos cinereus research community
 
Apollo : A workshop for the Manakin Research Coordination Network
Apollo: A workshop for the Manakin Research Coordination NetworkApollo: A workshop for the Manakin Research Coordination Network
Apollo : A workshop for the Manakin Research Coordination Network
 

Destacado

Local vs. Global Models for Effort Estimation and Defect Prediction
Local vs. Global Models for Effort Estimation and Defect Prediction Local vs. Global Models for Effort Estimation and Defect Prediction
Local vs. Global Models for Effort Estimation and Defect Prediction CS, NcState
 
Global local alignment
Global local alignmentGlobal local alignment
Global local alignmentScott Hamilton
 
Micro array based comparative genomic hybridisation -Dr Yogesh D
Micro array based comparative genomic hybridisation -Dr Yogesh DMicro array based comparative genomic hybridisation -Dr Yogesh D
Micro array based comparative genomic hybridisation -Dr Yogesh DDr.Yogesh D
 
Prediction of protein function from sequence derived protein features
Prediction of protein function from sequence derived protein featuresPrediction of protein function from sequence derived protein features
Prediction of protein function from sequence derived protein featuresLars Juhl Jensen
 
Sequence comparison techniques
Sequence comparison techniquesSequence comparison techniques
Sequence comparison techniquesruchibioinfo
 
MATLAB IMPLEMENTATION OF SELF-ORGANIZING MAPS FOR CLUSTERING OF REMOTE SENSIN...
MATLAB IMPLEMENTATION OF SELF-ORGANIZING MAPS FOR CLUSTERING OF REMOTE SENSIN...MATLAB IMPLEMENTATION OF SELF-ORGANIZING MAPS FOR CLUSTERING OF REMOTE SENSIN...
MATLAB IMPLEMENTATION OF SELF-ORGANIZING MAPS FOR CLUSTERING OF REMOTE SENSIN...Daksh Raj Chopra
 
Global and local alignment (bioinformatics)
Global and local alignment (bioinformatics)Global and local alignment (bioinformatics)
Global and local alignment (bioinformatics)Pritom Chaki
 
The Needleman-Wunsch Algorithm for Sequence Alignment
The Needleman-Wunsch Algorithm for Sequence Alignment The Needleman-Wunsch Algorithm for Sequence Alignment
The Needleman-Wunsch Algorithm for Sequence Alignment Parinda Rajapaksha
 
gene regulation sdk 2013
gene regulation sdk 2013gene regulation sdk 2013
gene regulation sdk 2013Dr-HAMDAN
 

Destacado (20)

Local vs. Global Models for Effort Estimation and Defect Prediction
Local vs. Global Models for Effort Estimation and Defect Prediction Local vs. Global Models for Effort Estimation and Defect Prediction
Local vs. Global Models for Effort Estimation and Defect Prediction
 
Ch06 rna
Ch06 rnaCh06 rna
Ch06 rna
 
Blast
BlastBlast
Blast
 
Analyzing and integrating probabilistic and deterministic computational model...
Analyzing and integrating probabilistic and deterministic computational model...Analyzing and integrating probabilistic and deterministic computational model...
Analyzing and integrating probabilistic and deterministic computational model...
 
Global local alignment
Global local alignmentGlobal local alignment
Global local alignment
 
Micro array based comparative genomic hybridisation -Dr Yogesh D
Micro array based comparative genomic hybridisation -Dr Yogesh DMicro array based comparative genomic hybridisation -Dr Yogesh D
Micro array based comparative genomic hybridisation -Dr Yogesh D
 
Prediction of protein function from sequence derived protein features
Prediction of protein function from sequence derived protein featuresPrediction of protein function from sequence derived protein features
Prediction of protein function from sequence derived protein features
 
Sequence comparison techniques
Sequence comparison techniquesSequence comparison techniques
Sequence comparison techniques
 
IntelliGO semantic similarity measure for Gene Ontology annotations
IntelliGO semantic similarity measure for Gene Ontology annotationsIntelliGO semantic similarity measure for Gene Ontology annotations
IntelliGO semantic similarity measure for Gene Ontology annotations
 
Kishor Presentation
Kishor PresentationKishor Presentation
Kishor Presentation
 
Sequence alignment
Sequence alignmentSequence alignment
Sequence alignment
 
MATLAB IMPLEMENTATION OF SELF-ORGANIZING MAPS FOR CLUSTERING OF REMOTE SENSIN...
MATLAB IMPLEMENTATION OF SELF-ORGANIZING MAPS FOR CLUSTERING OF REMOTE SENSIN...MATLAB IMPLEMENTATION OF SELF-ORGANIZING MAPS FOR CLUSTERING OF REMOTE SENSIN...
MATLAB IMPLEMENTATION OF SELF-ORGANIZING MAPS FOR CLUSTERING OF REMOTE SENSIN...
 
SCoT and RAPD
SCoT and RAPDSCoT and RAPD
SCoT and RAPD
 
Global and local alignment (bioinformatics)
Global and local alignment (bioinformatics)Global and local alignment (bioinformatics)
Global and local alignment (bioinformatics)
 
The Needleman-Wunsch Algorithm for Sequence Alignment
The Needleman-Wunsch Algorithm for Sequence Alignment The Needleman-Wunsch Algorithm for Sequence Alignment
The Needleman-Wunsch Algorithm for Sequence Alignment
 
Ch06 alignment
Ch06 alignmentCh06 alignment
Ch06 alignment
 
Gene aswa (2)
Gene aswa (2)Gene aswa (2)
Gene aswa (2)
 
Research proposal
Research proposalResearch proposal
Research proposal
 
gene regulation sdk 2013
gene regulation sdk 2013gene regulation sdk 2013
gene regulation sdk 2013
 
Bioinformatica 08-12-2011-t8-go-hmm
Bioinformatica 08-12-2011-t8-go-hmmBioinformatica 08-12-2011-t8-go-hmm
Bioinformatica 08-12-2011-t8-go-hmm
 

Similar a Bioalgo 2012-01-gene-prediction-sim

So sánh cấu trúc protein_Protein structure comparison
So sánh cấu trúc protein_Protein structure comparisonSo sánh cấu trúc protein_Protein structure comparison
So sánh cấu trúc protein_Protein structure comparisonbomxuan868
 
local and global allignment
local and global allignmentlocal and global allignment
local and global allignmentSHOBI KHAKHI
 
Sergey Nikolenko and Elena Tutubalina - Constructing Aspect-Based Sentiment ...
Sergey Nikolenko and  Elena Tutubalina - Constructing Aspect-Based Sentiment ...Sergey Nikolenko and  Elena Tutubalina - Constructing Aspect-Based Sentiment ...
Sergey Nikolenko and Elena Tutubalina - Constructing Aspect-Based Sentiment ...AIST
 
Bio process
Bio processBio process
Bio processsun777
 
Bio process
Bio processBio process
Bio processsun777
 
AI 바이오 (4일차).pdf
AI 바이오 (4일차).pdfAI 바이오 (4일차).pdf
AI 바이오 (4일차).pdfH K Yoon
 
Sequence-analysis-pairwise-alignment.pdf
Sequence-analysis-pairwise-alignment.pdfSequence-analysis-pairwise-alignment.pdf
Sequence-analysis-pairwise-alignment.pdfsriaisvariyasundar
 
Clustering_Overview.pptx
Clustering_Overview.pptxClustering_Overview.pptx
Clustering_Overview.pptxnyomans1
 
Logic programming (1)
Logic programming (1)Logic programming (1)
Logic programming (1)Nitesh Singh
 
lecture4.ppt Sequence Alignmentaldf sdfsadf
lecture4.ppt Sequence Alignmentaldf sdfsadflecture4.ppt Sequence Alignmentaldf sdfsadf
lecture4.ppt Sequence Alignmentaldf sdfsadfalizain9604
 
Competitive Learning [Deep Learning And Nueral Networks].pptx
Competitive Learning [Deep Learning And Nueral Networks].pptxCompetitive Learning [Deep Learning And Nueral Networks].pptx
Competitive Learning [Deep Learning And Nueral Networks].pptxraghavaram5555
 
Introduction to sequence alignment partii
Introduction to sequence alignment partiiIntroduction to sequence alignment partii
Introduction to sequence alignment partiiSumatiHajela
 
Islamic University Pattern Recognition & Neural Network 2019
Islamic University Pattern Recognition & Neural Network 2019 Islamic University Pattern Recognition & Neural Network 2019
Islamic University Pattern Recognition & Neural Network 2019 Rakibul Hasan Pranto
 
Boosting probabilistic graphical model inference by incorporating prior knowl...
Boosting probabilistic graphical model inference by incorporating prior knowl...Boosting probabilistic graphical model inference by incorporating prior knowl...
Boosting probabilistic graphical model inference by incorporating prior knowl...Hakky St
 

Similar a Bioalgo 2012-01-gene-prediction-sim (20)

02-alignment.pdf
02-alignment.pdf02-alignment.pdf
02-alignment.pdf
 
So sánh cấu trúc protein_Protein structure comparison
So sánh cấu trúc protein_Protein structure comparisonSo sánh cấu trúc protein_Protein structure comparison
So sánh cấu trúc protein_Protein structure comparison
 
local and global allignment
local and global allignmentlocal and global allignment
local and global allignment
 
BioAlg02.ppt
BioAlg02.pptBioAlg02.ppt
BioAlg02.ppt
 
Sergey Nikolenko and Elena Tutubalina - Constructing Aspect-Based Sentiment ...
Sergey Nikolenko and  Elena Tutubalina - Constructing Aspect-Based Sentiment ...Sergey Nikolenko and  Elena Tutubalina - Constructing Aspect-Based Sentiment ...
Sergey Nikolenko and Elena Tutubalina - Constructing Aspect-Based Sentiment ...
 
Bio process
Bio processBio process
Bio process
 
Bio process
Bio processBio process
Bio process
 
AI 바이오 (4일차).pdf
AI 바이오 (4일차).pdfAI 바이오 (4일차).pdf
AI 바이오 (4일차).pdf
 
Sequence-analysis-pairwise-alignment.pdf
Sequence-analysis-pairwise-alignment.pdfSequence-analysis-pairwise-alignment.pdf
Sequence-analysis-pairwise-alignment.pdf
 
Introduction to Evolutionary Computations. Akira Imada
Introduction to Evolutionary Computations. Akira ImadaIntroduction to Evolutionary Computations. Akira Imada
Introduction to Evolutionary Computations. Akira Imada
 
Clustering_Overview.pptx
Clustering_Overview.pptxClustering_Overview.pptx
Clustering_Overview.pptx
 
F0422052058
F0422052058F0422052058
F0422052058
 
Logic programming (1)
Logic programming (1)Logic programming (1)
Logic programming (1)
 
Apolo Taller en BIOS
Apolo Taller en BIOS Apolo Taller en BIOS
Apolo Taller en BIOS
 
Bioinformatics t8-go-hmm v2014
Bioinformatics t8-go-hmm v2014Bioinformatics t8-go-hmm v2014
Bioinformatics t8-go-hmm v2014
 
lecture4.ppt Sequence Alignmentaldf sdfsadf
lecture4.ppt Sequence Alignmentaldf sdfsadflecture4.ppt Sequence Alignmentaldf sdfsadf
lecture4.ppt Sequence Alignmentaldf sdfsadf
 
Competitive Learning [Deep Learning And Nueral Networks].pptx
Competitive Learning [Deep Learning And Nueral Networks].pptxCompetitive Learning [Deep Learning And Nueral Networks].pptx
Competitive Learning [Deep Learning And Nueral Networks].pptx
 
Introduction to sequence alignment partii
Introduction to sequence alignment partiiIntroduction to sequence alignment partii
Introduction to sequence alignment partii
 
Islamic University Pattern Recognition & Neural Network 2019
Islamic University Pattern Recognition & Neural Network 2019 Islamic University Pattern Recognition & Neural Network 2019
Islamic University Pattern Recognition & Neural Network 2019
 
Boosting probabilistic graphical model inference by incorporating prior knowl...
Boosting probabilistic graphical model inference by incorporating prior knowl...Boosting probabilistic graphical model inference by incorporating prior knowl...
Boosting probabilistic graphical model inference by incorporating prior knowl...
 

Más de BioinformaticsInstitute

Comparative Genomics and de Bruijn graphs
Comparative Genomics and de Bruijn graphsComparative Genomics and de Bruijn graphs
Comparative Genomics and de Bruijn graphsBioinformaticsInstitute
 
Биоинформатический анализ данных полноэкзомного секвенирования: анализ качес...
 Биоинформатический анализ данных полноэкзомного секвенирования: анализ качес... Биоинформатический анализ данных полноэкзомного секвенирования: анализ качес...
Биоинформатический анализ данных полноэкзомного секвенирования: анализ качес...BioinformaticsInstitute
 
Вперед в прошлое. Методы генетической диагностики древней днк
Вперед в прошлое. Методы генетической диагностики древней днкВперед в прошлое. Методы генетической диагностики древней днк
Вперед в прошлое. Методы генетической диагностики древней днкBioinformaticsInstitute
 
"Зачем биологам суперкомпьютеры", Александр Предеус
"Зачем биологам суперкомпьютеры", Александр Предеус"Зачем биологам суперкомпьютеры", Александр Предеус
"Зачем биологам суперкомпьютеры", Александр ПредеусBioinformaticsInstitute
 
Иммунотерапия раковых опухолей: взгляд со стороны системной биологии. Максим ...
Иммунотерапия раковых опухолей: взгляд со стороны системной биологии. Максим ...Иммунотерапия раковых опухолей: взгляд со стороны системной биологии. Максим ...
Иммунотерапия раковых опухолей: взгляд со стороны системной биологии. Максим ...BioinformaticsInstitute
 
Рак 101 (Мария Шутова, ИоГЕН РАН)
Рак 101 (Мария Шутова, ИоГЕН РАН)Рак 101 (Мария Шутова, ИоГЕН РАН)
Рак 101 (Мария Шутова, ИоГЕН РАН)BioinformaticsInstitute
 
Секвенирование как инструмент исследования сложных фенотипов человека: от ген...
Секвенирование как инструмент исследования сложных фенотипов человека: от ген...Секвенирование как инструмент исследования сложных фенотипов человека: от ген...
Секвенирование как инструмент исследования сложных фенотипов человека: от ген...BioinformaticsInstitute
 
Инвестиции в биоинформатику и биотех (Андрей Афанасьев)
Инвестиции в биоинформатику и биотех (Андрей Афанасьев)Инвестиции в биоинформатику и биотех (Андрей Афанасьев)
Инвестиции в биоинформатику и биотех (Андрей Афанасьев)BioinformaticsInstitute
 

Más de BioinformaticsInstitute (20)

Graph genome
Graph genome Graph genome
Graph genome
 
Nanopores sequencing
Nanopores sequencingNanopores sequencing
Nanopores sequencing
 
A superglue for string comparison
A superglue for string comparisonA superglue for string comparison
A superglue for string comparison
 
Comparative Genomics and de Bruijn graphs
Comparative Genomics and de Bruijn graphsComparative Genomics and de Bruijn graphs
Comparative Genomics and de Bruijn graphs
 
Биоинформатический анализ данных полноэкзомного секвенирования: анализ качес...
 Биоинформатический анализ данных полноэкзомного секвенирования: анализ качес... Биоинформатический анализ данных полноэкзомного секвенирования: анализ качес...
Биоинформатический анализ данных полноэкзомного секвенирования: анализ качес...
 
Вперед в прошлое. Методы генетической диагностики древней днк
Вперед в прошлое. Методы генетической диагностики древней днкВперед в прошлое. Методы генетической диагностики древней днк
Вперед в прошлое. Методы генетической диагностики древней днк
 
Knime &amp; bioinformatics
Knime &amp; bioinformaticsKnime &amp; bioinformatics
Knime &amp; bioinformatics
 
"Зачем биологам суперкомпьютеры", Александр Предеус
"Зачем биологам суперкомпьютеры", Александр Предеус"Зачем биологам суперкомпьютеры", Александр Предеус
"Зачем биологам суперкомпьютеры", Александр Предеус
 
Иммунотерапия раковых опухолей: взгляд со стороны системной биологии. Максим ...
Иммунотерапия раковых опухолей: взгляд со стороны системной биологии. Максим ...Иммунотерапия раковых опухолей: взгляд со стороны системной биологии. Максим ...
Иммунотерапия раковых опухолей: взгляд со стороны системной биологии. Максим ...
 
Рак 101 (Мария Шутова, ИоГЕН РАН)
Рак 101 (Мария Шутова, ИоГЕН РАН)Рак 101 (Мария Шутова, ИоГЕН РАН)
Рак 101 (Мария Шутова, ИоГЕН РАН)
 
Плюрипотентность 101
Плюрипотентность 101Плюрипотентность 101
Плюрипотентность 101
 
Секвенирование как инструмент исследования сложных фенотипов человека: от ген...
Секвенирование как инструмент исследования сложных фенотипов человека: от ген...Секвенирование как инструмент исследования сложных фенотипов человека: от ген...
Секвенирование как инструмент исследования сложных фенотипов человека: от ген...
 
Инвестиции в биоинформатику и биотех (Андрей Афанасьев)
Инвестиции в биоинформатику и биотех (Андрей Афанасьев)Инвестиции в биоинформатику и биотех (Андрей Афанасьев)
Инвестиции в биоинформатику и биотех (Андрей Афанасьев)
 
Biodb 2011-everything
Biodb 2011-everythingBiodb 2011-everything
Biodb 2011-everything
 
Biodb 2011-05
Biodb 2011-05Biodb 2011-05
Biodb 2011-05
 
Biodb 2011-04
Biodb 2011-04Biodb 2011-04
Biodb 2011-04
 
Biodb 2011-03
Biodb 2011-03Biodb 2011-03
Biodb 2011-03
 
Biodb 2011-01
Biodb 2011-01Biodb 2011-01
Biodb 2011-01
 
Biodb 2011-02
Biodb 2011-02Biodb 2011-02
Biodb 2011-02
 
Ngs 3 1
Ngs 3 1Ngs 3 1
Ngs 3 1
 

Bioalgo 2012-01-gene-prediction-sim

  • 2. Outline 1. Introduction 2. Exon Chaining Problem 3. Spliced Alignment 4. Gene Prediction Tools
  • 4. Similarity-Based Approach to Gene Prediction • Some genomes may be well-studied, with many genes having been experimentally verified. • Closely-related organisms may have similar genes. • In order to determine the functions of unknown genes, researchers may compare them to known genes in a closely- related species. • This is the idea behind the similarity-based approach to gene prediction.
  • 5. Similarity-Based Approach to Gene Prediction • There  is  one  issue  with  comparison:  genes  are  often  “split.” • The exons, or coding sections of a gene, are separated from introns, or the non-coding sections. • Example: • Our Problem: Given a known gene and an unannotated genome sequence, find a set of substrings in the genomic sequence whose concatenation best matches the known gene. • This concatenation will be our candidate gene. …AGGGTCTCATTGTAGACAGTGGTACTGATCAACGCAGGACTT… Coding Non-coding Coding Non-coding
  • 6. • Small islands of similarity corresponding to similarities between exons Comparing Genes in Two Genomes
  • 7. FrogGene(known) Human Genome Using Similarities to Find Exon Structure • A (known) frog gene is aligned to different locations in the human genome. • Find  the  “best”  path  to  reveal  the  exon  structure  of  the   corresponding human gene.
  • 8. Finding Local Alignments • A (known) frog gene is aligned to different locations in the human genome. • Find  the  “best”  path  to  reveal  the  exon  structure  of  the   corresponding human gene. • Use local alignments to find all islands of similarity. FrogGene(known) Human Genome
  • 9. genome mRNA Exon 3Exon 1 Exon 2 { { { Intron 1 Intron 2 { { Reverse Translation • Reverse Translation Problem: Given a known protein, find a gene which codes for it. • Inexact: amino acids map to > 1 codon • This problem is essentially reduced to an alignment problem. • Example: Comparing Genomic DNA Against mRNA.
  • 10. Portion of genome mRNA (codonsequence) Exon 3Exon 1 Exon 2 { { { Intron 1 Intron 2 { { Comparing Genomic DNA Against mRNA
  • 11. Reverse Translation • The reverse translation problem can be modeled as traveling in Manhattan  grid  with  “free”  horizontal  jumps. • Each horizontal jump models insertion of an intron. • Complexity of Manhattan grid is O(n3). • Issue: Would match nucleotides pointwise and use horizontal jumps at every opportunity.
  • 13. Chaining Local Alignments • Aim: Find candidate exons, or substrings that match a given gene sequence. • Define a candidate exon as (l, r, w): • l = starting position of exon • r = ending position of exon • w = weight of exon, defined as score of local alignment or some other weighting score • Idea: Look for a chain of substrings with maximal score. • Chain: a set of non-overlapping nonadjacent intervals.
  • 14. • Locate the beginning and end of each interval (2n points). • Find  the  “best”  concatenation  of  some  of  the  intervals. 3 4 11 9 15 5 5 0 2 3 5 6 11 13 16 20 25 27 28 30 32 Exon Chaining Problem: Illustration
  • 15. • Locate the beginning and end of each interval (2n points). • Find  the  “best”  concatenation  of  some  of  the  intervals. 3 4 11 9 15 5 5 0 2 3 5 6 11 13 16 20 25 27 28 30 32 Exon Chaining Problem: Illustration
  • 16. • Locate the beginning and end of each interval (2n points). • Find  the  “best”  concatenation  of  some  of  the  intervals. 3 4 11 9 15 5 5 0 2 3 5 6 11 13 16 20 25 27 28 30 32 Exon Chaining Problem: Illustration
  • 17. Exon Chaining Problem: Formulation • Goal: Given a set of putative exons, find a maximum set of non-overlapping putative exons (chain). • Input: A set of weighted intervals (putative exons). • Output: A maximum chain of intervals from this set.
  • 18. Exon Chaining Problem: Formulation • Goal: Given a set of putative exons, find a maximum set of non-overlapping putative exons (chain). • Input: A set of weighted intervals (putative exons). • Output: A maximum chain of intervals from this set. • Question: Would a greedy algorithm solve this problem?
  • 19. • This problem can be solved with dynamic programming in O(n) time. • Idea: Connect adjacent endpoints with weight zero edges, connect ends of weight w exon with weight w edge. Exon Chaining Problem: Graph Representation
  • 20. ExonChaining (G, n) //Graph, number of intervals 1. for i ← 1 to 2n 2. si ← 0 3. for i ← 1 to 2n 4. if vertex vi in G corresponds to right end of the interval I 5. j ← index of vertex for left end of the interval I 6. w ← weight of the interval I 7. sj ← max {sj + w, si-1} 8. else 9. si ← si-1 10.return s2n Exon Chaining Problem: Pseudocode
  • 21. Exon Chaining: Deficiencies 1. Poor definition of the putative exon endpoints. 2. Optimal chain of intervals may not correspond to a valid alignment. • Example: First interval may correspond to a suffix, whereas second interval may correspond to a prefix.
  • 22. Human Genome FrogGenes(known) Infeasible Chains: Illustration • Red local similarities form two non-overlapping intervals but do not form a valid global alignment.
  • 23. • The cell carries DNA as a blueprint for producing proteins, like a manufacturer carries a blueprint for producing a car. Gene Prediction Analogy
  • 24. Gene Prediction Analogy: Using Blueprint • Each protein has its own distinct blueprint for construction.
  • 25. Gene Prediction Analogy: Assembling Exons
  • 26. Gene  Prediction  Analogy:  Still  Assembling…
  • 28. Spliced Alignment • Mikhail Gelfand and colleagues proposed a spliced alignment approach of using a protein from one genome to reconstruct the exon-intron structure of a (related) gene in another genome. • Spliced alignment begins by selecting either all putative exons between potential acceptor and donor sites or by finding all substrings similar to the target protein (as in the Exon Chaining Problem). • This set is further filtered in a such a way that attempts to retain all true exons, with some false ones.
  • 29. Spliced Alignment Problem: Formulation • Goal: Find a chain of blocks in a genomic sequence that best fits a target sequence. • Input: Genomic sequences G, target sequence T, and a set of candidate exons B. • Output: A chain of exons C such that the global alignment score between C* and T is maximum among all chains of blocks from B. • Here C* is the concatenation of all exons from chain C.
  • 30. Example:  Lewis  Carroll’s  “Jabberwocky” • Genomic  Sequence:  “It  was  a  brilliant  thrilling  morning  and   the slimy, hellish, lithe doves gyrated and gamboled nimbly in the  waves.” • Target  Sequence:  “Twas  brillig,  and  the  slithy  toves  did  gyre   and  gimble  in  the  wabe.” • Alignment:  Next  Slide…    
  • 31. Example:  Lewis  Carroll’s  “Jabberwocky”
  • 32. Example:  Lewis  Carroll’s  “Jabberwocky”
  • 33. Example:  Lewis  Carroll’s  “Jabberwocky”
  • 34. Example:  Lewis  Carroll’s  “Jabberwocky”
  • 35. Example:  Lewis  Carroll’s  “Jabberwocky”
  • 36. Spliced Alignment: Idea • Compute the best alignment between i-prefix of genomic sequence G and j-prefix of target T: S(i,j) • But what is “i-prefix”  of G? • There may be a few i-prefixes of G, depending on which block B we are in.
  • 37. Spliced Alignment: Idea • Compute the best alignment between i-prefix of genomic sequence G and j-prefix of target T: S(i,j) • But what is “i-prefix”  of G? • There may be a few i-prefixes of G, depending on which block B we are in. • Compute the best alignment between i-prefix of genomic sequence G and j-prefix of target T under the assumption that the alignment uses the block B at position i: S(i, j, B).
  • 38. Spliced Alignment Recurrence • If i is not the starting vertex of block B: • If i is the starting vertex of block B: where p is  some  indel  penalty  and  δ  is  a  scoring  matrix.  S(i, j,B)  max S(i  1, j,B)  p S(i, j  1,B)  p S(i  1, j  1,B)   (gi,t j )      S(i, j,B)  max B ' preceding B S(i, j  1,B)  p S(end (B'), j,B')  p S(end (B'), j  1,B')   (gi,t j )     
  • 39. Spliced Alignment Recurrence • After computing the three-dimensional table S(i, j, B), the score of the optimal spliced alignment is:  max B S end (B), length (T), B 
  • 40. Spliced Alignment: Complications • Considering multiple i-prefixes leads to slow down. • Running time: where m is the target length, n is the genomic sequence length and |B| is the number of blocks. • A mosaic effect: short exons are easily combined to fit any target protein.  O mn 2 B 
  • 43. Spliced Alignment: Speedup  P i, j   max B preceding i S end B , j, B 
  • 44. Exon Chaining vs Spliced Alignment • In Spliced Alignment, every path spells out the string obtained by concatenation of labels of its edges. • The weight of the path is defined as optimal alignment score between concatenated labels (blocks) and target sequence. • Defines weight of entire path in graph, but not the weights for individual edges. • Exon Chaining assumes the positions and weights of exons are pre-defined.
  • 46. Gene Prediction: Aligning Genome vs. Genome • Goal: Align entire human and mouse genomes. • Predict genes in both sequences simultaneously as chains of aligned blocks (exons). • This approach does not assume any annotation of either human or mouse genes.
  • 47. Gene Prediction Tools 1. GENSCAN/Genome Scan 2. TwinScan 3. Glimmer 4. GenMark
  • 48. The GENSCAN Algorithm • Algorithm is based on probabilistic model of gene structure. • GENSCAN  uses  a  “training  set,”  then  the  algorithm  returns   the exon structure using maximum likelihood approach standard to many HMM algorithms (Viterbi algorithm). • Biological input: Codon bias in coding regions, gene structure (start and stop codons, typical exon and intron length, presence of promoters, presence of genes on both strands, etc). • Benefit: Covers cases where input sequence contains no gene, partial gene, complete gene, or multiple genes.
  • 49. GENSCAN Limitations • Does not use similarity search to predict genes. • Does not address alternative splicing. • Could combine two exons from consecutive genes together.
  • 50. GenomeScan • Incorporates similarity information into GENSCAN: predicts gene structure which corresponds to maximum probability conditional on similarity information. • Algorithm is a combination of two sources of information: 1. Probabilistic models of exons-introns. 2. Sequence similarity information.
  • 51. http://www.stanford.edu/class/cs262/Spring2003/Notes/ln10.pdf TwinScan • Aligns two sequences and marks each base as gap ( - ), mismatch (:), or match (|). • Results in a new alphabet of 12 letters Σ = {A-, A:, A |, C-, C:, C |, G-, G:, G |, T-, T:, T|}. • Then run Viterbi algorithm using emissions ek(b) where b ∊ {A-,  A:,  A|,  …,  T|}.
  • 52. TwinScan • The emission probabilities are estimated from human/mouse gene pairs. • Example: eI(x|) < eE(x|) since matches are favored in exons, and eI(x-) > eE(x-) since gaps (as well as mismatches) are favored in introns. • Benefit: Compensates for dominant occurrence of poly-A region in introns.
  • 53. Glimmer • Glimmer: Stands for Gene Locator and Interpolated Markov ModelER • Finds genes in bacterial DNA • Uses interpolated Markov Models
  • 54. Glimmer • Made of 2 programs: 1. BuildIMM: • Takes sequences as input and outputs the Interpolated Markov Models (IMMs). 2. Glimmer: • Takes IMMs and outputs all candidate genes. • Automatically resolves overlapping genes by choosing one, hence limited. • Marks  “suspected  to  truly  overlap”  genes  for  closer   inspection by user.
  • 55. GenMark • Based on non-stationary Markov chain models. • Results displayed graphically with coding vs. noncoding probability dependent on position in nucleotide sequence.