SlideShare una empresa de Scribd logo
1 de 28
Applying Hidden Markov Models to Bioinformatics Conor Buckley
Outline What are Hidden Markov Models? Why are they a good tool for Bioinformatics? Applications in Bioinformatics
History of Hidden Markov Models HMM were first described in a series of statistical papers by Leonard E. Baum and other authors in the second half of the 1960s. One of the first applications of HMMs was speech recogniation, starting in the mid-1970s. They are commonly used in speech recognition systems to help to determine the words represented by the sound wave forms captured In the second half of the 1980s, HMMs began to be applied to the analysis of biological sequences, in particular DNA. Since then, they have become ubiquitous in bioinformatics Source: http://en.wikipedia.org/wiki/Hidden_Markov_model#History
What are Hidden Markov Models? HMM: A formal foundation for making probabilistic models of linear sequence 'labeling' problems. They provide a conceptual toolkit for building complex models just by drawing an intuitive picture. Source: http://www.nature.com/nbt/journal/v22/n10/full/nbt1004-1315.html#B1
What are Hidden Markov Models? Machine learning approach in bioinformatics Machine learning algorithms are presented with training data, which are used to derive important insights about the (often hidden) parameters. Once an algorithm has been trained, it can apply these insights to the analysis of a test sample As the amount of training data increases, the accuracy of the machine learning algorithm typically increasess as well. Source: http://www.nature.com/nbt/journal/v22/n10/full/nbt1004-1315.html#B1
Hidden Markov Models Has N states, called S1, S2, ... Sn There are discrete timesteps, t=0, t=1 S2 N = 3 t = 0 S1 S3 Source: http://www.autonlab.org/tutorials/hmm.html
Hidden Markov Models Has N states, called S1, S2, ... Sn There are discrete timesteps, t=0, t=1 For each timestep, the system is in exactly one of the available states. S2 N = 3 t = 0 S1 S3
Hidden Markov Models S1 S2 S3 ,[object Object],Bayesian Network Image: http://en.wikipedia.org/wiki/File:Hmm_temporal_bayesian_net.svg
A Markov Chain Bayes'Theory ,[object Object],- http://wordnetweb.princeton.edu/perl/webwn?s=bayes%27%20theorem
Building a Markov Chain Concrete Example Two friends, Alice and Bob, who live far apart from each other and who talk together daily over the telephone about what they did that day. Bob is only interested in three activities: walking in the park, shopping, and cleaning his apartment. The choice of what to do is determined exclusively by the weather on a given day.  Alice has no definite information about the weather where Bob lives, but she knows general trends. Based on what Bob tells her he did each day, Alice tries to guess what the weather must have been like. Alice believes that the weather operates as a discrete Markov chain. There are two states, "Rainy" and "Sunny", but she cannot observe them directly, that is, they are hidden from her. On each day, there is a certain chance that Bob will perform one of the following activities, depending on the weather: "walk", "shop", or "clean". Since Bob tells Alice about his activities, those are the observations.  Source: Wikipedia.org
Hidden Markov Models
Building a Markov Chain
What now? * Find out the most probable output sequence Vertibi's algorithm Dynamic programming algorithm for finding the most likely sequence of hidden states – called the Vertibi path – that results in a sequence of observed events.
Vertibi Results http://pcarvalho.com/forward_viterbi/
Bioinformatics Example Assume we are given a DNA sequence that begins in an exon, contains one 5' splice site and ends in an intron Identify where the switch from exon to intron occurs Where is the splice site?? Sourece: http://www.nature.com/nbt/journal/v22/n10/full/nbt1004-1315.html#B1
Bioinformatics Example In order for us to guess, the sequences of exons, splice sites and introns must have different statistical properties. Let's say... Exons have a uniform base composition on average A/C/T/G: 25% for each base Introns are A/T rich A/T: 40% for each  C/G: 10% for each 5' Splice site consensus nucleotide is almost always a G... G: 95% A: 5% Sourece: http://www.nature.com/nbt/journal/v22/n10/full/nbt1004-1315.html#B1
Bioinformatics Example We can build an Hidden Markov Model We have three states "E" for Exon "5" for 5' SS "I" for Intron Each State has its own emission probabilities which model the base composition of exons, introns and consensus G at the 5'SS Each state also has transition probabilities (arrows) Sourece: http://www.nature.com/nbt/journal/v22/n10/full/nbt1004-1315.html#B1
HMM: A Bioinformatics Visual We can use HMMs to generate a sequence When we visit a state, we emit a nucleotide bases on the emission probability distribution We also choose a state to visit next according to the state's transition probability distribution. ,[object Object]
Observed Sequence
Underlying State PathSource: http://www.nature.com/nbt/journal/v22/n10/full/nbt1004-1315.html#B1
HMM: A Bioinformatics Visual The state path is a Markov Chain Since we're only given the observed sequence, this underlying state path is a hidden Markov Chain Therefore... We can apply Bayesian Probability Source: http://www.nature.com/nbt/journal/v22/n10/full/nbt1004-1315.html#B1
HMM: A Bioinformatics Visual S – Observed sequence π – State Path Θ – Parameters The probability P(S, π|HMM, Θ) is the product of all emission probabilites and transition probilities. Lets look at an example... Source: http://www.nature.com/nbt/journal/v22/n10/full/nbt1004-1315.html#B1
HMM: A Bioinformatics Visual There are 27 transitions and 26 emissions. Multiply all 53 probabilities together (and take the log, since these are small numbers) and you'll calculate log P(S, π|HMM, Θ) = -41.22 Source: http://www.nature.com/nbt/journal/v22/n10/full/nbt1004-1315.html#B1
HMM: A Bioinformatics Visual The model parameters and overall sequences scores are all probabilities Therefore we can use Bayesian probability theory to manipulate these numbers in standard, powerful ways, including optimizing parameters and interpreting the signifigance of scores. Source: http://www.nature.com/nbt/journal/v22/n10/full/nbt1004-1315.html#B1
HMM: A Bioinformatics Visual Posterior Decoding: An alternative state path where the SS falls on the 6th G instead of the 5th (log probabilities of -41.71 versus -41.22) How confident are we that the fifth G is the right choice? Source: http://www.nature.com/nbt/journal/v22/n10/full/nbt1004-1315.html#B1
HMM: A Bioinformatics Visual We can calculate our confidence directly. The probability that nucleotide i was emitted by state k is the sum of the probabilities of all the states paths use state k to generate i, normalized by the sum over all possible state paths Result: We get a probability of 46% that the best-scoring fifth G is correct and 28% that the sixth G position is correct. Source: http://www.nature.com/nbt/journal/v22/n10/full/nbt1004-1315.html#B1
Further Possibilites The toy-model provided by the article is a simple example But we can go further, we could add a more realistic consensus GTRAGT at the 5' splice site We could put a row of six HMM states in place of '5' state to model a six-base ungapped consensus motif Possibilities are not limited
The catch HMM don't deal well with correlations between nucleotides Because they assume that each emitted nucleotide depends only on one underlying state. Example of bad use for HMM: Conserved RNA base pairs which induce long-range pairwise correlations; one position might be any nucleotide but the base-paired partner must be complementary. An HMM state path has no way of 'remembering' what a distant state generated. Source: http://www.nature.com/nbt/journal/v22/n10/full/nbt1004-1315.html#B1

Más contenido relacionado

La actualidad más candente

Machine Learning in Bioinformatics
Machine Learning in BioinformaticsMachine Learning in Bioinformatics
Machine Learning in Bioinformatics
butest
 

La actualidad más candente (20)

Pathways and genomes databases in bioinformatics
Pathways and genomes databases in bioinformaticsPathways and genomes databases in bioinformatics
Pathways and genomes databases in bioinformatics
 
System's Biology
System's Biology System's Biology
System's Biology
 
OMIM Database
OMIM DatabaseOMIM Database
OMIM Database
 
Blast Algorithm
Blast AlgorithmBlast Algorithm
Blast Algorithm
 
The Role of Bioinformatics in The Drug Discovery Process
The Role of Bioinformatics in The Drug Discovery ProcessThe Role of Bioinformatics in The Drug Discovery Process
The Role of Bioinformatics in The Drug Discovery Process
 
Machine Learning in Bioinformatics
Machine Learning in BioinformaticsMachine Learning in Bioinformatics
Machine Learning in Bioinformatics
 
Kegg
KeggKegg
Kegg
 
Orthologs,Paralogs & Xenologs
 Orthologs,Paralogs & Xenologs  Orthologs,Paralogs & Xenologs
Orthologs,Paralogs & Xenologs
 
Gene prediction methods vijay
Gene prediction methods  vijayGene prediction methods  vijay
Gene prediction methods vijay
 
Protein database
Protein databaseProtein database
Protein database
 
Ab Initio Protein Structure Prediction
Ab Initio Protein Structure PredictionAb Initio Protein Structure Prediction
Ab Initio Protein Structure Prediction
 
Phylogenetic analysis
Phylogenetic analysis Phylogenetic analysis
Phylogenetic analysis
 
Homology modelling
Homology modellingHomology modelling
Homology modelling
 
Introduction to Systemics with focus on Systems Biology
Introduction to Systemics with focus on Systems BiologyIntroduction to Systemics with focus on Systems Biology
Introduction to Systemics with focus on Systems Biology
 
Protein Structure Prediction
Protein Structure PredictionProtein Structure Prediction
Protein Structure Prediction
 
Structure analysis of protein
Structure analysis of proteinStructure analysis of protein
Structure analysis of protein
 
Genome Database Systems
Genome Database Systems Genome Database Systems
Genome Database Systems
 
ProCheck
ProCheckProCheck
ProCheck
 
Metabolomics- concepts and applications
Metabolomics- concepts and applicationsMetabolomics- concepts and applications
Metabolomics- concepts and applications
 
Secondary protein structure prediction
Secondary protein structure predictionSecondary protein structure prediction
Secondary protein structure prediction
 

Destacado

Hidden Markov Models with applications to speech recognition
Hidden Markov Models with applications to speech recognitionHidden Markov Models with applications to speech recognition
Hidden Markov Models with applications to speech recognition
butest
 
Protein Secondary Structure Prediction using HMM
Protein Secondary Structure Prediction using HMMProtein Secondary Structure Prediction using HMM
Protein Secondary Structure Prediction using HMM
Abhishek Dabral
 
Microsoft power point seminar
Microsoft power point   seminarMicrosoft power point   seminar
Microsoft power point seminar
Hagni Wijayanti
 
K-means, EM and Mixture models
K-means, EM and Mixture modelsK-means, EM and Mixture models
K-means, EM and Mixture models
Vu Pham
 

Destacado (20)

Hidden markov model ppt
Hidden markov model pptHidden markov model ppt
Hidden markov model ppt
 
Lecture 7: Hidden Markov Models (HMMs)
Lecture 7: Hidden Markov Models (HMMs)Lecture 7: Hidden Markov Models (HMMs)
Lecture 7: Hidden Markov Models (HMMs)
 
Hidden markov model
Hidden markov modelHidden markov model
Hidden markov model
 
Hidden Markov Models with applications to speech recognition
Hidden Markov Models with applications to speech recognitionHidden Markov Models with applications to speech recognition
Hidden Markov Models with applications to speech recognition
 
HIDDEN MARKOV MODEL AND ITS APPLICATION
HIDDEN MARKOV MODEL AND ITS APPLICATIONHIDDEN MARKOV MODEL AND ITS APPLICATION
HIDDEN MARKOV MODEL AND ITS APPLICATION
 
Hidden Markov Models
Hidden Markov ModelsHidden Markov Models
Hidden Markov Models
 
Hidden Markov Models
Hidden Markov ModelsHidden Markov Models
Hidden Markov Models
 
Protein Secondary Structure Prediction using HMM
Protein Secondary Structure Prediction using HMMProtein Secondary Structure Prediction using HMM
Protein Secondary Structure Prediction using HMM
 
Microsoft power point seminar
Microsoft power point   seminarMicrosoft power point   seminar
Microsoft power point seminar
 
Aplikasi Hidden Markov Model
Aplikasi Hidden Markov ModelAplikasi Hidden Markov Model
Aplikasi Hidden Markov Model
 
Hidden Markov Model
Hidden Markov ModelHidden Markov Model
Hidden Markov Model
 
Hidden markov model
Hidden markov modelHidden markov model
Hidden markov model
 
Markov models explained
Markov models explainedMarkov models explained
Markov models explained
 
Dynamic programming
Dynamic programming Dynamic programming
Dynamic programming
 
B.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignmentB.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignment
 
Deep Learning for Computer Vision: Recurrent Neural Networks (UPC 2016)
Deep Learning for Computer Vision: Recurrent Neural Networks (UPC 2016)Deep Learning for Computer Vision: Recurrent Neural Networks (UPC 2016)
Deep Learning for Computer Vision: Recurrent Neural Networks (UPC 2016)
 
Dynamic Programming
Dynamic ProgrammingDynamic Programming
Dynamic Programming
 
Sequence Alignment In Bioinformatics
Sequence Alignment In BioinformaticsSequence Alignment In Bioinformatics
Sequence Alignment In Bioinformatics
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word Embeddings
 
K-means, EM and Mixture models
K-means, EM and Mixture modelsK-means, EM and Mixture models
K-means, EM and Mixture models
 

Similar a Applying Hidden Markov Models to Bioinformatics

Report-de Bruijn Graph
Report-de Bruijn GraphReport-de Bruijn Graph
Report-de Bruijn Graph
Ashwani kumar
 
Dynamic Programming Algorithm for the Prediction for Gene Structure
Dynamic Programming Algorithm for the Prediction for Gene StructureDynamic Programming Algorithm for the Prediction for Gene Structure
Dynamic Programming Algorithm for the Prediction for Gene Structure
Marilyn Arceo
 
Improving the quality of chemical databases with community-developed tools (a...
Improving the quality of chemical databases with community-developed tools (a...Improving the quality of chemical databases with community-developed tools (a...
Improving the quality of chemical databases with community-developed tools (a...
baoilleach
 
Elements Of Field Programmable Gate Arrays
Elements Of Field Programmable Gate ArraysElements Of Field Programmable Gate Arrays
Elements Of Field Programmable Gate Arrays
Jennifer Baker
 

Similar a Applying Hidden Markov Models to Bioinformatics (20)

On the identifiability of phylogenetic networks under a pseudolikelihood model
On the identifiability of phylogenetic networks under a pseudolikelihood modelOn the identifiability of phylogenetic networks under a pseudolikelihood model
On the identifiability of phylogenetic networks under a pseudolikelihood model
 
The Last Line Effect
The Last Line EffectThe Last Line Effect
The Last Line Effect
 
Redundancy elimination of_big_sensor_data_using_bayesian_networks (1)
Redundancy elimination of_big_sensor_data_using_bayesian_networks (1)Redundancy elimination of_big_sensor_data_using_bayesian_networks (1)
Redundancy elimination of_big_sensor_data_using_bayesian_networks (1)
 
Inference in HMM and Bayesian Models
Inference in HMM and Bayesian ModelsInference in HMM and Bayesian Models
Inference in HMM and Bayesian Models
 
E04423133
E04423133E04423133
E04423133
 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
 
Report-de Bruijn Graph
Report-de Bruijn GraphReport-de Bruijn Graph
Report-de Bruijn Graph
 
2014 anu-canberra-streaming
2014 anu-canberra-streaming2014 anu-canberra-streaming
2014 anu-canberra-streaming
 
Bayesian Divergence Time Estimation
Bayesian Divergence Time Estimation Bayesian Divergence Time Estimation
Bayesian Divergence Time Estimation
 
The Transformation of Systems Biology Into A Large Data Science
The Transformation of Systems Biology Into A Large Data ScienceThe Transformation of Systems Biology Into A Large Data Science
The Transformation of Systems Biology Into A Large Data Science
 
Dynamic Programming Algorithm for the Prediction for Gene Structure
Dynamic Programming Algorithm for the Prediction for Gene StructureDynamic Programming Algorithm for the Prediction for Gene Structure
Dynamic Programming Algorithm for the Prediction for Gene Structure
 
Improving the quality of chemical databases with community-developed tools (a...
Improving the quality of chemical databases with community-developed tools (a...Improving the quality of chemical databases with community-developed tools (a...
Improving the quality of chemical databases with community-developed tools (a...
 
Databeers Dub #6 - Alberto Caimo - Investigating the local dependencies of ne...
Databeers Dub #6 - Alberto Caimo - Investigating the local dependencies of ne...Databeers Dub #6 - Alberto Caimo - Investigating the local dependencies of ne...
Databeers Dub #6 - Alberto Caimo - Investigating the local dependencies of ne...
 
Linear algebra notes 1
Linear algebra notes 1Linear algebra notes 1
Linear algebra notes 1
 
Linear algebra notes 2
Linear algebra notes 2Linear algebra notes 2
Linear algebra notes 2
 
Linear algebra notes
Linear algebra notesLinear algebra notes
Linear algebra notes
 
BiVeS & BudHat @ Combine2013 in Paris
BiVeS & BudHat @ Combine2013 in ParisBiVeS & BudHat @ Combine2013 in Paris
BiVeS & BudHat @ Combine2013 in Paris
 
2015 10-7-11am-reproducible research
2015 10-7-11am-reproducible research2015 10-7-11am-reproducible research
2015 10-7-11am-reproducible research
 
Elements Of Field Programmable Gate Arrays
Elements Of Field Programmable Gate ArraysElements Of Field Programmable Gate Arrays
Elements Of Field Programmable Gate Arrays
 
Generating Musical Notes and Transcription using Deep Learning
Generating Musical Notes and Transcription using Deep LearningGenerating Musical Notes and Transcription using Deep Learning
Generating Musical Notes and Transcription using Deep Learning
 

Más de butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
butest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
butest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
butest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
butest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
butest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
butest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
butest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
butest
 
Facebook
Facebook Facebook
Facebook
butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
butest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
butest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
butest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
butest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
butest
 

Más de butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 

Applying Hidden Markov Models to Bioinformatics

  • 1. Applying Hidden Markov Models to Bioinformatics Conor Buckley
  • 2. Outline What are Hidden Markov Models? Why are they a good tool for Bioinformatics? Applications in Bioinformatics
  • 3. History of Hidden Markov Models HMM were first described in a series of statistical papers by Leonard E. Baum and other authors in the second half of the 1960s. One of the first applications of HMMs was speech recogniation, starting in the mid-1970s. They are commonly used in speech recognition systems to help to determine the words represented by the sound wave forms captured In the second half of the 1980s, HMMs began to be applied to the analysis of biological sequences, in particular DNA. Since then, they have become ubiquitous in bioinformatics Source: http://en.wikipedia.org/wiki/Hidden_Markov_model#History
  • 4. What are Hidden Markov Models? HMM: A formal foundation for making probabilistic models of linear sequence 'labeling' problems. They provide a conceptual toolkit for building complex models just by drawing an intuitive picture. Source: http://www.nature.com/nbt/journal/v22/n10/full/nbt1004-1315.html#B1
  • 5. What are Hidden Markov Models? Machine learning approach in bioinformatics Machine learning algorithms are presented with training data, which are used to derive important insights about the (often hidden) parameters. Once an algorithm has been trained, it can apply these insights to the analysis of a test sample As the amount of training data increases, the accuracy of the machine learning algorithm typically increasess as well. Source: http://www.nature.com/nbt/journal/v22/n10/full/nbt1004-1315.html#B1
  • 6. Hidden Markov Models Has N states, called S1, S2, ... Sn There are discrete timesteps, t=0, t=1 S2 N = 3 t = 0 S1 S3 Source: http://www.autonlab.org/tutorials/hmm.html
  • 7. Hidden Markov Models Has N states, called S1, S2, ... Sn There are discrete timesteps, t=0, t=1 For each timestep, the system is in exactly one of the available states. S2 N = 3 t = 0 S1 S3
  • 8.
  • 9.
  • 10. Building a Markov Chain Concrete Example Two friends, Alice and Bob, who live far apart from each other and who talk together daily over the telephone about what they did that day. Bob is only interested in three activities: walking in the park, shopping, and cleaning his apartment. The choice of what to do is determined exclusively by the weather on a given day. Alice has no definite information about the weather where Bob lives, but she knows general trends. Based on what Bob tells her he did each day, Alice tries to guess what the weather must have been like. Alice believes that the weather operates as a discrete Markov chain. There are two states, "Rainy" and "Sunny", but she cannot observe them directly, that is, they are hidden from her. On each day, there is a certain chance that Bob will perform one of the following activities, depending on the weather: "walk", "shop", or "clean". Since Bob tells Alice about his activities, those are the observations. Source: Wikipedia.org
  • 13. What now? * Find out the most probable output sequence Vertibi's algorithm Dynamic programming algorithm for finding the most likely sequence of hidden states – called the Vertibi path – that results in a sequence of observed events.
  • 15. Bioinformatics Example Assume we are given a DNA sequence that begins in an exon, contains one 5' splice site and ends in an intron Identify where the switch from exon to intron occurs Where is the splice site?? Sourece: http://www.nature.com/nbt/journal/v22/n10/full/nbt1004-1315.html#B1
  • 16. Bioinformatics Example In order for us to guess, the sequences of exons, splice sites and introns must have different statistical properties. Let's say... Exons have a uniform base composition on average A/C/T/G: 25% for each base Introns are A/T rich A/T: 40% for each C/G: 10% for each 5' Splice site consensus nucleotide is almost always a G... G: 95% A: 5% Sourece: http://www.nature.com/nbt/journal/v22/n10/full/nbt1004-1315.html#B1
  • 17. Bioinformatics Example We can build an Hidden Markov Model We have three states "E" for Exon "5" for 5' SS "I" for Intron Each State has its own emission probabilities which model the base composition of exons, introns and consensus G at the 5'SS Each state also has transition probabilities (arrows) Sourece: http://www.nature.com/nbt/journal/v22/n10/full/nbt1004-1315.html#B1
  • 18.
  • 20. Underlying State PathSource: http://www.nature.com/nbt/journal/v22/n10/full/nbt1004-1315.html#B1
  • 21. HMM: A Bioinformatics Visual The state path is a Markov Chain Since we're only given the observed sequence, this underlying state path is a hidden Markov Chain Therefore... We can apply Bayesian Probability Source: http://www.nature.com/nbt/journal/v22/n10/full/nbt1004-1315.html#B1
  • 22. HMM: A Bioinformatics Visual S – Observed sequence π – State Path Θ – Parameters The probability P(S, π|HMM, Θ) is the product of all emission probabilites and transition probilities. Lets look at an example... Source: http://www.nature.com/nbt/journal/v22/n10/full/nbt1004-1315.html#B1
  • 23. HMM: A Bioinformatics Visual There are 27 transitions and 26 emissions. Multiply all 53 probabilities together (and take the log, since these are small numbers) and you'll calculate log P(S, π|HMM, Θ) = -41.22 Source: http://www.nature.com/nbt/journal/v22/n10/full/nbt1004-1315.html#B1
  • 24. HMM: A Bioinformatics Visual The model parameters and overall sequences scores are all probabilities Therefore we can use Bayesian probability theory to manipulate these numbers in standard, powerful ways, including optimizing parameters and interpreting the signifigance of scores. Source: http://www.nature.com/nbt/journal/v22/n10/full/nbt1004-1315.html#B1
  • 25. HMM: A Bioinformatics Visual Posterior Decoding: An alternative state path where the SS falls on the 6th G instead of the 5th (log probabilities of -41.71 versus -41.22) How confident are we that the fifth G is the right choice? Source: http://www.nature.com/nbt/journal/v22/n10/full/nbt1004-1315.html#B1
  • 26. HMM: A Bioinformatics Visual We can calculate our confidence directly. The probability that nucleotide i was emitted by state k is the sum of the probabilities of all the states paths use state k to generate i, normalized by the sum over all possible state paths Result: We get a probability of 46% that the best-scoring fifth G is correct and 28% that the sixth G position is correct. Source: http://www.nature.com/nbt/journal/v22/n10/full/nbt1004-1315.html#B1
  • 27. Further Possibilites The toy-model provided by the article is a simple example But we can go further, we could add a more realistic consensus GTRAGT at the 5' splice site We could put a row of six HMM states in place of '5' state to model a six-base ungapped consensus motif Possibilities are not limited
  • 28. The catch HMM don't deal well with correlations between nucleotides Because they assume that each emitted nucleotide depends only on one underlying state. Example of bad use for HMM: Conserved RNA base pairs which induce long-range pairwise correlations; one position might be any nucleotide but the base-paired partner must be complementary. An HMM state path has no way of 'remembering' what a distant state generated. Source: http://www.nature.com/nbt/journal/v22/n10/full/nbt1004-1315.html#B1
  • 29. Credits http://www.nature.com/nbt/journal/v22/n10/full/nbt1004-1315.html#B1 http://en.wikipedia.org/wiki/Viterbi_algorithm http://en.wikipedia.org/wiki/Hidden_Markov_model http://en.wikipedia.org/wiki/Bayesian_network http://www.daimi.au.dk/~bromille/PHM/Storm.pdf