Hmm

A Revealing Introduction to
Hidden Markov Models

Mark Stamp

HMM 1

Hidden Markov Models
 What is a hidden Markov model (HMM)?
A machine learning technique and…
 A discrete hill climb technique

 Two for the price of one!

 Where are HMMs used?
 Speech recognition, malware
detection, IDS, etc., etc., etc.
 Why is it useful?
 Easy to apply and efficient algorithms

HMM 2

Markov Chain
 Markov chain:
 “memoryless random process”
 Transitions depend only on current state
(Markov chain of order 1) and transition
probability matrix
 Example?
 See next slide…

HMM 3

Markov Chain
0.7

 Supposewe’re interested in
average annual temperature H

 Only consider Hot and Cold
 From recorded history, we
0.3 0.4

obtain probabilities in
diagram to the right C

0.6

HMM 4

Markov Chain
0.7

 Transition probability
matrix H

0.3 0.4

 Matrix is denoted as A
C

 Note, A is “row stochastic” 0.6

HMM 5

Markov Chain
 Can also include 0.7

begin, end states
0.6
 Begin state H
matrix is π
 In this example, begin 0.3 0.4 end

C
 Note that π is also 0.4
row stochastic
0.6

HMM 6

Hidden Markov Model
 HMM includes a Markov chain
 But the Markov process is “hidden”
 Cannot observe the Markov process
 Instead, we observe something related (by
probabilities) to hidden states
 It’s as if there is a “curtain” between
Markov chain and observations
 Example on next few slides…

HMM 7

HMM Example
 ConsiderH/C temperature example
 Suppose we want to know H or C annual
temperature in distant past
 Before thermometers (or humans) invented
 We just want to decide between H and C

 We assume transition between Hot and
Cold years is same as today
 So, the A matrix is known

HMM 8

HMM Example
 Temp in past determined by Markov process
 But, we cannot observe temperature in past
 We findthat tree ring size is related to
temperature
 Look at historical data to see the connection
 We consider 3 tree ring sizes
 Small, Medium, Large (S, M, L, respectively)
 Measure tree ring sizes and recorded
temperatures to determine relationship

HMM 9

HMM Example
 Wefind that tree ring sizes and
temperature related by

 This is known as the B matrix:

 Note that B is also row stochastic

HMM 10

HMM Example
 Can we now find H/C temps in past?
 We cannot measure (observe) temps
 But we can measure tree ring sizes…
 …and tree ring sizes related to temps
 By the B matrix
 Weought to be able to say something
about average annual temperature

HMM 11

HMM Notation
A lot of notation is required
 Notation may be the most difficult part

HMM 12

HMM Notation
 To simplify notation, observations are
taken from the set {0,1,…,M-1}
 That is,
 The matrix A = {aij} is N x N, where

 The matrix B = {bj(k)} is N x M, where

HMM 13

HMM Example
 Consider our temperature example…
 What are the observations?
 V = {0,1,2}, which corresponds to S,M,L
 What are states of Markov process?
 Q = {H,C}
 What are A,B,π, and T?
 A,B,πon previous slides
 T is number of tree rings measured

 What are N and M?
 N = 2 and M = 3

HMM 14

Generic HMM
 Generic view of HMM

 HMM defined by A,B,andπ
 We denote HMM “model” as λ = (A,B,π)
HMM 15

HMM Example
 Suppose that we observe tree ring sizes
 For 4 year period of interest: S,M,S,L
 Then = (0, 1, 0, 2)
 Most likely (hidden) state sequence?
 We want most likely X = (x0, x1, x2, x3)
 Let πx0be prob. of starting in state x0
 Note prob. of initial observation
 And ax0,x1 is prob. of transition x0 to x1
 And so on…

HMM 16

HMM Example
 Bottom line?
 We can compute P(X) for any X
 For X = (x0, x1, x2, x3) we have

 Suppose we observe (0,1,0,2), then what
is probability of, say, HHCC?
 Plug into formula above to find

HMM 17

HMM Example
 Do same for all
4-state
sequences
 We find…
 The winner is?
 CCCH

 Not so fast my
friend…

HMM 18

HMM Example
 The pathCCCH scores the highest
 In dynamic programming (DP), we find
highest scoring path
 But, HMM maximizes expected number
of correct states
 Sometimes called “EM algorithm”
 For “Expectation Maximization”

 How does HMM work in this example?
HMM 19

HMM Example
 For first position…
 Sum probabilities for all paths that have H
in 1st position, compare to sum of probs for
paths with C in 1st position --- biggest wins
 Repeat for each position and we find

HMM 20

HMM Example

 So, HMM solution gives us CHCH
 While DP solution is CCCH
 Which solution is better?
 Neither!!!
 They use different definitions of “best”

HMM 21

HMM Paradox?
 HMM maximizes expected number of
correct states
 Whereas DP chooses “best” overall path
 Possiblefor HMM to choose a “path”
that is impossible
 Could be a transition probability of 0
 Cannot get impossible path with DP
 Is this a flaw with HMM?
 No, it’s a feature…
HMM 22

HMM Model
 An HMM is defined by the three
matrices, A, B, and π
 Note that M and N are implied, since
they are the dimensions of the matrices
 So, we denote HMM “model” as
λ = (A,B,π)

HMM 23

The Three Problems
 HMMs used to solve 3 problems
 Problem 1: Given a model λ = (A,B,π) and
observation sequence O, find P(O|λ)
 That is, we can score an observation sequence
to see how well it fits a given model
 Problem 2: Given λ = (A,B,π) and O, find an
optimal state sequence
 Uncover hidden part (like previous example)
 Problem 3: Given O, N, and M, find the
model λ that maximizes probability of O
 That is, train a model to fit observations

HMM 24

HMMs in Practice
 Typically,HMMs used as follows:
 Given an observation sequence…
 Assume a (hidden) Markov process exists
 Train a model based on observations
 Problem 3 (find N by trial and error)
 Thengiven a sequence of
observations, score it versus the model
 Problem 1: high score implies it’s similar to
training data, low score implies it’s not

HMM 25

HMMs in Practice
 Previousslide gives sense in which HMM
is a “machine learning” technique
 To train model, we do not need to specify
anything except the parameter N
 And “best” N found by trial and error

 That is, we don’t have to think too much
 Just train HMM and then use it
 Best of all, efficient algorithms for HMMs

HMM 26

The Three Solutions
 We give detailed solutions to the three
problems
 Note: We must provide efficient solutions
 Recall the three problems:
 Problem 1: Score an observation sequence
versus a given model
 Problem 2: Given a model, “uncover” hidden part

 Problem 3: Given an observation sequence, train
a model

HMM 27

Solution 1
 Score observations versus a given model
 Given model λ = (A,B,π) and observation
sequence O=(O0,O1,…,OT-1), find P(O|λ)
 Denote hidden states as
X = (x0, x1, . . . , xT-1)
 Then from definition of B,
P(O|X,λ)=bx0(O0) bx1(O1) … bxT-1(OT-1)
 And from definition of A and π,
P(X|λ)=πx0 ax0,x1 ax1,x2 … axT-2,xT-1

HMM 28

Forward Algorithm
 Instead of brute force: forward algorithm
 Or “alpha pass”
 For t = 0,1,…,T-1 and i=0,1,…,N-1, let
αt(i) = P(O0,O1,…,Ot,xt=qi|λ)
 Probability of “partial sum” to t, and
Markov process is in state qi at step t
 What the?
 Can be computed recursively, efficiently

HMM 30

Forward Algorithm
 Let α0(i) = πibi(O0) for i = 0,1,…,N-1
 For t = 1,2,…,T-1 and i=0,1,…,N-1, let

αt(i) = (Σαt-1(j)aji)bi(Ot)
 Where the sum is from j = 0 to N-1
 From definition of αt(i) we see
P(O|λ) = ΣαT-1(i)
 Where the sum is from i = 0 to N-1
 Note this requires only N2T multiplications
HMM 31

Solution 2
 Given a model, find “most likely” hidden
states: Given λ = (A,B,π) and O, find an
optimal state sequence
 Recall that optimal means “maximize expected
number of correct states”
 In contrast, DP finds best scoring path

 For temp/tree ring example, solved this
 But hopelessly inefficient approach
 A better way: backward algorithm
 Or “beta pass”

HMM 32

Backward Algorithm
 For t = 0,1,…,T-1 and i=0,1,…,N-1, let
βt(i) = P(Ot+1,Ot+2,…,OT-1|xt=qi,λ)
 Probability of partial sum from t to end and
Markov process in state qi at step t
 Analogous to the forward algorithm
 As with forward algorithm, this can be
computed recursively and efficiently

HMM 33

Backward Algorithm
 Let βT-1(i) = 1for i = 0,1,…,N-1
 For t = T-2,T-3, …,1 and i=0,1,…,N-1, let
βt(i) = Σaijbj(Ot+1)βt+1(j)
 Where the sum is from j = 0 to N-1

HMM 34

Solution 2
 For t = 1,2,…,T-1 and i=0,1,…,N-1 define
γt(i) = P(xt=qi|O,λ)
 Most likely state at t is qi that maximizes γt(i)
 Note that γt(i) = αt(i)βt(i)/P(O|λ)
 And recall P(O|λ) = ΣαT-1(i)
 The bottom line?
 Forward algorithm solves Problem 1
 Forward/backward algorithms solve Problem 2

HMM 35

Solution 3
 Train a model: Given O, N, and M, find λ
that maximizes probability of O
 Here, we iteratively adjust λ = (A,B,π)
to better fit the given observations O
 The size of matrices are fixed (N and M)
 But elements of matrices can change

 It is amazing that this works!
 And even more amazing that it’s efficient

HMM 36

Solution 3
 For t=0,1,…,T-2 and i,jin {0,1,…,N-
1}, define “di-gammas” as
γt(i,j) = P(xt=qi, xt+1=qj|O,λ)
 Note γt(i,j) is prob of being in state qi at
time t and transiting to state qj at t+1
 Then γt(i,j) = αt(i)aijbj(Ot+1)βt+1(j)/P(O|λ)
 And γt(i) = Σγt(i,j)
 Where sum is from j = 0 to N – 1

HMM 37

Model Re-estimation
 Given di-gammas and gammas…
 For i = 0,1,…,N-1 let πi = γ0(i)
 For i = 0,1,…,N-1 and j = 0,1,…,N-1
aij = Σγt(i,j)/Σγt(i)
 Where both sums are from t = 0 to T-2
 For j = 0,1,…,N-1 and k = 0,1,…,M-1
bj(k) = Σγt(j)/Σγt(j)
 Both sums from from t = 0 to T-2 but only t for
which Ot = kare counted in numerator
 Why does this work?

HMM 38

Solution 3
 To summarize…
1. Initialize λ = (A,B,π)
2. Compute αt(i), βt(i), γt(i,j), γt(i)
3. Re-estimate the model λ = (A,B,π)
4. If P(O|λ) increases, goto 2

HMM 39

Solution 3
 Some fine points…
 Model initialization
 If we have a good guess for λ = (A,B,π) then we
can use it for initialization
 If not, let πi ≈ 1/N, ai,j≈ 1/N, bj(k) ≈ 1/M

 Subject to row stochastic conditions

 But, do not initialize to uniform values

 Stopping conditions
 Stop after some number of iterations and/or…
 Stop if increase in P(O|λ) is “small”

HMM 40

HMM as Discrete Hill Climb
 Algorithm on previous slides shows that
HMM is a “discrete hill climb”
 HMM consists of discrete parameters
 Specifically, the elements of the matrices
 And
re-estimation process improves
model by modifying parameters
 So, process “climbs” toward improved model
 This happens in a high-dimensional space

HMM 41

Dynamic Programming
 Brief detour…
 For λ = (A,B,π) as above, it’s easy to
define a dynamic program (DP)
 Executive summary:
 DP is forward algorithm, with “sum”
replaced by “max”
 Precise details on next few slides

HMM 42

Dynamic Programming
 Let δ0(i) = πi bi(O0)for i=0,1,…,N-1
 For t=1,2,…,T-1 and i=0,1,…,N-1 compute
δt(i) = max (δt-1(j)aji)bi(Ot)
 Where the max is over j in {0,1,…,N-1}
 Note that at each t, the DP computes best
path for each state, up to that point
 So, probability of best path is max δT-1(j)
 This max gives the best probability
 Not the best path, for that, see next slide

HMM 43

Dynamic Programming
 To determine optimal path
 While computing deltas, keep track of pointers
to previous state
 When finished, construct optimal path by
tracing back points
 For example, consider temp example: recall
that we observe (0,1,0,2)
 Probabilities for path of length 1:

 These are the only “paths” of length 1
HMM 44

Dynamic Programming
 Probabilities for each path of length 2

 Best path of length 2 ending with H is CH
 Best path of length 2 ending with C is CC

HMM 45

Dynamic Program
 Continuing,we compute best path ending
at H and C at each step
 And save pointers --- why?

HMM 46

Dynamic Program

 Best final score is .002822
 And, thanks to pointers, best path is CCCH
 But what about underflow?
A serious problem in bigger cases

HMM 47

Underflow Resistant DP
 Common trick to prevent underflow
 Insteadof multiplying probabilities…
 …we add logarithms of probabilities

 Why does this work?
 Because log(xy) = log x + log y
 Adding logs does not tend to 0

 Note that we must avoid 0 probabilities

HMM 48

Underflow Resistant DP
 Underflow resistant DP algorithm:
 Let δ0(i) = log(πi bi(O0))for i=0,1,…,N-1
 For t=1,2,…,T-1 and i=0,1,…,N-1 compute

δt(i) = max (δt-1(j) + log(aji) + log(bi(Ot)))
 Where the max is over j in {0,1,…,N-1}
 And score of best path is max δT-1(j)
 As before, must also keep track of paths

HMM 49

HMM Scaling
 Trickierto prevent underflow in HMM
 We consider solution 3
 Since it includes solutions 1 and 2
 Recall for t = 1,2,…,T-1, i=0,1,…,N-1,
αt(i) = (Σαt-1(j)aj,i)bi(Ot)
 The idea is to normalize alphas so that
they sum to 1
 Algorithm on next slide

HMM 50

HMM Scaling

 Given αt(i) = (Σαt-1(j)aj,i)bi(Ot)
 Let a0(i) = α0(i) for i=0,1,…,N-1
 Let c0 = 1/Σa0(j)
 For i = 0,1,…,N-1, let a0(i) = c0a0(i)
 This takes care of t = 0 case
 Algorithm continued on next slide…

HMM 51

HMM Scaling
 For t = 1,2,…,T-1 do the following:
 For i = 0,1,…,N-1,
at(i) = (Σat-1(j)aj,i)bi(Ot)
 Let ct = 1/Σat(j)
 For i = 0,1,…,N-1 let at(i) = ctat(i)

HMM 52

HMM Scaling
 Easy to show at(i) = c0c1…ct αt(i) (♯)
 Simple proof by induction
 So, c0c1…ct is scaling factor at step t
 Also, easy to show that
at(i) = αt(i)/Σαt(j)
 Which implies ΣaT-1(i) = 1 (♯♯)

HMM 53

HMM Scaling
 By combining (♯) and (♯♯), we have
1 = ΣaT-1(i) = c0c1…cT-1 ΣαT-1(i)
= c0c1…cT-1 P(O|λ)
 Therefore, P(O|λ) = 1 / c0c1…cT-1
 To avoid underflow, we compute
log P(O|λ) = -Σlog(cj)
 Where sum is from j = 0 to T-1

HMM 54

HMM Scaling
 Similarly, scale betas as ctβt(i)
 For re-estimation,
 Computeγt(i,j)andγt(i)using original formulas, but
with scaled alphas and betas
 This gives us new values for λ = (A,B,π)
 “Easy exercise” to show re-estimate is
exact when scaled alphas and betas used
 Also, P(O|λ) cancels from formula
 Use log P(O|λ) = -Σlog(cj) to decide if iterate
improves
HMM 55

All Together Now
 Complete pseudo code for Solution 3
 Given: (O0,O1,…,OT-1) and N and M
 Initialize: λ = (A,B,π)
 A is NxN, B is NxM and π is 1xN
 πi≈ 1/N, aij ≈ 1/N, bj(k) ≈ 1/M, each matrix row
stochastic, but not uniform
 Initialize:
 maxIters = max number of re-estimation steps
 iters = 0
 oldLogProb = -∞

HMM 56

Forward Algorithm
 Forward algorithm
 With scaling

HMM 57

Backward Algorithm
 Backward algorithm
or “beta pass”
 With scaling
 Note:same scaling
factor as alphas

HMM 58

Gammas
 Using scaled
alphas and betas
 So formulas
unchanged

HMM 59

Re-Estimation
 Again, using
scaled gammas
 So formulas
unchanged

HMM 60

Stopping Criteria
 Checkthat
probability
increases
 In practice, want
logProb>oldLogPro
b+ε
 And don’t
exceed max
iterations

HMM 61

English Text Example
 Suppose Martian arrives on earth
 Sees written English text
 Wants to learn something about it

 Martians know about HMMs

 So,strip our all non-letters, make all
letters lower-case
 27 symbols (letters, plus word-space)
 Train HMM on long sequence of symbols

HMM 62

English Text
 For first training case, initialize:
N = 2 and M = 27
 Elements of A and πare about ½ each

 Elements of B are each about 1/27

 We use 50,000 symbols for training
 After 1stiter: log P(O|λ) ≈ -165097
 After 100thiter: log P(O|λ) ≈ -137305

HMM 63

English Text
 Matrices A and πconverge:

 What does this tells us?
 Started in hidden state 1 (not state 0)
 And we know transition probabilities
between hidden states
 Nothing too interesting here
 We don’t care about hidden states
HMM 64

English Text
 What about B
matrix?
 This much more
interesting…
 Why???

HMM 65

A Security Application
 Suppose we want to detect metamorphic
computer viruses
 Such viruses vary their internal structure
 But function of malware stays same
 If sufficiently variable, standard signature
detection will fail
 Can we use HMM for detection?
 What to use as observation sequence?
 Is there really a “hidden” Markov process?
 What about N, M, and T?
 How many Os needed for training, scoring?

HMM 66

HMM for Metamorphic Detection
 Set of “family” viruses into 2 subsets
 Extract opcodes from each virus
 Append opcodes from subset 1 to make one
long sequence
 Train HMM on opcode sequence (problem 3)
 Obtain a model λ = (A,B,π)

 Set threshold: score opcodes from files in
subset 2 and “normal” files (problem 1)
 Can you sets a threshold that separates sets?
 If so, may have a viable detection method

HMM 67

HMM for Metamorphic Detection
 Virus
detection
results from
recent paper
 Note the
separation
 This is good!

HMM 68

HMM Generalizations
 Here, assumed Markov process of order 1
 Current state depends only on previous state
and transition matrix A
 Can use higher order Markov process
 Current state depends on n previous states
 Higher order vs size of N ? “Depth” vs “width”

 Can have A and B matrices depend on t
 HMM often combined with other
techniques (e.g., neural nets)
HMM 69

Generalizations
 Insome cases, limitation of HMM is
that position information is not used
 In many applications this is OK/desirable
 In some apps, this is a serious problem

 Bioinformatics applications
 DNA sequencing, protein alignment, etc.
 Sequence alignment is crucial

 They use “profile HMMs” instead of HMMs

HMM 70

References
Arevealing introduction to hidden
Markov models, by M. Stamp
 http://www.cs.sjsu.edu/faculty/stamp/RUA
/HMM.pdf
A tutorial on hidden Markov models and
selected applications in speech
recognition, by L.R. Rabiner
 http://www.cs.ubc.ca/~murphyk/Bayes/rabi
ner.pdf
HMM 71

References
 Hunting
for metamorphic engines, W.
Wong and M. Stamp
 Journal
in Computer Virology, Vol. 2, No.
3, December 2006, pp. 211-229
 Hunting for undetectable metamorphic
viruses, D. Lin and M. Stamp
 Journalin Computer Virology, Vol. 7, No.
3, August 2011, pp. 201-214

HMM 72

Hmm

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (15)

Similar a Hmm

Similar a Hmm (14)

Último

Último (20)

Hmm