2. Hidden Markov Models
What is a hidden Markov model (HMM)?
A machine learning technique and…
A discrete hill climb technique
Two for the price of one!
Where are HMMs used?
Speech recognition, malware
detection, IDS, etc., etc., etc.
Why is it useful?
Easy to apply and efficient algorithms
HMM 2
3. Markov Chain
Markov chain:
“memoryless random process”
Transitions depend only on current state
(Markov chain of order 1) and transition
probability matrix
Example?
See next slide…
HMM 3
4. Markov Chain
0.7
Supposewe’re interested in
average annual temperature H
Only consider Hot and Cold
From recorded history, we
0.3 0.4
obtain probabilities in
diagram to the right C
0.6
HMM 4
5. Markov Chain
0.7
Transition probability
matrix H
0.3 0.4
Matrix is denoted as A
C
Note, A is “row stochastic” 0.6
HMM 5
6. Markov Chain
Can also include 0.7
begin, end states
0.6
Begin state H
matrix is π
In this example, begin 0.3 0.4 end
C
Note that π is also 0.4
row stochastic
0.6
HMM 6
7. Hidden Markov Model
HMM includes a Markov chain
But the Markov process is “hidden”
Cannot observe the Markov process
Instead, we observe something related (by
probabilities) to hidden states
It’s as if there is a “curtain” between
Markov chain and observations
Example on next few slides…
HMM 7
8. HMM Example
ConsiderH/C temperature example
Suppose we want to know H or C annual
temperature in distant past
Before thermometers (or humans) invented
We just want to decide between H and C
We assume transition between Hot and
Cold years is same as today
So, the A matrix is known
HMM 8
9. HMM Example
Temp in past determined by Markov process
But, we cannot observe temperature in past
We findthat tree ring size is related to
temperature
Look at historical data to see the connection
We consider 3 tree ring sizes
Small, Medium, Large (S, M, L, respectively)
Measure tree ring sizes and recorded
temperatures to determine relationship
HMM 9
10. HMM Example
Wefind that tree ring sizes and
temperature related by
This is known as the B matrix:
Note that B is also row stochastic
HMM 10
11. HMM Example
Can we now find H/C temps in past?
We cannot measure (observe) temps
But we can measure tree ring sizes…
…and tree ring sizes related to temps
By the B matrix
Weought to be able to say something
about average annual temperature
HMM 11
12. HMM Notation
A lot of notation is required
Notation may be the most difficult part
HMM 12
13. HMM Notation
To simplify notation, observations are
taken from the set {0,1,…,M-1}
That is,
The matrix A = {aij} is N x N, where
The matrix B = {bj(k)} is N x M, where
HMM 13
14. HMM Example
Consider our temperature example…
What are the observations?
V = {0,1,2}, which corresponds to S,M,L
What are states of Markov process?
Q = {H,C}
What are A,B,π, and T?
A,B,πon previous slides
T is number of tree rings measured
What are N and M?
N = 2 and M = 3
HMM 14
15. Generic HMM
Generic view of HMM
HMM defined by A,B,andπ
We denote HMM “model” as λ = (A,B,π)
HMM 15
16. HMM Example
Suppose that we observe tree ring sizes
For 4 year period of interest: S,M,S,L
Then = (0, 1, 0, 2)
Most likely (hidden) state sequence?
We want most likely X = (x0, x1, x2, x3)
Let πx0be prob. of starting in state x0
Note prob. of initial observation
And ax0,x1 is prob. of transition x0 to x1
And so on…
HMM 16
17. HMM Example
Bottom line?
We can compute P(X) for any X
For X = (x0, x1, x2, x3) we have
Suppose we observe (0,1,0,2), then what
is probability of, say, HHCC?
Plug into formula above to find
HMM 17
18. HMM Example
Do same for all
4-state
sequences
We find…
The winner is?
CCCH
Not so fast my
friend…
HMM 18
19. HMM Example
The pathCCCH scores the highest
In dynamic programming (DP), we find
highest scoring path
But, HMM maximizes expected number
of correct states
Sometimes called “EM algorithm”
For “Expectation Maximization”
How does HMM work in this example?
HMM 19
20. HMM Example
For first position…
Sum probabilities for all paths that have H
in 1st position, compare to sum of probs for
paths with C in 1st position --- biggest wins
Repeat for each position and we find
HMM 20
21. HMM Example
So, HMM solution gives us CHCH
While DP solution is CCCH
Which solution is better?
Neither!!!
They use different definitions of “best”
HMM 21
22. HMM Paradox?
HMM maximizes expected number of
correct states
Whereas DP chooses “best” overall path
Possiblefor HMM to choose a “path”
that is impossible
Could be a transition probability of 0
Cannot get impossible path with DP
Is this a flaw with HMM?
No, it’s a feature…
HMM 22
23. HMM Model
An HMM is defined by the three
matrices, A, B, and π
Note that M and N are implied, since
they are the dimensions of the matrices
So, we denote HMM “model” as
λ = (A,B,π)
HMM 23
24. The Three Problems
HMMs used to solve 3 problems
Problem 1: Given a model λ = (A,B,π) and
observation sequence O, find P(O|λ)
That is, we can score an observation sequence
to see how well it fits a given model
Problem 2: Given λ = (A,B,π) and O, find an
optimal state sequence
Uncover hidden part (like previous example)
Problem 3: Given O, N, and M, find the
model λ that maximizes probability of O
That is, train a model to fit observations
HMM 24
25. HMMs in Practice
Typically,HMMs used as follows:
Given an observation sequence…
Assume a (hidden) Markov process exists
Train a model based on observations
Problem 3 (find N by trial and error)
Thengiven a sequence of
observations, score it versus the model
Problem 1: high score implies it’s similar to
training data, low score implies it’s not
HMM 25
26. HMMs in Practice
Previousslide gives sense in which HMM
is a “machine learning” technique
To train model, we do not need to specify
anything except the parameter N
And “best” N found by trial and error
That is, we don’t have to think too much
Just train HMM and then use it
Best of all, efficient algorithms for HMMs
HMM 26
27. The Three Solutions
We give detailed solutions to the three
problems
Note: We must provide efficient solutions
Recall the three problems:
Problem 1: Score an observation sequence
versus a given model
Problem 2: Given a model, “uncover” hidden part
Problem 3: Given an observation sequence, train
a model
HMM 27
28. Solution 1
Score observations versus a given model
Given model λ = (A,B,π) and observation
sequence O=(O0,O1,…,OT-1), find P(O|λ)
Denote hidden states as
X = (x0, x1, . . . , xT-1)
Then from definition of B,
P(O|X,λ)=bx0(O0) bx1(O1) … bxT-1(OT-1)
And from definition of A and π,
P(X|λ)=πx0 ax0,x1 ax1,x2 … axT-2,xT-1
HMM 28
29. Solution 1
Elementary conditional probability fact:
P(O,X|λ) = P(O|X,λ) P(X|λ)
Sum over all possible state sequences X,
P(O|λ) = Σ P(O,X|λ) = Σ P(O|X,λ) P(X|λ)
= Σπx0bx0(O0)ax0,x1bx1(O1)…axT-2,xT-1bxT-1(OT-1)
This “works” but way too costly
Requires about 2TNT multiplications
Why?
There better be a better way…
HMM 29
30. Forward Algorithm
Instead of brute force: forward algorithm
Or “alpha pass”
For t = 0,1,…,T-1 and i=0,1,…,N-1, let
αt(i) = P(O0,O1,…,Ot,xt=qi|λ)
Probability of “partial sum” to t, and
Markov process is in state qi at step t
What the?
Can be computed recursively, efficiently
HMM 30
31. Forward Algorithm
Let α0(i) = πibi(O0) for i = 0,1,…,N-1
For t = 1,2,…,T-1 and i=0,1,…,N-1, let
αt(i) = (Σαt-1(j)aji)bi(Ot)
Where the sum is from j = 0 to N-1
From definition of αt(i) we see
P(O|λ) = ΣαT-1(i)
Where the sum is from i = 0 to N-1
Note this requires only N2T multiplications
HMM 31
32. Solution 2
Given a model, find “most likely” hidden
states: Given λ = (A,B,π) and O, find an
optimal state sequence
Recall that optimal means “maximize expected
number of correct states”
In contrast, DP finds best scoring path
For temp/tree ring example, solved this
But hopelessly inefficient approach
A better way: backward algorithm
Or “beta pass”
HMM 32
33. Backward Algorithm
For t = 0,1,…,T-1 and i=0,1,…,N-1, let
βt(i) = P(Ot+1,Ot+2,…,OT-1|xt=qi,λ)
Probability of partial sum from t to end and
Markov process in state qi at step t
Analogous to the forward algorithm
As with forward algorithm, this can be
computed recursively and efficiently
HMM 33
34. Backward Algorithm
Let βT-1(i) = 1for i = 0,1,…,N-1
For t = T-2,T-3, …,1 and i=0,1,…,N-1, let
βt(i) = Σaijbj(Ot+1)βt+1(j)
Where the sum is from j = 0 to N-1
HMM 34
35. Solution 2
For t = 1,2,…,T-1 and i=0,1,…,N-1 define
γt(i) = P(xt=qi|O,λ)
Most likely state at t is qi that maximizes γt(i)
Note that γt(i) = αt(i)βt(i)/P(O|λ)
And recall P(O|λ) = ΣαT-1(i)
The bottom line?
Forward algorithm solves Problem 1
Forward/backward algorithms solve Problem 2
HMM 35
36. Solution 3
Train a model: Given O, N, and M, find λ
that maximizes probability of O
Here, we iteratively adjust λ = (A,B,π)
to better fit the given observations O
The size of matrices are fixed (N and M)
But elements of matrices can change
It is amazing that this works!
And even more amazing that it’s efficient
HMM 36
37. Solution 3
For t=0,1,…,T-2 and i,jin {0,1,…,N-
1}, define “di-gammas” as
γt(i,j) = P(xt=qi, xt+1=qj|O,λ)
Note γt(i,j) is prob of being in state qi at
time t and transiting to state qj at t+1
Then γt(i,j) = αt(i)aijbj(Ot+1)βt+1(j)/P(O|λ)
And γt(i) = Σγt(i,j)
Where sum is from j = 0 to N – 1
HMM 37
38. Model Re-estimation
Given di-gammas and gammas…
For i = 0,1,…,N-1 let πi = γ0(i)
For i = 0,1,…,N-1 and j = 0,1,…,N-1
aij = Σγt(i,j)/Σγt(i)
Where both sums are from t = 0 to T-2
For j = 0,1,…,N-1 and k = 0,1,…,M-1
bj(k) = Σγt(j)/Σγt(j)
Both sums from from t = 0 to T-2 but only t for
which Ot = kare counted in numerator
Why does this work?
HMM 38
39. Solution 3
To summarize…
1. Initialize λ = (A,B,π)
2. Compute αt(i), βt(i), γt(i,j), γt(i)
3. Re-estimate the model λ = (A,B,π)
4. If P(O|λ) increases, goto 2
HMM 39
40. Solution 3
Some fine points…
Model initialization
If we have a good guess for λ = (A,B,π) then we
can use it for initialization
If not, let πi ≈ 1/N, ai,j≈ 1/N, bj(k) ≈ 1/M
Subject to row stochastic conditions
But, do not initialize to uniform values
Stopping conditions
Stop after some number of iterations and/or…
Stop if increase in P(O|λ) is “small”
HMM 40
41. HMM as Discrete Hill Climb
Algorithm on previous slides shows that
HMM is a “discrete hill climb”
HMM consists of discrete parameters
Specifically, the elements of the matrices
And
re-estimation process improves
model by modifying parameters
So, process “climbs” toward improved model
This happens in a high-dimensional space
HMM 41
42. Dynamic Programming
Brief detour…
For λ = (A,B,π) as above, it’s easy to
define a dynamic program (DP)
Executive summary:
DP is forward algorithm, with “sum”
replaced by “max”
Precise details on next few slides
HMM 42
43. Dynamic Programming
Let δ0(i) = πi bi(O0)for i=0,1,…,N-1
For t=1,2,…,T-1 and i=0,1,…,N-1 compute
δt(i) = max (δt-1(j)aji)bi(Ot)
Where the max is over j in {0,1,…,N-1}
Note that at each t, the DP computes best
path for each state, up to that point
So, probability of best path is max δT-1(j)
This max gives the best probability
Not the best path, for that, see next slide
HMM 43
44. Dynamic Programming
To determine optimal path
While computing deltas, keep track of pointers
to previous state
When finished, construct optimal path by
tracing back points
For example, consider temp example: recall
that we observe (0,1,0,2)
Probabilities for path of length 1:
These are the only “paths” of length 1
HMM 44
45. Dynamic Programming
Probabilities for each path of length 2
Best path of length 2 ending with H is CH
Best path of length 2 ending with C is CC
HMM 45
47. Dynamic Program
Best final score is .002822
And, thanks to pointers, best path is CCCH
But what about underflow?
A serious problem in bigger cases
HMM 47
48. Underflow Resistant DP
Common trick to prevent underflow
Insteadof multiplying probabilities…
…we add logarithms of probabilities
Why does this work?
Because log(xy) = log x + log y
Adding logs does not tend to 0
Note that we must avoid 0 probabilities
HMM 48
49. Underflow Resistant DP
Underflow resistant DP algorithm:
Let δ0(i) = log(πi bi(O0))for i=0,1,…,N-1
For t=1,2,…,T-1 and i=0,1,…,N-1 compute
δt(i) = max (δt-1(j) + log(aji) + log(bi(Ot)))
Where the max is over j in {0,1,…,N-1}
And score of best path is max δT-1(j)
As before, must also keep track of paths
HMM 49
50. HMM Scaling
Trickierto prevent underflow in HMM
We consider solution 3
Since it includes solutions 1 and 2
Recall for t = 1,2,…,T-1, i=0,1,…,N-1,
αt(i) = (Σαt-1(j)aj,i)bi(Ot)
The idea is to normalize alphas so that
they sum to 1
Algorithm on next slide
HMM 50
51. HMM Scaling
Given αt(i) = (Σαt-1(j)aj,i)bi(Ot)
Let a0(i) = α0(i) for i=0,1,…,N-1
Let c0 = 1/Σa0(j)
For i = 0,1,…,N-1, let a0(i) = c0a0(i)
This takes care of t = 0 case
Algorithm continued on next slide…
HMM 51
52. HMM Scaling
For t = 1,2,…,T-1 do the following:
For i = 0,1,…,N-1,
at(i) = (Σat-1(j)aj,i)bi(Ot)
Let ct = 1/Σat(j)
For i = 0,1,…,N-1 let at(i) = ctat(i)
HMM 52
53. HMM Scaling
Easy to show at(i) = c0c1…ct αt(i) (♯)
Simple proof by induction
So, c0c1…ct is scaling factor at step t
Also, easy to show that
at(i) = αt(i)/Σαt(j)
Which implies ΣaT-1(i) = 1 (♯♯)
HMM 53
54. HMM Scaling
By combining (♯) and (♯♯), we have
1 = ΣaT-1(i) = c0c1…cT-1 ΣαT-1(i)
= c0c1…cT-1 P(O|λ)
Therefore, P(O|λ) = 1 / c0c1…cT-1
To avoid underflow, we compute
log P(O|λ) = -Σlog(cj)
Where sum is from j = 0 to T-1
HMM 54
55. HMM Scaling
Similarly, scale betas as ctβt(i)
For re-estimation,
Computeγt(i,j)andγt(i)using original formulas, but
with scaled alphas and betas
This gives us new values for λ = (A,B,π)
“Easy exercise” to show re-estimate is
exact when scaled alphas and betas used
Also, P(O|λ) cancels from formula
Use log P(O|λ) = -Σlog(cj) to decide if iterate
improves
HMM 55
56. All Together Now
Complete pseudo code for Solution 3
Given: (O0,O1,…,OT-1) and N and M
Initialize: λ = (A,B,π)
A is NxN, B is NxM and π is 1xN
πi≈ 1/N, aij ≈ 1/N, bj(k) ≈ 1/M, each matrix row
stochastic, but not uniform
Initialize:
maxIters = max number of re-estimation steps
iters = 0
oldLogProb = -∞
HMM 56
61. Stopping Criteria
Checkthat
probability
increases
In practice, want
logProb>oldLogPro
b+ε
And don’t
exceed max
iterations
HMM 61
62. English Text Example
Suppose Martian arrives on earth
Sees written English text
Wants to learn something about it
Martians know about HMMs
So,strip our all non-letters, make all
letters lower-case
27 symbols (letters, plus word-space)
Train HMM on long sequence of symbols
HMM 62
63. English Text
For first training case, initialize:
N = 2 and M = 27
Elements of A and πare about ½ each
Elements of B are each about 1/27
We use 50,000 symbols for training
After 1stiter: log P(O|λ) ≈ -165097
After 100thiter: log P(O|λ) ≈ -137305
HMM 63
64. English Text
Matrices A and πconverge:
What does this tells us?
Started in hidden state 1 (not state 0)
And we know transition probabilities
between hidden states
Nothing too interesting here
We don’t care about hidden states
HMM 64
65. English Text
What about B
matrix?
This much more
interesting…
Why???
HMM 65
66. A Security Application
Suppose we want to detect metamorphic
computer viruses
Such viruses vary their internal structure
But function of malware stays same
If sufficiently variable, standard signature
detection will fail
Can we use HMM for detection?
What to use as observation sequence?
Is there really a “hidden” Markov process?
What about N, M, and T?
How many Os needed for training, scoring?
HMM 66
67. HMM for Metamorphic Detection
Set of “family” viruses into 2 subsets
Extract opcodes from each virus
Append opcodes from subset 1 to make one
long sequence
Train HMM on opcode sequence (problem 3)
Obtain a model λ = (A,B,π)
Set threshold: score opcodes from files in
subset 2 and “normal” files (problem 1)
Can you sets a threshold that separates sets?
If so, may have a viable detection method
HMM 67
68. HMM for Metamorphic Detection
Virus
detection
results from
recent paper
Note the
separation
This is good!
HMM 68
69. HMM Generalizations
Here, assumed Markov process of order 1
Current state depends only on previous state
and transition matrix A
Can use higher order Markov process
Current state depends on n previous states
Higher order vs size of N ? “Depth” vs “width”
Can have A and B matrices depend on t
HMM often combined with other
techniques (e.g., neural nets)
HMM 69
70. Generalizations
Insome cases, limitation of HMM is
that position information is not used
In many applications this is OK/desirable
In some apps, this is a serious problem
Bioinformatics applications
DNA sequencing, protein alignment, etc.
Sequence alignment is crucial
They use “profile HMMs” instead of HMMs
HMM 70
71. References
Arevealing introduction to hidden
Markov models, by M. Stamp
http://www.cs.sjsu.edu/faculty/stamp/RUA
/HMM.pdf
A tutorial on hidden Markov models and
selected applications in speech
recognition, by L.R. Rabiner
http://www.cs.ubc.ca/~murphyk/Bayes/rabi
ner.pdf
HMM 71
72. References
Hunting
for metamorphic engines, W.
Wong and M. Stamp
Journal
in Computer Virology, Vol. 2, No.
3, December 2006, pp. 211-229
Hunting for undetectable metamorphic
viruses, D. Lin and M. Stamp
Journalin Computer Virology, Vol. 7, No.
3, August 2011, pp. 201-214
HMM 72