Supervised Learning Hidden Markov Models. Some of these slides were inspired by the tutorials of Andrew Moore

Supervised Learning Hidden Markov Models Some of these slides were inspired by the tutorials of Andrew Moore

A Markov System S 2 Has N states, called s 1, s 2.. s N There are discrete timesteps, t=0, t=1,. N=3 t=0 S 1 S 3

A Markov System Has N states, called s 1, s 2.. s N There are discrete timesteps, t=0, t=1,. S 2 On the t th timestep the system is in exactly one of the available states. Call it q t (Note: q t {s 1, s 2.. s N } S 1 S 3 N=3 t=0 Current state

A Markov System Current state S 2 S 1 S 3 Has N states, called s 1, s 2.. s N There are discrete timesteps, t=0, t=1,. On the t th timestep the system is in exactly one of the available states. Call it q t (Note: q t {s 1, s 2.. s N } Between each timestep, the next state is chosen randomly. N=3 t=1

A Markov System P(q t+1 =s 1 q t =s 2 = 1/2 P(q t+1 =s 2 q t =s 2 = 1/2 P(q t+1 =s 3 q t =s 2 = 0 Has N states, called s 1, s 2.. s N There are discrete timesteps, t=0, t=1,. P(q t+1 =s 1 q t =s 1 = 0 P(q t+1 =s 2 q t =s 1 = 0 P(q t+1 =s 3 q t =s 1 = 1 S 1 S 2 S 3 On the t th timestep the system is in exactly one of the available states. Call it q t (Note: q t {s 1, s 2.. s N } Between each timestep, the next state is chosen randomly. N=3 t=0 P(q t+1 =s 1 q t =s 3 = 1/3 P(q t+1 =s 2 q t =s 3 = 2/3 P(q t+1 =s 3 q t =s 3 = 0 The current state determines the probability distribution for the next state

A Markov System S 1 N=3 t=0 P(q t+1 =s 1 q t =s 2 = 1/2 P(q t+1 =s 2 q t =s 2 = 1/2 P(q t+1 =s 3 q t =s 2 = 0 P(q t+1 =s 1 q t =s 1 = 0 P(q t+1 =s 2 q t =s 1 = 0 P(q t+1 =s 3 q t =s 1 = 1 1/2 1/3 1 S 2 S 3 2/3 1/2 P(q t+1 =s 1 q t =s 3 = 1/3 P(q t+1 =s 2 q t =s 3 = 2/3 P(q t+1 =s 3 q t =s 3 = 0 Has N states, called s 1, s 2.. s N There are discrete timesteps, t=0, t=1,. On the t th timestep the system is in exactly one of the available states. Call it q t (Note: q t {s 1, s 2.. s N } Between each timestep, the next state is chosen randomly. The current state determines the probability distribution for the next state Often notated with arcs between states

Markov Property S 1 N=3 t=0 P(q t+1 =s 1 q t =s 2 = 1/2 P(q t+1 =s 2 q t =s 2 = 1/2 P(q t+1 =s 3 q t =s 2 = 0 P(q t+1 =s 1 q t =s 1 = 0 P(q t+1 =s 2 q t =s 1 = 0 P(q t+1 =s 3 q t =s 1 = 1 1/2 1/3 1 S 2 S 3 2/3 1/2 P(q t+1 =s 1 q t =s 3 = 1/3 P(q t+1 =s 2 q t =s 3 = 2/3 P(q t+1 =s 3 q t =s 3 = 0 q t+1 is conditionally independent of { q t-1, q t-2,... q 1, q 0 } given q t In other words: P(q t+1 = s j q t = s i = P(q t+1 = s j q t = s i, any earlier history Question: what would be the best Bayes Net structure to represent the Joint Distribution of ( q 0, q 1,... q 3,q 4?

Markov Property P(q t+1 =s 1 q t =s 2 = 1/2 P(q t+1 =s 2 q t =s 2 = 1/2 P(q t+1 =s 3 q t =s 2 = 0 P(q t+1 =s 1 q t =s 1 = 0 P(q t+1 =s 2 q t =s 1 = 0 P(q t+1 =s 3 q t =s 1 = 1 S 1 N=3 t=0 1/2 1/3 1 S 2 S 3 2/3 1/2 P(q t+1 =s 1 q t =s 3 = 1/3 P(q t+1 =s 2 q t =s 3 = 2/3 P(q t+1 =s 3 q t =s 3 = 0 q t+1 is conditionally independent of { q t-1, q t-2,... q 1, q 0 } given q t In other words: P(q t+1 = s j q t = s i = P(q t+1 = s j q t = s i, any earlier history Question: what would be the best Bayes Net structure to represent the Joint Distribution of ( q 0, q 1,... q 3,q 4?

S 1 N=3 t=0 Markov Property P(q t+1 =s 1 q t =s 2 = 1/2 P(q t+1 =s 2 q t =s 2 = 1/2 P(q t+1 =s 3 q t =s 2 = 0 P(q t+1 =s 1 q t =s 1 = 0 P(q t+1 =s 2 q t =s 1 = 0 P(q t+1 =s 3 q t =s 1 = 1 1/2 1/3 1 S 2 S 3 2/3 q t+1 is conditionally independent of { q t-1, q t-2,... q 1, q 0 } given q t i P(q t+1 =sin 1 qother t =s i words: P(q P(q t+1 =s t+1 = N q s t j q =s i t = s i 1/2 = P(q t+1 = s j q t = s i, any earlier 1 a 11 history a 1N 2 a 21 a 2N Question: what would be the best 3 a 31 a 3N Bayes Net structure to represent.. the Joint Distribution of ( q 0, q 1,... i q 3,q 4? P(q t+1 =s 1 q t =s 3 = 1/3 P(q t+1 =s 2 q t =s 3 = N 2/3 P(q t+1 =s 3 q t =s 3 = 0 a N1 a NN

A Blind Robot A human and a robot wander around randomly on a grid. R H STATE q = {Location Robot, Location Human} # states=18*18=324

Dynamics of System H R Each timestep the human moves randomly to an adjacent cell. And Robot also moves randomly to an adjacent cell. Typical Questions: What s the expected time until the human is crushed like a bug? What s the probability that the robot will hit the left wall before it hits the human? What s the probability Robot crushes human on next time step?

What is P(q t =s? Slow Answer Step 1: Work out how to compute P(Q for any path Q =q 1 q 2 q 3..q t Given we know the start state q 1 (i.e. P(q 1 =1 P(q 1, q 2,..., q t = P(q 1, q 2,..., q t 1 P(q t q 1, q 2,..., q t 1 = P(q 1, q 2,..., q t 1 P(q t q t 1 = P(q 2 q 1 P(q 3 q 2...P(q t q t 1 Step 2: Use this knowledge to get P(q t =s P(q t = s = Q Paths of length t that end in s P(Q

What is P(q t =s? Clever Answer For each state s i, define P t (i = Probability that state is s i at time t = P(q t =s i Easy to do inductive definition " i P 0 (i = 1 if s i is the start state% # & $ 0 otherwise ' j P t+1 ( j = p(q t+1 = s j N i=1 N = p(q t+1 = s j q t = s i = p(q t+1 = s j q t = s i P(q t = s i = a ij P t (i i=1 N i=1

Hidden State The previous example tried to estimate P(q t = s i unconditionally (using no observed evidence. Suppose we can observe something that s affected by the true state. Example: Proximity sensors. (tell us the contents of the 8 adjacent squares W denotes Walls H R W W W H True state q t What the robot sees: Observation Ot

Noisy Hidden State Example: Noisy Proximity sensors. (unreliably tell us the contents of the 8 adjacent squares H R W W W H Uncorrupted Observation W denotes Walls True state q t W H W W H What the robot sees: Observation Ot

Noisy Hidden State Example: Noisy Proximity sensors. (unreliably tell us the contents of the 8 adjacent squares H R W W W True state q t W W O t is noisily determined depending on the current state Assume that O t is conditionally independent of H W H {q t-1, q t-2,... q 1, q 0,O t-1, O t-2,... O 1, O 0 } What the robot sees: given q Observation Ot t P(O t = X q t = s i = P(O t = X q t = s i,any earlier history H Uncorrupted Observation W denotes Walls

Noisy Hidden State Example: Noisy Proximity sensors. (unreliably tell us the contents of the 8 adjacent squares H R W W W True state q t Question: what d W be the best W Bayes Net structure to represent the Joint Distribution O t is noisily determined depending Of (q on the current state 0, q 1, q 2,q 3,q 4,O 0, O 1, O W 2,O 3,O 4 Assume that O t is conditionally H H independent of {q t-1, q t-2,... q 1, q 0,O t-1, O t-2,... O 1, O 0 } What the robot sees: given q Observation Ot t P(O t = X q t = s i = P(O t = X q t = s i,any earlier history H Uncorrupted Observation W denotes Walls

Noisy Hidden State 1/2 1/3 1

Noisy Hidden State i P(O t =1 q t =s i P(O t =M q t =s i 1/2 1/3 1 1 b1(1 b1(m 2 b2(1 b2(m 3 b3(1 b3(m.. i N b N (1 bn(m

HMM Formal Definition An HMM, λ, is a 5-tuple consisting of N: The number of states M: the number of possible observations {π 1, π 2, π 3, π N } the starting state probabilities P(q 0 =S i = π i

Probability of a Series of Observations What is P(O=P(O 1 O 2 O 3 = P(O 1 =X^O 2 =X^O 3 =Z? Slow, stupid way: P(O = Q Paths of length 3 Q Paths of length 3 P(O Q P(O = P(O Q P(Q S 1 S 2 1/3 X Y Z Y 1/3 2/3 1/3 Z X 1/3 2/3 S 3 1/3 How do we compute P(Q for an arbitrary path Q? How do we compute P(O Q for an arbitrary path Q?

Probability of a Series of Observations What is P(O=P(O 1 O 2 O 3 = P(O 1 =X^O 2 =X^O 3 =Z? Slow, stupid way: S 1 S 2 P(O = P(O Q 1/3 X Y Z Y Q Paths of length 3 2/3 2/3 P(O = P(O Q P(Q Q Paths of length 3 1/3 Z X 1/3 S 3 How do we compute P(Q for an arbitrary path 1/3 P(Q = P(q 1, q 2, q 3 = P(q 1 P(q 2, q 3 q 1 (chain rule = P(q 1 P(q 2 q 1 P(q 3 q 2, q 1 (chain = P(q 1 P(q 2 q 1 P(q 3 q 2 (independency Example in the case Q = S1 S3 S3: 1/2 * 2/3 * 1/3 = 1/9? 1/3

Probability of a Series of Observations What is P(O=P(O 1 O 2 O 3 = P(O 1 =X^O 2 =X^O 3 =Z? Slow, stupid way: S 1 S 2 P(O = P(O Q 1/3 X Y Z Y Q Paths of length 3 2/3 2/3 P(O = P(O Q P(Q Q Paths of length 3 1/3 Z X 1/3 S 3 How do we compute P(O Q for an arbitrary path Q? 1/3 P(O Q = P(O 1,O 2,O 3 q 1, q 2, q 3 = P(O 1 q 1 P(O 2 q 2 P(O 3 q 3 Example in the case Q = S1 S3 S3: P(O Q=P(X S1P(X S3P(Z S3=1/2 * 1/2 * 1/2 = 1/8? 1/3

Probability of a Series of Observations What is P(O=P(O 1 O 2 O 3 = P(O 1 =X^O 2 =X^O 3 =Z? Slow, stupid way: P(O = Q Paths of length 3 P(O P(O would Q need 27 P(Q computations and 27 P(O Q computations P(O = P(O Q P(Q Q Paths of length 3 A sequence of 20 observations would need 3 20 P(O Q = P(O 3.5 billion 1,O 2,O 3 P(Q q 1, qcomputations 2, q 3 = P(Oand 1 q 1 P(O 3.5 billion 2 q 2 P(O computations 3 q 3 How do we compute P(O Q for an arbitrary path Q? Example in the case Q = S1 S3 S3: P(O Q=P(X S1P(X S3P(Z S3=1/2 * 1/2 * 1/2 = 1/8? 1/3 S 1 S 2 1/3 X Y Z Y 2/3 2/3 1/3 Z X 1/3 1/3 S 3

Non Exponential Cost Style Given observations O 1 O 2... O T Define α t (i = P(O 1 O 2...O t q t where 1 t T α t (i = Probability that, in a random trial, We d have seen the first t observations We d have ended up in Si as the t th state visited. In our example, what is α 2 (3?

α t (i: easy to define recursively α 1 ( j = P(O 1 q 1 = S j = P(q 1 = S j P(O 1 q 1 = S j α t+1 (i = P(O 1 O 2...O t O t+1 q t+1 N = P(O 1 O 2...O t q t = S j O t+1 q t+1 j=1 N = P(O t+1, q t+1 O 1 O 2...O t q t = S j P(O 1 O 2...O t q t = S j j=1 N = P(O t+1, q t+1 q t = S j α t ( j j=1 N = P(q t+1 q t = S j P(O t+1 q t+1 α t ( j j=1 N = a ij b i (O t+1 α t ( j j=1

Easy Question We can cheaply compute α t (i = P(O 1 O 2...O t q t How can we cheaply compute P(O 1 O 2...O t How can we cheaply compute P(q t O 1 O 2...O t

Easy Question We can cheaply compute α t (i = P(O 1 O 2...O t q t How can we cheaply compute P(O 1 O 2...O t = How can we cheaply compute N i=1 α t (i P(q t O 1 O 2...O t

Easy Question We can cheaply compute α t (i = P(O 1 O 2...O t q t How can we cheaply compute P(O 1 O 2...O t = N i=1 α(i How can we cheaply compute P(q t O 1 O 2...O t = N j=1 α t(i α t ( j

Most Probable Path Given Observations What is the most probable path given O 1 O 2...O T,i.e. What is : argmax P(Q O 1 O 2...O T? Q Slow answer: argmax P(Q O 1 O 2...O T = Q = argmax P(O 1O 2...O T QP(Q Q P(O 1 O 2...O T argmax P(O 1 O 2...O T QP(Q Q

Efficient MPP Computation We are going to compute the following variables: δ t (i = max P(q 1 q 2...q t 1 q t O 1...O t q 1 q 2...q t 1 = The probability of the path of length t-1 with the maximum chance of doing all these things: Occurring and Ending up in state S i and Producing output O 1 O t Define: mpp t (i= that path δ t (i=prob(mpp t (i

The Viterbi Algorithm δ t (i = mpp t (i = max P(q 1 q 2...q t 1 q t O 1...O t q 1 q 2...q t 1 argmax P(q 1 q 2...q t 1 q t O 1...O t q 1 q 2...q t 1 δ 1 (i = max P(q 1 O 1 one choice = P(q 1 P(O 1 q 1 = π i b i (O 1 Now, suppose we have all the δ t (i s and mpp t (i s for all i. How to get δ t+1 (j and mpp t+1 (j? mpp t (1 mpp t (2 mpp t (N Prob=δ t (1 Prob=δ t (2 Prob=δ t (N S 1 S 2 S N

The Viterbi Algorithm time t time t+1 The most prob path with last two states S i S j is: the most prob path to S i, followed by transition S 1 S i S j S i S N S j Summary δ t+1 ( j = δ t (i*a ij b j (O t+1!# " with i* defined below mpp t+1 ( j = mpp t (i*s j $# What is the prob of that path? δ t (i x P(S i S j O t+1 = δ t (i a ij b j (O t+1 So, the most probable path to S j has S i * as its penultimate state where i*=argmax δt(i a ij b j (O t+1 i

What is the Viterbi Algorithm Used For? Speech recognition Signal words HMM observable is signal Hidden state is part of word formation What is the most probable word given this signal?

HMMs are Used and Useful But how do you design an HMM? Occasionally, (e.g. in our robot example it is reasonable to deduce the HMM from first principles But usually, especially in Speech or Genetics, it is better to infer it from large amounts of data. O 1 O 2.. O T with a big T Observations previously in lecture O 1 O 2.. O T Observations in the next bit O 1 O 2.. O T

Inferring an HMM We were computing probabilities based on some parameters P(O 1 O 2..O T λ That λ is the notation for our HMM parameters λ = {a ij },{b i (j}, π i We could use MAX LIKELIHOOD λ = argmax P(O 1.. O T λ λ BAYES Workout P(λ O 1..O T and then take E[λ] or max P(λ O 1..O T λ

Max Likelihood HMM Estimation Define γ t (i = P(q t O 1 O 2...O T, λ ε t (i, j = P(q t q t+1 = S j O 1 O 2...O T, λ γ t (i and ε t (i,j can be computed efficiently i,j,t T 1 t=1 T 1 t=1 γ t (i = Expected number of transitions out of state i during the path ε t (i, j = Expected number of transitions from state i to state j during the path

HMM Estimation γ t (i = P(q t O 1 O 2...O T, λ ε t (i, j = P(q t q t+1 = S j O 1 O 2...O T, λ T 1 t=1 T 1 t=1 γ t (i = Expected number of transitions out of state i during the path ε t (i, j = Expected number of transitions from state i to state j during the path T 1 t=1 T 1 t=1 ε t (i, j γ t (i $ expected frequency' & % i j ( = $ expected frequency' & % i ( = Estimate of Prob(Next state S j This state S i We can re-estimate aij T 1 t=1 T 1 t=1 ε t (i, j γ t (i

Expectation Maximization for HMMs If we knew λ we could estimate EXPECTATIONS of quantities such as Expected number of times in state i Expected number of transitions i j If we knew the quantities such as Expected number of times in state i Expected number of transitions i j We could compute the MAX LIKELIHOOD estimate of λ = {a ij },{b i (j}, π i

Expectation Maximization for HMMs 1. Get your observations O 1...O T 2. Guess your first λ estimate λ(0, k=0 3. k=k+1 4. Given O 1...O T, λ(k compute : γt(i, εt(i,j 1 t, T, 1 i,j N, 5. Compute expected freq. of state i, and expected freq. i j 6. Compute new estimates of a ij, b j (k, π i accordingly. Call them λ(k+1 7. Goto 3, unless converged. Also known (for the HMM case as the BAUM-WELCH algorithm.

Summary HMM is a dynamic Bayesian network with hidden states Efficient computation of probabilities of series of observation The Viterbi algorithm Outline of the EM algorithm