Hidden Markov Models,99,100! Markov, here I come!

Hidden Markov Models,99,100! Markov, here I come! 16.410/413 Principles of Autonomy and Decision-Making Pedro Santana (psantana@mit.edu) October 7 th, 2015. Based on material by Brian Williams and Emilio Frazzoli.

Assignments Problem set 4 Out last Wednesday. Due at midnight tonight. Problem set 5 Out today and due in a week. Readings Today: Probabilistic Reasoning Over Time [AIMA], Ch. 15. 2/41

Today s topics 1. Motivation 2. Probability recap Bayes Rule Marginalization 3. Markov chains 4. Hidden Markov models 5. HMM algorithms Prediction Filtering Smoothing Decoding Learning (Baum-Welch) Won t be covered today and significantly more involved, but you might want to learn more about it. 3/41

1. Motivation Why are we learning this? 4/41

Robot navigation 5/41

Robust sensor fusion (visual tracking) 6/41

Natural language processing (NLP) Li-ve long and pros-per 7/41

2. Probability recap Probability is common sense reduced to calculation. Pierre-Simon Laplace 8/41

Bayes rule Joint Conditional Marginal Pr A, B = Pr A B Pr(B) A A, B: random variables B Pr A, B = Pr B A Pr(A) Pr A B Pr B = Pr B A Pr(A) Pr A B = Pr B A Pr(A) Pr B Bayes rule! Pr B A Pr(A) 9/41

Marginalization & graphical models A Pr(B A) A causes B B Marginalizes A out Prior on cause Pr B = Distribution of the effect B a Pr(A = a, B) = Pr B A = a Pr(A = a) a Conditioning on cause makes the computation easier. 10/41

Our goal for today How can we estimate the hidden state of a system from noisy sensor observations? 11/41

3. Markov chains Andrey Markov 12/41

State transitions over time State S S t : state at time t (random variable) S t =s: particular value of S t (not random) s S, S is the state space. Time S 0 S 1 S t S t+1 S t+2 13/41

State transitions over time Pr S 0, S 1,, S t, S t+1 = Pr(S 0:t+1 ) Pr S 0:t+1 = Pr(S 0 ) Pr(S 1 S 0 ) Pr(S 2 S 0:1 ) Pr S 3 S 0:2 Pr(S 4 S 0:3 ) Pr S t S 0:t 1 Past influences present models Models grow exponentially with time! 14/41

The Markov assumption Pr S t S 0:t 1 Path to S t isn t relevant, given knowledge of S t-1. Definition: Markov chain = Pr S t S t 1 Constant size! If a sequence of random variable S 0,S 1,,S t+1 is such that Pr S 0:t+1 = Pr(S 0 ) Pr(S 1 S 0 ) Pr(S 2 S 1 ) = Pr S 0 we say that S 0,S 1,,S t+1 form a Markov chain. t+1 i=1 Pr S i S i 1, 15/41

Markov chains S 0 S 1 S t S t+1 S t+2 S = Pr S t S t 1 : d d matrix T t Discrete set with d values. T t i,j = Pr S t = i S t 1 = j If T t does not depend on t Markov chain is stationary. T i,j = Pr S t = i S t 1 = j, t 16/41

(Very) Simple Wall Street L R H F S F L R H H Stock price* H H: High R: Rising F: Falling L: Low S: Steady T H k-1 R k-1 F k-1 L k-1 S k-1 H k 0.1 0.05 0.2 R k 0.5 0.8 0.25 F k 0.9 0.6 0.25 L k 0.1 S k 0.45 0.3 0.5 R S F L *Pedagogical example. In no circumstance shall the author be responsible for financial losses due to decisions based on this model. 17/41

4. Hidden Markov models (HMMs) Andrey Markov 18/41

Observing hidden Markov chains S 0 S 1 S t S t+1 S t+2 Hidden O 1 O t O t+1 O t+2 Observable Definition: Hidden Markov Model (HMM) A sequence of random variables O 1,O 2,,O t,, is an HMM if the distribution of O t is completely defined by the current (hidden) state S t according to Pr(O t S t ), where S t is part of an underlying Markov chain. 19/41

Hidden Markov models S 0 S 1 S t S t+1 S t+2 Hidden O 1 O t O t+1 O t+2 Observable O = Pr O t S t : d m matrix M Discrete set with m values. M i,j = Pr O t = j S t = i 20/41

The dishonest casino T F k-1 L k-1 F k 0.95 0.05 Fair L k 0.05 0.95 Loaded Hidden states Observations M 1 2 3 4 5 6 F k 1/6 1/6 1/6 1/6 1/6 1/6 L k 1/10 1/10 1/10 1/10 1/10 1/2 21/41

Queries Lower case: these are known values, not random variables. o 1,o 2,,ot = o 1:t Time 0 t Given the available history of observations, what s the belief about the current hidden state? Pr(S t o 1:t ) Filtering Given the available history of observations, what s the belief about a past hidden state? Pr S k o 1:t, k < t Smoothing 22/41

Queries Lower case: these are known values, not random variables. o 1,o 2,,ot = o 1:t Time 0 t Given the available history of observations, what s the belief about a future hidden state? Pr S k o 1:t, k > t Prediction Given the available history of observations, what s the most likely sequence of hidden states? s 0:t = arg max s 0:t Pr(S 0:t = s 0:t o 1:t ) Decoding 23/41

5. HMM algorithms Where we ll learn how to compute answers to the previously seen HMM queries. 24/41

Notation Random variable! Pr S t Probability distribution of S t Vector of d probability values. Pr S t = s = Pr s t Probability of observing S t = s according to Pr(S t ) Probability [0,1] 25/41

Filtering (forward) Given the available history of observations, what s the belief about the current hidden state? Pr S t o 1:t = p t Pr S t o 1:t = Pr S t o t, o 1:t 1 Pr o t S t, o 1:t 1 Pr(S t o 1:t 1 ) Bayes = Pr o t S t Pr(S t o 1:t 1 ) Obs. model Pr S t o 1:t 1 = = d i=1 d i=1 Pr(S t S t 1 = i, o 1:t 1 ) Pr(S t 1 = i o 1:t 1 ) Pr(S t S t 1 = i) Pr(S t 1 = i o 1:t 1 ) Recursion! Marg. Trans. model 26/41

Filtering Given the available history of observations, what s the belief about the current hidden state? 1. One-step prediction: Pr S t o 1:t 1 = p t = 2. Measurement update: i=1 p t [i] = ηpr(o t S t = i) p t [i] 3. Normalize belief (to get rid of η): p t i p t i η d d, η = j=1 Pr S t o 1:t = p t Pr(S t S t 1 = i) Pr(S t 1 = i o 1:t 1 ) = T p t j p t 1 27/41

Prediction Given the available history of observations, what s the belief about a future hidden state? Pr S k o 1:t, k > t Pr S t+1 o 1:t = T p t Previous slide. Pr S t+2 o 1:t = d i=1 Pr(S t+2 S t+1 = i) Pr S t+1 = i o 1:t = T 2 p t Pr S k o 1:t = T k t p t 28/41

Smoothing (forward-backward) Given the available history of observations, what s the belief about a past hidden state? Pr S k o 1:t, k < t Pr S k o 1:t Pr(o k+1:t S k ) = = Pr S k o 1:k, o k+1:t Pr(o k+1:t S k, o 1:k )Pr S k o 1:k = Pr(o k+1:t S k )Pr S k o 1:k = = d i=1 d i=1 d i=1 Filtering! Pr o k+1:t S k+1 = i, S k Pr(S k+1 = i S k ) Pr o k+2:t, o k+1 S k+1 = i Pr(S k+1 = i S k ) Bayes Obs. model Marg. Pr o k+2:t S k+1 = i Pr(o k+1 S k+1 = i)pr(s k+1 = i S k ) Recursion! Obs. model 29/41

Smoothing Given the available history of observations, what s the belief about a past hidden state? Pr S k o 1:t = p k,t, k < t 1. Perform filtering from 0 to k (forward): Pr S k o 1:k = p k 2. Compute the backward recursion from t to k: Pr(o k+1:t S k ) = b k,t, b m 1,t [i] = d j=1 b t,t = 1 b m,t [j] Pr(o m S m = j)pr(s m = j S m 1 = i), k + 1 m t 3. Combine the two results and normalize: p k,t [i] = b k,t [i] p k [i], p k,t i p k,t i η, η = d j=1 p k,t j 30/41

Decoding Given the available history of observations, what s the most likely sequence of hidden states so far? s 0:t = arg max Pr(S 0:t = s 0:t o 1:t ) s 0:t 1 2 d 31/41

Decoding (simple algorithm) Given the available history of observations, what s the most likely sequence of hidden states so far? s 0:t = arg max Pr(S 0:t = s 0:t o 1:t ) s 0:t Pr(s 0:t o 1:t ) Pr(o 1:t s 0:t )Pr(s 0:t ) t = Pr(s 0 ) Pr s i s i 1 Pr(o i s i ) i=1 Bayes HMM model 32/41

Decoding (simple algorithm) Given the available history of observations, what s the most likely sequence of hidden states so far? s 0:t = arg max Pr(S 0:t = s 0:t o 1:t ) s 0:t Compute all possible state trajectories from 0 to t. ॻ 0:t = {s 0:t s i S,i=0,, t} How big is ॻ 0:t? Choose the most likely trajectory according to s 0:t = arg max Pr s 0 s 0:t ॻ 0:t t i=1 Pr s i s i 1 Pr(o i s i ) d t+1 Can we do better? 33/41

Decoding (the Viterbi algorithm) Given the available history of observations, what s the most likely sequence of hidden states so far? s 0:t = arg max Pr(S 0:t = s 0:t o 1:t ) s 0:t Pr(S 0:t o 1:t ) = Pr(S 0:t o 1:t 1, o t ) t-1 t Pr(o t S t )Pr(S t, S 0:t 1 o 1:t 1 ) = Pr o t S t Pr(S t S t 1 )Pr(S 0:t 1 o 1:t 1 ) Recursion! From all paths arriving at s t-1, record only the most likely one. max s 0:t Pr s 0:t o 1:t = max s t,s t 1 Pr o t s t Pr s t s t 1 max s 0:t 1 Pr(s 0:t 1 o 1:t 1 ) 34/41

Decoding (the Viterbi algorithm) Given the available history of observations, what s the most likely sequence of hidden states so far? δ k [s]: most likely path ending in s t = s δ 0 [s]: (s), l 0 [s]: Pr(S 0 = s) 1. Expand paths in δ k according to the transition model 2. Update likelihood: l k [s]: likelihood of δ k [s] (unnormalized probability) pred k+1 s = arg max Pr s k+1 = s s k = s l k s, s = 1,, d s δ k+1 s = δ k pred k+1 s. append(s) l k+1 s = Pr(o k+1 s k+1 = s) Pr s k+1 = s s k = pred k+1 s 3. When k=t, choose δ t s with the highest l t s. l k pred k+1 s 35/41

Dishonest casino example T F k-1 L k-1 F k 0.95 0.05 Fair L k 0.05 0.95 Loaded Hidden states Observations M 1 2 3 4 5 6 F k 1/6 1/6 1/6 1/6 1/6 1/6 L k 1/10 1/10 1/10 1/10 1/10 1/2 36/41

Dishonest casino example Pr S 0 = 0.8 0.2 Fair Loaded Filtering Smoothing Fair Loaded Fair Loaded t=0 0.8000 0.2000 0.7382 0.2618 Observations = 1,2,4,6,6,6,3,6 T F k-1 L k-1 F k 0.95 0.05 L k 0.05 0.95 M 1 2 3 4 5 6 F k 1/6 1/6 1/6 1/6 1/6 1/6 L k 1/10 1/10 1/10 1/10 1/10 1/2 t=1 0.8480 0.1520 0.6940 0.3060 t=2 0.8789 0.1211 0.6116 0.3884 t=3 0.8981 0.1019 0.4679 0.5321 t=4 0.6688 0.3312 0.2229 0.7771 t=5 0.3843 0.6157 0.1444 0.8556 t=6 0.1793 0.8207 0.1265 0.8735 t=7 0.3088 0.6912 0.1449 0.8551 t=8 0.1399 0.8601 0.1399 0.8601 Coincidence? 37/41

Dishonest casino example Pr S 0 = 0.8 0.2 Fair Loaded Observations = 1,2,4,6,6,6,3,6 Filtering (MAP): Smoothing (MAP): Decoding: ['Fair ', 'Fair ', 'Fair ', 'Fair ', 'Fair ', 'Loaded', 'Loaded', 'Loaded', 'Loaded'] ['Fair ', 'Fair ', 'Fair ', 'Loaded ', 'Loaded ', 'Loaded', 'Loaded', 'Loaded', 'Loaded'] t=0: ['Fair'] t=1: ['Fair', 'Fair'] t=2: ['Fair', 'Fair', 'Fair'] t=3: ['Fair', 'Fair', 'Fair', 'Fair'] t=4: ['Fair', 'Fair', 'Fair', 'Fair', 'Fair'] t=5: ['Fair', 'Fair', 'Fair', 'Fair', 'Fair', 'Fair'] t=6: ['Loaded', 'Loaded', 'Loaded', 'Loaded', 'Loaded', 'Loaded', 'Loaded'] t=7: ['Fair', 'Fair', 'Fair', 'Fair', 'Fair', 'Fair', 'Fair', 'Fair'] t=8: ['Loaded', 'Loaded', 'Loaded', 'Loaded', 'Loaded', 'Loaded', 'Loaded', 'Loaded', 'Loaded'] 38/41

Borodovsky & Ekisheva (2006), pp 80-81 DNA A G T C A T G H: High genetic content (coding DNA) L: Low genetic content (non-coding DNA) H L T M A C G T H k-1 L k-1 H k 0.5 0.4 L k 0.5 0.6 H k 0.2 0.3 0.3 0.2 L k 0.3 0.2 0.2 0.3 39/41

Borodovsky & Ekisheva (2006), pp 80-81 Pr S 0 = 0.5 0.5 High Low Filtering Smoothing H L H L t=0 0.5000 0.5000 0.5113 0.4887 Observations = G,G,C,A,C,T,G,A,A T H k-1 L k-1 H k 0.5 0.4 L k 0.5 0.6 M A C G T H k 0.2 0.3 0.3 0.2 L k 0.3 0.2 0.2 0.3 t=1 0.5510 0.4490 0.5620 0.4380 t=2 0.5561 0.4439 0.5653 0.4347 t=3 0.5566 0.4434 0.5478 0.4522 t=4 0.3582 0.6418 0.3668 0.6332 t=5 0.5368 0.4632 0.5278 0.4722 t=6 0.3563 0.6437 0.3648 0.6352 t=7 0.5366 0.4634 0.5259 0.4741 t=8 0.3563 0.6437 0.3474 0.6526 t=9 0.3398 0.6602 0.3398 0.6602 40/41

Borodovsky & Ekisheva (2006), pp 80-81 Pr S 0 = 0.5 0.5 Observations = G,G,C,A,C,T,G,A,A Filtering (MAP): Smoothing (MAP): Decoding: t=0: t=1: t=2: t=3: t=4: t=5: t=6: t=7: t=8: t=9: ['H/L', 'H', 'H', 'H', 'L', 'H', 'L', 'H', 'L', 'L'] ['H', 'H', 'H', 'H', 'L', 'H', 'L', 'H', 'L', 'L'] ['H'] ['H', 'H'] ['H', 'H', 'H'] ['H', 'H', 'H', 'H'] ['H', 'H', 'H', 'H', 'L'] ['H', 'H', 'H', 'H', 'L', 'L'] ['H', 'H', 'H', 'H', 'L', 'L', 'L'] ['H', 'H', 'H', 'H', 'L', 'L', 'L', 'H'] ['H', 'H', 'H', 'H', 'L', 'L', 'L', 'L', 'L'] ['H', 'H', 'H', 'H', 'L', 'L', 'L', 'L', 'L', 'L'] 41/41