Hidden Markov models 1
Outline Time and uncertainty Markov process Hidden Markov models Inference: filtering, prediction, smoothing Most likely explanation: Viterbi 2
Time and uncertainty The world changes; we need to track and predict it Weather forecast; speech recognition; diabetes management Basic idea: copy state and evidence variables for each time step X t = set of unobservable state variables at time t e.g., BloodSugar t, StomachContents t, etc. E t = set of observable evidence variables at time t e.g., MeasuredBloodSugar t, P ulserate t, F oodeaten t This assumes discrete time; step size depends on problem Notation: X a:b = X a, X a+1,..., X b 1, X b 3
Markov processes (Markov chains) Construct a Bayes net from these variables: parents? Markov assumption: X t depends on bounded subset of X 0:t 1 First-order Markov process: P(X t X 1:t 1 ) = P(X t X t 1 ) Second-order Markov process: P(X t X 1:t 1 ) = P(X t X t 2, X t 1 ) Stationary process: transition model P(X t X t 1 ) fixed for all t First order X t 2 X t 1 X t X t +1 X t +2 Second order X t 2 X t 1 X t X t +1 X t +2 4
Hidden Markov models - Example Rain t 1 R t 1 t f P(R t ) 0.7 0.3 Rain t Rain t +1 Rt t f P(U t ) 0.9 0.2 Umbrella t 1 Umbrella t Umbrella t +1 5
Hidden Markov models - Definition X t is a single, discrete variable (N states). Domain of X t is {1,..., N} E t is also a discrete variable (M symbols). Domain of E t is {1,..., M} Transition matrix A ij = P (X t = j X t 1 = i), e.g., 0.7 0.3 0.3 0.7 Emission probability distribution B i (k) = P (E t = k X t = i) Prior state occupation π i = P (X 1 = i) e.g., ( 0.7 0.3 ) Hidden Markov modell λ = (A, B, π) First-order Markov process: P(X t X 1:t 1 ) = P(X t X t 1 ) Sensor Markov assumption: P(E t X 1:t, E 1:t 1 ) = P(E t X t ) 6
Hidden Markov model First-order Markov assumption not exactly true in real world! Possible fixes: 1. Increase order of Markov process 2. Augment state, e.g., add T emp t, P ressure t 7
Inference tasks Filtering: P(X t e 1:t ) belief state input to the decision process of a rational agent Prediction: P(X t+k e 1:t ) for k > 0 evaluation of possible action sequences; like filtering without the evidence Smoothing: P(X k e 1:t ) for 0 k < t better estimate of past states, essential for learning Most likely explanation: arg max x1:t P (x 1:t e 1:t ) speech recognition, decoding with a noisy channel 8
Filtering Aim: devise a recursive state estimation algorithm: P(X t+1 e 1:t+1 ) = f(e t+1, P(X t e 1:t )) P(X t+1 e 1:t+1 ) = P(X t+1 e 1:t, e t+1 ) = αp(e t+1 X t+1, e 1:t )P(X t+1 e 1:t ) = αp(e t+1 X t+1 )P(X t+1 e 1:t ) I.e., prediction + estimation. Prediction by summing out X t : P(X t+1 e 1:t+1 ) = αp(e t+1 X t+1 )Σ xt P(X t+1 x t, e 1:t )P (x t e 1:t ) = αp(e t+1 X t+1 )Σ xt P(X t+1 x t )P (x t e 1:t ) f 1:t+1 = Forward(f 1:t, e t+1 ) where f 1:t = P(X t e 1:t ) Time and space constant (independent of t) 9
Filtering example 0.500 0.500 0.627 0.373 True False 0.500 0.500 0.818 0.182 0.883 0.117 Rain 0 Rain 1 Rain 2 Umbrella 1 Umbrella 2 10
Prediction P(X t+k+1 e 1:t ) = Σ xt+k P(X t+k+1 x t+k )P (x t+k e 1:t ) As k, P (x t+k e 1:t ) tends to the stationary distribution of the Markov chain 11
Smoothing X 0 X 1 X k X t E1 E k Et Divide evidence e 1:t into e 1:k, e k+1:t : P(X k e 1:t ) = P(X k e 1:k, e k+1:t ) = αp(x k e 1:k )P(e k+1:t X k, e 1:k ) = αp(x k e 1:k )P(e k+1:t X k ) = αf 1:k b k+1:t Backward message computed by a backwards recursion: P(e k+1:t X k ) = Σ xk+1 P(e k+1:t X k, x k+1 )P(x k+1 X k ) = Σ xk+1 P (e k+1:t x k+1 )P(x k+1 X k ) = Σ xk+1 P (e k+1 x k+1 )P (e k+2:t x k+1 )P(x k+1 X k ) 12
Smoothing example 0.500 0.500 0.627 0.373 True False 0.500 0.500 0.818 0.182 0.883 0.117 forward 0.883 0.117 0.883 0.117 smoothed 0.690 0.410 1.000 1.000 backward Rain 0 Rain 1 Rain 2 Umbrella 1 Umbrella 2 Forward backward algorithm: cache forward messages along the way Time linear in t (polytree inference), space O(t f ) 13
Most likely explanation Most likely sequence sequence of most likely states!!!! Most likely path to each x t+1 = most likely path to some x t plus one more step max P(x x 1...x 1,..., x t, X t+1 e 1:t+1 ) t = P(e t+1 X t+1 ) max P(Xt+1 x x t ) max P (x t x 1...x 1,..., x t 1, x t e 1:t ) t 1 Identical to filtering, except f 1:t replaced by m 1:t = max P(x x 1...x 1,..., x t 1, X t e 1:t ), t 1 I.e., m 1:t (i) gives the probability of the most likely path to state i. Update has sum replaced by max, giving the Viterbi algorithm: m 1:t+1 = P(e t+1 X t+1 ) max x t (P(X t+1 x t )m 1:t ) 14
Viterbi example Rain 1 Rain 2 Rain 3 Rain 4 Rain 5 state space paths true false true false true false true false true false umbrella true true false true true most likely paths.8182.5155.0361.0334.0210.1818.0491.1237.0173.0024 m 1:1 m 1:2 m 1:3 m 1:4 m 1:5 15
Summary Temporal models use state and sensor variables replicated over time Markov assumptions and stationarity assumption, so we need transition model P(X t X t 1 ) sensor model P(E t X t ) Tasks are filtering, prediction, smoothing, most likely sequence; all done recursively with constant cost per time step 16