Lecture 11: Hidden Markov Models Cognitive Systems - Machine Learning Cognitive Systems, Applied Computer Science, Bamberg University slides by Dr. Philip Jackson Centre for Vision, Speech & Signal Processing University of Surrey, UK last change December 22, 2014 Michael Siebers (CogSys) ML Hidden Markov Models December 22, 2014 1 / 30
Outline Outline Introduction Markov Models Hidden Markov Models Parameters of a Discrete HMM Computing P(O λ) Forward Procedure Backward Procedure Finding the Best Path Viterbi Algorithm Training the Models Viterbi Training Baum-Welch Michael Siebers (CogSys) ML Hidden Markov Models December 22, 2014 2 / 30
Introduction Introduction to Markov models We can model stochastic sequences using a Markov chain, e.g., the state topology of an ergodic Markov model: For 1st-order Markov chains, probability of state occupation depends only on the previous step (Rabiner, 1989): P(x t = j x t 1 = i, x t 2 = h,...) P(x t = j x t 1 = i) So, if we assume the RHS of eq. 5 is independent of time, we can write the state-transition probabilities a ij = P(x t = j x t 1 = i), 1 i, j N with the properties a ij 0 and N a ij = 1 1 i, j N j=1 Michael Siebers (CogSys) ML Hidden Markov Models December 22, 2014 3 / 30
Introduction Weather prediction example Example Let us represent the state of the weather by a 1st-order, ergodic Markov model, M: state 1: raining state 2: cloudy state 3: sunny with state-transition probabilities expressed in matrix form: 0.4 0.3 0.3 A = {a ij } = 0.2 0.6 0.2 0.1 0.1 0.8 Michael Siebers (CogSys) ML Hidden Markov Models December 22, 2014 4 / 30
Introduction Example Weather predictor probability calculation Given today s weather, what is the probability of directly observing the sequence of weather states rain-sun-sun with model M? A = rain cloud sun rain cloud sun 0.4 0.3 0.3 0.2 0.6 0.2 0.1 0.1 0.8 P(X M) = P(X = 1, 3, 3 M) = P(x 1 = rain today) P(x 2 = sun x 1 = rain) P(x 3 = sun x 2 = sun) = a 21 a 13 a 33 = 0.2 0.3 0.8 = 0.048 Michael Siebers (CogSys) ML Hidden Markov Models December 22, 2014 5 / 30
Introduction Example Start and end of a stochastic state sequence Null states deal with the start and end of sequences, as in the state topology of this left-right Markov model: Entry probabilities at t = 1 for each state i are defined π i = P(x 1 = i) 1 i N with the properties π i 0 for 1 i N and N i=1 π i = 1. Exit probabilities at t = T are similarly defined η i = P(x T = i) 1 i N with the properties η i 0 and η i + N j=1 a ij = 1 for 1 i N Michael Siebers (CogSys) ML Hidden Markov Models December 22, 2014 6 / 30
Introduction Example Example: probability of MM state sequence Consider the state topology state transition probabilities A = 0 1 0 0 0 0.8 0.2 0 0 0 0.6 0.4 0 0 0 0 The probability of state sequence X = 1, 2, 2 is P(X M) = π 1 a 12 a 22 η 2 = 1 0.2 0.6 0.4 = 0.048 Michael Siebers (CogSys) ML Hidden Markov Models December 22, 2014 7 / 30
Introduction Summary of Markov models Summary of Markov Models State topology diagram: Entry probabilities π = {π i } = [ 1 0 0 ] and exit probabilities η = {η i } = [ 0 0 0.2 ] T are combined with state transition probabilities in complete A matrix: A = Probability of a given state sequence X: 0 1 0 0 0 0 0.6 0.3 0.1 0 0 0 0.9 0.1 0 0 0 0 0.8 0.2 0 0 0 0 0 ( T ) P(X M) = π x1 a xt 1 x t η xt t=2 Michael Siebers (CogSys) ML Hidden Markov Models December 22, 2014 8 / 30
Introduction Introduction to Hidden Markov Models Introduction to Hidden Markov Models Hidden Markov Models (HMMs) use a Markov chain to model stochastic state sequences which emit stochastic observations, e.g., the state topology of an ergodic HMM: Probability of state i generating a discrete observation o t, which has one of a finite set of values, k 1..K, is b i (k) = P(o t = k x t = i). Michael Siebers (CogSys) ML Hidden Markov Models December 22, 2014 9 / 30
Parameters of a Discrete HMM, λ Parameters of a discrete HMM, λ State transition probabilities, A = {π j, a ij, η i } = {P(x t = j x t 1 = i)} where N is the number of states Discrete output probabilities, B = {b i (k)} = {P(o t = k x t = i)} where K is the number of distinct observations. for 1 i, j N for 1 i N 1 k K Michael Siebers (CogSys) ML Hidden Markov Models December 22, 2014 10 / 30
HMM Probability Calculation HMM Probability Calculation The joint likelihood of state and observation sequences is P(O, X λ) = P(X λ)p(o X, λ) The state sequence X = 1, 1, 2, 3, 3, 4 produces the observations O = o 1, o 2,..., o 6 : P(X λ) = π 1 a 11 a 12 a 23 a 33 a 34 η 4 P(O X, λ) = b 1 (o 1 )b 1 (o 2 )b 2 (o 3 )b 3 (o 4 )b 3 (o 5 )b 4 (o 6 ) ( T ) P(O, X λ) = π x1 b x1 (o 1 ) t=2 a x t 1 x t b xt (o t ) η xt Michael Siebers (CogSys) ML Hidden Markov Models December 22, 2014 11 / 30
HMM Probability Calculation Example Example: probability of HMM state sequence Consider state topology, state transition matrix, and 0 1 0 0 A = 0 0.8 0.2 0 0 0 0.6 0.4 0 0 0 0 output probabilities [ ] [ R G B ] b1 0.5 0.2 0.3 B = = b 2 0 0.9 0.1 What is the probability of observations O = R, G, G with state sequence X = 1, 2, 2? P(O, X λ) = P(X λ)p(o X, λ) = π 1 b 1 (o 1 )a 12 b 2 (o 2 )a 22 b 2 (o 3 )η 2 = 1 0.5 0.2 0.9 0.6 0.9 0.4 = 0.01944 0.02 Michael Siebers (CogSys) ML Hidden Markov Models December 22, 2014 12 / 30
HMM Probability Calculation Example HMM Recognition & Training Three tasks within HMM framework 1 Compute likelihood of a set of observations with a given model, P(O λ) 2 Decode a test sequence by calculating the most likely path, X 3 Optimise pattern templates by training parameters in the models Michael Siebers (CogSys) ML Hidden Markov Models December 22, 2014 13 / 30
Computing P(O λ) Computing P(O λ) So far, we calculated the joint probability of observations and state sequence, for a given model λ, P(O, X λ) = P(X λ)p(o X, λ) For the total probability of the observations, we marginalise the state sequence by summing over all possible X P(O λ) = all X P(O, X λ) Alternate Notation O = o T 1 = o 1 t.o T t+1 1 t < T P(O λ) = all x T 1 P(o T 1, x T 1 λ) Michael Siebers (CogSys) ML Hidden Markov Models December 22, 2014 14 / 30
Computing P(O λ) Computing P(O λ) Now, we define forward likelihood for state j, as α t (j) = P(o t 1, x t = j λ) = P(o t 1, x t 1 λ) x t 1 1,x t =j considering the (unknown) previous state x t 1 this can be reformulated: α t (j) = P(o t 1, x t 1 λ) x t 2 1,x t 1,x t =j = t 1 P(o1, x t 1 1 λ)a xt 1 jb j (o t ) x t 2 1,x t 1 N = P(o t 1 1, x t 1 1 λ)a ij b j (o t ) = i=1 x t 2 1,x t 1 =i N α t 1 (i)a ij b j (o t ) i=1 Michael Siebers (CogSys) ML Hidden Markov Models December 22, 2014 15 / 30
Computing P(O λ) Forward Procedure Forward procedure To calculate forward likelihood, α t (i) = P(o t 1, x t = i λ) : 1 Initialise at t = 1, α 1 (i) = π i b i (o 1 ) for 1 i N 2 Recur for t = {2, 3,..., T }, α t (j) = [ N i=1 α t 1(i)a ij ]b j (o t ) for 1 j N 3 Finalise, P(O λ) = N i=1 α T (i)η i Thus, we can solve Task 1 efficiently by recursion. Michael Siebers (CogSys) ML Hidden Markov Models December 22, 2014 16 / 30
Computing P(O λ) Example Worked example of the forward procedure α 1 (1) = 1 0.5 = 0.5 α 1 (2) = 0 0 = 0 α 2 (1) = 0.5 0.8 0.2 = 0.08 α 2 (2) = 0.5 0.2 0.9 = 0.09 α 3 (1) = 0.08 0.8 0.2 = 0.0128 α 3 (2) = (0.08 0.2 + 0.09 0.6) 0.9 = 0.063 P(O λ) = 0.063 0.4 = 0.0252 Michael Siebers (CogSys) ML Hidden Markov Models December 22, 2014 17 / 30
Computing P(O λ) Backward Procedure Backward procedure We define backward likelihood, β t (i) = P(o T t+1 x t = i, λ), and calculate: 1 Initialise at t = T, β T (i) = η i for 1 i N 2 Recure for t = {T 1, T 2,..., 1}, β t (i) = N j=1 a ijb j (o t+1 )β t+1 (j) for 1 i N 3 Finalise, P(O λ) = N i=1 π ib i (o 1 )β 1 (i) This is an equivalent way of computing P(O λ) recursively. Michael Siebers (CogSys) ML Hidden Markov Models December 22, 2014 18 / 30
Finding the best path Finding the Best Path Given observations O = o 1,..., o T, find the HMM state sequence X = x 1,..., x T that has greatest likelihood where X = arg max P(O, X λ), X P(O, X λ) = P(O X, λ)p(x λ) ( T ) = π x1 b x1 (o 1 ) t=2 a xt 1 x t b xt (o t ) Viterbi algorithm is an inductive method to find optimal state sequence X efficiently, similar to forward procedure. It computes maximum cumulative likelihood δ t (i) up to current time t for each state i: δ t (i) = max P(o t 1, x t 1 x t 1 1, x t = i λ) 1,x t =i Michael Siebers (CogSys) ML Hidden Markov Models December 22, 2014 19 / 30 η xt
Finding the Best Path Viterbi Algorithm Viterbi algorithm To compute the maximum cumulative likelihood, δ t (i): 1 Initialise at t = 1, δ 1 (i) = π i b i (o 1 ) ψ 1 (i) = 0 2 Recur for t = {2, 3,..., T }, δ t (j) = max i [δ t 1 (i)a ij ]b j (o t ) ψ t (j) = arg max i [δ t 1 (i)a ij ] for 1 i N for 1 j N 3 Finalise, P(O, X λ) = max i [δ T (i)η i ] x T = arg max i [δ T (i)η i ] 4 Trace back, for t = {T, T 1,..., 2}, x t 1 = ψ t (x t ), and X = {x 1, x 2,..., x T } Michael Siebers (CogSys) ML Hidden Markov Models December 22, 2014 20 / 30
Finding the Best Path Viterbi Algorithm Illustration of the Viterbi algorithm 1 Initialise, δ 1 (i) = π i b i (o 1 ) ψ 1 (i) = 0 2 Recur for t = 2, δ 2 (j) = max i [ δ1 (i)a ij ] bj (o 2 ) ψ 2 (j) = arg max i [ δ1 (i)a ij ] Recur for t = 3, δ 3 (j) = max i [ δ2 (i)a ij ] bj (o 3 ) ψ 3 (j) = arg max i [ δ2 (i)a ij ] 3 Finalise, P(O, X λ) = max i [δ 3 (i)η i ] x 3 = arg max i [δ 3(i)η i ] 4 Trace back for t = {3..2}, x 2 = ψ 3(x 3 ) x 1 = ψ 2(x 2 ) X = {x 1, x 2, x 3 } Michael Siebers (CogSys) ML Hidden Markov Models December 22, 2014 21 / 30
Finding the Best Path Example Worked example of the Viterbi algorithm δ 1 (1) = 1 0.5 = 0.5 ψ 1 (1) = 0 δ 1 (2) = 0 0 = 0 ψ 1 (2) = 0 δ 2 (1) = 0.5 0.8 0.2 = 0.08 ψ 2 (1) = 1 δ 2 (2) = 0.5 0.2 0.9 = 0.09 ψ 2 (2) = 1 δ 3 (1) = 0.08 0.8 0.2 = 0.0128 ψ 3 (1) = 1 δ 3 (2) = max(0.08 0.2, 0.09 0.6) 0.9 = 0.0486 ψ 3 (2) = 2 P(O, X δ) = 0.0486 0.4 = 0.01944 Michael Siebers (CogSys) ML Hidden Markov Models December 22, 2014 22 / 30
Training the Models Maximum Likelihood Maximum likelihood training In general, we want to find the value of some model parameter c that is most likely to give our set of training data O train. This maximum likelihood (ML) estimate ĉ is obtained by setting the derivative of P(O train c) w.r.t. c equal to zero, which is equivalent to: ln(p(o train ĉ) c = 0 Solving this likelihood equation tells us how to optimise the model parameters in training. Michael Siebers (CogSys) ML Hidden Markov Models December 22, 2014 23 / 30
Training the Models Maximum Likelihood Re-estimating the parameters of the model λ As a preliminary approach, let us use the optimal path X computed by the Viterbi algorithm with some initial model parameters λ = {A, B}. In so doing, we approximate the total likelihood of the observations: P(O λ) = allx P(O, X λ) P(O, X λ) Using X, we make a hard binary decision about the state occupation, q t (i) {0, 1}, and train the parameters of our model accordingly. Michael Siebers (CogSys) ML Hidden Markov Models December 22, 2014 24 / 30
Training the Models Maximum Likelihood Viterbi training (hard state assignment) Model parameters can be re-estimated using the Viterbi alignment to assign observations to states (a.k.a. segmental k-means training). State-transition probabilities, â ij = T t=2 q t 1(i)q t (j) T for t=1 qt (i) where state indicator q t (i) = Discrete output probabilities, T t=1 qt (j)ωt (k) for ˆb j (k) = T t=1 qt (j) where event indicator ω t (k) = { 1 for i = x t 0 otherwise { 1 for k = o t 0 otherwise 1 i, j N 1 j N and 1 k K Michael Siebers (CogSys) ML Hidden Markov Models December 22, 2014 25 / 30
Training the Models Baum-Welch Maximum likelihood training by EM Baum-Welch re-estimation (occupation) Yet, the hidden state occupation is not known with absolute certainty. So, the expectation maximisation (EM) method optimises the model parameters based on soft assignment of observations to states via the occupation likelihood, γ t (i) = P(x t = i O, λ) relaxing the assumption previously made two slides before We can rearrange, using Bayes theorem, to obtain γ t (i) = P(O, x t = i λ) P(O λ) = α t (i)β t (i) P(O λ) where α t, β t and P(O λ) are computed by the forward and backward procedures. Michael Siebers (CogSys) ML Hidden Markov Models December 22, 2014 26 / 30
Training the Models Baum-Welch Baum-Welch re-estimation (transition) Similarly, we also define a transition likelihood, ξ t (i, j) = P(x t 1 = i, x t = j O, λ) = P(O, x t 1 = i, x t = j λ) P(O λ) = α t 1(i) a ij b j (o t ) β t (j) P(O λ) Trellis depiction of (left) occupation and (right) transition likelihoods Michael Siebers (CogSys) ML Hidden Markov Models December 22, 2014 27 / 30
Training the Models Baum-Welch Baum-Welch training (soft state assignment) State-transition probabilities, â ij = T t=2 ξt (i,j) T for t=1 γt (i) Discrete output probabilities, T t=1 γt (j)ωt (k) ˆb j (k) = T for t=1 γt (j) 1 i, j N 1 j N and 1 k K Re-estimation increases the likelihood over the training data for the new model ˆλ P(O train ˆλ) P(O train λ) although it does not guarantee a global maximum. Michael Siebers (CogSys) ML Hidden Markov Models December 22, 2014 28 / 30
Training the Models Baum-Welch Use of HMMs with training and test data (a) Training (b) Recognition Isolated word training and recognition (Young et al., 2009) Michael Siebers (CogSys) ML Hidden Markov Models December 22, 2014 29 / 30
Learning Terminology Training the Models Baum-Welch Hidden Markov Models Supervised Learning Approaches: Concept / Classification symbolic inductive Learning Strategy: Data: Target Values: unsupervised learning Policy Learning statistical / neuronal network analytical learning from examples sequences of discrete observations probabilities for classes Michael Siebers (CogSys) ML Hidden Markov Models December 22, 2014 30 / 30