Hidden Markov Models. Dr. Naomi Harte

Hidden Markov Models Dr. Naomi Harte

The Talk Hidden Markov Models What are they? Why are they useful? The maths part Probability calculations Training optimising parameters Viterbi unseen sequences Real Systems

Background Discrete Markov process System can be in any of N states S 1 S N State changes each time instant, t 1 t 2 t 3 etc Actual state at time t is q t For first order Markov process P(q t = S j q t-1 =S i, q t-2 =S k ) simplifies to P(q t =S j q t-1 =S i )

Background P(q t =S j q t-1 =S i ) Independent of time State transition probabilities a ij = P(q t =S j q t-1 =S i ) i,j are 1..N a ij >= 0 a ij = 1 j=1:n Observable Markov Model

Example (from Rabiner)

Hidden Markov Model State corresponded to observable event Restrictive Observation probabilistic function of state Hidden state, observable outputs [O 1, O 2, O 3,, O T ]

Jack Ferguson s Urn and Ball Model N Urns, M colours O = {Red, Green, Green, Pink, Orange, Blue, Orange, Yellow} Ball and Urn (Rabiner) URN 1 URN 2 URN N P(Red) = b 1 (1) P(Blue) = b 1 (2) P(Green) = b 1 (3) P(Red) = b 2 (1) P(Blue) = b 2 (2) P(Green) = b 2 (3) P(Red) = b N (1) P(Blue) = b N (2) P(Green) = b N (3) P(Pink) = b 1 (M) P(Pink) = b 2 (M) P(Pink) = b N (M)

Ball and Urn Simplest HMM State is urn Colour probability defined for each state (urn) State transition matrix governs urn choice

HMM elements N - number of states A state transition probability a ij = P(q t =S j q t-1 =S i ) B observation probability in state j b j = P(O t q t =S j ) Discrete O t is v k, k=1:m Continuous, gaussian mixture Initial state distribution π i =P(q 1 =S i ) Model λ =(A, B, π)

What are HMMs useful for? Modelling temporally evolving events with a reproducible pattern with some reasonable level of variation measurable features at intervals Well structured Left to right HMM More random? fully connected (ergodic) HMM Applications in BOTH Audio and Video

HMM applications Need labelled training data!! Usual reason to NOT use HMMs Speech & audio visual applications Research databases Labelled/transcribed

What might a HMM model? Sequence of events, features sampled at intervals In speech recognition: A word, a phoneme, a syllable In speech analysis for home monitoring Normal speech, emotionally distressed speech, slurred speech In music to transcribe scores A violin, a piano, a trumpet, a mixture of instruments In sports video to automatically extract highlights A tennis serve, tennis volley, tennis rally, passing shot etc. Snooker: pot black, pot colour, pot red, foul In cell biology video, flag specific events Nothing happening, fluorescence, cells growing, cells shrinking, cell death or division

Observations What is this observation sequence O? [O 1, O 2, O 3,, O T ] Pertinent features or measures taken at regular time intervals that compactly describe events of interest Spectral features, pitch, speaker rate in speech Colour, shape, motion in video

Example c 1 c 12 O 1 O 2 O 3 O T Take DCT of log spectrum on 20ms windows with 50% overlap

HMM problem 1 Given O = [O 1, O 2, O T ], and model λ, how to efficiently compute P(O λ). Evaluation Which model gives best score Forward-Backward procedure

HMM problem 2 Given O = [O 1, O 2, O T ], and model λ, how to choose a state sequence Q=[q 1, q 2,, q T ] that is optimal Uncover hidden part No correct sequence Viterbi Algorithm

HMM problem 3 How to adjust model parameters of model λ= (A, B, π) to maximise P(O λ). Training Adapt parameters to observed training data Use Baum Welch Iterative solution. Expectation maximisation

Notation Follow Rabiner tutorial

Back to Problem 1 Given O = [O 1, O 2, O T ], and model λ, how to efficiently compute P(O λ). Consider ALL possible state sequences Say one particular sequence Q = [q 1, q 2,,q T ] Probability of O given Q and λ? T ( Q, λ) P( O q, λ) P O = = t= 1 b q 1 t ( O ) b ( O ) Kb ( O ) 1 q 2 t 2 q T T

Observation probability ctd. Probability of state sequence? ( Q λ) = π q aq q a K q q a 1 1 2 2 3 q T qt P 1 JOINT probability of O and Q? ( O, Q λ ) P( O Q, λ) P( Q, λ) P = Probability of O for ALL possible Q? P ( O λ ) P( O Q, λ) P( Q λ) = allq

Forward-Backward Procedure Be smart! Only have N states So any state at t+1 can only be reached from N previous states at time t Reuses calculations

Forward variable (Rabiner)

Exercise Corresponding Backward variable? Partial observation sequence from t+1 to end, given in state i at time t and model λ Answer in Rabiner paper!

Observation Probability Observation probability in state j b j = P(O t q t =S j ) Discrete O t is v k, k=1:m Continuous, multivariate gaussian mixture density most common Are the features independent? 1st years how does this affect the pdf?

What if features not independent? Use full covariance HMMs Slow Need more training data Decorrelate the features PCA, LDA, DCT

Problem 2 Given O = [O 1, O 2, O T ], and model λ, how to choose a state sequence Q=[q 1, q 2,, q T ] that is optimal Well explained in Rabiner paper Single best state sequence Q Best score along path at time t accounting for first t observations and ending in state i () i = P[ q q Lq = i O, O, LO λ] δ t max q 1, q 2, Kq t 1 1 2 t, 1 2 t

Viterbi Trellis

Back to Problem 3 How to adjust model parameters of model λ= (A, B, π) to maximise P(O λ). Training of models Baum Welch An implementation of EM algorithm (tutorial from David) Start with good estimate Clustering with k-means

Training strategies Choice of number of states Controlling transitions Fully connected, or left-right HMM Gradually increasing number of mixtures per state

More Information HTK, Hidden Markov Model Toolkit from Cambridge University htk.eng.cam.ac.uk Rabiner paper Rabiner, L.R., "A tutorial on hidden Markov models and selected applications in speech recognition," Proceedings of the IEEE, vol.77, no.2, pp.257-286, Feb 1989 Speech Recognition Books