Hidden Markov Models Dr. Naomi Harte
The Talk Hidden Markov Models What are they? Why are they useful? The maths part Probability calculations Training optimising parameters Viterbi unseen sequences Real Systems
Background Discrete Markov process System can be in any of N states S 1 S N State changes each time instant, t 1 t 2 t 3 etc Actual state at time t is q t For first order Markov process P(q t = S j q t-1 =S i, q t-2 =S k ) simplifies to P(q t =S j q t-1 =S i )
Background P(q t =S j q t-1 =S i ) Independent of time State transition probabilities a ij = P(q t =S j q t-1 =S i ) i,j are 1..N a ij >= 0 a ij = 1 j=1:n Observable Markov Model
Example (from Rabiner)
Hidden Markov Model State corresponded to observable event Restrictive Observation probabilistic function of state Hidden state, observable outputs [O 1, O 2, O 3,, O T ]
Jack Ferguson s Urn and Ball Model N Urns, M colours O = {Red, Green, Green, Pink, Orange, Blue, Orange, Yellow} Ball and Urn (Rabiner) URN 1 URN 2 URN N P(Red) = b 1 (1) P(Blue) = b 1 (2) P(Green) = b 1 (3) P(Red) = b 2 (1) P(Blue) = b 2 (2) P(Green) = b 2 (3) P(Red) = b N (1) P(Blue) = b N (2) P(Green) = b N (3) P(Pink) = b 1 (M) P(Pink) = b 2 (M) P(Pink) = b N (M)
Ball and Urn Simplest HMM State is urn Colour probability defined for each state (urn) State transition matrix governs urn choice
HMM elements N - number of states A state transition probability a ij = P(q t =S j q t-1 =S i ) B observation probability in state j b j = P(O t q t =S j ) Discrete O t is v k, k=1:m Continuous, gaussian mixture Initial state distribution π i =P(q 1 =S i ) Model λ =(A, B, π)
What are HMMs useful for? Modelling temporally evolving events with a reproducible pattern with some reasonable level of variation measurable features at intervals Well structured Left to right HMM More random? fully connected (ergodic) HMM Applications in BOTH Audio and Video
HMM applications Need labelled training data!! Usual reason to NOT use HMMs Speech & audio visual applications Research databases Labelled/transcribed
What might a HMM model? Sequence of events, features sampled at intervals In speech recognition: A word, a phoneme, a syllable In speech analysis for home monitoring Normal speech, emotionally distressed speech, slurred speech In music to transcribe scores A violin, a piano, a trumpet, a mixture of instruments In sports video to automatically extract highlights A tennis serve, tennis volley, tennis rally, passing shot etc. Snooker: pot black, pot colour, pot red, foul In cell biology video, flag specific events Nothing happening, fluorescence, cells growing, cells shrinking, cell death or division
Observations What is this observation sequence O? [O 1, O 2, O 3,, O T ] Pertinent features or measures taken at regular time intervals that compactly describe events of interest Spectral features, pitch, speaker rate in speech Colour, shape, motion in video
Example c 1 c 12 O 1 O 2 O 3 O T Take DCT of log spectrum on 20ms windows with 50% overlap
HMM problem 1 Given O = [O 1, O 2, O T ], and model λ, how to efficiently compute P(O λ). Evaluation Which model gives best score Forward-Backward procedure
HMM problem 2 Given O = [O 1, O 2, O T ], and model λ, how to choose a state sequence Q=[q 1, q 2,, q T ] that is optimal Uncover hidden part No correct sequence Viterbi Algorithm
HMM problem 3 How to adjust model parameters of model λ= (A, B, π) to maximise P(O λ). Training Adapt parameters to observed training data Use Baum Welch Iterative solution. Expectation maximisation
Notation Follow Rabiner tutorial
Back to Problem 1 Given O = [O 1, O 2, O T ], and model λ, how to efficiently compute P(O λ). Consider ALL possible state sequences Say one particular sequence Q = [q 1, q 2,,q T ] Probability of O given Q and λ? T ( Q, λ) P( O q, λ) P O = = t= 1 b q 1 t ( O ) b ( O ) Kb ( O ) 1 q 2 t 2 q T T
Observation probability ctd. Probability of state sequence? ( Q λ) = π q aq q a K q q a 1 1 2 2 3 q T qt P 1 JOINT probability of O and Q? ( O, Q λ ) P( O Q, λ) P( Q, λ) P = Probability of O for ALL possible Q? P ( O λ ) P( O Q, λ) P( Q λ) = allq
Observation probability ctd. Probability of state sequence? ( Q λ) = π q aq q a K q q a 1 1 2 2 3 q T qt P 1 JOINT probability of O and Q? ( O, Q λ ) P( O Q, λ) P( Q, λ) P = Probability of O for ALL possible Q? P ( O λ ) P( O Q, λ) P( Q λ) = allq Gets crazy as N and T increase!!
Forward-Backward Procedure Be smart! Only have N states So any state at t+1 can only be reached from N previous states at time t Reuses calculations
Forward variable (Rabiner)
Exercise Corresponding Backward variable? Partial observation sequence from t+1 to end, given in state i at time t and model λ Answer in Rabiner paper!
Observation Probability Observation probability in state j b j = P(O t q t =S j ) Discrete O t is v k, k=1:m Continuous, multivariate gaussian mixture density most common Are the features independent? 1st years how does this affect the pdf?
What if features not independent? Use full covariance HMMs Slow Need more training data Decorrelate the features PCA, LDA, DCT
Problem 2 Given O = [O 1, O 2, O T ], and model λ, how to choose a state sequence Q=[q 1, q 2,, q T ] that is optimal Well explained in Rabiner paper Single best state sequence Q Best score along path at time t accounting for first t observations and ending in state i () i = P[ q q Lq = i O, O, LO λ] δ t max q 1, q 2, Kq t 1 1 2 t, 1 2 t
Viterbi Trellis
Back to Problem 3 How to adjust model parameters of model λ= (A, B, π) to maximise P(O λ). Training of models Baum Welch An implementation of EM algorithm (tutorial from David) Start with good estimate Clustering with k-means
Training strategies Choice of number of states Controlling transitions Fully connected, or left-right HMM Gradually increasing number of mixtures per state
More Information HTK, Hidden Markov Model Toolkit from Cambridge University htk.eng.cam.ac.uk Rabiner paper Rabiner, L.R., "A tutorial on hidden Markov models and selected applications in speech recognition," Proceedings of the IEEE, vol.77, no.2, pp.257-286, Feb 1989 Speech Recognition Books