ASR using Hidden Markov Model : A tutorial

Size: px

Start display at page:

Download "ASR using Hidden Markov Model : A tutorial"

Branden McKinney
5 years ago
Views:

ASR using Hidden Markov Model : A tutorial Samudravijaya K Workshop on ASR @BAMU; 14-OCT-11 samudravijaya@gmail.

1 ASR using Hidden Markov Model : A tutorial Samudravijaya K Workshop on 14-OCT-11 samudravijaya@gmail.com Tata Institute of Fundamental Research Samudravijaya K Workshop on 14-OCT-11 ASR using Hidden Markov Model : A tutorial 1/26

2 What is ASR? source: HTK book Samudravijaya K Workshop on 14-OCT-11 ASR using Hidden Markov Model : A tutorial 2/26

3 Dynamic Time Warping rm.. rm.. r3 r2 d(4,3) d( n, m) r1 x1 x2 x3 x4 x xn......xn Test feature vector sequence Algorithm: Define d(n,m) : the (local) distance between the n th test frame and m th reference frame. D(n,m) : the (accumulated) distance of the partial path starting from the grid point (1,1) and ending at the grid point (n,m). Samudravijaya K Workshop on 14-OCT-11 ASR using Hidden Markov Model : A tutorial 3/26

4 Using Principle of Optimality, D(n,m) is the sum of the local cost and the cumulative cost of cheapest path to a predecessor node. m n-1,m n,m... n-2,m-1 n-1,m-1 n,m-1 1 n-1,m-2 D(n,m) = d(n,m) + min n D(n 1,m) D(n 1,m 1) D(n,m 1)... N Compute D(n,m) for each allowed pair of (n,m). Remember the best predecessor point. D(N,M) is the cost of the optimal path. From (N,M), start backtracing to identify the optimal path. Samudravijaya K Workshop on 14-OCT-11 ASR using Hidden Markov Model : A tutorial 4/26

5 sl s2 s1 rm.. r5 r4 r3 r2 DTW HMM d( n, m) d(4,3) r1 x1 x2 x3 x4 x xn......xn Test feature vector sequence Q: How DTW handles (speaker,channel,environ) variabilities of speech? A: Reference template: average of multiple feature vector sequences. Model Sequence of Statistics Goal DTW N feature vectors mean minimise distance HMM M(<<N) states mean,variance maximise likelihood Samudravijaya K Workshop on 14-OCT-11 ASR using Hidden Markov Model : A tutorial 5/26

6 t i r u w a n th p u r a m Spectrogram of thiruvananthapuram Samudravijaya K Workshop on 14-OCT-11 ASR using Hidden Markov Model : A tutorial 6/26

of states (probabilistic model: mean and Variance of vectors) Vector sequence Vs State

7 Formant trajectories states Instead of representing temporal variation of a phoneme as a sequence of feature vectors (deterministic model), represent it as a sequence smaller number of states (probabilistic model: mean and Variance of vectors) Vector sequence Vs State sequence Samudravijaya K Workshop on 14-OCT-11 ASR using Hidden Markov Model : A tutorial 7/26

8 hidden Markov model (HMM) a11 1 a p(f) p(f) p(f) f(hz) f(hz) f(hz) Parameters of a HMM: A, B, π A,B model duration and features of phoneme; π: skipping initial part) Samudravijaya K Workshop on 14-OCT-11 ASR using Hidden Markov Model : A tutorial 8/26

9 What is hidden in hidden Markov model? Samudravijaya K Workshop on 14-OCT-11 ASR using Hidden Markov Model : A tutorial 9/26

10 HMM: definitions Assumptions First order Markov assumption (finite history): P(q t = j q t 1 = i,q t 2 = k,...) = P(q t = j q t 1 = i) Stationarity (parameters do not change with time): P(q t = j q t 1 = i) = P(q t+l = j q t+l 1 = i) exponential duration distribution Samudravijaya K Workshop on 14-OCT-11 ASR using Hidden Markov Model : A tutorial 10/26

11 HMM: definitions Assumptions First order Markov assumption (finite history): P(q t = j q t 1 = i,q t 2 = k,...) = P(q t = j q t 1 = i) Stationarity (parameters do not change with time): P(q t = j q t 1 = i) = P(q t+l = j q t+l 1 = i) exponential duration distribution Elements of HMM N: number of hidden states Q: set of states: Q = {q 1,q 2,q 3,...,q N } B: observation probability distribution: B = {b j } 1 j N A: state transition probability matrix: A = {a ij } a ij = P(q t+1 = j q t = i), 1 i,j, N π: initial state distribution: π i = P(q 1 = i) 1 i N λ: the entire model: λ = (A,B,π) Samudravijaya K Workshop on 14-OCT-11 ASR using Hidden Markov Model : A tutorial 10/26

12 3 problems in HMM 1. Matching: Given an observation sequence O = o 1,o 2,o 3,...,o T, and a trained model λ = (A,B,π), how to efficiently compute the likelihood, P(O λ) (likelihood of the model λ generating the observation sequence) O? Solution: forward algorithm (use recursion for computational efficiency) Use: Given two models λ 1 and λ 2, choose λ 1 if P(O λ 1 ) > P(O λ 2 ) Samudravijaya K Workshop on 14-OCT-11 ASR using Hidden Markov Model : A tutorial 11/26

13 3 problems in HMM 1. Matching: Given an observation sequence O = o 1,o 2,o 3,...,o T, and a trained model λ = (A,B,π), how to efficiently compute the likelihood, P(O λ) (likelihood of the model λ generating the observation sequence) O? Solution: forward algorithm (use recursion for computational efficiency) Use: Given two models λ 1 and λ 2, choose λ 1 if P(O λ 1 ) > P(O λ 2 ) 2. Optimal path: Given O and λ, how to find the optimal state sequence (Q = q 1,q 2,q 3,...,q T )? Solution: Viterbi algorithm (similar to DTW) Use: Derive word/phone sequence Samudravijaya K Workshop on 14-OCT-11 ASR using Hidden Markov Model : A tutorial 11/26

14 3 problems in HMM 1. Matching: Given an observation sequence O = o 1,o 2,o 3,...,o T, and a trained model λ = (A,B,π), how to efficiently compute the likelihood, P(O λ) (likelihood of the model λ generating the observation sequence) O? Solution: forward algorithm (use recursion for computational efficiency) Use: Given two models λ 1 and λ 2, choose λ 1 if P(O λ 1 ) > P(O λ 2 ) 2. Optimal path: Given O and λ, how to find the optimal state sequence (Q = q 1,q 2,q 3,...,q T )? Solution: Viterbi algorithm (similar to DTW) Use: Derive word/phone sequence 3. Training: How to estimate the parameters of the model: λ = (A, B, π) that maximise P(O λ)? Solution: Forward-backward algorithm. Samudravijaya K Workshop on 14-OCT-11 ASR using Hidden Markov Model : A tutorial 11/26

15 Match observation (speech vector) sequence with a model Goal: To compute P(o 1,o 2,o 3,...,o T λ) Steps: There are many state sequences (paths). Consider one state sequence q = q 1,q 2,q 3,...,q T If we assume that observations are independent, P(O q,λ) = T i=1 P(o t q t,λ) = b q1 (o 1 )b q2 (o 2 )...b qt (o T ) Samudravijaya K Workshop on 14-OCT-11 ASR using Hidden Markov Model : A tutorial 12/26

16 Match observation (speech vector) sequence with a model Goal: To compute P(o 1,o 2,o 3,...,o T λ) Steps: There are many state sequences (paths). Consider one state sequence q = q 1,q 2,q 3,...,q T If we assume that observations are independent, P(O q,λ) = T i=1 P(o t q t,λ) = b q1 (o 1 )b q2 (o 2 )...b qt (o T ) Probability of a particular state sequence is: P(q λ) = π q1 a q1 q 2 a q2 q 3...a qt 1 q T Samudravijaya K Workshop on 14-OCT-11 ASR using Hidden Markov Model : A tutorial 12/26

17 Match observation (speech vector) sequence with a model Goal: To compute P(o 1,o 2,o 3,...,o T λ) Steps: There are many state sequences (paths). Consider one state sequence q = q 1,q 2,q 3,...,q T If we assume that observations are independent, P(O q,λ) = T i=1 P(o t q t,λ) = b q1 (o 1 )b q2 (o 2 )...b qt (o T ) Probability of a particular state sequence is: P(q λ) = π q1 a q1 q 2 a q2 q 3...a qt 1 q T Enumerate paths and sum probabilities: P(O λ) = qp(o q,λ)p(q λ) N T state sequences and O(T) calculations N T O(TN T ) computational complexity: exponential in length! Samudravijaya K Workshop on 14-OCT-11 ASR using Hidden Markov Model : A tutorial 12/26

18 Forward Algorithm: Intution N N 1 anj States i o1 o2 i aij a3j a2j a_1j o3 o_t o_t+1 o_t 1 o_t Observation sequence Let α t (i) = P(o 1,o 2,...,o t,q t = i λ). Then α t+1 (j) = N i=1 α t(i)a ij b j (o t+1 ) j Samudravijaya K Workshop on 14-OCT-11 ASR using Hidden Markov Model : A tutorial 13/26

19 Forward Algorithm Define a forward variable α t (i) as: α t (i) = P(o 1,o 2,...,o t,q t = i λ) α t (i) is the probability of observing the partial sequence ( o 1,o 2,...,o t ) and o t being generated by i th state (i.e., q t = i). Induction: Initialization: Recursion: Termination: α 1 (i) = πib i (o 1 ) α t+1 (j) = [ N i=1 α t(i)a ij ] b j (o t+1 ) P(O λ) = N i=1 α T(i) Samudravijaya K Workshop on 14-OCT-11 ASR using Hidden Markov Model : A tutorial 14/26

20 Forward Algorithm Define a forward variable α t (i) as: α t (i) = P(o 1,o 2,...,o t,q t = i λ) α t (i) is the probability of observing the partial sequence ( o 1,o 2,...,o t ) and o t being generated by i th state (i.e., q t = i). Induction: Initialization: Recursion: Termination: α 1 (i) = πib i (o 1 ) α t+1 (j) = [ N i=1 α t(i)a ij ] b j (o t+1 ) P(O λ) = N i=1 α T(i) Computational complexity: O(N 2 T) Use: Match a test speech feature vector sequence with all models. Choose λ i if P(O λ i ) > P(O λ j ) j Samudravijaya K Workshop on 14-OCT-11 ASR using Hidden Markov Model : A tutorial 14/26

21 Viterbi Algorithm: Intution Problem 2: Given O and λ, how to find the optimal state sequence (Q = q 1,q 2,q 3,...,q T ) (Optimal path)? Samudravijaya K Workshop on 14-OCT-11 ASR using Hidden Markov Model : A tutorial 15/26

22 Viterbi Algorithm: Intution Problem 2: Given O and λ, how to find the optimal state sequence (Q = q 1,q 2,q 3,...,q T ) (Optimal path)? Define δ t (i) (the highest probability path ending at state i at time t) as: δ t (i) = max P(q 1,q 2,,q t = i,o 1,o 2,...,o t λ) q 1,q 2,...,q t 1 Viterbi recursion: δ t+1 (j) = max δ t (i)a ij b j (o t+1 ) i States N N 1 i i o_t anj aij a3j a2j a_1j o_t+1 Observation sequence j Samudravijaya K Workshop on 14-OCT-11 ASR using Hidden Markov Model : A tutorial 15/26

23 Viterbi Algorithm: Intution Problem 2: Given O and λ, how to find the optimal state sequence (Q = q 1,q 2,q 3,...,q T ) (Optimal path)? Define δ t (i) (the highest probability path ending at state i at time t) as: δ t (i) = max P(q 1,q 2,,q t = i,o 1,o 2,...,o t λ) q 1,q 2,...,q t 1 Viterbi recursion: δ t+1 (j) = max δ t (i)a ij b j (o t+1 ) i States N N 1 i o_t o_t+1 Observation sequence Contrast the above with the recursion in Forward algorithm: α t+1 (j) = N i=1 α t(i)a ij b j (o t+1 ) i anj aij a3j a2j a_1j Samudravijaya K Workshop on 14-OCT-11 ASR using Hidden Markov Model : A tutorial 15/26 j

24 Initialization: Recursion: Termination: Viterbi Algorithm δ 1 (i) = π i b i (o 1 ), ψ 1 (i) = 0 1 i N δ t (j) = max 1 i N [δ t 1(i)a ij ] b j (o t ) ψ t (j) = argmax[δ t 1 (i)a ij ] 2 t T, 1 j N 1 i N P = max 1 i N [δ T(i)] qt = argmax [δ T (i)] 1 i N Path (optimal state sequence) backtracking: qt = ψ t+1(qt+1 ), t = T 1,T 2,,2,1. Samudravijaya K Workshop on 14-OCT-11 ASR using Hidden Markov Model : A tutorial 16/26

25 DP Technique: DTW and HMM In case of template matching (DTW), we decided on the optimal path that minimised distance between a test feature sequence and a reference template. The key optimisation equation was D(n,m) = d(n,m) + min D(n 1,m) D(n 1,m 1) D(n,m 1) In case of a probabilistic model, we want to maximise the probability of a test feature sequence matching a HMM. The DP equation for matching a test sequence with the best HMM state sequence (Viterbi algorithm) is δ t (j) = max 1 i N [δ t 1(i)a ij ] b j (o t ) Taking log on both sides of the equation, we get log(δ t (j)) = log(b j (o t )) + max [log(δ t 1(i)) + log(a ij )] 1 i N Samudravijaya K Workshop on 14-OCT-11 ASR using Hidden Markov Model : A tutorial 17/26

26 DP technique applied to HMM: Viterbi algorithm The HMM can represent even a sentence! source: The HTK Book Samudravijaya K Workshop on 14-OCT-11 ASR using Hidden Markov Model : A tutorial 18/26

27 Training Problem 3: Given training data and its transcription, how to estimate the parameters of the model, λ = (A,B,π), that maximises the probability of representation of training data by the model, P(O λ)? There is no analytic solution because of its complexity. So, we employ Expectation-Maximisation (an iterative) algorithm. Samudravijaya K Workshop on 14-OCT-11 ASR using Hidden Markov Model : A tutorial 19/26

28 Training Problem 3: Given training data and its transcription, how to estimate the parameters of the model, λ = (A,B,π), that maximises the probability of representation of training data by the model, P(O λ)? There is no analytic solution because of its complexity. So, we employ Expectation-Maximisation (an iterative) algorithm. 1) Start with an initial (approximate) model, λ 0. 2) E-step: Using the current model (λ 0 ), compute the expectation of the likelihood of the training data: P(O λ) = N i=1 α T(i). 3) M-step: Re-estimate the parameters (λ = (A,B,π)) so as to maximise the probability (P(O λ)). 4) Stop if the improvement in log likelihood is insignificant: P(O λ) P(O λ 0 ) < 5) Else, set λ 0 λ and go to step 2. Samudravijaya K Workshop on 14-OCT-11 ASR using Hidden Markov Model : A tutorial 19/26

29 Training Problem 3: Given training data and its transcription, how to estimate the parameters of the model, λ = (A,B,π), that maximises the probability of representation of training data by the model, P(O λ)? There is no analytic solution because of its complexity. So, we employ Expectation-Maximisation (an iterative) algorithm. 1) Start with an initial (approximate) model, λ 0. 2) E-step: Using the current model (λ 0 ), compute the expectation of the likelihood of the training data: P(O λ) = N i=1 α T(i). 3) M-step: Re-estimate the parameters (λ = (A,B,π)) so as to maximise the probability (P(O λ)). 4) Stop if the improvement in log likelihood is insignificant: P(O λ) P(O λ 0 ) < 5) Else, set λ 0 λ and go to step 2. The EM algorithm as applied to ASR is known as B-W algorithm; it is also known as Forward-Backward algorithm. Samudravijaya K Workshop on 14-OCT-11 ASR using Hidden Markov Model : A tutorial 19/26

30 Forward-Backward Algorithm: β t (i) Define a backward variable β t (i) = p(o t+1,...,o T q t = i,λ) β t (i) Given that we are at node i at time t: Sum of probabilities of all paths such that partial sequence o t+1,...,o T are observed Samudravijaya K Workshop on 14-OCT-11 ASR using Hidden Markov Model : A tutorial 20/26

31 Forward-Backward Algorithm: β t (i) Define a backward variable β t (i) = p(o t+1,...,o T q t = i,λ) β t (i) Given that we are at node i at time t: Sum of probabilities of all paths such that partial sequence o t+1,...,o T are observed Starting with the initial condition at the last speech vector (t = T): β T (i) = 1.0, 1 i N, Samudravijaya K Workshop on 14-OCT-11 ASR using Hidden Markov Model : A tutorial 20/26

32 Forward-Backward Algorithm: β t (i) Define a backward variable β t (i) = p(o t+1,...,o T q t = i,λ) β t (i) Given that we are at node i at time t: Sum of probabilities of all paths such that partial sequence o t+1,...,o T are observed Starting with the initial condition at the last speech vector (t = T): β T (i) = 1.0, 1 i N, we can recursively compute β t (i) for every state i = 1,2,...,N backwards in time (t = T-1, T-2,..., 2, 1) as follows: β t (i) = N [a ij b j (o t+1 )] j=1 }{{} Going to each node from i th node β j (t + 1) }{{} Prob. of observation o t+2... o T given now we are in j th node at t + 1 Samudravijaya K Workshop on 14-OCT-11 ASR using Hidden Markov Model : A tutorial 20/26

33 Joint event: state i at time t AND state j at t+1 Define ξ t (i,j) as the probability of system being in state i at time t and in state j at time t+1: ξ t (i, j) = α t(i)a ij b j (o t+1 )β t+1 (j) P(O λ) Source: Samudravijaya K Workshop on 14-OCT-11 ASR using Hidden Markov Model : A tutorial 21/26

34 Re-estimation Formulae: π i and â ij The revised estimate of initial probability, π i, is the expected frequency in state i at time (t=1): N π i new = ξ 1 (i,j) j=1 Samudravijaya K Workshop on 14-OCT-11 ASR using Hidden Markov Model : A tutorial 22/26

35 Re-estimation Formulae: π i and â ij The revised estimate of initial probability, π i, is the expected frequency in state i at time (t=1): N π i new = ξ 1 (i,j) j=1 The revised estimate of transition probability is (Expected no. of transitions from state i j) / (Expected no. of time the system is in state i) â new ij = N j=1 ( T 1 t=1 ξ t(i,j) T 1 ξ t (i,j) t=1 }{{} all transitions out of i to j at all times ) Samudravijaya K Workshop on 14-OCT-11 ASR using Hidden Markov Model : A tutorial 22/26

36 Re-estimation Formulae: b j (t) Parameters of State Probability Density Function Let us assume that the state output distribution function is Gaussian. If there was just one state j, the maximum likelihood estimation of parameters would be Σ j = 1 T µ j = 1 T T t=1 o t T (o t µ j )(o t µ j ) t=1 Samudravijaya K Workshop on 14-OCT-11 ASR using Hidden Markov Model : A tutorial 23/26

37 Re-estimation Formulae: b j (t) Parameters of State Probability Density Function Let us assume that the state output distribution function is Gaussian. If there was just one state j, the maximum likelihood estimation of parameters would be Σ j = 1 T µ j = 1 T T t=1 o t T (o t µ j )(o t µ j ) t=1 * Difficulty: Speech HMMs have many states. * Speech vector state mapping is unknown because the state sequence itself is unknown. * Solution: Assign each speech vector to every state in proportion to the likelihood of system being in that state when the speech vector was observed. Samudravijaya K Workshop on 14-OCT-11 ASR using Hidden Markov Model : A tutorial 23/26

38 Re-estimation Formulae: b j (t) Let L j (t) denote the probability of being in state j at time t. L j (t) = p(q t = j O,λ) = p(q t = j,o λ) p(o λ) = α t(i)β t (j) i α T(i) Samudravijaya K Workshop on 14-OCT-11 ASR using Hidden Markov Model : A tutorial 24/26

39 Re-estimation Formulae: b j (t) Let L j (t) denote the probability of being in state j at time t. L j (t) = p(q t = j O,λ) = p(q t = j,o λ) p(o λ) = α t(i)β t (j) i α T(i) Revised estimates of the state pdf parameters are µ j = T t=1 L j(t)o t T t=1 L j(t) Σ j = T t=1 L j(t)(o t µ j )(o t µ j ) T t=1 L j(t) The expected values (estimations) are weighted averages, weights being the probability of being in state j at time t. Samudravijaya K Workshop on 14-OCT-11 ASR using Hidden Markov Model : A tutorial 24/26

40 Some remarks Types of HMM * Ergodic Vs left-to-right * Semi-Markov (state duration) * Discriminative models Samudravijaya K Workshop on 14-OCT-11 ASR using Hidden Markov Model : A tutorial 25/26

41 Some remarks Types of HMM * Ergodic Vs left-to-right * Semi-Markov (state duration) * Discriminative models Implementational Issues * Number of states * Initial parameters * Scaling, addition of loglikelihoods * Multiple observations (tokens/repetitions) * Discrete Vs Continuous probability functions (with GMMs) * Concatenation of smaller HMMs larger HMM Samudravijaya K Workshop on 14-OCT-11 ASR using Hidden Markov Model : A tutorial 25/26

42 References Four online tutorials on HMM are listed at < http : //speech.tifr.res.in/tutorials/index.html > Books: Fundamentals of Speech Recognition, by Lawrence R. Rabiner, B. H. Juang and B.Yegnanarayana, Pearson Education India, 2008, Rs. 450; ISBN: Spoken Language Processing : A Guide to Theory, Algorithm and System Development, by Xuedong Huang, Alex Acero, Hsiao-Wuen Hon Year 2001, Prentice Hall PTR; ISBN: Hidden Markov models for speech recognition; X.D. Huang, Y. Ariki, M.A. Jack. Edinburgh: Edinburgh University Press, c1990. Statistical methods for speech recognition, F.Jelinek, The MIT Press, Cambridge, MA., HMM on MATLAB HMM toolbox on matlab: Discrete HMMs: training and recognition by Kevin Murphy, 2005; < http : // murphyk/software/hmm/hmm.html > Samudravijaya K Workshop on 14-OCT-11 ASR using Hidden Markov Model : A tutorial 26/26

Hidden Markov Modelling

Hidden Markov Modelling Introduction Problem formulation Forward-Backward algorithm Viterbi search Baum-Welch parameter estimation Other considerations Multiple observation sequences Phone-based models