Speech Recognition Lecture 8: Expectation-Maximization Algorithm, Hidden Markov Models.

Speech Recognition Lecture 8: Expectation-Maximization Algorithm, Hidden Markov Models. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.com

This Lecture Expectation-Maximization (EM) algorithm Hidden-Markov Models 2

Latent Variables Definition: unobserved or hidden variables, as opposed to those directly available at training and test time. example: mixture models, HMMs. Why latent variables? naturally unavailable variables: e.g., economics variables. modeling: latent variables introduced to model dependencies. 3

ML with Latent Variables Problem: with fully observed variables, ML is often straightforward. with latent variables, log-likelihood contains a sum (harder to find the best parameter values): L(θ, x) = log p θ [x] = log z p θ (x, z) = log z p θ [z x]p θ [x]. Idea: use current parameter values to estimate latent variables and use those to re-estimate parameter values EM algorithm. 4

EM Theorem Theorem: for any x, and parameter values θ and θ, L(θ,x) L(θ, x) z Proof: p θ [z x] log p θ [x, z] z p θ [z x] log p θ [x, z]. L(θ,x) L(θ, x) = log p θ [x] log p θ [x] = z p θ [z x] log p θ [x] z p θ [z x] log p θ [x] = z p θ [z x] log p θ [x, z] p θ [z x] z p θ [z x] log p θ[x, z] p θ [z x] = z p θ [z x] log p θ[z x] p θ [z x] + z p θ [z x] log p θ [x, z] z p θ [z x] log p θ [x, z] = D(p θ [z x] p θ [z x]) + z p θ [z x] log p θ [x, z] z p θ [z x] log p θ [x, z] z p θ [z x] log p θ [x, z] z p θ [z x] log p θ [x, z]. 5

EM Idea Maximize expectation: z p θ [z x] log p θ [x, z] = Iterations: E z p θ [ x] compute p θ [z x] for current value of θ. compute expectation and find maximizing. log p θ [x, z]. θ 6

EM Algorithm Algorithm: maximum-likelihood meta algorithm for models with latent variables. E-step: q t+1 p. θ t[z x] M-step: θ t+1 argmax q t+1 (z x) log p. θ [x, z] Interpretation: (Dempster, Laird, and Rubin, 1977; Wu, 1983) θ E-step: posterior probability of latent variables given observed variables and current parameter. M-step: maximum-likelihood parameter given all data. z 7

EM Algorithm - Notes Applications: mixture models, e.g., Gaussian mixtures. Latent variables: which model generated points. HMMs (Baum-Welsh algorithm). Notes: positive: each iteration increases likelihood. No parameter tuning. negative: can converge to local optima. Note that likelihood could converge but not parameter. Dependence on initial parameter. 8

EM = Coordinate Ascent Theorem: EM = coordinate ascent applied to A(q, θ) = z q[z x] log p θ[x, z] q[z x] = z q[z x] log p θ [x, z] z q[z x] log q[z x]. E-step: q t+1 argmax A(q, θ t ). q M-step: θ t+1 argmax A(q t+1,θ). θ Proof: the E-step follows the two identities A(q, θ) log z A(p θ [z x],θ)= z q[z x] p θ[x, z] = log p θ [x, z] = log p θ [x] =L(θ, x), q[z x] z p θ [z x] log p θ [x] = log p θ [x] =L(θ, x). 9

EM = Coordinate Ascent Proof (cont.): finally, note that argmax θ A(q t+1,θ) = argmax θ coincides with the M-step of EM. q t+1 [z x] log p θ [x, z], z 10

EM - Extensions and Variants (Dempster et al., 1977; Jamshidian and Jennrich, 1993; Wu, 1983) Generalized EM (GEM): maximization is not necessary at all steps since any (strict) increase of the auxiliary function guarantees an increase of the log-likelihood. Sparse EM: posterior probabilities computed at some points in E-step. 11

This Lecture Expectation-Maximization (EM) algorithm Hidden-Markov Models 12

Motivation Data: sample of msequences over alphabet Σ drawn i.i.d. according to some distribution D : x i 1,x i 2,...,x i k i, i =1,...,m. Problem: find sequence model that best estimates distribution D. Latent variables: observed sequences may have been generated by model with states and transitions that are hidden or unobserved. 13

Example Observations: sequences of Heads and Tails. observed sequence: unobserved paths: H, T, T, H, T, H, T, H, H, H, T, T,..., H. (0,H,0), (0,T,1), (1,T,1),...,(2,H,2). Model: T/.4 H/.1 H/.3 T/.7 1 H/.6 T/.5 2 0 T/.4 14

Hidden Markov Models Definition: probabilistic automata, generative view. discrete case: finite alphabet Σ. simplification: for us, transition probability: Pr[transition e]. emission probability: Pr[emission a e]. Illustration: a/.3 0 a/.2 a/.2 b/.1 c/.2 w[e] = Pr[a e]pr[e]. /.5 15

Other Models Outputs at states (instead of transitions): equivalent model, dual representation. Non-discrete case: outputs in R N. fixed probabilities replaced by distributions, e.g., Gaussian mixtures. application to acoustic modeling. 16

Three Problems Problem 1: given sequence x 1,...,x k and hidden Markov model p θ compute p θ [x 1,...,x k ]. Problem 2: given sequence x 1,...,x k and hidden Markov model p θ find most likely path that generated that sequence. Problem 3: estimate parameters θ of a hidden Markov model via ML: p θ [x]. θ = argmax θ (Rabiner, 1989) π = e 1,...,e r 17

P1: Evaluation Algorithm: let X represent a finite automaton representing the sequence x = x 1 x k and let denote the current hidden Markov model. Then, p θ [x] = π X H w[π]. Thus, it can be computed using composition and a shortest-distance algorithm in time O( X H ). H 18

P2: Most Likely Path Algorithm: shortest-path problem in semiring. any shortest-path algorithm applied to the result of the composition X H. In particular, since the automaton is acyclic, with a linear-time shortest-path algorithm, the total complexity is in. O( X H ) =O( X H ) (max, ) a traditional solution is to use the Viterbi algorithm. 19

P3: Parameter Estimation Baum-Welsh algorithm: special case of EM applied to HMMs. Here, θ represents the transitions weights w[e]. The latent or hidden variables are paths π. M-step: argmax θ π subject to q Q, p θ [π x] log p θ [x, π] e E[q] w θ [e] =1. (Baum, 1972) 20

P3: Parameter Estimation Using Lagrange multipliers λ q, q Q, the problem consists of setting the following partial derivatives to zero: w θ [e] π [ π p θ [π x] log p θ [x, π] q p θ [π x] log p θ[x, π] w θ [e] only for paths labeled with :. In that case, p θ [x, π] = w θ [e] π e, where π e is the number of occurrences eof e in π. λ q λ orig[e] =0. e E[q] w θ [e] ] =0 p θ [x, π] 0 π x π Π(x) 21

P3: Parameter Estimation Thus, the equation with partial derivatives can be rewritten as π Π(x) p θ [π x] π e w θ [e] = λ orig(e) w θ [e] = w θ [e] = w θ [e] 1 λ orig(e) π Π(x) 1 λ orig(e) p θ [x] p θ [x, π] π e. π Π(x) p θ [π x] π e π Π(x) p θ [x, π] π e 22

P3: Parameter Estimation But, p θ [x, π] π e = α i 1 (θ )w θ [e]β i (θ ). Thus, with π Π(x) w θ [e] k i=1 k i=1 α i 1 (θ )w θ [e]β i (θ ) α i (θ )=p θ [x 1,...,x i q = orig[e]] β i (θ )=p θ [x i,...,x k dest[e]]. Expected number of times through transition e. Can be computed using forward-backward algorithms, or, for us, shortest-distance algorithms. 23

References L. E. Baum. An Inequality and Associated Maximization Technique in Statistical Estimation for Probalistic Functions of Markov Processes. Inequalities, 3:1-8, 1972. Arthur P. Dempster, Nan M. Laird, and Donald B. Rubin. Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society, Vol. 39, No. 1. (1977), pp. 1-38.. Jamshidian, M. and R. I. Jennrich: 1993, Conjugate Gradient Acceleration of the EM Algorithm. Journal of the American Statistical Association 88(421), 221-228. Lawrence Rabiner. A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of IEEE, Vol. 77, No. 2, pp. 257, 1989. O'Sullivan. Alternating minimzation algorithms: From Blahut-Arimoto to expectationmaximization. Codes, Curves and Signals: Common Threads in Communications, A. Vardy, (editor), Kluwer, 1998. C. F. Jeff Wu. On the Convergence Properties of the EM Algorithm. The Annals of Statistics, Vol. 11, No. 1 (Mar., 1983), pp. 95-103. 24