Weighted Finite-State Transducers in Computational Biology Mehryar Mohri Courant Institute of Mathematical Sciences mohri@cims.nyu.edu Joint work with Corinna Cortes (Google Research). 1
This Tutorial Weighted finite-state transducers and software library Pairwise alignments for bioinformatics Learning weighted transducers 2 2
This Lecture Expectation-Maximization (EM) algorithm Hidden-Markov Models 3 3
Problem Data: sample drawn i.i.d. from set some distribution D, x 1,...,x m X. X according to Problem: find distribution p out of a set P that best estimates D. 4 4
Maximum Likelihood Likelihood: probability of observing sample under distribution p P, which, given the independence assumption is m Pr[x 1,...,x m ]= p(x i ). Principle: select distribution maximizing sample probability m p = argmax p(x i ), p P m or p = argmax log p(x i ). p P 5 5
Example: Bernoulli Trials Problem: find most likely Bernoulli distribution, given sequence of coin flips H, T, T, H, T, H, T, H, H, H, T, T,..., H. Bernoulli distribution: p(h) =θ, p(t )=1 θ. Likelihood: Solution: dl(p) dθ l = N(H) θ l(p) = log θ N(H) (1 θ) N(T ) = N(H) log θ + N(T ) log(1 θ). is differentiable and concave; N(T ) θ =0 θ = N(H) N(H)+N(T ). 6 6
Latent Variables Definition: unobserved or hidden variables, as opposed to those directly available at training and test time. example: mixture models, HMMs. Why latent variables? naturally unavailable variables: e.g., economics variables. modeling: latent variables introduced to model dependencies. 7 7
ML with Latent Variables Problem: with fully observed variables, ML is often straightforward, but with latent variables, the loglikelihood contains a sum that makes it harder to determine the best parameter values: L(θ, x) = log p θ [x] = log z p θ (x, z) = log z p θ [z x]p θ [x]. Idea: use current parameter values to estimate latent variables and use those to re-estimate parameter values. This leads to the EM algorithm. 8 8
EM Theorem Theorem: for any x, and parameter values θ and θ, L(θ,x) L(θ, x) z Proof: p θ [z x] log p θ [x, z] z p θ [z x] log p θ [x, z]. L(θ,x) L(θ, x) = log p θ [x] log p θ [x] = z p θ [z x] log p θ [x] z p θ [z x] log p θ [x] = z p θ [z x] log p θ [x, z] p θ [z x] z p θ [z x] log p θ[x, z] p θ [z x] = z p θ [z x] log p θ[z x] p θ [z x] + z p θ [z x] log p θ [x, z] z p θ [z x] log p θ [x, z] = D(p θ [z x] p θ [z x]) + z p θ [z x] log p θ [x, z] z p θ [z x] log p θ [x, z] z p θ [z x] log p θ [x, z] z p θ [z x] log p θ [x, z]. 9 9
EM Algorithm Algorithm: maximum-likelihood meta algorithm for models with latent variables. E-step: q t+1 p. θ t[z x] M-step: θ t+1 argmax q t+1 (z x) log p. θ [x, z] Interpretation: (Dempster, Laird, and Rubin, 1977; Wu, 1983) θ E-step: posterior probability of latent variables given observed variables and current parameter. M-step: maximum-likelihood parameter given all data. z 10 10
EM Algorithm - Notes Applications: mixture models, e.g., Gaussian mixtures. Latent variables: which model generated points. HMMs (Baum-Welsh algorithm). Notes: positive: each iteration increases likelihood. No parameter tuning. negative: can converge to local minima. Note that likelihood could converge but not parameter. Dependence on initial parameter. 11 11
EM = Coordinate Ascent Theorem: EM = coordinate ascent applied to A(q, θ) = z q[z x] log p θ[x, z] q[z x] = z q[z x] log p θ [x, z] z q[z x] log q[z x]. E-step: q t+1 argmax A(q, θ t ). q M-step: θ t+1 argmax A(q t+1,θ). θ Proof: the E-step follows the two identities A(q, θ) log q[z x] p θ[x, z] = log p θ [x, z] = log p θ [x] =L(θ, x), q[z x] z z A(p θ [z x],θ)= p θ [z x] log p θ [x] = log p θ [x] =L(θ, x). z 12 12
EM = Coordinate Ascent Proof (cont.): finally, note that argmax θ A(q t+1,θ) = argmax θ coincides with the M-step of EM. z q t+1 [z x] log p θ [x, z], 13 13
EM - Extensions and Variants (Dempster et al., 1977; Jamshidian and Jennrich, 1993; Wu, 1983) Generalized EM (GEM): maximization is not necessary at all steps since any (strict) increase of the auxiliary function guarantees an increase of the log-likelihood. Sparse EM: posterior probabilities computed at some points in E-step. 14 14
This Lecture Expectation-Maximization (EM) algorithm Hidden-Markov Models 15 15
Motivation Data: sample of msequences over alphabet Σ drawn i.i.d. according to some distribution D : x i 1,x i 2,...,x i k i, i =1,...,m. Problem: find sequence model that best estimates distribution D. Latent variables: observed sequences may have been generated by model with states and transitions that are hidden or unobserved. 16 16
Example Observations: sequences of Head and Tail. observed sequence: unobserved paths: H, T, T, H, T, H, T, H, H, H, T, T,..., H. (0,H,0), (0,T,1), (1,T,1),...,(2,H,2). Model: T/.4 H/.1 H/.3 T/.7 1 H/.6 T/.5 2 0 T/.4 17 17
Hidden Markov Models Definition: probabilistic automata, generative view. discrete case: finite alphabet Σ. simplification: for us, transition probability: Pr[transition e]. emission probability: Pr[emission a e]. Illutration: a/.3 0 a/.2 a/.2 b/.1 c/.2 w[e] = Pr[a e]pr[e]. /.5 18 18
Other Models Outputs at states (instead of transitions): equivalent model, dual representation. Non-discrete case: outputs in R N. fixed probabilities replaced by distributions, e.g., Gaussian mixtures. application to acoustic modeling. 19 19
Three Problems Problem 1: given sequence x 1,...,x k and hidden Markov model p θ compute p θ [x 1,...,x k ]. Problem 2: given sequence x 1,...,x k and hidden Markov model p θ find most likely path that generated that sequence. Problem 3: estimate parameters θ of a hidden Markov model via ML: p θ [x]. θ = argmax θ (Rabiner, 1989) π = e 1,...,e r 20 20
P1: Evaluation Algorithm: let X represent a finite automaton representing the sequence x = x 1 x k and let denote the current hidden Markov model. Then, p θ [x] = π X H w[π]. Thus, it can be computed using composition and a shortest-distance algorithm in time O( X H ). H 21 21
P2: Most Likely Path Algorithm: shortest-path problem in semiring. any shortest-path algorithm applied to the result of the composition X H. In particular, since the automaton is acyclic, with a linear-time shortest-path algorithm, the total complexity is in. O( X H ) =O( X H ) (max, ) a traditional solution is to use the Viterbi algorithm. 22 22
P3: Parameter Estimation Baum-Welsh algorithm: special case of EM applied to HMMs. Here, θ represents the transitions weights w[e]. The latent or hidden variables are paths π. M-step: argmax θ π subject to q Q, p θ [π x] log p θ [x, π] e E[q] w θ [e] =1. (Baum, 1972) 23 23
P3: Parameter Estimation Using Lagrange multipliers λ q, q Q, the problem consists of setting the following partial derivatives to zero: w θ [e] π [ π p θ [π x] log p θ [x, π] q p θ [π x] log p θ[x, π] w θ [e] only for paths labeled with :. In that case, p θ [x, π] = w θ [e] π e, where π e is the number of occurrences eof e in π. λ q λ orig[e] =0. e E[q] w θ [e] ] =0 p θ [x, π] 0 π x π Π(x) 24 24
P3: Parameter Estimation Thus, the equation with partial derivatives can be rewritten as π Π(x) p θ [π x] π e w θ [e] = λ orig(e) w θ [e] = w θ [e] = w θ [e] 1 λ orig(e) π Π(x) 1 λ orig(e) p θ [x] p θ [x, π] π e. π Π(x) p θ [π x] π e π Π(x) p θ [x, π] π e 25 25
P3: Parameter Estimation But, p θ [x, π] π e = α i 1 (θ )w θ [e]β i (θ ). Thus, with π Π(x) w θ [e] k k α i 1 (θ )w θ [e]β i (θ ) α i (θ )=p θ [x 1,...,x i q = orig[e]] β i (θ )=p θ [x i,...,x k dest[e]]. Expected number of times through transition e. Can be computed using forward-backward algorithms, or, for us, shortest-distance algorithms. 26 26
Continuous Models Graph topology: 3-state HMM model: for each CD phone in speech. ae b,d d0:! d1:! d2:! (Rabiner and Juang, 1993) 0 d0:! 1 d1:! 2 d2:ae b,d 3 Interpretation: beginning, middle, and end of CD phone. Continuous case: transition weights based on distributions over feature vectors in R N, typically with N =39. 27 27
Distributions Simple cases: e.g., single speaker, single Gaussian distribution N (x; µ, σ) = 1 exp (2π) N/2 σ 1/2 covariance matrix σtypically diagonal. General case: mixtures of Gaussians. M k=1 with λ i 0 and λ k N (x; µ k,σ k ), M ( 1 ) 2 (x µ) σ 1 (x µ). λ i =1.Typically, M =16. 28 28
GMMs - Illustration 29 29
Estimation Algorithm Baum-Welsh algorithm: maximum likelihood of joint distribution Pr[o, p]. requires segmentation to match feature vectors with HMM transitions. generalizes to continuous case with Gaussians and GMMs. Questions: segmentation, model initialization. 30 30
Single Gaussian Distribution Problem: find most likely Gaussian distribution, given sequence of real-valued observations Normal distribution: Likelihood: Solution: p(x) µ =0 µ = 1 m l 3.18, 2.35,.95, 1.175,... ( ) 1 p(x) = exp (x µ)2 2πσ 2 2σ 2 l(p) = 1 2 m log(2πσ2 ) 1 m (x i µ) 2 2 2σ 2. is differentiable and concave; m x i p(x) σ 2 =0 σ 2 = 1 m m x 2 i µ 2.. 31 31
Differentials General identities: log of determinant bilinear form log σ 1 σ 1 = σ. x σx σ = xx. 32 32
Single Gaussian - ML solution L = Log likelihood: sample. ( 1 Pr[x i ]= exp 1 ) (2π) N/2 σ 1/2 2 (x i µ) σ 1 (x i µ). m m log Pr[x i ]= N 2 log(2π)+1 2 log σ 1 1 2 (x i µ) σ 1 (x i µ). For each x i, ML solution: x 1,...,x m L m µ = L m σ 1 = σ 1 (x i µ) =0 µ = 1 m m x i. 1 2 (σ (x i µ)(x i µ) )=0 σ = 1 m m (x i µ)(x i µ). 33 33
GMMs - EM Algorithm Mixture of p θ [x] = M EM algorithm: let Gaussians: M λ k N (x; µ k,σ k ) L = k=1 k=1 m M log λ k N (x i ; µ k,σ k ). p t i,k = N (x i ; µ t k,σ t k). E-step: M-step: q t+1 i,k µ t+1 k = σ t+1 k = λ t+1 k = p θ t[z = k x i]= m qt+1 i,k x i m qt+1 = 1 m i,k m qt+1 m λ t k pt i,k M. k=1 λt k pt i,k i,k (x i µ t+1 k )(x i µ t+1 q t+1 i,k. m qt+1 i,k 34 k ) 34
GMMs - EM Algorithm l = = Proof: M-step. Auxiliary function: M k=1 M k=1 m p θ [z = k x i ] log p θ [x i,z = k] = Optimization for fixed q and : M k=1 m q i,k log p θ [x i,z = k]. m [ q i,k log λk 1 2 (x i µ t+1 k ) σ 1 k (x i µ t+1 k ) 1 2 log(2π)+1 2 l = σ 1 k µ k L σ 1 k = 1 2 m (x i µ k ) l λ k = 1 λ k m q i,k β m q i,k (σk (x i µ)(x i µ) ). M k=1 λ k =1 log σ 1 k ]. 35 35
References L. E. Baum. An Inequality and Associated Maximization Technique in Statistical Estimation for Probalistic Functions of Markov Processes. Inequalities, 3:1-8, 1972. Arthur P. Dempster, Nan M. Laird, and Donald B. Rubin. Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society, Vol. 39, No. 1. (1977), pp. 1-38.. Jamshidian, M. and R. I. Jennrich: 1993, Conjugate Gradient Acceleration of the EM Algorithm. Journal of the American Statistical Association 88(421), 221-228. Lawrence Rabiner. A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of IEEE, Vol. 77, No. 2, pp. 257, 1989. O'Sullivan. Alternating minimzation algorithms: From Blahut-Arimoto to expectationmaximization. Codes, Curves and Signals: Common Threads in Communications, A. Vardy, (editor), Kluwer, 1998. C. F. Jeff Wu. On the Convergence Properties of the EM Algorithm. The Annals of Statistics, Vol. 11, No. 1 (Mar., 1983), pp. 95-103. 36 36