Expectation-Maximization Algorithm.

Expectatio-Maximizatio Algorithm. Petr Pošík Czech Techical Uiversity i Prague Faculty of Electrical Egieerig Dept. of Cyberetics MLE 2 Likelihood......................................................................................................... 3 Icomplete data.................................................................................................... 4 Geeral EM........................................................................................................ 5 K-meas 7 Algorithm......................................................................................................... 8 Illustratio......................................................................................................... 9 EM view.......................................................................................................... 15 EM for Mixtures 16 Geeral mixture................................................................................................... 17 EM for Mixtures................................................................................................... 19 GMM............................................................................................................ 2 EM for GMM...................................................................................................... 21 EM for HMM 37 HMM............................................................................................................ 38 HMM learig.................................................................................................... 39 Sufficiet statistics................................................................................................. 4 Baum-Welch...................................................................................................... 41 Summary 42 Competecies..................................................................................................... 43 1

Maximum likelihood estimatio 2 / 43 Likelihood maximizatio Let s have a radom variable X with probability distributio p X (x θ). This emphasizes that the distributio is parameterized by θ Θ, i.e. the distributio comes from certai parametric family. Θ is the space of possible parameter values. Learig task: assume the parameters θ are ukow, but we have a i.i.d. traiig dataset T = {x 1,..., x } which ca be used to estimate the ukow parameters. The probability of observig dataset T give some parameter values θ is p(x θ) = p X (x j θ) def = L(θ; T). j=1 This probability ca be iterpretted as a degree with which the model parameters θ coform to the data T. It is thus called the likelihood of parameters θ w.r.t. data T. The optimal θ is obtaied by maximizig the likelihood θ = arg max θ Θ L(θ; T) = arg max θ Θ j=1 p X (x j θ) Sice arg max x f(x) = arg max x log f(x), we ofte maximize the log-likelihood l(θ; T) = log L(θ; T) θ = arg max l(θ; T) = arg max log θ Θ θ Θ p X (x j θ) = arg max log p X (x j θ), j=1 θ Θ j=1 which is ofte easier tha maximizatio of L. P. Pošík c 217 Artificial Itelligece 3 / 43 Icomplete data Assume we caot observe the objects completely: r.v. X describes the observable part, r.v. K describes the uobservable, hidde part. We assume there is a uderlyig distributio p XK (x, k θ) of objects (x, k). Learig task: we wat to estimate the model parameters θ, but the traiig set cotais i.i.d. samples for the observable part oly, i.e. T X = {x 1,..., x }. (Still, there also exists a hidde, uobservable dataset T K = {k 1,..., k }.) If we had a complete data (T X, T K ), we could directly optimize l(θ; T X, T K ) = log p(t X, T K θ). But we do ot have access to T K. If we would like to maximize l(θ; T X ) = log p(t X θ) = log T K p(t X, T K θ), the summatio iside log() results i complicated expressios, or we would have to use umerical methods. Our state of kowledge about T K is give by p(t K T X, θ). The complete-data likelihood L(θ; T X, T K ) = P(T X, T K θ) is a radom variable sice T K is ukow, radom, but govered by the uderlyig distributio. Istead of optimizig it directly, cosider its expected value uder the posterior distributio over latet variables (E-step), ad the maximize this expectatio (M-step). P. Pošík c 217 Artificial Itelligece 4 / 43 2

Expectatio-Maximizatio algorithm EM algorithm: A geeral method of fidig MLE of prob. dist. parameters from a give dataset whe data is icomplete (hidde variables, or missig values). Hidde variables: mixture models, Hidde Markov models,... It is a family of algorithms, or a recipe to derive a ML estimatio algorithm for various kids of probabilistic models. 1. Preted that you kow θ. (Use some iitial guess θ ().) Set iteratio couter i = 1. 2. E-step: Use the curret parameter values θ (i 1) to fid the posterior distributio of the latet variables P(T K T X, θ (i 1) ). Use this posterior distributio to fid the expectatio of the complete-data log-likelihood evaluated for some geeral parameter values θ: Q(θ, θ (i 1) ) = T K p(t K T X, θ (i 1) ) log p(t X, T K θ). 3. M-step: maximize the expectatio, i.e. compute a updated estimate of θ as θ (i) = arg max θ Θ Q(θ, θ(i 1) ). 4. Check for covergece: fiish, or advace the iteratio couter i = i+1, ad repeat from 2. P. Pošík c 217 Artificial Itelligece 5 / 43 EM algorithm features Pros: Amog the possible optimizatio methods, EM exploits the structure of the model. For p X K from expoetial family: M-step ca be doe aalytically ad there is a uique optimizer. The expected value i the E-step ca be expressed as a fuctio of θ without solvig it explicitly for each θ. p X (T X θ (i+1) ) p X (T X θ (i) ), i.e. the process fids a local optimum. Works well i practice. Cos: Not guarateed to get globally optimal estimate. MLE ca overfit; use MAP istead (EM ca be used as well). Covergece may be slow. P. Pošík c 217 Artificial Itelligece 6 / 43 3

K-meas 7 / 43 K-meas algorithm Clusterig is oe of the tasks of usupervised learig. K-meas algorithm for clusterig [Mac67]: K is the apriori give umber of clusters. Algorithm: 1. Choose K cetroids µ k (i almost ay way, but every cluster should have at least oe example.) 2. For all x, assig x to its closest µ k. 3. Compute the ew positio of cetroids µ k based o all examples x i, i I k, i cluster k. 4. If the positios of cetroids chaged, repeat from 2. Algorithm features: Algorithm miimizes the fuctio (itracluster variace): k j J = xi,j c j 2 j=1 i=1 (1) Algorithm is fast, but each time it ca coverge to a differet local optimum of J. [Mac67] J. B. MacQuee. Some methods for classificatio ad aalysis of multivariate observatios. I Proceedigs of 5-th Berkeley Symposium o Mathematical Statistics ad Probability, volume 1, pages 281 297, Berkeley, 1967. Uiversity of Califoria Press. P. Pošík c 217 Artificial Itelligece 8 / 43 Illustratio K meas clusterig: iteratio 1 9 8 7 6 5 4 3 2 1 2 4 6 8 P. Pošík c 217 Artificial Itelligece 9 / 43 4

Illustratio K meas clusterig: iteratio 2 9 8 7 6 5 4 3 2 1 2 4 6 8 P. Pošík c 217 Artificial Itelligece / 43 Illustratio K meas clusterig: iteratio 3 9 8 7 6 5 4 3 2 1 2 4 6 8 P. Pošík c 217 Artificial Itelligece 11 / 43 5

Illustratio K meas clusterig: iteratio 4 9 8 7 6 5 4 3 2 1 2 4 6 8 P. Pošík c 217 Artificial Itelligece 12 / 43 Illustratio K meas clusterig: iteratio 5 9 8 7 6 5 4 3 2 1 2 4 6 8 P. Pošík c 217 Artificial Itelligece 13 / 43 6

Illustratio K meas clusterig: iteratio 6 9 8 7 6 5 4 3 2 1 2 4 6 8 P. Pošík c 217 Artificial Itelligece 14 / 43 K-meas: EM view Assume: A object ca be i oe of the K states with equal probabilities. All p X K (x k) are isotropic Gaussias: p X K (x k) = N(x µ k, σi). Recogitio (Part of E-step): The task is to decide the state k for each x, assumig all µ k are kow. The Bayesia strategy (miimizes the probability of error) chooses the cluster which ceter is the closest to observatio x: q (x) = arg mi k K (x µ k) 2 If µ k, k K, are ot kow, it is a parametrized strategy q Θ (x), where Θ = (µ k ) K k=1. Decidig state k for each x assumig kow µ k is actually the computatio of a degeerate probability distributio p(t K T X, θ (i 1) ), i.e. the first part of E-step. Learig (The rest of E-step ad M-step): Fid the maximum-likelihood estimates of µ k based o kow (x 1, k 1 ),...,(x l, k l ): µ k = 1 I k i I k x i, where I k is a set of idices of traiig examples (curretly) belogig to state k. This completes the E-step ad implemets the M-step. P. Pošík c 217 Artificial Itelligece 15 / 43 7

EM for Mixture Models 16 / 43 Geeral mixture distributios Assume the data are samples from a distributio factorized as p XK (x, k) = p K (k)p X K (x k), i.e. p X (x) = p K (k)p X K (x k) k K ad that the distributio is kow (except the distributio parameters). Recogitio (Part of E-step): Let s defie the result of recogitio ot as a sigle decisio for some state k (as doe i K-meas), but rather as a set of posterior probabilities (sometimes called resposibilities) for all k give x i γ k (x i ) = p K X (k x i, θ (t) p X K (x i k)p K (k) ) = k K p X K (x i k)p K (k) that a object was i state k whe observatio x i was made. The γ k (x) fuctios ca be viewed as discrimiat fuctios. P. Pošík c 217 Artificial Itelligece 17 / 43 Geeral mixture distributios (cot.) Learig (The rest of E-step ad M-step): Give the traiig multiset T = (x i, k i ) i=1 (or the respective γ k(x i ) istead of k i ), assume γ k (x) is kow, p K (k) are ot kow, ad p X K (x k) are kow except the parameter values Θ k, i.e. we shall write p X K (x k, Θ k ). Let the object model m be a set of all ukow parameters m = (p K (k), Θ k ) k K. The log-likelihood of model m if we assume k i is kow: log L(m) = log i=1 p XK (x i, k i ) = log p K (k i )+ log p X K (x i k i, Θ ki ) i=1 i=1 The log-likelihood of model m if we assume a distributio (γ) over k is kow: log L(m) = i=1 k K γ k (x i ) log p K (k)+ γ k (x i ) log p X K (x i k, Θ k ) i=1 k K We search for the optimal model usig maximum likelihood: m = (p K (k), Θ k i.e. we compute ) = arg max log L(m) m p K (k) = 1 γ k (x i ) ad solve k idepedet tasks i=1 Θ k = arg max γ k (x i ) log p X K (x i k, Θ k ). Θ k i=1 P. Pošík c 217 Artificial Itelligece 18 / 43 8

EM for mixture distributio Usupervised learig algorithm [?] for geeral mixture distributios: 1. Iitialize the model parameters m = ((p K (k), Θ k ) k). 2. Perform the recogitio task, i.e. assumig m is kow, compute γ k (x i ) = ˆp K X (k x i ) = p K(k)p X K (x i k, Θ k ) j K p K (j)p X K (x i j, Θ j ). 3. Perform the learig task, i.e. assumig γ k (x i ) are kow, update the ML estimates of the model parameters p K (k) ad Θ k for all k: p K (k) = 1 γ k (x i ) i=1 Θ k = arg max γ k (x i ) log p X K (x i k, Θ k ) Θ k i=1 4. Iterate 2 ad 3 util the model stabilizes. Features: The algorithm does ot specify how to update Θ k i step 3, it depeds o the chose form of p X K. The model created i iteratio t is always at least as good as the model from iteratio t 1, i.e. L(m) = p(t m) icreases. [Mac67] J. B. MacQuee. Some methods for classificatio ad aalysis of multivariate observatios. I Proceedigs of 5-th Berkeley Symposium o Mathematical Statistics ad Probability, volume 1, pages 281 297, Berkeley, 1967. Uiversity of Califoria Press. P. Pošík c 217 Artificial Itelligece 19 / 43 Special Case: Gaussia Mixture Model Each kth compoet is a Gaussia distributio: 1 N(x µ k, Σ k ) = (2π) D 2 Σ k 1 2 Gaussia Mixture Model (GMM): exp{ 1 2 (x µ k) T Σ 1 k (x µ k )} K K p(x) = p K (k)p X K (x k, Θ k ) = α k N(x µ k, Σ k ) k=1 k=1 assumig K α k = 1 ad α k 1 k=1 5 x 3 4 5 3 2 2 3 4 2 3 4 2 P. Pošík c 217 Artificial Itelligece 2 / 43 9

EM for GMM 1. Iitialize the model parameters m = ((p K (k), µ k, Σ k ) k). 2. Perform the recogitio task as i the geeral case, i.e. assumig m is kow, compute γ k (x i ) = ˆp K X (k x i ) = p K(k)p X K (x i k, Θ k ) j K p K (j)p X K (x i j, Θ j ) = α kn(x i µ k, Σ k ) j K α j N(x i µ j, Σ j ). 3. Perform the learig task, i.e. assumig γ k (x i ) are kow, update the ML estimates of the model parameters α k, µ k ad Σ k for all k: α k = p K (k) = 1 γ k (x i ) i=1 µ k = i=1 γ k(x i )x i i=1 γ k(x i ) Σ k = i=1 γ k(x i )(x i µ k )(x i µ k ) T i=1 γ k(x i ) 4. Iterate 2 ad 3 util the model stabilizes. Remarks: Each data poit belogs to all compoets to a certai degree γ k (x i ). The eq. for µ k is just a weighted average of x i s. The eq. for Σ k is just a weighted covariace matrix. P. Pošík c 217 Artificial Itelligece 21 / 43 Example: Source data 5 4 3 2 2 Source data geerated from 3 Gaussias. 4 3 2 2 3 4 P. Pošík c 217 Artificial Itelligece 22 / 43

Example: Iput to EM algorithm 5 The data were give to the EM algorithm as a ulabeled dataset. 4 3 2 2 5 4 3 2 2 3 P. Pošík c 217 Artificial Itelligece 23 / 43 Example: EM Iteratios 5 4 3 2 2 4 3 2 2 3 4 P. Pošík c 217 Artificial Itelligece 24 / 43 11

Example: EM Iteratios 5 4 3 2 2 4 3 2 2 3 4 P. Pošík c 217 Artificial Itelligece 25 / 43 Example: EM Iteratios 5 4 3 2 2 4 3 2 2 3 4 P. Pošík c 217 Artificial Itelligece 26 / 43 12

Example: EM Iteratios 5 4 3 2 2 4 3 2 2 3 4 P. Pošík c 217 Artificial Itelligece 27 / 43 Example: EM Iteratios 5 4 3 2 2 4 3 2 2 3 4 P. Pošík c 217 Artificial Itelligece 28 / 43 13

Example: EM Iteratios 5 4 3 2 2 4 3 2 2 3 4 P. Pošík c 217 Artificial Itelligece 29 / 43 Example: EM Iteratios 5 4 3 2 2 4 3 2 2 3 4 P. Pošík c 217 Artificial Itelligece 3 / 43 14

Example: EM Iteratios 5 4 3 2 2 4 3 2 2 3 4 P. Pošík c 217 Artificial Itelligece 31 / 43 Example: EM Iteratios 5 4 3 2 2 4 3 2 2 3 4 P. Pošík c 217 Artificial Itelligece 32 / 43 15

Example: EM Iteratios 5 4 3 2 2 4 3 2 2 3 4 P. Pošík c 217 Artificial Itelligece 33 / 43 Example: EM Iteratios 5 4 3 2 2 4 3 2 2 3 4 P. Pošík c 217 Artificial Itelligece 34 / 43 16

Example: EM Iteratios 5 4 3 2 2 4 3 2 2 3 4 P. Pošík c 217 Artificial Itelligece 35 / 43 Example: Groud Truth ad EM Estimate 5 5 4 4 3 3 2 2 2 4 3 2 2 3 4 2 4 3 2 2 3 4 The groud truth (left) ad the EM estimate (right) are very close because we have eough data, we kow the right umber of compoets, ad we were lucky that EM coverged to the right local optimum of the likelihood fuctio. P. Pošík c 217 Artificial Itelligece 36 / 43 17

Baum-Welch Algorithm: EM for HMM 37 / 43 Hidde Markov Model 1st order HMM is a geerative probabilistic model formed by a sequece of hidde variables X,..., X t, the domai of all of them is the set of states {s 1,..., s N }. a sequece of observed variables E 1,..., E t, the domai of all of them is the set of observatios {v 1,..., v M }. a iitial distributio over hidde states P(X ), a trasitio model P(X t X t 1 ), ad a emissio model P(E t X t ). Simulatig HMM: 1. Geerate a iitial state x accordig to P(X ). Set t 1. 2. Geerate a ew curret state x t accordig to P(X t x t 1 ). 3. Geerate a observatio e t accordig to P(E t x t ). 4. Advace time t t+1. 5. Fiish, or repeat from step 2. With HMM: efficiet algorithms exist for solvig iferece tasks; but we have o idea (so far) how to lear HMM parameters from the observatio sequece, because we do ot have access to the hidde states. P. Pošík c 217 Artificial Itelligece 38 / 43 Learig HMM from data Is it possible to lear HMM from data? No kow way to aalytically solve for the model which maximizes the probability of observatios. No optimal way of estimatig the model parameters from the observatio sequeces. We ca fid model parameters such that the probability of observatios is maximized Baum-Welch algorithm (a special case of EM). Let s use a slightly differet otatio to emphasize the model parameters: π = [π i ] = [P(X 1 = s i )]... vector of the iitial probabilities of states A = [a i,j ] = [P(X t = s j X t 1 = s i )]... the matrix of trasitio probabilities to ext state give the curret state B = [b i,k ] = [P(E t = v k X t = s i )]... the matrix of observatio probabilities give the curret state The whole set of HMM parameters is the θ = (π, A, B) The algorithm (preseted o the ext slides) will compute the expected umbers of beig i a state or takig a trasitio give the observatios ad the curret model parameters θ = (π, A, B), ad the compute the ew estimate of model parameters θ = (π, A, B ), such that P(e t 1 θ ) P(e t 1 θ). P. Pošík c 217 Artificial Itelligece 39 / 43 18

Sufficiet statistics Let s defie the probability of trasitio from state s i at time t to state s j at time t+1, give the model ad the observatio sequece e t 1 : ξ t (i, j) = P(X t = s i, X t+1 = s j e t 1, θ) = α t(s i )a ij b jk β t+1 (s j ) P(e t 1 θ) = = α t (s i )a ij b jk β t+1 (s j ) N i=1 N j=1 α t(s i )a ij b jk β t+1 (s j ), where α t ad β t are the forward ad backward messages computed by the forward-backward algorithm, ad the probability of beig i state s i at time t, give the model ad the observatio sequece: γ t (i) = The we ca iterpret N ξ t (i, j). j=1 T 1 γ k (i) as the expected umber of trasitios from state s i, ad k=1 T 1 ξ k (i, j) as the expected umber of trasitios from s i to s j. k=1 P. Pošík c 217 Artificial Itelligece 4 / 43 Baum-Welch algorithm The re-estimatio formulas are π i = expected frequecy of beig i state s i at time (t = 1) = = γ 1 (i) a ij = expected umber of trasitios from s i to s j expected umber of trasitios from s i = = T 1 k=1 ξ k(i, j) T 1 k=1 γ k(i) b jk = expected umber of times beig i state s j ad observig v k expected umber of times beig i state s j = = T t=1 I(e t = v k )γ t (j) T t=1 γ t(j) As with other EM variats, with the old model parameters θ = (π, A, B) ad ew, re-estimated parameters θ = (π, A, B ), the ew model is at least as likely as the old oe: P(e t 1 θ ) P(e t 1 θ) The above equatios are used iteratively with θ takig place of θ. P. Pošík c 217 Artificial Itelligece 41 / 43 19

Summary 42 / 43 Competecies After this lecture, a studet shall be able to... defie ad explai the task of maximum likelihood estimatio; explai why we ca maximize log-likelihood istead of likelihood, describe the advatages; describe the issues we face whe tryig to maximize the likelihood i case of icomplete data; explai the geeral high-level priciple of Expectatio-Maximizatio algorithm; describe the pros ad cos of the EM algorithm, especially what happes with the likelihood i oe EM iteratio; describe the EM algorithm for mixture distributios, icludig the otio of resposibilities; explai the Baum-Welch algorithm, i.e. the applicatio of EM to HMM; what parameters are leared ad how (coceptually). P. Pošík c 217 Artificial Itelligece 43 / 43 2