Chapter 2 EM algorithms The Expectatio-Maximizatio (EM) algorithm is a maximum likelihood method for models that have hidde variables eg. Gaussia Mixture Models (GMMs), Liear Dyamic Systems (LDSs) ad Hidde Markov Models (HMMs). 2. Gaussia Mixture Models Say we have a variable which is multi-modal ie. it separates ito distict clusters. For such data the mea ad variace are ot very represetative quatities. I a -dimesioal Gaussia Mixture Model (GMM) with m-compoets the likelihood of a data poit x is give by p(x ) = mx k= p(x jk)p(s = k) (2.) where s is a idicator variable idicatig which compoet is selected for which data poit. These are chose probabilistically accordig to p(s = k) = k (2.2) ad each compoet is a Gaussia p(x jk) = (2 2 k) =2 exp (x k ) 2 2 2 k! (2.3) To geerate data from a GMM we pick a Gaussia at radom (accordig to 2.2) ad the sample from that Gaussia. To t a GMM to a data set we eed to estimate k, k ad 2 k. This ca be achieved i two steps. I the 'E-Step' we soft-partitio the data amog the dieret clusters. This amouts to calculatig the probability that data poit belogs to cluster k which, from Baye's rule, is k p(s = kjx ) = p(x jk)p(s = k) k 0 p(x jk 0 )p(s = k 0 ) (2.4) 4
I the 'M-Step' we re-estimate the parameters usig Maximum Likelihood, but the data poits are weighted accordig to the soft-partitioig k = X k (2.5) k = 2 k = k x k k (x k ) 2 k These two steps costitute a EM algorithm. Summarizig: 5 0 5 0 0 0 20 30 40 50 60 Figure 2.: A variable with 3 modes. This ca be accurately modelled with a 3- compoet Gaussia Mixture Model. E-Step: Soft-partitioig. M-Step: arameter updatig. Applicatio of a 3-compoet GMM to our example data gives for cluster (i) = 0:3, = 0, 2 = 0, (ii) = 0:35, = 40, 2 = 0, ad (iii) 3 = 0:35, = 50, 2 = 5. GMMs are readily exteded to multivariate data by replacig each uivariate Gaussia i the mixture with a multivariate Gaussia. See eg. chapter 3 i [3]. 2.2 Geeral Approach If V are visible variables, H are hidde variables ad are parameters the
. E-Step: Get p(hjv; ) 2. M-Step, chage so as to maximise Q =< log p(v; Hj) > (2.6) where expectatio is wrt p(hjv; ). Why does it work? Maximisig Q maximises the likelihood p(v j). This ca be proved as follows. Firstly p(h; V j ) p(v j ) = (2.7) p(h j V; ) This meas that the log-likelihood, L() log p(v j ), ca be writte L() = log p(h; V j ) log p(h j V; ) (2.8) If we ow take expectatios with respect to a distributio p 0 (H) the we get L() = p 0 (H) log p(h; V j )dh p 0 (H) log p(h j V; )dh (2.9) The secod term is miimised by settig p 0 (H) = p(hjv; ) (we ca prove this from Jese's iequality or the positivity of the KL divergece; see [2] or lecture 4). This takes place i the E-Step. After the E-step the auxiliary fuctio Q is the equal to the log-likelihood. Therefore, whe we maximise Q i the M-step we are maximisig the likelihood. 2.3 robabilistic ricipal Compoet Aalysis I a earlier lecture, ricipal Compoet Aalysis (CA) was viewed as a liear trasform y = Q T x (2.0) where the jth colum of the matrix Q is the jth eigevector, q j, of the covariace matrix of the origial d-dimesioal data x. The jth projectio y j = q T j x (2.) has a variace give by the jth eigevalue j. If the projectios are raked accordig to variace (ie. eigevalue) the the M variables that recostruct the origial data with miimum error (ad are also liear fuctios of x) are give by y ; y 2 ; :::; y M. The remaiig variables y M + ; ::; y d ca be be discarded with miimal loss of iformatio (i the sese of least squares error). The recostructed data is give by ^x = Q :M y :M (2.2) where Q :M is a matrix formed from the rst M colums of Q. Similarly, y :M = [y ; y 2 ; :::; y M ] T.
I probabilistic CA (pca) [60] the CA trasform is coverted ito a statistical model by explaiig the `discarded' variace as observatio oise x = W y + e (2.3) where the oise is draw from a zero mea Gaussia distributio with isotropic covariace 2 I. The `observatios' x are geerated by trasformig the `sources' y with the 'mixig matrix' W ad the addig `observatio oise'. The pca model has M sources where M < d. For a give M, we have W = Q :M ad 2 = M d dx j=m + j (2.4) which is the average variace of the discarded projectios. There also exists a EM algorithm for dig the mixig matrix which is more eciet tha SVD for high dimesioal data. This is because it oly eeds to ivert a M- by-m matrix rather tha a d-by-d matrix. If we dee S as the sample covariace matrix ad C = W W T + 2 I (2.5) the the log-likelihood of the data uder a pca model is give by [60] log p(x) = Nd 2 log 2 N 2 log jcj N 2 T r(c S) (2.6) where N is the umber of data poits. We are ow i the positio to apply the MDL model order selectio criterio. We have MDL(M) = log p(x) + Md log N (2.7) 2 This gives us a procedure for choosig the optimal umber of sources. Because pca is a probabilistic model (whereas CA is a trasform) it is readily icorporated i larger models. A useful model, for example, is the Mixtures of pca model. This is idetical to the Gaussia Mixture model except that each Gaussia is decomposed usig pca (rather tha keepig it as a full covariace Gaussia). This ca greatly reduce the umber of parameters i the model [6]. 2.4 Liear Dyamical Systems A Liear Dyamical System is give by the followig `state-space' equatios x t+ = Ax t + w t (2.8) y t = Cx t + v t
where the state oise ad observatio oise are zero mea Gaussia variables with covariaces Q ad R. Give A,C,Q ad R the state ca be updated usig the Kalma lter. For real-time applicatios we ca ifer the states usig a Kalma lter. For retrospective/oie data aalysis the state at time t ca be determied usig data before t ad after t. This is kow as Kalma smoothig. See eg. [2]. Moreover, we ca also ifer other parameters; eg. state oise covariace Q, state trasformatio matrix C, etc. See eg. [22]. To do this, the state is regarded as a `hidde variable' (we do ot observe it) ad we apply the EM algorithm [5]. For a LDS x t+ = Ax t + w t (2.9) y t = Cx t + v t the hidde variables are the states x t ad the observed variable is the time series y t. If x t = [x ; x 2 ; ::; x t ] are the states ad y T = [y ; y 2 ; :::; y t ] are the observatios the the EM algorithm is as follows. M-Step I the M-Step we maximise Q =< log p(y T ; xt j) > (2.20) Because of the Markov roperty of a LDS (the curret state oly depeds o the last oe, ad ot o oes before that) we have p(y T ; xt j) = p(x ) ad whe we we take logs we get log p(y T ; xt j) = log p(x ) + t=2 TY t=2 p(x t jx t ) TY t= log p(x t jx t ) + p(y t jx t ) (2.2) t= log p(y t jx t ) (2.22) where each DF is a multivariate Gaussia. We ow eed to take expectatios wrt. the distributio over hidde variables This gives < log p(y T ; xt j) >= log p(x ) + t p(x t jy T ) (2.23) t=2 t log p(x t jx t ) + t= t log p(y t jx t ) (2.24) By takig derivates wrt each of the parameters ad settig them to zero we get update equatios for A, Q,C ad R. See [22] for details. The distributio over hidde variables is calculated i the E-Step.
E-Step The E-Step cosists of two parts. I the forward-pass the joit probability t p(x t ; y t ) = t p(x t jx t )p(y t jx t )dx t (2.25) is recursively evaluated usig a Kalma lter. I the backward pass we estimate the coditioal probability t p(y T t jx t) = t+ p(x t+ jx t )p(y t+ jx t+ )dx t+ (2.26) The two are the combied to produce a smoothed estimate p(x t jy T ) / t t (2.27) This E-Step costitutes a Kalma smoother.
.4.2 Secods 0.8 0.6 0.4 0.2 (a) 0 0 5 0 5 20 25 30 Frequecy.4.2 Secods 0.8 0.6 0.4 0.2 (b) 0 0 5 0 5 20 25 30 Frequecy Figure 2.2: (a) Kalma lterig ad (b) Kalma smoothig.