CSE 527, Additional notes on MLE & EM

CSE 57 Lecture Notes: MLE & EM CSE 57, Additioal otes o MLE & EM Based o earlier otes by C. Grat & M. Narasimha Itroductio Last lecture we bega a examiatio of model based clusterig. This lecture will be the techical backgroud leadig to the Expectatio Maximizatio (EM) algorithm. Do gee expressio data fit a Gaussia model? The cetral limit theorem implies that the sum of a large umber of idepedet idetically distributed radom variables ca be well approximated by a Normal distributio. While it is far from clear that the expressio data is a sum of idepedet variables, usig the Normal distributio seems to work i practice. Besides, havig a weak model is better tha havig o model at all. Probability Basics A radom variable ca be cotiuous or discrete (or both). A discrete radom radom variable correspods to a probability distributio o a discrete sample space, such as the roll of a dice. A cotiuous radom variable correspods to a probability distributio o a cotiuous sample space such as. Show i the table below are two examples of probability distributios, with the first represetig a roll of a ubiased die, ad the secod represetig a Normal distributio. Discrete Cotiuous Sample Space 8,,... 6< Distributio p, p,... p 6 0, 6 i= p i = p = p =... = p 6 = ÅÅÅ 6 fhxl 0, Ÿ fhxl dx = fhxl = ÅÅÅÅÅ ê s è!!!!!!!!!!!! p s e-hx-ml Discrete Probability Distributio

CSE 57 Lecture Notes: MLE & EM 0.3 0.5 0. 0.5 0. 0.05 Cotiuous Probability Distributio 3 4 5 6 0.4 0.3 0. 0. -4-4 Parameter Estimatio May distributios are parametrized. Typically, we have data x, x,..., x that is sampled from a parametric distributio f Hx» ql. Ofte, the goal is to estimate the parameter q. The mea m ad variace s are ofte used as such parameters. Estimates of these quatities derived from the sampled data are ofte called the sample statistics, while the (true) parameter based o the etire sample space is called the populatio statistic. The followig table illustrates these two cocepts.

CSE 57 Lecture Notes: MLE & EM 3 Discrete Cotiuous Populatio Mea m = i i p i m = Ÿ x fhxl dx Populatio Variace s = i Hi - ml p i s = Ÿ Hx - ml fhxl dx Sample Mea ê x = i= Sample Variace s ê = i= x i ê ê x = i= Hx i - ê xl ê s ê = i= x i ê Hx i - x ê L ê While the sample statistics ca be used as estimates of these parameters, this is ofte ot the prefered way of estimatig these quatities. For example, the sample variace s êê = i= Hx i - êê xl ê is a biased estimate of the true variace because it uderestimates the quatity (a ubiased estimate of the variace is give by s êê = i= Hx i - êê xl ê H - L ). Maximum Likelihood Estimatio is oe of may parameter estimatio techiques (ote that the MLE is ot guarateed to be ubiased either). Assumig the data are idepedet, the likelihood of the data x, x,..., x give the parameter q is LHx, x,..., x» ql = i= f Hx i» ql where f is the probability desity fuctio of the presumed distributio (which of course dcepeds o q). Note that the x i are kow costats, ot variables; they are the values we observed. O the other had, q is ukow. We treat the likelihood L as a fuctio of q ad ask what value of q maximizes it. The typical approach is to solve for ÅÅÅÅÅÅÅ q LHx, x,..., x» ql = 0 Sice the likelihood fuctio is always positive (ad we may assume it to be strictly positive), the log likelihood l LHx, x,..., x» ql = l i= f Hx i» ql = i= l f Hx i» ql is well defied, ad by the mootoicity of the logarithm, the log likelihood is maximized exactly whe the likelihood is maximized. Hece we ca solve for ÅÅÅÅÅÅÅ q l LHx, x,..., x» ql = 0

CSE 57 Lecture Notes: MLE & EM 4 Note that i geeral, these coditios are statisfied by maxima, miima ad statioary poits of the log-likelihood fuctio. (A "statioary poit" is a temporary flat spot o a curve that otherwise teds upward or dowward.) Further, if q is restricted to be i some bouded rage, the maxima might occur at the boudary which does ot satisfy this coditio. Therefore, we eed to check the boudaries separately. Here is a example which illustrates this procedure. Example. Let x, x,..., x be coi flips, ad let q be the probability of gettig heads. Suppose we observe 0 tails ad heads ( 0 + = L. The the likelihood fuctio is give by This yields LHx, x,..., x» ql = H - ql 0 q Hece the log - likelihood fuctio is l LHx, x,..., x» ql = 0 l H - ql + l q To fid a value of q that maximizes this fuctio, we solve for ÅÅÅÅÅÅÅ q l LHx, x,..., x» ql = - 0-0 -q + ÅÅÅÅÅÅ = 0 q H - ql = 0 q = H 0 + L q ÅÅÅÅÅÅÅ H 0 + L = q ÅÅÅÅÅÅ = q -q + ÅÅÅÅÅÅ q = 0 (The sig of d derivative ca the be checked to guaratee that this is a maximum ot a miimum. Likewise, you ca easily verify that the maximum is ot attaied at the boudaries of the parameter space, i.e. at q=0 or q=.) This estimate for the parameter of the distributio matches our ituitio. Example. Suppose x i ~ NHm, sl, s = ad m ukow. The

CSE 57 Lecture Notes: MLE & EM 5 LHx, x,..., x» ql = ÅÅ è!!!!!! i= p e-hx i -ql ê l LHx, x,..., x» ql = S I- ÅÅÅÅ i= l p - Hx i -ql ÅÅÅÅÅÅ M ÅÅÅÅÅÅÅ q l LHx, x,..., x» ql = i= Hx i - ql = i= x i - q = 0 So the value of q that maximizes the likelihood is q = i= x i ê Agai matchig our ituitio: the sample mea is the maximum likelihood estimator (MLE) for the populatio mea. Example 3. Suppose x i ~ NHm, sl, s ad m ukow. The LHx, x,..., x» q, q L = i= ÅÅÅÅÅÅÅ è!!!!!!!!!! e -Hx i -q L ê q pq l LHx, x,..., x» q, q L = S I- ÅÅÅÅ i= l p q - Hx i -q L ÅÅÅÅÅÅÅ q l LHx, x,..., x» q, q L = i= q l LHx, x,..., x» q, q L = S I- p ÅÅÅÅ ÅÅ i= p q + Hx i -q L ÅÅÅÅÅÅÅ M = S q I- i= q + Hx i -q L q M Hx i -q L ÅÅÅÅÅÅ q = 0 ï i= x i ê = q ÅÅÅÅÅÅÅ M = 0 ï q i= Hx i - q L ê = q The MLE for the populatio variace is the sample variace. This is a biased estimator. It systematically uderestimates the populatio variace, but is oe the less the MLE. The MLE does't promise a ubiased estimator but it is a reasoable approach. Expectatio Maximizatio The MLE approach works well whe we have relatively simple parametrized distributios. However, whe we have more complicated situatios, we may ot be able to solve for the ML estimate because the complexity of the likelihood fuctio precludes both aalytical ad umerical optimizatio. The EM algorithm ca be thought of as a algorithm that provides a tractable approximatio to the ML estimate.

CSE 57 Lecture Notes: MLE & EM 6 Cosider the followig example. We have data correspodig to heights of idividuals, as show i the figures below. Is this distributio likely to be Normally distributed as show below? 0. 0.5 0. 0.05 X X X X X X X X X X -0-5 5 0 Ü Graphics Ü Or is there some hidde variable, like geder, so the distributio should be more like this: 0. 0.5 0. 0.05 X X X X X X X X X X -0-5 5 0 Ü Graphics Ü The clusterig problem ca is essetially a parameter estimatio problem : Try to fid if there are hidde parameters that cause the data to fall ito two distributios f HxL, f HxL. These distributios deped o some parameter q: f Hx, ql, f Hx, ql, ad there are also mixig parameters t ad t, t + t =, which describe the probability of samplig from each group. Ca we estimate the parameters for the this more complex model? Let's suppose that the two groups are ormal but with differet, ukow, parameters. The likelihood is ow give by LHx, x,..., x» t, t, m, m, s, s L = i= j= t j f j Hx i» q j L

CSE 57 Lecture Notes: MLE & EM 7 If we try to work with this i our existig framework it becomes messy ad algebraically itractable, due to the product-of-sums form, ad remais so eve if we take the log of the likelihood. This leads us to itroduce the Expectatio Maximizatio (EM) algorithm as a heuristic for fidig the MLE. It is particularly useful for problems cotaiig a hidde variable. It uses a hill-climbig strategy to fid a local maximum of the likelihood. Itroduce ew variables z ij = 9 0 iff x i was sampled from distributio j otherwise These variables are itroduced for mathematical coveiece. They let us avoid a sum over j i the expressio for the likelihood. The full data table becomes x z z x z z x z z If the z were kow estimatig t, t would be easy, ad estimatio of the parameters would become easy agai. If we kew the parameters estimatio of the z would be easy. The EM algorithm iterates over these alteratives. It ca be proved that the likelihood will be mootoically icreasig, ad so will coverge to a (local) maximum. [There is a polyomial time algorithm for estimatig Gaussia mixtures uder the assumptio that the compoets are "well-separated," but the method is ot used much i practice. I do't kow whether the complexity of the geeral problem is kow; plausibly it's NPhard. So, the EM algorithm is probably the method of choice.] Expectatio step Assume fixed values for t j ad q j. Let A be the evet that x i is draw from the distributio f, let B be the evet that x i is draw from f, ad let D be the evet that x i is observed. We wat PHA» DL, but it is easier to fid PHD» AL. We use Bayes' rule: PHA» DL = PHD»AL PHAL ÅÅÅÅÅÅÅ PHDL PHDL = PHD» AL PHAL + PHD» BL PHBL = t PHD» AL + t PHD» BL = t f Hx i» q L + t f Hx i» q L

CSE 57 Lecture Notes: MLE & EM 8 PHA» DL is the expected value of z i give q ad q. This is the expectatio step of the EM algorithm. To be cocrete, cosider a sample of poits take from a mixture of Gaussia distributios with ukow parameters ad ukow mixig coefficiets. The EM algorithm will give estimates of the parameters that raise the likelihood of the data. A easy heuristic to apply is If EHz i L ê the set z i = If EHz i L < ê the set z i = 0 This gives rise to the so-called Classificatio EM algorithm (we classify each observatio as comig from exactly oe of the compoet distributios). The k-meas clusterig algorithm is a example. I this case, the maximizatio step is just like the simple Maximum Likelihood Estimatio examples cosidered above. The more geeral M-step (below) accouts for the iheret ucertaity i these classificatios, appropriately weightig the cotributios of each observatio to the parameter estimates for each mixture compoet. Maximizatio step The expressio for the likelihood is LHx, z, z, x, z, z,...» q, tl The x i are kow. If the z ij were kow fidig the MLE of q, t would be easy, but we do't. Istead we maximize the expected log likelihood of the visible data EHl LHx, x,..., x» q, tll. The expectatio is take over the distributio of the hidde variables z ij. Assumig s = s = s, ad t =t =t =H ÅÅÅÅ L: so LHx, z» q, tl = t i= EHl LHx, z» q, tll = EH i= i= l ÅÅÅÅ - ÅÅÅÅÅÅÅ è!!!!!!!!!!! ps e-ê s H j= z ij Hx i -m j L L l ÅÅÅÅ - ÅÅÅÅ l ps - s j= z ij Hx i - m j L L = ÅÅÅÅ l ps - s j= EHz ij L Hx i - m j L L

CSE 57 Lecture Notes: MLE & EM 9 The last step above depeds o the importat fact that expectatio is liear: if c ad d are costats ad X ad Y are radom variables, the E(cX+dY) = c E(X) + d E(Y). We calculated EHz ij L i the previous step. We ca ow solve for the m j that maximize the expectatio by the methods give earlier: set derivatives to zero, etc. With a little more algebra you will see that the MLE for m j is the weighted average of the x i 's, where the weights are the EHz ij L's,which makes sese ituitively: if a give poit x i has a high probability of havig bee sampled from distributio, the it will cotribute strogly to our estimate of m ad weakly to our estimate of m. It ca be show that this procedure icreases the likelihood at every iteratio, hece is guarateed to coverge to a local maximum. Ufortuately, it is ot guarateed to be the global maximum, but empirically it works well i may situatios.