Chapter 7. The Expectation-Maximisation Algorithm. 7.1 The EM algorithm - a method for maximising the likelihood

Size: px

Start display at page:

Download "Chapter 7. The Expectation-Maximisation Algorithm. 7.1 The EM algorithm - a method for maximising the likelihood"

Cornelius Rich
5 years ago
Views:

1 Chapter 7 The Expetation-Maximisation Algorithm 7. The EM algorithm - a method for maximising the likelihood Let us suppose that we observe Y {Y i } n. ThejointdensityofY is f(y ; 0 ), and 0 is an unknown parameter. Our objetive is to estimate 0.Thelog-likelihoodofY is L n (Y ; ) log f(y ; ), Observe, that we have not speified that {Y i } are iid random variables. This is beause the proedure that we will desribe below is very general and the observations do not need to be either independent or identially distributed (indeed an interesting extension of this proedure, is to time series with missing data first proposed in Shumway and Sto er (98) and Engle and Watson (98)). Our objetive is to estimate 0,inthesituation where either evaluating the log-likelihood L n or maximising L n is di ult. Hene an alternative means of maximising L n is required. Often, there may exist unobserved data {U {U i } m }, wherethelikelihoodof(y,u)anbe easily evaluated. Itisthrough these unobserved data that we find an alternative method for maximising L n. The EM-algorithm was speified in its urrent form in Dempster, Laird and Runbin (977)( howeveritwasapplied previously to several speifi models. 93

2 Example 7.. (i) Suppose that {f j ( ; ); } m j are a sequene of densities from m exponential lasses of densities. In Setions.6 and.6.5 we showed that it was straightforward to maximise eah of these densities. However, let us suppose that eah f j ( ; ) orresponds to one subpopulation. All the populations are pooled together and given an observation X i it is unknown whih population it omes from. Let denote the subpopulation the individual X i omes from i.e. i {,...,m} where P ( i j) p j. The density of all these mixtures of distribution is i f(x; ) mx f(x i x i j)p ( i j) j mx p j f j (x; ) j where P m j p j. Thus the log-likelihood of {X i } is log! mx p j f j (X i ; ). j Of ourse we require that P m j p j, thus we inlude a lagrange multiplier to the likelihood to ensure this holds!! mx mx log p j f j (X i ; ) + p j. j It is straightforward to maximise the likelihood for eah individual subpopulation, however, it is extremely di ult to maximise the likelihood of this mixture of distributions. j The data {X i } an be treated as missing, sine the information { i } about the whih population eah individual belongs to is not there. If {X i, i} is i were known the likelihood of my ny (p j f j (X i ; )) I( ij) j ny p i f i (X i ; ) whih leads to the log-likelihood of {X i, i} whih is log p i f i (X i ; ) (log p i +logf i (X i ; )) 94

3 whih is far easier to maximise. Again to ensure that P m j p j we inlude a Lagrange multiplier log p i f i (X i ; ) (log p i +logf i (X i ; )) + It is easy to show that bp j n P n I( i j).! mx p j. (ii) Let us suppose that {T i } n+m are iid survival times, with density f(x; 0 ). Some of these times are ensored and we observe {Y i } n+m, where Y i min(t i,). To simplify notation we will suppose that {Y i T i } n, hene the survival time for apple i apple n, is observed but Y i for n + apple i apple n + m. Using the results in Setion the log-likelihood of Y is n+m X L n (Y ; ) log f(y i ; ) + in+ j log F(Y i ; ). The observations {Y i } n+m in+ an be treated as if they were missing. Define the omplete observations U {T i } n+m in+, hene U ontains the unobserved survival times. Then the likelihood of (Y,U) is L n (Y,U; ) n+m X log f(t i ; ). If no analyti express exists for the survival funtion F, it is easier to maximise L n (Y,U) than L n (Y ). We now formally desribe the EM-algorithm. As mentioned in the disussion above it is often easier to maximise the joint likelihood of (Y,U) than with the likelihood of Y itself. the EM-algorithm is based on maximising an approximation of (Y,U)basedon the data that is observed Y. Let us suppose that the joint likelihood of (Y,U)is L n (Y,U; ) log f(y,u; ). This likelihood is often alled the omplete likelihood, we will assume that if U were known, then this likelihood would be easy to obtain and di erentiate. We will assume 95

4 that the density f(u Y ; ) is also known and is easy to evaluate. By using Bayes theorem it is straightforward to show that log f(y,u; ) logf(y ; )+logf(u Y ; ) (7.) )L n (Y,U; ) L n (Y ; )+logf(u Y ; ). Of ourse, in reality log f(y,u; ) isunknown,beauseu is unobserved. However, let us onsider the expeted value of log f(y,u; ) givenwhatweobservey.thatis Z Q( 0,)E log f(y,u; ) Y, 0 log f(y,u; ) f(u Y, 0 )du, (7.) where f(u Y, 0 )istheonditionaldistributionofu given Y and the unknown parameter 0. Hene if f(u Y, 0 )wereknown,thenq( 0,)anbeevaluated. Remark 7.. It is worth noting that Q( 0,)E log f(y,u; ) Y, 0 an be viewed as the best preditor of the omplete likelihood (involving both observed and unobserved data - (Y,U)) given what is observed Y. We reall that the onditional expetation is the best preditor of U in terms of mean squared error, that is the funtion of Y whih minimises the mean squared error: E(U Y )argmin g E(U g(y )). The EM algorithm is based on iterating Q( ) insuhawaythatateahstepwe obtaining an estimator whih gives a larger value of Q( ) (andaswewillshowlater,this gives a larger L n (Y ; )). We desribe the EM-algorithm below. The EM-algorithm: (i) Define an initial value. Let. (ii) The expetation step (The (k+)-step), For a fixed evaluate Z Q(,)E log f(y,u; ) Y, log f(y,u; ) f(u Y, )du, for all. (iii) The maximisation step Evaluate k+ argmax Q(,). 96

5 We note that the maximisation an be done by finding the solution log f(y,u; ) E Y, 0. (iv) If k and k+ are su iently lose to eah other stop the algorithm and set b n k+. Else set k+,gobakandrepeatsteps(ii)and(iii)again. We use ˆ n as an estimator of 0.Tounderstandwhythisiterationisonnetedtothe maximising of L n (Y ; ) and,underertainonditions,givesagoodestimatorof 0 (in the sense that ˆ n is lose to the parameter whih maximises L n )letusreturnto(7.). Taking the expetation of log f(y,u; ), onditioned on Y we have Q(,) E log f(y,u; ) Y, E log f(y ; )+logf(u Y ; ) Y, apple logf(y ; )+E log f(u Y ; ) Y,. (7.3) Define Z D(,) E log f(u Y ; ) Y, log f(u Y ; ) f(u Y, )du. Substituting D(,)into(7.3)gives Q(,)L n ()+D(,), (7.4) we use this in expression in the proof below to show that L n ( k+ ) > L n ( k ). First we that at the (k +)th step iteration of the EM-algorithm, k+ maximises Q( k,)overall, hene Q( k, k+ ) In the lemma below we show that L n ( k+ ) Q( k, k )(whihwillalsobeusedintheproof). L n ( k ), hene at eah iteration of the EM-algorthm we are obtaining a k+ whih inreases the likelihood over the previous iteration. Lemma 7.. When running the EM-algorithm the inequality L n ( k+ ) holds. L n ( k ) always Furthermore, if k! b and for every ) (, )( k, k+ ) 0, n() b 0(this point an be a saddle point, a loal maximum or the sought after global maximum). 97

6 PROOF. From (7.4) it is lear that Q( k, k+ ) Q( k, k ) L n ( k+ ) L n ( k ) + D( k, k+ ) D( k, k ), (7.5) where we reall Z D(,) E log f(u Y ; ) Y, log f(u Y ; ) f(u Y, )du. We will show that D( k, k+ ) that D(k, k+ ) D( k, k ) Z D( k, k ) apple 0, the result follows from this. We observe log f(u Y, k+) f(u Y, k ) f(u Y, k)du. By using the Jenson s inequality (whih we have used several times previously) D(k, k+ ) D( k, k ) Z apple log f(u Y, k+ )du 0. Therefore, D( k, k+ ) D( k, k ) apple 0. Note that if uniquely identifies the distribution f(u Y,) then equality only happens when k+ k.sine D( k, k+ ) D( k, k ) apple 0 by (7.5) we have Ln ( k+ ) L n ( k ) Q( k, k+ ) Q( k, k ) 0. and we obtain the desired result (L n ( k+ ) L n ( k )). To prove the seond part of the result we will use that for all log f(u Y ; ) (, )(,) f(u Z f(u Y,)du 0. We will to show that the derivative of the likelihood is zero at i.e. show this we use the identity L n ( k+ )Q( k, k+ ) D( k, k+ ). Taking derivatives with respet to k+ gives 0. n () ) (, )( k, k+ ) (, )( k, k+ ). By definition of ) (, )( k, k+ ) n () ) (, )( k, k+ ). 98

7 Furthermore, sine by assumption k! b this implies that as k!we n () ) (, )(, b ) b 0, whih follows from (7.6), thus giving the required result. Further information on onvergene an be found in Boyles (983) ( jstor.org/stable/pdf/3456.pdf?_ )andwu(983)( jstor.org/stable/pdf/40463.pdf?_ ). Remark 7.. Note that the EM algorithm will onverge to a b b 0. The reason an be seen from the identity Q(,)L n ()+D(,). The derivative of the above with respet n() (7.7) Observe that D(,) is maximum only when (for all, this is lear from the proof above), whih for b b 0. Sine b b b 0. Furthermore, by 0by using (7.7) this b 0. In order to prove the results in the following setion we use the following identities. Q(, )L( )+D(, ) Q(, ) Q(, ) L( L( ) D(, ) (, L( ) (, Q(, ) (, )(,) D(, ) (, )(,) (, )(,) D(, ) (, )(,)(7.8) We observe that the LHS of the above is the observed Fisher information matrix I( Y Q(, ) (, )(,) I C ( Y ) log f(u,y; ) f(u Y,)du (7.9) 99

8 is the omplete Fisher information onditioned on what is observed D(, ) (, )(,) I M ( Y ) log f(u Y ; ) f(u Y,)du (7.0) is the Fisher information matrix of the unobserved data onditioned on what is observed. Thus I( Y )I C ( Y ) I M ( Y ). 7.. Speed of onvergene of k to a stable point When analyzing an algorithm it is instrutive to understand how fast it takes to onverge to the limiting point. In the ase of the EM-algorithm, this means what fators determine the rate at whih k onverges to a stable point b (note this has nothing to do with the rate of onvergene of an estimator to the true parameter, and it is important to understand this distintion). The rate of onvergene of an algorithm is usually measured by the ratio of the urrent iteration with the previous iteration:! k+ b R lim, k! b k if the algorithm onverges to a limit in a finite number of iterations we plae the above limit to zero. Thus the smaller R the faster the rate of onvergene (for example if (i) k b k then R if(ii) k b k then R, assuming < ). Note that sine ( k+ b ) Q k j j+ b j b,thentypially R apple. To obtain an approximation of R we will make a Taylor expansion ) around the limit (, )(, b ). b To do this we reall that for a bivariate funtion f : R! R for (x 0,y 0 ) lose to(x, y) we have the Taylor expansion f(x, y) f(x 0,y 0 )+(x x 0 y) 0 ) +(y y 0 ) 0 ) +lowerorderterms. Applying the above ) ) (, )( k, k+ ) (, )( b, b ) +( k+ Q(, ) (, )(, b ) b +( k 00 Q(, ) (, )( b, b ).

9 Sine k+ maximises Q( k,)and b maximises Q( b, ) within the interior of the parameter spae then This implies ) (, )( k, k+ ) ) (, )( b, b ) 0 ( k+ ) Q(, ) (, )(, b ) b +( k Q(, ) (, )( b, b ) 0. Thus lim k! k+ k! b Q(, ) (, )( b, b Q(, ) (, )(, b ) b. (7.) This result shows that the rate of onvergene depends on the ratio of gradients of Q(, ) around (, )( b, b ). Some further simplifiations an be made by noting that Q(, ) L n ( )+D(, ) Q(, D(, ). Substituting this into (7.) gives! k+ Q(, ) lim k! b k (, )( b, b D(, ) (, )(, b ) b. (7.) To make one further simplifiation, we note Z D(, Y, Y, ) (, )(, b ) b f(u Y, ) (, )(, b ) b du Y,) f(u Y,) du b log f(u Y,) f(u Y,) bdu (7.3) were the last line of the above follows from the identity Z log f(x; ) dx + f(x; )dx 0 f(x; ) (see the proof of Corollary.3.). Substituting (7.) into (7.3) gives! k+ Q(, D(, ) lim k! b (, )(, b ) b (, )(, b ) b. (7.4) k 0

10 Substituting (7.9) and (7.0) into the above gives! k+ b lim I k! b C ( Y ) I M ( Y ) (7.5) k Hene the rate of onvergene of the algorithm depends on on the ratio I C ( Y ) I M ( Y ). The loser the largest eigenvalue of I C ( Y ) I M ( Y )toone,theslowertherateof onvergene, and a larger number of iterations are required. The heuristi of this result is that if the missing information is a large proportion of the omplete or total information than this ratio will be large. Further details an be found in Dempster et. al. (977) pages 9-0 and Meng and Rubin (994) ( 7. Appliations of the EM algorithm 7.. Censored data Let us return to the example at the start of this setion, and onstrut the EM-algorithm for ensored data. We reall that the log-likelihoods for ensored data and omplete data are and n+m X L n (Y ; ) log f(y i ; ) + in+ n+m X L n (Y,U; ) log f(y i ; ) + in+ log F(Y i ; ). log f(t i ; ). To implement the EM-algorithm we need to evaluate the expetation step Q(,). It is easy to see that Q(,)E L n (Y,U; ) Y, n+m X log f(y i ; ) + To obtain E log f(t i ; ) Y, (i n +)wenotethat in+ E log f(t i ; ) Y, E(logf(T i ; ) T i ) Z log f(ti ; ) f(u; )du. F(; ) 0 E log f(t i ; ) Y,.

11 Therefore we have Q(,) m log f(y i ; ) + F(; ) Z log f(ti ; ) f(u; )du. We also note that the derivative of Q(,)withrespetto i ; ) m + f(y i ; ) F(; ) Hene for this example, the EM-algorithm is (i) Define an initial value. Let. (ii) The expetation step: For a fixed f(y i ; ) (iii) The maximisation i ; ) + Z m F(; ) f(u; ) Solve k+ be suh k 0. f(u; ) f(u; ) f(u; )du. (iv) If k and k+ are su iently lose to eah other stop the algorithm and set ˆ n k+. Else set k+,gobakandrepeatsteps(ii)and(iii)again. 7.. Mixture distributions We now onsider a useful appliation of the EM-algorithm, to the estimation of parameters in mixture distributions. Let us suppose that {Y i } n are iid random variables with density f(y; ) pf (y; )+( p)f (y; ), where (p,, )areunknownparameters. Forthepurposeofidentifiabilitywewill suppose that 6, p 6 andp 6 0. Thelog-likelihoodof{Y i } is L n (Y ; ) log pf (Y i ; )+( p)f (Y i ; ). (7.6) Now maximising the above an be extremely di ult. As an illustration onsider the example below. 03

12 Example 7.. Let us suppose that f (y; ) and f (y; ) are normal densities, then the log likelihood is L n (Y ; ) log pp exp( We observe this is extremely di (Y i µ ) )+( p) p exp( (Y i µ ) ). ult to maximise. On the other hand if Y i were simply normally distributed then the log-likelihood is extremely simple L n (Y ; ) / log + (Y i µ ) ). (7.7) In other words, the simpliity of maximising the log-likelihood of the exponential family of distributions (see Setion.6) is lost for mixtures of distributions. We use the EM-algorithm as an indiret but simple method of maximising (7.7). In this example, it is not lear what observations are missing. However, let us onsider one possible intepretation of the mixture distribution. Let us define the random variables and Y i,where i {, }, P ( i )p and P ( i )( p) and the density of Y i i isf and the density of Y i i isf.basedonthisdefinition, it is lear from the above that the density of Y i is f(y; ) f(y,)p ( )+f(y,)p ( )pf (y; )+( p)f (y; ). Hene, one interpretation of the mixture model is that there is a hidden unobserved random variable whih determines the state or distribution of Y i. A simple example, is that Y i is the height of an individual and i is the gender. However, i is unobserved and only the height is observed. Often a mixture distribution has a physial interpretation, similar to the height example, but sometimes it an be used to parametrially model a wide lass of densities. Based on the disussion above, U { i } an be treated as the missing observations. The likelihood of (Y i,u i )is i p f (Y i ; ) I( ) p f (Y i ; ) I( i) p i f i (Y i ; i ). 04

13 where we set p p. Thereforetheloglikelihoodof{(Y i, i)} is We now need to evaluate L n (Y,U; ) Q(,)E L n (Y,U; ) Y, (log p i +logf i (Y i ; i )). E log p i Y i, +E log f i (Y i ; i ) Y i,. We see that the above expetation is taken with respet the distribution of i onditioned on Y i and the parameter Thus, in general, E(A(Y, ) Y, ) X j A(Y, j)p ( j Y i, ), whih we apply to Q(,)togive Q(,) X j [log p i j +logf i j(y i ; )] P ( i j Y i, ). Therefore we need to obtain P ( i j Y i, ). By using onditioning arguments it is easy to see that P ( i Y i y, ) P ( i,y i y; ) p f (y,, ) P (Y i y; ) p f (y,, )+( p )f (y,, ) : w (,y) P ( i Y i y, ) p f (y,, ) p f (y,, )+( p )f (y,, ) : w (,y) w (,y). Therefore Q(,) log p +logf (Y i ; ) w (,Y i )+ log( p)+logf (Y i ; ) w (,Y i ). Now maximising the above with respet to p, and in general will be muh easier than maximising L n (Y ; ). For this example the EM algorithm is (i) Define an initial value. Let. To see why note that P ( i and Y i [y h/,y+ h/] )hp f (y) and P (Y i [y h/,y+ h/] )h (p f (y)+( p )f (y)). 05

14 (ii) The expetation step: For a fixed evaluate Q(,) (iii) The maximisation step: log p +logf (Y i ; ) w (,Y i )+ log( p)+logf (Y i ; ) w (,Y i ). Evaluate k+ argmax Q(,)bydi erentiatingq(,)wrtto and equating to zero. Sine the parameters p and, are in separate subfuntions, they an be maximised separately. (iv) If k and k+ are su iently lose to eah other stop the algorithm and set ˆ n k+. Else set k+,gobakandrepeatsteps(ii)and(iii)again. Example 7.. (Normal mixtures and mixtures from the exponential family) We briefly outline the algorithm in the ase of a mixture two normal distributions. In this ase (i) Q(,) X w j (,Y i ) j j (Y i µ j ) +log j + w j (,Y i )(logp +log( p)). By di erentiating the above wrt to µ j, j (for j and ) and p it is straightforward to see that the µ j, j and p whih maximises the above is bµ j P n w j(,y i )Y i P n w j(,y i ) and b j P n w j(,y i )(Y i bµ j ) P n w j(,y i ) and bp P n w (,Y i ). n One these estimators are obtained we let (bµ, bµ, b, b, bp). The quantities w j (,Y i ) are re-evaluated and Q(,) maximised with respet to the new weights. (ii) In general if Y is a mixture from the exponential family with density f(y; ) mx p j exp (y j j 06 apple j ( j )+ j (y))

15 the orresponding Q(,) is where Q(,) mx j w j (,Y i )[Y i j apple j ( j )+ j (Y i )+logp j ], w j (,Y i ) p j exp Y i j apple j ( j )+ j (Y i ) P m k p k exp (Y i k apple k ( k )+ k(y i )) subjet to the onstraint that P m j p j. Thus for apple j apple m, Q(,) is maximised for b j µ j P n w j(,y i )Y P i n w j(,y i ) where µ j apple 0 j (we assume all parameter for eah exponential mixture is open) and P n bp j w j(,y i ). n Thus we set ({ b j, bp j } m j) and re-evaluate the weights. Remark 7.. One the algorithm is terminated, we an alulate the hane that any given observation Y i is in subpopulation j sine bp ( i j Y i ) bp j f j (Y ; b ) P j bp jf j (Y ; b ). This allows us to obtain a lassifier for eah observation Y i. It is straightforward to see that the arguments above an be generalised to the ase that the density of Y i is a mixture of m di erent densities. However, we observe that the seletion of m an be quite adho. There are methods for hoosing m, theseinludethe reversible jump MCMC methods Problems Example 7..3 Question: Suppose that the regressors x t are believed to influene the response variable Y t. The distribution of Y t is P (Y t y) p y t exp( y! where t exp( x 0 t ) and t exp( x 0 t ). 07 ty) +( p) y t exp( y! ty),

16 (i) State minimum onditions on the parameters, for the above model to be identifiable? (ii) Carefully explain (giving details of Q(,) and the EM stages) how the EM-algorithm an be used to obtain estimators of, and p. (iii) Derive the derivative of Q(,), and explain how the derivative may be useful in the maximisation stage of the EM-algorithm. Solution (iv) Given an initial value, will the EM-algorithm always find the maximum of the likelihood? Explain how one an hek whether the parameter whih maximises the EM-algorithm, maximises the likelihood. (i) 0 <p<and 6 (these are minimum assumptions, there ould be more whih is hard to aount for given the regressors x t ). (ii) We first observe that P (Y t y) isamixtureoftwopoissondistributionswhere eah has the anonial link funtion. Define the unobserved variables, {U t },whih are iid and where P (U t )p and P (U t )( p) andp (Y y U i ) y t exp( ty) and P (Y y) U y! i ) y t exp( ty) log f(y t,u t,).therefore,wehave y! 0 Y t ut x t exp( u 0 t x t )+logy t!+logp, where (,,p). Thus, E(log f(y t,u t,) Y t, )is E(log f(y t,u t,) Y t, ) Y t 0 x t exp( x 0 t )+logy t!+logp (,Y t ) + Y t 0 x t exp( x 0 t )+logy t!+logp ( (,Y t )). where P (U i Y t, )isevaluatedas P (U i Y t, ) (,Y t ) pf (Y t, ) pf (Y t, )+( p)f (Y t, ), with f (Y t, ) exp( 0 x t Y t )exp( Y t exp( 0 x t )) Y t! f (Y t, ) exp( x 0 t Y t )exp( Y t exp( x 0 t ). Y t! 08

17 Thus Q(,)is Q(,) TX t 0 Y t x t exp( x 0 t )+logy t!+logp (,Y t ) 0 + Y t x t exp( x 0 t )+logy t!+log( p) ( (,Y t )). Using the above, the EM algorithm is the following: (a) Start with an initial value whih is an estimator of., and p, denote this as (b) For every evaluate Q(,). () Evaluate arg max Q(,). Denote the maximum as and return to step (b). (d) Keep iterating until the maximums are su iently lose. (iii) The derivative TX t TX t TX t Y t exp( x 0 t ) x t (,Y t ) Y t exp( x 0 t ) x t ( (,Y t )) p (,Y t ) p ( (,Y t ). Thus maximisation of Q(,)anbeahievedbysolvingfortheaboveequations using iterative weighted least squares. (iv) Depending on the initial value, the EM-algorithm may only loate a loal maximum. To hek whether we have found the global maximum, we an start the EMalgorithm with several di erent initial values and hek where they onverge. Example 7..4 Question () Let us suppose that F (t) and F (t) are two survival funtions. Let x denote a univariate regressor. (i) Show that F(t; x) pf (t) exp( x) +( p)f (t) exp( x) is a valid survival funtion and obtain the orresponding density funtion. 09

18 (ii) Suppose that T i are survival times and x i is a univariate regressor whih exerts an influene an T i. Let Y i min(t i,), where is a ommon ensoring time. {T i } are independent random variables with survival funtion F(t; x i ) pf (t) exp( x i ) +( p)f (t) exp( x i ), where both F and F are known, but p, and are unknown. State the ensored likelihood and show that the EM-algorithm together with iterative least squares in the maximisation step an be used to maximise this likelihood (su ient details need to be given suh that your algorithm an be easily oded). Solution i) Sine F and F are monotonially dereasing positive funtions where F (0) F (0) and F () F () 0,thenitimmediatelyfollowsthat F(t, x) pf (t) e x +( p)f (t) e x satisfies the same onditions. To obtain the density we di erential wrt x) pe x f (t)f (t) e x ( p)e x f (t)f (t) e ) f(t; x) pe x f (t)f (t) e x +( p)e x f (t)f (t) e x, where we use that df (t) dt f ( t). ii) The ensored log likelihood is L n (,,p) [ i log f(y i ;,,p)+( i)logf(y i ;,,p)]. Clearly, diretly maximizing the above is extremely di alternative method via the EM algorithm. ult. Thus we look for an We first define the indiator variable (whih orresponds to the missing variables) whih denotes the state or ( with P (Ii )p p I i with P (I i )( p) p. 0

19 Then the joint density of (Y i, i,i i )is whih gives the log-density p Ii e I i x i f Ii (t)f Ii (t) e I i x i F Ii (t) e I i x i i i log p Ii + Ii x i +logf Ii (Y i )+(e I i x i ) log F Ii (Y i ) +( i) log p Ii +(e I i x i )logf Ii (Y i ). Thus the omplete log likelihood of (Y i, i,i i )is L n (Y,,I i ;,,p) { i [log p Ii + Ii x i +logf Ii (Y i )+(e I i x i ) log F Ii (Y i ) +( i)[log p Ii +(e I i x i )logf Ii (Y i )])} Next we need to alulate P (I i Y i, i, )andp (I i Y i, i, );! i i () P (I i Y i, i,p,, ) p e x i f (Y i )F (Y i ) e x i p e x i f (Y i )F (Y i ) e x i +( p )e x i f (Y i )F (Y i ) e x i! i 0 i () P (I i Y i, i 0,p, p F (Y i ) e x i p F (Y i ) e x i +(, ) p )F (Y i ) e x i and! i i ()! i i () and! i 0 i ()! i 0 i (). Let p p and p p. Therefore the omplete likelihood onditioned on what we observe is Q(,) X s { i! i i (s)[log p s + x i +logf s (Y i )+(e sx i ) log F s (Y i )] +( i)! i 0 i (s) log p s + e sx i log F s (Y i ) } X { i! i i (s)[ x i +logf s (Y i )+(e sx i ) log F s (Y i )] s +e sx i log F s (Y i ) X + s i (s)logp s +( i)! i 0 i (s)logp s i! i Q(, )+Q(, )+Q(,p,p )

20 The onditional likelihood, above, looks unwieldy. However, the parameter estimators an to be separated. First, di erentiating with respet to @Q(,p, i! i i () p +! i 0 i ()( i) p i () p i! i i ()( i) p.! i 0 Equating the above to zero we have the estimator ˆp a b i! i i () + i! i i () + a,where a+b! i 0 i ()( i)! i 0 i ()( i). Next we onsider the estimates of and at the i th iteration step. Di erentiating Q wrt to and gives for @Q s(, Q s (, Q( 0. i! i i (s) +e sx i log F s (Y i ) +( i)! i 0 i (s)e sx i log F s (Y i ) x s i! i i (s)e sx i log F s (Y i )+( i)! i 0 i (s)e sx i log F s (Y i ) x i Observe that setting the first derivative to zero, we annot obtain an expliit expression for the estimators at eah iteration. Thus we need to use the Newton-Rapshon sheme but in a very simply set-up. To estimate (, ) atthej th iteration we use " (j) (j) # " # #

21 Thus for s, wehave (j) s s @ s. We an rewrite the above Newton Raphson sheme as something that resembles weighted least squares. We reall the weighted least squares estimator are the parameters whih minimise the weighted least squares riterion W ii (Y i x 0 i ). The whih minimises the above is b (X 0 WX) X 0 WY. The Newtons-Raphson sheme an be written as (j) @ s s (X 0 W s X) XS s where W where the elements of the above are X 0 (x,x,...,x n ), s diag[! (s),...,! 3 S S s 6 4 s. 7 5, S sn n (s)],! si i! i i e s log F s (Y i )+( i)! i 0 i e s x i log F s (Y i ) S si i! i i [ + e s x i log F s (Y i )] + ( i)! i 0 i e s x i log F s (Y i )]. By using algebrai manipulations we an rewrite the iteration as an iterated weighted least squared (j) s s s s (X 0! s X) X 0 S (X 0 W s X) (X 0 W s X) (X 0 W s X) X 0 W s X s s (X 0 W s X) XS s s (X 0 W s X) XW s [W s ] S s 3

22 Now we rewrite the above in weighted least squares form. Define Z s X s [W s ] S s this ats as our pseudo y-variable. Using this notation we have (j) s (X 0 W s X) X 0 W s Z s. Thus at eah step of the Newton-Raphson iteration we minimise the weighted least equation! si Z s x i for s,. Thus altogether in the EM-algorithm we have: Start with initial value 0, 0,p 0 Step Set (,r,,r,p r )(,,p ). Evaluate! i i and! i i (these probabilies/weights stay the same throughout the iterative least squares). Step Maximize Q(,)byusingthealgorithmp r previously. Now evaluate for s, ar a r+b r where a r,b r are defined (j) s (X 0 W s X) X 0 W s Z s. Iterate until onvergene of the parameters. Step 3 Go bak to step until onvergene of the EM algorithm Exerises Exerise 7. Consider the linear regression model Y i 0 x i + i " i where " i follows a standard normal distribution (mean zero and variane ) and a Gamma distribution i follows f( ; ) (apple ) apple exp( ), (apple) 0, 4

23 with apple>0. Let us suppose that and are unknown parameters but apple is a known parameter. We showed in Exerise. that diretly maximising the log-likelihood was extremely di ult. Derive the EM-algorithm for model. In your derivation explain what quantities will have to be evaluated numerially. Exerise 7. Consider the following shifted exponential mixture distribution f(x;,,p,a)p exp( x/ )I(x 0) + ( p) exp( (x a)/ )I(x a), where p,, and a are unknown. (i) Make a plot of the above mixture density. Considering the ases x a and x < a separately, alulate the probability of belonging to eah of the mixtures, given the observation X i (i.e. Define the variable i, where P ( i )p and f(x i ) exp( x/ ) and alulate P ( i X i )). (ii) Show how the EM-algorithm an be used to estimate a, p,,. At eah iteration you should be able to obtain expliit solutions for most of the parameters, give as many details as you an. Hint: It may be benefiial for you to use profiling too. (iii) From your knowledge of estimation of these parameters, what do you onjeture the rates of onvergene to be? Will they all be the same, or possibly di erent? Exerise 7.3 Suppose {Z i } n are independent random variables, where Z i has the density f Z (z; 0,,µ,,u i )ph(z; 0,,u i )+( p)g(z;, µ), g(x;, µ) ( µ )( x µ ) exp( (x/µ) )I (0,) (x) (the Weibull distribution) and h(x; 0,,u i ) i exp( x/ i)i (0,) (x) (the exponential distribution), with i 0 exp( u i ) and {u i } n are observed regressors. The parameters p, 0,,µ and are unknown and our objetive in this question is to estimate them. (a) What is the log-likelihood of {Z i }? (Assume we also observe the deterministi regressors {u i }.) 5

24 (b) By defining the orret dummy variable i derive the steps of the EM-algorithm to estimate the parameters p, 0,,µ, (using the method of profiling if neessary). 7.3 Hidden Markov Models Finally, we onsider appliations of the EM-algorithm to parameter estimation in Hidden Markov Models (HMM). This is a model where the EM-algorithm pretty muh surpasses any other likelihood maximisation methodology. It is worth mentioning that the EMalgorithm in this setting is often alled the Baum-Welh algorithm. Hidden Markov models are a generalisation of mixture distributions, however unlike mixture distibutions it is di ult to derive an expliit expression for the likelihood of a Hidden Markov Models. HMM are a general lass of models whih are widely used in several appliations (inluding speeh reongition), and an easily be generalised to the Bayesian set-up. A nie desription of them an be found on Wikipedia. In this setion we will only briefly over how the EM-algorithm an be used for HMM. We do not attempt to address any of the issues surrounding how the maximisation is done, interested readers should refer to the extensive literature on the subjet. The general HMM is desribed as follows. Let us suppose that we observe {Y t },where the rvs Y t satisfy the Markov property P (Y t Y t,y t,...)p(y t Y t ). In addition to {Y t } there exists a hidden unobserved disrete random variables {U t },where{u t } satisfies the Markov property P (U t U t,u t,...)p(u t U t )and drives thedependene in {Y t }. In other words P (Y t U t,y t,u t,...)p(y t U t ). To summarise, the HMM is desribed by the following properties: (i) We observe {Y t } (whih an be either ontinuous or disrete random variables) but do not observe the hidden disrete random variables {U t }. (ii) Both {Y t } and {U t } are time-homogenuous Markov random variables that is P (Y t Y t,y t,...) P (Y t Y t ) and P (U t U t,u t,...) P (U t U t ). The distributions of P (Y t ), P (Y t Y t ), P (U t )andp (U t U t )donotdependont. (iii) The dependene between {Y t } is driven by {U t },thatisp (Y t U t,y t,u t,...) P (Y t U t ). 6

25 There are several examples of HMM, but to have a lear intepretation of them, in this setion we shall only onsider one lassial example of a HMM. Let us suppose that the hidden random variable U t an take N possible values {,...,N} and let p i P (U t i) and p ij P (U t i U t j). Moreover, let us suppose that Y t are ontinuous random variables where (Y t U t i) N(µ i, i )andtheonditionalrandomvariablesy t U t and Y U are independent of eah other. Our objetive is to estimate the parameters {p i,p ij,µ i, i } given {Y i }.Letf i ( ; ) denotethenormaldistributionn (µ i, i ). Remark 7.3. (HMM and mixture models) Mixture models (desribed in the above setion) are a partiular example of HMM. In this ase the unobserved variables {U t } are iid, where p i P (U t i U t j) P (U t i) for all i and j. Let us denote the log-likelihood of {Y t } as L T (Y ; ) (thisistheobservedlikelihood). It is lear that onstruting an expliit expression for L T is di ult, thus maximising the likelihood is near impossible. In the remark below we derive the observed likelihood. Remark 7.3. The likelihood of Y (Y,...,Y T ) is L T (Y ; ) f(y T Y T,Y T,...; )...f(y Y ; )P (Y ; ) f(y T Y T ; )...f(y Y ; )f(y ; ). Thus the log-likelihood is L T (Y ; ) TX log f(y t Y t t ; )+f(y ; ). The distribution of f(y ; ) is simply the mixture distribution f(y ; ) p f(y ; )+...+ p N f(y ; N ), where p i P (U t i). The onditional f(y t Y t ) is more triky. We start with f(y t Y t ; ) f(y t,y t ; ) f(y t ; ). An expression for f(y t ; ) is given above. To evaluate f(y t,y t ; ) we ondition on 7

26 U t,u t to give (using the Markov and onditional independent propery) f(y t,y t ; ) X i,j X i,j X i,j f(y t,y t U t i, U t j)p (U t i, U t j) f(y t U t i)p (Y t U t j)p (U t i U t j)p (U t i) f i (Y t ; i )f j (Y t ; j )p ij p i. Thus we have f(y t Y t ; ) P i,j f i(y t ; i )f j (Y t ; j )p ij p i P i p. if(y t ; i ) We substitute the above into L T (Y ; ) to give the expression L T (Y ; ) TX P log i,j fi(y t ; i )f j (Y t ; j )p ij p i P i p +log if(y t ; i ) t! NX p i f(y ; i ). Clearly, this is extremely di ult to maximise. Instead we seek an indiret method for maximising the likelihood. By using the EM algorithm we an maximise a likelihood whih is a lot easier to evaluate. Let us suppose that we observe {Y t,u t }.SineP(Y U) P (Y T Y T,...,Y,U)P(Y T Y T,...,Y,U)...P(Y U) Q T t P (Y t U t ), and the distribution of Y t U t is N (µ Ut, Ut ), then the omplete likelihood of {Y t,u t } is T Y t f(y t U t ; ) p U T Y t p Ut U t. Thus the log-likelihood of the omplete observations {Y t,u t } is L T (Y,U; ) TX log f(y t U t ; )+ t TX log p Ut Ut +logp U. t Of ourse, we do not observe the omplete likelihood, but the above an be used in order to define the funtion Q(,) whih is maximised in the EM-algorithm. It is worth mentioning that given the transition probabilities of a disrete Markov hain (that is {p i,j } ij )themarginal/stationaryprobabilities{p i } an be obtained by solving P, where P is the transition matrix. Thus it is not neessary to estimate the marginal 8

27 probabilities {p i } (note that the exlusion of {p i } in the log-likelihood, above, gives the onditional omplete log-likelihood). We reall that to maximise the observed likelihood L T (Y ; ) usingtheemalgorithm involves evaluating Q(,), where TX Q(,) E log f(y t U t ; )+ t X U{,...,N} T TX log p Ut Ut +logp U Y, t TX log f(y t U t ; )+ t TX log p Ut Ut +logp U p(u Y, ). Note that eah step in the algorithm the probability p(u Y, )needstobeevaluated. This is done by using onditioning p(u Y, ) p(u Y, ) p(u Y, ) t TY P (U t U t,...,u Y ; ) t TY P (U t U t t,y; )usingthemarkovproperty. Evaluation of the above is not simple (mainly beause one is estimating the probability of being in state U t based on U t and the observation information Y t in the past, present and future). This is usually done using the so alled forward bakward algorithm (and is related to the idea of Kalman filtering). For this example the EM algorithm is (i) Define an initial value. Let. (ii) The expetation step, For a fixed evaluate P (U t,y, ), P (U t U t,y, )andq(,). (iii) The maximisation step Evaluate k+ argmax Q(,)bydi erentiatingq(,)wrtto and equating to zero. (iv) If k and k+ are su iently lose to eah other stop the algorithm and set ˆ n k+. Else set k+,gobakandrepeatsteps(ii)and(iii)again. 9

28 Sine P (U Y, )P(U,Y, )/P (Y, ) and P (U t,u t Y, )P(U t,u t,y, )/P (Y, ); P (Y, )isommontoallu in {,...,N} T and is independent of,thusratherthan maximising Q(,) one an equvalently maximise X TX TX eq(,) log f(y t U t ; )+ log p Ut Ut +logp U p(u,y, ), U{,...,N} T t t noting that Q( e,) / Q(,)andthemaximumof Q(,)withrespetto is the same as the maximum of Q(,). 0

Methods of evaluating tests

Methods of evaluating tests Let X,, 1 Xn be i.i.d. Bernoulli( p ). Then 5 j= 1 j ( 5, ) T = X Binomial p. We test 1 H : p vs. 1 1 H : p>. We saw that a LRT is 1 if t k* φ ( x ) =. otherwise (t is the observed