4. Understand the key properties of maximum likelihood estimation. 6. Understand how to compute the uncertainty in maximum likelihood estimates.

Size: px

Start display at page:

Download "4. Understand the key properties of maximum likelihood estimation. 6. Understand how to compute the uncertainty in maximum likelihood estimates."

Tamsyn Porter
5 years ago
Views:

1 9.07 Itroductio to Statistics for Brai ad Cogitive Scieces Emery N. Brow Lecture 9: Likelihood Methods I. Objectives 1. Review the cocept of the joit distributio of radom variables.. Uderstad the cocept of the likelihood fuctio. 3. Uderstad the cocept of maximum likelihood estimatio. 4. Uderstad the key properties of maximum likelihood estimatio. 5. Uderstad the cocept of Fisher iformatio. 6. Uderstad how to compute the ucertaity i maximum likelihood estimates. II. Joit Probability Desity or Distributio of a Set of Radom Variables We recall that two evets E 1 ad E are idepedet if Ituitively, this statemet says that kowledge about E 1 geeral a set of evets E 1, K, E is idepedet if Pr( E 1 I E ) = Pr( E 1 ) Pr( E ) (9.1) Pr( E 1 I E K I gives o iformatio about E. I E ) = Pr( E i ) (9.) It follows from Lecture 1 ad Lecture 5 that if x= x 1, K, x is a sample of idepedet, idetically distributed observatios from a pmf or a pdf f( x i θ) the the joit probability desity of x= x 1, K,x is f( x θ ) = f( x i θ ). (9.3) Equatio 9.3 follows directly from the defiitio of a pmf for a discrete radom variable. Usig a cdf of a cotiuous radom variable which is differetiable, it is also easy to show. Equatio 9.3 is essetial for our defiitio of the likelihood fuctio of a sample of idepedet observatios. III. The Likelihood Fuctio A. Defiitio Defiitio 9.1 (Likelihood Fuctio). Give x= x1, K,x is a idepedet, idetically distributed sample from a pdf or pmf f( xi θ ). Let f( x θ ) deote the joit pdf or pmf of the sample x = ( x, K, x ) defied i Eq Give X = x is observed, the fuctio of θ defied as 1

2 page : 9.07 Lecture 9: Likelihood Methods L( θ ) = f ( x θ ) = f ( x k θ ) (9.4) k =1 is called the likelihood fuctio for the parameter θ. The likelihood fuctio or likelihood provides a objective meas of assessig the iformatio i a sample of data about the model parameter θ. We view the data as fixed ad ow study L( θ ) = f ( x θ ) as a fuctio ofθ. For a give model it summarizes all the iformatio i the sample about θ. Use of the likelihood will allow us to overcome some of the ambiguities we faced with the method-of-momets. B. Examples of Likelihoods Example.1 (cotiued). Biomial Likelihood Fuctio. I our learig experimet, we observed correct resposes from the aimal i 40 trials. We derived the pmf of the observatios i 40 trials as the biomial distributio f ( p) = p (1 p). (9.5) (Notice that this distributio does ot have exactly the same form as Eq. 9.3, because we allowed for all the possible ways the sample could be realized. It turs out that i our likelihood aalysis, this distictio will ot matter because the biomial coefficiet does ot alter the likelihood). Sice the ukow parameter is p we have the likelihood of p is Lp ( ) = f ( p) = p (1 p). (9.6) I geeral, for the biomial pmf with trials ad probability of a correct respose p, the likelihood x x Lp ( )= f( x p ) = p ( 1 p). (9.7) x

3 page 3: 9.07 Lecture 9: Likelihood Methods Figure 9A. Biomial Likelihood Fuctio L(p) for = 40 ad k = 5, 10,, 30, 35. Plottig these elemetary likelihood fuctios is useful to gai isight ito what type of iformatio they provide. We ca predict the shape of the biomial likelihood fuctio based o discussios of the beta distributio i Lecture 3 (Eq. 3.30, Figure 3J). Example 3. Quatile Release Hypothesis (Poisso Likelihood). We used a Poisso pmf to model the quatile release of acetylcholie at the syapse. If we record the umber of quata released i o-overlappig time itervals of a specified legth, the the likelihood fuctio for the Poisso parameter λ, which is the expected umber of quata released, is x λ i λ e L ( λ )= f ( x λ ) = x i! λ = λ x i (9.8) e x i! We see i this example, a very importat feature of the likelihood. Namely, that it has reduced or summarized the data from all observatios to simply S = x i, the sum of the observatios. For a large class of probability models, this is a typical feature.

4 page 4: 9.07 Lecture 9: Likelihood Methods Figure 9B Poisso Likelihood Fuctios L( λ ). The shape of the Poisso likelihood could be predicted based o results i Lecture 3. Note that the essetial features of this likelihood are described by S λ L( λ ) λ e (9.9) which is proportioal to a gamma probability desity (Eq. 3.6) with α = (S 1) ad β = (Figure 3I). Example 3. (cotiued) Magetoecephalogram Noise Data (Gaussia Likelihood Fuctio). If we view the MEG data as a radom sample from a Gaussia probability model the i this problem there are two ukow parameters: µ ad σ.

5 page 5: 9.07 Lecture 9: Likelihood Methods L ( µ, σ ) = f( x µ,σ ) = f( x µ,σ ) i i 1 1 (x µ) = exp πσ σ (9.10) 1 (x i µ) = exp πσ σ If we expad the sum i the expoet, we see that the data have bee compressed ito two distict terms: x i ad x i. That is, we ca rewrite Eq as x i + x i log(πσ ) + 1 µ µ L( µ, σ ) = exp σ σ σ A crucial feature is that the coefficiets o these two statistics ivolve the ukow parameters. This data reductio is a importat cosequece of the likelihood aalysis. These statistics are called sufficiet statistics i that it is possible to show that they cotai all the iformatio i the data about the parameters µ ad σ. We state this cocept formally. Defiitio 9. A statistic T( X ) is sufficiet for a parameter θ if the probability desity of the data coditioal o the statistic is idepedet of θ. That is, if f( X1,..,. X T( X ),θ) = h( X ), the T( X) is sufficiet. If there are two parameters, the the smallest umber of sufficiet statistics is two. Similarly, i the case of the Poisso likelihood above, the umber of sufficiet statistics was oe cosistet with the fact that there is oly oe parameter to estimate. There are very precise theorems that tell us how to idetify the sufficiet statistics for parameters i a probability model. We will ot discuss them here. Details ca be foud i Pawita (001), DeGroot ad Schervish (00) ad i Rice (007). Likelihood aalyses i geeral are based o the sufficiet statistics. Whe method-ofmomets ad likelihood aalyses differ, it is usually because the sample momets are ot the sufficiet statistics for the parameters to be estimated. We will illustrate this poit below whe we defie maximum likelihood estimatio. C. Maximum Likelihood Estimatio Ideally, we would like to aalyze the likelihood fuctio for every problem. This gets to be especially challegig whe the umber of parameters is large. Therefore, we would like to derive a summary that gives us a sese about the shape of the likelihood. This is what the maximum likelihood estimate ad the Fisher iformatio provide. We describe them i tur.

6 page 6: 9.07 Lecture 9: Likelihood Methods Defiitio 9.3. Maximum Likelihood Estimatio. For each sample poit x = ( x 1,..., x ) let θˆ( x ) be a parameter value of which L( θ ) = L( θ x ) attais a maximum as a fuctio of θ for fixed x. ˆ θ ( x ) is a maximum likelihood (ML) estimator (MLE) of the parameter θ. I particular, if L( θ ) is differetiable, we ca cosider L ( θ) θ = 0 ad check the coditios L( θ ) o to be sure that the estimate defies a maximum. Remember that this would mea θ that for a oe-dimesioal problem, verifyig that the secod derivative was egative ad for the d dimesioal problem verifyig the coditio that the determiat of the Hessia is egative defiite. I likelihood aalyses, it is usually easier to work with log L( θ ) istead of L ( θ ). log L( θ) The fuctio is called the score fuctio. Hece, computig the MLE ca ofte be θ formulated as fidig θˆ which solves the score equatio log L ( θ ) θ = 0. (9.11) Two challegig problems for fidig maximum likelihood estimates of parameters i complex high dimesioal models is fidig the solutio to Eq umerically ad establishig that this solutio achieves a global maximum. Example 9.1. Suppose that x 1, K, x are idepedet, idetically distributed observatios from a gamma distributio with parameters α ad β. If α is kow the we have the likelihood fuctio is β α α 1 β x L ( α, β ) = k x k e Γ( α ) k =1 α = β exp{( α 1) log x k β x k } Γ( ) α k =1 k =1 (9.1) ad the log likelihood fuctio is log L( αβ, ) = log Γ( ) α α x (9.13) α + log β + ( 1) log( x k ) β k k=1 k=1 Differetiatig with respect to β log L(α, β) α = x k (9.14) β β k =1 ad solvig yields

7 page 7: 9.07 Lecture 9: Likelihood Methods α 0 = x (9.15) β βˆ = α. (9.16) x Notice also that if α is kow, the the secod derivative of the log likelihood evaluated at βˆ is log L( αβ), α x = = ˆ β βˆ β α This is clearly egative siceα is positive ad x is positive. Hece, the log likelihood (ad the likelihood) is maximized at βˆ. If α = 1 we have βˆ = x 1 is the maximum likelihood estimate for a expoetial model. If α is ukow, the there is o closed form solutio for either α or β. The estimates must the be foud umerically. To see this we differetiate Eq with respect to α to obtai the log likelihood for α Substitutig for β from Eq we obtai log L( αβ, ) Γ ( α ) = + log β + log( x ) (9.17) α Γ( α) k k=1 log L( α ) Γ ( α ) α = + log + log( x k ) (9.18) α Γ( α ) x k =1 Settig the left had side of Eq equal to zero ad solvig for the MLE of α requires a umerical procedure such as Newto s method. Plausible startig values for computig the ML estimates ca be obtaied from the method-of-momets estimates. We ca see i this case that the method-of-momets estimates ad the maximum likelihood estimates are differet. The distictio betwee the maximum likelihood ad the method-of-momets estimates comes about because for the gamma distributio, the sufficiet statistics for α ad β are log( x k ) ad x k. This is suggested by how the likelihood summarizes the data sample as idicated by k =1 the terms i the log likelihood fuctio i Eq Give the defiitio of the sufficiet statistics, this shows that the simple method-of-momets estimates are ot efficiet (i.e., they must have some loss of iformatio). We will make this cocept more precise below. Example 3. (cotiued) Let,, x be a radom sample of MEG backgroud oise believed x 1 K to obey a Gaussia probability desity with ukow µ ad σ. It is easy to show that 1 1 = x k is the maximum likelihood (ML) estimate of µ ad ˆ σ = (x k x) is the x k=1 k =1 maximum likelihood estimate of σ. This is straightforward to show by differetiatig the k =1

8 page 8: 9.07 Lecture 9: Likelihood Methods Gaussia log likelihood, equatig the gradiet to zero ad solvig for µ ad σ. Here the method-of-momets ad the maximum likelihood estimates are the same. I this case, the sufficiet statistics are the first two sample momets. Example 9.. Iverse Gaussia Distributio. If x 1, K, x is a radom sample from a iverse Gaussia distributio with parameters µ ad λ, the the likelihood is 1 λ λ(x k µ) 3 exp L(µ,λ) = k =1 π x k x k µ k λ 1 λ(x µ) = exp (9.19) 3 k =1 π x k k =1 x k µ 1 λ 1 λ λ 3 µ π µ k =1 k =1 k =1 = exp x k + λ x k log + + log x k. Recall that for this probability model the mea is µ ad the variace is µ 3 / λ. What are the maximum likelihood estimates of µ ad λ? Are they the same as the method-of-momets estimate? The maximum likelihood estimates are µˆml = 1 x i (9.0) whereas the method-of-momets estimates are λˆml = ( x µˆ ) (9.1) i ML µˆmm = 1 x i (9.) ˆ 3 λ = x /σˆ. (9.3) MM Ca you veture a guess as to what are the sufficiet statistics for µ ad λ? (Remember the maximum likelihood estimates are always fuctios of the sufficiet statistics. If you ca derive the last lie i Eq. 9.19, the the extra credit problem o Homework Assigmet 7 is straightforward). Ivariace Property of the Maximum Likelihood Estimator. Suppose that a distributio is idexed by a parameter θ. Let T ( θ ) be a fuctio of θ, the if ˆ is the maximum likelihood estimator of θ the T(θˆML ) is the maximum likelihood estimate of T ( θ ). For example, if θ = λ, the scale parameter of the iverse Gaussia distributio the MLE ˆ λ ML is give i Eq If we are iterested i T( λ ) = co s( λ ) the the MLE of T ( λ ) is T ( ˆ λ ML ). This is oe of the most importat properties of maximum likelihood estimates. θ ML

9 page 9: 9.07 Lecture 9: Likelihood Methods D. Fisher Iformatio (Pawita, 001). The MLE provides a poit (or sigle umber summary) estimate. I geeral a sigle umber is ot sufficiet to represet a likelihood fuctio. If the log likelihood is well approximated by a quadratic fuctio, the we eed at least two quatities to represet it: the locatio of its maximum the (MLE) ad the curvature at the maximum. The curvature will be the Fisher iformatio. Whe this quadratic approximatio is valid we call the likelihood fuctio regular. Fortuately, whe sample sizes are sufficietly large, the likelihood fuctio geerally does become regular. To restate this critical requiremet, regular problems are those i which we ca approximate the log likelihood fuctio aroud the MLE by a quadratic fuctio. To defie the Fisher iformatio we first recall how we compute the Taylor series expasio of a fuctio. The Taylor series expasio of a fuctio f( x) about a poit x 0 is defied as f f ''( x 0 )( x x 0 ) ( x ) = f ( x 0 ) + f ' ( x 0 )( x x 0 ) (9.4)! Let us assume that our log likelihood fuctio is differetiable ad let us expad the log likelihood fuctio i a Taylor series about the ML estimate ˆ θ ML. This gives ˆ ˆ log L(θ ML )''(θ θ ML ) log L( θ ) = log L(θˆ ˆ ˆ ML ) + log L(θ ML )'(θ θ ML ) (9.5)! Because log L(θˆML )' = 0 sice θˆml is the ML estimate of θ, we obtai ˆ ˆ ˆ log L(θ ML )''(θ θ ML ) log L( θ ) log L( θ ML ) +. (9.6)! Therefore, we ca approximate L( θ ) for θ close to θˆ as L( θ ) L ( ˆML θ )e x p { 1 I(θˆML ) ( θ θˆml ) } (9.7) where I (θˆ ) = log L(θˆ )". (9.8) ML is the observed Fisher iformatio. It is a umber because it is evaluated at the MLE. I(θˆML ) measures the curvature of the likelihood aroud the MLE. Eq. 9.7 implies that ear the MLE the likelihood ca be approximated as a Gaussia distributio. We ca gai ituitio about the meaig of the Fisher iformatio by cosiderig the followig Gaussia example. Example 3. (cotiued). If we assume σ is kow, the igorig irrelevat costats we get ML ad 1 log L( µ ) = (x i µ) (9.9) σ

10 page 10: 9.07 Lecture 9: Likelihood Methods log L( µ ) 1 = (x i µ) (9.30) µ σ log L( µ ) = µ σ (9.31) 1 1 Thus I ( ˆ µ ) = Hece, µ ( ˆ) Thus a key poit is: A large amout. V ar( ˆ ) = σ = I µ. of σ iformatio implies smaller variace. Example.1 (cotiued). For the biomial model of the learig experimet, the log likelihood ad score fuctios (igorig irrelevat costats) are log Lp ( ) = k log p + ( k )log(1 p) (9.3) The MLE is log L( p) k k = (9.33) p p 1 p k pˆ = (9.34) ad so that at the MLE the observed Fisher iformatio is log L( p) k k I( p) = = + (9.35) p p (1 p) I( p ˆ ) =. (9.36) pˆ(1 pˆ) See the Addedum 1 to Lecture 9 to uderstad the details of the last step. Remark 9.1. I priciple, we ca judge the quadratic approximatio to the log likelihood by plottig the true log likelihood with the quadratic approximatio superimposed. Remark 9.. For the Gaussia distributio from Example 3., the quadratic approximatio is exact. A practical rule for early all likelihood applicatios: a reasoably regular likelihood meas ˆ is approximately Gaussia. θ ML Remark 9.3. The result i Eq. 9.7 suggests that a approximate 95% cofidece iterval for θ may be costructed as ˆ θ ± se(θˆ ) (9.37) ML ML

11 page 11: 9.07 Lecture 9: Likelihood Methods 1 where se( ˆML θ ) = [ I ( θˆml )]. Remark 9.4. Reportig θˆml ad I(θˆML ) is a practical alterative to plottig the likelihood fuctio i every problem. The quatity we derived above was the observed Fisher iformatio. It is a estimate of the Fisher iformatio which we defie below. Defiitio 9.4. If x= x 1,..., x is a sample with joit probability desity f( x θ), the the Fisher iformatio i x for the parameter θ is log f( x θ ) I( θ ) = E[ ] θ log f( x θ ) = [ ] f ( x θ )dx. θ (9.38) Equivaletly, log f( x θ ) I( θ ) = E[ ] θ log f( x θ ) = θ f( x θ ) d x. (9.39) Now that we have defied the Fisher iformatio we ca use it to make a importat geeral statemet about the performace of a estimator. Theorem 9.1 (Cramer-Rao Lower Boud). Suppose x= x 1, K, x is a sample from f( x θ ), ad T( x) is a estimator of θ ad E [( T( x))] is a differetiable fuctio of θ. Suppose also that θ d d f( x θ) hx ( ) f ( x θ ) d x = hx ( ) d x, (9.40) dθ dθ for all hx ( ) with E h()< x. The θ V ar( T( x )) de θ ( T ( x )) ( ) dθ. (9.41) log f( x θ ) E θ ( ) θ Equatio 9.41 is the Cramer-Rao Lower Boud (CRLB) for the estimator Tx. () For a proof see Casella ad Berger (1990). CRLB gives the lowest boud o the variace of a estimator. If the estimate is ubiased, the the umerator is 1 ad the deomiator is the Fisher iformatio. If θ is a d 1 vector the the Fisher iformatio is a d d matrix give by

12 page 1: 9.07 Lecture 9: Likelihood Methods log f( x θ) T log f( x θ) log f( x θ) I( θ ) = E θ [( ] = E θ [ ]. (9.4) θ θ θθ T We will make extesive use of the Fisher iformatio to derive cofidece itervals for our estimates ad to make ifereces i our problems. E. Criteria for Evaluatig a Estimator If T( x) is a estimator of θ, the there are several criteria that ca be used to evaluate its performace. Four commoly used criteria to evaluate the performace of estimators are: 1. Mea-Squared Error (MSE) E T x θ θ [ ( ) ] The smaller the MSE the better the estimator.. Ubiasedess E θ ( T( x))=θ The bias of a estimator is b T ( θ ) = E( T ( x )) θ. A estimator is ubiased if its expected value is the parameter it estimates. 3. Cosistecy p Tx ( ) θ as Cosistecy meas that as the sample size icreases the estimator coverges i mea-squared error, probability or almost surely to the true value of the parameter. 4. Efficiecy Achieves a miimum variace (Cramer-Rao Lower Boud). The efficiecy of a estimator is usually characterized i terms of its variace relative to aother estimator. The Cramer-Rao Lower boud gives a characterizatio of how small the variace of a estimator ca be. F. Summary of Key Properties of Maximum Likelihood Estimatio 1. Maximum likelihood estimates are geerally biased.. Maximum likelihood estimates are cosistet, hece, they are asymptotically ubiased. 3. Maximum likelihood estimates are asymptotically efficiet. 4. The variace of the maximum likelihood estimate may be approximated by the reciprocal (iverse) of the Fisher iformatio. 5. The maximum likelihood estimate θˆml is asymptotically Gaussia with mea θ ad variace I( θ) 1.

13 page 13: 9.07 Lecture 9: Likelihood Methods 6. If θˆml is the maximum likelihood estimate of θ the h(θˆml ) is the maximum likelihood estimate of h( θ ). 7. Give that θˆ ML : N( θ, I 1 ( ) ), expasio we have θ by the Ivariace Priciple for the MLE ad a Taylor Series ad a approximate 95% cofidece iterval for h( θ ) is 1 h(θˆ ML ) N( h ( θ ), h' ( θ ) [ I ( θ )] ) (9.43) h( ˆ θ ) ±1.96h ' θˆml )[ ( ˆML )] 1 ML ( I θ. (9.44) See Addedum to Lecture 9 for a explaatio ad illustratio of this result. III. Summary The likelihood fuctio ad maximum likelihood estimatio are two of the most importat theoretical ad practical cocepts i statistics. Ackowledgmets I am grateful to Uri Ede for makig the figures, to Jim Mutch for careful proofreadig ad commets ad to Julie Scott for techical assistace. Textbook Refereces Casella, G, Berger RL. Statistical Iferece. Belmot, CA: Duxbury, DeGroot MH, Schervish MJ. Probability ad Statistics, 3rd editio. Bosto, MA: Addiso Wesley, 00. Pawita Y. I All Likelihood: Statistical Modelig ad Iferece Usig Likelihood. Lodo: Oxford, 001. Rice JA. Mathematical Statistics ad Data Aalysis, 3 rd editio. Bosto, MA, 007. Literature Referece Kass RE, Vetura V, Brow EN. Statistical issues i the aalysis of euroal data. Joural of Neurophysiology 005, 94: 8-5.

14 MIT OpeCourseWare Statistics for Brai ad Cogitive Sciece Fall 016 For iformatio about citig these materials or our Terms of Use, visit:

Lecture Note 8 Point Estimators and Point Estimation Methods. MIT Spring 2006 Herman Bennett

Lecture Note 8 Point Estimators and Point Estimation Methods. MIT Spring 2006 Herman Bennett Lecture Note 8 Poit Estimators ad Poit Estimatio Methods MIT 14.30 Sprig 2006 Herma Beett Give a parameter with ukow value, the goal of poit estimatio is to use a sample to compute a umber that represets