Topic 15: Maximum Likelihood Estimation

Topic 5: Maximum Likelihood Estimatio November ad 3, 20 Itroductio The priciple of maximum likelihood is relatively straightforward. As before, we begi with a sample X (X,..., X of radom variables chose accordig to oe of a family of probabilities P θ. I additio, f(x θ, x (x,..., x will be used to deote the desity fuctio for the data whe θ is the true state of ature. The, the priciple of maximum likelihood yields a choice of the estimator ˆθ as the value for the parameter that makes the observed data most probable. Defiitio. The likelihood fuctio is the desity fuctio regarded as a fuctio of θ. The maximum likelihood estimator (MLE, L(θ x f(x θ, θ Θ. ( ˆθ(x arg max L(θ x. (2 θ We will lear that especially for large samples, the maximum likelihood estimators have may desirable properties. However, especially for high dimesioal data, the likelihood ca have may local maxima. Thus, fidig the global maximum ca be a major computatioal challege. This class of estimators has a importat property. If ˆθ(x is a maximum likelihood estimate for θ, the g(ˆθ(x is a maximum likelihood estimate for g(θ. For example, if θ is a parameter for the variace ad ˆθ is the maximum likelihood estimator, the ˆθ is the maximum likelihood estimator for the stadard deviatio. This flexibility i estimatio criterio see here is ot available i the case of ubiased estimators. Typically, maximizig the score fuctio, l L(θ x, the logarithm of the likelihood, will be easier. Havig the parameter values be the variable of iterest is somewhat uusual, so we will ext look at several examples of the likelihood fuctio. 2 Examples Example 2 (Beroulli trials. If the experimet cosists of Beroulli trial with success probability p, the L(p x p x ( p ( x p x ( p ( x p (x+ +x ( p (x+ +x. l L(p x l p( x i + l( p( x i ( x l p + ( x l( p. This equals zero whe p x. c 20 Joseph C. Watkis ( x l L(p x p p x p x p p( p 82

Itroductio to Statistical Methodology Maximum Likelihood Estimatio Exercise 3. Check that this is a maximum. Thus, ˆp(x x. I this case the maximum likelihood estimator is also ubiased. Example 4 (Normal data. Maximum likelihood estimatio ca be applied to a vector valued parameter. For a simple radom sample of ormal radom variables, we ca use the properties of the expoetial fuctio to simplify the likelihood fuctio. ( L(µ, σ 2 x The score fuctio 2πσ 2 exp (x µ 2 2σ 2 ( 2πσ 2 exp (x µ 2 2σ 2 l L(µ, σ 2 x 2 (l 2π + l σ2 2σ 2 µ l L(µ, σ2 x σ 2 Because the secod partial derivative with respect to µ is egative, (2πσ2 exp 2σ 2 (x i µ 2. (x i µ. ( x µ σ2 ˆµ(x x is the maximum likelihood estimator. For the derivative of the score fuctio with respect to the parameter σ 2, ( σ 2 l L(µ, σ2 x 2σ 2 + 2(σ 2 2 (x i µ 2 2(σ 2 2 σ 2 (x i µ 2. Recallig that ˆµ(x x, we obtai ˆσ 2 (x (x i x 2. Note that the maximum likelihood estimator is a biased estimator. (x i µ 2. Example 5 (Licol-Peterso method of mark ad recapture. Let s recall the variables i mark ad recapture: t be the umber captured ad tagged, k be the umber i the secod capture, r the the umber i the secod capture that are tagged, ad let N be the total populatio. Here t ad k is set by the experimetal desig; r is a observatio that may vary. The total populatio N is ukow. The likelihood fuctio for N is the hypergeometric distributio. ( t N t r( L(N r We would like to maximize the likelihood give the umber of recaptured idividuals r. Because the domai for N is the oegative itegers, we caot use calculus. However, we ca look at the ratio of the likelihood values for successive value of the total populatio. L(N r L(N r k r ( N k 83

l l Itroductio to Statistical Methodology Maximum Likelihood Estimatio 0.0e+00 5.0e-07.0e-06.5e-06 0.2 0.3 0.4 0.5 0.6 0.7 0.0e+00.0e-2 2.0e-2 0.2 0.3 0.4 0.5 0.6 0.7 p p log(l -20-8 -6-4 log(l -33-3 -29-27 0.2 0.3 0.4 0.5 0.6 0.7 0.2 0.3 0.4 0.5 0.6 0.7 p p Figure : Likelihood fuctio (top row ad its logarithm, the score fuctio, (bottom row for Berouli trials. The left colum is based o 20 trials havig 8 ad successes. The right colum is based o 40 trials havig 6 ad 22 successes. Notice that the maximum likelihood is approximately 0 6 for 20 trials ad 0 2 for 40. I additio, ote that the peaks are more arrow for 40 trials rather tha 20. We shall later be able to associate this property to the variace of the maximum likelihood estimator. 84

Itroductio to Statistical Methodology Maximum Likelihood Estimatio N is more likely that N precisely whe this ratio is larger tha oe. The computatio below will show that this ratio is greater tha for small values of N ad less tha oe for large values. Thus, there is a place i the middle which has the maximum. We expad the biomial coefficiets i the expressio for L(N r ad simplify. L(N r L(N r ( t N t ( r( k r / N k ( t N t ( r( k r / N k ( N t ( N k r k ( N t k r ( N k (N t!(n!(n t k + r!(n k! (N t!n!(n t k + r!(n k! (N t! (N! (k r!(n t k+r! k!(n k! (N t! N! (k r!(n t k+r! k!(n k! (N t(n k N(N t k + r. Thus, the ratio exceeds if ad oly if L(N r (N t(n k L(N r N(N t k + r (N t(n k > N(N t k + r N 2 tn kn + tk > N 2 tn kn + rn tk > rn tk r > N Writig [x] for the iteger part of x, we see that L(N r > L(N r for N < [tk/r] ad L(N r L(N r for N [tk/r]. This give the maximum likelihood estimator [ ] tk ˆN. r Thus, the maximum likelihood estimator is, i this case, obtaied from the method of momets estimator by roudig dow to the ext iteger. Let look at the example of mark ad capture from the previous topic. There N 2000, the umber of fish i the populatio, is ukow to us. We tag t 200 fish i the first capture evet, ad obtai k 400 fish i the secod capture. > N<-2000 > t<-200 > fish<-c(rep(,t,rep(0,n-t > k<-400 > r<-sum(sample(fish,k > r [] 42 I this simulated example, we fid r 42 recaptured fish. For the likelihood fuctio, we look at a rage of values for N that is symmetric about 2000. Here, ˆN [200 400/42] 904. > N<-c(800:2200 > L<-dhyper(r,t,N-t,k > plot(n,l,type"l",ylab"l(n 42" Example 6 (Liear regressio. Our data are observatios with oe explaatory variable ad oe respose variable. The model is that y i α + βx i + ɛ i 85

Itroductio to Statistical Methodology Maximum Likelihood Estimatio Likelihood Fuctio for Mark ad Recapture L(N 42 0.045 0.050 0.055 0.060 0.065 0.070 800 900 2000 200 2200 Figure 2: Likelihood fuctio L(N 42 for mark ad recapture with t 200 tagged fish, k 400 i the secod capture with r 42 havig tags ad thus recapture. Note that the maximum likelihood estimator for the total fish populatio is ˆN 904. N where the ɛ i are idepedet mea 0 ormal radom variables. The (ukow variace is σ 2. Thus, the joit desity for the ɛ i is exp ɛ2 2πσ 2 2σ 2 exp ɛ2 2 2πσ 2 2σ 2 exp ɛ2 2πσ 2 2σ 2 (2πσ2 exp 2σ 2 Sice ɛ i y i (α + βx i, the likelihood fuctio The score fuctio L(α, β, σ 2 y, x (2πσ2 exp 2σ 2 (y i (α + βx i 2. l L(α, β, σ 2 y, x 2 (l 2π + l σ2 2σ 2 (y i (α + βx i 2. Cosequetly, maximizig the likelihood fuctio for the parameters α ad β is equivalet to miimizig SS(α.β (y i (α + βx i 2. Thus, the priciple of maximum likelihood is equivalet to the least squares criterio for ordiary liear regressio. The maximum likelihood estimators α ad β give the regressio lie ŷ i ˆα + ˆβx i. Exercise 7. Show that the maximum likelihood estimator for σ 2 is ˆσ MLE 2 (y i ŷ i 2. k ɛ 2 i 86

Itroductio to Statistical Methodology Maximum Likelihood Estimatio Frequetly, software will report the ubiased estimator. For ordiary least square procedures, this is ˆσ 2 U 2 (y i ŷ i 2. k For the measuremets o the legths i cetimeters of the femur ad humerus for the five specimes of Archeopteryx, we have the followig R output for liear regressio. > femur<-c(38,56,59,64,74 > humerus<-c(4,63,70,72,84 > summary(lm(humerus femur Call: lm(formula humerus femur Residuals: 2 3 4 5-0.8226-0.3668 3.0425-0.9420-0.90 Coefficiets: Estimate Std. Error t value Pr(> t (Itercept -3.65959 4.45896-0.82 0.47944 femur.9690 0.07509 5.94 0.000537 *** --- Sigif. codes: 0 *** 0.00 ** 0.0 * 0.05. 0. Residual stadard error:.982 o 3 degrees of freedom Multiple R-squared: 0.9883,Adjusted R-squared: 0.9844 F-statistic: 254. o ad 3 DF, p-value: 0.0005368 The residual stadard error of.982 cetimeters is obtaied by squarig the 5 residuals, dividig by 3 5 2 ad takig a square root. Example 8 (weighted least squares. If we kow the relative size of the variaces of the ɛ i, the we have the model y i α + βx i + γ(x i ɛ i where the ɛ i are, agai, idepedet mea 0 ormal radom variable with ukow variace σ 2. I this case, ɛ i γ(x i (y i α + βx i are idepedet ormal radom variables, mea 0 ad (ukow variace σ 2. the likelihood fuctio L(α, β, σ 2 y, x (2πσ2 exp 2σ 2 w(x i (y i (α + βx i 2 where w(x /γ(x 2. I other words, the weights are iversely proportioal to the variaces. The log-likelihood is l L(α, β, σ 2 y, x 2 l 2πσ2 2σ 2 w(x i (y i (α + βx i 2. 87

Itroductio to Statistical Methodology Maximum Likelihood Estimatio Exercise 9. Show that the maximum likelihood estimators ˆα w ad β w x i. have formulas ˆβ w cov w(x, y var w (x, ȳ w ˆα w + ˆβ w x w where x w ad ȳ w are the weighted meas x w w(x ix i w(x i, ȳ w The weighted covariace ad variace are, respectively, cov w (x, y The maximum likelihood estimator for σ 2 is w(x iy i w(x i. w(x i(x i x w (y i ȳ w w(x, var w (x i ˆσ 2 MLE k w(x i(y i ŷ i 2 w(x. i I the case of weighted least squares, the predicted value for the respose variable is ŷ i ˆα w + ˆβ w x i. w(x i(x i x w 2 w(x, i Exercise 0. Show that ˆα w ad ˆβ w are ubiased estimators of α ad β. I particular, ordiary (uweighted least square estimators are ubiased. I computig the optimal values usig itroductory differetial calculus, the maximum ca occur at either critical poits or at the edpoits. The ext example show that the maximum value for the likelihood ca occur at the ed poit of a iterval. Example (Uiform radom variables. If our data X (X,..., X are a simple radom sample draw from uiformly distributed radom variable whose maximum value θ is ukow, the each radom variable has desity { /θ if 0 x θ, f(x θ 0 otherwise. Therefore, the joit desity or the likelihood { /θ if 0 x f(x θ L(θ x i θ for all i, 0 otherwise. Coseqeutly, the joit desity is 0 wheever ay of the x i > θ. Restatig this i terms of likelihood, o value of θ is possible that is less tha ay of the x i. Coseuetly, ay value of θ less tha ay of the x i has likelihood 0. Symbolically, { 0 for θ < maxi x L(θ x i x (, /θ for θ max i x i x (. Recall the otatio x ( for the top order statistic based o observatios. The likelihood is 0 o the iterval (0, x ( ad is positive ad decreasig o the iterval [x (,. Thus, to maximize L(θ x, we should take the miimum value of θ o this iterval. I other words, ˆθ(x x (. Because the estimator is always less tha the parameter value it is meat to estimate, the estimator ˆθ(X X ( < θ, 88

Itroductio to Statistical Methodology Maximum Likelihood Estimatio 0.8 L(θ x 0.6 /θ 0.4 0.2 observatios x i i this iterval 0 0 0.5.5 2 2.5 3 θ Figure 3: Likelihood fuctio for uiform radom variables o the iterval [0, θ]. The likelihood is 0 up to max i x i ad /θ afterwards. Thus, we suspect it is biased dowwards, i. e.. E θ X ( < θ. For 0 x θ, the distributio fuctio for X ( max i X i is F X( (x P { max i X i x} P {X x, X 2 x,..., X < x} P {X x}p {X 2 x} P {X < x} each of these radom variables have the same distributio fuctio 0 for x 0, x P {X i x} θ for 0 < x θ, for θ < x. Thus, the distributio fuctio 0 ( for x 0, F X( (x x θ for 0 < x θ, for θ < x. Take the derivative to fid the desity, 0 for x 0, f X( (x x θ for 0 < x θ, 0 for θ < x. The mea θ x x 0 θ θ E θ X ( dx θ x θ dx 0 ( + θ x+ 0 + θ. This cofirms the bias of the estimator X ( ad gives us a strategy to fid a ubiased estimator. I particular, the choice is a ubiased estimator of θ. d(x + X ( 89

Itroductio to Statistical Methodology Maximum Likelihood Estimatio 3 Summary of Estimates Look to the text above for the defiitio of variables. parameter estimate p Beroulli trials ˆp x i x ubiased N mark recapture [ ˆN kt ] r biased upward ormal observatios µ ˆµ x i x ubiased σ 2 ˆσ mle 2 (x i x 2 biased dowward ˆσ u 2 (x i x 2 ubiased σ ˆσ mle β (x i x 2 biased dowward liear regressio ˆβ cov(x,y var(x ubiased α ˆα ȳ ˆβ x ubiased σ 2 ˆσ mle 2 (y i (ˆα + ˆβx 2 biased dowward ˆσ u 2 2 (y i (ˆα + ˆβx 2 ubiased σ ˆσ mle 4 Asymptotic Properties (y i (ˆα + ˆβx 2 biased dowward uiform [0, θ] θ ˆθ maxi x i biased dowward ˆθ + i x i ubiased Much of the attractio of maximum likelihood estimators is based o their properties for large sample sizes. We summarizes some the importat properties below, savig a more techical discussio of these properties for later.. Cosistecy. If θ 0 is the state of ature ad ˆθ (X is the maximum likelihood estimator based o observatios from a simple radom sample, the ˆθ (X θ 0 as. I words, as the umber of observatios icrease, the distributio of the maximum likelihood estimator becomes more ad more cocetrated about the true state of ature. 2. Asymptotic ormality ad efficiecy. Uder some assumptios that allows, amog several aalytical properties, the use of the delta method, a cetral limit theorem holds. Here we have (ˆθ (X θ 0 coverges i distributio as to a ormal radom variable with mea 0 ad variace /I(θ 0, the Fisher iformatio for oe observatio. Thus, Var θ0 (ˆθ (X I(θ 0, 90

Itroductio to Statistical Methodology Maximum Likelihood Estimatio the lowest variace possible uder the Crámer-Rao lower boud. This property is called asymptotic efficiecy. We ca write this i terms of the z-score. Let Z ˆθ(X θ 0 / I(θ 0. The, as with the cetral limit theorem, Z coverges i distributio to a stadard ormal radom variable. 3. Properties of the log likelihood surface. For large sample sizes, the variace of a maximum likelihood estimator of a sigle parameter is approximately the egative of the reciprocal of the the Fisher iformatio [ ] 2 I(θ E l L(θ X. θ2 the egative reciprocal of the secod derivative, also kow as the curvature, of the log-likelihood fuctio. The Fisher iformatio ca be approimated by the observed iformatio based o the data x, J(ˆθ 2 l L(ˆθ(x x, θ2 givig the curvature of the likelihood surface at the maximum likelihood estimate ˆθ(x If the curvature is small ear the maximum likelihood estimator, the the likelihood surface is earty flat ad the variace is large. If the curvature is large ad thus the variace is small, the likelihood is strogly curved at the maximum. We ow look at these properties i some detail by revisitig the example of the distributio of fitess effects. For this example, we have two parameters - α ad β for the gamma distributio ad so, we will wat to exted the properties above to circumstaces i which we are lookig to estimate more tha oe parameter. 5 Multidimesioal Estimatio For a multidimesioal parameter space θ (θ, θ 2,..., θ, the Fisher iformatio I(θ is ow a matrix. As with oe-dimesioal case, the ij-th etry has two alterative expressios, amely, [ I(θ ij E θ l L(θ X ] [ ] 2 l L(θ X E θ l L(θ X. θ i θ j θ i θ j Rather tha takig reciprocals to obtai a estimate of the variace, we fid the matrix iverse I(θ. This iverse will provide estimates of both variaces ad covariaces. To be precise, for observatios, let ˆθ i, (X be the maximum likelihood estimator of the i-th parameter. The Var θ (ˆθ i, (X I(θ ii Cov θ (ˆθ i, (X, ˆθ j, (X I(θ ij. Whe the i-th parameter is θ i, the asymptotic ormality ad efficiecy ca be expressed by otig that the z-score is approximately a stadard ormal. Z i, ˆθ i (X θ i I(θ ii /. Example 2. To obtai the maximum likelihood estimate for the gamma family of radom variables, write the likelihood ( ( ( β α β α β α L(α, β x Γ(α xα e βx Γ(α xα e βx (x x 2 x α e β(x+x2+ +x. Γ(α 9

Itroductio to Statistical Methodology Maximum Likelihood Estimatio diff -.0-0.5 0.0 0.5.0.5 2.0 0.5 0.20 0.25 0.30 0.35 alpha Figure 4: The graph of (l ˆα l x d dα l Γ(ˆα + P l x i crosses the horizotal axis at ˆα 0.2376. The fact that the graph of the derivative is decreasig states that the score fuctio moves from icreasig to decreasig with α ad thus is a maximum. ad the score fuctio l L(α, β x (α l β l Γ(α + (α l x i β To determie the parameters that maximize the likelihood, we solve the equatios ad α l L(ˆα, ˆβ x (l ˆβ d dα l Γ(ˆα + l x i 0 β l L(ˆα, ˆβ x ˆαˆβ x i 0, or x ˆαˆβ. Substitutig ˆβ ˆα/ x ito the first equatio results the followig relatioship for ˆα (l ˆα l x d dα l Γ(ˆα + l x i 0 which ca be solved umerically. The derivative of the logarithm of the gamma fuctio ψ(α d l Γ(α dα x i. is kow as the digamma fuctio ad is called i R with digamma. For the example for the distributio of fitess effects α 0.23 ad β 5.35 with 00, a simulated data set yields ˆα 0.2376 ad ˆβ 5.690 for maximum likelihood estimator. (See Figure 4. To determie the variace of these estimators, we first compute the Fisher iformatio matrix. Takig the appropriate derivatives, we fid that each of the secod order derivatives are costat ad thus the expected values used to determie the etries for Fisher iformatio matrix are the egative of these costats. I(α, β 2 d2 l L(α, β x l Γ(α, α2 dα2 I(α, β 22 2 β 2 l L(α, β x α β 2, 92

Itroductio to Statistical Methodology Maximum Likelihood Estimatio This give a Fisher iformatio matrix 2 I(α, β 2 αβ l L(α, β x β. I(α, β ( d 2 dα 2 l Γ(α β β The secod derivative of the logarithm of the gamma fuctio ψ (α d2 l Γ(α dα2 is kow as the trigamma fuctio ad is called i R with trigamma. The iverse ( I(α, β α β α( d2 dα l Γ(α β 2 α β 2. β 2 d2 dα 2 l Γ(α For the example for the distributio of fitess effects α 0.23 ad β 5.35 ad 00, ad ( I(0.23, 5.35 0.23 5.35 00(0.23(9.2804 5.35 5.35 2 (20.2804 Var (0.23,5.35 (ˆα 0.000202, Var (0.23,5.35 ( ˆβ.3095. σ (0.23,5.35 (ˆα 0.00, σ (0.23,5.35 ( ˆβ.443.. ( 0.000202 0.026 0.026.3095 Compare this to the empirical values of 0.0662 ad 2.046 for the method of momets. This gives the followig table of stadard deviatios for 00 observatio method ˆα ˆβ maximum likelihood 0.00.443 method of momets 0.0662 2.046 ratio 0.66 0.559 Thus, the stadard deviatio for the maximum likelihood estimator is respectively 7% ad 56% that of method of momets estimator. We will look at the impact as we move o to our ext topic - iterval estimatio ad the cofidece itervals. Exercise 3. If the data are a simple radom sample of 00 observatios of a Γ(0.23, 5.35 radom variable. Use the approximate ormality of maximum likelihood estimators to estimate P {ˆα 0.2376} P { ˆβ 5.690}.. 6 Choice of Estimators With all of the desirable properties of the maximum likelihood estimator, the questio arises as to why would oe choose a method of momets estimator? Oe aswer is that the use maximum likelihood techiques relies o kowig the desity fuctio explicitly i order to be able to perform the ecessary aalysis to maximize the score fuctio ad fid the Fisher iformatio. However, much less about the experimet is eed i order to compute momets. Thus far, we have computed momets usig the desity E θ X m 93 x m f X (x θ dx.

Itroductio to Statistical Methodology Maximum Likelihood Estimatio We could determie, for example, the (radom umber of a give protei i the cells i a tissue by givig the distributio of the umber of cells ad the the distributio of the umber of the give protei per cell. This ca be used to calculate the mea ad variace for the umber of cells with some ease. However, givig a explicit expressio for the desity ad hece the likelihood fuctio is more difficult to obtai ad ca lead to quite itricate computatios to carry out the desired aalysis of the likelihood fuctio. 7 Techical Aspects We ca use cocepts previously itroduced to obtai the properties for the maximum likelihood estimator. For example, θ 0 is more likely that a aother parameter value θ L(θ 0 X > L(θ X if ad oly if By the strog law of large umbers, this sum coverges to [ E θ0 l f(x ] θ 0. f(x θ l f(x i θ 0 f(x i θ > 0. which is greater tha 0. thus, for a large umber of observatios ad a give value of θ, the with a probability early oe, L(θ 0 X > L(θ X ad the so the maximum likelihood estimator has a high probability of beig very ear θ 0. For the asymptotic ormality ad efficiecy, we write the liear approximatio of the score fuctio d dθ l L(θ X d dθ l L(θ 0 X + (θ θ 0 d2 dθ 2 l L(θ 0 X. Now substitute θ ˆθ ad ote that d dθ l L(ˆθ X 0. The (ˆθ (X θ 0 d dθ l L(θ 0 X d 2 dθ 2 l L(θ 0 X d dθ l L(θ 0 X d 2 dθ 2 l L(θ 0 X Now assume that θ 0 is the true state of ature. The, the radom variables d l f(x i θ 0 /dθ are idepedet with mea 0 ad variace I(θ 0. Thus, the distributio of umerator d dθ l L(θ 0 X d dθ l f(x i θ 0 coverges, by the cetral limit theorem, to a ormal radom variable with mea 0 ad variace I(θ 0. For the deomiator, d 2 l f(x i θ 0 /dθ 2 are idepedet with mea I(θ 0. Thus, d 2 dθ 2 l L(θ 0 X d 2 dθ 2 l f(x i θ 0 coverges, by the law of large umbers, to I(θ 0. Thus, the distributio of the ratio, (ˆθ (X θ 0, coverges to a ormal radom variable with variace I(θ 0 /I(θ 0 2 /I(θ 0. 8 Aswers to Selected Exercises 3. We have foud that p l L(p x x p p( p 94

Itroductio to Statistical Methodology Maximum Likelihood Estimatio Thus p l L(p x > 0 if p < x, ad l L(p x < 0 p if p > x I words, the score fuctio l L(p x is icreasig for p < x ad l L(p x is decreasig for p > x. Thus, ˆp(x x is a maximum. 7. The log-likelihood fuctio l L(α, β, σ 2 y, x 2 (l(2π + l σ2 2σ 2 (y i (α + βx i 2 leads to the ordiary least squares equatios for the maximum likelihood estimates ˆα ad ˆβ. Take the partial derivative with respect to σ 2, σ 2 L(α, β, σ2 y, x 2σ 2 + 2(σ 2 2 (y i (α + βx i 2. This partial derivative is 0 at the maximum likelihood estimates ˆσ 2, ˆα ad ˆβ. or 0 2ˆσ 2 + 2(ˆσ 2 2 ˆσ 2 (y i (ˆα + ˆβx i 2 (y i (ˆα + ˆβx i 2. 9. The maximum likelihood priciple leads to a miimizatio problem for SS w (α, β ɛ 2 i w(x i (y i (α + βx i 2. Followig the steps to derive the equatios for ordiary least squares, take partial derivatives to fid that β SS w(α, β 2 w(x i x i (y i α βx i α SS w(α, β 2 w(x i (y i α βx i. Set these two equatios equal to 0 ad call the solutios ˆα w ad ˆβ w. 0 0 w(x i x i (y i ˆα w ˆβ w x i w(x i x i y i ˆα w w(x i (y i ˆα w ˆβ w x i w(x i y i ˆα w w(x i x i ˆβ w w(x i ˆβ w w(x i x 2 i ( w(x i x i (2 Multiply these equatios by the appropriate factors to obtai ( ( ( ( ( ( 0 w(x i w(x i x i y i ˆα w w(x i w(x i x i ˆβ w w(x i w(x i x 2 i (3 ( ( ( ( ( 2 0 w(x i x i w(x i y i ˆα w w(x i w(x i x i ˆβ w w(x i x i (4 95

Itroductio to Statistical Methodology Maximum Likelihood Estimatio Now subtract the equatio (4 from equatio (3 ad solve for ˆβ. ˆβ ( w(x i ( w(x ix i y i ( w(x ix i ( w(x iy i w(x ix 2 i ( w(x ix i 2 w(x i(x i x w (y i ȳ w w(x cov w(x, y i(x i x w 2 var w (x. Next, divide equatio (2 by w(x i to obtai ȳ w ˆα w + ˆβ w x w. (5 0. Because the ɛ i have mea zero, E (α,β y i E (α,β [α + βx i + γ(x i ɛ i ] α + βx i + γ(x i E (α,β [ɛ i ] α + βx i. Next, use the liearity property of expectatio to fid the mea of ȳ w. E (α,β ȳ w w(x ie (α,β y i w(x i w(x i(α + βx i w(x i α + β x w. (6 Take together, we have that E (α,β [y i ȳ w ] (α+βx i. (α+βx i β(x i x w. To show that ˆβ w is a ubiased estimator, we see that [ ] covw (x, y E (α,β ˆβw E (α,β E [ (α,β[cov w (x, y] var w (x var w (x var w (x E w(x ] i(x i x w (y i ȳ w (α,β w(x i var w (x w(x i(x i x w E (α,β [y i ȳ w ] w(x i β var w (x To show that ˆα w is a ubiased estimator, recall that ȳ w ˆα w + ˆβ w x w. Thus w(x i(x i x w (x i x w w(x i E (α,β ˆα w E (α,β [ȳ w ˆβ w x w ] E (α,β ȳ w E (α,β [ ˆβ w ] x w α + β x w β x w α, usig (6 ad the fact that ˆβ w is a ubiased estimator of β 3. For ˆα, we have the z-score zˆα Thus, usig the ormal approximatio, ˆα 0.23 0.2376 0.23 0.684. 0.000202 0.000202 P {ˆα 0.2367} P {zˆα 0.684} 0.2470. β. For ˆβ, we have the z-score z ˆβ ˆβ 5.35 5.690 5.35 0.297..3095.3095 Here, the ormal approximatio gives P { ˆβ 5.690} P {z ˆβ 0.297} 0.3832. 96

Itroductio to Statistical Methodology Maximum Likelihood Estimatio score.eval alpha beta score 330 335 340 345 350 355 score 330 335 340 345 350 355 0. 0.2 0.3 0.4 alpha 4 5 6 7 8 beta Figure 5: (top The score fuctio ear the maximum likelihood estimators. The domai is 0. α 0.4 ad 4 β 8. (bottom Graphs of vertical slices through the score fuctio surface. (left ˆβ 5.690 ad 0. α 0.4 varies. (right ˆα 0.2376 ad 4 β 8. The variace of the estimator is approximately the egative reciprocal of the secod derivative of the score fuctio at the maximum likelihood estimators. Note that the score fuctio is early flat as β varies. This leads to the iterpretatio that a rage of values for β are early equally likely ad that the variace for the estimator for ˆβ will be high. O the other had, the score fuctio has a much greater curvature for the α parameter ad the estimator ˆα will have a much smaller variace tha ˆβ 97