Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator

Ecoomics 24B Relatio to Method of Momets ad Maximum Likelihood OLSE as a Maximum Likelihood Estimator Uder Assumptio 5 we have speci ed the distributio of the error, so we ca estimate the model parameters = (; 2 ) with the priciple of maximum likelihood. Uder the assumptio that the error is Gaussia, we will see that the OLS estimator B is equivalet to the MLE ad the OLS estimator of 2 di ers oly sightly from its ML couterpart. Further, B achieves the Cramer-Rao lower boud. ML Priciple The ituitive idea of the ML priciple is to choose the value of the parameter that is most likely to have geerated the data. Precisely, we assume that the probability distributio of a sample (Y ) is a member of a family of fuctios idexed by (this is described as parameterizig the distributio). This fuctio, viewed as a fuctio of the parameter vector is called the likelihood fuctio. I geeral, the likelihood fuctio has the form of the joit desity fuctio For a i.i.d. fuctio as L (jy = y ; ; Y = y ) = f Y Y (y ; ; y ; ) sample of a cotiuous radom variable, we form the likelihood L (jy = y ; ; Y = y ) = Y f Y (y t ; ) De itio. The maximum likelihood estimator (MLE) of, A ML, is the value of (i the parameter space) that maximizes L (jy = y ; ; Y = y ). Coditioal versus Ucoditioal Likelihood For the regressio model, we have a sample (Y; X), whose joit desity we parameterize. Because the joit desity is the product of a margial desity ad a coditioal desity, we ca write the joit desity of the data as f (y; x; ) = f (yjx; ) f (x; ) The parameter vector of iterest is. If we kew the parametric form of f (x; ), the we could maximize the joit likelihood fuctio. We caot do this, as the classic model does ot specify f (x; ). However, if there is o fuctioal relatio betwee ad (such as

the value of a elemet of depedig o a elemet of ), the maximizig the joit likelihood is achieved by separately maximizig the coditioal ad margial likelihoods. I such a case, the ML estimate of is obtaied by maximizig the coditioal likelihood aloe. Log-Likelihood for the Regressio Model As we have already see, Assumptios.2 (strict exogeeity), Assumptio.4 (spherical error variace) ad Assumptio.5 (Gaussia) together imply U jx N (0; 2 I ). Because Y = X + U, we have Y jx N X; 2 I The log-likelihood fuctio, which is simpler to maximize, is l L ~; ~ 2 j (Y ; X ) = (y ; x ) ; ; (Y ; X ) = (y ; x ) = 2 l (2) 2 l ~2 2~ 2 Y X ~ 0 Y X ~ (Because the likelihood fuctio has the form of a joit desity fuctio, the likelihood fuctio takes values o the uit iterval. Because the likelihood fuctio takes values o the uit iterval, the log-likelihood fuctio is egative.) ML via Cocetrated Likelihood We could maximize the log likelihood i two stages. First, maximize over ~ for ay give ~ 2. The ~ that maximizes the objective fuctio could (but i this case, does ot) deped o ~ 2. Secod, maximize over ~ 2, takig ito accout that the ~ from the rst stage could deped o ~ 2. The log likelihood fuctio i which ~ is costraied to be the value from the rst stage is called the cocetrated log likelihood (cocetrated with respect to ). ~ Because the rst stage for the Gaussia log-likelihood amouts to miimizig the sum of squares Y X ~ 0 Y X ~, the value of ~ is simply the OLS estimator B (so B ML ad B OLS are idetical if the regressio error is Gaussia). I cosequece, the miimized sum of squares is ^U 0 ^U, so the cocetrated log likelihood is l L C ~ 2 j (Y ; X ) = (y ; x ) ; = 2 l (2) 2 l ~2 2~ 2 ^U 0 ^U This is a fuctio of ~ 2 aloe ad, because ^U 0 ^U is ot a fuctio of ~ 2, oe ca simply take the derivative with respect to ~ 2 (takig the derivative with respect 2

to ~ 2, rather tha ~ ca be tricky; replace ~ 2 with ~). If we set this derivative equal to zero, we obtai Propositio (ML Estimator of (; 2 )) Suppose Assumptios.-.5 hold. The the ML estimator of is the OLS estimator ad the ML estimator of 2 is ^U 0 ^U = K S2 As S 2 is a ubiased estimator of the variace, the ML estimator of 2 is biased, which idicates that a best estimator of the variace does ot exist. The resultat maximized log likelihood is 0 2 l (2) 2 l ^U ^U 2 = 2 2 l Cramer-Rao Boud for the Classic Regressio Model 2 2 l ^U 0 ^U Recall from 24A, the Cramer-Rao iequality for the covariace matrix of ay ubiased estimator. Let S ~ be the score vector, which is the gradiet (vector of partial derivatives) of the log likelihood @ l L ~ S ~ = @ ~ Cramer-Rao Iequality. Let Z be a vector of radom variables (ot ecessarily idepedet) with joit desity f(z;). 2. Let be a m-dimesioal vector of parameters, de ed i a parameter space. 3. Let L( ~ ) be the likelihood ad let ^ (z) be a ubiased estimator of with ite covariace matrix. Uder certai regularity coditios o f(z;), i V ar h^ (z) I () (Cramer-Rao Lower Boud), mm 3

where I( ) is the iformatio matrix de ed by I () = E S () S () 0 (Note that the score is evaluated at the true parameter value.) Also uder the regularity coditios, the iformatio matrix equals the egative of the expected value of the Hessia (matrix of secod partial derivatives) of the log likelihood @ 2 l L () I () = E @ ~ @ ~ 0 This is called the iformatio matrix equality. The regularity coditios guaratee that the operatios of di eretiatio ad takig expectatios ca be iterchaged h E @L () =@ ~ i = @E [L ()] =@ ~ For the classic regressio model, the Cramer-Rao boud is (derivatio i Hayashi) I () = 2 (X 0 X) 0 0 0 24 Therefore the OLS estimator, which is equivalet to the MLE, achieves the Cramer- Rao boud ad is the best ubiased estimator. What about the estimator of 2? We have already see that the MLE for 2 is biased, so the Cramer-Rao boud does ot apply. But S 2 is ubiased, does it achieve the boud? It ca be show that V ar S 2 jx = 24 K ; so the estimator does ot achieve the boud. However, it ca also be show that a ubiased estimator with lower variace does ot exist, so the boud is ot attaiable. Quasi-Maximum Likelihood Of course, if the Gaussia assumptio is icorrect, the the resultat estimator is ot the MLE. Rather, as the likelihood is misspeci ed, the resultat estimator is the quasi-mle. I may cases the Gaussia quasi-mle performs well. Ufortuately, i geeral a quasi-mle performs quite poorly. 4

OLSE as a Method of Momets Estimator The OLS estimators are costructed so that the populatio momets hold i the sample ad so are method of momets estimators. A assumptio of the classic model is that each regressor is ucorrelated with the error term (captured i Assumptio 2, where the regressors are assumed exogeous ad measured without error). To uderstad the mathematical implicatios of the assumptio, recall that two radom variables are ucorrelated if they have zero covariace, which i tur implies Cov (X t ; U t ) = E (X t U t ) EX t EU t = 0 Uder Assumptio 3, EU t = 0, so a zero covariace implies E (X t U t ) = 0. The two populatio momets used to costruct the estimators are EU t = 0; which ca be viewed as E (X t;0 U t ) = 0 where X t;0 = is the itercept regressor, ad E (X t U t ) = 0 The method of momets sets sample momets equal to populatio momets. To costruct sample aalogs of these momets, we eed a sample value of the uobserved error U t. For a give estimator, the residual (predictio of U t ) is observed Ut P = Y t Yt P = Y t B 0 B X t Equality of sample ad populatio momets yields U P t = 0 ad X t Ut P = 0 From the de itio of U P t, P U P t = 0 implies Y t = Y P t Oe ca readily verify that the OLS residuals do satisfy the populatio momets, as asserted above, by replacig the OLS estimators with their data formulae U P t = (Y t B 0 B X t ) = Y Y B X 5 B X = 0

ad X t Ut P = = = X t (Y t B 0 B X t ) X t Y t X t Y t Y Y X t B X X t X t + B X X t B Because P X t X Yt Y P = X ty t Y P X t ad P P P X2 t X X t, the above displayed equatio becomes " # X t Ut P = X t X Yt Y B X t X 2 Because B = " P (X t X)(Y t Y) P (X t X) 2, the above expressio equals X t X Yt Y X t X Yt Y # = 0 X 2 t X t Fially, ote that orthogoality betwee Ut P ad the regressors implies orthogoality betwee Ut P ad Yt P, which is a liear combiatio of the regressors. I detail The coditioal expectatios used to de e the model cotai importat iformatio. If we treat the regressor as a radom variable, the we must distiguish betwee coditioal ad ucoditioal expectatios.. For example, the coditioal expectatio of Yt P is E Y P t jx t = E (AOLS + B OLS X t jx t ) = + X t = E (Y t jx t ) The ucoditioal expectatio of Yt P is E Yt P = E (AOLS + B OLS X t ) = + EX t ; which is costat if the expectatio of the regressor is costat across observatios. While the coditioal ad ucoditioal expectatios of Yt P di er, the coditioal ad ucoditioal expectatios of Ut P are the same E U P t jx t = E Yt Y P t jx t = + Xt E Y P t jx t = 0; 6 X 2 t X 2 =

ad E U P t = E Yt Yt P = + EXt The (ucoditioal) covariace betwee Yt P ad Ut P is E Y P t E Yt P = 0 EYt P U P t = E (AOLS + B OLS X t EX t ) Ut P = E (A OLS + B OLS X t ) Ut P = 0; where the secod lie follows because EX t is ot radom ad EUt P = 0, ad the third lie follows because A OLS + B OLS X t is ucorrelated with Ut P by costructio (recall, if X ad Y are ucorrelated, the E (XY ) = EX EY ). Clearly, if the predicted values of the depedet variable were correlated with the estimated residuals, the the predicted values could be improved, so we expect zero covariace. To show that the sample estimate is always zero, the sample estimate of the covariace betwee Yt P ad Ut P is y P t yt P u P t = bx t b x t u P t 2 2 b = x t u P t x u P t 2 = 0; where the third lie follows from the ormal equatios that state P P up t = x tu P t = 0. Of course the ormal equatios esure that the sample aalogs equal the populatio momets. The relevat populatio momets are E (U t jx t ) = 0 (the residuals are mea zero) ad E (X t U t jx t ) = 0 (the residuals are ucorrelated with the regressors). Recall Assumptio 2 Issues of ideti catio are i play here. To make the issues clear, cosider the model Y t = 0 + X t 0 + U t ; i which x t is the k vector that does ot iclude the itercept. We ow ask, uder what coditios are the coe ciets ideti ed? If the covariace matrix of X t is osigular ad X t is idepedet of U t, the 0 is ideti ed. A additioal assumptio is eeded to idetify 0. Two alterative assumptios that idetify 7

0 are EU t = 0 ad Med (U t ) = 0. The oly di erece is i iterpretatio of 0 + X t 0, as discussed above. Alteratively, we could assume that U t is symmetrically distributed aroud 0, coditioal o X t. The 0 ad 0 are ideti ed ad 0 + X t 0 is both the coditioal mea ad the coditioal media, as well as beig equal to other locatio measures. Both 0 ad 0 are ideti ed uder a coditioal locatio restrictio that is weaker the either the assumptio of idepedece (betwee the regressor ad the error) or the assumptio of coditioal symmetry. Further, each coditioal locatio restrictio is associated with a coditioal momet restrictio E [f (U t ) jx t ] = 0 for some fuctio f (U t ) from which a estimator is costructed. Cosider the two locatio assumptios itroduced earlier. If E (U t jx t ) = 0, the f (U t ) = U t ad the resultat estimator is OLS (ad, agai, 0 + X t 0 is the coditioal mea of Y t ). If Med (U t jx t ) = 0, the correspodig momet coditio is E [sg (U t ) jx t ] = 0 ad the resultig estimator is least absolute deviatios (ad, agai, 0 + X t 0 is the coditioal media of Y t ). To derive the momet coditio for OLS, ote that E (U t jx t ) = 0 is clearly a momet coditio that ca be used for estimatio. The OLSE B thus satis es X t U t (B) = 0 While Med (U t jx t ) = 0 is a momet coditio, it may ot be as clear how it ca be used to form a estimator. Cosider rst the case i which U t is cotiuous. The assumptio Med (U t jx t ) = 0 implies P (U t < 0jX t ) = P (U t > 0jX t ) = 2 ; which implies E [sg (U t ) jx t ] = 0, which i tur implies E [X t sg (U t )] = 0 The sigum, or sig, fuctio is de ed as 8 < sg (u) = if u > 0 0 if u = 0 if u < 0 8

The LAD estimator B L satis es the sample aalog X t sg (U t (B L )) = 0 There are two problems here. First it may ot be apparet that the sample aalog with B L admits a uique solutio. I fact, i Powell s symmetrically trimmed LAD paper i Ecoometrica, his coditioal momet equatio has may solutios. Also, if U t is ot distributed symmetrically, the the assumptio Med (U t jx t ) = 0 does ot ecessarily lead to a simple momet coditio for estimatio. The problem is, if U t does ot have a cotiuous distributio, the it is possible that there is positive poit mass at the media, so it is possible that E [sg (U t ) jx t ] 6= 0. The alterative is to retur to the loss fuctio (also termed the objective fuctio). The loss fuctio approach solves both problems. First, there is clearly a uique solutio (as Powell shows i the appedix to the above metioed paper). Secod, the loss fuctio approach works well eve if U t does ot have a cotiuous distributio. 9