Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator

Similar documents
First Year Quantitative Comp Exam Spring, Part I - 203A. f X (x) = 0 otherwise

Statistical Properties of OLS estimators

Slide Set 13 Linear Model with Endogenous Regressors and the GMM estimator

Statistical Inference Based on Extremum Estimators

Properties and Hypothesis Testing

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

First, note that the LS residuals are orthogonal to the regressors. X Xb X y = 0 ( normal equations ; (k 1) ) So,

Chapter 6 Principles of Data Reduction

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Linear Regression Demystified

Single-Equation GMM: Estimation

ECONOMETRIC THEORY. MODULE XIII Lecture - 34 Asymptotic Theory and Stochastic Regressors

Unbiased Estimation. February 7-12, 2008

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1

Lecture 11 and 12: Basic estimation theory

1 General linear Model Continued..

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

Lecture 22: Review for Exam 2. 1 Basic Model Assumptions (without Gaussian Noise)

Study the bias (due to the nite dimensional approximation) and variance of the estimators

11 THE GMM ESTIMATION

Efficient GMM LECTURE 12 GMM II

Lecture 3: MLE and Regression

Statistical and Mathematical Methods DS-GA 1002 December 8, Sample Final Problems Solutions

Topic 9: Sampling Distributions of Estimators

7.1 Convergence of sequences of random variables

STAT Homework 1 - Solutions

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5

Asymptotic Results for the Linear Regression Model

SOME THEORY AND PRACTICE OF STATISTICS by Howard G. Tucker

Random Variables, Sampling and Estimation

Maximum Likelihood Estimation

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.

CEE 522 Autumn Uncertainty Concepts for Geotechnical Engineering

Economics 326 Methods of Empirical Research in Economics. Lecture 8: Multiple regression model

Support vector machine revisited

January 25, 2017 INTRODUCTION TO MATHEMATICAL STATISTICS

Notes On Median and Quantile Regression. James L. Powell Department of Economics University of California, Berkeley

Probability 2 - Notes 10. Lemma. If X is a random variable and g(x) 0 for all x in the support of f X, then P(g(X) 1) E[g(X)].

Random Matrices with Blocks of Intermediate Scale Strongly Correlated Band Matrices

6. Sufficient, Complete, and Ancillary Statistics

STATISTICAL PROPERTIES OF LEAST SQUARES ESTIMATORS. Comments:

In this section we derive some finite-sample properties of the OLS estimator. b is an estimator of β. It is a function of the random sample data.

10-701/ Machine Learning Mid-term Exam Solution

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY

1 Inferential Methods for Correlation and Regression Analysis

Solutions: Homework 3

Lecture 7: Properties of Random Samples

Estimation for Complete Data

An Introduction to Randomized Algorithms

Exponential Families and Bayesian Inference

Element sampling: Part 2

¹Y 1 ¹ Y 2 p s. 2 1 =n 1 + s 2 2=n 2. ¹X X n i. X i u i. i=1 ( ^Y i ¹ Y i ) 2 + P n

Econ 325/327 Notes on Sample Mean, Sample Proportion, Central Limit Theorem, Chi-square Distribution, Student s t distribution 1.

Machine Learning Brett Bernstein

Lecture 33: Bootstrap

Lecture 19: Convergence

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

Algebra of Least Squares

3/3/2014. CDS M Phil Econometrics. Types of Relationships. Types of Relationships. Types of Relationships. Vijayamohanan Pillai N.

Topic 9: Sampling Distributions of Estimators

The variance of a sum of independent variables is the sum of their variances, since covariances are zero. Therefore. V (xi )= n n 2 σ2 = σ2.

Probability and Statistics


Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

ECON 3150/4150, Spring term Lecture 3

It should be unbiased, or approximately unbiased. Variance of the variance estimator should be small. That is, the variance estimator is stable.

CEU Department of Economics Econometrics 1, Problem Set 1 - Solutions

Lecture 12: September 27

Matrix Representation of Data in Experiment

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 3 9/11/2013. Large deviations Theory. Cramér s Theorem

Lecture Stat Maximum Likelihood Estimation

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

Topic 9: Sampling Distributions of Estimators

Solutions to Odd Numbered End of Chapter Exercises: Chapter 4

Machine Learning Brett Bernstein

1.010 Uncertainty in Engineering Fall 2008

This exam contains 19 pages (including this cover page) and 10 questions. A Formulae sheet is provided with the exam.

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

6.867 Machine learning, lecture 7 (Jaakkola) 1

Lecture 2: Monte Carlo Simulation

Econ 325 Notes on Point Estimator and Confidence Interval 1 By Hiro Kasahara

LECTURE 11 LINEAR PROCESSES III: ASYMPTOTIC RESULTS

CSE 527, Additional notes on MLE & EM

Introduction to Machine Learning DIS10

Convergence of random variables. (telegram style notes) P.J.C. Spreij

Solution to Chapter 2 Analytical Exercises

MA Advanced Econometrics: Properties of Least Squares Estimators

5. Likelihood Ratio Tests

Quick Review of Probability

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 21 11/27/2013

Binomial Distribution

MATHEMATICAL SCIENCES PAPER-II

Geometry of LS. LECTURE 3 GEOMETRY OF LS, PROPERTIES OF σ 2, PARTITIONED REGRESSION, GOODNESS OF FIT

Algorithms for Clustering

Lecture 10 October Minimaxity and least favorable prior sequences

Lecture 7: Density Estimation: k-nearest Neighbor and Basis Approach

Review Questions, Chapters 8, 9. f(y) = 0, elsewhere. F (y) = f Y(1) = n ( e y/θ) n 1 1 θ e y/θ = n θ e yn

Outline. Linear regression. Regularization functions. Polynomial curve fitting. Stochastic gradient descent for regression. MLE for regression

7.1 Convergence of sequences of random variables

n outcome is (+1,+1, 1,..., 1). Let the r.v. X denote our position (relative to our starting point 0) after n moves. Thus X = X 1 + X 2 + +X n,

Transcription:

Ecoomics 24B Relatio to Method of Momets ad Maximum Likelihood OLSE as a Maximum Likelihood Estimator Uder Assumptio 5 we have speci ed the distributio of the error, so we ca estimate the model parameters = (; 2 ) with the priciple of maximum likelihood. Uder the assumptio that the error is Gaussia, we will see that the OLS estimator B is equivalet to the MLE ad the OLS estimator of 2 di ers oly sightly from its ML couterpart. Further, B achieves the Cramer-Rao lower boud. ML Priciple The ituitive idea of the ML priciple is to choose the value of the parameter that is most likely to have geerated the data. Precisely, we assume that the probability distributio of a sample (Y ) is a member of a family of fuctios idexed by (this is described as parameterizig the distributio). This fuctio, viewed as a fuctio of the parameter vector is called the likelihood fuctio. I geeral, the likelihood fuctio has the form of the joit desity fuctio For a i.i.d. fuctio as L (jy = y ; ; Y = y ) = f Y Y (y ; ; y ; ) sample of a cotiuous radom variable, we form the likelihood L (jy = y ; ; Y = y ) = Y f Y (y t ; ) De itio. The maximum likelihood estimator (MLE) of, A ML, is the value of (i the parameter space) that maximizes L (jy = y ; ; Y = y ). Coditioal versus Ucoditioal Likelihood For the regressio model, we have a sample (Y; X), whose joit desity we parameterize. Because the joit desity is the product of a margial desity ad a coditioal desity, we ca write the joit desity of the data as f (y; x; ) = f (yjx; ) f (x; ) The parameter vector of iterest is. If we kew the parametric form of f (x; ), the we could maximize the joit likelihood fuctio. We caot do this, as the classic model does ot specify f (x; ). However, if there is o fuctioal relatio betwee ad (such as

the value of a elemet of depedig o a elemet of ), the maximizig the joit likelihood is achieved by separately maximizig the coditioal ad margial likelihoods. I such a case, the ML estimate of is obtaied by maximizig the coditioal likelihood aloe. Log-Likelihood for the Regressio Model As we have already see, Assumptios.2 (strict exogeeity), Assumptio.4 (spherical error variace) ad Assumptio.5 (Gaussia) together imply U jx N (0; 2 I ). Because Y = X + U, we have Y jx N X; 2 I The log-likelihood fuctio, which is simpler to maximize, is l L ~; ~ 2 j (Y ; X ) = (y ; x ) ; ; (Y ; X ) = (y ; x ) = 2 l (2) 2 l ~2 2~ 2 Y X ~ 0 Y X ~ (Because the likelihood fuctio has the form of a joit desity fuctio, the likelihood fuctio takes values o the uit iterval. Because the likelihood fuctio takes values o the uit iterval, the log-likelihood fuctio is egative.) ML via Cocetrated Likelihood We could maximize the log likelihood i two stages. First, maximize over ~ for ay give ~ 2. The ~ that maximizes the objective fuctio could (but i this case, does ot) deped o ~ 2. Secod, maximize over ~ 2, takig ito accout that the ~ from the rst stage could deped o ~ 2. The log likelihood fuctio i which ~ is costraied to be the value from the rst stage is called the cocetrated log likelihood (cocetrated with respect to ). ~ Because the rst stage for the Gaussia log-likelihood amouts to miimizig the sum of squares Y X ~ 0 Y X ~, the value of ~ is simply the OLS estimator B (so B ML ad B OLS are idetical if the regressio error is Gaussia). I cosequece, the miimized sum of squares is ^U 0 ^U, so the cocetrated log likelihood is l L C ~ 2 j (Y ; X ) = (y ; x ) ; = 2 l (2) 2 l ~2 2~ 2 ^U 0 ^U This is a fuctio of ~ 2 aloe ad, because ^U 0 ^U is ot a fuctio of ~ 2, oe ca simply take the derivative with respect to ~ 2 (takig the derivative with respect 2

to ~ 2, rather tha ~ ca be tricky; replace ~ 2 with ~). If we set this derivative equal to zero, we obtai Propositio (ML Estimator of (; 2 )) Suppose Assumptios.-.5 hold. The the ML estimator of is the OLS estimator ad the ML estimator of 2 is ^U 0 ^U = K S2 As S 2 is a ubiased estimator of the variace, the ML estimator of 2 is biased, which idicates that a best estimator of the variace does ot exist. The resultat maximized log likelihood is 0 2 l (2) 2 l ^U ^U 2 = 2 2 l Cramer-Rao Boud for the Classic Regressio Model 2 2 l ^U 0 ^U Recall from 24A, the Cramer-Rao iequality for the covariace matrix of ay ubiased estimator. Let S ~ be the score vector, which is the gradiet (vector of partial derivatives) of the log likelihood @ l L ~ S ~ = @ ~ Cramer-Rao Iequality. Let Z be a vector of radom variables (ot ecessarily idepedet) with joit desity f(z;). 2. Let be a m-dimesioal vector of parameters, de ed i a parameter space. 3. Let L( ~ ) be the likelihood ad let ^ (z) be a ubiased estimator of with ite covariace matrix. Uder certai regularity coditios o f(z;), i V ar h^ (z) I () (Cramer-Rao Lower Boud), mm 3

where I( ) is the iformatio matrix de ed by I () = E S () S () 0 (Note that the score is evaluated at the true parameter value.) Also uder the regularity coditios, the iformatio matrix equals the egative of the expected value of the Hessia (matrix of secod partial derivatives) of the log likelihood @ 2 l L () I () = E @ ~ @ ~ 0 This is called the iformatio matrix equality. The regularity coditios guaratee that the operatios of di eretiatio ad takig expectatios ca be iterchaged h E @L () =@ ~ i = @E [L ()] =@ ~ For the classic regressio model, the Cramer-Rao boud is (derivatio i Hayashi) I () = 2 (X 0 X) 0 0 0 24 Therefore the OLS estimator, which is equivalet to the MLE, achieves the Cramer- Rao boud ad is the best ubiased estimator. What about the estimator of 2? We have already see that the MLE for 2 is biased, so the Cramer-Rao boud does ot apply. But S 2 is ubiased, does it achieve the boud? It ca be show that V ar S 2 jx = 24 K ; so the estimator does ot achieve the boud. However, it ca also be show that a ubiased estimator with lower variace does ot exist, so the boud is ot attaiable. Quasi-Maximum Likelihood Of course, if the Gaussia assumptio is icorrect, the the resultat estimator is ot the MLE. Rather, as the likelihood is misspeci ed, the resultat estimator is the quasi-mle. I may cases the Gaussia quasi-mle performs well. Ufortuately, i geeral a quasi-mle performs quite poorly. 4

OLSE as a Method of Momets Estimator The OLS estimators are costructed so that the populatio momets hold i the sample ad so are method of momets estimators. A assumptio of the classic model is that each regressor is ucorrelated with the error term (captured i Assumptio 2, where the regressors are assumed exogeous ad measured without error). To uderstad the mathematical implicatios of the assumptio, recall that two radom variables are ucorrelated if they have zero covariace, which i tur implies Cov (X t ; U t ) = E (X t U t ) EX t EU t = 0 Uder Assumptio 3, EU t = 0, so a zero covariace implies E (X t U t ) = 0. The two populatio momets used to costruct the estimators are EU t = 0; which ca be viewed as E (X t;0 U t ) = 0 where X t;0 = is the itercept regressor, ad E (X t U t ) = 0 The method of momets sets sample momets equal to populatio momets. To costruct sample aalogs of these momets, we eed a sample value of the uobserved error U t. For a give estimator, the residual (predictio of U t ) is observed Ut P = Y t Yt P = Y t B 0 B X t Equality of sample ad populatio momets yields U P t = 0 ad X t Ut P = 0 From the de itio of U P t, P U P t = 0 implies Y t = Y P t Oe ca readily verify that the OLS residuals do satisfy the populatio momets, as asserted above, by replacig the OLS estimators with their data formulae U P t = (Y t B 0 B X t ) = Y Y B X 5 B X = 0

ad X t Ut P = = = X t (Y t B 0 B X t ) X t Y t X t Y t Y Y X t B X X t X t + B X X t B Because P X t X Yt Y P = X ty t Y P X t ad P P P X2 t X X t, the above displayed equatio becomes " # X t Ut P = X t X Yt Y B X t X 2 Because B = " P (X t X)(Y t Y) P (X t X) 2, the above expressio equals X t X Yt Y X t X Yt Y # = 0 X 2 t X t Fially, ote that orthogoality betwee Ut P ad the regressors implies orthogoality betwee Ut P ad Yt P, which is a liear combiatio of the regressors. I detail The coditioal expectatios used to de e the model cotai importat iformatio. If we treat the regressor as a radom variable, the we must distiguish betwee coditioal ad ucoditioal expectatios.. For example, the coditioal expectatio of Yt P is E Y P t jx t = E (AOLS + B OLS X t jx t ) = + X t = E (Y t jx t ) The ucoditioal expectatio of Yt P is E Yt P = E (AOLS + B OLS X t ) = + EX t ; which is costat if the expectatio of the regressor is costat across observatios. While the coditioal ad ucoditioal expectatios of Yt P di er, the coditioal ad ucoditioal expectatios of Ut P are the same E U P t jx t = E Yt Y P t jx t = + Xt E Y P t jx t = 0; 6 X 2 t X 2 =

ad E U P t = E Yt Yt P = + EXt The (ucoditioal) covariace betwee Yt P ad Ut P is E Y P t E Yt P = 0 EYt P U P t = E (AOLS + B OLS X t EX t ) Ut P = E (A OLS + B OLS X t ) Ut P = 0; where the secod lie follows because EX t is ot radom ad EUt P = 0, ad the third lie follows because A OLS + B OLS X t is ucorrelated with Ut P by costructio (recall, if X ad Y are ucorrelated, the E (XY ) = EX EY ). Clearly, if the predicted values of the depedet variable were correlated with the estimated residuals, the the predicted values could be improved, so we expect zero covariace. To show that the sample estimate is always zero, the sample estimate of the covariace betwee Yt P ad Ut P is y P t yt P u P t = bx t b x t u P t 2 2 b = x t u P t x u P t 2 = 0; where the third lie follows from the ormal equatios that state P P up t = x tu P t = 0. Of course the ormal equatios esure that the sample aalogs equal the populatio momets. The relevat populatio momets are E (U t jx t ) = 0 (the residuals are mea zero) ad E (X t U t jx t ) = 0 (the residuals are ucorrelated with the regressors). Recall Assumptio 2 Issues of ideti catio are i play here. To make the issues clear, cosider the model Y t = 0 + X t 0 + U t ; i which x t is the k vector that does ot iclude the itercept. We ow ask, uder what coditios are the coe ciets ideti ed? If the covariace matrix of X t is osigular ad X t is idepedet of U t, the 0 is ideti ed. A additioal assumptio is eeded to idetify 0. Two alterative assumptios that idetify 7

0 are EU t = 0 ad Med (U t ) = 0. The oly di erece is i iterpretatio of 0 + X t 0, as discussed above. Alteratively, we could assume that U t is symmetrically distributed aroud 0, coditioal o X t. The 0 ad 0 are ideti ed ad 0 + X t 0 is both the coditioal mea ad the coditioal media, as well as beig equal to other locatio measures. Both 0 ad 0 are ideti ed uder a coditioal locatio restrictio that is weaker the either the assumptio of idepedece (betwee the regressor ad the error) or the assumptio of coditioal symmetry. Further, each coditioal locatio restrictio is associated with a coditioal momet restrictio E [f (U t ) jx t ] = 0 for some fuctio f (U t ) from which a estimator is costructed. Cosider the two locatio assumptios itroduced earlier. If E (U t jx t ) = 0, the f (U t ) = U t ad the resultat estimator is OLS (ad, agai, 0 + X t 0 is the coditioal mea of Y t ). If Med (U t jx t ) = 0, the correspodig momet coditio is E [sg (U t ) jx t ] = 0 ad the resultig estimator is least absolute deviatios (ad, agai, 0 + X t 0 is the coditioal media of Y t ). To derive the momet coditio for OLS, ote that E (U t jx t ) = 0 is clearly a momet coditio that ca be used for estimatio. The OLSE B thus satis es X t U t (B) = 0 While Med (U t jx t ) = 0 is a momet coditio, it may ot be as clear how it ca be used to form a estimator. Cosider rst the case i which U t is cotiuous. The assumptio Med (U t jx t ) = 0 implies P (U t < 0jX t ) = P (U t > 0jX t ) = 2 ; which implies E [sg (U t ) jx t ] = 0, which i tur implies E [X t sg (U t )] = 0 The sigum, or sig, fuctio is de ed as 8 < sg (u) = if u > 0 0 if u = 0 if u < 0 8

The LAD estimator B L satis es the sample aalog X t sg (U t (B L )) = 0 There are two problems here. First it may ot be apparet that the sample aalog with B L admits a uique solutio. I fact, i Powell s symmetrically trimmed LAD paper i Ecoometrica, his coditioal momet equatio has may solutios. Also, if U t is ot distributed symmetrically, the the assumptio Med (U t jx t ) = 0 does ot ecessarily lead to a simple momet coditio for estimatio. The problem is, if U t does ot have a cotiuous distributio, the it is possible that there is positive poit mass at the media, so it is possible that E [sg (U t ) jx t ] 6= 0. The alterative is to retur to the loss fuctio (also termed the objective fuctio). The loss fuctio approach solves both problems. First, there is clearly a uique solutio (as Powell shows i the appedix to the above metioed paper). Secod, the loss fuctio approach works well eve if U t does ot have a cotiuous distributio. 9