Econometrics 2 Fall 2004 Maximum Likelihood (ML) Estimation Heino Bohn Nielsen 1of32 Outline of the Lecture (1) Introduction. (2) ML estimation defined. (3) ExampleI:Binomialtrials. (4) Example II: Linear regression. (5) Pseudo-ML estimation. (6) Test principles. Wald test. Likelihood ratio (LR) test. Lagrange multiplier (LM) test or score test. 2of32
Introduction: GMM and ML Consider stochastic variables (1 1) and ( 1) and the observations µ 12 Assume that we have a conditional model for in mind. GMM derives moment conditions [( )] 0. The sample moments () 1 X ( ) links to the data. The GMM estimator, b, minimizes () 0 (). Maximum likelihood (ML) assumes knowledge of the distribution density() up to unknown parameters,. The ML estimator, b, maximizes the probability of the data given density. 3of32 The Likelihood Principle The probability of observing given is given by the conditional density ( ; ) If the observations are i.i.d., the probability of 1 (given ) is ( 1 2 ; ) Y ( ; ) The likelihood function for the sample is defined as Y Y ( ; ) ( 1 2 ; ) ( ; ) ( ; ) where () is called the likelihood contribution for observation. The ML estimator b is chosen to maximize ( ; ). 4of32
The ML Estimator Easier to maximize the log-likelihood function Define the score vector as () ( 1) log () log () X log ( ; ) X where () is the score for each observation. log () X () The first order conditions, the so-called likelihood equations, states ( b log () X ) log () X ( b b )0 b Might be dicult to solve in practice. Use numerical optimization. 5of32 Properties of ML Given correct specification (and weak regularity conditions): The ML estimator is consistent, plim b. The ML estimator is asymptotically normal ³ b (0) where () 1 µ 2 1 log () 0 The negative expected Hessian, (), is called the information matrix. The more curvature of the likelihood function, the more precision. The ML estimator is asymptotically ecient. All other consistent and asymptotically normal estimators will have an asymptotic variance larger than () 1. () 1 is denoted the Cramèr-Rao lower bound. 6of32
Asymptotic Inference Asymptotic inference can be based on b where b is a consistent estimate of. µ 1 b One possibility is the sample average of second derivatives à b 1 X ( ) 2 log () 0 b! 1 It can be shown than [ () () 0 ](). An alternative estimator is the outer-product-of-the-scores Ã! 1 b 1 X ( b ) ( b ) 0 7of32 Joint and Conditional Distributions Aboveweconsideredamodelfor corresponding to the conditional distribution, ( ; ). From a statistical point of view a natural alternative would be a model for all the data ( 0 ) 0, corresponding to the joint density, ( ). Recall the factorization ( ) ( ; ) ( ) If the two sets of parameters, and, are not related (and is exogenous in a certain sense), we can estimate in the conditional model ( ; ) ( ) ( ) 8of32
Time Series (Non-i.i.d.) Data and Factorization The multiplicative form ( 1 2 ; ) Y ( ; ) follows from the i.i.d. assumption, which cannot be made for many time series. For a time series, the object of interest is often [ 1 1 ]. This can be used to factorize the likelihood function. Recall again that ( 1 2 ) ( 1 1 ; ) ( 1 2 1 ) Y ( 1 ) ( 1 1 ; ) Conditioning on the first observation, 1, gives a multiplicative structure 2 ( 2 1 ; ) ( 1 2 ) ( 1 ) Y ( 1 1 ; ) 2 9of32 Example I: Binomial Trials Consider random draws from a pool of red and blue balls. We are interested in the proportion of red balls,. Let ½ 1 if a ball is red 0 if a ball is blue Consideradatasetof draws 1 2 e.g. 1 0 0 1 0 1 The model implies that prob( 1) and prob( 0)1 The density function for is given by the Binomial ( ) (1 ) (1) 10 of 32
Since the draws are independent, the likelihood function is given by Y () ( ) and the log-likelihood function is Y (1 ) (1) X log () [ log()+(1 ) log (1 )] The ML estimator, b, is chosen to maximize this expression. The score for an individual observation is given by () log () 1 1 11 of 32 The likelihood equation is given by the first order condition X X () () 1 0 1 This implies that P P 1 Ã! X X (1 ) X X X b P 12 of 32
The second derivative is 2 log () µ 1 1 1 2 (1 ) 2 Using that [ ], theinformation is given by 2 log () () 1 2 (1 ) 2 1 + 2 (1 ) 2 1 + 1 1 1 (1 ) + (1 ) 1 (1 ) 13 of 32 Inference on can be based on b µ 1 b (1 b ) ML estimation can easily be done in PcGive using numerical optimization. Specify actual: fitted: Nothing in this case. loglik: log ( ) log()+(1 ) log (1 ) Initial value for the parameters, denoted &0,&1,...,&k. In our case (where the data series is denoted Bin): actual Bin; fitted 0; loglik actual*log(&0)+(1-actual)*log(1-&0); &0 0.5; 14 of 32
Example II: Linear Regression Consider a linear regression model 0 + 1 2 Make the following assumptions [ ] 0 () [ 0 ] 2 () () is a moment condition. The models is the conditional expectation. () excludes heteroscedasticity and autocorrelation. For ML estimation, we have to specify a distribution for.assumethat (0 2 ) This implies that ( 0 2 ) 15 of 32 The probability of observing given the model is the density ( ) ; 2 1 exp 1 ( 0 ) 2 2 2 2 2 Since the observations are assumed i.i.d. the probability of 1 is 1 2 ; 2 Y ; 2 µ 1 2 2 ( ) Y exp 1 ( 0 ) 2 2 2 16 of 32
The likelihood function is given by 2 µ 1 2 2 ( ) Y exp 1 ( 0 ) 2 2 2 The log-likelihood function is log 2 2 log 2 2 1 X 2 ( 0 ) 2 The ML estimators b and b 2 are chosen to maximize this expression. 2 17 of 32 The likelihood contributions are log 2 1 2 log 2 2 1 2 ( 0 ) 2 The individual scores are given by the first derivatives! Ã 2 2 (+1) 1 1 2 + 1 2 2 Ã log ( 2 ) log ( 2 ) 2 2 ( 0 ) ( 0 ) 2 4! The first order conditions are given by () X Ã P 2 ( 0 ) 2 2 2 + 1 2 P ( 0 ) 2 4! µ 0 0 18 of 32
The first condition: X X X ( 0 ) 0 0 implies à X! 1 X b 0 b Letting b ( 0 b ), the second condition yields 1 2 X b 2 4 2 2 b 2 1 X b 2 which is slightly dierent from the OLS variance. The ML estimator of the variance is biased but consistent. 19 of 32 The second derivatives are given by 2 log ( 2 ) 0 ½ ( 0 ¾ ) 2 2 log ( 2 ) ½ ( 0 ¾ ) 2 2 ( 2 ) 2 log ( 2 ) 1 2 2 2 2 + 1 ( 0 ) 2 2 2 4 2 log ( 2 ) 2 0 ( 1 2 + 1 ( 0 ) 2 2 2 4 ) 0 2 ( 0 ) 4 1 2 4 2 6 0 ( 0 ) 4 4 0 4 That gives the information matrix à ( 2 ) 0 2 4 0 4 1 2 4 2 6 Note that the information matrix is block diagonal.! µ 2 [ 0 ] 0 0 1 2 4 20 of 32
Recall that µ b b 2 µµ 2 1 ()1 This implies b µ 2 ([ 0 ]) 1 where the variance can be estimated by b 2 Ã 1! 1 X 0 b 2 Ã X 1! 0 Furthermore b 2 µ 2 2 4 21 of 32 Pseudo-ML Estimation The first order conditions for ML estimation can be seen as a sample counterpart to a moment condition 1 X () 0 corresponds to [ ()] 0 and ML becomes a special case of GMM. b is consistent for weaker assumptions than maintained by ML. The FOC for a normal regression model corresponds to [ ( 0 )] 0 which is weaker than the assumption that the entire distribution is correctly specified. OLS is consistent even if is not normal. A ML estimation that maximizes a likelihood function dierent from the true models likelihood is referred to as a pseudo-ml or a quasi-ml estimator. 22 of 32
Three Classical Test Principles Consider a null hypothesis of interest, 0, and an alternative,.e.g. 0 : ( ) against : 6. Let e and b denote the ML estimates under 0 and respectively Wald test. Estimates the model only under, and look at the distance b normalized by the covariance matrix. Likelihood ratio (LR) test. Estimate under 0 and under and look at the loss in likelihood log ( b ) log ( e ). Lagrange multiplier (LM) or score test. Estimate e under 0 and see if the FOCs P ( e )0are significantly violated. The tests are asymptotically equivalent. For the normal regression model in finite samples. 23 of 32 LogL( ) LR { LM { W 0 24 of 32
Wald Test Recall, that b 1,sothat b 1 0 If the null hypothesis is true, and a natural test statistic is ³ b 0 ³ b 0 1 ³ b Under the null this is distributed as 2 () An example is the ratio for 0 : 0 p b q 0 ( b ) (0 1) Requires only estimation under the alternative,. 25 of 32 Likelihood Ratio (LR) Test For the LR test we estimate both under 0 and under The LR test statistic is given by à ( 2 log e! ) ( b 2 ) where ( e ) and ( b ) are the two likelihood values. Under the null, this is asymptotically distributed as 2 () ³ log ( e ) log ( b ) The test is insensitive to how the model and restrictions are formulated. Test is only appropriate when the models are nested. 26 of 32
Lagrange Multiplier (LM) or Score Test Recall that the average score is zero at the unrestricted estimate If the restriction is true () ( e ) X ( b )0 X ( e ) 0 This can be tested by the quadratic form à X à X!! 1 à X! ( e )!ÃVariance 0 ( e ) ( e ) which under the null is 2 (). 27 of 32 Thevarianceoftheindividualscore, (), is the information matrix, () [ () () 0 ] which can be estimated as b ( e ) 1 X ( e ) ( e ) 0 The estimated variance of P ( b ) is therefore b ( e ) X ( e ) ( e ) 0 AndtheLMtestcanbewrittenas à X!à X! 1 à X ( e ) 0 ( e ) ( e ) 0 which will have a 2 () distribution under the null. ( e )! 28 of 32
Note, that the quadratic form is of dimension, but elements are unrestricted and P ( e )0for these elements. Hence a 2 (). Inthecasewherethenullhypothesisis 0 : 2 0,for µ µ 1 1 (), and () 2 2 () the LM statistic only depends on the scores corresponding to 2 : à X à X!! 1 à X! 2 ( e )!ÃVariance 0 2 ( e ) 2 ( e ) Note, however, that to calculate à X Variance 2 ( e )! 1 we need the full score vector, except if the covariance matrix is block-diagonal. 29 of 32 LM Tests by Auxiliary Regressions Define the matrix Then ( ) 1( e ) 0. ( e ) 0 X 0 (1 1) 0 ( 1) and ( 1) ( e ) 1. 1 X 0 ( e ) ( e ) 0 ( ) Therefore, we can write the LM statistic as à X!à X! 1 à X ( e ) 0 ( e ) ( e ) 0 0 ( 0 ) 1 0 ( e )! 30 of 32
Now consider the regression model + residual. () The OLS estimator and the predicted values are given by, respectively, b ( 0 ) 1 0 and b b ( 0 ) 1 0 The LM test can the written as 0 ( 0 ) 1 0 0 ( 0 ) 1 0 0 b0 b 0 ESS TSS 2 The regression () is not always the most convenient. Often, alternative auxiliary regressions are used. 31 of 32 Examples of LM tests in a linear regression based on 2 from auxiliary regressions: Test for omitted variables,.runtheregression b 0 + b 0 + residual, or b 0 + 0 + residual. Breuch-Godfrey test for no first order autocorrelation. Run the regression b b 1 + 0 + residual. Breuch-Pagan test for no heteroscedasticity. Run the regression b 2 0 + residual. 32 of 32