Maximum Likelihood Estimation Econometrics II Department of Economics Universidad Carlos III de Madrid Máster Universitario en Desarrollo y Crecimiento Económico Outline 1 3 4
General Approaches to Parameter Estimation are there general approaches to estimation that produce estimators with good properties, such as consistency, and eciency? Least Squares: OLS, FGLS, FE Method of Moments: Assume θ = g EY. replaces population by sample moments: ˆθ = g E N [y i ]. OLS, FGLS, IV, FE Maximum Likelihood ML: loosely speaking, it chooses ˆθ which maximizes the estimate of the empirical density Why Should We Use ML? Nice asymptotic results under mild conditions Easy to implement for non-linear models: labor force participation decision, employment decision marriage/divorce decisions, number of kids a couple want big investment project decision means of transport choice... that is, useful for discrete choices...
Basic Setup Let {y 1, y,..., y N } be a sample from a population each with a probability f Y ;θ 0. We know f but do not know θ 0 We assume that observations {y 1, y,..., y N } are independent, so that f y 1, y,..., y N ;θ 0 = f y 1 ;θ 0 f y ;θ 0...f y N ;θ 0 Likelihood function: the function obtained for a given sample after replacing true θ 0 by any θ Lθ = f y 1 ;θf y ;θ...f y N ;θ Lθ is a random variable because it depends on the sample Denition The maximum likelihood estimator of θ 0, ˆθ ML, is the value of θ that maximizes the likelihood function Lθ. usually, it is more convenient to work with the logarithm of the likelihood function l θ = loglθ = N i=1 log f y i;θ
Example: Bernoulli 1/4 Y is Bernoulli: { likelihood for observation i : { takes value 1 with probability p0 takes value 0 with probability 1 p 0 p 0 if y i = 1 1 p 0 if y i = 0 let n 1 be the number of observations with 1. Then, under iid sampling Lp = p n 1 1 p n n 1 Example: Bernoulli /4 Each sample gives us one likelihood function Suppose we observe this: {0,1,0,0,0} Then we have this likelihood: Lp = p1 p 4 Suppose we observed instead this sample:{1, 0, 0, 1, 1} ones are more frequent Now we have a dierent likelihood: Lp = p 3 1 p
Example: Bernoulli 3/4 Sample: {0,1,0,0,0} Sample: {1,0,0,1,1} 0.09 0.035 0.08 0.03 0.07 0.06 0.05 Lp 0.05 0.04 Lp 0.0 0.015 0.03 0.0 0.01 0.01 0.005 0 0 0. 0.4 0.6 0.8 1 0 0 0. 0.4 0.6 0.8 1 p p with the rst sample, ˆp = 0. with the second sample, ˆp = 0.6 Example: Bernoulli 4/4 the maximum likelihood estimator is the value that maximizes Lp = p n 1 1 p n n 1 the same ˆp maximizes the logarithm of the likelihood function lp = n 1 logp + n n 1 log1 p lp p = 0 n 1 ˆp = n n 1 1 ˆp ˆp = n 1 n with {0,1,0,0,0} ˆp = 1 5 = 0. with{1,0,0,1,1} ˆp = 3 5 = 0.6
Computing the MLE ML estimates are sometimes easy to compute, as in the previous example in the linear regression model with normal errors, ML coincides with OLS sometimes, however, there is no algebraic solution to the maximization problem It is necessary to use some sort of nonlinear maximization procedure: STATA will take care of this Classical Assumptions Gauss-Markov Assumptions A1: Linearity: y = β 0 x + ε A: Random Sampling A3: Conditional Mean Independence: E[y x ] = β 0 x A4: Invertibility of Variance-covariance Matrix A5: Homoskedasticity: Var[ε x ] = σ0 Normality A6: Normality: y x Nβ 0 x,σ 0
Basic Setup Let {y 1, y,..., y N } be an iid sample from the population with density y x N β0 x,σ0. We aim to estimate θ 0 = β 0,σ0 Because of the iid assumption, the joint distribution of {y 1, y,..., y N } is simply the product of the densities: f y 1, y,..., y N x 1,..., x N ;θ 0 = f y 1 x 1 ;θ 0 f y x ;θ 0...f y N x N ;θ 0 Note that y x N β0 x,σ 0 ε N 0,σ 0. This implies that f Y X y i x i ;θ 0 = f ε y i β x i ;θ 0 Density of the Error Term We have that ε N 0,σ 0, so what is its density? We can use the following trick: 1 ε N 0,σ 0 implies that ε N 0,1 ε N 0,1 implies that CDF ε z = Pr ε z z = Φ 3 since the density of any continuous random variable is the rst derivative of its CDF: 1 z f ε z;θ 0 = φ
Density of the Sample Since and 1 z f ε z;θ 0 = φ f Y X y i x i ;θ 0 = f ε y i β x i ;θ 0 and f y 1, y,..., y N x 1,..., x N ;θ 0 = f y 1 x 1 ;θ 0 f y x ;θ 0...f y N x N ;θ 0 then we have that f y 1, y,..., y N x 1,..., x N ;θ 0 = i { 1 φ } yi β 0 x i The Log-likelihood 1/ The likelihood replaces the actual values of the parameters for real variables: { } 1 yi β x i Lβ,σ = φ σ σ i taking the log makes the problem easier { [ ]} 1 yi β x i log Lβ,σ = log + log φ σ σ i
The Log-likelihood / the rst term inside the sum is a constant for all observations { [ ]} yi β x i log Lβ,σ = Nlog σ + log φ σ i ] and given that φ yi β x i σ = π 1 exp[ yi β x i σ we have that 1 1 yi β x i log Lβ,σ = Nlog + πσ σ i The ML Estimator: FOC The ML estimator is the value for β,σ such that the log-likelihood is maximized We obtain the maximum of the likelihood by setting the partial derivatives with respect to β,σ to zero With respect to β, this implies which implies ˆσ ² x i x i y i ˆβ x i ˆσ y i ˆβ x i = 0 = 0 With respect to σ, this implies ˆσ ² = 1 y i ˆβ x i N i
Some Final Comments MLE for ˆβ is exactly the same estimator as OLS ˆσ is not the same as the unbiased estimator s = 1 N 1 i y i ˆβ x i ˆσ = N 1 N s is biased, but the bias disappears as N increases Consistency Assumptions nite-sample identication: lθ takes dierent values for dierent θ sampling: a law of large numbers is satised by 1 n Σ i l i ˆθ asymptotic identication: max lθ provides a unique way to determine the parameter in the limit as the sample size tends to innity. Under these conditions, the ML estimator is consistent plim ˆθ ML = θ
Asymptotic Normality Assumptions consistency lθ is dierentiable and attains an interior maximum a Central Limit Theorem can be applied to the gradient Under these conditions the ML estimator is asymptotically normal n 1 / ˆθ θ N 0,Σ as n where Σ = plim 1 n 1 H i Asymptotic Eciency and Variance Estimation If lθ is dierentiable and attains an interior maximum the MLE must be at least as asymptotically ecient as any other consistent estimator that is asymptotically unbiased Consistent estimators of the Variance-Covariance Matrix empirical hessian: var H ˆθ = = [ 1 n H 1 i ˆθ ] 1 [ T ] 1 1 BHHH, var BHHH ˆθ = n g 1 iˆθ n g iˆθ the sandwich estimator: valid even if the model is misspecied robust option in STATA
ML estimates are the values which maximize the likelihood function under the Gauss-Markov assumptions plus normality of the error term, ˆβ ML OLS is exactly the same estimator as ˆβ under general assumptions, ML is consistent, asymptotically normal, and asymptotically ecient