Econometrics I, Estimation Department of Economics Stanford University September, 2008 Part I
Parameter, Estimator, Estimate A parametric is a feature of the population. An estimator is a function of the data (sample, a random variable, a statistic. An estimate is a particular realization of an estimator. Analog principle: replacing population distribution in the parametric definition with the empirical distribution. Population moments and sample moments. Consistency follows from LLN.
Sample X = {X 1, X 2,..., X n }. Estimator ˆθ = φ (X. Example: ˆθ = 1 n n X i, or even ˆθ = X 1. Measures of closeness: X better than Y if P ( X θ Y θ = 1. P ( X θ > ɛ P ( Y θ > ɛ for every ɛ > 0. P ( X θ < Y θ P ( X θ > Y θ. E (X θ 2 E (Y θ 2. X and Y might not be rankable using the first two criteria. But always rankable using the last two. The last one, mean square error, is most commonly used.
Decomposing mean square error 2 ( E (ˆθ θ = V ˆθ + E ˆθ 2 θ. Other loss functions L (x, in addition to L(x = x 2, can be used. Let X and Y be two estimators of θ. We say X is more efficient than Y if EL (X θ EL (Y θ for all θ Θ and EL (X θ < EL (Y θ for at least one θ. ˆθ is inadmissible if there is another estimator that is more efficient in the sense of the above definition. Otherwise it is admissible. Bayes estimators, and limits of Bayes estimators, are admissible.
Making a choice among estimators is difficult. Possible strategies: Bayesian: prior weights on parameter space. Minimax estimator: ˆθ = arg min θ Θ max θ E ( θ θ 2. Limiting to subclass of estimators: e.g. linear estimators, unbiased estimators, equivariant estimators. ˆθ is unbiased if E ˆθ = θ for all θ Θ.
Sample mean is the best unbiased linear estimator (BLUE of the population mean: ( n V X n V a t X t for all a t satisfying E n a tx t = µ. But sample mean can be dominated by Biased linear estimator. Unbiased nonlinear estimator. Biased nonlinear estimator. Using asymptotic properties to select estimators. In particular compare asymptotic variances.
Asymptotic properties Consistency: ˆθ p θ. Asymptotic normality: d n (ˆθ θ N ( 0, σ 2. Maximum likelihood estimator is usually consistent and asymptotically normal (CAN. Definition: for X = {X 1,..., X n }. ˆθ MLE = arg max log L (X θ. θ Θ where L (X θ = P (X θ for discrete sample, and L (X θ = f (X θ for continuous sample.
Under i.i.d. sampling assumption: n log L (X θ = log p (x t θ for discrete data. Example: n tosses of a coin. X t = 1 if tth toss a head, and 0 otherwise. L = n p xt (1 p 1 xt ( n ( log L = x t log p + n n x t log (1 p. Example 2: X B (n, p, observed X = k, ( n L = p k (1 p n k k log L = log C n k + k log p + (n k log (1 p.
Continuous data, i.i.d sample: n log L (X θ = log f (x t θ for discrete data. Example: {X t }, t = 1,..., n, i.i.d., X t N ( µ, σ 2. L = n ( 1 exp 1 2πσ 2σ 2 (x t µ 2 log L = n 2 log (2π n 2 log σ2 1 2σ 2 Score equations: log L µ = 1 σ 2 T (x t µ = 0 log L σ 2 = n 2σ 2 + 1 2σ 4 n (X t µ 2. n (x t µ 2 = 0
Computation can be difficult Newton-Raphson if the log likelihood is smooth (several times differentiable Q (θ Q (ˆθ 1 + Q ( θ ˆθ 1 + 1 2 Q ( 2 2 ˆθ1 2 θ ˆθ 1. ( Q ˆθ 2 =ˆθ 1 2 Q ˆθ1 2. ˆθ1 Other gradient methods: BHHH etc. Nongradient based methods: Nelder-Mead (Matlab Simulated Annealing. Bayesian methods: Monte carlo markov chain.
Information matrix equality for L (θ L (X θ, E 2 log L (θ 0 = E log L (θ 0 log L (θ 0 Note that expectation is taken under θ 0 : Two step proof. First, the mean of the score function is zero at truth, as long as the support does not depend on θ: log L (θ E θ = E L (θ 1 L (θ L (θ dy = L (θ dy = L (θ dy = 0. Second, differentiate this identity again: E = E θ, 0 = log L (θ E = log L (θ L (θ dy 2 log L (θ log L (θ log L (θ = L (θ dy + L (θ dy =E 2 log L (θ + E log L (θ log L (θ
Cramer-Rao Lower Bound For any unbiased estimator E ˆθ (X = θ. In general, (ˆθ V ( 1 ( 1 E 2 log L (θ = V log L (x Proof by Cauchy-Schwartz: Cov (ˆθ (x, log L (x = E θ ˆθ (x log L (x 1 L (x L (x = ˆθ (x L (x dx = ˆθ (x L (x dx = ˆθ (x L (x dx = E ˆθ = θ = I. (ˆθ V V ( 1 log L (x Cov (ˆθ (x, log L (x
Asymptotic properties M (maximization or minimization estimator theory: ˆθ = arg maxθ Θ Q n (θ. Q n (θ is random, for example, Q n (θ = 1 n log L n (θ = 1 n n log f (X t ; θ. p Typicall Q n (θ Q (θ for deterministic Q (θ, in some uniform sense over θ Θ. Usually, Q (θ = EQ n (θ or Q (θ = lim n EQ n (θ. If Q (θ is uniquely maximized at θ 0, then we should expect ˆθ p θ 0.
Consistency of maximum likelihood estimator (Uniform law of large numbers: Q n (θ = 1 n n p log f (X t ; θ Q (θ = E log f (X ; θ. E log f (X ; θ uniquely maximized at θ 0 by Jensen s inequality. Q (θ Q (θ 0 = E log f (X ; θ E log f (X ; θ 0 =E log = log f (X ; θ f (X ; θ < log E f (X ; θ 0 f (X ; θ 0 = log X :f (X ;θ 0>0 f (X ; θ dx 0. f (X ; θ f (X ; θ 0 f (X ; θ 0 dx E log f (X ; θ E log f 0 (X is also called the Kullback Leibler information criterion (KLIC between f (X ; θ and f 0 (X. Misspecification: possible that there is no θ 0 such that f 0 (X = f (X ; θ 0.
Asymptotic normality of maximum likelihood estimator Holds when θ 0 int (Θ, an interior point. And when the support of X does not depend on θ. Then E log L (x θ 0 = 0. With probability converging to 1: 0 = log L = log L + 2 log L θ0 ˆθ n (ˆθ θ 0 ( 1 = n 2 log L θ θ (ˆθ θ 0. 1 1 log L n θ0
Central limit theorem 1 log L n θ0 d N ( ( 0, Ω = V log f (x (Locally uniform law of large number 1 2 log L p n H = E 2 log f (x; θ 0 θ By slutsky n (ˆθ θ0 d N ( H 1 ΩH 1 By information matrix equality Ω = H, H 1 ΩH 1 = H 1 = Ω 1.
Consistent estimate of asymptotic variance average outer-product of the score function: ( ˆΩ = 1 n log f x t, ˆθ ( log f x t, ˆθ n average Hessian Ĥ = 1 n T 2 log f ( x t ; ˆθ Sandwich formula: Ĥ 1 ˆΩĤ 1. Sandwich formula correct even under misspecification. Pseudo-likelihood.
Normal example continued: X t N ( µ, σ 2. û = 1 n n x t = x ˆσ 2 = 1 n n (x t x 2. log f ( x t ; θ = (µ, σ 2 = 1 2 log (2π 1 2 log σ2 1 (x 2σ 2 t µ 2. ( 1 log L = σ (x 2 t µ 1 2σ + 1 2 2σ (x 4 t µ 2 Note V 2 ( log L = ( log L V 1 σ 1 2 σ (x 4 t µ 1 1 σ (x 4 t µ 2σ 1 4 σ (x 6 t µ 2 ( = E 2 log L 1 = σ 0 2 1 0 2σ 4 ( (x t µ 2 ( = E (x t µ 4 E (x t µ 2 2 = 2σ 4.