Intro LS MLE Other Erik Lindström Centre for Mathematical Sciences Lund University LU/LTH & DTU
Intro LS MLE Other General Properties Popular estimatiors Overview Introduction General Properties Estimators Least squares Maximum Likelihood MLE asymptotics Two theorems Fisher Information Proofs Other estimators
Intro LS MLE Other General Properties Popular estimatiors General We are interested in estimating a parameter vector θ 0 from data X. Ad hoc or formal estimation? Properties Denition ˆθ = T (X). Interpretation
Intro LS MLE Other General Properties Popular estimatiors Properties Bias b = θ 0 E[T (X)]. Asympt. bias lim θ 0 E[T (X)]. Consistency ˆθ p θ 0 P( ˆθ θ 0 > ε) 0 as. a.s. Strong consistency ˆθ θ 0. Eciency Var[T (X)] I 1 (θ 0), where I 1 (θ 0) is the Fisher information matrix. Asympt. normality Convergence α (ˆθ θ 0 ) d (µ, Σ).
Intro LS MLE Other General Properties Popular estimatiors Popular estimators How are the estimates computed? Optimization: Least squares (LS), Weighted Least Squares (WLS), Prediction Error Methods (PEM), Generalized Method of Moments (GMM) etc. Solving equations: Estimation functions (EFs), Method of moments (MM) Either: Bayesian methods (MCMC), Maximum Likelihood, Quasi Maximum Likelihood
Intro LS MLE Other Least squares Observations y 1,..., y Predictors J j=1 B j(x)θ j Vector form: Y = Zθ + ε, ε (0, Ω) (Weighted) Parameter estimation: Generalization ˆθ LS = argmin θ Θ n=1 λ n y n j B j (x n )θ j 2 (1) = (Y Zθ) T W (Y Zθ) (2) ˆθ PLS = (Y Zθ) T W (Y Zθ) + (θ θ 0 ) T D(θ θ 0 ) (3)
Intro LS MLE Other Least squares Observations y 1,..., y Predictors J j=1 B j(x)θ j Vector form: Y = Zθ + ε, ε (0, Ω) (Weighted) Parameter estimation: Generalization ˆθ LS = argmin θ Θ n=1 λ n y n j B j (x n )θ j 2 (1) = (Y Zθ) T W (Y Zθ) (2) ˆθ PLS = (Y Zθ) T W (Y Zθ) + (θ θ 0 ) T D(θ θ 0 ) (3)
Intro LS MLE Other Estimate ˆθ = ( Z T WZ) 1 (ZWY ) (4) Bias? o, not if the correct model is used ] ( 1 E [ˆθ = Z WZ) T (ZW (Zθ + ε)) (5) ( 1 = θ + Z WZ) T (ZW ε) (6) Variance ] Var [ˆθ = (Z T WZ) 1 (Z T W ΩWZ)(Z T WZ) 1 (7) Simplies if W = Ω 1, and Ω = σ 2 I. We then get ] Var [ˆθ = σ 2 (Z T Z) 1. (8)
Intro LS MLE Other Estimate ˆθ = ( Z T WZ) 1 (ZWY ) (4) Bias? o, not if the correct model is used ] ( 1 E [ˆθ = Z WZ) T (ZW (Zθ + ε)) (5) ( 1 = θ + Z WZ) T (ZW ε) (6) Variance ] Var [ˆθ = (Z T WZ) 1 (Z T W ΩWZ)(Z T WZ) 1 (7) Simplies if W = Ω 1, and Ω = σ 2 I. We then get ] Var [ˆθ = σ 2 (Z T Z) 1. (8)
Intro LS MLE Other Estimate ˆθ = ( Z T WZ) 1 (ZWY ) (4) Bias? o, not if the correct model is used ] ( 1 E [ˆθ = Z WZ) T (ZW (Zθ + ε)) (5) ( 1 = θ + Z WZ) T (ZW ε) (6) Variance ] Var [ˆθ = (Z T WZ) 1 (Z T W ΩWZ)(Z T WZ) 1 (7) Simplies if W = Ω 1, and Ω = σ 2 I. We then get ] Var [ˆθ = σ 2 (Z T Z) 1. (8)
Intro LS MLE Other Estimate ˆθ = ( Z T WZ) 1 (ZWY ) (4) Bias? o, not if the correct model is used ] ( 1 E [ˆθ = Z WZ) T (ZW (Zθ + ε)) (5) ( 1 = θ + Z WZ) T (ZW ε) (6) Variance ] Var [ˆθ = (Z T WZ) 1 (Z T W ΩWZ)(Z T WZ) 1 (7) Simplies if W = Ω 1, and Ω = σ 2 I. We then get ] Var [ˆθ = σ 2 (Z T Z) 1. (8)
Intro LS MLE Other F-tests We estimate σ 2 = (Y X θ)t (Y X θ) J Dene Q(ˆθ) = (Y X ˆθ) T (Y X ˆθ) = Y T (I P) T (I P)Y where P is the projection matrix We approximate Q by E[Q] = tr ( (I P) T (I P) Cov[Y, Y ] ). It holds for the standard model that Cov[Y, Y ] = σ 2 I and (I P) T (I P) = (I P) It then follows that ( ) tr (I P) T (I P) Cov[Y, Y ] = σ 2 tr (I P) (9) = σ 2 tr(i ) σ 2 tr(p) (10) ( ) = σ 2 σ 2 tr Z(Z T Z) 1 Z T (11) ( ) = σ 2 ( tr (Z T Z) 1 (Z T Z) = σ 2 ( J) (12)
Intro LS MLE Other What about the penalized asymptotics? What if the estimation is given by ˆθ PLS = (Y Zθ) T W (Y Zθ) + (θ θ 0 ) T D(θ θ 0 ) (13) Derive on blackboard. Another popular form is ˆθ Adaptive LASSO = (Y Zθ) T W (Y Zθ) + D(θ θ 0 ) 1 (14)
Intro LS MLE Other What about the penalized asymptotics? What if the estimation is given by ˆθ PLS = (Y Zθ) T W (Y Zθ) + (θ θ 0 ) T D(θ θ 0 ) (13) Derive on blackboard. Another popular form is ˆθ Adaptive LASSO = (Y Zθ) T W (Y Zθ) + D(θ θ 0 ) 1 (14)
Intro LS MLE Other Theorems Info Proofs Maximum Likelihood estimators Dened as ˆθ MLE = argmax p θ (X 1,... X ). θ Θ The MLE use all information in the data, and have nice properties: Ia. Consistency ˆθ p θ 0. Ib. Asympt. normality II. Eciency Var[T (X)] = I 1 0 (θ 0 ), III. Invariant. We have under general conditions that (ˆθ θ0 ) d (0, I 1 0 (θ 0 )).
Intro LS MLE Other Theorems Info Proofs Two useful theorems Assume X i iid and E[X i ] = µ, Var[X i ] = σ 2 Law of Large umbers 1 Central Limit Theorem where Z (0, 1). 1 i=1 i=1 X i p/a.s. µ. X i µ σ These theorems can be generalized further. d Z,
Intro LS MLE Other Theorems Info Proofs Fisher information The Fisher Information matrix is dened as [ ] I (θ) = Var θ log p θ(x ), which is equivalent to and E ote E [ θ log p θ(x θ) ] = 0. [ ( ) ] 2 θ log p θ(x ) [ ] 2 E θ 2 log p θ(x ).
Intro LS MLE Other Theorems Info Proofs Proofs Ia. Kullback-Leibler and law of large number Ib. Second order Taylor expansion of the Likelihood. II. Cauchy-Schwartz inequality III. Direct calculations.
Intro LS MLE Other Theorems Info Proofs Consistency The log-likelihood function is dened as l(θ) = log p θ (x n x 1:n 1 ) (15) The estimate is given by n=1 1 ˆθ MLE = argmax l(θ) = argmax l(θ) (16) θ Θ θ Θ = argmax θ Θ 1 n=1 log p θ (x n x 1:n 1 ) (17) ow we rewrite this as 1 l(θ) = 1 l(θ) ± E[l(θ)] (18) ( ) 1 = E[l(θ)] + l(θ) E[l(θ)] (19)
Intro LS MLE Other Theorems Info Proofs Consistency The log-likelihood function is dened as l(θ) = log p θ (x n x 1:n 1 ) (15) The estimate is given by n=1 1 ˆθ MLE = argmax l(θ) = argmax l(θ) (16) θ Θ θ Θ = argmax θ Θ 1 n=1 log p θ (x n x 1:n 1 ) (17) ow we rewrite this as 1 l(θ) = 1 l(θ) ± E[l(θ)] (18) ( ) 1 = E[l(θ)] + l(θ) E[l(θ)] (19)
Intro LS MLE Other Theorems Info Proofs Consistency The log-likelihood function is dened as l(θ) = log p θ (x n x 1:n 1 ) (15) The estimate is given by n=1 1 ˆθ MLE = argmax l(θ) = argmax l(θ) (16) θ Θ θ Θ = argmax θ Θ 1 n=1 log p θ (x n x 1:n 1 ) (17) ow we rewrite this as 1 l(θ) = 1 l(θ) ± E[l(θ)] (18) ( ) 1 = E[l(θ)] + l(θ) E[l(θ)] (19)
Intro LS MLE Other Theorems Info Proofs Consistency cont. This decomposition shows that 1 l(θ) = It then follows that E[l(θ)] }{{} Expected log likelihood ˆθ MLE = argmax E[l(θ)] + θ Θ ( + 1 l(θ) E[l(θ)] } {{} Random error ) ˆθ MLE argmax E[l(θ)] θ Θ (20) (21)
Intro LS MLE Other Theorems Info Proofs First we look at the random error, that is written so that we can apply the Law of Large umbers. If lim uniformly over Θ, then it follows that lim 1 l(θ) E[l(θ)] = 0 (22) ˆθ MLE = argmax E[l(θ)] (23) θ Θ
Intro LS MLE Other Theorems Info Proofs Finally, we denote any feasible density by q θ Q θ and assume that the true density p θ0 Q θ. The dierence in expected log-likelihood between the true density and any other density is given by E[log q θ (X )] E[log p θ0 (X )] = (log q θ (x) log p θ0 (x)) p θ0 (x)dx = (24) ( ) qθ (x) log p θ0 (x)dx p θ0 (x) (25) Log is a concave function. We know from Jensen's inequality that E[log(ξ)] log(e[ξ])
Intro LS MLE Other Theorems Info Proofs Finally, we denote any feasible density by q θ Q θ and assume that the true density p θ0 Q θ. The dierence in expected log-likelihood between the true density and any other density is given by E[log q θ (X )] E[log p θ0 (X )] = (log q θ (x) log p θ0 (x)) p θ0 (x)dx = (24) ( ) qθ (x) log p θ0 (x)dx p θ0 (x) (25) Log is a concave function. We know from Jensen's inequality that E[log(ξ)] log(e[ξ])
Intro LS MLE Other Theorems Info Proofs Finally, we denote any feasible density by q θ Q θ and assume that the true density p θ0 Q θ. The dierence in expected log-likelihood between the true density and any other density is given by E[log q θ (X )] E[log p θ0 (X )] = (log q θ (x) log p θ0 (x)) p θ0 (x)dx = (24) ( ) qθ (x) log p θ0 (x)dx p θ0 (x) (25) Log is a concave function. We know from Jensen's inequality that E[log(ξ)] log(e[ξ])
Intro LS MLE Other Theorems Info Proofs That means that ( qθ (x) log p θ0 (x) ) p θ0 (x)dx log ( ) qθ (x) p θ0 (x) p θ 0 (x)dx = 0 (26) We only get equality when q θ (x) = p θ0 (x) for all values of x. The conclusion is that and hence that as the random error vanishes as. θ 0 = argmax E[l(θ)] (27) θ Θ lim ˆθ MLE = θ 0 (28)
Intro LS MLE Other Theorems Info Proofs That means that ( qθ (x) log p θ0 (x) ) p θ0 (x)dx log ( ) qθ (x) p θ0 (x) p θ 0 (x)dx = 0 (26) We only get equality when q θ (x) = p θ0 (x) for all values of x. The conclusion is that and hence that as the random error vanishes as. θ 0 = argmax E[l(θ)] (27) θ Θ lim ˆθ MLE = θ 0 (28)
Intro LS MLE Other Theorems Info Proofs Proof Ib ( Rewrite L(X) = exp log p θ (X 1 ) + ) n=2 log p θ(x n X 1:n 1 ). Second order Taylor expansion around θ 0. Maximize (ˆθ θ 0 )... Consistency and asymp normality follows.
Intro LS MLE Other Theorems Info Proofs Proof I, Ext. Write l (θ) = log L(X θ). This is a sum consisting of terms (think LL and CLT!). Approximate l (θ) using a second order Taylor expansion around θ 0. Thus l (θ) = l (θ 0 ) + θ l (θ 0 )(θ θ 0 ) + 1 2 2 θ l (θ 0 )(θ θ 0 ) 2 + R. Ignore the last term, multiply l with 1/ and maximize wrt θ to obtain ˆθ. We obtain 1 θ l (θ 0 ) + 1 2 θ l (θ 0 ) (ˆθ θ 0 ) := 0.
Intro LS MLE Other Theorems Info Proofs Proof I, Ext. We have from LL that and from CLT that 1 2 θ l (θ 0 ) I (θ 0 ), 1 θ l (θ 0 ) Z, where Z (0, I (θ 0 )). Rearranging gives (ˆθ θ 0 ) d (0, I 1 (θ 0 )).
Intro LS MLE Other Theorems Info Proofs Proof II Denote the score function U = θ log θ p(x ) and another estimator by T. Assume that E[T ] = θ 0 (to rule out anomalies). Calculate Cov(U, T ) Cov(U, T ) = E[UT ] E[U]E[T ] = E[UT ] θ 0 0 E[UT ] = T (x) θ log p θ 0 (x)p θ0 (x)dx T (x)pθ0 (x)dx = θ θ = 1. = θ Cauchy-Schwartz states Cov(U, T ) Var[U]Var[T ]. Thus Var[T ] Var[ θ log p θ(x )] 1 I.e. Var[T ] Var[ˆθ MLE ]!
Intro LS MLE Other Theorems Info Proofs Proof III Original problem Dene Y = g(x ). Calculations... Ok. ˆθ MLE = argmax p θ (X 1,... X ). θ Θ
Intro LS MLE Other Other estimators We can derive the asymptotical distribution using the same arguments for any M-estimator, i.e. for any estimator taking the estimate as the value that maximized/minimizes a function of data. Similar arguments can also be used for Z-estimators, i.e. estimator taking the estimate as the value that solves a system of equations depending on data.
Intro LS MLE Other M-estimators Take any loss function J(θ), such that E.g. PEM: J(θ) = i=1 ε i(θ) 2. The corresponding problem is ˆθ = argmax J(θ). (29) θ Θ 1 θ J(θ 0 ) + 1 2 θ J(θ 0) (ˆθ θ 0 ) := 0. Assume that E[ 1 θ J(θ 0 )] = 0 (Why?), Var[ 1 θ J(θ 0 )] = W and E[ 1 2 θ J(θ 0)] = V. Applying the limit theorems as in the Maximum likelihood setup gives the asymptotics. Result: (ˆθ θ 0 ) d (0, V 1 W (V 1 ) T ).
Intro LS MLE Other Z-estimators Take any loss function G(θ), such that ˆθ = argmax G(θ) = 0. (30) θ Θ E.g. Estimation functions G(θ) = g i (θ)(x i E[X i ]) Compute a rst order Taylor expansion around θ 0, and solve wrt θ. Thus G(θ 0 ) + θ G(θ 0 )(ˆθ θ 0 ) := 0. Multiply by 1/ and apply limit theorems. Let E[ 1 G(θ 0 )] = 0 (why?), Var[ 1 G(θ 0 )] = W and E[ 1 θg(θ 0 )] = V. Then (ˆθ θ 0 ) d (0, V 1 W (V 1 ) T ).
Intro LS MLE Other The likelihood framework can be used to test the t of models. Condence intervals Likelihood ratio tests
Intro LS MLE Other Condence intervals. Assume (ˆθ θ 0 ) d (0, Σ). Then I θ0 ˆθ ± 2 diag(σ). Better accuracy: Use prole likelihood!
Intro LS MLE Other Likelihood based tests. Likelihood ratio tests where Θ R Θ F. Then Λ = {sup L(X θ), θ ΘF } {sup L(X θ), θ Θ R } 2 log(λ) χ 2 (k), k being the degrees of freedom. LR tests are optimal, cf. eyman-pearson. and are closely related to AIC, BIC etc.
Intro LS MLE Other Likelihood ratios Compare L(X ˆθ) to L(X θ 0 ). Taylor expand We have that ( ) 2 log(λ) = 2 l (ˆθ) l (θ 0 ). l (θ) = l (θ 0 ) + θ l (θ 0 )(θ θ 0 ) + 1 2 2 θ l (θ 0 )(θ θ 0 ) 2 + R. The estimate must solve θ l (θ 0 ) + 2 θ l (θ 0 )(ˆθ θ 0 ) = 0. Thus 2 log(λ) = 2 ( ) 1 2 θl (θ 0 )(ˆθ θ 0 ).
Intro LS MLE Other Likelihood ratios, cont. Plugging in (ˆθ θ 0 ) = ( 2 θ l (θ 0 )) 1 θ l (θ 0 ) gives 2 log(λ) = θ l (θ 0 )( 2 θ l (θ 0 )) 1 θ l (θ 0 ) Or nicer written 2 log(λ) = 1 θ l (θ 0 )( 1 2 θ l (θ 0 )) 1 1 θ l (θ 0 ). This is a quadratic form, distributed as χ 2 (k), k being the degrees of freedom.
Intro LS MLE Other Feedback Send feedback, questions and information about typos to erikl@maths.lth.se.