Maximum Likelihood Estimation

Size: px

Start display at page:

Download "Maximum Likelihood Estimation"

Darren Flynn
5 years ago
Views:

1 University of Pavia Maximum Likelihood Estimation Eduardo Rossi

2 Likelihood function Choosing parameter values that make what one has observed more likely to occur than any other parameter values do. Assumption(Distribution) The pair {U, V } is a random variable and the N variables {(U 1, V 1 ),...,(U N, V N )} are i.i.d. random sample of (U, V ). F U V (u v; θ 0 ) is completely known but θ 0 (true value of the real-valued parameter vector) is unknown, θ R K. Support of F U V is S(θ 0 ) S(θ 0 ) df U V (u v; θ 0 ) = 1 = u S(θ 0 ) f(u v; θ 0) S(θ 0 ) f(u v; θ 0)du if U discrete if U continuous Eduardo Rossi c - Econometria finanziaria

3 Likelihood function Probability function for (U 1,...,U N ) (V 1,...,V N ) N f(u t v t ; θ 0 ) t=1 Normal Linear Regression: y t = x tβ 0 + ǫ t, (y t,x t ) i.i.d. normal u t = y t, f(u t v t ; θ 0 ) = v t = x t 1 2πσ 2 0 exp [ (y t x ] tβ 0 ) 2σ0 2 S(θ 0 ) = R. Since the obs are i.i.d. normal. The conditional p.d.f. of the sample is N t=1 f(u t v t ; θ 0 ) = [ ] 2πσ0 2 N/2 exp [ (y Xβ 0) ] (y Xβ 0 ) 2σ0 2 Eduardo Rossi c - Econometria finanziaria

4 Likelihood function The marginal distribution of x t does not depend on θ 0. Student s t Linear Regression y t x tβ 0 x t σ 0 t ν0 f(u t v t ; θ 0 ) = Γ[(ν 0 + 1)/2] Γ(ν 0 /2) 1 πν0 σ 2 0 [1 + (y t x tβ 0 ) 2 ν 0 σ 2 0 ] (ν0 +1)/2 Eduardo Rossi c - Econometria finanziaria

5 Likelihood function Laplace Linear Regression f(u t v t ; σ 2 0) = 1 2σ 2 0 exp 2 y t x tβ 0 σ 0 U = y t, V = x t, S(θ 0 ) = R, θ 0 = [β 0, σ 2 0]. We can obtain h(θ 0 ) E[g(u)] = g(u)df(u; θ 0 ) h(v; θ 0 ) E[g(U, V ) V = v] = g(u, v)df(u v; θ 0 ) Eduardo Rossi c - Econometria finanziaria

6 The likelihood function Unconditional specification: f(u; θ) describes the likely values of every r.v. U t, t = 1, 2...,N for a specific value of θ. The sample likelihood function treats the u argument as given and θ 0 as variable. It describes the likely values of the unknown θ 0 given the realizations of the r.v. U. The likelihood function of θ for a random variable U with p.f. F(u; θ 0 ) is defined to be l(θ; U) = f(u; θ) L(θ; U) = log l(θ; U) Eduardo Rossi c - Econometria finanziaria

7 The Likelihood function Likelihood function: we evaluate the p.f. at a random variable and consider the result as a function of the variable θ: [ N ] L(θ; U 1,...,U N ) = log f(u t ; θ) = t=1 N L(θ; U t ) t=1 The conditional likelihood function of θ for a r.v. U with p.f. f(u v; θ 0 ) given the r.v. V is l(θ, U V ) = f(u v; θ) L(θ; U V ) = log l(θ; U V ) θ 0 Θ, Θ parameter space, the set of permitted values of the model. Eduardo Rossi c - Econometria finanziaria

8 Dominance condition Assumption (Dominance condition) [ ] E sup L(θ; U V ) θ Θ exists. This means that L(θ; U V ) is dominated by h(u, V ) sup L(θ; U V ) θ Θ where h(u, V ) does not depend on θ. The existence of E[h(U)] implies the existence of E[L(θ; U V )], θ Θ. Lemma. If L(θ; U V ) is the conditional log-likelihood for θ, the Dominance condition holds, then E [L(θ; U V ) V ] E[L(θ 0 ; U V ) V ]. Eduardo Rossi c - Econometria finanziaria

9 Expected log-likelihood inequality Unconditional case: E[L(θ 0 ; U)] E[L(θ; U)] The specification of p.f. of U determines expected values of functions of U. Therefore Q(θ, θ 0 ) E[L(θ; U)] which depends on θ because the L does and depends on θ 0 because Q is the expected value of a function of U. The expected loglikelihood inequality states that Q(θ 0, θ 0 ) = max θ Θ Q(θ, θ 0) Eduardo Rossi c - Econometria finanziaria

10 Normal linear regression model y t x t N(x tβ 0, σ 2 0) E [L(θ, y t x t ) x t ] = 1 2 log (2πσ2 ) E[(y t x tβ) 2 x t ] 2σ 2 = 1 2 log (2πσ2 )+ 1 2 = 1 2 E[(y t x tβ 0 + x tβ 0 x tβ) 2 x t ] σ 2 [log (2πσ 2 ) + σ2 0 + (x tβ x tβ 0 ) 2 ] σ 2 which is uniquely maximized at x tβ = x tβ 0 and σ 2 = σ 2 0. Eduardo Rossi c - Econometria finanziaria

11 Normal linear regression model The conditional expectation of the conditional log-likelihood of the entire sample is the sum of such terms E [L(θ;y X) X] = N 2 log (2πσ2 ) Nσ2 0 + (β β 0 ) X X(β β 0 ) 2σ 2 which is uniquely maximized at β = β 0, Xβ = Xβ 0 and σ 2 = σ 2 0 if X is full-column rank. Eduardo Rossi c - Econometria finanziaria

12 Student t Linear Regression The expected log-likelihood is analytically intractable. We show that E[L(θ; U V )] exists, for ν 0 > 2, because the concavity of the logarithmic function log (1 + z 2 ) z 2 E [ log [1 + (y ] t x tβ) 2 ] νσ 2 x t E [ (yt x tβ) 2 ] νσ 2 x t = ν 0σ (x tβ 0 x tβ) 2 νσ 2 (ν 0 2) provided that E[x t x t] exists, the expected log-lik exists. Eduardo Rossi c - Econometria finanziaria

13 Unconditional inequality The expected log-likelihood inequality implies the unconditional inequality E[L(θ; U V )] E[L(θ 0 ; U V )] starting from E[L(θ; U V ) V ] E[L(θ 0 ; U V ) V ] we can take the E[ ] over V E[L(θ; U V )] = E [E[L(θ; U V ) V ]] E[E[L(θ 0 ; U V ) V ]] = E[L(θ 0 ; U V )] Eduardo Rossi c - Econometria finanziaria

14 The ML estimator Because θ 0 maximizes E[L(θ; U V )] it is natural to construct an estimator of θ 0 from the value of θ that maximizes the sample: the average log-likelihood functions of the N observations 1 N L(θ; U t V t ) E N [L(θ; U V )] t E[L(θ; U V )] = L(θ; u v)df(u v; θ 0 ) ML estimator: the MLE is a value of the parameter vector that maximizes the sample average log-lik function θ N arg max θ Θ E N[L(θ)] Eduardo Rossi c - Econometria finanziaria

15 Normal Linear Regression Model The empirical expectation of the log-likelihood E N [L(θ)] = 1 2 log (2πσ2 ) E N[(y t x tβ) 2 ] 2σ 2 = 1 2 log (2πσ2 ) (y Xβ) (y Xβ)/N 2σ 2 The log-lik is differentiable. F.O.C s: E N [L β (θ)] = 1 σ 2 E N[x t (y t x tβ)] = 1 Nσ 2 [X (y Xβ)] E N [L σ 2(θ)] = 1 2σ 4 {σ2 E N [(y t x tβ) 2 ]} = 1 [ σ 2 1 ] 2σ 4 N (y Xβ) (y Xβ) Eduardo Rossi c - Econometria finanziaria

16 Normal Linear Regression Model Solutions: 1 N σ 2 [X (y X β)] = 0 β = (X X) 1 X y The MLE of σ 2 is with σ 2 = 1 N (y X β) (y X β) σ 2 = ǫ ǫ N = N K N s2 The Hessian matrix: E N [L θθ (θ)] = 1 σ 2 N X X (y Xβ) X σ 4 N X (y Xβ) σ 4 N 1 2σ 1 4 σ 6 N (y Xβ) (y Xβ) Eduardo Rossi c - Econometria finanziaria

17 Normal Linear Regression Model E N [L θθ ( θ)] = = 1 σ 2 N X X X (y X β) σ 4 N (y X β) X σ 4 N 1 2 σ 1 4 σ 6 N (y X β) (y X β) 1 σ 2 N X X σ 1 4 σ 6 N (y X β) (y X β) which is negative definite. The second-order necessary condition for a point to be the local maximum of a twice continuously differentiable function is that the Hessian be negative semidefinite at the point. Eduardo Rossi c - Econometria finanziaria

18 Identification Is the DGP sufficiently informative about the parameters of the model? If f(u v; θ 0 ) = f(u v; θ 1 ) data drawn from these two distributions will have the same sampling properties. There is no way to distinguish whether θ = θ 0 or θ = θ 1. Eduardo Rossi c - Econometria finanziaria

19 Global Identification The parameter θ 0 is globally identified in Θ if, for every θ 1 Θ, θ 0 θ 1, implies that Pr{f(U V ; θ 0 ) f(u V ; θ 1 )} > 0 Assumption (Global identification): Every parameter vector θ 0 Θ is globally identified. Lemma (Strict expected log-likelihood inequality): Under the Distribution, Dominance and Global identification assumptions: implies θ θ 0 E[L(θ)] < E[L(θ 0 )]. Eduardo Rossi c - Econometria finanziaria

20 Example Exact multicollinearity among explanatory variables in a linear regression E[y X] = Xβ 0 is a failure of global identification. If rank(x) < K then E[L(θ)] E[L(θ 0 )] still holds. The normal log-likelihood still attains its maximum in β at β 0 because (β β 0 ) X X(β β 0 ) 0 but inequality is not strict for all β β 0. If rank(x) = K then β 0 is the unique maximum of E[L(θ)]. Eduardo Rossi c - Econometria finanziaria

21 Example Identification concerns E[L(θ)] and not the E N [L(θ)]. One can discover failures of identification in the sample log-likelihood. But if a sample log-likelihood function fails to have a unique global maximum this does not always imply a failure of global identification. Eduardo Rossi c - Econometria finanziaria

22 Example Exact multicollinearity among explanatory variables in a LRM E[y X] = Xβ 0 is a failure of global identification. Note that if rank(x) < K the expected log-likelihood inequality E[L(θ)] E[L(θ 0 )] still holds. Eduardo Rossi c - Econometria finanziaria

23 Differentiability When the support of the distribution depends on the unknown parameter values the MLE cannot be found with simple calculus. In such cases the log-lik cannot be differentiable everywhere in the parameter space. Assumption (Differentiability): The p.f. f(u v; θ) is twice continuously differentiable in θ, θ Θ. The S(θ) does not depend on θ, and differentiation and integration are interchangeable in the sense that θ 2 θ 2 S(θ) S(θ) df(u v; θ) = df(u v; θ) = S(θ) S(θ) df(u v; θ) θ 2 2 df(u v; θ) θ Eduardo Rossi c - Econometria finanziaria

24 Differentiability [ ] E[L(θ) V = v] L(θ) = E θ θ V = v 2 [ E[L(θ) V = v] 2 ] L(θ) θ θ = E θ θ V = v The interchange of differentiation and integration is ensured in part by S(θ) = S. θ 0 = arg max θ Θ E[L(θ)] translates into the conditions E[L(θ)] θ = 0 θ=θ0 and the second order conditions that the Hessian matrix 2 E[L(θ)] θ θ is a n.d. matrix. θ=θ0 Eduardo Rossi c - Econometria finanziaria

25 The score function The MLE θ is an implicit function of the data u θ = arg max θ Θ E N[L(θ)] arg zero θ Θ E N [L θ (θ)] The F.O.C. Normal equations or likelihood equations E N [L θ ( θ)] = 0 where the score function L θ L(θ) θ θ must be calculated by numerical methods for maximizing differentiable functions. Eduardo Rossi c - Econometria finanziaria

26 Score Identity Lemma (Score identity): Under Distribution and Differentiability assumptions E[L θ (θ 0 ) V = v] = 0 Proof: Continuous random variables case 1 = df(u v; θ) = f(u v; θ)du S S Eduardo Rossi c - Econometria finanziaria

27 Score Identity we can differentiate both sides of this equality w.r.t. θ 0 = f(u v; θ)du θ S = f θ (u v; θ)du consider = S S L θ (θ; U V ) = 1 f(u v; θ) f θ(u v; θ)f(u v; θ)du E[L θ (θ; U V ) V = v] = 1 f(u v; θ) f θ(u v; θ) S 1 f(u v; θ) f θ(u v; θ)f(u v; θ 0 )du Eduardo Rossi c - Econometria finanziaria

28 Score Identity The E[ V = v] is evaluated at θ = θ 0. For θ θ 0 E[L θ (θ; U V ) V = v] 0 But if θ = θ 0 then E[L θ (θ 0 ; U V ) V = v] = S 1 f(u v; θ 0 ) f θ(u v; θ 0 )f(u v; θ 0 )du = 0. Eduardo Rossi c - Econometria finanziaria

29 Score Identity In the Normal Linear Regression Model E[L β (θ)] = 1 σ 2 E[x tx t](β 0 β) E[L σ 2(θ)] = 1 2σ 4 ( σ 2 { σ E[(x tβ 0 x tβ) 2 ] }) θ 0 = (β 0, σ 2 0) E[L β (θ 0 )] = 1 σ 2 0 E[x t x t](β 0 β 0 ) = 0 E[L σ 2(θ 0 )] = 1 2σ 4 0 ( σ 2 0 { σ E[(x tβ 0 x tβ 0 ) 2 ] }) = 0 Eduardo Rossi c - Econometria finanziaria

30 The Information Matrix If there exists θ such that E N [L θ ( θ N )] = 0 we must check that we have a global maximum. Otherwise our solution cannot be the MLE ( θ N ). In general, a sufficient condition for θ N to be a local maximum is that the Hessian matrix E N [L θθ ( θ N )] 2 E N [L(θ)] θ θ θ= θn evaluated at θ N is negative definite: c R K,c 0 c E N [L θθ ( θ N )]c < 0 it guarantees that E N [L(θ)] is strictly concave in a neighborhood of θ. Eduardo Rossi c - Econometria finanziaria

31 Information Matrix We investigate the second-order conditions for the maximum of E[L(θ)]. Assumption (Finite Information): V ar[l θ (θ 0 )] exists. Lemma (Information Identity): Under Distribution, Differentiability, Finite Information assumptions E[L θθ (θ 0 ) V = v] = V ar[l θ (θ 0 ) V = v] and this matrix is negative semidefinite. Eduardo Rossi c - Econometria finanziaria

32 Information Matrix Proof: 0 = S L θ (θ; u v)f(u v; θ)du Differentiating both sides (L θ (θ)f(θ)) θ = L θ θ f + L f θ θ = L θθ f + L θ (f θ ) = (L θθ + L θ L θ)f f f(u v; θ). 0 = [L θθ (θ; u v) + L θ (θ; u v)l θ (θ; u v) ]df(u v; θ) S Eduardo Rossi c - Econometria finanziaria

33 Information Matrix S L θθ (θ; u v)df(u v; θ) = S [L θ (θ; u v)l θ (θ; u v) ]df(u v; θ) Setting θ = θ 0 E[L θθ (θ 0 ; U V ) V = v] = E[L θ (θ 0 ; U V )L θ (θ 0 ; U V ) V = v] = V ar[l θ (θ 0 ; U V ) V = v] because E[L θ (θ 0 ; U V ) V ] = 0. The Hessian is negative semidefinite since is the negative of a variance matrix. Eduardo Rossi c - Econometria finanziaria

34 Conditional Information The conditional variance matrix of the score vector L θ (θ; U V ) given V = v and evaluated at θ 0 I(θ 0 v) E[L θ (θ 0 )L θ (θ 0 ) V = v] = V ar[l θ (θ 0 ) V = v] we can always find the conditional information matrix function I(θ v) L θ (θ; u v)l θ (θ; u v) df(u v; θ) S Eduardo Rossi c - Econometria finanziaria

35 Population Information The marginal expectation I(θ 0 ) E[L θ (θ; U V )L θ (θ; U V ) ] is the population information matrix. The population information matrix is the unconditional variance matrix of the conditional score vector because E[L θ (θ 0 ; U V ) V ] = 0 V ar[l θ (θ 0 ; U V )] = E[V ar[l θ (θ 0 ; U V )]] + V ar[e[l θ (θ 0 ; U V )] V ] = E[I(θ 0 V )] = I(θ 0 ) Eduardo Rossi c - Econometria finanziaria

36 Normal linear regression model The conditional information matrix for the normal linear regression model: 1 x I(θ 0 x t ) = σ0 2 t x t σ 4 0 The Hessian of the conditional normal regression log-likelihood function L θθ (θ; y t x t ) = 1 σ x 2 t x t 1 σ x 4 t (y t x tβ) 1 σ (y 4 t x tβ)x t (y t x tβ) 2 /σ 6 1 2σ 4 0 E[L θθ (θ 0 ; y t x t ) V ] = I(θ 0 x t ) Eduardo Rossi c - Econometria finanziaria

37 Nonsigular information It is possible that information matrix can be singular even θ 0 is globally identifiable and the expected log-lik is uniquely maximized at θ 0. The second order condition that the Hessian be negative definite is sufficient but not necessary for a local maximum. We assume this condition explicitly. Assumption (Nonsingular Information) The information matrix I(θ 0 ) is nonsingular for all possible θ 0 Θ. Eduardo Rossi c - Econometria finanziaria

38 The Cramér - Rao Lower Bound Information matrix: measure of how much we can learn about θ 0 from the random sample {(U 1, V 1 ),...,(U N, V N )}. Theorem: θ unbiased estimator of θ 0, with finite variance matrix with interchangeability between differentiation and integration E[ θ v 1,...,v N ] θ 0 = θ 0 = S S θ θ θ 0 N df(u t v t ; θ 0 ) t=1 N df(u t v t ; θ 0 ) if Distribution, Differentiability, Finite Information Nonsingularity assumptions also hold then that for any a R K t=1 a V ar[ θ v]a a (NE N [I(θ 0 ) v]) 1 a. Eduardo Rossi c - Econometria finanziaria

39 The Cramér - Rao Lower Bound In some cases we can find estimators with variances equal to the Cramér-Rao lower bound. The OLS estimator β is efficient relative to all unbiased estimators of β 0. Proof: Using I(θ 0 x t ) = (N E N [I(θ 0 x t )]) 1 = because 1 σ 2 0 x t x t σ 4 0 V ar[ β X] = σ 2 0(X X) 1 1 σ 2 0 (X X) 0 0 N 2σ = σ2 0(X X) 0 0 The OLS/MLE estimator attains the Cramér-Rao lower bound. 2σ 4 0 N Eduardo Rossi c - Econometria finanziaria

Maximum Likelihood Asymptotic Theory. Eduardo Rossi University of Pavia

Maximum Likelihood Asymptotic Theory. Eduardo Rossi University of Pavia Maximum Likelihood Asymtotic Theory Eduardo Rossi University of Pavia Slutsky s Theorem, Cramer s Theorem Slutsky s Theorem Let {X N } be a random sequence converging in robability to a constant a, and