Chapter 3: Maximum Likelihood Theory

Chapter 3: Maximum Likelihood Theory Florian Pelgrin HEC September-December, 2010 Florian Pelgrin (HEC) Maximum Likelihood Theory September-December, 2010 1 / 40

1 Introduction Example 2 Maximum likelihood estimator Notation Likelihood and log-likelihood Maximum likelihood principle Equivariance principle 3 Fisher information Score vector Fisher information matrix 4 Asymptotic results Overview Consistency Asymptotic efficiency Large sample distribution Back to the equivariance... Florian Pelgrin (HEC) Maximum Likelihood Theory September-December, 2010 2 / 40

Introduction Example Example 1: Suppose that Y 1,Y 2,,Y n are i.i.d. random variables, with Y i B(p): { 1 with probability p Y i = 0 with probability 1 p where p is an unknown parameter to estimate. The sample (y 1, y 2,, y n ) is observed. Explicit assumption regarding the distribution of Y i. Can we find an estimate (estimator) of p? Florian Pelgrin (HEC) Maximum Likelihood Theory September-December, 2010 3 / 40

Introduction Example Example 1 (cont d) The joint distribution of the sample is: ( n ) P (Y i = y i ) = P ((Y 1 = y 1 ) (Y 2 = y 2 ) (Y n = y n )) = = n P(Y i = y i ) n p y i (1 p) 1 y i = p np y i (1 p) n n P y i The likelihood function is the joint density of the data, except that we treat it as a function of the parameter: n L(p y) L(y; p) = p y i (1 p) 1 y i... The likely values of the unknown parameter given the realizations of the random variables... Florian Pelgrin (HEC) Maximum Likelihood Theory September-December, 2010 4 / 40

Introduction Example Example 1 (cont d) Suppose that two estimates of p, given by ˆp 1,n (y) and ˆp 2,n (y), are such that L n (y; ˆp 1,n (y)) > L n (y; ˆp 2,n (y)) The sample we observe y = (y 1,, y n ) is more likely to have occurred if p = ˆp 1,n (y) than if p = ˆp 2,n (y) ˆp 1,n (y) is a more plausible value. Florian Pelgrin (HEC) Maximum Likelihood Theory September-December, 2010 5 / 40

Introduction Example Example 1 (cont d) Under suitable regularity conditions, the maximum likelihood estimate (estimator) is defined to be: ˆp = argmaxl(y; p) = argmaxl(y; p) p p where l(y; p) = log(l(y; p)) is the log-likelihood function. The maximum likelihood estimate is: ˆp(y) = 1 n The maximum likelihood estimator is: ˆp = 1 n y i Y i. Florian Pelgrin (HEC) Maximum Likelihood Theory September-December, 2010 6 / 40

Introduction Example How to apply the maximum likelihood principle to the multiple linear regression model? What are the main properties of the maximum likelihood estimator? Is it asymptotically unbiased? Is it asymptotically efficient? Under which condition(s)? Is it consistent? What is the asymptotic distribution? What are the main properties of any transformation of the estimator, say θ = g(p)?... All of these questions are answered in this lecture... Florian Pelgrin (HEC) Maximum Likelihood Theory September-December, 2010 7 / 40

Maximum likelihood estimator Notation 2. Maximum likelihood estimator 2.1 Notation Consider the multiple linear regression model: y i = x i b + u i where the error terms are spherical, and the observations (y i, x i ), i = 1,, n, are i.i.d. The joint density function is given by: where θ = (b, σ 2 ). By definition, f (y i, x i ) L i (y i, x i ; θ) L(y i, x i ; θ) f (y i, x i ) = f (y i x i )f (x i ) where f (y i x i ) is the conditional density of Y X = x i and f (x i ) is the marginal density of X i. Florian Pelgrin (HEC) Maximum Likelihood Theory September-December, 2010 8 / 40

Maximum likelihood estimator Notation To get the (log-)likelihood function, one needs some parametric assumptions: 1 One can specify the conditional distribution of u X, i.e. the conditional distribution of Y X: u X N (0 n 1, σ 2 I n ) i.e Y X N (Xb, σ 2 I n ) 2 One can specify the joint (multivariate) distribution of (X, Y ) and the marginal (multivariate) distribution of X: ( Y X ) N (( EY EX or X N (EX, Σ xx ) ) ( Σyy Σ, yx Σ xy Σ xx )) Florian Pelgrin (HEC) Maximum Likelihood Theory September-December, 2010 9 / 40

Maximum likelihood estimator Notation In the first case (conditional distribution), the estimator of θ can be observed from the conditional (log-)likelihood function. In the second case (joint distribution), the estimator of θ can be derived from the joint (log-)likelihood function The joint likelihood function is the product of the conditional likelihood function and the marginal likelihood function (the information provided by the marginal distribution of X). The joint log-likelihood function is the sum of the conditional log-likelihood function and the marginal log-likelihood function. Florian Pelgrin (HEC) Maximum Likelihood Theory September-December, 2010 10 / 40

Maximum likelihood estimator Notation The conditional and marginal (log-)likelihood function (and thus the joint and conditional (log-)likelihood function) are conceptually different, and so are the two corresponding estimators (especially, in finite samples). Choosing one or the other depends on the empirical setting. For instance, the distribution of the sample data can be conditionally normal but not jointly normal (e.g., the variables X are arbitrarily determined in some experimental settings). In the sequel, we only consider the conditional maximum likelihood estimator (under the assumption of independent samples). Florian Pelgrin (HEC) Maximum Likelihood Theory September-December, 2010 11 / 40

Maximum likelihood estimator 2.2. Likelihood and log-likelihood Likelihood and log-likelihood Definition The (conditional) likelihood function is defined to be: L n : Y Θ [0, + ) ((y, x), θ) L n (y x; θ) = n L i (y i x i ; θ) Remark: The conditional likelihood function is the joint conditional density of the data in which the unknown parameter is θ. Florian Pelgrin (HEC) Maximum Likelihood Theory September-December, 2010 12 / 40

Maximum likelihood estimator Likelihood and log-likelihood Definition The (conditional) log-likelihood function is defined to be: l n : Y Θ R ((y, x); θ) l n (y x; θ) = logl i (y i x i ; θ) Florian Pelgrin (HEC) Maximum Likelihood Theory September-December, 2010 13 / 40

Maximum likelihood estimator Likelihood and log-likelihood Application: The multiple linear regression model. Under the conditional normality assumption, ( f (x i x i ; θ) L i (y i x i ; θ) = (σ 2 2π) 1 2 exp (y ) i x i b)2 2σ0 2 Therefore L n (y x; θ) = n L i (y i x i ; θ) = (σ 2 2π) n 2 exp ( 1 2σ 2 and l n (y x; θ) = n 2 log(2π) n 2 log(σ2 ) 1 2σ 2 (y i x i b)2 ) (y i x i b)2. Florian Pelgrin (HEC) Maximum Likelihood Theory September-December, 2010 14 / 40

Maximum likelihood estimator 2.3. Maximum likelihood principle Maximum likelihood principle Definition A maximum likelihood estimator of θ Θ R k is a solution to the maximization problem: or ˆθ n = argmaxl n (θ) θ Θ ˆθ n = argmaxl n (θ). θ Θ Florian Pelgrin (HEC) Maximum Likelihood Theory September-December, 2010 15 / 40

Maximum likelihood estimator Maximum likelihood principle 2.3. The maximum likelihood principle: Using the first-order conditions... Definition Under suitable regularity conditions, a maximum likelihood estimator of θ Θ R k is defined to be the solution of the first-order conditions (likelihood or log-likelihood equations): or L n θ (y x, ˆθ n ) = 0 k 1 l n θ (y x, ˆθ n ) = 0 k 1. Remark: Regularity conditions are fundamental! Florian Pelgrin (HEC) Maximum Likelihood Theory September-December, 2010 16 / 40

Maximum likelihood estimator Maximum likelihood principle Application: The multiple linear regression model (cont d) Under suitable regularity conditions, the first-order condition are given by: { ln b (y x; ˆθ n ) = 0 k 1 l n σ 2 (y x; ˆθ n ) = 0 1 1 1 σ 2 x i (y i x i ˆb n ) = 0 k 1 (y i x i ˆb n ) 2 = 0 n + 1 2ˆσ n 2 2ˆσ n 4 The maximum likelihood estimate of θ is: ( ) 1 ( ˆb n = x i x i ) x i y i ˆσ n 2 = n 1 n (y i x i ˆb n ) 2. Florian Pelgrin (HEC) Maximum Likelihood Theory September-December, 2010 17 / 40

Maximum likelihood estimator Maximum likelihood principle Second-order conditions: the Hessian matrix evaluated at θ = ˆθ n must be negative definite. The Hessian matrix is given by: and H θ= ˆθn H = = 1 σ 2 1 x σ 4 x i x i 1 x σ 4 i (y i x i b) i (y i x i b) 1ˆσ x n 2 i x i 0 k 1 0 1 k n 2ˆσ n 4 n 1 2σ 4 σ 6 (y i x i b)2 Given that (X X) is positive definite and ˆσ 2 > 0, then H θ= ˆθn negative definite and ˆθ n is a maximum. is Florian Pelgrin (HEC) Maximum Likelihood Theory September-December, 2010 18 / 40

Maximum likelihood estimator 2.4. Equivariance principle Equivariance principle Definition Under suitable regularity conditions, the maximum likelihood estimator of a function g(θ) of the parameter θ is g(ˆθ n ), where ˆθ n is the maximum likelihood estimator of θ. Florian Pelgrin (HEC) Maximum Likelihood Theory September-December, 2010 19 / 40

Maximum likelihood estimator Equivariance principle Example: Suppose Y 1,,Y n i.i.d. E(θ). The likelihood function is: n L n (y; θ) = θ exp( θy i ) ( ) = θ n exp θ y i. One gets (second-order conditions hold): ˆθ n = 1 Ȳ n ˆθ n (y) = 1 ȳ n. Consider now the probability density function: f Yi (y i ; λ) = 1 ( λ exp y ) i. λ Florian Pelgrin (HEC) Maximum Likelihood Theory September-December, 2010 20 / 40

Example (continued): Maximum likelihood estimator Equivariance principle The log-likelihood function is: l(y; λ) = nlog(λ) 1 λ The first-order condition with respect to λ is: n λ + 1 λ 2 y i = 0. y i. Since the second order condition holds, one gets (as to be expected!): ˆλ n = Ȳn = 1ˆθ n ˆλ n (y) = ȳ n = 1 ˆθ n (y). Florian Pelgrin (HEC) Maximum Likelihood Theory September-December, 2010 21 / 40

3. Fisher information. 3.1 Score vector Fisher information Score vector Definition The score vector, s, is defined to be the vector formed by the first (partial) derivative of the (conditional) log-likelihood with respect to the parameters θ Θ R k : ( ) ln s(θ) l n,θ (Y x; θ) = (Y x; θ) θ i 1 i k It satisfies: [ ] ln E θ (Y x; θ) = 0 k 1, x, θ. θ Remark: E θ means the expectation with respect to the conditional distribution Y X. Florian Pelgrin (HEC) Maximum Likelihood Theory September-December, 2010 22 / 40

Fisher information Score vector Application: The multiple linear regression model The score vector is given by: 1 x σ s(θ) = 2 i (Y i x i b) n + 1 (Y 2σ 2 2σ 4 i x i b)2 E θ [s(θ)] = 0 (k+1) 1 since: [ ] ln E θ β (Y x; βσ2 ) [ E θ n 2σ 2 + 1 σ 4 = 1 σ 2 x i (E θ (Y i ) x i b) = 0 k 1 ] (Y i E θ (Y i x i )) 1 2σ 4 = n 2σ 2 + E θ (Y i E(Y i x i )) 2 }{{} = 0 V θ (Y i x i )=σ 2 Florian Pelgrin (HEC) Maximum Likelihood Theory September-December, 2010 23 / 40

Fisher information 3.2. Fisher information matrix Fisher information matrix Definition The Fisher information matrix at x is the variance-covariance matrix of the score vector: [ I x ln F = V θ (Y x; θ) θ [ ln (Y x; θ) = E θ. l n(y x; θ) θ θ ] ]. Florian Pelgrin (HEC) Maximum Likelihood Theory September-December, 2010 24 / 40

Fisher information Fisher information matrix Definition The Fisher information matrix of x is also given by: Remarks: [ I x 2 ] l n F = E θ (Y x; θ). θ θ 1 Three equivalent definition of the Fisher information matrix three different consistent estimates of the Fisher information matrix. 2 Finite sample properties can be quite different! 3 I x F can be defined from the Fisher information matrix for the observation i. Florian Pelgrin (HEC) Maximum Likelihood Theory September-December, 2010 25 / 40

Fisher information Fisher information matrix Definition The Fisher information matrix for the observation i (or x i ) can be defined by: Ĩ x i F (θ) = V θ [ ] l θ (Y i x i ; θ) = E θ [ l θ (Y i x i ; θ). l = E θ [ 2 l n θ θ (Y i x i ; θ) ] θ (Y i x i ; θ) ]. Florian Pelgrin (HEC) Maximum Likelihood Theory September-December, 2010 26 / 40

Fisher information Fisher information matrix Proposition The Fisher information matrix at x = (x 1,, x n ) (or for n observations) is given by: I x F (θ) = Ĩ x i F (θ). Remark: In a sampling model (with i.i.d. observations), one has: I x F (θ) = nĩx i F (θ). Florian Pelgrin (HEC) Maximum Likelihood Theory September-December, 2010 27 / 40

Fisher information Fisher information matrix Definition The average Fisher information matrix for one observation is defined by: Theorem (a) Ĩ F (θ) = plim 1 n IX F (θ) n Ĩ F (θ) = E Xi Ĩ X i F (θ). (b) Ĩ F (θ) = E [ l θ (Y i X i ; θ) l θ (Y i X i ; θ) ] [ ] (c) Ĩ F (θ) = E 2 l θ θ (Y i X i ; θ) Florian Pelgrin (HEC) Maximum Likelihood Theory September-December, 2010 28 / 40

Fisher information Fisher information matrix A consistent estimator of the Fisher information matrix? Proposition If ˆθ n converges in probability to θ 0, then: Î (1) F (ˆθ n,ml ) = 1 n Î (2) F (ˆθ n,ml ) = 1 n Î (3) F (ˆθ n,ml ) = 1 n I x i F (ˆθ n,ml ) l i (y i x i ; ˆθ n,ml ) l i (y i x i ; ˆθ n,ml ) θ θ t 2 l i (y i x i ; ˆθ n,ml ) θ θ t = 1 n 2 l n (y x; ˆθ n,ml ) θ θ t. are three consistent estimators of the Fisher information matrix. Florian Pelgrin (HEC) Maximum Likelihood Theory September-December, 2010 29 / 40

Fisher information Fisher information matrix These three consistent estimators of the Fisher information matrix are asymptotically equivalent and none of these estimators is preferable to the others on statistical grounds. The main difficulty is that these estimators can have very different finite sample properties (again!). This can lead to different statistical conclusions for the same problem! Florian Pelgrin (HEC) Maximum Likelihood Theory September-December, 2010 30 / 40

Fisher information Fisher information matrix Application: The multiple linear regression model Computation of Ĩ F (θ): Derivation of the Hessian matrix of the log-likelihood function for observation i: 2 ( l i θ θ (y 1 x i x i ; θ) = σ 2 i x i 1 ) x σ 4 i (y i x i b) 1 x σ 4 i (y 1 i x i b) 1 (y 2σ 4 σ 6 i x i b)2 Expectation with respect to the conditional distribution of Y i X i = x i : [ 2 ] ( l E i 1 θ θ θ (.) x = σ 2 i x i 0 k 1 0 1 k 1 2σ 4 Expectation with respect to the distribution of X i : [ 2 ] ( l 1 Ĩ F (θ) = E Xi E i θ θ θ (.) E(X = σ 2 i X i ) 0 ) k 1 1 0 1 k 2σ 4 ) Florian Pelgrin (HEC) Maximum Likelihood Theory September-December, 2010 31 / 40

4. Asymptotic results 4.1. Overview Asymptotic results Overview Under certain regularity conditions, the maximum likelihood estimator, ˆθ n, possesses many appealing properties: 1. The maximum likelihood estimator is consistent. 2. The maximum likelihood estimator is asymptotically normal: n (ˆθn θ 0 ) d N (.,.). 3. The maximum likelihood estimator is asymptotically optimal or efficient. 4. The maximum likelihood estimator is equivariant: if ˆθ n is an estimator of θ 0 then g(ˆθ n ) is an estimator of g(θ 0 ). Florian Pelgrin (HEC) Maximum Likelihood Theory September-December, 2010 32 / 40

Asymptotic results Overview At the same time, Dependence to the explicit assumptions regarding Y 1,,Y n Finite sample properties can be very different from large sample properties: - The maximum likelihood estimator is consistent but can be severely biased in finite samples - The estimation of the variance-covariance matrix can be seriously doubtful in finite samples. Florian Pelgrin (HEC) Maximum Likelihood Theory September-December, 2010 33 / 40

4.2. Consistency Asymptotic results Consistency Theorem Under suitable regularity conditions, ˆθ n,ml a.s θ 0. Remark: This implies that: ˆθ n,ml p θ0 Florian Pelgrin (HEC) Maximum Likelihood Theory September-December, 2010 34 / 40

Asymptotic results 4.3. Asymptotical efficiency Asymptotic efficiency Proposition An unbiased maximum likelihood estimator of θ or g(θ) attains the FDCR lower bound and is thus (asymptotically) efficient. Florian Pelgrin (HEC) Maximum Likelihood Theory September-December, 2010 35 / 40

Asymptotic results 4.4. Large sample distribution Large sample distribution Theorem Under suitable regularity conditions, ( ) d n(ˆθn,ml θ 0 ) N 0, Ĩ 1 n F (θ 0) ˆθ n,ml a N (θ 0, n 1 Ĩ 1 F (θ 0) Remark: In a sampling model, I F (θ 0 ) is independent of x and is the Fisher information matrix for one observation: ( ) d n(ˆθ n,ml θ 0 ) N 0, I 1 n 1 (θ 0) ). Florian Pelgrin (HEC) Maximum Likelihood Theory September-December, 2010 36 / 40

Asymptotic results Large sample distribution Interpretation: The distribution of ˆθ n is approximatively (for n large) normally distributed with expectation the true unknown parameter and variance-covariance matrix the FDCR lower bound. The maximum likelihood estimator is asymptotically unbiased. The maximum likelihood estimator is asymptotically efficient. Florian Pelgrin (HEC) Maximum Likelihood Theory September-December, 2010 37 / 40

Asymptotic results 5.3. Back to the equivariance... Back to the equivariance... Proposition Assume H1, H2, H3-H8 hold, and g is a continuously differentiable function of θ and is defined from R k to R p, then: ) a.s. g (ˆθ n g(θ 0 ) ( ) ) ( [ ] [ ]) d g g n g (ˆθ n g(θ 0 ) N 0, θ t (θ 0) Ĩ 1 t F (θ 0) θ (θ 0). Florian Pelgrin (HEC) Maximum Likelihood Theory September-December, 2010 38 / 40

Asymptotic results Back to the equivariance... Application: The multiple linear regression model The Fisher information matrix is given by: ( Ĩ 1 σ F (θ 2 0) = 0 (EX i X i ) ) 1 0 k 1 0 1 k 2σ0 4 Therefore, n(ˆbn,ml b 0 ) n(ˆσ 2 n,ml σ 2 0 ) ( d N 0, σ 2 n 0 (EX ix i ) 1) d n N (0, 2σ 4 0 ) The two vectors n(ˆb n,ml b 0 ) and n(ˆσ 2 n,ml σ2 0 ) are asymptotically independent. Florian Pelgrin (HEC) Maximum Likelihood Theory September-December, 2010 39 / 40

Asymptotic results Back to the equivariance... A consistent estimate of the Fisher information matrix can be given by: Ĩ (1) F = 1 n = ( 1ˆσ 2 n E 2 l(y i x i ; ˆθ n ) θ θ 1 n X ) X 0 k 1 0 k 1 1 ˆσ 4 n so that: ˆb n,ml a N ( b 0, ˆσ 2 n(x X) 1) ˆσ n,ml a N ( σ 2 0, 2ˆσ4 n n ). Florian Pelgrin (HEC) Maximum Likelihood Theory September-December, 2010 40 / 40