Maximum Likelihood Estimation Merlise Clyde STA721 Linear Models Duke University August 31, 2017
Outline Topics Likelihood Function Projections Maximum Likelihood Estimates Readings: Christensen Chapter 1-2, Appendix A, and Appendix B
Models Take an random vector Y R n which is observable and decompose Y = µ + ɛ into µ R n (unknown, fixed) and ɛ R n unobservable error vector (random) Usual assumptions? E[ɛ i ] = 0 i E[ɛ] = 0 E[Y] = µ (mean vector) ɛ i independent with Var(ɛ i ) = σ 2 and Cov(ɛ i, ɛ j ) = 0 Matrix version Cov[ɛ] [(E [ɛ i E[ɛ i ]])(E [ɛ j E[ɛ j ]])] ij = σ 2 I n Cov[Y] = σ 2 I n (errors are uncorrelated) ɛ i iid N(0, σ 2 ) implies that Y i ind N(µ i, σ 2 )
Likelihood Functions The likelihood function for µ, σ 2 is proportional to the sampling distribution of the data L(µ, σ 2 ) n i=1 1 (2πσ 2 ) exp 1 2 (2πσ 2 ) n/2 exp { (σ 2 ) n/2 exp 1 2 { (σ 2 ) n/2 exp 1 2 { 1 2 (2π) n/2 I n σ 2 1/2 exp Last line is the density of Y N n ( µ, σ 2 I n ) { (yi µ i ) 2 } σ 2 i (Y i µ i ) 2 ) σ 2 } (Y µ) T } (Y µ) σ 2 Y µ 2 } σ 2 { 1 2 Y µ 2 } σ 2
MLEs Find values of ˆµ and ˆσ 2 that maximize the likelihood L(µ, σ 2 ) for µ R n and σ 2 R + { L(µ, σ 2 ) (σ 2 ) n/2 exp 1 Y µ 2 } 2 σ 2 log(l(µ, σ 2 )) n 2 log(σ2 ) 1 Y µ 2 2 σ 2 or equivalently the log likelihood Clearly, ˆµ = Y but ˆσ 2 = 0 is outside the parameter space Need restrictions on µ = Xβ
Column Space Let X 1, X 2,..., X p R n The set of all linear combinations of X 1,..., X p is the space spanned by X 1,..., X p S(X 1,..., X p ) Let X = [X 1 X 2... X p ] be a n p matrix with columns X j then the column space of X, C(X) = S(X 1,..., X p ) space spanned by the (column) vectors of X µ C(X) : C(X) = {µ µ R n such that Xβ = µ for some β R p } (also called the Range of X, R(X)) β are the coordinates of µ in this space C(X) is a subspace of R n Many equivalent ways to represent the same mean vector inference should be independent of the coordinate system used
Projections µ = Xβ with X full rank µ C(X) P X = X(X T X) 1 X T P X is the orthogonal projection operator on the column space of X; e.g. P = P 2 idempotent (projection) P 2 X = P XP X = X(X T X) 1 X T X(X T X) 1 X T = X(X T X) 1 X T = P X P = P T symmetry (orthogonal) P T X = (X(XT X) 1 X T ) T = (X T ) T ((X T X) 1 ) T (X) T = X(X T X) 1 X T = P X P X µ = P X Xβ = X(X T X) 1 X T Xβ = Xβ = µ
Projections Claim: I P X is an orthogonal projection onto C(X) idempotent (I P X ) 2 = (I P X )(I P X ) = I P X P X + P X P X = I P X P X + P X = I P X Symmetry I P X = (I P X ) T u C(X) u C(X) that is u C(X) and v C(X) then u T v = 0 (I P X )u = u (projection) if v C(X), (I P X )v = v v = 0
Log Likelihood Y N(µ, σ 2 I n ) with µ = Xβ and X full column rank Claim: Maximum Likelihood Estimator (MLE) of µ is P X Y Log Likelihood: log L(µ, σ 2 ) = n 2 log(σ2 ) 1 Y µ 2 2 σ 2 Decompose Y = P X Y + (I P X )Y Use P X µ = µ Simplify Y µ 2
Expand Y µ 2 = (I P X )Y + P x Y µ 2 = (I P X )Y + P x Y P X µ 2 = (I P x )Y + P X (Y µ) 2 = (I P x )Y 2 + P X (Y µ) 2 + 2(Y µ) T P T X (I P X )Y = (I P x )Y 2 + P X (Y µ) 2 + 0 = (I P x )Y 2 + P X Y µ 2 Crossproduct term is zero P T X (I P X) = P X (I P X ) = P X P X P X = P X P X = 0
Likelihood Substitute decomposition into log likelihood log L(µ, σ 2 ) = n 2 log(σ2 ) 1 Y µ 2 2 σ 2 = n 2 log(σ2 ) 1 ( (I PX )Y 2 2 σ 2 + P XY µ 2 ) σ 2 = n 2 log(σ2 ) 1 (I P X )Y 2 }{{ 2 σ 2 + 1 P X Y µ 2 }} 2 {{ σ 2 } = constant with respect to µ 0 Maximize with respect to µ for each σ 2 RHS is largest when µ = P X Y for any choice of σ 2 ˆµ = P X Y is the MLE of µ (yields fitted values Ŷ = P X Y)
MLE of β L(µ, σ 2 ) = n 2 log(σ2 ) 1 ( (I PX )Y 2 2 σ 2 + P XY µ 2 ) σ 2 L(β, σ 2 ) = n 2 log(σ2 ) 1 ( (I PX )Y 2 2 σ 2 + P XY Xβ 2 ) σ 2 Similar argument to show that RHS is maximized by minimizing P X Y Xβ 2 Therefore ˆβ is a MLE of β if and only if satisfies P X Y = Xˆβ If X T X is full rank, the MLE of β is (X T X) 1 X T Y = ˆβ
MLE of σ 2 Plug-in MLE of ˆµ for µ and differentiate with respect to σ 2 log L(ˆµ, σ 2 ) = n 2 log σ2 1 (I P X )Y 2 2 σ 2 log L(ˆµ, σ 2 ) σ 2 = n 2 1 σ 2 + 1 2 (I P X)Y 2 Set derivative to zero and solve for MLE 0 = n 1 2 ˆσ 2 + 1 ( ) 1 2 2 (I P X)Y 2 ˆσ 2 n 2 ˆσ2 = 1 2 (I P X)Y 2 ˆσ 2 = (I P X)Y 2 n ( ) 1 2 σ 2
Estimate of σ 2 Maximum Likelihood Estimate of σ 2 ˆσ 2 = (I P X)Y 2 n = YT (I P X ) T (I P X )Y n = YT (I P X )Y n = et e n where e = (I P X )Y residuals from the regression of Y on X
Geometric View Fitted Values Ŷ = P X Y = Xˆβ Residuals e = (I P X )Y Y = Ŷ + e Y 2 = P X Y 2 + (I P X )Y 2
Properties Ŷ = ˆµ is an unbiased estimate of µ = Xβ E[Ŷ] = E[P XY] = P X E[Y] = P X µ = µ E[e] = 0 if µ C(X) E[e] = E[(I P X )Y] = (I P X )E[Y] = (I P X )µ = 0 Will not be 0 if µ / C(X) (useful for model checking)
Estimate of σ 2 MLE of σ 2 : ˆσ 2 = et e n = YT (I P X )Y n Is this an unbiased estimate of σ 2? Need expectations of quadratic forms Y T AY for A an n n matrix Y a random vector in R n
Quadratic Forms Without loss of generality we can assume that A = A T Y T AY is a scalar Y T AY = (Y T AY) T = Y T A T Y may take A = A T Y T AY + Y T A T Y 2 = Y T AY Y T (A + AT ) Y = Y T AY 2
Expectations of Quadratic Forms Theorem Let Y be a random vector in R n with E[Y] = µ and Cov(Y) = Σ. Then E[Y T AY] = traσ + µ T Aµ. Result useful for finding expected values of Mean Squares; no normality required!
Proof Start with (Y µ) T A(Y µ), expand and take expectations E[(Y µ) T A(Y µ)] = E[Y T AY + µ T Aµ µ T AY Y T Aµ] = E[Y T AY] + µ T Aµ µ T Aµ µ T Aµ = E[Y T AY] µ T Aµ Rearrange E[Y T AY] = E[(Y µ) T A(Y µ)] + µ T Aµ = E[tr(Y µ) T A(Y µ)] + µ T Aµ = E[trA(Y µ)(y µ) T ] + µ T Aµ = tre[a(y µ)(y µ) T ] + µ T Aµ = trae([(y µ)(y µ) T ] + µ T Aµ = traσ + µ T Aµ tra n i=1 a ii
Expectation of ˆσ 2 Use the theorem: E[Y T (I P X )Y] = tr(i P X )σ 2 I + µ T (I P X )µ = σ 2 tr(i P X ) = σ 2 r(i P X ) = σ 2 (n r(x)) Therefore an unbiased estimate of σ 2 is e T e n r(x) If X is full rank (r(x) = p) and P X = X(X T X) 1 X T then the tr(p X ) = tr(x(x T X) 1 X T ) = tr(x T X(X T X) 1 ) = tr(i p ) = p
Spectral Theorem Theorem If A (n n) is a symmetric real matrix then there exists a U (n n) such that U T U = UU T = I n and a diagonal matrix Λ with elements λ i such that A = UΛU T U is an orthogonal matrix; U 1 = U T The columns of U from an Orthonormal Basis for R n rank of A equals the number of non-zero eigenvalues λ i Columns of U associated with non-zero eigenvalues form an ONB for C(A) (eigenvectors of A) A p = UΛ p U T (matrix powers) a square root of A > 0 is UΛ 1/2 U T
Projections Projection Matrix If P is an orthogonal projection matrix, then its eigenvalues λ i are either zero or one with tr(p) = i (λ i) = r(p) P = UΛU T P = P 2 UΛU T UΛU T = UΛ 2 U T Λ = Λ 2 is true only for λ i = 1 or λ i = 0 Since r(p) is the number of non-zero eigenvalues, r(p) = λ i = tr(p) [ ] [ ] Ir 0 U T P = [U P U P ] P 0 0 n r U T = U P U T P P r P = u i u T i sum of r rank 1 projections. i=1
Next Class distribution theory Continue Reading Chapter 1-2 and Appendices A & B in Christensen