Maximum Likelihood Estimation

Similar documents
MLES & Multivariate Normal Theory

3. For a given dataset and linear model, what do you think is true about least squares estimates? Is Ŷ always unique? Yes. Is ˆβ always unique? No.

STAT 135 Lab 13 (Review) Linear Regression, Multivariate Random Variables, Prediction, Logistic Regression and the δ-method.

Sampling Distributions

STAT 100C: Linear models

Sampling Distributions

Xβ is a linear combination of the columns of X: Copyright c 2010 Dan Nettleton (Iowa State University) Statistics / 25 X =

Regression Review. Statistics 149. Spring Copyright c 2006 by Mark E. Irwin

Lecture 15. Hypothesis testing in the linear model

Matrix Approach to Simple Linear Regression: An Overview

Linear Models Review

5.1 Consistency of least squares estimates. We begin with a few consistency results that stand on their own and do not depend on normality.

MA 575 Linear Models: Cedric E. Ginestet, Boston University Midterm Review Week 7

Gauss Markov & Predictive Distributions

Properties of Matrices and Operations on Matrices

Estimation of the Response Mean. Copyright c 2012 Dan Nettleton (Iowa State University) Statistics / 27

Math 423/533: The Main Theoretical Topics

Introduction to Estimation Methods for Time Series models. Lecture 1

Peter Hoff Linear and multilinear models April 3, GLS for multivariate regression 5. 3 Covariance estimation for the GLM 8

Ch4. Distribution of Quadratic Forms in y

ANOVA: Analysis of Variance - Part I

Linear models. Rasmus Waagepetersen Department of Mathematics Aalborg University Denmark. October 5, 2016

STAT 100C: Linear models

Restricted Maximum Likelihood in Linear Regression and Linear Mixed-Effects Model

Chapter 1. Linear Regression with One Predictor Variable

Orthogonal Projection and Least Squares Prof. Philip Pennance 1 -Version: December 12, 2016


Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

Lecture 11: Regression Methods I (Linear Regression)

3 Multiple Linear Regression

Lecture 11: Regression Methods I (Linear Regression)

Statistics 910, #5 1. Regression Methods

Statement: With my signature I confirm that the solutions are the product of my own work. Name: Signature:.

Part IB Statistics. Theorems with proof. Based on lectures by D. Spiegelhalter Notes taken by Dexter Chua. Lent 2015

MIT Spring 2015

MAT2377. Rafa l Kulik. Version 2015/November/26. Rafa l Kulik

STAT5044: Regression and Anova. Inyoung Kim

2. A Review of Some Key Linear Models Results. Copyright c 2018 Dan Nettleton (Iowa State University) 2. Statistics / 28

Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017

The Hilbert Space of Random Variables

For more information about how to cite these materials visit

Ch 2: Simple Linear Regression

[y i α βx i ] 2 (2) Q = i=1

Basic Distributional Assumptions of the Linear Model: 1. The errors are unbiased: E[ε] = The errors are uncorrelated with common variance:

Matrices and Multivariate Statistics - II

Linear Methods for Prediction

STAT 540: Data Analysis and Regression

Chapter 3. Matrices. 3.1 Matrices

Lecture 13: Simple Linear Regression in Matrix Format. 1 Expectations and Variances with Vectors and Matrices

Quick Review on Linear Multiple Regression

where x and ȳ are the sample means of x 1,, x n

BIOS 2083 Linear Models c Abdus S. Wahed

Probability and Statistics Notes

Fundamentals of Matrices

Bayes Estimators & Ridge Regression

Topic 7 - Matrix Approach to Simple Linear Regression. Outline. Matrix. Matrix. Review of Matrices. Regression model in matrix form

Estimating Estimable Functions of β. Copyright c 2012 Dan Nettleton (Iowa State University) Statistics / 17

Introduction to Simple Linear Regression

9.1 Orthogonal factor model.

Exam 2. Jeremy Morris. March 23, 2006

Regression diagnostics

Linear Models and Estimation by Least Squares

4 Multiple Linear Regression

Econ 620. Matrix Differentiation. Let a and x are (k 1) vectors and A is an (k k) matrix. ) x. (a x) = a. x = a (x Ax) =(A + A (x Ax) x x =(A + A )

MATH 829: Introduction to Data Mining and Analysis Linear Regression: statistical tests

Lecture Notes on Different Aspects of Regression Analysis

Linear Algebra Review

1 Data Arrays and Decompositions

11 Hypothesis Testing

Matrix Factorizations

Outline. Remedial Measures) Extra Sums of Squares Standardized Version of the Multiple Regression Model

Chapter 5 Matrix Approach to Simple Linear Regression

Data Mining Stat 588

Distributions of Quadratic Forms. Copyright c 2012 Dan Nettleton (Iowa State University) Statistics / 31

A Note on UMPI F Tests

18.S096 Problem Set 3 Fall 2013 Regression Analysis Due Date: 10/8/2013

Matrix Algebra, part 2

Ma 3/103: Lecture 24 Linear Regression I: Estimation

A Vector Space Justification of Householder Orthogonalization

WLS and BLUE (prelude to BLUP) Prediction

STAT 8260 Theory of Linear Models Lecture Notes

Chapter 6: Orthogonality

Multivariate Linear Regression Models

STA 2101/442 Assignment Four 1

PCA and admixture models

Linear models and their mathematical foundations: Simple linear regression

Methods for sparse analysis of high-dimensional data, II

Ph.D. Qualifying Exam Friday Saturday, January 3 4, 2014

STAT 151A: Lab 1. 1 Logistics. 2 Reference. 3 Playing with R: graphics and lm() 4 Random vectors. Billy Fang. 2 September 2017

Simple Linear Regression

Linear Regression. September 27, Chapter 3. Chapter 3 September 27, / 77

Economics 620, Lecture 5: exp

Ridge Regression. Flachs, Munkholt og Skotte. May 4, 2009

Stat 579: Generalized Linear Models and Extensions

Introduction to Normal Distribution

I i=1 1 I(J 1) j=1 (Y ij Ȳi ) 2. j=1 (Y j Ȳ )2 ] = 2n( is the two-sample t-test statistic.

ML and REML Variance Component Estimation. Copyright c 2012 Dan Nettleton (Iowa State University) Statistics / 58

REGRESSION WITH SPATIALLY MISALIGNED DATA. Lisa Madsen Oregon State University David Ruppert Cornell University

General Linear Model: Statistical Inference

The Multivariate Gaussian Distribution

Transcription:

Maximum Likelihood Estimation Merlise Clyde STA721 Linear Models Duke University August 31, 2017

Outline Topics Likelihood Function Projections Maximum Likelihood Estimates Readings: Christensen Chapter 1-2, Appendix A, and Appendix B

Models Take an random vector Y R n which is observable and decompose Y = µ + ɛ into µ R n (unknown, fixed) and ɛ R n unobservable error vector (random) Usual assumptions? E[ɛ i ] = 0 i E[ɛ] = 0 E[Y] = µ (mean vector) ɛ i independent with Var(ɛ i ) = σ 2 and Cov(ɛ i, ɛ j ) = 0 Matrix version Cov[ɛ] [(E [ɛ i E[ɛ i ]])(E [ɛ j E[ɛ j ]])] ij = σ 2 I n Cov[Y] = σ 2 I n (errors are uncorrelated) ɛ i iid N(0, σ 2 ) implies that Y i ind N(µ i, σ 2 )

Likelihood Functions The likelihood function for µ, σ 2 is proportional to the sampling distribution of the data L(µ, σ 2 ) n i=1 1 (2πσ 2 ) exp 1 2 (2πσ 2 ) n/2 exp { (σ 2 ) n/2 exp 1 2 { (σ 2 ) n/2 exp 1 2 { 1 2 (2π) n/2 I n σ 2 1/2 exp Last line is the density of Y N n ( µ, σ 2 I n ) { (yi µ i ) 2 } σ 2 i (Y i µ i ) 2 ) σ 2 } (Y µ) T } (Y µ) σ 2 Y µ 2 } σ 2 { 1 2 Y µ 2 } σ 2

MLEs Find values of ˆµ and ˆσ 2 that maximize the likelihood L(µ, σ 2 ) for µ R n and σ 2 R + { L(µ, σ 2 ) (σ 2 ) n/2 exp 1 Y µ 2 } 2 σ 2 log(l(µ, σ 2 )) n 2 log(σ2 ) 1 Y µ 2 2 σ 2 or equivalently the log likelihood Clearly, ˆµ = Y but ˆσ 2 = 0 is outside the parameter space Need restrictions on µ = Xβ

Column Space Let X 1, X 2,..., X p R n The set of all linear combinations of X 1,..., X p is the space spanned by X 1,..., X p S(X 1,..., X p ) Let X = [X 1 X 2... X p ] be a n p matrix with columns X j then the column space of X, C(X) = S(X 1,..., X p ) space spanned by the (column) vectors of X µ C(X) : C(X) = {µ µ R n such that Xβ = µ for some β R p } (also called the Range of X, R(X)) β are the coordinates of µ in this space C(X) is a subspace of R n Many equivalent ways to represent the same mean vector inference should be independent of the coordinate system used

Projections µ = Xβ with X full rank µ C(X) P X = X(X T X) 1 X T P X is the orthogonal projection operator on the column space of X; e.g. P = P 2 idempotent (projection) P 2 X = P XP X = X(X T X) 1 X T X(X T X) 1 X T = X(X T X) 1 X T = P X P = P T symmetry (orthogonal) P T X = (X(XT X) 1 X T ) T = (X T ) T ((X T X) 1 ) T (X) T = X(X T X) 1 X T = P X P X µ = P X Xβ = X(X T X) 1 X T Xβ = Xβ = µ

Projections Claim: I P X is an orthogonal projection onto C(X) idempotent (I P X ) 2 = (I P X )(I P X ) = I P X P X + P X P X = I P X P X + P X = I P X Symmetry I P X = (I P X ) T u C(X) u C(X) that is u C(X) and v C(X) then u T v = 0 (I P X )u = u (projection) if v C(X), (I P X )v = v v = 0

Log Likelihood Y N(µ, σ 2 I n ) with µ = Xβ and X full column rank Claim: Maximum Likelihood Estimator (MLE) of µ is P X Y Log Likelihood: log L(µ, σ 2 ) = n 2 log(σ2 ) 1 Y µ 2 2 σ 2 Decompose Y = P X Y + (I P X )Y Use P X µ = µ Simplify Y µ 2

Expand Y µ 2 = (I P X )Y + P x Y µ 2 = (I P X )Y + P x Y P X µ 2 = (I P x )Y + P X (Y µ) 2 = (I P x )Y 2 + P X (Y µ) 2 + 2(Y µ) T P T X (I P X )Y = (I P x )Y 2 + P X (Y µ) 2 + 0 = (I P x )Y 2 + P X Y µ 2 Crossproduct term is zero P T X (I P X) = P X (I P X ) = P X P X P X = P X P X = 0

Likelihood Substitute decomposition into log likelihood log L(µ, σ 2 ) = n 2 log(σ2 ) 1 Y µ 2 2 σ 2 = n 2 log(σ2 ) 1 ( (I PX )Y 2 2 σ 2 + P XY µ 2 ) σ 2 = n 2 log(σ2 ) 1 (I P X )Y 2 }{{ 2 σ 2 + 1 P X Y µ 2 }} 2 {{ σ 2 } = constant with respect to µ 0 Maximize with respect to µ for each σ 2 RHS is largest when µ = P X Y for any choice of σ 2 ˆµ = P X Y is the MLE of µ (yields fitted values Ŷ = P X Y)

MLE of β L(µ, σ 2 ) = n 2 log(σ2 ) 1 ( (I PX )Y 2 2 σ 2 + P XY µ 2 ) σ 2 L(β, σ 2 ) = n 2 log(σ2 ) 1 ( (I PX )Y 2 2 σ 2 + P XY Xβ 2 ) σ 2 Similar argument to show that RHS is maximized by minimizing P X Y Xβ 2 Therefore ˆβ is a MLE of β if and only if satisfies P X Y = Xˆβ If X T X is full rank, the MLE of β is (X T X) 1 X T Y = ˆβ

MLE of σ 2 Plug-in MLE of ˆµ for µ and differentiate with respect to σ 2 log L(ˆµ, σ 2 ) = n 2 log σ2 1 (I P X )Y 2 2 σ 2 log L(ˆµ, σ 2 ) σ 2 = n 2 1 σ 2 + 1 2 (I P X)Y 2 Set derivative to zero and solve for MLE 0 = n 1 2 ˆσ 2 + 1 ( ) 1 2 2 (I P X)Y 2 ˆσ 2 n 2 ˆσ2 = 1 2 (I P X)Y 2 ˆσ 2 = (I P X)Y 2 n ( ) 1 2 σ 2

Estimate of σ 2 Maximum Likelihood Estimate of σ 2 ˆσ 2 = (I P X)Y 2 n = YT (I P X ) T (I P X )Y n = YT (I P X )Y n = et e n where e = (I P X )Y residuals from the regression of Y on X

Geometric View Fitted Values Ŷ = P X Y = Xˆβ Residuals e = (I P X )Y Y = Ŷ + e Y 2 = P X Y 2 + (I P X )Y 2

Properties Ŷ = ˆµ is an unbiased estimate of µ = Xβ E[Ŷ] = E[P XY] = P X E[Y] = P X µ = µ E[e] = 0 if µ C(X) E[e] = E[(I P X )Y] = (I P X )E[Y] = (I P X )µ = 0 Will not be 0 if µ / C(X) (useful for model checking)

Estimate of σ 2 MLE of σ 2 : ˆσ 2 = et e n = YT (I P X )Y n Is this an unbiased estimate of σ 2? Need expectations of quadratic forms Y T AY for A an n n matrix Y a random vector in R n

Quadratic Forms Without loss of generality we can assume that A = A T Y T AY is a scalar Y T AY = (Y T AY) T = Y T A T Y may take A = A T Y T AY + Y T A T Y 2 = Y T AY Y T (A + AT ) Y = Y T AY 2

Expectations of Quadratic Forms Theorem Let Y be a random vector in R n with E[Y] = µ and Cov(Y) = Σ. Then E[Y T AY] = traσ + µ T Aµ. Result useful for finding expected values of Mean Squares; no normality required!

Proof Start with (Y µ) T A(Y µ), expand and take expectations E[(Y µ) T A(Y µ)] = E[Y T AY + µ T Aµ µ T AY Y T Aµ] = E[Y T AY] + µ T Aµ µ T Aµ µ T Aµ = E[Y T AY] µ T Aµ Rearrange E[Y T AY] = E[(Y µ) T A(Y µ)] + µ T Aµ = E[tr(Y µ) T A(Y µ)] + µ T Aµ = E[trA(Y µ)(y µ) T ] + µ T Aµ = tre[a(y µ)(y µ) T ] + µ T Aµ = trae([(y µ)(y µ) T ] + µ T Aµ = traσ + µ T Aµ tra n i=1 a ii

Expectation of ˆσ 2 Use the theorem: E[Y T (I P X )Y] = tr(i P X )σ 2 I + µ T (I P X )µ = σ 2 tr(i P X ) = σ 2 r(i P X ) = σ 2 (n r(x)) Therefore an unbiased estimate of σ 2 is e T e n r(x) If X is full rank (r(x) = p) and P X = X(X T X) 1 X T then the tr(p X ) = tr(x(x T X) 1 X T ) = tr(x T X(X T X) 1 ) = tr(i p ) = p

Spectral Theorem Theorem If A (n n) is a symmetric real matrix then there exists a U (n n) such that U T U = UU T = I n and a diagonal matrix Λ with elements λ i such that A = UΛU T U is an orthogonal matrix; U 1 = U T The columns of U from an Orthonormal Basis for R n rank of A equals the number of non-zero eigenvalues λ i Columns of U associated with non-zero eigenvalues form an ONB for C(A) (eigenvectors of A) A p = UΛ p U T (matrix powers) a square root of A > 0 is UΛ 1/2 U T

Projections Projection Matrix If P is an orthogonal projection matrix, then its eigenvalues λ i are either zero or one with tr(p) = i (λ i) = r(p) P = UΛU T P = P 2 UΛU T UΛU T = UΛ 2 U T Λ = Λ 2 is true only for λ i = 1 or λ i = 0 Since r(p) is the number of non-zero eigenvalues, r(p) = λ i = tr(p) [ ] [ ] Ir 0 U T P = [U P U P ] P 0 0 n r U T = U P U T P P r P = u i u T i sum of r rank 1 projections. i=1

Next Class distribution theory Continue Reading Chapter 1-2 and Appendices A & B in Christensen