Lecture 20: Linear model, the LSE, and UMVUE

Similar documents
Lecture 34: Properties of the LSE

Lecture 6: Linear models and Gauss-Markov theorem

Chapter 2: Fundamentals of Statistics Lecture 15: Models and statistics

Lecture 32: Asymptotic confidence sets and likelihoods

Stat 710: Mathematical Statistics Lecture 40

Chapter 8: Hypothesis Testing Lecture 9: Likelihood ratio tests

STAT 100C: Linear models

Chapter 3: Unbiased Estimation Lecture 22: UMVUE and the method of using a sufficient and complete statistic

Lecture 26: Likelihood ratio tests

Lecture 17: Likelihood ratio and asymptotic tests

Lecture 17: Minimal sufficiency

Stat 710: Mathematical Statistics Lecture 27

Multivariate Regression

[y i α βx i ] 2 (2) Q = i=1

Stat 710: Mathematical Statistics Lecture 12

STAT5044: Regression and Anova. Inyoung Kim

Chapter 1: Linear Regression with One Predictor Variable also known as: Simple Linear Regression Bivariate Linear Regression

STAT 100C: Linear models

Stat 710: Mathematical Statistics Lecture 31

STAT5044: Regression and Anova. Inyoung Kim

Chapter 6. Hypothesis Tests Lecture 20: UMP tests and Neyman-Pearson lemma

EEL 5544 Noise in Linear Systems Lecture 30. X (s) = E [ e sx] f X (x)e sx dx. Moments can be found from the Laplace transform as

Chapter 7. Confidence Sets Lecture 30: Pivotal quantities and confidence sets

ST5215: Advanced Statistical Theory

Lecture 15: Multivariate normal distributions

Lecture 28: Asymptotic confidence sets

ST5215: Advanced Statistical Theory

Time Series Analysis

Preliminaries. Copyright c 2018 Dan Nettleton (Iowa State University) Statistics / 38

Review of Classical Least Squares. James L. Powell Department of Economics University of California, Berkeley

Appendix A: Matrices

Lecture 14: Multivariate mgf s and chf s

5.1 Consistency of least squares estimates. We begin with a few consistency results that stand on their own and do not depend on normality.

Central Limit Theorem ( 5.3)

Chapter 5 Matrix Approach to Simple Linear Regression

Matrix operations Linear Algebra with Computer Science Application

Ma 3/103: Lecture 24 Linear Regression I: Estimation

Lecture 15 Multiple regression I Chapter 6 Set 2 Least Square Estimation The quadratic form to be minimized is

Regression: Lecture 2

Lecture 23: UMPU tests in exponential families

Sudoku and Matrices. Merciadri Luca. June 28, 2011

. a m1 a mn. a 1 a 2 a = a n

TMA4255 Applied Statistics V2016 (5)

3 (Maths) Linear Algebra

Linear Equations in Linear Algebra

Chapter 1: Probability Theory Lecture 1: Measure space, measurable function, and integration

Master s Written Examination - Solution

ICS 6N Computational Linear Algebra Matrix Algebra

Well-developed and understood properties

Lecture 11. Multivariate Normal theory

MATH 2331 Linear Algebra. Section 2.1 Matrix Operations. Definition: A : m n, B : n p. Example: Compute AB, if possible.

Review of Linear Algebra

Lecture 1 Review: Linear models have the form (in matrix notation) Y = Xβ + ε,

Partitioned Covariance Matrices and Partial Correlations. Proposition 1 Let the (p + q) (p + q) covariance matrix C > 0 be partitioned as C = C11 C 12

Least Squares Estimation-Finite-Sample Properties

Fall Inverse of a matrix. Institute: UC San Diego. Authors: Alexander Knop

Chapter 1: Systems of linear equations and matrices. Section 1.1: Introduction to systems of linear equations

a(b + c) = ab + ac a, b, c k

Math Linear Algebra Final Exam Review Sheet

Lecture 7 Spectral methods

Matrices and Determinants

Part IB Statistics. Theorems with proof. Based on lectures by D. Spiegelhalter Notes taken by Dexter Chua. Lent 2015

1. (Regular) Exponential Family

Lecture 6: Geometry of OLS Estimation of Linear Regession

Linear Algebra 1. 3COM0164 Quantum Computing / QIP. Joseph Spring School of Computer Science. Lecture - Linear Spaces 1 1

Chapter 7, continued: MANOVA

CS 246 Review of Linear Algebra 01/17/19

MATH Topics in Applied Mathematics Lecture 12: Evaluation of determinants. Cross product.

Mathematical Statistics

Lecture Notes in Linear Algebra

Regression Models - Introduction

33AH, WINTER 2018: STUDY GUIDE FOR FINAL EXAM

Multivariate Analysis and Likelihood Inference

Multivariate Regression (Chapter 10)

Estimation theory. Parametric estimation. Properties of estimators. Minimum variance estimator. Cramer-Rao bound. Maximum likelihood estimators

1 Matrices and Systems of Linear Equations. a 1n a 2n

ELE/MCE 503 Linear Algebra Facts Fall 2018

Application of Variance Homogeneity Tests Under Violation of Normality Assumption

Duke University, Department of Electrical and Computer Engineering Optimization for Scientists and Engineers c Alex Bronstein, 2014

An Introduction to Bayesian Linear Regression

SGN Advanced Signal Processing: Lecture 8 Parameter estimation for AR and MA models. Model order selection

Matrix & Linear Algebra

Matrix Algebra 2.1 MATRIX OPERATIONS Pearson Education, Inc.

Lecture 3: Linear Models. Bruce Walsh lecture notes Uppsala EQG course version 28 Jan 2012

Summer School in Statistics for Astronomers V June 1 - June 6, Regression. Mosuk Chow Statistics Department Penn State University.

2. Matrix Algebra and Random Vectors

Vectors To begin, let us describe an element of the state space as a point with numerical coordinates, that is x 1. x 2. x =

Equality: Two matrices A and B are equal, i.e., A = B if A and B have the same order and the entries of A and B are the same.

Definition 2.3. We define addition and multiplication of matrices as follows.

Asymptotic Statistics-III. Changliang Zou

Math Camp Lecture 4: Linear Algebra. Xiao Yu Wang. Aug 2010 MIT. Xiao Yu Wang (MIT) Math Camp /10 1 / 88

Math 4377/6308 Advanced Linear Algebra

MATH 304 Linear Algebra Lecture 10: Linear independence. Wronskian.

Notes, March 4, 2013, R. Dudley Maximum likelihood estimation: actual or supposed

Asymptotic Theory. L. Magee revised January 21, 2013

1 Exercises for lecture 1

Chapter 1: Systems of Linear Equations and Matrices

REFRESHER. William Stallings

ANALYTICAL MATHEMATICS FOR APPLICATIONS 2018 LECTURE NOTES 3

ANOVA: Analysis of Variance - Part I

Transcription:

Lecture 20: Linear model, the LSE, and UMVUE Linear Models One of the most useful statistical models is X i = β τ Z i + ε i, i = 1,...,n, where X i is the ith observation and is often called the ith response; β is a p-vector of unknown parameters (main parameters of interest), p < n; Z i is the ith value of a p-vector of explanatory variables (or covariates); ε 1,...,ε n are random errors (not observed). Data: (X 1,Z 1 ),...,(X n,z n ). Z i s are nonrandom or given values of a random p-vector, in which case our analysis is conditioned on Z 1,...,Z n. A matrix form of the model is X = Z β + ε, (1) where X = (X 1,...,X n ), ε = (ε 1,...,ε n ), and Z = the n p matrix whose ith row is the vector Z i, i = 1,...,n. UW-Madison (Statistics) Stat 709 Lecture 20 2018 1 / 17

Definition 3.4 (LSE). Suppose that the range of β in model (6) is B R p. A least squares estimator (LSE) of β is defined to be any β B such that X Z β 2 = min b B X Zb 2. For any l R p, l τ β is called an LSE of l τ β. Throughout this book, we consider B = R p unless otherwise stated. Differentiating X Zb 2 w.r.t. b, we obtain that any solution of is an LSE of β. Full rank Z Z τ Zb = Z τ X If the rank of the matrix Z is p, in which case (Z τ Z ) 1 exists and Z is said to be of full rank, then there is a unique LSE, which is β = (Z τ Z ) 1 Z τ X. UW-Madison (Statistics) Stat 709 Lecture 20 2018 2 / 17

Not full rank Z If Z is not of full rank, then there are infinitely many LSE s of β. Any LSE of β is of the form β = (Z τ Z ) Z τ X, where (Z τ Z ) is called a generalized inverse of Z τ Z and satisfies Z τ Z (Z τ Z ) Z τ Z = Z τ Z. Generalized inverse matrices are not unique unless Z is of full rank, in which case (Z τ Z ) = (Z τ Z ) 1 Assumptions To study properties of LSE s of β, we need some assumptions on the distribution of X or ε (conditional on Z if Z is random and ε and Z are independent). A1: ε is distributed as N n (0,σ 2 I n ) with an unknown σ 2 > 0. A2: E(ε) = 0 and Var(ε) = σ 2 I n with an unknown σ 2 > 0. A3: E(ε) = 0 and Var(ε) is an unknown matrix. UW-Madison (Statistics) Stat 709 Lecture 20 2018 3 / 17

Remarks Assumption A1 is the strongest and implies a parametric model. We may assume a slightly more general assumption that ε has the N n (0,σ 2 D) distribution with unknown σ 2 but a known positive definite matrix D. Let D 1/2 be the inverse of the square root matrix of D. Then model (6) with assumption A1 holds if we replace X, Z, and ε by the transformed variables X = D 1/2 X, Z = D 1/2 Z, and ε = D 1/2 ε, respectively. A similar conclusion can be made for assumption A2. Under assumption A1, the distribution of X is N n (Z β,σ 2 I n ), which is in an exponential family P with parameter θ = (β,σ 2 ) R p (0, ). However, if the matrix Z is not of full rank, then P is not identifiable (see 2.1.2), since Z β 1 = Z β 2 does not imply β 1 = β 2. UW-Madison (Statistics) Stat 709 Lecture 20 2018 4 / 17

Remarks Suppose that the rank of Z is r p. Then there is an n r submatrix Z of Z such that Z = Z Q (2) and Z is of rank r, where Q is a fixed r p matrix, and Z β = Z Qβ. P is identifiable if we consider the reparameterization β = Qβ. The new parameter β is in a subspace of R p with dimension r. In many applications, we are interested in estimating some linear functions of β, i.e., ϑ = l τ β for some l R p. From the previous discussion, however, estimation of l τ β is meaningless unless l = Q τ c for some c R r so that l τ β = c τ Qβ = c τ β. UW-Madison (Statistics) Stat 709 Lecture 20 2018 5 / 17

The following result shows that l τ β is estimable if l = Q τ c, which is also necessary for l τ β to be estimable under assumption A1. Theorem 3.6 Assume model (6) with assumption A3. (i) A necessary and sufficient condition for l R p being Q τ c for some c R r is l R(Z ) = R(Z τ Z ), where Q is given by (2) and R(A) is the smallest linear subspace containing all rows of A. (ii) If l R(Z ), then the LSE l τ β is unique and unbiased for l τ β. (iii) If l R(Z ) and assumption A1 holds, then l τ β is not estimable. Proof (i) Note that a R(A) iff a = A τ b for some vector b. If l = Q τ c, then l = Q τ c = Q τ Z τ Z (Z τ Z ) 1 c = Z τ [Z (Z τ Z ) 1 c]. Hence l R(Z ). UW-Madison (Statistics) Stat 709 Lecture 20 2018 6 / 17

Proof (continued) If l R(Z ), then l = Z τ ζ for some ζ and l = (Z Q) τ ζ = Q τ c, c = Z τ ζ. (ii) If l R(Z ) = R(Z τ Z ), then l = Z τ Z ζ for some ζ and by β = (Z τ Z ) Z τ X, E(l τ β) = E[l τ (Z τ Z ) Z τ X] = ζ τ Z τ Z (Z τ Z ) Z τ Z β = ζ τ Z τ Z β = l τ β. If β is any other LSE of β, then, by Z τ Z β = Z τ X, l τ β l τ β = ζ τ (Z τ Z )( β β) = ζ τ (Z τ X Z τ X) = 0. (iii) Under A1, if there is an estimator h(x,z ) unbiased for l τ β, then { l τ β = h(x,z )(2π) n/2 σ n exp 1 x Z β 2} dx. R n 2σ 2 Differentiating w.r.t. β and applying Theorem 2.1 lead to { l τ = Z R τ h(x,z )(2π) n/2 σ n 2 (x Z β)exp 1 x Z β 2} dx, n 2σ 2 which implies l R(Z ). UW-Madison (Statistics) Stat 709 Lecture 20 2018 7 / 17

Example 3.12 (Simple linear regression) Let β = (β 0,β 1 ) R 2 and Z i = (1,t i ), t i R, i = 1,...,n. Then model (6) is called a simple linear regression model. It turns out that ( ) n n i=1 t i n i=1 t i n i=1 t2 i This matrix is invertible iff some t i s are different. Thus, if some t i s are different, then the unique unbiased LSE of l τ β for any l R 2 is l τ (Z τ Z ) 1 Z τ X, which has the normal distribution if assumption A1 holds. The result can be easily extended to the case of polynomial regression of order p in which β = (β 0,β 1,...,β p 1 ) and Z i = (1,t i,...,t p 1 i ). Example 3.13 (One-way ANOVA) Suppose that n = m j=1 n j with m positive integers n 1,...,n m and that X i = µ j + ε i, i = k j 1 + 1,...,k j, j = 1,...,m, where k 0 = 0, k j = j l=1 n l, j = 1,...,m, and (µ 1,..., µ m ) = β.. UW-Madison (Statistics) Stat 709 Lecture 20 2018 8 / 17

Let J m be the m-vector of ones. Then the matrix Z in this case is a block diagonal matrix with J nj as the jth diagonal column. Consequently, Z τ Z is an m m diagonal matrix whose jth diagonal element is n j. Thus, Z τ Z is invertible and the unique LSE of β is the m-vector whose jth component is 1 n j k j X i, i=k j 1 +1 j = 1,...,m. Sometimes it is more convenient to use the following notation: X ij = X ki 1 +j, ε ij = ε ki 1 +j, and µ i = µ + α i, i = 1,...,m. Then our model becomes j = 1,...,n i, i = 1,...,m, X ij = µ + α i + ε ij, j = 1,...,n i,i = 1,...,m, (3) which is called a one-way analysis of variance (ANOVA) model. UW-Madison (Statistics) Stat 709 Lecture 20 2018 9 / 17

Under model (3), β = (µ,α 1,...,α m ) R m+1. The matrix Z under model (3) is not of full rank. An LSE of β under model (3) is β = ( X, X1 X,..., X m X ), where X is still the sample mean of X ij s and X i is the sample mean of the ith group {X ij,j = 1,...,n i }. The notation used in model (3) allows us to generalize the one-way ANOVA model to any s-way ANOVA model with a positive integer s under the so-called factorial experiments. Example 3.14 (Two-way balanced ANOVA) Suppose that X ijk = µ + α i + β j + γ ij + ε ijk, i = 1,...,a,j = 1,...,b,k = 1,...,c, (4) where a, b, and c are some positive integers. Model (4) is called a two-way balanced ANOVA model. UW-Madison (Statistics) Stat 709 Lecture 20 2018 10 / 17

If we view model (4) as a special case of model (6), then the parameter vector β is β = (µ,α 1,...,α a,β 1,...,β b,γ 11,...,γ 1b,...,γ a1,...,γ ab ). (5) One can obtain the matrix Z and show that it is n p, where n = abc and p = 1 + a + b + ab, and is of rank ab < p. It can also be shown that an LSE of β is given by the right-hand side of (5) with µ, α i, β j, and γ ij replaced by µ, α i, β j, and γ ij, respectively, where µ = X, α i = X i X, β j = X j X, γ ij = X ij X i X j + X, and a dot is used to denote averaging over the indicated subscript, e.g., with a fixed j, X j = 1 ac a c X ijk i=1 k=1 UW-Madison (Statistics) Stat 709 Lecture 20 2018 11 / 17

Theorem 3.7 (UMVUE). Consider model X = Z β + ε (6) with assumption A1 (ε is distributed as N n (0,σ 2 I n ) with an unknown σ 2 > 0). Then (i) The LSE l τ β is the UMVUE of l τ β for any estimable l τ β. (ii) The UMVUE of σ 2 is σ 2 = (n r) 1 X Z β 2, where r is the rank of Z. Proof of (i) Let β be an LSE of β. By Z τ Zb = Z τ X, (X Z β) τ Z ( β β) = (X τ Z X τ Z )( β β) = 0 and, hence, UW-Madison (Statistics) Stat 709 Lecture 20 2018 12 / 17

X Z β 2 = X Z β + Z β Z β 2 = X Z β 2 + Z β Z β 2 = X Z β 2 2β τ Z τ X + Z β 2 + Z β 2. Using this result and assumption A1, we obtain the following joint Lebesgue p.d.f. of X: { } (2πσ 2 ) n/2 β exp τ Z τ x x Z β 2 + Z β 2 Z β 2. σ 2 2σ 2 2σ 2 By Proposition 2.1 and the fact that Z β = Z (Z τ Z ) Z τ X is a function of Z τ X, the statistic (Z τ X, X Z β 2 ) is complete and sufficient for θ = (β,σ 2 ). Note that β is a function of Z τ X and, hence, a function of the complete sufficient statistic. If l τ β is estimable, then l τ β is unbiased for l τ β (Theorem 3.6) and, hence, l τ β is the UMVUE of l τ β. UW-Madison (Statistics) Stat 709 Lecture 20 2018 13 / 17

Proof of (ii) From X Z β 2 = X Z β 2 + Z β Z β 2 and E(Z β) = Z β (Theorem 3.6), E X Z β 2 = E(X Z β) τ (X Z β) E(β β) τ Z τ Z (β β) ( = tr Var(X) Var(Z β) ) = σ 2 [n tr ( Z (Z τ Z ) Z τ Z (Z τ Z ) Z τ) ] = σ 2 [n tr ( (Z τ Z ) Z τ Z ) ]. Since each row of Z R(Z ), Z β does not depend on the choice of (Z τ Z ) in β = (Z τ Z ) Z τ X (Theorem 3.6). Hence, we can evaluate tr((z τ Z ) Z τ Z ) using a particular (Z τ Z ). From the theory of linear algebra, there exists a p p matrix C such that CC τ = I p and ( ) C τ (Z τ Λ 0 Z )C =, 0 0 UW-Madison (Statistics) Stat 709 Lecture 20 2018 14 / 17

Then, a particular choice of (Z τ Z ) is ( ) (Z τ Z ) Λ 1 0 = C C τ (7) 0 0 and (Z τ Z ) Z τ Z = C ( ) Ir 0 C τ 0 0 whose trace is r. Hence σ 2 is the UMVUE of σ 2, since it is a function of the complete sufficient statistic and Residual vector E σ 2 = (n r) 1 E X Z β 2 = σ 2. The vector X Z β is called the residual vector and X Z β 2 is called the sum of squared residuals and is denoted by SSR. The estimator σ 2 is then equal to SSR/(n r). UW-Madison (Statistics) Stat 709 Lecture 20 2018 15 / 17

The Fisher information matrix is 1 σ 2 ( Z τ Z 0 0 n 2σ 2 The UMVUE l τ β attains the information lower bound, but not σ 2. Since ) X Z β = [I n Z (Z τ Z ) Z τ ]X and l τ β = l τ (Z τ Z ) Z τ X are linear in X, they are normally distributed under assumption A1. Also, using the generalized inverse matrix in (7), we obtain that [I n Z (Z τ Z ) Z τ ]Z (Z τ Z ) = Z (Z τ Z ) Z (Z τ Z ) Z τ Z (Z τ Z ) = 0, which implies that σ 2 and l τ β are independent (Exercise 58 in 1.6) for any estimable l τ β. Z (Z τ Z ) Z τ is a projection matrix, [Z (Z τ Z ) Z τ ] 2 = Z (Z τ Z ) Z τ, hence SSR = X τ [I n Z (Z τ Z ) Z τ ]X. UW-Madison (Statistics) Stat 709 Lecture 20 2018 16 / 17

The rank of Z (Z τ Z ) Z τ is tr(z (Z τ Z ) Z τ ) = r. Similarly, the rank of the projection matrix I n Z (Z τ Z ) Z τ is n r. From X τ X = X τ [Z (Z τ Z ) Z τ ]X + X τ [I n Z (Z τ Z ) Z τ ]X and Theorem 1.5 (Cochran s theorem), SSR/σ 2 has the chi-square distribution χ 2 n r (δ) with δ = σ 2 β τ Z τ [I n Z (Z τ Z ) Z τ ]Z β = 0. Thus, we have proved the following result. Theorem 3.8. Consider model (6) with assumption A1. For any estimable parameter l τ β, the UMVUE s l τ β and σ 2 are independent; the distribution of l τ β is N(l τ β,σ 2 l τ (Z τ Z ) l); and (n r) σ 2 /σ 2 has the chi-square distribution χn r 2. UW-Madison (Statistics) Stat 709 Lecture 20 2018 17 / 17