STAT 100C: Linear models

Size: px

Start display at page:

Download "STAT 100C: Linear models"

Harold Boone
5 years ago
Views:

1 STAT 100C: Linear models Arash A. Amini June 9, / 56

2 Table of Contents Multiple linear regression Linear model setup Estimation of β Geometric interpretation Estimation of σ 2 Hat matrix Gram matrix Simple linear regression (as special case) Statistical properties of estimators and related vectors Gauss Markov Theorem Generalized Least-Squares 2 / 56

3 Table of Contents Multiple linear regression Linear model setup Estimation of β Geometric interpretation Estimation of σ 2 Hat matrix Gram matrix Simple linear regression (as special case) Statistical properties of estimators and related vectors Gauss Markov Theorem Generalized Least-Squares 3 / 56

4 Linear model A multiple linear regression (MLR) model, or simply a linear model, is: Population version: y = β 0 + β 1 x β p x p +ε }{{} µ y is the response, or dependent variable (observed). x j, j = 1,..., p are independent, or explanatory, or regressor variables, or predictors or covariates. β j are fixed unknown parameters (unobserved). x j are fixed i.e., deterministic for now (observed). Alternatively, we work conditioned on {x j }. ε is random: the noise, random error, unexplained variation. The only source of randomness in the model. Assumption: E[ε] = 0 so that The goal is to estimate β j. E[y] = µ + E[ε] = µ. 4 / 56

5 Examples Give an example: Explaining 100C grades. Gas consumption data. Oral contraceptives. 5 / 56

6 Sampled version Our data or observation is a collection of n i.i.d. samples from this model: We observe {y i, x i1,..., x ip }, i = 1,..., n, and Matrix-vector form: Let Then, y i = β 0 + β 1 x i1 + + β p x ip + ε i p = β 0 + β j x ij + ε i j=1 x j = (x 1j,..., x nj ) R n x 0 := 1 := (1,..., 1) R n ε = (ε 1,..., ε n ) y = (y 1,..., y n ) R n y = β β 1 x β p x p + ε p = β j x j + ε j=0 where everything is now a vector except β j. 6 / 56

7 Let β = (β 0, β 1,..., β p ) R p, and let X = (x 0 x 1 x p ) R n (p+1). Since X β = p j=0 β jx j, we have y = X β + ε. }{{} µ This is called multiple linear regression (MLR) model. 7 / 56

8 Assumptions Assumptions on the noise: y = X β + ε. }{{} µ (E1) E[ε] = 0, cov(ε) = σ 2 I n, and {ε i } independent. (E2) Distributional assumption: ε N(0, σ 2 I n ). (E1) implies that E[y] = µ = X β, and cov(y) = σ 2 I n (why?). (E2) implies y N(X β, σ 2 I n ). Assumptions on the design matrix: (X1) X R n (p+1) is fixed (nonrandom) and has full column rank, that is, p + 1 n and rank(x ) = p / 56

9 Table of Contents Multiple linear regression Linear model setup Estimation of β Geometric interpretation Estimation of σ 2 Hat matrix Gram matrix Simple linear regression (as special case) Statistical properties of estimators and related vectors Gauss Markov Theorem Generalized Least-Squares 9 / 56

10 Maximum likelihood estimation (MLE) We are interested in estimating both β and σ 2 (noise variance). The MLE requires a distributional assumption. Here we assume (E2). The PDF of y N(µ, σ 2 I ) is 1 ( f (y) = (2π) n/2 σ 2 I n exp 1 ) 1/2 2 (y µ)t (σ 2 I n ) 1 (y µ) 1 = (2π) n/2 (σ 2 ) exp ( 1 y µ 2) n/2 2σ2 Viewed as a function β and σ 2, this is the likelihood L(β, σ 2 1 y) = (2πσ 2 ) exp ( 1 y X β 2) n/2 2σ2 Let us estimate β first: Maximizing the likelihood is equivalent to minimizing S(β) := y X β 2 = n [y i (X β) i ] 2 The problem min β S(β) is called least-squares (LS) problem. i=1 10 / 56

11 Use chain rule to compute the gradient of S(β) and set to zero: S(β) β k = 2 n i=1 (X β) i β k [y i (X β) i ], (X β) i β k = X ik Hence S(β) β k = 2[X T (y X β)] k or S(β) = 2X T (y X β). Setting to zero gives the normal equations S( β) = 0 (X T X ) β = X T y If p + 1 n and if X R n (p+1) is full-rank, i.e., rank(x ) = p + 1, then X T X is invertible and we get β = (X T X ) 1 X T y. 11 / 56

12 Remark 1 Shown that the maximum likelihood estimate of β under a Gaussian assumption, is the same as the least-squares (LS) estimate. The LS estimate in general makes sense even if we have no distributional assumption. 12 / 56

13 Table of Contents Multiple linear regression Linear model setup Estimation of β Geometric interpretation Estimation of σ 2 Hat matrix Gram matrix Simple linear regression (as special case) Statistical properties of estimators and related vectors Gauss Markov Theorem Generalized Least-Squares 13 / 56

14 Geometric interpretation of LS. (Section 4.2.1) Least-squares problem is equivalent to min y X β R β 2 min y µ 2 p+1 µ Im(X ) since, we recall, Im(X ) = {X β : β R p+1 }. That is, we are trying to find the projection µ of y onto Im(X ): µ = argmin y µ 2 µ Im(X ) By orthogonality principle: The residual e = y µ Im(X ), i.e., y µ [Im(X )] = ker(x T ) X T (y µ) = 0 µ Im(X ) means there is at least one β R p+1 that µ = X β, and X T (y X β) = 0 which is the same normal equations for β. 14 / 56

15 y Im(X) µ ε e µ Remark 2 The projection µ is unique in general, but β need not be. If X is full (column) rank, then β is unique. 15 / 56

16 Consequences of geometric interpretation The residual vector: e = y µ = y X β R n. The vector of fitted values: µ = X β. Sometimes µ is also referred to as ŷ. The following hold: Recall that x j is the jth column of X. e Im(X ). Since x j Im(X ), we have { { n e 1 e x j, j = 1,..., p i=1 e i = 0 n i=1 e ix ij = 0, j = 1,..., p Residuals are orthogonal to all the covariate vectors. Since µ Im(X ), we have e µ i e i µ i = 0. S( β) = e 2 = min β S(β). 16 / 56

17 Table of Contents Multiple linear regression Linear model setup Estimation of β Geometric interpretation Estimation of σ 2 Hat matrix Gram matrix Simple linear regression (as special case) Statistical properties of estimators and related vectors Gauss Markov Theorem Generalized Least-Squares 17 / 56

18 Estimation of σ 2 Back to the likelihood, substituting β for β, L( β, σ 2 1 ( y) = (2πσ 2 ) exp 1 ) n/2 2σ 2 S( β) which we want to maximize over σ 2. Equivalent to maximizing log-likelihood: over σ 2, or maximizing: log L( β, σ 2 y) = n 2 log(2πσ2 ) S( β) 2σ 2 l( β, v y) := log L( β, v y) = const. 1 2 [n log v + v 1 S( β) ] over v. (Change of variable v = σ 2.) The problem reduces to v := argmax v>0 l( β, v y) = argmin v>0 Setting the derivative to zero gives σ 2 = v = S( β)/n. (Check that this is indeed the maximizer.) [ n log v + v 1 S( β) ]. 18 / 56

19 For reasons that will become clear later, we often use the following modified estimator: s 2 = Compare with the MLE for σ 2 : S( β) n (p + 1). σ 2 = S( β) n. We will see that s 2 is unbiased for σ 2 while σ 2 is not. Note that S( β) is the sum of squares of residuals: n S( β) = (y i (X β) n i ) 2 = ei 2 = e 2. i=1 i=1 19 / 56

20 Table of Contents Multiple linear regression Linear model setup Estimation of β Geometric interpretation Estimation of σ 2 Hat matrix Gram matrix Simple linear regression (as special case) Statistical properties of estimators and related vectors Gauss Markov Theorem Generalized Least-Squares 20 / 56

21 The hat matrix Recall that β = (X T X ) 1 X T y. We have H is called the hat matrix. µ = X β = X (X T X ) 1 X T y = Hy. }{{} H It is an example of an orthogonal projection matrix. These matrices have a lot of properties. 21 / 56

22 Properties of projection matrices These matrices have a lot of properties: Lemma 1 For any vector y R n, Hy R n is the orthogonal projection of y onto Im(X ). H is symmetric: H T = H. (Exercise: use (BCD) T = D T C T B T.) H is idempotent: H 2 = H. (Exercise.) This is what we expect from a projection matrix. If we project Hy onto Im(X ), we should get back the same thing: H(Hy) = Hy for all y. A matrix is an orthogonal projection matrix if and only if it is symmetric and idempotent. I H is also a projection matrix (symmetric and idempotent): For every y R n, (I H)y is the projection of y onto [Im(X )]. Note that (I H)y = e. To summarize, we can decompose every y R n as y = µ + e = Hy + (I H)y. }{{}}{{} Im(X ) [Im(X )] 22 / 56

23 Table of Contents Multiple linear regression Linear model setup Estimation of β Geometric interpretation Estimation of σ 2 Hat matrix Gram matrix Simple linear regression (as special case) Statistical properties of estimators and related vectors Gauss Markov Theorem Generalized Least-Squares 23 / 56

24 Gram matrix Write X = (x 0 x 1 x 2 x p ) R n (p+1) Here x j is the jth column of X. Verify the following useful result: for i, j = 0, 1,..., p + 1. (X T X ) ij = x T i x j = x i, x j Thus, the entries of X T X are the pairwise inner products of columns of X. For example (X T X ) ij = 0 means x i and x j are orthogonal. X T X is called the Gram matrix of {x 0, x 1,..., x p }. Exercise: Show that X T X is always PSD. 24 / 56

25 Gram matrix (Optional) Theorem 1 (Rank of Gram matrix) Let G = X T X be the Gram matrix of X. Then rank(g) = rank(x ). If X R n (p+1), then G R (p+1) (p+1) and the theorem says that rank(x ) = p + 1 if and only if rank(g) = p + 1, i.e., G is invertible. Proof: Recall that if A R n (p+1) then rank(a) + null(a) = p + 1. Since G and X have the same input dimension, enough to show: null(g) = null(x ). Assume that u ker(g), that is, X T Xu = 0. Then u T X T Xu = 0 = Xu 2 = 0 = Xu = 0, i.e., u ker(x ). Now assume u ker(x ), i.e. Xu = 0. Then, X T Xu = 0, i.e. u ker(g). Thus, we have shown ker(g) = ker(x ). Recalling that nullity is the dimension of the kernel, we are done. An alternative simple proof is based on SVD of X. 25 / 56

26 Table of Contents Multiple linear regression Linear model setup Estimation of β Geometric interpretation Estimation of σ 2 Hat matrix Gram matrix Simple linear regression (as special case) Statistical properties of estimators and related vectors Gauss Markov Theorem Generalized Least-Squares 26 / 56

27 Example 1 (Simple linear regression) This is the case p = 1, and the model is y = β β 1 x 1 + ε. We have X = [1 x 1 ] R n 2. Assumption (X1) is satisfied if n 2 and x 1 is not a constant multiple of 1. We have β = ( 1 n X T X ) 1 1 n X T y = ( ) 1 ( 1 x ȳ x x 2 xy 1 = x 2 ( x) 2 ) ( x 2 x x 1 ) ( ) ȳ. xy From which it follows that β 1 = xy xȳ x 2 ( x) 2 = ρ xy ρ xx, β0 = ȳ β 1 x where / 56

28 Example (Simple linear regression (cont d)) From which it follows that β 1 = xy xȳ x 2 ( x) 2 = ρ xy ρ xx, β0 = ȳ β 1 x where ρ xy = 1 n (x i x)(y i ȳ) = xy xȳ, n i=1 ρ xx = 1 (x i x) 2 = x n 2 ( x) 2. i The formula for β 0 is easier to see if one writes down the normal equations as ( 1 n X T X ) β = 1 n X T y and solves for β 0 in terms of β 1. Note also that [( var( β 1 ) = σ2 1 ) 1 ] n n X T X = σ2. 22 nρ xx 28 / 56

29 Table of Contents Multiple linear regression Linear model setup Estimation of β Geometric interpretation Estimation of σ 2 Hat matrix Gram matrix Simple linear regression (as special case) Statistical properties of estimators and related vectors Gauss Markov Theorem Generalized Least-Squares 29 / 56

30 Sampling distribution Both β and σ 2 are random quantities (due to the randomness in ε). The distribution of an estimate, called its sampling distribution, allows us to quantify the uncertainty in the estimate. Basic properties like the mean and, the variance and covariances can be determined under our basic assumption (E1). To determine the full sampling distribution, we need to assume some distribution for the noise vector ε, e.g. (E2). Recall that E[y] = µ = X β, cov(y) = σ 2 I. 30 / 56

31 Properties of β Let us write A = (X T X ) 1 X T R (p+1) n so that β = Ay Note that AX = I p+1. Proposition 1 Under the linear model, y = X β + ε, 1. β is an unbiased estimate β, i.e., E[ β] = β. 2. with covariance matrix cov( β) = σ 2 (X T X ) 1. Exercise: Prove this. For covariance, note AA T = (X T X ) / 56

32 Properties of a T β: Consider a T β for some nonrandom a R p+1. This is interesting since e.g., with a = (0, 1, 1, 0,..., 0) we have a T β = β1 β 2 In general, E[a T β] = a T E[ β] = a T β, var(a T β) = a T cov( β)a = σ 2 a T (X T X ) 1 a The first equation shows that a T β is an unbiased estimate of a T β. 32 / 56

33 Fitted values The vector of fitted values µ = X β = Hy. µ is an unbiased estimate µ: E[ µ] = X E[ β] = X β = µ. Approach 2: µ = X β so that µ Im(X ) hence Hµ = µ (Verify directly!): E[ µ] = HE[y] = Hµ = µ The covariance matrix is cov( µ) = cov(hy) = H(σ 2 I n )H T = σ 2 HH = σ 2 H. 33 / 56

34 Residuals The residual is e = y X β = y µ = (I H)y R n. We have E[e] = E[y µ] = E[y] E[ µ] = µ µ = 0, cov(e) = (I H)(σ 2 I n )(I H) T = σ 2 (I H) 2 = σ 2 (I H) 34 / 56

35 Joint behavior of ( β, e) R (p+1+n) 1 Recall A = (X T X ) 1 X T. Stack β on top of e. We have β = Ay and e = (I H)y. Since H = X (X T X ) 1 X T = XA, E ) ( ) ( β A = y. e I H }{{} P ) ( β = e ( ) ( E[ β] β =. (1) E[e] 0) 35 / 56

36 The covariance matrix is (Recall that I H is symmetric and idempotent): cov ( ( β ) ) = P cov(y)p T e = σ 2 PP T [ ] = σ 2 A [A T (I H) ] I H T [ ] = σ 2 A [A T I H ] I H [ ] = σ 2 AA T A(I H) (I H)A T I H [ ] σ = 2 (X T X ) σ 2 (I H) Note that first diagonal block matches the covariance of β as expected. 36 / 56

37 Sidenote Why A(I H) = 0? Algebraic calculation: AH = (X T X ) 1 X T [XA] = A. Geometric interpretation: A T = X (X T X ) 1 hence Im(A T ) Im(X ) (Check!). This means that H leaves A T intact: HA T = A T. 37 / 56

38 Important consequences Proposition 2 Under the linear regression model y = X β + ε and assumption (E1), cov ( ( β ) ) [ ] σ = 2 (X T X ) 1 0 e 0 σ 2 (I H) (2) where β is the LS estimate of the regression coefficient and e is the residual. Under (E1), β and e are uncorrelated. Under (E2), we have: ( β, e) has MVN normal distribution (why?) with mean vector (β, 0) and covariance matrix (2). β and e are independent (why?). S( β) = e 2 is independent of β (why?). Similarly, s 2 is independent of β. e and µ are independent. ( µ = X β is a function of β). 38 / 56

39 Distribution of s 2? To answer this question, need to talk about quadratic forms (QF). For now, let us just accept the following result, assuming (E2): Recall that s 2 = S( β) σ 2 = e 2 σ 2 χ 2 n (p+1). (3) S( β) n (p + 1) = e 2 n (p + 1). We know that the mean of χ 2 n (p+1) is n (p + 1). Hence, E[s 2 /σ 2 ] = 1 showing that s 2 is unbiased for σ / 56

40 Sidenote (optional) Could have shown unbiasedness of s 2 directly without knowing (3). Under assumption (E1), E e 2 = n var(e i ) Eei 2 = i=1 i = tr(cov(e)) = tr(i H) = n p 1... needs some arguing using the rank of I H which is n p 1. Remark 3 The rank of an orthogonal projection matrix is equal to the dimension of the space it is projecting onto: rank(h) = dim(im(h)). I H is projecting onto [Im(X )], hence rank(i H) = dim([im X ] ) = n dim(im X ) = n rank(x ) = n p 1 40 / 56

41 Table of Contents Multiple linear regression Linear model setup Estimation of β Geometric interpretation Estimation of σ 2 Hat matrix Gram matrix Simple linear regression (as special case) Statistical properties of estimators and related vectors Gauss Markov Theorem Generalized Least-Squares 41 / 56

42 Gauss Markov Theorem Says that in the linear regression model LSE of β is BLUE. LSE = Least squares estimator. BLUE = Best linear unbiased estimator. Among all linear unbiased estimators of β, β = (X T X ) 1 X T y has the smallest variance (it says more). It shows that LSE is in some sense optimal. 42 / 56

43 PSD matrices revisited A matrix A R n n is PSD of x T Ax 0 for all x R n. We write A 0 when A is PSD. We also write A B iff A B 0. Lemma 2 A matrix A R n n is PSD if and only if A = ZZ T for some matrix Z R n m. If A 0 then A = A 1/2 A 1/2. Take Z = A 1/2. In the other direction, assume A = ZZ T for some Z. Then, x T Ax = x T ZZ T x = (Z T x) T (Z T x) = Z T x 2 0. Exercise: If A is PSD, any principal submatirx of A is PSD. In particular A ii 0 (diagonal elements are nonnegative). 43 / 56

44 Theorem 2 (Gauss-Markov) Consider the linear regression model y = X β + ε with E[ε] = 0, cov(ε) = σ 2 I n. Let β = (X T X ) 1 X T y be the LSE of β. Let ˆb be any other unbiased linear estimator of β Then cov(ˆb) cov( β) that is, cov(ˆb) cov( β) is PSD. This implies var(ˆb j ) var( β j ) 0 for all j. 44 / 56

45 Proof We have ˆb = My for some matrix M R p+1 n, and E[ˆb] = β for all β R p+1 (why?). Expanding the above, conclude that MX = I. We also recall that β = Ay where A = (X T X ) 1 X T (and E[ β] = β). What are the covariance matrices? Let M := M A. Then, and we have cov(ˆb) = σ 2 MM T cov( β) = σ 2 AA T = σ 2 (X T X ) 1 M M T = MM T MA T AM T + AA T MA T = MX (X T X ) 1 = AA T It follows that M M T = MM T AA T. Hence, cov(ˆb) cov( β) = σ 2 M MT / 56

46 Remark 4 Gauss-Markov only says that β is the best within the class of linear unbiased estimates. It might be that a nonlinear unbiased estimator has a smaller variance. However, if we assume normality, i.e., ε N(0, σ 2 I ), then β is the best unbiased estimator, among all estimators linear or nonlinear. This is a consequence of the Cramer Rao bound: cov( β) = [I (β)] 1 where I (β) is the Fisher information matrix. Remark 5 A consequence of Gauss Markov is that a T β is BLUE for a T β (for any a). 46 / 56

47 Table of Contents Multiple linear regression Linear model setup Estimation of β Geometric interpretation Estimation of σ 2 Hat matrix Gram matrix Simple linear regression (as special case) Statistical properties of estimators and related vectors Gauss Markov Theorem Generalized Least-Squares 47 / 56

48 Generalized least-squares (GLS) ( 4.6) Now relax the assumption of uncorrelated noise with equal variance: y = X β + ε, cov(ε) = σ 2 I. Some applications when this assumption is not reasonable: Case 1: Each y i is obtained from averaging m i independent measurements. If noise in each measurement has var. = σ 2, then var(y i ) = var(ε i ) = σ2 m i. m i could be different for i = 1,..., n: m 1 1 cov(ε) = σ 2 m m n. 48 / 56

49 Case 2: Serial correlation among noise variables. Often happens when dealing with time series. For example: y t = β 0 + β 1 x 1t + β 2 x 2t + ε t where y t sales quantity x 1t price x 2t how much spent on ads. ε t error; could contain hidden variables not included in the model like economic factors Reasonable to assume that y t depends on y t 1 in some form. One way to model this is to assume dependence among {ε t }. For example, a first order autoregressive, a.k.a. AR(1), process for t = 1,..., n and ε 0 = 0. ε t = φ ε t 1 + z t, z t iid N(0, σ 2 ) 49 / 56

50 For example, a first order autoregressive, a.k.a. AR(1), process iid ε t = φ ε t 1 + z t, z t N(0, σ 2 ) for t = 1,..., n and ε 0 = 0. Exercise: Show that cov(ε t, ε t 1 ) = σ 2 φ r, r t 1. The covariance matrix has Toeplitz structure: 1 φ φ 2 φ 3 φ n 1 φ 1 φ φ 2 φ n 2 cov(ε) = σ 2 φ 2 φ 1 φ φ n 3 φ 3 φ 2 φ 1 φ n φ n 1 1 Case 3: Spatial observations. 50 / 56

51 Generalized least-squares (GLS) Consider y = X β + ε but with more general noise covariance: cov(ε) = σ 2 V We assume V R n n is known but σ 2 is unknown. Assume that V is nonsingular (i.e., invertible). Since V 0, it makes sense to take about V 1/2. Because of nonsingularity, V 1/2 also makes sense. Consider Note that V 1/2 y }{{} ỹ = V } 1/2 {{ X } X β + V 1/2 ε }{{} ε cov( ε) = V 1/2 cov(ε)v 1/2 = σ 2 V 1/2 VV 1/2 = σ 2 I. So, by transforming the measurements and the design matrix, we get back the standard regression setup (i.e., uncorrelated noise vector). 51 / 56

52 Why transform? Gauss Markov only applies to the uncorrelated case. By Gauss Markov the following is the BLUE estimator of β: β GLS = ( X T X ) 1 X T ỹ = (X T V 1 X ) 1 X T V 1 y (Do the plug-in as an exercise.) We have E[ β GLS ] = β and cov( β GLS ) = σ 2 ( X T X ) 1 = σ 2 (X T V 1 X ) 1 52 / 56

53 SSE will be where e GLS = y X β GLS. S( β GLS ) = ỹ X β GLS 2 = e T GLSV 1 e GLS Since the transformed model is standard regression, we have S( β GLS ) σ 2 χ 2 n p 1 and the unbiased estimate of the variance is s 2 GLS = S( β GLS ) n p 1 53 / 56

54 What about the ordinary least-squares estimate β OLS, which ignores the noise dependence (derived under wrong assumptions)? β OLS = (X T X ) 1 X T y is still unbiased. However, Gauss Markov implies that β OLS has higher variance than β GLS : cov( β OLS ) cov( β GLS ). Exercise: Show that β GLS = (X T V 1 X ) 1 X T V 1 y is the solution of the following (quadratic) optimization problem min Q(β) β where Q(β) = (y X β) T V 1 (y X β) = V 1/2 (y X β) 2 54 / 56

55 Weighted least squares (WLS) Assume that we want to weigh the samples differently. (I.e., discount or emphasize the effect of some samples.) Given weights w 1,..., w n we can solve min β S w (β) = n i=1 w i (y i x T i β) 2 where xi T is the i th row of X. The objective can be written as S w (β) = (y X β) T W (y X β) where w 1 w 2 W =.... wn Thus WLS can be considered a special case of GLS with V 1 = W. 55 / 56

56 The solution simplifies to with covariance matrix ( n ) 1 ( n ) β WLS = w i x i xi T w i x i y i i=1 cov( β WLS ) = σ 2[ n i=1 i=1 ] 1 w i x i xi T 56 / 56

STAT 100C: Linear models

STAT 100C: Linear models Arash A. Amini April 27, 2018 1 / 1 Table of Contents 2 / 1 Linear Algebra Review Read 3.1 and 3.2 from text. 1. Fundamental subspace (rank-nullity, etc.) Im(X ) = ker(x T ) R