STAT 540: Data Analysis and Regression

STAT 540: Data Analysis and Regression Wen Zhou http://www.stat.colostate.edu/~riczw/ Email: riczw@stat.colostate.edu Department of Statistics Colorado State University Fall 205 W. Zhou (Colorado State University) STAT 540 July 6th, 205 / 62

Contents Multiple Linear Regression Model 2 Inference on Multiple Regression 3 Inference about Regression Parameters 4 Estimation and Prediction 5 Geometric View of Regression and Linear Models 6 Estimating estimable function of coefficient W. Zhou (Colorado State University) STAT 540 July 6th, 205 2 / 62

Multiple Linear Regression I Multiple linear regression model Multiple linear regression model in matrix terms 2 Estimation of regression coefficients Inference ANOVA results 2 Inference about regression parameters 3 Estimation of mean response and prediction of new observation Inference about regression parameters Estimation and prediction Geometric interpretation of linear model and regression Estimating estimable function of regression or linear coefficient β W. Zhou (Colorado State University) STAT 540 July 6th, 205 3 / 62

Multiple Linear Regression Model 2 Inference on Multiple Regression 3 Inference about Regression Parameters 4 Estimation and Prediction 5 Geometric View of Regression and Linear Models 6 Estimating estimable function of coefficient W. Zhou (Colorado State University) STAT 540 July 6th, 205 4 / 62

Multiple Linear Regression Example: # of predictor variables = 2. Y i = β 0 + β X i + β 2 X i2 + ɛ i, ɛ i iid N(0, σ 2 ), for i =,..., n. Response surface: E(Y i ) = Example: Y = Pine bark beetle density X = Temperature X2 = Tree species W. Zhou (Colorado State University) STAT 540 July 6th, 205 5 / 62

Interpretation of Coefficients β 0 : Intercept. When the model scope includes X = X 2 = 0. β0 is interpreted as the mean response E(Y ) at X = X 2 = 0. β j : Slope in the direction of X j (effect). E(Y )/ Xj = EY X=(X,X 2 )(Y ) E Y X=(X,X 2 )(Y ) = Interpreted as the change in the mean response E(Y ) per unit increase in X j, when X j are held constant. What if X j is qualitative? W. Zhou (Colorado State University) STAT 540 July 6th, 205 6 / 62

Multiple Linear Regression A general linear regression model is, for i =,, n Response surface: Y i = β 0 + p X ij β j + ɛ i, ɛ i iid N(0, σ 2 ). j= E(Y i ) = β 0 + p X ij β j j= Regression coefficients: β 0, β,..., β p, β p. Predictor variables: X,..., X p are known constants/values. The model is linear in the parameters, not necessarily in the shape of the response surface. W. Zhou (Colorado State University) STAT 540 July 6th, 205 7 / 62

Response Surface Examples Polynomial regression E(Y ) = β 0 + β X + β 2 X 2 + β 3 X 3. Transformed variables E(log(Y )) = β 0 + β X + β 2 X2. Interaction effects E(Y ) = β 0 + β X + β 2 X2 + β 3 X X 2. The change in the mean response corresponding to a unit change in X depends on X 2 and vice versa. Testing whether β3 = 0 or not is very challenging in high-dimensional (n = o(p)). W. Zhou (Colorado State University) STAT 540 July 6th, 205 8 / 62

Qualitative Predictor Variables Example: Let Y = length of hospital stay, X = age, and X 2 = gender: 0 for male and for female. An additive model is Thus the response surface for males is and for females is β 2 is This kind of model sometimes is called ANVOCA model. W. Zhou (Colorado State University) STAT 540 July 6th, 205 9 / 62

Qualitative Predictor Variables Interaction: the relationship between X and Y for a fixed value of X 2 = x 2 depends on x 2. An interaction model is Thus the response surface for males is and for females is W. Zhou (Colorado State University) STAT 540 July 6th, 205 0 / 62

Notation n observations, response variable, p β s with predictors (i.e. β 0 is the pth). Response variable: Y n = (Y, Y 2,..., Y n ) T. The predictors are arranged in the design matrix X X 2 X,p X 2 X 22 X 2,p X n p =... X n X n2 X n,p Random error: ɛ n = (ɛ, ɛ 2,..., ɛ n ) T. Regression coefficients: β p = (β 0, β,..., β p ) T. W. Zhou (Colorado State University) STAT 540 July 6th, 205 / 62

Multiple Linear Regression Model in Matrix Terms The multiple linear regression model can be written as where as we have seen before Thus, E(ɛ) = 0 n, Var{ɛ} = σ 2 I n n. and Y W. Zhou (Colorado State University) STAT 540 July 6th, 205 2 / 62

Least Squares Estimation Consider the criterion: n p Q = (Y i β 0 β j X ij ) 2 = i= j= The least squares estimate of β is assuming that X T X is invertible. This is also the MLE. What condition on X do we need to have X T X invertible? What if X T X is not invertible? W. Zhou (Colorado State University) STAT 540 July 6th, 205 3 / 62

Fitted Values and Residuals Fitted values: Ŷ = where the hat matrix is Residuals: e = W. Zhou (Colorado State University) STAT 540 July 6th, 205 4 / 62

Sums of Squares We have sums of squares in matrix forms that SSR = SSE = SST O = n (Ŷi Ȳ )2 = i= n (Y i Ŷi) 2 = i= n (Y i Ȳ )2 = i= Partitioning of total sum of squares and particularly the df are SST O }{{} df=n = SSR }{{} df=p + SSE }{{}. df=n p W. Zhou (Colorado State University) STAT 540 July 6th, 205 6 / 62

Mean Squares Define mean squares MSR = SSR p, SSE MSE = n p. It can be shown that E(MSE) = Also can be shown that = σ 2 if β j = 0 for j E(M SR). > σ 2 otherwise W. Zhou (Colorado State University) STAT 540 July 6th, 205 7 / 62

ANOVA Table The ANOVA table is Source SS df MS F Regression SSR MSR F = MSR/MSE Error SSE M SE Total SST O If then in which case MSR/MSE. E(MSE) = E(MSR) = σ 2 W. Zhou (Colorado State University) STAT 540 July 6th, 205 8 / 62

Overall F Test for Regression Relation Test H 0 : v.s. H a :. It can be shown that under H0, F = MSR MSE Thus we can perform an F -test at level α by the decision rule: Conditional on H 0 being rejected, we may want to find (or a.s.) Identification/Selection. S = {j β j 0} W. Zhou (Colorado State University) STAT 540 July 6th, 205 9 / 62

Coefficient of Multiple Determination, R 2 The coefficient of multiple determination is denoted by R 2 and is defined as R 2 = SSR SST O = SSE SST O Interpretation: The proportion of variation in the Y i s explained by the regression relation. W. Zhou (Colorado State University) STAT 540 July 6th, 205 20 / 62

More on R 2 As more predictors are added to the model (p ), R 2 must increase. Why? Recall SST O = SSR + SSE SST O is fixed for Y while SSE is a minimum of the unconstraint convex optimization problem β = arg min SSE(β 0,..., β p ). Suppose we consider an extra predictor and thus consider SSE(β0,..., β p). The β that minimizes this SSE cannot be inferior to the previous minimizer because β p = 0 is a special case within the new minimization problem that incorporates the previous one. W. Zhou (Colorado State University) STAT 540 July 6th, 205 2 / 62

Adjusted R 2 R 2 depends on p (even for p n), how to remove that dependence? The adjusted coefficient of multiple determination is denoted by Ra 2 and is defined as Ra 2 = SSE/n p ( ) n SSE SST O/n = n p SST O. The adjusted coefficient of multiple determination R 2 a may decrease when more predictors are in the model. Many other statistics such as AIC, BIC, Mallow s C p, etc. will be discussed and they are superior over R 2 a. W. Zhou (Colorado State University) STAT 540 July 6th, 205 22 / 62

Estimation of Regression Coefficients Mean satisfies E( ˆβ) = β. That is, the LS estimate ˆβ is an unbiased estimate of β. Variance-covariance matrix: Σ β := Var{ ˆβ} = σ 2 ( X T X ). (Σβ ) kk = (Σβ ) kl = W. Zhou (Colorado State University) STAT 540 July 6th, 205 24 / 62

Inference about Regression Coefficients The estimated variance-covariance matrix. ˆΣ β := s 2 { ˆβ} = MSE (X T X ) s 2 { ˆβ 0 } s{ ˆβ 0, ˆβ } s{ ˆβ 0, ˆβ p } s{ := ˆβ, ˆβ 0 } s 2 { ˆβ } s{ ˆβ, ˆβ p }... s{ ˆβ p, ˆβ 0 } s{ ˆβ p, ˆβ } s 2 { ˆβ p } Under the multiple linear regression model, we have for k = 0,,..., p. ˆβ k β k s{ ˆβ k } W. Zhou (Colorado State University) STAT 540 July 6th, 205 25 / 62

Inference about Regression Coefficients Thus the ( α) confidence interval for β k is ˆβ k ± t α/2;n p s{ ˆβ k }. Test H 0 : β k = β k0 versus H a : β k β k0. Under H 0, we have t = ˆβ k β k0 s{ ˆβ k } t n p Thus we can perform a t-test at level α by the decision rule: W. Zhou (Colorado State University) STAT 540 July 6th, 205 26 / 62

Estimation of Mean Response Hidden Extrapolation Define X h = (, X h,..., X h,p ) T. Caution about hidden extrapolations. The region (with respect to X0) defined by d(x 0) = X T 0 (X T X) X 0 h max where h max = max i h ii, is an ellipsoid enclosing all data points inside the regressor variable hull (RVH). Predictions for any X0 outside the RVH (i.e., d(x 0) > h max) is hidden extrapolation, at least to some degree. W. Zhou (Colorado State University) STAT 540 July 6th, 205 28 / 62

Estimation of Mean Response The estimated mean response corresponding to X h = Mean E(Ŷh) = Variance Var{Ŷh} = Estimated variance is s 2 {Ŷh} = W. Zhou (Colorado State University) STAT 540 July 6th, 205 29 / 62

Confidence Intervals for Mean Response The ( α) confidence interval for E(Y h ) is Ŷ h ± t α/2;n p s{ŷh} The Working-Hotelling ( α) confidence band for the regression surface is Ŷ h ± W s{ŷh} where W 2 = pf ( α; p, n p). The Bonferroni ( α) joint confidence intervals for g mean responses are Ŷ h ± Bs{Ŷh} where B = t α/(2g);n p. W. Zhou (Colorado State University) STAT 540 July 6th, 205 30 / 62

Prediction of New Observation The predicted new observation corresponding to X h is Ŷh = X T h ˆβ, and Mean E( Ŷ h ) = X T h β = E(Y h(new) ). Prediction error variance σ 2 pred = Var(Ŷh Y h(new) ) = Estimated prediction error variance is s 2 {pred} = W. Zhou (Colorado State University) STAT 540 July 6th, 205 3 / 62

Prediction Intervals for New Observation The ( α) prediction interval for Y h(new) is Ŷ h ± t α/2;n p s{pred} The Scheffé ( α) joint confidence intervals for g new observations are Ŷ h ± Ss{pred} where S 2 = gf ( α; g, n p). The Bonferroni ( α) joint confidence intervals for g new observations are Ŷ h ± Bs{pred} where B = t α/(2g);n p. W. Zhou (Colorado State University) STAT 540 July 6th, 205 32 / 62

Geometric Viewpoint: The Column Space of the Design Matrix Xβ is a linear combination of the columns of X β Xβ = [x,..., x p ]. = β x +... + β p x p β p The set of all possible linear combinations of the columns of X is called the column space of X and is denoted by C(X) = {Xa : a R p } The Gauss-Markov linear model says y is a random vector whose mean is in the column space of X and whose variance is σ 2 I for some positive real number σ 2, i.e. E(y) C(X) and Var(y) = σ 2 I, σ 2 R + W. Zhou (Colorado State University) STAT 540 July 6th, 205 34 / 62

An Example Column Space X = [ ] C(X) = {Xa : a R p } {[ ] } = [a ] : a R [ ] } = {a : a R {[ ] } = a a : a R W. Zhou (Colorado State University) STAT 540 July 6th, 205 35 / 62

Another Example Column Space X = 0 0 0 0 0 [ ] C(X) = 0 a : a R 2 0 a 2 0 0 = a 0 + a 0 2 : a, a 2 R 0 a 0 a = 0 + 0 a 2 : a, a 2 R 0 a 2 a a = a 2 : a, a 2 R a 2 W. Zhou (Colorado State University) STAT 540 July 6th, 205 36 / 62

Another Example Column Space X = 0 0 0 0, X 2 = 0 0 0 0 x C(X ) x = X a for some a R 2 [ ] 0 x = X 2 for some a R 2 a x = X 2 b for some b R 3 x C(X 2 ) Thus C(X ) C(X 2 ) W. Zhou (Colorado State University) STAT 540 July 6th, 205 37 / 62

Another Example Column Space (continued) x C(X 2 ) x = X 2 a for some a R 3 0 x = a + a 2 0 + a 0 3 for some a R3 0 a + a 2 a + a 2 x = a + a 3 for some a, a 2, a 3 R a + a 3 [ a + a 2 x = X a + a 3 ] for some a, a 2, a 3 R W. Zhou (Colorado State University) STAT 540 July 6th, 205 38 / 62

Another Example Column Space (continued) x = X [ a + a 2 a + a 3 x = X b for some b R 2 x C(X ) ] for some a, a 2, a 3 R Thus, C(X 2 ) C(X ), as we have shown C(X ) C(X 2 ). It follows that C(X ) = C(X 2 ). W. Zhou (Colorado State University) STAT 540 July 6th, 205 39 / 62

Estimation of E(y) A fundamental goal of linear model analysis is to estimate E(y) We could, of course, use y to estimate E(y) y is obviously an unbiased estimator of E(y), but it is often not a very sensible estimator. For example, suppose [ ] [ ] [ y = µ + y 2 ɛ ɛ 2 Should we estimate E(y) = [µ, µ] by y = [6., 2.3]? ], and we observe y = [6., 2.3] W. Zhou (Colorado State University) STAT 540 July 6th, 205 40 / 62

Estimation of E(y) The Gauss-Markov linear models says that E(y) C(X), so we should use that information when estimating E(y) Consider estimating E(y) by the point in C(X) that is closest to y (as measured by the usual Euclidean distance). This unique point is called the orthogonal projection of y onto C(X) and denoted by ŷ (although it could be argued that Ê(y) might be better notation). By definition, y ŷ = min z C(X) y z where a = n i= a2 i W. Zhou (Colorado State University) STAT 540 July 6th, 205 4 / 62

Geometric Viewpoint on Multiple Regression (and LM) Geometrically, how to minimize the distance between Y and C(X)? That point is The vector between Y and X β is, and the distance is For R 2 : if we add another predictor, C(X) gains more dimension, so e can only decrease. C(X) Note: if dim(s) = n then W. Zhou (Colorado State University) STAT 540 July 6th, 205 42 / 62

W. Zhou (Colorado State University) STAT 540 July 6th, 205 43 / 62

Orthogonal Projection Matrices It can be shown that, as we did for least square estimators y R n, ŷ = P X y is the optimal one, i.e. ŷ = P X y is the best estimator of E(y) in the class of linear unbiased estimators for the unique matrix P X = H, the hat matrix, and is called orthogonal projection matrix HH = H, idempotent H = H, symmetric HX = X and X H = X (Why? Intuitively...) If (X X) is not invertible, we use its generalized inverse (X X) where AA A = A. The H is invariant to the choice of (X X), which is itself not unique ŷ and y ŷ are orthogonal (Why?) W. Zhou (Colorado State University) STAT 540 July 6th, 205 44 / 62

An Example Orthogonal Projection Suppose [ y y 2 ] = [ ] µ + [ X(X X) X = = ɛ ɛ 2 [ [ ], and we observe y = [6., 2.3]. Then ] ([ ] [ ] = [ = 2 [ = ] [ [2] [ ] [ ][ ] 2 ] /2 /2 /2 /2 ] ]) [ ] W. Zhou (Colorado State University) STAT 540 July 6th, 205 45 / 62

An Example Orthogonal Projection Thus, [ the orthogonal ] projection of y = [6., 2.3] onto the column space of X = is [ ] [ ] [ ] /2 /2 6. 4.2 P X y = Hy = = /2 /2 2.3 4.2 W. Zhou (Colorado State University) STAT 540 July 6th, 205 46 / 62

Geometric illustration Suppose X = [ 2 ] and y = [ 2 3/4 ] W. Zhou (Colorado State University) STAT 540 July 6th, 205 47 / 62

Geometric illustration Suppose X = [ 2 ] and y = [ 2 3/4 ] W. Zhou (Colorado State University) STAT 540 July 6th, 205 48 / 62

Geometric illustration Suppose X = [ 2 ] and y = [ 2 3/4 ] W. Zhou (Colorado State University) STAT 540 July 6th, 205 49 / 62

Geometric illustration Suppose X = [ 2 ] and y = [ 2 3/4 ] W. Zhou (Colorado State University) STAT 540 July 6th, 205 50 / 62

Geometric illustration The angle between ŷ and residual y ŷ is 90. So, orthogonal projection. W. Zhou (Colorado State University) STAT 540 July 6th, 205 5 / 62

What if X is not full column rank? X T X is not invertible, then (X T X) has to be defined based on the generalized inverse matrix. If X is not of full column rank, then there are infinitely many vectors in the set {b : Xb = Xβ} for any fixed value of β. Thus, no matter what the value of E(y), there will be infinitely many vectors b such that Xb = E(y) when X is not of full column rank. Our response vector y can help us learn about E(y) = Xβ, but when X is NOT of full column rank, there is NO hope of learning about β alone unless additional information about β is available. How, we could estimate estimable function of β W. Zhou (Colorado State University) STAT 540 July 6th, 205 53 / 62

Treatment Effects Model Researchers randomly assigned a total of six experimental units to two treatments and measured a response of interest. y ij = µ + τ i + ɛ ij, i =, 2; j =, 2, 3 y y 2 y 3 y 2 y 22 = µ + τ µ + τ µ + τ µ + τ 2 µ + τ 2 + ɛ ɛ 2 ɛ 3 ɛ 2 ɛ 22 Question: what is X, β? y 23 µ + τ 2 ɛ 23 W. Zhou (Colorado State University) STAT 540 July 6th, 205 54 / 62

Treatment Effects Model (continued) In this case, it makes no sense to estimate β = [µ, τ, τ 2 ] because there are multiple (infinitely many, in fact) choices of β that define the same mean for y. For example µ τ τ 2 = all yield same Xβ = E(y). 5, 0 4 6, 999 995 993 When multiple values for β define the same E(y), we say that β is non-estimable. W. Zhou (Colorado State University) STAT 540 July 6th, 205 55 / 62

Estimable Functions of β A linear function of β, Cβ, is said to be estimable if there is a linear function of y, say Ay, that is an unbiased estimator for Cβ. Otherwise, nonexistence of such linear function implies that Cβ is non-estimable. Note that Ay is an unbiased estimator of Cβ if and only if E(Ay) = Cβ, for β R p AXβ = Cβ AX = C This says that we can estimate Cβ as long as Cβ = AXβ = AE(y) for some A, i.e. as long as Cβ is a linear function of E(y) The bottom line is that we can always estimate E(y) and all linear functions of E(y); all other linear functions of β are non-estimable W. Zhou (Colorado State University) STAT 540 July 6th, 205 56 / 62

Treatment Effects Model (continued) Xβ = 0 0 0 0 0 0 0 0 µ τ τ 2 = µ + τ µ + τ µ + τ µ + τ 2 µ + τ 2 µ + τ 2 so that [, 0, 0, 0, 0, 0]Xβ = [,, 0]β = µ + τ [0, 0, 0,, 0, 0]Xβ = [, 0, ]β = µ + τ 2 [, 0, 0,, 0, 0]Xβ = [0,, ]β = τ τ 2 are estimable functions of β W. Zhou (Colorado State University) STAT 540 July 6th, 205 57 / 62

Estimating Estimable Functions of β If Cβ is estimable, then there exists a matrix A such that C = AX and Cβ = AXβ = AE(y) for any β R p It makes sens to estimate Cβ by AÊ(y) = Aŷ = AP Xy = AX(X X) X y = AX(X X) X X ˆβ = AP X X ˆβ = AX ˆβ = Cβ C ˆβ is called an Ordinary Least Squares (OLS) estimator of Cβ Note that although the hat is on β, it is Cβ that we are estimating Invariance of C ˆβ to the choice of ˆβ: Although there are infinitely many solutions to the normal equations when X is not of full column rank, C ˆβ is the same for all normal equation solutions ˆβ whenever Cβ is estimable (STAT 640) W. Zhou (Colorado State University) STAT 540 July 6th, 205 58 / 62

Treatment Effects Model (continued) Suppose our aim is to estimate τ τ 2 As noted before Xβ = 0 0 0 0 0 0 0 0 µ τ τ 2 = µ + τ µ + τ µ + τ µ + τ 2 µ + τ 2 µ + τ 2, so that [, 0, 0,, 0, 0]Xβ = [0,, ]β = τ τ 2 Thus, we can compute the OLS estimator of τ τ 2 as [, 0, 0,, 0, 0]ŷ = [0,, ] ˆβ where ˆβ is any solution to the normal equations. W. Zhou (Colorado State University) STAT 540 July 6th, 205 59 / 62

Treatment Effects Model (continued) The normal equation in this case is 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 b b 2 b 3 = 0 0 0 0 0 0 0 0 y y 2 y 3 y 2 y 22 y 23 so that 6 3 3 3 3 0 3 0 3 b b 2 b 3 = y.. y. y 2. W. Zhou (Colorado State University) STAT 540 July 6th, 205 60 / 62

Treatment Effects Model (continued) ȳ.. 0 ˆβ = ȳ. ȳ.. and ˆβ 2 = ȳ. are both solutions to the normal equation ȳ 2. ȳ.. ȳ 2. (Check this). Thus, the OLS estimator of Cβ = [0,, ]β = τ τ 2 is ȳ.. C ˆβ = [0,, ] ȳ. ȳ.. = ȳ. ȳ 2. = [0,, ] ȳ 2. ȳ.. 0 ȳ. ȳ 2. = C ˆβ 2 HW: Can you find two different generalized inverse of (X X), A and A 2 that (X X)A i (X X) = (X X) so that A i = (X X) for each i, and they will give you ˆβ and ˆβ 2, respectively? W. Zhou (Colorado State University) STAT 540 July 6th, 205 6 / 62

The Gauss-Markov Theorem Under the Gauss-Markov Linear Model, the OLS estimator c ˆβ of an estimable linear function c β is the unique Best Linear Unbiased Estimator (BLUE) in the sense that Var(c ˆβ) is strictly less than the variance of any other linear unbiased estimator of c β for all β R p and all σ 2 R +. The Gauss-Markov Theorem says that if we want to estimate an estimable linear function c β using a linear estimator that is unbiased, we should always use the OLS estimator. In our simple example of the treatment effects model, we could have used y y 2 to estimate τ τ 2. It is easy to see that y y 2 is a linear estimator that is unbiased for τ τ 2, but its variance is clearly larger than the variance of the OLS estimator ȳ. ȳ 2. (as guaranteed by the Gauss-Markov Theorem). W. Zhou (Colorado State University) STAT 540 July 6th, 205 62 / 62