Math 423/533: The Main Theoretical Topics

Size: px

Start display at page:

Download "Math 423/533: The Main Theoretical Topics"

Camron Page
5 years ago
Views:

1 Math 423/533: The Main Theoretical Topics

2 Notation sample size n, data index i number of predictors, p (p = 2 for simple linear regression) y i : response for individual i x i = (x i1,..., x ip ) (1 p) row vector X - (n p) matrix containing all predictors for all individuals i = 1,..., n. y = (y 1,..., y n ) (n 1) column vector Y i and Y: random variables corresponding to responses 1

3 Linear model assumptions For i = 1,..., n, and where E Yi x i [Y i x i ] = x i β = β j x ij p j=1 Var Yi x i [Y i x i ] = σ 2 β = (β 1,..., β p ) is the (p 1) vector of regression coefficients, and σ 2 > 0 is the error variance. We assume also that Y 1,..., Y n are independent given x 1,..., x n. 2

4 Linear model assumptions In vector form E Y X [Y X] = Xβ (n 1) and Var Y X [Y X] = σ 2 I n (n n). This is equivalent to a model specification of Y = Xβ + ɛ where E ɛ X [ɛ X] = 0 n Var ɛ X [ɛ X] = σ 2 I n. 3

5 The intercept We usually consider including the special predictor x i0 1 i = 1,..., n and specify the model E Yi x i [Y i x i ] = x i β = β 0 + β j x ij = β j x ij This model has p + 1 β parameters. We will let p count the total number of predictors, including the intercept term. p j=1 p j=0 4

6 Simple linear regression We specify [ E β0 Yi x i [Y i x i ] = β 0 + β 1 x i1 = [1 x i1 ] β 1 with p = 2 parameters in the regression model. ] = x i β This model posits a straight line relationship between x and y. 5

7 Least squares estimation in simple linear regression On the basis of data (x i1, y i ), i = 1,..., n, we choose the line of best fit according to the least squares principle. We estimate parameters β = (β 0, β 1 ) by β where β = arg min β (y i β 0 β 1 x i1 ) 2 = arg min β where we may also write, in vector form, S(β) = (y Xβ) (y Xβ) S(β) We achieve the minimization by calculus. 6

8 The Normal Equations We solve S(β) β 0 = 2 S(β) β 1 = 2 (y i β 0 β 1 x i1 ) = 0 x i1 (y i β 0 β 1 x i1 ) = 0 These two equations can be written nβ 0 + β 1 x i1 = β 0 or, in matrix form x i1 + β 1 x 2 i1 = (X X)β = X y y i x i1 y i 7

9 The Normal Equations These equations are the Normal Equations. If the symmetric p p = 2 2 matrix X X is non-singular, then we may write the solution β = (X X) 1 X y which yields a p 1 = 2 1 vector of least squares estimates. Explicitly, we have [ β0 β 1 ] = [ ] 1 [ ] n xi1 yi xi1 x 2 i1 xi1 y i 8

10 Estimates n x i1 x i1 x 2 i1 1 = 1 n n x 2 i1 { n x i1 } 2 x 2 i1 x i1 n n x i1 n We write S xx = S xy = { } 2 x 2 i1 x i1 = { x i1 y i 1 n (x i1 x 1 ) 2 x i1 } y i = y i (x i1 x 1 ) 9

11 Estimates (cont.) Thus, after some algebra β 0 = y β 1 x 1 β 1 = S xy S xx 10

12 Residuals and fitted values Define for i = 1,..., n, e i = y i ( β 0 + β 1 x i1 ) = y i ŷ i e i ith residual ŷ i ith fitted value. 11

13 Statistical properties of least squares estimators It is evident from the formula β = (X X) 1 X y = Ay say, where A = (X X) 1 X that the least squares estimates are merely linear combinations of the observed responses y = (y 1,..., y n ). Specifically in the simple linear regression β 0 = where, for i = 1,..., n, ( ) 1 n x 1c i y i β1 = c i = x i1 x 1 S xx. c i y i 12

14 Statistical properties of the estimators In random variable form, we have the estimators β = (X X) 1 X Y = AY and thus, under the model assumptions E Y X [Y X] = Xβ and Var Y X [Y X] = σ 2 I n we can study distributional properties of the estimators. 13

15 Statistical properties of the estimators (cont.) We have, using elementary properties of expectation and variance, E Y X [ β X] = β (p 1) Var Y X [ β X] = σ 2 (X X) 1 (p p) with p = 2. Explicitly x Var Y X [ β 0 X] = σ 2 2 ( i1 1 = σ 2 ns xx n + (x 1) 2 ) S xx Var Y X [ β 1 X] = σ2 S xx 14

16 Residuals and Fitted values For the simple linear regression 1. e i = 1 n e = x i1 e i = 1 x 3. y i = n ŷ i. 4. ŷ i e i = 0, that is, the observed residual vector e is orthogonal to the observed n 1 vectors x = (x 11,..., x n1 ) 1 and ŷ = (ŷ 1,..., ŷ n ). 15

17 Residuals and Fitted values (cont.) These results arise as β solves X (y X β) = 0 p and the results above follow immediately. 16

18 Estimating σ 2 Let SS Res = ei 2 = (y i ŷ i ) 2 = y 2 i n(y)2 β 1 S xy = SS T β 1 S xy say, where SS T = y 2 i n(y)2 = (y i y) 2 We study the statistical properties of the random variable (Y i Ŷi) 2 = (Y i x i β) 2 = (Y X β) (Y X β) 17

19 Estimating σ 2 (cont.) where β = (X X) 1 X Y is the vector of least squares estimators. 18

20 Estimating σ 2 (cont.) But Ŷ = X β = X(X X) 1 X Y = HY say, where H = X(X X) 1 X is the hat matrix. We can show that H is symmetric, and that H H = H so H is idempotent. 19

21 Estimating σ 2 (cont.) Now consider the simpler model where dependence on x i1 is omitted, and we merely have an intercept term. Predictions in this model use the (n 1) matrix X = (1, 1,..., 1) = 1 n yielding the corresponding hat matrix H 1 = 1 n (1 n 1 n ) 1 1 n = 1 n 1 n1 n which is merely the (n n) matrix with all elements equal to 1/n. 20

22 Estimating σ 2 (cont.) We have that SS Res = (Y X β) (Y X β) = Y (I n H)Y where the (n n) matrix (I n H) is symmetric and idempotent. Now, we have the sum of squares decomposition or (y i y) 2 = (y i ŷ i ) 2 + (ŷ i y) 2 SS T = SS Res + SS R 21

23 Estimating σ 2 (cont.) Similarly to the previous result we have SS T = Y (I n H 1 )Y and SS R = Y (H H 1 )Y yielding the representation Y (I n H 1 )Y = Y (I n H)Y + Y (H H 1 )Y where the (n n) matrices (I n H 1 ) and (H H 1 ) are also symmetric and idempotent. 22

24 Estimating σ 2 (cont.) Using the result for the expectation of a quadratic form that if V is a k-dimensional random vector with E[V] = µ Var[V] = Σ then for k k matrix A, we have it follows that E[V AV] = trace(aσ) + µ Aµ E Y X [Y (I n H)Y] = (n p)σ 2 Hence an unbiased estimator of σ 2 is σ 2 = SS Res n p = MS Res with p = 2. 23

25 Estimating σ 2 (cont.) Using similar methods and E Y X [SS R X] = E Y X [Y (H H 1 )Y X] = (p 1)σ 2 + β 2 1S xx E Y X [SS T X] = E Y X [Y (I n H 1 )Y X] = (n 1)σ 2 + β 2 1S xx 24

26 Standard errors for the estimators We have that which is estimated by Var Y X [ β X] = σ 2 (X X) 1 σ 2 (X X) 1. The standard errors of the estimators are estimated by the square roots of the diagonal elements of this matrix; denote them by e.s.e( β j ) j = 0, 1. 25

27 Hypothesis Testing We can formulate hypothesis tests for the parameters provided we make the normality assumption For j = 0, 1, to test we use the test statistic ɛ X Normal(0 n, σ 2 I n ). H 0 : β j = 0 vs H 1 : β j 0 t j = β j e.s.e( β j ). 26

28 Hypothesis Testing (cont.) If H 0 is true, we have by standard distributional results that corresponding statistic T j Student(n p) with p = 2. We reject H 0 at significance level α if t j > t α/2,n p where t α,ν is the 1 α quantile of the Student-t distribution with ν degrees of freedom. 27

29 Confidence Intervals A (1 α) 100% confidence interval for β j is β j ± t α/2,n p e.s.e( β j ) j = 0, 1. 28

30 Global Model Adequacy The R 2 statistic is defined by R 2 = SS R SS T = 1 SS Res SS T and is a measure of the global adequacy of x as a predictor of y. The adjusted R 2 statistic is defined by R 2 Adj = 1 SS Res/(n p) SS T /(n 1) and is a measure that acknowledges that SS Res decreases in expectation as p increases. 29

31 Local Model Adequacy Residual plots are used to assess local model adequacy. If the model assumptions are correct, then the residual plots e i vs i e i vs x i1 e i vs ŷ i should be patternless that is, should not exhibit systematic patterns in either mean-level or variability. The residuals should form a horizontal band around zero, with equal variability around zero everywhere. 30

32 Prediction Predictions from the model at value of x are formed by using the estimated regression coefficients; at x = x i1 observed in the sample, we have the prediction equal to the fitted value In vector form, we have ŷ i = β 0 + β 1 x i1. ŷ = X β. At x = x new 1, we have the prediction ŷ new = β 0 + β 1 x new 1. 31

33 Confidence and Prediction Intervals In the random variable form we have predictions so that and Ŷ = X β = HY E Y X [Ŷ X] = Xβ Var Y X [Ŷ X] = σ2 X(X X) 1 X = σ 2 H Therefore a (1 α) 100% confidence interval for the prediction at x = x i1 is ŷ i ± t α/2,n p σ 2 h ii where h ii is the (i, i)th diagonal element of H. 32

34 Confidence and Prediction Intervals (cont.) For a prediction at x = x new 1, we have that Var Y X [Ŷ new X] = σ 2 x new (X X) 1 (x new ) = σ 2 h new and a (1 α) 100% confidence interval for the prediction at x = x new 1 is ŷ new ± t α/2,n p σ 2 h new 33

35 Confidence and Prediction Intervals (cont.) A prediction interval at x = x new 1 incorporates the random variation that is present in the observations. Let Ŷ new O = Ŷ new + ɛ new where ɛ new is a zero mean, variance σ 2 random residual error, independent of all other random quantities. Then Var Y X [Ŷ new O X] = Var Y X[Ŷ new X] + Var Y X [ɛ new X] = σ 2 h new + σ 2 = σ 2 (1 + h new ). Thus a (1 α) 100% prediction interval for the prediction at x = x new 1 is ŷ new ± t α/2,n p σ 2 (1 + h new ) 34

36 The Analysis of Variance The sums-of-squares decomposition SS T = SS Res + SS R forms the basic component of the Analysis of Variance (ANOVA) as it describes how overall observed variability in response y (SS T ) is decomposed into a component corresponding to the residual errors (SS Res ) and a component corresponding to the regression (SS R ). 35

37 The Analysis of Variance (cont.) Under the assumption of Normality of residual errors, ɛ X Normal(0 n, σ 2 I n ), and the hypothesis that β 1 = 0, we have the result that for the sums-of-squares random variables SS T σ 2 = Y (I n H 1 )Y σ 2 χ 2 n 1 SS Res σ 2 = Y (I n H)Y σ 2 χ 2 n p with SS Res and SS R independent. SS R σ 2 = Y (H H 1 )Y σ 2 χ 2 p 1 36

38 The Analysis of Variance (cont.) Consequently we can show that under the hypothesis, the random variable F = SS R/(p 1) SS Res /(n p) has a Fisher-F distribution with p 1 and n p degrees of freedom F Fisher(p 1, n p). We can construct a test of H 0 : β 1 = 0 based on this result: we reject H 0 at significance level α if F > F α,p 1,n p where F α,ν1,ν 2 is the (1 α) quantile of the Fisher-F distribution with ν 1 and ν 2 degrees of freedom 37

39 The Analysis of Variance (cont.) This test is equivalent to the test of H 0 : β 1 = 0 based on the t-statistic; we have that t 2 1 = { β1 e.s.e( β 1 ) } 2 = F and the two-tailed test based on t 1 is equivalent to the one-tailed test based on F. 38

40 The Analysis of Variance (cont.) The ANOVA table arranges the requires information in tabular form: Source SS df MS F Regression SS R p 1 SS R /(p 1) F Residual SS Res n p SS Res /(n p) Total SS T n 1 39

41 Maximum Likelihood Estimation Under the assumption of Normality of residual errors, so that ɛ X Normal(0 n, σ 2 I n ), Y X Normal(Xβ, σ 2 I n ), we may consider using a maximum likelihood (ML) procedure to estimate the model parameters (β, σ 2 ). The likelihood function for data y is L(β, σ 2 ) = ( ) 1 n/2 { 2πσ 2 exp 1 } 2σ 2 (y Xβ) (y Xβ) 40

42 Maximum Likelihood Estimation (cont.) The log-likelihood is l(β, σ 2 ) = n 2 log σ2 1 2σ 2 (y Xβ) (y Xβ) + constant = n 2 log σ2 1 S(β) + constant. 2σ2 We seek to maximize this function with respect to β and σ 2. It is evident that, in terms of β, the maximum value of l(β, σ 2 ) is attained (for any σ 2 ) when S(β) is minimized. Thus the maximum likelihood estimate of β is identical to the least squares estimate. 41

43 Maximum Likelihood Estimation (cont.) To maximize over σ 2, note that l(β, σ 2 ) σ 2 = n 2σ σ 4 S(β). Equating to zero, and setting β = β, we have the solution σ 2 ML = 1 n S( β) = 1 n (y X β) (y X β) = 1 n (y i ŷ i ) 2 = SS Res n. 42

44 Multiple Linear Regression The simple linear regression model extends to include multiple predictors E Y X [Y X] = Xβ in a straightforward way. Least squares estimation, inference, testing prediction etc. proceeds in precisely the same way. Model checking using residuals and R 2 also proceeds as for simple linear regression. F-testing procedures for assessing the utility of including all the predictors proceed as before, and there is a natural extension to the use of ANOVA tables in regression. 43

45 Hypothesis Testing However, now we may also consider individual t-tests for the coefficients, from a single fit. Furthermore we may consider simultaneous tests for collections of parameters using the F-test. We split X = [X (1) X (2) ] and β = [β (1) β (2) ] and consider the model Y = Xβ + ɛ = X (1) β (1) + X (2) β (2) + ɛ. We compare a simpler model with a more complex model, where the simpler model is obtained by hypothesizing that some βs are zero, that is Full Model : Y = Xβ + ɛ Reduced Model : Y = X (1) β (1) + ɛ that is, under the reduced model we assume β (2) = 0 r. 44

46 Hypothesis Testing (cont.) In general the parameter estimates arising from the two models will be different; that is β (1) estimates from the Full model will in general not be equal to β (1) estimates from the Reduced model, as in the Full model [ ] [ β β(1) {X = β (2) = (1) } X (1) {X (1) } X (2) ] 1 [ {X (1) } ] y {X (2) } X (1) {X (2) } X (2) {X (2) } y and in the reduced model β (1) = ({X (1) } X (1) ) 1 {X (1) } y. 45

47 Hypothesis Testing (cont.) We modify the earlier sums of squares decomposition (y i y) 2 = (y i ŷ i ) 2 + (ŷ i y) 2 SS T = SS Res + SS R to y 2 i = (y i ŷ i ) 2 + ŷ 2 i SS T = SS Res + SS R 46

48 Hypothesis Testing (cont.) Note that SS T = SS T n{y} 2 and SS R = SS R n{y} 2. 47

49 Hypothesis Testing (cont.) If SS R (β) and SS R (β (1) ) denote the regression sums of squares from the Full and Reduced models respectively, the extra sum of squares due to β (2) in the presence of β (1) is SS R (β (2) β (1) ) = SS R (β) SS R (β (1) ). This facilitates the F-test of the null hypothesis H 0 : β (2) = 0 r that is, that the Reduced model is an adequate simplification of the Full model. 48

50 Hypothesis Testing (cont.) We perform a partial F-test using F = (SS R(β (2) β (1) ))/r MS Res (β) = (SS Res(β (1) ) SS Res (β))/r SS Res (β)/(n p) which is distributed as Fisher(r, n p) if H 0 is true. 49

51 Hypothesis Testing (cont.) The test may be used sequentially: for example, if parameter β = (β 0, β 1, β 2, β 3 ), then we have SS R (β 0, β 1, β 2, β 3 ) = SS R (β 0 ) + SS R (β 1 β 0 ) + SS R (β 2 β 0, β 1 ) + SS R (β 3 β 0, β 1, β 2 ) and also SS R (β 1, β 2, β 3 β 0 ) = SS R (β 1 β 0 ) + SS R (β 2 β 0, β 1 ) + SS R (β 3 β 0, β 1, β 2 ) We also have SS R (β 1, β 2, β 3 β 0 ) SS R (β 1, β 2, β 3 ). 50

52 Hypothesis Testing (cont.) This latter decomposition allows us to test whether X 1 is worth including in the model, then whether X 2 is worth including, when X 1 is already included, then whether X 3 is worth including, when X 1 and X 2 are already included. 51

53 Hypothesis Testing (cont.) Note: If the X (1) and X (2) blocks are orthogonal {X (1) } X (2) = 0 p r,r then β estimates from Full and Reduced model are equal; we have the identity SS R (β (1) β (2) ) = SS R (β (1) ). 52

54 Multicollinearity Multicollinearity corresponds to dependence/correlation amongst the predictors: can lead to inflation of the variance of estimators if present; can be measured by variance inflation factors; equivalent to R 2 measures constructed for the predictors; can be computed from the correlation matrix from the predictors. 53

55 Special Types of Predictors polynomial terms; factor predictors: discrete predictors measured on a nominal ( non-numerical) scale; interactions: an interaction is a modification of the effect of one predictor in the presence of another. higher-order interactions. 54

56 Special Types of Predictors (cont.) Important issues include how to represent factor predictors using indicator functions; how to count parameters; the interpretation of interactions; removing the intercept. 55

57 Model Selection Strategies Our goal is to find the simplest possible model that adequately explains the observed response data. Forward selection; Backward elimination; Stepwise elimination; 56

58 Model Selection Criteria R 2 -based methods; largest R 2 Adj equivalent to minimum MS Res; Mallows s C p ; AIC; BIC; 57

59 Why the best model matters Over-simplified model: if the chosen model omits important predictors, then the resulting estimates will be biased, have lower variance, and can have higher mean-squared error, depending on the magnitude of the effect of the omitted predictors; the estimate of σ 2 will be too high; predictions will have higher mean squared error. If the omitted predictors are orthogonal to the included ones, then the bias is removed. 58

60 Why the best model matters (cont.) Over-complex model: if the chosen model includes unnecessary predictors, then the resulting estimates will be unbiased, have higher variance, and higher mean-squared error; the estimate of σ 2 will be unbiased; predictions will have no bias, higher variance and higher mean squared error. 59

61 Assessing the model fit PRESS residuals/statistic; leave-one-out/deletion residuals; leverage; influence in estimation and prediction; outliers; influence diagnostics. 60

62 Generalizing Least Squares We may relax the variance assumption Var Y X [Y X] = σ 2 I n to Var Y X [Y X] = σ 2 V where V is a square, symmetric, non-singular matrix. This allows for unequal variances in the conditional response distribution; correlation amongst the residual errors ɛ. This leads to the amended least squares objective function (y Xβ) V 1 (y Xβ) 61

63 Generalizing Least Squares (cont.) This leads to the estimates β = (X V 1 X) X V 1 y and the properties of the estimators, predictions etc. follow in a straightforward fashion. A special case is weighted least squares, where V = diag(1/w 1, 1/w 2,..., 1/w n ) which accommodates unequal variances. 62

64 Transforming the response We may use transformations of the response y i to a different form to make the linear regression model assumptions more appropriate: log; square root; Box-Cox transform. 63

Ch 3: Multiple Linear Regression

Ch 3: Multiple Linear Regression 1. Multiple Linear Regression Model Multiple regression model has more than one regressor. For example, we have one response variable and two regressor variables: 1. delivery