Multiple Linear Regression

Size: px

Start display at page:

Download "Multiple Linear Regression"

Caroline Bennett
5 years ago
Views:

1 Multiple Linear Regression University of California, San Diego Instructor: Ery Arias-Castro 1 / 42 Passenger car mileage Consider the carmpg dataset taken from here. Variables: make.model Make and model vol Cubic feet of cab space hp Engine horsepower mpg Average miles per gallon sp Top speed (mph) wt Vehicle weight (100 lb) Goal: Predict a car s gas consumption based on its characteristics. Graphics: pairwise scatterplots and possibly individual boxplots. To [R]. 2 / 42 Scatterplot highlights hp, sp and wt seem correlated. vol and wt seem correlated. Highly correlated predictors can lead to a misinterpretation of the fit. mpg seems polynomial in hp and sp. mpg versus wt shows a fan shape indicating unequal variances. 3 / 42

2 Least squares regression We fit a (simple) linear model: mpg = β 0 + β 1 vol + β 2 hp + β 3 sp + β 4 wt With data {(x i, y i ) : i = 1,...,n}, where x = (x i,1,...,x i,p ), we fit: y i = β 0 + β 1 x i,1 + + β p x i,p + ǫ i Standard assumptions: measurement errors are i.i.d. normal with mean zero, i.e. ǫ 1,...,ǫ n iid N(0, σ 2 ) The least squares fit minimizes the error sum of squares SSE(β 0, β 1,..., β p ) = n (y i β 0 β 1 x i,1 β p x i,p ) 2 i=1 Under the standard assumption, this fit corresponds to the MLE. 4 / 42 Fitted values, residuals and standard error The fitted (predicted) values are defined as: ŷ i = β 0 + β 1 x i,1 + + β p x i,p The residuals are defined as: e i = y i ŷ i Our estimate for σ 2 is the mean squared error of the fit σ 2 = 1 n p 1 n (y i ŷ i ) 2 = i=1 1 n p 1 n i=1 e 2 i 5 / 42

3 Matrix interpretation Let X be the n (p + 1) matrix with row vectors {(1,x i ) : i = 1,...,n}. Define y = (y 1,...,y n ), ǫ = (ǫ 1,...,ǫ n ) and β = (β 0, β 1,...,β p ). The model is: y = Xβ + ǫ Then the least squares estimate β minimizes If the columns of X are linearly independent, then y Xβ 2 β = (X T X) 1 X T y Note that the estimate is linear in the response. Define the hat matrix H = X(X T X) 1 X T H is the orthogonal projection onto span(x). The fitted values may be expressed as ŷ = X(X T X) 1 X T y = Hy The residuals may be expressed as e = y ŷ = (I H)y 6 / 42

4 Distributions Assume that the ǫ i s are i.i.d. normal with mean zero. Define M = (X T X) 1 R (p+1) (p+1). β = ( β 0,..., β p ) has the multivariate normal distribution with mean β (unbiased) and covariance matrix σ 2 M. I.e. the β j s are jointly normal and ) E ( βj = β j, cov ( βj, β ) k = σ 2 M jk For σ 2, we have σ 2 σ 2 n p 1 χ2 n p 1 β and σ 2 are independent. 7 / 42 t-tests Consequently, β j β j σ2 M jj t n p 1 In general, if c = (c 0, c 1,...,c p ) T R p+1, then ( ) where ŝe c T β = σ c T Mc c T β c T β ( ŝe c T β ) t n p 1 In particular, the following is a level-(1 α) confidence interval for β 0 + β 1 x β p x p = β T x, where x = (1, x 1,...,x p ): β T x ± t α/2 ( βt ) n p 1 ŝe x To [R]. 8 / 42

5 Confidence bands / regions Using the fact that ( β β) T X T X( β β) (p + 1) σ 2 F p+1,n p 1 the following defines a level-α confidence region (an ellipsoid) for β: Using the same fact, Scheffe s S-method gives that ( β β) T X T X( β β) (p + 1) σ 2 F α p+1,n p 1 c T β ± ((p + 1)F α p+1,n p 1 )1/2 ŝe ( ) c T β is a level-α confidence interval for c T β simultaneously for all c R p+1. As a special case, we obtain the following confidence band β ( βt ) T x ± ((p + 1)Fp+1,n p 1) α 1/2 ŝe x 9 / 42 Comparing models Say we want to test H 0 : β 1 = = β p = 0 H 1 : β j 0, for some j = 1,...,p Same as H 0 : H 1 : y i = β 0 + ǫ i y i = β 0 + β 1 x i,1 + + β p x i,p + ǫ i The model under H 0 is a submodel of the model under H / 42

6 Analysis of variance The residual sum of squares under H 0 is It has n 1 degrees of freedom. The residual sum of squares under H 1 is It has n p 1 degrees of freedom. The sum of squares due to regression is SS Y = SSE = n (y i y) 2 i=1 n (y i ŷ i ) 2 i=1 SSreg = SS Y SSE = n (y ŷ i ) 2 i=1 It has p degrees of freedom. The ANOVA F-test rejects for large: Under H 0, F = SSreg/p SSE/(n p 1) F F p,n p 1 11 / 42 Coefficient of (multiple) determination Defined as R 2 = 1 SSE SS Y Interpreted as the fraction of variation in y explained by x. R 2 always increases when variables are added. The adjusted R 2 incorporates the number of predictors R 2 a = 1 SSE/(n p 1) SS Y /(n 1) 12 / 42

7 Testing whether a subset of the variables are zero Consider a subset of variables J {1,..., p}. We want to test H 0 : j J β j = 0 H 1 : j J β j 0 Same as H 0 : y i = β 0 + j / J β j x i,j + ǫ i H 1 : y i = β 0 + The model under H 0 is a submodel of the model under H 1. p β j x i,j + ǫ i j=1 13 / 42 Analysis of Variance Let SSE J the rss for the model under H 0. Let SSE be the rss for the model under H 1. The ANOVA F-test rejects for large: Under the null, F = (SSE J SSE)/ J SSE/(n p 1) F F J,n p 1 In particular, testing β j = 0 versus β j 0 using the F-test above is equivalent to using the t-test described earlier. 14 / 42

8 Testing whether a given set of linear combinations are zero Consider a (full-rank) matrix q-by-(p + 1) matrix A. We want to test H 0 : Aβ = 0 H 1 : Aβ 0 Assuming A = (A 1,A 2 ) with A 2 q-by-q invertible, we are comparing the two models H 0 : H 1 : y = (X 1 X 2 A 1 2 A 1)β 1 + ǫ y = Xβ + ǫ The model under H 0 is a submodel of the model under H / 42 Analysis of Variance Let SSE A the rss for the model under H 0. Let SSE be the rss for the model under H 1. The ANOVA F-test rejects for large: Under the null, F = (SSE A SSE)/rank(A) SSE/(n p 1) F F rank(a),n p 1 16 / 42 Spreading Spreading the variables often equalizes the leverage of the data points and improves the fit, both numerically (e.g. R 2 ) and visually. A typical situation is an accumulation of data points near 0, in which case applying a logarithm or a square root helps spread the data more evenly. Example: mammals dataset. 17 / 42

9 Standardized variables Standardizing the variables removes the unit and makes comparing the magnitude of the coefficients easier. One way to do so is to make all the variables mean 0 and unit length: y y y SSY and x j x j x j SSXj An intercept is not needed anymore. Standardizing the variables changes the estimates β j, but not σ 2, the multiple R 2 or the p-value of the tests we saw. 18 / 42 Checking assumptions Can we rely on those p-values? In other words, are the assumptions correct? 1. Mean zero, i.e. model accuracy 2. Equal variances, i.e. homoscedasticity 3. Independence 4. Normally distributed 19 / 42 Residuals We look at the residuals as substitutes for the errors. Remember that e = (e 1,...,e n ), with e = y ŷ = y X β = (I H)y Under the standard assumptions, e is multivariate normal with mean zero and covariance σ 2 (I H)). In particular, cov (e i, e j ) = σ 2 h ij i j, var (e i ) = σ 2 (1 h ii ) i 20 / 42

10 Standardized residuals In practice, the residuals are often corrected for unequal variance. Internally studentized residuals (what R uses): r i = e i σ 1 h ii Note that: ( ri 2 1 n p 1 Beta 2, n p 2 ), so that r i t n p 1 2 Externally studentized residuals: e i t i = σ (i) 1 hii where σ (i) comes from the fit with the ith observation removed. Note that: t i t n p 2 21 / 42 Checking model accuracy With several predictors, the residuals-vs-fitted-values plot may be misleading. Instead, we look at partial residual plots, which focus on one variable at a time. Say we want to look at the influence of predictor X j = (x 1,j,...,x n,j ) (the jth column of X) on the response y. Added variable plots take into account the influence of the other predictors. 1. Regress y on all the predictors excluding X j, obtaining residuals e (j). 2. Regress X j on all the predictors excluding X j, obtaining residuals X (j). 3. Plot e (j) versus X (j). Component plus residual plots are another alternative. 1. Plot e + β j X j versus X j. 22 / 42 Checking homoscedasticity Partial residual plots may help assess whether the errors have equal variance or not. A fan-shape in the plot is an indication that the errors do not have equal variances. 23 / 42

11 Checking normality A q-q plot of the standardized residuals is used. 24 / 42 Checking for independence Checking for independence often requires defining a dependency structure explicitly and testing against that. (For otherwise we would need a number of replicates of the data, i.e. multiple samples of same size.) If dependency is expected, then it is preferable to fit some time series model (for the errors). In that context, we could use something like the Durbin-Watson test to assess whether the errors are autocorrelated. 25 / 42 Outliers Outliers are points that are unusual compared to the bulk of the observations. An outlier in predictor is a data point (x i, y i ) such that x i is away from the bulk of the predictor vectors. Such a point has high-leverage An outlier in response is a data point (x i, y i ) such that y i is away from the trend implied by the other observations. A point that is both an outlier in predictor and response is said to be influential. 26 / 42 Detecting outliers To detect outliers in predictor, plotting the hat values h ii may be useful. (h ii > 2p/n is considered suspect.) To detect outliers in response, plotting the residuals may be useful. 27 / 42

12 Leave-one-out diagnostics If β (i) is the least squares coefficient vector computed with the ith case deleted, then Define β β (i) = (XT X) 1 x i e i 1 h ii ŷ (i) = X β (i) In particular, ŷ (i)i is the value at x i predicted by the model fitted without the observation (x i, y i ). 28 / 42 Cook s distances Cook s distances D i = ŷ ŷ (i) 2 (p + 1) σ 2 (change in fitted values) = ( β β (i) ) T X T X( β β (i) ) (p + 1) σ 2 (change in coefficients) = r 2 i h ii (p + 1)(1 h ii ) (combination of residuals and hat values) D i > F (.05) p+1,n p 1 is considered suspect. To [R]. 29 / 42 DFBETAS and DFFITS DFBETAS DFBETAS j(i) = β j β j(i) σ (i) 2 (XT X) 1 j,j DFBETAS j(i) > 2/ n is considered suspect. DFFITS DFFITS i = ŷi ŷ i(i) σ (i) 2 h i,i = t i hi 1 h i DFFITS i > 2 p/n is considered suspect. 30 / 42

13 Multicollinearity When the predictors are nearly linearly dependent, several issues arise: 1. Interpretation is difficult. 2. The estimates have large variances: ) var ( βj = σ 2 (1 Q 2 j )SS Xj where Q 2 j is the R 2 when regressing X j on X k, k j. 3. The fit is numerically unstable since X T X is almost singular. 31 / 42 Correlation between predictors To detect multicollinearity, we look at large correlation between predictors. For that we compute the correlation matrix of the predictors, with coefficients cor(x j, X k ). To [R]. 32 / 42 Variance Inflation Factors Alternatively, one can look at the variance inflation factors: VIF j = 1 1 Q 2 j If the predictors are standardized to unit norm, X T X is a correlation matrix and we have: VIF = diag((x T X) 1 ) VIF j > 10 (same as Q 2 j > 90%) is considered suspect. To [R]. 33 / 42

14 Condition Indices Another option is to examine the condition indices of X T X: where the λ j s are the eigenvalues of X T X. κ j > 1000 is considered suspect. κ j = λ max λ j λ max /λ min is called the condition number of matrix X T X. To [R]. 34 / 42 Dealing with questionable assumptions If some of the assumptions are doubtful, it is important to correct the model, for otherwise we cannot rely on our results (estimates + tests). 35 / 42 Dealing with curvature When the model accuracy is questionable, we can: 1. Transform the variables and/or the response, which often involves guessing. This is particularly useful to spread the variables. 2. Augment the model by adding other terms, perhaps with interactions. This is often used in conjunction with model selection so as to avoid overfitting. 36 / 42 Dealing with heteroscedasticity When homoscedasticity is questionable, we can: 1. Apply a variance stabilizing transformation to the response. 2. Fit the model by weighted least squares. 3. Both involve guessing at the dependency of the variance as a function of the variables. 37 / 42

15 Variance stabilizing transformations Here are a few common transformations on the response to stabilize the variance: σ 2 E(y i ) (Poisson) change to y σ 2 E(y i )(1 E(y i )) (binomial) change to sin 1 ( y) σ 2 E(y i ) 2 change to log(y) σ 2 E(y i ) 3 change to 1/ y σ 2 E(y i ) 4 change to 1/y In general, we want a transformation ψ such that ψ (Ey i )var(y i ) 1 38 / 42 Weighted least squares Suppose y i = β 0 + β T x i + ǫ i, ǫ i N(0, σ 2 i ) Maximum likelihood estimation of β 0, β corresponds to the weighted least squares solution that minimizes SSE(β 0, β) = n i=1 w i (y i β 0 β T x i ) 2, w i = 1 σ 2 i To determine appropriate weights, different approaches are used: Guess how σ 2 i varies with x i, i.e. the shape of var(y i x i ). Assume a parametric model for the weights and use maximum likelihood estimation solved by iteratively reweighted least squares. Using weights is helpful in situations where some observations are more reliable than others, mostly because of variability. 39 / 42

16 Generalized least squares Weighted least squares assumes that the errors are uncorrelated. Generalized least squares assumes a more general form for the covariance of the errors, namely σ 2 V, where V is usually known. Assuming both the design matrix X and the covariance matrix V are full rank, the maximum likelihood estimates are β = (X T V 1 X) 1 X T V 1 y 40 / 42 Dealing with outliers We have identified an outlier. Was it recorded properly? No: correct it or remove it from the fit. Yes: decide whether to include it in the fit. Genuine outliers carry information, so simply discarding them amounts to losing that information. However, strong outliers can compromise the quality of the overall fit. Possible option: to model outliers and non-outliers separately. 41 / 42 Dealing with Multicolinearity If interpretation is the main goal, drop a few redundant predictors. If prediction is the main goal, use a model selection procedure. 42 / 42

Simple Linear Regression

Simple Linear Regression MATH 282A Introduction to Computational Statistics University of California, San Diego Instructor: Ery Arias-Castro http://math.ucsd.edu/ eariasca/math282a.html MATH 282A University