Contents. 1 Review of Residuals. 2 Detecting Outliers. 3 Influential Observations. 4 Multicollinearity and its Effects

Contents 1 Review of Residuals 2 Detecting Outliers 3 Influential Observations 4 Multicollinearity and its Effects W. Zhou (Colorado State University) STAT 540 July 6th, 2015 1 / 32

Model Diagnostics: An Overview Basic diagnostics, review Model adequacy for a predictor variable added-variable plots Outlying Y observation and studentized/deleted residuals Outlying X observation and hat matrix/leverage values Influential cases Multicollinearity diagnostics and variance inflation factor W. Zhou (Colorado State University) STAT 540 July 6th, 2015 2 / 32

Model Assumptions Recall multiple linear regression model, for i = 1,..., n p 1 Y i = β 0 + β j X ij + ɛ i, ɛ i iid N(0, σ 2 ). j=1 Relationship between Y and X: E(Yi) = β 0 + p 1 j=1 βjxij. Homogeneous variance: Var(Yi) = Var(ɛ i) σ 2. Independence: Cov(ɛi, ɛ j) = Cov(Y i, Y j) = 0, i j. Normal distribution: Yi N(β 0 + p 1 j=1 βjxij, σ2 ). W. Zhou (Colorado State University) STAT 540 July 6th, 2015 3 / 32

Basic Diagnostics Exploratory Data Analysis Same as before: scatterplots, boxplots, histograms, summaries New: scatterplot matrices, split boxplots, brush/spin, coplots Linearity, Homoscedasticity, Normality Same as before: (externally studentized) residuals vs. each X, against Ŷ, and against time (also note: ACF plot), QQplot. Tests: e.g., F test for lack of fit, Breusch-Pagan, etc. (see Chapter 6.8 KNNL) Outliers, Influence, and Correlated Predictors Major focus of this set of notes W. Zhou (Colorado State University) STAT 540 July 6th, 2015 4 / 32

1 Review of Residuals 2 Detecting Outliers Outlying Response Outlying Predictor 3 Influential Observations 4 Multicollinearity and its Effects W. Zhou (Colorado State University) STAT 540 July 6th, 2015 5 / 32

Residuals Review Recall that the residuals e = (e 1,..., e n ) T = Y Ŷ = (I H)Y, where H is the hat/projection matrix. The mean of the residuals is e1 T = The variance-covariance matrix of the residuals is Var{e} = and is estimated by s 2 {e} = W. Zhou (Colorado State University) STAT 540 July 6th, 2015 6 / 32

Residuals Review Denote H = [h ij ] n i,j=1. Then we have variance of e i Var{e i } = σ 2 (1 h ii ), estimated by s 2 {e i } = MSE(1 h ii ) The covariance of e i and e j (i j) is Cov{e i, e j } = σ 2 (0 h ij ) = σ 2 h ij estimated by s 2 {e i } = MSE h ij. W. Zhou (Colorado State University) STAT 540 July 6th, 2015 7 / 32

Studentized Residuals Review The variance of e i is not constant and the covariance of e i, e j is not zero. Observations with a large residual relatively to its standard deviation may be outlying. To compare n residuals, standardize so that the residuals are on the same scale. Studentized residuals (a.k.a. internally studentized) are defined as r i = ei s{e = e i. i} MSE(1 hii) If the model is appropriate, the studentized residuals {ri} have constant variance, while the ordinary residuals {e i} do not. W. Zhou (Colorado State University) STAT 540 July 6th, 2015 8 / 32

Deleted Residuals Review Influence: the ith point can pull the line response surface strongly toward it if it is highly influential. This masks the point s influence. Strategy: define the residual for the ith point as the prediction error for that point using the model fit to the data omitting that point. Deleted residuals are defined as It can be shown that d i = Y i Ŷi( i) = d i = Y i Ŷi( i). e i 1 h ii = Y i Ŷi 1 h ii. W. Zhou (Colorado State University) STAT 540 July 6th, 2015 9 / 32

Deleted Residuals Review Let X i = (1, X i1,..., X i,p 1 ) (a row vector). Let X i and MSE i denote the design matrix and the MSE with the ith row (observation) deleted. Recall that s 2 {pred} = MSE(1 + X h (X T X) 1 Xh T ), one can show W. Zhou (Colorado State University) STAT 540 July 6th, 2015 10 / 32

Studentized Deleted Residuals Review The studentized deleted residuals (a.k.a. externally studentized) are defined as, for i = 1,..., n, t i = d i s{d i } = ei 1 h ii = MSE i 1 h ii Note that (n p)mse = (n p 1)MSE i + e i MSE i (1 h ii ). e2 i 1 h ii, so and there is no need to fit n separate regressions. W. Zhou (Colorado State University) STAT 540 July 6th, 2015 11 / 32

Outlying Y Observation Outlying observations are well separated from the remainder of the data. Consider three types of outlying observations: 1 Outlying not in X but in Y X: Usually not influential. 2 Outlying in X but not Y X: Usually not influential. 3 Outlying in X and Y X: Can be very influential. Goal: Identify outlying and influential observations. The task is relatively straightforward for 1-2 predictor variables but becomes more challenging for more than 2 predictor variables. Hidden Extrapolation. Basic idea: Outlying observations may involve large residuals and often have large impact on the model fit. W. Zhou (Colorado State University) STAT 540 July 6th, 2015 12 / 32

Identifying Outlying Y Observations Basic idea: the ith observation is outlying in Y if t i is large. Under H 0 : observation i is not outlying in Y The decision rule is t i = d i s{d i } t n p 1. Need Bonferroni adjustment, why? n multiple comparisons. For most n and p, t1 α 2n ;n p 1 at the α = 5% level is greater than 3. In practice, t i > 3 then observation i is a possible outlier. W. Zhou (Colorado State University) STAT 540 July 6th, 2015 13 / 32

Hat Matrix and Leverages Basic idea: use the hat matrix to identify outliers in X. Recall that H = [h ij ] n i,j=1 and h ii = X i (X T X) 1 X T i. The diagonal elements hii are called leverages. Properties of leverages hii: 1 0 h ii 1 (can you show this? ) 2 n i=1 h ii = p h ni=1 h = ii = p (show it). n n 3 h ii is a measure of the distance between X values of the ith observation and the means of the X values for all n observations (show: h ii = 1/n + (x 1i x 1 ) T (X cx c) 1 (x 1i x 1 ), where X c is the centerred design matrix X.) W. Zhou (Colorado State University) STAT 540 July 6th, 2015 14 / 32

Identifying Outlying X Observation Effects of hat values: if the ith data point is outlying in X with a high leverage h ii, it can influence the fitted response Ŷi. A higher leverage hii results in more weight of Y i in determining Ŷi (as Ŷ = HY ). A higher leverage hii results in a smaller s{e i}, as Ŷi is closer to Yi. Connections to nonparametric smoothing. What is a bad hat value? 1 If h ii > 2p/n, then observation i is considered to be outlying in X. 2 Moderate leverage if h ii [0.2, 0.5) and high leverage if h ii [0.5, 1]. 3 Draw a histogram, stem-and-leaf, or other plot of h ii. Outlying observations tend to be large and there tends to be a gap between the outlying group and other leverage values. W. Zhou (Colorado State University) STAT 540 July 6th, 2015 15 / 32

Hidden Extrapolation H can be used to identify hidden extrapolation for large p. It is possible for a point Xnew to have each X new,i (i = 1,..., p) within the corresponding marginal range of X, but for the p-dim point X new to lie outside the support region of the empirical joint distribution of X. Can be very difficult to detect, especially if no 2-way scatterplot or 3-way brush/spin illustrates it. Consider h new,new = X new (X T X) 1 X T new. If h new,new max i h ii, then it is fine to make predictions at X new. W. Zhou (Colorado State University) STAT 540 July 6th, 2015 16 / 32

Identifying Influential Observations An observation is influential if its deletion leads to major changes in the fitted regression. Not all outlying observations are influential. Main idea: Leave-one-out approach like the deleted residuals. Consider 3 measures: 1 DFFITS 2 Cook s distance 3 DFBETAS No diagnostics identify all possible problems. For example, leave-one-out methods do not address multiple influential observations. More complicated methods are possible: bootstrap, highd-dimensional situations. W. Zhou (Colorado State University) STAT 540 July 6th, 2015 18 / 32

DFFITS DFFITS measures the effect of the ith case on fitted value of Y i DF F IT S i = Ŷi Ŷ i MSE i h ii and we can show DF F IT S i = t i where t i is the ith studentized deleted residual. hii 1 h ii For small to medium data sets, DF F IT S i > 1 implies that the ith observation may be influential. For large data sets, DF F IT S i > 2 p/n implies that the ith observation may be influential. W. Zhou (Colorado State University) STAT 540 July 6th, 2015 19 / 32

Cook s Distance Cook s distance measures the influence of the ith observation on all n fitted values. n j=1 D i = (Ŷj Ŷj( i)) 2. p MSE i and show D i = where r i is the studendized residual. ( ) ( r 2 i hii ) p 1 h ii W. Zhou (Colorado State University) STAT 540 July 6th, 2015 20 / 32

Cook s Distance Cook s D is large when r i is large and h ii is large D i < F p,n p,0.2 (the 20th percentile) is no concern D i > F p,n p,0.5 indicate substantial influence What about between? Crude rule of thumb: If Di > 1, investigate the ith observation as possibly influential. If p, what happens? W. Zhou (Colorado State University) STAT 540 July 6th, 2015 21 / 32

DFBETAS DFBETAS measures the influence of the ith observation on a single coefficient β k. DF BET AS k(i) = ( ˆβ k ˆβ k( i) )/ MSE i c kk where c kk = [(X T X) 1 ] kk Recall that V ar( ˆβ) = σ 2 (X T X) 1. Larger DF BET AS k(i) indicates larger impact of observation i on ˆβ k. For small to medium data sets, if DF BET ASk(i) > 1, then the ith observation may be influential. For large data sets, if DF BET ASk(i) > 2/ n, then the ith observation may be influential. The sign of DF BET ASk(i) tells whether inclusion of observation i leads to an increase (+) or decrease (-) in ˆβ k. W. Zhou (Colorado State University) STAT 540 July 6th, 2015 22 / 32

Multicollinearity When the predictor variables are correlated among themselves, multicollinearity among them is said to exist. Consider two extreme cases. Uncorrelated predictor variables. Predictor variables are perfectly correlated. W. Zhou (Colorado State University) STAT 540 July 6th, 2015 24 / 32

Linearly Independent Predictor Variables Consider Y = β 0 + β 1 X 1 + β 2 X 2 + ɛ. Suppose X1 X 2, i.e. ˆ Corr(X 1, X 2) = 0. We can show ˆβ 1 = n i=1 (Yi Ȳ )(Xi1 X 1) n i=1 (Xi1 X 1) 2, ˆβ2 = n i=1 (Yi Ȳ )(Xi2 X 2) n i=1 (Xi2 X 2) 2. The LS estimate of β1 is not affected by X 2 and vice versa. Also, the order in which the predictor variables are put in the model is inconsequential. Interpretation of regression coefficients is clear: β 1 is the expected change in Y for one unit increase in X 1 with X 2 held constant. W. Zhou (Colorado State University) STAT 540 July 6th, 2015 25 / 32

Predictor Variables are Linearly Dependent Again, suppose Y = β 0 + β 1 X 1 + β 2 X 2 + ɛ. But X 2 = 2X 1 + 1. Suppose β 0 = 3, β 1 = 2, β 2 = 5. Then all the following models give the same fit for Y : Y = 3 + 2X1 + 5X 2 + ɛ. Y = 8 + 12X1 + ɛ. Y = 2 + 6X2 + ɛ. W. Zhou (Colorado State University) STAT 540 July 6th, 2015 26 / 32

What is still fine. Prediction of Y is fine within the model/data scope, but unreliable outside the model/data scope. What is not. The β s are not unique because X is reduced rank (why?) and X T X is not invertible. Interpretation of the effect of the jth predictor holding all other variables constant is difficult. A regression coefficient may no longer reflect the effect of its corresponding predictor variable. Even worse: multicollinearity does not violate any model assumptions! W. Zhou (Colorado State University) STAT 540 July 6th, 2015 27 / 32

Concerns with Multicollinearity Multicollinearity could be between 3 or more variables, rather than just a correlated pair. That would be harder to detect. Effects of multicollinearity on the inference of regression coefficients: Large changes in the fitted ˆβk when another X is added or deleted Small changes in the data lead to very large changes in ˆβ Large s{ ˆβk }. Makes the ˆβ k seem non-significant even though the predictors are jointly significant and R 2 is large. More difficult to interpret ˆβk as the effect of X k on Y because the other X s cannot be held constant. Estimated coefficients may have wrong sign or implausible magnitudes. W. Zhou (Colorado State University) STAT 540 July 6th, 2015 28 / 32

Some Diagnostics for Multicollinearity Multicollinearity is harmless for estimation of mean response and prediction of new observation at X h. Assuming no extrapolation! Diagnosing multicollinearity Large changes in ˆβ s when a predictor (or an observation) is added or deleted. Important predictors are not statistically significant (large p-values) in individual tests. Wide confidence intervals for β s corresponding to important predictor variables. The sign of ˆβ is counter-intuitive. Predictors are highly correlated. W. Zhou (Colorado State University) STAT 540 July 6th, 2015 29 / 32

Variance Inflation Factor (VIF) Variance inflation factor (VIF) for ˆβ k : 1 VIF k = 1 Rk 2, k = 1,..., p 1 where R 2 k is the R2 for a regression of X k on the other predictor variables. VIF measures the increase in the standard error of βk due to the presence of other variables. If maxk VIF k > 10, multicollinearity may have a large impact on the inference. If p 1 j=1 VITj > p 1, there may be serious multicollinearity problems (for large p). W. Zhou (Colorado State University) STAT 540 July 6th, 2015 30 / 32

Variance Inflation Factor (VIF) As R 2 k is the coefficient of multiple determination R2 of the model σ 2 { ˆβ k } σ 2 VIF k = σ2. 1 R k 2 p 1 X ik = β 0 + X ij + ɛ. j k 1 When R 2 k decreases, σ2 { ˆβ k } decreases. 2 When R 2 k increases, σ2 { ˆβ k } increases. In fact, VIF k = (n 1) ( (X c X c ) 1) kk where X c is the scaled design matrix. (Can you show this?) W. Zhou (Colorado State University) STAT 540 July 6th, 2015 31 / 32

Some Remedial Measures for Multicollinearity Classical method. Drop one or more predictor variables from the model (selection, frontier of statistics). For polynomial or interaction regression models, use centered predictor variables X ik X k to reduce multicollinearity (Gram-Schmidt transformation, why?) Modern method. Create new predictor variables: principal component regression, PLRS, dimension-reductions. Use shrinkage regression such as ridge, LASSO, SCAD, group LASSO, adaptive LASSO, ˆβ R = (X T X + λi) 1 X T Y Although ˆβR has a smaller variance, it is a biased estimator of β. Going into very frontier of Statistical Machine Learning and High-dimensional Inference. W. Zhou (Colorado State University) STAT 540 July 6th, 2015 32 / 32