Final Review. Yang Feng. Yang Feng (Columbia University) Final Review 1 / 58

Final Review Yang Feng http://www.stat.columbia.edu/~yangfeng Yang Feng (Columbia University) Final Review 1 / 58

Outline 1 Multiple Linear Regression (Estimation, Inference) 2 Special Topics for Multiple Regression Extra Sums of Squares Standardized Version of the Multiple Regression Model 3 Polynomial and Interaction Regression Models 4 Model Selection 5 Remedial Measures for Multiple Linear Regression Models Yang Feng (Columbia University) Final Review 2 / 58

General Regression Model in Matrix Terms Y 1. 1 X 11 X 12... X 1,p 1 Y =. X =.... 1 X n1 X n2... X n,p 1 Y n β 0 1.. β =. =... β p 1 n Yang Feng (Columbia University) Final Review 4 / 58

General Linear Regression in Matrix Terms With E() = 0 and Y = Xβ + σ 2 0... 0 σ 2 () = 0 σ 2... 0... 0 0... σ 2 We have E(Y) = Xβ and σ 2 {Y} = σ 2 I Yang Feng (Columbia University) Final Review 5 / 58

Least Square Solution The matrix normal equations can be derived directly from the minimization of Q(β) = (Y Xβ) (Y Xβ) w.r.t to β. b = (X X) 1 X Y Ŷ = Xb Yang Feng (Columbia University) Final Review 6 / 58

Hat Matrix-Puts hat on y We can also directly express the fitted values in terms of X and Y matrices Ŷ = X(X X) 1 X Y and we can further define H, the hat matrix Ŷ = HY H = X(X X) 1 X The hat matrix plans an important role in diagnostics for regression analysis. Yang Feng (Columbia University) Final Review 7 / 58

Hat Matrix Properties 1. the hat matrix is symmetric 2. the hat matrix is idempotent, i.e. HH = H Important idempotent matrix property For a symmetric and idempotent matrix A, rank(a) = trace(a), the number of non-zero eigenvalues of A. Yang Feng (Columbia University) Final Review 8 / 58

Residuals The residuals, like the fitted value Ŷ can be expressed as linear combinations of the response variable observations Y i e = Y Ŷ = Y HY = (I H)Y also, remember e = Y Ŷ = Y Xb these are equivalent. Yang Feng (Columbia University) Final Review 9 / 58

Covariance of Residuals Starting with we see that but which means that e = (I H)Y σ 2 {e} = (I H)σ 2 {Y}(I H) σ 2 {Y} = σ 2 {} = σ 2 I σ 2 {e} = σ 2 (I H)I(I H) = σ 2 (I H)(I H) and since I H is idempotent, we have σ 2 {e} = σ 2 (I H) Yang Feng (Columbia University) Final Review 10 / 58

Quadratic Forms In general, a quadratic form is defined by Y AY = i j a ijy i Y j where a ij = a ji with A the matrix of the quadratic form. The ANOVA sums SSTO,SSE and SSR can all be arranged into quadratic forms. SSTO = Y (I 1 n J)Y SSE = Y (I H)Y SSR = Y (H 1 n J)Y Yang Feng (Columbia University) Final Review 11 / 58

Inference Since σ 2 {Y} = σ 2 I we can write σ 2 {b} = (X X) 1 X σ 2 IX(X X) 1 = σ 2 (X X) 1 X X(X X) 1 = σ 2 (X X) 1 I = σ 2 (X X) 1 And E(b) = E((X X) 1 X Y) = (X X) 1 X E(Y) = (X X) 1 X Xβ = β Yang Feng (Columbia University) Final Review 12 / 58

Inference The estimated variance-covariance matrix Then, we have s 2 {b} = MSE(X X) 1 b k β k s{b k } t(n p), k = 0, 1,, p 1 1 α confidence intervals: b k ± t(1 α/2; n p)s{b k } Yang Feng (Columbia University) Final Review 13 / 58

t test Tests for β k : H 0 : β k = 0 H 1 : β k = 0 Test Statistic: t = b k s{b k } Decision Rule: t t(1 α/2; n p); conclude H 0 Otherwise, conclude H a Yang Feng (Columbia University) Final Review 14 / 58

F-test for regression H 0 : β 1 = β 2 = = β p 1 = 0 H a : no all β k, (k = 1,, p 1) equal zero Test statistic: Decision Rule: F = MSR MSE if F F (1 α; p 1, n p), conclude H 0 if F > F (1 α; p 1, n p), conclude H a Yang Feng (Columbia University) Final Review 15 / 58

R 2 and adjusted R 2 The coefficient of multiple determination R 2 is defined as: R 2 = SSR SSTO = 1 SSE SSTO 0 R 2 1 R 2 always increases when there are more variables. Therefore, adjusted R 2 : R 2 a = 1 SSE n p SSTO n 1 R 2 a may decrease when p is large. Coefficient of multiple correlation: Always positive square root! = 1 R = R 2 n 1 SSE n p SSTO Yang Feng (Columbia University) Final Review 16 / 58

Extra Sums of Squares Definition: marginal decrease in the SSE when one or several predictor variables are added to the regression model, given that other variables are already in the model. Examples: SSR(X 1 X 2 ) = SSE(X 2 ) SSE(X 1, X 2 ) = SSR(X 1, X 2 ) SSR(X 2 ) SSR(X 3 X 1, X 2 ) = SSE(X 1, X 2 ) SSE(X 1, X 2, X 3 ) = SSR(X 1, X 2, X 3 ) SSR(X 1, X 2 ) Yang Feng (Columbia University) Final Review 19 / 58

ANOVA Table Various software packages can provide extra sums of squares for regression analysis. These are usually provided in the order in which the input variables are provided to the system, for instance Figure: Yang Feng (Columbia University) Final Review 20 / 58

Summary of Tests Concerning Regression Coefficients Test whether all β k = 0 Test whether a single β k = 0 Test whether some β k = 0 Test involving relationships among coefficients, for example, H 0 : β 1 = β 2 vs. H a : β 1 = β 2 H 0 : β 1 = 3, β 2 = 5 vs. H a : otherwise Key point in all tests: form the full model and the reduced model Yang Feng (Columbia University) Final Review 21 / 58

Coefficients of Partial Determination Recall Coefficient of determination : R 2 measures the proportionate reduction in the variation of Y by introduction of the entire set of X. Partial Determination: measures the marginal contribution of one X variable when all others are already in the model. Yang Feng (Columbia University) Final Review 22 / 58

Two predictor variables Y i = β 0 + β 1 X i1 + β 2 X i2 + i Coefficient of partial determination between Y and X 1 given X 2 in the model is denoted as R 2 Y 1 2 : R 2 Y 1 2 = SSE(X 2) SSE(X 1, X 2 ) SSE(X 2 ) = SSR(X 1 X 2 ) SSE(X 2 ) Likewise: R 2 Y 2 1 = SSE(X 1) SSE(X 1, X 2 ) SSE(X 1 ) = SSR(X 2 X 1 ) SSE(X 1 ) Yang Feng (Columbia University) Final Review 23 / 58

General case R 2 Y 1 23 = SSR(X 1 X 2, X 3 ) SSE(X 2, X 3 ) R 2 Y 4 123 = SSR(X 4 X 1, X 2, X 3 ) SSE(X 1, X 2, X 3 ) Yang Feng (Columbia University) Final Review 24 / 58

Coefficients of Partial Correlation Coefficients of Partial Correlation: square root of a coefficient of partial determination, following the same sign with the regression coefficient! Yang Feng (Columbia University) Final Review 25 / 58

Standardized Multiple Regression Transformed variables Y i = 1 ( Y i Ȳ ) n 1 s y X ik = 1 n 1 ( X ik X k s k ), k = 1,..., p 1 Yang Feng (Columbia University) Final Review 27 / 58

Standardized Regression Model The regression model using the transformed variables: Y i = β 1X i1 + + β p 1X i,p 1 + i Notice that there is no need for intercept It reduces to the standard linear regression problem Yang Feng (Columbia University) Final Review 28 / 58

Standardized Regression Model The solution b = b 1 b 2... b p 1 can be related to the solution to the untransformed regression problem through the relationship b k = ( sy s k )bk, k = 1,..., p 1 b 0 = Ȳ b 1 X 1... b p 1 X p 1 Yang Feng (Columbia University) Final Review 29 / 58

Multicollinearity Usually, we still have good fit of the data, in addition, we still have good prediction. The estimated regression coefficients tends to have large sampling variability when the predictor variables are highly correlated. Some of the regression coefficients maybe statistically not significant even though a definite statistical relation exists. The common interpretation of a regression coefficient is NOT fully applicable any more. Regress Y on both X 1 and X 2. It is possible that when individual t-tests are performed, neither β 1 or β 2 is significant. However, when the F -test is performed for both β 1 and β 2, the results may still be significant. Yang Feng (Columbia University) Final Review 30 / 58

One-predictor variable-second order Y i = β 0 + β 1 x i + β 11 x 2 i + i where x i = X i X X is centered due to the possible high correlation between X and X 2. Regression function: E{Y } = β 0 + β 1 x + β 11 x 2, quadratic response function β 0 is the mean response when x = 0, i.e., X = X. β 1 is called the linear effect. β 11 is called the quadratic effect. Yang Feng (Columbia University) Final Review 32 / 58

One Predictor Variable-Third Order Y i = β 0 + β 1 x i + β 11 x 2 i + β 111 x 3 i + i where x i = X i X Yang Feng (Columbia University) Final Review 33 / 58

One Predictor Variable-Higher Orders Employed with special caution. Tends to overfit Poor prediction Yang Feng (Columbia University) Final Review 34 / 58

Two Predictors-Second Order Y i = β 0 + β 1 x i1 + β 2 x i2 + β 11 x 2 i1 + β 22 x 2 i2 + β 12 x i1 x i2 + i where x i1 = X i1 X 1, x i2 = X i2 X 2 The coefficient β 12 is called the interaction effect coefficient. More on interaction later. Three Predictors- Second Order is similar. Yang Feng (Columbia University) Final Review 35 / 58

Implementation of Polynomial Regression Models Fitting Very easy, just use the least squares for multiple linear regressions since they can all be seen as a multiple regression. Determine the order Very important step! Y i = β 0 + β 1 x i + β 11 x 2 i + β 111 x 3 i + i Naturally, we want to test whether or not β 111 = 0, or whether or not both β 11 = 0 and β 111 = 0. How to do the test? Yang Feng (Columbia University) Final Review 36 / 58

Extra Sum of Squares Decomposition SSR into SSR(x), SSR(x 2 x) and SSR(x 3 x, x 2 ). Test whether β 111 = 0: use SSR(x 3 x, x 2 ). Test whether both β 11 = 0 and β 111 = 0: use SSR(x 2, x 3 x). Yang Feng (Columbia University) Final Review 37 / 58

Interpretation of Regression Models with Interactions E{Y } = β 0 + β 1 X 1 + β 2 X 2 + β 3 X 1 X 2 The change in mean response with a unit increase in X 1 when X 2 is held constant is β 1 + β 3 X 2 Similarly, a unit increase in X 2 when X 1 is constant: β 2 + β 3 X 1 Yang Feng (Columbia University) Final Review 38 / 58

Implementation of Interaction Regression Models Center the predictor variables to avoid the high multicollinearities x ik = X ik X k Using prior knowledge to reduce the number of interactions. If we have 8 predictors, then we have 28 pairwise terms in total. For p predictors, the number is p(p 1)/2. Yang Feng (Columbia University) Final Review 39 / 58

Qualitative Predictors Examples: Gender (male or female) Purchase status (yes or no) Disability status (not disabled, partly disabled, fully disabled) A qualitative variables with c classes will be represented by c 1 indicator variables, each taking on the values 0 and 1. Yang Feng (Columbia University) Final Review 40 / 58

Six Criteria R 2 p, R 2 a,p, C p, AIC p, BIC p (SBC p ), PRESS p Denote total number of variables as P 1, so P parameters in total. Here, 1 p P. Use the coefficient of multiple determination R 2 : R 2 p = 1 SSE p SSTO Use adjusted coefficient of multiple determination: Ra,p 2 = 1 n 1 SSE p n p SSTO = 1 MSE p SSTO n 1 Yang Feng (Columbia University) Final Review 42 / 58

Mallows C p Criterion Concerned with total mean squared error we have Criterion measure: (Ŷ i µ i ) 2 = [(E{Ŷ i } µ i ) + (Ŷ i E{Ŷ i })] 2 E(Y i µ i ) 2 = (E{Ŷ i } µ i ) 2 + σ 2 {Ŷ i } Γ p = 1 σ 2 n i=1(e{ŷi} µ i ) 2 + A good estimator of Γ p will be C p = n σ 2 {Ŷi} i=1 SSE p (n 2p) MSE(X 1,, X P 1 ) Yang Feng (Columbia University) Final Review 43 / 58

AIC and BIC The Akaike s information criterion (AIC) and the Bayesian information criterion (BIC) (also called the Schwarz criterion, SBC in the book) are two criteria that penalize model complexity. In the linear regression setting AIC p = n log SSE p n log n + 2p BIC p = n log SSE p n log n + (log n)p Roughly you can think of these two criteria as penalizing models with many parameters (p in the case of linear regression). Yang Feng (Columbia University) Final Review 44 / 58

PRESS p or Leave-One-Out Cross Validation The PRESS p or prediction sum of squares measures how well a subset model can predict the observed responses Y i. Let Ŷi(i) be the fitted value when i is being predicted from a model in which (i) was left out during training. The PRESS p criterion is then given by summing over all n cases PRESS p = n (Y i Ŷi(i)) 2 i=1 PRESS p values can be calculated without doing n separate regression runs. Yang Feng (Columbia University) Final Review 45 / 58

PRESS p or Leave-One-Out Cross Validation If we let d i be the deleted residual for the i th case then we can rewrite d i = Y i Ŷi(i) e i d i = 1 h ii where e i is the ordinary residual for the i th case and h ii is the i th diagonal element in the hat matrix. We can obtain the h ii diagonal element of the hat matrix directly from h ii = X i(x X) 1 X i Yang Feng (Columbia University) Final Review 46 / 58

Stepwise Regression Methods An automatic search procedure Identify a single best model several different formats Yang Feng (Columbia University) Final Review 47 / 58

Forward Stepwise Regression A (greedy) procedure for identifying variables to include in the regression model is as follows. Repeat until finished: 1 Fit a simple linear regression model for each of the P 1 X variables considered for inclusion. For each compute the t statistics for testing whether or not the slope is zero t k = b k 1 s{b k } 2 Pick the largest out of the P 1 tk s (in the first step k = 1) and include the corresponding X variable in the regression model if tk exceeds some significance level. 3 If the number of X variables included in the regression model is greater than one, check to see if the model would be improved by dropping variables (using the t-test and a threshold again). 1 Remember b k is the estimate for β k and s{b k } is the estimator sample standard deviation. Yang Feng (Columbia University) Final Review 48 / 58

Forward Stepwise Regression (cont) Other criteria can be used in determining which variables to add and delete, such as F-test (full model v.s. reduced model), AIC (default option in R), BIC, Cp. Usually much more efficient than the best subset regression Yang Feng (Columbia University) Final Review 49 / 58

Forward Regression Simplified version of forward stepwise regression No deletion step! Once a variable is in, it will be there from then on. Yang Feng (Columbia University) Final Review 50 / 58

Backward Elimination Start from the full model with P 1 variables. Iteratively check whether any variable should be deleted from the model by some given criteria. This time, no addition step! Yang Feng (Columbia University) Final Review 51 / 58

Unequal Error Variance Y i = β 0 + β 1 X i1 + + β p 1 X i,p 1 + i Here: i are independent N(0, σ 2 i ). (Originally: i are independent N(0, σ 2 )) In matrix form: σ 2 1 0 0 σ 2 0 σ2 2 0 {} =... 0 0 σn 2 Yang Feng (Columbia University) Final Review 53 / 58

Known Error Variance Define weights Denote w i = 1 σ 2 i w 1 0 0 0 w 2 0 W =... 0 0 w n Weighted least squares and maximum likelihood estimator is b w = (X WX) 1 X WY Yang Feng (Columbia University) Final Review 54 / 58

Error Variance Known up to Proportionality Constant Same estimator. w i = k 1 σ 2 i Yang Feng (Columbia University) Final Review 55 / 58

Unknown Error Variances In reality, one rarely known the variances σ 2 i. Estimation of Variance Function or Standard Deviation Function Use of Replicates or Near Replicates Yang Feng (Columbia University) Final Review 56 / 58

Estimation of Variance Function or Standard Deviation Function Four steps: (Can be iterated for several times to reach convegence) 1 Fit the regression model by unweighted least squares and analyze the residuals 2 Estimate the variance function or the standard deviation function by regressing either the squared residuals or the absolute residuals on the appropriate predictor(s). (We known that the variance of i σ 2 i = E( 2 i ) (E( i)) 2 = E( 2 i ). Hence the squared residual e 2 i is an estimator of σ 2 i.) 3 Use the fitted value from the estimated variance or standard deviation function to obtain the weights w i. 4 Estimate the regression coefficients using these weights. Yang Feng (Columbia University) Final Review 57 / 58

Ridge Estimators (Multi-collinearity) OLS: (X X)b = X Y Transformed by correlation transformation: r XX b = r YX Ridge Estimator: for a constant c 0, c = 0, OLS (r XX + ci)b R = r YX c > 0, biased, but much more stable. Yang Feng (Columbia University) Final Review 58 / 58