Multiple Regression: Chapter 13 July 24, 2015
Multiple Regression (MR) Response Variable: Y - only one response variable (quantitative) Several Predictor Variables: X 1, X 2, X 3,..., X p (p = # predictors) Note: the predictors can be quantitative, categorical, quadratic term, interaction term
Concentrate on: Reading computer output Interpreting coefficients for each Predictor Determining which order to test in picking the simplest model that does a good job for predicting Y
The Basic of MR Model: Y = α + β 1 X 1 + β 2 X 2 +... + β p X p + ɛ ( predictors: X 1, X 2,..., X p, # predictors: p) Assumptions: ɛ iid N(0, σ) Parameters: coefficients: β 1, β 2,..., β p constant: α
Reading the computer output: 1. Fitted Equation: ŷ = a + b 1 X 1 + b 2 X 2 +... + b p X p 2. ANOVA Test: H 0 : β 1 = β 2 =... = β p = 0 (nothing good in model) H a : at least one β i 0 (something good) Test Statistic: F = MSR MSE ANOVA table for regression Source df SS M S F P-value Regression p SSReg MSR MSR (from F table with MSE Error n p 1 SSE MSE df num, df denom ) Total n 1 SST
3. t test for Individual Predictors: H 0 : β i = 0 vs H a : β i 0 b Test Statistic: t = i 0 standard error of b i p-value computed from t-table with df= n p 1 (error) Interpretation: if p-value small, reject H 0 - conclude that predictor X i is a GOOD predictor of Y (X i provides significance information about Y ) AFTER all other predictors in the model are accounted for
Important Issues in Multiple Regression Don t just add predictors to the model - think! For p = n we have oversaturated model with R 2 = 100% (not useful to predict for larger populations, but only for this particular dataset) adjusted R 2 only increases if the new predictor added to the model is good, whereas R 2 goes up or stays the same even if the new predictors are bad Remember to look at p-values for each predictor.
Multicollinearity: when several predictors are correlated with each other, then the ANOVA p-value may be small even if all the individual t-test p-values are large. Correlated predictors give overlapping or redundant information. (Don t throw out all the predictors but take them out of model slowly) Sample size should be at least 5 to 20 times bigger than the number of predictors
Example: The following is the dataset on Blood Alcohol Content (BAC) and the Number of Beers consumed (NOB) with two more variables Weight and Sex. We fit different regression models and compare the output. BAC NOB Weight Sex M 1 F 1 0.100 5 132 f 0 1 0.030 2 128 f 0 1 0.190 9 110 f 0 1 0.120 8 192 m 1 0 0.040 3 172 m 1 0 0.095 7 250 f 0 1 0.070 3 125 f 0 1 0.060 5 175 m 1 0 0.020 3 175 f 0 1 0.050 5 275 m 1 0 0.070 4 130 f 0 1 0.100 6 168 m 1 0 0.085 5 128 f 0 1 0.090 7 246 m 1 0 0.010 1 164 m 1 0 0.050 4 175 m 1 0
Regression Analysis: BAC versus NOB The regression equation is BAC = - 0.0127 + 0.0180 NOB Predictor Coef SE Coef T P Constant -0.01270 0.01264-1.00 0.332 NOB 0.017964 0.002402 7.48 0.000 S = 0.0204410 R-Sq = 80.0% R-Sq(adj) = 78.6% Analysis of Variance Source DF SS MS F P Regression 1 0.023375 0.023375 55.94 0.000 Residual Error 14 0.005850 0.000418 Total 15 0.029225
Regression Analysis: BAC versus NOB, M_1 The regression equation is BAC = - 0.0035 + 0.0181 NOB - 0.0198 M_1 Predictor Coef SE Coef T P Constant -0.00348 0.01200-0.29 0.777 NOB 0.018100 0.002135 8.48 0.000 M_1-0.019763 0.009086-2.18 0.049 S = 0.0181633 R-Sq = 85.3% R-Sq(adj) = 83.1% Analysis of Variance Source DF SS MS F P Regression 2 0.024936 0.012468 37.79 0.000 Residual Error 13 0.004289 0.000330 Total 15 0.029225
Regression with Dummy Variables: Dummy Variable: Categorical variable coded as 0 or 1 0 if female Example: Let X 2 =Gender = 1 if male (baseline group has zero for dummy variable) Model (no interaction): Y = α + β 1 X 1 + β 2 X 2 + ɛ Note: This model gives two lines - one for females and one for males with same slope but different intercepts. F (X 2 = 0) Y = α + β 1 X 1 + ɛ M (X 2 = 1) Y = (α + β 2 ) + β 1 X 1 + ɛ
Interpret Coefficients: α β 1 β 2 y-intercept for baseline group (F) slope for both groups change in intercept for males compared to females
Regression Analysis: BAC versus NOB, Weight The regression equation is BAC = 0.0399 + 0.0200 NOB - 0.000363 Weight Predictor Coef SE Coef T P Constant 0.03986 0.01043 3.82 0.002 NOB 0.019976 0.001263 15.82 0.000 Weight -0.00036282 0.00005668-6.40 0.000 S = 0.0104104 R-Sq = 95.2% R-Sq(adj) = 94.4% Analysis of Variance Source DF SS MS F P Regression 2 0.027816 0.013908 128.33 0.000 Residual Error 13 0.001409 0.000108 Total 15 0.029225
Regression Analysis: BAC versus NOB, Weight, M_1 The regression equation is BAC = 0.0387 + 0.0199 NOB - 0.000344 Weight - 0.00324 M_1 Predictor Coef SE Coef T P Constant 0.03871 0.01097 3.53 0.004 NOB 0.019896 0.001309 15.20 0.000 Weight -0.00034440 0.00006842-5.03 0.000 M_1-0.003240 0.006286-0.52 0.616 S = 0.0107174 R-Sq = 95.3% R-Sq(adj) = 94.1% Analysis of Variance Source DF SS MS F P Regression 3 0.0278466 0.0092822 80.81 0.000 Residual Error 12 0.0013784 0.0001149 Total 15 0.0292250
Question: What if gender coded the other way? Regression Analysis: BAC versus NOB, Weight, F_1 The regression equation is BAC = 0.0355 + 0.0199 NOB - 0.000344 Weight + 0.00324 F_1 Predictor Coef SE Coef T P Constant 0.03547 0.01371 2.59 0.024 NOB 0.019896 0.001309 15.20 0.000 Weight -0.00034440 0.00006842-5.03 0.000 F_1 0.003240 0.006286 0.52 0.616 S = 0.0107174 R-Sq = 95.3% R-Sq(adj) = 94.1% Analysis of Variance Source DF SS MS F P Regression 3 0.0278466 0.0092822 80.81 0.000 Residual Error 12 0.0013784 0.0001149 Total 15 0.0292250
Interaction model: (with dummy) Y = α + β 1 X 1 + β 2 X 2 + β 3 X 1 X 2 + ɛ Note: This model gives two lines - one for females and one for males with different slopes and different intercepts. F (X 2 = 0) Y = α +β 1 X 1 +ɛ M (X 2 = 1) Y = (α + β 2 ) +(β 1 + β 3 )X 1 +ɛ Interpret Coefficients: α β 1 β 2 β 3 y-intercept for baseline group (F) slope for baseline group (F) change in intercept for males compared to females change in slope for males compared to females
Regression Analysis: BAC versus NOB, Weight, M_1, Weight*M_1 The regression equation is BAC = 0.0460 + 0.0198 NOB - 0.000390 Weight - 0.0215 M_1 + 0.000104 Weight*M_1 Predictor Coef SE Coef T P Constant 0.04604 0.01467 3.14 0.009 NOB 0.019762 0.001343 14.71 0.000 Weight -0.00038990 0.00009130-4.27 0.001 M_1-0.02148 0.02453-0.88 0.400 Weight*M_1 0.0001045 0.0001357 0.77 0.457 S = 0.0109039 R-Sq = 95.5% R-Sq(adj) = 93.9% Analysis of Variance Source DF SS MS F P Regression 4 0.0279172 0.0069793 58.70 0.000 Residual Error 11 0.0013078 0.0001189 Total 15 0.0292250
What if we had 3 groups? Suppose we want to predict BAC from NOB and Race: white, black, hispanic Need 2 dummy variables for 3 categories. Let X 2 = 1 if black 0 otherwise, X 3 = 1 if hispanic 0 otherwise Note: Race = White, is the baseline zero for both dummy variables.
No Interaction model with 2 Dummies: Y = α + β 1 X 1 + β 2 X 2 + β 3 X 3 + ɛ which gives the following 3 equations: X 2 = 0, X 3 = 0 (W): X 2 = 1, X 3 = 0 (B): X 2 = 0, X 3 = 1 (H): Y = α + β 1 X 1 + ɛ Y = (α + β 2 ) + β 1 X 1 + ɛ Y = (α + β 3 ) + β 1 X 1 + ɛ Interpret Coefficients: α β 1 β 2 β 3 intercept for baseline group (W) slope for all 3 groups change in intercept for blacks compared to whites change in intercept for hispanic compared to whites
Interaction Model: add interactions between the quantitative variable (X 1 ) and the dummy variables (X 2, X 3 ) Y = α+β 1 X 1 +β 2 X 2 +β 3 X 3 +β 4 X 1 X 2 +β 5 X 1 X 3 +ɛ which gives the following 3 equations: X 2 = 0, X 3 = 0 (W): X 2 = 1, X 3 = 0 (B): X 2 = 0, X 3 = 1 (H): Y = α + β 1 X 1 + ɛ Y = (α + β 2 ) + (β 1 + β 4 )X 1 + ɛ Y = (α + β 3 ) + (β 1 + β 5 )X 1 + ɛ Interpret Coefficients: α intercept for baseline group (W) β 1 slope for W β 2 change in intercept for B compared to W β 3 change in intercept for H compared to W β 4 change in slope for B compared to W change in slope for H compared to W β 5
In regression, if we have only one categorical predictor, REGRESSION ONE-WAY ANOVA Revisit the ONE-WAY ANOVA Example: Compare average weight loss for three diets. Data: Weight loss under 3 diets low FAT low CAL low CARB 22 24 28 18 21 27 21 26 30 25 27 32
ANOVA results (output): One-way ANOVA: lowfat, lowcal, lowcarb Source DF SS MS F P Factor 2 122.17 61.08 9.05 0.007 Error 9 60.75 6.75 Total 11 182.92 S = 2.598 R-Sq = 66.79% R-Sq(adj) = 59.41% Individual 95% CIs For Mean Based on Pooled StDev Level N Mean StDev -------+---------+---------+---------+-- lowfat 4 21.500 2.887 (-------*--------) lowcal 4 24.500 2.646 (-------*-------) lowcarb 4 29.250 2.217 (--------*-------) -------+---------+---------+---------+-- 21.0 24.5 28.0 31.5 Pooled StDev = 2.598
Now, let s set up the problem as regression with dummy variables. Y = weight loss (response) Let X 1 = 1 if lowcal 0 otherwise, X 2 = 1 if lowcarb 0 otherwise Model: Y = α + β 1 X 1 + β 2 X 2 + ɛ Interpret Coefficients: α β 1 β 2 intercept for baseline group (lowfat) change in intercept for lowcal compared to lowfat change in intercept for lowcarb compared to lowfat
REGRESSION results (output): Regression Analysis: Y versus x1, x2 The regression equation is Y = 21.5 + 3.00 x1 + 7.75 x2 Predictor Coef SE Coef T P Constant 21.500 1.299 16.55 0.000 x1 3.000 1.837 1.63 0.137 x2 7.750 1.837 4.22 0.002 S = 2.59808 R-Sq = 66.8% R-Sq(adj) = 59.4% Analysis of Variance Source DF SS MS F P Regression 2 122.167 61.083 9.05 0.007 Residual Error 9 60.750 6.750 Total 11 182.917
More about RESIDUALS: Plot of RESIDUALS vs FITTED value will exaggerate any pattern present in data other than linear trend. How to judge non constant variance in response from residual vs fitted plot? (example in class) Recall: residuals = y ŷ (i.e. linear trend is removed from the model) any pattern (or trend) still present in residual vs fitted value plot suggests that the linear regression was not enough. Need to add quadratic (or other polynomial) terms in the equation (examples in class)
QUADRATIC REGRESSION Model: Y = α + β 1 X + β 2 X 2 + ɛ, note that p = 2 predictors (X, X 2 ) Assumptions: ɛ iid N(0, σ) Fitted Equation (output): ŷ = a + b 1 X + b 2 X 2 Interpret Coefficient: Only interpret the coefficient for the quadratic term. Is β 2 significantly different from zero? if yes - keep quadratic term - look for sign of b 2 (determines whether curvature opens up or down) if no - throw X 2 out - do SLR
Example: Suppose we are interested in predicting the GPA of students in college (CGPA) using 16 different predictor variables. Data were collected from a random sample of 59 college students. What is the response variable in this problem? What are the values of n and p? What are Ho and Ha that you can test using the ANOVA table? What is your decision, based on the following ANOVA table? What is your conclusion?
Regression Analysis: CGPA versus Height, Gender,... The regression equation is CGPA = 0.53 + 0.0194 Height + 0.047 Gender - 0.00163 Haircut - 0.042 Job + 0.0004 Studytime - 0.375 Smokecig + 0.0488 Dated + 0.546 HSGPA + 0.00315 HomeDist + 0.00069 BrowseInternet - 0.00128 WatchTV - 0.0117 Exercise + 0.0140 ReadNewsP + 0.039 Vegan - 0.0139 PoliticalDegree - 0.0801 PoliticalAff Predictor Coef SE Coef T P Constant 0.532 1.496 0.36 0.724 Height 0.01942 0.01637 1.19 0.242 Gender 0.0468 0.1429 0.33 0.745 Haircut -0.001633 0.001697-0.96 0.341 Job -0.0418 0.1024-0.41 0.685 Studytime 0.00043 0.01921 0.02 0.982 Smokecig -0.3746 0.2249-1.67 0.103 Dated 0.04881 0.07111 0.69 0.496 HSGPA 0.5457 0.1776 3.07 0.004 HomeDist 0.003147 0.003400 0.93 0.360 BrowseInternet 0.000689 0.001163 0.59 0.557 WatchTV -0.0012840 0.0009710-1.32 0.193 Exercise -0.011657 0.005934-1.96 0.056 ReadNewsP 0.01395 0.02272 0.61 0.543 Vegan 0.0392 0.1578 0.25 0.805 PoliticalDegree -0.01390 0.03185-0.44 0.665 PoliticalAff -0.08006 0.07741-1.03 0.307
S = 0.322198 R-Sq = 43.2% R-Sq(adj) = 21.5% Analysis of Variance Source DF SS MS F P Regression 16 3.3135 0.2071 1.99 0.037 Residual Error 42 4.3601 0.1038 Total 58 7.6736
Best Subsets Regression: CGPA versus Height, Gender,... Response is CGPA B o r l o i P w t o s i l S e R c i t S H I E e a t H u m o n W x a l i H G a d o m t a e d D c e e i y k D H e e t r N V e a i n r t e a S D r c c e e g l g d c J i c t G i n h i w g r A Mallows h e u o m i e P s e T s s a e f Vars R-Sq R-Sq(adj) Cp S t r t b e g d A t t V e P n e f 1 25.5 24.2 0.1 0.31667 X 1 13.0 11.5 9.3 0.34217 X 2 31.6 29.2-2.4 0.30613 X X 2 29.4 26.9-0.8 0.31109 X X 3 33.8 30.2-2.1 0.30389 X X X 3 33.7 30.0-2.0 0.30423 X X X 4 35.7 31.0-1.5 0.30223 X X X X 4 35.3 30.5-1.2 0.30320 X X X X 5 37.3 31.4-0.6 0.30132 X X X X X 5 37.0 31.1-0.4 0.30198 X X X X X
6 38.3 31.2 0.6 0.30163 X X X X X X 6 38.3 31.2 0.6 0.30164 X X X X X X 7 39.6 31.3 1.7 0.30150 X X X X X X X 7 39.3 30.9 1.9 0.30231 X X X X X X X 8 40.4 30.8 3.1 0.30249 X X X X X X X X 8 40.4 30.8 3.1 0.30256 X X X X X X X X 9 41.5 30.8 4.2 0.30266 X X X X X X X X X 9 41.0 30.2 4.6 0.30395 X X X X X X X X X 10 41.9 29.8 6.0 0.30478 X X X X X X X X X X 10 41.8 29.7 6.0 0.30492 X X X X X X X X X X 11 42.2 28.7 7.7 0.30712 X X X X X X X X X X X 11 42.2 28.7 7.7 0.30715 X X X X X X X X X X X 12 42.6 27.6 9.4 0.30945 X X X X X X X X X X X X 12 42.6 27.6 9.5 0.30954 X X X X X X X X X X X X 13 42.9 26.4 11.2 0.31205 X X X X X X X X X X X X X 13 42.8 26.3 11.3 0.31229 X X X X X X X X X X X X X 14 43.1 25.0 13.1 0.31502 X X X X X X X X X X X X X X 14 43.0 24.9 13.1 0.31526 X X X X X X X X X X X X X X 15 43.2 23.4 15.0 0.31843 X X X X X X X X X X X X X X X 15 43.1 23.2 15.1 0.31866 X X X X X X X X X X X X X X X 16 43.2 21.5 17.0 0.32220 X X X X X X X X X X X X X X X X
Regression Analysis: CGPA versus HSGPA, Exercise The regression equation is CGPA = 1.55 + 0.560 HSGPA - 0.0111 Exercise Predictor Coef SE Coef T P Constant 1.5489 0.5551 2.79 0.007 HSGPA 0.5599 0.1436 3.90 0.000 Exercise -0.011138 0.004985-2.23 0.029 S = 0.306126 R-Sq = 31.6% R-Sq(adj) = 29.2% Analysis of Variance Source DF SS MS F P Regression 2 2.4256 1.2128 12.94 0.000 Residual Error 56 5.2479 0.0937 Total 58 7.6736
LOGISTIC REGRESSION Y = Categorical Response (Yes/No) or Binary Response (1 or 0) Example: Predict the probability that a person pay bills on time based on past credit history, income, employment, age, etc.. Example: Predict the probability that a person gets lung cancer based on smoking, family history, asthma, age, gender, race, eating habit, exercise habit, etc..
Logistic Regression Model: (with 1 predictor variable) p = exp(α + βx) 1 + exp(α + βx) Example: Whether a person has travel credit card. X = annual income (in thousand euros), y = (partial dataset..) income y 12 0 13 0 14 1 14 0 14 0 14 1 1 if yes 0 if no
Link Function: Logit Response Information Variable Value Count y 1 31 (Event) 0 69 Total 100 Logistic Regression Table Predictor Coef SE Coef Z P Constant -3.51795 0.710336-4.95 0.000 income 0.105409 0.0261574 4.03 0.000
Interpretations: Annual income is a good predictor of probability of having a travel credit card the probability of having a travel credit card increases (because of the positive sign of the coefficient) with higher annual income.
Prediction Equation: ˆp = exp( 3.52 + 0.105X) 1 + exp( 3.52 + 0.105X) i.e. a = 3.52, b = 0.105 predict the probability that person with annual income 12K (euros) has a travel credit card (answer: ˆp = 0.09) predict the probability that person with annual income 65K (euros) has a travel credit card (answer: ˆp = 0.97) the probability of having a travel credit card is 50% when X = a b = 0.105 3.52 = 33.524 (why?)
Multiple Logistic Regression: Example: Predict Marijuana use (Y/N) based on Alcohol use (Y/N) and cigarette smoking (Y/N) for HS seniors. Data: 2276 HS seniors in non-urban area outside Dayton, Ohio. Marijuana Cigarette Alcohol Frequency 1 1 1 911 1 0 1 44 1 1 0 3 1 0 0 2 0 1 1 538 0 0 1 456 0 1 0 43 0 0 0 279
Binary Logistic Regression: Marijuana versus Alcohol, Cigarette Link Function: Logit Response Information Variable Value Count Marijuana 1 960 (Event) 0 1316 Total 2276 Frequency: Frequency Logistic Regression Table Predictor Coef SE Coef Z P Constant -5.30904 0.475190-11.17 0.000 Alcohol 2.98601 0.464671 6.43 0.000 Cigarette 2.84789 0.163839 17.38 0.000
Predict probability of using Marijuana if Alcohol use = Yes and Cigarette smoking = Yes ˆp = exp( 5.30904 + 2.98601 + 2.84789) 1 + exp( 5.30904 + 2.98601 + 2.84789) = 0.628 Alcohol use = No and Cigarette smoking = Yes ˆp = exp( 5.30904 + 2.84789) 1 + exp( 5.30904 + 2.84789) = 0.079