Lecture 20: Multiple linear regression

Similar documents
Chapter 8: Multiple and logistic regression

Unit 7: Multiple linear regression 1. Introduction to multiple linear regression

Announcements. Unit 6: Simple Linear Regression Lecture : Introduction to SLR. Poverty vs. HS graduate rate. Modeling numerical variables

Modeling kid s test scores (revisited) Lecture 20 - Model Selection. Model output. Backward-elimination

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.

Lecture 19: Inference for SLR & Transformations

1 The Classic Bivariate Least Squares Model

Announcements. Lecture 18: Simple Linear Regression. Poverty vs. HS graduate rate

Chi-square tests. Unit 6: Simple Linear Regression Lecture 1: Introduction to SLR. Statistics 101. Poverty vs. HS graduate rate

Announcements. Unit 7: Multiple linear regression Lecture 3: Confidence and prediction intervals + Transformations. Uncertainty of predictions

Lecture 16 - Correlation and Regression

Regression. Marc H. Mehlman University of New Haven

2. Outliers and inference for regression

Announcements. Lecture 10: Relationship between Measurement Variables. Poverty vs. HS graduate rate. Response vs. explanatory

1 Multiple Regression

Multiple Regression. Inference for Multiple Regression and A Case Study. IPS Chapters 11.1 and W.H. Freeman and Company

STA 101 Final Review

Intro to Linear Regression

Inference. ME104: Linear Regression Analysis Kenneth Benoit. August 15, August 15, 2012 Lecture 3 Multiple linear regression 1 1 / 58

Regression, Part I. - In correlation, it would be irrelevant if we changed the axes on our graph.

Unit 6 - Simple linear regression

STAT 3900/4950 MIDTERM TWO Name: Spring, 2015 (print: first last ) Covered topics: Two-way ANOVA, ANCOVA, SLR, MLR and correlation analysis

Statistics and Quantitative Analysis U4320

FinalExamReview. Sta Fall Provided: Z, t and χ 2 tables

Lab 3 A Quick Introduction to Multiple Linear Regression Psychology The Multiple Linear Regression Model

Statistics Introductory Correlation

Section 3: Simple Linear Regression

STA441: Spring Multiple Regression. This slide show is a free open source document. See the last slide for copyright information.

STAT 350: Summer Semester Midterm 1: Solutions

REVIEW 8/2/2017 陈芳华东师大英语系

Lecture 18: Simple Linear Regression

Variance Decomposition and Goodness of Fit

Inferences for Regression

Data Set 1A: Algal Photosynthesis vs. Salinity and Temperature

Intro to Linear Regression

ST430 Exam 1 with Answers

ST430 Exam 2 Solutions

ANCOVA. ANCOVA allows the inclusion of a 3rd source of variation into the F-formula (called the covariate) and changes the F-formula

Review of Statistics 101

Inference for Regression

y = a + bx 12.1: Inference for Linear Regression Review: General Form of Linear Regression Equation Review: Interpreting Computer Regression Output

Draft Proof - Do not copy, post, or distribute. Chapter Learning Objectives REGRESSION AND CORRELATION THE SCATTER DIAGRAM

Exam 3 Practice Questions Psych , Fall 9

Linear Regression. Linear Regression. Linear Regression. Did You Mean Association Or Correlation?

Chapter Goals. To understand the methods for displaying and describing relationship among variables. Formulate Theories.

Announcements. Lecture 1 - Data and Data Summaries. Data. Numerical Data. all variables. continuous discrete. Homework 1 - Out 1/15, due 1/22

Simple Linear Regression: One Qualitative IV

Regression and the 2-Sample t

Chapter 4: Regression Models

Gov 2000: 9. Regression with Two Independent Variables

22s:152 Applied Linear Regression

Unit5: Inferenceforcategoricaldata. 4. MT2 Review. Sta Fall Duke University, Department of Statistical Science

Multiple Linear Regression for the Salary Data

Stat 412/512 TWO WAY ANOVA. Charlotte Wickham. stat512.cwick.co.nz. Feb

Announcements. Unit 4: Inference for numerical variables Lecture 4: ANOVA. Data. Statistics 104

Introduction and Background to Multilevel Analysis

Lecture 4: Multivariate Regression, Part 2

Garvan Ins)tute Biosta)s)cal Workshop 16/6/2015. Tuan V. Nguyen. Garvan Ins)tute of Medical Research Sydney, Australia

Diagnostics and Transformations Part 2

R 2 and F -Tests and ANOVA

1-Way ANOVA MATH 143. Spring Department of Mathematics and Statistics Calvin College

Mrs. Poyner/Mr. Page Chapter 3 page 1

Correlation and Regression Notes. Categorical / Categorical Relationship (Chi-Squared Independence Test)

Announcements. Lecture 5: Probability. Dangling threads from last week: Mean vs. median. Dangling threads from last week: Sampling bias

Gov 2000: 9. Regression with Two Independent Variables

Regression Analysis IV... More MLR and Model Building

Data Analysis Using R ASC & OIR

Announcements: You can turn in homework until 6pm, slot on wall across from 2202 Bren. Make sure you use the correct slot! (Stats 8, closest to wall)

Correlation & Simple Regression

Conditions for Regression Inference:

Unit 6 - Introduction to linear regression

Multiple Regression: Chapter 13. July 24, 2015

Example: Poisondata. 22s:152 Applied Linear Regression. Chapter 8: ANOVA

This document contains 3 sets of practice problems.

Final Exam - Solutions

Final Exam. Name: Solution:

Multiple Linear Regression. Chapter 12

Statistiek II. John Nerbonne. March 17, Dept of Information Science incl. important reworkings by Harmut Fitz

Chapter 4. Regression Models. Learning Objectives

Confidence Intervals, Testing and ANOVA Summary

Announcements. Final Review: Units 1-7

Basic Business Statistics, 10/e

Occupy movement - Duke edition. Lecture 14: Large sample inference for proportions. Exploratory analysis. Another poll on the movement

Simple, Marginal, and Interaction Effects in General Linear Models

Stat 411/511 ESTIMATING THE SLOPE AND INTERCEPT. Charlotte Wickham. stat511.cwick.co.nz. Nov

Simple linear regression

Multiple Regression Introduction to Statistics Using R (Psychology 9041B)

Stat 135, Fall 2006 A. Adhikari HOMEWORK 10 SOLUTIONS

Tests of Linear Restrictions

Inference for Regression Inference about the Regression Model and Using the Regression Line, with Details. Section 10.1, 2, 3

ECNS 561 Multiple Regression Analysis

(ii) Scan your answer sheets INTO ONE FILE only, and submit it in the drop-box.

Stat 401B Exam 2 Fall 2015

Applied Regression Analysis. Section 2: Multiple Linear Regression

Variance Decomposition in Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 04, 2017

Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model

Regression Analysis II

Multiple Linear Regression CIVL 7012/8012

AP Statistics Unit 6 Note Packet Linear Regression. Scatterplots and Correlation

Data Set 8: Laysan Finch Beak Widths

Transcription:

Lecture 20: Multiple linear regression Statistics 101 Mine Çetinkaya-Rundel April 5, 2012

Announcements Announcements Project proposals due Sunday midnight: Respsonse variable: numeric Explanatory variables: at least one numeric and one at least categorical variable Total of at least 3 variables, more the merrier You might want to make a few quick plots of your dataset to see if the relationship between your response and explanatory variables are linear (or could be made liner using transformations). If the relationship is not linear, you may want to consider another dataset otherwise your findings will be inconclusive. Statistics 101 (Mine Çetinkaya-Rundel) L20: Multiple linear regression April 5, 2012 1 / 40

Recap Review question One waiter recorded information about each tip he received over a period of a few months. Below is the regression model for predicting tip amount from total bill amount (both in $). Which of the below is correct? (Intercept) 0.92 0.16 5.76 0.00 total bill 0.11 0.01 14.26 0.00 (a) For each $1 increase in total bill amount, we would expect tip to increase on average by 92 cents. (b) The regression model is tip = 0.92 + 0.11 total bill. (c) For a bill amount of 0 dollars we would expect the tip amount to be 11 cents on average. (d) The explanatory variable is tips and the response variable is total bill amount. Statistics 101 (Mine Çetinkaya-Rundel) L20: Multiple linear regression April 5, 2012 2 / 40

1 Multiple regression Categorical variables with two levels Many variables in a model Adjusted R 2 Statistics 101 (Mine Çetinkaya-Rundel) L20: Multiple linear regression April 5, 2012

Multiple regression Simple linear regression: Bivariate - two variables: y and x Multiple linear regression: Multiple variables: y and x 1, x 2, Statistics 101 (Mine Çetinkaya-Rundel) L20: Multiple linear regression April 5, 2012 3 / 40

Categorical variables with two levels GPA vs. Greek Relationship between Greek organization or an SLG and GPA based on class survey: 4.0 gpa 3.5 3.0 118 87 no yes greek Statistics 101 (Mine Çetinkaya-Rundel) L20: Multiple linear regression April 5, 2012 4 / 40

Categorical variables with two levels GPA vs. Greek - linear model gpa_greek = lm(gpa greek, data = survey) summary(gpa_greek) Call: lm(formula = gpa greek, data = survey) Residuals: Min 1Q Median 3Q Max -1.03891-0.23028 0.06109 0.21109 0.66109 Coefficients: (Intercept) 3.53028 0.03039 116.160 < 2e-16 *** greekyes 0.10863 0.04006 2.712 0.00726 ** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 0.2835 on 203 degrees of freedom (13 observations deleted due to missingness) Multiple R-squared: 0.03496, Adjusted R-squared: 0.03021 F-statistic: 7.354 on 1 and 203 DF, p-value: 0.007265 Statistics 101 (Mine Çetinkaya-Rundel) L20: Multiple linear regression April 5, 2012 5 / 40

Categorical variables with two levels GPA vs. Greek - linear model (cont.) (Intercept) 3.53 0.03 116.16 0.00 greek:yes 0.11 0.04 2.71 0.01 The variable greek has two levels: yes and no no is the reference level, hence not shown on the output Linear model: ĝpa = 3.53 + 0.11 greek : yes Intercept: The estimated mean GPA of students who do not belong to a Greek organization is 3.53. This is the value we get if we plug in 0 for the explanatory variable Slope: The estimated mean GPA of students who belong to a Greek organization is 0.11 higher than those who do not Then, the estimated mean GPA of students who do belong to a Greek organization is 3.53 + 0.11 = 3.64 This is the value we get if we plug in 1 for the explanatory variable Statistics 101 (Mine Çetinkaya-Rundel) L20: Multiple linear regression April 5, 2012 6 / 40

Categorical variables with two levels GPA vs. Greek - inference (Intercept) 3.53 0.03 116.16 0.00 greek:yes 0.11 0.04 2.71 0.01 Hypotheses: H 0 : β 1 = 0 H A : β 1 0 p value = 0.01 The data provide convincing evidence that the true slope parameter is different than 0, hence there appears to be statistically significant relationship between belonging to a Greek organization or an SLG and GPA Statistics 101 (Mine Çetinkaya-Rundel) L20: Multiple linear regression April 5, 2012 7 / 40

Categorical variables with two levels Another approach? Clicker question What other approach could we use to evaluate the relationship between belonging to a Greek organization or an SLG and GPA? Remember: 205 students reported their GPA, 87 belong to a Greek organization or an SLG and 118 do not. (a) chi-squared test of independence (b) chi-squared test of goodness-of-fit (c) test for comparing means of dependent groups (d) test for comparing means of independent groups (e) test for comparing proportions of independent groups Statistics 101 (Mine Çetinkaya-Rundel) L20: Multiple linear regression April 5, 2012 8 / 40

Categorical variables with two levels Another example... Relationship between how politically active students are (on a scale of 1 to 5) and whether or not they are pro-life or pro-choice: 5 politic_active 4 3 2 1 pro choice pro life life_choice Statistics 101 (Mine Çetinkaya-Rundel) L20: Multiple linear regression April 5, 2012 9 / 40

Categorical variables with two levels Identifying the reference level Clicker question Based on the regression output given below, what is the reference level of the variable life choice? (Intercept) 2.64 0.09 30.23 0.00 life choice:pro-life -0.06 0.21-0.29 0.78 (a) pro-choice (b) pro-life (c) neither (d) cannot tell from the information given Statistics 101 (Mine Çetinkaya-Rundel) L20: Multiple linear regression April 5, 2012 10 / 40

Categorical variables with two levels Identifying the reference level Clicker question Based on the regression output, which of the below is correct? (Intercept) 2.64 0.09 30.23 0.00 life choice:pro-life -0.06 0.21-0.29 0.78 The estimated mean political activity level of students who are (a) pro-choice is 0.06 higher than those who are pro-life (b) pro-choice is 0.06 lower than those who are pro-life (c) pro-life is 0.06 lower than those who are pro-choice (d) pro-life is 0.06 higher than those who are pro-choice Statistics 101 (Mine Çetinkaya-Rundel) L20: Multiple linear regression April 5, 2012 11 / 40

Categorical variables with two levels Identifying the reference level Clicker question Does whether or not a student is pro-life or pro-choice have a statistically significant relationship with their political activity level? (a) no (b) yes (Intercept) 2.64 0.09 30.23 0.00 life choice:pro-life -0.06 0.21-0.29 0.78 (c) cannot tell from the information given Statistics 101 (Mine Çetinkaya-Rundel) L20: Multiple linear regression April 5, 2012 12 / 40

Many variables in a model Weights of books weight (g) volume (cm 3 ) cover 1 800 885 hc 2 950 1016 hc 3 1050 1125 hc 4 350 239 hc 5 750 701 hc 6 600 641 hc 7 1075 1228 hc 8 250 412 pb 9 700 953 pb 10 650 929 pb 11 975 1492 pb 12 350 419 pb 13 950 1010 pb 14 425 595 pb 15 725 1034 pb l w h Statistics 101 (Mine Çetinkaya-Rundel) L20: Multiple linear regression April 5, 2012 13 / 40

Many variables in a model Weights of books (cont.) Clicker question The scatterplot shows the relationship between weights and volumes of books as well as the regression output. Which of the below is correct? weight (g) 1000 800 600 400 weight = 108 + 0.7 volume R 2 = 80% 200 400 600 800 1000 1200 1400 volume (cm 3 ) (a) Weights of 80% of the books can be predicted accurately using this model. (b) Books that are 10 cm 3 over average are expected to weigh 7 g over average. (c) The correlation between weight and volume is R = 0.80 2 = 0.64. (d) The model underestimates the weight of the book with the highest volume. Statistics 101 (Mine Çetinkaya-Rundel) L20: Multiple linear regression April 5, 2012 14 / 40

Many variables in a model If you d like to replicate the analysis in R... install.packages("daag") library(daag) data(allbacks) From: Maindonald, J.H. and Braun, W.J. (2nd ed., 2007) Data Analysis and Graphics Using R Statistics 101 (Mine Çetinkaya-Rundel) L20: Multiple linear regression April 5, 2012 15 / 40

Many variables in a model Modeling weights of books using volume somewhat abbreviated output... m1 = lm(weight volume, data = allbacks) summary(m1) Coefficients: (Intercept) 107.67931 88.37758 1.218 0.245 volume 0.70864 0.09746 7.271 6.26e-06 Residual standard error: 123.9 on 13 degrees of freedom Multiple R-squared: 0.8026, Adjusted R-squared: 0.7875 F-statistic: 52.87 on 1 and 13 DF, p-value: 6.262e-06 Statistics 101 (Mine Çetinkaya-Rundel) L20: Multiple linear regression April 5, 2012 16 / 40

Many variables in a model Weights of hard cover and paperback books Can you identify a trend in the relationship between volume and weight of hardcover and paperback books? 1000 hardcover paperback weight (g) 800 600 400 200 400 600 800 1000 1200 1400 volume (cm 3 ) Statistics 101 (Mine Çetinkaya-Rundel) L20: Multiple linear regression April 5, 2012 17 / 40

Many variables in a model Modeling weights of books using volume and cover type m2 = lm(weight volume + cover, data = allbacks) summary(m2) Coefficients: (Intercept) 197.96284 59.19274 3.344 0.005841 ** volume 0.71795 0.06153 11.669 6.6e-08 *** cover:pb -184.04727 40.49420-4.545 0.000672 *** Residual standard error: 78.2 on 12 degrees of freedom Multiple R-squared: 0.9275, Adjusted R-squared: 0.9154 F-statistic: 76.73 on 2 and 12 DF, p-value: 1.455e-07 Statistics 101 (Mine Çetinkaya-Rundel) L20: Multiple linear regression April 5, 2012 18 / 40

Many variables in a model Determining the reference level Clicker question Based on the regression output below, which level of cover is the reference level? Note that pb: paperback. (Intercept) 197.9628 59.1927 3.34 0.0058 volume 0.7180 0.0615 11.67 0.0000 cover:pb -184.0473 40.4942-4.55 0.0007 (a) paperback (b) hardcover Statistics 101 (Mine Çetinkaya-Rundel) L20: Multiple linear regression April 5, 2012 19 / 40

Many variables in a model Determining the reference level Clicker question Which of the below correctly describes the roles of variables in this regression model? (Intercept) 197.9628 59.1927 3.34 0.0058 volume 0.7180 0.0615 11.67 0.0000 cover:pb -184.0473 40.4942-4.55 0.0007 (a) response: weight, explanatory: volume, paperback cover (b) response: weight, explanatory: volume, hardcover cover (c) response: volume, explanatory: weight, cover type (d) response: weight, explanatory: volume, cover type Statistics 101 (Mine Çetinkaya-Rundel) L20: Multiple linear regression April 5, 2012 20 / 40

Many variables in a model Linear model (Intercept) 197.96 59.19 3.34 0.01 volume 0.72 0.06 11.67 0.00 cover:pb -184.05 40.49-4.55 0.00 weight = 197.96 + 0.72 volume 184.05 cover : pb 1 For hardcover books: plug in 0 for cover weight = 197.96 + 0.72 volume 184.05 0 = 197.96 + 0.72 volume 2 For paperback books: plug in 1 for cover weight = 197.96 + 0.72 volume 184.05 1 = 13.91 + 0.72 volume Statistics 101 (Mine Çetinkaya-Rundel) L20: Multiple linear regression April 5, 2012 21 / 40

Many variables in a model Visualising the linear model 1000 hardcover paperback weight (g) 800 600 400 200 400 600 800 1000 1200 1400 volume (cm 3 ) Statistics 101 (Mine Çetinkaya-Rundel) L20: Multiple linear regression April 5, 2012 22 / 40

Many variables in a model Interpretation of the regression coefficients (Intercept) 197.96 59.19 3.34 0.01 volume 0.72 0.06 11.67 0.00 cover:pb -184.05 40.49-4.55 0.00 Slope of volume: All else held constant, for each 1 cm 3 increase in volume we would expect weight to increase on average by 0.72 grams. Slope of cover: All else held constant, the model predicts that paperback books weigh 184 grams lower than hardcover books. Intercept: Hardcover books with no volume are expected on average to weigh 198 grams. Obviously, the intercept does not make sense in context. It only serves to adjust the height of the line. Statistics 101 (Mine Çetinkaya-Rundel) L20: Multiple linear regression April 5, 2012 23 / 40

Many variables in a model Prediction Clicker question Which of the following is the correct calculation for the predicted weight of a paperback book that is 600 cm 3? (Intercept) 197.96 59.19 3.34 0.01 volume 0.72 0.06 11.67 0.00 cover:pb -184.05 40.49-4.55 0.00 (a) 197.96 + 0.72 * 600-184.05 * 1 (b) 184.05 + 0.72 * 600-197.96 * 1 (c) 197.96 + 0.72 * 600-184.05 * 0 (d) 197.96 + 0.72 * 1-184.05 * 600 Statistics 101 (Mine Çetinkaya-Rundel) L20: Multiple linear regression April 5, 2012 24 / 40

Many variables in a model Another example: Modeling kid s test scores Predicting cognitive test scores of three- and four-year-old children using characteristics of their mothers. Data are from a survey of adult American women and their children - a subsample from the National Longitudinal Survey of Youth. kid score mom hs mom iq mom work mom age 1 65 yes 121.12 yes 27.. 5 115 yes 92.75 yes 27 6 98 no 107.90 no 18. 434 70 yes 91.25 yes 25 Gelman, Hill. Data Analysis Using Regression and Multilevel/Hierarchical Models. (2007) Cambridge University Press. Statistics 101 (Mine Çetinkaya-Rundel) L20: Multiple linear regression April 5, 2012 25 / 40

Many variables in a model Interpreting the slope What is the correct interpretation of the slope for mom s IQ? (Intercept) 19.59 9.22 2.13 0.03 mom hs:yes 5.09 2.31 2.20 0.03 mom iq 0.56 0.06 9.26 0.00 mom work:yes 2.54 2.35 1.08 0.28 mom age 0.22 0.33 0.66 0.51 Statistics 101 (Mine Çetinkaya-Rundel) L20: Multiple linear regression April 5, 2012 26 / 40

Many variables in a model Interpreting the slope What is the correct interpretation of the intercept? (Intercept) 19.59 9.22 2.13 0.03 mom hs:yes 5.09 2.31 2.20 0.03 mom iq 0.56 0.06 9.26 0.00 mom work:yes 2.54 2.35 1.08 0.28 mom age 0.22 0.33 0.66 0.51 Statistics 101 (Mine Çetinkaya-Rundel) L20: Multiple linear regression April 5, 2012 27 / 40

Many variables in a model Interpreting the slope Clicker question What is the correct interpretation of the slope for mom work? (Intercept) 19.59 9.22 2.13 0.03 mom hs:yes 5.09 2.31 2.20 0.03 mom iq 0.56 0.06 9.26 0.00 mom work:yes 2.54 2.35 1.08 0.28 mom age 0.22 0.33 0.66 0.51 All else being equal, kids whose moms worked during the first three year s of the kid s life (a) are estimated to score 2.54 points lower (b) are estimated to score 2.54 points higher than those whose moms did not work. Statistics 101 (Mine Çetinkaya-Rundel) L20: Multiple linear regression April 5, 2012 28 / 40

Adjusted R 2 Revisit: Modeling poverty poverty 40 50 60 70 80 90 100 80 85 90 6 8 10 14 18 40 60 80 100 0.20 metro_res 0.31 0.34 white 30 50 70 90 80 85 90 0.75 0.018 0.24 hs_grad 6 8 10 12 14 16 18 0.53 0.30 30 50 70 90 0.75 0.61 8 10 12 14 16 18 8 10 14 18 female_house Statistics 101 (Mine Çetinkaya-Rundel) L20: Multiple linear regression April 5, 2012 29 / 40

Adjusted R 2 Predicting poverty using % female householder pov_slr = lm(poverty female_house, data = poverty) (Intercept) 3.31 1.90 1.74 0.09 female house 0.69 0.16 4.32 0.00 % in poverty 18 16 14 12 10 8 R = 0.53 R 2 = 0.53 2 = 0.28 6 8 10 12 14 16 18 % female householder Statistics 101 (Mine Çetinkaya-Rundel) L20: Multiple linear regression April 5, 2012 30 / 40

Adjusted R 2 Another look at R 2 R 2 can be calculated in three ways: 1 square the correlation coefficient of x and y (how we have been calculating it): > cor(poverty$poverty, poverty$female_house)ˆ2 [1] 0.276042 2 square the correlation coefficient of y and ŷ: > cor(poverty$poverty, pov_slr$fitted.values)ˆ2 [1] 0.276042 3 based on definition: R 2 = explained variability in y total variability in y Using ANOVA we can calculate the explained variability and total variability in y. Statistics 101 (Mine Çetinkaya-Rundel) L20: Multiple linear regression April 5, 2012 31 / 40

Adjusted R 2 Sum of squares anova(pov_slr) Df Sum Sq Mean Sq F value Pr(>F) female house 1 132.57 132.57 18.68 0.00 Residuals 49 347.68 7.10 Total 50 480.25 Sum of squares of y: SS Total = Sum of squares of residuals: SS Error = (y ȳ) 2 = 480.25 total variability e 2 i = 347.68 unexplained variability Sum of squares of x: SS Model = SS Total SS Error explained variability = 480.25 347.68 = 132.57 R 2 = explained variability total variability = 132.57 480.25 = 0.28 Statistics 101 (Mine Çetinkaya-Rundel) L20: Multiple linear regression April 5, 2012 32 / 40

Adjusted R 2 Why bother? Why bother with another approach for calculating R 2 when we had a perfectly good way to calculate it as the correlation coefficient squared? Statistics 101 (Mine Çetinkaya-Rundel) L20: Multiple linear regression April 5, 2012 33 / 40

Adjusted R 2 Predicting poverty using % female hh + % white Linear model: (Intercept) -2.58 5.78-0.45 0.66 female house 0.89 0.24 3.67 0.00 white 0.04 0.04 1.08 0.29 ANOVA: Df Sum Sq Mean Sq F value Pr(>F) female house 1 132.57 132.57 18.74 0.00 white 1 8.21 8.21 1.16 0.29 Residuals 48 339.47 7.07 Total 50 480.25 R 2 = explained variability total variability = 132.57 + 8.21 480.25 = 0.29 Statistics 101 (Mine Çetinkaya-Rundel) L20: Multiple linear regression April 5, 2012 34 / 40

Adjusted R 2 Does adding the variable white to the model add valuable information that wasn t provided by female house? poverty 40 50 60 70 80 90 100 80 85 90 6 8 10 14 18 40 60 80 100 0.20 metro_res 0.31 0.34 white 30 50 70 90 80 85 90 0.75 0.018 0.24 hs_grad 6 8 10 12 14 16 18 0.53 0.30 30 50 70 90 0.75 0.61 8 10 12 14 16 18 8 10 14 18 female_house Statistics 101 (Mine Çetinkaya-Rundel) L20: Multiple linear regression April 5, 2012 35 / 40

Adjusted R 2 Collinearity between explanatory variables poverty vs. % female head of household (Intercept) 3.31 1.90 1.74 0.09 female house 0.69 0.16 4.32 0.00 poverty vs. % female head of household and % female hh (Intercept) -2.58 5.78-0.45 0.66 female house 0.89 0.24 3.67 0.00 white 0.04 0.04 1.08 0.29 Statistics 101 (Mine Çetinkaya-Rundel) L20: Multiple linear regression April 5, 2012 36 / 40

Adjusted R 2 Collinearity between explanatory variables poverty vs. % female head of household (Intercept) 3.31 1.90 1.74 0.09 female house 0.69 0.16 4.32 0.00 poverty vs. % female head of household and % female hh (Intercept) -2.58 5.78-0.45 0.66 female house 0.89 0.24 3.67 0.00 white 0.04 0.04 1.08 0.29 Statistics 101 (Mine Çetinkaya-Rundel) L20: Multiple linear regression April 5, 2012 36 / 40

Adjusted R 2 Collinearity between explanatory variables (cont.) Two predictor variables are said to be collinear when they are correlated, and this collinearity complicates model estimation. Remember: Predictors are also called explanatory or independent variables, so they should be independent of each other. We don t like adding predictors that are associated with each other to the mode, because often times the addition of such variable brings nothing to the table. Instead, we prefer the simplest best model, i.e. parsimonious model. While it s impossible to avoid collinearity from arising in observational data, experiments are usually designed to control for correlated predictors. Statistics 101 (Mine Çetinkaya-Rundel) L20: Multiple linear regression April 5, 2012 37 / 40

R 2 vs. adjusted R 2 Multiple regression Adjusted R 2 R 2 Adjusted R 2 Model 1 (SLR) 0.28 0.26 Model 2 (MLR) 0.29 0.26 When any variable is added to the model R 2 increases. But if the added variable doesn t really provide any new information, or is completely unrelated, adjusted R 2 does not increase. Statistics 101 (Mine Çetinkaya-Rundel) L20: Multiple linear regression April 5, 2012 38 / 40

Adjusted R 2 Multiple regression Adjusted R 2 Adjusted R 2 ( R 2 adj = 1 SSError n 1 ) SS Total n p 1 where n is the number of cases and p is the number of predictors (explanatory variables) in the model. Because p is never negative, R 2 adj will always be smaller than R2. R 2 adj applies a penalty for the number of predictors included in the model. Therefore, we choose models with higher R 2 adj over others. Statistics 101 (Mine Çetinkaya-Rundel) L20: Multiple linear regression April 5, 2012 39 / 40

Calculate adjusted R 2 Multiple regression Adjusted R 2 ANOVA: Df Sum Sq Mean Sq F value Pr(>F) female house 1 132.57 132.57 18.74 0.0001 white 1 8.21 8.21 1.16 0.2868 Residuals 48 339.47 7.07 Total 50 480.25 ( R 2 SSError adj = 1 n 1 ) SS Total n p 1 ( 339.47 = 1 480.25 51 1 ) 51 2 1 ( 339.47 = 1 480.25 50 ) 48 = 1 0.74 = 0.26 Statistics 101 (Mine Çetinkaya-Rundel) L20: Multiple linear regression April 5, 2012 40 / 40