Introduction to Linear Regression Rebecca C. Steorts September 15, 2015

Size: px

Start display at page:

Download "Introduction to Linear Regression Rebecca C. Steorts September 15, 2015"

Elfreda Shields
5 years ago
Views:

1 Introduction to Linear Regression Rebecca C. Steorts September 15, 2015 Today (Re-)Introduction to linear models and the model space What is linear regression Basic properties of linear regression Using data frames for statistical purposes Manipulation of data into more convenient forms How do we do exploratory analysis to see if linear regression is appropriate? Linear regression Let Y n 1 be the response (poverty rate) Let X n p represent a matrix of covariates (age, education level, state you live in, etc.) Thus, for each observation, there are p covariates. Let β be an unknown parameter that can help up estimate future poverty rates where ɛ N(0, σ 2 I). Y n 1 = X n p β p 1 + ɛ n 1 Estimation and Prediction We seek to estimate ˆβ which is the Let f(y ) = Y Xβ 2 and solve for arg min Y Xβ 2. f(y ) β = β [(Y Xβ)T (Y Xβ)] =... We thus, find that ˆβ = (X T X) 1 X T Y. Predicting Ŷ We can now predict new observations via, where H is often called the hat matrix. Ŷ = X(X T X) 1 X T Y = HY 1

2 How can we evaluate an linear model? Residual plots Outliers Colliearity Residual plots Graphical tool for identifying non-linearity. For each observation i, the residual is e i = y i ŷ i. Plotting the residuals is plotting e i versus one covariate x i for all i. Residuals for multiple covariates Translating to multiple covariates, we plot the residuals versus the fitted values. That is, we plot e i versus ŷ i. (Think about why on your own). A strong pattern in the residuals indicates non-linearity. Instead: non-linear transformation of the covariates. Outliers An outlier is a point for which y i is far from the value predicted by the model. Can identify these by residual plots but often it s hard to know how large a residual should be to consider a point an outlier. Instead, we can plot the studentized residuals, computed by dividing each residual e i by its estimated standard error. Observations whose studentized residuals are greater than 3 in absolute value are possible outliers. What to do if you think you have an outliner, remove it. High-leverage points Recall that outliers are observations for which the response y i is unusual given the predictor x i Observations with high leverage have an unusual value for x i Correlation It s important that the error terms e are all uncorrelated. e are uncorrelated, means that if e i > 0 provides little or no information about the sign of e i+1 The fitted values (or standard errors) are computed assumed the e are uncorrleated. If there is correlated, then the estimated std errors will tend to understimate the true std errors. This implies confidence and prediction intervals will be too narrow. 2

3 Figure 1: Observation 41 is a high leverage point, while 20 is not. The red line is the fit to all the data, and the dotted line is the fit with observation 41 removed. Center: The red observation is not unusual in terms of its X1 value or its X2 value, but still falls outside the bulk of the data, and hence has high leverage. Right: Observation 41 has a high leverage and a high residual. Similarly, p-value will be lower than they should (and this could cause a parameter to be thought to be statistically significant when it s not). Where does this happen often time series data (things are correlated). Collinearity Collinearity refers to the situation in which two or more predictor variables are closely related to each other. Suppose we are predicting balance of credit card. Credit limit and age when plotted appear to have no obvious relation. However, credit limit and credit rating do (and this is also intuitive is you have credit and own a credit card) Credit limit and rating are high correlated. Collinearity (continued) The presence of collinearity can pose problems in the regression context, since it can be difficult to separate out the individual effects of collinear variables on the response. Collinearity reduces accuracy of estimates of regression coefficients and causes std error for beta ˆ j to grow. Elements of this matrix that are large in absolute value indicated pair of highly correlated variables. Another way to check is computing the variance inflation factor (VIF). VIF of 1 says no colllinearity. VIF above 5 problematic amount. How to deal with collinearity? The first is to drop one of the problematic variables from the regression. The second solution is to combine the collinear variables together into a single predictor. (better solution since not throwing away data) Example: For take the average of standardized versions of limit and rating in order to create a new variable that measures credit worthiness. 3

4 So You ve Got A Data Frame What can we do with it? Plot it: examine multiple variables and distributions Test it: compare groups of individuals to each other Check it: does it conform to what we d like for our needs? Test Case: Birth weight data Included in R already: library(mass) data(birthwt) summary(birthwt) low age lwt race Min. : Min. :14.00 Min. : 80.0 Min. : st Qu.: st Qu.: st Qu.: st Qu.:1.000 Median : Median :23.00 Median :121.0 Median :1.000 Mean : Mean :23.24 Mean :129.8 Mean : rd Qu.: rd Qu.: rd Qu.: rd Qu.:3.000 Max. : Max. :45.00 Max. :250.0 Max. :3.000 smoke ptl ht ui Min. : Min. : Min. : Min. : st Qu.: st Qu.: st Qu.: st Qu.: Median : Median : Median : Median : Mean : Mean : Mean : Mean : rd Qu.: rd Qu.: rd Qu.: rd Qu.: Max. : Max. : Max. : Max. : ftv bwt Min. : Min. : 709 1st Qu.: st Qu.:2414 Median : Median :2977 Mean : Mean :2945 3rd Qu.: rd Qu.:3487 Max. : Max. :4990 From R help Go to R help for more info, because someone documented this (thanks, someone!) help(birthwt) Make it readable! 4

5 colnames(birthwt) [1] "low" "age" "lwt" "race" "smoke" "ptl" "ht" "ui" [9] "ftv" "bwt" colnames(birthwt) <- c("birthwt.below.2500", "mother.age", "mother.weight", "race", "mother.smokes", "previous.prem.labor", "hypertension", "uterine.irr", "physician.visits", "birthwt.grams") Make it readable, again! Let s make all the factors more descriptive. birthwt$race <- factor(c("white", "black", "other")[birthwt$race]) birthwt$mother.smokes <- factor(c("no", "Yes")[birthwt$mother.smokes + 1]) birthwt$uterine.irr <- factor(c("no", "Yes")[birthwt$uterine.irr + 1]) birthwt$hypertension <- factor(c("no", "Yes")[birthwt$hypertension + 1]) Make it readable, again! summary(birthwt) birthwt.below.2500 mother.age mother.weight race Min. : Min. :14.00 Min. : 80.0 black:26 1st Qu.: st Qu.: st Qu.:110.0 other:67 Median : Median :23.00 Median :121.0 white:96 Mean : Mean :23.24 Mean : rd Qu.: rd Qu.: rd Qu.:140.0 Max. : Max. :45.00 Max. :250.0 mother.smokes previous.prem.labor hypertension uterine.irr No :115 Min. : No :177 No :161 Yes: 74 1st Qu.: Yes: 12 Yes: 28 Median : Mean : rd Qu.: Max. : physician.visits birthwt.grams Min. : Min. : 709 1st Qu.: st Qu.:2414 Median : Median :2977 Mean : Mean :2945 3rd Qu.: rd Qu.:3487 Max. : Max. :4990 5

6 Explore it! R s basic plotting functions go a long way. plot (birthwt$race) title (main = "Count of Mother's Race in Springfield MA, 1986") Count of Mother's Race in Springfield MA, black other white Explore it! R s basic plotting functions go a long way. plot (birthwt$mother.age) title (main = "Mother's Ages in Springfield MA, 1986") 6

7 Mother's Ages in Springfield MA, 1986 birthwt$mother.age Index Explore it! R s basic plotting functions go a long way. plot (sort(birthwt$mother.age)) title (main = "(Sorted) Mother's Ages in Springfield MA, 1986") 7

8 (Sorted) Mother's Ages in Springfield MA, 1986 sort(birthwt$mother.age) Index Explore it! R s basic plotting functions go a long way. plot (birthwt$mother.age, birthwt$birthwt.grams) title (main = "Birth Weight by Mother's Age in Springfield MA, 1986") 8

9 Birth Weight by Mother's Age in Springfield MA, 1986 birthwt$birthwt.grams birthwt$mother.age Basic statistical testing Let s fit some models to the data pertaining to our outcome(s) of interest. plot (birthwt$mother.smokes, birthwt$birthwt.grams, main="birth Weight by Mother's Smoking Habit", ylab 9

10 Birth Weight by Mother's Smoking Habit Birth Weight (g) No Yes Mother Smokes Basic statistical testing Tough to tell! Simple two-sample t-test. We re testing: H o : µ 1 = µ 2 versus H a : µ 1 µ 2 where µ 1 is the mean birth weight when the mother smokes and µ 2 is the mean birth weight when the mother doesn t smoke. t.test (birthwt$birthwt.grams[birthwt$mother.smokes == "Yes"], birthwt$birthwt.grams[birthwt$mother.smokes == "No"]) Welch Two Sample t-test data: birthwt$birthwt.grams[birthwt$mother.smokes == "Yes"] and birthwt$birthwt.grams[birthwt$mother t = , df = 170.1, p-value = alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: sample estimates: mean of x mean of y We reject the null in favor that the two means are not the same. 10

11 Basic statistical testing Does this difference match the linear model? linear.model.1 <- lm (birthwt.grams ~ mother.smokes, data=birthwt) linear.model.1 Call: lm(formula = birthwt.grams ~ mother.smokes, data = birthwt) Coefficients: (Intercept) mother.smokesyes Basic statistical testing Does this difference match the linear model? summary(linear.model.1) Call: lm(formula = birthwt.grams ~ mother.smokes, data = birthwt) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) < 2e-16 *** mother.smokesyes ** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 187 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 1 and 187 DF, p-value: Basic statistical testing Does this difference match the linear model? linear.model.2 <- lm (birthwt.grams ~ mother.age, data=birthwt) linear.model.2 Call: 11

12 lm(formula = birthwt.grams ~ mother.age, data = birthwt) Coefficients: (Intercept) mother.age Basic statistical testing summary(linear.model.2) Call: lm(formula = birthwt.grams ~ mother.age, data = birthwt) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) <2e-16 *** mother.age Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 187 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 1 and 187 DF, p-value: Basic statistical testing Diagnostics: R tries to make it as easy as possible (but no easier). Try in R proper: plot(linear.model.2) 12

13 Residuals Residuals vs Fitted Fitted values lm(birthwt.grams ~ mother.age) Normal Q Q Standardized residuals Theoretical Quantiles lm(birthwt.grams ~ mother.age) 13

14 Scale Location Standardized residuals Fitted values lm(birthwt.grams ~ mother.age) Residuals vs Leverage Standardized residuals Cook's distance Leverage lm(birthwt.grams ~ mother.age) Detecting Outliers These are the default diagnostic plots for the analysis. Note that our oldest mother and her heaviest child are greatly skewing this analysis as we suspected. 14

15 birthwt.noout <- birthwt[birthwt$mother.age <= 40,] linear.model.3 <- lm (birthwt.grams ~ mother.age, data=birthwt.noout) linear.model.3 Call: lm(formula = birthwt.grams ~ mother.age, data = birthwt.noout) Coefficients: (Intercept) mother.age Detecting Outliers summary(linear.model.3) Call: lm(formula = birthwt.grams ~ mother.age, data = birthwt.noout) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) <2e-16 *** mother.age Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 186 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 1 and 186 DF, p-value: More complex models Add in smoking behavior: linear.model.3b <- lm (birthwt.grams ~ mother.age + mother.smokes*race, data=birthwt.noout) summary(linear.model.3b) Call: lm(formula = birthwt.grams ~ mother.age + mother.smokes * race, data = birthwt.noout) Residuals: 15

16 Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) < 2e-16 *** mother.age mother.smokesyes raceother racewhite ** mother.smokesyes:raceother mother.smokesyes:racewhite Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 181 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 6 and 181 DF, p-value: More complex models plot(linear.model.3b) Residuals vs Fitted Residuals Fitted values lm(birthwt.grams ~ mother.age + mother.smokes * race) 16

17 Normal Q Q Standardized residuals Theoretical Quantiles lm(birthwt.grams ~ mother.age + mother.smokes * race) Scale Location 10 4 Standardized residuals Fitted values lm(birthwt.grams ~ mother.age + mother.smokes * race) 17

18 Residuals vs Leverage Standardized residuals Cook's distance Leverage lm(birthwt.grams ~ mother.age + mother.smokes * race) Everything Must Go (In) Let s do a kitchen sink model on this new data set: linear.model.4 <- lm (birthwt.grams ~., data=birthwt.noout) linear.model.4 Call: lm(formula = birthwt.grams ~., data = birthwt.noout) Coefficients: (Intercept) birthwt.below.2500 mother.age mother.weight raceother racewhite mother.smokesyes previous.prem.labor hypertensionyes uterine.irryes physician.visits Everything Must Go (In), Except What Must Not Whoops! One of those variables was birthwt.below.2500 which is a function of the outcome. 18

19 linear.model.4a <- lm (birthwt.grams ~. - birthwt.below.2500, data=birthwt.noout) summary(linear.model.4a) Call: lm(formula = birthwt.grams ~. - birthwt.below.2500, data = birthwt.noout) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) e-13 *** mother.age mother.weight ** raceother racewhite *** mother.smokesyes ** previous.prem.labor hypertensionyes ** uterine.irryes *** physician.visits Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 638 on 178 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 9 and 178 DF, p-value: 8.255e-08 Everything Must Go (In), Except What Must Not Whoops! One of those variables was birthwt.below.2500 which is a function of the outcome. plot(linear.model.4a) 19

20 Residuals vs Fitted Residuals Fitted values lm(birthwt.grams ~. birthwt.below.2500) Normal Q Q Standardized residuals Theoretical Quantiles lm(birthwt.grams ~. birthwt.below.2500) 20

21 Standardized residuals Scale Location Fitted values lm(birthwt.grams ~. birthwt.below.2500) Residuals vs Leverage Standardized residuals Cook's distance Leverage lm(birthwt.grams ~. birthwt.below.2500) 21

UNIVERSITY OF MASSACHUSETTS. Department of Mathematics and Statistics. Basic Exam - Applied Statistics. Tuesday, January 17, 2017

UNIVERSITY OF MASSACHUSETTS. Department of Mathematics and Statistics. Basic Exam - Applied Statistics. Tuesday, January 17, 2017 UNIVERSITY OF MASSACHUSETTS Department of Mathematics and Statistics Basic Exam - Applied Statistics Tuesday, January 17, 2017 Work all problems 60 points are needed to pass at the Masters Level and 75