Simple Linear Regression OI CHAPTER 7 Important Concepts Correlation (r or R) and Coefficient of determination (R 2 ) Interpreting y-intercept and slope coefficients Inference (hypothesis testing and confidence intervals) on y-intercept (! " ), slope (! # ), and mean response ($ % =! " +! # () Prediction for a new value of the response, % =! " +! # ( + ) Extrapolation = using the model to make predictions for (-values outside the range of observed data Checking regression assumptions Detecting outliers and influential points: leverage (hat values) and Cook s distance 2 Manatee Deaths 20 30 40 50 Example: Manatees Manatee Deaths vs. Powerboat Registrations Least-squares regression line ŷ = 41.43+ 0.12x Linear Regression in R Estimated regression coefficients 450 500 550 600 650 700 Powerboat Registrations (in thousands) 3 4 Interpret Regression Coefficients ŷ = 41.43+ 0.12x How do we interpret the intercept? If a year has no powerboat registrations (x = 0), we would expect about -41 manatee deaths. (not meaningful). Not a valid prediction since the number of powerboat registrations in the data set only ranged from about 450,000 to 700,000, so predicting for x = 0 is extrapolating outside the range of observed data. 5 Interpret Regression Coefficients ŷ = 41.43+ 0.12x How do we interpret the slope? For every additional 1000 powerboat registrations, we estimate the average number of manatee deaths would increase by 0.12. That is, for every additional eight thousand powerboat registrations, we predict approximately one (0.12 8) additional manatee death. 6
Predictions Let x = 500. Then ŷ = 41.43+ 0.12(500) =18.57 Interpretation? 1. An estimate of the average number of manatee deaths across all years with 500,000 powerboat registrations is about 19 deaths. 2. A prediction for the number of manatee deaths in one particular year with 500,000 powerboat registrations is about 19 deaths. Residuals (Prediction Errors) Individual y values can be written as: y = predicted value + prediction error or y = fitted value + residual or y = ŷ + residual For each observation, residual = observed predicted = y ŷ 7 8 Manatee Deaths Ex: Manatees ŷ = 41.4304 + 0.1249x Residual = Vertical distance from the observed point to the regression line. Manatee Deaths vs. Powerboat Registrations Observation (526, 15): Observed Response = 15 Predicted Response = 41.4304 + 0.1249(526) = 24.3 Residual = 15 24.3 = 9.3 Observation (559, 34): Observed Response = 34 Predicted Response = 41.4304 + 0.1249(559) = 28.4 450 500 550 600 650 700 Residual = 34 28.4 = 5.6 20 30 40 50 Powerboat Registrations (in thousands) 9 What do we mean by least squares? Basic Idea: Minimize how far off we are when we use the line to predict y, based on x, by comparing to the actual y. Definition: The least squares regression line is the line that minimizes the sum of the squared residuals for all points in the data set. The sum of squared errors = SSE is that minimum sum: n SSE = ( y i ŷ i ) 2 = (residual) 2 i=1 all values 10 Ex: Manatees Least squares regression line: ŷ = 41.4304 + 0.1249x x y ŷ Residual 447 13 41.43 +.1249(447) = 14.4 13 14.4 = 1.4 460 21 41.43 +.1249(460) = 16.0 21 16.0 = 5.0 481 24 41.43 +.1249(481) =18.6 24 18.6 = 5.4 Can compute the residuals for all 14 observations. Positive residual => observed value higher than predicted. Negative residual => observed value lower than predicted. The least squares regression line is such that SSE = (-1.4) 2 + (5.0) 2 + (5.4) 2 + is as small as possible. 11 Properties of Correlation: r Magnitude of r indicates the strength (how close are the points to the regression line?) of the linear relationship. Values close to 1 or close to 1 à strong linear relationship Values close to 0 à no or weak linear relationship Sign of r indicates the direction (when one variable increases, does the other generally increase (positive association) or generally decrease (negative association)) of the linear association. r > 0 à positive linear association r < 0 à negative linear association 12
General Guidelines for Describing Strength Manatees Value of r Strength of Linear relationship -1.0 to -0.5 OR 0.5 to 1.0 Strong linear relationship -0.5 to -0.3 OR 0.3 to 0.5 Moderate linear relationship -0.3 to -0.1 OR 0.1 to 0.3 Weak linear relationship -0.1 to 0.1 No or very weak linear relationship The table above serves only as a rule of thumb (many experts may somewhat disagree on the choice of boundaries). Number of Manatee Deaths 20 30 40 50 Manatee Deaths vs. Powerboat Registrations r = 0.941 Very strong positive linear association Source: http://www.experiment-resources.com/statistical-correlation.html 450 500 550 600 650 700 13 Powerboat Registrations (in thousands) 14 Measuring Strength of Linear Association The following scatterplots are arranged from strongest positive linear association (on the left) to those with virtually no linear association (in the middle) to those with the strongest negative linear association (on the right). Measuring Strength of Linear Association.994.889.510 -.081 -.450 -.721 -.907 Positive ß None à Negative Strong ß Weak Weak à Strong Take a couple minutes and try to guess the correlation for each plot. 15 Get more practice at guessing the correlation and impress your friends here: http://www.rossmanchance.com/applets/guesscorrelation.htm l 16 Formula for r (p. 338 footnote) 1 r = n -1 å æ x æ ö i - x ö ç yi - y ç è sx øè s y ø We will have R calculate for us! But we can still learn from the formula We re looking at an (almost) average of the products of the standardized (z) scores for x with each respective standardized (z) score for y. What does this mean? (Draw picture) 17 1 r = n -1 Formula for r å æ x æ ö i - x ö ç yi - y ç è sx øè s y ø The formula also shows us that: Order of x and y doesn t matter. It doesn t matter which of the two variables is called the x variable, and which is called the y variable, the correlation doesn t care. Correlation is unitless. Doesn t change when the measurement units are changed (since it uses standardized observations in its calculation). 18
Warning! The correlation coefficient only measures the strength and direction of a linear association. x = Month (January = 1, February = 2, etc.) y = Raleigh s average monthly temperature r =.257 Even though the strength of the relationship between month and rainfall is very strong, its correlation is only 0.257, indicating a weak linear relationship. 19 R-squared (coefficient of determination): r 2 Squared correlation r 2 is between 0 and 1 and indicates the proportion of variation in the response (y) explained by knowing x. SSTO = sum of squares total = sum of squared differences between observed y values and y. We will break SSTO into two pieces, SSE + SSR: SSE = sum of squared residuals (error), unexplained SSR = sum of squares due to regression or explained = sum of squared differences between fitted values and y. Copyright 2004 Brooks/Cole, a division of Thomson Learning, Inc., modified by S. Hancock Oct. 2012 20 ŷ New interpretation: r 2 SSTO = SSR + SSE Question: How much of the total variability in the y values (SSTO) is explained by the regression model (SSR)? How much better can we predict y when we know x than when we don t? r 2 = SSR SSR +SSE = SSR SSE =1 SSTO SSTO Copyright 2004 Brooks/Cole, a division of Thomson Learning, 21Inc., modified by S. Hancock Oct. 2012 Example: Chug Times x = body weight (lbs); y = time to chug a 12-oz drink (sec) Total variation summed over all points = SSTO = 36.6 Unexplained part summed over all points = SSE = 13.9 Explained by knowing x summed = SSR = 22.7 ChugTime 9 8 7 6 5 4 3 2 120 140 Scatterplot of ChugTime vs Weight 160 180 Weight 200 220 240 5.108 = mean y r 2 =1 SSE SSTO = SSR SSTO = 22.7 36.6 = 0.62 Interpretation: 62% of the variability in chug times is explained by knowing the weight of the person. 22 Breakdown of Calculations y = 66.4 /13 = 5.11 ŷ =13.298 0.046x Example: Manatees r 2 in R output ( ) 2 ŷ ( ) 2 x y y y ( ŷ y) y ŷ 153 5.6 (5.6-5.11) 2 = 0.24 6.29 (6.29-5.11) 2 = 1.40 (5.6-6.29) 2 = 0.48 169 6.1 (6.1-5.11) 2 = 0.98 5.56 (5.56-5.11) 2 = 0.20 (6.1-5.56) 2 = 0.29 178 3.3 (3.3-5.11) 2 = 3.27 5.15 (5.15-5.11) 2 = 0.002 (3.3-5.15) 2 = 3.41 158 6.7 (6.7-5.11) 2 = 2.53 6.06 (6.06-5.11) 2 = 0.91 (6.7-6.06) 2 = 0.41 SUM: 66.4 SSTO = 36.61 66.4 SSR = 22.73 SSE = 13.88 SSTO = SSR + SSE => 36.61 = 22.73 + 13.88 Note: SSTO does not have anything to do with the fitted model nor with x; it s just a measure of the variability in y (used in calculating standard deviation of y-values). 23 r = 0.94 à r 2 = 0.89 Interpretation: About 89% of the variability in the number of manatee deaths (response variable) can be explained by the number of powerboat registrations (explanatory variable). 24
26 What to check for? REGRESSION DIAGNOSTICS Simple linear regression model assumptions: 1. Linearity 2. Constant variance 3. Normality 4. Independence Outliers? Influential points? Other predictors? 27 28 Scatterplots with lowess curve The best plot to start with is a scatterplot of Y vs. X. Add the least-squares regression line. Add a smoothed curve (not restricted to a line follows the general pattern of the data) lowess curve = locally weighted regression scatterplot smoothing Similar to a moving average Fits least-squares lines around each neighborhood of points, makes predictions, and smooths the predictions. R function: lines(lowess(y~x)) Residuals vs. Fitted Values (Or residuals vs. X) Check linearity and constant variance assumptions, and look for outliers. Good: No pattern; random scatter. Equal spread around the horizontal line at zero (mean of the residuals). Bad: Pattern (curved, etc.) à indicates linearity not met. Funneling (spread increases or decreases with X) à indicates non-constant variance. 29 30 Residuals vs. Fitted Values - GOOD Residuals vs. Fitted Values - BAD Points look randomly scattered around zero. No evidence of nonlinear pattern or unequal variances. Plot on the left shows evidence of non-constant variance. Plot on the right shows evidence of nonlinear relationship.
31 32 Residuals vs. Time or Obs. Number If data is available on the time observations were collected, a plot of residuals vs. time can help us check the independence assumption. Good: No pattern over time. Bad: Pattern (e.g., increasing/decreasing or cyclical) over time à indicates dependence in data points (points closer together in time more similar than points further apart in time). Independence Assumption In general, the independence assumption is checked by knowledge of how the data were collected. Example: Suppose we want to predict blood pressure from amount of fat consumed in the diet. Incorrect sampling à sample entire families à results in dependent observations. Why? Use a random sample to ensure independence of observations. Residuals vs. Other Predictor Check if the other predictor helps explain the leftovers from our model. If there is a pattern (e.g., increasing with values of other predictor), may indicate that this other predictor helps predict Y in addition to original X try adding it to the model (multiple linear regression à introduced in Chapter 6) 33 Normal Probability Plot (Normal Quantile- Quantile Plot) of Residuals Plots the sample quantiles (y-axis) vs. quantiles we would expect from a standard normal distribution (x-axis). Best plot to use for checking normality assumption. Good: Points follow a straight line. Bad: Points deviate in a systematic fashion from a straight line à indicates violation of normality assumption. 34 35 36 Normal Probability Plots of Residuals Examples of bad and good normal probability plots (see also Figure 3.9 on p. 112) What to do if assumptions are not met? Two choices: 1. Abandon simple linear regression model and use a more appropriate model (future courses). 2. Employ some transformation on the data so that the simple linear regression model is appropriate for the transformed data. Can transform X or Y or both. Make sure to run regression diagnostics on the model fit with transformed data to check assumptions. Can make interpretations more complex.
Transformations Nonlinear pattern with constant error variance à Try transforming X 37 Residuals are inverted U à use X ' = X or X ' = log(x) Upper plots show data and residual plot before transformation; lower plots show after. 38 Residuals are U-shaped and association between X and Y is positiveà use X ' = X 2 or X ' = exp(x) 39 Residuals are U-shaped and association between X and Y is negativeà use X ' =1/ X or X ' = exp( X) 40 Transformations Non-constant error variance à Try transforming Y May also help to transform X in addition to Y. 41 Box-Cox Transformations Power transformation on Y à Transformed Y = Most commonly used: Y ' = Y λ λ = 2 Y ' = Y 2 λ = 0.5 Y ' = Y λ = 0 Y ' = log(y ) (by definition) λ = 0.5 Y ' =1/ Y λ = 1 Y ' =1/Y 42 Remember: Notation log means log base e
43 44 Box-Cox Transformations Find maximum likelihood estimate of λ. R function to find best λ = boxcox Need to load the MASS R library. Plots likelihood function à look at value of λ where likelihood is highest. Best to choose a value of λ that is easy to interpret (e.g., choose λ = 0 rather than λ = 0.12) Detecting Outliers and Influential Points Outlier = observation that does not follow the overall pattern of the data set à easy to see in scatterplots for SLR; harder to find in multiple linear regression Influential point = observation that, when removed, changes the model fit substantially (e.g., coefficient estimates, correlation) Tools for detecting outliers and influential points: Residual plots Leverage = measures how far away an observation is from the mean of the predictors à high leverage = potential to be an influential point Cook s distance = measures how much of an influence an individual observation has on the fitted coefficients (how much they change when the observation is removed) à high Cook s distance = influential point