Simple Linear Regression

Similar documents
Simple Linear Regression for the Advertising Data

Lecture 11: Simple Linear Regression

appstats27.notebook April 06, 2017

Section 3: Simple Linear Regression

BIVARIATE DATA data for two variables

Introduction to Linear regression analysis. Part 2. Model comparisons

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.

Inferences for Regression

Recall, Positive/Negative Association:

3. Diagnostics and Remedial Measures

Chapter 3: Describing Relationships

Chapter 16. Simple Linear Regression and Correlation

Chapter 3. Diagnostics and Remedial Measures

HOMEWORK (due Wed, Jan 23): Chapter 3: #42, 48, 74

Relationships Regression

y n 1 ( x i x )( y y i n 1 i y 2

Chapter 16. Simple Linear Regression and dcorrelation

Chapter 27 Summary Inferences for Regression

Chapter 3: Describing Relationships

Review of Regression Basics

Chapter 2: Looking at Data Relationships (Part 3)

The Model Building Process Part I: Checking Model Assumptions Best Practice

Mathematics for Economics MA course

Chapter 9. Correlation and Regression

Keller: Stats for Mgmt & Econ, 7th Ed July 17, 2006

Looking at data: relationships

The following formulas related to this topic are provided on the formula sheet:

The Model Building Process Part I: Checking Model Assumptions Best Practice (Version 1.1)

Lectures on Simple Linear Regression Stat 431, Summer 2012

Information Sources. Class webpage (also linked to my.ucdavis page for the class):

Objectives. 2.3 Least-squares regression. Regression lines. Prediction and Extrapolation. Correlation and r 2. Transforming relationships

1. Create a scatterplot of this data. 2. Find the correlation coefficient.

Chapter 8. Linear Regression. Copyright 2010 Pearson Education, Inc.

Unit 6 - Introduction to linear regression

Warm-up Using the given data Create a scatterplot Find the regression line

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X.

Summarizing Data: Paired Quantitative Data

UNIT 12 ~ More About Regression

Statistical View of Least Squares

Simple Linear Regression for the MPG Data

Simple Linear Regression. (Chs 12.1, 12.2, 12.4, 12.5)

Chapter 14. Linear least squares

Regression Analysis. Regression: Methodology for studying the relationship among two or more variables

7.0 Lesson Plan. Regression. Residuals

Start with review, some new definitions, and pictures on the white board. Assumptions in the Normal Linear Regression Model

Lecture 18: Simple Linear Regression

Statistical View of Least Squares

Linear Regression. Linear Regression. Linear Regression. Did You Mean Association Or Correlation?

Announcements. Lecture 18: Simple Linear Regression. Poverty vs. HS graduate rate

Ch 13 & 14 - Regression Analysis

Correlation and Regression

Regression: Main Ideas Setting: Quantitative outcome with a quantitative explanatory variable. Example, cont.

Inference for Regression Inference about the Regression Model and Using the Regression Line, with Details. Section 10.1, 2, 3

Linear Regression and Correlation. February 11, 2009

Applied Regression Analysis

ST Correlation and Regression

Chapter 8. Linear Regression. The Linear Model. Fat Versus Protein: An Example. The Linear Model (cont.) Residuals

CHAPTER 3 Describing Relationships

Regression. Marc H. Mehlman University of New Haven

AMS 7 Correlation and Regression Lecture 8

36-309/749 Experimental Design for Behavioral and Social Sciences. Sep. 22, 2015 Lecture 4: Linear Regression

Applied Regression Modeling: A Business Approach Chapter 3: Multiple Linear Regression Sections

Lecture 2 Linear Regression: A Model for the Mean. Sharyn O Halloran

Unit 6 - Simple linear regression

Chapter 8. Linear Regression /71

LAB 3 INSTRUCTIONS SIMPLE LINEAR REGRESSION

Chapter 4: Regression Models

Inference for Regression Simple Linear Regression

9 Correlation and Regression

Correlation & Simple Regression

Objectives Simple linear regression. Statistical model for linear regression. Estimating the regression parameters

STA 108 Applied Linear Models: Regression Analysis Spring Solution for Homework #6

Regression and correlation. Correlation & Regression, I. Regression & correlation. Regression vs. correlation. Involve bivariate, paired data, X & Y

Chapter 6: Exploring Data: Relationships Lesson Plan

Chi-square tests. Unit 6: Simple Linear Regression Lecture 1: Introduction to SLR. Statistics 101. Poverty vs. HS graduate rate

Lecture 19: Inference for SLR & Transformations

9. Linear Regression and Correlation

STA 302 H1F / 1001 HF Fall 2007 Test 1 October 24, 2007

Topic - 12 Linear Regression and Correlation

K. Model Diagnostics. residuals ˆɛ ij = Y ij ˆµ i N = Y ij Ȳ i semi-studentized residuals ω ij = ˆɛ ij. studentized deleted residuals ɛ ij =

Basic Business Statistics 6 th Edition

Chapter 7. Linear Regression (Pt. 1) 7.1 Introduction. 7.2 The Least-Squares Regression Line

Linear Regression. Simple linear regression model determines the relationship between one dependent variable (y) and one independent variable (x).

Unit 10: Simple Linear Regression and Correlation

Correlation and Regression

Multicollinearity occurs when two or more predictors in the model are correlated and provide redundant information about the response.

AP Stats ~ 3A: Scatterplots and Correlation OBJECTIVES:

Inference for Regression

Chapter Learning Objectives. Regression Analysis. Correlation. Simple Linear Regression. Chapter 12. Simple Linear Regression

INFERENCE FOR REGRESSION

Regression Model Building

Multiple Linear Regression for the Supervisor Data

Analysing data: regression and correlation S6 and S7

STAT 512 MidTerm I (2/21/2013) Spring 2013 INSTRUCTIONS

CS 5014: Research Methods in Computer Science

Introduction and Single Predictor Regression. Correlation

7. Do not estimate values for y using x-values outside the limits of the data given. This is called extrapolation and is not reliable.

Announcements. Lecture 10: Relationship between Measurement Variables. Poverty vs. HS graduate rate. Response vs. explanatory

Business Statistics. Lecture 9: Simple Regression

Multiple Regression. Dr. Frank Wood. Frank Wood, Linear Regression Models Lecture 12, Slide 1

Transcription:

Simple Linear Regression OI CHAPTER 7 Important Concepts Correlation (r or R) and Coefficient of determination (R 2 ) Interpreting y-intercept and slope coefficients Inference (hypothesis testing and confidence intervals) on y-intercept (! " ), slope (! # ), and mean response ($ % =! " +! # () Prediction for a new value of the response, % =! " +! # ( + ) Extrapolation = using the model to make predictions for (-values outside the range of observed data Checking regression assumptions Detecting outliers and influential points: leverage (hat values) and Cook s distance 2 Manatee Deaths 20 30 40 50 Example: Manatees Manatee Deaths vs. Powerboat Registrations Least-squares regression line ŷ = 41.43+ 0.12x Linear Regression in R Estimated regression coefficients 450 500 550 600 650 700 Powerboat Registrations (in thousands) 3 4 Interpret Regression Coefficients ŷ = 41.43+ 0.12x How do we interpret the intercept? If a year has no powerboat registrations (x = 0), we would expect about -41 manatee deaths. (not meaningful). Not a valid prediction since the number of powerboat registrations in the data set only ranged from about 450,000 to 700,000, so predicting for x = 0 is extrapolating outside the range of observed data. 5 Interpret Regression Coefficients ŷ = 41.43+ 0.12x How do we interpret the slope? For every additional 1000 powerboat registrations, we estimate the average number of manatee deaths would increase by 0.12. That is, for every additional eight thousand powerboat registrations, we predict approximately one (0.12 8) additional manatee death. 6

Predictions Let x = 500. Then ŷ = 41.43+ 0.12(500) =18.57 Interpretation? 1. An estimate of the average number of manatee deaths across all years with 500,000 powerboat registrations is about 19 deaths. 2. A prediction for the number of manatee deaths in one particular year with 500,000 powerboat registrations is about 19 deaths. Residuals (Prediction Errors) Individual y values can be written as: y = predicted value + prediction error or y = fitted value + residual or y = ŷ + residual For each observation, residual = observed predicted = y ŷ 7 8 Manatee Deaths Ex: Manatees ŷ = 41.4304 + 0.1249x Residual = Vertical distance from the observed point to the regression line. Manatee Deaths vs. Powerboat Registrations Observation (526, 15): Observed Response = 15 Predicted Response = 41.4304 + 0.1249(526) = 24.3 Residual = 15 24.3 = 9.3 Observation (559, 34): Observed Response = 34 Predicted Response = 41.4304 + 0.1249(559) = 28.4 450 500 550 600 650 700 Residual = 34 28.4 = 5.6 20 30 40 50 Powerboat Registrations (in thousands) 9 What do we mean by least squares? Basic Idea: Minimize how far off we are when we use the line to predict y, based on x, by comparing to the actual y. Definition: The least squares regression line is the line that minimizes the sum of the squared residuals for all points in the data set. The sum of squared errors = SSE is that minimum sum: n SSE = ( y i ŷ i ) 2 = (residual) 2 i=1 all values 10 Ex: Manatees Least squares regression line: ŷ = 41.4304 + 0.1249x x y ŷ Residual 447 13 41.43 +.1249(447) = 14.4 13 14.4 = 1.4 460 21 41.43 +.1249(460) = 16.0 21 16.0 = 5.0 481 24 41.43 +.1249(481) =18.6 24 18.6 = 5.4 Can compute the residuals for all 14 observations. Positive residual => observed value higher than predicted. Negative residual => observed value lower than predicted. The least squares regression line is such that SSE = (-1.4) 2 + (5.0) 2 + (5.4) 2 + is as small as possible. 11 Properties of Correlation: r Magnitude of r indicates the strength (how close are the points to the regression line?) of the linear relationship. Values close to 1 or close to 1 à strong linear relationship Values close to 0 à no or weak linear relationship Sign of r indicates the direction (when one variable increases, does the other generally increase (positive association) or generally decrease (negative association)) of the linear association. r > 0 à positive linear association r < 0 à negative linear association 12

General Guidelines for Describing Strength Manatees Value of r Strength of Linear relationship -1.0 to -0.5 OR 0.5 to 1.0 Strong linear relationship -0.5 to -0.3 OR 0.3 to 0.5 Moderate linear relationship -0.3 to -0.1 OR 0.1 to 0.3 Weak linear relationship -0.1 to 0.1 No or very weak linear relationship The table above serves only as a rule of thumb (many experts may somewhat disagree on the choice of boundaries). Number of Manatee Deaths 20 30 40 50 Manatee Deaths vs. Powerboat Registrations r = 0.941 Very strong positive linear association Source: http://www.experiment-resources.com/statistical-correlation.html 450 500 550 600 650 700 13 Powerboat Registrations (in thousands) 14 Measuring Strength of Linear Association The following scatterplots are arranged from strongest positive linear association (on the left) to those with virtually no linear association (in the middle) to those with the strongest negative linear association (on the right). Measuring Strength of Linear Association.994.889.510 -.081 -.450 -.721 -.907 Positive ß None à Negative Strong ß Weak Weak à Strong Take a couple minutes and try to guess the correlation for each plot. 15 Get more practice at guessing the correlation and impress your friends here: http://www.rossmanchance.com/applets/guesscorrelation.htm l 16 Formula for r (p. 338 footnote) 1 r = n -1 å æ x æ ö i - x ö ç yi - y ç è sx øè s y ø We will have R calculate for us! But we can still learn from the formula We re looking at an (almost) average of the products of the standardized (z) scores for x with each respective standardized (z) score for y. What does this mean? (Draw picture) 17 1 r = n -1 Formula for r å æ x æ ö i - x ö ç yi - y ç è sx øè s y ø The formula also shows us that: Order of x and y doesn t matter. It doesn t matter which of the two variables is called the x variable, and which is called the y variable, the correlation doesn t care. Correlation is unitless. Doesn t change when the measurement units are changed (since it uses standardized observations in its calculation). 18

Warning! The correlation coefficient only measures the strength and direction of a linear association. x = Month (January = 1, February = 2, etc.) y = Raleigh s average monthly temperature r =.257 Even though the strength of the relationship between month and rainfall is very strong, its correlation is only 0.257, indicating a weak linear relationship. 19 R-squared (coefficient of determination): r 2 Squared correlation r 2 is between 0 and 1 and indicates the proportion of variation in the response (y) explained by knowing x. SSTO = sum of squares total = sum of squared differences between observed y values and y. We will break SSTO into two pieces, SSE + SSR: SSE = sum of squared residuals (error), unexplained SSR = sum of squares due to regression or explained = sum of squared differences between fitted values and y. Copyright 2004 Brooks/Cole, a division of Thomson Learning, Inc., modified by S. Hancock Oct. 2012 20 ŷ New interpretation: r 2 SSTO = SSR + SSE Question: How much of the total variability in the y values (SSTO) is explained by the regression model (SSR)? How much better can we predict y when we know x than when we don t? r 2 = SSR SSR +SSE = SSR SSE =1 SSTO SSTO Copyright 2004 Brooks/Cole, a division of Thomson Learning, 21Inc., modified by S. Hancock Oct. 2012 Example: Chug Times x = body weight (lbs); y = time to chug a 12-oz drink (sec) Total variation summed over all points = SSTO = 36.6 Unexplained part summed over all points = SSE = 13.9 Explained by knowing x summed = SSR = 22.7 ChugTime 9 8 7 6 5 4 3 2 120 140 Scatterplot of ChugTime vs Weight 160 180 Weight 200 220 240 5.108 = mean y r 2 =1 SSE SSTO = SSR SSTO = 22.7 36.6 = 0.62 Interpretation: 62% of the variability in chug times is explained by knowing the weight of the person. 22 Breakdown of Calculations y = 66.4 /13 = 5.11 ŷ =13.298 0.046x Example: Manatees r 2 in R output ( ) 2 ŷ ( ) 2 x y y y ( ŷ y) y ŷ 153 5.6 (5.6-5.11) 2 = 0.24 6.29 (6.29-5.11) 2 = 1.40 (5.6-6.29) 2 = 0.48 169 6.1 (6.1-5.11) 2 = 0.98 5.56 (5.56-5.11) 2 = 0.20 (6.1-5.56) 2 = 0.29 178 3.3 (3.3-5.11) 2 = 3.27 5.15 (5.15-5.11) 2 = 0.002 (3.3-5.15) 2 = 3.41 158 6.7 (6.7-5.11) 2 = 2.53 6.06 (6.06-5.11) 2 = 0.91 (6.7-6.06) 2 = 0.41 SUM: 66.4 SSTO = 36.61 66.4 SSR = 22.73 SSE = 13.88 SSTO = SSR + SSE => 36.61 = 22.73 + 13.88 Note: SSTO does not have anything to do with the fitted model nor with x; it s just a measure of the variability in y (used in calculating standard deviation of y-values). 23 r = 0.94 à r 2 = 0.89 Interpretation: About 89% of the variability in the number of manatee deaths (response variable) can be explained by the number of powerboat registrations (explanatory variable). 24

26 What to check for? REGRESSION DIAGNOSTICS Simple linear regression model assumptions: 1. Linearity 2. Constant variance 3. Normality 4. Independence Outliers? Influential points? Other predictors? 27 28 Scatterplots with lowess curve The best plot to start with is a scatterplot of Y vs. X. Add the least-squares regression line. Add a smoothed curve (not restricted to a line follows the general pattern of the data) lowess curve = locally weighted regression scatterplot smoothing Similar to a moving average Fits least-squares lines around each neighborhood of points, makes predictions, and smooths the predictions. R function: lines(lowess(y~x)) Residuals vs. Fitted Values (Or residuals vs. X) Check linearity and constant variance assumptions, and look for outliers. Good: No pattern; random scatter. Equal spread around the horizontal line at zero (mean of the residuals). Bad: Pattern (curved, etc.) à indicates linearity not met. Funneling (spread increases or decreases with X) à indicates non-constant variance. 29 30 Residuals vs. Fitted Values - GOOD Residuals vs. Fitted Values - BAD Points look randomly scattered around zero. No evidence of nonlinear pattern or unequal variances. Plot on the left shows evidence of non-constant variance. Plot on the right shows evidence of nonlinear relationship.

31 32 Residuals vs. Time or Obs. Number If data is available on the time observations were collected, a plot of residuals vs. time can help us check the independence assumption. Good: No pattern over time. Bad: Pattern (e.g., increasing/decreasing or cyclical) over time à indicates dependence in data points (points closer together in time more similar than points further apart in time). Independence Assumption In general, the independence assumption is checked by knowledge of how the data were collected. Example: Suppose we want to predict blood pressure from amount of fat consumed in the diet. Incorrect sampling à sample entire families à results in dependent observations. Why? Use a random sample to ensure independence of observations. Residuals vs. Other Predictor Check if the other predictor helps explain the leftovers from our model. If there is a pattern (e.g., increasing with values of other predictor), may indicate that this other predictor helps predict Y in addition to original X try adding it to the model (multiple linear regression à introduced in Chapter 6) 33 Normal Probability Plot (Normal Quantile- Quantile Plot) of Residuals Plots the sample quantiles (y-axis) vs. quantiles we would expect from a standard normal distribution (x-axis). Best plot to use for checking normality assumption. Good: Points follow a straight line. Bad: Points deviate in a systematic fashion from a straight line à indicates violation of normality assumption. 34 35 36 Normal Probability Plots of Residuals Examples of bad and good normal probability plots (see also Figure 3.9 on p. 112) What to do if assumptions are not met? Two choices: 1. Abandon simple linear regression model and use a more appropriate model (future courses). 2. Employ some transformation on the data so that the simple linear regression model is appropriate for the transformed data. Can transform X or Y or both. Make sure to run regression diagnostics on the model fit with transformed data to check assumptions. Can make interpretations more complex.

Transformations Nonlinear pattern with constant error variance à Try transforming X 37 Residuals are inverted U à use X ' = X or X ' = log(x) Upper plots show data and residual plot before transformation; lower plots show after. 38 Residuals are U-shaped and association between X and Y is positiveà use X ' = X 2 or X ' = exp(x) 39 Residuals are U-shaped and association between X and Y is negativeà use X ' =1/ X or X ' = exp( X) 40 Transformations Non-constant error variance à Try transforming Y May also help to transform X in addition to Y. 41 Box-Cox Transformations Power transformation on Y à Transformed Y = Most commonly used: Y ' = Y λ λ = 2 Y ' = Y 2 λ = 0.5 Y ' = Y λ = 0 Y ' = log(y ) (by definition) λ = 0.5 Y ' =1/ Y λ = 1 Y ' =1/Y 42 Remember: Notation log means log base e

43 44 Box-Cox Transformations Find maximum likelihood estimate of λ. R function to find best λ = boxcox Need to load the MASS R library. Plots likelihood function à look at value of λ where likelihood is highest. Best to choose a value of λ that is easy to interpret (e.g., choose λ = 0 rather than λ = 0.12) Detecting Outliers and Influential Points Outlier = observation that does not follow the overall pattern of the data set à easy to see in scatterplots for SLR; harder to find in multiple linear regression Influential point = observation that, when removed, changes the model fit substantially (e.g., coefficient estimates, correlation) Tools for detecting outliers and influential points: Residual plots Leverage = measures how far away an observation is from the mean of the predictors à high leverage = potential to be an influential point Cook s distance = measures how much of an influence an individual observation has on the fitted coefficients (how much they change when the observation is removed) à high Cook s distance = influential point