Lecture 11: Simple Linear Regression

Similar documents
Lecture notes on Regression & SAS example demonstration

Overview Scatter Plot Example

SAS Procedures Inference about the Line ffl model statement in proc reg has many options ffl To construct confidence intervals use alpha=, clm, cli, c

Ch 2: Simple Linear Regression

Lecture 1 Linear Regression with One Predictor Variable.p2

Inference for Regression

STA 108 Applied Linear Models: Regression Analysis Spring Solution for Homework #6

ST Correlation and Regression

Chapter 1 Linear Regression with One Predictor

STA 302 H1F / 1001 HF Fall 2007 Test 1 October 24, 2007

Correlation and the Analysis of Variance Approach to Simple Linear Regression

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X.

Inferences for Regression

STAT 350 Final (new Material) Review Problems Key Spring 2016

Failure Time of System due to the Hot Electron Effect

STOR 455 STATISTICAL METHODS I

Regression: Main Ideas Setting: Quantitative outcome with a quantitative explanatory variable. Example, cont.

Section Least Squares Regression

General Linear Model (Chapter 4)

36-309/749 Experimental Design for Behavioral and Social Sciences. Sep. 22, 2015 Lecture 4: Linear Regression

Lecture 11 Multiple Linear Regression

Simple Linear Regression Using Ordinary Least Squares

Variance. Standard deviation VAR = = value. Unbiased SD = SD = 10/23/2011. Functional Connectivity Correlation and Regression.

Notes 6. Basic Stats Procedures part II

Lecture 18: Simple Linear Regression

ECO220Y Simple Regression: Testing the Slope

BNAD 276 Lecture 10 Simple Linear Regression Model

Lecture 3. Experiments with a Single Factor: ANOVA Montgomery 3.1 through 3.3

Lectures on Simple Linear Regression Stat 431, Summer 2012

Simple Linear Regression

Linear Regression. Simple linear regression model determines the relationship between one dependent variable (y) and one independent variable (x).

Chapter 1: Linear Regression with One Predictor Variable also known as: Simple Linear Regression Bivariate Linear Regression

Correlation and Regression

Measuring the fit of the model - SSR

Correlation Analysis

Stat 302 Statistical Software and Its Applications SAS: Simple Linear Regression

Lecture 3. Experiments with a Single Factor: ANOVA Montgomery 3-1 through 3-3

Mathematics for Economics MA course

Simple Linear Regression

Lecture 10 Multiple Linear Regression

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.

Ch 3: Multiple Linear Regression

LINEAR REGRESSION ANALYSIS. MODULE XVI Lecture Exercises

MAT2377. Rafa l Kulik. Version 2015/November/26. Rafa l Kulik

Data Analysis and Statistical Methods Statistics 651

3 Variables: Cyberloafing Conscientiousness Age

R 2 and F -Tests and ANOVA

Introduction and Single Predictor Regression. Correlation

Linear Correlation and Regression Analysis

Simple Linear Regression

STAT 4385 Topic 03: Simple Linear Regression

Business Statistics. Lecture 10: Correlation and Linear Regression

Homework 2: Simple Linear Regression

Density Temp vs Ratio. temp

y n 1 ( x i x )( y y i n 1 i y 2

LECTURE 6. Introduction to Econometrics. Hypothesis testing & Goodness of fit

STAT 512 MidTerm I (2/21/2013) Spring 2013 INSTRUCTIONS

Lecture 2: Basic Concepts and Simple Comparative Experiments Montgomery: Chapter 2

Applied Regression. Applied Regression. Chapter 2 Simple Linear Regression. Hongcheng Li. April, 6, 2013

Regression. Marc H. Mehlman University of New Haven

Topic 18: Model Selection and Diagnostics

Lecture 6 Multiple Linear Regression, cont.

Chapter 2 Inferences in Simple Linear Regression

Chapter 12 - Lecture 2 Inferences about regression coefficient

Confidence Interval for the mean response

Simple Linear Regression

df=degrees of freedom = n - 1

Lecture 3: Inference in SLR

Statistics for Managers using Microsoft Excel 6 th Edition

Linear models and their mathematical foundations: Simple linear regression

Statistics 512: Solution to Homework#11. Problems 1-3 refer to the soybean sausage dataset of Problem 20.8 (ch21pr08.dat).

AMS 7 Correlation and Regression Lecture 8

Unit 6 - Introduction to linear regression

Regression Analysis. Regression: Methodology for studying the relationship among two or more variables

Chapter 14 Simple Linear Regression (A)

This document contains 3 sets of practice problems.

STATISTICS 479 Exam II (100 points)

Model Selection Procedures

a. The least squares estimators of intercept and slope are (from JMP output): b 0 = 6.25 b 1 =

STAT Chapter 11: Regression

Chapter 16. Simple Linear Regression and dcorrelation

28. SIMPLE LINEAR REGRESSION III

Statistics for exp. medical researchers Regression and Correlation

Statistical Modelling in Stata 5: Linear Models

Lecture 15 Multiple regression I Chapter 6 Set 2 Least Square Estimation The quadratic form to be minimized is

Comparison of a Population Means

Lecture 10: 2 k Factorial Design Montgomery: Chapter 6

Chapter 8 (More on Assumptions for the Simple Linear Regression)

Correlation and Simple Linear Regression

STAT 350: Summer Semester Midterm 1: Solutions

Lecture 4 Scatterplots, Association, and Correlation

Lecture 4 Scatterplots, Association, and Correlation

Table 1: Fish Biomass data set on 26 streams

Inference for the Regression Coefficient

Chapter 4: Regression Models

Unit 6 - Simple linear regression

Inference for Regression Inference about the Regression Model and Using the Regression Line

Analysis of Variance. Source DF Squares Square F Value Pr > F. Model <.0001 Error Corrected Total

One-Way Analysis of Variance (ANOVA) There are two key differences regarding the explanatory variable X.

The simple linear regression model discussed in Chapter 13 was written as

Transcription:

Lecture 11: Simple Linear Regression Readings: Sections 3.1-3.3, 11.1-11.3 Apr 17, 2009

In linear regression, we examine the association between two quantitative variables. Number of beers that you drink and your blood alcohol level. Homework score and test score. Response variable Y: Dependent variable Measures an outcome of a study Explanatory variable X: Independent/predictor variable explains or is related to changes in the response variable We will have pairs of observations: (x 1, y 1 ), (x 2, y 2 ),..., (x n, y n ).

General Procedure for Analyzing Two Quantitative Variables 1. Make a scatter plot of the data. Describe the form, direction, and strength. Look for outliers. 2. Look at the correlation to get a numerical value for the direction and strength. 3. If the data is reasonably linear, get an equation of the line using least squares technique. 4. Look at the residual plot to see whether the assumptions of the linear regression hold. 5. Perform formal inference procedures for the correlation, intercept, and slope.

Example 1: We want to examine whether the amount of rainfall per year increases or decreases corn bushel output. A sample of 10 observations was taken, and the amount of rainfall (in inches), was measured, as was the subsequent growth of corn. Obs. x (Rainfall) y (Corn Yield) 1 3.03 80 2 3.47 84 3 4.21 90 4 4.44 95 5 4.95 97 6 5.11 102 7 5.63 105 8 6.34 112 9 6.56 115 10 6.82 115

What can we see from the scatter plot? Form: Linear? Non-linear? No obvious pattern? Direction: Positive or negative association? No association? Positive association Negative association No association Strength: how closely do the points follow a clear form? Strong or weak or moderate? Look for OUTLIERS!

Form and direction of an association Linear Non Linear No Relationship

Strength of an association 2 4 6 8 10 0 5 10 Strong Positive Linear Association 2 4 6 8 10 0 5 10 Weak Positive Linear Association

Note: Association or correlation is NOT the same thing as causation. Just because two variables are associated doesn t mean that a change in one variable causes a change in the other. The relationship between two variables might not tell the whole story. Other variables may affect the relationship. These other variables are called lurking variables.

Correlation Pearson s Sample Correlation r: a numerical quantity that measures the direction and strength of the linear relationship between two quantitative variables. r = n (x i x)(y i ȳ) i=1 = n n (x i x) 2 (y i ȳ) 2 i=1 i=1 SS xy SSxx SSyy where SS xy = n (x i x)(y i ȳ) = n x i y i n xȳ i=1 i=1 SS xx = n (x i x) 2 = n x 2 i n x2 = (n 1)s 2 x i=1 i=1 SS yy = n (y i ȳ) 2 = n yi 2 nȳ2 = (n 1)s 2 y i=1 i=1

Example 1 (cont d): a. What is the correlation between amount of rainfall and corn yield? Obs. x (Rainfall) y (Corn Yield) x 2 y 2 xy 1 3.03 80 9.1809 6400 242.4 2 3.47 84 12.0409 7056 291.48 3 4.21 90 17.7241 8100 378.9 4 4.44 95 19.7136 9025 421.8 5 4.95 97 24.5025 9409 480.15 6 5.11 102 26.1121 10404 521.22 7 5.63 105 31.6969 11025 591.15 8 6.34 112 40.1956 12544 710.08 9 6.56 115 43.0336 13225 754.4 10 6.82 115 46.5124 13225 784.3 Sum 50.56 995 270.7126 100413 5175.88

SS xy = n x i y i n xȳ = i=1 SS xx = n x 2 i n x2 = i=1 SS yy = n yi 2 nȳ2 = i=1 r = SS xy SSxx SSyy =

Correlation in SAS data yield; input rainfall yield @@; datalines; 3.03 80 3.47 84 4.21 90 4.44 95 4.95 97 5.11 102 5.63 105 6.34 112 6.56 115 6.82 115 ; run; proc corr data=yield; var rainfall yield; run;

The CORR Procedure 2 Variables: rainfall yield Simple Statistics Variable N Mean Std Dev Sum Minimum Maximu rainfall 10 5.05600 1.29449 50.56000 3.03000 6.8200 yield 10 99.50000 12.51887 995.00000 80.00000 115.0000 Pearson Correlation Coefficients, N = 10 Prob > r under H0: Rho=0 rainfall yield rainfall 1.00000 0.99527 <.0001 yield 0.99527 1.00000 <.0001

Properties of Correlation Correlation measures the strength of only a linear relationship. (i.e. correlation is meaningless if the scatter plot shows a curved relationship). The correlation r does not change if we change the units of measurements of X or Y.

The correlation r is always between -1 and 1, i.e., 1 r 1. A positive r corresponds to a positive association between the variables. As X increases, Y increases. A negative r corresponds to a negative association between the variables. As X increases, Y decreases. Values near 0 indicate a weak linear relationship. Values close to 1 or -1 indicate a strong linear relationship. r = 1 only when all points lie exactly on a line with positive slope; r = 1 only when all points lie exactly on a line with negative slope.

r = 0 r = 0.5 r = 0.9 r = 0.3 r = 0.7 r = 0.99

If a scatter plot shows that a relationship is linear and we want to use one variable to help explain or predict the other, we can summarize the relationship between the two variables by using a regression line. In linear regression, the regression line is a straight line that describes how a response variable y changes as an explanatory variable x changes.

Example 1 (cont d):

Least Squares Regression Least Squares Regression fits a straight line through the data points that will minimize the sum of the vertical distances of the data points from the line. Least Squares Regression Line: ŷ = b 0 + b 1 x n b 1 = SS xy SS xx = b 0 = ȳ b 1 x i=1 x i y i n xȳ n x 2 i n x2 i=1 = r s y s x

Least Squares Regression Example 1 (cont d): b. What is the equation of the least squares regression line? Obs. x (Rainfall) y (Corn Yield) x 2 y 2 xy 1 3.03 80 9.1809 6400 242.4 2 3.47 84 12.0409 7056 291.48 3 4.21 90 17.7241 8100 378.9 4 4.44 95 19.7136 9025 421.8 5 4.95 97 24.5025 9409 480.15 6 5.11 102 26.1121 10404 521.22 7 5.63 105 31.6969 11025 591.15 8 6.34 112 40.1956 12544 710.08 9 6.56 115 43.0336 13225 754.4 10 6.82 115 46.5124 13225 784.3 Sum 50.56 995 270.7126 100413 5175.88

Least Squares Regression b 1 = n x iy i n xȳ i=1 n = x 2 i n x2 i=1 b 0 = ȳ b 1 x = The regression equation is:

Prediction and Residual Prediction: We can use a regression line to predict the value of the response variable y for a specific value of the explanatory variable x. This value is called a predicted value or fitted value. Be careful about extrapolations. While our data may provide evidence of a linear relationship between y and x, this relationship may not hold outside of the range of x values actually observed. Therefore predictions of y for values of x that are far away from the range you actually have are often not accurate.

Prediction and Residual Example 1 (cont d): c. Predict the corn yield for i. 5 inches of rain ii. 0 inches of rain iii. 100 inches of rain iv. For which amounts of rainfall above do you think the line does a good job of predicting actual corn yield? Why?

Prediction and Residual A residual is the difference between an observed value of the response variable and the value predicted by the regression line. residual = e i = y i ŷ i. y i y^i x i

Prediction and Residual Example 1 (cont d): d. Find the predicted value and residual for every observation. Obs. x (Rainfall) y (Corn Yield) ŷ (Predicted) e i (Residual) 1 3.03 80 79.999 0.001 2 3.47 84 84.234-0.234 3 4.21 90 91.357-1.357 4 4.44 95 93.571 1.429 5 4.95 97 98.480-1.480 6 5.11 102 100.020 1.980 7 5.63 105 105.025-0.025 8 6.34 112 111.859 0.141 9 6.56 115 113.976 1.024 10 6.82 115 116.479-1.479 Sum 50.56 995 995 0.000

Assessing Model Fit Regression Sum of Squares (SSR): measure of the variation in y that is explained by the linear regression of y on x. SSR = n (ŷ i ȳ) 2 = b 2 1SS xx i=1 Residual/Error Sum of Squares (SSE): measure of the variation in y that is not explained by the linear regression of y on x. n n SSE = (y i ŷ i ) 2 = i=1 i=1 e 2 i

Assessing Model Fit Total Sum of Squares (SST): measure of the total variation in y. n SST = (y i ȳ) 2 = SS yy i=1 SST = SSR + SSE (Note: (y i ȳ) = (y i ŷ i ) + (ŷ i ȳ)). y i y^i y x i

Assessing Model Fit Coefficient of Determination r 2 : r 2 = SSR SST = 1 SSE SST r 2 is the fraction of the variation in y that can be explained by the linear regression of y on x. r 2 measures how successful the linear regression explains the response. r 2 is the square of the Pearson correlation r.

Assessing Model Fit R 2 = 0.25 R 2 = 0.7 R 2 = 0.95

proc print data=yield1; run; Assessing Model Fit Regression in SAS data yield; input rainfall yield @@; datalines; 3.03 80 3.47 84 4.21 90 4.44 95 4.95 97 5.11 102 5.63 105 6.34 112 6.56 115 6.82 115 ; run; proc reg data=yield; model yield = rainfall; plot yield * rainfall; output out=yield1 p=pred r=resid; run;quit;

Assessing Model Fit The REG Procedure Model: MODEL1 Dependent Variable: yield Number of Observations Read 10 Number of Observations Used 10 Analysis of Variance Source DF Squares Square F Value Pr > F Model 1 1397.19450 1397.19450 840.07 <.0001 Error 8 13.30550 1.66319 Corrected Total 9 1410.50000 Root MSE 1.28965 R-Square 0.9906 Dependent Mean 99.50000 Adj R-Sq 0.9894 Coeff Var 1.29613

Assessing Model Fit Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept 1 50.83497 1.72785 29.42 <.0001 rainfall 1 9.62520 0.33209 28.98 <.0001

Assessing Model Fit

Assessing Model Fit Obs rainfall yield pred resid 1 3.03 80 79.999 0.00066 2 3.47 84 84.234-0.23443 3 4.21 90 91.357-1.35708 4 4.44 95 93.571 1.42913 5 4.95 97 98.480-1.47973 6 5.11 102 100.020 1.98024 7 5.63 105 105.025-0.02487 8 6.34 112 111.859 0.14124 9 6.56 115 113.976 1.02369 10 6.82 115 116.479-1.47886

Statistical Model and Assumption The model: y = β 0 + β 1 x + ɛ For any fixed value of x, the error term ɛ is assumed to follow a normal distribution with mean 0 and standard deviation σ (Normality). The standard deviation σ does not vary for different values of x (Constant Variability). The random errors ɛ 1, ɛ 2,..., ɛ n associated with different observations are independent of each other (Independence).

Statistical Model and Assumption How do we check the regression assumptions? Normality: Normal quantile plot of the residuals. Constant variability: Residual plot. Independence: Examine the way in which subjects/units were selected in the study. Linearity: Scatter plot or a residual plot. Note: It is always important to check that the assumptions of the regression model have been met to determine whether your results are valid. This is also important to do before you proceed with inference.

Residual Analysis A residual plot is a scatter plot of the regression residuals against the explanatory variable x. The mean of the least-squares residuals is always zero. ē = 0. Good plot: total randomness, no pattern, approximately the same number of points above and below the e = 0 line Bad plot: obvious pattern, funnel shape, parabola, more points above 0 than below (or vice versa)

Residual Analysis Example 1 (cont d):

Residual Analysis

Residual Analysis SAS Code for Residual Analysis proc reg data=yield; model yield = rainfall; output out=yield1 p=pred r=resid; run;quit; proc gplot data=yield1; plot resid * rainfall /vref=0 cvref=red lvref=2; run;quit; proc univariate data=yield1; qqplot resid / normal(l=1 mu=est sigma=est); run;

Residual Analysis Nonlinear relationship Scatter Plot Residual Plot

Residual Analysis Nonconstant variance Scatter Plot Residual Plot

Residual Analysis Non-normal error Scatter Plot Residual Plot 2 1 0 1 2 1 0 1 2 3 Normal Quantile Plot Theoretical Quantiles Sample Quantiles

Population parameters in linear regression: ρ: population correlation - estimated by Pearson Correlation r. β 0 : population intercept - estimated by b 0. β 1 : population slope - estimated by b 1. σ: population standard deviation of the random errors - estimated by SSE s = n 2.

Inference about β 1 Sampling distribution of b 1 : b 1 is normally distributed with mean µ b1 = β 1 standard deviation σ σ b1 = SSxx = estimated by s b1 = s SSxx The standardized variable t = b 1 β 1 s b1 degrees of freedom df = n 1. has a t distribution with

Inference about β 1 Confidence interval for β 1 : Hypothesis test about β 1 : Hypotheses: H 0 : β 1 = 0 or H a : β 1 > 0 The t test statistic is: b 1 ± t α/2,n 2 s b1 H 0 : β 1 = 0 H a : β 1 < 0 t = b 1 s b1 or H 0 : β 1 = 0 H a : β 1 0 The test statistic has a t distribution with n 2 degrees of freedom if H 0 is true. P-value and rejection region can be computed as with previous t-tests.

Inference about β 1 Example 1 (cont d): e. Construct the 95% confidence interval for β 1. (We have previously found that ŷ = 50.835 + 9.6252x, SS xx = 15.08124, and s = 1.28965).

Inference about β 1 f. Does amount of rainfall have a linear relationship with the corn yield? Perform a hypothesis test using α = 0.05.

Inference about β 1 g. Does amount of rainfall have a positive linear relationship with the corn yield? Perform a hypothesis test using α = 0.05.

Inference about ρ Hypothesis test about ρ: Hypotheses: H 0 : ρ = 0 or H a : ρ > 0 The t test statistic is: H 0 : ρ = 0 H a : ρ < 0 or H 0 : ρ = 0 H a : ρ 0 t = r n 2 1 r 2 The test statistic has a t distribution with n 2 degrees of freedom if H 0 is true. Note: The test statistic for correlation is numerically identical to the test statistic used to test slope. P-value and rejection region can be computed as with previous t-tests.

Inference about ρ Example 1 (cont d): h. Do amount of rainfall and corn yield have a positive correlation? Perform a hypothesis test using α = 0.05. (previously we have found that r = 0.99527).

Example 2: Twenty plots, each 10 4 meters, were randomly chosen in a large field of corn. For each plot, the plant density (number of plants in the plot) and the mean cob weight (gm of grain per cob) were observed. The results are given in the table. Plant Density Cob Weight Plant Density Cob Weight 137 212 173 235 107 241 124 241 132 215 157 196 135 225 184 193 115 250 112 224 103 241 80 257 102 237 165 200 65 282 160 190 149 206 157 208 85 246 119 224 Preliminary calculations yield the following results: x = 128.05, ȳ = 224.1, SS xx = 20208.95, SS yy = 11831.8, SS xy = 14563.1, SSE = 1337.3

a. Calculate the linear regression line of cob weight on plant density.

b. Plot the data and draw the regression line on the graph.

c. What percent of variation in y can be explained by the linear regression line?

d. What is the correlation between plant density and cob weight?

e. If there is an additional plot with plant density 125, how much do you expect the cobs weigh?

f. Construct a 99% confidence interval for the population regression slope β 1.

g. Is there a linear association between plant density and cob weight? Test this hypothesis using α = 0.01.