Simple Linear Regression. (Chs 12.1, 12.2, 12.4, 12.5)

Similar documents
Simple Linear Regression. Material from Devore s book (Ed 8), and Cengagebrain.com

Simple Linear Regression and Correlation. Chapter 12 Stat 4570/5570 Material from Devore s book (Ed 8), and Cengage

STAT 511. Lecture : Simple linear regression Devore: Section Prof. Michael Levine. December 3, Levine STAT 511

Inference for Regression Simple Linear Regression

Measuring the fit of the model - SSR

Ch 2: Simple Linear Regression

Inferences for Regression

Mathematics for Economics MA course

Regression Models - Introduction

Lecture 14 Simple Linear Regression

AMS 315/576 Lecture Notes. Chapter 11. Simple Linear Regression

STAT Chapter 11: Regression

Basic Business Statistics 6 th Edition

Inference for Regression Inference about the Regression Model and Using the Regression Line

Linear Regression. Simple linear regression model determines the relationship between one dependent variable (y) and one independent variable (x).

Simple and Multiple Linear Regression

INFERENCE FOR REGRESSION

Chapter 16. Simple Linear Regression and dcorrelation

23. Inference for regression

Lecture 9: Linear Regression

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X.

Simple Linear Regression

Inference for the Regression Coefficient

Applied Econometrics (QEM)

Statistics for Engineers Lecture 9 Linear Regression

Variance. Standard deviation VAR = = value. Unbiased SD = SD = 10/23/2011. Functional Connectivity Correlation and Regression.

LECTURE 6. Introduction to Econometrics. Hypothesis testing & Goodness of fit

Multiple Regression. Inference for Multiple Regression and A Case Study. IPS Chapters 11.1 and W.H. Freeman and Company

Correlation Analysis

Scatter plot of data from the study. Linear Regression

Business Statistics. Chapter 14 Introduction to Linear Regression and Correlation Analysis QMIS 220. Dr. Mohammad Zainal

Statistics for Managers using Microsoft Excel 6 th Edition

Chapter 16. Simple Linear Regression and Correlation

Density Temp vs Ratio. temp

STAT5044: Regression and Anova. Inyoung Kim

Inference for Regression

Scatter plot of data from the study. Linear Regression

LAB 5 INSTRUCTIONS LINEAR REGRESSION AND CORRELATION

Simple linear regression

Ch. 1: Data and Distributions

Chapter 4: Regression Models

Confidence Intervals, Testing and ANOVA Summary

appstats27.notebook April 06, 2017

: The model hypothesizes a relationship between the variables. The simplest probabilistic model: or.

Correlation and the Analysis of Variance Approach to Simple Linear Regression

Lectures on Simple Linear Regression Stat 431, Summer 2012

Chapter 27 Summary Inferences for Regression

Chapter 10. Regression. Understandable Statistics Ninth Edition By Brase and Brase Prepared by Yixun Shi Bloomsburg University of Pennsylvania

13 Simple Linear Regression

SSR = The sum of squared errors measures how much Y varies around the regression line n. It happily turns out that SSR + SSE = SSTO.

Correlation & Simple Regression

Warm-up Using the given data Create a scatterplot Find the regression line

CHAPTER EIGHT Linear Regression

Chapter 14. Linear least squares

Lecture 11: Simple Linear Regression

Topic 10 - Linear Regression

Ch 3: Multiple Linear Regression

Simple linear regression

Ch 13 & 14 - Regression Analysis

K. Model Diagnostics. residuals ˆɛ ij = Y ij ˆµ i N = Y ij Ȳ i semi-studentized residuals ω ij = ˆɛ ij. studentized deleted residuals ɛ ij =

Chapter 14 Simple Linear Regression (A)

28. SIMPLE LINEAR REGRESSION III

Applied Regression Analysis

Lecture 2 Linear Regression: A Model for the Mean. Sharyn O Halloran

Simple Linear Regression

1. Simple Linear Regression

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.

Inference for Regression Inference about the Regression Model and Using the Regression Line, with Details. Section 10.1, 2, 3

Inference in Regression Analysis

Multiple Linear Regression

2. Outliers and inference for regression

Section 3: Simple Linear Regression

Keller: Stats for Mgmt & Econ, 7th Ed July 17, 2006

Regression Models - Introduction

Statistical View of Least Squares

Econ 300/QAC 201: Quantitative Methods in Economics/Applied Data Analysis. 12th Class 6/23/10

Applied Regression Analysis

Regression Models. Chapter 4. Introduction. Introduction. Introduction

Lecture 3. The Population Variance. The population variance, denoted σ 2, is the sum. of the squared deviations about the population

y = a + bx 12.1: Inference for Linear Regression Review: General Form of Linear Regression Equation Review: Interpreting Computer Regression Output

Regression Analysis II

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

Ordinary Least Squares Regression Explained: Vartanian

Regression. Marc H. Mehlman University of New Haven

Draft Proof - Do not copy, post, or distribute. Chapter Learning Objectives REGRESSION AND CORRELATION THE SCATTER DIAGRAM

Estadística II Chapter 4: Simple linear regression

Chapter 12 - Part I: Correlation Analysis

Chapter 4. Regression Models. Learning Objectives

Quantitative Analysis of Financial Markets. Summary of Part II. Key Concepts & Formulas. Christopher Ting. November 11, 2017

Correlation and Linear Regression

y ˆ i = ˆ " T u i ( i th fitted value or i th fit)

Review of Statistics

STA 302 H1F / 1001 HF Fall 2007 Test 1 October 24, 2007

Semester 2, 2015/2016

Diagnostics of Linear Regression

Statistical View of Least Squares

Unit 6 - Introduction to linear regression

Homoskedasticity. Var (u X) = σ 2. (23)

Linear regression. We have that the estimated mean in linear regression is. ˆµ Y X=x = ˆβ 0 + ˆβ 1 x. The standard error of ˆµ Y X=x is.

STA121: Applied Regression Analysis

Transcription:

10 Simple Linear Regression (Chs 12.1, 12.2, 12.4, 12.5)

Simple Linear Regression Rating 20 40 60 80 0 5 10 15 Sugar 2

Simple Linear Regression Rating 20 40 60 80 0 5 10 15 Sugar 3

Simple Linear Regression Rating 20 40 60 80 x 0 5 10 15 Sugar 4

The Simple Linear Regression Model The simplest mathematical relationship between two variables x and y is a linear relationship: y = b 0 + b 1 x. The objective of this section is about equivalent linear probabilistic models. 5

The Simple Linear Regression Model The simplest mathematical relationship between two variables x and y is a linear relationship: y = b 0 + b 1 x. The objective of this section is about equivalent linear probabilistic models. What does that mean? If two random variables are probabilistically related, then for a fixed value of x, there is uncertainty in the value of the second variable. So we assume Y = b 0 + b 1 x + ε, where ε is a random variable. 6

The Simple Linear Regression Model Definition The Simple Linear Regression Model There are parameters b 0, b 1, and s 2, such that for any fixed value of the independent variable x, the dependent variable is a random variable related to x through the model equation Y = b 0 + b 1 x + ε The quantity ε in the model equation is the error -- a random variable, assumed to be distributed as ε ~ N(0, s 2 ) 7

The Simple Linear Regression Model X: the independent, predictor, or explanatory variable (usually known). NOT RANDOM. Y: The dependent or response variable. For fixed x, Y IS RANDOM. ε: The random deviation or random error term. For fixed x, ε IS RANDOM. What exactly does ε do? 8

The Simple Linear Regression Model The points (x 1, y 1 ),, (x n, y n ) resulting from n independent observations will then be scattered about the true regression line: This image cannot currently be displayed. 9

The Simple Linear Regression Model How do we know simple linear regression is appropriate? - Theoretical considerations - Scatterplots 10

The Simple Linear Regression Model Interpreting parameters: b 0 (the intercept of the true regression line): The average value of Y when x is zero. Usually this is called the baseline average. b 1 (the slope of the true regression line): The average change in Y associated with a 1-unit increase in the value of x. 11

The Error Term distribution of Î (b) distribution of Y for different values of x The variance parameter s 2 determines the extent to which each normal curve spreads out about the regression line 12

The Error Term distribution of Î Homoscedasticity: ε i ~ N(0, s 2 ) (b) distribution of Y for different values of x We assume the variance (amount of variability) of the distribution of Y values to be the same at each different value of fixed x. (i.e. homogeneity of variance assumption). The variance parameter s 2 determines the extent to which each normal curve spreads out about the regression line 13

Estimating Model Parameters The variance of our model, s 2 will be smallest if the differences between the estimate of the true line and each point is the smallest. This is our goal: minimize s 2. We use our sample data, which consists of n observed pairs (x 1, y 1 ),, (x n, y n ), to estimate the regression line. 14

Estimating Model Parameters The variance of our model, s 2 will be smallest if the differences between the estimate of the true line and each point is the smallest. This is our goal: minimize s 2. We use our sample data, which consists of n observed pairs (x 1, y 1 ),, (x n, y n ), to estimate the regression line. Important: The data pairs are assumed to have been obtained independently of one another. 15

Estimating Model Parameters The best fit line is motivated by the principle of least squares, which can be traced back to the German mathematician Gauss (1777 1855): A line provides the best fit to the data if the sum of the squared vertical distances (deviations) from the observed points to that line is as small as it can be. 16

Estimating Model Parameters The sum of squared vertical deviations from the points (x 1, y 1 ),, (x n, y n ) to the line is then & Q = " # $ % $'( & = " (* $, -, (. $ ) % $'( The point estimates of b 0 and b 1, denoted by and, are called the least squares estimates they are those values that minimize f(b 0, b 1 ). 17

Estimating Model Parameters The fitted regression line or least squares line is then the line whose equation is y = + x. The minimizing values of b 0 and b 1 are found by taking partial derivatives of Q with respect to both b 0 and b 1, equating them both to zero [this is regular old Calculus!], and solving the two equations with two unknowns. 18

Estimating Model Parameters The fitted regression line or least squares line is then the line whose equation is y = + x. The minimizing values of b 0 and b 1 are found by taking partial derivatives of Q with respect to both b 0 and b 1, equating them both to zero [this is regular old Calculus!], and solving the two equations with two unknowns. Let s do that now. 19

Estimating Model Parameters There are shortcut notations we use to express the estimates of the parameters.! "" = $ 1 & (( ) (), = $ ( ), $ 1 & ( ),! "- = $ 1 & (( ) ()(. ).) = $ ( ). ) $ 1 & ( ). ) So, we frequently write:! " = $ % &' % && 20

Estimating Model Parameters There are shortcut notations we use to express the estimates of the parameters. HOW?! "" = $ 1 & (( ) (), = $ ( ), $ 1 & ( ),! "- = $ 1 & (( ) ()(. ).) = $ ( ). ) $ 1 & ( ). ) So, we frequently write:! " = $ % &' % && 21

Estimating Model Parameters There are shortcut notations we use to express the estimates of the parameters.! "" = $ 1 & (( ) (), = $ ( ), $ 1 & ( ),! "- = $ 1 & (( ) ()(. ).) = $ ( ). ) $ 1 & ( ). ) So, we frequently write:! " = $ % &' % && Your book doesn t use this notation, but we re going to because it s very convenient and commonly used. 22

Example The cetane number is a critical property in specifying the ignition quality of a fuel used in a diesel engine. Determination of this number for a biodiesel fuel is expensive and time-consuming. The article Relating the Cetane Number of Biodiesel Fuels to Their Fatty Acid Composition: A Critical Study (J. of Automobile Engr., 2009: 565 583) included the following data on x = iodine value (g) and y = cetane number for a sample of 14 biofuels (see next slide). 23

Example The iodine value (x) is the amount of iodine necessary to saturate a sample of 100 g of oil. The article s authors fit the simple linear regression model to this data, so let s do the same. Calculating the relevant statistics gives S x i = 1307.5, S y i = 779.2, S y i2 = 43745.22 S x i2 = 128,913.93, S x i y i = 71,347.30, from which S xx = 128,913.93 (1307.5) 2 /14 = 6802.7693 and S xy = 71,347.30 (1307.5)(779.2)/14 = 1424.41429 24

Example Scatter plot with the least squares line superimposed. 25

Fitted Values Fitted Values: The fitted (or predicted) values are obtained by substituting x 1,, x n into the equation of the estimated regression line: Residuals: The differences observed and fitted y values. between the Residuals are estimates of the true error WHY? 26

Example We now calculate the first three fitted values and residuals from the cetane example. 27

Estimating s 2 and s The parameter s 2 determines the amount of spread about the true regression line. Two separate examples: 28

Estimating s 2 and s An estimate of s 2 will be used in confidence interval (CI) formulas and hypothesis-testing procedures presented in the next two sections. Many large deviations (residuals) suggest a large value of s 2, whereas deviations all of which are small in magnitude suggest that s 2 is small. 29

Estimating s 2 and s The error sum of squares (equivalently, residual sum of squares), denoted by SSE, is and the estimate of s 2 is ˆ2 = s 2 = SSE n 2 = X (y ŷi ) 2 n 2 = 1 n 2 nx i=1 ê 2 i 30

Estimating s 2 and s The error sum of squares (equivalently, residual sum of squares), denoted by SSE, is and the estimate of s 2 is ˆ2 = s 2 = SSE n 2 = X (y ŷi ) 2 n 2 = 1 n 2 How the homoscedasticity assumption comes into play here?? nx i=1 ê 2 i 31

Estimating s 2 and s The divisor n 2 in s 2 is the number of degrees of freedom (df) associated with SSE and the estimate s 2. This is because to obtain s 2, the two parameters b 0 and b 1 must first be estimated, which results in a loss of 2 df (just as µ had to be estimated in one sample problems, resulting in an estimated variance based on n 1 df in our previous t-tests). Replacing each y i in the formula for s 2 by the r.v. Y i gives the estimator S 2. It can be shown that the r.v. S 2 is an unbiased estimator for s 2 32

Example Suppose we have the following data on filtration rate (x) versus moisture content (y): Relevant summary quantities (summary statistics) are S x i = 2817.9, S y i = 1574.8, S x 2 i = 415,949.85, S x i y i = 222,657.88, and S y 2 i = 124,039.58, From S xx = 18,921.8295, S xy = 776.434. 33

Example The corresponding error sum of squares (SSE) is SSE = (.200) 2 + (.188) 2 + + (1.099) 2 = 7.968 The estimate of s 2 is then s 2 = 7.968/(20 2) =.4427, and the estimated standard deviation is 34

Example The corresponding error sum of squares (SSE) is SSE = (.200) 2 + (.188) 2 + + (1.099) 2 = 7.968 The estimate of s 2 is then s 2 = 7.968/(20 2) =.4427, and the estimated standard deviation is How do we interpret s? Roughly speaking, 0.665 is the magnitude of a typical deviation from the estimated regression line some points are closer to the line than this and others are further away. 35

Estimating s 2 and s The last method for computing the estimate of s 2 was terribly complex and realistically would produce errors in calculation if done by hand. Use of the following shortcut formula does not require these quantities. SSE = S yy S xy2 /S xy This expression results from substituting into squaring the summand, carrying through the sum to the resulting three terms, and simplifying. 36

The Coefficient of Determination Different variability in observed y values: Using the linear model to explain y variation: (a) data for which all variation is explained;; (b) data for which most variation is explained;; (c) data for which little variation is explained 37

The Coefficient of Determination The error sum of squares SSE can be interpreted as a measure of how much variation in y is left unexplained by the model that is, how much cannot be attributed to a linear relationship. In the first plot SSE = 0, and there is no unexplained variation, whereas unexplained variation is small for second, and large for the third plot. A quantitative measure of the total amount of variation in observed y values is given by the total sum of squares 38

The Coefficient of Determination Total sum of squares is the sum of squared deviations about the sample mean of the observed y values when no predictors are taken into account. Thus the same number y is subtracted from each y i in SST, whereas SSE involves subtracting each different predicted value from the corresponding observed y i. The SST in some sense is as bad as SSE can get if there is no regression model (i.e., slope is 0) then ˆ0 = y ˆ1x ) ŷ = ˆ0 + {z} ˆ1 x = ˆ0 = y =0 Which motivates the definition of the SST. 39

The Coefficient of Determination The sum of squared deviations about the least squares line is smaller than the sum of squared deviations about any other line, i.e. SSE < SST unless the horizontal line itself is the least squares line. The ratio SSE/SST is the proportion of total variation that cannot be explained by the simple linear regression model, and r 2 = 1 SSE/SST (a number between 0 and 1) is the proportion of observed y variation explained by the model. 40

The Coefficient of Determination The sum of squared deviations about the least squares line is smaller than the sum of squared deviations about any other line, i.e. SSE < SST unless the horizontal line itself is the least squares line. The ratio SSE/SST is the proportion of total variation that cannot be explained by the simple linear regression model, and r 2 = 1 SSE/SST The higher the value of r 2, the more successful is the simple linear regression model in explaining y variation. 41

Example Let s calculate the correlation coefficient for the two previous examples we ve done. 42

Iodine Example The iodine value (x) is the amount of iodine necessary to saturate a sample of 100 g of oil. S x i = 1307.5, S y i = 779.2, S x i2 = 128,913.93, S y i2 = 43745.22, S x i y i = 71,347.30, S xx = 6802.7693, S xy = 1424.41429 43

Filtration Example Filtration rate (x) versus moisture content (y): Relevant summary quantities (summary statistics) are S x i = 2817.9, S y i = 1574.8, S x i2 = 415,949.85, S x i y i = 222,657.88, and S y i2 = 124,039.58, From S xx = 18,921.8295, S xy = 776.434. 44

Inferences About the Slope Parameter b 1 In virtually all of our inferential work thus far, the notion of sampling variability has been pervasive. Properties of sampling distributions of various statistics have been the basis for developing confidence interval formulas and hypothesis-testing methods. Same idea as before: The value of any quantity calculated from sample data (which is random) will vary from one sample to another. 45

Inferences About the Slope Parameter b 1 The estimators are: => That is, is a linear function of the independent rv s Y 1, Y 2,..., Y n, each of which is normally distributed. Similarly, we have the estimators: And, 46

Inferences About the Slope Parameter b 1 Invoking properties of linear function of random variables as discussed earlier, leads to the following results: 1. E( ) = b 1, so is an unbiased estimator of b 1 2. The variance and standard deviation of are: 3. Replacing s by its estimate s gives an estimate as: 4. The estimator has a normal distribution (why?). How is all this information helpful to us? 47

Inference About b 1 Theorem The assumptions of the simple linear regression model imply that the standardized variable has a t distribution with n 2 df (since s s). 48

A Confidence Interval for b 1 As in the derivation of previous CIs, we begin with a probability statement: Manipulation of the inequalities inside the parentheses to isolate b 1 and substitution of estimates in place of the estimators gives the CI formula. A 100(1 a )% CI for the slope b 1 of the true regression line is 49

Example Variations in clay brick masonry weight have implications not only for structural and acoustical design but also for design of heating, ventilating, and air conditioning systems. The article Clay Brick Masonry Weight Variation (J. of Architectural Engr., 1996: 135 137) gave a scatter plot of y = mortar dry density (lb/ft 3 ) versus x = mortar air content (%) for a sample of mortar specimens, from which the following representative data was read: 50

Example cont d The scatter plot of this data in Figure 12.14 certainly suggests the appropriateness of the simple linear regression model;; there appears to be a substantial negative linear relationship between air content and density, one in which density tends to decrease as air content increases. Scatter plot of the data from this example 51

Example cont d The values of the summary statistics required for calculation of the least squares estimates are S x i = 218.1 S y i = 1693.6 S x i y i = 24,252.54 S = 3577.01 S = 191,672.90;; n = 15 What is r 2 and how is it interpreted? What is the 95% confidence interval for the slope? 52

Hypothesis-Testing Procedures The most commonly encountered pair of hypotheses about b 1 is H 0 : b 1 = 0 versus H a : b 1 0. When this null hypothesis is true, µ Y x = b 0 (independent of x). Then knowledge of x gives no information about the value of the dependent variable. Null hypothesis: H 0 : b 1 = b 10 Test statistic value: t = ( t ratio ) 53

Hypothesis-Testing Procedures Alternative Hypothesis Alternative Hypothesis H a : b 1 > b 10 t ³ t a,n 2 H a : b 1 < b 10 t t a,n 2 H a : b 1 b 10 either t ³ t a /2,n 2 or t t a /2,n 2 A P-value based on n 2 can be calculated just as was done previously for t tests. If H 0 : b 1 = 0, then the test statistic is the t ratio t =. 54

Regression in R. 55

Residuals and Standardized Residuals The standardized residuals are given by If, for example, a particular standardized residual is 1.5, then the residual itself is 1.5 (estimated) standard deviations larger than what would be expected from fitting the correct model. 56

Diagnostic Plots The basic plots that many statisticians recommend for an assessment of model validity and usefulness are the following: 1. e i* (or e i ) on the vertical axis versus x i on the horizontal axis 2. e i* (or e i ) on the vertical axis versus on the horizontal axis 3. on the vertical axis versus y i on the horizontal axis 4. A histogram of the standardized residuals 57

Diagnostic Plots Plots 1 and 2 are called residual plots (against the independent variable and fitted values, respectively), whereas Plot 3 is fitted against observed values. Provided that the model is correct, neither residual plots should exhibit distinct patterns. The residuals should be randomly distributed about 0 according to a normal distribution, so all but a very few standardized residuals should lie between 2 and +2 (i.e., all but a few residuals within 2 standard deviations of their expected value 0). If Plot 3 yields points close to the 45-deg line [slope +1 through (0, 0)], then the estimated regression function gives accurate predictions of the values actually observed. 58

Example (Plot Type #2 and #3) 59

Heteroscedasticity The residual plot below suggests that, although a straightline relationship may be reasonable, the assumption that V(Y i ) = s 2 for each i is of doubtful validity. Using advanced methods like weighted LS (WLS), or more advanced models, is recommended for inference. 60