STAT 511. Lecture : Simple linear regression Devore: Section Prof. Michael Levine. December 3, Levine STAT 511

Similar documents
Simple Linear Regression. Material from Devore s book (Ed 8), and Cengagebrain.com

Simple Linear Regression. (Chs 12.1, 12.2, 12.4, 12.5)

Ch 2: Simple Linear Regression

Simple and Multiple Linear Regression

Measuring the fit of the model - SSR

AMS 315/576 Lecture Notes. Chapter 11. Simple Linear Regression

Simple Linear Regression and Correlation. Chapter 12 Stat 4570/5570 Material from Devore s book (Ed 8), and Cengage

Simple Linear Regression

Ch 3: Multiple Linear Regression

Linear models and their mathematical foundations: Simple linear regression

Multiple Linear Regression

Chapter 12 - Lecture 2 Inferences about regression coefficient

Inference for Regression

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X.

STAT420 Midterm Exam. University of Illinois Urbana-Champaign October 19 (Friday), :00 4:15p. SOLUTIONS (Yellow)

CHAPTER EIGHT Linear Regression

Figure 1: The fitted line using the shipment route-number of ampules data. STAT5044: Regression and ANOVA The Solution of Homework #2 Inyoung Kim

PART I. (a) Describe all the assumptions for a normal error regression model with one predictor variable,

: The model hypothesizes a relationship between the variables. The simplest probabilistic model: or.

Applied Regression. Applied Regression. Chapter 2 Simple Linear Regression. Hongcheng Li. April, 6, 2013

STAT5044: Regression and Anova. Inyoung Kim

STAT Chapter 11: Regression

Regression Analysis. Regression: Methodology for studying the relationship among two or more variables

Lecture 9: Linear Regression

Basic Business Statistics 6 th Edition

Correlation Analysis

MAT2377. Rafa l Kulik. Version 2015/November/26. Rafa l Kulik

Homework 2: Simple Linear Regression

Lectures on Simple Linear Regression Stat 431, Summer 2012

Linear Regression. Simple linear regression model determines the relationship between one dependent variable (y) and one independent variable (x).

TMA4255 Applied Statistics V2016 (5)

Simple Linear Regression

Inferences for Regression

Data Analysis and Statistical Methods Statistics 651

ECO220Y Simple Regression: Testing the Slope

Lecture 18: Simple Linear Regression

Lecture 14 Simple Linear Regression

Chapter 14. Linear least squares

STAT 4385 Topic 03: Simple Linear Regression

Applied Econometrics (QEM)

Lecture 6 Multiple Linear Regression, cont.

Chapter 14 Simple Linear Regression (A)

LECTURE 6. Introduction to Econometrics. Hypothesis testing & Goodness of fit

Applied Regression Analysis

Lecture 10 Multiple Linear Regression

Business Statistics. Chapter 14 Introduction to Linear Regression and Correlation Analysis QMIS 220. Dr. Mohammad Zainal

BNAD 276 Lecture 10 Simple Linear Regression Model

Statistics 112 Simple Linear Regression Fuel Consumption Example March 1, 2004 E. Bura

Linear Regression Model. Badr Missaoui

Lecture 15: Inference Based on Two Samples

Statistics for Engineers Lecture 9 Linear Regression

STAT2012 Statistical Tests 23 Regression analysis: method of least squares

Overview Scatter Plot Example

How to mathematically model a linear relationship and make predictions.

Regression Models - Introduction

13 Simple Linear Regression

Chapter 14 Student Lecture Notes Department of Quantitative Methods & Information Systems. Business Statistics. Chapter 14 Multiple Regression

ST Correlation and Regression

Regression Analysis II

Mathematics for Economics MA course

15.1 The Regression Model: Analysis of Residuals

Linear regression. We have that the estimated mean in linear regression is. ˆµ Y X=x = ˆβ 0 + ˆβ 1 x. The standard error of ˆµ Y X=x is.

Lecture 15. Hypothesis testing in the linear model

Statistics for Managers using Microsoft Excel 6 th Edition

Inference for Regression Simple Linear Regression

Problems. Suppose both models are fitted to the same data. Show that SS Res, A SS Res, B

Inference for Regression Inference about the Regression Model and Using the Regression Line

Lecture 3: Inference in SLR

Lecture 11: Simple Linear Regression

Chapter Learning Objectives. Regression Analysis. Correlation. Simple Linear Regression. Chapter 12. Simple Linear Regression

INTRODUCING LINEAR REGRESSION MODELS Response or Dependent variable y

STAT 111 Recitation 7

Simple Linear Regression

Basic Business Statistics, 10/e

We like to capture and represent the relationship between a set of possible causes and their response, by using a statistical predictive model.

STAT5044: Regression and Anova

assumes a linear relationship between mean of Y and the X s with additive normal errors the errors are assumed to be a sample from N(0, σ 2 )

STAT 540: Data Analysis and Regression

INFERENCE FOR REGRESSION

Chapter 16. Simple Linear Regression and dcorrelation

The simple linear regression model discussed in Chapter 13 was written as

Multiple linear regression

Chapte The McGraw-Hill Companies, Inc. All rights reserved.

Chapter 1. Linear Regression with One Predictor Variable

Linear Models and Estimation by Least Squares

STAT 350 Final (new Material) Review Problems Key Spring 2016

Inference for the Regression Coefficient

Multiple Regression. Inference for Multiple Regression and A Case Study. IPS Chapters 11.1 and W.H. Freeman and Company

STAT2201 Assignment 6

Matrix Approach to Simple Linear Regression: An Overview

ST430 Exam 1 with Answers

[4+3+3] Q 1. (a) Describe the normal regression model through origin. Show that the least square estimator of the regression parameter is given by

Lecture 15 Multiple regression I Chapter 6 Set 2 Least Square Estimation The quadratic form to be minimized is

ECON The Simple Regression Model

Econometrics I Lecture 3: The Simple Linear Regression Model

The Multiple Regression Model

Inference for Regression Inference about the Regression Model and Using the Regression Line, with Details. Section 10.1, 2, 3

Multiple Regression Analysis. Part III. Multiple Regression Analysis

UNIVERSITY OF MASSACHUSETTS. Department of Mathematics and Statistics. Basic Exam - Applied Statistics. Tuesday, January 17, 2017

9. Linear Regression and Correlation

Transcription:

STAT 511 Lecture : Simple linear regression Devore: Section 12.1-12.4 Prof. Michael Levine December 3, 2018

A simple linear regression investigates the relationship between the two variables that is not deterministic. The variable whose value is fixed by the experimenter is called the independent, predictor or explanatory variable. For fixed x, the second variable is random; it is referred to as the dependent or response variable. The data is usually given as n pairs (x 1, y 1 ),..., (x n, y n ). A scatter plot gives a good indication of the nature of the relationship between the two variables.

A Linear Model The usual linear regression model is where ε N(0, σ 2 ) Y = β 0 + β 1 x + ɛ ε is the random error term or the random deviation while the line y = β 0 + β 1 x is called the true (or population) regression line This model assumes that E Y = β 0 + β 1 x while the deterministic model assumes that y = β 0 + β 1 x

Illustration

Appropriateness of the linear regression model Sometimes, such a model is suggested by physical considerations More commonly, it is simply suggested from an inspection of a scatter plot

Implications of the linear regression model Let x be a specific value of x,, corresponding mean and variance are E (Y x ) and E (Y x ) E.g. x is the age of a child, Y is the size of the child s vocabulary The meaning: E (Y x ) = β 0 + β 1 x and V (Y x ) = σ 2

Implications of the linear regression model Thus, the line Y = β 0 + β 1 x is the line of of mean values The slope β 1 is the expected change in E Y with one unit change in x σ 2 does not depend on x so the amount of variability in Y stays the same for all x

Illustration

Example The relationship between applied stress x and time-to-failure y is described by the simple linear regression model with true regression line y = 651.2x and σ = 8 For any fixed value x of stress, time-to-failure has a normal distribution with mean value 651.2x and standard deviation 8 For x = 20, Thus, e.g. P(Y > 50 x = 20) = P E (Y x = 20) = 65 1.2 20 = 41 ( Z > ) 50 41 = 1 Φ(1.13) =.1292 8

Illustration

Example Suppose that Y 1 denotes an observation on time-to-failure made with x = 25 and Y 2 denotes an independent observation made with x = 24 Y 1 Y 2 is normally distributed, E (Y 1 Y 2 ) = β 1 = 1.2; V (Y 1 Y 2 ) = 2σ 2 = 128 Thus, P(Y 1 Y 2 > 0) = P ( Z > 0 ( 1.2) ) = P(Z >.11) =.4562 11.314 Even though the slope is negative, it is not inconceivable that Y 1 > Y 2

Estimating Model Parameters The usual way to estimate parameters of a linear regression models is by using the Least Squares approach suggested by Gauss The vertical deviation of the point (x i, y i ) from a line y = b 0 + b 1 x is y i (b 0 + b 1 x i ) The estimates of coefficients are ˆβ 0 and ˆβ 1 that minimize n [y i (β 0 + b 1 x i )] 2 i=1 ˆβ 0 and ˆβ 1 are called the least squares estimates The estimated regression line is y = ˆβ 0 + ˆβ 1 x Example: dependance between cetane number and iodine value of the biodiesel fuel

System of normal equations nb 0 + b 1 ( x i ) = y i b 0 ( x i ) + b 1 ( x 2 i ) = x i y i If not all x i are identical, there is a unique solution - least squares

Solutions b 1 = ˆβ 1 = (xi x)(y i ȳ) (xi x) 2 = S xy S xx b 0 = ȳ ˆβ 1 x

Example The cetane number is a critical property in specifying the ignition quality of a fuel used in a diesel engine. Determination of this number for a biodiesel fuel is expensive and time-consuming. The iodine value is the amount of iodine necessary to saturate a sample of 100 g of oil x = iodine value (g) and y = cetane number for a sample of 14 biofuels.

Example ˆβ1 =.20938742 Thus, expected change in true average cetane number associated with 1 g decrease in iodine value is about.209 The estimated ˆβ 0 = ȳ ˆβ 1 x = 75.212432

Example

Estimating σ 2 Estimating σ 2 is needed to get confidence intervals and/or test hypotheses about coefficients of the regression model The fitted values are ŷ 1 = ˆβ 0 + ˆβ 1 x 1,..., ŷ n = ˆβ 0 + ˆβ 1 x n The residuals are y 1 ŷ 1,..., y n ŷ n The residuals are needed to estimate the variance of errors; specifically, ˆσ 2 = SSE n 2 where the error sum of squares is SSE = n i=1 (y i ŷ i ) 2

Computational formula for SSE The direct computation of SSE is rather involved A better option is to use the computation formula SSE = S yy ˆβ 1 S xy This formula does not involve computation of predicted values and residuals It is, however, very sensitive to the rounding effect in ˆβ 0 and ˆβ 1

Variation in the data

R 2 - coefficient of determination I How much of the total variation can the linear regression model explain? That total variation will be described by the total sum of squares SST = n (y i ȳ) 2 i=1 It is always true that SSE< SST, so we can define R 2 = 1 SSE SST which is a number between 0 and 1 that suggests how much of the total variation is explained by the regression model

R 2 - coefficient of determination II Its alternative form is R 2 = SSR SST where SSR = n i=1 (ŷ i ȳ) 2 is the regression sum of squares The same identity as before in ANOVA analysis is true SST = SSR + SSE Cetane number-iodine value example: high value of R 2

Parameter estimators An estimator of β 1 is ˆβ 1 = n i=1 (x i x)(y i Ȳ ) n i=1 (x i x) 2 An estimator of β 0 is ˆβ 0 = Ȳ ˆβ 0 x An estimator of σ 2 is ˆσ 2 = S 2 = n i=1 Y 2 i ˆβ 0 n i=1 Y i ˆβ 1 n i=1 x iy i n 2

ˆβ 1 as a linear estimator of the slope Verify that where c i = x i x S xx Consequently, ˆβ 1 N ˆβ 1 = n c i Y i i=1 ( ) β 1, σ2 S xx The variance of ˆβ 1 is estimated by s Sxx.

A confidence interval for β 1 Note that T = ˆβ 1 β 1 S/ S xx t n 2 Thus, the 100%(1 α) CI for β 1 is ˆβ 1 ± t α/2,n 2 S Sxx

Example When damage to a timber structure occurs, it may be more economical to repair the damaged area rather than replace the entire structure The dependent variable is y = rupture load (N) and the independent variable is anchorage length (the additional length of material used to bond at the junction), in mm

Example Main quantities are S xx = 18, 000 error df 10 2 = 8, s = 2661.33. The estimated standard error is The 95% confidence interval is s Sxx = 19.836 123.64 ± (2.306)(19.836) = (77.90, 169.38)

Hypothesis testing The most common is the model utility test H 0 : β 1 = 0 vs. H a : β 1 0 The test statistic value is t = ˆβ 1 β 10 s ˆβ1 Then, if e.g. H a : β 1 > β 10, we have the P-value as the area under the t n 2 curve to the right of t

Example Mopeds are very popular in Europe because of cost and ease of operation They can be dangerous if performance characteristics are modified. One of the features commonly manipulated is the maximum speed simple linear regression analysis of the variables x = test track speed (km/h) and y = rolling test speed

Regression and ANOVA Note that t 2 = f for the test of H 0 : β 1 = 0 vs. H a : β 1 0

Inference about the mean µ Y x For a given value x, the estimated average value of Y is ˆβ 0 + ˆβ 1 x It can also be viewed as the prediction at the given point x It is possible to represent the estimated average value of Y as ˆβ 0 + ˆβ 1 x = n d i Y i i=1 where d i = 1 n + (x x)(x i x) n i=1 (x i x) 2

Summary E(Ŷ ) = β 0 + β 1 x, and the variance is [ 1 V (Ŷ ) = σ 2 n + (x x) 2 ] The estimated variance results from the above by replacing σ 2 with s; Ŷ is also normally distributed To construct a confidence interval or to test a hypothesis, just note that T = Ŷ ( ˆβ 0 + ˆβ 1 x ) t n 2 SŶ S xx

Inferences concerning µ Y x The variable T = ˆβ 0 + ˆβ 1 x (β 0 + β 1 x ) S ˆβ0 + ˆβ 1 x = Ŷ (β 0 + β 1 x ) SŶ has a t distribution with n 2 df The 100%(1 α) CI for E (Y x ) = µ Y x is ˆβ 0 + ˆβ 1 x ± t α/2,n 2 s ˆβ 0 + ˆβ 1 x = ŷ ± t α/2,n 2 sŷ

Example Corrosion of steel reinforcing bars is the most important durability problem for reinforced concrete structures Representative data on x = carbonation depth (mm) and y = strength (MPa) for a sample of core specimens from a building in Singapore The scatter plot supports the use of simple linear regression; thus, let us obtain 95% CI for β 0 + β 1 45 for x = 45 mm

Example First, ˆβ 1 =.297561 and ˆβ 0 = 27.182936 so The estimated ŷ = 27.182936.297561 45 = 13.79 1 sŷ = 2.8640 18 + 45 36.6111)2 4840.7778 Th 16 df t-critical value is 2.120 and so =.7582 13.79 ± (2.120)(.7582) = (12.18, 15.40)

Example The following output results from a request to fit the simple linear regression model and calculate confidence intervals for the mean value of strength at depths of 45 mm and 35 mm

CI s for multiple values of x In some situations, a CI is desired not just for a single x value but for two or more x values Suppose an investigator wishes a CI both for µ Y v and for µ Y w, where v and w are two different values of the independent variable The intervals are not independent because the same ˆβ 0, ˆβ 1 and S are used in each. We therefore cannot assert that the joint confidence level for the two intervals is exactly 90% even if we select α = 0.05 It can be shown, though, that if the 100%(1 α) CI is computed both for x = v and x = w to obtain joint CIs for µ Y v and for µ Y w, then the joint confidence level on the resulting pair of intervals is at least 100%(1 2α).

A prediction interval for a future value of Y Sometimes, an investigator may wish to obtain an interval of plausible values for the value of Y associated with some future observation when the independent variable has value x We may want to relate vocabulary size y to the age of a child x. The CI with x = 6 would provide an estimate of true average vocabulary size for all 6-year-old children Alternatively, we might wish an interval of plausible values for the vocabulary size of a particular 6-year-old child

The error of prediction The error of prediction is Y ( ˆβ 0 + ˆβ 1 x ) The variance of the prediction error is [ V (Y ( ˆβ 0 + ˆβ 1 x )) = σ 2 1 + 1 n + (x x) 2 ] The expected value of the prediction error is E (Y ( ˆβ 0 + ˆβ 1 x )) = 0 and S xx T = Y ( ˆβ 0 + ˆβ 1 x ) S 1 + 1 n + (x x) 2 S xx t n 2

Prediction interval The prediction interval is ˆβ 0 + ˆβ 1 x ± s 1 + 1 n + (x x) 2 S xx This interval is always wider than the corresponding confidence interval

Example Lets return to the carbonation depth-strength data and calculate a 95% PI for a strength value that would result from selecting a single core specimen whose carbonation depth is 45 mm The relevant quantities are ŷ = 13.79, sŷ =.7582, s = 2.8640 For a prediction level of 95% based on n 2 = 16 df the critical value is 2.120 The prediction interval is then 13.79 ± 2.120 (2.8640) 2 + (.7582) 2 = (7.51, 20.07)