MAT2377. Rafa l Kulik. Version 2015/November/26. Rafa l Kulik

Similar documents
Simple Linear Regression

Ch 2: Simple Linear Regression

Linear models and their mathematical foundations: Simple linear regression

Simple and Multiple Linear Regression

Simple Linear Regression

where x and ȳ are the sample means of x 1,, x n

Simple Linear Regression

AMS 315/576 Lecture Notes. Chapter 11. Simple Linear Regression

Linear Models and Estimation by Least Squares

Measuring the fit of the model - SSR

Problems. Suppose both models are fitted to the same data. Show that SS Res, A SS Res, B

Scatter plot of data from the study. Linear Regression

STAT Chapter 11: Regression

Inference for Regression

Midterm 2 - Solutions

Scatter plot of data from the study. Linear Regression

Multiple Linear Regression

STAT 111 Recitation 7

Statistical View of Least Squares

Ch 3: Multiple Linear Regression

Oct Simple linear regression. Minimum mean square error prediction. Univariate. regression. Calculating intercept and slope

ECON3150/4150 Spring 2016

Data Analysis and Statistical Methods Statistics 651

Statistics 112 Simple Linear Regression Fuel Consumption Example March 1, 2004 E. Bura

Chapter 1. Linear Regression with One Predictor Variable

AMS 7 Correlation and Regression Lecture 8

Correlation and the Analysis of Variance Approach to Simple Linear Regression

Lecture 11: Simple Linear Regression

The Simple Linear Regression Model

STAT 3A03 Applied Regression With SAS Fall 2017

Applied Regression. Applied Regression. Chapter 2 Simple Linear Regression. Hongcheng Li. April, 6, 2013

Density Temp vs Ratio. temp

STAT 540: Data Analysis and Regression

13 Simple Linear Regression

Y i = η + ɛ i, i = 1,...,n.

Coefficient of Determination

Statistical View of Least Squares

ECON3150/4150 Spring 2015

STAT 511. Lecture : Simple linear regression Devore: Section Prof. Michael Levine. December 3, Levine STAT 511

Ecn Analysis of Economic Data University of California - Davis February 23, 2010 Instructor: John Parman. Midterm 2. Name: ID Number: Section:

UNIVERSITY OF MASSACHUSETTS. Department of Mathematics and Statistics. Basic Exam - Applied Statistics. Tuesday, January 17, 2017

Formal Statement of Simple Linear Regression Model

Business Statistics. Lecture 10: Correlation and Linear Regression

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X.

regression analysis is a type of inferential statistics which tells us whether relationships between two or more variables exist

Math 3330: Solution to midterm Exam

STAT5044: Regression and Anova. Inyoung Kim

Linear Regression. Simple linear regression model determines the relationship between one dependent variable (y) and one independent variable (x).

Simple linear regression

Homework 2: Simple Linear Regression

STAT 4385 Topic 03: Simple Linear Regression

Key Algebraic Results in Linear Regression

Lecture 15. Hypothesis testing in the linear model

Regression Models - Introduction

REGRESSION ANALYSIS AND INDICATOR VARIABLES

Multiple Linear Regression

Introduction to Estimation Methods for Time Series models. Lecture 1

Chapter 10. Simple Linear Regression and Correlation

Chapter 14. Linear least squares

1 Least Squares Estimation - multiple regression.

Review of probability and statistics 1 / 31

Lecture 8 CORRELATION AND LINEAR REGRESSION

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

Chapter 1: Linear Regression with One Predictor Variable also known as: Simple Linear Regression Bivariate Linear Regression

Lecture 6 Multiple Linear Regression, cont.

Statistics for Engineers Lecture 9 Linear Regression

STA 302 H1F / 1001 HF Fall 2007 Test 1 October 24, 2007

Lecture 4 Multiple linear regression

STAT2012 Statistical Tests 23 Regression analysis: method of least squares

Lecture 14 Simple Linear Regression

Simple Linear Regression

2.4.3 Estimatingσ Coefficient of Determination 2.4. ASSESSING THE MODEL 23

Regression Estimation - Least Squares and Maximum Likelihood. Dr. Frank Wood

ECON The Simple Regression Model

Final Exam - Solutions

Applied Regression Analysis

Business Statistics. Lecture 9: Simple Regression

K. Model Diagnostics. residuals ˆɛ ij = Y ij ˆµ i N = Y ij Ȳ i semi-studentized residuals ω ij = ˆɛ ij. studentized deleted residuals ɛ ij =

Chapter 1 Linear Regression with One Predictor

Properties of the least squares estimates

4 Multiple Linear Regression

Two-Variable Regression Model: The Problem of Estimation

Chapter 8: Simple Linear Regression

BNAD 276 Lecture 10 Simple Linear Regression Model

MA 575 Linear Models: Cedric E. Ginestet, Boston University Midterm Review Week 7

ST430 Exam 1 with Answers

Matrices and vectors A matrix is a rectangular array of numbers. Here s an example: A =

Maximum Likelihood Estimation

Inferences for Regression

11 Regression. Introduction. The Correlation Coefficient. The Least-Squares Regression Line

Linear Regression. 1 Introduction. 2 Least Squares

Regression and Statistical Inference

y ˆ i = ˆ " T u i ( i th fitted value or i th fit)

IEOR 165 Lecture 7 1 Bias-Variance Tradeoff

Midterm 2 - Solutions

Simple Linear Regression Analysis

Overview Scatter Plot Example

Essential of Simple regression

Linear models. Rasmus Waagepetersen Department of Mathematics Aalborg University Denmark. October 5, 2016

Chapter 12 - Lecture 2 Inferences about regression coefficient

Transcription:

MAT2377 Rafa l Kulik Version 2015/November/26 Rafa l Kulik

Bivariate data and scatterplot Data: Hydrocarbon level (x) and Oxygen level (y): x: 0.99, 1.02, 1.15, 1.29, 1.46, 1.36, 0.87, 1.23, 1.55, 1.40, 1.19, 1.15, 0.98, 1.01, 1.11, 1.20, 1.26, 1.32, 1.43, 0.95 ----------------------------------- y: 90.01, 89.05, 91.43, 93.74, 96.73, 94.45, 87.59, 91.77, 99.42, 93.65, 93.54, 92.52, 90.56, 89.54, 89.85, 90.39, 93.25, 93.41, 94.98, 87.33 Rafa l Kulik 1

Oxygen 88 90 92 94 96 98 0.9 1.0 1.1 1.2 1.3 1.4 1.5 Hydrocarbon level Rafa l Kulik 2

We want to describe the relationship between these two variables. We will use regression analysis. We will assume that our model is given by Y = β 0 + β 1 x + ɛ, where ɛ is a random error and β 0, β 1 are regression coefficients. The variable x is called regressor (predictor) variable and Y is called a dependent or response variable. It is assumed that E(ɛ) = 0, Var(ɛ) = σ 2. In particular, E(Y x) = β 0 + β 1 x. Rafa l Kulik 3

Suppose now that we have observations (x i, y i ) from our model, so y i = β 0 + β 1 x i + ɛ i, i = 1,..., n. Our aim is to find ˆβ 0, ˆβ 1, estimator of the unknown parameters β 0, β 1. consequently, we will find the estimated (fitted) regression line or the line of the best fit: ŷ = ˆβ 0 + ˆβ 1 x. This line is obtained using the method of least squares. Having the observations y i, i = 1,..., n, their deviation e i from the line β 0 + β 1 x are e i = (y i β 0 β i x i ), i = 1,..., n. So, L(β 0, β 1 ) := n e 2 i = n (y i β 0 β i x i ) 2. i=1 i=1 Rafa l Kulik 4

Consequently, and dl dβ 0 = 2 dl dβ 1 = 2 n (y i β 0 β i x i ) i=1 n (y i β 0 β i x i )x i. i=1 Solving dl dβ 0 = 0 and dl dβ 1 = 0 we obtain the least squares estimators of β 0 and β 1 : Rafa l Kulik 5

where Equivalently, S xx = ˆβ 1 = S xy S xx, S xy = ˆβ0 = ȳ ˆβ 1 x, n (x i x)(y i ȳ) i=1 n (x i x) 2 S yy = i=1 n (y i ȳ) 2 i=1 S xy = n i=1 x iy i 1 n ( n i=1 x i)( n i=1 y i) S xx = n i=1 x2 i 1 n ( n i=1 x i) 2 S yy = n i=1 y2 i 1 n ( n i=1 y i) 2 Rafa l Kulik 6

Example For hydrocarbon data we find x i = 23.92, y i = 1843.21, x 2 i = 29.2892, y 2 i = 170044.5, x i y i = 2214.657, so that S xy = 10.17744, S xx = 0.68088, S yy = 173.3769. Therefore, ˆβ0 = 74.28, ˆβ1 = 14.95. Consequently, the fitted regression line is ŷ = 74.28 + 14.95x. Oxygen 88 90 92 94 96 98 0.9 1.0 1.1 1.2 1.3 1.4 1.5 Hydrocarbon level Rafa l Kulik 7

Sample Correlation Coefficient. For data (x i, y i ) we define the sample correlation coefficient as R = S xy Sxx S yy (1) (R is defined only if S xx and S yy are positive). Example: For the hydrocarbon data the sample correlation coefficient is R = 0.93. Rafa l Kulik 8

Properties of R R is unaffected by change of scale or origin. Adding constants to x does not change x x and multiplying x and y by constants changes numerator and denominator equally. R is symmetric in x and y. 1 R 1. If R = 1 (R = 1) then the observations (x i, y i ) all lie on a straight line with a positive (negative) slope. The sign of r reflects the trend of the points. Remember a high correlation coefficient does not necessarily imply any causal relationship between the two variables. Rafa l Kulik 9

Note x and y can have a very strong non-linear relationship and R 0. The graph shows plots of a vector x against y = (x x) 2 and z = (x x) 3. y 0.00 0.05 0.10 0.15 0.20 z 0.10 0.05 0.00 0.05 0.2 0.6 1.0 x 0.2 0.6 1.0 x Correlation for the above data is -0.12 and 0.93, respectively. Rafa l Kulik 10

3 2 1 0 1 2 3 2 0 1 2 3 R=0 2 1 0 1 2 2 1 0 1 2 R=0.74 2 1 0 1 2 2 1 0 1 2 R=0.78 2 1 0 1 2 3 3 1 1 2 R= 0.78 Rafa l Kulik 11

Estimating σ 2 Recall that Var(ɛ) = σ 2. To estimate it we use sum of squares of residuals: n n SS E = e 2 i = (y i ŷ i ) 2. The estimator ˆσ 2 of σ 2 is given by i=1 i=1 ˆσ 2 = SS E n 2 = S yy ˆβ 1 S xy. (2) n 2 For the oxygen data we have ˆσ 2 = 1.18. Rafa l Kulik 12

Properties of the LSE Recall that we consider the model Y = β 0 + β 1 x + ɛ, where E(ɛ) = 0, Var(ɛ) = σ 2. Thus, given x, Y is a random variable with mean β 0 + β 1 x and variance σ 2. Note that ˆβ 0 and ˆβ 1 depend on the observed y s, which are realizations of the random variable Y. Consequently, the estimators are random variables. We have E( ˆβ 1 ) = β 1, Var( ˆβ 1 ) = σ2 S xx, E( ˆβ 0 ) = β 0, Var( ˆβ 0 ) = σ 2 [ ] 1 n + x2 S xx ( ˆβ 0 and ˆβ 1 are the unbiased estimator of β 0 and β 1, respectively). Consequently, the estimated standard errors are [ ] se( ˆβ ˆσ 1 ) = 2, se( S ˆβ 1 0 ) = ˆσ 2 xx n + x2 S xx Rafa l Kulik 13

Hypothesis tests in linear regression We want to test that the slope equals β 1,0, i.e. H 0 : β 1 = β 1,0, H 1 : β 1 β 1,0. Since ˆβ 1 is approximately normal N (β 1, σ2 S xx ), we may use statistics: ˆβ 1 β 1,0 σ2 /S xx N (0, 1). However, σ 2 is not known, thus we plug-in its estimator ˆσ 2 given in (2): T 0 = ˆβ 1 β 1,0 ˆσ2 /S xx t n 2. If now t 0 is the observed value, we reject H 0 if t 0 > t α/2,n 2. Rafa l Kulik 14

Significance of regression For each bivariate data set we may fit a regression line, which aims to describe a linear relationship between x and Y. However, does it always make sense? 3 2 1 0 1 2 3 2 1 0 1 2 3 x y Of course, we may fit the regression line, ŷ = 0.01 0.04x, but this line does not describe at all the bivariate data set. Rafa l Kulik 15

Having computed the regression line, we want to test whether this line is significant. The test for significance of regression is given by H 0 : β 1 = 0, H 1 : β 1 0. If we reject H 0, then there is a linear relationship between x and Y. Example: Hydrocarbon data - ˆβ 1 = 14.95, n = 20, S xx = 0.68, ˆσ 2 = 1.18. Consequently, t 0 = ˆβ 1 0 ˆσ2 /S xx = 11.35 > 2.88 = t 0.005,18. We reject H 0 - there is a linear relationship between x and Y. Rafa l Kulik 16

Confidence intervals - slope and intercept The 100(1 α)% confidence intervals for β 1 and β 0 are: ˆβ 0 t α/2,n 2 ˆσ ˆβ 1 t 2 α/2,n 2 β 1 S ˆβ 1 + t α/2,n 2 xx ˆσ 2 S xx [ ] [ ] 1 ˆσ 2 n + x2 β 0 S ˆβ 1 0 + t α/2,n 2 ˆσ 2 xx n + x2 S xx Example: Hydrocarbon data - 12.181 β 1 17.713. Rafa l Kulik 17

Confidence intervals - mean response We want to estimate µ Y x0 = E(Y x 0 ) - the mean response at x 0 (typically at the observed x 0 ). Of course, it can be read exactly from the regression line ˆµ Y x0 = ˆβ 0 + ˆβ 1 x 0. The distance (at x 0 ) between the estimated and the true regression line is ˆµ Y x0 µ Y x0 = Now, E(ˆµ Y x0 ) = µ Y x0 and ( ˆβ0 β 0 ) + ( ) ˆβ1 β 1 x 0. [ 1 Var(ˆµ Y x0 ) = σ 2 n + (x 0 x) 2 ]. S xx Rafa l Kulik 18

Note that Var(ˆµ Y x0 ) = Var( ˆβ 0 + ˆβ 1 x 0 ) Var( ˆβ 0 ) + Var( ˆβ 1 x 0 ) since ˆβ 0 and ˆβ 1 are dependent. is The confidence interval for µ Y x0 (the mean response (regression line)) ˆµ Y x0 ± t α/2,n 2 [ 1 ˆσ 2 n + (x ] 0 x) 2. S xx Example: Hydrocarbon data - ˆµ Y x0 ± 2.101 1.18 [ 1 20 + (x ] 0 1.196) 2 0.68 Rafa l Kulik 19

y 88 90 92 94 96 98 0.9 1.0 1.1 1.2 1.3 1.4 1.5 x A lot of observations are outside the confidence interval (small sample (???)). Rafa l Kulik 20

Prediction of new observations If x 0 is the value of the regressor variable of interest, then the estimated value of the response variable Y is ŷ = Ŷ0 = ˆβ 0 + ˆβ 1 x 0. If Y 0 is the true future observation at x = x 0 (so, Y 0 = β 0 + β 1 x 0 + ɛ) and Ŷ 0 is the predicted value, given by the above equation, then the prediction error eˆp = Y 0 Ŷ0 = β 0 + β 1 x 0 + ɛ ( ˆβ 0 + ˆβ 1 x 0 ) = (β 0 ˆβ 0 ) + (β 1 ˆβ 1 )x 0 + ɛ Rafa l Kulik 21

is normally distributed N (0, σ [1 2 + 1n + (x 0 x) 2 ]). S xx Now, we plug-in the estimator of σ to get the following confidence interval for Y 0 : [ ŷ 0 ± t α/2,n 2 ˆσ 2 1 + 1 n + (x ] 0 x) 2, S xx where ŷ 0 = ˆβ 0 + ˆβ 1 x 0 Rafa l Kulik 22

Residuals e i = y i ŷ i, where y i, i = 1,..., n are the observed value and ŷ i, i = 1,..., n are the values obtained from the regression line, i.e. ŷ i = ˆβ 0 + ˆβ 1 x i. lm(y ~ x)$res 1 0 1 2 1 0 1 2 5 10 15 20 Index Rafa l Kulik 23

Analysis - summary 1. Draw scatterplot 2. Find the regression line 3. Check the appropriateness of a linear fit (correlation coefficient, significance of regression test) 4. Check goodness-of-fit (confidence interval for the regression line) 5. Check model assumptions (residuals) 6. Do prediction, if appropriate Rafa l Kulik 24

Set #1 This data set contains statistics, in arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973. 1. x-number of murders, y - number of assaults 5 10 15 50 100 150 200 250 300 x y Rafa l Kulik 25

2. n n x i = 389.4, y i = 8538 i=1 n x 2 i = 3962.2, i=1 Thus, ŷ = 51.27 + 15.34x i=1 n yi 2 = 1798262, i=1 i=1 n x i y i = 80756 y 50 100 150 200 250 300 5 10 15 x Rafa l Kulik 26

3. Correlation: R = 0.802. Significance of regression: H 0 : β 1 = 0, H 1 : β 1 0. Test statistics T 0 = ˆβ 1 0 t n 2.. We have ˆσ ˆσ 2 = 2531.73, S xx = 929.55. 2 /S xx The observed value: t 0 = 9.30, t 0.05/2,48 2.01 - reject H 0 - there is a linear relationship between x and y. Rafa l Kulik 27

4. Confidence interval for the regression line 5 10 15 50 100 150 200 250 300 x y Rafa l Kulik 28

5. 0 10 20 30 40 50 100 50 0 50 100 Index lm(y ~ x)$res 100 50 0 50 100 Rafa l Kulik 29

6. Predict number of assaults if number of murders is x 0 = 20: ŷ 0 = 51.27 + 15.34 20 = 358.07, Equivalent statement: Give a point estimate of number of assaults if number of murders is... Rafa l Kulik 30

Compute prediction interval ŷ 0 ± t α/2,n 2 358.07 ± 2.01 358.07 ± 40.64. 2531.73 [ ˆσ 2 1 + 1 n + (x ] 0 x) 2 S xx [ 1 + 1 48 ] (20 7.78)2 + 929.55 What change in mean number of assaults would be expected for a change of 1 in the number of assaults? - compute slope Rafa l Kulik 31

Set #2 The classic airline data. Monthly totals of international airline passengers, 1949 to 1960. 1. x-time, x = (1, 2,..., 144). y 100 200 300 400 500 600 0 20 40 60 80 100 120 140 x Rafa l Kulik 32

2. n n x i = 10440, y i = 40363 i=1 i=1 n n x 2 i = 1005720, yi 2 = 13371737, i=1 i=1 i=1 Thus, ŷ = 87.653 + 2.657x n x i y i = 3587478 y 100 200 300 400 500 600 0 20 40 60 80 100 120 140 x Rafa l Kulik 33

3. Correlation: R = 0.924. Significance of regression: H 0 : β 1 = 0, H 1 : β 1 0. Test statistics T 0 = ˆβ 1 0 t n 2.. We have ˆσ ˆσ 2 = 2121.261, S xx = 248820. 2 /S xx The observed value: t 0 = 28.77644, t 0.05/2,142 1.97 - reject H 0 - there is a linear relationship between x and y. Rafa l Kulik 34

4. Confidence interval for the reg. line ˆµ Y x0 ± t α/2,n 2 ˆσ 2 [ 1 n + (x 0 x) 2 S xx ]. y 100 200 300 400 500 600 0 20 40 60 80 100 120 140 x Rafa l Kulik 35

lm(y ~ x)$res 100 50 0 50 100 150 0 20 40 60 80 100 120 140 5. Index Rafa l Kulik 36

y 100 200 300 400 500 600 0 20 40 60 80 100 120 140 x Rafa l Kulik 37