MAT2377 Rafa l Kulik Version 2015/November/26 Rafa l Kulik
Bivariate data and scatterplot Data: Hydrocarbon level (x) and Oxygen level (y): x: 0.99, 1.02, 1.15, 1.29, 1.46, 1.36, 0.87, 1.23, 1.55, 1.40, 1.19, 1.15, 0.98, 1.01, 1.11, 1.20, 1.26, 1.32, 1.43, 0.95 ----------------------------------- y: 90.01, 89.05, 91.43, 93.74, 96.73, 94.45, 87.59, 91.77, 99.42, 93.65, 93.54, 92.52, 90.56, 89.54, 89.85, 90.39, 93.25, 93.41, 94.98, 87.33 Rafa l Kulik 1
Oxygen 88 90 92 94 96 98 0.9 1.0 1.1 1.2 1.3 1.4 1.5 Hydrocarbon level Rafa l Kulik 2
We want to describe the relationship between these two variables. We will use regression analysis. We will assume that our model is given by Y = β 0 + β 1 x + ɛ, where ɛ is a random error and β 0, β 1 are regression coefficients. The variable x is called regressor (predictor) variable and Y is called a dependent or response variable. It is assumed that E(ɛ) = 0, Var(ɛ) = σ 2. In particular, E(Y x) = β 0 + β 1 x. Rafa l Kulik 3
Suppose now that we have observations (x i, y i ) from our model, so y i = β 0 + β 1 x i + ɛ i, i = 1,..., n. Our aim is to find ˆβ 0, ˆβ 1, estimator of the unknown parameters β 0, β 1. consequently, we will find the estimated (fitted) regression line or the line of the best fit: ŷ = ˆβ 0 + ˆβ 1 x. This line is obtained using the method of least squares. Having the observations y i, i = 1,..., n, their deviation e i from the line β 0 + β 1 x are e i = (y i β 0 β i x i ), i = 1,..., n. So, L(β 0, β 1 ) := n e 2 i = n (y i β 0 β i x i ) 2. i=1 i=1 Rafa l Kulik 4
Consequently, and dl dβ 0 = 2 dl dβ 1 = 2 n (y i β 0 β i x i ) i=1 n (y i β 0 β i x i )x i. i=1 Solving dl dβ 0 = 0 and dl dβ 1 = 0 we obtain the least squares estimators of β 0 and β 1 : Rafa l Kulik 5
where Equivalently, S xx = ˆβ 1 = S xy S xx, S xy = ˆβ0 = ȳ ˆβ 1 x, n (x i x)(y i ȳ) i=1 n (x i x) 2 S yy = i=1 n (y i ȳ) 2 i=1 S xy = n i=1 x iy i 1 n ( n i=1 x i)( n i=1 y i) S xx = n i=1 x2 i 1 n ( n i=1 x i) 2 S yy = n i=1 y2 i 1 n ( n i=1 y i) 2 Rafa l Kulik 6
Example For hydrocarbon data we find x i = 23.92, y i = 1843.21, x 2 i = 29.2892, y 2 i = 170044.5, x i y i = 2214.657, so that S xy = 10.17744, S xx = 0.68088, S yy = 173.3769. Therefore, ˆβ0 = 74.28, ˆβ1 = 14.95. Consequently, the fitted regression line is ŷ = 74.28 + 14.95x. Oxygen 88 90 92 94 96 98 0.9 1.0 1.1 1.2 1.3 1.4 1.5 Hydrocarbon level Rafa l Kulik 7
Sample Correlation Coefficient. For data (x i, y i ) we define the sample correlation coefficient as R = S xy Sxx S yy (1) (R is defined only if S xx and S yy are positive). Example: For the hydrocarbon data the sample correlation coefficient is R = 0.93. Rafa l Kulik 8
Properties of R R is unaffected by change of scale or origin. Adding constants to x does not change x x and multiplying x and y by constants changes numerator and denominator equally. R is symmetric in x and y. 1 R 1. If R = 1 (R = 1) then the observations (x i, y i ) all lie on a straight line with a positive (negative) slope. The sign of r reflects the trend of the points. Remember a high correlation coefficient does not necessarily imply any causal relationship between the two variables. Rafa l Kulik 9
Note x and y can have a very strong non-linear relationship and R 0. The graph shows plots of a vector x against y = (x x) 2 and z = (x x) 3. y 0.00 0.05 0.10 0.15 0.20 z 0.10 0.05 0.00 0.05 0.2 0.6 1.0 x 0.2 0.6 1.0 x Correlation for the above data is -0.12 and 0.93, respectively. Rafa l Kulik 10
3 2 1 0 1 2 3 2 0 1 2 3 R=0 2 1 0 1 2 2 1 0 1 2 R=0.74 2 1 0 1 2 2 1 0 1 2 R=0.78 2 1 0 1 2 3 3 1 1 2 R= 0.78 Rafa l Kulik 11
Estimating σ 2 Recall that Var(ɛ) = σ 2. To estimate it we use sum of squares of residuals: n n SS E = e 2 i = (y i ŷ i ) 2. The estimator ˆσ 2 of σ 2 is given by i=1 i=1 ˆσ 2 = SS E n 2 = S yy ˆβ 1 S xy. (2) n 2 For the oxygen data we have ˆσ 2 = 1.18. Rafa l Kulik 12
Properties of the LSE Recall that we consider the model Y = β 0 + β 1 x + ɛ, where E(ɛ) = 0, Var(ɛ) = σ 2. Thus, given x, Y is a random variable with mean β 0 + β 1 x and variance σ 2. Note that ˆβ 0 and ˆβ 1 depend on the observed y s, which are realizations of the random variable Y. Consequently, the estimators are random variables. We have E( ˆβ 1 ) = β 1, Var( ˆβ 1 ) = σ2 S xx, E( ˆβ 0 ) = β 0, Var( ˆβ 0 ) = σ 2 [ ] 1 n + x2 S xx ( ˆβ 0 and ˆβ 1 are the unbiased estimator of β 0 and β 1, respectively). Consequently, the estimated standard errors are [ ] se( ˆβ ˆσ 1 ) = 2, se( S ˆβ 1 0 ) = ˆσ 2 xx n + x2 S xx Rafa l Kulik 13
Hypothesis tests in linear regression We want to test that the slope equals β 1,0, i.e. H 0 : β 1 = β 1,0, H 1 : β 1 β 1,0. Since ˆβ 1 is approximately normal N (β 1, σ2 S xx ), we may use statistics: ˆβ 1 β 1,0 σ2 /S xx N (0, 1). However, σ 2 is not known, thus we plug-in its estimator ˆσ 2 given in (2): T 0 = ˆβ 1 β 1,0 ˆσ2 /S xx t n 2. If now t 0 is the observed value, we reject H 0 if t 0 > t α/2,n 2. Rafa l Kulik 14
Significance of regression For each bivariate data set we may fit a regression line, which aims to describe a linear relationship between x and Y. However, does it always make sense? 3 2 1 0 1 2 3 2 1 0 1 2 3 x y Of course, we may fit the regression line, ŷ = 0.01 0.04x, but this line does not describe at all the bivariate data set. Rafa l Kulik 15
Having computed the regression line, we want to test whether this line is significant. The test for significance of regression is given by H 0 : β 1 = 0, H 1 : β 1 0. If we reject H 0, then there is a linear relationship between x and Y. Example: Hydrocarbon data - ˆβ 1 = 14.95, n = 20, S xx = 0.68, ˆσ 2 = 1.18. Consequently, t 0 = ˆβ 1 0 ˆσ2 /S xx = 11.35 > 2.88 = t 0.005,18. We reject H 0 - there is a linear relationship between x and Y. Rafa l Kulik 16
Confidence intervals - slope and intercept The 100(1 α)% confidence intervals for β 1 and β 0 are: ˆβ 0 t α/2,n 2 ˆσ ˆβ 1 t 2 α/2,n 2 β 1 S ˆβ 1 + t α/2,n 2 xx ˆσ 2 S xx [ ] [ ] 1 ˆσ 2 n + x2 β 0 S ˆβ 1 0 + t α/2,n 2 ˆσ 2 xx n + x2 S xx Example: Hydrocarbon data - 12.181 β 1 17.713. Rafa l Kulik 17
Confidence intervals - mean response We want to estimate µ Y x0 = E(Y x 0 ) - the mean response at x 0 (typically at the observed x 0 ). Of course, it can be read exactly from the regression line ˆµ Y x0 = ˆβ 0 + ˆβ 1 x 0. The distance (at x 0 ) between the estimated and the true regression line is ˆµ Y x0 µ Y x0 = Now, E(ˆµ Y x0 ) = µ Y x0 and ( ˆβ0 β 0 ) + ( ) ˆβ1 β 1 x 0. [ 1 Var(ˆµ Y x0 ) = σ 2 n + (x 0 x) 2 ]. S xx Rafa l Kulik 18
Note that Var(ˆµ Y x0 ) = Var( ˆβ 0 + ˆβ 1 x 0 ) Var( ˆβ 0 ) + Var( ˆβ 1 x 0 ) since ˆβ 0 and ˆβ 1 are dependent. is The confidence interval for µ Y x0 (the mean response (regression line)) ˆµ Y x0 ± t α/2,n 2 [ 1 ˆσ 2 n + (x ] 0 x) 2. S xx Example: Hydrocarbon data - ˆµ Y x0 ± 2.101 1.18 [ 1 20 + (x ] 0 1.196) 2 0.68 Rafa l Kulik 19
y 88 90 92 94 96 98 0.9 1.0 1.1 1.2 1.3 1.4 1.5 x A lot of observations are outside the confidence interval (small sample (???)). Rafa l Kulik 20
Prediction of new observations If x 0 is the value of the regressor variable of interest, then the estimated value of the response variable Y is ŷ = Ŷ0 = ˆβ 0 + ˆβ 1 x 0. If Y 0 is the true future observation at x = x 0 (so, Y 0 = β 0 + β 1 x 0 + ɛ) and Ŷ 0 is the predicted value, given by the above equation, then the prediction error eˆp = Y 0 Ŷ0 = β 0 + β 1 x 0 + ɛ ( ˆβ 0 + ˆβ 1 x 0 ) = (β 0 ˆβ 0 ) + (β 1 ˆβ 1 )x 0 + ɛ Rafa l Kulik 21
is normally distributed N (0, σ [1 2 + 1n + (x 0 x) 2 ]). S xx Now, we plug-in the estimator of σ to get the following confidence interval for Y 0 : [ ŷ 0 ± t α/2,n 2 ˆσ 2 1 + 1 n + (x ] 0 x) 2, S xx where ŷ 0 = ˆβ 0 + ˆβ 1 x 0 Rafa l Kulik 22
Residuals e i = y i ŷ i, where y i, i = 1,..., n are the observed value and ŷ i, i = 1,..., n are the values obtained from the regression line, i.e. ŷ i = ˆβ 0 + ˆβ 1 x i. lm(y ~ x)$res 1 0 1 2 1 0 1 2 5 10 15 20 Index Rafa l Kulik 23
Analysis - summary 1. Draw scatterplot 2. Find the regression line 3. Check the appropriateness of a linear fit (correlation coefficient, significance of regression test) 4. Check goodness-of-fit (confidence interval for the regression line) 5. Check model assumptions (residuals) 6. Do prediction, if appropriate Rafa l Kulik 24
Set #1 This data set contains statistics, in arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973. 1. x-number of murders, y - number of assaults 5 10 15 50 100 150 200 250 300 x y Rafa l Kulik 25
2. n n x i = 389.4, y i = 8538 i=1 n x 2 i = 3962.2, i=1 Thus, ŷ = 51.27 + 15.34x i=1 n yi 2 = 1798262, i=1 i=1 n x i y i = 80756 y 50 100 150 200 250 300 5 10 15 x Rafa l Kulik 26
3. Correlation: R = 0.802. Significance of regression: H 0 : β 1 = 0, H 1 : β 1 0. Test statistics T 0 = ˆβ 1 0 t n 2.. We have ˆσ ˆσ 2 = 2531.73, S xx = 929.55. 2 /S xx The observed value: t 0 = 9.30, t 0.05/2,48 2.01 - reject H 0 - there is a linear relationship between x and y. Rafa l Kulik 27
4. Confidence interval for the regression line 5 10 15 50 100 150 200 250 300 x y Rafa l Kulik 28
5. 0 10 20 30 40 50 100 50 0 50 100 Index lm(y ~ x)$res 100 50 0 50 100 Rafa l Kulik 29
6. Predict number of assaults if number of murders is x 0 = 20: ŷ 0 = 51.27 + 15.34 20 = 358.07, Equivalent statement: Give a point estimate of number of assaults if number of murders is... Rafa l Kulik 30
Compute prediction interval ŷ 0 ± t α/2,n 2 358.07 ± 2.01 358.07 ± 40.64. 2531.73 [ ˆσ 2 1 + 1 n + (x ] 0 x) 2 S xx [ 1 + 1 48 ] (20 7.78)2 + 929.55 What change in mean number of assaults would be expected for a change of 1 in the number of assaults? - compute slope Rafa l Kulik 31
Set #2 The classic airline data. Monthly totals of international airline passengers, 1949 to 1960. 1. x-time, x = (1, 2,..., 144). y 100 200 300 400 500 600 0 20 40 60 80 100 120 140 x Rafa l Kulik 32
2. n n x i = 10440, y i = 40363 i=1 i=1 n n x 2 i = 1005720, yi 2 = 13371737, i=1 i=1 i=1 Thus, ŷ = 87.653 + 2.657x n x i y i = 3587478 y 100 200 300 400 500 600 0 20 40 60 80 100 120 140 x Rafa l Kulik 33
3. Correlation: R = 0.924. Significance of regression: H 0 : β 1 = 0, H 1 : β 1 0. Test statistics T 0 = ˆβ 1 0 t n 2.. We have ˆσ ˆσ 2 = 2121.261, S xx = 248820. 2 /S xx The observed value: t 0 = 28.77644, t 0.05/2,142 1.97 - reject H 0 - there is a linear relationship between x and y. Rafa l Kulik 34
4. Confidence interval for the reg. line ˆµ Y x0 ± t α/2,n 2 ˆσ 2 [ 1 n + (x 0 x) 2 S xx ]. y 100 200 300 400 500 600 0 20 40 60 80 100 120 140 x Rafa l Kulik 35
lm(y ~ x)$res 100 50 0 50 100 150 0 20 40 60 80 100 120 140 5. Index Rafa l Kulik 36
y 100 200 300 400 500 600 0 20 40 60 80 100 120 140 x Rafa l Kulik 37