STAT2012 Statistical Tests 23 Regression analysis: method of least squares

Size: px

Start display at page:

Download "STAT2012 Statistical Tests 23 Regression analysis: method of least squares"

Harry Russell
6 years ago
Views:

1 23 Regression analysis: method of least squares L23 Regression analysis The main purpose of regression is to explore the dependence of one variable (Y ) on another variable (X) Introduction (P ) Suppose we are interested in studying the relation between two variables X and Y, where X is regarded as an explanatory (controlled; independent) variable that may be measured without error, and Y is a response (outcome; dependent) variable. We may have Y = α + β X, or more generally, Y = f(x). This is called deterministic mathematical relation because it does not allow for any error in predicting Y as a function of X. In this course, we are interested in studying the relationship between X and Y which subject to random errors. Example: (Predicting height) Parents are often interested in predicting the eventual heights of their children. The following is a portion of the data taken from a study of heights of boys. Height (inches) at age two (x_i) Height (inches) as an adult (y_i) A scatter plot of the data is given as follows: SydU STAT2012 (2015) Second semester Dr. J. Chan 224

2 L23 Regression analysis Height (inches) as an adult Height (inches) at age two It is clear from the figure that the expected value of Y increases as a linear function of X but with errors. Hence a deterministic relation is NOT suitable. SydU STAT2012 (2015) Second semester Dr. J. Chan 225

3 23.2 Linear regression STAT2012 Statistical Tests L23 Regression analysis With repeated experiments, one would find that values of Y often vary in a random manner. Hence the probabilistic model Y = α + β X + ɛ where ɛ is a random variable which follows a probability distribution with mean zero provides a more accurate description of the relationship between X and Y in most cases. The model is generally called a (simple linear) regression model or the regression of Y on X. The variables α, β are called parameters and they are the intercept and slope (coefficient) of the linear regression model respectively. Y α 1 Slope=β y-intercept=α Y = α + βx X + Y i Y i Ŷi = e i Ŷ i The regression line Residuals In practice, the regression function f(x) may be more complex than a simple linear function. For example, we may have Y X f(x) = α + β 1 x + β 2 x 2, f(x) = α + βx 1 2, etc. or When we say we have a linear regression for Y, we mean that f(x) is a linear function of the unknown parameters α, β, etc, not necessarily a linear function of x. SydU STAT2012 (2015) Second semester Dr. J. Chan 226

4 L23 Regression analysis 23.3 Fitting a straight line by least squares (P ) Suppose we have a set of observed bivariate data (x 1, y 1 ), (x 2, y 2 ),..., (x n, y n ). A scatter plot is drawn and indicates that a straight line model is appropriate. We need to choose a line (regression line) y = ˆα + ˆβx, which fit the data most closely. In general, we call y = ˆα + ˆβx the regression line or fitted line. A lot of techniques, such as least squares, robust methods, etc, can be used to produce a reasonable estimates ˆα, ˆβ for the unknown regression function f(x). Note that the expected value or the mean of Y i given X = x i under the model is E(Y i X = x i ) = ŷ i = α + β x i and the errors or residuals of the model when Y i = y i is ɛ i = y i ŷ i = y i α βx i. By the method of least squares, the values ˆα and ˆβ are estimated such that the sum of squared residuals (SSR) is a minimum. That is where the objective function is S(α, β) = SSR = Then we solve SSR α S(ˆα, ˆβ) = min α,β ɛ 2 i = SSR = 0 and β S(α, β), (y i α β x i ) 2. = 0 for ˆα & ˆβ. SydU STAT2012 (2015) Second semester Dr. J. Chan 227

5 The solutions are and STAT2012 Statistical Tests ˆβ = S xy = (x i x)(y i Ȳ ) = (x i x) 2 ˆα = Ȳ ˆβ x. L23 Regression analysis (x i x)y i (x i x) 2 Note: (x i x)ȳ = Ȳ (x i x) = Ȳ (n x n x) = 0, and ˆα and ˆβ are random variables as they both depend on ɛ i through the Y i. The x i are fixed numbers. SydU STAT2012 (2015) Second semester Dr. J. Chan 228

6 L23 Regression analysis 23.4 Calculations of ˆα and ˆβ Example: (Predicting height) Note that Solution: = S xy = x = , ȳ = , x 2 i = 16558, yi 2 = 62674, x i y i = x 2 i n( x) 2 = ( ) = , x i y i n( xȳ) = (34.286)(66.857) = 91.57, ˆβ = S xy = = , ˆα = ȳ ˆβ x = (34.286) = Hence the fitted least squares line is ŷ = ˆα + ˆβ x = x. The R code to compute the least squares line including the residuals of the fit is lsfit(x,y) > x=c(39,30,32,34,35,36,36,30,33,37,33,38,32,35) > y=c(71,63,63,67,68,68,70,64,65,68,66,70,64,69) SydU STAT2012 (2015) Second semester Dr. J. Chan 229

7 L23 Regression analysis > ls.height=lsfit(x,y) > ls.height$coef Intercept X > ls.height$residuals [1] [9] > par(mfrow=c(2,2)) #plots > plot(x,y) > abline(ls.height) > title("fitted line plot") > plot(x,ls.height$resid) > abline(h=0) > title("residual plot") > qqnorm(ls.height$resid) > qqline(ls.height$resid) > xm=mean(x) #check > ym=mean(y) > c(xm,ym) [1] > Sxx=sum(x^2)-sum(x)^2/n > Sxy=sum(x*y)-sum(x)*sum(y)/n > c(sxx,sxy) [1] SydU STAT2012 (2015) Second semester Dr. J. Chan 230

8 > beta=sxy/sxx > alpha=mean(y)-beta*mean(x) > c(alpha,beta) [1] L23 Regression analysis Fitted line plot Residual plot y ls.height$residuals x x Normal Q Q Plot Sample Quantiles Theoretical Quantiles SydU STAT2012 (2015) Second semester Dr. J. Chan 231

9 23.5 Checking residuals STAT2012 Statistical Tests L23 Regression analysis The regression model above does not specify the distribution for Y i or ɛ i. In fact, we usually assume that Y i N (α + βx i, σ 2 ) or equivalently, ɛ i iid N (0, σ 2 ) where iid stands for independent and identically distributed. Model assumptions: 1. Linearity of data, i.e. y i = α + βx i + ɛ i, 2. Equality of variance, i.e. a common σ 2 independent of x i and 3. Normality of residuals, i.e. ɛ i N (0, σ 2 ). These assumptions may be checked by using 1. Linearity of data: the fitted line plot of y i, 2. Equality of variance: the scatter plot of residuals ɛ i = y i ŷ i against the fitted values ŷ i, 3. Normality of residuals: the normal qq-plot of the residuals ɛ i. The residual plot should be a random scatter around 0 with no particular pattern. Violation to assumptions on residuals: 1. If E(ɛ i ) = a say, we can just increase α increase by a and all ɛ i will drop by a. Then E(ɛ i ) = If σ 2 = σ 2 i depends on x i, e.g. the variability of the stock market index usually increases with time, we may consider models that allow σ 2 i to change across x i. 3. If ɛ i does not follow a normal distribution, we may consider other distributions say t with wider spread. SydU STAT2012 (2015) Second semester Dr. J. Chan 232

10 L23 Regression analysis 4. If ɛ i and ɛ j for some i j are dependent, we may consider models that allow a very general variance-covariance structure. A random scatter plot indicates the model assumptions are OK. ɛ Y Ŷ x Nonconstant variance σ 2 which increases with x. ɛ Y Ŷ Residuals plots ɛ vs Ŷ x Fitted line Y vs x Functional form f(x) may be wrong. It should be say f(x) = α + β 1 x + β 2 x 2 + β 3 x 3. ɛ Y Ŷ x The least square regression line is sensitive to model assumptions and outliers. If there are pattens in the residuals, one needs to modify the fitted model using transformed data. SydU STAT2012 (2015) Second semester Dr. J. Chan 233

11 L24 Distribution of par. 24 Regression analysis: distribution of parameters 24.1 Properties of least squares estimates (P ) Result 1. The intercept ˆα and slope ˆβ provide unbiased estimates of the true intercept α and slope β, i.e. Proof: E(ˆα) = α and E( ˆβ) = β. (x i x)(y i E( ˆβ) = E[ Ȳ ) (x i x)e(y i ) E(Ȳ ] ) n (x i x) = (x i x)(α + βx i ) α n (x i x) β n (x i x)x i = = + β n (x i x)(x i x) = = βs xx = β (since x (x i x) = 0) E(ˆα) = E(Ȳ ) E( ˆβ) x = 1 E(Y i ) β x = 1 (α + βx i ) β x n n = α + β 1 x i β x = α + β x β x = α n since x i are fixed, (x i x) = 0 and E(Y i ) = α+βx i +E(ɛ i ) = α+βx i. Result 2. Under the assumptions of the regression model, we have SydU STAT2012 (2015) Second semester Dr. J. Chan 234

12 Proof: V ar( ˆβ) = V ar( STAT2012 Statistical Tests V ar(ˆα) = σ 2 ( 1 n + ( x)2 ) = σ2 n V ar( ˆβ) = σ 2 /, Cov(ˆα, ˆβ) x σ2 =. = σ 2 S 2 xx (x i x)y i ) = L24 Distribution of par. i x2 i (x i x) 2 V ar(y i ) S 2 xx = σ2 since V ar(ay ) = a 2 V ar(y ) V ar(ˆα) = V ar(ȳ ˆβ x) = V ar(ȳ ) + x2 V ar( ˆβ) = σ2 n + x2 σ 2 = σ 2 ( 1 n + x2 ) = σ2 ( + n x 2 ) = n σ 2 ( n x 2 i n x2 + n x 2 ) n = σ 2 n x 2 i n since Y 1,..., Y n and β are independent. The proof for Cov(ˆα, ˆβ) is more complicated and hence is omitted. Note that both variances are divided by. Hence V ar( ˆβ) increases with σ 2, the variability of residuals ɛ i, and decreases with which measures the variability of x i. The fit of a model depends on how big is SSR relative to the total spread of the data as measured by. SydU STAT2012 (2015) Second semester Dr. J. Chan 235

13 L24 Distribution of par. Small SSR, σ 2 Small total spread Small SSR, σ 2 Large total spread The best β Good fit and bad fit Large SSR, σ 2 Small total spread The worst β Large SSR, σ 2 Large total spread SydU STAT2012 (2015) Second semester Dr. J. Chan 236

14 L25 Estimation & inference 25 Regression analysis: estimation theory and inference 25.1 Estimate of σ 2 (P ) For the simple linear regression model, Y i = α + β x i + ɛ i, ɛ i iid N (0, σ 2 ) The variances of ˆα and ˆβ depend on the residuals variance σ 2. Hence we need to estimate σ 2. Since σ 2 is the variance of residuals, an estimate S 2 of σ 2 is based on the average of the squared residuals called Residual Sum of Squares (SSR) SSR = ɛ 2 i = (y i ˆα ˆβ x i ) 2. It can be shown that S 2 = SSR n 2 is an unbiased estimate of σ 2, i.e., E(S 2 ) = σ 2. Y + y y i i ŷ i = e i ŷ i = ˆα + ˆβx i Residuals X SydU STAT2012 (2015) Second semester Dr. J. Chan 237

15 Note that SSR = = STAT2012 Statistical Tests (y i ˆα ˆβ x i ) 2 = (y i ȳ) 2 2 ˆβ L25 Estimation & inference [(y i ȳ) ˆβ(x i x)] 2 (x i x)(y i ȳ) + ˆβ 2 (x i x) 2 = S yy 2 ˆβS xy + ˆβ S xy = S yy ˆβ S xy = SST o SST or = S yy S xy S xy = S yy S 2 xy since ˆα = ȳ ˆβ x and ˆβ = S xy. SydU STAT2012 (2015) Second semester Dr. J. Chan 238

16 L25 Estimation & inference 25.2 Distributions of the estimators ˆα, ˆβ and S 2 Assume that we have a linear regression model: Under this model, Y i = α + β x i + ɛ i, ɛ i iid N (0, σ 2 ), i = 1, 2,..., n. ˆα α V ar(ˆα) = ˆβ β V ar( ˆβ) ˆα α N (0, 1), σ 1 n + ( x)2 = ˆβ β N (0, 1). 1 σ Furthermore, (ˆα, ˆβ) and S 2 are independent and (n 2)S 2 σ 2 χ 2 n 2, ˆα α 1 S n + ( x)2 t n 2, ˆβ β 1 S t n 2, or ( ˆβ β) 2 F S 2 1,n 2. Note that these results can be used to construct 100(1 α)% confidence intervals for α and β, which are given by 1 α : (ˆα t n 2,α/2 s n + x2 1, ˆα + t n 2,α/2 s n + x2 ); β : ( ˆβ t n 2,α/2 s/, ˆβ + tn 2,α/2 s/ ), where t n 2,α/2 satisfies that Pr(t n 2 t n 2,α/2 ) = α/2. SydU STAT2012 (2015) Second semester Dr. J. Chan 239

17 L25 Estimation & inference Example: (Predicting height) Find ˆσ 2 and the CIs for ˆβ and ˆα. Solution: We have Hence x = , ȳ = , n = 14, x 2 i = 16558, yi 2 = 62674, S yy = x i y i = 32183, = , S xy = 91.57, ˆβ = , ˆα = yi 2 n(ȳ) 2 = ( ) = 95.71, s 2 = S yy ˆβ S xy = n 2 CI for β = ( ˆβ t n 2,0.975 = ( = (0.6859, 1.13) CI for α = (ˆα t n 2, (91.57) = , 14 2 s 2 s 2 ), ˆβ + tn 2, , s 2 ( 1 n + x2 ), ˆα + t n 2,0.975 ( ) 1 = ( , ( ) ) = ( , ) ) ( ) 1 s 2 n + x2 S ) xx Since the CI for β does not contain 0, β is significantly greater than 0. SydU STAT2012 (2015) Second semester Dr. J. Chan 240

18 In R, STAT2012 Statistical Tests L25 Estimation & inference > Syy=sum(y^2)-sum(y)^2/n > Syy [1] > SSR=Syy-beta*Sxy > SSR [1] > s2=ssr/(n-2) > s2 [1] > CIb.lower=beta-qt(0.975,n-2)*sqrt(s2/Sxx) > CIb.upper=beta+qt(0.975,n-2)*sqrt(s2/Sxx) > c(cib.lower,cib.upper) [1] > CIa.lower=alpha-qt(0.975,n-2)*sqrt(s2*(1/n+mean(x)^2/Sxx)) > CIa.upper=alpha+qt(0.975,n-2)*sqrt(s2*(1/n+mean(x)^2/Sxx)) > c(cia.lower,cia.upper) [1] SydU STAT2012 (2015) Second semester Dr. J. Chan 241

19 25.3 Test on β STAT2012 Statistical Tests L25 Estimation & inference Suppose that Y i = α + β x i + ɛ i, ɛ i N (0, σ 2 ). If Y does not change with X, β = 0. Thus one may wish to test: H 0 : β = 0 vs H 1 : β 0 in order to assess whether the explanatory variable X has an influence on the dependent variable Y. The five steps to test the significance of β (i.e. β 0) and hence the regression model are 1. Hypotheses: H 0 : β = β 0 vs H 1 : β > β 0, β < β 0, β β Test statistic: t 0 = ˆβ β 0 s/ where s 2 = S yy ˆβ S xy n 2 and ˆβ = Sxy /. 3. Assumptions: Y i N (α + βx i, σ 2 ). Y i are independent. 4. P -value: Pr(t n 2 t 0 ) for H 1 : β > β 0, Pr(t n 2 t 0 ) for H 1 : β < β 0 ; 2 Pr(t n 2 t 0 ) for H 1 : β β Decision: Reject H 0 if p-value < α. Remarks 1. t 0 can be calculated by n 2( ˆβ β0 ) t 0 = = (n 2)s2 / n 2(Sxy / β 0 ). ( S yy Sxy)/S 2 xx 2 SydU STAT2012 (2015) Second semester Dr. J. Chan 242

20 L25 Estimation & inference In particular, if β 0 = 0, t 0 = n 2 Sxy S yy S 2 xy = n 2 S xy Sxx S yy 1 S2 xy S yy = n 2 r 1 r 2 where r is the correlation coefficient in the next section. 2. If H 0 : β = 0 is accepted, it implies that X has no linear effect in describing the behavior of the response Y. Note that it is not equivalent to say there are no relationship between X and Y. The true model may involve quadratic, cubic, or other more complex functions of X. The following examples illustrate the idea. The p-value indicated on the top of each plot corresponds to the 2-sided t-test for a zero slope. Case (b) shows insignificant result but there is a clear quadratic relationship between X and Y. (a): P=0.771 x y (b): P=0.226 x y (c): P=10e-05 x y (d): P= x y SydU STAT2012 (2015) Second semester Dr. J. Chan 243

21 L25 Estimation & inference 3. If H 0 : β = 0 is rejected, it implies that the data indicates there is a linear relationship between X and Y. However a linear regression might not be the best model to describe the relationship. Case (d) shows significant result but it is clear that a quadratic model is better to describe the relationship between X and Y. 4. The strength of linear relationship may be described by the sample correlation coefficient. 5. Similar ideas can be applied to construct a test for the intercept α. However there is more interest in testing β than α. The β contains the information about whether linear relationship exists between Y and X. 6. The regression line is sensitive to outliers as will be demonstrated in the computer practical. The nonparameteric version of the regression model is the kernel smoothing which is more robust to outliers but it is not included in this course. SydU STAT2012 (2015) Second semester Dr. J. Chan 244

22 L25 Estimation & inference Example: (Air pollution) The data air give four environmental variables Ozone, radiation, temperature and wind. We d like to assess whether the variable temperature (X) has an influence on the dependent variable Ozone (Y ). >air ozone radiation temperature wind Solution: First of all, we look at a scatter plot of the data to see whether a linear model is appropriate. >plot(air[,3], air[,1]) Ozone temperature The scatter plot indicates that a linear model is appropriate. SydU STAT2012 (2015) Second semester Dr. J. Chan 245

23 We fit a simple linear model STAT2012 Statistical Tests L25 Estimation & inference Y i = α + βx i + ɛ i, ɛ i N (0, σ 2 ) to the data to assess whether the temperature (X) has an influence on Ozone (Y ). We have obtained the following values from R, n = 111, = , S yy = 87.21, S xy = The test for the significance of β is 1. Hypotheses: H 0 : β = 0 vs H 1 : β Test statistic: ˆβ = S xy = = s 2 = S yy ˆβS xy n 2 t 0 = ˆβ s 2 or t 0 = = = n 2 Sxy S yy S 2 xy (702.95) = = = ( ) (87.209) = Assumptions: Y i N (α + βx i, σ 2 ). Y i are independent. 4. P -value: p-value = 2 Pr(t ) 0 5. Decision: There are very strong evidence in the data to indicate a linear relationship between temperature and ozone. In R, SydU STAT2012 (2015) Second semester Dr. J. Chan 246

24 > x=air[,3] > y=air[,1] STAT2012 Statistical Tests L25 Estimation & inference > c(mean(x),mean(y)) > Sxx=sum((x-mean(x))*(x-mean(x))) > Sxy=sum((x-mean(x))*(y-mean(y))) > Syy=sum((y-mean(y))*(y-mean(y))) > c(sxx, Sxy, Syy) > n=length(x) > beta=sxy/sxx #check beta > s2=(syy-beta*sxy)/(n-2) > c(beta,s2) [1] > t0=beta/sqrt(s2/sxx) > t0=sqrt(n-2)*sxy/sqrt(sxx*syy-sxy^2) #either way > p=2*(1-pt(t0,n-2)) > p [1] 0 The assumptions on the model may be checked as follows in R: > lsfit(x,y)$coef Intercept X > air.fitted=y- lsfit(x,y)$res #fitted values SydU STAT2012 (2015) Second semester Dr. J. Chan 247

25 L25 Estimation & inference > par(mfrow=c(2,2)) > plot(x,y, xlab="temperature",ylab="ozone") #fitted line plot > abline(lsfit(x,y)) > plot(air.fitted, lsfit(x,y)$res) #residual plot > abline(h=0) > boxplot(lsfit(x,y)$res) #boxplot of residuals > qqnorm(lsfit(x,y)$res) #normal qq plot > qqline(lsfit(x,y)$res) Temperature Ozone air.fitted lsfit(x, y)$res Quantiles of Standard Normal lsfit(x, y)$res Comments: The scatter plot and the plot of residuals versus fitted values indicates that a linear model is appropriate. The boxplot and the normal qq-plot of residuals indicates that the distribution of residuals is slightly long tailed. There are three rather larger residuals in particular. This might indicate that a robust regression should be used. SydU STAT2012 (2015) Second semester Dr. J. Chan 248

26 L26 ANOVA & prediction 26 Regression analysis: ANOVA and prediction 26.1 ANOVA for regression The linear regression model Y i = α + β x i + ɛ i, i = 1, 2,..., n, where ɛ i iid N (0, σ 2 ), is significant if the slope parameter β that describes the effect of X on Y is non-zero. Note that where ŷ i = ˆα + ˆβx i. We have (y i ȳ) 2 = where SST = SSR = } {{ } SST o (ŷ i ȳ) 2 = y i ȳ = ŷ i ȳ + y i ŷ i, (ŷ i ȳ) 2 + } {{ } SST (y i ŷ i ) 2 (ˆα + ˆβx i ȳ) 2 = = ˆβ 2 (x i x) 2 = ˆβ S xy = S ˆβS xy, xx } {{ } SSR ( ˆβ x + ˆβx i ) 2 (y i ŷ i ) 2 = (n 2)s 2 = S yy ˆβ S xy, since ˆα = ȳ ˆβ x ˆα ȳ = ˆβ x. SydU STAT2012 (2015) Second semester Dr. J. Chan 249

27 L26 ANOVA & prediction ȳ y i ŷ i ȳ Variation unexplained error, SSR = i (y i ŷ i ) 2 Variation explained by reg. SST = i (ŷ i ȳ) 2 Decomposition of TOTAL variation TOTAL variation in y i, SST o = i (y i ȳ) 2 On the other hand, the F test statistic for testing H 0 : β = β 0 = 0 is f 0 = t 2 0 = ˆβ 2 = s 2 / ˆβ S xy s 2 = ˆβS xy s 2 = Since t 2 m,1 α/2 = F 1,m,1 α for any degree of freedom m, p-value = Pr(F 1,n 2 ˆβ 2 SST/1 SSR/(n 2). s 2 / ) = Pr(t 2 n 2 ˆβ 2 s 2 / ) = Pr( t n 2 ˆβ s/ ) = 2 Pr(t n 2 ˆβ s/ ). is the same as that in the t-test. These calculations are summarized in the Regression ANOVA table, which is like the ANOVA tables obtained in the ANOVA test. Regression ANOVA table Source df SS MS F Regression 1 S 2 xy/ MST MST s 2 Residuals n 2 S yy S 2 xy/ MSR = s 2 Total n 1 S yy SydU STAT2012 (2015) Second semester Dr. J. Chan 250

28 L26 ANOVA & prediction Example: (Air pollution) Based on the previous calculations, we have = , S xy = , S yy = , ˆα = , ˆβ = Complete the Regression ANOVA table and hence test for the significance of the regrssion model. Solution: We have Hence SST = ˆβS xy = ( ) = SSR = S yy SST = = Regression ANOVA table Source df SS MS F Regression = Residuals = Total The test for the significance of the regression model is 1. Hypotheses: H 0 : β = 0 vs H 1 : β Test statistic: f 0 = SST/1 SSR/(n 2) = = = t Assumption: Y i (α + βx i, σ 2 ). Y i are independent. 4. P-value: p-value = Pr(F 1,109 > ) Decision: Since p-value < 0.05, we reject H 0. There are strong evidence of a linear relationship between the temperature (X) and Ozone (Y ). SydU STAT2012 (2015) Second semester Dr. J. Chan 251

29 L26 ANOVA & prediction In R, > summary(aov(air[,1]~air[,3])) Df Sum of Sq Mean Sq F Value Pr(F) air[, 3] Residuals Remarks The test for the hypothesis that β = 0 in regression is similar to the ANOVA test for the completely randomized design (one-way data) that all treatments are equal. Under the null hypothesis, we have ANOVA test: µ 1 = = µ g = µ Y i N (µ, σ 2 ), Reg. test: β = 0 Y i N (α, σ 2 ), stating that the treatments (by groups in ANOVA or by explanatory variable X in regression) are unrelated to the response Y in any way. SydU STAT2012 (2015) Second semester Dr. J. Chan 252

30 26.2 Estimation at X = x 0 (P ) L26 ANOVA & prediction Suppose we have fitted a simple linear regression model: Y i = α + β x i + ɛ i, where ɛ i iid N (0, σ 2 ). Obviously, E(Y i ) = α + βx i. It is often of interest to estimate the expected value E(Y 0 ) = E(Y X = x 0 ) at a specified value of X = x 0. An obvious choice for the estimate is Ê(Y 0 ) = ˆα + ˆβ Y x 0. Ŷ0 x 0 Y = α + βx Prediction at X = x 0 For this estimate, we have the following results (see P.229): E(ˆα + ˆβx 0 ) = α + β x 0, V ar(ˆα + ˆβx 0 ) = V ar(ˆα) + x 2 0 V ar( ˆβ) + 2x 0 Cov(ˆα, ˆβ) x 2 i = σ 2 ( + x2 0 2x 0 x ) ( n n ) x 2 i = σ 2 [ n x2 + n x 2 + x2 0 2x 0 x ] n = σ 2 ( + x2 2 xx 0 + x 2 0) [ n 1 = σ 2 n + (x 0 x) 2 ]. X SydU STAT2012 (2015) Second semester Dr. J. Chan 253

31 Hence STAT2012 Statistical Tests Ê(Y 0 ) Y 0 se[ê(y 0)] = ˆα + ˆβ x 0 (α + β x 0 ) t n 2, 1 s n + (x 0 x) 2 L26 ANOVA & prediction and a 100(1 α)% confidence interval for E(Y 0 ) = α + β x 0 is: ˆα+ ˆβ 1 x 0 t n 2,1 α/2 s n + (x 0 x) 2, ˆα+ ˆβ 1 x 0 + t n 2,1 α/2 s n + (x 0 x) 2 where t n 2,1 α/2 satisfies that Pr(t n 2 t n 2,1 α/2 ) = 1 α/2. Note: When x 0 = 0, ŷ 0 = ˆα + ˆβ 0 = ˆα and the estimation interval becomes the CI for α given by 1 ˆα t n 2,1 α/2 s n + x2 1, ˆα + t n 2,1 α/2 s n + x2 SydU STAT2012 (2015) Second semester Dr. J. Chan 254

32 26.3 Prediction at X = x 0 (P ) L26 ANOVA & prediction Now instead of estimating the expected value E(Y 0 ) at x 0, we wish to predict the response Y 0 that we will observe in the future when X = x 0. Under the regression model of Y i = α + βx i + ɛ i, we can employ Ŷ 0 = ˆα + ˆβ x 0 + ˆɛ 0 = ˆα + ˆβ x 0, as a predictor of Y 0 = α + βx 0 + ɛ 0 since ˆɛ 0 = 0. This is same as Ê(Y 0 ) = ˆα + ˆβ x 0. Note that Ŷ0 is unbiased since E(Ŷ0) = E(ˆα) + E( ˆβ) x 0 + E(ˆɛ 0 ) = α + βx 0 = E(Y 0 ) since E(ˆɛ 0 ) = 0. However the variance of Ŷ 0 is greater than that of Ê(Y 0 ) = ˆα + ˆβ x 0 because Ŷ0 is predicted with prediction error ɛ 0 : V ar(ŷ0) = V ar(ˆα + ˆβ x 0 ) + V ar(ɛ 0 ) = σ 2 [ 1 n + (x 0 x) 2 ] + σ 2 = σ 2 [1 + 1 n + (x 0 x) 2 ] and ˆα + ˆβ x 0 and ɛ 0 are independent. Hence Ŷ 0 Y 0 se(ŷ0) = ˆα + ˆβ x 0 (α + βx 0 ) t n 2, s n + (x 0 x) 2 and a 100(1 α)% prediction intervals for Y 0 is given by ˆα+ ˆβ x 0 t n 2,1 α 2 s 1+ 1 n +(x 0 x) 2, ˆα + S ˆβ x 0 +t n 2,1 α s 2 xx 1+ 1 n +(x 0 x) 2 where t n 2,1 α 2 satisfies that Pr(t n 2 t n 2,1 α 2 ) = 1 α 2. SydU STAT2012 (2015) Second semester Dr. J. Chan 255

33 26.4 Effects on confidence intervals σ 2 n x 2 i n Note: Var( α) = ( 1 Var[E(Ŷ0)] = σ 2 n + (x 0 x) 2 ) all depend on summary statistics Y i, x i,, Var( β) = σ2, L26 ANOVA & prediction and Var(Ŷ0) = σ 2 ( n + (x 0 x) 2 Y 2 i, x i Y i, which vary from sample to sample. Hence α, β, E(Ŷ0) and Ŷ0 are random variables which vary from sample to sample. Their variances depend on the following factors: 1. The smaller the σ 2 (with estimate S 2 which measures the variability of residuals ɛ i ), the better the fit and hence the smaller the variances for α, β, E(Ŷ0) and Ŷ0. 2. The larger the spread of x i as measured by, the more information about Y i and hence the smaller the variances for α, β, E(Ŷ0) and Ŷ 0. Y No spread of X X x 2 i If all x are the same, it would be not be able to draw any conclusion about the dependence of Y on x. In this case, = 0 giving V ar(β) =. 3. The larger the sample size n giving more information about Y i, the smaller the variances for α, E(Ŷ0) and Ŷ0. 4. The closer is x 0 from x, the smaller the squared distance (x 0 x) 2, the smaller the variances for E(Ŷ0) and Ŷ0. ) SydU STAT2012 (2015) Second semester Dr. J. Chan 256

34 L26 ANOVA & prediction Estimation interval Prediction interval Height_adult Height_adult Height_two Height_two The interval is shorter in the middle and the estimation interval is shorter than the prediction interval. ( x, Ȳ ) Small error Bigger error Why does the error increase away from x? Because if you wiggle the regression line, it makes more of a difference as it gets further away from the mean x. Note that the line always passes through ( x, Ȳ ). Since when X = x, Y = α + β x = (Ȳ β x) + β x = Hence Ȳ ( x, Ȳ ) passes through the line Y = α + β x. SydU STAT2012 (2015) Second semester Dr. J. Chan 257

35 L26 ANOVA & prediction Example: (Predicting height) Parents are often interested in predicting the eventual heights of their children. The following is a portion of the data taken from a study of heights of boys. Height (inches) at age two (x_i) Height (inches) as an adult (y_i) Indicate whether a linear relationship between the heights of boys at age two and as an adult is significant. 2. Estimate the average height of adults that were 40 inches tall at the age of two and give a 90% estimation interval of this estimate. 3. Predict the height of an adult who was 40 inches tall at the age of two. Give a 90% prediction interval of this prediction. Solution: From previous working, we have obtained that x = , = , S yy = 95.71, S xy = 91.57, and the least squares line is 1. Then ŷ = ˆα + ˆβ x = x. SST = S 2 xy/ = = SSR = S yy S 2 xy/ = = Hence the regression ANOVA table is obtained as follow: SydU STAT2012 (2015) Second semester Dr. J. Chan 258

36 L26 ANOVA & prediction Source df SS MS F Regression = Residuals = Total The test for the significance of the regression model is 1. Hypotheses: H 0 : β = 0 vs H 1 : β Test statistic: f 0 = Assumption: Y i (α + βx i, σ 2 ). Y i are independent. 4. P-value: Pr(F 1,12 > 79.37) 0 (F 1,12,0.999 = 18.6, from R). 5. Decision: Since p-value < 0.05, we reject H 0. There are strong evidence of a linear relationship between the heights of boys at age two and the heights as an adult. In R, > Syy=sum(y^2)-sum(y)^2/n > Syy [1] > SST=beta*Sxy > SST [1] > SSR=Syy-SST > SSR [1] > s2=ssr/(n-2) > s2 [1] > f0=sst/s2 > f0 SydU STAT2012 (2015) Second semester Dr. J. Chan 259

37 [1] > t0=sqrt(f0) > t0 [1] > p.value=1-pf(f0,1,n-2) > p.value [1] e-06 L26 ANOVA & prediction 2. According to the fitted line, the expected average height E(Y 0 ) as an adult corresponding to the height of 40 inches at age two is Ê(Y 0 ) = = Note that n = 14, t 12,0.05 = 1.782, s 2 = , s = Hence, the 90% estimation interval of the average height of adults when their heights at age two are x 0 = 40 is ˆα+ ˆβx 1 0 t n 2, α 2 s n + (x 0 x) 2, ˆα+ ˆβx t n 2, α 2 s n + (x 0 x) 2 1 = ( (1.0236) 14 ( )2 +, = (70.90, 73.19) (1.0236) 14 ( )2 + ) SydU STAT2012 (2015) Second semester Dr. J. Chan 260

38 L26 ANOVA & prediction 3. According to the fitted line, the predicted height of a boy with height 40 inches at age two is still inches. The 90% prediction interval of the height of an adult when his height at age two is x 0 = 40 is ˆα + ˆβx 0 t n 2, α 2 s n + (x 0 x) 2, ˆα + ˆβx 0 + t n 2, α 2 s n + (x 0 x) 2 = ( (1.0236) ( )2 +, (1.0236) ( )2 + ) = (69.89, 74.20). In R, > x0=40 > y0=alpha+beta*x0 > y0 [1] > alp=0.1 > t=qt(1-alp/2,n-2) > t [1] > se.est=sqrt(s2*(1/n+(x0-mean(x))^2/sxx)) > se.est [1] SydU STAT2012 (2015) Second semester Dr. J. Chan 261

39 > se.pre=sqrt(s2*(1+1/n+(x0-mean(x))^2/sxx)) > se.pre [1] > CIest.low=y0-t*se.est > CIest.up=y0+t*se.est > c(ciest.low,ciest.up) [1] > CIpre.low=y0-t*se.pre > CIpre.up=y0+t*se.pre > c(cipre.low,cipre.up) [1] L26 ANOVA & prediction Note: x 0 = 40 is outside the range of X in the data. If x 0 is well outside the range of X, the estimation and prediction may not be reliable because we are not sure if the fitted relationship can be extended to x 0. SydU STAT2012 (2015) Second semester Dr. J. Chan 262

40 27 Regression and correlation L27 Regression & correlation 27.1 Correlation coefficient (P , ) In practice, one may be interested in strength of the linear relationship between X and Y. Suppose X and Y are two random variables. The correlation coefficient ρ, as a measure of the strength of the linear relationship between X and Y, is defined as: ρ = E{[X E(X)][(Y E(Y )]} V ar(x)v ar(y ) = E[XY E(X)E(Y )] V ar(x)v ar(y ) = Cov(X, Y ) V ar(x)v ar(y ). Note: If ρ = 0, then X and Y are said to be uncorrelated. Definition: The sample correlation coefficient, r, for the bivariate data (x 1, y 1 ), (x 2, y 2 ),..., (x n, y n ), is defined as r = S xy Sxx S yy = (x i x)(y i ȳ) [ n ] [ n ] or (x i x) 2 (y i ȳ) 2 r = x i y i n xȳ [ n ] [ n ] x 2 i n x2 yi 2 nȳ2 Hence r gives a point estimator of ρ, and hence measures the strength of the linear relationship between X and Y based on the sample. SydU STAT2012 (2015) Second semester Dr. J. Chan 263

41 L27 Regression & correlation 27.2 Interpreting r and r 2 Given a bivariate data (x 1, y 1 ), (x 2, y 2 ),..., (x n, y n ), The square of correlation coefficient r 2 called the coefficient of determination measures the proportion of total variation in Y explained by the linear regression model: Y y 5 y = a + bx r 5 = y 5 ŷ 5 r dist. of pt to line + d 5 = y 5 ȳ ŷ d dist. of pt to mean r 2 d 2 3 r4 2 y = ȳ d 2 1 d d 2 4 r1 2 r2 2 Residuals It is one minus the proportion of variation not explained by the model : r 2 = 1 i r2 i i = 1 (y i ŷ i ) 2 i d2 i i (y i ȳ) = 1 SSR 2 SST o where or (y i ȳ) 2 = (ŷ i ȳ) 2 + X (y i ŷ i ) 2, S }{{ yy = S } xy/s 2 xx + S yy S }{{} xy/s 2 xx, }{{} SST o SST SSR and SST o is the total variation in Y, SST is the sum of squares explained by the regression line and SSR = SST o SST is the variation in Y remain unexplained. SydU STAT2012 (2015) Second semester Dr. J. Chan 264

42 Note that STAT2012 Statistical Tests r = S xy Sxx S yy, L27 Regression & correlation r 2 = S2 xy S yy = S2 xy/ S yy = SST SST o, and 1 r 2 = S yy S 2 xy/ S yy = SSR SST o. Hence the coefficient of determination r 2 measures the strength of the linear relationship between X and Y by the percentage of variation in Y explained by the linear regression model in X. Interpreting r and r 2 : 1. r r 2 = 1 when all data in the scatterplot lie on the fitted least squares line. 3. r = 0 when there is no linear relationship between the points in the scatterplot. r 2 = 1 since SSR=0 Coefficient of determination, r 2 r 2 = 0 since SST= i (Ŷi Ȳ )2 = 0 Y = Ȳ. Hence Ŷi = Ȳ. ( β = 0 and α = Ȳ β X = Ȳ ) Perfect negative correlation No correlation Perfect positive correlation Strong Moderate Weak Weak Moderate Strong negative negative negative positive positive positive correlation correlation correlation correlation correlation correlation ve ρ ve ρ 1.00 SydU STAT2012 (2015) Second semester Dr. J. Chan 265

43 L27 Regression & correlation r= r= r= r= SydU STAT2012 (2015) Second semester Dr. J. Chan 266

44 27.3 Test for ρ STAT2012 Statistical Tests L27 Regression & correlation By using the sample correlation coefficient r, we may construct a test for ρ = 0. The five steps for the test of the correlation coefficient ρ are: 1. Hypotheses: H 0 : ρ = 0 vs H 1 : ρ > 0, ρ < 0, ρ Test statistic: t 0 = r n 2 1 r Assumption: The data are taken from a bivariate normal population. 4. P -value: Pr(t n 2 t 0 ) for H 1 : ρ > 0, Pr(t n 2 t 0 ) for H 1 : ρ < 0, 2 Pr(t n 2 t 0 ) for H 1 : ρ Decision: Reject H 0 if p-value < α. Note: The tests for ρ = 0 and for β = 0 are equivalent since t 2 0 = r2 (n 2) 1 r 2 = S 2 xy S yy (n 2) 1 S2 xy S yy = S 2 xy (n 2) S yy S2 xy = ˆβS xy (n 2) S yy ˆβS xy = SST SSR/(n 2) = f 0. SydU STAT2012 (2015) Second semester Dr. J. Chan 267

45 L27 Regression & correlation Example: (Automobile) The data in the following table give the miles per gallon obtained by a test automobile when using gasolines of varying octane levels. Miles per Gallon (y) Octane (x) Scatter plot of Miles per gal. vs Octane y x Given 8 x i = 741, 8 y i = 108, 8 x i y i = , (a) Calculate the value r. 8 x 2 i = 68789, 8 yi 2 = (b) Do the data provide sufficient evidence to indicate that the octane level and miles per gallon are dependent? SydU STAT2012 (2015) Second semester Dr. J. Chan 268

46 L27 Regression & correlation Solution: We have that = x 2 i 1 n ( x i ) 2 = = , S yy = S xy = yi 2 1 n ( y i ) 2 = = 1.34, x i y i 1 n ( x i )( y i ) = (741)(108) = (a) r = S xy / S yy = 12.8/ (1.34) = (b) The test for the correlation coefficient ρ between the octane level and miles per gallon is 1. Hypothesis: H 0 : ρ = 0 vs H 1 : ρ Test statistic: t 0 = r n 2 1 r 2 = = In R, 3. Assumption: The data are taken from a bivariate normal population. 4. P -value: p-value = 2 Pr(t ) (0.002, 0.01) (t 6,0.995 = 3.707, t 6,0.999 = 5.208; from R) 5. Decision: Since the p-value < 0.05, we reject H 0. There are strong evidence in the data that the octane level and miles per gallon are dependent. > x=c(89, 93, 87, 90, 89, 95, 100, 98) > y=c(13, 13.2, 13, 13.6, 13.3, 13.8, 14.1, 14) > cor.test(x,y,alt="two.sided") SydU STAT2012 (2015) Second semester Dr. J. Chan 269

47 L27 Regression & correlation Pearson s product-moment correlation data: x and y t = , df = 6, p-value = alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: sample estimates: cor > n=length(x) #checking > Sxx=sum(x^2)-sum(x)^2/n > Sxy=sum(x*y)-sum(x)*sum(y)/n > Syy=sum(y^2)-sum(y)^2/n > c(sxx,sxy,syy) [1] > beta=sxy/sxx > alpha=mean(y)-beta*mean(x) > c(alpha,beta) [1] > SST=beta*Sxy > r2=sst/syy > r=sqrt(r2) > t0=r*sqrt((n-2)/(1-r^2)) > p.value=2*(1-pt(abs(t0),n-2)) > c(sst,r2,r,t0,p.value) [1] SydU STAT2012 (2015) Second semester Dr. J. Chan 270

Ch 2: Simple Linear Regression

Ch 2: Simple Linear Regression 1. Simple Linear Regression Model A simple regression model with a single regressor x is y = β 0 + β 1 x + ɛ, where we assume that the error ɛ is independent random component