STAT2012 Statistical Tests 23 Regression analysis: method of least squares

Size: px
Start display at page:

Download "STAT2012 Statistical Tests 23 Regression analysis: method of least squares"

Transcription

1 23 Regression analysis: method of least squares L23 Regression analysis The main purpose of regression is to explore the dependence of one variable (Y ) on another variable (X) Introduction (P ) Suppose we are interested in studying the relation between two variables X and Y, where X is regarded as an explanatory (controlled; independent) variable that may be measured without error, and Y is a response (outcome; dependent) variable. We may have Y = α + β X, or more generally, Y = f(x). This is called deterministic mathematical relation because it does not allow for any error in predicting Y as a function of X. In this course, we are interested in studying the relationship between X and Y which subject to random errors. Example: (Predicting height) Parents are often interested in predicting the eventual heights of their children. The following is a portion of the data taken from a study of heights of boys. Height (inches) at age two (x_i) Height (inches) as an adult (y_i) A scatter plot of the data is given as follows: SydU STAT2012 (2015) Second semester Dr. J. Chan 224

2 L23 Regression analysis Height (inches) as an adult Height (inches) at age two It is clear from the figure that the expected value of Y increases as a linear function of X but with errors. Hence a deterministic relation is NOT suitable. SydU STAT2012 (2015) Second semester Dr. J. Chan 225

3 23.2 Linear regression STAT2012 Statistical Tests L23 Regression analysis With repeated experiments, one would find that values of Y often vary in a random manner. Hence the probabilistic model Y = α + β X + ɛ where ɛ is a random variable which follows a probability distribution with mean zero provides a more accurate description of the relationship between X and Y in most cases. The model is generally called a (simple linear) regression model or the regression of Y on X. The variables α, β are called parameters and they are the intercept and slope (coefficient) of the linear regression model respectively. Y α 1 Slope=β y-intercept=α Y = α + βx X + Y i Y i Ŷi = e i Ŷ i The regression line Residuals In practice, the regression function f(x) may be more complex than a simple linear function. For example, we may have Y X f(x) = α + β 1 x + β 2 x 2, f(x) = α + βx 1 2, etc. or When we say we have a linear regression for Y, we mean that f(x) is a linear function of the unknown parameters α, β, etc, not necessarily a linear function of x. SydU STAT2012 (2015) Second semester Dr. J. Chan 226

4 L23 Regression analysis 23.3 Fitting a straight line by least squares (P ) Suppose we have a set of observed bivariate data (x 1, y 1 ), (x 2, y 2 ),..., (x n, y n ). A scatter plot is drawn and indicates that a straight line model is appropriate. We need to choose a line (regression line) y = ˆα + ˆβx, which fit the data most closely. In general, we call y = ˆα + ˆβx the regression line or fitted line. A lot of techniques, such as least squares, robust methods, etc, can be used to produce a reasonable estimates ˆα, ˆβ for the unknown regression function f(x). Note that the expected value or the mean of Y i given X = x i under the model is E(Y i X = x i ) = ŷ i = α + β x i and the errors or residuals of the model when Y i = y i is ɛ i = y i ŷ i = y i α βx i. By the method of least squares, the values ˆα and ˆβ are estimated such that the sum of squared residuals (SSR) is a minimum. That is where the objective function is S(α, β) = SSR = Then we solve SSR α S(ˆα, ˆβ) = min α,β ɛ 2 i = SSR = 0 and β S(α, β), (y i α β x i ) 2. = 0 for ˆα & ˆβ. SydU STAT2012 (2015) Second semester Dr. J. Chan 227

5 The solutions are and STAT2012 Statistical Tests ˆβ = S xy = (x i x)(y i Ȳ ) = (x i x) 2 ˆα = Ȳ ˆβ x. L23 Regression analysis (x i x)y i (x i x) 2 Note: (x i x)ȳ = Ȳ (x i x) = Ȳ (n x n x) = 0, and ˆα and ˆβ are random variables as they both depend on ɛ i through the Y i. The x i are fixed numbers. SydU STAT2012 (2015) Second semester Dr. J. Chan 228

6 L23 Regression analysis 23.4 Calculations of ˆα and ˆβ Example: (Predicting height) Note that Solution: = S xy = x = , ȳ = , x 2 i = 16558, yi 2 = 62674, x i y i = x 2 i n( x) 2 = ( ) = , x i y i n( xȳ) = (34.286)(66.857) = 91.57, ˆβ = S xy = = , ˆα = ȳ ˆβ x = (34.286) = Hence the fitted least squares line is ŷ = ˆα + ˆβ x = x. The R code to compute the least squares line including the residuals of the fit is lsfit(x,y) > x=c(39,30,32,34,35,36,36,30,33,37,33,38,32,35) > y=c(71,63,63,67,68,68,70,64,65,68,66,70,64,69) SydU STAT2012 (2015) Second semester Dr. J. Chan 229

7 L23 Regression analysis > ls.height=lsfit(x,y) > ls.height$coef Intercept X > ls.height$residuals [1] [9] > par(mfrow=c(2,2)) #plots > plot(x,y) > abline(ls.height) > title("fitted line plot") > plot(x,ls.height$resid) > abline(h=0) > title("residual plot") > qqnorm(ls.height$resid) > qqline(ls.height$resid) > xm=mean(x) #check > ym=mean(y) > c(xm,ym) [1] > Sxx=sum(x^2)-sum(x)^2/n > Sxy=sum(x*y)-sum(x)*sum(y)/n > c(sxx,sxy) [1] SydU STAT2012 (2015) Second semester Dr. J. Chan 230

8 > beta=sxy/sxx > alpha=mean(y)-beta*mean(x) > c(alpha,beta) [1] L23 Regression analysis Fitted line plot Residual plot y ls.height$residuals x x Normal Q Q Plot Sample Quantiles Theoretical Quantiles SydU STAT2012 (2015) Second semester Dr. J. Chan 231

9 23.5 Checking residuals STAT2012 Statistical Tests L23 Regression analysis The regression model above does not specify the distribution for Y i or ɛ i. In fact, we usually assume that Y i N (α + βx i, σ 2 ) or equivalently, ɛ i iid N (0, σ 2 ) where iid stands for independent and identically distributed. Model assumptions: 1. Linearity of data, i.e. y i = α + βx i + ɛ i, 2. Equality of variance, i.e. a common σ 2 independent of x i and 3. Normality of residuals, i.e. ɛ i N (0, σ 2 ). These assumptions may be checked by using 1. Linearity of data: the fitted line plot of y i, 2. Equality of variance: the scatter plot of residuals ɛ i = y i ŷ i against the fitted values ŷ i, 3. Normality of residuals: the normal qq-plot of the residuals ɛ i. The residual plot should be a random scatter around 0 with no particular pattern. Violation to assumptions on residuals: 1. If E(ɛ i ) = a say, we can just increase α increase by a and all ɛ i will drop by a. Then E(ɛ i ) = If σ 2 = σ 2 i depends on x i, e.g. the variability of the stock market index usually increases with time, we may consider models that allow σ 2 i to change across x i. 3. If ɛ i does not follow a normal distribution, we may consider other distributions say t with wider spread. SydU STAT2012 (2015) Second semester Dr. J. Chan 232

10 L23 Regression analysis 4. If ɛ i and ɛ j for some i j are dependent, we may consider models that allow a very general variance-covariance structure. A random scatter plot indicates the model assumptions are OK. ɛ Y Ŷ x Nonconstant variance σ 2 which increases with x. ɛ Y Ŷ Residuals plots ɛ vs Ŷ x Fitted line Y vs x Functional form f(x) may be wrong. It should be say f(x) = α + β 1 x + β 2 x 2 + β 3 x 3. ɛ Y Ŷ x The least square regression line is sensitive to model assumptions and outliers. If there are pattens in the residuals, one needs to modify the fitted model using transformed data. SydU STAT2012 (2015) Second semester Dr. J. Chan 233

11 L24 Distribution of par. 24 Regression analysis: distribution of parameters 24.1 Properties of least squares estimates (P ) Result 1. The intercept ˆα and slope ˆβ provide unbiased estimates of the true intercept α and slope β, i.e. Proof: E(ˆα) = α and E( ˆβ) = β. (x i x)(y i E( ˆβ) = E[ Ȳ ) (x i x)e(y i ) E(Ȳ ] ) n (x i x) = (x i x)(α + βx i ) α n (x i x) β n (x i x)x i = = + β n (x i x)(x i x) = = βs xx = β (since x (x i x) = 0) E(ˆα) = E(Ȳ ) E( ˆβ) x = 1 E(Y i ) β x = 1 (α + βx i ) β x n n = α + β 1 x i β x = α + β x β x = α n since x i are fixed, (x i x) = 0 and E(Y i ) = α+βx i +E(ɛ i ) = α+βx i. Result 2. Under the assumptions of the regression model, we have SydU STAT2012 (2015) Second semester Dr. J. Chan 234

12 Proof: V ar( ˆβ) = V ar( STAT2012 Statistical Tests V ar(ˆα) = σ 2 ( 1 n + ( x)2 ) = σ2 n V ar( ˆβ) = σ 2 /, Cov(ˆα, ˆβ) x σ2 =. = σ 2 S 2 xx (x i x)y i ) = L24 Distribution of par. i x2 i (x i x) 2 V ar(y i ) S 2 xx = σ2 since V ar(ay ) = a 2 V ar(y ) V ar(ˆα) = V ar(ȳ ˆβ x) = V ar(ȳ ) + x2 V ar( ˆβ) = σ2 n + x2 σ 2 = σ 2 ( 1 n + x2 ) = σ2 ( + n x 2 ) = n σ 2 ( n x 2 i n x2 + n x 2 ) n = σ 2 n x 2 i n since Y 1,..., Y n and β are independent. The proof for Cov(ˆα, ˆβ) is more complicated and hence is omitted. Note that both variances are divided by. Hence V ar( ˆβ) increases with σ 2, the variability of residuals ɛ i, and decreases with which measures the variability of x i. The fit of a model depends on how big is SSR relative to the total spread of the data as measured by. SydU STAT2012 (2015) Second semester Dr. J. Chan 235

13 L24 Distribution of par. Small SSR, σ 2 Small total spread Small SSR, σ 2 Large total spread The best β Good fit and bad fit Large SSR, σ 2 Small total spread The worst β Large SSR, σ 2 Large total spread SydU STAT2012 (2015) Second semester Dr. J. Chan 236

14 L25 Estimation & inference 25 Regression analysis: estimation theory and inference 25.1 Estimate of σ 2 (P ) For the simple linear regression model, Y i = α + β x i + ɛ i, ɛ i iid N (0, σ 2 ) The variances of ˆα and ˆβ depend on the residuals variance σ 2. Hence we need to estimate σ 2. Since σ 2 is the variance of residuals, an estimate S 2 of σ 2 is based on the average of the squared residuals called Residual Sum of Squares (SSR) SSR = ɛ 2 i = (y i ˆα ˆβ x i ) 2. It can be shown that S 2 = SSR n 2 is an unbiased estimate of σ 2, i.e., E(S 2 ) = σ 2. Y + y y i i ŷ i = e i ŷ i = ˆα + ˆβx i Residuals X SydU STAT2012 (2015) Second semester Dr. J. Chan 237

15 Note that SSR = = STAT2012 Statistical Tests (y i ˆα ˆβ x i ) 2 = (y i ȳ) 2 2 ˆβ L25 Estimation & inference [(y i ȳ) ˆβ(x i x)] 2 (x i x)(y i ȳ) + ˆβ 2 (x i x) 2 = S yy 2 ˆβS xy + ˆβ S xy = S yy ˆβ S xy = SST o SST or = S yy S xy S xy = S yy S 2 xy since ˆα = ȳ ˆβ x and ˆβ = S xy. SydU STAT2012 (2015) Second semester Dr. J. Chan 238

16 L25 Estimation & inference 25.2 Distributions of the estimators ˆα, ˆβ and S 2 Assume that we have a linear regression model: Under this model, Y i = α + β x i + ɛ i, ɛ i iid N (0, σ 2 ), i = 1, 2,..., n. ˆα α V ar(ˆα) = ˆβ β V ar( ˆβ) ˆα α N (0, 1), σ 1 n + ( x)2 = ˆβ β N (0, 1). 1 σ Furthermore, (ˆα, ˆβ) and S 2 are independent and (n 2)S 2 σ 2 χ 2 n 2, ˆα α 1 S n + ( x)2 t n 2, ˆβ β 1 S t n 2, or ( ˆβ β) 2 F S 2 1,n 2. Note that these results can be used to construct 100(1 α)% confidence intervals for α and β, which are given by 1 α : (ˆα t n 2,α/2 s n + x2 1, ˆα + t n 2,α/2 s n + x2 ); β : ( ˆβ t n 2,α/2 s/, ˆβ + tn 2,α/2 s/ ), where t n 2,α/2 satisfies that Pr(t n 2 t n 2,α/2 ) = α/2. SydU STAT2012 (2015) Second semester Dr. J. Chan 239

17 L25 Estimation & inference Example: (Predicting height) Find ˆσ 2 and the CIs for ˆβ and ˆα. Solution: We have Hence x = , ȳ = , n = 14, x 2 i = 16558, yi 2 = 62674, S yy = x i y i = 32183, = , S xy = 91.57, ˆβ = , ˆα = yi 2 n(ȳ) 2 = ( ) = 95.71, s 2 = S yy ˆβ S xy = n 2 CI for β = ( ˆβ t n 2,0.975 = ( = (0.6859, 1.13) CI for α = (ˆα t n 2, (91.57) = , 14 2 s 2 s 2 ), ˆβ + tn 2, , s 2 ( 1 n + x2 ), ˆα + t n 2,0.975 ( ) 1 = ( , ( ) ) = ( , ) ) ( ) 1 s 2 n + x2 S ) xx Since the CI for β does not contain 0, β is significantly greater than 0. SydU STAT2012 (2015) Second semester Dr. J. Chan 240

18 In R, STAT2012 Statistical Tests L25 Estimation & inference > Syy=sum(y^2)-sum(y)^2/n > Syy [1] > SSR=Syy-beta*Sxy > SSR [1] > s2=ssr/(n-2) > s2 [1] > CIb.lower=beta-qt(0.975,n-2)*sqrt(s2/Sxx) > CIb.upper=beta+qt(0.975,n-2)*sqrt(s2/Sxx) > c(cib.lower,cib.upper) [1] > CIa.lower=alpha-qt(0.975,n-2)*sqrt(s2*(1/n+mean(x)^2/Sxx)) > CIa.upper=alpha+qt(0.975,n-2)*sqrt(s2*(1/n+mean(x)^2/Sxx)) > c(cia.lower,cia.upper) [1] SydU STAT2012 (2015) Second semester Dr. J. Chan 241

19 25.3 Test on β STAT2012 Statistical Tests L25 Estimation & inference Suppose that Y i = α + β x i + ɛ i, ɛ i N (0, σ 2 ). If Y does not change with X, β = 0. Thus one may wish to test: H 0 : β = 0 vs H 1 : β 0 in order to assess whether the explanatory variable X has an influence on the dependent variable Y. The five steps to test the significance of β (i.e. β 0) and hence the regression model are 1. Hypotheses: H 0 : β = β 0 vs H 1 : β > β 0, β < β 0, β β Test statistic: t 0 = ˆβ β 0 s/ where s 2 = S yy ˆβ S xy n 2 and ˆβ = Sxy /. 3. Assumptions: Y i N (α + βx i, σ 2 ). Y i are independent. 4. P -value: Pr(t n 2 t 0 ) for H 1 : β > β 0, Pr(t n 2 t 0 ) for H 1 : β < β 0 ; 2 Pr(t n 2 t 0 ) for H 1 : β β Decision: Reject H 0 if p-value < α. Remarks 1. t 0 can be calculated by n 2( ˆβ β0 ) t 0 = = (n 2)s2 / n 2(Sxy / β 0 ). ( S yy Sxy)/S 2 xx 2 SydU STAT2012 (2015) Second semester Dr. J. Chan 242

20 L25 Estimation & inference In particular, if β 0 = 0, t 0 = n 2 Sxy S yy S 2 xy = n 2 S xy Sxx S yy 1 S2 xy S yy = n 2 r 1 r 2 where r is the correlation coefficient in the next section. 2. If H 0 : β = 0 is accepted, it implies that X has no linear effect in describing the behavior of the response Y. Note that it is not equivalent to say there are no relationship between X and Y. The true model may involve quadratic, cubic, or other more complex functions of X. The following examples illustrate the idea. The p-value indicated on the top of each plot corresponds to the 2-sided t-test for a zero slope. Case (b) shows insignificant result but there is a clear quadratic relationship between X and Y. (a): P=0.771 x y (b): P=0.226 x y (c): P=10e-05 x y (d): P= x y SydU STAT2012 (2015) Second semester Dr. J. Chan 243

21 L25 Estimation & inference 3. If H 0 : β = 0 is rejected, it implies that the data indicates there is a linear relationship between X and Y. However a linear regression might not be the best model to describe the relationship. Case (d) shows significant result but it is clear that a quadratic model is better to describe the relationship between X and Y. 4. The strength of linear relationship may be described by the sample correlation coefficient. 5. Similar ideas can be applied to construct a test for the intercept α. However there is more interest in testing β than α. The β contains the information about whether linear relationship exists between Y and X. 6. The regression line is sensitive to outliers as will be demonstrated in the computer practical. The nonparameteric version of the regression model is the kernel smoothing which is more robust to outliers but it is not included in this course. SydU STAT2012 (2015) Second semester Dr. J. Chan 244

22 L25 Estimation & inference Example: (Air pollution) The data air give four environmental variables Ozone, radiation, temperature and wind. We d like to assess whether the variable temperature (X) has an influence on the dependent variable Ozone (Y ). >air ozone radiation temperature wind Solution: First of all, we look at a scatter plot of the data to see whether a linear model is appropriate. >plot(air[,3], air[,1]) Ozone temperature The scatter plot indicates that a linear model is appropriate. SydU STAT2012 (2015) Second semester Dr. J. Chan 245

23 We fit a simple linear model STAT2012 Statistical Tests L25 Estimation & inference Y i = α + βx i + ɛ i, ɛ i N (0, σ 2 ) to the data to assess whether the temperature (X) has an influence on Ozone (Y ). We have obtained the following values from R, n = 111, = , S yy = 87.21, S xy = The test for the significance of β is 1. Hypotheses: H 0 : β = 0 vs H 1 : β Test statistic: ˆβ = S xy = = s 2 = S yy ˆβS xy n 2 t 0 = ˆβ s 2 or t 0 = = = n 2 Sxy S yy S 2 xy (702.95) = = = ( ) (87.209) = Assumptions: Y i N (α + βx i, σ 2 ). Y i are independent. 4. P -value: p-value = 2 Pr(t ) 0 5. Decision: There are very strong evidence in the data to indicate a linear relationship between temperature and ozone. In R, SydU STAT2012 (2015) Second semester Dr. J. Chan 246

24 > x=air[,3] > y=air[,1] STAT2012 Statistical Tests L25 Estimation & inference > c(mean(x),mean(y)) > Sxx=sum((x-mean(x))*(x-mean(x))) > Sxy=sum((x-mean(x))*(y-mean(y))) > Syy=sum((y-mean(y))*(y-mean(y))) > c(sxx, Sxy, Syy) > n=length(x) > beta=sxy/sxx #check beta > s2=(syy-beta*sxy)/(n-2) > c(beta,s2) [1] > t0=beta/sqrt(s2/sxx) > t0=sqrt(n-2)*sxy/sqrt(sxx*syy-sxy^2) #either way > p=2*(1-pt(t0,n-2)) > p [1] 0 The assumptions on the model may be checked as follows in R: > lsfit(x,y)$coef Intercept X > air.fitted=y- lsfit(x,y)$res #fitted values SydU STAT2012 (2015) Second semester Dr. J. Chan 247

25 L25 Estimation & inference > par(mfrow=c(2,2)) > plot(x,y, xlab="temperature",ylab="ozone") #fitted line plot > abline(lsfit(x,y)) > plot(air.fitted, lsfit(x,y)$res) #residual plot > abline(h=0) > boxplot(lsfit(x,y)$res) #boxplot of residuals > qqnorm(lsfit(x,y)$res) #normal qq plot > qqline(lsfit(x,y)$res) Temperature Ozone air.fitted lsfit(x, y)$res Quantiles of Standard Normal lsfit(x, y)$res Comments: The scatter plot and the plot of residuals versus fitted values indicates that a linear model is appropriate. The boxplot and the normal qq-plot of residuals indicates that the distribution of residuals is slightly long tailed. There are three rather larger residuals in particular. This might indicate that a robust regression should be used. SydU STAT2012 (2015) Second semester Dr. J. Chan 248

26 L26 ANOVA & prediction 26 Regression analysis: ANOVA and prediction 26.1 ANOVA for regression The linear regression model Y i = α + β x i + ɛ i, i = 1, 2,..., n, where ɛ i iid N (0, σ 2 ), is significant if the slope parameter β that describes the effect of X on Y is non-zero. Note that where ŷ i = ˆα + ˆβx i. We have (y i ȳ) 2 = where SST = SSR = } {{ } SST o (ŷ i ȳ) 2 = y i ȳ = ŷ i ȳ + y i ŷ i, (ŷ i ȳ) 2 + } {{ } SST (y i ŷ i ) 2 (ˆα + ˆβx i ȳ) 2 = = ˆβ 2 (x i x) 2 = ˆβ S xy = S ˆβS xy, xx } {{ } SSR ( ˆβ x + ˆβx i ) 2 (y i ŷ i ) 2 = (n 2)s 2 = S yy ˆβ S xy, since ˆα = ȳ ˆβ x ˆα ȳ = ˆβ x. SydU STAT2012 (2015) Second semester Dr. J. Chan 249

27 L26 ANOVA & prediction ȳ y i ŷ i ȳ Variation unexplained error, SSR = i (y i ŷ i ) 2 Variation explained by reg. SST = i (ŷ i ȳ) 2 Decomposition of TOTAL variation TOTAL variation in y i, SST o = i (y i ȳ) 2 On the other hand, the F test statistic for testing H 0 : β = β 0 = 0 is f 0 = t 2 0 = ˆβ 2 = s 2 / ˆβ S xy s 2 = ˆβS xy s 2 = Since t 2 m,1 α/2 = F 1,m,1 α for any degree of freedom m, p-value = Pr(F 1,n 2 ˆβ 2 SST/1 SSR/(n 2). s 2 / ) = Pr(t 2 n 2 ˆβ 2 s 2 / ) = Pr( t n 2 ˆβ s/ ) = 2 Pr(t n 2 ˆβ s/ ). is the same as that in the t-test. These calculations are summarized in the Regression ANOVA table, which is like the ANOVA tables obtained in the ANOVA test. Regression ANOVA table Source df SS MS F Regression 1 S 2 xy/ MST MST s 2 Residuals n 2 S yy S 2 xy/ MSR = s 2 Total n 1 S yy SydU STAT2012 (2015) Second semester Dr. J. Chan 250

28 L26 ANOVA & prediction Example: (Air pollution) Based on the previous calculations, we have = , S xy = , S yy = , ˆα = , ˆβ = Complete the Regression ANOVA table and hence test for the significance of the regrssion model. Solution: We have Hence SST = ˆβS xy = ( ) = SSR = S yy SST = = Regression ANOVA table Source df SS MS F Regression = Residuals = Total The test for the significance of the regression model is 1. Hypotheses: H 0 : β = 0 vs H 1 : β Test statistic: f 0 = SST/1 SSR/(n 2) = = = t Assumption: Y i (α + βx i, σ 2 ). Y i are independent. 4. P-value: p-value = Pr(F 1,109 > ) Decision: Since p-value < 0.05, we reject H 0. There are strong evidence of a linear relationship between the temperature (X) and Ozone (Y ). SydU STAT2012 (2015) Second semester Dr. J. Chan 251

29 L26 ANOVA & prediction In R, > summary(aov(air[,1]~air[,3])) Df Sum of Sq Mean Sq F Value Pr(F) air[, 3] Residuals Remarks The test for the hypothesis that β = 0 in regression is similar to the ANOVA test for the completely randomized design (one-way data) that all treatments are equal. Under the null hypothesis, we have ANOVA test: µ 1 = = µ g = µ Y i N (µ, σ 2 ), Reg. test: β = 0 Y i N (α, σ 2 ), stating that the treatments (by groups in ANOVA or by explanatory variable X in regression) are unrelated to the response Y in any way. SydU STAT2012 (2015) Second semester Dr. J. Chan 252

30 26.2 Estimation at X = x 0 (P ) L26 ANOVA & prediction Suppose we have fitted a simple linear regression model: Y i = α + β x i + ɛ i, where ɛ i iid N (0, σ 2 ). Obviously, E(Y i ) = α + βx i. It is often of interest to estimate the expected value E(Y 0 ) = E(Y X = x 0 ) at a specified value of X = x 0. An obvious choice for the estimate is Ê(Y 0 ) = ˆα + ˆβ Y x 0. Ŷ0 x 0 Y = α + βx Prediction at X = x 0 For this estimate, we have the following results (see P.229): E(ˆα + ˆβx 0 ) = α + β x 0, V ar(ˆα + ˆβx 0 ) = V ar(ˆα) + x 2 0 V ar( ˆβ) + 2x 0 Cov(ˆα, ˆβ) x 2 i = σ 2 ( + x2 0 2x 0 x ) ( n n ) x 2 i = σ 2 [ n x2 + n x 2 + x2 0 2x 0 x ] n = σ 2 ( + x2 2 xx 0 + x 2 0) [ n 1 = σ 2 n + (x 0 x) 2 ]. X SydU STAT2012 (2015) Second semester Dr. J. Chan 253

31 Hence STAT2012 Statistical Tests Ê(Y 0 ) Y 0 se[ê(y 0)] = ˆα + ˆβ x 0 (α + β x 0 ) t n 2, 1 s n + (x 0 x) 2 L26 ANOVA & prediction and a 100(1 α)% confidence interval for E(Y 0 ) = α + β x 0 is: ˆα+ ˆβ 1 x 0 t n 2,1 α/2 s n + (x 0 x) 2, ˆα+ ˆβ 1 x 0 + t n 2,1 α/2 s n + (x 0 x) 2 where t n 2,1 α/2 satisfies that Pr(t n 2 t n 2,1 α/2 ) = 1 α/2. Note: When x 0 = 0, ŷ 0 = ˆα + ˆβ 0 = ˆα and the estimation interval becomes the CI for α given by 1 ˆα t n 2,1 α/2 s n + x2 1, ˆα + t n 2,1 α/2 s n + x2 SydU STAT2012 (2015) Second semester Dr. J. Chan 254

32 26.3 Prediction at X = x 0 (P ) L26 ANOVA & prediction Now instead of estimating the expected value E(Y 0 ) at x 0, we wish to predict the response Y 0 that we will observe in the future when X = x 0. Under the regression model of Y i = α + βx i + ɛ i, we can employ Ŷ 0 = ˆα + ˆβ x 0 + ˆɛ 0 = ˆα + ˆβ x 0, as a predictor of Y 0 = α + βx 0 + ɛ 0 since ˆɛ 0 = 0. This is same as Ê(Y 0 ) = ˆα + ˆβ x 0. Note that Ŷ0 is unbiased since E(Ŷ0) = E(ˆα) + E( ˆβ) x 0 + E(ˆɛ 0 ) = α + βx 0 = E(Y 0 ) since E(ˆɛ 0 ) = 0. However the variance of Ŷ 0 is greater than that of Ê(Y 0 ) = ˆα + ˆβ x 0 because Ŷ0 is predicted with prediction error ɛ 0 : V ar(ŷ0) = V ar(ˆα + ˆβ x 0 ) + V ar(ɛ 0 ) = σ 2 [ 1 n + (x 0 x) 2 ] + σ 2 = σ 2 [1 + 1 n + (x 0 x) 2 ] and ˆα + ˆβ x 0 and ɛ 0 are independent. Hence Ŷ 0 Y 0 se(ŷ0) = ˆα + ˆβ x 0 (α + βx 0 ) t n 2, s n + (x 0 x) 2 and a 100(1 α)% prediction intervals for Y 0 is given by ˆα+ ˆβ x 0 t n 2,1 α 2 s 1+ 1 n +(x 0 x) 2, ˆα + S ˆβ x 0 +t n 2,1 α s 2 xx 1+ 1 n +(x 0 x) 2 where t n 2,1 α 2 satisfies that Pr(t n 2 t n 2,1 α 2 ) = 1 α 2. SydU STAT2012 (2015) Second semester Dr. J. Chan 255

33 26.4 Effects on confidence intervals σ 2 n x 2 i n Note: Var( α) = ( 1 Var[E(Ŷ0)] = σ 2 n + (x 0 x) 2 ) all depend on summary statistics Y i, x i,, Var( β) = σ2, L26 ANOVA & prediction and Var(Ŷ0) = σ 2 ( n + (x 0 x) 2 Y 2 i, x i Y i, which vary from sample to sample. Hence α, β, E(Ŷ0) and Ŷ0 are random variables which vary from sample to sample. Their variances depend on the following factors: 1. The smaller the σ 2 (with estimate S 2 which measures the variability of residuals ɛ i ), the better the fit and hence the smaller the variances for α, β, E(Ŷ0) and Ŷ0. 2. The larger the spread of x i as measured by, the more information about Y i and hence the smaller the variances for α, β, E(Ŷ0) and Ŷ 0. Y No spread of X X x 2 i If all x are the same, it would be not be able to draw any conclusion about the dependence of Y on x. In this case, = 0 giving V ar(β) =. 3. The larger the sample size n giving more information about Y i, the smaller the variances for α, E(Ŷ0) and Ŷ0. 4. The closer is x 0 from x, the smaller the squared distance (x 0 x) 2, the smaller the variances for E(Ŷ0) and Ŷ0. ) SydU STAT2012 (2015) Second semester Dr. J. Chan 256

34 L26 ANOVA & prediction Estimation interval Prediction interval Height_adult Height_adult Height_two Height_two The interval is shorter in the middle and the estimation interval is shorter than the prediction interval. ( x, Ȳ ) Small error Bigger error Why does the error increase away from x? Because if you wiggle the regression line, it makes more of a difference as it gets further away from the mean x. Note that the line always passes through ( x, Ȳ ). Since when X = x, Y = α + β x = (Ȳ β x) + β x = Hence Ȳ ( x, Ȳ ) passes through the line Y = α + β x. SydU STAT2012 (2015) Second semester Dr. J. Chan 257

35 L26 ANOVA & prediction Example: (Predicting height) Parents are often interested in predicting the eventual heights of their children. The following is a portion of the data taken from a study of heights of boys. Height (inches) at age two (x_i) Height (inches) as an adult (y_i) Indicate whether a linear relationship between the heights of boys at age two and as an adult is significant. 2. Estimate the average height of adults that were 40 inches tall at the age of two and give a 90% estimation interval of this estimate. 3. Predict the height of an adult who was 40 inches tall at the age of two. Give a 90% prediction interval of this prediction. Solution: From previous working, we have obtained that x = , = , S yy = 95.71, S xy = 91.57, and the least squares line is 1. Then ŷ = ˆα + ˆβ x = x. SST = S 2 xy/ = = SSR = S yy S 2 xy/ = = Hence the regression ANOVA table is obtained as follow: SydU STAT2012 (2015) Second semester Dr. J. Chan 258

36 L26 ANOVA & prediction Source df SS MS F Regression = Residuals = Total The test for the significance of the regression model is 1. Hypotheses: H 0 : β = 0 vs H 1 : β Test statistic: f 0 = Assumption: Y i (α + βx i, σ 2 ). Y i are independent. 4. P-value: Pr(F 1,12 > 79.37) 0 (F 1,12,0.999 = 18.6, from R). 5. Decision: Since p-value < 0.05, we reject H 0. There are strong evidence of a linear relationship between the heights of boys at age two and the heights as an adult. In R, > Syy=sum(y^2)-sum(y)^2/n > Syy [1] > SST=beta*Sxy > SST [1] > SSR=Syy-SST > SSR [1] > s2=ssr/(n-2) > s2 [1] > f0=sst/s2 > f0 SydU STAT2012 (2015) Second semester Dr. J. Chan 259

37 [1] > t0=sqrt(f0) > t0 [1] > p.value=1-pf(f0,1,n-2) > p.value [1] e-06 L26 ANOVA & prediction 2. According to the fitted line, the expected average height E(Y 0 ) as an adult corresponding to the height of 40 inches at age two is Ê(Y 0 ) = = Note that n = 14, t 12,0.05 = 1.782, s 2 = , s = Hence, the 90% estimation interval of the average height of adults when their heights at age two are x 0 = 40 is ˆα+ ˆβx 1 0 t n 2, α 2 s n + (x 0 x) 2, ˆα+ ˆβx t n 2, α 2 s n + (x 0 x) 2 1 = ( (1.0236) 14 ( )2 +, = (70.90, 73.19) (1.0236) 14 ( )2 + ) SydU STAT2012 (2015) Second semester Dr. J. Chan 260

38 L26 ANOVA & prediction 3. According to the fitted line, the predicted height of a boy with height 40 inches at age two is still inches. The 90% prediction interval of the height of an adult when his height at age two is x 0 = 40 is ˆα + ˆβx 0 t n 2, α 2 s n + (x 0 x) 2, ˆα + ˆβx 0 + t n 2, α 2 s n + (x 0 x) 2 = ( (1.0236) ( )2 +, (1.0236) ( )2 + ) = (69.89, 74.20). In R, > x0=40 > y0=alpha+beta*x0 > y0 [1] > alp=0.1 > t=qt(1-alp/2,n-2) > t [1] > se.est=sqrt(s2*(1/n+(x0-mean(x))^2/sxx)) > se.est [1] SydU STAT2012 (2015) Second semester Dr. J. Chan 261

39 > se.pre=sqrt(s2*(1+1/n+(x0-mean(x))^2/sxx)) > se.pre [1] > CIest.low=y0-t*se.est > CIest.up=y0+t*se.est > c(ciest.low,ciest.up) [1] > CIpre.low=y0-t*se.pre > CIpre.up=y0+t*se.pre > c(cipre.low,cipre.up) [1] L26 ANOVA & prediction Note: x 0 = 40 is outside the range of X in the data. If x 0 is well outside the range of X, the estimation and prediction may not be reliable because we are not sure if the fitted relationship can be extended to x 0. SydU STAT2012 (2015) Second semester Dr. J. Chan 262

40 27 Regression and correlation L27 Regression & correlation 27.1 Correlation coefficient (P , ) In practice, one may be interested in strength of the linear relationship between X and Y. Suppose X and Y are two random variables. The correlation coefficient ρ, as a measure of the strength of the linear relationship between X and Y, is defined as: ρ = E{[X E(X)][(Y E(Y )]} V ar(x)v ar(y ) = E[XY E(X)E(Y )] V ar(x)v ar(y ) = Cov(X, Y ) V ar(x)v ar(y ). Note: If ρ = 0, then X and Y are said to be uncorrelated. Definition: The sample correlation coefficient, r, for the bivariate data (x 1, y 1 ), (x 2, y 2 ),..., (x n, y n ), is defined as r = S xy Sxx S yy = (x i x)(y i ȳ) [ n ] [ n ] or (x i x) 2 (y i ȳ) 2 r = x i y i n xȳ [ n ] [ n ] x 2 i n x2 yi 2 nȳ2 Hence r gives a point estimator of ρ, and hence measures the strength of the linear relationship between X and Y based on the sample. SydU STAT2012 (2015) Second semester Dr. J. Chan 263

41 L27 Regression & correlation 27.2 Interpreting r and r 2 Given a bivariate data (x 1, y 1 ), (x 2, y 2 ),..., (x n, y n ), The square of correlation coefficient r 2 called the coefficient of determination measures the proportion of total variation in Y explained by the linear regression model: Y y 5 y = a + bx r 5 = y 5 ŷ 5 r dist. of pt to line + d 5 = y 5 ȳ ŷ d dist. of pt to mean r 2 d 2 3 r4 2 y = ȳ d 2 1 d d 2 4 r1 2 r2 2 Residuals It is one minus the proportion of variation not explained by the model : r 2 = 1 i r2 i i = 1 (y i ŷ i ) 2 i d2 i i (y i ȳ) = 1 SSR 2 SST o where or (y i ȳ) 2 = (ŷ i ȳ) 2 + X (y i ŷ i ) 2, S }{{ yy = S } xy/s 2 xx + S yy S }{{} xy/s 2 xx, }{{} SST o SST SSR and SST o is the total variation in Y, SST is the sum of squares explained by the regression line and SSR = SST o SST is the variation in Y remain unexplained. SydU STAT2012 (2015) Second semester Dr. J. Chan 264

42 Note that STAT2012 Statistical Tests r = S xy Sxx S yy, L27 Regression & correlation r 2 = S2 xy S yy = S2 xy/ S yy = SST SST o, and 1 r 2 = S yy S 2 xy/ S yy = SSR SST o. Hence the coefficient of determination r 2 measures the strength of the linear relationship between X and Y by the percentage of variation in Y explained by the linear regression model in X. Interpreting r and r 2 : 1. r r 2 = 1 when all data in the scatterplot lie on the fitted least squares line. 3. r = 0 when there is no linear relationship between the points in the scatterplot. r 2 = 1 since SSR=0 Coefficient of determination, r 2 r 2 = 0 since SST= i (Ŷi Ȳ )2 = 0 Y = Ȳ. Hence Ŷi = Ȳ. ( β = 0 and α = Ȳ β X = Ȳ ) Perfect negative correlation No correlation Perfect positive correlation Strong Moderate Weak Weak Moderate Strong negative negative negative positive positive positive correlation correlation correlation correlation correlation correlation ve ρ ve ρ 1.00 SydU STAT2012 (2015) Second semester Dr. J. Chan 265

43 L27 Regression & correlation r= r= r= r= SydU STAT2012 (2015) Second semester Dr. J. Chan 266

44 27.3 Test for ρ STAT2012 Statistical Tests L27 Regression & correlation By using the sample correlation coefficient r, we may construct a test for ρ = 0. The five steps for the test of the correlation coefficient ρ are: 1. Hypotheses: H 0 : ρ = 0 vs H 1 : ρ > 0, ρ < 0, ρ Test statistic: t 0 = r n 2 1 r Assumption: The data are taken from a bivariate normal population. 4. P -value: Pr(t n 2 t 0 ) for H 1 : ρ > 0, Pr(t n 2 t 0 ) for H 1 : ρ < 0, 2 Pr(t n 2 t 0 ) for H 1 : ρ Decision: Reject H 0 if p-value < α. Note: The tests for ρ = 0 and for β = 0 are equivalent since t 2 0 = r2 (n 2) 1 r 2 = S 2 xy S yy (n 2) 1 S2 xy S yy = S 2 xy (n 2) S yy S2 xy = ˆβS xy (n 2) S yy ˆβS xy = SST SSR/(n 2) = f 0. SydU STAT2012 (2015) Second semester Dr. J. Chan 267

45 L27 Regression & correlation Example: (Automobile) The data in the following table give the miles per gallon obtained by a test automobile when using gasolines of varying octane levels. Miles per Gallon (y) Octane (x) Scatter plot of Miles per gal. vs Octane y x Given 8 x i = 741, 8 y i = 108, 8 x i y i = , (a) Calculate the value r. 8 x 2 i = 68789, 8 yi 2 = (b) Do the data provide sufficient evidence to indicate that the octane level and miles per gallon are dependent? SydU STAT2012 (2015) Second semester Dr. J. Chan 268

46 L27 Regression & correlation Solution: We have that = x 2 i 1 n ( x i ) 2 = = , S yy = S xy = yi 2 1 n ( y i ) 2 = = 1.34, x i y i 1 n ( x i )( y i ) = (741)(108) = (a) r = S xy / S yy = 12.8/ (1.34) = (b) The test for the correlation coefficient ρ between the octane level and miles per gallon is 1. Hypothesis: H 0 : ρ = 0 vs H 1 : ρ Test statistic: t 0 = r n 2 1 r 2 = = In R, 3. Assumption: The data are taken from a bivariate normal population. 4. P -value: p-value = 2 Pr(t ) (0.002, 0.01) (t 6,0.995 = 3.707, t 6,0.999 = 5.208; from R) 5. Decision: Since the p-value < 0.05, we reject H 0. There are strong evidence in the data that the octane level and miles per gallon are dependent. > x=c(89, 93, 87, 90, 89, 95, 100, 98) > y=c(13, 13.2, 13, 13.6, 13.3, 13.8, 14.1, 14) > cor.test(x,y,alt="two.sided") SydU STAT2012 (2015) Second semester Dr. J. Chan 269

47 L27 Regression & correlation Pearson s product-moment correlation data: x and y t = , df = 6, p-value = alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: sample estimates: cor > n=length(x) #checking > Sxx=sum(x^2)-sum(x)^2/n > Sxy=sum(x*y)-sum(x)*sum(y)/n > Syy=sum(y^2)-sum(y)^2/n > c(sxx,sxy,syy) [1] > beta=sxy/sxx > alpha=mean(y)-beta*mean(x) > c(alpha,beta) [1] > SST=beta*Sxy > r2=sst/syy > r=sqrt(r2) > t0=r*sqrt((n-2)/(1-r^2)) > p.value=2*(1-pt(abs(t0),n-2)) > c(sst,r2,r,t0,p.value) [1] SydU STAT2012 (2015) Second semester Dr. J. Chan 270

Ch 2: Simple Linear Regression

Ch 2: Simple Linear Regression Ch 2: Simple Linear Regression 1. Simple Linear Regression Model A simple regression model with a single regressor x is y = β 0 + β 1 x + ɛ, where we assume that the error ɛ is independent random component

More information

Simple Linear Regression

Simple Linear Regression Simple Linear Regression In simple linear regression we are concerned about the relationship between two variables, X and Y. There are two components to such a relationship. 1. The strength of the relationship.

More information

CHAPTER EIGHT Linear Regression

CHAPTER EIGHT Linear Regression 7 CHAPTER EIGHT Linear Regression 8. Scatter Diagram Example 8. A chemical engineer is investigating the effect of process operating temperature ( x ) on product yield ( y ). The study results in the following

More information

Scatter plot of data from the study. Linear Regression

Scatter plot of data from the study. Linear Regression 1 2 Linear Regression Scatter plot of data from the study. Consider a study to relate birthweight to the estriol level of pregnant women. The data is below. i Weight (g / 100) i Weight (g / 100) 1 7 25

More information

Lecture 14 Simple Linear Regression

Lecture 14 Simple Linear Regression Lecture 4 Simple Linear Regression Ordinary Least Squares (OLS) Consider the following simple linear regression model where, for each unit i, Y i is the dependent variable (response). X i is the independent

More information

Inference for Regression

Inference for Regression Inference for Regression Section 9.4 Cathy Poliak, Ph.D. cathy@math.uh.edu Office in Fleming 11c Department of Mathematics University of Houston Lecture 13b - 3339 Cathy Poliak, Ph.D. cathy@math.uh.edu

More information

Linear models and their mathematical foundations: Simple linear regression

Linear models and their mathematical foundations: Simple linear regression Linear models and their mathematical foundations: Simple linear regression Steffen Unkel Department of Medical Statistics University Medical Center Göttingen, Germany Winter term 2018/19 1/21 Introduction

More information

Scatter plot of data from the study. Linear Regression

Scatter plot of data from the study. Linear Regression 1 2 Linear Regression Scatter plot of data from the study. Consider a study to relate birthweight to the estriol level of pregnant women. The data is below. i Weight (g / 100) i Weight (g / 100) 1 7 25

More information

Regression Analysis. Regression: Methodology for studying the relationship among two or more variables

Regression Analysis. Regression: Methodology for studying the relationship among two or more variables Regression Analysis Regression: Methodology for studying the relationship among two or more variables Two major aims: Determine an appropriate model for the relationship between the variables Predict the

More information

Measuring the fit of the model - SSR

Measuring the fit of the model - SSR Measuring the fit of the model - SSR Once we ve determined our estimated regression line, we d like to know how well the model fits. How far/close are the observations to the fitted line? One way to do

More information

Lecture 11: Simple Linear Regression

Lecture 11: Simple Linear Regression Lecture 11: Simple Linear Regression Readings: Sections 3.1-3.3, 11.1-11.3 Apr 17, 2009 In linear regression, we examine the association between two quantitative variables. Number of beers that you drink

More information

Correlation and the Analysis of Variance Approach to Simple Linear Regression

Correlation and the Analysis of Variance Approach to Simple Linear Regression Correlation and the Analysis of Variance Approach to Simple Linear Regression Biometry 755 Spring 2009 Correlation and the Analysis of Variance Approach to Simple Linear Regression p. 1/35 Correlation

More information

Simple Linear Regression

Simple Linear Regression Simple Linear Regression ST 430/514 Recall: A regression model describes how a dependent variable (or response) Y is affected, on average, by one or more independent variables (or factors, or covariates)

More information

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X.

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X. Estimating σ 2 We can do simple prediction of Y and estimation of the mean of Y at any value of X. To perform inferences about our regression line, we must estimate σ 2, the variance of the error term.

More information

Ch 3: Multiple Linear Regression

Ch 3: Multiple Linear Regression Ch 3: Multiple Linear Regression 1. Multiple Linear Regression Model Multiple regression model has more than one regressor. For example, we have one response variable and two regressor variables: 1. delivery

More information

Inference for Regression Inference about the Regression Model and Using the Regression Line

Inference for Regression Inference about the Regression Model and Using the Regression Line Inference for Regression Inference about the Regression Model and Using the Regression Line PBS Chapter 10.1 and 10.2 2009 W.H. Freeman and Company Objectives (PBS Chapter 10.1 and 10.2) Inference about

More information

MAT2377. Rafa l Kulik. Version 2015/November/26. Rafa l Kulik

MAT2377. Rafa l Kulik. Version 2015/November/26. Rafa l Kulik MAT2377 Rafa l Kulik Version 2015/November/26 Rafa l Kulik Bivariate data and scatterplot Data: Hydrocarbon level (x) and Oxygen level (y): x: 0.99, 1.02, 1.15, 1.29, 1.46, 1.36, 0.87, 1.23, 1.55, 1.40,

More information

Regression. Marc H. Mehlman University of New Haven

Regression. Marc H. Mehlman University of New Haven Regression Marc H. Mehlman marcmehlman@yahoo.com University of New Haven the statistician knows that in nature there never was a normal distribution, there never was a straight line, yet with normal and

More information

STAT2912: Statistical Tests. Solution week 12

STAT2912: Statistical Tests. Solution week 12 STAT2912: Statistical Tests Solution week 12 1. A behavioural biologist believes that performance of a laboratory rat on an intelligence test depends, to a large extent, on the amount of protein in the

More information

Correlation Analysis

Correlation Analysis Simple Regression Correlation Analysis Correlation analysis is used to measure strength of the association (linear relationship) between two variables Correlation is only concerned with strength of the

More information

where x and ȳ are the sample means of x 1,, x n

where x and ȳ are the sample means of x 1,, x n y y Animal Studies of Side Effects Simple Linear Regression Basic Ideas In simple linear regression there is an approximately linear relation between two variables say y = pressure in the pancreas x =

More information

Applied Regression. Applied Regression. Chapter 2 Simple Linear Regression. Hongcheng Li. April, 6, 2013

Applied Regression. Applied Regression. Chapter 2 Simple Linear Regression. Hongcheng Li. April, 6, 2013 Applied Regression Chapter 2 Simple Linear Regression Hongcheng Li April, 6, 2013 Outline 1 Introduction of simple linear regression 2 Scatter plot 3 Simple linear regression model 4 Test of Hypothesis

More information

Multiple Linear Regression

Multiple Linear Regression Multiple Linear Regression Simple linear regression tries to fit a simple line between two variables Y and X. If X is linearly related to Y this explains some of the variability in Y. In most cases, there

More information

Mathematics for Economics MA course

Mathematics for Economics MA course Mathematics for Economics MA course Simple Linear Regression Dr. Seetha Bandara Simple Regression Simple linear regression is a statistical method that allows us to summarize and study relationships between

More information

Simple Linear Regression

Simple Linear Regression Simple Linear Regression Reading: Hoff Chapter 9 November 4, 2009 Problem Data: Observe pairs (Y i,x i ),i = 1,... n Response or dependent variable Y Predictor or independent variable X GOALS: Exploring

More information

AMS 315/576 Lecture Notes. Chapter 11. Simple Linear Regression

AMS 315/576 Lecture Notes. Chapter 11. Simple Linear Regression AMS 315/576 Lecture Notes Chapter 11. Simple Linear Regression 11.1 Motivation A restaurant opening on a reservations-only basis would like to use the number of advance reservations x to predict the number

More information

Lectures on Simple Linear Regression Stat 431, Summer 2012

Lectures on Simple Linear Regression Stat 431, Summer 2012 Lectures on Simple Linear Regression Stat 43, Summer 0 Hyunseung Kang July 6-8, 0 Last Updated: July 8, 0 :59PM Introduction Previously, we have been investigating various properties of the population

More information

Section 3: Simple Linear Regression

Section 3: Simple Linear Regression Section 3: Simple Linear Regression Carlos M. Carvalho The University of Texas at Austin McCombs School of Business http://faculty.mccombs.utexas.edu/carlos.carvalho/teaching/ 1 Regression: General Introduction

More information

K. Model Diagnostics. residuals ˆɛ ij = Y ij ˆµ i N = Y ij Ȳ i semi-studentized residuals ω ij = ˆɛ ij. studentized deleted residuals ɛ ij =

K. Model Diagnostics. residuals ˆɛ ij = Y ij ˆµ i N = Y ij Ȳ i semi-studentized residuals ω ij = ˆɛ ij. studentized deleted residuals ɛ ij = K. Model Diagnostics We ve already seen how to check model assumptions prior to fitting a one-way ANOVA. Diagnostics carried out after model fitting by using residuals are more informative for assessing

More information

13 Simple Linear Regression

13 Simple Linear Regression B.Sc./Cert./M.Sc. Qualif. - Statistics: Theory and Practice 3 Simple Linear Regression 3. An industrial example A study was undertaken to determine the effect of stirring rate on the amount of impurity

More information

Inference for Regression Inference about the Regression Model and Using the Regression Line, with Details. Section 10.1, 2, 3

Inference for Regression Inference about the Regression Model and Using the Regression Line, with Details. Section 10.1, 2, 3 Inference for Regression Inference about the Regression Model and Using the Regression Line, with Details Section 10.1, 2, 3 Basic components of regression setup Target of inference: linear dependency

More information

The Simple Linear Regression Model

The Simple Linear Regression Model The Simple Linear Regression Model Lesson 3 Ryan Safner 1 1 Department of Economics Hood College ECON 480 - Econometrics Fall 2017 Ryan Safner (Hood College) ECON 480 - Lesson 3 Fall 2017 1 / 77 Bivariate

More information

Simple Linear Regression

Simple Linear Regression Simple Linear Regression September 24, 2008 Reading HH 8, GIll 4 Simple Linear Regression p.1/20 Problem Data: Observe pairs (Y i,x i ),i = 1,...n Response or dependent variable Y Predictor or independent

More information

Inference for Regression Simple Linear Regression

Inference for Regression Simple Linear Regression Inference for Regression Simple Linear Regression IPS Chapter 10.1 2009 W.H. Freeman and Company Objectives (IPS Chapter 10.1) Simple linear regression p Statistical model for linear regression p Estimating

More information

Inferences for Regression

Inferences for Regression Inferences for Regression An Example: Body Fat and Waist Size Looking at the relationship between % body fat and waist size (in inches). Here is a scatterplot of our data set: Remembering Regression In

More information

Linear Regression. Simple linear regression model determines the relationship between one dependent variable (y) and one independent variable (x).

Linear Regression. Simple linear regression model determines the relationship between one dependent variable (y) and one independent variable (x). Linear Regression Simple linear regression model determines the relationship between one dependent variable (y) and one independent variable (x). A dependent variable is a random variable whose variation

More information

Simple Linear Regression

Simple Linear Regression Simple Linear Regression EdPsych 580 C.J. Anderson Fall 2005 Simple Linear Regression p. 1/80 Outline 1. What it is and why it s useful 2. How 3. Statistical Inference 4. Examining assumptions (diagnostics)

More information

Chapter 10. Simple Linear Regression and Correlation

Chapter 10. Simple Linear Regression and Correlation Chapter 10. Simple Linear Regression and Correlation In the two sample problems discussed in Ch. 9, we were interested in comparing values of parameters for two distributions. Regression analysis is the

More information

Basic Business Statistics 6 th Edition

Basic Business Statistics 6 th Edition Basic Business Statistics 6 th Edition Chapter 12 Simple Linear Regression Learning Objectives In this chapter, you learn: How to use regression analysis to predict the value of a dependent variable based

More information

Chapter 1: Linear Regression with One Predictor Variable also known as: Simple Linear Regression Bivariate Linear Regression

Chapter 1: Linear Regression with One Predictor Variable also known as: Simple Linear Regression Bivariate Linear Regression BSTT523: Kutner et al., Chapter 1 1 Chapter 1: Linear Regression with One Predictor Variable also known as: Simple Linear Regression Bivariate Linear Regression Introduction: Functional relation between

More information

Estadística II Chapter 4: Simple linear regression

Estadística II Chapter 4: Simple linear regression Estadística II Chapter 4: Simple linear regression Chapter 4. Simple linear regression Contents Objectives of the analysis. Model specification. Least Square Estimators (LSE): construction and properties

More information

Chapter 12 - Lecture 2 Inferences about regression coefficient

Chapter 12 - Lecture 2 Inferences about regression coefficient Chapter 12 - Lecture 2 Inferences about regression coefficient April 19th, 2010 Facts about slope Test Statistic Confidence interval Hypothesis testing Test using ANOVA Table Facts about slope In previous

More information

Simple Linear Regression

Simple Linear Regression Chapter 2 Simple Linear Regression Linear Regression with One Independent Variable 2.1 Introduction In Chapter 1 we introduced the linear model as an alternative for making inferences on means of one or

More information

STAT Chapter 11: Regression

STAT Chapter 11: Regression STAT 515 -- Chapter 11: Regression Mostly we have studied the behavior of a single random variable. Often, however, we gather data on two random variables. We wish to determine: Is there a relationship

More information

Six Sigma Black Belt Study Guides

Six Sigma Black Belt Study Guides Six Sigma Black Belt Study Guides 1 www.pmtutor.org Powered by POeT Solvers Limited. Analyze Correlation and Regression Analysis 2 www.pmtutor.org Powered by POeT Solvers Limited. Variables and relationships

More information

Chapter 16. Simple Linear Regression and dcorrelation

Chapter 16. Simple Linear Regression and dcorrelation Chapter 16 Simple Linear Regression and dcorrelation 16.1 Regression Analysis Our problem objective is to analyze the relationship between interval variables; regression analysis is the first tool we will

More information

Confidence Intervals, Testing and ANOVA Summary

Confidence Intervals, Testing and ANOVA Summary Confidence Intervals, Testing and ANOVA Summary 1 One Sample Tests 1.1 One Sample z test: Mean (σ known) Let X 1,, X n a r.s. from N(µ, σ) or n > 30. Let The test statistic is H 0 : µ = µ 0. z = x µ 0

More information

9 Correlation and Regression

9 Correlation and Regression 9 Correlation and Regression SW, Chapter 12. Suppose we select n = 10 persons from the population of college seniors who plan to take the MCAT exam. Each takes the test, is coached, and then retakes the

More information

Unit 10: Simple Linear Regression and Correlation

Unit 10: Simple Linear Regression and Correlation Unit 10: Simple Linear Regression and Correlation Statistics 571: Statistical Methods Ramón V. León 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 1 Introductory Remarks Regression analysis is a method for

More information

STAT 4385 Topic 03: Simple Linear Regression

STAT 4385 Topic 03: Simple Linear Regression STAT 4385 Topic 03: Simple Linear Regression Xiaogang Su, Ph.D. Department of Mathematical Science University of Texas at El Paso xsu@utep.edu Spring, 2017 Outline The Set-Up Exploratory Data Analysis

More information

Simple Linear Regression

Simple Linear Regression Simple Linear Regression MATH 282A Introduction to Computational Statistics University of California, San Diego Instructor: Ery Arias-Castro http://math.ucsd.edu/ eariasca/math282a.html MATH 282A University

More information

Statistics for Managers using Microsoft Excel 6 th Edition

Statistics for Managers using Microsoft Excel 6 th Edition Statistics for Managers using Microsoft Excel 6 th Edition Chapter 13 Simple Linear Regression 13-1 Learning Objectives In this chapter, you learn: How to use regression analysis to predict the value of

More information

STAT 511. Lecture : Simple linear regression Devore: Section Prof. Michael Levine. December 3, Levine STAT 511

STAT 511. Lecture : Simple linear regression Devore: Section Prof. Michael Levine. December 3, Levine STAT 511 STAT 511 Lecture : Simple linear regression Devore: Section 12.1-12.4 Prof. Michael Levine December 3, 2018 A simple linear regression investigates the relationship between the two variables that is not

More information

3. Linear Regression With a Single Regressor

3. Linear Regression With a Single Regressor 3. Linear Regression With a Single Regressor Econometrics: (I) Application of statistical methods in empirical research Testing economic theory with real-world data (data analysis) 56 Econometrics: (II)

More information

ECON The Simple Regression Model

ECON The Simple Regression Model ECON 351 - The Simple Regression Model Maggie Jones 1 / 41 The Simple Regression Model Our starting point will be the simple regression model where we look at the relationship between two variables In

More information

Correlation and Regression

Correlation and Regression Correlation and Regression October 25, 2017 STAT 151 Class 9 Slide 1 Outline of Topics 1 Associations 2 Scatter plot 3 Correlation 4 Regression 5 Testing and estimation 6 Goodness-of-fit STAT 151 Class

More information

MFin Econometrics I Session 4: t-distribution, Simple Linear Regression, OLS assumptions and properties of OLS estimators

MFin Econometrics I Session 4: t-distribution, Simple Linear Regression, OLS assumptions and properties of OLS estimators MFin Econometrics I Session 4: t-distribution, Simple Linear Regression, OLS assumptions and properties of OLS estimators Thilo Klein University of Cambridge Judge Business School Session 4: Linear regression,

More information

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis. 401 Review Major topics of the course 1. Univariate analysis 2. Bivariate analysis 3. Simple linear regression 4. Linear algebra 5. Multiple regression analysis Major analysis methods 1. Graphical analysis

More information

Oct Simple linear regression. Minimum mean square error prediction. Univariate. regression. Calculating intercept and slope

Oct Simple linear regression. Minimum mean square error prediction. Univariate. regression. Calculating intercept and slope Oct 2017 1 / 28 Minimum MSE Y is the response variable, X the predictor variable, E(X) = E(Y) = 0. BLUP of Y minimizes average discrepancy var (Y ux) = C YY 2u C XY + u 2 C XX This is minimized when u

More information

Econometrics Summary Algebraic and Statistical Preliminaries

Econometrics Summary Algebraic and Statistical Preliminaries Econometrics Summary Algebraic and Statistical Preliminaries Elasticity: The point elasticity of Y with respect to L is given by α = ( Y/ L)/(Y/L). The arc elasticity is given by ( Y/ L)/(Y/L), when L

More information

Homework 2: Simple Linear Regression

Homework 2: Simple Linear Regression STAT 4385 Applied Regression Analysis Homework : Simple Linear Regression (Simple Linear Regression) Thirty (n = 30) College graduates who have recently entered the job market. For each student, the CGPA

More information

STAT420 Midterm Exam. University of Illinois Urbana-Champaign October 19 (Friday), :00 4:15p. SOLUTIONS (Yellow)

STAT420 Midterm Exam. University of Illinois Urbana-Champaign October 19 (Friday), :00 4:15p. SOLUTIONS (Yellow) STAT40 Midterm Exam University of Illinois Urbana-Champaign October 19 (Friday), 018 3:00 4:15p SOLUTIONS (Yellow) Question 1 (15 points) (10 points) 3 (50 points) extra ( points) Total (77 points) Points

More information

Linear Models and Estimation by Least Squares

Linear Models and Estimation by Least Squares Linear Models and Estimation by Least Squares Jin-Lung Lin 1 Introduction Causal relation investigation lies in the heart of economics. Effect (Dependent variable) cause (Independent variable) Example:

More information

Formal Statement of Simple Linear Regression Model

Formal Statement of Simple Linear Regression Model Formal Statement of Simple Linear Regression Model Y i = β 0 + β 1 X i + ɛ i Y i value of the response variable in the i th trial β 0 and β 1 are parameters X i is a known constant, the value of the predictor

More information

Review of Statistics 101

Review of Statistics 101 Review of Statistics 101 We review some important themes from the course 1. Introduction Statistics- Set of methods for collecting/analyzing data (the art and science of learning from data). Provides methods

More information

MATH11400 Statistics Homepage

MATH11400 Statistics Homepage MATH11400 Statistics 1 2010 11 Homepage http://www.stats.bris.ac.uk/%7emapjg/teach/stats1/ 4. Linear Regression 4.1 Introduction So far our data have consisted of observations on a single variable of interest.

More information

Simple and Multiple Linear Regression

Simple and Multiple Linear Regression Sta. 113 Chapter 12 and 13 of Devore March 12, 2010 Table of contents 1 Simple Linear Regression 2 Model Simple Linear Regression A simple linear regression model is given by Y = β 0 + β 1 x + ɛ where

More information

Business Statistics. Lecture 10: Correlation and Linear Regression

Business Statistics. Lecture 10: Correlation and Linear Regression Business Statistics Lecture 10: Correlation and Linear Regression Scatterplot A scatterplot shows the relationship between two quantitative variables measured on the same individuals. It displays the Form

More information

Figure 1: The fitted line using the shipment route-number of ampules data. STAT5044: Regression and ANOVA The Solution of Homework #2 Inyoung Kim

Figure 1: The fitted line using the shipment route-number of ampules data. STAT5044: Regression and ANOVA The Solution of Homework #2 Inyoung Kim 0.0 1.0 1.5 2.0 2.5 3.0 8 10 12 14 16 18 20 22 y x Figure 1: The fitted line using the shipment route-number of ampules data STAT5044: Regression and ANOVA The Solution of Homework #2 Inyoung Kim Problem#

More information

Regression Models - Introduction

Regression Models - Introduction Regression Models - Introduction In regression models there are two types of variables that are studied: A dependent variable, Y, also called response variable. It is modeled as random. An independent

More information

Correlation & Simple Regression

Correlation & Simple Regression Chapter 11 Correlation & Simple Regression The previous chapter dealt with inference for two categorical variables. In this chapter, we would like to examine the relationship between two quantitative variables.

More information

Chapter 14. Linear least squares

Chapter 14. Linear least squares Serik Sagitov, Chalmers and GU, March 5, 2018 Chapter 14 Linear least squares 1 Simple linear regression model A linear model for the random response Y = Y (x) to an independent variable X = x For a given

More information

Midterm 2 - Solutions

Midterm 2 - Solutions Ecn 102 - Analysis of Economic Data University of California - Davis February 23, 2010 Instructor: John Parman Midterm 2 - Solutions You have until 10:20am to complete this exam. Please remember to put

More information

Applied Regression Analysis

Applied Regression Analysis Applied Regression Analysis Chapter 3 Multiple Linear Regression Hongcheng Li April, 6, 2013 Recall simple linear regression 1 Recall simple linear regression 2 Parameter Estimation 3 Interpretations of

More information

Business Statistics. Chapter 14 Introduction to Linear Regression and Correlation Analysis QMIS 220. Dr. Mohammad Zainal

Business Statistics. Chapter 14 Introduction to Linear Regression and Correlation Analysis QMIS 220. Dr. Mohammad Zainal Department of Quantitative Methods & Information Systems Business Statistics Chapter 14 Introduction to Linear Regression and Correlation Analysis QMIS 220 Dr. Mohammad Zainal Chapter Goals After completing

More information

Chapter 1. Linear Regression with One Predictor Variable

Chapter 1. Linear Regression with One Predictor Variable Chapter 1. Linear Regression with One Predictor Variable 1.1 Statistical Relation Between Two Variables To motivate statistical relationships, let us consider a mathematical relation between two mathematical

More information

Linear regression. We have that the estimated mean in linear regression is. ˆµ Y X=x = ˆβ 0 + ˆβ 1 x. The standard error of ˆµ Y X=x is.

Linear regression. We have that the estimated mean in linear regression is. ˆµ Y X=x = ˆβ 0 + ˆβ 1 x. The standard error of ˆµ Y X=x is. Linear regression We have that the estimated mean in linear regression is The standard error of ˆµ Y X=x is where x = 1 n s.e.(ˆµ Y X=x ) = σ ˆµ Y X=x = ˆβ 0 + ˆβ 1 x. 1 n + (x x)2 i (x i x) 2 i x i. The

More information

Statistics for Engineers Lecture 9 Linear Regression

Statistics for Engineers Lecture 9 Linear Regression Statistics for Engineers Lecture 9 Linear Regression Chong Ma Department of Statistics University of South Carolina chongm@email.sc.edu April 17, 2017 Chong Ma (Statistics, USC) STAT 509 Spring 2017 April

More information

Analysing data: regression and correlation S6 and S7

Analysing data: regression and correlation S6 and S7 Basic medical statistics for clinical and experimental research Analysing data: regression and correlation S6 and S7 K. Jozwiak k.jozwiak@nki.nl 2 / 49 Correlation So far we have looked at the association

More information

Chapter 4: Regression Models

Chapter 4: Regression Models Sales volume of company 1 Textbook: pp. 129-164 Chapter 4: Regression Models Money spent on advertising 2 Learning Objectives After completing this chapter, students will be able to: Identify variables,

More information

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference. Understanding regression output from software Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals In 1966 Cyril Burt published a paper called The genetic determination of differences

More information

Regression Models. Chapter 4. Introduction. Introduction. Introduction

Regression Models. Chapter 4. Introduction. Introduction. Introduction Chapter 4 Regression Models Quantitative Analysis for Management, Tenth Edition, by Render, Stair, and Hanna 008 Prentice-Hall, Inc. Introduction Regression analysis is a very valuable tool for a manager

More information

Regression and Statistical Inference

Regression and Statistical Inference Regression and Statistical Inference Walid Mnif wmnif@uwo.ca Department of Applied Mathematics The University of Western Ontario, London, Canada 1 Elements of Probability 2 Elements of Probability CDF&PDF

More information

9. Linear Regression and Correlation

9. Linear Regression and Correlation 9. Linear Regression and Correlation Data: y a quantitative response variable x a quantitative explanatory variable (Chap. 8: Recall that both variables were categorical) For example, y = annual income,

More information

Density Temp vs Ratio. temp

Density Temp vs Ratio. temp Temp Ratio Density 0.00 0.02 0.04 0.06 0.08 0.10 0.12 Density 0.0 0.2 0.4 0.6 0.8 1.0 1. (a) 170 175 180 185 temp 1.0 1.5 2.0 2.5 3.0 ratio The histogram shows that the temperature measures have two peaks,

More information

SSR = The sum of squared errors measures how much Y varies around the regression line n. It happily turns out that SSR + SSE = SSTO.

SSR = The sum of squared errors measures how much Y varies around the regression line n. It happily turns out that SSR + SSE = SSTO. Analysis of variance approach to regression If x is useless, i.e. β 1 = 0, then E(Y i ) = β 0. In this case β 0 is estimated by Ȳ. The ith deviation about this grand mean can be written: deviation about

More information

STAT763: Applied Regression Analysis. Multiple linear regression. 4.4 Hypothesis testing

STAT763: Applied Regression Analysis. Multiple linear regression. 4.4 Hypothesis testing STAT763: Applied Regression Analysis Multiple linear regression 4.4 Hypothesis testing Chunsheng Ma E-mail: cma@math.wichita.edu 4.4.1 Significance of regression Null hypothesis (Test whether all β j =

More information

THE ROYAL STATISTICAL SOCIETY 2008 EXAMINATIONS SOLUTIONS HIGHER CERTIFICATE (MODULAR FORMAT) MODULE 4 LINEAR MODELS

THE ROYAL STATISTICAL SOCIETY 2008 EXAMINATIONS SOLUTIONS HIGHER CERTIFICATE (MODULAR FORMAT) MODULE 4 LINEAR MODELS THE ROYAL STATISTICAL SOCIETY 008 EXAMINATIONS SOLUTIONS HIGHER CERTIFICATE (MODULAR FORMAT) MODULE 4 LINEAR MODELS The Society provides these solutions to assist candidates preparing for the examinations

More information

5. Linear Regression

5. Linear Regression 5. Linear Regression Outline.................................................................... 2 Simple linear regression 3 Linear model............................................................. 4

More information

Chapter 16. Simple Linear Regression and Correlation

Chapter 16. Simple Linear Regression and Correlation Chapter 16 Simple Linear Regression and Correlation 16.1 Regression Analysis Our problem objective is to analyze the relationship between interval variables; regression analysis is the first tool we will

More information

MATH 1015: Life Science Statistics. Lecture Pack for Chapter 1 Weeks 1-3. Lecturer: Jennifer Chan Room: Carslaw Room 817 Telephone:

MATH 1015: Life Science Statistics. Lecture Pack for Chapter 1 Weeks 1-3. Lecturer: Jennifer Chan Room: Carslaw Room 817 Telephone: MATH 1015: Life Science Statistics Lecture Pack for Chapter 1 Weeks 1-3. Lecturer: Jennifer Chan Room: Carslaw Room 817 Telephone: 9351 4873. Text: Phipps, M. and Quine, M. (2001) A Primer of Statistics

More information

Regression used to predict or estimate the value of one variable corresponding to a given value of another variable.

Regression used to predict or estimate the value of one variable corresponding to a given value of another variable. CHAPTER 9 Simple Linear Regression and Correlation Regression used to predict or estimate the value of one variable corresponding to a given value of another variable. X = independent variable. Y = dependent

More information

Overview Scatter Plot Example

Overview Scatter Plot Example Overview Topic 22 - Linear Regression and Correlation STAT 5 Professor Bruce Craig Consider one population but two variables For each sampling unit observe X and Y Assume linear relationship between variables

More information

Simple linear regression

Simple linear regression Simple linear regression Biometry 755 Spring 2008 Simple linear regression p. 1/40 Overview of regression analysis Evaluate relationship between one or more independent variables (X 1,...,X k ) and a single

More information

CHAPTER 5 LINEAR REGRESSION AND CORRELATION

CHAPTER 5 LINEAR REGRESSION AND CORRELATION CHAPTER 5 LINEAR REGRESSION AND CORRELATION Expected Outcomes Able to use simple and multiple linear regression analysis, and correlation. Able to conduct hypothesis testing for simple and multiple linear

More information

INTRODUCING LINEAR REGRESSION MODELS Response or Dependent variable y

INTRODUCING LINEAR REGRESSION MODELS Response or Dependent variable y INTRODUCING LINEAR REGRESSION MODELS Response or Dependent variable y Predictor or Independent variable x Model with error: for i = 1,..., n, y i = α + βx i + ε i ε i : independent errors (sampling, measurement,

More information

Regression Analysis: Basic Concepts

Regression Analysis: Basic Concepts The simple linear model Regression Analysis: Basic Concepts Allin Cottrell Represents the dependent variable, y i, as a linear function of one independent variable, x i, subject to a random disturbance

More information

Simple Linear Regression Analysis

Simple Linear Regression Analysis LINEAR REGRESSION ANALYSIS MODULE II Lecture - 6 Simple Linear Regression Analysis Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur Prediction of values of study

More information

y ˆ i = ˆ " T u i ( i th fitted value or i th fit)

y ˆ i = ˆ  T u i ( i th fitted value or i th fit) 1 2 INFERENCE FOR MULTIPLE LINEAR REGRESSION Recall Terminology: p predictors x 1, x 2,, x p Some might be indicator variables for categorical variables) k-1 non-constant terms u 1, u 2,, u k-1 Each u

More information

Lecture notes on Regression & SAS example demonstration

Lecture notes on Regression & SAS example demonstration Regression & Correlation (p. 215) When two variables are measured on a single experimental unit, the resulting data are called bivariate data. You can describe each variable individually, and you can also

More information