Chapter 10. Simple Linear Regression and Correlation In the two sample problems discussed in Ch. 9, we were interested in comparing values of parameters for two distributions. Regression analysis is the part of statistics that deals with investigation of the relationships between two or more variables. In this chapter, we generalize the linear relation y = α + βx + e to a linear probabilistic relationship, develop procedures for making inferences about the parameters of the model, and obtain a quantitative measure(the correlation coefficient) of the extent to which the two variables are related. 1. Simple Linear Regression Model : y = α + βx + e where e N(0, σ 2 ) There exist parameters α, β and σ 2 such that for any fixed value of the independent variable x, the dependent variable y is related to x through the model equation. The quantity e in the model equation is a random variable, assumed to be normally distributed with E(e)=0 and Var(e) =σ 2. 1
True slope,β β is the measure the change in the mean of the response variable for every unit change in the x(explanatory variable). 1. beta > 0 2.beta < 0 3.beta = 0 True intercept,α α tells you the mean of the response variable wehn the explanatory variable, x = 0 2
2. Estimating Model parameters (α, β and σ 2 ) The parameters α(intercept), β(slpoe) will almost never be known to an investigator. Instead, sample data consisting of n observed pairs of (x 1, y 1 ), (x 2, y 2 ),..., (x n, y n ) will be available, from which the model parameters can be estimated. Notation: a = ˆα =estimate of α b = ˆβ = estimate of β the estimated linear regression equation ŷ = a + bx Principle of Least Squares Find a and b such that n min e 2 i = i=1 = n [y i ŷ i ] 2 i=1 n [y i (a + bx i )] 2 i=1 b = a = ȳ ˆβ x i=1 (x i x)(y i ȳ) i=1 (x i x) 2 = r s x s y Here, r is the correlation between y and x, s y is the standard deviation of y and the s x is the standard deviation of x. 3
Measuring the variability The estimate of σ 2 is i=1 (y i ŷ i ) 2 n 2. The estimate of σ = the estimated standard deviation, s e s e = s 2 e SS resid and SS total represent the sums of squares of the residuals and the sums of squares of the y i about ȳ, respectively. The sums of squares due to regression is SS reg = SS total SS resid. 3. Inferences about the slope β. 1. E(b) = β 2. Var(b) V ar(b) = σ 2 b = = σ 2 i=1 (x i x) 2 s 2 i=1 (x i x) 2 (replacing σ 2 by its estimate s 2 e gives an estimator for σ 2 b ) 3. The estimator of β has a normal distribution. Confidence Interval for β The distribution of the standardized variable t is a t distribution with d.f.= n 2. t = b β s ˆβ A 100(1 α)% confidence interval for β, the slope of the regression line, (b t α/2,n 2 s b, b + t α/2,n 2 s b ) 4
Hypothesis testing for β Null hypothesis(h o ) : β = b 0 Test Statistic : t = b b 0 s b Alternative Hypothesis H o : β > b 0 H o : β < b 0 H o : β = b 0 Rejection Region t > t α,n 2 t < t α,n 2 t < t α,n 2 or t > t α,n 2 H o should be rejected if p value less than α and not rejected if p value > α. < Example > handout (in Class) 5
4. Inferences based on the estimated Regression Line y = α + βx Let x o denote a particular value of x. For a + bx 0 1. E(a + bx 0 ) = α + βx 2. σ a+bx0 = σ Replacing σ 2 by its estimate s 2 i=1 (x i x) e. 2 1 + (x 0 x) 2 n 3. a + bx 0 N(α + βx, σ 2 a+bx 0 ) Confidence Interval for α + βx The distribution of the standardized variable t has a t distribution with d.f.= n 2. t = a + bx 0 (α + βx 0 ) s a+bx0 Hence, A 100(1 α)% confidence interval for α + βx 0, the average value when x has value x o, has the form ((a + bx 0 ) t α/2,n 2 s a+bx0, (a + bx 0 ) + t α/2,n 2 s a+bx0 6
5. Inferences about the population Correlation Coefficient The correlation coefficient r is a measure of how strongly related x and y are in the observed sample. Population Correlation Coefficient ρ Sample Correlation Coefficient r ρ = ρ(x, Y ) = cov(x, Y ) σ X σ Y r = ˆρ = i=1 (x i x)(y i ȳ) i=1 (x i x) 2 (y i ȳ) 2 Properties of r 1. The value of r is independent of the units in which x and y are measured. 2. 1 r 1 3. r=1 B1 if and only if all (x i, y i ) pairs lie on a straight line with positive(negative) slope, respectively. 7
Hypothesis testing for ρ Null hypothesis( H 0 ) : ρ = 0 ( x and y are independent) Test Statistic: t = r (1 r 2 ) 2 /(n 2) Alternative Hypothesis Rejection Region H 0 : ρ > 0 H o : ρ < 0 H o : ρ = 0 t > t α,n 2 t < t α,n 2 t < t α,n 2 or t > t α,n 2 The t critical value is based on n 2 d.f. H o should be rejected if p value less than α and not rejected if p value > α. < Exmaple > Here are the golf scores of 12 members of a college women s golf team in two rounds of tournament play. Plot the data and find the correlation between the two scores andt he test the null hypothesis that ρ = 0. player 1 2 3 4 5 6 7 8 9 10 11 12 Round 1 89 90 87 95 86 81 102 105 83 88 91 79 Round 2 94 85 89 89 81 76 107 89 87 91 88 80 8