1. Simple Linear Regression

Size: px

Start display at page:

Download "1. Simple Linear Regression"

Charla Atkins
6 years ago
Views:

1 1. Simple Linear Regression Suppose that we are interested in the average height of male undergrads at UF. We put each male student s name (population) in a hat and randomly select 100 (sample). Then their height is measured in cm. We denote the selected measurements by: Y 1, Y 2,..., Y 100. In addition, suppose we also measure the selected student s weights in pounds and the number of cats owned by their parents. We denmote these by: X 1, X 2,..., X 100 and C 1, C 2,..., C 100, respectively. Questions: 1. How would you use this data to estimate the average height of a male undergrad?

2 height Answers: weight height #cats 100 i=1 Y i = 174cm 5 8, the sam- 1. Ê[Y ] = Ȳ = 1 ple mean Average the Y i s for guys whose X i s are between

3 height weight height #cats 3. Average the Y i s for guys whose C i s are 3.

4 Intuitive description of regression: Height: Y = variable of interest = response variable = dependent variable Weight: X = explanatory variable = predictor variable = independent variable Continuous vs. Discrete: Fundamental assumption of regression The response variable Y is a random variable whose mean (expected value) depends on X. The expected value of Y, E(Y ), can be written as a deterministic function of X, f(x). A special case is when f(x) = constant.

5 Example: E(height i ) = f(weight i ) where β 0, β 1, and β 2 are unknown parameters! Scatter plot weight versus height and weight versus E(height): height weight E(height) weight

6 Simple Linear Regression (SLR) A scatter plot of 100 (X i, Y i ) pairs (weight, height) shows that there is a linear trend. Equation of a line: Y = β 0 + β 1 X (β 0 : intercept and β 1 : slope) height b 1 Y=b+mX m weight X* X*+1

7 Is: height = β 0 + β 1 weight? (functional relation) No! The relationship is far from perfect (it s a statistical relation)! Distribution of height for a person who is 180lbs, i.e. Mean E(height) = β 0 + β b+m*180 height

9 Formal Statement of the SLR Model Data: (X 1, Y 1 ), (X 2, Y 2 ),..., (X n, Y n ) Equation: Y i = β 0 + β 1 X i + ǫ i, i = 1, 2,..., n Assumptions:

10 The Summation Operator: (see also Appendix A of the textbook) Define n i=1 a i = a 1 + a a n n i=1 K = K + K K = nk n i=1 (a i + b i ) = n i=1 a i + n i=1 b i n i=1 Ka i = Ka 1 + Ka Ka n = K(a 1 + a a n ) = K( n i=1 a i)

11 The Expectation Operator: (see also Appendix A of the textbook) Let Y denote a discrete randlom variable which takes on values: y 1, y 2,... y n with probabilities p 1, p 2,... p n respectively. then E[g(Y )] = n i=1 g(y i)p i E[K] = K, where K is a constant. E[KY ] = KE[Y ]

12 Define V ar[y ] = E[(Y µ Y ) 2 ]. V ar[k] = 0. V ar[ky ] = K 2 V ar(y ). V ar(x + Y ) = V ar(x) + V ar(y ), if X and Y s are independent. Define Cov[X,Y ] = E[(X µ X )(Y µ Y )]. Cov[Y, K] = Cov(X, Y ) = 0, V ar(x + Y ) = V ar(x) + V ar(y ) + 2Cov(X, Y ). The above descriptions of expected value properties assume that all the expected values exist.

13 Consequences of the SLR Model Y i = β 0 + β 1 X i + ǫ i, i = 1, 2,..., n The response Y i is the sum of the constant term (β 0 + β 1 X i ) and the random term (ǫ i ). Hence, Y i is a random variable. Because the ǫ i s are independent and each Y i involves only one ǫ i, the Y i s are independent as well. E(Y i ) = E(β 0 + β 1 X i + ǫ i ) = β 0 + β 1 X i. Regression function (it relates the mean of Y to X) is

14 Why is it called SLR? Simple: only one predictor X i Linear: regression function, E(Y ) = β 0 +β 1 X, is linear in the parameters. Why do we care about the regression model? If the model is realistic and we have reasonable estimates of β 0 and β 1 we have:

15 Least Squares Estimation of regression parameters β 0 and β 1 X i = #math classes taken by ith student in spring Y i = #hours student i spends writing papers in spring Randomly select 4 students: (X 1, Y 1 ) = (1, 60), (X 2, Y 2 ) = (2, 70), (X 3, Y 3 ) = (3, 40), (X 4, Y 4 ) = (5, 20) #hours #math classes

16 If we assume a SLR model for these data, we are assuming that at each X, there is a distribution of #hours and that the means (expected values) of these responses all lie on a line. We need estimates of the unknown parameters β 0, β 1, and σ 2. Let s focus on β 0 and β 1 for now. Every (β 0, β 1 ) pair defines a line β 0 +β 1 X. The Least Squares Criterion says choose the line that minimizes the sum of the squares of the vertical distances from the data points (X i, Y i ) to the line (X i, β 0 + β 1 X i )

17 Instead of evaluating Q for every possible line β 0 +β 1 X, we can find the best β 0 and β 1 using calculus. We will minimize the function Q with respect to β 0 and β 1 Q β 0 = Q β 1 = n 2(Y i (β 0 + β 1 X i ))( 1) i=1 n 2(Y i (β 0 + β 1 X i ))( X i ) i=1 Set it to 0 (and change notation) yields

18 Solving these equations simultaneously yields b 1 = n i=1 (X i X)(Y i Ȳ ) n i=1 (X i X) 2 b 0 = Ȳ b 1 X This result is even more important! Use second derivative to show that a minimum is attained. A more efficient formula for the calculation of b 1 is n i=1 b 1 = X iy i 1 n ( n i=1 X i)( n i=1 Y i) n i=1 X2 i 1 n ( n i=1 X i) 2 n i=1 = X iy i n XȲ S XX where S XX = n i=1 (X i X) 2. Or b 1 = = n i=1 (X i X)(Y i Ȳ ) n i=1 (X i X) 2 n i=1 (X i X)Y i n i=1 (X i X) 2

19 Example: Let us calculate the estimates of slope and intercept of our example: (X 1, Y 1 ) = (1, 60), (X 2, Y 2 ) = (2, 70), (X 3, Y 3 ) = (3, 40), (X 4, Y 4 ) = (5, 20). i X iy i = = i X i = i Y i = i X2 i = b 1 = n i=1 X iy i 1 n ( n i=1 X i)( n i=1 Y i) n i=1 X2 i 1 n ( n i=1 X i) 2 = (11)(190) (11)2 = = b 0 = Ȳ b 1 X = ( 11.7)(1 4 11) =

20 Estimated regression function Ê(Y ) = X At X = 1: Ê(Y ) = (1) = 68.3 At X = 5: Ê(Y ) = (5) = 21.5 At X = 2.5: Ê(Y ) = (2.5) = #hours #math classes

21 Properties of Least Squares Estimators An important theorem, called the Gauss Markov Theorem, states that the Least Squares Estimators are Point Estimation of the Mean Response: Under the SLR model, the regression function is E(Y ) = β 0 + β 1 X. We use our estimates of β 0 and β 1 to construct the estimated regresion function

22 Fitted Values: Define Ŷ i is the fitted value at X i. Residuals: Define e i is called the ith residual. The vertical distance beween the ith Y value and the line. #hours #math classes

23 Properties of Fitted Regression Line The sum of the residuals is zero: The sum of the squared residuals, n i=1 e2 i, is a minimum. The sum of the observed values equals the sum of the fitted values: The sum of the residuals weighted by X i is zero:

24 The sum of the residuals weighted by Ŷi is zero: The regression line always goes through the point Errors versus Residuals e i = = ǫ i = So e i is an attempt to estimate ǫ i, but ǫ i is not a parameter, it is a random quantity!

25 Estimation of σ 2 in SLR: Motivation from iid (independent & identically distributed) case, where Y 1,..., Y n iid with E(Y i ) = µ and V ar(y i ) = σ 2. Sample variance (two steps) 1. Find n (Y i Ê(Y i)) 2 = i=1 n (Y i Ȳ )2. i=1 2. Divide by degrees of freedom, n 1 in this case Lost 1 degree of freedom, because we estimated 1 parameter, µ.

26 In the SLR model with E(Y i ) = β 0 +β 1 X i and V ar(y i ) = σ 2, the Y i s are independent but not identically distributed. Let s do the same two steps. 1. Find n (Y i Ê(Y i)) 2 = i=1 n (Y i (b 0 + b 1 X i )) 2 = SSE. i=1 2. Divide by the degrees of freedom, n 2 in this case: s 2 = 1 n (Y i (b 0 + b 1 X i )) 2 = MSE. n 2 i=1

27 Properties of the point estimator of σ 2 : s 2 = = = 1 n 2 1 n 2 1 n 2 n (Y i (b 0 + b 1 X i )) 2 i=1 n (Y i Ŷi) 2 i=1 n i=1 e 2 i

28 Normal Error Regression Model No matter what may be the form of the distribution of the error terms ǫ i the least squares method provides unbiased point estimators of β 0 and β 1 that have minimum variance among all unbiased linear estimators. To set up interval estimates and make tests, however, we need to make assumptions about the distribution of the ǫ i.

29 The normal error regression model is as follows: Y i = β 0 + β 1 X i + ǫ i, i = 1, 2,..., n Assumptions: Y i is the value of the X i s are ǫ i s are independent N(0, σ 2 ) β 0, β 1, and σ 2 are

30 Maximum likelihood estimate is equivalent to Y i N(β 0 + β 1 X i, σ 2 ). f(y i X i ) = 1 2πσ e (Y i (β 0 +β 1 X i ))2 2σ 2. The likelihood function is the product of probability density functions (PDF) when random variables are independent: = L(β 0, β 1, σ 2 Y i ) n f(y i X i ) = i=1 n i=1 1 e (Y i (β 0 +β 1 X i ))2 2σ 2 2πσ The maximum likelihood estimate is the parameter values that maximize L(β 0, β 1, σ 2 Y i ). In other words, the solutions of L(β 0,β 1,σ 2 Y i ) β 0 = 0, L(β 0,β 1,σ 2 Y i ) β 1 = 0 and L(β 0,β 1,σ 2 Y i ) = 0 σ 2

31 Motivate Inference in SLR Models Let X i = #siblings and Y i = #hours spent on papers. Data (1, 20), (2, 50), (3, 30), (5, 30) gives Ê(Y ) = X Conclusion: b 1 is not zero, so #siblings is linearly related to #hours, right?

32 Scenario 1: Highly variable Histogram of bvar Scenario 2: Highly concentrated Histogram of bcon Think about H 0 : β 1 = 0 Is H 0 false? Scenario 1: not sure Scenario 2: definitely If we know the exact dist n of b 1, we can formally decide if H 0 is true. We need formal statistical test of H 0 : β 1 = 0 (no linear relationship) H A : β 1 0 (there is a linear relationship between E(Y ) and X)

Chapter 1. Linear Regression with One Predictor Variable

Chapter 1. Linear Regression with One Predictor Variable 1.1 Statistical Relation Between Two Variables To motivate statistical relationships, let us consider a mathematical relation between two mathematical