STAT 111 Recitation 7 Xin Lu Tan xtan@wharton.upenn.edu October 25, 2013 1 / 13
Miscellaneous Please turn in homework 6. Please pick up homework 7 and the graded homework 5. Please check your grade and let me know during the next recitation if there are any grade discrepancies (please show me your graded homework as well). Note: Homework 7 is printed two-sided. 2 / 13
Midterm: Z-chart The following Z chart gives less than probabilities for positive values of z, i.e. P(Z z) for Z N(0, 1). What is P(Z 2), P(Z 2), P(Z 2), P(Z 2)? 3 / 13
Midterm: Problem 5 Let X 1 be a random variable having a normal distribution with mean 12 and variance 9. Also, Let X 2 be a random variable having a normal distribution with mean 14 and variance 16. Calculate the probability that X 2 > X 1. [Hint: Think of the difference X 1 X 2.] Recall: If X N(µ, σ 2 ), then ax N(aµ, a 2 σ 2 ). If X 1 N(µ 1, σ1 2), X 2 N(µ 2, σ2 2) and X 1, X 2 are independent, then X 1 + X 2 N(µ 1 + µ 2, σ1 2 + σ2 2 ). In particular, we have X 1 X 2 N(µ 1 µ 2, σ1 2 + σ2 2 ). 4 / 13
Simple Linear Regression Suppose you observe n data points (x i, y i ), i = 1, 2,..., n. y -20-10 0 10 20 30 40-5 0 5 10 15 20 25 x It seems like there is some kind of linear relationship between the random variables X i and Y i, i = 1, 2,..., n, i.e. Y i = α + βx i + ɛ i where ɛ i denotes the noise term (we assume that each y i is observed with noise ɛ i that has mean 0 and variance σ 2 ). 5 / 13
Simple Linear Regression The goal of simple linear regression is to find the straight line y = a + bx that would provide the best fit for the data points. Questions: How do we define best? 6 / 13
Simple Linear Regression Intuitively, we would expect that a good line to stay close to the data points. The best line is then the one that stays closest to the data points. How do we measure closeness? Intuitively, we want the distance between each individual data points and the line to be small. y -20-10 0 10 20 30 40-5 0 5 10 15 20 25 x 7 / 13
Simple Linear Regression This leads us to considering the line y = a + bx that has minimum sum of absolute value of residuals y i a bx i So all we need to do in order to estimate the parameters α and β from the data is to find the line y = a + bx that minimize this quantity. The a and b found this way then serve as an estimate for α and β. 8 / 13
Simple Linear Regression This leads us to considering the line y = a + bx that has minimum sum of absolute value of residuals y i a bx i So all we need to do in order to estimate the parameters α and β from the data is to find the line y = a + bx that minimize this quantity. The a and b found this way then serve as an estimate for α and β. Note: We SHOULD NOT consider minimizing the term (y i a bx i ) since the individual terms can be positive or negative and might cancel out each other. 8 / 13
The Least-Squares Approach But we don t like to deal with absolute values since it can be a little inconvenient for us to compute the values of a and b that minimizes y i a bx i. 9 / 13
The Least-Squares Approach But we don t like to deal with absolute values since it can be a little inconvenient for us to compute the values of a and b that minimizes y i a bx i. So instead, we consider minimizing the sum of squared residuals (y i a bx i ) 2, where the values of a and b that minimizes this quantity can be computed easily using calculus. The a and b found this way then serve as an estimate for α and β. 9 / 13
Estimation of α and β It can be shown (by taking partial derivative of the sum of squared residuals) that the best (in the least squares sense) straight line fit y = ax + b has b = (x i x)(y i ȳ) (x i x) 2 = x iy i n xȳ x 2 i n x 2, a = ȳ b x. If we denote s xx = (x i x) 2, s xy = (x i x)(y i ȳ), s yy = then the formula for b can be rewritten as We also estimate σ 2 by b = s xy s xx. (y i ȳ) 2, s 2 r = s yy b 2 s xx n 2 10 / 13
Back to the previous graph Back to our previous graph, do you see that the least square line is in overall much closer than the other line to most data points? y -20-10 0 10 20 30 40-5 0 5 10 15 20 25 x In fact, the fitted green line y = a + b x obtained using the formula in previous slide has the minimum (y i a bx i ) 2 among all straight lines of the form y = a + bx!! 11 / 13
Estimates and Estimators Our estimate a and b of the parameter α and β is of the form b = (x i x)(y i ȳ) (x i x) 2, a = ȳ b x i.e. it is a function of our data (x i, y i ), i = 1, 2,..., n. But (x i, y i ), i = 1, 2,..., n are themselves the realized values of the random variables (X i, Y i ), i = 1, 2,..., n, so our estimator a and b of the parameter α and β is of the form b = (X i X )(Y i Ȳ ) (X i X ) 2, a = Ȳ b X, which are themselves random variables. We can therefore construct confidence interval for a and b, to give us an idea of the precision of our estimates! 12 / 13
Confidence Interval of b It can be shown (the math is too difficult to give here) that a is an unbiased estimate of α, that b is an unbiased estimate of β, and that that s 2 r is an unbiased estimate of σ 2. We may want to ask: how accurate is the estimate b of β? An approximate 95% confidence interval for β is given by b 2s r sxx to b + 2s r sxx 13 / 13