Section 4.6 Simple Linear Regression

Size: px

Start display at page:

Download "Section 4.6 Simple Linear Regression"

Julia Davis
5 years ago
Views:

1 Section 4.6 Simple Linear Regression Objectives ˆ Basic philosophy of SLR and the regression assumptions ˆ Point & interval estimation of the model parameters, and how to make predictions ˆ Point and interval estimation of future observations from the model ˆ Regression diagnostics, including R 2 and basic residual analysis Basic Philosophy We have two variables X and Y. Here, X is not random (so we will write x), but Y is random. We believe that Y depends in some way on x. Some typical examples of (x, Y ) pairs are ˆ x study time and Y score on a test. ˆ x height and Y weight. ˆ x father s height and Y son s height. We focus our efforts on estimating two parameters, β 0 and β in the simple linear regression model: Y i β 0 + β x i + ε i, where ε i N ( 0, σ 2) ˆ Y i is the (random) response for the ith case. ˆ β 0, β are unknown parameters that we want to estimate. β 0 (unknown) intercept, and β (unknown) slope. ˆ X i is the value of the predictor variable for the ith case. ˆ ε i is a (random) error term for the ith case, such that the mean 0, variance the same for all the cases, and the covariance between the ith and jth case 0. Least Squares Estimates We begin with the likelihood function L(β 0, β, σ 2 ) n f ( y i ; β 0, β, σ 2) ln L(β 0, β, σ 2 ) n 2 ln ( 2πσ 2) + n ( ) [ n/2 (yi β 0 β x i ) 2 2πσ 2 exp 2σ 2 [ ] exp (y i β 0 β x i ) 2 2πσ 2 2σ 2 ] (y i β 0 β x i ) 2 To maximize the log likelihood, let s minimize the summand, i.e., H 2σ 2 (y i β 0 β x i ) 2. That is, let s find β 0 and β that minimize H. Because of the two parameters, we differentiate this wrt

2 β 0, β and set them equal to zero, we get β 0 H 2 β H 2 2 (y i β 0 β x i ) 0 nβ 0 + β x i x i (y i β 0 β x i ) ( xi y i β 0 x i β x 2 ) i 0 β0 x i + β y i x 2 i x i y i Organizing these two equations, we get ˆβ 0 ȳ ˆβ x ( ) x i y i x i)( y i /n ˆβ ( ) 2 x 2 i x i /n Shown below are the second derivatives: (x i x) (y i ȳ) (x i x) 2 2 β 2 0 H 2n, 2 β β 0 H 2 x i, 2 β 0 β H 2 2 β 2 H 2 x 2 i x i And the 2 2 matrix consisting of these second-derivatives is positive definite because the (,)th element > 0 and its determinant is also > 0. 2n, 2 2 x i, 2 x i x 2 i det > 0 The conclusion? Use ŷ i ˆβ 0 + ˆβ x i line to ensure the line that fits the (x, y) pattern the best, i.e., the estimated line we have will leave the smallest gap between the observed y s and the estimated line. For this reason, they are also called the least squares estimates. Section 4.6+, page 2

3 Next, let s find the mle of σ 2. σ 2 { ln L(β0, β, σ 2 ) } n 2σ 2 (y i β 0 β x i ) 2 2 (σ 2 ) 2 0 We get ˆσ 2 n (y i ˆβ 0 ˆβ ) 2 x i One note: ˆ In statistics, the gap between the observed value (y i ) and the expected (or predicted) value (ŷ i ) is called the residual. So (y i ˆβ 0 ˆβ ) 2 x i (y i ŷ i ) 2 is the sum of squared residuals, and it s commonly called SS E. For a point estimate of σ 2, we use SS E n 2, i.e., SSE ˆσ s n 2. ˆ There are many equivalent formulas for ˆβ that are more intuitive, or at the least are easier to remember. One of the popular ones is ˆβ r SD y, where r correlation coefficient between SD x x and y, SD y sd of y and SD x sd of x. Inferences about the Parameters Let s learn some more notations: b ˆβ S xy b 0 ˆβ 0 ȳ b x (x i x) (y i ȳ) (x i x) 2 (x i x) y i (x i x) 2 Section 4.6+, page 3

4 Here is how to derive the expectation and the variance of the estimates: E (b ) E ( Sxy ) { } E (x i x) y i E S xx ] [ x i E (y i ) ne ( xȳ) [ ] x i (β 0 + β x i ) n x (β 0 + β x) [ β 0 ( x i n x (0 + β ) β ) + β ( x 2 i x 2 )] { } (x i y i xy i ) E (b 0 ) E (ȳ b x) E (β 0 + β x) xe (b ) β 0 ( ) { } { } Sxy V ar (b ) V ar Sxx 2 V ar (x i x) y i S 2 (x i x) 2 σ 2 xx ( ) V ar (b 0 ) V ar (ȳ b x) σ 2 n + x2 Furthermore, it can be shown that b N (mean β, sd σ b ), where σ b σ Sxx σ2 σ b σ is also called the standard error of b and we can estimate σ from the previous descrip- Sxx tion by s SSE n 2. So the SE of b becomes s b s Sxx See textbook page It can be shown that SS E σ 2 n ˆσ 2 σ 2 χ2 (n 2). Also, it turns out that b 0, b, and s are mutually independent. Therefore, we have the following t-distribution. T b β σ b SS E σ 2 /(n 2) b β σ/ SS E σ 2 /(n 2) b β s/ b β s b t df(n 2) Therefore, a 00( α)% confidence interval for β is given by b ± t df(n 2) α/2 s b Section 4.6+, page 4

5 It can also be shown in a similar way, b 0 N (mean β 0, sd σ b0 ), where σ b0 σ n + x2 σ b0 σ n + x2 is the standard error of b 0 and the SE of b becomes Therefore, we have another t-distribution. s b0 s n + x2 T 0 b 0 β 0 σ b0 SS E σ 2 /(n 2) b 0 β 0 σ SS E n + x2 Sxx σ 2 /(n 2) b 0 β 0 s n + x2 b 0 β 0 s b0 t df(n 2) Therefore, a 00( α)% confidence interval for β is given by b 0 ± t df(n 2) α/2 s b0 We have seen how to estimate the coefficients of a regression line with both point estimates and confidence intervals. We have learned how to estimate a value ŷ on the regression line for a given value of x, such as x x 0. But how good is our estimate ŷ at x x 0? How much confidence do we have in this estimate? Furthermore, suppose we were going to observe another value of y at x x 0. What can we say? Intuitively, it should be easier to get bounds on the mean (average) value of y at x 0 (called a confidence interval for the mean value of y at x 0 ) than it is to get bounds on a future observation of y (called a prediction interval for y at x 0 ). It turns out the confidence intervals are narrower for the mean value, wider for the individual value. Our point estimate of y at x 0 is, of course, ŷ at x 0, so for a confidence interval we will need to know the sampling distribution of ŷ s. It turns out that ŷ at x 0 is distributed as ŷ N ) (mean E (y x0 ), sd σŷx0, where σŷx0 σ n + (x 0 x 2 ) Section 4.6+, page 5

6 σŷx0 σ n + (x 0 x 2 ) is the standard error of ŷ x0 and the estimate is sŷx0 s n + (x 0 x 2 ) Therefore, we have the following t-distribution. T 2 ŷ x0 E(y x0 ) σŷx0 SS E σ 2 /(n 2) ŷ x0 E(y x0 ) n + (x 0 x2 ) Sxx σ SS E σ 2 /(n 2) ŷx 0 E (y x0 ) ŷ x0 E (y x0 ) s n + (x 0 x 2 ) sŷx0 t df(n 2) Therefore, a 00( α)% confidence interval (C.I.) for E (y) at x 0 is given by ŷ x0 ± t df(n 2) α/2 sŷx0 Next, the prediction intervals are slightly different. In order to find confidence bounds for a new observation of y (we will denote it y future ) we use the fact that ŷ future N ( mean E (y future ), sd σŷfuture ), where σŷfuture σ + n + (x 0 x 2 ) Of course σ is unknown and we estimate it with s. Therefore, a 00( α)% prediction interval (P.I.) for a future value of y at x 0 is given by ŷ x0 ± t df(n 2) α/2 sŷfuture Take note that the prediction interval is wider than the confidence interval, as its SE is greater. Ex. Consider the following sample data and carry out all the inferences involved. Midterm (X) Final (Y ) Section 4.6+, page 6

7 > x <- c(70,74,80,84,80,67,70,64,74,82) > y <- c(87,79,88,98,96,73,83,79,9,94) > model <- lm(y~x) > summary(model) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) x ** --- Residual standard error: on 8 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: 9 on and 8 DF, p-value: > plot(y~x,pch6,col2) > abline(model,col4) > predict(model,interval"confidence") fit lwr upr > predict(model,interval"prediction") fit lwr upr > newx <- seq(60,95,0.2) > ci <- predict(model,list(xnewx), interval"confidence") > pi <- predict(model,list(xnewx), interval"prediction") > plot(x,y,pch6,col2) > matplot(newx,ci,type"l",ltyc(,2,2),colc(,2,2),addt) > matplot(newx,pi,type"l",ltyc(,3,3),colc(,4,4),addt) > legend(locator(),c("regression line","95% ci","95% pi"),cex0.8,lty:3,colc(,2,4)) > #The following command creates four diagnostic plots. > par(mfrowc(,4)) Section 4.6+, page 7

8 > plot(model) Figure : Regression line, 95% CI & 95% PI Figure 2: Diagnostic plots of a regression model Section 4.6+, page 8

9 Section 4.8 One-Factor ANOVA One-Factor Samples Suppose you have collected n i, where (i, 2,..., m) samples from m groups: Groups Means Y : Y Y 2 Y n Ȳ Y 2: Y 2 Y 22 Y 2n2 Ȳ 2..:..... Y m: Y m Y m2 Y mnm Ȳ m Grand Mean: Ȳ The hypotheses we want to test are: H 0 : µ µ 2 µ m (i.e., all group means are the same.) H : not H 0 (i.e., some group means are significantly different.) In the end, all will be summarized in the following ANOVA table: source SS df MS F -value p-value Treatment SS trt m MS trt SS trt m Error SS E n m MS E SS E n m Total SS tot n MS trt/ms E Here are all the SS (sum of squares) numbers and how the SS tot is partitioned: SS tot m n i ( Yij Ȳ ) 2 m n i ( Yij Ȳi + Ȳi Ȳ ) 2 j n i j n i m ( Yij Ȳi ) 2 m + j n i j m ( Yij Ȳ ) 2 m + j ) 2 (Ȳi Ȳ cross-product term 0 ) 2 n i (Ȳi Ȳ SS E + SS trt We also have SS trt σ 2 χ 2 (m ), SS E σ 2 χ 2 (n m) SS trt /(m ) σ 2 SS E /(n m) SS trt/(m ) SS σ 2 E /(m m) MS trt MS E F (m ),(n m) Section 4.6+, page 9

10 Ex 2. Consider the following sample data and carry out all the inferences involved. Observations Group : Group 2: Group 3: Group 4: Group 5: > grp <- c(rep(,7),rep(2,7),rep(3,7),rep(4,7),rep(5,7)) > y <- c(92,90,87,05,86,83,02,00,08,98,0,4,97,94, + 43,49,38,36,39,20,45,47,44,60,49,52,3,34, + 42,55,9,34,33,46,52) > data <- data.frame(cbind(grp,y)) > head(data) > attach(data) > grp <- factor(grp) > boxplot(y~grp,col"pink") > model <- lm(y~grp) > summary(model) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) < 2e-6 *** grp * grp e-0 *** grp e- *** grp e-0 *** --- Residual standard error: 9.7 on 30 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: 44.2 on 4 and 30 DF, p-value: 3.664e-2 > anova(model) Analysis of Variance Table Response: y Df Sum Sq Mean Sq F value Pr(>F) grp e-2 *** Residuals > summary(aov(y~grp)) Df Sum Sq Mean Sq F value Pr(>F) grp e-2 *** Residuals > boxplot(y~grp,col"pink") > plot(tukeyhsd(aov(y~grp))) Section 4.6+, page 0

11 > par(mfrowc(,4)) > plot(model) Figure 3: Boxplot & Tukey s pairwise comparison Figure 4: Diagnostic plots of ANOVA Section 4.6+, page

12 Section 4.0 χ 2 Tests Review: Facts about χ 2 -Distribution In the χ 2 distribution X χ 2 df, where df ( degrees of freedom) is the only parameter that uniquely determines the shape. The (theoretical) population mean is µ df and the (theoretical) population standard deviation is σ 2 df. ˆ If you square a random variable that has the standard normal distribution, it has χ 2 (df) distribution. This is often written as Z 2 χ 2 (df). ˆ The random variable with a χ 2 distribution with k degrees of freedom is the sum of k independent, squared standard normal variables, i.e., χ 2 (dfk) Z2 + Z Z2 k, where Z N(0, ). ˆ The curve is nonsymmetrical and skewed to the right. ˆ The mean, µ, is always located just to the right of the peak. ˆ The χ 2 test statistic is always greater than or equal to zero. ˆ When df > 90, the χ 2 curve is approximated by the normal distribution. X χ 2 (df000), then, X N(µ 000, σ ). For example, χ 2 Goodness of Fit Test We test whether the data fits a particular distribution or not. For example, we can test if the color distribution of M&M bags fits what the company claims on their webpage. After flipping a coin many times, we can test if it fits a binomial distribution. We use a χ 2 test statistic to determine if there is a good fit or not. Why χ 2? Demo for a binomial case Let Y binomial(n, p ), then Z Y np np ( p ) the CLT. Consider the following: has an approximate N(0, ) distribution due to Q (Y np ) 2 np ( p ) (Y np ) 2 + (Y np ) 2 (Why?) np n( p ) (Y np ) 2 + (Y 2 np 2 ) 2 [ (Y np ) 2 {n Y n( p )} 2 (Y 2 np 2 ) 2] np np 2 2 (Y i np i ) 2 χ 2 (df) np i Section 4.6+, page 2

13 This will be generalized to when there are k many categories. It can be shown: Q k k (Y i np i ) 2 np i χ 2 (dfk ) The null and the alternative hypotheses for the goodness-of-fit test can be written as: H 0 : p i p i0, where i, 2,..., k, (i.e., data fits the hypothesized distribution) H : p i p i0 (i.e., at least in some cases, data does NOT fit the hypothesized distribution) Ex. People were asked to write bunch of random digits. The result: If these digits are truly random, the probability of the next digit can be either the same as the preceding one with the probability of /0 or one away from the preceding one with the probability of 2/0, or neither cases with the probability of 7/0. We want to test whether the data fits this thinking (i.e., random sequence examined by this idea), i.e., H 0 : p 0, p 2 2 0, p H : At least one of the cases is significantly different from the hypothesized proportion. Here is summary: observed freq expected freq same digit 0 5 (/0) 5. one-away digit 8 5 (2/0) 0.2 others 43 5 (7/0) 35.7 Test statistic: χ 2 3 (observed expected) 2 expected (0 5.) (8 0.2) ( ) χ 2 (df2) p-value , so we reject H 0 and conclude that the data didn t follow the hypothesized proportion, i.e., the data doesn t seem random. The whole thing can be done in R as shown below. > x <- c(0,8,43) > chisq.test(x,p <- c(0., 0.2, 0.7)) Pearson s Chi-squared test data: x and p <- c(0., 0.2, 0.7) X-squared 6, df 4, p-value 0.99 Section 4.6+, page 3

14 Ex 2. You flipped a coin 4 times a day and counted total number of H s every day. You did this for 00 days. The result: Number of H s observed freq Test if the result agrees with X (total number of H s) being a binomial (4, /2). Answer: Number of H s observed freq expected freq Test statistic: χ 2 5 (obs exp) 2 exp (7 6.25) (8 25) (4 6.25) χ 2 (df4) p-value , so we do not reject H 0 and conclude that the data supports the hypothesis of binomial (4, 0.5). Ex 3. You lose one more df by estimating another parameter! Shown below are X, number of α particles emitted by barium-33 in /0 of a second, and counted by a Geiger counter Test H 0 : X P oisson. Answer: We first have to estimate the Poisson parameter λ by the mean of data, i.e., ˆλ x 5.4. Then, we calculate the expected probabilities for each case and expected frequencies. Cases observed freq expected freq {0,,2,3} {4} {5} {6} {7} {8, 9,... } Section 4.6+, page 4

15 Test statistic: χ 2 6 (obs exp) 2 exp (3 0.65) (0 8.90) χ 2 (df4) p-value 0.408, so we do not reject H 0 and conclude that the data cannot reject the hypothesis that the counts form a Poisson distribution. χ 2 Test for Homogeneity The goodness-of-fit test can be used to decide whether a data fits a given distribution, but it will not suffice to decide whether two populations follow the same unknown distribution. A different test, called the test for homogeneity, can be used to draw a conclusion about whether two populations have the same distribution. Here we re concerned about: H 0 : The distributions of the two populations are the same. H : The distributions of the two populations are NOT the same. Ex 4. Shown below are grade distribution of two groups of students. observed freq A B C D F total Group I Group II Test H 0 : Grade distribution of the two groups are the same. Answer: Under H 0 that the probabilities of each grade is equal, the respective estimates of the probabilities are: 2/000.2, 22/000.22, 30/000.3, 26/000.26, and 0/000.. Note also, since we have estimated these probabilities, the χ 2 test statistic will have a df (5 ) + (5 ) 4 4. Here are the expected frequencies for each case. expected freq A B C D F Group I Group II Test statistic: 2 5 χ 2 (obs exp) 2 exp j (8 6) (7 5) χ 2 (df4) p-value , so we do not reject H 0 and conclude that we cannot say there is a significant difference in grade distribution between the two groups. Section 4.6+, page 5

16 > data <- matrix(c(8,4,3,9,6,4,0,6,3,7), nrow2, ncol5) > chisq.test(as.table(data))$observed A B C D E A B > chisq.test(as.table(data))$expected A B C D E A B > chisq.test(as.table(data))$residual A B C D E A B > chisq.test(as.table(data)) Pearson s Chi-squared test data: as.table(data) X-squared 5.786, df 4, p-value χ 2 Test for Independence Test of independence involves using a contingency table of observed (data) values. statistic for a test of independence is similar to that of a goodness-of-fit test: The test c r χ 2 (obs exp) 2 exp j χ 2 df(r )(c ), where r number of rows, and c number of columns. Ex 5. A random sample of 400 students at the University of Iowa shows the following breakdown of gender and colleges where they study. observed freq Business Engineering Liberal Arts Nursing Pharmacy total Male Females Test H 0 : p ij p i p j (i.e., the college where a student studies is independent of the gender.) Answer: > data2 <- matrix(c(2,4,6,4,45,75,2,3,6,4), nrow2, ncol5) > chisq.test(as.table(data2))$observed A B C D E A B > chisq.test(as.table(data2))$expected A B C D E Section 4.6+, page 6

17 A B > chisq.test(as.table(data2))$residual A B C D E A B > chisq.test(as.table(data2)) Pearson s Chi-squared test data: as.table(data2) X-squared , df 4, p-value We do reject H 0 and conclude that the number of students in colleges is highly dependent on gender, i.e., the two variables (gender and which college) are NOT independent. Section 4.9 Distribution-Free CI & TI Basics Let Y < Y 2 < Y 3 < Y 4 < Y 5 be the order statistics of a random sample of size n 5 from any continuous distribution. Also, let m π 0.5 (i.e., the 50th percentile) be the median. For example, we can find the following probability: P (Y < m < Y 5 ) 4 k ( 5 k ) ( 2 Why P (Y < m < Y 5 ) is calculated like this? ) k ( ) 5 k 2 P (X 0) P (X 5), where X binomial(5, /2) ( ) 5 ( ) First, for any individual observation, say X, has P (X < m) 0.5, and in order for Y to be less than m and Y 5 to be greater than m, we must have, 2, 3, or 4 observations to be less than m. And we say (y, y 5 ) is a 94% (distribution-free) confidence interval for m. In a similar way, when there are n independent trials, we calculate: j P (Y i < m < Y j ) ki ( n k α ) ( 2 ) k ( ) n k 2 Section 4.6+, page 7

18 and (y i, y j ) is a 00( α)% (distribution-free) confidence intervals for the median m. Ex. Suppose we have an ordered set of data (n 9) like: Let s calculate: P (Y 2 < m < Y 8 ) 7 k2 ( 9 k ) ( 2 ) k ( ) 9 k and (y 2, y 8 ) (9.0, 30.) is a 96.% (distribution-free) confidence intervals for the median m. It turns out we can argue the same thing for any percentile π p. In this case, any individual observation X has P (X < π p ) p, so when there are n independent trials, we calculate: j ( ) n P (Y i < π p < Y j ) p k ( p) n k α k ki and (y i, y j ) is a 00( α)% (distribution-free) confidence intervals for the percentile π p. Ex 2. Suppose we have an ordered set of data (n 27) like: First, note that π 0.25 (i.e., the first quartile) (n + )p (27 + )(0.25) 7, and we have ˆπ 0.25 y Now, let s see how much confidence we can have with (y 4, y 0 ) being a confidence interval for y 7. P (Y 4 < π 0.25 < Y 0 ) 9 k4 ( ) 27 (0.25) k (0.75) 27 k k i.e., (y 4, y 0 ) (74, 87) is a 82.0% (distribution-free) confidence intervals for the 25th percentile π One note: For some of these binomial probability calculations, it s OK to use the normal approximation. For example, in the last ( problem where we calculate P (4 X 9), where X binomial (n 27, p /4), and X. N µ 27/4 6.75, σ ) 27 (/4) (3/4) Finding the same probability by normal approximation, we have: ( P (4 X 9) P (3.5 X 9.5) P Z 2.25 ) Section 4.6+, page 8

19 i.e., the normal approximation works rather well for such a case. Theorem. Let Y < Y 2 < < Y n be the order statistics (based on random samples x, x 2,..., x n ). Then the pdf of Y k is Proof. g k (y) n! (k )!(n k)! [F (y)]k f(y) [ f(y)] n k, where f( ), F ( ) pdf and cdf of X. Theorem 2. Let U () < U (2) < < U (n) be the order statistics, where U i uniform(0, ). Then U (k) has a beta distribution with two parameters k and (n k + ). Proof. From Theorem, we have g k (y) n! (k )!(n k)! (y)k ( y) n k, 0 < y < pdf of β(k, n k + ) Theorem 3. Let X, X 2,..., X n be random variables with cdf F ( ), then, where U i uniform(0, ). Then F { } X (k) has a beta distribution with two parameters k and (n k + ). Proof. First, note that U i F (X i ) is iid uniform (0, ) due to the probability integral transformation. Furthermore, F ( ) is a nondecreasing function, i.e., F ( ) preserves order. So, U (i) F { } X (i). That is, { } { ( ) ( ) ( )} U(), U (2),..., U (n) F X(), F X(2),..., F X(n) F ( ) X (k) β(k, n k + ) Application: Let Y k be the order statistic of X k, i.e., Y k X (k). Consider the following n + Section 4.6+, page 9

20 random variables: W F (Y ) W 2 F (Y 2 ) F (Y ) W 3 F (Y 3 ) F (Y 2 ) W n F (Y n ) F (Y n ) W n+ F (Y n ) ˆ These W, W 2,..., W n+ are called the coverage of intervals, for example (Y i, Y i+ ]. ˆ Note that sum of k of thse intervals, i.e., W + W k F (Y k ) β(k, n k + ) ˆ F (Y j ) F (Y i ), i < j is the sum of k j i coverages, so that it will have β(j i, n j +i+), i.e., γ P {F (Y j ) F (Y i ) p} p Γ(n + ) Γ(j i)γ(n j + i + ) vj i ( v) n j+i dv and this is called a 00γ% tolerance interval for 00p% of the distribution. Ex 3. Let Y < Y 2 < < Y 6 be the order statistics of a random sample of size n 6 from any continuous distribution. Also, let p 0.8, then γ P {F (Y 6 ) F (Y ) 0.8} 0.8 Γ(7) Γ(5)Γ(2) v4 ( v)dv 0.34 i.e., (y, y 6 ) is a 34% (distribution-free) tolerance interval for 80% of the distribution. Section 4.6+, page 20

Ch 2: Simple Linear Regression

Ch 2: Simple Linear Regression 1. Simple Linear Regression Model A simple regression model with a single regressor x is y = β 0 + β 1 x + ɛ, where we assume that the error ɛ is independent random component