Two-Sample Inference for Proportions and Inference for Linear Regression

Two-Sample Inference for Proportions and Inference for Linear Regression Kwonsang Lee University of Pennsylvania kwonlee@wharton.upenn.edu April 24, 2015 Kwonsang Lee STAT111 April 24, 2015 1 / 13

Announcement: Review Session There is no mandatory recitation next Friday, but there is a review session next Friday at the same time and the same location (optional). The professor will go over the material next Tuesday in class, so I ll focus on problem solving. Also, I ll hold the office hours next Tuesday 1-2pm and Wednesday 3-4pm. Kwonsang Lee STAT111 April 24, 2015 2 / 13

Announcement: Returning HW Homework 6 and 7 are not yet graded. I ll return them during next Friday review session. For those who are not able to come to the review session, I ll make a returning box at the entrance of the Statistic department on 4th floor, JMHH. I ll let you know via eamil when the box is ready. Kwonsang Lee STAT111 April 24, 2015 3 / 13

Hypothesis Test for Two Proportions Assume that we have two independent samples and test if the two proportions are the same or not. H 0 : p 1 = p 2 (or p 1 p 2 = 0), H a : p 1 p 2 The test statistic Z 0 is given by Z 0 = ˆp 1 ˆp 2 SE(ˆp 1 ˆp 2 ) = ˆp 1 ˆp 2 ( ) ˆp p (1 ˆp p ) 1 + 1 n1 n2 where ˆp 1 = Y 1 n 1, ˆp 2 = Y 2 n 2 and ˆp p = Y 1+Y 2 n 1 +n 2. Then, we compute P-value. Kwonsang Lee STAT111 April 24, 2015 4 / 13

Confidence Interval for Difference A confidence interval is for the difference between Population 1 and Population 2. i.e. CI for p 1 p 2. Here, the point estimate of p 1 p 2 is ˆp 1 ˆp 2 and a confidence interval is ˆp 1 ˆp 2 ± Z ˆp 1 (1 ˆp 1 ) + ˆp 2(1 ˆp 2 ) n 1 n 2 Kwonsang Lee STAT111 April 24, 2015 5 / 13

Example: Smoking Time magazine reported the result of a telephone poll of 800 adult Americans. The question posed of the Americans who were surveyed was: Should the federal tax on cigarettes be raised to pay for health care reform? The results of the survey were: Non-Smokers Smokers n 1 = 605 n 2 = 195 y 1 = 351 said yes y 2 = 41 said yes ˆp 1 = 351 605 = 0.58 ˆp 2 = 41 195 = 0.21 Kwonsang Lee STAT111 April 24, 2015 6 / 13

Confidence Interval 1. What is the 95% Confidence Interval for p 1? ˆp 1 (1 ˆp 1 ) ˆp 1 ± 1.96 = (0.54, 0.62) n 1 2. What is the 95% Confidence Interval for p 2? ˆp 2 (1 ˆp 2 ) ˆp 2 ± 1.96 = (0.15, 0.27) n 2 3. What is the 95% Confidence Interval for p 1 p 2? ˆp 1 (1 ˆp 1 ) (ˆp 1 ˆp 2 ) ± 1.96 + ˆp 2(1 ˆp 2 ) = (0.30, 0.44) n 1 n 2 Kwonsang Lee STAT111 April 24, 2015 7 / 13

Hypothesis Test We want to test whether smokers and non-smokers have significantly different opinions. The null and alternative are H 0 : p 1 = p 2, H a : p 1 p 2 Here, ˆp p = 351+41 605+195 = 0.49. The test statistic Z 0 is Z 0 = (ˆp 1 ˆp 2 ) 0 ( ) = 0.58 0.21 ˆp p (1 ˆp p ) 1 + 1 0.49 (1 0.49) ( 1 605 + 1 ) = 8.99 195 n1 n2 The P-value is 2 P(Z > 8.99) 0. Therefore, we reject the null. Kwonsang Lee STAT111 April 24, 2015 8 / 13

Linear Regression Simple Linear Regression Model: Y i = α + βx i + e i where α is the intercept and β is the slope. We estimate α and β, not e i. Best fit line minimizes sum of squared residuals (Residual = Y i (α + βx i )). SSR = n (Y i (α + βx i )) 2. i=1 Best estimators of α, β are a, b, Remember this formula! b = r Sy S x, a = Ȳ b X. Kwonsang Lee STAT111 April 24, 2015 9 / 13

Prediction Using Linear Regression For a new value of X, we can predict Y using the least-squares line. Y predicted = a + b X new. The regression equation is obtained from the data ((X 1, Y 1 ),..., (X n, Y n )). This equation is good to predict Y for new X between X min and X max where X min = min (X i ) and X max = max (X i ). However, if new X is outside the interval, then prediction might not work well. i.e. extrapolation might have a problem. Kwonsang Lee STAT111 April 24, 2015 10 / 13

Inference for Linear Regression We want to see if there is a linear relationship between Y and X. This is equivalent to test whether the (population) slope β is zero or not. H 0 : β = 0, H a : β 0 If the null hypothesis is rejected, then we can say that there is a linear relationship. We can test whether the intercept α is zero or not, but usually, we are not interested in this test. Kwonsang Lee STAT111 April 24, 2015 11 / 13

Test for Slope β = 0 From the sample (X i, Y i ), we have the estimates a and b. a. State the appropriate hypotheses H 0 : β = 0, H a : β 0 b. Test statistic T 0 T 0 = b 0 SE(b) SE(b) is usually given or can be found in the output of JMP regression c. P-value We can compute the range of P-value from t-table with n 2 degrees of freedom, or find the P-value in the output of JMP regression. Kwonsang Lee STAT111 April 24, 2015 12 / 13

Confidence Interval for β The slope β is interpreted as average change of Y when one unit of X changes. We might be interested in a confidence interval for the average change of Y i.e. β. A confidence interval for β is b ± t SE(b) where t is the critical value with n 2 degrees of freedom. Kwonsang Lee STAT111 April 24, 2015 13 / 13