Homework 9 Sample Solution # 1 (Ex 9.12, Ex 9.23) Ex 9.12 (a) Let p vitamin denote the probability of having cold when a person had taken vitamin C, and p placebo denote the probability of having cold when a person had only taken placebo. H 0 : p vitamin = p placebo H A : p vitamin p placebo Note that p vitamin = 17 139 0.122, p placebo = 31 140 0.221. z = p vitamin (1 p vitamin) n vitamin = p vitamin p placebo 0.122 0.221 + p placebo (1 p placebo) n placebo 0.122 0.878 0.221 0.779 139 + 140 = 2.212 P-value is 2 * (0.0136) = 0.0272 < 0.05 = α Thus, reject H 0 and we can conclude that Vitamin C significantly changes (reduces) the incidence rate of cold (or the probability of having cold). We will reject H 0 if χ 2 2 > χ (r 1)(c 1),α r c χ 2 = (n ij e ij) 2 i=1 j=1 Note that Chi-square test for two-way data can test both hypothesis of independence and hypothesis of homogeneity (p. 323). e ij
The following three ways of stating hypotheses are all right and equivalent. (1) H 0 : p(cold VC) = P(cold placebo) = P(cold), p(no cold VC) = P(no cold placebo) = P(no cold) H A : at least one is different. (2) H 0 : The chance of having cold is homogeneous (equal) in the group of VC and placebo. H A : The chance of having cold is heterogeneous (NOT equal) in the group of VC and placebo (3) H 0 : Having cold is independent of whether a person had vitamin C or placebo. H A : Having cold is NOT independent of whether a person had vitamin C or placebo. Observed Values Group Cold Column Yes No Total Vitamin C 17 122 139 Placebo 31 109 140 Row Total 48 232 279 Expected Values Group Cold Column Yes No Total Vitamin C 23.90 115.04 138.94 Placebo 24.09 115.97 140.06 Row Total 47.99 231.01 279 χ 2 = (17 23.90)2 23.90 + (31 24.09)2 24.09 + (122 115.04)2 115.04 2 = 4.814 > 3.843 = χ 1,0.05 + (109 115.97)2 115.97 Thus, reject H 0 and we can conclude that Vitamin C reduces the incidence of cold (in rates).
Ex 9.23 (a) χ 2 2 = (n i e i ) 2 i=1 e i = (x np 0 )2 + (n x n(1 p 0 ))2 np 0 n(1 p 0 ) = (x np 0 )2 (1 p 0 )+(np 0 x) 2 p 0 np 0 (1 p 0 ) = (x np 0 )2 np 0 (1 p 0 ) = z2 We reject H 0 if z > z α/2 or if z 2 2 > z α/2 equivalent. # 2 (Ex 9.20) (a) = χ 2 1,α. It is evident that the two tests are, indeed, H 0 : p 1 = 9 16, p 2 = 3 16, p 3 = 3 16, p 4 = 1 16 H A = Not H 0 Note that e i = np i = 1611p i. Phenotype n i e i (n i e i ) 2 Tall, cut-leaf 926 906.188 0.433 Dwarf, cut-leaf 293 302.063 0.272 Tall, potato-leaf 288 302.063 0.655 Dwarf, potato-leaf 104 100.688 0.109 Total 1611 χ 2 = 1.469 e i Note that χ 2 2 = χ 3,0.05 = 7.815. Thus, we fail to reject H 0.
# 3 (Ex 9.22) (a) Note that λ = 0.519. Then, p i = e 0.519(0.519)i i! Passengers n i p i e i (n i e i ) 2 0 678 0.595 601.662 9.686 1 227 0.309 312.262 23.281 2 56 0.080 81.032 7.733 3 28 0.014 14.019 13.944 4 8 0.002 1.819 196.998 Greater than 5 14 0.000 0.206 Total 1011 χ 2 = 251.642 e i Note that the cell Greater than 5 was combined with cell 4, to satisfy the requirement that no cell can have e i < 1 and no more than 1/5 th of the e i can be < 5. Since χ 2 2 > χ 3,0.05 = 7.815, reject H 0 and conclude that the Poisson distribution is not a plausible distribution for the number of passengers. Since p = 1 1+0.519 = 0.658, then p i = (1 p ) i 1 p = (0.342) i 1 (0.658) and e i = np i. Occupants n i p i e i (n i e i ) 2 1 678 0.658 665.569 0.232 2 227 0.225 227.407 0.001 3 56 0.077 77.698 6.060 4 28 0.026 26.547 0.079 5 8 0.009 9.071 0.126 Greater than 6 14 0.005 4.708 18.342 Total 1011 χ 2 = 24.841 e i Since χ 2 > χ 4,0.05 2 = 9.488, reject H 0 and conclude that the geometric distribution is not a plausible distribution for the number of occupants.
(c) While neither is a plausible distribution for the data, the geometric distribution seemed to fit much better, since the χ 2 value is much smaller. Also note that the lack of fit of the geometric distribution comes primarily from the tail category (Greater than 6).
# 4 (Ex 10.14, Ex 10.33) Ex 10.14 First of all, Note that Y = 1 Y n i and β 1 = c i Y i, where c i = x i x and c S i = 0. Then xx Cov(Y, β ) 1 = 1 n (c i)cov(y i, Y j ) i j = 1 n (c i)var(y i ) i = σ2 n c i i = 0 Since both Y and β 1 are both normally distributed (as linear functions of normal random variables), a correlation of 0 implies that they are independent. Ex 10.33 y i = β 0 + β 1x i = y + β 1(x i x ) Then, (y i y i)(y i y ) = (y i y β 1(x i x ))(y + β 1(x i x ) y ) = (y i y β 1(x i x ))β 1(x i x ) = β 1 (y i y )(x i x ) β 12 x i x 2 = β 1S xy β 12 S xx = S xy 2 S 2 xy S xx S xx 2 S xx = 0
# 5 (Coding Assignment: Ex 9.32) (a) Note that the total number for each row or column is not fixed. Thus, this is an example of a multinomial sampling (Refer to page 322 of the textbook for explanation). Pearson's Chi-squared test data: data X-squared = 138.29, df = 9, p-value < 2.2e-16 # 6 (Coding Assignment: Ex 10.4, Ex 10.11) (a), > summary(model) Call: lm(formula = NEXT ~ LAST) Residuals: Min 1Q Median 3Q Max -12.2364-4.2364-0.6352 5.5327 9.9316 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 31.013 4.417 7.022 1.10e-06 *** LAST 9.790 1.300 7.531 4.06e-07 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 6.129 on 19 degrees of freedom Multiple R-squared: 0.7491, Adjusted R-squared: 0.7359 F-statistic: 56.72 on 1 and 19 DF, p-value: 4.059e-07
y = 31.013 + 3 9.79 = 60.383 (c) Note that R-square value is 0.7491 from the regression output. Also, you can run ANOVA test, get SSR and SST to compute R 2 = SSR = 2130.60 = 0.749. SST 2844.286 Analysis of Variance Table Response: NEXT Df Sum Sq Mean Sq F value Pr(>F) LAST 1 2130.60 2130.60 56.721 4.059e-07 *** Residuals 19 713.69 37.56 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 (d) You can find σ = 6.129 from the regression output. MSE can be found in the ANOVA table, and the value is 37.56. You can also compute this: MSE = Thus, σ = 37.563 6.129. SSE = 713.69 = 37.563. Residual Degrees of Freedom 19 (e) In the previous regression output, note that the p-value of the coefficient corresponds to t-test where H 0 : β 1 = 0 H A : β 1 0 Since the p-value 4.06e-07 < 0.05 and that the coefficient is positive with value 9.79, we conclude that the time to next eruption significantly increase when the duration of the last eruption increases. (f) We can use cor function in R to find the sample correlation r = 0.865, and cor.test function to find the confidence interval for correlation as [0.692, 0.944]. Another way to find confidence interval for correlation is to follow the steps in p.383. Define
Compute the z-statistic ψ = 1 2 log e ( 1 + r 1 r ) = 1 2 log e ( 1 + 0.865 1 0.865 ) = 1.3129 z = n 3(ψ ψ 0 ) ψ t 20,0.025 1 n 3 ψ ψ + t 20,0.025 1 n 3 = 1.3129 1.725 1 1 ψ 1.3129 + 1.725 18 18 = 0.906 ψ 1.719 Lastly, we want to back out to correlation. e 2l 1 e 2l + 1 ρ e2u 1 e 2u + 1 = e2 0.906 1 e 2 0.906 + 1 ρ e2 1.719 1 e 2 1.719 + 1 = 0.719 ρ 0.938 Ex 10.11 (a) fit lwr upr 1 60.38332 47.2377 73.52893 Prediction Interval = [47.2377, 73.52893] fit lwr upr 1 60.38332 57.51009 63.25654 Confidence Interval = [57.51009, 63.25654] Note that confidence interval is narrower than the prediction interval. (c) fit lwr upr 1 40.80318 26.33021 55.27614
Prediction Interval = [26.33021, 55.27614] This prediction interval is not reliable because it extrapolates beyond the domain of the data. Codes Used # Copy the given data table in R data <- matrix(c(68, 20, 15, 5, 119, 84, 54, 29, 26, 17, 14, 14, 7, 94, 10, 16),ncol=4) rownames(data) <- c("brown", "Blue", "Hazel", "Green") colnames(data) <- c("black", "Brown", "Red", "Blond") # Perform Chi-square test chisq.test(data) # Note that you don't need to compute for expected values for each cell. R will automatically do all the work for you. # Copy the data into R LAST <- c(2, 1.8, 3.7, 2.2, 2.1, 2.4, 2.6, 2.8, 3.3, 3.5, 3.7, 3.8, 4.5, 4.7, 4, 4, 1.7, 1.8, 4.9, 4.2, 4.3) NEXT <- c(50, 57, 55, 47, 53, 50, 62, 57, 72, 62, 63, 70, 85, 75, 77, 70, 43, 48, 70, 79, 72) plot(last, NEXT, main="scatter Plot of NEXT vs LAST") model <- lm(next ~ LAST) abline(model) summary(model) # Correlation between LAST and NEXT cor(last, NEXT) # Prediction Interval newdata <- data.frame(last=3) predict(lm(next~last), newdata, interval="predict", level=0.95) # Correlation Interval predict(lm(next~last), newdata, interval="confidence", level=0.95) # Prediction Interval newdata <- data.frame(last=1) predict(lm(next~last), newdata, interval="predict", level=0.95)