STAT 526 Spring Midterm 1. Wednesday February 2, 2011

Size: px

Start display at page:

Download "STAT 526 Spring Midterm 1. Wednesday February 2, 2011"

Julie Johnson
5 years ago
Views:

1 STAT 526 Spring 2011 Midterm 1 Wednesday February 2, 2011 Time: 2 hours Name (please print): Show all your work and calculations. Partial credit will be given for work that is partially correct. Points will be deducted for false statements, even if the final answer is correct. Please circle your final answer where appropriate. This exam is closed-book. You may consult two pages with your hand-written notes. Calculators are permitted. Honor code: I promise not to cheat on this exam. I will neither give nor receive any unauthorized assistance. I will not to share information about the exam with anyone who may be taking it at a different time. I have not been told anything about the exam by someone who has taken it earlier. Signature: Date: 1

2 Question Possible Points Actual Points

3 1. Researchers study the performance of nurse practitioners in three specialities (pediatrics, obstetrics and diabetes). They randomly selected 3 cities, and recorded competency scores of 4 nurses randomly selected within each speciality and each city. The scores are on a continuous scale, and the values are summarized below. City 1 City 2 City 3 Mean Diabetes Obstetrics Pediatrics (a) (6 pts) State the ANOVA model that is appropriate for these data, and the assumptions. y ijk = µ + α i + β j + (αβ) ij + ɛ ijk, where y ijk is the score from speciality i = 1,..., 3, city j = 1,..., 3 and replicate k = 1,..., 4 µ is the overall expected value 3 α i is the deviation of the expected score of speciality i from the overall mean, = 0 β j is the deviation of the expected score of city j from the overall mean, β j iid N (0, σ 2 β ) (αβ) ij is the non-additive deviation of speciality i and city j, (αβ) ij iid N (0, σ 2 αβ ) ɛ ijk is the random error, ɛ ijk iid N (0, σ 2 ) β j, (αβ) ij, ɛ ijk are independent i=1 (b) (6 pts) Provide the estimates of the fixed effects of the model in the zero-sum model parametrization. α 1 = = α 2 = = α 3 = =

4 (c) (6 pts) Provide the estimates of the fixed effects of the model in the baseline model parametrization. α 1 = 0 α 2 = = α 3 = = (d) (6 pts) Based on the R output below, estimate and interpret the variance components of the model. > aov(score ~ spec*city, data=x) Call: aov(formula = score ~ spec * city, data = X) spec city spec:city Residuals Sum of Squares The ANOVA table is Therefore Source df MS EMS Pi A nb a 1 nσ2 αβ + σ 2 B naσβ 2 + nσαβ 2 + σ 2 AB nσαβ 2 + σ 2 Error σ 2 ˆσ 2 = MSE = ˆσ αβ 2 = MS(AB) MSE = = n 4 ˆσ β 2 = MS(B) MS(AB) = = na 4 3 The first estimate is negative, therefore we assign it to zero. The second estimate is much smaller that the MSE. Therefore the between-city variation does not contribute substantially to the overall variation. 4

5 (e) (6 pts) The researchers decided to exclude city (both main effects and interactions) from the model. Use the new model to test whether there is a difference between the specialities. Use confidence level of 95%. The model is The ANOVA table is y ijk = µ + α i + ɛ ijk Source df MS A Error =33 ( )/33 = We test H 0 : α i = 0 for all i, against H a : α i 0 for some i. F = MS(A) MS(E) = = > F (2, 33) = We reject H 0, and conclude that there is a difference between the specialities. (f) (6 pts) Use the new model to provide the 95% CI for the difference of the expected scores of pediatrics and diabetes. ( ) ± t / ( ) ± ( , ) 5

6 2. (6 pts) In genetics, when a gene has two different alleles A and a, each individual in a population must have one of three possible genotypes: AA, Aa and aa. If the alleles are passed independently from the two parents, and every parent has the same probability θ of passing the first allele to each offspring, then the probability distribution of the three genotypes is Genotype AA Aa aa Probability π 1 = θ 2 π 2 = 2θ(1 θ) π 3 = (1 θ) 2 where 0 < θ < 1 and 3 i=1 π i = 1. A random sample of n = 100 individuals is taken from this population, resulting in the following counts of individuals with each genotype: Genotype AA Aa aa Total Observed counts n 1 = 70 n 2 = 25 n 3 = Conduct a deviance goodness-of-fit test to determine whether the hypothesize form of π 1, π 2 and π 3 is appropriate for these data. State the null and the alternative hypotheses, the test statistic, and your conclusion at the confidence level of 95%. The likelihood and the log-likelihood are L(θ) = C 1 [θ 2 ] 70 [2θ(1 θ)] 25 [(1 θ) 2 ] 5 l(θ) = C log(θ) + 25 log(θ) + 25 log(1 θ) log(1 θ) The derivative is u(θ) = l(θ) θ = 140 θ + 25 θ 25 1 θ 10 1 θ Solving u(θ) = 0, θ θ(1 θ) = 0, ˆθ = = Testing H 0 : π 1, π 2, π 3 are as specified vs H a : π 1, π 2, π 3 unrelated, 3 i=1 π i = 1, using the deviance test: G 2 = 2 3 n i log(n i /µ i ) i=1 [ ] 70 = 2 70 log log ( ) + 5 log ( ) 2 = < χ 2 2 1(1 0.05) = Therefore we fail to reject H 0, and conclude that the specified model has a good fit. 6

7 3. Investigators would like to establish whether a genetic fingerprint technique (called polymerase chain reaction, PCR) can be used as a tool for diagnostics of relapse status of acute lumphoblastic leukemia. PCR was performed on the bone marrow of 178 children who were currently in remission. Results of the study are tabulated as follows: Relapse status PCR status Yes No Total Traces of cancer Cancer free Total (a) (6 pts) Test whether the probability of relapse is different among children with and without trances of cancer, by comparing proportions. State the null and the alternative hypotheses, the non-pooled test statistic, and your conclusion at the confidence level of 95%. Denote π 1 = P {Relapse T races of cancer} and π 2 = P {Relapse Cancer free}. We test H 0 : π 1 π 2 = 0 vs H a : π 1 π 2 0. ˆπ 1 = 30/75 = 0.4, ˆπ 2 = 8/103 = The test statistic is T = ˆπ 1 ˆπ 2 = ˆπ 1 (1 ˆπ 1 ) 75 + ˆπ 2(1 ˆπ 2 ) = > z /2 = 1.96 We reject H 0 and conclude that the relapse rate is significantly different for the two outcomes of PCR. (b) (6 pts) Estimate the odds ratio of relapse and its 95% confidence interval, and interpret the result. The odds ratio θ, and the estimated SE of log(ˆθ) are ˆθ = n 11 n = n 12 n = [ 1 s(log(ˆθ)) = ] 1/2 = The 95% CI of log(ˆθ) is log(ˆθ) ± z /2 s(log(ˆθ)), i.e. ( , ) On the scale of the odds ratio, the CI is (e , e ) = ( , ). The CI does not contain 1, i.e. at the confidence level of 95%, the odds of relapse are significantly higher for patients with traces of cancer. 7

8 (c) (6 pts) The Pearson standardized residuals are given in the table below. What are your conclusions from this table? r ij Relapse status PCR status Yes No Traces of cancer Cancer free The residuals are larger in absolute value that z /2 = Therefore the cells show a greater discrepancy between the observed cell counts and the cell counts predicted under independence. This indicates that the hypothesis of independence is not appropriate. (d) (6 pts) Estimate the sensitivity and the specificity of the PCR test. Sensitivity = P {P CR = Y es Relapse = Y es} = = Specificity = P {P CR = No Relapse = No} = =

9 4. Researchers conduct a retrospective case-control study of lang cancer, comparing the smoking habits (# of cigarettes/day) of individuals with and without the disease. The data are summarized as follows: # cigarettes per day # cases # controls Total Total The R output at the end of the exam presents the results of three models fit to these data. (a) (6 pts) Consider Model 1. State the model and the assumptions. Denote X the number of cigarettes, and Y the disease status. Then ( ) πi Y i Binomial(π i ), where log = β 0 + β 1 I X= β 2 I X= β 3 I X= β 4 I X=50+ 1 π i (b) (6 pts) Based on the output of Model 1, obtain the estimated odds ratio of lung cancer of subjects who smoke more than 50 cigarettes a day, and those who smoke 1-14 cigarettes a day. log(or) = P {Y = 1 X = 50+} / P {Y = 0 X = 50+} log P {Y = 1 X = 1 14} / P {Y = 0 X = 1 14} = β 0 + β 4 β 0 β 1 = β 4 β 1 log(ôr) = ˆβ 4 ˆβ 1 = = Therefore ÔR = exp( ˆβ 4 ˆβ 1 ) = exp(1.4033) =

10 (c) (6 pts) Based on the output of Model 1, obtain a 95%CI for the odds ratio above. On the log(or) scale: V ar{log(ôr)} = V ar{ ˆβ 4 ˆβ 1 } = V ar{ ˆβ 4 } + V ar{ ˆβ 1 } 2 Cov{ ˆβ 4, ˆβ 1 } = = The CI for log(or) is ± z / ± ( , ) The CI for the OR is (exp( ), exp( )) = ( , ) (d) (6 pts) Consider Model 2. State the model and the assumption. State whether you prefer Model 1 or Model 2, and why. Denote X the score indicating the number of cigarettes per day. Then ( ) πi Y i Binomial(π i ), where log = β 0 + β 1 X i 1 π i Based on AIC, Model 1 works best. However Model 1 is a saturated model which is likely to overfit the data, and it does not account for the ordinal nature of the predictor. Therefore we prefer Model 2. The residual deviance is >> 3, indicating that there is either an insufficient quality of fit for the expected value, or presence of overdispersion. (e) (6 pts) Consider Model 3. State the model and the assumption. State whether you prefer Model 2 or Model 3, and why. Y i Quasibinomial(π i ), where E{Y i } = π i, V ar{y i } = σ 2 pi i (1 π i ) and ( ) πi log = β 0 + β 1 X i 1 π i The model accounts for the insufficient quality of fit by overdispersion. If we want to specify a linear relationship between logit(π) and X, then Model 3 is preferred. 10

11 5. Consider a generalized linear model for the expected value of a binomial response Y, as function of the predictor variable X. For each question below, circle TRUE or FALSE, and provide the rationale. (a) (6 pts) Suppose that we would like to use the identity link function. Then the least squares estimates of model parameters will be identical to the maximum likelihood estimates. TRUE FALSE False. The least squares parameter estimates are identical to the Maximum Likelihood estimates for the Normal distribution of Y. However in this case the distribution of Y is Binomial, and the likelihood and the resulting parameter estimates will differ. (b) (6 pts) Suppose that we would like to model the data from a retrospective case-control type study design. The logistic link is the only link function that yields the same estimate and the same interpretation of the parameter associated with X as in the prospective study. TRUE FALSE True. The logistic link function is the only function that allows us to interpret the parameters as log(oddsratio), and odds ratio is the same for both prospective and retrospective designs. 11

12 (c) (6 pts) Suppose that we have ungrouped data, and would like to evaluate the quality of model fit using deviance test. Under the null hypothesis that the model of interest holds, the deviance test statistic approaches χ 2 as the sample size increases. TRUE FALSE False. The number of parameters under the alternative model grows with the sample size, and therefore the asymptotic theory does not hold. (d) (6 pts) The value reported for the deviance of the model depends on whether the data are grouped (i.e. report the number of successes and the number of failures for each value of X), or as individual Bernoulli observations. However the difference between the deviances of two unsaturated models does not depend on the form of the data entry. TRUE FALSE True. The log-likelihood of unsaturated models does not depend on the form of the data. The log-likelihood of the saturated model depends on the form of the data, however it cancels out when we compare two unsaturated models. 12

13 Problem 4 Model 1 > lc cigarettes score cases controls > lc$cigarettesf <- factor(lc$cigarettes, levels=levels(lc$cigarettes)) > lc.fit1 <- glm(cbind(cases,controls) ~ cigarettesf, data=lc, family=binomial()) > summary(lc.fit1) Call: glm(formula = cbind(cases, controls) ~ cigarettesf, family = binomial(), data = lc) Deviance Residuals: [1] Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) e-08 *** cigarettesf e-06 *** cigarettesf e-08 *** cigarettesf e-12 *** cigarettesf e-10 *** --- Signif. codes: 0 *** ** 0.01 * (Dispersion parameter for binomial family taken to be 1) Null deviance: e+02 on 4 degrees of freedom Residual deviance: e-15 on 0 degrees of freedom AIC: > summary(lc.fit1)$cov.unscaled (Intercept) cigarettesf1-14 cigarettesf15-24 (Intercept) cigarettesf cigarettesf cigarettesf cigarettesf cigarettesf25-49 cigarettesf50+ (Intercept) cigarettesf cigarettesf cigarettesf cigarettesf > predict(lc.fit1, type="link") > predict(lc.fit1, type="response")

14 Model 2 > fit2 <- glm(cbind(cases,controls) ~ score, data=lc, family=binomial()) > summary(fit2) Call: glm(formula = cbind(cases, controls) ~ score, family = binomial(), data = lc) Deviance Residuals: Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) <2e-16 *** score <2e-16 *** --- Signif. codes: 0 *** ** 0.01 * (Dispersion parameter for binomial family taken to be 1) Null deviance: on 4 degrees of freedom Residual deviance: on 3 degrees of freedom AIC: Model 3 > fit3 <- glm(cbind(cases,controls) ~ score, data=lc, family=quasibinomial()) > summary(fit3) Call: glm(formula = cbind(cases, controls) ~ score, family = quasibinomial(), data = lc) Deviance Residuals: Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) * score * --- Signif. codes: 0 *** ** 0.01 * (Dispersion parameter for quasibinomial family taken to be ) Null deviance: on 4 degrees of freedom Residual deviance: on 3 degrees of freedom AIC: NA 14

STAT 525 Fall Final exam. Tuesday December 14, 2010

STAT 525 Fall Final exam. Tuesday December 14, 2010 STAT 525 Fall 2010 Final exam Tuesday December 14, 2010 Time: 2 hours Name (please print): Show all your work and calculations. Partial credit will be given for work that is partially correct. Points will