STOCKHOLMS UNIVERSITET MATEMATISKA INSTITUTIONEN Avd. Matematisk statistik, Frank Miller MT 5006 LÖSNINGAR 21 mars 2013 Solutions for Examination Categorical Data Analysis, March 21, 2013 Problem 1 a. Fisher s exact test evaluates H 0 distribution under the condition of the observed marginal frequencies. P H0 (n 11 t n 1+, n 2+, n +1, n +2 ( ( n1+ n n1+ t n +1 t ( n n +1 b. One-sided p-value for Fisher s exact test P (n 11 t o n i+, n +j 4 t0 P (n 11 t n i+, n +j 0 + 0 + 0.001 + 0.004 + 0.019 0.024 where t o 4 is the observed value of n 11. The one-sided p-value is below 5% (even below 2.5% and therefore, the study shows a hypertension preventing effect of aspirin compared to placebo. c. Two-sided p-value can be defined as P (n 11 t o n i+, n +j + P (n 11 t n i+, n +j with t min{t P (n 11 s n i+, n +j P (n 11 t o n i+, n +j for all s t}. As the hypergeometric distribution is unimodal, this definition is equal to P (n 11 t n i+, n +j t T with T {s P (n 11 s n i+, n +j P (n 11 t o n i+, n +j }. For the dataset, the two-sided p-value is P (n 11 4 n i+, n +j + P (n 11 12 n i+, n +j 0.024 + 0.012 + 0.002 0.038. The null hypothesis is rejected with the two-sided test at the 5% level showing that there is any effect of aspirin compared to placebo. d. Mid p-value P (n 11 < t o n i+, n +j + 1 2 P (n 11 t o n i+, n +j 0.005 + 0.019/2 0.0145. Advantage: Type I error of a test based on the mid p-value is closer to the desired level α. Disadvantage: Type I error might in some cases be slightly above α ( test is anti-conservative. 1
Problem 2 a. Estimate for odds ratio: ˆθ n 11n 22 4 20 n 12 n 21 11 30 8 33 0.242. The odds of getting hypertension after treatment with aspirin were 0.242 of the odds of getting hypertension after placebo treatment. b. SE V (log(ˆθ 1/n 11 + 1/n 12 + 1/n 21 + 1/n 22 0.651. log ˆθ log 0.242 1.417. 95% Wald CI for log(ˆθ : 1.417 ± 1.96 0.651; CI [ 2.693, 0.141]. 95% Wald CI for ˆθ : [e 2.693, e 0.141 ] [0.068, 0.868]. c. For example, likelihood ratio test: G 2 2 2 2 j1 n ij log n ij ˆµ ij 1/4 + 1/11 + 1/30 + 1/20 where ˆµ ij n i+ n +j /n is the ML-estimation in the independence model. G 2 2 (4 log(4/7.85 + 11 log(11/7.15 + 30 log(30/26.15 + 20 log(20/23.85 5.28. The full model has 4 parameters, the independence model 3 parameters. Therefore the asymptotic distribution of G 2 is the χ 2 -distribution with 1( 4 3 degrees of freedom. The null hypothesis of independence can be rejected at the 5% level by the two-sided test, since 5.28 > 3.84. The problem should state one-sided instead of two-sided. For the one-sided test at the 5% level, one has to compare G 2 5.28 with χ 2 1;0.90 2.71 and check the direction of the effect. Since ˆθ 0.242 < 1 and G 2 > χ 2 1;0.90, one can conclude that the one-sided test of testing independence versus the alternative of a hypertension preventing effect of aspirin rejects H 0. log ˆθ SE Shorter alternative instead of the LR-test: Wald test (here: one-sided: 1.42 2.18 < 1.64. Therefore the Wald test at 5% level rejects the nullhypothesis. 0.651 Problem 3 a. P (Y i y i p y i i (1 p i 1 y i, y i {0, 1} and p i α+βx i, i 1,..., n. The likelihood function is n p y i i (1 p i 1 y i. The loglikelihood is L(α, β y i log(p i + (1 y i log(1 p i y i log(α + βx i + (1 y i log(1 α βx i. 2
b. Likelihood equations are the derivatives of the loglikelihood with respect to α and β set to 0. Derivatives: L(α, β α L(α, β β y i α + βx i 1 y i 1 α βx i y i αy i βx i y i α βx i + y i α + y i βx i (α + βx i (1 α βx i y i α βx i (α + βx i (1 α βx i y i x i α + βx i (1 y ix i 1 α βx i x i (y i α βx i (α + βx i (1 α βx i The likelihood equations are then L(α, β/ α 0 and L(α, β/ β 0. Their solution (e.g. obtained numerically with an algorithm is then the ML-estimate for α and β. c. Asymptotic variance matrix for the parameter estimates: The vector of ML-estimates (ˆα, ˆβ is asymptotically normally distributed with mean (α, β and covariance matrix I(α, β 1 where I(α, β is the information matrix I(α, β E 2 L(α,β 2 α E 2 L(α,β α β E 2 L(α,β α β E 2 L(α,β 2 β The information matrix and therefore the asymptotic covariance matrix contain the unknown parameters (α, β. Replacing them by their ML-estimates (ˆα, ˆβ leads to an estimated covariance matrix which can be used for constructing CIs. d. Advantage: Results from analysis are on probability scale and hence directly interpretable in terms of probabilities or differences of probabilities without further transformation. Disadvantage: Estimated/predicted probabilities are not necessarily between 0 and 1. Problem 4 a. Maximum likelihood estimates ˆα 4.0343 and ˆβ W 0.0042. Further, e ˆβ W 0.996. With each gram increased weight, the odds for developing BPD is reduced by 0.4% (1-0.996. (A better interpretation: with each 100 gram increased weight, the odds for BPD are decreased with 34% since e 100 ˆβ W 0.66. b. Let π 0 be the proportion of interest. logit(ˆπ 0 ˆα + ˆβ W 1000 4.0343 4.2 0.2, ˆπ 0 e 0.2 /(1 + e 0.2 0.45 The predicted probability that a newborn baby with weight 1000 gram develops BPD until day 28 of life is 45%. 3
c. V (logit(ˆπ 0 V (ˆα + ˆβ W 1000 V (ˆα + 2Cov(ˆα, ˆβ W 1000 + V ( ˆβ W 10 6 0.484 0.866 + 0.411 0.029. 95% Wald CI for logit( ˆπ 0 0.2 ± 1.96 0.029; CI[-0.53,0.13]. 95% Wald CI for ˆπ 0 : [e 0.53 /(1 + e 0.53, e 0.13 /(1 + e 0.13 ] [0.37, 0.53]. d. Estimates ˆβ W 0.0024, ˆβ A 0.398 and e ˆβ W 0.9976, e ˆβ A 0.672. With each gram increased weight, the odds for developing BPD is reduced by 0.24 % (1 e ˆβ W 0.0024 if age is kept constant. With each week increased age, the odds for developing BPD is reduced by 33% if weight is kept constant. The estimate for β W is around half in model (W+A compared with model (A. Since weight and age are correlated, the addition of age as explanatory variable reduces the estimate for β W. However, weight is still important in this model (significant variable in model (W+A. e. Definition of AIC: AIC -2(maximized loglikelihood - number of parameters in the model. Based on AIC, (W+A is preferable over (W, since AIC215.2 for (W+A which is smaller than AIC227.7 for (W. LR-test-statistic 2(L 0 L 1 where L i are the maximized loglikelihood for the two models. 2(L 0 L 1 2( 111.86 + 104.61 14.46. The test statistics is approximately χ 2 -distributed with 1 degree of freedom (1 difference of number of parameters between the two models. Since 14.46 > χ 2 1;0.95 3.84, we can reject H 0 that (W holds. Therefore it is justified to use (W+A. Problem 5 Denote school with S, gender with G and age group with A and the frequencies in the table by n a, s 1, 2, g 1, 2, a 1, 2, 3. a. Saturated loglinear model for the expected number of students: log µ a λ + λ S s + λ G g + λ A a + λ SG sa ga a. The parameters with at least one of s 2, g 2 or a 3 are set to 0 to avoid over-parametrisation. The remaining parameters not set to 0 are: λ, λ S 1, λ G 1, λ A 1, λ A 2, λ SG 11, λ SA 11, λ SA 12, λ GA 11, λ GA 12, λ SGA 111, λ SGA 112. These are 12 parameters ( number of cells in the 2x2x3 contingency table. b. Age jointly independent of both gender and school. Model (SG, A: log µ a λ + λ S s + λ G g + λ A a + λ SG. 4
c. In the saturated model, the ML-estimate is the observation, i.e. ˆµ 111 n 111 42. In model in (SG, A, ˆµ 111 n 11+n ++1 n 259 184 1373 34.7 (see also Table 5 in Appendix B. d. Likelihood ratio test: with ˆµ a n +n ++a n 2 2 3 G 2 2 n a log(n a /ˆµ a, s1 g1 a1 which is available in Table 5. Consequently, ( G 2 2 42 log 42 34.7 +... 2 9.34 18.68. Asymptotically, G 2 is χ 2 -distributed with 6 12 6 degrees of freedom (saturated model has 12 parameters; (SG, A model has 6 parameters. According to the table with the χ 2 -distribution, G 2 > 12.59 χ 2 6;0.95 and therefore the hypothesis that the (SG, A model holds can be rejected. Age group can therefore not being considered as jointly independent of school and gender. e. Consider now a 1, 2 only. Equivalent logistic regression models which have age group as depending variable and school and gender as explanatory variables. For loglinear model in a. (saturated model: log P (A 1 P (A 2 log µ 1 µ 2 ( λ + λ S s + λ G g + λ A 1 + λ SG ( λ + λ S s + λ G g + λ A 2 + λ SG s1 s2 g1 g2 1 2 (λ A 1 λ A 2 + (λ SA s1 λ SA s2 + (λ GA g1 λ GA g2 + (λ SGA 1 }{{}}{{}}{{} β βs S βg G β + β S s + β G g + β SG, λ SGA 2 } {{ } β SG and for the model in b. (gender and school jointly independent of age: log P (A 1 P (A 2 log µ 1 µ 2 ( λ + λ S s + λ G g + λ A 1 + λ SG (λ A 1 λ A 2 β. }{{} β ( λ + λ S s + λ G g + λ A 2 + λ SG 5