Solutions for Examination Categorical Data Analysis, March 21, PDF Free Download

STOCKHOLMS UNIVERSITET MATEMATISKA INSTITUTIONEN Avd. Matematisk statistik, Frank Miller MT 5006 LÖSNINGAR 21 mars 2013 Solutions for Examination Categorical Data Analysis, March 21, 2013 Problem 1 a. Fisher s exact test evaluates H 0 distribution under the condition of the observed marginal frequencies. P H0 (n 11 t n 1+, n 2+, n +1, n +2 ( ( n1+ n n1+ t n +1 t ( n n +1 b. One-sided p-value for Fisher s exact test P (n 11 t o n i+, n +j 4 t0 P (n 11 t n i+, n +j 0 + 0 + 0.001 + 0.004 + 0.019 0.024 where t o 4 is the observed value of n 11. The one-sided p-value is below 5% (even below 2.5% and therefore, the study shows a hypertension preventing effect of aspirin compared to placebo. c. Two-sided p-value can be defined as P (n 11 t o n i+, n +j + P (n 11 t n i+, n +j with t min{t P (n 11 s n i+, n +j P (n 11 t o n i+, n +j for all s t}. As the hypergeometric distribution is unimodal, this definition is equal to P (n 11 t n i+, n +j t T with T {s P (n 11 s n i+, n +j P (n 11 t o n i+, n +j }. For the dataset, the two-sided p-value is P (n 11 4 n i+, n +j + P (n 11 12 n i+, n +j 0.024 + 0.012 + 0.002 0.038. The null hypothesis is rejected with the two-sided test at the 5% level showing that there is any effect of aspirin compared to placebo. d. Mid p-value P (n 11 < t o n i+, n +j + 1 2 P (n 11 t o n i+, n +j 0.005 + 0.019/2 0.0145. Advantage: Type I error of a test based on the mid p-value is closer to the desired level α. Disadvantage: Type I error might in some cases be slightly above α ( test is anti-conservative. 1

Problem 2 a. Estimate for odds ratio: ˆθ n 11n 22 4 20 n 12 n 21 11 30 8 33 0.242. The odds of getting hypertension after treatment with aspirin were 0.242 of the odds of getting hypertension after placebo treatment. b. SE V (log(ˆθ 1/n 11 + 1/n 12 + 1/n 21 + 1/n 22 0.651. log ˆθ log 0.242 1.417. 95% Wald CI for log(ˆθ : 1.417 ± 1.96 0.651; CI [ 2.693, 0.141]. 95% Wald CI for ˆθ : [e 2.693, e 0.141 ] [0.068, 0.868]. c. For example, likelihood ratio test: G 2 2 2 2 j1 n ij log n ij ˆµ ij 1/4 + 1/11 + 1/30 + 1/20 where ˆµ ij n i+ n +j /n is the ML-estimation in the independence model. G 2 2 (4 log(4/7.85 + 11 log(11/7.15 + 30 log(30/26.15 + 20 log(20/23.85 5.28. The full model has 4 parameters, the independence model 3 parameters. Therefore the asymptotic distribution of G 2 is the χ 2 -distribution with 1( 4 3 degrees of freedom. The null hypothesis of independence can be rejected at the 5% level by the two-sided test, since 5.28 > 3.84. The problem should state one-sided instead of two-sided. For the one-sided test at the 5% level, one has to compare G 2 5.28 with χ 2 1;0.90 2.71 and check the direction of the effect. Since ˆθ 0.242 < 1 and G 2 > χ 2 1;0.90, one can conclude that the one-sided test of testing independence versus the alternative of a hypertension preventing effect of aspirin rejects H 0. log ˆθ SE Shorter alternative instead of the LR-test: Wald test (here: one-sided: 1.42 2.18 < 1.64. Therefore the Wald test at 5% level rejects the nullhypothesis. 0.651 Problem 3 a. P (Y i y i p y i i (1 p i 1 y i, y i {0, 1} and p i α+βx i, i 1,..., n. The likelihood function is n p y i i (1 p i 1 y i. The loglikelihood is L(α, β y i log(p i + (1 y i log(1 p i y i log(α + βx i + (1 y i log(1 α βx i. 2

b. Likelihood equations are the derivatives of the loglikelihood with respect to α and β set to 0. Derivatives: L(α, β α L(α, β β y i α + βx i 1 y i 1 α βx i y i αy i βx i y i α βx i + y i α + y i βx i (α + βx i (1 α βx i y i α βx i (α + βx i (1 α βx i y i x i α + βx i (1 y ix i 1 α βx i x i (y i α βx i (α + βx i (1 α βx i The likelihood equations are then L(α, β/ α 0 and L(α, β/ β 0. Their solution (e.g. obtained numerically with an algorithm is then the ML-estimate for α and β. c. Asymptotic variance matrix for the parameter estimates: The vector of ML-estimates (ˆα, ˆβ is asymptotically normally distributed with mean (α, β and covariance matrix I(α, β 1 where I(α, β is the information matrix I(α, β E 2 L(α,β 2 α E 2 L(α,β α β E 2 L(α,β α β E 2 L(α,β 2 β The information matrix and therefore the asymptotic covariance matrix contain the unknown parameters (α, β. Replacing them by their ML-estimates (ˆα, ˆβ leads to an estimated covariance matrix which can be used for constructing CIs. d. Advantage: Results from analysis are on probability scale and hence directly interpretable in terms of probabilities or differences of probabilities without further transformation. Disadvantage: Estimated/predicted probabilities are not necessarily between 0 and 1. Problem 4 a. Maximum likelihood estimates ˆα 4.0343 and ˆβ W 0.0042. Further, e ˆβ W 0.996. With each gram increased weight, the odds for developing BPD is reduced by 0.4% (1-0.996. (A better interpretation: with each 100 gram increased weight, the odds for BPD are decreased with 34% since e 100 ˆβ W 0.66. b. Let π 0 be the proportion of interest. logit(ˆπ 0 ˆα + ˆβ W 1000 4.0343 4.2 0.2, ˆπ 0 e 0.2 /(1 + e 0.2 0.45 The predicted probability that a newborn baby with weight 1000 gram develops BPD until day 28 of life is 45%. 3

c. V (logit(ˆπ 0 V (ˆα + ˆβ W 1000 V (ˆα + 2Cov(ˆα, ˆβ W 1000 + V ( ˆβ W 10 6 0.484 0.866 + 0.411 0.029. 95% Wald CI for logit( ˆπ 0 0.2 ± 1.96 0.029; CI[-0.53,0.13]. 95% Wald CI for ˆπ 0 : [e 0.53 /(1 + e 0.53, e 0.13 /(1 + e 0.13 ] [0.37, 0.53]. d. Estimates ˆβ W 0.0024, ˆβ A 0.398 and e ˆβ W 0.9976, e ˆβ A 0.672. With each gram increased weight, the odds for developing BPD is reduced by 0.24 % (1 e ˆβ W 0.0024 if age is kept constant. With each week increased age, the odds for developing BPD is reduced by 33% if weight is kept constant. The estimate for β W is around half in model (W+A compared with model (A. Since weight and age are correlated, the addition of age as explanatory variable reduces the estimate for β W. However, weight is still important in this model (significant variable in model (W+A. e. Definition of AIC: AIC -2(maximized loglikelihood - number of parameters in the model. Based on AIC, (W+A is preferable over (W, since AIC215.2 for (W+A which is smaller than AIC227.7 for (W. LR-test-statistic 2(L 0 L 1 where L i are the maximized loglikelihood for the two models. 2(L 0 L 1 2( 111.86 + 104.61 14.46. The test statistics is approximately χ 2 -distributed with 1 degree of freedom (1 difference of number of parameters between the two models. Since 14.46 > χ 2 1;0.95 3.84, we can reject H 0 that (W holds. Therefore it is justified to use (W+A. Problem 5 Denote school with S, gender with G and age group with A and the frequencies in the table by n a, s 1, 2, g 1, 2, a 1, 2, 3. a. Saturated loglinear model for the expected number of students: log µ a λ + λ S s + λ G g + λ A a + λ SG sa ga a. The parameters with at least one of s 2, g 2 or a 3 are set to 0 to avoid over-parametrisation. The remaining parameters not set to 0 are: λ, λ S 1, λ G 1, λ A 1, λ A 2, λ SG 11, λ SA 11, λ SA 12, λ GA 11, λ GA 12, λ SGA 111, λ SGA 112. These are 12 parameters ( number of cells in the 2x2x3 contingency table. b. Age jointly independent of both gender and school. Model (SG, A: log µ a λ + λ S s + λ G g + λ A a + λ SG. 4

c. In the saturated model, the ML-estimate is the observation, i.e. ˆµ 111 n 111 42. In model in (SG, A, ˆµ 111 n 11+n ++1 n 259 184 1373 34.7 (see also Table 5 in Appendix B. d. Likelihood ratio test: with ˆµ a n +n ++a n 2 2 3 G 2 2 n a log(n a /ˆµ a, s1 g1 a1 which is available in Table 5. Consequently, ( G 2 2 42 log 42 34.7 +... 2 9.34 18.68. Asymptotically, G 2 is χ 2 -distributed with 6 12 6 degrees of freedom (saturated model has 12 parameters; (SG, A model has 6 parameters. According to the table with the χ 2 -distribution, G 2 > 12.59 χ 2 6;0.95 and therefore the hypothesis that the (SG, A model holds can be rejected. Age group can therefore not being considered as jointly independent of school and gender. e. Consider now a 1, 2 only. Equivalent logistic regression models which have age group as depending variable and school and gender as explanatory variables. For loglinear model in a. (saturated model: log P (A 1 P (A 2 log µ 1 µ 2 ( λ + λ S s + λ G g + λ A 1 + λ SG ( λ + λ S s + λ G g + λ A 2 + λ SG s1 s2 g1 g2 1 2 (λ A 1 λ A 2 + (λ SA s1 λ SA s2 + (λ GA g1 λ GA g2 + (λ SGA 1 }{{}}{{}}{{} β βs S βg G β + β S s + β G g + β SG, λ SGA 2 } {{ } β SG and for the model in b. (gender and school jointly independent of age: log P (A 1 P (A 2 log µ 1 µ 2 ( λ + λ S s + λ G g + λ A 1 + λ SG (λ A 1 λ A 2 β. }{{} β ( λ + λ S s + λ G g + λ A 2 + λ SG 5