Solutions for Examination Categorical Data Analysis, March 21, 2013

Size: px

Start display at page:

Download "Solutions for Examination Categorical Data Analysis, March 21, 2013"

Duane Dennis
6 years ago
Views:

1 STOCKHOLMS UNIVERSITET MATEMATISKA INSTITUTIONEN Avd. Matematisk statistik, Frank Miller MT 5006 LÖSNINGAR 21 mars 2013 Solutions for Examination Categorical Data Analysis, March 21, 2013 Problem 1 a. Fisher s exact test evaluates H 0 distribution under the condition of the observed marginal frequencies. P H0 (n 11 t n 1+, n 2+, n +1, n +2 ( ( n1+ n n1+ t n +1 t ( n n +1 b. One-sided p-value for Fisher s exact test P (n 11 t o n i+, n +j 4 t0 P (n 11 t n i+, n +j where t o 4 is the observed value of n 11. The one-sided p-value is below 5% (even below 2.5% and therefore, the study shows a hypertension preventing effect of aspirin compared to placebo. c. Two-sided p-value can be defined as P (n 11 t o n i+, n +j + P (n 11 t n i+, n +j with t min{t P (n 11 s n i+, n +j P (n 11 t o n i+, n +j for all s t}. As the hypergeometric distribution is unimodal, this definition is equal to P (n 11 t n i+, n +j t T with T {s P (n 11 s n i+, n +j P (n 11 t o n i+, n +j }. For the dataset, the two-sided p-value is P (n 11 4 n i+, n +j + P (n n i+, n +j The null hypothesis is rejected with the two-sided test at the 5% level showing that there is any effect of aspirin compared to placebo. d. Mid p-value P (n 11 < t o n i+, n +j P (n 11 t o n i+, n +j / Advantage: Type I error of a test based on the mid p-value is closer to the desired level α. Disadvantage: Type I error might in some cases be slightly above α ( test is anti-conservative. 1

2 Problem 2 a. Estimate for odds ratio: ˆθ n 11n n 12 n The odds of getting hypertension after treatment with aspirin were of the odds of getting hypertension after placebo treatment. b. SE V (log(ˆθ 1/n /n /n /n log ˆθ log % Wald CI for log(ˆθ : ± ; CI [ 2.693, 0.141]. 95% Wald CI for ˆθ : [e 2.693, e ] [0.068, 0.868]. c. For example, likelihood ratio test: G j1 n ij log n ij ˆµ ij 1/4 + 1/11 + 1/30 + 1/20 where ˆµ ij n i+ n +j /n is the ML-estimation in the independence model. G 2 2 (4 log(4/ log(11/ log(30/ log(20/ The full model has 4 parameters, the independence model 3 parameters. Therefore the asymptotic distribution of G 2 is the χ 2 -distribution with 1( 4 3 degrees of freedom. The null hypothesis of independence can be rejected at the 5% level by the two-sided test, since 5.28 > The problem should state one-sided instead of two-sided. For the one-sided test at the 5% level, one has to compare G with χ 2 1; and check the direction of the effect. Since ˆθ < 1 and G 2 > χ 2 1;0.90, one can conclude that the one-sided test of testing independence versus the alternative of a hypertension preventing effect of aspirin rejects H 0. log ˆθ SE Shorter alternative instead of the LR-test: Wald test (here: one-sided: < Therefore the Wald test at 5% level rejects the nullhypothesis Problem 3 a. P (Y i y i p y i i (1 p i 1 y i, y i {0, 1} and p i α+βx i, i 1,..., n. The likelihood function is n p y i i (1 p i 1 y i. The loglikelihood is L(α, β y i log(p i + (1 y i log(1 p i y i log(α + βx i + (1 y i log(1 α βx i. 2

3 b. Likelihood equations are the derivatives of the loglikelihood with respect to α and β set to 0. Derivatives: L(α, β α L(α, β β y i α + βx i 1 y i 1 α βx i y i αy i βx i y i α βx i + y i α + y i βx i (α + βx i (1 α βx i y i α βx i (α + βx i (1 α βx i y i x i α + βx i (1 y ix i 1 α βx i x i (y i α βx i (α + βx i (1 α βx i The likelihood equations are then L(α, β/ α 0 and L(α, β/ β 0. Their solution (e.g. obtained numerically with an algorithm is then the ML-estimate for α and β. c. Asymptotic variance matrix for the parameter estimates: The vector of ML-estimates (ˆα, ˆβ is asymptotically normally distributed with mean (α, β and covariance matrix I(α, β 1 where I(α, β is the information matrix I(α, β E 2 L(α,β 2 α E 2 L(α,β α β E 2 L(α,β α β E 2 L(α,β 2 β The information matrix and therefore the asymptotic covariance matrix contain the unknown parameters (α, β. Replacing them by their ML-estimates (ˆα, ˆβ leads to an estimated covariance matrix which can be used for constructing CIs. d. Advantage: Results from analysis are on probability scale and hence directly interpretable in terms of probabilities or differences of probabilities without further transformation. Disadvantage: Estimated/predicted probabilities are not necessarily between 0 and 1. Problem 4 a. Maximum likelihood estimates ˆα and ˆβ W Further, e ˆβ W With each gram increased weight, the odds for developing BPD is reduced by 0.4% ( (A better interpretation: with each 100 gram increased weight, the odds for BPD are decreased with 34% since e 100 ˆβ W b. Let π 0 be the proportion of interest. logit(ˆπ 0 ˆα + ˆβ W , ˆπ 0 e 0.2 /(1 + e The predicted probability that a newborn baby with weight 1000 gram develops BPD until day 28 of life is 45%. 3

4 c. V (logit(ˆπ 0 V (ˆα + ˆβ W 1000 V (ˆα + 2Cov(ˆα, ˆβ W V ( ˆβ W % Wald CI for logit( ˆπ ± ; CI[-0.53,0.13]. 95% Wald CI for ˆπ 0 : [e 0.53 /(1 + e 0.53, e 0.13 /(1 + e 0.13 ] [0.37, 0.53]. d. Estimates ˆβ W , ˆβ A and e ˆβ W , e ˆβ A With each gram increased weight, the odds for developing BPD is reduced by 0.24 % (1 e ˆβ W if age is kept constant. With each week increased age, the odds for developing BPD is reduced by 33% if weight is kept constant. The estimate for β W is around half in model (W+A compared with model (A. Since weight and age are correlated, the addition of age as explanatory variable reduces the estimate for β W. However, weight is still important in this model (significant variable in model (W+A. e. Definition of AIC: AIC -2(maximized loglikelihood - number of parameters in the model. Based on AIC, (W+A is preferable over (W, since AIC215.2 for (W+A which is smaller than AIC227.7 for (W. LR-test-statistic 2(L 0 L 1 where L i are the maximized loglikelihood for the two models. 2(L 0 L 1 2( The test statistics is approximately χ 2 -distributed with 1 degree of freedom (1 difference of number of parameters between the two models. Since > χ 2 1; , we can reject H 0 that (W holds. Therefore it is justified to use (W+A. Problem 5 Denote school with S, gender with G and age group with A and the frequencies in the table by n a, s 1, 2, g 1, 2, a 1, 2, 3. a. Saturated loglinear model for the expected number of students: log µ a λ + λ S s + λ G g + λ A a + λ SG sa ga a. The parameters with at least one of s 2, g 2 or a 3 are set to 0 to avoid over-parametrisation. The remaining parameters not set to 0 are: λ, λ S 1, λ G 1, λ A 1, λ A 2, λ SG 11, λ SA 11, λ SA 12, λ GA 11, λ GA 12, λ SGA 111, λ SGA 112. These are 12 parameters ( number of cells in the 2x2x3 contingency table. b. Age jointly independent of both gender and school. Model (SG, A: log µ a λ + λ S s + λ G g + λ A a + λ SG. 4

5 c. In the saturated model, the ML-estimate is the observation, i.e. ˆµ 111 n In model in (SG, A, ˆµ 111 n 11+n ++1 n (see also Table 5 in Appendix B. d. Likelihood ratio test: with ˆµ a n +n ++a n G 2 2 n a log(n a /ˆµ a, s1 g1 a1 which is available in Table 5. Consequently, ( G log Asymptotically, G 2 is χ 2 -distributed with degrees of freedom (saturated model has 12 parameters; (SG, A model has 6 parameters. According to the table with the χ 2 -distribution, G 2 > χ 2 6;0.95 and therefore the hypothesis that the (SG, A model holds can be rejected. Age group can therefore not being considered as jointly independent of school and gender. e. Consider now a 1, 2 only. Equivalent logistic regression models which have age group as depending variable and school and gender as explanatory variables. For loglinear model in a. (saturated model: log P (A 1 P (A 2 log µ 1 µ 2 ( λ + λ S s + λ G g + λ A 1 + λ SG ( λ + λ S s + λ G g + λ A 2 + λ SG s1 s2 g1 g2 1 2 (λ A 1 λ A 2 + (λ SA s1 λ SA s2 + (λ GA g1 λ GA g2 + (λ SGA 1 }{{}}{{}}{{} β βs S βg G β + β S s + β G g + β SG, λ SGA 2 } {{ } β SG and for the model in b. (gender and school jointly independent of age: log P (A 1 P (A 2 log µ 1 µ 2 ( λ + λ S s + λ G g + λ A 1 + λ SG (λ A 1 λ A 2 β. }{{} β ( λ + λ S s + λ G g + λ A 2 + λ SG 5

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis 1. Indicate whether each of the following is true (T) or false (F). (a) T In 2 2 tables, statistical independence is equivalent to a population