NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION Categorical Data Analysis (Semester II: 2010 2011) April/May, 2011 Time Allowed : 2 Hours Matriculation No: Seat No: Grade Table Question 1 2 3 4 5 6 Full marks 10 28 15 20 15 12 Earned marks Total Full marks 100 Earned marks INSTRUCTIONS TO CANDIDATES 1. This examination paper contains SIX (6) questions and comprises TWELVE (12) printed pages. 2. Answer ALL the questions for TOTAL 100 marks. 3. Read the questions CAREFULLY. 4. All NOTATIONS used here are the same as those used in the lecture notes. 5. Write your answers NEATLY following the associated questions. 6. This is a Closed textbook, Closed notes examination but calculators are allowed. 7. Candidates may bring in TWO A4 size (210 297 mm) help sheets. Page 2
Page 2 1. [10 pts, each 1 pt] Circle T or F for each of the statements. (1) [T, F] To test for independence in two-way contingency tables, likelihood ratio tests and Pearson s χ 2 tests are equivalent for small sample sizes. (2) [T, F] Fisher s exact test uses negative binomial distribution to compute p-values. (3) [T, F] Diagnosis of type of mental illness (schizophrenia, neurosis, depression) is an ordinal variable. (4) [T, F] If odds of success in a binary response is 0.5, the probability of success is 0.25. (5) [T, F] Suppose that P (Y i = 1) = 1 P (Y i = 0) = 0.2, i = 1,, n, where Y i s are independent. Let Y = 50 Y i. Then the distribution of Y is Binomial with mean 10. i=1 (6) [T, F] Test of independence for a linear trend alternative cannot be used for nominal categorical data. (7) [T, F] In a logistic regression model, logit[π(x)] = α+βx, e α equals the odds of success when x = 1. (8) [T, F] In a logit model logit[π(x)] = α + βx, the probability increases at the rate of 0.16β when π(x) = 0.4. (9) [T, F] A classical linear regression model with errors having normal distribution is a special case of generalized linear model with probit link. (10) [T, F] Fitting a saturated model often results in nonzero residual deviance. Page 3
Page 3 2. [28 pts] For a study using logistic regression to examine the data on rheumatoid arthritis, we consider age of the patient as the predictor variable. The response Y measured whether the patient showed any improvement at all (1=yes). The following computer output reports for a logistic regression model using age to predict the probability of improvement. Model Fit Statistics Intercept Intercept and Criterion Only Covariates -2 Log L 116.449 109.164 Standard Wald Parameter DF Estimate Error Chi-Square Pr>ChiSq Intercept 1-2.6421 1.0732 6.0611 0.0138 age 1 0.0492 0.0194 6.4733 0.0110 Odds Ratio Estimates Point 95% Wald Effect Estimate Confidence Limits age 1.050 1.011 1.091 Estimated Covariance Matrix parameter Intercept age Intercept 1.15169-0.02030 age -0.02030 0.00038 The 90, 95, 97.5 and 99.5-th percentiles of the standard normal distribution are 1.28, 1.645, 1.96, and 2.576 respectively. (a) Find out the rates of change in predicted probabilities of improvement when age = 25 and when the estimated probability of improvement is 0.3, respectively. [8 Pts] Page 4
Page 4 (b) Find out the age at which the estimated probability of improvement is 0.3. [4 Pts] (c) Obtain a 95% confidence interval for the true odds ratio of improvement for a half year increase in age. [6 pts] (d) Obtain a 95% confidence interval for the probability of improvement at age = 25. [10 pts] Page 5
Page 5 3. [15 pts] The following table was taken from the 1991 General Social Survey. Party Identification Race Democrat Independent Republican Total White 341 105 405 851 Black 103 15 11 129 Total 444 120 416 980 Final Examination Q3 R code and Output Racew<-c(1,1,1,0,0,0)# White=1; black=0; PartyD<-c(1,0,0,1,0,0) #Democrat= 1; others 0 PartyI<-c(0,1,0,0,1,0)# Independent=1; others 0 Count<-c(341,105,405,103,15,11) RacewPartyD<-Racew*PartyD; RacewPartyI<-Racew*PartyI; fit<-glm(count~racew+partyd+racewpartyd+racewpartyi,family=poisson(link="log")) summary(fit) ####R outputs Call: glm(formula = count ~ Racew + PartyD + RacewPartyD +RacewPartyI, family = poisson(link = "log")) Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) 2.5649 0.1961 13.079<2e-16 *** Racew 3.4389 0.2023 16.998<2e-16 *** PartyD 2.0698 0.2195 9.430 <2e-16 *** RacewPartyD -2.2418 0.2315-9.686 <2e-16 *** RacewPartyI -1.3499 0.1095-12.327<2e-16 *** Null deviance: 918.8 Residual deviance: 0.61784 Page 6
Page 6 Let X and Y denote the race and party respectively. The 95-th percentiles of χ 2 -distribution with 1, 2, 3, 4, 5, 6 degrees of freedom are 3.841, 5.99, 7.81, 9.49, 11.07, and 12.59 respectively. Based on the R code and R output, (a) Write down the loglinear regression model and identify the associated estimates. [3 Pts] (b) Compute all the estimated cell counts. [6 pts] (c) Comment if the intercept model fit the data well. [3 pts] (d) Comment if the loglinear model fit the data well. [3 pts] Page 7
Page 7 4. [20 pts] The following table is taken from Lecture 8. Alcohol, Cigarette and Marijuana Use For High School Seniors Marijuana Use Alcohol Cigarette Use Use Yes No Yes Yes 911 538 No 44 456 No Yes 3 43 No 2 279 ## Final Examination Q4 R code and output A<-c(1,1,1,1,0,0,0,0); ## 1--Alcohol use 0--otherwise C<-c(1,1,0,0,1,1,0,0); ## 1---Cigarette use 0---otherwise M<-c(1,0,1,0,1,0,1,0); ## 1-Marijuana use 0-otherwise count<-c(911,538,44,456,3,43,2,279); AC<-A*C; AM<-A*M; CM<-C*M; ACM<-A*C*M; ##Model (AM,CM,AC) fit drug.log<-glm(count~a+c+m+am+cm+ac,family=poisson(link="log")) summary(drug.log) ## output Call: glm(formula = count ~ A + C + M + AM + CM + AC, family = poisson(link = "log")) Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) 5.63342 0.05970 94.361 < 2e-16*** A 0.48772 0.07577 6.437 1.22e-10 *** C -1.88667 0.16270-11.596 < 2e-16 *** M -5.30904 0.47520-11.172 < 2e-16*** AM 2.98601 0.46468 6.426 <1.31e-10*** CM 2.84789 0.16384 17.382 < 2e-16 *** AC 2.05453 0.17406 11.803 < 2e-16 *** Null deviance: 2851.46098, Residual deviance: 0.37399 ##Estimated covariance matrix between AM and CM AM CM AM 0.215925578-0.004968391 CM -0.004968391 0.026843349 Page 8
Page 8 Let X, Y and Z denote the variables Alcohol, Cigarette and Marijuana use respectively. The 90, 95, 97.5 and 99.5-th percentiles of the standard normal distribution are 1.28, 1.645, 1.96, and 2.576 respectively. Based on the R code and R output, (a) Write down the loglinear regression model and identify the associated estimates. [5 pts] (b) Compute the estimated odds ratio between any two variables of Alcohol, Cigarette, and Marijuana use controlling for the third variable. [3 pts] Page 9
Page 9 (c) Construct the 95% confidence interval for the true odds ratio between Alcohol and Cigarette use controlling for Marijuana use. [4 pts] (d) Test if the true odds ratio between Alcohol and Marijuana use controlling for Cigarette use equals the true odds ratio between Cigarette and Marijuana use controlling for Alcohol use at α = 5%. [ 8 pts] Page 10
Page 10 5. [15 pts] Consider a three-way contingency table with categorical variables X having 2 categories, Y having 2 categories and Z having K 2 categories. (Hint: To show A if and only if B, you need show both A implies B and B implies A ) (a) Show that the loglinear model (XY, XZ, Y Z) holds if and only if X and Y have homogeneous association controlling for Z. [9 pts] Page 11
Page 11 (b) Show that the loglinear model (XZ, Y Z) holds if and only if X and Y are conditionally independent controlling for Z. [6 pts] Page 12
Page 12 6. [12 pts] (a) Let P (Y = 1) = 1 P (Y = 0) = p. For the population of subjects having Y = j, X has a probability density function f j (x) = λ j exp( λ j x), x 0, j = 0, 1. Show that π(x) = P (Y = 1 x) satisfies the logistic regression model with some α and β. [7 pts] (b) For known n 2, show that the negative binomial distribution with probability mass function, f(y n, µ) = ( ) ( ) n ( y y+n 1 n n 1 µ+n 1 µ+n) n, y = 0, 1, 2,. belongs to the exponential family of distributions. Find out the natural parameter for this distribution. [5 pts] -End of the Paper