NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION (SOLUTIONS) ST3241 Categorical Data Analysis. (Semester II: )

Size: px

Start display at page:

Download "NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION (SOLUTIONS) ST3241 Categorical Data Analysis. (Semester II: )"

Georgiana Henry
5 years ago
Views:

1 NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION (SOLUTIONS) Categorical Data Analysis (Semester II: ) April/May, 2011 Time Allowed : 2 Hours Matriculation No: Seat No: Grade Table Question Full marks Earned marks Total Full marks 100 Earned marks INSTRUCTIONS TO CANDIDATES 1. This examination paper contains SIX (6) questions and comprises TWELVE (12) printed pages. 2. Answer ALL the questions for TOTAL 100 marks. 3. Read the questions CAREFULLY. 4. All NOTATIONS used here are the same as those used in the lecture notes. 5. Write your answers NEATLY following the associated questions. 6. This is a Closed textbook, Closed notes examination but calculators are allowed. 7. Candidates may bring in TWO A4 size ( mm) help sheets. Page 2

2 Page 2 1. [10 pts, each 1 pt] Circle T or F for each of the statements. (1) [F] To test for independence in two-way contingency tables, likelihood ratio tests and Pearson s χ 2 tests are equivalent for small sample sizes. (2) [F] Fisher s exact test uses negative binomial distribution to compute p-values. (3) [F] Diagnosis of type of mental illness (schizophrenia, neurosis, depression) is an ordinal variable. (4) [F] If odds of success in a binary response is 0.5, the probability of success is (5) [T] Suppose that P (Y i = 1) = 1 P (Y i = 0) = 0.2, i = 1,, n, where Y i s are independent. Let Y = 50 Y i. Then the distribution of Y is Binomial with mean 10. i=1 (6) [T] Test of independence for a linear trend alternative cannot be used for nominal categorical data. (7) [F] In a logistic regression model, logit[π(x)] = α + βx, e α equals the odds of success when x = 1. (8) [F] In a logit model logit[π(x)] = α + βx, the probability increases at the rate of 0.16β when π(x) = 0.4. (9) [F] A classical linear regression model with errors having normal distribution is a special case of generalized linear model with probit link. (10) [F] Fitting a saturated model often results in nonzero residual deviance. Page 3

3 Page 3 2. [28 pts] For a study using logistic regression to examine the data on rheumatoid arthritis, we consider age of the patient as the predictor variable. The response Y measured whether the patient showed any improvement at all (1=yes). The following computer output reports for a logistic regression model using age to predict the probability of improvement. Model Fit Statistics Intercept Intercept and Criterion Only Covariates -2 Log L Standard Wald Parameter DF Estimate Error Chi-Square Pr>ChiSq Intercept age Odds Ratio Estimates Point 95% Wald Effect Estimate Confidence Limits age Estimated Covariance Matrix parameter Intercept age Intercept age The 90, 95, 97.5 and 99.5-th percentiles of the standard normal distribution are 1.28, 1.645, 1.96, and respectively. (a) Find out the rates of change in predicted probabilities of improvement when age = 25 and when the estimated probability of improvement is 0.3, respectively. Solution: The estimated probability of improvement at age=25 is: ˆπ(25) = exp( ) = (2 pts) 1 + exp( ) [8 Pts] Then at age 25, the rate of change in the estimated probability is ˆβ ˆπ(25)(1 ˆπ(25)) = ( ) = (3 pts). When the estimated probability of improvement is.3, the rate of change in the estimated probability is ˆβ.3 (1.3) = (1 0.3) = (3 pts). Page 4

4 Page 4 (b) Find out the age at which the estimated probability of improvement is 0.3. [4 Pts] exp(ˆα+ ˆβx) Solution: For ˆπ(Age) = 1+exp(ˆα+ ˆβx) = 0.3 (2 pts), Age = (log(.3/.7) ˆα)/ ˆβ (1 pt) = ( )/ = (1 pt). (c) Obtain a 95% confidence interval for the true odds ratio of improvement for a half year increase in age. [6 pts] Solution: The 95% confidence interval for.5β (1 pt) is.5( ˆβ ±z ASE( ˆβ)) =.5(0.0492± ) = ( , ). (3 pts) Thus, the 95% confidence interval for the true odds ratio exp(.5β) is (exp( ), exp( )) = ( , ). (2 pts) (d) Obtain a 95% confidence interval for the probability of improvement at age = 25. [10 pts] Solution: The estimated linear predictor at age 25 is, ˆα + 25 ˆβ = (1 pt) and its estimated variance is Var(ˆα) Var( ˆβ) Cov(ˆα, ˆβ) (1 pt) = ( ) = (2 pts) Therefore, the estimated ASE of the linear predictor is.3742 = (1 a 95% confidence interval for the true linear predictor is pt). So ± = ( , ). (2 pts) Therefore, a 95% confidence interval for the true probability at age 25 is ( ) exp( ) 1 + exp( ), exp( ) = (0.0684, ). (3 pts) 1 + exp( ) Page 5

5 Page 5 3. [15 pts] The following table was taken from the 1991 General Social Survey. Party Identification Race Democrat Independent Republican Total White Black Total Final Examination Q3 R code and Output Racew<-c(1,1,1,0,0,0)# White=1; black=0; PartyD<-c(1,0,0,1,0,0) #Democrat= 1; others 0 PartyI<-c(0,1,0,0,1,0)# Independent=1; others 0 Count<-c(341,105,405,103,15,11) RacewPartyD<-Racew*PartyD; RacewPartyI<-Racew*PartyI; fit<-glm(count~racew+partyd+racewpartyd+racewpartyi,family=poisson(link="log")) summary(fit) ####R outputs Call: glm(formula = count ~ Racew + PartyD + RacewPartyD +RacewPartyI, family = poisson(link = "log")) Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) <2e-16 *** Racew <2e-16 *** PartyD <2e-16 *** RacewPartyD <2e-16 *** RacewPartyI <2e-16 *** Null deviance: Residual deviance: Page 6

6 Page 6 Let X and Y denote the race and party respectively. The 95-th percentiles of χ 2 -distribution with 1, 2, 3, 4, 5, 6 degrees of freedom are 3.841, 5.99, 7.81, 9.49, 11.07, and respectively. Based on the R code and R output, (a) Write down the loglinear regression model and identify the associated estimates. [3 Pts] Solution: The log linear regression model can be written as log(µ ij ) = λ + λ X i + λ Y j + λ ij, i = 1, 2; j = 1, 2, 3. (1 pt) Based on the R code, the first scheme of constraints was used. The estimated parameters are = X 1 = X 2 = 0 Y 1 = , Y 2 = 0, Y 3 = 0, 11 = , 12 = , 13 = 21 = 22 = 23 = 0. (2 pts) (b) Compute all the estimated cell counts. Solution: [6 pts] ˆµ 11 = exp( + X 1 + Y 1 + ˆµ 12 = exp( + X 1 + Y 2 + ˆµ 13 = exp( + X 1 + Y 3 + ˆµ 21 = exp( + X 2 + Y ) = exp( ) = ) = exp( ) = ) = exp( ) = ) = exp( ) = ˆµ 22 = exp( + X 2 + Y ) = exp(2.5649) = ˆµ 23 = exp( + X 2 + Y ) = exp(2.5649) = (each 1 pt) (c) Comment if the intercept model fit the data well. [3 pts] Solution: From the R output, the non-intercept coefficients are highly significant and hence they are unlikely 0 (2 pts). Thus, the intercept model assuming the non-intercept coefficients being 0 can not fit the data well. (1 pt) OR From the R output, the null deviance is (1 a χ 2 -distribution with 6 1 = 5 degrees of freedom (1 χ 2 5 pt). The null deviance follows pt). The 95-th percentile of is which is much smaller than the null deviance. Thus, the intercept model does not fit the data well (1 pt). (d) Comment if the loglinear model fit the data well. Solution: From the R output, the residual deviance is (1 [3 pts] pt). The residual deviance follows a χ 2 -distribution with 6 5 = 1 degrees of freedom (1 pt). The 95-th percentile of χ 2 1 is which is much larger than the residual deviance. Thus, the loglinear regression model does fit the data well. (1 pt) Page 7

7 Page 7 4. [20 pts] The following table is taken from Lecture 8. Alcohol, Cigarette and Marijuana Use For High School Seniors Marijuana Use Alcohol Cigarette Use Use Yes No Yes Yes No No Yes 3 43 No ## Final Examination Q4 R code and output A<-c(1,1,1,1,0,0,0,0); ## 1--Alcohol use 0--otherwise C<-c(1,1,0,0,1,1,0,0); ## 1---Cigarette use 0---otherwise M<-c(1,0,1,0,1,0,1,0); ## 1-Marijuana use 0-otherwise count<-c(911,538,44,456,3,43,2,279); AC<-A*C; AM<-A*M; CM<-C*M; ACM<-A*C*M; ##Model (AM,CM,AC) fit drug.log<-glm(count~a+c+m+am+cm+ac,family=poisson(link="log")) summary(drug.log) ## output Call: glm(formula = count ~ A + C + M + AM + CM + AC, family = poisson(link = "log")) Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) < 2e-16*** A e-10 *** C < 2e-16 *** M < 2e-16*** AM <1.31e-10*** CM < 2e-16 *** AC < 2e-16 *** Null deviance: , Residual deviance: ##Estimated covariance matrix between AM and CM AM CM AM CM Page 8

8 Page 8 Let X, Y and Z denote the variables Alcohol, Cigarette and Marijuana use respectively. The 90, 95, 97.5 and 99.5-th percentiles of the standard normal distribution are 1.28, 1.645, 1.96, and respectively. Based on the R code and R output, (a) Write down the loglinear regression model and identify the associated estimates. [5 pts] Solution: The log linear regression model can be written as log(µ ijk ) = λ + λ X i + λ Y j + λ ij + λ XZ ik + λ Y jk Z, i = 1, 2; j = 1, 2; k = 1, 2. (1 pt) Based on the R code, the first scheme of constraints was used. The estimated parameters are (4 pts) = X 1 = X 2 = 0, Y 1 = , Y 2 = 0, Z 1 = , Z 2 = 0 11 = , XZ 11 = , Y Z 12 = 21 = 22 = 0, XZ 12 = XZ 21 = 11 = XZ 22 = 0, Y Z 12 = Y 21 Z = Y 22 Z = 0. (b) Compute the estimated odds ratio between any two variables of Alcohol, Cigarette, and Marijuana use controlling for the third variable. [3 pts] Solution: Since the loglinear regression model (,XZ,YZ) is homogeneous association for any two variables controlling for the third variable. The estimated odds ratio between Alcohol and Cigarette use controlling for Marijuana use is exp( ) = exp( 11 ) = exp( ) = (1 pt) The estimated odds ratio between Alcohol and Marijuana use controlling for Cigarette use is exp( XZ XZ XZ XZ ) = exp( XZ 11 ) = exp(2.9860) = (1 pt) The estimated odds ratio between Cigarette and Marijuana use controlling for Alcohol use is exp( Y Z 11 + Y Z 22 Y Z 12 Y Z 21 ) = exp( Y Z 11 ) = exp( ) = (1 pt) Page 9

9 Page 9 (c) Construct the 95% confidence interval for the true odds ratio between Alcohol and Cigarette use controlling for Marijuana use. [4 pts] Solution: The 95% confidence interval for the true log odds ratio between Alcohol and Cigarette use controlling for Marijuana use is 11 ± 1.96 ASE = ± = (1.7134, ) (2 pts) Thus, the 95% confidence interval for the true odds ratio between Alcohol and Cigarette use controlling for Marijuana use is (exp(1.7134), exp(2.3957)) = (5.5476, ). (2 pts) (d) Test if the true odds ratio between Alcohol and Marijuana use controlling for Cigarette use equals the true odds ratio between Cigarette and Marijuana use controlling for Alcohol use at α = 5%. [ 8 pts] Solution: Set T = λ XZ 11 λ Y 11 Z. It is equivalent to test H 0 : T = 0 vs H 1 : T 0 (1 pt). Now the observed ˆT = XZ 11 Y 11 Z = = (1 pt). In addition, Var( ˆT ) = Var( XZ 11 ) + Var( Y Z 11 ) 2Cov( XZ 11, Y Z 11 ) (1 pt) = ( ) = (2 pts) Therefore, the estimated ASE of ˆT is.2527 =.5027 (1 pt). It follows that ˆT /ASE =.13812/.5027 =.2747 (1 pt) which is smaller than the 95-th percentile of the standard normal distribution, That is, at α = 5%, it is very likely that the true odds ratio between Alcohol and Marijuana use controlling for Cigarette use equals the true odds ratio between Cigarette and Marijuana use controlling for Alcohol use. (1 pt) Page 10

10 Page [15 pts] Consider a three-way contingency table with categorical variables X having 2 categories, Y having 2 categories and Z having K 2 categories. (Hint: To show A if and only if B, you need show both A implies B and B implies A ) (a) Show that the loglinear model (, XZ, Y Z) holds if and only if X and Y have homogeneous association controlling for Z. [9 pts] Proof: If the loglinear model (, XZ, Y Z) holds, then we have log(θ (k) ) = λ 11 + λ 22 λ 12 λ 21, which does not depend on k, the level of Z. Thus, X and Y have homogeneous association controlling for Z. (3 pts) Under the first scheme of constraints, the possible nonzero 3-factor terms are λ Z 11k, k = 1, 2,, K 1. Other 3-factor terms are 0. Then under the saturated loglinear model ( Z), we can show that log(θ (k) ) = λ 11 + λ 22 λ 12 λ 21 + λ 11k Z + λ 22k Z λ 12k Z λ 21k Z = λ 11 + λ 22 λ 12 λ 21 + λ Z, k = 1, 2,, K 1, and log(θ (K) ) = λ 11 + λ 22 λ 12 λ 21 11k. (3 pts) If X and Y have homogeneous association controlling for Z, then log(θ (k) ) = log(θ (K) ) = λ 11 + λ 22 λ 12 λ 21 for k = 1, 2,, K 1. It follows that λ 11k Z = λ 11K Z = 0, k = 1, 2,, K 1. Therefore, in this case, the saturated model ( Z) reduces to the homogeneous association model (, XZ, Y Z). (3 pts) Page 11

11 Page 11 (b) Show that the loglinear model (XZ, Y Z) holds if and only if X and Y are conditionally independent controlling for Z. [6 pts] Proof: If the loglinear model (XZ, Y Z) holds, then we have log(θ (k) ) = λ 11 + λ 22 λ 12 λ 21 = 0. It follows that θ (k) = 1, k = 1, 2,, K. Thus, X and Y are conditionally independent controlling for Z. (3 pts) If X and Y are conditionally independent controlling for Z, then by Part (a), we have 0 = log(θ (k) ) = λ 11 + λ 22 λ 12 λ 21 = 0 for k = 1, 2,, K. Under the first scheme of constraints, the possible nonzero 2-factor terms are λ 11. Other 2-factor terms are 0. Then we have λ 11 = 0. It follows that in this case, the homogeneous association model (, XZ, Y Z) reduces to the conditionally independent model (XZ, Y Z) controlling for Z. (3 pts) Page 12

12 Page [12 pts] (a) Let P (Y = 1) = 1 P (Y = 0) = p. For the population of subjects having Y = j, X has a probability density function f j (x) = λ j exp( λ j x), x 0, j = 0, 1. Show that π(x) = P (Y = 1 x) satisfies the logistic regression model with some α and β. [7 pts] Proof: Since P (Y = 1) = 1 P (Y = 0) = p and the conditional probability density function of X given Y = 0 and Y = 1 are f 0 (x) = λ 0 exp( λ 0 x), x 0, (1 pt) and f 1 (x) = λ 1 exp( λ 1 x), x 0, (1 pt) by Bayes theorem, we have π(x) P (Y = 1 x) = f 1 (x)p (Y =1) f 0 (x)p (Y =0)+f 1 (x)p (Y =1). (1 pt) Therefore, { pf 1 (x) logit(π(x)) = log (1 p)f 0 (x) = log pλ1 [ ] } exp (λ 0 λ 1 )x (1 p)λ 0 = log pλ 1 (1 p)λ 0 + (λ 0 λ 1 )x = α + βx (2 pts) where pλ 1 α = log( ) and β = (λ 0 λ 1 ). (2 pts) (1 p)λ 0 (b) For known n 2, show that the negative binomial distribution with probability mass function, f(y n, µ) = ( ) ( ) n ( y y+n 1 n n 1 µ+n 1 µ+n) n, y = 0, 1, 2,. belongs to the exponential family of distributions. Find out the natural parameter for this distribution. [5 pts] Proof: The probability mass function of the negative binomial distribution can be written as ( ) y + n 1 n f(y n, µ) = ( n 1 µ + n )n (1 n µ + n )y ( ) µ = exp[y log( µ + n ) + n log( n y + n 1 µ + n ) + log ] (2 pts) n 1 This belongs to the exponential family of distributions with θ = log(µ/(µ + n)) (1 pt) and b(θ) = n log(1 e θ ) (1 pt). Here φ = 1, a(φ) = 1 and c(y; φ) = log ( ) y+n 1 n 1. The natural parameter for this distribution is θ = log(µ/(µ + n)). (1 pt) -End of the Paper

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION. ST3241 Categorical Data Analysis. (Semester II: ) April/May, 2011 Time Allowed : 2 Hours

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION Categorical Data Analysis (Semester II: 2010 2011) April/May, 2011 Time Allowed : 2 Hours Matriculation No: Seat No: Grade Table Question 1 2 3 4 5 6 Full marks