Homework 10 - Solution

Size: px

Start display at page:

Download "Homework 10 - Solution"

Hollie Hawkins
5 years ago
Views:

1 STAT Spring 2011 Homework 10 - Solution Olga Vitek Each part of the problems 5 points 1. Faraway Ch. 4 problem 1 (page 93) : The dataset parstum contains cross-classified data on marijuana usage by collge students as it relates to the alcohol and drug usage of the parents. Analyze the data as if both factors were nominal. Redo the analysis treating both factors as ordinal. Contrast the results. Answer : (a) As nominal : Since the calculated test statistic is with degree of freedom=4, there is strong evidence that the fitted model is insufficient with p-value= glm(formula = count ~ parent + student, family = poisson, data = parstum) Deviance Residuals: Estimate Std. Error z value Pr(> z ) (Intercept) < 2e-16 *** parentneither < 2e-16 *** parentone e-14 *** studentoccasional e-10 *** studentregular e-10 *** --- Signif. codes: 0 *** ** 0.01 * (Dispersion parameter for poisson family taken to be 1) Null deviance: on 8 degrees of freedom Residual deviance: on 4 degrees of freedom AIC: (b) As ordinal : Parent( Neither=0, One=1, Both=2), Student(Never=0, Occasional=1, regular=2) Since the calculated test statistic is with degree of freedom=3, there is not enough evidence that the fitted model is not sufficient to describe the data with p-value = glm(formula = count ~ parent + student + I(oparent * ostudent), family = poisson, data = parstum) Deviance Residuals: Estimate Std. Error z value Pr(> z ) (Intercept) e-08 *** parentneither e-16 *** 1

2 parentone e-15 *** studentoccasional e-14 *** studentregular e-13 *** I(oparent * ostudent) e-06 *** --- Signif. codes: 0 *** ** 0.01 * (Dispersion parameter for poisson family taken to be 1) Null deviance: on 8 degrees of freedom Residual deviance: on 3 degrees of freedom AIC: Number of Fisher Scoring iterations: 4 (c) Contrast the results A comparison of the two fitted models indicate that the model with ordinal factors is significantly better than the one with the nominal factors with p-value for the interaction is e-06 R code Analysis of Deviance Table Model 1: count ~ parent + student Model 2: count ~ parent + student + I(oparent * ostudent) Resid. Df Resid. Dev Df Deviance ## nominal nominal<-glm(count~parent+student, family=poisson, parstum) summary(nominal) pchisq(nominal$deviance, nominal$df.residual, lower=false) ## ordinal parstum$oparent<-unclass(parstum$parent) parstum$ostudent<-unclass(parstum$student) parstum$oparent[parstum$parent == "Neither"]<-0 parstum$oparent[parstum$parent == "One"]<-1 parstum$oparent[parstum$parent == "Both"]<-2 parstum ordinal<-glm(count~parent+student+i(oparent*ostudent), family=poisson, parstum) summary(ordinal2) pchisq(ordinal$deviance, ordinal$df.residual, lower=false) ### contrast anova(nominal, ordinal) pchisq(20.809, 1, lower=false) 2

3 2. [Methods qualifying exam, August 2010: use paper and pencil.] You conduct a survey of students in the university according to their book reading level and notes reading level, in order to study similarities and differences of these two types of study. You acquire the following data. Book Reading Total Notes Reading Very Good Good Bad Very Bad Very Good Good Bad Vary Bad Total (a) It is possible to consider up to 5 loglinear models, which express various levels of dependence between these two studies: the independence model, the linear-by-linear model, the row-effect model, the column-effect model, and the saturated model. Display all these models in the mathematical notation, define model terms and the assumptions, and provide the associated residual degrees of freedom. Let Y ij denote the count in row i and column j. We assume that Y ij P oisson(λ ij ), or, conditional on the total, Y ij Multinomial(π ij ). Let α i and β j denote row and column effects respectively. Let x i, y j denote the scores of rows and columns of the independent variables, defined on a continuous scale. The loglinear models are Models Formula Residual Df Independence log(λ ij ) = µ + α i + β j 9 Linear-by-linear log(λ ij ) = µ + α i + β j + ω(x i y j ) 8 Row-effect log(λ ij ) = µ + α i + β j + ω i y j 6 Column-effect log(λ ij ) = µ + α i + β j + ω j x i 6 Saturated log(λ ij ) = µ + α i + β j + (αβ) ij 0 subject to constraints α i = i=1 β j = j=1 ω i = i=1 ω j = j=1 (αβ) ij = i=1 (αβ) ij = 0. j=1 (b) Consider two extreme cases of the models above: the independence and the saturated models. Test both models for goodness of fit using deviance test, at the confidence level of 95%. State the null and the alternative hypotheses, the test statistic, and the conclusion. For the independence model, we test H 0 : The independence model is appropriate against H a : The saturated model is appropriate. The deviance statistic is G 2 = 2 i=1 j=1 n ij log n ij ˆn ij, where n ij is the observed count and ˆn ij is the predicted count. Based on the independence model, 4 i=1 ˆn ij = n n ij j=1 n ij 4 4 i=1 j=1 n ij 3

4 and we have the following predicted table Book Reading Notes Reading Very Good Good Bad Very Bad Very Good Good Bad Vary Bad Therefore, G 2 = > χ 2 9(0.95) = Therefore we reject H 0. For the saturated model, the same model is specified by both H 0 and H a. Therefore G 2 = 0, and we cannot conduct the test. (c) The residual deviance of the linear-by-linear model is Test whether the linear-by-linear model is more appropriate for the data than the independence model. Use the confidence level of 95%. The deviance of the linear-by-linear term is G 2 = = > χ 2 1(0.95) = Therefore, the linear-by-linear model is more appropriate for the data. 3. [Adapted from methods qualifying exam, January 2008: use paper and pencil.] A sample of 1083 voters were surveyed for their political ideology and political party affiliation in a presidential primary in Wisconsin. See the table below. Political Ideology Party Affiliation Liberal Moderate Conservative Total Democrat Independent Republican Total (a) The following commands were executed in R to check the independence between the political ideology and party affiliation. > mod1 <- glm(y ~ paff + polid, family = poisson, data = pidata) > summary(mod1) Estimate Std. Error z value P(> z ) (Intercept) < 2e-16 *** paffind * paffrep e-13 *** polidmod e-09 *** polidcon *** Null deviance: on 8 degrees of freedom Residual deviance: on 4 degrees of freedom AIC:

5 Please write down the fitted model and report a test of the independence hypothesis (include the hypotheses). Answer : Let Y ij denote the count in the ith row and j column of the table, with i = 1, 2, 3 = {Dep, Ind, Rep}, j = 1, 2, 3 = {Liberal, Moderate, Conservative}. We assume that Y ij P oisson(λ ij ), or equivalently Y j i Multinomial(π 1 i, π 2 i, π 3 i ) The fitted model assumes the independence of rows and columns, i.e. log(λ ij ) = µ + α i + β j, or equvalently Y j i Multinomial(π 1, π 2, π 3 ) log ˆλ ij = I Ind I Rep I moderate I conservative Hypothesis : H 0 : The independence model log(λ ij ) = µ+α i +β j, or equivalently Y j i Multinomial(π 1, π 2, π 3 ) H a : The saturated model log(λ ij ) = µ+α i +β j +(αβ) ij, or equivalently Y j i Multinomial(π 1 i, π 2 i, π 3 i ) Test statistic = Residual deviance = G 2 = χ 2 4 critical value = χ 2 4,0.95 = p-value = 6.126e 22 0 Reject H 0. The independence assumption is rejected. (b) As both political ideology and party affiliation may be considered ordinal, the following models are explored. > ct1 <- xtabs(y~polid+paff,pidata); > pid1<-data.frame(dem=ct1[,1],ind=ct1[,2],rep=ct1[,3],polid=c(1,2,3)) > mod2 <- multinom(cbind(dem,ind,rep)~polid,data=pid1); > summary(mod2) (Intercept) polid ind rep > ct2 <- xtabs(y~paff+polid,pidata); > pid2<-data.frame(lib=ct2[,1],mod=ct2[,2],con=ct2[,3],paff=c(1,2,3)) > mod3 <- multinom(cbind(lib,mod,con)~paff,data=pid2) ; > summary(mod3) (Intercept) paff mod con State the models that are fitted by mod2 and mod3. State the equivalent log-linear models. 5

6 i. mod2 : Political ideology is assigned a continuous score v j This is the row-effect model Y i j Multinomial(π 1 j, π 2 j, π 3 j ), log π i j π 1 j = α i + β i v j where π 1 j = P (Democrat v j ), π 2 j = P (Independent v j ), and π 3 j = P (Republican v j ). The fitted multinominal model is log π 2 j π 1 j = v j log π 3 j π 1 j = v j The equivalent log-linear model is log(λ ij ) = µ + α i + β j + γ i v j, where γ i : separate parameter of v j for each row α i : deviation from democrat party affiliation β j : deviation from liberal political ideology α 1 = β 1 = γ 3 = 0 ii. mod3 : Party affiliation is assigned a continuous score u i This is the column-effect model Y j i Multinomial(π 1 i, π 2 i, π 3 i ), log π j i π 1 i = α j + β j v i where π 1 i = P (Liberal u i ), π 2 i = P (Moderate u i ), and π 3 i = P (Conservative u i ). The fitted multinominal model is log π 2 i π 1 i = u i log π 3 i π 1 i = u i The equivalent log-linear model is log(λ ij ) = µ + α i + β j + γ j u i, where γ j : separate parameter of u i for each column α i : deviation from democrat party affiliation β j : deviation from liberal political ideology α 1 = β 1 = γ 3 = 0 (c) What other model(s) could be potentially appropriate for these data if the models above do not fit well? linear-by-linear model : Saturated model : log E(Y ij ) = µ + α i + β j + γ(u i v j ) log E(Y ij ) = µ + α i + β j + (αβ) ij 6

7 4. [Methods qualifying exam, August 2010: use paper and pencil.] As part of housing market research, you collected data on number or apartments sold for various price ranges (low, medium, high), in different locations (district A, B and C), building types (high-rise, low-rise, townhouse), and for three buyer incomes (low, medium, high). You d like to model the probability of selling an apartment in a specific price range, as function of the remaining characteristics. You consider two models: (1) price range is independent of all other variables, and (2) log of price range is determined by an additive contribution of location, building type, and buyer income. (a) You only have access to software which fits Poisson regression. Write models (1) and (2) that you d fit in terms of Poisson distribution. Clearly define the notation and state the assumptions. The surrogate linear model 1: Y ijkl ind P oisson(λ ijkl ), where log(λ ijkl ) = µ + loc i + build j + buy k + (loc build) ij + (loc buy) ik + (build buy) jk + (loc build buy) ijk + price l The surrogate linear model 2: Y ijkl ind P oisson(λ ijkl ), where log(λ ijkl ) = µ + loc i + build j + buy k + (loc build) ij + (loc buy) ik + (build buy) jk + (loc build buy) ijk + price l + (loc price) il + (build price) jl + (buy price) kl (b) What test statistic would you use to compare the two models, and what is its reference distribution of the test statistic under H 0 (including degrees of freedom)? We use the G 2 statistic which compares the deviances of the two models. Under H 0 it follows χ 2 distribution with (3 1)(3 1) + (3 1)(3 1) + (3 1)(3 1) = 12 degrees of freedom. 5. Download the dataset minn38 available in the package MASS, and check R documentation for further details. The goal of this problem is to use surrogate log-linear models to determine how phs is affected by the other three factors. (a) Analysis Use variable selection to find an appropriate surrogate log-linear model for the dataset. You may use step() to find a candidate model, then check locally the nearby models by drop1() and add1(), which would have larger AIC but may be worth attention due to other considerations, such as interpretation, log likelihood ratio tests, etc. Report the basic facts such as the minimum and maximum models you are considering, the meaning of the final model(s), etc., but be selective with the computer outputs you include. Answer : 7

8 i. initial model (minimum model) : no effect of covariate on phs. Surrograte log linear models was used for analysis. Therefore we model the response as a Poisson random variable, conditional on the total number of observations for each covariate pattern hs*sex*fol. This is done by including in the model a full set of linear and interaction terms involving hs*sex*fol. We start from the minimal model, which expresses a constant probability of phs status across all covariate patterns. In other words, probability of the phs is independent of other covariate patterns. > fit<-glm(f~hs*fol*sex + phs, family=poisson, data=minn38) ii. additive effect of covariates on phs. We consider the dependence of phs status on one covariate at a time, by adding interaction terms covariate:phs > fit.linear<-glm(f~hs*fol*sex + phs + phs:(hs+fol+sex), family=poisson, data=minn38) > add1(fit.surrogate, ~. + phs:(hs+sex+fol), test="chisq") Model: f ~ hs * sex * fol + phs Df Deviance AIC LRT Pr(Chi) <none> hs:phs < 2.2e-16 *** sex:phs < 2.2e-16 *** fol:phs < 2.2e-16 *** --- Signif. codes: 0 *** ** 0.01 * All the interaction terms are significantly different from 0, so we add all the three interaction terms into the model. iii. select variables using step > fullformula<-f~hs*fol*sex + phs + phs:(hs+fol+sex)^3 > stepresult<-step(fit, fullformula) > stepresult Call: glm(formula = f ~ hs + fol + sex + phs + hs:fol + hs:sex + fol:sex + fol:phs + hs:phs + sex:phs + hs:fol:sex + fol:sex:phs + hs:fol:phs, family = poisson, data = minn38) iv. non-additive effect of covariates on phs. We further attempt to express the dependence of the probability of phs status on covariates, while accounting for possible interactions between the covariates. > fit2<-update(fit1,.~.+ phs:(hs+fol+sex)^2) Compare four models : > anova(fit, fit.linear, stepresult, fit2) Analysis of Deviance Table Model 1: f ~ hs * fol * sex + phs Model 2: f ~ hs * fol * sex + phs + phs:(hs + fol + sex) Model 3: f ~ hs + fol + sex + phs + hs:fol + hs:sex + fol:sex + fol:phs + hs:phs + sex:phs + hs:fol:sex + fol:sex:phs + hs:fol:phs Model 4: f ~ hs + fol + sex + phs + hs:fol + hs:sex + fol:sex + hs:phs + fol:phs + sex:phs + hs:fol:sex + hs:fol:phs + hs:sex:phs + 8

9 fol:sex:phs Resid. Df Resid. Dev Df Deviance AIC Model 2, 3, 4 have quite similar AIC. Therefore, Since we also perform a Likelihood Ratio test comparing the additive model to the model with interaction and which interactions are good for fitting the model. The test rejects the hypothesis of no interaction between model 2 and model 3. Hence, finally we retains model 3 as final model. (b) Presentation Present your final fit in the form of estimated cell probabilities. Running the final model above to predict the probabilty of observations in each cell. R Code > mnames <- lapply(minn38[,-5],levels) > p <- predict(interactions, expand.grid(mnames), type="response") > p <- matrix( p, ncol=3, byrow=t, dimnames=list(null, mnames[[1]]) ) > pr <- p/drop(p%*%rep(1,3)) > cbind(expand.grid(mnames), prob=round(pr,2)) hs phs fol sex prob.l prob.m prob.u 1 L C F1 F M C F1 F U C F1 F L E F1 F M E F1 F U E F1 F L N F1 F M N F1 F U N F1 F L O F1 F M O F1 F U O F1 F L C F2 F M C F2 F U C F2 F L E F2 F M E F2 F U E F2 F L N F2 F M N F2 F M N F7 M U N F7 M L O F7 M M O F7 M U O F7 M (c) Discussion Briefly discuss in words your findings. The model expresses the probability of phs status as a function of the three covariates. The model contains the additive terms for all three covariates. hs:fol and sex:fol have non-additive effects on the probability of phs. In mathematical terms, the model can be written as where phs Multinomial(π 1, π 2, π 3, π 4 ) log(π j ) = β 0j + β 1j hs + β 2j fol + β 3j sex + β 4j hs : fol + β 5j sex : fol 9

Loglinear models. STAT 526 Professor Olga Vitek

Loglinear models. STAT 526 Professor Olga Vitek Loglinear models STAT 526 Professor Olga Vitek April 19, 2011 8 Can Use Poisson Likelihood To Model Both Poisson and Multinomial Counts 8-1 Recall: Poisson Distribution Probability distribution: Y - number