Matched Pair Data. Stat 557 Heike Hofmann

Matched Pair Data Stat 557 Heike Hofmann

Outline Marginal Homogeneity - review Binary Response with covariates Ordinal response Symmetric Models Subject-specific vs Marginal Model conditional logistic regression

Matched Pair Data 2nd Rating Assumptions 1st Rating Approve Disapprove Approve Disapprove 794 150 86 570 Diagonal heavily loaded Association usually strongly positive (most people don t change their opinion) Distinguish between movers & stayers

Marginal Homogeneity logit P(Yt = 1 xt ) = α + β xt xt is dummy variable for time points x1 = 0, x2 = 1 Then β is log odds ratio based on overall population

RAND -American Life Panel https://mmicdata.rand.org/alp/?page=election#electionforecast Panel of 3500 US citizens above 18 tracked since July Data isn t published on individual basis, but from change and overall margins we can (almost) work out change pattern 1 week after 1st debate Obama Romney before 1st debate Obama Romney 1585 121 162 1432 3300

> mswitch <- glm(i(candidate=="obama")~time, data=votem, family=binomial(), weight=votes) > summary(mswitch) Call: glm(formula = I(candidate == "Obama") ~ time, family = binomial(), data = votem, weights = votes) Deviance Residuals: Min 1Q Median 3Q Max -46.462-22.929-0.435 21.992 45.733 Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) 0.11771 0.03488 3.375 0.000738 *** timevote2-0.04981 0.04929-1.010 0.312299 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 9135.4 on 7 degrees of freedom Residual deviance: 9134.3 on 6 degrees of freedom AIC: 9138.3 Number of Fisher Scoring iterations: 3

Subject Specific Model link P(Yit = 1) = αi + β xt xt is dummy variable for time points x1 = 0, x2 = 1 then αi = link P(Yi1 = 1) β = link P(Yi2 = 1) - link P(Yi1 = 1) painful to fit...

Marginal vs Subject- Specific Model Estimates for β is identical for marginal model and subject specific model in case of identity link are different for logit link marginal model: β = logit P(Y2 = 1 x2 ) - logit P(Y1 = 1 x1 ) subject specific, for all i: β = logit P(Yi2 = 1 x2 ) - logit P(Yi1 = 1 x1 )

Subject-Specific Model logit P(Yit = 1) = αi + β xt Assumptions generally: responses from different subjects independent (for all i) responses for different time-points independent

Subject-Specific Model Violation of independence taken care of by model structure: Generally, αi >> β For large αi, probability of P(Yit = 1) is either close to 0 or close to 1 (largest dependence in the data) When αi is small, we have the most variability between responses of the same individual - i.e. least dependence. That s the records, on which estimation of β is based on.

Subject Specific Model link P(Yit = 1) = αi + β xt but: estimation αi of becomes problematic for large numbers of subjects idea: condition on sufficient statistic for αi leads to conditional (logistic) regression

Likelihood for αi

Fitting the Subject Specific Model Let Si = yi1+yi2, then Si in {0,1,2} Si are sufficient statistics for αi only values of 1 contribute to the estimation of β logit P(Yit = 1 Si = 1) = αi + β xt

Estimating β MLE for β is log n21/n12 standard deviation of estimate is then sqrt(1/n12 + 1/n21) Use clogit from the survival package to fit model

Navajo Indians 144 victims of myocardiac infarcts (MI cases) are matched with 144 control subjects (disease free) according to gender and age. All participants of the study are asked about whether they ever were diagnosed with diabetes: Controls Diabetes no Cases Diabetes no 9 16 37 82

> myo.ml <- clogit(mi ~ diabetes + strata(pair), data=t103) > summary(myo.ml) Call: coxph(formula = Surv(rep(1, 288L), MI) ~ diabetes + strata(pair), data = t103, method = "exact") n= 288 coef exp(coef) se(coef) z Pr(> z ) diabetes 0.8383 2.3125 0.2992 2.802 0.00508 ** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 exp(coef) exp(-coef) lower.95 upper.95 diabetes 2.312 0.4324 1.286 4.157 Rsquare= 0.029 (max possible= 0.5 ) Likelihood ratio test= 8.55 on 1 df, p=0.003449 Wald test = 7.85 on 1 df, p=0.005082 Score (logrank) test = 8.32 on 1 df, p=0.003919

Conditional Logistic Regression as GLM Model logit(p(y it = 1)) = α i + β 1 x 1it + β 2 x 2it +... + β p x pit Conditioned on one success: P(Y i1 =1, Y i2 =0 S i = 1) = P(Y i1 =0, Y i2 =1 S i = 1) = 1 1 + exp ((x i2 x i1 ) β) exp ((x i2 x i1 ) β) 1 + exp ((x i2 x i1 ) β)

Conditional Logistic Regression as GLM Rewrite Then Y = logit(p(y i 1 if Y i1 =0, Y i2 =1, 0 if Y i1 =1, Y i2 =0. and X i = X i2 X i1 for all i. = 1)) = β 1 x 1i + β 2 x 2i +... + β p x pi no intercept logistic regression

> table(ystar) ystar 1 144 > table(xstar) xstar -1 0 1 16 91 37 glm(formula = ystar ~ xstar - 1, family = binomial(logit)) Deviance Residuals: Min 1Q Median 3Q Max 0.8478 0.8478 1.1774 1.1774 1.5477 Coefficients: Estimate Std. Error z value Pr(> z ) xstar 0.8383 0.2992 2.802 0.00508 ** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 199.63 on 144 degrees of freedom Residual deviance: 191.07 on 143 degrees of freedom AIC: 193.07 Number of Fisher Scoring iterations: 4

Matched Pairs: Ordinal Y1 and Y2 are ordinal variables with J>2 categories POLR model (marginal): logit(p(y t j)) = α j + βx t cumulative odds ratios are constant for all j: log θ j = log P(Y 2 j)/p(y 2 > j) P(Y 1 j)/p(y 1 > j) = β(x 2 x 1 )=β,

Marginal Homogeneity Marginal homogeneity is equivalent to zero log odds ratio: β =0 logit(p(y 1 j)) = logit(p(y 2 j)) j P(Y 1 j) =P(Y 2 j) j π j+ = π +j Model Fit based on 1+ (J-1) parameters Model has J-2 degrees of freedom j Overall we have 2(J-1) degrees of freedom

Matched Pairs: Nominal Baseline Logistic Regression log P(Yt = j)/p(yt = J) = alphaj + betaj xt Then betaj=0 is test for marginal homogeneity POLR model (marginal):

Models for Square Contingency Tables For nominal Y with J 3 categories, use J as baseline Baseline Logistic Regression log P(Yt = j)/p(yt = J) = αj + βj xt Then βj=0 is test for marginal homogeneity

Migration Data 95% of the data is on the diagonal. Residence in 1985 Residence 80 NE MW S W Total NE 11607 100 366 124 12197 MW 87 13677 515 302 14581 S 172 225 17819 270 18486 W 63 176 286 10192 10717 Total 11929 14178 18986 10888 55981 95% of data is on diagonal marginal homogeneity seems given, is data even symmetric?

Symmetry Model H0: πab = πba for all a,b as logistic regression: log πab/πba = 1 as loglinear model log mab = µ + µa X + µb Y + µab XY with µa X = µa Y and µab XY = µba XY

Migration Data Symmetry seems to be violated: e.g. fewer people move MW -> S than vice versa