Categorical data analysis Chapter 5

Size: px

Start display at page:

Download "Categorical data analysis Chapter 5"

Allan Riley
5 years ago
Views:

1 Categorical data analysis Chapter 5

2 Interpreting parameters in logistic regression The sign of β determines whether π(x) is increasing or decreasing as x increases. The rate of climb or descent increases as β increases When β = 0, the curve flattens to a horizontal straight line, and Y is independent of X. π(x) approaches 1 at the same rate that it approaches 0 The odds multiply by e β for every 1-unit increase in x. In other words, e β is an odds ratio, the odds at X = x + 1 divided by the odds at X = x

3 Linear approximation

4 Linear approximation π(x) a + bx where b = βπ(x)(1 π(x)) When x = α/β, π(x) = 1/2 and the slope at this x is β( 1 2 )( 1 2 ) = β/4. This x = α/β value which makes π(x) = 1/2, is called the median effective level. In toxicology studies, it is called LD 50 (LD=lethal dose), the dose with a 50% chance of a lethal result. From this linear approximation, near x where π(x) = 1/2, a change in x of 1/β corresponds to a change in π(x) of roughly (1/β)(β/4) = 1/4; that is, 1/β approximates the distance between x values where π(x) = 0.5 and where π(x) =0.25 or 0.75.

5 Interpreting parameters An alternative way to interpret the effect reports the values of π(x) at certain x values, such as the min, max and quartiles. The change in π(x) over the middle half of x values, from the lower quartile to the upper quartile, is a useful summary of the effect. It can be compared with the corresponding change over the middle half of values of other quantitative predictors. The intercept parameter α is not usually of particular interest. However, by centering the predictor about 0 [i.e., replacing x by (x x)], α becomes the logit at x = x, and thus e α /(1 + e α ) = π( x) As in ordinary regression, centering is also helpful in complex models containing quadratic or interaction terms to reduce correlations among model parameter estimates.

6 Looking at the data before fitting Plot sample proportions or logits against x. when x is categorical, group the data at each setting of x and plot the sample logit against x. Small adjustment is needed when the sample sucess proportion is 0 or 1: log y i n i y i when x is continuous, we could group the data with nearby x values into categories and then plot. Fit a generalized additive model (GAM) to smooth the trend. A GAM replaces the linear predictor of a GLM by a smooth function. A plot of this fit reveals whether severe discrepancies occur from the S-shaped trend predicted by logistic regression.

7 Example: horseshoe crab mating revisited Define for crab i, y i = 1 if she has at least one satellite and y i = 0 otherwise.

8 Example: horseshoe crab mating revisited When x = 26.3cm, the mean width level in this sample, ˆπ(x) = ˆπ(x) = 0.5 when x = ˆα/ ˆβ = /0.497 = 24.8 which is the median effective level. The estimated odds of a satellite multiply by exp( ˆβ) = exp(0.497) = 1.64 for each 1-cm increase in width, that is, there is a 64% increase.

9 Example: horseshoe crab mating revisited At the mean width, ˆπ(x) = 0.674, and ˆπ(x) increases by about ˆβ[ˆπ(x)(1 ˆπ(x))] = 0.497(0.674)(0.326) = 0.11 for a 1-cm increase in width. The lower quartile, median, and upper quartile for width are 24.9, 26.1, and 27.7; ˆπ(x) at those values equals 0.51, 0.65, 0.81, increasing by 0.3 oer the x values for the middle half of the sample With the female crab s weight as the predctor, logit[ˆπ(x)] = x A 1-kg increase in wieght is not comparable to a 1-cm increase in width, so comparing the β coefficents does not make sense The quartiles for weight are 2.00, 2.35, 2.85; ˆπ(x) at those values are 0.48, 0.64, and 0.81, increasing by 0.33 over the middle half of the sampled weights The effect is similar to that of width, which is not surprising as these predictors are very highly correlated.

10 Logistic regression with retrospetive studies In case-control studies, the explanatory variable X rather than the response variable Y is random Applying the logistic regression to case-control data is effectively modeling P(Y = 1 Z = 1, x) where Z is an indicator indicate whether a subject is sampled (1=yes, 0=No). Assuming a logistic model for P(Y = 1 x), that is, It can be shown that logit[p(y = 1 x)] = α + βx logit[p(y = 1 Z = 1, x)] = α + βx where α = α + log(ρ 1 /ρ 0 ) with ρ 1 = P(Z = 1 y = 1) and ρ 0 = P(Z = 1 y = 0), representing the probability of sampling a case and control respectively.

11 Logistic regression with retrospetive studies When Y is random as in multinomial, Poisson, independent multinomial sampling (row fixed), ρ 1 = ρ 0, that is, sampling rate for cases is same for controls. For most case-control studies, ρ 1 > ρ 0, the intercept estimated is larger than the one if the experiment were a prosepctive study. With case-control studies, it is not possible to estimate β in binary-response models with links other than the logit. This is an important advantage of the logit link and is one reason why logistic regression models are so popular in biomedical studies.

12 Inference about model parameters and probabilities H 0 : β = 0 Wald test z = ˆβ/SE Likelihood-ratio test: -2[loglikelihood( ˆβ)-loglikelihood(β = 0) Score test: standardized derivative of the likelihood at ˆβ. A 95% Confidence interval for the linear predictor is ˆα + ˆβx 0 ± 1.96(SE) where SE is given by the estimated square root of var(ˆα + ˆβx 0 ) = var(ˆα) + x 2 0 var( ˆβ) + 2x 0 cov(ˆα, ˆβ) A 95% CI for π(x 0 ) is then obtained by subsituting each endpoint into the inverse transormation π(x 0 ) = exp(logit)/[1 + exp(logit)].

13 Example: inference for horseshoe crab mating data

14 so that ˆα and its SE are the estimated logit and its SE at x = x = Example: inference for horseshoe crab mating data At width x = 26.5, the estimated logit is (26.5)=0.826 and ˆπ(x) = Software reports from which var(ˆα) ˆ = 6.91, var( ˆ ˆβ) = 0.01, cov(ˆα, ˆ ˆβ) = At x = 26.5 the variance is , so the 95% CI for logit[π(26.5)] equals ± (1.96) (0.0356), or (0.456, 1.196). This translates to the interval (0.61, 0.77) for the probability of satellites (e.g., exp(0.456)/[1+exp(0.456)]=0.61). Since corr(ˆα, ˆβ) is near 1.0, for better computational precision, fit the model using predictor x = x 26.5,

15 Example: inference for horseshoe crab mating data We could ignore the model fit and simply use sample proportions (i.e., the saturated model) to estimate such probabilities. Six female crabs in the sample had x = 26.5, and four of them had sattellites. THe sample proportion estimate at x = 26.5 is ˆπ = 4/6 = THe 95% score CI based on these six observations alone equals (0.3, 0.9). If the logistic model approximates the true probabilities decently, its estimator tends to be closer than the sample proportion to the true value, unless each sample proportion is based on an extremely large sample.

16 Checking goodness of fit: grouped and ungrouped data With ungrouped data, or with continuous or nearly continuous predictors, X 2 and G 2 do not have limiting chi-squared distributions. Two popular alternatives for goodness of fit in this case: group the observed and fitted values for a partition of the space of x values group observed and fitted values according the estimated probabilities of success using the original ungrouped data

17 Partition x space

18 Partition x space in each width category, the fitted value for a yes response is the sum of the estimated probabilities ˆπ(x) for all crabs in that category. X 2 = 5.3 and G 2 =6.2 with df = 8 2 = 6. Neither X 2 nor G 2 shows evidence of lack of fit (P > 0.4). As the number of explanatory variables increases, this strategy loses effectiveness. Simultaneous grouping of values for each varaible can produce a contingency table with a large number of cells, most of which have very small counts (curse of dimensionality)

Partition according to estimated probabilities One common approach forms the groups in the partition so they have approximately equal size.

19 Partition according to estimated probabilities One common approach forms the groups in the partition so they have approximately equal size. With 10 groups, the first pair of observed counts and corresponding fitted counts refers to the n/10 observations having the highest estimated probabilities, the next pair refers to the n/10 observations having the second decile of estimated probabilities, and so on. Hosmer and Lemeshow goodness of fit test: where g is the number of partitions and ˆπ ij denote the corresponding fitted probability for the model fitted to the ungrouped data.

20 Partition according to estimated probabilities When the number of distinct patterns of covariate values equals the sample size, the null distribution of the above statistic is approximately chi-squared with df = g 2. For the horseshoe crab data with continuous width predictor, the Hosmer-Lemeshow statistic with g = 10 groups equals 3.5 with df = 8, indicating a decent fit.

21 Wald inference can be suboptimal Its results depend on the scale for the parameterization. For example, for the model log(π) = α, hypotheses α = 0 is equivalent to π = 0.5 but the wald test statistics are different for the two parameters. Evaluations reveal that the wald test for α = 0 tends to be too conservative and the one for π = 0.5 tends to be too liberal. When a true effect is relatively large, the Wald test is not as powerful as the likelihood-ratio and score test. For the single binomial case for example, suppose n = 25, we would regard y = 24 as stronger evidence against H 0 : α = 0, yet the Wald statistic equals 9.7 when y = 24 and 11 when y = 23. For comparison, the likelihood-ratio statistics are 26.3 and 20.7

22 Logistic models with categorical (factor) predictors Consider a single factor logistic model where X has I categories, and for each value of X, let y i be the number of successes out of n i trials. There are two equivalent ways to represent factors: ANOVA-type representation of factors log π i 1 π i = α + β i, i = 1,..., I The factor has as many parameters as categories. Indicator variables represent a factor logit(π i ) = α + β 1 x 1 + β 2 x β I x I where x i = 1 for observations in row i and x i = 0 otherwise.

23 Effect coding With I groups of data, the model can only have I free parameters. However, there are I + 1 parameters in the model. Therefore, constraints are needed to remove one parameter. Set one of β i s to zero, for example, β I = 0. With this constraint, α is the main effect of category I, which we call baseline category. All the other β i s represent effect differences between category i and the baseline category, that is, β i is the difference between the logits in rows i and I. The main effect of category i is then α + β i.

24 Effect coding Set I β i = 0. i=1 Then α represents the average effect of the categories, and β i s represent the effect deviations from the average effect, that is, β i is the log odd ratio between row i and the average of all rows. - For different constraints, the estimates of the β s are different, but the estimates of the mean reponse and contrasts between β s remain the same.

25 Example: Alcohol and Infant Malformation revisited logit(π i ) = α + β i is a saturated model and the estimated linear predictor ˆα + ˆβ i are the sample logits. Table 5.3 shows that except for the slight reversal between the first and second categories of alcohol consumption, the sample logits and hence the sample porportions of malformation cases increase as alcohol consumption increases.

26 Example: Alcohol and Infant Malformation revisited Test of independence is equivalent to test H 0 : β i = 0, i = 1,..., I The Pearson statistic is X 2 = 12.1(P = 0.02) and the likelihood-ratio statistic is G 2 = 6.2(P = 0.19). The P-values using the exact conditional distributions of X 2 and G 2 are 0.03 and 0.13.

27 Linear logit model for IX2 contingency tables The near-monotone increase in the sample logits in Table 5.3 indicates that the linear logit model may fit better. For ordered factor categories, we may assign scores that describe distances between categories of X and fit the linear (or more complex) linear logit model logit(π i ) = α + βx i With scores (x 1 = 0, x 2 = 0.5, x 3 = 1.5, x 4 = 4.0, x 5 = 7.0), Table 5.4 shows results.

28 Linear logit model for IX2 contingency tables The linear logit model fits as well as the saturated model as X 2 = 2.05 and G 2 = 1.95 with df = 3.

29 Alcohol and infant malformation revisited Pearson test: X 2 (I) = with P-value With scores (0, 0.5, 1.5, 4.0, 7.0), the score test, also called Cochran-Armitage trend test, has z 2 = 6.57 with P-value The test suggests strong evidence of a positive slope. The Wald statistic for the linear logit model equals ( ˆβ/SE) 2 = (0.3166/0.1254) 2 = 6.37(P = 0.012) and the likelihood-ratio statistic equals 4.25(P = 0.039). With highly unbalanced counts, it is best not to use the Wald approach. The asymptotics for the Cochran-Armitage trend test, however, work well even for quite small n when n i are equal and x i are equally spaced.

30 Model smoothing improves precision of estimation and test power Example: Skin damage and Leprosy

31 Example: Skin damage and Leprosy G 2 (I) = 7.28(df = 4) does not show much evidence of association (P=0.12) G 2 (I L) = 6.65 with df = 1(P = 0.01). It gives strong evidence of more positive clinical change at the higher level of infiltration G 2 (L) = 0.63(df = 3) suggests that the linear logit model fits well

32 Multiple logistic regression logit[π(x)] = α + β 1 x + β 2 x β p x p For qualitative predictors, we use indicator variables for its categories The parameter β j refers to the effect of x j on the log odds that Y = 1, adjusting for the other x k. For instance, exp(β j ) is the multiplicative effect on the odds of a 1-unit increase in x j, when we can keep fixed the levels of other x k

33 Logistic models for multiway contingency tables

34 Logistic models for multiway contingency tables Let X be the indicator for AZT treatment (x=1 for immediate ZAT use, x=0 otherwise), Z be the indicator for race (z=1 for whites, z=0 for blacks). logit[p(y = 1)] = α + β 1 x + β 2 z The model assumes homogeneous XY association, that is, the conditional odds ratio between X and Y is the same at each level of Z Conditional independence between X and Y given Z is equivalent to β 1 = 0 Adding the interaction between X and Z to the model, the model has as many parameters as the number of cells, therefore becomes a saturated model.

35 Logistic models for multiway contingency tables

36 Logistic models for multiway contingency tables α is the log odds of developing AIDS symptoms for black subjects without immediate AZT use β 1 is the increment to the log odds for those with immediate AZT use β 2 is the increment to the log odds for white subjects For each race, the estimated odds ratio between immediate AZT use and development of AIDS symptoms equals exp( ) = The Wald confidence interval for this effect is exp[ ± 1.96(0.279)] = (0.28, 0.84)

37 Different coding schemes

38 Different coding schemes For each coding scheme, at a given combination of ZAT use and race, the estimated probability of developing AIDS sympotoms is the same. For instance, the intercept estimate plus the estimate for immediate AZT use plus the estimate for being white is for each scheme, so the estimated probability that white veterans with immediate ZAT use develop AIDS symptoms equals exp( 1.738)/[1 + exp( 1.738)] = 0.15

39 Example: Horseshoe Crab Sattellites revisited

40 Example: Horseshoe Crab Sattellites revisited

41 Example: Horseshoe Crab Sattellites revisited For dark crabs, logit[ˆp(y = 1)] = x; by contrast, for medium-light crabs, logit[ ˆP(Y = 1)] = ( ) x = x At the average width of 26.3cm, ˆP(Y = 1) = for dark crabs and for medium-light crabs. At any given width, the estimate odds that a medium-light crab has a satellite are exp(1.330) = 3.8 times the estimated odds for a dark crab. At width x = 26.3, the odds equal 0.715/0.285=2.51 for a medium-light crab and 0.399/0.601=0.66 for a dark crab.

42 Example: Horseshoe Crab Sattellites revisited To test color effect, we test H 0 : β 1 = β 2 = β 3 = 0. Comparing the models with and without the color covariate, the deviance is 2(L 0 L 1 ) = 7.0 which has df = 3. The P-value is 0.07 which provides slight evidence of a color effect. The model assume a lack of interaction between color and width in their effects. Comparing the models with the interaction and without the interaction, the difference in deviance is 4.4, with df=3. The evidence of interaction is weak (P=0.22)

43 Example: Horseshoe Crab Sattellites revisited

44 Quantitative treatment of ordinal predictor Color has ordered categories, from lightest to darkest, let scores c = (1, 2, 3, 4) be the color categories, the model treats the color predictor as quantitative and may have a linear effect: logit[p(y = 1)] = α + β 1 c + β 2 x The fitted parameters are ˆα = , ˆβ 1 = 0.509(SE = 0.224) and ˆβ 2 = 0.458(SE = 0.104)

45 Quantitative treatment of ordinal predictor The likelihood-ratio statistic comparing this fit to the more complex model having a separate parameter for each color equals 1.66 (df=2). With P = 0.44 the simpler linear model seems adequate. Note in the qualitative-color model, the color parameter estimates are (1.33, 1.40, 1.11, 0), the first three colors are quite similiar. Thus another potential scoring is (1,1,1,0). The model fit is logit[ˆp(y = 1)] = c x The likelihood-ratio statistic comparing linear model with color scores (1,2,3,4) and (1,1,1,0) equals 0.5 (df=2), showing that this simpler model is also adequate.

46 More on interpretations Instantaneous rate of change in probability: Adjusting for other predictors, as a function of a quantitative predictor x j, ˆπ has instantaneous rate of change of ˆβ j ˆπ(1 ˆπ). For example, at predictor settings at which ˆπ = 0.5, the approximate effect of a 1-cm increase in width is (0.478)(0.5)(0.5)=0.12. We could summarize the effect of x j on the probability scale by averaging the instantaneous rates for the sample: 1 n n ˆβ j ˆπ(x i1,..., x ip )[1 ˆπ(x i1,..., x ip )] i=1

47 More on interpretation Describe the effect of x j by setting other predictors at their sample means and compute the estimated probabilities at the upper and lower quartiles of x j.

48 More on interpretation Standardized coefficients: coefficients are standardized in order to compare between predictors having different units. Standardize predictors (x j x j )/s xj Standardize coefficients ˆβ j s xj With binary color, the standard deviation of width is 2.109cm. The standardized coefficient is 0.478(2.109)=1.01. When width is replaced by weight, the standardized coefficient is 1.729(0.577)=1. The unstandardized estimates and are quite different, but width and weight have similar effects, conditonal on whether or not a crab is dark.

STAT 7030: Categorical Data Analysis

STAT 7030: Categorical Data Analysis 5. Logistic Regression Peng Zeng Department of Mathematics and Statistics Auburn University Fall 2012 Peng Zeng (Auburn University) STAT 7030 Lecture Notes Fall 2012