Multinomial Regression Models

Size: px

Start display at page:

Download "Multinomial Regression Models"

Shona McGee
5 years ago
Views:

1 Multinomial Regression Models Objectives: Multinomial distribution and likelihood Ordinal data: Cumulative link models (POM). Ordinal data: Continuation models (CRM). 84 Heagerty, Bio/Stat 571

2 Models for Multinomial Data Example Data: Wisconsin Study of Diabetic Retinopathy (WESDR). Diabetic retinopathy is one of the leading causes of blindness in people aged years in the US. Disease characterized by appearance of small hemorrhages in the retina which progress and lead to severe visual loss. 85 Heagerty, Bio/Stat 571

3 Disease severity: None Mild Moderate Proliferative Covariates: duration of diabetes glycosolated hemoglobin diastolic blood pressure age at diagnosis 86 Heagerty, Bio/Stat 571

4 Models for Ordinal Data Example Data: Q: How does the distribution of disease severity vary as a function of covariates? Q: Possible approaches to analysis? Ideas? 87 Heagerty, Bio/Stat 571

5 87-1 Heagerty, Bio/Stat 571 Score versus GlycHemo y glyc.hemo Score versus logduration y log.duration

6 age.diag y Score versus AgeDiag dia.bp y Score versus DiaBP 87-2 Heagerty, Bio/Stat 571

7 Multinomial Response Models Common categorical outcomes take more than two levels: Pain severity = low, medium, high Conception trials = 1, 2 if not 1, 3 if not 1-2 The basic probability model is the multi-category extension of the Bernoulli (Binomial) distribution multinomial. Univariate outcome with multivariate representation: Let O i [1, 2, 3,... C] be a categorical outcome. Let Y ij be the indicator Y ij = 1(O i = j), j = 1, 2,... (C 1). 88 Heagerty, Bio/Stat 571

8 O i = j Y ij = 1 and Y ik = 0 k j O i Y i = Y i1 Y i2.. Y ic 89 Heagerty, Bio/Stat 571

9 Probability given by P (O i ) = P (Y i ) = π Y i1 i1 πy i2 i2 C 1... (1 k=1 π ik ) (1 C 1 k=1 Y ik) E[Y ij ] = π ij cov[y ij, Y ik ] = π ij π ik π ij (1 π ij ) j k j = k 90 Heagerty, Bio/Stat 571

10 What is Ordinal Data? Categories (fixed number) that are ordered much like the ordinal numbers: first, second,... It doesn t make sense to talk about a distance between the categories: HIGH, MEDIUM, LOW NEVER, RARELY, SOMETIMES, ALWAYS e.g. Drachman Class 91 Heagerty, Bio/Stat 571

11 Ordinal Data Q: Is this type of data really that common? A: Lincoln Moses (Stanford) surveyed vol. 306 of NEJM and found that 32/168 articles contained ordered categorical data! We found no indication that any of the authors in vol. 306 used an analysis that takes account of the ordering. (p. 447) 92 Heagerty, Bio/Stat 571

12 Ordinal? Binary? Continuous? Q: Do we lose much information by GROUPING (categorizing) a measurement? Example: rather than analysis of BMI as a continuous measurement we recode into: non-obese; obese; severely obsese. Q: Do we gain much information by using the ordered categories over just DICHOTOMIZING? Example: rather than analysis of BMI categories we use the indicator of severely obsese. 93 Heagerty, Bio/Stat 571

13 Ordinal? Binary? Continuous? Evaluation: 1) Assume an underlying Gaussian: O i N (µ, 1). 2) Define CUTPOINTS α R C. Define: Y ij = 1(O i (α j 1, α j ]) E(Y ij ) = P (O i (α j 1, α j ]) = P (O i α j ) P (O i α j 1 ) π ij = Φ(α j µ) Φ(α j 1 µ) Y i represents a multinomial random variable with m = 1, π i. 94 Heagerty, Bio/Stat 571

14 94-1 Heagerty, Bio/Stat 571 Gaussian with cut-points (mu=0) fz alpha1 alpha2 alpha3 z Gaussian with cut-points (mu=3/4) fz alpha1 alpha2 alpha3 z

15 Ordinal versus Continuous? Given a sample of Y 1, Y 2,..., Y n we can obtain the MLE for µ (and we ll just assume α is known). We obtain (derive this yourself!): I n = i j π ij µ 1 π ij π ij µ If we assume that the Y i s are i.i.d then we can drop the subscript i and we can compare the information from the categorized, or grouped, data Y i to the information in the continuous O i : 95 Heagerty, Bio/Stat 571

16 Asymptotic relative efficiency: var( µ O ) var( µ Y ) = 1/n [ n j π j µ 1 π j π j µ ] 1 = I 1 Q: How does the efficiency depend on C, the number of categories? Q: How does the efficiency depend on the locations α j? 96 Heagerty, Bio/Stat 571

17 Efficiency versus C For this I actually used the MLE s variance based on estimating the location difference µ 1 µ 2 = and the cut-point parameters. For the cutpoints I used the 1/(C + 1)-tiles of N (0, 1) shifted by /2 (ie. cut-points are between the two locations). 97 Heagerty, Bio/Stat 571

18 C ARE Heagerty, Bio/Stat 571

19 98-1 Heagerty, Bio/Stat 571 ARE using different thresholds, C=5 ARE for delta cuts = c( -2, -1, 1, 2 ) cuts = c( -1, -0.5, 0.5, 1 ) cuts = ( -0.75, -0.25, 0.25, 0.75 ) delta

20 Some references on grouping data : Connor R.J. (1972), Grouping for testing trends in categorical data, Journal of the American Statistical Association, 67: binary & continuous 2 k Efficiency from 65% - 97% for testing (k=2,...,6) Armstrong B.G. and Sloan M. (1989), Ordinal regression models for epidemiologic data, American Journal of Epidemiology, 129: POM versus logistic regression. A smart dichotomization yields 75-80% efficiency. Sankey S.S and Weissfeld L.A. (1998), A study of the effect of dichotomizing ordinal data upon modeling, Communications in Statistics, Part B Sim & Comp, 27: Heagerty, Bio/Stat 571

21 Probability Model = Multinomial Fixed number of categories (C). Fixed sample size (m). Insect species 1, 2,..., C. Sites s 1, s 2,..., s m. Scheme 1 When first insect enters trap it shuts. All traps left until catch 1. Scheme 2 If insect enters it can t get out. All traps left for 24 hours. Y 1,i = (0, 0, 0, 1, 0) Y 2,i = (0, 4, 1, 0, 2) 100 Heagerty, Bio/Stat 571

22 Probability Model = Multinomial Given Y ij = number of type j out of m i ; j Y ij = m i P (Y i1 = y i1, Y i2 = y i2,..., Y ic = y ic ) = m i! y i1!y i2!... y ic! In this representation the parameter π ij can be defined via: C j=1 π y ij ij Given m i = 1: Given m i > 1: 101 Heagerty, Bio/Stat 571

23 Some properties of multinomial 1) w 1, w 2,..., w C independent w i P(λ i ). Then (w 1, w 2,..., w C j w j = m) mult(m, 2) If Y mult(m, π) λ 1 j λ, j λ 2 j λ,..., j = Y j binomial(m, π j ) 3) cov(y j, Y k ) = mπ j π k for j k. λ C j λ ) j 102 Heagerty, Bio/Stat 571

24 4) REPRODUCIBLE: Let Y = (Y 1, Y 2 ) then (Y 1, sum(y 2 ) ) mult(m, π 1, sum(π 2 ) ) 5) REPRODUCIBLE for CONDITIONALS: Condition on Y j : (Y 1, Y 2,..., ( j),..., Y C Y j = t) mult(m t, π 1 1 π j, π 2 1 π j,..., ( j),..., π C 1 π j ) 6) Define cumulative totals Z j = j k=1 Y k, γ j = j k=1 π k: Z j binomial(m, γ j ) Z i Z j = z j, i < j binomial(z j, γ i γ j ) 103 Heagerty, Bio/Stat 571

25 Multinomial as Vector Exponential Family P (Y i ) = C 1 j=1 π Y ij ij (1 C 1 C 1 = exp Y ij log(π ij ) + (1 j C 1 k k Y ik ) log(1 π ik ) (1 C 1 k Y ik ) C 1 k π ik ) ] 104 Heagerty, Bio/Stat 571

26 Multinomial as Vector Exponential Family C 1 P (Y i ) = exp Y ij log(π ij /π ic ) + log(π ic ) = exp j [ ] Y T i θ i b(θ i ) Note: we have dropped the last category. 105 Heagerty, Bio/Stat 571

27 where θ ij = log(π ij /π ic ) b(θ i ) = log(π ic ) = log[1 + b(θ i ) = π ij (verify!) θ ij 2 π ij π ik k j b(θ i ) = θ ij θ ik π ij (1 π ij ) k = j C 1 k exp(θ ik )] (verify!) 106 Heagerty, Bio/Stat 571

28 Models for Ordinal Data dia.bp80 y RowTotl ColTotl Test for independence of all factors Chi^2 = d.f.= 3 (p= e-10) 107 Heagerty, Bio/Stat 571

29 107-1 Heagerty, Bio/Stat 571 Call: crosstabs( ~ dia.bp80 + (y > 1)) dia.bp80 (y > 1) FALSE TRUE RowTotl ColTotl Test for independence of all factors Chi^2 = d.f.= 1 (p= e-10) odds ratio = 2.74, ( 1.983, ) log-odds ratio = 1.008, s.e.= 0.165

30 107-2 Heagerty, Bio/Stat 571 Call: crosstabs( ~ dia.bp80 + (y > 2)) dia.bp80 (y > 2) FALSE TRUE RowTotl ColTotl Test for independence of all factors Chi^2 = d.f.= 1 (p= e-06) Yates correction not used odds ratio = 2.276, ( 1.611, ) log-odds ratio = 0.822, s.e.= 0.176

31 107-3 Heagerty, Bio/Stat 571 Call: crosstabs( ~ dia.bp80 + (y > 3)) dia.bp80 (y > 3) FALSE TRUE RowTotl ColTotl Test for independence of all factors Chi^2 = d.f.= 1 (p= ) odds ratio = 2.848, ( 1.539, ) log-odds ratio = 1.047, s.e.= 0.314

32 Proportional Odds Model model the cumulative indicators: Z ij = 1(O i j) Define γ ij = P (O i j) = E(Z ij ) GLM g(γ ij ) = β (0,j) X i β For a single j this is equivalent to logistic regression when we use a logit link. 108 Heagerty, Bio/Stat 571

33 The model with the logit link is called the Proportional odds model. Another common link is the complementary log-log link, g(γ) = log( log(1 γ)), which is the discrete time proportional hazards model. (more on this later) There is an ordering constraint: β (0,1) β (0,2)... β (0,C 1) 109 Heagerty, Bio/Stat 571

34 Score equations The underlying model is the multinomial which yields log L β Note: L = = i ( πi β ) T Σ 1 i (Y i π i ) Heagerty, Bio/Stat 571

35 Score equations Properties of L : LY i = Z i Lπ i = γ i log L β log L β = i = i ( ) T Lπi ( LΣi L T ) 1 L (Y i π i ) β ( γi β ) T [cov(z i )] 1 (Z i γ i ) 111 Heagerty, Bio/Stat 571

36 Reparameterization of POM There are several ways this model can be parameterized. We have used indicators for O ij j and: logit(γ ij ) = β (0,j) X i β Alternatively we may model Zij = 1(O ij > j) and: E(Z ij) = 1 γ ij = µ ij logit(µ ij) = β (0,j) + X i β = β (0,j) + X iβ 112 Heagerty, Bio/Stat 571

37 112-1 Heagerty, Bio/Stat 571 Ordinal Regression Cumulative Link GLM link = logit -2*maxL = Regression Coefficients and S.E. s estimate S.E. z c c c dia.bp

38 112-2 Heagerty, Bio/Stat 571 Ordinal Regression Cumulative Link GLM link = logit -2*maxL = Regression Coefficients and S.E. s estimate S.E. z c c c dia.bp

39 112-3 Heagerty, Bio/Stat 571 Ordinal Regression Cumulative Link GLM link = logit -2*maxL = Regression Coefficients and S.E. s estimate S.E. z c c c glyc.hemo log.duration age.diag dia.bp

40 Ordinal Data Regression Models for ordinal data can be viewed as simultaneous regressions for the cumulative indicators Z ij = 1(O i j). Probability model is the multinomial. Models for ordinal data use the order information but do not assign numbers to the outcome levels. Models are also invariant to collapsing the categories (ie. combine 4th and 5th levels). 113 Heagerty, Bio/Stat 571

41 Ordinal Data Regression Models for ordinal data are useful as something intermediate to just dichotomizing the data, or to assigning scores and using linear regression. Fact For a single binary covariate X = 0/1 the score test using the POM model is the Wilcoxon! (see MN exercise 5.10). Thus, these models can be thought of as models for the rank of the data. 114 Heagerty, Bio/Stat 571

42 Models for Multinomial Data Another approach to the analysis of ordered categorical data is to model conditional expectations. Infection in 0-6 months. Infection in 6-12 months given uninfected at 6. Infection in months given uninfected at 12. Salmon dies before point #1. Salmon dies before point #2 given past #1. Salmon dies before point #3 given past # Heagerty, Bio/Stat 571

43 Multinomial Framework Define: Y ij = 1(O i = j) Y i = vec(y ij ) H ij = 1 j 1 k=1 Y ik H i1 = Heagerty, Bio/Stat 571

44 Multinomial Framework Model: E[Y ij H ij = 1] = µ ij = π ij 1 γ i(j 1) g(µ ij ) = α j + X ij β j j = 1, 2,..., C 1 Note : E[Y ij H ij ] = µ ij H ij 117 Heagerty, Bio/Stat 571

45 Example: j = 1, 2,..., Y ij H ij µ ij µ i1 µ i2 µ i3 µ i4 µ i5 E[Y ij H ij ] µ i1 µ i2 µ i Heagerty, Bio/Stat 571

46 Multinomial CRM Models Models for E(Y ij H ij ) are sometimes called continuation ratio models (Agresti 1990). These conditional models are also known as discrete-time survival models, in particular, if we assume that the ordered categories are time intervals, G j = (t j 1, t j ], and O i records the interval that the time, T i falls in: O i = j T i G j 119 Heagerty, Bio/Stat 571

47 Multinomial CRM Models PH Assumption : P (T i > t) = exp[ Λ(t)] = exp[ Λ 0 (t) exp(x i β)] then E(Y ij H ij = 1) = µ ij log( log(1 µ ij )) = α j + X i β α j = log[λ 0 (t j ) Λ 0 (t j 1 )] Note: The PH model uses a common β rather than a coefficient specific to each time interval, β j. 120 Heagerty, Bio/Stat 571

48 CRM Model The multinomial probability factors: P (Y i = y i ) = C j=1 π y ij ij = P (y i1 )P (y i2 y i1 )P (y i3 y i1, y i2 ) P (y ic y ik k < C) 121 Heagerty, Bio/Stat 571

49 CRM Model P (Y i = y i ) = = C 1 j=1 C 1 j=1 ( ( π ij 1 γ i(j 1) π ij 1 γ i(j 1) ) yij ( 1 ) yij ( 1 γij π ij 1 γ i(j 1) 1 γ i(j 1) ) hij y ij ) hij y ij Note: π ij π ij logit( ) = log( ) 1 γ i(j 1) 1 γ ij 122 Heagerty, Bio/Stat 571

50 Comments: This likelihood factorization implies that we obtain the MLE for this model using the pairs (Y ij, H ij = 1) and treating them as independent Bernoulli random variables. For any (j, k) E [(Y ij µ ij H ij )(Y ik µ ik H ik )] = 0 Q: justification for this? 123 Heagerty, Bio/Stat 571

51 123-1 Heagerty, Bio/Stat 571 WESDR Example: # nsubjects <- nrow( data ) # wisc.data <- data.frame( y = wisc.all.data[,"r.level"], glyc.hemo = wisc.all.data[,"glyc.hemo"], log.duration = log( wisc.all.data[,"duration"] ), age.diag = wisc.all.data[,"age.diag"], dia.bp = wisc.all.data[,"dia.bp"] ) print( wisc.data[1:5,] ) # ### ### Construct CRM data... ### # ##### (1) construct the pairs (Y,H) #

52 ncuts <- max(wisc.data$y) - 1 print( paste("ncuts =", ncuts, "\n\n") ) y.crm <- NULL h.crm <- NULL id <- NULL for( j in 1:nsubjects ){ yj <- rep( 0, ncuts ) if( wisc.data$y[j] <= ncuts ) yj[ wisc.data$y[j] ] <- 1 hj <- 1 - c(0,cumsum(yj)[1:(ncuts-1)]) y.crm <- c( y.crm, yj ) h.crm <- c( h.crm, hj ) id <- c( id, rep(j,ncuts) ) } Heagerty, Bio/Stat 571

53 ##### (2) construct the intercepts # level <- factor( rep(1:ncuts, nsubjects ) ) int.mat <- NULL for( j in 1:ncuts ){ intj <- rep( 0, ncuts ) intj[ j ] <- 1 int.mat <- cbind( int.mat, rep( intj, nsubjects) ) } dimnames(int.mat) <- list( NULL, paste("int",c(1:ncuts),sep="") ) # ##### (3) expand the X s # glyc.hemo <- rep( wisc.data$glyc.hemo, rep(ncuts,nsubjects) ) log.duration <- rep( wisc.data$log.duration, rep(ncuts,nsubjects) ) age.diag <- rep( wisc.data$age.diag, rep(ncuts,nsubjects) ) dia.bp <- rep( wisc.data$dia.bp, rep(ncuts,nsubjects) ) # print( cbind( y.crm, h.crm, level, int.mat, glyc.hemo, log.duration, age.diag, dia.bp)[1:15,] ) # Heagerty, Bio/Stat 571

54 ##### (4) drop the H=0 and build dataframe # keep <- h.crm==1 wisc.crm.data <- data.frame( id = id[keep], y = y.crm[keep], level = level[keep], glyc.hemo = glyc.hemo[keep], log.duration = log.duration[keep], age.diag = age.diag[keep], dia.bp = dia.bp[keep] ) # print( wisc.crm.data[1:15,] ) Heagerty, Bio/Stat 571

55 123-5 Heagerty, Bio/Stat 571 Original Data: y glyc.hemo log.duration age.diag dia.bp

56 123-6 Heagerty, Bio/Stat 571 Expanded Data: id y.crm h.crm level Int1 Int2 Int3 glyc.hemo [1,] [2,] [3,] [4,] [5,] [6,] [7,] [8,] [9,] [10,] [11,] [12,] [13,] [14,] [15,]

57 id y level glyc.hemo log.duration age.diag dia.bp Heagerty, Bio/Stat 571

58 WESDR Models ##### ##### Model fitting... ##### # options( contasts="contr.treatment" ) # table( wisc.crm.data$level ) # fit1 <- glm( y ~ glyc.hemo, family=binomial, subset=(as.integer(level)==1), data = wisc.crm.data ) summary( fit1, cor=f ) fit2 <- glm( y ~ glyc.hemo, family=binomial, subset=(as.integer(level)==2), data = wisc.crm.data ) summary(fit2, cor=f ) fit3 <- glm( y ~ glyc.hemo, family=binomial, 124 Heagerty, Bio/Stat 571

59 subset=(as.integer(level)==3), data = wisc.crm.data ) summary(fit3, cor=f ) # fit4 <- glm( y ~ level + glyc.hemo, family=binomial, data = wisc.crm.data ) summary( fit4, cor=f ) fit5 <- glm( y ~ level * glyc.hemo, family=binomial, data = wisc.crm.data ) # anova( fit4, fit5 ) # 125 Heagerty, Bio/Stat 571

60 125-1 Heagerty, Bio/Stat 571 Separate fits: table(level) : 1 : 2 : ***LEVEL=1 Coefficients: Value Std. Error t value (Intercept) glyc.hemo

61 125-2 Heagerty, Bio/Stat 571 ***LEVEL=2 Coefficients: Value Std. Error t value (Intercept) glyc.hemo ***LEVEL=3 Value Std. Error t value (Intercept) glyc.hemo

62 125-3 Heagerty, Bio/Stat 571 CRM fits: Value Std. Error t value (Intercept) level: level: glyc.hemo (Dispersion Parameter for Binomial family taken to be 1 ) Null Deviance: on 1339 degrees of freedom Residual Deviance: on 1336 degrees of freedom Number of Fisher Scoring Iterations: 3

63 125-4 Heagerty, Bio/Stat 571 CRM fits: Analysis of Deviance Table Response: y Terms Resid. Df Resid. Dev Test Df Deviance 1 level + glyc.hemo level * glyc.hemo level:glyc.hemo

64 Summary Multinomial models use a multivariate response vector to represent the outcome. For these models the mean (π i ) determines the covariance. Proportional odds models consider all possible dichotomizations. Continuation ratio models consider an ordered sequence of conditional expectations. 126 Heagerty, Bio/Stat 571

65 References : McCullagh & Nelder, Chapter 5 Polytomous Regression Ananth C.V. and Kleinbaum D.G. (1997), Regression models for ordinal responses: A review of methods and applications, Int J of Epi, 26: Fahrmeir & Tutz, Chapter 3 Multicategorical responses Agresti A. (1999), Modelling ordered categorical data: recent advances and future challenges, Statistics in Medicine, 18: Heagerty, Bio/Stat 571

Lecture 11. Interval Censored and. Discrete-Time Data. Statistics Survival Analysis. Presented March 3, 2016

Lecture 11. Interval Censored and. Discrete-Time Data. Statistics Survival Analysis. Presented March 3, 2016 Statistics 255 - Survival Analysis Presented March 3, 2016 Motivating Dan Gillen Department of Statistics University of California, Irvine 11.1 First question: Are the data truly discrete? : Number of