6 Applying Logistic Regression Models

Size: px

Start display at page:

Download "6 Applying Logistic Regression Models"

Anne West
5 years ago
Views:

1 6 Applying Logistic Regression Models I Model Selection and Diagnostics I.1 Model Selection # of x s can be entered in the model: Rule of thumb: # of events (both [Y = 1] and [Y = 0]) per x 10. Need to be aware of collinearity in x s. Use traditional model selection procedures (used when p << n) 1. Forward selection (simple one + variant) 2. Backward elimination Use modern model selection procedures, usually in the form of penalized likelihood (can handle p > n); New research area. Slide 344

2 Use LRT for nested models (e.g., Table 6.2) Use AIC (Akaike information criterion) or BIC (Bayesian information criterion) for model selection (not necessarily nested models) Smaller AIC/BIC, the better. AIC = 2{l max p} BIC = 2{l max 0.5 log(n)p} Note: BIC tends to yield a simpler model than AIC. Use common sense in model building (e.g. time ordering, etc. Table 6.3). Slide 345

3 I.2 Model Diagnostics Use standardized residuals to check model fit and identify outliers: y i x i ind Bin(n i, π i ) logit(π i ) = x T i β π i = 1. Standardized Pearson residual: ext i b β 1 + e xt i b β e i = y i π i ni π i (1 π i ) e st i = e i 1 hi Slide 346

4 2. Standardized deviance residual: ( d i = 2 y i log y i + (n i y i ) log n ) i y i n i π i n i n i π i d i = d i sign(y i π i ) d st i = d i 1 hi If e st i Plots of e st i st (or d i ) > 2, 3 outliers. (or When n i = 1, e st i d st i ) v.s. x i or x T i β may detect lack of fit. (or st d i ) not very informative. Note: Proc Logistic does not report e st Proc GenMod to get e st i and d st i. i and st d i. Need to use Slide 347

5 Example 1: Residual plot for the crab data: Model: logit(p[y = 1 x, c]) = β 0 + β 1 c 1 + β 2 c 2 + β 3 c 3 + β 4 x data crab; input color spine width satell weight; weight=weight/1000; color=color-1; satbin=(satell>0); c1 = (color=1); c2 = (color=2); c3 = (color=3); c4 = (color=4); s1 = (spine=1); s2 = (spine=2); datalines; proc genmod data=crab descending; model satbin = width c1 c2 c3 / dist=bin link=logit; output out=resid ResRaw=ResRaw ResChi=ResChi StdReschi=StdReschi; run; data _null_; set resid; file "crab_res"; put stdreschi width; run; Slide 348

6 Standardized Pearson Residual Plot for Crab Data Standardized Pearson Residual Carapace Width Slide 349

7 Example 2: Heart disease and bloop pressure (Table 6.5, P. 217) data HD; input bp $ n y; if bp="<117" then x=111.5; else if bp=" " then x=121.5; else if bp=" " then x=131.5; else if bp=" " then x=141.5; else if bp=" " then x=151.5; else if bp=" " then x=161.5; else if bp=" " then x=176.5; else x=191.5; cards; < > ; proc genmod; model y/n = x /dist=bin link=logit residual; run; Slide 350

8 Criteria For Assessing Goodness Of Fit Criterion DF Value Value/DF Deviance Scaled Deviance Pearson Chi-Square Scaled Pearson X Analysis Of Maximum Likelihood Parameter Estimates Standard Wald 95% Confidence Wald Parameter DF Estimate Error Limits Chi-Square Intercept x Raw Pearson Deviance Observation Residual Residual Residual Std Deviance Std Pearson Likelihood Residual Residual Residual Slide 351

9 Example 3: Admission to Graduate School at UF in (Table 6.6) Let π(k, g) = P[admission D = k, G = g] for department D = k and gender G = g. We consider three models: 1. π(k, g) = D k : Admission is independent of gender at each department. 2. π(k, g) = D k + G g : Admission-Gender association is the same across departments. 3. π(k, g) = G g : Get the marginal Admission-Gender association collapsed over departments. options ls=75 ps=100; data admit; input dept $ gender y yno; n = y+yno; male=gender-1; cards; anth anth astr astr Slide 352

10 chem chem title "Model 1: Logistic model assuming gender and admission are"; title2 "conditional independent given department"; proc genmod; class dept; model y/n = dept /dist=bin link=logit; output out=resid Resraw=Resraw Reschi=Reschi StdReschi=StdReschi; run; data resid; set resid; keep dept male Resraw Reschi StdReschi; run; title "Residuals from Model 1"; proc print data=resid; run; title "Model 2: Logistic model with homogeneous GA and DA association"; proc genmod data=admit; class dept; model y/n = dept male; run; title "Model 3: Logistic model for marginal GA association"; proc genmod data=admit; model y/n = male; run; Slide 353

11 Part of the output: Model 1: Logistic model assuming gender and admission are 1 conditional independent given department Criteria For Assessing Goodness Of Fit Criterion DF Value Value/DF Deviance Scaled Deviance Pearson Chi-Square Scaled Pearson X Std Obs dept male Reschi Resraw Reschi 1 anth anth astr astr chem chem clas clas comm comm comp comp engl engl geog geog geol geol germ germ Slide 354

12 21 hist hist lati lati ling ling math math phil phil phys phys poli poli psyc psyc reli reli roma roma soci soci stat stat zool zool Model 2: Logistic model with homogeneous GA and DA association 4 Criteria For Assessing Goodness Of Fit Criterion DF Value Value/DF Deviance Scaled Deviance Pearson Chi-Square Scaled Pearson X Slide 355

13 Analysis Of Maximum Likelihood Parameter Estimates Standard Wald 95% Wald Parameter DF Estimate Error Confidence Limits Chi-Square Intercept dept anth dept astr male Model 3: Logistic model for marginal GA association 6 Criteria For Assessing Goodness Of Fit Criterion DF Value Value/DF Deviance Scaled Deviance Pearson Chi-Square Scaled Pearson X Analysis Of Maximum Likelihood Parameter Estimates Standard Wald 95% Confidence Wald Parameter DF Estimate Error Limits Chi-Square Intercept male Models 2 & 3 show Simpson s Paradox. Slide 356

14 II Inference on The Conditional Association in 2 2 K Tables Example: Multi-center clinical trial evaluating a cream in curing skin infection (Table 6.9, P.226) S F S F S F S F trt control Z = 1 Z = 2 Z = 3 Z = 4 S F S F S F S F trt control Z = 5 Z = 6 Z = 7 Z = 8 What we observed: There is a lot of variation in success probabilities among centers. Slide 357

15 If we collapse the tables over centers, we got: S Y F X trt θ XY = control The above estimate θ XY may not be very useful since this is not a random sample, so we cannot use the famous formula for calculating the variance of log θ XY : var(log θ XY ) Should focus on conditional association! Slide 358

16 II.1 Testing Conditional Independence between X and Y Given Z (H 0 : X Y Z) 1. Method 1: Use logistic model with ML inference (good when K is fixed, small moderate) Let Y = 1 for success, 0 for failure x = 1 for treatment, 0 for control z = 1, 2,..., 8 for centers π(x, z) = P[Y = 1 x, z] and consider the (homogeneous) model: logitπ(x, z = k) = βx + β z k ( ) common odds-ratio model: π(x = 1, z = k)/{1 π(x = 1, z = k)} π(x = 0, z = k)/{1 π(x = 0, z = k)} = eβ Slide 359

17 π(x = 0, z = k)/{1 π(x = 0, z = k)} = e βz k Under this model, H 0 : β = 0 H 0 : X Y Z. data table6_9; input center trt y y0; n=y+y0; cards; title "Use homogeneous model to test no treatment effect at each center"; proc logistic; class center / param=ref; model y/n = center trt / selection=f include=1 slentry=1; run; Use homogeneous model to test no treatment effect at each center 1 The LOGISTIC Procedure The following effects will be included in each model: Intercept center Step 0. The INCLUDE effects were entered. Model Fit Statistics Slide 360 Intercept Intercept and

18 Step 1. Effect trt entered: Criterion Only Covariates -2 Log L Residual Chi-Square Test Chi-Square DF Pr > ChiSq Model Fit Statistics Intercept Criterion Intercept Only and Covariates -2 Log L Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept center center center center <.0001 center center center trt Slide 361

19 Three Tests for H 0 : β = 0: (a) Score test: χ 2 = , df = 1, P = (b) LRT: G 2 = = 6.669, df = 1, P = (c) Wald test: χ 2 = , P = Strong evidence to reject H 0 : β = 0. β = , e bβ = 2.17 At each center, the odds of success (infection is cured) for treated patients is 2.17 times the odds of success for untreated patients. Note 1: The above test results are based on the homogeneous model (*). When β = 0, model (*) reduces to logitπ(x, z = k) = β z k to H 0 : X Y Z, can be tested by conducting the GOF test for this model. Slide 362

20 title "Use goodness-of-fit statistics to test conditional independence"; Proc genmod; class center; model y/n = center; run; *************************************************************************** Use goodness-of-fit statistics to test conditional independence 3 Response Profile The GENMOD Procedure Ordered Binary Total Value Outcome Frequency 1 Event Nonevent 171 Criteria For Assessing Goodness Of Fit Criterion DF Value Value/DF Deviance Scaled Deviance Pearson Chi-Square Scaled Pearson X χ 2 = 13.71, df = 16 8 = 8, P = G 2 = 16.42, df = 8, P = Less powerful. Slide 363

21 Note 2: We can also test the adequacy of the homogeneous model (*) using its GOF statistics: title "Use goodness-of-fit statistics to test homogeneity"; Proc genmod; class center; model y/n = center trt; run; *************************************************************************** Use goodness-of-fit statistics to test homogeneity 4 The GENMOD Procedure Criteria For Assessing Goodness Of Fit Criterion DF Value Value/DF Deviance Scaled Deviance Pearson Chi-Square Scaled Pearson X χ 2 = , df = 7, P = 0.33 G 2 = , df = 7, P = 0.20; adequate fit. Slide 364

22 2. Method 2. Use Cochran-Mental-Haenszel (CMH) test for H 0 : X Y Z (good when K or K is fixed but n ++k ) The above analysis assuming N = 2 K = 2 8 = 16 is fixed may be problematic in many situations. One way to test X Y Z is to use the CMH test: X trt n 11k n 12k n 1+k S control n 21k n 22k n 2+k Y F n +1k n +2k Z = k Slide 365

23 Under H 0 : X Y Z, n 11k n 1+k, n +1k hypergeometric distribution: E(n 11k H 0, n 1+k, n +1k ) = n 1+kn +1k n ++k = µ 11k, var(n 11k H 0, n 1+k, n +1k ) = n 1+kn 2+k n +1k n +2k n 2 ++k (n ++k 1). χ 2 = [ K k=1 (n 11k µ 11k )] 2 K k=1 var(n 11k H 0, n 1+k, n +1k ) H 0 χ 2 1. CMH with correction: χ 2 c = { K k=1 (n 11k µ 11k ) 0.5} 2 K k=1 var(n 11k H 0, n 1+k, n +1k ) H 0 χ 2 1. The CMH does not require the homogeneous model. Slide 366

24 data y1; set table6_9; count=y; drop y0; y=1; run; data y0; set table6_9; count=y0; drop y0; y=0; run; data new; set y1 y0; run; title "MH test for conditional independence and MH common OR"; proc freq data=new order=data; weight count; tables center*trt*y/nopercent norow nocol cmh; run; ***************************************************************************** MH test for conditional independence and MH common OR 8 The FREQ Procedure Summary Statistics for trt by y Controlling for center Cochran-Mantel-Haenszel Statistics (Based on Table Scores) Statistic Alternative Hypothesis DF Value Prob Nonzero Correlation Row Mean Scores Differ General Association Slide 367

25 Estimates of the Common Relative Risk (Row1/Row2) Type of Study Method Value 95% Confidence Limits Case-Control Mantel-Haenszel (Odds Ratio) Logit ** Cohort Mantel-Haenszel (Col1 Risk) Logit ** Cohort Mantel-Haenszel (Col2 Risk) Logit ** These logit estimators use a correction of 0.5 in every cell of those tables that contain a zero. Breslow-Day Test for Homogeneity of the Odds Ratios Chi-Square DF 7 Pr > ChiSq CMH χ 2 = , df = 1, P = MH Common odds-ratio estimate θ MH = with 95% CI [1.1776, ]. Breslow-Day Test for common odds-ratio: χ 2 = , df = 7, P = , similar to the GOF test. Slide 368

26 3. Method 3: Use a conditional logistic regression under homogeneous model (*) (good even when K ): logitπ(x, k) = xβ + β k. Problem: # of β k s may ; want to get rid of them. Idea: find out sufficient statistics of β k and conduct inference on β based on the conditional distribution of the data given those sufficient statistics. Data from center k: X trt n 11k n 12k n 1+k S control n 21k n 22k n 2+k Y Z = k F Slide 369

27 Given n 11k n 1+k Bin(n 1+k, π(1, k)), n 21k n 2+k Bin(n 2+k, π(0, k)), we got the likelihood function of β and (β 1,..., β K ): L(β, β 1,..., β K ) = K k=1 L k (β, β k ) where L k (β, β k ) is the likelihood contributed by the data from center Z = k: L k (β, β k ) = {π(1, k)} n 11k {1 π(1, k)} n 12k {π(0, k)} n 21k {1 π(0, k)} n 22k, π(1, k) = π(0, k) = e β+β k 1 + e β+β k e β k 1 + e β k Slide 370

28 L k (β, β k ) = ( e β+β k 1 + e β+β k ) n11k ( e β+β k ) n12k ( e β k ) n21k ( 1 ) n22k 1 + e β k 1 + e β k = e βn 11k+β k (n 11k +n 21k ) (1 + e β+β k ) n 11k +n 12k(1 + e β k) n 21k +n 22k = e βn 11k+β k n +1k (1 + e β+β k ) n 1+k(1 + e β k) n 2+k Since n 1+k and n 2+k are fixed already, so n +1k = n 11k + n 21k (total # of successes in center k) is a sufficient statistic for β k. L k (β, β k n +1k ) should be free of β k noncentral hypergeometric dist. Slide 371

29 The conditional logistic inference (on β) is based on the conditional likelihood: L c (β {n +1k }) = K k=1 L k (β, β k n +1k ), which only has one parameter β no matter how large K is! Treat this as a regular likelihood function, we can estimate β by maximizing L c (β {n +1k }). We can also conduct the Wald, score and LRT for testing H 0 : β = 0. Slide 372

30 SAS program and output: title "Use a conditional logistic regression to assess treatment effect"; proc logistic; class center; model y/n = trt; strata center; run; *************************************************************************** Use a conditional logistic regression to assess treatment effect 5 The LOGISTIC Procedure Conditional Analysis Model Information Data Set WORK.TABLE6_9 Response Variable (Events) y Response Variable (Trials) n Number of Strata 8 Model binary logit Optimization Technique Newton-Raphson ridge Testing Global Null Hypothesis: BETA=0 Test Chi-Square DF Pr > ChiSq Likelihood Ratio Score Wald Slide 373

31 Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq trt β = (SE = ), e bβ = 2.13, similar to before since K = 8 is small. LRT G 2 = , Score χ 2 = , Wald χ 2 = Reject H 0 : β = 0. Note 1: Score χ 2 stat using L c (β {n +1k }) is equivalent to CMH χ 2, Note 2: We can make exact conditional inference for a regression coefficient in a regular regression model using the same idea. Y i = 1/0 for success/failure, covariates: x i1, x i2,..., x ip. π(x i ) = P[Y i = 1 x i ] Slide 374

32 Model: logit{π(x i )} = β 1 x i1 + β 2 x i2 + + β p x ip We can find out suff. stat. for each β k, denoted by T k. Suppose we would like to make exact conditional inference on, β p, say, then the exact inference can be based on f(y 1, y 2,..., y n T 1, T 2,..., T p 1 ) = L(β p ). For exact test of H 0 : β p = 0, the cond. dist. of data (Y 1, Y 2,..., Y n ) given T 1, T 2,..., T p 1 is completely known. We can do exact score test based on L(β p ). We can also construct an exact CI for β p based on L(β p ). Software: Proc Logistic descending; model y = x1 x2 x3 / link=logit; exact x3; run; Slide 375

33 Warning: It is usually very time consuming to conduct the exact inference, especially for non-sparse data, in which case no exact inference is needed. Note 3: If we apply the above procedure to our homogeneous model (*) logitπ(x, k) = xβ + β k, we can make exact conditional inference on the treatment effect β. In this case L(β) is the conditional likelihood we got before using the conditional logistic approach. Therefore, we will get exact CMH test for H 0 : β = 0. title "Exact p-value for MH test of no treatment effect at each center"; proc logistic data=table6_9; class center / param=ref; model y/n = center trt; exact trt; run; *************************************************************************** Exact p-value for MH test of no treatment effect at each center 9 The LOGISTIC Procedure Slide 376

34 Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept center center center center <.0001 center center center trt p-value --- Effect Test Statistic Exact Mid trt Score Probability We can see that is the CMH χ 2, which is the score stat. based on L(β) (row 1). We can also conduct Fisher exact test on H 0 : β = 0 using table prob. (row 2). Slide 377

35 4. Method 4. Use mixed model approach (good when K as n ): logitπ(x, k) = xβ + β k. Data from center k: Y 1 0 X 1 n 11k n 12k n 1+k 0 n 21k n 22k n 2+k Z = k Here 8 centers is probably a random sample of centers drawn from a large population of centers. Then the analysis should take this into account clustered data. β k log odds of being a success for patients in center k if they all receive the control treatment. It reflects the general healthy status of patients in center k. Slide 378

36 Since center k is randomly sampled, it is reasonable to assume β k is a random variable and has a distribution. A commonly used dist. is β k N(µ, σ 2 ). Let b k = β k µ, then b k N(0, σ 2 ) and our model becomes: logitπ(x, k) = µ + xβ + b k. Only 3 model parameters: µ, β and σ 2. The likelihood function of (µ, β, σ 2 ): L(µ, β, σ 2 ) = K k=1 f(n 11k b k )f(n 21k b k )f(b k )db k. The inference on β is based on L(µ, β, σ 2 ). Slide 379

37 SAS program and output: title "Proc glimmix treating center effect as random"; proc glimmix method=quad data=table6_9; class center; model y/n = trt / s dist=bin; random int / subject=center type=vc; run; ****************************************************************** Proc glimmix treating center effect as random 12 Data Set Response Variable (Events) Response Variable (Trials) Response Distribution Link Function Variance Function Variance Matrix Blocked By Estimation Technique Likelihood Approximation Degrees of Freedom Method The GLIMMIX Procedure Model Information Class Level Information Class Levels Values WORK.TABLE6_9 y n Binomial Logit Default center Maximum Likelihood Gauss-Hermite Quadrature Containment center Number of Observations Read 16 Slide 380

38 Number of Observations Used 16 Number of Events 102 Number of Trials 273 Iteration History Objective Max Iteration Restarts Evaluations Function Change Gradient E-6 Convergence criterion (GCONV=1E-8) satisfied. Fit Statistics -2 Log Likelihood AIC (smaller is better) AICC (smaller is better) BIC (smaller is better) CAIC (smaller is better) HQIC (smaller is better) Fit Statistics for Conditional Distribution -2 log L(y r. effects) Slide 381

39 Pearson Chi-Square 8.37 Pearson Chi-Square / DF 0.52 Covariance Parameter Estimates Standard Cov Parm Subject Estimate Error Intercept center Solutions for Fixed Effects Standard Effect Estimate Error DF t Value Pr > t Intercept trt From the output, we see µ = β = (SE = ), e bβ = 2.1. σ 2 = , variation in log odds of success among centers. Huge variation. Since the success prob. for patients receiving control at center k is π 0 k = π(0, k) = Slide 382 eµ+b k 1 + e µ+b k

40 and the success prob. for patients receiving treatment at center k is π 1 k = π(1, k) = eµ+β+b k 1 + e µ+β+b k, we can generate a random sample {b k } s to get a feeling on the distributions of π 0 k and π1 k π 0 = Ê(π0 k ) = 0.29, π1 = Ê(π1 k ) = 0.42 θ XY = R function: postscript(file="cream-prob.ps", horizontal = F) par(mfrow=c(1,2), pty="s") b <- rnorm(10000, 0, sqrt(1.9591)) expeta0 <- exp( b) expeta1 <- exp( b) pi0 <- expeta0/(1+expeta0) pi1 <- expeta1/(1+expeta1) mean0 <- mean(pi0) mean1 <- mean(pi1) hist(pi0, main="histogram of pi_0") hist(pi1, main="histogram of pi_1") dev.off() Slide 383

41 Histogram of pi_0 Histogram of pi_1 Frequency Frequency pi pi1 Slide 384

42 II.2 Estimation of The Common Odds-ratio in 2 2 K Tables Each of the above methods provides an estimate of the common odds-ratio in 2 2 K tables, except the CMH method (Method 2). There is also an MH estimate of the common odds-ratio θ MH = K k=1 K k=1 n 11k n 22k n ++k n 12k n 21k n ++k Motivation of θ MH : We could estimate θ using the data from the kth table as: θ = n 11kn 22k n 12k n 21k Slide 385

43 Estimating equation: θn 12k n 21k = n 11k n 22k θn 12k n 21k /n ++k = n 11k n 22k /n ++k K K θ n 12k n 21k /n ++k = n 11k n 22k /n ++k k=1 k=1 θmh = K k=1 K k=1 n 11k n 22k n ++k. n 12k n 21k n ++k CDA provides a variance formula of log( θ MH ) on P. 229, can be used to construct CI s for the common odds-ratio θ. Slide 386

44 For our cream example, we have θ MH = = See Method 2 in the previous section for SAS program and output. Slide 387

45 III Summarizing Predictive Power, Classification Tables and ROC Curves (P. 223) Suppose we have binary response Y i = 1/0 (success/failure), x i a vector of covariates. π(x i ) = P[Y i = 1 x i ] logit{π(x i )} = x T i β After we fit the model, we got β we got π i as π i = ext i b β 1 + e xt i b β. Choose a known value π 0 (e.g., π 0 = 0.5), and conduct prediction Ŷ i as 1 if π i > π 0 Ŷ i = 0 otherwise Slide 388

46 and then construct the table (classification table) Ŷ 1 0 Y 1 n 11 n 12 0 n 21 n 22 The following two quantities tell us how good the prediction is: sensitivity = n 11 n 11 +n 12 specificity = n 22 n 21 +n 22 Using only one table with one π 0 loses information. Solution: use many different values of π 0 many classification tables many pairs of sensitivity and specificity plot sensitivity v.s. 1 specificity ROC (receiver operating characteristic curve Area under the ROC curve summarizes the predictive power of the model, often called the c-index. Slide 389

47 An example: Y bπ Y0.3 b Y0.4 b Y0.5 b Y0.6 b Y0.7 b Y0.8 b Y0.9 b by Y se = 3 3 se = 3 3 se = 2 3 se = 2 3 se = 1 3 se = 1 3 se = 0 3 sp = 0 3 sp = 1 3 sp = 1 3 sp = 2 3 sp = 2 3 sp = 3 3 sp = 3 3 Slide 390

48 ROC curve for the example Sensitivity Specificity Slide 391

49 The AUC for the above ROC curve: = 2 3 = proportion of concordant pairs in (Y i, π i ) among all pairs with different outcome Y i. # of pairs with different outcomes: 3 3 = 9. # of concordant pairs: = 6. If there are ties in π i s, need to do some adjustment. For example, suppose two π i for a Y i = 1 and a Y i = 0 are the same (0.4): Slide 392

50 Y bπ Y0.4 b Y0.5 b Y0.6 b Y0.7 b Y0.8 b Y0.9 b The corresponding classification tables are: by Y se = 3 3 se = 2 3 se = 2 3 se = 1 3 se = 1 3 se = 0 3 sp = 0 3 sp = 1 3 sp = 2 3 sp = 2 3 sp = 3 3 sp = 3 3 Slide 393

51 ROC curve when there are tied predictive probs Sensitivity Specificity Slide 394

52 AUC = = # of pairs with diff outcomes 5.5 = # of concordant pairs (5) # of ties in π i s with diff. outcomes (1). Slide 395

Contrasting Marginal and Mixed Effects Models Recall: two approaches to handling dependence in Generalized Linear Models:

Contrasting Marginal and Mixed Effects Models Recall: two approaches to handling dependence in Generalized Linear Models: Marginal models: based on the consequences of dependence on estimating model parameters.