Comparing MLE, MUE and Firth Estimates for Logistic Regression

Comparing MLE, MUE and Firth Estimates for Logistic Regression Nitin R Patel, Chairman & Co-founder, Cytel Inc. Research Affiliate, MIT nitin@cytel.com

Acknowledgements This presentation is based on joint work with: Pralay Senchaudhuri, Cytel Inc. Hrishikesh Kulkarni, Cytel Inc. 2

Outline Separation and Maximum Likelihood Estimates Firth s Method of Maximum Penalized Likelihood Estimation Numerical experiments comparing MUE with FirthE when there is separation Near separation and problems with MLE Numerical experiments comparing MLE with FirthE when there is near separation Conclusions 3

Maximum Likelihood Estimation Almost universally used method for logistic regression models. ML estimates are asymptotically unbiased and have minimum variance but not for finite samples. MLE s can have serious shortcomings when applied to datasets with the following characteristics: Small/moderate in size Unbalanced responses (Rare outcomes) Unequally spaced covariate values Many parameters relative to number of observations. 4

Example 1 seq# x1 x2 y 1 10 10 1 2 11 11 1 3 12 12 1 4 13 13 1 5 14 14 1 6 15 15 1 7 16 16 1 8 17 17 1 9 19 19 1 10 10 16 0 11 11 17 0 12 12 18 0 13 13 19 0 14 14 20 0 15 15 21 0 16 16 22 0 17 17 23 0 18 18 18 0 19 18 24 0 20 19 25 0 Separation x2 covariate plot of data 30 25 20 15 10 5 5 10 15 20 x1 5

MLE s and Separation When separation occurs one or more MLE s do not exist. In other words, one or more MLE s are unbounded (and so are their standard errors). This means that the maximum likelihood method fails to provide either point or interval estimates. 6

A useful characterization of separation Separation occurs if and only if the observed vector of sufficient statistics is on the boundary of the convex hull of the (finite) set of possible sufficient statistics vectors. 7

Example 2: Simple Logistic Regression (one covariate, two parameters) Response Y i, covariate x i for observation i Model: π = PY= 1 i ( ) i logit ( π ) i = β0+ β1xi Sufficient statistics vector is (T 0, T 1 ) where T 0 = i Yi and T 1 = i x i Y i 8

Example 2: Simple Logistic Regression (contd.) Sufficient statistics vector is (T 0, T 1 ) where x 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 t1: sufficient stat. for beta1 1200 1000 800 600 400 200 0 T 0 = i Yi 0 5 10 15 20 25 t0: sufficient stat. for beta0 and T 1 = i x i Y i 9

Example 2 (contd.) If we observe: y i = 0 for x i = 5, 10, 15, 20, 25, 30, 35, 40, 45 y i = 1 for x i = 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100. The observed sufficient statistics vector is (t 0 = 11, t 1 = 825). The MLE for β 1 does not exist since (11,825) is on the boundary of (T 0, T 1 ) space. 10

Firth s Penalized Likelihood Method The MLE is the root when the score function (derivative of the loglikelihood) is equated to zero. Firth s method removes the O(n -1 ) term from the bias of the MLE by modifying the score function by subtracting a penalty function. The solution obtained as the root when the score function is set to zero is Firth s Penalized Likelihood Estimate (FirthE) 11

Logistic Regression The loglikelihood has the form where t is the observed sufficient statistic vector The score function is therefore U( β) = lʹ ( β) = t Kʹ ( β) Firth s modified score function is where l( β) = tβ K( β) * 1 I ( β) U ( βj) = U( βj) + 1/2 trace I( β) β j I( β ) is Fisher s information matrix Firth s modification shrinks the MLE estimate towards zero 12

Boundary points of Sufficient Statistics space t_0 t_1 0 0 1 5 1 100 2 15 2 195 3 30 3 285 4 50 4 370 5 75 5 450 6 105 6 525 7 140 7 595 8 180 8 660 9 225 9 720 10 275 10 775 11 330 11 825 12 390 12 870 13 455 13 910 14 525 14 945 15 600 15 975 16 680 16 1000 17 765 17 1020 18 855 18 1035 19 950 19 1045 20 1050 t1 1200 1000 800 600 400 200 Boundary Points in space of sufficient stats 0 0 5 10 15 20 25 There are 40 points on the boundary of the set of possible values of (t 0, t 1 ) t0 13

Comparison of MUE with FirthE when MLE does not exist Several numerical experiments with one covariate models and a limited number with two covariate models. Used exhaustive enumeration of t-vectors as well as Monte Carlo simulations with sample sizes of 1000. We will illustrate with Example 2 data 14

Findings from numerical experiments Our experiments with several numerical experiments with one covariate and some with two covariates suggest that both from the point of view of bias and Mean Square Error Firth s method gives better estimates when there is complete separation. Additional Advantages of Firth s method are: Unlike MUE it does not depend on the conditional distribution of the sufficient statistic, so it does not have problems associated with having few support points (e.g. with continuous covariates). It is much faster to compute. 17

A real dataset Two hundred rats treated with a toxic at four levels of dose, binary response examined was development of an intestinal tumor.the covariates were levels of dose (as factor variables) and a binary survival variable to control for death. (Data from US Toxicology Program Tech Report 405, 1991, LogXact manual gives details.) There was separation in this dataset. Output from current beta version of LogXact that provides Firth s method as an option. 18

LogXact Results Point Estimate 95% Conf. Interval 2*1-sided Model Term Type Beta SE(Beta) Type Lower Upper P-Value %Const FirthE -3.861 2.108 Asymptotic -7.993 0.2713 0.0671 dose_0 FirthE -2.873 1.937 Asymptotic -6.67 0.9241 0.1381 MUE -1.053 NA Exact -INF 1.909 0.4824 dose_150 FirthE -1.24 1.438 Asymptotic -4.057 1.578 0.3886 CMLE -1.444 1.667 Exact -6.437 2.471 0.9367 dose_300 FirthE -2.733 1.656 Asymptotic -5.978 0.5116 0.0988 MUE -1.677 NA Exact -INF 0.869 0.2068 survival FirthE 0.09387 0.1402 Asymptotic -0.1808 0.3686 0.5030 CMLE 0.1246 0.174 Exact -0.2128 0.5058 0.5345 19

Near Separation MLE is unstable small shift in data leads to huge change in ML estimate of coefficients seq# x1 x2 y 1 10 10 1 2 11 11 1 3 12 12 1 4 13 13 1 5 14 14 1 6 15 15 1 7 16 16 1 8 17 17 1 9 19 19 1 10 10 16 0 11 11 17 0 12 12 18 0 13 13 19 0 14 14 20 0 15 15 21 0 16 16 22 0 17 17 23 0 18 18 k 0 19 18 24 0 20 19 25 0 x2 30 25 20 15 10 covariate plot of data 5 5 10 15 20 x1 Example 1 k 20

MLE and Near separation: Example 1 (contd.) coefficients vs k beta1 beta2 MLE beta 2 1.5 1 0.5 0-0.5-1 -1.5-2 -2.5-3 0 5 10 15 20 k 21

Interior Points grouped into Layers by closeness to the boundary Interior Point Layers 1200 1000 800 t1 600 400 200 0 0 5 10 15 20 Layer 1 Layer 5 Layer 10 Layer 20 Layer 40 Layer 50 t0 22

Bias Comparison of MLE to FirthE ED50=52.5 Based on complete enumeration 24

Bias Comparison of MLE to FirthE ED50=5 Based on complete enumeration 25

Bias Comparison of MLE to FirthE ED50=100 Based on complete enumeration 26

Significant Models (pval < 0.05) Bias Comparison of MLE to FirthE ED50=52.5 Based on complete enumeration 27

MSE Comparison of MLE to FirthE ED50 = 52.5 Based on complete enumeration 28

MSE Comparison of MLE to FirthE ED50 = 5 Based on complete enumeration 29

MSE Comparison of MLE to FirthE ED50 = 100 Based on complete enumeration 30

Significant Models (pval < 0.05) MSE Comparison of MLE to FirthE ED50=52.5 Based on complete enumeration 31

Conclusions from Experiments Our numerical experiments and simulations suggest that FirthE reduces bias as well as Mean Square Error in comparison to MLE when the maximum slope of the logistic curve is not very high. However when the max slope is high the FirthE correction for bias produces excessive shrinkage and the MLE is superior. In many data sets that arise in we don t expect large changes in response for small changes in the covariate values so FirthE will be superior We conjecture that this conclusion will also hold when we compare conditional MLE and conditional FirthE 32

Detecting near separation in data sets We have a research project to create an index to signal near separation in data sets to alert LogXact users about the bias in MLE. Please let us know if you have datasets you can share which seem to exhibit near separation Experiments suggest that we can use Confidence Intervals based on the Firth Profile Likelihood to detect near separation. The ratio of the Upper CI width to the Lower CI appears to have promise as an index of near separation 33

Interior Points grouped into Layers by closeness to the boundary Ratios were calculated for each interior point 35

Ratio of Firth Profile Likelihood 95%CI widths Ratio = UCIwidth/LCIwidth 4.5 4 3.5 Ratio 3 2.5 2 1.5 1 0.5 Fitted polynomial 0 0 10 20 30 40 50 60 # Layers from boundary 36

Thank you! nitin@cytel.com 37