Comparing MLE, MUE and Firth Estimates for Logistic Regression

Size: px

Start display at page:

Download "Comparing MLE, MUE and Firth Estimates for Logistic Regression"

Melinda Cummings
5 years ago
Views:

1 Comparing MLE, MUE and Firth Estimates for Logistic Regression Nitin R Patel, Chairman & Co-founder, Cytel Inc. Research Affiliate, MIT nitin@cytel.com

2 Acknowledgements This presentation is based on joint work with: Pralay Senchaudhuri, Cytel Inc. Hrishikesh Kulkarni, Cytel Inc. 2

3 Outline Separation and Maximum Likelihood Estimates Firth s Method of Maximum Penalized Likelihood Estimation Numerical experiments comparing MUE with FirthE when there is separation Near separation and problems with MLE Numerical experiments comparing MLE with FirthE when there is near separation Conclusions 3

4 Maximum Likelihood Estimation Almost universally used method for logistic regression models. ML estimates are asymptotically unbiased and have minimum variance but not for finite samples. MLE s can have serious shortcomings when applied to datasets with the following characteristics: Small/moderate in size Unbalanced responses (Rare outcomes) Unequally spaced covariate values Many parameters relative to number of observations. 4

5 Example 1 seq# x1 x2 y Separation x2 covariate plot of data x1 5

6 MLE s and Separation When separation occurs one or more MLE s do not exist. In other words, one or more MLE s are unbounded (and so are their standard errors). This means that the maximum likelihood method fails to provide either point or interval estimates. 6

7 A useful characterization of separation Separation occurs if and only if the observed vector of sufficient statistics is on the boundary of the convex hull of the (finite) set of possible sufficient statistics vectors. 7

8 Example 2: Simple Logistic Regression (one covariate, two parameters) Response Y i, covariate x i for observation i Model: π = PY= 1 i ( ) i logit ( π ) i = β0+ β1xi Sufficient statistics vector is (T 0, T 1 ) where T 0 = i Yi and T 1 = i x i Y i 8

9 Example 2: Simple Logistic Regression (contd.) Sufficient statistics vector is (T 0, T 1 ) where x t1: sufficient stat. for beta T 0 = i Yi t0: sufficient stat. for beta0 and T 1 = i x i Y i 9

10 Example 2 (contd.) If we observe: y i = 0 for x i = 5, 10, 15, 20, 25, 30, 35, 40, 45 y i = 1 for x i = 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100. The observed sufficient statistics vector is (t 0 = 11, t 1 = 825). The MLE for β 1 does not exist since (11,825) is on the boundary of (T 0, T 1 ) space. 10

11 Firth s Penalized Likelihood Method The MLE is the root when the score function (derivative of the loglikelihood) is equated to zero. Firth s method removes the O(n -1 ) term from the bias of the MLE by modifying the score function by subtracting a penalty function. The solution obtained as the root when the score function is set to zero is Firth s Penalized Likelihood Estimate (FirthE) 11

12 Logistic Regression The loglikelihood has the form where t is the observed sufficient statistic vector The score function is therefore U( β) = lʹ ( β) = t Kʹ ( β) Firth s modified score function is where l( β) = tβ K( β) * 1 I ( β) U ( βj) = U( βj) + 1/2 trace I( β) β j I( β ) is Fisher s information matrix Firth s modification shrinks the MLE estimate towards zero 12

13 Boundary points of Sufficient Statistics space t_0 t_ t Boundary Points in space of sufficient stats There are 40 points on the boundary of the set of possible values of (t 0, t 1 ) t0 13

14 Comparison of MUE with FirthE when MLE does not exist Several numerical experiments with one covariate models and a limited number with two covariate models. Used exhaustive enumeration of t-vectors as well as Monte Carlo simulations with sample sizes of We will illustrate with Example 2 data 14

Bias Comparison for MUE with FirthE for ED50 = 52.

MSE Comparison for MUE with FirthE for ED50 = 52.

17 Findings from numerical experiments Our experiments with several numerical experiments with one covariate and some with two covariates suggest that both from the point of view of bias and Mean Square Error Firth s method gives better estimates when there is complete separation. Additional Advantages of Firth s method are: Unlike MUE it does not depend on the conditional distribution of the sufficient statistic, so it does not have problems associated with having few support points (e.g. with continuous covariates). It is much faster to compute. 17

18 A real dataset Two hundred rats treated with a toxic at four levels of dose, binary response examined was development of an intestinal tumor.the covariates were levels of dose (as factor variables) and a binary survival variable to control for death. (Data from US Toxicology Program Tech Report 405, 1991, LogXact manual gives details.) There was separation in this dataset. Output from current beta version of LogXact that provides Firth s method as an option. 18

19 LogXact Results Point Estimate 95% Conf. Interval 2*1-sided Model Term Type Beta SE(Beta) Type Lower Upper P-Value %Const FirthE Asymptotic dose_0 FirthE Asymptotic MUE NA Exact -INF dose_150 FirthE Asymptotic CMLE Exact dose_300 FirthE Asymptotic MUE NA Exact -INF survival FirthE Asymptotic CMLE Exact

20 Near Separation MLE is unstable small shift in data leads to huge change in ML estimate of coefficients seq# x1 x2 y k x covariate plot of data x1 Example 1 k 20

21 MLE and Near separation: Example 1 (contd.) coefficients vs k beta1 beta2 MLE beta k 21

22 Interior Points grouped into Layers by closeness to the boundary Interior Point Layers t Layer 1 Layer 5 Layer 10 Layer 20 Layer 40 Layer 50 t0 22

24 Bias Comparison of MLE to FirthE ED50=52.5 Based on complete enumeration 24

25 Bias Comparison of MLE to FirthE ED50=5 Based on complete enumeration 25

26 Bias Comparison of MLE to FirthE ED50=100 Based on complete enumeration 26

27 Significant Models (pval < 0.05) Bias Comparison of MLE to FirthE ED50=52.5 Based on complete enumeration 27

28 MSE Comparison of MLE to FirthE ED50 = 52.5 Based on complete enumeration 28

29 MSE Comparison of MLE to FirthE ED50 = 5 Based on complete enumeration 29

30 MSE Comparison of MLE to FirthE ED50 = 100 Based on complete enumeration 30

31 Significant Models (pval < 0.05) MSE Comparison of MLE to FirthE ED50=52.5 Based on complete enumeration 31

32 Conclusions from Experiments Our numerical experiments and simulations suggest that FirthE reduces bias as well as Mean Square Error in comparison to MLE when the maximum slope of the logistic curve is not very high. However when the max slope is high the FirthE correction for bias produces excessive shrinkage and the MLE is superior. In many data sets that arise in we don t expect large changes in response for small changes in the covariate values so FirthE will be superior We conjecture that this conclusion will also hold when we compare conditional MLE and conditional FirthE 32

33 Detecting near separation in data sets We have a research project to create an index to signal near separation in data sets to alert LogXact users about the bias in MLE. Please let us know if you have datasets you can share which seem to exhibit near separation Experiments suggest that we can use Confidence Intervals based on the Firth Profile Likelihood to detect near separation. The ratio of the Upper CI width to the Lower CI appears to have promise as an index of near separation 33

34 Example 2: Simple Logistic Regression (contd.) Sufficient statistics vector is (T 0, T 1 ) where x t1: sufficient stat. for beta T 0 = i Yi t0: sufficient stat. for beta0 and T 1 = i x i Y i 34

35 Interior Points grouped into Layers by closeness to the boundary Ratios were calculated for each interior point 35

36 Ratio of Firth Profile Likelihood 95%CI widths Ratio = UCIwidth/LCIwidth Ratio Fitted polynomial # Layers from boundary 36

37 Thank you! 37

Lecture 14: Shrinkage

Lecture 14: Shrinkage Reading: Section 6.2 STATS 202: Data mining and analysis October 27, 2017 1 / 19 Shrinkage methods The idea is to perform a linear regression, while regularizing or shrinking the