Comparing MLE, MUE and Firth Estimates for Logistic Regression

Similar documents
Lecture 14: Shrinkage

Penalized likelihood logistic regression with rare events

The Problem of Modeling Rare Events in ML-based Logistic Regression s Assessing Potential Remedies via MC Simulations

Linear regression methods

Generalized Linear Models

ECE521 lecture 4: 19 January Optimization, MLE, regularization

A Generalized Linear Model for Binomial Response Data. Copyright c 2017 Dan Nettleton (Iowa State University) Statistics / 46

Accurate Prediction of Rare Events with Firth s Penalized Likelihood Approach

Lecture #11: Classification & Logistic Regression

Ridge regression. Patrick Breheny. February 8. Penalized regression Ridge regression Bayesian interpretation

Regression Models - Introduction

Bayesian performance

Data Integration for Big Data Analysis for finite population inference

Maximum-Likelihood Estimation: Basic Ideas

Remedial Measures for Multiple Linear Regression Models

Stat 5101 Lecture Notes

Nonconcave Penalized Likelihood with A Diverging Number of Parameters

Machine Learning for OR & FE

Multivariate Survival Analysis

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Simple Regression Model Setup Estimation Inference Prediction. Model Diagnostic. Multiple Regression. Model Setup and Estimation.

Comparing two independent samples

Non-maximum likelihood estimation and statistical inference for linear and nonlinear mixed models

Confidence Intervals for the Odds Ratio in Logistic Regression with One Binary X

Machine Learning Practice Page 2 of 2 10/28/13

ECE521 week 3: 23/26 January 2017

Applied Economics. Regression with a Binary Dependent Variable. Department of Economics Universidad Carlos III de Madrid

Generalized, Linear, and Mixed Models

Open Problems in Mixed Models

Weakly informative priors

MLMED. User Guide. Nicholas J. Rockwood The Ohio State University Beta Version May, 2017

TGDR: An Introduction

Regression, Ridge Regression, Lasso

Statistics 203: Introduction to Regression and Analysis of Variance Course review

Linear Regression Models P8111

p y (1 p) 1 y, y = 0, 1 p Y (y p) = 0, otherwise.

Marginal versus conditional effects: does it make a difference? Mireille Schnitzer, PhD Université de Montréal

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

A note on multiple imputation for general purpose estimation

Applied Machine Learning Annalisa Marsico

Generalized linear models for binary data. A better graphical exploratory data analysis. The simple linear logistic regression model

Targeted Maximum Likelihood Estimation in Safety Analysis

Nonresponse weighting adjustment using estimated response probability

Approximate Median Regression via the Box-Cox Transformation

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Simple logistic regression

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

Firth's penalized likelihood logistic regression: accurate effect estimates AND predictions?

Testing Hypothesis after Probit Estimation

Penalized Logistic Regression in Case-Control Studies

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Lecture 5: A step back

9/26/17. Ridge regression. What our model needs to do. Ridge Regression: L2 penalty. Ridge coefficients. Ridge coefficients

STA414/2104 Statistical Methods for Machine Learning II

Graduate Econometrics I: Maximum Likelihood I

STAT5044: Regression and Anova. Inyoung Kim

Weakly informative priors

suppress constant term

APPENDIX B Sample-Size Calculation Methods: Classical Design

MS&E 226: Small Data

Least Squares Regression

Confidence intervals and point estimates for adaptive group sequential trials

Variable Selection in Restricted Linear Regression Models. Y. Tuaç 1 and O. Arslan 1

Correlation and Regression

Fractional Imputation in Survey Sampling: A Comparative Review

Data Mining Stat 588

Module 22: Bayesian Methods Lecture 9 A: Default prior selection

Appendix A. Numeric example of Dimick Staiger Estimator and comparison between Dimick-Staiger Estimator and Hierarchical Poisson Estimator

MS&E 226: Small Data

Machine Learning Linear Classification. Prof. Matteo Matteucci

Previous lecture. P-value based combination. Fixed vs random effects models. Meta vs. pooled- analysis. New random effects testing.

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Bayesian inference for sample surveys. Roderick Little Module 2: Bayesian models for simple random samples

Mathematical statistics

Unit 11: Multiple Linear Regression

Bias-Variance Tradeoff

ECON Introductory Econometrics. Lecture 7: OLS with Multiple Regressors Hypotheses tests

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Introduction to Algorithmic Trading Strategies Lecture 10

Review: what is a linear model. Y = β 0 + β 1 X 1 + β 2 X 2 + A model of the following form:

multilevel modeling: concepts, applications and interpretations

ECON Introductory Econometrics. Lecture 11: Binary dependent variables

Brandon C. Kelly (Harvard Smithsonian Center for Astrophysics)

STA 4273H: Statistical Machine Learning

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013)

Model Checking and Improvement

Simple and Multiple Linear Regression

Classification: Linear Discriminant Analysis

PENALIZING YOUR MODELS

Final Exam. 1. (6 points) True/False. Please read the statements carefully, as no partial credit will be given.

Mathematical statistics

ISyE 691 Data mining and analytics

Part 8: GLMs and Hierarchical LMs and GLMs

Ordered Response and Multinomial Logit Estimation

A NOTE ON ROBUST ESTIMATION IN LOGISTIC REGRESSION MODEL

Sparse Linear Models (10/7/13)

Quiz 1. Name: Instructions: Closed book, notes, and no electronic devices.

Review. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis

Review. December 4 th, Review

STA216: Generalized Linear Models. Lecture 1. Review and Introduction

Transcription:

Comparing MLE, MUE and Firth Estimates for Logistic Regression Nitin R Patel, Chairman & Co-founder, Cytel Inc. Research Affiliate, MIT nitin@cytel.com

Acknowledgements This presentation is based on joint work with: Pralay Senchaudhuri, Cytel Inc. Hrishikesh Kulkarni, Cytel Inc. 2

Outline Separation and Maximum Likelihood Estimates Firth s Method of Maximum Penalized Likelihood Estimation Numerical experiments comparing MUE with FirthE when there is separation Near separation and problems with MLE Numerical experiments comparing MLE with FirthE when there is near separation Conclusions 3

Maximum Likelihood Estimation Almost universally used method for logistic regression models. ML estimates are asymptotically unbiased and have minimum variance but not for finite samples. MLE s can have serious shortcomings when applied to datasets with the following characteristics: Small/moderate in size Unbalanced responses (Rare outcomes) Unequally spaced covariate values Many parameters relative to number of observations. 4

Example 1 seq# x1 x2 y 1 10 10 1 2 11 11 1 3 12 12 1 4 13 13 1 5 14 14 1 6 15 15 1 7 16 16 1 8 17 17 1 9 19 19 1 10 10 16 0 11 11 17 0 12 12 18 0 13 13 19 0 14 14 20 0 15 15 21 0 16 16 22 0 17 17 23 0 18 18 18 0 19 18 24 0 20 19 25 0 Separation x2 covariate plot of data 30 25 20 15 10 5 5 10 15 20 x1 5

MLE s and Separation When separation occurs one or more MLE s do not exist. In other words, one or more MLE s are unbounded (and so are their standard errors). This means that the maximum likelihood method fails to provide either point or interval estimates. 6

A useful characterization of separation Separation occurs if and only if the observed vector of sufficient statistics is on the boundary of the convex hull of the (finite) set of possible sufficient statistics vectors. 7

Example 2: Simple Logistic Regression (one covariate, two parameters) Response Y i, covariate x i for observation i Model: π = PY= 1 i ( ) i logit ( π ) i = β0+ β1xi Sufficient statistics vector is (T 0, T 1 ) where T 0 = i Yi and T 1 = i x i Y i 8

Example 2: Simple Logistic Regression (contd.) Sufficient statistics vector is (T 0, T 1 ) where x 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 t1: sufficient stat. for beta1 1200 1000 800 600 400 200 0 T 0 = i Yi 0 5 10 15 20 25 t0: sufficient stat. for beta0 and T 1 = i x i Y i 9

Example 2 (contd.) If we observe: y i = 0 for x i = 5, 10, 15, 20, 25, 30, 35, 40, 45 y i = 1 for x i = 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100. The observed sufficient statistics vector is (t 0 = 11, t 1 = 825). The MLE for β 1 does not exist since (11,825) is on the boundary of (T 0, T 1 ) space. 10

Firth s Penalized Likelihood Method The MLE is the root when the score function (derivative of the loglikelihood) is equated to zero. Firth s method removes the O(n -1 ) term from the bias of the MLE by modifying the score function by subtracting a penalty function. The solution obtained as the root when the score function is set to zero is Firth s Penalized Likelihood Estimate (FirthE) 11

Logistic Regression The loglikelihood has the form where t is the observed sufficient statistic vector The score function is therefore U( β) = lʹ ( β) = t Kʹ ( β) Firth s modified score function is where l( β) = tβ K( β) * 1 I ( β) U ( βj) = U( βj) + 1/2 trace I( β) β j I( β ) is Fisher s information matrix Firth s modification shrinks the MLE estimate towards zero 12

Boundary points of Sufficient Statistics space t_0 t_1 0 0 1 5 1 100 2 15 2 195 3 30 3 285 4 50 4 370 5 75 5 450 6 105 6 525 7 140 7 595 8 180 8 660 9 225 9 720 10 275 10 775 11 330 11 825 12 390 12 870 13 455 13 910 14 525 14 945 15 600 15 975 16 680 16 1000 17 765 17 1020 18 855 18 1035 19 950 19 1045 20 1050 t1 1200 1000 800 600 400 200 Boundary Points in space of sufficient stats 0 0 5 10 15 20 25 There are 40 points on the boundary of the set of possible values of (t 0, t 1 ) t0 13

Comparison of MUE with FirthE when MLE does not exist Several numerical experiments with one covariate models and a limited number with two covariate models. Used exhaustive enumeration of t-vectors as well as Monte Carlo simulations with sample sizes of 1000. We will illustrate with Example 2 data 14

Bias Comparison for MUE with FirthE for ED50 = 52.5 Based on complete enumeration 15 Copyright Cytel Inc. 2005. All rights reserved.

MSE Comparison for MUE with FirthE for ED50 = 52.5 Based on complete enumeration 16 Copyright Cytel Inc. 2005. All rights reserved.

Findings from numerical experiments Our experiments with several numerical experiments with one covariate and some with two covariates suggest that both from the point of view of bias and Mean Square Error Firth s method gives better estimates when there is complete separation. Additional Advantages of Firth s method are: Unlike MUE it does not depend on the conditional distribution of the sufficient statistic, so it does not have problems associated with having few support points (e.g. with continuous covariates). It is much faster to compute. 17

A real dataset Two hundred rats treated with a toxic at four levels of dose, binary response examined was development of an intestinal tumor.the covariates were levels of dose (as factor variables) and a binary survival variable to control for death. (Data from US Toxicology Program Tech Report 405, 1991, LogXact manual gives details.) There was separation in this dataset. Output from current beta version of LogXact that provides Firth s method as an option. 18

LogXact Results Point Estimate 95% Conf. Interval 2*1-sided Model Term Type Beta SE(Beta) Type Lower Upper P-Value %Const FirthE -3.861 2.108 Asymptotic -7.993 0.2713 0.0671 dose_0 FirthE -2.873 1.937 Asymptotic -6.67 0.9241 0.1381 MUE -1.053 NA Exact -INF 1.909 0.4824 dose_150 FirthE -1.24 1.438 Asymptotic -4.057 1.578 0.3886 CMLE -1.444 1.667 Exact -6.437 2.471 0.9367 dose_300 FirthE -2.733 1.656 Asymptotic -5.978 0.5116 0.0988 MUE -1.677 NA Exact -INF 0.869 0.2068 survival FirthE 0.09387 0.1402 Asymptotic -0.1808 0.3686 0.5030 CMLE 0.1246 0.174 Exact -0.2128 0.5058 0.5345 19

Near Separation MLE is unstable small shift in data leads to huge change in ML estimate of coefficients seq# x1 x2 y 1 10 10 1 2 11 11 1 3 12 12 1 4 13 13 1 5 14 14 1 6 15 15 1 7 16 16 1 8 17 17 1 9 19 19 1 10 10 16 0 11 11 17 0 12 12 18 0 13 13 19 0 14 14 20 0 15 15 21 0 16 16 22 0 17 17 23 0 18 18 k 0 19 18 24 0 20 19 25 0 x2 30 25 20 15 10 covariate plot of data 5 5 10 15 20 x1 Example 1 k 20

MLE and Near separation: Example 1 (contd.) coefficients vs k beta1 beta2 MLE beta 2 1.5 1 0.5 0-0.5-1 -1.5-2 -2.5-3 0 5 10 15 20 k 21

Interior Points grouped into Layers by closeness to the boundary Interior Point Layers 1200 1000 800 t1 600 400 200 0 0 5 10 15 20 Layer 1 Layer 5 Layer 10 Layer 20 Layer 40 Layer 50 t0 22

Bias Comparison of MLE to FirthE ED50=52.5 Based on complete enumeration 24

Bias Comparison of MLE to FirthE ED50=5 Based on complete enumeration 25

Bias Comparison of MLE to FirthE ED50=100 Based on complete enumeration 26

Significant Models (pval < 0.05) Bias Comparison of MLE to FirthE ED50=52.5 Based on complete enumeration 27

MSE Comparison of MLE to FirthE ED50 = 52.5 Based on complete enumeration 28

MSE Comparison of MLE to FirthE ED50 = 5 Based on complete enumeration 29

MSE Comparison of MLE to FirthE ED50 = 100 Based on complete enumeration 30

Significant Models (pval < 0.05) MSE Comparison of MLE to FirthE ED50=52.5 Based on complete enumeration 31

Conclusions from Experiments Our numerical experiments and simulations suggest that FirthE reduces bias as well as Mean Square Error in comparison to MLE when the maximum slope of the logistic curve is not very high. However when the max slope is high the FirthE correction for bias produces excessive shrinkage and the MLE is superior. In many data sets that arise in we don t expect large changes in response for small changes in the covariate values so FirthE will be superior We conjecture that this conclusion will also hold when we compare conditional MLE and conditional FirthE 32

Detecting near separation in data sets We have a research project to create an index to signal near separation in data sets to alert LogXact users about the bias in MLE. Please let us know if you have datasets you can share which seem to exhibit near separation Experiments suggest that we can use Confidence Intervals based on the Firth Profile Likelihood to detect near separation. The ratio of the Upper CI width to the Lower CI appears to have promise as an index of near separation 33

Example 2: Simple Logistic Regression (contd.) Sufficient statistics vector is (T 0, T 1 ) where x 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 t1: sufficient stat. for beta1 1200 1000 800 600 400 200 0 T 0 = i Yi 0 5 10 15 20 25 t0: sufficient stat. for beta0 and T 1 = i x i Y i 34

Interior Points grouped into Layers by closeness to the boundary Ratios were calculated for each interior point 35

Ratio of Firth Profile Likelihood 95%CI widths Ratio = UCIwidth/LCIwidth 4.5 4 3.5 Ratio 3 2.5 2 1.5 1 0.5 Fitted polynomial 0 0 10 20 30 40 50 60 # Layers from boundary 36

Thank you! nitin@cytel.com 37