Penalized likelihood logistic regression with rare events

Similar documents
Accurate Prediction of Rare Events with Firth s Penalized Likelihood Approach

Firth's penalized likelihood logistic regression: accurate effect estimates AND predictions?

Firth s logistic regression with rare events: accurate effect estimates AND predictions?

The Problem of Modeling Rare Events in ML-based Logistic Regression s Assessing Potential Remedies via MC Simulations

An Algorithm for Bayesian Variable Selection in High-dimensional Generalized Linear Models

BIAS OF MAXIMUM-LIKELIHOOD ESTIMATES IN LOGISTIC AND COX REGRESSION MODELS: A COMPARATIVE SIMULATION STUDY

Running head: QUASI-NONCONVERGENCE LOGISTIC REGRESSION 1. Handling Quasi-Nonconvergence in Logistic Regression:

Type of manuscript: Original Article

Comparing MLE, MUE and Firth Estimates for Logistic Regression

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013)

Machine Learning Linear Classification. Prof. Matteo Matteucci

Problems with Penalised Maximum Likelihood and Jeffrey s Priors to Account For Separation in Large Datasets with Rare Events

University of California, Berkeley

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

Generalized Linear Modeling - Logistic Regression

Introduction to Statistical Analysis

Marginal versus conditional effects: does it make a difference? Mireille Schnitzer, PhD Université de Montréal

Statistics 262: Intermediate Biostatistics Model selection

STA414/2104 Statistical Methods for Machine Learning II

A note on R 2 measures for Poisson and logistic regression models when both models are applicable

Ridge regression. Patrick Breheny. February 8. Penalized regression Ridge regression Bayesian interpretation

Longitudinal Modeling with Logistic Regression

INVERTED KUMARASWAMY DISTRIBUTION: PROPERTIES AND ESTIMATION

Good Confidence Intervals for Categorical Data Analyses. Alan Agresti

PQL Estimation Biases in Generalized Linear Mixed Models

The Logit Model: Estimation, Testing and Interpretation

Generalized Linear Models for Non-Normal Data

Plausible Values for Latent Variables Using Mplus

Multivariate Survival Analysis

TECHNICAL REPORT # 59 MAY Interim sample size recalculation for linear and logistic regression models: a comprehensive Monte-Carlo study

PENALIZED LIKELIHOOD PARAMETER ESTIMATION FOR ADDITIVE HAZARD MODELS WITH INTERVAL CENSORED DATA

Comparison of Estimators in GLM with Binary Data

Multilevel Statistical Models: 3 rd edition, 2003 Contents

Generalized Linear Models

Tuning Parameter Selection in L1 Regularized Logistic Regression

Equivalence of random-effects and conditional likelihoods for matched case-control studies

ONE MORE TIME ABOUT R 2 MEASURES OF FIT IN LOGISTIC REGRESSION

Package LBLGXE. R topics documented: July 20, Type Package

Prediction & Feature Selection in GLM

Bayesian Linear Regression

Multivariable Fractional Polynomials

Classification. Chapter Introduction. 6.2 The Bayes classifier

Review. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis

Discussion of Missing Data Methods in Longitudinal Studies: A Review by Ibrahim and Molenberghs

Previous lecture. P-value based combination. Fixed vs random effects models. Meta vs. pooled- analysis. New random effects testing.

Lecture 15 (Part 2): Logistic Regression & Common Odds Ratio, (With Simulations)

1. Fisher Information

Stat 5101 Lecture Notes

Basic Medical Statistics Course

SUPPLEMENTARY APPENDICES FOR WAVELET-DOMAIN REGRESSION AND PREDICTIVE INFERENCE IN PSYCHIATRIC NEUROIMAGING

Practical considerations for survival models

Penalized Logistic Regression in Case-Control Studies

Harvard University. A Note on the Control Function Approach with an Instrumental Variable and a Binary Outcome. Eric Tchetgen Tchetgen

Improving Efficiency of Inferences in Randomized Clinical Trials Using Auxiliary Covariates

Lecture 14: Shrinkage

Applied Machine Learning Annalisa Marsico

Robust covariance estimator for small-sample adjustment in the generalized estimating equations: A simulation study

A simulation study of model fitting to high dimensional data using penalized logistic regression

Linear Regression Models. Based on Chapter 3 of Hastie, Tibshirani and Friedman

Machine Learning for OR & FE

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Power and Sample Size Calculations with the Additive Hazards Model

Linear Models: Comparing Variables. Stony Brook University CSE545, Fall 2017

e author and the promoter give permission to consult this master dissertation and to copy it or parts of it for personal use. Each other use falls

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

LOGISTIC REGRESSION Joseph M. Hilbe

Statistics in medicine

Repeated ordinal measurements: a generalised estimating equation approach

Semiparametric Generalized Linear Models

Multivariable Fractional Polynomials

COMPLEMENTARY LOG-LOG MODEL

Estimating the Marginal Odds Ratio in Observational Studies

OPTIMISATION CHALLENGES IN MODERN STATISTICS. Co-authors: Y. Chen, M. Cule, R. Gramacy, M. Yuan

Correlation and Simple Linear Regression

TGDR: An Introduction

General Regression Model

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Fractional Polynomials and Model Averaging

Latent Variable Models for Binary Data. Suppose that for a given vector of explanatory variables x, the latent

Bayesian Grouped Horseshoe Regression with Application to Additive Models

Linear Regression. Data Model. β, σ 2. Process Model. ,V β. ,s 2. s 1. Parameter Model

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Selection of Variables and Functional Forms in Multivariable Analysis: Current Issues and Future Directions

Quantitative Introduction ro Risk and Uncertainty in Business Module 5: Hypothesis Testing

Chapter 9 Regression with a Binary Dependent Variable. Multiple Choice. 1) The binary dependent variable model is an example of a

Estimators for the binomial distribution that dominate the MLE in terms of Kullback Leibler risk

Noninformative Priors for the Ratio of the Scale Parameters in the Inverted Exponential Distributions

MISSING or INCOMPLETE DATA

MS&E 226: Small Data

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Regularization in Cox Frailty Models

A default prior distribution for logistic and other regression models

Variable selection and machine learning methods in causal inference

Regression, Ridge Regression, Lasso

Does low participation in cohort studies induce bias? Additional material

Faculty of Health Sciences. Regression models. Counts, Poisson regression, Lene Theil Skovgaard. Dept. of Biostatistics

MLR Model Selection. Author: Nicholas G Reich, Jeff Goldsmith. This material is part of the statsteachr project

Pooling multiple imputations when the sample happens to be the population.

Model Selection in GLMs. (should be able to implement frequentist GLM analyses!) Today: standard frequentist methods for model selection

Odds ratio estimation in Bernoulli smoothing spline analysis-ofvariance

Transcription:

Penalized likelihood logistic regression with rare events Georg Heinze 1, Angelika Geroldinger 1, Rainer Puhr 2, Mariana Nold 3, Lara Lusa 4 1 Medical University of Vienna, CeMSIIS, Section for Clinical Biometrics, Austria 2 University of New South Wales, The Kirby Institute, Australia 3 Universitätsklinikum Jena, Institute for Medical Statistics, Computer Sciences and Documentation, Germany 4 University of Ljubljana, Institute for Biostatistics and Medical Informatics, Slovenia Funded by the Austrian Science Fund (FWF) and the Slovenian Research Agency (ARRS)

Rare events: examples Medicine: Side effects of treatment 1/1000s to fairly common Hospital acquired infections 9.8/1000 pd Epidemiologic studies of rare diseases 1/1000 to 1/200,000 Engineering: Rare failures ofsystems 0.1 1/year Economy: E commerce click rates 1 2/1000 impressions Political science: Wars, election surprises, vetos 1/dozens to 1/1000s 13.11.2015 Georg Heinze 2

Problems with rare events Big studies needed to observe enough events Difficult to attribute events to risk factors Low absolute number of events Low event rate 13.11.2015 Georg Heinze 3

Our interest Models for prediction of binary outcomes should be interpretable, i.e., betas should have a meaning explanatory models 13.11.2015 Georg Heinze 4

Logistic regression Pr 1 1 exp Leads to odds ratio interpretation of exp : exp 1 1 / 1 / Likelihood: L 1 Its n th root: Probability of correct prediction How well can we estimate if events ( 1) are rare? 13.11.2015 Georg Heinze 5

Rare event problems Not much gain! Logistic regression with 5 variables: estimates are unstable (large MSE) because of few events removing some non events does not affect precision 13.11.2015 Georg Heinze 6

Penalized likelihood regression log log Imposes priors on model coefficients, e.g. (ridge: normal prior) (LASSO: double exponential) log det (Firth type: Jeffreys prior) in order to avoid extreme estimatesand stabilize variance (ridge) perform variable selection (LASSO) correct small sample bias in (Firth type) 13.11.2015 Georg Heinze 7

Firth type penalization In exponential family models with canonical parametrization the Firth type penalized likelihood is given by det /, where is the Fisher information matrix. This removes the first order bias of the ML estimates. Software: logistic regression: R (logistf, brglm, pmlr), SAS, Stata Cox regression: R (coxphf), SAS Jeffreys invariant prior Firth, 1993; Heinze and Schemper, 2002; Heinze, 2006; Heinze, Beyea and Ploner, 2013 13.11.2015 Georg Heinze 8

Firth type penalization We are interested in logistic regression: Here the penalized log likelihood is given by log log det with diagexpit 1 expit. is maximised at 0, i.e. the ML estimates are shrunken towards zero, for a 2 2table (logistic regression with one binary regressor), Firth s bias correction amounts to adding 1/2 to each cell. 13.11.2015 Georg Heinze 9

Separation (Complete) separation: a combination of the explanatory variables (nearly) perfectly predicts the outcome frequently encountered with small samples, monotone likelihood, some of the ML estimates are infinite, but Firth estimates do exist! Example: complete separation A B 0 0 10 1 10 0 quasi complete separation A B 0 0 7 1 10 3 13.11.2015 Georg Heinze 10

Separation (Complete) separation: a combination of the explanatory variables (nearly) perfectly predicts the outcome frequently encountered with small samples, monotone likelihood, some of the ML estimates are infinite, but Firth estimates do exist! Example: complete separation A B 0 0 10 1 10 0 A B 0 0.5 10.5 1 10.5 0.5 quasi complete separation A B 0 0 7 1 10 3 A B 0 0.5 7.5 1 10.5.5 13.11.2015 Georg Heinze 11

Impact of Firth correction on predictions Example from Greenland & Mansournia, 2015 no separation quasi complete separation A B A B 0 9 2966 1 1 16 0 10 2966 1 0 16 ML predictions: A B A B 1 10% 0.53% 1 0% 0.53% Firth predictions: A B A B 1 13.6% 0.55% 1 4.5% 0.55% 13.11.2015 Georg Heinze 12

Impact of Firth correction on predictions Example from Greenland & Mansournia, 2015 no separation quasi complete separation A B A B 0 9 2966 0 10 2966 1 1 16 0.56% 1 0 16 0.53% ML predictions: A B A B 1 10% 0.53% 0.56% 1 0% 0.53% 0.53% Firth predictions: A B A B 1 13.6% 0.55% 0.59% 1 4.5% 0.55% 0.56% 13.11.2015 Georg Heinze 13

Example: Bias in logistic regression Consider a model containing only intercept, no regressors: logit 1. With observations, events, the ML estimator of is given by: logit k/n. Jensen s inequality Since k/n is unbiased, is biased! (If was unbiased, expit would be biased!) 13.11.2015 Georg Heinze 14

Penalized logistic regression: ridge The penalized likelihood is given by log where is an unknown tuning parameter, and the s should be suitably standardized. Usually, X s are standardized to unit variance before application, and is tuned by cross validation After application, can be back transformed to original scale See, e.g., Friedman et al, 2010 13.11.2015 Georg Heinze 15

Known und unknown features of ridge It reduces RMSE of predictions It reduces RMSE of beta coefficients It introduces bias in the beta coefficients The bias is towards the null 13.11.2015 Georg Heinze 16

Known und unknown features of ridge It reduces RMSE of predictions It reduces RMSE of beta coefficients It introduces bias in the beta coefficients The bias is towards the null? 13.11.2015 Georg Heinze 17

The we won t let you down effect of ridge LogReg with 15 correlated covariates, N=3000, Marginal event rate 1% True effects: 0 for X1 0.25 for X2 X15 Plot shows the betas (simulation) -0.2 0.0 0.2 0.4 0.6 0.8 1 2 3 4 5 6 7 8 9 11 13 15 13.11.2015 Georg Heinze 18

For comparison: Firth LogReg with 15 correlated covariates, N=3000, Marginal event rate 1% True effects: 0 for X1 0.25 for X2 X15 Plot shows the betas (simulation) -0.2 0.0 0.2 0.4 0.6 0.8 x1 x3 x5 x7 x9 x11 x13 x15 13.11.2015 Georg Heinze 19

Recent criticisms on Firth for prediction Elgmati et al (Lifetime Data Analysis 2015): upwards bias in predictions for rare events Greenland and Mansournia (Statistics in Medicine 2015): Bayesian non collapsibility caused by including correlations in Jeffreys prior 13.11.2015 Georg Heinze 20

Generalized Firth correction Elgmati et al studied a generalized Firth correction: log log det with 0,0.5 (In the two group case, this corresponds to adding to each cell.) They derived formulae for bias and MSE of predicted probabilities in the two group case, and evaluated the impact of. 13.11.2015 Georg Heinze 21

Generalized Firth correction From Elgmati et al, 2015 Two group case, 50, 0.03, 0.06 13.11.2015 Georg Heinze 22

Generalized Firth correction From Elgmati et al, 2015 Two group case, 50, 0.03, 0.06 They suggest a weak Firth correction with 0.1tominimise MSE of predictions 13.11.2015 Georg Heinze 23

Problems (of Bayesians) working with the Jeffreys prior Greenland and Mansournia (Statistics in Medicine 2015): The Jeffreys prior (=Firth correction) is data dependent and includes correlation between covariates This correlation is needed as also the MLE bias to be corrected has correlations The marginal prior for a given can change in opaque ways as model covariates are added or deleted It may give surprising results in sparse data sets It is not clear in general how the penalty translates into prior probabilities for odds ratios 13.11.2015 Georg Heinze 24

Bayesian non collapsibility In their data example, G&M show that the Firth corrected estimate is further away from 0 than the ML estimate: X=1 X=0 Y=0 9 2966 Y=1 1 16 ML estimate of : 3.03 (0.08, 4.8) Firth estimate of : 3.35 (1.07, 4.9) 13.11.2015 Georg Heinze 25

The logf(1,1) prior Greenland and Mansournia (SiM 2015) suggest the logf(1,1) prior, leading to the penalized likelihood representation log 2 log 1 exp They show that this prior coincides with the Jeffreys prior in a oneparameter model (e.g., matched pairs case control) They strongly argue against imposing a prior on the intercept Implementation of the logf(1,1) is amazingly simple with standard software: Just augment the original data by adding two pseudo observations per variable with a value of 1 for that variable, and 0 s for all other variables (including the constant) The pseudo observations have y=0 and y=1, with weights 0.5 and 0.5 e.g. Constant X Y Weight 0 1 0 0.5 0 1 1 0.5 13.11.2015 Georg Heinze 26

Example, again In their data example, G&M show that the Firth corrected estimate is further away from 0 than the ML estimate: X=1 X=0 Y=0 9 2966 Y=1 1 16 logf(1,1) estimate of : 2.41 ( 0.64, 4.4) ML estimate of : 3.03 (0.08, 4.8) Firth estimate of : 3.35 (1.07, 4.9) 13.11.2015 Georg Heinze 27

Solving the issue by simulation By simulating 1000 2x2 tables with the expected cell frequencies (conditional on marginals of X), we obtain: Method Bias RMSE ML n.a. n.a. Firth 0.18 0.81 logf(1,1) 1.10 1.63 True value = 3.03 13.11.2015 Georg Heinze 28

Summary so far We observe that the original Firth penalty results in good bias and MSE properties for betas There is an upwards bias (towards 0.5) in predictions weak Firth penalty optimizes MSE of predictions, but induces bias in betas logf is simple to implement, yields unbiased mean predictions, but possibly too much correction for betas We would like to keep the good properties for the betas, but improve the performance for prediction 13.11.2015 Georg Heinze 29

We called the flic Austrian flic ( Kibara ) (unknown outside A) French flic (very popular in A) 13.11.2015 Georg Heinze 30

FLIC: Firth Logistic regression with Intercept Correction Consists of the Firth model, but with adjusted intercept to fulfill Two stage estimation: First fit Firth corrected model Then hold betas fixed, but re estimate intercept without penalization Corrects the bias in mean prediction 13.11.2015 Georg Heinze 31

Re simulating the example of Elgmati From Elgmati et al, 2015 Two group case, 50 0.06 0.00 0.10 0.20 Firth weak Firth FLIC RMSEx100: 3.43 3.49 3.29 0.03 0.00 0.04 0.08 0.12 Firth weak Firth FLIC RMSEx100: 2.47 2.51 2.27 13.11.2015 32

Other approaches for rare events Considering uncertainty in estimated coefficients, King and Zeng (2001) propose a correction of estimated probabilities: Pr 1 Pr 1 where is a bias corrected estimate of, and is the posterior distribution of This can be approximated by Pr 1 where 0.5 1 13.11.2015 Georg Heinze 33

A simulation study X matrix N (500, 1400, 3000) Y (1 10%) Puhr, 2015; Binder et al, 2011 13.11.2015 Georg Heinze 34

Scenario with N=3000, 1% event rate Bias of beta estimates 13.11.2015 Georg Heinze 35

N=3000, 1% event rate, bias of predictions 13.11.2015 Georg Heinze 36

N=3000, 1% event rate, bias of predictions Rescaled by expected standard error of predictions 13.11.2015 Georg Heinze 37

Conclusions Prediction and effect estimation Ridge models perform best for prediction (RMSE but not bias), but should be seen as black boxes Always on the safe side : sometimes overly pessimistic Among the less conservative methods, FLIC performed well It does not sacrifice the bias preventive properties of Firth 13.11.2015 Georg Heinze 38

References Binder H, Sauerbrei W, Royston P. Multivariable Model Building with Continuous Covariates: Performance Measures and Simulation Design. Unpublished Manuscript, 2011. Elgmati E, Fiaccone RL, Henderson R, Matthews JNS. Penalised logistic regression and dynamic prediction for discrete time recurrent event data. Lifetime Data Analysis 21:542 560, 2015 Firth D. Bias reduction of maximum likelihood estimates. Biometrika 80:27 38, 1993 Friedman J, Hastie T, Tibshirani R. Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software 33:1, 2010 Greenland S, Mansournia M. Penalization, bias reduction, and default priors in logistic and related categorical and survival regressions. Statistics in Medicine 34:3133 3143, 2015 Heinze G, Schemper M. A solution to the problem of separation in logistic regression. Statistics in Medicine 21:2409 2419, 2002 Heinze G. A comparative investigation of methods for logistic regression with separated or nearly separated data. Statistics in Medicine 25:4216 4226, 2006 Heinze G, Ploner M, Beyea J. Confidence intervals after multiple imputation: combining profile likelihood information from logistic regressions. Statistics in Medicine 32:5062 5076, 2013 King G, Zeng L. Logistic Regression in Rare Events Data. Political Analysis 9:137 163, 2001. Puhr R. Vorhersage von seltenen Ereignissen mit pönalisierter logistischer Regression. Master s Thesis, University of Vienna, 2015. 13.11.2015 Georg Heinze 39