Logistic Regression with the Nonnegative Garrote

Similar documents
Bayesian Grouped Horseshoe Regression with Application to Additive Models

Linear Models in Machine Learning

Lecture 16 Solving GLMs via IRWLS

The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA

Regression Shrinkage and Selection via the Lasso

LEAST ANGLE REGRESSION 469

Bayesian Grouped Horseshoe Regression with Application to Additive Models

Regularization and Variable Selection via the Elastic Net

Lecture 14: Variable Selection - Beyond LASSO

Minimum Message Length Analysis of the Behrens Fisher Problem

The MNet Estimator. Patrick Breheny. Department of Biostatistics Department of Statistics University of Kentucky. August 2, 2010

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Minimum Message Length Inference and Mixture Modelling of Inverse Gaussian Distributions

Paper Review: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties by Jianqing Fan and Runze Li (2001)

Regression with Numerical Optimization. Logistic

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression

Linear Model Selection and Regularization

Linear regression methods

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

A Short Introduction to the Lasso Methodology

STAT5044: Regression and Anova

Supplement to Bayesian inference for high-dimensional linear regression under the mnet priors

Analysis Methods for Supersaturated Design: Some Comparisons

Graphical Model Selection

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Generalized Linear Models. Kurt Hornik

Bi-level feature selection with applications to genetic association

Big Data Analytics. Lucas Rego Drumond

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

Machine Learning for OR & FE

ESL Chap3. Some extensions of lasso

The Behaviour of the Akaike Information Criterion when Applied to Non-nested Sequences of Models

Shrinkage Methods: Ridge and Lasso

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

Consistent high-dimensional Bayesian variable selection via penalized credible regions

CS6220: DATA MINING TECHNIQUES

Lecture 14: Shrinkage

Classification: Logistic Regression from Data

Lecture #11: Classification & Logistic Regression

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

The Frank-Wolfe Algorithm:

The Minimum Message Length Principle for Inductive Inference

Sparse Linear Models (10/7/13)

Logistic Regression. COMP 527 Danushka Bollegala

MSA220/MVE440 Statistical Learning for Big Data

Introduction to Machine Learning

Linear Models for Regression CS534

Sparse Gaussian conditional random fields

Computational statistics

An Improved 1-norm SVM for Simultaneous Classification and Variable Selection

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

Logistic Regression. INFO-2301: Quantitative Reasoning 2 Michael Paul and Jordan Boyd-Graber SLIDES ADAPTED FROM HINRICH SCHÜTZE

Behavioral Data Mining. Lecture 7 Linear and Logistic Regression

Robust Variable Selection Through MAVE

Lecture 2 Part 1 Optimization

Classification. Chapter Introduction. 6.2 The Bayes classifier

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables

Click Prediction and Preference Ranking of RSS Feeds

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Linear Models for Regression CS534

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Regularization Path Algorithms for Detecting Gene Interactions

Stepwise Searching for Feature Variables in High-Dimensional Linear Regression

Logistic Regression. Advanced Methods for Data Analysis (36-402/36-608) Spring 2014

Robust Variable Selection Methods for Grouped Data. Kristin Lee Seamon Lilly

Linear & nonlinear classifiers

Iterative Selection Using Orthogonal Regression Techniques

COMS 4771 Regression. Nakul Verma

CS6220: DATA MINING TECHNIQUES

STAT 462-Computational Data Analysis

Proteomics and Variable Selection

Classification: Logistic Regression from Data

Estimating Sparse High Dimensional Linear Models using Global-Local Shrinkage

Multicategory Vertex Discriminant Analysis for High-Dimensional Data

Regression, Ridge Regression, Lasso

MS&E 226: Small Data

L 0 methods. H.J. Kappen Donders Institute for Neuroscience Radboud University, Nijmegen, the Netherlands. December 5, 2011.

Part 8: GLMs and Hierarchical LMs and GLMs

Bayesian variable selection via. Penalized credible regions. Brian Reich, NCSU. Joint work with. Howard Bondell and Ander Wilson

Lecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

IEOR165 Discussion Week 5

Linear Regression Models P8111

HOMEWORK #4: LOGISTIC REGRESSION

Warm up: risk prediction with logistic regression

Approximation. Inderjit S. Dhillon Dept of Computer Science UT Austin. SAMSI Massive Datasets Opening Workshop Raleigh, North Carolina.

Variable Selection in Restricted Linear Regression Models. Y. Tuaç 1 and O. Arslan 1

CPSC 540: Machine Learning

Ultra High Dimensional Variable Selection with Endogenous Variables

A simulation study of model fitting to high dimensional data using penalized logistic regression

High-dimensional regression modeling

Administration. Homework 1 on web page, due Feb 11 NSERC summer undergraduate award applications due Feb 5 Some helpful books

Likelihood-Based Methods

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.

Linear Regression (9/11/13)

Generalized Linear Models

Statistical Methods for Data Mining

Transcription:

Logistic Regression with the Nonnegative Garrote Enes Makalic Daniel F. Schmidt Centre for MEGA Epidemiology The University of Melbourne 24th Australasian Joint Conference on Artificial Intelligence 2011

Outline Introduction 1 Introduction Problem Description Motivation 2 Non-negative Garrote 3

Outline Introduction Problem Description Motivation 1 Introduction Problem Description Motivation 2 Non-negative Garrote 3

Problem Description (1) Problem Description Motivation We have a binary classification problem Data y = (y 1,..., y n ), y i = { 1, +1} Matrix of p covariate vectors X = (x 1,..., x p ), x j R n y = y 1 y 2. y n x 11 x 12... x 1p, X = x 21 x 22... x 2p...... x n1 x n2... x np Use a logistic regression model (n samples, p predictors)

Problem Description (2) Problem Description Motivation Logistic regression model for explaining data y p(y = ±1 x, β) = 1 1 + exp( yx β) β R p is the vector of logistic regression coefficients Log-likelihood for n data points n l(β) = log ( 1 + exp( y i x iβ) ) i=1 Task: Estimate parameters Select significant regressors

Problem Description (2) Problem Description Motivation Logistic regression model for explaining data y p(y = ±1 x, β) = 1 1 + exp( yx β) β R p is the vector of logistic regression coefficients Log-likelihood for n data points n l(β) = log ( 1 + exp( y i x iβ) ) i=1 Task: Estimate parameters Select significant regressors

Problem Description (2) Problem Description Motivation Logistic regression model for explaining data y p(y = ±1 x, β) = 1 1 + exp( yx β) β R p is the vector of logistic regression coefficients Log-likelihood for n data points n l(β) = log ( 1 + exp( y i x iβ) ) i=1 Task: Estimate parameters Select significant regressors

Motivation Introduction Problem Description Motivation Many problems with maximum likelihood and stepwise regression Ideally, want a method that consistently selects true predictors automatically shrinks parameters selects important variables can be applied when p >> n has the Oracle property (asymptotically)

Outline Introduction Non-negative Garrote 1 Introduction Problem Description Motivation 2 Non-negative Garrote 3

Non-negative Garrote (1) Non-negative Garrote Requires an initial parameter estimate β For example, maximum likelihood, ridge regression, etc. Non-negative Garrote (NNG) estimate { ˆβ NG = arg max l( β 1,..., β } p ) β p s.t. c j 0, c j t j=1 where β j = c j βj, j = 1,..., p.

Non-negative Garrote (2) Non-negative Garrote Properties Consistent in terms of parameter estimation and variable selection (linear regression) Remains true even if β is inconsistent (with caveats) Oracle property It performs as well as if the true underlying model were given in advance

Non-negative Garrote (3) Non-negative Garrote How do we choose the initial estimate β? Breiman originally advocated maximum likelihood; Disadvantages Solving the optimisation problem Constrained least squares (linear regression) Standard convex programming solution not feasible for large p Our algorithm, NNG OPT, is based on cyclic coordinate descent

Non-negative Garrote (3) Non-negative Garrote How do we choose the initial estimate β? Breiman originally advocated maximum likelihood; Disadvantages Solving the optimisation problem Constrained least squares (linear regression) Standard convex programming solution not feasible for large p Our algorithm, NNG OPT, is based on cyclic coordinate descent

Non-negative Garrote (3) Non-negative Garrote How do we choose the initial estimate β? Breiman originally advocated maximum likelihood; Disadvantages Solving the optimisation problem Constrained least squares (linear regression) Standard convex programming solution not feasible for large p Our algorithm, NNG OPT, is based on cyclic coordinate descent

Non-negative Garrote (4) Non-negative Garrote input : data matrix X R n p, target vector y { 1, +1} n, initial estimate β R p, regularization parameter λ > 0 output: NNG estimate β R p 1 initialize j 1 for j = 1,..., p, r i 0 for i = 1,..., n 2 r y Xβ ( denotes element-wise product) 3 x i x i β (i = 1,..., n) (rescale data) 4 β (1,..., 1) (start search from β )

Non-negative Garrote (5) Non-negative Garrote 1 for t 1, 2,... to convergence do 2 for j 1, 2,... to p do 3 F i min(0.25, 1/(2 exp( j x ij ) + exp(r i j x ij ) + exp( j x ij r i )) (i = 1,..., n) ( n ) n 4 v j i=1 x ij y i /(1 + exp(r i )) λ /( i=1 x2 ij F i) (Newton--Raphson update) 5 if β j = 0 then 6 if v j 0 then 7 v j = 0 8 end 9 else 10 if β j + v j < 0 then 11 v j = β j (if sign change, set β j to zero) 12 end 13 end 14 β j min(max( v j, j ), j ) (limit step size to trust region) 15 r i β j X ij y i, r i r i + r i (i = 1,..., n) 16 β j β j + β j 17 j max(2 β j, j /2) (update trust region size) 18 end 19 end 20 β β β (use original scale)

Outline Introduction 1 Introduction Problem Description Motivation 2 Non-negative Garrote 3

Summary Introduction Stepwise forward selection Models generalize poorly Poor predictive performance Nonnegative Garrote Ridge regression recommended for initial estimate Excellent performance in comparison to LASSO Performed well for (highly) sparse models Somewhat worse performance for dense models

Summary Introduction Stepwise forward selection Models generalize poorly Poor predictive performance Nonnegative Garrote Ridge regression recommended for initial estimate Excellent performance in comparison to LASSO Performed well for (highly) sparse models Somewhat worse performance for dense models

Summary Introduction Stepwise forward selection Models generalize poorly Poor predictive performance Nonnegative Garrote Ridge regression recommended for initial estimate Excellent performance in comparison to LASSO Performed well for (highly) sparse models Somewhat worse performance for dense models

Summary Introduction Stepwise forward selection Models generalize poorly Poor predictive performance Nonnegative Garrote Ridge regression recommended for initial estimate Excellent performance in comparison to LASSO Performed well for (highly) sparse models Somewhat worse performance for dense models

Summary Introduction Stepwise forward selection Models generalize poorly Poor predictive performance Nonnegative Garrote Ridge regression recommended for initial estimate Excellent performance in comparison to LASSO Performed well for (highly) sparse models Somewhat worse performance for dense models

Introduction In all simulations... Sample size n = {20, 50, 100}. Regressor correlation corr(i, j) = 0.5 i j. Example 1: β = (3, 2, 1 5, 0, 0, 0, 0, 0). Example 2: β j = 0.85 for all j. Example 3: β = (5, 0 5, 0 5, 0 5, 0, 0, 0, 0).

Methods n = 20 n = 50 NLL Size FP FN NLL Size FP FN fwd 30 70 1 24 1 96 0 20 7 12 2 79 0 63 0 42 (0 80) (0 04) (0 03) (0 02) (0 07) (0 04) (0 03) (0 03) gfwd 26 64 1 18 1 97 0 15 6 74 2 75 0 64 0 38 (0 41) (0 04) (0 03) (0 02) (0 05) (0 04) (0 02) (0 03) lasso 21 13 4 53 0 50 2 03 6 50 5 78 0 05 2 83 (0 15) (0 05) (0 02) (0 04) (0 04) (0 04) (0 01) (0 04) glasso 22 40 2 89 0 93 0 82 6 44 3 97 0 23 1 20 (0 18) (0 04) (0 03) (0 03) (0 05) (0 04) (0 01) (0 04) rr 21 13 8 00 0 00 5 00 6 76 8 00 0 00 5 00 (0 14) (0 00) (0 00) (0 00) (0 03) (0 00) (0 00) (0 00) grr 21 86 3 37 0 77 1 14 6 45 4 32 0 17 1 49 (0 21) (0 05) (0 02) (0 03) (0 03) (0 04) (0 01) (0 04) enet 20 64 6 10 0 21 3 31 6 50 6 40 0 02 3 42 (0 12) (0 05) (0 01) (0 05) (0 03) (0 04) (0 00) (0 04) genet 22 20 3 07 0 85 0 92 6 42 4 06 0 20 1 26 (0 22) (0 04) (0 02) (0 03) (0 04) (0 04) (0 01) (0 04) ilasso 22 41 2 90 0 93 0 83 6 42 3 98 0 22 1 20 (0 19) (0 04) (0 02) (0 03) (0 05) (0 04) (0 01) (0 04) nng 21 34 3 50 0 66 1 16 6 34 4 35 0 11 1 46 (0 25) (0 04) (0 02) (0 03) (0 03) (0 04) (0 01) (0 04) Example 1: median negative log-likelihood (NLL), mean model size (Size), mean number of false positive regressors (FP) and mean number of false negative regressors (FN) included in the selected model. Tests are based on 1000 iterations with standard errors included in parentheses

Methods n = 20 n = 50 NLL Size FP FN NLL Size FP FN fwd 35 10 1 17 6 83 0 00 10 20 4 27 3 73 0 00 (0 08) (0 05) (0 05) (0 00) (0 09) (0 07) (0 07) (0 00) gfwd 34 66 1 11 6 89 0 00 9 19 4 05 3 94 0 00 (0 03) (0 04) (0 04) (0 00) (0 08) (0 07) (0 07) (0 00) lasso 24 83 4 95 3 06 0 00 7 76 6 94 1 06 0 00 (0 19) (0 05) (0 05) (0 00) (0 04) (0 03) (0 03) (0 00) glasso 28 24 3 14 4 86 0 00 8 44 5 66 2 34 0 00 (0 19) (0 05) (0 05) (0 00) (0 05) (0 05) (0 05) (0 00) rr 21 23 8 00 0 00 0 00 7 18 8 00 0 00 0 00 (0 11) (0 00) (0 00) (0 00) (0 03) (0 00) (0 00) (0 00) grr 26 97 3 80 4 20 0 00 8 23 6 10 1 90 0 00 (0 19) (0 05) (0 05) (0 00) (0 04) (0 04) (0 04) (0 00) enet 21 59 7 40 0 60 0 00 7 24 7 88 0 12 0 00 (0 12) (0 04) (0 04) (0 00) (0 02) (0 01) (0 01) (0 00) genet 27 20 3 65 4 35 0 00 8 24 6 06 1 94 0 00 (0 22) (0 05) (0 05) (0 00) (0 04) (0 04) (0 04) (0 00) ilasso 28 23 3 15 4 85 0 00 8 44 5 67 2 33 0 00 (0 21) (0 05) (0 05) (0 00) (0 05) (0 05) (0 05) (0 00) nng 26 49 4 08 3 92 0 00 8 05 6 33 1 67 0 00 (0 23) (0 05) (0 05) (0 00) (0 04) (0 04) (0 04) (0 00) Example 2: median negative log-likelihood (NLL), mean model size (Size), mean number of false positive regressors (FP) and mean number of false negative regressors (FN) included in the selected model. Tests are based on 1000 iterations with standard errors included in parentheses

Methods n = 20 n = 50 NLL Size FP FN NLL Size FP FN fwd 19 42 1 17 2 98 0 15 5 57 1 68 2 54 0 22 (0 78) (0 04) (0 03) (0 02) (0 03) (0 04) (0 02) (0 02) gfwd 16 55 0 99 3 09 0 08 5 40 1 63 2 57 0 19 (0 35) (0 03) (0 02) (0 01) (0 02) (0 04) (0 02) (0 02) lasso 16 70 4 08 1 49 1 57 5 45 5 08 0 98 2 06 (0 14) (0 05) (0 03) (0 03) (0 03) (0 05) (0 02) (0 04) glasso 15 63 2 13 2 35 0 48 5 35 2 86 1 90 0 76 (0 15) (0 03) (0 02) (0 02) (0 02) (0 05) (0 03) (0 03) rr 20 35 8 00 0 00 4 00 6 02 8 00 0 00 4 00 (0 18) (0 00) (0 00) (0 00) (0 03) (0 00) (0 00) (0 00) grr 15 74 2 59 2 12 0 71 5 36 3 21 1 76 0 97 (0 11) (0 04) (0 03) (0 02) (0 02) (0 05) (0 03) (0 03) enet 16 70 4 84 1 19 2 03 5 49 5 50 0 83 2 34 (0 14) (0 06) (0 03) (0 04) (0 03) (0 05) (0 02) (0 04) genet 15 64 2 21 2 31 0 52 5 34 2 94 1 86 0 80 (0 12) (0 04) (0 02) (0 02) (0 02) (0 05) (0 03) (0 03) ilasso 15 63 2 13 2 35 0 48 5 35 2 87 1 89 0 76 (0 15) (0 04) (0 02) (0 02) (0 02) (0 05) (0 03) (0 03) nng 15 83 2 76 1 99 0 75 5 30 3 46 1 49 0 95 (0 11) (0 04) (0 03) (0 03) (0 02) (0 05) (0 03) (0 03) Example 3: median negative log-likelihood (NLL), mean model size (Size), mean number of false positive regressors (FP) and mean number of false negative regressors (FN) included in the selected model. Tests are based on 1000 iterations with standard errors included in parentheses

Methods Datasets pima wdbc spambase ionosphere transfusion lasso 74.82 ± 0.30 95.53 ± 0.18 91.62 ± 0.09 79.68 ± 0.69 78.46 ± 0.29 (6.52) (7.78) (48.09) (7.73) (3.68) glasso 74.82 ± 0.28 95.12 ± 0.20 91.79 ± 0.09 79.68 ± 0.58 78.35 ± 0.29 (4.61) (4.64) (35.22) (3.67) (3.42) rr 74.30 ± 0.32 95.93 ± 0.16 91.45 ± 0.12 81.27 ± 0.53 78.57 ± 0.20 (8.00) (30.00) (57.00) (32.00) (4.00) grr 75.18 ± 0.29 95.39 ± 0.15 91.75 ± 0.11 79.88 ± 0.32 78.79 ± 0.33 (4.77) (6.21) (39.04) (5.53) (3.53) enet 74.65 ± 0.31 96.21 ± 0.16 91.48 ± 0.11 81.27 ± 0.40 78.57 ± 0.19 (7.05) (23.03) (51.61) (22.27) (3.92) genet 75.00 ± 0.23 95.12 ± 0.16 91.77 ± 0.09 80.28 ± 0.41 78.79 ± 0.35 (4.70) (5.82) (37.47) (5.10) (3.52) ilasso 74.82 ± 0.27 95.12 ± 0.20 91.77 ± 0.08 79.88 ± 0.55 78.35 ± 0.29 (4.61) (4.82) (35.28) (3.79) (3.44) nng 75.35 ± 0.21 95.66 ± 0.19 91.77 ± 0.08 80.48 ± 0.36 78.35 ± 0.37 (4.80) (6.63) (40.49) (6.53) (3.49) Simulation results for real data: Median classification accuracy (in percent) is shown along with bootstrap estimates of standard error. Mean model size is included in parentheses. Tests are based on 100 iterations.