Logistic Regression with the Nonnegative Garrote

Size: px

Start display at page:

Download "Logistic Regression with the Nonnegative Garrote"

Dylan Daniels
6 years ago
Views:

1 Logistic Regression with the Nonnegative Garrote Enes Makalic Daniel F. Schmidt Centre for MEGA Epidemiology The University of Melbourne 24th Australasian Joint Conference on Artificial Intelligence 2011

2 Outline Introduction 1 Introduction Problem Description Motivation 2 Non-negative Garrote 3

3 Outline Introduction Problem Description Motivation 1 Introduction Problem Description Motivation 2 Non-negative Garrote 3

4 Problem Description (1) Problem Description Motivation We have a binary classification problem Data y = (y 1,..., y n ), y i = { 1, +1} Matrix of p covariate vectors X = (x 1,..., x p ), x j R n y = y 1 y 2. y n x 11 x x 1p, X = x 21 x x 2p x n1 x n2... x np Use a logistic regression model (n samples, p predictors)

5 Problem Description (2) Problem Description Motivation Logistic regression model for explaining data y p(y = ±1 x, β) = exp( yx β) β R p is the vector of logistic regression coefficients Log-likelihood for n data points n l(β) = log ( 1 + exp( y i x iβ) ) i=1 Task: Estimate parameters Select significant regressors

6 Problem Description (2) Problem Description Motivation Logistic regression model for explaining data y p(y = ±1 x, β) = exp( yx β) β R p is the vector of logistic regression coefficients Log-likelihood for n data points n l(β) = log ( 1 + exp( y i x iβ) ) i=1 Task: Estimate parameters Select significant regressors

7 Problem Description (2) Problem Description Motivation Logistic regression model for explaining data y p(y = ±1 x, β) = exp( yx β) β R p is the vector of logistic regression coefficients Log-likelihood for n data points n l(β) = log ( 1 + exp( y i x iβ) ) i=1 Task: Estimate parameters Select significant regressors

8 Motivation Introduction Problem Description Motivation Many problems with maximum likelihood and stepwise regression Ideally, want a method that consistently selects true predictors automatically shrinks parameters selects important variables can be applied when p >> n has the Oracle property (asymptotically)

9 Outline Introduction Non-negative Garrote 1 Introduction Problem Description Motivation 2 Non-negative Garrote 3

10 Non-negative Garrote (1) Non-negative Garrote Requires an initial parameter estimate β For example, maximum likelihood, ridge regression, etc. Non-negative Garrote (NNG) estimate { ˆβ NG = arg max l( β 1,..., β } p ) β p s.t. c j 0, c j t j=1 where β j = c j βj, j = 1,..., p.

11 Non-negative Garrote (2) Non-negative Garrote Properties Consistent in terms of parameter estimation and variable selection (linear regression) Remains true even if β is inconsistent (with caveats) Oracle property It performs as well as if the true underlying model were given in advance

12 Non-negative Garrote (3) Non-negative Garrote How do we choose the initial estimate β? Breiman originally advocated maximum likelihood; Disadvantages Solving the optimisation problem Constrained least squares (linear regression) Standard convex programming solution not feasible for large p Our algorithm, NNG OPT, is based on cyclic coordinate descent

13 Non-negative Garrote (3) Non-negative Garrote How do we choose the initial estimate β? Breiman originally advocated maximum likelihood; Disadvantages Solving the optimisation problem Constrained least squares (linear regression) Standard convex programming solution not feasible for large p Our algorithm, NNG OPT, is based on cyclic coordinate descent

14 Non-negative Garrote (3) Non-negative Garrote How do we choose the initial estimate β? Breiman originally advocated maximum likelihood; Disadvantages Solving the optimisation problem Constrained least squares (linear regression) Standard convex programming solution not feasible for large p Our algorithm, NNG OPT, is based on cyclic coordinate descent

15 Non-negative Garrote (4) Non-negative Garrote input : data matrix X R n p, target vector y { 1, +1} n, initial estimate β R p, regularization parameter λ > 0 output: NNG estimate β R p 1 initialize j 1 for j = 1,..., p, r i 0 for i = 1,..., n 2 r y Xβ ( denotes element-wise product) 3 x i x i β (i = 1,..., n) (rescale data) 4 β (1,..., 1) (start search from β )

16 Non-negative Garrote (5) Non-negative Garrote 1 for t 1, 2,... to convergence do 2 for j 1, 2,... to p do 3 F i min(0.25, 1/(2 exp( j x ij ) + exp(r i j x ij ) + exp( j x ij r i )) (i = 1,..., n) ( n ) n 4 v j i=1 x ij y i /(1 + exp(r i )) λ /( i=1 x2 ij F i) (Newton--Raphson update) 5 if β j = 0 then 6 if v j 0 then 7 v j = 0 8 end 9 else 10 if β j + v j < 0 then 11 v j = β j (if sign change, set β j to zero) 12 end 13 end 14 β j min(max( v j, j ), j ) (limit step size to trust region) 15 r i β j X ij y i, r i r i + r i (i = 1,..., n) 16 β j β j + β j 17 j max(2 β j, j /2) (update trust region size) 18 end 19 end 20 β β β (use original scale)

17 Outline Introduction 1 Introduction Problem Description Motivation 2 Non-negative Garrote 3

18 Summary Introduction Stepwise forward selection Models generalize poorly Poor predictive performance Nonnegative Garrote Ridge regression recommended for initial estimate Excellent performance in comparison to LASSO Performed well for (highly) sparse models Somewhat worse performance for dense models

19 Summary Introduction Stepwise forward selection Models generalize poorly Poor predictive performance Nonnegative Garrote Ridge regression recommended for initial estimate Excellent performance in comparison to LASSO Performed well for (highly) sparse models Somewhat worse performance for dense models

20 Summary Introduction Stepwise forward selection Models generalize poorly Poor predictive performance Nonnegative Garrote Ridge regression recommended for initial estimate Excellent performance in comparison to LASSO Performed well for (highly) sparse models Somewhat worse performance for dense models

21 Summary Introduction Stepwise forward selection Models generalize poorly Poor predictive performance Nonnegative Garrote Ridge regression recommended for initial estimate Excellent performance in comparison to LASSO Performed well for (highly) sparse models Somewhat worse performance for dense models

22 Summary Introduction Stepwise forward selection Models generalize poorly Poor predictive performance Nonnegative Garrote Ridge regression recommended for initial estimate Excellent performance in comparison to LASSO Performed well for (highly) sparse models Somewhat worse performance for dense models

23 Introduction In all simulations... Sample size n = {20, 50, 100}. Regressor correlation corr(i, j) = 0.5 i j. Example 1: β = (3, 2, 1 5, 0, 0, 0, 0, 0). Example 2: β j = 0.85 for all j. Example 3: β = (5, 0 5, 0 5, 0 5, 0, 0, 0, 0).

24 Methods n = 20 n = 50 NLL Size FP FN NLL Size FP FN fwd (0 80) (0 04) (0 03) (0 02) (0 07) (0 04) (0 03) (0 03) gfwd (0 41) (0 04) (0 03) (0 02) (0 05) (0 04) (0 02) (0 03) lasso (0 15) (0 05) (0 02) (0 04) (0 04) (0 04) (0 01) (0 04) glasso (0 18) (0 04) (0 03) (0 03) (0 05) (0 04) (0 01) (0 04) rr (0 14) (0 00) (0 00) (0 00) (0 03) (0 00) (0 00) (0 00) grr (0 21) (0 05) (0 02) (0 03) (0 03) (0 04) (0 01) (0 04) enet (0 12) (0 05) (0 01) (0 05) (0 03) (0 04) (0 00) (0 04) genet (0 22) (0 04) (0 02) (0 03) (0 04) (0 04) (0 01) (0 04) ilasso (0 19) (0 04) (0 02) (0 03) (0 05) (0 04) (0 01) (0 04) nng (0 25) (0 04) (0 02) (0 03) (0 03) (0 04) (0 01) (0 04) Example 1: median negative log-likelihood (NLL), mean model size (Size), mean number of false positive regressors (FP) and mean number of false negative regressors (FN) included in the selected model. Tests are based on 1000 iterations with standard errors included in parentheses

25 Methods n = 20 n = 50 NLL Size FP FN NLL Size FP FN fwd (0 08) (0 05) (0 05) (0 00) (0 09) (0 07) (0 07) (0 00) gfwd (0 03) (0 04) (0 04) (0 00) (0 08) (0 07) (0 07) (0 00) lasso (0 19) (0 05) (0 05) (0 00) (0 04) (0 03) (0 03) (0 00) glasso (0 19) (0 05) (0 05) (0 00) (0 05) (0 05) (0 05) (0 00) rr (0 11) (0 00) (0 00) (0 00) (0 03) (0 00) (0 00) (0 00) grr (0 19) (0 05) (0 05) (0 00) (0 04) (0 04) (0 04) (0 00) enet (0 12) (0 04) (0 04) (0 00) (0 02) (0 01) (0 01) (0 00) genet (0 22) (0 05) (0 05) (0 00) (0 04) (0 04) (0 04) (0 00) ilasso (0 21) (0 05) (0 05) (0 00) (0 05) (0 05) (0 05) (0 00) nng (0 23) (0 05) (0 05) (0 00) (0 04) (0 04) (0 04) (0 00) Example 2: median negative log-likelihood (NLL), mean model size (Size), mean number of false positive regressors (FP) and mean number of false negative regressors (FN) included in the selected model. Tests are based on 1000 iterations with standard errors included in parentheses

26 Methods n = 20 n = 50 NLL Size FP FN NLL Size FP FN fwd (0 78) (0 04) (0 03) (0 02) (0 03) (0 04) (0 02) (0 02) gfwd (0 35) (0 03) (0 02) (0 01) (0 02) (0 04) (0 02) (0 02) lasso (0 14) (0 05) (0 03) (0 03) (0 03) (0 05) (0 02) (0 04) glasso (0 15) (0 03) (0 02) (0 02) (0 02) (0 05) (0 03) (0 03) rr (0 18) (0 00) (0 00) (0 00) (0 03) (0 00) (0 00) (0 00) grr (0 11) (0 04) (0 03) (0 02) (0 02) (0 05) (0 03) (0 03) enet (0 14) (0 06) (0 03) (0 04) (0 03) (0 05) (0 02) (0 04) genet (0 12) (0 04) (0 02) (0 02) (0 02) (0 05) (0 03) (0 03) ilasso (0 15) (0 04) (0 02) (0 02) (0 02) (0 05) (0 03) (0 03) nng (0 11) (0 04) (0 03) (0 03) (0 02) (0 05) (0 03) (0 03) Example 3: median negative log-likelihood (NLL), mean model size (Size), mean number of false positive regressors (FP) and mean number of false negative regressors (FN) included in the selected model. Tests are based on 1000 iterations with standard errors included in parentheses

27 Methods Datasets pima wdbc spambase ionosphere transfusion lasso ± ± ± ± ± 0.29 (6.52) (7.78) (48.09) (7.73) (3.68) glasso ± ± ± ± ± 0.29 (4.61) (4.64) (35.22) (3.67) (3.42) rr ± ± ± ± ± 0.20 (8.00) (30.00) (57.00) (32.00) (4.00) grr ± ± ± ± ± 0.33 (4.77) (6.21) (39.04) (5.53) (3.53) enet ± ± ± ± ± 0.19 (7.05) (23.03) (51.61) (22.27) (3.92) genet ± ± ± ± ± 0.35 (4.70) (5.82) (37.47) (5.10) (3.52) ilasso ± ± ± ± ± 0.29 (4.61) (4.82) (35.28) (3.79) (3.44) nng ± ± ± ± ± 0.37 (4.80) (6.63) (40.49) (6.53) (3.49) Simulation results for real data: Median classification accuracy (in percent) is shown along with bootstrap estimates of standard error. Mean model size is included in parentheses. Tests are based on 100 iterations.

Bayesian Grouped Horseshoe Regression with Application to Additive Models

Bayesian Grouped Horseshoe Regression with Application to Additive Models Zemei Xu, Daniel F. Schmidt, Enes Makalic, Guoqi Qian, and John L. Hopper Centre for Epidemiology and Biostatistics, Melbourne