Discussion of Least Angle Regression

Similar documents
LEAST ANGLE REGRESSION 469

Generalized Boosted Models: A guide to the gbm package

Regularization Paths

Fast Regularization Paths via Coordinate Descent

Biostatistics Advanced Methods in Biostatistics IV

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.

Least Angle Regression, Forward Stagewise and the Lasso

TECHNICAL REPORT NO. 1091r. A Note on the Lasso and Related Procedures in Model Selection

LASSO Review, Fused LASSO, Parallel LASSO Solvers

Machine Learning for OR & FE

Statistical Inference

BART: Bayesian additive regression trees

On Algorithms for Solving Least Squares Problems under an L 1 Penalty or an L 1 Constraint

SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING. Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu

Regularization: Ridge Regression and the LASSO

Fast Regularization Paths via Coordinate Descent

Least Absolute Shrinkage is Equivalent to Quadratic Penalization

Non-linear Supervised High Frequency Trading Strategies with Applications in US Equity Markets

LogitBoost with Trees Applied to the WCCI 2006 Performance Prediction Challenge Datasets

Variable Selection for High-Dimensional Data with Spatial-Temporal Effects and Extensions to Multitask Regression and Multicategory Classification

Regularization Methods for Additive Models

A New Perspective on Boosting in Linear Regression via Subgradient Optimization and Relatives

Pathwise coordinate optimization

Saharon Rosset 1 and Ji Zhu 2

Real Estate Price Prediction with Regression and Classification CS 229 Autumn 2016 Project Final Report

Regularization Path Algorithms for Detecting Gene Interactions

Chapter 6. Ensemble Methods

Regularization Paths. Theme

S-PLUS and R package for Least Angle Regression

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10

Assignment 3. Introduction to Machine Learning Prof. B. Ravindran

Gradient Boosting (Continued)

Adaptive Sparseness Using Jeffreys Prior

Chris Fraley and Daniel Percival. August 22, 2008, revised May 14, 2010

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013)

In Search of Desirable Compounds

Basis Pursuit Denoising and the Dantzig Selector

MS-C1620 Statistical inference

Lecture 5: Soft-Thresholding and Lasso

Applied Machine Learning Annalisa Marsico

PDF hosted at the Radboud Repository of the Radboud University Nijmegen

Linear Model Selection and Regularization

Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty

The lars Package. R topics documented: May 17, Version Date Title Least Angle Regression, Lasso and Forward Stagewise

Regularized Regression A Bayesian point of view

Introduction to Machine Learning and Cross-Validation

Comparisons of penalized least squares. methods by simulations

Chapter 3. Linear Models for Regression

Sparse representation classification and positive L1 minimization

Proteomics and Variable Selection

Boosted Lasso. Peng Zhao. Bin Yu. 1. Introduction

Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation

Bayesian Ensemble Learning

DATA MINING AND MACHINE LEARNING

Topics in Graphical Models. Statistical Machine Learning Ryan Haunfelder

Boosted Lasso. Abstract

Shrinkage Methods: Ridge and Lasso

Hierarchical Penalization

Generalized Elastic Net Regression

Dimension Reduction Methods

Delta Boosting Machine and its application in Actuarial Modeling Simon CK Lee, Sheldon XS Lin KU Leuven, University of Toronto

Lasso Regression: Regularization for feature selection

Sparse One Hidden Layer MLPs

Boosted Lasso and Reverse Boosting

The lasso: some novel algorithms and applications

Regression Clustering

Robust Variable Selection Through MAVE

Machine Learning CSE546 Carlos Guestrin University of Washington. October 7, Efficiency: If size(w) = 100B, each prediction is expensive:

Classification Logistic Regression

Generalized Boosted Models: A guide to the gbm package

Regularized Linear Models in Stacked Generalization

On High-Dimensional Cross-Validation

STRUCTURED VARIABLE SELECTION AND ESTIMATION

Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 4 th, Emily Fox 2014

MS&E 226: Small Data

Fast Regularization Paths via Coordinate Descent

The lasso, persistence, and cross-validation

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Is the test error unbiased for these programs? 2017 Kevin Jamieson

Lasso regression. Wessel van Wieringen

STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

Probabilistic machine learning group, Aalto University Bayesian theory and methods, approximative integration, model

Building a Prognostic Biomarker

Lecture 17 Intro to Lasso Regression

CS Homework 3. October 15, 2009

Tuning Parameter Selection in L1 Regularized Logistic Regression

A Comparison between LARS and LASSO for Initialising the Time-Series Forecasting Auto-Regressive Equations

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables

Nonnegative Garrote Component Selection in Functional ANOVA Models

Exploratory quantile regression with many covariates: An application to adverse birth outcomes

Sparse inverse covariance estimation with the lasso

Prediction & Feature Selection in GLM

The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA

A Bias Correction for the Minimum Error Rate in Cross-validation

Is the test error unbiased for these programs?

Paper Review: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties by Jianqing Fan and Runze Li (2001)

Introduction to Machine Learning Midterm, Tues April 8

Adaptive Lasso for correlated predictors

Transcription:

Discussion of Least Angle Regression David Madigan Rutgers University & Avaya Labs Research Piscataway, NJ 08855 madigan@stat.rutgers.edu Greg Ridgeway RAND Statistics Group Santa Monica, CA 90407-2138 gregr@rand.org Algorithms for simultaneous shrinkage and selection in regression and classification provide attractive solutions to knotty old statistical challenges. Nevertheless, as far as we can tell, Tibshirani s Lasso algorithm has had little impact on statistical practice. Two particular reasons for this may be the relative inefficiency of the original Lasso algorithm, and the relative complexity of more recent Lasso algorithms (e.g., Osborne et al., 2000). Efron, Hastie, Johnstone, and Tibshirani have provided an efficient, simple algorithm for the Lasso as well as algorithms for stagewiseregression and the new least angle regression. As such this paper is an important contribution to statistical computing. Predictive Performance The authors say little about predictive performance issues. In our work, however, the relative out-of-sample predictive performance of LARS, Lasso, and forwardstagewise (and variants thereof) takes center stage. Interesting connections exist between boosting and stagewise algorithms so predictive comparisons with boosting are also of interest. The authors present a simple statistic for LARS. In practice, a cross-validation type approach for selecting the degree of shrinkage, while computationally more expensive, may lead to better predictions. We considered this using the LARS software. Here we report results for the authors diabetes data, the Boston housing data, and the Servo data from the UCI Machine Learning Repository. Specifically, we held out 10% of the data, and chose the shrinkage level using either or 9-fold CV using 90% of the data. Then we estimated MSE on the 10% hold-out sample. Table 1 shows the results for main-effects models.

Diabetes Boston Servo CV CV CV Stagewise 3083 3082 Stagewise 25.7 25.8 Stagewise 1.33 1.32 LAR 3080 3083 LAR 25.5 25.4 LAR 1.33 1.30 Lasso 3083 3082 Lasso 25.8 25.7 Lasso 1.34 1.31 Table 1: Stagewise, LAR, and Lasso mean square error predictive performance. Comparing cross-validation with. Table 1 exhibits two particular characteristics. First, as expected, Stagewise, LAR, and Lasso perform similarly. Second, performs as well as cross-validation; if this holds up more generally, larger-scale applications will want to use to select the degree of shrinkage. Table 2 presents a re-analysis of the same three datasets but now considering five different models: least squares, LARS using cross-validation to select the coefficients, LARS using to select the coefficients and allowing for two-way interactions, least squares boosting fitting only main effects, and least squares boosting allowing for two-way interactions. Again we used the authors LARS software and, for the boosting results, the gbm package in R (Ridgeway, 2003). We evaluated all the models using the same cross-validation group assignments. Diabetes Boston Servo MSE MAD MSE MAD MSE MAD LM 3000 44.2 23.8 3.40 1.28 0.91 LAR 3087 45.4 24.7 3.53 1.33 0.95 LAR two-way 3090 45.1 14.2 2.58 0.93 0.60 GBM additive 3198 46.7 16.5 2.75 0.90 0.65 GBM two-way 3185 46.8 14.1 2.52 0.80 0.60 Table 2: Predictive performance of competing methods. LM is a main-effects linear model with least squares fitting. LAR is least angle regreression with main effects and CV shrinkage selection. LAR two-way is least angle regression with main effects and all two-way interactions, shrinkage selection via. GBM additive and GBM two-way use least squares boosting, the former using main effects only, the latter using main effects and all two-way interactions. MSE is mean square error on a 10% holdout sample. MAD is mean absolute deviation. A plain linear model provides the best out-of-sample predictive performance for the diabetes dataset. By contrast, the Boston housing and Servo data exhibit more complex structure and models incorporating higher order structure do a better job. While no general conclusions can emerge from such a limited analysis, LARS seems to be competitive with these particular alternatives. We note however that for

the Boston Housing and Servo datasets, Breiman (2001) reports substantially better predictive performance using random forests. Extensions to Generalized Linear Models The minimal computational complexity of LARS derives largely from the squared error loss function. Applying LARS-type strategies to models with non-linear loss functions will require some form of approximation. Here we consider LARS-type algorithms for logistic regression. Consider the logistic log-likelihood for a regression function which will be linear in. (1) We can initialize. For some we wish to find the covariate that offers the greatest improvement in the logistic log-likelihood,. To find this we can compute the directional derivative for each and choose the maximum.!! $% " # (2) & (3) Note that as with LARS this is the covariate that is most highly correlated with the residuals. The selected covariate is the first member of the active set, '. For small enough (3) implies ( ) ) ( $ ) & * + (4) for all, '- where ( indicates the sign of the correlation as in the LARS development. Choosing to have the largest magnitude while maintaining the constraint in (4) involves a non-linear optimization. However, linearizing (4) yields a fairly simple approximate solution. If. is the variable with the second largest correlation with the residual then / ( ) ) (.. 0 ( ) ) (.. 0 0 ) 1 (5) The algorithm may need to iterate (5) to obtain the exact optimal /. Similar logic yields an algorithm for the full solution. We simulated 2 +++ observations with 10 independent normal covariates 3 2# 456 with outcomes 7 3 8 9 : where : 3 2 # + 5. Figure 1 shows a comparison of the coefficient estimates using forward stagewise and the least angle method of estimating coefficients, the final estimates

Forward stagewise Least angle Coefficients -1.5-1.0-0.5 0.0 0.5 3 17 59 4 10 2 8-1.5-1.0-0.5 0.0 0.5 3 17 59 4 10 2 8 6 6 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 β j max β j Figure 1: Comparison of coefficient estimation for forward stagewise and least angle logistic regression. arriving at the MLE. While the paper presents LARS for squared error problems, the least angle approach seems applicable to a wider family of models. However, an out-of-sample evaluation of predictive performance is essential to assess the utility of such a modeling strategy. Specifically for the Lasso, one alternative strategy for logistic regression is to use a quadratic approximation for the log-likelihood. Consider the Bayesian version of Lasso with hyperparameter (i.e. the penalized rather than constrained version of Lasso): 5 1 1 1 5.! $..! &. :. : where denotes the logistic link,! is the dimension of, and 5, and are Taylor coefficients. Fu s elegant coordinate-wise Shooting algorithm (Fu, 1998), can optimize this target starting from either the least squares solution or from zero. In our experience the shooting algorithm converges rapidly.

References Breiman, L. (2001). Random forests. ftp://ftp.stat.berkeley.edu/pub/users/breiman/randomforest2001.pdf Fu, W.J. (1988). Penalized Regressions: The Bridge Versus the Lasso. Journal of Computational and Graphical Statistics, 7, 397 418. Osborne, M.R., Presnell, B., and Turlach, B.A. (2000). A new approach to variable selection in least squares problems. IMA Journal of Numerical Analysis, 20,389 404. Ridgeway, G. (2003). GBM 0.7-2 package manual. http://cran.r-project.org/doc/packages/gbm.pdf