A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013)

Size: px

Start display at page:

Download "A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013)"

Leslie Gardner
5 years ago
Views:

1 A Survey of L 1 Regression Vidaurre, Bielza and Larranaga (2013) Céline Cunen, 20/10/2014

2 Outline of article 1.Introduction 2.The Lasso for Linear Regression a) Notation and Main Concepts b) Statistical Properties c) Computational Algorithms d) A Bayesian Interpretation e) Connection to Boosting 3. Improving the Lasso's Properties 4. Adapting the Lasso to Particular Problems 5. Generalized Linear Models a) Logistic Regression b) Poisson Regression c)cox Proportional-hazard regression 6. Time Series a) Wavelet Analysis b) Autoregressive Models c) Other Regression Models d) Change Point Analysis 7. Discussion

3 1 Introduction L1 -penalized linear regression = Lasso (Least Absolute Shrinkage and Selection Operator) Lasso shrinks the coefficients (like Ridge), but also does variables selection (unlike Ridge) Useful for sparse models: believes that the true model contains only a small subset of the explanatory variables (?) Main advantage: lasso offers interpretable, stable models, and efficient prediction at a reasonable cost. Variable selection Solutions when p > n Bias/ Variance trade-off

4 2.1 Main Concepts Linear model: Ordinary Least Squares: β=argmin β y X β 2 2 Minimizing the RSS (residual sum of squares) Where: a 2 2 =a t a=a 1 2 +a LS-solution: β=(x t X) 1 X t y Lasso: Where: Ridge: Ridge-solution: a 1 = a 1 + a β=(x t X +λ I) 1 X t y β=argmin β ( y X β 2 2 +λ β 1 ) β=argmin β ( y X β 2 2 +λ β 2 2 )

5 Orthonormal input matrix X β 1 β1 β 1 ls β 1 ls from Hastie, Tibshirani and Friedman (2008)

6 Optimization problem Lasso Ridge β 2 2 =β 1 2 +β 2 2 s β 1 = β 1 + β 2 s from Hastie, Tibshirani and Friedman (2008)

7 General L q penalty Penalty: q < 1: λ β q q =λ( β 1 q + β 2 q +...) Less bias than Lasso Non-convexity Variable selection q > 1: Convex from Hastie, Tibshirani and Friedman (2008) No variable selection q = 1: Lasso! A compromise

8 Lasso vs Ridge Lasso better if true model is sparse Because shrinks the remaining coefficients less than Ridge Ridge better when the predictors are highly correlated Ridge keeps the redundant variables, but shrinks them. Lasso discards all but one.

9 2.2 Statistical Properties 2 measures: Prediction accuracy (for new data) Recovering the true model Prediction consistency: Expected squared prediction error:

10 2.2 Statistical Properties (2) Recovering the true explanatory variables and give consistent estimates Irrepresentable condition: true model can be recovered <=> no high correlations between relevant and irrelevant predictors In addition Beta-min condition: true non-zero coefficients must be sufficiently large Variable screening instead: select the true variables, but allow some false positives Optimal λ λ: adaptively chosen to minimize the expected prediction error Optimal λ for predition > optimal λ for variable selection

11 2.3 Computational Algorithms Algorithms solving the Lasso optimization problem: LARS (least-angle regression) Pathwise coordinate descent LARS Computes the entire regularization path (for all λs) in p steps (same cost as LS)

12 Regularization path for the Lasso Least squares solution More shrinking

13 LARS Algorithm from Hastie, Tibshirani and Friedman (2008) Modification to get the Lasso solution:

14 When can LARS be used? The regularization path is piecewise linear when: Loss function is quadratic as a function of β Penalty function is piecewise linear as a function of β When these conditions are not fulfilled (and/or we have a very large number of explanatory variables): Coordinate descent optimization algorithms may be used (must still be a convex problem)

15 2.4 A Bayesian Interpretation The Lasso can be considered a Bayes estimates With a Laplace prior on β (the Ridge solution can be obtained with another prior) Only the mode of the posterior (MAP) gives a sparse estimate Advantages: Reliable SE(β) Reveal information about dependence between explanatory variables But more computationally expensive

16 2.5 Connection to Boosting Reminder: Ordinary Lasso = squared loss + L1 - penalization Boosting regression with squared loss lasso (when low correlation between explanatory variables) Boosting can give methods for computation of L1 - penalized regression (with other losses than squared) blasso: generalization of lasso to any convex loss function

17 3 Improving the Lasso's Properties Bias/variance trade-off (reminder) Reminder: the regularization parameter λ is generally chosen by cross-validation (adaptively to minimize expected prediction error) Problem 1: the right variables may be identified, but their coefficient estimates are biased Relaxed lasso Variable Inclusion and Shrinkage Algorithm (VISA) Adaptive lasso (Dantzig selector, LAD-lasso)

18 3 Improving the Lasso's Properties (2) Problem 2: ridge is better than lasso when there are strong correlations between explanatory variables Elastic net β=argmin β ( y X β 2 2 +λ (α β 1 +(1 α) β 2 2 )) Popular If p > N: can select more than N variables Can select groups of redundant variables

19 Elastic net α = 0.2 More like Ridge α = 0.8 More like Lasso

20 4 Adapting the Lasso to Particular Problems Group lasso When we are interested in including entire groups of explanatory variables Sparsity on both group and individual level Composite Absolute Penalties (CAP) Fused Lasso Adjacent coefficients get similar values Multiresponse regression

21 5 Generalized Linear Models More general Link function g() defines how the linear contribution of the explanatory variables affect the expectation: g(e(y x))=β 0 + x t β

22 5.1 Logistic Regression Y categorical (can belong to 2 classes) Link function = logit function a g(a)=log( 1 a ) Solution with L1 penalty: N β=argmin β ( i=1 Can be solved by several algorithms ( y i x t i β log(1+e x t β i ))+λ β 1 ) Also methods for L1 -regularized multinomial logistic regression (more than 2 classes)

23 5.2 Poisson Regression Y: positive integer or 0 Modeling count data Link function: g(a)=log(a) Model: log(e(y x))=β 0 +x t β Solution with L1 penalty: Gives sparse estimates of β N β=argmin β ( ( y i (β 0 + x t i β)+e β +x t β 0 i )+λ β 1 ) i=1

24 6.1 Wavelet Analysis A method for representing complicated function in a simpler manner W: an orthonormal basis matrix Gives a basis for all square-integrable functions = all square integrable functions can be represented as a (possibly infinite) linear combination of these basis functions y: data vector Find the optimal z such that y Wz ẑ=argmin β ( y W z λ z 1 )

25 6.2 Autoregressive Models A model were the output depends linearly on its previous values L1 penalty for sparse solution Also a group penalty

26 7 Discussion Lasso: Minimizing the RSS subject to an L 1 constraint = Minimizing the L 1 -penalized negative log-likelihood function Wide applications Advantages: Variable selection Bias/Variance trade-off

27 7 Discussion (2) Extension to non-linear models: Use complex models for the entire data domain: with a dictionary of functions Simple (linear) models for different parts of the data domain Remaining work: Techniques for making rigorous inference

28 Sources Hastie, Tibshirani and Friedman (2008): The Elements of Statistical Learning

ESL Chap3. Some extensions of lasso

ESL Chap3. Some extensions of lasso ESL Chap3 Some extensions of lasso 1 Outline Consistency of lasso for model selection Adaptive lasso Elastic net Group lasso 2 Consistency of lasso for model selection A number of authors have studied