The problem is: given an n 1 vector y and an n p matrix X, find the b(λ) that minimizes

Similar documents
LASSO. Step. My modified algorithm getlasso() produces the same sequence of coefficients as the R function lars() when run on the diabetes data set:

Adaptive Lasso for correlated predictors

Regularization: Ridge Regression and the LASSO

Lecture 17 Intro to Lasso Regression

Saharon Rosset 1 and Ji Zhu 2

Robust methods and model selection. Garth Tarr September 2015

Frequentist Accuracy of Bayesian Estimates

Or How to select variables Using Bayesian LASSO

Regularization Paths

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Regularization Path Algorithms for Detecting Gene Interactions

Biostatistics Advanced Methods in Biostatistics IV

MS-C1620 Statistical inference

Least Angle Regression, Forward Stagewise and the Lasso

LASSO Review, Fused LASSO, Parallel LASSO Solvers

TECHNICAL REPORT NO. 1091r. A Note on the Lasso and Related Procedures in Model Selection

Homework 6. Due: 10am Thursday 11/30/17

Pathwise coordinate optimization

On Algorithms for Solving Least Squares Problems under an L 1 Penalty or an L 1 Constraint

Chapter 1 Linear Equations. 1.1 Systems of Linear Equations

The lars Package. R topics documented: May 17, Version Date Title Least Angle Regression, Lasso and Forward Stagewise

Regularization Paths. Theme

High dimensional thresholded regression and shrinkage effect

Frequentist Accuracy of Bayesian Estimates

Solving with Absolute Value

A robust hybrid of lasso and ridge regression

Math 61CM - Solutions to homework 6

LEAST ANGLE REGRESSION 469

Exploratory quantile regression with many covariates: An application to adverse birth outcomes

1.10 Continuity Brian E. Veitch

Physics 8 Wednesday, November 18, 2015

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10

Vote. Vote on timing for night section: Option 1 (what we have now) Option 2. Lecture, 6:10-7:50 25 minute dinner break Tutorial, 8:15-9

ONE DIMENSIONAL TRIANGULAR FIN EXPERIMENT. Current Technical Advisor: Dr. Kelly O. Homan Original Technical Advisor: Dr. D.C. Look

Ordinary Least Squares Linear Regression

Model Selection and Estimation in Regression with Grouped Variables 1

MATH10001 Mathematical Workshop Difference Equations part 2 Non-linear difference equations

Generalized Elastic Net Regression

10708 Graphical Models: Homework 2

Lasso applications: regularisation and homotopy

An Introduction to Graphical Lasso

MATH 829: Introduction to Data Mining and Analysis Least angle regression

The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA

SOLUTIONS TO ADDITIONAL EXERCISES FOR II.1 AND II.2

Nonlinear Support Vector Machines through Iterative Majorization and I-Splines

Functional Analysis MATH and MATH M6202

MATH20411 PDEs and Vector Calculus B

Research Methods in Mathematics Homework 4 solutions

spikeslab: Prediction and Variable Selection Using Spike and Slab Regression

AP Calculus AB Summer Assignment

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013)

Modeling the Motion of a Projectile in Air

MATH10001 Mathematical Workshop Graph Fitting Project part 2

Convex envelopes, cardinality constrained optimization and LASSO. An application in supervised learning: support vector machines (SVMs)

Homework 2 due Friday, October 11 at 5 PM.

Markov Chains and Related Matters

Farkas Lemma, Dual Simplex and Sensitivity Analysis

Project One: C Bump functions

DATA MINING AND MACHINE LEARNING

Designing Information Devices and Systems I Fall 2015 Anant Sahai, Ali Niknejad Final Exam. Exam location: RSF Fieldhouse, Back Left, last SID 6, 8, 9

MS&E 318 (CME 338) Large-Scale Numerical Optimization. A Lasso Solver

Designing Information Devices and Systems I Spring 2016 Elad Alon, Babak Ayazifar Homework 11

Sparse Algorithms are not Stable: A No-free-lunch Theorem

The lasso an l 1 constraint in variable selection

10725/36725 Optimization Homework 4

Basics of Proofs. 1 The Basics. 2 Proof Strategies. 2.1 Understand What s Going On

Review Solutions, Exam 2, Operations Research

Math 33A Discussion Notes

MATH 225 Summer 2005 Linear Algebra II Solutions to Assignment 1 Due: Wednesday July 13, 2005

Observations Homework Checkpoint quizzes Chapter assessments (Possibly Projects) Blocks of Algebra

Contents. 4.5 The(Primal)SimplexMethod NumericalExamplesoftheSimplexMethod

Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables

Linear Programming: Simplex

A Quick Introduction to Row Reduction

Self-adaptive Lasso and its Bayesian Estimation

A significance test for the lasso

Introduce complex-valued transcendental functions via the complex exponential defined as:

Math Lecture 4 Limit Laws

A note on the group lasso and a sparse group lasso

Manual of Logical Style (fresh version 2018)

Regression Shrinkage and Selection via the Lasso

arxiv: v1 [math.st] 2 May 2007

SMT 2013 Power Round Solutions February 2, 2013

Sparse representation classification and positive L1 minimization

HOMEWORK 4: SVMS AND KERNELS

arxiv: v3 [stat.me] 4 Jan 2012

Boosted Lasso and Reverse Boosting

Sparsity in Underdetermined Systems

Homework 4 due on Thursday, December 15 at 5 PM (hard deadline).

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Set 2 January 12 th, 2018

Elementary Linear Algebra Review for Exam 3 Exam is Friday, December 11th from 1:15-3:15

Preptests 59 Answers and Explanations (By Ivy Global) Section 1 Analytical Reasoning

MAT2342 : Introduction to Applied Linear Algebra Mike Newman, fall Projections. introduction

Pre Algebra, Unit 1: Variables, Expression, and Integers

MAS221 Analysis Semester Chapter 2 problems

Solution to Proof Questions from September 1st

642:550, Summer 2004, Supplement 6 The Perron-Frobenius Theorem. Summer 2004

The residual again. The residual is our method of judging how good a potential solution x! of a system A x = b actually is. We compute. r = b - A x!

Post-selection inference with an application to internal inference

The lasso, persistence, and cross-validation

Transcription:

LARS/lasso 1 Statistics 312/612, fall 21 Homework # 9 Due: Monday 29 November The questions on this sheet all refer to (a slight modification of) the lasso modification of the LARS algorithm, based on the papers by Efron, Hastie, Johnstone, and Tibshirani (24) and by Rosset and Zhu (27). I have made a few small changes to the algorithm described in the first paper. I hope I haven t broken anything. The homework problems are embedded into an explanation of how and why the algorithm works. Look for the numbers [1], [2],... in the left margin and the text enclosed in a box. 1 The Lasso problem The problem is: given an n 1 vector y and an n p matrix X, find the b(λ) that minimizes L λ (b) = y Xb 2 + 2λ b j for each λ. [The extra factor of 2 eliminates many factors of 2 in what follows.] The columns of X will be assumed to be standardized to have zero means and X j = 1. Remark. I thought the vector y was also supposed to have a zero mean. That is not the case for the diabetes data set in R, the data set used as an illustration in the LARS paper (Efron, Hastie, Johnstone, and Tibshirani 24) where the LARS procedure was first analyzed. The solution will turn out to be piecewise linear and continuous in λ. To test my modification I also used the diabetes data set: library(lars) data(diabetes) # load the data set #?diabetes for description dd = diabetes LL = lars(diabetes$x,diabetes$y,type="lasso") plot(ll,xvar="step") # the usual plot My modified algorithm gets the same coefficients as lars() for this data set.

LARS/lasso 2 Remark. For the sake of exposition, the discussion below assumes that the predictors enter the active set in a way slightly different from what happens for diabetes. I will consider only the one at a time case (page 417 of the LARS paper), for which the active set of predictors X j changes only by either addition or deletion of a single predictor. 2 Directional derivative The function L λ has derivative at b in the direction u defined by <1> L L λ (b + tu) L λ (b) λ (b, u) = lim t t = 2 j λf(b j, u j ) (y Xb) X j u j where f(b j, u j ) = u j {b j > } u j {b j < } + u j {b j = }. By convexity, a vector b minimizes L λ if and only if L λ (b, u) for every u. Equivalently, for every j and every u j the jth summand in <1> must be nonnegative. [Consider u vectors with only one nonzero component to establish this equivalence.] That is, b minimizes L λ if and only if λf(b j, u j ) (y Xb) X j u j for every j and every u j When b j the inequalities for u j = ±1 imply an equality; for b j = we get only the inequality. Thus b minimizes L λ if and only if λ = X j R if b j > λ = X j R if b j < where R := y Xb λ X j R if b j = The LARS/lasso algorithm recursively calculates a sequence of breakpoints > λ 1 > λ 2 > with b(λ) linear for each interval λ k+1 λ λ k. Define residual vector and correlations R(λ) := y X b(λ) and C j (λ) := X jr(λ). Remark. To get a true correlation we would have to divide by R(λ), which would complicate the constraints.

LARS/lasso 3 <2> The algorithm will ensure that λ = C j (λ) if b j (λ) > (constraint ) λ = C j (λ) if b j (λ) < (constraint ) λ C j (λ) if b j (λ) = (constraint ) That is, for the minimizing b(λ) each (λ, C j (λ)) needs to stay inside the region R = {(λ, c) R + R : c λ}, moving along the top bounday (c = λ) when b j (λ) > (constraint ), along the lower boundary (c = λ) when b j (λ) < (constraint ), and being anywhere in R when b j (λ) = (constraint ). Modified lasso algorithm for diabetes data 751.3 ltg bmi, ldl map hdl, tch age, glu bj's sex -792.2 949.4 tc bmi, ltg map, tch, glu age, tc, ldl sex -949.4 Cj's hdl 13.1 316.1 452.9 889.3 949.4 [See the file lassofull.pdf in the Handouts directory for higher resolution.] 3 The algorithm The solution b(λ) is to be constructed in a sequence of steps, starting with large λ and working towards λ =. First step. Define λ 1 = max X j y. For λ λ 1 take b(λ) =, so that C j (λ) λ 1. Constraint would be violated if we kept b(λ) equal to zero for λ < λ 1 ; the b(λ) must move away from zero as λ decreases below λ 1. We must have C j (λ 1 ) = λ 1 for at least one j. For convenience of exposition, suppose C 1 (λ 1 ) = λ 1 > C j (λ 1 ) for all j 2. The active set is now A = {1}.

LARS/lasso 4 For λ 2 λ < λ 1, with λ 2 to be specified soon, keep b j = for j 2 but let b 1 (λ) = + v 1 (λ 1 λ) for some constant v 1. We need b 1 (λ) = λ for a while, which forces v 1 = 1. That choice gives R(λ) = y X 1 (λ 1 λ) and C j (λ) = C j (λ 1 ) X jx 1 (λ 1 λ). Notice that b 1 (λ) > so that is the relevant constraint for b 1. Let λ 2 be the largest λ less than λ 1 for which max j 2 C j (λ) = λ. *[1] Find the value λ 2. That is, express λ 2 as a function of C j s. Hint: Consider Efron, Hastie, Johnstone, and Tibshirani (24, equation 2.13), with appropriate changes of notation. Second step. We have C 1 (λ 2 ) = λ 2, by construction. For convenience of exposition, suppose C 2 (λ 2 ) = λ 2 > C j (λ 2 ) for all j 3. The active set is now A = {1, 2}. For λ 3 λ < λ 2 choose a new v 1 and a v 2 then define b 1 (λ) = b 1 (λ 2 ) + (λ 2 λ)v 1 b 2 (λ) = + (λ 2 λ)v 2 with all other b j s still zero. Write Z for [X 1, X 2 ]. The new C j s become C j (λ) = X j (y X 1 b 1 (λ) X 2 b 2 (λ)) = C j (λ 2 ) (λ 2 λ)x jzv where v = (v 1, v 2 ). *[2] Show that C 1 (λ) = C 2 (λ) = λ if we choose v to make Z Zv = (1, 1). That is, v = (Z Z) 1 1 with 1 = (1, 1). Of course we must assume that X 1 and X 2 are linearly independent for Z Z to have an inverse. *[3] Show that v 1 and v 2 are both positive, so that is the relevant constraint for both b 1 and b 2. Let λ 3 be the largest λ less than λ 2 for which max j 3 C j (λ) = λ. Third step. We have C 1 (λ 3 ) = C 2 (λ 3 ) = λ 3 and max j 3 C j (λ 3 ) = λ 3. Consider now what happens if the last equality holds because a C j has hit the lower boundary of the

LARS/lasso 5 region R: suppose C 3 (λ 3 ) = λ 3 > max j 4 C j (λ 3 ). The active set then becomes A = {1, 2, 3}. For λ 4 λ < λ 3 choose a new v 1, v 2 and a v 3 to make b 1 (λ) = b 1 (λ 3 ) + (λ 3 λ)v 1 b 2 (λ) = b 2 (λ 3 ) + (λ 3 λ)v 2 b 3 (λ) = (λ 3 λ)v 3 with all other b j s still zero. More concisely, with s j = sign(c j (λ 3 )), b j (λ) = b j (λ 3 ) + (λ 3 λ)v j s j for j A. For λ 3 λ small enough, we still have b 1 (λ) > and b 2 (λ) >, keeping as the relevant constraint for b 1 and b 2. To ensure that b 3 (λ) < we need v 3 >. Define Z = [X 1, X 2, X 3 ]. For λ λ 3 show that *[4] [5] so that R(λ) R(λ k ) = j A Z jv j (λ k λ) C j (λ) = C j (λ 3 ) (λ 3 λ)x jzv Show that v = (Z Z) 1 = 1 keeps b 1, b 2, and b 3 on the correct boundaries. How should we choose λ 4? Is it possible for some active b j to switch sign? Prove that v 3 >. I am not yet clear on why this must be true. Efron, Hastie, Johnstone, and Tibshirani (24, Lemma 4) seems to contain the relevant argument. The general step. Just before λ = λ k+1 suppose the active set is A, so that C j (λ) = λ for each j in A. For each j in A let s j = sign(c j (λ k )). For λ k+1 λ λ k define b j (λ) = b j (λ k ) + (λ k λ)v j s j for j A *[6] Define Z as the matrix with jth column equal to s j X j for each j A. λ k+1 λ λ k show that C j (λ) = C j (λ k ) (λ k λ)x jzv For so that the choice = (Z Z) 1 1 keeps C j (λ) = s j λ for j A. Define λ k+1 to be the largest λ less than λ k for which either of these conditions holds:

LARS/lasso 6 (i) max j / A C j (λ) = λ. In that case add the new j A c for which C j (λ k+1 ) = λ k+1 to the active set, then carry out another general step. (ii) b j (λ) = for some j A. In that case, remove j from the active set, then carry out another general step. For diabetes, the second alternative caused the behavior shown below [see lassoending.pdf for higher resolution] for 1.3 λ 2.2. Modified lasso algorithm for diabetes data 751.3 ltg bmi, ldl map hdl, tch age, glu bj's sex -792.2 5.1 tc bmi, map, ldl, tch, ltg, glu -5.1 Cj's age, sex, tc, hdl 1.3 2.2 5.1 *[7] Find λ k+1. That is, write out what an R program would have to calculate. Note that this step is slightly tricky. If predictor j is removed from the active set because b j = then the corresponding C j will lie on the boundary of R. You will need to interpret (i) with care to avoid λ k+1 = λ k. [8] [9] Show that the coefficient b j (λ), if j enters the active set, has the correct sign. Again Efron, Hastie, Johnstone, and Tibshirani (24, Lemma 4) seems to contain the relevant argument. If you think you understand the modified algorithm, as described above, try to submit your solution to the next problem to me by Friday 3 December. Write an R program (a script or a function) that implements the modified algorithm. Test it out on the diabetes data set.

LARS/lasso 7 References Efron, B., T. Hastie, I. Johnstone, and R. Tibshirani (24). Least angle regression. The Annals of Statistics 32 (2), pp. 47 451. Rosset, S. and J. Zhu (27). Piecewise linear regularized solution paths. Annals of Statistics 35 (3), 112 13.