LASSO. Step. My modified algorithm getlasso() produces the same sequence of coefficients as the R function lars() when run on the diabetes data set:
|
|
- Elwin Jefferson
- 5 years ago
- Views:
Transcription
1 lars/lasso 1 Homework 9 stepped you through the lasso modification of the LARS algorithm, based on the papers by Efron, Hastie, Johnstone, and Tibshirani (24) (= the LARS paper) and by Rosset and Zhu (27). I made a few small changes to the algorithm in the LARS paper. I used the diabetes data set in R (the data used as an illustration in the LARS paper) to test my modification: library(lars) data(diabetes) # load the data set LL = lars(diabetes$x,diabetes$y,type="lasso") plot(ll,xvar="step") # one variation on the usual plot LASSO Standardized Coefficients Step My modified algorithm getlasso() produces the same sequence of coefficients as the R function lars() when run on the diabetes data set: J= getlasso(diabetes$y,diabetes$x) round(ll$bet[-1,]-t(j$output["coef.end",,]),6)# gives all zeros I am fairly confident that my changes lead to the same sequences of lasso fits. I find my plots, which are slightly different from those produced by the lars library, more helpful for understanding the algorithm: Statistics 312/612, Fall 21
2 lars/lasso 2 show2(j) # my plot Modified lasso algorithm for diabetes data ltg bmi, ldl map hdl, tch age, glu bj's tc bmi, ltg map, tch, glu age, tc, ldl Cj's hdl [For higher resolution, see the pdf files attached to this handout.] 1 The Lasso problem The problem is: given an n 1 vector y and an n p matrix X, find the ˆb(λ) that minimizes L λ (b) = y Xb 2 + 2λ b j for each λ. [The extra factor of 2 eliminates many factors of 2 in what follows.] The columns of X will be assumed to be standardized to have zero means and X j = 1. I also seem to need linear independence of various subsets of columns of X, which would be awkward if p were larger than n. Remark. I thought the vector y was also supposed to have a zero mean. That is not the case for the diabetes data set in R. It does not seem to be needed for the algorithm to work. I will consider only the one at a time case (page 417 of the LARS paper), for which the active set of predictors X j changes only by either addition or deletion of a single predictor. Statistics 312/612, Fall 21
3 lars/lasso 3 2 Directional derivative <1> The function L λ has derivative at b in the direction u defined by L λ(b, u) = lim t L λ (b + tu) L λ (b) t = 2 j λd(b j, u j ) (y Xb) X j u j where D(b j, u j ) = u j {b j > } u j {b j < } + u j {b j = }. By convexity, a vector b minimizes L λ if and only if L λ (b, u) for every u. Equivalently, for every j and every u j the jth summand in <1> must be nonnegative. [Consider u vectors with only one nonzero component to establish this equivalence.] That is, b minimizes L λ if and only if λd(b j, u j ) (y Xb) X j u j for every j and every u j When b j the inequalities for u j = ±1 imply an equality; for b j = we get only the inequality. Thus b minimizes L λ if and only if λ = X jr if b j > λ = X jr if b j < where R := y Xb λ X jr if b j = The LARS/lasso algorithm recursively calculates a sequence of breakpoints > λ 1 > λ 2 > with ˆb(λ) linear for each interval λ k+1 λ λ k. Define residual vector and correlations R(λ) := y Xˆb(λ) and C j (λ) := X jr(λ). Remark. To get a true correlation we would have to divide by R(λ), which would complicate the constraints. <2> The algorithm will ensure that λ = C j (λ) if ˆb j (λ) > (constraint ) λ = C j (λ) if ˆb j (λ) < (constraint ) λ C j (λ) if ˆb j (λ) = (constraint ) That is, for the minimizing ˆb(λ) each (λ, C j (λ)) needs to stay inside the region R = {(λ, c) R + R : c λ}, moving along the top bounday (c = λ) when b j (λ) > (constraint ), along the lower boundary (c = λ) when ˆb j (λ) < (constraint ), and being anywhere in R when ˆb j (λ) = (constraint ). Statistics 312/612, Fall 21
4 lars/lasso 4 3 The algorithm The solution ˆb(λ) is continuous in λ and linear on intervals defined by change points = λ > λ 1 > λ 2 > >. The construction proceeds in steps, starting with large λ and working towards λ =. The vector of fitted values ˆf(λ) = Xˆb(λ) is also piecewise linear. Within each interval (λ k+1, λ k ) only the active subset A = A k = {j : ˆb j (λ) } of the coefficients changes; the inactive coefficients stay fixed at zero. 3.1 Some illuminating special cases It helped me to work explicitly through the first few steps before thinking about the equations that define a general step in the algorithm. Start with A = and ˆb(λ) = for λ λ 1 := max X jy. Constraint is satisfied on [λ 1, ). Step 1. Constraint would be violated if we kept ˆb(λ) equal to zero for λ < λ 1, because we would have max j C j (λ) > λ. The ˆb(λ) must move away from zero as λ decreases below λ 1. We must have C j (λ 1 ) = λ 1 for at least one j. For convenience of exposition, suppose C 1 (λ 1 ) = λ 1 > C j (λ 1 ) for all j 2. The active set now becomes A = {1}. For λ 2 λ < λ 1, with λ 2 to be specified soon, keep ˆb j (λ) = for j 2 but let ˆb1 (λ) = + v 1 (λ 1 λ) for some constant v 1. To maintain the equalities λ = C 1 (λ) = X 1(y X 1ˆb1 (λ)) = C 1 (λ 1 ) X 1X 1 v 1 (λ 1 λ) = λ 1 v 1 (λ 1 λ) we need v 1 = 1. This choice also ensures that ˆb 1 (λ) > for a while, so that is the relevant constraint for ˆb 1. For λ < λ 1, with v 1 = 1 we have R(λ) = y X 1 (λ 1 λ) and C j (λ) = C j (λ 1 ) a j (λ 1 λ) where a j := X jx 1. Notice that a j < 1 unless X j = ±X 1. Also, as long as max j 2 C j (λ) λ the other ˆb j s still satisfy constraint. Statistics 312/612, Fall 21
5 lars/lasso 5 We need to end the first step at λ 2, the largest λ less than λ 1 for which max j 2 C j (λ) = λ. Solve for C j (λ) = ±λ for each fixed j 2: λ = λ 1 (λ 1 λ) = C j (λ 1 ) a j (λ 1 λ) λ = λ 1 + (λ 1 λ) = C j (λ 1 ) a j (λ 1 λ) if and only if λ 1 λ = (λ 1 C j (λ 1 )) /(1 a j ) λ 1 λ = (λ 1 + C j (λ 1 )) /(1 + a j ) <3> Both right-hand sides are strictly positive. Thus λ 2 = λ 1 λ where ( λ1 C j (λ 1 ) λ := min j 2 min, λ ) 1 + C j (λ 1 ) 1 a j 1 + a j Second step. We have C 1 (λ 2 ) = λ 2 = max j 2 C j (λ 2 ), by construction. For convenience of exposition, suppose C 2 (λ 2 ) = λ 2 > C j (λ 2 ) for all j 3. The active set now becomes A = {1, 2}. To emphasize a subtle point it helps to consider separately two cases. Write s 2 for sign(c 2 (λ 2 )), so that C 2 (λ 2 ) = s 2 λ 2. case s 2 = +1: For λ 3 λ < λ 2 and a new v 1 and v 2 (Note the recycling of notation.), define ˆb1 (λ) = ˆb 1 (λ 2 ) + (λ 2 λ)v 1 ˆb2 (λ) = + (λ 2 λ)v 2 with all other ˆb j s still zero. Write Z for [X 1, X 2 ]. The new C j s become ( ) C j (λ) = X j y X 1ˆb1 (λ) X 2ˆb2 (λ) = C j (λ 2 ) (λ 2 λ)x jzv where v = (v 1, v 2 ). We keep C 1 (λ) = C 2 (λ) = λ if we choose v to make X 1Zv = 1 = X 2v. That is, we need v = (Z Z) 1 1 with 1 = (1, 1). Of course we must assume that X 1 and X 2 are linearly independent for Z Z to have an inverse. Statistics 312/612, Fall 21
6 lars/lasso 6 case s 2 = 1: For λ 3 λ < λ 2 and a new v 1 and v 2, define ˆb1 (λ) = ˆb 1 (λ 2 ) + (λ 2 λ)v 1 ˆb2 (λ) = (λ 2 λ)v 2 (note the change of sign) with all other ˆb j s still zero. Write Z for [X 1, X 2 ]. The tricky business with the signs ensures that Xˆb(λ) Xˆb(λ 2 ) = X 1 (λ 2 λ)v 1 X 2 (λ 2 λ)v 2 = (λ 2 λ)zv. The new C j s become ( ) C j (λ) = X j y X 1ˆb1 (λ) X 2ˆb2 (λ) = C j (λ 2 ) (λ 2 λ)x jzv. We keep C 1 (λ) = C 2 (λ) = λ if we choose v to make X 1Zv = 1 = X 2v. That is, again we need v = (Z Z) 1 1. Remark. If we had assumed C 1 (λ 1 ) = λ 1 then X 1 would be replaced by X 1 in the Z matrix, that is, Z = [s 1 X 1, s 2 X 2 ] with s 1 = sign(c 1 (λ 1 )) and s 2 = sign(c 2 (λ 2 )). Because ˆb 1 (λ 2 ) >, the correlation C 1 (λ) stays on the correct boundary for the constraint. If s 2 = +1 we need v 2 > to keep ˆb 2 (λ) > and C 2 (λ) = λ, satisfying. If s 2 = 1 we also need v 2 > to keep ˆb 2 (λ) < and C 2 (λ) = λ, satisfying. That is, in both cases we need v 2 >. Why do we get a strictly positive v 2? Write ρ for s 2 X 2X 1. As the Z Z matrix is nonsingular we must have ρ < 1 so that v = ( 1 ρ ρ 1 ) 1 ( 1 ρ 1 = (1 ρ 2 ) 1 ρ 1 and v 2 = v 1 = (1 ρ)/(1 ρ 2 ) >. If no further C j (λ) s were to hit the ±λ boundary, step 2 could continue all the way to λ =. More typically, we would need to create a new active set at the largest λ 3 strictly smaller than λ 2 for which max j 3 C j (λ) = λ. For the general step there is another possible event that would require a change to the active set: one of the ˆb j (λ) s in the active set might hit zero, threatening to change sign and leave the corresponding C j (λ) on the wrong boundary. I could pursue these special cases further, but it is better to start again for the generic step in the algorithm. ) 1 Statistics 312/612, Fall 21
7 lars/lasso The general algorithm Once again, start with A = and ˆb(λ) = for λ λ 1 := max X jy. Constraint is satisfied on [λ 1, ). At each λ k a new active set A k is defined. During the kth step the parameter λ decreases from λ k to λ k+1. For all j s in the active set A k, the coefficients ˆb j (λ) change linearly and the C j (λ) s move along one of the boundaries of the feasible region: C j (λ) = λ if ˆb j (λ) > and C j (λ) = λ if ˆb j (λ) <. For each inactive j the coefficient ˆb j (λ) remains zero throughout [λ k+1, λ k ]. Step k ends when either an inactive C j (λ) hits a ±λ boundary or if an active ˆb j (λ) becomes zero: λ k+1 is defined as the largest λ less than λ k for which either of these conditions holds: (i) max j / Ak C j (λ) = λ. In that case add the new j A c k for which C j (λ k+1 ) = λ k+1 to the active set, then proceed to step k + 1. (ii) ˆb j (λ) = for some j A k. In that case, remove j from the active set, then proceed to step k + 1. For the diabetes data, the alternative (ii) caused the behavior shown below for 1.3 λ 2.2. Modified lasso algorithm for diabetes data ltg bmi, ldl map hdl, tch age, glu bj's tc bmi, map, ldl, tch, ltg, glu -5.1 Cj's age,, tc, hdl I will show how the C j (λ) s and the ˆb j (λ) s can be chosen so that the conditions <2> are always satisfied. Statistics 312/612, Fall 21
8 lars/lasso 8 At the start of step k (with λ = λ k ), define Z = [s j X j : j A k ], where s j := sign(c j (λ)). That is, Z has a column for each active X j predictor, with the sign flipped if the corresponding C j (λ) is moving along the λ boundary. Define v := (Z Z) 1 1. For λ k+1 λ λ k the active coefficients change linearly, <4> ˆbj (λ) = ˆb j (λ k ) + (λ k λ)v j s j for j A k. The fitted vector, ˆf(λ) := Xˆb(λ) = ˆf(λ k ) + X j (ˆbj (λ) ˆb ) j (λ k ) = ˆf(λ k ) + (λ k λ)zv, j A k the residuals, R(λ) := y ˆf(λ) = R(λ k ) (λ k λ)zv, and the correlations, <5> C j (λ) := X jr(λ) = C j (λ k ) (λ k λ)a j also change linearly. In particular, for j A k, where a j := X jzv, C j (λ) = s j λ k (λ k λ)s j Z jzv = s j λ because Z jz(z Z) 1 1 = 1. The active C j (λ) s move along one of the boundaries ±λ of R. Remember that I am assuming the one at a time condition: the active set changes only by addition of one new predictor in case (i) or by dropping one predictor in case (ii). Suppose index α enters the active set via (i) when λ equals λ k+1 then leaves it via (ii) when λ equals λ l+1. (If α never leaves the active set put λ l+1 equal to.) Throughout the interval (λ l+1, λ k+1 ) both sign(c α (λ)) and sign(ˆb α λ)) stay constant. To ensure that all constraints are satisfied it is enough to prove: (a) For λ only slightly smaller than λ k+1, both C α (λ) and ˆb α (λ) have the same sign. (b) For λ only slightly smaller than λ l+1, constraint is satisfied, that is, C α (λ) λ. Statistics 312/612, Fall 21
9 lars/lasso 9 <6> The analyses are similar for the two cases. They both depend on a neat formula for inversion of symmetric block matrices. Suppose A is an m m nonsingular, symmetric matrix and d is an m 1 vector for which κ := 1 d A 1 d. Then ( ) ( ) 1 A d A 1 + ww /κ w/κ d = where w := A 1 d. 1 w /κ 1/κ This formula captures the ideas involved in Lemma 4 of the LARS paper. For simplicity of notation, I will act in both cases (a) and (b) as if the α is larger than all the other j s that were in the active set. (More formally, I could replace X in what follows by XE for a suitable permutation matrix E.) The old and new Z matrices then have the block form shown in <6>. Case (a): α enters the active set at λ k+1 As for the analysis that led to the expression in <3>, the solutions for λ = C j (λ) with j A c k are given by (λ k λ)(1 a j ) = λ k C j (λ k ) to get λ = C j (λ) (λ k λ)(1 + a j ) = λ k + C j (λ k ) to get λ = C j (λ) Index α is brought into the active set because C α (λ k+1 ) = λ k+1. Thus (λ k λ k+1)(1 s α a α ) = λ k s α C α (λ k ) where s α := sign(c α (λ k+1)). Note that C α (λ k ) < λ k because α was not active during step k. It follows that the right-hand side of the last equality is strictly positive, which implies <7> 1 s α a α >. Throughout a small neighborhood of λ k+1 the sign s α of C α (λ) stays the same. Continue to write Z for the active matrix [s j X j ; j A k ] for λ slightly larger than λ k+1 and denote by Z = [s j X j ; j A k+1 ] = [Z, s α X α ] the new active matrix for λ slightly small than λ k+1. Then ( ) Z Z Z Z d = where d := s α Z X α. d 1 Statistics 312/612, Fall 21
10 lars/lasso 1 Notice that 1 κ = d A 1 d = X αz(z Z) 1 Z X α = HX α 2, <8> where H = Z(Z Z) 1 Z is the matrix that projects vectors orthogonally onto span(z). If X α / span(z) then HX α < X α = 1 so that κ >. From <6>, ( ) (Z ( Z Z) Z) 1 + ww /κ w/κ 1 = where w := s α (Z Z) 1 Z X α. w /κ 1/κ The αth coordinate of the new ṽ = ( Z Z) 1 1 equals ṽ α = κ 1 (1 w 1), which is strictly positive because 1 w 1 = 1 s α X αz(z Z) 1 1 = 1 s α X αzv = 1 s α a α by <5> > by <7>. By <4>, the new ˆb α (λ) = (λ k+1 λ)ṽ α s α has the same sign, s α, as C α (λ). Case (b): α leaves the active set at λ = λ l+1 The roles of Z = [s j X j : 1 j < α] and Z = [Z, s α X α ] are now reversed. For λ slightly larger than λ l+1 the active matrix is Z, and both C α (λ) and ˆb α (λ) have sign s α. Index α leaves the active set because ˆb α (λ l+1 ) =. Thus = ˆb α (λ l ) + (λ l λ l+1 )ṽ α s α where s α := sign(c α (λ l )) and ṽ = ( Z Z) 1 1. sign s α. Consequently, we must have The active ˆb α (λ l ) also had <9> ṽ α <. For λ slightly smaller than λ l+1 the active matrix is Z, and, by <5>, s α C α (λ) = λ l+1 (λ l+1 λ)s α a α = λ + (λ l+1 λ)(1 s α a α ) where a α := X αzv Statistics 312/612, Fall 21
11 lars/lasso 11 We also know that > κṽ α = 1 w 1 = 1 s α a α. Thus s α C α (λ) < λ, and hence C α (λ) < λ, for λ slightly less than λ l+1. The new C α (λ) satisfies constraint as it heads off towards the other boundary. Remark. Clearly there is some sort of duality accounting for the similarities in the arguments for cases (a) and (b), with <6> as the shared mechanism. If we think of λ as time, case (b) is a time reversal of case (a). For case (a) the defining condition (C α hits the boundary) at λ k+1 gives 1 s α v α >, which implies v α >. For case (b), the defining condition (ˆb α hits zero) at λ l+1 gives ṽ a <, which implies 1 s α a α <. Is there some clever way to handle both cases by a duality argument? Maybe I should read more of the LARS paper. 4 My getlasso() function I wrote the R code in the file lasso.r to help me understand the algorithm described in the LARS paper. The function is not particularly elegant or efficient. I wrote in a way that made it easy to examine the output from each step. If verbose=t, lots of messages get written to the console and the function pauses after each step. I also used some calls to browser() while tracking down some annoying bugs related to case (b). d.p. 1 December 21 References Efron, B., T. Hastie, I. Johnstone, and R. Tibshirani (24). Least angle regression. The Annals of Statistics 32 (2), pp Rosset, S. and J. Zhu (27). Piecewise linear regularized solution paths. Annals of Statistics 35 (3), Statistics 312/612, Fall 21
12 Modified lasso algorithm for diabetes data ltg bj's bmi, ldl map hdl, tch age, glu tc bmi, ltg map, tch, glu age, tc, ldl Cj's hdl
13 Modified lasso algorithm for diabetes data bmi bj's ltg age,, map, tc, ldl, hdl, tch, glu bmi, ltg map, tch, glu age, tc, ldl Cj's hdl
14 Modified lasso algorithm for diabetes data bmi ltg bj's map age,, tc, ldl, hdl, tch, glu bmi, ltg map, tch, glu age, tc, ldl Cj's hdl
15 Modified lasso algorithm for diabetes data 55.7 bmi ltg bj's map age,, tc, ldl, tch, glu hdl bmi, ltg map, tch, glu age, tc, ldl Cj's hdl
16 Modified lasso algorithm for diabetes data bmi, ltg bj's map age, tc, ldl, tch, glu hdl bmi, map, ltg tch, glu age tc, ldl Cj's hdl
17 Modified lasso algorithm for diabetes data ltg bj's bmi, ldl map hdl, tch age, glu tc bmi, map, tch, ltg, glu age -2 Cj's ldl, tc, hdl
18 Modified lasso algorithm for diabetes data ltg bj's bmi, ldl map hdl, tch age, glu tc bmi, map, ldl, tch, ltg, glu -5.1 Cj's age,, tc, hdl
The problem is: given an n 1 vector y and an n p matrix X, find the b(λ) that minimizes
LARS/lasso 1 Statistics 312/612, fall 21 Homework # 9 Due: Monday 29 November The questions on this sheet all refer to (a slight modification of) the lasso modification of the LARS algorithm, based on
More informationAdaptive Lasso for correlated predictors
Adaptive Lasso for correlated predictors Keith Knight Department of Statistics University of Toronto e-mail: keith@utstat.toronto.edu This research was supported by NSERC of Canada. OUTLINE 1. Introduction
More informationRegularization: Ridge Regression and the LASSO
Agenda Wednesday, November 29, 2006 Agenda Agenda 1 The Bias-Variance Tradeoff 2 Ridge Regression Solution to the l 2 problem Data Augmentation Approach Bayesian Interpretation The SVD and Ridge Regression
More informationLecture 17 Intro to Lasso Regression
Lecture 17 Intro to Lasso Regression 11 November 2015 Taylor B. Arnold Yale Statistics STAT 312/612 Notes problem set 5 posted; due today Goals for today introduction to lasso regression the subdifferential
More informationOr How to select variables Using Bayesian LASSO
Or How to select variables Using Bayesian LASSO x 1 x 2 x 3 x 4 Or How to select variables Using Bayesian LASSO x 1 x 2 x 3 x 4 Or How to select variables Using Bayesian LASSO On Bayesian Variable Selection
More informationLinear Programming: Simplex
Linear Programming: Simplex Stephen J. Wright 1 2 Computer Sciences Department, University of Wisconsin-Madison. IMA, August 2016 Stephen Wright (UW-Madison) Linear Programming: Simplex IMA, August 2016
More informationRobust methods and model selection. Garth Tarr September 2015
Robust methods and model selection Garth Tarr September 2015 Outline 1. The past: robust statistics 2. The present: model selection 3. The future: protein data, meat science, joint modelling, data visualisation
More informationSaharon Rosset 1 and Ji Zhu 2
Aust. N. Z. J. Stat. 46(3), 2004, 505 510 CORRECTED PROOF OF THE RESULT OF A PREDICTION ERROR PROPERTY OF THE LASSO ESTIMATOR AND ITS GENERALIZATION BY HUANG (2003) Saharon Rosset 1 and Ji Zhu 2 IBM T.J.
More informationTECHNICAL REPORT NO. 1091r. A Note on the Lasso and Related Procedures in Model Selection
DEPARTMENT OF STATISTICS University of Wisconsin 1210 West Dayton St. Madison, WI 53706 TECHNICAL REPORT NO. 1091r April 2004, Revised December 2004 A Note on the Lasso and Related Procedures in Model
More informationRegularization Path Algorithms for Detecting Gene Interactions
Regularization Path Algorithms for Detecting Gene Interactions Mee Young Park Trevor Hastie July 16, 2006 Abstract In this study, we consider several regularization path algorithms with grouped variable
More informationHomework 5 ADMM, Primal-dual interior point Dual Theory, Dual ascent
Homework 5 ADMM, Primal-dual interior point Dual Theory, Dual ascent CMU 10-725/36-725: Convex Optimization (Fall 2017) OUT: Nov 4 DUE: Nov 18, 11:59 PM START HERE: Instructions Collaboration policy: Collaboration
More informationLasso applications: regularisation and homotopy
Lasso applications: regularisation and homotopy M.R. Osborne 1 mailto:mike.osborne@anu.edu.au 1 Mathematical Sciences Institute, Australian National University Abstract Ti suggested the use of an l 1 norm
More informationSummary and discussion of: Exact Post-selection Inference for Forward Stepwise and Least Angle Regression Statistics Journal Club
Summary and discussion of: Exact Post-selection Inference for Forward Stepwise and Least Angle Regression Statistics Journal Club 36-825 1 Introduction Jisu Kim and Veeranjaneyulu Sadhanala In this report
More informationMS-C1620 Statistical inference
MS-C1620 Statistical inference 10 Linear regression III Joni Virta Department of Mathematics and Systems Analysis School of Science Aalto University Academic year 2018 2019 Period III - IV 1 / 32 Contents
More informationLeast Angle Regression, Forward Stagewise and the Lasso
January 2005 Rob Tibshirani, Stanford 1 Least Angle Regression, Forward Stagewise and the Lasso Brad Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani Stanford University Annals of Statistics,
More informationTMA 4180 Optimeringsteori KARUSH-KUHN-TUCKER THEOREM
TMA 4180 Optimeringsteori KARUSH-KUHN-TUCKER THEOREM H. E. Krogstad, IMF, Spring 2012 Karush-Kuhn-Tucker (KKT) Theorem is the most central theorem in constrained optimization, and since the proof is scattered
More informationThe lasso an l 1 constraint in variable selection
The lasso algorithm is due to Tibshirani who realised the possibility for variable selection of imposing an l 1 norm bound constraint on the variables in least squares models and then tuning the model
More informationOn Algorithms for Solving Least Squares Problems under an L 1 Penalty or an L 1 Constraint
On Algorithms for Solving Least Squares Problems under an L 1 Penalty or an L 1 Constraint B.A. Turlach School of Mathematics and Statistics (M19) The University of Western Australia 35 Stirling Highway,
More informationFrequentist Accuracy of Bayesian Estimates
Frequentist Accuracy of Bayesian Estimates Bradley Efron Stanford University Bayesian Inference Parameter: µ Ω Observed data: x Prior: π(µ) Probability distributions: Parameter of interest: { fµ (x), µ
More informationMath Models of OR: Some Definitions
Math Models of OR: Some Definitions John E. Mitchell Department of Mathematical Sciences RPI, Troy, NY 12180 USA September 2018 Mitchell Some Definitions 1 / 20 Active constraints Outline 1 Active constraints
More informationOnline Supplement: Managing Hospital Inpatient Bed Capacity through Partitioning Care into Focused Wings
Online Supplement: Managing Hospital Inpatient Bed Capacity through Partitioning Care into Focused Wings Thomas J. Best, Burhaneddin Sandıkçı, Donald D. Eisenstein University of Chicago Booth School of
More informationBiostatistics Advanced Methods in Biostatistics IV
Biostatistics 140.754 Advanced Methods in Biostatistics IV Jeffrey Leek Assistant Professor Department of Biostatistics jleek@jhsph.edu Lecture 12 1 / 36 Tip + Paper Tip: As a statistician the results
More informationIn view of (31), the second of these is equal to the identity I on E m, while this, in view of (30), implies that the first can be written
11.8 Inequality Constraints 341 Because by assumption x is a regular point and L x is positive definite on M, it follows that this matrix is nonsingular (see Exercise 11). Thus, by the Implicit Function
More informationModel Selection and Estimation in Regression with Grouped Variables 1
DEPARTMENT OF STATISTICS University of Wisconsin 1210 West Dayton St. Madison, WI 53706 TECHNICAL REPORT NO. 1095 November 9, 2004 Model Selection and Estimation in Regression with Grouped Variables 1
More informationRegularization Paths
December 2005 Trevor Hastie, Stanford Statistics 1 Regularization Paths Trevor Hastie Stanford University drawing on collaborations with Brad Efron, Saharon Rosset, Ji Zhu, Hui Zhou, Rob Tibshirani and
More informationCS261: Problem Set #3
CS261: Problem Set #3 Due by 11:59 PM on Tuesday, February 23, 2016 Instructions: (1) Form a group of 1-3 students. You should turn in only one write-up for your entire group. (2) Submission instructions:
More informationLinear Programming Redux
Linear Programming Redux Jim Bremer May 12, 2008 The purpose of these notes is to review the basics of linear programming and the simplex method in a clear, concise, and comprehensive way. The book contains
More informationLEAST ANGLE REGRESSION 469
LEAST ANGLE REGRESSION 469 Specifically for the Lasso, one alternative strategy for logistic regression is to use a quadratic approximation for the log-likelihood. Consider the Bayesian version of Lasso
More informationNonlinear Support Vector Machines through Iterative Majorization and I-Splines
Nonlinear Support Vector Machines through Iterative Majorization and I-Splines P.J.F. Groenen G. Nalbantov J.C. Bioch July 9, 26 Econometric Institute Report EI 26-25 Abstract To minimize the primal support
More information/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Matroids and Greedy Algorithms Date: 10/31/16
60.433/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Matroids and Greedy Algorithms Date: 0/3/6 6. Introduction We talked a lot the last lecture about greedy algorithms. While both Prim
More informationCOMPARATIVE ANALYSIS OF ORTHOGONAL MATCHING PURSUIT AND LEAST ANGLE REGRESSION
COMPARATIVE ANALYSIS OF ORTHOGONAL MATCHING PURSUIT AND LEAST ANGLE REGRESSION By Mazin Abdulrasool Hameed A THESIS Submitted to Michigan State University in partial fulfillment of the requirements for
More informationStatistical Inference
Statistical Inference Liu Yang Florida State University October 27, 2016 Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 1 / 27 Outline The Bayesian Lasso Trevor Park
More informationLecture 9: Elementary Matrices
Lecture 9: Elementary Matrices Review of Row Reduced Echelon Form Consider the matrix A and the vector b defined as follows: 1 2 1 A b 3 8 5 A common technique to solve linear equations of the form Ax
More informationRamanujan Graphs of Every Size
Spectral Graph Theory Lecture 24 Ramanujan Graphs of Every Size Daniel A. Spielman December 2, 2015 Disclaimer These notes are not necessarily an accurate representation of what happened in class. The
More informationNotes taken by Graham Taylor. January 22, 2005
CSC4 - Linear Programming and Combinatorial Optimization Lecture : Different forms of LP. The algebraic objects behind LP. Basic Feasible Solutions Notes taken by Graham Taylor January, 5 Summary: We first
More informationALGEBRAIC GEOMETRY I - FINAL PROJECT
ALGEBRAIC GEOMETRY I - FINAL PROJECT ADAM KAYE Abstract This paper begins with a description of the Schubert varieties of a Grassmannian variety Gr(k, n) over C Following the technique of Ryan [3] for
More informationA significance test for the lasso
1 Gold medal address, SSC 2013 Joint work with Richard Lockhart (SFU), Jonathan Taylor (Stanford), and Ryan Tibshirani (Carnegie-Mellon Univ.) Reaping the benefits of LARS: A special thanks to Brad Efron,
More informationDesigning Information Devices and Systems I Spring 2016 Elad Alon, Babak Ayazifar Homework 11
EECS 6A Designing Information Devices and Systems I Spring 206 Elad Alon, Babak Ayazifar Homework This homework is due April 9, 206, at Noon.. Homework process and study group Who else did you work with
More informationx 1 + x 2 2 x 1 x 2 1 x 2 2 min 3x 1 + 2x 2
Lecture 1 LPs: Algebraic View 1.1 Introduction to Linear Programming Linear programs began to get a lot of attention in 1940 s, when people were interested in minimizing costs of various systems while
More informationMath Camp Lecture 4: Linear Algebra. Xiao Yu Wang. Aug 2010 MIT. Xiao Yu Wang (MIT) Math Camp /10 1 / 88
Math Camp 2010 Lecture 4: Linear Algebra Xiao Yu Wang MIT Aug 2010 Xiao Yu Wang (MIT) Math Camp 2010 08/10 1 / 88 Linear Algebra Game Plan Vector Spaces Linear Transformations and Matrices Determinant
More informationBasics of Proofs. 1 The Basics. 2 Proof Strategies. 2.1 Understand What s Going On
Basics of Proofs The Putnam is a proof based exam and will expect you to write proofs in your solutions Similarly, Math 96 will also require you to write proofs in your homework solutions If you ve seen
More informationHere each term has degree 2 (the sum of exponents is 2 for all summands). A quadratic form of three variables looks as
Reading [SB], Ch. 16.1-16.3, p. 375-393 1 Quadratic Forms A quadratic function f : R R has the form f(x) = a x. Generalization of this notion to two variables is the quadratic form Q(x 1, x ) = a 11 x
More informationLanguages, regular languages, finite automata
Notes on Computer Theory Last updated: January, 2018 Languages, regular languages, finite automata Content largely taken from Richards [1] and Sipser [2] 1 Languages An alphabet is a finite set of characters,
More informationA significance test for the lasso
1 First part: Joint work with Richard Lockhart (SFU), Jonathan Taylor (Stanford), and Ryan Tibshirani (Carnegie-Mellon Univ.) Second part: Joint work with Max Grazier G Sell, Stefan Wager and Alexandra
More informationHomework sheet 4: EIGENVALUES AND EIGENVECTORS. DIAGONALIZATION (with solutions) Year ? Why or why not? 6 9
Bachelor in Statistics and Business Universidad Carlos III de Madrid Mathematical Methods II María Barbero Liñán Homework sheet 4: EIGENVALUES AND EIGENVECTORS DIAGONALIZATION (with solutions) Year - Is
More informationRecall the convention that, for us, all vectors are column vectors.
Some linear algebra Recall the convention that, for us, all vectors are column vectors. 1. Symmetric matrices Let A be a real matrix. Recall that a complex number λ is an eigenvalue of A if there exists
More informationMath 350 Fall 2011 Notes about inner product spaces. In this notes we state and prove some important properties of inner product spaces.
Math 350 Fall 2011 Notes about inner product spaces In this notes we state and prove some important properties of inner product spaces. First, recall the dot product on R n : if x, y R n, say x = (x 1,...,
More informationScientific Computing: An Introductory Survey
Scientific Computing: An Introductory Survey Chapter 7 Interpolation Prof. Michael T. Heath Department of Computer Science University of Illinois at Urbana-Champaign Copyright c 2002. Reproduction permitted
More informationOptimization Problems
Optimization Problems The goal in an optimization problem is to find the point at which the minimum (or maximum) of a real, scalar function f occurs and, usually, to find the value of the function at that
More informationEquivalent Forms of the Axiom of Infinity
Equivalent Forms of the Axiom of Infinity Axiom of Infinity 1. There is a set that contains each finite ordinal as an element. The Axiom of Infinity is the axiom of Set Theory that explicitly asserts that
More informationIntroduction to Operations Research Prof. G. Srinivasan Department of Management Studies Indian Institute of Technology, Madras
Introduction to Operations Research Prof. G. Srinivasan Department of Management Studies Indian Institute of Technology, Madras Module - 03 Simplex Algorithm Lecture 15 Infeasibility In this class, we
More informationA PRIMER ON SESQUILINEAR FORMS
A PRIMER ON SESQUILINEAR FORMS BRIAN OSSERMAN This is an alternative presentation of most of the material from 8., 8.2, 8.3, 8.4, 8.5 and 8.8 of Artin s book. Any terminology (such as sesquilinear form
More informationAnalysis of Algorithms I: Perfect Hashing
Analysis of Algorithms I: Perfect Hashing Xi Chen Columbia University Goal: Let U = {0, 1,..., p 1} be a huge universe set. Given a static subset V U of n keys (here static means we will never change the
More informationThe dual simplex method with bounds
The dual simplex method with bounds Linear programming basis. Let a linear programming problem be given by min s.t. c T x Ax = b x R n, (P) where we assume A R m n to be full row rank (we will see in the
More informationUses of duality. Geoff Gordon & Ryan Tibshirani Optimization /
Uses of duality Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember conjugate functions Given f : R n R, the function is called its conjugate f (y) = max x R n yt x f(x) Conjugates appear
More informationFarkas Lemma, Dual Simplex and Sensitivity Analysis
Summer 2011 Optimization I Lecture 10 Farkas Lemma, Dual Simplex and Sensitivity Analysis 1 Farkas Lemma Theorem 1. Let A R m n, b R m. Then exactly one of the following two alternatives is true: (i) x
More informationA note on the group lasso and a sparse group lasso
A note on the group lasso and a sparse group lasso arxiv:1001.0736v1 [math.st] 5 Jan 2010 Jerome Friedman Trevor Hastie and Robert Tibshirani January 5, 2010 Abstract We consider the group lasso penalty
More informationThe lars Package. R topics documented: May 17, Version Date Title Least Angle Regression, Lasso and Forward Stagewise
The lars Package May 17, 2007 Version 0.9-7 Date 2007-05-16 Title Least Angle Regression, Lasso and Forward Stagewise Author Trevor Hastie and Brad Efron
More information1 Multiply Eq. E i by λ 0: (λe i ) (E i ) 2 Multiply Eq. E j by λ and add to Eq. E i : (E i + λe j ) (E i )
Direct Methods for Linear Systems Chapter Direct Methods for Solving Linear Systems Per-Olof Persson persson@berkeleyedu Department of Mathematics University of California, Berkeley Math 18A Numerical
More informationMath 61CM - Solutions to homework 6
Math 61CM - Solutions to homework 6 Cédric De Groote November 5 th, 2018 Problem 1: (i) Give an example of a metric space X such that not all Cauchy sequences in X are convergent. (ii) Let X be a metric
More information10725/36725 Optimization Homework 4
10725/36725 Optimization Homework 4 Due November 27, 2012 at beginning of class Instructions: There are four questions in this assignment. Please submit your homework as (up to) 4 separate sets of pages
More informationChapter 2: Vector Geometry
Chapter 2: Vector Geometry Daniel Chan UNSW Semester 1 2018 Daniel Chan (UNSW) Chapter 2: Vector Geometry Semester 1 2018 1 / 32 Goals of this chapter In this chapter, we will answer the following geometric
More informationSUPPLEMENT TO PARAMETRIC OR NONPARAMETRIC? A PARAMETRICNESS INDEX FOR MODEL SELECTION. University of Minnesota
Submitted to the Annals of Statistics arxiv: math.pr/0000000 SUPPLEMENT TO PARAMETRIC OR NONPARAMETRIC? A PARAMETRICNESS INDEX FOR MODEL SELECTION By Wei Liu and Yuhong Yang University of Minnesota In
More informationMS&E 318 (CME 338) Large-Scale Numerical Optimization. A Lasso Solver
Stanford University, Dept of Management Science and Engineering MS&E 318 (CME 338) Large-Scale Numerical Optimization Instructor: Michael Saunders Spring 2011 Final Project Due Friday June 10 A Lasso Solver
More information9. Numerical linear algebra background
Convex Optimization Boyd & Vandenberghe 9. Numerical linear algebra background matrix structure and algorithm complexity solving linear equations with factored matrices LU, Cholesky, LDL T factorization
More informationMissing dependent variables in panel data models
Missing dependent variables in panel data models Jason Abrevaya Abstract This paper considers estimation of a fixed-effects model in which the dependent variable may be missing. For cross-sectional units
More informationStandard forms for writing numbers
Standard forms for writing numbers In order to relate the abstract mathematical descriptions of familiar number systems to the everyday descriptions of numbers by decimal expansions and similar means,
More informationEXAMPLES OF PROOFS BY INDUCTION
EXAMPLES OF PROOFS BY INDUCTION KEITH CONRAD 1. Introduction In this handout we illustrate proofs by induction from several areas of mathematics: linear algebra, polynomial algebra, and calculus. Becoming
More informationPathwise coordinate optimization
Stanford University 1 Pathwise coordinate optimization Jerome Friedman, Trevor Hastie, Holger Hoefling, Robert Tibshirani Stanford University Acknowledgements: Thanks to Stephen Boyd, Michael Saunders,
More informationCS261: A Second Course in Algorithms Lecture #12: Applications of Multiplicative Weights to Games and Linear Programs
CS26: A Second Course in Algorithms Lecture #2: Applications of Multiplicative Weights to Games and Linear Programs Tim Roughgarden February, 206 Extensions of the Multiplicative Weights Guarantee Last
More informationMachine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression
Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression Due: Monday, February 13, 2017, at 10pm (Submit via Gradescope) Instructions: Your answers to the questions below,
More informationKarush-Kuhn-Tucker Conditions. Lecturer: Ryan Tibshirani Convex Optimization /36-725
Karush-Kuhn-Tucker Conditions Lecturer: Ryan Tibshirani Convex Optimization 10-725/36-725 1 Given a minimization problem Last time: duality min x subject to f(x) h i (x) 0, i = 1,... m l j (x) = 0, j =
More information1 Maintaining a Dictionary
15-451/651: Design & Analysis of Algorithms February 1, 2016 Lecture #7: Hashing last changed: January 29, 2016 Hashing is a great practical tool, with an interesting and subtle theory too. In addition
More information5 + 9(10) + 3(100) + 0(1000) + 2(10000) =
Chapter 5 Analyzing Algorithms So far we have been proving statements about databases, mathematics and arithmetic, or sequences of numbers. Though these types of statements are common in computer science,
More informationReview Solutions, Exam 2, Operations Research
Review Solutions, Exam 2, Operations Research 1. Prove the weak duality theorem: For any x feasible for the primal and y feasible for the dual, then... HINT: Consider the quantity y T Ax. SOLUTION: To
More informationFinal Review Sheet. B = (1, 1 + 3x, 1 + x 2 ) then 2 + 3x + 6x 2
Final Review Sheet The final will cover Sections Chapters 1,2,3 and 4, as well as sections 5.1-5.4, 6.1-6.2 and 7.1-7.3 from chapters 5,6 and 7. This is essentially all material covered this term. Watch
More informationLinear Methods for Regression. Lijun Zhang
Linear Methods for Regression Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Linear Regression Models and Least Squares Subset Selection Shrinkage Methods Methods Using Derived
More informationLAGRANGE MULTIPLIERS
LAGRANGE MULTIPLIERS MATH 195, SECTION 59 (VIPUL NAIK) Corresponding material in the book: Section 14.8 What students should definitely get: The Lagrange multiplier condition (one constraint, two constraints
More informationCHAPTER 1: Functions
CHAPTER 1: Functions 1.1: Functions 1.2: Graphs of Functions 1.3: Basic Graphs and Symmetry 1.4: Transformations 1.5: Piecewise-Defined Functions; Limits and Continuity in Calculus 1.6: Combining Functions
More informationHere we used the multiindex notation:
Mathematics Department Stanford University Math 51H Distributions Distributions first arose in solving partial differential equations by duality arguments; a later related benefit was that one could always
More information18.9 SUPPORT VECTOR MACHINES
744 Chapter 8. Learning from Examples is the fact that each regression problem will be easier to solve, because it involves only the examples with nonzero weight the examples whose kernels overlap the
More informationCS 6820 Fall 2014 Lectures, October 3-20, 2014
Analysis of Algorithms Linear Programming Notes CS 6820 Fall 2014 Lectures, October 3-20, 2014 1 Linear programming The linear programming (LP) problem is the following optimization problem. We are given
More informationIn particular, if A is a square matrix and λ is one of its eigenvalues, then we can find a non-zero column vector X with
Appendix: Matrix Estimates and the Perron-Frobenius Theorem. This Appendix will first present some well known estimates. For any m n matrix A = [a ij ] over the real or complex numbers, it will be convenient
More informationSergey Norin Department of Mathematics and Statistics McGill University Montreal, Quebec H3A 2K6, Canada. and
NON-PLANAR EXTENSIONS OF SUBDIVISIONS OF PLANAR GRAPHS Sergey Norin Department of Mathematics and Statistics McGill University Montreal, Quebec H3A 2K6, Canada and Robin Thomas 1 School of Mathematics
More informationFunctional Analysis MATH and MATH M6202
Functional Analysis MATH 36202 and MATH M6202 1 Inner Product Spaces and Normed Spaces Inner Product Spaces Functional analysis involves studying vector spaces where we additionally have the notion of
More informationGeneralized Elastic Net Regression
Abstract Generalized Elastic Net Regression Geoffroy MOURET Jean-Jules BRAULT Vahid PARTOVINIA This work presents a variation of the elastic net penalization method. We propose applying a combined l 1
More informationSparsity and the Lasso
Sparsity and the Lasso Statistical Machine Learning, Spring 205 Ryan Tibshirani (with Larry Wasserman Regularization and the lasso. A bit of background If l 2 was the norm of the 20th century, then l is
More informationLecture 3: Sizes of Infinity
Math/CS 20: Intro. to Math Professor: Padraic Bartlett Lecture 3: Sizes of Infinity Week 2 UCSB 204 Sizes of Infinity On one hand, we know that the real numbers contain more elements than the rational
More informationLecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016
Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016 1 Entropy Since this course is about entropy maximization,
More informationON THE MATRIX EQUATION XA AX = X P
ON THE MATRIX EQUATION XA AX = X P DIETRICH BURDE Abstract We study the matrix equation XA AX = X p in M n (K) for 1 < p < n It is shown that every matrix solution X is nilpotent and that the generalized
More informationsatisfying ( i ; j ) = ij Here ij = if i = j and 0 otherwise The idea to use lattices is the following Suppose we are given a lattice L and a point ~x
Dual Vectors and Lower Bounds for the Nearest Lattice Point Problem Johan Hastad* MIT Abstract: We prove that given a point ~z outside a given lattice L then there is a dual vector which gives a fairly
More informationPage 52. Lecture 3: Inner Product Spaces Dual Spaces, Dirac Notation, and Adjoints Date Revised: 2008/10/03 Date Given: 2008/10/03
Page 5 Lecture : Inner Product Spaces Dual Spaces, Dirac Notation, and Adjoints Date Revised: 008/10/0 Date Given: 008/10/0 Inner Product Spaces: Definitions Section. Mathematical Preliminaries: Inner
More informationLASSO Review, Fused LASSO, Parallel LASSO Solvers
Case Study 3: fmri Prediction LASSO Review, Fused LASSO, Parallel LASSO Solvers Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade May 3, 2016 Sham Kakade 2016 1 Variable
More informationChapter 3, Operations Research (OR)
Chapter 3, Operations Research (OR) Kent Andersen February 7, 2007 1 Linear Programs (continued) In the last chapter, we introduced the general form of a linear program, which we denote (P) Minimize Z
More informationA Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013)
A Survey of L 1 Regression Vidaurre, Bielza and Larranaga (2013) Céline Cunen, 20/10/2014 Outline of article 1.Introduction 2.The Lasso for Linear Regression a) Notation and Main Concepts b) Statistical
More informationHomework 5. Convex Optimization /36-725
Homework 5 Convex Optimization 10-725/36-725 Due Tuesday November 22 at 5:30pm submitted to Christoph Dann in Gates 8013 (Remember to a submit separate writeup for each problem, with your name at the top)
More informationWelsh s problem on the number of bases of matroids
Welsh s problem on the number of bases of matroids Edward S. T. Fan 1 and Tony W. H. Wong 2 1 Department of Mathematics, California Institute of Technology 2 Department of Mathematics, Kutztown University
More informationA NEW FACE METHOD FOR LINEAR PROGRAMMING. 1. Introduction We are concerned with the standard linear programming (LP) problem
A NEW FACE METHOD FOR LINEAR PROGRAMMING PING-QI PAN Abstract. An attractive feature of the face method [9 for solving LP problems is that it uses the orthogonal projection of the negative objective gradient
More information1 Matrices and Systems of Linear Equations. a 1n a 2n
March 31, 2013 16-1 16. Systems of Linear Equations 1 Matrices and Systems of Linear Equations An m n matrix is an array A = (a ij ) of the form a 11 a 21 a m1 a 1n a 2n... a mn where each a ij is a real
More informationSpring 2017 CO 250 Course Notes TABLE OF CONTENTS. richardwu.ca. CO 250 Course Notes. Introduction to Optimization
Spring 2017 CO 250 Course Notes TABLE OF CONTENTS richardwu.ca CO 250 Course Notes Introduction to Optimization Kanstantsin Pashkovich Spring 2017 University of Waterloo Last Revision: March 4, 2018 Table
More information