The deterministic Lasso

Similar documents
(Part 1) High-dimensional statistics May / 41

THE LASSO, CORRELATED DESIGN, AND IMPROVED ORACLE INEQUALITIES. By Sara van de Geer and Johannes Lederer. ETH Zürich

arxiv: v2 [math.st] 12 Feb 2008

The adaptive and the thresholded Lasso for potentially misspecified models (and a lower bound for the Lasso)

arxiv: v1 [math.st] 5 Oct 2009

Least squares under convex constraint

Concentration behavior of the penalized least squares estimator

Reconstruction from Anisotropic Random Measurements

LASSO-TYPE RECOVERY OF SPARSE REPRESENTATIONS FOR HIGH-DIMENSIONAL DATA

Statistics for high-dimensional data: Group Lasso and additive models

Stepwise Searching for Feature Variables in High-Dimensional Linear Regression

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables

arxiv: v2 [math.st] 15 Sep 2015

High Dimensional Inverse Covariate Matrix Estimation via Linear Programming

l 1 -Regularized Linear Regression: Persistence and Oracle Inequalities

Sparsity oracle inequalities for the Lasso

Theoretical results for lasso, MCP, and SCAD

STAT 200C: High-dimensional Statistics

The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA

Statistical Data Mining and Machine Learning Hilary Term 2016

TECHNICAL REPORT NO. 1091r. A Note on the Lasso and Related Procedures in Model Selection

ORACLE INEQUALITIES AND OPTIMAL INFERENCE UNDER GROUP SPARSITY. By Karim Lounici, Massimiliano Pontil, Sara van de Geer and Alexandre B.

COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017

High-dimensional Ordinary Least-squares Projection for Screening Variables

The Frank-Wolfe Algorithm:

Stochastic Proximal Gradient Algorithm

Least Sparsity of p-norm based Optimization Problems with p > 1

arxiv: v1 [math.st] 8 Jan 2008

STAT 992 Paper Review: Sure Independence Screening in Generalized Linear Models with NP-Dimensionality J.Fan and R.Song

High-dimensional covariance estimation based on Gaussian graphical models

Peter Hoff Minimax estimation October 31, Motivation and definition. 2 Least favorable prior 3. 3 Least favorable prior sequence 11

Minwise hashing for large-scale regression and classification with sparse data

Paper Review: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties by Jianqing Fan and Runze Li (2001)

A talk on Oracle inequalities and regularization. by Sara van de Geer

Classification with Reject Option

Model Selection and Geometry

Relaxed Lasso. Nicolai Meinshausen December 14, 2006

Nonconcave Penalized Likelihood with A Diverging Number of Parameters

Stability and the elastic net

The Iterated Lasso for High-Dimensional Logistic Regression

Additive Isotonic Regression

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

D I S C U S S I O N P A P E R

regression Lie Wang Abstract In this paper, the high-dimensional sparse linear regression model is considered,

Empirical Processes: General Weak Convergence Theory

Lecture 24 May 30, 2018

High dimensional thresholded regression and shrinkage effect

IEOR 265 Lecture 3 Sparse Linear Regression

Penalized Squared Error and Likelihood: Risk Bounds and Fast Algorithms

LECTURE 15: COMPLETENESS AND CONVEXITY

A Practical Scheme and Fast Algorithm to Tune the Lasso With Optimality Guarantees

Asymptotic efficiency of simple decisions for the compound decision problem

Eigenvalues and Eigenfunctions of the Laplacian

Optimization methods

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Functional Analysis. Franck Sueur Metric spaces Definitions Completeness Compactness Separability...

1. f(β) 0 (that is, β is a feasible point for the constraints)

Peter Hoff Minimax estimation November 12, Motivation and definition. 2 Least favorable prior 3. 3 Least favorable prior sequence 11

21.2 Example 1 : Non-parametric regression in Mean Integrated Square Error Density Estimation (L 2 2 risk)

Bayesian Perspectives on Sparse Empirical Bayes Analysis (SEBA)

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

Convex functions, subdifferentials, and the L.a.s.s.o.

Lecture 21 Theory of the Lasso II

Dimension Reduction in Abundant High Dimensional Regressions

arxiv: v1 [math.st] 10 May 2009

MSA220/MVE440 Statistical Learning for Big Data

DISCUSSION OF A SIGNIFICANCE TEST FOR THE LASSO. By Peter Bühlmann, Lukas Meier and Sara van de Geer ETH Zürich

Does Compressed Sensing have applications in Robust Statistics?

Friedrich symmetric systems

Variable Selection for Highly Correlated Predictors

Analysis of Greedy Algorithms

Lasso-type recovery of sparse representations for high-dimensional data

Lecture 14: Variable Selection - Beyond LASSO

NONLINEAR LEAST-SQUARES ESTIMATION 1. INTRODUCTION

This manuscript is for review purposes only.

Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables

Minimax Rates of Estimation for High-Dimensional Linear Regression Over -Balls

A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models

SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING. Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu

NONLINEAR FREDHOLM ALTERNATIVE FOR THE p-laplacian UNDER NONHOMOGENEOUS NEUMANN BOUNDARY CONDITION

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation

Convex relaxation for Combinatorial Penalties

ASYMPTOTIC PROPERTIES OF BRIDGE ESTIMATORS IN SPARSE HIGH-DIMENSIONAL REGRESSION MODELS

Boosting Methods: Why They Can Be Useful for High-Dimensional Data

CENTRAL LIMIT THEOREMS AND MULTIPLIER BOOTSTRAP WHEN p IS MUCH LARGER THAN n

Lasso Maximum Likelihood Estimation of Parametric Models with Singular Information Matrices

How Correlations Influence Lasso Prediction

INDUSTRIAL MATHEMATICS INSTITUTE. B.S. Kashin and V.N. Temlyakov. IMI Preprint Series. Department of Mathematics University of South Carolina

Metric Spaces and Topology

SPECTRAL GAP FOR ZERO-RANGE DYNAMICS. By C. Landim, S. Sethuraman and S. Varadhan 1 IMPA and CNRS, Courant Institute and Courant Institute

Adaptive estimation of the copula correlation matrix for semiparametric elliptical copulas

Regularized Estimation of High Dimensional Covariance Matrices. Peter Bickel. January, 2008

Pre-Selection in Cluster Lasso Methods for Correlated Variable Selection in High-Dimensional Linear Models

Fast Rates for Estimation Error and Oracle Inequalities for Model Selection

A Unified Theory of Confidence Regions and Testing for High Dimensional Estimating Equations

PCA with random noise. Van Ha Vu. Department of Mathematics Yale University

Extended Bayesian Information Criteria for Gaussian Graphical Models

Minimax Rate of Convergence for an Estimator of the Functional Component in a Semiparametric Multivariate Partially Linear Model.

High-dimensional Joint Sparsity Random Effects Model for Multi-task Learning

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

Transcription:

The deterministic Lasso Sara van de Geer Seminar für Statistik, ETH Zürich Abstract We study high-dimensional generalized linear models and empirical risk minimization using the Lasso An oracle inequality is presented, under a so called compatibility condition Our aim is three fold: to proof a result announced in van de Geer 2007, to provide a simple proof with simple constants, and to separate the stochastic problem from the deterministic one KEY WORDS: coherence, Lasso, oracle, sparsity Introduction We consider the Lasso for high-dimensional generalized linear models, introducing three new elements Firstly, we complete the proof of an oracle inequality, which was announced in van de Geer 2007 The result is shown to hold under a so-called compatibility condition for certain norms This condition is a weaker version of Assumption C* in van de Geer 2007, and is related to coherence conditions in eg Candes and Tao, and Bunea et al 2006, 2007 Secondly, we state the result with simple constants, and give a new and rather straightforward proof Thirdly, in Subsection 3, we separate the stochastic part of the problem from the deterministic part In the subsequent sections, we then only concentrate on the deterministic part whence the title of this paper The advantage of this approach is that the behavior of the Lasso under various sampling schemes eg, dependent observations, can immediately be deduced once the stochastic problem a probability inequality for the empirical process is clear Consider data {Z i } n i=, with for i =,, n Z i in some space Z Let F := F, be a normed real vector space, and, for each f F, ρ f : Z R be a loss function We assume that the map f ρ f z is convex for all z Z For example, in a regression setup, one has for i =,, n Z i = X i, Y i, with covariables X i X and response variable Y i Y R, and f : X R is a regression function Examples are quadratic loss, or logistic loss, etc ρ f, y = y f 2, ρ f, y = yf + log + exp[f], The target We denote, for a function ρ : Z R, the theoretical measure by n P ρ := EρZ i /n, i= and the empirical measure by Define the target P n ρ := n ρz i /n i= f 0 := arg min f F P ρ f We assume for simplicity that the minimum is attained, and that it is unique For f F, the excess risk is 2 The Lasso Ef := P ρ f ρ f 0 Consider a given linear subspace F := {f β : β R p } F, where β f β is linear The collection F will be the model class over which we perform empirical risk minimization When F is high-dimensional, a complexity penalty can prevent overfitting The Lasso is where ˆβ = arg min β { Pn ρ fβ + λ β }, β := p β j, j= ie, is the l -norm The parameter λ is a smoothing - or tuning - parameter We examine the excess risk Ef ˆβ of the Lasso, and show that it behaves as if it knew which variables are relevant for a linear approximation of the target f 0 Section 2 does this under a so-called compatibility condition Section 3 examines under what circumstances the compatibility condition is met In, the penalty weights all coefficients equally In practice, certain terms eg, the constant term will not be penalized, and a weighted version of the l norm is used, with eg, more weight assigned to variables with large sample variance In essence possibly random weights do not alter the main issues in the theory We therefore consider the non-weighted l penalty to clarify these issues More details on the case of empirical weights are in Bunea et al 2006 and van de Geer 2007

3 The empirical process Throughout, the study of the stochastic part of the problem involving the empirical process is set aside It can be handled using empirical process theory In the current paper, we simply restrict ourselves to a set S where the empirical process behaves well see 6 in Lemma 2 This allows us to proceed with purely deterministic, and relatively straightforward, arguments More precisely, write for M > 0, Z M := sup P n P ρ fβ ρ fβ 2 β β M Here, β is some fixed parameter value, which will be specified in Section 2 Our results hold on the set S = {Z M ε}, with M and ε appropriately chosen The ratio ε/m will play a major role: it has to be chosen large enough so that S has large probability On the other hand, ε/m will form be a lower bound for the smoothing parameter λ see 5 in Lemma 2 It is shown in van de Geer 2007, that for independent observations Z i, under mild conditions, and with ε/m log2p/n, the set S has large probability in the case of Lipschitz loss functions, that is, for f ρ f z is Lipschitz for all z For other loss functions eg quadratic loss, similar behavior can be obtained, but only for small enough values of M In the result of Lemma 2, one may replace P n by any other random probability measure, say ˆP, in the definition of ˆβ, provided one also adjusts the definition 2 of Z M The result for the Lasso goes through on S One should then ensure that S contains most of the probability mass by taking ε/m appropriately, and this will then put lower bounds on the smoothing parameter 4 The margin behavior We will make use of the so-called margin condition see below, which is assumed for a neighborhood, denoted by F η, of the target f 0 Some of the effort goes in proving that the estimator is indeed in this neighborhood Here, we use arguments that rely on the assumed convexity of f ρ f These arguments also show that the stochastic part of the problem needs only to be studied locally ie, 2 for small values of M The margin condition requires that in the neighborhood F η F of f 0, the excess risk E is bounded from below by a strictly convex function This is true in many particular cases for an L neighborhood F η = { f f 0 η} for F being a class of functions Definition We say that the margin condition holds with strictly convex function G, if for all f F η, we have Ef G f f 0 Indeed, in typical cases, the margin condition holds with quadratic function G, that is, Gu = cu 2, u 0, where c is a positive constant We also need the notion of convex conjugate of a function Definition Let G be a strictly convex function on [0,, with G0 = 0 The convex conjugate H of G is defined as Hv = sup {uv Gu}, v 0 u Note that when G is quadratic, its convex conjugate H is quadratic as well The function H will appear in the estimation error term 2 An oracle inequality Our goal is now, to show that the estimator has oracle properties, namely, to show, for a set B R p as large as possible, that Ef ˆβ const min β B { Ef β + H λ } J β, 3 with large probability Here, H is the convex conjugate of the function G appearing in the margin condition The left hand side of 3 is small if the target f 0 can be well approximated by an f β with only a few of the coefficients β j j =,, p non-zero, that is, by a sparse f β To arrive at such an inequality, we need a certain amount of compatibility between the l norm and the metric on the vector space F 2 The compatibility condition Let J {,, p} be an index set Definition We say that the compatibility condition is met for the set J, with constants φj > 0 and νj > 0, if for all β R p that satisfy j / J β j 3 β j, it holds that β j J f β /φj + j / J β j / + νj 4 More details on this condition are in Section 3 22 The result Let 0 < δ < be fixed but otherwise arbitrary For β R p, set J β := {j : β j 0} and J β := J β The oracle β is defined as { } 2λ β Jβ := arg min + δef β + 2δH, β B φj δ Here, the minimum is over the set B of all β, for which the compatibility condition holds for J β Let J := {j : βj 0}, and let φ = φj and = νj Set 2λ ε Jβ := + δef β + 2δH φ δ

Lemma 2 Lasso under the compatibility condition Consider a constant λ 0 satisfying the inequality λ 0 82 + λ, 5 and let M = ε /λ 0 Assume the margin condition with strictly convex function G, and that the oracle f β is in the neighborhood F η of the target, as well as f β F η for all β β M Then on the set we have either or S := {Z M ε }, 6 δef ˆβ + ν 2 + λ ˆβ β 2ε, Ef ˆβ + λ ˆβ 42 + ν β ε 23 Asymptotics in n In the case of independent observations, the set S has large probability under general conditions, for λ 0 log2p/n see van de Geer 2007 Taking λ also of this order, and assuming a a quadratic margin, and in addition that φ stays away from zero, the estimation error H2λ J β /φ δ behaves like J β log2p/n 24 The case = We observe that often λ 0 can be taken independently of oracle values, whereas 5 requires a lower bound on the smoothing parameter λ which depends on the unknown oracle value The best possible situation is when is arbitrarily large Now, the case > 2 corresponds after a reparametrization to the case = Indeed, if νj > 2 and j / J β j 3 β j, then 4 implies νj 2 J β j fβ /φj + νj With =, Lemma 2 reads Corollary 2 Let λ 0 λ/8, and M := ε /λ 0 Assume the conditions of Lemma 2 with = Then on the set S given in 6, we have either or δef ˆβ + λ ˆβ β 2ɛ, Ef ˆβ + λ ˆβ β 4ɛ 3 On the compatibility condition Let f β 2 := β T Σβ, with Σ a p p matrix The entries of Σ are denoted by σ j,k j, k {,, p} We first present a simple lemma Lemma 3 Suppose that Σ has smallest eigenvalue φ > 0 Then the compatibility condition holds for all index sets J, with φj = φ and νj = 3 The coherence lemma For a vector v, and for q, we shall write v q = j v j q /q, with the usual extension for q = We now fix an L {,, p} Consider a partition of the form Σ, Σ Σ :=,2 Σ 2, Σ 2,2 where Σ, is an L L matrix, Σ 2, is a p L L matrix and Σ,2 = Σ T 2, is its transpose, and where Σ 2,2 is a p L p L matrix Then for any b R L and b 2 R p L, and for b T := b T, b T 2, one has b T Σ, b b T Σb + 2 Σ 2, q b 2 b 2 r, with /q + /r =, and Σ 2, q := sup Σ 2, a q a R L, a 2= Consider now a subset L {,, p} We introduce the L L matrix Σ, L := σ j,k j,k L, and the p L L matrix Σ 2, L := σ j,k j / L,k L, We let ρ 2 L be the smallest eigenvalue of Σ, L and take ϑ 2 ql := Σ 2, L q We moreover define and φj := θ q J := min ρl, L J : L =L max ϑ ql L J : L =L Lemma 32 Coherence lemma Fix J {,, p}, with cardinality J := J L Then we have for any β R p, J β j φj f β 2 + β j / + νj j / J with, for /q + /r =, + νj = 2 Jq /r θ 2 q J L+ J /q φ 2 J 2 J θ2 J φ 2 J q < q = The coherence lemma uses an auxiliary lemma Lemma 4 which is based on ideas in Candes and Tao 2007

32 Relation with other coherence conditions In the coherence lemma, the choice of L J and q is open They are allowed to depend on J, and for our purposes should be taken in such a way that φj and νj are as large as possible To gain insight into what the coherence lemma can bring us, let us again consider the partition Σ, Σ Σ :=,2 Σ 2, Σ 2,2 Note first that Σ 2, 2 2 is the largest eigenvalue of the matrix Σ,2 Σ 2, The coherence lemma with q = 2 is along the lines of the coherence condition in Candes and Tao 2007 They choose L not larger than 3J It is easy to see that Σ 2, q p j=l+ L k= σ 2 j,k q /q 7 with β j,in = β j l{j J } and β j,out = β j l{j / J } Note thus that βin = β and βout 0 Throughout the proof, we assume we are on the set S Let M s := M + ˆβ, β and β := s ˆβ + sβ Write f := f β, Ẽ := E f and f := f β, E := Ef The proof is structured as follows We first note that β β M and prove the result with ˆβ replaced by β As a byproduct we obtain that ˆβ β M The latter allows us to repeat the proof with β replaced by ˆβ By the convexity of f ρ f, and of, Thus, P n ρ f + λ β s[p n ρ f ˆβ + λ ˆβ ] + s[p n ρ f + λ β ] P n ρ f + λ β Ẽ + λ β 8 Thus, Σ 2, q p L max σ j,k q k L j=l+ /q = P n P ρ f ρ f + P n ρ f ρ f + E + λ β P n P ρ f ρ f + E + λ β Z M + E + λ β So on S For q =, this reads Ẽ + λ β ε + E + λ β 9 L Σ 2, max max σ j,k L+ j p k L Note also that with q =, the choice L = J in the coherence lemma gives the best result With this choice of L, the coherence lemma with q = gives a positive value for νj required in the compatibility conditions, if 2J max max σ j,k /φ 2 J < j / J k J This is in the spirit of the maximal local coherence condition in Bunea et al 2007 It also follows from 7, that Σ 2, q p j=l+ L q σ j,k k= Invoking this bound with q = and L = J in the coherence lemma leads to a condition which is similar to the cumulative local coherence condition in Bunea et al 2007 4 Proofs Proof of Lemma 2 We write β = β in + β out, /q Therefore, we have Ẽ + λ β out 0 ε + E + λ β in β 2ε + λ β in β Case If then 0 implies λ β in β Ẽ + λ β out 22 + ε, 22 + + ε, and hence Ẽ + λ β 42 + ν β ε But then β β 42 + ν ε λ = 42 + ν λ 0 λ M M 2, since λ 82 + λ 0 / This implies that ˆβ β M Case 2 If λ β 22 + ν in β ε,

we get from 0 and using 22 + = 4 + ν, that λ β out 2ε + λ β in β 3λ β in β This means that we can apply the weak compatibility condition Recall that inequality 9 implies We have and So Ẽ + λ β out ε + E + λ β λ β in ε + E + λ β in β β out = 2 2 + β out + ν 2 + β out, β out = β β β in β Ẽ + 2 2 + ν λ β out + ν 2 + λ β β ε + E + 2 + ν 2 + λ β in β With the compatibility condition on J, we find β in β J f β f β /φ + β out / + Thus Ẽ + 2 2 + λ β out + ν 2 + λ β β ε +E + 2 + ν 2 + J f β f β /φ + 2 2 + λ β out The term with β out cancels out Moreover, we may use the bound + /2 + This allows us to conclude that Ẽ + ν 2 + λ β β ε + E + 2λ J f β f β /φ Now, because we assumed f β F η, and since also f β F η, we can invoke the margin condition to arrive at 2λ J f βn f β n /φ 2λ J 2δH φ + δẽ δ + δe It follows that Ẽ + ν 2 + λ β β ε + + δe + 2δH 2λ J φ + δẽ δ = ε + ε + δẽ = 2ε + δẽ, or δẽ + ν 2 + λ β β 2ε 2 This yields β β 22 + ν λ 0 M M λ 4, where the last inequality is ensured by the assumption λ 82 + / λ 0 But β β M /4 implies ˆβ β M 3 M So in both Case and Case 2, we arrive at ˆβ β M Now, repeat the above arguments with β replaced by ˆβ Let Ê = Ef ˆβ Then, either from Case with this replacement: Ê + λ ˆβ 42 + ν β ε Or we are in Case 2 and can apply the compatibility assumption Then we arrive at 2 with replacement: δê + ν 2 + λ ˆβ β 2ε Proof of Lemma 3 Let β R p be arbitrary It is clear that β 2 2 f β 2 /φ 2 Hence, β j J β j 2 J β 2 J f β /φ Proof of Lemma 32 Let β = β in + β out, with β j,in := β j l and β j,out := β j l j / J, j =,, p Let L J, with L\J being the set of the L J largest β j with j / J Define b j, := β j l j L and b j,2 := β j l j / L We have b T Σb 2 θqj 2 b 2 b 2 r /2 By the auxiliary lemma below, for q <, This yields Hence b 2 r β out /L + J /q q /r 3 b T Σb 2 θ 2 qj b 2 β out /L + J /q q /r f b 2 f β 2 + 2 b T Σb 2 f β 2 + 2θ 2 qj b 2 β out /L + J /q q /r It follows that b 2 2 f b 2 φ 2 J

f β 2 φ 2 J + 2 θ2 qj φ 2 J b 2 β out /L + J /q q /r This implies b 2 f β φj + 2 θ2 qj φ 2 J β out /L J /q We thus arrive at J fβ φj β in J β in 2 + 2 J θ2 qj φ 2 J β out /L J /q The result with q = follows in the same manner, using instead of 3, the trivial bound b β out Lemma 4 Auxiliary lemma Let b b 2 0 For < r, /q + /r =, and L {0,, }, we have b j r /r q /r L + /q b Proof We have j r L + r + Moreover, for all j, b so that b j b /j Hence b r j b r L+ x r dx q L + r j b l jb j, l= REFERENCES j r q b r L + r Bunea, F, Tsybakov, AB and Wegkamp, MH 2006 Sparsity oracle inequalities for the Lasso submitted Bunea, F, Tsybakov, AB and Wegkamp, MH 2007 Sparse density estimation and aggregation with l penalties, COLT 2007 530 543 Candes, E and Tao, T 2007 The Dantzig selector: statistical estimation when p is much larger than n To appear in Ann Statist van de Geer, SA 2007, High-dimensional generalized linear models and the Lasso To appear in Ann Statist