(Part 1) High-dimensional statistics May / 41

Similar documents
The deterministic Lasso

arxiv: v1 [math.st] 5 Oct 2009

The adaptive and the thresholded Lasso for potentially misspecified models (and a lower bound for the Lasso)

THE LASSO, CORRELATED DESIGN, AND IMPROVED ORACLE INEQUALITIES. By Sara van de Geer and Johannes Lederer. ETH Zürich

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

Reconstruction from Anisotropic Random Measurements

STAT 200C: High-dimensional Statistics

Least squares under convex constraint

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design

STAT 200C: High-dimensional Statistics

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

The Sparsity and Bias of The LASSO Selection In High-Dimensional Linear Regression

Methods for sparse analysis of high-dimensional data, II

Statistical Data Mining and Machine Learning Hilary Term 2016

Pre-Selection in Cluster Lasso Methods for Correlated Variable Selection in High-Dimensional Linear Models

New Coherence and RIP Analysis for Weak. Orthogonal Matching Pursuit

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Statistics for high-dimensional data: Group Lasso and additive models

Lecture I: Asymptotics for large GUE random matrices

Analysis of Greedy Algorithms

Inference For High Dimensional M-estimates. Fixed Design Results

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression

Methods for sparse analysis of high-dimensional data, II

Variable Selection for Highly Correlated Predictors

Lecture 24 May 30, 2018

sparse and low-rank tensor recovery Cubic-Sketching

High Dimensional Covariance and Precision Matrix Estimation

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables

STAT 200C: High-dimensional Statistics

1 Regression with High Dimensional Data

Gaussian Graphical Models and Graphical Lasso

arxiv: v2 [stat.ml] 4 Apr 2012

Optimization methods

STAT 992 Paper Review: Sure Independence Screening in Generalized Linear Models with NP-Dimensionality J.Fan and R.Song

Confidence Intervals for Low-dimensional Parameters with High-dimensional Data

Permutation-invariant regularization of large covariance matrices. Liza Levina

Regression #5: Confidence Intervals and Hypothesis Testing (Part 1)

l 1 -Regularized Linear Regression: Persistence and Oracle Inequalities

Minimax Rates of Estimation for High-Dimensional Linear Regression Over -Balls

Sparsity and the Lasso

Constrained optimization

Computational and Statistical Aspects of Statistical Machine Learning. John Lafferty Department of Statistics Retreat Gleacher Center

Homework 5. Convex Optimization /36-725

Orthogonal Matching Pursuit for Sparse Signal Recovery With Noise

BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage

High-dimensional regression with unknown variance

Lecture 21 Theory of the Lasso II

Composite Loss Functions and Multivariate Regression; Sparse PCA

Regularized Estimation of High Dimensional Covariance Matrices. Peter Bickel. January, 2008

Lecture 8 : Eigenvalues and Eigenvectors

High-dimensional covariance estimation based on Gaussian graphical models

MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2

Sparse Legendre expansions via l 1 minimization

Near Ideal Behavior of a Modified Elastic Net Algorithm in Compressed Sensing

High Dimensional Inverse Covariate Matrix Estimation via Linear Programming

ORACLE INEQUALITIES AND OPTIMAL INFERENCE UNDER GROUP SPARSITY. By Karim Lounici, Massimiliano Pontil, Sara van de Geer and Alexandre B.

ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis

10725/36725 Optimization Homework 2 Solutions

Bayesian linear regression

Conditions for Robust Principal Component Analysis

INDUSTRIAL MATHEMATICS INSTITUTE. B.S. Kashin and V.N. Temlyakov. IMI Preprint Series. Department of Mathematics University of South Carolina

A Comparative Framework for Preconditioned Lasso Algorithms

arxiv: v1 [math.st] 13 Feb 2012

Learning discrete graphical models via generalized inverse covariance matrices

Chapter 7. Canonical Forms. 7.1 Eigenvalues and Eigenvectors

Theoretical results for lasso, MCP, and SCAD

On Model Selection Consistency of Lasso

Sample Size Requirement For Some Low-Dimensional Estimation Problems

Supremum of simple stochastic processes

Inference For High Dimensional M-estimates: Fixed Design Results

arxiv: v2 [math.st] 12 Feb 2008

Lecture: Introduction to Compressed Sensing Sparse Recovery Guarantees

Lasso, Ridge, and Elastic Net

High-dimensional statistics: Some progress and challenges ahead

Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

arxiv: v1 [stat.ml] 3 Nov 2010

Universal low-rank matrix recovery from Pauli measurements

High-dimensional Ordinary Least-squares Projection for Screening Variables

10-725/36-725: Convex Optimization Prerequisite Topics

Multi-stage convex relaxation approach for low-rank structured PSD matrix recovery

High-dimensional Statistical Models

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty

[y i α βx i ] 2 (2) Q = i=1

Statistical Machine Learning for Structured and High Dimensional Data

Optimal prediction for sparse linear models? Lower bounds for coordinate-separable M-estimators

The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA

Concentration Inequalities for Random Matrices

Lecture 2 Part 1 Optimization

arxiv: v1 [math.st] 10 Sep 2015

Lecture 9: September 28

MSA220/MVE440 Statistical Learning for Big Data

A REMARK ON THE LASSO AND THE DANTZIG SELECTOR

Robust high-dimensional linear regression: A statistical perspective

Uncertainty quantification in high-dimensional statistics

Linear Algebra Massoud Malek

Binary matrix completion

Can we do statistical inference in a non-asymptotic way? 1

Stepwise Searching for Feature Variables in High-Dimensional Linear Regression

Transcription:

Theory for the Lasso Recall the linear model Y i = p j=1 β j X (j) i + ɛ i, i = 1,..., n, or, in matrix notation, Y = Xβ + ɛ, To simplify, we assume that the design X is fixed, and that ɛ is N (0, σ 2 I)-distributed. We moreover assume that the linear model holds exactly, with some true parameter value β 0. (Part 1) High-dimensional statistics May 2012 1 / 41

(Part 1) High-dimensional statistics May 2012 2 / 41 What is an oracle inequality? Suppose for the moment that p n and that X has full rank p. Consider the least squares estimator in the linear model Then the prediction error ˆβ LM := (X T X) 1 X T Y. X( ˆβ LM β 0 ) 2 2 /σ2 is χ 2 p-distributed. In particular, this means that E X( ˆβ LM β 0 ) 2 2 n = σ2 n p. In words: each parameter βj 0 is estimated with squared accuracy σ 2 /n, j = 1,..., p. The overall squared accuracy is then (σ 2 /n) p.

(Part 1) High-dimensional statistics May 2012 3 / 41 Sparsity We now turn to the situation where possibly p > n. The philosophy that will generally rescue us, is to believe that in fact only a few, say s 0, of the βj 0 are non-zero. We use the notation S 0 : S 0 := {j : β 0 j 0}, so that s 0 = S 0. We call S 0 the active set, and s 0 the sparsity index of β 0.

(Part 1) High-dimensional statistics May 2012 4 / 41 Notation β j,s0 := β j l{j S 0 }, β j,s c 0 := β j l{j / S 0 }. Clearly, and β = β S0 + β S c 0, β 0 S c 0 = 0.

(Part 1) High-dimensional statistics May 2012 5 / 41 If we would know S 0, we could simply neglect all variables X (j) with j / S 0. Then, by the above argument, the overall squared accuracy would be (σ 2 /n) s 0. With S 0 is unknown, we apply the l 1 -penalty, i.e., the Lasso { } ˆβ := arg min Y Xβ 2 2 /n + λ β 1. β Definition: Sparsity oracle inequality. The sparsity constant φ 0 is the largest value φ 0 > 0 such that Lasso ˆβ satisfies the φ 0 -sparsity oracle inequality X( ˆβ β 0 ) 2 2 /n + λ ˆβ S c 0 1 λ2 s 0 φ 2. 0

A digression: the noiseless case Let X be some measurable space, Q be a probability measure on X, and be the L 2 (Q) norm. Consider a fixed dictionary of functions {ψ j } p j=1 L 2(Q): Consider linear functions f β ( ) = p β j ψ j ( ) : β R p. j=1 Consider moreover a fixed target f 0 := p βj 0 ψ j. j=1 We let S 0 := {j : βj 0 0} be its active set, and s 0 := S 0 be the sparsity index of f 0. (Part 1) High-dimensional statistics May 2012 6 / 41

(Part 1) High-dimensional statistics May 2012 7 / 41 For some fixed λ > 0, the Lasso for the noiseless problem is } β := arg min { f β f 0 2 + λ β 1, β where 1 is the l 1 -norm. We write f := f β and let S be the active set of the Lasso. The Gram matrix is Σ := ψ T ψdq.

(Part 1) High-dimensional statistics May 2012 8 / 41 We will need certain conditions on the Gram matrix to make the theory work. We require a certain compatibility of l 1 -norms with l 2 -norms. Compatibility Let L > 0 be some constant. The compatibility constant is φ 2 Σ (L, S 0) := φ 2 (L, S 0 ) := min{s 0 β T Σβ : β S0 1 = 1, β c S 0 1 L}. We say that the (L, S 0 )-compatibility condition is met if φ(l, S 0 ) > 0.

(Part 1) High-dimensional statistics May 2012 9 / 41 Back to the noisy case Lemma (Basic Inequality) We have X( ˆβ β 0 ) 2 2 /n + 2λ ˆβ 1 2ɛ T X( ˆβ β 0 )/n + 2λ β 0 1.

(Part 1) High-dimensional statistics May 2012 10 / 41 We introduce the set We assume that T := { } max 1 j p ɛt X (j) /n λ 0. λ > λ 0, to make sure that on T we can get rid of the random part of the problem.

(Part 1) High-dimensional statistics May 2012 11 / 41 Let us denote the diagonal elements of the Gram matrix ˆΣ := X T X/n, by ˆσ 2 j := ˆΣ j,j, j = 1,..., p. Lemma Suppose that σ 2 = ˆσ 2 j = 1 for all j. Then we have for all t > 0, and for 2t + 2 log p λ 0 :=, n P(T ) 1 2 exp[ t].

(Part 1) High-dimensional statistics May 2012 12 / 41 Compatibility condition (Noisy case) Let L > 0 be some constant. The compatibility constant is φ 2ˆΣ(L, S 0 ) := φ 2 (L, S 0 ) := min{s 0 β T ˆΣβ : β S0 1 = 1, β c S 0 1 L}. We say that the (L, S 0 )-compatibility condition is met if φ(l, S 0 ) > 0.

(Part 1) High-dimensional statistics May 2012 13 / 41 Theorem Suppose λ λ 0 and that the compatibility condition holds for S 0, with L = λ + λ 0 λ λ 0. Then on we have and T := { } max 1 j p ɛt X (j) /n λ 0, X( ˆβ β 0 ) 2 n 4(λ + λ 0 ) 2 s 0 /φ 2 (L, S 0 ), ˆβ S0 β 0 1 2(λ + λ 0 )s 0 /φ 2 (L, S 0 ), ˆβ c S 0 1 2L(λ + λ 0 )s 0 /φ 2 (L, S 0 ).

(Part 1) High-dimensional statistics May 2012 14 / 41 When does the compatibility condition hold? oracle inequalities for prediction and estimation RIP weak (S,2s)- RIP adaptive (S, 2s)- restricted regression (S,2s)-restricted eigenvalue S-compatibility S \S s * coherence adaptive (S, s)- restricted regression (S,s)-restricted eigenvalue weak (S, 2s)- irrepresentable (S,2s)-irrepresentable (S,s)-uniform irrepresentable S \S =0 *

(Part 1) High-dimensional statistics May 2012 15 / 41 If Σ is non-singular, the compatibility condition holds, with φ 2 (S 0 ) Λ 2 min, the latter being the smallest eigenvalue of Σ. Example Consider the matrix 1 ρ ρ Σ := (1 ρ)i + ριι T ρ 1 ρ =....., ρ ρ 1 with 0 < ρ < 1, and ι := (1,..., 1) T a vector of 1 s. Then the smallest eigenvalue of Σ is Λ 2 min = 1 ρ, so the compatibility condition holds with φ(s 0 ) 1 ρ. (The uniform S 0 -irrepresentable condition is met as well.)

(Part 1) High-dimensional statistics May 2012 15 / 41 If Σ is non-singular, the compatibility condition holds, with φ 2 (S 0 ) Λ 2 min, the latter being the smallest eigenvalue of Σ. Example Consider the matrix 1 ρ ρ Σ := (1 ρ)i + ριι T ρ 1 ρ =....., ρ ρ 1 with 0 < ρ < 1, and ι := (1,..., 1) T a vector of 1 s. Then the smallest eigenvalue of Σ is Λ 2 min = 1 ρ, so the compatibility condition holds with φ(s 0 ) 1 ρ. (The uniform S 0 -irrepresentable condition is met as well.)

(Part 1) High-dimensional statistics May 2012 16 / 41 Geometric interpretation Let X j R n denote the j-th column of X (j = 1,..., p). The set A := {Xβ S : β S 1 = 1} is the convex hull of the vectors {±X j } j S in R n. Likewise, the set B := {Xβ S c : β S c 1 L} is the convex hull including interior of the vectors {±LX j } j S c. The l 1 -eigenvalue δ(l, S) is the distance between these two sets. δ(l,s) B A

(Part 1) High-dimensional statistics May 2012 17 / 41 We note that: if L is large the l 1 -eigenvalue will be small, it will also be small if the vectors in S exhibit strong correlation with those in S c, when the vectors in {X j } j S are linearly dependent, it holds that {Xβ S : β S 1 = 1} = {Xβ S : β S 1 1}, and hence then δ(l, S) = 0.

(Part 1) High-dimensional statistics May 2012 18 / 41 The difference between the compatibility constant and the squared l 1 -eigenvalue lies only in the normalization by the size S of the set S. This normalization is inspired by the orthogonal case, which we detail in the following example. Example Suppose that the columns of X are all orthogonal: Xj T X k = 0 for all j k. Then δ(l, S) = 1/ S and φ(l, S) = 1.

(Part 1) High-dimensional statistics May 2012 19 / 41 Let S β := {j : β j 0}. We call S β the sparsity-index of β. More generally, we call S the sparsity index of the set S. Definition For a set S and constant L > 0, the effective sparsity Γ 2 (L, S) is the inverse of the squared l 1 -eigenvalue, that is Γ 2 (L, S) = 1 δ 2 (L, S).

(Part 1) High-dimensional statistics May 2012 20 / 41 Example As a simple numerical example, let us suppose n = 2, p = 3, S = {3}, and X = ( ) 5/13 0 1 n. 12/13 1 0 The l 1 -eigenvalue δ(l, S) is equal to the distance of X 3 to line that connects LX 1 and LX 2, that is δ(l, S) = max{(5 L)/ 26, 0}. Hence, for example for L = 3 the effective sparsity is Γ 2 (3, S) = 13/2. Alternatively, when X = ( ) 12/13 0 1 n. 5/13 1 0 then for example δ(3, S) = 0 and hence Γ 2 (3, S) =. This is due to the sharper angle between X 1 and X 3.

The compatibility condition is slightly weaker than the restricted eigenvalue condition of Bickel et al. [2009]. The restricted isometry property of Candes [2005] implies the restricted eigenvalue condition. (Part 1) High-dimensional statistics May 2012 21 / 41

The compatibility condition is slightly weaker than the restricted eigenvalue condition of Bickel et al. [2009]. The restricted isometry property of Candes [2005] implies the restricted eigenvalue condition. (Part 1) High-dimensional statistics May 2012 21 / 41

(Part 1) High-dimensional statistics May 2012 22 / 41 Approximating the Gram matrix For two (positive semi-definite) matrices Σ 0 and Σ 1, we define the supremum distance Σ 1 Σ 0 := max (Σ 1 ) j,k (Σ 0 ) j,k. j,k Lemma Assume Then β S c 0 1 3 β S0 1, f β 2 Σ 1 f β 2 1 Σ 0 Σ 1 Σ 0 λ. 16 λs φ 2 compatible (Σ 0, S 0 ).

(Part 1) High-dimensional statistics May 2012 23 / 41 Corollary We have φ Σ1 (3, S 0 ) φ Σ0 (3, S 0 ) 4 Σ 0 Σ 1 s 0.

(Part 1) High-dimensional statistics May 2012 24 / 41 Example Suppose we have a Gaussian random matrix ˆΣ := X T X/n = (ˆσ j,k ), where X = (X i,j ) is a n p-matrix with i.i.d. N (0, 1)-distributed entries in each column. For all t > 0, and for 4t + 8 log p 4t + 8 log p λ(t) := +, n n one has the inequality ( ) P ˆΣ Σ λ(t) 2 exp[ t].

Example (continued) Hence, we know for example that with probability at least 1 2 exp[ t], φ compatible (ˆΣ, S 0 ) Λ min (Σ) 4 λ(t)s 0. This leads to a bound on the sparsity of the form s 0 = o(1/ λ(t)), which roughly says that s 0 should be of small order n/ log p. (Part 1) High-dimensional statistics May 2012 25 / 41

Definition We call a random variable X sub-gaussian if for some constant K and σ 2 0, E exp[x 2 /K 2 ] σ 2 0. Theorem Suppose X 1,..., X n are uniformly sub-gaussian with constants K and σ0 2. Then for constants η = η(k, σ2 0 ), it holds that β T 1 ˆΣβ 3 βt Σβ t + log p β 2 1 n /η2, with probability at least 1 2 exp[ t]. See Raskutti, Wainwright and Yu [2010]. (Part 1) High-dimensional statistics May 2012 26 / 41

(Part 1) High-dimensional statistics May 2012 27 / 41 General convex loss Consider data {Z i } n i=1, with Z i in some space Z. Consider a linear space F := {f β ( ) = p j=1 β jψ j ( ) : β R p }. For each f F, ρ f : Z R be a loss function. We assume that the map f ρ f (z) is convex for all z Z. For example, Z i = (X i, Y i ), and ρ is quadratic loss or logistic loss ρ f (, y) = (y f ( )) 2, ρ f (, y) = yf ( ) + log(1 + exp[f ( )]), or minus log-likelihood loss ρ f = f log exp[f ], etc.

(Part 1) High-dimensional statistics May 2012 28 / 41 We denote, for a function ρ : Z R, the empirical average by P n ρ := n ρ(z i )/n, i=1 and the theoretical mean by Pρ := n Eρ(Z i )/n. i=1 The Lasso is { } ˆβ = arg min P n ρ fβ + λ β 1. (1) β We write ˆf = f ˆβ.

(Part 1) High-dimensional statistics May 2012 29 / 41 We furthermore define the target as the minimizer of the theoretical risk f 0 := arg min f F Pρ f. The excess risk is E(f ) := P(ρ f ρ f 0). Note that by definition, E(f ) 0 for all f F. We will mainly examine the excess risk E(ˆf ) of the Lasso.

(Part 1) High-dimensional statistics May 2012 30 / 41 Definition We say that the margin condition holds with strictly convex function G, if E(f ) G( f f 0 ). In typical cases, the margin condition holds with quadratic function G, that is, G(u) = cu 2, u 0, where c is a positive constant. G -1.0-0.5 0.0 0.5 1.0 1.5 2.0 2.5 uv uv-g(u) Definition Let G be a strictly convex function on [0, ). Its convex conjugate H is defined as -G(u) H(v) = sup {uv G(u)}, v 0. u H(v)

(Part 1) High-dimensional statistics May 2012 31 / 41 Set and Z M := sup (P n P)(ρ fβ ρ fβ0 ), (2) β β 0 1 M M 0 := H ( ) 4λ s0 /λ 0, (3) φ(s 0 ) where φ(s 0 ) = φ compatible (S 0 ). Set T := {Z M0 λ 0 M 0 }. (4)

(Part 1) High-dimensional statistics May 2012 32 / 41 Theorem (Oracle inequality for the Lasso) Assume the compatibility condition and the margin condition with strictly convex function G. Take λ 8λ 0. Then on the set T given in (4), we have ( ) 4λ E(ˆf ) + λ ˆβ β 0 s0 1 4H. φ(s 0 )

(Part 1) High-dimensional statistics May 2012 33 / 41 Corollary Assume quadratic margin behavior, i.e., G(u) = u 2. Then H(v) = v 2 /4, and we obtain on T, E(ˆf ) + λ ˆβ β 0 1 4λ2 s 0 φ 2 (S 0 ).

(Part 1) High-dimensional statistics May 2012 34 / 41 l 2 -rates To derive rates for ˆβ β 0 2, we need a stronger compatibility condition. Definition We say that the (S 0, 2s 0 )-restricted eigenvalue condition is satisfied, with constant φ = φ(s 0, 2s 0 ) > 0, if for all N S 0, N = 2s 0, and all β R p, that satisfy β S c 0 1 3 β S0 1, and β j β N \S0, j / N, it holds that β N 2 f β /φ.

Lemma Suppose the conditions of the previous theorem are met, but now with the stronger (S 0, 2s 0 )-restricted eigenvalue condition. On T, ( )) 4λ 2 ˆβ β 0 2 2 (H 16 s0 /(λ 2 s 0 ) + λ2 s 0 φ 4φ 4. In the case of quadratic margin behavior, with G(u) = u 2, we then get on T, ˆβ β 0 2 2 16λ2 s 0 φ 4. (Part 1) High-dimensional statistics May 2012 35 / 41

(Part 1) High-dimensional statistics May 2012 36 / 41 Theory for l 1 /l 2 -penalties Group Lasso Y i = p ( T ) X (j) i,t β0 j,t + ɛ i, i = 1,..., n, j=1 t=1 where the βj 0 := (βj,1 0,..., β0 j,t )T have sparsity property βj 0 0 for most j. l 1 /l 2 -penalty: p β 2,1 := β j 2. j=1

Multivariate linear model Y i,t = p j=1 X (j) i,t β0 j,t + ɛ i,t,, i = 1,..., n, t = 1,..., T, with for β 0 j := (β 0 j,1,..., β0 j,t )T, the sparsity property β 0 j 0 for most j. Linear model with time-varying coefficients Y i (t) = p j=1 X (j) i (t)β 0 j (t) + ɛ i (t), i = 1,..., n, t = 1,..., T, where the coefficients β 0 j ( ) are smooth functions, with the sparsity property that most of the β 0 j 0. (Part 1) High-dimensional statistics May 2012 37 / 41

(Part 1) High-dimensional statistics May 2012 38 / 41 High-dimensional additive model Y i = p j=1 f 0 j (X (j) i ) + ɛ i, i = 1,..., n, where the f 0 j (X (j) i ) are (non-parametric) smooth functions, with sparsity property f 0 j 0 for most j.

(Part 1) High-dimensional statistics May 2012 39 / 41 Theorem Consider the group Lasso where λ 4λ 0, with { } ˆβ = arg min Y Xβ 2 2 /n + λ T β 2,1, β λ 0 = 2 4x + 4 log p 1 + + n T Then with probability at least 1 exp[ x], we have 4x + 4 log p. T X ˆβ f 0 2 2 /n + λ T ˆβ β 0 2,1 24λ2 T S0 s 0 φ 2. 0

(Part 1) High-dimensional statistics May 2012 40 / 41 Theorem Consider the smoothed group Lasso } ˆβ := arg min { Y Xβ 22 /n + λ β 2,1 + λ 2 Bβ 2,1, β where λ 4λ 0. Then on T := {2 ɛ T Xβ /n λ 0 β 2,1 + λ 2 0 Bβ 2,1}, we have } ˆf f 0 2 n + λpen( ˆβ β 0 )/2 3 {16λ 2 s 0 /φ 20 + 2λ2 Bβ 0 2,1.

etc.... (Part 1) High-dimensional statistics May 2012 41 / 41