STAT 200C: High-dimensional Statistics

Similar documents
STAT 200C: High-dimensional Statistics

1 Regression with High Dimensional Data

Constrained optimization

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

High-dimensional statistics: Some progress and challenges ahead

Convex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization

STAT 100C: Linear models

(Part 1) High-dimensional statistics May / 41

Lecture: Introduction to Compressed Sensing Sparse Recovery Guarantees

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Analysis of Greedy Algorithms

High dimensional ising model selection using l 1 -regularized logistic regression

IEOR 265 Lecture 3 Sparse Linear Regression

ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis

Lecture Notes 9: Constrained Optimization

STAT 200C: High-dimensional Statistics

Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016

Composite Loss Functions and Multivariate Regression; Sparse PCA

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Structural and Multidisciplinary Optimization. P. Duysinx and P. Tossings

ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis

Supremum of simple stochastic processes

Compressibility of Infinite Sequences and its Interplay with Compressed Sensing Recovery

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Optimality Conditions for Constrained Optimization

Optimization and Optimal Control in Banach Spaces

Sparse Optimization Lecture: Dual Certificate in l 1 Minimization

Reconstruction from Anisotropic Random Measurements

Constrained Optimization and Lagrangian Duality

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables

Sparsity Regularization

Lecture 5 : Projections

General principles for high-dimensional estimation: Statistics and computation

19.1 Problem setup: Sparse linear regression

High-dimensional graphical model selection: Practical and information-theoretic limits

Compressed Sensing and Sparse Recovery

1 Sparsity and l 1 relaxation

Lecture 13 October 6, Covering Numbers and Maurey s Empirical Method

Extreme Abridgment of Boyd and Vandenberghe s Convex Optimization

Simultaneous Support Recovery in High Dimensions: Benefits and Perils of Block `1=` -Regularization

Constructing Explicit RIP Matrices and the Square-Root Bottleneck

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Lecture 3 January 28

SPARSE signal representations have gained popularity in recent

Lecture 24 May 30, 2018

approximation algorithms I

Optimization methods

6 Compressed Sensing and Sparse Recovery

The deterministic Lasso

Optimal prediction for sparse linear models? Lower bounds for coordinate-separable M-estimators

Conditions for Robust Principal Component Analysis

Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall Nov 2 Dec 2016

Low-Rank Matrix Recovery

Random projections. 1 Introduction. 2 Dimensionality reduction. Lecture notes 5 February 29, 2016

The adaptive and the thresholded Lasso for potentially misspecified models (and a lower bound for the Lasso)

U.C. Berkeley CS294: Beyond Worst-Case Analysis Handout 12 Luca Trevisan October 3, 2017

OWL to the rescue of LASSO

New Coherence and RIP Analysis for Weak. Orthogonal Matching Pursuit

High-dimensional graphical model selection: Practical and information-theoretic limits

1 Computing with constraints

CS261: A Second Course in Algorithms Lecture #12: Applications of Multiplicative Weights to Games and Linear Programs

arxiv: v1 [math.st] 10 Sep 2015

Date: July 5, Contents

High Dimensional Inverse Covariate Matrix Estimation via Linear Programming

sparse and low-rank tensor recovery Cubic-Sketching

CS-E4830 Kernel Methods in Machine Learning

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 08: Sparsity Based Regularization. Lorenzo Rosasco

Gradient Descent. Dr. Xiaowei Huang

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 11, NOVEMBER On the Performance of Sparse Recovery

Conditional Gradient (Frank-Wolfe) Method

Strengthened Sobolev inequalities for a random subspace of functions

10725/36725 Optimization Homework 4

r=1 r=1 argmin Q Jt (20) After computing the descent direction d Jt 2 dt H t d + P (x + d) d i = 0, i / J

ACCORDING to Shannon s sampling theorem, an analog

Minimax Rates of Estimation for High-Dimensional Linear Regression Over -Balls

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Lecture 2: Convex Sets and Functions

Differentially Private Feature Selection via Stability Arguments, and the Robustness of the Lasso

Robust Principal Component Analysis

High-dimensional covariance estimation based on Gaussian graphical models

19.1 Maximum Likelihood estimator and risk upper bound

Sparse PCA in High Dimensions

On Optimal Frame Conditioners

Solving Dual Problems

A Sharpened Hausdorff-Young Inequality

Three Generalizations of Compressed Sensing

Geometry of log-concave Ensembles of random matrices

Introduction to Compressed Sensing

Oslo Class 6 Sparsity based regularization

DATA MINING AND MACHINE LEARNING

ON A CLASS OF NONSMOOTH COMPOSITE FUNCTIONS

A NEW ITERATIVE METHOD FOR THE SPLIT COMMON FIXED POINT PROBLEM IN HILBERT SPACES. Fenghui Wang

Least Sparsity of p-norm based Optimization Problems with p > 1

Sparse Proteomics Analysis (SPA)

CSCI5654 (Linear Programming, Fall 2013) Lectures Lectures 10,11 Slide# 1

Chapter 2 Convex Analysis

Transcription:

STAT 200C: High-dimensional Statistics Arash A. Amini May 30, 2018 1 / 57

Table of Contents 1 Sparse linear models Basis Pursuit and restricted null space property Sufficient conditions for RNS 2 / 57

Linear regression setup The data is (y, X ) where y R n and X R n d, and the model θ R d is an unknown parameter. y = X θ + w. w R n is the vector of noise variables. Equivalently, y i = θ, x i + w i, i = 1,..., n where x i R d is the nth row of X : x1 T x2 T X =.. xn T }{{} d Recall θ, x i = d j=1 θ j x ij. 3 / 57

Sparsity models When n < d, no hope of estimating θ, unless we impose some sort of of low-dimensional model on θ. Support of θ (recall [d] = {1,..., d}): supp(θ ) := S(θ ) = { j [d] : θ j 0 }. Hard sparsity assumption: s = S(θ ) d. Weaker sparsity assumption via l q balls for q [0, 1] q = gives l 1 ball. B q (R q ) = { θ R d : q = 0 the l 0 ball, same as hard sparsity: d θ j q R q }. j=1 θ 0 := S(θ ) = # { j; θ j 0 } 4 / 57

(from HDS book) 5 / 57

Basis pursuit Consider the noiseless case y = X θ. We assume that θ 0 is small. Ideal program to solve: min θ 0 subject to y = X θ θ R d 0 is highly non-convex, relax to 1 : This is called basis pursuit (regression). (1) is a convex program. In fact, can be written as a linear program 1. min θ 1 subject to y = X θ (1) θ R d Global solutions can be obtained very efficiently. 1 Exercise: Introduce auxiliary variables s j R and note that minimizing j s j subject to θ j s j gives the l 1 norm of θ. 6 / 57

Table of Contents 1 Sparse linear models Basis Pursuit and restricted null space property Sufficient conditions for RNS 7 / 57

Define C(S) = { R d : S c 1 S 1 }. (2) Theorem 1 The following two are equivalent: For any θ R d with support S, the basis pursuit program (1) applied to the data (y = X θ, X ) has unique solution θ = θ. The restricted null space (RNS) property holds, i.e., C(S) ker(x ) = {0}. (3) 8 / 57

Proof Consider the tangent cone to the l 1 ball (of radius θ 1 ) at θ : T(θ ) = { R d : θ + t 1 θ 1, for some t > 0.} i.e., the set of descent directions for l 1 norm at point θ. Feasible set is θ + ker(x ), i.e. ker(x ) is the set of feasible directions = θ θ. Hence, there is a minimizer other than θ if and only if T(θ ) ker(x ) {0} (4) It is enough to show that C(S) = T(θ ). θ R d : supp(θ ) S 9 / 57

B1 θ (1) T(θ (2) ) Ker(X) T(θ (1) ) θ (2) C(S) d = 2, [d] = {1, 2}, S = {2}, θ(1) = (0, 1), θ = (0, 1). (2) C(S) = {( 1, 2 ) : 1 2 }. 10 / 57

It is enough to show that C(S) = T(θ ) (5) θ R d : supp(θ ) S We have T 1 (θ ) iff 2 S c 1 θs 1 θs + S 1 We have T 1 (θ ) for some θ R d s.t. supp(θ ) S iff S c 1 sup θs Rd ] [ θs 1 θs + S 1 = S 1 2 Let T 1 (θ ) be the subset of T(θ ) where t = 1, and argue that w.l.o.g. we can work this subset. 11 / 57

Table of Contents 1 Sparse linear models Basis Pursuit and restricted null space property Sufficient conditions for RNS 12 / 57

Sufficient conditions for restricted nullspace [d] := {1,..., d} For a matrix X R d, let X j be its jth column (for j [d]). The pairwise incoherence of X is defined as δ PW (X ) := max i, j [d] X i, X j n Alternative form: X T X is the Gram matrix of X, (X T X ) ij = X i, X j. 1{i = j} δ PW (X ) := X T X n I p where is the vector l norm of the matrix. 13 / 57

Proposition 1 (HDS Prop. 7.1) (Uniform) restricted nullspace holds for all S with S s if δ PW (X ) 1 3s Proof: Exercise 7.3. 14 / 57

A more relaxed condition: Definition 1 (RIP) X R n d satisfies a restricted isometry property (RIP) of order s with constant δ s (X ) > 0 if X T S X S n I op δ s (X ), for all S with S s PW incoherence is close to RIP with s = 2; for example, when X j / n 2 = 1 for all j, we have δ 2 (X ) = δ PW (X ). In general, for any s 2, (Exercise 7.4) δ PW (X ) δ s (X ) s δ PW (X ). 15 / 57

Definition (RIP) X R n d satisfies a restricted isometry property (RIP) of order s with constant δ s (X ) > 0 if X T S X S n I op δ s (X ), for all S with S s Let x T i be the i th row of X. Consider the sample covariance matrix: Σ := 1 n X T X = 1 n n i=1 x i x T i R d d. Then Σ SS = 1 n X S T X S, hence, RIP is Σ SS I op δ < 1 i.e., Σ SS I s. More precisely, (1 δ) u 2 Σ SS u 2 (1 + δ) u 2, u R s 16 / 57

RIP gives sufficient conditions: Proposition 2 (HDS Prop. 7.2) (Uniform) restricted null space holds for all S with S s if δ 2s (X ) 1 3 Consider a sub-gaussian matrix X with i.i.d. entries (Exercise 7.7): We have n s 2 log d = δ PW (X ) < 1 3s, n s log ( ed ) = δ2s < 1 s 3, w.h.p.. Sample complexity requirement for RIP is milder. Above corresponds to Σ = αi. w.h.p.. For more general covariance Σ, it is harder to satisfy either PW or RIP. 17 / 57

Neither RIP or PW is necessary Consider X R n d with i.i.d. rows X i N(0, Σ). Letting 1 R d be the all-ones vector, and Σ := (1 µ)i d + µ11 T for µ [0, 1). (A spiked covariance matrix.) We have γ max (Σ SS ) = 1 + µ(s 1) as s. Exercise 7.8, (a) PW is violated w.h.p. unless µ 1/s. (b) RIP is violated w.h.p. unless µ 1/ s. In fact δ 2s grows like µ s for any fixed µ (0, 1). However, for any µ [0, 1), basis pursuit succeeds w.h.p. if (A later result shows this.) n s log ( ed ). s 18 / 57

19 / 57

Noisy sparse regression A very popular estimator is the l 1 -regularized least-squares: [ 1 ] θ argmin θ R 2n y X θ 2 2 + λ θ 1 d (6) The idea: minimizing l 1 norm leads to sparse solutions. (6) is a convex program; global solution can be obtained efficiently. Other options: constrained form of lasso and relaxed basis persuit min θ 1 R 1 2n y X θ 2 2 (7) min θ R d θ 1 s.t. 1 2n y X θ 2 2 b 2 (8) 20 / 57

For a constant α 1, A strengthening of RNS is: Definition 2 (RE condition) C α (S) := { R d S c 1 α S 1 }. A matrix X satisfies the restricted eigenvalue (RE) condition over S with parameters (κ, α) if 1 n X 2 2 κ 2 2 for all C α (S). Intuition: θ minimizes L(θ) := 1 2n X θ y 2. Ideally, δl := L( θ) L(θ ) is small. Want to translate deviation in loss to deviations in parameter θ θ. Controlled by the curvature of the loss, captured by the Hessian 2 L(θ) = 1 n X T X. 21 / 57

Ideally would like strong convexity (in all directions): or in the context of regression, 2 L(θ) κ 2, R d \ {0}. 1 n X 2 2 κ 2, R d \ {0}. 22 / 57

In high-dimensions, cannot guarantee this in all directions, the loss is flat over ker X. 23 / 57

Side note: Strong convexity A twice differentiable function is strongly convex if 2 L(θ) κi for all θ. In other words if 2 L(θ) κi 0, for all θ. Hessian is uniformly bounded below (in all directions). By Taylor expansion, the function will have a quadratic lower bound: L(θ + ) L(θ ) + L(θ ), + κ 2 2. Alternatively, L(θ) is strongly convex if L(θ) κ 2 θ 2 is convex. In contrast, assuming smoothness, L is strictly convex iff 2 L(θ) 0, not necessarily uniformly lower bounded. Example: f (x) = e x on R, strictly convex but not strongly convex. f (x) > 0 for all x but f (x) 0 as x. Similarly: f (x) = 1/x over (0, ). 24 / 57

Theorem 2 Assume that y = X θ + w, where X R n d and θ R d, and θ is supported on S [d] with S s X satisfies RE(κ, 3) over S. Let us define z = X T w n and γ 2 := w 2 2 2n. Then, we have the following: (a) Any solution of Lasso (6) with λ 2 z satisfies θ θ 2 κ 3 s λ (b) Any solution of constrained Lasso (7) with R = θ 1 satisfies θ θ 2 4 κ s z (c) Any solution of relaxed basis pursuit (8) with b 2 γ 2 satisfies θ θ 2 4 κ s z + 2 κ b2 γ 2 25 / 57

Example (fixed design regression) Assume y = X θ + w where w N(0, σ 2 I n ), and X R n d fixed and satisfying RE condition and normalization where X j is the jth column of X. Recall z = X T w/n. X j max C. j=1,...,d n It is easy to show that w.p. 1 2e nδ2 /2, Thus, setting λ = 2Cσ ( 2 log d n w.p. at least 1 2e nδ2 /2. ( 2 log d ) z Cσ + δ n + δ ), Lasso solution satisfies θ θ 2 6Cσ ( 2 log d ) s + δ κ n 26 / 57

Taking δ = 2 log /n, we have w.h.p. (i.e., 1 2d 1 ). θ θ 2 σ s log d This is the typical high-dimensional scaling in sparse problems. Had we known the support S in advance, our rate would be (w.h.p.) s θ θ 2 σ n. The log d factor is the price for not knowing the support; roughly the price for searching over ( d s) d s collection of candidate supports. n 27 / 57

Proof of Theorem 2 Let us simplify the loss L(θ) := 1 2n X θ y 2. Setting = θ θ, where z = X T w/n. Hence, L(θ) = 1 2n X (θ θ ) w 2 = 1 X w 2 2n = 1 2n X 2 1 X, w + const. n = 1 2n X 2 1 n, X T w + const. = 1 2n X 2, z + const. L(θ) L(θ ) = 1 2n X 2, z. (9) Exercise: Show that (9) is the Taylor expansion of L around θ. 28 / 57

Proof (constrained version) By optimality of θ and feasibility of θ : L( θ) L(θ ) Error vector := θ θ satisfies basic inequality Using Holder inequality 1 2n X 2 2 z,. 1 2n X 2 2 z 1. Since θ 1 θ 1, we have = θ θ C 1 (S), hence 1 = S 1 + S c 1 2 S 1 2 s 2. Combined with RE condition ( C 3 (S) as well) which gives the desired result. 1 2 κ 2 2 2 s z 2. 29 / 57

Proof (Lagrangian version) Let L(θ) := L(θ) + λ θ 1 be the regularized loss. Basic inequality is L( θ) + λ θ 1 L(θ ) + λ θ 1 Rearranging We have 1 2n X 2 2 z, + λ( θ 1 θ 1 ) Since λ 2 z, θ 1 θ 1 = θ S 1 θ S + S 1 S c 1 S 1 S c 1 1 n X 2 2 λ 1 + 2λ( S 1 S c 1 ) λ( 3 S 1 S c 1 ) It follows that C 3 (S) and the rest of proof follows. 30 / 57

RE condition for anisotropic design For a PSD matrix Σ, let Theorem 3 ρ 2 (Σ) = max Σ ii. i Let X R n d with rows i.i.d. from N(0, Σ). Then, there exist universal constants c 1 < 1 < c 2 such that X θ 2 2 n c 1 Σθ 2 2 c 2 ρ 2 (Σ) log d n θ 2 1, for all θ R d (10) withe probability at least 1 e n/32 /(1 e n/32 ). Exercise 7.11: (10) implies RE condition over C 3 (S) uniformly over all subsets of cardinality S c 1 γ min (Σ) n 32c 2 ρ 2 (Σ) log d In other words, s log d n = RE condition over C 3 (S) for all S s. 31 / 57

Examples Toeplitz family: Σ ij = ν i j, ρ 2 (Σ) = 1, γ min (Σ) (1 ν) 2 > 0 Spiked model: Σ := (1 µ)i d + µ11 T, ρ 2 (Σ) = 1, γ min (Σ) = 1 µ For future applications, note that (10) implies X θ 2 2 n α 1 θ 2 2 α 2 θ 2 1, θ R d. where α 1 = c 1 γ min (Σ) and α 2 = c 2 ρ 2 (Σ) log d n. 32 / 57

Lasso oracle inequality For simplicity, let κ = γ min (Σ) and ρ 2 = ρ 2 (Σ) = max i Σ ii Theorem 4 Under the condition 10, consider the Lagrangian Lasso with regularization parameter λ 2 z where z = X T w/n. For any θ R d, any optimal solution θ satisfies the bound θ θ 2 2 144 λ 2 c1 2 κ 2 S + 16 λ c 1 κ θ S c 1 + 32c 2 ρ 2 log d c 1 κ n θ S c 2 1 }{{}}{{} Estimation error ApproximationError (11) valid for any subset S with cardinality S c 1 κ n 64c 2 ρ 2 log d. 33 / 57

Simplifying the bound θ θ 2 2 κ 1 λ 2 S + κ 2 λ θ S c 1 + κ 3 log d n θ S c 2 1 where κ 1, κ 2, κ 3 are constant dependent on Σ. Assume σ = 1 (noise variance) for simplicity. Since we z log d/n w.h.p., we can take λ of this order: θ θ 2 2 log d log d n S + n θ S c 1 + log d n θ S c 2 1 Optimizing the bound θ θ 2 2 inf S n log d [ ] log d log d n S + n θ S c 1 + log d n θ S c 2 1 An oracle that knows θ can choose the optimal S. 34 / 57

Example: l q -ball sparsity Assume that θ B q, i.e., d j=1 θ j q 1, for some q [0, 1]. Then, assuming σ 2 = 1, we have the rate (Exercise 7.12) Sketch: θ θ 2 2 ( log d ) 1 q/2. n Trick: take S = {i : θi > τ} and find a good threshold τ later. Show that θs c 1 τ 1 q and S τ q. The bound would be of the form (ε := log d/n) ε 2 τ q + ετ 1 q + (ετ 1 q ) 2. Ignore the last term (assuming ετ 1 q 1, it is not dominant), 35 / 57

Proof of Theorem 4 Let L(θ) := L(θ) + λ θ 1 be the regularized loss. Basic inequality is L( θ) + λ θ 1 L(θ ) + λ θ 1 Rearranging We have 1 2n X 2 2 z, + λ( θ 1 θ 1 ) Since λ 2 z, θ 1 θ 1 = θ S 1 θ S + S 1 S c 1 + θ S c 1 S 1 S c 1 + θ S c 1 1 n X 2 2 λ 1 + 2λ( S 1 S c 1 + θ S c 1) λ(3 S 1 S c 1 + 2 θ S c 1). 36 / 57

Let s = S and b = 2 θs 1. c Then, the error satisfies S c 1 3 S 1 + b. That is, 1 4 S 1 + b, hence 2 1 (4 S 1 + b) 2 32 S 2 1 + 2b 2 32 s S 2 2 + 2b 2 32 s 2 2 + 2b 2. Bound (10) can be written as (α 1, α 2 > 0) X 2 2 n α 1 2 2 α 2 2 1, R d. Applying to, we have X 2 2 n α 1 2 2 α 2 ( 32 s 2 2 + 2b 2) = (α 1 32α 2 s) 2 2 2α 2 b 2. 37 / 57

We want α 1 32α 2 s to be strictly positive. Assume that α 1 /2 32α 2 s so that α 1 32α 2 s α 1 /2. We obtain α 1 2 2 2 2α 2 b 2 λ(3 S 1 S c 1 + 2 θ S c 1) Drop S c 1, use S 1 s 2 and rearrange α 1 2 2 2 λ(3 s 2 + 2 θ S c 1) + 2α 2 b 2 This is a quadratic inequality in 2. Using the inequality on next slide 2 2 2(3λ s) 2 α1 2/4 + 2(2λ θ S c 1 + 2α 2 b 2 ) α 1 /2 72λ2 s α 2 1 where we used b = 2 θ S c 1. + 8λ θ S c 1 α 1 + 16α 2 θ S c 2 1 α 1 This proves the theorem with better constants! 38 / 57

In general, ax 2 bx + c and x 0 imply x b a + c a which itself implies x 2 2b2 a 2 + 2c a. 39 / 57

Bounds on prediction error The following can be thought of as the mean-squared prediction error 1 n X ( θ θ ) 2 2 = 1 n n ( x i, θ θ ) 2 i=1 Letting f θ (x) = x, θ be the regression function, in prediction we are interested in estimating the function f θ ( ). Defining the empirical norm ( 1 f n = n n i=1 ) 1/2 f 2 (x i ) we can write 1 n X ( θ θ ) 2 2 = f θ f θ 2 n (12) For sufficiently regular points, f 2 n f 2 L = f 2 (x)dx. 2 There is another explanation in HDS. 40 / 57

Prediction error bounds Theorem 5 Consider Lagrangian lasso with λ 2 z where z = X T w/n: (a) Any optimal solution θ satisfies 1 n X ( θ θ ) 2 2 12 θ 1 λ (b) If θ is supported on S with S s and X satisfies (κ, 3)-RE condition over S, they any optimal solution satisfies 1 n X ( θ θ ) 2 2 9 κ sλ2 41 / 57

Example (fixed design regression, no RE) Assume y = X θ + w where w has iid σ-sub-gaussian entries, and X R n d fixed and satisfies C-column normalization where X j is the jth column of X. X j max C. j=1,...,d n Recalling z = X T w/n, w.p. 1 2e nδ2 /2, ( 2 log d ) z Cσ + δ n Thus, setting λ = 2Cσ ( 2 log d n + δ ), Lasso solution satisfies 1 n X ( θ θ ) 2 2 ( 24 θ 2 log d ) 1 Cσ + δ n w.p. at least 1 2e nδ2 /2. 42 / 57

Example (fixed design regression, with RE) Assume y = X θ + w where w has iid σ-sub-gaussian entries, and X R n d fixed and satisfies RE condition and C-column normalization where X j is the jth column of X. X j max C. j=1,...,d n Recalling z = X T w/n, with probability 1 2e nδ2 /2, Thus, setting λ = 2Cσ ( 2 log d n w.p. at least 1 2e nδ2 /2. ( 2 log d ) z Cσ + δ n + δ ), Lasso solution satisfies 1 n X ( θ θ ) 2 2 72 κ C 2 σ 2( 2s log d ) + δ 2 s n 43 / 57

Under very mild assumptions (no RE), slow rate 1 n X ( θ θ ) 2 2 ( 24 θ 2 log d ) 1 Cσ + δ n Under stronger assumptions (e.g., RE condition), fast rate Is RE needed for fast rates? 1 n X ( θ θ ) 2 2 72 κ C 2 σ 2( s 2 log d ) + δ 2 s n 44 / 57

Proof of part (a) Recall the basic inequality, where = θ θ : 1 2n X 2 2 z, + λ( θ 1 θ 1 ) Using λ 2 z, and Hölder inequality z, z 1 λ 2 1 λ 2 ( θ 1 + θ 1 ) Putting the pieces together we conclude θ 1 3 θ 1. Triangle inequality gives 1 4 θ 1. Since θ 1 θ 1 1, 1 2n X 2 2 λ 2 1 + λ 1 12λ 2 θ 1 using the blue upper bound. 45 / 57

Proof of part (b) As before, we obtain and that C 3 (S). 1 n X 2 2 3λ s 2 We now apply RE condition to the other side X 2 2 n 3λ s 2 3λ s 1 κ X 2 n which gives the desired result. 46 / 57

Variable selection using lasso Can we recover the exact support of θ? Needs the most stringent conditions: Lower eigenvalues ( X T γ S X ) S min c min > 0 n Mutual incoherence: There exists some α [0, 1) such that max (X T j S c S X S ) 1 XS T X j 1 α (LowEig) (MuI) The expression is the l 1 norm of ω where ω = argmin ω R s X j X S ω 2 2. Letting Σ = 1 n X T X be the sample covariance, we can write (MuI) as Σ 1 SS Σ SS c 1 α Consider the projection matrix, projecting onto [Im(X S )] : Π S = I n X S (XS T X S ) 1 XS T 47 / 57

Theorem 6 Consider S-sparse linear model with design X satisfying (LowEig) and (MuI). Let w := w/n and assume that λ 2 1 α X T S c Π S w. (13) Then the Lagrangian Lasso has the following properties: (a) There is a unique optimal solution θ. (b) No false inclusion: Ŝ := supp( θ) S. (c) l bounds: The error θ θ satisfies θ S θ S Σ 1 SS X S T w + Σ 1 SS λ }{{} τ(λ;x ) (14) (d) No false exclusion: The lasso includes all indices i S such that θ i > τ(λ; X ), hence is variable selection consistent if min i S θ i > τ(λ; X ). 48 / 57

Theorem 7 Consider S-sparse linear model with design X satisfying (LowEig) and (MuI). Assume that noise vector w is zero mean with i.i.d. σ-sub-gaussian entries. Assume that X is C-column normalized deterministic design. Take the regularization parameter to be λ = 2Cσ { 2 log(d s) } + δ 1 α n for some δ > 0. Then the optimal solution θ is unique, has support contained in S and satisfies the l error bound θ S θ S all with probability at least 1 4e nδ2 /2. σ ( 2 log s ) + δ + Σ 1 SS cmin n λ 49 / 57

Need to verify (14): Enough to control Z j := Xj T Π S w, for j S c We have Π S X j 2 X j 2 C n. (Projections are nonexpansive.) Hence, Z j is a sub-gaussian with squared-parameter C 2 σ 2 /n. (Exercise.) It follows that P [ max j S c Z j t ] 2(d s)e n t2 /(2C 2 σ 2 ) Choice of λ satisfies (14) with high probability. Define Z 1 S = Σ S X S T w. Each Z i = ei T Σ 1 S X S T w is sub-gaussian with parameter at most It follows that [ P max Z i i=1,...,s }{{} Z S σ 2 n 1 Σ SS op σ2 c min n > σ ( 2 log s )] + δ cmin n 2e nδ2 /2 50 / 57

Exercise Assume that w R d has independent sub-gaussian entries, with sub-gaussian squared-parameters σ 2. Let x R d be a deterministic vector. Then, x T w is sub-gaussian with squared-parameter σ 2 x 2 2. 51 / 57

Corollary applies to fixed designs. Similar result holds for Gaussian random design (rows iid from N(0, Σ)), assuming that Σ satisfies α-incoherence. (Exercise 7.19): Then sample covariance Σ satisfies α-incoherence holds if n s log(d s) 52 / 57

Detour: subgradients Consider a convex function f : R d R. z R d is a sub-gradient of f at θ, denoted z f (θ) if f (θ + ) f (θ) + z,, R d. For a convex function θ minimizes f is equivalent to 0 f (θ). For l 1 norm, i.e. f (θ) = θ 1, z θ 1 z j = sign(θ j ) where sign( ) is the generalized sign, i.e. sign(0) = [ 1, 1]. For Lasso, ( θ, ẑ) is primal-dual optimal if θ is a minimizer and ẑ θ 1. Equivalently, primal-dual optimality conditions can be written as where (15) is the zero subgradient condition. 1 n X T (X θ y) + λẑ = 0, (15) ẑ θ 1 (16) 53 / 57

Proof of Theorem Primal-dual witness (PDW) construction: 1. Set θ S c = 0. 2. Determine ( θ S, ẑ S ) R s R s by solving oracle subproblem 1 θ S argmin θ S R s 2n y X Sθ S 2 2 + λ θ S 1 and choosing ẑ S θ S 1 s.t. f (θ S ) θs = θ S + λẑ S = 0. 3. Solve for ẑ S c R d s via zero subgradient equaltion and check for strict dual feasibility ẑ S c < 1. Lemma 1 Under condition (LowEig) the success of the PDW construction implies that ( θ S, 0) is the unique optimal solution of the Lasso. Proof: Only need to show uniqueness, which follows from this: Under strong duality, the set of saddle-points of the Lagrangian form a Cartesian product. I.e., we can mix and match primal and dual parts of two primal-dual pair to also get primal-dual pairs. 54 / 57

Using y = X θ + w and θ S c = 0 (by construction) and θs = 0 (by c assumption) we can write zero sub-gradient condition [ ] ] [ ] [ ] [ ΣSS ΣSS c [ θs θ us ẑs S 0 + λ = Σ Sc S Σ Sc S 0 u c S c ẑ S c 0] where u = X T w/n so that u S = XS T w/n and so on. Top equation satisfied since ( θ S, θs ) is chosen to solve oracle Lasso. Only need to satisfy the bottom EQ. Do so by choosing ẑ S c as needed ẑ S c = 1 λ Σ Sc S( θ S θ S) + u S c λ Since by assumption Σ SS is invertible, can solve for θ S θ S from 1st EQ: Combining ẑ S c θ S θ S = Σ 1 SS (u S λẑ S ). = Σ 1 S c S Σ SS ẑs + 1 ( u S c λ Σ ) 1 S c S Σ SS u S 55 / 57

We had ẑ S c = Σ 1 Sc S Σ SS ẑs + 1 ( u S c λ Σ ) 1 Sc S Σ SS u S Note that ( w = w/n) u S c Σ 1 Sc S Σ SS u S = XS T w Σ 1 c Sc S Σ SS X S T w = X T S c [I X S(X T S X S ) 1 X T S ] w = X T S c Π S w Thus, we have ẑ S c = Σ S c S Σ 1 SS ẑs } {{ } µ + XS T Π ( w c S nλ ) } {{ } v By (MuI) we have µ α. and by our choice of λ, v < 1 α. This verifies strict dual feasibility ẑ S c < 1, hence the constructed pair is primal-dual feasible and the primal solution is unique. 56 / 57

It remains to show the l bound which follows from a applying triangle inequality to leading to θ S θ S = Σ 1 SS (u S λẑ S ). θ S θ S Σ 1 SS u S + λ Σ 1 SS using sub-multiplicative property of operator norms and z 1. 57 / 57