High-dimensional statistics: Some progress and challenges ahead

Similar documents
General principles for high-dimensional estimation: Statistics and computation

Introduction to graphical models: Lecture III

High-dimensional graphical model selection: Practical and information-theoretic limits

High-dimensional graphical model selection: Practical and information-theoretic limits

Reconstruction from Anisotropic Random Measurements

High-dimensional covariance estimation based on Gaussian graphical models

High-dimensional Statistics

High-dimensional Statistical Models

High dimensional Ising model selection

Learning discrete graphical models via generalized inverse covariance matrices

STAT 200C: High-dimensional Statistics

Estimators based on non-convex programs: Statistical and computational guarantees

Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Learning and message-passing in graphical models

An iterative hard thresholding estimator for low rank matrix recovery

Regularized Estimation of High Dimensional Covariance Matrices. Peter Bickel. January, 2008

Supplementary material for a unified framework for high-dimensional analysis of M-estimators with decomposable regularizers

Tractable Upper Bounds on the Restricted Isometry Constant

Permutation-invariant regularization of large covariance matrices. Liza Levina

Robust Inverse Covariance Estimation under Noisy Measurements

arxiv: v2 [math.st] 12 Feb 2008

Least squares under convex constraint

BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage

De-biasing the Lasso: Optimal Sample Size for Gaussian Designs

On Iterative Hard Thresholding Methods for High-dimensional M-Estimation

Orthogonal Matching Pursuit for Sparse Signal Recovery With Noise

Latent Variable Graphical Model Selection Via Convex Optimization

Robust Principal Component Analysis

High-dimensional Covariance Estimation Based On Gaussian Graphical Models

Composite Loss Functions and Multivariate Regression; Sparse PCA

Probabilistic Graphical Models

Minimax Rates of Estimation for High-Dimensional Linear Regression Over -Balls

A unified framework for high-dimensional analysis of M-estimators with decomposable regularizers

Near Ideal Behavior of a Modified Elastic Net Algorithm in Compressed Sensing

Sparse PCA in High Dimensions

Applied Statistics and Machine Learning

Strengthened Sobolev inequalities for a random subspace of functions

Joint Gaussian Graphical Model Review Series I

arxiv: v1 [math.st] 13 Feb 2012

STAT 200C: High-dimensional Statistics

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

Convex relaxation for Combinatorial Penalties

Estimation of large dimensional sparse covariance matrices

Estimation of (near) low-rank matrices with noise and high-dimensional scaling

Computational and Statistical Aspects of Statistical Machine Learning. John Lafferty Department of Statistics Retreat Gleacher Center

1 Regression with High Dimensional Data

11 : Gaussian Graphic Models and Ising Models

An Introduction to Sparse Approximation

Optimisation Combinatoire et Convexe.

Compressive Sensing and Beyond

Simultaneous Support Recovery in High Dimensions: Benefits and Perils of Block `1=` -Regularization

ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis

Tractable performance bounds for compressed sensing.

Compressed Sensing and Linear Codes over Real Numbers

Compressed Sensing and Sparse Recovery

Methods for sparse analysis of high-dimensional data, II

(Part 1) High-dimensional statistics May / 41

arxiv: v1 [stat.ml] 3 Nov 2010

New Theory and Algorithms for Scalable Data Fusion

Risk and Noise Estimation in High Dimensional Statistics via State Evolution

Restricted Strong Convexity Implies Weak Submodularity

Causal Inference: Discussion

An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss

Compressed Sensing and Neural Networks

Probabilistic Graphical Models

A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models

Sparse Covariance Selection using Semidefinite Programming

OWL to the rescue of LASSO

Compressibility of Infinite Sequences and its Interplay with Compressed Sensing Recovery

Adaptive estimation of the copula correlation matrix for semiparametric elliptical copulas

Sensing systems limited by constraints: physical size, time, cost, energy

Sparse signal recovery using sparse random projections

Linear Regression with Strongly Correlated Designs Using Ordered Weigthed l 1

New Coherence and RIP Analysis for Weak. Orthogonal Matching Pursuit

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables

Sparse and Low Rank Recovery via Null Space Properties

Lecture Notes 9: Constrained Optimization

The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA

Shifting Inequality and Recovery of Sparse Signals

Sparse Graph Learning via Markov Random Fields

Stepwise Searching for Feature Variables in High-Dimensional Linear Regression

19.1 Problem setup: Sparse linear regression

1.1 Basis of Statistical Decision Theory

Sparse and Robust Optimization and Applications

Thresholds for the Recovery of Sparse Solutions via L1 Minimization

Introduction to Compressed Sensing

Random projections. 1 Introduction. 2 Dimensionality reduction. Lecture notes 5 February 29, 2016

Lecture: Introduction to Compressed Sensing Sparse Recovery Guarantees

Greedy Signal Recovery and Uniform Uncertainty Principles

Interpolation via weighted l 1 minimization

Constructing Explicit RIP Matrices and the Square-Root Bottleneck

High-dimensional Covariance Estimation Based On Gaussian Graphical Models

Non-convex Robust PCA: Provable Bounds

Mathematical Methods for Data Analysis

Hierarchical kernel learning

Methods for sparse analysis of high-dimensional data, II

Analysis of Greedy Algorithms

Posterior convergence rates for estimating large precision. matrices using graphical models

Transcription:

High-dimensional statistics: Some progress and challenges ahead Martin Wainwright UC Berkeley Departments of Statistics, and EECS University College, London Master Class: Lecture Joint work with: Alekh Agarwal, Arash Amini, Po-Ling Loh, Sahand Negahban, Garvesh Raskutti, Pradeep Ravikumar, Bin Yu.

Introduction classical asymptotic theory: sample size n + with number of parameters p fixed law of large numbers, central limit theory consistency of maximum likelihood estimation

Introduction classical asymptotic theory: sample size n + with number of parameters p fixed law of large numbers, central limit theory consistency of maximum likelihood estimation modern applications in science and engineering: large-scale problems: both p and n may be large (possibly p n) need for high-dimensional theory that provides non-asymptotic results for (n,p)

Introduction classical asymptotic theory: sample size n + with number of parameters p fixed law of large numbers, central limit theory consistency of maximum likelihood estimation modern applications in science and engineering: large-scale problems: both p and n may be large (possibly p n) need for high-dimensional theory that provides non-asymptotic results for (n,p) curses and blessings of high dimensionality exponential explosions in computational complexity statistical curses (sample complexity) concentration of measure

Introduction modern applications in science and engineering: large-scale problems: both p and n may be large (possibly p n) need for high-dimensional theory that provides non-asymptotic results for (n,p) curses and blessings of high dimensionality exponential explosions in computational complexity statistical curses (sample complexity) concentration of measure Key ideas: what embedded low-dimensional structures are present in data? how can they can be exploited algorithmically?

Vignette I: High-dimensional matrix estimation want to estimate a covariance matrix Σ R p p given i.i.d. samples X i N(0,Σ), for i =,2,...,n

Vignette I: High-dimensional matrix estimation want to estimate a covariance matrix Σ R p p given i.i.d. samples X i N(0,Σ), for i =,2,...,n Classical approach: Estimate Σ via sample covariance matrix: n Σ n := X i Xi T n i= }{{} average of p p rank one matrices

Vignette I: High-dimensional matrix estimation want to estimate a covariance matrix Σ R p p given i.i.d. samples X i N(0,Σ), for i =,2,...,n Classical approach: Estimate Σ via sample covariance matrix: n Σ n := X i Xi T n i= }{{} average of p p rank one matrices Reasonable properties: (p fixed, n increasing) Unbiased: E[ Σ n ] = Σ Consistent: Σ n a.s. Σ as n + Asymptotic distributional properties available

Vignette I: High-dimensional matrix estimation want to estimate a covariance matrix Σ R p p given i.i.d. samples X i N(0,Σ), for i =,2,...,n Classical approach: Estimate Σ via sample covariance matrix: n Σ n := X i Xi T n i= }{{} average of p p rank one matrices An alternative experiment: Fix some α > 0 Study behavior over sequences with p n = α Does Σ n(p) converge to anything reasonable?

Density (rescaled) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0. Empirical vs MP law (α = 0.5) Theory 0 0 0.5.5 2 2.5 3 Eigenvalue Marcenko & Pastur, 967.

Density (rescaled) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0. Empirical vs MP law (α = 0.2) Theory 0 0 0.5.5 2 Eigenvalue Marcenko & Pastur, 967.

Low-dimensional structure: Gaussian graphical models Zero pattern of inverse covariance 2 2 3 4 5 2 3 4 5 5 4 3 P(x,x 2,...,x p ) exp ( 2 xt Θ x ). Martin Wainwright (UC Berkeley) High-dimensional statistics February, 203 5 /

Maximum-likelihood with l -regularization Zero pattern of inverse covariance 2 2 3 4 5 2 3 4 5 Set-up: Samples from random vector with sparse covariance Σ or sparse inverse covariance Θ R p p. 5 4 3

Maximum-likelihood with l -regularization Zero pattern of inverse covariance 2 2 3 4 5 2 3 4 5 Set-up: Samples from random vector with sparse covariance Σ or sparse inverse covariance Θ R p p. Estimator (for inverse covariance) { Θ argmin n } x i x T i, Θ logdet(θ)+λ n Θ jk Θ n i= 5 4 j k 3 Some past work: Yuan & Lin, 2006; d Asprémont et al., 2007; Bickel & Levina, 2007; El Karoui, 2007; d Aspremont et al., 2007; Rothman et al., 2007; Zhou et al., 2007; Friedman et al., 2008; Lam & Fan, 2008; Ravikumar et al., 2008; Zhou, Cai & Huang, 2009

Gauss-Markov models with hidden variables Z X X 2 X 3 X 4 Problems with hidden variables: conditioned on hidden Z, vector X = (X,X 2,X 3,X 4 ) is Gauss-Markov.

Gauss-Markov models with hidden variables Z X X 2 X 3 X 4 Problems with hidden variables: conditioned on hidden Z, vector X = (X,X 2,X 3,X 4 ) is Gauss-Markov. Inverse covariance of X satisfies {sparse, low-rank} decomposition: µ µ µ µ µ µ µ µ µ µ µ µ = I 4 4 µ T. µ µ µ µ (Chandrasekaran, Parrilo & Willsky, 200)

Vignette II: High-dimensional sparse linear regression y X θ w n n p = + S S c Set-up: noisy observations y = Xθ +w with sparse θ Estimator: Lasso program θ argmin θ n n p (y i x T i θ) 2 +λ n θ j i= Some past work: Tibshirani, 996; Chen et al., 998; Donoho/Xuo, 200; Tropp, 2004; Fuchs, 2004; Efron et al., 2004; Meinshausen & Buhlmann, 2005; Candes & Tao, 2005; Donoho, 2005; Haupt & Nowak, 2005; Zhou & Yu, 2006; Zou, 2006; Koltchinskii, 2007; van j=

Application A: Compressed sensing (Donoho, 2005; Candes & Tao, 2005) y X β n = n p p (a) Image: vectorize to β R p (b) Compute n random projections

Application A: Compressed sensing In practice, signals are sparse in a transform domain: (Donoho, 2005; Candes & Tao, 2005) θ := Ψβ is a sparse signal, where Ψ is an orthonormal matrix. y X Ψ T θ n = n p s p Reconstruct θ (and hence image β = Ψ T θ ) based on finding a sparse solution to under-constrained linear system y = X θ where X = XΨ T is another random matrix.

Noiseless l recovery: Unrescaled sample size Prob. exact recovery vs. sample size (µ = 0) 0.9 0.8 Prob. of exact recovery 0.7 0.6 0.5 0.4 0.3 0.2 0. p = 28 p = 256 p = 52 0 0 50 00 50 200 250 300 Raw sample size n Probability of recovery versus sample size n.

Application B: Graph structure estimation let G = (V,E) be an undirected graph on p = V vertices pairwise graphical model factorizes over edges of graph: P(x,...,x p ;θ) exp { θ st (x s,x t ) }. (s,t) E given n independent and identically distributed (i.i.d.) samples of X = (X,...,X p ), identify the underlying graph structure

Pseudolikelihood and neighborhood regression Markov properties encode neighborhood structure: (X s X V\s ) }{{} Condition on full graph d = (X s X N(s) ) }{{} Condition on Markov blanket N(s) = {s,t,u,v,w} X s X t X w X s X u X v basis of pseudolikelihood method (Besag, 974) basis of many graph learning algorithm (Friedman et al., 999; Csiszar & Talata, 2005; Abeel et al., 2006; Meinshausen & Buhlmann, 2006)

Graph selection via neighborhood regression 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.......... 0 0 0 0 0 0 0 0 0 0 0 0 0 X \s 0 0 0 0 X s..... Predict X s based on X \s := {X s, t s}.

Graph selection via neighborhood regression 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.......... 0 0 0 0 0 0 0 0 0 0 0 0 0 X \s 0 0 0 0 X s..... Predict X s based on X \s := {X s, t s}. For each node s V, compute (regularized) max. likelihood estimate: { } θ[s] := arg min n L(θ;X i\s ) + λ n θ θ R p n }{{}}{{} i= local log. likelihood regularization

Graph selection via neighborhood regression 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.......... 0 0 0 0 0 0 0 0 0 0 0 0 0 X \s 0 0 0 0 X s..... Predict X s based on X \s := {X s, t s}. For each node s V, compute (regularized) max. likelihood estimate: { } θ[s] := arg min n L(θ;X i\s ) + λ n θ θ R p n }{{}}{{} i= local log. likelihood regularization 2 Estimate the local neighborhood N(s) as support of regression vector θ[s] R p.

US Senate network (2004 2006 voting)

Outline Lecture (Today): Basics of sparse recovery Sparse linear systems: l0/l equivalence Noisy case: Lasso, l2-bounds and variable selection 2 Lecture 2 (Tuesday): A more general theory A range of structured regularizers Group sparsity Low-rank matrices and nuclear norm regularization Matrix decomposition and robust PCA Ingredients of a general understanding 3 Lecture 3 (Wednesday): High-dimensional kernel methods Curse-of-dimensionality for non-parametric regression Reproducing kernel Hilbert spaces A simple but optimal estimator Martin Wainwright (UC Berkeley) High-dimensional statistics February, 203 5 /

Noiseless linear models and basis pursuit y X θ n = n p S S c under-determined linear system: unidentifiable without constraints say θ R p is sparse: supported on S {,2,...,p}. l 0 -optimization θ = arg min θ R p θ 0 Xθ = y Computationally intractable NP-hard l -relaxation θ arg min θ R p θ Xθ = y Linear program (easy to solve) Basis pursuit relaxation

Noiseless l recovery: Unrescaled sample size Prob. exact recovery vs. sample size (µ = 0) 0.9 0.8 Prob. of exact recovery 0.7 0.6 0.5 0.4 0.3 0.2 0. p = 28 p = 256 p = 52 0 0 50 00 50 200 250 300 Raw sample size n Probability of recovery versus sample size n.

Noiseless l recovery: Rescaled Prob. exact recovery vs. sample size (µ = 0) 0.9 0.8 Prob. of exact recovery 0.7 0.6 0.5 0.4 0.3 0.2 0. p = 28 p = 256 p = 52 0 0 0.5.5 2 2.5 3 Rescaled sample size α Probabability of recovery versus rescaled sample size α := n slog(p/s).

Restricted nullspace: necessary and sufficient Definition For a fixed S {,2,...,p}, the matrix X R n p satisfies the restricted nullspace property w.r.t. S, or RN(S) for short, if { R p X = 0} }{{} N(X) { R p } S c S = { 0 }. }{{} C(S) (Donoho & Xu, 200; Feuer & Nemirovski, 2003; Cohen et al, 2009)

Restricted nullspace: necessary and sufficient Definition For a fixed S {,2,...,p}, the matrix X R n p satisfies the restricted nullspace property w.r.t. S, or RN(S) for short, if Proposition { R p X = 0} }{{} N(X) { R p } S c S = { 0 }. }{{} C(S) (Donoho & Xu, 200; Feuer & Nemirovski, 2003; Cohen et al, 2009) Basis pursuit l -relaxation is exact for all S-sparse vectors X satisfies RN(S).

Restricted nullspace: necessary and sufficient Definition For a fixed S {,2,...,p}, the matrix X R n p satisfies the restricted nullspace property w.r.t. S, or RN(S) for short, if { R p X = 0} }{{} N(X) Proof (sufficiency): { R p } S c S = { 0 }. }{{} C(S) (Donoho & Xu, 200; Feuer & Nemirovski, 2003; Cohen et al, 2009) () Error vector = θ θ satisfies X = 0, and hence N(X). (2) Show that C(S) Optimality of θ: θ θ = θ S. Sparsity of θ : θ = θ + = θ S + S + S c. Triangle inequality: θ S + S + S c θ S S + S c. (3) Hence, N(X) C(S), and (RN) = = 0.

Illustration of restricted nullspace property 3 3 (, 2 ) (, 2 ) consider θ = (0,0,θ 3), so that S = {3}. error vector = θ θ belongs to the set C(S;) := { (, 2, 3 ) R 3 + 2 3 }. Martin Wainwright (UC Berkeley) High-dimensional statistics February, 203 20 /

Some sufficient conditions How to verify RN property for a given sparsity s? Elementwise incoherence condition (Donoho & Xuo, 200; Feuer & Nem., 2003) p max j,k=,...,p ( X T X n I p p ) jk δ s n x x j x k x p

Some sufficient conditions How to verify RN property for a given sparsity s? Elementwise incoherence condition (Donoho & Xuo, 200; Feuer & Nem., 2003) p max j,k=,...,p ( X T X n I p p ) jk δ s n x x j x k x p 2 Restricted isometry, or submatrix incoherence (Candes & Tao, 2005) ( max X T X n U 2s I p p ) UU δ 2s. op x x p U

Some sufficient conditions How to verify RN property for a given sparsity s? Elementwise incoherence condition (Donoho & Xuo, 200; Feuer & Nem., 2003) p max j,k=,...,p ( X T X n I p p ) jk δ s n x x j x k x p Matrices with i.i.d. sub-gaussian entries: holds w.h.p. for n = Ω(s 2 logp) 2 Restricted isometry, or submatrix incoherence (Candes & Tao, 2005) ( max X T X n U 2s I p p ) UU δ 2s. op x x p U

Some sufficient conditions How to verify RN property for a given sparsity s? Elementwise incoherence condition (Donoho & Xuo, 200; Feuer & Nem., 2003) p max j,k=,...,p ( X T X n I p p ) jk δ s n x x j x k x p Matrices with i.i.d. sub-gaussian entries: holds w.h.p. for n = Ω(s 2 logp) 2 Restricted isometry, or submatrix incoherence (Candes & Tao, 2005) ( max X T X n U 2s I p p ) UU δ 2s. op x x p U Matrices with i.i.d. sub-gaussian entries: holds w.h.p. for n = Ω(slog p s )

Violating matrix incoherence (elementwise/rip) Important: Incoherence/RIP conditions imply RN, but are far from necessary. Very easy to violate them...

Violating matrix incoherence (elementwise/rip) Form random design matrix X = [ x x 2... x p ] }{{} p columns = X T X2 T.. X T n }{{} n rows R n p, each row X i N(0,Σ), i.i.d. Example: For some µ (0,), consider the covariance matrix Σ = ( µ)i p p +µ T.

Violating matrix incoherence (elementwise/rip) Form random design matrix X = [ x x 2... x p ] }{{} p columns = X T X2 T.. X T n }{{} n rows R n p, each row X i N(0,Σ), i.i.d. Example: For some µ (0,), consider the covariance matrix Σ = ( µ)i p p +µ T. Elementwise incoherence violated: for any j k [ ] xj, x k P µ ǫ c exp( c 2 nǫ 2 ). n

Violating matrix incoherence (elementwise/rip) Form random design matrix X = [ x x 2... x p ] }{{} p columns = X T X2 T.. X T n }{{} n rows R n p, each row X i N(0,Σ), i.i.d. Example: For some µ (0,), consider the covariance matrix Σ = ( µ)i p p +µ T. Elementwise incoherence violated: for any j k [ ] xj, x k P µ ǫ c exp( c 2 nǫ 2 ). n RIP constants tend to infinity as (n, S ) increases: [ P XS TX ] S I s s 2 µ(s ) ǫ c exp( c 2 nǫ 2 ). n

Noiseless l recovery for µ = 0.5 Prob. exact recovery vs. sample size (µ = 0.5) 0.9 0.8 Prob. of exact recovery 0.7 0.6 0.5 0.4 0.3 0.2 0. p = 28 p = 256 p = 52 0 0 0.5.5 2 2.5 3 Rescaled sample size α Probab. versus rescaled sample size α := n slog(p/s).

Direct result for restricted nullspace/eigenvalues Theorem (Raskutti, W., & Yu, 200) Consider a random design X R n p with each row X i N(0,Σ) i.i.d., and define κ(σ) = max j=,2,...p Σ jj. Then for universal constants c,c 2, Xθ 2 logp n 2 Σ/2 θ 2 9κ(Σ) n θ for all θ R p with probability greater than c exp( c 2 n).

Direct result for restricted nullspace/eigenvalues Theorem (Raskutti, W., & Yu, 200) Consider a random design X R n p with each row X i N(0,Σ) i.i.d., and define κ(σ) = max j=,2,...p Σ jj. Then for universal constants c,c 2, Xθ 2 logp n 2 Σ/2 θ 2 9κ(Σ) n θ for all θ R p with probability greater than c exp( c 2 n). much less restrictive than incoherence/rip conditions many interesting matrix families are covered Toeplitz dependency constant µ-correlation (previous example) covariance matrix Σ can even be degenerate extensions to sub-gaussian matrices (Rudelson & Zhou, 202) related results hold for generalized linear models

Easy verification of restricted nullspace for any C(S), we have = S + S c 2 S 2 s 2 applying previous result: { X 2 λ min ( } slogp Σ) 8κ(Σ) 2. n n }{{} γ(σ)

Easy verification of restricted nullspace for any C(S), we have = S + S c 2 S 2 s 2 applying previous result: { X 2 λ min ( } slogp Σ) 8κ(Σ) n n }{{} γ(σ) 2. have actually proven much more than restricted nullspace...

Easy verification of restricted nullspace for any C(S), we have = S + S c 2 S 2 s 2 applying previous result: { X 2 λ min ( } slogp Σ) 8κ(Σ) n n }{{} γ(σ) 2. have actually proven much more than restricted nullspace... Definition A design matrix X R n p satisfies the restricted eigenvalue (RE) condition over S (denote RE(S)) with parameters α and γ > 0 if X 2 n γ 2 for all R p such that S c α S. (van de Geer, 2007; Bickel, Ritov & Tsybakov, 2008)

Lasso and restricted eigenvalues Turning to noisy observations... y X θ w n n p = + S S c Estimator: Lasso program { θ λn arg min θ R p 2n y } Xθ 2 2 +λ n θ. Goal: Obtain bounds on θ λn θ 2 that hold with high probability. Martin Wainwright (UC Berkeley) High-dimensional statistics February, 203 26 /

Lasso bounds: Four simple steps Let s analyze constrained version: min θ R p 2n y Xθ 2 2 such that θ R = θ.

Lasso bounds: Four simple steps Let s analyze constrained version: min θ R p 2n y Xθ 2 2 such that θ R = θ. () By optimality of θ and feasibility of θ : 2n y X θ 2 2 2n y Xθ 2 2.

Lasso bounds: Four simple steps Let s analyze constrained version: min θ R p 2n y Xθ 2 2 such that θ R = θ. () By optimality of θ and feasibility of θ : 2n y X θ 2 2 2n y Xθ 2 2. (2) Derive a basic inequality: re-arranging in terms of = θ θ : n X 2 2 2 n, X T w.

Lasso bounds: Four simple steps Let s analyze constrained version: min θ R p 2n y Xθ 2 2 such that θ R = θ. () By optimality of θ and feasibility of θ : 2n y X θ 2 2 2n y Xθ 2 2. (2) Derive a basic inequality: re-arranging in terms of = θ θ : n X 2 2 2 n, X T w. (3) Restricted eigenvalue for LHS; Hölder s inequality for RHS γ 2 2 n X 2 2 2 n, X T w 2 X T w n.

Lasso bounds: Four simple steps Let s analyze constrained version: min θ R p 2n y Xθ 2 2 such that θ R = θ. () By optimality of θ and feasibility of θ : 2n y X θ 2 2 2n y Xθ 2 2. (2) Derive a basic inequality: re-arranging in terms of = θ θ : n X 2 2 2 n, X T w. (3) Restricted eigenvalue for LHS; Hölder s inequality for RHS γ 2 2 n X 2 2 2 n, X T w 2 X T w n. (4) As before, C(S), so that 2 s 2, and hence 2 4 γ s X T w n.

Lasso error bounds for different models Proposition Suppose that vector θ has support S, with cardinality s, and design matrix X satisfies RE(S) with parameter γ > 0. For constrained Lasso with R = θ or regularized Lasso with λ n = 2 X T w/n, any optimal solution θ satisfies the bound θ θ 2 4 s γ XT w n. Martin Wainwright (UC Berkeley) High-dimensional statistics February, 203 28 /

Lasso error bounds for different models Proposition Suppose that vector θ has support S, with cardinality s, and design matrix X satisfies RE(S) with parameter γ > 0. For constrained Lasso with R = θ or regularized Lasso with λ n = 2 X T w/n, any optimal solution θ satisfies the bound θ θ 2 4 s γ XT w n. this is a deterministic result on the set of optimizers various corollaries for specific statistical models Martin Wainwright (UC Berkeley) High-dimensional statistics February, 203 28 /

Lasso error bounds for different models Proposition Suppose that vector θ has support S, with cardinality s, and design matrix X satisfies RE(S) with parameter γ > 0. For constrained Lasso with R = θ or regularized Lasso with λ n = 2 X T w/n, any optimal solution θ satisfies the bound θ θ 2 4 s γ XT w n. this is a deterministic result on the set of optimizers various corollaries for specific statistical models Compressed sensing: Xij N(0,) and bounded noise w 2 σ n Deterministic design: X with bounded columns and wi N(0,σ 2 ) XT w n 3σ2 logp n w.h.p. = θ θ 2 4σ s logp γ(l) n. Martin Wainwright (UC Berkeley) High-dimensional statistics February, 203 28 /

Look-ahead to Lecture 2: A more general theory Recap: Thus far... Derived error bounds for basis pursuit and Lasso (l -relaxation) Seen importance of restricted nullspace and restricted eigenvalues Martin Wainwright (UC Berkeley) High-dimensional statistics February, 203 29 /

Look-ahead to Lecture 2: A more general theory The big picture: Lots of other estimators with same basic form: θ λn }{{} Estimate argmin θ Ω { L(θ;Z n ) +λ n R(θ) }{{}}{{} Loss function Regularizer }. Martin Wainwright (UC Berkeley) High-dimensional statistics February, 203 29 /

Look-ahead to Lecture 2: A more general theory The big picture: Lots of other estimators with same basic form: θ λn }{{} Estimate argmin θ Ω { L(θ;Z n ) +λ n R(θ) }{{}}{{} Loss function Regularizer }. Past years have witnessed an explosion of results (compressed sensing, covariance estimation, block-sparsity, graphical models, matrix completion...) Question: Is there a common set of underlying principles? Martin Wainwright (UC Berkeley) High-dimensional statistics February, 203 29 /

Some papers (www.eecs.berkeley.edu/ wainwrig) M. J. Wainwright (2009), Sharp thresholds for high-dimensional and noisy sparsity recovery using l -constrained quadratic programming (Lasso), IEEE Trans. Information Theory, May 2009. 2 S. Negahban, P. Ravikumar, M. J. Wainwright, and B. Yu (202). A unified framework for high-dimensional analysis of M-estimators with decomposable regularizers. Statistical Science. December 202. 3 G. Raskutti, M. J. Wainwright and B. Yu (20) Minimax rates for linear regression over l q -balls. IEEE Trans. Information Theory, October 20. 4 G. Raskutti, M. J. Wainwright and B. Yu (200). Restricted nullspace and eigenvalue properties for correlated Gaussian designs. Journal of Machine Learning Research. August 200.