High-dimensional statistics: Some progress and challenges ahead

High-dimensional statistics: Some progress and challenges ahead Martin Wainwright UC Berkeley Departments of Statistics, and EECS University College, London Master Class: Lecture Joint work with: Alekh Agarwal, Arash Amini, Po-Ling Loh, Sahand Negahban, Garvesh Raskutti, Pradeep Ravikumar, Bin Yu.

Introduction classical asymptotic theory: sample size n + with number of parameters p fixed law of large numbers, central limit theory consistency of maximum likelihood estimation

Introduction classical asymptotic theory: sample size n + with number of parameters p fixed law of large numbers, central limit theory consistency of maximum likelihood estimation modern applications in science and engineering: large-scale problems: both p and n may be large (possibly p n) need for high-dimensional theory that provides non-asymptotic results for (n,p) curses and blessings of high dimensionality exponential explosions in computational complexity statistical curses (sample complexity) concentration of measure

Introduction modern applications in science and engineering: large-scale problems: both p and n may be large (possibly p n) need for high-dimensional theory that provides non-asymptotic results for (n,p) curses and blessings of high dimensionality exponential explosions in computational complexity statistical curses (sample complexity) concentration of measure Key ideas: what embedded low-dimensional structures are present in data? how can they can be exploited algorithmically?

Vignette I: High-dimensional matrix estimation want to estimate a covariance matrix Σ R p p given i.i.d. samples X i N(0,Σ), for i =,2,...,n

Vignette I: High-dimensional matrix estimation want to estimate a covariance matrix Σ R p p given i.i.d. samples X i N(0,Σ), for i =,2,...,n Classical approach: Estimate Σ via sample covariance matrix: n Σ n := X i Xi T n i= }{{} average of p p rank one matrices Reasonable properties: (p fixed, n increasing) Unbiased: E[ Σ n ] = Σ Consistent: Σ n a.s. Σ as n + Asymptotic distributional properties available

Vignette I: High-dimensional matrix estimation want to estimate a covariance matrix Σ R p p given i.i.d. samples X i N(0,Σ), for i =,2,...,n Classical approach: Estimate Σ via sample covariance matrix: n Σ n := X i Xi T n i= }{{} average of p p rank one matrices An alternative experiment: Fix some α > 0 Study behavior over sequences with p n = α Does Σ n(p) converge to anything reasonable?

Density (rescaled) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0. Empirical vs MP law (α = 0.5) Theory 0 0 0.5.5 2 2.5 3 Eigenvalue Marcenko & Pastur, 967.

Density (rescaled) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0. Empirical vs MP law (α = 0.2) Theory 0 0 0.5.5 2 Eigenvalue Marcenko & Pastur, 967.

Low-dimensional structure: Gaussian graphical models Zero pattern of inverse covariance 2 2 3 4 5 2 3 4 5 5 4 3 P(x,x 2,...,x p ) exp ( 2 xt Θ x ). Martin Wainwright (UC Berkeley) High-dimensional statistics February, 203 5 /

Maximum-likelihood with l -regularization Zero pattern of inverse covariance 2 2 3 4 5 2 3 4 5 Set-up: Samples from random vector with sparse covariance Σ or sparse inverse covariance Θ R p p. Estimator (for inverse covariance) { Θ argmin n } x i x T i, Θ logdet(θ)+λ n Θ jk Θ n i= 5 4 j k 3 Some past work: Yuan & Lin, 2006; d Asprémont et al., 2007; Bickel & Levina, 2007; El Karoui, 2007; d Aspremont et al., 2007; Rothman et al., 2007; Zhou et al., 2007; Friedman et al., 2008; Lam & Fan, 2008; Ravikumar et al., 2008; Zhou, Cai & Huang, 2009

Gauss-Markov models with hidden variables Z X X 2 X 3 X 4 Problems with hidden variables: conditioned on hidden Z, vector X = (X,X 2,X 3,X 4 ) is Gauss-Markov.

Gauss-Markov models with hidden variables Z X X 2 X 3 X 4 Problems with hidden variables: conditioned on hidden Z, vector X = (X,X 2,X 3,X 4 ) is Gauss-Markov. Inverse covariance of X satisfies {sparse, low-rank} decomposition: µ µ µ µ µ µ µ µ µ µ µ µ = I 4 4 µ T. µ µ µ µ (Chandrasekaran, Parrilo & Willsky, 200)

Vignette II: High-dimensional sparse linear regression y X θ w n n p = + S S c Set-up: noisy observations y = Xθ +w with sparse θ Estimator: Lasso program θ argmin θ n n p (y i x T i θ) 2 +λ n θ j i= Some past work: Tibshirani, 996; Chen et al., 998; Donoho/Xuo, 200; Tropp, 2004; Fuchs, 2004; Efron et al., 2004; Meinshausen & Buhlmann, 2005; Candes & Tao, 2005; Donoho, 2005; Haupt & Nowak, 2005; Zhou & Yu, 2006; Zou, 2006; Koltchinskii, 2007; van j=

Application A: Compressed sensing (Donoho, 2005; Candes & Tao, 2005) y X β n = n p p (a) Image: vectorize to β R p (b) Compute n random projections

Application A: Compressed sensing In practice, signals are sparse in a transform domain: (Donoho, 2005; Candes & Tao, 2005) θ := Ψβ is a sparse signal, where Ψ is an orthonormal matrix. y X Ψ T θ n = n p s p Reconstruct θ (and hence image β = Ψ T θ ) based on finding a sparse solution to under-constrained linear system y = X θ where X = XΨ T is another random matrix.

Noiseless l recovery: Unrescaled sample size Prob. exact recovery vs. sample size (µ = 0) 0.9 0.8 Prob. of exact recovery 0.7 0.6 0.5 0.4 0.3 0.2 0. p = 28 p = 256 p = 52 0 0 50 00 50 200 250 300 Raw sample size n Probability of recovery versus sample size n.

Application B: Graph structure estimation let G = (V,E) be an undirected graph on p = V vertices pairwise graphical model factorizes over edges of graph: P(x,...,x p ;θ) exp { θ st (x s,x t ) }. (s,t) E given n independent and identically distributed (i.i.d.) samples of X = (X,...,X p ), identify the underlying graph structure

Pseudolikelihood and neighborhood regression Markov properties encode neighborhood structure: (X s X V\s ) }{{} Condition on full graph d = (X s X N(s) ) }{{} Condition on Markov blanket N(s) = {s,t,u,v,w} X s X t X w X s X u X v basis of pseudolikelihood method (Besag, 974) basis of many graph learning algorithm (Friedman et al., 999; Csiszar & Talata, 2005; Abeel et al., 2006; Meinshausen & Buhlmann, 2006)

Graph selection via neighborhood regression 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.......... 0 0 0 0 0 0 0 0 0 0 0 0 0 X \s 0 0 0 0 X s..... Predict X s based on X \s := {X s, t s}.

Graph selection via neighborhood regression 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.......... 0 0 0 0 0 0 0 0 0 0 0 0 0 X \s 0 0 0 0 X s..... Predict X s based on X \s := {X s, t s}. For each node s V, compute (regularized) max. likelihood estimate: { } θ[s] := arg min n L(θ;X i\s ) + λ n θ θ R p n }{{}}{{} i= local log. likelihood regularization

US Senate network (2004 2006 voting)

Outline Lecture (Today): Basics of sparse recovery Sparse linear systems: l0/l equivalence Noisy case: Lasso, l2-bounds and variable selection 2 Lecture 2 (Tuesday): A more general theory A range of structured regularizers Group sparsity Low-rank matrices and nuclear norm regularization Matrix decomposition and robust PCA Ingredients of a general understanding 3 Lecture 3 (Wednesday): High-dimensional kernel methods Curse-of-dimensionality for non-parametric regression Reproducing kernel Hilbert spaces A simple but optimal estimator Martin Wainwright (UC Berkeley) High-dimensional statistics February, 203 5 /

Noiseless linear models and basis pursuit y X θ n = n p S S c under-determined linear system: unidentifiable without constraints say θ R p is sparse: supported on S {,2,...,p}. l 0 -optimization θ = arg min θ R p θ 0 Xθ = y Computationally intractable NP-hard l -relaxation θ arg min θ R p θ Xθ = y Linear program (easy to solve) Basis pursuit relaxation

Noiseless l recovery: Rescaled Prob. exact recovery vs. sample size (µ = 0) 0.9 0.8 Prob. of exact recovery 0.7 0.6 0.5 0.4 0.3 0.2 0. p = 28 p = 256 p = 52 0 0 0.5.5 2 2.5 3 Rescaled sample size α Probabability of recovery versus rescaled sample size α := n slog(p/s).

Restricted nullspace: necessary and sufficient Definition For a fixed S {,2,...,p}, the matrix X R n p satisfies the restricted nullspace property w.r.t. S, or RN(S) for short, if { R p X = 0} }{{} N(X) Proof (sufficiency): { R p } S c S = { 0 }. }{{} C(S) (Donoho & Xu, 200; Feuer & Nemirovski, 2003; Cohen et al, 2009) () Error vector = θ θ satisfies X = 0, and hence N(X). (2) Show that C(S) Optimality of θ: θ θ = θ S. Sparsity of θ : θ = θ + = θ S + S + S c. Triangle inequality: θ S + S + S c θ S S + S c. (3) Hence, N(X) C(S), and (RN) = = 0.

Illustration of restricted nullspace property 3 3 (, 2 ) (, 2 ) consider θ = (0,0,θ 3), so that S = {3}. error vector = θ θ belongs to the set C(S;) := { (, 2, 3 ) R 3 + 2 3 }. Martin Wainwright (UC Berkeley) High-dimensional statistics February, 203 20 /

Some sufficient conditions How to verify RN property for a given sparsity s? Elementwise incoherence condition (Donoho & Xuo, 200; Feuer & Nem., 2003) p max j,k=,...,p ( X T X n I p p ) jk δ s n x x j x k x p Matrices with i.i.d. sub-gaussian entries: holds w.h.p. for n = Ω(s 2 logp) 2 Restricted isometry, or submatrix incoherence (Candes & Tao, 2005) ( max X T X n U 2s I p p ) UU δ 2s. op x x p U

Violating matrix incoherence (elementwise/rip) Important: Incoherence/RIP conditions imply RN, but are far from necessary. Very easy to violate them...

Violating matrix incoherence (elementwise/rip) Form random design matrix X = [ x x 2... x p ] }{{} p columns = X T X2 T.. X T n }{{} n rows R n p, each row X i N(0,Σ), i.i.d. Example: For some µ (0,), consider the covariance matrix Σ = ( µ)i p p +µ T. Elementwise incoherence violated: for any j k [ ] xj, x k P µ ǫ c exp( c 2 nǫ 2 ). n

Noiseless l recovery for µ = 0.5 Prob. exact recovery vs. sample size (µ = 0.5) 0.9 0.8 Prob. of exact recovery 0.7 0.6 0.5 0.4 0.3 0.2 0. p = 28 p = 256 p = 52 0 0 0.5.5 2 2.5 3 Rescaled sample size α Probab. versus rescaled sample size α := n slog(p/s).

Direct result for restricted nullspace/eigenvalues Theorem (Raskutti, W., & Yu, 200) Consider a random design X R n p with each row X i N(0,Σ) i.i.d., and define κ(σ) = max j=,2,...p Σ jj. Then for universal constants c,c 2, Xθ 2 logp n 2 Σ/2 θ 2 9κ(Σ) n θ for all θ R p with probability greater than c exp( c 2 n). much less restrictive than incoherence/rip conditions many interesting matrix families are covered Toeplitz dependency constant µ-correlation (previous example) covariance matrix Σ can even be degenerate extensions to sub-gaussian matrices (Rudelson & Zhou, 202) related results hold for generalized linear models

Easy verification of restricted nullspace for any C(S), we have = S + S c 2 S 2 s 2 applying previous result: { X 2 λ min ( } slogp Σ) 8κ(Σ) 2. n n }{{} γ(σ)

Easy verification of restricted nullspace for any C(S), we have = S + S c 2 S 2 s 2 applying previous result: { X 2 λ min ( } slogp Σ) 8κ(Σ) n n }{{} γ(σ) 2. have actually proven much more than restricted nullspace... Definition A design matrix X R n p satisfies the restricted eigenvalue (RE) condition over S (denote RE(S)) with parameters α and γ > 0 if X 2 n γ 2 for all R p such that S c α S. (van de Geer, 2007; Bickel, Ritov & Tsybakov, 2008)

Lasso and restricted eigenvalues Turning to noisy observations... y X θ w n n p = + S S c Estimator: Lasso program { θ λn arg min θ R p 2n y } Xθ 2 2 +λ n θ. Goal: Obtain bounds on θ λn θ 2 that hold with high probability. Martin Wainwright (UC Berkeley) High-dimensional statistics February, 203 26 /

Lasso bounds: Four simple steps Let s analyze constrained version: min θ R p 2n y Xθ 2 2 such that θ R = θ.

Lasso bounds: Four simple steps Let s analyze constrained version: min θ R p 2n y Xθ 2 2 such that θ R = θ. () By optimality of θ and feasibility of θ : 2n y X θ 2 2 2n y Xθ 2 2.

Lasso bounds: Four simple steps Let s analyze constrained version: min θ R p 2n y Xθ 2 2 such that θ R = θ. () By optimality of θ and feasibility of θ : 2n y X θ 2 2 2n y Xθ 2 2. (2) Derive a basic inequality: re-arranging in terms of = θ θ : n X 2 2 2 n, X T w. (3) Restricted eigenvalue for LHS; Hölder s inequality for RHS γ 2 2 n X 2 2 2 n, X T w 2 X T w n.

Lasso error bounds for different models Proposition Suppose that vector θ has support S, with cardinality s, and design matrix X satisfies RE(S) with parameter γ > 0. For constrained Lasso with R = θ or regularized Lasso with λ n = 2 X T w/n, any optimal solution θ satisfies the bound θ θ 2 4 s γ XT w n. this is a deterministic result on the set of optimizers various corollaries for specific statistical models Martin Wainwright (UC Berkeley) High-dimensional statistics February, 203 28 /

Lasso error bounds for different models Proposition Suppose that vector θ has support S, with cardinality s, and design matrix X satisfies RE(S) with parameter γ > 0. For constrained Lasso with R = θ or regularized Lasso with λ n = 2 X T w/n, any optimal solution θ satisfies the bound θ θ 2 4 s γ XT w n. this is a deterministic result on the set of optimizers various corollaries for specific statistical models Compressed sensing: Xij N(0,) and bounded noise w 2 σ n Deterministic design: X with bounded columns and wi N(0,σ 2 ) XT w n 3σ2 logp n w.h.p. = θ θ 2 4σ s logp γ(l) n. Martin Wainwright (UC Berkeley) High-dimensional statistics February, 203 28 /

Look-ahead to Lecture 2: A more general theory Recap: Thus far... Derived error bounds for basis pursuit and Lasso (l -relaxation) Seen importance of restricted nullspace and restricted eigenvalues Martin Wainwright (UC Berkeley) High-dimensional statistics February, 203 29 /

Look-ahead to Lecture 2: A more general theory The big picture: Lots of other estimators with same basic form: θ λn }{{} Estimate argmin θ Ω { L(θ;Z n ) +λ n R(θ) }{{}}{{} Loss function Regularizer }. Past years have witnessed an explosion of results (compressed sensing, covariance estimation, block-sparsity, graphical models, matrix completion...) Question: Is there a common set of underlying principles? Martin Wainwright (UC Berkeley) High-dimensional statistics February, 203 29 /

Some papers (www.eecs.berkeley.edu/ wainwrig) M. J. Wainwright (2009), Sharp thresholds for high-dimensional and noisy sparsity recovery using l -constrained quadratic programming (Lasso), IEEE Trans. Information Theory, May 2009. 2 S. Negahban, P. Ravikumar, M. J. Wainwright, and B. Yu (202). A unified framework for high-dimensional analysis of M-estimators with decomposable regularizers. Statistical Science. December 202. 3 G. Raskutti, M. J. Wainwright and B. Yu (20) Minimax rates for linear regression over l q -balls. IEEE Trans. Information Theory, October 20. 4 G. Raskutti, M. J. Wainwright and B. Yu (200). Restricted nullspace and eigenvalue properties for correlated Gaussian designs. Journal of Machine Learning Research. August 200.