High-dimensional statistics: Some progress and challenges ahead

Size: px

Start display at page:

Download "High-dimensional statistics: Some progress and challenges ahead"

Marybeth Joseph
5 years ago
Views:

1 High-dimensional statistics: Some progress and challenges ahead Martin Wainwright UC Berkeley Departments of Statistics, and EECS University College, London Master Class: Lecture Joint work with: Alekh Agarwal, Arash Amini, Po-Ling Loh, Sahand Negahban, Garvesh Raskutti, Pradeep Ravikumar, Bin Yu.

2 Introduction classical asymptotic theory: sample size n + with number of parameters p fixed law of large numbers, central limit theory consistency of maximum likelihood estimation

3 Introduction classical asymptotic theory: sample size n + with number of parameters p fixed law of large numbers, central limit theory consistency of maximum likelihood estimation modern applications in science and engineering: large-scale problems: both p and n may be large (possibly p n) need for high-dimensional theory that provides non-asymptotic results for (n,p)

4 Introduction classical asymptotic theory: sample size n + with number of parameters p fixed law of large numbers, central limit theory consistency of maximum likelihood estimation modern applications in science and engineering: large-scale problems: both p and n may be large (possibly p n) need for high-dimensional theory that provides non-asymptotic results for (n,p) curses and blessings of high dimensionality exponential explosions in computational complexity statistical curses (sample complexity) concentration of measure

5 Introduction modern applications in science and engineering: large-scale problems: both p and n may be large (possibly p n) need for high-dimensional theory that provides non-asymptotic results for (n,p) curses and blessings of high dimensionality exponential explosions in computational complexity statistical curses (sample complexity) concentration of measure Key ideas: what embedded low-dimensional structures are present in data? how can they can be exploited algorithmically?

6 Vignette I: High-dimensional matrix estimation want to estimate a covariance matrix Σ R p p given i.i.d. samples X i N(0,Σ), for i =,2,...,n

7 Vignette I: High-dimensional matrix estimation want to estimate a covariance matrix Σ R p p given i.i.d. samples X i N(0,Σ), for i =,2,...,n Classical approach: Estimate Σ via sample covariance matrix: n Σ n := X i Xi T n i= }{{} average of p p rank one matrices

8 Vignette I: High-dimensional matrix estimation want to estimate a covariance matrix Σ R p p given i.i.d. samples X i N(0,Σ), for i =,2,...,n Classical approach: Estimate Σ via sample covariance matrix: n Σ n := X i Xi T n i= }{{} average of p p rank one matrices Reasonable properties: (p fixed, n increasing) Unbiased: E[ Σ n ] = Σ Consistent: Σ n a.s. Σ as n + Asymptotic distributional properties available

9 Vignette I: High-dimensional matrix estimation want to estimate a covariance matrix Σ R p p given i.i.d. samples X i N(0,Σ), for i =,2,...,n Classical approach: Estimate Σ via sample covariance matrix: n Σ n := X i Xi T n i= }{{} average of p p rank one matrices An alternative experiment: Fix some α > 0 Study behavior over sequences with p n = α Does Σ n(p) converge to anything reasonable?

10 Density (rescaled) Empirical vs MP law (α = 0.5) Theory Eigenvalue Marcenko & Pastur, 967.

11 Density (rescaled) Empirical vs MP law (α = 0.2) Theory Eigenvalue Marcenko & Pastur, 967.

12 Low-dimensional structure: Gaussian graphical models Zero pattern of inverse covariance P(x,x 2,...,x p ) exp ( 2 xt Θ x ). Martin Wainwright (UC Berkeley) High-dimensional statistics February, /

13 Maximum-likelihood with l -regularization Zero pattern of inverse covariance Set-up: Samples from random vector with sparse covariance Σ or sparse inverse covariance Θ R p p

14 Maximum-likelihood with l -regularization Zero pattern of inverse covariance Set-up: Samples from random vector with sparse covariance Σ or sparse inverse covariance Θ R p p. Estimator (for inverse covariance) { Θ argmin n } x i x T i, Θ logdet(θ)+λ n Θ jk Θ n i= 5 4 j k 3 Some past work: Yuan & Lin, 2006; d Asprémont et al., 2007; Bickel & Levina, 2007; El Karoui, 2007; d Aspremont et al., 2007; Rothman et al., 2007; Zhou et al., 2007; Friedman et al., 2008; Lam & Fan, 2008; Ravikumar et al., 2008; Zhou, Cai & Huang, 2009

15 Gauss-Markov models with hidden variables Z X X 2 X 3 X 4 Problems with hidden variables: conditioned on hidden Z, vector X = (X,X 2,X 3,X 4 ) is Gauss-Markov.

16 Gauss-Markov models with hidden variables Z X X 2 X 3 X 4 Problems with hidden variables: conditioned on hidden Z, vector X = (X,X 2,X 3,X 4 ) is Gauss-Markov. Inverse covariance of X satisfies {sparse, low-rank} decomposition: µ µ µ µ µ µ µ µ µ µ µ µ = I 4 4 µ T. µ µ µ µ (Chandrasekaran, Parrilo & Willsky, 200)

17 Vignette II: High-dimensional sparse linear regression y X θ w n n p = + S S c Set-up: noisy observations y = Xθ +w with sparse θ Estimator: Lasso program θ argmin θ n n p (y i x T i θ) 2 +λ n θ j i= Some past work: Tibshirani, 996; Chen et al., 998; Donoho/Xuo, 200; Tropp, 2004; Fuchs, 2004; Efron et al., 2004; Meinshausen & Buhlmann, 2005; Candes & Tao, 2005; Donoho, 2005; Haupt & Nowak, 2005; Zhou & Yu, 2006; Zou, 2006; Koltchinskii, 2007; van j=

18 Application A: Compressed sensing (Donoho, 2005; Candes & Tao, 2005) y X β n = n p p (a) Image: vectorize to β R p (b) Compute n random projections

19 Application A: Compressed sensing In practice, signals are sparse in a transform domain: (Donoho, 2005; Candes & Tao, 2005) θ := Ψβ is a sparse signal, where Ψ is an orthonormal matrix. y X Ψ T θ n = n p s p Reconstruct θ (and hence image β = Ψ T θ ) based on finding a sparse solution to under-constrained linear system y = X θ where X = XΨ T is another random matrix.

20 Noiseless l recovery: Unrescaled sample size Prob. exact recovery vs. sample size (µ = 0) Prob. of exact recovery p = 28 p = 256 p = Raw sample size n Probability of recovery versus sample size n.

21 Application B: Graph structure estimation let G = (V,E) be an undirected graph on p = V vertices pairwise graphical model factorizes over edges of graph: P(x,...,x p ;θ) exp { θ st (x s,x t ) }. (s,t) E given n independent and identically distributed (i.i.d.) samples of X = (X,...,X p ), identify the underlying graph structure

22 Pseudolikelihood and neighborhood regression Markov properties encode neighborhood structure: (X s X V\s ) }{{} Condition on full graph d = (X s X N(s) ) }{{} Condition on Markov blanket N(s) = {s,t,u,v,w} X s X t X w X s X u X v basis of pseudolikelihood method (Besag, 974) basis of many graph learning algorithm (Friedman et al., 999; Csiszar & Talata, 2005; Abeel et al., 2006; Meinshausen & Buhlmann, 2006)

23 Graph selection via neighborhood regression X \s X s..... Predict X s based on X \s := {X s, t s}.

24 Graph selection via neighborhood regression X \s X s..... Predict X s based on X \s := {X s, t s}. For each node s V, compute (regularized) max. likelihood estimate: { } θ[s] := arg min n L(θ;X i\s ) + λ n θ θ R p n }{{}}{{} i= local log. likelihood regularization

25 Graph selection via neighborhood regression X \s X s..... Predict X s based on X \s := {X s, t s}. For each node s V, compute (regularized) max. likelihood estimate: { } θ[s] := arg min n L(θ;X i\s ) + λ n θ θ R p n }{{}}{{} i= local log. likelihood regularization 2 Estimate the local neighborhood N(s) as support of regression vector θ[s] R p.

26 US Senate network ( voting)

27 Outline Lecture (Today): Basics of sparse recovery Sparse linear systems: l0/l equivalence Noisy case: Lasso, l2-bounds and variable selection 2 Lecture 2 (Tuesday): A more general theory A range of structured regularizers Group sparsity Low-rank matrices and nuclear norm regularization Matrix decomposition and robust PCA Ingredients of a general understanding 3 Lecture 3 (Wednesday): High-dimensional kernel methods Curse-of-dimensionality for non-parametric regression Reproducing kernel Hilbert spaces A simple but optimal estimator Martin Wainwright (UC Berkeley) High-dimensional statistics February, /

28 Noiseless linear models and basis pursuit y X θ n = n p S S c under-determined linear system: unidentifiable without constraints say θ R p is sparse: supported on S {,2,...,p}. l 0 -optimization θ = arg min θ R p θ 0 Xθ = y Computationally intractable NP-hard l -relaxation θ arg min θ R p θ Xθ = y Linear program (easy to solve) Basis pursuit relaxation

29 Noiseless l recovery: Unrescaled sample size Prob. exact recovery vs. sample size (µ = 0) Prob. of exact recovery p = 28 p = 256 p = Raw sample size n Probability of recovery versus sample size n.

30 Noiseless l recovery: Rescaled Prob. exact recovery vs. sample size (µ = 0) Prob. of exact recovery p = 28 p = 256 p = Rescaled sample size α Probabability of recovery versus rescaled sample size α := n slog(p/s).

31 Restricted nullspace: necessary and sufficient Definition For a fixed S {,2,...,p}, the matrix X R n p satisfies the restricted nullspace property w.r.t. S, or RN(S) for short, if { R p X = 0} }{{} N(X) { R p } S c S = { 0 }. }{{} C(S) (Donoho & Xu, 200; Feuer & Nemirovski, 2003; Cohen et al, 2009)

32 Restricted nullspace: necessary and sufficient Definition For a fixed S {,2,...,p}, the matrix X R n p satisfies the restricted nullspace property w.r.t. S, or RN(S) for short, if Proposition { R p X = 0} }{{} N(X) { R p } S c S = { 0 }. }{{} C(S) (Donoho & Xu, 200; Feuer & Nemirovski, 2003; Cohen et al, 2009) Basis pursuit l -relaxation is exact for all S-sparse vectors X satisfies RN(S).

33 Restricted nullspace: necessary and sufficient Definition For a fixed S {,2,...,p}, the matrix X R n p satisfies the restricted nullspace property w.r.t. S, or RN(S) for short, if { R p X = 0} }{{} N(X) Proof (sufficiency): { R p } S c S = { 0 }. }{{} C(S) (Donoho & Xu, 200; Feuer & Nemirovski, 2003; Cohen et al, 2009) () Error vector = θ θ satisfies X = 0, and hence N(X). (2) Show that C(S) Optimality of θ: θ θ = θ S. Sparsity of θ : θ = θ + = θ S + S + S c. Triangle inequality: θ S + S + S c θ S S + S c. (3) Hence, N(X) C(S), and (RN) = = 0.

34 Illustration of restricted nullspace property 3 3 (, 2 ) (, 2 ) consider θ = (0,0,θ 3), so that S = {3}. error vector = θ θ belongs to the set C(S;) := { (, 2, 3 ) R }. Martin Wainwright (UC Berkeley) High-dimensional statistics February, /

35 Some sufficient conditions How to verify RN property for a given sparsity s? Elementwise incoherence condition (Donoho & Xuo, 200; Feuer & Nem., 2003) p max j,k=,...,p ( X T X n I p p ) jk δ s n x x j x k x p

36 Some sufficient conditions How to verify RN property for a given sparsity s? Elementwise incoherence condition (Donoho & Xuo, 200; Feuer & Nem., 2003) p max j,k=,...,p ( X T X n I p p ) jk δ s n x x j x k x p 2 Restricted isometry, or submatrix incoherence (Candes & Tao, 2005) ( max X T X n U 2s I p p ) UU δ 2s. op x x p U

37 Some sufficient conditions How to verify RN property for a given sparsity s? Elementwise incoherence condition (Donoho & Xuo, 200; Feuer & Nem., 2003) p max j,k=,...,p ( X T X n I p p ) jk δ s n x x j x k x p Matrices with i.i.d. sub-gaussian entries: holds w.h.p. for n = Ω(s 2 logp) 2 Restricted isometry, or submatrix incoherence (Candes & Tao, 2005) ( max X T X n U 2s I p p ) UU δ 2s. op x x p U

38 Some sufficient conditions How to verify RN property for a given sparsity s? Elementwise incoherence condition (Donoho & Xuo, 200; Feuer & Nem., 2003) p max j,k=,...,p ( X T X n I p p ) jk δ s n x x j x k x p Matrices with i.i.d. sub-gaussian entries: holds w.h.p. for n = Ω(s 2 logp) 2 Restricted isometry, or submatrix incoherence (Candes & Tao, 2005) ( max X T X n U 2s I p p ) UU δ 2s. op x x p U Matrices with i.i.d. sub-gaussian entries: holds w.h.p. for n = Ω(slog p s )

39 Violating matrix incoherence (elementwise/rip) Important: Incoherence/RIP conditions imply RN, but are far from necessary. Very easy to violate them...

40 Violating matrix incoherence (elementwise/rip) Form random design matrix X = [ x x 2... x p ] }{{} p columns = X T X2 T.. X T n }{{} n rows R n p, each row X i N(0,Σ), i.i.d. Example: For some µ (0,), consider the covariance matrix Σ = ( µ)i p p +µ T.

41 Violating matrix incoherence (elementwise/rip) Form random design matrix X = [ x x 2... x p ] }{{} p columns = X T X2 T.. X T n }{{} n rows R n p, each row X i N(0,Σ), i.i.d. Example: For some µ (0,), consider the covariance matrix Σ = ( µ)i p p +µ T. Elementwise incoherence violated: for any j k [ ] xj, x k P µ ǫ c exp( c 2 nǫ 2 ). n

42 Violating matrix incoherence (elementwise/rip) Form random design matrix X = [ x x 2... x p ] }{{} p columns = X T X2 T.. X T n }{{} n rows R n p, each row X i N(0,Σ), i.i.d. Example: For some µ (0,), consider the covariance matrix Σ = ( µ)i p p +µ T. Elementwise incoherence violated: for any j k [ ] xj, x k P µ ǫ c exp( c 2 nǫ 2 ). n RIP constants tend to infinity as (n, S ) increases: [ P XS TX ] S I s s 2 µ(s ) ǫ c exp( c 2 nǫ 2 ). n

43 Noiseless l recovery for µ = 0.5 Prob. exact recovery vs. sample size (µ = 0.5) Prob. of exact recovery p = 28 p = 256 p = Rescaled sample size α Probab. versus rescaled sample size α := n slog(p/s).

44 Direct result for restricted nullspace/eigenvalues Theorem (Raskutti, W., & Yu, 200) Consider a random design X R n p with each row X i N(0,Σ) i.i.d., and define κ(σ) = max j=,2,...p Σ jj. Then for universal constants c,c 2, Xθ 2 logp n 2 Σ/2 θ 2 9κ(Σ) n θ for all θ R p with probability greater than c exp( c 2 n).

45 Direct result for restricted nullspace/eigenvalues Theorem (Raskutti, W., & Yu, 200) Consider a random design X R n p with each row X i N(0,Σ) i.i.d., and define κ(σ) = max j=,2,...p Σ jj. Then for universal constants c,c 2, Xθ 2 logp n 2 Σ/2 θ 2 9κ(Σ) n θ for all θ R p with probability greater than c exp( c 2 n). much less restrictive than incoherence/rip conditions many interesting matrix families are covered Toeplitz dependency constant µ-correlation (previous example) covariance matrix Σ can even be degenerate extensions to sub-gaussian matrices (Rudelson & Zhou, 202) related results hold for generalized linear models

46 Easy verification of restricted nullspace for any C(S), we have = S + S c 2 S 2 s 2 applying previous result: { X 2 λ min ( } slogp Σ) 8κ(Σ) 2. n n }{{} γ(σ)

47 Easy verification of restricted nullspace for any C(S), we have = S + S c 2 S 2 s 2 applying previous result: { X 2 λ min ( } slogp Σ) 8κ(Σ) n n }{{} γ(σ) 2. have actually proven much more than restricted nullspace...

48 Easy verification of restricted nullspace for any C(S), we have = S + S c 2 S 2 s 2 applying previous result: { X 2 λ min ( } slogp Σ) 8κ(Σ) n n }{{} γ(σ) 2. have actually proven much more than restricted nullspace... Definition A design matrix X R n p satisfies the restricted eigenvalue (RE) condition over S (denote RE(S)) with parameters α and γ > 0 if X 2 n γ 2 for all R p such that S c α S. (van de Geer, 2007; Bickel, Ritov & Tsybakov, 2008)

49 Lasso and restricted eigenvalues Turning to noisy observations... y X θ w n n p = + S S c Estimator: Lasso program { θ λn arg min θ R p 2n y } Xθ 2 2 +λ n θ. Goal: Obtain bounds on θ λn θ 2 that hold with high probability. Martin Wainwright (UC Berkeley) High-dimensional statistics February, /

50 Lasso bounds: Four simple steps Let s analyze constrained version: min θ R p 2n y Xθ 2 2 such that θ R = θ.

51 Lasso bounds: Four simple steps Let s analyze constrained version: min θ R p 2n y Xθ 2 2 such that θ R = θ. () By optimality of θ and feasibility of θ : 2n y X θ 2 2 2n y Xθ 2 2.

52 Lasso bounds: Four simple steps Let s analyze constrained version: min θ R p 2n y Xθ 2 2 such that θ R = θ. () By optimality of θ and feasibility of θ : 2n y X θ 2 2 2n y Xθ 2 2. (2) Derive a basic inequality: re-arranging in terms of = θ θ : n X n, X T w.

53 Lasso bounds: Four simple steps Let s analyze constrained version: min θ R p 2n y Xθ 2 2 such that θ R = θ. () By optimality of θ and feasibility of θ : 2n y X θ 2 2 2n y Xθ 2 2. (2) Derive a basic inequality: re-arranging in terms of = θ θ : n X n, X T w. (3) Restricted eigenvalue for LHS; Hölder s inequality for RHS γ 2 2 n X n, X T w 2 X T w n.

54 Lasso bounds: Four simple steps Let s analyze constrained version: min θ R p 2n y Xθ 2 2 such that θ R = θ. () By optimality of θ and feasibility of θ : 2n y X θ 2 2 2n y Xθ 2 2. (2) Derive a basic inequality: re-arranging in terms of = θ θ : n X n, X T w. (3) Restricted eigenvalue for LHS; Hölder s inequality for RHS γ 2 2 n X n, X T w 2 X T w n. (4) As before, C(S), so that 2 s 2, and hence 2 4 γ s X T w n.

55 Lasso error bounds for different models Proposition Suppose that vector θ has support S, with cardinality s, and design matrix X satisfies RE(S) with parameter γ > 0. For constrained Lasso with R = θ or regularized Lasso with λ n = 2 X T w/n, any optimal solution θ satisfies the bound θ θ 2 4 s γ XT w n. Martin Wainwright (UC Berkeley) High-dimensional statistics February, /

56 Lasso error bounds for different models Proposition Suppose that vector θ has support S, with cardinality s, and design matrix X satisfies RE(S) with parameter γ > 0. For constrained Lasso with R = θ or regularized Lasso with λ n = 2 X T w/n, any optimal solution θ satisfies the bound θ θ 2 4 s γ XT w n. this is a deterministic result on the set of optimizers various corollaries for specific statistical models Martin Wainwright (UC Berkeley) High-dimensional statistics February, /

57 Lasso error bounds for different models Proposition Suppose that vector θ has support S, with cardinality s, and design matrix X satisfies RE(S) with parameter γ > 0. For constrained Lasso with R = θ or regularized Lasso with λ n = 2 X T w/n, any optimal solution θ satisfies the bound θ θ 2 4 s γ XT w n. this is a deterministic result on the set of optimizers various corollaries for specific statistical models Compressed sensing: Xij N(0,) and bounded noise w 2 σ n Deterministic design: X with bounded columns and wi N(0,σ 2 ) XT w n 3σ2 logp n w.h.p. = θ θ 2 4σ s logp γ(l) n. Martin Wainwright (UC Berkeley) High-dimensional statistics February, /

58 Look-ahead to Lecture 2: A more general theory Recap: Thus far... Derived error bounds for basis pursuit and Lasso (l -relaxation) Seen importance of restricted nullspace and restricted eigenvalues Martin Wainwright (UC Berkeley) High-dimensional statistics February, /

59 Look-ahead to Lecture 2: A more general theory The big picture: Lots of other estimators with same basic form: θ λn }{{} Estimate argmin θ Ω { L(θ;Z n ) +λ n R(θ) }{{}}{{} Loss function Regularizer }. Martin Wainwright (UC Berkeley) High-dimensional statistics February, /

60 Look-ahead to Lecture 2: A more general theory The big picture: Lots of other estimators with same basic form: θ λn }{{} Estimate argmin θ Ω { L(θ;Z n ) +λ n R(θ) }{{}}{{} Loss function Regularizer }. Past years have witnessed an explosion of results (compressed sensing, covariance estimation, block-sparsity, graphical models, matrix completion...) Question: Is there a common set of underlying principles? Martin Wainwright (UC Berkeley) High-dimensional statistics February, /

61 Some papers ( wainwrig) M. J. Wainwright (2009), Sharp thresholds for high-dimensional and noisy sparsity recovery using l -constrained quadratic programming (Lasso), IEEE Trans. Information Theory, May S. Negahban, P. Ravikumar, M. J. Wainwright, and B. Yu (202). A unified framework for high-dimensional analysis of M-estimators with decomposable regularizers. Statistical Science. December G. Raskutti, M. J. Wainwright and B. Yu (20) Minimax rates for linear regression over l q -balls. IEEE Trans. Information Theory, October G. Raskutti, M. J. Wainwright and B. Yu (200). Restricted nullspace and eigenvalue properties for correlated Gaussian designs. Journal of Machine Learning Research. August 200.

General principles for high-dimensional estimation: Statistics and computation

General principles for high-dimensional estimation: Statistics and computation Martin Wainwright Statistics, and EECS UC Berkeley Joint work with: Garvesh Raskutti, Sahand Negahban Pradeep Ravikumar, Bin