Composite Loss Functions and Multivariate Regression; Sparse PCA

Size: px

Start display at page:

Download "Composite Loss Functions and Multivariate Regression; Sparse PCA"

Caroline Henry
5 years ago
Views:

1 Composite Loss Functions and Multivariate Regression; Sparse PCA G. Obozinski, B. Taskar, and M. I. Jordan (2009). Joint covariate selection and joint subspace selection for multiple classification problems. Statistics and Computing, to appear. G. Obozinski, M. J. Wainwright, and M. I. Jordan (2009). Union support recovery in multivariate regression. Annals of Statistics, under review. A. Amini and M. J. Wainwright (2009). High-dimensional analysis of semidefinite relaxations for sparse PCA. Annals of Statistics, to appear. 1

2 Introduction classical asymptotic theory of statistical inference: number of observations n + model dimension p stays fixed not suitable for many modern applications: { images, signals, systems, networks } frequently large (p )... interesting consequences: might have p = Θ(n) or even p n curse of dimensionality: procedures unless p/n 0 frequently impossible to obtain consistent can be saved by a lower effective dimensionality, due to some form of complexity constraint 2

3 Example: Sparse linear regression y X β w n = n p + S S c vector β R p with at most k p non-zero entries observation model: noisy linear observations y = Xβ + w X R n p : design matrix w R n 1 : noise vector various applications (database sketching, imaging, genetic testing...) 3

4 Example: Graphical model selection consider m-dimensional random vector Z = (Z 1,...,Z m ): P(Z 1,...,Z m ; β) exp (i,j) E β ij Z i Z j. given n independent and identically distributed (i.i.d.) samples of Z, identify underlying graph G = (V, E) lower effective dimensionality: graphs with k p := ( m 2) edges 4

5 Example: Sparse principal components analysis = Σ ZZ T D Set-up: Covariance matrix Σ = ZZ T + D, where leading eigenspace Z has sparse columns. Σ. Goal: Produce an estimate Ẑ based on samples X(i) with covariance matrix 5

6 Some issues in high-dimensional inference Consider some fixed loss function, and a fixed level δ of error. Given particular (polynomial-time) algorithms for what sample sizes n do they succeed/fail to achieve error δ? when does more computation reduce minimum # samples needed? 6

7 Outline 1. Multivariate regression in high dimensions (a) Practical limitations: scaling laws for second-order cone programs (b) SOCP vs. Lasso: when does more computation reduce statistical error? 2. Sparse principal component analysis in high dimensions (a) Thresholding methods (b) Semidefinite programming 7

8 Optimization-based estimators in (sparse) regression y X β w n = n p + S S c Regularized QP: β b 1 arg min β R p 2n y Xβ 2 2 {z } Data term + ρ n R(β). {z } Regularizer R(β) = β 2 Ridge regression (Tik43, HoeKen70) R(β) = β 1 convex l 1 -constrained QP (CheDonSau96; Tibs96) R(β) = β 0 Subset selection: combinatorial, NP-hard (Nat95) R(β) = β a, a (0,1) Non-convex l a regularization 8

9 Different loss functions Given an estimate β, how to assess its performance? 1. Predictive loss: compute expected error E[ ỹ X β 2 2] goal is to construct model with good predictive power β itself of secondary interest (need not be uniquely determined) 2. l 2 -loss E[ β β 2 2] appropriate when B is of primary interest (signal recovery, compressed sensing, denoising etc.) 3. Support recovery criterion: define estimated support S( β) = { i = 1,...,p β i 0 }, and measure probability P[S( β) S(β )]. useful for feature selection, dimensionality reduction, model selection can be used as a pre-processing step for estimation in l 2 -norm 9

10 1. Multivariate regression in high dimensions Y X B W n r = n p + S S c p r n r signal B is a p r matrix: partitioned into non-zero rows S and zero rows S c observe n noisy projections, defined via design matrix X R n p and noise matrix W R n r matrix Y R n r of observations high-dimensional scaling: allow parameters (n, p, r, S ) to scale 10

11 Block regularization and second-order cone programs (Obozinski, Taskar & Jordan, 2009) for fixed parameter q [1, ], estimate B via: bb j 1 arg min B R p r 2n Y XB 2 F {z } P Data term n rp ˆYjl (XB) jl 2 j=1 l=1 ff + ρ n B {z 1,q. } pp (B i1,..., B ir ) q i=1 regularization constant ρ n > 0 to be chosen by user q = 1: elementwise l 1 norm (constrained QP) different cases: q = 2: second-order cone program (SOCP) q = : block l 1 /l max-norm (constrained QP) in all cases, efficiently solvable (e.g., by interior point methods) generalization of the Lasso (Tibshirani, 1996; Chen et al., 1998), special case of the CAP family (Zhao, Rocha, & Yu, 2006); see also (Turlach et al., 2005; Yuan & Lin, 2006, Nardi & Rinaldo, 2008) 11

12 Two strategies Goal: Model selection consistency: recover union of supports S(B ) := {i {1, 2,..., p} B i1,..., B ir 2 0}. Different methods: Lasso-based recovery: 1. Solve a separate Lasso (l 1 -constrained QP) for each column l = 1,..., r, yielding column vector b β l R p. 2. Estimate row support b S Lasso = i {1, 2,..., p} b β il 0 for some l. SOCP-based recovery: 1. Solve a single SOCP, obtaining matrix estimate bb p r. 2. Estimate support b S SOCP = i {1,..., p} ( b B i1,..., b B ir 2 0. Trade-offs: Lasso (QP) cheap to solve, but method ignores coupling among columns SOCP more expensive, but block-regularizer better tailored to matrix structure 12

13 Scaling law for high-dimensional SOCP recovery (Obozinski, Wainwright & Jordan, 2009) SOCP method: bb arg min B R p r 1 2n Y XB 2 F + ρ n B 1,2. Parameters: Problem dimension p; number of non-zero rows k Design matrix X: i.i.d. rows from sub-gaussian distribution, with suitable covariance Σ Theorem: If the rescaled sample size θ SOCP (n, p, k, B ) := n Ψ(B S ;Σ SS) log(p k) is greater than a critical threshold θ l (Σ; σ 2 ), then for suitable ρ n we have with probability greater than 1 2 exp(c 2 log k): (a) the SOCP has a unique solution B b s.t. S( b B) b S(B ), and q (b) It includes all rows i with B i max{k,log(p k)} 2 c 3 n. 13

14 Assumptions on design covariance Σ SS Σ S c S c Σ S c S Σ SS c support set S = {i β i 0} complement S c := {1,... p}\s. random design matrix X R n p rows drawn i.i.d., cov. Σ, sub-gaussian 1. Bounded eigenspectrum: λ(σ SS ) [C min, C max ]. 2. Mutual incoherence/irrepresentability: There exists an ν (0, 1] such that Σ S c S (Σ SS ) 1, 1 ν. P Example: if Σ SS = I, then require max Σ j S c ji 1 ν. i S 14

15 Order parameter captures threshold (Angle 0 ) Prob. success versus rescaled sample size θ SOCP (n, p, k, B ) = n Ψ(B S ; Σ SS)log(p k). 15

16 Order parameter captures threshold (Angle 60 ) Prob. success versus rescaled sample size θ SOCP (n, p, k, B ) = n Ψ(B S ; Σ SS)log(p k). 16

17 Sparsity overlap function Ψ form gradient matrix Z(B S ) := B S 1,2 BS =B S R k r equivalent to renormalizing B S to have unit l 2-norm rows form r r Gram matrix: G = Z T (Σ SS ) 1 Z with G a,b = Z a, Z b (ΣSS ) 1 sparsity overlap function is max. eigenvalue of G: Ψ(B S ; Σ SS) = G 2. measures relative alignments of the renormalized columns of B Special case: Univariate regression (r = 1): Z(β S ) = k for any vector β S 17

18 Concrete examples (k = 4,r = 2) Aligned columns B S 2 Z(B S ) Orthogonal columns B S 2 Z(B S ) G =» G 2 = 4 G =» G 2 = 2 18

19 Empirical illustration of sparsity-overlap Ψ Orthogonal regression: Columns Z 1 Z 2 Intermediate angle: Columns at 60 Aligned regression: Columns parallel Ordinary Lasso: solve problems separately. 19

20 SOCP versus ordinary QP Corollary: If Σ SS =I k k, SOCP always dominates ordinary QP, with relative statistical efficiency: 1 max k l log(p k l ) l=1,...r Ψ(BS ; I)log(p k) {z } (QP sample size)/(socp sample size) r increased statistical efficiency of SOCP: dependent on orthogonality properties of rescaled columns B S up to a factor 1/r reduction in number of samples required most pessimistic case: no gain for disjoint supports, SOCP can be worse in some cases (if Σ SS I) 20

21 Proof sketch of sufficient conditions Direct analysis : Given n observations of β R p with S(β ) = k, oracle decoder performs following two 1. For each subset S of size k, solve the quadratic program: steps: f(s) = min β S R k Y X Sβ S Output the subset b S = arg min S =k f(s). by symmetry of ensemble, may assume that fixed subset S is chosen for sets U different from true set S, consider range of non-overlaps t := U\S {1,..., k} number of subsets with non-overlap t given by N(t) = ` k `p k t k t 21

22 Error exponents for random projections union bound yields upper bound on error probability P[error S true]: kx k p k t=1 k t t P[error on subset with non-overlap t] orthogonal projection Π U := I n n X U ˆXT U X 1 U X T U optimal decoder chooses U incorrectly over S if and only if (U) = Π U X S\U β S\U + W 2 {z } Π S W 2 {z } < 0 effective noise in U effective noise in S use large deviations to establish that P[ (U) < 0] exp n F( β S\U 2 ; t). 22

23 Proof sketch of necessary conditions Fano s inequality applied to a restricted ensemble, assuming fixed choice of β : β i [U] = ( βmin if i U 0 otherwise. by Fano s inequality, probability of success upper bounded as 1 P[error] I(Y ; β ) log(m 1) o(1), where I(Y ; β ): mutual information between β and observation vector Y M = `p k : number of competing models some work to establish the upper bound holds w.h.p. for X: I(Y, β X) n 2 log»1 + (1 k p )kβ2 min 23

24 2. High-dimensional analysis of sparse PCA principal components analysis (PCA): classical method for dimensionality reduction high-dimensional version: eigenvectors from sample covariance bσ based on n samples in p dimensions in general, high-dimensional PCA inconsistent unless p/n 0 (Joh01, JohLu04) natural to investigate more structured ensembles for which consistency still possible even with p/n + : sparse eigenvector recovery (JolEtal03, JohLu04, ZouEtAl06) sparse covariance matrices (LevBic06,ElKar07) 24

25 Spiked covariance ensembles sequences {Σ p } of spiked population covariance matrices: Σ p = MX α i β i β T i + Γ p, i=1 with leading eigenvectors (β i, i = 1,... M). past work on identity spiked ensembles (Γ p = I p ) (Joh01; JohLu04) different sparsity models: hard sparsity model: β has exactly k non-zero coefficients weak l q -sparsity: β belongs to the l q - ball : B q (R q ) = z R p px z i q R q. i=1 given n i.i.d. samples {X i } n i=1 with E[X i] = 0 and cov(x i ) = Σ p 25

26 SDP relaxation of sparse PCA (D Asprémont, El Ghaoui, Jordan & Lanckriet, 2006) Courant-Fischer variational principle for maximum eigenvalue/vector (PCA): λ max (Q) = max z 2 =1 zt Qz. equivalent/exact semidefinite program (SDP) of max. eigenvector: λ max (Q) = max Z 0,trace(Z)=1 trace `Z Q. SDP relaxation of sparse PCA: 8 < bz = arg max Z 0,trace(Z)=1 : trace `Z X Q ρ n` with regularization parameter ρ n > 0 chosen by user. i,j 9 = Z ij ;, 26

27 Rates in spectral norm given n samples from spiked identity model Σ p = αzz T + σ 2 I p eigenvector z in weak l q -ball B q (R q ) SDP relaxation: bz arg min Z 0,trace(Z)=1 trace(z bσ) + ρ n P i,j Z ij. Theorem: (AmiWai08b) Suppose that we apply q the SDP to the sample covariance Σ b with regularization parameter ρ n = f(α, σ 2 log p ) n. Then with probability greater than 1 c 1 exp( c 2 log p) 0, we have: bz zz T `log p 1 2 C R 2(1+q) q. n Example (Hard sparsity): q = 0, and radius R q = k (# non-zeros) s Z b zz T k 2 log p 2 C. n 27

28 Comparison to some known results Estimating sparse covariance matrices Thresholding estimator T λ n(bσ) achieves rate: (BicLev07) `log p 1 q T λ n(bσ) Σ 2 CR 2 q. n by matrix perturbation results, for well-separated eigenvalues, same rate applies to leading eigenvector agrees with SDP result for q = 0, but slower rate for q > 0 Minimax rates for q (0, 2): with sign bz, z = 1: (PauJoh08) min bz max z Bq(Rq) E[ bz z 2 2 ] C R q `log p n 1 q 2. same rate as normal sequence model (DonJoh94) SDP rate is slower, but approaches minimax rate as q 0 28

29 Model selection consistency for hard sparsity (q = 0) Goal: Given spiked model with k-sparse eigenvector (z i = ± 1 k ), recover support set S(z) = i {1, 2,..., p} z i 0} exactly. Methods: 1. Diagonal thresholding: Complexity O(np + p log p) (JohLu04) (a) Form sample covariance bσ = 1 n P n i=1 X ix T i. (b) Extract top k order statistics b Σ (11),..., b Σ (kk), and estimate support b S(D) by rank indices. 2. SDP-based recovery: Complexity O(np + p 4 log p) (AspLanGhaJor08) (a) Solve SDP with ρ n = α/(2σ 2 k). (b) Given solution b Z, estimate support bs := i {1,..., p} b Z ij 0 for some j. 29

30 Sharp threshold for diagonal thresholding Model: Σ p = αzz T + σ 2 I p Parameters: p model dimension k number of non-zeroes in spiked eigenvector Proposition: (AmiWai08a) If k = O(p 1 δ ) for any δ (0, 1), diagonal thresholding for support recovery controlled by rescaled sample size θ thr (n, p, k) := n k 2 log(p k). I.e., there are constants 0 < τ l (α, σ2 ) τ u (α, σ2 ) < such that (a) Success: If n > τ u k2 log(p k), then P[ bs(d) = S(β)] 1 c 1 exp ` c 2 k 2 log(p k)) 1. (b) Failure: If n τ l k2 log(p k), then P[ bs(d) = S(β)] c 1 exp ( c 2 (log(p k))) 0. 30

31 Performance of diagonal thresholding 1 Diagonal thresholding: k = O(log(p)) 1 Diagonal thresholding (k = O(sqrt(p)) Prob. success p = 100 p = p = 300 p = 600 p = Control parameter Prob. success p = 100 p = p = 300 p = 600 p = Control parameter (a) Log. sparsity (b) Square-root sparsity Probability of success P[S(D) = S(β )] versus rescaled sample size n θ thr (n, p, k) = k 2 log(p k) 31

32 Eigenvector support recovery via SDP relaxation spiked identity model Σ p = αzz T + σ 2 I p with k-sparse eigenvector z SDP relaxation: Z b arg trace(zσ) b P + ρn i,j Z ij. min Z 0,trace(Z)=1 Theorem: (AmiWai08a) Suppose that we solve the SDP with ρ n = α/(2σ 2 k). Then there are constants θ wr and θ crit such that (a) For sample sizes such that θ thr (n, p, k) = solution w.h.p., and (b) For problem sequences such that k = O(log p), and n k 2 log(p k) > θ wr, the SDP has a rank one θ sdp (n, p, k) := n k log(p k) > θ crit, a rank one solution (when it exists) specifies correct support w.h.p. Remarks: technical condition k = O(log p): likely an artifact 32

33 Performance of SDP relaxation 1 SDP relaxation (k = O(log p)) 1 SDP relaxation (k = 0.1 p) Prob. success p = 100 p = 200 p = 300 Prob. success p = 100 p = 200 p = Control parameter (a) Log. sparsity Control parameter (b) Linear sparsity Probability of success P[S( b β) = S(β )] versus rescaled sample size θ sdp (n, p, k) = n k log(p k). 33

34 Summary and open directions 1. When does more computation yield greater statistical accuracy? Multivariate regression: second-order cone programming versus quadratic programming (Lasso) Sparse PCA: diagonal thresholding versus SDP relaxation 2. When are polynomial-time algorithms as good as optimal algorithms? Multivariate regression: Lasso/SOCP order-optimal for k = o(p) Sparse PCA: SDP relaxation order-optimal for k = O(log p) 34

High-dimensional graphical model selection: Practical and information-theoretic limits

1 High-dimensional graphical model selection: Practical and information-theoretic limits Martin Wainwright Departments of Statistics, and EECS UC Berkeley, California, USA Based on joint work with: John