A unified framework for high-dimensional analysis of M-estimators with decomposable regularizers

Size: px

Start display at page:

Download "A unified framework for high-dimensional analysis of M-estimators with decomposable regularizers"

Erica Anthony
5 years ago
Views:

1 A uified framework for high-dimesioal aalysis of M-estimators with decomposable regularizers Sahad Negahba Departmet of EECS UC Berkeley Marti J. Waiwright Departmet of Statistics Departmet of EECS UC Berkeley Pradeep Ravikumar Departmet of Computer Scieces UT Austi Bi Yu Departmet of Statistics Departmet of EECS UC Berkeley Abstract High-dimesioal statistical iferece deals with models i which the the umber of parameters p is comparable to or larger tha the sample size. Sice it is usually impossible to obtai cosistet procedures uless p/ 0, a lie of recet work has studied models with various types of structure (e.g., sparse vectors; block-structured matrices; low-rak matrices; Markov assumptios). I such settigs, a geeral approach to estimatio is to solve a regularized covex program (kow as a regularized M-estimator) which combies a loss fuctio (measurig goodess-of-fit of the models to the data) with some regularizatio fuctio that ecourages the assumed structure. The goal of this paper is to provide a uified framework for establishig cosistecy ad covergece rates for such regularized M-estimatio procedures uder high-dimesioal scalig. We state oe mai theorem ad show how it ca be used to re-derive several existig results, ad also to obtai several ew results o cosistecy ad covergece rates. Our aalysis also idetifies two key properties of loss ad regularizatio fuctios, referred to as restricted strog covexity ad decomposability, that esure the correspodig regularized M-estimators have fast covergece rates. Itroductio I may fields of sciece ad egieerig such as geomics ad atural laguage processig, it is of great iterest to relate predictor variables (e.g. gee levels) to a respose variable (e.g. cacer status). Due to the explodig size of problems, we ofte fid ourselves i the large p small regime that is, the umber of predictor variables p is comparable to or eve larger tha the umber of observatios. For such high dimesioal data, successful statistical modelig is possible oly if the data follows models with restrictios. For istace, the data might be sparse i a suitably chose basis, could lie o some maifold, or the depedecies amog the variables might have Markov structure specifid by a graphical model. I such settigs, a commo approach is to use regularized M-estimators, where some loss fuctio (e.g., the egative log-likelihood of the data) is regularized by a fuctio appropriate to the assumed structure. Such estimators may also be iterpreted from a Bayesia perspective as the Maximum A Posterior (MAP) estimator, with the regularizer reflectig prior iformatio. I this paper, we study such regularized M-estimatio procedures, ad attempt to provide a uifyig framework that both

2 recovers some existig results ad provides ew results o cosistecy ad covergece rates uder high-dimesioal scalig. As a illustratio of the applicatios of our aalysis, we work with three ruig examples of costraied parametric structures. The first are sparse models, both where the umber of model parameters that are o-zero is small (hard-sparse) or more geerally where the umber of parameters above a certai threshold are limited (weak-sparse). The secod are so called block-sparse models, where the parameters are matrix-structured, ad etire rows are either zero or ot. Our third class is the estimatio of low-rak matrices, which arises i system idetificatio, collaborative filterig, ad other types of matrix completio problems. To motivate the eed for a uified aalysis, let us provide a brief (ad hece ecessarily icomplete) overview of the broad rage of work o high-dimesioal models. For the case of sparse regressio, a popular regularizer is the l orm of the parameter vector, which is the sum of the absolute values of the parameters. A umber of researchers have studied the Lasso [5, 3] as well as the closely related Datzig selector [2] ad provided coditios o various aspects of its behavior, icludig l 2 -error bouds [6,, 20, 2] ad model selectio cosistecy [2, 9, 5, 6]. For geeralized liear models (GLMs) ad expoetial family models, estimators based o l -regularized maximum likelihood have also bee studied, icludig results o risk cosistecy [8] ad model selectio cosistecy []. A body of work has focused o the case of estimatig Gaussia graphical models, icludig covergece rates i Frobeius ad operator orm [4], ad results o operator orm ad model selectio cosistecy [2]. Motivated by iferece problems ivolvig block-sparse matrices, other researchers have proposed block-structured regularizers [7, 22], ad more recetly, high-dimesioal cosistecy results have bee obtaied for model selectio [7, 8] ad parameter cosistecy [4]. I this paper, we derive a sigle mai theorem, ad show how we are able to rederive a wide rage of kow results o high-dimesioal cosistecy, as well as some ovel oes: such as estimatio error rates for low-rak matrices, sparse matrices, ad weakly -sparse vectors. 2 Problem formulatio ad some key properties I this sectio, we begi with a precise formulatio of the problem, ad the develop some key properties of the regularizer ad loss fuctio. I particular, we defie a otio of decomposability for regularizig fuctios r, ad the prove that whe it is satisfied, the error = θ θ of the regularized M-estimator must satisfy certai costraits. We use these costraits to defie a otio of restricted strog covexity that the loss fuctio must satisfy. 2. Problem set-up Cosider a radom variable Z with distributio P takig values i a set Z. Let Z := {Z,...,Z } deote observatios draw i a i.i.d. maer from P, ad suppose θ R p is some parameter of this distributio. We cosider the problem of estimatig θ from the data Z. I order to do so, we cosider the followig class of regularized M-estimators. Let L : R p Z R be some loss fuctio that assigs a cost to ay parameter θ R p give a set of observatios. Let r : R p R deote a regularizatio fuctio. We the cosider the regularized M-estimator give by θ arg mi θ R p { L(θ; Z )+λ r(θ) }, () where λ > 0 is a regularizatio pealty. For ease of otatio, i the sequel, we adopt the shorthad L(θ) for L(θ; Z ). Throughout the paper, we assume that the loss fuctio L is covex ad differetiable, ad that the regularizer r is a orm. Our goal is to provide geeral techiques for derivig bouds o the error θ θ i some error metric d. A commo example is the l 2 -orm d( θ θ ) := θ θ 2. As discussed earlier, high-dimesioal parameter estimatio is made possible by structural costraits o θ such as sparsity, ad we will see that the behavior of the error is determied by how well these costraits are captured by the regularizatio fuctio r( ). We ow tur to the properties of the regularizer r ad the loss fuctio L that uderlie our aalysis. 2

3 2.2 Decomposability Our first coditio requires that the regularizatio fuctio r be decomposable, i a sese to be defied precisely, with respect to a family of subspaces. This otio is a formalizatio of the maer i which the regularizatio fuctio imposes costraits o possible parameter vectors θ R p. We begi with some abstract defiitios, which we the illustrate with a umber of cocrete examples. Take some arbitrary ier product space H, ad let 2 deote the orm iduced by the ier product. Cosider a pair (A, B) of subspaces of H such that A B. For a give subspace A ad vector u H, we let π A (u) := argmi v A u v 2 deote the orthogoal projectio of u oto A. We let V = {(A, B) A B } be a collectio of subspace pairs. For a give statistical model, our goal is to costruct subspace collectios V such that for ay give θ from our model class, there exists a pair (A, B) V with π A (θ ) 2 θ 2, ad π B (θ ) 2 0. Of most iterest to us are subspace pairs (A, B) i which this property holds but the subspace A is relatively small ad B is relatively large. Note that A represets the costraits uderlyig our model class, ad imposed by our regularizer. I the remaider of this paper we assume that H = R p ad use the stadard Euclidea ierproduct, uless otherwise specified. As a first cocrete (but toy) example, cosider the model class of all vectors θ R p, ad the subspace collectio T that cosists of a sigle subspace pair (A, B) = (R p, 0). We refer to this choice (V = T ) as the trivial subspace collectio. I this case, for ay θ R p, we have π A (θ )=θ ad π B (θ ) = 0. Although this collectio satisfies our desired property, it is ot so useful sice A = R p is a very large subspace. As a secod example cosider the class of s-sparse parameter vectors θ R p, meaig that θi 0oly if i S, where S is some s-sized subset of {, 2,..., p}. For ay give subset S ad its complemet S c, let us defie the subspaces A(S) ={θ R p θ S c =0}, ad B(S) ={θ R p θ S =0}, ad the s-sparse subspace collectio S = {(A(S),B(S)) S {,..., p}, S = s}. With this set-up, for ay s-sparse parameter vector θ, we are guarateed that there exists some (A, B) S such that π A (θ )=θ ad π B (θ ) = 0. I this case, the property is more iterestig, sice the subspaces A(S) are relatively small as log as S = s p. With this set-up, we say that the regularizer r is decomposable with respect to a give subspace pair (A, B) if r(u + z) =r(u)+r(z) for all u A ad z B. (2) I our subsequet aalysis, we impose the followig coditio o the regularizer: Defiitio. The regularizer r is decomposable with respect to a give subspace collectio V, meaig that it is decomposable for each subspace pair (A, B) V. Note that ay regularizer is decomposable with respect to the trivial subspace collectio T = {(R p, 0)}. It will be of more iterest to us whe the regularizer decomposes with respect to a larger collectio V that icludes subspace pairs (A, B) i which A is relatively small ad B is relatively large. Let us illustrate with some examples. Sparse vectors ad l orm regularizatio. Cosider a model ivolvig s-sparse regressio vectors θ R p, ad recall the defiitio of the s-sparse subspace collectio S discussed above. We claim that the l -orm regularizer r(u) = u is decomposable with respect to S. Ideed, for ay s-sized subset S ad vectors u A(S) ad v B(S), we have u + v = u + v, as required. Group-structured sparse matrices ad l,q matrix orms. Various statistical problems ivolve matrix-valued parameters Θ R k m ; examples iclude multivariate regressio problems or (iverse) covariace matrix estimatio. We ca defie a ier product o such matrices via Θ, Σ = trace(θ T Σ) ad the iduced (Frobeius) orm k m i= j= Θ2 i,j. Let us suppose that Θ satisfies a group sparsity coditio, meaig that the i th row, deoted Θ i, is o-zero oly if i S {,..., k} ad the cardiality of S is cotrolled. For a give subset S, we ca defie the subspace pair B(S) = { Θ R k m Θ i =0 for all i S c}, ad A(S) = (B(S)), For some fixed s k, we the cosider the collectio V = {(A(S),B(S)) S {,..., k}, S = s}, 3

4 which is a group-structured aalog of the s-sparse set S for vectors. For ay q [, ], ow suppose that the regularizer is the l /l q matrix orm, give by r(θ) = k i= [ m j= Θ ij q ] /q, correspodig to applyig the l q orm to each row ad the takig the l -orm of the result. It ca be see that the regularizer r(θ) = Θ,q is decomposable with respect to the collectio V. Low-rak matrices ad uclear orm. The estimatio of low-rak matrices arises i various cotexts, icludig pricipal compoet aalysis, spectral clusterig, collaborative filterig, ad matrix completio. I particular, cosider the class of matrices Θ R k m that have rak r mi{k, m}. For ay give matrix Θ, we let row(θ) R m ad col(θ) R k deote its row space ad colum space respectively. For a give pair of r-dimesioal subspaces U R k ad V R m, we defie a pair of subspaces A(U, V ) ad B(U, V ) of R k m as follows: A(U, V ) := { Θ R k m row(θ) V, col(θ) U }, ad (3a) B(U, V ) := { Θ R k m row(θ) V, col(θ) U }. Note that A(U, V ) B (U, V ), as is required by our costructio. We the cosider the collectio V = {(A(U, V ),B(U, V )) U R k, V R m }, where (U, V ) rage over all pairs of r-dimesioal subspaces. Now suppose that we regularize with the uclear orm r(θ) = Θ, correspodig to the sum of the sigular values of the matrix Θ. It ca be show that the uclear orm is decomposable with respect to V. Ideed, sice ay pair of matrices M A(U, V ) ad M B(U, V ) have orthogoal row ad colum spaces, we have M +M = M + M (e.g., see the paper [3]). Thus, we have demostrated various models ad regularizers i which decomposability is satisfied with iterestig subspace collectios V. We ow show that decomposability has importat cosequeces for the error = θ θ, where θ R p is ay optimal solutio of the regularized M-estimatio procedure (). I order to state a lemma that captures this fact, we eed to defie the dual orm of the regularizer, give by r (v) := sup u R p. For the regularizers of iterest, the dual orm ca be obtaied via some easy calculatios. For istace, give a vector θ R p ad r(θ) = θ, we have r (θ) = θ. Similarly, give a matrix Θ R k m ad the uclear orm regularizer r(θ) = Θ, we have r (Θ) = Θ 2, correspodig to the operator orm (or maximal sigular value). Lemma. Suppose θ is a optimal solutio of the regularized M-estimatio procedure (), with associated error = θ θ. Furthermore, suppose that the regularizatio pealty is strictly positive with λ 2 r ( L(θ )). The for ay (A, B) V ut v r(u) r(π B ( )) 3r(π B ( )) + 4r(π A (θ )). This property plays a essetial role i our defiitio of restricted strog covexity ad subsequet aalysis. 2.3 Restricted Strog Covexity Next we state our assumptio o the loss fuctio L. I geeral, guarateeig that L( θ) L(θ ) is small is ot sufficiet to show that θ ad θ are close. (As a trivial example, cosider a loss fuctio that is idetically zero.) The stadard way to esure that a fuctio is ot too flat is via the otio of strog covexity i particular, by requirig that there exist some costat γ> 0 such that L(θ + ) L(θ ) L(θ ) T γd 2 ( ) for all R p. I the high-dimesioal settig, where the umber of parameters p may be much larger tha the sample size, the strog covexity assumptio eed ot be satisfied. As a simple example, cosider the usual liear regressio model y = Xθ + w, where y R is the respose vector, θ R p is the ukow parameter vector, X R p is the desig matrix, ad w R is a oise vector, with i.i.d. zero mea elemets. The least-squares loss is give by L(θ) = 2 y Xθ 2 2, ad has the Hessia H(θ) = XT X. It is easy to check that the p p matrix H(θ) will be rak-deficiet wheever p >, showig that the least-squares loss caot be strogly covex (with respect to d( ) = 2 ) whe p >. Herei lies the utility of Lemma : it guaratees that the error must lie withi a restricted set, so that we oly eed the loss fuctio to be strogly covex for a limited set of directios. More precisely, we have: (3b) 4

5 Defiitio 2. Give some subset C R p ad error orm d( ), we say that the loss fuctio L satisfies restricted strog covexity (RSC) (with respect to d( )) with parameter > 0 over C if L(θ + ) L(θ ) L(θ ) T d 2 ( ) for all C. (4) I the statemet of our results, we will be iterested i loss fuctios that satisfy RSC over sets C(A, B, ɛ) that are idexed by a subspace pair (A, B) ad a tolerace ɛ 0 as follows: C(A, B, ɛ) := { R p r(π B ( )) 3r(π B ( )) + 4r(π A (θ )), d( ) ɛ }. (5) I the special case of least-squares regressio with hard sparsity costraits, the RSC coditio correspods to a lower boud o the sparse eigevalues of the Hessia matrix X T X, ad is essetially equivalet to a restricted eigevalue coditio itroduced by Bickel et al. []. 3 Covergece rates We are ow ready to state a geeral result that provides bouds ad hece covergece rates for the error d( θ θ ). Although it may appear somewhat abstract at first sight, we illustrate that this result has a umber of cocrete cosequeces for specific models. I particular, we recover some kow results about estimatio i s-sparse models [], as well as a umber of ew results, icludig covergece rates for estimatio uder l q -sparsity costraits, estimatio i sparse geeralized liear models, estimatio of block-structured sparse matrices ad estimatio of low-rak matrices. I additio to the regularizatio parameter λ ad RSC costat of the loss fuctio, our geeral result ivolves a quatity that relates the error metric d to the regularizer r; i particular, for ay set A R p, we defie Ψ(A) := sup r(u), (6) {u R p d(u)=} so that r(u) Ψ(A)d(u) for u A. Theorem (Bouds for geeral models). For a give subspace collectio V, suppose that the regularizer r is decomposable, ad cosider the regularized M-estimator () with λ 2 r ( L(θ )). The, for ay pair of subspaces (A, B) Vadtolerace ɛ 0 such that the loss fuctio L satisfies restricted strog covexity over C(A, B, ɛ), we have d( θ θ ) max { ɛ, [ 2 Ψ(B ) λ + 2 λ r(π A (θ )) ]}. (7) The proof is motivated by argumets used i past work o high-dimesioal estimatio (e.g., [9, 4]); we provide the details i the full-legth versio. I the remaider of this paper, we illustrate the cosequeces of Theorem for specific models. The parameter λ will be selected as small as possible while satisfyig the lower boud 2 r ( L(θ )). For the sake of clarity, the error d( ) is take to be 2. For all models ɛ =0, apart from the weak-sparse model i sectio Bouds for liear regressio Cosider the stadard liear regressio y = Xθ + w model, where θ R p is the regressio vector, X R p is the desig matrix, ad w R is a oise vector. Give the observatios (y, X), our goal is to estimate the regressio vector θ. Without ay structural costraits o θ, we ca apply Theorem with the trivial subspace collectio T = {(R p, 0)} to establish a rate θ θ 2 = O(σ p/) for ridge regressio. Note that the RSC coditio requires that X is full-rak so that > p. Here we cosider bouds for liear regressio where θ is a s-sparse vector. 3.. Lasso estimates of hard sparse models More precisely, let us cosider estimatig a s-sparse regressio vector θ by solvig the Lasso program { } θ arg mi θ R p 2 y Xθ λ θ. (8) 5

6 The Lasso is a special case of our M-estimator () with r(θ) = θ, ad L(θ) = 2 y Xθ 2 2. Recall the defiitio of the s-sparse subspace collectio S from Sectio 2.2. For this problem, let us set ɛ =0sothat the restricted strog covexity set (5) becomes C(A, B, 0) = { R p S c 3 S }. Establishig restricted strog covexity for the least-squares loss is equivalet to esurig the followig boud o the desig matrix: Xθ 2 2/ θ 2 2 for all θ R p s.t. θ S 3 θ S. (9) As metioed previously, this coditio is essetially the same as the restricted eigevalue coditio developed by Bickel et al. []. Moreover, we ote that Raskutti et al. [0] have show that coditio (9) will hold with high probability for various radom esembles of Gaussia matrices. The i th colum of X, X i, also satisfies the costrait X i 2. Fially, we assume that the elemets of w i are zero-mea ad have sub-gaussia tails, meaig that there exists some costat σ> 0 such that P[ w i >t] exp( t 2 /2σ 2 ) for all t>0. Uder these coditios, we recover as a corollary of Theorem the followig kow result [, 6]. Corollary. Suppose that the true vector θ R p is exactly s-sparse with support S, ad that the desig matrix X satisfies coditio (9). If we solve the the Lasso with λ 2 = 6σ2 log p, the with probability at least c exp( c 2 λ 2 ), the solutio satisfies θ θ 2 8σ s log p. (0) Proof. As oted previously, the l -regularizer is decomposable for the sparse subspace collectio S, while coditio (9) esures that RSC holds for all sets C(A, B, 0) with (A, B) S. We must verify that the give choice of regularizatio satisfies λ 2 r ( L(θ )). Note that r ( ) =, ad moreover that L(θ )=X T w/. Uder the colum ormalizatio coditio o the desig matrix X ad the sub-gaussia ature of the oise, it follows that X T w/ 4σ 2 log p with high probability. The boud i Theorem is thus applicable, ad it remais to compute the form that its differet terms take i this special case. For the l -regularizer ad the l 2 error metric, we have Ψ(A S )= S. Give the hard sparsity assumptio, r(θs c) = 0, so that Theorem implies that θ θ 2 2 sλ = 8σ s log p, as claimed Lasso estimates of weak sparse models We ow cosider models that satisfy a weak sparsity assumptio. More cocretely, suppose that θ lies i the l q - ball of radius R q amely, the set B q (R q ) := {θ R p p i= θ i q R q } for some q (0, ]. Our aalysis exploits the fact that ay θ B q (R q ) ca be well approximated by a s-sparse vector (for a appropriately chose sparsity idex s). It is atural to approximate θ by a vector supported o the set S = {i θi τ}. For ay choice of threshold τ> 0, it ca be show that S R q τ q, ad as show i the full-legth versio, the optimal choice is to set τ = λ, usig the same regularizatio parameter as i Corollary. Accordigly, we cosider the s-sparse subspace collectio S with subsets of size s = R q λ q. We assume that the oise vector w R is as defied above ad that the colums are ormalized as i the previous sectio. We also assume that the matrix X satisfies the coditio ( log p ) 2 Xv 2 κ v 2 κ 2 v for costats κ,κ 2 > 0. () Raskutti et al. [0] show that this property holds with high probablity for suitable Gaussia radom matrices. Uder this coditio, it ca be verified that RSC holds with =κ /2 over the set C ( A(S),B(S),ɛ ), where ɛ = ( 4/κ + 4/κ )R ( ) 2 6 σ 2 log p q/2. q The followig result, which we obtai by applyig Theorem i this settig, is ew to the best of our kowledge: Corollary 2. Suppose that the true vector θ B q (R q ), ad the desig matrix X satisfies coditio (). If we solve the Lasso with λ 2 = 6σ2 log p solutio satisfies θ θ 2 R 2 q ( 6 σ2 log p, the with probability c exp( c 2 λ 2 ), the ) q/2 [ ] (2) 6

7 We ote that both of the rates for hard-sparsity i Corollary ad weak-sparsity i Corollary 2 are kow to be optimal i a miimax sese [0]. I [0], the authors also show that (2) is achievable by solvig the computatioally itractable problem of miimizig L(θ) over the l q -ball. 3.2 Bouds for geeralized liear models Next, cosider ay geeralized liear model with caoical lik fuctio, where the distributio of respose y Y, give predictor X R p, is give by p(y X; θ ) = exp(yθ T X a(θ T X)+d(y)), for some fixed fuctios a : R R ad d : Y R, where X A, ad y B. We cosider estimatig θ from observatios {(X i,y i )} i= by l -regularized maximum likelihood: { θ arg mi ( ) θ R p θt y i X i + a(θ T } X i )+ θ, (3) i= i= so that L(θ) = θ ( T i= y ) ix i + i= a(θt X i ), ad r(θ) = θ. Let X R p deote the matrix with X i as row i. Agai we use the s-sparse subspace collectio S ad ɛ =0, so that it ca be verified that it suffices for the restricted strog covexity coditio to hold if for some c>0, ä(θ T x) > c, for x M, θ {θ + : 2 6AB s log p }, ad that the desig matrix X satisfies the restricted eigevalue boud Xθ 2 2 / θ 2 2 for all θ R p s.t. θ S c 3 θ S. (4) c Corollary 3. Suppose that the true vector θ R p is exactly s-sparse with support S, ad the desig matrix X satisfies coditio (4). Suppose that we solve the l -regularized M-estimator (3) with λ 2 = 32A2 B 2 log p. The with probability c exp( c 2 λ 2 ), the solutio satisfies θ θ 2 6AB We defer the proof to the full-legth versio due to space costraits. 3.3 Bouds for sparse matrices s log p. (5) I this sectio, we cosider some extesios of our results to estimatio of regressio matrices. Various authors have proposed extesios of the Lasso based o regularizers that have more structure tha the l orm [7, 22]. Such regularizers allow oe to impose various types of block-sparsity costraits, i which groups of parameters are assumed to be active (or iactive) simultaeously. We assume that the observatio model takes o the form Y = XΘ + W, where Θ R k m is the ukow fixed set of parameters, X R k is the desig matrix, ad W R m is the oise matrix. As a loss fuctio, we use the Frobeius orm L(Θ) = Y XΘ 2 F, ad as a regularizer, we use the l,q -matrix orm for some q, which takes the form Θ,q = k i= (Θ i,...,θ im ) q. We refer to the resultig estimator as the q-group Lasso. We defie the quatity η(m; q) = if q (, 2] ad η(m; q) =m /2 /q if q>2. We the set the regularizatio parameter as follows: { 4σ [η(m; q) log k + C q m /q ] if q> λ = log(km) 4σ for q =. Corollary 4. Suppose that the true parameter matix Θ has o-zero rows oly for idices i S {,..., k} where S = s, ad that the desig matrix X R k satisfies coditio (9). The with probability at least c exp( c 2 λ 2 ), the q-block Lasso solutio satisfies Θ Θ 2 F Ψ(S)λ. (6) Proof. We simply eed to establish that the regularizatio parameter satisfies λ 2 r ( L(Θ )). We ote that for a matrix U, r (U) = max i=,...,k U i q for /q = /q. Moreover, we have L(Θ )= XT W. Cocetratio results o q ad the uio boud yield that r ( XT W ) 2σ [η(m; q) log k + C q m /q ], as required. 7

8 We will ow cosider three special cases of the above result. A simple argumet shows that Ψ(S) = s if q 2, ad Ψ(S) =m /q /2 s if q [, 2]. First, we cosider q =, ad ote that solvig the Group Lasso with q =is idetical solvig a Lasso problem with sparsity sm ad ambiet dimesio km. The resultig upper boud o the Frobeius orm reflects this fact: 8σ smlog(km) more specifically, for q =, the boud is. For the case q =2, Corollary 4 im- ]. This is also plies that the Frobeius error Θ Θ F is upper bouded as 8σ a very atural result: the term s log k [ s log k + sm captures the difficulty of fidig the s o-zero rows out of the total k, whereas the term sm captures the difficulty of estimatig the sm free parameters i the matrix (oce the o-zero rows have bee determied). We ote that recet work by Louici et al. [4] established the boud O( σ c mslog k Fially, for q =, we obtai the upper boud 8σ 3.4 Bouds for estimatig low rak matrices + sm ), which is equivalet apart from a term m. [ s log k + m ] s, which is a ovel result. Fially, we cosider the implicatios of our mai result for the problem of estimatig low-rak matrices. This structural assumptio is a atural geeralizatio of sparsity, ad has bee studied by various authors (see the paper [3] ad refereces therei). To illustrate our mai theorem i this cotext, let us cosider the followig istace of low-rak matrix learig. Give a low-rak matrix Θ R k m, suppose that we are give oisy observatios of the form Y i = X i, Θ + W i, where W i N(0, ). Such a observatio model arises, for istace, i system idetificatio settigs i cotrol theory [3]. The followig regularized M-estimator ca be cosidered i order to estimate the desired low-rak matrix Θ : mi Y i X i, Θ) 2 + Θ, (7) Θ R m p 2 i= where the regularizer, Θ, is the uclear orm, or the sum of the sigular values of Θ. Recall the rak-r collectio V defied for low-rak matrices i Sectio 2.2. Let Θ = UΣW T be the sigular value decompositio (SVD) of Θ, so that U R k r ad W R m r are orthogoal, ad Σ R r r is a diagoal matrix. If we let A = A(U, W ) ad B = B(U, W ), the, π B (Θ ) = 0, so that by Lemma we have that π B ( ) 3 π B ( ). Thus, for restricted strog covexity to hold it ca be show that the desig matrices X i must satisfy X i, 2 2 F for all such that π B ( ) 3 π B ( ). (8) i= As with the aalogous coditios for sparse vectors ad liear regressio, this coditio ca be show to hold with high probability for Gaussia radom matrices. Corollary 5. Suppose that the true matrix Θ has rak r mi(k, m), ad that the desig matrices {X i } satisfy coditio (8). If we solve the regularized M-estimator (7) with λ =4 k+ m, the with probability at least c exp( c 2 (k + m)), we have Θ Θ F 6 [ rk + rm ]. (9) Proof. Note that if rak(θ )=r, the Θ r Θ F so that Ψ(B )= 2r, sice the subspace B(U, V ) cosists of matrices with rak at most 2r. All that remais is to show that λ 2 r ( L(Θ )). Stadard aalysis gives that the dual orm to is the operator orm, 2. Applyig this observatio ad the fact that L(Θ )= i= X iw i we ca costruct a boud o the operator orm of i= X iw i. We assume that the etries of X i are i.i.d. N(0, ). The, coditioed o W, the etries of the matrix i= X iw i are i.i.d. N(0, W 2 2/ 2 ) from which it ca be show that with probability at least c exp( c 2 ), W 2 2 / 2. Coupled with results o radom matrix theory we have that i= X iw i 2 2 k+ m with probability at least c exp( c 2 (k + m)), verifyig that λ 2 r ( L(θ )). 8

9 Refereces [] P. Bickel, Y. Ritov, ad A. Tsybakov. Simultaeous aalysis of Lasso ad Datzig selector. Submitted to Aals of Statistics, [2] E. Cades ad T. Tao. The Datzig selector: Statistical estimatio whe p is much larger tha. Aals of Statistics, 35(6): , [3] S. Che, D. L. Dooho, ad M. A. Sauders. Atomic decompositio by basis pursuit. SIAM J. Sci. Computig, 20():33 6, 998. [4] K. Louici, M. Potil, A. B. Tsybakov, ad S. va de Geer. Takig advatage of sparsity i multi-task learig. Arxiv, [5] N. Meishause ad P. Bühlma. High-dimesioal graphs ad variable selectio with the Lasso. Aals of Statistics, 34: , [6] N. Meishause ad B. Yu. Lasso-type recovery of sparse represetatios for high-dimesioal data. Aals of Statistics, 37(): , [7] S. Negahba ad M. J. Waiwright. Simultaeous support recovery i high-dimesioal regressio: Beefits ad perils of l, -regularizatio. Techical report, Departmet of Statistics, UC Berkeley, April [8] G. Oboziski, M. J. Waiwright, ad M. I. Jorda. Uio support recovery i high-dimesioal multivariate regressio. Techical report, Departmet of Statistics, UC Berkeley, August [9] S. Portoy. Asymptotic behavior of M-estimators of p regressio parameters whe p 2 / is large: I. cosistecy. Aals of Statistics, 2(4): , 984. [0] G. Raskutti, M. J. Waiwright, ad B. Yu. Miimax rates of estimatio for high-dimesioal liear regressio over l q -balls. Techical Report arxiv: , UC Berkeley, Departmet of Statistics, [] P. Ravikumar, M. J. Waiwright, ad J. Lafferty. High-dimesioal Isig model selectio usig l -regularized logistic regressio. Aals of Statistics, To appear. [2] P. Ravikumar, M. J. Waiwright, G. Raskutti, ad B. Yu. High-dimesioal covariace estimatio by miimizig l -pealized log-determiat divergece. Techical Report 767, Departmet of Statistics, UC Berkeley, September [3] B. Recht, M. Fazel, ad P. A. Parrilo. Guarateed miimum-rak solutios of liear matrix equatios via uclear orm miimizatio. Allerto Coferece 07, Allerto House, Illiois, [4] A.J. Rothma, P.J. Bickel, E. Levia, ad J. Zhu. Sparse permutatio ivariat covariace estimatio. Electro. J. Statist., 2:494 55, [5] R. Tibshirai. Regressio shrikage ad selectio via the lasso. Joural of the Royal Statistical Society, Series B, 58(): , 996. [6] J. Tropp. Just relax: Covex programmig methods for idetifyig sparse sigals i oise. IEEE Tras. Ifo Theory, 52(3):030 05, March [7] B. Turlach, W.N. Veables, ad S.J. Wright. Simultaeous variable selectio. Techometrics, 27: , [8] S. Va de Geer. High-dimesioal geeralized liear models ad the lasso. Aals of Statistics, 36(2):64 645, [9] M. J. Waiwright. Sharp thresholds for high-dimesioal ad oisy sparsity recovery usig l - costraied quadratic programmig (Lasso). IEEE Tras. Iformatio Theory, 55: , May [20] C. Zhag ad J. Huag. Model selectio cosistecy of the lasso selectio i high-dimesioal liear regressio. Aals of Statistics, 36: , [2] P. Zhao ad B. Yu. O model selectio cosistecy of Lasso. Joural of Machie Learig Research, 7: , [22] P. Zhao, G. Rocha, ad B. Yu. Grouped ad hierarchical model selectio through composite absolute pealties. Aals of Statistics, 37(6A): ,

10 A Ridge-Regressio I this sectio, we apply Theorem to ridge-regressio. Cosider solvig the program { } θ arg mi θ R p y Xθ λ θ 2. Assume that the uderlyig structure eforces θ 2 M for some costat M>0. As a result, the restricted strog covexity assumptio reduces to λ mi ( XT X) > 0. We may ow preset the followig trivial corollary to Theorem. Note that the result is ot ew, ad provides exactly the same boud as i the ordiary least-squares solutio to the problem. Corollary 6. Suppose that the true vector θ R p ad that the desig matrix X has its smallest eigevalue bouded below by. Suppose that we solve the Ridge-regressio program with λ 2 = p. The, with probability c exp( c 2 λ 2 ), the solutio satisfies θ θ 8σ p 2 (20) Proof. The restricted strog covexity coditio clearly holds. Furthermore, let V be the space of all subspace-pairs. Therefore, we ca apply the boud i Theorem. First ote that Ψ(A) = for ay set A sice d(v) =r(v) v R p. The dual orm r ( ) is r( ). Thus, we must establish the l 2 orm of L(θ )=X T w/. However, the colum ormalizatio bouds yields that X T w/ 2 2σ p/ with probability c exp( c 2 p). Therefore, lettig λ =2 X T w/ 2 we have by Theorem that d( θ θ ) [8 p + 8λ r(π A (θ ))]. Thus, the boud is clealry miimized as log as θ A =0, which is the case if we let A = R p. Verifyig the result. B Proof of Theorem The argumet is motivated by the methods of Rothma et al. [4], i their aalysis of a l - regularized log-determiat program. Cosider the fuctio g( ) := L(θ + ) L(θ )+λ { r(θ + ) r(θ ) }. (2) The covexity of L( ) ad r( ) implies that g is a covex fuctio. Here, we have that = θ θ ad = θ θ. Observe that g(0) = 0 so that g( ) 0. From Lemma, we kow that C, where C := { R p : r(π B ( )) 3 r(π B ( )) + 4 r(π A (θ ))}. We also have that if C, the t C for ay t [0, ]. Now suppose that d( ) >M. The there exists a t (0, ) such that d(t ) = M ad t C. Now suppose that g(t ) > 0. The, by the covexity of g g(( t)0 + t ) ( t)g(0) + tg( ). We kow g(0) = 0 ad t>0. Thus, g( ) > 0, which is a cotradictio. Therefore, d( ) M. Hece, it suffices to show that for ay C such that d( ) = M, g( ) > 0, which we ow prove. Proof. Fix ay arbitrary vector R p such that C ad d( ) = M. We assume that restricted strog covexity holds for all such vectors. Therefore, g( ) = L(θ + ) L(θ )+λ { r(θ + ) r(θ ) } L(θ ) T + d( ) 2 + λ { r(θ + ) r(θ ) }. (22) Recall that λ 2r ( L(θ )), so that by Lemma L(θ ) T + λ { r(θ + ) r(θ ) } λ 2 {r(π B( )) 3r(π B ( )) 4r(π A (θ )) } λ 2 {3r(π B( )) + 4r(π A (θ )) } 0

11 Substitutig the latter iequality ito equato (22) yields g( ) d( ) 2 λ 2 {3r(π B( )) + 4r(π A (θ )) }. Notig that r(π B ( )) Ψ(B ) d(π B ( )) Ψ(B ) d( ), establishes that g( ) d( ) 2 λ { 3Ψ(B )d( ) + 4r(π 2 A (θ )) }. { [ Fially, substitutig M = 2 Ψ(B ) λ + ]} 2 λ r(π A (θ )) proves that g( ) > 0. C Proofs ad Auxiliary Results Proof of Lemma. Recall the fuctio g( ) := L(θ + ) L(θ )+λ { r(θ + ) r(θ ) }. (23) We will start off by obtaiig a lower boud for this fuctio. Loss Deviatio: Usig the covexity of the loss fuctio L, we have By the Cauchy-Schwartz iequality, we have L(θ + ) L(θ ) L(θ ) T. (24) L(θ ) T r ( L(θ )) r( ) λ [ r(πb ( )) + r(π B ( )) ], 2 where we have used the assumptio o r ( L(θ )), ad the triagle iequality. Substitutig i (24) L(θ + ) L(θ ) λ 2 Regularizatio Deviatio: By the triagle iequality, By the decompositio property, [ r(πb ( )) + r(π B ( )) ]. (25) r(θ + ) r(π A (θ )+π B ( )) r(π A (θ )) r(π B ( )). r(π A (θ )+π B ( )) = r(π A (θ )) + r(π B ( )), so that by aother applicatio of the triagle iequality, r(θ + ) r(θ ) r(π B ( )) r(π B ( )) 2r(π A (θ )). (26) Substitutig the lower bouds for the loss ad regularizatio fuctio deviatios (26) ad (25) i (23), g( ) λ 2 [r(π B( )) 3r(π B ( )) 4r(π A (θ ))]. (27) By costructio g(0) = 0, ad hece the deviatio of the optimum satisfies g( ) 0. Usig i (27) ad dividig by λ 2 > 0 yields, as required. r(π B ( )) 3 r(π B ( )) + 4 r(π A (θ )),

12 D Proof of Corollary 2 Proof. The subset V of the sparse-vectors decomposability-set collectio we use i this corollary is the subset V =(A S,A S c) for sets S S = {S S R q (log(p)/) q }. As i the proof of Corollary 2, the assumptios of Theorem are satisfied, so that we ca use the boud i the theorem; its terms ca be simplified as follows. Agai, for the l -regularizer ad the l 2 error metric, we have Ψ(A S )= S. Now S ca be bouded as follows: R q θi q i τ q S, i S θ i q so that S τ q R q. Further, give the soft sparsity assumptio, r(θs c) ca be boud as follows: θs c = θi i S c = θi q θi q R q τ q. i S c We thus obtai from Theorem that θ θ 2 [ 2 ] S λ + 2 λ θs c [ 2 ] R q τ q/2 λ + 2 λ R q τ q. From the settigs of τ ad λ, it ca be see that λ = τ, which whe substituted i the previous expressio yields, [ ] θ θ 2 R q λ 2 q Substitutig for the value of λ, we thus obtai the boud i the Corollary. D. Restricted Strog Covexity for Weak-Sparse Models Oe sufficiet coditio for the restricted strog covexity coditio to hold is that the desig matrices X R p satisfy the coditioo for some costats c > 0 ad c 2 > 0. Xv c v 2 c 2 log p v I our settig, v S c 3 v S +4 θ S c so that v 4[ v S + θ S c ], which further implies the that v 4[ S v + θ S c. Therefore, it immediately follows the that ( ) S log p log p Xv c 4c 2 v 2 4c 2 θ S c. Recall from the argumets above that θ S c R q τ q where we also set τ = oly cocered with sets such that S R q τ q so that Xv (c 4c 2 R q τ 2 q ) v 2 4c 2 R q τ 2 q. log p ad we are 2

13 For the applicatios of restricted strog covexity above, we oly eed it told hold for the vectors v such that v 2 = O( c Rq τ 2 q ) where we recall that τ = λ, justifyig the swap. Fially, applyig the boud o v 2 yields that ) Xv 2 ( 4c R q τ 2 q R q τ 2 q 4c R q τ 2 q ) ( 8 c R q τ 2 q R q τ 2 q, where c = c 2 /c. The costats c ad c 2 are idepedet of everythig else ad by the scalig of, have that the term i the parathesis ca be made arbitrarily close to by takig sufficietly large. Therefore, have that Xv 2 c 2 v 2, which immediately implies the that = c 2 for v G. Note, i fact that the boud holds for ay v such that v 2 c Rq τ 2 q, which implies the that the boud established i Corollary 2 is valid sice E Restricted Strog Covexity for the Trace Observatio Model Recall the low-rak matrix observatio model is Y i = trace(xi T Θ )+W i, where X i, Θ R m p. Note that by we ca covert each X i ad Θ to a vector to yield the usual liear regressio observatio model Y = Xθ + W, where X R (pm) ad θ R pm. We establish RSC for the simple case where the observatio matrices X i are draw from the i.i.d. Gaussia esemble. We will the appeal to the Gordo-Slepia Lemma to establish that if p + m X 2 c 2 c 2 { : 2 =} where the orm is the uclear orm, ad 2 is the Frobeius orm. Gordo-Slepai will lower boud the expected value of the radom variable if X 2, while we the apply cocetratio results to arrive at the above result with high probability, leavig that as a exercise. We kow that if X 2 = if sup trace(u T X ). Now, trace(u T X ) is a cetered Gaussia radom process idexed by U ad. We may costruct a secod cetered Gaussia radom process idex by U ad by defiig Y U, = trace U T W + trace T Z, where W, Z are idepedet ormal i.i.d. Gaussia matrices. We thus have the followig E[(X U, X U, )2 ]=E[[trace (X( U T (U ) T )] 2 ]= U T (U ) T 2 F. (28) ad E[(trace((U U ) T W ) + trace(( ) T Z)) 2 ] U = E[(trace((U U ) T W )) 2 + (trace(( ) T Z)) 2 ] = (U U ) 2 F + ( ) 2 F (29) Recall that U ad are the vectorized versios of the correspodig matrices. Equatio (28) is upper bouded by equatio (29). O the other had, if =, the equatio (28) equals equatio (29), thus verifyig the coditios of the Gordo-Slepai Lemma. Therefore, by the lemma, it immediately follows the that E if sup U T X E if sup U T W + T Z U U = E W F E Z 2 2 ( p + m ) 3

14 as desired. 4

A unified framework for high-dimensional analysis of M-estimators with decomposable regularizers

A unified framework for high-dimensional analysis of M-estimators with decomposable regularizers A uified framework for high-dimesioal aalysis of M-estimators with decomposable regularizers Sahad Negahba, UC Berkeley Pradeep Ravikumar, UT Austi Marti Waiwright, UC Berkeley Bi Yu, UC Berkeley NIPS