FAST GLOBAL CONVERGENCE OF GRADIENT METHODS FOR HIGH-DIMENSIONAL STATISTICAL RECOVERY

Size: px

Start display at page:

Download "FAST GLOBAL CONVERGENCE OF GRADIENT METHODS FOR HIGH-DIMENSIONAL STATISTICAL RECOVERY"

Jonas Sanders
6 years ago
Views:

1 The Aals of Statistics 2012, Vol. 40, No. 5, DOI: /12-AOS1032 Istitute of Mathematical Statistics, 2012 FAST GLOBAL CONVERGENCE OF GRADIENT METHODS FOR HIGH-DIMENSIONAL STATISTICAL RECOVERY BY ALEKH AGARWAL 1,2,SAHAND NEGAHBAN 1,3 AND MARTIN J. WAINWRIGHT 1,3 Uiversity of Califoria, Berkeley, Massachusetts Istitute of Techology ad Uiversity of Califoria, Berkeley May statistical M-estimators are based o covex optimizatio problems formed by the combiatio of a data-depedet loss fuctio with a orm-based regularizer. We aalyze the covergece rates of projected gradiet ad composite gradiet methods for solvig such problems, workig withi a high-dimesioal framework that allows the ambiet dimesio d to grow with (ad possibly exceed) the sample size. Our theory idetifies coditios uder which projected gradiet descet ejoys globally liear covergece up to the statistical precisio of the model, meaig the typical distace betwee the true ukow parameter θ ad a optimal solutio θ. By establishig these coditios with high probability for umerous statistical models, our aalysis applies to a wide rage of M-estimators, icludig sparse liear regressio usig Lasso; group Lasso for block sparsity; logliear models with regularizatio; low-rak matrix recovery usig uclear orm regularizatio; ad matrix decompositio usig a combiatio of the uclear ad l 1 orms. Overall, our aalysis reveals iterestig coectios betwee statistical ad computatioal efficiecy i high-dimesioal estimatio. 1. Itroductio. High-dimesioal data sets preset challeges that are both statistical ad computatioal i ature. O the statistical side, recet years have witessed a flurry of results o covergece rates for various estimators uder high-dimesioal scalig, allowig for the possibility that the problem dimesio d exceeds the sample size. These results typically ivolve some assumptio regardig the structure of the parameter space, such as sparse vectors, structured covariace matrices, or low-rak matrices, as well as some regularity of the datageeratig process. O the computatioal side, may estimators for statistical recovery are based o solvig covex programs. Examples of such M-estimators iclude l 1 -regularized quadratic programs (Lasso) for sparse liear regressio (e.g., Received April 2011; revised Jauary Supported i part by Grat AFOSR-09NL Supported i part by a Microsoft Graduate Fellowship ad Google Ph.D. Fellowship. 3 Supported by fudig from NSF-CDI MSC2010 subject classificatios. Primary 62F30, 62F30; secodary 62H12. Key words ad phrases. High-dimesioal iferece, covex optimizatio, regularized M- estimatio. 2452

2 FAST GLOBAL CONVERGENCE OF GRADIENT METHODS 2453 [7, 13, 26, 40, 44]), secod-order coe programs (SOCP) for the group Lasso (e.g., [19, 24, 45]) ad SDP relaxatios for various problems, icludig sparse PCA ad low-rak matrix estimatio (e.g., [3, 11, 28, 36, 37, 39]). May of these programs are istaces of covex coic programs, ad so ca (i priciple) be solved to ε-accuracy i polyomial time usig iterior poit methods, ad other stadard methods from covex programmig; for example, see the books [6, 8]. However, the complexity of such quasi-newto methods ca be prohibitively expesive for the very large-scale problems that arise from high-dimesioal data sets. Accordigly, recet years have witessed a reewed iterest i simpler first-order methods, amog them the methods of projected gradiet descet ad mirror descet. Several authors (e.g., [4, 5, 20]) have used variats of Nesterov s accelerated gradiet method [31] to obtai algorithms for highdimesioal statistical problems with a subliear rate of covergece. Note that a optimizatio algorithm, geeratig a sequece of iterates {θ t } t=0, is said to exhibit subliear covergece to a optimum θ if the optimizatio error θ t θ decays at the rate 1/t κ, for some expoet κ>0ad orm. It is kow that this is the best possible covergece rate for gradiet descet-type methods for covex programs uder oly Lipschitz coditios [30]. It is kow that much faster global rates i particular, a liear or geometric rate ca be achieved if global regularity coditios like strog covexity ad smoothess are imposed [30]. A optimizatio algorithm is said to exhibit liear or geometric covergece if the optimizatio error θ t θ decays at a rate κ t, for some cotractio coefficiet κ (0, 1). Note that such covergece is expoetially faster tha sub-liear covergece. For certai classes of problems ivolvig polyhedral costraits ad global smoothess, Tseg ad Luo [25] have established geometric covergece. However, a challegig aspect of statistical estimatio i high dimesios is that the uderlyig optimizatio problems ca ever be strogly covex i a global sese whe d>(sice the d d Hessia matrix is rak-deficiet), ad global smoothess coditios caot hold whe d/ +. Some more recet work has exploited structure specific to the optimizatio problems that arise i statistical settigs. For the special case of sparse liear regressio with radom isotropic desigs (also referred to as compressed sesig), some authors have established local liear covergece, meaig guaratees that apply oce the iterates are close eough to the optimum [9, 17]. Also i the settig of compressed sesig, Tropp ad Gilbert [41] studied fiite covergece of greedy algorithms, while Garg ad Khadekar [16] provide results for a thresholded gradiet algorithm. I both of these results, the covergece happes up to a tolerace of the order of the oise variace, which is substatially larger tha the true statistical precisio of the problem. The focus of this paper is the covergece rate of two simple gradietbased algorithms for solvig optimizatio problems that uderlie regularized M- estimators. For a costraied problem with a differetiable objective fuctio, the projected gradiet method geerates a sequece of iterates {θ t } t=0 by takig a step

3 2454 A. AGARWAL, S. NEGAHBAN AND M. J. WAINWRIGHT FIG. 1. Covergece rates of projected gradiet descet i applicatio to Lasso (l 1 -costraied least-squares). Each pael shows the log optimizatio error log θ t θ versus the iteratio umber t. Pael (a) shows three curves, correspodig to dimesios d {5000; 10,000; 20,000}, sparsity s = d ad all with the same sample size = All cases show geometric covergece, but the rate for larger problems becomes progressively slower. (b)for a appropriately rescaled sample size (α = s log d ), all three covergece rates should be roughly the same, as predicted by the theory. i the egative gradiet directio, ad the projectig the result oto the costrait set. The composite gradiet method of Nesterov [31] is well-suited to solvig regularized problems formed by the sum of a differetiable ad a odifferetiable compoet. The mai cotributio of this paper is to establish a form of global geometric covergece for these algorithms that holds for a broad class of high-dimesioal statistical problems. I order to provide ituitio for this guaratee, Figure 1 shows the performace of projected gradiet descet for Lasso problems (l 1 -costraied least-squares), each oe based o a fixed sample size = 2500 ad varyig dimesios d {5000; 10,000; 20,000}. I pael (a), we have plotted the logarithm of the optimizatio error, measured i terms of the Euclidea orm θ t θ betwee θ t ad a optimal solutio θ, versus the iteratio umber t. Note that all curves are liear (o this logarithmic scale), revealig the geometric covergece predicted by our theory. Moreover, the results i pael (a) exhibit a iterestig property: the covergece rate is dimesio-depedet, meaig that for a fixed sample size, projected gradiet descet coverges more slowly for a larger problem tha a smaller problem. This pheomeo reflects the atural ituitio that larger problems are harder tha smaller problems. A otable aspect of our theory is that it makes a quatitative predictio regardig the extet to which a larger problem is harder tha a smaller oe. I particular, our covergece rates suggest that if the sample size is re-scaled accordig to the dimesio d adalsoother

4 FAST GLOBAL CONVERGENCE OF GRADIENT METHODS 2455 model parameters such as sparsity, the covergece rates should be roughly similar. Pael (b) cofirms this predictio: whe the sample size is rescaled accordig to our theory (i particular, see Corollary 2 i Sectio 3.2), the all three curves lie essetially o top of each other. Although high-dimesioal optimizatio problems are typically either strogly covex or smooth, this paper shows that it is fruitful to cosider suitably restricted otios of strog covexity ad smoothess. Our otio of restricted strog covexity (RSC) is related to but slightly differet tha that of Negahba et al. [27] for establishig statistical cosistecy. We also itroduce a related otio of restricted smoothess (RSM), ot eeded for provig statistical rates, but essetial i the settig of optimizatio. Our aalysis cosists of two parts. We first show that for optimizatio problems uderlyig may regularized M-estimators, RSC/RSM coditios are sufficiet to guaratee global liear covergece of projected gradiet descet. Our secod cotributio is to prove that for the iterates geerated by our methods, these RSC/RSM assumptios do hold with high probability for umerous statistical models, amog them sparse liear models, models with group sparsity, ad various matrix estimatio problems, icludig matrix completio ad matrix decompositio. A iterestig aspect of our results is that the geometric covergece is ot guarateed to a arbitrary precisio, but oly to a accuracy related to statistical precisio of the problem. For a give orm, the statistical precisio is give by the mea-squared error E[ θ θ 2 ] betwee the true parameter θ ad the solutio θ of the optimizatio problem. Our aalysis guaratees geometric covergece to a parameter θ such that θ θ = θ θ + o ( θ θ ), which is the best we ca hope for statistically, igorig lower order terms. Overall, our results reveal a iterestig coectio betwee the statistical ad computatioal properties of M-estimators that is, the properties of the uderlyig statistical model that make it favorable for estimatio also reder it more ameable to optimizatio procedures. The remaider of this paper is orgaized as follows. We begi i Sectio 2 with our setup ad the ecessary backgroud. Sectio 3 is devoted to the statemet of our mai results ad various corollaries. I Sectio 4, we provide a umber of empirical results that cofirm the sharpess of our theory. Proofs of our results are provided i the Supplemetary Material [1]. 2. Backgroud ad problem formulatio. I this sectio, we begi by describig the class of regularized M-estimators to which our aalysis applies, as well as the optimizatio algorithms that we aalyze. Fially, we itroduce some importat otios that uderlie our aalysis, icludig the otios of a decomposable regularizatio, ad the properties of restricted strog covexity ad smoothess.

5 2456 A. AGARWAL, S. NEGAHBAN AND M. J. WAINWRIGHT 2.1. Loss fuctios, regularizatio ad gradiet-based methods. Give a radom variable Z P takig values i some set Z, letz1 ={Z 1,...,Z } be asampleof observatios. Assumig that P lies withi some idexed family {P θ,θ }, the goal is to recover a estimate of the ukow true parameter θ geeratig the data. Here is some subset of R d,whered is the ambiet dimesio of the problem. I order to measure the fit of ay θ to a give data set Z1, we itroduce a loss fuctio L : Z R +. By costructio, for ay give -sample data set Z1 Z, the loss fuctio assigs a cost L (θ; Z1 ) 0 to the parameter θ. I may applicatios, the loss fuctio has a separable structure across the data set, meaig that L (θ; Z1 ) = 1 i=1 l(θ; Z i ) where l : Z : R + is the loss fuctio associated with a sigle data poit. Of primary iterest i this paper are estimatio problems that are uderdetermied, meaig that the sample size is smaller tha the ambiet dimesio d. I such settigs, without further restrictios o the parameter space, there are various impossibility theorems, assertig that cosistet estimates of the ukow parameter θ caot be obtaied. For this reaso, it is ecessary to assume that the ukow parameter θ either lies withi a smaller subset of, or is well-approximated by some member of such a subset. I order to icorporate these types of structural costraits, we itroduce a regularizer R : R + over the parameter space. Give a user-defied radius ρ>0, our aalysis applies to the costraied M-estimator { ( θ ρ arg mi L θ; Z )} (1) 1 R(θ) ρ as well as to the regularized M-estimator (2) θ λ arg mi R(θ) ρ { ( L θ; Z ) } 1 + λ R(θ), }{{} φ (θ) where the regularizatio weight λ > 0 is user-defied. Note that the radii ρ ad ρ may be differet i geeral. Throughout this paper, we impose the followig two coditios: (a) for ay data set Z1, the fuctio L ( ; Z1 ) is covex ad differetiable over,ad (b) the regularizer R is a orm. These coditios esure that the overall problem is covex, so that by Lagragia duality, the optimizatio problems (1) ad(2) are equivalet. However, as our aalysis will show, solvig oe or the other ca be computatioally more preferable depedig upo the assumptios made. Whe the radius ρ or the regularizatio parameter λ is clear from the cotext, we will drop the subscript o θ to ease the otatio. Similarly, we frequetly adopt the shorthad L (θ). Procedures based o optimizatio problems of either form are kow as M-estimators i the statistics literature.

6 FAST GLOBAL CONVERGENCE OF GRADIENT METHODS 2457 The focus of this paper is o two simple algorithms for solvig the above optimizatio problems. The method of projected gradiet descet applies aturally to the costraied problem (1), whereas the composite gradiet descet method due to Nesterov [31] is suitable for solvig the regularized problem (2). Each routie geerates a sequece {θ t } t=0 of iterates by first iitializig to some parameter θ 0,adthefort = 0, 1, 2,...,applyig the recursive update (3) θ t+1 = arg mi θ B R (ρ) { L ( θ t ) + L ( θ t ),θ θ t + γ u 2 θ θ t }, 2 i the case of projected gradiet descet, or the update { θ t+1 ( = arg mi L θ t ) + ( L θ t ),θ θ t + γ u θ B R ( ρ) 2 θ θ t } (4) 2 +λ R(θ) for the composite gradiet method. Note that the oly differece betwee the two updates is the additio of the regularizatio term i the objective. These updates have a atural ituitio: the ext iterate θ t+1 is obtaied by costraied miimizatio of a first-order approximatio to the loss fuctio, combied with a smoothig term that cotrols how far oe moves from the curret iterate i terms of Euclidea orm. Moreover, it is easily see that update (3) is equivalet to ( θ t+1 = θ t 1 ( L θ t (5) )), γ u where BR (ρ) deotes Euclidea projectio oto the regularizer orm ball B R (ρ) := {θ R(θ) ρ} of radius ρ. I this formulatio, we see that the algorithm takes a step i the egative gradiet directio, usig the quatity 1/γ u as stepsize parameter, ad the projects the resultig vector oto the costrait set. Update (4) takes a aalogous form; however, the projectio will deped o both λ ad γ u. As will be illustrated i the examples to follow, for may problems, updates (3) ad(4), or equivaletly (5), have a very simple solutio. For istace, i the case of l 1 -regularizatio, they are easily computed by a appropriate form of soft-thresholdig Restricted strog covexity ad smoothess. I this sectio, we defie the coditios o the loss fuctio ad regularizer that uderlie our aalysis. Global smoothess ad strog covexity assumptios play a importat role i the classical aalysis of optimizatio algorithms [6, 8, 30]. I applicatio to a differetiable loss fuctio L, both of these properties are defied i terms of a first-order Taylor series expasio aroud a vector θ i the directio of θ amely, the quatity (6) T L ( θ; θ ) := L (θ) L ( θ ) L ( θ ),θ θ. By the assumed covexity of L, this error is always oegative, ad global strog covexity is equivalet to imposig a stroger coditio, amely that for

7 2458 A. AGARWAL, S. NEGAHBAN AND M. J. WAINWRIGHT some parameter γ l > 0, the first-order Taylor error T L (θ; θ ) is lower bouded by a quadratic term γ l 2 θ θ 2 for all θ,θ. Global smoothess is defied i a similar way, by imposig a quadratic upper boud o the Taylor error. It is kow that uder global smoothess ad strog covexity assumptios, the method of projected gradiet descet (3) ejoysaglobally geometric covergece rate, meaig that there is some κ (0, 1) such that 4 (7) θ t θ 2 κ t θ 0 θ 2 for all iteratios t = 0, 1, 2,... We refer the reader to Bertsekas [6], Propositio 1.2.3, page 145, or Nesterov [30], Theorem 2.2.8, page 88, for such results o projected gradiet descet, ad to Nesterov [31] for related results o composite gradiet descet. Ufortuately, i the high-dimesioal settig (d >), it is usually impossible to guaratee strog covexity of problem (1) i a global sese. For istace, whe the data is draw i.i.d., the loss fuctio cosists of a sum of terms. If the loss is twice differetiable, the resultig d d Hessia matrix 2 L(θ; Z1 ) is ofte a sum of matrices each with rak oe, so that the Hessia is rak-degeerate whe <d. However, as we show i this paper, i order to obtai fast covergece rates for optimizatio method (3), it is sufficiet that (a) the objective is strogly covex ad smooth i a restricted set of directios, ad (b) the algorithm approaches the optimum θ oly alog these directios. Let us ow formalize these ideas. DEFINITION 1 [Restricted strog covexity (RSC)]. The loss fuctio L satisfies restricted strog covexity with respect to R ad with parameters (γ l,τ l (L )) over the set if ( T L θ; θ ) γ l (8) 2 θ θ 2 τ l (L )R 2( θ θ ) for all θ,θ. We refer to the quatity γ l as the (lower) curvature parameter, ad to the quatity τ l as the tolerace parameter. Theset correspods to a suitably chose subset of the space of all possible parameters. 5 I order to gai ituitio for this defiitio, first suppose that coditio (8) holds with tolerace parameter τ l = 0. I this case, the regularizer plays o role i the defiitio, ad coditio (8) is equivalet to the usual defiitio of strog covexity o the optimizatio set. As discussed previously, this type of global strog covexity typically fails to hold for high-dimesioal iferece problems. I cotrast, whe tolerace parameter τ l is strictly positive, coditio (8) is much milder, 4 I this statemet (ad throughout the paper), we use to mea a iequality that holds with some uiversal costat c, idepedet of the problem parameters. 5 As poited out by a referee, our RSC coditio is a istace of the geeral theory of paracovexity (e.g., [32]); however, we are ot aware of covergece rates for miimizig geeral paracovex fuctios.

8 FAST GLOBAL CONVERGENCE OF GRADIENT METHODS 2459 i that it oly applies to a limited set of vectors. For a give pair θ θ, cosider the iequality R 2 (θ θ (9) ) θ θ 2 < γ l 2τ l (L ). If this iequality is violated, the the right-had side of boud (8) is opositive, i which case the RSC costrait (8) is vacuous. Thus, RSC imposes a otrivial costrait oly o pairs θ θ for which iequality (9) holds, ad a cetral part of our aalysis will be to prove that for our methods, the optimizatio error t := θ t θ satisfies a costrait of the form (9). We ote that sice the regularizer R is covex, strog covexity of the loss fuctio L also implies the strog covexity of the regularized loss φ. We also specify a aalogous otio of restricted smoothess: DEFINITION 2 [Restricted smoothess (RSM)]. We say the loss fuctio L satisfies restricted smoothess with respect to R ad with parameters (γ u,τ u (L )) over the set if ( T L θ; θ ) γ u (10) 2 θ θ 2 + τ u (L )R 2( θ θ ) for all θ,θ. As with our defiitio of restricted strog covexity, the additioal tolerace τ u (L ) is ot preset i aalogous smoothess coditios i the optimizatio literature, but it is essetial i our set-up Decomposable regularizers. I past work o the statistical properties of regularizatio, the otio of a decomposable regularizer has bee show to be useful [27]. Although the focus of this paper is a rather differet set of questios amely, optimizatio as opposed to statistics decomposability also plays a importat role here. Decomposability is defied with respect to a pair of subspaces defied with respect to the parameter space R d.thesetm is kow as the model subspace, whereas the set M, referred to as the perturbatio subspace, captures deviatios from the model subspace. DEFINITION 3. Give a subspace pair (M, M ) such that M M, wesay that a orm R is (M, M )-decomposable if (11) R(α + β) = R(α) + R(β) for all α M ad β M. To gai some ituitio for this defiitio, ote that by the triagle iequality, we always have the boud R(α + β) R(α) + R(β). For a decomposable regularizer, this iequality always holds with equality. Thus, give a fixed vector α M, the key property of ay decomposable regularizer is that it affords the maximum pealizatio of ay deviatio β M. For a give error orm, its iteractio with the regularizer R plays a importat role i our results. I particular, we have the followig:

9 2460 A. AGARWAL, S. NEGAHBAN AND M. J. WAINWRIGHT DEFINITION 4 (Subspace compatibility). Give the regularizer R( ) ad a orm, the associated subspace compatibility is give by (12) ( M) := sup θ M\{0} R(θ) θ whe M {0} ad ( {0} ) := 0. The quatity ( M) correspods to the Lipschitz costat of the orm R with respect to, whe restricted to the subspace M Some illustrative examples. We ow describe some particular examples of M-estimators with decomposable regularizers, ad discuss the form of the projected gradiet updates as well as RSC/RSM coditios. We cover two mai families of examples: log-liear models with sparsity costraits ad l 1 -regularizatio (Sectio 2.4.1), ad matrix regressio problems with uclear orm regularizatio (Sectio 2.4.2) Sparse log-liear models ad l 1 -regularizatio. Suppose that each sample Z i cosists of a scalar-vector pair (y i,x i ) R R d, correspodig to the scalar respose y i R associated with a vector of predictors x i R d. A log-liear model with caoical lik fuctio assumes that the respose y i is liked to the covariate vector x i via a coditioal distributio of the form P(y i x i ; θ,σ) exp{ y i θ,x i ( θ,x i ) c(σ) },wherec(σ) is a kow scalig parameter, ( ) is a kow log-partitio fuctio ad θ R d is a ukow regressio vector. I may applicatios, θ is relatively sparse, so that it is atural to impose a l 1 -costrait. Computig the maximum likelihood estimate subject to such a costrait ivolves solvig the covex program 6 (13) { 1 { ( θ arg mi θ,xi ) y i θ,x i }} θ i=1 }{{} L (θ;z1 ) such that θ 1 ρ, with x i R d as its ith row. We refer to this estimator as the log-liear Lasso; it is a special case of the M-estimator (1). Ordiary liear regressio is the special case of the log-liear settig with (t) = t 2 /2ad = R d, ad i this case, estimator (13) correspods to ordiary least-squares versio of Lasso [13, 40]. Other forms of log-liear Lasso that are of iterest iclude logistic regressio, Poisso regressio ad multiomial regressio. 6 The fuctio is covex sice it is the log-partitio fuctio of a caoical expoetial family.

10 FAST GLOBAL CONVERGENCE OF GRADIENT METHODS 2461 Projected gradiet updates. For the log-liear loss from equatio (13), a easy calculatio yields the gradiet L (θ) = 1 i=1 x i { ( θ,x i y i )}, ad update (5) correspods to the Euclidea projectio of the vector θ t γ 1 u L (θ t ) oto the l 1 -ball of radius ρ. It is well kow that this projectio ca be characterized i terms of soft-thresholdig, ad that the projected update (5) ca be computed i O(d) operatios [14]. Composite gradiet updates. The composite gradiet update for this problem amouts to solvig { θ t+1 θ, L = arg mi (θ) + γ u θ 1 ρ 2 θ θ t λ θ 1 }. The update ca be computed by two soft-thresholdig operatios. The first step is soft thresholdig the vector θ t 1 γ u L (θ t ) at a level λ. If the resultig vector has l 1 -orm greater tha ρ, the we project o to the l 1 -ball as before. Overall, the complexity of the update is still O(d). Decomposability of l 1 -orm. We ow illustrate how the l 1 -orm is decomposable with respect to appropriately chose subspaces. For ay subset S {1, 2,...,d}, cosider the subspace (14) M(S) := { α R d α j = 0forallj/ S }, correspodig to all vectors supported oly o S. Defiig M(S) = M(S), its orthogoal complemet (with respect to the Euclidea ier product) is give by M (S) = M (S) ={β R d β j = 0forallj S}. Sice ay pair of vectors α M(S) ad β M (S) have disjoit supports, it follows that α 1 + β 1 = α + β 1. Cosequetly, for ay subset S, thel 1 -orm is decomposable with respect to the pairs (M(S), M (S)). I aalogy to the l 1 -orm, various types of group-sparse orms are also decomposable with respect to otrivial subspace pairs. We refer the reader to the paper [27] for further examples of such decomposable orms. RSC/RSM coditios. A calculatio usig the mea-value theorem shows that for loss fuctio (13), the error i the first-order Taylor series, as previously defied i equatio (6), ca be writte as ( T L θ; θ ) = 1 ( θ t,x i )( x i,θ θ ) 2, i=1 where θ t = tθ + (1 t)θ for some t [0, 1]. Whe<d, the we ca always fid pairs θ θ such that x i,θ θ =0foralli = 1, 2,...,, showig that the objective fuctio ca ever be strogly covex. O the other had, RSC for

11 2462 A. AGARWAL, S. NEGAHBAN AND M. J. WAINWRIGHT log-liear models requires oly that there exist positive umbers (γ l,τ l (L )) such that for all θ,θ, 1 ( θ t,x i )( x i,θ θ ) 2 γ l (15) 2 θ θ 2 τ l (L )R 2( θ θ ), i=1 where := B 2 (R). This restrictio is essetial because for may geeralized liear models (e.g., logistic), the Hessia fuctio approaches zero as its argumet diverges. RSM imposes a aalogous upper boud o the Taylor error. For a broad class of log-liear models, such bouds hold with tolerace τ l (L ) ad τ u (L ) of the order log d. A detailed discussio of RSC for expoetial families ca be foud i the paper [27]. I the special case of liear regressio, we have (t) = 1forallt R, sothat the lower boud (15) ivolves oly the Gram matrix X T X/. (HereX R d is the usual desig matrix, with x i R d as its ith row.) For liear regressio ad l 1 -regularizatio, the RSC coditio is equivalet to (16) X(θ θ ) 2 2 γ l 2 θ θ 2 2 τ l(l ) θ θ 2 1 for all θ,θ. Such a coditio correspods to a variat of the restricted eigevalue (RE) coditios that have bee studied i the literature [7, 42]. Such RE coditios are sigificatly milder tha the restricted isometry property; we refer the reader to va de Geer ad Buhlma [42] for a i-depth compariso of differet RE coditios. From past work, coditio (16) is satisfied with high probability with a costat γ l > 0 ad tolerace τ l (L ) log d for a broad classes of aisotropic radom desig matrices [33, 38], ad parts of our aalysis make use of this fact Matrices ad uclear orm regularizatio. We ow discuss a geeral class of matrix regressio problems that falls withi our framework. Cosider the space of d 1 d 2 matrices edowed with the trace ier product A,B := trace(a T B). Let R d 1 d 2 be a ukow matrix ad suppose that for i = 1, 2,...,, we observe the pair Z i = (y i,x i ) R R d 1 d 2, where the scalar respose y i ad covariate matrix X i are liked to the ukow matrix via the liear model (17) y i = X i, + w i for i = 1, 2,...,. Here w i is a additive observatio oise. I may cotexts, it is atural to assume that is exactly low-rak, or approximately so, meaig that it is wellapproximated by a matrix of low rak. I such settigs, a umber of authors (e.g., [15, 28, 37]) have studied the M-estimator { 1 ( arg mi yi X i, ) } 2 (18) such that 1 ρ, R d 1 d 2 2 i=1

12 FAST GLOBAL CONVERGENCE OF GRADIENT METHODS 2463 or the correspodig regularized versio. Defiig d = mi{d 1,d 2 },theuclear or trace orm is give by 1 := d j=1 σ j ( ), correspodig to the sum of the sigular values. As discussed i Sectio 3.3, there are various applicatios i which this estimator ad variats thereof have prove useful. Form of projected gradiet descet. For the M-estimator (18), the projected gradiet updates take a very simple form amely ( t+1 = t 1 i=1 (y i X i, t ) )X i (19), γ u where deotes Euclidea (i.e., i Frobeius orm) projectio oto the uclear orm ball B N (ρ) ={ R d 1 d 2 1 ρ}. This uclear orm projectio ca be obtaied by first computig the sigular value decompositio (SVD), ad the projectig the vector of sigular values oto the l 1 -ball. The latter step ca be achieved by the fast projectio algorithms discussed earlier, ad there are various methods for fast computatio of SVDs. The composite gradiet update also has a simple form, requirig at most two sigular value thresholdig operatios. Decomposability of uclear orm. We ow defie matrix subspaces for which the uclear orm is decomposable. Defiig d := mi{d 1,d 2 },letu R d1 d ad V R d2 d be arbitrary matrices with orthoormal colums. Usig col to deote the colum spa of a matrix, we defie the subspaces 7 M(U, V ) := { R d 1 d 2 col ( T ) col(v ), col( ) col(u) } ad M (U, V ) := { R d 1 d 2 col ( T ) ( col(v ) ) ( ) }, col( ) col(u). Fially, let us verify the decomposability of the uclear orm. By costructio, ay pair of matrices M(U, V ) ad Ɣ M (U, V ) have orthogoal row ad colum spaces, which implies the required decomposability coditio amely + Ɣ 1 = 1 + Ɣ 1. Fially, we ote that i some special cases such as matrix completio or matrix decompositio, will ivolve a additioal boud o the etries of as well as the iterates t to establish RSC/RSM coditios. 3. Mai results ad some cosequeces. We are ow equipped to state the two mai results of our paper, ad discuss some of their cosequeces. We illustrate its applicatio to several statistical models, icludig sparse regressio (Sectio 3.2), matrix estimatio with rak costraits (Sectio 3.3) ad matrix decompositio problems (Sectio 3.4). The proofs of all our results ca be foud i the Supplemetary Material [1]. 7 Note that the model space M(U, V ) is ot equal to M(U, V ). Noetheless, as required by Defiitio 3, wedohavetheiclusiom(u, V ) M(U, V ).

13 2464 A. AGARWAL, S. NEGAHBAN AND M. J. WAINWRIGHT 3.1. Geometric covergece. Recall that the projected gradiet algorithm (3) is well suited to solvig a M-estimatio problem i its costraied form, whereas the composite gradiet algorithm (4) is appropriate for a regularized problem. Accordigly, let θ be ay optimum of the costraied problem (1), or the regularized problem (2), ad let {θ t } t=0 be a sequece of iterates geerated by geerated by the projected gradiet (3), or the the composite gradiet updates (4), respectively. Of primary iterest to us are bouds o the optimizatio error, which ca be measured either i terms of the error vector t := θ t θ, or the differece betwee the objective values at θ t ad θ. I this sectio, we state two mai results Theorems 1 ad 2 correspodig to the costraied ad regularized cases, respectively. I additio to the optimizatio error previously discussed, both of these results ivolve the statistical error := θ θ betwee the optimum θ ad the omial parameter θ. At a high level, these results guaratee that uder the RSC/RSM coditios, the optimizatio error shriks geometrically, with a cotractio coefficiet that depeds o the the loss fuctio L via the parameters (γ l,τ l (L )) ad (γ u,τ u (L )). A iterestig feature is that the cotractio occurs oly up to a certai tolerace ε 2 depedig o these same parameters, ad the statistical error. However, as we discuss, for may statistical problems of iterest, we ca show that this tolerace ε 2 is of a lower order tha the itrisic statistical error, ad cosequetly our theory gives a upper boud o the umber of iteratios required to solve a M-estimatio problem up to the statistical precisio. Covergece rates for projected gradiet. We ow provide the otatio ecessary for a precise statemet of this claim. Our mai result ivolves a family of upper bouds, oe for each pair (M, M ) of R-decomposable subspaces; see Defiitio 3. This subspace choice ca be optimized for differet model to obtai the tightest possible bouds. For a give pair (M, M ) such that 16 2 ( M)τ u (L )<γ u, let us defie the cotractio coefficiet { κ(l ; M) := 1 γ l } ( M)(τ u (L ) + τ l (L )) γ u γ u (20) { } ( M)τ u (L ) 1. γ u I additio, we defie the tolerace parameter ε 2( ; M, M ) (21) := 32(τ u(l ) + τ l (L ))(2R( M (θ )) + ( M) +2R( )) 2, γ u where = θ θ is the statistical error, ad M (θ ) deotes the Euclidea projectio of θ oto the subspace M. I terms of these two igrediets, we ow state our first mai result:

14 FAST GLOBAL CONVERGENCE OF GRADIENT METHODS 2465 THEOREM 1. Suppose that the loss fuctio L satisfies the RSC/RSM coditio with parameters (γ l,τ l (L )) ad (γ u,τ u (L )), respectively. Let (M, M) be ay R-decomposable pair of subspaces such that M M ad (22) 0 <κ(l, M)<1. The for ay optimum θ of the problem (1) for which the costrait is active, for all iteratios t = 0, 1, 2,..., we have θ t+1 θ 2 κ t θ 0 θ 2 + ε2 ( ; M, M) (23), 1 κ where κ κ(l, M). REMARKS. Theorem 1 actually provides a family of upper bouds, oe for each R-decomposable pair (M, M) such that coditio (22) holds. This coditio is always satisfied by settig M equal to the trivial subspace {0}: ideed, by defiitio (12) of the subspace compatibility, we have ( M) = 0, ad hece κ(l ;{0}) = (1 γ l γ u )<1. Although this choice of M miimizes the cotractio coefficiet, it will lead 8 to a very large tolerace parameter ε 2 ( ; M, M). A more typical applicatio of Theorem 1 ivolves otrivial choices of the subspace M. Boud (23) guaratees that the optimizatio error decreases geometrically, with cotractio factor κ (0, 1), up to a certai tolerace proportioal to ε 2 ( ; M, M), as illustrated i Figure 2(a). Wheever the tolerace terms i the RSC/RSM coditios decay to zero as the sample size icreases the typical case the the cotractio factor κ approaches 1 γ l /γ u. The appearace of the ratio γ l /γ u is atural sice it measures the coditioig of the objective fuctio; more specifically, it is essetially a restricted coditio umber of the Hessia matrix. O the other had, the residual error ε defied i equatio (21) depeds o the choice of decomposable subspaces, the parameters of the RSC/RSM coditios ad the statistical error = θ θ. I the corollaries of Theorem 1 to follow, we show that the subspaces ca ofte be chose such that ε 2 ( ; M, M) = o( θ θ 2 ). Cosequetly, boud (23) guaratees geometric covergece up to a residual error smaller tha statistical precisio, as illustrated i Figure 2(b). This is sesible, sice i statistical settigs, there is o poit to optimizig beyod the statistical precisio. The result of Theorem 1 takes a simpler form whe there is a subspace M that icludes θ,adther-ball radius is chose such that ρ R(θ ). 8 Ideed, the settig M = R d meas that the term R( M (θ )) = R(θ ) appears i the tolerace; this quatity is far larger tha the statistical precisio.

15 2466 A. AGARWAL, S. NEGAHBAN AND M. J. WAINWRIGHT FIG. 2. (a)geeric illustratio of Theorem 1. The optimizatio error t = θ t θ is guarateed to decrease geometrically with coefficiet κ (0, 1), up to the tolerace ε 2 = ε 2 ( ; M, M), represeted by the circle. (b)relatio betwee the optimizatio tolerace ε 2 ( ; M, M) (solid circle) ad the statistical precisio = θ θ (dotted circle). I may settigs, we have ε 2 ( ; M, M) 2. COROLLARY 1. I additio to the coditios of Theorem 1, suppose that θ M ad ρ R(θ ). The as log as 2 ( M)(τ u (L ) + τ l (L )) = o(1), we have for all iteratios t = 0, 1, 2,..., (24) θ t+1 θ 2 κ t θ 0 θ 2 + o ( θ θ 2). Thus, Corollary 1 guaratees that the optimizatio error decreases geometrically, with cotractio factor κ, up to a tolerace that is of strictly lower order tha the statistical precisio θ θ 2. As will be clarified i several examples to follow, the coditio 2 ( M)(τ u (L ) + τ l (L )) = o(1) is satisfied for may statistical models, icludig sparse liear regressio ad low-rak matrix regressio. This result is illustrated i Figure 2(b), where the solid circle represets the optimizatio tolerace, ad the dotted circle represets the statistical precisio. I the results to follow, we quatify the term o( θ θ 2 ) i a more precise maer for differet statistical models. Covergece rates for composite gradiet. We ow preset our mai result for the composite gradiet iterates (4) that are suitable for the Lagragia-based estimator (2). As before, our aalysis yields a rage of bouds idexed by subspace pairs (M, M ) that are R-decomposable. For ay subspace M such that 64τ l (L ) 2 ( M)<γ l,wedefietheeffective RSC coefficiet as (25) γ l := γ l 64τ l (L ) 2 ( M). This coefficiet accouts for the residual amout of strog covexity after accoutig for the lower tolerace terms. I additio, we defie the compoud cotractio

16 FAST GLOBAL CONVERGENCE OF GRADIENT METHODS 2467 coefficiet as (26) { κ(l ; M) := 1 γ l } ( M)τ u (L ) ξ( M), 4γ u γ l where ξ( M) := (1 64τ u(l ) 2 ( M) γ l ) 1,ad = θ λ θ is the statistical error vector 9 for a specific choice of ρ ad λ. As before, the coefficiet κ measures the geometric rate of covergece for the algorithm. Fially, we defie the compoud tolerace parameter (27) ε 2( ; M, M ) := 8ξ( M)β( M) ( 6 ( M) + 8R ( M ( θ ))) 2, where β( M) := 2( γ l γ l )τ l (L ) + 8τ u (L ) + 2τ l (L ). As with our previous result, the tolerace parameter determies the radius up to which geometric covergece ca be attaied. Recall that the regularized problem (2) ivolves both a regularizatio weight λ ad a costrait radius ρ. Our theory requires that the costrait radius is chose such that ρ R(θ ), which esures that θ is feasible. I additio, the regularizatio parameter should be chose to satisfy 4γ u + 128τ u(l ) 2 ( M) (28) λ 2R ( L ( θ )), where R is the dual orm of the regularizer. This costrait is kow to play a importat role i provig bouds o the statistical error of regularized M- estimators; see the paper [27] ad refereces therei for further details. Recallig defiitio (2) of the overall objective fuctio φ, the followig result provides bouds o the excess loss φ (θ t ) φ ( θ λ ). THEOREM 2. Cosider the optimizatio problem (2) for a radius ρ such that θ is feasible, ad a regularizatio parameter λ satisfyig boud (28), ad suppose that the loss fuctio L satisfies the RSC/RSM coditio with parameters (γ l,τ l (L )) ad (γ u,τ u (L )), respectively. Let (M, M ) be ay R- decomposable pair such that 32 ρ (29) κ κ(l, M) [0, 1) ad 1 κ(l ; M) ξ( M)β( M) λ. The for ay δ 2 ε2 ( ;M, M) (1 κ), we have φ (θ t ) φ ( θ λ ) δ 2 for all (30) t 2log((φ (θ 0 ) φ ( θ λ ))/δ 2 ) log(1/κ) ( )( ρλ + log 2 log 2 δ log 2 ). log(1/κ) 9 Whe the cotext is clear, we remid the reader that we drop the subscript λ o the parameter θ.

17 2468 A. AGARWAL, S. NEGAHBAN AND M. J. WAINWRIGHT REMARKS. Note that boud (30) guaratees that the excess loss φ (θ t ) φ ( θ) decays geometrically up to ay squared error δ 2 larger tha the compoud tolerace (27). Moreover, the RSC coditio also allows us to further traslate this result to a boud o the optimizatio error θ t θ. I particular, for ay iterate θ t such that φ (θ t ) φ ( θ) δ 2, we are guarateed that (31) θ t θ λ 2 2δ δ2 τ l (L ) γ l γ l λ 2 + 4τ l(l )(6 ( M) + 8R( M (θ ))) 2 γ l. I cojuctio with Theorem 2, we see that it suffices to take a umber of steps that is logarithmic i the iverse tolerace (1/δ), agai showig a geometric rate of covergece. Whereas Theorem 1 requires settig the radius so that the costrait is active, Theorem 2 has oly a very mild costrait o the radius ρ, amely that it be large eough such that ρ R(θ ). The reaso for this much milder requiremet is that the additive regularizatio with weight λ suffices to costrai the solutio, whereas the extra side costrait is oly eeded to esure good behavior of the optimizatio algorithm i the first few iteratios. Step-size settig. It seems that updates (3) ad(4) eed to kow the smoothess boud γ u i order to set the step-size for gradiet updates. However, we ca use the same doublig trick as described i Algorithm 3.1 of Nesterov [31]. At each step, we check if the smoothess upper boud holds at the curret iterate relative to the previous oe. If the coditio does ot hold, we double our estimate of γ u ad resume. Nesterov [31] demostrates that this guaratees a geometric covergece with a cotractio factor worse at most by a factor of 2, compared to the kowledge of γ u. The followig subsectios are devoted to the developmet of some cosequeces of Theorems 1 ad 2 ad Corollary 1 for some specific statistical models, amog them sparse liear regressio with l 1 -regularizatio, ad matrix regressio with uclear orm regularizatio. I cotrast to the etirely determiistic argumets that uderlie the Theorems 1 ad 2, these corollaries ivolve probabilistic argumets, more specifically i order to establish that the RSC ad RSM properties hold with high probability Sparse vector regressio. Recall from Sectio the observatio model for sparse liear regressio. I a variety of applicatios, it is atural to assume that θ is sparse. For a parameter q [0, 1] ad radius R q > 0, let us defie the l q ball { B q (R q ) := θ R d d (32) β j q R q }. j=1

18 FAST GLOBAL CONVERGENCE OF GRADIENT METHODS 2469 Note that q = 0 correspods to the case of hard sparsity, for which ay vector β B 0 (R 0 ) is supported o a set of cardiality at most R 0.Forq (0, 1], membership i the set B q (R q ) eforces a decay rate o the ordered coefficiets, thereby modelig approximate sparsity. I order to estimate the ukow regressio vector θ B q (R q ), we cosider the least-squares Lasso estimator from Sectio 2.4.1, based o L(θ; Z1 ) := 2 1 y Xθ 2 2,whereX R d is the desig matrix. I order to state a cocrete result, we cosider a radom desig matrix X, i which each row x i R d is draw i.i.d. from a N(0, ) distributio, where is the covariace matrix. We use σ max ( ) ad σ mi ( ) to refer the maximum ad miimum eigevalues of, respectively, ad ζ( ) := max j=1,2,...,d jj for the maximum variace. We also assume that the observatio oise is zero-mea ad ν 2 -sub-gaussia. Guaratees for costraied Lasso. Our covergece rate o the optimizatio error θ t θ is stated i terms of the cotractio coefficiet { κ := 1 σ } mi( ) {1 4σ max ( ) + χ ( ) χ ( ) } 1 (33), where we have adopted the shorthad ( ) c 0 ζ( ) log d 1 q/2 χ ( ) := σ max ( ) R q, for q>0, (34) ( ) c 0 ζ( ) log d σ max ( ) s, for q = 0 for a umerical costat c 0. We assume that χ ( ) is small eough to esure that κ (0, 1); i terms of the sample size, this amouts to a coditio of the form = (Rq 1/(1 q/2) log d). Such a scalig is sesible, sice it is kow from miimax theory o sparse liear regressio [34] to be ecessary for ay method to be statistically cosistet over the l q -ball. With this set-up, we have the followig cosequece of Theorem 1: COROLLARY 2 (Sparse vector recovery). Uder coditios of Theorem 1, suppose that we solve the costraied Lasso with ρ θ 1 ad γ u = 2σ max ( ). (a) Exact sparsity: Suppose that θ is supported o a subset of cardiality s. The the iterates (3) satisfy (35) θ t θ 2 2 κt θ 0 θ c 2χ ( ) θ θ 2 2 for all t = 0, 1, 2,...with probability at least 1 exp( c 1 log d). (b) Weak sparsity: Suppose that θ B q (R q ) for some q (0, 1]. The the error θ t θ 2 2 i the iterates (3) is at most θ 0 θ { ( ) 2 log d 1 q/2 2 + c 2χ ( ) R q + θ θ } (36) 2 2 for all t = 0, 1, 2,...with probability at least 1 exp( c 1 log d).

19 2470 A. AGARWAL, S. NEGAHBAN AND M. J. WAINWRIGHT We ca ow compare part (a), which deals with the special case of exactly sparse vectors, to some past work that has established covergece guaratees for optimizatio algorithms for sparse liear regressio. Certai methods are kow to coverge at subliear rates (e.g., [5]), more specifically at the rate O(1/t 2 ).The geometric rate of covergece guarateed by Corollary 2 is expoetially faster. Other work o sparse regressio has provided geometric rates of covergece that hold oce the iterates are close to the optimum [9, 17], or geometric covergece up to the oise level ν 2 usig various methods, icludig greedy methods [41]ad thresholded gradiet methods [16]. I cotrast, Corollary 2 guaratees geometric covergece for all iterates up to a precisio below that of statistical error. For these problems, the statistical error ν2 s log d is typically much smaller tha the oise variace ν 2, ad decreases as the sample size is icreased. I additio, Corollary 2 also applies to the case of approximately sparse vectors, lyig withi the set B q (R q ) for q (0, 1]. There are some importat differeces betwee the case of exact sparsity ad that of approximate sparsity. Part (a) guaratees geometric covergece to a tolerace depedig oly o the statistical error θ θ 2. I cotrast, the secod result also has the additioal term R q ( log d )1 q/2. This secod term arises due to the statistical oidetifiability of liear regressio over the l q -ball, ad it is o larger tha θ θ 2 2 with high probability. This fact follows from kow results [34] about miimax rates for liear regressio over l q -balls; these uimprovable rates iclude a term of this order. Guaratees for regularized Lasso. Usig similar methods, we ca also use Theorem 2 to obtai a aalogous guaratee for the regularized Lasso estimator. Here we focus oly o the case of exact sparsity, although the result exteds to approximate sparsity i a similar fashio. Lettig c i,i = 0, 1, 2, 3, 4 be uiversal positive costats, we defie the modified curvature costat γ l := γ l c 0 s log d ζ( ). Our results assume that = (s log d), a coditio kow to be ecessary for statistical cosistecy, so that γ l > 0. The cotractio factor the takes the form { κ := 1 σ } mi( ) {1 16σ max ( ) + c 1χ ( ) c2 χ ( ) } 1, where χ ( ) := ζ( ) γ l (37) s log d. The residual error i the optimizatio is give by ε 2 tol := 5 + c 2χ ( ) 1 c 3 χ ( ) ζ( )slog d θ θ 2 2, where θ R d is the ukow regressio vector, ad θ is ay optimal solutio. With this otatio, we have the followig corollary. COROLLARY 3 (Regularized Lasso). Uder the coditios of Theorem 2, suppose that we solve the regularized Lasso with λ = 6ν log d, ad that θ is sup-

20 FAST GLOBAL CONVERGENCE OF GRADIENT METHODS 2471 ported o a subset of cardiality at most s. Suppose further that we have 64 ρ log d { 5 + γ }{ } l 64s log d/ γl 128s log d/ 1 (38) + λ. 4γ u γ l 4γ u γ l The for ay δ 2 εtol 2 ad ay optimum θ λ, we have ( θ t θ λ 2 2 δ2 for all iteratios t with probability at least 1 exp( c 4 log d). log φ (θ 0 ) φ ( θ λ ) δ 2 ) / ( log 1 ) κ As with Corollary 2(a), this result guaratees that O(log(1/εtol 2 )) iteratios are sufficiet to obtai a iterate θ t that is withi squared error O(εtol 2 ) of ay optimum θ λ. Coditio (38) is the specializatio of equatio (29) to the sparse liear regressio problem, ad imposes a upper boud o admissible settigs of ρ for our theory. Moreover, wheever s log d = o(1) a coditio that is required for statistical cosistecy of ay method by kow miimax results [34] the residual error εtol 2 is of lower order tha the statistical error θ θ Matrix regressio with rak costraits. We ow tur to estimatio of matrices uder various types of soft rak costraits. Recall the model of matrix regressio from Sectio 2.4.2, ad the M-estimator based o least-squares regularized with the uclear orm (18). So as to reduce otatioal overhead, here we specialize to square matrices R d d, so that our observatios are of the form (39) y i = X i, + w i for i = 1, 2,...,, where X i R d d is a matrix of covariates, ad w i N(0,ν 2 ) is Gaussia oise. As discussed i Sectio 2.4.2, the uclear orm R( ) = 1 = d j=1 σ j ( ) is decomposable with respect to appropriately chose matrix subspaces, ad we exploit this fact heavily i our aalysis. We model the behavior of both exactly ad approximately low-rak matrices by eforcig a sparsity coditio o the vector of sigular values. I particular, for a parameter q [0, 1],wedefiethel q - ball of matrices { B q (R q ) := R d d d (40) σ j ( ) q R q }, where σ j ( ) deotes the jth sigular value of. Note that if q = 0, the B 0 (R 0 ) cosists of the set of all matrices with rak at most r = R 0. O the other had, for q (0, 1], thesetb q (R q ) cotais matrices of all raks, but eforces a relatively fast rate of decay o the sigular values. j=1

21 2472 A. AGARWAL, S. NEGAHBAN AND M. J. WAINWRIGHT Bouds for matrix compressed sesig. We begi by cosiderig the compressed sesig versio of matrix regressio, a model first itroduced by Recht et al. [36], ad later studied by other authors (e.g., [22, 28]). I this model, the observatio matrices X i R d d are dese ad draw from some radom esemble. The simplest example is the stadard Gaussia esemble, i which each etry of X i is draw i.i.d. as stadard ormal N(0, 1). Note that X i is a dese matrix i geeral; this i a importat cotrast with the matrix completio settig to follow shortly. Here we cosider a more geeral esemble of radom matrices X i,iwhich each matrix X i R d d is draw i.i.d. from a zero-mea ormal distributio i R d2 with covariace matrix R d2 d 2. The settig = I d 2 d2 recovers the stadard Gaussia esemble studied i past work. As usual, we let σ max ( ) ad σ mi ( ) defie the maximum ad miimum eigevalues of, ad we defie ζ mat ( ) = sup u 2 =1 sup v 2 =1 var( X, uvt ), correspodig to the maximal variace of X whe projected oto rak oe matrices. For the idetity esemble, we have ζ mat (I) = 1. We ow state a result o the covergece of the updates (19) whe applied to a statistical problem ivolvig a matrix B q (R q ). The covergece rate depeds o the cotractio coefficiet { κ := 1 σ } mi( ) {1 4σ max ( ) + χ ( ) χ ( ) } 1, where χ ( ) := c 1ζ mat ( ) σ max ( ) R q( d )1 q/2 for some uiversal costat c 1. I the case q = 0, correspodig to matrices with rak at most r, ote that we have R 0 = r. With this otatio, we have the followig covergece guaratee: COROLLARY 4 (Low-rak matrix recovery). Uder the coditios of Theorem 1, cosider the semidefiite program (18) with ρ 1, ad suppose that we apply the projected gradiet updates (19) with γ u = 2σ max ( ). (a) Exactly low-rak: Suppose that has rak r<d. The the iterates (19) satisfy the boud (41) t 2 F κt 0 2 F + c 2χ ( ) 2 F for all t = 0, 1, 2,...with probability at least 1 exp( c 0 d). (b) Approximately low-rak: Suppose that B q (R q ) for some q (0, 1]. The the iterates (19) satisfy t 2 F κt 0 { ( ) 2 d 1 q/2 F + c 2χ ( ) R q + } 2 F for all t = 0, 1, 2,...with probability at least 1 exp( c 0 d).

22 FAST GLOBAL CONVERGENCE OF GRADIENT METHODS 2473 Although quatitative aspects of the rates are differet, Corollary 4 is aalogous to Corollary 2. For the case of exactly low rak matrices [part (a)], geometric covergece is guarateed up to a tolerace ivolvig the statistical error 2 F. For the case of approximately low rak matrices [part (b)], the tolerace term ivolves a additioal factor of R q ( d )1 q/2. Agai, from kow results o miimax rates for matrix estimatio [37], this term is kow to be of comparable or lower order tha the quatity 2 F. As before, it is also possible to derive a aalogous corollary of Theorem 2 for estimatig low-rak matrices; i the iterests of space, we leave such a developmet to the reader Bouds for matrix completio. I this model, the observatio y i is a oisy versio of a radomly selected etry a(i),b(i) of the ukow matrix. Applicatios of this matrix completio problem iclude collaborative filterig [39], where the rows of the matrix correspod to users, ad the colums correspod to items (e.g., movies i the Netflix database), ad the etry ab correspods to user s a ratig of item b. Give observatios of oly a subset of the etries of, the goal is to fill i, or complete the matrix, thereby makig recommedatios of movies that a user has ot yet see. Matrix completio ca be viewed as a particular case of the matrix regressio model (17), i particular by settig X i = E a(i)b(i), correspodig to the matrix with a sigle oe i positio (a(i), b(i)), ad zeros i all other positios. Note that these observatio matrices are extremely sparse, i cotrast to the compressed sesig model. Nuclear-orm based estimators for matrix completio are kow to have good statistical properties (e.g., [11, 29, 35, 39]). Here we cosider the M-estimator 1 (42) arg mi (y i a(i)b(i) ) 2 such that 1 ρ, 2 i=1 where ={ R d d α d } is the set of matrices with bouded elemetwise l orm. This costrait elimiates matrices that are overly spiky (i.e., cocetrate too much of their mass i a sigle positio); as discussed i the paper [29], such spikiess cotrol is ecessary i order to boud the oidetifiable compoet of the matrix completio model. COROLLARY 5 (Matrix completio). Uder the coditios of Theorem 1, suppose that B q (R q ), ad that we solve program (42) with ρ 1. As log as >c 0 Rq 1/(1 q/2) d log d for a sufficietly large costat c 0, the there is a cotractio coefficiet κ t (0, 1) that decreases with t such that t+1 2 F κt t 0 2 F (43) { ( α 2 ) d log d 1 q/2 + c 2 R q + } 2 F for all iteratios t = 0, 1, 2,...,with probability at least 1 exp( c 1 d log d).

FAST GLOBAL CONVERGENCE OF GRADIENT METHODS FOR HIGH-DIMENSIONAL STATISTICAL RECOVERY

Submitted to the Aals of Statistics FAST GLOBAL CONVERGENCE OF GRADIENT METHODS FOR HIGH-DIMENSIONAL STATISTICAL RECOVERY By Alekh Agarwal ad Sahad Negahba ad Marti J. Waiwright UC Berkeley, Departmet