FAST GLOBAL CONVERGENCE OF GRADIENT METHODS FOR HIGH-DIMENSIONAL STATISTICAL RECOVERY

Size: px
Start display at page:

Download "FAST GLOBAL CONVERGENCE OF GRADIENT METHODS FOR HIGH-DIMENSIONAL STATISTICAL RECOVERY"

Transcription

1 The Aals of Statistics 2012, Vol. 40, No. 5, DOI: /12-AOS1032 Istitute of Mathematical Statistics, 2012 FAST GLOBAL CONVERGENCE OF GRADIENT METHODS FOR HIGH-DIMENSIONAL STATISTICAL RECOVERY BY ALEKH AGARWAL 1,2,SAHAND NEGAHBAN 1,3 AND MARTIN J. WAINWRIGHT 1,3 Uiversity of Califoria, Berkeley, Massachusetts Istitute of Techology ad Uiversity of Califoria, Berkeley May statistical M-estimators are based o covex optimizatio problems formed by the combiatio of a data-depedet loss fuctio with a orm-based regularizer. We aalyze the covergece rates of projected gradiet ad composite gradiet methods for solvig such problems, workig withi a high-dimesioal framework that allows the ambiet dimesio d to grow with (ad possibly exceed) the sample size. Our theory idetifies coditios uder which projected gradiet descet ejoys globally liear covergece up to the statistical precisio of the model, meaig the typical distace betwee the true ukow parameter θ ad a optimal solutio θ. By establishig these coditios with high probability for umerous statistical models, our aalysis applies to a wide rage of M-estimators, icludig sparse liear regressio usig Lasso; group Lasso for block sparsity; logliear models with regularizatio; low-rak matrix recovery usig uclear orm regularizatio; ad matrix decompositio usig a combiatio of the uclear ad l 1 orms. Overall, our aalysis reveals iterestig coectios betwee statistical ad computatioal efficiecy i high-dimesioal estimatio. 1. Itroductio. High-dimesioal data sets preset challeges that are both statistical ad computatioal i ature. O the statistical side, recet years have witessed a flurry of results o covergece rates for various estimators uder high-dimesioal scalig, allowig for the possibility that the problem dimesio d exceeds the sample size. These results typically ivolve some assumptio regardig the structure of the parameter space, such as sparse vectors, structured covariace matrices, or low-rak matrices, as well as some regularity of the datageeratig process. O the computatioal side, may estimators for statistical recovery are based o solvig covex programs. Examples of such M-estimators iclude l 1 -regularized quadratic programs (Lasso) for sparse liear regressio (e.g., Received April 2011; revised Jauary Supported i part by Grat AFOSR-09NL Supported i part by a Microsoft Graduate Fellowship ad Google Ph.D. Fellowship. 3 Supported by fudig from NSF-CDI MSC2010 subject classificatios. Primary 62F30, 62F30; secodary 62H12. Key words ad phrases. High-dimesioal iferece, covex optimizatio, regularized M- estimatio. 2452

2 FAST GLOBAL CONVERGENCE OF GRADIENT METHODS 2453 [7, 13, 26, 40, 44]), secod-order coe programs (SOCP) for the group Lasso (e.g., [19, 24, 45]) ad SDP relaxatios for various problems, icludig sparse PCA ad low-rak matrix estimatio (e.g., [3, 11, 28, 36, 37, 39]). May of these programs are istaces of covex coic programs, ad so ca (i priciple) be solved to ε-accuracy i polyomial time usig iterior poit methods, ad other stadard methods from covex programmig; for example, see the books [6, 8]. However, the complexity of such quasi-newto methods ca be prohibitively expesive for the very large-scale problems that arise from high-dimesioal data sets. Accordigly, recet years have witessed a reewed iterest i simpler first-order methods, amog them the methods of projected gradiet descet ad mirror descet. Several authors (e.g., [4, 5, 20]) have used variats of Nesterov s accelerated gradiet method [31] to obtai algorithms for highdimesioal statistical problems with a subliear rate of covergece. Note that a optimizatio algorithm, geeratig a sequece of iterates {θ t } t=0, is said to exhibit subliear covergece to a optimum θ if the optimizatio error θ t θ decays at the rate 1/t κ, for some expoet κ>0ad orm. It is kow that this is the best possible covergece rate for gradiet descet-type methods for covex programs uder oly Lipschitz coditios [30]. It is kow that much faster global rates i particular, a liear or geometric rate ca be achieved if global regularity coditios like strog covexity ad smoothess are imposed [30]. A optimizatio algorithm is said to exhibit liear or geometric covergece if the optimizatio error θ t θ decays at a rate κ t, for some cotractio coefficiet κ (0, 1). Note that such covergece is expoetially faster tha sub-liear covergece. For certai classes of problems ivolvig polyhedral costraits ad global smoothess, Tseg ad Luo [25] have established geometric covergece. However, a challegig aspect of statistical estimatio i high dimesios is that the uderlyig optimizatio problems ca ever be strogly covex i a global sese whe d>(sice the d d Hessia matrix is rak-deficiet), ad global smoothess coditios caot hold whe d/ +. Some more recet work has exploited structure specific to the optimizatio problems that arise i statistical settigs. For the special case of sparse liear regressio with radom isotropic desigs (also referred to as compressed sesig), some authors have established local liear covergece, meaig guaratees that apply oce the iterates are close eough to the optimum [9, 17]. Also i the settig of compressed sesig, Tropp ad Gilbert [41] studied fiite covergece of greedy algorithms, while Garg ad Khadekar [16] provide results for a thresholded gradiet algorithm. I both of these results, the covergece happes up to a tolerace of the order of the oise variace, which is substatially larger tha the true statistical precisio of the problem. The focus of this paper is the covergece rate of two simple gradietbased algorithms for solvig optimizatio problems that uderlie regularized M- estimators. For a costraied problem with a differetiable objective fuctio, the projected gradiet method geerates a sequece of iterates {θ t } t=0 by takig a step

3 2454 A. AGARWAL, S. NEGAHBAN AND M. J. WAINWRIGHT FIG. 1. Covergece rates of projected gradiet descet i applicatio to Lasso (l 1 -costraied least-squares). Each pael shows the log optimizatio error log θ t θ versus the iteratio umber t. Pael (a) shows three curves, correspodig to dimesios d {5000; 10,000; 20,000}, sparsity s = d ad all with the same sample size = All cases show geometric covergece, but the rate for larger problems becomes progressively slower. (b)for a appropriately rescaled sample size (α = s log d ), all three covergece rates should be roughly the same, as predicted by the theory. i the egative gradiet directio, ad the projectig the result oto the costrait set. The composite gradiet method of Nesterov [31] is well-suited to solvig regularized problems formed by the sum of a differetiable ad a odifferetiable compoet. The mai cotributio of this paper is to establish a form of global geometric covergece for these algorithms that holds for a broad class of high-dimesioal statistical problems. I order to provide ituitio for this guaratee, Figure 1 shows the performace of projected gradiet descet for Lasso problems (l 1 -costraied least-squares), each oe based o a fixed sample size = 2500 ad varyig dimesios d {5000; 10,000; 20,000}. I pael (a), we have plotted the logarithm of the optimizatio error, measured i terms of the Euclidea orm θ t θ betwee θ t ad a optimal solutio θ, versus the iteratio umber t. Note that all curves are liear (o this logarithmic scale), revealig the geometric covergece predicted by our theory. Moreover, the results i pael (a) exhibit a iterestig property: the covergece rate is dimesio-depedet, meaig that for a fixed sample size, projected gradiet descet coverges more slowly for a larger problem tha a smaller problem. This pheomeo reflects the atural ituitio that larger problems are harder tha smaller problems. A otable aspect of our theory is that it makes a quatitative predictio regardig the extet to which a larger problem is harder tha a smaller oe. I particular, our covergece rates suggest that if the sample size is re-scaled accordig to the dimesio d adalsoother

4 FAST GLOBAL CONVERGENCE OF GRADIENT METHODS 2455 model parameters such as sparsity, the covergece rates should be roughly similar. Pael (b) cofirms this predictio: whe the sample size is rescaled accordig to our theory (i particular, see Corollary 2 i Sectio 3.2), the all three curves lie essetially o top of each other. Although high-dimesioal optimizatio problems are typically either strogly covex or smooth, this paper shows that it is fruitful to cosider suitably restricted otios of strog covexity ad smoothess. Our otio of restricted strog covexity (RSC) is related to but slightly differet tha that of Negahba et al. [27] for establishig statistical cosistecy. We also itroduce a related otio of restricted smoothess (RSM), ot eeded for provig statistical rates, but essetial i the settig of optimizatio. Our aalysis cosists of two parts. We first show that for optimizatio problems uderlyig may regularized M-estimators, RSC/RSM coditios are sufficiet to guaratee global liear covergece of projected gradiet descet. Our secod cotributio is to prove that for the iterates geerated by our methods, these RSC/RSM assumptios do hold with high probability for umerous statistical models, amog them sparse liear models, models with group sparsity, ad various matrix estimatio problems, icludig matrix completio ad matrix decompositio. A iterestig aspect of our results is that the geometric covergece is ot guarateed to a arbitrary precisio, but oly to a accuracy related to statistical precisio of the problem. For a give orm, the statistical precisio is give by the mea-squared error E[ θ θ 2 ] betwee the true parameter θ ad the solutio θ of the optimizatio problem. Our aalysis guaratees geometric covergece to a parameter θ such that θ θ = θ θ + o ( θ θ ), which is the best we ca hope for statistically, igorig lower order terms. Overall, our results reveal a iterestig coectio betwee the statistical ad computatioal properties of M-estimators that is, the properties of the uderlyig statistical model that make it favorable for estimatio also reder it more ameable to optimizatio procedures. The remaider of this paper is orgaized as follows. We begi i Sectio 2 with our setup ad the ecessary backgroud. Sectio 3 is devoted to the statemet of our mai results ad various corollaries. I Sectio 4, we provide a umber of empirical results that cofirm the sharpess of our theory. Proofs of our results are provided i the Supplemetary Material [1]. 2. Backgroud ad problem formulatio. I this sectio, we begi by describig the class of regularized M-estimators to which our aalysis applies, as well as the optimizatio algorithms that we aalyze. Fially, we itroduce some importat otios that uderlie our aalysis, icludig the otios of a decomposable regularizatio, ad the properties of restricted strog covexity ad smoothess.

5 2456 A. AGARWAL, S. NEGAHBAN AND M. J. WAINWRIGHT 2.1. Loss fuctios, regularizatio ad gradiet-based methods. Give a radom variable Z P takig values i some set Z, letz1 ={Z 1,...,Z } be asampleof observatios. Assumig that P lies withi some idexed family {P θ,θ }, the goal is to recover a estimate of the ukow true parameter θ geeratig the data. Here is some subset of R d,whered is the ambiet dimesio of the problem. I order to measure the fit of ay θ to a give data set Z1, we itroduce a loss fuctio L : Z R +. By costructio, for ay give -sample data set Z1 Z, the loss fuctio assigs a cost L (θ; Z1 ) 0 to the parameter θ. I may applicatios, the loss fuctio has a separable structure across the data set, meaig that L (θ; Z1 ) = 1 i=1 l(θ; Z i ) where l : Z : R + is the loss fuctio associated with a sigle data poit. Of primary iterest i this paper are estimatio problems that are uderdetermied, meaig that the sample size is smaller tha the ambiet dimesio d. I such settigs, without further restrictios o the parameter space, there are various impossibility theorems, assertig that cosistet estimates of the ukow parameter θ caot be obtaied. For this reaso, it is ecessary to assume that the ukow parameter θ either lies withi a smaller subset of, or is well-approximated by some member of such a subset. I order to icorporate these types of structural costraits, we itroduce a regularizer R : R + over the parameter space. Give a user-defied radius ρ>0, our aalysis applies to the costraied M-estimator { ( θ ρ arg mi L θ; Z )} (1) 1 R(θ) ρ as well as to the regularized M-estimator (2) θ λ arg mi R(θ) ρ { ( L θ; Z ) } 1 + λ R(θ), }{{} φ (θ) where the regularizatio weight λ > 0 is user-defied. Note that the radii ρ ad ρ may be differet i geeral. Throughout this paper, we impose the followig two coditios: (a) for ay data set Z1, the fuctio L ( ; Z1 ) is covex ad differetiable over,ad (b) the regularizer R is a orm. These coditios esure that the overall problem is covex, so that by Lagragia duality, the optimizatio problems (1) ad(2) are equivalet. However, as our aalysis will show, solvig oe or the other ca be computatioally more preferable depedig upo the assumptios made. Whe the radius ρ or the regularizatio parameter λ is clear from the cotext, we will drop the subscript o θ to ease the otatio. Similarly, we frequetly adopt the shorthad L (θ). Procedures based o optimizatio problems of either form are kow as M-estimators i the statistics literature.

6 FAST GLOBAL CONVERGENCE OF GRADIENT METHODS 2457 The focus of this paper is o two simple algorithms for solvig the above optimizatio problems. The method of projected gradiet descet applies aturally to the costraied problem (1), whereas the composite gradiet descet method due to Nesterov [31] is suitable for solvig the regularized problem (2). Each routie geerates a sequece {θ t } t=0 of iterates by first iitializig to some parameter θ 0,adthefort = 0, 1, 2,...,applyig the recursive update (3) θ t+1 = arg mi θ B R (ρ) { L ( θ t ) + L ( θ t ),θ θ t + γ u 2 θ θ t }, 2 i the case of projected gradiet descet, or the update { θ t+1 ( = arg mi L θ t ) + ( L θ t ),θ θ t + γ u θ B R ( ρ) 2 θ θ t } (4) 2 +λ R(θ) for the composite gradiet method. Note that the oly differece betwee the two updates is the additio of the regularizatio term i the objective. These updates have a atural ituitio: the ext iterate θ t+1 is obtaied by costraied miimizatio of a first-order approximatio to the loss fuctio, combied with a smoothig term that cotrols how far oe moves from the curret iterate i terms of Euclidea orm. Moreover, it is easily see that update (3) is equivalet to ( θ t+1 = θ t 1 ( L θ t (5) )), γ u where BR (ρ) deotes Euclidea projectio oto the regularizer orm ball B R (ρ) := {θ R(θ) ρ} of radius ρ. I this formulatio, we see that the algorithm takes a step i the egative gradiet directio, usig the quatity 1/γ u as stepsize parameter, ad the projects the resultig vector oto the costrait set. Update (4) takes a aalogous form; however, the projectio will deped o both λ ad γ u. As will be illustrated i the examples to follow, for may problems, updates (3) ad(4), or equivaletly (5), have a very simple solutio. For istace, i the case of l 1 -regularizatio, they are easily computed by a appropriate form of soft-thresholdig Restricted strog covexity ad smoothess. I this sectio, we defie the coditios o the loss fuctio ad regularizer that uderlie our aalysis. Global smoothess ad strog covexity assumptios play a importat role i the classical aalysis of optimizatio algorithms [6, 8, 30]. I applicatio to a differetiable loss fuctio L, both of these properties are defied i terms of a first-order Taylor series expasio aroud a vector θ i the directio of θ amely, the quatity (6) T L ( θ; θ ) := L (θ) L ( θ ) L ( θ ),θ θ. By the assumed covexity of L, this error is always oegative, ad global strog covexity is equivalet to imposig a stroger coditio, amely that for

7 2458 A. AGARWAL, S. NEGAHBAN AND M. J. WAINWRIGHT some parameter γ l > 0, the first-order Taylor error T L (θ; θ ) is lower bouded by a quadratic term γ l 2 θ θ 2 for all θ,θ. Global smoothess is defied i a similar way, by imposig a quadratic upper boud o the Taylor error. It is kow that uder global smoothess ad strog covexity assumptios, the method of projected gradiet descet (3) ejoysaglobally geometric covergece rate, meaig that there is some κ (0, 1) such that 4 (7) θ t θ 2 κ t θ 0 θ 2 for all iteratios t = 0, 1, 2,... We refer the reader to Bertsekas [6], Propositio 1.2.3, page 145, or Nesterov [30], Theorem 2.2.8, page 88, for such results o projected gradiet descet, ad to Nesterov [31] for related results o composite gradiet descet. Ufortuately, i the high-dimesioal settig (d >), it is usually impossible to guaratee strog covexity of problem (1) i a global sese. For istace, whe the data is draw i.i.d., the loss fuctio cosists of a sum of terms. If the loss is twice differetiable, the resultig d d Hessia matrix 2 L(θ; Z1 ) is ofte a sum of matrices each with rak oe, so that the Hessia is rak-degeerate whe <d. However, as we show i this paper, i order to obtai fast covergece rates for optimizatio method (3), it is sufficiet that (a) the objective is strogly covex ad smooth i a restricted set of directios, ad (b) the algorithm approaches the optimum θ oly alog these directios. Let us ow formalize these ideas. DEFINITION 1 [Restricted strog covexity (RSC)]. The loss fuctio L satisfies restricted strog covexity with respect to R ad with parameters (γ l,τ l (L )) over the set if ( T L θ; θ ) γ l (8) 2 θ θ 2 τ l (L )R 2( θ θ ) for all θ,θ. We refer to the quatity γ l as the (lower) curvature parameter, ad to the quatity τ l as the tolerace parameter. Theset correspods to a suitably chose subset of the space of all possible parameters. 5 I order to gai ituitio for this defiitio, first suppose that coditio (8) holds with tolerace parameter τ l = 0. I this case, the regularizer plays o role i the defiitio, ad coditio (8) is equivalet to the usual defiitio of strog covexity o the optimizatio set. As discussed previously, this type of global strog covexity typically fails to hold for high-dimesioal iferece problems. I cotrast, whe tolerace parameter τ l is strictly positive, coditio (8) is much milder, 4 I this statemet (ad throughout the paper), we use to mea a iequality that holds with some uiversal costat c, idepedet of the problem parameters. 5 As poited out by a referee, our RSC coditio is a istace of the geeral theory of paracovexity (e.g., [32]); however, we are ot aware of covergece rates for miimizig geeral paracovex fuctios.

8 FAST GLOBAL CONVERGENCE OF GRADIENT METHODS 2459 i that it oly applies to a limited set of vectors. For a give pair θ θ, cosider the iequality R 2 (θ θ (9) ) θ θ 2 < γ l 2τ l (L ). If this iequality is violated, the the right-had side of boud (8) is opositive, i which case the RSC costrait (8) is vacuous. Thus, RSC imposes a otrivial costrait oly o pairs θ θ for which iequality (9) holds, ad a cetral part of our aalysis will be to prove that for our methods, the optimizatio error t := θ t θ satisfies a costrait of the form (9). We ote that sice the regularizer R is covex, strog covexity of the loss fuctio L also implies the strog covexity of the regularized loss φ. We also specify a aalogous otio of restricted smoothess: DEFINITION 2 [Restricted smoothess (RSM)]. We say the loss fuctio L satisfies restricted smoothess with respect to R ad with parameters (γ u,τ u (L )) over the set if ( T L θ; θ ) γ u (10) 2 θ θ 2 + τ u (L )R 2( θ θ ) for all θ,θ. As with our defiitio of restricted strog covexity, the additioal tolerace τ u (L ) is ot preset i aalogous smoothess coditios i the optimizatio literature, but it is essetial i our set-up Decomposable regularizers. I past work o the statistical properties of regularizatio, the otio of a decomposable regularizer has bee show to be useful [27]. Although the focus of this paper is a rather differet set of questios amely, optimizatio as opposed to statistics decomposability also plays a importat role here. Decomposability is defied with respect to a pair of subspaces defied with respect to the parameter space R d.thesetm is kow as the model subspace, whereas the set M, referred to as the perturbatio subspace, captures deviatios from the model subspace. DEFINITION 3. Give a subspace pair (M, M ) such that M M, wesay that a orm R is (M, M )-decomposable if (11) R(α + β) = R(α) + R(β) for all α M ad β M. To gai some ituitio for this defiitio, ote that by the triagle iequality, we always have the boud R(α + β) R(α) + R(β). For a decomposable regularizer, this iequality always holds with equality. Thus, give a fixed vector α M, the key property of ay decomposable regularizer is that it affords the maximum pealizatio of ay deviatio β M. For a give error orm, its iteractio with the regularizer R plays a importat role i our results. I particular, we have the followig:

9 2460 A. AGARWAL, S. NEGAHBAN AND M. J. WAINWRIGHT DEFINITION 4 (Subspace compatibility). Give the regularizer R( ) ad a orm, the associated subspace compatibility is give by (12) ( M) := sup θ M\{0} R(θ) θ whe M {0} ad ( {0} ) := 0. The quatity ( M) correspods to the Lipschitz costat of the orm R with respect to, whe restricted to the subspace M Some illustrative examples. We ow describe some particular examples of M-estimators with decomposable regularizers, ad discuss the form of the projected gradiet updates as well as RSC/RSM coditios. We cover two mai families of examples: log-liear models with sparsity costraits ad l 1 -regularizatio (Sectio 2.4.1), ad matrix regressio problems with uclear orm regularizatio (Sectio 2.4.2) Sparse log-liear models ad l 1 -regularizatio. Suppose that each sample Z i cosists of a scalar-vector pair (y i,x i ) R R d, correspodig to the scalar respose y i R associated with a vector of predictors x i R d. A log-liear model with caoical lik fuctio assumes that the respose y i is liked to the covariate vector x i via a coditioal distributio of the form P(y i x i ; θ,σ) exp{ y i θ,x i ( θ,x i ) c(σ) },wherec(σ) is a kow scalig parameter, ( ) is a kow log-partitio fuctio ad θ R d is a ukow regressio vector. I may applicatios, θ is relatively sparse, so that it is atural to impose a l 1 -costrait. Computig the maximum likelihood estimate subject to such a costrait ivolves solvig the covex program 6 (13) { 1 { ( θ arg mi θ,xi ) y i θ,x i }} θ i=1 }{{} L (θ;z1 ) such that θ 1 ρ, with x i R d as its ith row. We refer to this estimator as the log-liear Lasso; it is a special case of the M-estimator (1). Ordiary liear regressio is the special case of the log-liear settig with (t) = t 2 /2ad = R d, ad i this case, estimator (13) correspods to ordiary least-squares versio of Lasso [13, 40]. Other forms of log-liear Lasso that are of iterest iclude logistic regressio, Poisso regressio ad multiomial regressio. 6 The fuctio is covex sice it is the log-partitio fuctio of a caoical expoetial family.

10 FAST GLOBAL CONVERGENCE OF GRADIENT METHODS 2461 Projected gradiet updates. For the log-liear loss from equatio (13), a easy calculatio yields the gradiet L (θ) = 1 i=1 x i { ( θ,x i y i )}, ad update (5) correspods to the Euclidea projectio of the vector θ t γ 1 u L (θ t ) oto the l 1 -ball of radius ρ. It is well kow that this projectio ca be characterized i terms of soft-thresholdig, ad that the projected update (5) ca be computed i O(d) operatios [14]. Composite gradiet updates. The composite gradiet update for this problem amouts to solvig { θ t+1 θ, L = arg mi (θ) + γ u θ 1 ρ 2 θ θ t λ θ 1 }. The update ca be computed by two soft-thresholdig operatios. The first step is soft thresholdig the vector θ t 1 γ u L (θ t ) at a level λ. If the resultig vector has l 1 -orm greater tha ρ, the we project o to the l 1 -ball as before. Overall, the complexity of the update is still O(d). Decomposability of l 1 -orm. We ow illustrate how the l 1 -orm is decomposable with respect to appropriately chose subspaces. For ay subset S {1, 2,...,d}, cosider the subspace (14) M(S) := { α R d α j = 0forallj/ S }, correspodig to all vectors supported oly o S. Defiig M(S) = M(S), its orthogoal complemet (with respect to the Euclidea ier product) is give by M (S) = M (S) ={β R d β j = 0forallj S}. Sice ay pair of vectors α M(S) ad β M (S) have disjoit supports, it follows that α 1 + β 1 = α + β 1. Cosequetly, for ay subset S, thel 1 -orm is decomposable with respect to the pairs (M(S), M (S)). I aalogy to the l 1 -orm, various types of group-sparse orms are also decomposable with respect to otrivial subspace pairs. We refer the reader to the paper [27] for further examples of such decomposable orms. RSC/RSM coditios. A calculatio usig the mea-value theorem shows that for loss fuctio (13), the error i the first-order Taylor series, as previously defied i equatio (6), ca be writte as ( T L θ; θ ) = 1 ( θ t,x i )( x i,θ θ ) 2, i=1 where θ t = tθ + (1 t)θ for some t [0, 1]. Whe<d, the we ca always fid pairs θ θ such that x i,θ θ =0foralli = 1, 2,...,, showig that the objective fuctio ca ever be strogly covex. O the other had, RSC for

11 2462 A. AGARWAL, S. NEGAHBAN AND M. J. WAINWRIGHT log-liear models requires oly that there exist positive umbers (γ l,τ l (L )) such that for all θ,θ, 1 ( θ t,x i )( x i,θ θ ) 2 γ l (15) 2 θ θ 2 τ l (L )R 2( θ θ ), i=1 where := B 2 (R). This restrictio is essetial because for may geeralized liear models (e.g., logistic), the Hessia fuctio approaches zero as its argumet diverges. RSM imposes a aalogous upper boud o the Taylor error. For a broad class of log-liear models, such bouds hold with tolerace τ l (L ) ad τ u (L ) of the order log d. A detailed discussio of RSC for expoetial families ca be foud i the paper [27]. I the special case of liear regressio, we have (t) = 1forallt R, sothat the lower boud (15) ivolves oly the Gram matrix X T X/. (HereX R d is the usual desig matrix, with x i R d as its ith row.) For liear regressio ad l 1 -regularizatio, the RSC coditio is equivalet to (16) X(θ θ ) 2 2 γ l 2 θ θ 2 2 τ l(l ) θ θ 2 1 for all θ,θ. Such a coditio correspods to a variat of the restricted eigevalue (RE) coditios that have bee studied i the literature [7, 42]. Such RE coditios are sigificatly milder tha the restricted isometry property; we refer the reader to va de Geer ad Buhlma [42] for a i-depth compariso of differet RE coditios. From past work, coditio (16) is satisfied with high probability with a costat γ l > 0 ad tolerace τ l (L ) log d for a broad classes of aisotropic radom desig matrices [33, 38], ad parts of our aalysis make use of this fact Matrices ad uclear orm regularizatio. We ow discuss a geeral class of matrix regressio problems that falls withi our framework. Cosider the space of d 1 d 2 matrices edowed with the trace ier product A,B := trace(a T B). Let R d 1 d 2 be a ukow matrix ad suppose that for i = 1, 2,...,, we observe the pair Z i = (y i,x i ) R R d 1 d 2, where the scalar respose y i ad covariate matrix X i are liked to the ukow matrix via the liear model (17) y i = X i, + w i for i = 1, 2,...,. Here w i is a additive observatio oise. I may cotexts, it is atural to assume that is exactly low-rak, or approximately so, meaig that it is wellapproximated by a matrix of low rak. I such settigs, a umber of authors (e.g., [15, 28, 37]) have studied the M-estimator { 1 ( arg mi yi X i, ) } 2 (18) such that 1 ρ, R d 1 d 2 2 i=1

12 FAST GLOBAL CONVERGENCE OF GRADIENT METHODS 2463 or the correspodig regularized versio. Defiig d = mi{d 1,d 2 },theuclear or trace orm is give by 1 := d j=1 σ j ( ), correspodig to the sum of the sigular values. As discussed i Sectio 3.3, there are various applicatios i which this estimator ad variats thereof have prove useful. Form of projected gradiet descet. For the M-estimator (18), the projected gradiet updates take a very simple form amely ( t+1 = t 1 i=1 (y i X i, t ) )X i (19), γ u where deotes Euclidea (i.e., i Frobeius orm) projectio oto the uclear orm ball B N (ρ) ={ R d 1 d 2 1 ρ}. This uclear orm projectio ca be obtaied by first computig the sigular value decompositio (SVD), ad the projectig the vector of sigular values oto the l 1 -ball. The latter step ca be achieved by the fast projectio algorithms discussed earlier, ad there are various methods for fast computatio of SVDs. The composite gradiet update also has a simple form, requirig at most two sigular value thresholdig operatios. Decomposability of uclear orm. We ow defie matrix subspaces for which the uclear orm is decomposable. Defiig d := mi{d 1,d 2 },letu R d1 d ad V R d2 d be arbitrary matrices with orthoormal colums. Usig col to deote the colum spa of a matrix, we defie the subspaces 7 M(U, V ) := { R d 1 d 2 col ( T ) col(v ), col( ) col(u) } ad M (U, V ) := { R d 1 d 2 col ( T ) ( col(v ) ) ( ) }, col( ) col(u). Fially, let us verify the decomposability of the uclear orm. By costructio, ay pair of matrices M(U, V ) ad Ɣ M (U, V ) have orthogoal row ad colum spaces, which implies the required decomposability coditio amely + Ɣ 1 = 1 + Ɣ 1. Fially, we ote that i some special cases such as matrix completio or matrix decompositio, will ivolve a additioal boud o the etries of as well as the iterates t to establish RSC/RSM coditios. 3. Mai results ad some cosequeces. We are ow equipped to state the two mai results of our paper, ad discuss some of their cosequeces. We illustrate its applicatio to several statistical models, icludig sparse regressio (Sectio 3.2), matrix estimatio with rak costraits (Sectio 3.3) ad matrix decompositio problems (Sectio 3.4). The proofs of all our results ca be foud i the Supplemetary Material [1]. 7 Note that the model space M(U, V ) is ot equal to M(U, V ). Noetheless, as required by Defiitio 3, wedohavetheiclusiom(u, V ) M(U, V ).

13 2464 A. AGARWAL, S. NEGAHBAN AND M. J. WAINWRIGHT 3.1. Geometric covergece. Recall that the projected gradiet algorithm (3) is well suited to solvig a M-estimatio problem i its costraied form, whereas the composite gradiet algorithm (4) is appropriate for a regularized problem. Accordigly, let θ be ay optimum of the costraied problem (1), or the regularized problem (2), ad let {θ t } t=0 be a sequece of iterates geerated by geerated by the projected gradiet (3), or the the composite gradiet updates (4), respectively. Of primary iterest to us are bouds o the optimizatio error, which ca be measured either i terms of the error vector t := θ t θ, or the differece betwee the objective values at θ t ad θ. I this sectio, we state two mai results Theorems 1 ad 2 correspodig to the costraied ad regularized cases, respectively. I additio to the optimizatio error previously discussed, both of these results ivolve the statistical error := θ θ betwee the optimum θ ad the omial parameter θ. At a high level, these results guaratee that uder the RSC/RSM coditios, the optimizatio error shriks geometrically, with a cotractio coefficiet that depeds o the the loss fuctio L via the parameters (γ l,τ l (L )) ad (γ u,τ u (L )). A iterestig feature is that the cotractio occurs oly up to a certai tolerace ε 2 depedig o these same parameters, ad the statistical error. However, as we discuss, for may statistical problems of iterest, we ca show that this tolerace ε 2 is of a lower order tha the itrisic statistical error, ad cosequetly our theory gives a upper boud o the umber of iteratios required to solve a M-estimatio problem up to the statistical precisio. Covergece rates for projected gradiet. We ow provide the otatio ecessary for a precise statemet of this claim. Our mai result ivolves a family of upper bouds, oe for each pair (M, M ) of R-decomposable subspaces; see Defiitio 3. This subspace choice ca be optimized for differet model to obtai the tightest possible bouds. For a give pair (M, M ) such that 16 2 ( M)τ u (L )<γ u, let us defie the cotractio coefficiet { κ(l ; M) := 1 γ l } ( M)(τ u (L ) + τ l (L )) γ u γ u (20) { } ( M)τ u (L ) 1. γ u I additio, we defie the tolerace parameter ε 2( ; M, M ) (21) := 32(τ u(l ) + τ l (L ))(2R( M (θ )) + ( M) +2R( )) 2, γ u where = θ θ is the statistical error, ad M (θ ) deotes the Euclidea projectio of θ oto the subspace M. I terms of these two igrediets, we ow state our first mai result:

14 FAST GLOBAL CONVERGENCE OF GRADIENT METHODS 2465 THEOREM 1. Suppose that the loss fuctio L satisfies the RSC/RSM coditio with parameters (γ l,τ l (L )) ad (γ u,τ u (L )), respectively. Let (M, M) be ay R-decomposable pair of subspaces such that M M ad (22) 0 <κ(l, M)<1. The for ay optimum θ of the problem (1) for which the costrait is active, for all iteratios t = 0, 1, 2,..., we have θ t+1 θ 2 κ t θ 0 θ 2 + ε2 ( ; M, M) (23), 1 κ where κ κ(l, M). REMARKS. Theorem 1 actually provides a family of upper bouds, oe for each R-decomposable pair (M, M) such that coditio (22) holds. This coditio is always satisfied by settig M equal to the trivial subspace {0}: ideed, by defiitio (12) of the subspace compatibility, we have ( M) = 0, ad hece κ(l ;{0}) = (1 γ l γ u )<1. Although this choice of M miimizes the cotractio coefficiet, it will lead 8 to a very large tolerace parameter ε 2 ( ; M, M). A more typical applicatio of Theorem 1 ivolves otrivial choices of the subspace M. Boud (23) guaratees that the optimizatio error decreases geometrically, with cotractio factor κ (0, 1), up to a certai tolerace proportioal to ε 2 ( ; M, M), as illustrated i Figure 2(a). Wheever the tolerace terms i the RSC/RSM coditios decay to zero as the sample size icreases the typical case the the cotractio factor κ approaches 1 γ l /γ u. The appearace of the ratio γ l /γ u is atural sice it measures the coditioig of the objective fuctio; more specifically, it is essetially a restricted coditio umber of the Hessia matrix. O the other had, the residual error ε defied i equatio (21) depeds o the choice of decomposable subspaces, the parameters of the RSC/RSM coditios ad the statistical error = θ θ. I the corollaries of Theorem 1 to follow, we show that the subspaces ca ofte be chose such that ε 2 ( ; M, M) = o( θ θ 2 ). Cosequetly, boud (23) guaratees geometric covergece up to a residual error smaller tha statistical precisio, as illustrated i Figure 2(b). This is sesible, sice i statistical settigs, there is o poit to optimizig beyod the statistical precisio. The result of Theorem 1 takes a simpler form whe there is a subspace M that icludes θ,adther-ball radius is chose such that ρ R(θ ). 8 Ideed, the settig M = R d meas that the term R( M (θ )) = R(θ ) appears i the tolerace; this quatity is far larger tha the statistical precisio.

15 2466 A. AGARWAL, S. NEGAHBAN AND M. J. WAINWRIGHT FIG. 2. (a)geeric illustratio of Theorem 1. The optimizatio error t = θ t θ is guarateed to decrease geometrically with coefficiet κ (0, 1), up to the tolerace ε 2 = ε 2 ( ; M, M), represeted by the circle. (b)relatio betwee the optimizatio tolerace ε 2 ( ; M, M) (solid circle) ad the statistical precisio = θ θ (dotted circle). I may settigs, we have ε 2 ( ; M, M) 2. COROLLARY 1. I additio to the coditios of Theorem 1, suppose that θ M ad ρ R(θ ). The as log as 2 ( M)(τ u (L ) + τ l (L )) = o(1), we have for all iteratios t = 0, 1, 2,..., (24) θ t+1 θ 2 κ t θ 0 θ 2 + o ( θ θ 2). Thus, Corollary 1 guaratees that the optimizatio error decreases geometrically, with cotractio factor κ, up to a tolerace that is of strictly lower order tha the statistical precisio θ θ 2. As will be clarified i several examples to follow, the coditio 2 ( M)(τ u (L ) + τ l (L )) = o(1) is satisfied for may statistical models, icludig sparse liear regressio ad low-rak matrix regressio. This result is illustrated i Figure 2(b), where the solid circle represets the optimizatio tolerace, ad the dotted circle represets the statistical precisio. I the results to follow, we quatify the term o( θ θ 2 ) i a more precise maer for differet statistical models. Covergece rates for composite gradiet. We ow preset our mai result for the composite gradiet iterates (4) that are suitable for the Lagragia-based estimator (2). As before, our aalysis yields a rage of bouds idexed by subspace pairs (M, M ) that are R-decomposable. For ay subspace M such that 64τ l (L ) 2 ( M)<γ l,wedefietheeffective RSC coefficiet as (25) γ l := γ l 64τ l (L ) 2 ( M). This coefficiet accouts for the residual amout of strog covexity after accoutig for the lower tolerace terms. I additio, we defie the compoud cotractio

16 FAST GLOBAL CONVERGENCE OF GRADIENT METHODS 2467 coefficiet as (26) { κ(l ; M) := 1 γ l } ( M)τ u (L ) ξ( M), 4γ u γ l where ξ( M) := (1 64τ u(l ) 2 ( M) γ l ) 1,ad = θ λ θ is the statistical error vector 9 for a specific choice of ρ ad λ. As before, the coefficiet κ measures the geometric rate of covergece for the algorithm. Fially, we defie the compoud tolerace parameter (27) ε 2( ; M, M ) := 8ξ( M)β( M) ( 6 ( M) + 8R ( M ( θ ))) 2, where β( M) := 2( γ l γ l )τ l (L ) + 8τ u (L ) + 2τ l (L ). As with our previous result, the tolerace parameter determies the radius up to which geometric covergece ca be attaied. Recall that the regularized problem (2) ivolves both a regularizatio weight λ ad a costrait radius ρ. Our theory requires that the costrait radius is chose such that ρ R(θ ), which esures that θ is feasible. I additio, the regularizatio parameter should be chose to satisfy 4γ u + 128τ u(l ) 2 ( M) (28) λ 2R ( L ( θ )), where R is the dual orm of the regularizer. This costrait is kow to play a importat role i provig bouds o the statistical error of regularized M- estimators; see the paper [27] ad refereces therei for further details. Recallig defiitio (2) of the overall objective fuctio φ, the followig result provides bouds o the excess loss φ (θ t ) φ ( θ λ ). THEOREM 2. Cosider the optimizatio problem (2) for a radius ρ such that θ is feasible, ad a regularizatio parameter λ satisfyig boud (28), ad suppose that the loss fuctio L satisfies the RSC/RSM coditio with parameters (γ l,τ l (L )) ad (γ u,τ u (L )), respectively. Let (M, M ) be ay R- decomposable pair such that 32 ρ (29) κ κ(l, M) [0, 1) ad 1 κ(l ; M) ξ( M)β( M) λ. The for ay δ 2 ε2 ( ;M, M) (1 κ), we have φ (θ t ) φ ( θ λ ) δ 2 for all (30) t 2log((φ (θ 0 ) φ ( θ λ ))/δ 2 ) log(1/κ) ( )( ρλ + log 2 log 2 δ log 2 ). log(1/κ) 9 Whe the cotext is clear, we remid the reader that we drop the subscript λ o the parameter θ.

17 2468 A. AGARWAL, S. NEGAHBAN AND M. J. WAINWRIGHT REMARKS. Note that boud (30) guaratees that the excess loss φ (θ t ) φ ( θ) decays geometrically up to ay squared error δ 2 larger tha the compoud tolerace (27). Moreover, the RSC coditio also allows us to further traslate this result to a boud o the optimizatio error θ t θ. I particular, for ay iterate θ t such that φ (θ t ) φ ( θ) δ 2, we are guarateed that (31) θ t θ λ 2 2δ δ2 τ l (L ) γ l γ l λ 2 + 4τ l(l )(6 ( M) + 8R( M (θ ))) 2 γ l. I cojuctio with Theorem 2, we see that it suffices to take a umber of steps that is logarithmic i the iverse tolerace (1/δ), agai showig a geometric rate of covergece. Whereas Theorem 1 requires settig the radius so that the costrait is active, Theorem 2 has oly a very mild costrait o the radius ρ, amely that it be large eough such that ρ R(θ ). The reaso for this much milder requiremet is that the additive regularizatio with weight λ suffices to costrai the solutio, whereas the extra side costrait is oly eeded to esure good behavior of the optimizatio algorithm i the first few iteratios. Step-size settig. It seems that updates (3) ad(4) eed to kow the smoothess boud γ u i order to set the step-size for gradiet updates. However, we ca use the same doublig trick as described i Algorithm 3.1 of Nesterov [31]. At each step, we check if the smoothess upper boud holds at the curret iterate relative to the previous oe. If the coditio does ot hold, we double our estimate of γ u ad resume. Nesterov [31] demostrates that this guaratees a geometric covergece with a cotractio factor worse at most by a factor of 2, compared to the kowledge of γ u. The followig subsectios are devoted to the developmet of some cosequeces of Theorems 1 ad 2 ad Corollary 1 for some specific statistical models, amog them sparse liear regressio with l 1 -regularizatio, ad matrix regressio with uclear orm regularizatio. I cotrast to the etirely determiistic argumets that uderlie the Theorems 1 ad 2, these corollaries ivolve probabilistic argumets, more specifically i order to establish that the RSC ad RSM properties hold with high probability Sparse vector regressio. Recall from Sectio the observatio model for sparse liear regressio. I a variety of applicatios, it is atural to assume that θ is sparse. For a parameter q [0, 1] ad radius R q > 0, let us defie the l q ball { B q (R q ) := θ R d d (32) β j q R q }. j=1

18 FAST GLOBAL CONVERGENCE OF GRADIENT METHODS 2469 Note that q = 0 correspods to the case of hard sparsity, for which ay vector β B 0 (R 0 ) is supported o a set of cardiality at most R 0.Forq (0, 1], membership i the set B q (R q ) eforces a decay rate o the ordered coefficiets, thereby modelig approximate sparsity. I order to estimate the ukow regressio vector θ B q (R q ), we cosider the least-squares Lasso estimator from Sectio 2.4.1, based o L(θ; Z1 ) := 2 1 y Xθ 2 2,whereX R d is the desig matrix. I order to state a cocrete result, we cosider a radom desig matrix X, i which each row x i R d is draw i.i.d. from a N(0, ) distributio, where is the covariace matrix. We use σ max ( ) ad σ mi ( ) to refer the maximum ad miimum eigevalues of, respectively, ad ζ( ) := max j=1,2,...,d jj for the maximum variace. We also assume that the observatio oise is zero-mea ad ν 2 -sub-gaussia. Guaratees for costraied Lasso. Our covergece rate o the optimizatio error θ t θ is stated i terms of the cotractio coefficiet { κ := 1 σ } mi( ) {1 4σ max ( ) + χ ( ) χ ( ) } 1 (33), where we have adopted the shorthad ( ) c 0 ζ( ) log d 1 q/2 χ ( ) := σ max ( ) R q, for q>0, (34) ( ) c 0 ζ( ) log d σ max ( ) s, for q = 0 for a umerical costat c 0. We assume that χ ( ) is small eough to esure that κ (0, 1); i terms of the sample size, this amouts to a coditio of the form = (Rq 1/(1 q/2) log d). Such a scalig is sesible, sice it is kow from miimax theory o sparse liear regressio [34] to be ecessary for ay method to be statistically cosistet over the l q -ball. With this set-up, we have the followig cosequece of Theorem 1: COROLLARY 2 (Sparse vector recovery). Uder coditios of Theorem 1, suppose that we solve the costraied Lasso with ρ θ 1 ad γ u = 2σ max ( ). (a) Exact sparsity: Suppose that θ is supported o a subset of cardiality s. The the iterates (3) satisfy (35) θ t θ 2 2 κt θ 0 θ c 2χ ( ) θ θ 2 2 for all t = 0, 1, 2,...with probability at least 1 exp( c 1 log d). (b) Weak sparsity: Suppose that θ B q (R q ) for some q (0, 1]. The the error θ t θ 2 2 i the iterates (3) is at most θ 0 θ { ( ) 2 log d 1 q/2 2 + c 2χ ( ) R q + θ θ } (36) 2 2 for all t = 0, 1, 2,...with probability at least 1 exp( c 1 log d).

19 2470 A. AGARWAL, S. NEGAHBAN AND M. J. WAINWRIGHT We ca ow compare part (a), which deals with the special case of exactly sparse vectors, to some past work that has established covergece guaratees for optimizatio algorithms for sparse liear regressio. Certai methods are kow to coverge at subliear rates (e.g., [5]), more specifically at the rate O(1/t 2 ).The geometric rate of covergece guarateed by Corollary 2 is expoetially faster. Other work o sparse regressio has provided geometric rates of covergece that hold oce the iterates are close to the optimum [9, 17], or geometric covergece up to the oise level ν 2 usig various methods, icludig greedy methods [41]ad thresholded gradiet methods [16]. I cotrast, Corollary 2 guaratees geometric covergece for all iterates up to a precisio below that of statistical error. For these problems, the statistical error ν2 s log d is typically much smaller tha the oise variace ν 2, ad decreases as the sample size is icreased. I additio, Corollary 2 also applies to the case of approximately sparse vectors, lyig withi the set B q (R q ) for q (0, 1]. There are some importat differeces betwee the case of exact sparsity ad that of approximate sparsity. Part (a) guaratees geometric covergece to a tolerace depedig oly o the statistical error θ θ 2. I cotrast, the secod result also has the additioal term R q ( log d )1 q/2. This secod term arises due to the statistical oidetifiability of liear regressio over the l q -ball, ad it is o larger tha θ θ 2 2 with high probability. This fact follows from kow results [34] about miimax rates for liear regressio over l q -balls; these uimprovable rates iclude a term of this order. Guaratees for regularized Lasso. Usig similar methods, we ca also use Theorem 2 to obtai a aalogous guaratee for the regularized Lasso estimator. Here we focus oly o the case of exact sparsity, although the result exteds to approximate sparsity i a similar fashio. Lettig c i,i = 0, 1, 2, 3, 4 be uiversal positive costats, we defie the modified curvature costat γ l := γ l c 0 s log d ζ( ). Our results assume that = (s log d), a coditio kow to be ecessary for statistical cosistecy, so that γ l > 0. The cotractio factor the takes the form { κ := 1 σ } mi( ) {1 16σ max ( ) + c 1χ ( ) c2 χ ( ) } 1, where χ ( ) := ζ( ) γ l (37) s log d. The residual error i the optimizatio is give by ε 2 tol := 5 + c 2χ ( ) 1 c 3 χ ( ) ζ( )slog d θ θ 2 2, where θ R d is the ukow regressio vector, ad θ is ay optimal solutio. With this otatio, we have the followig corollary. COROLLARY 3 (Regularized Lasso). Uder the coditios of Theorem 2, suppose that we solve the regularized Lasso with λ = 6ν log d, ad that θ is sup-

20 FAST GLOBAL CONVERGENCE OF GRADIENT METHODS 2471 ported o a subset of cardiality at most s. Suppose further that we have 64 ρ log d { 5 + γ }{ } l 64s log d/ γl 128s log d/ 1 (38) + λ. 4γ u γ l 4γ u γ l The for ay δ 2 εtol 2 ad ay optimum θ λ, we have ( θ t θ λ 2 2 δ2 for all iteratios t with probability at least 1 exp( c 4 log d). log φ (θ 0 ) φ ( θ λ ) δ 2 ) / ( log 1 ) κ As with Corollary 2(a), this result guaratees that O(log(1/εtol 2 )) iteratios are sufficiet to obtai a iterate θ t that is withi squared error O(εtol 2 ) of ay optimum θ λ. Coditio (38) is the specializatio of equatio (29) to the sparse liear regressio problem, ad imposes a upper boud o admissible settigs of ρ for our theory. Moreover, wheever s log d = o(1) a coditio that is required for statistical cosistecy of ay method by kow miimax results [34] the residual error εtol 2 is of lower order tha the statistical error θ θ Matrix regressio with rak costraits. We ow tur to estimatio of matrices uder various types of soft rak costraits. Recall the model of matrix regressio from Sectio 2.4.2, ad the M-estimator based o least-squares regularized with the uclear orm (18). So as to reduce otatioal overhead, here we specialize to square matrices R d d, so that our observatios are of the form (39) y i = X i, + w i for i = 1, 2,...,, where X i R d d is a matrix of covariates, ad w i N(0,ν 2 ) is Gaussia oise. As discussed i Sectio 2.4.2, the uclear orm R( ) = 1 = d j=1 σ j ( ) is decomposable with respect to appropriately chose matrix subspaces, ad we exploit this fact heavily i our aalysis. We model the behavior of both exactly ad approximately low-rak matrices by eforcig a sparsity coditio o the vector of sigular values. I particular, for a parameter q [0, 1],wedefiethel q - ball of matrices { B q (R q ) := R d d d (40) σ j ( ) q R q }, where σ j ( ) deotes the jth sigular value of. Note that if q = 0, the B 0 (R 0 ) cosists of the set of all matrices with rak at most r = R 0. O the other had, for q (0, 1], thesetb q (R q ) cotais matrices of all raks, but eforces a relatively fast rate of decay o the sigular values. j=1

21 2472 A. AGARWAL, S. NEGAHBAN AND M. J. WAINWRIGHT Bouds for matrix compressed sesig. We begi by cosiderig the compressed sesig versio of matrix regressio, a model first itroduced by Recht et al. [36], ad later studied by other authors (e.g., [22, 28]). I this model, the observatio matrices X i R d d are dese ad draw from some radom esemble. The simplest example is the stadard Gaussia esemble, i which each etry of X i is draw i.i.d. as stadard ormal N(0, 1). Note that X i is a dese matrix i geeral; this i a importat cotrast with the matrix completio settig to follow shortly. Here we cosider a more geeral esemble of radom matrices X i,iwhich each matrix X i R d d is draw i.i.d. from a zero-mea ormal distributio i R d2 with covariace matrix R d2 d 2. The settig = I d 2 d2 recovers the stadard Gaussia esemble studied i past work. As usual, we let σ max ( ) ad σ mi ( ) defie the maximum ad miimum eigevalues of, ad we defie ζ mat ( ) = sup u 2 =1 sup v 2 =1 var( X, uvt ), correspodig to the maximal variace of X whe projected oto rak oe matrices. For the idetity esemble, we have ζ mat (I) = 1. We ow state a result o the covergece of the updates (19) whe applied to a statistical problem ivolvig a matrix B q (R q ). The covergece rate depeds o the cotractio coefficiet { κ := 1 σ } mi( ) {1 4σ max ( ) + χ ( ) χ ( ) } 1, where χ ( ) := c 1ζ mat ( ) σ max ( ) R q( d )1 q/2 for some uiversal costat c 1. I the case q = 0, correspodig to matrices with rak at most r, ote that we have R 0 = r. With this otatio, we have the followig covergece guaratee: COROLLARY 4 (Low-rak matrix recovery). Uder the coditios of Theorem 1, cosider the semidefiite program (18) with ρ 1, ad suppose that we apply the projected gradiet updates (19) with γ u = 2σ max ( ). (a) Exactly low-rak: Suppose that has rak r<d. The the iterates (19) satisfy the boud (41) t 2 F κt 0 2 F + c 2χ ( ) 2 F for all t = 0, 1, 2,...with probability at least 1 exp( c 0 d). (b) Approximately low-rak: Suppose that B q (R q ) for some q (0, 1]. The the iterates (19) satisfy t 2 F κt 0 { ( ) 2 d 1 q/2 F + c 2χ ( ) R q + } 2 F for all t = 0, 1, 2,...with probability at least 1 exp( c 0 d).

22 FAST GLOBAL CONVERGENCE OF GRADIENT METHODS 2473 Although quatitative aspects of the rates are differet, Corollary 4 is aalogous to Corollary 2. For the case of exactly low rak matrices [part (a)], geometric covergece is guarateed up to a tolerace ivolvig the statistical error 2 F. For the case of approximately low rak matrices [part (b)], the tolerace term ivolves a additioal factor of R q ( d )1 q/2. Agai, from kow results o miimax rates for matrix estimatio [37], this term is kow to be of comparable or lower order tha the quatity 2 F. As before, it is also possible to derive a aalogous corollary of Theorem 2 for estimatig low-rak matrices; i the iterests of space, we leave such a developmet to the reader Bouds for matrix completio. I this model, the observatio y i is a oisy versio of a radomly selected etry a(i),b(i) of the ukow matrix. Applicatios of this matrix completio problem iclude collaborative filterig [39], where the rows of the matrix correspod to users, ad the colums correspod to items (e.g., movies i the Netflix database), ad the etry ab correspods to user s a ratig of item b. Give observatios of oly a subset of the etries of, the goal is to fill i, or complete the matrix, thereby makig recommedatios of movies that a user has ot yet see. Matrix completio ca be viewed as a particular case of the matrix regressio model (17), i particular by settig X i = E a(i)b(i), correspodig to the matrix with a sigle oe i positio (a(i), b(i)), ad zeros i all other positios. Note that these observatio matrices are extremely sparse, i cotrast to the compressed sesig model. Nuclear-orm based estimators for matrix completio are kow to have good statistical properties (e.g., [11, 29, 35, 39]). Here we cosider the M-estimator 1 (42) arg mi (y i a(i)b(i) ) 2 such that 1 ρ, 2 i=1 where ={ R d d α d } is the set of matrices with bouded elemetwise l orm. This costrait elimiates matrices that are overly spiky (i.e., cocetrate too much of their mass i a sigle positio); as discussed i the paper [29], such spikiess cotrol is ecessary i order to boud the oidetifiable compoet of the matrix completio model. COROLLARY 5 (Matrix completio). Uder the coditios of Theorem 1, suppose that B q (R q ), ad that we solve program (42) with ρ 1. As log as >c 0 Rq 1/(1 q/2) d log d for a sufficietly large costat c 0, the there is a cotractio coefficiet κ t (0, 1) that decreases with t such that t+1 2 F κt t 0 2 F (43) { ( α 2 ) d log d 1 q/2 + c 2 R q + } 2 F for all iteratios t = 0, 1, 2,...,with probability at least 1 exp( c 1 d log d).

FAST GLOBAL CONVERGENCE OF GRADIENT METHODS FOR HIGH-DIMENSIONAL STATISTICAL RECOVERY

FAST GLOBAL CONVERGENCE OF GRADIENT METHODS FOR HIGH-DIMENSIONAL STATISTICAL RECOVERY Submitted to the Aals of Statistics FAST GLOBAL CONVERGENCE OF GRADIENT METHODS FOR HIGH-DIMENSIONAL STATISTICAL RECOVERY By Alekh Agarwal ad Sahad Negahba ad Marti J. Waiwright UC Berkeley, Departmet

More information

A unified framework for high-dimensional analysis of M-estimators with decomposable regularizers

A unified framework for high-dimensional analysis of M-estimators with decomposable regularizers A uified framework for high-dimesioal aalysis of M-estimators with decomposable regularizers Sahad Negahba 1 Pradeep Ravikumar 2 Marti J. Waiwright 1,3 Bi Yu 1,3 Departmet of EECS 1 Departmet of CS 2 Departmet

More information

Introduction to Optimization Techniques. How to Solve Equations

Introduction to Optimization Techniques. How to Solve Equations Itroductio to Optimizatio Techiques How to Solve Equatios Iterative Methods of Optimizatio Iterative methods of optimizatio Solutio of the oliear equatios resultig form a optimizatio problem is usually

More information

Summary and Discussion on Simultaneous Analysis of Lasso and Dantzig Selector

Summary and Discussion on Simultaneous Analysis of Lasso and Dantzig Selector Summary ad Discussio o Simultaeous Aalysis of Lasso ad Datzig Selector STAT732, Sprig 28 Duzhe Wag May 4, 28 Abstract This is a discussio o the work i Bickel, Ritov ad Tsybakov (29). We begi with a short

More information

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d Liear regressio Daiel Hsu (COMS 477) Maximum likelihood estimatio Oe of the simplest liear regressio models is the followig: (X, Y ),..., (X, Y ), (X, Y ) are iid radom pairs takig values i R d R, ad Y

More information

Optimally Sparse SVMs

Optimally Sparse SVMs A. Proof of Lemma 3. We here prove a lower boud o the umber of support vectors to achieve geeralizatio bouds of the form which we cosider. Importatly, this result holds ot oly for liear classifiers, but

More information

Discrete Mathematics for CS Spring 2008 David Wagner Note 22

Discrete Mathematics for CS Spring 2008 David Wagner Note 22 CS 70 Discrete Mathematics for CS Sprig 2008 David Wager Note 22 I.I.D. Radom Variables Estimatig the bias of a coi Questio: We wat to estimate the proportio p of Democrats i the US populatio, by takig

More information

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss ECE 90 Lecture : Complexity Regularizatio ad the Squared Loss R. Nowak 5/7/009 I the previous lectures we made use of the Cheroff/Hoeffdig bouds for our aalysis of classifier errors. Hoeffdig s iequality

More information

Introduction to Machine Learning DIS10

Introduction to Machine Learning DIS10 CS 189 Fall 017 Itroductio to Machie Learig DIS10 1 Fu with Lagrage Multipliers (a) Miimize the fuctio such that f (x,y) = x + y x + y = 3. Solutio: The Lagragia is: L(x,y,λ) = x + y + λ(x + y 3) Takig

More information

Lecture 12: February 28

Lecture 12: February 28 10-716: Advaced Machie Learig Sprig 2019 Lecture 12: February 28 Lecturer: Pradeep Ravikumar Scribes: Jacob Tyo, Rishub Jai, Ojash Neopae Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer:

More information

Rates of Convergence by Moduli of Continuity

Rates of Convergence by Moduli of Continuity Rates of Covergece by Moduli of Cotiuity Joh Duchi: Notes for Statistics 300b March, 017 1 Itroductio I this ote, we give a presetatio showig the importace, ad relatioship betwee, the modulis of cotiuity

More information

Optimization Methods MIT 2.098/6.255/ Final exam

Optimization Methods MIT 2.098/6.255/ Final exam Optimizatio Methods MIT 2.098/6.255/15.093 Fial exam Date Give: December 19th, 2006 P1. [30 pts] Classify the followig statemets as true or false. All aswers must be well-justified, either through a short

More information

A unified framework for high-dimensional analysis of M-estimators with decomposable regularizers

A unified framework for high-dimensional analysis of M-estimators with decomposable regularizers A uified framework for high-dimesioal aalysis of M-estimators with decomposable regularizers Sahad Negahba, UC Berkeley Pradeep Ravikumar, UT Austi Marti Waiwright, UC Berkeley Bi Yu, UC Berkeley NIPS

More information

Algebra of Least Squares

Algebra of Least Squares October 19, 2018 Algebra of Least Squares Geometry of Least Squares Recall that out data is like a table [Y X] where Y collects observatios o the depedet variable Y ad X collects observatios o the k-dimesioal

More information

Random Walks on Discrete and Continuous Circles. by Jeffrey S. Rosenthal School of Mathematics, University of Minnesota, Minneapolis, MN, U.S.A.

Random Walks on Discrete and Continuous Circles. by Jeffrey S. Rosenthal School of Mathematics, University of Minnesota, Minneapolis, MN, U.S.A. Radom Walks o Discrete ad Cotiuous Circles by Jeffrey S. Rosethal School of Mathematics, Uiversity of Miesota, Mieapolis, MN, U.S.A. 55455 (Appeared i Joural of Applied Probability 30 (1993), 780 789.)

More information

arxiv: v1 [math.pr] 13 Oct 2011

arxiv: v1 [math.pr] 13 Oct 2011 A tail iequality for quadratic forms of subgaussia radom vectors Daiel Hsu, Sham M. Kakade,, ad Tog Zhag 3 arxiv:0.84v math.pr] 3 Oct 0 Microsoft Research New Eglad Departmet of Statistics, Wharto School,

More information

w (1) ˆx w (1) x (1) /ρ and w (2) ˆx w (2) x (2) /ρ.

w (1) ˆx w (1) x (1) /ρ and w (2) ˆx w (2) x (2) /ρ. 2 5. Weighted umber of late jobs 5.1. Release dates ad due dates: maximimizig the weight of o-time jobs Oce we add release dates, miimizig the umber of late jobs becomes a sigificatly harder problem. For

More information

5.1 Review of Singular Value Decomposition (SVD)

5.1 Review of Singular Value Decomposition (SVD) MGMT 69000: Topics i High-dimesioal Data Aalysis Falll 06 Lecture 5: Spectral Clusterig: Overview (cotd) ad Aalysis Lecturer: Jiamig Xu Scribe: Adarsh Barik, Taotao He, September 3, 06 Outlie Review of

More information

Support vector machine revisited

Support vector machine revisited 6.867 Machie learig, lecture 8 (Jaakkola) 1 Lecture topics: Support vector machie ad kerels Kerel optimizatio, selectio Support vector machie revisited Our task here is to first tur the support vector

More information

Regression with an Evaporating Logarithmic Trend

Regression with an Evaporating Logarithmic Trend Regressio with a Evaporatig Logarithmic Tred Peter C. B. Phillips Cowles Foudatio, Yale Uiversity, Uiversity of Aucklad & Uiversity of York ad Yixiao Su Departmet of Ecoomics Yale Uiversity October 5,

More information

Lecture 27. Capacity of additive Gaussian noise channel and the sphere packing bound

Lecture 27. Capacity of additive Gaussian noise channel and the sphere packing bound Lecture 7 Ageda for the lecture Gaussia chael with average power costraits Capacity of additive Gaussia oise chael ad the sphere packig boud 7. Additive Gaussia oise chael Up to this poit, we have bee

More information

arxiv: v3 [math.st] 23 Aug 2012

arxiv: v3 [math.st] 23 Aug 2012 High-dimesioal regressio with oisy ad missig data: Provable guaratees with o-covexity Po-Lig Loh 1 Marti J. Waiwright 1, ploh@berkeley.edu waiwrig@stat.berkeley.edu Departmet of Statistics 1 Departmet

More information

Infinite Sequences and Series

Infinite Sequences and Series Chapter 6 Ifiite Sequeces ad Series 6.1 Ifiite Sequeces 6.1.1 Elemetary Cocepts Simply speakig, a sequece is a ordered list of umbers writte: {a 1, a 2, a 3,...a, a +1,...} where the elemets a i represet

More information

Statistics 511 Additional Materials

Statistics 511 Additional Materials Cofidece Itervals o mu Statistics 511 Additioal Materials This topic officially moves us from probability to statistics. We begi to discuss makig ifereces about the populatio. Oe way to differetiate probability

More information

Supplemental Material: Proofs

Supplemental Material: Proofs Proof to Theorem Supplemetal Material: Proofs Proof. Let be the miimal umber of traiig items to esure a uique solutio θ. First cosider the case. It happes if ad oly if θ ad Rak(A) d, which is a special

More information

1 Duality revisited. AM 221: Advanced Optimization Spring 2016

1 Duality revisited. AM 221: Advanced Optimization Spring 2016 AM 22: Advaced Optimizatio Sprig 206 Prof. Yaro Siger Sectio 7 Wedesday, Mar. 9th Duality revisited I this sectio, we will give a slightly differet perspective o duality. optimizatio program: f(x) x R

More information

Differentiable Convex Functions

Differentiable Convex Functions Differetiable Covex Fuctios The followig picture motivates Theorem 11. f ( x) f ( x) f '( x)( x x) ˆx x 1 Theorem 11 : Let f : R R be differetiable. The, f is covex o the covex set C R if, ad oly if for

More information

17. Joint distributions of extreme order statistics Lehmann 5.1; Ferguson 15

17. Joint distributions of extreme order statistics Lehmann 5.1; Ferguson 15 17. Joit distributios of extreme order statistics Lehma 5.1; Ferguso 15 I Example 10., we derived the asymptotic distributio of the maximum from a radom sample from a uiform distributio. We did this usig

More information

Geometry of LS. LECTURE 3 GEOMETRY OF LS, PROPERTIES OF σ 2, PARTITIONED REGRESSION, GOODNESS OF FIT

Geometry of LS. LECTURE 3 GEOMETRY OF LS, PROPERTIES OF σ 2, PARTITIONED REGRESSION, GOODNESS OF FIT OCTOBER 7, 2016 LECTURE 3 GEOMETRY OF LS, PROPERTIES OF σ 2, PARTITIONED REGRESSION, GOODNESS OF FIT Geometry of LS We ca thik of y ad the colums of X as members of the -dimesioal Euclidea space R Oe ca

More information

6.867 Machine learning, lecture 7 (Jaakkola) 1

6.867 Machine learning, lecture 7 (Jaakkola) 1 6.867 Machie learig, lecture 7 (Jaakkola) 1 Lecture topics: Kerel form of liear regressio Kerels, examples, costructio, properties Liear regressio ad kerels Cosider a slightly simpler model where we omit

More information

A Note on the Symmetric Powers of the Standard Representation of S n

A Note on the Symmetric Powers of the Standard Representation of S n A Note o the Symmetric Powers of the Stadard Represetatio of S David Savitt 1 Departmet of Mathematics, Harvard Uiversity Cambridge, MA 0138, USA dsavitt@mathharvardedu Richard P Staley Departmet of Mathematics,

More information

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator Ecoomics 24B Relatio to Method of Momets ad Maximum Likelihood OLSE as a Maximum Likelihood Estimator Uder Assumptio 5 we have speci ed the distributio of the error, so we ca estimate the model parameters

More information

Lecture 3: August 31

Lecture 3: August 31 36-705: Itermediate Statistics Fall 018 Lecturer: Siva Balakrisha Lecture 3: August 31 This lecture will be mostly a summary of other useful expoetial tail bouds We will ot prove ay of these i lecture,

More information

Machine Learning for Data Science (CS 4786)

Machine Learning for Data Science (CS 4786) Machie Learig for Data Sciece CS 4786) Lecture & 3: Pricipal Compoet Aalysis The text i black outlies high level ideas. The text i blue provides simple mathematical details to derive or get to the algorithm

More information

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence Chapter 3 Strog covergece As poited out i the Chapter 2, there are multiple ways to defie the otio of covergece of a sequece of radom variables. That chapter defied covergece i probability, covergece i

More information

7.1 Convergence of sequences of random variables

7.1 Convergence of sequences of random variables Chapter 7 Limit Theorems Throughout this sectio we will assume a probability space (, F, P), i which is defied a ifiite sequece of radom variables (X ) ad a radom variable X. The fact that for every ifiite

More information

CEE 522 Autumn Uncertainty Concepts for Geotechnical Engineering

CEE 522 Autumn Uncertainty Concepts for Geotechnical Engineering CEE 5 Autum 005 Ucertaity Cocepts for Geotechical Egieerig Basic Termiology Set A set is a collectio of (mutually exclusive) objects or evets. The sample space is the (collectively exhaustive) collectio

More information

Convergence of random variables. (telegram style notes) P.J.C. Spreij

Convergence of random variables. (telegram style notes) P.J.C. Spreij Covergece of radom variables (telegram style otes).j.c. Spreij this versio: September 6, 2005 Itroductio As we kow, radom variables are by defiitio measurable fuctios o some uderlyig measurable space

More information

Accuracy Assessment for High-Dimensional Linear Regression

Accuracy Assessment for High-Dimensional Linear Regression Uiversity of Pesylvaia ScholarlyCommos Statistics Papers Wharto Faculty Research -016 Accuracy Assessmet for High-Dimesioal Liear Regressio Toy Cai Uiversity of Pesylvaia Zijia Guo Uiversity of Pesylvaia

More information

Basics of Probability Theory (for Theory of Computation courses)

Basics of Probability Theory (for Theory of Computation courses) Basics of Probability Theory (for Theory of Computatio courses) Oded Goldreich Departmet of Computer Sciece Weizma Istitute of Sciece Rehovot, Israel. oded.goldreich@weizma.ac.il November 24, 2008 Preface.

More information

Chapter 6 Principles of Data Reduction

Chapter 6 Principles of Data Reduction Chapter 6 for BST 695: Special Topics i Statistical Theory. Kui Zhag, 0 Chapter 6 Priciples of Data Reductio Sectio 6. Itroductio Goal: To summarize or reduce the data X, X,, X to get iformatio about a

More information

Lecture 8: October 20, Applications of SVD: least squares approximation

Lecture 8: October 20, Applications of SVD: least squares approximation Mathematical Toolkit Autum 2016 Lecturer: Madhur Tulsiai Lecture 8: October 20, 2016 1 Applicatios of SVD: least squares approximatio We discuss aother applicatio of sigular value decompositio (SVD) of

More information

Notes for Lecture 11

Notes for Lecture 11 U.C. Berkeley CS78: Computatioal Complexity Hadout N Professor Luca Trevisa 3/4/008 Notes for Lecture Eigevalues, Expasio, ad Radom Walks As usual by ow, let G = (V, E) be a udirected d-regular graph with

More information

A survey on penalized empirical risk minimization Sara A. van de Geer

A survey on penalized empirical risk minimization Sara A. van de Geer A survey o pealized empirical risk miimizatio Sara A. va de Geer We address the questio how to choose the pealty i empirical risk miimizatio. Roughly speakig, this pealty should be a good boud for the

More information

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization ECE 90 Lecture 4: Maximum Likelihood Estimatio ad Complexity Regularizatio R Nowak 5/7/009 Review : Maximum Likelihood Estimatio We have iid observatios draw from a ukow distributio Y i iid p θ, i,, where

More information

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12 Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture Tolstikhi Ilya Abstract I this lecture we derive risk bouds for kerel methods. We will start by showig that Soft Margi kerel SVM correspods to miimizig

More information

62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 +

62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 + 62. Power series Defiitio 16. (Power series) Give a sequece {c }, the series c x = c 0 + c 1 x + c 2 x 2 + c 3 x 3 + is called a power series i the variable x. The umbers c are called the coefficiets of

More information

Lecture 2. The Lovász Local Lemma

Lecture 2. The Lovász Local Lemma Staford Uiversity Sprig 208 Math 233A: No-costructive methods i combiatorics Istructor: Ja Vodrák Lecture date: Jauary 0, 208 Origial scribe: Apoorva Khare Lecture 2. The Lovász Local Lemma 2. Itroductio

More information

High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity

High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity High-dimesioal regressio with oisy ad missig data: Provable guaratees with o-covexity Po-Lig Loh Departmet of Statistics Uiversity of Califoria, Berkeley Berkeley, CA 94720 ploh@berkeley.edu Marti J. Waiwright

More information

A RANK STATISTIC FOR NON-PARAMETRIC K-SAMPLE AND CHANGE POINT PROBLEMS

A RANK STATISTIC FOR NON-PARAMETRIC K-SAMPLE AND CHANGE POINT PROBLEMS J. Japa Statist. Soc. Vol. 41 No. 1 2011 67 73 A RANK STATISTIC FOR NON-PARAMETRIC K-SAMPLE AND CHANGE POINT PROBLEMS Yoichi Nishiyama* We cosider k-sample ad chage poit problems for idepedet data i a

More information

Sequences. Notation. Convergence of a Sequence

Sequences. Notation. Convergence of a Sequence Sequeces A sequece is essetially just a list. Defiitio (Sequece of Real Numbers). A sequece of real umbers is a fuctio Z (, ) R for some real umber. Do t let the descriptio of the domai cofuse you; it

More information

Problem Cosider the curve give parametrically as x = si t ad y = + cos t for» t» ß: (a) Describe the path this traverses: Where does it start (whe t =

Problem Cosider the curve give parametrically as x = si t ad y = + cos t for» t» ß: (a) Describe the path this traverses: Where does it start (whe t = Mathematics Summer Wilso Fial Exam August 8, ANSWERS Problem 1 (a) Fid the solutio to y +x y = e x x that satisfies y() = 5 : This is already i the form we used for a first order liear differetial equatio,

More information

REGRESSION WITH QUADRATIC LOSS

REGRESSION WITH QUADRATIC LOSS REGRESSION WITH QUADRATIC LOSS MAXIM RAGINSKY Regressio with quadratic loss is aother basic problem studied i statistical learig theory. We have a radom couple Z = X, Y ), where, as before, X is a R d

More information

Riesz-Fischer Sequences and Lower Frame Bounds

Riesz-Fischer Sequences and Lower Frame Bounds Zeitschrift für Aalysis ud ihre Aweduge Joural for Aalysis ad its Applicatios Volume 1 (00), No., 305 314 Riesz-Fischer Sequeces ad Lower Frame Bouds P. Casazza, O. Christese, S. Li ad A. Lider Abstract.

More information

Regularization methods for large scale machine learning

Regularization methods for large scale machine learning Regularizatio methods for large scale machie learig Lorezo Rosasco March 7, 2017 Abstract After recallig a iverse problems perspective o supervised learig, we discuss regularizatio methods for large scale

More information

6.3 Testing Series With Positive Terms

6.3 Testing Series With Positive Terms 6.3. TESTING SERIES WITH POSITIVE TERMS 307 6.3 Testig Series With Positive Terms 6.3. Review of what is kow up to ow I theory, testig a series a i for covergece amouts to fidig the i= sequece of partial

More information

Machine Learning Brett Bernstein

Machine Learning Brett Bernstein Machie Learig Brett Berstei Week 2 Lecture: Cocept Check Exercises Starred problems are optioal. Excess Risk Decompositio 1. Let X = Y = {1, 2,..., 10}, A = {1,..., 10, 11} ad suppose the data distributio

More information

11 THE GMM ESTIMATION

11 THE GMM ESTIMATION Cotets THE GMM ESTIMATION 2. Cosistecy ad Asymptotic Normality..................... 3.2 Regularity Coditios ad Idetificatio..................... 4.3 The GMM Iterpretatio of the OLS Estimatio.................

More information

A unified framework for high-dimensional analysis of M-estimators with decomposable regularizers

A unified framework for high-dimensional analysis of M-estimators with decomposable regularizers A uified framework for high-dimesioal aalysis of M-estimators with decomposable regularizers Sahad Negahba Departmet of EECS UC Berkeley sahad @eecs.berkeley.edu Marti J. Waiwright Departmet of Statistics

More information

Product measures, Tonelli s and Fubini s theorems For use in MAT3400/4400, autumn 2014 Nadia S. Larsen. Version of 13 October 2014.

Product measures, Tonelli s and Fubini s theorems For use in MAT3400/4400, autumn 2014 Nadia S. Larsen. Version of 13 October 2014. Product measures, Toelli s ad Fubii s theorems For use i MAT3400/4400, autum 2014 Nadia S. Larse Versio of 13 October 2014. 1. Costructio of the product measure The purpose of these otes is to preset the

More information

ECONOMETRIC THEORY. MODULE XIII Lecture - 34 Asymptotic Theory and Stochastic Regressors

ECONOMETRIC THEORY. MODULE XIII Lecture - 34 Asymptotic Theory and Stochastic Regressors ECONOMETRIC THEORY MODULE XIII Lecture - 34 Asymptotic Theory ad Stochastic Regressors Dr. Shalabh Departmet of Mathematics ad Statistics Idia Istitute of Techology Kapur Asymptotic theory The asymptotic

More information

7.1 Convergence of sequences of random variables

7.1 Convergence of sequences of random variables Chapter 7 Limit theorems Throughout this sectio we will assume a probability space (Ω, F, P), i which is defied a ifiite sequece of radom variables (X ) ad a radom variable X. The fact that for every ifiite

More information

Advanced Stochastic Processes.

Advanced Stochastic Processes. Advaced Stochastic Processes. David Gamarik LECTURE 2 Radom variables ad measurable fuctios. Strog Law of Large Numbers (SLLN). Scary stuff cotiued... Outlie of Lecture Radom variables ad measurable fuctios.

More information

Definition 4.2. (a) A sequence {x n } in a Banach space X is a basis for X if. unique scalars a n (x) such that x = n. a n (x) x n. (4.

Definition 4.2. (a) A sequence {x n } in a Banach space X is a basis for X if. unique scalars a n (x) such that x = n. a n (x) x n. (4. 4. BASES I BAACH SPACES 39 4. BASES I BAACH SPACES Sice a Baach space X is a vector space, it must possess a Hamel, or vector space, basis, i.e., a subset {x γ } γ Γ whose fiite liear spa is all of X ad

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS MASSACHUSTTS INSTITUT OF TCHNOLOGY 6.436J/5.085J Fall 2008 Lecture 9 /7/2008 LAWS OF LARG NUMBRS II Cotets. The strog law of large umbers 2. The Cheroff boud TH STRONG LAW OF LARG NUMBRS While the weak

More information

Quantile regression with multilayer perceptrons.

Quantile regression with multilayer perceptrons. Quatile regressio with multilayer perceptros. S.-F. Dimby ad J. Rykiewicz Uiversite Paris 1 - SAMM 90 Rue de Tolbiac, 75013 Paris - Frace Abstract. We cosider oliear quatile regressio ivolvig multilayer

More information

Distribution of Random Samples & Limit theorems

Distribution of Random Samples & Limit theorems STAT/MATH 395 A - PROBABILITY II UW Witer Quarter 2017 Néhémy Lim Distributio of Radom Samples & Limit theorems 1 Distributio of i.i.d. Samples Motivatig example. Assume that the goal of a study is to

More information

Efficient GMM LECTURE 12 GMM II

Efficient GMM LECTURE 12 GMM II DECEMBER 1 010 LECTURE 1 II Efficiet The estimator depeds o the choice of the weight matrix A. The efficiet estimator is the oe that has the smallest asymptotic variace amog all estimators defied by differet

More information

Slide Set 13 Linear Model with Endogenous Regressors and the GMM estimator

Slide Set 13 Linear Model with Endogenous Regressors and the GMM estimator Slide Set 13 Liear Model with Edogeous Regressors ad the GMM estimator Pietro Coretto pcoretto@uisa.it Ecoometrics Master i Ecoomics ad Fiace (MEF) Uiversità degli Studi di Napoli Federico II Versio: Friday

More information

On Random Line Segments in the Unit Square

On Random Line Segments in the Unit Square O Radom Lie Segmets i the Uit Square Thomas A. Courtade Departmet of Electrical Egieerig Uiversity of Califoria Los Ageles, Califoria 90095 Email: tacourta@ee.ucla.edu I. INTRODUCTION Let Q = [0, 1] [0,

More information

Statistical Inference Based on Extremum Estimators

Statistical Inference Based on Extremum Estimators T. Rotheberg Fall, 2007 Statistical Iferece Based o Extremum Estimators Itroductio Suppose 0, the true value of a p-dimesioal parameter, is kow to lie i some subset S R p : Ofte we choose to estimate 0

More information

Stochastic Simulation

Stochastic Simulation Stochastic Simulatio 1 Itroductio Readig Assigmet: Read Chapter 1 of text. We shall itroduce may of the key issues to be discussed i this course via a couple of model problems. Model Problem 1 (Jackso

More information

Regression with quadratic loss

Regression with quadratic loss Regressio with quadratic loss Maxim Ragisky October 13, 2015 Regressio with quadratic loss is aother basic problem studied i statistical learig theory. We have a radom couple Z = X,Y, where, as before,

More information

b i u x i U a i j u x i u x j

b i u x i U a i j u x i u x j M ath 5 2 7 Fall 2 0 0 9 L ecture 1 9 N ov. 1 6, 2 0 0 9 ) S ecod- Order Elliptic Equatios: Weak S olutios 1. Defiitios. I this ad the followig two lectures we will study the boudary value problem Here

More information

5.1 A mutual information bound based on metric entropy

5.1 A mutual information bound based on metric entropy Chapter 5 Global Fao Method I this chapter, we exted the techiques of Chapter 2.4 o Fao s method the local Fao method) to a more global costructio. I particular, we show that, rather tha costructig a local

More information

(A sequence also can be thought of as the list of function values attained for a function f :ℵ X, where f (n) = x n for n 1.) x 1 x N +k x N +4 x 3

(A sequence also can be thought of as the list of function values attained for a function f :ℵ X, where f (n) = x n for n 1.) x 1 x N +k x N +4 x 3 MATH 337 Sequeces Dr. Neal, WKU Let X be a metric space with distace fuctio d. We shall defie the geeral cocept of sequece ad limit i a metric space, the apply the results i particular to some special

More information

10. Comparative Tests among Spatial Regression Models. Here we revisit the example in Section 8.1 of estimating the mean of a normal random

10. Comparative Tests among Spatial Regression Models. Here we revisit the example in Section 8.1 of estimating the mean of a normal random Part III. Areal Data Aalysis 0. Comparative Tests amog Spatial Regressio Models While the otio of relative likelihood values for differet models is somewhat difficult to iterpret directly (as metioed above),

More information

Rademacher Complexity

Rademacher Complexity EECS 598: Statistical Learig Theory, Witer 204 Topic 0 Rademacher Complexity Lecturer: Clayto Scott Scribe: Ya Deg, Kevi Moo Disclaimer: These otes have ot bee subjected to the usual scrutiy reserved for

More information

Lecture 19: Convergence

Lecture 19: Convergence Lecture 19: Covergece Asymptotic approach I statistical aalysis or iferece, a key to the success of fidig a good procedure is beig able to fid some momets ad/or distributios of various statistics. I may

More information

Outline. Linear regression. Regularization functions. Polynomial curve fitting. Stochastic gradient descent for regression. MLE for regression

Outline. Linear regression. Regularization functions. Polynomial curve fitting. Stochastic gradient descent for regression. MLE for regression REGRESSION 1 Outlie Liear regressio Regularizatio fuctios Polyomial curve fittig Stochastic gradiet descet for regressio MLE for regressio Step-wise forward regressio Regressio methods Statistical techiques

More information

Dimension-free PAC-Bayesian bounds for the estimation of the mean of a random vector

Dimension-free PAC-Bayesian bounds for the estimation of the mean of a random vector Dimesio-free PAC-Bayesia bouds for the estimatio of the mea of a radom vector Olivier Catoi CREST CNRS UMR 9194 Uiversité Paris Saclay olivier.catoi@esae.fr Ilaria Giulii Laboratoire de Probabilités et

More information

Lecture 12: September 27

Lecture 12: September 27 36-705: Itermediate Statistics Fall 207 Lecturer: Siva Balakrisha Lecture 2: September 27 Today we will discuss sufficiecy i more detail ad the begi to discuss some geeral strategies for costructig estimators.

More information

ALGEBRAIC GEOMETRY COURSE NOTES, LECTURE 5: SINGULARITIES.

ALGEBRAIC GEOMETRY COURSE NOTES, LECTURE 5: SINGULARITIES. ALGEBRAIC GEOMETRY COURSE NOTES, LECTURE 5: SINGULARITIES. ANDREW SALCH 1. The Jacobia criterio for osigularity. You have probably oticed by ow that some poits o varieties are smooth i a sese somethig

More information

Lecture Notes for Analysis Class

Lecture Notes for Analysis Class Lecture Notes for Aalysis Class Topological Spaces A topology for a set X is a collectio T of subsets of X such that: (a) X ad the empty set are i T (b) Uios of elemets of T are i T (c) Fiite itersectios

More information

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015 ECE 8527: Itroductio to Machie Learig ad Patter Recogitio Midterm # 1 Vaishali Ami Fall, 2015 tue39624@temple.edu Problem No. 1: Cosider a two-class discrete distributio problem: ω 1 :{[0,0], [2,0], [2,2],

More information

Chi-Squared Tests Math 6070, Spring 2006

Chi-Squared Tests Math 6070, Spring 2006 Chi-Squared Tests Math 6070, Sprig 2006 Davar Khoshevisa Uiversity of Utah February XXX, 2006 Cotets MLE for Goodess-of Fit 2 2 The Multiomial Distributio 3 3 Applicatio to Goodess-of-Fit 6 3 Testig for

More information

Random Matrices with Blocks of Intermediate Scale Strongly Correlated Band Matrices

Random Matrices with Blocks of Intermediate Scale Strongly Correlated Band Matrices Radom Matrices with Blocks of Itermediate Scale Strogly Correlated Bad Matrices Jiayi Tog Advisor: Dr. Todd Kemp May 30, 07 Departmet of Mathematics Uiversity of Califoria, Sa Diego Cotets Itroductio Notatio

More information

Definitions and Theorems. where x are the decision variables. c, b, and a are constant coefficients.

Definitions and Theorems. where x are the decision variables. c, b, and a are constant coefficients. Defiitios ad Theorems Remember the scalar form of the liear programmig problem, Miimize, Subject to, f(x) = c i x i a 1i x i = b 1 a mi x i = b m x i 0 i = 1,2,, where x are the decisio variables. c, b,

More information

A Hadamard-type lower bound for symmetric diagonally dominant positive matrices

A Hadamard-type lower bound for symmetric diagonally dominant positive matrices A Hadamard-type lower boud for symmetric diagoally domiat positive matrices Christopher J. Hillar, Adre Wibisoo Uiversity of Califoria, Berkeley Jauary 7, 205 Abstract We prove a ew lower-boud form of

More information

A class of spectral bounds for Max k-cut

A class of spectral bounds for Max k-cut A class of spectral bouds for Max k-cut Miguel F. Ajos, José Neto December 07 Abstract Let G be a udirected ad edge-weighted simple graph. I this paper we itroduce a class of bouds for the maximum k-cut

More information

Linear Support Vector Machines

Linear Support Vector Machines Liear Support Vector Machies David S. Roseberg The Support Vector Machie For a liear support vector machie (SVM), we use the hypothesis space of affie fuctios F = { f(x) = w T x + b w R d, b R } ad evaluate

More information

1.010 Uncertainty in Engineering Fall 2008

1.010 Uncertainty in Engineering Fall 2008 MIT OpeCourseWare http://ocw.mit.edu.00 Ucertaity i Egieerig Fall 2008 For iformatio about citig these materials or our Terms of Use, visit: http://ocw.mit.edu.terms. .00 - Brief Notes # 9 Poit ad Iterval

More information

Empirical Process Theory and Oracle Inequalities

Empirical Process Theory and Oracle Inequalities Stat 928: Statistical Learig Theory Lecture: 10 Empirical Process Theory ad Oracle Iequalities Istructor: Sham Kakade 1 Risk vs Risk See Lecture 0 for a discussio o termiology. 2 The Uio Boud / Boferoi

More information

Linear Programming and the Simplex Method

Linear Programming and the Simplex Method Liear Programmig ad the Simplex ethod Abstract This article is a itroductio to Liear Programmig ad usig Simplex method for solvig LP problems i primal form. What is Liear Programmig? Liear Programmig is

More information

Singular Continuous Measures by Michael Pejic 5/14/10

Singular Continuous Measures by Michael Pejic 5/14/10 Sigular Cotiuous Measures by Michael Peic 5/4/0 Prelimiaries Give a set X, a σ-algebra o X is a collectio of subsets of X that cotais X ad ad is closed uder complemetatio ad coutable uios hece, coutable

More information

Notes on iteration and Newton s method. Iteration

Notes on iteration and Newton s method. Iteration Notes o iteratio ad Newto s method Iteratio Iteratio meas doig somethig over ad over. I our cotet, a iteratio is a sequece of umbers, vectors, fuctios, etc. geerated by a iteratio rule of the type 1 f

More information

Lecture 10 October Minimaxity and least favorable prior sequences

Lecture 10 October Minimaxity and least favorable prior sequences STATS 300A: Theory of Statistics Fall 205 Lecture 0 October 22 Lecturer: Lester Mackey Scribe: Brya He, Rahul Makhijai Warig: These otes may cotai factual ad/or typographic errors. 0. Miimaxity ad least

More information

CHAPTER 10 INFINITE SEQUENCES AND SERIES

CHAPTER 10 INFINITE SEQUENCES AND SERIES CHAPTER 10 INFINITE SEQUENCES AND SERIES 10.1 Sequeces 10.2 Ifiite Series 10.3 The Itegral Tests 10.4 Compariso Tests 10.5 The Ratio ad Root Tests 10.6 Alteratig Series: Absolute ad Coditioal Covergece

More information

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1 EECS564 Estimatio, Filterig, ad Detectio Hwk 2 Sols. Witer 25 4. Let Z be a sigle observatio havig desity fuctio where. p (z) = (2z + ), z (a) Assumig that is a oradom parameter, fid ad plot the maximum

More information

Sequences and Series of Functions

Sequences and Series of Functions Chapter 6 Sequeces ad Series of Fuctios 6.1. Covergece of a Sequece of Fuctios Poitwise Covergece. Defiitio 6.1. Let, for each N, fuctio f : A R be defied. If, for each x A, the sequece (f (x)) coverges

More information