Harder, Better, Faster, Stroger Covergece Rates for Least-Squares Regressio Aoymous Author(s) Affiliatio Address email Abstract 1 2 3 4 5 6 We cosider the optimizatio of a quadratic objective fuctio whose gradiets are oly accessible through a stochastic oracle. We preset the first algorithm that achieves joitly the optimal predictio error rates for least-squares regressio, both i terms of forgettig of iitial coditios ad i terms of depedece o the oise ad dimesio of the problem, ad prove dimesioless ad tighter rates for a regularized versio of this algorithm. 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 1 Itroductio May supervised machie learig problems are aturally cast as the miimizatio of a smooth fuctio defied o a Euclidea space. This icludes least-squares regressio, logistic regressio (see, e.g., Hastie et al., 2009) or geeralized liear models (McCullagh ad Nelder, 1989). While small problems with few or low-dimesioal iput features may be solved precisely by may potetial optimizatio algorithms (e.g., Newto method), large-scale problems with may high-dimesioal features are typically solved with simple gradiet-based iterative techiques whose per-iteratio cost is small. I this paper, we cosider a quadratic objective fuctio f whose gradiets are oly accessible through a stochastic oracle that returs the gradiet at ay give poit plus a zero-mea fiite variace radom error. I this stochastic approximatio framework (Robbis ad Moro, 1951), it is kow that two quatities dictate the behavior of various algorithms, amely the covariace matrix V of the oise i the gradiets, ad the deviatio θ 0 θ betwee the iitial poit of the algorithm θ 0 ad ay of the global miimizer θ of f. This leads to a bias/variace decompositio (Bach ad Moulies, 2013; Hsu et al., 2014) of the performace of most algorithms as the sum of two terms: (a) the bias term characterizes how fast iitial coditios are forgotte ad thus is icreasig i a well-chose orm of θ 0 θ ; while (b) the variace term characterizes the effect of the oise i the gradiets, idepedetly of the startig poit, ad with a term that is icreasig i the covariace of the oise. 26 For quadratic fuctios with (a) a oise covariace matrix V which is proportioal (with costat 27 σ 2 ) to the Hessia of f (a situatio which correspods to least-squares regressio) ad (b) a ii- 28 tial poit characterized by the orm θ 0 θ 2, the optimal bias ad variace terms are kow separately. O the oe had, the optimal bias term after iteratios is proportioal to L θ 0 θ 2 29, where L is the largest eigevalue of the Hessia of f. This rate is achieved by accelerated gradiet 2 30 31 descet (Nesterov, 1983, 2004), ad is kow to be optimal if the umber of iteratios is less tha 32 the dimesio d of the uderlyig predictors, but the algorithm is ot robust to radom or determi- 33 istic oise i the gradiets (d Aspremot, 2008; Schmidt et al., 2011; Devolder et al., 2014). O the 34 other had, the optimal variace term is proportioal to σ2 d (Tsybakov, 2003); it is kow to be 35 achieved by averaged gradiet descet (Bach ad Moulies, 2013), for which the bias term oly 36 achieves L θ 0 θ 2 istead of L θ 0 θ 2. 2 Submitted to 30th Coferece o Neural Iformatio Processig Systems (NIPS 2016). Do ot distribute.
37 38 39 40 41 42 43 44 45 46 47 Our first cotributio i this paper is to aalyze i Sectio 3 averaged accelerated gradiet descet, showig that it attais optimal rates for both the variace ad the bias terms. It shows that averagig is beeficial for accelerated techiques ad provides a provable robustess to oise. While optimal whe measurig performace i terms of the dimesio d ad the iitial distace to optimum θ 0 θ 2, these rates are ot adapted i may situatios where either d is larger tha the umber of iteratios (i.e., the umber of observatios for regular stochastic gradiet descet) or L θ 0 θ 2 is much larger tha 2. Our secod cotributio is to provide i Sectio 4 a aalysis of a ew algorithm (based o some additioal regularizatio) that ca adapt our bouds to fier assumptios o θ 0 θ ad the Hessia of the problem, leadig i particular to dimesio-free quatities that ca thus be exteded to the Hilbert space settig (i particular for o-parametric estimatio). 48 49 50 51 52 53 54 55 56 57 58 59 60 61 2 Least-Squares Regressio Statistical Assumptios. We cosider the followig geeral settig: H is a d-dimesioal Euclidea space with d 1, the observatios (x, y ) H R, 1, are idepedet ad idetically distributed (i.i.d.), ad such that E x 2 ad Ey 2 are fiite. We cosider the least-squares regressio problem which is the miimizatio of the quadratic fuctio f(θ) = 1 2 E( x, θ y ) 2. Covariace matrix: We deote by Σ = E(x x ) R d d the populatio covariace matrix, which is the Hessia of f at all poits. Without loss of geerality, we ca assume Σ ivertible. This implies that all eigevalues of Σ are strictly positive (but they may be arbitrarily small). We assume there exists R > 0 such that E x 2 x x R 2 Σ where A B meas that B A is positive semi-defiite. This assumptio is satisfied, for example, for least-square regressio with almost surely bouded data. Eigevalue decay: Most covergece bouds deped o the dimesio d of H. However it is possible to derive dimesio-free ad ofte tighter covergece rates by cosiderig bouds depedig o the value tr Σ b for b [0, 1]. Give b, if we cosider the eigevalues of Σ ordered i decreasig order, which we deote by s i, the tr Σ b = 62 i sb i, ad the eigevalues decay For b goig to 0 the 63 tr Σ b teds to d ad we are back i the classical low-dimesioal case. Whe b = 1, we simply get 64 tr Σ = E x 2, which will correspod to the weakest assumptio i our cotext. 65 66 67 68 69 70 71 72 73 Optimal predictor: The regressio fuctio f(θ) = 1 2 E( x, θ y ) 2 always admits a global miimum θ = Σ 1 E(y x ). Whe iitializig algorithms at θ 0 = 0 or regularizig by the squared orm, rates of covergece geerally deped o θ, a quatity which could be arbitrarily large. However there exists a systematic upper-boud Σ 1 2 θ 2 Ey. 2 This leads aturally to the cosideratio of covergece bouds depedig o Σ r/2 θ for r 1. Noise: We deote by ε = y θ, x the residual for which we have E[ε x ] = 0. Although we do ot have E[ε x ] = 0 i geeral uless the model is well-specified, we assume the oise to be a structured process such that there exists σ > 0 with E[ε 2 x x ] σ 2 Σ. This assumptio is satisfied for example for data almost surely bouded or whe the model is well-specified. 74 75 76 77 78 79 80 81 82 83 84 Averaged Gradiet Methods ad Acceleratio. We focus i this paper o stochastic gradiet methods with acceleratio for a quadratic fuctio regularized by λ 2 θ θ 0 2. The regularizatio will be useful whe derivig tighter covergece rates i Sectio 4, ad it has the additioal beefit of makig the problem λ-strogly-covex. Accelerated stochastic gradiet descet is defied by a iterative system with two parameters (θ, ν ) startig from θ 0 = ν 0 H, ad satisfyig for 1, θ = ν 1 γf (ν 1 γλ(ν 1 θ 0 ) ν = θ + δ ( ) θ θ 1, (1) with γ, δ R 2 ad f (θ 1 ) a ubiased estimate o the gradiet f(θ). The mometum coefficiet δ R is chose to accelerate the covergece rate (Nesterov, 1983; Beck ad Teboulle, 2009) ad has its roots i the heavy-ball algorithm from Polyak (1964). We especially cocetrate here, followig Polyak ad Juditsky (1992), o the average of the sequece θ = 1 +1 i=0 θ, 2
85 86 87 88 89 90 91 92 93 94 95 Stochastic Oracles o the Gradiet. Let (F ) 0 be the icreasig family of σ-fields that are geerated by all variables (x i, y i ) for i. The oracle we cosider is the sum of the true gradiet f (θ) ad a idepedet zero-mea oise that does ot deped o θ 1. Cosequetly it is of the form f (θ) = f (θ) ξ where the oise process ξ is F -measurable with E[ξ F 1 ] = 0 ad E[ ξ 2 ] is fiite. Furthermore we also assume that there exists τ R such that E[ξ ξ ] τ 2 Σ, that is, the oise has a particular structure adapted to least-squares regressio. 3 Accelerated Stochastic Averaged Gradiet Descet We study the covergece of averaged accelerated stochastic gradiet descet defied by Eq. (1) for λ = 0 ad δ = 1. It ca be rewritte for the quadratic fuctio f as a secod-order iterative system with costat coefficiets: θ = [ I γσ ] (2θ 1 θ 2 ) + γy x. Theorem 1 For ay costat step-size γ, such that γσ I, Ef( θ ) f(θ ) 36 θ 0 θ 2 γ( + 1) 2 + 8 τ 2 d + 1. (2) 96 We ca make the followig observatios: 97 98 99 100 101 The first boud 1 γ θ 2 0 θ 2 i Eq. (2) correspods to the usual accelerated rate. It has bee show by Nesterov (2004) to be the optimal rate of covergece for optimizig a quadratic fuctio with a first-order method that ca access oly to sequeces of gradiets whe d. We recover by averagig a algorithm dedicated to strogly-covex fuctio the traditioal covergece rate for o-strogly covex fuctios. The secod boud i Eq. (2) matches the optimal statistical performace τ 2 d over all 103 estimators i H (Tsybakov, 2008) eve without computatioal limits, i the sese that 104 o estimator that uses the same iformatio ca improve upo this rate. Accordigly 105 this algorithm achieves joit bias/variace optimality (whe measured i terms of τ 2 ad 106 θ 0 θ 2 ). 107 108 109 110 111 112 113 114 115 116 117 We have the same rate of covergece for the bias whe compared to the regular Nesterov acceleratio without averagig studied by Flammario ad Bach (2015), which correspods to choosig δ = 1 2/ for all. However if the problem is µ-strogly covex, this latter was show to also coverge at the liear rate O ( (1 γµ) ) ad thus is adaptive to hidde strog-covexity (sice the algorithm does ot eed to kow µ to ru), thus eds up covergig faster tha the rate 1/ 2. This is cofirmed i our experimets i Sectio 5. Overall, the bias term is improved whereas the variace term is ot degraded ad acceleratio is thus robust to oise i the gradiets. Thereby, while secod-order iterative methods for optimizig quadratic fuctios i the sigular case, such as cojugate gradiet (Polyak, 1987, Sectio 6.1) are otoriously highly sesitive to oise, we are able to propose a versio which is robust to stochastic oise. 118 119 120 121 122 123 124 125 126 127 128 4 Tighter Covergece Rates We have see i Corollary 1 above that the averaged accelerated gradiet algorithm matches the lower bouds τ 2 d/ ad L θ 2 0 θ 2 for the predictio error. However the algorithm performs better i almost all cases except the worst-case scearios correspodig to the lower bouds. For example the algorithm may still predict well whe the dimesio d is much bigger tha. Similarly the orm of the optimal predictor θ 2 may be huge ad the predictio still good, as gradiet algorithms happe to be adaptive to the difficulty of the problem: ideed, if the problem is simpler, the covergece rate of the gradiet algorithm will be improved. I this sectio, we provide such a theoretical guaratee. We study the covergece of averaged accelerated stochastic gradiet descet defied by Eq. (1) for λ = (γ( + 1) 2 ) 1 ad δ [ 1 2 +2, 1]. We have the followig theorem: 1 this is differet from the oracle usually cosidered i stochastic approximatio (see Bach ad Moulies (2013); Dieuleveut ad Bach (2015)). 3
129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 Theorem 2 For ay costat step-size γ, such that γ(σ + λi) I, Ef( θ ) f(θ ) mi r [0,1], b [0,1] We ca make the followig observatios: [74 Σr/2 (θ 0 θ ) 2 γ 1 r ( + 1) + 8 τ 2 γ b tr(σ b ) 2(1 r) ( + 1) 1 2b The algorithm is idepedet of r ad b, thus all the bouds for differet values of (r, b) are valid. This is a strog property of the algorithm, which is ideed adaptative to the regularity ad the effective dimesio of the problem (oce γ is chose). I situatios i which either d is larger tha or L θ 0 θ 2 is larger tha 2, the algorithm ca still ejoy good covergece properties, by adaptig to the best values of b ad r. For b = 0 we recover the variace term of Corollary 1, but for b > 0 ad fast decays of eigevalues of Σ, the boud may be much smaller; ote that we lose i the depedecy i, but typically, for large d, this ca be advatageous. With r, b well chose, we recover the optimal rate for o-parametric regressio (Capoetto ad De Vito, 2007). 5 Experimets We illustrate ow our theoretical results o sythetic examples. For d = 25 we cosider ormally distributed iputs x with radom covariace matrix Σ which has eigevalues 1/i 3, for i = 1,..., d, ad radom optimum θ ad startig poit θ 0 such that θ 0 θ = 1. The outputs y are geerated from a liear fuctio with homoscedastic oise with uit sigal to oise-ratio (σ 2 = 1), we take R 2 = tr Σ the average radius of the data ad a step-size γ = 1/R 2 ad λ = 0. The additive oise oracle is used. We show results averaged over 10 replicatios. We compare the performace of averaged SGD (AvSGD), usual Nesterov acceleratio for covex fuctios (AccSGD) ad our ovel averaged accelerated SGD (AvAccSGD) 2, o two differet problems: oe determiistic ( θ 0 θ = 1, σ 2 = 0) which will illustrate how the bias term behaves, ad oe purely stochastic ( θ 0 θ = 0, σ 2 = 1) which will illustrate how the variace term behaves. For the bias (left plot of Figure 1), AvSGD coverges at speed O(1/), while AvAccSGD ]. log 10 [f(θ)-f(θ * )] 0-2 -4-6 -8 Bias -10 AvAccSGD -12 AccSGD AvSGD -14 0 2 log () 10 4 6 log 10 [f(θ)-f(θ * )] 0-2 -4-6 -8 Variace -10 AvAccSGD -12 AccSGD AvSGD -14 0 2 log () 10 4 6 Figure 1: Sythetic problem (d = 25) ad γ = 1/R 2. Left: Bias. Right: Variace. 152 153 154 155 156 157 158 159 160 ad AccSGD coverge both at speed O(1/ 2 ). However, as metioed i the observatios followig Theorem 1, AccSGD takes advatage of the hidde strog covexity of the quadratic fuctio ad starts covergig liearly at the ed. For the variace (right plot of Figure 1), AccSGD is ot covergig to the opltimum ad keeps oscillatig whereas AvSGD ad AvAccSGD both coverge to the optimum at a speed O(1/). However AvSGD remais slightly faster i the begiig. Note that for small, or whe the bias L θ 0 θ 2 / 2 is much bigger tha the variace σ 2 d/, the bias may have a stroger effect, although asymptotically, the variace always domiates. It is thus essetial to have a algorithm which is optimal i both regimes; this is achieved by AvAccSGD. 2 which is ot the averagig of AccSGD because the mometum term is proportioal to 1 3/ for AccSGD istead of beig equal to 1 for AvAccSGD. 4
161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 Refereces Bach, F. ad Moulies, E. (2013). No-strogly-covex smooth stochastic approximatio with covergece rate O(1/). I Advaces i Neural Iformatio Processig Systems. Beck, A. ad Teboulle, M. (2009). A fast iterative shrikage-thresholdig algorithm for liear iverse problems. SIAM J. Imagig Sci., 2(1):183 202. Capoetto, A. ad De Vito, E. (2007). Optimal rates for the regularized least-squares algorithm. Foudatios of Computatioal Mathematics, 7(3):331 368. d Aspremot, A. (2008). Smooth optimizatio with approximate gradiet. SIAM J. Optim., 19(3):1171 1183. Devolder, O., Glieur, F., ad Nesterov, Y. (2014). First-order methods of smooth covex optimizatio with iexact oracle. Math. Program., 146(1-2, Ser. A):37 75. Dieuleveut, A. ad Bach, F. (2015). No-parametric stochastic approximatio with large step sizes. Aals of Statistics. To appear. Flammario, N. ad Bach, F. (2015). From averagig to acceleratio, there is oly a step-size. I Proceedigs of the Iteratioal Coferece o Learig Theory (COLT). Hastie, T., Tibshirai, R., ad Friedma, J. (2009). The Elemets of Statistical Learig. Spriger Series i Statistics. Spriger, secod editio. Hsu, D., Kakade, S. M., ad Zhag, T. (2014). Radom desig aalysis of ridge regressio. Foudatios of Computatioal Mathematics, 14(3):569 600. McCullagh, P. ad Nelder, J. A. (1989). Geeralized Liear Models. Moographs o Statistics ad Applied Probability. Chapma & Hall, Lodo, secod editio. Nesterov, Y. (1983). A method of solvig a covex programmig problem with covergece rate O(1/k 2 ). Soviet Mathematics Doklady, 27(2):372 376. Nesterov, Y. (2004). Itroductory Lectures o Covex Optimizatio, volume 87 of Applied Optimizatio. Kluwer Academic Publishers, Bosto, MA. Polyak, B. T. (1964). Some methods of speedig up the covergece of iteratio methods. {USSR} Computatioal Mathematics ad Mathematical Physics, 4(5):1 17. Polyak, B. T. (1987). Itroductio to Optimizatio. Traslatios Series i Mathematics ad Egieerig. Optimizatio Software, Ic., Publicatios Divisio, New York. Polyak, B. T. ad Juditsky, A. B. (1992). Acceleratio of stochastic approximatio by averagig. SIAM J. Cotrol Optim., 30(4):838 855. Robbis, H. ad Moro, S. (1951). A stochastic approxiatio method. The Aals of mathematical Statistics, 22(3):400 407. Schmidt, M., Le Roux, N., ad Bach, F. (2011). Covergece rates of iexact proximal-gradiet methods for covex optimizatio. I Advaces i Neural Iformatio Processig Systems. Tsybakov, A. B. (2003). Optimal rates of aggregatio. I Proceedigs of the Aual Coferece o Computatioal Learig Theory. Tsybakov, A. B. (2008). Itroductio to Noparametric Estimatio. Spriger. 5