On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

Size: px

Start display at page:

Download "On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:"

Mildred Harris
5 years ago
Views:

1 A Omitted Proofs from Section 3 Proof of Lemma 3 Let m x) = a i On Acceleration with Noise-Corrupted Gradients fxi ), u x i D ψ u, x 0 ) denote the function under the minimum in the lower bound By Proposition 31, v = ψ z ) = arg min x X m x) Observe that m x) = a fx ), x x m 1 x) By the definition of Bregman divergence: m 1 v ) = m 1 v 1 ) m 1 v 1 ), v v 1 D m 1 v, v 1 ) As Bregman divergence is blind to linear and zero-order terms, we have that D m 1 v, v 1 ) = D ψ v, v 1 ) By Proposition 31, v 1 = arg min x X m 1 x), and hence m 1 v 1 ), v v 1 0 Therefore, m v ) m 1 v 1 ) a D ψ v, v 1 ) Using the definition of fx ), the change in the lower bound is: L 1 L 1 a fx ) a fx ), v x D ψ v, v 1 ) a η, x v A1) For the change in the upper bound, we have: U 1 U 1 = fy ) 1 fy 1 ) =a fx ) fy ) fx )) 1 fx ) fy 1 )) By convexity of f ): fx ) fy 1 ) fx ), x y 1 Combining A1)-A3) and AGD): A) A3) as claimed G 1 G 1 a η, x v D ψ v, v 1 ) fy ) fx )) fx ), y x, B AGD for Smooth and Strongly Convex Minimization Here we show that AGD can be extended to the setting of smooth and strongly convex minimization As is customary Nesterov, 013), in this setting we assume that = so that f ) is L-smooth and µ-strongly convex wrt the l norm, for L < and µ > 0 To distinguish from the non-strongly-convex case, we refer to AGD for smooth and strongly convex minimization as µagd To analyze AGD in this setting, we need to use a stronger lower bound L, which is constructed by the same arguments as before, but now using strong convexity instead of regular convexity Such a construction gives: L = a ifx i ) min u X mu) a i η i, x x i D ψ x, x 0 ), B1) where m u) = a i fxi ), u x i µ u x i ) D ψ u, x 0 ) While it suffices to have ψ be an arbitrary function that is strongly convex wrt the, for simplicity, we tae ψx) = µ0 x, where µ 0 will be specified later For θ = a, the algorithm can now be stated as follows: v = argmin m u), u X x = 1 y 1 θ v 1, 1 θ 1 θ y = 1 θ )y 1 θ v, where, x 1 = x 0 = y 0 = v 0 is an arbitrary initial point from X µagd)

2 As before, the main convergence argument is to show that G 1 G 1 and combine it with the bound on the initial gap G 1 We start with bounding the initial gap, as follows Proposition B1 If ψx) = µ0 x, where µ 0 = a 1 L µ), then A 1 G 1 A1L µ) x x 0 E η 1, where Eη 1 = a 1 η 1, x v 1 Proof As x 1 = x 0, the initial lower bound is: A 1 L 1 =a 1 fx 1 ) a 1 fx 1 ), v 1 x 1 As a 1 = A 1, it follows that y 1 = v 1, and hence: A 1 U 1 =a 1 fv 1 ) a1 µ µ ) 0 v 1 x 1 µ 0 x x 0 a 1 η 1, x x 1 a 1 fx 1 ) a 1 fx 1 ), v 1 x 1 a 1L v 1 x 1 =a 1 fx 1 ) a 1 fx1 ), v 1 x 1 a 1L v 1 x 1 a 1 η 1, v 1 x 1, where the inequality is by the smoothness of f ) Combining the bounds on the initial upper and lower bounds, it follows: A 1 G 1 µ 0 x x 0 a 1L a 1 µ µ 0 v 1 x 1 a 1 η 1, x v 1 a 1L µ) x x 0 E η 1, as, by the initial assumption, µ 0 = a 1 µ L) and E η 1 = a 1 η 1, x v 1 To bound the change in the lower bound, it is useful to first bound m v ) m 1 v 1 ), as in the following technical proposition Proposition B Let ψx) = a1l µ) x Then: m v ) m 1 v 1 ) a 1µ v v 1 a µ v x Proof Observe that, by the definition of m ), m v ) = m 1 v ) a a µ v x The rest of the proof bounds m 1 v ) m 1 v 1 ) Observe that, as v 1 = argmin u X m 1 u), it must be m 1 v 1 ), u v 1 0, u X As Bregman divergence is blind to linear terms: m 1 v ) m 1 v 1 ) = m 1 v 1 ), u v 1 D m 1 v, v 1 ) D m 1 v, v 1 ) = 1µ The rest of the proof is by a1l µ) v v 1 0 v v 1 a 1L µ) v v 1 We are now ready to move to the main part of the convergence argument, namely, to show that G 1 G 1 for a certain choice of a Lemma B3 Let ψx) = a1l µ) x and 0 < a µ L Then: G 1 G 1 E η, where Eη = a η, x v Proof As U = fy ), we have that: U 1 U 1 fy ) 1 fy 1 ) B)

3 Using Proposition B, the change in the lower bound is: L 1 L 1 a fx ) a a η, x x 1µ v v 1 a µ v x B3) Denote w = 1 v 1 a x By Jensen s Inequality: 1 µ v v 1 a µ v x µ v w B4) Write a as: a As a µ L and by smoothness of f ) : = a fx ), v w a fx ), A 1 v 1 x ) a η, v x B5) a fx ), v w µa v w f x a fx ), x a v w ) Combining B3)-B6), we have the following bound for the change in the lower bound: L 1 L 1 f ) v w ) x L a ) v w ) ) ) fx ) B6) x a ) v w ) 1 fx ) a fx ), A 1 v 1 x ) a η, x v Using the definition of w, θ = a, and µagd), it is not hard to show that y = x a v w ) and a v 1 x ) = x y 1, which, using the convexity of f ), gives: L 1 L 1 fy ) 1 fy 1 ) a η, x v B7) Combining B7) and B), the proof follows Theorem B4 Let ψx) = L µ x, a 1 = 1, ai = γ i µ L for i, and let y, x evolve according to µagd) Then, 1: fy ) fx ) = A 1 L µ) x x 0 a i η i, x v i Π 1 γ i ) ) L µ) x x 0 a i η i, x v i Proof Applying Lemma B3, it follows that G A1G1 1 ai = 1 γ i, we have G Π 1 γ i) ) G 1 using that fy ) fx ) G Eη i Eη i = A1 A A A 3 A 1 G 1 Eη i As Ai 1 = The rest of the proof is by applying Proposition B1 and Using the same arguments for bounding the noise term as in the case of smooth minimization Section 3), we have the following corollary

4 Corollary B5 of Theorem B4) If η i = 0 the noiseless gradient case), setting γ i = µ L, we recover the standard convergence result for accelerated smooth and strongly convex minimization: ) µ L µ fy ) fx ) 1 x x 0 L If E[ η i ] M i and max u X x u R x, then setting γ i = µ E[fy ) fx )] 1 If η i s are zero-mean and independent and E[ η i ] σ, then: In particular, setting: ai = γ i = µ L, a i = i p for p Z, E[fy ) fx )] Π 1 γ i ) ) L µ E[fy) fx )] L : ) µ L µ x x 0 R x a im i L 1 x x 0 σ a i µ ) µ L µ x x 0 L σ µl p 1 E[fy ) fx )] = O p1 L µ) x x 0 ) p 1) σ p µ Proof The bounds for η i = 0 and for E[ η i ] M i and max u X x u R x are straightforward Assume that η i s are zero-mean and independent and denote ψ x) = a i µ x x i µ0 x x 0 Observe that the strong convexity parameter of ψ is µ µ 0 > µ From Fact 4, v = ψ z ) Similarly as for the case of smooth minimization, let ˆv = ψ z a η ) Then ˆv is independent of η, and, using Fact 5, we have: E[a η, x v ] = E[a η, x ˆv ] E[a η, ˆv v ] a µ η Combining with Theorem B4, we get E[fy ) fx )] Π 1 γ i) ) L µ x x 0 σ µ The rest of the proof follows by plugging in particular choices of a i Let us mae a few more remars here When η i s are zero-mean, independent, and E[ η i ] σ, even the vanilla version of µagd does not accumulate noise the noise averages out) Under the same assumptions and when a i = i, we recover the asymptotic bound from Ghadimi & Lan, 01) 7 More generally, a i = i p for any constant integer p gives a convergence bound for which the deterministic term vanishes at rate 1/ p1 while the noise term vanishes at rate 1/ When p = log) ) for a fixed number of iterations of µagd, we get E[fy ) fx )] = O log) L µ) x x0 log) log) σ µ, ie, the deterministic term independent of noise) decreases super-polynomially with the iteration count, while the noise term decreases at rate log)/ This is a much stronger bound than the one from Ghadimi & Lan, 01) and closer to the theoretical lower bound Ω 1 µ L ) L µ) x x 0 σ µ ) from Nemirovsii & Yudin, 1983) Note that in the setting of constrained bounded-diameter) minimization, Ghadimi & Lan, 013) obtained the optimal 1 convergence bound O µ ) ) L L µ) x x 0 σ µ by coupling the algorithm from Ghadimi & Lan, 01) with a domain-shrining procedure resulting in a multi-stage algorithm We expect it is possible to obtain a similar result for µagd by coupling it with the domain-shrining from Ghadimi & Lan, 013) 7 Note that the independence of η i s is a stronger assumption than used in Ghadimi & Lan, 01) Nevertheless, we can obtain the same bounds as stated in Corollary B5 for the same assumptions as in Ghadimi & Lan, 01) using the same arguments as for the case of smooth minimization see Appendix C) a i

5 C Different Models of Inexact Oracle C1 Adversarial Models On Acceleration with Noise-Corrupted Gradients There are two main adversarial models of inexact gradient oracles that have been used in the convergence analysis of accelerated methods: the approximate gradient model of d Aspremont, 008) and inexact first-order oracle of Devolder et al, 014) The approximate gradient model d Aspremont, 008) defines the inexact oracle by a deterministic perturbation satisfying the following condition for all queries: η, y z δ, y, z X Hence, this model is only applicable to constrained optimization with bounded-diameter domain and bounded adversarial) additive noise Under these assumptions, d Aspremont, 008) proves that it is possible to approximate fx ) up to an error of δ achieving an accelerated rate We can show the same asymptotic bound 8 by applying the assumption to Equation 3) in Proposition 35 This yields: G D ψx, x 0 ) δ whenever a µ L for all Setting a = µ L 1 yields = O), which establishes the accelerated decrease of the first term in the error bound above The inexact first-order oracle Devolder et al, 014) is a generalization of the model from d Aspremont, 008) that defines the inexact oracle by: fy) fx) fx), y x L y x δ As stated in Devolder et al, 014), this model does not apply to noise-corrupted gradients per se, but rather to non-smooth and wealy smooth convex problems In other words, the model was introduced to characterize the behavior of accelerated methods on objective functions that are non-smooth, but close to smooth Our results agree with those of Devolder et al, 014) and lead to the same ind of error accumulation To see this, observe that we only use the definition of smoothness when bounding E e in Theorem 34 Thus, the error from the inexact oracle would only appear as E = Ee δ, leading to: fy ) fx ) D ψx, x 0 ) δ, This is exactly the same bound as in Theorem 4 of Devolder et al, 014), but we obtain it through a generic algorithm with a simple analysis C Generalized Stochastic Models It is possible to generalize the results from Lemma 37 to the noise model from Lan, 01; Ghadimi & Lan, 01) In such a model, η = Gx, ξ ) fx ), where {ξ i } s are iid random vectors, E[η ] = 0 and E[ η ] σ Let F denote the natural filtration up to and including) iteration Then ˆv = ψ z a η ) is measurable wrt F 1 as {x i } and {ξ i} 1 are measurable wrt F 1) It follows that: E[E η F 1] = a E[η, x v F 1 ] = a E[η, x ˆv F 1 ] a E[η, ˆv v F 1 ] def Define Γ = G a µ E[ η ] a σ µ C1) a σ µ Then, using the results from Section 3 and C1): E[Γ Γ 1 F 1 ] = E [ G 1 G 1 a σ ] µ F 1 = E [E η a σ ] µ F 1 0, ie, Γ is a supermartingale Hence, we can conclude that E[Γ ] E[Γ 1 ], implying E[G ] A1 E[G 1 ] we recover the same bound as in Lemma 37: E[fy )] fx ) D ψx, x 0 ) a i σ µ 8 We actually obtain better constants than those in Theorem of d Aspremont, 008) i= ai σ µ, and

a) σ η = 0 b) σ η = 10 5 c) σ η = 10 3 d) σ η = 10 1 e) σ η = 0 f) σ η = 10 5 g) σ η = 10 3 h) σ η = 10 1 Figure : Median performance of gradient descent GD) and accelerated algorithms AGD, AXGD, AGD

100 iterations D Additional Experiments D1 Hard Instance for Unconstrained Smooth Minimization Unconstrained Minimization In Section 5, we compared various accelerated algorithms with gradient

6 a) σ η = 0 b) σ η = 10 5 c) σ η = 10 3 d) σ η = 10 1 e) σ η = 0 f) σ η = 10 5 g) σ η = 10 3 h) σ η = 10 1 Figure : Median performance of gradient descent GD) and accelerated algorithms AGD, AXGD, AGD with restart and slow-down semi-heuristics and TO-AGD) over 50 repeated runs on a hard instance for unconstrained smooth minimization for η N 0, σ η I) over R n and for: a)-d) 500 iterations; e)-h) 100 iterations D Additional Experiments D1 Hard Instance for Unconstrained Smooth Minimization Unconstrained Minimization In Section 5, we compared various accelerated algorithms with gradient descent GD) when restart and slow-down semi-heuristics are disabled and enabled We also included a comparison with AGD when its parameters are set according to Corollary 39 TO-AGD) Since the step sizes of TO-AGD depend on the number of iterations, one may hope that TO-AGD could outperform other accelerated algorithms for a smaller number of iterations However, this is not true for a smaller number of iterations, in the best case TO-AGD matches the performance of other algorithms with restart and slow-down When all algorithms have the same performance, they all essentially run their vanilla versions without restart and slow down and for their standard accelerated step sizes This is illustrated in Fig Hard Instance over Simplex The set of experiments in Figure 3 correspond to the minimization of the hard instance function for smooth optimization, constrained over the probability simplex It should be compared to the unconstrained version in Figure 1 As predicted, we observe that the presence of constraints decreases the effect of error accumulation as the boundary of the feasible set limits the variance Given the low variance due to the constraints, the effect of RESTARTSLOWDOWN is less evident for this batch of experiments D Regression on Epileptic Seizure Dataset For the second set of experiments, we used the Epileptic Seizure Recognition Dataset Andrzeja et al, 001) obtained from Lichman, 013) The dataset consists of brain activity EEG recordings for 100 patients at different time points and in five different states, of which only one indicates epileptic seizure The dataset contains rows and 179 columns, of which the first 178 columns are features while the last column indicates whether the patient was in seizure Before running the experiments, we standardize the data using a standard preprocessing function from the Python Sci-it library Linear regression and LASSO We performed linear regression on the considered dataset mainly to illustrate the performance of the algorithms in the bounded regime for l 1 -constraints LASSO), which we discuss here The results for standard, unconstrained linear regression are similar to the results for logistic regression discussed below and are omitted Fig 4a)-4d) shows the performance of GD and accelerated algorithms for l 1 -constrained linear regression LASSO) on the Epileptic Seizure Recognition Dataset Interestingly, the experiments suggest that there are cases when AGD performes better than AXGD and AGD In particular, in the noiseless Fig 4a)) and the low-noise Fig 4b)) settings, AGD performs

RESTARTSLOWDOWN- much better than worst-case and converges faster than AXGD and AGD However, the faster convergence comes at the expense of lower stability to noise as the noise becomes higher

7 a) σ η = 0 b) σ η = 10 5 c) σ η = 10 3 d) σ η = 10 1 e) σ η = 0 f) σ η = 10 5 g) σ η = 10 3 h) σ η = 10 1 i) σ η = 0 j) σ η = 10 5 ) σ η = 10 3 l) σ η = 10 1 Figure 3: Performance of gradient descent GD) and accelerated algorithms AGD, AXGD, AGD) on a hard instance for unconstrained smooth minimization for η N 0, σ η I) and over probability simplex a)-d) without and i)-l) with RESTARTSLOWDOWN and RESTARTSLOWDOWN- much better than worst-case and converges faster than AXGD and AGD However, the faster convergence comes at the expense of lower stability to noise as the noise becomes higher Specifically, as the noise is increased, AGD performs only marginally better and with higher variance than AXGD and AGD Fig 4c)), and stabilizes to much higher mean and variance in the very high-noise setting Fig 1d)) Intuitively, greedy gradient steps that AGD taes may reduce the function value significantly and lead to faster convergence in the noiseless and low-noise settings, while maing the convergence very sensitive to the noise from the last iteration, as the gradient steps only depend on the last seen noisy) gradient In contrast, AXGD and AGD are more stable to noise, since both of their per-iteration steps depend on the aggregate gradient and thus, aggregate noise) information As expected from the analytical results from Section 3, restart and slow-down does not noticeably improve the mean error of the algorithms Fig 4e)-4h), 4i)-4l)) However, in agreement with the analysis, it can reduce the error variance in the high-noise-variance setting Fig 4l)) We also note that TO-AGD is more stable over the repeated methods execution, at the expense of slower initial convergence Logistic regression Finally, we evaluated the performance of the accelerated algorithms and GD for unregularized) logistic regression on the Epileptic Seizure Recognition Dataset The results are shown in Fig 5 Similar as in the case of unconstrained minimization from the beginning of this section, in the noiseless and low-noise settings Fig 5a), 5b)) all accelerated algorithms perform similarly and restart and slow-down does not lead to any noticeable improvements or degradation Fig 5e), 5i), 5f), 5j)) Once the noise is high enough Fig 5c), 5d)), all accelerated algorithms begin to accumulate noise, while restart and slow-down stabilize their performance to a low error mean and variance Interestingly, in all the experiments, when RESTARTSLOWDOWN and RESTARTSLOWDOWN- are employed all accelerated algorithms perform at least as good as GD in terms of the error mean and variance

Performance of GD and accelerated algorithms AGD, AXGD, AGD, TO-AGD) for linear

al, 001) a)-d) without noise reduction; i)-l) sample run with TOAGD, RESTARTSLOWDOWN and

8 a) σ η = 0 b) σ η = 10 5 c) σ η = 10 3 d) σ η = 10 1 e) σ η = 0 f) σ η = 10 5 g) σ η = 10 3 h) σ η = 10 1 i) σ η = 0 j) σ η = 10 5 ) σ η = 10 3 l) σ η = 10 1 Figure 4: Performance of GD and accelerated algorithms AGD, AXGD, AGD, TO-AGD) for linear regression over l 1 -ball LASSO) on Epileptic Seizure Recognition Dataset Andrzeja et al, 001) a)-d) without noise reduction; i)-l) sample run with TOAGD, RESTARTSLOWDOWN and RESTARTSLOWDOWN-; and i)-l) 50 repeated runs and the median with TOAGD, RESTARTSLOWDOWN and RESTARTSLOWDOWN-

9 a) σ η = 0 b) σ η = 10 5 c) σ η = 10 3 d) σ η = 10 1 e) σ η = 0 f) σ η = 10 5 g) σ η = 10 3 h) σ η = 10 1 i) σ η = 0 j) σ η = 10 5 ) σ η = 10 3 l) σ η = 10 1 Figure 5: Performance of gradient descent GD) and accelerated algorithms AGD, AXGD, AGD, TO-AGD) for logistic regression on Epileptic Seizure Recognition Dataset Andrzeja et al, 001) a)-d) without noise reduction; e)-h) sample run with noise reduction; and i)-l) 50 repeated runs and the median run with noise reduction

arxiv: v1 [math.oc] 1 Jul 2016

arxiv: v1 [math.oc] 1 Jul 2016 Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the