An Optimal Algorithm for Bandit and Zero-Order Convex Optimization with Two-Point Feedback

Journal of Machine Learning Research 8 07) - Submitte /6; Publishe 5/7 An Optimal Algorithm for Banit an Zero-Orer Convex Optimization with wo-point Feeback Oha Shamir Department of Computer Science an Applie Mathematics Weizmann Institute of Science Rehovot 76000, Israel oha.shamir@weizmann.ac.il Eitor: Alexaner Rakhlin Abstract We consier the closely relate problems of banit convex optimization with two-point feeback, an zero-orer stochastic convex optimization with two function evaluations per roun. We provie a simple algorithm an analysis which is optimal for convex Lipschitz functions. his improves on Duchi et al. 05), which only provies an optimal result for smooth functions; Moreover, the algorithm an analysis are simpler, an reaily exten to non-eucliean problems. he algorithm is base on a small but surprisingly powerful moification of the graient estimator. Keywors: zero-orer optimization, banit optimization, stochastic optimization, graient estimator. Introuction We consier the problem of banit convex optimization with two-point feeback Agarwal et al. 00). his problem can be efine as a repeate game between a learner an an aversary as follows: At each roun t, the aversary picks a convex function f t on R, which is not reveale to the learner. he learner then chooses a point w t from some known an close convex set W R, an suffers a loss f t w t ). As feeback, the learner may choose two points w t, w t W an receive f t w t), f t w t ). he learner s goal is to minimize average regret, efine as f t w t ) min w W f t w). In this paper, we focus on obtaining bouns on the expecte average regret with respect to the learner s ranomness). A closely-relate an easier setting is zero-orer stochastic convex optimization. In this setting, our goal is to approximately solve F w) = min w W E ξ fw; ξ), given limite access to {f ; ξ t )} where ξ t are i.i.. instantiations. Specifically, we assume that each. his is slightly ifferent than the moel of Agarwal et al. 00), where the learner only chooses w t, w t an the loss is ftw t) + f tw t )). However, our results an analysis can be easily translate to their setting, an the moel we iscuss translates more irectly to the zero-orer stochastic optimization consiere later. c 07 Oha Shamir. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provie at http://jmlr.org/papers/v8/6-63.html.

Shamir f, ξ t ) is not irectly observe, but rather can be querie at two points. his moels situations where computing graients irectly is complicate or infeasible. It is well-known Cesa-Bianchi et al., 004) that given an algorithm with expecte average regret R in the banit optimization setting above, if we fee it with the functions f t w) = fw; ξ t ), then the average w = w t of the points generate satisfies the following boun on the expecte optimization error: EF w ) min w W F w) R. hus, an algorithm for banit optimization can be converte to an algorithm for zero-orer stochastic optimization with similar guarantees. he banit optimization setting with two-point feeback was propose an stuie in Agarwal et al. 00). Inepenently, Nesterov 0) an consiere two-point methos for stochastic optimization. Both papers are base on ranomize graient estimates which are then fe into stanar first-orer algorithms e.g. graient escent, or more generally mirror escent). However, the regret/error guarantees in both papers were suboptimal in terms of the epenence on the imension. Recently, Duchi et al. 05) consiere a similar approach for the stochastic optimization setting, attaining an optimal error guarantee when f ; ξ) is a smooth function ifferentiable an with Lipschitz-continuous graients). Relate results in the smooth case were also obtaine by Ghaimi an Lan 03). However, to tackle the general case, where f ; ξ) may be non-smooth, Duchi et al. 05) resorte to a non-trivial smoothing scheme an a significantly more involve analysis. he resulting bouns have aitional factors logarithmic in the imension) compare to the guarantees in the smooth case. Moreover, an analysis is only provie for Eucliean problems where the omain W an Lipschitz parameter of f t scale with the L norm). In this note, we present an analyze a simple algorithm with the following properties: For Eucliean problems, it is optimal up to constants for both smooth an nonsmooth functions. his closes the gap between the smooth an non-smooth Eucliean problems in this setting. he algorithm an analysis are reaily applicable to non-eucliean problems. We give an example for the -norm, with the resulting boun optimal up to logarithmic factors. he algorithm an analysis are simpler than those propose in Duchi et al. 05). hey apply equally to the banit an zero-orer optimization setting, an can be reaily extene using stanar techniques, e.g. improve bouns for strongly-convex functions; regret/error bouns holing with high-probability rather than just in expectation; an improve bouns if allowe k > observations per roun instea of just two Hazan et al., 007; Shalev-Shwartz, 007; Agarwal et al., 00). Like previous algorithms, our algorithm is base on a ranom graient estimator, which given a function f an point w, queries f at two ranom locations close to w, an computes a ranom vector whose expectation is a graient of a smoothe version of f. he papers Nesterov 0); Duchi et al. 05); Ghaimi an Lan 03) essentially use the estimator

Banit an Zero-Orer Convex Optimization with wo-point Feeback which queries at w an w + δu where u is a ranom unit vector an δ > 0 is a small parameter), an returns fw + δu) fw)) u. ) δ he intuition is reaily seen in the one-imensional = ) case, where the expectation of this expression equals fw + δ) fw δ)), ) δ which inee approximates the erivative of f assuming f is ifferentiable) at w, if δ is small enough. In contrast, our algorithm uses a slightly ifferent estimator also use in Agarwal et al., 00), which queries at w δu, w + δu, an returns fw + δu) fw δu)) u. 3) δ Again, the intuition is reaily seen in the case =, where the expectation of this expression also equals Eq. ). When δ is sufficiently small an f is ifferentiable at w, both estimators compute a goo approximation of the true graient fw). However, when f is not ifferentiable, the variance of the estimator in Eq. ) can be quaratic in the imension, as pointe out by Duchi et al. 05): For example, for fw) = w an w = 0, the secon moment equals E fδu) f0)) u δ = E u 4 =. Since the performance of the algorithm crucially epens on the secon moment of the graient estimate, this leas to a highly sub-optimal guarantee. In Duchi et al. 05), this was hanle by aing an aitional ranom perturbation an using a more involve analysis. Surprisingly, it turns out that the slightly ifferent estimator in Eq. 3) oes not suffer from this problem, an its secon moment is essentially linear in the imension. We note that in this work, we assume that u is a ranom unit vector, similar to previous works. However, our results can be reaily extene to other istributions, such as uniform in the Eucliean unit ball, or a Gaussian istribution.. Algorithm an Main Results We consier the algorithm escribe in Figure, which performs stanar mirror escent using a ranomize graient estimator g t of a smoothe) version of f t at point w t. Following Duchi et al. 05), we assume that one can inee query f t at any point w t + δ t u t as specifie in the algorithm. he analysis of the algorithm is presente in the following theorem:. his may require us to query at a istance δ t outsie W. If we must query within W, then a stanar technique see Agarwal et al., 00) is to simply run the algorithm on a slightly smaller set ɛ)w, where ɛ > 0 is sufficiently large so that w t + δ tu t must be in W. Since the formal guarantee in hm. hols for arbitrarily small δ t, an each f t is Lipschitz, we can generally take δ t an hence ɛ) sufficiently small so that the aitional regret/error incurre is arbitrarily small. 3

Shamir Algorithm wo-point Banit Convex Optimization Algorithm Input: Step size η, function r : W R, exploration parameters δ t > 0 Initialize θ = 0. for t =,..., o Preict w t = arg max w W θ t, w rw) Sample u t uniformly from the Eucliean unit sphere {w : w = } Query f t w t + δ t u t ) an f t w t δ t u t ) Set g t = δ t f t w t + δ t u t ) f t w t δ t u t )) u t Upate θ t+ = θ t η g t en for heorem Assume the following conitions hol:. r is -strongly convex with respect to a norm, an sup w W rw) R for some R <.. f t is convex an G -Lipschitz with respect to the -norm. 3. he ual norm of is such that 4 E ut u t 4 p for some p <. R If η = p G, an δ t chosen such that δ t p R, then the sequence w,..., w generate by the algorithm satisfies the following for any an w W: E f t w t ) f t w ) c p G R, where c is some numerical constant. We note that conition is stanar in the analysis of the mirror-escent metho see the specific corollaries below), whereas conitions an 3 are neee to ensure that the variance of our graient estimator is controlle. As mentione earlier, the boun on the average regret which appears in hm. immeiately implies a similar boun on the error in a stochastic optimization setting, for the average point w = w t. We note that the result is robust to the choice of η, an is the same up to constants as long as η = ΘR/p G ). Also, the constant c, while always strictly positive, shrinks as δ t 0 see the proof below for etails). As a first application of the theorem, let us consier the case where is the Eucliean norm. In this case, we can take rw) = w, an the algorithm reuces to a stanar variant of online graient escent, efine as θ t+ = θ t g t an w t = arg min w W w θ t. In this case, we get the following corollary: Corollary Suppose f t for all t is G -Lipschitz with respect to the Eucliean norm, an W {w : w R}. hen using = an rw) = w, it hols for some constant c an any w W that E f t w t ) f t w ) c G R, 4

Banit an Zero-Orer Convex Optimization with wo-point Feeback he proof is immeiately obtaine from hm., noting that p = in our case. his boun matches up to constants) the lower boun in Duchi et al. 05), hence closing the gap between upper an lower bouns in this setting. As a secon application, let us consier the case where is the -norm,, the omain W is the simplex in R, > although our result easily extens to any subset of the -norm unit ball), an we use a stanar entropic regularizer: Corollary 3 Suppose f t for all t is G -Lipschitz with respect to the L norm. hen using = an rw) = i= w i logw i ), it hols for some constant c an any w W that E f t w t ) f t w log ) ) c G. his boun matches this time up to a factor polylogarithmic in ) the lower boun in Duchi et al. 05) for this setting. Proof he function r is -strongly convex with respect to the -norm see for instance Shalev-Shwartz, 0, Example.5), an has value at most log) on the simplex. Also, if f t is G -Lipschitz with respect to the -norm, then it must be G -Lipschitz with respect to the Eucliean norm. Finally, to satisfy conition 3 in hm., we upper boun 4 E u t 4 using the following lemma, whose proof is given in the appenix: Lemma 4 If u is uniformly istribute on the unit sphere in R, >, then 4 E u 4 log) c where c is a positive numerical constant inepenent of. Plugging these observations into hm. leas to the esire result. Finally, we make two aitional remarks on possible extensions an improvements to hm.. Remark 5 Querying at k > points) If the algorithm is allowe to query f t at k >, then it can be moifie to attain an improve regret boun, by computing k/ inepenent estimates of g t at every roun using a freshly sample u t each time), an using their average. his leas to a new graient estimator g t k, which satisfies E g t k k E g t + E g t. Base on the proof of hm., it is easily verifie that this leas to an average expecte regret boun of cg ) R + p /k for some numerical constant c. Remark 6 Non-Eucliean Geometries) When consiering norms other than the Eucliean norm, it is tempting to conjecture that our algorithm an analysis can be improve, by sampling u t from a istribution aapte to the geometry of that norm not necessarily the Eucliean ball), an assuming f t is Lipschitz w.r.t. the ual norm. However, aapting the proof an in particular getting appropriate versions of Lemma 8 an Lemma 9) oes not appear straightforwar, an the potential performance improvement is currently unclear. 5

Shamir 3. Proof of heorem As iscusse in the introuction, the key to getting improve results compare to previous papers is the use of a slightly ifferent ranom graient estimator, which turns out to have significantly less variance. he formal proof relies on a few simple lemmas liste below. he key lemma is Lemma 0, which establishes the improve variance behavior. Lemma 7 For any w W, it hols that g t, w t w η R + η g t. his lemma is the canonical result on the convergence of online mirror escent, an the proof is stanar see e.g. Shalev-Shwartz, 0). Lemma 8 Define the function ˆf t w) = E ut f t w + δ t u t ), over W, where u t is a vector picke uniformly at ranom from the Eucliean unit sphere. hen the function is convex, Lipschitz with constant G, satisfies sup ˆf t w) f t w) δ t G, w W an is ifferentiable with the following graient: ˆf t w) = E ut δ t f t w + δ t u t )u t. Proof he fact that the function is convex an Lipschitz is immeiate from its efinition an the assumptions in the theorem. he inequality follows from u t being a unit vector an that f t is assume to be G -Lipschitz with respect to the -norm. he ifferentiability property follows from Lemma. in Flaxman et al. 005). Lemma 9 For any function g which is L-Lipschitz with respect to the -norm, it hols that if u is uniformly istribute on the Eucliean unit sphere, then E gu) Egu)) 4 c L. for some numerical constant c. Proof A stanar result on the concentration of Lipschitz functions on the Eucliean unit sphere implies that Pr gu) Egu) > t) exp c t /L ) 6

Banit an Zero-Orer Convex Optimization with wo-point Feeback for some numerical constant c > 0 see the proof of Proposition.0 an Corollary.6 in Leoux, 005). herefore, = E gu) Egu)) 4 = Pr t=0 Pr t=0 gu) Egu) > 4 ) t t ) gu) Egu)) 4 > t t exp t=0 c ) t L t = L4 c ), where in the last step we use the fact that x=0 exp x)x =. he expression above equals cl / for some numerical constant c. Lemma 0 It hols that E g t w t = ˆf t w t ) where ˆf t ) is as efine in Lemma 8), an E g t w t cp G for some numerical constant c. Proof For simplicity of notation, we rop the t subscript. Since u has a symmetric istribution aroun the origin, E g w = E u δ fw + δu) fw δu)) u = E u δ fw + δu)) u + E u δ fw δu) u) = E u δ fw + δu)) u + E u δ fw + δu)u) = E u δ fw + δu)u which equals ˆfw) by Lemma 8. As to the secon part of the lemma, we have the following, where α is an arbitrary parameter an where we use the elementary inequality a b) a + b ). E g w = E u fw + δu) fw δu)) u δ = 4δ E u u fw + δu) fw δu)) = 4δ E u u fw + δu) α) fw δu) α)) δ E u u fw + δu) α) + fw δu) α) ) = δ E u u fw + δu) α) + E u u fw δu) α) ). 7

Shamir Again using the symmetric istribution of u, this equals δ E u u fw + δu) α) + E u u fw + δu) α) ) = δ E u u fw + δu) α). Applying Cauchy-Schwartz an using the conition 4 E u u 4 p state in the theorem, we get the upper boun Eu δ u 4 E u fw + δu) α) 4 = p E δ u fw + δu) α) 4. In particular, taking α = E u fw + δu) an using Lemma 9 noting that fw + δu) is G δ-lipschitz w.r.t. u in terms of the -norm), this is at most p c G δ) δ = cp G as require. We are now reay to prove the theorem. aking expectations on both sies of the inequality in Lemma 7, we have E g t, w t w η R + η E g t = η R + η E E g t w t. 4) Using Lemma 0, the right han sie is at most η R + ηcp G he left han sie of Eq. 4), by Lemma 0 an convexity of ˆf t, equals E E g t w t, w t w = E ˆf t w t ), w t w E ˆft w t ) ˆf ) t w ). By Lemma 8, this is at least E f t w t ) f t w )) G Combining these inequalities an plugging back into Eq. 4), we get E f t w t ) f t w )) G δ t + η R + cp G η. Choosing η = R/p G ), an any δt p R /, we get E f t w t ) f t w )) c + 3)p G R. 8 δ t.

Banit an Zero-Orer Convex Optimization with wo-point Feeback Diviing both sies by, the result follows. Acknowlegments his research was supporte in part by an Israel Science Founation grant 45/3 an an FP7 Marie Curie CIG grant. We thank the anonymous reviewers for several helpful comments. Appenix A. Proof of Lemma 4 We note that the istribution of u 4 is ientical to that of n 4, where n N 0, I n 4 ) is a stanar Gaussian ranom vector. Moreover, by a stanar concentration boun on the norm of Gaussian ranom vectors e.g. Corollary.3 in Barvinok, 005, with ɛ = /): max { Pr n ), Pr n ) } exp ). 6 Finally, for any value of n, we always have n n, since the Eucliean norm is always larger than the infinity norm. Combining these observations, an using A for the inicator function of the event A, we have n E u 4 4 = E n 4 ) = Pr n E ) + Pr n > n 4 n 4 E n 4 n 4 exp ) + Pr n > 6 = exp exp ) + 6 6 n n > ) E ) E n 4 n > / n 4 / ) 4 n > ) + 4 E n 4. 5) hus, it remains to upper boun E n 4 where n is a stanar Gaussian ranom variable. Letting n = n,..., n ), an noting that n,..., n are inepenent an ientically 9

Shamir istribute stanar Gaussian ranom variables, we have for any scalar z that Pr n z) = n Pr n i z) = i= Pr n z)) = Pr n > z)) ) Pr n > z) = Prn > z) ) exp z /), where ) is Bernoulli s inequality, an ) is using a stanar tail boun for a Gaussian ranom variable. In particular, the above implies that Pr n > z) exp z /). herefore, for an arbitrary positive scalar r, E n 4 = z=0 r Pr n 4 > z ) z z + Pr n > 4 z ) z z=0 z=r ) z r + exp z z=r = r + 4 + ) r r) exp. In particular, plugging r = 4 log ) which is larger than, since we assume > ), we get 4 + log) + log )). Plugging this back into Eq. 5), we get that E u 4 exp ) + 6 + log) + log ) 6, ) which can be shown to be at most c log) for all >, where c < 50 is a numerical constant. In particular, this means that 4 E u 4 4 c log) as require. References A. Agarwal, O. Dekel, an L. Xiao. Optimal algorithms for online convex optimization with multi-point banit feeback. In Conference on Learning heory COL), 00. A. Barvinok. Measure concentration lecture notes. http://www.math.lsa.umich.eu/ ~barvinok/total70.pf, 005. N. Cesa-Bianchi, A. Conconi, an C. Gentile. On the generalization ability of on-line learning algorithms. Information heory, IEEE ransactions on, 509):050 057, 004. J. Duchi, M. Joran, M. Wainwright, an A. Wibisono. Optimal rates for zero-orer optimization: the power of two function evaluations. Information heory, IEEE ransactions on, 65):788 806, May 05. 0

Banit an Zero-Orer Convex Optimization with wo-point Feeback A. Flaxman, A. Kalai, an B. McMahan. Online convex optimization in the banit setting: graient escent without a graient. In ACM-SIAM Symposium on Discrete Algorithms SODA), 005. S. Ghaimi an G. Lan. Stochastic first- an zeroth-orer methos for nonconvex stochastic programming. SIAM Journal on Optimization, 34):34 368, 03. E. Hazan, A. Agarwal, an S. Kale. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69):69 9, 007. M. Leoux. he concentration of measure phenomenon, volume 89. American Mathematical Soc., 005. Y. Nesterov. Ranom graient-free minimization of convex functions. echnical Report 0/6, ECORE, 0. S. Shalev-Shwartz. Online learning: heory, algorithms, an applications. PhD thesis, he Hebrew University, 007. S. Shalev-Shwartz. Online learning an online convex optimization. Founations an rens in Machine Learning, 4), 0.