On the Complexity of Bandit and Derivative-Free Stochastic Convex Optimization

Size: px

Start display at page:

Download "On the Complexity of Bandit and Derivative-Free Stochastic Convex Optimization"

Juniper Newman
6 years ago
Views:

1 JMLR: Workshop an Conference Proceeings vol ) 1 On the Complexity of Banit an Derivative-Free Stochastic Convex Optimization Oha Shamir Microsoft Research an the Weizmann Institute of Science oha.shamir@weizmann.ac.il Abstract he problem of stochastic convex optimization with banit feeback in the learning community) or without knowlege of graients in the optimization community) has receive much attention in recent years, in the form of algorithms an performance upper bouns. However, much less is known about the inherent complexity of these problems, an there are few lower bouns in the literature, especially for nonlinear functions. In this paper, we investigate the attainable error/regret in the banit an erivative-free settings, as a function of the imension an the available number of queries. We provie a precise characterization of the attainable performance for strongly-convex an smooth functions, which also imply a non-trivial lower boun for more general problems. Moreover, we prove that in both the banit an erivative-free setting, the require number of queries must scale at least quaratically with the imension. Finally, we show that on the natural class of quaratic functions, it is possible to obtain a fast O1/ ) error rate in terms of, uner mil assumptions, even without having access to graients. o the best of our knowlege, this is the first such rate in a erivative-free stochastic setting, an hols espite previous results which seem to imply the contrary. Keywors: Stochastic Convex Optimization; Derivative-Free Optimization; Banit Convex Optimization; Regret 1. Introuction his paper consiers the following funamental question: Given an unknown convex function F, an the ability to query for possibly noisy) realizations of its values at various points, how can we optimize F with as few queries as possible? his question, uner ifferent guises, has playe an important role in several communities. In the optimization community, this is usually known as zeroth-orer or erivativefree convex optimization, since we only have access to function values rather than graients or higher-orer information. he goal is to return a point with small optimization error on some convex omain, using a limite number of queries. Derivative-free methos were among the earliest algorithms to numerically solve unconstraine optimization problems, an have recently enjoye increasing interest, being especial useful in black-box situations where graient information is har to compute or oes not exist Nesterov 011); Stich et al. 011). In a stochastic framework, we can only obtain noisy realizations of the function values for instance, ue to running the optimization process on sample ata). We refer to this setting as erivative-free SCO short for stochastic convex optimization). In the learning community, these kins of problems have been closely stuie in the context of multi-arme banits an more generally) banit online optimization, which are 013 O. Shamir.

2 Shamir powerful moels for sequential ecision making uner uncertainty Cesa-Bianchi an Lugosi 006); Bubeck an Cesa-Bianchi 01). In a stochastic framework, these settings correspon to repeately choosing points in some convex omain, obtaining noisy realizations of some unerlying convex function s value. However, rather than minimizing optimization error, our goal is to minimize the average) regret: roughly speaking, that the average of the function values we obtain is not much larger than the minimal function value. For example, the well-known multi-arme banit problem correspons to a linear function over the simplex. We refer to this setting as banit SCO. As will be more explicitly iscusse later on, any algorithm which attains small average regret can be converte to an algorithm with the same optimization error. In other wors, banit SCO is only harer than erivative-free SCO. We note that in the context of stochastic multi-arme banits, the potential gap between the two settings uner the terms cumulative regret an simple regret ) was introuce an stuie in Bubeck et al. 011). When one is given graient information, the attainable optimization error / average regret is well-known: uner mil conitions, it is Θ1/ ) for convex functions an Θ1/ ) for strongly-convex functions, where is the number of queries Zinkevich 003); Hazan an Kale 011); Rakhlin et al. 01). Note that these bouns o not explicitly epen on the imension of the omain. he inherent complexity of banit/erivative-free SCO is not as well-unerstoo. An important exception is multi-arme banits, where the attainable error/regret is known to be exactly Θ / ), where is the imension an is the number of queries 1 Auer et al. 00); Auibert an Bubeck 009). Linear functions over other convex omains has also been explore, with upper bouns on the orer of O / ) to O / ) e.g. Abbasi-Yakori et al. 011); Bubeck et al. 01)). For linear functions over general omains, information-theoretic Ω / ) lower bouns have been proven in Dani et al. 007, 008); Auibert et al. 011). However, these lower bouns are either on the regret not optimization error); shown for non-convex omains; or are implicit an rely on artificial, carefully constructe omains. In contrast, we focus here on simple, natural omains an convex problems. When ealing with more general, non-linear functions, much less is known. he problem was originally consiere over 30 years ago, in the seminal work by Yuin an Nemirovsky on the complexity of optimization Nemirovsky an Yuin 1983). he authors provie some algorithms an upper bouns, but as they themselves emphasize cf. pg. 359), the attainable complexity is far from clear. Quite recently, Jamieson et al. 01) provie an Ω / ) lower boun for strongly-convex functions, which emonstrates that the fast O1/ ) rate in terms of, that one enjoys with graient information, is not possible here. In contrast, the current best-known upper bouns are O 4 / ), O 3 / ), O / ) for convex, strongly-convex, an strongly-convex-an-smooth functions respectively Flaxman et al. 005); Agarwal et al. 010)); An a O 3 / ) boun for convex functions Agarwal 1. In a stochastic setting, a more common boun in the literature is O log )/ ), but the O-notation hies a non-trivial epenence on the form of the unerlying linear function in multi-arme banits terminology, a gap between the expecte rewars boune away from 0). Such assumptions are not natural in a nonlinear banits SCO setup, an without them, the regret is inee Θ / ). See for instance Bubeck an Cesa-Bianchi, 01, Chapter ) for more etails.

3 Complexity of Banit an Derivative-Free Stochastic Convex Optimization et al. 011)), which is better in terms of epenence on but very ba in terms of the imension. In this paper, we investigate the complexity of banit an erivative-free stochastic convex optimization, focusing on nonlinear functions, with the following contributions see also the summary in able 1): ˆ We prove that for strongly-convex an smooth functions, the attainable error/regret is exactly Θ / ). his has three important ramifications: First of all, it settles the question of attainable performance for such functions, an is the first sharp characterization of complexity for a general nonlinear banit/erivative-free class of problems. Secon, it proves that the require number of queries in such problems must scale quaratically with the imension, even in the easier optimization setting, an in contrast to the linear case which often allows linear scaling with the imension. hir, it formally provies a natural Ω / ) lower boun for more general classes of convex problems. ˆ We analyze an important special case of strongly-convex an smooth functions, namely quaratic functions. We show that for such functions, one can efficiently) attain Θ / ) optimization error, an that this rate is sharp. o the best of our knowlege, it is the first general class of nonlinear functions for which one can show a fast rate in terms of ) in a erivative-free stochastic setting. In fact, this may seem to contraict the result in Jamieson et al. 01), which shows an Ω / ) lower boun on quaratic functions. However, as we explain in more etail later on, there is no contraiction, since the example establishing the lower boun of Jamieson et al. 01) imposes an extremely small omain which actually ecays with ), while our result hols for a fixe omain. Although this result is tight, we also show that uner more restrictive assumptions on the noise process, it is sometimes possible to obtain better error bouns, as goo as O/ ). ˆ We prove that even for quaratic functions, the attainable average regret is exactly Θ / ), in contrast to the Θ / ) result for optimization error. his shows there is a real gap between what can be obtaine for erivative-free SCO an banit SCO, without any specific istributional assumptions. Again, this stans in contrast to settings such as multi-arme banits, where there is no ifference in their istributionfree performance. We emphasize that our upper bouns are base on the assumption that the function minimizer is boune away from the omain bounary, or that we can query points slightly outsie the omain. However, we argue that this assumption is not very restrictive in the context of strongly-convex functions especially in learning applications), where the omain is often R, an a minimizer always exists. he paper is structure as follows: In Sec., we formally efine the setup an introuce the notation we shall use in the remainer of the paper. For clarity of exposition, we begin with the case of quaratic functions in Sec. 3, proviing algorithms, upper an lower bouns. he tools an insights we evelop for the quaratic case will allow us to tackle the more general strongly-convex-an-smooth setting in Sec. 4. We en the main part of the paper with a summary an iscussion of open problems in Sec. 5. In Appenix A, we emonstrate 3

4 Shamir Optimization Error Average Regret Function ype O ) Ω ) O ) Ω ) Quaratic Str. Convex an Smooth Str. Convex Convex { 3 min, { 4 min, } 3 3 } { 3 min, { 4 min, } 3 3 } able 1: A summary of the complexity upper bouns O )) an lower bouns Ω )), for erivative-free stochastic convex optimization optimization error) an banit stochastic convex optimization average regret), for various function classes, in terms of the imension an the number of queries. he boxe results are shown in this paper. he upper bouns for the convex an strongly convex case combine results from Flaxman et al. 005); Agarwal et al. 010, 011). he table shows epenence on, only an ignores other factors an constants. that one can obtain improve performance in the quaratic case, if we re consiering more specific natural noise processes. Aitional proofs are presente in Appenix B.. Preliminaries Let enote the stanar Eucliean norm. We let F ) : W R enote the convex function of interest, where W R is a close) convex omain. We say that F is λ- strongly convex, for λ > 0, if for any w, w W an any subgraient g of F at w, it hols that F w ) F w) + g, w w + λ w w. Intuitively, this means that we can lower boun F everywhere by a quaratic function of fixe curvature. We say that F is µ-smooth if for any w, w W, an any subgraient g of F at w, it hols that F w ) F w) + g, w w + µ w w. Intuitively, this means that we can upperboun F everywhere by a quaratic function of fixe curvature. We let w W enote a minimizer of F on w. o prevent trivialities, we consier in this paper only functions whose optimum w is known beforehan to lie in some boune omain even if W is large or all of R ), an the function is Lipschitz in that omain. he learning/optimization process procees in rouns. Each roun t, we pick an query a point w t W, obtaining an inepenent realization of F w) + ξ w, where ξ w is an unknown zero-mean ranom variable, such that E[ξw max { 1, w }. In the banit. We note that this slightly eviates from the more common assumption in the banits/erivative-free SCO setting that E[ξ w O1). While such assumptions are equivalent for boune W, we also wish to consier cases with unrestricte omains W = R. In that case, assuming E[ξ w O1) may lea to trivialities in the erivative-free setting. For example, consier the case where F w) = w Aw + b w. hen for any w an any ξ w with uniformly boune variance, we can get a virtually noiseless estimate 4

5 Complexity of Banit an Derivative-Free Stochastic Convex Optimization SCO setting, our goal is to minimize the expecte average regret, namely [ 1 E F w t ) F w ), whereas in the erivative-free SCO setting, our goal is to compute, base on w 1,..., w an the observe values, some point w W, such that the expecte optimization error E [F w ) F w ), is as small as possible. We note that given a banit SCO algorithm with some regret boun, one can get a erivative-free SCO algorithm with the same optimization error boun: we simply run the stochastic banit algorithm, getting w 1,..., w, an returning 1 w t. By Jensen s inequality, the expecte optimization error is at most the expecte average regret with respect to w 1,..., w. hus, banit SCO is only harer than erivative-free SCO. In this paper, we provie upper an lower bouns on the attainable optimization error / average regret, as a function of the imension an the number of rouns/queries. For simplicity, we focus here on bouns which hol in expectation, an an interesting point for further research is to exten these to bouns on the actual error/regret, which hol with high probability. 3. Quaratic Functions In this section, we consier the class of quaratic functions, which have the form F w) = w Aw + b w + c where A is positive-efinite with a minimal eigenvalue boune away from 0). Moreover, to make the problem well-behave, we assume that A has a spectral norm of at most 1, an that b 1, c 1. We note that if the norms are boune but larger than 1, this can be easily hanle by rescaling the function. It is easily seen that such functions are both strongly convex an smooth. Moreover, this is a natural an important class of functions, which in learning applications appears, for instance, in the context of least squares an rige regression. Besies proviing new insights for this class, we will use the techniques evelope here later on, in the more general case of strongly-convex an smooth functions Upper Bouns We begin by showing that for erivative-free SCO, one can obtain an optimization error boun of O / ). o the best of our knowlege, this is the first example of a erivative-free stochastic boun scaling as O1/ ) for a general class of nonlinear functions, as oppose to O1/ ). However, to achieve this result, we nee to make the following mil assumption: of w Aw by picking w = cw for some large c an computing 1 c F w ) + ξ w ). Variants of this iea will also allow virtually noiseless estimates of the linear term. 5

6 Shamir Assumption 1 At least one of the following hols for some fixe ɛ 0, 1: ˆ he quaratic function attains its minimum w in the omain W, an the Eucliean istance of w from the omain bounary is at least ɛ. ˆ We can query not just points in W, but any point whose istance from W is at most ɛ. With strongly-convex functions, the most common case is that W = R, an then both cases actually hol for any value of ɛ. Even in other situations, one of these assumptions virtually always hols. Note that we crucially rely here on the strong-convexity assumption: with say) linear functions, the omain must always be boune an the optimum always lies at the bounary of the omain. With this assumption, the boun we obtain is on the orer of /ɛ. As iscusse earlier, Jamieson et al. 01) recently prove a Ω / ) lower boun for erivative-free SCO, which actually applies to quaratic functions. his oes not contraict our result, since in their example the iameter of W an hence also ɛ) ecays with. In contrast, our O / ) boun hols for fixe ɛ, which we believe is natural in most applications. o obtain this behavior, we utilize a well-known 1-point graient estimate technique, which allows us to get an unbiase estimate of the graient at any point by ranomly querying for a noisy) value of the function aroun it see Nemirovsky an Yuin 1983); Flaxman et al. 005)). Our key insight is that whereas for general functions one must query very close to the point of interest scaling to 0 with ), quaratic functions have aitional structure which allows us to query relatively far away, allowing graient estimates with much smaller variance. he algorithm we use is presente as Algorithm 1, an is computationally efficient. It uses a moification W of the omain W, efine as follows. First, we let B enote some known upper boun on w. If the first alternative of assumption 1 hols, then W consists of all points in W {w : w B}, whose istance from W s bounary is at least ɛ. If the secon alternative hols, then W = W {w : w B}. Note that uner any alternative, it hols that W is convex, that w t B, that w W, an that our algorithm always queries at legitimate points. In the pseuocoe, we use Π W to enote projection on W. For simplicity, we assume that / is an integer an that W inclues the origin 0. Algorithm 1 Derivative-Free SCO Algorithm for Strongly-Convex Quaratic Functions Input: Strong convexity parameter λ > 0; Distance parameter ɛ 0, 1 Initialize w 1 = 0. for t = 1,..., 1 o Pick r { 1, +1} uniformly at ranom Query noisy function value v at point w t + ɛ r v Let g = ɛ r Let w t+1 = Π W wt 1 λt g) en for Return w = t=/ w t. he following theorem quantifies the optimization error of our algorithm. 6

7 Complexity of Banit an Derivative-Free Stochastic Convex Optimization heorem 1 Let F w) = w Aw+b w+c be a λ-strongly convex function, where A, b, c are all at most 1, an suppose the optimum w has a norm of at most B. hen uner Assumption 1, the point w returne by Algorithm 1 satisfies E [F w ) F w log))b + 1)4 ) λɛ. Note that returning w as the average over the last / iterates as oppose to averaging over all iterates) is necessary to avoi log ) factors Rakhlin et al. 01). As an interesting sie-note, we conjecture that a graient-base approach is crucial here to obtain O1/ ) rates in terms of ). For example, a ifferent family of erivative-free methos see for instance Nemirovsky an Yuin 1983); Agarwal et al. 011); Jamieson et al. 01)) is base on a type of noisy binary search, where a few strategically selecte points are repeately sample in orer to estimate which of them has a larger/smaller function value. his is use to shrink the feasible region where the optimum w might lie. Since it is generally impossible to estimate the mean of noisy function values at a rate better than O1/ ), it is not clear if one can get an optimization rate faster than O1/ ) with such methos. he proof of the theorem relies on the following key lemma, whose proof appears in the appenix. Lemma For any w t, we have that an E r,v [ g = F w t ) E r,v [ g 4 B + 1) 4 ɛ. his lemma implies that Algorithm 1 essentially performs stochastic graient escent over the strongly-convex function F w), where the graient estimates are unbiase an with boune secon moments. he returne point is a suffix-average of the last / iterates. Using a convergence analysis for stochastic graient escent with suffix-averaging Rakhlin et al., 01, heorem 5), an plugging in the bouns of Lemma, we get hm Lower Bouns In this subsection, we prove that the upper boun obtaine in hm. 1 is essentially tight: namely, up to constants, the worst-case error rate one can obtain for erivative-free SCO of quaratic functions is orer of /. Besies showing that the algorithm above is essentially optimal, it implies that even for extremely nice strongly-convex functions an omains, the number of queries require to reach some fixe accuracy scales quaratically with the imension. his stans in contrast to the case of linear functions, where the provable query complexity often scales linearly with. heorem 3 Let the number of rouns be fixe. hen for any possibly ranomize) querying strategy, there exists a quaratic function of the form F w) = 1 w e, w, which is minimize at e where e 1, such that the resulting w satisfies } E[F w ) F w ) 0.01 min {1,. 7

8 Shamir Note that since e 1, we know in avance that the optimum must lie in the unit Eucliean ball. Despite this, the lower boun hols even if we o not restrict at all the omain in which we are allowe to query - i.e., it can even be all of R. Proof he proof technique is inspire by a lower boun which appears in Arias-Castro et al. 011), in the ifferent context of compresse sensing. he argument also bears some close similarities to the proof of Assoua s lemma see Cybakov 009)). We will exhibit a istribution over quaratic functions F, such that in expectation over this istribution, any querying strategy will attain Ω / ) optimization error. his implies that for any querying strategy, there exists some eterministic F for which it will have this amount of error. he functions we shall consier are F e w) = 1 w e, w, where e is rawn uniformly from { µ, µ}, with µ 0, 1/ ) being a parameter to be specifie later. Moreover, we will assume that the noise ξ w is a Gaussian ranom variable with zero mean an stanar eviation max { 1, w }. By efinition of 1-strong convexity, it is easy to verify that F e w) F e e) 1 w e. hus, the expecte optimization error over the querying strategy) is at least [ 1 E[F e w ) F e e) E w e E [ 1 w i e i ) E [ µ 1 wi e i <0. 1) We will assume that the querying strategy is eterministic: w t is a eterministic function of the previous query values v 1, v,..., v t 1 at w 1,..., w t 1. his assumption is without loss of generality, since any ranom querying strategy can be seen as a ranomization over eterministic querying strategy. hus, a lower boun which hols uniformly for any eterministic querying strategy woul also hol over a ranomization. o lower boun Eq. 1), we use the following key lemma, which relates this to the question of how informative are the query values as measure by Kullback-Leibler or KL ivergence) for etermining the sign of e s coorinates. Intuitively, the more similar the query values are, the smaller is the KL ivergence an the harer it is to istinguish the true sign of each e i, leaing to a larger lower boun. he proof appears in the appenix. Lemma 4 Let e be a ranom vector, none of whose coorinates is supporte on 0, an let v 1, v,..., v be a sequence of query values obtaine by a eterministic strategy returning a point w so that the query location w t is a eterministic function of v 1,..., v t 1, an w is a eterministic function of v 1,..., v ). hen we have [ E 1 wi e i <0 1 1 U t,i, where U t,i = sup D kl Pr vt e i > 0, {e j } j i, {v l } t 1 ) l=1 Pr vt e i < 0, {e j } j i, {v l } t 1 )) l=1 {e j } j i an D kl represents the KL ivergence between two istributions. 8

9 Complexity of Banit an Derivative-Free Stochastic Convex Optimization Using Lemma 4, we can get a lower boun for the above, provie an upper boun on the U t,i s. o analyze this, consier any fixe values of {e j } j i, an any fixe values of v 1,..., v t 1. Since the querying strategy is assume to be eterministic, it follows that w t is uniquely etermine. Given this w t, the function value v t equals F e w t ) = 1 w t + e j w t,j + µw t,i + ξ wt ) j i conitione on e i > 0, an F e w t ) = 1 w t + e j w t,j µw t,i + ξ wt 3) j i conitione on e i < 0. Comparing Eq. ) an Eq. 3), we notice that they both represent a Gaussian istribution ue to the ξ wt noise term), with stanar eviation max { 1, w t } an means seperate by µw t,i. o boun the ivergence, we use the following stanar result on the KL ivergence between two Gaussians Kullback 1959): Lemma 5 Let N µ, σ ) represent a Gaussian istribution variable with mean µ an variance σ. hen D kl N µ1, σ ) N µ, σ ) ) = µ 1 µ ) σ Using this lemma, it follows that D kl P v t v 1,..., v t 1 ) Qv t v 1,..., v t 1 )) µw t,i ) max {1, w t 4 } = µ w t,i max {1, w t 4 }. Plugging this upper boun on the U t,i s in Lemma 4, we can further lower boun on the expecte optimization error from Eq. 1) by µ 1 1 µ w t,i 4 max {1, w t 4 = µ 1 µ w t } 4 max {1, w t 4 } = µ { } ) 1 µ min w t 4 1, w t µ µ 1. 4) 4 Finally, we choose µ = min{1/, /4 }, an obtain a lower boun of ) } } min {1, > 0.01 min {1, 4 4 as require. he theorem above applies to the optimization error for erivative-free SCO. We now turn to eal with the case of banit SCO an regret, showing an Ω / ) lower boun. 9

10 Shamir Since the erivative-free SCO boun was Θ / ), the result implies a real gap between what can be obtaine in terms of average regret, as oppose to optimization error, without any specific istributional assumptions. his stans in contrast to settings such as multiarme banits, where the construction implying the known Ω / ) lower boun e.g. Cesa-Bianchi an Lugosi 006)) applies equally well to erivative-free an banit SCO see Bubeck et al. 011)). heorem 6 Let the number of rouns be fixe. hen for any possibly ranomize) querying strategy, there exists a quaratic function of the form F w) = 1 w e, w, which is minimize at e where e 1/, such that E [ 1 { } F w t ) F w ) 0.0 min 1,. Note that our lower boun hols even when the omain is unrestricte the algorithm can pick any point in R ). Moreover, the lower boun coincies up to a constant) with the O / ) regret upper-boun shown for strongly-convex an smooth functions in Agarwal et al. 010). his shows that for strongly-convex an smooth functions, the minimax average regret is Θ / ). Also, the lower boun implies that one cannot hope to obtain average regret better than / for more general banit problems, such as strongly-convex or even convex problems. he proof relies on techniques similar to the lower boun of hm. 3, with a key aitional insight. Specifically, in hm. 3, the lower boun obtaine actually epens on the norm of the points w 1,..., w see Eq. 4)), an the optimal w has a very small norm. In a regret minimization setting the points w 1,..., w cannot be too far from w, an thus must have a small norm as well, leaing to a stronger lower boun than that of hm. 3. he formal proof appears in the appenix. 4. Strongly Convex an Smooth Functions We now turn to the more general case of strongly convex an smooth functions. First, we note that in the case of functions which are both strongly convex an smooth, Agarwal et al., 010, heorem 14) alreay provie an O / ) average regret boun which hols even in a non-stochastic setting). he main result of this section is a matching lower boun, which hols even if we look at the much easier case of erivative-free SCO. his lower boun implies that the attainable error for strongly-convex an smooth functions is orer of /, an at least / for any harer setting. heorem 7 Let the number of rouns be fixe. hen for any possibly ranomize) querying strategy, there exists a function F over R which is 0.5-strongly convex an 3.5- smooth; Is 4-Lipschitz over the unit Eucliean ball; has a global minimum in the unit ball; An such that the resulting w satisfies { } E[F w ) F w ) min 1,. 10

11 Complexity of Banit an Derivative-Free Stochastic Convex Optimization Note that we mae no attempt to optimize the constant. he general proof technique is rather similar to that of hm. 3, but the construction is a bit more intricate. Specifically, letting µ > 0 be a parameter to be etermine later, we look at functions of the form F e w) = w e i w i 1 + w i /e i ), where e is uniformly istribute on { µ, +µ}. o see the intuition behin this choice, let us consier the one-imensional case = 1). Recall that in the quaratic setting, the function we consiere in one imension) was of the form F e w) = 1 w ew, where e was chosen uniformly at ranom from { µ, +µ}, an µ is a small number. hus, the optimum is at either µ or µ, an the ifference F µ w) F µ w) at these optima is orer of µ. However, by picking w = Θ1), the ifference F µ w) F µ w) is on the orer of µ - much larger than the ifference close to the optimum, which is orer of µ. herefore, by querying for w far from the optimum, an getting noisy values of F e, it is easier to istinguish whether we are ealing with e = +µ or e = µ, leaing to a / optimization error boun. In contrast, the function we consier here in the one-imensional case) is of the form F e w) = w ew 1 + w/e). 5) his form is carefully esigne so that F µ w) F µ w) is orer of µ, not just at the optima of F µ an F µ, but for all w. his is because of the aitional enominator, which makes the function closer an closer to w the larger w is - see Fig. 1 for a graphical illustration. As a result, no matter how the function is querie, istinguishing the choice of µ is ifficult, leaing to the strong lower boun of hm. 7. A formal proof is presente in the appenix. 5. Discussion In this paper, we consiere the ual settings of banit an erivative-free stochastic convex optimization. We provie a sharp characterization of the attainable performance for strongly-convex an smooth functions. he results also provie useful lower-bouns for more general settings. We also consiere the case of quaratic functions, showing that a fast O1/ ) rate is possible in a stochastic setting, even without knowlege of erivatives. Our results have several qualitative ifferences compare to previously known results which focus on linear functions, such as quaratic epenence on the imension even for extremely nice functions, an a provable gap between the attainable performance in banit optimization an erivative-free optimization. Our work leaves open several questions. For example, we have only ealt with bouns which hol in expectation, an our lower bouns focuse on the epenence on,, where other problem parameters, such as the Lipschitz constant an strong convexity parameter, are fixe constants. While this follows the setting of previous works, it oes not cover 11

12 Shamir Figure 1: he two soli blue lines represents F e w) as in Eq. 5), for e = 0.1 an e = 0.1, whereas the two ashe black lines represent two quaratic functions with similar minimum points. Close to the minima, F e w) an the quaratic functions behave rather similarly. However, as we increase w, the two quaratic functions become rather istinguishable, whereas F e w) become more an more inistinguishable for the two choices of e. hus, istinguishing whether e = 0.1 or e = 0.1, base only on function values is of F e w), is much harer than the quaratic case situations where these parameters scale with. Finally, while this paper settles the case of strongly-convex an smooth functions, we still on t know what is the attainable performance for general convex functions, as well as the more specific case of strongly-convex ) possibly non-smooth) functions. Our Ω / lower boun still hols, but the existing upper bouns are much larger: min /, } { 4 3 / for convex functions, an { 3 min /, } 3 / for strongly-convex functions see table 1). We on t know if the lower boun or the existing upper bouns are tight. However, it is the current upper bouns which seem less natural, an we suspect that they are the ones that can be consierably improve, using new algorithms which remain uniscovere. Acknowlegments We thank John Duchi, Satyen Kale, Robi Krauthgamer an the anonymous reviewers for helpful iscussions an comments. References Y. Abbasi-Yakori, D. Pál, an C. Szepesvári. Improve algorithms for linear stochastic banits. In NIPS,

13 Complexity of Banit an Derivative-Free Stochastic Convex Optimization A. Agarwal, O. Dekel, an L. Xiao. Optimal algorithms for online convex optimization with multi-point banit feeback. In COL, 010. A. Agarwal, D. Foster, D. Hsu, S. Kakae, an A. Rakhlin. Stochastic convex optimization with banit feeback. In NIPS, 011. E. Arias-Castro, E. Canès, an M. Davenport. On the funamental limits of aaptive sensing. CoRR, abs/ , 011. J.-Y. Auibert an S. Bubeck. Minimax policies for aversarial an stochastic banits. In COL, 009. J.-Y. Auibert, S. Bubeck, an G. Lugosi. Minimax policies for combinatorial preiction games. COL, 011. P. Auer, N. Cesa-Bianchi, Y. Freun, an R. Schapire. he nonstochastic multiarme banit problem. SIAM J. Comput., 31):48 77, 00. S. Bubeck an N. Cesa-Bianchi. Regret analysis of stochastic an nonstochastic multi-arme banit problems. CoRR, abs/ , 01. S. Bubeck, R. Munos, an G. Stoltz. Pure exploration in finitely-arme an continuousarme banits. heoretical Computer Science, 4119): , 011. S. Bubeck, N. Cesa-Bianchi, an S. Kakae. owars minimax policies for online linear optimization with banit feeback. In COL, 01. N. Cesa-Bianchi an G. Lugosi. Preiction, learning, an games. Cambrige University Press, Cover an J. homas. Elements of information theory. Wiley, eition, 006. A.B. Cybakov. Introuction to nonparametric estimation. Springer series in statistics. Springer, 009. V. Dani,. Hayes, an S. Kakae. he price of banit information for online optimization. In NIPS, 007. V. Dani,. Hayes, an S. Kakae. Stochastic linear optimization uner banit feeback. In COL, 008. A. Flaxman, A. Kalai, an B. McMahan. Online convex optimization in the banit setting: graient escent without a graient. In SODA, 005. E. Hazan an S. Kale. Beyon the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization. In COL, 011. K. Jamieson, R. Nowak, an B. Recht. Query complexity of erivative-free optimization. CoRR, abs/ , 01. S. Kullback. Information heory an Statistics. Dover,

14 Shamir A. Nemirovsky an D. Yuin. Problem Complexity an Metho Efficiency in Optimization. Wiley-Interscience, Y. Nesterov. Ranom graient-free minimization of convex functions. echnical Report 16, ECORE Discussion Paper, 011. A. Rakhlin, O. Shamir, an K. Sriharan. Making graient escent optimal for strongly convex stochastic optimization. In ICML, 01. S. Stich, C. Müller, an B. Gärtner. Optimization of convex functions with ranom pursuit. CoRR, abs/ , 011. M. Zinkevich. Online convex programming an generalize infinitesimal graient ascent. In ICML, 003. Appenix A. Improve Results for Quaratic Functions In Sec. 3, we showe a tight Θ / ) boun on the achievable error for quaratic functions, in the erivative-free SCO setting. his was shown uner the assumption that the noise ξ w is zero-mean an has a secon moment boune by max{1, w }. In this appenix, we show how uner aitional natural assumptions on the noise, one can improve on this result with an efficient algorithm. he main message here is not so much the algorithmic result, but rather to show that the generic noise assumption is important for our lower bouns, an that better algorithms may still be possible for more specific settings. o give a concrete example, consier the classic setting of rige regression, where we have labele training examples x, y) sample i.i.. from some istribution over R R, an our goal is to fin some w R minimizing F w) = λ [ ) w + E x,y) w x y. In a banit / erivative-free SCO setting, we can think of each query as giving as the value of ˆF w) = λ w + w x y). 6) for some specific example x, y), an note that its expecte value over the ranom raw of x, y)) equals F w). hus, it falls within the setting consiere in this paper. However, the noise process is not generic, but has a particular structure. We will show here that one can actually attain an error rate as goo as O/ ) for this problem. o formally present our result, it woul be useful to consier a more general setting, the rige regression setting above being a special case. Suppose we can write F w) as E[ ˆF w), where ˆF w) ecomposes into a eterministic term Rw) an a stochastic quaratic term Ĝw): ˆF w) = Rw) + w Ĝw) = Rw) + Âw + ˆb ) w + ĉ, where Â, ˆb, ĉ are ranom variables. We assume that whenever we query a point w, we get ˆF w) for some ranom realization of Â, ˆb, ĉ. In general, Rw) can be a strongly-convex regularization term, such as λ w in Eq. 6). 14

15 Complexity of Banit an Derivative-Free Stochastic Convex Optimization he algorithm we consier, Algorithm, is a slight variant of Algorithm 1, which takes this ecomposition of F w) into account when constructing its unbiase graient estimate. Compare to Algorithm 1, this algorithm also queries at ranom points further away from w t, up to a istance of. We will assume here that we can always query at such points 3. We also let W = W {w : w B} in the algorithm, where we recall that B is some known upper boun on w. Algorithm Derivative-Free SCO Algorithm for Decomposable-Quaratic Functions Input: Deterministic term R ); Strong convexity parameter λ > 0 Initialize w 1 = 0. for t = 1,..., 1 o Pick r { 1, +1} uniformly at ranom Query noisy function value v at point w t + r Let g = v R w t + r)) r + g R w t ), where g R w) is a subgraient of R ) at w Let w t+1 = Π W wt 1 λt g) en for Return w = w t=/ w t. We now show that with this algorithm, one can improve on our O / ) error upper boun from hm. 1). heorem 8 In the setting escribe above, suppose Â, ˆb, ĉ are all at most 1 with probability 1, the optimum w has a norm of at most B, an g R w) N for any w W. hen uner Assumption 1, the point w returne by Algorithm satisfies E [F w ) F w ) where F is the Frobenius norm. N log)) B + 1) 4 + E λ [ Â F Note that if we only assume Â 1, then Â F can be as high as, which leas to an O / ) boun, same as in hm. 1. However, it may be much smaller than that. In particular, for the rige regression case we consiere earlier, Â correspons to xx where x is a ranomly rawn instance. Uner the common assumption that x O1) inepenent of the imension), it follows that xx F = x 4 = O1). herefore, Â F is inepenent of the imension, leaing to an O/ ) error upper boun in terms of,. We remark that even in this specific setting, the O/ ) boun oes not carry over to the banit SCO setting i.e. in terms of regret), since the algorithm requires us to query far away from w t. Also, we again emphasize that this result oes not contraict our lower boun in the quaratic case hm. 3), since the setting there inclue a generic noise term, while here the stochastic noise has a very specific structure. As to the proof of hm. 8, it is very similar to that of hm. 1, the key ifference being a better moment upper boun on the graient estimate ḡ, as formalize in the following lemma. Plugging this improve boun into the calculations results in the theorem. ) 3. Similar to Algorithm 1, if one can only query at some istance ɛ, where ɛ 0, 1, then one can moify the algorithm to hanle such cases, with the resulting error boun epening on ɛ., 15

16 Shamir Lemma 9 For any w t, we have that E r,v [ g is a subgraient of F w t ), an [ )) E r,v [ g 4 N + 3 B + 1) 4 + E Â F. Proof By efinition of F w t ), we note that g = w t + r) Â w t + r) + ˆb ) w t + r) + ĉ r + g R w t ). Using a similar calculation to the one in the proof of Lemma, we have that the expecte value of this expression over r an Â, ˆb, ĉ is w t E[Â + E[ˆb + g R w t ), which is a subgraient of F w t ). As to the moment boun, we have [ E[ g E 4 w t + r) Â w t + r)) r ˆb ) + 4 w t + r) r + 4ĉ r + 4 g R w t ) [ 4 E Â w t + wt Âr + r Âr) ˆb ) ˆb ) ) + w t + r N = 4 E = 1 [ B + w t B 4 + 4E ) Âr + r Ar + B ˆb ) ) + r [ ) [ ) ) wt Âr + E r Ar N [ ˆb ) ) B + E r N. Letting â i,j enote entry i, j) in Â, an recalling that by efinition of r, E[r ir j = 1 i=j, we have that [ ) E r Ar = E r i r j â i,j = E r i r j r i r j â i,j â i,j i,j i,j,i,j = E ri rj â i,j = E [ â i,j = E Â F. i,j i,j 7) Also, using the fact that E[rr is the ientity matrix, we have [ ) [ [ [ E wt Âr = E wt Ârr Â w t = E wt ÂÂ w t E w t Â B. Finally, we have [ ˆb ) E r [ˆb = E rr ˆb [ = E ˆb 1. Plugging these inequalities back into Eq. 7), we get that ) E[ g 1 B 4 + 4B + E[ Â F + 8 B + 1 ) N [ ) = 4 3B B E Â F + 4N [ ) 1 B + 1) 4 + E Â F + 4N, 16

17 Complexity of Banit an Derivative-Free Stochastic Convex Optimization from which the lemma follows. Appenix B. Aitional Proofs B.1. Proof of Lemma By the way r is picke, we have that E r [r i r j = 1 i=j an that E r [r i r j r k = 0 for all i, j, k. hus, letting E enote expectation w.r.t. r an the ranom function values, we have [ v E[ g = E = E = E ɛ [ ɛ [ ɛ r w + ɛ r) A w + ɛ ) r + b w + ɛ ) ) r + c + ξ wt+ ɛ r r ) w Aw + b w + c + ξ wt+ ɛ r r + ɛ ) [ ) ) r Ar r + E w Ar r + b r r = 0 + w A + b + 0 = F w). Also, by the assumptions on A, b, c an the assumptions on the noise ξ w, we have [ F w t + ɛ ) ) r + ξ wt+ ɛ r [ v E[ g = E ɛ r [ ɛ E as require. ɛ ɛ F w t + sup w: w B+ɛ sup w: w B+ɛ = ɛ E[v = ɛ E ɛ r)) + ξ w t+ ɛ r { F w)) + max 1, w t + ɛ } ) r ) ) w Aw + b w + c + B + 1) ɛ B + ɛ) +B + ɛ) + 1 ) + B + 1) ) 4 B + 1)4 ɛ 17

18 Shamir B.. Proof of Lemma 4 We have the following: [ E 1 wi e i <0 = Pr w i e i < 0) = 1 Pr w i < 0 e i > 0) + Pr w i > 0 e i < 0)) ) = 1 Pr w i > 0 e i > 0) Pr w i > 0 e i < 0)) ) 1 1 Pr w i > 0 e i > 0) Pr w i > 0 e i < 0) 1 1 Pr w i > 0 e i > 0) Pr w i > 0 e i < 0)), 8) where the last inequality is by the fact that for any values a 1,..., a, it hols that a a a a. Consier without loss of generality) the term corresponing to the first coorinate, namely Pr w 1 > 0 e 1 > 0) Pr w 1 > 0 e 1 < 0)). his term equals Pr{e j } j=) e,...,e Pr{e j } j=) e,...,e sup Pr e,...,e 1 D kl ) )) ) Pr w 1 > 0 e 1 > 0, {e j } j= Pr w 1 > 0 e 1 < 0, {e j } j= ) )) Pr w 1 > 0 e 1 > 0, {e j } j= Pr w 1 > 0 e 1 < 0, {e j } j= ) )) w 1 > 0 e 1 > 0, {e j } j= Pr w 1 > 0 e 1 < 0, {e j } j= By Pinsker s inequality an the assumption that w is a eterministic function of v 1,..., v, this expression is at most Pr v 1,..., v e 1 > 0, {e j } j= ) )) Pr v 1,..., v e 1 < 0, {e j } j=, where D kl P Q) is the Kullback-Leibler ivergence between the two istributions. By the chain rule see e.g. Cover an homas 006)), we can upper boun the above by 1 ) D kl Pr v t e 1 > 0, {e j } j=, {v l } t 1 l=1 Plugging these bouns back into Eq. 8), the result follows. )) Pr v t e 1 < 0, {e j } j=, {v l } t 1 l=1. 18

19 Complexity of Banit an Derivative-Free Stochastic Convex Optimization B.3. Proof of hm. 6 We may assume without loss of generality that, an it is enough to show that the expecte average regret is at least 0.0 /. his is because if there was a strategy with < 0.0 average regret after < rouns, then for the case of rouns, we coul just run that strategy for rouns, compute the average w of all points playe so far, an then repeately choose w in the remaining rouns. By Jensen s inequality, this woul imply a < 0.0 average regret after rouns, in contraiction. Let w be an arbitrary eterministic function of w 1,..., w. A proof ientical to that of hm. 3, up to Eq. 4), implies that for any µ > 0, there exists a quaratic function of the form F e = 1 w e, w, with e { µ, µ}, such that E[F e w ) F e w ) E µ 4 1 µ { } min w t 1, w t. In particular, letting w = 1 w t, using Jensen s inequality, an iscaring the min, we get that [ 1 E F e w t ) F e w ) µ 1 µ w t 4. 9) However, we also know that by strong convexity of F e, we have [ 1 E F e w t ) F e w ) 1 w t e. 10) Using the fact that w t = w t e + e w t e + e ) w t e + e, we get that w t e 1 w t e = 1 w t µ. Substituting into Eq. 10) an slightly manipulating the resulting inequality, we get [ w t 1 4 E F e w t ) F e w ) + µ. [ For simplicity, enote the average regret term E 1 F ew t ) F e w ) by R. Substituting the expression above into Eq. 9), we get ) R µ µ R + µ ) µ 4µ 1 R ) µ

20 Shamir Rearranging an simplifying, we get R + µ3 R + µ µ ) he equation above can be seen as a quaratic function of R, with the roots ) 1 µ3 ± µ3 + µ 1 µ ). Now, recall that µ is a free parameter that we can choose at will. If we choose it so that 1 µ > 0, then it is easy to show that we get two roots, one strictly positive an one strictly negative. Since we know R is a nonnegative quantity, we get that 1 R = µ3 + µ3 ) + µ 1 µ ) ) µ µ + 4 µ4 + 1 µ. Finally, choosing µ = 1/4 / which inee satisfies 1 µ > 0), an simplifying, we get R Recalling that R is the expecte average regret, it only remains to take the square of the two sies. We note that since we assume, then e = µ = / / 1/, as specifie in the theorem statement. B.4. Proof of hm. 7 Let µ > 0 be a parameter to be etermine later. As iscusse in the text, we will look at functions of the form F e w) = w e i w i 1 + w i /e i ), 11) where e is uniformly istribute on { µ, +µ}. Our goal will be to prove a lower boun on the expecte optimization error over the ranomize choice of F e, with respect to eterministic querying strategies. As explaine in the proof of hm. 3, this woul imply the existence of some fixe F e such that the expecte optimization error over a possibly ranomize) querying strategy is the same. We will nee the following properties of F e : Lemma 10 For any µ > 0 an any e { µ, +µ}, the function F e in Eq. 11) is: ˆ 0.5-Strongly convex an 3.5-smooth 0

21 Complexity of Banit an Derivative-Free Stochastic Convex Optimization ˆ + µ-lipschitz for any w such that w 1. ˆ F e is globally minimize at w = ce, where c = /3 ˆ For any e { µ, +µ} which iffers from e in a single coorinate, an for any w R, it hols that F e w) F e w) µ. Proof Note that we can write the function F e w) as g e i w i ), where g a x) = x ax 1 + x/a). It is not har to realize that to prove the lemma, it is enough to prove that: 1. g a x) is 0.5-strongly convex an 3.5-smooth;. g ax) is at most 4 x + a ; 3. For all µ, g µ x) g µ x) µ ; 4. g a x) is minimize at ca where c = o show item 1, we calculate the secon erivative of g a x), which is 1 + a3 x3a x ) ) a + x ) 3. By efinition of strong convexity an smoothness, it is enough to show that this term is always at least 0.5 an at most 3.5. Substituting x = ay an simplifying, we get 1 + y3 ) y ) 1 + y ) 3. It is a straightforwar exercise to verify that y3 y ) 1+y ) is at most 3/4 for all y R, hence 3 the expression above is always in [0.5, 3.5 as require. As to item, we note that g ax) = x a5 a 3 x 1 x/a) a + x = x a ) 1 + x/a) ). For any value of x/a, the value of the fraction above is easily verifie to be at most 1, hence we can upper boun g ax) by x + a as require. As to item 3, we have g µ x) g µ x) = µx µx = µ 1 + x/µ) µ + x µ, where the last step uses µ + x µx, which follows from the ientity µ + x ) Since this woul imply that F ew) is at most wi + µ) 4w i + µ ) 4w i ) + µ ) = w + µ, which is at most + µ for any w in the unit ball. 1

22 Shamir Finally, as to item 4, we note that this function can be equivalently written as g a x) = a x/a) x/a) ) 1 + x/a). Substituting x = ay, we get a y y/1 + y )). A numerical calculation reveals that the minimizing value of y is , hence the minimizing value of x is a as require. We now begin to erive the lower boun. Using strong convexity an the lemma, we have [ [ [ 1 E[F w ) F w ) E 4 w w = 1 4 E w i wi ) 1 4 E wi ) 1 wi wi <0 [ [ 1 4 E ei ) 1 wi e 3 i <0 = µ 36 E 1 wi e i <0 1) We now lower boun this term using Lemma 4. o o so, we nee to upper boun the KL ivergence of the query values at roun t uner the two hypotheses e i = +µ an e i = µ, the other coorinates being fixe. We assume each noise term ξ w is a stanar Gaussian ranom variable. hus, the query value that we see is istribute as F e w t ) + ξ w = w j=1 e j w j 1 + w j /e j ) + ξ w. where one of the coorinates i of e is either +µ or µ an the other coorinates are fixe. his is a Gaussian istribution, with mean F e w t ) an variance 1. By Lemma 10, the ifference between the two means uner the two cases e i = +µ, e i = µ is at most µ, so by Lemma 5, the KL-ivergence is at most µ 4 /. Using Lemma 4, this implies that Eq. 1) is at least µ 1 1 ) µ 4 = µ µ Picking µ = 1/4, we get a lower boun of /144 > /. Finally, note that for this choice of µ, by Lemma 10, our function F e for any realization of e) is + / - Lipschitz in the unit ball, an has a global minimum with norm at most 0.35 /. If, the Lipschitz parameter is at most 4 an the global minimum is insie the unit ball, satisfying the requirements in the theorem statement. If <, then the boun cannot be better than what we woul obtain for = the argument is similar to the one in the proof of hm. 6), which is hus, for any, the boun is at least { } { } min 0.004, = min 1, as require.

An Optimal Algorithm for Bandit and Zero-Order Convex Optimization with Two-Point Feedback

An Optimal Algorithm for Bandit and Zero-Order Convex Optimization with Two-Point Feedback Journal of Machine Learning Research 8 07) - Submitte /6; Publishe 5/7 An Optimal Algorithm for Banit an Zero-Orer Convex Optimization with wo-point Feeback Oha Shamir Department of Computer Science an