An Optimal Algorithm for Bandit and Zero-Order Convex Optimization with Two-Point Feedback

Similar documents
Algorithms and matching lower bounds for approximately-convex optimization

On the Complexity of Bandit and Derivative-Free Stochastic Convex Optimization

Linear Regression with Limited Observation

Least-Squares Regression on Sparse Spaces

Lecture Introduction. 2 Examples of Measure Concentration. 3 The Johnson-Lindenstrauss Lemma. CS-621 Theory Gems November 28, 2012

Linear First-Order Equations

Lower Bounds for the Smoothed Number of Pareto optimal Solutions

LECTURE NOTES ON DVORETZKY S THEOREM

On the Generalization Ability of Online Strongly Convex Programming Algorithms

Tractability results for weighted Banach spaces of smooth functions

Robust Forward Algorithms via PAC-Bayes and Laplace Distributions. ω Q. Pr (y(ω x) < 0) = Pr A k

Separation of Variables

Analyzing Tensor Power Method Dynamics in Overcomplete Regime

Calculus of Variations

Convergence of Random Walks

On conditional moments of high-dimensional random vectors given lower-dimensional projections

Chaos, Solitons and Fractals Nonlinear Science, and Nonequilibrium and Complex Phenomena

PDE Notes, Lecture #11

Lecture 2 Lagrangian formulation of classical mechanics Mechanics

Acute sets in Euclidean spaces

A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks

A Course in Machine Learning

Adaptive Online Gradient Descent

Time-of-Arrival Estimation in Non-Line-Of-Sight Environments

u!i = a T u = 0. Then S satisfies

A Unified Theorem on SDP Rank Reduction

Agmon Kolmogorov Inequalities on l 2 (Z d )

Table of Common Derivatives By David Abraham

Database-friendly Random Projections

Self-normalized Martingale Tail Inequality

The derivative of a function f(x) is another function, defined in terms of a limiting expression: f(x + δx) f(x)

Some Examples. Uniform motion. Poisson processes on the real line

Lower bounds on Locality Sensitive Hashing

Multi-View Clustering via Canonical Correlation Analysis

APPROXIMATE SOLUTION FOR TRANSIENT HEAT TRANSFER IN STATIC TURBULENT HE II. B. Baudouy. CEA/Saclay, DSM/DAPNIA/STCM Gif-sur-Yvette Cedex, France

Topic 7: Convergence of Random Variables

1 dx. where is a large constant, i.e., 1, (7.6) and Px is of the order of unity. Indeed, if px is given by (7.5), the inequality (7.

FLUCTUATIONS IN THE NUMBER OF POINTS ON SMOOTH PLANE CURVES OVER FINITE FIELDS. 1. Introduction

Connections Between Duality in Control Theory and

Collaborative Ranking for Local Preferences Supplement

Balancing Expected and Worst-Case Utility in Contracting Models with Asymmetric Information and Pooling

Survey Sampling. 1 Design-based Inference. Kosuke Imai Department of Politics, Princeton University. February 19, 2013

Schrödinger s equation.

Monotonicity for excited random walk in high dimensions

Function Spaces. 1 Hilbert Spaces

Basic Thermoelasticity

All s Well That Ends Well: Supplementary Proofs

A new proof of the sharpness of the phase transition for Bernoulli percolation on Z d

Optimization of Geometries by Energy Minimization

Logarithmic spurious regressions

Robust Bounds for Classification via Selective Sampling

SINGULAR PERTURBATION AND STATIONARY SOLUTIONS OF PARABOLIC EQUATIONS IN GAUSS-SOBOLEV SPACES

Polynomial Inclusion Functions

arxiv: v2 [cs.ds] 11 May 2016

II. First variation of functionals

GLOBAL SOLUTIONS FOR 2D COUPLED BURGERS-COMPLEX-GINZBURG-LANDAU EQUATIONS

How to Minimize Maximum Regret in Repeated Decision-Making

Sublinear Optimization for Machine Learning

Introduction to Machine Learning

Lectures - Week 10 Introduction to Ordinary Differential Equations (ODES) First Order Linear ODEs

Calculus of Variations

IPA Derivatives for Make-to-Stock Production-Inventory Systems With Backorders Under the (R,r) Policy

2886 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 61, NO. 5, MAY 2015

Adaptive Gain-Scheduled H Control of Linear Parameter-Varying Systems with Time-Delayed Elements

A Sketch of Menshikov s Theorem

Math 342 Partial Differential Equations «Viktor Grigoryan

Expected Value of Partial Perfect Information

TRAJECTORY TRACKING FOR FULLY ACTUATED MECHANICAL SYSTEMS

Resistant Polynomials and Stronger Lower Bounds for Depth-Three Arithmetical Formulas

Lecture 10: October 30, 2017

Calculus and optimization

The chromatic number of graph powers

ORDINARY DIFFERENTIAL EQUATIONS AND SINGULAR INTEGRALS. Gianluca Crippa

arxiv: v4 [cs.ds] 7 Mar 2014

UC Berkeley Department of Electrical Engineering and Computer Science Department of Statistics

REAL ANALYSIS I HOMEWORK 5

Switching Time Optimization in Discretized Hybrid Dynamical Systems

Binary Discrimination Methods for High Dimensional Data with a. Geometric Representation

Concentration of Measure Inequalities for Compressive Toeplitz Matrices with Applications to Detection and System Identification

ON THE OPTIMALITY SYSTEM FOR A 1 D EULER FLOW PROBLEM

arxiv: v4 [math.pr] 27 Jul 2016

Exponential asymptotic property of a parallel repairable system with warm standby under common-cause failure

Monte Carlo Methods with Reduced Error

WUCHEN LI AND STANLEY OSHER

New bounds on Simonyi s conjecture

Characterizing Real-Valued Multivariate Complex Polynomials and Their Symmetric Tensor Representations

MARKO NEDELJKOV, DANIJELA RAJTER-ĆIRIĆ

Linear and quadratic approximation

Proof of SPNs as Mixture of Trees

Step 1. Analytic Properties of the Riemann zeta function [2 lectures]

Thermal conductivity of graded composites: Numerical simulations and an effective medium approximation

Minimax rates for memory-bounded sparse linear regression

The Subtree Size Profile of Plane-oriented Recursive Trees

Multi-View Clustering via Canonical Correlation Analysis

On the Surprising Behavior of Distance Metrics in High Dimensional Space

Hyperbolic Moment Equations Using Quadrature-Based Projection Methods

LOCAL WELL-POSEDNESS OF NONLINEAR DISPERSIVE EQUATIONS ON MODULATION SPACES

An extension of Alexandrov s theorem on second derivatives of convex functions

Generalized Tractability for Multivariate Problems

Perturbation Analysis and Optimization of Stochastic Flow Networks

Transcription:

Journal of Machine Learning Research 8 07) - Submitte /6; Publishe 5/7 An Optimal Algorithm for Banit an Zero-Orer Convex Optimization with wo-point Feeback Oha Shamir Department of Computer Science an Applie Mathematics Weizmann Institute of Science Rehovot 76000, Israel oha.shamir@weizmann.ac.il Eitor: Alexaner Rakhlin Abstract We consier the closely relate problems of banit convex optimization with two-point feeback, an zero-orer stochastic convex optimization with two function evaluations per roun. We provie a simple algorithm an analysis which is optimal for convex Lipschitz functions. his improves on Duchi et al. 05), which only provies an optimal result for smooth functions; Moreover, the algorithm an analysis are simpler, an reaily exten to non-eucliean problems. he algorithm is base on a small but surprisingly powerful moification of the graient estimator. Keywors: zero-orer optimization, banit optimization, stochastic optimization, graient estimator. Introuction We consier the problem of banit convex optimization with two-point feeback Agarwal et al. 00). his problem can be efine as a repeate game between a learner an an aversary as follows: At each roun t, the aversary picks a convex function f t on R, which is not reveale to the learner. he learner then chooses a point w t from some known an close convex set W R, an suffers a loss f t w t ). As feeback, the learner may choose two points w t, w t W an receive f t w t), f t w t ). he learner s goal is to minimize average regret, efine as f t w t ) min w W f t w). In this paper, we focus on obtaining bouns on the expecte average regret with respect to the learner s ranomness). A closely-relate an easier setting is zero-orer stochastic convex optimization. In this setting, our goal is to approximately solve F w) = min w W E ξ fw; ξ), given limite access to {f ; ξ t )} where ξ t are i.i.. instantiations. Specifically, we assume that each. his is slightly ifferent than the moel of Agarwal et al. 00), where the learner only chooses w t, w t an the loss is ftw t) + f tw t )). However, our results an analysis can be easily translate to their setting, an the moel we iscuss translates more irectly to the zero-orer stochastic optimization consiere later. c 07 Oha Shamir. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provie at http://jmlr.org/papers/v8/6-63.html.

Shamir f, ξ t ) is not irectly observe, but rather can be querie at two points. his moels situations where computing graients irectly is complicate or infeasible. It is well-known Cesa-Bianchi et al., 004) that given an algorithm with expecte average regret R in the banit optimization setting above, if we fee it with the functions f t w) = fw; ξ t ), then the average w = w t of the points generate satisfies the following boun on the expecte optimization error: EF w ) min w W F w) R. hus, an algorithm for banit optimization can be converte to an algorithm for zero-orer stochastic optimization with similar guarantees. he banit optimization setting with two-point feeback was propose an stuie in Agarwal et al. 00). Inepenently, Nesterov 0) an consiere two-point methos for stochastic optimization. Both papers are base on ranomize graient estimates which are then fe into stanar first-orer algorithms e.g. graient escent, or more generally mirror escent). However, the regret/error guarantees in both papers were suboptimal in terms of the epenence on the imension. Recently, Duchi et al. 05) consiere a similar approach for the stochastic optimization setting, attaining an optimal error guarantee when f ; ξ) is a smooth function ifferentiable an with Lipschitz-continuous graients). Relate results in the smooth case were also obtaine by Ghaimi an Lan 03). However, to tackle the general case, where f ; ξ) may be non-smooth, Duchi et al. 05) resorte to a non-trivial smoothing scheme an a significantly more involve analysis. he resulting bouns have aitional factors logarithmic in the imension) compare to the guarantees in the smooth case. Moreover, an analysis is only provie for Eucliean problems where the omain W an Lipschitz parameter of f t scale with the L norm). In this note, we present an analyze a simple algorithm with the following properties: For Eucliean problems, it is optimal up to constants for both smooth an nonsmooth functions. his closes the gap between the smooth an non-smooth Eucliean problems in this setting. he algorithm an analysis are reaily applicable to non-eucliean problems. We give an example for the -norm, with the resulting boun optimal up to logarithmic factors. he algorithm an analysis are simpler than those propose in Duchi et al. 05). hey apply equally to the banit an zero-orer optimization setting, an can be reaily extene using stanar techniques, e.g. improve bouns for strongly-convex functions; regret/error bouns holing with high-probability rather than just in expectation; an improve bouns if allowe k > observations per roun instea of just two Hazan et al., 007; Shalev-Shwartz, 007; Agarwal et al., 00). Like previous algorithms, our algorithm is base on a ranom graient estimator, which given a function f an point w, queries f at two ranom locations close to w, an computes a ranom vector whose expectation is a graient of a smoothe version of f. he papers Nesterov 0); Duchi et al. 05); Ghaimi an Lan 03) essentially use the estimator

Banit an Zero-Orer Convex Optimization with wo-point Feeback which queries at w an w + δu where u is a ranom unit vector an δ > 0 is a small parameter), an returns fw + δu) fw)) u. ) δ he intuition is reaily seen in the one-imensional = ) case, where the expectation of this expression equals fw + δ) fw δ)), ) δ which inee approximates the erivative of f assuming f is ifferentiable) at w, if δ is small enough. In contrast, our algorithm uses a slightly ifferent estimator also use in Agarwal et al., 00), which queries at w δu, w + δu, an returns fw + δu) fw δu)) u. 3) δ Again, the intuition is reaily seen in the case =, where the expectation of this expression also equals Eq. ). When δ is sufficiently small an f is ifferentiable at w, both estimators compute a goo approximation of the true graient fw). However, when f is not ifferentiable, the variance of the estimator in Eq. ) can be quaratic in the imension, as pointe out by Duchi et al. 05): For example, for fw) = w an w = 0, the secon moment equals E fδu) f0)) u δ = E u 4 =. Since the performance of the algorithm crucially epens on the secon moment of the graient estimate, this leas to a highly sub-optimal guarantee. In Duchi et al. 05), this was hanle by aing an aitional ranom perturbation an using a more involve analysis. Surprisingly, it turns out that the slightly ifferent estimator in Eq. 3) oes not suffer from this problem, an its secon moment is essentially linear in the imension. We note that in this work, we assume that u is a ranom unit vector, similar to previous works. However, our results can be reaily extene to other istributions, such as uniform in the Eucliean unit ball, or a Gaussian istribution.. Algorithm an Main Results We consier the algorithm escribe in Figure, which performs stanar mirror escent using a ranomize graient estimator g t of a smoothe) version of f t at point w t. Following Duchi et al. 05), we assume that one can inee query f t at any point w t + δ t u t as specifie in the algorithm. he analysis of the algorithm is presente in the following theorem:. his may require us to query at a istance δ t outsie W. If we must query within W, then a stanar technique see Agarwal et al., 00) is to simply run the algorithm on a slightly smaller set ɛ)w, where ɛ > 0 is sufficiently large so that w t + δ tu t must be in W. Since the formal guarantee in hm. hols for arbitrarily small δ t, an each f t is Lipschitz, we can generally take δ t an hence ɛ) sufficiently small so that the aitional regret/error incurre is arbitrarily small. 3

Shamir Algorithm wo-point Banit Convex Optimization Algorithm Input: Step size η, function r : W R, exploration parameters δ t > 0 Initialize θ = 0. for t =,..., o Preict w t = arg max w W θ t, w rw) Sample u t uniformly from the Eucliean unit sphere {w : w = } Query f t w t + δ t u t ) an f t w t δ t u t ) Set g t = δ t f t w t + δ t u t ) f t w t δ t u t )) u t Upate θ t+ = θ t η g t en for heorem Assume the following conitions hol:. r is -strongly convex with respect to a norm, an sup w W rw) R for some R <.. f t is convex an G -Lipschitz with respect to the -norm. 3. he ual norm of is such that 4 E ut u t 4 p for some p <. R If η = p G, an δ t chosen such that δ t p R, then the sequence w,..., w generate by the algorithm satisfies the following for any an w W: E f t w t ) f t w ) c p G R, where c is some numerical constant. We note that conition is stanar in the analysis of the mirror-escent metho see the specific corollaries below), whereas conitions an 3 are neee to ensure that the variance of our graient estimator is controlle. As mentione earlier, the boun on the average regret which appears in hm. immeiately implies a similar boun on the error in a stochastic optimization setting, for the average point w = w t. We note that the result is robust to the choice of η, an is the same up to constants as long as η = ΘR/p G ). Also, the constant c, while always strictly positive, shrinks as δ t 0 see the proof below for etails). As a first application of the theorem, let us consier the case where is the Eucliean norm. In this case, we can take rw) = w, an the algorithm reuces to a stanar variant of online graient escent, efine as θ t+ = θ t g t an w t = arg min w W w θ t. In this case, we get the following corollary: Corollary Suppose f t for all t is G -Lipschitz with respect to the Eucliean norm, an W {w : w R}. hen using = an rw) = w, it hols for some constant c an any w W that E f t w t ) f t w ) c G R, 4

Banit an Zero-Orer Convex Optimization with wo-point Feeback he proof is immeiately obtaine from hm., noting that p = in our case. his boun matches up to constants) the lower boun in Duchi et al. 05), hence closing the gap between upper an lower bouns in this setting. As a secon application, let us consier the case where is the -norm,, the omain W is the simplex in R, > although our result easily extens to any subset of the -norm unit ball), an we use a stanar entropic regularizer: Corollary 3 Suppose f t for all t is G -Lipschitz with respect to the L norm. hen using = an rw) = i= w i logw i ), it hols for some constant c an any w W that E f t w t ) f t w log ) ) c G. his boun matches this time up to a factor polylogarithmic in ) the lower boun in Duchi et al. 05) for this setting. Proof he function r is -strongly convex with respect to the -norm see for instance Shalev-Shwartz, 0, Example.5), an has value at most log) on the simplex. Also, if f t is G -Lipschitz with respect to the -norm, then it must be G -Lipschitz with respect to the Eucliean norm. Finally, to satisfy conition 3 in hm., we upper boun 4 E u t 4 using the following lemma, whose proof is given in the appenix: Lemma 4 If u is uniformly istribute on the unit sphere in R, >, then 4 E u 4 log) c where c is a positive numerical constant inepenent of. Plugging these observations into hm. leas to the esire result. Finally, we make two aitional remarks on possible extensions an improvements to hm.. Remark 5 Querying at k > points) If the algorithm is allowe to query f t at k >, then it can be moifie to attain an improve regret boun, by computing k/ inepenent estimates of g t at every roun using a freshly sample u t each time), an using their average. his leas to a new graient estimator g t k, which satisfies E g t k k E g t + E g t. Base on the proof of hm., it is easily verifie that this leas to an average expecte regret boun of cg ) R + p /k for some numerical constant c. Remark 6 Non-Eucliean Geometries) When consiering norms other than the Eucliean norm, it is tempting to conjecture that our algorithm an analysis can be improve, by sampling u t from a istribution aapte to the geometry of that norm not necessarily the Eucliean ball), an assuming f t is Lipschitz w.r.t. the ual norm. However, aapting the proof an in particular getting appropriate versions of Lemma 8 an Lemma 9) oes not appear straightforwar, an the potential performance improvement is currently unclear. 5

Shamir 3. Proof of heorem As iscusse in the introuction, the key to getting improve results compare to previous papers is the use of a slightly ifferent ranom graient estimator, which turns out to have significantly less variance. he formal proof relies on a few simple lemmas liste below. he key lemma is Lemma 0, which establishes the improve variance behavior. Lemma 7 For any w W, it hols that g t, w t w η R + η g t. his lemma is the canonical result on the convergence of online mirror escent, an the proof is stanar see e.g. Shalev-Shwartz, 0). Lemma 8 Define the function ˆf t w) = E ut f t w + δ t u t ), over W, where u t is a vector picke uniformly at ranom from the Eucliean unit sphere. hen the function is convex, Lipschitz with constant G, satisfies sup ˆf t w) f t w) δ t G, w W an is ifferentiable with the following graient: ˆf t w) = E ut δ t f t w + δ t u t )u t. Proof he fact that the function is convex an Lipschitz is immeiate from its efinition an the assumptions in the theorem. he inequality follows from u t being a unit vector an that f t is assume to be G -Lipschitz with respect to the -norm. he ifferentiability property follows from Lemma. in Flaxman et al. 005). Lemma 9 For any function g which is L-Lipschitz with respect to the -norm, it hols that if u is uniformly istribute on the Eucliean unit sphere, then E gu) Egu)) 4 c L. for some numerical constant c. Proof A stanar result on the concentration of Lipschitz functions on the Eucliean unit sphere implies that Pr gu) Egu) > t) exp c t /L ) 6

Banit an Zero-Orer Convex Optimization with wo-point Feeback for some numerical constant c > 0 see the proof of Proposition.0 an Corollary.6 in Leoux, 005). herefore, = E gu) Egu)) 4 = Pr t=0 Pr t=0 gu) Egu) > 4 ) t t ) gu) Egu)) 4 > t t exp t=0 c ) t L t = L4 c ), where in the last step we use the fact that x=0 exp x)x =. he expression above equals cl / for some numerical constant c. Lemma 0 It hols that E g t w t = ˆf t w t ) where ˆf t ) is as efine in Lemma 8), an E g t w t cp G for some numerical constant c. Proof For simplicity of notation, we rop the t subscript. Since u has a symmetric istribution aroun the origin, E g w = E u δ fw + δu) fw δu)) u = E u δ fw + δu)) u + E u δ fw δu) u) = E u δ fw + δu)) u + E u δ fw + δu)u) = E u δ fw + δu)u which equals ˆfw) by Lemma 8. As to the secon part of the lemma, we have the following, where α is an arbitrary parameter an where we use the elementary inequality a b) a + b ). E g w = E u fw + δu) fw δu)) u δ = 4δ E u u fw + δu) fw δu)) = 4δ E u u fw + δu) α) fw δu) α)) δ E u u fw + δu) α) + fw δu) α) ) = δ E u u fw + δu) α) + E u u fw δu) α) ). 7

Shamir Again using the symmetric istribution of u, this equals δ E u u fw + δu) α) + E u u fw + δu) α) ) = δ E u u fw + δu) α). Applying Cauchy-Schwartz an using the conition 4 E u u 4 p state in the theorem, we get the upper boun Eu δ u 4 E u fw + δu) α) 4 = p E δ u fw + δu) α) 4. In particular, taking α = E u fw + δu) an using Lemma 9 noting that fw + δu) is G δ-lipschitz w.r.t. u in terms of the -norm), this is at most p c G δ) δ = cp G as require. We are now reay to prove the theorem. aking expectations on both sies of the inequality in Lemma 7, we have E g t, w t w η R + η E g t = η R + η E E g t w t. 4) Using Lemma 0, the right han sie is at most η R + ηcp G he left han sie of Eq. 4), by Lemma 0 an convexity of ˆf t, equals E E g t w t, w t w = E ˆf t w t ), w t w E ˆft w t ) ˆf ) t w ). By Lemma 8, this is at least E f t w t ) f t w )) G Combining these inequalities an plugging back into Eq. 4), we get E f t w t ) f t w )) G δ t + η R + cp G η. Choosing η = R/p G ), an any δt p R /, we get E f t w t ) f t w )) c + 3)p G R. 8 δ t.

Banit an Zero-Orer Convex Optimization with wo-point Feeback Diviing both sies by, the result follows. Acknowlegments his research was supporte in part by an Israel Science Founation grant 45/3 an an FP7 Marie Curie CIG grant. We thank the anonymous reviewers for several helpful comments. Appenix A. Proof of Lemma 4 We note that the istribution of u 4 is ientical to that of n 4, where n N 0, I n 4 ) is a stanar Gaussian ranom vector. Moreover, by a stanar concentration boun on the norm of Gaussian ranom vectors e.g. Corollary.3 in Barvinok, 005, with ɛ = /): max { Pr n ), Pr n ) } exp ). 6 Finally, for any value of n, we always have n n, since the Eucliean norm is always larger than the infinity norm. Combining these observations, an using A for the inicator function of the event A, we have n E u 4 4 = E n 4 ) = Pr n E ) + Pr n > n 4 n 4 E n 4 n 4 exp ) + Pr n > 6 = exp exp ) + 6 6 n n > ) E ) E n 4 n > / n 4 / ) 4 n > ) + 4 E n 4. 5) hus, it remains to upper boun E n 4 where n is a stanar Gaussian ranom variable. Letting n = n,..., n ), an noting that n,..., n are inepenent an ientically 9

Shamir istribute stanar Gaussian ranom variables, we have for any scalar z that Pr n z) = n Pr n i z) = i= Pr n z)) = Pr n > z)) ) Pr n > z) = Prn > z) ) exp z /), where ) is Bernoulli s inequality, an ) is using a stanar tail boun for a Gaussian ranom variable. In particular, the above implies that Pr n > z) exp z /). herefore, for an arbitrary positive scalar r, E n 4 = z=0 r Pr n 4 > z ) z z + Pr n > 4 z ) z z=0 z=r ) z r + exp z z=r = r + 4 + ) r r) exp. In particular, plugging r = 4 log ) which is larger than, since we assume > ), we get 4 + log) + log )). Plugging this back into Eq. 5), we get that E u 4 exp ) + 6 + log) + log ) 6, ) which can be shown to be at most c log) for all >, where c < 50 is a numerical constant. In particular, this means that 4 E u 4 4 c log) as require. References A. Agarwal, O. Dekel, an L. Xiao. Optimal algorithms for online convex optimization with multi-point banit feeback. In Conference on Learning heory COL), 00. A. Barvinok. Measure concentration lecture notes. http://www.math.lsa.umich.eu/ ~barvinok/total70.pf, 005. N. Cesa-Bianchi, A. Conconi, an C. Gentile. On the generalization ability of on-line learning algorithms. Information heory, IEEE ransactions on, 509):050 057, 004. J. Duchi, M. Joran, M. Wainwright, an A. Wibisono. Optimal rates for zero-orer optimization: the power of two function evaluations. Information heory, IEEE ransactions on, 65):788 806, May 05. 0

Banit an Zero-Orer Convex Optimization with wo-point Feeback A. Flaxman, A. Kalai, an B. McMahan. Online convex optimization in the banit setting: graient escent without a graient. In ACM-SIAM Symposium on Discrete Algorithms SODA), 005. S. Ghaimi an G. Lan. Stochastic first- an zeroth-orer methos for nonconvex stochastic programming. SIAM Journal on Optimization, 34):34 368, 03. E. Hazan, A. Agarwal, an S. Kale. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69):69 9, 007. M. Leoux. he concentration of measure phenomenon, volume 89. American Mathematical Soc., 005. Y. Nesterov. Ranom graient-free minimization of convex functions. echnical Report 0/6, ECORE, 0. S. Shalev-Shwartz. Online learning: heory, algorithms, an applications. PhD thesis, he Hebrew University, 007. S. Shalev-Shwartz. Online learning an online convex optimization. Founations an rens in Machine Learning, 4), 0.