arxiv: v2 [cs.ds] 11 May 2016

Size: px

Start display at page:

Download "arxiv: v2 [cs.ds] 11 May 2016"

Joanna Anthony
6 years ago
Views:

1 Optimizing Star-Convex Functions Jasper C.H. Lee Paul Valiant arxiv: v2 [cs.ds] May 206 Department of Computer Science Brown University May 3, 206 Abstract We introuce a polynomial time algorithm for optimizing the class of star-convex functions, uner no Lipschitz or other smoothness assumptions whatsoever, an no restrictions except exponential bouneness on a region about the origin, an Lebesgue measurability. The algorithm s performance is polynomial in the requeste number of igits of accuracy an the imension of the search omain. This contrasts with the previous best known algorithm of Nesterov an Polyak which has exponential epenence on the number of igits of accuracy, but only n ω epenence on the imension n (where ω is the matrix multiplication exponent), an which further requires Lipschitz secon ifferentiability of the function [3]. Star-convex functions constitute a rich class of functions generalizing convex functions, incluing, for example: for any convex (or star-convex) ( ) functions f, g, with global minima f(0) = g(0) = 0, their power mean h(x) = f(x) p +g(x) p /p 2 is star-convex, for any real p, efining powers via limits as appropriate. Star-convex functions arise as loss functions in non-convex machine learning contexts, incluing, for ata points X i, parameter ( ) /p, vector θ, an any real exponent p, the loss function h θ,x (ˆθ) = i (ˆθ θ) X i p significantly generalizing the well-stuie convex case where p. Further, for any function g > 0 on the surface of the unit sphere (incluing iscontinuous, or pathological functions that ( have) ifferent x behavior at rational vs. irrational angles), the star-convex function h(x) = x 2 g x 2 extens the arbitrary behavior of g to the whole space. Despite a long history of successful graient-base optimization algorithms, star-convex optimization is a uniquely challenging regime because ) graients an/or subgraients often o not exist; an 2) even in cases when graients exist, there are star-convex functions for which graients provably provie no information about the location of the global optimum. We thus bypass the usual approach of relying on graient oracles an introuce a new ranomize cutting plane algorithm that relies only on function evaluations. Our algorithm essentially looks for structure at all scales, since, unlike with convex functions, star-convex functions o not necessarily isplay simpler behavior on smaller length scales. Thus, while our cutting plane algorithm refines a feasible region of exponentially ecreasing volume by iteratively removing cuts, unlike for the stanar convex case, the structure to efficiently iscover such cuts may not be foun within the feasible region: our novel star-convex cutting plane approach iscovers cuts by sampling the function exponentially far outsie the feasible region. We emphasize that the class of star-convex functions we consier is as unrestricte as possible: the class of Lebesgue measurable star-convex functions has theoretical appeal, introucing to the omain of polynomial-time algorithms a huge class with many interesting pathologies. We view our results as a step forwar in unerstaning the scope of optimization techniques beyon the garen of convex optimization an local graient-base methos.

2 Introuction Optimization is one of the most influential ieas in computer science, central to many rapily eveloping areas within computer science, an also one of the primary exports to other fiels, incluing operations research, economics an finance, bioinformatics, an many esign problems in engineering. Convex optimization, in particular, has prouce many general an robust algorithmic frameworks that have each become funamental tools in many ifferent areas: linear programming has become a general moeling tool, its simple structure powering many algorithms an reuctions; semiefinite programming is an area whose scope is rapily expaning, yieling many of the best known approximation algorithms for optimizing constraint satisfaction problems [2]; convex optimization generalizes both of these an has introuce powerful optimization techniques incluing interior point an cutting plane methos. Our eveloping unerstaning of optimization has also le to new algorithmic esign principles, which in turn leas to new insights into optimization. Recent progress in algorithmic graph theory has benefite enormously from the optimization perspective, as many recent results on max flow/min cut [0, 6, 8, 3, 5], bipartite matching [0], an Laplacian solvers [20, 3, 7] have mae breakthroughs that stem from eveloping a eeper unerstaning of the characteristics of convex optimization techniques in the context of graph theory. These successes motivate the quest for a eeper an broaer unerstaning of optimization techniques: ) to what egree can convex optimization techniques be extene to non-convex functions; 2) can we evelop general new tools for tackling non-convex optimization problems; an 3) can new techniques from non-convex optimization yiel new insights into convex optimization? As a partial answer to the first question, graient escent perhaps the most natural optimization approach has ha enormous success recently in a variety of practically-motivate non-convex settings, sometimes with provable guarantees. The metho an its variants are the e facto stanar for training (eep) neural networks, a hot topic in high imensional non-convex optimization, with many recent practical results (e.g. [7, 6]). Graient escent can be thought of as a greey algorithm, which repeately chooses the most attractive irection from the local lanscape. The efficacy of graient escent algorithms relies on local assumptions about the function: for example, if the first erivative is Lipschitz (slowly varying), then one can take large ownhill steps in the graient irection without worrying that the function will change to going uphill along this irection. Thus the convergence of graient escent algorithms typically epens on a Lipschitz parameter (or other smoothness measure), an conveniently oes not epen on the imension of the search space. Many of these algorithms converge to within ɛ of a local optimum in time poly(/ɛ), with aitional polynomial epenence on the Lipschitz constant or other appropriate smoothness guarantee [2, 5]. In cases where all local optima are global optima, then one has global convergence guarantees. While one intuitively expects graient escent algorithms to always converge to a local minimum, in this paper we stuy the optimization of a natural generalization of convex functions where, espite the global optimum being the only stationary point/local minimum for every function in this class, provably no variant of graient escent converges in polynomial time. The class of star-convex functions, which we efine below, inclues many functions of both practical an theoretical interest, both generalizing common families of convex functions to wier parameter regimes, an introucing new pathologies not foun in the convex case. We show, essentially, how to make a graient-base cutting plane algorithm robust to many new pathologies, incluing lack of graients or subgraients, long narrow riges an rapi oscillation in irections transverse to the global minimum, an, in fact, almost arbitrary iscontinuities. This challenging setting gives new insights into what funamentally enables cutting plane algorithms to work, which we view as progress towars answering questions 2 an 3 above.

(a) g efine on the unit circle (b) Linear extension of g to f Figure : An example star-convex function f efine by linearly extrapolating an arbitrary positive function g efine on the unit circle (in

3 (a) g efine on the unit circle (b) Linear extension of g to f Figure : An example star-convex function f efine by linearly extrapolating an arbitrary positive function g efine on the unit circle (in white).. Star-convex functions This paper focuses on the optimization of star-convex functions, a particular class of (typically) nonconvex functions that inclues convex functions as a special case. We efine these functions as follows, base on the efinition in Nesterov an Polyak [3]. Definition. (Star-convex functions). A function f : R n R is star-convex if there is a global minimum x R n such that for all α [0, ] an x R n, f(αx + ( α)x) αf(x ) + ( α)f(x) We call x the star center of f. For convenience, in the rest of the paper we shall refer to the minimum function value as f. See Appenix A for a variety of basic constructions an examples of star-convex functions. Intuitively, if we visualize the objective function as a lanscape, star-convexity means that the global optimum is visible from every point there are no riges on the way to the global optimum, but there coul be many riges in transverse irections. Since the global optimum is always visible in a ownhill irection from every point, graient escent methos woul seem to be effective. Counterintuitively, these methos all fail in general. In Section.3 we emonstrate that stanar graient escent an cutting plane methos fail in the absence of Lipschitz guarantees. One broa class of star-convex functions that gives a goo sense of the scope of this efinition is constructe by the following process: ) pick an arbitrary positive function g(θ) on the unit circle (see Figure a) which may be iscontinuous, rapily oscillating, or otherwise baly behave; an 2) linearly exten this function to a function f on the entire plane, about the value f(0) = 0 (see Figure b). The name star-convex comes from the fact that each sublevel set (the set of x for which f(x) < c, for some c) is star-shape. At any point at which a graient or subgraient exists, the star center (global optimum) lies in the halfspace opposite the (sub)graient. However, even for the simple example of Figure, graients o not exist for angles θ at which g is iscontinuous, an subgraients o not exist at most angles, incluing those angles where g is a local maxima with respect to θ. Further, even for ifferentiable star-convex functions, graients can be misleaing, since a rapily oscillating g implies that graients typically point in the transverse irection, nearly orthogonal to the irection of the star center. While it is often stanar to esign algorithms assuming one has access to an oracle that returns both function values an graients, we instea only assume access to the function value: even in the case when graients exist, it is unclear whether they are algorithmically helpful in our setting. (See Section.3 for etails.) Inee, we pose this as an open problem: uner what assumptions (short of 2

4 the Lipschitz guarantees on the secon erivative of Nesterov an Polyak [3]), can one meaningfully use a graient oracle to optimize star-convex functions? In this paper, we show that, assuming only Lebesgue measurability an an exponential boun on the function value within a large ball, our algorithm optimizes a star-convex function in polylog(/ɛ) time where ɛ is the esire accuracy in function value. Theorem.2. (Informal) Given evaluation oracle access to a Lebesgue measurable star-convex function f : R n R an an error parameter ɛ, with the guarantee that x is within raius R of the origin, Algorithm returns an estimate f 0 of the minimum function value such that f 0 f ɛ with high probability in time poly(n, log ɛ, log R). In previous work, Nesterov an Polyak [3] introuce a new aaptation of Newton s metho to fin local minima in functions that are twice-ifferentiable, with Lipschitz continuity of their secon erivatives. For the class of Lipschitz-twice-ifferentiable star-convex functions, they show that their algorithm converges to within ɛ of the global optimum in time O( ɛ ), using star convexity to lower boun the amount of progress in each of their optimization steps. Our results are stronger in two significant senses: ) as oppose to assuming Lipschitz twiceifferentiability, we make no continuity assumptions whatsoever, optimizing over an essentially arbitrary measurable function within the class of star-convex functions; 2) while the algorithm of [3] requires exponential time to estimate the optimum to k igits of accuracy, our algorithm is polynomial in the requeste number of igits of accuracy. The reaer may note that the complexity of our algorithm oes epen polynomially on the number of imensions n of the search space, whereas the Nesterov-Polyak algorithm uses a number of calls to a Hessian oracle that is inepenent of the imension n. However, our epenency on the imension n is necessary in the absence of Lipschitz guarantees, even for convex optimization []. We further point out an important istinction between star-convex optimization an star-shape optimization. A star-shape function is efine similarly to a star-convex function, except the star center is allowe to be locate away from the global minimum. Certain NP-har optimization problems, incluing max-clique can be rephrase in terms of star-shape optimization [], an the problem is in general impossible without continuity guarantees, for the global minimum may be hien along a single ray from the star center that is iscontinuous from the rest of the function. In this sense, our star-convex optimization algorithm narrowly avois solving an NP-har optimization problem..2 Applications of Star-Convex Functions We begin with the following concrete example: the function ( x p + y p ) /p is convex for p, yet is star-convex for all real p, positive or negative, taking limits as appropriate for p = 0. We generalize this example significantly below. Empirical risk minimization is a central technique in machine learning [9], where, given training ata x i with labels y i, we seek a hypothesis h so as to minimize m m L(h(x i ), y i ), i= where L is a loss function that escribes the penalty for mispreiction, on input our preiction h(x i ) an the true answer y i. We take the hypothesis h to be a linear function (this setting inclues kernel methos that preprocess x i first an then apply a linear function). If the loss function L is convex, then fining the optimal hypothesis h is a convex optimization problem, which is wiely use in practice: for exponent p, we let L(ŷ i, y i ) = ŷ i y i p an thus aim to optimize the convex function arg min h m h x i y i p. () i= 3

5 In this paper we raw attention to the interesting regime of Equation where p <. This regime is not accessible with stanar techniques. It is straightforwar to verify, however, that the (/p)th power of the objective function of Equation is star-convex, an thus the main result of our paper yiels an efficient optimization algorithm. Slightly more generally: Corollary.3. For a loss function L(ŷ i, y i ) such that there exists a real number p for which L(, ) p is a star-convex function of its first argument, the following empirical risk minimization problem can be optimize to accuracy ɛ in time poly(n, log /ɛ) by inputting its (/p)th power to the main algorithm of this paper: arg min h m m L(h(x i ), y i ), i= (We note that negative p is a somewhat peculiar regime in the above corollary, where running a minimization algorithm on the (/p)th power of the loss woul actually yiel a global maximum instea of minimum; we inclue this case here for completeness sake.) The regime where p (0, ) has the structure that, when one is far from the right answer, small changes to the hypothesis o not significantly affect performance, but the closer one gets to the true parameters, the richer the lanscape becomes. One of many interesting settings with this behavior is biological evolution, where fit creatures have a rich lanscape to traverse, while rastically unfit creatures fail to survive. A paper by one of the authors propose star-convex optimization as a regime where evolutionary algorithms might be unexpectely successful. This current work fins an affirmative answer to an open question raise there by showing the first polynomial time algorithm for optimizing this general class of functions. (The previous results by Nesterov an Polyak [3] fail to apply to this setting because the objective function has regions that behave for some parameter regimes like x, which is not Lipschitz.) Corollary.4 (Extening Theorem 4.5 of [8]). There is a single mutation algorithm uner which for any p > 0, efining the loss function L(ŷ, y) = ŷ y p, the class of constant-egree polynomials from R n R m with boune coefficients is evolvable with respect to all istributions over the raius r ball..3 Nee for Novel Algorithmic Techniques Before we outline our approach to optimizing star-convex functions, we emonstrate the nee for novel algorithmic approaches by explaining how a wie variety of stanar techniques fail on this challenging class of functions. We start by explaining how simple star-convex functions such as ( x + y ) 2 confoun graient escent, cutting plane an relate approaches, an en with a pathological example with information-theoretic guarantees about its harness to optimize, even given arbitrarily accurate first-orer (graient) oracle access. Consier the star-convex function ( x + y ) 2, which can be thought of essentially as x + y. This function has unique global minimum (an star center) at the origin. Graients of this function go to infinity as either x or y goes to 0, thus the function essentially has eep canyons along both the x an y axes. Intuitively, this function is challenging to optimize because, espite the origin lying in a ownhill irection from every point, graient escent will typically converge to one of the axes first, at which point the graient has near-infinite magnitue in the irection of the nearest axis, an this irection is thus near-orthogonal to the irection of the origin. In short, many variants of graient escent will have the following behavior: rapily jump towars one of the two axes, an then fail to make significant further progress as the search point oscillates aroun that axis. This is in part a reflection of the fact that the graient of ( x + y ) 2 is not Lipschitz, an in fact varies arbitrarily rapily as (x, y) converges towars either axis; since these axes are canyons in the search 4

6 space, typical algorithms will spen a isproportionate amount of their time stumbling aroun this baly-behave region. There are many variants of graient escent constructe to tackle optimization settings that are challenging for vanilla graient escent. One recent promising line of work concerns normalize graient escent [5], which is esigne to work well espite regions where the magnitue of the graient is unusually small or large. However, the analysis techniques rely on the strict local quasi-convexity property, which says that there is a global constant ɛ such that the ownhill halfspace inuce by any graient contains a ball of raius ɛ about the global optimum; this property oes not apply even to the simple function ( x + y ) 2, where halfspaces cut arbitrarily close to the origin, the closer one gets to either the x or y axis. As an alternative to graient escent methos, one might consier a graient-base cutting plane algorithm. Cutting plane methos work by iteratively cutting a large search space into ever-smaller regions until they converge to a tiny region guarantee to contain the global optimum. Cutting plane algorithms, even for convex optimization, are known to typically lea to ill-conitione behavior, where one imension of the search region becomes much smaller than the overall iameter of the region, leaing to ba numerical properties, ifficulty fining an manipulating graients, etc. These problems become even worse in our star-convex setting, even for the simple function ( x + y ) 2 : as the search space converges aroun one of the axes, the graients become even more strongly oriente towars that axis, yieling cuts that make the alreay thin region even thinner, instea of cutting in a more prouctive transverse irection. Inee, up to numerical precision, the compute graients may be foun to point exactly towars the nearest axis, making cuts in the transverse irection impossible, an thus halting the overall progress of the algorithm inefinitely. Having seen simple examples showing how stanar techniques cannot optimize star-convex functions, we now present a more sophisticate an pathological example which proves that, even with access to a first-orer oracle an the assumptions of infinite ifferentiability an bouneness on a region aroun the origin, no eterministic algorithm can efficiently optimize star-convex functions without Lipschitz guarantees, motivating the somewhat unusual ranomize flavor of the algorithm of this paper. In particular, we show that, for any eterministic polynomial time algorithm, there is an infinitely ifferentiable star-convex function on R 2 with a unique global minimum near the origin such that for all querie points ) the returne function value is always an 2) the returne graient epens only on the y-coorinate. The existence of such a function implies the inability of the algorithm to fin the x-coorinate of the star center. Example.5. Given a eterministic polynomial time optimization algorithm A, we construct a starconvex function f A that algorithm A fails to optimize. We assume that A will only query the function insie some isk of raius R about the origin, for some (possibly exponentially large) R. Now, we first choose a y-coorinate y 0 (that A cannot specify exactly, for example an irrational number) for the star center; we shall choose x 0 later. Then, we simulate running the algorithm A, assuming that in response to any query (x, y) that A makes to the function/graient oracle, the return value is f A (x, y) = an the graient f A (x, y) is 0 in the x irection an /(y y 0 ) in the y irection. Since A is a polynomial time algorithm, there must only be polynomially many queries before A terminates. Therefore, we may choose x 0 for the star center such that ) (x 0, y 0 ) is not exponentially close to collinear with any pair of query points, an 2) the x coorinate of the star center, x 0, is chosen to be significantly far from the x coorinates of all query points. We thus construct a star-convex function f A subject to the constraints that ) it passes through all the points simulate above, with corresponing graients, 2) it has star center at (x 0, y 0 ), an 3) it is infinitely ifferentiable, with each erivative being at most exponentially large in the egree of the erivative an the running time of A. (Since no pair of query points are close to collinear with the 5

7 star center, stanar interpolation lets us construct f A by first efining a smooth function g A on the unit circle an linearly extening g A to the entire plane, except for an exponentially small region about (x 0, y 0 ) that we make smooth instea of conically shape.) Since the function an graient oracles return zero information about the x-coorinate of the star center, algorithm A clearly has a hopeless task optimizing this function. In summary, graient information is very har to use effectively in star-convex optimization, even for smooth functions. Our algorithm, outline below, instea only queries the function value not because we object to graients, but rather because they o not seem to help..4 Our Approach Our overall approach is via the ellipsoi metho, which repeately refines an ellipsoial region containing the star center (global optimum) by iteratively computing cuts that contain the star center, while significantly reucing the overall volume of the ellipsoi. As mentione in Section.3, even for smooth functions with access to a graient oracle, the cutting planes inuce by the graients may yiel no significant progress in some irections. Fining useful cutting planes in the star-convex setting requires three novelties see the introuction to Section 4 for complete etails. First, the star-convex function may be iscontinuous, without even subgraients efine at most points, so we instea rely on a sampling process involving the blurre logarithm of the objective function. The blurre logarithm mitigates both the (potentially) exponential range of the function, an the (potentially) arbitrary iscontinuities in the omain. The blurre logarithm, furthermore, is ifferentiable, an sampling results let us estimate an use its erivatives in our algorithm. This technique is similar to that of ranomize smoothing [4]. Secon, the negative graient of the blurre logarithm might point away from the star center (global optimum) espite all graients of the unblurre function (when they exist) pointing towars the star center because of the way blurring interacts with sharp wege-shape valleys. Aressing this requires an averaging technique that repeately samples Gaussians-in-Gaussians, until it etects sufficient conitions to conclue that the graient may be use as a cut irection. Thir an finally, the usual cutting plane criterion that the volume of the feasible region ecreases exponentially is no longer sufficient in the star-convex setting, as an ellipsoi in two coorinates (x, y) might get repeately cut in the y irection without any progress restricting the range of x. This is provably not a concern in convex optimization. Our algorithm tackles this issue by locking axes of the ellipsoi smaller than some threshol τ, an seeking cuts orthogonal to these axes; this orthogonal signal may be hien by much larger erivatives in other irections, an hence requires new techniques to expose. The counterintuitive approach is that, in orer to expose structure within an exponentially thin ellipsoi imension we must search exponentially far outsie the ellipsoi in this imension..5 Stochastic Optimization Our approach extens easily to the stochastic optimization setting. Here, the objective function is the expectation over a istribution of star-convex functions which share the same star center an optimum value. For example, in the context of empirical risk minimization with a star-convex loss function L where we assume there is a true hypothesis h such that each labele example is correctly preicte by h, then for each example i, the loss L(ĥ(x i), y i ) consiere as a function of ĥ is star-convex, with global optimum 0 at ĥ = h. An the overall objective function is the expectation of this quantity, over ranom examples: E i [L(ĥ(x i), y i )], which is thus also star-convex with global optimum 0 at ĥ = h. The optimization algorithm for this new stochastic setting is actually ientical to the general starconvex optimization algorithm. Since star-convex functions can take arbitrarily unrelate values on nearby rays from the star center, our algorithm is alreay fully equippe to eal with the situation, an 6

8 aing explicitly unrelate values to the moel by stochastically sampling from a family of functions L i makes the problem no harer. The algorithm, analysis, an convergence are unchange..6 Outline In Section 2 we introuce the oracle moel by which we moel access to star-convex functions. Because we aress such a large class of potentially baly-behave functions (incluing, for example, starconvex functions which, in polar coorinates, behave very ifferently epening on whether the angle is rational or irrational), we take great care to specify the input an output requirements of our optimization algorithm appropriately, in the spirit of Lovász [9]. In Section 3 we present the overall star-convex optimization algorithm, a cutting-plane metho that at each stage tracks an ellipsoi that contains the star center (global optimum). The heart of the algorithm consists of fining appropriate cutting planes at each step, which we present in Section 4. 2 Problem Statement an Overview Our aim here is to iscuss algorithms for optimizing star-convex functions in the most general setting possible, an thus we must be careful about how the functions are specifie to the algorithms. In particular, we o not assume continuity, so functions with one behavior on the rational points an a separate behavior on irrational points are possible. Therefore, it is essential that our algorithms have access to the function values of f at inputs beyon the usual rational points expressible via stanar computer number representations. As motivation, see Example 2. which escribes a star-convex function that has value on essentially all rational points an algorithm is likely to query, espite having a rather ifferent lanscape on the irrational points, leaing to a global minimum value of 0. Example 2.. We present the following example of a class of star-convex functions f in 2-imensions parameterize by integers (i, j) such that f i,j has global optimum at (/ 2 + i, / 3 + j). The class has the property that, if i an j are chosen ranomly from an exponential range, then, except with exponentially small probability, any (probabilistic) polynomial time algorithm that accesses f only at rational points will learn nothing about the location of the global optimum. It emonstrates the nee for access beyon the rationals. We efine a star-convex function f i,j parameterize by integers (i, j) that has a unique global optimum at (/ 2 + i, / 3 + j). We evaluate f(x, y) via three cases:. If the ray from (/ 2+i, / 3+j) to (x, y) passes through no rational points, then let f(x, y) = (x, y) (/ 2 + i, / 3 + j) Otherwise, if the ray from (/ 2 + i, / 3 + j) to (x, y) passes through a rational point (p, q), then: (a) If p = i an q = j then let f(x, y) = (x, y) (/ 2 + i, / 3 + j) 2 as above. (b) Else, let f(x, y) = (x, y) (/ 2 + i, / 3 + j) 2 (p, q) (/ 2 + i, / 3 + j) 2 We note that since / 2 an / 3 are linearly inepenent over the rationals, no ray from (/ 2 + i, / 3 + j) passes through more than one rational point. Hence the above three cases give a complete efinition of f i,j. The function f is star-convex since each of the three cases above apply on the entirety of any ray from the global optimum, an these cases each efine a linear function on the ray. The erivative of 7

9 f along any of these rays is at most /( / 2), since ( / 2) is the closest that a point outsie the integer square at (i, j) can get to the global optimum (/ 2 + i, / 3 + j); thus the function is linearly boune in any finite region. Further, f evaluate at any rational point (p, q) will fall into Case 2, an thus, unless ( p, q ) = (i, j), the function f will return. Thus, no algorithm that queries f only at rational points can efficiently optimize this class of functions, because unless the algorithm exactly guesses the pair (i, j) which was rawn from an exponentially large range no information is gaine (the value of f will always be otherwise). Since irectly querying function values of general star-convex functions at rational points is so limiting, we instea introuce the notion of a weak sampling evaluation oracle for a star-convex function, aapting the efinition of a weak evaluation oracle by Lovász [9]. Definition 2.2. A weak sampling evaluation oracle for a function f : R n R takes as inputs a point x Q n, a positive efinite covariance matrix Σ, an an error parameter ɛ. The oracle first chooses a ranom point y N (x, Σ), an returns a value r such that f(y) r < ɛ. The Gaussian sampling in Definition 2.2 can be equivalently change to choosing a ranom point in a ball of esire (small) raius, since any Gaussian istribution can be approximate to arbitrary precision (in the total variation sense) as the convolution of itself with a small enough ball, an vice versa. Because inputs an outputs to the oracle must be expressible in a polynomial number of igits, we consier well-guarantee star-convex functions (in analogy with Lovász [9]), where the raius boun R an function boun B below shoul be interprete as huge numbers (with polynomial numbers of igits), such as 0 00 an respectively. Numerical accuracy issues in our analysis are analogous to those for the stanar ellipsoi algorithm an we o not iscuss them further. Definition 2.3. A weak sampling evaluation oracle for a function f is calle well-guarantee if the oracle comes with two bouns, R an B, such that ) the global minimum of f is within istance R of the origin, an 2) within istance 0nR of the origin, f(x) B. Such sampling gets aroun the obstacles of Examples.5 an 2. because even if the evaluation oracle is querie at a preictable rational point, the value it returns will represent the function evaluate at an unpreictable, (typically) irrational point nearby. At this point, our notion of oracle access may seem unnatural, in part because it allows access to the function at irrational points that are not computationally expressible. However, there are two natural an wiely use justifications for this approach, of ifferent flavors. First, the oracle represents a computational moel of a mathematical abstraction (the unerlying star-convex function), where even for mathematically pathological functions, we can actually in many cases implement simple coe to emulate oracle access in the manner escribe above. For example, it is in fact easy to implement a weak sampling evaluation oracle for the function of Example 2., where rays from the star center through rational points behave ifferently from the other rays. Since the rational points have Lebesgue measure 0, the complexities introuce by case 2 of Example 2. happen with probability 0, an thus we can write coe to simulate a weak sampling evaluation oracle by only implementing the simpler case, evaluating f(x, y) = (x, y) (/ 2 + i, / 3 + j) 2 to the requeste accuracy ±ɛ. Secon, setting oracle implementation issues asie, the moel cleanly separates accessing a potentially pathological function, from computing properties of it, in this case, optimizing it. The more pathological a function is, the more surprising it is that any efficient automate technique can extract structure from it. Thus, in some sense, the unrealistic regime of pathological functions, for which we o not even know how to write coe emulating oracle access, yiels the most surprising regime for the success of the algorithmic results of this paper. This regime is the most insightful from a theory perspective almost exactly because it is the most unnatural an counterintuitive from a practical perspective. 8

10 Having establishe weak sampling evaluation oracles as input to our algorithms, we must now take the same care to efine the form of their outputs. The output of a traitional (convex) optimization problem is an x value such that f(x) is close to the global minimum. For star-convex functions, accesse via a weak sampling evaluation oracle, one can instea ask for a small spherical region in which f(x) remains close to its global optimum given an input point x within istance τ of the star 4B center, star convexity an bouneness of f imply that f(x) must be within value τ 0nR R τ of the global optimum (see the proof of Lemma 3.2). However, such an output requirement is too much to ask of a star-convex optimization algorithm: see Example B. for an example of a star-convex function for which there is a large region in which the istributions of f(x) in all small (Gaussian) balls are inistinguishable from each other except with negligible probability, say That is, no algorithm can hope to istinguish a small ball aroun the global optimum from a small ball anywhere else in this region, except by using close to 0 00 function evaluations. Instea, since on this entire region, the function is close to the global optimum except with 0 00 probability, the most natural option is for an optimization algorithm to return any portion of this region, namely a region where the function value is within ɛ of optimal except with probability This is how we efine the output of a star-convex optimization problem: in analogy to the input convention, we ask optimization algorithms to return a Gaussian region, specifie by a mean an covariance, on which the function is near-optimal on all but a δ fraction of its points. (See Example B. for etails.) Definition 2.4. The weak star-convex optimization problem for a Lebesgue measurable star-convex function f, parameterize by δ, ɛ, F, is as follows. Given a well-guarantee weak sampling evaluation oracle for f, return with probability at least F a Gaussian G such that P r[f(x) f + ɛ : x G] δ. We now formally state our main result. Theorem 2.5. Algorithm optimizes (in the sense of Definition 2.4) any Lebesgue measurable starconvex function f in time poly(n, /δ, log ɛ, log F, log R, log B). Observe that we require only Lebesgue measurability of our objective function, an we make no further continuity or ifferentiability assumptions. The measurability is necessary to ensure that probabilities an expectations regaring the function are well-efine, an is essentially the weakest assumption possible for any probabilistic algorithm. The minimality in our assumptions contrasts that of the work by Nesterov an Polyak, which assumes Lipschitz continuity in the secon erivative [3]. It is mathematically interesting that the carinality of the set of star-convex functions which we optimize, even in the 2-imensional case, equals the huge quantity 2 R, while the carinality of the entire set of continuous functions which is NP-har to optimize, an strictly contains most stanar optimization settings is only R. 3 Our Optimization Approach In this section, we escribe our general strategy of optimizing star-convex functions by an aaptation of the ellipsoi algorithm. Our algorithm, like the stanar ellipsoi algorithm, looks for the global optimum insie an ellipsoi whose volume ecreases by a fixe ratio for each in a series of iterations, via algorithmically iscovere cuts that remove portions of the ellipsoi that are iscovere to lie in the uphill irection from the center of the ellipsoi. In Section 4 we will explore the properties of star-convex functions that will enable us to prouce such cuts. However, it is crucial that, unlike for stanar convex optimization, such cuts are not enough. Consier, for example, the possibility that for a star-convex function f(x, y), a cutting plane oracle only ever returns cuts in the x-irection, an 9

11 never reuces the size of the ellipsoi in the y irection, a situation which provably cannot occur in the stanar convex setting. Thus, when constructing a cutting plane, our algorithm efines an exponentially small threshol τ (efine in Definition 4.4), such that whenever a semi-principal axis of the ellipsoi is smaller than τ, we guarantee a cut orthogonal to this axis. In this section, we make use of the following characterization of our cutting plane algorithm, Algorithm 2, introuce an analyze in Section 4: Proposition (See Proposition 4.4). With negligible probability of failure, given an ellipsoi containing the star center, centere at the origin with all semi-principal axes longer than τ scale to, Algorithm 2 either a) returns a Gaussian region G such that P[f(x) f + ɛ : x G] δ, or b) returns a irection, restricte to the subspace spanne by those ellipsoi semi-principal axes longer than τ, such that when normalize to a unit vector ˆ, the cut {x : x ˆ 3n } contains the global minimum. Note that the above preconition on the ellipsoi input is satisfie by applying the appropriate affine transformation to the current ellipsoi before supplying it to Algorithm 2. Also note that in case a) above, the returne Gaussian region is a solution to the overall optimization problem. Given these guarantees about the behavior of the cutting plane algorithm, Algorithm 2, we now introuce our overall optimization algorithm, base on the ellipsoi metho. For the following algorithm, we efine the number of iterations m = 6(n+) [n log R + τ (n ) log 2 accoring to a stanar analysis of the multiplicative volume ecrease of the ellipsoi metho state in Lemma C.2 in Appenix C. Algorithm (Ellipsoi metho) Input: A raius R ball centere at the origin which is guarantee to contain the global minimum. Output: Either a) A Gaussian G such that P[f(x) f + ɛ : x G] δ or b) An ellipsoi E such that all function values in E are at most f + ɛ.. Let ellipsoi E be the input raius R ball. 2. For i [, m + ] 2a. If all the axes of E i are shorter than τ, then Return E i an Halt. 2b. Otherwise, execute Algorithm 2 with ellipsoi E i. 2c. If it returns a Gaussian, then Return this Gaussian an Halt. 2. Otherwise, it returns a cut irection. Apply an affine transformation such that E i becomes the unit ball centere at the origin an points in the negative x irection. Use the construction in Lemma C. (see appenix) to construct a new ellipsoi E i+ that inclues the intersection of ellipsoi E i with the cut. 2e. If any semi-principal axis of the ellipsoi E i+ is larger than 3nR then apply Lemma C.3, an if the center of the ellipsoi has istance > R from the origin then apply Lemma C.4, to yiel an ellipsoi of smaller volume, containing the entire intersection of E i+ with the ball of raius R, that now has all semi-principal axes smaller than 3nR, an has center in the ball of raius R. 3n ] The analysis of Algorithm is a straightforwar aaptation of stanar techniques for analyzing the ellipsoi algorithm, bouning the ecrease in volume of the feasible ellipsoi at each step, until either the algorithm returns an explicit Gaussian as the solution (in Step 2c), or terminates because the ellipsoi is containe in an exponentially small ball (in Step 2a). See Appenix C for full etails. 0

12 Lemma 3.. Algorithm halts within m + iterations either through Step 2a or Step 2c. The following lemma shows that, if the star center (global optimum) is foun to lie within a ball of exponentially small raius τ, then this entire ball has function value sufficiently close to the optimum that any point within the ball can be returne. Lemma 3.2. Given an ellipsoi E that a) has its center within R of the origin, b) is containe within a ball of raius τ, an c) contains the star center x, then f(x) [f, f + ɛ] for all x E. Proof. For any x E, let y be the intersection of the ray x x with the sphere of raius 0nR about the origin. The bouneness of f implies that f(y) B, an that f(x ) B. Thus star-convexity yiels that f(x) f + 2B x x 2 x y 2 f 2τ + 2B 0nR R τ < f + ɛ since τ ɛ. Since the ellipsoi returne in Step 2a will always satisfy the preconitions of Lemma 3.2, the entire ellipsoi has function value within ɛ accuracy of the global minimum. In practice, we can just return this region. However, if we o wish to conform to our problem statement (Definition 2.4), we may simply return a Gaussian ball of sufficiently small raius an centere at the ellipsoi center, such that there is only negligible probability of sampling a point more than τ from its center. In summary, Lemmas 3. an 3.2 imply that Algorithm optimizes a star-convex function in time poly(n, /δ, log ɛ, log F, log R, log B), proving Theorem Computing Cuts for Star-Convex Functions The general goal of this section is to explain how to compute a single cut, in the context of our aapte ellipsoi algorithm explaine in Section 3. Namely, given (weak sampling, in the sense of Definition 2.2) access to a star-convex function f, an a bouning ellipsoi in which we know the global optimum lies, we present an algorithm (Algorithm 2) that will either a) return a cut passing close to the center of the ellipsoi an containing the global optimum, or b) irectly return the answer to the overall optimization problem. There are several obstacles to this kin of algorithm that we must overcome. Star-convex functions may be very iscontinuous, without any graients or even subgraients efine at most points. Furthermore, arbitrarily small changes in the input value x may prouce exponentially large changes in the output of f, for any x that is not alreay exponentially close to the global optimum. This means that stanar techniques to even approximate the function s shape o not remotely suffice in the context of star-convex functions. Finally, consiering again the example of linearly extening an arbitrary function of the unit sphere surface (see Figure or item 3 in Appenix A), unlike regular convex functions, star-convex functions o not become locally flat along thin imensions of increasingly thin ellipsois. Thus, the ellipsoi algorithm in general might repeately cut certain imensions while neglecting others. The algorithm in this section (Algorithm 2) employs three key strategies that we intuitively explain now.. We o not work irectly with the star-convex function f, but instea work with a blurre version of its logarithm (efine in Definition 4.), which has a) well-efine erivatives, that b) we can estimate efficiently. Given a baly-behave (measurable) function f, an the pf of a Gaussian, g, blurring f by the Gaussian prouces the convolution f g, which has well-behave erivatives since erivatives commute with convolution: namely, letting D stan for a erivative or secon erivative, we have D(f g) = f (Dg). Since the erivatives of a Gaussian pf g are each boune, we can estimate erivatives of blurre versions of iscontinuous functions by sampling. Sampling bouns imply that if we have a ranom variable boune to a range of size r, then

13 we can estimate its mean to within accuracy ±ɛ by taking the average of O(( r ɛ )2 ) samples. The logarithm function (after appropriate translation so that its inputs are positive), maps a starconvex function f to a much smaller range, which enables accurate sampling (in polynomial time). Therefore, we can efficiently estimate erivatives of the blurre logarithm of f. 2. While intuitively, the negative graient of (the blurre logarithm of) f points towars the global minimum, this signal might be overpowere by a confouning term in a ifferent irection (see Lemma 4.5), making the negative graient point away from the global optimum in some cases. To combat this, our algorithm repeately estimates the graient at points sample from a istribution aroun the ellipsoi center, an for each graient, estimates the confouning terms, returning the corresponing graient only once the confouning terms are foun to be small. Lemma 4.8 shows that the confouning terms are small in expectation, so Markov s inequality shows that our strategy will rapily succee. 3. The ellipsoi algorithm in general might repeately cut certain imensions while neglecting others, an our algorithm must actively combat this. If some axes of the ellipsoi ever become exponentially small, then we lock them, meaning that we eman a cut orthogonal to those imensions, thus maintaining a global lower boun on the length of any axis of the ellipsoi. Combine with the stanar ellipsoi algorithm guarantee that the volume of the ellipsoi ecreases exponentially, this implies that the iameter of the ellipsoi can be mae exponentially small in polynomial time, letting us conclue our optimization. Fining cuts uner this new locking guarantee, however, requires a new algorithmic technique. We must take avantage of the fact that the ellipsoi is exponentially small along a certain irection b to somehow gain the new ability to prouce a cut orthogonal to b. Convex functions become increasingly well-behave on increasingly narrow regions, however, star-convex functions crucially o not (see item 3 in Section A). Thus, as the thickness of the ellipsoi in irection b gets smaller, the structural properties of the function insie our ellipsoi o not give us any new powers. Strangely, we can take avantage of one new parameter regime that at first appears useless: relative to the thinness of the ellipsoi, we have exponentially much space in irection b outsie of the ellipsoi. To implement the intuition of Step above, we efine L z (x), a truncate an translate logarithm of the star-convex function f, which maps the potentially exponentially large range of f to the (polynomial-size) range [log ɛ, log 2B], where ɛ is efine below (Definition 4.4), an is slightly smaller than our function accuracy boun ɛ. In the below efinition, z intuitively represents our estimate of the global optimum function value, an will recor, essentially, the smallest function evaluation seen so far (see Algorithm 2). Definition 4.. Given an objective function f with boun f(x) B when x R an an offset value z f, we efine the truncate logarithmic version of f to be log ɛ f(x) z ɛ L z (x) = log 2B f(x) z 2B log(f(x) z) otherwise While mapping to a small range, L z (x) nonetheless gives us a precise view of small changes in the function as we converge to the optimum. The next result shows that, if we blur L z (x) by rawing x from a Gaussian istribution, then not only can we efficiently estimate the expecte value of the blurre logarithm of f, we can also estimate the erivatives of this expectation with respect to changing either the mean or the variance of the Gaussian. 2

14 For an arbitrary boune (measurable) function h, the erivative of its expecte value over a Gaussian of with σ with respect to either a) moving the center of the Gaussian or b) changing its with σ, is boune by O( σ ). Thus we normalize the estimates below in terms of the prouct of the Gaussian with an the erivative, instea of estimating the erivative alone. Proposition 4.2. Let N (µ, Σ) be a Gaussian with iagonal covariance matrix Σ consisting of elements σ 2, σ2 2,..., σ2 n. For an error boun κ > 0 an a probability of error δ > 0, we can estimate each of the following functions to within error κ with probability at least δ using poly(n, κ, log δ, log 2B/ɛ ) samples: ) the expectation E[L z (x) : x N (µ, Σ)]; 2) the (scale) erivative σ µ E[L z (x) : x N (µ, Σ)]; an 3) the erivative with respect to scaling σ σ E[L z (x) : x N (µ, Σ)]. A fortiori, these erivatives exist. Proof. Chernoff bouns yiel the first claim, since L z [log ɛ log 2B/ɛ, log 2B], so O(( κ ) 2 log δ ) samples suffice. For the secon claim, we note that the expectation can be rewritten as the integral over R n of L z times the probability ensity function of the normal istribution: E[L z (x) : x N (µ, Σ)] = (2π) n/2 Σ /2 R L n z (x) e (x µ)t Σ (x µ)/2 x. Thus the erivative of this expression with respect to the first coorinate µ can be expresse by taking the erivative of the probability ensity function insie the integral, which ens up scaling it by the vector (x µ )/σ 2. Thus σ µ E[L z (x) : x N (µ, Σ)] = E[ x µ σ L z (x) : x N (µ, Σ)]. The multiplier x µ σ is effectively boune, because for real numbers c, the probability P[ > c : x N (µ, Σ)] vanishes faster than exp( c n ). x µ σ Thus we can pick c = poly(n, log κ, log 2B/ɛ ) such that replacing the expression in the expectation, x µ σ L z (x), by this expression clampe between ±c will change the expectation by less than κ 2. We can thus estimate the expectation of this clampe quantity to within error κ 2 with probability at least δ via some sample of size poly(n, κ, log δ, log 2B/ɛ ), as esire. The analysis for the thir claim (the erivative with respect to σ ) is analogous: the erivative of the Gaussian probability ensity function with respect to σ (as in the case of a univariate Gaussian of variance σ 2) scales the probability ensity function by x µ 2 /σ 3, an after scaling by σ, this expression measures the square of the number of stanar eviations from the mean, an as above, the prouct x µ 2 σ 2 the expectation by more than κ 2 L z (x) can be clampe to some polynomially-boune interval [ c, c] without changing. The conclusion is analogous to the previous case. As mentione in the previous section, our guiing aim for the ellipsoi metho is to prevent any axis of the ellipsoi from getting too small. Therefore, in orer to treat axes ifferently epening on their length, we shall ientify our basis as the unit vectors along the axes of the current ellipsoi an istinguish between axes that are smaller than τ versus at least τ, where τ is an exponentially small threshol for thinness, efine below in Definition 4.4. Definition 4.3. Given an ellipsoi, consier an orthonormal basis parallel to its axes. Each semiprincipal axis of the ellipsoi whose length is less than τ, we call a thin imension, an the rest are non-thin imensions. Given a vector µ, we ecompose it into µ = µ + µ where µ is non-zero only in the non-thin imensions, an µ is non-zero only in the thin imensions. Similarly, given the ientity matrix I, we ecompose it into I = I + I. We apply a scaling to the non-thin imensions so as to scale the non-thin semi-principal axes of the ellipsoi to unit vectors (making the ellipsoi a unit ball in the non-thin imensions). We keep the thin imensions as they are. We present again Proposition 4.4, escribing the guarantees on our cutting plane algorithm require by Algorithm, an then state the cutting plane algorithm, Algorithm 2. Below, we make use of 3

Lecture Introduction. 2 Examples of Measure Concentration. 3 The Johnson-Lindenstrauss Lemma. CS-621 Theory Gems November 28, 2012

Lecture Introduction. 2 Examples of Measure Concentration. 3 The Johnson-Lindenstrauss Lemma. CS-621 Theory Gems November 28, 2012 CS-6 Theory Gems November 8, 0 Lecture Lecturer: Alesaner Mąry Scribes: Alhussein Fawzi, Dorina Thanou Introuction Toay, we will briefly iscuss an important technique in probability theory measure concentration