Algorithms and matching lower bounds for approximately-convex optimization

Size: px

Start display at page:

Download "Algorithms and matching lower bounds for approximately-convex optimization"

Diane Rogers
6 years ago
Views:

1 Algorithms an matching lower bouns for approximately-convex optimization Yuanzhi Li Department of Computer Science Princeton University Princeton, NJ, Anrej Risteski Department of Computer Science Princeton University Princeton, NJ, Abstract In recent years, a rapily increasing number of applications in practice requires optimizing non-convex objectives, like training neural networks, learning graphical moels, maximum likelihoo estimation. Though simple heuristics such as graient escent with very few moifications ten to work well, theoretical unerstaning is very weak. We consier possibly the most natural class of non-convex functions where one coul hope to obtain provable guarantees: functions that are approximately convex, i.e. functions f : R R for which there exists a convex function f such that for all x, f(x) f(x) for a fixe value. We then want to minimize f, i.e. output a point x such that f( x) min x f(x) +. It is quite natural to conjecture that for fixe, the problem gets harer for larger, however, the exact epenency of an is not known. In this paper, we significantly improve the known lower boun on as a function of an an algorithm matching this lower boun for a natural class of convex boies. More precisely, we ientify a function T : R + R + such that when = O(T ()), we can give an algorithm that outputs a point x such that f( x) min x f(x) + within time poly (, ). On the other han, when = Ω(T ()), we also prove an information theoretic lower boun that any algorithm that outputs such a x must use super polynomial number of evaluations of f. Introuction Optimization of convex functions over a convex omain is a well stuie problem in machine learning, where a variety of algorithms exist to solve the problem efficiently. However, in recent years, practitioners face ever more often non-convex objectives e.g. training neural networks, learning graphical moels, clustering ata, maximum likelihoo estimation etc. Albeit simple heuristics such as graient escent with few moifications usually work very well, theoretical unerstaning in these settings are still largely open. The most natural class of non-convex functions where one coul hope to obtain provable guarantees is functions that are approximately convex : functions f : R R for which there exists a convex function f such that for all x, f(x) f(x) for a fixe value. In this paper, we focus on zero orer optimization of f: an algorithm that outputs a point x such that f( x) min x f(x) +, where the algorithm in the course of its execution is allowe to pick points x R an query the value of f(x). 30th Conference on Neural Information Processing Systems (NIPS 06), Barcelona, Spain.

2 Trivially, one can solve the problem by constructing a -net an search through all the net points. However, such an algorithm requires Ω ( ) evaluations of f, which is highly inefficient in high imension. In this paper, we are intereste in efficient algorithms: algorithms that run in time poly ( ) ( ), (in particular, this implies the algorithm makes poly, evaluations of f). One extreme case of the problem is = 0, which is just stanar convex optimization, where algorithms exist to solve it in polynomial time for every > 0. However, even when is any quantity > 0, none of these algorithms exten without moification. (Inee, we are not imposing any structure on f f like stochasticity.) Of course, when = +, the problem inclues any non-convex optimization, where we cannot hope for an efficient solution for any finite. Therefore, the crucial quantity to stuy is the optimal traeoff of an : For which, the problem can be solve in polynomial time, an for which it can not. In this paper, we stuy the rate of as a function of : We ientify a function T : R + R + such that when = O(T ()), we can give an algorithm that outputs a point x such that f( x) min x f(x) + within time poly (, ) over a natural class of well-conitione convex boies. On the other han, when = Ω(T ()), we also prove an information theoretic lower boun that any algorithm outputs such x must use super polynomial number of evaluations of f. Our result can be summarize as the following two theorems: Theorem (Algorithmic upper boun, informal). There exists an algorithm A that for any function f over a well-conitione convex set in R of iameter which is close to an -Lipschitz convex function f, an ( ({ = O max, })) A fins a point x such that f( x) ( ) min x f(x) + within time poly, The notion of well-conitioning will formally be efine in section 3, but intuitively captures the notion that the convex boy curves in all irections to a goo extent. Theorem (Information theoretic lower boun, informal). For every algorithm A, every,, with ( { = Ω max, }) there exists a function f on a convex set in R of iameter, an f is close to an -Lipschitz convex function f, such that A can not fin a point x with f( x) ( ) min x f(x) + in poly, evaluations of f. Prior work To the best of our knowlege, there are three works on the problem of approximately convex optimization, which we summarize briefly below. On the algorithmic sie, the classical paper by [DKS4] consiere optimizing smooth convex functions over convex boies with smooth bounaries. More precisely, they assume a boun on both the graient an the Hessian of F. Furthermore, they assume that for every small ball centere at a point in the boy, a large proportion of the volume of the ball lies in the boy. Their algorithm is local search: they show that for a sufficiently small r, in a ball of raius r there is with high probability a point which has a smaller value than the current one, as long as the current value is sufficiently larger than the optimum. For constant-smooth functions only, their algorithm applies when = O( ). Also on the algorithmic sie, the work by [BLNR5] consiers -Lipschitz functions, but their algorithm only applies to the case where = O( ) (so not optimal unless = O( )). Their methos rely on sampling log-concave istribution via hit an run walks. The crucial iea is to show that for approximately convex functions, one nees to sample from approximately log-concave The Ω notation hies polylog(/) factors. The assumptions on the iameter of K an the Lipschitz conition are for convenience of stating the results. (See Section?? to exten to arbitrary iameter an Lipschitz constant)

3 istributions, which they show can be one by a form of rejection sampling together with classical methos for sampling log-concave istributions. Finally, [SV5] consier information theoretic lower bouns. They show that when = / / δ no algorithm can, in polynomial time, achieve achieve = δ, when optimizing a convex function over the hypercube. This translates to a super polynomial information theoretic lower boun when = Ω( ). They aitionally give lower bouns when the approximately convex function is multiplicatively, rather than aitively, close to a convex function. 3 We also note a relate problem is zero-orer optimization, where the goal is to minimize a function we only have value oracle access to. The algorithmic motivations here come from various applications where we only have black-box access to the function we are optimizing, an there is a classical line of work on characterizing the oracle complexity of convex optimization.[ny83, NS, DJWW5]. In all of these settings however, the oracles are either noiseless, or the noise is stochastic, usually because the target application is in banit optimization. [AD0, AFH +, Sha] 3 Overview of results Formally, we will consier the following scenario. Definition 3.. A function f : K R will be calle -approximately convex if there exists a -Lipschitz convex function f : K R, s.t. x K, f(x) f(x). For ease of exposition, we also assume that K has iameter 4. We consier the problem of optimizing f, more precisely, we are interesting in fining a point x K, such that f( x) min f(x) + x K We give the following results: Theorem 3. (Information theoretic lower boun). For very constant c, there exists a constant c such that for every algorithm A, every c, there exists a convex set K R with iameter, an -approximate convex function f : K R an [0, /64) 5 such that { max, } ( 3c log ) Such that A fails to output, with probability /, a point x K with f( x) min x K { f(x)} + in o(( )c ) time. In orer to state the upper bouns, we will nee the efinition of a well-conitione boy: Definition 3. (µ-well-conitione). A convex boy K is sai to be µ-well-conitione for µ, if there exists a function F : R R such that K = {x F (x) 0} an for every x K: F (x) F (x) µ. This notion of well-conitioning of a convex boy to the best of our knowlege has not been efine before, but it intuitively captures the notion that the convex boy shoul curve in all irections to a certain extent. In particular, the unit ball has µ =. Theorem 3. (Algorithmic upper boun). Let be a positive integer, δ > 0 be a positive real number,, be two positive real number such that { max µ, } 6348 Then there exists an algorithm A such that on given any -approximate convex function f over a µ-roune convex set K R of iameter, A returns a point x K with probability δ in time poly (,, log ) δ such that f( x) min f(x) + x K 3 Though these are not too ifficult to erive from the aitive ones, consiering the convex boy has iameter boune by. 4 Generalizing to arbitrary Lipschitz constants an iameters is iscusse in Section 6. 5 Since we normalize f to be -Lipschitz an K to have iameter, the problem is only interesting for 3

4 For the reaer wishing to igest a conition-free version of the above result, the following weaker result is also true (an much easier to prove): Theorem 3.3 (Algorithmic upper boun (conition-free)). Let be a positive integer, δ > 0 be a positive real number,, be two positive real number such that max {, } 6348 Then there exists an algorithm A such that on given any -approximate convex function f over a µ-roune convex set K R of iameter, A returns a point x K with probability δ in time poly (,, log ) δ such that f( x) min f(x) + x S(K, ) Where S(K, ) = {x K B (x) K} The result merely states that we can output a value that competes with points well-insie the convex boy aroun which a ball of raius of still lies insie the boy. The assumptions on the iameter of K an the Lipschitz conition are for convenience of stating the results. It s quite easy to exten both the lower an upper bouns to an arbitrary iameter an Lipschitz constant, as we iscuss in Section Proof techniques We briefly outline the proof techniques we use. We procee with the information theoretic lower boun first. The iea behin the proof is the following. We will construct a function G(x) an a family of convex functions {f w (x)} epening on a irection w S (S is the unit sphere in R ). On one han, the minimal value of G an f w are quite ifferent: min x G(x) 0, an min x f w (x). On the other han, the approximately convex function f w (x) for f w (x) we consier will be such that f w (x) = G(x) except in a very small cone aroun w. Picking w at ranom, no algorithm with small number of queries will, with high probability, every query a point in this cone. Therefore, the algorithm will procee as if the function is G(x) an fail to optimize f w. Proceeing to the algorithmic result, since [BLNR5] alreay shows the existence of an efficient algorithm when = O( ), we only nee to give an algorithm that solves the problem when = Ω( ) an = O( ) (i.e. when, are large). There are two main ieas for the algorithm. First, we show that the graient of a smoothe version of f w (in the spirit of [FKM05]) at any point x will be correlate with x x, where x = argmin x K fw (x). The above strategy will however require averaging the value of f w along a ball of raius, which in many cases will not be containe in K (especially when is large). Therefore, we come up with a way to exten f w outsie of K in a manner that maintains the correlation with x x. 4 Information-theoretic lower boun In this section, we present the proof of Theorem 3.. The iea is to construct a function G(x), a family of convex functions {f w (x)} epening on a irection w S, such that min x G(x) 0, min x f w (x), an an approximately convex f w (x) for f w (x) such that f w (x) = G(x) except in a very small critical region epening on w. Picking w at ranom, we want to argue that the algorithm will with high probability not query the critical region. The convex boy K use in the lower boun will be arguably the simplest convex boy imaginable: the unit ball B (0). We might hope to prove a lower boun for even a linear function f w for a start, similarly as in [SV5]. A reasonable caniate construction is the following: we set f w (x) = w, x for some ranom chosen unit vector w an efine f(x) = 0 when x, w log x an f(x) = f w (x) otherwise. 6 6 For the proof sketch only, to maintain ease of reaing all of the inequalities we state will be only correct up to constants. In the actual proofs we will be completely formal. 4

5 Observe, this translates to = log. It s a stanar concentration of measure fact that for most of the points x in the unit ball, x, w log x. This implies that any algorithm that makes a polynomial number of queries to f will with high probability see 0 in all of the queries, but clearly min f(x) =. However, this iea fails to generalize to optimal range as = is tight for linear, even smooth functions. 7 In orer to obtain the optimal boun, we nee to moify the construction to a non-linear, non-smooth function. We will, in a certain sense, hie a ranom linear function insie a non-linear function. For a ranom unit vector w, we consier two regions insie the unit ball: a core C = B r (0) for r = max{, }, an a critical angle A = {x x, w log x }. The convex function f will look like x +α for some α > 0 outsie C A an w, x for x C A. We construct f as f = f when f(x) is sufficiently large (e.g. f(x) > ) an otherwise. Clearly, such f obtain its minimal at point w, with f(w) =. However, since f = x +α outsie C or A, the algorithm nees either query A or query C A c to etect w. The former happens with exponentially log small probability in high imensions, an for any x C A c, f(x) = w, x log x r max{, } log, which implies that f(x) =. Therefore, the algorithm will fail with high probability. Now, we move on to the etaile of the constructions. We will consier K = B (0): the ball of raius in R centere at The family {f w (x)} Before elving into the construction we nee the following efinition: Definition 4. (Lower Convex Envelope (LCE)). Given a set S R, a function F : S R, efine the lower convex envelope F LCE = LCE(F ) as a function F LCE : R R such that for every x R, F LCE (x) = max{ x y, F (y) + F (y)} y S Proposition 4.. LCE(F ) is convex. Proof. LCE(F) is the pointwise maximum of linear functions, so the claim follows. Remark : The LCE of a function F is a function efine over the entire R, while the input function F is only efine in a set S (not necessarily convex set). When the input function F is convex, LCE(F ) can be consiere as an extension of F to the entire R. To efine the family f w (x), we will nee four parameters: a power factor α > 0, a shrinking factor β, an a raius factor γ > 0, an a vector w R such that w =, which we specify in a short bit. Construction 4.. Given w, α, β, γ, efine the core C = B γ (0), the critical angle A = {x x, w β x } an let H = K C A. Let h : H R be efine as h(x) = x +α an efine l w (x) = 8 x, w. Finally let f w : K R as } f w (x) = max { hlce (x), l w (x) Where h LCE = LCE( h) as in Definition 4.. We then construct the har function f w as the following: Construction 4.. Consier the function f w : K R: { fw (x) if x K ( C A ) ; f w (x) = max{f w (x), } otherwise. 7 This follows from the results in [DKS4] 8 We pick B (0) instea of the unit ball in orer to ensure the iameter is. 5

6 Consier the following settings of the parameters β, γ, α (epening on the magnitue of ): Case, c log (log ) : β =, γ = 0c(log ).5, α = log(/γ). Case, : β = c log /, γ = 0c (log /) 3/, α = log(/γ). Case 3, 64 (log ) : β = c log, γ =, α =. Then, the we formalize the proof intuition from the previous section with the following claims. Following the the proof outline, we first show the minimum of f w is small, in particular we will show f w (w). Lemma 4.. f w (w) = Finally, we show that f w is inee a -approximately convex, by showing x K, f w f w an f w is -Lipschitz an convex. Proposition 4.. fw is a -approximately convex. Next, we construct G(x), which oes not epen on w, we want to show that for an algorithm with small number of queries of f w, it can not istinguish f w from this function. Construction 4.3. Let G : K R be efine as: { { max +α G(x) = 4 x α 4 γ, } if x K C ; otherwise. The following is true: x +α Lemma 4.. G(x) 0 an {x K G(x) f w (x)} A We show how Theorem 3. is implie given these statements: Proof of Theorem 3.. With everything prior to this set up, the final claim is somewhat stanar. We want to show that no algorithm can, with probability, output a point x, s.t. fw (x) min x fw (x) +. Since we know that f w (x) agrees with G(x) everywhere except in K A, an G(x) satisfies min x G(x) min x fw (x) +, we only nee to show that with high probability, any polynomial time algorithm will not query any point in K A. Consier a (potentially) ranomize algorithm A, making ranom choices R, R,..., R m. Conitione on a particular choice of ranomness r, r,..., r m, for a ranom choice of w, each r i lies in A with probability at most exp( c log(/)), by a stanar Gaussian tail boun. Union bouning, since m = o(( )c ) for an algorithm that runs in time o(( )c ), the probability that at least of the queries of A lies in A is at most. But the claim is true for any choice r, r,..., r m of the ranomness, by averaging, the claim hols for r, r,..., r m being sample accoring to the ranomness of the algorithm. The proofs of all of the lemmas above have been ommite ue to space constraints, an are inclue in the appenix in full. 5 Algorithmic upper boun As mentione before, the algorithm in [BLNR5] covers the case when = O( ), so we only nee to give an algorithm when = Ω( ) an = O( ). Our approach will not be making use of simulate annealing, but a more robust version of graient escent. The intuition comes from [FKM05] who use estimates of the graient of a convex function erive from Stokes formula: [ ] E w S f(x + rw)w = f(x)x r 6 B

7 where w S enotes w being a uniform sample from the sphere S. Our observation is the graient estimation is robust to noise if we instea use f in the left han sie. Crucially, robust is not in the sense that it approximates the graient of f, but it preserves the crucial property of the graient of f we nee: f(x), x x f(x) f(x ). In wors, this means if we move x at irection f(x) for a small step, then x will be closer to x, an we will show the property is preserve by f when. Inee, we have that: [ ] E w S r f(x + rw)w, x x [ ] E w S r f(x + rw)w, x x r E w S [ w, x x ] The usual [FKM05] calculation [ shows that ] E w S ) r f(x + rw)w, x x = Ω (f(x) f(x ) r) an r E w U(S ) [ w, x x ] is boune by O( r ), since E w U(S ) [ w, x x ] = O( ). Therefore, we want f(x) f(x ) r whenever f(x) f(x ). Choosing the optimal parameter leas to r = 4 an. r This intuitive calculation basically proves the simple upper boun guarantee (Theorem 3.3). On the other han, the argument requires sampling from a ball of raius Ω() aroun point x. This is problematic when > : many convex boies (e.g. the simplex, L ball after rescaling to iameter one) will not contain a ball of raius even. The iea is then to make the sampling possible by extening f outsie of K. Namely, we efine a new function g : R R such that (Π K (x) is the projection of x to K) g(x) = f(π K (x)) + (x, K) g(x) will not be in general convex, but we instea irectly boun E w [ r g(x + rw)w], x x for x K an show that it behaves like f(x), x x f(x) f(x ). Algorithm Noisy Convex Optimization : Input: A convex set K R with iam(k) = an 0 K. A -approximate convex function f : Define: g : R R as: g(x) = f(π K (x)) + (x, K) where Π K is the projection to K an (x, K) is the Eucliean istance from x to K. 3: Initial: x = 0, r = 8µ, η = , T = : for t =,,..., T o 5: Let v t = f(x t ). 6: Estimate up to accuracy in l norm (by uniformly ranomly sample w): [ ] g t = E w S r g(x t + rw)w where w S means w is uniform sample from the unit sphere. 7: Upate x t+ = Π K (x t ηg t ) 8: en for 9: Output min t [T ] {v t } The rest of this section will be eicate to showing the following main lemma for Algorithm. Lemma 5. (Main, algorithm). Suppose < 6348 x K such that f(x ) < f(x t ), then g t, x x t 64 7, we have: For every t [T ], if there exists

8 Assuming this Lemma, we can prove Theorem 3.. Proof of Theorem 3.. We first focus on the number of iterations: For every t, suppose f(x ) < f(x t ), then we have: (since g t /r 56 ) x x t+ x (x t ηg t ) = x x t η x x t, g t + η g t x x t η η 64 x x t = x x t Since originally x x, the algorithm ens in poly(, ) iterations. Now we consier the sample complexity. Since we know that r g(x t + rw)w 64 By stanar concentration boun we know that we nee poly(, ) samples to estimate the expectation up to error per iteration Due to space constraints, we forwar the proof of Lemma 5. to the appenix. 6 Discussion an open problems 6. Arbitrary Lipschitz constants an iameter We assume throughout the paper that the convex function f is -Lipschitz an the convex set K has iameter. Our results can be easily extene to arbitrary functions an convex sets through a simple linear transformation. For f with Lipschitz constant f Lip an K with iameter D, an the corresponing approximately convex f, efine g : K D R as g(x) = D f Lip f(rx). (Where K D is the rescaling of K by a factor of D.) This translates to g(x) g(x) R f Lip. But g(x) = f(rx) R f Lip is -Lipschitz over a set K R of iameter. Therefore, for general functions over a general convex sets, our result trivially implies the rate for being { able to optimize approximately-convex functions is ( ) } = max, R f Lip R f Lip R f Lip { } which simplifies to = max. 6. Boy specific bouns R f Lip, Our algorithmic result matches the lower boun on well-conitione boies. The natural open problem is to resolve the problem for arbitrary boies. 9 Also note the lower boun can not hol for any convex boy K in R : for example, if K is just a one imensional line in R, then the threshol shoul not epen on at all. But even when the inherent imension of K is, the result is still boy specific: one can show that for f over the simplex in R, when, it is possible to optimize f in polynomial time even when is as large as. 0 Finally, while our algorithm mae use of the well-conitioning what is the correct property/parameter of the convex boy that governs the rate of T () is a tantalizing question to explore in future work. 9 We o not show it here, but one can prove the upp/lower boun still hols over the hypercube an when one can fin a ball of raius that has most of the mass in the convex boy K. 0 Again, we o not show that here, but essentially one can search through the + lines from the center to the + corners. 8

9 References [AD0] Alekh Agarwal an Ofer Dekel. Optimal algorithms for online convex optimization with multi-point banit feeback. In COLT, pages Citeseer, 00. [AFH + ] Alekh Agarwal, Dean P Foster, Daniel J Hsu, Sham M Kakae, an Alexaner Rakhlin. Stochastic convex optimization with banit feeback. In Avances in Neural Information Processing Systems, pages , 0. [BLNR5] Alexanre Belloni, Tengyuan Liang, Hariharan Narayanan, an Alexaner Rakhlin. Escaping the local minima via simulate annealing: Optimization of approximately convex functions. In Proceeings of The 8th Conference on Learning Theory, pages 40 65, 05. [DJWW5] John C Duchi, Michael I Joran, Martin J Wainwright, an Anre Wibisono. Optimal rates for zero-orer convex optimization: The power of two function evaluations. Information Theory, IEEE Transactions on, 6(5): , 05. [DKS4] Martin Dyer, Ravi Kannan, an Leen Stougie. A simple ranomise algorithm for convex optimisation. Mathematical Programming, 47(-):07 9, 04. [FKM05] Abraham D Flaxman, Aam Tauman Kalai, an H Brenan McMahan. Online convex optimization in the banit setting: graient escent without a graient. In Proceeings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms, pages Society for Inustrial an Applie Mathematics, 005. [NS] Yurii Nesterov an Vlaimir Spokoiny. Ranom graient-free minimization of convex functions. Founations of Computational Mathematics, pages 40. [NY83] Arkaii Nemirovskii an Davi Borisovich Yuin. Problem complexity an metho efficiency in optimization. Wiley-Interscience series in iscrete mathematics. Wiley, Chichester, New York, 983. A Wiley-Interscience publication. [Sha] Oha Shamir. On the complexity of banit an erivative-free stochastic convex optimization. arxiv preprint arxiv:09.388, 0. [SV5] Yaron Singer an Jan Vonrák. Information-theoretic lower bouns for convex optimization with erroneous oracles. In Avances in Neural Information Processing Systems, pages , 05. 9

An Optimal Algorithm for Bandit and Zero-Order Convex Optimization with Two-Point Feedback

An Optimal Algorithm for Bandit and Zero-Order Convex Optimization with Two-Point Feedback Journal of Machine Learning Research 8 07) - Submitte /6; Publishe 5/7 An Optimal Algorithm for Banit an Zero-Orer Convex Optimization with wo-point Feeback Oha Shamir Department of Computer Science an