arxiv: v2 [cs.lg] 19 Oct 2018

Size: px

Start display at page:

Download "arxiv: v2 [cs.lg] 19 Oct 2018"

Marcus Lucas
5 years ago
Views:

1 Learning in Non-convex Games with an Optimization Oracle arxiv: v2 [cs.lg] 19 Oct 2018 Alon Gonen Elad Hazan Abstract We consider adversarial online learning in a non-convex setting under the assumption that the learner has an access to an offline optimization oracle. In the most general unstructured setting of prediction with expert advice, [5] established an exponential gap demonstrating that online learning can be significantly harder. Interestingly, this gap is eliminated once we assume a convex structure. A natural question which arises is whether the convexity assumption can be dropped. In this work we answer this question in the affirmative. Namely, we show that online learning is computationally equivalent to statistical learning in the Lipschitz-bounded setting. Notably, most deep neural networks satisfy these assumptions. We prove this result by adapting the ubiquitous Follow-he-Perturbed- Leader paradigm [7]. As an application we demonstrate how the offline oracle enables efficient computation of an equilibrium in non-convex games, that include GAN (generative adversarial networks) as a special case. 1 Introduction Online learning is a fundamental and expressive model which allows formulation of important and challenging learning tasks such as spam detection, online routing, etc ([2, 4, 10]). A key feature of this model is the ability of the environments to evolve over time, possibly in an adversarial manner. Consequently, this framework can be used to produce much more robust and powerful learners relative to the classic statistical and stationary learning framework. One would expect that such nice properties should come with a price. While it is well-known that any efficient online learner can be transformed into an efficient batch learner [1], it is important to understand to what extent is the online model harder. In this work we focus on the computational perspective while adopting the setting suggested in [5]. hat is, we assume that the learner has an access to the following two oracles: 1. Value oracle: the learner submits a pair of Department of Computer Science, Princeton University Google Brain and Princeton University 1

2 the form (hypothesis, loss function) and the oracle outputs the loss incurred by the hypothesis. 2. Optimization oracle: the learner submits a sequence of loss functions and receives a hypothesis that (approximately) minimize the cumulative loss w.r.t. this sequence. he second oracle might seem very powerful. However, as we know, many (offline) problems that are known to be hard in the worst case admit efficient solvers in practice. Moreover, it enables a systematic comparison between the online and the statistical models. Equipped with these oracles, we would like to compare between the computational complexities of online and statistical learning. he computational complexity is defined as the number of operations required to ensure expected excess risk of at most ɛ in the statistical model and expected average regret of at most ɛ in the online model. he required number of samples (or rounds in the online model) are implicit in the above definitions of the computational complexities. 1.1 Background [5] study our question in the general context of prediction with expert advice. In this setting, on each round t, the learner chooses (possibly in a random fashion) which expert to follow from a list of N candidates. hereafter, the environment assigns losses l t,j, j P rns to the experts. We measure the success of the learner using the notion of regret, which is the difference between the cumulative loss of the learner and that of the best single expert. In the statistical setting, we usually call experts hypotheses. A well-known fact, that follows from Hoeffding s inequality (and the union bound), is that statistical and the computational complexity of statistical learning is polynomial in log N and ɛ 1. In contrast, [5] showed a lower bound on the computational complexity of online learning (even for achieving a constant accuracy) of polypn q. his implies an exponential gap between the online and the statistical complexities. Interestingly, the above gap is completely eliminated if the problem is convex. o illustrate this idea, consider the online shortest path problem. Let G pv, Eq be a directed graph and fix some vertices s, t P V. On each round t, the learner chooses a path from s to t. hereafter, the adversary assigns nonnegative weights the edges. he loss associated with any path is the sum of edges along the path. his problem can be expressed in the framework of learning with expert advice, where each expert corresponds to a path between s and t. However, the number of such paths is typically exponential in E, rendering this reduction inefficient. In their seminal paper, [7] suggested the following algorithm: 1. At each time t and for each edge e in the graph, draw a random noise ν t res according to a suitable distribution with mean 0 and standard deviation of order?. Add this quantity to the cumulative weight of edge e during rounds 1,..., t Let w t be shortest path solution corresponding to the perturbed cumulative weights. It was shown in [7] that the regret of this algorithm is polynomial in E and sublinear in, so the overall complexity is 2

3 polynomial in E and ɛ 1, as in the batch setting. More generally, this approach yields a polypd, ɛ 1 q online algorithm for any family of linear loss functions defined over compact decision set W Ď R d. Even more generally, this approach can be extended to any family of convex loss functions by using a standard linearization technique (e.g., see Section 2.4 in [10]). 1.2 Main Result Naturally, the next obvious question is whether the convexity assumption can be dropped. Our paper answers this question in the affirmative. Before stating our result we specify the modest geomtric assumptions we make throughout the paper. Assumption 1. We assume that the following quantities are polynomial in the dimension d: a) he l 2 -Lipschitz parameter of any loss function. b) he magnitude of any loss function. c) he diameter of the domain W. heorem 1. (Main heorem) Let L be a class of loss functions satisfying Assumption 1. hen Algorithm 1 transforms any polypd, ɛ 1 q batch learner into a polypd, ɛ 1 q online learner. We deduce the following game theoretic result: Corollary 1. (informal) Convergence to equilibrium in two player zero-sum non-convex games is as hard as the corresponding offline best-response optimization problem. We elaborate on this implication and specify it to GANs in Section Overview and echniques Our result is proved by extending the Follow-he-Perturbed-Leader algorithm to the non-convex setting. As detailed below, online learnability requires algorithmic stability between consecutive rounds. In the linear case, FPL ensures stability by linearly perturbing the cumulative loss function. As the analysis of FPL in [7] relies heavily on the fact that the perturbation and the loss function are of the same type, it was not clear whether FPL can be extended to the non-convex case. herefore, a key challenge is to understand how randomness encourages Euclidean proximity between consecutive iterates of FPL in the non-convex setting. We illustrate the main ideas by restricting our attention to the relatively easy 1-dimensional non-convex objective w ÞÑ pσpwxq yq 2, where σpxq maxtx, 0u is the ReLU function. Note that this objective becomes convex when restricted either to the positive or the negative orthant. We show that FPL prevents frequent transitions between positive and negative predictions and consequently stabilize the predictions of the learner. We then proceed and extend this intuition to the full generality of online learning. 3

4 1.4 Additional Related Work Another interesting attempt to investigate settings that admit some structure but are non-convex has been conducted by [3]. hey revisited the experts setting and formulated abstract conditions under which the randomness can be shared between the experts. hey also demonstrated an application to online auctions. Several works have studied GANs in the regret minimization framework (e.g. [9, 8, 6]). We provide the first evidence that achieving equilibrium in GANs can be reduced to the offline problems associated with the players. Since its invention in [7], an extensive study of FPL has yielded efficient variants (e.g. [12]) and new insights. However, its extension to the non-convex setting has not been explored. 2 Preliminaries 2.1 Problem setting We denote the action set (a.k.a. domain or hypothesis class) by W Ď R d and assume that it is compact with diameter D. Let L Ď R W be a class of B- bounded and G-Lipschitz functions. For concreteness, we assume that B, D, G P polypdq. he online model can be described as a repeated game between the learner and a (possibly adversarial) environment. On each round t P r s, the learner decides on its next action w t P W. hereafter, the environment decides on a loss function l t P L. he regret of the learner is defined as ÿ ÿ Regret fi l t pw t q min l t pwq. wpw i 1 t 1 looomooon L pwq he goal of the learner is to attain a vanishing average regret (i.e., Regret op1q). Remark 1. o simplify the presentation, we assume in the sequel that the environment is oblivious, i.e., it chooses its strategy before the game begins. A standard modification enables us to cope with adaptive environments (coping with such environments is crucial for our application to non-convex games). We detail this extension in Section Optimization and value oracles he input to the optimization oracle is a pair pl 1:k, nq and the output is a predictor ŵ satisfying # + tÿ ÿ k l i pŵq n J ŵ ď min l i pwq n J w, wpw i 1 4 i 1

5 Note that we allow a linear perturbation of the standard offline oracle. As we deal with non-convex optimization, adding a linear term usually does not increase the computational complexity of the offline problem. We also emphasize that it is straightforward to extend our development to the case where the offline oracle is only assumed to return an ɛ 1 -approximate solution (where ɛ 1 ď ɛ) Statistical and online complexities For both the statistical and the batch models, we define the overall complexity as: sample complexity ` #of calls to the oracles he (statistical) sample complexity is defined as follows. Let P be an unknown distribution over L. he learner observes a finite sample of size n according to P n and outputs some ŵ P W. Given an accuracy parameter ɛ P p0, 1q, we are interested in bounds on the sample size n for which ErLpŵqs ď min wpw Lpwq`ɛ, wherelpwq E l P lpwq. Given the above oracles, it is easy to verify that the sample complexity is polypd, ɛ 1 q. 2 he sample complexity in the online model is defined as the number of rounds for which the expected average regret of the learner is at most ɛ. herefore, the main question we ask is: is there exists an online learner (in the above oracle model) whose overall complexity is polypd, ɛ 1 q. 2.4 Online to batch conversion he following well-known result due to [1] tells us that the online sample complexity dominates the batch sample complexity. he intuition that online learning is at least as hard as batch learning is formalized by the following online-tobatch conversion. heorem 2. [1] Suppose that A is an online learner with ErRegret s ď ɛp q for any. Consider the following algorithm for the batch setting: given a sample pl 1,..., l n q P n, the algorithm applies A to the sample in an online manner. hereafter, it draws a random round j P rns uniformly at random and returns ŵ w j. hen the expected excess risk of the algorithm, ErLpŵqs Lpw q, is at most ɛp q. 2.5 Online learning via stability he main challenge in online learning stems from the fact that the learner has to make decision before observing the adversarial action. Intuitively, we expect that the performance after shifting the actions of the learner by one step 1 We actually do not use the value oracle assumed in [5]. Formally, the input of this oracle is a pair pw, lq P W ˆ L and its output is lpwq. 2 his can be done various methods but perhaps the simplest is to consider the finite class obtained by covering the domain with l 1 -balls of radius G{2 (whose size is exponential in d and polynomial in the rest of the parameters), and applying the standard Oplnp H {ɛ 2 q bound for any finite class H. 5

6 (i.e. considering the loss l t pw t`1 q rather than l t pw t q) to be optimal. his view suggests that online learning is all about balancing between optimal performance w.r.t. previous rounds and ensuring stability between consecutive rounds. As we shall see, this is exactly the role of the random linear perturbation. In a sense n should be thought as a random regularizer. he next lemma provides a systematic approach for analyzing Follow-the-Regularized-Leader-type algorithms. Lemma 1. (FL-BL) he regret of the online learner is at most ErRegret s ď ÿ Erl t pw t q l t pw t`1 qs ` Ern J pw 1 w qs, i 1 where w argmint ř t 1 l ipwq : w P Wu. 2.6 he exponential distribution We use the following properties of the exponential distribution. Lemma 2. Let X be an exponential random variable with parameter η. 3 he following properties hold: a) for any s P R, P px ě sq expp ηsq. b) Memorylessness: for any s, q P R, P px ě s ` q X ě qq P px ě qq. c) if X 1,..., X d are i.i.d. with X i Exppηq, then Er}pX 1,..., X d q} 8 s ď η 1 plogpdq ` 1q. 3 FPL for Non-convex Losses: Warmup o illustrate the main ideas and provide some intuition, we consider the following 1-dimensional problem corresponding to a neural network with one hidden layer. Let W r 1, 1s be the decision set. Letting X r 1, 1s, Y r0, 1s, each l P L is of the form lpwq : l x,y ppwxq` yq 2, (where pzq` maxtz, 0u). he non-convexity of this objective is apparent in Figure 1 (where we set x, y 1). It can be verified that the minimizer of the cumulative loss after t rounds is either w t,` or w t,, where w ` argmint ÿ t:x tě0 l t pwqu, w argmint ÿ t:x tă0 l t pwqu For instance, if y t 1 for all t and x t alternates between `1 and 1, then w also alternates between w ` and w. As we explained above, our challenge now is to stabilize the output of the algorithm. We next explore several candidate approaches. 3 hat is, X has density ppxq η expp ηxq. 6

7 Figure 1: Square loss for one dimensional ReLU. Here we set x y Why standard approaches do not work? A common approach which works well in the convex setting is to apply l 2 - regularization, i.e., to let w t be the minimizer of ř iăt l ipwq ` η}w} 2 for some parameter η. However, the regularization has the effect of pushing the solution towards zero, which does not make sense in our case (for the alternating instance discussed above, any w with small magnitude misclassifies both the negative and the positive examples). Noting that stabilizing the algorithm amounts to deciding whether w t is positive or negative, we can reduce the problem to the experts setting by considering a positive and negative expert. his approach works well on our case but does not seem to extend to higher dimensions. 3.2 One-dimensional Non-convex FPL Although the loss function l x,y is not linear, we can still analyze the effect of adding a linear perturbation to the cumulative loss. Concretely, let ν be a d- dimensional i.i.d. random vector, whose coordinates are drawn according to the exponential distribution with parameter η α ą 0. he input to the offline oracle consists of a pair pl 1:t, νq and the output w t is assumed to satisfy # + tÿ w t P argmin l i pwq νw : w P W. iăt As we show below, this random modification reduces the probability of switching from positive predictors to negative predictors. Suppose that w t ě β ą 0, for some β P p0, 1q to be defined later. We next upper bound the probability 7

8 that w t`1 ă 0. By definition of w t, p@w ă 0q L t 1 pw t q νw t ď L t 1 pwq νw ñ p@w ă 0q νpw t wq ě L t 1 pw t q L t 1 pwq ñ ν ě L t 1 pw t q L t 1 pwq max : Z t 1 wpr 1,0q w t w Similar argument (using the definition of w t`1 ) implies that w t`1 remains nonnegative if 4 L t pw t q L t pwq ν ě max wpr 1,0q w t w Since the loss is B-bounded and w t w ě β for all w ă 0, it suffices to require that ν ě Z t 1 ` 2B β Using the memorylessness of the exponential distribution, we have that P rrν ě Z t 1 ` 2B β ν ě Z t 1 s P rrν ě 2B β s expp 2B β α q ñ P rrw t`1 ě 0 w t ě β s ě 1 expp 2B β α q ě 2B β α Similar argument can be applied for the bounding the probability that w t`1 ą 0 given that w t ď β. We can proceed to analyze the regret of the algorithm using the FL-BL lemma (a more rigorous proof is given in the next section). Note that B, G and D are constants. Let α 2{3, β 1{3. he above analysis shows that the probability for shifting between positive and negative prediction is at most Op 1{3 q. For consecutive rounds in which there is no transition between positive and negative prediction, we can use the Lipschitzenss assumption to deduce that l t pw t q l t pw t`1 q ď Op β q Op 1{3 q. herefore, the total instability (as defined in the FL-BL lemma) is Op 1{3 q Op 2{3 q. Since Erνs α, we obtain that the noise term in the FL-BL lemma is Op 2{3 q. Overall, the regret is Op 2{3 q which translates into a sample complexity bound of Op1{ɛ 3 q. In the next section we consider the general d-dimensional case while making the above arguments more formal. Note that in the d-dimensional case we might be worried about exponentially many transitions. he key idea will be to apply the above argument for each dimension separately. Finally, a more careful analysis allows us to improve the sample complexity bound from Op1{ɛ 3 q to Op1{ɛ 2 q. 4 General Analysis of Non-convex FPL We now consider the general Lipschitz-bounded case and analyze the variant of FPL presented in Algorithm 1. Our analysis below completes the proof of our main theorem. Fix the predictor w t chosen by the algorithm at time t and also 4 Simply since w t forms a better candidate. 8

9 Algorithm 1 Non-convex FPL Parameter: η ą 0 Draw a random vector ν pexppηqq d Prediction at time t: # + tÿ w t P argmin l i pwq ν J w : w P W iăt, fix an axis k P rds. Let U t,k tw P W : w t,k ě w k u. Our starting point is a simple extension of the analysis from the previous part; by using the definition of w t we obtain that L t 1 pw t q L t 1 puq ` řj k ν k ě max ν jpu j w t,j q fi z t,k. upu w t,k t,k u k We let E t,k be the conditional expectation given the identity of w t and z t,k. Lemma 3. We have E t,k rw t,k w t`1,k s ď Opη lnpη 1 qq. Proof. Based on the definition of w t and z t,k, for any u P U t,k, we have that w t`1 u provided that ν k ě z t,k ` ltpw t q l t puq w t,k u k (simply note that according to the definition of w t`1, w t forms a better candidate than u ). Using that the loss is B-bounded, we deduce that if ν k ě z t,k ` 2B{ for some ą 0, then w t,k w t`1,k ď. Equivalently, given that ν k q ě z t,k, we have that " * 2B w t,k w t`1,k ď min D, fi pz t,k q q z t,k 9

10 herefore, E t rw t,k w t`1,k s ď ż 8 qą0 ż 8 qąz tk ż 8 qą0 ż 8 ď ηd qą0 ż 1 η expp ηqq pz t,k q1 qązt,k expp ηz t,k q η expp ηqq pz t,kq expp ηz t,k q dq dq η expp ηpz t,k ` qqq mint2b{q, Du expp ηz t,k q η expp ηqq mint2b{q, Du dq qą0 ď ηd ` 2ηB ż η 1 ż expp ηqq 8 expp ηqq 1 dq ` 2ηB dq ` 2ηB dq q 1 q q q η 1 ż η 1 q 1 1 dt ` 2ηB q ż 8 q η 1 ď ηd ` 2ηB lnpη 1 q ` 2ηB expp ηqq ď ηd ` 2ηB lnpη 1 q ` 2ηB. dq expp ηqq η 1 dq Similar argument (this time using the optimality of w t at time t) can be used to derive the analogous inequality. Lemma 4. For any k P rds, E t,k rw t`1,k w t,k s ď Opη lnpη 1 qq. Summing over all coordinates k P rds yields the bound. Er}w t w t`1 } 1 s Oppolypdqη logpη 1 qq lomon η«p q 1{2 Oppolypdq 1{2 logp qq We are ready to conclude our main result. Proof. (of heorem 1) Using that Er}ν} 8 s ď η 1 plog d ` 1q and }w} 1 ď D, we can apply the FL-BL lemma (Lemma 1) to deduce the desired regret bound. Online-to-batch conversion yields the following corollary. Corollary 2. he sample complexity of non-convex online learning as defined above is O polypdq ɛ Oblivious vs. adaptive environments Our current analysis assumes that the environment is oblivious in the sense that its actions are chosen in advance. o cope with adaptive environments we apply a simple (standard) modification to Algorithm 1; instead of drawing a 10

11 single noise vector at the beginning of the game, the algorithm draws a fresh i.i.d. random vector on every round. he algorithm is detailed in Algorithm (2). 5 Algorithm 2 Non-convex FPL for adaptive adversaries Parameter: η ą 0 for t 1 to t do Draw a random vector ν t pexppηqq d Prediction at time t: # + tÿ w t P argmin l i pwq νt J w : w P W iăt, end for heorem 3. Algorithm 2 enjoys the same regret bound as Algorithm 2. Consequently, our main result (heorem 1) holds also in the non-oblivious setting. Proof. Clearly, the two algorithms suffer the same expected loss in the oblivious setting. Hence, Algorithm 2 attains the same regret bound as Algorithm 1. Since the distribution over the actions of Algorithm 2 is completely determined by the the loss sequence pf 1,..., f t 1 q, Lemma 4.1 in [2] implies the same regret bound against adaptive environments. 5 Implications to Non-convex Games Let F : X ˆ Y Ñ R, where X, Y Ď R d are compact with diameter at most D. he x-th player wishes to minimize F and whereas the y-th player wishes to maximize F. We assume that for all x P X and y P Y, both F p, yq and F px, q are G-Lipschitz and B-bounded. A known approach for achieving equilibrium is to apply (for each of the players) an online method with vanishing average regret. Precisely, on each round t both players choose a pair px t, y t q which induces the losses F px t, y t q and F px t, y t q, respectively. Finally, we draw a random index rjs P r s and output the pair pˆx, ŷq fi px j, y j q. By endowing the players with access to an offline oracle and playing according to non-convex FPL we can reach approximate equilibrium. heorem 4. Suppose that both the x-player and the y-player have an access to an offline oracle and play according to non-convex FPL (Algorithm 2). Given ɛ ą 0, let P polypdq{ɛ 2 such that the expected average regret of non-convex FPL is at most ɛ. hen, pˆx, ŷq forms an ɛ-approximated equilibrium, i.e., for any x P X and y P Y, ErF pˆx, ŷqs ď ErF px, ŷqs ` ɛ, ErF pˆx, ŷqs ě ErF pˆx, yqs ɛ. 5 In fact, the vanilla FPL also draws a fresh random noise vector on every round. 11

12 Note that the players can use their offline oracle to amplify their confidence and achieve an equilibrium with high probability. Proof. For all y P Y, 1 ErF pˆx, yqs E ñ ErF pˆx, ŷqs ď 1 Similarly, for all x P X, 1 ErF px, ŷqs E ñ ErF pˆx, ŷqs ě 1 ÿ j F pxt, yq ÿ F pxt, y t q ÿ j F px, yt q ÿ F pxt, y t q 1 ď E j ` ɛ 1 ě E j ɛ. ÿ j F pxt, y t q ` ɛ ÿ j F pxt, y t q ɛ 5.1 Implication to GANs In particular, we consider the case where the x-th player is a generator, who produces synthetic samples (e.g. images), whereas the y-th player acts as a discriminator by assigning scores to samples reflecting the probability of being generated from the true distribution. Formally, by choosing a parameter x P X and drawing a random noise z, the x-th player produces a sample denote G x pzq. Conversely, the y-th player chooses a parameter y P Y and assign the score D y pg x pzqq P r0, 1s to the sample G x pzq. he function F usually corresponds to the log-likelihood of mistakenly assigning an high score to a synthetic example and vice versa. It is reasonable to assume that F is Lipschitz and bounded w.r.t. the network parameters. As a result, efficient convergence to GANs is established by assuming an access to an offline oracle. 6 Discussion Our work establishes a computational equivalence between online and statistical learning in the non-convex setting. In a sense, we shed a light on the hardness result of [5] by demonstrating that online learning is significantly more difficult than statistical learning only when no structure is assumed. One interesting direction for further investigation is to refine the comparison model and study the polynomial dependencies more carefully. For instance, in the statistical setting, one can obtain dimension-independent bounds on the sample complexity under certain assumptions on the Lipschitz and boundedness parameters (e.g. Rademacher complexity bounds for SVM [11]). It is natural to ask whether one can achieve dimension-independent regret bounds in such settings. 12

13 References [1] Nicolò Cesa-Bianchi, Alex Conconi, and Claudio Gentile. On the Generalization Ability of Online Learning Algorithms for Pairwise Loss Functions. IEEE ransactions on Information heory, 50: , [2] Nicolo Cesa-Bianchi and Gabor Lugosi. Prediction, Learning, and Games [3] Miroslav Dudik, Nika Haghtalab, Haipeng Luo, Robert E. Schapire, Vasilis Syrgkanis, and Jennifer Wortman Vaughan. Oracle-efficient online learning and auction design. In Annual Symposium on Foundations of Computer Science - Proceedings, [4] Elad Hazan. Introduction to Online Convex Optimization. Foundations and rends R in Optimization, 2(3-4): , [5] Elad Hazan and omer Koren. he Computational Power of Optimization in Online Learning. [6] Elad Hazan, Karan Singh, and Cyril Zhang. Efficient regret minimization in non-convex games. In International Conference on Machine Learning, pages , [7] Adam Kalai and Santosh Vempala. Efficient Algorithms for Online Decision Problems [8] Naveen Kodali, Jacob Abernethy, James Hays, and Zsolt Kira. On Convergence and Stability of GANs [9] Dale Schuurmans and Martin A Zinkevich. Deep learning games. In Advances in Neural Information Processing Systems, pages , [10] Shai Shalev-Shwartz. Online Learning and Online Convex Optimization. Foundations and rends R in Machine Learning, 4(2): , [11] Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning. Cambridge university press, [12] im Van Erven, Wojciech Kot lowski, and Manfred K Warmuth. Follow the leader with dropout perturbations. In Conference on Learning heory, pages ,

Advanced Machine Learning

Advanced Machine Learning Follow-he-Perturbed Leader MEHRYAR MOHRI MOHRI@ COURAN INSIUE & GOOGLE RESEARCH. General Ideas Linear loss: decomposition as a sum along substructures. sum of edge losses in a