Online Convex Optimization. Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016

Size: px

Start display at page:

Download "Online Convex Optimization. Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016"

Diana Kennedy
5 years ago
Views:

1 Online Convex Optimization Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016

2 The General Setting

3 The General Setting (Cover) Given only the above, learning isn't always possible

4 Some Natural Restrictions Realizability There is some hypothesis that makes no mistakes Implies simple algs like Consistent and Halving

5 Some Natural Restrictions Realizability There is some hypothesis that makes no mistakes Implies simple algs like Consistent and Halving Randomization Our predictions are made via a probability distribution, and the environment does not control the randomness Suggests Expected Regret as a performance metric Recall:

6 Some Natural Restrictions Realizability There is some hypothesis that makes no mistakes Implies simple algs like Consistent and Halving Randomization Our predictions are made via a probability distribution, and the environment does not control the randomness Suggests Expected Regret as a performance metric Non-adversarial Loss functions are sampled iid from some distribution This is stochastic gradient descent

7 Unifying Framework

8 Unifying Framework Randomization is just OCO with the probability simplex and a 'surrogate loss function' original loss function. that upper-bounds the

9 Unifying Framework Randomization is just OCO with the probability simplex and a 'surrogate loss function' original loss function. Realizability additionally assumes that upper-bounds the

10 How to Learn in OCO

11 The Naïve Approach For quadratic optimization problems, in which, FTL has regret.

12 The Naïve Approach For linear optimization problems, in which, FTL has unbounded regret. Just imagine a 1-dimensional alternating between -1 and 1 each round.

13 Follow-the-Regularized-Leader (FoReL) Regularization stabilizes the solution of follow-the-leader Goal: eliminate the jumpiness! Regularization function: R: S R t w t = argmin w S σ 1 i=1 f i w + R(w)

14 Special Case: Linear Loss Functions t FoReL: w t = argmin w S [σ 1 i=1 f i w + R(w)] Let f t w = < w, z t >, S = R d, R w = 1 2η w 2 2 w t+1 = argmin w σt i=1 = argmin w σt i=1 Set derivative equal to zero: 0 = σt i=1 f i w + R w < w, z i > + 1 w 2 2η 2 z i + 1 2w w 2η t+1 = η σt i=1 z i

15 Special Case: Linear Loss Functions t t 1 w t+1 = η z i = η i=1 i=1 z i ηz t = w t η < z t, t > 1 Note that f t w = < w, z t > f t w t = z t This gives us the formula for gradient descent for linear functions: w t+1 = w t η f t w t

16 Special Case: Linear Loss Functions How can we bound the regret? Regret T u = σ t=1 T (f t w t f t (u)) Let f t w = < w, z t >, S = R d, R w = 1 2η w 2 2 For all u, we have the following regret bound:

17 Special Case: Linear Loss Functions How can we bound the regret? Regret T u = σ T t=1 (f t w t f t (u)) Let f t w = < w, z t >, S = R d, R w = 1 w 2η 2 2 For all u, we have the following regret bound: Regularization term Loss functions have greater magnitude

18 Digression: Proof of Lemma 2.3 Define a function f 0 to be equal to the regularization function R Then, running FoReL on f 1,, f T is equivalent to running FTL on f 0, f 1,, f T Apply Lemma 2.1, which is:

19 Digression: Proof of Lemma 2.3, continued Since f 0 = R, this is equivalent to: T T f t w t f t u + R w 0 R u t=1 t=1 f t w t f t w t+1 + R w 0 R(w 1 )

20 Special Case: Linear Loss Functions (Thm. 2.4) From Lemma 2.3: By definitions of R and f t : Regret T u 1 2η u η w 1 1 2η u σt=1 T <w t w t+1, z t > Using that w t+1 = w t ηz t : Thus: σt=1 T <w t w t+1, z t > = 1 2η u σt=1 T <ηz t, z t > = 1 2η u 2 2 +η σt=1 T z t 2 2

21 Extending from Linear to Convex Functions: Subgradients

22 Extending from Linear to Convex Functions: Online Gradient Descent SGD is the special case where loss functions are sampled iid from a distribution

23 Extending from Linear to Convex Functions: (Unsatisfying) Regret Bound We want sublinear regret

24 Extending from Linear to Convex Functions: Lipschitz-ness to the rescue! A function f is L-Lipschitz if for all x, y in the domain of f, we have f(x) f(y) <= L x y

25 Extending from Linear to Convex Functions: Lipschitz-ness to the rescue! A function f is L-Lipschitz if for all x, y in the domain of f, we have f(x) f(y) <= L x y

26 Extending from Linear to Convex Functions: How to pick step size?

27 Special Cases: Specific Convex Functions Logarithmic Regret Algorithms (Hazan) If the loss functions have bounded first and second gradients, OGD has regret Other algorithms exist for weaker assumptions: Newton Step/FTApproximateL (for bounded gradient and concave exponential) Exponentially Weighted Optimization (for concave exponential)

28 Special Case: Smoothed OCO Andrew, L., Barman, S., Ligett, K., Lin, M., Meyerson, A., Roytman, A., & Wierman, A. (2013, June). A tale of two metrics: Simultaneous bounds on competitiveness and regret. In ACM SIGMETRICS Performance Evaluation Review (Vol. 41, No. 1, pp ). ACM. Chen, N., Agarwal, A., Wierman, A., Barman, S., & Andrew, L. L. (2015, June). Online convex optimization using predictions. In Proceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (pp ). ACM.

29 Special Case: Smoothed OCO The cost model Applications: Geographical Load Balancing, Dynamic Capacity Provisioning

30 Special Case: Smoothed OCO Some performance metrics

31 Special Case: Smoothed OCO Result from first paper You can t have both! (in the worst case)

32 Special Case: Smoothed OCO Result from second paper but in practice, it doesn t matter! Most noise is not adversarial Often have access to noisy predictions => Propose an algorithm with good performance in both metrics

33 Special Case: Smoothed OCO Averaging Fixed Horizon Control Fixed Horizon Control: Given predictions over timesteps t t + w, just play whatever is optimal over these steps. AFHC: average several FHC algorithms starting at different time steps

34 Special Case: Iterative Soft-Thresholding Algorithm Assume L can be decomposed into differentiable and nondifferentiable components: L w = G w + H(w) Differentiable Non-differentiable Example: L w = σ (xi,y i ) S l x i, y i, w + λ w 1 G H

35 Special Case: Iterative Soft-Thresholding Algorithm L w = G w + H w = σ (xi,y i ) S l x i, y i, w + λ w 1 Solve the differentiable part G using gradient descent: v t = w t 1 η t w G(w = w t 1 ) For the non-differentiable part H, add a regularization term: Perform the minimization:

36 Special Case: Iterative Soft-Thresholding Algorithm

37 Generalizing OCO

38 Generalizations What if our can break the rules sometimes? If the weight vectors need only be convex on average in the long run, lower-regret algorithms exist. (Jennaton, et al.)

39 Generalizations What if our can't change too quickly? Not only is the regret worse under these 'ramp constraints', but learners must be designed not to constrain their future actions. (Badiei, Li, Wierman)

40 Generalizations What if our loss functions aren't convex? Often OCO techniques still work, especially when training deep learners. (Balduzzi)

41 Generalizations What if we, or our experts, don't have to play each round? Often OCO techniques still work. (Balduzzi)

42 Generalizations What if we have a whole network of online learners that can communicate? Even when communication is local, low regret-algorithms exist. (Koppel, et al.)

43 Generalizations: Dynamically-Varying Environment What if the underlying environment varies over time? To improve performance, dynamically model the environment (Hall 2013, Hall 2014) Dynamic Mirror Descent: incorporates dynamical model state updates Dynamic Fixed Share: selects a dynamical model from a family of candidates at each time step DMD tracking regret bound: Φ is the dynamical system, and we take the regret with respect to the sequence of moves {θ t } R θ T = O( T[1 + θ t+1 Φ t θ t ]) Note: regret scales with deviation of {θ t } from dynamical system Φ t

44 Generalizations: Bandit Setting (Agarwal 2010) Bandit setting: at each time t, we only find out l t (x t ), not all of l t Completely adaptive adversary: chooses l t knowing x 1,, x t Regret is Ω(T): at least order T Adaptive adversary: chooses l t knowing x 1,, x t 1 Regret is Ω( T) Regret is O T with linear loss functions (i.e. O( T) with high probability) Compare with regret for full information case O( T) for convex Lipschitz and smooth loss functions O(log T ) for strongly convex and smooth loss functions

45 Generalizations: Multi-Point Bandit Setting Player queries each loss function at k randomized points Take expected regret WRT the player s randomness: Setting Regret = E 1 k σ T t=1 Multi-Point, k = 2; adaptive adversary Multi-Point, k = d + 1; completely adaptive adversary σ k i=1 l t y t,i Regret for convex Lipschitz and smooth loss functions O T (with high probability) T min E l t x x K t=1 Regret for strongly convex and smooth loss functions O(log T ) (expected) O( T) (deterministic) O(log T ) (deterministic) Full Information O( T) (deterministic) O(log T ) (deterministic)

46 Summary Online convex optimization framework captures a huge part of online learning Lots of algorithms we already know are OCO, for instance SGD Follow-the-regularized leader, and variants, all perform well in OCO Generalizations of OCO are widespread, and rely on OCO algorithms

47 Questions?

48 References Agarwal, Alekh, Ofer Dekel, and Lin Xiao. "Optimal Algorithms for Online Convex Optimization with Multi-Point Bandit Feedback." COLT Andrew, L., Barman, S., Ligett, K., Lin, M., Meyerson, A., Roytman, A., & Wierman, A. (2013, June). A tale of two metrics: Simultaneous bounds on competitiveness and regret. In ACM SIGMETRICS Performance Evaluation Review (Vol. 41, No. 1, pp ). ACM. Badiei, Masoud, Na Li, and Adam Wierman. "Online Convex Optimization with Ramp Constraints. Balduzzi, David. "Deep Online Convex Optimization by Putting Forecaster to Sleep." arxiv preprint arxiv: (2015). Chen, N., Agarwal, A., Wierman, A., Barman, S., & Andrew, L. L. (2015, June). Online convex optimization using predictions. In Proceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (pp ). ACM. Hall, Eric C., and Rebecca M. Willett. "Online convex optimization in dynamic environments." Selected Topics in Signal Processing, IEEE Journal of 9.4 (2015): Hall, Eric C., and Rebecca M. Willett. "Dynamical models and tracking regret in online convex programming." Proceedings of the 30th International Conference on Machine Learning Hazan, Elad, Amit Agarwal, and Satyen Kale. "Logarithmic regret algorithms for online convex optimization." Machine Learning (2007): Jenatton, Rodolphe, Jim Huang, and Cédric Archambeau. "Adaptive Algorithms for Online Convex Optimization with Long-term Constraints." arxiv preprint arxiv: (2015). Koppel, Alec, Felicia Y. Jakubiec, and Alejandro Ribeiro. "A saddle point algorithm for networked online convex optimization." Signal Processing, IEEE Transactions on (2015): Yue, Yisong. Advanced Optimization Notes. Course notes.

49 A Couple Back-Up Slides

50 Strongly Convex Function

51 Eliminating the Time Horizon Dependence: The Doubling Trick For m = 0, 1, 2,, log 2 T : run algorithm on the 2 m rounds t = 2 m,, min(2 m+1 1, T) m = 0: run for t = 1 m = 1: run for t = 2, 3 m = 2: run for t = 3,, 7 Etc.

52 Eliminating the Time Horizon Dependence: The Doubling Trick Assume the algorithm s regret on each 2 m rounds is bounded by α 2 m log Regret T u σ 2 T m=0 α 2 m log = α σ 2 T m=0 2 m = α 1 2 log 2 T α 1 2log 2 T = α 1 2T 1 2 = 2T α α T So, the regret bound only worsens by a constant multiplicative factor

Online Convex Optimization Using Predictions

Online Convex Optimization Using Predictions Niangjun Chen Joint work with Anish Agarwal, Lachlan Andrew, Siddharth Barman, and Adam Wierman 1 c " c " (x " ) F x " 2 c ) c ) x ) F x " x ) β x ) x " 3 F