Advanced Machine Learning

Size: px

Start display at page:

Download "Advanced Machine Learning"

Ruth Bennett
6 years ago
Views:

1 Advanced Machine Learning Online Convex Optimization MEHRYAR MOHRI COURANT INSTITUTE & GOOGLE RESEARCH.

2 Outline Online projected sub-gradient descent. Exponentiated Gradient (EG). Mirror descent. Dual Averaging. Advanced Machine Learning - Mohri@ page 2

3 Set-Up Convex set. t =1 C For to do w t T C predict. receive convex loss function. f t (w t ) incur loss. f t : C R A Regret of algorithm : R T (A) = T f t (w t ) inf w C T f t (w). Advanced Machine Learning - Mohri@ page 3

4 Online Projected Subgrad. Desc. Algorithm: w 1 C arbitrary. w t+1 = C [w t f t (w t )], where C is the projection over C. f t (w t ) f t (w t ) f t w t > 0 parameter. (sub-gradient of at ). Advanced Machine Learning - Mohri@ page 4

5 Assumptions: Analysis w 1 w R w argmin where.. w C f t (w t ) G Theorem: the regret of online projected sub-gradient descent (PSGD) is bounded as follows R T (PSGD) R G2 T 2 T. f t (w) (Zinkevich, 2009) Choosing to minimize the bound gives R T (PSGD) RG T. Advanced Machine Learning - Mohri@ page 5

6 Proof The proof uses the definition of subgradient and the property of projection: R T (PSGD) = apple = apple apple 1 2 apple 1 2 f t (w t ) f t (w ) f t (w t ) (w t w ) (def. of subgrad.) 1 hkw t w k 2 ]+ 2 k f t (w t )k 2 kw t f t (w t ) w k 2i 2 1 h kw t w k 2 ]+ 2 G 2 kw t+1 w k 2i (prop. of proj.) 2 hkw 1 w k 2 ]+ 2 G 2 T kw T +1 w k 2i h kw 1 i w k 2 ]+ 2 G 2 T apple 1 h i R G 2 T. 2 Advanced Machine Learning - Mohri@ page 6

7 Strong Convexity Definition: a convex function defined over a convex set is -strongly convex with respect to norm k k if the function is convex or, equivalently, w 7! (w) 2 kwk2 w w 0 C (w) (w) (w 0 ) (w)+ (w) (w 0 w)+ 2 kw0 wk 2. for all, in and, (w 0 ) (w)+hr (w), w 0 wi + 2 kw0 wk 2 C (w)+hr (w), w 0 wi (w) page 7

8 Strongly Convex Objectives Theorem: assume that the functions are -strongly convex and k f t (w)k appleg for all w and f t t (w). Then, the regret of online projected sub-gradient descent (PSGD) with parameter is bounded as follows t+1 = 1 t R T (PSGD) apple G2 2 f t (1 + log T ). (Hazan et al., 2007) page 8

9 Proof R T (PSGD) = f t (w t ) f t (w ) apple = apple apple 2 = 2 f t (w t ) (w t w ) 2 kw t w k 2 (strong convexity) 1 h kw t w k t+1k 2 f t (w t )k 2 kw t t+1 f t (w t ) w k 2i t+1 2 kw t w k 2 1 h kw t w k t+1g 2 2 kw t+1 w k 2i t+1 2 kw t w k 2 (prop. of proj.) h h(t 1)kw t w k 2 tkw t+1 w k 2i + G2 2 T kw T +1 w k 2i + G2 2 1 t apple G2 2 1 t 1 t G2 apple (1 + log T ). 2 (def. of t+1 ) page 9

10 Smoothness Definition: a continuously differentiable function is -smooth if its gradient is -Lipschitz: f for all. krf(w 0 ) rf(w)k apple kw 0 wk, w, w 0 f w, w 0 Property: if is convex and -smooth, then, for all, 0 apple f(w) f(w 0 ) rf(w 0 ) (w w 0 ) apple 2 kw w 0 k 2. page 10

11 Exponentiated Gradient (EG) Convex set: simplex. Algorithm: w 1 =( 1 N,..., 1 N ). where w t+1,i = w t,i exp( [ f t (w t )] i ) C = {w R N : w 0 w 1 =1} Z t Z t = N i=1 w t,ie [ f t (w t )] i. (Kivinen and Warmuth, 1997) Advanced Machine Learning - Mohri@ page 11

12 Analysis Assumption: f t (w t ) G. Theorem: the regret of the Exponentiated Gradient (EG) algorithm is bounded as follows R T (EG) log N + G 2 T 2. Choosing to minimize the bound gives R T (EG) 2G T log N. Advanced Machine Learning - Mohri@ page 12

13 Potential: t+1 t = = t = D(w w t )= NX wi log w t,i i=1 Proof N i=1 h X N i log Z t = log w t,i e [ f t(w t )] i i=1 h i wt = log E = log E i wt w t+1,i i e [ f t(w t )] i apple e [ f t(w t )] i w i log w i w t,i. NX wi log Zt + [ f t (w t )] i = log Zt + w f t (w t ). i=1 E [ f t (w t )] i E [ f t (w t )] i apple 2 4G2 1 8 Advanced Machine Learning - Mohri@ w t f t (w t ) (Hoe ding s ineq.). page 13

14 Proof Combining equality and inequality: t+1 t apple 2 G (w w t ) f t (w t ), (w w t ) f t (w t ) apple 2 G 2 1 +( t t+1 ) 2 ) (w w t ) f t (w t ) apple 2 G 2 1T + 1 T +1 2 ) R T (EG) = (w w t ) f t (w t ) apple 2 G 2 1T 2 apple apple G2 1T 2 f t (w t ) f t (w ) f t (w t ) (w t w ) + 1 = G2 1T D(w k w 1 ) apple G2 1T 2 (Rel. Ent. non-neg.) + log N. Advanced Machine Learning - Mohri@ page 14

15 Convex Optimization Application:. fixed loss function: guarantee for average weight vector: 1 f T min w2c f(w) w t f t = f. f(w ) apple 1 T O 1 2 thus, convergence in. = R T (A) T f (w t ) f(w ) = O 1 p T. Advanced Machine Learning - Mohri@ page 15

16 Generalization PSGD and EG both special instances of a more general algorithm: Mirror Descent. Mirror Descent is based on Bregman divergence: B(w k w 0 )= 1 2 kw w0 k 2 2 PSGD:. EG: unnormalized relative entropy; B(w k w 0 )= P N i=1 h w i log h w i w 0 i i w i + w 0 i i. page 16

17 Bregman Divergence Definition: convex differentiable over open convex set. The Bregman divergence associated to is defined by B (w k w 0 )= (w) (w 0 ) hr (w 0 ), w w 0 i. (w) C (w) (w 0 ) {z } B (w k w 0 ) (w 0 )+hr (w 0 ), w w 0 i w 0 w w page 17

18 Properties Proposition: the following properties hold for a Bregman divergence. non-negativity:. linearity:. projection: for any closed convex set, the projection of -projection of over is unique: Triangular identity: Pythagorean theorem: 8w, w 0 2 C, B (w k w 0 ) 0 B + = B + B K C B w 0 K P K (w 0 ) = argmin w2k B F (w k w 0 ). (r (w) r (v)) (w u) =B(u k w)+b(w k v) B(u k v). B (w k w 0 ) B (w k P K (w 0 )) + B (P K (w 0 ) k w 0 ). page 18

19 Pythagorean theorem w 0 P K (w 0 ) K w B (w k w 0 ) B (w k P K (w 0 )) + B (P K (w 0 ) k w 0 ). page 19

20 Legendre Type Functions Definition: a real-valued function defined over a nonempty open convex set C is said to be of Legendre type if it is proper closed convex and differentiable over C and if one of the following equivalent conditions holds: r C r (C) lim kr (w)k =+1. w!@c is one-to-one mapping from to. (Rockafellar, 1970) page 20

21 Mirror Descent (Nemirovski and Yudin, 1983) r w t r (w t ) K w t+1 f t (w t ) r (v t+1 ) C v t+1 [r ] 1 page 21

22 Mirror Descent Mirror-Descent( ) 1 w 1 argmin w2k\c (w) 2 for t 1 to T do 3 v t+1 [r ] 1 r (w t ) f t (w t ) 4 w t+1 argmin w2k\c B(w k v t+1 ) page 22

23 MD Guarantee Theorem: let be a non-empty open convex set and a compact convex set. Assume that : C! R is of Legendre type and -strongly convex with respect to k k and f t s convex and G -Lipschitz with respect to k k. Then, the regret of Mirror Descent can be bounded as follows: Choosing C R T (MD) apple B(w k w 1 ) to minimize the bound gives R T (MD) apple D G r 2T, with B(w k w 1 ) apple D 2. + G2 T 2. K C page 23

24 Proof R T (MD) = f t (w t ) f t (w ) apple = 1 = 1 apple 1 = 1 f t (w t ) (w t w ) (def. of subgrad.) [r (w t ) r (v t+1 )] (w t w ) (def. of v t ) B(w k w t ) B(w k v t+1 )+B(w t k v t+1 ) (Breg. div. Identity) B(w k w t ) B(w k w t+1 ) B(w t+1 k v t+1 )+B(w t k v t+1 ) (Pythagorean ineq.) h i B(w k w 1 ) B(w k w T +1 ) apple B(w k w 1 ) h h i B(w t k v t+1 ) B(w t+1 k v t+1 ). i B(w t+1 k v t+1 )+B(w t k v t+1 ) page 24

25 Proof h i B(w t k v t+1 ) B(w t+1 k v t+1 ) = (w t ) (w t+1 ) r (v t+1 ) (w t w t+1 ) apple r (w t ) r (v t+1 ) (w t w t+1 ) = f t (w t ) (w t w t+1 ) apple G kw t apple ( G ) 2 2. w t+1 k 2 kw t w t+1 k 2 ( -strong convexity) 2 kw t w t+1 k 2 (def. of v t+1 ) 2 kw t w t+1 k 2 (G -Lipschitzness) (max. of 2nd deg. eq.) page 25

26 Equivalent Description Mirror-Descent( ) 1 w 1 argmin w2k\c (w) 2 for t 1 to (T 1) do 3 w t+1 argmin w2k\c f t (w t ) w + 1 B(w k w t) Proof: linearization of f t regularization w t+1 = argmin w2k\c = argmin w2k\c = argmin w2k\c = argmin w2k\c B(w k v t+1 ) (w) r (v t+1 ) w (def. of Breg. div.) (w) r (w t ) f t (w t ) w (def. of v t+1 ) f t (w t ) w + B(w k w t ). (def. of Breg. div.) page 26

27 Dual Averaging Dual-Averaging( ) 1 v w 1 argmin w2k\c B(w k v 1 ) 3 for t 1 to T do 4 v t+1 [r ] 1 r (v t ) f t (w t ) 5 w t+1 argmin w2k\c B(w k v t+1 ) (Iouditski and Nesterov, 2010) page 27

28 Equivalent Description Equivalent form: w t+1 = argmin w2k\c = argmin w2k\c = argmin w2k\c = argmin w2k\c B(w k v t+1 ) (w) r (v t+1 ) w (def. of Breg. div.) (w) r (v t ) f t (w t ) w (def. of v t+1 ) tx f t (w s )+ (w). (recurrence) s=1 In particular, for linear losses, Averaging coincides with regularized FL: w t+1 = argmin w2k\c tx s=1 f t (w) =a t w a s w + 1 (w)., Dual page 28

29 DA Guarantee Theorem: under the same assumptions as for MD, the following holds for the regret of Dual Averaging, R T (DA) apple (w ) (w 1 ) + 2 G2 T. Choosing to minimize the bound gives R T (DA) apple 2D G r 2T, with (w ) (w 1 ) apple D 2. page 29

30 References Abraham Flaxman, Adam Tauman Kalai, and H. Brendan McMahan. On-line convex optimization in the bandit setting: gradient descent without a gradient. In SODA, pages SIAM, Baruch Awerbuch and Robert Kleinberg. Online linear optimization and adaptive routing. J. Comput. Syst. Sci., 74(1):97 114, Amir Beck and Marc Teboulle. Mirror Descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3): , Nicolò Cesa-Bianchi and Gábor Lugosi. Prediction, learning, and games. Cambridge University Press, A.J. Grove, N. Littlestone, and D. Schuurmans. General convergence results for linear discriminant updates. Machine Learning, 43(3): , Advanced Machine Learning - Mohri@ page 30

31 References Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2-3): , Anatoli Iouditski, Yuri Nesterov. Primal-dual subgradient methods for minimizing uniformly convex functions <hal v1> Adam T. Kalai, Santosh Vempala. Efficient algorithms for online decision problems. J. Comput. Syst. Sci. 71(3): Jyrki Kivinen and Manfred K. Warmuth. Additive versus exponentiated gradient updates for linear prediction. Information and Computation, 132(1):1 64, Jyrki Kivinen and Manfred K. Warmuth. Relative loss bounds for multidimensional regression problems. Machine Learning, 45(3): , Yurii Nesterov. Introductory lectures on convex optimization: A basic course. Kluwer Academic Publishers, 2004a. R. Tyrrell Rockafellar. Convex Analysis. Princeton University Press, Advanced Machine Learning - Mohri@ page 31

32 References Arkadii Semenovich Nemirovski, David Berkovich Yudin. Problem complexity and Method Efficiency in Optimization, Wiley, New York, Eiji Takimoto and Manfred K. Warmuth. Path kernels and multiplicative updates. JMLR, 4: , Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In ICML, pages , Advanced Machine Learning - Mohri@ page 32

Learning with Large Number of Experts: Component Hedge Algorithm

Learning with Large Number of Experts: Component Hedge Algorithm Giulia DeSalvo and Vitaly Kuznetsov Courant Institute March 24th, 215 1 / 3 Learning with Large Number of Experts Regret of RWM is O( T