Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning

Size: px

Start display at page:

Download "Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning"

Kimberly Martha Carroll
5 years ago
Views:

1 Tutorial: PART 2 Online Convex Optimization, A Game- Theoretic Approach to Learning Elad Hazan Princeton University Satyen Kale Yahoo Research

2 Exploiting curvature: logarithmic regret

3 Logarithmic regret Some natural problems admit better regret: Least squares linear regression Soft- margin SVM Portfolio selection α- strong convexity: (e.g. f x = x & ) r 2 f(x) I α- exp- concavity: (e.g. f x = log a, x) r 2 f(x) rf(x)rf(x) >

4 Online Soft- Margin SVM Features a R /, labels b { 1, 1} Hypothesis class H is a set of linear predictors: Parameters: weight vectors x R / Prediction of weight vector x for feature vector a is a T x Loss f 5 x = max 0, 1 b 5 a 5, x + λ x & Convex, non- smooth strongly- convex Thm [Hazan, Agarwal, Kale 2007]: OGD with step sizes η 5 > 5 has O(log T) regret

5 Universal Portfolio Selection Loss function f 5 x = log(r 5, x) Convex, but no strongly convex component But exp- concave: exp( f 5 x ) = r 5, x (concave function: actually, linear) Definition: f is called α- exp- concave if exp αf x is concave Theorem [Hazan, Agarwal, Kale 2007]: For exp- concave losses, there is an algorithm, Online Newton Step, that obtains O(log T) regret

6 Online Newton Step algorithm Use x 1 = arbitrary point in K in round 1 For t > 1: use x t defined by following equations cumulative gradient step Newton update r t 1 = rf t 1 (x t 1 ) C t 1 = g t 1 = Xt 1 =1 Xt 1 =1 r r > r r > x y t = C 1 t 1 g t 1 cumulative Hessian cr c is a constant depending on α, other bounds x t = arg min(x y t ) > C t 1 (x y t ) >

7 Regularization follow the regularized leader, online mirror descent

8 Follow- the- leader Most natural: Regret = X t f t (x t ) x t = arg min min x 2K X f t (x ) Be- the- leader algorithm has zero regret [Kalai- Vempala, 2005]: t P t 1 i=1 f i(x) x 0 t = arg min P t i=1 f i(x) =x t+1 So if f 5 x 5 f 5 (x 5K> ), we get a regret bound Unstability: f 5 x 5 f 5 (x 5K> ) can be large!

9 Fixing FTL: Follow- The- Regularized- Leader (FTRL) First, suffices to use replace f t by a linear function, f 5 x 5, x Next, add regularization: x t = arg min P t 1 i=1 rf i(x i ) > x + 1 R(x) R(x) is a strongly convex function, η is a regularization constant Under certain conditions, this ensures stability: rf t (x t ) x t rf t (x t ) x t+1 O( )

10 Specific instances of FTRL R x = x & P t 1 x t = arg min i=1 rf i(x i ) > x + 1 R(x) = Q K P t 1 i=1 rf i(x i ) Here, Q K (y) = arg min ky xk Almost like OGD: starting with y 1 = 0, for t = 1, 2, x t = Q K (y t) y t+1 = y t rf t (x t )

11 Specific instances of FTRL Experts setting: K = Δ / distributions over experts f 5 x = c 5, x, where c t is the vector of losses R x = R x R log x R : negative entropy Entrywise exponential P t 1 x t = arg min i=1 rf i(x i ) > x + 1 R(x) =exp P t 1 i=1 c i /Z t Normalization constant Exactly RWM! Starting with y 1 = uniform distribution, for t = 1, 2, Entrywise product with entrywise exponential x t = eq K (y t) y t+1 = y t exp( c t ) Projection = Normalization

12 FTRL Online Mirror Descent x t = arg min Bregman Projection: Q R K (y) = arg min B R (xky) B R (xky) :=R(x) R(y) rr(y) > (x y) P t 1 i=1 rf i(x i ) > x + 1 R(x) x t = Q R K (y t) y t+1 =(rr) 1 (rr(y t ) rf t (x t ))

13 Advanced regularization follow the perturbed leader, AdaGrad

14 Randomized regularization: Follow- The- Perturbed- Leader Special case of OCO, Online Linear Optimization: f 5 x = c 5, x Follow- The- Perturbed- Leader [Kalai- Vempala 2005]: x t = arg min P t 1 i=1 c> i x + p> x Regularization p T x: vector p drawn u.a.r. from [ T, T] Expected regret is the same if p is chosen afresh every round Re- randomizing every round x 5 = x 5K> w.p. at least 1 1/ T regret = O( T)

15 Follow- The- Perturbed- Leader Algorithm: p is random vector with entries in [ T, x t = arg min P t 1 i=1 c> i x + p> x T] Implementation requires only linear optimization over K, easier than projections needed by OMD O T regret bound holds for arbitrary sets K: convexity unnecessary! Can be extended to convex losses using gradient trick + using expected perturbed leader in each round

16 Adaptive Regularization: AdaGrad Consider use of OGD in learning a generalized linear model For parameter x R / and example a, b R / R, prediction is some function of a, x Thus, f 5 x = ll a 5, b 5, x a 5 OGD update: x 5K> = x 5 η f 5 x 5 = x 5 ηll a 5, b 5, x a 5 Thus, features treated equally in updating parameter vector In typical text classification tasks, feature vectors a t are very sparse Slow learning! Adaptive regularization: implements per- feature learning rates

17 AdaGrad (diagonal form) [Duchi, Hazan, Singer 10] Set x > K arbitrarily For t = 1, 2,, 1. use x 5 obtain f t 2. compute x 5K> as follows: G t = diag( P t i=1 rf i(x i )rf i (x i ) > ) y t+1 = x t G 1/2 t rf t (x t ) x t+1 = arg min(y t+1 x) > G t (y t+1 x) Regret bound: O R 5 f 5 x 5 R Infrequently occurring, or small- scale, features have small influence on regret (and therefore, convergence to optimal parameter) &

18 Bandit Convex Optimization

19 Bandit Optimization Motivating example: ad selection on websites Simplest form: in each round t Choose one ad i t out of n Simultaneously, user chooses loss vector c t Observe loss of chosen ad, c t (i t ) c t (i) = - revenue if ad i would be clicked if shown, 0 otherwise TX TX Regret = t=1 c t (i t ) min i t=1 c t (i) Essentially, experts problem with partial feedback Modeled as online linear optimization over n- simplex of distributions Available choices are usually called arms (from multi- armed bandits) Randomization necessary, so consider expected regret

20 Constructing cost estimators e 1 Exploration: Choose arm i t from 1, 2,, n u.a.r. Observe c t (i t ) c t = nc t (i t )e it e 2 e n E[ c t ]=c t Exploitation: x t = arg min Xt 1 i=1 c i > x + 1 R(x)

21 Separate explore- exploit template Algorithm: 1. w.p. q explore (as before), create estimator 2. o/w exploit x t = arg min Xt 1 i=1 c i > x + 1 R(x) E[ c t ]=c t Regret is essentially (ignoring dependence on n) bounded by q T +(1 q) 1 p! T q Regret FTRL = O q + q = O(T 2/3 ) Can we do better?

22 Simultaneous exploration- exploitation [Auer, Cesa- Bianchi, Freund, Schapire 2002] e 1 Exploration+Exploitation: Choose arm i t from distributionx t Observe c t (i t ) c t = c t(i t ) x t (i t ) e i t e 2 x t = arg min Xt 1 i=1 e n E[ c t ]=c t c i > x + 1 R(x) With R = negative entropy, this alg (called EXP3) has O T regret

23 Bandit Linear Optimization Generalize from multi- armed bandits K: convex set in R / Linear cost functions: f 5 x = c 5, x Approach: 1. (Exploration) construct unbiased estimators 2. (Exploitation) Plug into FTRL: t 1 x t = arg min X i=1 E[ c t ]=c t c i > x + 1 R(x)

24 Bandit Linear Optimization K: convex set in R /, linear cost functions: f 5 x = c 5, x Assume K contains e 1, e 2,, e n In round t: x t = arg min Xt 1 i=1 exploration basis 1. W.p. q choose i 5 1, 2,, n and use e R 5. Observe c 5 i 5 and set c t = c t(i t ) q/n e i t 2. W.p. 1 q use x t obtained from FTRL and set cb 5 = 0 Gives regret bound of O(T 2/3 ) E[ c t ]=c t c i > x + 1 R(x)

25 Geometric regularization via barriers [Abernethy, Hazan, Rakhlin 2008] R = self- concordant barrier for K. Dikin ellipsoid at any point x lies in K: y y x, & R x y x 1} Algorithm: use random endpoint of axes of ellipsoid at x t using x t in expectation (exploitation) 2n endpoints suffice to construct cb 5 (exploration) Thm: Regret = O( T) x t = arg min Xt 1 i=1 c i > x + 1 R(x)

26 Bandit Online Convex Optimization [Flaxman, Kalai, McMahan 2005] f t arbitrary convex functions Approach: 1. Construct estimator cb 5 of f 5 x 5 2. Plug into FTRL x t = arg min Xt 1 i=1 Near boundary, estimators have large variance Hence, x t and x t+1 can be very different Regret bound = O(T 3/4 ) Recent developments: O c i > x + 1 R(x) rf(x) 1 E u2s n T regret possible! 1 [f(x + u)u]

27 Projection- free algorithms

28 Matrix completion In round t: Obtain (user, movie) pair, and output a recommended rating Then observe true rating, and suffer square loss Comparison class: all matrices with bounded trace norm

29 Bounded trace norm matrices Trace norm of a matrix = sum of singular values K = { X X is a matrix with trace norm at most D } Optimizing over K promotes low rank solutions Online Matrix Completion problem is an OCO problem over K FTRL, OMD etc require computing projections on K Expensive! Requires computing eigendecomposition of matrix to be projected: O(n 3 ) operation But: linear optimization over K is easier Requires computing top eigenvector; can be done in O(sparsity) time typically

Online routing (K = flow polytope) conic optimization over flow polytope shortest path computation 3.

30 Projections à linear optimization 1. Matrix completion (K = bounded trace norm matrices) eigen decomposition largest singular vector computation 2. Online routing (K = flow polytope) conic optimization over flow polytope shortest path computation 3. Rotations (K = rotation matrices) conic optimization over rotations set Wahba s algorithm linear time 4. Matroids (K = matroid polytope) convex opt. via ellipsoid method greedy algorithm nearly linear time!

31 Projection- Free Online Learning Is online learning possible given access to linear optimization oracle rather than projection oracle? Yes: projection can be implemented via polynomially many linear optimizations But for efficiency, would like only a few (1 or 2) calls to linear optimization oracle per round Follow- The- Perturbed- Leader does the job for linear loss functions What about general convex loss functions? Is OCO possible with only 1 or 2 calls to a linear optimization oracle per round?

32 Conditional Gradient algorithm [Frank, Wolfe 56] Convex opt problem: min h i f(x) Assume: f is smooth, convex linear opt over K is easy v t = arg min rf(x t ) > x x t+1 = x t + t (v t x t ) 1. At iteration t: convex comb. of at most t vertices (sparsity) 2. No learning rate. η 5 > 5 (independent of diameter, gradients etc.)

33 Online Conditional Gradient [Hazan, Kale 12] Set x > K arbitrarily For t = 1, 2,, 1. Use x 5, obtain f t 2. Compute x 5K> as follows v t = arg min Pt i=1 rf i(x i )+ t x t >x x t+1 (1 t )x t + t v t P t i=1 rf i(x i )+ t x t v t x t+1 x t Theorem: Regret = O(T 3/4 )

34 Linearly- convergence CG [Garber, Hazan 2013] x t Theorem: For polytopes, strongly- convex and smooth losses: 1. Offline: gap after t steps: 2. Online: Regret = O( p exp( T ) ct)

35 Directions for future research 1. Efficient algorithms & optimal dependency on dimension for bandit OCO 2. Faster online matrix algorithms 3. Optimal regret /running time tradeoffs 4. Optimal regret and projection- free algorithm for general convex sets 5. Time series/dependence/adaptivity (vs. static regret) 6. Adding state / online Reinforcement Learning / MDPs / stochastic games [lots of existing work ] 7. Tensor/spectral learning in the online setting 8. Game theoretic aspect stronger notions, approachability,

36 Bibliography & more information, see: Thank you!

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning.

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning. Tutorial: PART 1 Online Convex Optimization, A Game- Theoretic Approach to Learning http://www.cs.princeton.edu/~ehazan/tutorial/tutorial.htm Elad Hazan Princeton University Satyen Kale Yahoo Research