Online Learning and Online Convex Optimization

Size: px

Start display at page:

Download "Online Learning and Online Convex Optimization"

Antonia Thomas
6 years ago
Views:

1 Online Learning and Online Convex Optimization Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Learning 1 / 49

2 Summary 1 My beautiful regret 2 A supposedly fun game I ll play again 3 The joy of convex N. Cesa-Bianchi (UNIMI) Online Learning 2 / 49

3 Summary 1 My beautiful regret 2 A supposedly fun game I ll play again 3 The joy of convex N. Cesa-Bianchi (UNIMI) Online Learning 3 / 49

Machine learning Classification/regression tasks Predictive models h mapping data instances X to labels Y (e.g., binary classifier) Training data S T = ( (X 1, Y 1 ),..., (X T, Y T ) ) (e.g., email messages with spam vs.

4 Machine learning Classification/regression tasks Predictive models h mapping data instances X to labels Y (e.g., binary classifier) Training data S T = ( (X 1, Y 1 ),..., (X T, Y T ) ) (e.g., messages with spam vs. nonspam annotations) Learning algorithm A (e.g., Support Vector Machine) maps training data S T to model h = A(S T ) Evaluate the risk of the trained model h with respect to a given loss function N. Cesa-Bianchi (UNIMI) Online Learning 4 / 49

5 Two notions of risk View data as a statistical sample: statistical risk [ E l ( ) ] A( S T ), (X, Y) } {{ }} {{ } trained test model example Training set S T = ( (X 1, Y 1 ),..., (X T, Y T ) ) and test example (X, Y) drawn i.i.d. from the same unknown and fixed distribution N. Cesa-Bianchi (UNIMI) Online Learning 5 / 49

6 Two notions of risk View data as a statistical sample: statistical risk [ E l ( ) ] A( S T ), (X, Y) } {{ }} {{ } trained test model example Training set S T = ( (X 1, Y 1 ),..., (X T, Y T ) ) and test example (X, Y) drawn i.i.d. from the same unknown and fixed distribution View data as an arbitrary sequence: sequential risk T l ( ) A(S t 1 ), (X } {{ } t, Y t ) } {{ } trained test model example Sequence of models trained on growing prefixes S t = ( (X 1, Y 1 ),..., (X t, Y t ) ) of the data sequence N. Cesa-Bianchi (UNIMI) Online Learning 5 / 49

7 Regrets, I had a few Learning algorithm A maps datasets to models in a given class H Variance error in statistical learning [ E l ( A(S T ), (X, Y) )] [ inf E l ( h, (X, Y) )] h H compare to expected loss of best model in the class N. Cesa-Bianchi (UNIMI) Online Learning 6 / 49

8 Regrets, I had a few Learning algorithm A maps datasets to models in a given class H Variance error in statistical learning [ E l ( A(S T ), (X, Y) )] [ inf E l ( h, (X, Y) )] h H compare to expected loss of best model in the class Regret in online learning T l ( A(S t 1 ), (X t, Y t ) ) inf h H T l ( h, (X t, Y t ) ) compare to cumulative loss of best model in the class N. Cesa-Bianchi (UNIMI) Online Learning 6 / 49

9 Incremental model update A natural blueprint for online learning algorithms For t = 1, 2,... 1 Apply current model h t 1 to next data element (X t, Y t ) 2 Update current model: h t 1 h t H (local optimization) N. Cesa-Bianchi (UNIMI) Online Learning 7 / 49

10 Incremental model update A natural blueprint for online learning algorithms For t = 1, 2,... 1 Apply current model h t 1 to next data element (X t, Y t ) 2 Update current model: h t 1 h t H (local optimization) Goal: control regret T l ( h t 1, (X t, Y t ) ) inf h H T l ( h, (X t, Y t ) ) N. Cesa-Bianchi (UNIMI) Online Learning 7 / 49

11 Incremental model update A natural blueprint for online learning algorithms For t = 1, 2,... 1 Apply current model h t 1 to next data element (X t, Y t ) 2 Update current model: h t 1 h t H (local optimization) Goal: control regret T l ( h t 1, (X t, Y t ) ) inf h H T l ( h, (X t, Y t ) ) View this as a repeated game between a player generating predictors h t H and an opponent generating data (X t, Y t ) N. Cesa-Bianchi (UNIMI) Online Learning 7 / 49

12 Summary 1 My beautiful regret 2 A supposedly fun game I ll play again 3 The joy of convex N. Cesa-Bianchi (UNIMI) Online Learning 8 / 49

Theory of repeated games James Hannan (1922 2010)

13 Theory of repeated games James Hannan ( ) David Blackwell ( ) Learning to play a game (1956) Play a game repeatedly against a possibly suboptimal opponent N. Cesa-Bianchi (UNIMI) Online Learning 9 / 49

14 Zero-sum 2-person games played more than once M 1 l(1, 1) l(1, 2)... 2 l(2, 1) l(2, 2) N N M known loss matrix Row player (player) has N actions Column player (opponent) has M actions For each game round t = 1, 2,... Player chooses action i t and opponent chooses action y t The player suffers loss l(i t, y t ) (= gain of opponent) Player can learn from opponent s history of past choices y 1,..., y t 1 N. Cesa-Bianchi (UNIMI) Online Learning 10 / 49

Prediction with expert advice t = 1 t = 2... 1 l 1 (1) l 2 (1).

15 Prediction with expert advice t = 1 t = l 1 (1) l 2 (1)... 2 l 1 (2) l 2 (2) N l 1 (N) l 2 (N) Volodya Vovk Manfred Warmuth Opponent s moves y 1, y 2,... define a sequential prediction problem with a time-varying loss function l(i t, y t ) = l t (i t ) N. Cesa-Bianchi (UNIMI) Online Learning 11 / 49

16 Playing the experts game A sequential decision problem N actions Unknown deterministic assignment of losses to actions l t = ( l t (1),..., l t (N) ) [0, 1] N for t = 1, 2,...????????? For t = 1, 2,... N. Cesa-Bianchi (UNIMI) Online Learning 12 / 49

17 Playing the experts game A sequential decision problem N actions Unknown deterministic assignment of losses to actions l t = ( l t (1),..., l t (N) ) [0, 1] N for t = 1, 2,...????????? For t = 1, 2,... 1 Player picks an action I t (possibly using randomization) and incurs loss l t (I t ) N. Cesa-Bianchi (UNIMI) Online Learning 12 / 49

18 Playing the experts game A sequential decision problem N actions Unknown deterministic assignment of losses to actions l t = ( l t (1),..., l t (N) ) [0, 1] N for t = 1, 2, For t = 1, 2,... 1 Player picks an action I t (possibly using randomization) and incurs loss l t (I t ) 2 Player gets feedback information: l t (1),..., l t (N) N. Cesa-Bianchi (UNIMI) Online Learning 12 / 49

19 Regret analysis Regret [ T ] def R T = E l t (I t ) min T i=1,...,n l t (i) want = o(t) N. Cesa-Bianchi (UNIMI) Online Learning 13 / 49

20 Regret analysis Regret [ T ] def R T = E l t (I t ) min T i=1,...,n l t (i) want = o(t) Lower bound using random losses [Experts paper, 1997] l t (i) L t (i) {0, 1} independent random coin flip [ T ] For any player strategy E L t (I t ) = T 2 Then the expected regret is [ T ( ) ] 1 E max i=1,...,n 2 L t(i) = ( 1 o(1) ) T ln N 2 for N, T N. Cesa-Bianchi (UNIMI) Online Learning 13 / 49

21 Exponentially weighted forecaster (Hedge) At time t pick action I t = i with probability proportional to ( ) t 1 exp η l s (i) s=1 the sum at the exponent is the total loss of action i up to now Regret bound [Experts paper, 1997] If η = T ln N (ln N)/(8T) then R T 2 Matching lower bound including constants Dynamic choice η t = (ln N)/(8t) only loses small constants N. Cesa-Bianchi (UNIMI) Online Learning 14 / 49

22 The nonstochastic bandit problem????????? N. Cesa-Bianchi (UNIMI) Online Learning 15 / 49

23 The nonstochastic bandit problem????????? For t = 1, 2,... 1 Player picks an action I t (possibly using randomization) and incurs loss l t (I t ) N. Cesa-Bianchi (UNIMI) Online Learning 15 / 49

24 The nonstochastic bandit problem?.3??????? For t = 1, 2,... 1 Player picks an action I t (possibly using randomization) and incurs loss l t (I t ) 2 Player gets partial information: Only l t (I t ) is revealed N. Cesa-Bianchi (UNIMI) Online Learning 15 / 49

25 The nonstochastic bandit problem?.3??????? For t = 1, 2,... 1 Player picks an action I t (possibly using randomization) and incurs loss l t (I t ) 2 Player gets partial information: Only l t (I t ) is revealed Player still competing agaist best offline action [ T ] T R T = E l t (I t ) min l t (i) i=1,...,n N. Cesa-Bianchi (UNIMI) Online Learning 15 / 49

26 The Exp3 algorithm [Auer et al., 2002] Hedge with estimated losses ( ) t 1 P t (I t = i) exp η ls (i) lt (i) = s=1 l t (i) P t ( lt (i) observed ) if I t = i 0 otherwise Only one non-zero component in l t i = 1,..., N N. Cesa-Bianchi (UNIMI) Online Learning 16 / 49

27 The Exp3 algorithm [Auer et al., 2002] Hedge with estimated losses ( ) t 1 P t (I t = i) exp η ls (i) lt (i) = s=1 l t (i) P t ( lt (i) observed ) if I t = i 0 otherwise Only one non-zero component in l t i = 1,..., N Properties of importance weighting estimator ] E t [ lt (i) = l t (i) unbiasedness E t [ lt (i) 2] 1 P t ( lt (i) observed ) variance control N. Cesa-Bianchi (UNIMI) Online Learning 16 / 49

28 Exp3 regret bound R T ln N η ln N η + η 2 E + η 2 E [ T i=1 [ T i=1 N P t (I t = i)e t [ lt (i) 2]] N ] P t (I t = i) ( P t lt (i) is observed ) = ln N η + η 2 NT = NT ln N lower bound Ω ( NT ) N. Cesa-Bianchi (UNIMI) Online Learning 17 / 49

29 Exp3 regret bound R T ln N η ln N η + η 2 E + η 2 E [ T i=1 [ T i=1 N P t (I t = i)e t [ lt (i) 2]] N ] P t (I t = i) ( P t lt (i) is observed ) = ln N η + η 2 NT = NT ln N lower bound Ω ( NT ) Improved matching upper bound by [Audibért and Bubeck, 2009] N. Cesa-Bianchi (UNIMI) Online Learning 17 / 49

30 Exp3 regret bound R T ln N η ln N η + η 2 E + η 2 E [ T i=1 [ T i=1 N P t (I t = i)e t [ lt (i) 2]] N ] P t (I t = i) ( P t lt (i) is observed ) = ln N η + η 2 NT = NT ln N lower bound Ω ( NT ) Improved matching upper bound by [Audibért and Bubeck, 2009] The full information (experts) setting Player observes vector of losses l t after each play P t (l t (i) is observed) = 1 R T T ln N N. Cesa-Bianchi (UNIMI) Online Learning 17 / 49

31 Nonoblivious opponents The adaptive adversary The loss of action i at time t depends on the player s past m actions l t (i) l t (I t m,..., I t 1, i) N. Cesa-Bianchi (UNIMI) Online Learning 18 / 49

32 Nonoblivious opponents The adaptive adversary The loss of action i at time t depends on the player s past m actions l t (i) l t (I t m,..., I t 1, i) Examples: bandits with switching cost N. Cesa-Bianchi (UNIMI) Online Learning 18 / 49

33 Nonoblivious opponents The adaptive adversary The loss of action i at time t depends on the player s past m actions l t (i) l t (I t m,..., I t 1, i) Examples: bandits with switching cost Nonoblivious regret [ T = E l t (I t m,..., I t 1, I t ) R non T min i=1,...,n ] T l t (I t m,..., I t 1, i) N. Cesa-Bianchi (UNIMI) Online Learning 18 / 49

34 Nonoblivious opponents The adaptive adversary The loss of action i at time t depends on the player s past m actions l t (i) l t (I t m,..., I t 1, i) Examples: bandits with switching cost Nonoblivious regret [ T = E l t (I t m,..., I t 1, I t ) R non T R pol T min Policy regret T = E l t (I t m,..., I t 1, I t ) i=1,...,n min ] T l t (I t m,..., I t 1, i) i=1,...,n T l t (i,..., i, i) } {{ } m times N. Cesa-Bianchi (UNIMI) Online Learning 18 / 49

35 Bandits and reactive opponents Bounds on the nonoblivious regret (even when m depends on T) R non T = O ( ) TN ln N Exp3 with biased loss estimates Is the ln N factor necessary? N. Cesa-Bianchi (UNIMI) Online Learning 19 / 49

36 Bandits and reactive opponents Bounds on the nonoblivious regret (even when m depends on T) R non T = O ( ) TN ln N Exp3 with biased loss estimates Is the ln N factor necessary? Bounds on the policy regret for any constant m 1 = O ((N ln N) 1/3 T 2/3) R pol T Achieved by a very simple player strategy Optimal up to log factors! [Dekel, Koren, and Peres, 2014] N. Cesa-Bianchi (UNIMI) Online Learning 19 / 49

Partial monitoring: not observing any loss Dynamic pricing: Perform as the best fixed price 1 Post a T-shirt price 2 Observe if next customer buys or not 3 Adjust price Feedback does not reveal the

37 Partial monitoring: not observing any loss Dynamic pricing: Perform as the best fixed price 1 Post a T-shirt price 2 Observe if next customer buys or not 3 Adjust price Feedback does not reveal the player s loss c c c c c c c c c c 0 Loss matrix Feedback matrix N. Cesa-Bianchi (UNIMI) Online Learning 20 / 49

38 A characterization of minimax regret Special case Multiarmed bandits: loss and feedback matrix are the same N. Cesa-Bianchi (UNIMI) Online Learning 21 / 49

39 A characterization of minimax regret Special case Multiarmed bandits: loss and feedback matrix are the same A general gap theorem [Bartok, Foster, Pál, Rakhlin and Szepesvári, 2013] A constructive characterization of the minimax regret for any pair of loss/feedback matrix Only three possible rates for nontrivial games: 1 Easy games (e.g., bandits): Θ ( T ) 2 Hard games (e.g., revealing action): Θ ( T 2/3) 3 Impossible games: Θ(T) N. Cesa-Bianchi (UNIMI) Online Learning 21 / 49

40 A game equivalent to prediction with expert advice Online linear optimization in the simplex 1 Play p t from the N-dimensional simplex N 2 Incur linear loss E [ l t (I t ) ] = p t l t 3 Observe loss gradient l t Regret: compete against the best point in the simplex T p t l t min q N T q l t } {{ } 1 T = min l t (i) i=1,...,n T N. Cesa-Bianchi (UNIMI) Online Learning 22 / 49

41 From game theory to machine learning UNLABELED DATA CLASSIFICATION SYSTEM GUESSED LABEL TRUE LABEL OPPONENT Opponent s moves y t are viewed as values or labels assigned to observations x t R d (e.g., categories of documents) A repeated game between the player choosing an element w t of a linear space and the opponent choosing a label y t for x t Regret with respect to best element in the linear space N. Cesa-Bianchi (UNIMI) Online Learning 23 / 49

42 Summary 1 My beautiful regret 2 A supposedly fun game I ll play again 3 The joy of convex N. Cesa-Bianchi (UNIMI) Online Learning 24 / 49

43 Online convex optimization [Zinkevich, 2003] 1 Play w t from a convex and compact subset S of a linear space 2 Observe convex loss l t : S R and pay l t (w t ) 3 Update: w t w t+1 S N. Cesa-Bianchi (UNIMI) Online Learning 25 / 49

44 Online convex optimization [Zinkevich, 2003] 1 Play w t from a convex and compact subset S of a linear space 2 Observe convex loss l t : S R and pay l t (w t ) 3 Update: w t w t+1 S Example Regression with square loss: l t (w) = ( w x t y t ) 2 y t R Classification with hinge loss: l t (w) = [ 1 y t w x t ]+ y t { 1, +1} N. Cesa-Bianchi (UNIMI) Online Learning 25 / 49

45 Online convex optimization [Zinkevich, 2003] 1 Play w t from a convex and compact subset S of a linear space 2 Observe convex loss l t : S R and pay l t (w t ) 3 Update: w t w t+1 S Example Regression with square loss: l t (w) = ( w x t y t ) 2 y t R Classification with hinge loss: l t (w) = [ 1 y t w x t ]+ y t { 1, +1} Regret T T R T (u) = l t (w t ) l t (u) u S N. Cesa-Bianchi (UNIMI) Online Learning 25 / 49

46 Finding a good online algorithm Follow the leader w t+1 = arginf w S t l s (w) s=1 Regret can be linear due to lack of stability S = [ 1, +1] l 1 (w) = w 2 l t (w) = { w if t is even +w if t is odd N. Cesa-Bianchi (UNIMI) Online Learning 26 / 49

47 Finding a good online algorithm Follow the leader w t+1 = arginf w S t l s (w) s=1 Regret can be linear due to lack of stability S = [ 1, +1] l 1 (w) = w 2 l t (w) = { w if t is even +w if t is odd Note: t l s (w) = s=1 { w 2 if t is even + w 2 if t is odd Hence l t+1 (w t+1 ) = 1 for all t = 1, 2... N. Cesa-Bianchi (UNIMI) Online Learning 26 / 49

48 Follow the regularized leader [Shalev-Shwartz, 2007; Abernethy, Hazan and Rakhlin, 2008] w t+1 = argmin w S [ η ] t l s (w) + Φ(w) s=1 Φ is a strongly convex regularizer and η > 0 is a scale parameter N. Cesa-Bianchi (UNIMI) Online Learning 27 / 49

49 Convexity, smoothness, and duality Strong convexity Φ : S R is β-strongly convex w.r.t. a norm if for all u, v S Φ(v) Φ(u) + Φ(u) (v u) + β u v 2 2 N. Cesa-Bianchi (UNIMI) Online Learning 28 / 49

50 Convexity, smoothness, and duality Strong convexity Φ : S R is β-strongly convex w.r.t. a norm if for all u, v S Φ(v) Φ(u) + Φ(u) (v u) + β u v 2 2 Smoothness Φ : S R is α-smooth w.r.t. a norm if for all u, v S Φ(v) Φ(u) + Φ(u) (v u) + α u v 2 2 N. Cesa-Bianchi (UNIMI) Online Learning 28 / 49

51 Convexity, smoothness, and duality Strong convexity Φ : S R is β-strongly convex w.r.t. a norm if for all u, v S Φ(v) Φ(u) + Φ(u) (v u) + β u v 2 2 Smoothness Φ : S R is α-smooth w.r.t. a norm if for all u, v S Φ(v) Φ(u) + Φ(u) (v u) + α u v 2 2 If Φ is β-strongly convex w.r.t. 2, then 2 Φ βi If Φ is α-smooth w.r.t. 2, then 2 Φ αi N. Cesa-Bianchi (UNIMI) Online Learning 28 / 49

52 Examples Euclidean norm: Φ = is 1-strongly convex w.r.t. 2 N. Cesa-Bianchi (UNIMI) Online Learning 29 / 49

53 Examples Euclidean norm: Φ = is 1-strongly convex w.r.t. 2 p-norm: Φ = p is (p 1)-strongly convex w.r.t. p (for 1 < p 2) N. Cesa-Bianchi (UNIMI) Online Learning 29 / 49

54 Examples Euclidean norm: Φ = is 1-strongly convex w.r.t. 2 p-norm: Φ = p is (p 1)-strongly convex w.r.t. p (for 1 < p 2) Entropy: Φ(p) = d p i ln p i is 1-strongly convex w.r.t. 1 i=1 (for p in the probability simplex) N. Cesa-Bianchi (UNIMI) Online Learning 29 / 49

55 Examples Euclidean norm: Φ = is 1-strongly convex w.r.t. 2 p-norm: Φ = p is (p 1)-strongly convex w.r.t. p (for 1 < p 2) Entropy: Φ(p) = d p i ln p i is 1-strongly convex w.r.t. 1 i=1 (for p in the probability simplex) Power norm: Φ(w) = 1 2 w Aw is 1-strongly convex w.r.t. w = w Aw (for A symmetric and positive definite) N. Cesa-Bianchi (UNIMI) Online Learning 29 / 49

56 Convex duality Definition The convex dual of Φ is ( ) Φ (θ) = max θ w Φ(w) w S 1-dimensional example Convex f : R R such that f(0) = 0 f ( ) (θ) = max w θ f(w) w R The maximizer is w 0 such that f (w 0 ) = θ This gives f (θ) = w 0 f (w 0 ) f(w 0 ) As f(0) = 0, f (θ) is the error in approximating f(0) with a first-order expansion around f(w 0 ) N. Cesa-Bianchi (UNIMI) Online Learning 30 / 49

57 Convex duality (thanks to Shai Shalev-Shwartz for the image) N. Cesa-Bianchi (UNIMI) Online Learning 31 / 49

58 Convexity, smoothness, and duality Examples Euclidean norm: Φ = and Φ = Φ N. Cesa-Bianchi (UNIMI) Online Learning 32 / 49

59 Convexity, smoothness, and duality Examples Euclidean norm: Φ = and Φ = Φ p-norm: Φ = p and Φ = q where 1 p + 1 q = 1 N. Cesa-Bianchi (UNIMI) Online Learning 32 / 49

60 Convexity, smoothness, and duality Examples Euclidean norm: Φ = and Φ = Φ p-norm: Φ = p and Φ = q where 1 p + 1 q = 1 Entropy: Φ(p) = d i=1 ( ) p i ln p i and Φ (θ) = ln e θ e θ d N. Cesa-Bianchi (UNIMI) Online Learning 32 / 49

61 Convexity, smoothness, and duality Examples Euclidean norm: Φ = and Φ = Φ p-norm: Φ = p and Φ = q where 1 p + 1 q = 1 Entropy: Φ(p) = d i=1 ( ) p i ln p i and Φ (θ) = ln e θ e θ d Power norm: Φ(w) = 1 2 w Aw and Φ (θ) = 1 2 θ A 1 θ N. Cesa-Bianchi (UNIMI) Online Learning 32 / 49

62 Some useful properties If Φ : S R is β-strongly convex w.r.t., then Its convex dual Φ is everywhere differentiable and 1 β -smooth w.r.t. (the dual norm of ) ( ) Φ (θ) = argmax θ w Φ(w) w S N. Cesa-Bianchi (UNIMI) Online Learning 33 / 49

63 Some useful properties If Φ : S R is β-strongly convex w.r.t., then Its convex dual Φ is everywhere differentiable and 1 β -smooth w.r.t. (the dual norm of ) ( ) Φ (θ) = argmax θ w Φ(w) w S Recall: Follow the regularized leader (FTRL) [ ] t w t+1 = argmin η l s (w) + Φ(w) w S s=1 Φ is a strongly convex regularizer and η > 0 is a scale parameter N. Cesa-Bianchi (UNIMI) Online Learning 33 / 49

64 Using the loss gradient Linearization of convex losses l t (w t ) l t (u) l t (w t ) w } {{ } t l t (w t ) } {{ } lt lt u FTRL with linearized losses w t+1 = argmin w S ( η ) t ls w + Φ(w) s=1 } {{ } θ t+1 Note: w t+1 S always holds ) = argmax (θ t+1 w Φ(w) w S = Φ ( θ t+1 ) N. Cesa-Bianchi (UNIMI) Online Learning 34 / 49

65 The Mirror Descent algorithm [Nemirovsky and Yudin, 1983] Recall: w t+1 = Φ ( θ t ) = Φ ( η ) t l s (w s ) s=1 Online Mirror Descent (FTRL with linearized losses) Parameters: Strongly convex regularizer Φ with domain S, η > 0 Initialize: θ 1 = 0 // primal parameter For t = 1, 2,... 1 Use w t = Φ (θ t ) // dual parameter (via mirror step) 2 Suffer loss l t (w t ) 3 Observe loss gradient l t (w t ) 4 Update θ t+1 = θ t η l t (w t ) // gradient step N. Cesa-Bianchi (UNIMI) Online Learning 35 / 49

66 An equivalent formulation Under some assumptions on the regularizer Φ, OMD can be equivalently written in terms of projected gradient descent Online Mirror Descent (alternative version) Parameters: Strongly convex regularizer Φ and learning rate η > 0 Initialize: z 1 = Φ ( 0 ) ( ) and w 1 = argmin D Φ w z1 w S For t = 1, 2,... 1 Use w t and suffer loss l t (w t ) 2 Observe loss gradient l t (w t ) ) 3 Update z t+1 = Φ ( Φ(z t ) η l t (w t ) ( ) 4 w t+1 = argmin D Φ w zt+1 w S // gradient step // projection step N. Cesa-Bianchi (UNIMI) Online Learning 36 / 49

67 An equivalent formulation Under some assumptions on the regularizer Φ, OMD can be equivalently written in terms of projected gradient descent Online Mirror Descent (alternative version) Parameters: Strongly convex regularizer Φ and learning rate η > 0 Initialize: z 1 = Φ ( 0 ) ( ) and w 1 = argmin D Φ w z1 w S For t = 1, 2,... 1 Use w t and suffer loss l t (w t ) 2 Observe loss gradient l t (w t ) ) 3 Update z t+1 = Φ ( Φ(z t ) η l t (w t ) ( ) 4 w t+1 = argmin D Φ w zt+1 w S D Φ is the Bregman divergence induced by Φ // gradient step // projection step N. Cesa-Bianchi (UNIMI) Online Learning 36 / 49

68 Some examples Online Gradient Descent (OGD) [Zinkevich, 2003; Gentile, 2003] Φ(w) = 1 2 w 2 Update: w = w t η l t (w t ) p-norm version: Φ(w) = 1 2 w 2 p w t+1 = arginf w w 2 w S N. Cesa-Bianchi (UNIMI) Online Learning 37 / 49

69 Some examples Online Gradient Descent (OGD) [Zinkevich, 2003; Gentile, 2003] Φ(w) = 1 2 w 2 Update: w = w t η l t (w t ) p-norm version: Φ(w) = 1 2 w 2 p w t+1 = arginf w w 2 w S Exponentiated gradient (EG) [Kivinen and Warmuth, 1997] Φ(p) = p t+1,i = d p i ln p i i=1 p t,i e η l t(p t ) i d j=1 p t,je η l t(p t ) j Note: when losses are linear this is Hedge p S simplex N. Cesa-Bianchi (UNIMI) Online Learning 37 / 49

70 Regret analysis Regret bound [Kakade, Shalev-Shwartz and Tewari, 2012] R T (u) Φ(u) min w S Φ(w) η + η 2 T for all u S, where l 1, l 2,... are arbitrary convex losses l t (w t ) 2 β R T (u) GD T for all u S when η is tuned w.r.t. sup l t (w) G w S sup u,w S ( ) Φ(u) Φ(w) D Boundedness of gradients of l t w.r.t. equivalent to Lipschitzess of l t w.r.t. Regret bound optimal for general convex losses l t N. Cesa-Bianchi (UNIMI) Online Learning 38 / 49

71 Analysis relies on smoothness of Φ ( ) Φ (θ t+1 ) Φ (θ t ) Φ (θ t ) θ } {{ } t+1 θ t + 1 } {{ } 2β θ t+1 θ t 2 w t η l t (w t ) N. Cesa-Bianchi (UNIMI) Online Learning 39 / 49

72 Analysis relies on smoothness of Φ ( ) Φ (θ t+1 ) Φ (θ t ) Φ (θ t ) θ } {{ } t+1 θ t + 1 } {{ } 2β θ t+1 θ t 2 w t η l t (w t ) T ηu l t (w t ) Φ(u) = u θ T +1 Φ(u) Φ ( θ T +1 ) = Fenchel-Young inequality T ( Φ ( ) θ t+1 Φ ( )) θ t + Φ ( ) θ 1 T ) ( ηw t l t (w t ) + η2 2β l t(w t ) 2 + Φ (0) N. Cesa-Bianchi (UNIMI) Online Learning 39 / 49

73 Analysis relies on smoothness of Φ ( ) Φ (θ t+1 ) Φ (θ t ) Φ (θ t ) θ } {{ } t+1 θ t + 1 } {{ } 2β θ t+1 θ t 2 w t η l t (w t ) T ηu l t (w t ) Φ(u) = u θ T +1 Φ(u) Φ ( θ T +1 ) = Fenchel-Young inequality T ( Φ ( ) θ t+1 Φ ( )) θ t + Φ ( ) θ 1 T ( ) Φ (0) = max w 0 Φ(w) w S ) ( ηw t l t (w t ) + η2 2β l t(w t ) 2 + Φ (0) = min w S Φ(w) N. Cesa-Bianchi (UNIMI) Online Learning 39 / 49

74 Some examples l t (w) l t ( w x t ) max t l t L max t x t p X p N. Cesa-Bianchi (UNIMI) Online Learning 40 / 49

75 Some examples l t (w) l t ( w x t ) max t l t L max t x t p X p Bounds for OGD with convex losses R T (u) BLX 2 T = O ( dl T ) for all u such that u 2 B N. Cesa-Bianchi (UNIMI) Online Learning 40 / 49

76 Some examples l t (w) l t ( w x t ) max t l t L max t x t p X p Bounds for OGD with convex losses R T (u) BLX 2 T = O ( dl T ) for all u such that u 2 B Bounds logarithmic in the dimension Regret bound for EG run in the simplex, S = d ( ) R T (q) LX (ln d)t = O L (ln d)t p d Same bound for p-norm regularizer with p = ln d ln d 1 If losses are linear with [0, 1] coefficients then we recover the bound for Hedge N. Cesa-Bianchi (UNIMI) Online Learning 40 / 49

77 Exploiting curvature: minimization of SVM objective Training set (x 1, y 1 ),..., (x m, y m ) R d { 1, +1} SVM objective F(w) = 1 m [ 1 yt w ] x t + λ m + } {{ } 2 w 2 over R d hinge loss h t (w) Rewrite F(w) = 1 m l t (w) where l t (w) = h t (w) + λ m 2 w 2 Each loss l t is λ-strongly convex N. Cesa-Bianchi (UNIMI) Online Learning 41 / 49

78 Exploiting curvature: minimization of SVM objective Training set (x 1, y 1 ),..., (x m, y m ) R d { 1, +1} SVM objective F(w) = 1 m [ 1 yt w ] x t + λ m + } {{ } 2 w 2 over R d hinge loss h t (w) Rewrite F(w) = 1 m l t (w) where l t (w) = h t (w) + λ m 2 w 2 Each loss l t is λ-strongly convex The Pegasos algorithm Run OGD on random sequence of T training examples [ ( )] 1 T E F w t min F(w) + G2 ln T + 1 T w R d 2λ T O(ln T) rates hold for any sequence of strongly convex losses N. Cesa-Bianchi (UNIMI) Online Learning 41 / 49

79 Exp-concave losses Exp-concavity (strong convexity along the gradient direction) A convex l : S R is α-exp-concave when g(w) = e αl(w) is concave For twice-differentiable losses: 2 l(w) α l(w) l(w) for all w S l t (w) = ln ( w x t ) is exp-concave N. Cesa-Bianchi (UNIMI) Online Learning 42 / 49

80 Online Newton Step [Hazan, Agarwal and Kale, 2007] Update: w = A 1 t l t(w t ) w t+1 = argmin w w At w S t Where A t = εi + l s (w s ) l s (w s ) s=1 Note: Not an instance of OMD N. Cesa-Bianchi (UNIMI) Online Learning 43 / 49

81 Online Newton Step [Hazan, Agarwal and Kale, 2007] Update: w = A 1 t l t(w t ) w t+1 = argmin w w At w S t Where A t = εi + l s (w s ) l s (w s ) s=1 Note: Not an instance of OMD Logarithmic regret bound for exp-concave losses ( ) 1 R T (u) 5d α + GD ln(t + 1) u S N. Cesa-Bianchi (UNIMI) Online Learning 43 / 49

82 Online Newton Step [Hazan, Agarwal and Kale, 2007] Update: w = A 1 t l t(w t ) w t+1 = argmin w w At w S t Where A t = εi + l s (w s ) l s (w s ) s=1 Note: Not an instance of OMD Logarithmic regret bound for exp-concave losses ( ) 1 R T (u) 5d α + GD ln(t + 1) u S Extension of ONS to convex losses [Luo, Agarwal, C-B, Langford, 2016] l t (w) l t ( w x t ) max t l t L R T (u) Õ( CL dt ) for all u s.t. Invariance to linear transformations of the data u x t C N. Cesa-Bianchi (UNIMI) Online Learning 43 / 49

83 Online Ridge Regression [Vovk, 2001; Azoury and Warmuth, 2001] Logarithmic regret for square loss l t (u) = ( u x t y t ) 2 Y = max,...,t y t X = max,...,t x t OMD with adaptive regularizer Φ t (w) = 1 2 w 2 A t Where A t = I + t x s x s and θ t = s=1 t y s x s s=1 Regret bound (oracle inequality) ( T T ) l t (w t ) l t (u) + u 2 + dy 2 ln Parameterless inf u R d Scale-free: unbounded comparison set ) (1 + TX2 d N. Cesa-Bianchi (UNIMI) Online Learning 44 / 49

84 Scale free algorithm for convex losses [Orabona and Pál, 2015] Scale free algorithm for convex losses OMD with adaptive regularizer Φ t (w) = Φ 0 (w) t 1 l s (w s ) 2 s=1 Φ 0 is a β-strongly convex base regularizer Regret bound (oracle inequality) for convex loss functions l t T l t (w t ) inf u R d T ( l t (u) + Φ 0 (u) + 1 ) T β + max l t (w t ) t N. Cesa-Bianchi (UNIMI) Online Learning 45 / 49

85 Regularization via stochastic smoothing w t+1 = E Z [argmin w S t s=1 ] ( ) w η l s (w s ) + Z The distribution of Z must be stable (small variance and small average sensitivity) Regret bound similar to FTRL/OMD For some choices of Z, FPL becomes equivalent to OMD [Abernethy, Lee, Sinha and Tewari, 2014] Linear losses: Follow the Perturbed Leader algorithm [Kalai and Vempala, 2005] N. Cesa-Bianchi (UNIMI) Online Learning 46 / 49

86 Shifting regret [Herbster and Warmuth, 2001] Nonstationarity If data source is not fitted well by any model in the class, then comparing to the best model u S is trivial Compare instead to the best sequence u 1, u 2, S of models Shifting Regret for OMD [Zinkevich, 2003] T l t (w t ) } {{ } cumulative loss inf u 1,...,u T S T l t (u t ) } {{ } model fit + T u t u t 1 } {{ } shifting model cost + diam(s) + N. Cesa-Bianchi (UNIMI) Online Learning 47 / 49

87 Strongly adaptive regret [Daniely, Gonen, Shalev-Shwartz, 2015] Definition For all intervals I = {r,..., s} with 1 r < s T R T,I (u) = l t (w t ) l t (u) t I t I N. Cesa-Bianchi (UNIMI) Online Learning 48 / 49

88 Strongly adaptive regret [Daniely, Gonen, Shalev-Shwartz, 2015] Definition For all intervals I = {r,..., s} with 1 r < s T R T,I (u) = l t (w t ) l t (u) t I t I Regret bound for strongly adaptive OGD ( R T,I (u) BLX 2 + ln(t + 1)) I for all u such that u 2 B N. Cesa-Bianchi (UNIMI) Online Learning 48 / 49

89 Strongly adaptive regret [Daniely, Gonen, Shalev-Shwartz, 2015] Definition For all intervals I = {r,..., s} with 1 r < s T R T,I (u) = l t (w t ) l t (u) t I t I Regret bound for strongly adaptive OGD ( R T,I (u) BLX 2 + ln(t + 1)) I for all u such that u 2 B Remarks Generic black-box reduction applicable to any online learning algorithm It runs a logarithmic number of instances of the base learner N. Cesa-Bianchi (UNIMI) Online Learning 48 / 49

90 Online bandit convex optimization 1 Play w t from a convex and compact subset S of a linear space 2 Observe l t (w t ), where l : S R is unobserved convex loss 3 Update: w t w t+1 S T T Regret: R T (u) = l t (w t ) l t (u) u S N. Cesa-Bianchi (UNIMI) Online Learning 49 / 49

91 Online bandit convex optimization 1 Play w t from a convex and compact subset S of a linear space 2 Observe l t (w t ), where l : S R is unobserved convex loss 3 Update: w t w t+1 S T T Regret: R T (u) = l t (w t ) l t (u) u S Results Linear losses: Ω ( d T ) [Dani, Hayes, and Kakade, 2008] N. Cesa-Bianchi (UNIMI) Online Learning 49 / 49

92 Online bandit convex optimization 1 Play w t from a convex and compact subset S of a linear space 2 Observe l t (w t ), where l : S R is unobserved convex loss 3 Update: w t w t+1 S T T Regret: R T (u) = l t (w t ) l t (u) u S Results Linear losses: Ω ( d T ) [Dani, Hayes, and Kakade, 2008] Linear losses: Õ ( d T ) [Bubeck, C-B, and Kakade, 2012] N. Cesa-Bianchi (UNIMI) Online Learning 49 / 49

93 Online bandit convex optimization 1 Play w t from a convex and compact subset S of a linear space 2 Observe l t (w t ), where l : S R is unobserved convex loss 3 Update: w t w t+1 S T T Regret: R T (u) = l t (w t ) l t (u) u S Results Linear losses: Ω ( d T ) [Dani, Hayes, and Kakade, 2008] Linear losses: Õ ( d T ) [Bubeck, C-B, and Kakade, 2012] Strongly convex and smooth losses: Õ ( d 3/2 T ) [Hazan and Levy, 2014] N. Cesa-Bianchi (UNIMI) Online Learning 49 / 49

94 Online bandit convex optimization 1 Play w t from a convex and compact subset S of a linear space 2 Observe l t (w t ), where l : S R is unobserved convex loss 3 Update: w t w t+1 S T T Regret: R T (u) = l t (w t ) l t (u) u S Results Linear losses: Ω ( d T ) [Dani, Hayes, and Kakade, 2008] Linear losses: Õ ( d T ) [Bubeck, C-B, and Kakade, 2012] Strongly convex and smooth losses: Õ ( d 3/2 T ) [Hazan and Levy, 2014] Convex losses: Õ ( d 9.5 T ) [Bubeck, Eldan, and Lee, 2016] N. Cesa-Bianchi (UNIMI) Online Learning 49 / 49

The Online Approach to Machine Learning

The Online Approach to Machine Learning Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Approach to ML 1 / 53 Summary 1 My beautiful regret 2 A supposedly fun game I