Optimal and Adaptive Online Learning

Size: px

Start display at page:

Download "Optimal and Adaptive Online Learning"

Pauline Green
5 years ago
Views:

1 Optimal and Adaptive Online Learning Haipeng Luo Advisor: Robert Schapire Computer Science Department Princeton University

2 Examples of Online Learning (a) Spam detection 2 / 34

3 Examples of Online Learning (a) s arrive one by one Spam detection 2 / 34

4 Examples of Online Learning (a) s arrive one by one spam detector makes a prediction for each of them: spam or not Spam detection 2 / 34

5 Examples of Online Learning (a) s arrive one by one spam detector makes a prediction for each of them: spam or not sometimes get feedback from users Spam detection 2 / 34

Examples of Online Learning (a) Spam detection emails arrive one by one spam detector makes a prediction

6 Examples of Online Learning (a) Spam detection s arrive one by one spam detector makes a prediction for each of them: spam or not sometimes get feedback from users make better predictions next time 2 / 34

7 Examples of Online Learning (b) Online routing 3 / 34

8 Examples of Online Learning (b) packages arrive one by one Online routing 3 / 34

9 Examples of Online Learning (b) packages arrive one by one pick a path to send each package Online routing 3 / 34

10 Examples of Online Learning (b) packages arrive one by one pick a path to send each package observe latency afterward Online routing 3 / 34

11 Examples of Online Learning (b) packages arrive one by one pick a path to send each package observe latency afterward adjust strategy and pick a better path next time Online routing 3 / 34

12 Examples of Online Learning (c) Recommendation 4 / 34

13 Examples of Online Learning (c) users visit Netflix/Amzaon/... Recommendation 4 / 34

14 Examples of Online Learning (c) users visit Netflix/Amzaon/... movies/products/... are recommended to each user Recommendation 4 / 34

15 Examples of Online Learning (c) Recommendation users visit Netflix/Amzaon/... movies/products/... are recommended to each user users click or not 4 / 34

16 Examples of Online Learning (c) Recommendation users visit Netflix/Amzaon/... movies/products/... are recommended to each user users click or not make better recommendations next time 4 / 34

17 Why Online Learning? Traditional batch learning: input: a batch of training data output: a fixed predictor 5 / 34

18 Why Online Learning? Traditional batch learning: input: a batch of training data output: a fixed predictor Compared to batch learning, online learning is: 5 / 34

19 Why Online Learning? Traditional batch learning: input: a batch of training data output: a fixed predictor Compared to batch learning, online learning is: a natural model for many applications, mixing training and testing 5 / 34

20 Why Online Learning? Traditional batch learning: input: a batch of training data output: a fixed predictor Compared to batch learning, online learning is: a natural model for many applications, mixing training and testing memory efficient, requiring only one pass of the data 5 / 34

21 Why Online Learning? Traditional batch learning: input: a batch of training data output: a fixed predictor Compared to batch learning, online learning is: a natural model for many applications, mixing training and testing memory efficient, requiring only one pass of the data often distributional assumption free 5 / 34

22 Why Online Learning? Traditional batch learning: input: a batch of training data output: a fixed predictor Compared to batch learning, online learning is: a natural model for many applications, mixing training and testing memory efficient, requiring only one pass of the data often distributional assumption free suitable for changing or adversarial environment 5 / 34

23 Why Online Learning? Traditional batch learning: input: a batch of training data output: a fixed predictor Compared to batch learning, online learning is: a natural model for many applications, mixing training and testing memory efficient, requiring only one pass of the data often distributional assumption free suitable for changing or adversarial environment An active research area, proven extremely useful in practice. 5 / 34

24 Main Goals of My Work Design better online learning algorithms: computationally efficient theoretically optimal adaptive and robust, works for different data and different criteria parameter-free and ready-to-use 6 / 34

25 Overview of My Work 1 Combining expert advice [LS14a, LS14b, LS15] Online Learning 7 / 34

26 Overview of My Work 1 Combining expert advice [LS14a, LS14b, LS15] 2 Online boosting [BKL15, BHKL15] Boosting Online Learning 7 / 34

27 Overview of My Work 1 Combining expert advice [LS14a, LS14b, LS15] 2 Online boosting [BKL15, BHKL15] Boosting Data Sketching 3 Efficient second order online learning [LACL16] Online Learning 7 / 34

28 Overview of My Work 1 Combining expert advice [LS14a, LS14b, LS15] 2 Online boosting [BKL15, BHKL15] Boosting Data Sketching 3 Efficient second order online learning [LACL16] 4 Accelerated stochastic optimizations [LHP14, HL16] Optimization/ Stochastic Learning Online Learning 7 / 34

29 Overview of My Work 1 Combining expert advice [LS14a, LS14b, LS15] 2 Online boosting [BKL15, BHKL15] Boosting Data Sketching 3 Efficient second order online learning [LACL16] Online Learning 4 Accelerated stochastic optimizations [LHP14, HL16] 5 Fast convergence in games [SALS15] Optimization/ Stochastic Learning Game Theory 7 / 34

30 Overview of My Work 1 Combining expert advice [LS14a, LS14b, LS15] 2 Online boosting [BKL15, BHKL15] Boosting Data Sketching 3 Efficient second order online learning [LACL16] Online Learning 4 Accelerated stochastic optimizations [LHP14, HL16] 5 Fast convergence in games [SALS15] Optimization/ Stochastic Learning Game Theory 7 / 34

31 Online Boosting (Beygelzimer, Kale and Luo, ICML 2015) (Beygelzimer, Hazan, Kale and Luo, NIPS 2015) 8 / 34

32 Brief Introduction of Boosting (Schapire, 1990; Freund, 1995) Boosting is a highly successful batch learning algorithm. Idea: combine weak rules of thumb to form a highly accurate predictor. 9 / 34

33 Brief Introduction of Boosting (Schapire, 1990; Freund, 1995) Boosting is a highly successful batch learning algorithm. Idea: combine weak rules of thumb to form a highly accurate predictor. Example: spam detection. 9 / 34

34 Brief Introduction of Boosting (Schapire, 1990; Freund, 1995) Boosting is a highly successful batch learning algorithm. Idea: combine weak rules of thumb to form a highly accurate predictor. Example: spam detection. Given: a set of training examples. ( Wanna make fast money?..., spam) ( How is your thesis going? Rob, not spam) 9 / 34

35 Brief Introduction of Boosting (Schapire, 1990; Freund, 1995) Boosting is a highly successful batch learning algorithm. Idea: combine weak rules of thumb to form a highly accurate predictor. Example: spam detection. Given: a set of training examples. ( Wanna make fast money?..., spam) ( How is your thesis going? Rob, not spam) Obtain a classifier by asking a weak learning algorithm : e.g. contains the word money spam. 9 / 34

36 Brief Introduction of Boosting (Schapire, 1990; Freund, 1995) Boosting is a highly successful batch learning algorithm. Idea: combine weak rules of thumb to form a highly accurate predictor. Example: spam detection. Given: a set of training examples. ( Wanna make fast money?..., spam) ( How is your thesis going? Rob, not spam) Obtain a classifier by asking a weak learning algorithm : e.g. contains the word money spam. Reweight the examples so that difficult ones get more attention. e.g. spam that doesn t contain money. 9 / 34

37 Brief Introduction of Boosting (Schapire, 1990; Freund, 1995) Boosting is a highly successful batch learning algorithm. Idea: combine weak rules of thumb to form a highly accurate predictor. Example: spam detection. Given: a set of training examples. ( Wanna make fast money?..., spam) ( How is your thesis going? Rob, not spam) Obtain a classifier by asking a weak learning algorithm : e.g. contains the word money spam. Reweight the examples so that difficult ones get more attention. e.g. spam that doesn t contain money. Obtain another classifier: e.g. empty to address spam. 9 / 34

38 Brief Introduction of Boosting (Schapire, 1990; Freund, 1995) Boosting is a highly successful batch learning algorithm. Idea: combine weak rules of thumb to form a highly accurate predictor. Example: spam detection. Given: a set of training examples. ( Wanna make fast money?..., spam) ( How is your thesis going? Rob, not spam) Obtain a classifier by asking a weak learning algorithm : e.g. contains the word money spam. Reweight the examples so that difficult ones get more attention. e.g. spam that doesn t contain money. Obtain another classifier: e.g. empty to address spam / 34

39 Brief Introduction of Boosting (Schapire, 1990; Freund, 1995) Boosting is a highly successful batch learning algorithm. Idea: combine weak rules of thumb to form a highly accurate predictor. Example: spam detection. Given: a set of training examples. ( Wanna make fast money?..., spam) ( How is your thesis going? Rob, not spam) Obtain a classifier by asking a weak learning algorithm : e.g. contains the word money spam. Reweight the examples so that difficult ones get more attention. e.g. spam that doesn t contain money. Obtain another classifier: e.g. empty to address spam.... At the end, predict by taking a (weighted) majority vote. 9 / 34

40 Online Boosting: Motivation Boosting is well-studied in the batch setting, but becomes infeasible when the amount of data is huge. 10 / 34

41 Online Boosting: Motivation Boosting is well-studied in the batch setting, but becomes infeasible when the amount of data is huge. Online learning comes to rescue: memory efficient works even in an adversarial environment. 10 / 34

42 Online Boosting: Motivation Boosting is well-studied in the batch setting, but becomes infeasible when the amount of data is huge. Online learning comes to rescue: memory efficient works even in an adversarial environment. A natural question: How to extend boosting to the online setting? 10 / 34

43 Online Boosting: Motivation Boosting is well-studied in the batch setting, but becomes infeasible when the amount of data is huge. Online learning comes to rescue: memory efficient works even in an adversarial environment. A natural question: How to extend boosting to the online setting? How to boost the accuracy of existing online learning algorithms? 10 / 34

44 Related Work Several algorithms exist (Oza and Russell, 2001; Grabner and Bischof, 2006; Liu and Yu, 2007; Grabner et al., 2008). mimic batch counterparts. achieve great success in many real-world applications. no theoretical guarantees. 11 / 34

45 Related Work Several algorithms exist (Oza and Russell, 2001; Grabner and Bischof, 2006; Liu and Yu, 2007; Grabner et al., 2008). mimic batch counterparts. achieve great success in many real-world applications. no theoretical guarantees. Chen et al. (2012): first online boosting algorithms with theoretical guarantees. connecting online boosting and smooth batch boosting. 11 / 34

46 Related Work Several algorithms exist (Oza and Russell, 2001; Grabner and Bischof, 2006; Liu and Yu, 2007; Grabner et al., 2008). mimic batch counterparts. achieve great success in many real-world applications. no theoretical guarantees. Chen et al. (2012): first online boosting algorithms with theoretical guarantees. connecting online boosting and smooth batch boosting. Our work: more natural formulation + optimal and adaptive algorithms. 11 / 34

47 Batch Boosting Given a batch of T examples, (x t, y t ) for t = 1,..., T. Learner A predicts A(x t ) for example x t. 12 / 34

48 Batch Boosting Given a batch of T examples, (x t, y t ) for t = 1,..., T. Learner A predicts A(x t ) for example x t. Weak learner A (with edge γ): # Mistakes = T t=1 1{A(x t) y t } ( 1 2 γ)t 12 / 34

49 Batch Boosting Given a batch of T examples, (x t, y t ) for t = 1,..., T. Learner A predicts A(x t ) for example x t. Weak learner A (with edge γ): # Mistakes = T t=1 1{A(x t) y t } ( 1 2 γ)t Strong learner A (with any target error rate ɛ): # Mistakes = T t=1 1{A (x t ) y t } ɛt 12 / 34

50 Batch Boosting Given a batch of T examples, (x t, y t ) for t = 1,..., T. Learner A predicts A(x t ) for example x t. Weak learner A (with edge γ): # Mistakes = T t=1 1{A(x t) y t } ( 1 2 γ)t Boosting (Schapire, 1990; Freund, 1995) Strong learner A (with any target error rate ɛ): # Mistakes = T t=1 1{A (x t ) y t } ɛt 12 / 34

51 Online Boosting Examples (x t, y t ) arrive online, for t = 1,..., T. Learner A observes x t and predicts A(x t ) before seeing y t. Weak Online learner A (with edge γ): # Mistakes = T t=1 1{A(x t) y t } ( 1 2 γ)t Strong Online learner A (with any target error rate ɛ): # Mistakes = T t=1 1{A (x t ) y t } ɛt 12 / 34

52 Online Boosting Examples (x t, y t ) arrive online, for t = 1,..., T. Learner A observes x t and predicts A(x t ) before seeing y t. Weak Online learner A (with edge γ and excess loss S): # Mistakes = T t=1 1{A(x t) y t } ( 1 2 γ)t + S #mistakes made before learning Strong Online learner A (with any target error rate ɛ and excess loss S ) # Mistakes = T t=1 1{A (x t ) y t } ɛt + S 12 / 34

53 Online Boosting Examples (x t, y t ) arrive online, for t = 1,..., T. Learner A observes x t and predicts A(x t ) before seeing y t. Weak Online learner A (with edge γ and excess loss S): # Mistakes = T t=1 1{A(x t) y t } ( 1 2 γ)t + S Online Boosting (our result) #mistakes made before learning Strong Online learner A (with any target error rate ɛ and excess loss S ) # Mistakes = T t=1 1{A (x t ) y t } ɛt + S 12 / 34

54 Structure of Online Boosting x 1 Booster

55 Structure of Online Boosting x 1 WL 1 predict WL 2 predict x 1 ŷ1 1 x 1 ŷ Booster WL N predict x 1 ŷ N 1

56 Structure of Online Boosting x 1 ŷ 1 y 1 WL 1 predict WL 2 predict x 1 ŷ1 1 x 1 ŷ Booster WL N predict x 1 ŷ N 1

57 Structure of Online Boosting x 1 ŷ 1 y 1 WL 1 predict x 1 ŷ1 1 w.p. p1 1 (x 1, y 1 ) WL 1 update WL 2 predict x 1 ŷ1 2 w.p. p1 2 (x 1, y 1 ) WL 2 update... Booster... WL N predict x 1 ŷ N 1 w.p. p N 1 (x 1, y 1 ) WL N update 13 / 34

58 Structure of Online Boosting x 2 ŷ 2 y 2 WL 1 predict x 2 ŷ2 1 w.p. p2 1 (x 2, y 2 ) WL 1 update WL 2 predict x 2 ŷ2 2 w.p. p2 2 (x 2, y 2 ) WL 2 update... Booster... WL N predict x 2 ŷ N 2 w.p. p N 2 (x 2, y 2 ) WL N update 14 / 34

59 Structure of Online Boosting x t ŷ t y t WL 1 predict x t ŷt 1 w.p. pt 1 (x t, y t ) WL 1 update WL 2 predict x t ŷt 2 w.p. pt 2 (x t, y t ) WL 2 update... Booster... WL N predict x t ŷ N t w.p. p N t (x t, y t ) WL N update 15 / 34

60 Theoretical Gaurantees To achieve ɛ error rate, how many number of weak learners (N) and examples (T ) do we need? 16 / 34

61 Theoretical Gaurantees To achieve ɛ error rate, how many number of weak learners (N) and examples (T ) do we need? Algorithm N T Optimal? Adaptive? Online BBM O( 1 ln 1 γ 2 ɛ ) Õ( 1 ) ɛγ 2 AdaBoost.OL O( 1 ) Õ( 1 ) ɛγ 2 ɛ 2 γ 4 Chen et al. (2012) O( 1 ɛγ 2 ) Õ( 1 ɛγ 2 ) 16 / 34

62 Algorithms Online BBM: derived by generalizing drifting game (Schapire, 2001) 17 / 34

63 Algorithms Online BBM: derived by generalizing drifting game (Schapire, 2001) prediction: majority vote 17 / 34

64 Algorithms Online BBM: derived by generalizing drifting game (Schapire, 2001) prediction: majority vote update: Pr[(x t, y t ) sent to ith weak learner] Pr[kt i heads in N i flips of a γ 2 -biased coin] 17 / 34

65 Algorithms Online BBM: derived by generalizing drifting game (Schapire, 2001) prediction: majority vote update: Pr[(x t, y t ) sent to ith weak learner] Pr[kt i heads in N i flips of a γ 2 -biased coin] AdaBoost.OL: derived by utilizing online loss minimization 17 / 34

66 Algorithms Online BBM: derived by generalizing drifting game (Schapire, 2001) prediction: majority vote update: Pr[(x t, y t ) sent to ith weak learner] Pr[kt i heads in N i flips of a γ 2 -biased coin] AdaBoost.OL: derived by utilizing online loss minimization prediction: weighted majority vote 17 / 34

67 Algorithms Online BBM: derived by generalizing drifting game (Schapire, 2001) prediction: majority vote update: Pr[(x t, y t ) sent to ith weak learner] Pr[kt i heads in N i flips of a γ 2 -biased coin] AdaBoost.OL: derived by utilizing online loss minimization prediction: weighted majority vote update: Pr[(x t, y t ) sent to ith weak learner] derivative of logistic loss at some point 17 / 34

68 Experimental Results (I) Available in the widely used toolkit Vowpal Wabbit. command line option: --boosting. VW as the default weak learner (a rather strong one!) Dataset VW baseline Online BBM AdaBoost.OL Chen et al news a9a activity adult bio census covtype letter maptaskcoref nomao poker rcv vehv2binary / 34

69 Experimental Results (II) Generalized to regression setting (BHKL15). Base learner Average Improvement Regression stumps 20.22% Neural networks 7.88% VW Default 1.65% 19 / 34

70 Combining Expert Advice More Efficiently (Luo and Schapire, NIPS 2014, COLT 2015) 20 / 34

71 The Expert Problem Need to make a decision everyday which stock to invest in? which route to drive home? / 34

72 The Expert Problem Need to make a decision everyday which stock to invest in? which route to drive home?... Given: advice from a set of experts financial advisers Route 1, Route 206, I-95, / 34

73 The Expert Problem Need to make a decision everyday which stock to invest in? which route to drive home?... Given: advice from a set of experts financial advisers Route 1, Route 206, I-95, Question: how to combine advice and make smart choices? 21 / 34

74 Comparison with Online Boosting Online Boosting Expert Problem 22 / 34

75 Comparison with Online Boosting Online Boosting same weak learning algorithm Expert Problem different experts 22 / 34

76 Comparison with Online Boosting Online Boosting same weak learning algorithm assumption on weak learners Expert Problem different experts arbitrary advice 22 / 34

77 Comparison with Online Boosting Online Boosting same weak learning algorithm assumption on weak learners control inputs to weak learners Expert Problem different experts arbitrary advice combine advice 22 / 34

78 Comparison with Online Boosting Online Boosting same weak learning algorithm assumption on weak learners control inputs to weak learners aim to largely improve Expert Problem different experts arbitrary advice combine advice aim to find the best expert 22 / 34

79 Why interesting? Applications: investment route selection / 34

80 Why interesting? Applications: investment route selection... Foundation of many other learning algorithms: batch boosting bandit problems solving a SDP online PCA / 34

81 Formal Setup (Littlestone and Warmuth, 1994; Freund and Schapire, 1997) For t = 1, 2,..., T player predicts a distribution p t over N experts. adversary reveals the loss l t,i [0, 1] for each expert. player suffers loss p t l t for this round. 24 / 34

82 Formal Setup (Littlestone and Warmuth, 1994; Freund and Schapire, 1997) For t = 1, 2,..., T player predicts a distribution p t over N experts. adversary reveals the loss l t,i [0, 1] for each expert. player suffers loss p t l t for this round. Regret to expert i: R T,i = T t=1 p t l t } {{ } total loss of the player T t=1 l t,i }{{} total loss of expert i 24 / 34

83 Formal Setup (Littlestone and Warmuth, 1994; Freund and Schapire, 1997) For t = 1, 2,..., T player predicts a distribution p t over N experts. adversary reveals the loss l t,i [0, 1] for each expert. player suffers loss p t l t for this round. Regret to expert i: R T,i = T t=1 p t l t } {{ } total loss of the player T t=1 l t,i }{{} total loss of expert i Goal: obtain sublinear regret to the best expert. 24 / 34

84 Formal Setup (Littlestone and Warmuth, 1994; Freund and Schapire, 1997) For t = 1, 2,..., T player predicts a distribution p t over N experts. adversary reveals the loss l t,i [0, 1] for each expert. player suffers loss p t l t for this round. Regret to expert i: R T,i = T t=1 p t l t } {{ } total loss of the player T t=1 l t,i }{{} total loss of expert i Goal: obtain sublinear regret to top 10 experts. 24 / 34

85 Formal Setup (Littlestone and Warmuth, 1994; Freund and Schapire, 1997) For t = 1, 2,..., T player predicts a distribution p t over N experts. adversary reveals the loss l t,i [0, 1] for each expert. player suffers loss p t l t for this round. Regret to expert i: R T,i = T t=1 p t l t } {{ } total loss of the player T t=1 l t,i }{{} total loss of expert i Goal: obtain sublinear regret to any competitor. 24 / 34

86 Beyond the Classic Regret More difficult objectives: Achieving much smaller regret when the problem is easy while still ensuring worst-case robustness Learning with unknown number of experts Competing with a sequence of different competitors Learning with experts who provide confidence-rated advice 25 / 34

87 Beyond the Classic Regret More difficult objectives: Achieving much smaller regret when the problem is easy while still ensuring worst-case robustness Learning with unknown number of experts Competing with a sequence of different competitors Learning with experts who provide confidence-rated advice Previous work: all were studied to some extent, with different algorithms Our work: one single parameter-free algorithm, achieving all these goals, with improved results 25 / 34

88 Expert Problem as a Drifting Game Main technique: reduce the problem to a drifting game 26 / 34

89 Expert Problem as a Drifting Game 2.5 Main technique: reduce the problem to a drifting game Consequence: one upper bound of 0-1 loss one expert algorithm / 34

90 Expert Problem as a Drifting Game 2.5 Main technique: reduce the problem to a drifting game Consequence: one upper bound of 0-1 loss one expert algorithm Surrogate of 0-1 Algorithm: p t,i Regret to top-ɛ frac. exponential exp(ηr t 1,i ) T ln 1 ɛ 26 / 34

91 Expert Problem as a Drifting Game 2.5 Main technique: reduce the problem to a drifting game Consequence: one upper bound of 0-1 loss one expert algorithm Surrogate of 0-1 Algorithm: p t,i Regret to top-ɛ frac. exponential exp(ηr t 1,i ) T ln 1 ɛ squared hinge [R t 1,i + 1] 2 + [R t 1,i 1] 2 + T ɛ 26 / 34

92 Expert Problem as a Drifting Game 2.5 Main technique: reduce the problem to a drifting game Consequence: one upper bound of 0-1 loss one expert algorithm Surrogate of 0-1 Algorithm: p t,i Regret to top-ɛ frac. exponential exp(ηr t 1,i ) T ln 1 ɛ squared hinge [R t 1,i + 1] 2 + [R t 1,i 1] 2 + ) ( ) ( [Rt 1,i +1] exp 1 exp 2 + [Rt 1,i 1] exp 2 + 3t 3t ( [s] 2 3T T ɛ ) T ln ln T ɛ 26 / 34

93 AdaNormalHedge weight for expert i on round t: ( [Rt 1,i + 1] 2 ) ( + [Rt 1,i 1] 2 ) + p t,i exp exp 3t 3t Õ( T ) regret 27 / 34

94 AdaNormalHedge weight for expert i on round t: ( [Rt 1,i + 1] 2 ) ( + [Rt 1,i 1] 2 ) + p t,i exp exp 3t 3t Õ( T ) regret (worst-case optimal, but too conservative!) 27 / 34

95 AdaNormalHedge weight for expert i on round t: ( [Rt 1,i + 1] 2 ) ( + [Rt 1,i 1] 2 ) + p t,i exp exp 3(C t 1,i + 1) 3(C t 1,i + 1) where C t,i = t τ=1 p τ l τ l τ,i = cumulative regret magnitude }{{} regret magnitude of expert i at round τ 27 / 34

96 AdaNormalHedge weight for expert i on round t: ( [Rt 1,i + 1] 2 ) ( + [Rt 1,i 1] 2 ) + p t,i exp exp 3(C t 1,i + 1) 3(C t 1,i + 1) where C t,i = t τ=1 p τ l τ l τ,i = cumulative regret magnitude }{{} regret magnitude of expert i at round τ much more adaptive: small regret when the best expert has small loss almost constant regret when losses are stochastic... resolving open problems by Chaudhuri et al. (2009); Chernov and Vovk (2010); Warmuth and Koolen (2014) 27 / 34

97 Application Given: a big unpruned decision tree Goal: predict almost as well as the best pruned tree, online 28 / 34

98 Application Given: a big unpruned decision tree Goal: predict almost as well as the best pruned tree, online Reduced to expert problem Helmbold and Schapire (1995); Freund et al. (1997) 28 / 34

Helmbold and Schapire (1995); Freund et al.

99 Application Given: a big unpruned decision tree Goal: predict almost as well as the best pruned tree, online Reduced to expert problem Helmbold and Schapire (1995); Freund et al. (1997) AdaNormalHedge provides a parameter-free solution with exponential regret improvement! 28 / 34

100 Efficient Second Order Online Learning via Sketching (Luo, Agarwal, Cesa-Bianchi and Langford, Preprint, 2016) 29 / 34

101 Online Linear Prediction On each time t: see an example x t (e.g. an image) 30 / 34

102 Online Linear Prediction On each time t: see an example x t (e.g. an image) predict its label w t x t (e.g. cat or not) 30 / 34

103 Online Linear Prediction On each time t: see an example x t (e.g. an image) predict its label wt x t (e.g. cat or not) see the true label and suffer some loss 30 / 34

104 Online Linear Prediction On each time t: see an example x t (e.g. an image) predict its label wt x t (e.g. cat or not) see the true label and suffer some loss update w t to w t+1 30 / 34

105 Online Linear Prediction On each time t: see an example x t (e.g. an image) predict its label w t x t (e.g. cat or not) see the true label and suffer some loss update w t to w t+1 e.g. Online Gradient Descent (Zinkevich, 2003) w t+1 = w t ηg t where g t is the gradient 30 / 34

106 Problem Performance often depends on the conditioning of the data: 31 / 34

107 Comparisons error rate condition number 32 / 34

108 Comparisons OGD error rate condition number algorithm OGD Zinkevich (2003) update rule w t+1 = w t η g t 32 / 34

109 Comparisons OGD Newton error rate A t = t τ=1 g τ g τ algorithm OGD Zinkevich (2003) Online Newton Hazan et al. (2007) 0.08 update rule w t+1 = w t η g t w t+1 = w t η A 1 t g t condition number 32 / 34

110 Comparisons OGD Newton error rate A t = t τ=1 g τ g τ s: sparsity d: dimensionality condition number algorithm update rule time per example OGD Zinkevich (2003) Online Newton Hazan et al. (2007) w t+1 = w t η g t O(s) w t+1 = w t η A 1 t g t O(d 2 ) 32 / 34

111 Comparisons OGD Newton AdaGrad error rate A t = t τ=1 g τ g τ s: sparsity d: dimensionality condition number algorithm update rule time per example OGD Zinkevich (2003) Online Newton Hazan et al. (2007) AdaGrad Duchi et al. (2011) w t+1 = w t η g t O(s) w t+1 = w t η A 1 t g t O(d 2 ) w t+1 = w t η diag(a t ) 1 2 g t O(s) 32 / 34

112 Comparisons error rate OGD Newton AdaGrad SON A t = t τ=1 g τ g τ s: sparsity d: dimensionality condition number algorithm update rule time per example OGD Zinkevich (2003) Online Newton Hazan et al. (2007) AdaGrad Duchi et al. (2011) SON (our work) w t+1 = w t η g t O(s) w t+1 = w t η A 1 t g t O(d 2 ) w t+1 = w t η diag(a t ) 1 2 g t w t+1 = w t η sketch(a t ) 1 g t O(s) 32 / 34

113 Comparisons error rate OGD Newton AdaGrad SON A t = t τ=1 g τ g τ s: sparsity d: dimensionality m: sketch size condition number algorithm update rule time per example OGD Zinkevich (2003) Online Newton Hazan et al. (2007) AdaGrad Duchi et al. (2011) SON (our work) w t+1 = w t η g t O(s) w t+1 = w t η A 1 t g t O(d 2 ) w t+1 = w t η diag(a t ) 1 2 g t O(s) w t+1 = w t η sketch(a t ) 1 g t O(ms + m 2 ) 32 / 34

114 Performance on Real-World Datasets AdaGrad vs SON error rate of SON error rate of AdaGrad 33 / 34

115 Conclusions 1 Online boosting a novel and natural framework new online boosting algorithms for classification and regression 2 Combining expert advice a general game-theoretic framework to design algorithms adaptive algorithms with many desirable properties simultaneously 3 Efficient second order learning how to utilize sketching techniques to deal with ill-conditioned data 34 / 34

Optimal and Adaptive Algorithms for Online Boosting

Optimal and Adaptive Algorithms for Online Boosting Alina Beygelzimer 1 Satyen Kale 1 Haipeng Luo 2 1 Yahoo! Labs, NYC 2 Computer Science Department, Princeton University Jul 8, 2015 Boosting: An Example