Optimal and Adaptive Online Learning

Size: px
Start display at page:

Download "Optimal and Adaptive Online Learning"

Transcription

1 Optimal and Adaptive Online Learning Haipeng Luo Advisor: Robert Schapire Computer Science Department Princeton University

2 Examples of Online Learning (a) Spam detection 2 / 34

3 Examples of Online Learning (a) s arrive one by one Spam detection 2 / 34

4 Examples of Online Learning (a) s arrive one by one spam detector makes a prediction for each of them: spam or not Spam detection 2 / 34

5 Examples of Online Learning (a) s arrive one by one spam detector makes a prediction for each of them: spam or not sometimes get feedback from users Spam detection 2 / 34

6 Examples of Online Learning (a) Spam detection s arrive one by one spam detector makes a prediction for each of them: spam or not sometimes get feedback from users make better predictions next time 2 / 34

7 Examples of Online Learning (b) Online routing 3 / 34

8 Examples of Online Learning (b) packages arrive one by one Online routing 3 / 34

9 Examples of Online Learning (b) packages arrive one by one pick a path to send each package Online routing 3 / 34

10 Examples of Online Learning (b) packages arrive one by one pick a path to send each package observe latency afterward Online routing 3 / 34

11 Examples of Online Learning (b) packages arrive one by one pick a path to send each package observe latency afterward adjust strategy and pick a better path next time Online routing 3 / 34

12 Examples of Online Learning (c) Recommendation 4 / 34

13 Examples of Online Learning (c) users visit Netflix/Amzaon/... Recommendation 4 / 34

14 Examples of Online Learning (c) users visit Netflix/Amzaon/... movies/products/... are recommended to each user Recommendation 4 / 34

15 Examples of Online Learning (c) Recommendation users visit Netflix/Amzaon/... movies/products/... are recommended to each user users click or not 4 / 34

16 Examples of Online Learning (c) Recommendation users visit Netflix/Amzaon/... movies/products/... are recommended to each user users click or not make better recommendations next time 4 / 34

17 Why Online Learning? Traditional batch learning: input: a batch of training data output: a fixed predictor 5 / 34

18 Why Online Learning? Traditional batch learning: input: a batch of training data output: a fixed predictor Compared to batch learning, online learning is: 5 / 34

19 Why Online Learning? Traditional batch learning: input: a batch of training data output: a fixed predictor Compared to batch learning, online learning is: a natural model for many applications, mixing training and testing 5 / 34

20 Why Online Learning? Traditional batch learning: input: a batch of training data output: a fixed predictor Compared to batch learning, online learning is: a natural model for many applications, mixing training and testing memory efficient, requiring only one pass of the data 5 / 34

21 Why Online Learning? Traditional batch learning: input: a batch of training data output: a fixed predictor Compared to batch learning, online learning is: a natural model for many applications, mixing training and testing memory efficient, requiring only one pass of the data often distributional assumption free 5 / 34

22 Why Online Learning? Traditional batch learning: input: a batch of training data output: a fixed predictor Compared to batch learning, online learning is: a natural model for many applications, mixing training and testing memory efficient, requiring only one pass of the data often distributional assumption free suitable for changing or adversarial environment 5 / 34

23 Why Online Learning? Traditional batch learning: input: a batch of training data output: a fixed predictor Compared to batch learning, online learning is: a natural model for many applications, mixing training and testing memory efficient, requiring only one pass of the data often distributional assumption free suitable for changing or adversarial environment An active research area, proven extremely useful in practice. 5 / 34

24 Main Goals of My Work Design better online learning algorithms: computationally efficient theoretically optimal adaptive and robust, works for different data and different criteria parameter-free and ready-to-use 6 / 34

25 Overview of My Work 1 Combining expert advice [LS14a, LS14b, LS15] Online Learning 7 / 34

26 Overview of My Work 1 Combining expert advice [LS14a, LS14b, LS15] 2 Online boosting [BKL15, BHKL15] Boosting Online Learning 7 / 34

27 Overview of My Work 1 Combining expert advice [LS14a, LS14b, LS15] 2 Online boosting [BKL15, BHKL15] Boosting Data Sketching 3 Efficient second order online learning [LACL16] Online Learning 7 / 34

28 Overview of My Work 1 Combining expert advice [LS14a, LS14b, LS15] 2 Online boosting [BKL15, BHKL15] Boosting Data Sketching 3 Efficient second order online learning [LACL16] 4 Accelerated stochastic optimizations [LHP14, HL16] Optimization/ Stochastic Learning Online Learning 7 / 34

29 Overview of My Work 1 Combining expert advice [LS14a, LS14b, LS15] 2 Online boosting [BKL15, BHKL15] Boosting Data Sketching 3 Efficient second order online learning [LACL16] Online Learning 4 Accelerated stochastic optimizations [LHP14, HL16] 5 Fast convergence in games [SALS15] Optimization/ Stochastic Learning Game Theory 7 / 34

30 Overview of My Work 1 Combining expert advice [LS14a, LS14b, LS15] 2 Online boosting [BKL15, BHKL15] Boosting Data Sketching 3 Efficient second order online learning [LACL16] Online Learning 4 Accelerated stochastic optimizations [LHP14, HL16] 5 Fast convergence in games [SALS15] Optimization/ Stochastic Learning Game Theory 7 / 34

31 Online Boosting (Beygelzimer, Kale and Luo, ICML 2015) (Beygelzimer, Hazan, Kale and Luo, NIPS 2015) 8 / 34

32 Brief Introduction of Boosting (Schapire, 1990; Freund, 1995) Boosting is a highly successful batch learning algorithm. Idea: combine weak rules of thumb to form a highly accurate predictor. 9 / 34

33 Brief Introduction of Boosting (Schapire, 1990; Freund, 1995) Boosting is a highly successful batch learning algorithm. Idea: combine weak rules of thumb to form a highly accurate predictor. Example: spam detection. 9 / 34

34 Brief Introduction of Boosting (Schapire, 1990; Freund, 1995) Boosting is a highly successful batch learning algorithm. Idea: combine weak rules of thumb to form a highly accurate predictor. Example: spam detection. Given: a set of training examples. ( Wanna make fast money?..., spam) ( How is your thesis going? Rob, not spam) 9 / 34

35 Brief Introduction of Boosting (Schapire, 1990; Freund, 1995) Boosting is a highly successful batch learning algorithm. Idea: combine weak rules of thumb to form a highly accurate predictor. Example: spam detection. Given: a set of training examples. ( Wanna make fast money?..., spam) ( How is your thesis going? Rob, not spam) Obtain a classifier by asking a weak learning algorithm : e.g. contains the word money spam. 9 / 34

36 Brief Introduction of Boosting (Schapire, 1990; Freund, 1995) Boosting is a highly successful batch learning algorithm. Idea: combine weak rules of thumb to form a highly accurate predictor. Example: spam detection. Given: a set of training examples. ( Wanna make fast money?..., spam) ( How is your thesis going? Rob, not spam) Obtain a classifier by asking a weak learning algorithm : e.g. contains the word money spam. Reweight the examples so that difficult ones get more attention. e.g. spam that doesn t contain money. 9 / 34

37 Brief Introduction of Boosting (Schapire, 1990; Freund, 1995) Boosting is a highly successful batch learning algorithm. Idea: combine weak rules of thumb to form a highly accurate predictor. Example: spam detection. Given: a set of training examples. ( Wanna make fast money?..., spam) ( How is your thesis going? Rob, not spam) Obtain a classifier by asking a weak learning algorithm : e.g. contains the word money spam. Reweight the examples so that difficult ones get more attention. e.g. spam that doesn t contain money. Obtain another classifier: e.g. empty to address spam. 9 / 34

38 Brief Introduction of Boosting (Schapire, 1990; Freund, 1995) Boosting is a highly successful batch learning algorithm. Idea: combine weak rules of thumb to form a highly accurate predictor. Example: spam detection. Given: a set of training examples. ( Wanna make fast money?..., spam) ( How is your thesis going? Rob, not spam) Obtain a classifier by asking a weak learning algorithm : e.g. contains the word money spam. Reweight the examples so that difficult ones get more attention. e.g. spam that doesn t contain money. Obtain another classifier: e.g. empty to address spam / 34

39 Brief Introduction of Boosting (Schapire, 1990; Freund, 1995) Boosting is a highly successful batch learning algorithm. Idea: combine weak rules of thumb to form a highly accurate predictor. Example: spam detection. Given: a set of training examples. ( Wanna make fast money?..., spam) ( How is your thesis going? Rob, not spam) Obtain a classifier by asking a weak learning algorithm : e.g. contains the word money spam. Reweight the examples so that difficult ones get more attention. e.g. spam that doesn t contain money. Obtain another classifier: e.g. empty to address spam.... At the end, predict by taking a (weighted) majority vote. 9 / 34

40 Online Boosting: Motivation Boosting is well-studied in the batch setting, but becomes infeasible when the amount of data is huge. 10 / 34

41 Online Boosting: Motivation Boosting is well-studied in the batch setting, but becomes infeasible when the amount of data is huge. Online learning comes to rescue: memory efficient works even in an adversarial environment. 10 / 34

42 Online Boosting: Motivation Boosting is well-studied in the batch setting, but becomes infeasible when the amount of data is huge. Online learning comes to rescue: memory efficient works even in an adversarial environment. A natural question: How to extend boosting to the online setting? 10 / 34

43 Online Boosting: Motivation Boosting is well-studied in the batch setting, but becomes infeasible when the amount of data is huge. Online learning comes to rescue: memory efficient works even in an adversarial environment. A natural question: How to extend boosting to the online setting? How to boost the accuracy of existing online learning algorithms? 10 / 34

44 Related Work Several algorithms exist (Oza and Russell, 2001; Grabner and Bischof, 2006; Liu and Yu, 2007; Grabner et al., 2008). mimic batch counterparts. achieve great success in many real-world applications. no theoretical guarantees. 11 / 34

45 Related Work Several algorithms exist (Oza and Russell, 2001; Grabner and Bischof, 2006; Liu and Yu, 2007; Grabner et al., 2008). mimic batch counterparts. achieve great success in many real-world applications. no theoretical guarantees. Chen et al. (2012): first online boosting algorithms with theoretical guarantees. connecting online boosting and smooth batch boosting. 11 / 34

46 Related Work Several algorithms exist (Oza and Russell, 2001; Grabner and Bischof, 2006; Liu and Yu, 2007; Grabner et al., 2008). mimic batch counterparts. achieve great success in many real-world applications. no theoretical guarantees. Chen et al. (2012): first online boosting algorithms with theoretical guarantees. connecting online boosting and smooth batch boosting. Our work: more natural formulation + optimal and adaptive algorithms. 11 / 34

47 Batch Boosting Given a batch of T examples, (x t, y t ) for t = 1,..., T. Learner A predicts A(x t ) for example x t. 12 / 34

48 Batch Boosting Given a batch of T examples, (x t, y t ) for t = 1,..., T. Learner A predicts A(x t ) for example x t. Weak learner A (with edge γ): # Mistakes = T t=1 1{A(x t) y t } ( 1 2 γ)t 12 / 34

49 Batch Boosting Given a batch of T examples, (x t, y t ) for t = 1,..., T. Learner A predicts A(x t ) for example x t. Weak learner A (with edge γ): # Mistakes = T t=1 1{A(x t) y t } ( 1 2 γ)t Strong learner A (with any target error rate ɛ): # Mistakes = T t=1 1{A (x t ) y t } ɛt 12 / 34

50 Batch Boosting Given a batch of T examples, (x t, y t ) for t = 1,..., T. Learner A predicts A(x t ) for example x t. Weak learner A (with edge γ): # Mistakes = T t=1 1{A(x t) y t } ( 1 2 γ)t Boosting (Schapire, 1990; Freund, 1995) Strong learner A (with any target error rate ɛ): # Mistakes = T t=1 1{A (x t ) y t } ɛt 12 / 34

51 Online Boosting Examples (x t, y t ) arrive online, for t = 1,..., T. Learner A observes x t and predicts A(x t ) before seeing y t. Weak Online learner A (with edge γ): # Mistakes = T t=1 1{A(x t) y t } ( 1 2 γ)t Strong Online learner A (with any target error rate ɛ): # Mistakes = T t=1 1{A (x t ) y t } ɛt 12 / 34

52 Online Boosting Examples (x t, y t ) arrive online, for t = 1,..., T. Learner A observes x t and predicts A(x t ) before seeing y t. Weak Online learner A (with edge γ and excess loss S): # Mistakes = T t=1 1{A(x t) y t } ( 1 2 γ)t + S #mistakes made before learning Strong Online learner A (with any target error rate ɛ and excess loss S ) # Mistakes = T t=1 1{A (x t ) y t } ɛt + S 12 / 34

53 Online Boosting Examples (x t, y t ) arrive online, for t = 1,..., T. Learner A observes x t and predicts A(x t ) before seeing y t. Weak Online learner A (with edge γ and excess loss S): # Mistakes = T t=1 1{A(x t) y t } ( 1 2 γ)t + S Online Boosting (our result) #mistakes made before learning Strong Online learner A (with any target error rate ɛ and excess loss S ) # Mistakes = T t=1 1{A (x t ) y t } ɛt + S 12 / 34

54 Structure of Online Boosting x 1 Booster

55 Structure of Online Boosting x 1 WL 1 predict WL 2 predict x 1 ŷ1 1 x 1 ŷ Booster WL N predict x 1 ŷ N 1

56 Structure of Online Boosting x 1 ŷ 1 y 1 WL 1 predict WL 2 predict x 1 ŷ1 1 x 1 ŷ Booster WL N predict x 1 ŷ N 1

57 Structure of Online Boosting x 1 ŷ 1 y 1 WL 1 predict x 1 ŷ1 1 w.p. p1 1 (x 1, y 1 ) WL 1 update WL 2 predict x 1 ŷ1 2 w.p. p1 2 (x 1, y 1 ) WL 2 update... Booster... WL N predict x 1 ŷ N 1 w.p. p N 1 (x 1, y 1 ) WL N update 13 / 34

58 Structure of Online Boosting x 2 ŷ 2 y 2 WL 1 predict x 2 ŷ2 1 w.p. p2 1 (x 2, y 2 ) WL 1 update WL 2 predict x 2 ŷ2 2 w.p. p2 2 (x 2, y 2 ) WL 2 update... Booster... WL N predict x 2 ŷ N 2 w.p. p N 2 (x 2, y 2 ) WL N update 14 / 34

59 Structure of Online Boosting x t ŷ t y t WL 1 predict x t ŷt 1 w.p. pt 1 (x t, y t ) WL 1 update WL 2 predict x t ŷt 2 w.p. pt 2 (x t, y t ) WL 2 update... Booster... WL N predict x t ŷ N t w.p. p N t (x t, y t ) WL N update 15 / 34

60 Theoretical Gaurantees To achieve ɛ error rate, how many number of weak learners (N) and examples (T ) do we need? 16 / 34

61 Theoretical Gaurantees To achieve ɛ error rate, how many number of weak learners (N) and examples (T ) do we need? Algorithm N T Optimal? Adaptive? Online BBM O( 1 ln 1 γ 2 ɛ ) Õ( 1 ) ɛγ 2 AdaBoost.OL O( 1 ) Õ( 1 ) ɛγ 2 ɛ 2 γ 4 Chen et al. (2012) O( 1 ɛγ 2 ) Õ( 1 ɛγ 2 ) 16 / 34

62 Algorithms Online BBM: derived by generalizing drifting game (Schapire, 2001) 17 / 34

63 Algorithms Online BBM: derived by generalizing drifting game (Schapire, 2001) prediction: majority vote 17 / 34

64 Algorithms Online BBM: derived by generalizing drifting game (Schapire, 2001) prediction: majority vote update: Pr[(x t, y t ) sent to ith weak learner] Pr[kt i heads in N i flips of a γ 2 -biased coin] 17 / 34

65 Algorithms Online BBM: derived by generalizing drifting game (Schapire, 2001) prediction: majority vote update: Pr[(x t, y t ) sent to ith weak learner] Pr[kt i heads in N i flips of a γ 2 -biased coin] AdaBoost.OL: derived by utilizing online loss minimization 17 / 34

66 Algorithms Online BBM: derived by generalizing drifting game (Schapire, 2001) prediction: majority vote update: Pr[(x t, y t ) sent to ith weak learner] Pr[kt i heads in N i flips of a γ 2 -biased coin] AdaBoost.OL: derived by utilizing online loss minimization prediction: weighted majority vote 17 / 34

67 Algorithms Online BBM: derived by generalizing drifting game (Schapire, 2001) prediction: majority vote update: Pr[(x t, y t ) sent to ith weak learner] Pr[kt i heads in N i flips of a γ 2 -biased coin] AdaBoost.OL: derived by utilizing online loss minimization prediction: weighted majority vote update: Pr[(x t, y t ) sent to ith weak learner] derivative of logistic loss at some point 17 / 34

68 Experimental Results (I) Available in the widely used toolkit Vowpal Wabbit. command line option: --boosting. VW as the default weak learner (a rather strong one!) Dataset VW baseline Online BBM AdaBoost.OL Chen et al news a9a activity adult bio census covtype letter maptaskcoref nomao poker rcv vehv2binary / 34

69 Experimental Results (II) Generalized to regression setting (BHKL15). Base learner Average Improvement Regression stumps 20.22% Neural networks 7.88% VW Default 1.65% 19 / 34

70 Combining Expert Advice More Efficiently (Luo and Schapire, NIPS 2014, COLT 2015) 20 / 34

71 The Expert Problem Need to make a decision everyday which stock to invest in? which route to drive home? / 34

72 The Expert Problem Need to make a decision everyday which stock to invest in? which route to drive home?... Given: advice from a set of experts financial advisers Route 1, Route 206, I-95, / 34

73 The Expert Problem Need to make a decision everyday which stock to invest in? which route to drive home?... Given: advice from a set of experts financial advisers Route 1, Route 206, I-95, Question: how to combine advice and make smart choices? 21 / 34

74 Comparison with Online Boosting Online Boosting Expert Problem 22 / 34

75 Comparison with Online Boosting Online Boosting same weak learning algorithm Expert Problem different experts 22 / 34

76 Comparison with Online Boosting Online Boosting same weak learning algorithm assumption on weak learners Expert Problem different experts arbitrary advice 22 / 34

77 Comparison with Online Boosting Online Boosting same weak learning algorithm assumption on weak learners control inputs to weak learners Expert Problem different experts arbitrary advice combine advice 22 / 34

78 Comparison with Online Boosting Online Boosting same weak learning algorithm assumption on weak learners control inputs to weak learners aim to largely improve Expert Problem different experts arbitrary advice combine advice aim to find the best expert 22 / 34

79 Why interesting? Applications: investment route selection / 34

80 Why interesting? Applications: investment route selection... Foundation of many other learning algorithms: batch boosting bandit problems solving a SDP online PCA / 34

81 Formal Setup (Littlestone and Warmuth, 1994; Freund and Schapire, 1997) For t = 1, 2,..., T player predicts a distribution p t over N experts. adversary reveals the loss l t,i [0, 1] for each expert. player suffers loss p t l t for this round. 24 / 34

82 Formal Setup (Littlestone and Warmuth, 1994; Freund and Schapire, 1997) For t = 1, 2,..., T player predicts a distribution p t over N experts. adversary reveals the loss l t,i [0, 1] for each expert. player suffers loss p t l t for this round. Regret to expert i: R T,i = T t=1 p t l t } {{ } total loss of the player T t=1 l t,i }{{} total loss of expert i 24 / 34

83 Formal Setup (Littlestone and Warmuth, 1994; Freund and Schapire, 1997) For t = 1, 2,..., T player predicts a distribution p t over N experts. adversary reveals the loss l t,i [0, 1] for each expert. player suffers loss p t l t for this round. Regret to expert i: R T,i = T t=1 p t l t } {{ } total loss of the player T t=1 l t,i }{{} total loss of expert i Goal: obtain sublinear regret to the best expert. 24 / 34

84 Formal Setup (Littlestone and Warmuth, 1994; Freund and Schapire, 1997) For t = 1, 2,..., T player predicts a distribution p t over N experts. adversary reveals the loss l t,i [0, 1] for each expert. player suffers loss p t l t for this round. Regret to expert i: R T,i = T t=1 p t l t } {{ } total loss of the player T t=1 l t,i }{{} total loss of expert i Goal: obtain sublinear regret to top 10 experts. 24 / 34

85 Formal Setup (Littlestone and Warmuth, 1994; Freund and Schapire, 1997) For t = 1, 2,..., T player predicts a distribution p t over N experts. adversary reveals the loss l t,i [0, 1] for each expert. player suffers loss p t l t for this round. Regret to expert i: R T,i = T t=1 p t l t } {{ } total loss of the player T t=1 l t,i }{{} total loss of expert i Goal: obtain sublinear regret to any competitor. 24 / 34

86 Beyond the Classic Regret More difficult objectives: Achieving much smaller regret when the problem is easy while still ensuring worst-case robustness Learning with unknown number of experts Competing with a sequence of different competitors Learning with experts who provide confidence-rated advice 25 / 34

87 Beyond the Classic Regret More difficult objectives: Achieving much smaller regret when the problem is easy while still ensuring worst-case robustness Learning with unknown number of experts Competing with a sequence of different competitors Learning with experts who provide confidence-rated advice Previous work: all were studied to some extent, with different algorithms Our work: one single parameter-free algorithm, achieving all these goals, with improved results 25 / 34

88 Expert Problem as a Drifting Game Main technique: reduce the problem to a drifting game 26 / 34

89 Expert Problem as a Drifting Game 2.5 Main technique: reduce the problem to a drifting game Consequence: one upper bound of 0-1 loss one expert algorithm / 34

90 Expert Problem as a Drifting Game 2.5 Main technique: reduce the problem to a drifting game Consequence: one upper bound of 0-1 loss one expert algorithm Surrogate of 0-1 Algorithm: p t,i Regret to top-ɛ frac. exponential exp(ηr t 1,i ) T ln 1 ɛ 26 / 34

91 Expert Problem as a Drifting Game 2.5 Main technique: reduce the problem to a drifting game Consequence: one upper bound of 0-1 loss one expert algorithm Surrogate of 0-1 Algorithm: p t,i Regret to top-ɛ frac. exponential exp(ηr t 1,i ) T ln 1 ɛ squared hinge [R t 1,i + 1] 2 + [R t 1,i 1] 2 + T ɛ 26 / 34

92 Expert Problem as a Drifting Game 2.5 Main technique: reduce the problem to a drifting game Consequence: one upper bound of 0-1 loss one expert algorithm Surrogate of 0-1 Algorithm: p t,i Regret to top-ɛ frac. exponential exp(ηr t 1,i ) T ln 1 ɛ squared hinge [R t 1,i + 1] 2 + [R t 1,i 1] 2 + ) ( ) ( [Rt 1,i +1] exp 1 exp 2 + [Rt 1,i 1] exp 2 + 3t 3t ( [s] 2 3T T ɛ ) T ln ln T ɛ 26 / 34

93 AdaNormalHedge weight for expert i on round t: ( [Rt 1,i + 1] 2 ) ( + [Rt 1,i 1] 2 ) + p t,i exp exp 3t 3t Õ( T ) regret 27 / 34

94 AdaNormalHedge weight for expert i on round t: ( [Rt 1,i + 1] 2 ) ( + [Rt 1,i 1] 2 ) + p t,i exp exp 3t 3t Õ( T ) regret (worst-case optimal, but too conservative!) 27 / 34

95 AdaNormalHedge weight for expert i on round t: ( [Rt 1,i + 1] 2 ) ( + [Rt 1,i 1] 2 ) + p t,i exp exp 3(C t 1,i + 1) 3(C t 1,i + 1) where C t,i = t τ=1 p τ l τ l τ,i = cumulative regret magnitude }{{} regret magnitude of expert i at round τ 27 / 34

96 AdaNormalHedge weight for expert i on round t: ( [Rt 1,i + 1] 2 ) ( + [Rt 1,i 1] 2 ) + p t,i exp exp 3(C t 1,i + 1) 3(C t 1,i + 1) where C t,i = t τ=1 p τ l τ l τ,i = cumulative regret magnitude }{{} regret magnitude of expert i at round τ much more adaptive: small regret when the best expert has small loss almost constant regret when losses are stochastic... resolving open problems by Chaudhuri et al. (2009); Chernov and Vovk (2010); Warmuth and Koolen (2014) 27 / 34

97 Application Given: a big unpruned decision tree Goal: predict almost as well as the best pruned tree, online 28 / 34

98 Application Given: a big unpruned decision tree Goal: predict almost as well as the best pruned tree, online Reduced to expert problem Helmbold and Schapire (1995); Freund et al. (1997) 28 / 34

99 Application Given: a big unpruned decision tree Goal: predict almost as well as the best pruned tree, online Reduced to expert problem Helmbold and Schapire (1995); Freund et al. (1997) AdaNormalHedge provides a parameter-free solution with exponential regret improvement! 28 / 34

100 Efficient Second Order Online Learning via Sketching (Luo, Agarwal, Cesa-Bianchi and Langford, Preprint, 2016) 29 / 34

101 Online Linear Prediction On each time t: see an example x t (e.g. an image) 30 / 34

102 Online Linear Prediction On each time t: see an example x t (e.g. an image) predict its label w t x t (e.g. cat or not) 30 / 34

103 Online Linear Prediction On each time t: see an example x t (e.g. an image) predict its label wt x t (e.g. cat or not) see the true label and suffer some loss 30 / 34

104 Online Linear Prediction On each time t: see an example x t (e.g. an image) predict its label wt x t (e.g. cat or not) see the true label and suffer some loss update w t to w t+1 30 / 34

105 Online Linear Prediction On each time t: see an example x t (e.g. an image) predict its label w t x t (e.g. cat or not) see the true label and suffer some loss update w t to w t+1 e.g. Online Gradient Descent (Zinkevich, 2003) w t+1 = w t ηg t where g t is the gradient 30 / 34

106 Problem Performance often depends on the conditioning of the data: 31 / 34

107 Comparisons error rate condition number 32 / 34

108 Comparisons OGD error rate condition number algorithm OGD Zinkevich (2003) update rule w t+1 = w t η g t 32 / 34

109 Comparisons OGD Newton error rate A t = t τ=1 g τ g τ algorithm OGD Zinkevich (2003) Online Newton Hazan et al. (2007) 0.08 update rule w t+1 = w t η g t w t+1 = w t η A 1 t g t condition number 32 / 34

110 Comparisons OGD Newton error rate A t = t τ=1 g τ g τ s: sparsity d: dimensionality condition number algorithm update rule time per example OGD Zinkevich (2003) Online Newton Hazan et al. (2007) w t+1 = w t η g t O(s) w t+1 = w t η A 1 t g t O(d 2 ) 32 / 34

111 Comparisons OGD Newton AdaGrad error rate A t = t τ=1 g τ g τ s: sparsity d: dimensionality condition number algorithm update rule time per example OGD Zinkevich (2003) Online Newton Hazan et al. (2007) AdaGrad Duchi et al. (2011) w t+1 = w t η g t O(s) w t+1 = w t η A 1 t g t O(d 2 ) w t+1 = w t η diag(a t ) 1 2 g t O(s) 32 / 34

112 Comparisons error rate OGD Newton AdaGrad SON A t = t τ=1 g τ g τ s: sparsity d: dimensionality condition number algorithm update rule time per example OGD Zinkevich (2003) Online Newton Hazan et al. (2007) AdaGrad Duchi et al. (2011) SON (our work) w t+1 = w t η g t O(s) w t+1 = w t η A 1 t g t O(d 2 ) w t+1 = w t η diag(a t ) 1 2 g t w t+1 = w t η sketch(a t ) 1 g t O(s) 32 / 34

113 Comparisons error rate OGD Newton AdaGrad SON A t = t τ=1 g τ g τ s: sparsity d: dimensionality m: sketch size condition number algorithm update rule time per example OGD Zinkevich (2003) Online Newton Hazan et al. (2007) AdaGrad Duchi et al. (2011) SON (our work) w t+1 = w t η g t O(s) w t+1 = w t η A 1 t g t O(d 2 ) w t+1 = w t η diag(a t ) 1 2 g t O(s) w t+1 = w t η sketch(a t ) 1 g t O(ms + m 2 ) 32 / 34

114 Performance on Real-World Datasets AdaGrad vs SON error rate of SON error rate of AdaGrad 33 / 34

115 Conclusions 1 Online boosting a novel and natural framework new online boosting algorithms for classification and regression 2 Combining expert advice a general game-theoretic framework to design algorithms adaptive algorithms with many desirable properties simultaneously 3 Efficient second order learning how to utilize sketching techniques to deal with ill-conditioned data 34 / 34

Optimal and Adaptive Algorithms for Online Boosting

Optimal and Adaptive Algorithms for Online Boosting Optimal and Adaptive Algorithms for Online Boosting Alina Beygelzimer 1 Satyen Kale 1 Haipeng Luo 2 1 Yahoo! Labs, NYC 2 Computer Science Department, Princeton University Jul 8, 2015 Boosting: An Example

More information

Optimal and Adaptive Online Learning

Optimal and Adaptive Online Learning Optimal and Adaptive Online Learning Haipeng Luo A Dissertation Presented to the Faculty of Princeton University in Candidacy for the Degree of Doctor of Philosophy Recommended for Acceptance by the Department

More information

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning.

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning. Tutorial: PART 1 Online Convex Optimization, A Game- Theoretic Approach to Learning http://www.cs.princeton.edu/~ehazan/tutorial/tutorial.htm Elad Hazan Princeton University Satyen Kale Yahoo Research

More information

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning Tutorial: PART 2 Online Convex Optimization, A Game- Theoretic Approach to Learning Elad Hazan Princeton University Satyen Kale Yahoo Research Exploiting curvature: logarithmic regret Logarithmic regret

More information

Theory and Applications of A Repeated Game Playing Algorithm. Rob Schapire Princeton University [currently visiting Yahoo!

Theory and Applications of A Repeated Game Playing Algorithm. Rob Schapire Princeton University [currently visiting Yahoo! Theory and Applications of A Repeated Game Playing Algorithm Rob Schapire Princeton University [currently visiting Yahoo! Research] Learning Is (Often) Just a Game some learning problems: learn from training

More information

Online Learning and Online Convex Optimization

Online Learning and Online Convex Optimization Online Learning and Online Convex Optimization Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Learning 1 / 49 Summary 1 My beautiful regret 2 A supposedly fun game

More information

The Online Approach to Machine Learning

The Online Approach to Machine Learning The Online Approach to Machine Learning Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Approach to ML 1 / 53 Summary 1 My beautiful regret 2 A supposedly fun game I

More information

Online Learning and Sequential Decision Making

Online Learning and Sequential Decision Making Online Learning and Sequential Decision Making Emilie Kaufmann CNRS & CRIStAL, Inria SequeL, emilie.kaufmann@univ-lille.fr Research School, ENS Lyon, Novembre 12-13th 2018 Emilie Kaufmann Online Learning

More information

Full-information Online Learning

Full-information Online Learning Introduction Expert Advice OCO LM A DA NANJING UNIVERSITY Full-information Lijun Zhang Nanjing University, China June 2, 2017 Outline Introduction Expert Advice OCO 1 Introduction Definitions Regret 2

More information

Lecture 8. Instructor: Haipeng Luo

Lecture 8. Instructor: Haipeng Luo Lecture 8 Instructor: Haipeng Luo Boosting and AdaBoost In this lecture we discuss the connection between boosting and online learning. Boosting is not only one of the most fundamental theories in machine

More information

Adaptive Online Learning in Dynamic Environments

Adaptive Online Learning in Dynamic Environments Adaptive Online Learning in Dynamic Environments Lijun Zhang, Shiyin Lu, Zhi-Hua Zhou National Key Laboratory for Novel Software Technology Nanjing University, Nanjing 210023, China {zhanglj, lusy, zhouzh}@lamda.nju.edu.cn

More information

Notes on AdaGrad. Joseph Perla 2014

Notes on AdaGrad. Joseph Perla 2014 Notes on AdaGrad Joseph Perla 2014 1 Introduction Stochastic Gradient Descent (SGD) is a common online learning algorithm for optimizing convex (and often non-convex) functions in machine learning today.

More information

Practical Agnostic Active Learning

Practical Agnostic Active Learning Practical Agnostic Active Learning Alina Beygelzimer Yahoo Research based on joint work with Sanjoy Dasgupta, Daniel Hsu, John Langford, Francesco Orabona, Chicheng Zhang, and Tong Zhang * * introductory

More information

1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016

1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016 AM 1: Advanced Optimization Spring 016 Prof. Yaron Singer Lecture 11 March 3rd 1 Overview In this lecture we will introduce the notion of online convex optimization. This is an extremely useful framework

More information

No-Regret Algorithms for Unconstrained Online Convex Optimization

No-Regret Algorithms for Unconstrained Online Convex Optimization No-Regret Algorithms for Unconstrained Online Convex Optimization Matthew Streeter Duolingo, Inc. Pittsburgh, PA 153 matt@duolingo.com H. Brendan McMahan Google, Inc. Seattle, WA 98103 mcmahan@google.com

More information

Ad Placement Strategies

Ad Placement Strategies Case Study : Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 7 th, 04 Ad

More information

Learning theory. Ensemble methods. Boosting. Boosting: history

Learning theory. Ensemble methods. Boosting. Boosting: history Learning theory Probability distribution P over X {0, 1}; let (X, Y ) P. We get S := {(x i, y i )} n i=1, an iid sample from P. Ensemble methods Goal: Fix ɛ, δ (0, 1). With probability at least 1 δ (over

More information

New Algorithms for Contextual Bandits

New Algorithms for Contextual Bandits New Algorithms for Contextual Bandits Lev Reyzin Georgia Institute of Technology Work done at Yahoo! 1 S A. Beygelzimer, J. Langford, L. Li, L. Reyzin, R.E. Schapire Contextual Bandit Algorithms with Supervised

More information

Stochastic Gradient Descent

Stochastic Gradient Descent Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular

More information

CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

More information

Learning, Games, and Networks

Learning, Games, and Networks Learning, Games, and Networks Abhishek Sinha Laboratory for Information and Decision Systems MIT ML Talk Series @CNRG December 12, 2016 1 / 44 Outline 1 Prediction With Experts Advice 2 Application to

More information

COMS 4771 Lecture Boosting 1 / 16

COMS 4771 Lecture Boosting 1 / 16 COMS 4771 Lecture 12 1. Boosting 1 / 16 Boosting What is boosting? Boosting: Using a learning algorithm that provides rough rules-of-thumb to construct a very accurate predictor. 3 / 16 What is boosting?

More information

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri CSE 151 Machine Learning Instructor: Kamalika Chaudhuri Ensemble Learning How to combine multiple classifiers into a single one Works well if the classifiers are complementary This class: two types of

More information

A Drifting-Games Analysis for Online Learning and Applications to Boosting

A Drifting-Games Analysis for Online Learning and Applications to Boosting A Drifting-Games Analysis for Online Learning and Applications to Boosting Haipeng Luo Department of Computer Science Princeton University Princeton, NJ 08540 haipengl@cs.princeton.edu Robert E. Schapire

More information

Using Additive Expert Ensembles to Cope with Concept Drift

Using Additive Expert Ensembles to Cope with Concept Drift Jeremy Z. Kolter and Marcus A. Maloof {jzk, maloof}@cs.georgetown.edu Department of Computer Science, Georgetown University, Washington, DC 20057-1232, USA Abstract We consider online learning where the

More information

CS261: A Second Course in Algorithms Lecture #11: Online Learning and the Multiplicative Weights Algorithm

CS261: A Second Course in Algorithms Lecture #11: Online Learning and the Multiplicative Weights Algorithm CS61: A Second Course in Algorithms Lecture #11: Online Learning and the Multiplicative Weights Algorithm Tim Roughgarden February 9, 016 1 Online Algorithms This lecture begins the third module of the

More information

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Vicente L. Malave February 23, 2011 Outline Notation minimize a number of functions φ

More information

Yevgeny Seldin. University of Copenhagen

Yevgeny Seldin. University of Copenhagen Yevgeny Seldin University of Copenhagen Classical (Batch) Machine Learning Collect Data Data Assumption The samples are independent identically distributed (i.i.d.) Machine Learning Prediction rule New

More information

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research OLSO Online Learning and Stochastic Optimization Yoram Singer August 10, 2016 Google Research References Introduction to Online Convex Optimization, Elad Hazan, Princeton University Online Learning and

More information

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture 24 Scribe: Sachin Ravi May 2, 2013

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture 24 Scribe: Sachin Ravi May 2, 2013 COS 5: heoretical Machine Learning Lecturer: Rob Schapire Lecture 24 Scribe: Sachin Ravi May 2, 203 Review of Zero-Sum Games At the end of last lecture, we discussed a model for two player games (call

More information

Littlestone s Dimension and Online Learnability

Littlestone s Dimension and Online Learnability Littlestone s Dimension and Online Learnability Shai Shalev-Shwartz Toyota Technological Institute at Chicago The Hebrew University Talk at UCSD workshop, February, 2009 Joint work with Shai Ben-David

More information

Online Convex Optimization

Online Convex Optimization Advanced Course in Machine Learning Spring 2010 Online Convex Optimization Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz A convex repeated game is a two players game that is performed

More information

From Bandits to Experts: A Tale of Domination and Independence

From Bandits to Experts: A Tale of Domination and Independence From Bandits to Experts: A Tale of Domination and Independence Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Domination and Independence 1 / 1 From Bandits to Experts: A

More information

A Second-order Bound with Excess Losses

A Second-order Bound with Excess Losses A Second-order Bound with Excess Losses Pierre Gaillard 12 Gilles Stoltz 2 Tim van Erven 3 1 EDF R&D, Clamart, France 2 GREGHEC: HEC Paris CNRS, Jouy-en-Josas, France 3 Leiden University, the Netherlands

More information

Agnostic Online learnability

Agnostic Online learnability Technical Report TTIC-TR-2008-2 October 2008 Agnostic Online learnability Shai Shalev-Shwartz Toyota Technological Institute Chicago shai@tti-c.org ABSTRACT We study a fundamental question. What classes

More information

Boosting: Foundations and Algorithms. Rob Schapire

Boosting: Foundations and Algorithms. Rob Schapire Boosting: Foundations and Algorithms Rob Schapire Example: Spam Filtering problem: filter out spam (junk email) gather large collection of examples of spam and non-spam: From: yoav@ucsd.edu Rob, can you

More information

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017 COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University BOOSTING Robert E. Schapire and Yoav

More information

Lecture 8: Decision-making under total uncertainty: the multiplicative weight algorithm. Lecturer: Sanjeev Arora

Lecture 8: Decision-making under total uncertainty: the multiplicative weight algorithm. Lecturer: Sanjeev Arora princeton univ. F 13 cos 521: Advanced Algorithm Design Lecture 8: Decision-making under total uncertainty: the multiplicative weight algorithm Lecturer: Sanjeev Arora Scribe: (Today s notes below are

More information

Online Convex Optimization. Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016

Online Convex Optimization. Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016 Online Convex Optimization Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016 The General Setting The General Setting (Cover) Given only the above, learning isn't always possible Some Natural

More information

Dynamic Regret of Strongly Adaptive Methods

Dynamic Regret of Strongly Adaptive Methods Lijun Zhang 1 ianbao Yang 2 Rong Jin 3 Zhi-Hua Zhou 1 Abstract o cope with changing environments, recent developments in online learning have introduced the concepts of adaptive regret and dynamic regret

More information

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon. Administration CSCI567 Machine Learning Fall 2018 Prof. Haipeng Luo U of Southern California Nov 7, 2018 HW5 is available, due on 11/18. Practice final will also be available soon. Remaining weeks: 11/14,

More information

The Boosting Approach to. Machine Learning. Maria-Florina Balcan 10/31/2016

The Boosting Approach to. Machine Learning. Maria-Florina Balcan 10/31/2016 The Boosting Approach to Machine Learning Maria-Florina Balcan 10/31/2016 Boosting General method for improving the accuracy of any given learning algorithm. Works by creating a series of challenge datasets

More information

Online Advertising is Big Business

Online Advertising is Big Business Online Advertising Online Advertising is Big Business Multiple billion dollar industry $43B in 2013 in USA, 17% increase over 2012 [PWC, Internet Advertising Bureau, April 2013] Higher revenue in USA

More information

Efficient and Principled Online Classification Algorithms for Lifelon

Efficient and Principled Online Classification Algorithms for Lifelon Efficient and Principled Online Classification Algorithms for Lifelong Learning Toyota Technological Institute at Chicago Chicago, IL USA Talk @ Lifelong Learning for Mobile Robotics Applications Workshop,

More information

Online Learning with Feedback Graphs

Online Learning with Feedback Graphs Online Learning with Feedback Graphs Claudio Gentile INRIA and Google NY clagentile@gmailcom NYC March 6th, 2018 1 Content of this lecture Regret analysis of sequential prediction problems lying between

More information

The No-Regret Framework for Online Learning

The No-Regret Framework for Online Learning The No-Regret Framework for Online Learning A Tutorial Introduction Nahum Shimkin Technion Israel Institute of Technology Haifa, Israel Stochastic Processes in Engineering IIT Mumbai, March 2013 N. Shimkin,

More information

Exponential Weights on the Hypercube in Polynomial Time

Exponential Weights on the Hypercube in Polynomial Time European Workshop on Reinforcement Learning 14 (2018) October 2018, Lille, France. Exponential Weights on the Hypercube in Polynomial Time College of Information and Computer Sciences University of Massachusetts

More information

Hybrid Machine Learning Algorithms

Hybrid Machine Learning Algorithms Hybrid Machine Learning Algorithms Umar Syed Princeton University Includes joint work with: Rob Schapire (Princeton) Nina Mishra, Alex Slivkins (Microsoft) Common Approaches to Machine Learning!! Supervised

More information

Linear Probability Forecasting

Linear Probability Forecasting Linear Probability Forecasting Fedor Zhdanov and Yuri Kalnishkan Computer Learning Research Centre and Department of Computer Science, Royal Holloway, University of London, Egham, Surrey, TW0 0EX, UK {fedor,yura}@cs.rhul.ac.uk

More information

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning On-Line Learning Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation PAC learning: distribution fixed over time (training and test). IID assumption.

More information

0.1 Motivating example: weighted majority algorithm

0.1 Motivating example: weighted majority algorithm princeton univ. F 16 cos 521: Advanced Algorithm Design Lecture 8: Decision-making under total uncertainty: the multiplicative weight algorithm Lecturer: Sanjeev Arora Scribe: Sanjeev Arora (Today s notes

More information

Online and Stochastic Learning with a Human Cognitive Bias

Online and Stochastic Learning with a Human Cognitive Bias Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence Online and Stochastic Learning with a Human Cognitive Bias Hidekazu Oiwa The University of Tokyo and JSPS Research Fellow 7-3-1

More information

Explore no more: Improved high-probability regret bounds for non-stochastic bandits

Explore no more: Improved high-probability regret bounds for non-stochastic bandits Explore no more: Improved high-probability regret bounds for non-stochastic bandits Gergely Neu SequeL team INRIA Lille Nord Europe gergely.neu@gmail.com Abstract This work addresses the problem of regret

More information

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13 Boosting Ryan Tibshirani Data Mining: 36-462/36-662 April 25 2013 Optional reading: ISL 8.2, ESL 10.1 10.4, 10.7, 10.13 1 Reminder: classification trees Suppose that we are given training data (x i, y

More information

CS229 Supplemental Lecture notes

CS229 Supplemental Lecture notes CS229 Supplemental Lecture notes John Duchi 1 Boosting We have seen so far how to solve classification (and other) problems when we have a data representation already chosen. We now talk about a procedure,

More information

Online Advertising is Big Business

Online Advertising is Big Business Online Advertising Online Advertising is Big Business Multiple billion dollar industry $43B in 2013 in USA, 17% increase over 2012 [PWC, Internet Advertising Bureau, April 2013] Higher revenue in USA

More information

Online Learning. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 21. Slides adapted from Mohri

Online Learning. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 21. Slides adapted from Mohri Online Learning Jordan Boyd-Graber University of Colorado Boulder LECTURE 21 Slides adapted from Mohri Jordan Boyd-Graber Boulder Online Learning 1 of 31 Motivation PAC learning: distribution fixed over

More information

Online Bounds for Bayesian Algorithms

Online Bounds for Bayesian Algorithms Online Bounds for Bayesian Algorithms Sham M. Kakade Computer and Information Science Department University of Pennsylvania Andrew Y. Ng Computer Science Department Stanford University Abstract We present

More information

Adaptivity and Optimism: An Improved Exponentiated Gradient Algorithm

Adaptivity and Optimism: An Improved Exponentiated Gradient Algorithm Adaptivity and Optimism: An Improved Exponentiated Gradient Algorithm Jacob Steinhardt Percy Liang Stanford University {jsteinhardt,pliang}@cs.stanford.edu Jun 11, 2013 J. Steinhardt & P. Liang (Stanford)

More information

An Introduction to Adaptive Online Learning

An Introduction to Adaptive Online Learning An Introduction to Adaptive Online Learning Tim van Erven Joint work with: Wouter Koolen, Peter Grünwald ABN AMRO October 18, 2018 Example: Sequential Prediction for Football Games Before every match t

More information

Classification. Jordan Boyd-Graber University of Maryland WEIGHTED MAJORITY. Slides adapted from Mohri. Jordan Boyd-Graber UMD Classification 1 / 13

Classification. Jordan Boyd-Graber University of Maryland WEIGHTED MAJORITY. Slides adapted from Mohri. Jordan Boyd-Graber UMD Classification 1 / 13 Classification Jordan Boyd-Graber University of Maryland WEIGHTED MAJORITY Slides adapted from Mohri Jordan Boyd-Graber UMD Classification 1 / 13 Beyond Binary Classification Before we ve talked about

More information

Voting (Ensemble Methods)

Voting (Ensemble Methods) 1 2 Voting (Ensemble Methods) Instead of learning a single classifier, learn many weak classifiers that are good at different parts of the data Output class: (Weighted) vote of each classifier Classifiers

More information

From Batch to Transductive Online Learning

From Batch to Transductive Online Learning From Batch to Transductive Online Learning Sham Kakade Toyota Technological Institute Chicago, IL 60637 sham@tti-c.org Adam Tauman Kalai Toyota Technological Institute Chicago, IL 60637 kalai@tti-c.org

More information

New bounds on the price of bandit feedback for mistake-bounded online multiclass learning

New bounds on the price of bandit feedback for mistake-bounded online multiclass learning Journal of Machine Learning Research 1 8, 2017 Algorithmic Learning Theory 2017 New bounds on the price of bandit feedback for mistake-bounded online multiclass learning Philip M. Long Google, 1600 Amphitheatre

More information

Ad Placement Strategies

Ad Placement Strategies Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox 2014 Emily Fox January

More information

Online Learning with Experts & Multiplicative Weights Algorithms

Online Learning with Experts & Multiplicative Weights Algorithms Online Learning with Experts & Multiplicative Weights Algorithms CS 159 lecture #2 Stephan Zheng April 1, 2016 Caltech Table of contents 1. Online Learning with Experts With a perfect expert Without perfect

More information

Announcements Kevin Jamieson

Announcements Kevin Jamieson Announcements My office hours TODAY 3:30 pm - 4:30 pm CSE 666 Poster Session - Pick one First poster session TODAY 4:30 pm - 7:30 pm CSE Atrium Second poster session December 12 4:30 pm - 7:30 pm CSE Atrium

More information

Logistic Regression Logistic

Logistic Regression Logistic Case Study 1: Estimating Click Probabilities L2 Regularization for Logistic Regression Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 10 th,

More information

Reducing contextual bandits to supervised learning

Reducing contextual bandits to supervised learning Reducing contextual bandits to supervised learning Daniel Hsu Columbia University Based on joint work with A. Agarwal, S. Kale, J. Langford, L. Li, and R. Schapire 1 Learning to interact: example #1 Practicing

More information

Subgradient Methods for Maximum Margin Structured Learning

Subgradient Methods for Maximum Margin Structured Learning Nathan D. Ratliff ndr@ri.cmu.edu J. Andrew Bagnell dbagnell@ri.cmu.edu Robotics Institute, Carnegie Mellon University, Pittsburgh, PA. 15213 USA Martin A. Zinkevich maz@cs.ualberta.ca Department of Computing

More information

Learning with Exploration

Learning with Exploration Learning with Exploration John Langford (Yahoo!) { With help from many } Austin, March 24, 2011 Yahoo! wants to interactively choose content and use the observed feedback to improve future content choices.

More information

Adaptive Online Gradient Descent

Adaptive Online Gradient Descent University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 6-4-2007 Adaptive Online Gradient Descent Peter Bartlett Elad Hazan Alexander Rakhlin University of Pennsylvania Follow

More information

Better Algorithms for Selective Sampling

Better Algorithms for Selective Sampling Francesco Orabona Nicolò Cesa-Bianchi DSI, Università degli Studi di Milano, Italy francesco@orabonacom nicolocesa-bianchi@unimiit Abstract We study online algorithms for selective sampling that use regularized

More information

Convex Repeated Games and Fenchel Duality

Convex Repeated Games and Fenchel Duality Convex Repeated Games and Fenchel Duality Shai Shalev-Shwartz 1 and Yoram Singer 1,2 1 School of Computer Sci. & Eng., he Hebrew University, Jerusalem 91904, Israel 2 Google Inc. 1600 Amphitheater Parkway,

More information

Online prediction with expert advise

Online prediction with expert advise Online prediction with expert advise Jyrki Kivinen Australian National University http://axiom.anu.edu.au/~kivinen Contents 1. Online prediction: introductory example, basic setting 2. Classification with

More information

Online Learning for Time Series Prediction

Online Learning for Time Series Prediction Online Learning for Time Series Prediction Joint work with Vitaly Kuznetsov (Google Research) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Motivation Time series prediction: stock values.

More information

Regret Bounds for Online Portfolio Selection with a Cardinality Constraint

Regret Bounds for Online Portfolio Selection with a Cardinality Constraint Regret Bounds for Online Portfolio Selection with a Cardinality Constraint Shinji Ito NEC Corporation Daisuke Hatano RIKEN AIP Hanna Sumita okyo Metropolitan University Akihiro Yabe NEC Corporation akuro

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Bandit Problems MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Multi-Armed Bandit Problem Problem: which arm of a K-slot machine should a gambler pull to maximize his

More information

Online Convex Optimization: From Gambling to Minimax Theorems by Playing Repeated Games

Online Convex Optimization: From Gambling to Minimax Theorems by Playing Repeated Games Online Convex Optimization: From Gambling to Minimax heorems by Playing Repeated Games im van Erven Nederlands Mathematisch Congres, April 4, 2018 Example: Betting on Football Games Before every match

More information

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #24 Scribe: Jad Bechara May 2, 2018

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #24 Scribe: Jad Bechara May 2, 2018 COS 5: heoretical Machine Learning Lecturer: Rob Schapire Lecture #24 Scribe: Jad Bechara May 2, 208 Review of Game heory he games we will discuss are two-player games that can be modeled by a game matrix

More information

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I Sébastien Bubeck Theory Group i.i.d. multi-armed bandit, Robbins [1952] i.i.d. multi-armed bandit, Robbins [1952] Known

More information

Ensembles. Léon Bottou COS 424 4/8/2010

Ensembles. Léon Bottou COS 424 4/8/2010 Ensembles Léon Bottou COS 424 4/8/2010 Readings T. G. Dietterich (2000) Ensemble Methods in Machine Learning. R. E. Schapire (2003): The Boosting Approach to Machine Learning. Sections 1,2,3,4,6. Léon

More information

Time Series Prediction & Online Learning

Time Series Prediction & Online Learning Time Series Prediction & Online Learning Joint work with Vitaly Kuznetsov (Google Research) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Motivation Time series prediction: stock values. earthquakes.

More information

arxiv: v1 [cs.lg] 8 Feb 2018

arxiv: v1 [cs.lg] 8 Feb 2018 Online Learning: A Comprehensive Survey Steven C.H. Hoi, Doyen Sahoo, Jing Lu, Peilin Zhao School of Information Systems, Singapore Management University, Singapore School of Software Engineering, South

More information

Logistic Regression. Machine Learning Fall 2018

Logistic Regression. Machine Learning Fall 2018 Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes

More information

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24 Big Data Analytics Special Topics for Computer Science CSE 4095-001 CSE 5095-005 Feb 24 Fei Wang Associate Professor Department of Computer Science and Engineering fei_wang@uconn.edu Prediction III Goal

More information

Applications of on-line prediction. in telecommunication problems

Applications of on-line prediction. in telecommunication problems Applications of on-line prediction in telecommunication problems Gábor Lugosi Pompeu Fabra University, Barcelona based on joint work with András György and Tamás Linder 1 Outline On-line prediction; Some

More information

Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning

Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning Francesco Orabona Yahoo! Labs New York, USA francesco@orabona.com Abstract Stochastic gradient descent algorithms

More information

Online Passive-Aggressive Algorithms

Online Passive-Aggressive Algorithms Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {kobics,oferd,shais,singer}@cs.huji.ac.il

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

Convergence rate of SGD

Convergence rate of SGD Convergence rate of SGD heorem: (see Nemirovski et al 09 from readings) Let f be a strongly convex stochastic function Assume gradient of f is Lipschitz continuous and bounded hen, for step sizes: he expected

More information

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley Learning Methods for Online Prediction Problems Peter Bartlett Statistics and EECS UC Berkeley Course Synopsis A finite comparison class: A = {1,..., m}. Converting online to batch. Online convex optimization.

More information

arxiv: v4 [cs.lg] 27 Jan 2016

arxiv: v4 [cs.lg] 27 Jan 2016 The Computational Power of Optimization in Online Learning Elad Hazan Princeton University ehazan@cs.princeton.edu Tomer Koren Technion tomerk@technion.ac.il arxiv:1504.02089v4 [cs.lg] 27 Jan 2016 Abstract

More information

Learning Ensembles. 293S T. Yang. UCSB, 2017.

Learning Ensembles. 293S T. Yang. UCSB, 2017. Learning Ensembles 293S T. Yang. UCSB, 2017. Outlines Learning Assembles Random Forest Adaboost Training data: Restaurant example Examples described by attribute values (Boolean, discrete, continuous)

More information

Ensemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods

Ensemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods The wisdom of the crowds Ensemble learning Sir Francis Galton discovered in the early 1900s that a collection of educated guesses can add up to very accurate predictions! Chapter 11 The paper in which

More information

Online Passive-Aggressive Algorithms

Online Passive-Aggressive Algorithms Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {kobics,oferd,shais,singer}@cs.huji.ac.il

More information

Gradient Boosting (Continued)

Gradient Boosting (Continued) Gradient Boosting (Continued) David Rosenberg New York University April 4, 2016 David Rosenberg (New York University) DS-GA 1003 April 4, 2016 1 / 31 Boosting Fits an Additive Model Boosting Fits an Additive

More information

ECE 5424: Introduction to Machine Learning

ECE 5424: Introduction to Machine Learning ECE 5424: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting PAC Learning Readings: Murphy 16.4;; Hastie 16 Stefan Lee Virginia Tech Fighting the bias-variance tradeoff Simple

More information

Online Manifold Regularization: A New Learning Setting and Empirical Study

Online Manifold Regularization: A New Learning Setting and Empirical Study Online Manifold Regularization: A New Learning Setting and Empirical Study Andrew B. Goldberg 1, Ming Li 2, Xiaojin Zhu 1 1 Computer Sciences, University of Wisconsin Madison, USA. {goldberg,jerryzhu}@cs.wisc.edu

More information

Stochastic and Adversarial Online Learning without Hyperparameters

Stochastic and Adversarial Online Learning without Hyperparameters Stochastic and Adversarial Online Learning without Hyperparameters Ashok Cutkosky Department of Computer Science Stanford University ashokc@cs.stanford.edu Kwabena Boahen Department of Bioengineering Stanford

More information