Optimal and Adaptive Online Learning
|
|
- Pauline Green
- 5 years ago
- Views:
Transcription
1 Optimal and Adaptive Online Learning Haipeng Luo Advisor: Robert Schapire Computer Science Department Princeton University
2 Examples of Online Learning (a) Spam detection 2 / 34
3 Examples of Online Learning (a) s arrive one by one Spam detection 2 / 34
4 Examples of Online Learning (a) s arrive one by one spam detector makes a prediction for each of them: spam or not Spam detection 2 / 34
5 Examples of Online Learning (a) s arrive one by one spam detector makes a prediction for each of them: spam or not sometimes get feedback from users Spam detection 2 / 34
6 Examples of Online Learning (a) Spam detection s arrive one by one spam detector makes a prediction for each of them: spam or not sometimes get feedback from users make better predictions next time 2 / 34
7 Examples of Online Learning (b) Online routing 3 / 34
8 Examples of Online Learning (b) packages arrive one by one Online routing 3 / 34
9 Examples of Online Learning (b) packages arrive one by one pick a path to send each package Online routing 3 / 34
10 Examples of Online Learning (b) packages arrive one by one pick a path to send each package observe latency afterward Online routing 3 / 34
11 Examples of Online Learning (b) packages arrive one by one pick a path to send each package observe latency afterward adjust strategy and pick a better path next time Online routing 3 / 34
12 Examples of Online Learning (c) Recommendation 4 / 34
13 Examples of Online Learning (c) users visit Netflix/Amzaon/... Recommendation 4 / 34
14 Examples of Online Learning (c) users visit Netflix/Amzaon/... movies/products/... are recommended to each user Recommendation 4 / 34
15 Examples of Online Learning (c) Recommendation users visit Netflix/Amzaon/... movies/products/... are recommended to each user users click or not 4 / 34
16 Examples of Online Learning (c) Recommendation users visit Netflix/Amzaon/... movies/products/... are recommended to each user users click or not make better recommendations next time 4 / 34
17 Why Online Learning? Traditional batch learning: input: a batch of training data output: a fixed predictor 5 / 34
18 Why Online Learning? Traditional batch learning: input: a batch of training data output: a fixed predictor Compared to batch learning, online learning is: 5 / 34
19 Why Online Learning? Traditional batch learning: input: a batch of training data output: a fixed predictor Compared to batch learning, online learning is: a natural model for many applications, mixing training and testing 5 / 34
20 Why Online Learning? Traditional batch learning: input: a batch of training data output: a fixed predictor Compared to batch learning, online learning is: a natural model for many applications, mixing training and testing memory efficient, requiring only one pass of the data 5 / 34
21 Why Online Learning? Traditional batch learning: input: a batch of training data output: a fixed predictor Compared to batch learning, online learning is: a natural model for many applications, mixing training and testing memory efficient, requiring only one pass of the data often distributional assumption free 5 / 34
22 Why Online Learning? Traditional batch learning: input: a batch of training data output: a fixed predictor Compared to batch learning, online learning is: a natural model for many applications, mixing training and testing memory efficient, requiring only one pass of the data often distributional assumption free suitable for changing or adversarial environment 5 / 34
23 Why Online Learning? Traditional batch learning: input: a batch of training data output: a fixed predictor Compared to batch learning, online learning is: a natural model for many applications, mixing training and testing memory efficient, requiring only one pass of the data often distributional assumption free suitable for changing or adversarial environment An active research area, proven extremely useful in practice. 5 / 34
24 Main Goals of My Work Design better online learning algorithms: computationally efficient theoretically optimal adaptive and robust, works for different data and different criteria parameter-free and ready-to-use 6 / 34
25 Overview of My Work 1 Combining expert advice [LS14a, LS14b, LS15] Online Learning 7 / 34
26 Overview of My Work 1 Combining expert advice [LS14a, LS14b, LS15] 2 Online boosting [BKL15, BHKL15] Boosting Online Learning 7 / 34
27 Overview of My Work 1 Combining expert advice [LS14a, LS14b, LS15] 2 Online boosting [BKL15, BHKL15] Boosting Data Sketching 3 Efficient second order online learning [LACL16] Online Learning 7 / 34
28 Overview of My Work 1 Combining expert advice [LS14a, LS14b, LS15] 2 Online boosting [BKL15, BHKL15] Boosting Data Sketching 3 Efficient second order online learning [LACL16] 4 Accelerated stochastic optimizations [LHP14, HL16] Optimization/ Stochastic Learning Online Learning 7 / 34
29 Overview of My Work 1 Combining expert advice [LS14a, LS14b, LS15] 2 Online boosting [BKL15, BHKL15] Boosting Data Sketching 3 Efficient second order online learning [LACL16] Online Learning 4 Accelerated stochastic optimizations [LHP14, HL16] 5 Fast convergence in games [SALS15] Optimization/ Stochastic Learning Game Theory 7 / 34
30 Overview of My Work 1 Combining expert advice [LS14a, LS14b, LS15] 2 Online boosting [BKL15, BHKL15] Boosting Data Sketching 3 Efficient second order online learning [LACL16] Online Learning 4 Accelerated stochastic optimizations [LHP14, HL16] 5 Fast convergence in games [SALS15] Optimization/ Stochastic Learning Game Theory 7 / 34
31 Online Boosting (Beygelzimer, Kale and Luo, ICML 2015) (Beygelzimer, Hazan, Kale and Luo, NIPS 2015) 8 / 34
32 Brief Introduction of Boosting (Schapire, 1990; Freund, 1995) Boosting is a highly successful batch learning algorithm. Idea: combine weak rules of thumb to form a highly accurate predictor. 9 / 34
33 Brief Introduction of Boosting (Schapire, 1990; Freund, 1995) Boosting is a highly successful batch learning algorithm. Idea: combine weak rules of thumb to form a highly accurate predictor. Example: spam detection. 9 / 34
34 Brief Introduction of Boosting (Schapire, 1990; Freund, 1995) Boosting is a highly successful batch learning algorithm. Idea: combine weak rules of thumb to form a highly accurate predictor. Example: spam detection. Given: a set of training examples. ( Wanna make fast money?..., spam) ( How is your thesis going? Rob, not spam) 9 / 34
35 Brief Introduction of Boosting (Schapire, 1990; Freund, 1995) Boosting is a highly successful batch learning algorithm. Idea: combine weak rules of thumb to form a highly accurate predictor. Example: spam detection. Given: a set of training examples. ( Wanna make fast money?..., spam) ( How is your thesis going? Rob, not spam) Obtain a classifier by asking a weak learning algorithm : e.g. contains the word money spam. 9 / 34
36 Brief Introduction of Boosting (Schapire, 1990; Freund, 1995) Boosting is a highly successful batch learning algorithm. Idea: combine weak rules of thumb to form a highly accurate predictor. Example: spam detection. Given: a set of training examples. ( Wanna make fast money?..., spam) ( How is your thesis going? Rob, not spam) Obtain a classifier by asking a weak learning algorithm : e.g. contains the word money spam. Reweight the examples so that difficult ones get more attention. e.g. spam that doesn t contain money. 9 / 34
37 Brief Introduction of Boosting (Schapire, 1990; Freund, 1995) Boosting is a highly successful batch learning algorithm. Idea: combine weak rules of thumb to form a highly accurate predictor. Example: spam detection. Given: a set of training examples. ( Wanna make fast money?..., spam) ( How is your thesis going? Rob, not spam) Obtain a classifier by asking a weak learning algorithm : e.g. contains the word money spam. Reweight the examples so that difficult ones get more attention. e.g. spam that doesn t contain money. Obtain another classifier: e.g. empty to address spam. 9 / 34
38 Brief Introduction of Boosting (Schapire, 1990; Freund, 1995) Boosting is a highly successful batch learning algorithm. Idea: combine weak rules of thumb to form a highly accurate predictor. Example: spam detection. Given: a set of training examples. ( Wanna make fast money?..., spam) ( How is your thesis going? Rob, not spam) Obtain a classifier by asking a weak learning algorithm : e.g. contains the word money spam. Reweight the examples so that difficult ones get more attention. e.g. spam that doesn t contain money. Obtain another classifier: e.g. empty to address spam / 34
39 Brief Introduction of Boosting (Schapire, 1990; Freund, 1995) Boosting is a highly successful batch learning algorithm. Idea: combine weak rules of thumb to form a highly accurate predictor. Example: spam detection. Given: a set of training examples. ( Wanna make fast money?..., spam) ( How is your thesis going? Rob, not spam) Obtain a classifier by asking a weak learning algorithm : e.g. contains the word money spam. Reweight the examples so that difficult ones get more attention. e.g. spam that doesn t contain money. Obtain another classifier: e.g. empty to address spam.... At the end, predict by taking a (weighted) majority vote. 9 / 34
40 Online Boosting: Motivation Boosting is well-studied in the batch setting, but becomes infeasible when the amount of data is huge. 10 / 34
41 Online Boosting: Motivation Boosting is well-studied in the batch setting, but becomes infeasible when the amount of data is huge. Online learning comes to rescue: memory efficient works even in an adversarial environment. 10 / 34
42 Online Boosting: Motivation Boosting is well-studied in the batch setting, but becomes infeasible when the amount of data is huge. Online learning comes to rescue: memory efficient works even in an adversarial environment. A natural question: How to extend boosting to the online setting? 10 / 34
43 Online Boosting: Motivation Boosting is well-studied in the batch setting, but becomes infeasible when the amount of data is huge. Online learning comes to rescue: memory efficient works even in an adversarial environment. A natural question: How to extend boosting to the online setting? How to boost the accuracy of existing online learning algorithms? 10 / 34
44 Related Work Several algorithms exist (Oza and Russell, 2001; Grabner and Bischof, 2006; Liu and Yu, 2007; Grabner et al., 2008). mimic batch counterparts. achieve great success in many real-world applications. no theoretical guarantees. 11 / 34
45 Related Work Several algorithms exist (Oza and Russell, 2001; Grabner and Bischof, 2006; Liu and Yu, 2007; Grabner et al., 2008). mimic batch counterparts. achieve great success in many real-world applications. no theoretical guarantees. Chen et al. (2012): first online boosting algorithms with theoretical guarantees. connecting online boosting and smooth batch boosting. 11 / 34
46 Related Work Several algorithms exist (Oza and Russell, 2001; Grabner and Bischof, 2006; Liu and Yu, 2007; Grabner et al., 2008). mimic batch counterparts. achieve great success in many real-world applications. no theoretical guarantees. Chen et al. (2012): first online boosting algorithms with theoretical guarantees. connecting online boosting and smooth batch boosting. Our work: more natural formulation + optimal and adaptive algorithms. 11 / 34
47 Batch Boosting Given a batch of T examples, (x t, y t ) for t = 1,..., T. Learner A predicts A(x t ) for example x t. 12 / 34
48 Batch Boosting Given a batch of T examples, (x t, y t ) for t = 1,..., T. Learner A predicts A(x t ) for example x t. Weak learner A (with edge γ): # Mistakes = T t=1 1{A(x t) y t } ( 1 2 γ)t 12 / 34
49 Batch Boosting Given a batch of T examples, (x t, y t ) for t = 1,..., T. Learner A predicts A(x t ) for example x t. Weak learner A (with edge γ): # Mistakes = T t=1 1{A(x t) y t } ( 1 2 γ)t Strong learner A (with any target error rate ɛ): # Mistakes = T t=1 1{A (x t ) y t } ɛt 12 / 34
50 Batch Boosting Given a batch of T examples, (x t, y t ) for t = 1,..., T. Learner A predicts A(x t ) for example x t. Weak learner A (with edge γ): # Mistakes = T t=1 1{A(x t) y t } ( 1 2 γ)t Boosting (Schapire, 1990; Freund, 1995) Strong learner A (with any target error rate ɛ): # Mistakes = T t=1 1{A (x t ) y t } ɛt 12 / 34
51 Online Boosting Examples (x t, y t ) arrive online, for t = 1,..., T. Learner A observes x t and predicts A(x t ) before seeing y t. Weak Online learner A (with edge γ): # Mistakes = T t=1 1{A(x t) y t } ( 1 2 γ)t Strong Online learner A (with any target error rate ɛ): # Mistakes = T t=1 1{A (x t ) y t } ɛt 12 / 34
52 Online Boosting Examples (x t, y t ) arrive online, for t = 1,..., T. Learner A observes x t and predicts A(x t ) before seeing y t. Weak Online learner A (with edge γ and excess loss S): # Mistakes = T t=1 1{A(x t) y t } ( 1 2 γ)t + S #mistakes made before learning Strong Online learner A (with any target error rate ɛ and excess loss S ) # Mistakes = T t=1 1{A (x t ) y t } ɛt + S 12 / 34
53 Online Boosting Examples (x t, y t ) arrive online, for t = 1,..., T. Learner A observes x t and predicts A(x t ) before seeing y t. Weak Online learner A (with edge γ and excess loss S): # Mistakes = T t=1 1{A(x t) y t } ( 1 2 γ)t + S Online Boosting (our result) #mistakes made before learning Strong Online learner A (with any target error rate ɛ and excess loss S ) # Mistakes = T t=1 1{A (x t ) y t } ɛt + S 12 / 34
54 Structure of Online Boosting x 1 Booster
55 Structure of Online Boosting x 1 WL 1 predict WL 2 predict x 1 ŷ1 1 x 1 ŷ Booster WL N predict x 1 ŷ N 1
56 Structure of Online Boosting x 1 ŷ 1 y 1 WL 1 predict WL 2 predict x 1 ŷ1 1 x 1 ŷ Booster WL N predict x 1 ŷ N 1
57 Structure of Online Boosting x 1 ŷ 1 y 1 WL 1 predict x 1 ŷ1 1 w.p. p1 1 (x 1, y 1 ) WL 1 update WL 2 predict x 1 ŷ1 2 w.p. p1 2 (x 1, y 1 ) WL 2 update... Booster... WL N predict x 1 ŷ N 1 w.p. p N 1 (x 1, y 1 ) WL N update 13 / 34
58 Structure of Online Boosting x 2 ŷ 2 y 2 WL 1 predict x 2 ŷ2 1 w.p. p2 1 (x 2, y 2 ) WL 1 update WL 2 predict x 2 ŷ2 2 w.p. p2 2 (x 2, y 2 ) WL 2 update... Booster... WL N predict x 2 ŷ N 2 w.p. p N 2 (x 2, y 2 ) WL N update 14 / 34
59 Structure of Online Boosting x t ŷ t y t WL 1 predict x t ŷt 1 w.p. pt 1 (x t, y t ) WL 1 update WL 2 predict x t ŷt 2 w.p. pt 2 (x t, y t ) WL 2 update... Booster... WL N predict x t ŷ N t w.p. p N t (x t, y t ) WL N update 15 / 34
60 Theoretical Gaurantees To achieve ɛ error rate, how many number of weak learners (N) and examples (T ) do we need? 16 / 34
61 Theoretical Gaurantees To achieve ɛ error rate, how many number of weak learners (N) and examples (T ) do we need? Algorithm N T Optimal? Adaptive? Online BBM O( 1 ln 1 γ 2 ɛ ) Õ( 1 ) ɛγ 2 AdaBoost.OL O( 1 ) Õ( 1 ) ɛγ 2 ɛ 2 γ 4 Chen et al. (2012) O( 1 ɛγ 2 ) Õ( 1 ɛγ 2 ) 16 / 34
62 Algorithms Online BBM: derived by generalizing drifting game (Schapire, 2001) 17 / 34
63 Algorithms Online BBM: derived by generalizing drifting game (Schapire, 2001) prediction: majority vote 17 / 34
64 Algorithms Online BBM: derived by generalizing drifting game (Schapire, 2001) prediction: majority vote update: Pr[(x t, y t ) sent to ith weak learner] Pr[kt i heads in N i flips of a γ 2 -biased coin] 17 / 34
65 Algorithms Online BBM: derived by generalizing drifting game (Schapire, 2001) prediction: majority vote update: Pr[(x t, y t ) sent to ith weak learner] Pr[kt i heads in N i flips of a γ 2 -biased coin] AdaBoost.OL: derived by utilizing online loss minimization 17 / 34
66 Algorithms Online BBM: derived by generalizing drifting game (Schapire, 2001) prediction: majority vote update: Pr[(x t, y t ) sent to ith weak learner] Pr[kt i heads in N i flips of a γ 2 -biased coin] AdaBoost.OL: derived by utilizing online loss minimization prediction: weighted majority vote 17 / 34
67 Algorithms Online BBM: derived by generalizing drifting game (Schapire, 2001) prediction: majority vote update: Pr[(x t, y t ) sent to ith weak learner] Pr[kt i heads in N i flips of a γ 2 -biased coin] AdaBoost.OL: derived by utilizing online loss minimization prediction: weighted majority vote update: Pr[(x t, y t ) sent to ith weak learner] derivative of logistic loss at some point 17 / 34
68 Experimental Results (I) Available in the widely used toolkit Vowpal Wabbit. command line option: --boosting. VW as the default weak learner (a rather strong one!) Dataset VW baseline Online BBM AdaBoost.OL Chen et al news a9a activity adult bio census covtype letter maptaskcoref nomao poker rcv vehv2binary / 34
69 Experimental Results (II) Generalized to regression setting (BHKL15). Base learner Average Improvement Regression stumps 20.22% Neural networks 7.88% VW Default 1.65% 19 / 34
70 Combining Expert Advice More Efficiently (Luo and Schapire, NIPS 2014, COLT 2015) 20 / 34
71 The Expert Problem Need to make a decision everyday which stock to invest in? which route to drive home? / 34
72 The Expert Problem Need to make a decision everyday which stock to invest in? which route to drive home?... Given: advice from a set of experts financial advisers Route 1, Route 206, I-95, / 34
73 The Expert Problem Need to make a decision everyday which stock to invest in? which route to drive home?... Given: advice from a set of experts financial advisers Route 1, Route 206, I-95, Question: how to combine advice and make smart choices? 21 / 34
74 Comparison with Online Boosting Online Boosting Expert Problem 22 / 34
75 Comparison with Online Boosting Online Boosting same weak learning algorithm Expert Problem different experts 22 / 34
76 Comparison with Online Boosting Online Boosting same weak learning algorithm assumption on weak learners Expert Problem different experts arbitrary advice 22 / 34
77 Comparison with Online Boosting Online Boosting same weak learning algorithm assumption on weak learners control inputs to weak learners Expert Problem different experts arbitrary advice combine advice 22 / 34
78 Comparison with Online Boosting Online Boosting same weak learning algorithm assumption on weak learners control inputs to weak learners aim to largely improve Expert Problem different experts arbitrary advice combine advice aim to find the best expert 22 / 34
79 Why interesting? Applications: investment route selection / 34
80 Why interesting? Applications: investment route selection... Foundation of many other learning algorithms: batch boosting bandit problems solving a SDP online PCA / 34
81 Formal Setup (Littlestone and Warmuth, 1994; Freund and Schapire, 1997) For t = 1, 2,..., T player predicts a distribution p t over N experts. adversary reveals the loss l t,i [0, 1] for each expert. player suffers loss p t l t for this round. 24 / 34
82 Formal Setup (Littlestone and Warmuth, 1994; Freund and Schapire, 1997) For t = 1, 2,..., T player predicts a distribution p t over N experts. adversary reveals the loss l t,i [0, 1] for each expert. player suffers loss p t l t for this round. Regret to expert i: R T,i = T t=1 p t l t } {{ } total loss of the player T t=1 l t,i }{{} total loss of expert i 24 / 34
83 Formal Setup (Littlestone and Warmuth, 1994; Freund and Schapire, 1997) For t = 1, 2,..., T player predicts a distribution p t over N experts. adversary reveals the loss l t,i [0, 1] for each expert. player suffers loss p t l t for this round. Regret to expert i: R T,i = T t=1 p t l t } {{ } total loss of the player T t=1 l t,i }{{} total loss of expert i Goal: obtain sublinear regret to the best expert. 24 / 34
84 Formal Setup (Littlestone and Warmuth, 1994; Freund and Schapire, 1997) For t = 1, 2,..., T player predicts a distribution p t over N experts. adversary reveals the loss l t,i [0, 1] for each expert. player suffers loss p t l t for this round. Regret to expert i: R T,i = T t=1 p t l t } {{ } total loss of the player T t=1 l t,i }{{} total loss of expert i Goal: obtain sublinear regret to top 10 experts. 24 / 34
85 Formal Setup (Littlestone and Warmuth, 1994; Freund and Schapire, 1997) For t = 1, 2,..., T player predicts a distribution p t over N experts. adversary reveals the loss l t,i [0, 1] for each expert. player suffers loss p t l t for this round. Regret to expert i: R T,i = T t=1 p t l t } {{ } total loss of the player T t=1 l t,i }{{} total loss of expert i Goal: obtain sublinear regret to any competitor. 24 / 34
86 Beyond the Classic Regret More difficult objectives: Achieving much smaller regret when the problem is easy while still ensuring worst-case robustness Learning with unknown number of experts Competing with a sequence of different competitors Learning with experts who provide confidence-rated advice 25 / 34
87 Beyond the Classic Regret More difficult objectives: Achieving much smaller regret when the problem is easy while still ensuring worst-case robustness Learning with unknown number of experts Competing with a sequence of different competitors Learning with experts who provide confidence-rated advice Previous work: all were studied to some extent, with different algorithms Our work: one single parameter-free algorithm, achieving all these goals, with improved results 25 / 34
88 Expert Problem as a Drifting Game Main technique: reduce the problem to a drifting game 26 / 34
89 Expert Problem as a Drifting Game 2.5 Main technique: reduce the problem to a drifting game Consequence: one upper bound of 0-1 loss one expert algorithm / 34
90 Expert Problem as a Drifting Game 2.5 Main technique: reduce the problem to a drifting game Consequence: one upper bound of 0-1 loss one expert algorithm Surrogate of 0-1 Algorithm: p t,i Regret to top-ɛ frac. exponential exp(ηr t 1,i ) T ln 1 ɛ 26 / 34
91 Expert Problem as a Drifting Game 2.5 Main technique: reduce the problem to a drifting game Consequence: one upper bound of 0-1 loss one expert algorithm Surrogate of 0-1 Algorithm: p t,i Regret to top-ɛ frac. exponential exp(ηr t 1,i ) T ln 1 ɛ squared hinge [R t 1,i + 1] 2 + [R t 1,i 1] 2 + T ɛ 26 / 34
92 Expert Problem as a Drifting Game 2.5 Main technique: reduce the problem to a drifting game Consequence: one upper bound of 0-1 loss one expert algorithm Surrogate of 0-1 Algorithm: p t,i Regret to top-ɛ frac. exponential exp(ηr t 1,i ) T ln 1 ɛ squared hinge [R t 1,i + 1] 2 + [R t 1,i 1] 2 + ) ( ) ( [Rt 1,i +1] exp 1 exp 2 + [Rt 1,i 1] exp 2 + 3t 3t ( [s] 2 3T T ɛ ) T ln ln T ɛ 26 / 34
93 AdaNormalHedge weight for expert i on round t: ( [Rt 1,i + 1] 2 ) ( + [Rt 1,i 1] 2 ) + p t,i exp exp 3t 3t Õ( T ) regret 27 / 34
94 AdaNormalHedge weight for expert i on round t: ( [Rt 1,i + 1] 2 ) ( + [Rt 1,i 1] 2 ) + p t,i exp exp 3t 3t Õ( T ) regret (worst-case optimal, but too conservative!) 27 / 34
95 AdaNormalHedge weight for expert i on round t: ( [Rt 1,i + 1] 2 ) ( + [Rt 1,i 1] 2 ) + p t,i exp exp 3(C t 1,i + 1) 3(C t 1,i + 1) where C t,i = t τ=1 p τ l τ l τ,i = cumulative regret magnitude }{{} regret magnitude of expert i at round τ 27 / 34
96 AdaNormalHedge weight for expert i on round t: ( [Rt 1,i + 1] 2 ) ( + [Rt 1,i 1] 2 ) + p t,i exp exp 3(C t 1,i + 1) 3(C t 1,i + 1) where C t,i = t τ=1 p τ l τ l τ,i = cumulative regret magnitude }{{} regret magnitude of expert i at round τ much more adaptive: small regret when the best expert has small loss almost constant regret when losses are stochastic... resolving open problems by Chaudhuri et al. (2009); Chernov and Vovk (2010); Warmuth and Koolen (2014) 27 / 34
97 Application Given: a big unpruned decision tree Goal: predict almost as well as the best pruned tree, online 28 / 34
98 Application Given: a big unpruned decision tree Goal: predict almost as well as the best pruned tree, online Reduced to expert problem Helmbold and Schapire (1995); Freund et al. (1997) 28 / 34
99 Application Given: a big unpruned decision tree Goal: predict almost as well as the best pruned tree, online Reduced to expert problem Helmbold and Schapire (1995); Freund et al. (1997) AdaNormalHedge provides a parameter-free solution with exponential regret improvement! 28 / 34
100 Efficient Second Order Online Learning via Sketching (Luo, Agarwal, Cesa-Bianchi and Langford, Preprint, 2016) 29 / 34
101 Online Linear Prediction On each time t: see an example x t (e.g. an image) 30 / 34
102 Online Linear Prediction On each time t: see an example x t (e.g. an image) predict its label w t x t (e.g. cat or not) 30 / 34
103 Online Linear Prediction On each time t: see an example x t (e.g. an image) predict its label wt x t (e.g. cat or not) see the true label and suffer some loss 30 / 34
104 Online Linear Prediction On each time t: see an example x t (e.g. an image) predict its label wt x t (e.g. cat or not) see the true label and suffer some loss update w t to w t+1 30 / 34
105 Online Linear Prediction On each time t: see an example x t (e.g. an image) predict its label w t x t (e.g. cat or not) see the true label and suffer some loss update w t to w t+1 e.g. Online Gradient Descent (Zinkevich, 2003) w t+1 = w t ηg t where g t is the gradient 30 / 34
106 Problem Performance often depends on the conditioning of the data: 31 / 34
107 Comparisons error rate condition number 32 / 34
108 Comparisons OGD error rate condition number algorithm OGD Zinkevich (2003) update rule w t+1 = w t η g t 32 / 34
109 Comparisons OGD Newton error rate A t = t τ=1 g τ g τ algorithm OGD Zinkevich (2003) Online Newton Hazan et al. (2007) 0.08 update rule w t+1 = w t η g t w t+1 = w t η A 1 t g t condition number 32 / 34
110 Comparisons OGD Newton error rate A t = t τ=1 g τ g τ s: sparsity d: dimensionality condition number algorithm update rule time per example OGD Zinkevich (2003) Online Newton Hazan et al. (2007) w t+1 = w t η g t O(s) w t+1 = w t η A 1 t g t O(d 2 ) 32 / 34
111 Comparisons OGD Newton AdaGrad error rate A t = t τ=1 g τ g τ s: sparsity d: dimensionality condition number algorithm update rule time per example OGD Zinkevich (2003) Online Newton Hazan et al. (2007) AdaGrad Duchi et al. (2011) w t+1 = w t η g t O(s) w t+1 = w t η A 1 t g t O(d 2 ) w t+1 = w t η diag(a t ) 1 2 g t O(s) 32 / 34
112 Comparisons error rate OGD Newton AdaGrad SON A t = t τ=1 g τ g τ s: sparsity d: dimensionality condition number algorithm update rule time per example OGD Zinkevich (2003) Online Newton Hazan et al. (2007) AdaGrad Duchi et al. (2011) SON (our work) w t+1 = w t η g t O(s) w t+1 = w t η A 1 t g t O(d 2 ) w t+1 = w t η diag(a t ) 1 2 g t w t+1 = w t η sketch(a t ) 1 g t O(s) 32 / 34
113 Comparisons error rate OGD Newton AdaGrad SON A t = t τ=1 g τ g τ s: sparsity d: dimensionality m: sketch size condition number algorithm update rule time per example OGD Zinkevich (2003) Online Newton Hazan et al. (2007) AdaGrad Duchi et al. (2011) SON (our work) w t+1 = w t η g t O(s) w t+1 = w t η A 1 t g t O(d 2 ) w t+1 = w t η diag(a t ) 1 2 g t O(s) w t+1 = w t η sketch(a t ) 1 g t O(ms + m 2 ) 32 / 34
114 Performance on Real-World Datasets AdaGrad vs SON error rate of SON error rate of AdaGrad 33 / 34
115 Conclusions 1 Online boosting a novel and natural framework new online boosting algorithms for classification and regression 2 Combining expert advice a general game-theoretic framework to design algorithms adaptive algorithms with many desirable properties simultaneously 3 Efficient second order learning how to utilize sketching techniques to deal with ill-conditioned data 34 / 34
Optimal and Adaptive Algorithms for Online Boosting
Optimal and Adaptive Algorithms for Online Boosting Alina Beygelzimer 1 Satyen Kale 1 Haipeng Luo 2 1 Yahoo! Labs, NYC 2 Computer Science Department, Princeton University Jul 8, 2015 Boosting: An Example
More informationOptimal and Adaptive Online Learning
Optimal and Adaptive Online Learning Haipeng Luo A Dissertation Presented to the Faculty of Princeton University in Candidacy for the Degree of Doctor of Philosophy Recommended for Acceptance by the Department
More informationTutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning.
Tutorial: PART 1 Online Convex Optimization, A Game- Theoretic Approach to Learning http://www.cs.princeton.edu/~ehazan/tutorial/tutorial.htm Elad Hazan Princeton University Satyen Kale Yahoo Research
More informationTutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning
Tutorial: PART 2 Online Convex Optimization, A Game- Theoretic Approach to Learning Elad Hazan Princeton University Satyen Kale Yahoo Research Exploiting curvature: logarithmic regret Logarithmic regret
More informationTheory and Applications of A Repeated Game Playing Algorithm. Rob Schapire Princeton University [currently visiting Yahoo!
Theory and Applications of A Repeated Game Playing Algorithm Rob Schapire Princeton University [currently visiting Yahoo! Research] Learning Is (Often) Just a Game some learning problems: learn from training
More informationOnline Learning and Online Convex Optimization
Online Learning and Online Convex Optimization Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Learning 1 / 49 Summary 1 My beautiful regret 2 A supposedly fun game
More informationThe Online Approach to Machine Learning
The Online Approach to Machine Learning Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Approach to ML 1 / 53 Summary 1 My beautiful regret 2 A supposedly fun game I
More informationOnline Learning and Sequential Decision Making
Online Learning and Sequential Decision Making Emilie Kaufmann CNRS & CRIStAL, Inria SequeL, emilie.kaufmann@univ-lille.fr Research School, ENS Lyon, Novembre 12-13th 2018 Emilie Kaufmann Online Learning
More informationFull-information Online Learning
Introduction Expert Advice OCO LM A DA NANJING UNIVERSITY Full-information Lijun Zhang Nanjing University, China June 2, 2017 Outline Introduction Expert Advice OCO 1 Introduction Definitions Regret 2
More informationLecture 8. Instructor: Haipeng Luo
Lecture 8 Instructor: Haipeng Luo Boosting and AdaBoost In this lecture we discuss the connection between boosting and online learning. Boosting is not only one of the most fundamental theories in machine
More informationAdaptive Online Learning in Dynamic Environments
Adaptive Online Learning in Dynamic Environments Lijun Zhang, Shiyin Lu, Zhi-Hua Zhou National Key Laboratory for Novel Software Technology Nanjing University, Nanjing 210023, China {zhanglj, lusy, zhouzh}@lamda.nju.edu.cn
More informationNotes on AdaGrad. Joseph Perla 2014
Notes on AdaGrad Joseph Perla 2014 1 Introduction Stochastic Gradient Descent (SGD) is a common online learning algorithm for optimizing convex (and often non-convex) functions in machine learning today.
More informationPractical Agnostic Active Learning
Practical Agnostic Active Learning Alina Beygelzimer Yahoo Research based on joint work with Sanjoy Dasgupta, Daniel Hsu, John Langford, Francesco Orabona, Chicheng Zhang, and Tong Zhang * * introductory
More information1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016
AM 1: Advanced Optimization Spring 016 Prof. Yaron Singer Lecture 11 March 3rd 1 Overview In this lecture we will introduce the notion of online convex optimization. This is an extremely useful framework
More informationNo-Regret Algorithms for Unconstrained Online Convex Optimization
No-Regret Algorithms for Unconstrained Online Convex Optimization Matthew Streeter Duolingo, Inc. Pittsburgh, PA 153 matt@duolingo.com H. Brendan McMahan Google, Inc. Seattle, WA 98103 mcmahan@google.com
More informationAd Placement Strategies
Case Study : Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 7 th, 04 Ad
More informationLearning theory. Ensemble methods. Boosting. Boosting: history
Learning theory Probability distribution P over X {0, 1}; let (X, Y ) P. We get S := {(x i, y i )} n i=1, an iid sample from P. Ensemble methods Goal: Fix ɛ, δ (0, 1). With probability at least 1 δ (over
More informationNew Algorithms for Contextual Bandits
New Algorithms for Contextual Bandits Lev Reyzin Georgia Institute of Technology Work done at Yahoo! 1 S A. Beygelzimer, J. Langford, L. Li, L. Reyzin, R.E. Schapire Contextual Bandit Algorithms with Supervised
More informationStochastic Gradient Descent
Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular
More informationCSCI-567: Machine Learning (Spring 2019)
CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March
More informationLearning, Games, and Networks
Learning, Games, and Networks Abhishek Sinha Laboratory for Information and Decision Systems MIT ML Talk Series @CNRG December 12, 2016 1 / 44 Outline 1 Prediction With Experts Advice 2 Application to
More informationCOMS 4771 Lecture Boosting 1 / 16
COMS 4771 Lecture 12 1. Boosting 1 / 16 Boosting What is boosting? Boosting: Using a learning algorithm that provides rough rules-of-thumb to construct a very accurate predictor. 3 / 16 What is boosting?
More informationCSE 151 Machine Learning. Instructor: Kamalika Chaudhuri
CSE 151 Machine Learning Instructor: Kamalika Chaudhuri Ensemble Learning How to combine multiple classifiers into a single one Works well if the classifiers are complementary This class: two types of
More informationA Drifting-Games Analysis for Online Learning and Applications to Boosting
A Drifting-Games Analysis for Online Learning and Applications to Boosting Haipeng Luo Department of Computer Science Princeton University Princeton, NJ 08540 haipengl@cs.princeton.edu Robert E. Schapire
More informationUsing Additive Expert Ensembles to Cope with Concept Drift
Jeremy Z. Kolter and Marcus A. Maloof {jzk, maloof}@cs.georgetown.edu Department of Computer Science, Georgetown University, Washington, DC 20057-1232, USA Abstract We consider online learning where the
More informationCS261: A Second Course in Algorithms Lecture #11: Online Learning and the Multiplicative Weights Algorithm
CS61: A Second Course in Algorithms Lecture #11: Online Learning and the Multiplicative Weights Algorithm Tim Roughgarden February 9, 016 1 Online Algorithms This lecture begins the third module of the
More informationAdaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer
Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Vicente L. Malave February 23, 2011 Outline Notation minimize a number of functions φ
More informationYevgeny Seldin. University of Copenhagen
Yevgeny Seldin University of Copenhagen Classical (Batch) Machine Learning Collect Data Data Assumption The samples are independent identically distributed (i.i.d.) Machine Learning Prediction rule New
More informationOLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research
OLSO Online Learning and Stochastic Optimization Yoram Singer August 10, 2016 Google Research References Introduction to Online Convex Optimization, Elad Hazan, Princeton University Online Learning and
More informationCOS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture 24 Scribe: Sachin Ravi May 2, 2013
COS 5: heoretical Machine Learning Lecturer: Rob Schapire Lecture 24 Scribe: Sachin Ravi May 2, 203 Review of Zero-Sum Games At the end of last lecture, we discussed a model for two player games (call
More informationLittlestone s Dimension and Online Learnability
Littlestone s Dimension and Online Learnability Shai Shalev-Shwartz Toyota Technological Institute at Chicago The Hebrew University Talk at UCSD workshop, February, 2009 Joint work with Shai Ben-David
More informationOnline Convex Optimization
Advanced Course in Machine Learning Spring 2010 Online Convex Optimization Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz A convex repeated game is a two players game that is performed
More informationFrom Bandits to Experts: A Tale of Domination and Independence
From Bandits to Experts: A Tale of Domination and Independence Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Domination and Independence 1 / 1 From Bandits to Experts: A
More informationA Second-order Bound with Excess Losses
A Second-order Bound with Excess Losses Pierre Gaillard 12 Gilles Stoltz 2 Tim van Erven 3 1 EDF R&D, Clamart, France 2 GREGHEC: HEC Paris CNRS, Jouy-en-Josas, France 3 Leiden University, the Netherlands
More informationAgnostic Online learnability
Technical Report TTIC-TR-2008-2 October 2008 Agnostic Online learnability Shai Shalev-Shwartz Toyota Technological Institute Chicago shai@tti-c.org ABSTRACT We study a fundamental question. What classes
More informationBoosting: Foundations and Algorithms. Rob Schapire
Boosting: Foundations and Algorithms Rob Schapire Example: Spam Filtering problem: filter out spam (junk email) gather large collection of examples of spam and non-spam: From: yoav@ucsd.edu Rob, can you
More informationCOMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017
COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University BOOSTING Robert E. Schapire and Yoav
More informationLecture 8: Decision-making under total uncertainty: the multiplicative weight algorithm. Lecturer: Sanjeev Arora
princeton univ. F 13 cos 521: Advanced Algorithm Design Lecture 8: Decision-making under total uncertainty: the multiplicative weight algorithm Lecturer: Sanjeev Arora Scribe: (Today s notes below are
More informationOnline Convex Optimization. Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016
Online Convex Optimization Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016 The General Setting The General Setting (Cover) Given only the above, learning isn't always possible Some Natural
More informationDynamic Regret of Strongly Adaptive Methods
Lijun Zhang 1 ianbao Yang 2 Rong Jin 3 Zhi-Hua Zhou 1 Abstract o cope with changing environments, recent developments in online learning have introduced the concepts of adaptive regret and dynamic regret
More informationAdministration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.
Administration CSCI567 Machine Learning Fall 2018 Prof. Haipeng Luo U of Southern California Nov 7, 2018 HW5 is available, due on 11/18. Practice final will also be available soon. Remaining weeks: 11/14,
More informationThe Boosting Approach to. Machine Learning. Maria-Florina Balcan 10/31/2016
The Boosting Approach to Machine Learning Maria-Florina Balcan 10/31/2016 Boosting General method for improving the accuracy of any given learning algorithm. Works by creating a series of challenge datasets
More informationOnline Advertising is Big Business
Online Advertising Online Advertising is Big Business Multiple billion dollar industry $43B in 2013 in USA, 17% increase over 2012 [PWC, Internet Advertising Bureau, April 2013] Higher revenue in USA
More informationEfficient and Principled Online Classification Algorithms for Lifelon
Efficient and Principled Online Classification Algorithms for Lifelong Learning Toyota Technological Institute at Chicago Chicago, IL USA Talk @ Lifelong Learning for Mobile Robotics Applications Workshop,
More informationOnline Learning with Feedback Graphs
Online Learning with Feedback Graphs Claudio Gentile INRIA and Google NY clagentile@gmailcom NYC March 6th, 2018 1 Content of this lecture Regret analysis of sequential prediction problems lying between
More informationThe No-Regret Framework for Online Learning
The No-Regret Framework for Online Learning A Tutorial Introduction Nahum Shimkin Technion Israel Institute of Technology Haifa, Israel Stochastic Processes in Engineering IIT Mumbai, March 2013 N. Shimkin,
More informationExponential Weights on the Hypercube in Polynomial Time
European Workshop on Reinforcement Learning 14 (2018) October 2018, Lille, France. Exponential Weights on the Hypercube in Polynomial Time College of Information and Computer Sciences University of Massachusetts
More informationHybrid Machine Learning Algorithms
Hybrid Machine Learning Algorithms Umar Syed Princeton University Includes joint work with: Rob Schapire (Princeton) Nina Mishra, Alex Slivkins (Microsoft) Common Approaches to Machine Learning!! Supervised
More informationLinear Probability Forecasting
Linear Probability Forecasting Fedor Zhdanov and Yuri Kalnishkan Computer Learning Research Centre and Department of Computer Science, Royal Holloway, University of London, Egham, Surrey, TW0 0EX, UK {fedor,yura}@cs.rhul.ac.uk
More informationFoundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research
Foundations of Machine Learning On-Line Learning Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation PAC learning: distribution fixed over time (training and test). IID assumption.
More information0.1 Motivating example: weighted majority algorithm
princeton univ. F 16 cos 521: Advanced Algorithm Design Lecture 8: Decision-making under total uncertainty: the multiplicative weight algorithm Lecturer: Sanjeev Arora Scribe: Sanjeev Arora (Today s notes
More informationOnline and Stochastic Learning with a Human Cognitive Bias
Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence Online and Stochastic Learning with a Human Cognitive Bias Hidekazu Oiwa The University of Tokyo and JSPS Research Fellow 7-3-1
More informationExplore no more: Improved high-probability regret bounds for non-stochastic bandits
Explore no more: Improved high-probability regret bounds for non-stochastic bandits Gergely Neu SequeL team INRIA Lille Nord Europe gergely.neu@gmail.com Abstract This work addresses the problem of regret
More informationBoosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13
Boosting Ryan Tibshirani Data Mining: 36-462/36-662 April 25 2013 Optional reading: ISL 8.2, ESL 10.1 10.4, 10.7, 10.13 1 Reminder: classification trees Suppose that we are given training data (x i, y
More informationCS229 Supplemental Lecture notes
CS229 Supplemental Lecture notes John Duchi 1 Boosting We have seen so far how to solve classification (and other) problems when we have a data representation already chosen. We now talk about a procedure,
More informationOnline Advertising is Big Business
Online Advertising Online Advertising is Big Business Multiple billion dollar industry $43B in 2013 in USA, 17% increase over 2012 [PWC, Internet Advertising Bureau, April 2013] Higher revenue in USA
More informationOnline Learning. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 21. Slides adapted from Mohri
Online Learning Jordan Boyd-Graber University of Colorado Boulder LECTURE 21 Slides adapted from Mohri Jordan Boyd-Graber Boulder Online Learning 1 of 31 Motivation PAC learning: distribution fixed over
More informationOnline Bounds for Bayesian Algorithms
Online Bounds for Bayesian Algorithms Sham M. Kakade Computer and Information Science Department University of Pennsylvania Andrew Y. Ng Computer Science Department Stanford University Abstract We present
More informationAdaptivity and Optimism: An Improved Exponentiated Gradient Algorithm
Adaptivity and Optimism: An Improved Exponentiated Gradient Algorithm Jacob Steinhardt Percy Liang Stanford University {jsteinhardt,pliang}@cs.stanford.edu Jun 11, 2013 J. Steinhardt & P. Liang (Stanford)
More informationAn Introduction to Adaptive Online Learning
An Introduction to Adaptive Online Learning Tim van Erven Joint work with: Wouter Koolen, Peter Grünwald ABN AMRO October 18, 2018 Example: Sequential Prediction for Football Games Before every match t
More informationClassification. Jordan Boyd-Graber University of Maryland WEIGHTED MAJORITY. Slides adapted from Mohri. Jordan Boyd-Graber UMD Classification 1 / 13
Classification Jordan Boyd-Graber University of Maryland WEIGHTED MAJORITY Slides adapted from Mohri Jordan Boyd-Graber UMD Classification 1 / 13 Beyond Binary Classification Before we ve talked about
More informationVoting (Ensemble Methods)
1 2 Voting (Ensemble Methods) Instead of learning a single classifier, learn many weak classifiers that are good at different parts of the data Output class: (Weighted) vote of each classifier Classifiers
More informationFrom Batch to Transductive Online Learning
From Batch to Transductive Online Learning Sham Kakade Toyota Technological Institute Chicago, IL 60637 sham@tti-c.org Adam Tauman Kalai Toyota Technological Institute Chicago, IL 60637 kalai@tti-c.org
More informationNew bounds on the price of bandit feedback for mistake-bounded online multiclass learning
Journal of Machine Learning Research 1 8, 2017 Algorithmic Learning Theory 2017 New bounds on the price of bandit feedback for mistake-bounded online multiclass learning Philip M. Long Google, 1600 Amphitheatre
More informationAd Placement Strategies
Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox 2014 Emily Fox January
More informationOnline Learning with Experts & Multiplicative Weights Algorithms
Online Learning with Experts & Multiplicative Weights Algorithms CS 159 lecture #2 Stephan Zheng April 1, 2016 Caltech Table of contents 1. Online Learning with Experts With a perfect expert Without perfect
More informationAnnouncements Kevin Jamieson
Announcements My office hours TODAY 3:30 pm - 4:30 pm CSE 666 Poster Session - Pick one First poster session TODAY 4:30 pm - 7:30 pm CSE Atrium Second poster session December 12 4:30 pm - 7:30 pm CSE Atrium
More informationLogistic Regression Logistic
Case Study 1: Estimating Click Probabilities L2 Regularization for Logistic Regression Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 10 th,
More informationReducing contextual bandits to supervised learning
Reducing contextual bandits to supervised learning Daniel Hsu Columbia University Based on joint work with A. Agarwal, S. Kale, J. Langford, L. Li, and R. Schapire 1 Learning to interact: example #1 Practicing
More informationSubgradient Methods for Maximum Margin Structured Learning
Nathan D. Ratliff ndr@ri.cmu.edu J. Andrew Bagnell dbagnell@ri.cmu.edu Robotics Institute, Carnegie Mellon University, Pittsburgh, PA. 15213 USA Martin A. Zinkevich maz@cs.ualberta.ca Department of Computing
More informationLearning with Exploration
Learning with Exploration John Langford (Yahoo!) { With help from many } Austin, March 24, 2011 Yahoo! wants to interactively choose content and use the observed feedback to improve future content choices.
More informationAdaptive Online Gradient Descent
University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 6-4-2007 Adaptive Online Gradient Descent Peter Bartlett Elad Hazan Alexander Rakhlin University of Pennsylvania Follow
More informationBetter Algorithms for Selective Sampling
Francesco Orabona Nicolò Cesa-Bianchi DSI, Università degli Studi di Milano, Italy francesco@orabonacom nicolocesa-bianchi@unimiit Abstract We study online algorithms for selective sampling that use regularized
More informationConvex Repeated Games and Fenchel Duality
Convex Repeated Games and Fenchel Duality Shai Shalev-Shwartz 1 and Yoram Singer 1,2 1 School of Computer Sci. & Eng., he Hebrew University, Jerusalem 91904, Israel 2 Google Inc. 1600 Amphitheater Parkway,
More informationOnline prediction with expert advise
Online prediction with expert advise Jyrki Kivinen Australian National University http://axiom.anu.edu.au/~kivinen Contents 1. Online prediction: introductory example, basic setting 2. Classification with
More informationOnline Learning for Time Series Prediction
Online Learning for Time Series Prediction Joint work with Vitaly Kuznetsov (Google Research) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Motivation Time series prediction: stock values.
More informationRegret Bounds for Online Portfolio Selection with a Cardinality Constraint
Regret Bounds for Online Portfolio Selection with a Cardinality Constraint Shinji Ito NEC Corporation Daisuke Hatano RIKEN AIP Hanna Sumita okyo Metropolitan University Akihiro Yabe NEC Corporation akuro
More informationAdvanced Machine Learning
Advanced Machine Learning Bandit Problems MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Multi-Armed Bandit Problem Problem: which arm of a K-slot machine should a gambler pull to maximize his
More informationOnline Convex Optimization: From Gambling to Minimax Theorems by Playing Repeated Games
Online Convex Optimization: From Gambling to Minimax heorems by Playing Repeated Games im van Erven Nederlands Mathematisch Congres, April 4, 2018 Example: Betting on Football Games Before every match
More informationCOS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #24 Scribe: Jad Bechara May 2, 2018
COS 5: heoretical Machine Learning Lecturer: Rob Schapire Lecture #24 Scribe: Jad Bechara May 2, 208 Review of Game heory he games we will discuss are two-player games that can be modeled by a game matrix
More informationRegret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group
Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I Sébastien Bubeck Theory Group i.i.d. multi-armed bandit, Robbins [1952] i.i.d. multi-armed bandit, Robbins [1952] Known
More informationEnsembles. Léon Bottou COS 424 4/8/2010
Ensembles Léon Bottou COS 424 4/8/2010 Readings T. G. Dietterich (2000) Ensemble Methods in Machine Learning. R. E. Schapire (2003): The Boosting Approach to Machine Learning. Sections 1,2,3,4,6. Léon
More informationTime Series Prediction & Online Learning
Time Series Prediction & Online Learning Joint work with Vitaly Kuznetsov (Google Research) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Motivation Time series prediction: stock values. earthquakes.
More informationarxiv: v1 [cs.lg] 8 Feb 2018
Online Learning: A Comprehensive Survey Steven C.H. Hoi, Doyen Sahoo, Jing Lu, Peilin Zhao School of Information Systems, Singapore Management University, Singapore School of Software Engineering, South
More informationLogistic Regression. Machine Learning Fall 2018
Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes
More informationBig Data Analytics. Special Topics for Computer Science CSE CSE Feb 24
Big Data Analytics Special Topics for Computer Science CSE 4095-001 CSE 5095-005 Feb 24 Fei Wang Associate Professor Department of Computer Science and Engineering fei_wang@uconn.edu Prediction III Goal
More informationApplications of on-line prediction. in telecommunication problems
Applications of on-line prediction in telecommunication problems Gábor Lugosi Pompeu Fabra University, Barcelona based on joint work with András György and Tamás Linder 1 Outline On-line prediction; Some
More informationSimultaneous Model Selection and Optimization through Parameter-free Stochastic Learning
Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning Francesco Orabona Yahoo! Labs New York, USA francesco@orabona.com Abstract Stochastic gradient descent algorithms
More informationOnline Passive-Aggressive Algorithms
Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {kobics,oferd,shais,singer}@cs.huji.ac.il
More informationCSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18
CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$
More informationConvergence rate of SGD
Convergence rate of SGD heorem: (see Nemirovski et al 09 from readings) Let f be a strongly convex stochastic function Assume gradient of f is Lipschitz continuous and bounded hen, for step sizes: he expected
More informationLearning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley
Learning Methods for Online Prediction Problems Peter Bartlett Statistics and EECS UC Berkeley Course Synopsis A finite comparison class: A = {1,..., m}. Converting online to batch. Online convex optimization.
More informationarxiv: v4 [cs.lg] 27 Jan 2016
The Computational Power of Optimization in Online Learning Elad Hazan Princeton University ehazan@cs.princeton.edu Tomer Koren Technion tomerk@technion.ac.il arxiv:1504.02089v4 [cs.lg] 27 Jan 2016 Abstract
More informationLearning Ensembles. 293S T. Yang. UCSB, 2017.
Learning Ensembles 293S T. Yang. UCSB, 2017. Outlines Learning Assembles Random Forest Adaboost Training data: Restaurant example Examples described by attribute values (Boolean, discrete, continuous)
More informationEnsemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods
The wisdom of the crowds Ensemble learning Sir Francis Galton discovered in the early 1900s that a collection of educated guesses can add up to very accurate predictions! Chapter 11 The paper in which
More informationOnline Passive-Aggressive Algorithms
Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {kobics,oferd,shais,singer}@cs.huji.ac.il
More informationGradient Boosting (Continued)
Gradient Boosting (Continued) David Rosenberg New York University April 4, 2016 David Rosenberg (New York University) DS-GA 1003 April 4, 2016 1 / 31 Boosting Fits an Additive Model Boosting Fits an Additive
More informationECE 5424: Introduction to Machine Learning
ECE 5424: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting PAC Learning Readings: Murphy 16.4;; Hastie 16 Stefan Lee Virginia Tech Fighting the bias-variance tradeoff Simple
More informationOnline Manifold Regularization: A New Learning Setting and Empirical Study
Online Manifold Regularization: A New Learning Setting and Empirical Study Andrew B. Goldberg 1, Ming Li 2, Xiaojin Zhu 1 1 Computer Sciences, University of Wisconsin Madison, USA. {goldberg,jerryzhu}@cs.wisc.edu
More informationStochastic and Adversarial Online Learning without Hyperparameters
Stochastic and Adversarial Online Learning without Hyperparameters Ashok Cutkosky Department of Computer Science Stanford University ashokc@cs.stanford.edu Kwabena Boahen Department of Bioengineering Stanford
More information