Optimal and Adaptive Algorithms for Online Boosting

Size: px

Start display at page:

Download "Optimal and Adaptive Algorithms for Online Boosting"

Brian Cannon
5 years ago
Views:

1 Optimal and Adaptive Algorithms for Online Boosting Alina Beygelzimer 1 Satyen Kale 1 Haipeng Luo 2 1 Yahoo! Labs, NYC 2 Computer Science Department, Princeton University Jul 8, 2015

2 Boosting: An Example Idea: combine weak rules of thumb to form a highly accurate predictor. 2 / 16

3 Boosting: An Example Idea: combine weak rules of thumb to form a highly accurate predictor. Example: spam detection. 2 / 16

4 Boosting: An Example Idea: combine weak rules of thumb to form a highly accurate predictor. Example: spam detection. Given: a set of training examples. ( Wanna make fast money?..., spam) ( How is research going? Rob, not spam) 2 / 16

5 Boosting: An Example Idea: combine weak rules of thumb to form a highly accurate predictor. Example: spam detection. Given: a set of training examples. ( Wanna make fast money?..., spam) ( How is research going? Rob, not spam) Obtain a classifier by asking a weak learning algorithm : e.g. contains the word money spam. 2 / 16

6 Boosting: An Example Idea: combine weak rules of thumb to form a highly accurate predictor. Example: spam detection. Given: a set of training examples. ( Wanna make fast money?..., spam) ( How is research going? Rob, not spam) Obtain a classifier by asking a weak learning algorithm : e.g. contains the word money spam. Reweight the examples so that difficult ones get more attention. e.g. spam that doesn t contain money. 2 / 16

7 Boosting: An Example Idea: combine weak rules of thumb to form a highly accurate predictor. Example: spam detection. Given: a set of training examples. ( Wanna make fast money?..., spam) ( How is research going? Rob, not spam) Obtain a classifier by asking a weak learning algorithm : e.g. contains the word money spam. Reweight the examples so that difficult ones get more attention. e.g. spam that doesn t contain money. Obtain another classifier: e.g. empty to address spam. 2 / 16

8 Boosting: An Example Idea: combine weak rules of thumb to form a highly accurate predictor. Example: spam detection. Given: a set of training examples. ( Wanna make fast money?..., spam) ( How is research going? Rob, not spam) Obtain a classifier by asking a weak learning algorithm : e.g. contains the word money spam. Reweight the examples so that difficult ones get more attention. e.g. spam that doesn t contain money. Obtain another classifier: e.g. empty to address spam / 16

9 Boosting: An Example Idea: combine weak rules of thumb to form a highly accurate predictor. Example: spam detection. Given: a set of training examples. ( Wanna make fast money?..., spam) ( How is research going? Rob, not spam) Obtain a classifier by asking a weak learning algorithm : e.g. contains the word money spam. Reweight the examples so that difficult ones get more attention. e.g. spam that doesn t contain money. Obtain another classifier: e.g. empty to address spam.... At the end, predict by taking a (weighted) majority vote. 2 / 16

10 Online Boosting: Motivation Boosting is well studied in the batch setting, but become infeasible when the amount of data is huge. 3 / 16

11 Online Boosting: Motivation Boosting is well studied in the batch setting, but become infeasible when the amount of data is huge. Online learning has proven extremely useful: one pass of the data, make prediction on the fly. 3 / 16

12 Online Boosting: Motivation Boosting is well studied in the batch setting, but become infeasible when the amount of data is huge. Online learning has proven extremely useful: one pass of the data, make prediction on the fly. works even in an adversarial environment. e.g. spam detection. 3 / 16

13 Online Boosting: Motivation Boosting is well studied in the batch setting, but become infeasible when the amount of data is huge. Online learning has proven extremely useful: one pass of the data, make prediction on the fly. works even in an adversarial environment. e.g. spam detection. An natural question: how to extend boosting to the online setting? 3 / 16

14 Related Work Several algorithms exist (Oza and Russell, 2001; Grabner and Bischof, 2006; Liu and Yu, 2007; Grabner et al., 2008). mimic offline counterparts. achieve great success in many real-world applications. no theoretical guarantees. 4 / 16

15 Related Work Several algorithms exist (Oza and Russell, 2001; Grabner and Bischof, 2006; Liu and Yu, 2007; Grabner et al., 2008). mimic offline counterparts. achieve great success in many real-world applications. no theoretical guarantees. Chen et al. (2012): first online boosting algorithms with theoretical guarantees. online analogue of weak learning assumption. connecting online boosting and smooth batch boosting. 4 / 16

16 Batch Boosting Given a batch of T examples, (x t, y t ) X { 1, 1} for t = 1,..., T. Learner A predicts A(x t ) { 1, 1} for example x t. 5 / 16

17 Batch Boosting Given a batch of T examples, (x t, y t ) X { 1, 1} for t = 1,..., T. Learner A predicts A(x t ) { 1, 1} for example x t. Weak learner A (with edge γ): T t=1 1{A(x t) y t } ( 1 2 γ)t 5 / 16

18 Batch Boosting Given a batch of T examples, (x t, y t ) X { 1, 1} for t = 1,..., T. Learner A predicts A(x t ) { 1, 1} for example x t. Weak learner A (with edge γ): T t=1 1{A(x t) y t } ( 1 2 γ)t Strong learner A (with any target error rate ɛ): T t=1 1{A (x t ) y t } ɛt 5 / 16

19 Batch Boosting Given a batch of T examples, (x t, y t ) X { 1, 1} for t = 1,..., T. Learner A predicts A(x t ) { 1, 1} for example x t. Weak learner A (with edge γ): T t=1 1{A(x t) y t } ( 1 2 γ)t Boosting (Schapire, 1990; Freund, 1995) Strong learner A (with any target error rate ɛ): T t=1 1{A (x t ) y t } ɛt 5 / 16

20 Online Boosting Examples (x t, y t ) X { 1, 1} arrive online, for t = 1,..., T. Learner A observes x t and predicts A(x t ) { 1, 1} before seeing y t. Weak Online learner A (with edge γ): T t=1 1{A(x t) y t } ( 1 2 γ)t Strong Online learner A (with any target error rate ɛ): T t=1 1{A (x t ) y t } ɛt 5 / 16

21 Online Boosting Examples (x t, y t ) X { 1, 1} arrive online, for t = 1,..., T. Learner A observes x t and predicts A(x t ) { 1, 1} before seeing y t. Weak Online learner A (with edge γ and excess loss S): T t=1 1{A(x t) y t } ( 1 2 γ)t + S Strong Online learner A (with any target error rate ɛ and excess loss S ) T t=1 1{A (x t ) y t } ɛt + S 5 / 16

22 Online Boosting Examples (x t, y t ) X { 1, 1} arrive online, for t = 1,..., T. Learner A observes x t and predicts A(x t ) { 1, 1} before seeing y t. Weak Online learner A (with edge γ and excess loss S): T t=1 1{A(x t) y t } ( 1 2 γ)t + S Online Boosting (our result) Strong Online learner A (with any target error rate ɛ and excess loss S ) T t=1 1{A (x t ) y t } ɛt + S 5 / 16

23 Online Boosting Examples (x t, y t ) X { 1, 1} arrive online, for t = 1,..., T. Learner A observes x t and predicts A(x t ) { 1, 1} before seeing y t. Weak Online learner A (with edge γ and excess loss S): T t=1 1{A(x t) y t } ( 1 2 γ)t + S Online Boosting (our result) Strong Online learner A (with any target error rate ɛ and excess loss S ) T t=1 1{A (x t ) y t } ɛt + S this talk: S = 1 γ (corresponds to T regret) 5 / 16

24 Main Results Parameters of interest: N = number of weak learners (of edge γ) needed to achieve error rate ɛ. T ɛ = minimal number of examples s.t. error rate is ɛ. Algorithm N T ɛ Optimal? Adaptive? Online BBM O( 1 ln 1 γ 2 ɛ ) Õ( 1 ) ɛγ 2 AdaBoost.OL O( 1 ) Õ( 1 ) ɛγ 2 ɛ 2 γ 4 Chen et al. (2012) O( 1 ) Õ( 1 ) ɛγ 2 ɛγ 2 6 / 16

25 Structure of Online Boosting x 1 Booster

26 Structure of Online Boosting x 1 WL 1 predict WL 2 predict x 1 ŷ1 1 x 1 ŷ Booster WL N predict x 1 ŷ N 1

27 Structure of Online Boosting x 1 ŷ 1 y 1 WL 1 predict WL 2 predict x 1 ŷ1 1 x 1 ŷ Booster WL N predict x 1 ŷ N 1

28 Structure of Online Boosting x 1 ŷ 1 y 1 WL 1 predict x 1 ŷ1 1 w.p. p1 1 (x 1, y 1 ) WL 1 update WL 2 predict x 1 ŷ1 2 w.p. p1 2 (x 1, y 1 ) WL 2 update... Booster... WL N predict x 1 ŷ N 1 w.p. p N 1 (x 1, y 1 ) WL N update 7 / 16

29 Structure of Online Boosting x 2 ŷ 2 y 2 WL 1 predict x 2 ŷ2 1 w.p. p2 1 (x 2, y 2 ) WL 1 update WL 2 predict x 2 ŷ2 2 w.p. p2 2 (x 2, y 2 ) WL 2 update... Booster... WL N predict x 2 ŷ N 2 w.p. p N 2 (x 2, y 2 ) WL N update 8 / 16

30 Structure of Online Boosting x t ŷ t y t WL 1 predict x t ŷt 1 w.p. pt 1 (x t, y t ) WL 1 update WL 2 predict x t ŷt 2 w.p. pt 2 (x t, y t ) WL 2 update... Booster... WL N predict x t ŷ N t w.p. p N t (x t, y t ) WL N update 9 / 16

31 Boosting as a Drifting Game (Schapire, 2001; Luo and Schapire, 2014) Batch boosting can be analyzed using drifting game. 10 / 16

32 Boosting as a Drifting Game (Schapire, 2001; Luo and Schapire, 2014) Batch boosting can be analyzed using drifting game. Online version: sequence of potentials Φ i (s) s.t. Φ N (s) 1{s 0}, Φ i 1 (s) ( 1 2 γ 2 )Φ i(s 1) + ( γ 2 )Φ i(s + 1). 10 / 16

33 Boosting as a Drifting Game (Schapire, 2001; Luo and Schapire, 2014) Batch boosting can be analyzed using drifting game. Online version: sequence of potentials Φ i (s) s.t. Φ N (s) 1{s 0}, Φ i 1 (s) ( 1 2 γ 2 )Φ i(s 1) + ( γ 2 )Φ i(s + 1). Online boosting algorithm using Φ i : prediction: majority vote. 10 / 16

34 Boosting as a Drifting Game (Schapire, 2001; Luo and Schapire, 2014) Batch boosting can be analyzed using drifting game. Online version: sequence of potentials Φ i (s) s.t. Φ N (s) 1{s 0}, Φ i 1 (s) ( 1 2 γ 2 )Φ i(s 1) + ( γ 2 )Φ i(s + 1). Online boosting algorithm using Φ i : prediction: majority vote. update: p i t = Pr[(x t, y t ) sent to ith weak learner] w i t where w i t = difference in potentials if example is misclassified or not. 10 / 16

35 Mistake Bound Generalized drifting games analysis implies T t=1 1{A (x t ) y t } Φ 0 (0) T + (S + 1 }{{} γ ) i wi. }{{} ɛ =S 11 / 16

36 Mistake Bound Generalized drifting games analysis implies T t=1 1{A (x t ) y t } Φ 0 (0) T + (S + 1 }{{} γ ) i wi. }{{} ɛ =S So we want small w i. exponential potential (corresponding to AdaBoost) does not work. 11 / 16

37 Mistake Bound Generalized drifting games analysis implies T t=1 1{A (x t ) y t } Φ 0 (0) T + (S + 1 }{{} γ ) i wi. }{{} ɛ =S So we want small w i. exponential potential (corresponding to AdaBoost) does not work. Boost-by-Majority (Freund, 1995) potential works well! 11 / 16

38 Mistake Bound Generalized drifting games analysis implies T t=1 1{A (x t ) y t } Φ 0 (0) T + (S + 1 }{{} γ ) i wi. }{{} ɛ =S So we want small w i. exponential potential (corresponding to AdaBoost) does not work. Boost-by-Majority (Freund, 1995) potential works well! w i t = Pr[kt i heads in N i flips of a γ 2 -biased coin] 11 / 16

39 Mistake Bound Generalized drifting games analysis implies T t=1 1{A (x t ) y t } Φ 0 (0) T + (S + 1 }{{} γ ) i wi. }{{} ɛ =S So we want small w i. exponential potential (corresponding to AdaBoost) does not work. Boost-by-Majority (Freund, 1995) potential works well! w i t = Pr[kt i heads in N i flips of a γ 4 2 -biased coin] N i 11 / 16

40 Mistake Bound Generalized drifting games analysis implies T t=1 1{A (x t ) y t } Φ 0 (0) T + (S + 1 }{{} γ ) i wi. }{{} ɛ =S So we want small w i. exponential potential (corresponding to AdaBoost) does not work. Boost-by-Majority (Freund, 1995) potential works well! w i t = Pr[kt i heads in N i flips of a γ 4 2 -biased coin] N i Online BBM: to get ɛ error rate, needs N = O( 1 γ 2 ln( 1 ɛ )) weak learners and T ɛ = O( 1 ɛγ 2 ) examples. (Optimal) 11 / 16

41 Drawback of Online BBM The draw back of BBM (or Chen et al. (2012)) is the lack of adaptivity. requires γ as a parameter. 12 / 16

42 Drawback of Online BBM The draw back of BBM (or Chen et al. (2012)) is the lack of adaptivity. requires γ as a parameter. treats each weak learner equally: predicts via simple majority vote. 12 / 16

43 Drawback of Online BBM The draw back of BBM (or Chen et al. (2012)) is the lack of adaptivity. requires γ as a parameter. treats each weak learner equally: predicts via simple majority vote. Adaptivity is the key advantage of AdaBoost! different weak learners weighted differently based on their performance. 12 / 16

44 Adaptivity via Online Loss Minimization Batch boosting finds a combination of weak learners to minimize some loss function using coordinate descent. (Breiman, 1999) 13 / 16

45 Adaptivity via Online Loss Minimization Batch boosting finds a combination of weak learners to minimize some loss function using coordinate descent. (Breiman, 1999) AdaBoost: exponential loss AdaBoost.L: logistic loss 13 / 16

46 Adaptivity via Online Loss Minimization Batch boosting finds a combination of weak learners to minimize some loss function using coordinate descent. (Breiman, 1999) AdaBoost: exponential loss AdaBoost.L: logistic loss We generalize it to the online setting: replace line search with online gradient descent. 13 / 16

47 Adaptivity via Online Loss Minimization Batch boosting finds a combination of weak learners to minimize some loss function using coordinate descent. (Breiman, 1999) AdaBoost: exponential loss AdaBoost.L: logistic loss We generalize it to the online setting: replace line search with online gradient descent. exponential loss does not work again, use logistic loss to get adaptive online boosting algorithm AdaBoost.OL. 13 / 16

48 Mistake Bound If WL i has edge γ i, then T t=1 1{A (x t ) y t } 2T i γ2 i + Õ ( N 2 i γ2 i ) 14 / 16

49 Mistake Bound If WL i has edge γ i, then T t=1 1{A (x t ) y t } 2T i γ2 i + Õ ( N 2 i γ2 i ) Suppose γ i γ, then to get ɛ error rate AdaBoost.OL needs N = O( 1 ) weak learners and T ɛγ 2 ɛ = O( 1 ) examples. ɛ 2 γ 4 14 / 16

50 Mistake Bound If WL i has edge γ i, then T t=1 1{A (x t ) y t } 2T i γ2 i + Õ ( N 2 i γ2 i ) Suppose γ i γ, then to get ɛ error rate AdaBoost.OL needs N = O( 1 ) weak learners and T ɛγ 2 ɛ = O( 1 ) examples. ɛ 2 γ 4 Not optimal but adaptive. 14 / 16

51 Results Available in Vowpal Wabbit 8.0. command line option: --boosting. VW as the default weak learner (a rather strong one!) Dataset VW baseline Online BBM AdaBoost.OL Chen et al news a9a activity adult bio census covtype letter maptaskcoref nomao poker rcv vehv2binary / 16

52 Conclusions We propose A natural framework of online boosting. An optimal algorithm Online BBM. An adaptive algorithm AdaBoost.OL. 16 / 16

53 Conclusions We propose A natural framework of online boosting. An optimal algorithm Online BBM. An adaptive algorithm AdaBoost.OL. Future directions: Open problem: optimal and adaptive algorithm? Beyond classification: online gradient boosting for regression (see arxiv: ). 16 / 16

Optimal and Adaptive Online Learning

Optimal and Adaptive Online Learning Haipeng Luo Advisor: Robert Schapire Computer Science Department Princeton University Examples of Online Learning (a) Spam detection 2 / 34 Examples of Online Learning