Optimal and Adaptive Algorithms for Online Boosting

Similar documents
Optimal and Adaptive Online Learning

Learning theory. Ensemble methods. Boosting. Boosting: history

COMS 4771 Lecture Boosting 1 / 16

Boosting: Foundations and Algorithms. Rob Schapire

CSCI-567: Machine Learning (Spring 2019)

The Boosting Approach to. Machine Learning. Maria-Florina Balcan 10/31/2016

Voting (Ensemble Methods)

Stochastic Gradient Descent

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Lecture 8. Instructor: Haipeng Luo

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning.

Chapter 14 Combining Models

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

Online Advertising is Big Business

Lossless Online Bayesian Bagging

Optimal and Adaptive Online Learning

Theory and Applications of A Repeated Game Playing Algorithm. Rob Schapire Princeton University [currently visiting Yahoo!

Online Learning and Sequential Decision Making

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

Practical Agnostic Active Learning

A Brief Introduction to Adaboost

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #24 Scribe: Jad Bechara May 2, 2018

AdaBoost. S. Sumitra Department of Mathematics Indian Institute of Space Science and Technology

CS7267 MACHINE LEARNING

Accelerating Stochastic Optimization

Online Advertising is Big Business

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

CS229 Supplemental Lecture notes

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture 24 Scribe: Sachin Ravi May 2, 2013

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

A Gentle Introduction to Gradient Boosting. Cheng Li College of Computer and Information Science Northeastern University

Efficient and Principled Online Classification Algorithms for Lifelon

What makes good ensemble? CS789: Machine Learning and Neural Network. Introduction. More on diversity

Logistic Regression Logistic

Online Non-Linear Gradient Boosting in Multi-Latent Spaces

ECE 5424: Introduction to Machine Learning

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017

The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

Ensembles. Léon Bottou COS 424 4/8/2010

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Learning Ensembles. 293S T. Yang. UCSB, 2017.

Boosting. 1 Boosting. 2 AdaBoost. 2.1 Intuition

Decision Trees: Overfitting

Binary Classification / Perceptron

Boosting. Acknowledgment Slides are based on tutorials from Robert Schapire and Gunnar Raetsch

Data Warehousing & Data Mining

Hierarchical Boosting and Filter Generation

Background. Adaptive Filters and Machine Learning. Bootstrap. Combining models. Boosting and Bagging. Poltayev Rassulzhan

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Littlestone s Dimension and Online Learnability

Learning with multiple models. Boosting.

Linear Probability Forecasting

VBM683 Machine Learning

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research

1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning

1 Review of Winnow Algorithm

STK-IN4300 Statistical Learning Methods in Data Science

Ensemble Methods: Jay Hyer

Announcements Kevin Jamieson

Machine Learning. Ensemble Methods. Manfred Huber

Name (NetID): (1 Point)

Chapter 6. Ensemble Methods

Midterm, Fall 2003

Does Modeling Lead to More Accurate Classification?

Boos$ng Can we make dumb learners smart?

Generalized Boosting Algorithms for Convex Optimization

Chapter 18. Decision Trees and Ensemble Learning. Recall: Learning Decision Trees

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer

Hybrid Machine Learning Algorithms

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Logistic Regression. Machine Learning Fall 2018

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

i=1 = H t 1 (x) + α t h t (x)

Machine Learning Lecture 10

Perceptron. Subhransu Maji. CMPSCI 689: Machine Learning. 3 February February 2015

Statistics and learning: Big Data

Totally Corrective Boosting Algorithms that Maximize the Margin

Data Mining und Maschinelles Lernen

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Ensemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods

Game Theory, On-line prediction and Boosting (Freund, Schapire)

TDT4173 Machine Learning

Random Classification Noise Defeats All Convex Potential Boosters

1 Review of the Perceptron Algorithm

2D1431 Machine Learning. Bagging & Boosting

1 Boosting. COS 511: Foundations of Machine Learning. Rob Schapire Lecture # 10 Scribe: John H. White, IV 3/6/2003

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

The Boosting Approach to Machine Learning. Rob Schapire Princeton University. schapire

Perceptron (Theory) + Linear Regression

CSC321 Lecture 4 The Perceptron Algorithm

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

Multiclass Boosting with Repartitioning

6.036 midterm review. Wednesday, March 18, 15

BBM406 - Introduc0on to ML. Spring Ensemble Methods. Aykut Erdem Dept. of Computer Engineering HaceDepe University

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

A Study of Relative Efficiency and Robustness of Classification Methods

ECE 5984: Introduction to Machine Learning

Full-information Online Learning

Transcription:

Optimal and Adaptive Algorithms for Online Boosting Alina Beygelzimer 1 Satyen Kale 1 Haipeng Luo 2 1 Yahoo! Labs, NYC 2 Computer Science Department, Princeton University Jul 8, 2015

Boosting: An Example Idea: combine weak rules of thumb to form a highly accurate predictor. 2 / 16

Boosting: An Example Idea: combine weak rules of thumb to form a highly accurate predictor. Example: email spam detection. 2 / 16

Boosting: An Example Idea: combine weak rules of thumb to form a highly accurate predictor. Example: email spam detection. Given: a set of training examples. ( Wanna make fast money?..., spam) ( How is research going? Rob, not spam) 2 / 16

Boosting: An Example Idea: combine weak rules of thumb to form a highly accurate predictor. Example: email spam detection. Given: a set of training examples. ( Wanna make fast money?..., spam) ( How is research going? Rob, not spam) Obtain a classifier by asking a weak learning algorithm : e.g. contains the word money spam. 2 / 16

Boosting: An Example Idea: combine weak rules of thumb to form a highly accurate predictor. Example: email spam detection. Given: a set of training examples. ( Wanna make fast money?..., spam) ( How is research going? Rob, not spam) Obtain a classifier by asking a weak learning algorithm : e.g. contains the word money spam. Reweight the examples so that difficult ones get more attention. e.g. spam that doesn t contain money. 2 / 16

Boosting: An Example Idea: combine weak rules of thumb to form a highly accurate predictor. Example: email spam detection. Given: a set of training examples. ( Wanna make fast money?..., spam) ( How is research going? Rob, not spam) Obtain a classifier by asking a weak learning algorithm : e.g. contains the word money spam. Reweight the examples so that difficult ones get more attention. e.g. spam that doesn t contain money. Obtain another classifier: e.g. empty to address spam. 2 / 16

Boosting: An Example Idea: combine weak rules of thumb to form a highly accurate predictor. Example: email spam detection. Given: a set of training examples. ( Wanna make fast money?..., spam) ( How is research going? Rob, not spam) Obtain a classifier by asking a weak learning algorithm : e.g. contains the word money spam. Reweight the examples so that difficult ones get more attention. e.g. spam that doesn t contain money. Obtain another classifier: e.g. empty to address spam.... 2 / 16

Boosting: An Example Idea: combine weak rules of thumb to form a highly accurate predictor. Example: email spam detection. Given: a set of training examples. ( Wanna make fast money?..., spam) ( How is research going? Rob, not spam) Obtain a classifier by asking a weak learning algorithm : e.g. contains the word money spam. Reweight the examples so that difficult ones get more attention. e.g. spam that doesn t contain money. Obtain another classifier: e.g. empty to address spam.... At the end, predict by taking a (weighted) majority vote. 2 / 16

Online Boosting: Motivation Boosting is well studied in the batch setting, but become infeasible when the amount of data is huge. 3 / 16

Online Boosting: Motivation Boosting is well studied in the batch setting, but become infeasible when the amount of data is huge. Online learning has proven extremely useful: one pass of the data, make prediction on the fly. 3 / 16

Online Boosting: Motivation Boosting is well studied in the batch setting, but become infeasible when the amount of data is huge. Online learning has proven extremely useful: one pass of the data, make prediction on the fly. works even in an adversarial environment. e.g. spam detection. 3 / 16

Online Boosting: Motivation Boosting is well studied in the batch setting, but become infeasible when the amount of data is huge. Online learning has proven extremely useful: one pass of the data, make prediction on the fly. works even in an adversarial environment. e.g. spam detection. An natural question: how to extend boosting to the online setting? 3 / 16

Related Work Several algorithms exist (Oza and Russell, 2001; Grabner and Bischof, 2006; Liu and Yu, 2007; Grabner et al., 2008). mimic offline counterparts. achieve great success in many real-world applications. no theoretical guarantees. 4 / 16

Related Work Several algorithms exist (Oza and Russell, 2001; Grabner and Bischof, 2006; Liu and Yu, 2007; Grabner et al., 2008). mimic offline counterparts. achieve great success in many real-world applications. no theoretical guarantees. Chen et al. (2012): first online boosting algorithms with theoretical guarantees. online analogue of weak learning assumption. connecting online boosting and smooth batch boosting. 4 / 16

Batch Boosting Given a batch of T examples, (x t, y t ) X { 1, 1} for t = 1,..., T. Learner A predicts A(x t ) { 1, 1} for example x t. 5 / 16

Batch Boosting Given a batch of T examples, (x t, y t ) X { 1, 1} for t = 1,..., T. Learner A predicts A(x t ) { 1, 1} for example x t. Weak learner A (with edge γ): T t=1 1{A(x t) y t } ( 1 2 γ)t 5 / 16

Batch Boosting Given a batch of T examples, (x t, y t ) X { 1, 1} for t = 1,..., T. Learner A predicts A(x t ) { 1, 1} for example x t. Weak learner A (with edge γ): T t=1 1{A(x t) y t } ( 1 2 γ)t Strong learner A (with any target error rate ɛ): T t=1 1{A (x t ) y t } ɛt 5 / 16

Batch Boosting Given a batch of T examples, (x t, y t ) X { 1, 1} for t = 1,..., T. Learner A predicts A(x t ) { 1, 1} for example x t. Weak learner A (with edge γ): T t=1 1{A(x t) y t } ( 1 2 γ)t Boosting (Schapire, 1990; Freund, 1995) Strong learner A (with any target error rate ɛ): T t=1 1{A (x t ) y t } ɛt 5 / 16

Online Boosting Examples (x t, y t ) X { 1, 1} arrive online, for t = 1,..., T. Learner A observes x t and predicts A(x t ) { 1, 1} before seeing y t. Weak Online learner A (with edge γ): T t=1 1{A(x t) y t } ( 1 2 γ)t Strong Online learner A (with any target error rate ɛ): T t=1 1{A (x t ) y t } ɛt 5 / 16

Online Boosting Examples (x t, y t ) X { 1, 1} arrive online, for t = 1,..., T. Learner A observes x t and predicts A(x t ) { 1, 1} before seeing y t. Weak Online learner A (with edge γ and excess loss S): T t=1 1{A(x t) y t } ( 1 2 γ)t + S Strong Online learner A (with any target error rate ɛ and excess loss S ) T t=1 1{A (x t ) y t } ɛt + S 5 / 16

Online Boosting Examples (x t, y t ) X { 1, 1} arrive online, for t = 1,..., T. Learner A observes x t and predicts A(x t ) { 1, 1} before seeing y t. Weak Online learner A (with edge γ and excess loss S): T t=1 1{A(x t) y t } ( 1 2 γ)t + S Online Boosting (our result) Strong Online learner A (with any target error rate ɛ and excess loss S ) T t=1 1{A (x t ) y t } ɛt + S 5 / 16

Online Boosting Examples (x t, y t ) X { 1, 1} arrive online, for t = 1,..., T. Learner A observes x t and predicts A(x t ) { 1, 1} before seeing y t. Weak Online learner A (with edge γ and excess loss S): T t=1 1{A(x t) y t } ( 1 2 γ)t + S Online Boosting (our result) Strong Online learner A (with any target error rate ɛ and excess loss S ) T t=1 1{A (x t ) y t } ɛt + S this talk: S = 1 γ (corresponds to T regret) 5 / 16

Main Results Parameters of interest: N = number of weak learners (of edge γ) needed to achieve error rate ɛ. T ɛ = minimal number of examples s.t. error rate is ɛ. Algorithm N T ɛ Optimal? Adaptive? Online BBM O( 1 ln 1 γ 2 ɛ ) Õ( 1 ) ɛγ 2 AdaBoost.OL O( 1 ) Õ( 1 ) ɛγ 2 ɛ 2 γ 4 Chen et al. (2012) O( 1 ) Õ( 1 ) ɛγ 2 ɛγ 2 6 / 16

Structure of Online Boosting x 1 Booster

Structure of Online Boosting x 1 WL 1 predict WL 2 predict x 1 ŷ1 1 x 1 ŷ1 2... Booster WL N predict x 1 ŷ N 1

Structure of Online Boosting x 1 ŷ 1 y 1 WL 1 predict WL 2 predict x 1 ŷ1 1 x 1 ŷ1 2... Booster WL N predict x 1 ŷ N 1

Structure of Online Boosting x 1 ŷ 1 y 1 WL 1 predict x 1 ŷ1 1 w.p. p1 1 (x 1, y 1 ) WL 1 update WL 2 predict x 1 ŷ1 2 w.p. p1 2 (x 1, y 1 ) WL 2 update... Booster... WL N predict x 1 ŷ N 1 w.p. p N 1 (x 1, y 1 ) WL N update 7 / 16

Structure of Online Boosting x 2 ŷ 2 y 2 WL 1 predict x 2 ŷ2 1 w.p. p2 1 (x 2, y 2 ) WL 1 update WL 2 predict x 2 ŷ2 2 w.p. p2 2 (x 2, y 2 ) WL 2 update... Booster... WL N predict x 2 ŷ N 2 w.p. p N 2 (x 2, y 2 ) WL N update 8 / 16

Structure of Online Boosting x t ŷ t y t WL 1 predict x t ŷt 1 w.p. pt 1 (x t, y t ) WL 1 update WL 2 predict x t ŷt 2 w.p. pt 2 (x t, y t ) WL 2 update... Booster... WL N predict x t ŷ N t w.p. p N t (x t, y t ) WL N update 9 / 16

Boosting as a Drifting Game (Schapire, 2001; Luo and Schapire, 2014) Batch boosting can be analyzed using drifting game. 10 / 16

Boosting as a Drifting Game (Schapire, 2001; Luo and Schapire, 2014) Batch boosting can be analyzed using drifting game. Online version: sequence of potentials Φ i (s) s.t. Φ N (s) 1{s 0}, Φ i 1 (s) ( 1 2 γ 2 )Φ i(s 1) + ( 1 2 + γ 2 )Φ i(s + 1). 10 / 16

Boosting as a Drifting Game (Schapire, 2001; Luo and Schapire, 2014) Batch boosting can be analyzed using drifting game. Online version: sequence of potentials Φ i (s) s.t. Φ N (s) 1{s 0}, Φ i 1 (s) ( 1 2 γ 2 )Φ i(s 1) + ( 1 2 + γ 2 )Φ i(s + 1). Online boosting algorithm using Φ i : prediction: majority vote. 10 / 16

Boosting as a Drifting Game (Schapire, 2001; Luo and Schapire, 2014) Batch boosting can be analyzed using drifting game. Online version: sequence of potentials Φ i (s) s.t. Φ N (s) 1{s 0}, Φ i 1 (s) ( 1 2 γ 2 )Φ i(s 1) + ( 1 2 + γ 2 )Φ i(s + 1). Online boosting algorithm using Φ i : prediction: majority vote. update: p i t = Pr[(x t, y t ) sent to ith weak learner] w i t where w i t = difference in potentials if example is misclassified or not. 10 / 16

Mistake Bound Generalized drifting games analysis implies T t=1 1{A (x t ) y t } Φ 0 (0) T + (S + 1 }{{} γ ) i wi. }{{} ɛ =S 11 / 16

Mistake Bound Generalized drifting games analysis implies T t=1 1{A (x t ) y t } Φ 0 (0) T + (S + 1 }{{} γ ) i wi. }{{} ɛ =S So we want small w i. exponential potential (corresponding to AdaBoost) does not work. 11 / 16

Mistake Bound Generalized drifting games analysis implies T t=1 1{A (x t ) y t } Φ 0 (0) T + (S + 1 }{{} γ ) i wi. }{{} ɛ =S So we want small w i. exponential potential (corresponding to AdaBoost) does not work. Boost-by-Majority (Freund, 1995) potential works well! 11 / 16

Mistake Bound Generalized drifting games analysis implies T t=1 1{A (x t ) y t } Φ 0 (0) T + (S + 1 }{{} γ ) i wi. }{{} ɛ =S So we want small w i. exponential potential (corresponding to AdaBoost) does not work. Boost-by-Majority (Freund, 1995) potential works well! w i t = Pr[kt i heads in N i flips of a γ 2 -biased coin] 11 / 16

Mistake Bound Generalized drifting games analysis implies T t=1 1{A (x t ) y t } Φ 0 (0) T + (S + 1 }{{} γ ) i wi. }{{} ɛ =S So we want small w i. exponential potential (corresponding to AdaBoost) does not work. Boost-by-Majority (Freund, 1995) potential works well! w i t = Pr[kt i heads in N i flips of a γ 4 2 -biased coin] N i 11 / 16

Mistake Bound Generalized drifting games analysis implies T t=1 1{A (x t ) y t } Φ 0 (0) T + (S + 1 }{{} γ ) i wi. }{{} ɛ =S So we want small w i. exponential potential (corresponding to AdaBoost) does not work. Boost-by-Majority (Freund, 1995) potential works well! w i t = Pr[kt i heads in N i flips of a γ 4 2 -biased coin] N i Online BBM: to get ɛ error rate, needs N = O( 1 γ 2 ln( 1 ɛ )) weak learners and T ɛ = O( 1 ɛγ 2 ) examples. (Optimal) 11 / 16

Drawback of Online BBM The draw back of BBM (or Chen et al. (2012)) is the lack of adaptivity. requires γ as a parameter. 12 / 16

Drawback of Online BBM The draw back of BBM (or Chen et al. (2012)) is the lack of adaptivity. requires γ as a parameter. treats each weak learner equally: predicts via simple majority vote. 12 / 16

Drawback of Online BBM The draw back of BBM (or Chen et al. (2012)) is the lack of adaptivity. requires γ as a parameter. treats each weak learner equally: predicts via simple majority vote. Adaptivity is the key advantage of AdaBoost! different weak learners weighted differently based on their performance. 12 / 16

Adaptivity via Online Loss Minimization Batch boosting finds a combination of weak learners to minimize some loss function using coordinate descent. (Breiman, 1999) 13 / 16

Adaptivity via Online Loss Minimization Batch boosting finds a combination of weak learners to minimize some loss function using coordinate descent. (Breiman, 1999) AdaBoost: exponential loss AdaBoost.L: logistic loss 13 / 16

Adaptivity via Online Loss Minimization Batch boosting finds a combination of weak learners to minimize some loss function using coordinate descent. (Breiman, 1999) AdaBoost: exponential loss AdaBoost.L: logistic loss We generalize it to the online setting: replace line search with online gradient descent. 13 / 16

Adaptivity via Online Loss Minimization Batch boosting finds a combination of weak learners to minimize some loss function using coordinate descent. (Breiman, 1999) AdaBoost: exponential loss AdaBoost.L: logistic loss We generalize it to the online setting: replace line search with online gradient descent. exponential loss does not work again, use logistic loss to get adaptive online boosting algorithm AdaBoost.OL. 13 / 16

Mistake Bound If WL i has edge γ i, then T t=1 1{A (x t ) y t } 2T i γ2 i + Õ ( N 2 i γ2 i ) 14 / 16

Mistake Bound If WL i has edge γ i, then T t=1 1{A (x t ) y t } 2T i γ2 i + Õ ( N 2 i γ2 i ) Suppose γ i γ, then to get ɛ error rate AdaBoost.OL needs N = O( 1 ) weak learners and T ɛγ 2 ɛ = O( 1 ) examples. ɛ 2 γ 4 14 / 16

Mistake Bound If WL i has edge γ i, then T t=1 1{A (x t ) y t } 2T i γ2 i + Õ ( N 2 i γ2 i ) Suppose γ i γ, then to get ɛ error rate AdaBoost.OL needs N = O( 1 ) weak learners and T ɛγ 2 ɛ = O( 1 ) examples. ɛ 2 γ 4 Not optimal but adaptive. 14 / 16

Results Available in Vowpal Wabbit 8.0. command line option: --boosting. VW as the default weak learner (a rather strong one!) Dataset VW baseline Online BBM AdaBoost.OL Chen et al. 12 20news 0.0812 0.0775 0.0777 0.0791 a9a 0.1509 0.1495 0.1497 0.1509 activity 0.0133 0.0114 0.0128 0.0130 adult 0.1543 0.1526 0.1536 0.1539 bio 0.0035 0.0031 0.0032 0.0033 census 0.0471 0.0469 0.0469 0.0469 covtype 0.2563 0.2347 0.2495 0.2470 letter 0.2295 0.1923 0.2078 0.2148 maptaskcoref 0.1091 0.1077 0.1083 0.1093 nomao 0.0641 0.0627 0.0635 0.0627 poker 0.4555 0.4312 0.4555 0.4555 rcv1 0.0487 0.0485 0.0484 0.0488 vehv2binary 0.0292 0.0286 0.0291 0.0284 15 / 16

Conclusions We propose A natural framework of online boosting. An optimal algorithm Online BBM. An adaptive algorithm AdaBoost.OL. 16 / 16

Conclusions We propose A natural framework of online boosting. An optimal algorithm Online BBM. An adaptive algorithm AdaBoost.OL. Future directions: Open problem: optimal and adaptive algorithm? Beyond classification: online gradient boosting for regression (see arxiv: 1506.04820). 16 / 16