COMS 4771 Lecture Boosting 1 / 16

Similar documents
Learning theory. Ensemble methods. Boosting. Boosting: history

Boosting: Foundations and Algorithms. Rob Schapire

The Boosting Approach to. Machine Learning. Maria-Florina Balcan 10/31/2016

Lecture 8. Instructor: Haipeng Luo

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017

The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

Ensembles. Léon Bottou COS 424 4/8/2010

Background. Adaptive Filters and Machine Learning. Bootstrap. Combining models. Boosting and Bagging. Poltayev Rassulzhan

Boosting. Acknowledgment Slides are based on tutorials from Robert Schapire and Gunnar Raetsch

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Learning with multiple models. Boosting.

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

Decision Trees: Overfitting

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

ECE 5984: Introduction to Machine Learning

Statistical Machine Learning from Data

2 Upper-bound of Generalization Error of AdaBoost

ECE 5424: Introduction to Machine Learning

Optimal and Adaptive Algorithms for Online Boosting

CSCI-567: Machine Learning (Spring 2019)

Voting (Ensemble Methods)

Computational and Statistical Learning Theory

Online Learning and Sequential Decision Making

Ensemble Methods for Machine Learning

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

Statistics and learning: Big Data

Hierarchical Boosting and Filter Generation

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Ensemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Boos$ng Can we make dumb learners smart?

A Brief Introduction to Adaboost

Generalization, Overfitting, and Model Selection

COMS 4771 Introduction to Machine Learning. Nakul Verma

Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example

Learning Ensembles. 293S T. Yang. UCSB, 2017.

Decision trees COMS 4771

Large-Margin Thresholded Ensembles for Ordinal Regression

Stochastic Gradient Descent

CS229 Supplemental Lecture notes

Gradient Boosting (Continued)

Name (NetID): (1 Point)

Large-Margin Thresholded Ensembles for Ordinal Regression

Boosting Methods for Regression

Ensembles of Classifiers.

TDT4173 Machine Learning

Boosting: Algorithms and Applications

Boosting & Deep Learning

Boosting the Margin: A New Explanation for the Effectiveness of Voting Methods

The Boosting Approach to Machine Learning. Rob Schapire Princeton University. schapire

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Computational Learning Theory

Minimax risk bounds for linear threshold functions

A Gentle Introduction to Gradient Boosting. Cheng Li College of Computer and Information Science Northeastern University

CS7267 MACHINE LEARNING

Totally Corrective Boosting Algorithms that Maximize the Margin

Foundations of Machine Learning

Introduction and Overview

ABC-LogitBoost for Multi-Class Classification

Open Problem: A (missing) boosting-type convergence result for ADABOOST.MH with factorized multi-class classifiers

Boosting the Area Under the ROC Curve

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

VBM683 Machine Learning

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Optimal and Adaptive Online Learning

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

i=1 = H t 1 (x) + α t h t (x)

Advanced Machine Learning

Outline: Ensemble Learning. Ensemble Learning. The Wisdom of Crowds. The Wisdom of Crowds - Really? Crowd wiser than any individual

Randomized Decision Trees

Lecture 13: Ensemble Methods

Manfred K. Warmuth - UCSC S.V.N. Vishwanathan - Purdue & Microsoft Research. Updated: March 23, Warmuth (UCSC) ICML 09 Boosting Tutorial 1 / 62

Discriminative Learning can Succeed where Generative Learning Fails

Boosting. March 30, 2009

TDT4173 Machine Learning

Introduction to Boosting and Joint Boosting

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

A Simple Algorithm for Learning Stable Machines

BOOSTING THE MARGIN: A NEW EXPLANATION FOR THE EFFECTIVENESS OF VOTING METHODS

Introduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research

An overview of Boosting. Yoav Freund UCSD

Lossless Online Bayesian Bagging

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

What makes good ensemble? CS789: Machine Learning and Neural Network. Introduction. More on diversity

CS 231A Section 1: Linear Algebra & Probability Review. Kevin Tang

CS 231A Section 1: Linear Algebra & Probability Review

SPECIAL INVITED PAPER

Machine Learning Lecture 7

1 Review of the Perceptron Algorithm

Recitation 9. Gradient Boosting. Brett Bernstein. March 30, CDS at NYU. Brett Bernstein (CDS at NYU) Recitation 9 March 30, / 14

Learning Theory Continued

Deep Boosting. Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) COURANT INSTITUTE & GOOGLE RESEARCH.

Classification Ensemble That Maximizes the Area Under Receiver Operating Characteristic Curve (AUC)

Chapter 18. Decision Trees and Ensemble Learning. Recall: Learning Decision Trees

2D1431 Machine Learning. Bagging & Boosting

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Transcription:

COMS 4771 Lecture 12 1. Boosting 1 / 16

Boosting

What is boosting? Boosting: Using a learning algorithm that provides rough rules-of-thumb to construct a very accurate predictor. 3 / 16

What is boosting? Boosting: Using a learning algorithm that provides rough rules-of-thumb to construct a very accurate predictor. Motivation: Easy to construct classification rules that are correct more-often-than-not (e.g., If 5% of the e-mail characters are dollar signs, then it s spam. ), but hard to find a single rule that is almost always correct. 3 / 16

What is boosting? Boosting: Using a learning algorithm that provides rough rules-of-thumb to construct a very accurate predictor. Motivation: Easy to construct classification rules that are correct more-often-than-not (e.g., If 5% of the e-mail characters are dollar signs, then it s spam. ), but hard to find a single rule that is almost always correct. Basic idea: Input: training data S, weak learning algorithm A For t = 1, 2,..., T : 1. Choose subset of examples S t S (or a distribution over S). 2. Call weak learning algorithm to get classifier: f t := A(S t). Return a weighted majority vote over f 1, f 2,..., f T. 3 / 16

Boosting: history 1984 Valiant and Kearns ask whether boosting is theoretically possible (formalized in the PAC learning model). 4 / 16

Boosting: history 1984 Valiant and Kearns ask whether boosting is theoretically possible (formalized in the PAC learning model). 1989 Schapire creates first boosting algorithm, solving the open problem of Valiant and Kearns. 4 / 16

Boosting: history 1984 Valiant and Kearns ask whether boosting is theoretically possible (formalized in the PAC learning model). 1989 Schapire creates first boosting algorithm, solving the open problem of Valiant and Kearns. 1990 Freund creates an optimal boosting algorithm (Boost-by-majority). 4 / 16

Boosting: history 1984 Valiant and Kearns ask whether boosting is theoretically possible (formalized in the PAC learning model). 1989 Schapire creates first boosting algorithm, solving the open problem of Valiant and Kearns. 1990 Freund creates an optimal boosting algorithm (Boost-by-majority). 1992 Drucker, Schapire, and Simard empirically observe practical limitations of early boosting algorithms. 4 / 16

Boosting: history 1984 Valiant and Kearns ask whether boosting is theoretically possible (formalized in the PAC learning model). 1989 Schapire creates first boosting algorithm, solving the open problem of Valiant and Kearns. 1990 Freund creates an optimal boosting algorithm (Boost-by-majority). 1992 Drucker, Schapire, and Simard empirically observe practical limitations of early boosting algorithms. 1995 Freund and Schapire create AdaBoost a boosting algorithm with practical advantages over early boosting algorithms. 4 / 16

Boosting: history 1984 Valiant and Kearns ask whether boosting is theoretically possible (formalized in the PAC learning model). 1989 Schapire creates first boosting algorithm, solving the open problem of Valiant and Kearns. 1990 Freund creates an optimal boosting algorithm (Boost-by-majority). 1992 Drucker, Schapire, and Simard empirically observe practical limitations of early boosting algorithms. 1995 Freund and Schapire create AdaBoost a boosting algorithm with practical advantages over early boosting algorithms. Winner of 2004 ACM Paris Kanellakis Award: For their seminal work and distinguished contributions [...] to the development of the theory and practice of boosting, a general and provably effective method of producing arbitrarily accurate prediction rules by combining weak learning rules ; specifically, for AdaBoost, which can be used to significantly reduce the error of algorithms used in statistical analysis, spam filtering, fraud detection, optical character recognition, and market segmentation, among other applications. 4 / 16

AdaBoost Input Training data S from X {±1}. Weak learning algorithm A (for importance-weighted classification). 1: initialize D 1(x, y) := 1/ S for each (x, y) S (a probability distribution). 2: for t = 1, 2,..., T do 3: Give D t-weighted examples to A; get back f t : X {±1}. 4: Update weights: z t := D t(x, y) yf t(x) [ 1, +1] (x,y) S α t := 1 1 + zt ln R (weight of f t) 2 1 z t D t+1(x, y) D t(x, y) exp( α t yf t(x)) for each (x, y) S. 5: end for ( T ) 6: return Final classifier f final (x) := sign α t f t(x). t=1 5 / 16

Interpretation Interpreting z t If Pr [f t(x) = Y ] = 1 + γt for some γt [ 1/2, +1/2], (X,Y ) D t 2 then z t = D t(x, y) yf t(x) = 2γ t [ 1, +1]. (x,y) S z t = 0 random guessing w.r.t. D t. z t > 0 better than random guessing w.r.t. D t. z t < 0 better off using the opposite of f t s predictions. 6 / 16

Interpretation Classifier weights α t = 1 2 ln 1+zt 1 z t 3 α t 2 1 0.5 0.5 1 z t 1 2 Example weights D t+1 (x, y) D t+1(x, y) D t(x, y) exp( α t yf t(x)) 7 / 16

Example: AdaBoost with decision stumps Weak learning algorithm A: ERM with F = decision stumps on R 2 (i.e., axis-aligned threshold functions x sign(vx i t)). Straightforward to handle importance weights in ERM. (Example from Figures 1.1 and 1.2 of Schapire & Freund text.) 8 / 16

Example: execution of AdaBoost D 1 9 / 16

Example: execution of AdaBoost D 1 f 1 z 1 = 0.40, α 1 = 0.42 9 / 16

Example: execution of AdaBoost D 1 D 2 f 1 z 1 = 0.40, α 1 = 0.42 9 / 16

Example: execution of AdaBoost D 1 D 2 f 1 f 2 z 1 = 0.40, α 1 = 0.42 z 2 = 0.58, α 2 = 0.65 9 / 16

Example: execution of AdaBoost D 1 D 2 D 3 + + f 1 f 2 z 1 = 0.40, α 1 = 0.42 z 2 = 0.58, α 2 = 0.65 9 / 16

Example: execution of AdaBoost D 1 D 2 D 3 + + + + f 1 f 2 f 3 z 1 = 0.40, α 1 = 0.42 z 2 = 0.58, α 2 = 0.65 z 3 = 0.72, α 3 = 0.92 9 / 16

Example: final classifier from AdaBoost + + f 1 f 2 f 3 z 1 = 0.40, α 1 = 0.42 z 2 = 0.58, α 2 = 0.65 z 3 = 0.72, α 3 = 0.92 10 / 16

Example: final classifier from AdaBoost + + f 1 f 2 f 3 z 1 = 0.40, α 1 = 0.42 z 2 = 0.58, α 2 = 0.65 z 3 = 0.72, α 3 = 0.92 Final classifier f final (x) = sign(0.42f 1(x) + 0.65f 2(x) + 0.92f 3(x)) (Zero training error!) 10 / 16

Empirical results UCI UCI Results Test error rates of C4.5 and AdaBoost on several classification problems. Each point represents a single classification problem/dataset from UCI repository. 30 30 25 25 30 30 25 25 C4.5 C4.5 20 20 15 15 C4.5 C4.5 20 20 15 15 10 10 10 10 5 5 5 5 0 0 0 0 5 510 1015 1520 2025 2530 30 AdaBoost+stumps 0 0 0 0 5 510 1015 1520 2025 2530 30 AdaBoost+C4.5 boosting Stumps boosting C4.5 C4.5 C4.5 = popular algorithm for learning decision trees. (Figure 1.3 from Schapire & Freund text.) 11 / 16

Training error of final classifier Recall γ t := Pr (X,Y ) Dt [f t(x) = Y ] 1/2 = z t/2. Training error of final classifier from AdaBoost: ( T err(f final, S) exp 2 t=1 γ 2 t ). 12 / 16

Training error of final classifier Recall γ t := Pr (X,Y ) Dt [f t(x) = Y ] 1/2 = z t/2. Training error of final classifier from AdaBoost: ( T err(f final, S) exp 2 If average γ 2 := 1 T T t=1 γ2 t > 0, then training error is exp ( 2 γ 2 T ). t=1 γ 2 t ). 12 / 16

Training error of final classifier Recall γ t := Pr (X,Y ) Dt [f t(x) = Y ] 1/2 = z t/2. Training error of final classifier from AdaBoost: ( T err(f final, S) exp 2 If average γ 2 := 1 T T t=1 γ2 t > 0, then training error is exp ( 2 γ 2 T ). AdaBoost = Adaptive Boosting Some γ t could be small (or even negative!) only care about overall average γ 2. t=1 γ 2 t ). 12 / 16

Training error of final classifier Recall γ t := Pr (X,Y ) Dt [f t(x) = Y ] 1/2 = z t/2. Training error of final classifier from AdaBoost: ( T err(f final, S) exp 2 If average γ 2 := 1 T T t=1 γ2 t > 0, then training error is exp ( 2 γ 2 T ). AdaBoost = Adaptive Boosting Some γ t could be small (or even negative!) only care about overall average γ 2. t=1 What about true error? γ 2 t ). 12 / 16

Combining classifiers Let F be the function class used by the weak learning algorithm A. 13 / 16

Combining classifiers Let F be the function class used by the weak learning algorithm A. The function class used by AdaBoost is { ( T ) } F T := x sign α tf t(x) : f 1, f 2,..., f T F, α 1, α 2,..., α T R t=1 i.e., linear combinations of T functions from F. Complexity of F T grows linearly with T. 13 / 16

Combining classifiers Let F be the function class used by the weak learning algorithm A. The function class used by AdaBoost is { ( T ) } F T := x sign α tf t(x) : f 1, f 2,..., f T F, α 1, α 2,..., α T R t=1 i.e., linear combinations of T functions from F. Complexity of F T grows linearly with T. Theoretical guarantee: with high probability over choice of i.i.d. sample S, ( ) T log F S err(f) err(f, S) + O f F T. S 13 / 16

Combining classifiers Let F be the function class used by the weak learning algorithm A. The function class used by AdaBoost is { ( T ) } F T := x sign α tf t(x) : f 1, f 2,..., f T F, α 1, α 2,..., α T R t=1 i.e., linear combinations of T functions from F. Complexity of F T grows linearly with T. Theoretical guarantee: with high probability over choice of i.i.d. sample S, ( ) T log F S err(f) err(f, S) + O f F T. S Theory suggests danger of over-fitting when T is very large. 13 / 16

Combining classifiers Let F be the function class used by the weak learning algorithm A. The function class used by AdaBoost is { ( T ) } F T := x sign α tf t(x) : f 1, f 2,..., f T F, α 1, α 2,..., α T R t=1 i.e., linear combinations of T functions from F. Complexity of F T grows linearly with T. Theoretical guarantee: with high probability over choice of i.i.d. sample S, ( ) T log F S err(f) err(f, S) + O f F T. S Theory suggests danger of over-fitting when T is very large. Indeed, this does happen sometimes... but often not! 13 / 16

A typical run of boosting AdaBoost+C4.5 on letters dataset. 20 15 C4.5 test error Error 10 AdaBoost test error 5 0 AdaBoost training error 10 100 1000 # of rounds T (# nodes across all decision trees in f final is >2 10 6 ) Training error is zero after just five rounds, but test error continues to decrease, even up to 1000 rounds! (Figure 1.7 from Schapire & Freund text) 14 / 16

Boosting the margin Final classifier from AdaBoost: ( T ) t=1 f final (x) = sign αtft(x) T t=1 αt. } {{ } g(x) [ 1, +1] Call y g(x) [ 1, +1] the margin achieved on example (x, y). 15 / 16

Boosting the margin Final classifier from AdaBoost: ( T ) t=1 f final (x) = sign αtft(x) T t=1 αt. } {{ } g(x) [ 1, +1] Call y g(x) [ 1, +1] the margin achieved on example (x, y). New theory [Schapire, Freund, Bartlett, and Lee, 1998]: Larger margins better generalization error, independent of T. AdaBoost tends to increase margins on training examples. (Similar but not the same as SVM margins.) 15 / 16

Boosting the margin Final classifier from AdaBoost: ( T ) t=1 f final (x) = sign αtft(x) T t=1 αt. } {{ } g(x) [ 1, +1] Call y g(x) [ 1, +1] the margin achieved on example (x, y). New theory [Schapire, Freund, Bartlett, and Lee, 1998]: Larger margins better generalization error, independent of T. AdaBoost tends to increase margins on training examples. (Similar but not the same as SVM margins.) On letters dataset: T = 5 T = 100 T = 1000 training error 0.0% 0.0% 0.0% test error 8.4% 3.3% 3.1% % margins 0.5 7.7% 0.0% 0.0% min. margin 0.14 0.52 0.55 15 / 16

More on boosting Many variants of boosting: AdaBoost.L and LogitBoost Forward-{step,stage}wise regression Boosted decision trees = boosting + decision trees (See ESL Chapter 10.) Boosting algorithms for ranking and multi-class. Boosting algorithms that are robust to certain kinds of noise.... Many connections between boosting and other subjects: Game theory, online learning Information geometry Computational complexity... 16 / 16