Boosting: Foundations and Algorithms. Rob Schapire

Similar documents
The Boosting Approach to Machine Learning. Rob Schapire Princeton University. schapire

COMS 4771 Lecture Boosting 1 / 16

The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m

Learning theory. Ensemble methods. Boosting. Boosting: history

The Boosting Approach to. Machine Learning. Maria-Florina Balcan 10/31/2016

Boosting. Acknowledgment Slides are based on tutorials from Robert Schapire and Gunnar Raetsch

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

An overview of Boosting. Yoav Freund UCSD

Ensembles. Léon Bottou COS 424 4/8/2010

Learning with multiple models. Boosting.

Introduction and Overview

Theory and Applications of A Repeated Game Playing Algorithm. Rob Schapire Princeton University [currently visiting Yahoo!

2 Upper-bound of Generalization Error of AdaBoost

Machine Learning Algorithms for Classification. Rob Schapire Princeton University. schapire

Theory and Applications of Boosting. Rob Schapire

Ensemble Methods for Machine Learning

Voting (Ensemble Methods)

Decision Trees: Overfitting

BBM406 - Introduc0on to ML. Spring Ensemble Methods. Aykut Erdem Dept. of Computer Engineering HaceDepe University

Background. Adaptive Filters and Machine Learning. Bootstrap. Combining models. Boosting and Bagging. Poltayev Rassulzhan

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017

Outline: Ensemble Learning. Ensemble Learning. The Wisdom of Crowds. The Wisdom of Crowds - Really? Crowd wiser than any individual

Lecture 8. Instructor: Haipeng Luo

CSCI-567: Machine Learning (Spring 2019)

CS229 Supplemental Lecture notes

ECE 5424: Introduction to Machine Learning

TDT4173 Machine Learning

Boos$ng Can we make dumb learners smart?

Ensemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods

Learning Ensembles. 293S T. Yang. UCSB, 2017.

CS7267 MACHINE LEARNING

Hierarchical Boosting and Filter Generation

Online Learning and Sequential Decision Making

Optimal and Adaptive Algorithms for Online Boosting

VBM683 Machine Learning

FINAL: CS 6375 (Machine Learning) Fall 2014

Stochastic Gradient Descent

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

M1 Apprentissage. Michèle Sebag Benoit Barbot LRI LSV. 13 janvier 2014

TDT4173 Machine Learning

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

Boosting the Margin: A New Explanation for the Effectiveness of Voting Methods

Data Mining und Maschinelles Lernen

BOOSTING THE MARGIN: A NEW EXPLANATION FOR THE EFFECTIVENESS OF VOTING METHODS

ECE 5984: Introduction to Machine Learning

Chapter 18. Decision Trees and Ensemble Learning. Recall: Learning Decision Trees

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

Adaptive Boosting of Neural Networks for Character Recognition

Computational and Statistical Learning Theory

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16

A Brief Introduction to Adaboost

Boosting: Algorithms and Applications

Boosting & Deep Learning

Generalization, Overfitting, and Model Selection

Linear and Logistic Regression. Dr. Xiaowei Huang

Large-Margin Thresholded Ensembles for Ordinal Regression

Robotics 2 AdaBoost for People and Place Detection

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017

CS 231A Section 1: Linear Algebra & Probability Review

6.036 midterm review. Wednesday, March 18, 15

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 331 le-tex

Totally Corrective Boosting Algorithms that Maximize the Margin

CS 6375 Machine Learning

Cross Validation & Ensembling

Robotics 2. AdaBoost for People and Place Detection. Kai Arras, Cyrill Stachniss, Maren Bennewitz, Wolfram Burgard

Click Prediction and Preference Ranking of RSS Feeds

Midterm, Fall 2003

Data Warehousing & Data Mining

CS 231A Section 1: Linear Algebra & Probability Review. Kevin Tang

Representing Images Detecting faces in images

An Algorithms-based Intro to Machine Learning

Introduction to Machine Learning

CS 188: Artificial Intelligence Fall 2011

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Midterm Exam Solutions, Spring 2007

Errors, and What to Do. CS 188: Artificial Intelligence Fall What to Do About Errors. Later On. Some (Simplified) Biology

10-701/ Machine Learning - Midterm Exam, Fall 2010

Boosting the margin: A new explanation for the effectiveness of voting methods

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Introduction to AI Learning Bayesian networks. Vibhav Gogate

Name (NetID): (1 Point)

What makes good ensemble? CS789: Machine Learning and Neural Network. Introduction. More on diversity

Online Advertising is Big Business

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Large-Margin Thresholded Ensembles for Ordinal Regression

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine Learning and Data Mining. Linear classification. Kalev Kask

Machine Learning, Fall 2011: Homework 5

10701/15781 Machine Learning, Spring 2007: Homework 2

Supervised Learning of Non-binary Problems Part I: Multiclass Categorization via Output Codes

The Perceptron algorithm

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Decision trees COMS 4771

Transcription:

Boosting: Foundations and Algorithms Rob Schapire

Example: Spam Filtering problem: filter out spam (junk email) gather large collection of examples of spam and non-spam: From: yoav@ucsd.edu Rob, can you review a paper... non-spam From: xa412@hotmail.com Earn money without working!!!!... spam.. goal: have computer learn from examples to distinguish spam from non-spam.

Machine Learning studies how to automatically learn to make accurate predictions based on past observations classification problems: classify examples into given set of categories new example labeled training examples machine learning algorithm classification rule predicted classification

Examples of Classification Problems text categorization (e.g., spam filtering) fraud detection machine vision (e.g., face detection) natural-language processing (e.g., spoken language understanding) market segmentation (e.g.: predict if customer will respond to promotion) bioinformatics (e.g., classify proteins according to their function).

Back to Spam main observation: easy to find rules of thumb that are often correct If viagra occurs in message, then predict spam hard to find single rule that is very highly accurate

The Boosting Approach devise computer program for deriving rough rules of thumb apply procedure to subset of examples obtain rule of thumb apply to 2nd subset of examples obtain 2nd rule of thumb repeat T times

Key Details how to choose examples on each round? concentrate on hardest examples (those most often misclassified by previous rules of thumb) how to combine rules of thumb into single prediction rule? take (weighted) majority vote of rules of thumb

Boosting boosting = general method of converting rough rules of thumb into highly accurate prediction rule technically: assume given weak learning algorithm that can consistently find classifiers ( rules of thumb ) at least slightly better than random, say, accuracy 55% (in two-class setting) [ weak learning assumption ] given sufficient data, a boosting algorithm can provably construct single classifier with very high accuracy, say, 99%

Early History [Valiant 84]: introduced theoretical ( PAC ) model for studying machine learning [Kearns & Valiant 88]: open problem of finding a boosting algorithm if boosting possible, then... can use (fairly) wild guesses to produce highly accurate predictions if can learn part way then can learn all the way should be able to improve any learning algorithm for any learning problem: either can always learn with nearly perfect accuracy or there exist cases where cannot learn even slightly better than random guessing

First Boosting Algorithms [Schapire 89]: first provable boosting algorithm [Freund 90]: optimal algorithm that boosts by majority [Drucker, Schapire & Simard 92]: first experiments using boosting limited by practical drawbacks [Freund & Schapire 95]: introduced AdaBoost algorithm strong practical advantages over previous boosting algorithms

A Formal Description of Boosting given training set (x 1,y 1 ),...,(x m,y m ) y i { 1,+1} correct label of instance x i X for t = 1,...,T: construct distribution D t on {1,...,m} find weak classifier ( rule of thumb ) h t : X { 1,+1} with small error ǫ t on D t : ǫ t = Pr i Dt [h t (x i ) y i ] output final classifier H final

AdaBoost [with Freund] constructing D t : D 1 (i) = 1/m given D t and h t : D t+1 (i) = D { t(i) e α t if y i = h t (x i ) Z t e αt if y i h t (x i ) = D t(i) exp( α t y i h t (x i )) Z t where Z t = normalization ( ) factor 1 α t = 1 2 ln ǫt > 0 final classifier: ( ) H final (x) = sign α t h t (x) t ǫ t

Toy Example D 1 weak classifiers = vertical or horizontal half-planes

Round 1 Round 1 Round 1 Round 1 Round 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 h 1 α ε 1 1 =0.30 =0.42 2 D

Round 2 Round 2 Round 2 Round 2 Round 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 α ε 2 2 =0.21 =0.65 h 2 3 D

Round 3 Round 3 Round 3 Round 3 Round 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 h 3 α ε 3 3 =0.92 =0.14

Final Classifier Final Classifier Final Classifier Final Classifier Final Classifier 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 H final + 0.92 + 0.65 0.42 = sign =

AdaBoost (recap) given training set (x 1,y 1 ),...,(x m,y m ) where x i X, y i { 1,+1} initialize D 1 (i) = 1/m ( i) for t = 1,...,T: train weak classifier h t : X { 1,+1} with error ǫ t = Pr i Dt ( [h t (x i ) y i ] 1 α t = 1 2 ln ǫt update i: ǫ t D t+1 (i) = D t(i) Z t exp ( α t y i h t (x i )) where Z t = normalization factor ( T ) H final (x) = sign α t h t (x) t=1

Analyzing the Training Error [with Freund] Theorem: write ǫ t as 1 2 γ t [ γ t = edge ] then [ training error(h final ) 2 ] ǫ t (1 ǫ t ) t = t exp 1 4γ 2 t ( 2 t γ 2 t ) so: if t : γ t γ > 0 then training error(h final ) e 2γ2 T AdaBoost is adaptive: does not need to know γ or T a priori can exploit γ t γ

How Will Test Error Behave? (A First Guess) 1 0.8 error 0.6 0.4 0.2 test train 20 40 60 80 1 # of rounds ( T) expect: training error to continue to drop (or reach zero) test error to increase when H final becomes too complex Occam s razor overfitting hard to know when to stop training

Technically... with high probability: generalization error training error + Õ ( ) dt m bound depends on m = # training examples d = complexity of weak classifiers T = # rounds generalization error = E[test error] predicts overfitting

Overfitting Can Happen 30 error 25 20 15 10 5 0 test train 1 10 1 10 # rounds (boosting stumps on heart-disease dataset) but often doesn t...

Actual Typical Run 20 error 15 10 5 0 test train 10 1 10 # of rounds (T) (boosting C4.5 on letter dataset) test error does not increase, even after 10 rounds (total size > 2,0,0 nodes) test error continues to drop even after training error is zero! # rounds 5 1 10 train error 0.0 0.0 0.0 test error 8.4 3.3 3.1 Occam s razor wrongly predicts simpler rule is better

A Better Story: The Margins Explanation [with Freund, Bartlett & Lee] key idea: training error only measures whether classifications are right or wrong should also consider confidence of classifications recall: H final is weighted majority vote of weak classifiers measure confidence by margin = strength of the vote = (weighted fraction voting correctly) (weighted fraction voting incorrectly) high conf. incorrect low conf. high conf. correct H H final final 1 incorrect 0 correct +1

Empirical Evidence: The Margin Distribution margin distribution = cumulative distribution of margins of training examples error 20 15 10 5 0 test train 10 1 10 # of rounds (T) cumulative distribution 1.0 0.5 10 1-1 -0.5 0.5 1 margin 5 # rounds 5 1 10 train error 0.0 0.0 0.0 test error 8.4 3.3 3.1 % margins 0.5 7.7 0.0 0.0 minimum margin 0.14 0.52 0.55

Theoretical Evidence: Analyzing Boosting Using Margins Theorem: large margins better bound on generalization error (independent of number of rounds) Theorem: boosting tends to increase margins of training examples (given weak learning assumption) moreover, larger edges larger margins

Consequences of Margins Theory predicts good generalization with no overfitting if: weak classifiers have large edges (implying large margins) weak classifiers not too complex relative to size of training set e.g., boosting decision trees resistant to overfitting since trees often have large edges and limited complexity overfitting may occur if: small edges (underfitting), or overly complex weak classifiers e.g., heart-disease dataset: stumps yield small edges also, small dataset

More Theory many other ways of understanding AdaBoost: as playing a repeated two-person matrix game weak learning assumption and optimal margin have natural game-theoretic interpretations special case of more general game-playing algorithm as a method for minimizing a particular loss function via numerical techniques, such as coordinate descent using convex analysis in an information-geometric framework that includes logistic regression and maximum entropy as a universally consistent statistical method can also derive optimal boosting algorithm, and extend to continuous time

Practical Advantages of AdaBoost fast simple and easy to program no parameters to tune (except T) flexible can combine with any learning algorithm no prior knowledge needed about weak learner provably effective, provided can consistently find rough rules of thumb shift in mind set goal now is merely to find classifiers barely better than random guessing versatile can use with data that is textual, numeric, discrete, etc. has been extended to learning problems well beyond binary classification

Caveats performance of AdaBoost depends on data and weak learner consistent with theory, AdaBoost can fail if weak classifiers too complex overfitting weak classifiers too weak (γ t 0 too quickly) underfitting low margins overfitting empirically, AdaBoost seems especially susceptible to uniform noise

UCI Experiments [with Freund] tested AdaBoost on UCI benchmarks used: C4.5 (Quinlan s decision tree algorithm) decision stumps : very simple rules of thumb that test on single attributes eye color = brown? yes no predict predict +1-1 yes predict -1 height > 5 feet? no predict +1

UCI Results C4.5 30 25 20 15 10 5 0 0 5 10 15 20 25 30 boosting Stumps C4.5 30 25 20 15 10 5 0 0 5 10 15 20 25 30 boosting C4.5

Application: Detecting Faces [Viola & Jones] problem: find faces in photograph or movie weak classifiers: detect light/dark rectangles in image many clever tricks to make extremely fast and accurate

Application: Human-computer Spoken Dialogue [with Rahim, Di Fabbrizio, Dutton, Gupta, Hollister & Riccardi] application: automatic store front or help desk for AT&T Labs Natural Voices business caller can request demo, pricing information, technical support, sales agent, etc. interactive dialogue

How It Works computer speech Human raw utterance text to speech text response automatic speech recognizer text dialogue manager predicted category natural language understanding NLU s job: classify caller utterances into 24 categories (demo, sales rep, pricing info, yes, no, etc.) weak classifiers: test for presence of word or phrase