An overview of Boosting. Yoav Freund UCSD

Similar documents
Boosting: Foundations and Algorithms. Rob Schapire

Voting (Ensemble Methods)

BBM406 - Introduc0on to ML. Spring Ensemble Methods. Aykut Erdem Dept. of Computer Engineering HaceDepe University

ECE 5424: Introduction to Machine Learning

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

VBM683 Machine Learning

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017

Background. Adaptive Filters and Machine Learning. Bootstrap. Combining models. Boosting and Bagging. Poltayev Rassulzhan

CSCI-567: Machine Learning (Spring 2019)

Boosting. Acknowledgment Slides are based on tutorials from Robert Schapire and Gunnar Raetsch

Hierarchical Boosting and Filter Generation

Learning theory. Ensemble methods. Boosting. Boosting: history

Outline: Ensemble Learning. Ensemble Learning. The Wisdom of Crowds. The Wisdom of Crowds - Really? Crowd wiser than any individual

Learning with multiple models. Boosting.

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Gradient Boosting (Continued)

The Boosting Approach to. Machine Learning. Maria-Florina Balcan 10/31/2016

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Data Mining und Maschinelles Lernen

Learning Ensembles. 293S T. Yang. UCSB, 2017.

Ensembles. Léon Bottou COS 424 4/8/2010

ECE 5984: Introduction to Machine Learning

i=1 = H t 1 (x) + α t h t (x)

Lecture 8. Instructor: Haipeng Luo

Ensemble Methods for Machine Learning

6.036 midterm review. Wednesday, March 18, 15

Midterm, Fall 2003

CS7267 MACHINE LEARNING

COMS 4771 Lecture Boosting 1 / 16

FINAL: CS 6375 (Machine Learning) Fall 2014

Boos$ng Can we make dumb learners smart?

Stochastic Gradient Descent

Boosting: Algorithms and Applications

The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m

CS229 Supplemental Lecture notes

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

What makes good ensemble? CS789: Machine Learning and Neural Network. Introduction. More on diversity

Large-Margin Thresholded Ensembles for Ordinal Regression

Midterm Exam Solutions, Spring 2007

Ensemble Methods: Jay Hyer

The Boosting Approach to Machine Learning. Rob Schapire Princeton University. schapire

Reconnaissance d objetsd et vision artificielle

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

A Gentle Introduction to Gradient Boosting. Cheng Li College of Computer and Information Science Northeastern University

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

PCA FACE RECOGNITION

A Brief Introduction to Adaboost

Computational and Statistical Learning Theory

Machine Learning Lecture 10

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

CS 231A Section 1: Linear Algebra & Probability Review

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

2D Image Processing Face Detection and Recognition

Machine Learning Linear Classification. Prof. Matteo Matteucci

10-701/ Machine Learning - Midterm Exam, Fall 2010

Open Problem: A (missing) boosting-type convergence result for ADABOOST.MH with factorized multi-class classifiers

Linear classifiers: Logistic regression

Learning Theory Continued

10701/15781 Machine Learning, Spring 2007: Homework 2

Boosting Methods: Why They Can Be Useful for High-Dimensional Data

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

Large-Margin Thresholded Ensembles for Ordinal Regression

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Robotics 2. AdaBoost for People and Place Detection. Kai Arras, Cyrill Stachniss, Maren Bennewitz, Wolfram Burgard

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

Machine Learning. Ensemble Methods. Manfred Huber

Introduction to Machine Learning Midterm Exam

CS 231A Section 1: Linear Algebra & Probability Review. Kevin Tang

CSE 546 Final Exam, Autumn 2013

Final Exam, Fall 2002

Name (NetID): (1 Point)

CS6220: DATA MINING TECHNIQUES

Machine Learning, Fall 2009: Midterm

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Final Exam, Spring 2006

A Tutorial on Support Vector Machine

CS446: Machine Learning Fall Final Exam. December 6 th, 2016

Does Modeling Lead to More Accurate Classification?

Representing Images Detecting faces in images

A ROBUST ENSEMBLE LEARNING USING ZERO-ONE LOSS FUNCTION

ABC-Boost: Adaptive Base Class Boost for Multi-class Classification

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

Decision Trees: Overfitting

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Anomaly Detection. Jing Gao. SUNY Buffalo

Machine Learning, Fall 2012 Homework 2

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Perceptron. Subhransu Maji. CMPSCI 689: Machine Learning. 3 February February 2015

Regularization Paths

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Multiclass Boosting with Repartitioning

CS570 Data Mining. Anomaly Detection. Li Xiong. Slide credits: Tan, Steinbach, Kumar Jiawei Han and Micheline Kamber.

Online Learning and Sequential Decision Making

Robotics 2 AdaBoost for People and Place Detection

ECE 661: Homework 10 Fall 2014

Evaluation. Andrea Passerini Machine Learning. Evaluation

Stat 502X Exam 2 Spring 2014

A Study of Relative Efficiency and Robustness of Classification Methods

Transcription:

An overview of Boosting Yoav Freund UCSD

Plan of talk Generative vs. non-generative modeling Boosting Alternating decision trees Boosting and over-fitting Applications 2

Toy Example Computer receives telephone call Measures Pitch of voice Decides gender of caller Male Human Voice Female 3

Generative modeling mean1 mean2 Probability var1 var2 Voice Pitch

Discriminative approach [Vapnik 85] No. of mistakes Voice Pitch

Ill-behaved data No. of mistakes Probability mean1 mean2 Voice Pitch

Traditional Statistics vs. Machine Learning Machine Learning Data Statistics Estimated world state Decision Theory Predictions Actions

Comparison of methodologies Model Generative Discriminative Goal Performance measure Mismatch problems Probability estimates Likelihood Outliers Classification rule Misclassification rate Misclassifications 8

Boosting

A weighted training set Feature vectors Binary labels {-1,+1} Positive weights ( x 1,y 1,w 1 ),( x 2,y 2,w 2 ),, ( x m,y m,w m ) 10

A weak learner Weighted training set ( ) ( x 1,y 1,w 1 ), x 2,y 2,w 2 ( ),, x m,y m,w m Weak Learner A weak rule h instances x 1,x 2,,x m h predictions ŷ 1, ŷ 2,, ŷ m ; ŷ i { 1,+1} The weak requirement: m y i ˆ y i w i i =1 m w i=1 i > γ > 0

The boosting process ( x 1,y 1,1), ( x 2, y 2,1),,( x n, y n,1) weak learner h 1 1 1 1 ( x 1,y 1,w 1 ), ( x 2,y 2,w 2 ),,( x n, y n,w n ) 2 2 2 ( x 1,y 1,w 1 ), ( x 2,y 2,w 2 ),,( x n,y n,w n ) weak learner h 2 h 3 T 1 T 1 T 1 ( x 1, y 1,w 1 ), ( x 2,y 2,w 2 ),,( x n,y n,w n ) h T... Prediction(x) = sign(f T (x))

AFormalDescriptionofBoosting [Slides from Rob Schapire] given training set (x 1, y 1 ),...,(x m, y m ) y i { 1, +1} correct label of instance x i X for t =1,...,T : construct distribution D t on {1,...,m} find weak classifier ( rule of thumb ) h t : X { 1, +1} with small error ϵ t on D t : ϵ t =Pr i Dt [h t (x i ) y i ] output final classifier H final

AdaBoost [Slides from Rob Schapire] [with Freund] constructing D t : D 1 (i) =1/m given D t and h t : D t+1 (i) = D t(i) Z t { e α t if y i = h t (x i ) e α t if y i h t (x i ) = D t(i) Z t exp( α t y i h t (x i )) where Z t =normalizationfactor ( ) 1 α t = 1 2 ln ϵt > 0 final classifier: ( ) H final (x) =sign α t h t (x) t ϵ t

Toy Example [Slides from Rob Schapire] D 1 weak classifiers = vertical or horizontal half-planes

Round 1 [Slides from Rob Schapire] h 1 D 2 ε 1 =0.30 α 1 =0.42

Round 2 [Slides from Rob Schapire] h 2 D 3 ε 2 =0.21 α 2 =0.65

Round 3 [Slides from Rob Schapire] h 3 ε 3 =0.14 α 3 =0.92

Final Classifier [Slides from Rob Schapire] H = sign 0.42 + 0.65 + 0.92 final =

Analyzing the Training Error [with Freund] Theorem: write ϵ t as 1 2 γ t [ γ t = edge ] then [ 2 ] ϵ t (1 ϵ t ) training error(h final ) t = t exp 1 4γ 2 t ( 2 t [Slides from Rob Schapire] γ 2 t ) so: if t : γ t γ > 0 then training error(h final ) e 2γ2 T AdaBoost is adaptive: does not need to know γ or T apriori can exploit γ t γ

Boosting block diagram Strong Learner Accurate Rule Weak Learner Weak rule Example weights Booster

Boosting with specialists Specialists predict only when they are confident. In addition to {-1,+1} specialists use 0 to indicate no-prediction. As boosting allows both positive and negative weights: we restrict attention to specialists that output {0,1}. 22

Adaboost as variational optimization Weak Rule: h t : X {0,1} Label: y { 1,+1} Training set: {(x 1, y 1,1),(x 2, y 2,1),,(x n, y n + - α t = 1 2 ln ε + i:h t (x i )=1,y i =1 w t i t ε + w i i:h t (x i )=1,y i = 1 F t = F t 1 + α t h t

AdaBoost as variational optimization. x = input, a scalar, y = output, -1 or +1 (x1,y1),(x2,y2),,(xn,yn) = training set Ft-1 = Strong rule after t-1 boosting iterations. ht = Weak rule produced at iteration t ht + + + - - - + + + + + + + + + + - - - - - - - - - - - - - - - Ft-1 x 24

AdaBoost as variational optimization. Ft-1 = Strong rule after t-1 boosting iterations. ht = Weak rule produced at iteration t F t 1 (x) + 0h t (x) ht + + + - - - + + + + + + + + + + - - - - - - - - - - - - - - - Ft-1 x 25

AdaBoost as variational optimization. Ft-1 = Strong rule after t-1 boosting iterations. ht = Weak rule produced at iteration t F t 1 (x) + 0.4h t (x) ht + + + - - - + + + + + + + + + + - - - - - - - - - - - - - - - Ft-1 x 26

AdaBoost as variational optimization. Ft-1 = Strong rule after t-1 boosting iterations. ht = Weak rule produced at iteration t F t (x) = F t 1 (x) + 0.8h t (x) ht + + + - - - + + + + + + + + + + - - - - - - - - - - - - - - - Ft-1 x 27

Margins Fix a set of weak rules:! h(x) = (h 1 (x),h 2 (x),,h T (x)) - - Represent the i'th example x i with outputs of the weak rules:! h i Labels: y { 1,+1} Training set: (! h 1, y 1 ),(! h 2, y 2 ),,(! h n, y n ) Goal : Find a weight vector! α = (α 1,,α T ) that minimizes number of training mistakes! α (! h i, y i ) + + - - - - + + + + + - -! α reflect - + + - - + + - + + - + - Mistake (y i! hi, y i ) - - Correct Project Cumulative # examples Mistakes margin! y i " hi " α Correct Margin

Boosting as gradient descent Our goal is to minimize the number of mistakes (0-1 loss) Unfortunately, the step function has deriv. zero at all points other than at zero where the derivative is undefined. Ada boost uses an upper bound on the 0-1 loss. Adaboost : e margin Loss Adaboost minimizes the exponential loss which is an upper bound. Finds the vector \alpha through coordinate-wise gradient descent. Weak rules are added one at a time. 0-1 loss Mistakes Correct Margin

Adaboost as gradient descent Weak rules are added one at a time. Fixing the alphas for 1..t-1, find alpha for the new rule (ht) Weak rule defines how each example would move = increase or decrease the margin. 0-1 loss new margin of (x i, y i ) is y i F t 1 (x i ) + α y i h t (x i ) new total loss = n exp y i F t 1 (x i ) + αh t (x i ) i=1 derivative of total loss wrt α : n α α =0 i=1 ( ( )) ( ( )) exp y i F t 1 (x i ) + αh t (x i ) weight of example (x i,y i ) n!## "## $ = y i h t (x i ) exp( -y i F t-1 (x i )) i=1 Mistakes Correct Margin

Logitboost as gradient descent Also called Gentle-Boost and Logit Boost, Hastie, Freedman & Tibshirani margin=m Logitboost loss Instead of the loss e m define loss to be log 1+ e m ( ) ( ) e m ( ) m When margin 0 : log 1+ e m When margin 0 :log 1+ e m Logitboost weight Adaboost (loss and weight) new margin of (x i, y i ) is y i F t 1 (x i ) + α y i h t (x i ) new total loss = n log 1+ exp y i ( F t 1 (x i ) + αh t (x i i=1 derivative of total loss wrt α : n α α =0 i=1 log 1+ exp y i F t 1 (x i ) + αh t (x i ) ( ( ( )) weight of example (x i,y i )!### "### $ n exp( -y = y i h t (x i ) i F t-1 (x i )) 1+ exp -y i F t-1 (x i ) i=1 ( ) Mistakes Correct Margin

Noise resistance Adaboost: perform well when a achievable error rate is close to zero (almost consistent case). Errors = examples with negative margins, get very large weights, can overfit. Logitboost: Inferior to adaboost when achievable error rate is close to zero. Often better than Adaboost when achievable error rate is high. Weight on any example never larger than 1. 32

Gradient-Boost / AnyBoost A general recipe for learning by incremental optimization Applies to any (differentiable) loss function. labeled example is (x, y) (can be anything). t prediction is of the form: F t (x) = α i h i (x) i=1 n Loss: L( F t (x), y) Total Loss so far is: L F t 1 (x j ), y j j=1 Weight of example (x, y) : z L ( F t 1(x) + z, y) ( ) At each iteration: calc example weights call weak learner with weighted example Add generated weak rule to create new rule: F t = F t 1 + α t h t 33

Beach Summer Winter Styles of weak learners Simple (stumps)!# Generalist "# $ α -1 Winter Season Summer +1 Specialists!##" ## $ Season Season = α α +1 +1 Complex (fully grown trees) α -1 Prison Location Winter Season +1 Summer Ski Slope Everything in between (neural networks, Nearest neighbors, Naive Bayes,..) -1 +1 34

Alternating Decision Trees Joint work with Llew Mason

Decision Trees Y X>3 +1 no -1 yes Y>5 5-1 no -1 yes +1-1 3 X

Decision tree as a sum -0.2 Y no X>3 yes +0.2 +1 sign -0.1 +0.1-0.1-1 -0.2 +0.1 no Y>5 yes -0.3-1 -0.3 +0.2 X

An alternating decision tree Y -0.2 +0.2 +1 sign no Y<1 X>3 0.0 yes +0.7 no -0.1 yes +0.1-1 -0.1 +0.1 no Y>5 yes -1-0.3-0.3 +0.2 +0.7 +1 X

Example: Medical Diagnostics Cleve dataset from UC Irvine database. Heart disease diagnostics (+1=healthy,-1=sick) 13 features from tests (real valued and discrete). 303 instances.

Adtree for Cleveland heart-disease diagnostics problem

Cross-validated accuracy Learning algorithm Number of splits Average test error Test error variance ADtree 6 17.0% 0.6% C5.0 27 27.2% 0.5% C5.0 + boosting Boost Stumps 446 20.2% 0.5% 16 16.5% 0.8%

Applications of Boosting

Applications of Boosting Academic research Applied research Commercial deployment

Academic research % test error rates Database Other Boosting Error reduction Cleveland 27.2 (DT) 16.5 39% Promoters 22.0 (DT) 11.8 46% Letter 13.8 (DT) 3.5 74% Reuters 4 5.8, 6.0, 9.8 2.95 ~60% Reuters 8 11.3, 12.1, 13.4 7.4 ~40%

Schapire, Singer, Gorin 98 Applied research AT&T, How may I help you? Classify voice requests Voice -> text -> category Fourteen categories Area code, AT&T service, billing credit, calling card, collect, competitor - competitor, - dial assistance, - directory, - how to dial, - person to person, - rate, - third party, - time charge

Example transcribed sentences Yes I d like to place a collect call long distance please Ø collect Operator I need to make a call but I need to bill it to my office Ø third party Yes I d like to place a call on my master card please Ø calling card I just called a number in Sioux city and I musta rang the wrong number because I got the wrong party and I would like to have that taken off my bill Ø billing credit

Weak rules generated by boostexter Categories Calling card Collect call Third party Weak Rule Word occurs Word does not occur

Results 7844 training examples hand transcribed 1000 test examples hand / machine transcribed Accuracy with 20% rejected Machine transcribed: 75% Hand transcribed: 90%

Freund, Mason, Rogers, Pregibon, Cortes 2000 Commercial deployment Distinguish business/residence customers Using statistics from call-detail records Alternating decision trees Combines very simple rules Can over-fit, cross validation used to stop

Massive datasets (for 1997) 260M calls / day 230M telephone numbers Label unknown for ~30% Hancock: software for computing statistical signatures. 100K randomly selected training examples, ~10K is enough Training takes about 2 hours. Generated classifier has to be both accurate and efficient

Alternating tree for buizocity

Alternating Tree (Detail)

Precision/recall graphs Accuracy Score

Face Detection Viola and Jones / 2001

Face Detection / Viola and Jones Live demo in 2001 conference. Higher accuracy than manually designed state of art Real -time using 2001 laptop Many Uses Standard feature in cameras User Interfaces Security Systems Video Compression Image Database Analysis

Face Detection as a Filtering process Larger Scale Smallest Scale 50,000 Locations/Scales Most Negative

Classifier is Learned from Labeled Data Example: 28x28 image patch Label: Face / Non face 5000 faces, 10 8 non faces Faces are normalized Scale, translation Rotation remains

Image Features Rectangle filters Similar to Haar wavelets Papageorgiou, et al. h t (x i ) = 1 if f t(x i ) > θ t 0 otherwise Very fast to compute using integral image. Unique Binary Features Combined using adaboost

Cascaded boosting Features combined using Adaboost Find best weak rule by exhaustive search. Ran for 2 days on a 250 node cluster For detection, features combined in a cascade: IMAGE SUB-WINDOW 50% 1 Feature 5 Features F NON-FACE 20% 2% 20 Features Runs in real time, 15 FPS, on laptop F NON-FACE F NON-FACE FACE 59

Paul Viola and Mike Jones 60

Summary Boosting is a computational method for learning accurate classifiers Resistance to over-fit explained by margins Boosting has been applied successfully to a variety of classification problems