COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017

Similar documents
Learning theory. Ensemble methods. Boosting. Boosting: history

Ensemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods

Learning with multiple models. Boosting.

COMS 4771 Lecture Boosting 1 / 16

Lecture 8. Instructor: Haipeng Luo

Background. Adaptive Filters and Machine Learning. Bootstrap. Combining models. Boosting and Bagging. Poltayev Rassulzhan

Computational and Statistical Learning Theory

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Outline: Ensemble Learning. Ensemble Learning. The Wisdom of Crowds. The Wisdom of Crowds - Really? Crowd wiser than any individual

Statistical Machine Learning from Data

The Boosting Approach to. Machine Learning. Maria-Florina Balcan 10/31/2016

Voting (Ensemble Methods)

Ensemble Methods for Machine Learning

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

10701/15781 Machine Learning, Spring 2007: Homework 2

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Boosting: Foundations and Algorithms. Rob Schapire

Stochastic Gradient Descent

Boos$ng Can we make dumb learners smart?

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

Boosting. Acknowledgment Slides are based on tutorials from Robert Schapire and Gunnar Raetsch

CS229 Supplemental Lecture notes

Hierarchical Boosting and Filter Generation

AdaBoost. S. Sumitra Department of Mathematics Indian Institute of Space Science and Technology

ECE 5984: Introduction to Machine Learning

i=1 = H t 1 (x) + α t h t (x)

Ensembles. Léon Bottou COS 424 4/8/2010

ECE 5424: Introduction to Machine Learning

2 Upper-bound of Generalization Error of AdaBoost

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

CS7267 MACHINE LEARNING

1. Implement AdaBoost with boosting stumps and apply the algorithm to the. Solution:

What makes good ensemble? CS789: Machine Learning and Neural Network. Introduction. More on diversity

An overview of Boosting. Yoav Freund UCSD

Machine Learning, Fall 2011: Homework 5

The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m

i=1 cosn (x 2 i y2 i ) over RN R N. cos y sin x

Statistics and learning: Big Data

1 Rademacher Complexity Bounds

TDT4173 Machine Learning

A Brief Introduction to Adaboost

Perceptron. Subhransu Maji. CMPSCI 689: Machine Learning. 3 February February 2015

Learning Ensembles. 293S T. Yang. UCSB, 2017.

VBM683 Machine Learning

Data Warehousing & Data Mining

CSCI-567: Machine Learning (Spring 2019)

Bias-Variance in Machine Learning

1 Handling of Continuous Attributes in C4.5. Algorithm

Ensembles of Classifiers.

TDT4173 Machine Learning

Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example

1 Handling of Continuous Attributes in C4.5. Algorithm

Introduction to Boosting and Joint Boosting

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture 24 Scribe: Sachin Ravi May 2, 2013

Chapter 18. Decision Trees and Ensemble Learning. Recall: Learning Decision Trees

CS 231A Section 1: Linear Algebra & Probability Review

CSC321 Lecture 4 The Perceptron Algorithm

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

Lecture 4. 1 Learning Non-Linear Classifiers. 2 The Kernel Trick. CS-621 Theory Gems September 27, 2012

Boosting with decision stumps and binary features

Data Mining und Maschinelles Lernen

Boosting: Algorithms and Applications

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Decision Trees: Overfitting

Recitation 9. Gradient Boosting. Brett Bernstein. March 30, CDS at NYU. Brett Bernstein (CDS at NYU) Recitation 9 March 30, / 14

Chapter 14 Combining Models

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

CS 231A Section 1: Linear Algebra & Probability Review. Kevin Tang

MIRA, SVM, k-nn. Lirong Xia

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017

Cross Validation & Ensembling

BBM406 - Introduc0on to ML. Spring Ensemble Methods. Aykut Erdem Dept. of Computer Engineering HaceDepe University

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

CS534 Machine Learning - Spring Final Exam

2D1431 Machine Learning. Bagging & Boosting

Learning Theory Continued

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Boosting. 1 Boosting. 2 AdaBoost. 2.1 Intuition

Ensemble Methods: Jay Hyer

Combining Classifiers

FINAL: CS 6375 (Machine Learning) Fall 2014

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Linear Classifiers and the Perceptron

1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016

Multiclass Boosting with Repartitioning

Nonlinear Classification

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Announcements Kevin Jamieson

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

PAC-learning, VC Dimension and Margin-based Bounds

1 Review of the Perceptron Algorithm

Optimal and Adaptive Online Learning

Lecture 13: Ensemble Methods

Introduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research

Optimal and Adaptive Algorithms for Online Boosting

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #24 Scribe: Jad Bechara May 2, 2018

Transcription:

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University

BOOSTING Robert E. Schapire and Yoav Freund, Boosting: Foundations and Algorithms, MIT Press, 2012. See this textbook for many more details. (I borrow some figures from that book.)

BAGGING CLASSIFIERS Algorithm: Bagging binary classifiers Given (x 1, y 1 ),..., (x n, y n ), x X, y { 1, 1} For b = 1,..., B Sample a bootstrap dataset Bb of size n. For each entry in B b, select (x i, y i) with probability 1. Some (xi, yi) will repeat and some won t appear in Bb. n Learn a classifier fb using data in B b. Define the classification rule to be ( B ) f bag (x 0 ) = sign f b (x 0 ). b=1 With bagging, we observe that a committee of classifiers votes on a label. Each classifier is learned on a bootstrap sample from the data set. Learning a collection of classifiers is referred to as an ensemble method.

BOOSTING How is it that a committee of blockheads can somehow arrive at highly reasoned decisions, despite the weak judgment of the individual members? Schapire & Freund, Boosting: Foundations and Algorithms Boosting is another powerful method for ensemble learning. It is similar to bagging in that a set of classifiers are combined to make a better one. It works for any classifier, but a weak one that is easy to learn is usually chosen. (weak = accuracy a little better than random guessing) Short history 1984 : Leslie Valiant and Michael Kearns ask if boosting is possible. 1989 : Robert Schapire creates first boosting algorithm. 1990 : Yoav Freund creates an optimal boosting algorithm. 1995 : Freund and Schapire create AdaBoost (Adaptive Boosting), the major boosting algorithm.

BAGGING VS BOOSTING (OVERVIEW) Bootstrap sample f 3 (x) Weighted sample f 3 (x) Bootstrap sample f 2 (x) Weighted sample f 2 (x) Bootstrap sample f 1 (x) Weighted sample f 1 (x) Training sample Training sample Bagging Boosting

THE ADABOOST ALGORITHM (SAMPLING VERSION) Weighted sample weighted error ε 2 Weighted sample weighted error ε 1 Weighted sample Sample and classify B3 Sample and classify B2 Sample and classify B1 α 3, f 3 (x) α 2, f 2 (x) α 1, f 1 (x) Training sample Boosting ( T ) f boost (x 0 ) = sign α t f t (x 0 ) t=1

THE ADABOOST ALGORITHM (SAMPLING VERSION) Algorithm: Boosting a binary classifier Given (x 1, y 1 ),..., (x n, y n ), x X, y { 1, 1}, set w 1 (i) = 1 n for i = 1 : n For t = 1,..., T 1. Sample a bootstrap dataset B t of size n according to distribution w t. Notice we pick (x i, y i) with probability w t(i) and not 1 n. 2. Learn a classifier f t using data in B t. 3. Set ɛ t = ( n i=1 wt(i)1{yi ft(xi)} and αt = 1 ln 1 ɛ t 2 ɛ t ). 4. Scale ŵ t1(i) = w t(i)e αty i f t(x i ) and set w t1(i) = ŵt1(i) j ŵt1(j). Set the classification rule to be f boost (x 0 ) = sign ( T ) t=1 α t f t (x 0 ). Comment: Description usually simplified to learn classifier f t using distribution w t.

BOOSTING A DECISION STUMP (EXAMPLE 1) Original data Uniform distribution, w 1 Learn weak classifier Here: Use a decision stump x 1 > 1.7 ŷ = 1 ŷ = 3

BOOSTING A DECISION STUMP (EXAMPLE 1) Round 1 classifier Weighted error: ɛ 1 = 0.3 Weight update: α 1 = 0.42

BOOSTING A DECISION STUMP (EXAMPLE 1) Weighted data After round 1

BOOSTING A DECISION STUMP (EXAMPLE 1) Round 2 classifier Weighted error: ɛ 2 = 0.21 Weight update: α 2 = 0.65

BOOSTING A DECISION STUMP (EXAMPLE 1) Weighted data After round 2

BOOSTING A DECISION STUMP (EXAMPLE 1) Round 2 classifier Weighted error: ɛ 3 = 0.14 Weight update: α 3 = 0.92

BOOSTING A DECISION STUMP (EXAMPLE 1) Classifier after three rounds 0.42 x 0.65 x 0.92 x

BOOSTING A DECISION STUMP (EXAMPLE 2) Example problem Random guessing 50% error Decision stump 45.8% error Full decision tree 24.7% error Boosted stump 5.8% error

BOOSTING Point = one dataset. Location = error rate w/ and w/o boosting. The boosted version of the same classifier almost always produces better results.

BOOSTING (left) Boosting a bad classifier is often better than not boosting a good one. (right) Boosting a good classifier is often better, but can take more time.

BOOSTING AND FEATURE MAPS Q: What makes boosting work so well? A: This is a wellstudied question. We will present one analysis later, but we can also give intuition by tying it in with what we ve already learned. The classification for a new x 0 from boosting is ( T ) f boost (x 0 ) = sign t=1 α t f t (x 0 ). Define φ(x) = [ f 1 (x),..., f T (x)], where each f t (x) { 1, 1}. We can think of φ(x) as a high dimensional feature map of x. The vector α = [α 1,..., α T ] corresponds to a hyperplane. So the classifier can be written f boost (x 0 ) = sign(φ(x 0 ) α). Boosting learns the feature mapping and hyperplane simultaneously.

APPLICATION: FACE DETECTION

FACE DETECTION (VIOLA & JONES, 2001) Problem: Locate the faces in an image or video. Processing: Divide image into patches of different scales, e.g., 24 24, 48 48, etc. Extract features from each patch. Classify each patch as face or no face using a boosted decision stump. This can be done in realtime, for example by your digital camera (at 15 fps). One patch from a larger image. Mask it with many feature extractors. Each pattern gives one number, which is the sum of all pixels in black region minus sum of pixels in white region (total of 45,000 features).

FACE DETECTION (EXAMPLE RESULTS)

ANALYSIS OF BOOSTING

ANALYSIS OF BOOSTING Training error theorem We can use analysis to make a statement about the accuracy of boosting on the training data. Theorem: Under the AdaBoost framework, if ɛ t is the weighted error of classifier f t, then for the classifier f boost (x 0 ) = sign( T t=1 α tf t (x 0 )), training error = 1 n n i=1 ( 1{y i f boost (x i )} exp 2 T ( 1 2 ɛ t) ). 2 t=1 Even if each ɛ t is only a little better than random guessing, the sum over T classifiers can lead to a large negative value in the exponent when T is large. For example, if we set: ɛ t = 0.45, T = 1000 training error 0.0067.

PROOF OF THEOREM Setup We break the proof into three steps. It is an application of the fact that if a < b }{{} Step 2 and b < c }{{} Step 3 then a < c }{{} conclusion Step 1 calculates the value of b. Steps 2 and 3 prove the two inequalities. Also recall the following step from AdaBoost: Update ŵ t1 (i) = w t (i)e αtyi ft(xi). Normalize w t1 (i) = ŵt1(i) j ŵt1(j) Define Z t = j ŵt1(j).

PROOF OF THEOREM (a b c) Step 1 We first want to expand the equation of the weights to show that w T1 (i) = 1 n T e yi t=1 αt ft(xi) T t=1 Z := 1 t n yi ht (xi) e T t=1 Z t h T (x) := T α t f t (x i ) t=1 Derivation of Step 1: Notice the update rule: w t1 (i) = 1 αtyi ft(xi) w t (i)e Z t Do the same expansion for w t (i) and continue until reaching w 1 (i) = 1 n, f1(xi) e α1yi w T1 (i) = w 1 (i) e αt yi ft (xi) Z 1 Z T The product T t=1 Z t is b above. We use this form of w T1 (i) in Step 2.

PROOF OF THEOREM (a b c) Step 2 Next show the training error of f (T) boost (boosting after T steps) is T t=1 Z t. Currently we know w T1 (i) = 1 e y i h T (x i ) n T t=1 Zt w T1 (i) T t=1 Z t = 1 n e y i h T (x i ) & f (T) boost (x) = sign(h T(x)) Derivation of Step 2: Observe that 0 < e z1 and 1 < e z2 for any z 1 < 0 < z 2. Therefore 1 n 1{y i f (T) boost n (x i)} i=1 }{{} a = 1 n i=1 n i=1 yi ht (xi) e n T w T1 (i) Z t = t=1 a is the training error the quantity we care about. T t=1 Z t }{{} b

PROOF OF THEOREM (a b c) Step 3 The final step is to calculate an upper bound on Z t, and by extension T t=1 Z t. Derivation of Step 3: This step is slightly more involved. It also shows why α t := 1 2 ln ( 1 ɛt ɛ t ). Z t = = n i=1 αtyi ft(xi) w t (i)e i : y i=f t(x i) e αt w t (i) i : y i f t(x i) e αt w t (i) = e αt (1 ɛ t ) e αt ɛ t Remember we defined ɛ t = i : y i f t(x i) w t(i), the probability of error for w t.

PROOF OF THEOREM (a b c) Derivation of Step 3 (continued): Remember from Step 2 that training error = 1 n n 1{y i f boost (x i )} i=1 T Z t. t=1 and we just showed that Z t = e αt (1 ɛ t ) e αt ɛ t. We want the training error to be small, so we pick α t to minimize Z t. Minimizing, we get the value of α t used by AdaBoost: α t = 1 ( ) 1 2 ln ɛt. ɛ t Plugging this value back in gives Z t = 2 ɛ t (1 ɛ t ).

PROOF OF THEOREM (a b c) Derivation of Step 3 (continued): Next, rewrite Z t as Z t = 2 ɛ t (1 ɛ t ) = 1 4( 1 2 ɛ t) 2 1.5 2 1 0 1 2 3.5 3 2.5 2 1.5 1 0.5 0 0.5 1 e x 1 x Then, use the inequality 1 x e x to conclude that Z t = ( 1 4( 1 2 ɛ t) 2) 1 2 ( e 4( 1 2 ɛt)2) 1 2 = e 2( 1 2 ɛt)2.

PROOF OF THEOREM Concluding the right inequality (a b c) Because both sides of Z t e 2( 1 2 ɛt)2 are positive, we can say that T t=1 Z t T t=1 e 2( 1 2 ɛt)2 = e 2 T t=1 ( 1 2 ɛt)2. This concludes the b c portion of the proof. Combining everything {}}{ 1 n training error = 1{y i f boost (x i )} n i=1 a b {}}{ T t=1 Z t c {}}{ e 2 T t=1 ( 1 2 ɛt)2. We set out to prove a < c and we did so by using b as a steppingstone.

TRAINING VS TESTING ERROR Q: Driving the training error to zero leads one to ask, does boosting overfit? A: Sometimes, but very often it doesn t! C4.5 (tree) testing error Error AdaBoost testing error AdaBoost training error Rounds of boosting