Voting (Ensemble Methods)

Similar documents
Stochastic Gradient Descent

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Boos$ng Can we make dumb learners smart?

ECE 5424: Introduction to Machine Learning

CS7267 MACHINE LEARNING

ECE 5984: Introduction to Machine Learning

VBM683 Machine Learning

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

BBM406 - Introduc0on to ML. Spring Ensemble Methods. Aykut Erdem Dept. of Computer Engineering HaceDepe University

Machine Learning. Ensemble Methods. Manfred Huber

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Learning with multiple models. Boosting.

Learning Ensembles. 293S T. Yang. UCSB, 2017.

A Brief Introduction to Adaboost

Decision Trees: Overfitting

What makes good ensemble? CS789: Machine Learning and Neural Network. Introduction. More on diversity

TDT4173 Machine Learning

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Background. Adaptive Filters and Machine Learning. Bootstrap. Combining models. Boosting and Bagging. Poltayev Rassulzhan

Boosting & Deep Learning

Ensemble Methods: Jay Hyer

AdaBoost. S. Sumitra Department of Mathematics Indian Institute of Space Science and Technology

Data Mining und Maschinelles Lernen

The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m

Ensemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods

TDT4173 Machine Learning

Ensemble Methods for Machine Learning

Ensembles. Léon Bottou COS 424 4/8/2010

Learning theory. Ensemble methods. Boosting. Boosting: history

The Boosting Approach to. Machine Learning. Maria-Florina Balcan 10/31/2016

Hierarchical Boosting and Filter Generation

Statistics and learning: Big Data

Lecture 8. Instructor: Haipeng Luo

CS229 Supplemental Lecture notes

Ensembles of Classifiers.

Statistical Machine Learning from Data

CSCI-567: Machine Learning (Spring 2019)

Bagging and Other Ensemble Methods

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Chapter 14 Combining Models

Outline: Ensemble Learning. Ensemble Learning. The Wisdom of Crowds. The Wisdom of Crowds - Really? Crowd wiser than any individual

Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Bias-Variance in Machine Learning

Boosting: Foundations and Algorithms. Rob Schapire

Variance Reduction and Ensemble Methods

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

Bias-Variance Tradeoff

Boosting. Acknowledgment Slides are based on tutorials from Robert Schapire and Gunnar Raetsch

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Announcements Kevin Jamieson

Machine Learning Lecture 10

Neural Networks and Ensemble Methods for Classification

Data Warehousing & Data Mining

i=1 = H t 1 (x) + α t h t (x)

2D1431 Machine Learning. Bagging & Boosting

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

10-701/ Machine Learning - Midterm Exam, Fall 2010

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Machine Learning

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

Boosting the Margin: A New Explanation for the Effectiveness of Voting Methods

Learning Theory Continued

FINAL: CS 6375 (Machine Learning) Fall 2014

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

10701/15781 Machine Learning, Spring 2007: Homework 2

Gradient Boosting (Continued)

Machine Learning Lecture 7

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine Learning, Fall 2012 Homework 2

Introduction to Boosting and Joint Boosting

An overview of Boosting. Yoav Freund UCSD

PAC-learning, VC Dimension and Margin-based Bounds

Recitation 9. Gradient Boosting. Brett Bernstein. March 30, CDS at NYU. Brett Bernstein (CDS at NYU) Recitation 9 March 30, / 14

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 331 le-tex

COMS 4771 Lecture Boosting 1 / 16

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

PDEEC Machine Learning 2016/17

1 Handling of Continuous Attributes in C4.5. Algorithm

Machine Learning

Machine Learning

FINAL EXAM: FALL 2013 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Machine Learning

1 Handling of Continuous Attributes in C4.5. Algorithm

Machine Learning Gaussian Naïve Bayes Big Picture

Linear and Logistic Regression. Dr. Xiaowei Huang

Midterm, Fall 2003

Data Mining and Analysis: Fundamental Concepts and Algorithms

Linear classifiers: Logistic regression

2 Upper-bound of Generalization Error of AdaBoost

ECE 5984: Introduction to Machine Learning

Midterm Exam Solutions, Spring 2007

Machine Learning

Transcription:

1

2

Voting (Ensemble Methods) Instead of learning a single classifier, learn many weak classifiers that are good at different parts of the data Output class: (Weighted) vote of each classifier Classifiers that are most sure will vote with more conviction Classifiers will be most sure about a particular part of the space On average, do better than single classifier! But how??? force classifiers to learn about different parts of the input space? different subsets of the data? weigh the votes of different classifiers?

BAGGing = Bootstrap AGGregation for i = 1, 2,, K: (Breiman, 1996) T i ß randomly select M training instances with replacement h i ß learn(t i ) [Decision Tree, Naive Bayes, ] Now combine the h i together with uniform voting (w i =1/K for all i)

5

6 decision tree learning algorithm; very similar to version in earlier slides

shades of blue/red indicate strength of vote for particular classification

Fighting the bias-variance tradeoff Simple (a.k.a. weak) learners are good e.g., naïve Bayes, logistic regression, decision stumps (or shallow decision trees) Low variance, don t usually overfit Simple (a.k.a. weak) learners are bad High bias, can t solve hard learning problems Can we make weak learners always good??? No!!! But often yes

Boosting [Schapire, 1989] Idea: given a weak learner, run it multiple times on (reweighted) training data, then let learned classifiers vote On each iteration t: weight each training example by how incorrectly it was classified Learn a hypothesis h t A strength for this hypothesis a t Final classifier: h(x) = sign Practically useful Theoretically interesting i ih i (x)

time = 0 blue/red = class size of dot = weight weak learner = Decision stub: horizontal or vertica 11

time = 1 this hypothesis has 15% error and so does this ensemble, since the ensemble contains just this one hypothesis 12

13 time = 2

14 time = 3

15 time = 13

16 time = 100

time = 300 overfitting 17

Learning from weighted data Consider a weighted dataset D(i) weight of i th training example (x i,y i ) Interpretations: i th training example counts as if it occurred D(i) times If I were to resample data, I would get more samples of heavier data points Now, always do weighted calculations: e.g., MLE for Naïve Bayes, redefine Count(Y=y) to be weighted count: Count(Y = y) = n j=1 D(j) (Y j = y) setting D(j)=1 (or any constant value!), for all j, will recreates unweighted case

Given: (x 1,y 1 ),...,(x m,y m )wherex i 2 R n,y i 2 { 1, +1} Initialize: D 1 (i) =1/m, for i =1,...,m For t=1 T: Train base classifier h t (x) using D t Choose α t Update, for..m: D t+1 (i) / D t (i)exp( t y i h t (x i )) with normalization constant: Output final classifier: mx D t (i)exp( t y i h t (x i )) H(x) = sign! TX t h t (x) How? Many possibilities. Will see one shortly! Why? Reweight the data: examples i that are misclassified will have higher weights! y i h t (x i ) > 0 à h i correct y i h t (x i ) < 0 à h i wrong h i correct, α t > 0 à D t+1 (i) < D t (i) h i wrong, α t > 0 à D t+1 (i) > D t (i) Final Result: linear sum of base or weak classifier outputs.

mx Given: (x 1,y 1 ),...,(x m,y m )wherex i 2 R n,y i t = D t 2 (i) { (h 1, t +1} (x i ) 6= y i ) Initialize: D 1 (i) =1/m, for i =1,...,m For t=1 T: Train base classifier h t (x) using D t Choose α t Update, for..m: D t+1 (i) / D t (i)exp( t y i h t (x i )) ε t : error of h t, weighted by D t 0 ε t 1 α t : No errors: ε t =0 à α t = All errors: ε t =1 à α t = Random: ε t =0.5 à α t =0 α t 2 1 1 2 3 0.2 0.4 0.6 0.8 1.0 ε t

What a t to choose for hypothesis h t? Idea: choose a t to minimize a bound on training error! Where mx (H(x i ) 6= y i ) apple mx D t (i)exp( y i f(x i )) [Schapire, 1989] exp( y i f(x i )) (H(x i ) 6= y i ) y i f(x i )

What a t to choose for hypothesis h t? Idea: choose a t to minimize a bound on training error! 1 m Where mx (H(x i ) 6= y i ) apple 1 m mx D t (i)exp( y i f(x i )) [Schapire, 1989] = Y t Z t And Z t = mx D t (i)exp( t y i h t (x i )) This equality isn t obvious! Can be shown with algebra (telescoping sums)! If we minimize Õ t Z t, we minimize our training error!!! We can tighten this bound greedily, by choosing a t and h t on each iteration to minimize Z t. h t is estimated as a black box, but can we solve for a t?

Summary: choose a t to minimize error bound We can squeeze this bound by choosing a t on each iteration to minimize Z t. Z t = mx D t (i)exp( t y i h t (x i )) mx t = D t (i) (h t (x i ) 6= y i ) For boolean Y: differentiate, set equal to 0, there is a closed form solution! [Freund & Schapire 97]: [Schapire, 1989]

Given: (x 1,y 1 ),...,(x m,y m )wherex i 2 R mx n,y i 2 { 1, +1} Initialize: D 1 (i) =1/m, for i =1,...,m t = D t (i) (h t (x i ) 6= y i ) For t=1 T: Train base classifier h t (x) using D t Choose α t Update, for..m: with normalization constant: Output final classifier: D t+1 (i) / D t (i)exp( t y i h t (x i )) mx D t (i)exp( t y i h t (x i )) H(x) = sign! TX t h t (x)

Initialize: D 1 (i) =1/m, for i =1,...,m For t=1 T: Train base classifier h t (x) using D t Choose α t mx t = D t (i) (h t (x i ) 6= y i ) x 1 y -1 1 0-1 1 1 Update, for..m: D t+1 (i) / D t (i)exp( t y i h t (x i )) Output final classifier: H(x) = sign! TX t h t (x) x 1 H(x) = sign(0.35 h 1 (x)) h 1 (x)=+1 if x 1 >0.5, -1 otherwise Use decision stubs as base classifier Initial: D 1 = [D 1 (1), D 1 (2), D 1 (3)] = [.33,.33,.33] t=1: Train stub [work omitted, breaking ties randomly] h 1 (x)=+1 if x 1 >0.5, -1 otherwise ε 1 =Σ i D 1 (i) δ(h 1 (x i ) y i ) = 0.33 1+0.33 0+0.33 0=0.33 α 1 =(1/2) ln((1-ε 1 )/ε 1 )=0.5 ln(2)= 0.35 D 2 (1) α D 1 (1) exp(-α 1 y 1 h 1 (x 1 )) = 0.33 exp(-0.35 1-1) = 0.33 exp(0.35) = 0.46 D 2 (2) α D 1 (2) exp(-α 1 y 2 h 1 (x 2 )) = 0.33 exp(-0.35-1 -1) = 0.33 exp(-0.35) = 0.23 D 2 (3) α D 1 (3) exp(-α 1 y 3 h 1 (x 3 )) = 0.33 exp(-0.35 1 1) = 0.33 exp(-0.35) =0.23 D 2 = [D 1 (1), D 1 (2), D 1 (3)] = [0.5,0.25,0.25] t=2 Continues on next slide!

Initialize: D 1 (i) =1/m, for i =1,...,m For t=1 T: Train base classifier h t (x) using D t Choose α t mx t = D t (i) (h t (x i ) 6= y i ) x 1 y -1 1 0-1 1 1 Update, for..m: D t+1 (i) / D t (i)exp( t y i h t (x i )) Output final classifier: H(x) = sign! TX t h t (x) x 1 H(x) = sign(0.35 h 1 (x)+0.55 h 2 (x)) h 1 (x)=+1 if x 1 >0.5, -1 otherwise h 2 (x)=+1 if x 1 <1.5, -1 otherwise D 2 = [D 1 (1), D 1 (2), D 1 (3)] = [0.5,0.25,0.25] t=2: Train stub [work omitted; different stub because of new data weights D; breaking ties opportunistically (will discuss at end)] h 2 (x)=+1 if x 1 <1.5, -1 otherwise ε 2 =Σ i D 2 (i) δ(h 2 (x i ) y i ) = 0.5 0+0.25 1+0.25 0=0.25 α 2 =(1/2) ln((1-ε 2 )/ε 2 )=0.5 ln(3)= 0.55 D 2 (1) α D 1 (1) exp(-α 2 y 1 h 2 (x 1 )) = 0.5 exp(-0.55 1 1) = 0.5 exp(-0.55) = 0.29 D 2 (2) α D 1 (2) exp(-α 2 y 2 h 2 (x 2 )) = 0.25 exp(-0.55-1 1) = 0.25 exp(0.55) = 0.43 D 2 (3) α D 1 (3) exp(-α 2 y 3 h 2 (x 3 )) = 0.25 exp(-0.55 1 1) = 0.25 exp(-0.55) = 0.14 D 3 = [D 3 (1), D 3 (2), D 3 (3)] = [0.33,0.5,0.17] t=3 Continues on next slide!

Initialize: D 1 (i) =1/m, for i =1,...,m For t=1 T: Train base classifier h t (x) using D t Choose α t mx t = D t (i) (h t (x i ) 6= y i ) x 1 y -1 1 0-1 1 1 Update, for..m: D t+1 (i) / D t (i)exp( t y i h t (x i )) Output final classifier: H(x) = sign! TX t h t (x) x 1 D 3 = [D 3 (1), D 3 (2), D 3 (3)] = [0.33,0.5,0.17] t=3: Train stub [work omitted; different stub because of new data weights D; breaking ties opportunistically (will discuss at end)] h 3 (x)=+1 if x 1 <-0.5, -1 otherwise ε 3 =Σ i D 3 (i) δ(h 3 (x i ) y i ) = 0.33 0+0.5 0+0.17 1=0.17 α 3 =(1/2) ln((1-ε 3 )/ε 3 )=0.5 ln(4.88)= 0.79 Stop!!! How did we know to stop? H(x) = sign(0.35 h 1 (x)+0.55 h 2 (x)+0.79 h 3 (x)) h 1 (x)=+1 if x 1 >0.5, -1 otherwise h 2 (x)=+1 if x 1 <1.5, -1 otherwise h 3 (x)=+1 if x 1 <-0.5, -1 otherwise

Strong, weak classifiers If each classifier is (at least slightly) better than random: e t < 0.5 Another bound on error: 1 mx (H(x i ) 6= y i ) apple Y m t What does this imply about the training error? Will reach zero! Will get there exponentially fast! Z t apple exp 2! TX (1/2 t ) 2 t=1 Is it hard to achieve better than random training error?

Boosting results Digit recognition [Schapire, 1989] Test error Training error Boosting: Seems to be robust to overfitting Test error can decrease even after training error is zero!!!

Boosting generalization error bound [Freund & Schapire, 1996] Constants: T: number of boosting rounds Higher T à Looser bound d: measures complexity of classifiers Higher d à bigger hypothesis space à looser bound m: number of training examples more data à tighter bound

Boosting generalization error bound [Freund & Schapire, 1996] Constants: Theory does not match practice: T: number of boosting rounds: Robust Higher T to àoverfitting Looser bound, what does this imply? d: Test VC dimension set error decreases of weak learner, even after measures training error is complexity zero of classifier Higher d à bigger hypothesis space à looser bound m: number of training examples more data à tighter bound

Boosting: Experimental Results [Freund & Schapire, 1996] Comparison of C4.5, Boosting C4.5, Boosting decision stumps (depth 1 trees), 27 benchmark datasets error error error

Boosting and Logistic Regression Logistic regression equivalent to minimizing log loss: ln(1 + exp( y i f(x i ))) Boosting minimizes similar loss function: exp( y i f(x i )) (H(x i ) 6= y i ) y i f(x i ) Both smooth approximations of 0/1 loss!

Logistic regression and Boosting Logistic regression: Minimize loss fn Define mx ln(1 + exp( y i f(x i ))) Boosting: Minimize loss fn mx exp( y i f(x i )) Define where each feature x j is predefined Jointly optimize parameters w 0, w 1, w n via gradient ascent. where h t (x) learned to fit data Weights a j learned incrementally (new one for each training pass)

What you need to know about Boosting Combine weak classifiers to get very strong classifier Weak classifier slightly better than random on training data Resulting very strong classifier can get zero training error AdaBoost algorithm Boosting v. Logistic Regression Both linear model, boosting learns features Similar loss functions Single optimization (LR) v. Incrementally improving classification (B) Most popular application of Boosting: Boosted decision stumps! Very simple to implement, very effective classifier