BBM406 - Introduc0on to ML. Spring Ensemble Methods. Aykut Erdem Dept. of Computer Engineering HaceDepe University

Similar documents
Boos$ng Can we make dumb learners smart?

Voting (Ensemble Methods)

Stochastic Gradient Descent

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

CS7267 MACHINE LEARNING

ECE 5424: Introduction to Machine Learning

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Machine Learning. Ensemble Methods. Manfred Huber

Boosting: Foundations and Algorithms. Rob Schapire

VBM683 Machine Learning

ECE 5984: Introduction to Machine Learning

Ensemble Methods for Machine Learning

Data Mining und Maschinelles Lernen

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Boosting. Acknowledgment Slides are based on tutorials from Robert Schapire and Gunnar Raetsch

Learning with multiple models. Boosting.

Background. Adaptive Filters and Machine Learning. Bootstrap. Combining models. Boosting and Bagging. Poltayev Rassulzhan

An overview of Boosting. Yoav Freund UCSD

A Brief Introduction to Adaboost

COMP 562: Introduction to Machine Learning

Outline: Ensemble Learning. Ensemble Learning. The Wisdom of Crowds. The Wisdom of Crowds - Really? Crowd wiser than any individual

The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Boosting & Deep Learning

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

Variance Reduction and Ensemble Methods

Learning Ensembles. 293S T. Yang. UCSB, 2017.

Ensemble Methods: Jay Hyer

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

Hierarchical Boosting and Filter Generation

Learning theory. Ensemble methods. Boosting. Boosting: history

Data Warehousing & Data Mining

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Ensembles. Léon Bottou COS 424 4/8/2010

Bias-Variance in Machine Learning

Decision Trees: Overfitting

Introduction to Boosting and Joint Boosting

Bias/variance tradeoff, Model assessment and selec+on

CS229 Supplemental Lecture notes

Linear Regression with mul2ple variables. Mul2ple features. Machine Learning

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017

The Boosting Approach to. Machine Learning. Maria-Florina Balcan 10/31/2016

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

What makes good ensemble? CS789: Machine Learning and Neural Network. Introduction. More on diversity

Midterm Exam Solutions, Spring 2007

CS534 Machine Learning - Spring Final Exam

Chapter 14 Combining Models

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

M1 Apprentissage. Michèle Sebag Benoit Barbot LRI LSV. 13 janvier 2014

Statistics and learning: Big Data

CSE446: Linear Regression Regulariza5on Bias / Variance Tradeoff Winter 2015

Bagging. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL 8.7

Announcements Kevin Jamieson

Recitation 9. Gradient Boosting. Brett Bernstein. March 30, CDS at NYU. Brett Bernstein (CDS at NYU) Recitation 9 March 30, / 14

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

Chapter 18. Decision Trees and Ensemble Learning. Recall: Learning Decision Trees

Machine Learning 2nd Edi7on

FINAL: CS 6375 (Machine Learning) Fall 2014

Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example

Ensemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods

CSCI 360 Introduc/on to Ar/ficial Intelligence Week 2: Problem Solving and Op/miza/on

TDT4173 Machine Learning

Empirical Risk Minimization Algorithms

TDT4173 Machine Learning

Ensembles of Classifiers.

Learning Theory, Overfi1ng, Bias Variance Decomposi9on

Combining Classifiers

Decision Trees Lecture 12

Dr. Ulas Bagci

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

10701/15781 Machine Learning, Spring 2007: Homework 2

CS 6140: Machine Learning Spring What We Learned Last Week 2/26/16

Lecture 8. Instructor: Haipeng Luo

COMS 4771 Lecture Boosting 1 / 16

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 331 le-tex

CSCI-567: Machine Learning (Spring 2019)

Large-Margin Thresholded Ensembles for Ordinal Regression

Discriminative v. generative

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

2 Upper-bound of Generalization Error of AdaBoost

10-701/ Machine Learning - Midterm Exam, Fall 2010

Statistical Machine Learning from Data

Logistic Regression. Machine Learning Fall 2018

Boosting: Algorithms and Applications

Theory and Applications of Boosting. Rob Schapire

Dan Roth 461C, 3401 Walnut

Lecture 13: Ensemble Methods

Machine Learning, Midterm Exam

Adaptive Boosting of Neural Networks for Character Recognition

SPECIAL INVITED PAPER

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

Bagging and Other Ensemble Methods

Classification: The rest of the story

A Gentle Introduction to Gradient Boosting. Cheng Li College of Computer and Information Science Northeastern University

Robotics 2. AdaBoost for People and Place Detection. Kai Arras, Cyrill Stachniss, Maren Bennewitz, Wolfram Burgard

Monte Carlo Theory as an Explanation of Bagging and Boosting

1 Handling of Continuous Attributes in C4.5. Algorithm

BAGGING PREDICTORS AND RANDOM FOREST

PDEEC Machine Learning 2016/17

Transcription:

BBM406 - Introduc0on to ML Spring 2014 Ensemble Methods Aykut Erdem Dept. of Computer Engineering HaceDepe University

2 Slides adopted from David Sontag, Mehryar Mohri, Ziv- Bar Joseph, Arvind Rao, Greg Shakhnarovich, Rob Schapire, Tommi Jaakkola, Jan Sochman and Jiri Matters Today Ensemble Methods - - Bagging Boos0ng

Bias/Variance Tradeoff Bias/Variance&Tradeoff& Hastie, Tibshirani, Friedman Elements of Statistical Learning 2001 3

Bias/Variance Tradeoff Graphical illustration of bias and variance. http://scott.fortmann- roe.com/docs/biasvariance.html 4

Figh0ng the bias- variance tradeoff Simple (a.k.a. weak) learners are good - - e.g., naïve Bayes, logis0c regression, decision stumps (or shallow decision trees) Low variance, don t usually overfit Simple (a.k.a. weak) learners are bad High bias, can t solve hard learning problems 5

Reduce Variance Without Increasing Bias Averaging reduces variance: (when prediction are independent) Average models to reduce model variance One problem: - - Only one training set Where do mul0ple models come from? 6

Bagging (Bootstrap Aggrega0ng) Leo Breiman (1994) Take repeated bootstrap samples from training set D. Bootstrap sampling: Given set D containing N training examples, create D by drawing N examples at random with replacement from D. Bagging: - Create k bootstrap samples D 1... D k. - Train dis0nct classifier on each D i. - Classify new instance by majority vote / average. 7

Bagging Best case: In prac0ce: - models are correlated, so reduc0on is smaller than 1/N - variance of models trained on fewer training cases usually somewhat larger 8

Bagging Example 9

CART* decision boundary * A decision tree learning algorithm; very similar to ID3 10

100 bagged trees Shades of blue/red indicate strength of vote for par0cular classifica0on 11

Reduce Bias 2 and Decrease Variance? Bagging reduces variance by averaging Bagging has lidle effect on bias Can we average and reduce bias? Yes: Boos0ng 12

Boos0ng Ideas Main idea: use weak learner to create strong learner. Ensemble method: combine base classifiers returned by weak learner. Finding simple rela0vely accurate base classifiers olen not hard. But, how should base classifiers be combined? 13

Example: How May I Help You? Goal: automa0cally categorize type of call requested by phone customer (Collect, CallingCard, PersonToPerson, etc.) - yes I d like to place a collect call long distance please (Collect) - operator I need to make a call but I need to bill it to my office (ThirdNumber) - yes I d like to place a call on my master card please (CallingCard) - I just called a number in sioux city and I musta rang the wrong number because I got the wrong party and I would like to have that taken off of my bill (BillingCredit) Observa;on: - - easy to find rules of thumb that are olen correct e.g.: IF card occurs in uderance THEN predict CallingCard hard to find single highly accurate predic0on rule [Gorin et al.] 14

Boos0ng: Intui0on Instead of learning a single (weak) classifier, learn many weak classifiers that are good at different parts of the input space Output class: (Weighted) vote of each classifier - - - Classifiers that are most sure will vote with more convic0on Classifiers will be most sure about a par0cular part of the space On average, do beder than single classifier But how do you??? - - force classifiers to learn about different parts of the input space? weigh the votes of different classifiers? 15

Boos0ng [Schapire, 1989] Idea: given a weak learner, run it mul0ple 0mes on (reweighted) training data, then let the learned classifiers vote On each itera0on t: - weight each training example by how incorrectly it was classified - Learn a hypothesis h t - A strength for this hypothesis a t Final classifier: - A linear combina0on of the votes of the different classifiers weighted by their strength Prac;cally useful Theore;cally interes;ng 16

Boos0ng: Intui0on Want to pick weak classifiers that contribute something to the ensemble Greedy algorithm: for m=1,...,m Pick a weak classifier h m Adjust weights: misclassified examples get heavier α m set according to weighted error of h m 17

Boos0ng: Intui0on Want to pick weak classifiers that contribute something to the ensemble Greedy algorithm: for m=1,...,m Pick a weak classifier h m Adjust weights: misclassified examples get heavier α m set according to weighted error of h m 18

Boos0ng: Intui0on Want to pick weak classifiers that contribute something to the ensemble Greedy algorithm: for m=1,...,m Pick a weak classifier h m Adjust weights: misclassified examples get heavier α m set according to weighted error of h m 19

Boos0ng: Intui0on Want to pick weak classifiers that contribute something to the ensemble Greedy algorithm: for m=1,...,m Pick a weak classifier h m Adjust weights: misclassified examples get heavier α m set according to weighted error of h m 20

Boos0ng: Intui0on Want to pick weak classifiers that contribute something to the ensemble Greedy algorithm: for m=1,...,m Pick a weak classifier h m Adjust weights: misclassified examples get heavier α m set according to weighted error of h m 21

Boos0ng: Intui0on Want to pick weak classifiers that contribute something to the ensemble Greedy algorithm: for m=1,...,m Pick a weak classifier h m Adjust weights: misclassified examples get heavier α m set according to weighted error of h m 22

Boos0ng: Intui0on Want to pick weak classifiers that contribute something to the ensemble Greedy algorithm: for m=1,...,m Pick a weak classifier h m Adjust weights: misclassified examples get heavier α m set according to weighted error of h m 23

First Boos0ng Algorithms [Schapire 89]: - first provable boos0ng algorithm [Freund 90]: - op0mal algorithm that boosts by majority [Drucker, Schapire & Simard 92]: - first experiments using boos0ng - limited by prac0cal drawbacks [Freund & Schapire 95]: - introduced AdaBoost algorithm - strong prac0cal advantages over previous boos0ng algorithms 24

The AdaBoost Algorithm 25

Toy Example Minimize the error weak hypotheses = ver0cal or horizontal half- planes 26

Round 1 h 1 D 2 ε 1 =0.30 α 1 =0.42 27

Round 2 h 2 D 3 ε 2 =0.21 α 2 =0.65 28

Round 3 h 3 ε 3 =0.14 α 3 =0.92 29

Final Hypothesis H = sign 0.42 + 0.65 + 0.92 final = 30

Voted combina0on of classifiers The general problem here is to try to combine many simple weak classifiers into a single strong classifier We consider voted combina0ons of simple binary ±1 component classifiers where the (non- nega0ve) votes α i can be used to emphasize component classifiers that are more reliable than others 31

Components: Decision stumps Consider the following simple family of component classifiers genera0ng ±1 labels: where These are called decision stumps. Each decision stump pays aden0on to only a single component of the input vector 32

Voted combina0ons (cont d.) We need to define a loss func0on for the combina0on so we can determine which new component h(x; θ) to add and how many votes it should receive While there are many op0ons for the loss func0on we consider here only a simple exponen0al loss 33

Modularity, errors, and loss Consider adding the m th component: 34

Modularity, errors, and loss Consider adding the m th component: 35

Modularity, errors, and loss Consider adding the m th component: So at the m th itera0on the new component (and the votes) should op0mize a weighted loss (weighted towards mistakes). 36

Empirical exponen0al loss (cont d.) To increase modularity we d like to further decouple the op0miza0on of h(x; θ m ) from the associated votes α m To this end we select h(x; θ m ) that op0mizes the rate at which the loss would decrease as a func0on of α m 37

Empirical exponen0al loss (cont d.) We find that minimizes We can also normalize the weights: so that 38

Empirical exponen0al loss (cont d.) We find that minimizes where is subsequently chosen to minimize 39

The AdaBoost Algorithm 40

The AdaBoost Algorithm Given: (x 1,y 1 ),...,(x m,y m ); x i 2 X,y i 2 { 1, +1} 41

The AdaBoost Algorithm Given: (x 1,y 1 ),...,(x m,y m ); x i 2 X,y i 2 { 1, +1} Initialise weights D 1 (i) =1/m 42

The AdaBoost Algorithm Given: (x 1,y 1 ),...,(x m,y m ); x i 2 X,y i 2 { 1, +1} Initialise weights D 1 (i) =1/m For t =1,..., T : Find ht = arg min P j = m D t (i)jy i 6= h j (x i )K h j 2H i=1 If t 1/2 then stop t =1 43

The AdaBoost Algorithm Given: (x 1,y 1 ),...,(x m,y m ); x i 2 X,y i 2 { 1, +1} Initialise weights D 1 (i) =1/m For t =1,..., T : Find ht = arg min P j = m D t (i)jy i 6= h j (x i )K h j 2H i=1 If t 1/2 then stop t =1 Set t = 12 log(1 t t ) 44

The AdaBoost Algorithm Given: (x 1,y 1 ),...,(x m,y m ); x i 2 X,y i 2 { 1, +1} Initialise weights D 1 (i) =1/m For t =1,..., T : Find ht = arg min P j = m D t (i)jy i 6= h j (x i )K h j 2H i=1 If t 1/2 then stop t =1 Set t = Update 12 log(1 t t ) D t+1 (i) = D t(i)exp( t y i h t (x i )) Z t where Z t is normalisation factor 45

The AdaBoost Algorithm Given: (x 1,y 1 ),...,(x m,y m ); x i 2 X,y i 2 { 1, +1} Initialise weights D 1 (i) =1/m For t =1,..., T : Find ht = arg min P j = m D t (i)jy i 6= h j (x i )K h j 2H i=1 If t 1/2 then stop t =1 Set t = Update 12 log(1 t t ) D t+1 (i) = D t(i)exp( t y i h t (x i )) Z t where Z t is normalisation factor Output the final classifier: H(x) =sign TX t h t (x) t=1 training error 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0 5 10 15 20 25 30 35 40 step 46

The AdaBoost Algorithm Given: (x 1,y 1 ),...,(x m,y m ); x i 2 X,y i 2 { 1, +1} Initialise weights D 1 (i) =1/m For t =1,..., T : Find ht = arg min P j = m D t (i)jy i 6= h j (x i )K h j 2H i=1 If t 1/2 then stop t =2 Set t = Update 12 log(1 t t ) D t+1 (i) = D t(i)exp( t y i h t (x i )) Z t where Z t is normalisation factor Output the final classifier: H(x) =sign TX t h t (x) t=1 training error 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0 5 10 15 20 25 30 35 40 step 47

The AdaBoost Algorithm Given: (x 1,y 1 ),...,(x m,y m ); x i 2 X,y i 2 { 1, +1} Initialise weights D 1 (i) =1/m For t =1,..., T : Find ht = arg min P j = m D t (i)jy i 6= h j (x i )K h j 2H i=1 If t 1/2 then stop t =3 Set t = Update 12 log(1 t t ) D t+1 (i) = D t(i)exp( t y i h t (x i )) Z t where Z t is normalisation factor Output the final classifier: H(x) =sign TX t h t (x) t=1 training error 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0 5 10 15 20 25 30 35 40 step 48

The AdaBoost Algorithm Given: (x 1,y 1 ),...,(x m,y m ); x i 2 X,y i 2 { 1, +1} Initialise weights D 1 (i) =1/m For t =1,..., T : Find ht = arg min P j = m D t (i)jy i 6= h j (x i )K h j 2H i=1 If t 1/2 then stop t =4 Set t = Update 12 log(1 t t ) D t+1 (i) = D t(i)exp( t y i h t (x i )) Z t where Z t is normalisation factor Output the final classifier: H(x) =sign TX t h t (x) t=1 training error 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0 5 10 15 20 25 30 35 40 step 49

The AdaBoost Algorithm Given: (x 1,y 1 ),...,(x m,y m ); x i 2 X,y i 2 { 1, +1} Initialise weights D 1 (i) =1/m For t =1,..., T : Find ht = arg min P j = m D t (i)jy i 6= h j (x i )K h j 2H i=1 If t 1/2 then stop t =5 Set t = Update 12 log(1 t t ) D t+1 (i) = D t(i)exp( t y i h t (x i )) Z t where Z t is normalisation factor Output the final classifier: H(x) =sign TX t h t (x) t=1 training error 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0 5 10 15 20 25 30 35 40 step 50

The AdaBoost Algorithm Given: (x 1,y 1 ),...,(x m,y m ); x i 2 X,y i 2 { 1, +1} Initialise weights D 1 (i) =1/m For t =1,..., T : Find ht = arg min P j = m D t (i)jy i 6= h j (x i )K h j 2H i=1 If t 1/2 then stop t =6 Set t = Update 12 log(1 t t ) D t+1 (i) = D t(i)exp( t y i h t (x i )) Z t where Z t is normalisation factor Output the final classifier: H(x) =sign TX t h t (x) t=1 training error 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0 5 10 15 20 25 30 35 40 step 51

The AdaBoost Algorithm Given: (x 1,y 1 ),...,(x m,y m ); x i 2 X,y i 2 { 1, +1} Initialise weights D 1 (i) =1/m For t =1,..., T : Find ht = arg min P j = m D t (i)jy i 6= h j (x i )K h j 2H i=1 If t 1/2 then stop t =7 Set t = Update 12 log(1 t t ) D t+1 (i) = D t(i)exp( t y i h t (x i )) Z t where Z t is normalisation factor Output the final classifier: H(x) =sign TX t h t (x) t=1 training error 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0 5 10 15 20 25 30 35 40 step 52

The AdaBoost Algorithm Given: (x 1,y 1 ),...,(x m,y m ); x i 2 X,y i 2 { 1, +1} Initialise weights D 1 (i) =1/m For t =1,..., T : Find ht = arg min P j = m D t (i)jy i 6= h j (x i )K h j 2H i=1 If t 1/2 then stop t = 40 Set t = Update 12 log(1 t t ) D t+1 (i) = D t(i)exp( t y i h t (x i )) Z t where Z t is normalisation factor Output the final classifier: H(x) =sign TX t h t (x) t=1 training error 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0 5 10 15 20 25 30 35 40 step 53

Reweigh0ng 54

Reweigh0ng 55

Reweigh0ng 56

Boos0ng results Digit recogni0on 20 15 test error training error error 10 5 0 10 100 1000 # rounds Boos0ng olen (but not always) - - Robust to overfivng Test set error decreases even aler training error is zero [Schapire, 1989] 57

Applica0on: Detec0ng Faces Problem: find faces in photograph or movie Weak classifiers: detect light/dark rectangle in image Many clever tricks to make extremely fast and accurate [Viola & Jones] 58

Boos0ng vs. Logis0c Regression Logis+c$regression:$ Minimize$log$loss$ Minimize+log+loss+ Boos+ng:$ Minimize+exp+loss+ Minimize$exp$loss$ Define$$ Define++ $ $ $ where$x re+x +predefined+ j $predefined$ where$h e+ t (x)$defined$ +defined+dyna features$(linear$classifier)$ Jointly$op+mize$over$all$ weights$w 0,w 1,w 2, $ $ Define++ Define$ $ dynamically$to$fit$data$$ (not$a$$linear$classifier)$ Weights$α t $learned$per$ itera+on$t$incrementally$ 59

Boos0ng vs. Bagging Bagging: Resample data points Weight of each classifier is the same Only variance reduc0on Boos0ng: Reweights data points (modifies their distribu0on) Weight is dependent on classifier s accuracy Both bias and variance reduced learning rule becomes more complex with itera0ons 60