Boos$ng Can we make dumb learners smart?

Similar documents
Voting (Ensemble Methods)

BBM406 - Introduc0on to ML. Spring Ensemble Methods. Aykut Erdem Dept. of Computer Engineering HaceDepe University

CS7267 MACHINE LEARNING

Stochastic Gradient Descent

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

ECE 5424: Introduction to Machine Learning

VBM683 Machine Learning

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

ECE 5984: Introduction to Machine Learning

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Background. Adaptive Filters and Machine Learning. Bootstrap. Combining models. Boosting and Bagging. Poltayev Rassulzhan

Learning Ensembles. 293S T. Yang. UCSB, 2017.

Learning with multiple models. Boosting.

Machine Learning. Ensemble Methods. Manfred Huber

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

The Boosting Approach to. Machine Learning. Maria-Florina Balcan 10/31/2016

A Brief Introduction to Adaboost

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017

Ensemble Methods for Machine Learning

Boosting: Foundations and Algorithms. Rob Schapire

Outline: Ensemble Learning. Ensemble Learning. The Wisdom of Crowds. The Wisdom of Crowds - Really? Crowd wiser than any individual

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Boosting & Deep Learning

Ensembles of Classifiers.

What makes good ensemble? CS789: Machine Learning and Neural Network. Introduction. More on diversity

TDT4173 Machine Learning

Ensemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods

AdaBoost. S. Sumitra Department of Mathematics Indian Institute of Space Science and Technology

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CS229 Supplemental Lecture notes

Bias-Variance in Machine Learning

Ensembles. Léon Bottou COS 424 4/8/2010

Learning theory. Ensemble methods. Boosting. Boosting: history

TDT4173 Machine Learning

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

COMS 4771 Lecture Boosting 1 / 16

Decision Trees: Overfitting

Boosting. Acknowledgment Slides are based on tutorials from Robert Schapire and Gunnar Raetsch

The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m

Data Warehousing & Data Mining

Hierarchical Boosting and Filter Generation

Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example

CS 6140: Machine Learning Spring What We Learned Last Week 2/26/16

Data Mining und Maschinelles Lernen

i=1 = H t 1 (x) + α t h t (x)

CSE546: SVMs, Dual Formula5on, and Kernels Winter 2012

10-701/ Machine Learning - Midterm Exam, Fall 2010

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Lecture 8. Instructor: Haipeng Luo

Bias/variance tradeoff, Model assessment and selec+on

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

Learning Theory. Aar$ Singh and Barnabas Poczos. Machine Learning / Apr 17, Slides courtesy: Carlos Guestrin

Ensemble Methods: Jay Hyer

COMP 562: Introduction to Machine Learning

An overview of Boosting. Yoav Freund UCSD

Variance Reduction and Ensemble Methods

Learning Theory Continued

Chapter 14 Combining Models

Introduction to Machine Learning CMU-10701

Statistical Machine Learning from Data

1 Handling of Continuous Attributes in C4.5. Algorithm

MIRA, SVM, k-nn. Lirong Xia

Bias-Variance Tradeoff

Adaptive Boosting of Neural Networks for Character Recognition

PAC-learning, VC Dimension and Margin-based Bounds

1 Handling of Continuous Attributes in C4.5. Algorithm

Decision Trees Lecture 12

Recitation 9. Gradient Boosting. Brett Bernstein. March 30, CDS at NYU. Brett Bernstein (CDS at NYU) Recitation 9 March 30, / 14

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

2 Upper-bound of Generalization Error of AdaBoost

Large-Margin Thresholded Ensembles for Ordinal Regression

2D1431 Machine Learning. Bagging & Boosting

Statistics and learning: Big Data

BOOSTING THE MARGIN: A NEW EXPLANATION FOR THE EFFECTIVENESS OF VOTING METHODS

Announcements Kevin Jamieson

Bagging and Other Ensemble Methods

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 331 le-tex

Robotics 2. AdaBoost for People and Place Detection. Kai Arras, Cyrill Stachniss, Maren Bennewitz, Wolfram Burgard

CSE446: Linear Regression Regulariza5on Bias / Variance Tradeoff Winter 2015

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

CSCI-567: Machine Learning (Spring 2019)

CS 6140: Machine Learning Spring What We Learned Last Week. Survey 2/26/16. VS. Model

CS 6140: Machine Learning Spring 2016

Monte Carlo Theory as an Explanation of Bagging and Boosting

ECE 5984: Introduction to Machine Learning

PAC-learning, VC Dimension and Margin-based Bounds

Linear Classifiers: Expressiveness

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

FINAL: CS 6375 (Machine Learning) Fall 2014

Midterm Exam Solutions, Spring 2007

Machine Learning Lecture 10

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Boosting the Margin: A New Explanation for the Effectiveness of Voting Methods

Boosting with decision stumps and binary features

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

10701/15781 Machine Learning, Spring 2007: Homework 2

A Gentle Introduction to Gradient Boosting. Cheng Li College of Computer and Information Science Northeastern University

Transcription:

Boos$ng Can we make dumb learners smart? Aarti Singh Machine Learning 10-601 Nov 29, 2011 Slides Courtesy: Carlos Guestrin, Freund & Schapire 1

Why boost weak learners? Goal: Automa'cally categorize type of call requested (Collect, Calling card, Person- to- person, etc.) Easy to find rules of thumb that are o>en correct. E.g. If card occurs in uberance, then predict calling card Hard to find single highly accurate predic$on rule. 2

Figh$ng the bias- variance tradeoff Simple (a.k.a. weak) learners e.g., naïve Bayes, logis'c regression, decision stumps (or shallow decision trees) Are good J - Low variance, don t usually overfit Are bad L - High bias, can t solve hard learning problems Can we make weak learners always good??? No!!! But o>en yes 3

Vo$ng (Ensemble Methods) Instead of learning a single (weak) classifier, learn many weak classifiers that are good at different parts of the input space Output class: (Weighted) vote of each classifier Classifiers that are most sure will vote with more convic'on Classifiers will be most sure about a par'cular part of the space On average, do beber than single classifier! h1(x) h2(x) H: X Y (-1,1) H(X) = h1(x)+h2(x) H(X) = sign( αt ht(x)) t 1-1???? 1-1 weights 4

Vo$ng (Ensemble Methods) Instead of learning a single (weak) classifier, learn many weak classifiers that are good at different parts of the input space Output class: (Weighted) vote of each classifier Classifiers that are most sure will vote with more convic'on Classifiers will be most sure about a par'cular part of the space On average, do beber than single classifier! But how do you??? force classifiers h t to learn about different parts of the input space? weigh the votes of different classifiers? α t 5

Boos$ng [Schapire 89] Idea: given a weak learner, run it mul'ple 'mes on (reweighted) training data, then let learned classifiers vote On each itera'on t: weight each training example by how incorrectly it was classified Learn a weak hypothesis h t A strength for this hypothesis α t Final classifier: H(X) = sign( αt ht(x)) Prac$cally useful Theore$cally interes$ng 6

Learning from weighted data Consider a weighted dataset D(i) weight of i th training example (x i,y i ) Interpreta'ons: i th training example counts as D(i) examples If I were to resample data, I would get more samples of heavier data points Now, in all calcula$ons, whenever used, i th training example counts as D(i) examples e.g., in MLE redefine Count(Y=y) to be weighted count Unweighted data m Count(Y=y) = 1(Y i =y) i =1 Weights D(i) m Count(Y=y) = D(i)1(Y i =y) i =1 7

AdaBoost [Freund & Schapire 95] Ini$ally equal weights weak weak Magic (+ve) Naïve bayes, decision stump Increase weight if wrong on pt i yi ht(xi) = - 1 < 0 8

AdaBoost [Freund & Schapire 95] Ini$ally equal weights weak weak Magic (+ve) Naïve bayes, decision stump Increase weight if wrong on pt i yi ht(xi) = - 1 < 0 Weights for all pts must sum to 1 Dt+1(i) = 1 t 9

AdaBoost [Freund & Schapire 95] Ini$ally equal weights weak weak Magic (+ve) Naïve bayes, decision stump Increase weight if wrong on pt i yi ht(xi) = - 1 < 0 10

What α t to choose for hypothesis h t? Weight Update Rule: [Freund & Schapire 95] Weighted training error Does ht get i th point wrong ε t = 0 if h t perfectly classifies all weighted data pts α t = ε t = 1 if h t perfectly wrong => -h t perfectly right α t = - ε t = 0.5 α t = 0 11

Boos$ng Example (Decision Stumps) 12

Boos$ng Example (Decision Stumps) 13

Analyzing training error Analysis reveals: What α t to choose for hypothesis h t? ε t - weighted training error If each weak learner h t is slightly beber than random guessing (ε t < 0.5), then training error of AdaBoost decays exponen'ally fast in number of rounds T. Training Error 14

Analyzing training error Training error of final classifier is bounded by: Convex upper bound Where 0/1 loss exp loss If boosting can make upper bound 0, then training error 0 1 0 15

Analyzing training error Training error of final classifier is bounded by: Where Proof: Using Weight Update Rule Wts of all pts add to 1 16

Analyzing training error Training error of final classifier is bounded by: Where If Zt < 1, training error decreases exponentially (even though weak learners may not be good ε t ~0.5) Training error Upper bound t 17

What α t to choose for hypothesis h t? Training error of final classifier is bounded by: Where If we minimize t Z t, we minimize our training error We can 'ghten this bound greedily, by choosing α t and h t on each itera'on to minimize Z t. 18

What α t to choose for hypothesis h t? We can minimize this bound by choosing α t on each itera'on to minimize Z t. For boolean target func'on, this is accomplished by [Freund & Schapire 97]: Proof: 19

What α t to choose for hypothesis h t? We can minimize this bound by choosing α t on each itera'on to minimize Z t. For boolean target func'on, this is accomplished by [Freund & Schapire 97]: Proof: 20

Dumb classifiers made Smart Training error of final classifier is bounded by: Using 1-x e -x grows as ε t moves away from 1/2 If each classifier is (at least slightly) bemer than random ε t < 0.5 AdaBoost will achieve zero training error exponen$ally fast (in number of rounds T)!! What about test error? 21

Boos$ng results Digit recogni$on [Schapire, 1989] Test Error Training Error Boos'ng ohen, Robust to overfijng but not always Test set error decreases even aher training error is zero 22

Margin Based Bounds [Schapire, Freund, Bartlett, Lee 98] With high probability Boos'ng increases the margin very aggressively since it concentrates on the hardest examples. If margin is large, more weak learners agree and hence more rounds does not necessarily imply that final classifier is gejng more complex. Bound is independent of number of rounds T! Boos$ng can s$ll overfit if margin is too small or weak learners are too complex 23

Okay Train Test Train Test Okay Overfits Okay Overfits Okay Overfits Overfits 24

Boos$ng and Logis$c Regression Logis'c regression equivalent to minimizing log loss Boos'ng minimizes similar loss func'on!! Weighted average of weak learners log loss 0/1 loss exp loss 1 Both smooth approxima$ons of 0/1 loss! 0 25

Boos$ng and Logis$c Regression Logis'c regression: Minimize log loss Boos'ng: Minimize exp loss Define Define where x j predefined features (linear classifier) where h t (x) defined dynamically to fit data (not a linear classifier) Jointly op'mize over all weights w0, w1, w2 Weights α t learned per itera'on t incrementally 26

Hard & So> Decision Weighted average of weak learners Hard Decision/Predicted label: Soh Decision: (based on analogy with logis'c regression) 27

Effect of Outliers Good J : Can identify outliers since focuses on examples that are hard to categorize Bad L : Too many outliers can degrade classification performance dramatically increase time to convergence 28

Bagging [Breiman, 1996] Related approach to combining classifiers: 1. Run independent weak learners on bootstrap replicates (sample with replacement) of the training set 2. Average/vote over weak hypotheses Bagging vs. Boos$ng Resamples data points Reweights data points (modifies their distribu'on) Weight of each classifier Weight is dependent on is the same classifier s accuracy Only variance reduc'on Both bias and variance reduced learning rule becomes more complex with itera'ons 29

Boos$ng Summary Combine weak classifiers to obtain very strong classifier Weak classifier slightly beber than random on training data Resul'ng very strong classifier can eventually provide zero training error AdaBoost algorithm Boos'ng v. Logis'c Regression Similar loss func'ons Single op'miza'on (LR) v. Incrementally improving classifica'on (B) Most popular applica'on of Boos'ng: Boosted decision stumps! Very simple to implement, very effec've classifier 30