CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Similar documents
Learning with multiple models. Boosting.

Voting (Ensemble Methods)

Background. Adaptive Filters and Machine Learning. Bootstrap. Combining models. Boosting and Bagging. Poltayev Rassulzhan

Machine Learning. Ensemble Methods. Manfred Huber

Boosting: Foundations and Algorithms. Rob Schapire

Boos$ng Can we make dumb learners smart?

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

VBM683 Machine Learning

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Stochastic Gradient Descent

ECE 5424: Introduction to Machine Learning

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017

Ensembles. Léon Bottou COS 424 4/8/2010

The Boosting Approach to. Machine Learning. Maria-Florina Balcan 10/31/2016

What makes good ensemble? CS789: Machine Learning and Neural Network. Introduction. More on diversity

TDT4173 Machine Learning

Learning theory. Ensemble methods. Boosting. Boosting: history

The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m

TDT4173 Machine Learning

CS7267 MACHINE LEARNING

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

Data Mining und Maschinelles Lernen

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

Ensemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods

FINAL: CS 6375 (Machine Learning) Fall 2014

ECE 5984: Introduction to Machine Learning

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

COMS 4771 Lecture Boosting 1 / 16

2 Upper-bound of Generalization Error of AdaBoost

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Bias-Variance in Machine Learning

Lecture 8. Instructor: Haipeng Luo

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Learning Ensembles. 293S T. Yang. UCSB, 2017.

CSCI-567: Machine Learning (Spring 2019)

Ensemble Methods for Machine Learning

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

Outline: Ensemble Learning. Ensemble Learning. The Wisdom of Crowds. The Wisdom of Crowds - Really? Crowd wiser than any individual

Learning Theory Continued

Machine Learning, Fall 2011: Homework 5

Understanding Generalization Error: Bounds and Decompositions

Midterm: CS 6375 Spring 2015 Solutions

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

Lecture 4. 1 Learning Non-Linear Classifiers. 2 The Kernel Trick. CS-621 Theory Gems September 27, 2012

Neural Networks and Ensemble Methods for Classification

Statistical Machine Learning from Data

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

AdaBoost. S. Sumitra Department of Mathematics Indian Institute of Space Science and Technology

Cross Validation & Ensembling

Gradient Boosting (Continued)

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Boosting. Acknowledgment Slides are based on tutorials from Robert Schapire and Gunnar Raetsch

A Brief Introduction to Adaboost

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Hierarchical Boosting and Filter Generation

Variance Reduction and Ensemble Methods

Announcements Kevin Jamieson

ORIE 4741: Learning with Big Messy Data. Generalization

Data Warehousing & Data Mining

Ensemble Methods and Random Forests

Introduction to Machine Learning

Chapter 14 Combining Models

10701/15781 Machine Learning, Spring 2007: Homework 2

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

CS 6375 Machine Learning

Empirical Risk Minimization Algorithms

CS534 Machine Learning - Spring Final Exam

Supervised Learning. George Konidaris

1 Boosting. COS 511: Foundations of Machine Learning. Rob Schapire Lecture # 10 Scribe: John H. White, IV 3/6/2003

Statistics and learning: Big Data

Optimal and Adaptive Algorithms for Online Boosting

Machine Learning, Fall 2012 Homework 2

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan

Boosting & Deep Learning

Ensemble Methods: Jay Hyer

Decision trees COMS 4771

Chapter 18. Decision Trees and Ensemble Learning. Recall: Learning Decision Trees

An overview of Boosting. Yoav Freund UCSD

Discriminative v. generative

Multivariate statistical methods and data mining in particle physics Lecture 4 (19 June, 2008)

PDEEC Machine Learning 2016/17

Single layer NN. Neuron Model

Empirical Risk Minimization, Model Selection, and Model Assessment

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Perceptron. Subhransu Maji. CMPSCI 689: Machine Learning. 3 February February 2015

Day 3: Classification, logistic regression

Logistic Regression and Boosting for Labeled Bags of Instances

Gradient Boosting, Continued

i=1 = H t 1 (x) + α t h t (x)

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

FINAL EXAM: FALL 2013 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Combining Classifiers

Bagging and Other Ensemble Methods

Decision Tree Learning Lecture 2

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

PAC-learning, VC Dimension and Margin-based Bounds

Ensembles of Classifiers.

Transcription:

CSE 151 Machine Learning Instructor: Kamalika Chaudhuri

Ensemble Learning How to combine multiple classifiers into a single one Works well if the classifiers are complementary This class: two types of ensemble methods: Bagging Boosting

Bagging (Bootstrap AGgregatING) Input: n labelled training examples (xi, yi), i = 1,..,n Algorithm: Repeat k times: Select m samples out of n with replacement to get training set Si Train classifier (decision tree, k-nn, perceptron, etc) hi on Si Output: Classifiers h1,.., hk Classification: On test example x, output majority(h1,.., hk)

Example Input: n labelled training examples (xi, yi), i = 1,..,n Algorithm: Repeat k times: Select m samples out of n with replacement to get training set Si Train classifier (decision tree, k-nn, perceptron, etc) hi on Si How to pick m? Popular choice: m = n Still different from working with entire training set. Why?

Bagging Input: n labelled training examples S = {(xi, yi)}, i = 1,..,n Suppose we select n samples out of n with replacement to get training set Si Still different from working with entire training set. Why? Pr(S i = S) = n! Pr((x i,y i ) not in S i )= n n (tiny number, exponentially small in n) 1 1 n n e 1 For large data sets, about 37% of the data set is left out!

Bias and Variance Classification error = Bias + Variance

Bias and Variance Classification error = Bias + Variance Bias is the true error of the best classifier in the concept class (e.g, best linear separator, best decision tree on a fixed number of nodes). Bias is high if the concept class cannot model the true data distribution well, and does not depend on training set size.

Bias High Bias Low Bias Underfitting: when you have high bias

Bias and Variance Classification error = Bias + Variance Variance is the error of the trained classifier with respect to the best classifier in the concept class. Variance depends on the training set size. It decreases with more training data, and increases with more complicated classifiers.

Variance Overfitting: when you have extra high variance

Bias and Variance Classification error = Bias + Variance If you have high bias, both training and test error will be high If you have high variance, training error will be low, and test error will be high

Bias Variance Tradeoff If we make the concept class more complicated (e.g, linear classification to quadratic classification, or increase number of nodes in the decision tree), then bias decreases but variance increases. Thus there is a bias-variance tradeoff

Why is Bagging useful? Bagging reduces the variance of the classifier, doesn t help much with bias

Ensemble Learning How to combine multiple classifiers into a single one Works well if the classifiers are complementary This class: two types of ensemble methods: Bagging Boosting

Boosting Goal: Determine if an email is spam or not based on text in it From: Yuncong Chen Text: 151 homeworks are all graded... From: Work from home solutions Text: Earn money without working! Not Spam Spam Sometimes it is: * Easy to come up with simple rules-of-thumb classifiers, * Hard to come up with a single high accuracy rule

Boosting Goal: Detect if an image contains a face in it = Is the black region darker on average than the white? Sometimes it is: * Easy to come up with simple rules-of-thumb classifiers, * Hard to come up with a single high accuracy rule

Boosting Weak Learner: A simple rule-of-the-thumb classifier that doesn t necessarily work very well Strong Learner: A good classifier Boosting: How to combine many weak learners into a strong learner?

Boosting Procedure: 1. Design a method for finding a good rule-of-thumb 2. Apply method to training data to get a good rule-of-thumb 3. Modify the training data to get a 2nd data set 4. Apply method to 2nd data set to get a good rule-of-thumb 5. Repeat T times...

Boosting 1. How to get a good rule-of-thumb? Depends on application e.g, single node decision trees 2. How to choose examples on each round? Focus on the hardest examples so far -- namely, examples misclassified most often by previous rules of thumb 3. How to combine the rules-of-thumb to a prediction rule? Take a weighted majority of the rules

Some Notation Let D be a distribution over examples, and h be a classifier Error of h with respect to D is: Example: err D (h) =Pr (X,Y ) D (h(x) = Y ) Below X is uniform over [0, 1], and Y = 1 if X > 0.5, 0 ow 0 0.25 0.5 1 errd(h) = 0.25

Some Notation Let D be a distribution over examples, and h be a classifier Error of h with respect to D is: err D (h) =Pr (X,Y ) D (h(x) = Y ) h is called a weak learner if errd(h) < 0.5 If you guess completely randomly, then the error is 0.5

Some Notation Given training examples {(xi, yi)}, i=1,..,n, we can assign weights wi to the examples. If the wis sum to 1, then we can think of them as a distribution W over the examples. The error of a classifier h wrt W is: err W (h) = n i=1 w i 1(h(x i ) = y i ) Note: 1 here is the indicator function

Boosting Given training set S = {(x1, y1),..., (xn, yn)}, y in {-1, 1} For t = 1,..., T Construct distribution Dt on the examples Find weak learner ht which has small error errdt(ht) wrt Dt Output final classifier Initially, D1(i) = 1/n, for all i (uniform) Given Dt and ht: D t+1 (i) = D t(i) Z t exp( α t y i h t (x i )) where: α t = 1 1 2 ln errdt (h t ) err Dt (h t ) Zt = normalization constant

Boosting Given training set S = {(x1, y1),..., (xn, yn)}, y in {-1, 1} For t = 1,..., T Construct distribution Dt on the examples Find weak learner ht which has small error errdt(ht) wrt Dt Output final classifier Initially, D1(i) = 1/n, for all i (uniform) Given Dt and ht: D t+1 (i) = D t(i) Z t exp( α t y i h t (x i )) where: α t = 1 1 2 ln errdt (h t ) err Dt (h t ) Zt = normalization constant Dt+1(i) goes down if xi is classified correctly by ht, up otherwise High Dt+1(i) means hard example

Boosting Given training set S = {(x1, y1),..., (xn, yn)}, y in {-1, 1} For t = 1,..., T Construct distribution Dt on the examples Find weak learner ht which has small error errdt(ht) wrt Dt Output final classifier Initially, D1(i) = 1/n, for all i (uniform) Given Dt and ht: D t+1 (i) = D t(i) Z t exp( α t y i h t (x i )) where: α t = 1 1 2 ln errdt (h t ) err Dt (h t ) Zt = normalization constant Higher if ht has low error wrt Dt, lower otherwise. >0 if errdt(ht) < 0.5

Boosting Given training set S = {(x1, y1),..., (xn, yn)}, y in {-1, 1} For t = 1,..., T Construct distribution Dt on the examples Find weak learner ht which has small error errdt(ht) wrt Dt Output final classifier Initially, D1(i) = 1/n, for all i (uniform) Given Dt and ht: D t+1 (i) = D t(i) Z t exp( α t y i h t (x i )) where: α t = 1 1 2 ln errdt (h t ) err Dt (h t ) Zt = normalization constant Final classifier: sign( T α t h t (x)) t=1

Boosting: Example Schapire, 2011 weak classifiers: horizontal or vertical half-planes

Boosting: Example Schapire, 2011 weak classifiers: horizontal or vertical half-planes

Boosting: Example Schapire, 2011 weak classifiers: horizontal or vertical half-planes

Boosting: Example Schapire, 2011 weak classifiers: horizontal or vertical half-planes

The Final Classifier Schapire, 2011 weak classifiers: horizontal or vertical half-planes

How to Find Stopping Time Given training set S = {(x1, y1),..., (xn, yn)}, y in {-1, 1} For t = 1,..., T Construct distribution Dt on the examples Find weak learner ht which has small error errdt(ht) wrt Dt Output final classifier Validation Error To find stopping time, use a validation dataset. Stop when the error on the validation dataset stops getting better, or when you can t find a good rule of thumb. Iterations