2D1431 Machine Learning. Bagging & Boosting

Similar documents
Voting (Ensemble Methods)

TDT4173 Machine Learning

CS7267 MACHINE LEARNING

Learning with multiple models. Boosting.

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

AdaBoost. S. Sumitra Department of Mathematics Indian Institute of Space Science and Technology

Ensembles of Classifiers.

The Boosting Approach to. Machine Learning. Maria-Florina Balcan 10/31/2016

TDT4173 Machine Learning

Lecture 3: Decision Trees

What makes good ensemble? CS789: Machine Learning and Neural Network. Introduction. More on diversity

Ensembles. Léon Bottou COS 424 4/8/2010

Boosting & Deep Learning

Stochastic Gradient Descent

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Question of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning

Data Warehousing & Data Mining

ECE 5424: Introduction to Machine Learning

Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example

Ensemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods

Boos$ng Can we make dumb learners smart?

Outline: Ensemble Learning. Ensemble Learning. The Wisdom of Crowds. The Wisdom of Crowds - Really? Crowd wiser than any individual

A Brief Introduction to Adaboost

Announcements Kevin Jamieson

Ensemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017

VBM683 Machine Learning

COMS 4771 Lecture Boosting 1 / 16

FINAL: CS 6375 (Machine Learning) Fall 2014

Learning Ensembles. 293S T. Yang. UCSB, 2017.

Learning theory. Ensemble methods. Boosting. Boosting: history

Notes on Machine Learning for and

BAGGING PREDICTORS AND RANDOM FOREST

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

Lecture 8. Instructor: Haipeng Luo

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Chapter 18. Decision Trees and Ensemble Learning. Recall: Learning Decision Trees

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

Variance Reduction and Ensemble Methods

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Introduction to machine learning. Concept learning. Design of a learning system. Designing a learning system

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

Bias-Variance in Machine Learning

Boosting: Foundations and Algorithms. Rob Schapire

Machine Learning. Ensemble Methods. Manfred Huber

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

Hierarchical Boosting and Filter Generation

Background. Adaptive Filters and Machine Learning. Bootstrap. Combining models. Boosting and Bagging. Poltayev Rassulzhan

Bagging. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL 8.7

1 Handling of Continuous Attributes in C4.5. Algorithm

Solving Classification Problems By Knowledge Sets

Data Mining und Maschinelles Lernen

How do we compare the relative performance among competing models?

Neural Networks and Ensemble Methods for Classification

Algorithm-Independent Learning Issues

ECE 5984: Introduction to Machine Learning

Performance Evaluation and Comparison

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Ensemble Methods: Jay Hyer

Computational Learning Theory

Notation P(Y ) = X P(X, Y ) = X. P(Y X )P(X ) Teorema de Bayes: P(Y X ) = CIn/UFPE - Prof. Francisco de A. T. de Carvalho P(X )

Classification of Longitudinal Data Using Tree-Based Ensemble Methods

CSCI-567: Machine Learning (Spring 2019)

i=1 = H t 1 (x) + α t h t (x)

Statistics and learning: Big Data

Evaluation. Andrea Passerini Machine Learning. Evaluation

Topics. Bayesian Learning. What is Bayesian Learning? Objectives for Bayesian Learning

Ensemble Methods for Machine Learning

Bagging and Boosting for the Nearest Mean Classifier: Effects of Sample Size on Diversity and Accuracy

Introduction to Machine Learning

Introduction to Bayesian Learning. Machine Learning Fall 2018

Evaluation requires to define performance measures to be optimized

Monte Carlo Theory as an Explanation of Bagging and Boosting

1 Handling of Continuous Attributes in C4.5. Algorithm

Hypothesis Evaluation

Computational Learning Theory (VC Dimension)

Adaptive Boosting of Neural Networks for Character Recognition

CS534 Machine Learning - Spring Final Exam

Statistical Machine Learning from Data

Machine Learning Lecture 12

Computational Learning Theory

Boosting. Acknowledgment Slides are based on tutorials from Robert Schapire and Gunnar Raetsch

Introduction to Boosting and Joint Boosting

Combining Classifiers

Introduction to Machine Learning

Optimal and Adaptive Algorithms for Online Boosting

EECS 349: Machine Learning Bryan Pardo

Data Mining. 3.6 Regression Analysis. Fall Instructor: Dr. Masoud Yaghini. Numeric Prediction

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

1/sqrt(B) convergence 1/B convergence B

2 Upper-bound of Generalization Error of AdaBoost

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan

Decision Trees: Overfitting

Predicting Protein Interactions with Motifs

Introduction to Machine Learning

Recitation 9. Gradient Boosting. Brett Bernstein. March 30, CDS at NYU. Brett Bernstein (CDS at NYU) Recitation 9 March 30, / 14

Transcription:

2D1431 Machine Learning Bagging & Boosting

Outline Bagging and Boosting Evaluating Hypotheses Feature Subset Selection Model Selection

Question of the Day Three salesmen arrive at a hotel one night and ask for a room. The manager informs them the price is $30.00. Each gives the manager a ten, and they retire for the night. Shortly, the manager remembers that the room was at a discount, on account of it being haunted. So he tells his bellhop that the room was only $25.00, gives the bellhop five dollars and tells him to give the men the refund. The bellhop is slightly crooked and rationalizes, "Five doesn't divide well among three, I'll save them some arguing and just give them a dollar each." Which he does, and keeps the leftover two dollars for himself. Now each of the men paid $9.00 for the room, for a total of $27.00. The bellhop has $2.00. $27+$2=$29. What happened to the missing dollar?

Classification Problem Assume a set s of N instances x i X each belonging to one of M classes { c 1, c M }. The training set consists of pairs (x i,c i ). A classifier C assigns a classification C(x) { c 1, c M } to an instance x. The classifier learned in trial t is denoted C t while C* is the composite boosted classifier.

Linear Discriminant Analysis Possible classes : { x, o } LDA : x if w 1 x 1 + w 2 x 2 + w 3 > 0 x 2 o otherwise x x x x o w 1 x 1 +w 2 x 2 +w 3 =0 x o o o o x 1

Bagging & Boosting Bagging and Boosting aggregate multiple hypotheses generated by the same learning algorithm invoked over different distributions of training data [Breiman 1996, Freund & Schapire 1996]. Bagging and Boosting generate a classifier with a smaller error on the training data as it combines multiple hypotheses which individually have a larger error.

Bagging For each trial t=1,2,,t create a bootstrap sample S t. Obtain an hypothesis C t on the bootstrap sample S t. The final classification is the majority class c BA (x) = argmax c Σ j C t δ(c j,c t (x)) x c BA (x)=+ C 1 (x)=+ C 2 (x)=- C T (x) =+ train train train Training set 1 Training set 2 Training set T

Bagging From the overall training S set randomly sample (with replacement) T different training sets S 1,...,S T of size N. For each sample set S t obtain a hypothesis C t. To an unseen instance x assign the majority classification c BA (x) among the hypotheses C t classifications c t (x). x c BA (x)=argmax c j C Σ t δ(c j,c t (x)) C 1 c C 2 C T 1 (x) c 2 (x) c T (x) train train train S 1 S 2 S T

Bagging Bagging requires instable classifiers like for example decision trees or concept learner. The vital element is the instability of the prediction method. If perturbing the learning set can cause significant changes in the predictor constructed, then bagging can improve accuracy. (Breiman 1996)

Boosting Boosting maintains a weight w i for each instance <x i, c i > in the training set. The higher the weight w i, the more the instance x i influences the next hypothesis learned. At each trial, the weights are adjusted to reflect the performance of the previously learned hypothesis, with the result that the weight of correctly classified instances is decreased and the weight of incorrectly classified instances is increased.

Boosting Construct an hypothesis C t from the current distribution of instances described by w t. Adjust the weights according to the classification error ε t of classifier C t. The strength α t of a hypothesis depends on its training error et. α t = ½ ln ((1-ε t )/ε t ) Set of weighted w i t instances x i learn hypothesis adjust weights hypothesis Ct strength α t

Boosting The final hypothesis C BO aggregates the individual hypotheses C t by weighted voting. c BO (x) = argmax cj C Σ t=1t α t δ(c j,c t (x)) Each hypothesis vote is a function of its accuracy. Let w it denote the weight of an instance x i at trial t, for every x i, w i1 = 1/N. The weight w it reflects the importance (e.g. probability of occurrence) of the instance x i in the sample set S t. At each trial t = 1,,T an hypothesis C t is constructed from the given instances under the distribution w t. This requires that the learning algorithm can deal with fractional examples.

Boosting The error of the hypothesis C t is measured with respect to the weights ε t = Σ i Ct(xi) w t ci i / Σ i w t i α t = ½ ln ((1-ε t )/ε t ) Update the weights w it of correctly and incorrectly classified instances by w t+1 i = w it e -αt if C t (x i ) = c i w t+1 i = w it e αt if C t (x i ) c i Afterwards normalize the w t+1 i such that they form a proper distribution Σ i w t+1 i =1

Boosting The classification c BO (x) of the boosted hypothesis is obtained by summing the votes of the hypotheses C 1, C 2,, C T where the vote of each hypothesis C t is weights α t. x c BO (x) = argmax cj C Σ t=1t α t δ(c j,c t (x)) C 1 c 1 (x) C 2 c 2 (x) C T c T (x) train train train S,w 1 S,w 2 S,w T

Boosting Given {<x 1,c 1 >, <x m,c m >} Initialize w i1 =1/m For t=1,,t train weak learner using distribution wit get weak hypothesis C t : X C with error ε t = Σ i C t(xi) w t ci i choose α t = ½ ln ((1-ε t )/ε t ) / Σ i w t i update: w t+1 i = w it e -αt if C t (x i ) = c i w t+1 i = w it e αt if C t (x i ) c i Output the final hypothesis c BO (x) = argmax cj C Σ t=1t α t δ(c j,c t (x))

Bayes MAP Hypothesis Bayes MAP hypothesis for two classes x and o red : incorrect classified instances

Boosted Bayes MAP Hypothesis Boosted Bayes MAP hypothesis has more complex decision surface than individual hypotheses alone

Feature Subset Selection Input features Feature subset search Induction algorithm Feature subset evaluation Feature subset evaluation

Feature Subset Selection Evaluate the accuracy of the hypothesis for a given subset of features by means of n-fold cross-validation Forward elimination: Starts with the empty set of features and greedily adds the feature that improves the estimated accuracy at most Backward elimination: Starts with the set of all features and greedily removes the worst feature Forward-backward elimination: Alternate between backward and forward elimination choosing whichever step improves the accuracy of the current subset of features.

Question of the Day What happened to the missing dollar? This classic is a simple wording deception that occurs in the last sentence: "Now each of the men paid $9.00 for the room, for a total of $27.00. The bellhop has $2.00. $27+$2=29. What happened to the missing dollar?" There is no missing dollar, the arithmetic above makes no sense in the context of the problem. There's no reason to add the two dollars the bellboy kept to the nine dollars each of the men spent. Each man paid nine dollars for a total of twenty-even dollars. Of that twenty-seven, two went to the bell boy, twentyfive went to the hotel. The equation should be: $27-$2=$25, the price of the room.