Learning with multiple models. Boosting.

Similar documents
Data Mining und Maschinelles Lernen

Ensamble methods: Boosting

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

CS7267 MACHINE LEARNING

Ensamble methods: Bagging and Boosting

Learning Ensembles. 293S T. Yang. UCSB, 2017.

FINAL: CS 6375 (Machine Learning) Fall 2014

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

Background. Adaptive Filters and Machine Learning. Bootstrap. Combining models. Boosting and Bagging. Poltayev Rassulzhan

Statistics and learning: Big Data

Voting (Ensemble Methods)

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

Machine Learning. Ensemble Methods. Manfred Huber

VBM683 Machine Learning

CS534 Machine Learning - Spring Final Exam

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017

A Brief Introduction to Adaboost

Boosting & Deep Learning

ECE 5424: Introduction to Machine Learning

What makes good ensemble? CS789: Machine Learning and Neural Network. Introduction. More on diversity

ECE 5984: Introduction to Machine Learning

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Lecture 3: Decision Trees

Boosting: Foundations and Algorithms. Rob Schapire

Bagging and Other Ensemble Methods

Ensemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

Stochastic Gradient Descent

Chapter 14 Combining Models

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Boos$ng Can we make dumb learners smart?

CS 6375 Machine Learning

Ensembles. Léon Bottou COS 424 4/8/2010

Outline: Ensemble Learning. Ensemble Learning. The Wisdom of Crowds. The Wisdom of Crowds - Really? Crowd wiser than any individual

TDT4173 Machine Learning

Bias-Variance in Machine Learning

Ensembles of Classifiers.

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Ensemble Methods for Machine Learning

TDT4173 Machine Learning

The Boosting Approach to. Machine Learning. Maria-Florina Balcan 10/31/2016

Ensemble Methods: Jay Hyer

Cross Validation & Ensembling

Ensemble Learning in the Presence of Noise

Neural Networks and Ensemble Methods for Classification

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

6.036 midterm review. Wednesday, March 18, 15

Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example

COMS 4771 Lecture Boosting 1 / 16

Lecture 8. Instructor: Haipeng Luo

Adaptive Boosting of Neural Networks for Character Recognition

Linear and Logistic Regression. Dr. Xiaowei Huang

Decision Trees: Overfitting

Chapter 18. Decision Trees and Ensemble Learning. Recall: Learning Decision Trees

Machine Learning Linear Classification. Prof. Matteo Matteucci

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Machine Learning Recitation 8 Oct 21, Oznur Tastan

ECE521 week 3: 23/26 January 2017

Data Warehousing & Data Mining

PAC-learning, VC Dimension and Margin-based Bounds

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Notes on Machine Learning for and

Hierarchical Boosting and Filter Generation

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

AdaBoost. S. Sumitra Department of Mathematics Indian Institute of Space Science and Technology

Learning Theory Continued

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

Machine Learning

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

Ensemble Methods and Random Forests

Variance Reduction and Ensemble Methods

Algorithm-Independent Learning Issues

2 Upper-bound of Generalization Error of AdaBoost

Announcements Kevin Jamieson

Support Vector Machines. Machine Learning Fall 2017

1 Handling of Continuous Attributes in C4.5. Algorithm

The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

Boosting. Acknowledgment Slides are based on tutorials from Robert Schapire and Gunnar Raetsch

CS229 Supplemental Lecture notes

Statistical Machine Learning from Data

UVA CS 4501: Machine Learning

Midterm, Fall 2003

BBM406 - Introduc0on to ML. Spring Ensemble Methods. Aykut Erdem Dept. of Computer Engineering HaceDepe University

Evaluation. Andrea Passerini Machine Learning. Evaluation

Machine Learning

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Exam Solutions, Spring 2007

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

Midterm Exam, Spring 2005

Notes on Discriminant Functions and Optimal Classification

Name (NetID): (1 Point)

Evaluation requires to define performance measures to be optimized

1 Handling of Continuous Attributes in C4.5. Algorithm

Monte Carlo Theory as an Explanation of Bagging and Boosting

Combining Classifiers

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Transcription:

CS 2750 Machine Learning Lecture 21 Learning with multiple models. Boosting. Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Learning with multiple models: Approach 2 Approach 2: use multiple models (classifiers, regressors) that cover the complete input (x) space and combine their outputs Committee machines: Combine predictions of all models to produce the output Regression: averaging Classification: a majority vote Goal: Improve the accuracy of the base model Methods: Bagging ( the same base models) Boosting (the same base models) Stacking (different base model) not covered 1

Bagging algorithm Training For each model M1, M2, Mk Randomly sample with replacement N samples from the training set (bootstrap) Train a chosen base model (e.g. neural network, decision tree) on the samples Data N Data 1 Data 2 bootstrap Data k N N N Model M1 Model M2 Model Mk Bagging algorithm Training For each model M1, M2, Mk Randomly sample with replacement N samples from the training set Train a chosen base model (e.g. neural network, decision tree) on the samples Test For each test example Run all base models M1, M2, Mk Predict by combining results of all T trained models: Regression: averaging Classification: a majority vote 2

When Bagging works Main property of Bagging (proof omitted) Bagging decreases variance of the base model without changing the bias!!! Why? averaging! Bagging typically helps When applied with an over-fitted base model High dependency on actual training data Example: fully grown decision trees It does not help much High bias. When the base model is robust to the changes in the training data (due to sampling) Boosting Bagging Multiple models covering the complete space, a learner is not biased to any region Learners are learned independently Boosting Every learner covers the complete space Learners are biased to regions not predicted well by other learners Learners are dependent 3

Boosting. Theoretical foundations. PAC: Probably Approximately Correct framework (, ) solution PAC learning: Learning with a pre-specified error and a confidence parameter the probability that the misclassification error is larger than is smaller than P( ME( c) ) Alternative rewrite: P( Acc( c) 1 ) (1 ) Accuracy (1- ): Percent of correctly classified samples in test Confidence (1- ): The probability that in one experiment some accuracy will be achieved PAC Learnability Strong (PAC) learnability: There exists a learning algorithm that efficiently learns the classification with a pre-specified error and confidence values Strong (PAC) learner: A learning algorithm P that Given an arbitrary: classification error (< 1/2), and confidence (<1/2) or in other words: classification accuracy > (1- ) confidence probability > (1- ) Outputs a classifier that satisfies this parameters Efficiency: runs in time polynomial in 1/, 1/ Implies: number of samples N is polynomial in 1/, 1/ 4

Weak Learner Weak learner: A learning algorithm (learner) M that gives some fixed (not arbitrary): error o (<1/2) and confidence o (<1/2) Alternatively: a classification accuracy > 0.5 with probability > 0.5 and this on an arbitrary distribution of data entries Weak learnability=strong (PAC) learnability Assume there exists a weak learner it is better that a random guess (> 50 %) with confidence higher than 50 % on any data distribution Question: Is the problem also strong PAC-learnable? Can we generate an algorithm P that achieves an arbitrary (, ) accuracy? Why is important? Usual classification methods (decision trees, neural nets), have specified, but uncontrollable performances. Can we improve performance to achieve any pre-specified accuracy (confidence)? 5

Weak=Strong learnability!!! Proof due to R. Schapire An arbitrary (, ) improvement is possible Idea: combine multiple weak learners together Weak learner W with confidence o and maximal error o It is possible: To improve (boost) the confidence To improve (boost) the accuracy by training different weak learners on slightly different datasets Boosting accuracy Training Distribution samples Learners H 1 H 2 H 3 Correct classification Wrong classification H 1 and H 2 classify differently 6

Boosting accuracy Training Sample randomly from the distribution of examples Train hypothesis H 1. on the sample Evaluate accuracy of H 1 on the distribution Sample randomly such that for the half of samples H 1. provides correct, and for another half, incorrect results; Train hypothesis H 2. Train H 3 on samples from the distribution where H 1 and H 2 classify differently Test For each example, decide according to the majority vote of H 1, H 2 and H 3 Theorem If each hypothesis has an error < o, the final voting classifier has error < g( o ) =3 o2-2 o 3 Accuracy improved!!!! Apply recursively to get to the target accuracy!!! 7

Theoretical Boosting algorithm Similarly to boosting the accuracy we can boost the confidence at some restricted accuracy cost The key result: we can improve both the accuracy and confidence Problems with the theoretical algorithm A good (better than 50 %) classifier on all distributions and problems We cannot get a good sample from data-distribution The method requires a large training set Solution to the sampling problem: Boosting by sampling AdaBoost algorithm and variants AdaBoost AdaBoost: boosting by sampling Classification (Freund, Schapire; 1996) AdaBoost.M1 (two-class problem) AdaBoost.M2 (multiple-class problem) Regression (Drucker; 1997) AdaBoostR 8

AdaBoost training. Training data Distribution D 1 Uniform distribution D 1 training examples P(example i) = 1/N AdaBoost training. Training data Distribution Learn D 1 Model 1 Sample randomly according to D 1 And train the Model 1 9

AdaBoost training. Training data Distribution Learn Test D 1 Model 1 Errors 1 Test the Model 1 and calculate errors AdaBoost training. Training data Distribution Learn Test D 1 Model 1 Errors 1 D 2 Use errors to recalculate the new distribution on data More probability to pick examples with errors 10

AdaBoost training. Training data Distribution Learn Test D 1 Model 1 Errors 1 D 2 Model 2 Errors 2 D T Model T Errors T AdaBoost Given: A training set of N examples (attributes + class label pairs) A base learning model (e.g. a decision tree, a neural network) Training stage: Train a sequence of T base models on T different sampling distributions defined upon the training set (D) A sample distribution D t for building the model t is constructed by modifying the sampling distribution D t-1 from the (t-1)th step. Examples classified incorrectly in the previous step receive higher weights in the new data (attempts to cover misclassified samples) Application (classification) stage: Classify according to the weighted majority of classifiers 11

AdaBoost algorithm Training (step t) Sampling Distribution Dt D t (i) - a probability that example i from the original training dataset is selected D 1( i) 1/ N for the first step (t=1) Take K samples from the training set according to D t Train a classifier h t on the samples Calculate the error of h t : t t Dt ( i) i: ht ( xi ) yi Classifier weight: t t /( 1 t ) New sampling distribution ( D ( ) t ht ( xi ) y t i i Dt 1 i) Zt 1 otherwise Norm. constant AdaBoost. Sampling Probabilities Example: - Nonlinearly separable binary classification - NN used as week learners 12

AdaBoost: Sampling Probabilities AdaBoost classification We have T different classifiers h t weight w t of the classifier is proportional to its accuracy on the training set w t log( 1/ t ) log (1 t )/ t t t /( 1 t ) Classification: For every class j=0,1 Compute the sum of weights w corresponding to ALL classifiers that predict class j; Output class that correspond to the maximal sum of weights (weighted majority) h final (x) arg max j w t t: ht ( x) j 13

Two-Class example. Classification. Classifier 1 yes 0.7 Classifier 2 no 0.3 Classifier 3 no 0.2 Weighted majority yes 0.7-0.5 = + 0.2 The final choice is yes + 1 What is boosting doing? Each classifier specializes on a particular subset of examples Algorithm is concentrating on more and more difficult examples Boosting can: Reduce variance (the same as Bagging) Eliminate the effect of high bias of the weak learner (unlike Bagging) Train versus test errors performance: Train errors can be driven close to 0 But test errors do not show overfitting Proofs and theoretical explanations in a number of papers 14

Boosting. Error performances 0.4 0.35 Training error Test error Single-learner error 0.3 0.25 0.2 0.15 0.1 0.05 0 0 2 4 6 8 10 12 14 16 Model Averaging An alternative way to combine weight multiple models Can be used for supervised and unsupervised frameworks For example: Likelihood of the data can be expressed by averaging over the multiple models P( D) P( D M Prediction: N i 1 N P( y x) P( y x, M i 1 m ) P( M i m ) m ) P( M i i m ) i 15