Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

Similar documents
Voting (Ensemble Methods)

Stochastic Gradient Descent

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Boos$ng Can we make dumb learners smart?

Learning with multiple models. Boosting.

Support Vector Machines. CAP 5610: Machine Learning Instructor: Guo-Jun QI

CS7267 MACHINE LEARNING

ECE 5424: Introduction to Machine Learning

VBM683 Machine Learning

10701/15781 Machine Learning, Spring 2007: Homework 2

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

COMS 4771 Lecture Boosting 1 / 16

Lecture 8. Instructor: Haipeng Luo

What makes good ensemble? CS789: Machine Learning and Neural Network. Introduction. More on diversity

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Decision Trees: Overfitting

ECE 5984: Introduction to Machine Learning

TDT4173 Machine Learning

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Ensemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods

The Boosting Approach to. Machine Learning. Maria-Florina Balcan 10/31/2016

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

CS229 Supplemental Lecture notes

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017

Boosting & Deep Learning

CSCI-567: Machine Learning (Spring 2019)

Learning Ensembles. 293S T. Yang. UCSB, 2017.

Learning theory. Ensemble methods. Boosting. Boosting: history

Hierarchical Boosting and Filter Generation

Chapter 14 Combining Models

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Ensembles. Léon Bottou COS 424 4/8/2010

A Brief Introduction to Adaboost

Ensemble Methods for Machine Learning

Open Problem: A (missing) boosting-type convergence result for ADABOOST.MH with factorized multi-class classifiers

Machine Learning. Ensemble Methods. Manfred Huber

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

Machine Learning for NLP

Ensembles of Classifiers.

Does Modeling Lead to More Accurate Classification?

FINAL: CS 6375 (Machine Learning) Fall 2014

A Study of Relative Efficiency and Robustness of Classification Methods

i=1 = H t 1 (x) + α t h t (x)

Background. Adaptive Filters and Machine Learning. Bootstrap. Combining models. Boosting and Bagging. Poltayev Rassulzhan

Introduction to Boosting and Joint Boosting

TDT4173 Machine Learning

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Large-Margin Thresholded Ensembles for Ordinal Regression

Generative v. Discriminative classifiers Intuition

Outline: Ensemble Learning. Ensemble Learning. The Wisdom of Crowds. The Wisdom of Crowds - Really? Crowd wiser than any individual

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 331 le-tex

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Bagging and Other Ensemble Methods

The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

Statistics and learning: Big Data

Data Mining und Maschinelles Lernen

10-701/ Machine Learning - Midterm Exam, Fall 2010

Support Vector Machines

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Recitation 9. Gradient Boosting. Brett Bernstein. March 30, CDS at NYU. Brett Bernstein (CDS at NYU) Recitation 9 March 30, / 14

Boosting. Acknowledgment Slides are based on tutorials from Robert Schapire and Gunnar Raetsch

Ensemble Methods: Jay Hyer

Logistic Regression. Machine Learning Fall 2018

Computational and Statistical Learning Theory

Introduction to Logistic Regression and Support Vector Machine

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

AdaBoost. S. Sumitra Department of Mathematics Indian Institute of Space Science and Technology

Ordinal Classification with Decision Rules

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Boosting: Foundations and Algorithms. Rob Schapire

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example

ENSEMBLES OF DECISION RULES

Machine Learning Lecture 5

Representing Images Detecting faces in images

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

An overview of Boosting. Yoav Freund UCSD

PAC-learning, VC Dimension and Margin-based Bounds

BBM406 - Introduc0on to ML. Spring Ensemble Methods. Aykut Erdem Dept. of Computer Engineering HaceDepe University

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

Advanced statistical methods for data analysis Lecture 2

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Data Warehousing & Data Mining

Multiclass Boosting with Repartitioning

Generative v. Discriminative classifiers Intuition

Large-Margin Thresholded Ensembles for Ordinal Regression

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Bayesian Learning (II)

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

Generalized Boosted Models: A guide to the gbm package

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Algorithm-Independent Learning Issues

Midterm: CS 6375 Spring 2018

Machine Learning Linear Classification. Prof. Matteo Matteucci

Transcription:

Boosting CAP5610: Machine Learning Instructor: Guo-Jun Qi

Weak classifiers Weak classifiers Decision stump one layer decision tree Naive Bayes A classifier without feature correlations Linear classifier logistic regression Weak classifiers usually have larger training error but smaller variance. A single weak classifier is usually not adequate in real applications, but it is possible to combine an ensemble of weak classifiers to build a strong one.

Idea: weighted voting Combining an ensemble of weak classifiers by weighted voting Learning an ensemble of weak classifiers Although each weak classifier is not adequate of classifying the whole feature space, it can still output good result on certain parts of feature spaces. Each weak classifier is given a weight based on its performance More adequate classifier will vote with more weight. Weighted voting usually generates better performance by combining complementary classifiers good at classifying different parts of feature spaces. h 1 h 2 Combined classifier: f x = sign α 1 h 1 x + α 2 h 2 x

Problems to solve How are an ensemble of weak classifiers learned? Decide which part each weak classifier focuses on. How is the weight of each weak classifier decided?

Boosting [Schapire 98 ] Idea: learning a pool of weak classifiers (usually of the same type e.g., stump, logistic regression), on different sets of training examples resampled from different parts of an original training set h 1 Two resampled sets h 1 Combined h 2 h 2 Original training set

Boosting The Algorithm On each iteration t: Weight each training example by how correctly it is classified so far Learn a weak classifier h t that best classifies the weighted training examples. Decide a strength for this weak classifier α t Final classifier: f x = sign( t α t h t x )

Learning from Weighted Training Examples Consider a weighted dataset D(i) weight of i-th training example x i, y i Interpretations: i-th example is counted as D(i) examples i-th example is resampled from training set by weight D(i) Two ways to learn a weak classifier from weighted training examples Resampling the training set by D(i), and train a weak classifier from the resampled set Learn a weak classifier directly from the weighted samples, e.g., a weighted logistic regression classifier h = min h i D i loss(h(x i ), y i )

AdaBoost [Freund & Shapire 95] Output final classifier

Decide the combination weight for each weak classifier Weight of weak classifier α t = 1 2 log 1 ε t ε t Where ε t is the weighted training error ε t = i D i δ(h t x i y i ) If a classifier is better than a random guess, ε t < 0.5, and α t > 0; otherwise, α t < 0. For the latter case, it is an adverse classifier rather than a weak classifier.

Boosting Example (Decision Stump) Three weak classifier

Boosting Example (Decision Stump) Final classifier

How good can the boosting reduce the training error? If each weak classifier h t is slightly better than random guess with ε t < 0.5, then the training error of Adaboost can be reduced exponentially fast in the number of weak classifiers combined T, It is astounding result as AdaBoost can reduce the training error to arbitrarily close to zero.

Proof: Training error of AdaBoost Note the fact that exponential function exp( y i f x i ) is an upper bound of the 0/1 loss δ(y i f x i ) as f x i

Proof: Training error of AdaBoost The total training error is bounded by where f x i = t α t h t (x i ), and H x i = sign(f x i ) is the final classifier.

Proof: Training error of AdaBoost D 1 i = 1 m D 2 i = exp( y iα 1 h 1 x i ) mz 1 D 3 i = D 2 i exp y i α 2 h 2 x i Z 2 = exp( y iα 1 h 1 x i )exp( y i α 2 h 2 x i ) mz 1 Z 2 By induction, we have D T i = D T 1exp( y i α T h T x i ) Z T = exp( t y iα t h t x i ) mz 1 Z 2 Z T = exp( y if x i ) mz 1 Z 2 Z T From i D T i = 1, we have t Z t = 1 m i exp( y i f x i ).

Proof: Training error of AdaBoost So we have 1 m i δ y i f x i 1 m i exp( y i f x i ) = Π t Z t If Z t <1, the training error decays exponentially.

Proof: Training error of AdaBoost Finding an optimal weight α t by minimizing Z t Z t = i D t (i)exp( α t y i h t x i ) = ht x i y i D t i exp(α t ) + ht x i =y i D t i exp( α t ) = exp α t ε t + exp α t (1 ε t ) Z t α t = exp α t ε t exp α t 1 ε t = 0, thus the optimal α t = 1 2 log 1 ε t ε t, and Z t = 2 ε t (1 ε t ) = 1 1 2ε t 2

Proof: Training error of AdaBoost Training error is bounded by

Digital recognition Test error still decreases even after training error reaches zero. This shows boosting is robust to over fitting?

Comparison between LR and Boosting Logistic Regression assumes 1 P y i = 1 x i =, f x = 1+exp( y i f x i ) d w d x d + w 0 Maximizing the data log likelihood max log(1 + exp y if x i ) w Or min log(1 + exp y if x i ) w

Comparison between LR and Boosting Logistic Regression Boosting min w log(1 + exp y if x i ), f x = min h 1 m d w d x d + w 0 i exp( y i f x i ) = t Z t, f x = t α t h t x Comparing α t w d and h t x d, LR and Boosting becomes comparable. For LR all w d are jointly learned, but for Boosting, h t are learned sequentially. Note that LR is linear classifier, but Boosting is not.

Attention A common mistake: Even if you choose a linear classifier for h t, the final classifier is not linear in Boosting. A sign function should be taken for h t x = sign(w T x), otherwise a linear function of a set of linear functions f x = is still linear which we do not desire. t α t h t x

Summary We have learned to use a set of week classifiers to build a strong classifier, which Can reduce the training error arbitrarily close to zero. Even if the weak classifier is only slightly better than random guess A particular Boosting algorithm: Adaboost Compare the Logistic Regression and Boosting Linear VS. Nonlinear Joint optimization VS. iterative optimization