Boosting. Acknowledgment Slides are based on tutorials from Robert Schapire and Gunnar Raetsch

Similar documents
Boosting: Foundations and Algorithms. Rob Schapire

Ensembles. Léon Bottou COS 424 4/8/2010

Decision Trees.

CS7267 MACHINE LEARNING

Version Spaces.

Hierarchical Boosting and Filter Generation

The Boosting Approach to. Machine Learning. Maria-Florina Balcan 10/31/2016

COMS 4771 Lecture Boosting 1 / 16

Learning theory. Ensemble methods. Boosting. Boosting: history

Learning Ensembles. 293S T. Yang. UCSB, 2017.

Decision Trees.

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

Voting (Ensemble Methods)

What makes good ensemble? CS789: Machine Learning and Neural Network. Introduction. More on diversity

Concept Learning.

The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m

i=1 = H t 1 (x) + α t h t (x)

Lecture 8. Instructor: Haipeng Luo

Bayesian Learning (II)

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

Machine Learning: Exercise Sheet 2

Data Mining und Maschinelles Lernen

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017

Background. Adaptive Filters and Machine Learning. Bootstrap. Combining models. Boosting and Bagging. Poltayev Rassulzhan

BBM406 - Introduc0on to ML. Spring Ensemble Methods. Aykut Erdem Dept. of Computer Engineering HaceDepe University

Learning with multiple models. Boosting.

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Outline: Ensemble Learning. Ensemble Learning. The Wisdom of Crowds. The Wisdom of Crowds - Really? Crowd wiser than any individual

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

A Brief Introduction to Adaboost

2 Upper-bound of Generalization Error of AdaBoost

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

Stochastic Gradient Descent

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 331 le-tex

Chapter 18. Decision Trees and Ensemble Learning. Recall: Learning Decision Trees

Boosting: Algorithms and Applications

Ensemble Methods for Machine Learning

Boos$ng Can we make dumb learners smart?

CSCI-567: Machine Learning (Spring 2019)

ECE 5984: Introduction to Machine Learning

An overview of Boosting. Yoav Freund UCSD

Machine Learning. Ensemble Methods. Manfred Huber

TDT4173 Machine Learning

Robotics 2 AdaBoost for People and Place Detection

Ensembles of Classifiers.

Deep Boosting. Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) COURANT INSTITUTE & GOOGLE RESEARCH.

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Computational and Statistical Learning Theory

10701/15781 Machine Learning, Spring 2007: Homework 2

ECE 5424: Introduction to Machine Learning

Perceptron. Subhransu Maji. CMPSCI 689: Machine Learning. 3 February February 2015

Logistic Regression. Machine Learning Fall 2018

Bias-Variance in Machine Learning

Large-Margin Thresholded Ensembles for Ordinal Regression

Totally Corrective Boosting Algorithms that Maximize the Margin

Linear Classifiers: Expressiveness

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Decision Trees: Overfitting

Advanced Machine Learning

Ensemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods

Theory and Applications of A Repeated Game Playing Algorithm. Rob Schapire Princeton University [currently visiting Yahoo!

VBM683 Machine Learning

Machine Learning Algorithms for Classification. Rob Schapire Princeton University. schapire

Bayesian Learning. Remark on Conditional Probabilities and Priors. Two Roles for Bayesian Methods. [Read Ch. 6] [Suggested exercises: 6.1, 6.2, 6.

The Boosting Approach to Machine Learning. Rob Schapire Princeton University. schapire

Robotics 2. AdaBoost for People and Place Detection. Kai Arras, Cyrill Stachniss, Maren Bennewitz, Wolfram Burgard

Variance Reduction and Ensemble Methods

CS 231A Section 1: Linear Algebra & Probability Review

CS 231A Section 1: Linear Algebra & Probability Review. Kevin Tang

CS 6375 Machine Learning

10-701/ Machine Learning - Midterm Exam, Fall 2010

Recitation 9. Gradient Boosting. Brett Bernstein. March 30, CDS at NYU. Brett Bernstein (CDS at NYU) Recitation 9 March 30, / 14

CS229 Supplemental Lecture notes

Boosting & Deep Learning

Learning Decision Trees

Machine Learning. Bayesian Learning.

1 Boosting. COS 511: Foundations of Machine Learning. Rob Schapire Lecture # 10 Scribe: John H. White, IV 3/6/2003

Infinite Ensemble Learning with Support Vector Machinery

Data Warehousing & Data Mining

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

FINAL: CS 6375 (Machine Learning) Fall 2014

Midterm, Fall 2003

Statistics and learning: Big Data

Statistical Machine Learning from Data

Announcements Kevin Jamieson

Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

Machine Learning 2nd Edi7on

A Gentle Introduction to Gradient Boosting. Cheng Li College of Computer and Information Science Northeastern University

TDT4173 Machine Learning

Models, Data, Learning Problems

Open Problem: A (missing) boosting-type convergence result for ADABOOST.MH with factorized multi-class classifiers

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Intelligent Data Analysis. Decision Trees

Lecture 3: Decision Trees

Transcription:

. Machine Learning Boosting Prof. Dr. Martin Riedmiller AG Maschinelles Lernen und Natürlichsprachliche Systeme Institut für Informatik Technische Fakultät Albert-Ludwigs-Universität Freiburg riedmiller@informatik.uni-freiburg.de Acknowledgment Slides are based on tutorials from Robert Schapire and Gunnar Raetsch

Overview of Today s Lecture: Boosting Motivation AdaBoost examples

Motivation often, simple rules already yield a reasonable classification performance e.g. email spam filter: buy now would already be a good indicator for spam, doing better than random idea: use several of these simple rules ( decisions stumps ) and combine them to do classification ( committee appraoch)

Boosting Boosting is a general method to combine simple, rough rules into a highly accurate decision model. assumes base learners (or weak learners ) that do at least slightly better than random, e.g. 53% for a two-class classification. In other words the error is strictly smaller than 0.5. given sufficient data, a boosting algorithm can provably construct a single classifier with very high accuracy (say, e.g. 99%) Algorithm: AdaBoost (Freund and Schapire, 1995)

General Framework for Boosting training set (x 1,y 1 ),...,(x p,y p ), where y i { 1,1} and x i X For t = 1...T construct a distribution D t = (d t 1,...,d t p) on the training set, where d t i is the probability of occurence of training pattern i (also called weight of i) train a base learner h t : X { 1,+1} according to the training set distribution D t with a small error ǫ t, ǫ t = Pr Dt [h t (x i ) y i ] output final classifier H final as a combination of the T hypothesis h 1,...,h T

AdaBoost - Details (Freund and Schapire, 1995) the final hypothesis is a weighted sum of the individual hypothesis: H final (x) = sign( t α t h t (x)) hypothesis are weighted depending on the error they produce: α t := 1 2 ln1 ǫ t ǫ t e.g: ǫ t = 0.1 α t = 2.197; ǫ t = 0.4 α t = 0.41; ǫ t = 0.5 α t = 0; the above choice of α can be shown to minimize an upper bound of the final hypothesis error (see e.g. (Schapire, 2003) for details)

AdaBoost - Details (cont d) (Freund and Schapire, 1995) training pattern probabilities are recomputed, such that wrongly classified patterns gets a higher probability in the next round: d t+1 i := dt i Z t { exp(αt ), if y i h t (x i ) exp( α t ), else = dt i Z t exp( α t y i h t (x i )) Z t is a normalisation factor, such that i dt i = 1 (probability distribution), i.e. Z t = i d t i exp( α t y i h t (x i ))

AdaBoost (Freund and Schapire, 1995) Initialize D 1 = (1/p,...,1/p), p is the number of training patterns For t = 1...T: train base learner h t according to distribution D t determine expected error: ǫ t := p d t ii(h(x i ) y i ), i=1 where I(A) = 1, if A is true, 0, else. compute hypothesis weight and new pattern distribution: α t := 1 2 ln1 ǫ t ǫ t d t+1 i := dt i Z t exp( α t y i h t (x i )), (Z t normalisation factor) Final hypothesis: H final (x) = sign( t α th t (x))

Example (1) Example taken from (R. Schapire) Hypothesis: simple splits parallel to one axis

Example taken from (R. Schapire) Example (2) errorǫ 1 =? ǫ 1 = 0.3,α 1 = 0.42, wrong pattern: d 2 i = 0.166, correct: d2 i = 0.07 computation see next slide

ǫ 1 = p i=1 d 1 i I(h(x i ) y i ) = 1 10 3 = 0.3 α 1 = 1 2 ln1 ǫ 1 ǫ 1 = 1 2 ln1 0.3 0.3 = 0.42 wrong patterns: d 1 i exp(α 1) = 0.1 exp(0.42) = 0.152 correct patterns: d 1 i exp( α 1) = 0.1 exp( 0.42) = 0.066 Z 1 = 70.066+30.152 = 0.918 wrong patterns: d 2 i = 0.152/0.918 = 0.166 correct patterns: d 2 i = 0.066/0.918 = 0.07

Example taken from (R. Schapire) Example (3) errorǫ 2 =? ǫ 2 = 0.21,α 2 = 0.65 computation see next slide

ǫ 2 = p d 2 i I(h(x i ) y i ) = 0.07+0.07+0.07 = 0.21 i=1 α 1 = 1 2 ln1 ǫ 2 ǫ 2 = 1 2 ln1 0.21 0.21 = 0.65

Example taken from (R. Schapire) Example (4) ǫ 2 = 0.14,α 2 = 0.92

Example taken from (R. Schapire) Final classifier: Example (5)

Further examples (1) Taken from (Meir, Raetsch, 2003): AdaBoost on a 2D toy data set: color indicates the label and the diameter is proportional to the weight of the example. Dashed lines show decision boundaries of the single classifiers (up to 5th iteration). Solid line shows the decision line of the combined classifier. In the last two plots the decision line of Bagging is plotted for a comparison.

Performance of AdaBoost In case of low noise in the training data, AdaBoost can reasonbly learn even complex decision boundaries without overfitting Figure taken from G. Raetsch, Tutorial MLSS 03

Practical Application: Face Recognition (Viola and Jones, 2004) problem: find faces in photographs and movies weak classifiers: detec light/ dark rectangles in images many tricks to make it fast and accurate

Further aspects Boosting can be applied to any ML classifier: MLPs, Decision Trees,... related approach: Bagging (Breiman): training sample distribution remains unchanged. For each training of a hypotheses, another set of training samples is drawn with putting back. boosting performs usually very well with respect to non overfitting the data, if the noise in the data is low. While this is astonishing at first glance, it can be theortically explained by analyzing the margins of the classifiers. if the data has a higher noise level, boosting runs into the problem of overfitting. Regularisation methods exist, that try to avoid this (e.g. by restricting the weight of notorious outliers). extension to multi-class problems and regression problems exist.

References R. Schapire: A boosting tutorial. www.cs.princeton.edu/ schapire Meir and G. Raetsch, 2003: An introduction to boosting and leveraging 2003 G. Raetsch: Tutorial at MLSS 2003 Viola and Jones, 2004: Fast Face Recognition