The Boosting Approach to. Machine Learning. Maria-Florina Balcan 10/31/2016

Similar documents
Boosting: Foundations and Algorithms. Rob Schapire

COMS 4771 Lecture Boosting 1 / 16

Generalization, Overfitting, and Model Selection

Learning theory. Ensemble methods. Boosting. Boosting: history

Voting (Ensemble Methods)

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Distributed Machine Learning. Maria-Florina Balcan, Georgia Tech

2 Upper-bound of Generalization Error of AdaBoost

Background. Adaptive Filters and Machine Learning. Bootstrap. Combining models. Boosting and Bagging. Poltayev Rassulzhan

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017

Learning with multiple models. Boosting.

Computational and Statistical Learning Theory

Boos$ng Can we make dumb learners smart?

Boosting. Acknowledgment Slides are based on tutorials from Robert Schapire and Gunnar Raetsch

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Ensembles. Léon Bottou COS 424 4/8/2010

Distributed Machine Learning. Maria-Florina Balcan Carnegie Mellon University

Lecture 8. Instructor: Haipeng Luo

Generalization and Overfitting

Ensemble Methods for Machine Learning

Stochastic Gradient Descent

CSCI-567: Machine Learning (Spring 2019)

The Boosting Approach to Machine Learning. Rob Schapire Princeton University. schapire

CS7267 MACHINE LEARNING

Ensemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods

Data Mining und Maschinelles Lernen

CS229 Supplemental Lecture notes

AdaBoost. S. Sumitra Department of Mathematics Indian Institute of Space Science and Technology

Chapter 18. Decision Trees and Ensemble Learning. Recall: Learning Decision Trees

Optimal and Adaptive Algorithms for Online Boosting

Machine Learning Algorithms for Classification. Rob Schapire Princeton University. schapire

VBM683 Machine Learning

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

ECE 5424: Introduction to Machine Learning

Data Warehousing & Data Mining

i=1 = H t 1 (x) + α t h t (x)

Introduction and Overview

BOOSTING THE MARGIN: A NEW EXPLANATION FOR THE EFFECTIVENESS OF VOTING METHODS

What makes good ensemble? CS789: Machine Learning and Neural Network. Introduction. More on diversity

2D1431 Machine Learning. Bagging & Boosting

Boosting & Deep Learning

Learning Ensembles. 293S T. Yang. UCSB, 2017.

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16

Boosting the Margin: A New Explanation for the Effectiveness of Voting Methods

1 Rademacher Complexity Bounds

A Course in Machine Learning

Ensembles of Classifiers.

Discriminative Learning can Succeed where Generative Learning Fails

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Random Classification Noise Defeats All Convex Potential Boosters

An overview of Boosting. Yoav Freund UCSD

Theory and Applications of A Repeated Game Playing Algorithm. Rob Schapire Princeton University [currently visiting Yahoo!

The Perceptron Algorithm, Margins

Decision Trees: Overfitting

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

Totally Corrective Boosting Algorithms that Maximize the Margin

ECE 5984: Introduction to Machine Learning

TDT4173 Machine Learning

A Brief Introduction to Adaboost

Announcements Kevin Jamieson

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Foundations For Learning in the Age of Big Data. Maria-Florina Balcan

Computational Learning Theory. Definitions

Hierarchical Boosting and Filter Generation

Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example

Sample and Computationally Efficient Active Learning. Maria-Florina Balcan Carnegie Mellon University

Foundations For Learning in the Age of Big Data. Maria-Florina Balcan

1 Learning Linear Separators

Generalization, Overfitting, and Model Selection

10701/15781 Machine Learning, Spring 2007: Homework 2

Representing Images Detecting faces in images

Computational Learning Theory

1 Boosting. COS 511: Foundations of Machine Learning. Rob Schapire Lecture # 10 Scribe: John H. White, IV 3/6/2003

Monte Carlo Theory as an Explanation of Bagging and Boosting

Computational Learning Theory: Shattering and VC Dimensions. Machine Learning. Spring The slides are mainly from Vivek Srikumar

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

Classification: The rest of the story

Boosting the margin: A new explanation for the effectiveness of voting methods

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

BBM406 - Introduc0on to ML. Spring Ensemble Methods. Aykut Erdem Dept. of Computer Engineering HaceDepe University

Outline: Ensemble Learning. Ensemble Learning. The Wisdom of Crowds. The Wisdom of Crowds - Really? Crowd wiser than any individual

Supervised Learning of Non-binary Problems Part I: Multiclass Categorization via Output Codes

Adaptive Boosting of Neural Networks for Character Recognition

Combining Classifiers

Optimal and Adaptive Online Learning

FINAL: CS 6375 (Machine Learning) Fall 2014

Computational Learning Theory: Probably Approximately Correct (PAC) Learning. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Statistics and learning: Big Data

Does Unlabeled Data Help?

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Lecture 4. 1 Learning Non-Linear Classifiers. 2 The Kernel Trick. CS-621 Theory Gems September 27, 2012

An adaptive version of the boost by majority algorithm

Boosting Methods for Regression

Machine Learning

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

Transcription:

The Boosting Approach to Machine Learning Maria-Florina Balcan 10/31/2016

Boosting General method for improving the accuracy of any given learning algorithm. Works by creating a series of challenge datasets s.t. even modest performance on these can be used to produce an overall high-accuracy predictor. Works amazingly well in practice --- Adaboost and its variations one of the top 10 algorithms. Backed up by solid foundations.

Readings: The Boosting Approach to Machine Learning: An Overview. Rob Schapire, 2001 Theory and Applications of Boosting. NIPS tutorial. http://www.cs.princeton.edu/~schapire/talks/nips-tutorial.pdf Plan for today: Motivation. A bit of history. Adaboost: algo, guarantees, discussion. Focus on supervised classification.

An Example: Spam Detection E.g., classify which emails are spam and which are important. Not spam spam Key observation/motivation: Easy to find rules of thumb that are often correct. E.g., If buy now in the message, then predict spam. E.g., If say good-bye to debt in the message, then predict spam. Harder to find single rule that is very highly accurate.

Emails An Example: Spam Detection Boosting: meta-procedure that takes in an algo for finding rules of thumb (weak learner). Produces a highly accurate rule, by calling the weak learner repeatedly on cleverly chosen datasets. h 1 h 2 h 3 h T apply weak learner to a subset of emails, obtain rule of thumb apply to 2nd subset of emails, obtain 2nd rule of thumb apply to 3 rd subset of emails, obtain 3rd rule of thumb repeat T times; combine weak rules into a single highly accurate rule.

Boosting: Important Aspects How to choose examples on each round? Typically, concentrate on hardest examples (those most often misclassified by previous rules of thumb) How to combine rules of thumb into single prediction rule? take (weighted) majority vote of rules of thumb

Historically.

Weak Learning vs Strong Learning [Kearns & Valiant 88]: defined weak learning: being able to predict better than random guessing (error 1 2 γ), consistently. Posed an open pb: Does there exist a boosting algo that turns a weak learner into a strong learner (that can produce arbitrarily accurate hypotheses)? Informally, given weak learning algo that can consistently find classifiers of error 1 γ, a boosting algo would 2 provably construct a single classifier with error ε.

Surprisingly. Weak Learning =Strong Learning Original Construction [Schapire 89]: poly-time boosting algo, exploits that we can learn a little on every distribution. A modest booster obtained via calling the weak learning algorithm on 3 distributions. Error = β < 1 2 γ error 3β2 2β 3 Then amplifies the modest boost of accuracy by running this somehow recursively. Cool conceptually and technically, not very practical.

An explosion of subsequent work

Adaboost (Adaptive Boosting) A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting [Freund-Schapire, JCSS 97] Godel Prize winner 2003

Informal Description Adaboost Boosting: turns a weak algo into a strong learner. Input: S={(x 1, y 1 ),,(x m, y m )}; For t=1,2,,t Construct D t on {x 1,, x m } x i X, y i Y = { 1,1} weak learning algo A (e.g., Naïve Bayes, decision stumps) Run A on D t producing h t : X { 1,1} (weak classifier) ϵ t = P xi ~D t (h t x i y i ) error of h t over D t Output H final x = sign σ t=1 α t h t x h t + + + + + + + + - - - - - - - - Roughly speaking D t+1 increases weight on x i if h t incorrect on x i ; decreases it on x i if h t correct.

Adaboost (Adaptive Boosting) Weak learning algorithm A. For t=1,2,,t Construct D t on {x 1,, x m } Run A on D t producing h t Constructing D t D [i.e., D 1 i = 1 1 uniform on {x 1,, x m } Given D t and h t set D t+1 i = D t i e α t if y i = h t x i D t+1 i = D t i e α t if y i h t x i α t = 1 2 ln 1 ε t ε t > 0 Final hyp: H final x = sign σ t α t h t x m ] D t+1 i = D t i e α ty i h t x i = D t i e α ty i h t x i i D t+1 puts half of weight on examples x i where h t is incorrect & half on examples where h t is correct

Adaboost: A toy example Weak classifiers: vertical or horizontal half-planes (a.k.a. decision stumps)

Adaboost: A toy example

Adaboost: A toy example

Adaboost (Adaptive Boosting) Weak learning algorithm A. For t=1,2,,t Construct D t on {x 1,, x m } Run A on D t producing h t Constructing D t D [i.e., D 1 i = 1 1 uniform on {x 1,, x m } Given D t and h t set m ] D t+1 i = D t i e α t if y i = h t x i D t+1 i = D t i e α t if y i h t x i D t+1 i = D t i e α ty i h t x i α t = 1 2 ln 1 ε t ε t > 0 Final hyp: H final x = sign σ t α t h t x D t+1 puts half of weight on examples x i where h t is incorrect & half on examples where h t is correct

Nice Features of Adaboost Very general: a meta-procedure, it can use any weak learning algorithm!!! (e.g., Naïve Bayes, decision stumps) Very fast (single pass through data each round) & simple to code, no parameters to tune. Shift in mindset: goal is now just to find classifiers a bit better than random guessing. Grounded in rich theory.

Guarantees about the Training Error Theorem ε t = 1/2 γ t (error of h t over D t ) err S H final exp 2 t γ t 2 So, if t, γ t γ > 0, then err S H final exp 2 γ 2 T The training error drops exponentially in T!!! To get err S H final ε, need only T = O 1 γ 2 log 1 ε rounds Adaboost is adaptive Does not need to know γ or T a priori Can exploit γ t γ

Understanding the Updates & Normalization Claim: D t+1 puts half of the weight on x i where h t was incorrect and half of the weight on x i where h t was correct. Recall D t+1 i = D t i e α ty i h t x i Probabilities are equal! Pr y i h t x i = D t+1 i:y i h t x i D t i e α t = ε t 1 e α t = ε t 1 ε t ε t = ε t 1 ε t Pr y i = h t x i = D t+1 i:y i =h t x i D t i e α t = 1 ε t e α t = 1 ε t ε t 1 ε t = 1 ε t ε t = D t i e α ty i h t x i i = i:y i =h t x i D t i e αt + i:y i h t x i D t i e αt = 1 ε t e α t + ε t e α t = 2 ε t 1 ε t

Generalization Guarantees (informal) Theorem err S H final exp 2 t γ t 2 where ε t = 1/2 γ t How about generalization guarantees? Original analysis [Freund&Schapire 97] Let H be the set of rules that the weak learner can use Let G be the set of weighted majority rules over T elements of H (i.e., the things that AdaBoost might output) Theorem [Freund&Schapire 97] g G, err g err S g + O Td m T= # of rounds d= VC dimension of H

Generalization Guarantees Theorem [Freund&Schapire 97] g G, err g err S g + O Td m T= # of rounds d= VC dimension of H error train error generalization error T= # of rounds complexity

Generalization Guarantees Experiments with boosting showed that the test error of the generated classifier usually does not increase as its size becomes very large. Experiments showed that continuing to add new weak learners after correct classification of the training set had been achieved could further improve test set performance!!!

Generalization Guarantees Experiments with boosting showed that the test error of the generated classifier usually does not increase as its size becomes very large. Experiments showed that continuing to add new weak learners after correct classification of the training set had been achieved could further improve test set performance!!! These results seem to contradict FS 87 bound and Occam s razor (in order achieve good test error the classifier should be as simple as possible)!

How can we explain the experiments? R. Schapire, Y. Freund, P. Bartlett, W. S. Lee. present in Boosting the margin: A new explanation for the effectiveness of voting methods a nice theoretical explanation. Key Idea: Training error does not tell the whole story. We need also to consider the classification confidence!!

What you should know The difference between weak and strong learners. AdaBoost algorithm and intuition for the distribution update step. The training error bound for Adaboost and overfitting/generalization guarantees aspects.

Advanced (Not required) Material

Analyzing Training Error: Proof Intuition Theorem ε t = 1/2 γ t (error of h t over D t ) err S H final exp 2 t γ t 2 On round t, we increase weight of x i for which h t is wrong. If H final incorrectly classifies x i, - Then x i incorrectly classified by (wtd) majority of h t s. - Which implies final prob. weight of x i is large. Can show probability 1 m 1 ς t Since sum of prob. = 1, can t have too many of high weight. Can show # incorrectly classified m ς t. And (ς t ) 0.

Analyzing Training Error: Proof Math Step 1: unwrapping recurrence: D T+1 i = 1 m where f x i = σ t α t h t x i. exp y i f x i ς t [Unthresholded weighted vote of h i on x i ] Step 2: err S H final ς t. Step 3: ς t = ς t 2 ε t 1 ε t = ς t 1 4γ t 2 e 2 σ t γ t 2

Analyzing Training Error: Proof Math Step 1: unwrapping recurrence: D T+1 i = 1 m where f x i = σ t α t h t x i. exp y i f x i ς t Recall D 1 i = 1 m and D t+1 i = D t i exp y i α t h t x i D T+1 i = exp y iα T h T x i Z T = exp y iα T h T x i Z T D T i exp y iα T 1 h T 1 x i Z T 1 D T 1 i. = exp y iα T h T x i Z T exp y iα 1 h 1 x i Z 1 1 m = 1 m exp y i (α 1 h 1 x i + +α T h T x T ) Z 1 Z T = 1 m exp y i f x i ς t

Analyzing Training Error: Proof Math Step 1: unwrapping recurrence: D T+1 i = 1 m where f x i = σ t α t h t x i. exp y i f x i ς t Step 2: err S H final ς t. err S H final = 1 m i 1 yi H final x i exp loss 0/1 loss = 1 m i 1 yi f x i 0 1 0 1 m i exp y i f x i = D T+1 i i t = ς t.

Analyzing Training Error: Proof Math Step 1: unwrapping recurrence: D T+1 i = 1 m where f x i = σ t α t h t x i. exp y i f x i ς t Step 2: err S H final ς t. Step 3: ς t = ς t 2 ε t 1 ε t = ς t 1 4γ t 2 e 2 σ t γ t 2 Note: recall = 1 ε t e α t + ε t e α t = 2 ε t 1 ε t α t minimizer of α 1 ε t e α + ε t e α