Lecture 8. Instructor: Haipeng Luo

Similar documents
2 Upper-bound of Generalization Error of AdaBoost

CSCI-567: Machine Learning (Spring 2019)

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

COMS 4771 Lecture Boosting 1 / 16

ECE 5424: Introduction to Machine Learning

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017

The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m

ECE 5984: Introduction to Machine Learning

Voting (Ensemble Methods)

TDT4173 Machine Learning

Computational and Statistical Learning Theory

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Background. Adaptive Filters and Machine Learning. Bootstrap. Combining models. Boosting and Bagging. Poltayev Rassulzhan

CS229 Supplemental Lecture notes

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

i=1 = H t 1 (x) + α t h t (x)

TDT4173 Machine Learning

Learning theory. Ensemble methods. Boosting. Boosting: history

Learning with multiple models. Boosting.

Machine Learning, Fall 2011: Homework 5

Game Theory, On-line prediction and Boosting (Freund, Schapire)

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Hierarchical Boosting and Filter Generation

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

Learning Ensembles. 293S T. Yang. UCSB, 2017.

10701/15781 Machine Learning, Spring 2007: Homework 2

The Boosting Approach to. Machine Learning. Maria-Florina Balcan 10/31/2016

CS446: Machine Learning Fall Final Exam. December 6 th, 2016

Boosting. Acknowledgment Slides are based on tutorials from Robert Schapire and Gunnar Raetsch

Statistical Machine Learning from Data

1. Implement AdaBoost with boosting stumps and apply the algorithm to the. Solution:

Lecture 4. 1 Learning Non-Linear Classifiers. 2 The Kernel Trick. CS-621 Theory Gems September 27, 2012

What makes good ensemble? CS789: Machine Learning and Neural Network. Introduction. More on diversity

Boosting: Foundations and Algorithms. Rob Schapire

1 Rademacher Complexity Bounds

Large-Margin Thresholded Ensembles for Ordinal Regression

Ensembles. Léon Bottou COS 424 4/8/2010

FINAL EXAM: FALL 2013 CS 6375 INSTRUCTOR: VIBHAV GOGATE

AdaBoost. S. Sumitra Department of Mathematics Indian Institute of Space Science and Technology

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16

VBM683 Machine Learning

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

Name (NetID): (1 Point)

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Stochastic Gradient Descent

Outline: Ensemble Learning. Ensemble Learning. The Wisdom of Crowds. The Wisdom of Crowds - Really? Crowd wiser than any individual

Adaptive Boosting of Neural Networks for Character Recognition

Learning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture 24 Scribe: Sachin Ravi May 2, 2013

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #24 Scribe: Jad Bechara May 2, 2018

Representing Images Detecting faces in images

Ensemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods

COMS 4771 Introduction to Machine Learning. Nakul Verma

Support Vector Machines

Computational and Statistical Learning Theory

Statistics and learning: Big Data

Open Problem: A (missing) boosting-type convergence result for ADABOOST.MH with factorized multi-class classifiers

Lecture 7: Passive Learning

Data Warehousing & Data Mining

A Brief Introduction to Adaboost

Support Vector Machines. Machine Learning Fall 2017

Empirical Risk Minimization Algorithms

Statistical and Computational Learning Theory

Online Learning, Mistake Bounds, Perceptron Algorithm

Minimax risk bounds for linear threshold functions

Introduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research

Machine Learning. Ensemble Methods. Manfred Huber

Active Learning: Disagreement Coefficient

Logistic Regression. Machine Learning Fall 2018

Optimal and Adaptive Online Learning

Machine Learning for NLP

Large-Margin Thresholded Ensembles for Ordinal Regression

Chapter 18. Decision Trees and Ensemble Learning. Recall: Learning Decision Trees

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Active Learning and Optimized Information Gathering

Boosting with decision stumps and binary features

Online Learning with Experts & Multiplicative Weights Algorithms

1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016

4.1 Online Convex Optimization

Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013

Online Learning and Sequential Decision Making

Decision Trees: Overfitting

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 331 le-tex

PAC-learning, VC Dimension and Margin-based Bounds

Linear Classifiers and the Perceptron

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9

Optimal and Adaptive Algorithms for Online Boosting

Discriminative Models

Support Vector Machines

Lecture notes for quantum semidefinite programming (SDP) solvers

CS446: Machine Learning Spring Problem Set 4

FINAL: CS 6375 (Machine Learning) Fall 2014

6.036 midterm review. Wednesday, March 18, 15

Midterm Exam Solutions, Spring 2007

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Understanding Generalization Error: Bounds and Decompositions

Perceptron (Theory) + Linear Regression

Transcription:

Lecture 8 Instructor: Haipeng Luo Boosting and AdaBoost In this lecture we discuss the connection between boosting and online learning. Boosting is not only one of the most fundamental theories in machine learning, but also one of the most widely used machine learning algorithms in practice. It was originally proposed in a statistical learning setting for binary classification, which will also be the focus here. Informally, the key question that boosting tries to answer is that if we have a learning algorithm with some nontrivial (but also not great prediction accuracy (say 5%, is it possible to boost the accuracy to something arbitrarily high (say 99.9%, in a blackbox manner? Formally we consider the following binary classification task. Given a set of training examples S = {(x, y,..., (x, y } where each (x i, y i X {, +} (for some feature space X is an i.i.d. sample of an unknown distribution D, our goal is to output a binary classifier H : X {, +} with small training error {H(x i y i } and more importantly small generalization error E (x,y D [{H(x y}]. ow suppose we are given a weak learning oracle A which takes a training set S and a distribution over the training examples p ( as input (in other words, a weighted training set, and outputs a classifier h = A(S, p H for some hypothesis space H such that the weighted average error according to p is always bounded as E i p [{h(x i y i }] = p(i γ ( i:h(x i y i for some constant γ > 0. his is called the weak learning assumption and h is also called a weak classifer. One should think about γ as something very small such as %, so the weak learning assumption is very mild and is just saying that the oracle A is doing something slightly nontrivial since the trivial random guessing has expected error rate /. We called γ the edge of the oracle. In practice, such oracle can be any existing off-the-shelf learning algorithms, such as SVM, decision tree algorithms (e.g. C4.5, neural nets algorithms, or even something much simpler such as decision stump algorithms (a stump is a tree with depth. Many of these algorithms support taking weighted training set as input. However, even if weighted training set is not supported, one can simply generate a new (and large enough training set by sampling from S according to p with replacement and use it as the input for the oracle instead, so that the (unweighted training error on this new training set is close to the weighted error rate on S. A boosting algorithm is then a master algorithm that uses the oracle as a subroutine and outputs a strong classifier with very high accuracy. he fact that this is even possible is highly nontrivial and was not clear until the seminal work of [Schapire, 990]. he idea is to repeatedly apply the oracle with a distribution that focuses on hard examples and to combine all the weak classifiers outputted by the oracle in some smart way. he most successful boosting algorithm is AdaBoost [Freund and Schapire, 997] (short for Adaptive Boosting, outlined in Algorithm. At each iteration AdaBoost maintains a distribution p t and invoke the oracle with a training set weighted by p t to get a weak classifier h t. Afterwards, based on the weighted error rate of h t, p t is updated in a multiplicative way so that misclassified examples get more weights in the next iteration. he final output of AdaBoost is a weighted majority vote of

Algorithm : AdaBoost Input: training set S = {(x, y,..., (x, y }, weak learning oracle A Initialize: p (i = / for i =,..., for t =,..., do invoke the oracle to get h t = A(S, p t calculate weighted error ɛ t = E i p [{h(x i y i }], and α t = ln ɛ t ɛ t update distribution: i [] { e α t if h p t+ (i p t (i t (x i = y i e αt else ( output the final classifier: H(x = SG α th t (x. all the weak classifiers (SG(z is if z > 0 and otherwise, where the weight (α t for each h t is calculated based on the error rate of h t so that more accurate classifier has more weights in the final vote. (ote that the algorithm makes sense even when ɛ t /. hink about why. he first result about AdaBoost is that it drives the training error to zero exponentially fast. heorem. After rounds, the error rate of the final output of AdaBoost is bounded as {H(x i y i } exp γt where γ t = ɛ t is the edge of classifier h t. Under the weak learning assumption, the error rate is then bounded by exp ( γ and is 0 as long as > ln γ. Proof. Let F (x = α th t (x so that H(x = SG(F (x. he first step is to realize 0- loss is bounded by the exponential loss {H(x i y i } = p (i {y i F (x i 0} p (i exp ( y i F (x i. ow notice that the update rule of p t can be written as p t+ (i = p t (i exp( y i α t h t (x i /Z t where Z t is the normalization factor. We then have exp( y i α t h t (x i p + (i = p (i = p (i exp ( y i F (x i Z t Z t and therefore the error rate is bounded by bound each Z t : Z t = = p t (i exp( y i α t h t (x i i:h t(x i=y i p t (i exp( α t + = ( ɛ t exp( α t + ɛ t exp(α t = 4γt exp( γt. ( p + (i Z t i:h t(x i y i p t (i exp(α t ( = Z t. It remains to where the last equality is by the definition of α t (which is in fact chosen to minimize the expression, and the last step is by the fact + z e z. his finishes the proof for Eq. (. Under weak learning assumption, we further have γ t γ and thus the stated bound exp ( γ. Finally, note that as soon as the error rate drops below /, it must become zero. herefore, as long as exp ( γ < /, which means > ln γ, AdaBoost ensures zero training error.

Of course, at the end of the day one cares about the generalization error instead of the training error. Standard VC theory says that the difference between the generalization error and the training error is bounded by something like Õ( C/ where C is some complexity measure of the hypothesis space. For boosting, it is not so surprising that this complexity is growing as O, since combining more weak classifiers leads to a more complicated final classifier H. Since we just proved that after rounds the training error of AdaBoost is zero, it means that if we stop the algorithm at the right time, AdaBoost will have generalization error of order Õ, which can be arbitrarily > ln γ small when is large enough. In other words, under the weak learning assumption AdaBoost does ensure arbitrarily small generalization error given enough examples, implying that boosting is indeed possible, or in more technical terms, weak learnability is equivalent to strong learnability. Preventing Overfitting. Like all other machine learning algorithms, AdaBoost could also overfit the training set. his is consistent with the VC theory: if we keep running the algorithm even after the training error has dropped to zero, then the generalization error is of order /, which is increasing in the number of rounds. However, what usually happens in practice is that AdaBoost tends to prevent overfitting, in the sense that even if one keeps running the algorithm for many more rounds after the training error has dropped to zero, the generalization error still keeps decreasing. How should we explain this? It turns out that the concept of margin is the key for understanding this phenomenon. he margin of an example (x, y (with respect to the classifier H is defined as yf(x where f(x = F (x α t α = t τ= α h t (x. τ It is clear that the margin is always in [, ], and the sign of the margin indicates whether the final classifier H makes a mistake on x or not. Specifically we want the margin to be positive in order to have low error rate. However, intuitively we also want the margin to be a large positive (that is, close to, since in this case there is a huge difference between the vote for + and (recall H is just doing a weighted majority vote, and the decisive win of one label makes us feel more confident about the final prediction. Indeed, margin theory says that for any θ, the generalization error of H and the fraction of training examples with margin at most θ are related as follows: ( E (x,y D [{H(x y}] {y i f(x i θ} + Õ CH θ where C H is the complexity of the hypothesis space H used by the oracle (recall h t H, which is independent of the number of weak classifiers combined by H (that is,! herefore, if keep running AdaBoost has the effect of increasing margin even after the training error drops to zero, then this explains why overfitting does not happen. his is indeed true: under the weak learning assumption AdaBoost guarantees {y i f(x i θ} ln γ ( ( γ θ ( + γ +θ. One way to interpret this bound is to note that as long as θ is such that ( γ θ ( + γ +θ <, which translates to θ Γ(γ def = ln( 4γ γ, then the fraction of examples with margin at ln( +γ γ most θ is eventually zero when is large enough. In other words, if we keep ( running AdaBoost, eventually the smallest margin is Γ(γ and the generalization error is thus Õ C H Γ(γ. As a final remark, interestingly AdaBoost is not doing the best possible job in maximizing the smallest margin. he best possible smallest margin under the weak learning assumption can be shown to be exactly γ. 3

Boosting via Online Learning In this section we show a simple reduction from boosting to an expert problem. In other words, given an expert algorithm as a blackbox, we can turn it into a boosting algorithm, as shown below. Algorithm : Boosting to Expert Problem Input: training set S = {(x, y,..., (x, y }, weak learning oracle A, expert algorithm E Initialize the expert algorithm E with experts for t =,..., do let p t be the prediction of E at round t invoke the oracle to get h t = A(S, p t feed the loss vector l t to E where l t (i = {h t (x i = y i } ( output the final classifier: H(x = SG h t(x. While one might imagine that it is natural to treat each possible weak classifier h H as an expert, the reduction above actually treats each training example as an expert. Moreover, the reduction punishes an expert/example by assigning loss if it is correctly classified, which seems counterintuitive at first glance but is in fact consistent with the key idea of boosting, that is, more focus should be put on hard examples. ow suppose the regret of E is R, we have by construction p t (i{h t (x i = y i } {h t (x j = y j } + R for any j []. On the other hand, reusing the notation ɛ t = E i p [{h(x i y i }] and γ t = / ɛ t, we have under the weak learning assumption Combining leads to p t (i{h t (x i = y i } = + γ ( ɛ t = + γ t + γ. {h t (x j = y j } + R. (3 herefore, as long as γ > R, we have < {h t(x j = y j }, implying that more than half of the weak classifiers predicts correctly. Since the final classifier H is a simple majority vote, this also implies that H has zero training error. Plugging the optimal regret R = O( ln (for example by using Hedge, we thus only need > O( ln γ rounds to achieve zero training error, same as AdaBoost. Moreover, since {h(x = y} = (yh(x +, plugging this fact into Eq. (3 and simplifying lead to (with f(x = h t(x γ R y jf(x j. ote that y j f(x j is exactly the margin for example x j. he above is therefore saying that if R is sublinear and we keep running the algorithm, eventually every example will have margin above something arbitrarily close to γ, the optimal margin as mentioned in the last section. In other words, this expert-algorithm-turned boosting is doing an even better job in maximizing margin compared to AdaBoost. Finally, we point out how adaptive quantile regret bound can say something more meaningful in this context. For any margin threshold θ, what does the above derivation tell us about the fraction of examples with margin at most θ, that is, {y if(x i θ}? In fact, it only says that if θ γ R then the fraction is zero, otherwise it could be as large as, a trivial bound. However suppose the expert algorithm admits quantile regret bounds (such as Squint, then assuming 4

y f(x y f(x without loss of generality, we have ln(/j γ O y j f(x j. ln(/j herefore for any θ < γ, we can just find the largest j θ such that γ O θ θ, and assert that the fraction of examples with margin at most θ will be j θ /, which is of order exp( Ω( (γ θ. References Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(:9 39, August 997. Robert E Schapire. he strength of weak learnability. Machine learning, 5(:97 7, 990. 5