i=1 = H t 1 (x) + α t h t (x)

Similar documents
AdaBoost. S. Sumitra Department of Mathematics Indian Institute of Space Science and Technology

Lecture 8. Instructor: Haipeng Luo

What makes good ensemble? CS789: Machine Learning and Neural Network. Introduction. More on diversity

ECE 5424: Introduction to Machine Learning

CS7267 MACHINE LEARNING

Voting (Ensemble Methods)

Boosting. Acknowledgment Slides are based on tutorials from Robert Schapire and Gunnar Raetsch

2 Upper-bound of Generalization Error of AdaBoost

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

Machine Learning. Ensemble Methods. Manfred Huber

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

CS229 Supplemental Lecture notes

Stochastic Gradient Descent

Deep Boosting. Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) COURANT INSTITUTE & GOOGLE RESEARCH.

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

VBM683 Machine Learning

Totally Corrective Boosting Algorithms that Maximize the Margin

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

The Boosting Approach to. Machine Learning. Maria-Florina Balcan 10/31/2016

Ensembles. Léon Bottou COS 424 4/8/2010

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 331 le-tex

Boos$ng Can we make dumb learners smart?

Learning Ensembles. 293S T. Yang. UCSB, 2017.

Open Problem: A (missing) boosting-type convergence result for ADABOOST.MH with factorized multi-class classifiers

The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m

Adaptive Boosting of Neural Networks for Character Recognition

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

1 Rademacher Complexity Bounds

ECE 5984: Introduction to Machine Learning

Geometry of U-Boost Algorithms

Statistical Machine Learning from Data

Hierarchical Boosting and Filter Generation

1. Implement AdaBoost with boosting stumps and apply the algorithm to the. Solution:

Recitation 9. Gradient Boosting. Brett Bernstein. March 30, CDS at NYU. Brett Bernstein (CDS at NYU) Recitation 9 March 30, / 14

Chapter 18. Decision Trees and Ensemble Learning. Recall: Learning Decision Trees

Accelerating Stochastic Optimization

Computational and Statistical Learning Theory

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #24 Scribe: Jad Bechara May 2, 2018

Learning with multiple models. Boosting.

COMS 4771 Lecture Boosting 1 / 16

Decision Trees: Overfitting

An overview of Boosting. Yoav Freund UCSD

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Perceptron Mistake Bounds

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 14, 2017

Multiclass Boosting with Repartitioning

VC Dimension Review. The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces.

A Brief Introduction to Adaboost

Ensembles of Classifiers.

Bayesian Learning (II)

i=1 cosn (x 2 i y2 i ) over RN R N. cos y sin x

Large-Margin Thresholded Ensembles for Ordinal Regression

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

Background. Adaptive Filters and Machine Learning. Bootstrap. Combining models. Boosting and Bagging. Poltayev Rassulzhan

Gradient Boosting (Continued)

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013

1 Handling of Continuous Attributes in C4.5. Algorithm

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

Announcements Kevin Jamieson

Generalization and Overfitting

Outline: Ensemble Learning. Ensemble Learning. The Wisdom of Crowds. The Wisdom of Crowds - Really? Crowd wiser than any individual

Notation P(Y ) = X P(X, Y ) = X. P(Y X )P(X ) Teorema de Bayes: P(Y X ) = CIn/UFPE - Prof. Francisco de A. T. de Carvalho P(X )

Boosting & Deep Learning

Relationship between Loss Functions and Confirmation Measures

Statistics and learning: Big Data

Ensemble Methods: Jay Hyer

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture 24 Scribe: Sachin Ravi May 2, 2013

8 Basics of Hypothesis Testing

CSCI-567: Machine Learning (Spring 2019)

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Selective Prediction. Binary classifications. Rong Zhou November 8, 2017

Empirical Risk Minimization Algorithms

CSE 546 Midterm Exam, Fall 2014

Ensemble Methods for Machine Learning

Data Mining und Maschinelles Lernen

TDT4173 Machine Learning

CS340 Machine learning Lecture 4 Learning theory. Some slides are borrowed from Sebastian Thrun and Stuart Russell

Data Warehousing & Data Mining

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Lecture 3: Decision Trees

Introduction to Boosting and Joint Boosting

Computational Learning Theory

Boosting: Foundations and Algorithms. Rob Schapire

Advanced Machine Learning

Generalization, Overfitting, and Model Selection

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

1 Handling of Continuous Attributes in C4.5. Algorithm

PAC Generalization Bounds for Co-training

Does Unlabeled Data Help?

Manfred K. Warmuth - UCSC S.V.N. Vishwanathan - Purdue & Microsoft Research. Updated: March 23, Warmuth (UCSC) ICML 09 Boosting Tutorial 1 / 62

Large-Margin Thresholded Ensembles for Ordinal Regression

Top-k Parametrized Boost

Self Supervised Boosting

Foundations of Machine Learning Multi-Class Classification. Mehryar Mohri Courant Institute and Google Research

Logistic Regression. Machine Learning Fall 2018

Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Numerical Learning Algorithms

Transcription:

AdaBoost AdaBoost, which stands for ``Adaptive Boosting", is an ensemble learning algorithm that uses the boosting paradigm []. We will discuss AdaBoost for binary classification. That is, we assume that we are given a training set S : (x, y ), (x 2, y 2 ),..., (x n, y n ) where i, y i {, } and a pool of hypothesis functions H from which we are to pick T hypotheses in order to form an ensemble H. H then makes a decision using the individual hypotheses h,..., h T in the ensemble as follows: T H(x) α i h i (x) () That is, H uses a linear combination of the decisions of each of the h i hypotheses in the ensemble. The AdaBoost algorithm sequentially chooses h i from H and assigns this hypothesis a weight α i. We let H t be the classifier formed by the first t hypotheses. That is, H t (x) t α i h i (x) H t (x) + α t h t (x) where H 0 (x) : 0. That is, the empty ensemble will always output 0. The idea behind the AdaBoost algorithm is that the t th hypothesis will correct for the errors that the first t hypotheses make on the training set. More specifically, after we select the first t hypotheses, we determine which instances in S our t hypotheses perform poorly on and make sure that the t th hypothesis performs well on these instances. The pseudocode for AdaBoost is described in Algorithm. A high-level overview of the algorithm is described below:. Initialize a training set distribution At each iteration,..., T of the AdaBoost algorithm, we define a probability distribution D over the training instances in S. We let D t be the probability distribution at the t th iteration and D t (i) be the probability assigned to the i th training instance, (x i, y i ) S, according to D t. As the algorithm proceeds, each iteration will design D t so that it assigns higher probability mass to instances that the first t hypotheses performed poorly on. That is, the worse the performance on x i, the higher will be D t (i). At the onset of the algorithm, we set D to be the uniform distribution over the instances. That is, i {, 2,..., n}, D (i) : n Matthew Bernstein 207

Algorithm AdaBoost for binary classification Precondition: A training set S : (x, y ),..., (x n, y n ), hypothesis space H, and number of iterations T. for i {, 2..., n} do 2 D (i) n 3 end for 4 H 5 for t,..., T do 6 h t argmin P i Dt (h(x i ) y i ) find good hypothesis on weighted training set 7 ϵ t P i Dt (h t (x i ) y i ) compute hypothesis's error 8 α t ln ( ) ϵ t 2 ϵ t compute hypothesis's weight 9 H H {(α t, h t )} add hypothesis to the ensemble 0 for i {, 2..., n} do update training set distribution D t+ (i) D t(i) e α ty i ht(x i ) j D t ( j) e α ty j ht(x j ) 2 end for 3 end for 4 return H where n is the size of S. 2. Find a new hypothesis to add to the ensemble At the t th iteration, we search for a new hypothesis, h t, that performs well on S assuming that instances are drawn from D t ). By ``performs well", we mean that h t should have a low expected 0- loss on S under D t. That is h t : E i Dt [l 0 (h, x i, y i )] P i Dt (y i h(x i )) We call this expected loss the ``weighted loss" because the 0- loss is not computed on the instances in the training set directly, but rather on the weighted instances in the training set. Matthew Bernstein 207 2

3. Assign the new hypothesis a weight Once we compute h t, we assign h t a weight α t based on its performance. More specifically, we give it the weight α t : ( ) 2 ln ϵt (2) ϵ t where ϵ t : P i Dt (y i h t (x i )). We will soon explain the theoretical justification of this precise weight assignment, but intuitively we see that the the higher ϵ t, the the larger will be the denominator and the smaller the numerator in ϵ t ϵ t thus, the smaller will be ln ( ) ϵ t 2 ϵ t. Thus, if the new hypothesis, h t, has a high error, ϵ t, then we assign this hypothesis a smaller weight. That is, h t will contribute less to the output of ensemble H. 4. Recompute the training set distribution Once the new hypothesis is added to the ensemble, we recompute the training set distribution to assign each instance a probability proportional to how well the current ensemble H t performs on the training set. We compute D t+ as follows: D t+ (i) : D t (i) e α ty i h t (x i ) j D t ( j) e α ty j h t (x j ) (3) We will soon explain a theoretical justification for this precise probability assignment, but for now we can gain an intuitive understanding. Note the term e α ty i h t (x i ). If h t (x i ) y i, then y i h t (x i ) which means that e α ty i h t (x i ) e α t. If, on the other hand, h t (x i ) y i, then y i h t (x i ) which means that e α ty i h t (x i ) e α t. Thus, we see that e α ty i h t (x i ) is smaller if the hypothesis's prediction agrees with the true value. That is, we assign higher probability to the i th instance if h t was wrong on x i. Repeat steps 2 through 4 Repeat steps 2 through 4 for T more iterations. Derivation of AdaBoost from first principles The AdaBoost algorithm can be viewed as an algorithm that searches for hypotheses of the form of Equation in order to minimize the empirical loss under the exponential loss function: l exp (h, x, y) : e yh(x) Matthew Bernstein 207 3

We note that there are many ways in which one might search for a hypothesis of the form of Equation in order to minimize the exponential loss function. The AdaBoost algorithm performs this minimization using a sequential procedure such that, at iteration t, we are given H t and our goal is to produce H t H t + α t h t where the new h t and α t minimizes the exponential loss of H t on the training data. Theorem shows that AdaBoost's choice of h t minimizes the exponential loss of H t over the training data. That is, where h t L S (H t + Ch) : n L S (H t + Ch) l exp (H t + Ch, x, y) and C is an arbitrary constant. Theorem 2 shows that once h t is chosen, AdaBoost's choice of α t then further minimizes the exponential loss of H t over the training set. That is, α t : L S (H t + αh t ) α. Theorem The choice of h t under AdaBoost, h t : P i Dt (y i h(x i )), minimizes the exponential-loss of H t over the training set. That is, given an arbitrary constant C, h t L S (H t + Ch). Matthew Bernstein 207 4

Proof: h t argmin L S (H t + Ch) n n e y i[h t (x i )+Ch(x i )] e y ih t (x i ) e ych(x i) e ych(x i) n e ych t(x i ) let : e y ih t (x i ) e C + e C split the summation i:h(x i )y i e C e C + e αt e C + (e C e C ) K + (e C e C ) K : w i e α t is a constant (ec e C ) j w t, j is a constant i:h(x i ) y j w t, j i j w t, j P i Dt (y i h(x i )) See Lemma Matthew Bernstein 207 5

Lemma P i Dt (y i h(x i )) j w t, j where : e y ih t (x i ) Proof: First, we show that D t (i) j w t, j (4) We show this fact by induction. First, we prove the base case: w,i j w, j e y ih 0 (x i ) j e y jh 0 (x j ) n D (i) for all i because H 0 (x i ) 0 Next, we need to prove the inductive step. That is, we prove that D t (i) j w t, j D t+ (i) w t+,i j w t+, j Matthew Bernstein 207 6

This is proven as follows: D t+ (i) : D t (i) e α ty i h t (x i ) j D t ( j) e α ty j h t (x j ) by Equation 3 j j j w t, j e α ty i h t (x i ) w t, j k w t,k e α ty j h t (x j ) e y i H t (x i ) j e y j H t (x j ) e α ty i h t (x i ) e y j H t (x j ) k e y k H t (x k ) e α ty j h t (x j ) by the inductive hypothesis by the fact that : e y ih t (x i ) j e y j H t (x j ) e y ih t (x i ) e α ty i h t (x i ) k e y k H t (x k ) j e y jh t (x j) e α ty j h t (x j ) e y ih t (x i ) α t y i h t (x i ) j e y jh t (x j ) α t y j h t (x j ) e y ih t (x i ) j e y jh t (x j ) w t+,i j w t+, j Now that we have proven Equation 4, it follows that D t (x i ) j w t, j P i Dt (y i h t (x i )) Theorem 2 The choice of α t under AdaBoost, α t : 2 ln ( ϵt ϵ t ) where ϵ t : P i Dt (y i h t (x i )) Matthew Bernstein 207 7

, minimizes the exponential-loss of H t over the training set. That is,. Proof: Our goal is to solve α t : α α α t α L S (H t + αh t ) L S (H t + αh t ) eα + e α i:h(x i )y i To do so, set the derivative of the function in the argmin to zero and solve for α (the function is convex, though we don't prove it here): d dα eα + e α i:h(x i )y i 0 eα e α 0 i:h(x i )y i e 2α i:h(x i )y i ( ) 2α ln i:h(x i )y i α ( ) 2 ln i:h(x i )y i α ( 2 ln α 2 ln α 2 ln n α 2 ln ( ϵt ϵ t ) ) Matthew Bernstein 207 8

Matthew Bernstein 207 9

Bibliography [] Y. Freund and R.E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 996. 0