Open Problem: A (missing) boosting-type convergence result for ADABOOST.MH with factorized multi-class classifiers

Similar documents
Fast boosting using adversarial bandits

The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m

Hierarchical Boosting and Filter Generation

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

Lecture 8. Instructor: Haipeng Luo

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

Ensembles. Léon Bottou COS 424 4/8/2010

Learning theory. Ensemble methods. Boosting. Boosting: history

Learning Ensembles. 293S T. Yang. UCSB, 2017.

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Multiclass Boosting with Repartitioning

i=1 = H t 1 (x) + α t h t (x)

Stochastic Gradient Descent

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Learning with multiple models. Boosting.

Introduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research

COMS 4771 Lecture Boosting 1 / 16

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 331 le-tex

10701/15781 Machine Learning, Spring 2007: Homework 2

CS229 Supplemental Lecture notes

A Brief Introduction to Adaboost

CIS 520: Machine Learning Oct 09, Kernel Methods

Foundations of Machine Learning Multi-Class Classification. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research

Outline: Ensemble Learning. Ensemble Learning. The Wisdom of Crowds. The Wisdom of Crowds - Really? Crowd wiser than any individual

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Decision Trees: Overfitting

Data Mining und Maschinelles Lernen

Statistics and learning: Big Data

Chapter 14 Combining Models

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

Deep Boosting. Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) COURANT INSTITUTE & GOOGLE RESEARCH.

Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example

Boosting with decision stumps and binary features

Linear & nonlinear classifiers

Does Modeling Lead to More Accurate Classification?

Boosting. Acknowledgment Slides are based on tutorials from Robert Schapire and Gunnar Raetsch

Name (NetID): (1 Point)

Ordinal Classification with Decision Rules

Voting (Ensemble Methods)

CSCI-567: Machine Learning (Spring 2019)

Logistic Regression. Machine Learning Fall 2018

ABC-Boost: Adaptive Base Class Boost for Multi-class Classification

the tree till a class assignment is reached

A Study of Relative Efficiency and Robustness of Classification Methods

Learning Methods for Linear Detectors

Support Vector Machines

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017

Totally Corrective Boosting Algorithms that Maximize the Margin

Large-Margin Thresholded Ensembles for Ordinal Regression

CS7267 MACHINE LEARNING

AdaBoost. S. Sumitra Department of Mathematics Indian Institute of Space Science and Technology

A Simple Algorithm for Multilabel Ranking

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture 24 Scribe: Sachin Ravi May 2, 2013

SPECIAL INVITED PAPER

Recitation 9. Gradient Boosting. Brett Bernstein. March 30, CDS at NYU. Brett Bernstein (CDS at NYU) Recitation 9 March 30, / 14

The Boosting Approach to. Machine Learning. Maria-Florina Balcan 10/31/2016

Boosting: Foundations and Algorithms. Rob Schapire

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Adaptive Boosting of Neural Networks for Character Recognition

Evaluation. Andrea Passerini Machine Learning. Evaluation

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Background. Adaptive Filters and Machine Learning. Bootstrap. Combining models. Boosting and Bagging. Poltayev Rassulzhan

Computational and Statistical Learning Theory

Linear & nonlinear classifiers

Logistic Regression and Boosting for Labeled Bags of Instances

Evaluation requires to define performance measures to be optimized

Lecture 3: Decision Trees

Introduction to Support Vector Machines

Boosting & Deep Learning

What makes good ensemble? CS789: Machine Learning and Neural Network. Introduction. More on diversity

An overview of Boosting. Yoav Freund UCSD

Neural Networks and Ensemble Methods for Classification

Robotics 2 AdaBoost for People and Place Detection

CS 231A Section 1: Linear Algebra & Probability Review

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Online Advertising is Big Business

Boosting Methods: Why They Can Be Useful for High-Dimensional Data

ECS289: Scalable Machine Learning

Improved Boosting Algorithms Using Confidence-rated Predictions

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

Introduction to Boosting and Joint Boosting

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Multiclass Classification-1

The Margin Vector, Admissible Loss and Multi-class Margin-based Classifiers

2 Upper-bound of Generalization Error of AdaBoost

Generalized Boosting Algorithms for Convex Optimization

Statistical Machine Learning from Data

Boos$ng Can we make dumb learners smart?

Decision trees COMS 4771

Logistic Regression Logistic

Linear discriminant functions

TDT4173 Machine Learning

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machines and Kernel Methods

Computational and Statistical Learning Theory

Transcription:

JMLR: Workshop and Conference Proceedings vol 35:1 8, 014 Open Problem: A (missing) boosting-type convergence result for ADABOOST.MH with factorized multi-class classifiers Balázs Kégl LAL/LRI, University of Paris-Sud, CNRS, 91898 Orsay, France BALAZS.KEGL@GMAIL.COM Abstract In (Kégl, 014), we recently showed empirically that ADABOOST.MH is one of the best multi-class boosting algorithms when the classical one-against-all base classifiers, proposed in the seminal paper of Schapire and Singer (1999), are replaced by factorized base classifiers containing a binary classifier and a vote (or code) vector. In a slightly different setup, a similar factorization coupled with an iterative optimization of the two factors also proved to be an excellent approach (Gao and Koller, 011). The main algorithmic advantage of our approach over the original setup of Schapire and Singer (1999) is that trees can be built in a straightforward way by using the binary classifier at inner nodes. In this open problem paper we take a step back to the basic setup of boosting generic multi-class factorized (Hamming) classifiers (so no trees), and state the classical problem of boosting-like convergence of the training error. Given a vote vector, training the classifier leads to a standard weighted binary classification problem. The main difficulty of proving the convergence is that, unlike in binary ADABOOST, the sum of the weights in this weighted binary classification problem is less than one, which means that the lower bound on the edge, coming from the weak learning condition, shrinks. To show the convergence, we need a (uniform) lower bound on the sum of the weights in this derived binary classification problem. Let the training data be D = { (x 1, y 1 ),..., (x n, y n ) }, where x i R d are observation vectors, and y i {±1} K are label vectors. Sometimes we will use the notion of an n d observation matrix of X = (x 1,..., x n ) and an n K label matrix Y = (y 1,..., y n ) instead of the set of pairs D. 1 In multi-class classification, the single label l(x) of the observation x comes from a finite set. Without loss of generality, we will suppose that l L = {1,..., K}. The label vector y is a one-hot representation of the correct class: the l(x)th element of y will be 1 and all the other elements will be 1. To avoid confusion, from now on we will call y and l the label and the label index of x, respectively. The goal of the ADABOOST.MH algorithm (Schapire and Singer 1999; Appendix B) is to return a vector-valued discriminant function f (T ) : R d R K with a small Hamming loss R H (f, W) = 1 n w i,l I { sign ( f l (x i ) ) } y i,l, (1) where W = [ w i,l ] is an n K weight matrix over data points and labels, usually defined as w l = { 1 if l = l(x) (i.e., if y l = 1), 1 (K 1) otherwise (i.e., if y l = 1). () 1. We will use bold capitals X for matrices, bold small letters x i and x,.j for its row and column vectors, respectively, and italic for its elements x i,j. c 014 B. Kégl.

KÉGL ADABOOST.MH directly minimizes a surrogate, the weighted multi-class exponential marginbased error ( R EXP f (T ), W ) = 1 w i,l exp ( f (T ) ) n l (x i )y i,l. (3) Since exp( ρ) I {ρ < 0}, (3) upper bounds (1). ADABOOST.MH builds the final discriminant function f (T ) (x) = T t=1 h(t) (x) as a sum of T base classifiers h (t) : R d R K returned by a base learner algorithm BASE( X, Y, W (t) ) in each iteration t. Since ( R EXP f (T ), W ) T = Z ( h (t), W (t)), where Z ( h, W (t)) = t=1 w (t) i,l exp ( h li y i,l ) the goal of the base classifier is to maximize Z ( h, W (t)). Assuming factorized base classifiers (4) h(x) = αvϕ(x), (5) where α R + is a positive real valued base coefficient, v is an input-independent vote (or code) vector of length K, and ϕ(x) is a label-independent binary (scalar) classifier, Kégl (014) shows that Z(h, W) = eα + e α eα e α ( ) v l µl+ µ l, (6) where µ l = is the weighted per-class error rate and µ l+ = w i,l I {ϕ(x i ) y i,l } (7) w i,l I {ϕ(x i ) = y i,l } (8) is the weighted per-class correct classification rate for each class l = 1,..., K. The quantity γ l = v l ( µl+ µ l ) = w i,l v l ϕ(x i )y i,l (9) is the classwise edge of h(x). The full multi-class edge of the classifier is then γ = γ(v, ϕ, W) = γ l = ( ) v l µl+ µ l = ϕ(x i ) w i,l v l y i,l. (10) With this notation, the classical (Freund and Schapire, 1997) binary coefficient α is recovered: it is easy to see that (6) is minimized when α = 1 log 1 + γ 1 γ. (11)

CONVERGENCE OF ADABOOST.MH WITH FACTORIZED MULTI-CLASS CLASSIFIERS With this optimal coefficient, similarly to the binary case, (6) becomes Z(h, W) = 1 γ, so Z(h, W) is minimized when γ is maximized. From (10) it then follows that Z(h, W) is minimized if v l agrees with the sign of ( ) µ l+ µ l, that is, { 1 if µ l+ > µ l v l = (1) 1 otherwise for all classes l = 1,..., K. As in the binary case, it can be shown that if γ > δ, the training error becomes zero exponentially fast, more precisely, after ( ) log n(k 1) T = δ + 1 (13) iterations. The problem, and the subject of this open problem paper, is that γ > δ is not implied by the weak learning condition on our single binary classifier ϕ. To show this, let us rewrite the edge (10) as γ = ϕ(x i )(w i + wi ) = ϕ(x i )sign(w i + wi ) w+ i wi, where and w + i = w i = w i,l I {v l y i,l = +1}, w i,l I {v l y i,l = 1}. By assigning pseudo-labels y i = sign(w+ i wi ) and pseudo-weights w i = w+ i wi to each point (x i, y i ), we can rewrite the multi-class edge (10) as a classical binary classification edge γ = n w i ϕ(x i)y i. The trouble is that w Σ w i 1. (14) The classical weak learning condition requires that there exists a constant δ > 0 for which, with any weighting w summing to one n w i = 1, there exists a ϕ with edge γ w i ϕ(x i )y i δ. (15) Since w Σ 1, this only implies δ = δ w Σ δ in (13). If w Σ can be arbitrarily small, then the convergence to zero training error can be arbitrarily slow. The open problem is to find a lower bound on w Σ which may depend on the number of classes K but is independent of the number of training points n. Acknowledgments This work was supported by the ANR-010-COSI-00 grant of the French National Research Agency. ( log n ) K 1. This bound can be tightened to + 1 (Schapire and Singer, 1999). δ 3

KÉGL Appendix A. Remarks Remark 1 The reason why the proof works out in the binary case K = is that we have strict equality in (14). Indeed, transforming the binary setup into multi-class with two label columns, we always have y i,1 = y i, and v 1 = v, which means that either w i + = 0 (when v = y i ) or wi = 0 (when v = y i ) so we end up with w i = w i,1 + w i, for all i = 1,..., n, and n w i = n (w i,1 + w i, ) = 1 in each iteration (by classical boosting-type induction argument). Remark The reason why we do not face this problem in Schapire and Singer (1999) s original ADABOOST.MH setup is that h(x) is not factorized (5) there, rather, it is assumed that K independent one-against-all classifiers ϕ l (x), l = 1,..., K are trained separately on the binary problems defined by the columns of Y and W. The classwise edges γ l = w i,l ϕ l (x i )y i,l are not coupled by the label-independent single classifier ϕ(x) as in (9). From the weak learning assumption (15) it follows that for each classwise edge we have γ l δ n w i,l, and so γ = γ l δ i=l w i,l = δ. Remark 3 If it is assumed that v is fixed, than it seems plausible that we can construct a toy example with w Σ = 0. What makes the problem non-trivial is that the base learner optimizes the edge γ(v, ϕ, W) (10) in ϕ and v simultaneously. In the case of decision stumps (Appendix C), a global optimum can be found in Θ(ndK) time. If ϕ is learned by a generic binary classification algorithm, a greedy but practically very efficient iterative optimization usually leads to an excellent solution (Gao and Koller, 011). So the question is whether there exists a setup (X, W, Y, and function class) in which all of the K different vote vectors v = {±1} K lead to arbitrarily small (or zero) w Σ, or we can find a constant (independent of n) lower bound ω such that with at least one vote vector v and classifier ϕ, w Σ ω holds. This would then imply δ = δ ω in (13). Remark 4 It is easy to see that the edge γ(v, ϕ, W) can never be negative, since if it were, flipping the sign of either v or ϕ would make it strictly positive. This means that in practice, the algorithm does not get stuck: given even a random classifier ϕ, unless the edge is exactly zero for all vote vectors v, we can find a vote vector for which the edge is strictly positive. The only question we raise here is whether this edge can become arbitrarily small (even if the weak learning condition (15) holds). Appendix B. The pseudocode of the ADABOOST.MH algorithm with factorized base classifiers The following is the pseudocode of the ADABOOST.MH algorithm with factorized base classifiers (5). X is the n d observation matrix, Y is the n K label matrix, W is the user-defined weight matrix used in the definition of the weighted Hamming error (1) and the weighted exponential margin-based error (3), BASE(,, ) is the base learner algorithm, and T is the number of iterations. 4

CONVERGENCE OF ADABOOST.MH WITH FACTORIZED MULTI-CLASS CLASSIFIERS α (t) is the base coefficient, v (t) is the vote vector, ϕ (t) ( ) is the scalar base (weak) classifier, h (t) ( ) is the vector-valued base classifier, and f (T ) ( ) is the final (strong) discriminant function. ADABOOST.MH(X, Y, W, BASE(,, ), T ) 1 W (1) 1 n W for t 1 to T ( 3 α (t), v (t), ϕ (t) ( ) ) BASE( X, Y, W (t) ) 4 h (t) ( ) α (t) v (t) ϕ (t) ( ) 5 for i 1 to n for l 1 to K 6 w (t+1) i,l w (t) i,l i =1 l =1 7 return f (T ) ( ) = T t=1 h(t) ( ) e h(t) l (x i)y i,l w (t) i,l e h(t) l (x i )y i,l }{{} Z ( h (t), W (t)) 5

KÉGL Appendix C. Multi-class decision stumps The simplest scalar base learner used in practice on numerical features is the decision stump, a one-decision two-leaf decision tree of the form { 1 if x (j) b, ϕ j,b (x) = 1 otherwise, where j is the index of the selected feature and b is the decision threshold. If the feature values ( (j) x 1,..., ) x(j) n are pre-ordered before the first boosting iteration, a decision stump maximizing the edge (10) (or minimizing the energy (5) 3 ) can be found very efficiently in Θ(ndK) time. The pseudocode of the algorithm is given in Figure. STUMPBASE first calculates the edge vector γ (0) of the constant classifier h (0) (x) 1 which will serve as the initial edge vector for each featurewise edge-maximizer. Then it loops over the features, calls BESTSTUMP to return the best featurewise stump, and then selects the best of the best by minimizing the energy (5). BESTSTUMP loops over all (sorted) feature values s 1,..., s n 1. It considers all thresholds b halfway between two non-identical feature values s i s i+1. The main trick (and, at the same time, the bottleneck of the algorithm) is the update of the classwise edges in lines 4-5: when the threshold moves from b = s i 1+s i to b = s i+s i+1, the classwise edge γ l of 1ϕ(x) (that is, vϕ(x) with v = 1) can only change by ±w i,l, depending on the sign y i,l (Figure 1). The total edge of vϕ(x) with optimal votes (1) is then the sum of the absolute values of the classwise edges of 1ϕ(x) (line 7). φ j, x j 1 s i 1 s i s i s i 1 s i 1 s i s i 1 1 Figure 1: Updating the edge γ l in line 5 of BESTSTUMP. If y i,l = 1, then γ l decreases by w i,l, and if y i = 1, then γ l increases by w i,l. 3. Note the distinction: for full binary v the two are equivalent, but for ternary or real valued v and/or real valued φ(x) they are not. In Figure we are maximizing the edge within each feature (line 7 in BESTSTUMP) but across features we are minimizing the energy (line 7 in STUMPBASE). Updating the energy inside the inner loop (line 4) could not be done in Θ(K) time. 6

CONVERGENCE OF ADABOOST.MH WITH FACTORIZED MULTI-CLASS CLASSIFIERS STUMPBASE(X, Y, W) 1 for l 1 to K for all classes γ (0) l w i,l y i,l classwise edges (9) of constant classifier h (0) (x) 1 3 for j 1 to d all (numerical) features ( (j) 4 s SORT x 1,..., ) x(j) n sort the jth column of X 5 (v j, b j, γ j ) BESTSTUMP(s, Y, W, γ (0) ) best stump per feature 6 α j 1 log 1 + γ j 1 γ j base coefficient (11) 7 j arg min Z ( α j v j ϕ j,bj, W ) best stump across features j 8 return ( α j, v j, ϕ j,b j ( ) ) BESTSTUMP(s, Y, W, γ (0) ) 1 γ γ (0) best edge vector γ γ (0) initial edge vector 3 for i 1 to n 1 for all points in order s 1... s n 1 4 for l 1 to K for all classes 5 γ l γ l w i,l y i,l update classwise edges of stump with v = 1 6 if s i s i+1 then no threshold if identical coordinates s i = s i+1 7 if K γ l > K γ l then found better stump 8 γ γ update best edge vector 9 b s i+s i+1 update best threshold 10 for l 1 to K for all classes 11 v l sign(γ l) set vote vector according to (1) 1 if γ = γ (0) did not beat the constant classifier 13 return (v,, γ 1 ) constant classifier with optimal votes 14 else 15 return (v, b, γ 1 ) best stump Figure : Exhaustive search for the best decision stump. BESTSTUMP receives a sorted column (feature) s of the observation matrix X. The sorting in line 4 can be done once for all features outside of the boosting loop. BESTSTUMP examines all thresholds b halfway between two non-identical coordinates s i s i+1 and returns the threshold b and vote vector v that maximizes the edge γ(v, ϕ j,b, W). STUMPBASE then sets the coefficient α j according to (11) and chooses the stump across features that minimizes the energy (4). 7

KÉGL References Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55:119 139, 1997. T. Gao and D. Koller. Multiclass boosting with hinge loss based on output coding. In International Conference on Machine Learning, 011. B. Kégl. The return of AdaBoost.MH: multi-class Hamming trees. In International Conference on Learning Representations, 014. URL http://arxiv.org/abs/131.6086. R. E. Schapire and Y. Singer. Improved boosting algorithms using confidence-rated predictions. Machine Learning, 37(3):97 336, 1999. 8