Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Size: px

Start display at page:

Download "Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers"

Jeremy Gray
5 years ago
Views:

1 Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers Erin Allwein, Robert Schapire and Yoram Singer Journal of Machine Learning Research, 1: , 000 CSE 54: Seminar on Learning Algorithms Professor: Charles Elkan Student: Aldebaro Klautau April 3, 001 Copyright, 1996 Dale Carnegie & Associates, Inc. 1 Outline Motivation Definitions Margin, loss, output coding Paper contributions Unified view of classifiers, 0 entries in coding matrix, bounds and simulations Algorithm Bound on training error Experimental results Compare: a) Hamming vs. loss-based decoding b) different coding matrices c) simple binary problems vs. robust matrix Conclusions Unifying Approach for Margin Classifiers 1

2 Motivation Many applications require multiclass categorization Algorithms that easily handle the multiclass case include C4.5 and CART Some do not easily handle the multiclass case, as AdaBoost and SVM Alternative: reduce the multiclass problem to multiple binary classifications 3 Reducing multiclass to multiple binary problems Paper proposes general framework that unifies several methods, namely: binary margin-based algorithms coupled through a coding matrix 4 Unifying Approach for Margin Classifiers

3 Binary margin-based learning algorithm Input: total of m binary labeled training examples (x 1, y 1 ),, (x m, y m ) such that x i X and y i {-1,+1} Output: real-valued function (hypothesis) f : X R Binary margin of a training example (x, y) is defined as: z = y f(x) z > 0 if and only if f classifies x correctly 5 Classification error: It is difficult to minimize classification error Instead, minimize some loss function of the margin L(z) = L(y f(x)) of each example (x,y) Algorithms that use margin-based loss: supportvector machines (SVM), AdaBoost, regression, decision trees, etc. Example: neural networks and least square regression attempt to minimize squared error (y f(x)) = y (y f(x)) = (yy yf(x)) = (1 yf(x)) L(z) = (1 z) 6 Unifying Approach for Margin Classifiers 3

4 Loss function of popular algorithms L : R [0, ) 14 1 loss L(z) adaboost exp(-z) regression (1-z) logistic log(1+exp(-*z)) svm (1-z) margin z = y f(x) 7 Scenario: training sequence has labels drawn from set with cardinality k > and the algorithm(s) assume k = Two popular alternatives: one-against-all and all-pairs For k = 10 classes Approach Number of classifiers one-against-all 10 all-pairs 45 complete Unifying Approach for Margin Classifiers 4

5 Output coding for solving multiclass problems Class vertical line Code word (each column corresponds to a binary classifier) horizontal line diagonal line closed curve curve open to left curve open to right k = 10 k = number of classes (rows) and l = number of classifiers (columns) l = 6 9 Error correcting output code ECOC, proposed by Dietterich and Bakiri, 1995 Class vertical line Code word (each column corresponds to a binary classifier) horizontal diagonal closed curve line line curve open to curve open to right left Associate each class r with a row of a coding matrix Train each binary classifier Each example labeled y is mapped to M(y,s) Obtain l hypotheses f 1 to f l Given test example x, choose the row y of M that is closest to binary predictions (f 1 (x),, f l (x)) according to some distance (e.g. Hamming) 10 Unifying Approach for Margin Classifiers 5

6 Summary of paper contributions Unified view of margin classifiers Possibility of 0 entry in ECOC matrix Decoding when matrix has 0 entries Bound on training error (general) Bound on testing error (restricted to AdaBoost) Experimental results 11 Idea: allow entries 0 The coding matrix M is taken from the extended set {-1, 0, +1} k x l The entry M(r, s) = 0 indicates we do not care how f s categorizes examples with label r One-against-all k by k All-pairs k by k Unifying Approach for Margin Classifiers 6

7 Algorithm Associate each class r with a row of a coding matrix M {-1, 0, +1} k x l Train the binary classifier for each column s=1, l. For training classifier s, example labeled y is mapped to M(y,s). Omit examples for which M(y,s) = 0 Obtain l classifiers Given test example x, choose the row y of M that is closest to binary predictions (f 1 (x),, f l (x)) according to some distance (e.g. modified Hamming) 13 Two intuitive options Distance a) Quantize predictions and then use a generalized Hamming distance ( u, v) = l 1 u v s s s= 1 String A String B Distance parcels b) Loss-based decoding: for each row, calculate the margin z s of each classifier s and sum their losses L(z s ) adopting the same L used by the binary classifier. Sounds better because the magnitude of predictions are an indication of a level of confidence 14 Unifying Approach for Margin Classifiers 7

8 Hamming vs. loss-based decoding quantization 15 Losses for classes 3 and 4 in fig. prediction = [ ] row_3 = [ ] row_4 = [ ] row_r_loss = exp(-prediction.* row_r) loss = sum(row_r_loss) 10 6 big influence 10 4 row_r_loss (log scale) binary classifier # 16 Unifying Approach for Margin Classifiers 8

9 Training error Bound on training error Average binary loss ε = ml i= 1 s= 1 ( M ( y, s) f ( x )) Minimum distance between pair of rows ρ = 1 min E m l lε ρ L(0) L { ( M( r ), M( r )) r r } 1 : 1 i s i One-against-all ρ = All-pairs ρ = 1 + 1/ (l 1) Bound on training error lε E ρ L(0) Simulation: AdaBoost with L(0) = 1 complete code E k 1 ( 1) ε ε k one-against-all E kε average binary loss actual classification error theoretical bound # classes 18 Unifying Approach for Margin Classifiers 9

10 Bound on training error Requirement: L( z) + L( z) L(0) > 0 Do not need to be convex: sin(z) loss L(z) margin z = y f(x) 19 Experimental results Tradeoff between simple binary problems and robust coding matrix Hamming versus loss-based decoding (AdaBoost and SVM) Comparison among different output codes (AdaBoost and SVM) 0 Unifying Approach for Margin Classifiers 10

otherwise 10 rounds 3 classes 8 classes 1 Comparisons: a) Hamming vs.

11 Experiment 1 - synthetic data Set thresholds to have exactly 100 examples per class Label test examples using these thresholds Use AdaBoost Weak hypotheses: set of thresholds threshold t would label x as +1 if x < t and -1 otherwise 10 rounds 3 classes 8 classes 1 Comparisons: a) Hamming vs. loss-based decoding b) simple binary problems vs. robust matrix complete code one-against-all AdaBoost used in both simulations Unifying Approach for Margin Classifiers 11

12 k by k complete code k by k one-against-all k = 4 3 Comparison: different output codes Experiments with UCI databases SVM: polynomial kernels of degree 4 AdaBoost: decision stumps for base hypotheses 5 output codes: one-against-all, complete, all-pairs, dense and sparse design of dense and sparse: try 10,000 random codes, pick a code with high ρ 4 Unifying Approach for Margin Classifiers 1

13 5 Show error obtained with Hamming minus error with loss-based decoding Negative height indicates loss-based outperformed Hamming decoding 6 Unifying Approach for Margin Classifiers 13

14 Entry (r, c) shows error of row r classifier minus error of column c classifier Positive height indicates classifier c outperformed classifier r SVM 7 Entry (r, c) shows error of row r classifier minus error of column c classifier Positive height indicates classifier c outperformed classifier r AdaBoost 8 Unifying Approach for Margin Classifiers 14

15 Conclusions Bounds give insight about the tradeoffs but can be of limited use in practice Experiments show that in most cases: Loss-based is better than Hamming decoding One-against-all is outperformed (SVM) Choosing / designing the coding matrix is an open problem and the best approach is possibly task-dependent (Crammer & Singer, 000) 9 30 Unifying Approach for Margin Classifiers 15

Supervised Learning of Non-binary Problems Part I: Multiclass Categorization via Output Codes

Supervised Learning of Non-binary Problems Part I: Multiclass Categorization via Output Codes Yoram Singer Hebrew University http://www.cs.huji.il/ singer Based on joint work with: Koby Crammer, Hebrew