Supervised Learning of Non-binary Problems Part I: Multiclass Categorization via Output Codes

Supervised Learning of Non-binary Problems Part I: Multiclass Categorization via Output Codes Yoram Singer Hebrew University http://www.cs.huji.il/ singer Based on joint work with: Koby Crammer, Hebrew Univ. Rob Schapire, AT&T Labs Erin Allwein, SWRI NATO ASI on Learning Theory and Practice July 8-19 22 - K.U. Leuven Belgium

Problem Setting Given: multiclass data e.g.: predict children s favorite Teletubbies (name=guy, age=5, sex=m) (name=neta, age=2, sex=f) (name=dana, age=6, sex=f).. Given: binary (2-class) learning algorithm A x ( x 1, + ), ( x 2,-- ), ( x 3, + )... A h h(x) ε {--, + }

Problem Setting (cont.) Goal: use binary learning algorithm to build classifier for multiclass data ( x 1, ), ( x, ), (, )... 2 x 3 multiclass learner A x H H(x)ε {,,, }

The One-against-all Approach [folklore] break k-class problem into k binary problems and solve each separately how to combine predictions? x 1 x 1 x 1 + x 1 x 1 x 2 x 2 x 2 x 2 + x 2 x 3 x 3 x 3 x 3 x 3 + x 4 x 4 x 4 + x 4 x 4 x 5 x 5 + x 5 x 5 x 5 h 1 h 2 h 3 h 4 evaluate all h s and hope exactly one is + e.g.: if h 1 (x) = h 2 (x) = h 4 (x) = and h 3 (x) =+ then predict problem: will give incorrect prediction if only one h is wrong

The All-pairs Approach [Hastie & Tibshirani] create one binary problem for each pair of classes vs. vs. vs. vs. vs. vs. x 1 x 1 x 1 x 1 x 2 x 2 x 2 + x 2 + x 3 x 3 x 3 + x 3 x 4 x 4 x 4 x 4 x 5 x 5 + x 5 + x 5 + h 1 h 2 h 3 h 4 h 5 h 6 not obvious how to combine predictions

Error-correcting Output Codes reduce to binary using coding matrix M rows of M code words 1 2 3 4 5 x 1 x 1 x 1 x 1 + x 1 + x 1 + x 2 x 2 + x 2 + x 2 x 2 x 2 x 3 x 3 + x 3 + x 3 + x 3 + x 3 x 4 x 4 x 4 x 4 + x 4 + x 4 + x 5 x 5 + x 5 x 5 + x 5 x 5 + h 1 h 2 h 3 h 4 h 5 binary problems 1 2 3 4 5 + + + + + + + + + + + + [Dietterich & Bakiri] given x, choose closest row of M to h(x) = (h 1 (x),..., h 5 (x)) e.g.: if h(x) = (, +, +, +, ) then predict

ECOC (continued) if rows of M far from one another, will be highly robust to errors potentially much faster when k (# of classes) large disadvantage: binary problems may be unnatural and hard to solve

Outline general framework for studying such reductions from multiclass to binary particularly for margin-based binary learners decoding methods for combining predictions of binary classifiers Hamming decoding loss-based decoding general training error bounds generalization error bound when binary learner is AdaBoost finding a good output coding matrix discrete output codes continuous output codes learning continuous output codes using QP generalization error bound for continuous output codes experiments with discrete and continuous codes variation on a theme: multilabel problems

A (Slightly) More General Approach choose matrix M over { 1,, +1} 1 2 3 4 5 x 1 x 1 x 1 x 1 + x 1 + x 2 x 2 + x 2 x 2 x 2 x 3 x 3 + x 3 + x 3 + x 4 x 4 x 4 x 4 + x 4 + x 5 x 5 x 5 x 5 + h 1 h 2 h 3 h 4 h 5 binary problems 1 2 3 4 5 + + + + + + + generalizes all three earlier approaches how to do decoding?

Hamming Decoding (generalized) Hamming distance between u, v { 1,, +1} l is (u, v) = 1 l l s=1 if u s = v s u s v s 1 if u s v s u s v s 1/2 if u s = v s =. e.g.: ( (+ +), (+ + +) ) = 1 5 ( + 1 + 1 2 + 1 2 + ) to classify x evaluate h(x) = (h 1 (x),..., h l (x)) choose row r closest to h(x) in Hamming distance weakness: ignores confidence of predictions

Margin-based Binary Learning Algorithms typical binary classifier produces real-valued predictions i.e., h(x) rather than h(x) {, +} interpretation: sign(h(x)) = predicted class ( or +) h(x) = confidence margin of labeled example (x, y) is yh(x) yh(x) > correct prediction yh(x) = h(x) = confidence many learning algorithms attempt to find classifier h minimizing m i=1 L(y ih(x i )) where (x 1, y 1 ),..., (x m, y m ) is given training set L is loss function of margin

Examples of Margin-based Learning Algorithms algorithm L(z) support-vector machines max{1 z, } logistic regression ln(1 + e z ) AdaBoost e z least-squares regression (1 z) 2 C4.5 ln(1 + e z ) CART (1 z) 2

Loss-based Decoding can use instead of Hamming decoding when using margin-based learning algorithm to classify x evaluate h(x) = (h 1 (x),..., h l (x)) for each class r compute total loss that would have been suffered if x were labeled r l L(M(r, s) h s(x)) s=1 choose class r that minimizes total loss

Loss-based Decoding (example) M = 1 2 3 4 5 + + + + + + + h(x) = (1.9, 3.1, 4.2,.7, 6.2) compute: total loss for each class L( )+L( 3.1 )+L( )+L(.7)+L( 6.2 ) L( 1.9)+L( 3.1 )+L( 4.2 )+L( )+L( 6.2 ) L( 1.9 )+L( )+L( 4.2)+L(.7)+L( 6.2) L( 1.9 )+L( 3.1)+L( )+L(.7 )+L( ) Hamming decoding (sign(h(x)) = (+,, +, +, +)): predict loss-based decoding with L(z) = e z : predict

Training Error Bounds let ε be expected loss on training averaged over all l binary problems ε = 1 ml m i=1 l L(M(y i, s) h s (x i )) s=1 let ρ be minimum distance between pairs of rows of M ρ = min{ (M(r 1, ), M(r 2, )) : r 1 r 2 } assume L() = 1 and 1 2 (L(z) + L( z)) L() and other minor technical assumptions Theorem: For loss-based decoding: For Hamming decoding: # training mistakes ρ ε # training mistakes 2ε ρ

Proof Sketch Loss-based decoding incorrectly classifies an example (x, y): l L(M(r, s)f s(x)) l L(M(y, s)f s(x)) s=1 s=1 Define: S = {s : M(r, s) M(y, s) M(r, s) M(y, s) } S = {s : M(r, s) = M(y, s) = } z s = M(y, s)f s (x) z s = M(r, s)f s (x) Then: l s=1 L(z s) l s=1 L(z s) L(z s) s S S L(z s ) s S S

Proof Sketch (cont.) l L(z s) s=1 s S S L(z s ) 1 2 (L(z s) + L(z s )) s S S = 1 (L(z 2 s) + L(z s )) + 1 (L(z s S 2 s) + L(z s )) s S L() S + S = L() (M(r), M(y)) 2 ρl() a mistake on training example (x i, y i ) implies that l L(M(y i, s)f s (x i )) ρl() s=1 so the number of training mistakes is at most 1 m l ρl() L(M(y i, s)f s (x i )) i=1 s=1 and the training error is at most lε/(ρl())

Discussion want good performance on binary problems (so that ε small) want rows far apart from each other (so ρ large) ρ 1/2 for ECOC ρ = 2/k for one-against-all but: making ρ large may make binary problems more difficult to solve many zeros in matrix makes binary problems easier and faster to solve but: adding zeros tends to increase average error ε

Generalization Error for Boosting with Loss-based Decoding - Outline analyze performance on unseen data when binary learner is AdaBoost in binary case, use margins to analyze boosting: (Schapire, Freund, Bartlett, Lee 1998) larger empirical margins data imply better bound on generalization error boosting tends to increase training-data margins for loss-based decoding modify classification rule modifying definition of margin gives two-part analysis

Generalization Error Bounds for AdaBoost.MO Loss-based decoding for AdaBoost: H(x) = arg min y Y = arg min y Y l s=1 exp ( M(y, s)f(x, s)) l exp s=1 M(y, s) T α th t (x, s) t=1 Scaling f to [ 1, +1] using the transform x (1/η) ln(x/l): η = T t=1 α t f(x, s) = 1 T η α th t (x, s) t=1 ν(f, η, x, y) = 1 η ln 1 l l s=1 e ηm(y,s)f(x,s) H(x) = arg max y Y ν(f, η, x, y)

Generalization Error Bounds for AdaBoost.MO (cont.) Modified definition of margin: M f,η (x, y) = 1 ν(f, η, x, y) max 2 r y ν(f, η, x, r) H the base-hypothesis space, convex hull of H co(h) = f : x h α hh(x) α h, h α h = 1 Theorem: D distribution over X Y; S sample of m examples; base-classifier space H has VC-dimension d; coding matrix M; w.p. 1 δ every function f co(h) and η > satisfy [ Pr D M f,η (x, y) ] [ Pr S M f,η (x, y) θ ] + O 1 m d log 2 (lm/d) θ 2 + log(1/δ) 1/2

Experiments tested SVM on 8 benchmark problems using both decoding methods and 5 different coding matrices overall, loss-based decoding is better than Hamming decoding: 2 one against all 5 All Pairs.5 Complete.5 Random Dense 2 Random Sparse 2.5 2 5 4 Test error difference 4 6 Test error difference 1 15 Test error difference.5 1 Test error difference 1 1.5 Test error difference 6 8 1 2 8 2 1.5 2.5 12 14 1 6 6 7 8 1 1 11 19 Number of Classes 25 6 6 7 8 1 1 11 19 Number of Classes 2 6 6 7 8 1 1 11 19 Number of Classes 3 6 6 7 8 1 1 11 19 Number of Classes 16 6 6 7 8 1 1 11 19 Number of Classes one-against-all is consistently worse than other methods but no clear pattern among other methods 35 one against all All Pairs 35 one against all Complete 35 one against all Random Dense 3 one against all Random Sparse 3 3 3 25 25 25 25 2 Test error difference 2 15 1 Test error difference 2 15 1 Test error difference 2 15 1 Test error difference 15 1 5 5 5 5 5 6 6 7 8 1 1 11 19 Number of Classes 5 6 6 7 8 1 1 11 19 Number of Classes 5 6 6 7 8 1 1 11 19 Number of Classes 5 6 6 7 8 1 1 11 19 Number of Classes

So far we assumed that the matrix M is predefined problems: all columns have the same influence in the decoding process but some binary problems might be simpler than others each binary classifier induces the same loss for all labels but some labels are easier to predict than others correlations between the binary classifiers are not taken into account but similar columns of M might result highly correlated classifiers change M in accordance to the predictions of the binary classifiers

Code design - discrete case The following problem is NP-complete: given: a training set S = {(x 1, y 1 ),..., (x m, y m )} a set of binary classifiers h(x) = (h 1 (x),..., h l ) error bound q find: a matrix M { 1, +1} k l such that the classifier H(x) based on M and h( ) makes at most q mistakes on S however, if l is fixed there might exist a poly-time algorithm for finding the matrix M which attains the smallest number of errors on S

Code design - continuous case analogously to the binary classifiers let M k l column t of M defines a partition of the labels into two disjoint sets sign(m r,t ) is the set (+1 or 1) to which the class r belongs M r,t is the confidence in the associated partition the classification rule H(x) = arg max r where M r is the rth row of M (h(x) Mr )

Continuous codes - example M = 1 2 3 4 5 +.1-1.5 -.2-2.4 +3.7-1.8-1.1 +.9 -.3 +1.9 +2.1 +.1 +.4-4.5-2.5 +3.2 +.7 +2.1 +2.7 +.2 h(x) = (1.1, 3.1, 4.2,.7, 6.2) compute: similarity for each class + 1.1.1 + 3.1 1.5 4.2.2.7 2.4 + 6.2 3.7 25.2 1.1 1.8 + 3.1 1.1 + 4.2.9.7.3 + 6.2 1.9 16.8 + 1.1 2.1 3.1.1 + 4.2.4.7 4.5 6.2 2.5 15 + 1.1 3.2 3.1.7 + 4.2 2.1 +.7 2.7 + 6.2.2 13.3 predict:

Multiclass Margin Multiclass Margin Multiclass Margin Multiclass Margin Multiclass Margin Similarity (Similarity of correct label)-(largest simialiry of rest) > 1!!!!!! " " " " " " # $# $ % &% & Similarity Loss (Similarity of correct label)-(largest simialiry of rest) < 1 ' ' ' ' ' ' ( ( ( ( ( ( ) ) ) ) ) ) * * * * * * + + + + + +,,,,,, - - - - - -...... / / / / / / 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 5 5 5 6 6 6 7 7 7 8 8 8 Similarity (Similarity of correct label)-(largest simialiry of rest) < Loss replace the discrete multiclass loss with a piecewise-linear bound ɛ S = 1 m m i=1 [H(x i) yi] 1 m m i=1 max r {( h(xi) Mr) + 1 δyi,r} ( h(xi) Myi )

Optimization problem Minimize: β M p + m i=1 ξ i subject to the constraints: i, r : ( h(x i ) Myi ) ( h(x i ) Mr ) 1 δ yi,r ξ i l 1 l 2 l Primal Variables m + 2kl m + kl m + kl + 1 -Constraints Constraints km + 2kl km km + 2kl Dual Variables km + 2kl km km + 2kl -Constraints km + 2kl km km + 2kl Constraints m + 2kl m m + kl + 1

Constructing the code using the dual program we get M r = β 1 m i=1 [δ y i,r η i,r ] h(x i ) an example i participates in the solution for M if and only if {η i,1,..., η i,k } is not a point distribution concentrating on the correct label y i code is constructed from examples whose labels are hard to predict efficient algorithm: SPOC - Support Pattern Output Coding can be combined with Mercer kernels chooses a single example and efficiently solves a mini-qp problem constructs continuous code or multiclass SVM ( h(x) = x) can bound number of iterations needed to be ɛ close to optimal solution

Generalization Error for Continuous Codes and Multiclass SVM suppose we perfectly classify a random samples S of m examples using matrix M; then w.p 1 δ a bound on the generalization error is O R 2 m k M 2 log(m) + k 2 log m δ where R is the radius of the ball containing the support of S

Example 1 Training Set 1 Square vs Rest.8.8.6.6.4.4.2.2.2.2.4.4.6.6.8 1 1.8.6.4.2.2.4.6.8 1.8 1 1.8.6.4.2.2.4.6.8 1 1 Star vs Rest 1 Circle vs Rest.8.8.6.6.4.4.2.2.2.2.4.4.6.6.8.8 1 1.8.6.4.2.2.4.6.8 1 1 1.8.6.4.2.2.4.6.8 1

Old and new decision boundaries Old New 1 1.8.8.6.6.4.4.2.2.2.2.4.4.6.6.8.8 1 1 1.8.6.4.2.2.4.6.8 1 1.8.6.4.2.2.4.6.8 1

Old and new codes.5 ID code matrix 1.5 New code matrix 1.8.8 1.6 1.6 1.5.4 1.5.4.2.2 2 2.2.2 2.5.4 2.5.4 3.6 3.6.8.8 3.5 1 3.5 1.5 1 1.5 2 2.5 3 3.5.5 1 1.5 2 2.5 3 3.5

Experiments 9 8 id BCH random 9 8 id BCH random 7 7 Relative performence % 6 5 4 3 Relative performence % 6 5 4 3 2 2 1 1 satimage shuttle mnist isolet letter vowel glass soybean 1 satimage shuttle mnist isolet letter vowel glass soybean 5 45 id BCH random 8 id BCH random 4 6 Relative performence % 35 3 25 2 15 Relative performence % 4 2 1 2 5 satimage shuttle mnist isolet letter vowel glass soybean 4 satimage shuttle mnist isolet letter vowel glass soybean

Multilabel Problems Joint work with Eleazar Eskin (Columbia) and Yair Weiss (HUJI) topics: Earnings Forecasts Share-Capital Debt-Issues Loans Credits Credit-Ratings Mergers Acquisitions Asset-Transfers... labeled documents: Doc#1: {Share-Capital, Credit-Ratings, Asset-Transfers} Doc#2: {Mergers, Acquisitions} Doc#3: {Earnings} Doc#4: {Earnings, Forecasts} Doc#5:... task find set of relevant topics for each new document

Multilabel problems - formal definition Instances x i X Set of possible topics Y = {1, 2,..., K} Relevant topics Labels Example t i 2 Y y i { 1, +1} K Doc#17 {Share-Capital, Acquisitions} Doc#17 (+ + ) Find f : X { 1, +1} K

Naïve Approach train K classifiers f i : X { 1, +1} f(x) = (f 1 (x),..., f K (x)) classifier i predicts whether topic i is relevant (+1) or not ( 1) apply each classifier independently to each new instance pros: simple cons: error prone does account for correlations between topics Employed by AdaBoost.MH and versions of SVM

Adding Fault Tolerance train classifiers that take topic correlations into account: article is about Mergers learn whether also about Acquisitions article is not about Mergers abstain obtain K 2 classifiers f i j : X { 1, +1} view each possible vector of labels in { 1, +1} K as a super-label build corresponding output code of size 2 K K 2 use loss-based decoding

Multilabel + ECOC: example three classes, nine classifiers output code 2 3 3 2 = 8 9 f 1 1 f 2 2 f 3 3 f 1 2 f 2 3 f 2 1 f 2 3 f 3 1 f 3 2 + + + + + + + + + + + + + + + + + + + + + + + + but... classification time is exponential in K

Bayes to the Rescue Define the probability of K labels corresponding to row r Note that if P (r x) = exp l L(M(r, s)f s(x) s=1 then l L(M(r, s)f s(x)) > l L(M(y, s)f s(x)) s=1 s=1 P (r x) < P (y x) construct belief net with loops from constraints use loopy belief propagation (Max-product) to find best K labels

Simulation Results.45.4.35.3 ECOC Error.25.2.15.1.5.5.1.15.2.25.3.35.4.45 Raw Error Apps in the works: multilabel text classification gene multi-functionality classification

Summary general framework for solving multiclass problems with margin classifiers using output codes general empirical loss analysis generalization error analysis when binary learner is AdaBoost experimental results support loss-based decoding design of continuous output codes efficient algorithm generalization analysis research directions online models for multiclass problems hierarchical classification and coding VC analysis of learning with output codes learning output codes using Bregman-Legendre projections

Based On T. G. Dietterich and G. Bakiri. Solving multiclass learning problems via error-correcting output codes. J. of AI Research, 2:263 286, 1995. R. E. Schapire and Y. Singer. Improved boosting algorithms using confidence-rated predictions. Machine Learning, 37(3):1 4, 1999. E.L. Allwein, R.E. Schapire, and Y. Singer. Reducing multiclass to binary: A unifying approach for margin classifiers. J. of ML Research, 1:113 141, 2. K. Crammer and Y. Singer. On the algorithmic implementation of multiclass kernel-based vector machines. J. of ML Research, 2:265 292, 21. K. Crammer and Y. Singer. On the learnability and design of output codes for multiclass problems. Machine Learning, 47(2/3), pp. 21 233, 22.