Supervised Learning of Non-binary Problems Part I: Multiclass Categorization via Output Codes

Similar documents
Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Introduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research

Multiclass Learning by Probabilistic Embeddings

Foundations of Machine Learning Multi-Class Classification. Mehryar Mohri Courant Institute and Google Research

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

Foundations of Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research

Multiclass Boosting with Repartitioning

CS446: Machine Learning Fall Final Exam. December 6 th, 2016

ECE 5424: Introduction to Machine Learning

Lecture 8. Instructor: Haipeng Luo

Data Mining und Maschinelles Lernen

Ensembles of Classifiers.

Improved Boosting Algorithms Using Confidence-rated Predictions

Boosting: Foundations and Algorithms. Rob Schapire

2 Upper-bound of Generalization Error of AdaBoost

Machine Learning in the Data Revolution Era

ECE 5984: Introduction to Machine Learning

The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m

The Boosting Approach to. Machine Learning. Maria-Florina Balcan 10/31/2016

Voting (Ensemble Methods)

Learning Ensembles. 293S T. Yang. UCSB, 2017.

Machine Learning for NLP

Large-Margin Thresholded Ensembles for Ordinal Regression

Learning Binary Classifiers for Multi-Class Problem

10-701/ Machine Learning - Midterm Exam, Fall 2010

Support Vector Machines

CS7267 MACHINE LEARNING

ABC-Boost: Adaptive Base Class Boost for Multi-class Classification

Stochastic Gradient Descent

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Ensemble Methods for Machine Learning

Learning with multiple models. Boosting.

VBM683 Machine Learning

CS229 Supplemental Lecture notes

Discriminative Models

Support Vector Machine. Industrial AI Lab.

CS446: Machine Learning Spring Problem Set 4

FINAL: CS 6375 (Machine Learning) Fall 2014

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9

Support Vector Machine. Industrial AI Lab. Prof. Seungchul Lee

On Nearest-Neighbor Error-Correcting Output Codes with Application to All-Pairs Multiclass Support Vector Machines

Discriminative Models

Large-Margin Thresholded Ensembles for Ordinal Regression

Open Problem: A (missing) boosting-type convergence result for ADABOOST.MH with factorized multi-class classifiers

Multiclass Classification-1

Does Unlabeled Data Help?

A Brief Introduction to Adaboost

Boosting with decision stumps and binary features

Error Limiting Reductions Between Classification Tasks

MACHINE LEARNING. Support Vector Machines. Alessandro Moschitti

Linear discriminant functions

Classification objectives COMS 4771

Midterm Exam Solutions, Spring 2007

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Introduction to Machine Learning Midterm Exam

Lecture Support Vector Machine (SVM) Classifiers

MIRA, SVM, k-nn. Lirong Xia

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Computational and Statistical Learning Theory

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

Machine Learning for NLP

Part of the slides are adapted from Ziko Kolter

6.036 midterm review. Wednesday, March 18, 15

The Perceptron Algorithm, Margins

Support Vector Machines

i=1 = H t 1 (x) + α t h t (x)

PAC-learning, VC Dimension and Margin-based Bounds

Nearest Neighbors Methods for Support Vector Machines

The Margin Vector, Admissible Loss and Multi-class Margin-based Classifiers

Deep Boosting. Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) COURANT INSTITUTE & GOOGLE RESEARCH.

Machine Learning Theory (CS 6783)

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

Learning with Rejection

Accelerating Stochastic Optimization

A Simple Algorithm for Multilabel Ranking

CSCI-567: Machine Learning (Spring 2019)

Understanding SVM (and associated kernel machines) through the development of a Matlab toolbox

Jeff Howbert Introduction to Machine Learning Winter

On the Learnability and Design of Output Codes for Multiclass Problems

1/sqrt(B) convergence 1/B convergence B

Introduction to Machine Learning (67577) Lecture 3

Computational and Statistical Learning Theory

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

CS145: INTRODUCTION TO DATA MINING

Support Vector Machines, Kernel SVM

Online Learning and Sequential Decision Making

Does Modeling Lead to More Accurate Classification?

ECS289: Scalable Machine Learning

Logistic regression and linear classifiers COMS 4771

Minimax risk bounds for linear threshold functions

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

A Study of Relative Efficiency and Robustness of Classification Methods

CPSC 540: Machine Learning

Name (NetID): (1 Point)

Introduction to Support Vector Machines

Machine Learning Linear Models

Boosting the Margin: A New Explanation for the Effectiveness of Voting Methods

Support Vector Machines for Classification: A Statistical Portrait

Introduction to Machine Learning

Transcription:

Supervised Learning of Non-binary Problems Part I: Multiclass Categorization via Output Codes Yoram Singer Hebrew University http://www.cs.huji.il/ singer Based on joint work with: Koby Crammer, Hebrew Univ. Rob Schapire, AT&T Labs Erin Allwein, SWRI NATO ASI on Learning Theory and Practice July 8-19 22 - K.U. Leuven Belgium

Problem Setting Given: multiclass data e.g.: predict children s favorite Teletubbies (name=guy, age=5, sex=m) (name=neta, age=2, sex=f) (name=dana, age=6, sex=f).. Given: binary (2-class) learning algorithm A x ( x 1, + ), ( x 2,-- ), ( x 3, + )... A h h(x) ε {--, + }

Problem Setting (cont.) Goal: use binary learning algorithm to build classifier for multiclass data ( x 1, ), ( x, ), (, )... 2 x 3 multiclass learner A x H H(x)ε {,,, }

The One-against-all Approach [folklore] break k-class problem into k binary problems and solve each separately how to combine predictions? x 1 x 1 x 1 + x 1 x 1 x 2 x 2 x 2 x 2 + x 2 x 3 x 3 x 3 x 3 x 3 + x 4 x 4 x 4 + x 4 x 4 x 5 x 5 + x 5 x 5 x 5 h 1 h 2 h 3 h 4 evaluate all h s and hope exactly one is + e.g.: if h 1 (x) = h 2 (x) = h 4 (x) = and h 3 (x) =+ then predict problem: will give incorrect prediction if only one h is wrong

The All-pairs Approach [Hastie & Tibshirani] create one binary problem for each pair of classes vs. vs. vs. vs. vs. vs. x 1 x 1 x 1 x 1 x 2 x 2 x 2 + x 2 + x 3 x 3 x 3 + x 3 x 4 x 4 x 4 x 4 x 5 x 5 + x 5 + x 5 + h 1 h 2 h 3 h 4 h 5 h 6 not obvious how to combine predictions

Error-correcting Output Codes reduce to binary using coding matrix M rows of M code words 1 2 3 4 5 x 1 x 1 x 1 x 1 + x 1 + x 1 + x 2 x 2 + x 2 + x 2 x 2 x 2 x 3 x 3 + x 3 + x 3 + x 3 + x 3 x 4 x 4 x 4 x 4 + x 4 + x 4 + x 5 x 5 + x 5 x 5 + x 5 x 5 + h 1 h 2 h 3 h 4 h 5 binary problems 1 2 3 4 5 + + + + + + + + + + + + [Dietterich & Bakiri] given x, choose closest row of M to h(x) = (h 1 (x),..., h 5 (x)) e.g.: if h(x) = (, +, +, +, ) then predict

ECOC (continued) if rows of M far from one another, will be highly robust to errors potentially much faster when k (# of classes) large disadvantage: binary problems may be unnatural and hard to solve

Outline general framework for studying such reductions from multiclass to binary particularly for margin-based binary learners decoding methods for combining predictions of binary classifiers Hamming decoding loss-based decoding general training error bounds generalization error bound when binary learner is AdaBoost finding a good output coding matrix discrete output codes continuous output codes learning continuous output codes using QP generalization error bound for continuous output codes experiments with discrete and continuous codes variation on a theme: multilabel problems

A (Slightly) More General Approach choose matrix M over { 1,, +1} 1 2 3 4 5 x 1 x 1 x 1 x 1 + x 1 + x 2 x 2 + x 2 x 2 x 2 x 3 x 3 + x 3 + x 3 + x 4 x 4 x 4 x 4 + x 4 + x 5 x 5 x 5 x 5 + h 1 h 2 h 3 h 4 h 5 binary problems 1 2 3 4 5 + + + + + + + generalizes all three earlier approaches how to do decoding?

Hamming Decoding (generalized) Hamming distance between u, v { 1,, +1} l is (u, v) = 1 l l s=1 if u s = v s u s v s 1 if u s v s u s v s 1/2 if u s = v s =. e.g.: ( (+ +), (+ + +) ) = 1 5 ( + 1 + 1 2 + 1 2 + ) to classify x evaluate h(x) = (h 1 (x),..., h l (x)) choose row r closest to h(x) in Hamming distance weakness: ignores confidence of predictions

Margin-based Binary Learning Algorithms typical binary classifier produces real-valued predictions i.e., h(x) rather than h(x) {, +} interpretation: sign(h(x)) = predicted class ( or +) h(x) = confidence margin of labeled example (x, y) is yh(x) yh(x) > correct prediction yh(x) = h(x) = confidence many learning algorithms attempt to find classifier h minimizing m i=1 L(y ih(x i )) where (x 1, y 1 ),..., (x m, y m ) is given training set L is loss function of margin

Examples of Margin-based Learning Algorithms algorithm L(z) support-vector machines max{1 z, } logistic regression ln(1 + e z ) AdaBoost e z least-squares regression (1 z) 2 C4.5 ln(1 + e z ) CART (1 z) 2

Loss-based Decoding can use instead of Hamming decoding when using margin-based learning algorithm to classify x evaluate h(x) = (h 1 (x),..., h l (x)) for each class r compute total loss that would have been suffered if x were labeled r l L(M(r, s) h s(x)) s=1 choose class r that minimizes total loss

Loss-based Decoding (example) M = 1 2 3 4 5 + + + + + + + h(x) = (1.9, 3.1, 4.2,.7, 6.2) compute: total loss for each class L( )+L( 3.1 )+L( )+L(.7)+L( 6.2 ) L( 1.9)+L( 3.1 )+L( 4.2 )+L( )+L( 6.2 ) L( 1.9 )+L( )+L( 4.2)+L(.7)+L( 6.2) L( 1.9 )+L( 3.1)+L( )+L(.7 )+L( ) Hamming decoding (sign(h(x)) = (+,, +, +, +)): predict loss-based decoding with L(z) = e z : predict

Training Error Bounds let ε be expected loss on training averaged over all l binary problems ε = 1 ml m i=1 l L(M(y i, s) h s (x i )) s=1 let ρ be minimum distance between pairs of rows of M ρ = min{ (M(r 1, ), M(r 2, )) : r 1 r 2 } assume L() = 1 and 1 2 (L(z) + L( z)) L() and other minor technical assumptions Theorem: For loss-based decoding: For Hamming decoding: # training mistakes ρ ε # training mistakes 2ε ρ

Proof Sketch Loss-based decoding incorrectly classifies an example (x, y): l L(M(r, s)f s(x)) l L(M(y, s)f s(x)) s=1 s=1 Define: S = {s : M(r, s) M(y, s) M(r, s) M(y, s) } S = {s : M(r, s) = M(y, s) = } z s = M(y, s)f s (x) z s = M(r, s)f s (x) Then: l s=1 L(z s) l s=1 L(z s) L(z s) s S S L(z s ) s S S

Proof Sketch (cont.) l L(z s) s=1 s S S L(z s ) 1 2 (L(z s) + L(z s )) s S S = 1 (L(z 2 s) + L(z s )) + 1 (L(z s S 2 s) + L(z s )) s S L() S + S = L() (M(r), M(y)) 2 ρl() a mistake on training example (x i, y i ) implies that l L(M(y i, s)f s (x i )) ρl() s=1 so the number of training mistakes is at most 1 m l ρl() L(M(y i, s)f s (x i )) i=1 s=1 and the training error is at most lε/(ρl())

Discussion want good performance on binary problems (so that ε small) want rows far apart from each other (so ρ large) ρ 1/2 for ECOC ρ = 2/k for one-against-all but: making ρ large may make binary problems more difficult to solve many zeros in matrix makes binary problems easier and faster to solve but: adding zeros tends to increase average error ε

Generalization Error for Boosting with Loss-based Decoding - Outline analyze performance on unseen data when binary learner is AdaBoost in binary case, use margins to analyze boosting: (Schapire, Freund, Bartlett, Lee 1998) larger empirical margins data imply better bound on generalization error boosting tends to increase training-data margins for loss-based decoding modify classification rule modifying definition of margin gives two-part analysis

Generalization Error Bounds for AdaBoost.MO Loss-based decoding for AdaBoost: H(x) = arg min y Y = arg min y Y l s=1 exp ( M(y, s)f(x, s)) l exp s=1 M(y, s) T α th t (x, s) t=1 Scaling f to [ 1, +1] using the transform x (1/η) ln(x/l): η = T t=1 α t f(x, s) = 1 T η α th t (x, s) t=1 ν(f, η, x, y) = 1 η ln 1 l l s=1 e ηm(y,s)f(x,s) H(x) = arg max y Y ν(f, η, x, y)

Generalization Error Bounds for AdaBoost.MO (cont.) Modified definition of margin: M f,η (x, y) = 1 ν(f, η, x, y) max 2 r y ν(f, η, x, r) H the base-hypothesis space, convex hull of H co(h) = f : x h α hh(x) α h, h α h = 1 Theorem: D distribution over X Y; S sample of m examples; base-classifier space H has VC-dimension d; coding matrix M; w.p. 1 δ every function f co(h) and η > satisfy [ Pr D M f,η (x, y) ] [ Pr S M f,η (x, y) θ ] + O 1 m d log 2 (lm/d) θ 2 + log(1/δ) 1/2

Experiments tested SVM on 8 benchmark problems using both decoding methods and 5 different coding matrices overall, loss-based decoding is better than Hamming decoding: 2 one against all 5 All Pairs.5 Complete.5 Random Dense 2 Random Sparse 2.5 2 5 4 Test error difference 4 6 Test error difference 1 15 Test error difference.5 1 Test error difference 1 1.5 Test error difference 6 8 1 2 8 2 1.5 2.5 12 14 1 6 6 7 8 1 1 11 19 Number of Classes 25 6 6 7 8 1 1 11 19 Number of Classes 2 6 6 7 8 1 1 11 19 Number of Classes 3 6 6 7 8 1 1 11 19 Number of Classes 16 6 6 7 8 1 1 11 19 Number of Classes one-against-all is consistently worse than other methods but no clear pattern among other methods 35 one against all All Pairs 35 one against all Complete 35 one against all Random Dense 3 one against all Random Sparse 3 3 3 25 25 25 25 2 Test error difference 2 15 1 Test error difference 2 15 1 Test error difference 2 15 1 Test error difference 15 1 5 5 5 5 5 6 6 7 8 1 1 11 19 Number of Classes 5 6 6 7 8 1 1 11 19 Number of Classes 5 6 6 7 8 1 1 11 19 Number of Classes 5 6 6 7 8 1 1 11 19 Number of Classes

So far we assumed that the matrix M is predefined problems: all columns have the same influence in the decoding process but some binary problems might be simpler than others each binary classifier induces the same loss for all labels but some labels are easier to predict than others correlations between the binary classifiers are not taken into account but similar columns of M might result highly correlated classifiers change M in accordance to the predictions of the binary classifiers

Code design - discrete case The following problem is NP-complete: given: a training set S = {(x 1, y 1 ),..., (x m, y m )} a set of binary classifiers h(x) = (h 1 (x),..., h l ) error bound q find: a matrix M { 1, +1} k l such that the classifier H(x) based on M and h( ) makes at most q mistakes on S however, if l is fixed there might exist a poly-time algorithm for finding the matrix M which attains the smallest number of errors on S

Code design - continuous case analogously to the binary classifiers let M k l column t of M defines a partition of the labels into two disjoint sets sign(m r,t ) is the set (+1 or 1) to which the class r belongs M r,t is the confidence in the associated partition the classification rule H(x) = arg max r where M r is the rth row of M (h(x) Mr )

Continuous codes - example M = 1 2 3 4 5 +.1-1.5 -.2-2.4 +3.7-1.8-1.1 +.9 -.3 +1.9 +2.1 +.1 +.4-4.5-2.5 +3.2 +.7 +2.1 +2.7 +.2 h(x) = (1.1, 3.1, 4.2,.7, 6.2) compute: similarity for each class + 1.1.1 + 3.1 1.5 4.2.2.7 2.4 + 6.2 3.7 25.2 1.1 1.8 + 3.1 1.1 + 4.2.9.7.3 + 6.2 1.9 16.8 + 1.1 2.1 3.1.1 + 4.2.4.7 4.5 6.2 2.5 15 + 1.1 3.2 3.1.7 + 4.2 2.1 +.7 2.7 + 6.2.2 13.3 predict:

Multiclass Margin Multiclass Margin Multiclass Margin Multiclass Margin Multiclass Margin Similarity (Similarity of correct label)-(largest simialiry of rest) > 1!!!!!! " " " " " " # $# $ % &% & Similarity Loss (Similarity of correct label)-(largest simialiry of rest) < 1 ' ' ' ' ' ' ( ( ( ( ( ( ) ) ) ) ) ) * * * * * * + + + + + +,,,,,, - - - - - -...... / / / / / / 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 5 5 5 6 6 6 7 7 7 8 8 8 Similarity (Similarity of correct label)-(largest simialiry of rest) < Loss replace the discrete multiclass loss with a piecewise-linear bound ɛ S = 1 m m i=1 [H(x i) yi] 1 m m i=1 max r {( h(xi) Mr) + 1 δyi,r} ( h(xi) Myi )

Optimization problem Minimize: β M p + m i=1 ξ i subject to the constraints: i, r : ( h(x i ) Myi ) ( h(x i ) Mr ) 1 δ yi,r ξ i l 1 l 2 l Primal Variables m + 2kl m + kl m + kl + 1 -Constraints Constraints km + 2kl km km + 2kl Dual Variables km + 2kl km km + 2kl -Constraints km + 2kl km km + 2kl Constraints m + 2kl m m + kl + 1

Constructing the code using the dual program we get M r = β 1 m i=1 [δ y i,r η i,r ] h(x i ) an example i participates in the solution for M if and only if {η i,1,..., η i,k } is not a point distribution concentrating on the correct label y i code is constructed from examples whose labels are hard to predict efficient algorithm: SPOC - Support Pattern Output Coding can be combined with Mercer kernels chooses a single example and efficiently solves a mini-qp problem constructs continuous code or multiclass SVM ( h(x) = x) can bound number of iterations needed to be ɛ close to optimal solution

Generalization Error for Continuous Codes and Multiclass SVM suppose we perfectly classify a random samples S of m examples using matrix M; then w.p 1 δ a bound on the generalization error is O R 2 m k M 2 log(m) + k 2 log m δ where R is the radius of the ball containing the support of S

Example 1 Training Set 1 Square vs Rest.8.8.6.6.4.4.2.2.2.2.4.4.6.6.8 1 1.8.6.4.2.2.4.6.8 1.8 1 1.8.6.4.2.2.4.6.8 1 1 Star vs Rest 1 Circle vs Rest.8.8.6.6.4.4.2.2.2.2.4.4.6.6.8.8 1 1.8.6.4.2.2.4.6.8 1 1 1.8.6.4.2.2.4.6.8 1

Old and new decision boundaries Old New 1 1.8.8.6.6.4.4.2.2.2.2.4.4.6.6.8.8 1 1 1.8.6.4.2.2.4.6.8 1 1.8.6.4.2.2.4.6.8 1

Old and new codes.5 ID code matrix 1.5 New code matrix 1.8.8 1.6 1.6 1.5.4 1.5.4.2.2 2 2.2.2 2.5.4 2.5.4 3.6 3.6.8.8 3.5 1 3.5 1.5 1 1.5 2 2.5 3 3.5.5 1 1.5 2 2.5 3 3.5

Experiments 9 8 id BCH random 9 8 id BCH random 7 7 Relative performence % 6 5 4 3 Relative performence % 6 5 4 3 2 2 1 1 satimage shuttle mnist isolet letter vowel glass soybean 1 satimage shuttle mnist isolet letter vowel glass soybean 5 45 id BCH random 8 id BCH random 4 6 Relative performence % 35 3 25 2 15 Relative performence % 4 2 1 2 5 satimage shuttle mnist isolet letter vowel glass soybean 4 satimage shuttle mnist isolet letter vowel glass soybean

Multilabel Problems Joint work with Eleazar Eskin (Columbia) and Yair Weiss (HUJI) topics: Earnings Forecasts Share-Capital Debt-Issues Loans Credits Credit-Ratings Mergers Acquisitions Asset-Transfers... labeled documents: Doc#1: {Share-Capital, Credit-Ratings, Asset-Transfers} Doc#2: {Mergers, Acquisitions} Doc#3: {Earnings} Doc#4: {Earnings, Forecasts} Doc#5:... task find set of relevant topics for each new document

Multilabel problems - formal definition Instances x i X Set of possible topics Y = {1, 2,..., K} Relevant topics Labels Example t i 2 Y y i { 1, +1} K Doc#17 {Share-Capital, Acquisitions} Doc#17 (+ + ) Find f : X { 1, +1} K

Naïve Approach train K classifiers f i : X { 1, +1} f(x) = (f 1 (x),..., f K (x)) classifier i predicts whether topic i is relevant (+1) or not ( 1) apply each classifier independently to each new instance pros: simple cons: error prone does account for correlations between topics Employed by AdaBoost.MH and versions of SVM

Adding Fault Tolerance train classifiers that take topic correlations into account: article is about Mergers learn whether also about Acquisitions article is not about Mergers abstain obtain K 2 classifiers f i j : X { 1, +1} view each possible vector of labels in { 1, +1} K as a super-label build corresponding output code of size 2 K K 2 use loss-based decoding

Multilabel + ECOC: example three classes, nine classifiers output code 2 3 3 2 = 8 9 f 1 1 f 2 2 f 3 3 f 1 2 f 2 3 f 2 1 f 2 3 f 3 1 f 3 2 + + + + + + + + + + + + + + + + + + + + + + + + but... classification time is exponential in K

Bayes to the Rescue Define the probability of K labels corresponding to row r Note that if P (r x) = exp l L(M(r, s)f s(x) s=1 then l L(M(r, s)f s(x)) > l L(M(y, s)f s(x)) s=1 s=1 P (r x) < P (y x) construct belief net with loops from constraints use loopy belief propagation (Max-product) to find best K labels

Simulation Results.45.4.35.3 ECOC Error.25.2.15.1.5.5.1.15.2.25.3.35.4.45 Raw Error Apps in the works: multilabel text classification gene multi-functionality classification

Summary general framework for solving multiclass problems with margin classifiers using output codes general empirical loss analysis generalization error analysis when binary learner is AdaBoost experimental results support loss-based decoding design of continuous output codes efficient algorithm generalization analysis research directions online models for multiclass problems hierarchical classification and coding VC analysis of learning with output codes learning output codes using Bregman-Legendre projections

Based On T. G. Dietterich and G. Bakiri. Solving multiclass learning problems via error-correcting output codes. J. of AI Research, 2:263 286, 1995. R. E. Schapire and Y. Singer. Improved boosting algorithms using confidence-rated predictions. Machine Learning, 37(3):1 4, 1999. E.L. Allwein, R.E. Schapire, and Y. Singer. Reducing multiclass to binary: A unifying approach for margin classifiers. J. of ML Research, 1:113 141, 2. K. Crammer and Y. Singer. On the algorithmic implementation of multiclass kernel-based vector machines. J. of ML Research, 2:265 292, 21. K. Crammer and Y. Singer. On the learnability and design of output codes for multiclass problems. Machine Learning, 47(2/3), pp. 21 233, 22.