Introduction to Machine Learning Lecture 13 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu
Multi-Class Classification Mehryar Mohri - Introduction to Machine Learning page 2
Motivation Real-world problems often have multiple classes: text, speech, image, biological sequences. Algorithms studied so far: designed for binary classification problems. How do we design multi-class classification algorithms? can the algorithms used for binary classification be generalized to multi-class classification? can we reduce multi-class classification to binary classification? Mehryar Mohri - Introduction to Machine Learning page 3
Multi-Class Classification Problem Training data: sample drawn i.i.d. from set according to some distribution D, S =((x 1,y 1 ),...,(x m,y m )) X Y, mono-label case: Card(Y )=k. multi-label case: Y ={ 1, +1} k. Problem: find classifier h: X Y in H with small generalization error, mono-label case: R D (h)=e x D [1 h(x)=f(x) ]. multi-label case: R D (h)=e 1 k x D k l=1 1 [h(x)] k =[f(x)] k. X Mehryar Mohri - Introduction to Machine Learning page 4
Notes In most tasks, number of classes For k large or infinite, problem often not treated as a multi-class classification problem, e.g., automatic speech recognition. Computational efficiency issues arise for larger ks. In general, classes not balanced. k 100. Mehryar Mohri - Introduction to Machine Learning page 5
Approaches Single classifier: Decision trees. AdaBoost-type algorithm. SVM-type algorithm. Combination of binary classifiers: One-vs-all. One-vs-one. Error-correcting codes. Mehryar Mohri - Introduction to Machine Learning page 6
AdaBoost-Type Algorithm Training data (multi-label case): (x 1,y 1 ),...,(x m,y m ) X { 1, 1} k. Reduction to binary classification: each example leads to k binary examples: apply AdaBoost to the resulting problem. choice of α t. Computational cost: each round. (Schapire and Singer, 2000) (x i,y i ) ((x i, 1),y i [1]),...,((x i,k),y i [k]),i [1,m]. mk distribution updates at Mehryar Mohri - Introduction to Machine Learning page 7
AdaBoost.MH H ({ 1, +1} k ) (X Y ). AdaBoost.MH(S =((x 1,y 1 ),...,(x m,y m ))) 1 for i 1 to m do 2 for l 1 to k do 3 D 1 (i, l) 1 mk 4 for t 1 to T do 5 h t base classifier in H with small error t =Pr Dt [h t (x i,l)=y i [l]] 6 α t choose to minimize Z t 7 Z t i,l D t(i, l)exp( α t y i [l]h t (x i,l)) 8 for i 1 to m do 9 for l 1 to k do 10 D t+1 (i, l) D t(i,l)exp( α t y i [l]h t (x i,l)) Z t 11 f T T t=1 α th t 12 return h T =sgn(f T ) Mehryar Mohri - Introduction to Machine Learning page 8
Bound on Empirical Error Theorem: The empirical error of the classifier output by AdaBoost.MH verifies: Proof: similar to the proof for AdaBoost. Choice of α t : for for bound. R(h) H ({ 1, +1} k ) X Y H ([ 1, 1] k ) X Y T t=1 Z t., as for AdaBoost, α t = 1 2 log 1 t t., same choice: minimize upper other cases: numerical/approximation method. Mehryar Mohri - Introduction to Machine Learning page 9
Multi-Class SVMs Optimization problem: min w,ξ 1 2 k l=1 w l 2 + C m i=1 l=y i ξ il (Weston and Watkins, 1999) subject to: w yi x i + b yi w l x i + b l +2 ξ il Decision function: ξ il 0, (i, l) [1,m] (Y {y i }). h: x argmax l Y (w l x + b l ). Mehryar Mohri - Introduction to Machine Learning page 10
Idea: slack variable Notes ξ il penalizes case where (w yi x i + b yi ) (w l x i + b l ) < 2. w 1 x + b 1 =+1 w 1 x + b 1 = 1 w 2 x + b 2 =+1 w 2 x + b 2 = 1 Binary SVM obtained as special case: w 1 = w 2,b 1 = b 2,ξ i1 = ξ i2 =2ξ i. Mehryar Mohri - Introduction to Machine Learning page 11
Notation: Optimization problem: Decision function: Dual Formulation α i = Mehryar Mohri - Introduction to Machine Learning page 12 k l=1 α il c il =1 yi =l. max 2 m α i + [ 1 α 2 c jy i α i α j + α il α jyi 1 2 α ilα jl ](x i x j ) i=1 i,j,l m m subject to: l [1,k], α il = c il α i i=1 i=1 (i, l) [1,m] (Y {y i }), 0 α il C, α iyi =0. h: x argmax l=1,...,k m (c il α i α il )(x i x)+b l. i=1
Notes PDS kernel instead of inner product Optimization: complex constraints, mk -size problem. Generalization error: leave-one-out analysis and bounds of binary SVM apply similarly. One-vs-all solution (non-optimal) feasible solution of multi-class SVM problem. Simplification: single slack variable per point (Crammer and Singer, 2002),. ξ il ξ i Mehryar Mohri - Introduction to Machine Learning page 13
Simplified Multi-Class SVMs Optimization problem: min w,ξ 1 2 k w l 2 + C l=1 m i=1 ξ i (Crammer and Singer, 2001) subject to: w yi x i + δ yi,l w l x i +1 ξ i Decision function: (i, l) [1,m] Y. h: x argmax l Y (w l x). Mehryar Mohri - Introduction to Machine Learning page 14
Single slack variable per point: maximum of previous slack variables (penalty for worst class): k l=1 Notes ξ il k max l=1 ξ il. PDS kernel instead of inner product Optimization: complex constraints, mk -size problem. specific solution based on decomposition into disjoint sets of constraints (Crammer and Singer, 2001). m Mehryar Mohri - Introduction to Machine Learning page 15
Dual Formulation Optimization problem: ( α i ith row of matrix α) max α=[α ij ] m i=1 α i e yi 1 2 m (α i α j )(x i x j ) i=1 subject to: 0 α i C α i 1 =0,i [1,m]. Decision function: h(x) = k argmax l=1 m i=1 α il (x i x). Mehryar Mohri - Introduction to Machine Learning page 16
Approaches Single classifier: Decision trees. AdaBoost-type algorithm. SVM-type algorithm. Combination of binary classifiers: One-vs-all. One-vs-one. Error-correcting codes. Mehryar Mohri - Introduction to Machine Learning page 17
One-vs-All Technique: for each class l Y learn binary classifier h l =sgn(f l ). combine binary classifiers via voting mechanism, typically majority vote: h: x argmax l Y f l (x). Problem: poor justification. calibration: classifier scores not comparable. nevertheless: simple and frequently used in practice, computational advantages in some cases. Mehryar Mohri - Introduction to Machine Learning page 18
One-vs-One Technique: for each pair classifier h ll : X {0, 1}. learn binary combine binary classifiers via majority vote: Problem: computational: train (l, l ) Y,l=l h(x) =argmax l Y {l : hll (x) =1}. k(k 1)/2 binary classifiers. overfitting: size of training sample could become small for a given pair. Mehryar Mohri - Introduction to Machine Learning page 19
Computational Comparison Training Testing One-vs-all O(kB train (m)) O(kB test ) O(km α ) One-vs-one O(k 2 B train (m/k)) O(k (on average) 2 B test ) O(k 2 α m α ) smaller N SV per B Time complexity for SVMs, α less than 3. Mehryar Mohri - Introduction to Machine Learning page 20
Heuristics Training: reuse of computation between classifiers, e.g., sharing of kernel computations. caching. 1 vs 4 Testing: directed acyclic graph. smaller number of tests. ordering? 3 vs 4 2 vs 4 not 2 not 1 not 4 1 vs 3 not 4 not 1 2 vs 3 (Platt et al., 2000) not 3 1 vs 2 4 3 2 1 Mehryar Mohri - Introduction to Machine Learning page 21
Error-Correcting Code Approach Technique: assign F-long binary code word to each class: M =[M lj ] {0, 1} [1,k] [1,F ]. learn binary classifier f j : X {0, 1} for each column. Example x in class l labeled with. classifier output:, h: x argmin l Y f(x)= f 1 (x),...,f F (x) d Hamming M l, f(x). (Dietterich and Bakiri, 1995) M lj Mehryar Mohri - Introduction to Machine Learning page 22
Illustration classes 8 classes, code length: 6. codes 1 2 3 4 5 6 1 0 0 0 1 0 0 2 1 0 0 0 0 0 3 0 1 1 0 1 0 4 1 1 0 0 0 0 5 1 1 0 0 1 0 6 0 0 1 1 0 1 7 0 0 1 0 0 0 8 0 1 0 1 0 0 f 1 (x)f 2 (x)f 3 (x)f 4 (x)f 5 (x)f 6 (x) 0 1 1 0 1 1 new example x Mehryar Mohri - Introduction to Machine Learning page 23
Error-Correcting Codes - Design Main ideas: independent columns: otherwise no effective discrimination. distance between rows: if the minimal Hamming distance between rows is d, then the multi-class can correct d 1 errors. 2 columns may correspond to features selected for the task. one-vs-all and one-vs-one (with ternary codes) are special cases. Mehryar Mohri - Introduction to Machine Learning page 24
Mehryar Mohri - Introduction to Machine Learning Extensions Matrix entries in { 1, 0, +1} : examples marked with 0 disregarded during training. one-vs-one becomes also a special case. Margin loss L: function of Hamming loss: Margin loss: h(x) = argmin l {1,...,k} h(x) = argmin l {1,...,k} F j=1 yf(x), e.g., hinge loss. 1 sgn M lj f j (x) 2 F L M lj f j (x). j=1 page 25 (Allwein et al., 2000).
Continuous Codes Optimization problem: ( M l lth row of M) min M,ξ M2 2 + C m Decision function: i=1 subject to: K(f(x i ), M yi ) K(f(x i ), M l )+1 ξ i ξ i (i, l) [1,m] [1,k]. h: x argmax l {1,...,k} K(f(x), M l ). (Crammer and Singer, 2000, 2002) Mehryar Mohri - Introduction to Machine Learning page 26
Ideas Continuous codes: real-valued matrix. Learn matrix code M. Similar optimization problems with other matrix norms. Kernel K used for similarity between matrix row and prediction vector. Mehryar Mohri - Introduction to Machine Learning page 27
Multiclass Margin Definition: let H R X Y. The margin of the training point (x, y) X Y for the hypothesis h H is γ h (x, y) =h(x, y) max y =y h(x, y ). Thus, h misclassifies (x, y) iff γ h (x, y) 0. Mehryar Mohri - Introduction to Machine Learning page 28
Margin Bound Theorem: Let H 1 ={x h(x, y): h H, y Y } where H R X Y. Fix ρ>0. Then, for any δ>0, with probability at least 1 δ, the following holds Pr[γ h (x, y) 0] Pr[γ h (x, y) ρ]+c k2 R m (H 1 ) ρ for some constants c, c >0. (Koltchinskii and Panchenko, 2002) + c log 1 δ m, Mehryar Mohri - Introduction to Machine Learning page 29
Applications One-vs-all approach is the most widely used. No clear empirical evidence of the superiority of other approaches (Rifkin and Klautau, 2004). except perhaps on small data sets with relatively large error rate. Large structured multi-class problems: often treated as ranking problems (see next lecture). Mehryar Mohri - Introduction to Machine Learning page 30
References Erin L. Allwein, Robert E. Schapire and Yoram Singer. Reducing multiclass to binary: A unifying approach for margin classifiers. Journal of Machine Learning Research, 1:113-141, 2000. K. Crammer and Y. Singer. Improved output coding for classification using continuous relaxation. In Proceedings of NIPS, 2000. Koby Crammer and Yoram Singer. On the algorithmic implementation of multiclass kernelbased vector machines. Journal of Machine Learning Research, 2:265 292, 2001. Koby Crammer and Yoram Singer. On the Learnability and Design of Output Codes for Multiclass Problems. Machine Learning 47, 2002. Thomas G. Dietterich, Ghulum Bakiri: Solving Multiclass Learning Problems via Error- Correcting Output Codes. Journal of Artificial Intelligence Research (JAIR) 2: 263-286, 1995. John C. Platt, Nello Cristianini, and John Shawe-Taylor. Large Margin DAGS for Multiclass Classification. In Advances in Neural Information Processing Systems 12 (NIPS 1999), pp. 547-553, 2000. Mehryar Mohri - Introduction to Machine Learning page 31
References Ryan Rifkin. Everything Old Is New Again: A Fresh Look at Historical Approaches in Machine Learning. Ph.D. Thesis, MIT, 2002. Rifkin and Klautau. In Defense of One-Vs-All Classification. Journal of Machine Learning Research, 5:101-141, 2004. Robert E. Schapire. The boosting approach to machine learning: An overview. In D. D. Denison, M. H. Hansen, C. Holmes, B. Mallick, B. Yu, editors, Nonlinear Estimation and Classification. Springer, 2003. Robert E. Schapire, Yoav Freund, Peter Bartlett and Wee Sun Lee. Boosting the margin: A new explanation for the effectiveness of voting methods. The Annals of Statistics, 26(5): 1651-1686, 1998. Robert E. Schapire and Yoram Singer. BoosTexter: A boosting-based system for text categorization. Machine Learning, 39(2/3):135-168, 2000. Jason Weston and Chist Watkins. Support Vector Machines for Multi-Class Pattern Recognition. Proceedings of the Seventh European Symposium On Artificial Neural Networks (ESANN 99), 1999. Mehryar Mohri - Introduction to Machine Learning page 32