Foundations of Machine Learning Multi-Class Classification. Mehryar Mohri Courant Institute and Google Research

Size: px

Start display at page:

Download "Foundations of Machine Learning Multi-Class Classification. Mehryar Mohri Courant Institute and Google Research"

Sherman Morton
5 years ago
Views:

1 Foundations of Machine Learning Multi-Class Classification Mehryar Mohri Courant Institute and Google Research

2 Motivation Real-world problems often have multiple classes: text, speech, image, biological sequences. Algorithms studied so far: designed for binary classification problems. How do we design multi-class classification algorithms? can the algorithms used for binary classification be generalized to multi-class classification? can we reduce multi-class classification to binary classification? page 2

3 Multi-Class Classification Problem Training data: sample drawn i.i.d. from set according to some distribution D, S =((x 1,y 1 ),...,(x m,y m )) X Y, mono-label case: Card(Y )=k. multi-label case: Y ={ 1, +1} k. Problem: find classifier h: X Y in H with small generalization error, mono-label case: R(h)=E x D [1 h(x)6=f(x) ]. multi-label case: R(h)=E 1 P k x D k l=1 1 [h(x)] l 6=[f(x)] l. X page 3

4 Notes In most tasks considered, number of classes For k large, problem often not treated as a multiclass classification problem (ranking or density estimation, e.g., automatic speech recognition). Computational efficiency issues arise for larger ks. In general, classes not balanced. k 100. page 4

5 Multi-Class Classification - Margin Hypothesis set H: functions h: X Y R. x label returned: argmax h(x, y). Margin: h(x, y) =h(x, y) max. error: 1. h (x,y)apple0 apple ( h (x, y)) empirical margin loss: R (h) = 1 m m y i=1 Y y =y h(x, y ) ( h (x, y)). page 5

6 Multi-Class Margin Bound (MM et al. 2012; Kuznetsov, MM, and Syed, 2014) Theorem: let H R X Y with Y = {1,...,k}. Fix >0. Then, for any >0, with probability at least 1, the following multi-class classification bound holds for all h H: R(h) R (h)+ 4k R m ( 1 (H)) + log 1 2m, with 1 (H) ={x h(x, y): y Y,h H}. page 6

7 Kernel Based Hypotheses Hypothesis set H K,p : feature mapping associated to PDS kernel K. functions (x, y) w y (x), y {1,...,k}. label returned: x argmax w y (x). y {1,...,k} for any p 1, H K,p = {(x, y) X [1,k] w y (x): W =(w 1,...,w k ), W H,p }. page 7

8 Multi-Class Margin Bound - Kernels Theorem: let K: X X R be a PDS kernel and let : X H be a feature mapping associated to K. Fix >0. Then, for any >0, with probability at least 1, the following multiclass bound holds for all h : H K,p (MM et al. 2012) R(h) R (h)+4k r m + log 1 2m, where r 2 =sup x X K(x, x). page 8

9 Single classifier: Multi-class SVMs. AdaBoost.MH. Conditional Maxent. Decision trees. Approaches Combination of binary classifiers: One-vs-all. One-vs-one. Error-correcting codes. page 9

10 Multi-Class SVMs Optimization problem: min w, 1 2 Decision function: k l=1 h: x argmax l Y (Weston and Watkins, 1999; Crammer and Singer, 2001) w l 2 + C m i=1 subject to: w yi x i + yi,l w l x i +1 i (i, l) [1,m] Y. (w l x). i page 10

11 Directly based on generalization bounds. Comparison with (Weston and Watkins, 1999): single slack variable per point, maximum of slack variables (penalty for worst class): k l=1 Notes PDS kernel instead of inner product Optimization: complex constraints, il k max l=1 mk -size problem. specific solution based on decomposition into disjoint sets of constraints (Crammer and Singer, 2001). il. m page 11

12 Dual Formulation Optimization problem: th row of matrix. max =[ ij] m i=1 i e yi 1 2 m i=1 i i R m k ( i j)(x i x j ) subject to: i [1,m], (0 iy i C) ( j = y i, ij 0) ( i 1 =0). Decision function: h(x) = k argmax l=1 m i=1 il(x i x). page 12

13 AdaBoost Training data (multi-label case): (x 1,y 1 ),...,(x m,y m ) X { 1, 1} k. Reduction to binary classification: each example leads to k binary examples: apply AdaBoost to the resulting problem. choice of t. Computational cost: each round. (Schapire and Singer, 2000) (x i,y i ) ((x i, 1),y i [1]),...,((x i,k),y i [k]),i [1,m]. mk distribution updates at page 13

14 AdaBoost.MH H ({ 1, +1} k ) (X Y ). AdaBoost.MH(S =((x 1,y 1 ),...,(x m,y m ))) 1 for i 1 to m do 2 for l 1 to k do 3 D 1 (i, l) 1 mk 4 for t 1 to T do 5 h t base classifier in H with small error t =Pr Dt [h t (x i,l)=y i [l]] 6 t choose to minimize Z t 7 Z t i,l D t(i, l)exp( ty i [l]h t (x i,l)) 8 for i 1 to m do 9 for l 1 to k do D 10 D t+1 (i, l) t (i,l)exp( 11 f T T t=1 th t 12 return h T =sgn(f T ) ty i [l]h t (x i,l)) Z t page 14

15 Bound on Empirical Error Theorem: The empirical error of the classifier output by AdaBoost.MH verifies: R(h) T Z t. t=1 Proof: similar to the proof for AdaBoost. Choice of t: for for bound. H ({ 1, +1} k ) X Y H ([ 1, 1] k ) X Y, as for AdaBoost, t = 1 2 log 1, same choice: minimize upper other cases: numerical/approximation method. t t. page 15

16 Objective function: F ( )= m i=1 k l=1 Notes e y i[l]f n (x i,l) = All comments and analysis given for AdaBoost apply here. Alternative: Adaboost.MR, which coincides with a special case of RankBoost (ranking lecture). m i=1 k l=1 e y i[l] P n t=1 th t (x i,l). page 16

17 Decision Trees X1 < a1 X 2 R 2 X1 < a2 X2 < a3 a 4 R 5 R 1 R 3 a 3 X2 < a4 R3 R4 R5 R 4 R1 R2 a 2 a 1 X 1 page 17

18 Different Types of Questions Decision trees X {blue, white, red} : categorical questions. X a : continuous variables. Binary space partition (BSP) trees: n i=1 ix i a: partitioning with convex polyhedral regions. Sphere trees: X a 0 a : partitioning with pieces of spheres. page 18

19 Hypotheses In each region R t, classification: majority vote - ties broken arbitrarily, y t =argmax y Y regression: average value, y t = {x i R t : i [1,m],y i = y}. 1 y i. Form of hypotheses: h: x S R t xi R t i [1,m] t y t 1 x Rt. page 19

20 Training Problem: general problem of determining partition with minimum empirical error is NP-hard. Heuristics: greedy algorithm. for all,, j [1,N] R R + (j, )={x i R: x i [j], i [1,m]} R (j, )={x i R: x i [j]<,i [1,m]}. Decision-Trees(S =((x 1,y 1 ),...,(x m,y m ))) 1 P {S} initial partition 2 for each region R P such that Pred(R) do 3 (j, ) argmin (j, ) error(r (j, )) + error(r + (j, )) 4 P P R {R (j, ),R + (j, )} 5 return P page 20

21 Splitting/Stopping Criteria Problem: larger trees overfit training sample. Conservative splitting: split node only if loss reduced by some fixed value >0. issue: seemingly bad split dominating useful splits. Grow-then-prune technique (CART): grow very large tree, Pred(R): R > n 0. prune tree based on:, parameter determined by cross-validation. F (T )=Loss(T )+ T 0 page 21

22 Decision Tree Tools Most commonly used tools for learning decision trees: CART (classification and regression tree) (Breiman et al., 1984). C4.5 (Quinlan, 1986, 1993) and C5.0 (RuleQuest Research) a commercial system. Differences: minor between latest versions. page 22

23 Approaches Single classifier: SVM-type algorithm. AdaBoost-type algorithm. Conditional Maxent. Decision trees. Combination of binary classifiers: One-vs-all. One-vs-one. Error-correcting codes. page 23

24 One-vs-All Technique: for each class l Y learn binary classifier h l =sgn(f l ). combine binary classifiers via voting mechanism, typically majority vote: h: x argmax l Y f l (x). Problem: poor justification (in general). calibration: classifier scores not comparable. nevertheless: simple and frequently used in practice, computational advantages in some cases. page 24

25 One-vs-One Technique: for each pair (l, l ) classifier h ll : X {0, 1}. learn binary combine binary classifiers via majority vote: Problem: computational: train h(x) =argmax l Y Y,l=l {l : h ll (x) =1}. k(k 1)/2 binary classifiers. overfitting: size of training sample could become small for a given pair. page 25

26 Computational Comparison Training Testing One-vs-all O(kB train (m)) O(kB test ) O(km α ) One-vs-one O(k 2 B train (m/k)) O(k (on average) 2 B test ) O(k 2 α m α ) smaller N SV per B Time complexity for SVMs, α less than 3. page 26

27 Error-Correcting Code Approach Idea: assign F-long binary code word to each class: M =[M lj ] {0, 1} [1,k] [1,F ]. learn binary classifier f j : X {0, 1} for each column. Example x in class l labeled with. classifier output: f(x)= f 1 (x),...,f F (x), (Dietterich and Bakiri, 1995) M lj h: x argmin l Y d Hamming M l, f(x). page 27

28 Illustration classes 8 classes, code-length: 6. codes f 1 (x)f 2 (x)f 3 (x)f 4 (x)f 5 (x)f 6 (x) new example x page 28

29 Error-Correcting Codes - Design Main ideas: independent columns: otherwise no effective discrimination. distance between rows: if the minimal Hamming distance between rows is d, then the multi-class d 1 can correct errors. 2 columns may correspond to features selected for the task. one-vs-all and one-vs-one (with ternary codes) are special cases. page 29

30 Extensions Matrix entries in { 1, 0, +1} : examples marked with 0 disregarded during training. one-vs-one becomes also a special case. Margin loss L: function of Hamming loss: F Margin loss: h(x) = argmin l {1,...,k} h(x) = argmin l {1,...,k} j=1 F j=1 yf(x), e.g., hinge loss. 1 sgn M lj f j (x) 2 L M lj f j (x). page 30 (Allwein et al., 2000).

31 Applications One-vs-all approach is the most widely used. No clear empirical evidence of the superiority of other approaches (Rifkin and Klautau, 2004). except perhaps on small data sets with relatively large error rate. Large structured multi-class problems: often treated as ranking problems (see ranking lecture). page 31

32 References Erin L. Allwein, Robert E. Schapire and Yoram Singer. Reducing multiclass to binary: A unifying approach for margin classifiers. Journal of Machine Learning Research, 1: , K. Crammer and Y. Singer. Improved output coding for classification using continuous relaxation. In Proceedings of NIPS, Koby Crammer and Yoram Singer. On the algorithmic implementation of multiclass kernelbased vector machines. Journal of Machine Learning Research, 2: , Koby Crammer and Yoram Singer. On the Learnability and Design of Output Codes for Multiclass Problems. Machine Learning 47, Thomas G. Dietterich, Ghulum Bakiri: Solving Multiclass Learning Problems via Error- Correcting Output Codes. Journal of Artificial Intelligence Research (JAIR) 2: , Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning, the MIT Press, John C. Platt, Nello Cristianini, and John Shawe-Taylor. Large Margin DAGS for Multiclass Classification. In Advances in Neural Information Processing Systems 12 (NIPS 1999), pp , page 32

33 References Ryan Rifkin. Everything Old Is New Again: A Fresh Look at Historical Approaches in Machine Learning. Ph.D. Thesis, MIT, Rifkin and Klautau. In Defense of One-Vs-All Classification. Journal of Machine Learning Research, 5: , Robert E. Schapire. The boosting approach to machine learning: An overview. In D. D. Denison, M. H. Hansen, C. Holmes, B. Mallick, B. Yu, editors, Nonlinear Estimation and Classification. Springer, Robert E. Schapire, Yoav Freund, Peter Bartlett and Wee Sun Lee. Boosting the margin: A new explanation for the effectiveness of voting methods. The Annals of Statistics, 26(5): , Robert E. Schapire and Yoram Singer. BoosTexter: A boosting-based system for text categorization. Machine Learning, 39(2/3): , Jason Weston and Chris Watkins. Support Vector Machines for Multi-Class Pattern Recognition. Proceedings of the Seventh European Symposium On Artificial Neural Networks (ESANN 99), page 33

Foundations of Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Lecture 9 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Multi-Class Classification page 2 Motivation Real-world problems often have multiple classes: