Foundations of Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research

Size: px

Start display at page:

Download "Foundations of Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research"

Herbert Horn
6 years ago
Views:

1 Foundations of Machine Learning Lecture 9 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu

2 Multi-Class Classification page 2

3 Motivation Real-world problems often have multiple classes: text, speech, image, biological sequences. Algorithms studied so far: designed for binary classification problems. How do we design multi-class classification algorithms? can the algorithms used for binary classification be generalized to multi-class classification? can we reduce multi-class classification to binary classification? page 3

4 Multi-Class Classification Problem Training data: sample drawn i.i.d. from set according to some distribution D, S =((x 1,y 1 ),...,(x m,y m )) X Y, mono-label case: Card(Y )=k. multi-label case: Y ={ 1, +1} k. Problem: find classifier h: X Y in H with small generalization error, mono-label case: R D (h)=e x D [1 h(x)=f(x) ]. 1 k multi-label case: R D (h)=e x D k l=1 1 [h(x)] k =[f(x)] k. X page 4

5 Notes In most tasks, number of classes For k large or infinite, problem often not treated as a multi-class classification problem, e.g., automatic speech recognition. Computational efficiency issues arise for larger ks. In general, classes not balanced. k 100. page 5

6 Multiclass Classification - Margin Hypothesis set H: functions h: X Y R. x label returned: argmax h(x, y). Margin: h(x, y) =h(x, y) max h(x, y ). y =y empirical margin loss: R (h) = 1 m y m i=1 Y ( h (x, y)). page 6

7 Multiclass Margin Bound (Koltchinskii and Panchenko, 2002; MM et al. 2012) Theorem: let H R X Y with Y = {1,...,k}. Fix >0. Then, for any >0, with probability at least 1, the following multi-class classitication bound holds for all h H: R(h) R (h)+ 2k2 R m ( 1 (H)) + log 1 2m, with 1 (H) ={x h(x, y): y Y,h H}. page 7

8 Kernel Based Hypotheses Hypothesis set H K,p : feature mapping associated to PDS kernel K. functions (x, y) w y (x), y {1,...,k}. label returned: x argmax w y (x). y {1,...,k} for any p 1, H K,p = {(x, y) X [1,k] w y (x): W =(w 1,...,w k ), W H,p }. page 8

9 Multiclass Margin Bound - Kernels Theorem: let K: X X R be a PDS kernel and let : X H be a feature mapping associated to K. Assume that K(x, x) r 2 for all x X. Fix >0. Then, for any >0, with probability at least 1, the following multiclass bound holds for all h : H K,p R(h) R (h)+2k 2 r 2 2 / 2 m + log 1 2m. (MM et al. 2012) page 9

10 Single classifier: Multi-class SVMs. AdaBoost.MH. Decision trees. Approaches Combination of binary classifiers: One-vs-all. One-vs-one. Error-correcting codes. page 10

11 Multi-Class SVMs Optimization problem: min w, 1 2 Decision function: k l=1 h: x argmax l Y (Weston and Watkins, 1999; Crammer and Singer, 2001) w l 2 + C m i=1 subject to: w yi x i + yi,l w l x i +1 i (i, l) [1,m] Y. (w l x). i page 11

12 Directly based on generalization bounds. Comparison with (Weston and Watkins, 1999): single slack variable per point, maximum of slack variables (penalty for worst class): k l=1 Notes PDS kernel instead of inner product Optimization: complex constraints, il k max l=1 mk -size problem. specific solution based on decomposition into disjoint sets of constraints (Crammer and Singer, 2001). il. m page 12

13 Dual Formulation Optimization problem: th row of matrix. max =[ ij] m i=1 i e yi 1 2 m i=1 i i R m k ( i j)(x i x j ) subject to: i [1,m], (0 iy i C) ( j = y i, ij 0) ( i 1 =0). Decision function: h(x) = k argmax l=1 m i=1 il(x i x). page 13

14 AdaBoost Training data (multi-label case): (x 1,y 1 ),...,(x m,y m ) X { 1, 1} k. Reduction to binary classification: each example leads to k binary examples: apply AdaBoost to the resulting problem. choice of t. Computational cost: each round. (Schapire and Singer, 2000) (x i,y i ) ((x i, 1),y i [1]),...,((x i,k),y i [k]),i [1,m]. mk distribution updates at page 14

15 AdaBoost.MH H ({ 1, +1} k ) (X Y ). AdaBoost.MH(S =((x 1,y 1 ),...,(x m,y m ))) 1 for i 1 to m do 2 for l 1 to k do 3 D 1 (i, l) 1 mk 4 for t 1 to T do 5 h t base classifier in H with small error t =Pr Dt [h t (x i,l)=y i [l]] 6 t choose to minimize Z t 7 Z t i,l D t(i, l)exp( ty i [l]h t (x i,l)) 8 for i 1 to m do 9 for l 1 to k do D 10 D t+1 (i, l) t (i,l)exp( 11 f T T t=1 th t 12 return h T =sgn(f T ) ty i [l]h t (x i,l)) Z t page 15

16 Bound on Empirical Error Theorem: The empirical error of the classifier output by AdaBoost.MH verifies: R(h) T Z t. t=1 Proof: similar to the proof for AdaBoost. Choice of t: for for bound. H ({ 1, +1} k ) X Y H ([ 1, 1] k ) X Y, as for AdaBoost, t = 1 2 log 1, same choice: minimize upper other cases: numerical/approximation method. t t. page 16

17 Objective function: F ( )= m i=1 k l=1 Notes e y i[l]f n (x i,l) = All comments and analysis given for AdaBoost apply here. Alternative: Adaboost.MR, which coincides with a special case of RankBoost (next lecture). m i=1 k l=1 e y i[l] P n t=1 th t (x i,l). page 17

18 Decision Trees X1 < a1 X 2 R 2 X1 < a2 X2 < a3 a 4 R 5 R 1 R 3 a 3 X2 < a4 R3 R4 R5 R 4 R1 R2 a 2 a 1 X 1 page 18

19 Different Types of Questions Decision trees X {blue, white, red} : categorical questions. X a : continuous variables. Binary space partition (BSP) trees: n i=1 ix i a : partitioning with convex polyhedral regions. Sphere trees: X a 0 a : partitioning with pieces of spheres. page 19

20 Hypotheses In each region R t, classification: majority vote - ties broken arbitrarily, y t =argmax y Y regression: average value, y t = {x i R t : i [1,m],y i = y}. 1 y i. Form of hypotheses: h: x S R t xi R t i [1,m] t y t 1 x Rt. page 20

21 Training Problem: general problem of determining partition with minimum empirical error is NP-hard. Heuristics: greedy algorithm. for all,, j [1,N] R R + (j, )={x i R: x i [j], i [1,m]} R (j, )={x i R: x i [j]<,i [1,m]}. Decision-Trees(S =((x 1,y 1 ),...,(x m,y m ))) 1 P {S} initial partition 2 for each region R P such that Pred(R) do 3 (j, ) argmin (j, ) error(r (j, )) + error(r + (j, )) 4 P P R {R (j, ),R + (j, )} 5 return P page 21

22 Splitting/Stopping Criteria Problem: larger trees overfit training sample. Conservative splitting: split node only if loss reduced by some fixed value >0. issue: seemingly bad split dominating useful splits. Grow-then-prune technique (CART): grow very large tree, Pred(R): R > n 0. prune tree based on:, parameter determined by cross-validation. F (T )=Loss(T )+ T 0 page 22

23 Decision Tree Tools Most commonly used tools for learning decision trees: CART (classification and regression tree) (Breiman et al., 1984). C4.5 (Quinlan, 1986, 1993) and C5.0 (RuleQuest Research) a commercial system. Differences: minor between latest versions. page 23

24 Approaches Single classifier: SVM-type algorithm. AdaBoost-type algorithm. Decision trees. Combination of binary classifiers: One-vs-all. One-vs-one. Error-correcting codes. page 24

25 One-vs-All Technique: for each class l Y learn binary classifier h l =sgn(f l ). combine binary classifiers via voting mechanism, typically majority vote: h: x argmax l Y f l (x). Problem: poor justification. calibration: classifier scores not comparable. nevertheless: simple and frequently used in practice, computational advantages in some cases. page 25

26 One-vs-One Technique: for each pair (l, l ) classifier h ll : X {0, 1}. learn binary combine binary classifiers via majority vote: Problem: computational: train h(x) =argmax l Y Y,l=l {l : h ll (x) =1}. k(k 1)/2 binary classifiers. overfitting: size of training sample could become small for a given pair. page 26

27 Computational Comparison Training Testing One-vs-all O(kB train (m)) O(kB test ) O(km α ) One-vs-one O(k 2 B train (m/k)) O(k (on average) 2 B test ) O(k 2 α m α ) smaller N SV per B Time complexity for SVMs, α less than 3. page 27

28 Heuristics Training: reuse of computation between classifiers, e.g., sharing of kernel computations. caching. 1 vs 4 Testing: directed acyclic graph. smaller number of tests. ordering? 3 vs 4 2 vs 4 not 2 not 1 not 4 1 vs 3 not 4 not 1 2 vs 3 (Platt et al., 2000) not 3 1 vs page 28

29 Error-Correcting Code Approach Technique: assign F-long binary code word to each class: M =[M lj ] {0, 1} [1,k] [1,F ]. learn binary classifier f j : X {0, 1} for each column. Example x in class l labeled with. classifier output: f(x)= f 1 (x),...,f F (x), (Dietterich and Bakiri, 1995) M lj h: x argmin l Y d Hamming M l, f(x). page 29

30 Illustration classes 8 classes, code-length: 6. codes f 1 (x)f 2 (x)f 3 (x)f 4 (x)f 5 (x)f 6 (x) new example x page 30

31 Error-Correcting Codes - Design Main ideas: independent columns: otherwise no effective discrimination. distance between rows: if the minimal Hamming distance between rows is d, then the multi-class d 1 can correct errors. 2 columns may correspond to features selected for the task. one-vs-all and one-vs-one (with ternary codes) are special cases. page 31

32 Extensions Matrix entries in { 1, 0, +1} : examples marked with 0 disregarded during training. one-vs-one becomes also a special case. Margin loss L: function of Hamming loss: F Margin loss: h(x) = argmin l {1,...,k} h(x) = argmin l {1,...,k} j=1 F j=1 yf(x), e.g., hinge loss. 1 sgn M lj f j (x) 2 L M lj f j (x). page 32 (Allwein et al., 2000).

33 Ideas Continuous codes: real-valued matrix. Learn matrix code M. Similar optimization problems with other matrix norms. Kernel K used for similarity between matrix row and prediction vector. page 33

34 Continuous Codes Optimization problem: ( M l lth row of M) min M, Decision function: M C m i=1 subject to: K(f(x i ), M yi ) K(f(x i ), M l )+1 i (i, l) [1,m] [1,k]. h: x argmax l {1,...,k} i K(f(x), M l ). (Crammer and Singer, 2000, 2002) page 34

35 Applications One-vs-all approach is the most widely used. No clear empirical evidence of the superiority of other approaches (Rifkin and Klautau, 2004). except perhaps on small data sets with relatively large error rate. Large structured multi-class problems: often treated as ranking problems (see next lecture). page 35

36 References Erin L. Allwein, Robert E. Schapire and Yoram Singer. Reducing multiclass to binary: A unifying approach for margin classifiers. Journal of Machine Learning Research, 1: , K. Crammer and Y. Singer. Improved output coding for classification using continuous relaxation. In Proceedings of NIPS, Koby Crammer and Yoram Singer. On the algorithmic implementation of multiclass kernelbased vector machines. Journal of Machine Learning Research, 2: , Koby Crammer and Yoram Singer. On the Learnability and Design of Output Codes for Multiclass Problems. Machine Learning 47, Thomas G. Dietterich, Ghulum Bakiri: Solving Multiclass Learning Problems via Error- Correcting Output Codes. Journal of Artificial Intelligence Research (JAIR) 2: , Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning, the MIT Press, John C. Platt, Nello Cristianini, and John Shawe-Taylor. Large Margin DAGS for Multiclass Classification. In Advances in Neural Information Processing Systems 12 (NIPS 1999), pp , page 36

37 References Ryan Rifkin. Everything Old Is New Again: A Fresh Look at Historical Approaches in Machine Learning. Ph.D. Thesis, MIT, Rifkin and Klautau. In Defense of One-Vs-All Classification. Journal of Machine Learning Research, 5: , Robert E. Schapire. The boosting approach to machine learning: An overview. In D. D. Denison, M. H. Hansen, C. Holmes, B. Mallick, B. Yu, editors, Nonlinear Estimation and Classification. Springer, Robert E. Schapire, Yoav Freund, Peter Bartlett and Wee Sun Lee. Boosting the margin: A new explanation for the effectiveness of voting methods. The Annals of Statistics, 26(5): , Robert E. Schapire and Yoram Singer. BoosTexter: A boosting-based system for text categorization. Machine Learning, 39(2/3): , Jason Weston and Chris Watkins. Support Vector Machines for Multi-Class Pattern Recognition. Proceedings of the Seventh European Symposium On Artificial Neural Networks (ESANN 99), page 37

Foundations of Machine Learning Multi-Class Classification. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Multi-Class Classification Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation Real-world problems often have multiple classes: text, speech,