Introduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research

Similar documents
Foundations of Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Multi-Class Classification. Mehryar Mohri Courant Institute and Google Research

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

Advanced Machine Learning

Boosting Ensembles of Structured Prediction Rules

The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m

Supervised Learning of Non-binary Problems Part I: Multiclass Categorization via Output Codes

Lecture 18: Multiclass Support Vector Machines

Foundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research

Kobe University Repository : Kernel

Support Vector and Kernel Methods

CS-E4830 Kernel Methods in Machine Learning

Lecture 8. Instructor: Haipeng Luo

A NEW MULTI-CLASS SUPPORT VECTOR ALGORITHM

ML (cont.): SUPPORT VECTOR MACHINES

Understanding SVM (and associated kernel machines) through the development of a Matlab toolbox

Learning with multiple models. Boosting.

i=1 cosn (x 2 i y2 i ) over RN R N. cos y sin x

Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013

Analysis of Multiclass Support Vector Machines

ECS289: Scalable Machine Learning

Error Limiting Reductions Between Classification Tasks

Deep Boosting. Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) COURANT INSTITUTE & GOOGLE RESEARCH.

Support Vector Machines.

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

The Margin Vector, Admissible Loss and Multi-class Margin-based Classifiers

Foundations of Machine Learning Ranking. Mehryar Mohri Courant Institute and Google Research

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Boosting with decision stumps and binary features

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Discriminative Models

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Multiclass Learning by Probabilistic Embeddings

Perceptron Mistake Bounds

SUPPORT VECTOR REGRESSION WITH A GENERALIZED QUADRATIC LOSS

Support Vector Machine (SVM) and Kernel Methods

Ensembles of Classifiers.

A Note on Extending Generalization Bounds for Binary Large-Margin Classifiers to Multiple Classes

Support Vector Machines for Classification: A Statistical Portrait

Open Problem: A (missing) boosting-type convergence result for ADABOOST.MH with factorized multi-class classifiers

Learning Kernels -Tutorial Part III: Theoretical Guarantees.

Introduction to Support Vector Machines

Lecture Support Vector Machine (SVM) Classifiers

Sequential Minimal Optimization (SMO)

ECS289: Scalable Machine Learning

Discriminative Models

SUPPORT VECTOR MACHINE

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Learning with Rejection

Margin Maximizing Loss Functions

Lecture 3: Multiclass Classification

ABC-Boost: Adaptive Base Class Boost for Multi-class Classification

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Brief Introduction to Machine Learning

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

SVM TRADE-OFF BETWEEN MAXIMIZE THE MARGIN AND MINIMIZE THE VARIABLES USED FOR REGRESSION

Data Mining und Maschinelles Lernen

Multiclass Classification-1

Support Vector Machine (SVM) and Kernel Methods

10-701/ Machine Learning - Midterm Exam, Fall 2010

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

Multiclass Boosting with Repartitioning

A Uniform Convergence Bound for the Area Under the ROC Curve

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

Support Vector Machine. Industrial AI Lab.

COMS 4771 Lecture Boosting 1 / 16

Support Vector Machine. Industrial AI Lab. Prof. Seungchul Lee

Ensemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods

18.9 SUPPORT VECTOR MACHINES

Domain Adaptation for Regression

MACHINE LEARNING. Support Vector Machines. Alessandro Moschitti

Midterms. PAC Learning SVM Kernels+Boost Decision Trees. MultiClass CS446 Spring 17

Support Vector Machines and Kernel Methods

Learning Binary Classifiers for Multi-Class Problem

Rademacher Complexity Bounds for Non-I.I.D. Processes

Support Vector Machines

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Introduction to Support Vector Machines

On the Learnability and Design of Output Codes for Multiclass Problems

ML4NLP Multiclass Classification

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Review: Support vector machines. Machine learning techniques and image analysis

Statistics and learning: Big Data

Perceptron Revisited: Linear Separators. Support Vector Machines

MLCC 2017 Regularization Networks I: Linear Models

Name (NetID): (1 Point)

The definitions and notation are those introduced in the lectures slides. R Ex D [h

1. Implement AdaBoost with boosting stumps and apply the algorithm to the. Solution:

ECE 5424: Introduction to Machine Learning

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

A Simple Algorithm for Multilabel Ranking

Name (NetID): (1 Point)

Minimally-Sized Balanced Decomposition Schemes for Multi-Class Classification

Support Vector Machines

Support Vector Machines, Kernel SVM

Transcription:

Introduction to Machine Learning Lecture 13 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu

Multi-Class Classification Mehryar Mohri - Introduction to Machine Learning page 2

Motivation Real-world problems often have multiple classes: text, speech, image, biological sequences. Algorithms studied so far: designed for binary classification problems. How do we design multi-class classification algorithms? can the algorithms used for binary classification be generalized to multi-class classification? can we reduce multi-class classification to binary classification? Mehryar Mohri - Introduction to Machine Learning page 3

Multi-Class Classification Problem Training data: sample drawn i.i.d. from set according to some distribution D, S =((x 1,y 1 ),...,(x m,y m )) X Y, mono-label case: Card(Y )=k. multi-label case: Y ={ 1, +1} k. Problem: find classifier h: X Y in H with small generalization error, mono-label case: R D (h)=e x D [1 h(x)=f(x) ]. multi-label case: R D (h)=e 1 k x D k l=1 1 [h(x)] k =[f(x)] k. X Mehryar Mohri - Introduction to Machine Learning page 4

Notes In most tasks, number of classes For k large or infinite, problem often not treated as a multi-class classification problem, e.g., automatic speech recognition. Computational efficiency issues arise for larger ks. In general, classes not balanced. k 100. Mehryar Mohri - Introduction to Machine Learning page 5

Approaches Single classifier: Decision trees. AdaBoost-type algorithm. SVM-type algorithm. Combination of binary classifiers: One-vs-all. One-vs-one. Error-correcting codes. Mehryar Mohri - Introduction to Machine Learning page 6

AdaBoost-Type Algorithm Training data (multi-label case): (x 1,y 1 ),...,(x m,y m ) X { 1, 1} k. Reduction to binary classification: each example leads to k binary examples: apply AdaBoost to the resulting problem. choice of α t. Computational cost: each round. (Schapire and Singer, 2000) (x i,y i ) ((x i, 1),y i [1]),...,((x i,k),y i [k]),i [1,m]. mk distribution updates at Mehryar Mohri - Introduction to Machine Learning page 7

AdaBoost.MH H ({ 1, +1} k ) (X Y ). AdaBoost.MH(S =((x 1,y 1 ),...,(x m,y m ))) 1 for i 1 to m do 2 for l 1 to k do 3 D 1 (i, l) 1 mk 4 for t 1 to T do 5 h t base classifier in H with small error t =Pr Dt [h t (x i,l)=y i [l]] 6 α t choose to minimize Z t 7 Z t i,l D t(i, l)exp( α t y i [l]h t (x i,l)) 8 for i 1 to m do 9 for l 1 to k do 10 D t+1 (i, l) D t(i,l)exp( α t y i [l]h t (x i,l)) Z t 11 f T T t=1 α th t 12 return h T =sgn(f T ) Mehryar Mohri - Introduction to Machine Learning page 8

Bound on Empirical Error Theorem: The empirical error of the classifier output by AdaBoost.MH verifies: Proof: similar to the proof for AdaBoost. Choice of α t : for for bound. R(h) H ({ 1, +1} k ) X Y H ([ 1, 1] k ) X Y T t=1 Z t., as for AdaBoost, α t = 1 2 log 1 t t., same choice: minimize upper other cases: numerical/approximation method. Mehryar Mohri - Introduction to Machine Learning page 9

Multi-Class SVMs Optimization problem: min w,ξ 1 2 k l=1 w l 2 + C m i=1 l=y i ξ il (Weston and Watkins, 1999) subject to: w yi x i + b yi w l x i + b l +2 ξ il Decision function: ξ il 0, (i, l) [1,m] (Y {y i }). h: x argmax l Y (w l x + b l ). Mehryar Mohri - Introduction to Machine Learning page 10

Idea: slack variable Notes ξ il penalizes case where (w yi x i + b yi ) (w l x i + b l ) < 2. w 1 x + b 1 =+1 w 1 x + b 1 = 1 w 2 x + b 2 =+1 w 2 x + b 2 = 1 Binary SVM obtained as special case: w 1 = w 2,b 1 = b 2,ξ i1 = ξ i2 =2ξ i. Mehryar Mohri - Introduction to Machine Learning page 11

Notation: Optimization problem: Decision function: Dual Formulation α i = Mehryar Mohri - Introduction to Machine Learning page 12 k l=1 α il c il =1 yi =l. max 2 m α i + [ 1 α 2 c jy i α i α j + α il α jyi 1 2 α ilα jl ](x i x j ) i=1 i,j,l m m subject to: l [1,k], α il = c il α i i=1 i=1 (i, l) [1,m] (Y {y i }), 0 α il C, α iyi =0. h: x argmax l=1,...,k m (c il α i α il )(x i x)+b l. i=1

Notes PDS kernel instead of inner product Optimization: complex constraints, mk -size problem. Generalization error: leave-one-out analysis and bounds of binary SVM apply similarly. One-vs-all solution (non-optimal) feasible solution of multi-class SVM problem. Simplification: single slack variable per point (Crammer and Singer, 2002),. ξ il ξ i Mehryar Mohri - Introduction to Machine Learning page 13

Simplified Multi-Class SVMs Optimization problem: min w,ξ 1 2 k w l 2 + C l=1 m i=1 ξ i (Crammer and Singer, 2001) subject to: w yi x i + δ yi,l w l x i +1 ξ i Decision function: (i, l) [1,m] Y. h: x argmax l Y (w l x). Mehryar Mohri - Introduction to Machine Learning page 14

Single slack variable per point: maximum of previous slack variables (penalty for worst class): k l=1 Notes ξ il k max l=1 ξ il. PDS kernel instead of inner product Optimization: complex constraints, mk -size problem. specific solution based on decomposition into disjoint sets of constraints (Crammer and Singer, 2001). m Mehryar Mohri - Introduction to Machine Learning page 15

Dual Formulation Optimization problem: ( α i ith row of matrix α) max α=[α ij ] m i=1 α i e yi 1 2 m (α i α j )(x i x j ) i=1 subject to: 0 α i C α i 1 =0,i [1,m]. Decision function: h(x) = k argmax l=1 m i=1 α il (x i x). Mehryar Mohri - Introduction to Machine Learning page 16

Approaches Single classifier: Decision trees. AdaBoost-type algorithm. SVM-type algorithm. Combination of binary classifiers: One-vs-all. One-vs-one. Error-correcting codes. Mehryar Mohri - Introduction to Machine Learning page 17

One-vs-All Technique: for each class l Y learn binary classifier h l =sgn(f l ). combine binary classifiers via voting mechanism, typically majority vote: h: x argmax l Y f l (x). Problem: poor justification. calibration: classifier scores not comparable. nevertheless: simple and frequently used in practice, computational advantages in some cases. Mehryar Mohri - Introduction to Machine Learning page 18

One-vs-One Technique: for each pair classifier h ll : X {0, 1}. learn binary combine binary classifiers via majority vote: Problem: computational: train (l, l ) Y,l=l h(x) =argmax l Y {l : hll (x) =1}. k(k 1)/2 binary classifiers. overfitting: size of training sample could become small for a given pair. Mehryar Mohri - Introduction to Machine Learning page 19

Computational Comparison Training Testing One-vs-all O(kB train (m)) O(kB test ) O(km α ) One-vs-one O(k 2 B train (m/k)) O(k (on average) 2 B test ) O(k 2 α m α ) smaller N SV per B Time complexity for SVMs, α less than 3. Mehryar Mohri - Introduction to Machine Learning page 20

Heuristics Training: reuse of computation between classifiers, e.g., sharing of kernel computations. caching. 1 vs 4 Testing: directed acyclic graph. smaller number of tests. ordering? 3 vs 4 2 vs 4 not 2 not 1 not 4 1 vs 3 not 4 not 1 2 vs 3 (Platt et al., 2000) not 3 1 vs 2 4 3 2 1 Mehryar Mohri - Introduction to Machine Learning page 21

Error-Correcting Code Approach Technique: assign F-long binary code word to each class: M =[M lj ] {0, 1} [1,k] [1,F ]. learn binary classifier f j : X {0, 1} for each column. Example x in class l labeled with. classifier output:, h: x argmin l Y f(x)= f 1 (x),...,f F (x) d Hamming M l, f(x). (Dietterich and Bakiri, 1995) M lj Mehryar Mohri - Introduction to Machine Learning page 22

Illustration classes 8 classes, code length: 6. codes 1 2 3 4 5 6 1 0 0 0 1 0 0 2 1 0 0 0 0 0 3 0 1 1 0 1 0 4 1 1 0 0 0 0 5 1 1 0 0 1 0 6 0 0 1 1 0 1 7 0 0 1 0 0 0 8 0 1 0 1 0 0 f 1 (x)f 2 (x)f 3 (x)f 4 (x)f 5 (x)f 6 (x) 0 1 1 0 1 1 new example x Mehryar Mohri - Introduction to Machine Learning page 23

Error-Correcting Codes - Design Main ideas: independent columns: otherwise no effective discrimination. distance between rows: if the minimal Hamming distance between rows is d, then the multi-class can correct d 1 errors. 2 columns may correspond to features selected for the task. one-vs-all and one-vs-one (with ternary codes) are special cases. Mehryar Mohri - Introduction to Machine Learning page 24

Mehryar Mohri - Introduction to Machine Learning Extensions Matrix entries in { 1, 0, +1} : examples marked with 0 disregarded during training. one-vs-one becomes also a special case. Margin loss L: function of Hamming loss: Margin loss: h(x) = argmin l {1,...,k} h(x) = argmin l {1,...,k} F j=1 yf(x), e.g., hinge loss. 1 sgn M lj f j (x) 2 F L M lj f j (x). j=1 page 25 (Allwein et al., 2000).

Continuous Codes Optimization problem: ( M l lth row of M) min M,ξ M2 2 + C m Decision function: i=1 subject to: K(f(x i ), M yi ) K(f(x i ), M l )+1 ξ i ξ i (i, l) [1,m] [1,k]. h: x argmax l {1,...,k} K(f(x), M l ). (Crammer and Singer, 2000, 2002) Mehryar Mohri - Introduction to Machine Learning page 26

Ideas Continuous codes: real-valued matrix. Learn matrix code M. Similar optimization problems with other matrix norms. Kernel K used for similarity between matrix row and prediction vector. Mehryar Mohri - Introduction to Machine Learning page 27

Multiclass Margin Definition: let H R X Y. The margin of the training point (x, y) X Y for the hypothesis h H is γ h (x, y) =h(x, y) max y =y h(x, y ). Thus, h misclassifies (x, y) iff γ h (x, y) 0. Mehryar Mohri - Introduction to Machine Learning page 28

Margin Bound Theorem: Let H 1 ={x h(x, y): h H, y Y } where H R X Y. Fix ρ>0. Then, for any δ>0, with probability at least 1 δ, the following holds Pr[γ h (x, y) 0] Pr[γ h (x, y) ρ]+c k2 R m (H 1 ) ρ for some constants c, c >0. (Koltchinskii and Panchenko, 2002) + c log 1 δ m, Mehryar Mohri - Introduction to Machine Learning page 29

Applications One-vs-all approach is the most widely used. No clear empirical evidence of the superiority of other approaches (Rifkin and Klautau, 2004). except perhaps on small data sets with relatively large error rate. Large structured multi-class problems: often treated as ranking problems (see next lecture). Mehryar Mohri - Introduction to Machine Learning page 30

References Erin L. Allwein, Robert E. Schapire and Yoram Singer. Reducing multiclass to binary: A unifying approach for margin classifiers. Journal of Machine Learning Research, 1:113-141, 2000. K. Crammer and Y. Singer. Improved output coding for classification using continuous relaxation. In Proceedings of NIPS, 2000. Koby Crammer and Yoram Singer. On the algorithmic implementation of multiclass kernelbased vector machines. Journal of Machine Learning Research, 2:265 292, 2001. Koby Crammer and Yoram Singer. On the Learnability and Design of Output Codes for Multiclass Problems. Machine Learning 47, 2002. Thomas G. Dietterich, Ghulum Bakiri: Solving Multiclass Learning Problems via Error- Correcting Output Codes. Journal of Artificial Intelligence Research (JAIR) 2: 263-286, 1995. John C. Platt, Nello Cristianini, and John Shawe-Taylor. Large Margin DAGS for Multiclass Classification. In Advances in Neural Information Processing Systems 12 (NIPS 1999), pp. 547-553, 2000. Mehryar Mohri - Introduction to Machine Learning page 31

References Ryan Rifkin. Everything Old Is New Again: A Fresh Look at Historical Approaches in Machine Learning. Ph.D. Thesis, MIT, 2002. Rifkin and Klautau. In Defense of One-Vs-All Classification. Journal of Machine Learning Research, 5:101-141, 2004. Robert E. Schapire. The boosting approach to machine learning: An overview. In D. D. Denison, M. H. Hansen, C. Holmes, B. Mallick, B. Yu, editors, Nonlinear Estimation and Classification. Springer, 2003. Robert E. Schapire, Yoav Freund, Peter Bartlett and Wee Sun Lee. Boosting the margin: A new explanation for the effectiveness of voting methods. The Annals of Statistics, 26(5): 1651-1686, 1998. Robert E. Schapire and Yoram Singer. BoosTexter: A boosting-based system for text categorization. Machine Learning, 39(2/3):135-168, 2000. Jason Weston and Chist Watkins. Support Vector Machines for Multi-Class Pattern Recognition. Proceedings of the Seventh European Symposium On Artificial Neural Networks (ESANN 99), 1999. Mehryar Mohri - Introduction to Machine Learning page 32