Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Similar documents
Supervised Learning of Non-binary Problems Part I: Multiclass Categorization via Output Codes

Introduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Multi-Class Classification. Mehryar Mohri Courant Institute and Google Research

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Foundations of Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research

Multiclass Boosting with Repartitioning

Data Mining und Maschinelles Lernen

Evaluation requires to define performance measures to be optimized

ECE 5424: Introduction to Machine Learning

Lecture 8. Instructor: Haipeng Luo

Learning Binary Classifiers for Multi-Class Problem

Multiclass Learning by Probabilistic Embeddings

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

ECE 5984: Introduction to Machine Learning

On Nearest-Neighbor Error-Correcting Output Codes with Application to All-Pairs Multiclass Support Vector Machines

Machine Learning for NLP

Evaluation. Andrea Passerini Machine Learning. Evaluation

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Announcements Kevin Jamieson

Boosting with decision stumps and binary features

PAC-learning, VC Dimension and Margin-based Bounds

Learning with multiple models. Boosting.

10-701/ Machine Learning - Midterm Exam, Fall 2010

Lecture 18: Multiclass Support Vector Machines

Ensembles of Classifiers.

PAC-learning, VC Dimension and Margin-based Bounds

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machine. Industrial AI Lab.

Error Limiting Reductions Between Classification Tasks

Machine Learning Practice Page 2 of 2 10/28/13

Discriminative Learning and Big Data

VBM683 Machine Learning

TDT4173 Machine Learning

COMS 4771 Lecture Boosting 1 / 16

Holdout and Cross-Validation Methods Overfitting Avoidance

18.9 SUPPORT VECTOR MACHINES

ML in Practice: CMSC 422 Slides adapted from Prof. CARPUAT and Prof. Roth

Generalization, Overfitting, and Model Selection

Large-Margin Thresholded Ensembles for Ordinal Regression

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

Kernel Methods and Support Vector Machines

Support Vector Machine. Industrial AI Lab. Prof. Seungchul Lee

Support Vector Machines.

Ensembles. Léon Bottou COS 424 4/8/2010

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

1. Implement AdaBoost with boosting stumps and apply the algorithm to the. Solution:

Bias-Variance in Machine Learning

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Machine Learning

Infinite Ensemble Learning with Support Vector Machinery

CS534 Machine Learning - Spring Final Exam

2 Upper-bound of Generalization Error of AdaBoost

CS7267 MACHINE LEARNING

Probabilistic Machine Learning. Industrial AI Lab.

Large-Margin Thresholded Ensembles for Ordinal Regression

Day 3: Classification, logistic regression

CIS 520: Machine Learning Oct 09, Kernel Methods

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Support Vector Machines

CS446: Machine Learning Fall Final Exam. December 6 th, 2016

Learning Ensembles. 293S T. Yang. UCSB, 2017.

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9

Discriminative Models

Boosting: Algorithms and Applications

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Support Vector Machine (SVM) and Kernel Methods

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

10701/15781 Machine Learning, Spring 2007: Homework 2

Machine Learning

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 331 le-tex

CS 231A Section 1: Linear Algebra & Probability Review

CS 231A Section 1: Linear Algebra & Probability Review. Kevin Tang

CSCI-567: Machine Learning (Spring 2019)

Part of the slides are adapted from Ziko Kolter

A Simple Algorithm for Multilabel Ranking

Boosting. Acknowledgment Slides are based on tutorials from Robert Schapire and Gunnar Raetsch

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Online Passive- Aggressive Algorithms

Voting (Ensemble Methods)

1/sqrt(B) convergence 1/B convergence B

Support Vector Machine (SVM) and Kernel Methods

1 Machine Learning Concepts (16 points)

Support Vector Machines

Support Vector Machine (SVM) and Kernel Methods

Multiclass and Introduction to Structured Prediction

Introduction to Support Vector Machines

Classification objectives COMS 4771

Linear classifiers Lecture 3

Statistical Machine Learning from Data

The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m

Stochastic Gradient Descent

Combining Classifiers

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Machine learning for pervasive systems Classification in high-dimensional spaces

Jeff Howbert Introduction to Machine Learning Winter

Machine Learning Gaussian Naïve Bayes Big Picture

Transcription:

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers Erin Allwein, Robert Schapire and Yoram Singer Journal of Machine Learning Research, 1:113-141, 000 CSE 54: Seminar on Learning Algorithms Professor: Charles Elkan Student: Aldebaro Klautau April 3, 001 Copyright, 1996 Dale Carnegie & Associates, Inc. 1 Outline Motivation Definitions Margin, loss, output coding Paper contributions Unified view of classifiers, 0 entries in coding matrix, bounds and simulations Algorithm Bound on training error Experimental results Compare: a) Hamming vs. loss-based decoding b) different coding matrices c) simple binary problems vs. robust matrix Conclusions Unifying Approach for Margin Classifiers 1

Motivation Many applications require multiclass categorization Algorithms that easily handle the multiclass case include C4.5 and CART Some do not easily handle the multiclass case, as AdaBoost and SVM Alternative: reduce the multiclass problem to multiple binary classifications 3 Reducing multiclass to multiple binary problems Paper proposes general framework that unifies several methods, namely: binary margin-based algorithms coupled through a coding matrix 4 Unifying Approach for Margin Classifiers

Binary margin-based learning algorithm Input: total of m binary labeled training examples (x 1, y 1 ),, (x m, y m ) such that x i X and y i {-1,+1} Output: real-valued function (hypothesis) f : X R Binary margin of a training example (x, y) is defined as: z = y f(x) z > 0 if and only if f classifies x correctly 5 Classification error: It is difficult to minimize classification error Instead, minimize some loss function of the margin L(z) = L(y f(x)) of each example (x,y) Algorithms that use margin-based loss: supportvector machines (SVM), AdaBoost, regression, decision trees, etc. Example: neural networks and least square regression attempt to minimize squared error (y f(x)) = y (y f(x)) = (yy yf(x)) = (1 yf(x)) L(z) = (1 z) 6 Unifying Approach for Margin Classifiers 3

Loss function of popular algorithms L : R [0, ) 14 1 loss L(z) 10 8 6 adaboost exp(-z) regression (1-z) logistic log(1+exp(-*z)) svm (1-z) + 4 0 -.5 - -1.5-1 -0.5 0 0.5 1 1.5.5 margin z = y f(x) 7 Scenario: training sequence has labels drawn from set with cardinality k > and the algorithm(s) assume k = Two popular alternatives: one-against-all and all-pairs For k = 10 classes Approach Number of classifiers one-against-all 10 all-pairs 45 complete 511 8 Unifying Approach for Margin Classifiers 4

Output coding for solving multiclass problems Class vertical line Code word (each column corresponds to a binary classifier) horizontal line diagonal line closed curve curve open to left curve open to right 0 1 1 1 +1 1 1 1 +1 1 1 1 1 1 1 +1 +1 1 +1 1 3 1 1 1 1 +1 1 9 1 1 +1 +1 1 1 k = 10 k = number of classes (rows) and l = number of classifiers (columns) l = 6 9 Error correcting output code ECOC, proposed by Dietterich and Bakiri, 1995 Class vertical line Code word (each column corresponds to a binary classifier) horizontal diagonal closed curve line line curve open to curve open to right left 0 1 1 1 +1 1 1 1 +1 1 1 1 1 1 1 +1 +1 1 +1 1 3 1 1 1 1 +1 1 9 1 1 +1 +1 1 1 Associate each class r with a row of a coding matrix Train each binary classifier Each example labeled y is mapped to M(y,s) Obtain l hypotheses f 1 to f l Given test example x, choose the row y of M that is closest to binary predictions (f 1 (x),, f l (x)) according to some distance (e.g. Hamming) 10 Unifying Approach for Margin Classifiers 5

Summary of paper contributions Unified view of margin classifiers Possibility of 0 entry in ECOC matrix Decoding when matrix has 0 entries Bound on training error (general) Bound on testing error (restricted to AdaBoost) Experimental results 11 Idea: allow entries 0 The coding matrix M is taken from the extended set {-1, 0, +1} k x l The entry M(r, s) = 0 indicates we do not care how f s categorizes examples with label r One-against-all k by k +1 1 1 1 1 +1 1 1 1 1 +1 1 1 1 1 +1 All-pairs k by k +1 +1 +1 0 0 0 1 0 0 +1 +1 0 0 1 0 1 0 +1 0 0 1 0 1 1 1 Unifying Approach for Margin Classifiers 6

Algorithm Associate each class r with a row of a coding matrix M {-1, 0, +1} k x l Train the binary classifier for each column s=1, l. For training classifier s, example labeled y is mapped to M(y,s). Omit examples for which M(y,s) = 0 Obtain l classifiers Given test example x, choose the row y of M that is closest to binary predictions (f 1 (x),, f l (x)) according to some distance (e.g. modified Hamming) 13 Two intuitive options Distance a) Quantize predictions and then use a generalized Hamming distance ( u, v) = l 1 u v s s s= 1 String A 1 1 0 0 1 String B 1 1 1 0 1 Distance parcels 0 1 0.5 0.5 0 b) Loss-based decoding: for each row, calculate the margin z s of each classifier s and sum their losses L(z s ) adopting the same L used by the binary classifier. Sounds better because the magnitude of predictions are an indication of a level of confidence 14 Unifying Approach for Margin Classifiers 7

Hamming vs. loss-based decoding quantization 15 Losses for classes 3 and 4 in fig. prediction = [0.5-7 -1 - -10-1 9] row_3 = [+1 0-1 -1-1 +1 +1] row_4 = [-1-1 +1 0-1 -1 +1] row_r_loss = exp(-prediction.* row_r) loss = sum(row_r_loss) 10 6 big influence 10 4 row_r_loss (log scale) 10 10 0 10-10 -4 10-6 1 3 4 5 6 7 binary classifier # 16 Unifying Approach for Margin Classifiers 8

Training error Bound on training error Average binary loss ε = ml i= 1 s= 1 ( M ( y, s) f ( x )) Minimum distance between pair of rows ρ = 1 min E m l lε ρ L(0) L { ( M( r ), M( r )) r r } 1 : 1 i s i One-against-all ρ = +1 1 1 1 1 +1 1 1 1 1 +1 1 1 1 1 +1 All-pairs ρ = 1 + 1/ (l 1) +1 +1 +1 0 0 0 1 0 0 +1 +1 0 0 1 0 1 0 +1 0 0 1 0 1 1 17 Bound on training error lε E ρ L(0) Simulation: AdaBoost with L(0) = 1 complete code E 70 60 50 40 30 0 10 k 1 ( 1) ε ε k 70 60 50 40 30 0 10 one-against-all E kε average binary loss actual classification error theoretical bound 3 4 5 6 7 8 3 4 5 6 7 8 # classes 18 Unifying Approach for Margin Classifiers 9

Bound on training error Requirement: L( z) + L( z) L(0) > 0 Do not need to be convex: sin(z) + 1 1.8 1.6 1.4 1. loss L(z) 1 0.8 0.6 0.4 0. 0 -.5 - -1.5-1 -0.5 0 0.5 1 1.5.5 margin z = y f(x) 19 Experimental results Tradeoff between simple binary problems and robust coding matrix Hamming versus loss-based decoding (AdaBoost and SVM) Comparison among different output codes (AdaBoost and SVM) 0 Unifying Approach for Margin Classifiers 10

Experiment 1 - synthetic data Set thresholds to have exactly 100 examples per class Label test examples using these thresholds Use AdaBoost Weak hypotheses: set of thresholds threshold t would label x as +1 if x < t and -1 otherwise 10 rounds 3 classes 8 classes 1 Comparisons: a) Hamming vs. loss-based decoding b) simple binary problems vs. robust matrix complete code one-against-all AdaBoost used in both simulations Unifying Approach for Margin Classifiers 11

k by k 1 1 +1 1 1 1 +1 1 1 1 +1 1 1 +1 +1 +1 1 1 +1 1 1 1 +1 1 1 1 +1 1 +1 1 complete code k by k +1 1 1 1 1 +1 1 1 1 1 +1 1 1 1 1 +1 one-against-all k = 4 3 Comparison: different output codes Experiments with UCI databases SVM: polynomial kernels of degree 4 AdaBoost: decision stumps for base hypotheses 5 output codes: one-against-all, complete, all-pairs, dense and sparse design of dense and sparse: try 10,000 random codes, pick a code with high ρ 4 Unifying Approach for Margin Classifiers 1

5 Show error obtained with Hamming minus error with loss-based decoding Negative height indicates loss-based outperformed Hamming decoding 6 Unifying Approach for Margin Classifiers 13

Entry (r, c) shows error of row r classifier minus error of column c classifier Positive height indicates classifier c outperformed classifier r SVM 7 Entry (r, c) shows error of row r classifier minus error of column c classifier Positive height indicates classifier c outperformed classifier r AdaBoost 8 Unifying Approach for Margin Classifiers 14

Conclusions Bounds give insight about the tradeoffs but can be of limited use in practice Experiments show that in most cases: Loss-based is better than Hamming decoding One-against-all is outperformed (SVM) Choosing / designing the coding matrix is an open problem and the best approach is possibly task-dependent (Crammer & Singer, 000) 9 30 Unifying Approach for Margin Classifiers 15