Multiclass Boosting with Repartitioning

Similar documents
Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

6.036 midterm review. Wednesday, March 18, 15

Evaluation requires to define performance measures to be optimized

Evaluation. Andrea Passerini Machine Learning. Evaluation

CS534 Machine Learning - Spring Final Exam

ECE 5424: Introduction to Machine Learning

Logistic Regression. Machine Learning Fall 2018

Data Mining und Maschinelles Lernen

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Multiclass Classification-1

VBM683 Machine Learning

The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m

MIRA, SVM, k-nn. Lirong Xia

Ensembles. Léon Bottou COS 424 4/8/2010

Infinite Ensemble Learning with Support Vector Machinery

Stochastic Gradient Descent

Large-Margin Thresholded Ensembles for Ordinal Regression

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

Learning from Examples

Voting (Ensemble Methods)

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Open Problem: A (missing) boosting-type convergence result for ADABOOST.MH with factorized multi-class classifiers

AdaBoost. S. Sumitra Department of Mathematics Indian Institute of Space Science and Technology

Large-Margin Thresholded Ensembles for Ordinal Regression

Learning with multiple models. Boosting.

CS7267 MACHINE LEARNING

Machine Learning for NLP

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

Discriminative Models

Learning theory. Ensemble methods. Boosting. Boosting: history

Max Margin-Classifier

Supervised Learning of Non-binary Problems Part I: Multiclass Categorization via Output Codes

Warm up: risk prediction with logistic regression

Discriminative Models

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

18.9 SUPPORT VECTOR MACHINES

CS 188: Artificial Intelligence. Outline

ECE 5984: Introduction to Machine Learning

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

FINAL EXAM: FALL 2013 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Mistake Bound Model, Halving Algorithm, Linear Classifiers, & Perceptron

10-701/ Machine Learning - Midterm Exam, Fall 2010

Ensemble Methods for Machine Learning

Recitation 9. Gradient Boosting. Brett Bernstein. March 30, CDS at NYU. Brett Bernstein (CDS at NYU) Recitation 9 March 30, / 14

Linear discriminant functions

Introduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research

i=1 = H t 1 (x) + α t h t (x)

What makes good ensemble? CS789: Machine Learning and Neural Network. Introduction. More on diversity

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

Support Vector and Kernel Methods

Learning Theory Continued

CSCI-567: Machine Learning (Spring 2019)

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Boosting with decision stumps and binary features

CS229 Supplemental Lecture notes

Lecture 8. Instructor: Haipeng Luo

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Gradient Boosting (Continued)

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 331 le-tex

PAC-learning, VC Dimension and Margin-based Bounds

Boosting: Algorithms and Applications

Introduction to Support Vector Machines

Machine Learning for NLP

Machine Learning Linear Models

Learning Binary Classifiers for Multi-Class Problem

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Linear and Logistic Regression. Dr. Xiaowei Huang

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Pattern Recognition and Machine Learning

Foundations of Machine Learning Multi-Class Classification. Mehryar Mohri Courant Institute and Google Research

Minimax risk bounds for linear threshold functions

i=1 cosn (x 2 i y2 i ) over RN R N. cos y sin x

Introduction to Boosting and Joint Boosting

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013

Foundations of Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research

Active Learning and Optimized Information Gathering

Machine Learning. Ensemble Methods. Manfred Huber

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

CS 231A Section 1: Linear Algebra & Probability Review. Kevin Tang

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

CS 231A Section 1: Linear Algebra & Probability Review

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Machine Learning, Fall 2011: Homework 5

Support vector machines Lecture 4

COMS 4771 Lecture Boosting 1 / 16

CS 6375 Machine Learning

Midterm exam CS 189/289, Fall 2015

Decision Trees: Overfitting

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Name (NetID): (1 Point)

ABC-Boost: Adaptive Base Class Boost for Multi-class Classification

Transcription:

Multiclass Boosting with Repartitioning Ling Li Learning Systems Group, Caltech ICML 2006

Binary and Multiclass Problems Binary classification problems Y = { 1, 1} Multiclass classification problems Y = {1, 2,..., K} A multiclass problem can be reduced to a collection of binary problems Examples one-vs-one one-vs-all Usually we obtain an ensemble of binary classifiers

A Unified Approach [Allwein et al., 2000] Given a coding matrix M = + + + + Each row is a codeword for a class the codeword for class 2 is + Construct a binary classifier for each column (partition) f 1 should discriminate classes 1 and 2 from 3 and 4 Decode (f 1 (x), f 2 (x)) to predict (f 1 (x), f 2 (x)) = (+, +) predicts class label 4

Coding Matrix Error-Correcting If a few binary classifiers make mistakes, the correct label can still be predicted Assure the Hamming distance between codewords is large + + + + + + + + + + Assume errors are independent Extensions Some entries can be 0 Various distance measures can be used

Multiclass Boosting [Guruswami & Sahai, 1999] Problems Errors of the binary classifiers may be highly correlated Optimal coding matrix is problem dependent Boosting Approach Dynamically generates the coding matrix Reweights examples to reduce the error correlation Minimizes a multiclass margin cost

Prototype The ensemble F = (f 1, f 2,..., f T ) f t has a coefficient α t The Hamming distance (M(k), F(x)) = Multiclass Boosting T t=1 1: F (0, 0,..., 0), i.e., f t 0 2: for t = 1 to T do 3: Pick the t-th column M(, t) {, +} K 1 M(k, t)f t (x) α t. 2 4: Train a binary hypothesis f t on {(x n, M(y n, t))} N n=1 5: Decide a coefficient α t 6: end for 7: return M, F, and α t s

Multiclass Margin Cost For an example (x, y), we want (M(k), F(x)) > (M(y), F(x)), k y Margin The margin of the example (x, y) for class k is ρ k (x, y) = (M(k), F(x)) (M(y), F(x)) Exponential Margin Cost N C(F) = e ρk(xn,yn) n=1 k y n This is similar to the binary exponential margin cost.

Gradient Descent [Sun et al., 2005] A multiclass boosting algorithm can be deduced as gradient descent on the margin cost Multiclass Boosting 1: F (0, 0,..., 0), i.e., f t 0 2: for t = 1 to T do 3: Pick M(, t) and f t to maximize the negative gradient 4: Pick α t to minimize the cost along the gradient 5: end for 6: return M, F, and α t s AdaBoost.ECC is a concrete algorithm on the exponential cost.

Gradient of Exponential Cost Say F = (f 1,..., f t, 0,... ). skipped most math equations C (F) α t = U t (1 2ε t ) αt=0 D t (n, k) = e ρ k(x n,y n) (before f t is added) How would this example of class y n be confused as class k? U t = N K D n=1 k=1 t (n, k) M(k, t) M(y n, t) Sum of the confusion for binary relabeled examples D t (n) = U 1 t K k=1 D t (n, k) M(k, t) M(y n, t) Sum of the confusion for individual example ε t = N n=1 D t(n) f t (x n ) M(y n, t)

Picking Partitions C (F) α t = U t (1 2ε t ) αt=0 U t is determined by the t-th column/partition ε t is also decided by the binary learning performance Seems that we should pick the partition to maximize U t and ask the binary learner to minimize ε t Picking Partitions max-cut: picks the partition with the largest U t rand-half: randomly assigns + to half of the classes Which one would you pick?

Tangram 1 2 3 4 6 5 7

Margin Cost (with perceptrons) Training cost (normalized) 10 0 10 1 10 2 AdaBoost.ECC (max cut) AdaBoost.ECC (rand half) 0 10 20 30 40 50 Number of iterations

Why was Max-Cut Worse? Maximizing U t brings strong error-correcting ability But it also generates much hard binary problems

Trade-Off C (F) α t = U t (1 2ε t ) αt=0 Hard problems deteriorate the binary learning, thus overall the negative gradient might be smaller Need to find a trade-off between U t and ε t The hardness depends on the binary learner So we may ask the binary learner for a better partition

Repartitioning Given a binary classifier f t, which partition is the best? C (F) The one that maximizes α t αt=0 skipped most math equations M(k, t) can be decided from the output of f t and the confusion

AdaBoost.ERP Given a partition, a binary classifier can be learned Given a binary classifier, a better partition can be generated These two steps can be carried out alternatively We use a string of L and R to denote the schedule Example LRL means Learning Repartitioning Learning We can also start from partial partitions Example rand-2 starts with two random classes Faster learning; focus on local class structure

Experiment Settings We compared one-vs-one, one-vs-all, AdaBoost.ECC, and AdaBoost.ERP Four different binary learners: decision stumps, perceptrons, binary AdaBoost, and SVM-perceptron Ten UCI data sets with number of classes varies from 3 to 26

Cost with Decision Stumps on letter Training cost (normalized) 10 0 10 1 AdaBoost.ECC (max cut) AdaBoost.ECC (rand half) AdaBoost.ERP (max 2, LRL) AdaBoost.ERP (rand 2, LRL) AdaBoost.ERP (rand 2, LRLR) 0 200 400 600 800 1000 Number of iterations

Test Error with Decision Stumps on letter Test error (%) 60 55 50 45 40 35 30 25 20 15 0 200 400 600 800 1000 Number of iterations

Cost with Perceptrons on letter Training cost (normalized) 10 0 10 1 AdaBoost.ECC (max cut) AdaBoost.ECC (rand half) AdaBoost.ERP (max 2, LRL) AdaBoost.ERP (rand 2, LRL) AdaBoost.ERP (rand 2, LRLR) 10 2 0 100 200 300 400 500 Number of iterations

Test Error with Perceptrons on letter 40 35 Test error (%) 30 25 20 15 10 0 100 200 300 400 500 Number of iterations

Overall Results AdaBoost.ERP achieved the lowest cost, and the lowest test error on most of the data sets The improvement is especially significant for weak binary learners With SVM-perceptron, all methods were comparable AdaBoost.ERP starting with partial partitions were much faster than AdaBoost.ECC One-vs-one is much worse with weak binary learners One-vs-one is much faster

Summary A multiclass problem can be reduced to a collection of binary problems via an error-correcting coding matrix Multiclass boosting dynamically generates the coding matrix and the binary problems Hard binary problems deteriorate the binary learning AdaBoost.ERP achieves a better trade-off between the error-correcting and the binary learning