Multi-Category Classification by Soft-Max Combination of Binary Classifiers

Size: px

Start display at page:

Download "Multi-Category Classification by Soft-Max Combination of Binary Classifiers"

Alexina Stewart
5 years ago
Views:

1 Multi-Category Classification by Soft-Max Combination of Binary Classifiers K. Duan, S. S. Keerthi, W. Chu, S. K. Shevade, A. N. Poo Department of Mechanical Engineering, National University of Singapore Gatsby Computational Neuroscience Unit, University College London Department of Computer Science and Automation, Indian Institute of Science 2003/06/12 The fourth International Workshop on Multiple Classifier Systems

2 Multi-category classification Using Binary Classifiers All-Together Methods: Training speed is usually slow. One-Versus-All Methods: For M-category classification, construct M one-versus-all binary classifiers, each to distinguish one class from all other classes. Implementation strategy: Winner-Takes-All One-Versus-One Methods: For M-category classification, construct M(M 1)/2 binary classifiers, each to distinguish one class from another. Implementation strategy: Max-Wins-Voting etc. The fourth International Workshop on Multiple Classifier Systems 1

3 Multi-category classification Using Binary Classifiers Pairwise Coupling: For one-versus-one binary classifiers with probabilistic outputs, such as kernel logistic regression. Central idea: Couple M(M 1)/2 pairwise class probability estimates to obtain estimates of posterior probabilities for M classes. Our Methods: Combine one-versus-others or one-versus-one binary classifiers through soft-max functions to obtain posterior class probabilities. The fourth International Workshop on Multiple Classifier Systems 2

4 Soft-Max Combination of Binary Classifiers M classes and l labelled training data (x 1, y 1 ),, (x l, y l ), where x i R m and y i {1,, M}. Combination of One-Versus-All binary classifiers; Combination of One-Versus-One binary classifiers; Relation to Previous Work. The fourth International Workshop on Multiple Classifier Systems 3

5 Soft-Max Combination of One-Versus-All Classifiers Denote the output of the kth binary classifier (class c k versus the rest) for x i as r i k. Posteriori probabilities obtained through a soft-max function P i k = Prob(c k x i ) = ew kr i k +w ko z i (1) where z i = M k=1 ew kr i k +w ko is a normalization term to ensure M k=1 P i k = 1. The fourth International Workshop on Multiple Classifier Systems 4

6 Soft-Max Combination of One-Versus-All Classifiers w = {(w 1, w 1o ),..., (w M, w Mo )} can be designed to minimize a penalized NLL: min subject to E(w) = 1 l 2 w 2 C log Py i i (2) i=1 w k, w ko > 0, k = 1,, M Auxiliary Variables: s k = log(w k ), s ko = log(w ko ) The fourth International Workshop on Multiple Classifier Systems 5

7 Soft-Max Combination of One-Versus-All Classifiers The optimization problem can be solved using gradient methods. Following formulas give gradients wrt auxiliary variables: E s k = E w k w k s k = E = E w ko = s ko w ko ds ko w k + C yi=k w ko + C yi=k ( P i k 1 ) r i k + C y i k ( P i k 1 ) + C y i k Pkr i k i w k P i k w ko The fourth International Workshop on Multiple Classifier Systems 6

8 Soft-Max Combination of One-Versus-One Classifiers Denote the output of one-versus-one classifier C kt for x i as rkt i. We have rtk i = ri kt. The posteriori probabilities can be obtained through a soft-max function Pk i = Prob(c k x i ) = e t k w i ktrkt +w ko where z i = M k=1 e t k w ktr i kt +w ko is a normalization term. z i (3) The fourth International Workshop on Multiple Classifier Systems 7

9 Soft-Max Combination of One-Versus-One Classifiers The weight parameters w can be designed to minimize a penalized NLL: min subject to E(w) = 1 l 2 w 2 C log Py i i (4) i=1 w kt, w ko > 0, k, t = 1,, M and t k Auxiliary Variables: s kt = log(w kt ), s ko = log(w ko ) The fourth International Workshop on Multiple Classifier Systems 8

10 Soft-Max Combination of One-Versus-One Classifiers The optimization problem can be solved using gradient methods. Following formulas give gradients wrt auxiliary variables: E s kt = E w kt w kt s kt = E = E w ko = s ko w ko s ko w kt + C yi=k w ko + C yi=k ( P i k 1 ) r i kt + C ( P i k 1 ) + C y i k y i k P i k Pkr i kt i w kt w ko The fourth International Workshop on Multiple Classifier Systems 9

11 Relation to Previous Work The following parametric model is used by Platt (1999) to fit the posteriori probability Prob(c 1 x i ) = e Af i+b, (5) where f i is the output of SVMs. One-Versus-All case with M = 2, r i 1 = f i, and r i 2 = r i 1: Prob(c 1 x i ) = e (w 1+w 2 )f i +(w 2o w 1o ). (6) The fourth International Workshop on Multiple Classifier Systems 10

12 Relation to Previous Work One-Versus-One case with M = 2, r i 12 = f i and r i 21 = r i 12: Prob(c 1 x i ) = e (w 12+w 21 )f i +(w 2o w 1o ). (7) Therefore, our soft-max combination methods can be viewed as natural extensions of Platt s sigmoid-fitting idea to multi-category classification. The fourth International Workshop on Multiple Classifier Systems 11

13 Practical Issues in Soft-Max Design 5-fold cross validation for soft-max design The original training data is partitioned into 5 folds with each fold containing equal percentage of examples of one particular class. Regularization Parameter C We select optimal C by the validation estimates of error rate and negative log-likelihood. Simplified soft-max function design We may omit the use of regularization. The fourth International Workshop on Multiple Classifier Systems 12

14 Numerical Study Soft-max combination of SVM one-versus-all classifiers: standard design and simplified soft-max function design Soft-max combination of SVM one-versus-one classifiers: standard design and simplified soft-max function design Winner-Takes-All of one-versus-all classifiers: SVM, SVM with Platt s posterior probabilities (PSVM) and kernel logistic regression (KLOGR) Max-Wins-Voting of one-versus-one classifiers: SVM, PSVM and KLOGR Pairwise coupling of one-versus-one classifiers: PSVM and KLOGR. The fourth International Workshop on Multiple Classifier Systems 13

15 Results and Conclusions Winner-Takes-All of KLOGR seems best among all one-versus-all classifiers. Max-Wins-Voting of KLOGR seems best among all one-versus-one classifiers. Overall, Pairwise-Coupling of KLOGR seems slightly better. The proposed soft-max combination methods with simplified combination function design are competitive and simpler to design. They provide new ways of obtaining posteriori probability estimates from binary classifiers whose outputs are not probabilistic values. The fourth International Workshop on Multiple Classifier Systems 14

Logistic Regression. Machine Learning Fall 2018

Logistic Regression. Machine Learning Fall 2018 Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes