Probability Estimates for Multi-class Classification by Pairwise Coupling

Probability Estimates for Multi-class Classification by Pairwise Couling Ting-Fan Wu Chih-Jen Lin Deartment of Comuter Science National Taiwan University Taiei 06, Taiwan Ruby C. Weng Deartment of Statistics National Chenechi University Taiei 6, Taiwan Abstract Pairwise couling is a oular multi-class classification method that combines together all airwise comarisons for each air of classes. This aer resents two aroaches for obtaining class robabilities. Both methods can be reduced to linear systems and are easy to imlement. We show concetually and exerimentally that the roosed aroaches are more stable than two existing oular methods: voting and [3]. Introduction The multi-class classification roblem refers to assigning each of the observations into one of k classes. As two-class roblems are much easier to solve, many authors roose to use two-class classifiers for multi-class classification. In this aer we focus on techniques that rovide a multi-class classification solution by combining all airwise comarisons. A common way to combine airwise comarisons is by voting [6, 2]. It constructs a rule for discriating between every air of classes and then selecting the class with the most winning two-class decisions. Though the voting rocedure requires just airwise decisions, it only redicts a class label. In many scenarios, however, robability estimates are desired. As numerous (airwise) classifiers do rovide class robabilities, several authors [2,, 3] have roosed robability estimates by combining the airwise class robabilities. Given the observation x and the class label y, we assume that the estimated airwise class robabilities r ij of µ ij = (y = i y = i or j, x) are available. Here r ij are obtained by some binary classifiers. Then, the goal is to estimate { i } k i=, where i = (y = i x), i =,..., k. We roose to obtain an aroximate solution to an identity, and then select the label with the highest estimated class robability. The existence of the solution is guaranteed by theory in finite Markov Chains. Motivated by the otimization formulation of this method, we roose a second aroach. Interestingly, it can also be regarded as an imroved version of the couling aroach given by [2]. Both of the roosed methods can be reduced to solving linear systems and are simle in ractical imlementation. Furthermore, from concetual and exerimental oints of view, we show that the two roosed methods are more stable than voting and the method in [3]. We organize the aer as follows. In Section 2, we review two existing methods. Sections 3 and 4 detail the two roosed aroaches. Section 5 resents the relationshi among the four methods through their corresonding otimization formulas. In Section 6, we comare

these methods using simulated and real data. The classifiers considered are suort vector machines. Section 7 concludes the aer. Due to sace limit, we omit all detailed roofs. A comlete version of this work is available at htt://www.csie.ntu.edu.tw/ cjlin/aers/svmrob/svmrob.df. 2 Review of Two Methods Let r ij be the estimates of µ ij = i /( i + j ). The voting rule [6, 2] is δ V = argmax i [ I {rij>r ji}]. () A simle estimate of robabilities can be derived as v i = 2 I {r ij>r ji}/(k(k )). The authors of [3] suggest another method to estimate class robabilities, and they claim that the resulting classification rule can outerform δ V in some situations. Their aroach is based on the imization of the Kullback-Leibler (KL) distance between r ij and µ ij : l() = i j n ij r ij log(r ij /µ ij ), (2) where k i= i =, i > 0, i =,..., k, and n ij is the number of instances in class i or j. By letting l() = 0, a nonlinear system has to be solved. [3] rooses an iterative rocedure to find the imum of (2). If r ij > 0, i j, the existence of a unique global imal solution to (2) has been roved in [5] and references therein. Let denote this oint. Then the resulting classification rule is It is shown in Theorem of [3] that δ HT (x) = argmax i [ i ]. i > j if and only if i > j, where j = 2 s:s j r js ; (3) k(k ) that is, the i are in the same order as the i. Therefore, are sufficient if one only requires the classification rule. In fact, as ointed out by [3], can be derived as an aroximation to the identity by relacing i + j with 2/k, and µ ij with r ij. i = ( i + j k )( i ) = ( i + j i + j k )µ ij (4) 3 Our First Aroach Note that δ HT is essentially argmax i [ i ], and is an aroximate solution to (4). Instead of relacing i + j by 2/k, in this section we roose to solve the system: i = ( i + j k )r ij, i, subject to Let denote the solution to (5). Then the resulting decision rule is δ = argmax i [ i ]. i =, i 0, i. (5) As δ HT relies on i + j k/2, in Section 6. we use two examles to illustrate ossible roblems with this rule. i=

To solve (5), we rewrite it as Q =, i =, i 0, i, where Q ij = i= { r ij /(k ) if i j, s:s i r is/(k ) if i = j. Observe that k j= Q ij = for i =,..., k and 0 Q ij for i, j =,..., k, so there exists a finite Markov Chain whose transition matrix is Q. Moreover, if r ij > 0 for all i j, then Q ij > 0, which imlies this Markov Chain is irreducible and aeriodic. These conditions guarantee the existence of a unique stationary robability and all states being ositive recurrent. Hence, we have the following theorem: Theorem If r ij > 0, i j, then (6) has a unique solution with 0 < i <, i. With Theorem and some further analyses, if we remove the constraint i 0, i, the linear system with k + equations still has the same unique solution. Furthermore, if any one of the k equalities Q = is removed, we have a system with k variables and k equalities, which, again, has the same single solution. Thus, (6) can be solved by Gaussian eliation. On the other hand, as the stationary solution of a Markov Chain can be derived by the limit of the n-ste transition robability matrix Q n, we can solve by reeatedly multilying Q T with any initial vector. Now we reexae this method to gain more insight. The following arguments show that the solution to (5) is a global imum of a meaningful otimization roblem. To begin, we exress (5) as r ji i r ij j = 0, i =,..., k, using the roerty that r ij + r ji =, i j. Then the solution to (5) is in fact the global imum of the following roblem: ( r ji i r ij j ) 2 i= subject to (6) i =, i 0, i. (7) Since the object function is always nonnegative, and it attains zero under (5) and (6). 4 Our Second Aroach Note that both aroaches in Sections 2 and 3 involve solving otimization roblems using the relations like i /( i + j ) r ij or r ji i r ij j. Motivated by (7), we suggest another otimization formulation as follows: 2 (r ji i r ij j ) 2 subject to i= i= i =, i 0, i. (8) In related work, [2] rooses to solve a linear system consisting of k i= i = and any k equations of the form r ji i = r ij j. However, ointed out in [], the results of [2] strongly deends on the selection of k equations. In fact, as (8) considers all r ij j r ji i, not just k of them, it can be viewed as an imroved version of [2]. Let denote the corresonding solution. We then define the classification rule as δ 2 = argmax i [ i ]. Since (7) has a unique solution, which can be obtained by solving a simle linear system, it is desired to see whether the imization roblem (8) has these nice roerties. In the rest of the section, we show that this is true. The following theorem shows that the nonnegative constraints in (8) are redundant. i=

Theorem 2 Problem (8) is equivalent to a simlification without conditions i 0, i. Note that we can rewrite the objective function of (8) as = 2 T Q, where Q ij = { s:s i r2 si if i = j, r ji r ij if i j. (9) From here we can show that Q is ositive semi-definite. Therefore, without constraints i 0, i, (9) is a linear-constrained convex quadratic rogramg roblem. Consequently, a oint is a global imum if and only if it satisfies the KKT otimality condition: There is a scalar b such that [ [ ] [ Q e 0 e 0] T =. (0) b ] Here e is the vector of all ones and b is the Lagrangian multilier of the equality constraint k i= i =. Thus, the solution of (8) can be obtained by solving the simle linear system (0). The existence of a unique solution is guaranteed by the invertibility of the matrix of (0). Moreover, if Q is ositive definite(pd), this matrix is invertible. The following theorem shows that Q is PD under quite general conditions. Theorem 3 If for any i =,..., k, there are s i and j i such that rsirsj r is then Q is ositive definite. rjirjs r ij, In addition to direct methods, next we roose a simle iterative method for solving (0): Algorithm. Start with some initial i 0, i and k i= i =. 2. Reeat (t =,..., k,,...) until (0) is satisfied. t Q tt [ j:j t Q tj j + T Q] () normalize (2) Theorem 4 If r sj > 0, s j, and { i } i= is the sequence generated by Algorithm, any convergent sub-sequence goes to a global imum of (8). As Theorem 3 indicates that in general Q is ositive definite, the sequence { i } i= from Algorithm usually globally converges to the unique imum of (8). 5 Relations Among Four Methods The four decision rules δ HT, δ, δ 2, and δ V can be written as argmax i [ i ], where is derived by the following four otimization formulations under the constants k i= i =

and i 0, i: δ HT : δ : δ 2 : δ V : [ i= [ i= i= i= (r ij k 2 i)] 2, (3) (r ij j r ji i )] 2, (4) (r ij j r ji i ) 2, (5) (I {rij>r ji} j I {rji>r ij} i ) 2. (6) Note that (3) can be easily verified, and that (4) and (5) have been exlained in Sections 3 and 4. For (6), its solution is c i = I, {r ji>r ij} where c is the normalizing constant; and therefore, argmax i [ i ] is the same as (). Clearly, (3) can be obtained from (4) by letting j /k, j and r ji /2, i, j. Such aroximations ignore the differences between i. Similarly, (6) is from (5) by taking the extreme values of r ij : 0 or. As a result, (6) may enlarge the differences between i. Next, comared with (5), (4) may tend to underestimate the differences between the i s. The reason is that (4) allows the difference between r ij j and r ji i to get canceled first. Thus, concetually, (3) and (6) are more extreme the former tends to underestimate the differences between i s, while the latter overestimate them. These arguments will be suorted by simulated and real data in the next section. 6 Exeriments 6. Simle Simulated Examles [3] designs a simle exeriment in which all i s are fairly close and their method δ HT outerforms the voting strategy δ V. We conduct this exeriment first to assess the erformance of our roosed methods. As in [3], we define class robabilities =.5/k, j = ( )/(k ), j = 2,..., k, and then set r ij = i i + j + 0.z ij if i > j, (7) r ji = r ij if j > i, (8) where z ij are standard normal variates. Since r ij are required to be within (0,), we truncate r ij at ɛ below and ɛ above, with ɛ = 0.0000. In this examle, class has the highest robability and hence is the correct class. Figure shows accuracy rates for each of the four methods when k = 3, 5, 8, 0, 2, 5, 20. The accuracy rates are averaged over,000 relicates. Note that in this exeriment all classes are quite cometitive, so, when using δ V, sometimes the highest vote occurs at two For I to be well defined, we consider r ij r ji, which is generally true. In addition, if there is an i for which P I {r ji >r ij } = 0, an otimal solution of (6) is i =, and j = 0, j i. The resulting decision is the same as that of ().

Accuracy Rates.05 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6 2 3 4 5 6 7 log k 2 Accuracy Rates.05 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6 2 3 4 5 6 7 log k 2 Accuracy Rates.05 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6 2 3 4 5 6 7 log k 2 (a) balanced i (b) unbalanced i (c) highly unbalanced i Figure : Accuracy of redicting the true class by the methods δ HT (solid line, cross marked), δ V (dash line, square marked), δ (dotted line, circle marked), and δ 2 (dashed line, asterisk marked) from simulated class robability i, i =, 2 k. or more different classes. We handle this roblem by randomly selecting one class from the ties. This artly exlains why δ V erforms oor. Another exlanation is that the r ij here are all close to /2, but (6) uses or 0 instead; therefore, the solution may be severely biased. Besides δ V, the other three rules have done very well in this examle. Since δ HT relies on the aroximation i + j k/2, this rule may suffer some losses if the class robabilities are not highly balanced. To exae this oint, we consider the following two sets of class robabilities: () We let k = k/2 if k is even, and (k + )/2 if k is odd; then we define = 0.95.5/k, i = (0.95 )/(k ) for i = 2,..., k, and i = 0.05/(k k ) for i = k +,..., k. (2) If k = 3, we define = 0.95.5/2, 2 = 0.95, and 3 = 0.05. If k > 3, we define = 0.475, 2 = 3 = 0.475/2, and i = 0.05/(k 3) for i = 4,..., k. After setting i, we define the airwise comarisons r ij as in (7)-(8). Both exeriments are reeated for,000 times. The accuracy rates are shown in Figures (b) and (c). In both scenarios, i are not balanced. As exected, δ HT is quite sensitive to the imbalance of i. The situation is much worse in Figure (c) because the aroximation i + j k/2 is more seriously violated, esecially when k is large. In summary, δ and δ 2 are less sensitive to i, and their overall erformance are fairly stable. All features observed here agree with our analysis in Section 5. 6.2 Real Data In this section we resent exerimental results on several multi-class roblems: segment, satimage, and letter from the Statlog collection [9], USPS [4], and MNIST [7]. All data sets are available at htt://www.csie.ntu.edu.tw/ cjlin/libsvmtools/ t. Their numbers of classes are 7, 6, 26, 0, and 0, resectively. From thousands of instances in each data, we select 300 and 500 as our training and testing sets. We consider suort vector machines (SVM) with RBF kernel e γ xi xj 2 as the binary classifier. The regularization arameter C and the kernel arameter γ are selected by crossvalidation. To begin, for each training set, a five-fold cross-validation is conducted on the following oints of (C, γ): [2 5, 2 3,..., 2 5 ] [2 5, 2 3,..., 2 5 ]. This is done by modifying LIBSVM [], a library for SVM. At each (C, γ), sequentially four folds are

Table : Testing errors (in ercentage) by four methods: Each row reorts the testing errors based on a air of the training and testing sets. The mean and std (standard deviation) are from five 5-fold cross-validation rocedures to select the best (C, γ). Dataset k δ HT δ δ 2 δ V satimage 6 segment 7 USPS 0 MNIST 0 letter 26 mean std mean std mean std mean std 4.080.306 4.600 0.938 4.760 0.784 5.400 0.29 2.960 0.320 3.400 0.400 3.400 0.400 3.360 0.080 4.520 0.968 4.760.637 3.880 0.392 4.080 0.240 2.400 0.000 2.200 0.000 2.640 0.294 2.680.4 6.60 0.294 6.400 0.379 6.20 0.299 6.60 0.344 9.960 0.480 9.480 0.240 9.000 0.400 8.880 0.27 6.040 0.528 6.280 0.299 6.200 0.456 6.760 0.445 6.600 0.000 6.680 0.349 6.920 0.27 7.60 0.96 5.520 0.466 5.200 0.420 5.400 0.580 5.480 0.588 7.440 0.625 8.60 0.637 8.040 0.408 7.840 0.344 4.840 0.388 3.520 0.560 2.760 0.233 2.520 0.60 2.080 0.560.440 0.625.600.08.440 0.99 0.640 0.933 0.000 0.657 9.920 0.483 0.320 0.744 2.320 0.845.960.03.560 0.784.840.248 3.400 0.30 2.640 0.080 2.920 0.299 2.520 0.97 7.400 0.000 6.560 0.080 5.760 0.96 5.960 0.463 5.200 0.400 4.600 0.000 3.720 0.588 2.360 0.96 7.320.608 4.280 0.560 3.400 0.657 3.760 0.794 4.720 0.449 4.60 0.96 3.360 0.686 3.520 0.325 2.560 0.294 2.600 0.000 3.080 0.560 2.440 0.233 39.880.42 37.60.06 34.560 2.44 33.480 0.325 4.640 0.463 39.400 0.769 35.920.389 33.440.06 4.320.700 38.920 0.854 35.800.453 35.000.066 35.240.439 32.920.2 29.240.335 27.400.7 43.240 0.637 40.360.472 36.960.74 34.520.00 used as the training set while one fold as the validation set. The training of the four folds consists of k(k )/2 binary SVMs. For the binary SVM of the ith and the jth classes, using decision values ˆf of training data, we emloy an imroved imlementation [8] of Platt s osterior robabilities [0] to estimate r ij : r ij = P (i i or j, x) =, (9) + ea ˆf+B where A and B are estimated by imizing the negative log-likelihood function. Then, for each validation instance, we aly the four methods to obtain classification decisions. The error of the five validation sets is thus the cross-validation error at (C, γ). After the cross-validation is done, each rule obtains its best (C, γ). Using these arameters, we train the whole training set to obtain the final model. Next, the same as (9), the decision values from the training data are emloyed to find r ij. Then, testing data are tested using each of the four rules. Due to the randomness of searating training data into five folds for finding the best (C, γ), we reeat the five-fold cross-validation five times and obtain the mean and standard deviation of the testing error. Moreover, as the selection of 300 and 500 training and testing instances from a larger dataset is also random, we generate five of such airs. In Table, each row reorts the testing error based on a air of the training and testing sets. The results show that when the number of classes k is small, the four methods erform similarly; however, for roblems with larger k, δ HT is less cometitive. In articular, for roblem letter which has 26 classes, δ 2 or δ V outerforms δ HT by at least 5%. It seems that for [0] suggests to use ˆf from the validation instead of the training. However, this requires a further cross-validation on the four-fold data. For simlicity, we directly use ˆf from the training. If more than one arameter sets return the smallest cross-validation error, we simly choose one with the smallest C.

roblems here, their characteristics are closer to the setting of Figure (c), rather than that of Figure (a). All these results agree with the revious findings in Sections 5 and 6.. Note that in Table, some standard deviations are zero. That means the best (C, γ) by different cross-validations are all the same. Overall, the variation on arameter selection due to the randomness of cross-validation is not large. 7 Discussions and Conclusions As the imization of the KL distance is a well known criterion, some may wonder why the erformance of δ HT is not quite satisfactory in some of the examles. One ossible exlanation is that here KL distance is derived under the assumtions that n ij r ij Bin(n ij, µ ij ) and r ij are indeendent; however, as ointed out in [3], neither of the assumtions holds in the classification roblem. In conclusion, we have rovided two methods which are shown to be more stable than both δ HT and δ V. In addition, the two roosed aroaches require only solutions of linear systems instead of a nonlinear one in [3]. The authors thank S. Sathiya Keerthi for helful comments. References [] C.-C. Chang and C.-J. Lin. LIBSVM: a library for suort vector machines, 200. Software available at htt://www.csie.ntu.edu.tw/ cjlin/libsvm. [2] J. Friedman. Another aroach to olychotomous classification. Technical reort, Deartment of Statistics, Stanford University, 996. Available at htt://www-stat.stanford.edu/reorts/friedman/oly.s.z. [3] T. Hastie and R. Tibshirani. Classification by airwise couling. The Annals of Statistics, 26():45 47, 998. [4] J. J. Hull. A database for handwritten text recognition research. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6(5):550 554, May 994. [5] D. R. Hunter. MM algorithms for generalized Bradley-Terry models. The Annals of Statistics, 2004. To aear. [6] S. Knerr, L. Personnaz, and G. Dreyfus. Single-layer learning revisited: a stewise rocedure for building and training a neural network. In J. Fogelman, editor, Neurocomuting: Algorithms, Architectures and Alications. Sringer-Verlag, 990. [7] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning alied to document recognition. Proceedings of the IEEE, 86():2278 2324, November 998. MNIST database available at htt://yann.lecun.com/exdb/mnist/. [8] H.-T. Lin, C.-J. Lin, and R. C. Weng. A note on Platt s robabilistic oututs for suort vector machines. Technical reort, Deartment of Comuter Science and Information Engineering, National Taiwan University, 2003. [9] D. Michie, D. J. Siegelhalter, and C. C. Taylor. Machine Learning, Neural and Statistical Classification. Prentice Hall, Englewood Cliffs, N.J., 994. Data available at htt://www.ncc.u.t/liacc/ml/statlog/datasets.html. [0] J. Platt. Probabilistic oututs for suort vector machines and comarison to regularized likelihood methods. In A. Smola, P. Bartlett, B. Schölkof, and D. Schuurmans, editors, Advances in Large Margin Classifiers, Cambridge, MA, 2000. MIT Press. [] D. Price, S. Knerr, L. Personnaz, and G. Dreyfus. Pairwise nerual network classifiers with robabilistic oututs. In G. Tesauro, D. Touretzky, and T. Leen, editors, Neural Information Processing Systems, volume 7, ages 09 6. The MIT Press, 995. [2] P. Refregier and F. Vallet. Probabilistic aroach for multiclass classification with neural networks. In Proceedings of International Conference on Artificial Networks, ages 003 007, 99.