Probability Estimates for Multi-class Classification by Pairwise Coupling

Size: px

Start display at page:

Download "Probability Estimates for Multi-class Classification by Pairwise Coupling"

Andrew Fisher
6 years ago
Views:

1 Probability Estimates for Multi-class Classification by Pairwise Coupling By Ting-Fan Wu, Chih-Jen Lin, and Ruby C. Weng 2 National Taiwan University and National Chengchi University Abstract Pairwise coupling is a popular multi-class classification method that combines together all pairwise comparisons for each pair of classes. This paper presents two approaches for obtaining class probabilities. Both methods can be reduced to linear systems and are easy to implement. We show conceptually and experimentally that the proposed approaches are more stable than two existing popular methods: voting and [9].. Introduction. The multi-class classification problem refers to assigning each of the observations into one of k classes. As two-class problems are much easier to solve, many authors propose to use two-class classifiers for multi-class classification. In this paper we focus on techniques that provide a multi-class classification solution by combining all pairwise comparisons. A common way to combine pairwise comparisons is by voting [2, 8]. It constructs a rule for discriminating between every pair of classes and then selecting the class with the most winning twoclass decisions. Though the voting procedure requires just pairwise decisions, it only predicts a class label. In many scenarios, however, probability estimates are desired. As numerous (pairwise) classifiers do provide class probabilities, several authors [9, 8, 9] have proposed probability estimates by combining the pairwise class probabilities. Given the observation x and the class label y, we assume that the estimated pairwise class probabilities r ij of µ ij = p(y = i y = i or j, x) are available. Here r ij are obtained by some binary classifiers. Then, the goal is to estimate {p i } k i=, where p i = p(y = i x), i =,..., k. We AMS 2000 subject classifications. Key words and phrases. Pairwise Coupling, Probability Estimates, Random Forest, Support Vector Machines Department of Computer Science, National Taiwan University, Taipei 06, Taiwan. 2 Department of Statistics, National Chengchi University, Taipei 6, Taiwan.

2 2 propose to obtain an approximate solution to an identity, and then select the label with the highest estimated class probability. The existence of the solution is guaranteed by theory in finite Markov Chains. Motivated by the optimization formulation of this method, we propose a second approach. Interestingly, it can also be regarded as an improved version of the coupling approach given by [9]. Both of the proposed methods can be reduced to solving linear systems and are simple in practical implementation. Furthermore, from conceptual and experimental points of view, we show that the two proposed methods are more stable than voting and the method in [9]. We organize the paper as follows. In Section 2, we review several existing methods. Sections 3 and 4 detail the two proposed approaches. Section 5 presents the relationship among different methods through their corresponding optimization formulas. In Section 6, we compare these methods using simulated data. In Section 7, we conduct experiments using real data. The classifiers considered are support vector machines and random forest. A preliminary version of this paper was presented in [22]. 2. Survey of Existing Methods 2.. Voting Let r ij be the estimates of µ ij = p i /(p i p j ). The voting rule [2, 8] is (2.) δ V = argmax i [ I {rij>r ji}]. A simple estimate of probabilities can be derived as p v i = 2 I {r ij>r ji}/(k(k )) Method by Refregier and Vallet In [9], the authors consider that (2.2) r ij r ji µ ij µ ji = p i p j. Thus, making (2.2) an equality may be a way to solve p i. However, the number of equations, k(k )/2, is more than the number of unknowns k, so [9] proposes to choose any k r ij s. Then, with the condition k i= p i =, p can be obtained by solving a linear system. However, as pointed out in [8], the results strongly depend on the selection of k r ij s. In Section 4, by considering (2.2) as well, we propose a method which remedies this problem Method by Price, Kner, Personnaz, and Dreyfus In [8], the authors consider that ( p(y = i or j x) ) (k 2)p(y = i x) = p(y = j x) =. j=

3 3 Using r ij µ ij = p(y = i x) p(y = i or j x), one obtains (2.3) p i r ij (k 2). As k i= p i = does not hold, we must normalize p. This approach is very simple and easy to implement. In the rest of this paper, we refer this method as PKPD Method by Hastie and Tibshirani In [9], the authors propose to minimize the Kullback- Leibler (KL) distance between r ij and µ ij : (2.4) l(p) = i j n ij r ij log r ij µ ij, where µ ij = p i /(p i p j ) and n ij is the number of training data in the ith or jth class. (2.5) To find a minimizer of (2.4), they first calculate l(p) p i = ( r ij ) n ij. p i p i p j Thus, letting l(p) = 0, [9] proposes to find a point satisfying (2.6) n ij µ ij = n ij r ij, p i =, p i > 0, i =,..., k. i= Such a point is obtained by the following algorithm: Algorithm. Start with some initial p i > 0, i and corresponding µ ij = p i /(p i p j ). 2. Repeat (i =,..., k,,...) (2.7) (2.8) α = n ijr ij n ijµ ij µ ij αµ ij αµ ij µ ji, µ ji µ ij, for all j i p i αp i (2.9) normalize p (optional) until k consecutive α are all close to ones under some criteria. 3. p p/ k i= p i

4 4 There are several remarks about this algorithm. First, the initial p i must be positive so that all later p i are positive and α is well defined. Second, (2.9) is an optional operation because whether we normalize p or not does not affect the values of µ ij and α in (2.7) and (2.8). [9] proves that Algorithm generates a sequence of points at which the KL distance is strictly decreasing. However, as indicated in [], the strict decrease of l(p) does not guarantee that any limit point satisfies (2.6). [] discusses convergence of algorithms for generalized Bradley-Terry models where Algorithm is a special case. It points out that [23] has proved that, for any initial point, the whole sequence generated by Algorithm converges to a point satisfying (2.6), and this point is the unique global minimum of l(p) under the constraints k i= p i = and 0 p i, i =,..., k. Let p denote the global minimum of l(p). It is shown in Theorem of [9] that p satisfies (2.0) p i > p j if and only if p i = 2 s:i s r is k(k ) > p j = 2 s:j s r js. k(k ) Therefore, p are sufficient if one only requires the classification rule. In fact, as pointed out by [9], p can be derived as an approximation to the identity (2.) p i = ( p i p j k )( p i ) = ( p i p j p i p j k )µ ij by replacing p i p j with 2/k, and µ ij with r ij. In the next two sections, we propose two methods which are simpler in both practical implementation and algorithmic analysis. 3. Our First Approach Note that δ HT is essentially (3.) argmax i [ p i ], and p is an approximate solution to (2.). Instead of replacing p i p j by k/2, in this section we propose to solve the system: (3.2) p i = ( p i p j k )r ij, i, subject to p i =, p i 0, i. i= Let p denote the solution to (3.2). Then the resulting decision rule is δ = argmax i [ p i ]. As δ HT relies on p i p j k/2, in Section 6 we use two examples to illustrate possible problems with this rule.

5 5 3.. Solving (3.2) To solve (3.2), we rewrite it as (3.3) Qp = p, r ij /(k ) if i j, p i =, p i 0, i, where Q ij = i= s:s i r is/(k ) if i = j. Observe that k j= Q ij = for i =,..., k and 0 Q ij for i, j =,..., k, so there exists a finite Markov Chain whose transition matrix is Q. Moreover, if r ij > 0 for all i j, then Q ij > 0, which implies this Markov Chain is irreducible and aperiodic. From Theorem of [20], these conditions guarantee the existence of a unique stationary probability and all states being positive recurrent. Hence, we have the following theorem: Theorem If r ij > 0, i j, then (3.3) has a unique solution p with 0 < p i < i. Note that without the constraints p i 0, i, the linear system (3.4) Qp = p, p i = still has the same unique solution. Therefore, unlike the method in Section 2.4 where a special iterative procedure has to be implemented, here we only have to solve a simple linear system. Practically we remove any equality from Qp = p so a square linear system from (3.3) can be solved by standard Gaussian elimination. Since the column sum of Q is the vector of all ones, it can be verified that the square linear system and the k equations from (3.3) have the same unique solution. On the other hand, as the stationary solution of a Markov Chain can be derived by the limit of the n-step transition probability matrix Q n, we can solve p by repeatedly multiplying Q T with any initial vector. i= 3.2. Another Look at (3.2) The following arguments show that the solution to (3.2) is a global minimum of a meaningful optimization problem. To begin, we re-express Qp = p of (3.3) as (3.5) r ji p i r ij p j = 0, i =,..., k, using the property that r ij r ji =, i j. Therefore, a solution of (3.3) is in fact the unique global minimum of the following convex problem: (3.6) min p subject to ( r ji p i r ij p j ) 2 i= p i =, p i 0, i =,..., k. i=

6 6 The reason is that the object function is always nonnegative, and it attains zero under (3.3). Again, the constraints p i 0 i are not necessary. 4. Our Second Approach Note that both approaches in Sections 2.4 and 3 involve solving optimization problems using relations like p i /(p i p j ) r ij or r jip i r ijp j. Motivated by (3.6), we suggest another optimization formulation as follows: (4.) min p 2 (r ji p i r ij p j ) 2 subject to i= p i =, p i 0, i. Note that the method [9] described in Section 2.2 considers a random selection of k equations of the form r ji p i = r ij p j. As (4.) considers all r ij p j r ji p i, not just k of them, it can be viewed as an improved version of [9]. Let p denote the corresponding solution. We then define the classification rule as i= δ 2 = argmax i [p i ]. 4.. A Linear System from (4.) Since (3.6) has a unique solution and it can be obtained by solving a simple linear system, it is desired to see whether the minimization problem (4.) has these nice properties. In this subsection, we show that (4.) can be solved by a simple linear system and, under mild condition, it has a unique solution. First, the following theorem shows that the nonnegative constraints in (4.) are redundant. Theorem 2 Problem (4.) is equivalent to (4.2) min p 2 (r ji p i r ij p j ) 2 subject to i= p i =. i= The proof is in Appendix A. Note that we can rewrite the objective function of (4.2) as (4.3) min p 2 pt Qp, where s:s i r2 si if i = j, (4.4) Q ij = r ji r ij if i j. From (4.4), Q is positive semi-definite as for any v 0, v T Qv = k i= k j= (r jiv i r ij v j ) 2 0. Therefore, without constraints p i 0, i, (4.3) is a linear-constrained convex quadratic programming

7 problem. Consequently, a point p is a global minimum if and only if it satisfies the Karash-Kunh- Tucker (KKT) optimality condition: There is a scalar b such that (4.5) Q e p = 0. e T 0 b Here e is the k vector of all ones, 0 is the k vector of all zeros, and b is the Lagrangian multiplier of the equality constraint k i= p i =. Thus, the solution to (4.) can be obtained by solving the simple linear system (4.5) Solving (4.5) (4.5) can be solved by some direct methods in numerical linear algebra. Theorem 3(i) below shows that the matrix in (4.5) is invertible; therefore, Gaussian elimination can be easily applied. Though the matrix in (4.5) is symmetric, we cannot directly apply Cholesky factorization to save computational time as this matrix is not positive definite. However, if Q is positive definite, Cholesky factorization can be used to obtain b = /(e T Q e) first and then p = bq e. Theorem 3(ii) shows that Q is positive definite under quite general conditions. Even if Q is only positive semidefinite, Theorem 3(i) proves that Q ee T is positive definite for any constant > 0. Along with the fact that (4.5) is equivalent to (4.6) Q eet e p = e, e T 0 b we can do Cholesky factorization on Q ee T and solve b and p similarly, no matter whether Q is positive definite or not. Theorem 3 If r tu > 0 t u, we have (i) For any > 0, Q ee T is positive definite. In addition, has a unique global minimum. [ ] Q e is invertible, and hence (4.) e T 0 (ii) If for any i =,..., k, there are s j for which s i, j i, and (4.7) then Q is positive definite. r si r sj r is r jir js r ij, We leave the proof in Appendix B. In addition to direct methods, next we propose a simple iterative method for solving (4.5):

8 8 Algorithm 2. Start with some initial p i 0, i and k i= p i =. 2. Repeat (t =,..., k,,...) (4.8) (4.9) p t Q tt [ normalize p j:j t Q tj p j p T Qp] until (4.5) is satisfied. (4.4) and the assumption r ij > 0, i j, ensure that the right-hand side of (4.8) is always nonnegative. Moreover, k i= p i > 0, and hence (4.9) is well defined. See (C.3) for explanation. Note that b = p T Qp can be obtained from (4.5), and (4.8) is motivated from the tth equality in (4.5) with b replaced by p T Qp. The convergence of Algorithm 2 is established in the following theorem: Theorem 4 If r sj > 0, s j, and {p i } i= is the sequence generated by Algorithm 2, any convergent sub-sequence goes to a global minimum of (4.). As Theorem 3 indicates that in general Q is positive definite, the sequence {p i } i= from Algorithm 2 usually globally converges to the unique minimum of (4.). 5. Relations Among Different Methods Among methods discussed in this paper, the four decision rules δ HT, δ, δ 2, and δ V can be written as argmax i [p i ], where p is derived by the following four optimization formulations under the constants k i= p i = and p i 0, i: δ HT : min [ (5.) (r ij p k 2 p i)] 2, (5.2) (5.3) (5.4) δ : δ 2 : δ V : min p min p min p i= [ (r ij p j r ji p i )] 2, i= (r ij p j r ji p i ) 2, i= (I {rij >r ji }p j I {rji >r ij }p i ) 2. i= Note that (5.) can be easily verified, and that (5.2) and (5.3) have been explained in Sections 3 and 4. For (5.4), its solution is c (5.5) p i = I, {r ji>r ij}

9 9 where c is the normalizing constant; and therefore, argmax i [p i ] is the same as (2.). 3 Detailed derivation of (5.5) is in Appendix D. Clearly, (5.) can be obtained from (5.2) by letting p j = /k and r ji = /2. Such approximations ignore the differences between p i. Next, (5.4) is from (5.3) with r ij replaced by I {rij>r ji}, and hence, (5.4) may enlarge the differences between p i. Moreover, compared with (5.3), (5.2) allows the difference between r ij p j and r ji p i to get canceled first, so (5.2) may tend to underestimate the differences between p i. In conclusion, conceptually, (5.) and (5.4) are more extreme the former tends to underestimate the differences between p i, while the latter overestimate them. These arguments will be supported by simulated and real data in the next two sections. For PKPD approach (2.3), the decision rule can be written as: δ P KP D = argmin i [ r ij ]. This form looks similar to δ HT = argmax i [ r ij], which can be obtained from (2.0) and (3.). Notice that the differences among r ij tend to be larger than those among /r ij > > r ij. More discussion on these two rules will be given in Section 6. r ij, because 6. Experiments on Synthetic Data In this section, we use synthetic data to compare the performance of existing methods described in Section 2 as well as two new approaches proposed in Sections 3 and 4. Here we do not include the method in Section 2.2 because its results strongly depend on the choice of k r ij s and our second method is an improved version of it. [9] designs a simple experiment in which all p i are fairly close and their method δ HT outperforms the voting strategy δ V. We conduct this experiment first to assess the performance of our proposed methods. As in [9], we define class probabilities p =.5/k, p j = ( p )/(k ), j = 2,..., k, and then set (6.) (6.2) r ij = p i p i p j 0.z ij if i > j, r ji = r ij if j > i, 3 For I {rij >r ji } to be well defined, we consider r ij r ji, which is generally true. In addition, if there is an i for which P I {r ji >r ij } = 0, an optimal solution of (5.4) is p i =, and p j = 0, j i. The resulting decision is the same as that of (2.).

10 0 where z ij are standard normal variates. Since r ij are required to be within (0,), we truncate r ij at ɛ below and ɛ above, with ɛ = 0 7. In this example, class has the highest probability and hence is the correct class. Figure (a) shows accuracy rates for each of the five methods when k = 2 2, 2 2.5, 2 3,..., 2 7, where x denotes the largest integer not exceeding x. The accuracy rates are averaged over,000 replicates. Note that in this experiment all classes are quite competitive, so, when using δ V, sometimes the highest vote occurs at two or more different classes. We handle this problem by randomly selecting one class from the ties. This partly explains why δ V performs poor. Another explanation is that the r ij here are all close to /2, but (5.4) uses or 0 instead, as stated in the previous section; therefore, the solution may be severely biased. Besides δ V, the other four rules have done very well in this example. Since δ HT relies on the approximation p i p j k/2, this rule may suffer some losses if the class probabilities are not highly balanced. To examine this point, we consider the following two sets of class probabilities: () We let k = k/2 if k is even, and (k )/2 if k is odd; then we define p = /k, p i = (0.95 p )/(k ) for i = 2,..., k, and p i = 0.05/(k k ) for i = k,..., k. (2) We define p = /2, p 2 = 0.95 p, and p i = 0.05/(k 2), i = 3,..., k. After setting p i, we define the pairwise comparisons r ij as in (6.)-(6.2). Both experiments are repeated for,000 times. The accuracy rates are shown in Figures (b) and (c). In both scenarios, p i are not balanced. As expected, δ HT is quite sensitive to the imbalance of p i. The situation is much worse in Figure (c) because the approximation p i p j k/2 is more seriously violated, especially when k is large. A further analysis of Figure (c) shows that when k is large, r 2 = z 2, r j 0.z j, j 3, r 2 = 4 0.z 2, r 2j 0.z 2j, j 3, r ij 0 0.z ij, i j, i 3, where z ji = z ij are standard normal variates. From (2.0), the decision rule δ HT in this case is mainly on comparing j:j r j and j:j 2 r 2j. The difference between these two sums is 2

11 Table Data set Statistics dataset dna waveform satimage segment USPS MNIST letter #class #attribute ( j:j z j j:j 2 z 2j). Therefore, when k is large, the decision strongly depends on these normal variates, and the probability of choosing the first class is approaching half. On the other hand, δ P KP D relies on comparing j:j /r j and j:j 2 /r 2j. As the difference between /r 2 and /r 2 is larger than that between r 2 and r 2, though the accuracy rates decline as k increases, the situation is less serious. We also analyze the mean square error (MSE) in Figure 2: (6.3) MSE = j= i= (ˆp j i p i) 2, where ˆp j is the probability estimate obtained in the jth of the,000 replicates. Overall, δ HT and δ V have higher MSE, confirming again that they are less stable. In summary, δ and δ 2 are less sensitive to p i, and their overall performance are fairly stable. All observations about δ HT, δ, δ 2, and δ V here agree with our analysis in Section 5. Despite some similarity to δ HT, δ P KP D outperforms δ HT in general. Experiments here are done using MATLAB. 7. Experiments on Real Data In this section we present experimental results on several multi-class problems: dna, satimage, segment, and letter from the Statlog collection [6], waveform from UCI Machine Learning Repository [], USPS [0], and MNIST [3]. The numbers of classes and features are reported in Table. Except dna, which takes two possible values 0 and, each attribute of all other data is scaled to [-,]. In each scaled data, we randomly select 300 training and 500 testing instances from thousands of data points. 20 such selections are generated and the testing error rates are averaged. Similarly, we do experiments on larger sets (800 training and,000 testing). All training and testing sets used are available at svmprob/data.

12 2 Accuracy Rates log 2 k (a) Accuracy Rates log 2 k (b) Accuracy Rates log 2 k (c) Fig.. Accuracy of predicting the true class by the methods: δ HT (solid line, cross marked), δ V (dashed line, square marked), δ (dotted line, circle marked), δ 2 (dashed line, asterisk marked), and δ P KP D (dashdot line, diamond marked). MSE log 2 k MSE log 2 k MSE log 2 k (a) (b) (c) Fig. 2. MSE by the methods: δ HT (solid line, cross marked), δ V (dashed line, square marked), δ (dotted line, circle marked), δ 2 (dashed line, asterisk marked), and δ P KP D (dashdot line, diamond marked). 7.. SVM as the Binary Classifier We first consider support vector machines (SVM) [2, 6] with RBF kernel e γ xi xj 2 as the binary classifier. The regularization parameter C and the kernel parameter γ are selected by cross-validation. To begin, for each training set, a five-fold cross-validation is conducted on the following points of (C, γ): [2 5, 2 3,..., 2 5 ] [2 5, 2 3,..., 2 5 ]. This is done by modifying LIBSVM [5], a library for SVM. At each (C, γ), sequentially four folds are used as the

13 training set while one fold as the validation set. The training of the four folds consists of k(k )/2 binary SVMs. For the binary SVM of the ith and jth classes, we employ an improved implementation [5] of Platt s posterior probabilities [7] to estimate r ij : (7.) r ij = P (i i or j, x) = e A ˆfB, where A and B are estimated by minimizing the negative log-likelihood function, and ˆf are the decision values of training data. 4 Next, for each instance in the validation set, we apply the pairwise coupling methods to obtain classification decisions. The error of the five validation sets is thus the cross-validation error at (C, γ). From this, each rule obtains its best (C, γ). 5 Then, the decision values from the five-fold cross-validation at the best (C, γ) are employed in (7.) to find A and B of the final model, 6 using which the testing data are tested. The average of 20 MSEs are presented on the left panel of Figure 3, where the solid line represents results of small sets (300 training/500 testing), and the dashed line of large sets (800 training/,000 testing). The definition of MSE here is as in (6.3), but as there is no correct p i for these problems, we let p i = if the data is in the ith class, and 0 otherwise. This measurement is called Brier Score [4], which is popular in meteorology. The figures show that for smaller k (the number of classes), δ HT, δ, δ 2 and δ P KP D have similar MSEs, but for larger k, δ HT has the largest MSE. The MSEs 3 of δ V are much larger than those by all other methods, so they are not included in the figures. In summary, the two proposed approaches, δ and δ 2, are fairly insensitive to the values of k, and all above observations agree well with previous findings in Sections 5 and 6. Next, left panels of Figures 4 and 5 present the average of 20 test errors for problems with small size (300 training/500 testing) and large size (800 training/,000 testing), respectively. The caption of each sub-figure also shows the average of 20 test errors of the multi-class implementation in LIBSVM. This rule is voting using merely pairwise SVM decision values, and is denoted as δ DV for later discussion. The figures show that the errors of the five methods are fairly close for smaller k, 4 [7] suggests to use ˆf from the validation instead of the training. However, in our example, this requires a further cross-validation on the four-fold data. For simplicity, we directly use ˆf from the training. 5 If more than one parameter sets return the smallest cross-validation error, we simply choose one with the smallest C. 6 We use the training decision values ˆf from the best (C, γ) as well, and the results are similar. However, it is recommended to use validated decision values if possible.

14 4 but quite different for larger k. Notice that for smaller k (Figures 4 and 5 (a), (c), (e), and (g)) the differences of the averaged errors among the five methods are less than 0.5%, and there is no particular trend in these figures. However, for problems with larger k (Figures 4 and 5 (i), (k), and (m)), the differences are bigger and δ HT is less competitive. In particular, for letter problem (Figure 4 (m), k =26), δ 2 and δ V outperform δ HT by more than 7%. The test errors along with MSE seems to indicate that, for problems with larger k, the posterior probabilities p i are closer to the setting of Figure (c), rather than that of Figure (a). The other notable feature is that when k is larger the results of δ 2 are closer to those of δ V, and δ closer to δ HT, for both small and large training/testing sets. As for δ P KP D, its overall performance is competitive, but we are not clear about its relationships to the other methods Random Forest as the Binary Classifier In this subsection we consider random forest [3] as the binary classifier and conduct experiments on the same data sets. As random forest itself can provide multi-class probability estimates, we denote the corresponding rule as δ RF and also compare it with the coupling methods. For each two classes of data, we construct 500 trees as the random forest classifiers. Using m try randomly selected features, a bootstrap sample (around two thirds) of training data are employed to generate a full tree without pruning. For each test instance, r ij is simply the proportion out of the 500 trees that class i wins over class j. As we set the number of trees to be fixed at 500, the only parameter left for tuning is m try. Similar to [2], we select m try from {, m, m/3, m/2, m} by five-fold cross validation, where m is the number of attributes. The cross validation procedure is as in Section 7., first sequentially using four folds as the training set to construct k(k )/2 pairwise random forests, next obtaining the decision for each instance in the validation set by the pairwise coupling methods, and then calculating the cross validation error at the given m try by the error of five validation sets. Of course, a more efficient out of bag validation can be used for random forest, but here we consider cross validation for consistency. Experiments are conducted using an R-interface [4] to the code from [3]. The MSE presented in the right panel of Figure 3 shows that δ and δ 2 yield more stable results than δ HT and δ V for both small and large sets. The right panels of Figures 4 and 5 give the average of 20 test errors. The caption of each sub-figure also shows the averaged error when using random forest as a multi-class classifier (δ RF ). Notice that random forest bears a resemblance to SVM: the errors

15 5 are only slightly different among the five methods for smaller k, but δ V and δ 2 tend to outperform δ HT and δ for larger k. In summary, the results by using random forest as the binary classifier strongly support previous findings regarding the four methods Miscellaneous Observations Recall that in Section 7. we consider δ DV, which does not use Platt s posterior probabilities. Experimental results show that δ DV is slightly better for dna, USPS, MNIST, but is about 2% worse than all probability-based methods for waveform. Similar observations on waveform are also reported in [7], where the comparison is between δ DV and δ HT. We explain why the results by probability-based and decision-value-based methods can be so distinct. For some problems, the parameters selected by δ DV are quite different from those by the other five rules. For example, in waveform, at some parameters all probability-based methods gives much higher cross validation accuracy than δ DV. We observe, for example, the decision values of validation sets are in [0.73, 0.97] and [0.93,.02] for data in two classes; hence, all data in the validation sets are classified as in one class and the error is high. On the contrary, the probability-based methods fit the decision values by a sigmoid function, which can better separate the two classes by cutting at a decision value around This observation shed some light on the difference between probability-based and decision-value based methods. The results of random forest as a multi-class classifier are reported in the caption of each sub-figure in Figures 4 and 5. We observe from the figures that, when the number of classes is larger, using random forest as a multi-class classifier is better than coupling binary random forests. However, for dna (k = 3) the result is the other way around. As our focus in this paper is on different pairwise coupling methods for probability estimates, rather than different classifiers, we shall leave this observation as a future research issue. Acknowledgments. The authors thank S. Sathiya Keerthi for helpful comments. This work was supported in part by the National Science Council of Taiwan via the grants NSC E-002- and NSC 9-28-M REFERENCES [] C. L. Blake and C. J. Merz. UCI repository of machine learning databases. Technical report, University of California, Department of Information and Computer Science, Irvine, CA, 998. Available at

16 6 [2] B. Boser, I. Guyon, and V. Vapnik. A training algorithm for optimal margin classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, 992. [3] L. Breiman. Random forests. Machine Learning, 45():5 32, 200. [4] G. W. Brier. Verification of forecasts expressed in probabilities. Monthly Weather Review, 78: 3, 950. [5] C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines, 200. Software available at [6] C. Cortes and V. Vapnik. Support-vector network. Machine Learning, 20: , 995. [7] K. Duan and S. S. Keerthi. Which is the best multiclass SVM method? An empirical study. Technical Report CD-03-2, Control Division, Department of Mechanical Engineering, National University of Singapore, [8] J. Friedman. Another approach to polychotomous classification. Technical report, Department of Statistics, Stanford University, 996. Available at [9] T. Hastie and R. Tibshirani. Classification by pairwise coupling. The Annals of Statistics, 26():45 47, 998. [0] J. J. Hull. A database for handwritten text recognition research. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6(5): , May 994. [] D. R. Hunter. MM algorithms for generalized Bradley-Terry models. The Annals of Statistics, To appear. [2] S. Knerr, L. Personnaz, and G. Dreyfus. Single-layer learning revisited: a stepwise procedure for building and training a neural network. In J. Fogelman, editor, Neurocomputing: Algorithms, Architectures and Applications. Springer-Verlag, 990. [3] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(): , November 998. MNIST database available at [4] A. Liaw and M. Wiener. Classification and regression by randomforest. R News, 2/3:8 22, December [5] H.-T. Lin, C.-J. Lin, and R. C. Weng. A note on Platt s probabilistic outputs for support vector machines. Technical report, Department of Computer Science and Information Engineering, National Taiwan University, [6] D. Michie, D. J. Spiegelhalter, and C. C. Taylor. Machine Learning, Neural and Statistical Classification. Prentice Hall, Englewood Cliffs, N.J., 994. Data available at [7] J. Platt. Probabilistic outputs for support vector machines and comparison to regularized likelihood methods. In A. Smola, P. Bartlett, B. Schölkopf, and D. Schuurmans, editors, Advances in Large Margin Classifiers, Cambridge, MA, MIT Press. [8] D. Price, S. Knerr, L. Personnaz, and G. Dreyfus. Pairwise nerual network classifiers with probabilistic outputs. In G. Tesauro, D. Touretzky, and T. Leen, editors, Neural Information Processing Systems, volume 7, pages The MIT Press, 995.

17 7 [9] P. Refregier and F. Vallet. Probabilistic approach for multiclass classification with neural networks. In Proceedings of International Conference on Artificial Networks, pages , 99. [20] S. Ross. Stochastic Processes. John Wiley & Sons, Inc., second edition, 996. [2] V. Sventnik, A. Liaw, C. Tong, J. C. Culberson, R. P. Sheridan, and B. P. Feuston. Random forest: a tool for classification and regression in compound classification and QSAR modeling. Journal of Chemical Information and Computer Science, [22] T.-F. Wu, C.-J. Lin, and R. C. Weng. Probability estimates for multi-class classification by pairwise coupling. In Proceedings of NIPS 2003, [23] E. Zermelo. Die berechnung der turnier-ergebnisse als ein maximumproblem der wahrscheinlichkeitsrechnung. Mathematische Zeitschrift, 29: , 929. A. Proof of Theorem 2 It suffices to prove that any optimal solution p of (4.2) satisfies p i 0, i =,..., k. If this is not true, without loss of generality, we assume p < 0,..., p r < 0, p r 0,..., p k 0, where r < k because k i= p i =. We can then define a new feasible solution of (4.2): p = 0,..., p r = 0, p r = p r /α,..., p k = p k /α, where α = r i= p i >. With r ij > 0 and r ji > 0, we obtain (A.) (A.2) (A.3) (r ji p i r ij p j ) 2 0 = (r ji p i r ij p j) 2, if i, j r, (r ji p i r ij p j ) 2 > (r ijp j ) 2 α 2 = (r ji p i r ij p j) 2, if i r, r j k, (r ji p i r ij p j ) 2 (r jip i r ij p j ) 2 α 2 = (r ji p i r ij p j) 2, if r i, j k. Therefore, (r ij p i r ji p j ) 2 > (r ij p i r ji p j) 2. i= i= This contradicts the assumption that p is an optimal solution of (4.2). B. Proof of Theorem 3 (i) If Q ee T is not positive definite, there is a vector v with v i 0 such that (B.) v T Qv = (r ut v t r tu v u ) 2 ( v t ) 2 = 0. t= u:u t t=

18 8 For all t i, r it v t r ti v i = 0, so Thus, v t = r ti r it v i 0. t= v t = ( t:t i r ti r it )v i 0, which contradicts (B.). [ ] [ ] Q ee As T e has the same rank as Q e, the positive definiteness of Q ee T implies e T 0 e T 0 [ ] Q ee that T e is invertible. Hence, (4.5) has a unique solution, and so does (4.). e T 0 (ii) If Q is not positive definite, there is a vector v with v i 0 such that Therefore, v T Qv = (r ut v t r tu v u ) 2 = 0. t= u:u t (r ut v t r tu v u ) 2 = 0, t u. As r tu > 0, t u, for any s j for which s i and j i, we have (B.2) v s = r si r is v i, v j = r ji r ij v i, v s = r sj r js v j. As v i 0, (B.2) implies r si r sj r is = r jir js r ij, which contradicts (4.7). C. Proof of Theorem 4 First we need a lemma to show the strict decrease of the objective function: Lemma If r ij > 0, i j, p and p n are from two consecutive iterations of Algorithm 2, and p n p, then (C.) 2 (pn ) T Qp n < 2 pt Qp. Proof. Assume that p t is the component to be updated. Then, p n is obtained through the following calculation: p i if i t, (C.2) p i = Q tt ( j:j t Q tjp j p T Qp) if i = t,

19 9 and (C.3) p n = p k i= p. i For (C.3) to be a valid operation, k i= p i must be strictly positive. To show this, we first suppose that k i= p i = 0, and therefore p i = 0 for all i. Next, from (C.2), p i = p i = 0 for i t, which, together with the equality k i= p i = implies that p t =. However, if p t = and p i = 0 for i t, then p t = from (C.2). This contradicts the situation that p i = 0 for all i. To prove (C.), we take B = {,..., k}\{t} and observe that p T Q p ( p i ) 2 p T Qp i= = p T Qp 2p T BQ Bt ( p t p t ) Q tt ( p 2 t p 2 t ) ( p t p t ) 2 p T Qp (C.4) = ( p t p t ) ( 2p T BQ Bt ( p t p t )Q tt (2 p t p t )p T Qp ). It then suffices to prove that (C.4) is negative. Since p n p, by (C.2) we must have p t p t. Again from (C.2), Q tt ( p t p t ) k j= Q tjp j = p T Qp. Therefore, k j= Q tjp j p T Qp by the properties p t p t and Q tt = j:j t r2 jt > 0. In fact, (C.5) > Q tj p j < p T < Qp p t > p t. j= We then consider two cases: Case : k j= Q tjp j < p T Qp: In this case, p t > p t by (C.5). Moreover, (C.6) p t p t = Q tbp B p T Qp Q tt p t Q tt. Then the following calculation shows that (C.4) is negative: 2p T BQ Bt ( p t p t )Q tt = p T Qp < 2p T Qp Q tj p j j= (2 p t p t )p T Qp. Case 2: k j= Q tjp j > p T Qp:

20 20 In this case, p t < p t by (C.5). Together with (C.6), we derive 2p T BQ Bt ( p t p t )Q tt = p T Qp > 2p T Qp Q tj p j j= (2 p t p t )p T Qp. Thus, (C.4) is negative. Now we are ready to prove the theorem. If this result does not hold, there is a convergent subsequence {p i } i K such that p = lim i K,i p i is not optimal for (4.). Note that there is at least one index of {,..., l} which is updated in infinitely many iterations. Without loss of generality, we assume that for all p i, i K, p i t is updated to generate the next iteration p i. As p is not optimal for (4.), starting from t, t,..., k,,..., t, there is a first component t for which Q tjp j (p ) T Qp 0. j= By applying one iteration of Algorithm 2 on p t, from an explanation similar to (C.5), we obtain p,n satisfying p,n t p t. Then by Lemma, Note that if it takes ī steps from t to t and ī >, lim i K,i pi t = lim i K,i = = 2 (p,n ) T Qp,n < 2 (p ) T Qp. Q tt ( j:j t Q tjp i t (p i ) T Qp i ) Q tt ( j:j t Q tjp i t (p i ) T Qp i ) j:j t pi j Q tt ( j:j t Q tjp t (p ) T Qp ) Q tt ( j:j t Q tjp t (p ) T Qp ) j:j t p j p t k j= p j = p t. Therefore, Moreover, lim i K,i pi = lim i K,i pi = = lim i K,i piī = p. lim i K,i piī = p,n

21 2 and This contradicts the fact from Lemma : Therefore, p must be optimal for (4.). lim i K,i 2 (piī ) T Qp iī = 2 (p,n ) T Qp,n < 2 (p ) T Qp = lim i K,i 2 (pi ) T Qp i. 2 (p ) T Qp 2 (p2 ) T Qp 2 2 (p ) T Qp. D. Derivation of (5.5) = = 2 (I {rij >r ji }p j I {rji >r ij }p i ) 2 i= (I {rij >r ji }p 2 j I {rji >r ij }p 2 i ) i= ( I {rji>r ij})p 2 i. i= If I {r ji >r ij } 0, i, then, under the constraint k i= p i =, the optimal solution satisfies p j:j I {r j>r j} Thus, (5.5) is the optimal solution of (5.4). = = p k j:j k I. {r jk >r kj }

22 δ HT δ δ 2 δ P KP D δ HT δ δ 2 δ P KP D (a) dna (k = 3) by binary SVMs (b) dna (k = 3) by binary random forests δ HT δ δ 2 δ P KP D δ HT δ δ 2 δ P KP D (c) waveform (k = 3) by binary SVMs (d) waveform (k = 3) by binary random forests δ HT δ δ 2 δ P KP D δ HT δ δ 2 δ P KP D (e) satimage (k = 6) by binary SVMs (f) satimage (k = 6) by binary random forests δ HT δ δ 2 δ P KP D δ HT δ δ 2 δ P KP D (g) segment (k = 7) by binary SVMs (h) segment (k = 7) by binary random forests

23 δ HT δ δ 2 δ P KP D δ HT δ δ 2 δ P KP D (i) USPS (k = 0) by binary SVMs (j) USPS (k = 0) by binary random forests δ HT δ δ 2 δ P KP D δ HT δ δ 2 δ P KP D (k) MNIST (k = 0) by binary SVMs (l) MNIST (k = 0) by binary random forests δ HT δ δ 2 δ P KP D δ HT δ δ 2 δ P KP D (m) letter (k = 26) by binary SVMs (n) letter (k = 26) by binary random forests Fig. 3. MSE by using four probability estimates methods based on binary SVMs (left) and binary random forests (right). MSE of δ V is too large and is not presented. solid line: 300 training/500 testing points; dotted line: 800 training/,000 testing points.

24 (a) dna (k = 3) by binary SVMs; δ DV = 0.82% (b) dna (k = 3) by binary random forests; δ RF = 8.74% (c) waveform (k = 3) by binary SVMs; δ DV = 6.47% (d) waveform (k = 3) by binary random forests; δ RF = 7.39% (e) satimage (k = 6) by binary SVMs; δ DV = 4.88% (f) satimage (k = 6) by binary random forests;δ RF = 4.74% (g) segment (k = 7) by binary SVMs; δ DV = 5.82% (h) segment (k = 7) by binary random forests; δ RF = 5.23%

25 (i) USPS (k = 0) by binary SVMs; δ DV = 0.7% (j) USPS (k = 0) by binary random forests; δ RF = 4.28% (k) MNIST (k = 0) by binary SVMs; δ DV = 3.02% (l) MNIST (k = 0) by binary random forests; δ RF = 6.8% (m) letter (k = 26) by binary SVMs; δ DV = 32.29% (n) letter (k = 26) by binary random forests; δ RF = 32.55% Fig. 4. Average of 20 test errors by five probability estimates methods based on binary SVMs (left) and binary random forests (right). Each of the 20 test errors is by 300 training/500 testing points. Caption of each sub-figure shows the averaged error by voting using pairwise SVM decision values (δ DV ) and the multi-class random forest (δ RF ).

26 (a) dna (k = 3) by binary SVMs; δ DV = 6.3% (b) dna (k = 3) by binary random forests; δ RF = 6.9% (c) waveform (k = 3) by binary SVMs; δ DV = 4.33% (d) waveform (k = 3) by binary random forests; δ RF = 5.66% (e) satimage (k = 6) by binary SVMs; δ DV =.54% (f) satimage (k = 6) by binary random forests; δ RF =.92% (g) segment (k = 7) by binary SVMs; δ DV = 3.30% (h) segment (k = 7) by binary random forests; δ RF = 2.77%

27 (i) USPS (k = 0) by binary SVMs; δ DV = 7.54% (j) USPS (k = 0) by binary random forests; δ RF = 0.23% (k) MNIST (k = 0) by binary SVMs; δ DV = 7.77% (l) MNIST (k = 0) by binary random forests; δ RF = 0.0% (m) letter (k = 26) by binary SVMs; δ DV = 8.6% (n) letter (k = 26) by binary random forests; δ RF = 20.25% Fig. 5. Average of 20 test errors by five probability estimates methods based on binary SVMs (left) and binary random forests (right). Each of the 20 test errors is by 800 training/,000 testing points. Caption of each sub-figure shows the averaged error by voting using pairwise SVM decision values (δ DV ) and the multi-class random forest (δ RF ).

Probability Estimates for Multi-class Classification by Pairwise Coupling

Journal of Machine Learning Research 5 (2004) 975 005 Submitted /03; Revised 05/04; Published 8/04 Probability Estimates for Multi-class Classification by Pairwise Coupling Ting-Fan Wu Chih-Jen Lin Department