Probability Estimates for Multi-class Classification by Pairwise Coupling

Similar documents
Probability Estimates for Multi-class Classification by Pairwise Coupling

Probability Estimates for Multi-class Classification by Pairwise Coupling

Probability Estimates for Multi-class Classification by Pairwise Coupling

where x i is the ith coordinate of x R N. 1. Show that the following upper bound holds for the growth function of H:

An Analysis of Reliable Classifiers through ROC Isometrics

4. Score normalization technical details We now discuss the technical details of the score normalization method.

On Wald-Type Optimal Stopping for Brownian Motion

Approximating min-max k-clustering

Combining Logistic Regression with Kriging for Mapping the Risk of Occurrence of Unexploded Ordnance (UXO)

State Estimation with ARMarkov Models

Feedback-error control

Using the Divergence Information Criterion for the Determination of the Order of an Autoregressive Process

Improved Capacity Bounds for the Binary Energy Harvesting Channel

System Reliability Estimation and Confidence Regions from Subsystem and Full System Tests

Solved Problems. (a) (b) (c) Figure P4.1 Simple Classification Problems First we draw a line between each set of dark and light data points.

AI*IA 2003 Fusion of Multiple Pattern Classifiers PART III

Research Note REGRESSION ANALYSIS IN MARKOV CHAIN * A. Y. ALAMUTI AND M. R. MESHKANI **

A Comparison between Biased and Unbiased Estimators in Ordinary Least Squares Regression

DETC2003/DAC AN EFFICIENT ALGORITHM FOR CONSTRUCTING OPTIMAL DESIGN OF COMPUTER EXPERIMENTS

Uncorrelated Multilinear Principal Component Analysis for Unsupervised Multilinear Subspace Learning

Estimation of the large covariance matrix with two-step monotone missing data

Preconditioning techniques for Newton s method for the incompressible Navier Stokes equations

Distributed Rule-Based Inference in the Presence of Redundant Information

LIBSVM: a Library for Support Vector Machines

On a Markov Game with Incomplete Information

arxiv: v1 [physics.data-an] 26 Oct 2012

An Ant Colony Optimization Approach to the Probabilistic Traveling Salesman Problem

A New Perspective on Learning Linear Separators with Large L q L p Margins

#A64 INTEGERS 18 (2018) APPLYING MODULAR ARITHMETIC TO DIOPHANTINE EQUATIONS

ON THE LEAST SIGNIFICANT p ADIC DIGITS OF CERTAIN LUCAS NUMBERS

On split sample and randomized confidence intervals for binomial proportions

Uncorrelated Multilinear Discriminant Analysis with Regularization and Aggregation for Tensor Object Recognition

Multi-instance Support Vector Machine Based on Convex Combination

Radial Basis Function Networks: Algorithms

Convex Optimization methods for Computing Channel Capacity

Information collection on a graph

ON POLYNOMIAL SELECTION FOR THE GENERAL NUMBER FIELD SIEVE

Universal Finite Memory Coding of Binary Sequences

Uncorrelated Multilinear Discriminant Analysis with Regularization and Aggregation for Tensor Object Recognition

Information collection on a graph

substantial literature on emirical likelihood indicating that it is widely viewed as a desirable and natural aroach to statistical inference in a vari

Applied Mathematics and Computation

Plotting the Wilson distribution

Topic: Lower Bounds on Randomized Algorithms Date: September 22, 2004 Scribe: Srinath Sridhar

For q 0; 1; : : : ; `? 1, we have m 0; 1; : : : ; q? 1. The set fh j(x) : j 0; 1; ; : : : ; `? 1g forms a basis for the tness functions dened on the i

On the Chvatál-Complexity of Knapsack Problems

Named Entity Recognition using Maximum Entropy Model SEEM5680

An Investigation on the Numerical Ill-conditioning of Hybrid State Estimators

Metrics Performance Evaluation: Application to Face Recognition

Diverse Routing in Networks with Probabilistic Failures

MATHEMATICAL MODELLING OF THE WIRELESS COMMUNICATION NETWORK

Using a Computational Intelligence Hybrid Approach to Recognize the Faults of Variance Shifts for a Manufacturing Process

Lower Confidence Bound for Process-Yield Index S pk with Autocorrelated Process Data

Towards understanding the Lorenz curve using the Uniform distribution. Chris J. Stephens. Newcastle City Council, Newcastle upon Tyne, UK

MATH 2710: NOTES FOR ANALYSIS

A Simple Weight Decay Can Improve. Abstract. It has been observed in numerical simulations that a weight decay can improve

NONLINEAR OPTIMIZATION WITH CONVEX CONSTRAINTS. The Goldstein-Levitin-Polyak algorithm

Characterizing the Behavior of a Probabilistic CMOS Switch Through Analytical Models and Its Verification Through Simulations

Recursive Estimation of the Preisach Density function for a Smart Actuator

Linear diophantine equations for discrete tomography

Shadow Computing: An Energy-Aware Fault Tolerant Computing Model

MODELING THE RELIABILITY OF C4ISR SYSTEMS HARDWARE/SOFTWARE COMPONENTS USING AN IMPROVED MARKOV MODEL

Deriving Indicator Direct and Cross Variograms from a Normal Scores Variogram Model (bigaus-full) David F. Machuca Mory and Clayton V.

Covariance Matrix Estimation for Reinforcement Learning

Notes on duality in second order and -order cone otimization E. D. Andersen Λ, C. Roos y, and T. Terlaky z Aril 6, 000 Abstract Recently, the so-calle

A SIMPLE PLASTICITY MODEL FOR PREDICTING TRANSVERSE COMPOSITE RESPONSE AND FAILURE

Recent Developments in Multilayer Perceptron Neural Networks

ESTIMATION OF THE RECIPROCAL OF THE MEAN OF THE INVERSE GAUSSIAN DISTRIBUTION WITH PRIOR INFORMATION

Factor Analysis of Convective Heat Transfer for a Horizontal Tube in the Turbulent Flow Region Using Artificial Neural Network

Analysis of some entrance probabilities for killed birth-death processes

Predicting the Probability of Correct Classification

Yixi Shi. Jose Blanchet. IEOR Department Columbia University New York, NY 10027, USA. IEOR Department Columbia University New York, NY 10027, USA

LECTURE 7 NOTES. x n. d x if. E [g(x n )] E [g(x)]

Asymptotically Optimal Simulation Allocation under Dependent Sampling

Location of solutions for quasi-linear elliptic equations with general gradient dependence

LIBSVM: a Library for Support Vector Machines (Version 2.32)

On the capacity of the general trapdoor channel with feedback

Robust Predictive Control of Input Constraints and Interference Suppression for Semi-Trailer System

Brownian Motion and Random Prime Factorization

One step ahead prediction using Fuzzy Boolean Neural Networks 1

Robust Solutions to Markov Decision Problems

CERIAS Tech Report The period of the Bell numbers modulo a prime by Peter Montgomery, Sangil Nahm, Samuel Wagstaff Jr Center for Education

LOGISTIC REGRESSION. VINAYANAND KANDALA M.Sc. (Agricultural Statistics), Roll No I.A.S.R.I, Library Avenue, New Delhi

Estimating Posterior Ratio for Classification: Transfer Learning from Probabilistic Perspective

The non-stochastic multi-armed bandit problem

Estimating function analysis for a class of Tweedie regression models

Optimal Design of Truss Structures Using a Neutrosophic Number Optimization Model under an Indeterminate Environment

RANDOM WALKS AND PERCOLATION: AN ANALYSIS OF CURRENT RESEARCH ON MODELING NATURAL PROCESSES

Paper C Exact Volume Balance Versus Exact Mass Balance in Compositional Reservoir Simulation

General Linear Model Introduction, Classes of Linear models and Estimation

Positive decomposition of transfer functions with multiple poles

Lecture 6. 2 Recurrence/transience, harmonic functions and martingales

VIBRATION ANALYSIS OF BEAMS WITH MULTIPLE CONSTRAINED LAYER DAMPING PATCHES

Bayesian Model Averaging Kriging Jize Zhang and Alexandros Taflanidis

8.7 Associated and Non-associated Flow Rules

CHAPTER-II Control Charts for Fraction Nonconforming using m-of-m Runs Rules

Scaling Multiple Point Statistics for Non-Stationary Geostatistical Modeling

Developing A Deterioration Probabilistic Model for Rail Wear

Published: 14 October 2013

Elementary Analysis in Q p

Transcription:

Probability Estimates for Multi-class Classification by Pairwise Couling Ting-Fan Wu Chih-Jen Lin Deartment of Comuter Science National Taiwan University Taiei 06, Taiwan Ruby C. Weng Deartment of Statistics National Chenechi University Taiei 6, Taiwan Abstract Pairwise couling is a oular multi-class classification method that combines together all airwise comarisons for each air of classes. This aer resents two aroaches for obtaining class robabilities. Both methods can be reduced to linear systems and are easy to imlement. We show concetually and exerimentally that the roosed aroaches are more stable than two existing oular methods: voting and [3]. Introduction The multi-class classification roblem refers to assigning each of the observations into one of k classes. As two-class roblems are much easier to solve, many authors roose to use two-class classifiers for multi-class classification. In this aer we focus on techniques that rovide a multi-class classification solution by combining all airwise comarisons. A common way to combine airwise comarisons is by voting [6, 2]. It constructs a rule for discriating between every air of classes and then selecting the class with the most winning two-class decisions. Though the voting rocedure requires just airwise decisions, it only redicts a class label. In many scenarios, however, robability estimates are desired. As numerous (airwise) classifiers do rovide class robabilities, several authors [2,, 3] have roosed robability estimates by combining the airwise class robabilities. Given the observation x and the class label y, we assume that the estimated airwise class robabilities r ij of µ ij = (y = i y = i or j, x) are available. Here r ij are obtained by some binary classifiers. Then, the goal is to estimate { i } k i=, where i = (y = i x), i =,..., k. We roose to obtain an aroximate solution to an identity, and then select the label with the highest estimated class robability. The existence of the solution is guaranteed by theory in finite Markov Chains. Motivated by the otimization formulation of this method, we roose a second aroach. Interestingly, it can also be regarded as an imroved version of the couling aroach given by [2]. Both of the roosed methods can be reduced to solving linear systems and are simle in ractical imlementation. Furthermore, from concetual and exerimental oints of view, we show that the two roosed methods are more stable than voting and the method in [3]. We organize the aer as follows. In Section 2, we review two existing methods. Sections 3 and 4 detail the two roosed aroaches. Section 5 resents the relationshi among the four methods through their corresonding otimization formulas. In Section 6, we comare

these methods using simulated and real data. The classifiers considered are suort vector machines. Section 7 concludes the aer. Due to sace limit, we omit all detailed roofs. A comlete version of this work is available at htt://www.csie.ntu.edu.tw/ cjlin/aers/svmrob/svmrob.df. 2 Review of Two Methods Let r ij be the estimates of µ ij = i /( i + j ). The voting rule [6, 2] is δ V = argmax i [ I {rij>r ji}]. () A simle estimate of robabilities can be derived as v i = 2 I {r ij>r ji}/(k(k )). The authors of [3] suggest another method to estimate class robabilities, and they claim that the resulting classification rule can outerform δ V in some situations. Their aroach is based on the imization of the Kullback-Leibler (KL) distance between r ij and µ ij : l() = i j n ij r ij log(r ij /µ ij ), (2) where k i= i =, i > 0, i =,..., k, and n ij is the number of instances in class i or j. By letting l() = 0, a nonlinear system has to be solved. [3] rooses an iterative rocedure to find the imum of (2). If r ij > 0, i j, the existence of a unique global imal solution to (2) has been roved in [5] and references therein. Let denote this oint. Then the resulting classification rule is It is shown in Theorem of [3] that δ HT (x) = argmax i [ i ]. i > j if and only if i > j, where j = 2 s:s j r js ; (3) k(k ) that is, the i are in the same order as the i. Therefore, are sufficient if one only requires the classification rule. In fact, as ointed out by [3], can be derived as an aroximation to the identity by relacing i + j with 2/k, and µ ij with r ij. i = ( i + j k )( i ) = ( i + j i + j k )µ ij (4) 3 Our First Aroach Note that δ HT is essentially argmax i [ i ], and is an aroximate solution to (4). Instead of relacing i + j by 2/k, in this section we roose to solve the system: i = ( i + j k )r ij, i, subject to Let denote the solution to (5). Then the resulting decision rule is δ = argmax i [ i ]. i =, i 0, i. (5) As δ HT relies on i + j k/2, in Section 6. we use two examles to illustrate ossible roblems with this rule. i=

To solve (5), we rewrite it as Q =, i =, i 0, i, where Q ij = i= { r ij /(k ) if i j, s:s i r is/(k ) if i = j. Observe that k j= Q ij = for i =,..., k and 0 Q ij for i, j =,..., k, so there exists a finite Markov Chain whose transition matrix is Q. Moreover, if r ij > 0 for all i j, then Q ij > 0, which imlies this Markov Chain is irreducible and aeriodic. These conditions guarantee the existence of a unique stationary robability and all states being ositive recurrent. Hence, we have the following theorem: Theorem If r ij > 0, i j, then (6) has a unique solution with 0 < i <, i. With Theorem and some further analyses, if we remove the constraint i 0, i, the linear system with k + equations still has the same unique solution. Furthermore, if any one of the k equalities Q = is removed, we have a system with k variables and k equalities, which, again, has the same single solution. Thus, (6) can be solved by Gaussian eliation. On the other hand, as the stationary solution of a Markov Chain can be derived by the limit of the n-ste transition robability matrix Q n, we can solve by reeatedly multilying Q T with any initial vector. Now we reexae this method to gain more insight. The following arguments show that the solution to (5) is a global imum of a meaningful otimization roblem. To begin, we exress (5) as r ji i r ij j = 0, i =,..., k, using the roerty that r ij + r ji =, i j. Then the solution to (5) is in fact the global imum of the following roblem: ( r ji i r ij j ) 2 i= subject to (6) i =, i 0, i. (7) Since the object function is always nonnegative, and it attains zero under (5) and (6). 4 Our Second Aroach Note that both aroaches in Sections 2 and 3 involve solving otimization roblems using the relations like i /( i + j ) r ij or r ji i r ij j. Motivated by (7), we suggest another otimization formulation as follows: 2 (r ji i r ij j ) 2 subject to i= i= i =, i 0, i. (8) In related work, [2] rooses to solve a linear system consisting of k i= i = and any k equations of the form r ji i = r ij j. However, ointed out in [], the results of [2] strongly deends on the selection of k equations. In fact, as (8) considers all r ij j r ji i, not just k of them, it can be viewed as an imroved version of [2]. Let denote the corresonding solution. We then define the classification rule as δ 2 = argmax i [ i ]. Since (7) has a unique solution, which can be obtained by solving a simle linear system, it is desired to see whether the imization roblem (8) has these nice roerties. In the rest of the section, we show that this is true. The following theorem shows that the nonnegative constraints in (8) are redundant. i=

Theorem 2 Problem (8) is equivalent to a simlification without conditions i 0, i. Note that we can rewrite the objective function of (8) as = 2 T Q, where Q ij = { s:s i r2 si if i = j, r ji r ij if i j. (9) From here we can show that Q is ositive semi-definite. Therefore, without constraints i 0, i, (9) is a linear-constrained convex quadratic rogramg roblem. Consequently, a oint is a global imum if and only if it satisfies the KKT otimality condition: There is a scalar b such that [ [ ] [ Q e 0 e 0] T =. (0) b ] Here e is the vector of all ones and b is the Lagrangian multilier of the equality constraint k i= i =. Thus, the solution of (8) can be obtained by solving the simle linear system (0). The existence of a unique solution is guaranteed by the invertibility of the matrix of (0). Moreover, if Q is ositive definite(pd), this matrix is invertible. The following theorem shows that Q is PD under quite general conditions. Theorem 3 If for any i =,..., k, there are s i and j i such that rsirsj r is then Q is ositive definite. rjirjs r ij, In addition to direct methods, next we roose a simle iterative method for solving (0): Algorithm. Start with some initial i 0, i and k i= i =. 2. Reeat (t =,..., k,,...) until (0) is satisfied. t Q tt [ j:j t Q tj j + T Q] () normalize (2) Theorem 4 If r sj > 0, s j, and { i } i= is the sequence generated by Algorithm, any convergent sub-sequence goes to a global imum of (8). As Theorem 3 indicates that in general Q is ositive definite, the sequence { i } i= from Algorithm usually globally converges to the unique imum of (8). 5 Relations Among Four Methods The four decision rules δ HT, δ, δ 2, and δ V can be written as argmax i [ i ], where is derived by the following four otimization formulations under the constants k i= i =

and i 0, i: δ HT : δ : δ 2 : δ V : [ i= [ i= i= i= (r ij k 2 i)] 2, (3) (r ij j r ji i )] 2, (4) (r ij j r ji i ) 2, (5) (I {rij>r ji} j I {rji>r ij} i ) 2. (6) Note that (3) can be easily verified, and that (4) and (5) have been exlained in Sections 3 and 4. For (6), its solution is c i = I, {r ji>r ij} where c is the normalizing constant; and therefore, argmax i [ i ] is the same as (). Clearly, (3) can be obtained from (4) by letting j /k, j and r ji /2, i, j. Such aroximations ignore the differences between i. Similarly, (6) is from (5) by taking the extreme values of r ij : 0 or. As a result, (6) may enlarge the differences between i. Next, comared with (5), (4) may tend to underestimate the differences between the i s. The reason is that (4) allows the difference between r ij j and r ji i to get canceled first. Thus, concetually, (3) and (6) are more extreme the former tends to underestimate the differences between i s, while the latter overestimate them. These arguments will be suorted by simulated and real data in the next section. 6 Exeriments 6. Simle Simulated Examles [3] designs a simle exeriment in which all i s are fairly close and their method δ HT outerforms the voting strategy δ V. We conduct this exeriment first to assess the erformance of our roosed methods. As in [3], we define class robabilities =.5/k, j = ( )/(k ), j = 2,..., k, and then set r ij = i i + j + 0.z ij if i > j, (7) r ji = r ij if j > i, (8) where z ij are standard normal variates. Since r ij are required to be within (0,), we truncate r ij at ɛ below and ɛ above, with ɛ = 0.0000. In this examle, class has the highest robability and hence is the correct class. Figure shows accuracy rates for each of the four methods when k = 3, 5, 8, 0, 2, 5, 20. The accuracy rates are averaged over,000 relicates. Note that in this exeriment all classes are quite cometitive, so, when using δ V, sometimes the highest vote occurs at two For I to be well defined, we consider r ij r ji, which is generally true. In addition, if there is an i for which P I {r ji >r ij } = 0, an otimal solution of (6) is i =, and j = 0, j i. The resulting decision is the same as that of ().

Accuracy Rates.05 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6 2 3 4 5 6 7 log k 2 Accuracy Rates.05 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6 2 3 4 5 6 7 log k 2 Accuracy Rates.05 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6 2 3 4 5 6 7 log k 2 (a) balanced i (b) unbalanced i (c) highly unbalanced i Figure : Accuracy of redicting the true class by the methods δ HT (solid line, cross marked), δ V (dash line, square marked), δ (dotted line, circle marked), and δ 2 (dashed line, asterisk marked) from simulated class robability i, i =, 2 k. or more different classes. We handle this roblem by randomly selecting one class from the ties. This artly exlains why δ V erforms oor. Another exlanation is that the r ij here are all close to /2, but (6) uses or 0 instead; therefore, the solution may be severely biased. Besides δ V, the other three rules have done very well in this examle. Since δ HT relies on the aroximation i + j k/2, this rule may suffer some losses if the class robabilities are not highly balanced. To exae this oint, we consider the following two sets of class robabilities: () We let k = k/2 if k is even, and (k + )/2 if k is odd; then we define = 0.95.5/k, i = (0.95 )/(k ) for i = 2,..., k, and i = 0.05/(k k ) for i = k +,..., k. (2) If k = 3, we define = 0.95.5/2, 2 = 0.95, and 3 = 0.05. If k > 3, we define = 0.475, 2 = 3 = 0.475/2, and i = 0.05/(k 3) for i = 4,..., k. After setting i, we define the airwise comarisons r ij as in (7)-(8). Both exeriments are reeated for,000 times. The accuracy rates are shown in Figures (b) and (c). In both scenarios, i are not balanced. As exected, δ HT is quite sensitive to the imbalance of i. The situation is much worse in Figure (c) because the aroximation i + j k/2 is more seriously violated, esecially when k is large. In summary, δ and δ 2 are less sensitive to i, and their overall erformance are fairly stable. All features observed here agree with our analysis in Section 5. 6.2 Real Data In this section we resent exerimental results on several multi-class roblems: segment, satimage, and letter from the Statlog collection [9], USPS [4], and MNIST [7]. All data sets are available at htt://www.csie.ntu.edu.tw/ cjlin/libsvmtools/ t. Their numbers of classes are 7, 6, 26, 0, and 0, resectively. From thousands of instances in each data, we select 300 and 500 as our training and testing sets. We consider suort vector machines (SVM) with RBF kernel e γ xi xj 2 as the binary classifier. The regularization arameter C and the kernel arameter γ are selected by crossvalidation. To begin, for each training set, a five-fold cross-validation is conducted on the following oints of (C, γ): [2 5, 2 3,..., 2 5 ] [2 5, 2 3,..., 2 5 ]. This is done by modifying LIBSVM [], a library for SVM. At each (C, γ), sequentially four folds are

Table : Testing errors (in ercentage) by four methods: Each row reorts the testing errors based on a air of the training and testing sets. The mean and std (standard deviation) are from five 5-fold cross-validation rocedures to select the best (C, γ). Dataset k δ HT δ δ 2 δ V satimage 6 segment 7 USPS 0 MNIST 0 letter 26 mean std mean std mean std mean std 4.080.306 4.600 0.938 4.760 0.784 5.400 0.29 2.960 0.320 3.400 0.400 3.400 0.400 3.360 0.080 4.520 0.968 4.760.637 3.880 0.392 4.080 0.240 2.400 0.000 2.200 0.000 2.640 0.294 2.680.4 6.60 0.294 6.400 0.379 6.20 0.299 6.60 0.344 9.960 0.480 9.480 0.240 9.000 0.400 8.880 0.27 6.040 0.528 6.280 0.299 6.200 0.456 6.760 0.445 6.600 0.000 6.680 0.349 6.920 0.27 7.60 0.96 5.520 0.466 5.200 0.420 5.400 0.580 5.480 0.588 7.440 0.625 8.60 0.637 8.040 0.408 7.840 0.344 4.840 0.388 3.520 0.560 2.760 0.233 2.520 0.60 2.080 0.560.440 0.625.600.08.440 0.99 0.640 0.933 0.000 0.657 9.920 0.483 0.320 0.744 2.320 0.845.960.03.560 0.784.840.248 3.400 0.30 2.640 0.080 2.920 0.299 2.520 0.97 7.400 0.000 6.560 0.080 5.760 0.96 5.960 0.463 5.200 0.400 4.600 0.000 3.720 0.588 2.360 0.96 7.320.608 4.280 0.560 3.400 0.657 3.760 0.794 4.720 0.449 4.60 0.96 3.360 0.686 3.520 0.325 2.560 0.294 2.600 0.000 3.080 0.560 2.440 0.233 39.880.42 37.60.06 34.560 2.44 33.480 0.325 4.640 0.463 39.400 0.769 35.920.389 33.440.06 4.320.700 38.920 0.854 35.800.453 35.000.066 35.240.439 32.920.2 29.240.335 27.400.7 43.240 0.637 40.360.472 36.960.74 34.520.00 used as the training set while one fold as the validation set. The training of the four folds consists of k(k )/2 binary SVMs. For the binary SVM of the ith and the jth classes, using decision values ˆf of training data, we emloy an imroved imlementation [8] of Platt s osterior robabilities [0] to estimate r ij : r ij = P (i i or j, x) =, (9) + ea ˆf+B where A and B are estimated by imizing the negative log-likelihood function. Then, for each validation instance, we aly the four methods to obtain classification decisions. The error of the five validation sets is thus the cross-validation error at (C, γ). After the cross-validation is done, each rule obtains its best (C, γ). Using these arameters, we train the whole training set to obtain the final model. Next, the same as (9), the decision values from the training data are emloyed to find r ij. Then, testing data are tested using each of the four rules. Due to the randomness of searating training data into five folds for finding the best (C, γ), we reeat the five-fold cross-validation five times and obtain the mean and standard deviation of the testing error. Moreover, as the selection of 300 and 500 training and testing instances from a larger dataset is also random, we generate five of such airs. In Table, each row reorts the testing error based on a air of the training and testing sets. The results show that when the number of classes k is small, the four methods erform similarly; however, for roblems with larger k, δ HT is less cometitive. In articular, for roblem letter which has 26 classes, δ 2 or δ V outerforms δ HT by at least 5%. It seems that for [0] suggests to use ˆf from the validation instead of the training. However, this requires a further cross-validation on the four-fold data. For simlicity, we directly use ˆf from the training. If more than one arameter sets return the smallest cross-validation error, we simly choose one with the smallest C.

roblems here, their characteristics are closer to the setting of Figure (c), rather than that of Figure (a). All these results agree with the revious findings in Sections 5 and 6.. Note that in Table, some standard deviations are zero. That means the best (C, γ) by different cross-validations are all the same. Overall, the variation on arameter selection due to the randomness of cross-validation is not large. 7 Discussions and Conclusions As the imization of the KL distance is a well known criterion, some may wonder why the erformance of δ HT is not quite satisfactory in some of the examles. One ossible exlanation is that here KL distance is derived under the assumtions that n ij r ij Bin(n ij, µ ij ) and r ij are indeendent; however, as ointed out in [3], neither of the assumtions holds in the classification roblem. In conclusion, we have rovided two methods which are shown to be more stable than both δ HT and δ V. In addition, the two roosed aroaches require only solutions of linear systems instead of a nonlinear one in [3]. The authors thank S. Sathiya Keerthi for helful comments. References [] C.-C. Chang and C.-J. Lin. LIBSVM: a library for suort vector machines, 200. Software available at htt://www.csie.ntu.edu.tw/ cjlin/libsvm. [2] J. Friedman. Another aroach to olychotomous classification. Technical reort, Deartment of Statistics, Stanford University, 996. Available at htt://www-stat.stanford.edu/reorts/friedman/oly.s.z. [3] T. Hastie and R. Tibshirani. Classification by airwise couling. The Annals of Statistics, 26():45 47, 998. [4] J. J. Hull. A database for handwritten text recognition research. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6(5):550 554, May 994. [5] D. R. Hunter. MM algorithms for generalized Bradley-Terry models. The Annals of Statistics, 2004. To aear. [6] S. Knerr, L. Personnaz, and G. Dreyfus. Single-layer learning revisited: a stewise rocedure for building and training a neural network. In J. Fogelman, editor, Neurocomuting: Algorithms, Architectures and Alications. Sringer-Verlag, 990. [7] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning alied to document recognition. Proceedings of the IEEE, 86():2278 2324, November 998. MNIST database available at htt://yann.lecun.com/exdb/mnist/. [8] H.-T. Lin, C.-J. Lin, and R. C. Weng. A note on Platt s robabilistic oututs for suort vector machines. Technical reort, Deartment of Comuter Science and Information Engineering, National Taiwan University, 2003. [9] D. Michie, D. J. Siegelhalter, and C. C. Taylor. Machine Learning, Neural and Statistical Classification. Prentice Hall, Englewood Cliffs, N.J., 994. Data available at htt://www.ncc.u.t/liacc/ml/statlog/datasets.html. [0] J. Platt. Probabilistic oututs for suort vector machines and comarison to regularized likelihood methods. In A. Smola, P. Bartlett, B. Schölkof, and D. Schuurmans, editors, Advances in Large Margin Classifiers, Cambridge, MA, 2000. MIT Press. [] D. Price, S. Knerr, L. Personnaz, and G. Dreyfus. Pairwise nerual network classifiers with robabilistic oututs. In G. Tesauro, D. Touretzky, and T. Leen, editors, Neural Information Processing Systems, volume 7, ages 09 6. The MIT Press, 995. [2] P. Refregier and F. Vallet. Probabilistic aroach for multiclass classification with neural networks. In Proceedings of International Conference on Artificial Networks, ages 003 007, 99.