From Margins to Probabilities in Multiclass Learning Problems

Size: px

Start display at page:

Download "From Margins to Probabilities in Multiclass Learning Problems"

Randolph Hensley
5 years ago
Views:

1 From Margins to Probabiities in Muticass Learning Probems Andrea Passerini and Massimiiano Ponti 2 and Paoo Frasconi 3 Abstract. We study the probem of muticass cassification within the framework of error correcting output codes (ECOC) using margin-based binary cassifiers. An important open probem in this context is how to measure the distance between cass codewords and the outputs of the cassifiers. In this paper we propose a new decoding function that combines the margins through an estimate of their cass conditiona probabiities. We report experiments using support vector machines as the base binary cassifiers, showing the advantage of the proposed decoding function over other functions of the margin commony used in practice. We aso present new theoretica resuts bounding the eave-one-out error of ECOC of kerne machines, which can be used to tune kerne parameters. An empirica vaidation indicates that the bound eads to good estimates of kerne parameters and the corresponding cassifiers attain high accuracy. Keywords: Machine Learning, Error Correcting Output Codes, Support Vector Machines, Statistica Learning Theory. Introduction and Notation Many machine earning agorithms are intrinsicay conceived for binary cassification. However, many rea word earning probems require that inputs are mapped into one of severa possibe categories. The extension of a binary agorithm to its muticass counterpart is not aways possibe or easy to conceive. An aternative consists in reducing a muticass probem into severa binary sub-probems. Perhaps the most genera reduction scheme is the information theoretic method based on error correcting output codes (ECOC), introduced by Dietterich and Bakiri [5] and more recenty extended in []. ECOC work in two steps: training and cassification. During the first step, S binary cassifiers are trained on S dichotomies of the instance space, formed by joining non overapping subsets of casses. Assuming Q casses, et us introduce a coding matrix M {, 0, } Q S which specifies a reation between casses and dichotomies. m qs = (m qs = ) means that exampes beonging to cass q are used as positive (negative) exampes to train the s th cassifier f s. When m qs = 0, points in cass q are not used to train the s th cassifier. Thus each cass q is encoded by the q th row of matrix M which we aso denoted by M q. Typicay the cassifiers are trained independenty. This approach is known as muti-ca approach. Recent works [8, 2] considered aso the case where cassifiers are trained simutaneousy (singe-ca approach). In this paper we focus on the former approach athough some of the resuts may Dept. of Systems and Computer Science, University of Forence, Firenze, Itay 2 Dept. of Computer Science, University of Siena, Siena, Itay 3 Dept. of Systems and Computer Science, University of Forence, Firenze, Itay be extended to the atter one. In the second step a new input x is cassified by computing the vector formed by the outputs of the cassifiers, f(x) = (f (x),..., f S(x)) and choosing the cass whose corresponding row is cosest to f(x). In so doing, cassification can be seen as a decoding operation and the cass of input x is computed as arg min Q d(mq, f(x)), q= where d is the decoding function. In [5] the entries of matrix M were restricted to take ony binary vaues and the d was chosen to be the Hamming distance: L(M q, f) = m qs sign(f s) 2 In the case that the binary earners are margin-based cassifiers, [] shown the advantage of using a oss-based function of the margin d L(M q, f) = L(m qsf s) where L is a oss function. Such a function has the advantage of weighting the confidence of each cassifier thus aowing a more robust cassification criterion. The simpest oss function one can use is the inear oss for which L(m qsf s) = m qsf s but severa other choices are possibe and it is not cear which one shoud work the best. In the case that a binary cassifiers are computed by the same earning agorithm Awein, Shapire and Singer [] proposed to set L to be the same oss function used by that agorithm. In this paper we suggest a different approach which is based on decoding via conditiona probabiities of the outputs of the cassifiers. The advantages offered by our approach is twofod. First, the use of conditiona probabiities aows to combine the margins of each cassifier in a principed way. Second, the decoding function is itsef a cass conditiona probabiity which can give an estimate of muticassification confidence. This approach is discussed in Section 2. In Section 3 we present experiments, using SVM as the underying binary cassifiers, which enighten the advantage over other proposed decoding schemes. In Section 4 we study the generaization error of ECOC. We present a genera bound on the eave one out error in the case of ECOC of kerne machines. The bound can be used for estimating optima kerne parameters. The novety of this anaysis is that it aows muticass parameters optimization even though the binary cassifiers are trained independenty. We report experiments showing that the bound eads to good estimates of kerne parameters and the corresponding cassifiers attain high accuracy. ()

2 2 Decoding Functions Based on Conditiona Probabiities We have seen that a oss function of the margin presents some advantage over the standard Hamming distance because it can encode the confidence of each cassifier in the ECOC. This confidence is, however, a reative quantity, i.e. the range of the vaues of the margin may vary with the cassifier used. Thus, just using a inear oss function may introduce some bias in the fina cassification in the sense that cassifiers with a arger output range wi receive a higher weight. Not surprisingy, we wi see in the experiments beow that the Hamming distance can work better than the inear and soft-margin osses in the case of pairwise schemes. A straightforward normaization in some interva, e.g. [, ], can aso introduce bias since it does not fuy take into account the margin distribution. A more principed approach is to estimate the conditiona probabiity of each output codebit O s given the input vector x. Assuming that a the information about x that is reevant for determining O s is contained in the margin f s(x) (or f s for short) the above probabiity for each cassifier is P (O s f s, s). We can now assume a simpe mode for the probabiity of Y given the codebits O,..., O S. The conditiona probabiity that Y = q shoud be if the configuration on O,..., O S is the code of q, shoud be zero if it is the code of some other cass q q, and uniform (i.e. /Q if the configuration is not a vaid cass code. Under this mode, P (Y = q f) = P (O = m q,..., O S = m qs f) + α being α a constant that coects the probabiity mass dispersed on the 2 S Q invaid codes. Assuming that O s and O s are conditionay independent given x for each s, s, we can write the ikeihood that the resuting output codeword is q as P (Y = q f) = S P (O s = m qs f s) + α. (2) In this case, if α is sma, the decoding function wi be: d(m q, f) og P (Y = q f). (3) The probem bois down to estimating the individua conditiona probabiities in Eq. (2). To do so, one can try to fit a parametric mode on the output produced by a n-fod cross vaidation procedure. In our experiments we choose this mode to be a sigmoid computed on a 3- fod cross vaidation as suggested in [0]: P (O s = m qs f s, s) = + exp{a sf s + B s}. It is interesting to notice that an additiona advantage of the proposed decoding agorithm is that the muticass cassifier outputs a conditiona probabiity rather than a mere cass decision. 3 Experimenta Comparison Between Different Decoding Functions The proposed decoding method is vaidated on ten datasets from UCI repository. Their characteristics are shorty summarized in Tabe. We trained muticass cassifiers using SVM as the base binary cassifier. In our experiments we compared our decoding strategy to Hamming and other common oss-based decoding schemes (inear, and the soft margin oss used to train SVM) for different types of ECOC schemes: one-vs-a, a-pairs, and dense matrices consisting Tabe. Characteristics of the Datasets used Name Casses Train Test Inputs Annea Ecoi Gass Letter Optdigits Pendigits Satimage Segment Soybean Yeast of 3Q coumns of {, } entries. SVM were trained on a Gaussian kerne with the same vaue of the variance for a the experiments. For datasets with ess than 2,000 instances we used ten random spits of the avaiabe data (picking up 2/3 of exampes for training and /3 for testing) and averaged resuts over the ten trias. For the remaining datasets we used the origina spit defined in the UCI repository. Resuts are summarized in Tabes 2 4. Tabe 2. One-vs-A Annea Ecoi Gass Letter Optdigits Pendigits Satimage Segment Soybean Yeast Tabe 3. A-Pairs Annea Ecoi Gass Letter Optdigits Pendigits Satimage Segment Soybean Yeast Our ikeihood decoding works significanty better for a ECOC schemes. This resut is particuary cear in the case of pairwise cassifiers. In the case of dense ECOC some dichotomies can be particuary hard to earn and in this situation the sigmoid may be a too simpe mode for the conditiona probabiity. We suspect that by using a better estimate of the conditiona probabiity one coud obtain

3 Tabe 4. Dense Codes Let us introduce the matrix G defined by G ij = K(x i, x j). The coefficients α i in Equation (5) are earned by soving the foowing optimization probem: Annea Ecoi Gass Letter Optdigits Pendigits Satimage Segment Soybean Yeast better resuts with the ikeihood decoding. Notice that the Hamming distance works we in the case of pairwise cassification, whie it performs poory with one-vs-a cassifiers. Both resuts are not surprising: the Hamming distance corresponds to the majority vote, which is known to work we for pairwise cassifiers [7] but does not make much sense for one-vs-a because in this case ties may occur often. 4 Tuning Kerne Parameters In this section we study the probem of mode seection in the case of ECOC of Kerne Machines [2,, 6, 3]. The anaysis uses a eave-one-out error estimate as the key quantity for seecting the best mode. We first reca the main features of kerne machines for binary cassification. For a more detaied account consistent with the notations in this section see [6]. 4. Background on Kerne Machines Kerne machines are the minimizers of functionas of the form: H[f; D ] = V (c if(x i)) + λ f 2 K, (4) where we use the foowing notation: {(x i, c i) X {, }} is the training set. f is a function IR n IR beonging to a reproducing kerne Hibert space H defined by a symmetric and positive definite kerne K, and f 2 K is the norm of f in this space. See [2, 3] for a number of kernes. The cassification is done by taking the sign of this function. V (cf(x)) is a oss function whose choice determines different earning techniques, each eading to a different earning agorithm (for computing the coefficients α i - see beow). In this paper we assume that V is monotonic non increasing. λ is caed the reguarization parameter and is a positive constant. Machines of the form in Eq. (4) have been motivated in the framework of statistica earning theory. Under rather genera conditions the soution of Equation (4) is of the form 4 f(x) = α ic ik(x i, x). (5) 4 We assume that the bias term is incorporated in the kerne K. max α H(α) = m S(αi) 2 i,j= αiαjcicigij subject to : 0 α i C, i =,...,, where S( ) is a concave function and C = a constant. Support 2λ Vector Machines (SVMs) are a particuar case of these machines for S(α) = α. This corresponds to a oss function V in (4) that is of the form cf(x) +, where x + = x if x > 0 and zero otherwise. The points for which α i > 0 are caed support vectors. An important property of kerne machines is that, since we assumed K is symmetric and positive definite, the kerne function can be re-written as K(x, t) = λ nφ n(x)φ n(t). (7) n= where φ n(x) are the eigenfunctions of the integra operator associated to kerne K and λ n 0 their eigenvaues [4]. By setting Φ(x) to be the sequence ( λ ) nφ n(x), we see that K is a dot product n in a Hibert space of sequences (caed the feature space), that is, K(x, t) = Φ(x) Φ(t). Thus, function in Eq. (5) is a inear combination of features, f(x) = wnφn(x). The great advantage of n= working with the compact representation in Eq. (5) is that we avoid directy estimating the the infinite parameters w n. Finay, note that kerne K can aso be defined/buit by choosing the φs and λs but, in genera, those do not need to be known and can be impicity defined by K. 4.2 Bounds on the Leave-One-Out Error We first introduce some more notation. We define the muticass margin (see aso []) of point (x, y) to be with g(x, y) = d(m y, f(x)) + d(m r(x,y), f(x)) r(x, y) = argmin r y d(m r, f(x)). When g(x, y) is negative, point x is miscassified. Notice that when Q = 2 and and L is the inear oss, g(x, y) reduces to the definition of margin for a binary rea vaued cassification function. The empirica miscassification error can be written as: θ ( g(x i, y i)). The eave-one-out (LOO) error is defined by θ ( g i (x i, y ) i) where we have denoted by g i (x i, y i) the margin of exampe x i when the ECOC is trained on the dataset D \{(x i, y i)}. The LOO error is an interesting quantity to ook at when we want to seect/find the optimum (hyper)parameters of a cassification function, as is an amost unbiased estimator for the test error and we may expect that his minimum wi be cose to the minimum of the test error. Unfortunatey, computing the LOO error is time demanding when is arge. This (6)

4 becomes practicay impossibe in the case that we need to know the LOO error for severa vaues of the parameters of the machine used. In the foowing theorem we give a bound on the LOO error of ECOC of kerne machines. An interesting property of this bound is that it ony depends on the soution of the machine trained on the fu dataset (so training the machine once wi suffice). Beow we denote by f s the s machine, f s(x) = m αs i m yi sk s (x i, x), and et G s ij = K s (x i, x j). Theorem Suppose the decoding function uses the inear oss function. Then, the LOO error of the ECOC of kerne machines is bounded by ( ) θ g(x i, y i) + max U t(x i) (8) t y i where we have defined the function U t(x i) = (M t M r) f(x i) + m yi s(m yi s m ts)αi s G s ii To prove this theorem we first need the foowing Lemma for binary kerne machines. Second, if the number of support vectors of each kerne machine is sma, say ess than ν, the LOO error wi be at most S ν. Therefore, if Sν the LOO error wi be sma and so wi be the test error. We aso notice that, athough bound in Eq. (8) ony uses the knowedge about the machine trained on the fu dataset, it gives an estimate cose to the LOO error when the parameter C used to train the kerne machine is sma. Sma in this case means that C sup K(x i, x i) <. 4.3 Mode Seection Experiments We now show experiments where we use the bound provided by Theorem to seect optima kerne parameters. We focused on four of the argest datasets and searched the best vaue of the variance σ of the Gaussian kerne. To simpify the probem we searched a common vaue for a the binary cassifiers. Figures 3 show the test error and our LOO estimate for different vaues of γ = for the three 2σ 2 ECOC schemes considered. Lemma 4. Let f(x) be the kerne machine as defined in Equations (5) and (6). Let f i (x) be the soution of (4) found when the data point (x i, c i) is removed from the training set. We have c if(x i) α ig ii c if i (x i) c if(x i). (9) Proof: The.h.s part of Inequaity (9) was proved in [9]. Note that, if x i is not a support vector, f = f i and we have a trivia inequaity. Thus suppose that x i is a support vector. To prove the r.h.s. inequaity we observe that: H[f; D ] H[f i ; D ], and H[f; D i ] H[f i ; D i ]. By subtracting the second bound from the first one and using the definition of H we obtain that V (c if(x i)) V (c if i (x i)). Then, the resut foows by the monotonicay property of V. Figure. One-vs-a: Test error and LOO error bound as a function of the ogarithm of the parameter γ for satimage (eft) and optdigits (right) datasets. We now turn to the proof of Theorem. Our goa is to bound the difference of the muticass margin of f and f i, I = g(x i, y i) g i (x i, y i). Let us appy Lemma 4. to each SVM trained in the ECOC procedure. Inequaity (9) can be rewritten as m tsf i s(x i) = m tsf s(x i) λ sm yi sm ts (0) where λ s is a parameter in [0, α s i G s ii]. Using this equation we can transform quantity I to the desired resut through the foowing steps, where we have defined r i = argmin t yi d(m q, f i (x)): I = M yi f(x i) M r f(x i) [ M yi f i (x i) M r i f i (x i) ] = M yi f(x i) M yi f i (x i) + [ M r i f i (x i) M r f(x ] i) = (M yi s) 2 λ s + (M t M r) f(x i) m yi sm r i sλ s = (M r i M r) f(x i) + m yi s(m yi s m r i s)λ s Then, bound in Eq. (8) foows by using the definition of r i and by setting λ s = α s i G s ii. This theorem enightens some interesting properties of the ECOC of kerne machines. First, observe that the arger the margin of an exampe is, the ess ikey that this point wi be counted as a LOO error. Figure 2. A-pairs: Test error and LOO error bound as a function of the ogarithm of the parameter γ for satimage (eft) and optdigits (right) datasets. Figure 3. Dense-codes: Test error and LOO error bound as a function of the ogarithm of the parameter γ for satimage (eft) and optdigits (right) datasets.

5 Resuts indicate that the minimum of our LOO estimate is very cose to the minimum of the test error, athough we often observed a sight bias towards smaer vaues of the variance. Experiments with a ECOC schemes were repeated with such optima parameters. Tabes 5 7 show that we can improve performance significanty. Notice that in the case of one-vs-a and dense codes the advantage of using ikeihood decoding is ess evident than in experiments in section 3. We conjecture that this may be due to the fact that ikeihood estimates are computed on a 3-fod cross vaidation, thus the optima vaue of the variance shoud be computed separatey for each fod. Moreover the bias issues discussed above may further affect such estimates. We wi investigate such probems in future experiments. Tabe 5. One-vs-a: Optimizing the variance of the Gaussian kerne Optdigits Pendigits Satimage Segment Tabe 6. A-pairs: Optimizing the variance of the Gaussian kerne [3] N. Cristianini and J. Shawe-Tayor, An Introduction to Support Vector Machines, Cambridge University Press, [4] F. Cuker and S. Smae, On the mathematica foundations of earning, Buetin of American Mathematica Society, 39(), 49, (200). [5] T. G. Dietterich and G. Bakiri, Soving muticass earning probems via error-correcting output codes, Journa of Artificia Inteigence Research, (995). [6] T. Evgeniou, M. Ponti, and T. Poggio, Reguarization networks and support vector machines, Advances in Computationa Mathematics, 3, 50, (2000). [7] Jerome H. Friedman, Another approach to poychotomous cassification, Technica report, Department of Statistics, Stanford University, (997). [8] Y. Guermeur, A. Eisseeff, and H. Paugam-Mousy, A new muti-cass svm based on a uniform convergence resut, in Proceedings of IJCNN - Internationa Joint Conference on Neura Networks. IEEE, (2000). [9] T. Jaakkoa and D. Hausser, Expoiting generative modes in discriminative cassifiers, in Proc. of Neura Information Processing Conference, (998). [0] J. Patt, Probabiistic outputs for support vector machines and comparisons to reguarized ikeihood methods, in Advances in Large Margin Cassifiers, A. Smoa et a. Eds, MIT Press, (999). [] B. Schokopf, C. Burges, and A. Smoa, Advances in Kerne Methods Support Vector Learning, MIT Press, 998. [2] V. N. Vapnik, Statistica Learning Theory, Wiey, New York, 998. [3] G. Wahba, Spines Modes for Observationa Data, Series in Appied Mathematics, Vo. 59, SIAM, Phiadephia, 990. Optdigits Pendigits Satimage Segment Tabe 7. Dense-codes: Optimizing the variance of the Gaussian kerne Optdigits Pendigits Satimage Segment Concusions We studied ECOC constructed on margin based binary cassifiers under two compementary perspectives: the use of conditiona probabiities for buiding a decoding function, and the use of a theoreticay estimated bound on the eave-one-out error for optimizing kerne parameters. Our experiments show that transforming margins into conditiona probabiities is very effective and improves the overa accuracy of muticass cassification in comparison to standard oss-based decoding schemes. Moreover, fitting Gaussian kerne parameters by means of our theoretica bound further improves cassification accuracy of ECOC machines. REFERENCES [] E. L. Awein, R. E. Schapire, and Y. Singer, Reducing muticass to binary: A unifying approach for margin cassifiers, in Proc. 7th Internationa Conf. on Machine Learning, pp Morgan Kaufmann, San Francisco, CA, (2000). [2] K. Crammer and Y. Singer, the earnabiity and design of output codes for muticass probems, in In Proceedings of the Thirteenth Annua Conference on Computationa Learning Theory, 2000., (2000).

Statistical Learning Theory: A Primer

Statistical Learning Theory: A Primer Internationa Journa of Computer Vision 38(), 9 3, 2000 c 2000 uwer Academic Pubishers. Manufactured in The Netherands. Statistica Learning Theory: A Primer THEODOROS EVGENIOU, MASSIMILIANO PONTIL AND TOMASO