Estimating the sample complexity of a multi-class. discriminant model

Estimating the sample complexity of a multi-class discriminant model Yann Guermeur LIP, UMR NRS 0, Universit Paris, place Jussieu, 55 Paris cedex 05 Yann.Guermeur@lip.fr Andr Elissee and H l ne Paugam-Moisy ERI, Universit Lumi re Lyon 5, avenue Pierre Mend s-france, 9 Bron cedex {aelissee,hpaugam}@univ-lyon.fr Abstract We study the generalization performance of a multi-class discriminant model. Several bounds on its sample complexity are derived from uniform convergence results based on dierent measures of capacity. This gives us an insight into the nature of the capacity measure which is best suited to study multi-class discriminant model. 1 Introduction Since the pioneering work of Vapnik and hervonenkis [1], extending the classical Glivenko- antelli theorem to give a uniform convergence result over classes of indicator functions, many studies have dealt with uniform strong laws of large numbers (see for instance [10, ]). The bound they provide both in pattern recognition and regression estimation can readily be used to derive a sample complexity which is an increasing function of a particular measure of the capacity of the model considered. The choice of the appropriate bound as well as the computation of a tight upper bound on the capacity measure thus appear to be of central importance to derive a tight bound on the sample complexity. This is the subject we investigate in this paper, throughout an application on the Multivariate Linear Regression (MLR) combiner described in []. Up to now, very few studies in statistical learning theory have dealt with multi-class discrimination. Section briey outlines the implementation of the MLR model for class posterior probability estimates combination. Section is devoted to the estimation of the sample complexity using two combinatorial quantities generalizing the Vapnik-hervonenkis (V) dimension and called the graph dimension and the Natarajan dimension. In section, another bound is derived from a theorem where the capacity of the family of functions is characterized by its covering number. Both bounds are discussed in section 5 and further improvements are proposed in perspective. MLR combiner for classi- er combination We consider a -category discrimination task, under the usual hypothesis that there is a joint distribution, xed but unknown, on S = X Y, where X is the input space and Y the set of categories. We further assume that, for each input pattern x X, the outputs of P classiers are available. Let f j denote the function computed by the j th of these classiers: f j (x) = [f jk (x)] R. The k th output f jk (x) approximates the class posterior probability p( k jx). Precisely, f j (x) U with U = nu R + =1T u = 1 o 1

In other words, the outputs are non-negative and sum to 1. Let F (x) = [f j (x)], (1 j P) (F (x) U P ) be a vector of predictors. The MLR model studied here, parameterized by v = [v k ] R =P, computes the functions g G given by: g(x) = g 1 (x). g k (x). g (x) 5 = v T 1. v T ḳ. v T 5 F (x) Let be the set of loss functions satisfying the general conditions for outputs to be interpreted as probabilities (see for instance []) and s a N- sample of observations (s S N ). Among the functions of G, the MLR combiner is any of the functions which constitutes a solution to the following optimization problem: Problem 1 Given a convex loss function L and a N-sample s S N, nd a function in G minimizing the empirical risk ^J(v) and taking its values in U. The MLR combiner is thus designed to take as inputs class posterior probability estimates and output better estimates with respect to some given criterion (least-squares, cross-entropy : : :). v k;l;m, the general term of v k, is the coecient associated with the predictor f lm (x) in the regression computed to estimate p( k jx). Let v be the vector of all the parameters (v = [v k ] R P = ). As was pointed out in [], optimal solutions to Problem 1 are obtained by minimizing ^J(v) subject to v V with V = P v R P 8(l; + = m); k(v k;l;m? vk;l;) = 0 1 T P v = This result holds irrespective of the choice of L and s. Details on the computation of a global minimum can be found in [5]. To sum up, the model we study is a multiple-output perceptron, under the constraints 8x X; F (x) U P and v V. One can easily verify that the Euclidean norm kf (x)k of every vector in U P is bounded above by p P. Furthermore, solving a simple quadratic programming problem establishes that for all v V and k f1; : : : ; g, kv k k p. Growth function bounds For now on, we consider the MLR combiner as a discriminant model, by application of Bayes' estimated decision rule. Although the learning process described in the previous section only amounts to estimating probability densities on a nite sample, the criterion of interest, the generalization ability in terms of recognition rate, can be both estimated and controlled. To perform this task, we use results derived in the framework of the computational learning theory and the statistical learning theory introduced by Vapnik. Roughly speaking, they represent corollaries of uniform strong laws of large numbers. The theorems used in this section are derived from bounds grounded on a combinatorial capacity measure called the growth function. In order to express them, we must rst introduce additional notations and denitions. Let H be a family of functions from X to a nite set Y. E s (h) and E(h) respectively designate the error on s (observed error) and the generalization error of a function h belonging to H. Denition 1 Let H be a set of indicator functions (values in {0,1}). Let s X be a N-sample of X and H (s X ) the number of dierent classications of s X by the functions of H. The growth function H is dened by: H (N ) = max H (s X ) : s X X N : Denition The V dimension of a set H of indicator functions is the maximum number d of vectors that can be shattered, i.e. separated into two classes in all d possible ways, using functions of H. If this maximum does not exist, the V dimension is equal to innity.

These denitions only apply to sets of indicator functions. The following theorem, dealing with multiple-output functions, appears as an immediate corollary of a result due to Vapnik [1] (see for instance [1]): Theorem 1 Let s S N. With probability 1?, E(h) < E s (h)+ r 1 N (ln( GH (N ))? ln())+ 1 N GH, the graph space of H, is dened as follows: For h H, let Gh be the function from X Y to f0; 1g dened by Gh(x; y) = 1 () h(x) = y GH = fgh : h Hg. GH is the growth function of GH [9]. Thus, in order to obtain an upper bound on the condence interval in Theorem 1, one must nd an upper bound on the growth function associated with the MLR combiner. This constitutes the object of the next two subsections where two dierent bounds are stated..1 Graph dimension of the MLR combiner The discriminant functions computed by the MLR combiner are also computed by the single hidden layer perceptron with threshold activation functions depicted in Figure 1. Several articles have been devoted to bounding the growth function and V dimension of multilayer perceptrons [9, 11]. Proceeding as in [11], one can use as upper bound on the growth function the product of the bounds on the growth functions of the individual hidden units derived from Sauer's lemma. Since the V dimension of each hidden unit is at most equal to the dimension of the smallest subspace of R P containing U P and this dimension is d E = P (? 1) + 1 [], we get: GMLP (N ) < en d E (1) d E F(x) Px Figure 1: v1-v v1-v v1-v v1-v v-v v-v v-v v-v v-v v(-)-v(-1) v(-)-v v(-1)-v -1 +1 1 Bayes Architecture of a multi-layer perceptron with threshold units computing the same discriminant function as the MLR combiner. Each hidden unit computes a function h i;j (F (x)) = t? F (x) T (v i? v j ), (1 i < j ), where t(z) = 1 if z > 0 and t(z) =?1 otherwise. The weights of the output layer, either +1 (solid lines) or?1 (dashed lines), and the biases are chosen so that the output units compute a logical AND. The number below each layer corresponds to the number of units. This bound can be signicantly improved, by making use of the dependences between the hidden units. For lack of place, we only exhibit here a simple way to do so. Let x s X. Among all the possible classications performed by the hidden unit whose vector is v 1? v, exactly half of them associate to F (x) the value 1. The same is true for the hidden unit whose vector is v? v. Now if these two units provide the same output for F (x), then the output for the hidden unit whose vector is v 1? v is known. onsequently, the growth function associated with these three units is at most en de de. Proceedings step by step, we thus get an improved bound: GMLP (N ) < g(x)?1 en d E () d E

. Natarajan dimension Natarajan introduced in [9] an extension of the V dimension for multiple-output functions based on the following denition of shattering: Denition A set H of discrete-valued functions shatters a set s X of vectors () there exist two functions h 1 and h belonging to H such that: (a) for any x s X, h 1 (x) = h (x), (b) for all s 1 s X, there exists h H such that h agrees with h 1 on s 1 and with h on s X n s 1, i.e. 8x s 1, h (x) = h 1 (x), 8x s X n s 1, h (x) = h (x). We established the following result in [] Theorem The Natarajan dimension d N of the MLR combiner satises: P + b 1 c d N (d E? 1) In [8], the following theorem was proved Theorem Let N c H (N ) be the number of dierent classications performed by a set of functions H on a set of size N. Let d N be the Natarajan dimension of H. Then, for all d d N : X i=d N c H (N ) N i i=0 We have the following inequality: Xi=d i=0 i N? i? i en d d () Obviously, GH (N ) N c H (N ). Thus, substituting the upper bound on d N provided by Theorem for d in the formula of Theorem and applying () we get: GMLP (N ) < en (d E? 1)! (d E?1) () This last bound is very similar to those provided by (1) and (). This is a good indication that these bounds are quite loose, since bounding GH (N ) by N c H (N ) is crude. In fact, the diculty with the use of the growth function lies in the way to take into account the specicity of the model (the link between the hidden units), when applying a generalization of Sauer's lemma. This diculty concerns all the approaches using combinatorial methods which do not consider how the model is built. This led us to consider an alternative definition of the capacity of the combiner which is an original method to have condence bounds for multi-class discriminant models based on the computation of the a posteriori probability p( k jx). overing number bounds Many studies have been devoted to giving rates of uniform convergence of the means to their expectations, based on a capacity measure called the covering number [1, 10, ]. To dene this measure, we rst introduce the notion of -cover in a pseudo-metric space. Denition Let (E; ) be a pseudo-metric space. A nite set T E is an?cover of a set H E with respect to if for all h H there is a h T such that (h; h). Denition 5 Given and, the covering number N (; H; ) of H is the size of the smallest -cover of H. overing numbers were introduced in learning theory to derive bounds for regression estimation problems. In [], Bartlett has extended their use to discrimination, specically to dichotomy computation. For now on, H is supposed to be a set of functions taking their values in R. The discriminant function associated with any of the functions h H is the function t h, where t is the sign function dened in the caption of Figure 1. E(h) is dened accordingly.

Denition Let : R! [?; ] be a piecewiselinear function such that: t(x) if jxj (x) = x otherwise Let us dene H = f (h); h Hg and N 1 (; H; N ) = max sx N (; H; k:k sx 1 ), where khk sx 1 = max 1iN jh(x i )j. Let the empirical error according to the margin be dened as: Denition E s (h) = 1 N jf(x i; y i ) s=h(x i )y i < gj We have: Theorem Suppose > 0, and 0 < < 1=, then, with probability at least 1?, s E(h) Es (h) + N1 N ln (=; H ; N ) The MLR model takes its values in [0; 1]. For each pattern x, the desired output y is the canonical coding of the category of x. If jg k (x)? y k j < 1= for all k, then the use of Bayes' estimated decision rule will provide the correct classication for x. A contrario, an error will occur only if jg k (x)? y k j 1= for at least one k. For all k, we dene E(g k ) as being the error for g k? 1= and E(g) as the error in generalization of the discriminant function associated with g. Then, we have: E(g) X k=1 E(g k ) (5) As in the previous section, bounding the generalization ability of the MLR combiner amounts to bounding a measure of complexity. This time however, this measure is dened for each individual function g k. The end of this section thus deals with stating bounds on N 1 (=; H ; N ), where H equals fg k?1=g for any k in f1; : : : ; g. Since the function is 1-lipschitzian, N 1 (=; H ; N ) N 1 (=; H; N ) Let T be the ane application dened over B P (the unit ball of R P endowed with k:k ) as: T : B P! R N w! p [F (x 1 ) T w; : : : ; F (x N ) T w] T? (1=) N It has the property that f[f (x 1 ) T w? 1=; : : : ; F (x N ) T w? 1=] T s.t. kwk p g T (B P ). Hence, since kv k k p, N 1 (=; H; N ) According to [], we have, N (=; T (B P ); k:k 1 ) max s 0 X XN N (=; T (B P); k:k 1 ) ()! P 8k T ~ k () where, T ~ is the linear operator associated with T and k T ~ k Twk k = sup ~ 1 wbp kwk. Since kf (x)k p P, from auchy-schwarz inequality k T ~ k is bounded above by p P. By injection of () in () and by ap plying Theorem to bound the right hand side of (5), it yields E(g) k= X k=1 E k s (g k) + s N ln + P ln 8p P k! This bound on the error should be related to bound derived from theorem 1 by substituting GH (N ) by the successive bounds established in section. 5 Discussion and onclusion This paper analyses two strategies to derive bounds on the generalization error for a multiclass discriminant model. One is based on combinatorial dimensions and the other uses covering numbers. The bounds on the generalization ability derived in the two former sections cannot be readily compared, since they rest on dierent denitions of the empirical error. A direct application of these results on a biological application [5] 5

leads to similar bounds in both cases which are roughly 5%. These bounds are not accurate for such an application. In their current implementation, both methods fail to provide useful bounds for real-world application. However, this study sheds lights on some features which could be used to improve the generalization control so as to make it of practical interest. As pointed out in section, the combinatorial approach does not take into account the specicity of the model and should be avoided as it is used here. Improving the Natarajan dimension will not indeed decrease the gap between GH (N ) and N ch (N ), which is one of the most important Achilles' heels of the method. To use knowledge of the structure when bounding the generalization error, we introduce a method based on covering numbers which is new for multi-class discrimination. This method has the disadvantage P however to crudely bound E(g) by the sum k=1 E(g k). This is the main bug of the method and future works should x it somehow. One way to do so is to develop a global approach by directly controlling the generalization error of E(g) in terms of covering numbers instead of controlling the generalization errors E(g k ). This will be the subject of our next research. Thus, by presenting a new approach for the study of multi-class discriminant model with real internal representations, we have stated new bounds and pointed out a way to improve them. overing numbers allow to use knowledge of the learning system and to include it in the condence bounds. Further work will be to derive more practical bounds for the model of interest. References [1] M. Anthony (199): Probabilistic Analysis of Learning in Articial Neural Networks: The PA Model and its Variants. Neural omputing Surveys, Vol. 1, 1-. [] P. Bartlett (199): The sample complexity of pattern classication with neural networks: the size of the weights is more important than the size of the network. Technical report, Department of Systems Engineering, Australian National University, ftp : syseng.anu.edu.au:pub/peter/tr9d.ps. [].M. Bishop (1995): Neural Networks for Pattern Recognition. larendon Press, Oxford. [] B. arl and I. Stephani (1990): Entropy, compactness, and the approximation of operators. ambridge University Press, ambridge, UK. [5] Y. Guermeur,. Geourjon, P. Gallinari and G. Del age (1999): Improved Performance in Protein Secondary Structure Prediction by Inhomogeneous Score ombination. To appear in Bioinformatics. [] Y. Guermeur, H. Paugam-Moisy and P. Gallinari (1998): Multivariate Linear Regression on lassi- er Outputs: a apacity Study. IANN'98, 9-98. [] D. Haussler (199): Decision Theoretic Generalizations of the PA Model for Neural Net and Other Learning Applications. Information and omputation, 100, 8-150. [8] D. Haussler and P.M. Long. (1995): A Generalization of Sauer's Lemma. Journal of ombinatorial Theory, Series A, 1, 19-0. [9] B.K. Natarajan (1989): On Learning Sets and Functions. Machine Learning, Vol., -9. [10] D. Pollard (198): onvergence of Stochastic Processes. Springer Series in Statistics, Springer- Verlag, N.Y. [11] J. Shawe-Taylor and M. Anthony (1991): Sample sizes for multiple-output threshold networks. Network: omputation in Neural Systems, Vol., 10-11. [1] V.N. Vapnik (198): Estimation of Dependences Based on Empirical Data. Springer-Verlag, N.Y. [1] V.N. Vapnik (1998): Statistical Learning Theory. John Wiley & Sons, IN., N.Y. [1] V.N. Vapnik and A.Y. hervonenkis (191): On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Application, Vol. 1, -80.