Estimating the sample complexity of a multi-class. discriminant model

Similar documents
Sample width for multi-category classifiers

Optimal Linear Regression on Classifier Outputs

Data-Dependent Structural Risk. Decision Trees. John Shawe-Taylor. Royal Holloway, University of London 1. Nello Cristianini. University of Bristol 2

Model Selection for Multi-class SVMs

The best expert versus the smartest algorithm

The sample complexity of agnostic learning with deterministic labels

A Result of Vapnik with Applications

1 GMD FIRST, Rudower Chaussee 5, Berlin, Germany 2 Department of Computer Science, Royal Holloway College, U

Statistical learning theory, Support vector machines, and Bioinformatics

Martin Anthony. The London School of Economics and Political Science, Houghton Street, London WC2A 2AE, United Kingdom.

Fat-shattering. and the learnability of real-valued functions. Peter L. Bartlett. Department of Systems Engineering

Statistical Learning Theory

Plan of Class 4. Radial Basis Functions with moving centers. Projection Pursuit Regression and ridge. Principal Component Analysis: basic ideas

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

Advanced Machine Learning

Linear Regression and Its Applications

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh

Feature selection. c Victor Kitov August Summer school on Machine Learning in High Energy Physics in partnership with

ARTIFICIAL INTELLIGENCE LABORATORY. and CENTER FOR BIOLOGICAL INFORMATION PROCESSING. A.I. Memo No August Federico Girosi.

Machine Learning Lecture 7

On A-distance and Relative A-distance

1/sqrt(B) convergence 1/B convergence B

1 Introduction It will be convenient to use the inx operators a b and a b to stand for maximum (least upper bound) and minimum (greatest lower bound)

Linear Models for Classification

Neural Network Learning: Testing Bounds on Sample Complexity


Computation Of Asymptotic Distribution. For Semiparametric GMM Estimators. Hidehiko Ichimura. Graduate School of Public Policy

Linear & nonlinear classifiers

Ch 4. Linear Models for Classification

Empirical Risk Minimization

Discriminative Models

On Learnability, Complexity and Stability

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

THE VAPNIK- CHERVONENKIS DIMENSION and LEARNABILITY

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Margin Maximizing Loss Functions

Konrad-Zuse-Zentrum für Informationstechnik Berlin Takustraße 7, D Berlin

Dimensionality in the Stability of the Brunn-Minkowski Inequality: A blessing or a curse?

In: Advances in Intelligent Data Analysis (AIDA), International Computer Science Conventions. Rochester New York, 1999

Understanding Generalization Error: Bounds and Decompositions

2 JOSE BURILLO It was proved by Thurston [2, Ch.8], using geometric methods, and by Gersten [3], using combinatorial methods, that the integral 3-dime

Rademacher Averages and Phase Transitions in Glivenko Cantelli Classes

Multilayer neural networks and polyhedral dichotomies

Gaussian and Linear Discriminant Analysis; Multiclass Classification

ICML '97 and AAAI '97 Tutorials

Linear & nonlinear classifiers

The Uniformity Principle: A New Tool for. Probabilistic Robustness Analysis. B. R. Barmish and C. M. Lagoa. further discussion.

The learning task is to use examples of classied points to be able to correctly classify all possible points. In neural network learning, decision bou

Kernel Methods and Support Vector Machines

Machine Learning Lecture 5

below, kernel PCA Eigenvectors, and linear combinations thereof. For the cases where the pre-image does exist, we can provide a means of constructing

Techinical Proofs for Nonlinear Learning using Local Coordinate Coding

Support Vector Machines

Classier Selection using the Predicate Depth

Intelligent Systems Discriminative Learning, Neural Networks

Bayesian Decision Theory

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

A characterization of consistency of model weights given partial information in normal linear models

Part 1: Overview of the Probably Approximately Correct (PAC) Learning Framework David Haussler Baskin Center for Computer Engine

Support Vector Machines vs Multi-Layer. Perceptron in Particle Identication. DIFI, Universita di Genova (I) INFN Sezione di Genova (I) Cambridge (US)

International Journal "Information Theories & Applications" Vol.14 /

y(n) Time Series Data

Introduction to Machine Learning

ESANN'1999 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 1999, D-Facto public., ISBN X, pp.

Carnegie Mellon University Forbes Ave. Pittsburgh, PA 15213, USA. fmunos, leemon, V (x)ln + max. cost functional [3].

Does Unlabeled Data Help?

Identication and Control of Nonlinear Systems Using. Neural Network Models: Design and Stability Analysis. Marios M. Polycarpou and Petros A.

Coins with arbitrary weights. Abstract. Given a set of m coins out of a collection of coins of k unknown distinct weights, we wish to

Special Classes of Fuzzy Integer Programming Models with All-Dierent Constraints

A Simple Algorithm for Learning Stable Machines

290 J.M. Carnicer, J.M. Pe~na basis (u 1 ; : : : ; u n ) consisting of minimally supported elements, yet also has a basis (v 1 ; : : : ; v n ) which f

In Advances in Neural Information Processing Systems 6. J. D. Cowan, G. Tesauro and. Convergence of Indirect Adaptive. Andrew G.

Curves clustering with approximation of the density of functional random variables

Vapnik-Chervonenkis Dimension of Neural Nets

1 Introduction Tasks like voice or face recognition are quite dicult to realize with conventional computer systems, even for the most powerful of them

Introduction to machine learning

QUASI-UNIFORMLY POSITIVE OPERATORS IN KREIN SPACE. Denitizable operators in Krein spaces have spectral properties similar to those

Near convexity, metric convexity, and convexity

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Linear Regression and Discrimination

The information-theoretic value of unlabeled data in semi-supervised learning

LEARNING & LINEAR CLASSIFIERS

STA 4273H: Statistical Machine Learning

SGN (4 cr) Chapter 5

The Complexity and Approximability of Finding. Maximum Feasible Subsystems of Linear Relations. Abstract

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

for average case complexity 1 randomized reductions, an attempt to derive these notions from (more or less) rst

= w 2. w 1. B j. A j. C + j1j2

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I

Support Vector Machines

Functional Analysis. Franck Sueur Metric spaces Definitions Completeness Compactness Separability...

Learning Kernels -Tutorial Part III: Theoretical Guarantees.

Vector Space Basics. 1 Abstract Vector Spaces. 1. (commutativity of vector addition) u + v = v + u. 2. (associativity of vector addition)

12 CHAPTER 1. PRELIMINARIES Lemma 1.3 (Cauchy-Schwarz inequality) Let (; ) be an inner product in < n. Then for all x; y 2 < n we have j(x; y)j (x; x)

Pointwise convergence rate for nonlinear conservation. Eitan Tadmor and Tao Tang

Logistic Regression. COMP 527 Danushka Bollegala

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Vapnik-Chervonenkis Dimension of Neural Nets

Transcription:

Estimating the sample complexity of a multi-class discriminant model Yann Guermeur LIP, UMR NRS 0, Universit Paris, place Jussieu, 55 Paris cedex 05 Yann.Guermeur@lip.fr Andr Elissee and H l ne Paugam-Moisy ERI, Universit Lumi re Lyon 5, avenue Pierre Mend s-france, 9 Bron cedex {aelissee,hpaugam}@univ-lyon.fr Abstract We study the generalization performance of a multi-class discriminant model. Several bounds on its sample complexity are derived from uniform convergence results based on dierent measures of capacity. This gives us an insight into the nature of the capacity measure which is best suited to study multi-class discriminant model. 1 Introduction Since the pioneering work of Vapnik and hervonenkis [1], extending the classical Glivenko- antelli theorem to give a uniform convergence result over classes of indicator functions, many studies have dealt with uniform strong laws of large numbers (see for instance [10, ]). The bound they provide both in pattern recognition and regression estimation can readily be used to derive a sample complexity which is an increasing function of a particular measure of the capacity of the model considered. The choice of the appropriate bound as well as the computation of a tight upper bound on the capacity measure thus appear to be of central importance to derive a tight bound on the sample complexity. This is the subject we investigate in this paper, throughout an application on the Multivariate Linear Regression (MLR) combiner described in []. Up to now, very few studies in statistical learning theory have dealt with multi-class discrimination. Section briey outlines the implementation of the MLR model for class posterior probability estimates combination. Section is devoted to the estimation of the sample complexity using two combinatorial quantities generalizing the Vapnik-hervonenkis (V) dimension and called the graph dimension and the Natarajan dimension. In section, another bound is derived from a theorem where the capacity of the family of functions is characterized by its covering number. Both bounds are discussed in section 5 and further improvements are proposed in perspective. MLR combiner for classi- er combination We consider a -category discrimination task, under the usual hypothesis that there is a joint distribution, xed but unknown, on S = X Y, where X is the input space and Y the set of categories. We further assume that, for each input pattern x X, the outputs of P classiers are available. Let f j denote the function computed by the j th of these classiers: f j (x) = [f jk (x)] R. The k th output f jk (x) approximates the class posterior probability p( k jx). Precisely, f j (x) U with U = nu R + =1T u = 1 o 1

In other words, the outputs are non-negative and sum to 1. Let F (x) = [f j (x)], (1 j P) (F (x) U P ) be a vector of predictors. The MLR model studied here, parameterized by v = [v k ] R =P, computes the functions g G given by: g(x) = g 1 (x). g k (x). g (x) 5 = v T 1. v T ḳ. v T 5 F (x) Let be the set of loss functions satisfying the general conditions for outputs to be interpreted as probabilities (see for instance []) and s a N- sample of observations (s S N ). Among the functions of G, the MLR combiner is any of the functions which constitutes a solution to the following optimization problem: Problem 1 Given a convex loss function L and a N-sample s S N, nd a function in G minimizing the empirical risk ^J(v) and taking its values in U. The MLR combiner is thus designed to take as inputs class posterior probability estimates and output better estimates with respect to some given criterion (least-squares, cross-entropy : : :). v k;l;m, the general term of v k, is the coecient associated with the predictor f lm (x) in the regression computed to estimate p( k jx). Let v be the vector of all the parameters (v = [v k ] R P = ). As was pointed out in [], optimal solutions to Problem 1 are obtained by minimizing ^J(v) subject to v V with V = P v R P 8(l; + = m); k(v k;l;m? vk;l;) = 0 1 T P v = This result holds irrespective of the choice of L and s. Details on the computation of a global minimum can be found in [5]. To sum up, the model we study is a multiple-output perceptron, under the constraints 8x X; F (x) U P and v V. One can easily verify that the Euclidean norm kf (x)k of every vector in U P is bounded above by p P. Furthermore, solving a simple quadratic programming problem establishes that for all v V and k f1; : : : ; g, kv k k p. Growth function bounds For now on, we consider the MLR combiner as a discriminant model, by application of Bayes' estimated decision rule. Although the learning process described in the previous section only amounts to estimating probability densities on a nite sample, the criterion of interest, the generalization ability in terms of recognition rate, can be both estimated and controlled. To perform this task, we use results derived in the framework of the computational learning theory and the statistical learning theory introduced by Vapnik. Roughly speaking, they represent corollaries of uniform strong laws of large numbers. The theorems used in this section are derived from bounds grounded on a combinatorial capacity measure called the growth function. In order to express them, we must rst introduce additional notations and denitions. Let H be a family of functions from X to a nite set Y. E s (h) and E(h) respectively designate the error on s (observed error) and the generalization error of a function h belonging to H. Denition 1 Let H be a set of indicator functions (values in {0,1}). Let s X be a N-sample of X and H (s X ) the number of dierent classications of s X by the functions of H. The growth function H is dened by: H (N ) = max H (s X ) : s X X N : Denition The V dimension of a set H of indicator functions is the maximum number d of vectors that can be shattered, i.e. separated into two classes in all d possible ways, using functions of H. If this maximum does not exist, the V dimension is equal to innity.

These denitions only apply to sets of indicator functions. The following theorem, dealing with multiple-output functions, appears as an immediate corollary of a result due to Vapnik [1] (see for instance [1]): Theorem 1 Let s S N. With probability 1?, E(h) < E s (h)+ r 1 N (ln( GH (N ))? ln())+ 1 N GH, the graph space of H, is dened as follows: For h H, let Gh be the function from X Y to f0; 1g dened by Gh(x; y) = 1 () h(x) = y GH = fgh : h Hg. GH is the growth function of GH [9]. Thus, in order to obtain an upper bound on the condence interval in Theorem 1, one must nd an upper bound on the growth function associated with the MLR combiner. This constitutes the object of the next two subsections where two dierent bounds are stated..1 Graph dimension of the MLR combiner The discriminant functions computed by the MLR combiner are also computed by the single hidden layer perceptron with threshold activation functions depicted in Figure 1. Several articles have been devoted to bounding the growth function and V dimension of multilayer perceptrons [9, 11]. Proceeding as in [11], one can use as upper bound on the growth function the product of the bounds on the growth functions of the individual hidden units derived from Sauer's lemma. Since the V dimension of each hidden unit is at most equal to the dimension of the smallest subspace of R P containing U P and this dimension is d E = P (? 1) + 1 [], we get: GMLP (N ) < en d E (1) d E F(x) Px Figure 1: v1-v v1-v v1-v v1-v v-v v-v v-v v-v v-v v(-)-v(-1) v(-)-v v(-1)-v -1 +1 1 Bayes Architecture of a multi-layer perceptron with threshold units computing the same discriminant function as the MLR combiner. Each hidden unit computes a function h i;j (F (x)) = t? F (x) T (v i? v j ), (1 i < j ), where t(z) = 1 if z > 0 and t(z) =?1 otherwise. The weights of the output layer, either +1 (solid lines) or?1 (dashed lines), and the biases are chosen so that the output units compute a logical AND. The number below each layer corresponds to the number of units. This bound can be signicantly improved, by making use of the dependences between the hidden units. For lack of place, we only exhibit here a simple way to do so. Let x s X. Among all the possible classications performed by the hidden unit whose vector is v 1? v, exactly half of them associate to F (x) the value 1. The same is true for the hidden unit whose vector is v? v. Now if these two units provide the same output for F (x), then the output for the hidden unit whose vector is v 1? v is known. onsequently, the growth function associated with these three units is at most en de de. Proceedings step by step, we thus get an improved bound: GMLP (N ) < g(x)?1 en d E () d E

. Natarajan dimension Natarajan introduced in [9] an extension of the V dimension for multiple-output functions based on the following denition of shattering: Denition A set H of discrete-valued functions shatters a set s X of vectors () there exist two functions h 1 and h belonging to H such that: (a) for any x s X, h 1 (x) = h (x), (b) for all s 1 s X, there exists h H such that h agrees with h 1 on s 1 and with h on s X n s 1, i.e. 8x s 1, h (x) = h 1 (x), 8x s X n s 1, h (x) = h (x). We established the following result in [] Theorem The Natarajan dimension d N of the MLR combiner satises: P + b 1 c d N (d E? 1) In [8], the following theorem was proved Theorem Let N c H (N ) be the number of dierent classications performed by a set of functions H on a set of size N. Let d N be the Natarajan dimension of H. Then, for all d d N : X i=d N c H (N ) N i i=0 We have the following inequality: Xi=d i=0 i N? i? i en d d () Obviously, GH (N ) N c H (N ). Thus, substituting the upper bound on d N provided by Theorem for d in the formula of Theorem and applying () we get: GMLP (N ) < en (d E? 1)! (d E?1) () This last bound is very similar to those provided by (1) and (). This is a good indication that these bounds are quite loose, since bounding GH (N ) by N c H (N ) is crude. In fact, the diculty with the use of the growth function lies in the way to take into account the specicity of the model (the link between the hidden units), when applying a generalization of Sauer's lemma. This diculty concerns all the approaches using combinatorial methods which do not consider how the model is built. This led us to consider an alternative definition of the capacity of the combiner which is an original method to have condence bounds for multi-class discriminant models based on the computation of the a posteriori probability p( k jx). overing number bounds Many studies have been devoted to giving rates of uniform convergence of the means to their expectations, based on a capacity measure called the covering number [1, 10, ]. To dene this measure, we rst introduce the notion of -cover in a pseudo-metric space. Denition Let (E; ) be a pseudo-metric space. A nite set T E is an?cover of a set H E with respect to if for all h H there is a h T such that (h; h). Denition 5 Given and, the covering number N (; H; ) of H is the size of the smallest -cover of H. overing numbers were introduced in learning theory to derive bounds for regression estimation problems. In [], Bartlett has extended their use to discrimination, specically to dichotomy computation. For now on, H is supposed to be a set of functions taking their values in R. The discriminant function associated with any of the functions h H is the function t h, where t is the sign function dened in the caption of Figure 1. E(h) is dened accordingly.

Denition Let : R! [?; ] be a piecewiselinear function such that: t(x) if jxj (x) = x otherwise Let us dene H = f (h); h Hg and N 1 (; H; N ) = max sx N (; H; k:k sx 1 ), where khk sx 1 = max 1iN jh(x i )j. Let the empirical error according to the margin be dened as: Denition E s (h) = 1 N jf(x i; y i ) s=h(x i )y i < gj We have: Theorem Suppose > 0, and 0 < < 1=, then, with probability at least 1?, s E(h) Es (h) + N1 N ln (=; H ; N ) The MLR model takes its values in [0; 1]. For each pattern x, the desired output y is the canonical coding of the category of x. If jg k (x)? y k j < 1= for all k, then the use of Bayes' estimated decision rule will provide the correct classication for x. A contrario, an error will occur only if jg k (x)? y k j 1= for at least one k. For all k, we dene E(g k ) as being the error for g k? 1= and E(g) as the error in generalization of the discriminant function associated with g. Then, we have: E(g) X k=1 E(g k ) (5) As in the previous section, bounding the generalization ability of the MLR combiner amounts to bounding a measure of complexity. This time however, this measure is dened for each individual function g k. The end of this section thus deals with stating bounds on N 1 (=; H ; N ), where H equals fg k?1=g for any k in f1; : : : ; g. Since the function is 1-lipschitzian, N 1 (=; H ; N ) N 1 (=; H; N ) Let T be the ane application dened over B P (the unit ball of R P endowed with k:k ) as: T : B P! R N w! p [F (x 1 ) T w; : : : ; F (x N ) T w] T? (1=) N It has the property that f[f (x 1 ) T w? 1=; : : : ; F (x N ) T w? 1=] T s.t. kwk p g T (B P ). Hence, since kv k k p, N 1 (=; H; N ) According to [], we have, N (=; T (B P ); k:k 1 ) max s 0 X XN N (=; T (B P); k:k 1 ) ()! P 8k T ~ k () where, T ~ is the linear operator associated with T and k T ~ k Twk k = sup ~ 1 wbp kwk. Since kf (x)k p P, from auchy-schwarz inequality k T ~ k is bounded above by p P. By injection of () in () and by ap plying Theorem to bound the right hand side of (5), it yields E(g) k= X k=1 E k s (g k) + s N ln + P ln 8p P k! This bound on the error should be related to bound derived from theorem 1 by substituting GH (N ) by the successive bounds established in section. 5 Discussion and onclusion This paper analyses two strategies to derive bounds on the generalization error for a multiclass discriminant model. One is based on combinatorial dimensions and the other uses covering numbers. The bounds on the generalization ability derived in the two former sections cannot be readily compared, since they rest on dierent denitions of the empirical error. A direct application of these results on a biological application [5] 5

leads to similar bounds in both cases which are roughly 5%. These bounds are not accurate for such an application. In their current implementation, both methods fail to provide useful bounds for real-world application. However, this study sheds lights on some features which could be used to improve the generalization control so as to make it of practical interest. As pointed out in section, the combinatorial approach does not take into account the specicity of the model and should be avoided as it is used here. Improving the Natarajan dimension will not indeed decrease the gap between GH (N ) and N ch (N ), which is one of the most important Achilles' heels of the method. To use knowledge of the structure when bounding the generalization error, we introduce a method based on covering numbers which is new for multi-class discrimination. This method has the disadvantage P however to crudely bound E(g) by the sum k=1 E(g k). This is the main bug of the method and future works should x it somehow. One way to do so is to develop a global approach by directly controlling the generalization error of E(g) in terms of covering numbers instead of controlling the generalization errors E(g k ). This will be the subject of our next research. Thus, by presenting a new approach for the study of multi-class discriminant model with real internal representations, we have stated new bounds and pointed out a way to improve them. overing numbers allow to use knowledge of the learning system and to include it in the condence bounds. Further work will be to derive more practical bounds for the model of interest. References [1] M. Anthony (199): Probabilistic Analysis of Learning in Articial Neural Networks: The PA Model and its Variants. Neural omputing Surveys, Vol. 1, 1-. [] P. Bartlett (199): The sample complexity of pattern classication with neural networks: the size of the weights is more important than the size of the network. Technical report, Department of Systems Engineering, Australian National University, ftp : syseng.anu.edu.au:pub/peter/tr9d.ps. [].M. Bishop (1995): Neural Networks for Pattern Recognition. larendon Press, Oxford. [] B. arl and I. Stephani (1990): Entropy, compactness, and the approximation of operators. ambridge University Press, ambridge, UK. [5] Y. Guermeur,. Geourjon, P. Gallinari and G. Del age (1999): Improved Performance in Protein Secondary Structure Prediction by Inhomogeneous Score ombination. To appear in Bioinformatics. [] Y. Guermeur, H. Paugam-Moisy and P. Gallinari (1998): Multivariate Linear Regression on lassi- er Outputs: a apacity Study. IANN'98, 9-98. [] D. Haussler (199): Decision Theoretic Generalizations of the PA Model for Neural Net and Other Learning Applications. Information and omputation, 100, 8-150. [8] D. Haussler and P.M. Long. (1995): A Generalization of Sauer's Lemma. Journal of ombinatorial Theory, Series A, 1, 19-0. [9] B.K. Natarajan (1989): On Learning Sets and Functions. Machine Learning, Vol., -9. [10] D. Pollard (198): onvergence of Stochastic Processes. Springer Series in Statistics, Springer- Verlag, N.Y. [11] J. Shawe-Taylor and M. Anthony (1991): Sample sizes for multiple-output threshold networks. Network: omputation in Neural Systems, Vol., 10-11. [1] V.N. Vapnik (198): Estimation of Dependences Based on Empirical Data. Springer-Verlag, N.Y. [1] V.N. Vapnik (1998): Statistical Learning Theory. John Wiley & Sons, IN., N.Y. [1] V.N. Vapnik and A.Y. hervonenkis (191): On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Application, Vol. 1, -80.