Sample width for multi-category classifiers

Similar documents
Using boxes and proximity to classify data into several categories

The Performance of a New Hybrid Classifier Based on Boxes and Nearest Neighbors

Maximal Width Learning of Binary Functions

The Performance of a New Hybrid Classifier Based on Boxes and Nearest Neighbors

Maximal Width Learning of Binary Functions

A Result of Vapnik with Applications

R u t c o r Research R e p o r t. Large margin case-based reasoning. Martin Anthony a. Joel Ratsaby b. RRR , February 2013

The sample complexity of agnostic learning with deterministic labels

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

FORMULATION OF THE LEARNING PROBLEM

Large width nearest prototype classification on general distance spaces

Uniform Glivenko-Cantelli Theorems and Concentration of Measure in the Mathematical Modelling of Learning

Statistical and Computational Learning Theory

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

Computational Learning Theory for Artificial Neural Networks

On the Sample Complexity of Noise-Tolerant Learning

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

A note on the generalization performance of kernel classifiers with margin. Theodoros Evgeniou and Massimiliano Pontil

Neural Network Learning: Testing Bounds on Sample Complexity

Martin Anthony. The London School of Economics and Political Science, Houghton Street, London WC2A 2AE, United Kingdom.

On Learnability, Complexity and Stability

Discriminative Learning can Succeed where Generative Learning Fails

The Decision List Machine

Does Unlabeled Data Help?

THE VAPNIK- CHERVONENKIS DIMENSION and LEARNABILITY

The information-theoretic value of unlabeled data in semi-supervised learning

Vapnik-Chervonenkis Dimension of Neural Nets

Relating Data Compression and Learnability

Generalization theory

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Lecture Support Vector Machine (SVM) Classifiers

Vapnik-Chervonenkis Dimension of Neural Nets

Computational Learning Theory

Part of the slides are adapted from Ziko Kolter

Bennett-type Generalization Bounds: Large-deviation Case and Faster Rate of Convergence

A Large Deviation Bound for the Area Under the ROC Curve

A Pseudo-Boolean Set Covering Machine

R u t c o r Research R e p o r t. A New Imputation Method for Incomplete Binary Data. Mine Subasi a Martin Anthony c. Ersoy Subasi b P.L.

10.1 The Formal Model

IFT Lecture 7 Elements of statistical learning theory

CSE 648: Advanced algorithms

Learning with multiple models. Boosting.

E0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011)

A Bound on the Label Complexity of Agnostic Active Learning

Online Learning: Random Averages, Combinatorial Parameters, and Learnability

Computational Learning Theory

Computational Learning Theory. CS534 - Machine Learning

Computational Learning Theory. Definitions

COMPUTATIONAL LEARNING THEORY

Support Vector Machines

CS 6375: Machine Learning Computational Learning Theory

COMS 4771 Introduction to Machine Learning. Nakul Verma

Rademacher Averages and Phase Transitions in Glivenko Cantelli Classes

Fast Rates for Estimation Error and Oracle Inequalities for Model Selection

Computational Learning Theory

Empirical Risk Minimization

Solving Classification Problems By Knowledge Sets

Computational Learning Theory

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Baum s Algorithm Learns Intersections of Halfspaces with respect to Log-Concave Distributions

Support Vector Machine Classification via Parameterless Robust Linear Programming

Yale University Department of Computer Science. The VC Dimension of k-fold Union

Machine Learning

Models of Language Acquisition: Part II

On A-distance and Relative A-distance

Synthesis of Maximum Margin and Multiview Learning using Unlabeled Data

MACHINE LEARNING. Vapnik-Chervonenkis (VC) Dimension. Alessandro Moschitti

1 Differential Privacy and Statistical Query Learning

Efficient classification for metric data

Machine Learning

Generalization bounds

Understanding Generalization Error: Bounds and Decompositions

Computational Learning Theory

1 Active Learning Foundations of Machine Learning and Data Science. Lecturer: Maria-Florina Balcan Lecture 20 & 21: November 16 & 18, 2015

Data-Dependent Structural Risk. Decision Trees. John Shawe-Taylor. Royal Holloway, University of London 1. Nello Cristianini. University of Bristol 2

Computational Learning Theory: Probably Approximately Correct (PAC) Learning. Machine Learning. Spring The slides are mainly from Vivek Srikumar

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016

MACHINE LEARNING. Support Vector Machines. Alessandro Moschitti

Computational Learning Theory

Discriminative Models

VC Dimension Review. The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces.

A Necessary Condition for Learning from Positive Examples

Hence C has VC-dimension d iff. π C (d) = 2 d. (4) C has VC-density l if there exists K R such that, for all

Introduction to Machine Learning

Active Learning and Optimized Information Gathering

On Efficient Agnostic Learning of Linear Combinations of Basis Functions

Adaptive Sampling Under Low Noise Conditions 1

Statistical learning theory, Support vector machines, and Bioinformatics

Generalization, Overfitting, and Model Selection

Introduction to Statistical Learning Theory

Learning with Decision Lists of Data-Dependent Features

The Set Covering Machine with Data-Dependent Half-Spaces

Lecture 29: Computational Learning Theory

Discriminative Models

Formulation with slack variables

Bounding the Fat Shattering Dimension of a Composition Function Class Built Using a Continuous Logic Connective

An introduction to Support Vector Machines

A Note on Extending Generalization Bounds for Binary Large-Margin Classifiers to Multiple Classes

Multiclass Learnability and the ERM principle

PAC-Bayes Risk Bounds for Sample-Compressed Gibbs Classifiers

Transcription:

R u t c o r Research R e p o r t Sample width for multi-category classifiers Martin Anthony a Joel Ratsaby b RRR 29-2012, November 2012 RUTCOR Rutgers Center for Operations Research Rutgers University 640 Bartholomew Road Piscataway, New Jersey 08854-8003 Telephone: 732-445-3804 Telefax: 732-445-5472 Email: rrr@rutcor.rutgers.edu http://rutcor.rutgers.edu/ rrr a Department of Mathematics, The London School of Economics and Political Science, Houghton Street, London WC2A 2AE, UK. m.anthony@lse.ac.uk b Electrical and Electronics Engineering Department, Ariel University Center of Samaria, Ariel 40700, Israel. ratsaby@ariel.ac.il

Rutcor Research Report RRR 29-2012, November 2012 Sample width for multi-category classifiers Martin Anthony Joel Ratsaby Abstract.In a recent paper, the authors introduced the notion of sample width for binary classifiers defined on the set of real numbers. It was shown that the performance of such classifiers could be quantified in terms of this sample width. This paper considers how to adapt the idea of sample width so that it can be applied in cases where the classifiers are multi-category and are defined on some arbitrary metric space. Acknowledgements: This work was supported in part by the IST Programme of the European Community, under the PASCAL2 Network of Excellence, IST-2007-216886.

Page 2 RRR 29-2012 1 Introduction By a multi-category) classifier on a set X, we mean a function mapping from X to [C] = {1, 2,..., C} where C 2 is the number of possible categories. Such a classifier indicates to which of the C different classes objects from X belong and, in supervised machine learning, it is arrived at on the basis of a sample, a set of objects from X together with their classifications in [C]. In [4], the notion of sample width for binary classifiers C = 2) mapping from the real line X = R was introduced and in [5], this was generalized to finite metric spaces. In this paper, we consider how a similar approach might be taken to the situation in which C could be larger than 2, and in which the classifiers map not simply from the real line, but from some metric space which would not generally have the linear structure of the real line). The definition of sample width is given below, but it is possible to indicate the basic idea at this stage: we define sample width to be at least γ if the classifier achieves the correct classifications on the sample and, furthermore, for each sample point, the minimum distance to a point of the domain having a different classification is at least γ. A key issue that arises in machine learning is that of generalization error: given that a classifier has been produced by some learning algorithm on the basis of a random) sample of a certain size, how can we quantify the accuracy of that classifier, where by its accuracy we mean its likely performance in classifying objects from X correctly? In this paper, we seek answers to this question that involve not just the sample size, but the sample width. 2 Probabilistic modelling of learning We work in a version of the popular PAC framework of computational learning theory see [14, 7]). This model assumes that the sample s consists of an ordered set x i, y i ) of labeled examples, where x i X and y i Y = [C], and that each x i, y i ) in the training sample s has been generated randomly according to some fixed but unknown) probability distribution P on Z = X Y. This includes, as a special case, the situation in which each x i is drawn according to a fixed distribution on X and is then labeled deterministically by y i = tx i ) where t is some fixed function.) Thus, a sample s of length m can be thought of as being drawn randomly according to the product probability distribution P m. An appropriate measure of how well h : X Y would perform on further randomly drawn points is its error, er P h), the probability that hx) Y for random X, Y ). Given any function h H, we can measure how well h matches the training sample through its sample error er s h) = 1 m {i : hx i) y i }

RRR 29-2012 Page 3 the proportion of points in the sample incorrectly classified by h). Much classical work in learning theory see [7, 14], for instance) related the error of a classifier h to its sample error. A typical result would state that, for all δ 0, 1), with probability at least 1 δ, for all h belonging to some specified set if functions, we have er P h) < er s h) + ɛm, δ), where ɛm, δ) known as a generalization error bound) is decreasing in m and δ. Such results can be derived using uniform convergence theorems from probability theory [15, 11, 8], in which case ɛm, δ) would typically involve a quantity known as the growth function of the set of classifiers [15, 7, 14, 2]. More recently, emphasis has been placed on learning with a large margin. See, for instance [13, 2, 1, 12].) The rationale behind margin-based generalization error bounds in the two-category classification case is that if a binary classifier can be thought of as a geometrical separator between points, and if it has managed to achieve a wide separation between the points of different classification, then this indicates that it is a good classifier, and it is possible that a better generalization error bound can be obtained. Margin-based results apply when the binary classifiers are derived from real-valued function by thresholding taking their sign). Margin analysis has been extended to multi-category classifiers in [9]. 3 The width of a classifier We now discuss the case where the underlying set of objects X forms a metric space. Let X be a set on which is defined a metric d : X X R. For a subset S of X, define the distance dx, S) from x X to S as follows: We define the diameter of X to be d x, S) := inf dx, y). y S diamx) := sup dx, y). x,y X We will denote by H the set of all possible functions h from X to [C]. The paper [4] introduced the notion of the width of a binary classifier at a point in the domain, in the case where the domain was the real line R. Consider a set of points {x 1, x 2,..., x m } from R, which, together with their true classifications y i { 1, 1}, yield a training sample s = x j, y j )) m j=1 = x 1, y 1 ), x 2, y 2 ),..., x m, y m )). We say that h : R { 1, 1} achieves sample margin at least γ on s if hx i ) = y i for each i so that h correctly classifies the sample) and, furthermore, h is constant on each of

Page 4 RRR 29-2012 the intervals x i γ, x i + γ). It was then possible to obtain generalization error bounds in terms of the sample width. In this paper we use an analogous notion of width to analyse multi-category classifiers defined on a metric space. For each k between 1 and C, let us denote by S h k or simply by S k) the sets corresponding to the function h : X [C], defined as follows: S k := h 1 k) = {x X : hx) = k}. 1) We define the width w h x) of h at a point x X as follows: w h x) := min l hx) dx, S l). In other words, it is the distance from x to the set of points that are labeled differently from hx). The term width is appropriate since the functional value is just the geometric distance between x and the complement of S hx). Given h : X [C], for each k between 1 and C, we define f h k : X R by f h k x) = min l k dx, S l) dx, S k ), and we define f h : X R C by setting the kth component function of f h to be fk h : that is, f h ) k = fk h. Note that if hx) = k, then fk hx) 0 and f j h x) 0 for j k. The function f contains geometrical information encoding how definitive the classification of a point is: if fk h x) is a large positive number, then the point x belongs to category k and is a large distance from differently classified points. We will regard h as being in error on x X if fk h x) is not positive, where k = hx) that is, if the classification is not unambiguously revealed by f h ). The error er P h) of h can then be expressed in terms of the function f h : er P h) = P f h Y X) < 0 ). 2) We define the class F of functions as F := { f h x) : h H }. 3) Note that f h is a mapping from X to the bounded set [ diamx), diamx)] C R C. Henceforth, we will use γ > 0 to denote a width parameter whose value is in the range 0, diamx)]. For a positive width parameter γ > 0 and a training sample s, the empirical sample) γ-width error is defined as Es γ h) = 1 m ) I fy h m j x j ) γ. j=1

RRR 29-2012 Page 5 Here, IA) is the indicator function of the set, or event, A.) We shall also denote E γ s h) by E γ s f) where f = f h. Note that f h y x) γ min l y dx, S l) dx, S y ) γ l y such that dx, S l ) dx, S y ) + γ. So the empirical γ-width error on the sample is the proportion of points in the sample which are either misclassified by h or which are classified correctly, but lie within distance γ of the set of points classified differently. We recall that hx) = y implies dx, S y ) = 0.) Our aim is to show that with high probability) the generalization error er P h) is not much greater than E γ s h). In particular, as a special case, we want to bound the generalization error given that E γ s h) = 0.) This will imply that if the learner finds a hypothesis which, for a large value of γ, has a small γ-width error, then that hypothesis is likely to have small error. What this indicates, then, is that if a hypothesis has a large width on most points of a sample, then it will be likely to have small error. 4 Covering numbers 4.1 Covering numbers We will make use of some results and ideas from large-margin theory. One central idea in large-margin analysis is that of covering numbers. We will discuss different types of covering numbers, so we introduce the idea in some generality to start with. Suppose A, d) is a pseudo-)metric space and that α > 0. Then an α-cover of A with respect to d) is a finite subset C of A such that, for every a A, there is some c C such that da, c) α. If such a cover exists, then the mimimum cardinality of such a cover is the covering number N A, α, d). We are working with the set F of vector-valued functions from X to R C, as defined earlier. We define the sup-metric d on F as follows: for f, g : X R C, d f, g) = sup x X max f kx) g k x), 1 k C where f k denotes the kth component function of f. Note that each component function is bounded, so the metric is well-defined.) We can bound the covering numbers N F, α, d ) of F with respect to the sup-metric) in terms of the covering numbers of X with respect to its metric d. The result is as follows.

Page 6 RRR 29-2012 Theorem 4.1 For α 0, diamx)], ) CNα 9 diamx) N F, α, d ), α where N α = N X, α/3, d). 4.2 Smoothness of the function class As a first step towards establishing this result, we prove that the functions in F satisfy a certain Lipschitz or smoothness) property. Proposition 4.2 For every f F, and for all x, x X, max f kx) f k x ) 2 dx, x ). 4) 1 k C Proof: Let x, x X and fix k between 1 and C. We show that f k x) f k x ) 2 dx, x ). Recall that, since f F, there is some h : X [C] such that, for all x, f k x) = min l k dx, S l) dx, S k ) where, for each i, S i = h 1 i). We have f k x) f k x ) = min dx, S l) dx, S k ) min l k l k dx, S l ) + dx, S k ) min dx, S l) min l k l k dx, S l ) + dx, S k) dx, S k ) We consider in turn each of the two terms in this final expression. We start with the second. Let ɛ > 0 and suppose r S k is such that dx, r) < dx, S k ) + ɛ. Such an r exists since dx, S k ) = inf y Sk dx, y). Let s S k be such that dx, s) < dx, S k ) + ɛ. Then dx, S k ) dx, r) dx, s) + ɛ, since dx, r) < dx, S k ) + ɛ and dx, S k ) dx, s). So, dx, S k ) dx, s) + dx, x ) + ɛ < dx, S k ) + ɛ + dx, x ) + ɛ = dx, S k ) + 2ɛ.

RRR 29-2012 Page 7 A similar argument establishes dx, S k ) dx, s) dx, r) + ɛ dx, r) + dx, x ) + ɛ < dx, S k ) + dx, x ) + 2ɛ. Combining these two inequalities shows dx, S k ) dx, S k ) dx, x )+2ɛ. Since this holds for all ɛ > 0, it follows that dx, S k ) dx, S k ) dx, x ). Next we show min dx, S l) min l k l k dx, S l ) dx, x ). Suppose that min k l dx, S l ) = dx, S p ) and that min k l dx, S l ) = dx, S q ). Let ɛ > 0. Suppose that y S p is such that dx, y) < dx, S p ) + ɛ and that y S q is such that dx, y ) < dx, S q ) + ɛ. Then and dx, S q ) dx, S p ) dx, y) dx, x ) + dx, y) < dx, x ) + dx, S p ) + ɛ dx, S p ) dx, S q ) dx, y ) dx, x ) + dx, y ) = dx, x ) + dx, S q ) + ɛ. From these inequalities, it follows that dx, S p ) dx, S q ) dx, x ) + ɛ, for all ɛ, and it follows that dx, S p ) dx, S q ) dx, x ). Next, we exploit this smoothness to construct a cover for F. 4.3 Covering F Let the subset C γ X be a minimal size α/3-cover for X with respect to the metric d. So, for every x X there is some ˆx C γ such that dx, ˆx) α/3. Denote by N α the cardinality of C α. Let { } 3 diamx) 3 diamx) Λ α = λ i = iα : i =,..., 1, 0, 1, 2,..., α α and define the class ˆF to be all functions ˆf : C α Λ α ) C. Then ˆF is of a finite size equal to Λ α C Nα. For any ˆf ˆF define the extension ˆf ext : X R C of ˆf to the whole domain X as follows: given ˆf which is well-defined on the points ˆx i of the cover) then for every point x in the ball B α/3 ˆx i ) = {x X : dx, ˆx i ) α/3}, we let ˆf ext x) = ˆfˆx i ), for all ˆx i C α where, if, for a point x there is more than one point ˆx i such that x B α/3 ˆx i ), we arbitrarily pick one of the points ˆx i in order to assign the value of ˆf ext x)). There is a one-to-one correspondence 5)

Page 8 RRR 29-2012 between the functions ˆf and the functions ˆf ext. Hence the set ˆF ext = cardinality equal to Λ α C Nα. { ˆfext : ˆf ˆF } is of We claim that for any f F there exists an ˆf ext such that d f, ˆf ext ) α. To see this, first for every point ˆx i C α, consider fˆx i ) and find a corresponding element in Λ C α, call it ˆfˆx i )) such that max fˆx i)) k ˆfˆx i )) k α/3. 6) 1 k C That there exists such a value follows by design of Λ α.) By the above definition of extension, it follows that for all points x B α/3 ˆx i ) we have ˆf ext x) = ˆfˆx i ). Now, from 4) we have for all f F, max sup 1 i k x B α/3 ˆx i ) fx)) k fˆx i )) k 2dx, ˆx i ) 2α/3. 7) Hence for any f F there exists a function ˆf ˆF with a corresponding ˆf ext ˆF ext such that, given an x X, there exists ˆx i C α such that, for each k between 1 and C, fx)) k ˆf ext x)) k = fx)) k ˆf ext ˆx i )) k. The right hand side can be expressed as fx)) k ˆf ext ˆx i )) k = fx)) k ˆfˆx i )) k where 8) follows from 6) and 7). = fx)) k fˆx i )) k + fˆx i )) k ˆfˆx i )) k fx)) k fˆx i )) k + fˆx i )) k ˆfˆx i )) k 2α/3 + α/3 8) = α. Hence the set ˆF ext forms an α-covering of the class F in the sup-norm. Thus we have the following covering number bound. C Nα 3 diamx) N F, α, d ) Λ α C Nα = 2 + 1). 9) α Theorem 4.1 now follows because for 0 < α diamx)) ) 3 diamx) 3 diamx) 2 + 1 2 + 1 + 1 = 6 diamx) α α α + 3 9 diamx). α 5 Generalization error bounds We present two results. The first bounds the generalization error in terms of a width parameter γ for which the γ-width error on the sample is zero; the second more general but looser

RRR 29-2012 Page 9 in that special case) bounds the error in terms of γ and the γ-width error on the sample which could be non-zero). Theorem 5.1 Suppose that X is a metric space of diameter diamx). Suppose P is any probability measure on Z = X [C]. Let δ 0, 1). Then, with probability at least 1 δ, the following holds for s Z m : for any function h : X [C], and for any γ 0, diamx)], if Es γ h) = 0, then er P h) 2 ) )) 36 diamx) 4 diamx) CN X, γ/12, d) log m 2 + log γ 2. δγ Theorem 5.2 Suppose that X is a metric space of diameter diamx). Suppose P is any probability measure on Z = X [C]. Let δ 0, 1). Then, with probability at least 1 δ, the following holds for s Z m : for any function h : X [C], and for any γ 0, diamx)], er P h) E γ s h) + 2 m CN X, γ/6, d) ln ) )) 18 diamx) 2 diamx) + ln + 1 γ γδ m. What we have in Theorem 5.2 is a high probability bound that takes the following form: for all h and for all γ 0, diamx)], er P h S ) E γ s h) + ɛm, γ, δ), where ɛ tends to 0 as m and ɛ decreases as γ increases. The rationale for seeking such a bound is that there is likely to be a trade-off between width error on the sample and the value of ɛ: taking γ small so that the error term E γ s h) is zero might entail a large value of ɛ; and, conversely, choosing γ large will make ɛ relatively small, but lead to a large sample error term. So, in principle, since the value γ is free to be chosen, one could optimize the choice of γ on the right-hand side of the bound to minimize it. Proof of Theorem 5.1 The proof uses techniques similar to those first used in [15, 14, 8, 11] and in subsequent work extending those techniques to learning with real-valued functions, such as [10, 3, 1, 6]. The first observation is that if Q = {s Z m : h H with E γ s h) = 0, er P h) ɛ} and T = {s, s ) Z m Z m : h H with Es γ h) = 0, Es 0 h) ɛ/2},

Page 10 RRR 29-2012 then, for m 8/ɛ, P m Q) 2 P 2m T ). This is because P 2m T ) P { 2m h : Es γ h) = 0, er P h) ɛ and Es 0 h) ɛ/2}) = P { m s : h, Es γ h) = 0, er P h) ɛ and Es 0 h) ɛ/2}) dp m s) Q 1 2 P m Q), for m 8/ɛ. The final inequality follows from the fact that if er P h) ɛ, then for m 8/ɛ, P m Es 0 h) ɛ/2) 1/2, for any h, something that follows by a Chernoff bound. Let G be the permutation group the swapping group ) on the set {1, 2,..., 2m} generated by the transpositions i, m + i) for i = 1, 2,..., m. Then G acts on Z 2m by permuting the coordinates: for σ G, By invariance of P 2m under the action of G, σz 1, z 2,..., z 2m ) = z σ1),..., z σ2m) ). P 2m T ) max{pσz T ) : z Z 2m }, where P denotes the probability over uniform choice of σ from G. See, for instance, [11, 2].) Let F = {f h : h H} be the set of vector-valued functions derived from H as before, and let ˆF be a minimal γ/2-cover of F in the d -metric. Theorem 4.1 tells us that the size of ˆF is no more than ) CN 18 diamx), where N = N X, γ/6, d). γ Fix z Z 2m. Suppose τz = s, s ) T and that h is such that Es γ h) = 0 and Es 0 h) ɛ/2. Let ˆf ˆF be such that d ˆf, f h ) γ/2. Then the fact that Es γ h) = 0 implies that ˆf) = 0. And, the fact that Es 0 h) ɛ/2 implies that Eγ/2 s ˆf) ɛ/2. To see the first claim, note that we have fy h x) > γ for all x, y) in the sample s and that, since, for all k, and all x, ˆf k x) fk hx) γ/2, we have also that for all such x, y), ˆfy x) > γ/2. For the second claim, we observe that if fy h x) 0 then ˆf y x) γ/2. E γ/2 s It now follows that if τz T, then, for some ˆf ˆF, τz R ˆf), where R ˆf) = {s, s ) Z m Z m : Es γ/2 ˆf) = 0, E γ/2 s ˆf) ɛ/2}. By symmetry, P σz R ˆf) ) = P στz) R ˆf) ). Suppose that E γ/2 s ˆf) = r/m, where r ɛm/2 is the number of x i, y i ) in s on which ˆf is such that ˆf yi x i ) γ/2. Then those

RRR 29-2012 Page 11 permutations σ such that στz) R ˆf) are precisely those that do not transpose these r coordinates, and there are 2 m r 2 m ɛm/2 such σ. It follows that, for each fixed ˆf ˆF, P σz R ˆf) ) 2m1 ɛ/2) G = 2 ɛm/2. We therefore have Pσz T ) P σz ˆf R ˆf) ˆf Pσz R ˆf)) ˆF 2 ɛm/2. ˆF ˆF So, ) CN P m Q) 2 P 2m T ) 2 ˆF 18 diamx) 2 ɛm/2 2, γ where N = N X, γ/6, d). This is at most δ when ɛ = 2 m ) )) 18 diamx) 2 CN log 2 + log γ 2. δ Next, we use this to obtain a result in which γ is not prescribed in advance. For α 1, α 2, δ 0, 1), let Eα 1, α 2, δ) be the set of z Z m for which there exists some h H with E α 2 z h) = 0 and er P h) ɛ 1 m, α 1, δ), where ɛ 1 m, α 1, δ) = 2 m ) )) 18 diamx) 2 CN X, α 1 /6, d) log 2 + log α 2. 1 δ Then the result just obtained tells us that P m Eα, α, δ)) δ. It is also clear that if α 1 α α 2 and δ 1 δ, then Eα 1, α 2, δ 1 ) Eα, α, δ). It follows from a result from [6], that P m Eγ/2, γ, δγ/2 diamx))) δ. γ 0,diamX)] In other words, with probability at least 1 δ, for all γ 0, diamx)]], we have that if h H satisfies E γ s h) = 0, then er P h) < ɛ 2 m, γ, δ), where ɛ 2 m, γ, δ) = 2 m ) )) 36 diamx) 4 diamx) CN X, γ/12, d) log 2 + log γ 2. δγ Note that γ now need not be prescribed in advance.

Page 12 RRR 29-2012 Proof of Theorem 5.2 Guermeur [9] has developed a framework in which to analyse multi-category classification, and we can apply one of his results to obtain the bound of Theorem 5.2, which is generalization error bound applicable to the case in which the γ-width sample error is not zero. In that framework, there is a set G of functions from X into R C, and a typical g G is represented by its component functions g k for k = 1 to C. Each g G satisfies the constraint g k x) = 0, x X. k=1 A function of this type acts as a classifier as follows: it assigns category l [C] to x X if and only if g l x) > max k l g k x). If more than one value of k maximizes g k x), then the classification is left undefined, assigned some value not in [C].) The risk of g G, when the underlying probability measure on X Y is P, is defined to be Rg) = P For v, k) R C [C], let Mv, k) = 1 2 X R C given by Given a sample s X [C]) m, let ) {x, y) X [C] : g y x) max g kx)}. k y ) v k max v l l k gx) = g k x)) C k=1 = Mgx), k)) C k=1. and, for g G, let g be the function R γ,s g) = 1 m m I { g yi x i ) < γ}. i=1 A result following from [9] is in the above notation) as follows: Let δ 0, 1) and suppose P is a probability measure on Z = X [C]. With P m -probability at least 1 δ, s Z m will be such that we have the following: for any fixed d > 0) for all γ 0, d] and for all g G, Rg) R γ,s g) + 2 m ln N G, γ/4, d ) + ln )) 2d + 1 γδ m. In fact, the result from [9] involves emprical covering numbers rather than d -covering numbers. The latter are at least as large as the empirical covering numbers, but we use

RRR 29-2012 Page 13 these because we have bounded them earlier in this paper.) For each function h : X [C], let g = g h be the function X R C defined by g k x) = 1 dx, S i ) dx, S k ), C i=1 where, as before, S j = h 1 j). Let G = {g h : h H} be the set of all such g. Then these functions satisfy the constraint that their coordinate functions sum to the zero function, since g k x) = k=1 For each k, k=1 1 C dx, S i ) i=1 dx, S k ) = k=1 dx, S k ) k=1 g k x) = Mgx), k) = 1 ) g k x) max 2 g lx) l k = 1 1 1 dx, S i ) dx, S k ) max 2 C l k C i=1 = 1 ) dx, S k ) max 2 dx, S l)) l k = 1 ) min 2 dx, S l) dx, S k ) l k = 1 2 f h k x). dx, S k ) = 0. k=1 )) dx, S i ) dx, S l ) From the definition of g, the event that g y x) max k y g k x) is equivalent to the event that ) 1 1 dx, S i ) dx, S y ) max dx, S i ) dx, S k ), C k y C i=1 which is equivalent to min k y dx, S k ) dx, S y ). It can therefore be seen that Rg) = er P h). Similarly, R γ,s g) = E γ s h). Noting that G = 1/2)F, so that an α/2 cover of F will provide an α/4 cover of G, we can therefore apply Guermeur s result to see that with probability at least 1 δ, for all h and for all γ 0, diamx)], er P h) E γ s h) + E γ s h) + 2 m 2 m CN X, γ/6, d) ln i=1 ln N F, γ/2, d ) + ln i=1 )) 2 diamx) + 1 γδ m ) )) 18 diamx) 2 diamx) + ln + 1 γ γδ m.

Page 14 RRR 29-2012 Acknowledgements This work was supported in part by the IST Programme of the European Community, under the PASCAL2 Network of Excellence, IST-2007-216886. References [1] M. Anthony and P. L. Bartlett. Function learning from interpolation. Combinatorics, Probability, and Computing, 9:213 225, 2000. [2] M. Anthony and P. L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge University Press, 1999. [3] N. Alon, S. Ben-David, N. Cesa-Bianchi, and D. Haussler. Scale-sensitive dimensions, uniform convergence, and learnability. J. ACM, 444):615631, 1997. [4] M. Anthony and J. Ratsaby. Maximal width learning of binary functions. Theoretical Computer Science, 411:138 147, 2010. [5] M. Anthony and J. Ratsaby. Learning on finite metric spaces. RUTCOR Research Report RRR-19-2012, June 2012. [6] P. L. Bartlett. The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network, IEEE Transactions on Information Theory, 442):525 536, 1998. [7] A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth. Learnability and the Vapnik-Chervonenkis dimension. J. ACM, 364):929 965, 1989. [8] R. M. Dudley 1999). Uniform Central Limit Theorems, Cambridge Studies in Advanced Mathematics, 63, Cambridge University Press, Cambridge, UK. [9] Yann Guermeur. VC theory of large margin multi-category classifiers. Journal of Machine Learning Research, 8, 2551-2594, 2007. [10] D. Haussler. Decision theoretic generalizations of the pac model for neural net and other learning applications, Information and Computation, 100: 78 150, 1992. [11] D. Pollard 1984). Convergence of Stochastic Processes. Springer-Verlag. [12] J. Shawe-Taylor, P.L. Bartlett, R.C. Williamson and M. Anthony. Structural risk minimization over data-dependent hierarchies. IEEE Transactions on Information Theory, 445), 1996: 1926 1940.

RRR 29-2012 Page 15 [13] A. J. Smola, P. L. Bartlett, B. Scholkopf, and D. Schuurmans. Advances in Large-Margin Classifiers Neural Information Processing). MIT Press, 2000. [14] V. N. Vapnik. Statistical Learning Theory. Wiley, 1998. [15] V.N. Vapnik and A.Y. Chervonenkis, On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 162): 264 280, 1971.