Martin Anthony. The London School of Economics and Political Science, Houghton Street, London WC2A 2AE, United Kingdom.

Size: px

Start display at page:

Download "Martin Anthony. The London School of Economics and Political Science, Houghton Street, London WC2A 2AE, United Kingdom."

Lindsey Reed
5 years ago
Views:

1 Function Learning from Interpolation 1 Martin Anthony Department of Mathematics, The London School of Economics and Political Science, Houghton Street, London WC2A 2AE, United Kingdom. anthony@vax.lse.ac.uk Peter Bartlett Department of Systems Engineering, Research School of Information Sciences and Engineering, The Australian National University, Canberra, 0200 Australia. Peter.Bartlett@anu.edu.au NeuroCOLT Technical Report Series NC-TR October, Produced as part of the ESPRIT Working Group in Neural and Computational Learning, NeuroCOLT 8556 NeuroCOLT Coordinating Partner!()+, -./ Department of Computer Science Egham, Surrey TW20 0EX, England For more information contact John Shawe-Taylor at the above address or neurocolt@dcs.rhbnc.ac.uk

2 1 The research reported here was conducted while Martin Anthony was visiting the Department of Systems Engineering, The Research School of Information Sciences and Engineering, The Australian National University 2 Received, October 11,

3 Introduction 2 Abstract In this paper, we study a statistical property of classes of real-valued functions that we call approximation from interpolated examples. We derive a characterization of function classes that have this property, in terms of their `fat-shattering function', a notion that has proven useful in computational learning theory. We discuss the implications for function learning of approximation from interpolated examples. Specically, it can be interpreted as a problem of learning real-valued functions from random examples in which we require satisfactory performance from every algorithm that returns a function which approximately interpolates the training examples. 1 Introduction In the problem of learning a real-valued function from examples, a learner sees a sequence of values of an unknown function at a number of randomly chosen points. On the basis of these examples, the learner chooses a function called a hypothesis from some class H of hypotheses, with the aim that the learner's hypothesis is close to the target function on future random examples. In this paper we require that, for most training samples, with high probability the absolute dierence between the values of the learner's hypothesis and the target function on a random point is small. A natural learning algorithm to consider is one that chooses a function in H that is close to the target function on the training examples. This poses the following statistical problem: For what function classes H will any function in H that approximately interpolates the target function on the training examples probably have small absolute error? More precisely, we have the following denition of approximation from interpolated examples. Denition 1 Let C; H be sets of functions that map from a set X to <. We say that H approximates C from interpolated examples if for all ; ; ; 2 (0; 1), there is m 0 (; ; ; ) such that for every t 2 C and for every probability measure 1 P on X, if m m 0 (; ; ; ), then with P m -probability at least 1?, x = (x 1 ; x 2 ; : : :; x m ) 2 X m has the property that if h 2 H and jh(x i )?t(x i )j < for 1 i m, then P (fx 2 X : jh(x)? t(x)j + g) < : We say that m 0 (; ; ; ) is a sucient sample length function for H to approximate C from interpolated examples. 1 More formally, one has a xed -algebra on X: when X is countable this is 2 X, and when X < n, it is the Borel -algebra. Then, by \any probability measure on X," we mean \for any probability measure on," where, the xed -algebra, is understood. The class H must have some fairly benign measurability properties; we refer to [14, 10] for details.

4 Introduction 3 Two cases of particular interest are those in which C = H and C = < X, the set of all functions from X to <. If H approximates H from interpolated examples, we simply say that H approximates from interpolated examples. The main aim of this paper is to nd characterizations of classes which approximate from interpolated examples and which approximate < X from interpolated examples. While this problem may be of independent interest, it can be interpreted as a learning problem in which we require satisfactory performance from every algorithm that returns a function that approximately interpolates the training examples. If, instead of requiring that all algorithms in this class be suitable, we require only the existence of a suitable algorithm, no necessary and sucient conditions on the function class H are known. Because an arbitrary amount of information can be conveyed in a single real value, it is possible to construct complicated function classes in which the identity of a function is encoded in its value at every point, and an algorithm can take advantage of this (see [5]). We can avoid this unnatural \conspiracy" between algorithms and function classes in two ways: by requiring that the algorithm be robust in the presence of random observation noise, as was considered in [5], or by requiring satisfactory performance of every algorithm in a class of reasonable algorithms, as we consider here. Results in this paper show that these two approaches are equivalent. Another reason for studying the problem of this paper is that it has implications for learning in the presence of malicious noise, in which the labels on the training sample can be any real numbers within of the true value of the target. This will be discussed later in the paper, but for the moment simply observe that if h is -close to a training sample where the labels have been corrupted to a level of at most, then h is certainly 2-close to the target on the sample. If H approximates from interpolated examples, we can then deduce that if the sample is large enough then (with high probability) h is within 2 + of the target on `most' of X. In the remainder of this section, we give an example of a function class that approximates from interpolated examples. In the next section, we dene a measure of the complexity of a class H of functions (the fat-shattering function), and we state the main result: that the fat-shattering function is the key quantity in this problem. In Sections 3 and 4 we give upper and lower bounds on the number of examples necessary for approximation from interpolated examples. Section 5 discusses the gap between these bounds, and describes the implications for learning with malicious noise. Section 6 describes the relationship between the problem of approximation from interpolated examples and a related problem (the case = 0 in Denition 1) studied in [3]. By way of motivation, we give upper and lower bounds on the number of examples necessary for approximation from interpolated examples with a particular function class, a class of Lipschitz-continuous functions. Proposition 2 Suppose that k > 0. Let H be the class of all k-lipschitzcontinuous functions from [0; 1] to [0; 1]. Thus, for each h 2 H and for all

5 Introduction 4 x; y 2 [0; 1], jh(x)? h(y)j kjx? yj: Then H approximates from interpolated examples. A sucient sample length function is 2k 2k m 0 (; ; ; ) = ln + 1: Furthermore, for < k=4, < 1=2, and < 1=13, any sucient sample length function must satisfy for all 2 (0; 1). m 0 (; ; ; ) 1 8 k 2? 2 ; Proof: Suppose that x = (x 1 ; x 2 ; : : :; x m ) 2 [0; 1] m and that, without loss of generality, 0 x 1 x 2 x 3 x m 1: Denote by I 0 the closed interval [0; x 1 ] and by I m the interval [x m ; 1], and for j between 1 and m? 1 let I j = [x j ; x j+1 ]. Let ; ; ; 2 (0; 1) be given. Suppose that P is any probability distribution on [0; 1]. Let where 2 (0; 1). Then B = x 2 [0; 1] m : max P (I i) > 0im P m (B ) = P m fx 2 [0; 1] m : 90 i m; 8j 6= i; x j 62 S(x i ; )g ; where S(x i ; ) is the shortest interval [x i ; z i ] satisfying P ([x i ; z i ]) (or [x i ; 1] if there is no such interval), and x 0 = 0. So ; P (B ) (m + 1) max P m fx 2 [0; 1] m : 8j 6= i; x j 62 S(x i ; )g 0im (m + 1) sup P m?1 x 2 [0; 1] m?1 : 8j x j 62 S(; ) 2[0;1] (m + 1)(1? ) m?1 : Choosing = 1 m?1 ln((m + 1)=), and using the fact that (1? =n)n < e? implies that this probability is no more than. Now, there are at most k= intervals I i of length at least =k. Suppose that t; h 2 H. If the interval I i has length less than =k and jh(x i )? t(x i )j <, jh(x i+1 )? t(x i+1 )j <, then, for all x in the interval I i, jh(x)? t(x)j < +. Indeed, suppose x 2 I i and jx? x i j =2k. (A similar argument applies if jx? x i j > =2k.) Then, since t and h are k-lipschitz-continuous, we have j(h(x)? t(x))? (h(x i )? t(x i ))j jh(x)? h(x i )j + jt(x)? t(x i )j < 2k k + 2k k = ;

6 Introduction 5 so that jh(x)? t(x)j < +. It follows that, with probability at least 1?, P (fx : jh(x)? t(x)j + g) i : length(i i ) k max k m + 1 (m? 1) ln ; 0jm P (I j) which is less than provided m > k ln 2m + 1: Using the fact that ln(x) x? 1 for ; x > 0, with = =(4k) shows that m > k ln 4k? m + 1 will suce, and this inequality is satised if m 2k 2k ln + 1: The fact that H approximates from interpolated examples, and the upper bound on sucient sample length, follow. We now obtain the lower bound, by an argument similar to one given in [9]. Suppose that k=4 and let r = bk=(2)c? 1 1. Let X 0 = fy 0 ; y 1 ; : : :; y r g where y 0 = 0 and, for 1 i r, y i = 2i=k. Dene a probability distribution P on [0; 1] as follows. 8 >< 1? 2 x = y 0 P (x) = 2=r x 2 fy 1 ; : : :; y r g >: 0 otherwise. Let t be the identically-zero function on [0; 1]. Now, suppose a sample x = (x 1 ; x 2 ; : : :; x m ) 2 X m 0 is given and that Y = fy 1 ; y 2 ; : : :; y r g n fx 1 ; x 2 ; : : :; x m g 6= ;. Let > 0 be such that min(; ). Then there is a function h 2 H such that h(x i ) =? for 1 i m but such that, for each y 2 Y, h(y) = +. Indeed, let h(x) = min(g(x); 1) where g is dened as follows. For 0 j r? 1 and x 2 [y j ; y j+1 ], let 8? if y >< j?1 62 Y and y j 62 Y (? ) + (x? y g(x) j?1 )k( + )=(2) if y j?1 62 Y and y j 2 Y + if y >: j?1 2 Y and y j 2 Y ( + )? (x? y j?1 )k( + )=(2) if y j?1 2 Y and y j 62 Y :

7 Denitions and the Main Result 6 Let Q X m 0 be the set of x of length m containing less than r=2 examples from fy 1 ; y 2 ; : : :; y r g. Then we see that, for x 2 Q, there is a function h such that jh(x i )? t(x i )j < for 1 i m, and P (fx 2 [0; 1] : jh(x i )? t(x i )j + g) 2 r jfy 1; y 2 ; : : :; y r g n fx 1 ; x 2 ; : : :; x m gj > : Now, the probability of drawing a sample of length m that is not in Q is at most the probability of at least r=2 successes in a sequence of m Bernoulli trials, where the probability of success at each trial is 2. >From standard Cherno bounds on the tails of the binomial distribution (see [2, 13]), this probability is no more than exp and this is less than 1? when m < 3 2? 2m 3 r 4m? 1 2! r 6? ln 1 : 1? For 1=13, m < r=(8) will suce, from which the result follows. ut 2 Denitions and the Main Result A number of ways of measuring the `expressive power' of a class H of functions has been proposed. This power is quantied by associating a `dimension' to the class. Sometimes this is simply one number depending on H. Sometimes in what is known as a scale-sensitive dimension it is a function depending on H. An important example of the rst type of dimension is the pseudo-dimension [10, 14]. We say that a nite subset S = fx 1 ; x 2 ; : : :; x d g of X is shattered if there is r = (r 1 ; r 2 ; : : :; r d ) 2 < d such that for every b = (b 1 ; b 2 ; : : :; b d ) 2 f0; 1g d, there is a function h b 2 H with h b (x i ) > r i if b i = 1 and h b (x i ) < r i if b i = 0. The pseudo-dimension of H, denoted P dim(h), is the largest cardinality of a shattered set, or innity if there is no bound on the cardinalities of the shattered sets. The pseudo-dimension is a well-understood and useful measure of expressive power. One attractive feature of this dimension is that if the set of functions is a vector space then its pseudo-dimension coincides with its linear dimension (see for example [10]). Perhaps the most important scale-sensitive dimension which has been used to date in the development of the theory of learning real-valued functions is the fat-shattering function. This is a scale-sensitive version of the pseudo-dimension and was introduced by Kearns and Schapire [11]. Suppose that H is a set of functions from X to [0; 1] and that 2 (0; 1). We say that a nite subset S = fx 1 ; x 2 ; : : :; x d g of X is -shattered if there is r = (r 1 ; r 2 ; : : :; r d ) 2 < d such that for every b = (b 1 ; b 2 ; : : :; b d ) 2 f0; 1g d, there is a function h b 2 H

8 Upper bound 7 with h b (x i ) r i + if b i = 1 and h b (x i ) r i? if b i = 0. Thus, S is - shattered if it is shattered with a `width of shattering' of at least. We dene the fat-shattering function, fat H : < +! IN 0 [ f1g, as fat H () = maxfjsj : S X is -shattered by Hg ; or fat H () = 1 if the maximum does not exist. (Here, IN 0 denotes the set of nonnegative integers.) It is easy to see that P dim(h) = lim!0 fat H (). It should be noted, however, that it is possible for the pseudo-dimension to be innite, even when fat H () is nite for all. We shall say that H has nite fat-shattering function whenever it is the case that for all 2 (0; 1), fat H () is nite. The fat-shattering function plays an important role in the learning theory of real-valued functions. Kearns and Schapire [11] proved that if a class of probabilistic concepts is learnable, then the class has nite fat-shattering function. (A probabilistic concept f is a [0; 1]-valued function. In this model, the learner sees examples (x i ; y i ), where Pr(y i = 1) = f(x i ).) Alon et al. [1] proved, conversely, that if a class of probabilistic concepts has nite fat-shattering function, then it is learnable. The main result in [5] is that niteness of the fat-shattering function of a class of [0; 1]-valued functions is a necessary and sucient condition for learning with random observation noise. Our main result is the following. Theorem 3 Suppose that H is a set of functions from a set X to [0; 1]. Then the following propositions are equivalent. 1. H approximates from interpolated examples. 2. H approximates < X from interpolated examples. 3. H has nite fat-shattering function. 3 Upper bound In this section, we prove that nite fat-shattering function is a sucient condition for approximation from interpolated examples and we provide a suitable sample length function m 0 (; ; ; ). We rst need the notion of covering numbers, as used extensively in [10, 1, 8] for instance. Suppose that (A; d) is a pseudo-metric space and > 0. Then, a subset N of A is said to be an -cover for a subset B of A if, for every x 2 B, there is an ^x 2 N such that d(x; ^x) <. The metric space is totally bounded if there is a nite -cover for A, for all > 0. When (A; d) is totally bounded, we shall denote the minimal cardinality of an -cover for A by N A (; d) for > 0. A subset M of A is said to be -separated if, for all distinct x; y 2 M,

9 Upper bound 8 d(x; y). We shall denote the maximal cardinality of an -separated subset of A by M A (; d). It is easy to show that M A (2; d) N A (; d) M A (; d) (see [12]), so M A (; d) is always dened if (A; d) is totally bounded. Suppose now that H is a set of functions from a set X to [0; 1] and that x = (x 1 ; x 2 ; : : :; x m ) 2 X m, where m is a positive integer. We may dene a pseudometric l 1 x on H as follows: for g; h 2 H, l 1 x (g; h) = max 1im jg(x i)? h(x i )j: (This metric has been used in [8, 1], for example.) Alon et al. [1] obtained (essentially) the following result bounding the l 1 x -covering number of H in terms of the fat-shattering function of H. Lemma 4 Suppose that H is a set of functions from X to [0; 1] and that H has nite fat-shattering function. Let m 2 IN, and x 2 X m. Then the pseudometric space (H; l 1 x ) is totally bounded. Suppose > 0. Let d = fat H (=4) and dx! 2 i : y = Then, provided m log y + 1, i=1 m i N H (; l 1 x ) < 2 2 2! log y m : Here, as elsewhere in the paper, log denotes logarithm to base 2. We then have the following result. Theorem 5 Suppose that H is a class of functions mapping from a domain X to the real interval [0; 1] and that H has nite fat-shattering function. Let t be any function from X to < and let ; ; > 0. Let P be any probability distribution on X and dene B to be the set of functions h 2 H for which P (fx 2 X : jh(x)? t(x)j + g) : Let d = fat H (=8) and let y = dx i=1 2m i! 4 i : Then, for m max (8=; log y + 1) ;

10 Upper bound 9 the probability is at most P m (fx 2 X m : 9h 2 B; jh(x i )? t(x i )j < ; (1 i m)g) 4 2! log y 4 2m 2?m=2 : Proof: The proof is based on a technique analogous to that used in [17, 7, 10], where we `symmetrize' and then `combinatorially bound'. The rst step symmetrization relates the desired probability to a `sample-based' one. This latter probability is then bounded using counting techniques. Fix t, P, m, the parameters ; ;, and hence the set B. We dene two sets Q and R as follows. The subset Q of X m is given by Q = fx 2 X m : 9h 2 B; jh(x i )? t(x i )j < (1 i m)g : The subset R of X 2m is dened as follows, where xy 2 X 2m denotes the concatenation of x; y 2 X m : R = xy 2 X 2m : 9h 2 B; jh(x i )? t(x i )j < (1 i m) and jfi : jh(y i )? t(y i )j + gj > m=2g : We claim that, for m 8=, P m (Q) 2P 2m (R). To prove this claim, we rst note that if h 2 B, then, for m 8=, by Chebyshev's inequality (for example), the probability that y 2 X m satises jfi : jh(y i )? t(y i )j + gj m=2 is at most 1=2. The characteristic function R of R may be expressed as follows [4]: R (xy) = Q (x) x (y); where Q is the characteristic function of Q and the f0; 1g-valued function x is such that x (y) = 1 if and only if there is h 2 B such that jh(x i )? t(x i )j < for 1 i m and such that jfi : jh(y i )? t(y i )j + gj > m=2. Then Z Z Z P 2m (R) = R (xy) dp 2m (xy) = Q (x) x (y) dp m (y) dp m (x): For xed x 2 Q and for m 8=, the inner integral is at least 1=2 by the observation above. It follows that P 2m (R) 1 Z Q (x) dp m (x) = P m (Q); as claimed. The next step is to bound the probability of R using combinatorial techniques. For this, let? be the `swapping group' [14] of permutations on the set f1; 2; : : :; 2mg. This is the group generated by the transpositions (i; m + i)

11 Upper bound 10 for 1 i m. The group? acts in a natural way on vectors in X 2m : for 2? and z 2 X 2m, we dene z to be Let (z (1) ; z (2) ; : : :; z (2m) ):?(R; z) = jf 2? : z 2 Rgj be the number of permutations in? taking z into R. Since P 2m is a product distribution, we have P 2m (R) = P 2m fz : z 2 Rg and hence P 2m (R) = 1 j?j Z X Z 2? X X R (z) dp 2m (z) 1 = X j?j R (z) 2??(R; z) = E P 2m ; 2 m! dp 2m (z) where E P 2m denotes expectation with respect to P 2m. It follows from this that P 2m (R) 1 2 m max z2x 2m? R(z): Now, let us x z 2 X 2m and consider the pseudo-metric space (B; l 1 z ). Since H has nite fat-shattering function, so does B and Lemma 4 implies that this pseudo-metric space is totally bounded. Let N = f^h 1 ; ^h 2 ; : : :; ^h n g be a minimal =2-cover for B. >From Lemma 4, n < 2 4 2! log y 2m : Since N is a =2-cover, given h 2 B, there is ^h 2 N such that for 1 i 2m, l 1 z (^h; h) < =2, which means that for 1 i 2m, j^h(z i )? h(z i )j < =2. Suppose that z = xy 2 R. Then, by the denition of R, there is some h 2 B such that jh(x i )? t(x i )j < for 1 i m and such that, for more than m=2 of the y i, jh(y i )? t(y i )j +. But (taking ^h to be, as described above, a function in the cover =2-close to h) this implies that there is ^h 2 N such that for 1 i m, j^h(x i )? t(x i )j < + =2 and such that, for more than m=2 of the y i, j^h(y i )? t(y i )j + =2. It follows from this that if z 2 R, then, for some l between 1 and n, z belongs to the set ^Rl, dened as n ^R l = xy 2 X 2m : j^h l (x i )? t(x i )j < + =2 (1 i m); o and jfi : j^h l (y i )? t(y i )j + =2gj > m=2 :

12 Upper bound 11 Let?( ^R l ; z) be the number of in? for which z 2 ^R l. Since z 2 R implies z 2 ^Rl for some l, we have?(r; z) n X l=1?( ^R l ; z): Consider a particular l between 1 and n and suppose that?( ^R l ; z) 6= 0. Let k be the number of indices i between 1 and 2m such that j^h l (z i )?t(z i )j +=2. Then m=2 < k m. The number of permutations in? for which z belongs to ^Rl is then equal to 2 m?k, which is less than 2 m(1?m=2). (The z i which can be `swapped' are precisely those m? k satisfying j^h l (z i )? t(z i )j < + =2.) It follows that nx P 2m (R) < 1 2 m i=1 The statement in the Theorem now follows. 4 2! log y 2 m(1?=2) n2?m=2 2 2m 2?m=2 : It follows from this result that niteness of fat H implies that H approximates < X from interpolated examples, and hence H approximates H from interpolated examples. Corollary 6 Suppose that H is a set of functions from X to [0; 1] and that H has nite fat-shattering function. Then H approximates < X from interpolated examples. Furthermore, there is a positive constant K such that a sucient sample length function is where d = fat H (=8). m 0 (; ; ; ) = K 1 d log + d log 2 ; ut Proof: For all m; d 1, y = dx i=1 2m i! 4 d(2m) d (5=) d < (20m=) d : i So 4 2! log y 4 2m 2?m=2 < if? 4 50m= 2 d log(20m=) 2?m=2 < () m=2? d log(20m=) log(50m= 2 ) > log(4=) (= m=2? d log 2 (50m= 2 ) > log(4=): (1)

13 Lower bound 12 Now, since ln() < for all ; > 0, we have ln 2 () 4 so ln ln ln 1? ln2 1 for all ; > 0. Similarly, ln (= ln(1=)) = ln(1=), hence for > 0 and 0 < < 1, so ln = (ln(1=)) + ln (ln(1=)=) ln (2 ln(ln(1=)=)? ln(1=)) ln(1=) ln 2 (1=): For = 50m= 2 and = 2 ln 2 2=(1200d), this inequality implies that (1) holds when m > 4 3d 1200d ln 2 2 ln2 2 ln log 4 : ut We remark that, for a given k, the class of all k-lipschitz continuous functions discussed in Proposition 2 has fat-shattering function fat H () = O(1=) and that the sucient sample length function of Proposition 2 is of order 4 Lower bound fat H () log fat H() + log 1 In this section, we give lower bounds on the number of examples necessary for H to approximate < X from interpolated examples and for H to approximate H from interpolated examples. The bounds are in terms of fat H, the fat-shattering function of H. To prove them, we consider a discretized version of H. For a 2 [0; 1], let D (a) = da=e. For a function f : X! [0; 1], let D (f) : X! f0; 1; : : :; d1=eg be dened as the composition of D and f. Let D (H) denote fd (f) : f 2 Hg. Functions in D (H) map to f0; 1; : : :; ng, where n = d1=e. Let [n] denote f0; 1; : : :; ng. From the denition of the fat-shattering function, 1 2 fat H () fat D (H) 2 for ; 2 < +. To obtain a lower bound on the sample length function in terms of the fat-shattering function, we consider a number of notions of dimension, dened using classes of f0; 1; g-valued functions on [n]. Our approach is similar to that of Ben-David, Cesa-Bianchi, Haussler and Long [6], who consider learning [n]-valued functions. We show that a large family of these dimensions (including the fat-shattering function) are closely related, and prove the lower bound in terms of the most convenient, the gnat -dimension (a `gapped' version of the Natarajan dimension). :

14 Lower bound 13 Denition 7 Let gnat (k) be the following set of f0; 1; g-valued functions de- ned on [n], where k 2 f2; 3; : : :g. with gnat(k) = f i;j : i; j 2 [n]; ji? jj kg ; i;j() = 8 >< >: 1 = i 0 = j otherwise. If F is a class of [n]-valued functions dened on X and is a class of f0; 1; gvalued functions dened on [n], we say that F -shatters x = (x 1 ; : : :; x d ) 2 X d if there is a sequence ( 1 ; : : :; d ) 2 d such that f0; 1g d f( 1 (f(x 1 )); : : :; d (f(x d ))) : f 2 Fg : The -dimension of F, denoted -dim(f), is the size of the largest -shattered sequence, or innity if there is no largest sequence. The following result is similar to the key result in [1] (Lemma 15), which bounds covering numbers of F in terms of fat F (1). We will see later that it generalizes that lemma, since the gnat (2)-dimension is the smallest of a family of dimensions that includes fat F (1). Lemma 8 Let k 2 and n 1 be integers. Suppose that F is a class of [n]-valued functions dened on X satisfying gnat (k)-dim(f) d, and that m log y + 1, where! dx m y = n 2i : i Then i=0 max x2x m M F(k; l 1 x ) < 2(2mn 2 ) log y : Proof: Fix x 2 X m and let S = fx i : i = 1; : : :; mg. For a set A S and functions u; l : A! [n], we say that F shatters (A; l; u) if u(x)? l(x) k for all x 2 A, and for all b = (b 1 ; : : :; b m ) 2 f0; 1g m there is a function f b 2 F such that for i = 1; : : :; m. f b (x i ) = Dene the function t : IN 4! IN 0 as ( u(xi ) b i = 1 l(x i ) b i = 0; t(h; m; n; k) = min fjf(a; l; u) : A Y ; l; u : A! [n]; G shatters (A; l; u)gj : jy j m; 9G [n] Y ; jgj h; G is k-separated on Y ; where we say that a set G is k-separated on Y if, for all distinct functions f; g 2 G, there is a z 2 Y such that jf(z)? g(z)j k.

15 Lower bound 14 It is easy to see that if jy j m then jf(a; l; u) : A Y ; jaj d; l; u : Y! [n]; 8z 2 Y u(z)? l(z) k gj < y: It follows that, if there is an h 2 IN such that t(h; m; n; k) y, then every k-separated function class G [n] Y with jy j m and jgj h has gnat(k)-dim(g) > d. Suppose F [n] X has gnat (k)-dim(f) d. Fix x 2 X m and let G be some subset of F restricted to fx 1 ; : : :; x m g. Clearly, gnat (k)-dim(g) d, so if G is k-separated on X then jgj < h. In that case, max x2x m M F (k; l 1 x ) < h. So the result follows from t(2(2mn 2 ) log y ; m; n; k) y, which we prove by induction. For all m 0 2 IN and n k, we have t(2; m 0 ; n; k) 1. To see this, take fg 1 ; g 2 g G, where the functions g 1 and g 2 map from a set Y (with jy j m 0 ) to [n], and are k-separated. There must be an x 2 Y for which (wlog) g 1 (x)? g 2 (x) k. It follows that F shatters (fxg; g 2 ; g 1 ), and hence t(2; m 0 ; n; k) 1. Now, take a class G of [n]-valued functions dened on Y, where jy j = m, jgj = 2mn 2 h, and G is k-separated. Divide G into mn 2 h pairs, G = f(f 1 ; g 1 ); : : :; (f mn 2 h; g mn 2 h)g: For each pair (f i ; g i ), choose an x i 2 Y for which f i (x i )? g i (x i ) k (swapping the labels f i and g i if necessary). This is always possible since G is k-separated. By the pigeonhole principle, there is an x 2 Y for which i 2 f1; : : :; mn 2 hg : x i = x n2 h: Again by the pigeonhole principle, there is a pair (a; b) and a set S = i 2 f1; : : :; mn 2 hg : x i = x; f i (x i ) = a; g i (x i ) = b such that jsj = h. Dene G 1 = f ijy nfxg : i 2 S and G 2 = g ijy nfxg : i 2 S, where f jt is the restriction of the function f to the set T. Clearly, G 1 is k- separated on Y n fxg because G is k-separated on Y and for all distinct pairs f i ; f j 2 G 1 we have f i (x) = f j (x). Also, jg 1 j = h. By the denition of t, the set S 1 = f(a; l; u) : A Y n fxg; l; u : A! [n]; G 1 shatters (A; l; u)g satises js 1 j t(h; m? 1; n; k). Similarly, if we dene S 2 as the set of triples shattered by G 2, we have js 2 j t(h; m? 1; n; k). For any triple (A; l; u) in S 1 \ S 2, dene l ; u : A! Y as ( l b = x () = l() otherwise, and ( u a = x () = u() otherwise.

16 Lower bound 15 Then G shatters (A; l ; u ). It follows that G shatters js 1 j + js 2 j = 2t(h; m? 1; n; k) triples. Since G was arbitrary, t(2mn 2 h; m; n; k) 2t(h; m? 1; n; k). So we have t(2; m 0 ; n; k) 1 for m 0 2 IN and n 3, and t(2mn 2 h; m; n; k) 2t(h; m? 1; n; k). By induction, t(2(2n 2 ) i iy j=1 (m o + j); m 0 + i; n; k) 2 i for all i 2 IN. If m log y + 1, we can assign i = log y, m 0 h = 2(2n 2 m) i, and we have h 2(2n 2 ) i iy j=1 (m 0 + j): = m? i, and From the denition of t, h 1 h 2 implies t(h 1 ; m; n; k) t(h 2 ; m; n; k), so t(2(2n 2 m) i ; m; n; k) y: The result follows. ut A useful family of function classes is the k-gapped distinguishers. Denition 9 Let k 2. A set of functions from [n] to f0; 1; g is a k-gapped distinguisher if it satises 1. for all i 2 f0; 1; : : :; n? kg and j 2 fi + k; : : :; ng, there is a function 2 and a bit b 2 f0; 1g such that (i) = b and (y) = 1? b. 2. minfji? jj : i; j 2 [n]; 9 2 ; (i) = 0; (j) = 1g = k. The set gnat (k) is an example of a k-gapped distinguisher. Another important example is the class n o g(k) = 2 f0; 1; g [n] : min fji? jj : i; j 2 [n]; (i) = 0; (j) = 1g = k : In fact g (k) is the largest k-gapped distinguisher, in the sense that it contains any other k-gapped distinguisher. Lemma 10 Suppose F is a class of [n]-valued functions dened on X, is a class of f0; 1; g-valued functions dened on [n], and k 2. If is a k-gapped distinguisher then gnat(k)-dim(f) -dim(f) g (k)-dim(f): Proof: Take a gnat (k)-shattered sequence x 2 X d. Since is a k-gapped distinguisher, for all i;j 2 gnat (k) there is a 2 and a b 2 f0; 1g for which

17 Lower bound 16 (i) = b and (j) = 1? b. It follows that x is -shattered, which gives the rst inequality. The second inequality follows from the fact that g (k). We can express the fat-shattering dimension fat F (k) as a dimension of this type, for k 2. Dene with fat(k) = f i : i 2 f0; : : :; n? kgg ; 8 >< 1 z i + k i(z) = i < z < i + k >: 0 z i; for z 2 [n]. Then fat F () = fat (b2c)-dim(f) for all classes F of functions from X to [n]. Clearly, fat (k) is a k-gapped distinguisher. It follows from Lemma 10 that Lemma 8 generalizes Alon et al's Lemma 15, which gave a similar result for the fat (2)-dimension. The following result shows that the gnat (k)-dimension, g (k)-dimension, and -dimension (for any k-gapped distinguisher ) are all closely related. Lemma 11 Let k 2. Let F be a class of functions that map from X to [n], satisfying gnat (k)-dim(f) = d N 2 and g (k)-dim(f) d B 1. Then d N > d B 3 log 2 (2d B n 2 ) : Proof: Suppose x = (x 1 ; : : :; x db ) 2 X db is g (k)-shattered by F. The denition of g (k) implies that any subset of F that g (k)-shatters x is k- separated, so M F (k; l 1 x ) 2 db. Let Then y d N d dn B n 2dN so y = d N X i=0 d B i! n 2i : since d N 2. If d B > log y then 2 db M F (k; l 1 x ) < 2(2d B n 2 ) log y ; d B < 1 + log y log(2d B n 2 ): (2) Alternatively, if d B log y, then (2) is obviously true. So we have ut d B < 1 + log y log(2d B n 2 ) 1 + (log d N + d N log(d B n 2 )) log(2d B n 2 ) 3d N log 2 (2d B n 2 ): ut

18 Lower bound 17 The following result follows easily from Theorem 1 in [9], which gives a lower bound on the number of examples necessary for learning f0; 1g [d] in the probably approximately correct model (see also [7]). Lemma 12 Let 0 < 1=8, 0 < < 1=100, and d 1. If d m < max 32 ; 1? ln 1 then there is a distribution P on [d] and a function t 2 f0; 1g [d] such that P m nx 2 X m : 9f 2 f0; 1g [d] such that f(x i ) = t(x i ); i = 1; : : :; m and Pfy : f(y) 6= t(y)g g : We use Lemma 12 to prove the following lower bound on the sample length function for approximating < X from interpolated examples. Theorem 13 Suppose H is a class of [0; 1]-valued functions dened on a set X, 0 < < < 1, and ; 2 (0; 1). Then any sample length function m 0 for H to approximate < X from interpolated examples satises where d = fat H (). m 0 (; ; ; ) max 1 32! d 3 log 2 (4d= 2 )? 1 ; 1 log 1! ; Proof: Fix 0 < < < 1, dene n = d1=e, and suppose fat H () d. Let F = D (H). Then fat F (1) d, so gnat (2)-dim(F) > k, where k = d=(3 log 2 (2dn 2 )). Consider a sequence (x 1 ; : : :; x k ) 2 X k that is gnat -shattered by F. Clearly, there is a subset H 0 H with jh 0 j = 2 k and a sequence ( a1;b 1 ; : : :; ak;b k ) 2 k gnat such that f( a1;b 1 (f(x 1 )); : : :; ak;b k (f(x k ))) : f 2 D (H 0 )g = f0; 1g k : Without loss, we can assume that a j > b j for j = 1; : : :; k. Now, if m < max((k? 1)=(32); (1? )= ln(1=)), Lemma 12 implies that there is a distribution P on f1; : : :; kg and a function p : f1; : : :; kg! f0; 1g such that P m fl 2 f1; : : :; kg m : 9p 0 : f1; : : :; kg! f0; 1g such that p(l i ) = p 0 (l i ) for i = 1; : : :; m and Pfy 2 f1; : : :; kg : p(y) 6= p 0 (y)g g : Choose a function t : X! < satisfying ( (aj? 1) + p(j) = 1 t(x j ) = b j? + p(j) = 0

19 Lower bound 18 for j = 1; : : :; k, where = 1 2 minfh(x j)? (a j? 1) : h 2 H 0 and h(x j ) > (a j? 1); j = 1; : : :; kg : For each function h 2 H 0 dene f h = D (h). Let p h : f1; : : :; kg! f0; 1g be dened by p h (j) = ( 1 fh (x j ) = a j 0 f h (x j ) = b j : Clearly, if jh(x j )? t(x j )j < for some h 2 H 0 and some j 2 f1; : : :; kg, then h(x j ) 2 ((a j? 1); a j ] [ ((b j? 1); b j ] so p h (j) = p(j). Also, if p h (j) 6= p(j) for some h 2 H, then jh(x j )? t(x j )j > +. It follows that Pfy 2 f1; : : :; kg : p h (y) 6= p(y)g implies Qfy 2 X : jh(y)? t(y)j + g, where Q is the discrete probability distribution on X satisfying Q(x j ) = P (j) for j = 1; : : :; k. So Q m fy 2 X m : 9h 2 H such that jh(y i )? t(y i )j < for i = 1; : : :; m and Qfy 2 X : jh(y)? t(y)j + g g Q m fy 2 X m : 9h 2 H 0 such that jh(y i )? t(y i )j < for i = 1; : : :; m and Qfy 2 X : jh(y)? t(y)j + g g P m fl 2 f1; : : :; kg m : 9p 0 : f1; : : :; kg! f0; 1g such that p(l i ) = p 0 (l i ) for i = 1; : : :; m and Pfy 2 f1; : : :; kg : p(y) 6= p 0 (y)g g : We also have the following result which bounds from below the sample length function for H to approximate H from interpolated examples. Theorem 14 Suppose H is a class of [0; 1]-valued functions dened on a set X, 0 < < 1, 3=2 < 1, and ; 2 (0; 1). If d satises fat H ( + ) d, then any sample length function m 0 for H to approximate H from interpolated examples satises Proof: m 0 (; ; ; ) max 1 32!! d 3 log 2 (4d= 2 )? 1 ; 1 log 1 : Fix 0 < < 1 and 3=2 < 1, dene n = d1=e, and suppose d fat H ( + ). Let F = D (H). Then 1 2( + ) fat F 2 d; so fat (b2=c + 1)-dim(F) d, hence gnat (b2=c + 1)-dim(F) > k, where k = d=(3 log 2 (2dn 2 )). Consider a sequence (x 1 ; : : :; x k ) 2 X k that is gnat (b2=c+ 1)-shattered by F. Clearly, there is a subset H 0 H with jh 0 j = 2 k and a sequence ( a1;b1 ; : : :; ak;bk ) 2 gnat (b2=c + 1) k such that f( a1;b 1 (f(x 1 )); : : :; ak;b k (f(x k ))) : f 2 D (H 0 )g = f0; 1g k : ut

20 Discussion 19 Fix a function t 2 H 0. Any function h 2 H that has a i;b i (D (h)(x i )) = ai;bi (D (t)(x i )) satises jh(x i )? t(x i )j < <. Any function h in H that has a i;b i (D (h)(x i )) 6= ai;bi (D (t)(x i )) satises jh(x i )? t(x i )j 2 = 2 + 2(? ) 2 +? = + ; since (? )= 1=2 and b2c for 1=2. Using the same argument as in the proof of Theorem 13, there is a distribution P on X such that if m is too small then with P m -probability at least, some h 2 H is within of t on a random sample, but P (jh? tj + ). ut 5 Discussion Figure 1 shows the sample complexity bounds for approximation from interpolated examples. These bounds imply Theorem 3. 1 fath (=8) (= m = fat H approximates < X H (=8) log 2 from interpolated examples =) m = max fat H () log 2 (fat H ()= 2 ) ; 1 log 1 + log 1!! H approximates H from interpolated examples + =) m = max!! fat H ( + ) log 2 (fat H ( + )= 2 ) ; 1 log 1 Figure 1 Sample complexity bounds Notice that the upper and lower bounds on the sample length for H to approximate < X from interpolated examples are within log factors of each other. If H is the class of k-lipschitz-continuous functions from [0; 1] to [0; 1], the upper bound on the sample length function for H to approximate H from interpolated examples is within a log 2 factor of the lower bound implied by Proposition 2. Indeed, fat H () = (1=) in this case, so Proposition 2 shows that m 0 (; ; ; ) = (fat H ()=).

21 When meets 0 20 These sample complexity bounds for approximation from interpolated examples are also relevant to the problem of learning real-valued functions in the presence of malicious noise. Suppose a learner sees a sequence of training examples that correspond to the values of a target function corrupted with arbitrary bounded additive noise. That is, each example is of the form (x i ; t(x i )+n i ), where t 2 H and jn i j <. Clearly, any function h 2 H that is -close to the training sample will satisfy Pr (jh? tj 2 + ) < ; provided that H approximates from interpolated examples and the training sample is suciently large. In addition, if there is an algorithm that can learn in the presence of malicious noise (in this sense), then it can certainly learn in the presence of uniformly distributed random noise (as dened in [5]), which implies fat H is nite ([5], Theorem 3). That is, a function class H is learnable with malicious noise if and only if fat H is nite. 6 When meets 0 In [3], the related problem of valid generalization from approximate interpolation was studied. We say that H validly generalizes C from approximate interpolation if, for all > 0 and ; 2 (0; 1), there is an m such that for all probability distributions P on X and all t 2 C,with P m -probability at least 1?, a sample x = (x 1 ; x 2 ; : : :; x m ) 2 X m has the property that if h 2 H and jh(x i )? t(x i )j < for 1 i m, then P (fx 2 X : jh(x)? t(x)j g) < : This problem is equivalent to approximation from interpolated examples with set to 0 (see Denition 1). Results in [3] give necessary and sucient conditions for valid generalization from approximate interpolation. In particular, H validly generalizes < X from approximate interpolation if and only if the pseudo-dimension P dim(h) is - nite. It was also shown in [3] that H validly generalizes H from approximate interpolation if and only if Ddim H () is nite for all, where Ddim H () is dened as follows. Denition 15 For h; t 2 H, let h [;t] : X! f0; 1g be dened by h [;t] (x) = 1 i jh(x)? t(x)j. Let H [;t] = h [;t] : h 2 H. Then we dene Ddim H () = max t2h P dim? H [;t] : (Since functions in H [;t] are f0; 1g-valued, the pseudo-dimension of H [;t] is equal to its Vapnik-Chervonenkis dimension, dened in [17].) Clearly, valid generalization from approximate interpolation implies approximation from interpolated examples, so niteness of P dim(h) or Ddim H implies

22 Conclusions 21 niteness of fat H. Since the fat-shattering function is a non-increasing function of, niteness of the pseudo-dimension of H implies a uniform upper bound on fat H (). The following proposition gives a lower bound on Ddim H () in terms of fat H. Proposition 16 Ddim H () fat H () 3 log 2 (2fat H ()= 2 ) : Let F = D (H). Then fat F (1) fat H (), so Lemma 11 implies that Proof: gnat(2)-dim(f) k, where k = fat H () 3 log 2 (4fat H ()= 2 ) : Suppose x 2 X k is gnat (2)-shattered by F. Then there is a subset H 0 H with jh 0 j = 2 k and a sequence ( a1;b1 ; : : :; ak;bk ) 2 k gnat such that f( a1;b 1 (f(x 1 )); : : :; ak;b k (f(x k ))) : f 2 D (H 0 )g = f0; 1g k : Choose t 2 H 0. Then for h 2 H 0 the denition of h [;t] implies that h [;t] (x i ) = 0 if and only if a i;b i (D (h(x i ))) = ai;b i (D (t(x i ))): That is,? h[;t](x 1 ); : : :; h [;t](x k ) : h 2 H 0 = f0; 1g k ; so Ddim H () k. ut 7 Conclusions We have shown that approximation from interpolated examples is possible if and only if the fat-shattering function of the hypothesis class is nite. This problem can be interpreted as a problem of learning real-valued functions, where we require satisfactory performance from every algorithm that returns a function that approximately interpolates the training examples. It follows that this problem is equivalent to those of learning real-valued functions with random (see [5]) and malicious (see Section 5) observation noise, in the sense that learning is possible in each case if and only if the fat-shattering function of the hypothesis class is nite. Acknowledgements This research was supported in part by the Australian Telecommunications and Electronics Research Board. The work of Martin Anthony is supported in part by the European Union through the \Neurocolt" ESPRIT Working Group.

23 REFERENCES 22 References [1] Alon, N., Ben-David, S., Cesa-Bianchi, N. and Haussler, D. (1993). Scalesensitive dimensions, uniform convergence, and learnability. In Proceedings of the 1993 IEEE Symposium on Foundations of Computer Science, IEEE Press. [2] Angluin, D. and Valiant, L. (1979). Fast probabilistic algorithms for Hamiltonian circuits and matchings. Journal of Computer and System Sciences, 18: [3] Anthony, M., Bartlett, P.L., Ishai, Y., Shawe-Taylor, J. (1994). Valid generalisation from approximate interpolation. To appear, Combinatorics, Probability and Computing. [4] Anthony, M. and Biggs, N. (1992). Computational Learning Theory: An Introduction, Cambridge University Press. [5] Bartlett, P.L., Long, P.M. and Williamson, R.C. (1994). Fat-shattering and the learnability of real-valued functions. In Proceedings of the Seventh Annual ACM Conference on Computational Learning Theory, ACM Press, New York. [6] Ben-David, S., Cesa-Bianchi, N., Haussler, D. and Long, P. (1992). Characterizations of learnability for classes of f0; : : :; ng-valued functions. Technical Report. (An earlier version appeared in Proceedings of the Fifth Annual ACM Workshop on Computational Learning Theory, ACM Press, New York.) To appear, Journal of Computer and System Sciences. [7] Blumer, A., Ehrenfeucht, A., Haussler, D. and Warmuth, M. K. (1989). Learnability and the Vapnik-Chervonenkis dimension, Journal of the ACM 36(4): 929{965. [8] Dudley, R. M., Gine, E. and Zinn, J. (1991). Uniform and universal Glivenko-Cantelli classes. Journal of Theoretical Probability, 4: 485{510. [9] Ehrenfeucht, A., Haussler, D., Kearns, M. and Valiant, L. (1989). A general lower bound on the number of examples needed for learning, Information and Computation 82: [10] Haussler, D. (1992). Decision theoretic generalizations of the PAC model for neural net and other learning applications, Information and Computation, 100: 78{150. [11] Kearns, M.J. and Schapire, R. E. (1990). Ecient distribution-free learning of probabilistic concepts, in Proceedings of the 1990 IEEE Symposium on Foundations of Computer Science, IEEE Press. [12] Kolmogorov, A.N. and Tihomirov, V. M. (1961). -entropy and -capacity of sets in functional spaces, AMS Translations, Series 2, 17:

24 REFERENCES 23 [13] McDiarmid, C. (1989). On the method of bounded dierences. In J. Siemons, editor, Surveys in Combinatorics, 1989, London Mathematical Society Lecture Note Series (141). Cambridge University Press, Cambridge, UK, [14] Pollard, D. (1984). Convergence of Stochastic Processes, Springer-Verlag. [15] Simon, H. U. (1994). Bounds on the number of examples needed for learning functions. In Computational Learning Theory: EUROCOLT'93 (ed. J. Shawe-Taylor and M. Anthony), Oxford University Press. [16] Valiant, L.G. (1984). A theory of the learnable. Communications of the ACM, 27(11): 1134{1142. [17] Vapnik, V.N. and Chervonenkis, A.Ya. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16(2):

Sample width for multi-category classifiers

R u t c o r Research R e p o r t Sample width for multi-category classifiers Martin Anthony a Joel Ratsaby b RRR 29-2012, November 2012 RUTCOR Rutgers Center for Operations Research Rutgers University