Martin Anthony. The London School of Economics and Political Science, Houghton Street, London WC2A 2AE, United Kingdom.

Size: px
Start display at page:

Download "Martin Anthony. The London School of Economics and Political Science, Houghton Street, London WC2A 2AE, United Kingdom."

Transcription

1 Function Learning from Interpolation 1 Martin Anthony Department of Mathematics, The London School of Economics and Political Science, Houghton Street, London WC2A 2AE, United Kingdom. anthony@vax.lse.ac.uk Peter Bartlett Department of Systems Engineering, Research School of Information Sciences and Engineering, The Australian National University, Canberra, 0200 Australia. Peter.Bartlett@anu.edu.au NeuroCOLT Technical Report Series NC-TR October, Produced as part of the ESPRIT Working Group in Neural and Computational Learning, NeuroCOLT 8556 NeuroCOLT Coordinating Partner!()+, -./ Department of Computer Science Egham, Surrey TW20 0EX, England For more information contact John Shawe-Taylor at the above address or neurocolt@dcs.rhbnc.ac.uk

2 1 The research reported here was conducted while Martin Anthony was visiting the Department of Systems Engineering, The Research School of Information Sciences and Engineering, The Australian National University 2 Received, October 11,

3 Introduction 2 Abstract In this paper, we study a statistical property of classes of real-valued functions that we call approximation from interpolated examples. We derive a characterization of function classes that have this property, in terms of their `fat-shattering function', a notion that has proven useful in computational learning theory. We discuss the implications for function learning of approximation from interpolated examples. Specically, it can be interpreted as a problem of learning real-valued functions from random examples in which we require satisfactory performance from every algorithm that returns a function which approximately interpolates the training examples. 1 Introduction In the problem of learning a real-valued function from examples, a learner sees a sequence of values of an unknown function at a number of randomly chosen points. On the basis of these examples, the learner chooses a function called a hypothesis from some class H of hypotheses, with the aim that the learner's hypothesis is close to the target function on future random examples. In this paper we require that, for most training samples, with high probability the absolute dierence between the values of the learner's hypothesis and the target function on a random point is small. A natural learning algorithm to consider is one that chooses a function in H that is close to the target function on the training examples. This poses the following statistical problem: For what function classes H will any function in H that approximately interpolates the target function on the training examples probably have small absolute error? More precisely, we have the following denition of approximation from interpolated examples. Denition 1 Let C; H be sets of functions that map from a set X to <. We say that H approximates C from interpolated examples if for all ; ; ; 2 (0; 1), there is m 0 (; ; ; ) such that for every t 2 C and for every probability measure 1 P on X, if m m 0 (; ; ; ), then with P m -probability at least 1?, x = (x 1 ; x 2 ; : : :; x m ) 2 X m has the property that if h 2 H and jh(x i )?t(x i )j < for 1 i m, then P (fx 2 X : jh(x)? t(x)j + g) < : We say that m 0 (; ; ; ) is a sucient sample length function for H to approximate C from interpolated examples. 1 More formally, one has a xed -algebra on X: when X is countable this is 2 X, and when X < n, it is the Borel -algebra. Then, by \any probability measure on X," we mean \for any probability measure on," where, the xed -algebra, is understood. The class H must have some fairly benign measurability properties; we refer to [14, 10] for details.

4 Introduction 3 Two cases of particular interest are those in which C = H and C = < X, the set of all functions from X to <. If H approximates H from interpolated examples, we simply say that H approximates from interpolated examples. The main aim of this paper is to nd characterizations of classes which approximate from interpolated examples and which approximate < X from interpolated examples. While this problem may be of independent interest, it can be interpreted as a learning problem in which we require satisfactory performance from every algorithm that returns a function that approximately interpolates the training examples. If, instead of requiring that all algorithms in this class be suitable, we require only the existence of a suitable algorithm, no necessary and sucient conditions on the function class H are known. Because an arbitrary amount of information can be conveyed in a single real value, it is possible to construct complicated function classes in which the identity of a function is encoded in its value at every point, and an algorithm can take advantage of this (see [5]). We can avoid this unnatural \conspiracy" between algorithms and function classes in two ways: by requiring that the algorithm be robust in the presence of random observation noise, as was considered in [5], or by requiring satisfactory performance of every algorithm in a class of reasonable algorithms, as we consider here. Results in this paper show that these two approaches are equivalent. Another reason for studying the problem of this paper is that it has implications for learning in the presence of malicious noise, in which the labels on the training sample can be any real numbers within of the true value of the target. This will be discussed later in the paper, but for the moment simply observe that if h is -close to a training sample where the labels have been corrupted to a level of at most, then h is certainly 2-close to the target on the sample. If H approximates from interpolated examples, we can then deduce that if the sample is large enough then (with high probability) h is within 2 + of the target on `most' of X. In the remainder of this section, we give an example of a function class that approximates from interpolated examples. In the next section, we dene a measure of the complexity of a class H of functions (the fat-shattering function), and we state the main result: that the fat-shattering function is the key quantity in this problem. In Sections 3 and 4 we give upper and lower bounds on the number of examples necessary for approximation from interpolated examples. Section 5 discusses the gap between these bounds, and describes the implications for learning with malicious noise. Section 6 describes the relationship between the problem of approximation from interpolated examples and a related problem (the case = 0 in Denition 1) studied in [3]. By way of motivation, we give upper and lower bounds on the number of examples necessary for approximation from interpolated examples with a particular function class, a class of Lipschitz-continuous functions. Proposition 2 Suppose that k > 0. Let H be the class of all k-lipschitzcontinuous functions from [0; 1] to [0; 1]. Thus, for each h 2 H and for all

5 Introduction 4 x; y 2 [0; 1], jh(x)? h(y)j kjx? yj: Then H approximates from interpolated examples. A sucient sample length function is 2k 2k m 0 (; ; ; ) = ln + 1: Furthermore, for < k=4, < 1=2, and < 1=13, any sucient sample length function must satisfy for all 2 (0; 1). m 0 (; ; ; ) 1 8 k 2? 2 ; Proof: Suppose that x = (x 1 ; x 2 ; : : :; x m ) 2 [0; 1] m and that, without loss of generality, 0 x 1 x 2 x 3 x m 1: Denote by I 0 the closed interval [0; x 1 ] and by I m the interval [x m ; 1], and for j between 1 and m? 1 let I j = [x j ; x j+1 ]. Let ; ; ; 2 (0; 1) be given. Suppose that P is any probability distribution on [0; 1]. Let where 2 (0; 1). Then B = x 2 [0; 1] m : max P (I i) > 0im P m (B ) = P m fx 2 [0; 1] m : 90 i m; 8j 6= i; x j 62 S(x i ; )g ; where S(x i ; ) is the shortest interval [x i ; z i ] satisfying P ([x i ; z i ]) (or [x i ; 1] if there is no such interval), and x 0 = 0. So ; P (B ) (m + 1) max P m fx 2 [0; 1] m : 8j 6= i; x j 62 S(x i ; )g 0im (m + 1) sup P m?1 x 2 [0; 1] m?1 : 8j x j 62 S(; ) 2[0;1] (m + 1)(1? ) m?1 : Choosing = 1 m?1 ln((m + 1)=), and using the fact that (1? =n)n < e? implies that this probability is no more than. Now, there are at most k= intervals I i of length at least =k. Suppose that t; h 2 H. If the interval I i has length less than =k and jh(x i )? t(x i )j <, jh(x i+1 )? t(x i+1 )j <, then, for all x in the interval I i, jh(x)? t(x)j < +. Indeed, suppose x 2 I i and jx? x i j =2k. (A similar argument applies if jx? x i j > =2k.) Then, since t and h are k-lipschitz-continuous, we have j(h(x)? t(x))? (h(x i )? t(x i ))j jh(x)? h(x i )j + jt(x)? t(x i )j < 2k k + 2k k = ;

6 Introduction 5 so that jh(x)? t(x)j < +. It follows that, with probability at least 1?, P (fx : jh(x)? t(x)j + g) i : length(i i ) k max k m + 1 (m? 1) ln ; 0jm P (I j) which is less than provided m > k ln 2m + 1: Using the fact that ln(x) x? 1 for ; x > 0, with = =(4k) shows that m > k ln 4k? m + 1 will suce, and this inequality is satised if m 2k 2k ln + 1: The fact that H approximates from interpolated examples, and the upper bound on sucient sample length, follow. We now obtain the lower bound, by an argument similar to one given in [9]. Suppose that k=4 and let r = bk=(2)c? 1 1. Let X 0 = fy 0 ; y 1 ; : : :; y r g where y 0 = 0 and, for 1 i r, y i = 2i=k. Dene a probability distribution P on [0; 1] as follows. 8 >< 1? 2 x = y 0 P (x) = 2=r x 2 fy 1 ; : : :; y r g >: 0 otherwise. Let t be the identically-zero function on [0; 1]. Now, suppose a sample x = (x 1 ; x 2 ; : : :; x m ) 2 X m 0 is given and that Y = fy 1 ; y 2 ; : : :; y r g n fx 1 ; x 2 ; : : :; x m g 6= ;. Let > 0 be such that min(; ). Then there is a function h 2 H such that h(x i ) =? for 1 i m but such that, for each y 2 Y, h(y) = +. Indeed, let h(x) = min(g(x); 1) where g is dened as follows. For 0 j r? 1 and x 2 [y j ; y j+1 ], let 8? if y >< j?1 62 Y and y j 62 Y (? ) + (x? y g(x) j?1 )k( + )=(2) if y j?1 62 Y and y j 2 Y + if y >: j?1 2 Y and y j 2 Y ( + )? (x? y j?1 )k( + )=(2) if y j?1 2 Y and y j 62 Y :

7 Denitions and the Main Result 6 Let Q X m 0 be the set of x of length m containing less than r=2 examples from fy 1 ; y 2 ; : : :; y r g. Then we see that, for x 2 Q, there is a function h such that jh(x i )? t(x i )j < for 1 i m, and P (fx 2 [0; 1] : jh(x i )? t(x i )j + g) 2 r jfy 1; y 2 ; : : :; y r g n fx 1 ; x 2 ; : : :; x m gj > : Now, the probability of drawing a sample of length m that is not in Q is at most the probability of at least r=2 successes in a sequence of m Bernoulli trials, where the probability of success at each trial is 2. >From standard Cherno bounds on the tails of the binomial distribution (see [2, 13]), this probability is no more than exp and this is less than 1? when m < 3 2? 2m 3 r 4m? 1 2! r 6? ln 1 : 1? For 1=13, m < r=(8) will suce, from which the result follows. ut 2 Denitions and the Main Result A number of ways of measuring the `expressive power' of a class H of functions has been proposed. This power is quantied by associating a `dimension' to the class. Sometimes this is simply one number depending on H. Sometimes in what is known as a scale-sensitive dimension it is a function depending on H. An important example of the rst type of dimension is the pseudo-dimension [10, 14]. We say that a nite subset S = fx 1 ; x 2 ; : : :; x d g of X is shattered if there is r = (r 1 ; r 2 ; : : :; r d ) 2 < d such that for every b = (b 1 ; b 2 ; : : :; b d ) 2 f0; 1g d, there is a function h b 2 H with h b (x i ) > r i if b i = 1 and h b (x i ) < r i if b i = 0. The pseudo-dimension of H, denoted P dim(h), is the largest cardinality of a shattered set, or innity if there is no bound on the cardinalities of the shattered sets. The pseudo-dimension is a well-understood and useful measure of expressive power. One attractive feature of this dimension is that if the set of functions is a vector space then its pseudo-dimension coincides with its linear dimension (see for example [10]). Perhaps the most important scale-sensitive dimension which has been used to date in the development of the theory of learning real-valued functions is the fat-shattering function. This is a scale-sensitive version of the pseudo-dimension and was introduced by Kearns and Schapire [11]. Suppose that H is a set of functions from X to [0; 1] and that 2 (0; 1). We say that a nite subset S = fx 1 ; x 2 ; : : :; x d g of X is -shattered if there is r = (r 1 ; r 2 ; : : :; r d ) 2 < d such that for every b = (b 1 ; b 2 ; : : :; b d ) 2 f0; 1g d, there is a function h b 2 H

8 Upper bound 7 with h b (x i ) r i + if b i = 1 and h b (x i ) r i? if b i = 0. Thus, S is - shattered if it is shattered with a `width of shattering' of at least. We dene the fat-shattering function, fat H : < +! IN 0 [ f1g, as fat H () = maxfjsj : S X is -shattered by Hg ; or fat H () = 1 if the maximum does not exist. (Here, IN 0 denotes the set of nonnegative integers.) It is easy to see that P dim(h) = lim!0 fat H (). It should be noted, however, that it is possible for the pseudo-dimension to be innite, even when fat H () is nite for all. We shall say that H has nite fat-shattering function whenever it is the case that for all 2 (0; 1), fat H () is nite. The fat-shattering function plays an important role in the learning theory of real-valued functions. Kearns and Schapire [11] proved that if a class of probabilistic concepts is learnable, then the class has nite fat-shattering function. (A probabilistic concept f is a [0; 1]-valued function. In this model, the learner sees examples (x i ; y i ), where Pr(y i = 1) = f(x i ).) Alon et al. [1] proved, conversely, that if a class of probabilistic concepts has nite fat-shattering function, then it is learnable. The main result in [5] is that niteness of the fat-shattering function of a class of [0; 1]-valued functions is a necessary and sucient condition for learning with random observation noise. Our main result is the following. Theorem 3 Suppose that H is a set of functions from a set X to [0; 1]. Then the following propositions are equivalent. 1. H approximates from interpolated examples. 2. H approximates < X from interpolated examples. 3. H has nite fat-shattering function. 3 Upper bound In this section, we prove that nite fat-shattering function is a sucient condition for approximation from interpolated examples and we provide a suitable sample length function m 0 (; ; ; ). We rst need the notion of covering numbers, as used extensively in [10, 1, 8] for instance. Suppose that (A; d) is a pseudo-metric space and > 0. Then, a subset N of A is said to be an -cover for a subset B of A if, for every x 2 B, there is an ^x 2 N such that d(x; ^x) <. The metric space is totally bounded if there is a nite -cover for A, for all > 0. When (A; d) is totally bounded, we shall denote the minimal cardinality of an -cover for A by N A (; d) for > 0. A subset M of A is said to be -separated if, for all distinct x; y 2 M,

9 Upper bound 8 d(x; y). We shall denote the maximal cardinality of an -separated subset of A by M A (; d). It is easy to show that M A (2; d) N A (; d) M A (; d) (see [12]), so M A (; d) is always dened if (A; d) is totally bounded. Suppose now that H is a set of functions from a set X to [0; 1] and that x = (x 1 ; x 2 ; : : :; x m ) 2 X m, where m is a positive integer. We may dene a pseudometric l 1 x on H as follows: for g; h 2 H, l 1 x (g; h) = max 1im jg(x i)? h(x i )j: (This metric has been used in [8, 1], for example.) Alon et al. [1] obtained (essentially) the following result bounding the l 1 x -covering number of H in terms of the fat-shattering function of H. Lemma 4 Suppose that H is a set of functions from X to [0; 1] and that H has nite fat-shattering function. Let m 2 IN, and x 2 X m. Then the pseudometric space (H; l 1 x ) is totally bounded. Suppose > 0. Let d = fat H (=4) and dx! 2 i : y = Then, provided m log y + 1, i=1 m i N H (; l 1 x ) < 2 2 2! log y m : Here, as elsewhere in the paper, log denotes logarithm to base 2. We then have the following result. Theorem 5 Suppose that H is a class of functions mapping from a domain X to the real interval [0; 1] and that H has nite fat-shattering function. Let t be any function from X to < and let ; ; > 0. Let P be any probability distribution on X and dene B to be the set of functions h 2 H for which P (fx 2 X : jh(x)? t(x)j + g) : Let d = fat H (=8) and let y = dx i=1 2m i! 4 i : Then, for m max (8=; log y + 1) ;

10 Upper bound 9 the probability is at most P m (fx 2 X m : 9h 2 B; jh(x i )? t(x i )j < ; (1 i m)g) 4 2! log y 4 2m 2?m=2 : Proof: The proof is based on a technique analogous to that used in [17, 7, 10], where we `symmetrize' and then `combinatorially bound'. The rst step symmetrization relates the desired probability to a `sample-based' one. This latter probability is then bounded using counting techniques. Fix t, P, m, the parameters ; ;, and hence the set B. We dene two sets Q and R as follows. The subset Q of X m is given by Q = fx 2 X m : 9h 2 B; jh(x i )? t(x i )j < (1 i m)g : The subset R of X 2m is dened as follows, where xy 2 X 2m denotes the concatenation of x; y 2 X m : R = xy 2 X 2m : 9h 2 B; jh(x i )? t(x i )j < (1 i m) and jfi : jh(y i )? t(y i )j + gj > m=2g : We claim that, for m 8=, P m (Q) 2P 2m (R). To prove this claim, we rst note that if h 2 B, then, for m 8=, by Chebyshev's inequality (for example), the probability that y 2 X m satises jfi : jh(y i )? t(y i )j + gj m=2 is at most 1=2. The characteristic function R of R may be expressed as follows [4]: R (xy) = Q (x) x (y); where Q is the characteristic function of Q and the f0; 1g-valued function x is such that x (y) = 1 if and only if there is h 2 B such that jh(x i )? t(x i )j < for 1 i m and such that jfi : jh(y i )? t(y i )j + gj > m=2. Then Z Z Z P 2m (R) = R (xy) dp 2m (xy) = Q (x) x (y) dp m (y) dp m (x): For xed x 2 Q and for m 8=, the inner integral is at least 1=2 by the observation above. It follows that P 2m (R) 1 Z Q (x) dp m (x) = P m (Q); as claimed. The next step is to bound the probability of R using combinatorial techniques. For this, let? be the `swapping group' [14] of permutations on the set f1; 2; : : :; 2mg. This is the group generated by the transpositions (i; m + i)

11 Upper bound 10 for 1 i m. The group? acts in a natural way on vectors in X 2m : for 2? and z 2 X 2m, we dene z to be Let (z (1) ; z (2) ; : : :; z (2m) ):?(R; z) = jf 2? : z 2 Rgj be the number of permutations in? taking z into R. Since P 2m is a product distribution, we have P 2m (R) = P 2m fz : z 2 Rg and hence P 2m (R) = 1 j?j Z X Z 2? X X R (z) dp 2m (z) 1 = X j?j R (z) 2??(R; z) = E P 2m ; 2 m! dp 2m (z) where E P 2m denotes expectation with respect to P 2m. It follows from this that P 2m (R) 1 2 m max z2x 2m? R(z): Now, let us x z 2 X 2m and consider the pseudo-metric space (B; l 1 z ). Since H has nite fat-shattering function, so does B and Lemma 4 implies that this pseudo-metric space is totally bounded. Let N = f^h 1 ; ^h 2 ; : : :; ^h n g be a minimal =2-cover for B. >From Lemma 4, n < 2 4 2! log y 2m : Since N is a =2-cover, given h 2 B, there is ^h 2 N such that for 1 i 2m, l 1 z (^h; h) < =2, which means that for 1 i 2m, j^h(z i )? h(z i )j < =2. Suppose that z = xy 2 R. Then, by the denition of R, there is some h 2 B such that jh(x i )? t(x i )j < for 1 i m and such that, for more than m=2 of the y i, jh(y i )? t(y i )j +. But (taking ^h to be, as described above, a function in the cover =2-close to h) this implies that there is ^h 2 N such that for 1 i m, j^h(x i )? t(x i )j < + =2 and such that, for more than m=2 of the y i, j^h(y i )? t(y i )j + =2. It follows from this that if z 2 R, then, for some l between 1 and n, z belongs to the set ^Rl, dened as n ^R l = xy 2 X 2m : j^h l (x i )? t(x i )j < + =2 (1 i m); o and jfi : j^h l (y i )? t(y i )j + =2gj > m=2 :

12 Upper bound 11 Let?( ^R l ; z) be the number of in? for which z 2 ^R l. Since z 2 R implies z 2 ^Rl for some l, we have?(r; z) n X l=1?( ^R l ; z): Consider a particular l between 1 and n and suppose that?( ^R l ; z) 6= 0. Let k be the number of indices i between 1 and 2m such that j^h l (z i )?t(z i )j +=2. Then m=2 < k m. The number of permutations in? for which z belongs to ^Rl is then equal to 2 m?k, which is less than 2 m(1?m=2). (The z i which can be `swapped' are precisely those m? k satisfying j^h l (z i )? t(z i )j < + =2.) It follows that nx P 2m (R) < 1 2 m i=1 The statement in the Theorem now follows. 4 2! log y 2 m(1?=2) n2?m=2 2 2m 2?m=2 : It follows from this result that niteness of fat H implies that H approximates < X from interpolated examples, and hence H approximates H from interpolated examples. Corollary 6 Suppose that H is a set of functions from X to [0; 1] and that H has nite fat-shattering function. Then H approximates < X from interpolated examples. Furthermore, there is a positive constant K such that a sucient sample length function is where d = fat H (=8). m 0 (; ; ; ) = K 1 d log + d log 2 ; ut Proof: For all m; d 1, y = dx i=1 2m i! 4 d(2m) d (5=) d < (20m=) d : i So 4 2! log y 4 2m 2?m=2 < if? 4 50m= 2 d log(20m=) 2?m=2 < () m=2? d log(20m=) log(50m= 2 ) > log(4=) (= m=2? d log 2 (50m= 2 ) > log(4=): (1)

13 Lower bound 12 Now, since ln() < for all ; > 0, we have ln 2 () 4 so ln ln ln 1? ln2 1 for all ; > 0. Similarly, ln (= ln(1=)) = ln(1=), hence for > 0 and 0 < < 1, so ln = (ln(1=)) + ln (ln(1=)=) ln (2 ln(ln(1=)=)? ln(1=)) ln(1=) ln 2 (1=): For = 50m= 2 and = 2 ln 2 2=(1200d), this inequality implies that (1) holds when m > 4 3d 1200d ln 2 2 ln2 2 ln log 4 : ut We remark that, for a given k, the class of all k-lipschitz continuous functions discussed in Proposition 2 has fat-shattering function fat H () = O(1=) and that the sucient sample length function of Proposition 2 is of order 4 Lower bound fat H () log fat H() + log 1 In this section, we give lower bounds on the number of examples necessary for H to approximate < X from interpolated examples and for H to approximate H from interpolated examples. The bounds are in terms of fat H, the fat-shattering function of H. To prove them, we consider a discretized version of H. For a 2 [0; 1], let D (a) = da=e. For a function f : X! [0; 1], let D (f) : X! f0; 1; : : :; d1=eg be dened as the composition of D and f. Let D (H) denote fd (f) : f 2 Hg. Functions in D (H) map to f0; 1; : : :; ng, where n = d1=e. Let [n] denote f0; 1; : : :; ng. From the denition of the fat-shattering function, 1 2 fat H () fat D (H) 2 for ; 2 < +. To obtain a lower bound on the sample length function in terms of the fat-shattering function, we consider a number of notions of dimension, dened using classes of f0; 1; g-valued functions on [n]. Our approach is similar to that of Ben-David, Cesa-Bianchi, Haussler and Long [6], who consider learning [n]-valued functions. We show that a large family of these dimensions (including the fat-shattering function) are closely related, and prove the lower bound in terms of the most convenient, the gnat -dimension (a `gapped' version of the Natarajan dimension). :

14 Lower bound 13 Denition 7 Let gnat (k) be the following set of f0; 1; g-valued functions de- ned on [n], where k 2 f2; 3; : : :g. with gnat(k) = f i;j : i; j 2 [n]; ji? jj kg ; i;j() = 8 >< >: 1 = i 0 = j otherwise. If F is a class of [n]-valued functions dened on X and is a class of f0; 1; gvalued functions dened on [n], we say that F -shatters x = (x 1 ; : : :; x d ) 2 X d if there is a sequence ( 1 ; : : :; d ) 2 d such that f0; 1g d f( 1 (f(x 1 )); : : :; d (f(x d ))) : f 2 Fg : The -dimension of F, denoted -dim(f), is the size of the largest -shattered sequence, or innity if there is no largest sequence. The following result is similar to the key result in [1] (Lemma 15), which bounds covering numbers of F in terms of fat F (1). We will see later that it generalizes that lemma, since the gnat (2)-dimension is the smallest of a family of dimensions that includes fat F (1). Lemma 8 Let k 2 and n 1 be integers. Suppose that F is a class of [n]-valued functions dened on X satisfying gnat (k)-dim(f) d, and that m log y + 1, where! dx m y = n 2i : i Then i=0 max x2x m M F(k; l 1 x ) < 2(2mn 2 ) log y : Proof: Fix x 2 X m and let S = fx i : i = 1; : : :; mg. For a set A S and functions u; l : A! [n], we say that F shatters (A; l; u) if u(x)? l(x) k for all x 2 A, and for all b = (b 1 ; : : :; b m ) 2 f0; 1g m there is a function f b 2 F such that for i = 1; : : :; m. f b (x i ) = Dene the function t : IN 4! IN 0 as ( u(xi ) b i = 1 l(x i ) b i = 0; t(h; m; n; k) = min fjf(a; l; u) : A Y ; l; u : A! [n]; G shatters (A; l; u)gj : jy j m; 9G [n] Y ; jgj h; G is k-separated on Y ; where we say that a set G is k-separated on Y if, for all distinct functions f; g 2 G, there is a z 2 Y such that jf(z)? g(z)j k.

15 Lower bound 14 It is easy to see that if jy j m then jf(a; l; u) : A Y ; jaj d; l; u : Y! [n]; 8z 2 Y u(z)? l(z) k gj < y: It follows that, if there is an h 2 IN such that t(h; m; n; k) y, then every k-separated function class G [n] Y with jy j m and jgj h has gnat(k)-dim(g) > d. Suppose F [n] X has gnat (k)-dim(f) d. Fix x 2 X m and let G be some subset of F restricted to fx 1 ; : : :; x m g. Clearly, gnat (k)-dim(g) d, so if G is k-separated on X then jgj < h. In that case, max x2x m M F (k; l 1 x ) < h. So the result follows from t(2(2mn 2 ) log y ; m; n; k) y, which we prove by induction. For all m 0 2 IN and n k, we have t(2; m 0 ; n; k) 1. To see this, take fg 1 ; g 2 g G, where the functions g 1 and g 2 map from a set Y (with jy j m 0 ) to [n], and are k-separated. There must be an x 2 Y for which (wlog) g 1 (x)? g 2 (x) k. It follows that F shatters (fxg; g 2 ; g 1 ), and hence t(2; m 0 ; n; k) 1. Now, take a class G of [n]-valued functions dened on Y, where jy j = m, jgj = 2mn 2 h, and G is k-separated. Divide G into mn 2 h pairs, G = f(f 1 ; g 1 ); : : :; (f mn 2 h; g mn 2 h)g: For each pair (f i ; g i ), choose an x i 2 Y for which f i (x i )? g i (x i ) k (swapping the labels f i and g i if necessary). This is always possible since G is k-separated. By the pigeonhole principle, there is an x 2 Y for which i 2 f1; : : :; mn 2 hg : x i = x n2 h: Again by the pigeonhole principle, there is a pair (a; b) and a set S = i 2 f1; : : :; mn 2 hg : x i = x; f i (x i ) = a; g i (x i ) = b such that jsj = h. Dene G 1 = f ijy nfxg : i 2 S and G 2 = g ijy nfxg : i 2 S, where f jt is the restriction of the function f to the set T. Clearly, G 1 is k- separated on Y n fxg because G is k-separated on Y and for all distinct pairs f i ; f j 2 G 1 we have f i (x) = f j (x). Also, jg 1 j = h. By the denition of t, the set S 1 = f(a; l; u) : A Y n fxg; l; u : A! [n]; G 1 shatters (A; l; u)g satises js 1 j t(h; m? 1; n; k). Similarly, if we dene S 2 as the set of triples shattered by G 2, we have js 2 j t(h; m? 1; n; k). For any triple (A; l; u) in S 1 \ S 2, dene l ; u : A! Y as ( l b = x () = l() otherwise, and ( u a = x () = u() otherwise.

16 Lower bound 15 Then G shatters (A; l ; u ). It follows that G shatters js 1 j + js 2 j = 2t(h; m? 1; n; k) triples. Since G was arbitrary, t(2mn 2 h; m; n; k) 2t(h; m? 1; n; k). So we have t(2; m 0 ; n; k) 1 for m 0 2 IN and n 3, and t(2mn 2 h; m; n; k) 2t(h; m? 1; n; k). By induction, t(2(2n 2 ) i iy j=1 (m o + j); m 0 + i; n; k) 2 i for all i 2 IN. If m log y + 1, we can assign i = log y, m 0 h = 2(2n 2 m) i, and we have h 2(2n 2 ) i iy j=1 (m 0 + j): = m? i, and From the denition of t, h 1 h 2 implies t(h 1 ; m; n; k) t(h 2 ; m; n; k), so t(2(2n 2 m) i ; m; n; k) y: The result follows. ut A useful family of function classes is the k-gapped distinguishers. Denition 9 Let k 2. A set of functions from [n] to f0; 1; g is a k-gapped distinguisher if it satises 1. for all i 2 f0; 1; : : :; n? kg and j 2 fi + k; : : :; ng, there is a function 2 and a bit b 2 f0; 1g such that (i) = b and (y) = 1? b. 2. minfji? jj : i; j 2 [n]; 9 2 ; (i) = 0; (j) = 1g = k. The set gnat (k) is an example of a k-gapped distinguisher. Another important example is the class n o g(k) = 2 f0; 1; g [n] : min fji? jj : i; j 2 [n]; (i) = 0; (j) = 1g = k : In fact g (k) is the largest k-gapped distinguisher, in the sense that it contains any other k-gapped distinguisher. Lemma 10 Suppose F is a class of [n]-valued functions dened on X, is a class of f0; 1; g-valued functions dened on [n], and k 2. If is a k-gapped distinguisher then gnat(k)-dim(f) -dim(f) g (k)-dim(f): Proof: Take a gnat (k)-shattered sequence x 2 X d. Since is a k-gapped distinguisher, for all i;j 2 gnat (k) there is a 2 and a b 2 f0; 1g for which

17 Lower bound 16 (i) = b and (j) = 1? b. It follows that x is -shattered, which gives the rst inequality. The second inequality follows from the fact that g (k). We can express the fat-shattering dimension fat F (k) as a dimension of this type, for k 2. Dene with fat(k) = f i : i 2 f0; : : :; n? kgg ; 8 >< 1 z i + k i(z) = i < z < i + k >: 0 z i; for z 2 [n]. Then fat F () = fat (b2c)-dim(f) for all classes F of functions from X to [n]. Clearly, fat (k) is a k-gapped distinguisher. It follows from Lemma 10 that Lemma 8 generalizes Alon et al's Lemma 15, which gave a similar result for the fat (2)-dimension. The following result shows that the gnat (k)-dimension, g (k)-dimension, and -dimension (for any k-gapped distinguisher ) are all closely related. Lemma 11 Let k 2. Let F be a class of functions that map from X to [n], satisfying gnat (k)-dim(f) = d N 2 and g (k)-dim(f) d B 1. Then d N > d B 3 log 2 (2d B n 2 ) : Proof: Suppose x = (x 1 ; : : :; x db ) 2 X db is g (k)-shattered by F. The denition of g (k) implies that any subset of F that g (k)-shatters x is k- separated, so M F (k; l 1 x ) 2 db. Let Then y d N d dn B n 2dN so y = d N X i=0 d B i! n 2i : since d N 2. If d B > log y then 2 db M F (k; l 1 x ) < 2(2d B n 2 ) log y ; d B < 1 + log y log(2d B n 2 ): (2) Alternatively, if d B log y, then (2) is obviously true. So we have ut d B < 1 + log y log(2d B n 2 ) 1 + (log d N + d N log(d B n 2 )) log(2d B n 2 ) 3d N log 2 (2d B n 2 ): ut

18 Lower bound 17 The following result follows easily from Theorem 1 in [9], which gives a lower bound on the number of examples necessary for learning f0; 1g [d] in the probably approximately correct model (see also [7]). Lemma 12 Let 0 < 1=8, 0 < < 1=100, and d 1. If d m < max 32 ; 1? ln 1 then there is a distribution P on [d] and a function t 2 f0; 1g [d] such that P m nx 2 X m : 9f 2 f0; 1g [d] such that f(x i ) = t(x i ); i = 1; : : :; m and Pfy : f(y) 6= t(y)g g : We use Lemma 12 to prove the following lower bound on the sample length function for approximating < X from interpolated examples. Theorem 13 Suppose H is a class of [0; 1]-valued functions dened on a set X, 0 < < < 1, and ; 2 (0; 1). Then any sample length function m 0 for H to approximate < X from interpolated examples satises where d = fat H (). m 0 (; ; ; ) max 1 32! d 3 log 2 (4d= 2 )? 1 ; 1 log 1! ; Proof: Fix 0 < < < 1, dene n = d1=e, and suppose fat H () d. Let F = D (H). Then fat F (1) d, so gnat (2)-dim(F) > k, where k = d=(3 log 2 (2dn 2 )). Consider a sequence (x 1 ; : : :; x k ) 2 X k that is gnat -shattered by F. Clearly, there is a subset H 0 H with jh 0 j = 2 k and a sequence ( a1;b 1 ; : : :; ak;b k ) 2 k gnat such that f( a1;b 1 (f(x 1 )); : : :; ak;b k (f(x k ))) : f 2 D (H 0 )g = f0; 1g k : Without loss, we can assume that a j > b j for j = 1; : : :; k. Now, if m < max((k? 1)=(32); (1? )= ln(1=)), Lemma 12 implies that there is a distribution P on f1; : : :; kg and a function p : f1; : : :; kg! f0; 1g such that P m fl 2 f1; : : :; kg m : 9p 0 : f1; : : :; kg! f0; 1g such that p(l i ) = p 0 (l i ) for i = 1; : : :; m and Pfy 2 f1; : : :; kg : p(y) 6= p 0 (y)g g : Choose a function t : X! < satisfying ( (aj? 1) + p(j) = 1 t(x j ) = b j? + p(j) = 0

19 Lower bound 18 for j = 1; : : :; k, where = 1 2 minfh(x j)? (a j? 1) : h 2 H 0 and h(x j ) > (a j? 1); j = 1; : : :; kg : For each function h 2 H 0 dene f h = D (h). Let p h : f1; : : :; kg! f0; 1g be dened by p h (j) = ( 1 fh (x j ) = a j 0 f h (x j ) = b j : Clearly, if jh(x j )? t(x j )j < for some h 2 H 0 and some j 2 f1; : : :; kg, then h(x j ) 2 ((a j? 1); a j ] [ ((b j? 1); b j ] so p h (j) = p(j). Also, if p h (j) 6= p(j) for some h 2 H, then jh(x j )? t(x j )j > +. It follows that Pfy 2 f1; : : :; kg : p h (y) 6= p(y)g implies Qfy 2 X : jh(y)? t(y)j + g, where Q is the discrete probability distribution on X satisfying Q(x j ) = P (j) for j = 1; : : :; k. So Q m fy 2 X m : 9h 2 H such that jh(y i )? t(y i )j < for i = 1; : : :; m and Qfy 2 X : jh(y)? t(y)j + g g Q m fy 2 X m : 9h 2 H 0 such that jh(y i )? t(y i )j < for i = 1; : : :; m and Qfy 2 X : jh(y)? t(y)j + g g P m fl 2 f1; : : :; kg m : 9p 0 : f1; : : :; kg! f0; 1g such that p(l i ) = p 0 (l i ) for i = 1; : : :; m and Pfy 2 f1; : : :; kg : p(y) 6= p 0 (y)g g : We also have the following result which bounds from below the sample length function for H to approximate H from interpolated examples. Theorem 14 Suppose H is a class of [0; 1]-valued functions dened on a set X, 0 < < 1, 3=2 < 1, and ; 2 (0; 1). If d satises fat H ( + ) d, then any sample length function m 0 for H to approximate H from interpolated examples satises Proof: m 0 (; ; ; ) max 1 32!! d 3 log 2 (4d= 2 )? 1 ; 1 log 1 : Fix 0 < < 1 and 3=2 < 1, dene n = d1=e, and suppose d fat H ( + ). Let F = D (H). Then 1 2( + ) fat F 2 d; so fat (b2=c + 1)-dim(F) d, hence gnat (b2=c + 1)-dim(F) > k, where k = d=(3 log 2 (2dn 2 )). Consider a sequence (x 1 ; : : :; x k ) 2 X k that is gnat (b2=c+ 1)-shattered by F. Clearly, there is a subset H 0 H with jh 0 j = 2 k and a sequence ( a1;b1 ; : : :; ak;bk ) 2 gnat (b2=c + 1) k such that f( a1;b 1 (f(x 1 )); : : :; ak;b k (f(x k ))) : f 2 D (H 0 )g = f0; 1g k : ut

20 Discussion 19 Fix a function t 2 H 0. Any function h 2 H that has a i;b i (D (h)(x i )) = ai;bi (D (t)(x i )) satises jh(x i )? t(x i )j < <. Any function h in H that has a i;b i (D (h)(x i )) 6= ai;bi (D (t)(x i )) satises jh(x i )? t(x i )j 2 = 2 + 2(? ) 2 +? = + ; since (? )= 1=2 and b2c for 1=2. Using the same argument as in the proof of Theorem 13, there is a distribution P on X such that if m is too small then with P m -probability at least, some h 2 H is within of t on a random sample, but P (jh? tj + ). ut 5 Discussion Figure 1 shows the sample complexity bounds for approximation from interpolated examples. These bounds imply Theorem 3. 1 fath (=8) (= m = fat H approximates < X H (=8) log 2 from interpolated examples =) m = max fat H () log 2 (fat H ()= 2 ) ; 1 log 1 + log 1!! H approximates H from interpolated examples + =) m = max!! fat H ( + ) log 2 (fat H ( + )= 2 ) ; 1 log 1 Figure 1 Sample complexity bounds Notice that the upper and lower bounds on the sample length for H to approximate < X from interpolated examples are within log factors of each other. If H is the class of k-lipschitz-continuous functions from [0; 1] to [0; 1], the upper bound on the sample length function for H to approximate H from interpolated examples is within a log 2 factor of the lower bound implied by Proposition 2. Indeed, fat H () = (1=) in this case, so Proposition 2 shows that m 0 (; ; ; ) = (fat H ()=).

21 When meets 0 20 These sample complexity bounds for approximation from interpolated examples are also relevant to the problem of learning real-valued functions in the presence of malicious noise. Suppose a learner sees a sequence of training examples that correspond to the values of a target function corrupted with arbitrary bounded additive noise. That is, each example is of the form (x i ; t(x i )+n i ), where t 2 H and jn i j <. Clearly, any function h 2 H that is -close to the training sample will satisfy Pr (jh? tj 2 + ) < ; provided that H approximates from interpolated examples and the training sample is suciently large. In addition, if there is an algorithm that can learn in the presence of malicious noise (in this sense), then it can certainly learn in the presence of uniformly distributed random noise (as dened in [5]), which implies fat H is nite ([5], Theorem 3). That is, a function class H is learnable with malicious noise if and only if fat H is nite. 6 When meets 0 In [3], the related problem of valid generalization from approximate interpolation was studied. We say that H validly generalizes C from approximate interpolation if, for all > 0 and ; 2 (0; 1), there is an m such that for all probability distributions P on X and all t 2 C,with P m -probability at least 1?, a sample x = (x 1 ; x 2 ; : : :; x m ) 2 X m has the property that if h 2 H and jh(x i )? t(x i )j < for 1 i m, then P (fx 2 X : jh(x)? t(x)j g) < : This problem is equivalent to approximation from interpolated examples with set to 0 (see Denition 1). Results in [3] give necessary and sucient conditions for valid generalization from approximate interpolation. In particular, H validly generalizes < X from approximate interpolation if and only if the pseudo-dimension P dim(h) is - nite. It was also shown in [3] that H validly generalizes H from approximate interpolation if and only if Ddim H () is nite for all, where Ddim H () is dened as follows. Denition 15 For h; t 2 H, let h [;t] : X! f0; 1g be dened by h [;t] (x) = 1 i jh(x)? t(x)j. Let H [;t] = h [;t] : h 2 H. Then we dene Ddim H () = max t2h P dim? H [;t] : (Since functions in H [;t] are f0; 1g-valued, the pseudo-dimension of H [;t] is equal to its Vapnik-Chervonenkis dimension, dened in [17].) Clearly, valid generalization from approximate interpolation implies approximation from interpolated examples, so niteness of P dim(h) or Ddim H implies

22 Conclusions 21 niteness of fat H. Since the fat-shattering function is a non-increasing function of, niteness of the pseudo-dimension of H implies a uniform upper bound on fat H (). The following proposition gives a lower bound on Ddim H () in terms of fat H. Proposition 16 Ddim H () fat H () 3 log 2 (2fat H ()= 2 ) : Let F = D (H). Then fat F (1) fat H (), so Lemma 11 implies that Proof: gnat(2)-dim(f) k, where k = fat H () 3 log 2 (4fat H ()= 2 ) : Suppose x 2 X k is gnat (2)-shattered by F. Then there is a subset H 0 H with jh 0 j = 2 k and a sequence ( a1;b1 ; : : :; ak;bk ) 2 k gnat such that f( a1;b 1 (f(x 1 )); : : :; ak;b k (f(x k ))) : f 2 D (H 0 )g = f0; 1g k : Choose t 2 H 0. Then for h 2 H 0 the denition of h [;t] implies that h [;t] (x i ) = 0 if and only if a i;b i (D (h(x i ))) = ai;b i (D (t(x i ))): That is,? h[;t](x 1 ); : : :; h [;t](x k ) : h 2 H 0 = f0; 1g k ; so Ddim H () k. ut 7 Conclusions We have shown that approximation from interpolated examples is possible if and only if the fat-shattering function of the hypothesis class is nite. This problem can be interpreted as a problem of learning real-valued functions, where we require satisfactory performance from every algorithm that returns a function that approximately interpolates the training examples. It follows that this problem is equivalent to those of learning real-valued functions with random (see [5]) and malicious (see Section 5) observation noise, in the sense that learning is possible in each case if and only if the fat-shattering function of the hypothesis class is nite. Acknowledgements This research was supported in part by the Australian Telecommunications and Electronics Research Board. The work of Martin Anthony is supported in part by the European Union through the \Neurocolt" ESPRIT Working Group.

23 REFERENCES 22 References [1] Alon, N., Ben-David, S., Cesa-Bianchi, N. and Haussler, D. (1993). Scalesensitive dimensions, uniform convergence, and learnability. In Proceedings of the 1993 IEEE Symposium on Foundations of Computer Science, IEEE Press. [2] Angluin, D. and Valiant, L. (1979). Fast probabilistic algorithms for Hamiltonian circuits and matchings. Journal of Computer and System Sciences, 18: [3] Anthony, M., Bartlett, P.L., Ishai, Y., Shawe-Taylor, J. (1994). Valid generalisation from approximate interpolation. To appear, Combinatorics, Probability and Computing. [4] Anthony, M. and Biggs, N. (1992). Computational Learning Theory: An Introduction, Cambridge University Press. [5] Bartlett, P.L., Long, P.M. and Williamson, R.C. (1994). Fat-shattering and the learnability of real-valued functions. In Proceedings of the Seventh Annual ACM Conference on Computational Learning Theory, ACM Press, New York. [6] Ben-David, S., Cesa-Bianchi, N., Haussler, D. and Long, P. (1992). Characterizations of learnability for classes of f0; : : :; ng-valued functions. Technical Report. (An earlier version appeared in Proceedings of the Fifth Annual ACM Workshop on Computational Learning Theory, ACM Press, New York.) To appear, Journal of Computer and System Sciences. [7] Blumer, A., Ehrenfeucht, A., Haussler, D. and Warmuth, M. K. (1989). Learnability and the Vapnik-Chervonenkis dimension, Journal of the ACM 36(4): 929{965. [8] Dudley, R. M., Gine, E. and Zinn, J. (1991). Uniform and universal Glivenko-Cantelli classes. Journal of Theoretical Probability, 4: 485{510. [9] Ehrenfeucht, A., Haussler, D., Kearns, M. and Valiant, L. (1989). A general lower bound on the number of examples needed for learning, Information and Computation 82: [10] Haussler, D. (1992). Decision theoretic generalizations of the PAC model for neural net and other learning applications, Information and Computation, 100: 78{150. [11] Kearns, M.J. and Schapire, R. E. (1990). Ecient distribution-free learning of probabilistic concepts, in Proceedings of the 1990 IEEE Symposium on Foundations of Computer Science, IEEE Press. [12] Kolmogorov, A.N. and Tihomirov, V. M. (1961). -entropy and -capacity of sets in functional spaces, AMS Translations, Series 2, 17:

24 REFERENCES 23 [13] McDiarmid, C. (1989). On the method of bounded dierences. In J. Siemons, editor, Surveys in Combinatorics, 1989, London Mathematical Society Lecture Note Series (141). Cambridge University Press, Cambridge, UK, [14] Pollard, D. (1984). Convergence of Stochastic Processes, Springer-Verlag. [15] Simon, H. U. (1994). Bounds on the number of examples needed for learning functions. In Computational Learning Theory: EUROCOLT'93 (ed. J. Shawe-Taylor and M. Anthony), Oxford University Press. [16] Valiant, L.G. (1984). A theory of the learnable. Communications of the ACM, 27(11): 1134{1142. [17] Vapnik, V.N. and Chervonenkis, A.Ya. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16(2):

Sample width for multi-category classifiers

Sample width for multi-category classifiers R u t c o r Research R e p o r t Sample width for multi-category classifiers Martin Anthony a Joel Ratsaby b RRR 29-2012, November 2012 RUTCOR Rutgers Center for Operations Research Rutgers University

More information

A Result of Vapnik with Applications

A Result of Vapnik with Applications A Result of Vapnik with Applications Martin Anthony Department of Statistical and Mathematical Sciences London School of Economics Houghton Street London WC2A 2AE, U.K. John Shawe-Taylor Department of

More information

Fat-shattering. and the learnability of real-valued functions. Peter L. Bartlett. Department of Systems Engineering

Fat-shattering. and the learnability of real-valued functions. Peter L. Bartlett. Department of Systems Engineering Fat-shattering and the learnability of real-valued functions Peter L. Bartlett Department of Systems Engineering Research School of Information Sciences and Engineering Australian National University Canberra,

More information

Maximal Width Learning of Binary Functions

Maximal Width Learning of Binary Functions Maximal Width Learning of Binary Functions Martin Anthony Department of Mathematics, London School of Economics, Houghton Street, London WC2A2AE, UK Joel Ratsaby Electrical and Electronics Engineering

More information

On the Sample Complexity of Noise-Tolerant Learning

On the Sample Complexity of Noise-Tolerant Learning On the Sample Complexity of Noise-Tolerant Learning Javed A. Aslam Department of Computer Science Dartmouth College Hanover, NH 03755 Scott E. Decatur Laboratory for Computer Science Massachusetts Institute

More information

Maximal Width Learning of Binary Functions

Maximal Width Learning of Binary Functions Maximal Width Learning of Binary Functions Martin Anthony Department of Mathematics London School of Economics Houghton Street London WC2A 2AE United Kingdom m.anthony@lse.ac.uk Joel Ratsaby Ben-Gurion

More information

Data-Dependent Structural Risk. Decision Trees. John Shawe-Taylor. Royal Holloway, University of London 1. Nello Cristianini. University of Bristol 2

Data-Dependent Structural Risk. Decision Trees. John Shawe-Taylor. Royal Holloway, University of London 1. Nello Cristianini. University of Bristol 2 Data-Dependent Structural Risk Minimisation for Perceptron Decision Trees John Shawe-Taylor Royal Holloway, University of London 1 Email: jst@dcs.rhbnc.ac.uk Nello Cristianini University of Bristol 2 Email:

More information

The sample complexity of agnostic learning with deterministic labels

The sample complexity of agnostic learning with deterministic labels The sample complexity of agnostic learning with deterministic labels Shai Ben-David Cheriton School of Computer Science University of Waterloo Waterloo, ON, N2L 3G CANADA shai@uwaterloo.ca Ruth Urner College

More information

Australian National University. Abstract. number of training examples necessary for satisfactory learning performance grows

Australian National University. Abstract. number of training examples necessary for satisfactory learning performance grows The VC-Dimension and Pseudodimension of Two-Layer Neural Networks with Discrete Inputs Peter L. Bartlett Robert C. Williamson Department of Systems Engineering Research School of Information Sciences and

More information

Uniform Glivenko-Cantelli Theorems and Concentration of Measure in the Mathematical Modelling of Learning

Uniform Glivenko-Cantelli Theorems and Concentration of Measure in the Mathematical Modelling of Learning Uniform Glivenko-Cantelli Theorems and Concentration of Measure in the Mathematical Modelling of Learning Martin Anthony Department of Mathematics London School of Economics Houghton Street London WC2A

More information

On Learnability, Complexity and Stability

On Learnability, Complexity and Stability On Learnability, Complexity and Stability Silvia Villa, Lorenzo Rosasco and Tomaso Poggio 1 Introduction A key question in statistical learning is which hypotheses (function) spaces are learnable. Roughly

More information

On Efficient Agnostic Learning of Linear Combinations of Basis Functions

On Efficient Agnostic Learning of Linear Combinations of Basis Functions On Efficient Agnostic Learning of Linear Combinations of Basis Functions Wee Sun Lee Dept. of Systems Engineering, RSISE, Aust. National University, Canberra, ACT 0200, Australia. WeeSun.Lee@anu.edu.au

More information

A Necessary Condition for Learning from Positive Examples

A Necessary Condition for Learning from Positive Examples Machine Learning, 5, 101-113 (1990) 1990 Kluwer Academic Publishers. Manufactured in The Netherlands. A Necessary Condition for Learning from Positive Examples HAIM SHVAYTSER* (HAIM%SARNOFF@PRINCETON.EDU)

More information

Using boxes and proximity to classify data into several categories

Using boxes and proximity to classify data into several categories R u t c o r Research R e p o r t Using boxes and proximity to classify data into several categories Martin Anthony a Joel Ratsaby b RRR 7-01, February 01 RUTCOR Rutgers Center for Operations Research Rutgers

More information

THE VAPNIK- CHERVONENKIS DIMENSION and LEARNABILITY

THE VAPNIK- CHERVONENKIS DIMENSION and LEARNABILITY THE VAPNIK- CHERVONENKIS DIMENSION and LEARNABILITY Dan A. Simovici UMB, Doctoral Summer School Iasi, Romania What is Machine Learning? The Vapnik-Chervonenkis Dimension Probabilistic Learning Potential

More information

The best expert versus the smartest algorithm

The best expert versus the smartest algorithm Theoretical Computer Science 34 004 361 380 www.elsevier.com/locate/tcs The best expert versus the smartest algorithm Peter Chen a, Guoli Ding b; a Department of Computer Science, Louisiana State University,

More information

Computational Learning Theory for Artificial Neural Networks

Computational Learning Theory for Artificial Neural Networks Computational Learning Theory for Artificial Neural Networks Martin Anthony and Norman Biggs Department of Statistical and Mathematical Sciences, London School of Economics and Political Science, Houghton

More information

ICML '97 and AAAI '97 Tutorials

ICML '97 and AAAI '97 Tutorials A Short Course in Computational Learning Theory: ICML '97 and AAAI '97 Tutorials Michael Kearns AT&T Laboratories Outline Sample Complexity/Learning Curves: nite classes, Occam's VC dimension Razor, Best

More information

Estimating the sample complexity of a multi-class. discriminant model

Estimating the sample complexity of a multi-class. discriminant model Estimating the sample complexity of a multi-class discriminant model Yann Guermeur LIP, UMR NRS 0, Universit Paris, place Jussieu, 55 Paris cedex 05 Yann.Guermeur@lip.fr Andr Elissee and H l ne Paugam-Moisy

More information

Rademacher Averages and Phase Transitions in Glivenko Cantelli Classes

Rademacher Averages and Phase Transitions in Glivenko Cantelli Classes IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 48, NO. 1, JANUARY 2002 251 Rademacher Averages Phase Transitions in Glivenko Cantelli Classes Shahar Mendelson Abstract We introduce a new parameter which

More information

Yale University Department of Computer Science. The VC Dimension of k-fold Union

Yale University Department of Computer Science. The VC Dimension of k-fold Union Yale University Department of Computer Science The VC Dimension of k-fold Union David Eisenstat Dana Angluin YALEU/DCS/TR-1360 June 2006, revised October 2006 The VC Dimension of k-fold Union David Eisenstat

More information

On Randomized One-Round Communication Complexity. Ilan Kremer Noam Nisan Dana Ron. Hebrew University, Jerusalem 91904, Israel

On Randomized One-Round Communication Complexity. Ilan Kremer Noam Nisan Dana Ron. Hebrew University, Jerusalem 91904, Israel On Randomized One-Round Communication Complexity Ilan Kremer Noam Nisan Dana Ron Institute of Computer Science Hebrew University, Jerusalem 91904, Israel Email: fpilan,noam,danarg@cs.huji.ac.il November

More information

On Learning µ-perceptron Networks with Binary Weights

On Learning µ-perceptron Networks with Binary Weights On Learning µ-perceptron Networks with Binary Weights Mostefa Golea Ottawa-Carleton Institute for Physics University of Ottawa Ottawa, Ont., Canada K1N 6N5 050287@acadvm1.uottawa.ca Mario Marchand Ottawa-Carleton

More information

The information-theoretic value of unlabeled data in semi-supervised learning

The information-theoretic value of unlabeled data in semi-supervised learning The information-theoretic value of unlabeled data in semi-supervised learning Alexander Golovnev Dávid Pál Balázs Szörényi January 5, 09 Abstract We quantify the separation between the numbers of labeled

More information

On Probabilistic ACC Circuits. with an Exact-Threshold Output Gate. Richard Beigel. Department of Computer Science, Yale University

On Probabilistic ACC Circuits. with an Exact-Threshold Output Gate. Richard Beigel. Department of Computer Science, Yale University On Probabilistic ACC Circuits with an Exact-Threshold Output Gate Richard Beigel Department of Computer Science, Yale University New Haven, CT 06520-2158, USA. beigel-richard@cs.yale.edu Jun Tarui y Department

More information

Can PAC Learning Algorithms Tolerate. Random Attribute Noise? Sally A. Goldman. Department of Computer Science. Washington University

Can PAC Learning Algorithms Tolerate. Random Attribute Noise? Sally A. Goldman. Department of Computer Science. Washington University Can PAC Learning Algorithms Tolerate Random Attribute Noise? Sally A. Goldman Department of Computer Science Washington University St. Louis, Missouri 63130 Robert H. Sloan y Dept. of Electrical Engineering

More information

Max-Planck-Institut fur Mathematik in den Naturwissenschaften Leipzig Uniformly distributed measures in Euclidean spaces by Bernd Kirchheim and David Preiss Preprint-Nr.: 37 1998 Uniformly Distributed

More information

ALMOST ODD RANDOM SUM-FREE SETS NEIL J. CALKIN AND P. J. CAMERON. 1. introduction

ALMOST ODD RANDOM SUM-FREE SETS NEIL J. CALKIN AND P. J. CAMERON. 1. introduction ALMOST ODD RANDOM SUM-FREE SETS NEIL J. CALKIN AND P. J. CAMERON Abstract. We show that if S 1 is a strongly complete sum-free set of positive integers, and if S 0 is a nite sum-free set, then with positive

More information

Extremal Cases of the Ahlswede-Cai Inequality. A. J. Radclie and Zs. Szaniszlo. University of Nebraska-Lincoln. Department of Mathematics

Extremal Cases of the Ahlswede-Cai Inequality. A. J. Radclie and Zs. Szaniszlo. University of Nebraska-Lincoln. Department of Mathematics Extremal Cases of the Ahlswede-Cai Inequality A J Radclie and Zs Szaniszlo University of Nebraska{Lincoln Department of Mathematics 810 Oldfather Hall University of Nebraska-Lincoln Lincoln, NE 68588 1

More information

Learning by Canonical Smooth Estimation, Part I: Abstract. This paper examines the problem of learning from examples in a framework that

Learning by Canonical Smooth Estimation, Part I: Abstract. This paper examines the problem of learning from examples in a framework that Learning by Canonical Smooth Estimation, Part I: Simultaneous Estimation y Kevin L. Buescher z and P. R. Kumar x Abstract This paper examines the problem of learning from examples in a framework that is

More information

LOGARITHMIC MULTIFRACTAL SPECTRUM OF STABLE. Department of Mathematics, National Taiwan University. Taipei, TAIWAN. and. S.

LOGARITHMIC MULTIFRACTAL SPECTRUM OF STABLE. Department of Mathematics, National Taiwan University. Taipei, TAIWAN. and. S. LOGARITHMIC MULTIFRACTAL SPECTRUM OF STABLE OCCUPATION MEASURE Narn{Rueih SHIEH Department of Mathematics, National Taiwan University Taipei, TAIWAN and S. James TAYLOR 2 School of Mathematics, University

More information

Discriminative Learning can Succeed where Generative Learning Fails

Discriminative Learning can Succeed where Generative Learning Fails Discriminative Learning can Succeed where Generative Learning Fails Philip M. Long, a Rocco A. Servedio, b,,1 Hans Ulrich Simon c a Google, Mountain View, CA, USA b Columbia University, New York, New York,

More information

Relating Data Compression and Learnability

Relating Data Compression and Learnability Relating Data Compression and Learnability Nick Littlestone, Manfred K. Warmuth Department of Computer and Information Sciences University of California at Santa Cruz June 10, 1986 Abstract We explore

More information

On an algebra related to orbit-counting. Peter J. Cameron. Queen Mary and Westeld College. London E1 4NS U.K. Abstract

On an algebra related to orbit-counting. Peter J. Cameron. Queen Mary and Westeld College. London E1 4NS U.K. Abstract On an algebra related to orbit-counting Peter J. Cameron School of Mathematical Sciences Queen Mary and Westeld College London E1 4NS U.K. Abstract With any permutation group G on an innite set is associated

More information

MARKOV CHAINS: STATIONARY DISTRIBUTIONS AND FUNCTIONS ON STATE SPACES. Contents

MARKOV CHAINS: STATIONARY DISTRIBUTIONS AND FUNCTIONS ON STATE SPACES. Contents MARKOV CHAINS: STATIONARY DISTRIBUTIONS AND FUNCTIONS ON STATE SPACES JAMES READY Abstract. In this paper, we rst introduce the concepts of Markov Chains and their stationary distributions. We then discuss

More information

STOCHASTIC DIFFERENTIAL EQUATIONS WITH EXTRA PROPERTIES H. JEROME KEISLER. Department of Mathematics. University of Wisconsin.

STOCHASTIC DIFFERENTIAL EQUATIONS WITH EXTRA PROPERTIES H. JEROME KEISLER. Department of Mathematics. University of Wisconsin. STOCHASTIC DIFFERENTIAL EQUATIONS WITH EXTRA PROPERTIES H. JEROME KEISLER Department of Mathematics University of Wisconsin Madison WI 5376 keisler@math.wisc.edu 1. Introduction The Loeb measure construction

More information

A Large Deviation Bound for the Area Under the ROC Curve

A Large Deviation Bound for the Area Under the ROC Curve A Large Deviation Bound for the Area Under the ROC Curve Shivani Agarwal, Thore Graepel, Ralf Herbrich and Dan Roth Dept. of Computer Science University of Illinois Urbana, IL 680, USA {sagarwal,danr}@cs.uiuc.edu

More information

LEBESGUE INTEGRATION. Introduction

LEBESGUE INTEGRATION. Introduction LEBESGUE INTEGATION EYE SJAMAA Supplementary notes Math 414, Spring 25 Introduction The following heuristic argument is at the basis of the denition of the Lebesgue integral. This argument will be imprecise,

More information

Stanford Mathematics Department Math 205A Lecture Supplement #4 Borel Regular & Radon Measures

Stanford Mathematics Department Math 205A Lecture Supplement #4 Borel Regular & Radon Measures 2 1 Borel Regular Measures We now state and prove an important regularity property of Borel regular outer measures: Stanford Mathematics Department Math 205A Lecture Supplement #4 Borel Regular & Radon

More information

for average case complexity 1 randomized reductions, an attempt to derive these notions from (more or less) rst

for average case complexity 1 randomized reductions, an attempt to derive these notions from (more or less) rst On the reduction theory for average case complexity 1 Andreas Blass 2 and Yuri Gurevich 3 Abstract. This is an attempt to simplify and justify the notions of deterministic and randomized reductions, an

More information

[3] R.M. Colomb and C.Y.C. Chung. Very fast decision table execution of propositional

[3] R.M. Colomb and C.Y.C. Chung. Very fast decision table execution of propositional - 14 - [3] R.M. Colomb and C.Y.C. Chung. Very fast decision table execution of propositional expert systems. Proceedings of the 8th National Conference on Articial Intelligence (AAAI-9), 199, 671{676.

More information

k-fold unions of low-dimensional concept classes

k-fold unions of low-dimensional concept classes k-fold unions of low-dimensional concept classes David Eisenstat September 2009 Abstract We show that 2 is the minimum VC dimension of a concept class whose k-fold union has VC dimension Ω(k log k). Keywords:

More information

Alon Orlitsky. AT&T Bell Laboratories. March 22, Abstract

Alon Orlitsky. AT&T Bell Laboratories. March 22, Abstract Average-case interactive communication Alon Orlitsky AT&T Bell Laboratories March 22, 1996 Abstract and Y are random variables. Person P knows, Person P Y knows Y, and both know the joint probability distribution

More information

Real Analysis: Homework # 12 Fall Professor: Sinan Gunturk Fall Term 2008

Real Analysis: Homework # 12 Fall Professor: Sinan Gunturk Fall Term 2008 Eduardo Corona eal Analysis: Homework # 2 Fall 2008 Professor: Sinan Gunturk Fall Term 2008 #3 (p.298) Let X be the set of rational numbers and A the algebra of nite unions of intervals of the form (a;

More information

2 suggests that the range of variation is at least about n 1=10, see Theorem 2 below, and it is quite possible that the upper bound in (1.2) is sharp

2 suggests that the range of variation is at least about n 1=10, see Theorem 2 below, and it is quite possible that the upper bound in (1.2) is sharp ON THE LENGTH OF THE LONGEST INCREASING SUBSEQUENCE IN A RANDOM PERMUTATION Bela Bollobas and Svante Janson Dedicated to Paul Erd}os on his eightieth birthday. Abstract. Complementing the results claiming

More information

Power Domains and Iterated Function. Systems. Abbas Edalat. Department of Computing. Imperial College of Science, Technology and Medicine

Power Domains and Iterated Function. Systems. Abbas Edalat. Department of Computing. Imperial College of Science, Technology and Medicine Power Domains and Iterated Function Systems Abbas Edalat Department of Computing Imperial College of Science, Technology and Medicine 180 Queen's Gate London SW7 2BZ UK Abstract We introduce the notion

More information

Analysis on Graphs. Alexander Grigoryan Lecture Notes. University of Bielefeld, WS 2011/12

Analysis on Graphs. Alexander Grigoryan Lecture Notes. University of Bielefeld, WS 2011/12 Analysis on Graphs Alexander Grigoryan Lecture Notes University of Bielefeld, WS 0/ Contents The Laplace operator on graphs 5. The notion of a graph............................. 5. Cayley graphs..................................

More information

Simple Learning Algorithms for. Decision Trees and Multivariate Polynomials. xed constant bounded product distribution. is depth 3 formulas.

Simple Learning Algorithms for. Decision Trees and Multivariate Polynomials. xed constant bounded product distribution. is depth 3 formulas. Simple Learning Algorithms for Decision Trees and Multivariate Polynomials Nader H. Bshouty Department of Computer Science University of Calgary Calgary, Alberta, Canada Yishay Mansour Department of Computer

More information

March 16, Abstract. We study the problem of portfolio optimization under the \drawdown constraint" that the

March 16, Abstract. We study the problem of portfolio optimization under the \drawdown constraint that the ON PORTFOLIO OPTIMIZATION UNDER \DRAWDOWN" CONSTRAINTS JAKSA CVITANIC IOANNIS KARATZAS y March 6, 994 Abstract We study the problem of portfolio optimization under the \drawdown constraint" that the wealth

More information

The Performance of a New Hybrid Classifier Based on Boxes and Nearest Neighbors

The Performance of a New Hybrid Classifier Based on Boxes and Nearest Neighbors The Performance of a New Hybrid Classifier Based on Boxes and Nearest Neighbors Martin Anthony Department of Mathematics London School of Economics and Political Science Houghton Street, London WC2A2AE

More information

On the Complexity ofteaching. AT&T Bell Laboratories WUCS August 11, Abstract

On the Complexity ofteaching. AT&T Bell Laboratories WUCS August 11, Abstract On the Complexity ofteaching Sally A. Goldman Department of Computer Science Washington University St. Louis, MO 63130 sg@cs.wustl.edu Michael J. Kearns AT&T Bell Laboratories Murray Hill, NJ 07974 mkearns@research.att.com

More information

Elementary 2-Group Character Codes. Abstract. In this correspondence we describe a class of codes over GF (q),

Elementary 2-Group Character Codes. Abstract. In this correspondence we describe a class of codes over GF (q), Elementary 2-Group Character Codes Cunsheng Ding 1, David Kohel 2, and San Ling Abstract In this correspondence we describe a class of codes over GF (q), where q is a power of an odd prime. These codes

More information

290 J.M. Carnicer, J.M. Pe~na basis (u 1 ; : : : ; u n ) consisting of minimally supported elements, yet also has a basis (v 1 ; : : : ; v n ) which f

290 J.M. Carnicer, J.M. Pe~na basis (u 1 ; : : : ; u n ) consisting of minimally supported elements, yet also has a basis (v 1 ; : : : ; v n ) which f Numer. Math. 67: 289{301 (1994) Numerische Mathematik c Springer-Verlag 1994 Electronic Edition Least supported bases and local linear independence J.M. Carnicer, J.M. Pe~na? Departamento de Matematica

More information

Learning by Canonical Smooth Estimation, Part II: Abstract. In this paper, we analyze the properties of a procedure for learning from examples.

Learning by Canonical Smooth Estimation, Part II: Abstract. In this paper, we analyze the properties of a procedure for learning from examples. Learning by Canonical Smooth Estimation, Part II: Learning and Choice of Model Complexity y Kevin L. Buescher z and P. R. Kumar x Abstract In this paper, we analyze the properties of a procedure for learning

More information

10.1 The Formal Model

10.1 The Formal Model 67577 Intro. to Machine Learning Fall semester, 2008/9 Lecture 10: The Formal (PAC) Learning Model Lecturer: Amnon Shashua Scribe: Amnon Shashua 1 We have see so far algorithms that explicitly estimate

More information

Asymptotics of generating the symmetric and alternating groups

Asymptotics of generating the symmetric and alternating groups Asymptotics of generating the symmetric and alternating groups John D. Dixon School of Mathematics and Statistics Carleton University, Ottawa, Ontario K2G 0E2 Canada jdixon@math.carleton.ca October 20,

More information

1 Introduction It will be convenient to use the inx operators a b and a b to stand for maximum (least upper bound) and minimum (greatest lower bound)

1 Introduction It will be convenient to use the inx operators a b and a b to stand for maximum (least upper bound) and minimum (greatest lower bound) Cycle times and xed points of min-max functions Jeremy Gunawardena, Department of Computer Science, Stanford University, Stanford, CA 94305, USA. jeremy@cs.stanford.edu October 11, 1993 to appear in the

More information

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization : Neural Networks Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization 11s2 VC-dimension and PAC-learning 1 How good a classifier does a learner produce? Training error is the precentage

More information

Back circulant Latin squares and the inuence of a set. L F Fitina, Jennifer Seberry and Ghulam R Chaudhry. Centre for Computer Security Research

Back circulant Latin squares and the inuence of a set. L F Fitina, Jennifer Seberry and Ghulam R Chaudhry. Centre for Computer Security Research Back circulant Latin squares and the inuence of a set L F Fitina, Jennifer Seberry and Ghulam R Chaudhry Centre for Computer Security Research School of Information Technology and Computer Science University

More information

Coins with arbitrary weights. Abstract. Given a set of m coins out of a collection of coins of k unknown distinct weights, we wish to

Coins with arbitrary weights. Abstract. Given a set of m coins out of a collection of coins of k unknown distinct weights, we wish to Coins with arbitrary weights Noga Alon Dmitry N. Kozlov y Abstract Given a set of m coins out of a collection of coins of k unknown distinct weights, we wish to decide if all the m given coins have the

More information

Multiclass Learnability and the ERM principle

Multiclass Learnability and the ERM principle Multiclass Learnability and the ERM principle Amit Daniely Sivan Sabato Shai Ben-David Shai Shalev-Shwartz November 5, 04 arxiv:308.893v [cs.lg] 4 Nov 04 Abstract We study the sample complexity of multiclass

More information

Computer Science Dept.

Computer Science Dept. A NOTE ON COMPUTATIONAL INDISTINGUISHABILITY 1 Oded Goldreich Computer Science Dept. Technion, Haifa, Israel ABSTRACT We show that following two conditions are equivalent: 1) The existence of pseudorandom

More information

1 GMD FIRST, Rudower Chaussee 5, Berlin, Germany 2 Department of Computer Science, Royal Holloway College, U

1 GMD FIRST, Rudower Chaussee 5, Berlin, Germany 2 Department of Computer Science, Royal Holloway College, U Generalization Bounds via Eigenvalues of the Gram Matrix Bernhard Scholkopf, GMD 1 John Shawe-Taylor, University of London 2 Alexander J. Smola, GMD 3 Robert C. Williamson, ANU 4 NeuroCOLT2 Technical Report

More information

The Perceptron algorithm vs. Winnow: J. Kivinen 2. University of Helsinki 3. M. K. Warmuth 4. University of California, Santa Cruz 5.

The Perceptron algorithm vs. Winnow: J. Kivinen 2. University of Helsinki 3. M. K. Warmuth 4. University of California, Santa Cruz 5. The Perceptron algorithm vs. Winnow: linear vs. logarithmic mistake bounds when few input variables are relevant 1 J. Kivinen 2 University of Helsinki 3 M. K. Warmuth 4 University of California, Santa

More information

Nets Hawk Katz Theorem. There existsaconstant C>so that for any number >, whenever E [ ] [ ] is a set which does not contain the vertices of any axis

Nets Hawk Katz Theorem. There existsaconstant C>so that for any number >, whenever E [ ] [ ] is a set which does not contain the vertices of any axis New York Journal of Mathematics New York J. Math. 5 999) {3. On the Self Crossing Six Sided Figure Problem Nets Hawk Katz Abstract. It was shown by Carbery, Christ, and Wright that any measurable set E

More information

Nader H. Bshouty Lisa Higham Jolanta Warpechowska-Gruca. Canada. (

Nader H. Bshouty Lisa Higham Jolanta Warpechowska-Gruca. Canada. ( Meeting Times of Random Walks on Graphs Nader H. Bshouty Lisa Higham Jolanta Warpechowska-Gruca Computer Science University of Calgary Calgary, AB, T2N 1N4 Canada (e-mail: fbshouty,higham,jolantag@cpsc.ucalgary.ca)

More information

MAT 570 REAL ANALYSIS LECTURE NOTES. Contents. 1. Sets Functions Countability Axiom of choice Equivalence relations 9

MAT 570 REAL ANALYSIS LECTURE NOTES. Contents. 1. Sets Functions Countability Axiom of choice Equivalence relations 9 MAT 570 REAL ANALYSIS LECTURE NOTES PROFESSOR: JOHN QUIGG SEMESTER: FALL 204 Contents. Sets 2 2. Functions 5 3. Countability 7 4. Axiom of choice 8 5. Equivalence relations 9 6. Real numbers 9 7. Extended

More information

Garrett: `Bernstein's analytic continuation of complex powers' 2 Let f be a polynomial in x 1 ; : : : ; x n with real coecients. For complex s, let f

Garrett: `Bernstein's analytic continuation of complex powers' 2 Let f be a polynomial in x 1 ; : : : ; x n with real coecients. For complex s, let f 1 Bernstein's analytic continuation of complex powers c1995, Paul Garrett, garrettmath.umn.edu version January 27, 1998 Analytic continuation of distributions Statement of the theorems on analytic continuation

More information

2 Section 2 However, in order to apply the above idea, we will need to allow non standard intervals ('; ) in the proof. More precisely, ' and may gene

2 Section 2 However, in order to apply the above idea, we will need to allow non standard intervals ('; ) in the proof. More precisely, ' and may gene Introduction 1 A dierential intermediate value theorem by Joris van der Hoeven D pt. de Math matiques (B t. 425) Universit Paris-Sud 91405 Orsay Cedex France June 2000 Abstract Let T be the eld of grid-based

More information

1/12/05: sec 3.1 and my article: How good is the Lebesgue measure?, Math. Intelligencer 11(2) (1989),

1/12/05: sec 3.1 and my article: How good is the Lebesgue measure?, Math. Intelligencer 11(2) (1989), Real Analysis 2, Math 651, Spring 2005 April 26, 2005 1 Real Analysis 2, Math 651, Spring 2005 Krzysztof Chris Ciesielski 1/12/05: sec 3.1 and my article: How good is the Lebesgue measure?, Math. Intelligencer

More information

MACHINE LEARNING. Vapnik-Chervonenkis (VC) Dimension. Alessandro Moschitti

MACHINE LEARNING. Vapnik-Chervonenkis (VC) Dimension. Alessandro Moschitti MACHINE LEARNING Vapnik-Chervonenkis (VC) Dimension Alessandro Moschitti Department of Information Engineering and Computer Science University of Trento Email: moschitti@disi.unitn.it Computational Learning

More information

growth rates of perturbed time-varying linear systems, [14]. For this setup it is also necessary to study discrete-time systems with a transition map

growth rates of perturbed time-varying linear systems, [14]. For this setup it is also necessary to study discrete-time systems with a transition map Remarks on universal nonsingular controls for discrete-time systems Eduardo D. Sontag a and Fabian R. Wirth b;1 a Department of Mathematics, Rutgers University, New Brunswick, NJ 08903, b sontag@hilbert.rutgers.edu

More information

The Performance of a New Hybrid Classifier Based on Boxes and Nearest Neighbors

The Performance of a New Hybrid Classifier Based on Boxes and Nearest Neighbors R u t c o r Research R e p o r t The Performance of a New Hybrid Classifier Based on Boxes and Nearest Neighbors Martin Anthony a Joel Ratsaby b RRR 17-2011, October 2011 RUTCOR Rutgers Center for Operations

More information

Markov processes Course note 2. Martingale problems, recurrence properties of discrete time chains.

Markov processes Course note 2. Martingale problems, recurrence properties of discrete time chains. Institute for Applied Mathematics WS17/18 Massimiliano Gubinelli Markov processes Course note 2. Martingale problems, recurrence properties of discrete time chains. [version 1, 2017.11.1] We introduce

More information

Numerical Analysis: Interpolation Part 1

Numerical Analysis: Interpolation Part 1 Numerical Analysis: Interpolation Part 1 Computer Science, Ben-Gurion University (slides based mostly on Prof. Ben-Shahar s notes) 2018/2019, Fall Semester BGU CS Interpolation (ver. 1.00) AY 2018/2019,

More information

The Decision List Machine

The Decision List Machine The Decision List Machine Marina Sokolova SITE, University of Ottawa Ottawa, Ont. Canada,K1N-6N5 sokolova@site.uottawa.ca Nathalie Japkowicz SITE, University of Ottawa Ottawa, Ont. Canada,K1N-6N5 nat@site.uottawa.ca

More information

Neural Network Learning: Testing Bounds on Sample Complexity

Neural Network Learning: Testing Bounds on Sample Complexity Neural Network Learning: Testing Bounds on Sample Complexity Joaquim Marques de Sá, Fernando Sereno 2, Luís Alexandre 3 INEB Instituto de Engenharia Biomédica Faculdade de Engenharia da Universidade do

More information

Measurable Choice Functions

Measurable Choice Functions (January 19, 2013) Measurable Choice Functions Paul Garrett garrett@math.umn.edu http://www.math.umn.edu/ garrett/ [This document is http://www.math.umn.edu/ garrett/m/fun/choice functions.pdf] This note

More information

ON A CLASS OF APERIODIC SUM-FREE SETS NEIL J. CALKIN PAUL ERD OS. that is that there are innitely many n not in S which are not a sum of two elements

ON A CLASS OF APERIODIC SUM-FREE SETS NEIL J. CALKIN PAUL ERD OS. that is that there are innitely many n not in S which are not a sum of two elements ON A CLASS OF APERIODIC SUM-FREE SETS NEIL J. CALKIN PAUL ERD OS Abstract. We show that certain natural aperiodic sum-free sets are incomplete, that is that there are innitely many n not in S which are

More information

Introduction and Preliminaries

Introduction and Preliminaries Chapter 1 Introduction and Preliminaries This chapter serves two purposes. The first purpose is to prepare the readers for the more systematic development in later chapters of methods of real analysis

More information

Agnostic Online learnability

Agnostic Online learnability Technical Report TTIC-TR-2008-2 October 2008 Agnostic Online learnability Shai Shalev-Shwartz Toyota Technological Institute Chicago shai@tti-c.org ABSTRACT We study a fundamental question. What classes

More information

Lecture 5: Efficient PAC Learning. 1 Consistent Learning: a Bound on Sample Complexity

Lecture 5: Efficient PAC Learning. 1 Consistent Learning: a Bound on Sample Complexity Universität zu Lübeck Institut für Theoretische Informatik Lecture notes on Knowledge-Based and Learning Systems by Maciej Liśkiewicz Lecture 5: Efficient PAC Learning 1 Consistent Learning: a Bound on

More information

Learning Recursive Functions From Approximations

Learning Recursive Functions From Approximations Sacred Heart University DigitalCommons@SHU Computer Science & Information Technology Faculty Publications Computer Science & Information Technology 8-1997 Learning Recursive Functions From Approximations

More information

A Tutorial on Computational Learning Theory Presented at Genetic Programming 1997 Stanford University, July 1997

A Tutorial on Computational Learning Theory Presented at Genetic Programming 1997 Stanford University, July 1997 A Tutorial on Computational Learning Theory Presented at Genetic Programming 1997 Stanford University, July 1997 Vasant Honavar Artificial Intelligence Research Laboratory Department of Computer Science

More information

Constructing c-ary Perfect Factors

Constructing c-ary Perfect Factors Constructing c-ary Perfect Factors Chris J. Mitchell Computer Science Department Royal Holloway University of London Egham Hill Egham Surrey TW20 0EX England. Tel.: +44 784 443423 Fax: +44 784 443420 Email:

More information

A version of for which ZFC can not predict a single bit Robert M. Solovay May 16, Introduction In [2], Chaitin introd

A version of for which ZFC can not predict a single bit Robert M. Solovay May 16, Introduction In [2], Chaitin introd CDMTCS Research Report Series A Version of for which ZFC can not Predict a Single Bit Robert M. Solovay University of California at Berkeley CDMTCS-104 May 1999 Centre for Discrete Mathematics and Theoretical

More information

The Vapnik-Chervonenkis Dimension: Information versus Complexity in Learning

The Vapnik-Chervonenkis Dimension: Information versus Complexity in Learning REVlEW The Vapnik-Chervonenkis Dimension: Information versus Complexity in Learning Yaser S. Abu-Mostafa California Institute of Terhnohgy, Pasadcna, CA 91 125 USA When feasible, learning is a very attractive

More information

Empirical Processes: General Weak Convergence Theory

Empirical Processes: General Weak Convergence Theory Empirical Processes: General Weak Convergence Theory Moulinath Banerjee May 18, 2010 1 Extended Weak Convergence The lack of measurability of the empirical process with respect to the sigma-field generated

More information

FORMULATION OF THE LEARNING PROBLEM

FORMULATION OF THE LEARNING PROBLEM FORMULTION OF THE LERNING PROBLEM MIM RGINSKY Now that we have seen an informal statement of the learning problem, as well as acquired some technical tools in the form of concentration inequalities, we

More information

MACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension

MACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension MACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension Prof. Dan A. Simovici UMB Prof. Dan A. Simovici (UMB) MACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension 1 / 30 The

More information

Stochastic Processes

Stochastic Processes Introduction and Techniques Lecture 4 in Financial Mathematics UiO-STK4510 Autumn 2015 Teacher: S. Ortiz-Latorre Stochastic Processes 1 Stochastic Processes De nition 1 Let (E; E) be a measurable space

More information

Multiclass Learnability and the ERM Principle

Multiclass Learnability and the ERM Principle Journal of Machine Learning Research 6 205 2377-2404 Submitted 8/3; Revised /5; Published 2/5 Multiclass Learnability and the ERM Principle Amit Daniely amitdaniely@mailhujiacil Dept of Mathematics, The

More information

Hence C has VC-dimension d iff. π C (d) = 2 d. (4) C has VC-density l if there exists K R such that, for all

Hence C has VC-dimension d iff. π C (d) = 2 d. (4) C has VC-density l if there exists K R such that, for all 1. Computational Learning Abstract. I discuss the basics of computational learning theory, including concept classes, VC-dimension, and VC-density. Next, I touch on the VC-Theorem, ɛ-nets, and the (p,

More information

Entrance Exam, Real Analysis September 1, 2017 Solve exactly 6 out of the 8 problems

Entrance Exam, Real Analysis September 1, 2017 Solve exactly 6 out of the 8 problems September, 27 Solve exactly 6 out of the 8 problems. Prove by denition (in ɛ δ language) that f(x) = + x 2 is uniformly continuous in (, ). Is f(x) uniformly continuous in (, )? Prove your conclusion.

More information

Online Learning: Random Averages, Combinatorial Parameters, and Learnability

Online Learning: Random Averages, Combinatorial Parameters, and Learnability Online Learning: Random Averages, Combinatorial Parameters, and Learnability Alexander Rakhlin Department of Statistics University of Pennsylvania Karthik Sridharan Toyota Technological Institute at Chicago

More information

Theorem 1 can be improved whenever b is not a prime power and a is a prime divisor of the base b. Theorem 2. Let b be a positive integer which is not

Theorem 1 can be improved whenever b is not a prime power and a is a prime divisor of the base b. Theorem 2. Let b be a positive integer which is not A RESULT ON THE DIGITS OF a n R. Blecksmith*, M. Filaseta**, and C. Nicol Dedicated to the memory of David R. Richman. 1. Introduction Let d r d r,1 d 1 d 0 be the base b representation of a positive integer

More information

Some Combinatorial Aspects of Time-Stamp Systems. Unite associee C.N.R.S , cours de la Liberation F TALENCE.

Some Combinatorial Aspects of Time-Stamp Systems. Unite associee C.N.R.S , cours de la Liberation F TALENCE. Some Combinatorial spects of Time-Stamp Systems Robert Cori Eric Sopena Laboratoire Bordelais de Recherche en Informatique Unite associee C.N.R.S. 1304 351, cours de la Liberation F-33405 TLENCE bstract

More information

Bounds on the generalised acyclic chromatic numbers of bounded degree graphs

Bounds on the generalised acyclic chromatic numbers of bounded degree graphs Bounds on the generalised acyclic chromatic numbers of bounded degree graphs Catherine Greenhill 1, Oleg Pikhurko 2 1 School of Mathematics, The University of New South Wales, Sydney NSW Australia 2052,

More information

Lexington, MA September 18, Abstract

Lexington, MA September 18, Abstract October 1990 LIDS-P-1996 ACTIVE LEARNING USING ARBITRARY BINARY VALUED QUERIES* S.R. Kulkarni 2 S.K. Mitterl J.N. Tsitsiklisl 'Laboratory for Information and Decision Systems, M.I.T. Cambridge, MA 02139

More information