Martin Anthony. The London School of Economics and Political Science, Houghton Street, London WC2A 2AE, United Kingdom.
|
|
- Lindsey Reed
- 5 years ago
- Views:
Transcription
1 Function Learning from Interpolation 1 Martin Anthony Department of Mathematics, The London School of Economics and Political Science, Houghton Street, London WC2A 2AE, United Kingdom. anthony@vax.lse.ac.uk Peter Bartlett Department of Systems Engineering, Research School of Information Sciences and Engineering, The Australian National University, Canberra, 0200 Australia. Peter.Bartlett@anu.edu.au NeuroCOLT Technical Report Series NC-TR October, Produced as part of the ESPRIT Working Group in Neural and Computational Learning, NeuroCOLT 8556 NeuroCOLT Coordinating Partner!()+, -./ Department of Computer Science Egham, Surrey TW20 0EX, England For more information contact John Shawe-Taylor at the above address or neurocolt@dcs.rhbnc.ac.uk
2 1 The research reported here was conducted while Martin Anthony was visiting the Department of Systems Engineering, The Research School of Information Sciences and Engineering, The Australian National University 2 Received, October 11,
3 Introduction 2 Abstract In this paper, we study a statistical property of classes of real-valued functions that we call approximation from interpolated examples. We derive a characterization of function classes that have this property, in terms of their `fat-shattering function', a notion that has proven useful in computational learning theory. We discuss the implications for function learning of approximation from interpolated examples. Specically, it can be interpreted as a problem of learning real-valued functions from random examples in which we require satisfactory performance from every algorithm that returns a function which approximately interpolates the training examples. 1 Introduction In the problem of learning a real-valued function from examples, a learner sees a sequence of values of an unknown function at a number of randomly chosen points. On the basis of these examples, the learner chooses a function called a hypothesis from some class H of hypotheses, with the aim that the learner's hypothesis is close to the target function on future random examples. In this paper we require that, for most training samples, with high probability the absolute dierence between the values of the learner's hypothesis and the target function on a random point is small. A natural learning algorithm to consider is one that chooses a function in H that is close to the target function on the training examples. This poses the following statistical problem: For what function classes H will any function in H that approximately interpolates the target function on the training examples probably have small absolute error? More precisely, we have the following denition of approximation from interpolated examples. Denition 1 Let C; H be sets of functions that map from a set X to <. We say that H approximates C from interpolated examples if for all ; ; ; 2 (0; 1), there is m 0 (; ; ; ) such that for every t 2 C and for every probability measure 1 P on X, if m m 0 (; ; ; ), then with P m -probability at least 1?, x = (x 1 ; x 2 ; : : :; x m ) 2 X m has the property that if h 2 H and jh(x i )?t(x i )j < for 1 i m, then P (fx 2 X : jh(x)? t(x)j + g) < : We say that m 0 (; ; ; ) is a sucient sample length function for H to approximate C from interpolated examples. 1 More formally, one has a xed -algebra on X: when X is countable this is 2 X, and when X < n, it is the Borel -algebra. Then, by \any probability measure on X," we mean \for any probability measure on," where, the xed -algebra, is understood. The class H must have some fairly benign measurability properties; we refer to [14, 10] for details.
4 Introduction 3 Two cases of particular interest are those in which C = H and C = < X, the set of all functions from X to <. If H approximates H from interpolated examples, we simply say that H approximates from interpolated examples. The main aim of this paper is to nd characterizations of classes which approximate from interpolated examples and which approximate < X from interpolated examples. While this problem may be of independent interest, it can be interpreted as a learning problem in which we require satisfactory performance from every algorithm that returns a function that approximately interpolates the training examples. If, instead of requiring that all algorithms in this class be suitable, we require only the existence of a suitable algorithm, no necessary and sucient conditions on the function class H are known. Because an arbitrary amount of information can be conveyed in a single real value, it is possible to construct complicated function classes in which the identity of a function is encoded in its value at every point, and an algorithm can take advantage of this (see [5]). We can avoid this unnatural \conspiracy" between algorithms and function classes in two ways: by requiring that the algorithm be robust in the presence of random observation noise, as was considered in [5], or by requiring satisfactory performance of every algorithm in a class of reasonable algorithms, as we consider here. Results in this paper show that these two approaches are equivalent. Another reason for studying the problem of this paper is that it has implications for learning in the presence of malicious noise, in which the labels on the training sample can be any real numbers within of the true value of the target. This will be discussed later in the paper, but for the moment simply observe that if h is -close to a training sample where the labels have been corrupted to a level of at most, then h is certainly 2-close to the target on the sample. If H approximates from interpolated examples, we can then deduce that if the sample is large enough then (with high probability) h is within 2 + of the target on `most' of X. In the remainder of this section, we give an example of a function class that approximates from interpolated examples. In the next section, we dene a measure of the complexity of a class H of functions (the fat-shattering function), and we state the main result: that the fat-shattering function is the key quantity in this problem. In Sections 3 and 4 we give upper and lower bounds on the number of examples necessary for approximation from interpolated examples. Section 5 discusses the gap between these bounds, and describes the implications for learning with malicious noise. Section 6 describes the relationship between the problem of approximation from interpolated examples and a related problem (the case = 0 in Denition 1) studied in [3]. By way of motivation, we give upper and lower bounds on the number of examples necessary for approximation from interpolated examples with a particular function class, a class of Lipschitz-continuous functions. Proposition 2 Suppose that k > 0. Let H be the class of all k-lipschitzcontinuous functions from [0; 1] to [0; 1]. Thus, for each h 2 H and for all
5 Introduction 4 x; y 2 [0; 1], jh(x)? h(y)j kjx? yj: Then H approximates from interpolated examples. A sucient sample length function is 2k 2k m 0 (; ; ; ) = ln + 1: Furthermore, for < k=4, < 1=2, and < 1=13, any sucient sample length function must satisfy for all 2 (0; 1). m 0 (; ; ; ) 1 8 k 2? 2 ; Proof: Suppose that x = (x 1 ; x 2 ; : : :; x m ) 2 [0; 1] m and that, without loss of generality, 0 x 1 x 2 x 3 x m 1: Denote by I 0 the closed interval [0; x 1 ] and by I m the interval [x m ; 1], and for j between 1 and m? 1 let I j = [x j ; x j+1 ]. Let ; ; ; 2 (0; 1) be given. Suppose that P is any probability distribution on [0; 1]. Let where 2 (0; 1). Then B = x 2 [0; 1] m : max P (I i) > 0im P m (B ) = P m fx 2 [0; 1] m : 90 i m; 8j 6= i; x j 62 S(x i ; )g ; where S(x i ; ) is the shortest interval [x i ; z i ] satisfying P ([x i ; z i ]) (or [x i ; 1] if there is no such interval), and x 0 = 0. So ; P (B ) (m + 1) max P m fx 2 [0; 1] m : 8j 6= i; x j 62 S(x i ; )g 0im (m + 1) sup P m?1 x 2 [0; 1] m?1 : 8j x j 62 S(; ) 2[0;1] (m + 1)(1? ) m?1 : Choosing = 1 m?1 ln((m + 1)=), and using the fact that (1? =n)n < e? implies that this probability is no more than. Now, there are at most k= intervals I i of length at least =k. Suppose that t; h 2 H. If the interval I i has length less than =k and jh(x i )? t(x i )j <, jh(x i+1 )? t(x i+1 )j <, then, for all x in the interval I i, jh(x)? t(x)j < +. Indeed, suppose x 2 I i and jx? x i j =2k. (A similar argument applies if jx? x i j > =2k.) Then, since t and h are k-lipschitz-continuous, we have j(h(x)? t(x))? (h(x i )? t(x i ))j jh(x)? h(x i )j + jt(x)? t(x i )j < 2k k + 2k k = ;
6 Introduction 5 so that jh(x)? t(x)j < +. It follows that, with probability at least 1?, P (fx : jh(x)? t(x)j + g) i : length(i i ) k max k m + 1 (m? 1) ln ; 0jm P (I j) which is less than provided m > k ln 2m + 1: Using the fact that ln(x) x? 1 for ; x > 0, with = =(4k) shows that m > k ln 4k? m + 1 will suce, and this inequality is satised if m 2k 2k ln + 1: The fact that H approximates from interpolated examples, and the upper bound on sucient sample length, follow. We now obtain the lower bound, by an argument similar to one given in [9]. Suppose that k=4 and let r = bk=(2)c? 1 1. Let X 0 = fy 0 ; y 1 ; : : :; y r g where y 0 = 0 and, for 1 i r, y i = 2i=k. Dene a probability distribution P on [0; 1] as follows. 8 >< 1? 2 x = y 0 P (x) = 2=r x 2 fy 1 ; : : :; y r g >: 0 otherwise. Let t be the identically-zero function on [0; 1]. Now, suppose a sample x = (x 1 ; x 2 ; : : :; x m ) 2 X m 0 is given and that Y = fy 1 ; y 2 ; : : :; y r g n fx 1 ; x 2 ; : : :; x m g 6= ;. Let > 0 be such that min(; ). Then there is a function h 2 H such that h(x i ) =? for 1 i m but such that, for each y 2 Y, h(y) = +. Indeed, let h(x) = min(g(x); 1) where g is dened as follows. For 0 j r? 1 and x 2 [y j ; y j+1 ], let 8? if y >< j?1 62 Y and y j 62 Y (? ) + (x? y g(x) j?1 )k( + )=(2) if y j?1 62 Y and y j 2 Y + if y >: j?1 2 Y and y j 2 Y ( + )? (x? y j?1 )k( + )=(2) if y j?1 2 Y and y j 62 Y :
7 Denitions and the Main Result 6 Let Q X m 0 be the set of x of length m containing less than r=2 examples from fy 1 ; y 2 ; : : :; y r g. Then we see that, for x 2 Q, there is a function h such that jh(x i )? t(x i )j < for 1 i m, and P (fx 2 [0; 1] : jh(x i )? t(x i )j + g) 2 r jfy 1; y 2 ; : : :; y r g n fx 1 ; x 2 ; : : :; x m gj > : Now, the probability of drawing a sample of length m that is not in Q is at most the probability of at least r=2 successes in a sequence of m Bernoulli trials, where the probability of success at each trial is 2. >From standard Cherno bounds on the tails of the binomial distribution (see [2, 13]), this probability is no more than exp and this is less than 1? when m < 3 2? 2m 3 r 4m? 1 2! r 6? ln 1 : 1? For 1=13, m < r=(8) will suce, from which the result follows. ut 2 Denitions and the Main Result A number of ways of measuring the `expressive power' of a class H of functions has been proposed. This power is quantied by associating a `dimension' to the class. Sometimes this is simply one number depending on H. Sometimes in what is known as a scale-sensitive dimension it is a function depending on H. An important example of the rst type of dimension is the pseudo-dimension [10, 14]. We say that a nite subset S = fx 1 ; x 2 ; : : :; x d g of X is shattered if there is r = (r 1 ; r 2 ; : : :; r d ) 2 < d such that for every b = (b 1 ; b 2 ; : : :; b d ) 2 f0; 1g d, there is a function h b 2 H with h b (x i ) > r i if b i = 1 and h b (x i ) < r i if b i = 0. The pseudo-dimension of H, denoted P dim(h), is the largest cardinality of a shattered set, or innity if there is no bound on the cardinalities of the shattered sets. The pseudo-dimension is a well-understood and useful measure of expressive power. One attractive feature of this dimension is that if the set of functions is a vector space then its pseudo-dimension coincides with its linear dimension (see for example [10]). Perhaps the most important scale-sensitive dimension which has been used to date in the development of the theory of learning real-valued functions is the fat-shattering function. This is a scale-sensitive version of the pseudo-dimension and was introduced by Kearns and Schapire [11]. Suppose that H is a set of functions from X to [0; 1] and that 2 (0; 1). We say that a nite subset S = fx 1 ; x 2 ; : : :; x d g of X is -shattered if there is r = (r 1 ; r 2 ; : : :; r d ) 2 < d such that for every b = (b 1 ; b 2 ; : : :; b d ) 2 f0; 1g d, there is a function h b 2 H
8 Upper bound 7 with h b (x i ) r i + if b i = 1 and h b (x i ) r i? if b i = 0. Thus, S is - shattered if it is shattered with a `width of shattering' of at least. We dene the fat-shattering function, fat H : < +! IN 0 [ f1g, as fat H () = maxfjsj : S X is -shattered by Hg ; or fat H () = 1 if the maximum does not exist. (Here, IN 0 denotes the set of nonnegative integers.) It is easy to see that P dim(h) = lim!0 fat H (). It should be noted, however, that it is possible for the pseudo-dimension to be innite, even when fat H () is nite for all. We shall say that H has nite fat-shattering function whenever it is the case that for all 2 (0; 1), fat H () is nite. The fat-shattering function plays an important role in the learning theory of real-valued functions. Kearns and Schapire [11] proved that if a class of probabilistic concepts is learnable, then the class has nite fat-shattering function. (A probabilistic concept f is a [0; 1]-valued function. In this model, the learner sees examples (x i ; y i ), where Pr(y i = 1) = f(x i ).) Alon et al. [1] proved, conversely, that if a class of probabilistic concepts has nite fat-shattering function, then it is learnable. The main result in [5] is that niteness of the fat-shattering function of a class of [0; 1]-valued functions is a necessary and sucient condition for learning with random observation noise. Our main result is the following. Theorem 3 Suppose that H is a set of functions from a set X to [0; 1]. Then the following propositions are equivalent. 1. H approximates from interpolated examples. 2. H approximates < X from interpolated examples. 3. H has nite fat-shattering function. 3 Upper bound In this section, we prove that nite fat-shattering function is a sucient condition for approximation from interpolated examples and we provide a suitable sample length function m 0 (; ; ; ). We rst need the notion of covering numbers, as used extensively in [10, 1, 8] for instance. Suppose that (A; d) is a pseudo-metric space and > 0. Then, a subset N of A is said to be an -cover for a subset B of A if, for every x 2 B, there is an ^x 2 N such that d(x; ^x) <. The metric space is totally bounded if there is a nite -cover for A, for all > 0. When (A; d) is totally bounded, we shall denote the minimal cardinality of an -cover for A by N A (; d) for > 0. A subset M of A is said to be -separated if, for all distinct x; y 2 M,
9 Upper bound 8 d(x; y). We shall denote the maximal cardinality of an -separated subset of A by M A (; d). It is easy to show that M A (2; d) N A (; d) M A (; d) (see [12]), so M A (; d) is always dened if (A; d) is totally bounded. Suppose now that H is a set of functions from a set X to [0; 1] and that x = (x 1 ; x 2 ; : : :; x m ) 2 X m, where m is a positive integer. We may dene a pseudometric l 1 x on H as follows: for g; h 2 H, l 1 x (g; h) = max 1im jg(x i)? h(x i )j: (This metric has been used in [8, 1], for example.) Alon et al. [1] obtained (essentially) the following result bounding the l 1 x -covering number of H in terms of the fat-shattering function of H. Lemma 4 Suppose that H is a set of functions from X to [0; 1] and that H has nite fat-shattering function. Let m 2 IN, and x 2 X m. Then the pseudometric space (H; l 1 x ) is totally bounded. Suppose > 0. Let d = fat H (=4) and dx! 2 i : y = Then, provided m log y + 1, i=1 m i N H (; l 1 x ) < 2 2 2! log y m : Here, as elsewhere in the paper, log denotes logarithm to base 2. We then have the following result. Theorem 5 Suppose that H is a class of functions mapping from a domain X to the real interval [0; 1] and that H has nite fat-shattering function. Let t be any function from X to < and let ; ; > 0. Let P be any probability distribution on X and dene B to be the set of functions h 2 H for which P (fx 2 X : jh(x)? t(x)j + g) : Let d = fat H (=8) and let y = dx i=1 2m i! 4 i : Then, for m max (8=; log y + 1) ;
10 Upper bound 9 the probability is at most P m (fx 2 X m : 9h 2 B; jh(x i )? t(x i )j < ; (1 i m)g) 4 2! log y 4 2m 2?m=2 : Proof: The proof is based on a technique analogous to that used in [17, 7, 10], where we `symmetrize' and then `combinatorially bound'. The rst step symmetrization relates the desired probability to a `sample-based' one. This latter probability is then bounded using counting techniques. Fix t, P, m, the parameters ; ;, and hence the set B. We dene two sets Q and R as follows. The subset Q of X m is given by Q = fx 2 X m : 9h 2 B; jh(x i )? t(x i )j < (1 i m)g : The subset R of X 2m is dened as follows, where xy 2 X 2m denotes the concatenation of x; y 2 X m : R = xy 2 X 2m : 9h 2 B; jh(x i )? t(x i )j < (1 i m) and jfi : jh(y i )? t(y i )j + gj > m=2g : We claim that, for m 8=, P m (Q) 2P 2m (R). To prove this claim, we rst note that if h 2 B, then, for m 8=, by Chebyshev's inequality (for example), the probability that y 2 X m satises jfi : jh(y i )? t(y i )j + gj m=2 is at most 1=2. The characteristic function R of R may be expressed as follows [4]: R (xy) = Q (x) x (y); where Q is the characteristic function of Q and the f0; 1g-valued function x is such that x (y) = 1 if and only if there is h 2 B such that jh(x i )? t(x i )j < for 1 i m and such that jfi : jh(y i )? t(y i )j + gj > m=2. Then Z Z Z P 2m (R) = R (xy) dp 2m (xy) = Q (x) x (y) dp m (y) dp m (x): For xed x 2 Q and for m 8=, the inner integral is at least 1=2 by the observation above. It follows that P 2m (R) 1 Z Q (x) dp m (x) = P m (Q); as claimed. The next step is to bound the probability of R using combinatorial techniques. For this, let? be the `swapping group' [14] of permutations on the set f1; 2; : : :; 2mg. This is the group generated by the transpositions (i; m + i)
11 Upper bound 10 for 1 i m. The group? acts in a natural way on vectors in X 2m : for 2? and z 2 X 2m, we dene z to be Let (z (1) ; z (2) ; : : :; z (2m) ):?(R; z) = jf 2? : z 2 Rgj be the number of permutations in? taking z into R. Since P 2m is a product distribution, we have P 2m (R) = P 2m fz : z 2 Rg and hence P 2m (R) = 1 j?j Z X Z 2? X X R (z) dp 2m (z) 1 = X j?j R (z) 2??(R; z) = E P 2m ; 2 m! dp 2m (z) where E P 2m denotes expectation with respect to P 2m. It follows from this that P 2m (R) 1 2 m max z2x 2m? R(z): Now, let us x z 2 X 2m and consider the pseudo-metric space (B; l 1 z ). Since H has nite fat-shattering function, so does B and Lemma 4 implies that this pseudo-metric space is totally bounded. Let N = f^h 1 ; ^h 2 ; : : :; ^h n g be a minimal =2-cover for B. >From Lemma 4, n < 2 4 2! log y 2m : Since N is a =2-cover, given h 2 B, there is ^h 2 N such that for 1 i 2m, l 1 z (^h; h) < =2, which means that for 1 i 2m, j^h(z i )? h(z i )j < =2. Suppose that z = xy 2 R. Then, by the denition of R, there is some h 2 B such that jh(x i )? t(x i )j < for 1 i m and such that, for more than m=2 of the y i, jh(y i )? t(y i )j +. But (taking ^h to be, as described above, a function in the cover =2-close to h) this implies that there is ^h 2 N such that for 1 i m, j^h(x i )? t(x i )j < + =2 and such that, for more than m=2 of the y i, j^h(y i )? t(y i )j + =2. It follows from this that if z 2 R, then, for some l between 1 and n, z belongs to the set ^Rl, dened as n ^R l = xy 2 X 2m : j^h l (x i )? t(x i )j < + =2 (1 i m); o and jfi : j^h l (y i )? t(y i )j + =2gj > m=2 :
12 Upper bound 11 Let?( ^R l ; z) be the number of in? for which z 2 ^R l. Since z 2 R implies z 2 ^Rl for some l, we have?(r; z) n X l=1?( ^R l ; z): Consider a particular l between 1 and n and suppose that?( ^R l ; z) 6= 0. Let k be the number of indices i between 1 and 2m such that j^h l (z i )?t(z i )j +=2. Then m=2 < k m. The number of permutations in? for which z belongs to ^Rl is then equal to 2 m?k, which is less than 2 m(1?m=2). (The z i which can be `swapped' are precisely those m? k satisfying j^h l (z i )? t(z i )j < + =2.) It follows that nx P 2m (R) < 1 2 m i=1 The statement in the Theorem now follows. 4 2! log y 2 m(1?=2) n2?m=2 2 2m 2?m=2 : It follows from this result that niteness of fat H implies that H approximates < X from interpolated examples, and hence H approximates H from interpolated examples. Corollary 6 Suppose that H is a set of functions from X to [0; 1] and that H has nite fat-shattering function. Then H approximates < X from interpolated examples. Furthermore, there is a positive constant K such that a sucient sample length function is where d = fat H (=8). m 0 (; ; ; ) = K 1 d log + d log 2 ; ut Proof: For all m; d 1, y = dx i=1 2m i! 4 d(2m) d (5=) d < (20m=) d : i So 4 2! log y 4 2m 2?m=2 < if? 4 50m= 2 d log(20m=) 2?m=2 < () m=2? d log(20m=) log(50m= 2 ) > log(4=) (= m=2? d log 2 (50m= 2 ) > log(4=): (1)
13 Lower bound 12 Now, since ln() < for all ; > 0, we have ln 2 () 4 so ln ln ln 1? ln2 1 for all ; > 0. Similarly, ln (= ln(1=)) = ln(1=), hence for > 0 and 0 < < 1, so ln = (ln(1=)) + ln (ln(1=)=) ln (2 ln(ln(1=)=)? ln(1=)) ln(1=) ln 2 (1=): For = 50m= 2 and = 2 ln 2 2=(1200d), this inequality implies that (1) holds when m > 4 3d 1200d ln 2 2 ln2 2 ln log 4 : ut We remark that, for a given k, the class of all k-lipschitz continuous functions discussed in Proposition 2 has fat-shattering function fat H () = O(1=) and that the sucient sample length function of Proposition 2 is of order 4 Lower bound fat H () log fat H() + log 1 In this section, we give lower bounds on the number of examples necessary for H to approximate < X from interpolated examples and for H to approximate H from interpolated examples. The bounds are in terms of fat H, the fat-shattering function of H. To prove them, we consider a discretized version of H. For a 2 [0; 1], let D (a) = da=e. For a function f : X! [0; 1], let D (f) : X! f0; 1; : : :; d1=eg be dened as the composition of D and f. Let D (H) denote fd (f) : f 2 Hg. Functions in D (H) map to f0; 1; : : :; ng, where n = d1=e. Let [n] denote f0; 1; : : :; ng. From the denition of the fat-shattering function, 1 2 fat H () fat D (H) 2 for ; 2 < +. To obtain a lower bound on the sample length function in terms of the fat-shattering function, we consider a number of notions of dimension, dened using classes of f0; 1; g-valued functions on [n]. Our approach is similar to that of Ben-David, Cesa-Bianchi, Haussler and Long [6], who consider learning [n]-valued functions. We show that a large family of these dimensions (including the fat-shattering function) are closely related, and prove the lower bound in terms of the most convenient, the gnat -dimension (a `gapped' version of the Natarajan dimension). :
14 Lower bound 13 Denition 7 Let gnat (k) be the following set of f0; 1; g-valued functions de- ned on [n], where k 2 f2; 3; : : :g. with gnat(k) = f i;j : i; j 2 [n]; ji? jj kg ; i;j() = 8 >< >: 1 = i 0 = j otherwise. If F is a class of [n]-valued functions dened on X and is a class of f0; 1; gvalued functions dened on [n], we say that F -shatters x = (x 1 ; : : :; x d ) 2 X d if there is a sequence ( 1 ; : : :; d ) 2 d such that f0; 1g d f( 1 (f(x 1 )); : : :; d (f(x d ))) : f 2 Fg : The -dimension of F, denoted -dim(f), is the size of the largest -shattered sequence, or innity if there is no largest sequence. The following result is similar to the key result in [1] (Lemma 15), which bounds covering numbers of F in terms of fat F (1). We will see later that it generalizes that lemma, since the gnat (2)-dimension is the smallest of a family of dimensions that includes fat F (1). Lemma 8 Let k 2 and n 1 be integers. Suppose that F is a class of [n]-valued functions dened on X satisfying gnat (k)-dim(f) d, and that m log y + 1, where! dx m y = n 2i : i Then i=0 max x2x m M F(k; l 1 x ) < 2(2mn 2 ) log y : Proof: Fix x 2 X m and let S = fx i : i = 1; : : :; mg. For a set A S and functions u; l : A! [n], we say that F shatters (A; l; u) if u(x)? l(x) k for all x 2 A, and for all b = (b 1 ; : : :; b m ) 2 f0; 1g m there is a function f b 2 F such that for i = 1; : : :; m. f b (x i ) = Dene the function t : IN 4! IN 0 as ( u(xi ) b i = 1 l(x i ) b i = 0; t(h; m; n; k) = min fjf(a; l; u) : A Y ; l; u : A! [n]; G shatters (A; l; u)gj : jy j m; 9G [n] Y ; jgj h; G is k-separated on Y ; where we say that a set G is k-separated on Y if, for all distinct functions f; g 2 G, there is a z 2 Y such that jf(z)? g(z)j k.
15 Lower bound 14 It is easy to see that if jy j m then jf(a; l; u) : A Y ; jaj d; l; u : Y! [n]; 8z 2 Y u(z)? l(z) k gj < y: It follows that, if there is an h 2 IN such that t(h; m; n; k) y, then every k-separated function class G [n] Y with jy j m and jgj h has gnat(k)-dim(g) > d. Suppose F [n] X has gnat (k)-dim(f) d. Fix x 2 X m and let G be some subset of F restricted to fx 1 ; : : :; x m g. Clearly, gnat (k)-dim(g) d, so if G is k-separated on X then jgj < h. In that case, max x2x m M F (k; l 1 x ) < h. So the result follows from t(2(2mn 2 ) log y ; m; n; k) y, which we prove by induction. For all m 0 2 IN and n k, we have t(2; m 0 ; n; k) 1. To see this, take fg 1 ; g 2 g G, where the functions g 1 and g 2 map from a set Y (with jy j m 0 ) to [n], and are k-separated. There must be an x 2 Y for which (wlog) g 1 (x)? g 2 (x) k. It follows that F shatters (fxg; g 2 ; g 1 ), and hence t(2; m 0 ; n; k) 1. Now, take a class G of [n]-valued functions dened on Y, where jy j = m, jgj = 2mn 2 h, and G is k-separated. Divide G into mn 2 h pairs, G = f(f 1 ; g 1 ); : : :; (f mn 2 h; g mn 2 h)g: For each pair (f i ; g i ), choose an x i 2 Y for which f i (x i )? g i (x i ) k (swapping the labels f i and g i if necessary). This is always possible since G is k-separated. By the pigeonhole principle, there is an x 2 Y for which i 2 f1; : : :; mn 2 hg : x i = x n2 h: Again by the pigeonhole principle, there is a pair (a; b) and a set S = i 2 f1; : : :; mn 2 hg : x i = x; f i (x i ) = a; g i (x i ) = b such that jsj = h. Dene G 1 = f ijy nfxg : i 2 S and G 2 = g ijy nfxg : i 2 S, where f jt is the restriction of the function f to the set T. Clearly, G 1 is k- separated on Y n fxg because G is k-separated on Y and for all distinct pairs f i ; f j 2 G 1 we have f i (x) = f j (x). Also, jg 1 j = h. By the denition of t, the set S 1 = f(a; l; u) : A Y n fxg; l; u : A! [n]; G 1 shatters (A; l; u)g satises js 1 j t(h; m? 1; n; k). Similarly, if we dene S 2 as the set of triples shattered by G 2, we have js 2 j t(h; m? 1; n; k). For any triple (A; l; u) in S 1 \ S 2, dene l ; u : A! Y as ( l b = x () = l() otherwise, and ( u a = x () = u() otherwise.
16 Lower bound 15 Then G shatters (A; l ; u ). It follows that G shatters js 1 j + js 2 j = 2t(h; m? 1; n; k) triples. Since G was arbitrary, t(2mn 2 h; m; n; k) 2t(h; m? 1; n; k). So we have t(2; m 0 ; n; k) 1 for m 0 2 IN and n 3, and t(2mn 2 h; m; n; k) 2t(h; m? 1; n; k). By induction, t(2(2n 2 ) i iy j=1 (m o + j); m 0 + i; n; k) 2 i for all i 2 IN. If m log y + 1, we can assign i = log y, m 0 h = 2(2n 2 m) i, and we have h 2(2n 2 ) i iy j=1 (m 0 + j): = m? i, and From the denition of t, h 1 h 2 implies t(h 1 ; m; n; k) t(h 2 ; m; n; k), so t(2(2n 2 m) i ; m; n; k) y: The result follows. ut A useful family of function classes is the k-gapped distinguishers. Denition 9 Let k 2. A set of functions from [n] to f0; 1; g is a k-gapped distinguisher if it satises 1. for all i 2 f0; 1; : : :; n? kg and j 2 fi + k; : : :; ng, there is a function 2 and a bit b 2 f0; 1g such that (i) = b and (y) = 1? b. 2. minfji? jj : i; j 2 [n]; 9 2 ; (i) = 0; (j) = 1g = k. The set gnat (k) is an example of a k-gapped distinguisher. Another important example is the class n o g(k) = 2 f0; 1; g [n] : min fji? jj : i; j 2 [n]; (i) = 0; (j) = 1g = k : In fact g (k) is the largest k-gapped distinguisher, in the sense that it contains any other k-gapped distinguisher. Lemma 10 Suppose F is a class of [n]-valued functions dened on X, is a class of f0; 1; g-valued functions dened on [n], and k 2. If is a k-gapped distinguisher then gnat(k)-dim(f) -dim(f) g (k)-dim(f): Proof: Take a gnat (k)-shattered sequence x 2 X d. Since is a k-gapped distinguisher, for all i;j 2 gnat (k) there is a 2 and a b 2 f0; 1g for which
17 Lower bound 16 (i) = b and (j) = 1? b. It follows that x is -shattered, which gives the rst inequality. The second inequality follows from the fact that g (k). We can express the fat-shattering dimension fat F (k) as a dimension of this type, for k 2. Dene with fat(k) = f i : i 2 f0; : : :; n? kgg ; 8 >< 1 z i + k i(z) = i < z < i + k >: 0 z i; for z 2 [n]. Then fat F () = fat (b2c)-dim(f) for all classes F of functions from X to [n]. Clearly, fat (k) is a k-gapped distinguisher. It follows from Lemma 10 that Lemma 8 generalizes Alon et al's Lemma 15, which gave a similar result for the fat (2)-dimension. The following result shows that the gnat (k)-dimension, g (k)-dimension, and -dimension (for any k-gapped distinguisher ) are all closely related. Lemma 11 Let k 2. Let F be a class of functions that map from X to [n], satisfying gnat (k)-dim(f) = d N 2 and g (k)-dim(f) d B 1. Then d N > d B 3 log 2 (2d B n 2 ) : Proof: Suppose x = (x 1 ; : : :; x db ) 2 X db is g (k)-shattered by F. The denition of g (k) implies that any subset of F that g (k)-shatters x is k- separated, so M F (k; l 1 x ) 2 db. Let Then y d N d dn B n 2dN so y = d N X i=0 d B i! n 2i : since d N 2. If d B > log y then 2 db M F (k; l 1 x ) < 2(2d B n 2 ) log y ; d B < 1 + log y log(2d B n 2 ): (2) Alternatively, if d B log y, then (2) is obviously true. So we have ut d B < 1 + log y log(2d B n 2 ) 1 + (log d N + d N log(d B n 2 )) log(2d B n 2 ) 3d N log 2 (2d B n 2 ): ut
18 Lower bound 17 The following result follows easily from Theorem 1 in [9], which gives a lower bound on the number of examples necessary for learning f0; 1g [d] in the probably approximately correct model (see also [7]). Lemma 12 Let 0 < 1=8, 0 < < 1=100, and d 1. If d m < max 32 ; 1? ln 1 then there is a distribution P on [d] and a function t 2 f0; 1g [d] such that P m nx 2 X m : 9f 2 f0; 1g [d] such that f(x i ) = t(x i ); i = 1; : : :; m and Pfy : f(y) 6= t(y)g g : We use Lemma 12 to prove the following lower bound on the sample length function for approximating < X from interpolated examples. Theorem 13 Suppose H is a class of [0; 1]-valued functions dened on a set X, 0 < < < 1, and ; 2 (0; 1). Then any sample length function m 0 for H to approximate < X from interpolated examples satises where d = fat H (). m 0 (; ; ; ) max 1 32! d 3 log 2 (4d= 2 )? 1 ; 1 log 1! ; Proof: Fix 0 < < < 1, dene n = d1=e, and suppose fat H () d. Let F = D (H). Then fat F (1) d, so gnat (2)-dim(F) > k, where k = d=(3 log 2 (2dn 2 )). Consider a sequence (x 1 ; : : :; x k ) 2 X k that is gnat -shattered by F. Clearly, there is a subset H 0 H with jh 0 j = 2 k and a sequence ( a1;b 1 ; : : :; ak;b k ) 2 k gnat such that f( a1;b 1 (f(x 1 )); : : :; ak;b k (f(x k ))) : f 2 D (H 0 )g = f0; 1g k : Without loss, we can assume that a j > b j for j = 1; : : :; k. Now, if m < max((k? 1)=(32); (1? )= ln(1=)), Lemma 12 implies that there is a distribution P on f1; : : :; kg and a function p : f1; : : :; kg! f0; 1g such that P m fl 2 f1; : : :; kg m : 9p 0 : f1; : : :; kg! f0; 1g such that p(l i ) = p 0 (l i ) for i = 1; : : :; m and Pfy 2 f1; : : :; kg : p(y) 6= p 0 (y)g g : Choose a function t : X! < satisfying ( (aj? 1) + p(j) = 1 t(x j ) = b j? + p(j) = 0
19 Lower bound 18 for j = 1; : : :; k, where = 1 2 minfh(x j)? (a j? 1) : h 2 H 0 and h(x j ) > (a j? 1); j = 1; : : :; kg : For each function h 2 H 0 dene f h = D (h). Let p h : f1; : : :; kg! f0; 1g be dened by p h (j) = ( 1 fh (x j ) = a j 0 f h (x j ) = b j : Clearly, if jh(x j )? t(x j )j < for some h 2 H 0 and some j 2 f1; : : :; kg, then h(x j ) 2 ((a j? 1); a j ] [ ((b j? 1); b j ] so p h (j) = p(j). Also, if p h (j) 6= p(j) for some h 2 H, then jh(x j )? t(x j )j > +. It follows that Pfy 2 f1; : : :; kg : p h (y) 6= p(y)g implies Qfy 2 X : jh(y)? t(y)j + g, where Q is the discrete probability distribution on X satisfying Q(x j ) = P (j) for j = 1; : : :; k. So Q m fy 2 X m : 9h 2 H such that jh(y i )? t(y i )j < for i = 1; : : :; m and Qfy 2 X : jh(y)? t(y)j + g g Q m fy 2 X m : 9h 2 H 0 such that jh(y i )? t(y i )j < for i = 1; : : :; m and Qfy 2 X : jh(y)? t(y)j + g g P m fl 2 f1; : : :; kg m : 9p 0 : f1; : : :; kg! f0; 1g such that p(l i ) = p 0 (l i ) for i = 1; : : :; m and Pfy 2 f1; : : :; kg : p(y) 6= p 0 (y)g g : We also have the following result which bounds from below the sample length function for H to approximate H from interpolated examples. Theorem 14 Suppose H is a class of [0; 1]-valued functions dened on a set X, 0 < < 1, 3=2 < 1, and ; 2 (0; 1). If d satises fat H ( + ) d, then any sample length function m 0 for H to approximate H from interpolated examples satises Proof: m 0 (; ; ; ) max 1 32!! d 3 log 2 (4d= 2 )? 1 ; 1 log 1 : Fix 0 < < 1 and 3=2 < 1, dene n = d1=e, and suppose d fat H ( + ). Let F = D (H). Then 1 2( + ) fat F 2 d; so fat (b2=c + 1)-dim(F) d, hence gnat (b2=c + 1)-dim(F) > k, where k = d=(3 log 2 (2dn 2 )). Consider a sequence (x 1 ; : : :; x k ) 2 X k that is gnat (b2=c+ 1)-shattered by F. Clearly, there is a subset H 0 H with jh 0 j = 2 k and a sequence ( a1;b1 ; : : :; ak;bk ) 2 gnat (b2=c + 1) k such that f( a1;b 1 (f(x 1 )); : : :; ak;b k (f(x k ))) : f 2 D (H 0 )g = f0; 1g k : ut
20 Discussion 19 Fix a function t 2 H 0. Any function h 2 H that has a i;b i (D (h)(x i )) = ai;bi (D (t)(x i )) satises jh(x i )? t(x i )j < <. Any function h in H that has a i;b i (D (h)(x i )) 6= ai;bi (D (t)(x i )) satises jh(x i )? t(x i )j 2 = 2 + 2(? ) 2 +? = + ; since (? )= 1=2 and b2c for 1=2. Using the same argument as in the proof of Theorem 13, there is a distribution P on X such that if m is too small then with P m -probability at least, some h 2 H is within of t on a random sample, but P (jh? tj + ). ut 5 Discussion Figure 1 shows the sample complexity bounds for approximation from interpolated examples. These bounds imply Theorem 3. 1 fath (=8) (= m = fat H approximates < X H (=8) log 2 from interpolated examples =) m = max fat H () log 2 (fat H ()= 2 ) ; 1 log 1 + log 1!! H approximates H from interpolated examples + =) m = max!! fat H ( + ) log 2 (fat H ( + )= 2 ) ; 1 log 1 Figure 1 Sample complexity bounds Notice that the upper and lower bounds on the sample length for H to approximate < X from interpolated examples are within log factors of each other. If H is the class of k-lipschitz-continuous functions from [0; 1] to [0; 1], the upper bound on the sample length function for H to approximate H from interpolated examples is within a log 2 factor of the lower bound implied by Proposition 2. Indeed, fat H () = (1=) in this case, so Proposition 2 shows that m 0 (; ; ; ) = (fat H ()=).
21 When meets 0 20 These sample complexity bounds for approximation from interpolated examples are also relevant to the problem of learning real-valued functions in the presence of malicious noise. Suppose a learner sees a sequence of training examples that correspond to the values of a target function corrupted with arbitrary bounded additive noise. That is, each example is of the form (x i ; t(x i )+n i ), where t 2 H and jn i j <. Clearly, any function h 2 H that is -close to the training sample will satisfy Pr (jh? tj 2 + ) < ; provided that H approximates from interpolated examples and the training sample is suciently large. In addition, if there is an algorithm that can learn in the presence of malicious noise (in this sense), then it can certainly learn in the presence of uniformly distributed random noise (as dened in [5]), which implies fat H is nite ([5], Theorem 3). That is, a function class H is learnable with malicious noise if and only if fat H is nite. 6 When meets 0 In [3], the related problem of valid generalization from approximate interpolation was studied. We say that H validly generalizes C from approximate interpolation if, for all > 0 and ; 2 (0; 1), there is an m such that for all probability distributions P on X and all t 2 C,with P m -probability at least 1?, a sample x = (x 1 ; x 2 ; : : :; x m ) 2 X m has the property that if h 2 H and jh(x i )? t(x i )j < for 1 i m, then P (fx 2 X : jh(x)? t(x)j g) < : This problem is equivalent to approximation from interpolated examples with set to 0 (see Denition 1). Results in [3] give necessary and sucient conditions for valid generalization from approximate interpolation. In particular, H validly generalizes < X from approximate interpolation if and only if the pseudo-dimension P dim(h) is - nite. It was also shown in [3] that H validly generalizes H from approximate interpolation if and only if Ddim H () is nite for all, where Ddim H () is dened as follows. Denition 15 For h; t 2 H, let h [;t] : X! f0; 1g be dened by h [;t] (x) = 1 i jh(x)? t(x)j. Let H [;t] = h [;t] : h 2 H. Then we dene Ddim H () = max t2h P dim? H [;t] : (Since functions in H [;t] are f0; 1g-valued, the pseudo-dimension of H [;t] is equal to its Vapnik-Chervonenkis dimension, dened in [17].) Clearly, valid generalization from approximate interpolation implies approximation from interpolated examples, so niteness of P dim(h) or Ddim H implies
22 Conclusions 21 niteness of fat H. Since the fat-shattering function is a non-increasing function of, niteness of the pseudo-dimension of H implies a uniform upper bound on fat H (). The following proposition gives a lower bound on Ddim H () in terms of fat H. Proposition 16 Ddim H () fat H () 3 log 2 (2fat H ()= 2 ) : Let F = D (H). Then fat F (1) fat H (), so Lemma 11 implies that Proof: gnat(2)-dim(f) k, where k = fat H () 3 log 2 (4fat H ()= 2 ) : Suppose x 2 X k is gnat (2)-shattered by F. Then there is a subset H 0 H with jh 0 j = 2 k and a sequence ( a1;b1 ; : : :; ak;bk ) 2 k gnat such that f( a1;b 1 (f(x 1 )); : : :; ak;b k (f(x k ))) : f 2 D (H 0 )g = f0; 1g k : Choose t 2 H 0. Then for h 2 H 0 the denition of h [;t] implies that h [;t] (x i ) = 0 if and only if a i;b i (D (h(x i ))) = ai;b i (D (t(x i ))): That is,? h[;t](x 1 ); : : :; h [;t](x k ) : h 2 H 0 = f0; 1g k ; so Ddim H () k. ut 7 Conclusions We have shown that approximation from interpolated examples is possible if and only if the fat-shattering function of the hypothesis class is nite. This problem can be interpreted as a problem of learning real-valued functions, where we require satisfactory performance from every algorithm that returns a function that approximately interpolates the training examples. It follows that this problem is equivalent to those of learning real-valued functions with random (see [5]) and malicious (see Section 5) observation noise, in the sense that learning is possible in each case if and only if the fat-shattering function of the hypothesis class is nite. Acknowledgements This research was supported in part by the Australian Telecommunications and Electronics Research Board. The work of Martin Anthony is supported in part by the European Union through the \Neurocolt" ESPRIT Working Group.
23 REFERENCES 22 References [1] Alon, N., Ben-David, S., Cesa-Bianchi, N. and Haussler, D. (1993). Scalesensitive dimensions, uniform convergence, and learnability. In Proceedings of the 1993 IEEE Symposium on Foundations of Computer Science, IEEE Press. [2] Angluin, D. and Valiant, L. (1979). Fast probabilistic algorithms for Hamiltonian circuits and matchings. Journal of Computer and System Sciences, 18: [3] Anthony, M., Bartlett, P.L., Ishai, Y., Shawe-Taylor, J. (1994). Valid generalisation from approximate interpolation. To appear, Combinatorics, Probability and Computing. [4] Anthony, M. and Biggs, N. (1992). Computational Learning Theory: An Introduction, Cambridge University Press. [5] Bartlett, P.L., Long, P.M. and Williamson, R.C. (1994). Fat-shattering and the learnability of real-valued functions. In Proceedings of the Seventh Annual ACM Conference on Computational Learning Theory, ACM Press, New York. [6] Ben-David, S., Cesa-Bianchi, N., Haussler, D. and Long, P. (1992). Characterizations of learnability for classes of f0; : : :; ng-valued functions. Technical Report. (An earlier version appeared in Proceedings of the Fifth Annual ACM Workshop on Computational Learning Theory, ACM Press, New York.) To appear, Journal of Computer and System Sciences. [7] Blumer, A., Ehrenfeucht, A., Haussler, D. and Warmuth, M. K. (1989). Learnability and the Vapnik-Chervonenkis dimension, Journal of the ACM 36(4): 929{965. [8] Dudley, R. M., Gine, E. and Zinn, J. (1991). Uniform and universal Glivenko-Cantelli classes. Journal of Theoretical Probability, 4: 485{510. [9] Ehrenfeucht, A., Haussler, D., Kearns, M. and Valiant, L. (1989). A general lower bound on the number of examples needed for learning, Information and Computation 82: [10] Haussler, D. (1992). Decision theoretic generalizations of the PAC model for neural net and other learning applications, Information and Computation, 100: 78{150. [11] Kearns, M.J. and Schapire, R. E. (1990). Ecient distribution-free learning of probabilistic concepts, in Proceedings of the 1990 IEEE Symposium on Foundations of Computer Science, IEEE Press. [12] Kolmogorov, A.N. and Tihomirov, V. M. (1961). -entropy and -capacity of sets in functional spaces, AMS Translations, Series 2, 17:
24 REFERENCES 23 [13] McDiarmid, C. (1989). On the method of bounded dierences. In J. Siemons, editor, Surveys in Combinatorics, 1989, London Mathematical Society Lecture Note Series (141). Cambridge University Press, Cambridge, UK, [14] Pollard, D. (1984). Convergence of Stochastic Processes, Springer-Verlag. [15] Simon, H. U. (1994). Bounds on the number of examples needed for learning functions. In Computational Learning Theory: EUROCOLT'93 (ed. J. Shawe-Taylor and M. Anthony), Oxford University Press. [16] Valiant, L.G. (1984). A theory of the learnable. Communications of the ACM, 27(11): 1134{1142. [17] Vapnik, V.N. and Chervonenkis, A.Ya. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16(2):
Sample width for multi-category classifiers
R u t c o r Research R e p o r t Sample width for multi-category classifiers Martin Anthony a Joel Ratsaby b RRR 29-2012, November 2012 RUTCOR Rutgers Center for Operations Research Rutgers University
More informationA Result of Vapnik with Applications
A Result of Vapnik with Applications Martin Anthony Department of Statistical and Mathematical Sciences London School of Economics Houghton Street London WC2A 2AE, U.K. John Shawe-Taylor Department of
More informationFat-shattering. and the learnability of real-valued functions. Peter L. Bartlett. Department of Systems Engineering
Fat-shattering and the learnability of real-valued functions Peter L. Bartlett Department of Systems Engineering Research School of Information Sciences and Engineering Australian National University Canberra,
More informationMaximal Width Learning of Binary Functions
Maximal Width Learning of Binary Functions Martin Anthony Department of Mathematics, London School of Economics, Houghton Street, London WC2A2AE, UK Joel Ratsaby Electrical and Electronics Engineering
More informationOn the Sample Complexity of Noise-Tolerant Learning
On the Sample Complexity of Noise-Tolerant Learning Javed A. Aslam Department of Computer Science Dartmouth College Hanover, NH 03755 Scott E. Decatur Laboratory for Computer Science Massachusetts Institute
More informationMaximal Width Learning of Binary Functions
Maximal Width Learning of Binary Functions Martin Anthony Department of Mathematics London School of Economics Houghton Street London WC2A 2AE United Kingdom m.anthony@lse.ac.uk Joel Ratsaby Ben-Gurion
More informationData-Dependent Structural Risk. Decision Trees. John Shawe-Taylor. Royal Holloway, University of London 1. Nello Cristianini. University of Bristol 2
Data-Dependent Structural Risk Minimisation for Perceptron Decision Trees John Shawe-Taylor Royal Holloway, University of London 1 Email: jst@dcs.rhbnc.ac.uk Nello Cristianini University of Bristol 2 Email:
More informationThe sample complexity of agnostic learning with deterministic labels
The sample complexity of agnostic learning with deterministic labels Shai Ben-David Cheriton School of Computer Science University of Waterloo Waterloo, ON, N2L 3G CANADA shai@uwaterloo.ca Ruth Urner College
More informationAustralian National University. Abstract. number of training examples necessary for satisfactory learning performance grows
The VC-Dimension and Pseudodimension of Two-Layer Neural Networks with Discrete Inputs Peter L. Bartlett Robert C. Williamson Department of Systems Engineering Research School of Information Sciences and
More informationUniform Glivenko-Cantelli Theorems and Concentration of Measure in the Mathematical Modelling of Learning
Uniform Glivenko-Cantelli Theorems and Concentration of Measure in the Mathematical Modelling of Learning Martin Anthony Department of Mathematics London School of Economics Houghton Street London WC2A
More informationOn Learnability, Complexity and Stability
On Learnability, Complexity and Stability Silvia Villa, Lorenzo Rosasco and Tomaso Poggio 1 Introduction A key question in statistical learning is which hypotheses (function) spaces are learnable. Roughly
More informationOn Efficient Agnostic Learning of Linear Combinations of Basis Functions
On Efficient Agnostic Learning of Linear Combinations of Basis Functions Wee Sun Lee Dept. of Systems Engineering, RSISE, Aust. National University, Canberra, ACT 0200, Australia. WeeSun.Lee@anu.edu.au
More informationA Necessary Condition for Learning from Positive Examples
Machine Learning, 5, 101-113 (1990) 1990 Kluwer Academic Publishers. Manufactured in The Netherlands. A Necessary Condition for Learning from Positive Examples HAIM SHVAYTSER* (HAIM%SARNOFF@PRINCETON.EDU)
More informationUsing boxes and proximity to classify data into several categories
R u t c o r Research R e p o r t Using boxes and proximity to classify data into several categories Martin Anthony a Joel Ratsaby b RRR 7-01, February 01 RUTCOR Rutgers Center for Operations Research Rutgers
More informationTHE VAPNIK- CHERVONENKIS DIMENSION and LEARNABILITY
THE VAPNIK- CHERVONENKIS DIMENSION and LEARNABILITY Dan A. Simovici UMB, Doctoral Summer School Iasi, Romania What is Machine Learning? The Vapnik-Chervonenkis Dimension Probabilistic Learning Potential
More informationThe best expert versus the smartest algorithm
Theoretical Computer Science 34 004 361 380 www.elsevier.com/locate/tcs The best expert versus the smartest algorithm Peter Chen a, Guoli Ding b; a Department of Computer Science, Louisiana State University,
More informationComputational Learning Theory for Artificial Neural Networks
Computational Learning Theory for Artificial Neural Networks Martin Anthony and Norman Biggs Department of Statistical and Mathematical Sciences, London School of Economics and Political Science, Houghton
More informationICML '97 and AAAI '97 Tutorials
A Short Course in Computational Learning Theory: ICML '97 and AAAI '97 Tutorials Michael Kearns AT&T Laboratories Outline Sample Complexity/Learning Curves: nite classes, Occam's VC dimension Razor, Best
More informationEstimating the sample complexity of a multi-class. discriminant model
Estimating the sample complexity of a multi-class discriminant model Yann Guermeur LIP, UMR NRS 0, Universit Paris, place Jussieu, 55 Paris cedex 05 Yann.Guermeur@lip.fr Andr Elissee and H l ne Paugam-Moisy
More informationRademacher Averages and Phase Transitions in Glivenko Cantelli Classes
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 48, NO. 1, JANUARY 2002 251 Rademacher Averages Phase Transitions in Glivenko Cantelli Classes Shahar Mendelson Abstract We introduce a new parameter which
More informationYale University Department of Computer Science. The VC Dimension of k-fold Union
Yale University Department of Computer Science The VC Dimension of k-fold Union David Eisenstat Dana Angluin YALEU/DCS/TR-1360 June 2006, revised October 2006 The VC Dimension of k-fold Union David Eisenstat
More informationOn Randomized One-Round Communication Complexity. Ilan Kremer Noam Nisan Dana Ron. Hebrew University, Jerusalem 91904, Israel
On Randomized One-Round Communication Complexity Ilan Kremer Noam Nisan Dana Ron Institute of Computer Science Hebrew University, Jerusalem 91904, Israel Email: fpilan,noam,danarg@cs.huji.ac.il November
More informationOn Learning µ-perceptron Networks with Binary Weights
On Learning µ-perceptron Networks with Binary Weights Mostefa Golea Ottawa-Carleton Institute for Physics University of Ottawa Ottawa, Ont., Canada K1N 6N5 050287@acadvm1.uottawa.ca Mario Marchand Ottawa-Carleton
More informationThe information-theoretic value of unlabeled data in semi-supervised learning
The information-theoretic value of unlabeled data in semi-supervised learning Alexander Golovnev Dávid Pál Balázs Szörényi January 5, 09 Abstract We quantify the separation between the numbers of labeled
More informationOn Probabilistic ACC Circuits. with an Exact-Threshold Output Gate. Richard Beigel. Department of Computer Science, Yale University
On Probabilistic ACC Circuits with an Exact-Threshold Output Gate Richard Beigel Department of Computer Science, Yale University New Haven, CT 06520-2158, USA. beigel-richard@cs.yale.edu Jun Tarui y Department
More informationCan PAC Learning Algorithms Tolerate. Random Attribute Noise? Sally A. Goldman. Department of Computer Science. Washington University
Can PAC Learning Algorithms Tolerate Random Attribute Noise? Sally A. Goldman Department of Computer Science Washington University St. Louis, Missouri 63130 Robert H. Sloan y Dept. of Electrical Engineering
More informationMax-Planck-Institut fur Mathematik in den Naturwissenschaften Leipzig Uniformly distributed measures in Euclidean spaces by Bernd Kirchheim and David Preiss Preprint-Nr.: 37 1998 Uniformly Distributed
More informationALMOST ODD RANDOM SUM-FREE SETS NEIL J. CALKIN AND P. J. CAMERON. 1. introduction
ALMOST ODD RANDOM SUM-FREE SETS NEIL J. CALKIN AND P. J. CAMERON Abstract. We show that if S 1 is a strongly complete sum-free set of positive integers, and if S 0 is a nite sum-free set, then with positive
More informationExtremal Cases of the Ahlswede-Cai Inequality. A. J. Radclie and Zs. Szaniszlo. University of Nebraska-Lincoln. Department of Mathematics
Extremal Cases of the Ahlswede-Cai Inequality A J Radclie and Zs Szaniszlo University of Nebraska{Lincoln Department of Mathematics 810 Oldfather Hall University of Nebraska-Lincoln Lincoln, NE 68588 1
More informationLearning by Canonical Smooth Estimation, Part I: Abstract. This paper examines the problem of learning from examples in a framework that
Learning by Canonical Smooth Estimation, Part I: Simultaneous Estimation y Kevin L. Buescher z and P. R. Kumar x Abstract This paper examines the problem of learning from examples in a framework that is
More informationLOGARITHMIC MULTIFRACTAL SPECTRUM OF STABLE. Department of Mathematics, National Taiwan University. Taipei, TAIWAN. and. S.
LOGARITHMIC MULTIFRACTAL SPECTRUM OF STABLE OCCUPATION MEASURE Narn{Rueih SHIEH Department of Mathematics, National Taiwan University Taipei, TAIWAN and S. James TAYLOR 2 School of Mathematics, University
More informationDiscriminative Learning can Succeed where Generative Learning Fails
Discriminative Learning can Succeed where Generative Learning Fails Philip M. Long, a Rocco A. Servedio, b,,1 Hans Ulrich Simon c a Google, Mountain View, CA, USA b Columbia University, New York, New York,
More informationRelating Data Compression and Learnability
Relating Data Compression and Learnability Nick Littlestone, Manfred K. Warmuth Department of Computer and Information Sciences University of California at Santa Cruz June 10, 1986 Abstract We explore
More informationOn an algebra related to orbit-counting. Peter J. Cameron. Queen Mary and Westeld College. London E1 4NS U.K. Abstract
On an algebra related to orbit-counting Peter J. Cameron School of Mathematical Sciences Queen Mary and Westeld College London E1 4NS U.K. Abstract With any permutation group G on an innite set is associated
More informationMARKOV CHAINS: STATIONARY DISTRIBUTIONS AND FUNCTIONS ON STATE SPACES. Contents
MARKOV CHAINS: STATIONARY DISTRIBUTIONS AND FUNCTIONS ON STATE SPACES JAMES READY Abstract. In this paper, we rst introduce the concepts of Markov Chains and their stationary distributions. We then discuss
More informationSTOCHASTIC DIFFERENTIAL EQUATIONS WITH EXTRA PROPERTIES H. JEROME KEISLER. Department of Mathematics. University of Wisconsin.
STOCHASTIC DIFFERENTIAL EQUATIONS WITH EXTRA PROPERTIES H. JEROME KEISLER Department of Mathematics University of Wisconsin Madison WI 5376 keisler@math.wisc.edu 1. Introduction The Loeb measure construction
More informationA Large Deviation Bound for the Area Under the ROC Curve
A Large Deviation Bound for the Area Under the ROC Curve Shivani Agarwal, Thore Graepel, Ralf Herbrich and Dan Roth Dept. of Computer Science University of Illinois Urbana, IL 680, USA {sagarwal,danr}@cs.uiuc.edu
More informationLEBESGUE INTEGRATION. Introduction
LEBESGUE INTEGATION EYE SJAMAA Supplementary notes Math 414, Spring 25 Introduction The following heuristic argument is at the basis of the denition of the Lebesgue integral. This argument will be imprecise,
More informationStanford Mathematics Department Math 205A Lecture Supplement #4 Borel Regular & Radon Measures
2 1 Borel Regular Measures We now state and prove an important regularity property of Borel regular outer measures: Stanford Mathematics Department Math 205A Lecture Supplement #4 Borel Regular & Radon
More informationfor average case complexity 1 randomized reductions, an attempt to derive these notions from (more or less) rst
On the reduction theory for average case complexity 1 Andreas Blass 2 and Yuri Gurevich 3 Abstract. This is an attempt to simplify and justify the notions of deterministic and randomized reductions, an
More information[3] R.M. Colomb and C.Y.C. Chung. Very fast decision table execution of propositional
- 14 - [3] R.M. Colomb and C.Y.C. Chung. Very fast decision table execution of propositional expert systems. Proceedings of the 8th National Conference on Articial Intelligence (AAAI-9), 199, 671{676.
More informationk-fold unions of low-dimensional concept classes
k-fold unions of low-dimensional concept classes David Eisenstat September 2009 Abstract We show that 2 is the minimum VC dimension of a concept class whose k-fold union has VC dimension Ω(k log k). Keywords:
More informationAlon Orlitsky. AT&T Bell Laboratories. March 22, Abstract
Average-case interactive communication Alon Orlitsky AT&T Bell Laboratories March 22, 1996 Abstract and Y are random variables. Person P knows, Person P Y knows Y, and both know the joint probability distribution
More informationReal Analysis: Homework # 12 Fall Professor: Sinan Gunturk Fall Term 2008
Eduardo Corona eal Analysis: Homework # 2 Fall 2008 Professor: Sinan Gunturk Fall Term 2008 #3 (p.298) Let X be the set of rational numbers and A the algebra of nite unions of intervals of the form (a;
More information2 suggests that the range of variation is at least about n 1=10, see Theorem 2 below, and it is quite possible that the upper bound in (1.2) is sharp
ON THE LENGTH OF THE LONGEST INCREASING SUBSEQUENCE IN A RANDOM PERMUTATION Bela Bollobas and Svante Janson Dedicated to Paul Erd}os on his eightieth birthday. Abstract. Complementing the results claiming
More informationPower Domains and Iterated Function. Systems. Abbas Edalat. Department of Computing. Imperial College of Science, Technology and Medicine
Power Domains and Iterated Function Systems Abbas Edalat Department of Computing Imperial College of Science, Technology and Medicine 180 Queen's Gate London SW7 2BZ UK Abstract We introduce the notion
More informationAnalysis on Graphs. Alexander Grigoryan Lecture Notes. University of Bielefeld, WS 2011/12
Analysis on Graphs Alexander Grigoryan Lecture Notes University of Bielefeld, WS 0/ Contents The Laplace operator on graphs 5. The notion of a graph............................. 5. Cayley graphs..................................
More informationSimple Learning Algorithms for. Decision Trees and Multivariate Polynomials. xed constant bounded product distribution. is depth 3 formulas.
Simple Learning Algorithms for Decision Trees and Multivariate Polynomials Nader H. Bshouty Department of Computer Science University of Calgary Calgary, Alberta, Canada Yishay Mansour Department of Computer
More informationMarch 16, Abstract. We study the problem of portfolio optimization under the \drawdown constraint" that the
ON PORTFOLIO OPTIMIZATION UNDER \DRAWDOWN" CONSTRAINTS JAKSA CVITANIC IOANNIS KARATZAS y March 6, 994 Abstract We study the problem of portfolio optimization under the \drawdown constraint" that the wealth
More informationThe Performance of a New Hybrid Classifier Based on Boxes and Nearest Neighbors
The Performance of a New Hybrid Classifier Based on Boxes and Nearest Neighbors Martin Anthony Department of Mathematics London School of Economics and Political Science Houghton Street, London WC2A2AE
More informationOn the Complexity ofteaching. AT&T Bell Laboratories WUCS August 11, Abstract
On the Complexity ofteaching Sally A. Goldman Department of Computer Science Washington University St. Louis, MO 63130 sg@cs.wustl.edu Michael J. Kearns AT&T Bell Laboratories Murray Hill, NJ 07974 mkearns@research.att.com
More informationElementary 2-Group Character Codes. Abstract. In this correspondence we describe a class of codes over GF (q),
Elementary 2-Group Character Codes Cunsheng Ding 1, David Kohel 2, and San Ling Abstract In this correspondence we describe a class of codes over GF (q), where q is a power of an odd prime. These codes
More information290 J.M. Carnicer, J.M. Pe~na basis (u 1 ; : : : ; u n ) consisting of minimally supported elements, yet also has a basis (v 1 ; : : : ; v n ) which f
Numer. Math. 67: 289{301 (1994) Numerische Mathematik c Springer-Verlag 1994 Electronic Edition Least supported bases and local linear independence J.M. Carnicer, J.M. Pe~na? Departamento de Matematica
More informationLearning by Canonical Smooth Estimation, Part II: Abstract. In this paper, we analyze the properties of a procedure for learning from examples.
Learning by Canonical Smooth Estimation, Part II: Learning and Choice of Model Complexity y Kevin L. Buescher z and P. R. Kumar x Abstract In this paper, we analyze the properties of a procedure for learning
More information10.1 The Formal Model
67577 Intro. to Machine Learning Fall semester, 2008/9 Lecture 10: The Formal (PAC) Learning Model Lecturer: Amnon Shashua Scribe: Amnon Shashua 1 We have see so far algorithms that explicitly estimate
More informationAsymptotics of generating the symmetric and alternating groups
Asymptotics of generating the symmetric and alternating groups John D. Dixon School of Mathematics and Statistics Carleton University, Ottawa, Ontario K2G 0E2 Canada jdixon@math.carleton.ca October 20,
More information1 Introduction It will be convenient to use the inx operators a b and a b to stand for maximum (least upper bound) and minimum (greatest lower bound)
Cycle times and xed points of min-max functions Jeremy Gunawardena, Department of Computer Science, Stanford University, Stanford, CA 94305, USA. jeremy@cs.stanford.edu October 11, 1993 to appear in the
More informationCOMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization
: Neural Networks Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization 11s2 VC-dimension and PAC-learning 1 How good a classifier does a learner produce? Training error is the precentage
More informationBack circulant Latin squares and the inuence of a set. L F Fitina, Jennifer Seberry and Ghulam R Chaudhry. Centre for Computer Security Research
Back circulant Latin squares and the inuence of a set L F Fitina, Jennifer Seberry and Ghulam R Chaudhry Centre for Computer Security Research School of Information Technology and Computer Science University
More informationCoins with arbitrary weights. Abstract. Given a set of m coins out of a collection of coins of k unknown distinct weights, we wish to
Coins with arbitrary weights Noga Alon Dmitry N. Kozlov y Abstract Given a set of m coins out of a collection of coins of k unknown distinct weights, we wish to decide if all the m given coins have the
More informationMulticlass Learnability and the ERM principle
Multiclass Learnability and the ERM principle Amit Daniely Sivan Sabato Shai Ben-David Shai Shalev-Shwartz November 5, 04 arxiv:308.893v [cs.lg] 4 Nov 04 Abstract We study the sample complexity of multiclass
More informationComputer Science Dept.
A NOTE ON COMPUTATIONAL INDISTINGUISHABILITY 1 Oded Goldreich Computer Science Dept. Technion, Haifa, Israel ABSTRACT We show that following two conditions are equivalent: 1) The existence of pseudorandom
More information1 GMD FIRST, Rudower Chaussee 5, Berlin, Germany 2 Department of Computer Science, Royal Holloway College, U
Generalization Bounds via Eigenvalues of the Gram Matrix Bernhard Scholkopf, GMD 1 John Shawe-Taylor, University of London 2 Alexander J. Smola, GMD 3 Robert C. Williamson, ANU 4 NeuroCOLT2 Technical Report
More informationThe Perceptron algorithm vs. Winnow: J. Kivinen 2. University of Helsinki 3. M. K. Warmuth 4. University of California, Santa Cruz 5.
The Perceptron algorithm vs. Winnow: linear vs. logarithmic mistake bounds when few input variables are relevant 1 J. Kivinen 2 University of Helsinki 3 M. K. Warmuth 4 University of California, Santa
More informationNets Hawk Katz Theorem. There existsaconstant C>so that for any number >, whenever E [ ] [ ] is a set which does not contain the vertices of any axis
New York Journal of Mathematics New York J. Math. 5 999) {3. On the Self Crossing Six Sided Figure Problem Nets Hawk Katz Abstract. It was shown by Carbery, Christ, and Wright that any measurable set E
More informationNader H. Bshouty Lisa Higham Jolanta Warpechowska-Gruca. Canada. (
Meeting Times of Random Walks on Graphs Nader H. Bshouty Lisa Higham Jolanta Warpechowska-Gruca Computer Science University of Calgary Calgary, AB, T2N 1N4 Canada (e-mail: fbshouty,higham,jolantag@cpsc.ucalgary.ca)
More informationMAT 570 REAL ANALYSIS LECTURE NOTES. Contents. 1. Sets Functions Countability Axiom of choice Equivalence relations 9
MAT 570 REAL ANALYSIS LECTURE NOTES PROFESSOR: JOHN QUIGG SEMESTER: FALL 204 Contents. Sets 2 2. Functions 5 3. Countability 7 4. Axiom of choice 8 5. Equivalence relations 9 6. Real numbers 9 7. Extended
More informationGarrett: `Bernstein's analytic continuation of complex powers' 2 Let f be a polynomial in x 1 ; : : : ; x n with real coecients. For complex s, let f
1 Bernstein's analytic continuation of complex powers c1995, Paul Garrett, garrettmath.umn.edu version January 27, 1998 Analytic continuation of distributions Statement of the theorems on analytic continuation
More information2 Section 2 However, in order to apply the above idea, we will need to allow non standard intervals ('; ) in the proof. More precisely, ' and may gene
Introduction 1 A dierential intermediate value theorem by Joris van der Hoeven D pt. de Math matiques (B t. 425) Universit Paris-Sud 91405 Orsay Cedex France June 2000 Abstract Let T be the eld of grid-based
More information1/12/05: sec 3.1 and my article: How good is the Lebesgue measure?, Math. Intelligencer 11(2) (1989),
Real Analysis 2, Math 651, Spring 2005 April 26, 2005 1 Real Analysis 2, Math 651, Spring 2005 Krzysztof Chris Ciesielski 1/12/05: sec 3.1 and my article: How good is the Lebesgue measure?, Math. Intelligencer
More informationMACHINE LEARNING. Vapnik-Chervonenkis (VC) Dimension. Alessandro Moschitti
MACHINE LEARNING Vapnik-Chervonenkis (VC) Dimension Alessandro Moschitti Department of Information Engineering and Computer Science University of Trento Email: moschitti@disi.unitn.it Computational Learning
More informationgrowth rates of perturbed time-varying linear systems, [14]. For this setup it is also necessary to study discrete-time systems with a transition map
Remarks on universal nonsingular controls for discrete-time systems Eduardo D. Sontag a and Fabian R. Wirth b;1 a Department of Mathematics, Rutgers University, New Brunswick, NJ 08903, b sontag@hilbert.rutgers.edu
More informationThe Performance of a New Hybrid Classifier Based on Boxes and Nearest Neighbors
R u t c o r Research R e p o r t The Performance of a New Hybrid Classifier Based on Boxes and Nearest Neighbors Martin Anthony a Joel Ratsaby b RRR 17-2011, October 2011 RUTCOR Rutgers Center for Operations
More informationMarkov processes Course note 2. Martingale problems, recurrence properties of discrete time chains.
Institute for Applied Mathematics WS17/18 Massimiliano Gubinelli Markov processes Course note 2. Martingale problems, recurrence properties of discrete time chains. [version 1, 2017.11.1] We introduce
More informationNumerical Analysis: Interpolation Part 1
Numerical Analysis: Interpolation Part 1 Computer Science, Ben-Gurion University (slides based mostly on Prof. Ben-Shahar s notes) 2018/2019, Fall Semester BGU CS Interpolation (ver. 1.00) AY 2018/2019,
More informationThe Decision List Machine
The Decision List Machine Marina Sokolova SITE, University of Ottawa Ottawa, Ont. Canada,K1N-6N5 sokolova@site.uottawa.ca Nathalie Japkowicz SITE, University of Ottawa Ottawa, Ont. Canada,K1N-6N5 nat@site.uottawa.ca
More informationNeural Network Learning: Testing Bounds on Sample Complexity
Neural Network Learning: Testing Bounds on Sample Complexity Joaquim Marques de Sá, Fernando Sereno 2, Luís Alexandre 3 INEB Instituto de Engenharia Biomédica Faculdade de Engenharia da Universidade do
More informationMeasurable Choice Functions
(January 19, 2013) Measurable Choice Functions Paul Garrett garrett@math.umn.edu http://www.math.umn.edu/ garrett/ [This document is http://www.math.umn.edu/ garrett/m/fun/choice functions.pdf] This note
More informationON A CLASS OF APERIODIC SUM-FREE SETS NEIL J. CALKIN PAUL ERD OS. that is that there are innitely many n not in S which are not a sum of two elements
ON A CLASS OF APERIODIC SUM-FREE SETS NEIL J. CALKIN PAUL ERD OS Abstract. We show that certain natural aperiodic sum-free sets are incomplete, that is that there are innitely many n not in S which are
More informationIntroduction and Preliminaries
Chapter 1 Introduction and Preliminaries This chapter serves two purposes. The first purpose is to prepare the readers for the more systematic development in later chapters of methods of real analysis
More informationAgnostic Online learnability
Technical Report TTIC-TR-2008-2 October 2008 Agnostic Online learnability Shai Shalev-Shwartz Toyota Technological Institute Chicago shai@tti-c.org ABSTRACT We study a fundamental question. What classes
More informationLecture 5: Efficient PAC Learning. 1 Consistent Learning: a Bound on Sample Complexity
Universität zu Lübeck Institut für Theoretische Informatik Lecture notes on Knowledge-Based and Learning Systems by Maciej Liśkiewicz Lecture 5: Efficient PAC Learning 1 Consistent Learning: a Bound on
More informationLearning Recursive Functions From Approximations
Sacred Heart University DigitalCommons@SHU Computer Science & Information Technology Faculty Publications Computer Science & Information Technology 8-1997 Learning Recursive Functions From Approximations
More informationA Tutorial on Computational Learning Theory Presented at Genetic Programming 1997 Stanford University, July 1997
A Tutorial on Computational Learning Theory Presented at Genetic Programming 1997 Stanford University, July 1997 Vasant Honavar Artificial Intelligence Research Laboratory Department of Computer Science
More informationConstructing c-ary Perfect Factors
Constructing c-ary Perfect Factors Chris J. Mitchell Computer Science Department Royal Holloway University of London Egham Hill Egham Surrey TW20 0EX England. Tel.: +44 784 443423 Fax: +44 784 443420 Email:
More informationA version of for which ZFC can not predict a single bit Robert M. Solovay May 16, Introduction In [2], Chaitin introd
CDMTCS Research Report Series A Version of for which ZFC can not Predict a Single Bit Robert M. Solovay University of California at Berkeley CDMTCS-104 May 1999 Centre for Discrete Mathematics and Theoretical
More informationThe Vapnik-Chervonenkis Dimension: Information versus Complexity in Learning
REVlEW The Vapnik-Chervonenkis Dimension: Information versus Complexity in Learning Yaser S. Abu-Mostafa California Institute of Terhnohgy, Pasadcna, CA 91 125 USA When feasible, learning is a very attractive
More informationEmpirical Processes: General Weak Convergence Theory
Empirical Processes: General Weak Convergence Theory Moulinath Banerjee May 18, 2010 1 Extended Weak Convergence The lack of measurability of the empirical process with respect to the sigma-field generated
More informationFORMULATION OF THE LEARNING PROBLEM
FORMULTION OF THE LERNING PROBLEM MIM RGINSKY Now that we have seen an informal statement of the learning problem, as well as acquired some technical tools in the form of concentration inequalities, we
More informationMACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension
MACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension Prof. Dan A. Simovici UMB Prof. Dan A. Simovici (UMB) MACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension 1 / 30 The
More informationStochastic Processes
Introduction and Techniques Lecture 4 in Financial Mathematics UiO-STK4510 Autumn 2015 Teacher: S. Ortiz-Latorre Stochastic Processes 1 Stochastic Processes De nition 1 Let (E; E) be a measurable space
More informationMulticlass Learnability and the ERM Principle
Journal of Machine Learning Research 6 205 2377-2404 Submitted 8/3; Revised /5; Published 2/5 Multiclass Learnability and the ERM Principle Amit Daniely amitdaniely@mailhujiacil Dept of Mathematics, The
More informationHence C has VC-dimension d iff. π C (d) = 2 d. (4) C has VC-density l if there exists K R such that, for all
1. Computational Learning Abstract. I discuss the basics of computational learning theory, including concept classes, VC-dimension, and VC-density. Next, I touch on the VC-Theorem, ɛ-nets, and the (p,
More informationEntrance Exam, Real Analysis September 1, 2017 Solve exactly 6 out of the 8 problems
September, 27 Solve exactly 6 out of the 8 problems. Prove by denition (in ɛ δ language) that f(x) = + x 2 is uniformly continuous in (, ). Is f(x) uniformly continuous in (, )? Prove your conclusion.
More informationOnline Learning: Random Averages, Combinatorial Parameters, and Learnability
Online Learning: Random Averages, Combinatorial Parameters, and Learnability Alexander Rakhlin Department of Statistics University of Pennsylvania Karthik Sridharan Toyota Technological Institute at Chicago
More informationTheorem 1 can be improved whenever b is not a prime power and a is a prime divisor of the base b. Theorem 2. Let b be a positive integer which is not
A RESULT ON THE DIGITS OF a n R. Blecksmith*, M. Filaseta**, and C. Nicol Dedicated to the memory of David R. Richman. 1. Introduction Let d r d r,1 d 1 d 0 be the base b representation of a positive integer
More informationSome Combinatorial Aspects of Time-Stamp Systems. Unite associee C.N.R.S , cours de la Liberation F TALENCE.
Some Combinatorial spects of Time-Stamp Systems Robert Cori Eric Sopena Laboratoire Bordelais de Recherche en Informatique Unite associee C.N.R.S. 1304 351, cours de la Liberation F-33405 TLENCE bstract
More informationBounds on the generalised acyclic chromatic numbers of bounded degree graphs
Bounds on the generalised acyclic chromatic numbers of bounded degree graphs Catherine Greenhill 1, Oleg Pikhurko 2 1 School of Mathematics, The University of New South Wales, Sydney NSW Australia 2052,
More informationLexington, MA September 18, Abstract
October 1990 LIDS-P-1996 ACTIVE LEARNING USING ARBITRARY BINARY VALUED QUERIES* S.R. Kulkarni 2 S.K. Mitterl J.N. Tsitsiklisl 'Laboratory for Information and Decision Systems, M.I.T. Cambridge, MA 02139
More information