Fat-shattering. and the learnability of real-valued functions. Peter L. Bartlett. Department of Systems Engineering

Similar documents
Martin Anthony. The London School of Economics and Political Science, Houghton Street, London WC2A 2AE, United Kingdom.

The sample complexity of agnostic learning with deterministic labels

On Efficient Agnostic Learning of Linear Combinations of Basis Functions

MARKOV CHAINS: STATIONARY DISTRIBUTIONS AND FUNCTIONS ON STATE SPACES. Contents

On the Sample Complexity of Noise-Tolerant Learning

The best expert versus the smartest algorithm

Sample width for multi-category classifiers

Australian National University. Abstract. number of training examples necessary for satisfactory learning performance grows

Can PAC Learning Algorithms Tolerate. Random Attribute Noise? Sally A. Goldman. Department of Computer Science. Washington University

Nets Hawk Katz Theorem. There existsaconstant C>so that for any number >, whenever E [ ] [ ] is a set which does not contain the vertices of any axis

Computational Learning Theory. CS534 - Machine Learning

THE VAPNIK- CHERVONENKIS DIMENSION and LEARNABILITY

The information-theoretic value of unlabeled data in semi-supervised learning

Midterm 1. Every element of the set of functions is continuous

Linear Algebra (part 1) : Vector Spaces (by Evan Dummit, 2017, v. 1.07) 1.1 The Formal Denition of a Vector Space

Maximal Width Learning of Binary Functions

Multiclass Learnability and the ERM principle

Statistical and Computational Learning Theory

Data-Dependent Structural Risk. Decision Trees. John Shawe-Taylor. Royal Holloway, University of London 1. Nello Cristianini. University of Bristol 2

FORMULATION OF THE LEARNING PROBLEM

Does Unlabeled Data Help?

Continuous Functions on Metric Spaces

Agnostic Online learnability

MACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension

Introduction to Real Analysis

Garrett: `Bernstein's analytic continuation of complex powers' 2 Let f be a polynomial in x 1 ; : : : ; x n with real coecients. For complex s, let f

Rademacher Averages and Phase Transitions in Glivenko Cantelli Classes

Linear Regression and Its Applications

Real Analysis Math 131AH Rudin, Chapter #1. Dominique Abdi

Vapnik-Chervonenkis Dimension of Neural Nets

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

ICML '97 and AAAI '97 Tutorials

Computational Learning Theory for Artificial Neural Networks


CS 6375: Machine Learning Computational Learning Theory

LEBESGUE INTEGRATION. Introduction

Computational Learning Theory

Mathematics Course 111: Algebra I Part I: Algebraic Structures, Sets and Permutations

Multiclass Learnability and the ERM Principle

The Vapnik-Chervonenkis Dimension

Statistical Learning Theory

The PAC Learning Framework -II

Relating Data Compression and Learnability

Learning by Canonical Smooth Estimation, Part I: Abstract. This paper examines the problem of learning from examples in a framework that

Learning by Canonical Smooth Estimation, Part II: Abstract. In this paper, we analyze the properties of a procedure for learning from examples.

4.6 Montel's Theorem. Robert Oeckl CA NOTES 7 17/11/2009 1

Introduction to Machine Learning

Analysis III. Exam 1

Lecture 3: Lower Bounds for Bandit Algorithms

and are based on the precise formulation of the (vague) concept of closeness. Traditionally,

2 Interval-valued Probability Measures

A Result of Vapnik with Applications

Near convexity, metric convexity, and convexity

On A-distance and Relative A-distance

Yale University Department of Computer Science. The VC Dimension of k-fold Union

Analysis on Graphs. Alexander Grigoryan Lecture Notes. University of Bielefeld, WS 2011/12

A Necessary Condition for Learning from Positive Examples

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable

Estimating the sample complexity of a multi-class. discriminant model

Computation Of Asymptotic Distribution. For Semiparametric GMM Estimators. Hidehiko Ichimura. Graduate School of Public Policy

The Weighted Majority Algorithm. Nick Littlestone. Manfred K. Warmuth y UCSC-CRL Revised. October 26, 1992.

Department of Computer Science, University of Chicago, Chicago, IL 60637

Vapnik-Chervonenkis Dimension of Neural Nets

Review of Probability Theory

THE STONE-WEIERSTRASS THEOREM AND ITS APPLICATIONS TO L 2 SPACES

Lecture Notes on Metric Spaces

Measurable functions are approximately nice, even if look terrible.

1 Topology Definition of a topology Basis (Base) of a topology The subspace topology & the product topology on X Y 3

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable

Bennett-type Generalization Bounds: Large-deviation Case and Faster Rate of Convergence

Lecture 5: Efficient PAC Learning. 1 Consistent Learning: a Bound on Sample Complexity

2 Garrett: `A Good Spectral Theorem' 1. von Neumann algebras, density theorem The commutant of a subring S of a ring R is S 0 = fr 2 R : rs = sr; 8s 2

On Randomized One-Round Communication Complexity. Ilan Kremer Noam Nisan Dana Ron. Hebrew University, Jerusalem 91904, Israel

CS340 Machine learning Lecture 4 Learning theory. Some slides are borrowed from Sebastian Thrun and Stuart Russell

CHAPTER 1. Metric Spaces. 1. Definition and examples

8. Limit Laws. lim(f g)(x) = lim f(x) lim g(x), (x) = lim x a f(x) g lim x a g(x)

Principle of Mathematical Induction

Examples of Dual Spaces from Measure Theory

Preface These notes were prepared on the occasion of giving a guest lecture in David Harel's class on Advanced Topics in Computability. David's reques

Learning with Equivalence Constraints, and the Relation to Multiclass Learning

Maximal Width Learning of Binary Functions

2 suggests that the range of variation is at least about n 1=10, see Theorem 2 below, and it is quite possible that the upper bound in (1.2) is sharp

The Uniformity Principle: A New Tool for. Probabilistic Robustness Analysis. B. R. Barmish and C. M. Lagoa. further discussion.

Abstract. In this paper we study an extension of the distribution-free model of learning introduced

Contents. 2 Partial Derivatives. 2.1 Limits and Continuity. Calculus III (part 2): Partial Derivatives (by Evan Dummit, 2017, v. 2.

Analysis Finite and Infinite Sets The Real Numbers The Cantor Set

Polynomial time Prediction Strategy with almost Optimal Mistake Probability

the set of critical points of is discrete and so is the set of critical values of. Suppose that a 1 < b 1 < c < a 2 < b 2 and let c be the only critic

Analog Neural Nets with Gaussian or other Common. Noise Distributions cannot Recognize Arbitrary. Regular Languages.

ELECTRIC SUN NEW JERSEY'S OLDEST WEEKLY NEWSPAPER EST :30-5:30 DAILY SAT. 10>00-5:30 OPEN TILL 9:00 P.M. THURS. "A Unisex Boutique*'

Discriminative Learning can Succeed where Generative Learning Fails

Entrance Exam, Real Analysis September 1, 2017 Solve exactly 6 out of the 8 problems

A note on the generalization performance of kernel classifiers with margin. Theodoros Evgeniou and Massimiliano Pontil

STA 711: Probability & Measure Theory Robert L. Wolpert

Machine Learning. Computational Learning Theory. Eric Xing , Fall Lecture 9, October 5, 2016

On the Complexity ofteaching. AT&T Bell Laboratories WUCS August 11, Abstract

STOCHASTIC DIFFERENTIAL EQUATIONS WITH EXTRA PROPERTIES H. JEROME KEISLER. Department of Mathematics. University of Wisconsin.

Introduction and Preliminaries

The Power of Random Counterexamples

Abstract. We introduce a new formal model in which a learning algorithm must combine a

Transcription:

Fat-shattering and the learnability of real-valued functions Peter L. Bartlett Department of Systems Engineering Research School of Information Sciences and Engineering Australian National University Canberra, 0200 Australia Philip M. Long Research Triangle Institute 3040 Cornwallis Road P.O. Box 12194 Research Triangle Park, NC 27709 USA Robert C. Williamson Department of Engineering Australian National University Canberra, 0200 Australia August 1, 1995 1

Proposed running head: Fat-shattering and learnability Author to whom correspondence should be sent: Peter L. Bartlett Department of Systems Engineering Research School of Information Sciences and Engineering Australian National University Canberra, 0200 Australia 2

Abstract We consider the problem of learning real-valued functions from random examples when the function values are corrupted with noise. With mild conditions on independent observation noise, we provide characterizations of the learnability of a real-valued function class in terms of a generalization of the Vapnik-Chervonenkis dimension, the fat-shattering function, introduced by Kearns and Schapire. We show that, given some restrictions on the noise, a function class is learnable in our model if and only if its fat-shattering function is nite. With dierent (also quite mild) restrictions, satised for example by gaussian noise, we show that a function class is learnable from polynomially many examples if and only if its fat-shattering function grows polynomially. We prove analogous results in an agnostic setting, where there is no assumption of an underlying function class. 3

1 Introduction In many common denitions of learning, a learner sees a sequence of values of an unknown function at random points, and must, with high probability, choose an accurate approximation to that function. The function is assumed to be a member of some known class. Using a popular denition of the problem of learning f0; 1g-valued functions (probably approximately correct learning see [12], [26]), Blumer, Ehrenfeucht, Haussler, and Warmuth have shown [12] that the Vapnik-Chervonenkis dimension (see [27]) of a function class characterizes its learnability, in the sense that a function class is learnable if and only if its Vapnik-Chervonenkis dimension is nite. Natarajan [19] and Ben-David, Cesa-Bianchi, Haussler and Long [11] have characterized the learnability of f0; :::; ng-valued functions for xed n. Alon, Ben-David, Cesa- Bianchi, and Haussler have proved an analogous result for the problem of learning probabilistic concepts [1]. In this case, there is an unknown [0; 1]-valued function, but the learner does not receive a sequence of values of the function at random points. Instead, with each random point it sees either 0 or 1, with the probability of a 1 given by the value of the unknown function at that point. Kearns and Schapire [16] introduced a generalization of the Vapnik-Chervonenkis dimension, which we call the fat-shattering function, and showed that a class of probabilistic concepts is learnable only if the class has a nite fat-shattering function. The main learning result of [1] is that niteness of the fat-shattering function of a class of probabilistic concepts is also sucient for learnability. In this paper, we consider the learnability of [0; 1]-valued function classes. We show that a class of [0; 1]-valued functions is learnable from a nite training sample with observation noise satisfying some mild conditions (the distribution has bounded support and its density satises a smoothness constraint) if and only if the class has a nite fat-shattering function. Here, as elsewhere, our main contribution is in showing that the niteness of the fat-shattering function is necessary for learning. We also consider small-sample learnability, for which the sample size is allowed to grow only polynomially with the required performance parameters. We show that a real-valued function class is learnable from a small sample with observation noise satisfying some other quite mild conditions (the distribution need not have bounded support, but it must have light tails and be symmetric about zero; gaussian noise satises these conditions) if and only if the fat-shattering function of the class has a polynomial rate of growth. We also consider agnostic learning [15] [17], in which there is no assumption of an underlying function generating the training examples, and the performance of the learning algorithm is measured by comparison with some function class F. We show that the fat-shattering function of F characterizes nitesample and small-sample learnability in this case also. In fact, the proof in [1] that niteness of the fat-shattering function of a class of probabilistic concepts implies learnability also gives a related sucient condition for agnostic learnability of [0; 1]-valued functions. We show that this condition is implied by niteness of the fat-shattering function of F. The proof of the lower bound on the number of examples necessary for learning is in two steps. First, we show that the problem of learning real-valued functions in the presence of noise is not much easier than that of learning functions in a discrete-valued function class obtained by quantizing the real-valued function class. This formalizes the intuition that a noisy, realvalued measurement provides little more information than a quantized measurement, if the quantization width is suciently small. Existing lower bounds on the number of examples required for learning discrete-valued function classes [11], [19] are not strong enough for our purposes. We improve these lower bounds by relating the problem of learning the quantized 4

function class to that of learning f0; 1g-valued functions. In addition to the aforementioned papers, other general results about learning real-valued functions have been obtained. Haussler [15] gives sucient conditions for agnostic learnability. Anthony, Bartlett, Ishai, and Shawe-Taylor [4] give necessary and sucient conditions that a function that approximately interpolates the target function is a good approximation to it (see also [5] and [3]). Natarajan [20] considers the problem of learning a class of real-valued functions in the presence of bounded observation noise, and presents sucient conditions for learnability. (Theorem 2 in [4] shows that these conditions are not necessary in our setting.) Merhav and Feder [18], and Auer, Long, Maass, and Woeginger [6] study function learning in a worst-case setting. In the next section, we dene admissible noise distribution classes and the learning problems, and present the characterizations of learnability. Sections 3 and 4 give lower and upper bounds on the number of examples necessary for learning real-valued functions. Section 5 presents the characterization of agnostic learnability. Section 6 discusses our results. An earlier version of this paper appeared in [10]. 2 Denitions and main result Denote the integers by, the positive integers by N, the reals by R and the nonnegative reals by R +. We use log to denote logarithm to base two, and ln to denote the natural logarithm. Fix an arbitrary set. Throughout the paper, denotes the input space on which the real-valued functions are dened. We refer to probability distributions on without explicitly dening a -algebra S. For countable, let S be the set of all subsets of. If is a metric space, let S be the Borel sets of. All functions and sets we consider are assumed to be measurable. Classes of noise distributions The noise distributions we consider are absolutely continuous, and their densities have bounded variation. A function f : R R is said to have bounded variation if there is a constant C > 0 such that for every ordered sequence x 0 < < x n in R (n 2 N) we have n k=1 In that case, the total variation of f on R is V (f) = sup ( n k=1 jf(x k )? f(x k?1 )j C: ) jf(x k )? f(x k?1 )j : n 2 N; x 0 < < x n : Denition 1 An admissible noise distribution class D is a class of distributions on R that satises 1. Each distribution in D has mean 0 and nite variance. 2. Each distribution in D is absolutely continuous and its probability density function (pdf) has bounded variation. Furthermore, there is a function v : R + R + such that, if f is the pdf of any distribution in D with variance 2, then V (f) v(). The function v is called the total variation function of the class D. 5

If D also satises the following condition, we say it is a bounded admissible noise distribution class. 3. There is a function s : R + R + such that, if D is a distribution in D with variance 2, then the support of D is contained in a closed interval of length s(). The function s is called the support function of D. If D satises Conditions 1, 2, and the following condition 1, we say it is an almost-bounded admissible noise distribution class. 3 0. Each distribution D in D has an even pdf (f(x) = f(?x)) and light tails: there are constants s 0 and c 0 in R + such that, for all distributions D in D with variance 2, and all s > s 0, Df : jj > s=2g c 0 e?s= : Example (Uniform noise) Let U = fu : > 0g, where U is uniform on (? p 3; p 3). Then this noise has mean 0, standard deviation, total variation function v() = 1=( p 3), and support function s() = 2 p 3, so U is a bounded admissible noise distribution class. Example (Gaussian noise) Let G = fg : > 0g, where G is the zero mean gaussian distribution with variance 2. Since the density f of G has f (0) = ( p 2)?1, and f (x) is monotonic decreasing for x > 0, the total variation function is v() = 2( p 2)?1. Obviously, f is an even function. Standard bounds on the area under the tails of the gaussian density (see [21], p.64, Fact 3.7.3) give G f 2 R : jj > s=2g exp? s2 ; (1) 8 2 and if s > 8, exp(?s 2 =(8 2 )) < exp(?s=), so the constants c 0 = 1 and s 0 = 8 will satisfy Condition 3 0. So the class G of gaussian distributions is almost-bounded admissible. The learning problem Choose a set F of functions from to [0; 1]. For m 2 N, f 2 F, x 2 m, and 2 R m, let sam(x; ; f) = ((x 1 ; f(x 1 ) + 1 ); :::; (x m ; f(x m ) + m )) 2 ( R) m : (We often dispense with the parentheses in tuples of this form, to avoid cluttering the notation.) Informally, a learning algorithm takes a sample of the above form, and outputs a hypothesis 1 In fact, Condition 3 0 is stronger than we need. It suces that the distributions be \close to" symmetric and have light tails in the following sense: there are constants s 0 and c 0 in R + such that, for all distributions D in D with variance 2, and all s > s 0, if l 2 R satises R l+s where is the pdf of D. l+s l l x(x) dx = 0, then (x) dx 1? c 0 e?s= ; 6

for f. More formally, a deterministic learning algorithm 2 is dened to be a mapping from [ m ( R) m to [0; 1]. A randomized learning algorithm L is a pair (A; P ), where P is a distribution on a set, and A is a mapping from [ m ( R) m m to [0; 1]. That is, given a sample of length m, the randomized algorithm chooses a sequence z 2 m at random from P m, and passes it to the (deterministic) mapping A as a parameter. For a probability distribution P on, f 2 F and h : [0; 1], dene er P;f (h) = jh(x)? f(x)jdp (x): The following denition of learning is based on those of [12], [19], [26]. Denition 2 Let D be a class of distributions on R. Choose 0 < ; < 1, > 0, and m 2 N. We say a learning algorithm L = (A; P ) (; ; )-learns F from m examples with noise D if for all distributions P on, all functions f in F, and all distributions D 2 D with variance 2, (P m D m P m ) f(x; ; z) 2 m R m m : er P;f (A(sam(x; ; f); z)) g < : Similarly, L (; )-learns F from m examples without noise if, for all distributions P on and all functions f in F, (P m P m ) f(x; z) 2 m m : er P;f (A(sam(x; 0; f); z)) g < : We say F is learnable with noise D if there is a learning algorithm L and a function m 0 : (0; 1) (0; 1) R + N such that for all 0 < ; < 1, for all > 0, algorithm L (; ; ) learns F from m 0 (; ; ) examples with noise D. We say F is small-sample learnable with noise D if, in addition, the function m 0 is bounded by a polynomial in 1=, 1=, and. The following denition comes from [16]. Choose x 1 ; :::; x d 2. We say x 1 ; :::; x d are - shattered by F if there exists r 2 [0; 1] d such that for each b 2 f0; 1g d, there is an f 2 F such that for each i ( f(xi ) r f(x i ) = i + if b i = 1 f(x i ) r i? if b i = 0. For each, let fat F () = maxfd 2 N : 9x 1 ; :::; x d ; F -shatters x 1 ; :::; x d g if such a maximum exists, and 1 otherwise. If fat F () is nite for all, we say F has a nite fat-shattering function. The following is our main result. Theorem 3 Suppose F is a permissible 3 class of [0; 1]-valued functions dened on. If D is a bounded admissible noise distribution class, then F is learnable with observation noise D if and only if F has a nite fat-shattering function. If D is an almost-bounded admissible noise distribution class, then F is small-sample learnable with observation noise D if and only if there is a polynomial p such that fat F () < p(1=) for all > 0. 2 Despite the name \algorithm," there is no requirement that this mapping be computable. Throughout the paper, we ignore issues of computability. 3 This is a benign measurability constraint dened in Section 4. 7

3 Lower bound In this section, we give a lower bound on the number of examples necessary to learn a realvalued function class in the presence of observation noise. Lemma 5 in Section 3.1 shows that an algorithm that can learn a real-valued function class with observation noise can be used to construct an algorithm that can learn a quantized version of the function class to slightly worse accuracy and condence with the same number of examples, provided the quantization width is suciently small. Lemma 10 in Section 3.2 gives a lower bound on the number of examples necessary for learning a quantized function class in terms of its fat-shattering function. In Section 3.3, we combine these results to give the lower bound for real-valued functions, Theorem 11. 3.1 Learnability with noise implies quantized learnability In this subsection, we relate the problem of learning a real-valued function class with observation noise to the problem of learning a quantized version of that class, without noise. Denition 4 For 2 R +, dene the quantization function & ' y? =2 Q (y) = : For a set S R, let Q (S) = fq (y) : y 2 Sg. For a function class F [0; 1], let Q (F ) be the set fq f : f 2 F g of Q ([0; 1])-valued functions dened on. Lemma 5 Suppose F is a set of functions from to [0; 1], D is an admissible noise distribution class with total variation function v, A is a learning algorithm, 0 < ; < 1, 2 R +, and m 2 N. If the quantization width 2 R + satises min v()m ; 2 ; and A (; ; )-learns F from m examples with noise D, then there is a randomized learning algorithm (C; P ) that (2; 2)-learns Q (F ) from m examples. Figure 1 illustrates our approach. Suppose an algorithm A can (; ; )-learn from m noisy examples (x i ; f(x i ) + i ). If we quantize the observations to accuracy and add noise that is uniform on (?=2; =2), Lemma 6(a) shows that the distribution of the observations is approximately unchanged (in the notation of Figure 1, the distributions P 1 and P 2 are close), so A learns almost as well as it did previously. If we dene Algorithm B as this operation of adding uniform noise and then invoking Algorithm A, B solves a quantized learning problem in which the examples are given as (x i ; Q (f(x i ) + i )). Lemma 6(b) shows that this problem is similar to the problem of learning the quantized function class when the observations are contaminated with independent noise whose distribution is a quantized version of the original observation noise (that is, the examples are given as (x i ; Q (f(x i )) + Q ( i ))). In the notation of Figure 1, Lemma 6(b) shows that the distributions P 3 and P 4 are close. It follows that Algorithm C, which adds this quantized noise to the observations and passes them to Algorithm B, learns 8

x y + - - - 6 P 1 Algorithm A - h observation noise Algorithm B x y + + - - Q - - P P 6 3 6 2 observation noise uniform noise - Algorithm A - h Algorithm C x y + - Q - - P 6 4 - Algorithm B - h Q 6 simulated observation noise Figure 1: Lemma 5 shows that a learning algorithm for real-valued functions (Algorithm A) can be used to construct a randomized learning algorithm for quantized functions (Algorithm C). 9

the quantized function class without observation noise (that is, when the examples are given as (x i ; Q (f(x i )))). For distributions P and Q on R, dene the total variation distance between P and Q as d T V (P; Q) = 2 sup jp (E)? Q(E)j E where the supremum is over all Borel sets. If P and Q are discrete, it is easy to show that d T V (P; Q) = x jp (x)? Q(x)j ; where the sum is over all x in union of the supports of P and Q. Similarly, if P and Q are continuous with probability density functions p and q respectively, d T V (P; Q) = 1?1 jp(x)? q(x)j dx: Lemma 6 Let D be an admissible noise distribution class with total variation function v. Let > 0 and 0 < < 1. Let D be a distribution in D with variance 2. Let,, and be random variables, and suppose that and are distributed according to D, and is distributed uniformly on (?=2; =2). (a) For any y 2 [0; 1], if P 1 is the distribution of y+ and P 2 is the distribution of Q (y+)+, we have d T V (P 1 ; P 2 ) v(): (b) For any y 2 [0; 1], if P 3 is the distribution of Q (y + ) and P 4 is the distribution of Q (y) + Q (), we have d T V (P 3 ; P 4 ) v(): Proof Let p be the pdf of D. (a) The random variable y + has density p 1 (a) = p(a? y), and Q (y + ) + has density p 2 given by for a 2 R. So d T V (P 1 ; P 2 ) = = = 1?1 1 p 2 (a) = 1 n=?1 =2 p(x? y)? 1 =2?=2 1?=2 n=?1 Q(a)+=2 Q (a)?=2 Q(x)+=2 Q (x)?=2 p(x? y + n)? 1 p(x? y + n)? 1 p(x? y) dx p(? y) d dx =2?=2 =2?=2 By the mean value theorem, there are z 1 and z 2 in [?=2; =2] such that p(z 1? y + n) 1 =2?=2 p(? y + n) d dx p(? y + n) d dx: p(? y + n) d p(z 2? y + n); 10

so for all x 2 [?=2; =2], and therefore 1 n=?1 p(x? y + n)? 1 1 =2?=2 sup n=?1 z2(?=2;=2) v(); p(? y + n) d d T V (P 1 ; P 2 ) v(): (b) The distribution P 3 of Q (y + ) is discrete, and is given by P 3 (a) = ( R n+=2 n?=2 p(x? y) dx 0 otherwise. jp(x? y + n)? p(z? y + n)j if a = n for some n 2 Since has distribution D, the distribution P 4 of Q (y) + Q () is also discrete, and is given by So P 4 (a) = d T V (P 3 ; P 4 ) = = = ( R n+=2 n?=2 p(x) dx if a = Q (y) + n for some n 2 0 otherwise. 1 n=?1 1 jp 3 (n)? P 4 (n)j n+=2 n=?1 n?=2 =2 1 n=?1 =2?=2 1?=2 n=?1 =2 1 p(x? y) dx? n+=2 n?=2 p (x? Q (y)) dx jp(x? y + n)? p (x? Q (y) + n)j dx jp(x? y + n)? p (x? Q (y) + n)j dx sup?=2 n=?1 z2(?=2;=2) v(): jp(x? y + n)? p(x? y + n + z)j dx We will use the following lemma. The proof is by induction, and is implicit in the proof of Lemma 12 in [8]. Lemma 7 If P i and Q i (i = 1; : : : ; m) are distributions on a set Y, and is a [0; 1]-valued random variable dened on Y m, then Y m dp? dq 1 Y m 2 where P = Q m P i and Q = Q m Q i are distributions on Y m. 11 m d T V (P i ; Q i );

Proof (of Lemma 5) We will describe a randomized algorithm (Algorithm C) that is constructed from Algorithm A, and show that it (2; 2)-learns the quantized function class Q (F ). Fix a noise distribution D in D with variance 2, a function f 2 F, and a distribution P on. Since A (; ; )-learns F, we have P m D m f(x; ) 2 m R m : er P;f (A(sam(x; ; f))) g < : That is, the probability (over all x 2 m and 2 R m ) that Algorithm A chooses a bad function is small. We will show that this implies that the probability that Algorithm C chooses a bad function is also small, where the probability is over all x 2 m and all values of the random variables that Algorithm C uses. Let be a random variable with distribution U, where U is the uniform distribution on (?=2; =2). For an arbitrary sequence (y 1 ; : : : ; y m ), let Algorithm B be the randomized algorithm that adds noise to each y value it receives, and passes the sequence to Algorithm A. That is, for any sequence of (x i ; y i ) pairs, B(x 1 ; y 1 ; : : : ; x m ; y m ) = A(x 1 ; y 1 + 1 ; : : : ; x m ; y m + m ): First we prove that, for a given sequence x of input values, the probability that Algorithm A outputs a bad hypothesis when it is called from Algorithm B in the scenario shown in Figure 1 (that is, when it sees examples of the form (x i ; Q (f(x i ) + i ) + i )) is no more than =2 more than the probability that Algorithm A outputs a bad hypothesis after receiving examples (x i ; f(x i ) + i ). We prove this by considering the set of noisy function values for the input sequence x that cause Algorithm A to output a bad hypothesis. Now, x a sequence x = (x 1 ; : : : ; x m ) 2 m, and dene the events E = f 2 R m : er P;f (A(sam(x; ; f))) g ; E 1 = fy 2 R m : er P;f (A(x 1 ; y 1 ; : : : ; x m ; y m )) g : That is, E is the set of noise sequences that make A choose a bad function, and E 1 is the corresponding set of y sequences. Clearly, D m (E) = my P 1jxi (E 1 ); (2) where P 1jxi is the distribution of f(x i )+. We will show that D m (E) is close to the corresponding probability under the distribution of y values that Algorithm A sees when Algorithm B invokes it. Dene P 2jxi as the distribution of Q (f(x i ) + ) +. From Lemma 6a, d T V (P 1jxi ; P 2jxi ) v(). Applying Lemma 7 with = 1 E1, the indicator 4 function for E 1, gives my P 2jxi (E 1 )? But by hypothesis =(mv()), so this and (2) imply my my P 1jxi (E 1 ) mv()=2: P 2jxi (E 1 ) D m (E) + =2: 4 that is, 1 E 1 (y) takes value 1 if y 2 E 1, and 0 otherwise. 12

Next we observe that, for a xed sequence x of input values, the probability that Algorithm B outputs a bad hypothesis when given quantized noisy examples (of the form (x i ; Q (f(x i )+ i ))) is equal to the probability that Algorithm A outputs a bad hypothesis when given examples of the form (x i ; Q (f(x i ) + i ) + i ). More formally, we can write this as follows. Let P 3jxi be the distribution of Q (f(x i ) + ), and let E 3 = f(y; ) 2 R m R m : er P;f (A(x 1 ; y 1 + 1 ; : : : ; x m ; y m + m )) g ; In this case, E 3 is the set of (y; ) pairs that correspond to B choosing a bad function. Clearly, my P 2jxi (E 1 ) = my P 3jxi U m (E 3 ): Let be a random variable with distribution D. Let Algorithm C be the randomized algorithm that adds noise Q () to each y value it receives, and passes the sequence to Algorithm B. That is, C(x 1 ; y 1 ; : : : ; x m ; y m ) = B (x 1 ; y 1 + Q ( 1 ); : : : ; x m ; y m + Q ( m )) : Next we prove that, for a xed sequence x, the probability that Algorithm B outputs a bad hypothesis when it is called from Algorithm C in the scenario shown in Figure 1 (that is, when it sees examples of the form (x i ; Q (f(x i )) + Q ( i ))) is no more than =2 more than the probability that Algorithm B outputs a bad hypothesis after receiving examples of the form (x i ; Q (f(x i ) + i )). Let P 4jxi be the distribution of Q (f(x i )) + Q (). Applying Lemma 7, with equal the probability under U that A produces a bad hypothesis, gives my P 3jxi U m (E 3 )? my P 4jxi U m From Lemma 6b, d T V (P 4jxi ; P 3jxi ) v(), so we have my P 4jxi U m We have shown that, for any x 2 m, It follows that (E 3 ) = my my (E 3 ) m P 3jxi U m D m (E) + : P 2jxi (E 1 ) + =2 d T V (P 3jxi ; P 4jxi )=2: (E 3 ) + =2 U m Dm f(; ) : er P;f (A(x 1 ; Q (f(x 1 )) + Q ( 1 ) + 1 ; : : : ; x m ; Q (f(x m )) + Q ( m ) + m )) g D m f : er P;f (A(sam(x; ; f))) g + : P m U m D m f(x; ; ) : er P;f (A(x 1 ; Q (f(x 1 )) + Q ( 1 ) + 1 ; : : : ; x m ; Q (f(x m )) + Q ( m ) + m )) g P m D m f(x; ) : er P;f (A(sam(x; ; f))) g + < 2: 13

For any function h : [0; 1], the triangle inequality for the absolute dierence on R gives er P;f (h) = jh(x)? f(x)j dp (x) (jh(x)? Q (f(x))j? jf(x)? Q (f(x))j) dp (x) er P;Q(f ) (h)? 2 er P;Q(f ) (h)? ; since for all x, jf(x)? Q (f(x))j =2, and 2 by hypothesis. It follows that Hence fer P;Q(f ) (C( )) 2g fer P;f (C( )) g: Pr er P;Q(f ) (C (x 1 ; Q (f(x 1 )); : : : ; x m ; Q (f(x m )))) 2 < 2; where the probability is taken over all x in m and all values of and, the random variables used by Algorithm C. This is true for any Q (f) in Q (F ), so this algorithm (2; 2)-learns Q (F ) from m examples. 3.2 Lower bounds for quantized learning In the previous subsection, we showed that if a class F can be (; ; )-learned with a certain number of examples, then an associated class Q (F ) of discrete-valued functions can be (2; 2)- learned with the same number of examples. Given this result, one would be tempted to apply techniques of Natarajan [19] or Ben-David, Cesa-Bianchi, Haussler, and Long [11] (who consider the learnability of discrete-valued functions) to lower bound the number of examples required for learning Q (F ). The main results of those papers, however, were for the discrete loss function, where the learner \loses" 1 whenever it's hypothesis is incorrect. When those results are applied directly to get bounds for learning with the absolute loss, the resulting bounds are not strong enough for our purposes because of the restrictions on required to show that learning F is not much harder than learning Q (F ). In this subsection, we present a new technique, inspired by the techniques of [7]. We show that an algorithm for learning a class of discrete-valued functions can eectively be used as a subroutine in an algorithm for learning binary-valued functions. We then apply a lower bound result for binary-valued functions. For each d 2 N, let power d be the set of all functions from f1; : : : ; dg to f0; 1g. We will make use of the following special case of a general result about power d ([12], Theorem 2.1b). Theorem 8 ([12]) Let A be a randomized learning algorithm which always outputs f0; 1gvalued hypotheses. If A is given fewer than d=2 examples, A fails to (1=8; 1=8)-learn power d. Theorem 2.1b of [12] is stated for deterministic algorithms, but an almost identical proof gives the same result for randomized algorithms. We will also make use of the standard Chernov bounds, proved in this form by Angluin and Valiant [2]. 14

Theorem 9 ([2]) Let Y 1 ; : : : ; Y m be independently, identically distributed f0; 1g-valued random variables where Pr(Y 1 = 1) = p. Then Pr Pr m m Y i 2mp Y i mp=2 e?mp=3 e?mp=8 : Lemma 10 For 0 < < 1=2, choose a set F of functions from to Q ([0; 1]), d 2 N and > 0 such that fat F () d. If a randomized learning algorithm A is given fewer than d? 666 4 + 192 ln d1= + 1=2e examples, A fails to (=32; 1=16)-learn F without noise. Proof We will show that if there is an algorithm that can (=32; 1=16)-learn the quantized class F from fewer than the number of examples given in the lemma, then this could be used as a subroutine of an algorithm that could (1=8; 1=8)-learn power d from fewer than d=2 examples, violating Theorem 8. Choose an algorithm A for learning F. Let x 1 ; : : : ; x d 2 be -shattered by F, and let r 1 ; : : : ; r d 2 [0; 1] d be such that for each b 2 f0; 1g d, there is an f b 2 F such that for all j; 1 j d, ( rj + if b f b (x j ) j = 1 r j? if b j = 0. For each q 2 N, consider the algorithm ~ A q (which will be used for learning power d ) which uses A as a subroutine as follows. Given m > q examples, ( 1 ; y 1 ); : : : ; ( m ; y m ) in f1; : : : ; dg f0; 1g, Algorithm ~ Aq rst, for each v 2 Q ([0; 1]) q = f0; ; : : : ; d1=? 1=2eg q, sets h ;v = A((x 1 ; v 1 ); : : : ; (x q ; v q )). Algorithm ~ Aq then uses this to dene a set ~ S of f0; 1g-valued functions dened on f1; : : : ; dg by ~S = n ~ h;v : v 2 Q ([0; 1]) qo ; where ~h ;v (j) = ( 1 if h;v (x j ) r j 0 otherwise, for all j 2 f1; : : : ; dg. Finally, Aq ~ returns an h ~ in S ~ for which the number of disagreements with the last m? q examples is minimized. That is, ~h = arg min ~h2 ~ S n jfj 2 fq + 1; : : : ; mg : ~ h(j ) 6= y j gj o : We claim that if A can (=32; 1=16)-learn F from m 0 2 N examples without noise, then ~ Am0 can (1=8; 1=8)-learn power d from m 0 + d96 (ln 32 + m 0 ln d1= + 1=2e)e examples without noise, and we can then apply Theorem 8 to give the desired lower bound on m 0. To see this, assume A (=32; 1=16)-learns F from m 0 examples, and let ~ A = ~ A m0. Suppose 15

~A is trying to learn g 2 power d and the distribution on the domain f1; : : : ; dg is ~ P. Let P be the corresponding distribution on fx 1 ; : : : ; x d g, and let b = (g(1); : : : ; g(d)) 2 f0; 1g d. Since A (=32; 1=16)-learns F, we have ~P m 0 n (1 ; : : : ; m0 ) : er P;fb A (x1 ; f b (x 1 )); : : : ; (x m0 ; f b(x m0 )) =32 o < 1=16; which implies ~P m 0 f 2 f1; : : : ; dg m 0 : 8v 2 Q ([0; 1]) m 0 ; er P;fb (h ;v ) =32g < 1=16: This can be rewritten as ~P m 0 : 8v; jh ;v (x j )? f b (x j )j d P ~ (j) =32 < 1=16; which, applying Markov's inequality, yields ~P m 0 n : 8v; ~ P fj : jh;v (x j )? f b (x j )j g 1=32 o < 1=16: (3) Now, for all j, jf b (x j )? r j j, so if j ~ h ;v (j)? b j j = 1 the denitions of ~ h ;v and f b imply jh ;v (x j )? f b (x j )j. Therefore er ~ P ;g (~ h ;v ) 1=32 implies ~P fj : jh ;v (x j )? f b (x j )j g 1=32; so (3) implies ~P m 0 n : 8v; er ~ P ;g (~ h ;v ) 1=32 o < 1=16: (4) That is, ~ A is unlikely to choose ~ S so that all elements have large error. We will show that ~ A can use the remaining u examples to nd an accurate function in ~ S. Let u = d96(ln 32 + m 0 lnd1= + 1=2e)e : Fix a v in Q ([0; 1]) m 0 and a in f1; : : : ; dg m 0. If er ~P ;g ( ~ h ;v ) 1=8, we can apply Theorem 9, with ( 1 if ~ h;v ( Y j = j ) 6= g( j ) 0 otherwise, to give ~P u n ( 1 ; : : : ; u ) : jfj : ~ h ;v ( j ) 6= g( j )gj u=16 o e?u=64 : Similarly, if er ~P ;g ( ~ h ;v ) 1=32, Theorem 9 implies ~P u n ( 1 ; : : : ; u ) : jfj : ~ h ;v ( j ) 6= g( j )gj u=16 o e?u=96 : Since this is true for any v and since jq ([0; 1])j = d1= + 1=2e, we have ~P u n ( 1 ; : : : ; u ) : 9v; er ~P ;g ( ~ h ;v ) 1=8 and jfj : ~ h ;v ( j ) 6= g( j )gj u=16 or er ~ P ;g (~ h ;v ) 1=32 and jfj : ~ h ;v ( j ) 6= g( j )gj u=16 o 16 2d1= + 1=2e m 0 e?u=96 (5)

for any 2 f1; : : : ; dg m 0. Let E be the event that some hypothesis in ~ S has error below 1=32, E = n (; ) 2 f1; : : : ; dg m 0+u : 9v; er ~ P ;g (~ h ;v ) < 1=32 o : (Notice that this event is independent of the examples 2 f1; : : : ; dg u that are used to assess the functions in S.) ~ For 2 f1; : : : ; dg m 0 and 2 f1; : : : ; dg u, let A ~ ;;g denote Then (5) and the denition of u imply ~A ( 1 ; g( 1 ); : : : ; m0 ; g( m0 ); 1 ; g( 1 ); : : : ; u ; g( u )) : Pr er ~ P ;g ~A;;g > 1=8 E 2 d1= + 1=2e m 0 e?u=96 1=16; (6) where the probability is taken over all values of and conditioned on (; ) 2 E. But (4), which shows that Pr(not E) < 1=16, and (6) imply Pr er ~ P ;g ~A;;g > 1=8 Pr er ~ P ;g ~A;;g > 1=8 E + Pr (not E) < 1=8: That is, ~ A (1=8; 1=8)-learns powerd using m 0 + d96(ln 32 + m 0 lnd1= + 1=2e)e examples, as claimed. Applying Theorem 8, this implies which implies the lemma. m 0 + d96 ln 32 + 96m 0 ln d1= + 1=2ee d=2 ) m 0 (1 + d96 ln d1= + 1=2ee) + 333 d=2 d=2? 333 ) m 0 2 + 96 ln d1= + 1=2e ; 3.3 The lower bound In this section, we combine Lemmas 5 and 10 to prove the following lower bound on the number of examples necessary for learning with observation noise. Obviously the constants have not been optimized. Theorem 11 Suppose F is a set of [0; 1]-valued functions dened on, D is an admissible noise distribution class with total variation function v, 0 < < 1, 0 < =65, 0 < 1=32, 2 R +, and d 2 N. If fat F () d > 1000, then any algorithm that (; ; )-learns F with noise D requires at least m 0 examples, where m 0 > min ( d 1152 ln(2 + dv()=17) ; d 1152 ln(d=238) ; d 576 ln(35=) ) : (7) In particular, if then v() > max (1=14; 101=(d p )) ; (8) m 0 > d 1152 ln(2 + dv()=17) : 17

This theorem shows that if there is a > 0 such that fat F () is innite then we can choose,, and for which (; ; )-learning is impossible from a nite sample. Similarly, if fat F () grows faster than polynomially in 1=, we can x and Theorem 11 implies that the number of examples necessary for learning must grow faster than polynomially in 1=. This proves the \only if" parts of the characterization theorem (Theorem 3). We will use the following lemma. Lemma 12 If x; y; z > 0, yz 1, w 1, and x > z= ln(w(1+xy)), then x > z=(2 ln(w(1+yz))). Proof Suppose x < z=(2 ln(w(1 + yz))). Then, since x ln(w(1 + xy)) is an increasing function of x, we have z yz x ln(w(1 + xy)) < ln w 1 + 2 ln(w(1 + yz)) 2 ln(w(1 + yz)) = z 0 @ ln w 1 + yz 2 ln(w(1+yz)) 2 ln(w(1 + yz)) 1 A But yz 1, so yz w 1 + < w 2 (1 + yz) 2 ; 2 ln (w(1 + yz)) which implies x ln(w(1 + xy)) < z, a contradiction. Proof (of Theorem 11) Set = =65 and = 1=32. Suppose a learning algorithm can (; ; )-learn F from m examples with noise D. Lemma 5 shows that, provided min (=(v()m); 2); (9) then there is a learning algorithm that can (2; 2)-learn Q (F ) from m examples. From the denition of fat-shattering, fat F () d implies fat Q(F ) (? =2) d. Furthermore, since = =65, if Inequality (9) is satised, we have (? =2)=32 (? =65)=32 = 2: Lemma 10 shows that, if an algorithm can (2; 2)-learn Q (F ) from m examples (when 2 (? =2)=32 and 2 1=16), then m d? 666 4 + 192 ln d1= + 1=2e : (10) That is, if Inequality (9) is satised, we must have m at least this large. Using a case-by-case analysis, in each case choosing to satisfy Inequality (9), we will show that m is larger than at least one of the terms in (7). Consider the two cases 2 =(v()m) and 2 < =(v()m). Case (1) (2 =(v()m)) If we set = =(v()m), Inequality (9) is satised, so m > = d? 666 4 + 192 ln dv()m= + 1=2e d=3 4 + 192 ln (32v()m + 3=2) d 3 12 + 576 ln (1 + 64v()m=3) : (11) 2 18

Consider the two cases v() > 3=64 and v() 3=64. First, suppose v() > 3=64. Using Lemma 12 with x = m, z = d=576, w = 3 2 e12=576, and y = 64v()=3 (so yz > d=576 > 1), we have m > > d 1152 ln 3 2 e12=576 + e 12=576 dv()=18 d 1152 ln (2 + dv()=17) ; which is the rst term in the minimum of Inequality (7). Now suppose that v() 3=64. Then (11) implies d m > 12 + 576 ln 3 2 (1 + m) : Using Lemma 12 with x = m, z = d=576, w = 3 2 e12=576, and y = 1, (and noting that yz = d=576 > 1), we have m > > d 1152 ln 3 2 e12=576 (1 + d=576) d 1152 ln(d=238) ; which is the second term in the minimum of Inequality (7). Case (2) (2 < =(v()m)) If we set = 2, Inequality (9) is satised, so Inequality (10) implies m > > > d? 666 4 + 192 ln d1=(2) + 1=2e d 65 12 + 576 ln + 3 2 2 d 576 ln(35=) ; which is the third term in the minimum of Inequality (7). We now use Inequality (7) to prove the second part of the theorem. If and then 1152 ln(2 + dv()=17) > 1152 ln(d=238) (12) 1152 ln(2 + dv()=17) > 576 ln(35=) (13) m 0 > d 1152 ln(2 + dv()=17) : So it suces to show that (12) and (13) are implied by (8). Indeed, we have that v() > 1=14 ) dv()=17 > d=238 ) 2 + dv()=17 > d=238; 19

which implies (12). Similarly, which implies (13). v() > 101=(d p ) q ) 2 + dv()=17 > 35=; 4 Upper bound In this section, we prove an upper bound on the number of examples required for learning with observation noise, nishing the proof of Theorem 3. For n 2 N; v; w 2 R n, let d(v; w) = 1 n n jv i? w i j: For U R n, > 0, we say C R n is an -cover of U if and only if for all v 2 U, there exists w 2 C such that d(v; w), and we denote by N (; U) the size of the smallest -cover of U (the -covering number of U). For a function f : [0; 1], dene `f : R R by `f(x; y) = (f(x)? y) 2, and if F [0; 1], let `F = f`f : f 2 F g. If W is a set, f : W R, and w 2 W m, let f jw 2 R m denote (f(w 1 ); :::; f(w m )). Finally, if F is a set of functions from W to R, let F jw R m be dened by F jw = ff jw : f 2 F g. The following theorem is due to Haussler [15] (Theorem 3, p107); it is an improvement of a result of Pollard [22]. We say a function class is PH-permissible if it satises the mild measurability condition dened in Haussler's Section 9.2 [15]. We say a class F of realvalued functions is permissible if the class `F is PH-permissible. This implies that the class `af = f(x; y) 7 jf(x)? yj : f 2 F g is PH-permissible, since the square root function on R + is measurable. Theorem 13 ([15]) Let Y be a set and G a PH-permissible class of [0; M]-valued functions dened on = Y, where M 2 R +. For any > 0 and any distribution P on, P (z m 2 m : 9g 2 G; 1 m m g(z i )? g dp ) > 4 max z2 2m N =16; Gjz e? 2 m=(64m 2 ) : Corollary 14 Let F be a permissible class of [0; 1]-valued functions dened on. Let Y = [a; b] with a 0 and b 1, and let = Y. There is a mapping B from (0; 1) [ i i to [0; 1] such that, for any 0 < < 1 and any distribution P on, P m z 2 m : `B(;z) dp inf f2f `f dp + 4 max N =48; `F jz e? 2 m=(576(b?a) 4 ) : z2 2m The proof is similar to the proof of Haussler's Lemma 1 [15]. Proof For a sequence z = (z 1 ; : : : ; z m ), let the mapping B return a function f from F that satises 1 m m `f (z i ) < inf f2f 20 1 m m `f(z i ) + =3: (14)

Let M = (b? a) 2. Theorem 13 implies that, with probability at least we have 1 m and 1? 4 max N (=48; `F jz )e?2 m=(576m 2 ) ; inf f2f 1 m m m `f (z i )? `f dp `f(z i )? inf f2f `f dp < =3; (15) < =3: (16) By the triangle inequality for absolute dierence on the reals, (14), (15), and (16) imply `f dp? inf f2f `f dp < : The following result follows trivially from Alon, Ben-David, Cesa-Bianchi and Haussler's Lemmas 14 and 15 [1]. Theorem 15 ([1]) If F is a class of [0; 1]-valued functions dened on, 0 < < 1, and m 2 N, then for all x in m, N (; F jx ) 2(mb 2 ) log c ; where b = d2=e + 1 and c = fat F (=4) m b i : i Corollary 16 For F dened as in Theorem 15, if 0 < < 1=2 and m fat F (=4)=2, then for all x in m N (; F jx) exp 2 ln 2 fat F (=4) ln 2 9m 2 : Proof Let d = fat F (=4). If d = 0 then any f 1 and f 2 in F have jf 1 (x)? f 2 (x)j < =2, so N (; F jx ) 1 in this case. Assume then that d 1. We have b < 3= and So we have log c < log d m i < log d m d (3=) i (3=) d < log d (3m=) d < d log (3m=) + log d: ln N (; F jx ) ln 2 + (d log(3m=) + log d) ln(9m= 2 ) < 2d ln(3m=) ln(9m= 2 )= ln 2 < 2d ln 2 (9m= 2 )= ln 2: 21

Note that the bound of Corollary 14 involves covering numbers of `F, whereas Corollary 16 bounds covering numbers of F. This was handled in [1] in the case of probabilistic concepts (where the Y = [a; b] in Corollary 14 is replaced by Y = f0; 1g) by showing that in that case, fat`f () fat F (=2). In the following lemma, we relate the covering numbers of `F and of F 5. Lemma 17 Choose a set F of functions from to [0; 1]. Then for any > 0, for any m 2 N, if a 0 and b 1, max N (; (`F ) jz ) max N z2([a;b]) m x2 m 3jb? aj ; F j x : Proof We show that, for any sequence z of (x; y) pairs in [a; b] and any functions f and g, if the restrictions of f and g to x are close, then the restrictions of `f and `g to z are close. Thus, given a cover of F jx, we can construct a cover of `F jz that is no bigger. Now, choose (x 1 ; y 1 ); :::; (x m ; y m ) 2 [a; b], and f; g : [0; 1]. We have 1 m m j(g(x i )? y i ) 2? (f(x i )? y i ) 2 j = 1 m = 1 m 1 m 1 m m m m m j(g(x i )? y i ) 2? ((f(x i )? g(x i )) + g(x i )? y i ) 2 j j(f(x i )? g(x i )) 2? 2(f(x i )? g(x i ))(g(x i )? y i )j ((f(x i )? g(x i )) 2 + 2jf(x i )? g(x i )jjg(x i )? y i j) 3jb? ajjf(x i )? g(x i )j: Thus if x = (x 1 ; : : : ; x m ) 2 m and z = (x 1 ; y 1 ; : : : ; x m ; y m ) 2 ( [a; b]) m, and d(f jx ; g jx ) =(3jb? aj) then d(`fjz ; `gjz ). So if S is an =(3jb? aj)-cover of F jx, we can construct an -cover T of `F jz as T = n (u 1? y 1 ) 2 ; : : : ; (u m? y m ) 2 : u 2 S o : Since (x 1 ; y 1 ); :::; (x m ; y m ) was chosen arbitrarily, this completes the proof. In our proof of upper bounds on the number of examples needed for learning, we will make use of the following lemma. Lemma 18 For any y 1 ; y 2 ; y 4 ; > 0 and y 3 1, if 0 m 2 @ 4y2 4 + ln y 2 1 2y 3 + ln y 1 y 4 y 4 A ; then y 1 exp(y 2 ln 2 (y 3 m)? y 4 m) : 5 Recently, Gurvits and Koiran have proved a result relating the fat-shattering functions of `F and F [14]. 22

Proof The assumed lower bound on m implies that and m 2 y 4 ln y 1 ; (17) m 8y 2 y 4 2 ln(4 p 2) + ln y 2y 3 y 4 2 : Taking square roots of the latter inequality and ddling a little with the second term, we get s p p y2 m 2 2 2 ln 4 p s y2 2 + ln y 3 : y 4 y 4 q Setting b = p 1 y4 4 2 y 2, the previous inequality implies that s p 2y2 m 1? 2 b p s y2 2 (2 ln(1=b) + ln y 3 ) y 4 y 4 which trivially yields p m p 2 s y2 y 4 2(b p m + ln(1=b)) + ln y3 : The above inequality, using the fact [24] that for all a; b > 0; ln a ab + ln(1=b), implies that p m p 2 s y2 y 4 2 ln p m + ln y3 = p 2 s y2 y 4 ln(y 3 m): Squaring both sides and combining with (17), we get m 1 y 4 Solving for completes the proof. y 2 ln 2 (y 3 m) + ln y 1 We can now present the upper bound. Again, the constants have not been optimized. Theorem 19 For any permissible class F of functions from to [0; 1], there is a learning algorithm A such that, for all bounded admissible distribution classes D with support function s, for all probability distributions P on, and for all 0 < < 1=2, 0 < < 1, and > 0, if d = fat F ( 2 =(576(s() + 1))), then A (; ; )-learns F from 0 1152(1 + s()) 4 @ d(1 + s()) 6 2 1 12d 25 + ln + ln 4 A 4 8 examples with noise D. Proof Let B be the mapping from Corollary 14. Choose 0 < < 1=2, 0 < < 1, and > 0. Let 0 = 2. Let D be a distribution in D with variance 2 and support contained in [c; d], so d? c s(). Choose a distribution P on and a function f 2 F. : 23

For x 2 m and 2 [c; d] m, let B x; = B( 0 ; sam(x; ; f)). Dene the event bad = ( (x; ) 2 ( m [c; d] m ) : [c;d] Since D has variance 2 and mean 0, so bad = ( inf g2f (x; ) : [c;d] ) [B x; (u)? (f(u) + )] 2 dd()dp (u) 2 + 0 : (g(u)? (f(u) + )) 2 dd()dp (u) = 2 ; [c;d] inf g2f [B x; (u)? (f(u) + )] 2 dd()dp (u) [c;d] ) [g(u)? (f(u) + )] 2 dd()dp (u) + 0 : The random variable f(u) + has a distribution on [c; 1 + d], determined by the distributions P and D and the function f. Thus, by Corollary 14, Pr(bad) 4 max N 0 =48; (`F ) jz exp z2([c;1+d]) 2m? 2 0m : 576(1 + s()) 4 Lemma 17 implies Pr(bad) 4 max N 0 x2 2m 144(1 + s()) ; F j x exp? 2 0m : 576(1 + s()) 4 Applying Corollary 16, if and m d=2, then Pr(bad) 4 exp d = fat F 0 576(1 + s()) ; 2 ln 2 d 373248m(1 + s()) 2 2 ln2? 0m : (18) 2 0 576(1 + s()) 4 For any particular x 2 m ; 2 [c; d] m, [c;d] = = (B x; (u)? (f(u) + )) 2 dd()dp (u) 2 [c;d] [c;d] (B x; (u)? f(u)) 2 dd()dp (u)? [c;d] (B x; (u)? f(u)) dd()dp (u) + 2 (B x; (u)? f(u)) 2 dd()dp (u) + 2 24

because of the independence of the noise, and the fact that it has zero mean. Thus bad = (x; ) 2 ( m [a; b] m ) : [B x; (u)? f(u)] 2 dp (u) 0 : If m 1152(1+s())4 2 0 12d 25 + ln d(1+s())6 4 0 2 + ln 4 ; (19) then applying Lemma 18, with y 1 = 4, y 2 = 2d= ln 2, y 3 = 373248(1 + s()) 2 = 2 0, and y 4 = 2 0=(576(1 + s()) 4 ), we have that (18) and (19) imply P m D (x; m ) : (B x; (u)? f(u)) 2 dp (u) 0 < : (20) From Jensen's inequality, (x; ) : jb x; (u)? f(u)j dp (u) p 0 so if m m 0 (; ; ), where and P m D m (x; ) : (x; ) : (B x; (u)? f(u)) 2 dp (u) 0 ; (B( 2 ; sam(x; ; f)))(u)? f(u) dp (u) < ; 0 1152(1 + s())4 m 0 (; ; ) = @ d(1 + s()) 6 2 12d 25 + ln + ln 4 A ; 4 8 2 d = fat F : 576(1 + s()) Now, let A be the algorithm that counts the number m of examples it receives and chooses 1 such that m 0 ( 1 ; 1; 0) = m. This is always possible, since d and hence m 0 are non-increasing functions of. Algorithm A then passes 2 1 and the examples to the mapping B, and returns B's hypothesis. Since s() is a non-decreasing function of, m 0 is a non-decreasing function of 1=, 1=, and, so for any,, and satisfying m 0 (; ; ) m, we must have 1. It follows that, for any,, and for which A sees at least m 0 (; ; ) examples, if P is a distribution on Y and D 2 D has variance 2 then P m D m (x; ) : completing the proof. j(a(sam(x; ; f)))(u)? f(u)j dp (u) As an immediate consequence of Theorem 19, if F has a nite fat-shattering function and D is a bounded admissible distribution class, then F is learnable with observation noise D. The following corollary provides the one implication in Theorem 3 we have yet to prove. Corollary 20 Let F be a class of functions from to [0; 1]. Let p be a polynomial, and suppose fat F () < p(1=) for all 0 < < 1. Then for any almost-bounded admissible noise distribution class D, F is small-sample learnable with noise D. 1 < ; 25

Proof We will show that Algorithm A from Theorem 19 can (; ; )-learn F from a polynomial number of examples with noise D. Let s : R + R + (we will dene s later). Choose 0 < ; < 1, > 0. Fix a distribution P on, a function f in F, and a noise distribution D in D with variance 2. Construct a distribution D s from D as follows. Let be the pdf of D. Dene the pdf s of D s as s (x) = 8 < : (x) R s()=2 if (x) dx?s()=2?s()=2 < x < s()=2 0 otherwise. Since D is an almost-bounded admissible class, there are universal constants s 0 ; c 0 2 R + such that, if s() > s 0, s()=2?s()=2 (x) dx 1? c 0 e?s()= : Let I = R s()=2 (x) dx. The total variation distance between D and D?s()=2 s is d T V (D; D s ) = 1?1 = 1? I + j(x)? s (x)j dx s()=2?s()=2 = 1? I + j1? 1=Ij = 2(1? I) For some m in N, x x 2 m and dene the event Then (21) and Lemma 7 show that j(x)? s (x)j dx s()=2?s()=2 (x) dx 2c 0 e?s()= : (21) E 1 = f 2 R m : er P;f (A(sam(x; ; f))) g : D m (E 1 ) D m s (E 1 ) + mc 0 exp(?s()=): If we choose s() = (s 0 + j ln(mc 0 =)j), then (21) holds and s() ln(2mc 0 =), so D m (E 1 ) D m s (E 1 ) + =2, Since this is true for any x 2 m, where P m D m (E 2 ) P m D m s (E 2) + =2; E 2 = f(x; ) 2 m R m : er P;f (A(sam(x; ; f))) g : Clearly, D s has mean 0, nite variance, and support contained in an interval of length s(). From the proof of Theorem 19, there is a polynomial p 1 such that if then m p 1 (s(); d; 1=; ln 1=) P m D m s (E 2 ) < =2: (22) Now, fat F () < p(1=), so for some polynomial p 2, m > p 2 (; 1=; log(1=); log m) implies (22). Clearly, for some polynomial p 3, if m > p 3 (; 1=; log(1=)) then P m D m (E 2 ) <. Since this is true for any P and any D in D with variance 2, Algorithm A (; ; )-learns F with noise D from p 3 (; 1=; log(1=)) examples. 26

5 Agnostic learning In this section, we consider an agnostic learning model, a model of learning in which assumptions about the target function and observation noise are removed. In this model, we assume labelled examples (x; y) are generated by some joint distribution P on [0; 1]. The agnostic learning problem can be viewed as the problem of learning a real-valued function f with observation noise when the constraints on the noise are relaxed in particular, we no longer have the constraint that the noise is independent of the value f(x). This model has been studied in [15], [17]. If h is a [0; 1]-valued function dened on, dene the error of h with respect to P as er P (h) = [0;1] jh(x)? yj dp (x; y): We require that the learner chooses a function with error little worse than the best function in some \touchstone" function class F. Notice that the learner is not restricted to choose a function from F ; the class F serves only to provide a performance measurement standard (see [17]). Denition 21 Suppose F is a class of [0; 1]-valued functions dened on, P is a probability distribution on [0; 1], 0 < ; < 1 and m 2 N. We say a learning algorithm L = (A; D ) (; )-learns in the agnostic sense with respect to F from m examples if, for all distributions P on [0; 1], (P m D m ) (x; y; z) 2 m [0; 1] m m : er P (A(x; y; z)) inf er P (f) + < : f2f The function class F is agnostically learnable if there is a learning algorithm L and a function m 0 : (0; 1) (0; 1) N such that, for all 0 < ; < 1, algorithm L (; )-learns in the agnostic sense with respect to F from m 0 (; ) examples. If, in addition, m 0 is bounded by a polynomial in 1= and 1=, we say that F is small-sample agnostically learnable. The following result is analogous to the characterization theorem of Section 2. Theorem 22 Suppose F is a permissible class of [0; 1]-valued functions dened on. Then F is agnostically learnable if and only if its fat-shattering function is nite, and F is small-sample agnostically learnable if and only if there is a polynomial p such that fat F () < p(1=) for all > 0. Alon etal's proof in [1] that niteness of the fat-shattering function of the class `F is sucient for learnability of a class F of probabilistic concepts also shows that this condition is sucient for the agnostic learnability of a class F of real-valued functions. A simpler version of Lemma 17 then shows that niteness of the fat-shattering function of F suces for agnostic learnability. If the \loss" of the learning algorithm was measured with (h(x)? y) 2 instead of jh(x)? yj, then the necessity part of Theorem 22 would follow from the results of Kearns and Schapire [16]. The following result proves the \only if" parts of the theorem. Theorem 23 Let F be a class of [0; 1]-valued functions dened on. Suppose 0 < < 1, 0 < =65, 0 < 1=16, and d 2 N. If fat F () d > 1000, then any learning algorithm that (; )-learns in the agnostic sense with respect to F requires at least m 0 examples, where m 0 > d 576 ln(35=) : 27

Proof The proof is similar to, though simpler than, the argument in Section 3. We will show that the agnostic learning problem is not much harder than the problem of learning a quantized version of the function class F, and then apply Lemma 10. Set = =65 and = 1=16. Consider the class of distributions P on [0; 1] for which there exists an f in F such that, for all x 2, ( 1 y = Q2 (f(x)) P (yjx) = 0 otherwise Fix a distribution P in this class. Let L be a randomized learning algorithm that can (; )-learn in the agnostic sense with respect to F. Then Pr er P (L) inf (er P (f)) + < ; f2f where er P (L) is the error of the function that the learning algorithm chooses. But the denition of P ensures that inf f2f er P (f), so Pr (er P (L) 2) < : Since this is true for any distribution P that can be expressed as the product of a distribution on and the generalized derivative of the indicator function on [0; 1] of a function in Q 2 (F ), the learning algorithm L can (2; )-learn the quantized function class Q 2 (F ). By hypothesis, fat F () d, but then the denition of fat-shattering implies that fat Q2 (F )(? ) d. Since = =65, 2 (? )=32. Also, = 1=16, so Lemma 10 implies m 0 > > > d? 666 4 + 192 ln d1=(2) + 1=2e d 12 + 576 ln(65=(2) + 3=2) d 576 ln(35=) : With minor modications, the proof of Theorem 19 yields the following analogous result for agnostic learning. Theorem 24 Choose a permissible set F of functions from to [0; 1]. There exists an algorithm A such that, for all 0 < < 1=2, for all 0 < < 1, if fat F (=192) = d, then A agnostically (; )-learns F from examples. 0 1152 @ 12d 2 2 d 23 + ln + ln 4 4 Proof Sketch First, the analog of Corollary 14 where the expected absolute error is used to measure the \quality" of a hypothesis in place of the expected squared error, and b = 1 and a = 0, can be proved using essentially the same argument. Second, the analog of Lemma 17 where `F is replaced with a corresponding class constructed from absolute loss in place of `, where a = 0; b = 1, and where the =(3jb? aj) of the upper bound is replaced with, also is obtained using a simpler, but similar, proof. These results are combined with Corollary 16 and Lemma 18 in much the same way as was done for Theorem 19. 1 A 28