Estimating Average-Case Learning Curves Using Bayesian, Statistical Physics and VC Dimension Methods

Similar documents
A Simple Regression Problem

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians

Computational and Statistical Learning Theory

Estimating Parameters for a Gaussian pdf

1 Generalization bounds based on Rademacher complexity

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

Probability Distributions

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon

Computable Shell Decomposition Bounds

Pattern Recognition and Machine Learning. Artificial Neural networks

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition

Computational and Statistical Learning Theory

Combining Classifiers

Kernel Methods and Support Vector Machines

This model assumes that the probability of a gap has size i is proportional to 1/i. i.e., i log m e. j=1. E[gap size] = i P r(i) = N f t.

1 Bounding the Margin

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization

Bayes Decision Rule and Naïve Bayes Classifier

New Bounds for Learning Intervals with Implications for Semi-Supervised Learning

Computable Shell Decomposition Bounds

Lecture October 23. Scribes: Ruixin Qiang and Alana Shine

Understanding Machine Learning Solution Manual

A Smoothed Boosting Algorithm Using Probabilistic Output Codes

Block designs and statistics

Improved Guarantees for Agnostic Learning of Disjunctions

A note on the multiplication of sparse matrices

Pattern Recognition and Machine Learning. Artificial Neural networks

Bounds on the Minimax Rate for Estimating a Prior over a VC Class from Independent Learning Tasks

Ensemble Based on Data Envelopment Analysis

Bayesian Learning. Chapter 6: Bayesian Learning. Bayes Theorem. Roles for Bayesian Methods. CS 536: Machine Learning Littman (Wu, TA)

The Transactional Nature of Quantum Information

Bootstrapping Dependent Data

arxiv: v1 [cs.ds] 3 Feb 2014

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Pattern Classification using Simplified Neural Networks with Pruning Algorithm

Boosting with log-loss

Introduction to Machine Learning. Recitation 11

Learnability and Stability in the General Learning Setting

Lecture 12: Ensemble Methods. Introduction. Weighted Majority. Mixture of Experts/Committee. Σ k α k =1. Isabelle Guyon

ASSUME a source over an alphabet size m, from which a sequence of n independent samples are drawn. The classical

Polygonal Designs: Existence and Construction

1 Rademacher Complexity Bounds

Machine Learning Basics: Estimators, Bias and Variance

Spine Fin Efficiency A Three Sided Pyramidal Fin of Equilateral Triangular Cross-Sectional Area

DERIVING PROPER UNIFORM PRIORS FOR REGRESSION COEFFICIENTS

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices

3.3 Variational Characterization of Singular Values

Sequence Analysis, WS 14/15, D. Huson & R. Neher (this part by D. Huson) February 5,

Kinematics and dynamics, a computational approach

CS Lecture 13. More Maximum Likelihood

Inspection; structural health monitoring; reliability; Bayesian analysis; updating; decision analysis; value of information

Non-Parametric Non-Line-of-Sight Identification 1

Robustness and Regularization of Support Vector Machines

Quantum algorithms (CO 781, Winter 2008) Prof. Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search

When Short Runs Beat Long Runs

1 Proof of learning bounds

A Theoretical Framework for Deep Transfer Learning

e-companion ONLY AVAILABLE IN ELECTRONIC FORM

Randomized Recovery for Boolean Compressed Sensing

Figure 1: Equivalent electric (RC) circuit of a neurons membrane

Rademacher Complexity Margin Bounds for Learning with a Large Number of Classes

Intelligent Systems: Reasoning and Recognition. Artificial Neural Networks

Fairness via priority scheduling

A Note on the Applied Use of MDL Approximations

EMPIRICAL COMPLEXITY ANALYSIS OF A MILP-APPROACH FOR OPTIMIZATION OF HYBRID SYSTEMS

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Fixed-to-Variable Length Distribution Matching

Learnability of Gaussians with flexible variances

Asymptotics of weighted random sums

Feature Extraction Techniques

1 Proving the Fundamental Theorem of Statistical Learning

Analyzing Simulation Results

Upper bound on false alarm rate for landmine detection and classification using syntactic pattern recognition

A Better Algorithm For an Ancient Scheduling Problem. David R. Karger Steven J. Phillips Eric Torng. Department of Computer Science

Lower Bounds for Quantized Matrix Completion

PAC-Bayes Analysis Of Maximum Entropy Learning

Kinetic Theory of Gases: Elementary Ideas

Grafting: Fast, Incremental Feature Selection by Gradient Descent in Function Space

Testing Properties of Collections of Distributions

Iterative Decoding of LDPC Codes over the q-ary Partial Erasure Channel

Distributed Subgradient Methods for Multi-agent Optimization

Collaborative Filtering using Associative Neural Memory

Interactive Markov Models of Evolutionary Algorithms

Support Vector Machines. Goals for the lecture

The Weierstrass Approximation Theorem

Support Vector Machines. Maximizing the Margin

Fast Montgomery-like Square Root Computation over GF(2 m ) for All Trinomials

ZISC Neural Network Base Indicator for Classification Complexity Estimation

Constant-Space String-Matching. in Sublinear Average Time. (Extended Abstract) Wojciech Rytter z. Warsaw University. and. University of Liverpool

Compression and Predictive Distributions for Large Alphabet i.i.d and Markov models

3.8 Three Types of Convergence

Kinetic Theory of Gases: Elementary Ideas

On the Inapproximability of Vertex Cover on k-partite k-uniform Hypergraphs

Prediction by random-walk perturbation

Homework 3 Solutions CSE 101 Summer 2017

Ph 20.3 Numerical Solution of Ordinary Differential Equations

Fast Structural Similarity Search of Noncoding RNAs Based on Matched Filtering of Stem Patterns

Identical Maximum Likelihood State Estimation Based on Incremental Finite Mixture Model in PHD Filter

Optimal Jamming Over Additive Noise: Vector Source-Channel Case

Transcription:

Estiating Average-Case Learning Curves Using Bayesian, Statistical Physics and VC Diension Methods David Haussler University of California Santa Cruz, California Manfred Opper Institut fur Theoretische Physik Universita.t Giessen, Gerany Michael Kearns AT&T Bell Laboratories Murray Hill, New Jersey Robert Schapire AT&T Bell Laboratories Murray Hill, New Jersey Abstract In this paper we investigate an average-case odel of concept learning, and give results that place the popular statistical physics and VC diension theories of learning curve behavior in a coon fraework. 1 INTRODUCTION In this paper we study a siple concept learning odel in which the learner attepts to infer an unknown target concept I, chosen fro a known concept class:f of {O, 1} valued functions over an input space X. At each trial i, the learner is given a point Xi E X and asked to predict the value of I(xi). If the learner predicts I(xi) incorrectly, we say the learner akes a istake. After aking its prediction, the learner is told the correct value. This siple theoretical paradig applies to any areas of achine learning, including uch of the research in neural networks. The quantity of fundaental interest in this setting is the learning curve, which is the function of defined as the prob- Contact author. Address: AT&T Bell Laboratories, 600 Mountain Avenue, Roo 2A-423, Murray Hill, New Jersey 07974. Electronic ail: kearns@research.att.co. 855

856 Haussler, Kearns, Opper, and Schapire ability the learning algorith akes a istake predicting f(x+i}, having already seen the exaples (Xl, I(x!)),..., (x, f(x)). In this paper we study learning curves in an average-case setting that adits a prior distribution over the concepts in F. We exaine learning curve behavior for the optial Bayes algorith and for the related Gibbs algorith that has been studied in statistical physics analyses of learning curve behavior. For both algoriths we give new upper and lower bounds on the learning curve in ters of the Shannon inforation gain. The ain contribution of this research is in showing that the average-case or Bayesian odel provides a unifying fraework for the popular statistical physics and VC diension theories of learning curves. By beginning in an average-case setting and deriving bounds in inforation-theoretic ters, we can gradually recover a worst-case theory by reoving the averaging in favor of cobinatorial paraeters that upper bound certain expectations. Due to space liitations, the paper is technically dense and alost all derivations and proofs have been oitted. We strongly encourage the reader to refer to our longer and ore coplete versions [4, 6] for additional otivation and technical detail. 2 NOTATIONAL CONVENTIONS Let X be a set called the instance space. A concept class F over X is a (possibly infinite) collection of subsets of X. We will find it convenient to view a concept f E F as a function I : X - {O, I}, where we interpret I(x) = 1 to ean that x E X is a positive exaple of f, and f(x) = 0 to ean x is a negative exaple. The sybols P and V are used to denote probability distributions. The distribution P is over F, and V is over X. When F and X are countable we assue that these distributions are defined as probability ass functions. For uncountable F and X they are assued to be probability easures over soe appropriate IT-algebra. All of our results hold for both countable and uncountable F and X. We use the notation E I E'P [x (f)] for the expectation of the rando variable X with respect to the distribution P, and Pr/E'P[cond(f)] for the probability with respect to the distribution P of the set of all I satisfying the predicate cond(f). Everything that needs to be easurable is assued to be easurable. 3 INFORMATION GAIN AND LEARNING Let F be a concept class over the instance space X. Fix a target concept I E F and an infinite sequence of instances x = Xl,..., X, X+l,... with X E X for all. For now we assue that the fixed instance sequence x is known in advance to the learner, but that the target concept I is not. Let P be a probability distribution over the concept class F. We think of P in the Bayesian sense as representing the prior beliefs of the learner about which target concept it will be learning. In our setting, the learner receives inforation about I increentally via the label

Estiating Average-Case Learning Curves 857 sequence I(xd,..., I(x), I(x+d,... At tie, the learner receives the label I(x). For any ~ 1 we define (with respect to x, I) the th version space F(x, I) = {j E F: j(xd = I(XI),..., j(x) = I(x)} and the th volue V!(x, I) = P[F(x, I)]. We define Fo(x, I) = F for all x and I, so Vl(x, I) = 1. The version space at tie is siply the class of all concepts in F consistent with the first labels of I (with respect to x), and the th volue is the easure of this class under P. For the first part of the paper, the infinite instance sequence x and the prior P are fixed; thus we siply write F(f) and V(f). Later, when the sequence x is chosen randoly, we will reintroduce this dependence explicitly. We adopt this notational practice of oitting any dependence on a fixed x in any other places as well. For each ~ 0 let us define the th posterior distribution P(x, I) = P by restricting P to the th version space F(f); that is, for all (easurable) S C F, P[S] = P[S n F(I))/P[F(l)] = P[S n F(I)]/V(f). Having already seen I(xd,..., I(x), how uch inforation (assuing the prior P) does the learner expect to gain by seeing I(x+d? If we let I+l(x, I) (abbreviated I+l (I) since x is fixed for now) be a rando variable whose value is the (Shannon) inforation gained fro I(x+d, then it can be shown that the expected inforation is E/E'P[I +1(f)] = E/E'P [-log v~:(j~) 1 = E/E'P[-logX+1 (I)] (1) where we define the ( + 1 )st volue ratio by X!:+l (x, I) = X+l (f) = V +l (f)/v(f). We now return to our learning proble, which we define to be that of predicting the label I(x+d given only the previous labels I(XI),..., I(x). The first learning algorith we consider is called the Bayes optial classification algorith, or the Bayes algorith for short. For any and be {O, I}, define F:n(x,!) = F:n(1) = {j E F(x,!) : j(x+d = b}. Then the Bayes algorith is: If P[F~(f)] > P[F~(I)], predict I(x+d = 1. If P[F~(f)] < P[F~(I)], predict I(x+d = O. If P[F~(f)] = P[F~(f)], flip a fair coin to predict I(x+d It is well known that if the target concept I is drawn at rando according to the prior distribution P, then the Bayes algorith is optial in the sense that it iniizes the probability that f(x+d is predicted incorrectly. Furtherore, if we let Bayes:+ 1 (x, I) (abbreviated Bayes:+ l (I) since x is fixed for now) be a rando variable whose value is 1 if the Bayes algorith predicts I(x+d correctly and 0 otherwise, then it can be shown that the probability of a istake for a rando I is Despite the optiality of the Bayes algorith, it suffers the drawback that its hypothesis at any tie ay not be a eber of the target class F. (Here we (2)

858 Haussler, Kearns, Opper, and Schapire define the hypothesis of an algorith at tie to be the (possibly probabilistic) apping j : X -+ {O, 1} obtained by letting j(x) be the prediction of the algorith when X+l = x.) This drawback is absent in our second learning algorith, which we call the Gibbs algorith [6]: Given I(x!),..., f(x), choose a hypothesis concept j randoly fro P. Given X+l, predict I(x+d = j(x+!). The Gibbs algorith is the "zero-teperature" liit of the learning algorith studied in several recent papers [2, 3, 8, 9]. If we let Gibbs~+l (x, I) (abbreviated Gibbs~+l (f) since x is fixed for now) be a rando variable whose value is 1 if the Gibbs algorith predicts f(x+d correctly and 0 otherwise, then it can be shown that the probability of a istake for a rando f is Note that by the definition of the Gibbs algorith, Equation (3) is exactly the average probability of istake of a consistent hypothesis, using the distribution on :F defined by the prior. Thus bounds on this expectation provide an interesting contrast to those obtained via VC diension analysis, which always gives bounds on the probability of istake of the worst consistent hypothesis. (3) 4 THE MAIN INEQUALITY In this section we state one of our ain results: a chain of inequalities that upper and lower bounds the expected error for both the Bayes and Gibbs,algoriths by siple functions of the expected inforation gain. More precisely, using the characterizations of the expectations in ters of the volue ratio X+l (I) given by Equations (1), (2) and (3), we can prove the following, which we refer to as the ain inequality: 1l- 1 (E/E'P[I+1(1))) < E/E'P [Bayes + 1 (I)] < E/E'P[Gibbs+1(f)] ~ ~E/E'P[I+l(I)]. (4) Here we have defined an inverse to the binary entropy function ll(p) = -p log p - (1 - p) log(1 - p) by letting 1l-1(q), for q E [0,1]' be the unique p E [0,1/2] such that ll(p) = q. Note that the bounds given depend on properties of the particular prior P, and on properties of the particular fixed sequence x. These upper and lower bounds are equal (and therefore tight) at both extrees E /E'P [I+l (I)] = 1 (axial inforation gain) and E/E'P [I +1(f)] = 0 (inial inforation gain). To obtain a weaker but perhaps ore convenient lower bound, it can also be shown that there is a constant Co > 0 such that for all p > 0, 1l-1(p) 2:: cop/log(2/p). Finally, if all that is wanted is a direct coparison of the perforances of the Gibbs and Bayes algoriths, we can also show:

Estiating Average-Case Learning Curves 859 5 THE MAIN INEQUALITY: CUMULATIVE VERSION In this section we state a cuulative version of the ain inequality: naely, bounds on the expected cuulative nuber of istakes ade in the first trials (rather than just the instantaneous expectations). First, for the cuulative inforation gain, it can be shown that EfE'P [L~l Li(f)] = EfE'P[-log V(f)]. This expression has a natural interpretation. The first instances Xl,..., x of x induce a partition II!:(x) of the concept class :F defined by II~(x) = II~ = {:F(x, f) : f E :F}. Note that III~I is always at ost 2, but ay be considerably saller, depending on the interaction between :F and Xl,,X It is clear that EfE'P[-logV(f)] = - L:7rEIF P[1I']1ogP[71']. Thus the expected cuulative inforation gained fro the labels of Xl,..., X is siply the entropy of the partition II~ under the distribution P. We shall denote this entropy by 1i'P(II~(x)) = 1i'f;.(x) = 1i'f;.. Now analogous to the ain inequality for the instantaneous case (Inequality (4)), we can show: log(2/1i~) < W' (~ 1l::') :'0 E/E1' [t. BayeSi(f)] < E/E'P [t, GibbSi(f)] :'0 ~1l::' (6) Here we have applied the inequality 1i-l(p) ~ cop/log(2/p) in order to give the lower bound in ore convenient for. As in the instantaneous case, the upper and lower bounds here depend on properties of the particular P and x. When the cuulative inforation gain is axiu (1i'f;. = ), the upper and lower bounds are tight. These bounds on learning perforance in ters of a partition entropy are of special iportance to us, since they will for the crucial link between the Bayesian setting and the Vapnik-Chervonenkis diension theory. 6 MOVING TO A WORST-CASE THEORY: BOUNDING THE INFORMATION GAIN BY THE VC DIMENSION Although we have given upper bounds on the expected cuulative nuber of istakes for the Bayes and Gibbs algoriths in ters of 1i'f;.(x), we are still left with the proble of evaluating this entropy, or at least obtaining reasonable upper bounds on it. We can intuitively see that the "worst case" for learning occurs when the partition entropy 1i'f;. (x) is as large as possible. In our context, the entropy is qualitatively axiized when two conditions hold: (1) the instance sequence x induces a partition of :F that is the largest possible, and (2) the prior P gives equal weight to each eleent of this partition. In this section, we ove away fro our Bayesian average-case setting to obtain worst-case bounds by foralizing these two conditions in ters of cobinatorial paraeters depending only on the concept class:f. In doing so, we for the link between the theory developed so far and the VC diension theory.

860 Haussler, Kearns, Opper, and Schapire The second of the two conditions above is easily quantified. Since the entropy of a partition is at ost the logarith of the nuber of classes in it, a trivial upper bound on the entropy which holds for all priors P is 1l!(x) ~ log III~(x)l. VC diension theory provides an upper bound on log III~(x)1 as follows. For any sequence x = Xl, X2,.. of instances and for ~ 1, let di(f, x) denote the largest d ~ 0 such that there exists a subsequence XiI'..., Xiii of Xl,...,X with 1II~((Xill...,xili))1 = 2d; that is, for every possible labeling of XiII...,Xili there is soe target concept in F that gives this labeling. The Vapnik-Chervonenkis (VC) diension of F is defined by di(f) = ax{ di(f, x) : ~ 1 and Xl, X2,. E X}. It can be shown [7, 10] that for all x and ~ d ~ 1, log III~(x)1 ~ (1 + 0(1)) di(f, x) log d 7) (7) l F,x where 0(1) is a quantity that goes to zero as Cl' = /di(f, x) goes to infinity. In all of our discussions so far, we have assued that the instance sequence x is fixed in advance, but that the target concept f is drawn randoly according to P. We now ove to the copletely probabilistic odel, in which f is drawn according to P, and each instance X in the sequence x is drawn randoly and independently according to a distribution V over the instance space X (this infinite sequence of draws fro V will be denoted x E V*). Under these assuptions, it follows fro Inequalities (6) and (7), and the observation above that 1l~(x) ~ log III~(x)1 that for any P and any V, EfEP,XE" [t. Bayes,(x, f)] < EfEP,XE" [t. Gibbs,(x, f)] < ~EXE'V. [log III~(x)1l < (1 + o(i))exe'v. [diffi~f, x) log di7f, x)] di(f) < (1 + 0(1)) 2 log di(f) (8) The expectation EXE'V. [log In~(x)1l is the VC entropy defined by Vapnik and Chervonenkis in their seinal paper on unifor convergence [11]. In ters of instantaneous istake bounds, using ore sophisticated techniques [4], we can show that for any P and any V, di(f, X)] di(f) E/E1',XE'V [Bayes(x, I)] ~ EXE'V [ ~ (9) E [G bb (f)] E [2 di(f, x)] 2 di(f) IE1',XE'V I S x, ~ XE'V ~ (10) Haussler, Littlestone and Waruth [5] construct specific V, P and F for which the last bound given by Inequality (8) is tight to within a factor of 1/ In\2) ~ 1.44; thus this bound cannot be iproved by ore than this factor in general. Siilarly, the 1 It follows that the expected total nuber of istakes of the Bayes and the Gibbs algoriths differ by a factor of at ost about 1.44 in each of these cases; this was not previously known.

Estiating Average-Case Learning Curves 861 bound given by Inequality (9) cannot be iproved by ore than a factor of 2 in general. For specific V, P and F, however, it is possible to iprove the general bounds given in Inequalities (8), (9) and (10) by ore than the factors indicated above. We calculate the instantaneous istake bounds for the Bayes and Gibbs algoriths in the natural case that F is the set of hoogeneous linear threshold functions on R d and both the distribution V and the prior P on possible target concepts (represented also by vectors in Rd) are unifor on the unit sphere in Rd. This class has VC diension d. In this case, under certain reasonable assuptions used in statistical echanics, it can be shown that for ~ d ~ I, 0.44d (copared with the upper bound of di given by Inequality (9) for any class of VC diension d) and 0.62d (copared with the upper bound of 2dl in Inequality (10)). The ratio of these asyptotic bounds is J2. We can also show that this perforance advantage of Bayes over Gibbs is quite robust even when P and V vary, and there is noise in the exaples [6]. 7 OTHER RESULTS AND CONCLUSIONS We have a nuber of other results, and briefly describe here one that ay be of particular interest to neural network researchers. In the case that the class F has infinite VC diension (for instance, if F is the class of all ulti-layer perceptrons of finite size), we can still obtain bounds on the nuber of cuulative istakes by decoposing F into F 1, F2,..., F"..., where each F, has finite VC diension, and by decoposing the prior P over F as a linear su P = 2::1 aip" where each Pi is an arbitrary prior over Fi, and 2::1 a, = 1. A typical decoposition ight let Fi be all ulti-layer perceptrons of a given architecture with at ost i weights, in which case di = O( i log i) [1]. Here we can show an upper bound on the cuulative istakes during the first exaples of roughly 11 {ai} + [2::1 aidi] log for both the Bayes and Gibbs algoriths, where 11{ad = - 2::1 ai log a,. The quantity 2::1 aid, plays the role of an "effective VC diension" relative to the prior weights {a,}. In the case that x is also chosen randoly, we can bound the probability of istake on the th trial by roughly ~(11{ad + [2:~1 a,di)log). In our current research we are working on extending the basic theory presented here to the probles of learning with noise (see Opper and Haussler [6]), learning ulti-valued functions, and learning with other loss functions. Perhaps the ost iportant general conclusion to be drawn fro the work presented here is that the various theories of learning curves based on diverse ideas fro inforation theory, statistical physics and the VC diension are all in fact closely related, and can be naturally and beneficially placed in a coon Bayesian fraework.

862 Haussler, Kearns, Opper, and Schapire Acknowledgeents We are greatly indebted to Ron Rivest for his valuable suggestions and guidance, and to Sara Solla and N aft ali Tishby for insightful ideas in the early stages of this investigation. We also thank Andrew Barron, Andy Kahn, Nick Littlestone, Phil Long, Terry Sejnowski and Hai Sopolinsky for stiulating discussions on these topics. This research was supported by ONR grant NOOOI4-91-J-1162, AFOSR grant AFOSR-89-0506, ARO grant DAAL03-86-K-0171, DARPA contract NOOOI4-89-J-1988, and a grant fro the Sieens Corporation. This research was conducted in part while M. Kearns was at the M.I.T. Laboratory for Coputer Science and the International Coputer Science Institute, and while R. Schapire was at the M.I.T. Laboratory for Coputer Science and Harvard University. References [1] E. Bau and D. Haussler. What size net gives valid generalization? Neural Coputation, 1(1):151-160, 1989. [2] J. Denker, D. Schwartz, B. Wittner, S. SoHa, R. Howard, L. Jackel, and J. Hopfield. Autoatic learning, rule extraction and generalization. Coplex Systes, 1:877-922, 1987. [3] G. Gyorgi and N. Tishby. Statistical theory of learning a rule. In Neural Networks and Spin Glasses. World Scientific, 1990. [4] D. Haussler, M. Kearns, and R. Schapire. Bounds on the saple coplexity of Bayesian learning using inforation theory and the VC diension. In Coputational Learning Theory: Proceedings of the Fourth Annual Workshop. Morgan Kaufann, 1991. [5] D. Haussler, N. Littlestone, and M. Waruth. Predicting {O, 1}-functions on randoly drawn points. Technical Report UCSC-CRL-90-54, University of California Santa Cruz, Coputer Research Laboratory, Dec. 1990. [6] M. Opper and D. Haussler. Calculation of the learning curve of Bayes optial classification algorith for learning a perceptron with noise. In Coputational Learning Theory: Proceedings of the Fourth Annual Workshop. Morgan Kaufann, 1991. [7] N. Sauer. On the density of failies of sets. Journal of Cobinatorial Theory (Series A), 13:145-147,1972. [8] H. Sopolinsky, N. Tishby, and H. Seung. Learning fro exaples in large neural networks. Physics Review Letters, 65:1683-1686, 1990. [9] N. Tishby, E. Levin, and S. Solla. Consistent inference of probabilities in layered networks: predictions and generalizations. In IJCNN International Joint Conference on Neural Networks, volue II, pages 403-409. IEEE, 1989. [10] V. N. Vapnik. Estiation of Dependences Based on Epirical Data. Springer Verlag, New York, 1982. [11] V. N. Vapnik and A. Y. Chervonenkis. On the unifor convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16(2):264-80, 1971.