Australian National University. Abstract. number of training examples necessary for satisfactory learning performance grows

Similar documents
Vapnik-Chervonenkis Dimension of Neural Nets

Vapnik-Chervonenkis Dimension of Neural Nets

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

MACHINE LEARNING. Vapnik-Chervonenkis (VC) Dimension. Alessandro Moschitti

Nearly-tight VC-dimension bounds for piecewise linear neural networks

Some Statistical Properties of Deep Networks

Theoretical Advances in Neural Computation and Learning, V. P. Roychowdhury, K. Y. Siu, A. Perspectives of Current Research about

On the Sample Complexity of Noise-Tolerant Learning

Neural Network Learning: Testing Bounds on Sample Complexity

Computational Learning Theory

On Efficient Agnostic Learning of Linear Combinations of Basis Functions

Generalization in Deep Networks

CS 6375: Machine Learning Computational Learning Theory

VC Dimension Bounds for Product Unit Networks*

arxiv: v1 [cs.lg] 8 Mar 2017

Statistical and Computational Learning Theory

On Learning µ-perceptron Networks with Binary Weights

A Necessary Condition for Learning from Positive Examples

Discriminative Models

Discriminative Models

below, kernel PCA Eigenvectors, and linear combinations thereof. For the cases where the pre-image does exist, we can provide a means of constructing

The sample complexity of agnostic learning with deterministic labels

Fat-shattering. and the learnability of real-valued functions. Peter L. Bartlett. Department of Systems Engineering

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

Maximal Width Learning of Binary Functions

A Result of Vapnik with Applications

Computational Learning Theory (VC Dimension)

Sample width for multi-category classifiers

MACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension

Uniform Glivenko-Cantelli Theorems and Concentration of Measure in the Mathematical Modelling of Learning

Computational Learning Theory

Maximal Width Learning of Binary Functions

Machine Learning Lecture 7

Yale University Department of Computer Science. The VC Dimension of k-fold Union

This material is based upon work supported under a National Science Foundation graduate fellowship.

ations. It arises in control problems, when the inputs to a regulator are timedependent measurements of plant states, or in speech processing, where i

Martin Anthony. The London School of Economics and Political Science, Houghton Street, London WC2A 2AE, United Kingdom.

Connections between Neural Networks and Boolean Functions

The Vapnik-Chervonenkis Dimension: Information versus Complexity in Learning

8537-CS, Bonn, 1989.) Society, Providence, R.I., July 1992, (1989):

On the Complexity ofteaching. AT&T Bell Laboratories WUCS August 11, Abstract


Can PAC Learning Algorithms Tolerate. Random Attribute Noise? Sally A. Goldman. Department of Computer Science. Washington University

Relating Data Compression and Learnability

Bounding the Vapnik-Chervonenkis Dimension of Concept Classes Parameterized by Real Numbers

A Bound on the Label Complexity of Agnostic Active Learning

Computational Learning Theory. CS534 - Machine Learning

Data-Dependent Structural Risk. Decision Trees. John Shawe-Taylor. Royal Holloway, University of London 1. Nello Cristianini. University of Bristol 2

CS340 Machine learning Lecture 4 Learning theory. Some slides are borrowed from Sebastian Thrun and Stuart Russell

On the Computational Complexity of Networks of Spiking Neurons

Introduction to Machine Learning

Computational Learning Theory

Introduction to Machine Learning

Automatic Capacity Tuning. of Very Large VC-dimension Classiers. B. Boser 3. EECS Department, University of California, Berkeley, CA 94720

k-fold unions of low-dimensional concept classes

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

VC Dimension Review. The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces.

Discriminative Learning can Succeed where Generative Learning Fails

Generalization Error Bounds for Collaborative Prediction with Low-Rank Matrices

Learning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14

Vectors in many applications, most of the i, which are found by solving a quadratic program, turn out to be 0. Excellent classication accuracies in bo

Computational Learning Theory for Artificial Neural Networks

The Perceptron algorithm vs. Winnow: J. Kivinen 2. University of Helsinki 3. M. K. Warmuth 4. University of California, Santa Cruz 5.

3.1 Decision Trees Are Less Expressive Than DNFs

Empirical Risk Minimization

Understanding Generalization Error: Bounds and Decompositions

Being Taught can be Faster than Asking Questions

VC-dimension of a context-dependent perceptron

COMS 4771 Introduction to Machine Learning. Nakul Verma

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Baum s Algorithm Learns Intersections of Halfspaces with respect to Log-Concave Distributions

Simple Learning Algorithms for. Decision Trees and Multivariate Polynomials. xed constant bounded product distribution. is depth 3 formulas.

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

Content. Learning. Regression vs Classification. Regression a.k.a. function approximation and Classification a.k.a. pattern recognition

arxiv: v4 [cs.lg] 7 Feb 2016

Machine Learning

Quadratic Upper Bound for Recursive Teaching Dimension of Finite VC Classes

On Learning Perceptrons with Binary Weights

Computational Learning Theory

Computational Learning Theory

STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă

Computational Learning Theory

Voronoi Networks and Their Probability of Misclassification

The Power of Random Counterexamples

Computational Learning Theory: Probably Approximately Correct (PAC) Learning. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Machine Learning

The Decision List Machine

Name (NetID): (1 Point)

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

On Randomized One-Round Communication Complexity. Ilan Kremer Noam Nisan Dana Ron. Hebrew University, Jerusalem 91904, Israel

Statistical learning theory, Support vector machines, and Bioinformatics

Linear Regression and Its Applications

A Tutorial on Computational Learning Theory Presented at Genetic Programming 1997 Stanford University, July 1997

Each new feature uses a pair of the original features. Problem: Mapping usually leads to the number of features blow up!

Agnostic PAC Learning of Functions on Analog Neural Nets

PAC LEARNING, VC DIMENSION, AND THE ARITHMETIC HIERARCHY

Zhixiang Chen Steven Homer. Boston University

Department of Computer Science, University of Chicago, Chicago, IL 60637

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

Transcription:

The VC-Dimension and Pseudodimension of Two-Layer Neural Networks with Discrete Inputs Peter L. Bartlett Robert C. Williamson Department of Systems Engineering Research School of Information Sciences and Engineering Australian National University Canberra 2 AUSTRALIA March 24, 994 Revised: September 6, 994, July 28, 995 Abstract We give upper bounds on the Vapnik-Chervonenkis dimension and pseudodimension of two-layer neural networks that use the standard sigmoid function or radial basis function and have inputs from f;d : : : Dg n.invaliant's probably approximately correct (pac) learning framework for pattern classication, and in Haussler's generalization of this framework to nonlinear regression, the results imply that the number of training examples necessary for satisfactory learning performance grows no more rapidly than W log(wd), where W is the number of weights. The previous best bound for these networks was O(W 4 ). In using neural networks for pattern classication and regression tasks, it is important to be able to predict how much training data will be sucient for satisfactory performance. Valiant's probably approximately correct (pac) framework [Val84] provides a formal definition of `satisfactory learning performance' for f g-valued functions. In [BEHW89], Blumer et al. present upper and lower bounds on the number of examples necessary and

sucient for learning under this denition. These bounds depend linearly on the Vapnik- Chervonenkis dimension of the function class used for learning. In [Hau92], Haussler presents a generalization of the pac framework that applies to the problem of learning real-valued functions. In this case, the pseudodimension of the function class gives an upper bound on the number of examples necessary for learning. References [BH89, Maa92, Sak93, Bar93, GJ93] present upper and lower bounds on the VC-dimension of threshold networks and networks with piecewise polynomial output functions. Most of these results can easily be extended to give bounds on the pseudodimension. However, these networks are not commonly used in applications because the most popular learning algorithm, the backpropagation algorithm, relies on dierentiability of the units' output functions. In practice, sigmoid networks and radial basis function (RBF) networks are most widely used. Recently, Macintyre and Sontag [MS93] showed that the VC-dimension and pseudodimension of these networks are nite, and Karpinski and Macintyre [KM95] proved a bound of O(W 2 N 2 ), where W is the number of weights in the network and N is the number of processing units. In this note, we show that the VC-dimension and pseudodimension of two-layer sigmoid networks and radial basis function networks with discrete inputs is O(W log(wd)), where the input domain is f;d ::: Dg n. In many pattern classication and nonlinear regression tasks for which neural networks have been used, the set of inputs is a small nite set of this form. For the special case of sigmoid networks with binary inputs, our bound is within a log factor of the best known lower bounds [Bar93]. The result follows from the observation that a network with discrete inputs can be represented as a polynomially parametrized function class hence VC-dimension bounds for such classes can be applied. Suppose X is a set, and F is a class of real-valued functions dened on X. For x = (x ::: x m ) 2 X m and r =(r ::: r m ) 2 R m,wesay that F shatters (x r ::: x m r m ) if for all sign sequences s =(s ::: s m ) 2f; g m, there is a function f in F such that s i (f(x i ) ; r i ) > fori = ::: m. The pseudodimension of F is the length of the largest shattered sequence. For pattern classication problems, we typically consider a class of f g-valued functions obtained by thresholding a class of real-valued functions. Dene the threshold function, H : R!f g, ash() = if and only if. If F is a class of real-valued functions, let H(F ) denote the set fh(f) :f 2 F g. 2

The Vapnik-Chervonenkis dimension of a class F of real-valued functions dened on X is the size of the largest sequence of points that can be classied arbitrarily by H(F ), VCdim(F )=maxfm : 9x 2 X m H(F ) shatters (x =2 ::: x m =2)g : Clearly, VCdim(F ) dim P (F ). The function classes considered in this note can be indexed using a real vector of parameters. Let and X be the parameter and input spaces respectively, and let f :X! R. The function f denes a parametrized class of real-valued functions dened on X, ff( ) : 2 g. We also denote this function class by f. Denition A two layer sigmoid network with n inputs, W weights, and a single realvalued output is described by the function f S : R W X! R, where X R n, f S ( x) =a + i= a i =( + e ;(b ix+b i) ) with a i 2 R, b i =(b i ::: b in ) 2 R n, and =(a ::: a k b ::: b kn ) 2 R W. (For x y 2 R n, x y = P n i= x iy i.) In this case, W = kn +2k +. A radial basis function (RBF) network is described by the function f RBF ( x) =a + i= a i e ;kx;c ik 2 with a i 2 R, c i =(c i ::: c in ) 2 R n, and =(a ::: a k c ::: c kn ) 2 R W. (If x 2 R n, kxk 2 = x x.) Here, W = kn + k +. Theorem 2 Let X = f;d ::: Dg n RBF networks f S f RBF : R W X! R, we have for some positive integer D. For the sigmoid and VCdim(f S ) dim P (f S ) < 2W log 2 (24eW D) VCdim(f RBF ) dim P (f RBF ) < 4W log 2 (24eW D): The proof of Theorem 2 follows from the simple observation that the function classes f S and f RBF can be expressed as a polynomial in some transformed set of parameters when the inputs are integers. We can then use an upper bound from [GJ93] on the VC-dimension of such a function class. 3

Proof Consider the function f S dened in Denition. For any 2, x 2 X and r 2 R, let f S ( (x r)) = (f S ( x) ; r) = (a ; r) i= @ k i= a i e ;b ij D i= e ;b ijd A e ;b ij D + e ;b i A k h= h6=i k i=! +e ;(b i x+b i ) e ;b ij (x j +D) e ;b hj D + e ;b h A + e ;b hj (x j +D) Clearly, f S ( (x r)) always has the same sign as f S ( x) ; r, since the denominators in f S ( x) are always positive, so dim P (f S ) VCdim(f S ). But f S ( (x r)) is polynomial in =(a ::: a k e ;b ::: e ;b kn), with degree no more than 2Dnk+k+Dn+ < 3DW. Theorem 2.2 in [GJ93] implies that VCdim(f S ) < 2W log 2 (24eW D). Similarly, f RBF ( x) ; r has the same sign as f RBF ( (x r)), where f RBF ( (x r)) = (f RBF ( x) ; r) = (a ; r) i= i= e 2c ijd + Again, f RBF ( (x r)) is polynomial in e 2c ijd i= a i B @ h= h6=i e 2c hjd C A A : e ;x2 je 2c ij(x j +D) e ;c2 ija : =(a ::: a k e 2c ::: e 2c kn e;c2 ::: e ;c 2 kn) with degree no more than nd(k+2)+2 < 3DW. Asabove, dim P (f S ) < 4W log 2 (24eDW ). The same argument can be used to show thattwo-layer sigmoid or RBF networks with W weights, f;d ::: Dg-valued inputs, and arbitrary connectivityhave pseudodimension (and hence VC-dimension) O(W log(wd)). From the results in [KM95], it is clear that the dependence on D, the size of the input set, is not necessary in these upper bounds. Acknowledgements This research was supported by the Australian Telecommunications and Electronics Research Board and by the Australian Research Council. Thanks to Sridevan Parameswaran for help in obtaining references. 4

References [Bar93] P. L. Bartlett. Vapnik-Chervonenkis dimension bounds for two- and threelayer networks. Neural Computation, 5(3):353{355, 993. [BEHW89] A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth. Learnability and the Vapnik-Chervonenkis dimension. Journal of the Association for Computing Machinery, 36(4):929{965, 989. [BH89] [GJ93] [Hau92] [KM95] [Maa92] E. B. Baum and D. Haussler. What size net gives valid generalization? Neural Computation, :5{6, 989. P. Goldberg and M. Jerrum. Bounding the Vapnik-Chervonenkis dimension of concept classes parametrized by realnumbers. In Proceedings of the Sixth ACM Workshop on Computational Learning Theory, pages 36{369, 993. D. Haussler. Decision theoretic generalizations of the PAC model for neural net and other learning applications. Information and Computation, :78{ 5, 992. M. Karpinski and A. Macintyre. Quadratic VC-dimension bounds for sigmoidal networks. In Proceedings of the 27th Annual Symposium on the Theory of Computing, 995. W. Maass. Bounds for the computational power and learning complexityof analog neural nets. Technical report, Graz University oftechnology, 992. [MS93] A. Macintyre and E. D. Sontag. Finiteness results for sigmoidal `neural' networks. In Proceedings of the 25th Annual Symposium on the Theory of Computing, 993. [Sak93] A. Sakurai. Tighter bounds on the VC-dimension of three-layer networks. In World Congress on Neural Networks, 993. [Val84] L. G. Valiant. A theory of the learnable. Communications of the ACM, 27():34{43, 984. 5