Limitations of One-Hidden-Layer Perceptron Networks

Size: px

Start display at page:

Download "Limitations of One-Hidden-Layer Perceptron Networks"

Dominic Adams
5 years ago
Views:

1 J. Yaghob (E.): ITAT 2015 pp Charles University in Prague, Prague, 2015 Limitations of One-Hien-Layer Perceptron Networks Věra Kůrková Institute of Computer Science, Czech Acaemy of Sciences, WWW home page: vera Abstract: Limitations of one-hien-layer perceptron networks to represent efficiently finite mappings is investigate. It is shown that almost any uniformly ranomly chosen mapping on a sufficiently large finite omain cannot be tractably represente by a one-hien-layer perceptron network. This existential probabilistic result is complemente by a concrete example of a class of functions constructe using quasi-ranom sequences. Analogies with central paraox of coing theory an no free lunch theorem are iscusse. 1 Introuction A wiely-use type of a neural-network architecture is a network with one-hien-layer of computational units (such as perceptrons, raial or kernel units) an one linear output unit. Recently, new hybri learning algorithms for feeforwar networks with two or more hien layers, calle eep networks [9, 3], were successfully applie to various pattern recognition tasks. Thus a theoretical analysis ientifying tasks for which shallow networks require consierably larger moel complexities than eep ones is neee. In [4, 5], Bengio et al. suggeste that a cause of large moel complexities of shallow networks with one hien layer might be in the amount of variations of functions to be compute an they illustrate their suggestion by an example of representation of -imensional parities by Gaussian SVM. In practical applications, feeforwar networks compute functions on finite omains in R representing, e.g., scattere empirical ata or pixels of images. It is wellknown that shallow networks with many types of computational units have the universal representation property, i.e., they can exactly represent any real-value function on a finite omain. This property hols, e.g., for networks with perceptrons with any sigmoial activation function [10] an for networks with Gaussian raial units [15]. However, proofs of universal representation capabilities assume that networks have numbers of hien units equal to sizes of omains of functions to be compute. For large omains, this can be a factor limiting practical implementations. Upper bouns on rates of approximation of multivariable functions by shallow networks with increasing numbers of units were stuie in terms of variational norms tailore to types of network units (see, e.g., [11] an references therein). In this paper, we employ these norms to erive lower bouns on moel complexities of shallow networks representing finite mappings. Using geometrical properties of high-imensional spaces we show that a representation of almost any uniformly ranomly chosen function on a large finite omain by a shallow perceptron networks requires large number of units or large sizes of output weights. We illustrate this existential probabilistic result by a concrete construction of a class of functions base on Haamar an quasi-noise matrices. We iscuss analogies with central paraox of coing theory an no free lunch theorem. The paper is organize as follows. Section 2 contains basic concepts an notations on shallow networks an ictionaries of computational units. Section 3 reviews variational norms as tools for investigation of network complexity. In Section 3, estimates of probabilistic istributions of sizes of variational norms are proven. In section 4, concrete examples of functions which cannot be tractably represente by perceptron networks are constructe using Haamar an pseuo-noise matrices. Section 5 is a brief iscussion. 2 Preliminaries One-hien-layer networks with single linear outputs (shallow networks) compute input-output functions from sets of the form span n G := { n w i g i w i R, g i G where G, calle a ictionary, is a set of functions computable by a given type of units, the coefficients w i are calle output weights, an n is the number of hien units. This number is sometimes use as a measure of moel complexity. In this paper, we focus on representations of functions on finite omains X R. We enote by F (X) := { f f : X R} the set of all real-value functions on X. On F (X) we have the Eucliean inner prouct efine as an the Eucliean norm f,g := f (u)g(u) u X f := f, f. },

2 168 V. Kůrková To istinguish the inner prouct.,. on F (X) from the inner prouct on X R, we enote it, i.e., for u,v X, u v := u i v i. We investigate networks with units from the ictionary of signum perceptrons P (X) := {sgn(v. + b) : X { 1,1} v R,b R} where sgn(t) := 1 for t < 0 an sign(t) := 1 for t 0. Note that from the point of view of moel complexity, there is only a minor ifference between networks with signum perceptrons an those with Heavisie perceptrons as sgn(t) = 2ϑ(t) 1 an ϑ(t) := sgn(t) + 1, 2 where ϑ(t) := 0 for t < 0 an ϑ(t) = 1 for t 0. 3 Moel Complexity an Variational Norms A useful tool for erivation of estimates of numbers of units an sizes of output weights in shallow networks is the concept of a variational norm tailore to network units introuce in [12] as an extension of a concept of variation with respect to half-spaces from [2]. For a subset G of a norme linear space (X,. X ), G-variation (variation with respect to the set G), enote by. G, is efine as f G := inf{c R + f /c cl X conv(g G)}, where cl X enotes the closure with respect to the norm X on X, G := { g g G}, an { } k k convg := a i g i a i [0,1], a i = 1,g i G,k N is the convex hull of G. The following straightforwar consequence of the efinition of G-variation shows that in all representations of a function with large G-variation by shallow networks with units from the ictionary G, the number of units must be large or absolute values of some output weights must be large. Proposition 1. Let G be a finite subset of a norme linear space (X,. X ), then for every f X, } k k f G = min{ w i f = w i g i, w i R, g i G. Note that classes of functions efine by constraints on their variational norms represent a similar type of a concept as classes of functions efine by constraints on both numbers of gates an sizes of output weights stuie in theory of circuit complexity [16]. To erive lower bouns on variational norms, we use the following theorem from [13] showing that functions which are not correlate to any element of the ictionary G have large variations. Theorem 2. Let (X,. X ) be a Hilbert space with inner prouct.,. X an G its boune subset. Then for every f X G, f G f 2 sup g G f,g X. The following theorem shows that when a ictionary G(X) is not too large, then for a large omain X, almost any ranomly chosen function has large G(X)-variation. We enote by S r (X) := { f F (X) f = r} the sphere of raius r in F (X) an for f F (X), f o := f f. The proof of the theorem is base on geometry of spheres in high-imensional Eucliean spaces. In large imensions, most of areas of spheres lie very close to their equators [1]. Theorem 3. Let be a positive integer, X R with carx = m, G(X) a subset of F (X) with car G(X) = n such that for all g G(X), f r, µ be a uniform probability measure on S r (X), an b > 0. Then µ({ f S r (X) f G(X) b}) 1 2ne m 2b 2. Proof. Denote for g S r (X) an ε (0,1), C(g,ε) := {h Sr m 1 h o,g o ε}. As C(g,ε) is equivalent to a polar cap in R carx, whose measure is exponentially ecreasing with the imension m, we have µ(c(g,ε)) e mε2 2 (see, e.g., [1]). By Theorem 2, { f S r (X) f G(X) b} = S r (X) Hence the statement follows. g G C(g,1/b). Theorem 3 can be applie to ictionaries G(X) on omains X R with carx = m, which are relatively small. In particular, ictionaries of signum an Heavisie perceptrons are relatively small. Estimates of their sizes can be obtaine from bouns on numbers of linearly separable ichotomies to which finite subsets of R can be partitione. Various estimates of numbers of ichotomies have been erive by several authors starting from results by Schläfli [17]. The next boun is obtaine by combining a theorem from [7, p.330] with an upper boun on partial sum of binomials.

3 Limitations of One-Hien-Layer Perceptron Networks 169 Theorem 4. For every an every X R such that carx = m, carp (X) 2 i=0 ( m 1 i ) 2 m!. Combining Theorems 3 an 4, we obtain a lower boun on measures of sets of functions having variations with respect to signum perceptrons boune from below by a given boun b. Corollary 1. Let be a positive integer, X R with carx = m, µ a uniform probability measure on S m(x), an b > 0. Then µ({ f S m(x) f P (X) b}) 1 4 m! e m 2b 2. For example, for the omain X = {0,1} an b = 2 4, we obtain from Corollary 1 a lower boun e (2 2 1)! on the probability that a function on {0,1} with the norm 2 /2 has variation with respect to signum perceptrons greater or equal to 2 4. Thus for large almost any uniformly ranomly chosen function on the -imensional Boolean cube {0,1} of the same norm 2 /2 as signum perceptrons, has variation with respect to signum perceptrons epening on exponentially. 4 Construction of Functions with Large Variations The results erive in the previous section are existential. In this section, we construct a class of functions, which cannot be represente by shallow perceptron networks of low moel complexities. We construct such functions using Haamar matrices. We show that the class of Haamar matrices contains circulant matrices with rows being segments of pseuo-noise sequences which mimic some properties of ranom sequences. Recall that a Haamar matrix of orer m is an m m square matrix M with entries in { 1,1} such that any two istinct rows (or equivalently columns) of M are orthogonal. Note that this property is invariant uner permutating rows or columns an uner sign flipping all entries in a column or a row. Two istinct rows of a Haamar matrix iffer in exactly m/2 positions. The next theorem gives a lower boun on variation with respect to signum perceptrons of a { 1, 1}-value function constructe using a Haamar matrix. Theorem 5. Let M be an m m Haamar matrix, {x i i = 1,...,m} R, {y j j = 1,...,m} R, X = {x i i = 1,...,m} {y j j = 1,...,m} R 2, an f M : X { 1,1} be efine as f M (x i,y j ) =: M. Then Proof. By Theorem 2, f M P (X) f M P (X) m log 2 m. f M 2 sup g P (X) f M,g = m 2 sup g P (X) f M,g. For each g P (X), let M(g) be an m m matrix efine as M(g) = g(x i,y j ). It is easy to see that f M,g = M M(g). Using suitable permutations, we reorer rows an columns of both matrices M(g) an M in such a way that each row an each column of the reorere matrix M(g) starts with a (possibly empty) initial segment of 1 s followe by a (possibly empty) segment of 1 s. Denoting M the reorere matrix M we have f M,g = M M(g) = M M(g). As the property of being a Haamar matrix is invariant uner permutations of rows an columns, we can apply Linsay lemma [8, p.88] to submatrices of the Haamar matrix M on which all entries of the matrix M(g) are either 1 or 1. Thus we obtain an upper boun m m on the ifferences of +1s an 1s in suitable submatrices of M. Iterating the proceure at most log 2 m-times, we obtain an upper boun m mlog 2 m on M M(g) = f M,g. Thus f M P (X) m 2 m mlog 2 m = m log 2 m. Theorem 5 shows that functions whose representations by shallow perceptron networks require numbers of units or sizes of output weights boune from below by m log 2 m can be constructe using Haamar matrices. In particular, when the omain is -imensional Boolean cube {0,1}, where is even, the lower boun is 2/4 /2. So the lower bouns grows with exponentially. Recall that if a Haamar matrix of orer m exists, then m = 1 or m = 2 or m is ivisible by 4 [14, p.44]. It is conjecture that there exists a Haamar matrix of every orer ivisible by 4. Listings of Haamar matrices of various orers can be foun at Neil Sloane s library of Haamar matrices. We show that suitable Haamar matrices can be obtaine from pseuo-noise sequences. An infinite sequence

4 170 V. Kůrková a 0,a 1,...,a i,... of elements of {0,1} is calle k-th orer linear recurring sequence if for some h 0,...,h k {0,1} a i = k j=1 a i j h k j mo 2 for all i k. It is calle k-th orer pseuo-noise (PN) sequence (or pseuo-ranom sequence) if it is k-th orer linear recurring sequence with minimal perio 2 k 1. A 2 k 2 k matrix L is calle pseuo-noise if for all i = 1,...,2 k, L 1,i = 0 an L i,1 = 0 an for all i = 2,...,2 k an j = 2,...,2 k L = L i 1, j 1 where the (2 k 1) (2 k 1) matrix L is a circulant matrix with rows forme by shifte segments of length 2 k 1 of a k-th orer pseuo-noise sequence. PN sequences have many useful applications because some of their properties mimic those of ranom sequences. A run is a string of consecutive 1 s or a string of consecutive 0 s. In any segment of length 2 k 1 of k-th orer PN-sequence, one-half of the runs have length 1, one quarter have length 2, one-eighth have length 3, an so on. In particular, there is one run of length k of 1 s, one run of length k 1 of 0 s. Thus every segment of length 2 k 1 contains 2 k/2 ones an 2 k/2 1 zeros [14, p.410]. Let τ : {0,1} { 1,1} be efine as τ(x) = 1 x (i.e., τ(0) = 1 an τ(1) = 1). The following theorem states that a matrix obtaine by applying τ to entries of a pseuonoise matrix is a Haamar matrix. Theorem 6. Let L be a 2 k 2 k pseuo-noise matrix an L τ be the 2 k 2 k matrix with entries in { 1,1} obtaine from L by applying τ to all its entries. Then L τ is a Haamar matrix. Proof. We show that inner prouct of any two rows of L τ is equal to zero. The autocorrelation of a sequence a 0,a 1,...,a i,... of elements of {0,1} with perio 2 k 1 is efine as ρ(t) = 1 2 k 1 2 k 1 j=0 For every pseuo-noise sequence, ρ(t) = 1 2 k 1 1 a j+a j+t. for every t = 1,...,2 k 2 [14, p. 411]. Thus the inner prouct of every two rows of the matrix L τ is equal to 1. As all elements of the first column of L τ are equal to 1, inner prouct of every pair of its rows is equal to zero. Theorem 5 implies that for every pseuo-noise matrix L of orer 2 k an X R such that carx = 2 k 2 k, there exists a function f Lτ : X { 1,1} inuce by the matrix L τ obtaine from L by replacing 0 s with 1 s an 1 s with 1 s such that f Lτ P (X) 2k/2 k. So the variation of f Lτ with respect to signum perceptrons epens on k exponentially. In particular, setting X = {0,1}, where = 2k is even, we obtain a function of variables with variation with respect to signum perceptrons growing with exponentially as f Lτ P (X) 2/4 /2. Representation of this function by a shallow perceptron network requires number of units or sizes of some output weights epening on exponentially. It is easy to show that for each even integer, the function inuce by Sylvester-Haamar matrix M u,v = 1 u v, where u,v {0,1} /2, can be represente by a twohien-layer network with /2 units in each hien layer. 5 Discussion We prove that almost any uniformly ranomly chosen function on a sufficiently large finite set in R has large variation with respect to signum perceptrons an thus it cannot be tractably represente by a shallow perceptron network. It seems to be a paraox that although representations of almost all functions by shallow perceptron networks are untractable, it is ifficult to construct such functions. The situation can be rephrase in analogy with the title of an article from coing theory Any coe of which we cannot think is goo [6] as representation of almost any function of which we cannot think by shallow perceptron networks is untractable. A central paraox of coing theory concerns the existence an construction of the best coes. Virtually every linear coe is goo (in the sense that it meets the Gilbert-Varshamov boun on istance versus reunancy), however espite the sophisticate constructions for coes erive over the years, no one has succeee in emonstrating a constructive proceure that yiels such goo coes. The only class of functions having large variations which we succeee to construct is the class escribe in section 4 base on Haamar matrices. Among these matrices belong quasi-noise (quasi-ranom) matrices with rows obtaine as shifts of segments of quasi-noise sequences. These sequences have been use in construction of coes, interplanetary satellite picture transmission, precision measurements, acoustics, raar camouflage, an light iffusers. Pseuo-noise sequences permit esign of surfaces that scatter incoming signals very broaly making reflecte energy invisible or inauible. It shoul be emphasize that similarly as no free lunch theorem [18], our results assume uniform istributions of functions to be represente. However, probability istributions of functions moeling some practical tasks of interest (such as colors of pixels in a photograph) might be highly non uniform.

5 Limitations of One-Hien-Layer Perceptron Networks 171 Acknowlegments. This work was partially supporte by the grant COST LD13002 of the Ministry of Eucation of the Czech Republic an institutional support of the Institute of Computer Science RVO References [1] Ball, K.: An elementary introuction to moern convex geometry. In: S. Levy, (e.), Falvors of Geometry, 1 58, Cambrige University Press, 1997 [2] Barron, A. R.: Neural net approximation. In: K. S. Narenra, (e.), Proc. 7th Yale Workshop on Aaptive an Learning Systems, 69 72, Yale University Press, 1992 [3] Bengio, Y.: Learning eep architectures for AI. Founations an Trens in Machine Learning 2 (2009), [4] Bengio, Y., Delalleau, O., Le Roux, N.: The curse of imensionality for local kernel machines. Technical Report 1258, Département Informatique et Recherche Opérationnelle, Université e Montréal, 2005 [5] Bengio, Y., Delalleau, O., Le Roux, N.: The curse of highly variable functions for local kernel machines. In: Avances in Neural Information Processing Systems 18, , MIT Press, 2006 [6] Coffey, J. T., Gooman, R. M.: Any coe of which we cannot think is goo. IEEE Transactions on Information Theory 36 (1990), [7] Cover, T.: Geometrical an statistical properties of systems of linear inequalities with applictions in pattern recognition. IEEE Trans. on Electronic Computers 14 (1965), [8] Erös, P., Spencer, J. H.: Probabilistic Methos in Combinatorics. Acaemic Press, 1974 [9] Hinton, G. E., Osinero, S., Teh, Y. W.: A fast learning algorithm for eep belief nets. Neural Computation 18 (2006), [10] Ito, Y.: Finite mapping by neural networks an truth functions. Mathematical Scientist 17 (1992), [11] Kainen, P. C., Kůrková, V., Sanguineti, M.: Depenence of computational moels on input imension: Tractability of approximation an optimization tasks. IEEE Trans. on Information Theory 58 (2012), [12] Kůrková, V.: Dimension-inepenent rates of approximation by neural networks. In: K. Warwick an M. Kárný, (es.), Computer-Intensive Methos in Control an Signal Processing. The Curse of Dimensionality, , Birkhäuser, Boston, MA, 1997 [13] Kůrková, V., Savický, P., Hlaváčková, K.: Representations an rates of approximation of real-value Boolean functions by neural networks. Neural Networks 11 (1998), [14] MacWilliams, F. J., Sloane, N. J. A.: The theory of errorcorrecting coes. North Hollan, New York, 1977 [15] Micchelli, C. A.: Interpolation of scattere ata: Distance matrices an conitionally positive efinite functions. Constructive Approximation 2 (1986), [16] Roychowhury, V., Siu, K. -Y., Orlitsky, A.: Neural moels an spectral methos. In: V. Roychowhury, K. Siu, an A. Orlitsky, (es.), Theoretical Avances in Neural Computation an Learning, 3 36, Springer, New York, 1994 [17] Schläfli, L.: Theorie er vielfachen Kontinuität. Zürcher & Furrer, Zürich, 1901 [18] Wolpert, D. H., Macreay, W. G.: No free lunch theorems for optimization. IEEE Transactions on Evolutionary Computation 1(1) (1997), 67 82

Bounds on Sparsity of One-Hidden-Layer Perceptron Networks

J. Hlaváčová (Ed.): ITAT 2017 Proceedings, pp. 100 105 CEUR Workshop Proceedings Vol. 1885, ISSN 1613-0073, c 2017 V. Kůrková Bounds on Sparsity of One-Hidden-Layer Perceptron Networks Věra Kůrková Institute