1 Introduction Research on neural networks has led to the investigation of massively parallel computational models that consist of analog computationa

Size: px

Start display at page:

Download "1 Introduction Research on neural networks has led to the investigation of massively parallel computational models that consist of analog computationa"

Gervase Strickland
6 years ago
Views:

1 A Comparison of the Computational Power of Sigmoid and Boolean Threshold Circuits Wolfgang Maass Institute for Theoretical Computer Science Technische Universitaet Graz Klosterwiesgasse 32/2 A-8010 Graz, Austria Georg Schnitger Fachbereich Mathematik/Informatik Universitat Paderborn D-4790 Paderborn, Germany Eduardo D. Sontag y SYCON { Rutgers Center for Systems and Control Department of Mathematics Rutgers University sontag@hilbert.rutgers.edu Abstract We examine the power of constant depth circuits with sigmoid (i.e. smooth) threshold gates for computing boolean functions. It is shown that, for depth 2, constant size circuits of this type are strictly more powerful than constant size boolean threshold circuits (i.e. circuits with linear threshold gates). On the other hand it turns out that, for any constant depth d, polynomial size sigmoid threshold circuits with polynomially bounded weights compute exactly the same boolean functions as the corresponding circuits with linear threshold gates. Partially supported by NSF-CCR and AFOSR y Partially supported by Siemens Corporate Research and AFOSR and AFOSR

2 1 Introduction Research on neural networks has led to the investigation of massively parallel computational models that consist of analog computational elements. Usually these analog computational elements are assumed to be smooth threshold gates, i.e.-gates for some nondecreasing dierentiable function : R! R. A -gate with weights w 1 ; : : : ; w m 2 R and threshold t 2 R is dened to be a gate that computes the function (x 1 ; : : : ; x m ) 7! ( P m w i x i? t) from R m into R. A -circuit is dened as a directed acyclic circuit that consists of -gates. The most frequently considered special case of a smooth threshold circuit is the sigmoid threshold circuit, which is a -circuit for : R! R dened by 1 (x) = 1 + exp(?x). Smooth threshold circuits (-circuits for \smooth" functions ) have become the standard model for the investigation of learning on multi-layer articial neural nets ([K], [HKP], [RM], [SS1], [SS2], [WK]). In fact, the most common learning algorithm for multi-layer neural nets, the Backwards-Propagation algorithm, can only be implemented on -circuits for dierentiable functions. Another motivation for the investigation of smooth threshold circuits is the desire to explore simple models for the (very complicated) information processing mechanisms in neural systems of living organisms. In a rst approximation one may view the current ring rate of a neuron as its current output ([S], [RM], [K]). The ring rates of neurons are known to change between a few and several hundred rings per second. Hence a smooth threshold gate provides a somewhat better computational model for a neuron than a digital element that has just two dierent output signals. In this paper we examine the power of smooth threshold circuits for computing Boolean functions. In particular, we compare their power with that of boolean threshold circuits (i.e. s-circuits for the \heaviside" function s, with s(x) = 1 if x 0 and s(x) = 0 if x < 0). In the literature one often refers to such \boolean threshold circuits" as \linear threshold circuits", or simply as \threshold circuits". The most surprising result of this paper is the existence of a boolean function F n, that can be computed by a large class of -circuits (containing -circuits) with small weights in depth 2 and size 5 (Theorem 2.2.), but which cannot be computed with any weight size by constant size boolean threshold circuits of depth 2 (Theorem 3.1). A witness for this dierence in computational power is the boolean function F n with F n (~x; ~y) := Majority(~x) Majority(~y); where ~x and ~y are n-bit vectors. The proof of this lower bound result for boolean threshold circuits (Theorem 3.1) is of independent interest. First, this proof demonstrates that the restriction method is not only useful in order to prove lower bounds for AC 0 -circuits, but also for threshold circuits. Secondly, this proof exploits some previously unused potential in a standard tool for the analysis of threshold circuits: the "-Discriminator Lemma of [HMP ST ]. It is essential for our proof that the "-Discriminator Lemma holds not just for the uniform distribution over the input space (as it is stated in [HMPST]), but for any distribution. 2

3 Hence we have the freedom to construct such a distribution in a malicious manner, where we exploit specic \weak points" of the considered threshold circuit. This extra power of the (generalized) "-Discriminator Lemma is crucial: in Remark 3.10 we show that its conventional version is insucient for the proof of Theorem 3.1. Subsequent to the preliminary version [MSS] of this paper, DasGupta and Schnitger [DS] have shown that the boolean function SQ n with SQ n (x 1 ; : : : x n ; y 1 ; : : : ; y n 2) = (( n x i ) 2 leads to a depth-independent separation of boolean threshold circuits and -circuits. In particular, SQ n is computable by a large class of -circuits in constant size, whereas any boolean threshold circuit requires size (log n). However, the gate function has to satisfy more stringent dierentiabilty requirements than are necessary for the constantsize computation of F n. For the case of analog input, [DS] also provides a comparison of the approximation power of various -circuits. In order to compute a boolean function on an analog computational device one has to adopt a suitable output convention (similar to the conventions that are used to carry out digital computations on real-world computers, which consists of non-digital computational elements such as transistors). Denition 1.1 A -circuit C computes a boolean function F : f0; 1g n! f0; 1g with separation " if there is some t C 2 R such that for any input (x 1 ; : : : ; x n ) 2 f0; 1g n the output gate of C outputs a value which is at least t C +" if F (x 1 ; : : : ; x n ) = 1, and at most t C? " otherwise. A computation without separation at the output gate appears to be less interesting, since then an innitesimal change in the output of any -gate in the circuit may invert the output bit. Hence we consider in this paper computations on -circuits C n with 1 separation at least for some polynomial p (where n is the number of input bits of p(n) C n ). One nice feature of this convention is that, for Lipschitz bounded gate functions and polynomial size -circuits C n of constant depth and with polynomially bounded 1 weights, it allows a tolerance of for all -gates in C poly(n) n. We will give in Theorem 4.1 a \separation boosting" result, which says that for any constant depth d one may demand for polynomial size -circuits with polynomially bounded weights just as well a separation of size (1) without changing the class of boolean functions that can be computed. An extended abstract of this paper has previously appeared in [MSS]. In this full version of the paper we have strengthened the claim of Theorem 4.1 by imposing a less stringent condition on. n 2 y i ) 3

4 2 Sigmoid Threshold Circuits for the OR of Majorities We write (NL) for the following property of a function : R! R, (NL) There is some rational number s so that: 1. is dierentiable on some open interval containing s, and (s) exists and is nonzero. Obviously the function satises (NL). Observe that property (NL) is basically the requirement that the function be nonlinear; for instance, if 00 happens to be everywhere dened, then (NL) is precisely equivalent to not being a linear function. The nonlinearity of is obviously a necessary assumption for Theorem 2.2, since otherwise a -circuit can only compute linear functions. Without loss of generality, we will assume that c := 00 (s) 4 for some point s as in the denition. If this value where to be negative, we simply replace by? in what follows. > 0 Lemma 2.1 Assume that satises (NL). Dene the function (x) := (x + s) + (?x + s) : Then, the function is even, and there exists some " > 0 so that the following property holds: (a + h)? (a) ch 2 for all a; h 2 [0; "] : Proof. Note that (?x) = (x) directly from the denition, so is even. Moreover, is dierentiable on some open interval containing x = 0, because is dierentiable in a neighborhood of s, and evenness implies that 0 (0) = 0. Observe also that 00 (0) exists, and in fact 00 (0) = 2 00 (s) = 8c > 0 : By denition of 00 (0) (just write 0 (l) = 0 (0) + 00 r(l) (0)l + r(l) with lim = 0), there is l!0 l some " > 0 so that 0 (l) 00 (0)l = 4cl (1) 2 for all each l 2 [0; 2"]. Because 0 (l) 4cl > 0 for l > 0, it follows that is strictly increasing on [0; 2"]. We are only left to prove that this " is so that the last property holds. 4

5 Pick any a; h 2 [0; "]. Assume that h 6= 0, as otherwise there is nothing to prove. As a and a + h are both in the interval [0; 2"], and is strictly increasing there, it follows that (a + h)? (a) > (a + h)? (a + h 2 ) and by the Mean Value Theorem this last expression equals 0 (l) h 2 for some l 2 (a+ h 2 ; a+ h). Since l < a + h 2", we may apply inequality (1) to obtain (a + h)? (a) > 2clh : The result now follows from the fact that l > a + h 2 h 2. Theorem 2.2 Assume that : R! R satises (NL). Then there exists for every n 2 N a -circuit C n of depth 2 with 5 gates (and rational weights and thresholds of size O(1)) that computes F n with separation (1=n 2 ). Proof. With and " as in Lemma 2.1 one has (a) > (b), jaj > jbj for any a; b 2 [?"; +"]. Hence any two nonzero reals u; v 2 [?"=2; +"=2] have dierent sign if and only if (u? v)? (u + v) > 0. Let x 1 ; : : : ; x n ; y 1 ; : : : ; y n 2 f0; 1g be arbitrary and set Then we obtain Furthermore Lemma 2.1 implies that u := " 2 (4(x 1 + : : : + x n )? 2n + 1 ); 4n v := " 2 (4(y 1 + : : : + y n )? 2n + 1 ): 4n (u? v)? (u + v) > 0, F n (~x; ~y) = 1: j(u? v)? (u + v)j 4c minfu 2 ; v 2 g = (1=n 2 ): Hence, we can achieve separation (1=n 2 ) by using a -gate on level two of circuit C n that checks whether (u? v)? (u + v) > 0. Such a -gate exists: Since 00 (s) 6= 0, there is some t with 0 (t) 6= 0. Now transform (u? v)? (u + v) into a suitable neighborhood of t and choose a suitable rational approximation of (t) as threshold. Corollary 2.3 Assume that : R! R satises (NL) and is monotone. Then there exists for every n 2 N a -circuit C n of depth 2 and size 5 (with rational weights and thresholds of size polynomial in n) that computes F n with separation (1). Proof. Multiply the weights of the -gate on level two of the circuit C n with n 2 and transform the threshold accordingly. In this way we can ensure that the weighted sum computed at the top gate has distance (1) from its threshold. 5

6 Remark 2.4. For computations with real (rather than Boolean) inputs, there has been some work dealing with the dierences in capabilities between sigmoidal and threshold devices; in particular [So] studies questions of interpolation and classication related to learnability (VC dimension). 3 Boolean threshold gates are less powerful Theorem 3.1 No family (C n j n 2 N) of constant size boolean threshold circuits of depth 2 (with unrestricted weights and thresholds) can compute the function F n. Proof. Assume, by way of contradiction, that there exist such circuits C n, each with at most k 0 gates on level one. We can demand that all weights are integers and that the level 2 gate has weights of absolute value at most 2 O(k0 log k 0 ) ([Mu],[MT]). Thus we can assume, after appropriate duplication of level one gates, that the gate on level 2 has only weights from f?1; 1g. Let k be an upper bound on the resulting number of gates. In the next section we use the restriction method to eliminate those gates on level one of C n whose weights for the x i (y i ) have drastically dierent sizes. It turns out that we cannot achieve this goal for all gates. For example, if all the weights w i (for the x i ) are much larger than the weights u i (for the y i ), then we can only limit the variance of the weights w i (see condition b. in Denition 3.2). Nevertheless, the restriction method allows us to \regularize" all bottom gates of C n (see Lemma 3.4). In section 3.2 we show that the resulting regularized gates behave predictably for certain distributions (see Lemma 3.7). The argument concludes in section 3.3 with a non-standard application of the "-Discriminator Lemma. 3.1 The Restriction Method Our goal will be to x certain inputs such that all bottom gates of C n will have a normal form as described in the following denition. Denition 3.2. Let G be a boolean threshold gate (with inputs x 1 ; : : : ; x m ; y 1 ; : : : ; y m ) that outputs 1 if and only if that w i x i + u i y i t. Assume that the numbering is such jw 1 j : : : jw m j and ju 1 j : : : ju m j: We say that G is l-regular if and only if all w i have the same sign (negative, zero, or positive) and all u i have the same sign. Additionally, one of the following conditions has to hold, 6

7 a. G is constant. b. 8i (jw i j m 1=8 ju i j) and jw m j 60jw 1 j. c. 8i (ju i j m 1=8 jw i j) and ju m j 60ju 1 j. d. jw m j 30(1 + l)jw 1 j and ju m j 30(1 + l)ju 1 j. First we will transform a single threshold gate to a regular gate. Lemma 3.3 Let G be an arbitrary threshold gate that outputs 1 if and only if n w i x i + n u i y i t: Then there are sets M x f1; : : : ; ng and M y f1; : : : ; ng of size n 60 assignment A : fx i : i 62 M x g [ fy i : i 62 M y g! f0; 1g such that each and an a. when values are assigned according to A, F n=60 will be obtained as the corresponding subfunction of F n, and b. G, when restricted to the remaining free variables, is n 1=8 -regular. Proof. First we determine a set M 0 x f1; : : : ; ng of size n=3 such that all w i (with i 2 M 0 x) are either all positive, all negative or all zero. A set M 0 y f1; : : : ; ng of size n=3 is chosen analogously to enforce the same property for the coecients u i (with i 2 M 0 y). Set m = n=3. After possibly renumbering the indices, we can assume that M 0 x = M 0 y = f1; : : : ; mg. We can also assume that jw 1 j : : : jw m j as well as ju 1 j : : : ju m j. We dene R := f1; : : : ; m 4 g; S := fm 4 + 1; : : : ; 3m 4 g and T := f3m 4 + 1; : : : ; mg: By assigning 1's to the x i 's with i 2 R and 0's to the x i 's with i 2 T or vice versa, and by assigning 1's to the y i 's with i 2 R and 0's to the y i 's with i 2 T or vice versa, we obtain four partial assignments. Let us now interpret G as a threshold gate of the remaining variables x i (i 2 S) and y i (i 2 S). By choosing one of the four assignments, we can \move" the threshold of the resulting gate over a distance d with d = i2t jw i j? jw i j + ju i j? ju i j: i2r i2t i2r If for none of these four partial assignments the threshold gate G gives constant output, we have d jw i j + ju i j: i2s i2s 7

8 This implies that i2t (jw i j + ju i j) i2r[s (jw i j + ju i j): (2) Set a = P i2r[s (jw i j + ju i j)=(3m=4) and b = P i2t (jw i j + ju i j)=(m=4): Then (2) implies for these \averages" of jw i j + ju i j over R [ S respectively T that b 3a. We subdivide the set S by introducing the sets P = f 3m 4? 2m ; : : : ; 3m 4? m 10 g and Q = f3m 4? m ; : : : ; 3m 4 g: Since jw i j + ju i j is a non-decreasing function of i we have for all i 2 R [ S (and in particular for all i 2 P [ Q) Furthermore, we have for all i 2 P jw i j + ju i j b 3a: (3) jw i j + ju i j a=10; (4) since otherwise jw i j + ju i j < a=10 for all i 2 (R [ S)? (P [ Q), and we would get i2r[s (jw i j + ju i j) = i2(r[s)?(p[q) (jw i j + ju i j) + ( 3 4? 2 10 )m a 2m + 3a = m a( (3? 1 10 )) < 3m a ; 4 which is a contradiction to the denition of a. (3) and (4) jointly imply that i2p[q (jw i j + ju i j) max (jw ij + ju i j) 30 min (jw ij + ju i j): (5) i2p[q i2p[q Case 1: 8 i 2 P(jw i j m 1=8 ju i j _ ju i j m 1=8 jw i j). We can nd a subset P 0 P of size m=20 such that 8i 2 P 0 (jw i j m 1=8 ju i j) or 8i 2 P 0 (ju i j m 1=8 jw i j): 8

9 In the former case, (5) implies that max jw i j 30 min(jw i j + ju i j) i2p 0 i2p 0 30(1 + m?1=8 ) min jw i j i2p 0 60 min jw i j: i2p 0 Set M x = M y = P 0 and x the remaining variables such that exactly half of the x i 's and half of the y i 's are 0. Analogously, in the latter case we obtain max obtained as above. Case 2: Otherwise. i2p 0 ju i j 60 min ju i j. M x and M y are i2p 0 Then 9i 0 2 P (jw i0 j < m 1=8 ju i0 j^ ju i0 j < m 1=8 jw i0 j). We have for all i 2 Q: jw i j + ju i j 30(jw i0 j + ju i0 j) minf30jw i0 j(1 + m 1=8 ); 30ju i0 j(1 + m 1=8 )g: Thus we have max i2q jw i j 30(1 + m 1=8 ) min i2q jw i j and max i2q ju i j 30(1 + m 1=8 ) min i2q ju i j. Choose M x to be an arbitrary subsets of Q of size n 60, set M y = M x and x the remaining variables in the same fashion as before. If we perform the \regularization process" for all bottom gates of C n, then we obtain Lemma 3.4 There are sets M x ; M y f1; : : : ; ng of size m = n 60 k and there is an assignment A : fx i : i 62 M x g [ fy i : i 62 M y g! f0; 1g such that a. when values are assigned according to A, F m will be obtained as the corresponding subfunction of F n, and b. all level one gates of C n, when restricted to the free variables, are n 1=8 -regular. Proof. Apply Lemma 3.3 successively to each of the k level one gates of C n. Let M x be the set of indices of those variables x i which did not receive a value during the processing of all gates by Lemma 3.3. M y is dened analogously. A is the union of all partial assignments that have been made in this process. We write D n for the circuit that results from C n by the restriction of Lemma 3.4. Observe that D n computes the function F m (for m = n ). 60 k 9

10 3.2 The Likely Behavior of a Threshold Gate In this section we will exploit the result of our regularization process. In particular, in Lemma 3.7, we will show that, for the input distribution dened below, a weighted sum with small variance in weight sizes \almost" behaves as if all the weights were identical. For the integer s, 1 s m, set U(s) = f~x 2 f0; 1g m : random variable which assigns to each ~x 2 U(s) the value x i = sg. (s) is the w i x i ; all elements of U(s) are equally likely. Obviously E((s)) = s w i. m In the following, we will assume that the w i 's are either all positive or all negative. Proposition 3.5 Set W = w 2 i and g = maxf jw ij jw j j : 1 i; j mg. Then W m s 2 g2 E((s)) 2 : Proof. Set MIN = minfjw i j : 1 i mg. We get W 1 m (m2 g 2 MIN 2 ): (6) Also, E((s)) 2 = ( s m m w i ) 2 ( s m m MIN)2. Thus m 2 MIN 2 ( m s )2 E((s)) 2 : (7) If we replace m 2 MIN 2 in (6) according to (7), we get W m s 2 g2 E((s)) 2 : Proposition 3.6 V ar((s)) s m W. Proof. We have V ar((s)) = E((s) 2 )? E((s)) 2. Also, m s E((s) 2 ) = 1 m s ~x2u (s) = 1 ( m s = 1 ( ( ~x2u (s) w i 2 w i x i ) 2 w i2 x i + 2 ~x2u (s) 10 x i + 2 ~x2u (s) 1i<jm 1i<jm w i w j w i w j x i x j ) ~x2u (s) x i x j )

11 = s m m w i2 + 2s(s? 1) m(m? 1) 1i<jm w i w j : Furthermore, In summary, we obtain m s E((s)) 2 = ( 1 m s = ( 1 = ( s m = s2 m 2 ~x2u (s) m w i ~x2u (s) w i ) 2 w i2 + 2s2 m 2 w i x i ) 2 x i ) 2 1i<jm w i w j : V ar((s)) ( s m? s2 m m ) w 2 i2 s m W: Lemma 3.7 If maxf jw ij jw j j : 1 i; j mg = O(m1=8 ), then P r(j(s)? E((s))j je((s))j m 1=4 ) = O( m3=4 s ): Proof. By Chebyshev's inequality, we get for any t (t > 0) Thus, for t = je((s))j ), we obtain m 1=4 Proposition 3.6 implies P r(j(s)? E((s))j t) V ar((s) t 2 : P r(j(s)? E((s))j je((s))j m 1=4 ) V ar((s)) m1=2 E((s)) 2 : and with Proposition 3.5 V ar((s)) m 1=2 E((s)) 2 s W m1=2 m E((s)) 2 ; s W m 1=2 m E((s)) 2 m s g2 m?1=2 = O( m s m?1=4 ): 11

12 3.3 A Non-standard Application of the Discriminator Lemma Let G be some boolean threshold gate with weights w 1 ; : : : ; w m, u 1 ; : : : ; u m and threshold t. Set P m P w m i a := m and b := u i m : With G we can thus associate the two-dimensional threshold function ax + by t. Similarly, with F m we associate the two-dimensional function F : f0; : : : ; mg 2! f0; 1g, where F (x; y) = 1 if and only if (x m 2 ^ y < m 2 ) _ (x < m 2 ^ y m 2 ): Let L be the line ax + by = t in R 2 (where t is the threshold of G). Let x 0 (y 0 ) be the x-coordinate (y-coordinate) of the intersection of L and y = m (x = m). Set 2 2 x0 = 1 (y 0 = 1) if the line L is horizontal (vertical). We dene D(G) := minfjx 0? m 2 j; jy0? m 2 jg: and let U r be the uniform distribu- Proposition 3.8 Let r be an integer with 0 r m 2 tion over V r = f m? r; : : : ; m + rg. Then 2 2 P r UrUr[(x; y) 2 V 2 r j ax + by t ^ F (x; y) = 1g D(G) + 1 2r + 1 : Proof. Let be the area enclosed by the two lines ax + by = t and ax + by = (a + b) m 2. (The latter is the line through (m 2 ; m 2 ).) Intersect with the set Vr and call the 2 intersection r. Let us assume that D(G) = jx 0? mj. Then 2 r will contain at most D(G) + 1 points per row of Vr 2. Thus j r j (2r + 1) (D(G) + 1). On the other hand, the halfspace ax + by (a + b) m 2 the elements of the set f(x; y) 2 Vr 2 j F (x; y) = 1g: contains exactly one half of all Let us consider the case that all weights w i are identical and all weights u i are identical. If D(G) is \small", then G will not show any signicant advantage in predicting F for a subcollection of the m + 1 distributions U 2 r mentioned in Proposition 3.8. If on the other hand D(G) is large (say proportional to m), then we can trivialize G by choosing a distribution with a small value for r. Our goal is to carry out a similar argument for arbitrary gates G. Consequently we introduce a collection Q r of distributions over f0; 1g m with Q r (~x) = 8 >< >: (2r + 1) 0 ; if j P m x i? m 2 j > r 1 ; otherwise. m P m x i 12

13 Note that the probability of a string only depends on its number of ones. The appropriate value for the parameter r will be determined later. Finally we dene for the considered threshold gate G with input variables x 1 ; : : : ; x m, y 1 ; : : : ; y m, ADV r (G) := P r QrQr[G(~x; ~y) = 1jF m (~x; ~y) = 1]? P r QrQr[G(~x; ~y) = 1jF m (~x; ~y) = 0]: Lemma 3.9. Set m = n 60 k. Assume that the boolean threshold gate G with input variables x 1 ; : : : ; x m ; y 1 ; : : : y m is n 1=8 -regular, and that n is suciently large. Furthermore assume that the natural number r 2 [m 31=32 ; m 4 ] satises D(G) r=(64k) or D(G) 4r : Then jadv r (G)j 1 2k : Proof. Case 1: D(G) r. 64k We know that G is n 1=8 -regular. We proceed by examining the three dierent cases (see Denition 3.2.). Case 1.1: 8i (jw i j m 1=8 ju i j) and jw m j 60jw 1 j: This implies that jaj m 1=8 jbj. Hence the line L is very \steep". We have in this case, maxfx 2 [0; m] : 9y 2 [0; m] ((x; y) 2 L)g?minfx 2 [0; m] : 9y 2 [0; m] ((x; y) 2 L)g m 7=8 : Thus, the set fx 2 [0; m] : 9x 0 ; y 0 2 [0; m] (jx? x 0 j m 3=4 ^ (x 0 ; y 0 ) 2 Lg is contained in an interval of length m 7=8 + 2m 3= m 7=8. This implies that jf(x; y) 2 f m 2? r; : : : ; m 2 + rg2 : P (x; y)gj 3m 7=8 m = 3m 15=8 ; (8) where P (x; y) is equivalent to (ax + by < t) ^ 9(x 0 ; y 0 ) 2 [0; m] 2 ( (ax 0 + by 0 t) ^ (jx? x 0 j m 3=4 ) ): As a rst step towards estimating ADV r (G) we consider the set S = f(~x; ~y) 2 U : G(~x; ~y) = 1 ^ F m (~x; ~y) = 1g where U := f(~x; ~y) 2 f0; 1g 2m : m 2? r m x i ; y i m 2 + rg: 13

14 One shows that S is contained in the following two sets, S 1 = f(~x; ~y) 2 U : j w i x i? ( x i w i )=mj ( x i w i )=m 5=4 g and S 2 = f(~x; ~y) 2 U : F m (~x; ~y) = 1 ^ Q(~x; ~y)g, where Q(~x; ~y) is equivalent to 9~x 0 ; ~y 0 2 f0; 1g m (j x 0 i?? x i j m 3=4 ^ a x 0 i + b y 0 i t): Intuitively, the set S 1 consists of all those inputs (that are relevant for Q r ) on which our approximation of G by ax + by t fails. We will show later that this set has small probability. S 2 on the other hand is the collection of all relevant inputs on which the approximation (in a quite liberal sense) succeeds. j w i x i? a x i j ( x i jaj)=m 1=4 m 3=4 jaj: (9) We need to nd vectors ~x 0 ; ~y 0 according to the denition of set S 2. If a 0, we pick some ~x 0 such that U 3m=4 [ i=m=4 f0; 1g i. We then have with (9) a If a < 0, we pick some ~x 0 such that a x i? a m 3=4 = a x i + jaj m 3=4 x 0 i = x 0 i = Furthermore, we pick some vector ~y 0 with b x 0 i w i x i. x i + m 3=4. w i x i. This is possible, since x i? m 3=4. We then have a y 0 i x 0 i = u i y i according to the following procedure: if all components of u i are positive or all components are zero, then set ~y 0 = (1; : : : ; 1). Otherwise all components are negative and we set ~y 0 = (0; : : : ; 0). This concludes our proof of inclusion, since property Q(~x; ~y) holds for the pair (~x 0 ; ~y 0 ). It is obvious that P r QrQr[S 1 j F m (~x; ~y) = 1] 2 P r QrQr[S 1 ]: If we apply Lemma 3.7 for s 2 [ m? r; m + r] [ m; 3m], we obtain P r QrQr[S 1 ] = O(m?1=4 ). In order to give an upper bound on P r QrQr[S 2 ] we observe that S 2 S 3 [ S 4, where S 3 = f(~x; ~y) : (F m (~x; ~y) = 1) ^ (a S 4 = f(~x; ~y) : (F m (~x; ~y) = 1) ^ (a R(~x; ~y) is equivalent to 14 x i + b x i + b y i t)g and y i < t) ^ R(~x; ~y)g:

15 9~x 0 ; ~y 0 (j x i? It follows from Proposition 3.8 that x 0 ij m 3=4 ^ a x 0 i + b y 0 i t): Also, it is obvious that P r QrQr[S 3 j F m (~x; ~y) = 1] D(G) + 1 2r + 1 : P r QrQr[S 4 j F m (~x; ~y) = 1] 2 P r QrQr[S 4 ]: Furthermore, by (8), P r QrQr[S 4 ] 3 m15=8 : Thus we have 2 (2r + 1) P r QrQr[S j F m (~x; ~y) = 1] P r QrQr[S 1 [ S 3 [ S 4 j F m (~x; ~y) = 1] O(m?1=4 ) D(G) + 1 2r k + O(m?1=16 ): We will obtain the same upper bound for the probability of S 0 = f(~x; ~y) 2 U : G(~x; ~y) = 0 ^ F m (~x; ~y) = 1g: + 6m15=8 (2r + 1) 2 Thus, since P r QrQr[S j F m = 1] + P r QrQr[S 0 j F m = 1] = 1; we get jp r QrQr[S j F m (~x; ~y) = 1]? 1 2 j 1 64k + O(m?1=16 ): One shows analogously for T = f(~x; ~y) 2 U : G(~x; ~y) = 1 ^ F m (~x; ~y) = 0g that jp r QrQr[T j F m (~x; ~y) = 0]? 1 2 j 1 64k + O(m?1=16 ): Thus, jadv r (G)j 1 32k + O(m?1=16 ). Case 1.2: 8i (ju i j m 1=8 jw i j) and ju m j 60ju 1 j. The argument is analogous to Case 1.1. Case 1.3: jw m j 30(1 + n 1=8 )jw 1 j and ju m j 30(1 + n 1=8 )ju 1 j. We rst observe that the set S is contained in the union of the sets S 0 1 and S 0 2, where S 0 1 = f(~x; ~y) 2 U : P 0 (~x; ~y)g; and P 0 (~x; ~y) is equivalent to 15

16 (j w i x i? a x i j jaj P m x i m 1=4 ) _ (j u i y i? b S 0 2 = f(~x; ~y) 2 U : F m (~x; ~y) = 1 ^ Q 0 (~x; ~y)g, and Q 0 (~x; ~y) is equivalent to j 9~x 0 ; ~y 0 2 f0; 1g m (j y 0 i? y i j m 3=4 x 0 i?? ^ a x i j m 3=4 ^ x 0 i + b y 0 i y i j jbj P m y i m 1=4 ); t): Lemma 3.7 implies that P r QrQr[S 0 1 jf m (~x; ~y) = 1] 2 P r QrQr[S 0 1] = O(m?1=4 ): With an argument analogous to Case 1.1 we get S 0 2 S 3 [ S 0 4 where S 0 4 = f(~x; ~y) : (F m (~x; ~y) = 1) ^ (a R 0 (~x; ~y) is equivalent to j y i? 9~x 0 ; ~y 0 (j We have already shown that Furthermore, it is obvious that y 0 ij m 3=4 x i? ^ a x i + b x 0 ij m 3=4 ^ x 0 i + b y i < t) ^ R 0 (~x; ~y)g: y 0 i t): P r QrQr[S 3 j F m (~x; ~y) = 1] D(G) + 1 2r + 1 : P r QrQr[S 0 4 j F m (~x; ~y) = 1] 4 m m3=4 (2r + 1) 2 : The remaining argument is now analogous to Case 1.1. Case 2: D 4r. The analysis is now far simpler. The probability of the set S 1 (resp. S 0 1) is computed as before. As for S 3 we now get For S 4 we obtain P r QrQr[S 3 j F m (~x; ~y) = 1] 2 f0; 1g: P r QrQr[S 4 j F m (~x; ~y) = 1] = 0: The same applies to S4. 0 This follows, since the set U will be entirely contained in one of y i = tg. the halfspaces of f(~x; ~y) : a P m x i + b P m 16

17 In order to prove Theorem 3.1 we observe that for suciently large n we can nd r such that for each of the at most k gates G on level one of D n : D(G) r 64k or D(G) 4r: (A value for r can be found whenever k is bounded from above by the number of possible \r-intervals". This is the case, provided k c log k (m) for a suitably small constant c. This in turn is satised for k d for a suitably small constant d.) log n log log n The "-Discriminator Lemma of [HMPST] can be generalized to hold for any distribution over the input space. We apply it here to the distribution Q r Q r over the input space f0; 1g 2m of the circuit D n (which computes the function F m ). Since the weights of the gate on level two of D n are from f?1; 1g, we get jadv r (G)j 1 for some gate G on level one of D k n. But this contradicts Lemma 3.5. Thus we get a lower bound of ( log n ) for the size of depth 2 threshold circuits (with log log n weights from f?1; 1g for the top gate) computing F n. For unrestricted threshold circuits log log n our lower bound will be ( ) ([M],[MT]). log log log n Remark 3.10 It is not possible to prove Theorem 3.1 with the customary version of the "-Discriminator Lemma, where one considers the uniform distribution over the input space. Consider for example the threshold gate G dened by G(x 1 ; : : : x n ; y 1 ; : : : y n ) = 1, n x i? n y i c p n: For appropriate c one has ADV (G) = (1) (where ADV (G) is dened like ADV r (G), but with regard to the uniform distribution over f0; 1g 2n ). This happens, because a \large discrepancy" in x-sum and y-sum is more likely if we assume P n x i P n and 2 n y i n than if we assume (say) P n 2 x i n and P n 2 y i n. This phenomenon has 2 been independently observed by Bultman [B]. Corollary 3.7The class of boolean functions computable by constant size boolean threshold circuits of depth 2 with integer weights of polynomial size is properly contained in the class of boolean functions computable by constant size -circuits of depth 2 with polynomial size rational weights (even with common polynomial size denominator) and separation 1. poly The same statement holds if one considers arbitrary real weights for both types of circuits (still with separation 1 ). poly Proof. It is quite easy to simulate boolean threshold circuits of size s and constant depth d by sigmoid threshold circuits of the same size and depth. The containment is proper as a consequence of Theorems 3.1 and

18 4 Simulation Results and Separation Boosting T Cd() 0 is the class of those families (g n j n 2 N) of boolean functions that are computable, 1 with separation ( ), by polynomial size, depth d -circuits whose weights are reals poly(n) of absolute value at most poly(n). T Cd 0 ([HMPST]) is the corresponding class of families of boolean functions computable by polynomial size, depth d boolean threshold circuits whose weights are polynomial size integers. Theorem 4.1 Let : R! [0; 1] be a nondecreasing function that is Lipschitz-bounded and converges fast to 0 (resp. 1) in the following sense: Then the following holds. 9 " > 0 9 x 0 > 0 8 x x 0 (?x) 1 x " ^ 1? (x) 1 x " : (a) For every d 2 N, T C 0 d = T C 0 d(). (b) The class T C 0 d() does not change if we demand separation (1). Observe, that the above class of functions also includes the standard sigmoid. Proof. Assume that (g n jn 2 N) is a family of boolean functions in T Cd(). 0 Thus (g n jn 2 N) can be computed with separation 1 by some family (C p(n) n jn 2 N) of - circuits of depth d with the number of gates and the size of weights bounded by q(n) (for some polynomials p and q). Since is Lipschitz-bounded, and since the depth d of C n is a constant, there exists a polynomial r(n) with the following property: If the gate function of each gate G in C n is replaced by some arbitrary function G : R! R (where the functions G may be dierent for dierent gates G) such that 8x 2 R(j(x)? G (x)j 1 r(n) ); then for each input x 1 ; : : : ; x n of C n the value of the output gate of the new circuit diers from the value of the output gate of C n by at most 1. 2p(n) In order to construct a boolean threshold circuit Cn b that computes g n, one replaces in C n each internal -gate that outputs ( P m j=1 j y j? ) for inputs y 1 ; : : : ; y m 2 [0; 1] (with reals 1 ; : : : ; m ; of polynomial size in n) by a weighted sum S(~y) := l k=1 1 2r(n) H k(y 1 ; : : : ; y m ) of l := 2r(n) boolean threshold gates H 1 ; : : : ; H l (which use the same weights 1 ; : : : ; m as G). The function S is chosen to be a step function which approximates such that for all y 1 ; : : : ; y m 2 [0; 1], j( j y j? )? S(~y)j 1 2r(n) : j=1 18

19 In a second step, one replaces each of the boolean threshold gates H k by a boolean threshold gate H 0 k whose weights and thresholds are integers of polynomial size. We set S 0 (~y) := l k=1 The threshold gates H 0 k are chosen such that 1 2r(n) H0 k(y 1 ; : : : ; y m ): 8y 1 ; : : : ; y m 2 [0; 1](jS(~y)? S 0 (~y)j 1 2r(n) ): Let C 0 n be the circuit that results from C n by replacing in the described manner each internal -gate in C n by an array of boolean threshold gates Hk. 0 For every input, the value of the output gates of C n and C 0 1 n dier by at most. Hence we can replace 2p(n) the output gate of C 0 n by a boolean threshold gate with integer weights and threshold of polynomial size such that the resulting boolean threshold circuit Cn b computes g n. This shows that (g n jn 2 N) 2 T Cd. 0 In order to prove the other inclusion assume that (g n jn 2 N) 2 T Cd 0 is computed by a family (B n jn 2 N) of boolean threshold circuits of depth d, where B n has at most p(n) gates and its weights and thresholds are integers of absolute value at most q(n) (for some polynomials p and q with p(n) q(n) 2 for all n 2 N). Without loss of generality, we assume that for each circuit input the weighted sum at each gate in B n has distance at least 1 from its threshold (if this is not the case, rst multiply all weights and thresholds of gates in B n by 2, and then lower each threshold by 1). In addition we assume for simplicity that x 0 = 1 in the assumption about. By the assumption about there exists some l 2 N such that 8 x 1((?x l ) 1 x and 1? (xl ) 1 x ): Let B 0 n be a boolean threshold circuit that results from B n by multiplying rst all weights and thresholds of gates in B n by 2[2p(n)q(n)] l. It is obvious that B 0 n also computes the boolean function g n. In addition, for each circuit input the weighted sum at each gate in B 0 n has distance at least 2[2p(n)q(n)] l from its threshold. Let C n be the -circuit that results if we replace each boolean threshold gate in B 0 n by a -gate with the same weights and threshold. Then one shows by induction on the depth of a gate G in C n that for every boolean circuit input, the output of G diers by at most n from the output of the corresponding gate in B 0 n, where n := 1 2p(n)q(n). In the induction step one exploits that p(n) q(n) 2 [2p(n)q(n)] l n = [2p(n)q(n)] l : This implies that a change of at most n in each of the at most p(n) inputs of G causes a change of at most [2p(n)q(n)] l in the value of the weighted sum that reaches G, and so this weighted sum has distance at least 2[2p(n)q(n)] l? [2p(n)q(n)] l = [2p(n)q(n)] l 19

20 from the threshold. Therefore the output value of the -gate G diers by at most max n 1? ([2p(n)q(n)] l ); (?[2p(n)q(n)] l ) o 1 2p(n)q(n) = n from the output of the corresponding boolean threshold gate in B 0 n. The preceding argument implies that for any n 2 the -circuit C n threshold 1 computes the boolean function g 2 n with separation 1. 4 with outer Remark 4.2 One can also simulate polynomial size -circuits with weights of absolute value at most 2 poly(n) by polynomial size boolean threshold circuits with 0-1 weights; however in this case the circuit depth increases by a constant factor. This simulation can be extended to the case of real-valued inputs, where we assume that polynomially many bits of each real input are given as inputs to the simulating boolean threshold circuit. Remark 4.3 More recently it has been shown (see the paper by Maass in this volume, or the extended abstract [M]) that for neural nets with arbitrary piecewise polynomial activation functions (with polynomially many polynomial pieces of bounded degree) and arbitrary real weights, the class of boolean functions that can be computed in constant depth and polynomial size (with arbitrarily small separation) is contained in T C 0. 20

21 References [B] [DS] [HKP] [HMPST] [K] [M] [MSS] [MT] W.J. Bultman, \Topics in the Theory of Machine Learning and Neural Computing", Ph.D Dissertation, University of Illinois at Chicago, B. DasGupta and G. Schnitger, \The power of Approximating: a Comparison of Activation Functions", in Advances in Neural Information Processing Systems 5, S.E. Hanson, J.D. Cowan and C.L. Giles eds, pp , J. Hertz, A. Krogh and R.G. Palmer, \Introduction to the Theory of Neural Computation", Addison-Wesley, Redwood City, A.Hajnal, W. Maass, P. Pudlak, M. Szegedy and G. Turan, \Threshold Circuits of Bounded Depth", Proc. of the 28th Annual Symp. on Foundations of Computer Science, pp , To appear in J. Comp. Syst. Sci.. C.C. Klimasauskas, \The 1989 Neuro-Computing Bibliography", MIT Press, Cambridge, W. Maass, \Bounds for the computational power and learning complexity of analog neural nets", Proc. of the 25th ACM Symposium on the Theory of Computing, 1993, W. Maass, G. Schnitger, E. D. Sontag, \On the computational power of sigmoid versus boolean threshold circuits", Proc. of the 32nd Annual IEEE Symp. on Foundations of Computer Science, 1991, W. Maass and G. Turan, \How Fast Can a Threshold Gate Learn?", in: Computational Learning Theory and Natural Learning Systems: Constraints and Prospects, G. Drastal, S.J. Hanson and R. Rivest eds., MIT Press, to appear. [Mu] S. Muroga, \Threshold Logic and its Applications", Wiley, New York, [S] E.L. Schwartz, \Computational Neuroscience", MIT Press, Cambridge, [So] [SS1] [SS2] E.D. Sontag, \Feedforward Nets for Interpolation and Classication", J. Comp. Syst. Sci., 45(1992): E.D. Sontag and H.J. Sussmann, \Backpropagation can give rise to spurious local minima even for networks without hidden layers", Complex Systems 3, pp , E.D. Sontag and H.J. Sussmann, \Backpropagation separates where Perceptrons do", Neural Networks, 4, pp , [WK] S.M. Weiss, C.A. Kulikowsky, \Computer Systems that Learn", Morgan Kaufmann, San Mateo,

Analog Neural Nets with Gaussian or other Common. Noise Distributions cannot Recognize Arbitrary. Regular Languages.

Analog Neural Nets with Gaussian or other Common Noise Distributions cannot Recognize Arbitrary Regular Languages Wolfgang Maass Inst. for Theoretical Computer Science, Technische Universitat Graz Klosterwiesgasse