On Topology, Size and Generalization of Non-linear Feed-Forward Neural Networks

Size: px

Start display at page:

Download "On Topology, Size and Generalization of Non-linear Feed-Forward Neural Networks"

Justina Glenn
5 years ago
Views:

1 On Topology, Size and Generalization of Non-linear Feed-Forward Neural Networks Stephan Rudolph Institute for Statics and Dynamics of Aerospace Structures University of Stuttgart, Pfaffenwaldring 27, D Stuttgart, Germany Published in: NEUROCOMPUTING vol. 6, no. (July 997), pp 22. Abstract. The use of similarity transforms in the design and the interpretation of feed-forward neural networks is proposed. The method is based on the so-called Buckingham-Theorem or Pi-Theorem and is valid for all neural network function approximation problems which belong to the class of dimensionally homogeneous equations. The new design method allows the a priori determination of a minimal topology size of the first and last network layer. Finally, the correct and unique pointwise generalization capability of the new so-called similarity network topology is proved and illustrated using two examples. Keywords: Pi-Theorem, similarity transforms, similarity functions, dimensional homogeneity, neural network generalization, neural network topology. Introduction The potential of feed-forward neural networks to approximate the functional relationship g implicitly encoded in a certain number of p training patterns fx ; : : : ; x n g p has originated much research in the understanding, the training and the generalization performance of neural networks [23, 22, 33, 34, 37]. Usually such feedforward neural networks are composed of k layers with up to j summation units s j;k which sum their inputs x i;k. Each input is hereby multiplied by an adjustable weight w ij;k. The summation result s j;k with s j;k = X i w ij;k x i;k + w 0;k () is then propagated through a non-linear function h(s j;k ) to the output, which serves as input for the nodes of the next network layer. Often nonlinear functions of the type h(s j;k ) = es j;k + e s j;k (2) are chosen which are commonly referred to as sigmoidal functions [33, 34]. The computational power attributed to these networks originates mainly from these nonlinear functions h(s j;k ) of the weighted sums, since the limitations of linear neural network models are now well understood and documented in the literature [2, 24, 33]. On the other hand it is mainly this non-linearity which makes any deeper mathematical analysis of the network properties and performance very difficult [20, 34]. Neural network research has therefore mainly been concentrated on the establishment of certain classes of non-linear multi-layered feedforward neural networks, for which theorems and theoretical bounds for the approximation properties can be stated [, 2, 7, 5, 36]. But until today no general theory has been presented for the a priori topology design, the explanation of the generalization properties and the a posteriori interpretation of non-linear multi-layered feed-forward neural networks.

2 . Motivation The search for a general theory for the design of feed-forward neural networks means that one seeks additional constraints to reduce the class of all possibly imaginable functional relationships g to a smaller subclass f, for which certain general properties can be derived. Inherently such additional constraints are difficult to select a priori, since the form of g is naturally unknown when using feed-forward neural networks to approximate the functional relationship g encoded in a set of p training patterns fx ; : : : ; x n g p, as shown in Fig.. 0 x x 2. x n C A - approximation function g Figure : Classical Neural Network g - x n It is evident that a too weak assumption upon the unknown function g may not result in a very significant restriction of the remaining function class properties of f, while a too strong assumption may reduce the range of validity to a too narrow, possibly unimportant function class f. Ideally, the assumption should also be immediately justifiable by inspection of the a priori available knowledge, e.g. from the available set of training patterns fx ; : : : ; x n g p only. A well suited a priori assumption in all technical applications of feed-forward neural networks without loss of generality is the restriction of all possible neural network functions g to the class of all dimensionally homogeneous functions f. Therefore in later parts of the paper the terms dimensional homogeneity, dimensionally homogeneous equation, dimension, dimensionless product, similarity transform and similarity function will be of major importance for the understanding of the subsequent derivations using the Pi-Theorem [5, 6]. The definitions and usage of these terms in the natural sciences are recalled for this reason in the following paragraph..2 Definition of terms The term dimensional homogeneity simply means that in any equation f (x ; : : : ; x n ) = 0 (this is the implicit notation of x n = f (x ; : : : ; x n) only, where x n is commonly denoted as y) the functional relationship of the physical variables x ; : : : ; x n has to apply to the physical dimensions of the variables (usually expressed in SI-units) as well (e.g. from [Nm] = [kg] ([m]=[s]) 2 follows that E = mc 2 fulfills the dimensions check and is possibly correct if the quantitative validity of the equation can also be shown). This means that any physical dimension (i.e. a SI-unit like [kg]) cannot be created from or disappear in the void. This so-called principle of dimensional homogeneity guarantees that in every possible and correct physical equation the dimensions on the left hand side of the equal sign are always identical to those on the right hand side. All (the known as well as the still unknown) equations in physics belong therefore to the so-called class of dimensionally homogeneous equations. The term dimension has thus multiple meanings in mathematics and physics. A vector x = fx ; x 2 ; x 3 g T is called a 3-dimensional vector since it has 3 components. If the vector components are the three physical variables of the previous example (i.e. x = fm; c; Eg T ), each of the vector components has also physical dimensions. The term dimensionless product stands for a special class of monomial expressions in the Q form of j = x ri= j x ji i which have no physical dimensions (e.g. are dimensionless) and are formed out of a (sub)set of physical variables x i ; x j. These physical variables x i ; x j are again elements of the set of variables in the very same existing dimensionally homogeneous equation f (x ; : : : ; x n ) = 0. The monomial expression j can also be interpreted as a mapping of x i ; x j 7! j which belongs to the class of so-called similarity transforms. A dimensionless function F ( ; : : : ; m ) = 0 in these variables j is also called similarity function. The principle of dimensional homogeneity is purely epistemologically based and is the axiomatic foundation of group theoretic methods in mathematical physics [3, 4]. Its philosophical foundation is used in the validation of any theoretical model building in physics and engineering. The general validity of this principle has lead to the establishment of the commonly known statement that one cannot com- 2

3 pare apples with oranges (i.e. 5 [apples] 6= 5 [oranges]). In all natural sciences it is therefore generally agreed on that any dimensionally not homogeneous model cannot be correct [6, 9, ]. The principle of dimensional homogeneity must therefore always be observed. 2 Theoretical Foundation This section introduces without proof the so called Buckingham- or Pi-Theorem. Pi-Theorem [5, 6]. From the existence of a dimensionally homogeneous and complete equation f of n physical quantities x i the existence of an equation F of only m dimensionless quantities j can be shown f (x ; :::; x n ) = 0 (3) F ( ; :::; m ) = 0 (4) where r = n m is the rank of the dimensional matrix constructed by the x i and with dimensionless quantities j of the form j = x j ry i= x ji i (5) with j = ; : : : ; m 2 N and ji 2 R as constants. The implicit form f (x ; : : : ; x n ) = 0 includes here all explicit notations x n = f(x ; : : : ; x n), since any explicit equation can always be written in an implicit form. Furthermore, modern proofs of the Pi-Theorem [9, ] impose no special assumptions on the specific nature of the operator f. It can be proved that every dimensionally homogeneous equation f in physics can be subjected to the Pi-Theorem (i.e. algebraic equations, differential equations, integro-differential equations, and so on). 2. Dimensional Matrix The so-called dimensional matrix associated with each set of p training patterns x ; : : : ; x n is shown in the left hand side of Fig. 2. This dimensional matrix has n rows for the variables x i and up to k columns for the representation of the dimensional exponents e ij of the variables x i in the k base dimensions s k of the employed unit system. In the current known SI-unit system seven dimensions (mass, length, time, temperature, current, amount of substance and intensity of light) are distinguished, thus k 7. Further examples of dimensional matrices are given in Fig. 4 and 6. To calculate the dimensionless products j in equation (5), the dimensional matrix of the pattern x ; : : : ; x n as shown in the left hand side of Fig. 2 needs to be created. By rank preserving operations the upper diagonal form of the dimensional matrix as shown in the right hand side of Fig. 2 is obtained. This means that ei- x x 2 : x n s s 2 : : : : : e ij s k 6 n r 6 m x x 2 : x r x r+ : x r+m s s 2 : : : : : s r : : : : : : ji Figure 2: Definition of Dimensional Matrix ther multiples of matrix columns may be added to each other or that matrix rows can be interchanged. The unknown exponents ji of the dimensionless products in equation (5) are then automatically determined by negation of the values of the resulting matrix elements ji in the hatched part of the matrix on the lower right hand side of Fig. 2. (Note: in respect to equation (5) the index j = ; : : : ; m of the variables x j refers hereby to the group of variables x i with i = r + ; : : : ; r + m, i.e. x r+ ; : : : ; x r+m as shown in the hatched part of the dimensional matrix in Fig. 2 right. This index transform greatly simplifies the notation of the dimensionless groups and is used consequently in the following.) The addition of matrix columns which contain physical dimension exponents e ij to one another means that the dimensions of the original dimensional representation system s ; : : : ; s k will be combined with each other by multiplication. This signifies that the dimensional representations of the variables x ; : : : ; x n will be transformed into a representation in another, equivalent dimensional representation system s ; : : : ; s r, with r k [28, 29]. A physical in- 3

4 terpretation of such a representation change is given in the next section. 2.2 Dimensionless Products A practical example of the derivation of dimensionless products j from a dimensional matrix is shown in the following. The simple example of a bending truss bar is used to illustrate the theoretical concept of the Pi-Theorem introduced above and is used to show the straightforward transformation of a dimensionally homogeneous function f into a dimensionless function F. The bending of a truss bar of length l, with a material of Young s modulus E and a cross sectional moment I exhibits a deflection u under a load P. This is shown in Fig. 3. Since the l; E; I P u Figure 3: Bending of a truss bar underlying differential equation of linear bending theory is solvable, a closed analytical solution in the form of f exists and is well known in the literature [9, 38] to be equal to f (l; E; P; I; u) = 0 with (6) u = P l 3 (7) 3 E I Ignoring for a moment the final result in equation (7) and starting only with the knowledge of the physical dimensions of the variables l; E; P; I; u in the relevance list of equation (6), the following dimensional matrix can immediately be established. In Fig. 4 left, the dimension exponents of the physical variables are given in a ([M ]ass, [L]ength, [T ]ime)-system. Variable SI-units [M ] [L] [T ] [F ] [L] P [kg m=s 2 ] -2 0 l [m] E [kg=m s 2 ] - -2 =) -2 I [m 4 ] u [m] Figure 4: Dimensional Matrix Computations By adding multiples of the matrix columns to each other in the left hand side of Fig. 4, the modified dimensional matrix in the right hand side of Fig. 4 is obtained. There the dimensional representations of the physical variables in the former ([M ]ass, [L]ength, [T ]ime)-system have now been transformed into their equivalent dimensional representation in a ([F ]orce, [L]ength)-System [28, 29]. (Note: in similarity theory it can be shown that there exists an infinite number of equivalent dimensional representation systems which can be transformed into one another by simple matrix operations []. The derivation of this proof lies however outside the main scope of this paper. The choice of a set of k fundamental dimensions to describe n physical variables can be seen as analogous to the choice of a set of k linearly independent base vectors of a linear vector space. If the original base vectors are then replaced by a linear combination of the latter to form a new vector base, the coordinate representations of the n vectors described in the original base system will change accordingly. This is exactly what is shown by the operations in the dimensional matrix in Fig. 4. A representation change may thus lead to more or less dense coordinate description, according to a more or less appropriate choice of the base vectors.) Concerning the right hand side of Fig. 4 the third column (i.e. [T ]ime) has been omitted, since it contains only zeros. Since the shown bending problem is static it requires no explicit modeling of time. The rank of both dimensional matrices (in Fig. 4 left as well as in Fig. 4 right) is r = 2, from n = 5 physical variables only m = n r = 3 dimensionless products = E P l 2 = E P l2 (8) 2 = I l 4 = I l 4 (9) 3 = u l = u l (0) are obtained as guaranteed by the Pi-Theorem. According to the definitions in Fig. 2 and equation (5), the values of the ji can be determined by visual inspection of the coefficients of the lower part of the modified dimensional matrix, compare also Fig. 2 and Fig. 4. Taking now a look back to the exact solution in form of equation (7), an algebraic manipulation (multiplication of the right hand side of equation (7) by unity in form of the factor (l 2 =l 2 )) of 4

5 elementary calculus leads to u l = 3 ( P El 2 )(l4 I ) () Substituting now equations (8), (9) and (0) into equation () yields the dimensionless equation 3 = 3 2 (2) Equation (2) is a practical example of the fact that every complete and dimensionally homogeneous equation of n physical variables can be written in the form of a dimensionless equation of its m = n r dimensionless groups (i.e. the dimensionless products). The dimensionless products can thus be interpreted as the necessary and sufficient building blocks of the correct solution. This is stated by the Pi-Theorem and written in the general form of equations (3), (4) and (5). 2.3 Relevance to Neural Networks The above considerations of dimensional homogeneity will have multiple conceptual consequences for the design and the generalization properties of non-linear multi-layered feedforward neural networks. In the following it is described how the Pi-Theorem is used for the topology design and for the proof of generalization of non-linear multi-layered feed-forward neural networks. This will be discussed in direct comparison to the classical neural network approach where the principle of dimensional homogeneity is ignored. The current approach to the problem of function approximation by neural networks as shown data P l E I u e e e e e p x p x 2p x 3p x 4p x 5p Figure 5: Original pattern data in Fig. is considered first. Classically, a set of p numerical training pattern data as shown in Fig. 5 is used in the training of the neural network to approximate the unknown functional relationship g encoded in the patterns x ;p ; : : : ; x n;p, which has in the current example with n = 5 the form of g(p; l; E; I; u) = 0. From the data in Fig. 5, as well as from the knowledge of the exact analytical solution from linear bending theory in equation (7), it is quite clear that the approximation problem during the training phases of the neural network consists in the identification of the correct n-dimensional hyper-surface g(x ; : : : ; x n ) = 0 with n = 5. This is shown in Fig. 6, which represents the computation sequence inside such a neural net- 0 P l C EA! w 0 P w l w 2 E w 3 I w 4! I u Figure 6: Computation sequence for g work with five adjustable weights w 0 ; w ; w 2 ; w 3 and w 4. According to the definitions in equation () this network sums the weighted logarithms of its inputs in s and propagates it through the exponential h(s) = e s as output function. The topology of this network is shown in Fig. 7. P l E I w w 2 w 3 w 4 s w 0 Figure 7: Neural Network Topology for g According to equation (7) the correct solution after training should be w = ; w 2 = 3; w 3 = ; w 4 = and w 0 = =3 for the neural network to generalize correctly. In this context it is important to note that in classical neural network function approximation, the weights w 0 ; w ; w 2 ; w 3 ; w 4 are initialized by random before the training and are iteratively updated according to the employed learning rule. This however means that the initial neural network state u 5

6 as well as most of the intermediate neural network states do encode dimensionally not homogeneous functions. To highlight this fact, the condition of dimensional homogeneity of the current weights in the neural network in Fig. 6 and 7 is written as [u] = [P ] w [l] w2 [E] w3 [I] w 4 (3) where [u]; [P ]; : : : denote the dimensional representation of the variables u; P; : : :. From the general equation (3) one obtains two equations in the two dimensions [F ]orce and [L]ength used to represent the variables in the dimensional matrix. This leads to a linear equation system of the weights in [F ] : w + w 3 = 0 (4) in [L] : w 2 2w 3 + 4w 4 = (5) g Figure 8: Properties of Function Classes cally illegal and meaningless state, regardless whether by accident some neural network responses might be numerically correct (e.g. numerically 5 [apples] are equal to 5 [oranges]). This is shown in Fig. 8. This means that the sufficient search space of all possible physical solutions f is artificially enlarged to the set of all dimensionally not homogeneous functions g, since the dimensional representation of u is [u] = [F ] 0 [L] : This means that all numerical values of the weights w ; : : : ; w 4 in Fig. 6 and 7 which do not satisfy equations (4) and (5) do not represent dimensionally homogeneous equations. In the context of universal function approximation by neural networks thus many neural network states (i.e. the randomly initialized neural network weights as well as most of the intermediate neural network states during the training) violate the principle of dimensional homogeneity (here equations (4) and (5)) and do encode dimensionally not homogeneous functions g. The neural network is thus in an physif which cannot represent physically correct solutions by definition. By ignoring the property of dimensional homogeneity the original problem of correct function approximation might thus have even worsened, especially in such cases where only very few training patterns are known, since the now added purely numerically correct solutions in the form of 5 [apples] = 5 [oranges] are in the training numerically indistinguishable from the only numerically and dimensionally correct solutions in the form of 5 [apples] = 5 [apples]. Taking therefore the a priori property of dimensional homogeneity of the unknown and sought after function f into account, the dimensionless groups can be determined from the dimensional matrix as shown in the previous example of Fig. 4. This means that every data point in x ; : : : ; x n corresponds to a data point in ; : : : ; m. In Fig. 9 the numerical values of this mapping for the data points in Fig. 5 are shown. From this transformed data table, as well as data e e e e e p ;p 2;p 3;p Figure 9: pattern data transforms from the knowledge of the exact dimensionless solution in form of equation (2), it is clear that the approximation problem during the training phases of the neural network has now been transformed into the problem of the identification of the correct m-dimensional hyper-surface F ( ; : : : ; m ) = 0 with only m = 3. This is shown in Fig. 0, which represents the computation sequence inside such a neural network with three adjustable weights v 0 ; v and v 2. 0 P l! C EA! 2 I! v 0 v v 2 2! 3! u Figure 0: Modified computation of f via F 6

7 According to equation (2) the correct solution after training is v = ; v 2 = and v 0 = =3 for the neural network to generalize correctly. A network topology for this is shown in Fig.. P l E I v 0 + v v Figure : Network Topology for f via F In respect to the definitions in equation () this network first computes in the first hidden layer the intermediate dimensionless products ; 2 as sums of the weighted logarithms of the inputs. (In fact the logarithms of ; 2 are computed). The weighted sum 3 in the second hidden layer is then propagated through the exponential f (s) = e s as output function. (Remark: The output function of the nodes P; l; E; I and v 0 is the logarithm, the output function of 3 is the exponential, while and 2 have the identity as output function). In contrary to the previous classical neural network function approximation of g it is important to note in this context that the neural network can now permanently represent only dimensionally homogeneous functions f, regardless of the random initialization and the iterative update of v 0 ; v ; v 2 during the training. The validity of the Pi-Theorem for all dimensionally homogeneous equations in physics can thus be interpreted in such a way, that since for every function f (x ; : : : ; x n ) = 0 a function F ( ; : : : ; m ) = 0 exists, the function F can be seen as beeing nested inside of f, enclosed by the appropriate similarity mappings as indicated in Fig. 0. This mapping scheme based on the existence proof of the Pi-Theorem can thus be generalized to the new neural network similarity topology design and interpretation scheme as shown in Fig. 2. This means that the first and last network layer represent the (here predetermined and fixed) for- and back-transform and u 0 x x 2. x n C A - function F - x n Figure 2: Similarity Network for f via F into and from dimensionless space, while the learning during the training is done through adjustment of the weights of the sought after similarity function F only. This in direct comparison to Fig. advantageous exploitation of the principle of dimensional homogeneity in the design of feed-forward neural networks as well as the derivation of several important neural network properties which can be proved with the help of this theory are stated in the following. 2.4 Proof of Generalization Based on this neural network topology design scheme as shown in Fig. 2, a generally valid proof of the two necessary and sufficient conditions for the correct generalization in neural networks can now be established in form of the two following consecutive steps. The formerly unresolved generalization capability of non-linear multi-layered feed-forward neural networks can be now proven to be pointwise correct, if and only if a training pattern p can be learned and recalled errorfree by the new similarity neural network topology F. This proof is due to the fact that well distinct data points in x-space may fall onto the very same point in space. This is known in physics as the phenomenon of complete similarity. For (x ; : : : ; x n ) p; (x ; : : : ; x n ) p ( ; : : : ; m ) p = const R (x ; : : : ; x n ) p;2 (x ; : : : ; x n ) p; Figure 3: Complete Similarity Condition 7

8 each training pattern p exists an infinite amount of completely similar points on n- dimensional hypersurfaces, which are defined by the specific constant numerical values of the dimensionless variables j;p = const as shown in Fig. 3. A numerical example of such two completely similar points lying on such a hypersurface defined by x j = j;p ry i= x ji i ( xi 2 R + j = ; : : : ; m (6) are the two pattern data sets p = and p = 5 given in Fig. 5, which have been computed for this purpose with the specific constant values of j;p in Fig. 9. Equations (6) stem from equations (5) which in engineering are commonly known as similarity laws [, 6]. The pointwise correct generalization of the p pattern data is in a mathematical sense a necessary condition for the overall correct generalization capability of the neural network. totally correct, if and only if the neural network approximates after the training the correct similarity function F ( ; : : : ; m ) = 0 which is associated with f (x ; : : : ; x n ) = 0. The correct similarity function F is approximated if and only if the correct pointwise generalization property is fulfilled for each point in the whole domain of definition of F. This is in a mathematical sense the necessary and sufficient condition for the correct generalization capability of the neural network. At this point three remarks are in order to put the above proof of generalization into perspective. First, it is evident that the above statements represent the theoretical result which one would obtain with ideal noise-free data. It is however important to realize that the permanent presence of noise and/or measurement error requires no fundamental methodological change in the principal approach of modeling real physical behavior as dimensionally homogeneous models (see also section 4). Second, it is important to see that the correct pointwise generalization is automatically achieved for every point in the training data set which can be reproduced by the neural network within reasonable error bounds. This is a consequence of the transformation sequence inside the similarity network topology only and does not depend on the identification of the overall correct dimensionless function F. As a further advantage, the error of the pointwise correct approximation can be verified by a simple recall of each of the training patterns after the training phase. Third, it should be clear that because of the necessary and sufficient condition for the totally correct generalization in form of the identification of the correct similarity function F, there is made no explicit or implicit claim that this correct result is in fact achieved after the training of the neural network. It should be clear that the underlying basic problem of function approximation based on sparse data samples remains. It is however claimed that every correct result can always be written in this form and that the search space of intermediate stages of the sought after approximation of g without the use of the dimensionless groups is very likely to violate the property of dimensional homogeneity. In the following section the resulting consequences are summarized which stem from the necessary and sufficient conditions for the correct generalization property of the new similarity topology design scheme. 2.5 Design Consequences Multi-layered non-linear feed-forward neural networks used as a tool for universal function approximation may now be designed according to the Pi-Theorem. As a consequence, the first and last network layer has to encode a similarity transformation of the training patterns as indicated in Fig. 2. This leads because of m = nr as stated in the Pi-Theorem to the following neural network layer sizes as shown in Fig. 4. The similarity transforms j can be determined before the training phase and depend on the dimensional information of the training patterns only. Most importantly, the feed-forward neural network is now no longer able to encode dimensionally not homogeneous functions g, but can only encode and approximate dimensionally homogeneous functions f (x ; : : : ; x n ) = 0 via the adjustment of the weights of the corresponding similarity function F ( ; : : : ; m ) = 0. About the 8

9 6 (n) x-nodes 6 - (m) F- () - -nodes m -node () x n -node fixed variable fixed Figure 4: General Similarity Topology necessary minimum number of weights, layers and nodes to encode the similarity function F nothing can be stated in full generality from the Pi-Theorem, since this depends on the physics of the modeled phenomenon and the available set of node transfer functions according to equation (2). This still unresolved question and the therefore necessary variable topology are indicated by the in Fig. 4. Since the modern proofs of the Pi-Theorem make no special assumptions on the specific nature of the operator f as stated in section 2, the similarity topology design method can be imposed a priori on any feed-forward neural network for arbitrary non-linear function approximation problems without any loss of generality, as long as the approximation problem belongs to the class of dimensionally homogeneous models. The following properties can then be proven straightforward: The transformation sequence inside the neural network is x 7! 7! F () 7! x, as shown in Fig 4. The first and last transformation layer have hereby a simple precomputed product form as shown in equation (5). The last computation step consists of the back-transform from the resulting m to the sought after x n contained therein. To compute this, up to r current input values of the so-called basis variables x i in the resulting Q dimensionless product m = x ri= n x ji i need to be propagated directly from the first to this last hidden layer node. The general form of the backtransform is thus x n = Q ri= m x ji i. This is on the same hand an effortless model correspondence to the so-called information shortcuts [33, 34] observed in neuro-biological systems. If and only if the neural network learns during the training phase the corresponding dimensionless similarity function F of the dimensionally homogeneous function f, the correct and unique generalization capability of the network over the whole range of definition of f can be proved based on the properties of the similarity function F. Any other feed-forward neural network which cannot be shown analytically to be equivalent to the minimal topology generated by the new method based on the Pi- Theorem is principally unable to generalize correctly over the whole range of definition of f, because it does not encode the correct similarity function F. The weights and transfer functions of the internal nodes can a posteriori be interpreted as dimensionless similarity variables and similarity functions, thus supporting the engineering analysis, discussion, interpretation and understanding of the now explicit functional relationship formerly implicitly encoded in the training patterns. The original approximation problem (i.e. the learning process) has not been worsened, since the dimensionality of F in respect to f is reduced by r. The ratio of the number of patterns p to the number of n versus m = n r independent variables involved has thus been increased, since (p=n) (p=m). In other contexts as in neural networks, the equivalents to many of the above stated mathematical properties have a proven record of usefulness in many other fields of engineering and science [3, 4, 5, 6, 6]. Mainly heat transfer [4] and fluid mechanics [7] account for the traditional strengths of similarity theory. The new interpretation of the results of similarity theory in the context of neural network topology design and interpretation is now briefly demonstrated in the comparison of a classical and the new neural network topology in the following real world problem setting of a non-linear function approximation problem solution generally attributed to PRANDTL [26]. 9

10 3 Application Example The drag w exerted by a fluid (velocity v, density and viscosity ) around a sphere with diameter l is a highly nonlinear physical phenomena. Since the discovery of the governing Navier- Stokes equations which cannot be solved analytically, approximate solutions have been experimentally justified [26]. From the measurements in these experiments, a functional relationship f (l; v; ; ; w) = 0 can be expected. This is shown in Fig. 5. v - ; 6 l w Figure 5: Flow around a sphere Two different neural network topology concepts, a traditional neural network with 4 input nodes, two hidden layers with 4 nodes each and one output node, all having sigmoidal transfer functions as in equation (2), as well a neural network constructed with the similarity principle and polynomials as transfer functions, are compared. Both networks were presented the same patterns in the numerical simulation using the SNNS-package [35]. To determine the new neural network topology, the dimensional matrix shown in the left hand side of Fig. 6 is constructed from the dimensional information of the patterns. Then, name SI-units [M ] [L] [T ] [M ] [L] [T ] [kg=m 3 ] l [m] v [m=s] 0 - =) 0 0 [m 2 =s] w [kg m=s 2 ] Figure 6: Dimensional Matrix Computations by using rank preserving operations and adding multiples of the matrix columns to each other, the modified dimensional matrix in the right hand side of Fig. 6 is obtained. According to equation (5), two dimensionless products and 2 can immediately be derived by inspection of the lower diagonal coefficients ji of this modified dimensional matrix to be equal to = v l ( Re ) (7) 2 = w v 2 l 2 ( c w ) (8) This means that a functional relationship in the form F ( ; 2 ) = 0 with only two dimensionless similarity variables exists and corresponds to the expected functional relationship f (x ; : : : ; x 5 ) = 0. Both dimensionless products and 2 occur that often in fluid dynamics that is called the REYNOLDS-Number Re and 2 is called the drag coefficient c w. The determination of F is traditionally done using statistical methods [26, 8], but can also be iteratively determined by a neural network during the training period. From the experimental data [26] available, 44 data points in the range of Re = 0 3 ; : : : ; have been se- c w F f Re - "similarity.net" "konventional.net" "pattern.set" Figure 7: Approximation Results lected as training patterns as shown in Fig. 7. Here the projection of the training result of both the classical neural network learning f and the new neural network topology learning F, is shown. A closer look at the error distribution in Fig. 8 shows that the relative error in the 6 C W C W [%] f F "konv.plt" "pi.plt" p - Figure 8: Relative Approximation Error neural network constructed with the new topology design method is better balanced over the 0

11 p = ; : : : ; 44 training patterns. As can be observed from this data, the neural network with the new topology concept obtains better (or at least as good) approximation results over the whole range of training data. However, this is not the main point here at all and error propagation is one of the topics of our ongoing research. Therefore no further numerical data are presented here because none of the claims made in this paper is based on numerical or statistical observations. Most importantly, the new neural network topology guarantees an optimal generalization, since (an approximation of) the similarity function F of the physical phenomena is encoded in the network. Since all elements of the hyperplanes with v l = const = Re 2 [03 ; : : : ; ] are projected onto the very same point in dimensionless space, the generalization of the network F is guaranteed to be correct for p = 44- times infinitely more points not included in the original training set p with exactly the same relative approximation error as the p = 44 data points in Fig. 8. This is the main advantage over the classical neural network approximation g and a unique property inherent in the new similarity topology design method. 4 Discussion With both problems of function approximation and the a priori estimation of the size of the middle layer(s) now necessarily approximating F instead of f (necessary number of nodes, links or kinds of transfer functions) is not dealt with, since this issue is already addressed in the literature [, 2, 7, 5, 36]. Imposing the new topology concept does not change nor worsen the original problem of universal function approximation, but clearly identifies the well distinct origin and separation of the problem of universal function approximation from the problem of correct generalization in feed-forward neural networks. From the key idea of dimensional homogeneity highlighted in this paper it should be quite clear that one should be very cautious not to jump to numerical simulations too early. It is evident that a major part of the cognitive effort and of the scientific achievement in understanding physical phenomena lies in the discovery of the qualitatively correct relevance list of the expected functional relationship and only finally in the successful establishment of a quantitative model description. In this respect similarity theory is one of the keys to a meaningful a posteriori interpretation of the inner nodes of the neural network after successful training. The epistemological concept of dimensions has thus much more significant consequences than one might expect when looking at numerical values of data only. After more than 2000 years of scientific effort of mankind just 7 independent dimensions have been established in our presently known and used SI-unit system. It is evident that the range of validity of similarity theory is limited to the kind of problems which can be described with functional relationships of variables represented in these 7 dimensions. The range of validity is thus typically the area of engineering and physics. It is an open question in the philosophy of the natural sciences whether there are more (but still unknown) dimensions out there to be discovered or whether there exist classes of problems which inherently lie outside the domain of dimensional representations. In this respect it is just mentioned that the generalization and further extension of the Pi-Theorem to other areas than physics and engineering has already been suggested and seems to lead to fruitful extensions [27]. The Pi-Theorem has also already been successfully applied to the problem of pattern recognition with neural networks [0] as well as in the broad field of economics [8]. A clear understanding of all dimensions involved in a certain problem is therefore one of the crucial basic steps in any effort of theoretical model building and can significantly facilitate the theoretical analysis and conclusions drawn from the investigated model as shown in the examples. 4. Related Issues Noise. Real world data are commonly affected with errors and noise when measured. While for systematic measurement errors it can be compensated for, measurement noise can commonly only be handled and compensated for if a certain noise distribution model is assumed. In this respect it is stated without proof

12 that similarity theory in the form of dimensional analysis is often the method of choice to process and display experimental data affected with noise. The real world example of the flow around a sphere in section 3 shows best the advantageous projection of noise-affected experimental data into a physically meaningful lowerdimensional space. The mapping of several well distinct data points in x-space onto the principally same point (or in its proximity) in -space by similarity transforms even helps to deal with statistically distributed noise effects [8]. Dimensional analysis has therefore always been one of the methods of choice of experimenters [26]. The influence of noise is therefore judged as so important that it is one of the main topics of our ongoing work. Non-uniqueness of solutions. Without further consequences for the main arguments in this paper it is mentioned that the existence proof of solutions of equation (5) is constructive but not unique. Since the dimensionless products form a free Abelian group, all possible solutions to equation (4) are of the form ^ k = my j= kj j (9) with k = ; : : : ; m 2 N and kj 2 R. The general solution may thus consist of any arbitrary combination of the m original j which also satisfies the condition of structural independence [, 25]. This means that the square matrix with elements kj has to be of full rank to guarantee the equivalence of both dimensionless parameter sets = f ; : : : ; m g and ^ = f^ ; : : : ; ^ m g. In respect to the dimensionless equation F ( ; : : : ; m ) = 0 as guaranteed by the Pi- Theorem this means that the form of F depends on the choice of a specific set of ^ = f^ ; : : : ; ^ m g. In terms of the bending bar example in section 2.2 choosing a full rank matrix to be equal to = C A (20) results in the following 3 modified dimensionless products ^ = = P E l 2 (2) ^ 2 = 2 = l4 I ^ 3 = 3 = u l (22) (23) what in turn changes the form of the resulting similarity function F to ^ 3 = 3 ^ ^ 2 (24) 4.2 Related Works In a recent paper by GUNARATNAM and GERO [2] the effect of dimensionless representations (i.e. dimensionless products) on neural network generalization performance in the presence of noise was tested numerically in comparison to classical neural network approaches. The numerical experiments reported in [2] fit perfectly well into the systematic presentation of the underlying theoretical framework here. However no general explanation and/or no proof in form of the necessary and sufficient conditions for the correct generalization as in section 2.4 nor further details of the topology design of similarity networks as enumerated in section 2.5 were provided in [2]. In another recent paper by RUDOLPH [32] the practical implementation of the presented similarity theory into the fitness function of a genetic algorithm is described. There the condition of complete similarity according to equation (6) is used to construct a fitness function of a genetic algorithm which enables the selection of correctly generalizing neural network topologies out of a sequence of randomly mutated populations of neural network individuals. Despite the fact that the neural network topologies were left completely unconstrained on a so-called genetic grid which defined the maximally possible network size in terms of layers and nodes, the theoretical result in form of the general topology design scheme as shown in Fig. 2 and 4 was always achieved in the numerous numerical computer simulations. Besides the two recent short presentations of the principle of dimensional homogeneity in feed-forward neural networks at workshops [30, 2

13 3] with limited audience, it is claimed based on extensive literature search that this is the first time that the principle of dimensional homogeneity is theoretically applied to the a priori topology design of non-linear feed-forward neural networks and that a proof in form of necessary and sufficient conditions for the correct generalization in non-linear feed-forward neural networks is presented. The author has also communicated his original idea prior to publication to the graduate students G. EMRICH, H.-G. HERRMANN and O. BARTH of the Institut für Statik und Dynamik der Luft- und Raumfahrtkonstruktionen (ISD) to stimulate and encourage further work [0, 3]. 5 Summary The principle of dimensional homogeneity is a necessary and sufficient condition for the mapping of the n dimensional variables x into m- dimensional dimensionless space. The proof of the Pi-Theorem hereby guarantees the reduction to a minimum of m = n r independent dimensionless variables. Thus, the topology of any feed-forward neural network designed with this method consists of a predetermined node reduction and special transfer functions in the first layer of the neural network and a predetermined node expansion and special transfer functions in the last layer of the neural network. Only these topology properties guarantee the unique feature of correct pointwise generalization of non-linear feed-forward neural networks based on the correct approximation of the given training data only. No other theoretical way is known today to substitute for this unique feature. This result implies that all other neural networks which cannot be shown to be analytically equivalent to the new neural network topology design scheme are uncapable of generalizing correctly and suggests therefore the systematic use of this topology design method. Acknowledgments The support of G. EMRICH in the numerical simulations and the proof-reading of several previous drafts was very valuable and has been appreciated. The financial support of this work by the Deutsche Forschungsgemeinschaft (DFG) is acknowledged. References [] A. Barron, Universal approximation bounds for superpositions of a sigmoidal function, IEEE Transactions on Information Theory, 39 (3) (993), [2] E. Blum and L. Li, Approximation theory and neural networks, Neural Networks, 4 (4) (99), [3] G. Bluman and J. Cole, Similarity Methods for Differential Equations. (Springer, New York, 974). [4] G. Bluman and S. Kumei, Symmetries and Differential Equations. (Springer, New York, 989). [5] P. Bridgman, Dimensional Analysis (Yale University Press, New Haven, 922). [6] E. Buckingham, The principle of similitude, Nature 96 (95), [7] G. Cybenko, Approximations by superpositions of a sigmoidal function, Math. Control, Signals, Systems, 2 (989), [8] F. de Jong, Dimensional analysis for economists. North-Holland, Amsterdam, 967. [9] B. Elvers; S. Hawkins; G. Schulz (Eds.), Ullmanns Encyclopedia of Industrial Chemistry. Volume B: Fundamentals of Chemical Engineering. (Verlagsgesellschaft, Weinheim, 990). [0] G. Emrich. Bilderkennung mit neuen, multiskaleninvarianten Zentralmomenten, in: Kröplin, B.: Internationales Workshop Neuronale Netze in Ingenieuranwendungen, Institut für Statik und Dynamik der Luft- und Raumfahrtkonstruktionen, Universität Stuttgart, Februar 996, -2. [] H. Görtler, Dimensionsanalyse (Springer, Berlin, 975). [2] D. Gunaratnam and J. Gero, Effect of Representation on the Performance of Neural Networks in Structural Engineering Applications. Microcomputers in Civil Engineering 9, 97-08, 994. [3] H.-G. Herrmann. Untersuchungen zur Anwendbarkeit von Neuronalen Netzen in der Strukturmechanik (PhD in preparation), Institut für Statik und Dynamik (ISD), Universität Stuttgart,

14 [4] J. Holman, Heat Transfer, (McGraw-Hill, New York, 986). [5] K. Hornik, M. Stinchcombe, and H. White, Multilayer feedforward networks are universal approximators, Neural Networks, 2 (5) (989), [6] H. Huntley, Dimensional Analysis (MacDonald, London, 952). [7] S. Kline, Similitude and Approximation Theory, (Springer, New York, 986). [8] C. Li and Y. Lee, A statistical procedure for model building in dimensional analysis, International Journal of Heat and Mass Transfer, 33 (7) (990), [9] L. Malvern, Introduction to the Mechanics of a Continous Medium, (Prentice Hall, London, 969). [20] T. Masters, Practical Neural Network Recipes in C++ (chapter 6: Multilayer Feedforward Networks, 85 90, Academic Press, Boston, 993). [2] M. Minsky and S. Papert, Perceptron: An Introduction to Computational Geometry (MIT Press, Cambridge, MA, 969). [22] K. Möller and G. Paaß (eds), Künstliche Neuronale Netze: Eine Bestandsaufnahme, Künstliche Intelligenz KI, 8 (4) (994), [23] G. Paaß, Assessing and improving neural network predictions by the bootstrap algorithm, in: J. Cowan S. Hanson and C. Giles, eds., Advances in Neural Information Processing Systems 5 (NIPS 5), (Morgan Kaufmann, San Mateo, CA, 993), [24] Y.-H. Pao, Adaptive Pattern Recognition and Neural Networks (Addison-Wesley, Reading MA, 989). [25] J. Pawlowski, Die Ähnlichkeitstheorie in der physikalisch-technischen Forschung. Grundlagen und Anwendungen, Springer, Berlin, 97. [26] L. Prandtl, C. Wieselsberger, and A. Betz, Ergebnisse der Aerodynamischen Versuchsanstalt zu Göttingen (Oldenbourg, Berlin, 923). [27] Z. Rosenbaum, Foundations and Techniques of Generalized Dimensional Analysis, Ph.D. dissertation, The State University of New Jersey, 990. [28] S. Rudolph, Eine Methodik zur systematischen Bewertung von Konstruktionen, Ph.D. dissertation, Universität Stuttgart, VDI Fortschrittsberichte, Reihe, Nummer 25, Düsseldorf, 995. [29] S. Rudolph, A Methodology for the Systematic Evaluation of Engineering Design Objects, Ph.D. dissertation, translation of the original german PhD thesis [28] into english language. A copy of this translated Ph.D. thesis is available on request by from: rudolph@isd.uni-stuttgart.de, ISD Verlag, Number 02-94, Stuttgart University, 995. [30] S. Rudolph, Entwurf, Anwendung und Interpretation Neuronaler Netze im Ingenieurwesen, in: Berkhan, V.; Egly, H. und Olbrich, M. (eds.): Forum Bauinformatik, Junge Wissenschaftler forschen, Hannover 95. VDI Fortschrittsberichte Reihe 20, Nummer 73, VDI-Verlag, Dsseldorf, 24-30, 995. [3] S. Rudolph, On Topology and Generalization in Feed-Forward Neural Networks, in: Kröplin, B.-H. (ed.): Neuronale Netze in Ingenieuranwendungen, Internationales Workshop, Institut für Statik und Dynamik der Luft- und Raumfahrtkonstruktionen, Universität Stuttgart, Februar 996, 7-26, 996. [32] S. Rudolph, On A Genetic Algorithm for the Selection of Optimally Generalizing Neural Network Topologies, Proceedings of the 2nd International Conference on Adaptive Computing in Engineering Design and Control 96, I.C.Parmee (ed.), University of Plymouth, March 26th-28th, Plymouth, United Kingdom, 79-86, 996. [33] D. Rumelhart and J. McClelland, Parallel Distributed Processing. Volume I and II (MIT Press, Cambridge, MA, 986). [34] E. Sanchez-Sinencio and C. Lau (eds), Artificial Neural Networks (IEEE Press, New York, 992). [35] SNNS (Stuttgart Neural Network Simulator), User Manual, Version 3.2, Institute for Parallel and Distributed High Performance Systems, Stuttgart University, Germany, 994. [36] K.-Y. Siu, V. Roychowdhury, and T. Kailath, Rational approximation techniques for analysis of neural networks, IEEE Transactions on Information Theory, 40 (2) (994), [37] M. Stone, An asymptotic equivalence of choice of model by cross-validation and akaike s criterion, Journal of the Royal Statistical Society, Ser B, 39 () (977), [38] S. Timoshenko and J. Goodier, Theory of Elasticity, McGraw-Hill, London,

A Logarithmic Neural Network Architecture for Unbounded Non-Linear Function Approximation

1 Introduction A Logarithmic Neural Network Architecture for Unbounded Non-Linear Function Approximation J Wesley Hines Nuclear Engineering Department The University of Tennessee Knoxville, Tennessee,