An Integral Representation of Functions using. Three-layered Networks and Their Approximation. Bounds. Noboru Murata 1

Size: px
Start display at page:

Download "An Integral Representation of Functions using. Three-layered Networks and Their Approximation. Bounds. Noboru Murata 1"

Transcription

1 An Integral epresentation of Functions using Three-layered Networks and Their Approximation Bounds Noboru Murata Department of Mathematical Engineering and Information Physics, University of Tokyo, Hongo 7-3-, Bunkyo-ku, Tokyo 3, JAPAN Abstract Neural networks are widely known to provide a method of approximating nonlinear functions. In order to clarify its approximation ability, a new theorem on an integral transform of ridge functions is presented. By using this theorem, an approximation bound, which evaluates the quantitative relationship between the approximation accuracy and the number of elements in the hidden layer, can be obtained. This result shows that the approximation accuracy depends on the smoothness of target functions. It also shows that the approximation methods which use ridge functions are free from the \curse of dimensionality". Keywords: integral transform, ridge function, three-layered network, approximation bound, random coding, curse of dimensionality Introduction In the middle of the 98s, computational research on neural networks was revitalized by the works of the Parallel Distributed Processing (PDP) group (umelhart et al., 986). In this movement, multi-layered networks having sigmoidal functions together with back-propagation learning played an important role. The numerous examples provided by the PDP group attracted the interest of many other researchers, and a large number of subsequent computer simulations have shown that the multi-layered networks can be usefully applied to practical problems, such as image processing, speech recognition, system control. Currently staying at GMD First, udower Chaussee 5, 489 Berlin, Germany, supported by Alexander von Humboldt-Stiftung. Preprint submitted to Elsevier Science November 995

2 Advantages of these simple learning machines could be summarized in the following two points. One is that the \back-propagation" learning algorithm can be easily implemented on computers, because the algorithm is essentially a gradient descent method, and also because the derivative of the sigmoidal function depends only on its output and hence the gradient of each parameter can be calculated locally. The other is the fact that a three-layered network can approximate an arbitrary function with desired accuracy if it has suciently many hidden units. Irie and Miyake (988), Funahashi (989), Cybenko (989), White (99) proved this fact mainly based on the notion of the Fourier transform. Funahashi (989) also gave another proof making use of Kolmogorov and Arnol'd's theorem and Sprecher's theorem. ecently Jones (99), Barron (993) and Girosi and Anzellotti (99) showed that the approximating error is inversely proportional to the number of hidden units with respect to the mean square or sup norm criterion for a certain class of functions. Their results are interesting and important because they indicate that greedy algorithms, such as multi-layered networks and radial basis approximators with gradient descent learning, can avoid the \curse of dimensionality". In this paper, we discuss the second advantage from another point of view. As the rst step to clarify the performance and the limitation of three-layered networks as function approximators, we focus on the case when we have complete information about the target function. We do not consider learning from examples in this paper. We show that the structure of multi-layered networks has a property good for well-approximating nonlinear functions. First, we dene the integral transform and inverse transform of functions by using ridge functions. The inversion formula gives a precise representation of functions and also provides a reasonable interpretation of the structure of three-layered networks. From the correspondence between transformation coecients and parameters of networks, we obtain an intuitive interpretation of network parameters. Second, applying this result, we give a bound on the approximation accuracy of a three-layered network with a nite number of hidden units. Using the random coding technique, which is well known in the eld of information theory, it is shown that the mean square error of the approximation is inversely proportional to the number of hidden units. This means that if higher accuracy is desired, the number of required hidden units increases not exponentially with respect to the input dimension, but simply linearly with the increase in accuracy. Hence it follows that three-layered networks are free of the \curse of dimensionality". Third, we discuss the relationship between smoothness of target functions and

3 approximation errors of three-layered networks, and we give a class of functions which can be well-approximated by three-layered networks. We also suggest that there is a close connection between the smoothness of target functions and the magnitude of network parameters, and that some smoothness conditions might guarantee the convergence of learning. Integral Transform using idge Functions First, we dene ridge functions. Denition When a function F : m! is written as F (x) G(a x b); () with a vector a m, a real number b and an appropriate function G :!, it is called a ridge function. In other words, a ridge function takes the same value on certain hyper-planes in m whose normal vectors are parallel to a (see for example Figure ). Clearly, the input-output relation of a neuron in a conventional articial neural network, i.e. weighted sum and a sigmoidal activation function, belongs to the class of the ridge functions given in Equation (). We should note that when m, F is not integrable even if G is integrable on. In the following, we discuss a method of approximating a certain function f : m! using a linear combination of ridge functions. Let us assume that function f belongs to L ( m ) \ L p ( m ) ( p < ), or f is bounded and uniformly continuous. Let a pair of two functions d ; c L () \ L () be bounded, satisfying the following conditions: ^ d (!)^c (!) ^d (!) ^c (!); () ^d (!)^c (!)! m d! < (3) and ^ d (!)^c (!)! m d! 6 ; (4) 3

4 Fig.. An example of the ridge function on. where ^ denotes the Fourier transform and denotes the complex conjugate. The indices d; c are used to represent the decomposing and composing kernels. We dene C d ; c ^ d (!)^c (!) j!j m ^ d (!)^c (!) d! A : (5)! m The transform T of function f with respect to kernels d and c is dened by T (a; b) () m d (a x b)f(x)dx: (6) C d ; c m Then we can prove the following inversion theorem. 4

5 Theorem Using the transform T given in Equation (6), function f can be represented as f(x) lim "! m+ c (a x b)t (a; b)e "jaj dadb: (7) If f L ( m ) \ L p ( m ) ( p < ), f(x) in Equation (7) converges in the sense of L p -norm, and if f is bounded and uniformly continuous, it converges in the sense of L -norm. The proof is given in Appendix A. An example of kernels d and c will be given in Section 3.. When m, i.e. f is a function on, we can see the close relation between this transform and the wavelet transform. When m, this kind of relation cannot be found, because wavelets belong to L ( m ), i.e. functions are decomposed into nite-energy waves, while c (a x b) is not square-integrable with respect to x. From the fact that f belongs to L ( m ) and d is bounded, we see that T is bounded, i.e. T (a; b) L ( m+ ). We can also derive jt (a; b)jdb () m jc d ; c j () m jc d ; c j < : m d (a x b)f(x)dx db m j d (a x b)jjf(x)jdb dx Even though T (a; b) is bounded and c belongs to L (), f(x) m+ c (a x b)t (a; b)dadb (8) might be a divergent integral. Hence the convergence of the integral in Equation (7) is guaranteed by the convergence factor e "jaj. Note that although we adopt a Gaussian kernel for simplicity of calculations, other functions are also available for this purpose. Equation (6) can be seen as a map from L ( m ) to L ( m+ ). While the Fourier transform and inverse Fourier transform give a one-to-one correspondence between L ( m ) and L ( m ), Equation (6) has redundancy. In other words, f could for example be represented by two dierent transforms T () (a; b) 5

6 and T () (a; b) with respect to two dierent decomposing kernel functions () d and () d, both of which satisfy the admissible condition expressed in Equation (3) with the same composing kernel c. In general, there is a family of decomposing kernels f () d ; () d ; : : : g and we can choose a convenient kernel d among them, imposing certain appropriate properties such as compactness and smoothness. We use this advantage to investigate the relationship between smoothness of target functions and approximation ability in Section Application to Three-layered Networks From the previous result, we can evaluate some aspects of approximating functions using three-layered networks. 3. Three-layered Networks with Bell-shaped Functions In the engineering eld, a sigmoidal function which has the following characteristics lim z! (z) ; lim z! (z) ; d dz (z) (z) > ; lim z!6 (z) : is usually used as the activation function of hidden units. The monotonicity and smoothness of sigmoidal functions are advantageous for the backpropagation learning algorithm. Since sigmoidal functions do not belong to L (), we cannot directly apply the previous representation result to this type of network. We discard the assumption of monotonicity and dene a bell-shaped function c which has the following integrability condition, max z c (z)dz < ; lim z!6 c(z) ; c (z) ; c (z) ; c is unimodal. In short, a bell-shaped function is an unimodal L () function whose maximum value is. In the following, we consider the networks composed of c 's, 6

7 such that for an input x m the output f n (x) is calculated as f n (x) nx i c i c (a i x b i ); (9) where a i m ; b i ; c i and n denotes the number of hidden units. A bell-shaped function can be constructed from two appropriate sigmoidal functions. For example, c (z) c ( (z + h) (z h)) ; where c is a constant to normalize the maximum value and h is a positive constant. In this case, a network with some number of bell-shaped hidden units can instead be constructed from twice as many sigmoidal hidden units. In general, a family of sigmoidal networks includes that of bell-shaped networks as a subset. We can also represent a bell-shaped function as the derivative of a sigmoidal function, c (z) d dz (z): In this case c is clearly integrable. 3. epresenting Functions using Three-layered Networks Using the integral transform dened in the previous section, an arbitrary f L ( m ) can be written as f(x) lim "! m+ c (a x b)t (a; b)e "jaj dadb: Then we see that a three-layered network composed of bell-shaped functions can approximate an arbitrary function f L ( m ) with demanded accuracy if the network has a sucient number of hidden units. Here T (a; b) is calculated using Equation (6) with function d which satises admissible condition (3). For some desired accuracy " >, T (a; b)e "jaj can be approximated with a sucient number of hidden units because it belongs to L ( m+ ). 7

8 Consequently, noting that a bell-shaped function can be represented by a linear combination of two sigmoidal functions, a three-layered sigmoidal network can also approximate arbitrary functions. We can see a straightforward correspondence between the integral representation presented above and the structure of three-layered networks: the parameters a; b correspond to the weights and the thresholds from the input layer to the hidden layer, and the transform T (a; b) corresponds to the weights from the hidden layer to the output layer (see Figure ). x input layer x a x H HH8 88 a i x c (a x b) hidden layer c (a i x b i ) m+ T (a; b) c (a x b) T (a; b) c (a x b)dadb H HH8 8 8 output unit c i c (a i x b i ) P ni c i c (a i x b i ) Fig.. The relationship between the transform T (a; b) and the three-layered network. If a composing kernel c is suciently smooth, we can use a decomposing kernel d whose Fourier transform is expressed by ^ d (!) j!j m ^c (!): In this case, a sucient condition for the existence of the inverse Fourier transform d is that ^c is bounded and c is (m + )-times dierentiable. If a composing kernel c is even and suciently smooth, we can construct a decomposing kernel d with compact support as follows. Choose a even function C () which satises (z) ; 8

9 (z) if jzj ; (z)dz < ; where C () denotes a class of functions on which belong to C () with compact support. For example, (z) 8 >< >: e (jzj ) if jzj < if jzj () is available. A decomposing kernel is given by d (z) 8 >< >: d m dz (z) if m is even m d m+ (z) if m is odd dzm+ depending on the dimensionality m of inputs x. We conrm that these kernels satisfy the admissible condition expressed in Equation (3) in Appendix B. 3.3 Approximation Bounds of Bell-shaped Networks In the following, we assume that inputs x m are generated subject to a probability density (x), and we evaluate quantitatively the approximation accuracy of the network with n hidden units in terms of the L -norm with respect to the probability density (x) kf n fk L ( m ); m (f n (x) f(x)) (x)dx; () where f is a target function to be approximated. The dening function of the domain of input x normalized by the volume of the domain is a typical case of (x). We assume that m+ jt (a; b)jdadb < () 9

10 holds. In this case, we can omit the convergence factor. Taking into account that f and c are real functions, we can derive the representation where f(x) m+ e T (a; b) c (a x b)dadb T (a; b) c (a x b)dadb e T (a; b) c (a x b)dadb j e T (a; b)j sign(e T (a; b))c T c (a x b)dadb C T c(a; b) c (a x b)p(a; b)dadb; (3) C T j e T (a; b)jdadb; m+ c(a; b) sign(e T (a; b))c T ; j e T (a; b)j p(a; b) : C T Since p(a; b) is a positive function and its integral on m+ is equal to, function p(a; b) is regarded as a probability density of a and b. Let us consider the following function g n (x) n nx i c(a i ; b i ) c (a i x b i ); (4) where (a i ; b i ) (i ; : : : ; n) are independently chosen subject to the probability density p(a; b). The expectation and variance of g n (x) are E [g n (x)] f(x); V [g n (x)] n V [c(a; b) c(a x b)] n n C T E h c (a x b) i f(x) C T f(x) : The last inequality comes from the fact that the maximum value of c normalized as. These relations lead to is

11 E (g n (x) f(x)) (x)dx E h (gn (x) f(x)) i (x)dx V [g n (x)] (x)dx C T kfk L n ( m ); n C T : (5) (6) Inequality (5) shows that the expectation of kg n fk L ( m ); is less than or equal to (CT kfk L ( m );)n, and therefore there exists at least one linear combination g n (x) whose mean square error is as small as that bound. Also, Inequality (6) shows that there exists a norm-independent upper bound CT n which does not depend on the input density (x). These evaluations are derived under the assumption of Equation (). If the transform T belongs to L ( m+ ), the mean square error of function approximation with a threelayered network is bounded by O(n), where n is the number of hidden units. Now we obtain the following theorem and corollary. Theorem 3 Let c be a bell-shaped function which is an activation function of the hidden units and d be a corresponding decomposing kernel function which satises admissible condition (3). If the absolute integral C T of the transform T of the function f is bounded, for any arbitrary input distribution (x) there exists a bell-shaped three-layered network f n (x) which satises kf n fk L ( m ); n C T kfk L ( m ); : (7) Corollary 4 For any arbitrary input distribution (x), the approximation error of bell-shaped three-layered networks f n (x) can be bounded by kf n fk L ( m ); n C T : (8) We should note that the absolute integral C T depends on the kernels d and c, therefore C T is more accurately expressed as C T;d ; c. Also, since the above bounds hold for any pair of kernels, the bounds for a xed composing kernel c can be stated exactly as C T min d C T;d ; c (9)

12 when the minimum value exists, and otherwise as C T inf d C T;d ; c + 8 > : () To achieve the above approximation bound, greedy algorithms (Jones, 99, etc.) may be applied on representation (3). The coecient of the bound given in the above theorem is obtained by a rough estimation. In this evaluation, we suppose that all weights from hidden units to output units are of the same absolute value, i.e. +C T or C T (see for example Figure 3). In other words, the approximation depends only on the density of (a; b) and takes no account of the hidden-output weight value c i. Another estimation of an approximation bound is discussed in Appendix C, but it results in a bound of the same order, O(n). T( a,b) c i ai,bi a,b Fig. 3. The relationship between the transform T (a; b) and the density of (a; b). This theorem shows that three-layered networks have good properties which make them useful for approximating functions. The approximation bound is inversely proportional to the number of hidden units with respect to the mean square error. Naturally, it can be seen that if there are a suciently large number of hidden units, three-layered networks can achieve approximation with any desired accuracy. An important point is that to improve accuracy, an exponentially larger number of hidden units are not required, but rather only a linearly larger number. Also, the number of necessary hidden units is independent of the input dimensionality. oughly speaking, in order to reduce the mean square error in half, twice the number of hidden units should be need,

13 not m times. This means that three-layered networks are free from the \curse of dimensionality". In addition, this also shows that a combination of good approximations does not give a good approximation. For example, we divide the target function f into two functions f and f f(x) f (x) + f (x) and we have two good approximations g and g which have n-hidden units and satisfy kg f k L ( m ); n C T ; kg f k L ( m ); n C T ; where T and T are the transform of f and f. In order to approximate the function f, we combine two approximations again. The error bound is evaluated as k (g + g ) fk L ( m ); C + C : T T n Note that this combined network has n hidden units and the equality holds depending on the partition of the target function. From the theorem, we obtain the following bound for an approximator g with n hidden units kg fk L ( m ); n (C T + C T ) : This bound is equivalent to the one above when and only when C T C T, and otherwise it smaller than the one above. Although this is an estimation of the bounds, it implies that there might exist a better approximation than a combination of two approximators if the target function is partitioned inappropriately. In this sense, function approximation by three-layered networks has a nonlinear architecture. 3.4 Smoothness of Functions and Integrability of Transforms Practically, it is dicult to conrm whether the transform of the target function is integrable or not. The integrability of the transform should be translated into other kinds of useful conditions. 3

14 There are many kinds of conditions under which the transform T belongs to L ( m+ ). In the following, we only show a sucient condition for functions with compact support and even composing kernel functions. The integrability is restated in terms of the smoothness of target functions, and hence it is an important condition for practical applications. Let C m () denote the class of functions with compact support belonging to C m (). The function f e (z) is dened as f e (z) exz f(x)dx (m) for a unit vector e m, where dx (m) denotes the volume element on the hyper-plane e x z. f (m) (z) denotes the result of dierentiating f e e (z) m times. Theorem 5 Any function f which satises the conditions 8e m ; f e C m () and () 9M ; < 9 s.t. jf (m) e (z) f (m) (z )j < Mjz z j has an absolutely integrable transform T, i.e. f can be approximated by the three-layered network having n bell-shaped hidden units with O(n)-accuracy in the sense of mean square errors. For details of the constructive proof, refer to Appendix D. In practical use, it is almost impossible to conrm the integrability of the transform of a target function. Since the continuity condition can be more easily checked than the integrability, Theorem 5 might be widely applied. The condition stated in Theorem 5 is called the Holder continuity condition. When all the m-times partially dierentiated functions of f(x) are Holder continuous with respect to m-dimensional L -norm, it is clear that f (m) (z) is Holder e continuous. Moreover, when f(x) is a bounded (m + )-times dierentiable function with compact support, f(x) satises the condition of Theorem 5. Corollary 6 Any function f C m+ ( m ) \ L ( m ) can be approximated by a three-layered network with O(n)-accuracy. It is expected that most of the functions used in practical applications satisfy the above conditions. From the result of Theorem 5, we make a conjecture concerning the problem of \overtting" in learning from examples. The theorem implies that if e 4

15 the target function is suciently smooth, the parameters stay in an appropriate area, but if there exists some discontinuity in the target function, it is not guaranteed that the parameters will stay in a bounded area. An \over- tting" problem occurs when an \outlier" is included. In the case of learning from examples, especially with noise, it is often observed that huge values of parameters are needed in order to approximate a peculiar data which sticks out like a delta function. This overtting phenomenon is a serious problem in practical applications. This problem can be restated as follows. In the case of learning from examples, we do not have complete information about the target function. From a family of functions which networks can represent, we choose an imagined target function which locally minimizes a particular loss dened on the examples. \Locally" means that the result of learning depends on the initial values of the parameters and is not always optimal. A process of modifying the parameters towards this imagined target function is called learning. If the network has too many hidden units to approximate the examples except for the outlier, some of the hidden units could be used to represent the outlier. In this case, the imagined target function might have an extreme shape, roughly speaking unsmooth and discontinuous like a delta function, and the absolute integral of its transform might be quite huge or might diverge. Consequently leaning would be unstable when hidden units are added carelessly. To avoid this problem, kernel methods are commonly used and they are validated by our framework: taking a convolution with a smooth kernel function makes the function itself smooth and makes its transform integrable. We expect that our theoretical framework gives a means to analyze various problems of learning from examples. 4 Conclusion We found an integral representation of functions using ridge functions. Especially when a bell-shaped function is used as the activation function of the hidden units of a three-layered network, there exists a clear and direct correspondence between the integral representation and the three-layered network. It gives an intuitive interpretation about the structure of the three-layered networks. Based on this interpretation and the random coding technique, we gave an approximation bound of three-layered networks in the case of having complete information about target functions. The bound of the mean square error is inversely proportional to the number of hidden units and does not depend on the dimensionality of input. From this result, it can be seen that approximations 5

16 by three-layered networks are not subject to the \curse of dimensionality". In addition, we discussed a sucient condition for a class of functions to be well approximated. As seen in the previous discussions, we showed that the smoothness of functions is closely related to the errors of approximating functions by combining sigmoidal functions. This condition has a wide range of possible practical applications. Future work will be dedicated to applying our results to problems of learning from examples. Possible applications would be the estimation of the required number of hidden units for given accuracy, the estimation of the error rate, the acceleration of the convergence of learning, warding o \over-learning". Acknowledgement The author would like to give very special thanks to the reviewer for his useful comments simplifying the proofs and his careful examination of the manuscript. The author would like to thank Professor S. Amari, Professor S. Yoshizawa, Dr. K.. Muller and D. Harada for helpful suggestions and discussions. The present work is supported in part by Grant-in-Aid for Scientic esearch in Priority Areas on Higher-Order Brain Information Processing from the Ministry of Educations, Science and Culture of Japan. eferences [] Barron, A.. (993). Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information Theory, 39, 3, 93{945. [] Cybenko, G. (989). Approximation by superpositions of a sigmoid function. Mathematics of Control, Signals and Systems,, 33{34. [3] Funahashi, K. (989). On the approximate realization of continuous mappings by neural networks. Neural Networks,, 83{9. [4] Girosi, F. and Anzellotti, G. (99). Convergence rates of approximation by translates (Tech. ep. A.I. memo 88). Articial Intelligence Laboratory, Massachusetts Institute of Technology. [5] Irie, B. and Miyake, S. (988). Capabilities of three-layered Perceptrons. In Proceedings of International Conference on Neural Networks, pp. 64{648. [6] Jones, L. K. (99). A simple lemma on greedy approximation in Hilbert space and convergence rates for projection pursuit regression and neural network training. Annals of Statistics,,, 68{63. 6

17 [7] Murata, N. and Amari, S. (993). Approximation bounds for linear combination of sigmoidal functions. In Proceedings of 993 International Symposium on Nonlinear Theory and its Applications, pp. 73{76. [8] Murata, N. (994). Function approximation by three-layered networks and its error bounds an integral representation theorem (Tech. ep. MET 94-9). Department of Mathematical Engineering and Information Physics, University of Tokyo. [9] Niyogi, P. and Girosi, F. (994). On the relationship between generalization error, hypothesis complexity, and sample complexity for radial basis functions (Tech. ep. A.I. memo 467). Articial Intelligence Laboratory, Massachusetts Institute of Technology. [] umelhart, D., McClelland, J. L., and the PDP esearch Group (986). Parallel Distributed Processing : Explorations in the Microstructure of Cognition. MA: MIT Press. [] White, H. (99). Connectionist nonparametric regression: multilayer feedforward networks can learn arbitrary mappings. Neural Networks, 3, 5, 535{ 549. A Proof of the Inversion Theorem Let us dene f " (x) () m C d ; c () m C d ; c m m+ d (a y b)f(y) c (a x b)e "jaj dydbda d (a y b)f(y)dy c (a x b)db e "jaj da: m (A.) From the fact that f(y) L ( m ), d is bounded, and c L (), we can see that the right side of Equation (A.) is absolutely integrable. According to Fubini-Tonelli's theorem, we can change the order of the integration of f " (x). As d (a y b) and c (a x b) can be seen as functions of b and shifted by a y and a x respectively, by using Parseval's equality we can obtain f " (x) () m C d ; c () m C d ; c m+ ^d () ^c ()e ia(xy) e "jaj f(y)ddyda 7

18 m+ C d ; c m+ m ^d () ^c ()e "(a i(xy) " )(a i(xy) " ) j(xy)j 4" f(y)daddy ^d () ^c ()G " ((x y))f(y)ddy ^d () ^c ()G C d ; c m " C d ; c m ^d () ^c ()G " (x y)f(y)dyd 3 f(x)d where G " (x) (4") m e jxj 4" and 3 denotes the convolution. For kk kk L p ( m ) or kk kk L ( m ) with uniformly continuous functions, from Holder's inequality we have kf " fk jc d ; c j ^d () ^c () G C d ; c m " jc d ; c j 6 4 ^ d () ^c () + G m " << f f d 3 f f d ^d () ^c () m G " 3 f f d I + I : (A.) In any case, limkg " 3 f fk ; "! (A.3) and kg " 3 fk kfk (A.4) hold, therefore kg " 3 f fk kfk: (A.5) 8

19 Additionally, we assume the following admissible condition: ^d (!)^c (!)! m d! < : (A.6) From Equations (A.5) and (A.6), we know that for any we may choose so that ji j < ; then Equations (A.3) and (A.6) guarantee that we may choose " so that " < " ) ji j < : Now we can prove that for any there exists " < " such that kf " fk < : B An Example of a Decomposing Kernel A decomposing kernel d (z) 8 >< >: d m dz (z) if m is even m d m+ (z) if m is odd dzm+ can be conrmed to satisfy admissible condition (3) as follows, where m is the dimension of input x and is given by (z) 8 >< >: e (jzj ) if jzj < if jzj : When m is an even number, 9

20 holds. ^d (!) ^c (!) j!j m d! (i!) m^(!)^c (!) j!j m d! ^(!) ^c (!) j^(!)j d! j^c (!)j d! When m is an odd number, j(z)j dz j c (z)j dz A < ^d (!) ^c (!) j!j m d! (i!) m+^(!)^c (!) j!j m d!!^(!)^c (!) j(i!)^(!)j d! j ^c (!)j d! j (z)j dz j c (z)j dz A < holds, where (z) denotes the derivative of (z). C Another Bound Estimation Using any c(a; b) such that e T (a; b)c(a; b) is a positive real number and e T (a; b) dadb ; c(a; b) and dening p(a; b) e T (a; b)c(a; b), we can obtain the inequality E (g n (x) f(x)) (x)dx

21 n n n h i E (gn (x) f(x)) (x)dx n h i o E (c(a; b) c (a x b)) f(x) (x)dx ( E (c(a; b) c (a x b)) (x)dx kfk L ( m ); sup jc(a; b)j k c (a x b)k L ( m ); kfk L ( m ); a;b ) : (C.) In this case the order of the approximation error is equal to the previous evaluation, that is O(n), but the integrability condition of the transform T is reduced from condition () j e T (a; b)j < to e T (a; b) dadb < : c(a; b) (C.) Clearly, this new condition (C.) is weaker than condition () of Theorem 3, because for larger jaj the norm k c (a x b)k L ( m ); becomes smaller and therefore c(a; b) can be a large value without changing the bound in Inequality (C.). D Proof of Theorem 5 Let us assume that c is in C (). Since ^c is bounded, j!j^c (!)d! < holds, and we dene d d (z) 8 >< >: d m+ (z) if m is even dzm+ d m+ (z) if m is odd dzm+ with from Equation (). Noting that c is twice dierentiable, we can see that d and c satisfy admissible condition (3) in a manner similar to Section 3..

22 In this case, ^ d (!) 8 >< >: (i!) m+^(!) if m is even (i!) m+^(!) if m is odd and z k d (z)dz i d d!! (!) k ^d! (D.) hold, and hence the k-th moment as calculated in Equation (D.) vanishes if k m when m is odd, or if k m + when m is even. The moments vanish up to at least the m-th order. Let f be in C m ( m ), where C m ( m ) denotes the class of functions with compact support belonging to C m ( m ). Let f e (z) be f e (z) exz f(x)dx (m) for a unit vector e m, where dx (m) denotes the volume element on the hyper-plane e x z. We can rewrite the transform T as T e (a; b) T (a; b) d (a x b)f(x)dx d (az b)f e (z)dz; where a jaj and e aa. We assume that for any e f e (z) C m (); (D.) and there exists ( < ) such that jf (m) (z) f (m) (z )j < Mjz z j ; M : (D.3) e e Then jt e (a; b)j

23 a a d (z)f e d (z) ( m X d (z) a m! M z a m! a C a m++ ;! z + b dz a z k f (k) e k! a z m ( k a m z a f (m) e d (z)dz! b + a m! z + b a! f (m) e z m f (m) e a!) b a dz z + b a where satises and C is a constant. The m-th order moment d (z)z m f (m) e b a! dz!) dz is inserted and Equation (D.3) is applied. Since T e (a; b) is bounded, we can obtain an integrable inequality jt e (a; b)j C + a m++ ; where C is a constant. From the fact that f and d have compact supports, T (a; b) also has a compact support with respect to b when a is xed. The size of the support is bounded by C 3 a+c 4 with appropriate constants C 3 and C 4. Let d be a volume element of a unit hyper-sphere on m. The absolute integral is rewritten as C T jt e (a; b)ja m daddb; and we see that jt e (a; b)ja m daddb C 5 a m C (C 3 a + C 4 ) + a m++ da < holds, where C 5 is a denite integral with respect to d. Therefore, Equations (D.) and (D.3) are sucient conditions for the integrability of the transform T. 3

below, kernel PCA Eigenvectors, and linear combinations thereof. For the cases where the pre-image does exist, we can provide a means of constructing

below, kernel PCA Eigenvectors, and linear combinations thereof. For the cases where the pre-image does exist, we can provide a means of constructing Kernel PCA Pattern Reconstruction via Approximate Pre-Images Bernhard Scholkopf, Sebastian Mika, Alex Smola, Gunnar Ratsch, & Klaus-Robert Muller GMD FIRST, Rudower Chaussee 5, 12489 Berlin, Germany fbs,

More information

100 inference steps doesn't seem like enough. Many neuron-like threshold switching units. Many weighted interconnections among units

100 inference steps doesn't seem like enough. Many neuron-like threshold switching units. Many weighted interconnections among units Connectionist Models Consider humans: Neuron switching time ~ :001 second Number of neurons ~ 10 10 Connections per neuron ~ 10 4 5 Scene recognition time ~ :1 second 100 inference steps doesn't seem like

More information

ARTIFICIAL INTELLIGENCE LABORATORY. and CENTER FOR BIOLOGICAL INFORMATION PROCESSING. A.I. Memo No August Federico Girosi.

ARTIFICIAL INTELLIGENCE LABORATORY. and CENTER FOR BIOLOGICAL INFORMATION PROCESSING. A.I. Memo No August Federico Girosi. MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL INFORMATION PROCESSING WHITAKER COLLEGE A.I. Memo No. 1287 August 1991 C.B.I.P. Paper No. 66 Models of

More information

= w 2. w 1. B j. A j. C + j1j2

= w 2. w 1. B j. A j. C + j1j2 Local Minima and Plateaus in Multilayer Neural Networks Kenji Fukumizu and Shun-ichi Amari Brain Science Institute, RIKEN Hirosawa 2-, Wako, Saitama 35-098, Japan E-mail: ffuku, amarig@brain.riken.go.jp

More information

From Fractional Brownian Motion to Multifractional Brownian Motion

From Fractional Brownian Motion to Multifractional Brownian Motion From Fractional Brownian Motion to Multifractional Brownian Motion Antoine Ayache USTL (Lille) Antoine.Ayache@math.univ-lille1.fr Cassino December 2010 A.Ayache (USTL) From FBM to MBM Cassino December

More information

output dimension input dimension Gaussian evidence Gaussian Gaussian evidence evidence from t +1 inputs and outputs at time t x t+2 x t-1 x t+1

output dimension input dimension Gaussian evidence Gaussian Gaussian evidence evidence from t +1 inputs and outputs at time t x t+2 x t-1 x t+1 To appear in M. S. Kearns, S. A. Solla, D. A. Cohn, (eds.) Advances in Neural Information Processing Systems. Cambridge, MA: MIT Press, 999. Learning Nonlinear Dynamical Systems using an EM Algorithm Zoubin

More information

Error Empirical error. Generalization error. Time (number of iteration)

Error Empirical error. Generalization error. Time (number of iteration) Submitted to Neural Networks. Dynamics of Batch Learning in Multilayer Networks { Overrealizability and Overtraining { Kenji Fukumizu The Institute of Physical and Chemical Research (RIKEN) E-mail: fuku@brain.riken.go.jp

More information

Linear Algebra (part 1) : Vector Spaces (by Evan Dummit, 2017, v. 1.07) 1.1 The Formal Denition of a Vector Space

Linear Algebra (part 1) : Vector Spaces (by Evan Dummit, 2017, v. 1.07) 1.1 The Formal Denition of a Vector Space Linear Algebra (part 1) : Vector Spaces (by Evan Dummit, 2017, v. 1.07) Contents 1 Vector Spaces 1 1.1 The Formal Denition of a Vector Space.................................. 1 1.2 Subspaces...................................................

More information

Synthesis of Feedforward Networks in. Supremum Error Bound

Synthesis of Feedforward Networks in. Supremum Error Bound Synthesis of Feedforward etworks in Supremum Error Bound Krzysztof Ciesielski, Jarosław P. Sacha, and Krzysztof J. Cios K. Ciesielski is with the Department of Mathematics, West Virginia University, Morgantown,

More information

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

Artificial Neural Networks and Nonparametric Methods CMPSCI 383 Nov 17, 2011! Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011! 1 Todayʼs lecture" How the brain works (!)! Artificial neural networks! Perceptrons! Multilayer feed-forward networks! Error

More information

Congurations of periodic orbits for equations with delayed positive feedback

Congurations of periodic orbits for equations with delayed positive feedback Congurations of periodic orbits for equations with delayed positive feedback Dedicated to Professor Tibor Krisztin on the occasion of his 60th birthday Gabriella Vas 1 MTA-SZTE Analysis and Stochastics

More information

A Logarithmic Neural Network Architecture for Unbounded Non-Linear Function Approximation

A Logarithmic Neural Network Architecture for Unbounded Non-Linear Function Approximation 1 Introduction A Logarithmic Neural Network Architecture for Unbounded Non-Linear Function Approximation J Wesley Hines Nuclear Engineering Department The University of Tennessee Knoxville, Tennessee,

More information

Rearrangements and polar factorisation of countably degenerate functions G.R. Burton, School of Mathematical Sciences, University of Bath, Claverton D

Rearrangements and polar factorisation of countably degenerate functions G.R. Burton, School of Mathematical Sciences, University of Bath, Claverton D Rearrangements and polar factorisation of countably degenerate functions G.R. Burton, School of Mathematical Sciences, University of Bath, Claverton Down, Bath BA2 7AY, U.K. R.J. Douglas, Isaac Newton

More information

w 1 output input &N 1 x w n w N =&2

w 1 output input &N 1 x w n w N =&2 ISSN 98-282 Technical Report L Noise Suppression in Training Data for Improving Generalization Akiko Nakashima, Akira Hirabayashi, and Hidemitsu OGAWA TR96-9 November Department of Computer Science Tokyo

More information

Artificial Neural Networks

Artificial Neural Networks Introduction ANN in Action Final Observations Application: Poverty Detection Artificial Neural Networks Alvaro J. Riascos Villegas University of los Andes and Quantil July 6 2018 Artificial Neural Networks

More information

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function

More information

On Some Mathematical Results of Neural Networks

On Some Mathematical Results of Neural Networks On Some Mathematical Results of Neural Networks Dongbin Xiu Department of Mathematics Ohio State University Overview (Short) Introduction of Neural Networks (NNs) Successes Basic mechanism (Incomplete)

More information

Garrett: `Bernstein's analytic continuation of complex powers' 2 Let f be a polynomial in x 1 ; : : : ; x n with real coecients. For complex s, let f

Garrett: `Bernstein's analytic continuation of complex powers' 2 Let f be a polynomial in x 1 ; : : : ; x n with real coecients. For complex s, let f 1 Bernstein's analytic continuation of complex powers c1995, Paul Garrett, garrettmath.umn.edu version January 27, 1998 Analytic continuation of distributions Statement of the theorems on analytic continuation

More information

Lecture 4: Perceptrons and Multilayer Perceptrons

Lecture 4: Perceptrons and Multilayer Perceptrons Lecture 4: Perceptrons and Multilayer Perceptrons Cognitive Systems II - Machine Learning SS 2005 Part I: Basic Approaches of Concept Learning Perceptrons, Artificial Neuronal Networks Lecture 4: Perceptrons

More information

Boxlets: a Fast Convolution Algorithm for. Signal Processing and Neural Networks. Patrice Y. Simard, Leon Bottou, Patrick Haner and Yann LeCun

Boxlets: a Fast Convolution Algorithm for. Signal Processing and Neural Networks. Patrice Y. Simard, Leon Bottou, Patrick Haner and Yann LeCun Boxlets: a Fast Convolution Algorithm for Signal Processing and Neural Networks Patrice Y. Simard, Leon Bottou, Patrick Haner and Yann LeCun AT&T Labs-Research 100 Schultz Drive, Red Bank, NJ 07701-7033

More information

Algebraic Information Geometry for Learning Machines with Singularities

Algebraic Information Geometry for Learning Machines with Singularities Algebraic Information Geometry for Learning Machines with Singularities Sumio Watanabe Precision and Intelligence Laboratory Tokyo Institute of Technology 4259 Nagatsuta, Midori-ku, Yokohama, 226-8503

More information

Techinical Proofs for Nonlinear Learning using Local Coordinate Coding

Techinical Proofs for Nonlinear Learning using Local Coordinate Coding Techinical Proofs for Nonlinear Learning using Local Coordinate Coding 1 Notations and Main Results Denition 1.1 (Lipschitz Smoothness) A function f(x) on R d is (α, β, p)-lipschitz smooth with respect

More information

Multi-Layer Boosting for Pattern Recognition

Multi-Layer Boosting for Pattern Recognition Multi-Layer Boosting for Pattern Recognition François Fleuret IDIAP Research Institute, Centre du Parc, P.O. Box 592 1920 Martigny, Switzerland fleuret@idiap.ch Abstract We extend the standard boosting

More information

Artificial Neural Networks

Artificial Neural Networks Artificial Neural Networks Threshold units Gradient descent Multilayer networks Backpropagation Hidden layer representations Example: Face Recognition Advanced topics 1 Connectionist Models Consider humans:

More information

CS489/698: Intro to ML

CS489/698: Intro to ML CS489/698: Intro to ML Lecture 03: Multi-layer Perceptron Outline Failure of Perceptron Neural Network Backpropagation Universal Approximator 2 Outline Failure of Perceptron Neural Network Backpropagation

More information

Novel determination of dierential-equation solutions: universal approximation method

Novel determination of dierential-equation solutions: universal approximation method Journal of Computational and Applied Mathematics 146 (2002) 443 457 www.elsevier.com/locate/cam Novel determination of dierential-equation solutions: universal approximation method Thananchai Leephakpreeda

More information

2 Tikhonov Regularization and ERM

2 Tikhonov Regularization and ERM Introduction Here we discusses how a class of regularization methods originally designed to solve ill-posed inverse problems give rise to regularized learning algorithms. These algorithms are kernel methods

More information

Parallel layer perceptron

Parallel layer perceptron Neurocomputing 55 (2003) 771 778 www.elsevier.com/locate/neucom Letters Parallel layer perceptron Walmir M. Caminhas, Douglas A.G. Vieira, João A. Vasconcelos Department of Electrical Engineering, Federal

More information

Linear Regression and Its Applications

Linear Regression and Its Applications Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start

More information

Equivalence of Backpropagation and Contrastive Hebbian Learning in a Layered Network

Equivalence of Backpropagation and Contrastive Hebbian Learning in a Layered Network LETTER Communicated by Geoffrey Hinton Equivalence of Backpropagation and Contrastive Hebbian Learning in a Layered Network Xiaohui Xie xhx@ai.mit.edu Department of Brain and Cognitive Sciences, Massachusetts

More information

Learning with Ensembles: How. over-tting can be useful. Anders Krogh Copenhagen, Denmark. Abstract

Learning with Ensembles: How. over-tting can be useful. Anders Krogh Copenhagen, Denmark. Abstract Published in: Advances in Neural Information Processing Systems 8, D S Touretzky, M C Mozer, and M E Hasselmo (eds.), MIT Press, Cambridge, MA, pages 190-196, 1996. Learning with Ensembles: How over-tting

More information

Ecient Higher-order Neural Networks. for Classication and Function Approximation. Joydeep Ghosh and Yoan Shin. The University of Texas at Austin

Ecient Higher-order Neural Networks. for Classication and Function Approximation. Joydeep Ghosh and Yoan Shin. The University of Texas at Austin Ecient Higher-order Neural Networks for Classication and Function Approximation Joydeep Ghosh and Yoan Shin Department of Electrical and Computer Engineering The University of Texas at Austin Austin, TX

More information

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis Introduction to Natural Computation Lecture 9 Multilayer Perceptrons and Backpropagation Peter Lewis 1 / 25 Overview of the Lecture Why multilayer perceptrons? Some applications of multilayer perceptrons.

More information

Plan of Class 4. Radial Basis Functions with moving centers. Projection Pursuit Regression and ridge. Principal Component Analysis: basic ideas

Plan of Class 4. Radial Basis Functions with moving centers. Projection Pursuit Regression and ridge. Principal Component Analysis: basic ideas Plan of Class 4 Radial Basis Functions with moving centers Multilayer Perceptrons Projection Pursuit Regression and ridge functions approximation Principal Component Analysis: basic ideas Radial Basis

More information

Non-Euclidean Independent Component Analysis and Oja's Learning

Non-Euclidean Independent Component Analysis and Oja's Learning Non-Euclidean Independent Component Analysis and Oja's Learning M. Lange 1, M. Biehl 2, and T. Villmann 1 1- University of Appl. Sciences Mittweida - Dept. of Mathematics Mittweida, Saxonia - Germany 2-

More information

NN V: The generalized delta learning rule

NN V: The generalized delta learning rule NN V: The generalized delta learning rule We now focus on generalizing the delta learning rule for feedforward layered neural networks. The architecture of the two-layer network considered below is shown

More information

4. Multilayer Perceptrons

4. Multilayer Perceptrons 4. Multilayer Perceptrons This is a supervised error-correction learning algorithm. 1 4.1 Introduction A multilayer feedforward network consists of an input layer, one or more hidden layers, and an output

More information

Statistical Learning Theory

Statistical Learning Theory Statistical Learning Theory Fundamentals Miguel A. Veganzones Grupo Inteligencia Computacional Universidad del País Vasco (Grupo Inteligencia Vapnik Computacional Universidad del País Vasco) UPV/EHU 1

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Lecture 4: Deep Learning Essentials Pierre Geurts, Gilles Louppe, Louis Wehenkel 1 / 52 Outline Goal: explain and motivate the basic constructs of neural networks. From linear

More information

Artificial Neural Networks

Artificial Neural Networks Artificial Neural Networks 鮑興國 Ph.D. National Taiwan University of Science and Technology Outline Perceptrons Gradient descent Multi-layer networks Backpropagation Hidden layer representations Examples

More information

Richard DiSalvo. Dr. Elmer. Mathematical Foundations of Economics. Fall/Spring,

Richard DiSalvo. Dr. Elmer. Mathematical Foundations of Economics. Fall/Spring, The Finite Dimensional Normed Linear Space Theorem Richard DiSalvo Dr. Elmer Mathematical Foundations of Economics Fall/Spring, 20-202 The claim that follows, which I have called the nite-dimensional normed

More information

Identication and Control of Nonlinear Systems Using. Neural Network Models: Design and Stability Analysis. Marios M. Polycarpou and Petros A.

Identication and Control of Nonlinear Systems Using. Neural Network Models: Design and Stability Analysis. Marios M. Polycarpou and Petros A. Identication and Control of Nonlinear Systems Using Neural Network Models: Design and Stability Analysis by Marios M. Polycarpou and Petros A. Ioannou Report 91-09-01 September 1991 Identication and Control

More information

WHY DEEP NEURAL NETWORKS FOR FUNCTION APPROXIMATION SHIYU LIANG THESIS

WHY DEEP NEURAL NETWORKS FOR FUNCTION APPROXIMATION SHIYU LIANG THESIS c 2017 Shiyu Liang WHY DEEP NEURAL NETWORKS FOR FUNCTION APPROXIMATION BY SHIYU LIANG THESIS Submitted in partial fulfillment of the requirements for the degree of Master of Science in Electrical and Computer

More information

Mathematical Institute, University of Utrecht. The problem of estimating the mean of an observed Gaussian innite-dimensional vector

Mathematical Institute, University of Utrecht. The problem of estimating the mean of an observed Gaussian innite-dimensional vector On Minimax Filtering over Ellipsoids Eduard N. Belitser and Boris Y. Levit Mathematical Institute, University of Utrecht Budapestlaan 6, 3584 CD Utrecht, The Netherlands The problem of estimating the mean

More information

COS 424: Interacting with Data

COS 424: Interacting with Data COS 424: Interacting with Data Lecturer: Rob Schapire Lecture #14 Scribe: Zia Khan April 3, 2007 Recall from previous lecture that in regression we are trying to predict a real value given our data. Specically,

More information

Introduction Wavelet shrinage methods have been very successful in nonparametric regression. But so far most of the wavelet regression methods have be

Introduction Wavelet shrinage methods have been very successful in nonparametric regression. But so far most of the wavelet regression methods have be Wavelet Estimation For Samples With Random Uniform Design T. Tony Cai Department of Statistics, Purdue University Lawrence D. Brown Department of Statistics, University of Pennsylvania Abstract We show

More information

Feed-forward Network Functions

Feed-forward Network Functions Feed-forward Network Functions Sargur Srihari Topics 1. Extension of linear models 2. Feed-forward Network Functions 3. Weight-space symmetries 2 Recap of Linear Models Linear Models for Regression, Classification

More information

Approximating the Best Linear Unbiased Estimator of Non-Gaussian Signals with Gaussian Noise

Approximating the Best Linear Unbiased Estimator of Non-Gaussian Signals with Gaussian Noise IEICE Transactions on Information and Systems, vol.e91-d, no.5, pp.1577-1580, 2008. 1 Approximating the Best Linear Unbiased Estimator of Non-Gaussian Signals with Gaussian Noise Masashi Sugiyama (sugi@cs.titech.ac.jp)

More information

Computation Of Asymptotic Distribution. For Semiparametric GMM Estimators. Hidehiko Ichimura. Graduate School of Public Policy

Computation Of Asymptotic Distribution. For Semiparametric GMM Estimators. Hidehiko Ichimura. Graduate School of Public Policy Computation Of Asymptotic Distribution For Semiparametric GMM Estimators Hidehiko Ichimura Graduate School of Public Policy and Graduate School of Economics University of Tokyo A Conference in honor of

More information

On the Noise Model of Support Vector Machine Regression. Massimiliano Pontil, Sayan Mukherjee, Federico Girosi

On the Noise Model of Support Vector Machine Regression. Massimiliano Pontil, Sayan Mukherjee, Federico Girosi MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES A.I. Memo No. 1651 October 1998

More information

Temporal Backpropagation for FIR Neural Networks

Temporal Backpropagation for FIR Neural Networks Temporal Backpropagation for FIR Neural Networks Eric A. Wan Stanford University Department of Electrical Engineering, Stanford, CA 94305-4055 Abstract The traditional feedforward neural network is a static

More information

Outline of Fourier Series: Math 201B

Outline of Fourier Series: Math 201B Outline of Fourier Series: Math 201B February 24, 2011 1 Functions and convolutions 1.1 Periodic functions Periodic functions. Let = R/(2πZ) denote the circle, or onedimensional torus. A function f : C

More information

Neural Networks and the Back-propagation Algorithm

Neural Networks and the Back-propagation Algorithm Neural Networks and the Back-propagation Algorithm Francisco S. Melo In these notes, we provide a brief overview of the main concepts concerning neural networks and the back-propagation algorithm. We closely

More information

Local minima and plateaus in hierarchical structures of multilayer perceptrons

Local minima and plateaus in hierarchical structures of multilayer perceptrons Neural Networks PERGAMON Neural Networks 13 (2000) 317 327 Contributed article Local minima and plateaus in hierarchical structures of multilayer perceptrons www.elsevier.com/locate/neunet K. Fukumizu*,

More information

October 7, :8 WSPC/WS-IJWMIP paper. Polynomial functions are renable

October 7, :8 WSPC/WS-IJWMIP paper. Polynomial functions are renable International Journal of Wavelets, Multiresolution and Information Processing c World Scientic Publishing Company Polynomial functions are renable Henning Thielemann Institut für Informatik Martin-Luther-Universität

More information

In the Name of God. Lectures 15&16: Radial Basis Function Networks

In the Name of God. Lectures 15&16: Radial Basis Function Networks 1 In the Name of God Lectures 15&16: Radial Basis Function Networks Some Historical Notes Learning is equivalent to finding a surface in a multidimensional space that provides a best fit to the training

More information

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington Neural Networks CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Perceptrons x 0 = 1 x 1 x 2 z = h w T x Output: z x D A perceptron

More information

Heisenberg's inequality for Fourier transform

Heisenberg's inequality for Fourier transform Heisenberg's inequality for Fourier transform Riccardo Pascuzzo Abstract In this paper, we prove the Heisenberg's inequality using the Fourier transform. Then we show that the equality holds for the Gaussian

More information

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks Topics in Machine Learning-EE 5359 Neural Networks 1 The Perceptron Output: A perceptron is a function that maps D-dimensional vectors to real numbers. For notational convenience, we add a zero-th dimension

More information

Continuous Neural Networks

Continuous Neural Networks Continuous Neural Networks Nicolas Le Roux Université de Montréal Montréal, Québec nicolas.le.roux@umontreal.ca Yoshua Bengio Université de Montréal Montréal, Québec yoshua.bengio@umontreal.ca Abstract

More information

Bits of Machine Learning Part 1: Supervised Learning

Bits of Machine Learning Part 1: Supervised Learning Bits of Machine Learning Part 1: Supervised Learning Alexandre Proutiere and Vahan Petrosyan KTH (The Royal Institute of Technology) Outline of the Course 1. Supervised Learning Regression and Classification

More information

Artificial Neural Networks. MGS Lecture 2

Artificial Neural Networks. MGS Lecture 2 Artificial Neural Networks MGS 2018 - Lecture 2 OVERVIEW Biological Neural Networks Cell Topology: Input, Output, and Hidden Layers Functional description Cost functions Training ANNs Back-Propagation

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Lecture 5: Logistic Regression. Neural Networks

Lecture 5: Logistic Regression. Neural Networks Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture

More information

A summary of Deep Learning without Poor Local Minima

A summary of Deep Learning without Poor Local Minima A summary of Deep Learning without Poor Local Minima by Kenji Kawaguchi MIT oral presentation at NIPS 2016 Learning Supervised (or Predictive) learning Learn a mapping from inputs x to outputs y, given

More information

1. Introduction. Consider a single cell in a mobile phone system. A \call setup" is a request for achannel by an idle customer presently in the cell t

1. Introduction. Consider a single cell in a mobile phone system. A \call setup is a request for achannel by an idle customer presently in the cell t Heavy Trac Limit for a Mobile Phone System Loss Model Philip J. Fleming and Alexander Stolyar Motorola, Inc. Arlington Heights, IL Burton Simon Department of Mathematics University of Colorado at Denver

More information

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore Lecture - 27 Multilayer Feedforward Neural networks with Sigmoidal

More information

External Stability and Continuous Liapunov. investigated by means of a suitable extension of the Liapunov functions method. We

External Stability and Continuous Liapunov. investigated by means of a suitable extension of the Liapunov functions method. We External Stability and Continuous Liapunov Functions Andrea Bacciotti Dipartimento di Matematica del Politecnico Torino, 10129 Italy Abstract. Itis well known that external stability of nonlinear input

More information

8 Singular Integral Operators and L p -Regularity Theory

8 Singular Integral Operators and L p -Regularity Theory 8 Singular Integral Operators and L p -Regularity Theory 8. Motivation See hand-written notes! 8.2 Mikhlin Multiplier Theorem Recall that the Fourier transformation F and the inverse Fourier transformation

More information

Nets Hawk Katz Theorem. There existsaconstant C>so that for any number >, whenever E [ ] [ ] is a set which does not contain the vertices of any axis

Nets Hawk Katz Theorem. There existsaconstant C>so that for any number >, whenever E [ ] [ ] is a set which does not contain the vertices of any axis New York Journal of Mathematics New York J. Math. 5 999) {3. On the Self Crossing Six Sided Figure Problem Nets Hawk Katz Abstract. It was shown by Carbery, Christ, and Wright that any measurable set E

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Unit 8: Introduction to neural networks. Perceptrons

Unit 8: Introduction to neural networks. Perceptrons Unit 8: Introduction to neural networks. Perceptrons D. Balbontín Noval F. J. Martín Mateos J. L. Ruiz Reina A. Riscos Núñez Departamento de Ciencias de la Computación e Inteligencia Artificial Universidad

More information

Lower Bounds for Approximation by MLP Neural Networks

Lower Bounds for Approximation by MLP Neural Networks Lower Bounds for Approximation by MLP Neural Networks Vitaly Maiorov and Allan Pinkus Abstract. The degree of approximation by a single hidden layer MLP model with n units in the hidden layer is bounded

More information

Recurrent Neural Networks and Logic Programs

Recurrent Neural Networks and Logic Programs Recurrent Neural Networks and Logic Programs The Very Idea Propositional Logic Programs Propositional Logic Programs and Learning Propositional Logic Programs and Modalities First Order Logic Programs

More information

The Erwin Schrodinger International Pasteurgasse 6/7. Institute for Mathematical Physics A-1090 Wien, Austria

The Erwin Schrodinger International Pasteurgasse 6/7. Institute for Mathematical Physics A-1090 Wien, Austria ESI The Erwin Schrodinger International Pasteurgasse 6/7 Institute for Mathematical Physics A-1090 Wien, Austria On the Point Spectrum of Dirence Schrodinger Operators Vladimir Buslaev Alexander Fedotov

More information

Lebesgue Integration on R n

Lebesgue Integration on R n Lebesgue Integration on R n The treatment here is based loosely on that of Jones, Lebesgue Integration on Euclidean Space We give an overview from the perspective of a user of the theory Riemann integration

More information

PARAMETER IDENTIFICATION IN THE FREQUENCY DOMAIN. H.T. Banks and Yun Wang. Center for Research in Scientic Computation

PARAMETER IDENTIFICATION IN THE FREQUENCY DOMAIN. H.T. Banks and Yun Wang. Center for Research in Scientic Computation PARAMETER IDENTIFICATION IN THE FREQUENCY DOMAIN H.T. Banks and Yun Wang Center for Research in Scientic Computation North Carolina State University Raleigh, NC 7695-805 Revised: March 1993 Abstract In

More information

Average Reward Parameters

Average Reward Parameters Simulation-Based Optimization of Markov Reward Processes: Implementation Issues Peter Marbach 2 John N. Tsitsiklis 3 Abstract We consider discrete time, nite state space Markov reward processes which depend

More information

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA 1/ 21

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA   1/ 21 Neural Networks Chapter 8, Section 7 TB Artificial Intelligence Slides from AIMA http://aima.cs.berkeley.edu / 2 Outline Brains Neural networks Perceptrons Multilayer perceptrons Applications of neural

More information

Dimensionality in the Stability of the Brunn-Minkowski Inequality: A blessing or a curse?

Dimensionality in the Stability of the Brunn-Minkowski Inequality: A blessing or a curse? Dimensionality in the Stability of the Brunn-Minkowski Inequality: A blessing or a curse? Ronen Eldan, Tel Aviv University (Joint with Bo`az Klartag) Berkeley, September 23rd 2011 The Brunn-Minkowski Inequality

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

18.6 Regression and Classification with Linear Models

18.6 Regression and Classification with Linear Models 18.6 Regression and Classification with Linear Models 352 The hypothesis space of linear functions of continuous-valued inputs has been used for hundreds of years A univariate linear function (a straight

More information

Artificial Intelligence

Artificial Intelligence Artificial Intelligence Jeff Clune Assistant Professor Evolving Artificial Intelligence Laboratory Announcements Be making progress on your projects! Three Types of Learning Unsupervised Supervised Reinforcement

More information

C*-algebras, composition operators and dynamics

C*-algebras, composition operators and dynamics University of Florida SEAM 23, March 9, 2007 The setting: D = fjzj < 1g C T = @D = fjzj = 1g dm=normalized Lebesgue measure on T ' : D! D holomorphic (' 6= const.) H 2 = Hardy space, P : L 2! H 2 Riesz

More information

ON TRIVIAL GRADIENT YOUNG MEASURES BAISHENG YAN Abstract. We give a condition on a closed set K of real nm matrices which ensures that any W 1 p -grad

ON TRIVIAL GRADIENT YOUNG MEASURES BAISHENG YAN Abstract. We give a condition on a closed set K of real nm matrices which ensures that any W 1 p -grad ON TRIVIAL GRAIENT YOUNG MEASURES BAISHENG YAN Abstract. We give a condition on a closed set K of real nm matrices which ensures that any W 1 p -gradient Young measure supported on K must be trivial the

More information

Realization of set functions as cut functions of graphs and hypergraphs

Realization of set functions as cut functions of graphs and hypergraphs Discrete Mathematics 226 (2001) 199 210 www.elsevier.com/locate/disc Realization of set functions as cut functions of graphs and hypergraphs Satoru Fujishige a;, Sachin B. Patkar b a Division of Systems

More information

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann (Feed-Forward) Neural Networks 2016-12-06 Dr. Hajira Jabeen, Prof. Jens Lehmann Outline In the previous lectures we have learned about tensors and factorization methods. RESCAL is a bilinear model for

More information

Introduction to Machine Learning Spring 2018 Note Neural Networks

Introduction to Machine Learning Spring 2018 Note Neural Networks CS 189 Introduction to Machine Learning Spring 2018 Note 14 1 Neural Networks Neural networks are a class of compositional function approximators. They come in a variety of shapes and sizes. In this class,

More information

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes CS 6501: Deep Learning for Computer Graphics Basics of Neural Networks Connelly Barnes Overview Simple neural networks Perceptron Feedforward neural networks Multilayer perceptron and properties Autoencoders

More information

Automatic Differentiation and Neural Networks

Automatic Differentiation and Neural Networks Statistical Machine Learning Notes 7 Automatic Differentiation and Neural Networks Instructor: Justin Domke 1 Introduction The name neural network is sometimes used to refer to many things (e.g. Hopfield

More information

Stochastic dominance with imprecise information

Stochastic dominance with imprecise information Stochastic dominance with imprecise information Ignacio Montes, Enrique Miranda, Susana Montes University of Oviedo, Dep. of Statistics and Operations Research. Abstract Stochastic dominance, which is

More information

An Adaptive Bayesian Network for Low-Level Image Processing

An Adaptive Bayesian Network for Low-Level Image Processing An Adaptive Bayesian Network for Low-Level Image Processing S P Luttrell Defence Research Agency, Malvern, Worcs, WR14 3PS, UK. I. INTRODUCTION Probability calculus, based on the axioms of inference, Cox

More information

L p Approximation of Sigma Pi Neural Networks

L p Approximation of Sigma Pi Neural Networks IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 11, NO. 6, NOVEMBER 2000 1485 L p Approximation of Sigma Pi Neural Networks Yue-hu Luo and Shi-yi Shen Abstract A feedforward Sigma Pi neural networks with a

More information

Multilayer feedforward networks are universal approximators

Multilayer feedforward networks are universal approximators Multilayer feedforward networks are universal approximators Kur Hornik, Maxwell Stinchcombe and Halber White (1989) Presenter: Sonia Todorova Theoretical properties of multilayer feedforward networks -

More information

memory networks, have been proposed by Hopeld (1982), Lapedes and Farber (1986), Almeida (1987), Pineda (1988), and Rohwer and Forrest (1987). Other r

memory networks, have been proposed by Hopeld (1982), Lapedes and Farber (1986), Almeida (1987), Pineda (1988), and Rohwer and Forrest (1987). Other r A Learning Algorithm for Continually Running Fully Recurrent Neural Networks Ronald J. Williams College of Computer Science Northeastern University Boston, Massachusetts 02115 and David Zipser Institute

More information

AI Programming CS F-20 Neural Networks

AI Programming CS F-20 Neural Networks AI Programming CS662-2008F-20 Neural Networks David Galles Department of Computer Science University of San Francisco 20-0: Symbolic AI Most of this class has been focused on Symbolic AI Focus or symbols

More information

About the primer To review concepts in functional analysis that Goal briey used throughout the course. The following be will concepts will be describe

About the primer To review concepts in functional analysis that Goal briey used throughout the course. The following be will concepts will be describe Camp 1: Functional analysis Math Mukherjee + Alessandro Verri Sayan About the primer To review concepts in functional analysis that Goal briey used throughout the course. The following be will concepts

More information

PROPERTIES OF THE EMPIRICAL CHARACTERISTIC FUNCTION AND ITS APPLICATION TO TESTING FOR INDEPENDENCE. Noboru Murata

PROPERTIES OF THE EMPIRICAL CHARACTERISTIC FUNCTION AND ITS APPLICATION TO TESTING FOR INDEPENDENCE. Noboru Murata ' / PROPERTIES OF THE EMPIRICAL CHARACTERISTIC FUNCTION AND ITS APPLICATION TO TESTING FOR INDEPENDENCE Noboru Murata Waseda University Department of Electrical Electronics and Computer Engineering 3--

More information

LECTURE 15 + C+F. = A 11 x 1x1 +2A 12 x 1x2 + A 22 x 2x2 + B 1 x 1 + B 2 x 2. xi y 2 = ~y 2 (x 1 ;x 2 ) x 2 = ~x 2 (y 1 ;y 2 1

LECTURE 15 + C+F. = A 11 x 1x1 +2A 12 x 1x2 + A 22 x 2x2 + B 1 x 1 + B 2 x 2. xi y 2 = ~y 2 (x 1 ;x 2 ) x 2 = ~x 2 (y 1 ;y 2  1 LECTURE 5 Characteristics and the Classication of Second Order Linear PDEs Let us now consider the case of a general second order linear PDE in two variables; (5.) where (5.) 0 P i;j A ij xix j + P i,

More information

Kernel Method: Data Analysis with Positive Definite Kernels

Kernel Method: Data Analysis with Positive Definite Kernels Kernel Method: Data Analysis with Positive Definite Kernels 2. Positive Definite Kernel and Reproducing Kernel Hilbert Space Kenji Fukumizu The Institute of Statistical Mathematics. Graduate University

More information