An Integral Representation of Functions using. Three-layered Networks and Their Approximation. Bounds. Noboru Murata 1

Size: px

Start display at page:

Download "An Integral Representation of Functions using. Three-layered Networks and Their Approximation. Bounds. Noboru Murata 1"

Sharleen Jacobs
6 years ago
Views:

1 An Integral epresentation of Functions using Three-layered Networks and Their Approximation Bounds Noboru Murata Department of Mathematical Engineering and Information Physics, University of Tokyo, Hongo 7-3-, Bunkyo-ku, Tokyo 3, JAPAN Abstract Neural networks are widely known to provide a method of approximating nonlinear functions. In order to clarify its approximation ability, a new theorem on an integral transform of ridge functions is presented. By using this theorem, an approximation bound, which evaluates the quantitative relationship between the approximation accuracy and the number of elements in the hidden layer, can be obtained. This result shows that the approximation accuracy depends on the smoothness of target functions. It also shows that the approximation methods which use ridge functions are free from the \curse of dimensionality". Keywords: integral transform, ridge function, three-layered network, approximation bound, random coding, curse of dimensionality Introduction In the middle of the 98s, computational research on neural networks was revitalized by the works of the Parallel Distributed Processing (PDP) group (umelhart et al., 986). In this movement, multi-layered networks having sigmoidal functions together with back-propagation learning played an important role. The numerous examples provided by the PDP group attracted the interest of many other researchers, and a large number of subsequent computer simulations have shown that the multi-layered networks can be usefully applied to practical problems, such as image processing, speech recognition, system control. Currently staying at GMD First, udower Chaussee 5, 489 Berlin, Germany, supported by Alexander von Humboldt-Stiftung. Preprint submitted to Elsevier Science November 995

2 Advantages of these simple learning machines could be summarized in the following two points. One is that the \back-propagation" learning algorithm can be easily implemented on computers, because the algorithm is essentially a gradient descent method, and also because the derivative of the sigmoidal function depends only on its output and hence the gradient of each parameter can be calculated locally. The other is the fact that a three-layered network can approximate an arbitrary function with desired accuracy if it has suciently many hidden units. Irie and Miyake (988), Funahashi (989), Cybenko (989), White (99) proved this fact mainly based on the notion of the Fourier transform. Funahashi (989) also gave another proof making use of Kolmogorov and Arnol'd's theorem and Sprecher's theorem. ecently Jones (99), Barron (993) and Girosi and Anzellotti (99) showed that the approximating error is inversely proportional to the number of hidden units with respect to the mean square or sup norm criterion for a certain class of functions. Their results are interesting and important because they indicate that greedy algorithms, such as multi-layered networks and radial basis approximators with gradient descent learning, can avoid the \curse of dimensionality". In this paper, we discuss the second advantage from another point of view. As the rst step to clarify the performance and the limitation of three-layered networks as function approximators, we focus on the case when we have complete information about the target function. We do not consider learning from examples in this paper. We show that the structure of multi-layered networks has a property good for well-approximating nonlinear functions. First, we dene the integral transform and inverse transform of functions by using ridge functions. The inversion formula gives a precise representation of functions and also provides a reasonable interpretation of the structure of three-layered networks. From the correspondence between transformation coecients and parameters of networks, we obtain an intuitive interpretation of network parameters. Second, applying this result, we give a bound on the approximation accuracy of a three-layered network with a nite number of hidden units. Using the random coding technique, which is well known in the eld of information theory, it is shown that the mean square error of the approximation is inversely proportional to the number of hidden units. This means that if higher accuracy is desired, the number of required hidden units increases not exponentially with respect to the input dimension, but simply linearly with the increase in accuracy. Hence it follows that three-layered networks are free of the \curse of dimensionality". Third, we discuss the relationship between smoothness of target functions and

3 approximation errors of three-layered networks, and we give a class of functions which can be well-approximated by three-layered networks. We also suggest that there is a close connection between the smoothness of target functions and the magnitude of network parameters, and that some smoothness conditions might guarantee the convergence of learning. Integral Transform using idge Functions First, we dene ridge functions. Denition When a function F : m! is written as F (x) G(a x b); () with a vector a m, a real number b and an appropriate function G :!, it is called a ridge function. In other words, a ridge function takes the same value on certain hyper-planes in m whose normal vectors are parallel to a (see for example Figure ). Clearly, the input-output relation of a neuron in a conventional articial neural network, i.e. weighted sum and a sigmoidal activation function, belongs to the class of the ridge functions given in Equation (). We should note that when m, F is not integrable even if G is integrable on. In the following, we discuss a method of approximating a certain function f : m! using a linear combination of ridge functions. Let us assume that function f belongs to L ( m ) \ L p ( m ) ( p < ), or f is bounded and uniformly continuous. Let a pair of two functions d ; c L () \ L () be bounded, satisfying the following conditions: ^ d (!)^c (!) ^d (!) ^c (!); () ^d (!)^c (!)! m d! < (3) and ^ d (!)^c (!)! m d! 6 ; (4) 3

4 Fig.. An example of the ridge function on. where ^ denotes the Fourier transform and denotes the complex conjugate. The indices d; c are used to represent the decomposing and composing kernels. We dene C d ; c ^ d (!)^c (!) j!j m ^ d (!)^c (!) d! A : (5)! m The transform T of function f with respect to kernels d and c is dened by T (a; b) () m d (a x b)f(x)dx: (6) C d ; c m Then we can prove the following inversion theorem. 4

5 Theorem Using the transform T given in Equation (6), function f can be represented as f(x) lim "! m+ c (a x b)t (a; b)e "jaj dadb: (7) If f L ( m ) \ L p ( m ) ( p < ), f(x) in Equation (7) converges in the sense of L p -norm, and if f is bounded and uniformly continuous, it converges in the sense of L -norm. The proof is given in Appendix A. An example of kernels d and c will be given in Section 3.. When m, i.e. f is a function on, we can see the close relation between this transform and the wavelet transform. When m, this kind of relation cannot be found, because wavelets belong to L ( m ), i.e. functions are decomposed into nite-energy waves, while c (a x b) is not square-integrable with respect to x. From the fact that f belongs to L ( m ) and d is bounded, we see that T is bounded, i.e. T (a; b) L ( m+ ). We can also derive jt (a; b)jdb () m jc d ; c j () m jc d ; c j < : m d (a x b)f(x)dx db m j d (a x b)jjf(x)jdb dx Even though T (a; b) is bounded and c belongs to L (), f(x) m+ c (a x b)t (a; b)dadb (8) might be a divergent integral. Hence the convergence of the integral in Equation (7) is guaranteed by the convergence factor e "jaj. Note that although we adopt a Gaussian kernel for simplicity of calculations, other functions are also available for this purpose. Equation (6) can be seen as a map from L ( m ) to L ( m+ ). While the Fourier transform and inverse Fourier transform give a one-to-one correspondence between L ( m ) and L ( m ), Equation (6) has redundancy. In other words, f could for example be represented by two dierent transforms T () (a; b) 5

6 and T () (a; b) with respect to two dierent decomposing kernel functions () d and () d, both of which satisfy the admissible condition expressed in Equation (3) with the same composing kernel c. In general, there is a family of decomposing kernels f () d ; () d ; : : : g and we can choose a convenient kernel d among them, imposing certain appropriate properties such as compactness and smoothness. We use this advantage to investigate the relationship between smoothness of target functions and approximation ability in Section Application to Three-layered Networks From the previous result, we can evaluate some aspects of approximating functions using three-layered networks. 3. Three-layered Networks with Bell-shaped Functions In the engineering eld, a sigmoidal function which has the following characteristics lim z! (z) ; lim z! (z) ; d dz (z) (z) > ; lim z!6 (z) : is usually used as the activation function of hidden units. The monotonicity and smoothness of sigmoidal functions are advantageous for the backpropagation learning algorithm. Since sigmoidal functions do not belong to L (), we cannot directly apply the previous representation result to this type of network. We discard the assumption of monotonicity and dene a bell-shaped function c which has the following integrability condition, max z c (z)dz < ; lim z!6 c(z) ; c (z) ; c (z) ; c is unimodal. In short, a bell-shaped function is an unimodal L () function whose maximum value is. In the following, we consider the networks composed of c 's, 6

7 such that for an input x m the output f n (x) is calculated as f n (x) nx i c i c (a i x b i ); (9) where a i m ; b i ; c i and n denotes the number of hidden units. A bell-shaped function can be constructed from two appropriate sigmoidal functions. For example, c (z) c ( (z + h) (z h)) ; where c is a constant to normalize the maximum value and h is a positive constant. In this case, a network with some number of bell-shaped hidden units can instead be constructed from twice as many sigmoidal hidden units. In general, a family of sigmoidal networks includes that of bell-shaped networks as a subset. We can also represent a bell-shaped function as the derivative of a sigmoidal function, c (z) d dz (z): In this case c is clearly integrable. 3. epresenting Functions using Three-layered Networks Using the integral transform dened in the previous section, an arbitrary f L ( m ) can be written as f(x) lim "! m+ c (a x b)t (a; b)e "jaj dadb: Then we see that a three-layered network composed of bell-shaped functions can approximate an arbitrary function f L ( m ) with demanded accuracy if the network has a sucient number of hidden units. Here T (a; b) is calculated using Equation (6) with function d which satises admissible condition (3). For some desired accuracy " >, T (a; b)e "jaj can be approximated with a sucient number of hidden units because it belongs to L ( m+ ). 7

8 Consequently, noting that a bell-shaped function can be represented by a linear combination of two sigmoidal functions, a three-layered sigmoidal network can also approximate arbitrary functions. We can see a straightforward correspondence between the integral representation presented above and the structure of three-layered networks: the parameters a; b correspond to the weights and the thresholds from the input layer to the hidden layer, and the transform T (a; b) corresponds to the weights from the hidden layer to the output layer (see Figure ). x input layer x a x H HH8 88 a i x c (a x b) hidden layer c (a i x b i ) m+ T (a; b) c (a x b) T (a; b) c (a x b)dadb H HH8 8 8 output unit c i c (a i x b i ) P ni c i c (a i x b i ) Fig.. The relationship between the transform T (a; b) and the three-layered network. If a composing kernel c is suciently smooth, we can use a decomposing kernel d whose Fourier transform is expressed by ^ d (!) j!j m ^c (!): In this case, a sucient condition for the existence of the inverse Fourier transform d is that ^c is bounded and c is (m + )-times dierentiable. If a composing kernel c is even and suciently smooth, we can construct a decomposing kernel d with compact support as follows. Choose a even function C () which satises (z) ; 8

9 (z) if jzj ; (z)dz < ; where C () denotes a class of functions on which belong to C () with compact support. For example, (z) 8 >< >: e (jzj ) if jzj < if jzj () is available. A decomposing kernel is given by d (z) 8 >< >: d m dz (z) if m is even m d m+ (z) if m is odd dzm+ depending on the dimensionality m of inputs x. We conrm that these kernels satisfy the admissible condition expressed in Equation (3) in Appendix B. 3.3 Approximation Bounds of Bell-shaped Networks In the following, we assume that inputs x m are generated subject to a probability density (x), and we evaluate quantitatively the approximation accuracy of the network with n hidden units in terms of the L -norm with respect to the probability density (x) kf n fk L ( m ); m (f n (x) f(x)) (x)dx; () where f is a target function to be approximated. The dening function of the domain of input x normalized by the volume of the domain is a typical case of (x). We assume that m+ jt (a; b)jdadb < () 9

10 holds. In this case, we can omit the convergence factor. Taking into account that f and c are real functions, we can derive the representation where f(x) m+ e T (a; b) c (a x b)dadb T (a; b) c (a x b)dadb e T (a; b) c (a x b)dadb j e T (a; b)j sign(e T (a; b))c T c (a x b)dadb C T c(a; b) c (a x b)p(a; b)dadb; (3) C T j e T (a; b)jdadb; m+ c(a; b) sign(e T (a; b))c T ; j e T (a; b)j p(a; b) : C T Since p(a; b) is a positive function and its integral on m+ is equal to, function p(a; b) is regarded as a probability density of a and b. Let us consider the following function g n (x) n nx i c(a i ; b i ) c (a i x b i ); (4) where (a i ; b i ) (i ; : : : ; n) are independently chosen subject to the probability density p(a; b). The expectation and variance of g n (x) are E [g n (x)] f(x); V [g n (x)] n V [c(a; b) c(a x b)] n n C T E h c (a x b) i f(x) C T f(x) : The last inequality comes from the fact that the maximum value of c normalized as. These relations lead to is

11 E (g n (x) f(x)) (x)dx E h (gn (x) f(x)) i (x)dx V [g n (x)] (x)dx C T kfk L n ( m ); n C T : (5) (6) Inequality (5) shows that the expectation of kg n fk L ( m ); is less than or equal to (CT kfk L ( m );)n, and therefore there exists at least one linear combination g n (x) whose mean square error is as small as that bound. Also, Inequality (6) shows that there exists a norm-independent upper bound CT n which does not depend on the input density (x). These evaluations are derived under the assumption of Equation (). If the transform T belongs to L ( m+ ), the mean square error of function approximation with a threelayered network is bounded by O(n), where n is the number of hidden units. Now we obtain the following theorem and corollary. Theorem 3 Let c be a bell-shaped function which is an activation function of the hidden units and d be a corresponding decomposing kernel function which satises admissible condition (3). If the absolute integral C T of the transform T of the function f is bounded, for any arbitrary input distribution (x) there exists a bell-shaped three-layered network f n (x) which satises kf n fk L ( m ); n C T kfk L ( m ); : (7) Corollary 4 For any arbitrary input distribution (x), the approximation error of bell-shaped three-layered networks f n (x) can be bounded by kf n fk L ( m ); n C T : (8) We should note that the absolute integral C T depends on the kernels d and c, therefore C T is more accurately expressed as C T;d ; c. Also, since the above bounds hold for any pair of kernels, the bounds for a xed composing kernel c can be stated exactly as C T min d C T;d ; c (9)

12 when the minimum value exists, and otherwise as C T inf d C T;d ; c + 8 > : () To achieve the above approximation bound, greedy algorithms (Jones, 99, etc.) may be applied on representation (3). The coecient of the bound given in the above theorem is obtained by a rough estimation. In this evaluation, we suppose that all weights from hidden units to output units are of the same absolute value, i.e. +C T or C T (see for example Figure 3). In other words, the approximation depends only on the density of (a; b) and takes no account of the hidden-output weight value c i. Another estimation of an approximation bound is discussed in Appendix C, but it results in a bound of the same order, O(n). T( a,b) c i ai,bi a,b Fig. 3. The relationship between the transform T (a; b) and the density of (a; b). This theorem shows that three-layered networks have good properties which make them useful for approximating functions. The approximation bound is inversely proportional to the number of hidden units with respect to the mean square error. Naturally, it can be seen that if there are a suciently large number of hidden units, three-layered networks can achieve approximation with any desired accuracy. An important point is that to improve accuracy, an exponentially larger number of hidden units are not required, but rather only a linearly larger number. Also, the number of necessary hidden units is independent of the input dimensionality. oughly speaking, in order to reduce the mean square error in half, twice the number of hidden units should be need,

13 not m times. This means that three-layered networks are free from the \curse of dimensionality". In addition, this also shows that a combination of good approximations does not give a good approximation. For example, we divide the target function f into two functions f and f f(x) f (x) + f (x) and we have two good approximations g and g which have n-hidden units and satisfy kg f k L ( m ); n C T ; kg f k L ( m ); n C T ; where T and T are the transform of f and f. In order to approximate the function f, we combine two approximations again. The error bound is evaluated as k (g + g ) fk L ( m ); C + C : T T n Note that this combined network has n hidden units and the equality holds depending on the partition of the target function. From the theorem, we obtain the following bound for an approximator g with n hidden units kg fk L ( m ); n (C T + C T ) : This bound is equivalent to the one above when and only when C T C T, and otherwise it smaller than the one above. Although this is an estimation of the bounds, it implies that there might exist a better approximation than a combination of two approximators if the target function is partitioned inappropriately. In this sense, function approximation by three-layered networks has a nonlinear architecture. 3.4 Smoothness of Functions and Integrability of Transforms Practically, it is dicult to conrm whether the transform of the target function is integrable or not. The integrability of the transform should be translated into other kinds of useful conditions. 3

14 There are many kinds of conditions under which the transform T belongs to L ( m+ ). In the following, we only show a sucient condition for functions with compact support and even composing kernel functions. The integrability is restated in terms of the smoothness of target functions, and hence it is an important condition for practical applications. Let C m () denote the class of functions with compact support belonging to C m (). The function f e (z) is dened as f e (z) exz f(x)dx (m) for a unit vector e m, where dx (m) denotes the volume element on the hyper-plane e x z. f (m) (z) denotes the result of dierentiating f e e (z) m times. Theorem 5 Any function f which satises the conditions 8e m ; f e C m () and () 9M ; < 9 s.t. jf (m) e (z) f (m) (z )j < Mjz z j has an absolutely integrable transform T, i.e. f can be approximated by the three-layered network having n bell-shaped hidden units with O(n)-accuracy in the sense of mean square errors. For details of the constructive proof, refer to Appendix D. In practical use, it is almost impossible to conrm the integrability of the transform of a target function. Since the continuity condition can be more easily checked than the integrability, Theorem 5 might be widely applied. The condition stated in Theorem 5 is called the Holder continuity condition. When all the m-times partially dierentiated functions of f(x) are Holder continuous with respect to m-dimensional L -norm, it is clear that f (m) (z) is Holder e continuous. Moreover, when f(x) is a bounded (m + )-times dierentiable function with compact support, f(x) satises the condition of Theorem 5. Corollary 6 Any function f C m+ ( m ) \ L ( m ) can be approximated by a three-layered network with O(n)-accuracy. It is expected that most of the functions used in practical applications satisfy the above conditions. From the result of Theorem 5, we make a conjecture concerning the problem of \overtting" in learning from examples. The theorem implies that if e 4

15 the target function is suciently smooth, the parameters stay in an appropriate area, but if there exists some discontinuity in the target function, it is not guaranteed that the parameters will stay in a bounded area. An \over- tting" problem occurs when an \outlier" is included. In the case of learning from examples, especially with noise, it is often observed that huge values of parameters are needed in order to approximate a peculiar data which sticks out like a delta function. This overtting phenomenon is a serious problem in practical applications. This problem can be restated as follows. In the case of learning from examples, we do not have complete information about the target function. From a family of functions which networks can represent, we choose an imagined target function which locally minimizes a particular loss dened on the examples. \Locally" means that the result of learning depends on the initial values of the parameters and is not always optimal. A process of modifying the parameters towards this imagined target function is called learning. If the network has too many hidden units to approximate the examples except for the outlier, some of the hidden units could be used to represent the outlier. In this case, the imagined target function might have an extreme shape, roughly speaking unsmooth and discontinuous like a delta function, and the absolute integral of its transform might be quite huge or might diverge. Consequently leaning would be unstable when hidden units are added carelessly. To avoid this problem, kernel methods are commonly used and they are validated by our framework: taking a convolution with a smooth kernel function makes the function itself smooth and makes its transform integrable. We expect that our theoretical framework gives a means to analyze various problems of learning from examples. 4 Conclusion We found an integral representation of functions using ridge functions. Especially when a bell-shaped function is used as the activation function of the hidden units of a three-layered network, there exists a clear and direct correspondence between the integral representation and the three-layered network. It gives an intuitive interpretation about the structure of the three-layered networks. Based on this interpretation and the random coding technique, we gave an approximation bound of three-layered networks in the case of having complete information about target functions. The bound of the mean square error is inversely proportional to the number of hidden units and does not depend on the dimensionality of input. From this result, it can be seen that approximations 5

16 by three-layered networks are not subject to the \curse of dimensionality". In addition, we discussed a sucient condition for a class of functions to be well approximated. As seen in the previous discussions, we showed that the smoothness of functions is closely related to the errors of approximating functions by combining sigmoidal functions. This condition has a wide range of possible practical applications. Future work will be dedicated to applying our results to problems of learning from examples. Possible applications would be the estimation of the required number of hidden units for given accuracy, the estimation of the error rate, the acceleration of the convergence of learning, warding o \over-learning". Acknowledgement The author would like to give very special thanks to the reviewer for his useful comments simplifying the proofs and his careful examination of the manuscript. The author would like to thank Professor S. Amari, Professor S. Yoshizawa, Dr. K.. Muller and D. Harada for helpful suggestions and discussions. The present work is supported in part by Grant-in-Aid for Scientic esearch in Priority Areas on Higher-Order Brain Information Processing from the Ministry of Educations, Science and Culture of Japan. eferences [] Barron, A.. (993). Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information Theory, 39, 3, 93{945. [] Cybenko, G. (989). Approximation by superpositions of a sigmoid function. Mathematics of Control, Signals and Systems,, 33{34. [3] Funahashi, K. (989). On the approximate realization of continuous mappings by neural networks. Neural Networks,, 83{9. [4] Girosi, F. and Anzellotti, G. (99). Convergence rates of approximation by translates (Tech. ep. A.I. memo 88). Articial Intelligence Laboratory, Massachusetts Institute of Technology. [5] Irie, B. and Miyake, S. (988). Capabilities of three-layered Perceptrons. In Proceedings of International Conference on Neural Networks, pp. 64{648. [6] Jones, L. K. (99). A simple lemma on greedy approximation in Hilbert space and convergence rates for projection pursuit regression and neural network training. Annals of Statistics,,, 68{63. 6

17 [7] Murata, N. and Amari, S. (993). Approximation bounds for linear combination of sigmoidal functions. In Proceedings of 993 International Symposium on Nonlinear Theory and its Applications, pp. 73{76. [8] Murata, N. (994). Function approximation by three-layered networks and its error bounds an integral representation theorem (Tech. ep. MET 94-9). Department of Mathematical Engineering and Information Physics, University of Tokyo. [9] Niyogi, P. and Girosi, F. (994). On the relationship between generalization error, hypothesis complexity, and sample complexity for radial basis functions (Tech. ep. A.I. memo 467). Articial Intelligence Laboratory, Massachusetts Institute of Technology. [] umelhart, D., McClelland, J. L., and the PDP esearch Group (986). Parallel Distributed Processing : Explorations in the Microstructure of Cognition. MA: MIT Press. [] White, H. (99). Connectionist nonparametric regression: multilayer feedforward networks can learn arbitrary mappings. Neural Networks, 3, 5, 535{ 549. A Proof of the Inversion Theorem Let us dene f " (x) () m C d ; c () m C d ; c m m+ d (a y b)f(y) c (a x b)e "jaj dydbda d (a y b)f(y)dy c (a x b)db e "jaj da: m (A.) From the fact that f(y) L ( m ), d is bounded, and c L (), we can see that the right side of Equation (A.) is absolutely integrable. According to Fubini-Tonelli's theorem, we can change the order of the integration of f " (x). As d (a y b) and c (a x b) can be seen as functions of b and shifted by a y and a x respectively, by using Parseval's equality we can obtain f " (x) () m C d ; c () m C d ; c m+ ^d () ^c ()e ia(xy) e "jaj f(y)ddyda 7

18 m+ C d ; c m+ m ^d () ^c ()e "(a i(xy) " )(a i(xy) " ) j(xy)j 4" f(y)daddy ^d () ^c ()G " ((x y))f(y)ddy ^d () ^c ()G C d ; c m " C d ; c m ^d () ^c ()G " (x y)f(y)dyd 3 f(x)d where G " (x) (4") m e jxj 4" and 3 denotes the convolution. For kk kk L p ( m ) or kk kk L ( m ) with uniformly continuous functions, from Holder's inequality we have kf " fk jc d ; c j ^d () ^c () G C d ; c m " jc d ; c j 6 4 ^ d () ^c () + G m " << f f d 3 f f d ^d () ^c () m G " 3 f f d I + I : (A.) In any case, limkg " 3 f fk ; "! (A.3) and kg " 3 fk kfk (A.4) hold, therefore kg " 3 f fk kfk: (A.5) 8

19 Additionally, we assume the following admissible condition: ^d (!)^c (!)! m d! < : (A.6) From Equations (A.5) and (A.6), we know that for any we may choose so that ji j < ; then Equations (A.3) and (A.6) guarantee that we may choose " so that " < " ) ji j < : Now we can prove that for any there exists " < " such that kf " fk < : B An Example of a Decomposing Kernel A decomposing kernel d (z) 8 >< >: d m dz (z) if m is even m d m+ (z) if m is odd dzm+ can be conrmed to satisfy admissible condition (3) as follows, where m is the dimension of input x and is given by (z) 8 >< >: e (jzj ) if jzj < if jzj : When m is an even number, 9

20 holds. ^d (!) ^c (!) j!j m d! (i!) m^(!)^c (!) j!j m d! ^(!) ^c (!) j^(!)j d! j^c (!)j d! When m is an odd number, j(z)j dz j c (z)j dz A < ^d (!) ^c (!) j!j m d! (i!) m+^(!)^c (!) j!j m d!!^(!)^c (!) j(i!)^(!)j d! j ^c (!)j d! j (z)j dz j c (z)j dz A < holds, where (z) denotes the derivative of (z). C Another Bound Estimation Using any c(a; b) such that e T (a; b)c(a; b) is a positive real number and e T (a; b) dadb ; c(a; b) and dening p(a; b) e T (a; b)c(a; b), we can obtain the inequality E (g n (x) f(x)) (x)dx

21 n n n h i E (gn (x) f(x)) (x)dx n h i o E (c(a; b) c (a x b)) f(x) (x)dx ( E (c(a; b) c (a x b)) (x)dx kfk L ( m ); sup jc(a; b)j k c (a x b)k L ( m ); kfk L ( m ); a;b ) : (C.) In this case the order of the approximation error is equal to the previous evaluation, that is O(n), but the integrability condition of the transform T is reduced from condition () j e T (a; b)j < to e T (a; b) dadb < : c(a; b) (C.) Clearly, this new condition (C.) is weaker than condition () of Theorem 3, because for larger jaj the norm k c (a x b)k L ( m ); becomes smaller and therefore c(a; b) can be a large value without changing the bound in Inequality (C.). D Proof of Theorem 5 Let us assume that c is in C (). Since ^c is bounded, j!j^c (!)d! < holds, and we dene d d (z) 8 >< >: d m+ (z) if m is even dzm+ d m+ (z) if m is odd dzm+ with from Equation (). Noting that c is twice dierentiable, we can see that d and c satisfy admissible condition (3) in a manner similar to Section 3..

22 In this case, ^ d (!) 8 >< >: (i!) m+^(!) if m is even (i!) m+^(!) if m is odd and z k d (z)dz i d d!! (!) k ^d! (D.) hold, and hence the k-th moment as calculated in Equation (D.) vanishes if k m when m is odd, or if k m + when m is even. The moments vanish up to at least the m-th order. Let f be in C m ( m ), where C m ( m ) denotes the class of functions with compact support belonging to C m ( m ). Let f e (z) be f e (z) exz f(x)dx (m) for a unit vector e m, where dx (m) denotes the volume element on the hyper-plane e x z. We can rewrite the transform T as T e (a; b) T (a; b) d (a x b)f(x)dx d (az b)f e (z)dz; where a jaj and e aa. We assume that for any e f e (z) C m (); (D.) and there exists ( < ) such that jf (m) (z) f (m) (z )j < Mjz z j ; M : (D.3) e e Then jt e (a; b)j

23 a a d (z)f e d (z) ( m X d (z) a m! M z a m! a C a m++ ;! z + b dz a z k f (k) e k! a z m ( k a m z a f (m) e d (z)dz! b + a m! z + b a! f (m) e z m f (m) e a!) b a dz z + b a where satises and C is a constant. The m-th order moment d (z)z m f (m) e b a! dz!) dz is inserted and Equation (D.3) is applied. Since T e (a; b) is bounded, we can obtain an integrable inequality jt e (a; b)j C + a m++ ; where C is a constant. From the fact that f and d have compact supports, T (a; b) also has a compact support with respect to b when a is xed. The size of the support is bounded by C 3 a+c 4 with appropriate constants C 3 and C 4. Let d be a volume element of a unit hyper-sphere on m. The absolute integral is rewritten as C T jt e (a; b)ja m daddb; and we see that jt e (a; b)ja m daddb C 5 a m C (C 3 a + C 4 ) + a m++ da < holds, where C 5 is a denite integral with respect to d. Therefore, Equations (D.) and (D.3) are sucient conditions for the integrability of the transform T. 3

below, kernel PCA Eigenvectors, and linear combinations thereof. For the cases where the pre-image does exist, we can provide a means of constructing

below, kernel PCA Eigenvectors, and linear combinations thereof. For the cases where the pre-image does exist, we can provide a means of constructing Kernel PCA Pattern Reconstruction via Approximate Pre-Images Bernhard Scholkopf, Sebastian Mika, Alex Smola, Gunnar Ratsch, & Klaus-Robert Muller GMD FIRST, Rudower Chaussee 5, 12489 Berlin, Germany fbs,