w 1 output input &N 1 x w n w N =&2

Size: px

Start display at page:

Download "w 1 output input &N 1 x w n w N =&2"

Susanna Bryant
6 years ago
Views:

1 ISSN Technical Report L Noise Suppression in Training Data for Improving Generalization Akiko Nakashima, Akira Hirabayashi, and Hidemitsu OGAWA TR96-9 November Department of Computer Science Tokyo Institute of Technology ^Ookayama 2-2- Meguro Tokyo 52, Japan cthe author(s) of this report reserves all the rights. To appear in IEEE International Joint Conference on Neural Networks '98.

2 Noise Suppression in Training Data for Improving Generalization Akiko Nakashima, Akira Hirabayashi and Hidemitsu Ogawa Dept. of Computer Science, Tokyo Institute of Technology, 2-2-, O-okayama, Meguro-ku, Tokyo-52, Japan. Abstract Multi-layer feedforward neural networks are trained using the error back-propagation(bp) algorithm. This algorithm minimizes the error between outputs of a neural network(nn) and training data. Hence, in the case of noisy training data, a trained network memorizes noisy outputs for given inputs. Such learning is called rote memorization learning(rml). In this paper we propose error correcting memorization learning(cml). It can suppress noise in training data. In order to evaluate generalization ability of CML, it is compared with the projection learning (PL) criterion. It is theoretically proved that although CML merely suppresses noise in training data, it provides the same generalization as PL under some necessary and sucient condition. Introduction The learning problem of feed-forward neural networks is considered from the functional analytic point of view with noisy training data. What is important in the learning problem is to achieve a high level of generalization, that is, to construct a neural network which outputs true values not only for training inputs, but also for novel inputs. The back-propagation algorithm is often used for training a neural network. It is derived from a criterion of socalled rote memorization learning(rml), which minimizes the error between outputs of a neural network and noisy training data. Hence, RML does not guarantee generalization ability. In order to solve the problem, a regularization method was proposed[5],[2]. However, it still uses the criterion of RML together with a term of smoothness. In this paper, we propose error correcting memorization learning(cml) to suppress noise in training data. Generalization ability of CML is evaluated by comparing it with projection learning(pl) which reduces errors in the original function space. We obtain a necessary and sucient condition under which CML not only suppresses noise in the training data but also improve generalization. It is known that RML also provides the same generalization as PL under some condition. Although an analytical solution was provided for this condition in [8], here, we use the results on CML to interpret and clarify the above solution. 2 Neural network learning as an inverse problem In this section, we shall present a brief review of the basic formalization necessary for discussing the learning problem in NNs from the functional analytic point of view. Let us begin by considering a three-layer feedforward neural network whose number of input, hidden, and output units are L, N, and, respectively as shown in Fig.. Let x be the L dimensional vector consisting of L inputs i(i = ; ; L), which is referred to as the input vector. The network can be considered as a real valued function f(x) of L variables. input &N x &N L u N u n u w w n w N &2 output f ( x) = A ( J) y =&2 N w n n = Figure: NN as a real valued function u n ( x) The learning problem is to construct a neural network by using a set of training data so that the NN expresses the best approximation f(x) to a desired function f(x) under some learning criterion. We dene some of the notations used here. A training set given as a set of M input vec- fxmg M m= : tors.

3 fy m g M m= : The corresponding noisy output values, where y m = f(x m ) + n m. fx m ; y m g M m= : A set of training data. Once a training set fx m g M m= is xed, the corresponding true outputs ff(x m )g M m= are uniquely determined from f. Hence, we can introduce an operator A which maps f to the vector consisting of ff(x m )g M m=. Let y and n be the M-dimensional vectors consisting of elements fy m g M m= and fn mg M m=, respectively. Then we have y = Af + n: () The operator A is called the sampling operator. It becomes a linear operator even when we are concerned with nonlinear NNs. Let H be the set of all functions f to be approximated by the neural networks. Assume that H is a Hilbert space with a reproducing kernel K(x; x ). Let D be the domain of functions f, which is a subset of the L-dimensional Euclidean space R L. The reproducing kernel K(x; x ) is a bivariate function dened on D 2 D which satises the following two conditions:. For any xed x in D, K(x; x ) is a function in H. 2. For any f in H and x in D, it holds that (f(x); K(x; x )) = f(x ); (2) where the left hand side of eq.(2) denotes the inner product in H. In the theory of Hilbert space, arguments are developed by regarding a function as a point in that space. Thus, things such as 'value of a function at a point' cannot be discussed under the general framework of Hilbert space. However, if the Hilbert space has a reproducing kernel, then it is possible to deal with the value of a function at a point as shown in eq.(2). The sampling operator A is expressed by the reproducing kernel as A = X M m= e m K(x; x m ); (3) where fe m g M m= is the so-called natural basis in R M, i.e., e m is the M-dimensional vector consisting of zero elements except the m-th element equal to. The notation (: :) is the Schatten product dened by (e m g)f = (f; g)e m : (4) Now the learning problem is the problem of obtaining an estimate, say f, to f from y in the model. This can be considered as an inverse problem [4] equivalent to obtaining an operator X which provides f from y: f = Xy: (5) The operator X is called the learning operator. It can be optimized based on dierent learning criteria[4]. We denotes a criterion by J in general, and the operator X satisfying J by A (J). 3 Rote memorization learning The BP method minimizes the training error, that is, MX (f(x m ) y m ) 2 : (6) m= Hence, the learning criterion for the BP method is as follows. Denition (Rote memorization learning) If an operator X minimizes the functional J RM [X] = kaxy yk 2 ; (7) X is called the rote memorization learning(rml) operator and denoted by A (RM), where k k is a norm in R M. A general form of the RML operator is given as A (RM) = A y + Y A y AY; (8) where A y is the Moore-Penrose generalized inverse of A[] and Y is an arbitrary operator from R M to H. J RM requires only to memorize the given noisy training data by rote. 4 Error correcting memorization learning When we expect a NN to output the correct values for given inputs, the mean squared error between outputs of a NN and correct values of noisy training data should be minimized. The error is expressed by MX E[ (f(x m ) f(x m )) 2 ] = E[kAf Af k 2 ]; (9) n m= n where En denotes the expectation over the noise ensemble fng. Using eqs.() and (5), we are able to decompose the rst term in the right hand side of eq.(9) as Af = AXAf + AXn: () The rst and second terms in the right hand side of eq.() denote the signal component and the noise 2

4 H R M N(A) f R(U) A (CM) R(A) QR(A) f A y n R(A*) + + V A*U y R(A) Af Af Figure 2: Learned functions & mechanism of noise suppression. component of Af, respectively. The former is deterministic, whereas the latter is probabilistic in nature. Therefore, we require that the signal component of Af agrees with the true values Af of the training data. It leads us to the concept of error correcting memorization learning as follows. Denition 2 (Error Correcting Memorization Learning) For any f given by eqs.(5) and (), if an operator X minimizes the functional under the constraint J CM [X] = E n [kaf Af k 2 ] () AXA = A; (2) X is called the error correcting memorization learning(cml) operator and denoted by A (CM). Eq.(3) is determined by operators A,Q, and Y. A is obtained from a training set as shown in eq.(3). Q is the correlation matrix, which is determined by the nature of noise. Note that noise is not limited to, for example, a normal distribution or a mean-zero distribution. Hence, we can apply the theorem to any type of noise as long as Q is estimated. Statistically, almost all noisy training data y lie in the range of U denoted by R(U)[6]. It has the following (generally nonorthogonal) direct sum decomposition. R(U) = R(A) _+QR(A)? ; (6) where R(A)? is the orthogonal complement of R(A). This decomposition yields the following result. Theorem 2 An operator X satises the CML criterion if and only if Theorem A general form of the CML operator is given as y : y 2 R(A) AXy = : y 2 QR(A)? : (7) A (CM) = V y A 3 U y + Y A y AY U U y ; (3) where A 3 is the adjoint operator of A. U and V is dened as U = AA 3 + Q; V = A 3 U y A; (4) where Q is the correlation matrix of noise and Y is an arbitrary operator from R M to H. The minimum value of J CM [X] is given by min X J CM [X] = J CM [A (CM) ] = tr(av y A 3 ) tr(aa 3 ): (5) The correlation matrix of noise is the M 2M matrix dened by Q En(n n) whose ij-th component is En(ninj). Theorem 2 shows a mechanism of noise suppression by CML, which is illustrated in Fig.2. Let us consider a vector y in R(U). It is decomposed using eq.(6) as y = Af + n + n 2 ; (8) where n and n 2 are the R(A)-component and the QR(A)? -component of n, respectively. For the vector y, Af = AA (CM) y = Af + n ; (9) which is the R(A)-component of y. CML removes the QR(A)? -component n 2 of n. In this sense QR(A)? 3

5 5 5 4 *: training data solid: original function dashed: function learned by CML dotted: function learned by RML 4 *: training data solid: original function dashed: function learned by CML dotted: function learned by RML x Figure 3: Functions learned by CML and RML with the training set fx m g 3 m= = f:8; :4; g x Figure 4: Functions learned by CML and RML with the training set fx m g 3 m= = f; ; g. is the subspace which results in optimal noise suppression. Next, we show results of CML on an articial problem. Let H be a 2-dimensional functional space spanned by f' n (x)g 2 n= = fsinx; cosxg (2) and the inner product in H be dened as (f; g) = Z f(x)g(x)dx : f; g 2 H: (2) Then f'; '2g in eq.(2) becomes an orthonormal basis of H. The reproducing kernel of H is given by 2X K(x; x ) = ' n (x)' n (x ) (22) n= = cos(x x ): (23) Let us now consider the problem of learning a function f(x) = :5sinx + 3cosx (24) within this function space H. We consider two dierent experiments. The rst (Fig.3) uses 3 training points, fx m g 3 m= = f:8; :4; g, while the second (Fig.4) uses a dierent set of 3 training points, fx m g 3 m= = f; ; g. Here, we assume that noise is generated from the three-dimensional normal distribution with diagonal covariance matrix diag(:3 2 ; :2 2 ; : 2 ) and mean En n = (:8; :; :5) t. Then, the noise correlation matrix is given by Q :73 :8 :2 :8 :4 :5 :2 :5 2:26 A : (25) In Fig.3 and Fig.4, the true sampled values ff(x m )g 3 m= and the noisy training data y are denoted by 'o' and '3', respectively. The original function f is denoted by a solid line while the learned functions f by CML and RML are shown by a dashed line and a dotted line, respectively. As for the training error, both experiments give the similar results. The learned function f by CML passes near the true points denoted by 'o', whereas f by RML passes near the noisy data points denoted by '3'. This example shows that CML certainly suppresses noise in the training data. As for the generalization, experimental results are somewhat dierent. CML in Fig.3 provides the better approximation than CML in Fig.4. This result shows that the generalization by CML depends on the training set fx m g 3 m=. 5 Admissibility CML evaluates the error only over the training set. Therefore, even when a network is successfully trained for given inputs by CML, it is not guaranteed that f provides desired output values for novel inputs as shown in Fig.4. When we expect CML to achieve higher generalization, we implicitly uses the CML criterion as a substitute for some true criterion J which directly estimates the generalization error. In order to discuss conditions when we can substitute J CM for J, the concept of admissibility is useful[7]. Consider a general case that a criterion J substitutes for J. Generally, there are many learning operators which satisfy J. A set of them is denoted by AfJg. 4

6 Denition 3 (Admissibility)[7] (i) (Non admissibility) If all J -learnings do not satisfy J, i.e.,if it holds that AfJg \ AfJ g = ; (26) then it is said that J does not admit J. (ii) (Partial admissibility) If there is at least one J - learning which satises J, i.e., if it holds that AfJg \ AfJ g 6= ; (27) then it is said that J partially admits J. (iii) (Admissibility) If all J -learning satisfy J, i.e., if it holds that AfJg AfJ g; (28) then it is said that J always admits J, in brief J admits J. (iv) (Complete admissibility) If J always admits J and vice versa, i.e., if it holds that AfJg = AfJ g; (29) then it is said that J completely admits J. (v) (Inverse admissibility) If all J-learning satisfy J, i.e., if it holds that AfJg AfJ g; (3) then it is said that J is always admitted by J. Eq.(28) means that J is sucient for J, while eq.(3) means that J is necessary for J. Based on the concept of admissibility we shall discuss generalization ability of CML in the next section. 6 Generalization ability of!!!! CML In this section, as an example of true criterion J, we shall consider projection learning(pl)[3]. Let P be the orthogonal projection operator onto R(A 3 ). Denition 4 (Projection learning)[3] For any f given by eqs.(5) and (), if an operator X minimizes the functional J P [X] = E n [kf P f k 2 ] (3) under the constraint XA = P; (32) X is called the projection learning (PL) operator and denoted by A (P ), where k k is a norm in H. Whenever we use a linear operator X for constructing f, the range of X becomes a subspace of H. Hence 'the best approximation' implies that f is the nearest point to f in the subspace R(X), i.e., the orthogonal projection of f onto R(X). R(A 3 ) is the largest subspace in which we can obtain the orthogonal projection of f from y without knowing the original f. That is the reason why in eq.(3) error is evaluated not between f and f but between f and P f. Eqs.(5) and () yield that f = XAf + Xn: (33) The rst term XAf in the right hand side of eq.(33) is the signal component of f. It is independent of n in y. Hence, it is required that the signal component of f agrees with the best approximation of f in R(A 3 ), which is represented by the soft condition in the constrained optimization of eq.(32). Let us consider the case that the CML criterion J CM is used as a substitute for the PL criterion J P. In this case, the following two kinds of admissibilities appear among ve, listed in Section 5. Theorem 3 (Inverse admissibility) All the PL operators satisfy the CML criterion i.e., it always holds that AfJ P g AfJ CM g: (34) Theorem 4 (Complete admissibility) The PL criterion completely admits the CML criterion, i.e., it holds that AfJ P g = AfJ CM g (35) if and only if or N (A) = fg (36) N (A) = H!$R(Q) = fg; (37) where N (A) represents the null space of A. Theorem 3 says that any projection learning operator A (P ) can always suppress noise in training data. Theorem 3 and Theorem 4 show that there are A (CM) s which do not satisfy J P in general. Hence, noise suppression in the training data is not enough for CML to obtain the same generalization as that of PL; additionally eq.(36) or eq.(37) has to be satised. N (A) is the subspace consisting of functions which are mapped to zero vector by the sampling operator A. Eq.(37) refers to the condition when all the training data is statistically always zero, which does not make sense in practical learning problems. Therefore, eq.(36) is more essential for complete admissibility. When eq.(36) does not hold, the situation is as follows. In Fig.2, N (A), which has many non-zero functions, is denoted by a line perpendicular to R(A 3 ). 5

7 From eq.(3), the function learned by CML from y in R(U) is given as f = V y A 3 U y y + f; (38) where f is an arbitrary function in N (A). The rst term in the right hand side of eq.(38) is the function obtained by PL from the same y. Hence, the generalization error of the function in eq.(38) depends on the selection of f. When eq.(36) holds, N (A) only consists of zero function. Then, the function in eq.(38) beomes equal to the function obtained by PL. Whether eq.(36) holds or not depends on a training set fxmg M m=, because the sampling operator A is determined by the training set as shown in eq.(3). Hence, the generalization by CML depends on the selection of the training set. The training set used in Fig.3 satises eq.(36), whereas the training set used in Fig.4 does not satisfy eq.(36). This dierence causes the variation in generalization in Fig.3 and Fig.4. 7 Generalization ability of!!!! RML Although RML does not consider suppression of noise, it provides the same generalization as PL under some conditions. In this section we shall interpret the condition by using the results on CML. Since statistically almost all y lie in R(U), let us consider RML with y limited to R(U). Complete admissibility holds for CML if eq(36) is satised; while an additional condition QR(A)? R(A)? (39) is necessary for RML. The left hand side of eq.(39) is the subspace which results in optimal noise suppression in the sense of CML as shown by Theorem 2. RML also suppresses noise to some extent, although it seems contradictory to the RML criterion. RML constructs f so that Af becomes the best approximation to noisy y. Af belongs to R(A), even if y does not belong to R(A) in general because of noise. Hence, the best approximation to y is the orthogonal projection of y onto R(A). As a result a component of noise in R(A)? is removed independently of the nature of noise. When R(A)? includes QR(A)? as shown in eq.(39), for any y in R(U) a component in R(A)? becomes equal to the component in QR(A)?. Then, RML can suppress as much noise in the training data as CML. 8 Conclusions We proposed error correcting memorization learning. It can suppress noise in training data using the noise correlation matrix. By comparing the generalization ability of CML with that of PL, we obtained a necessary and sucient condition under which CML provides better generalization. In the case of RML, a further condition is necessary. We interpreted the meaning of the additional condition using the results on CML. Acknowledgements We would like to thank Mr. S.Vijayakumar for fruitful discussions. This work was supported by the Grant-in- Aid for Scientic Research ] and ]4429. References [] A.Albert,Regression and the Moore-Penrose Pseudoinverse, Academic Press(972). [2] C.M.Bishop,\Improving the generalization properties of radial basis function networks", Neural Computation, vol.3,no.4, pp.579{588(winter 99). [3] H.Ogawa,\Projection lter regularization of illconditioned problem", SPIE, Inverse Problems in Optics, vol.88, pp.89{96 (987). [4] H.Ogawa,\Neural network learning, generalization and over-learning", Proc. ICIIPS'92 (Beijing),Oct.3 - Nov.,992, vol.2, pp.-6(992). [5] T.Poggio and F.Girosi,\Networks for approximation and learning", Proc. of the IEEE, vol.78, no.9, pp (Sep. 99). [6] Y.Yamashita and H.Ogawa,\Properties of averaged projection lter for image restoration",trans. IEICE,Japan, vol.j74-d-ii, no.2, pp (Feb. 99)(In Japanese). [7] H.Ogawa and Y.Yamasaki,\A theory of overlearning", Trans. IEICE, Japan, vol.j76-d-ii, no.7, pp (June 993)(In Japanese); Its English and short version appeared in Articial Neural Networks 2, vol., I.Aleksander and J.Taylor, Eds., North-Holland, pp.25{28(992). [8] A.Hirabayashi and H.Ogawa,\Admissibility of memorization learning with respect to projection learning in the presence of noise", ICNN'96 (Washington D.C.), Jun.3-6, 996, vol., pp (jun. 996). 6

Approximating the Best Linear Unbiased Estimator of Non-Gaussian Signals with Gaussian Noise

IEICE Transactions on Information and Systems, vol.e91-d, no.5, pp.1577-1580, 2008. 1 Approximating the Best Linear Unbiased Estimator of Non-Gaussian Signals with Gaussian Noise Masashi Sugiyama (sugi@cs.titech.ac.jp)