2 STATISTICALLY OPTIMAL TRAINING DATA 2.1 A CRITERION OF OPTIMALITY We revew the crteron of statstcally optmal tranng data (Fukumzu et al., 1994). We

Size: px

Start display at page:

Download "2 STATISTICALLY OPTIMAL TRAINING DATA 2.1 A CRITERION OF OPTIMALITY We revew the crteron of statstcally optmal tranng data (Fukumzu et al., 1994). We"

Madison Morris
6 years ago
Views:

1 Advances n Neural Informaton Processng Systems 8 Actve Learnng n Multlayer Perceptrons Kenj Fukumzu Informaton and Communcaton R&D Center, Rcoh Co., Ltd , Shn-yokohama, Yokohama, 222 Japan E-mal: fuku@c.rdc.rcoh.co.jp Abstract We propose an actve learnng method wth hdden-unt reducton, whch s devsed specally for multlayer perceptrons (MLP). Frst, we revew our actve learnng method, and pont out that many Fsher-nformaton-based methods appled to MLP have a crtcal problem: the nformaton matrx may be sngular. To solve ths problem, we derve the sngularty condton of an nformaton matrx, and propose an actve learnng technque that s applcable to MLP. Its eectveness s vered through experments. 1 INTRODUCTION When one trans a learnng machne usng a set of data gven by the true system, ts ablty can be mproved f one selects the tranng data actvely. In ths paper, we consder the problem of actve learnng n multlayer perceptrons (MLP). Frst, we revew our method of actve learnng (Fukumzu el al., 1994), n whch we prepare a probablty dstrbuton and obtan tranng data as samples from the dstrbuton. Ths methodology leads us to an nformaton-matrx-based crteron smlar to other exstng ones (Fedorov, 1972; Pukelshem, 1993). Actve learnng technques have been recently used wth neural networks (MacKay, 1992; Cohn, 1994). Our method, however, as well as many other ones has a crucal problem: the requred nverse of an nformaton matrx may not exst (Whte, 1989). We propose an actve learnng technque whch s applcable to three-layer perceptrons. Developng a theory on the sngularty of a Fsher nformaton matrx, we present an actve learnng algorthm whch keeps the nformaton matrx nonsngular. We demonstrate the eectveness of the algorthm through experments. 1

2 2 STATISTICALLY OPTIMAL TRAINING DATA 2.1 A CRITERION OF OPTIMALITY We revew the crteron of statstcally optmal tranng data (Fukumzu et al., 1994). We consder the regresson problem n whch the target system maps a gven nput x to y accordng to y = f(x) + Z; where f(x) s a determnstc functon from R L to R M, and Z s a random varable whose law s a normal dstrbuton N(0; 2 I M ), (I M s the unt M 2 M matrx). Our objectve s to estmate the true functon f as accurately as possble. Let ff(x; )g be a parametrc model for estmaton. We use the maxmum lkelhood estmator (MLE) ^ for tranng data f(x () ; y () )g N =1, whch mnmzes the sum of squared errors n ths case. In theoretcal dervatons, we assume that the target functon f s ncluded n the model and equal to f(1; 0 ). We make a tranng example by choosng x () to try, observng the resultng output y (), and parng them. The problem of actve learnng s how to determne nput data fx () g N =1 to mnmze the estmaton error after tranng. Our approach s a statstcal one usng a probablty for tranng, r(x), and choosng fx () g N =1 as ndependent samples from r(x) to mnmze the expectaton of the MSE n the actual envronment: EMSE = E f(x () ;y () )g Z Z ky 0 f(x; ^)k 2 p(yjx)dydq : (1) In the above equaton, Q s the envronmental probablty whch gves nput vectors to the true system n the actual envronment, and E f(x () ;y () )g means the expectaton on tranng data. Eq.(1), therefore, shows the average error of the traned machne that s used as a substtute of the true functon n the actual envronment. 2.2 REVIEW OF AN ACTIVE LEARNING METHOD Usng statstcal asymptotc theory, Eq.(1) s approxmated as follows: EMSE = N Tr2 I( 0 )J 01 ( 0 ) 3 + O(N 03=2 ); (2) where the matrxes I and J are (Fsher) nformaton matrxes dened by I ab (x; ) = T (x; a (x; b ; I() = Z I(x; )dq(x); J() = Z I(x; )r(x)dx: The essental part of Eq.(2) s Tr[I( 0 )J 01 ( 0 )], computed by the unavalable parameter 0. We have proposed a practcal algorthm n whch we replace 0 wth ^, prepare a famly of probablty fr(x; v) j v : paramaterg to choose tranng samples, and optmze v and ^ teratvely (Fukumzu et al., 1994). Actve Learnng Algorthm 1. Select an ntal tranng data set D [0] from r(x; v [0] ), and compute ^ [0]. 2. k := Compute the optmal v = v [k] to mnmze Tr[I(^ [k01] )J 01 (^ [k01] )]. 2

3 4. Choose N k new tranng data from r(x; v [k] ) and let D [k] be a unon of D [k01] and the new data. 5. Compute the MLE ^[k] based on the tranng data set D [k]. 6. k := k + 1 and go to 3. The above method utlzes a probablty to generate tranng data. It has the advantage of makng many data n one step compared to exstng ones n whch only one data s chosen n each step, though ther crterons are smlar to each other. 3 SINGULARITY OF AN INFORMATION MATRIX 3.1 A PROBLEM ON ACTIVE LEARNING IN MLP Hereafter, we focus on actve learnng n three-layer perceptrons wth H hdden unts, N H = ff(x; )g. The map f(x; ) s dened by f (x; ) = j=1 w j s( LX k=1 where s(t) s the sgmodal functon: s(t) = 1=(1 + e 0t ). u jk x k + j ) + ; (1 M); (3) Our actve learnng method as well as many other ones requres the nverse of an nformaton matrx J. The nformaton matrx of MLP, however, s not always nvertble (Whte, 1989). Any statstcal algorthms utlzng the nverse, then, cannot be appled drectly to MLP (Hagwara et al., 1993). Such problems do not arse n lnear models, whch almost always have a nonsngular nformaton matrx. 3.2 SINGULARITY OF AN INFORMATION MATRIX OF MLP The followng theorem shows that the nformaton matrx of a three-layer perceptron s sngular f and only f the network has redundant hdden unts. We can deduce that f the nformaton matrx s sngular, we can make t nonsngular by elmnatng redundant hdden unts wthout changng the nput-output map. Theorem 1 Assume r(x) s contnuous and postve at any x. Then, the Fsher nformaton matrx J s sngular f and only f at least one of the followng three condtons s satsed: (1) u j := (u j1 ;... ; u jl ) T = 0, for some j. (2) w j := (w 1j ;... ; w Mj ) = 0 T, for some j. (3) For derent j 1 and j 2, (u T j 1 ; j1 ) = (u T j 2 ; j2 ) or (u T j 1 ; j1 ) = 0(u T j 2 ; j2 ). The rough sketch of the proof s shown below. The complete proof wll appear n a forthcomng paper (Fukumzu, 1996). Rough sketch of the proof. We know easly that an nformaton matrx s sngular f and only f f a g a are lnearly dependent. The sucency can be proved easly. To show the necessty, we show that the dervatves are lnearly ndependent f none of the three condtons s satsed. Assume a lnear relaton: MX =1 j=1 j j MX =1 + LX j=1 k=1 jk + j=1 j = 0: (4) 3

4 We can show there exsts a bass of R L, hx (1) ;... ; x (L), such that u j 1 x (l) 6= 0 for 8 j, 8 l, and u j1 1 x (l) + j1 6= 6(u j2 1 x (l) + j2 ) for j 1 6= j 2, 8 l. We replace x n eq.(4) by x (l) t (t 2 R). Let m (l) j := u j 1 x (l), S (l) j := fz 2 C j z = ((2n + 1) p 01 0 j )=m (l) j ; n 2 Zg, and D(l) := C 0 [ j S (l) j. The ponts n S (l) j are the sngulartes of s(m (l) j z + j). We dene holomorphc functons on D (l) as 9 (l) (z) := P H j=1 js(m (l) j z + j) P H j=1 P L k=1 jkw j s 0 (m (l) j z + j)x (l) k z + P H j=1 j0w j s 0 (m (l) j z + j); (1 M): From eq.(4), we have 9 (l) (t) = 0 for all t 2 R. Usng standard arguments on solated sngulartes of holomorphc functons, we know S (l) j are removable sngulartes of 9 (l) (z), and nally obtan w j P L k=1 jkx (l) k = 0; w j j0 = 0; j = 0; 0 = 0: It s easy to see jk = 0. Ths completes the proof. 3.3 REDUCTION PROCEDURE We ntroduce the followng reducton procedure based on Theorem 1. Used durng BP tranng, t elmnates redundant hdden unts and keeps the nformaton matrx nonsngular. The crteron of elmnaton s very mportant, because excessve elmnaton of hdden unts degrades the approxmaton capacty. We propose an algorthm whch does not ncrease the mean squared error on average. In the followng, let ^s j := s(^u j 1 x + ^ j ) and "(N) = A=N for a postve number A. Reducton Procedure 1. If k ^w j k 2 R (^s j 0 s(^ j )) 2 dq < "(N); then elmnate the jth hdden unt, and ^! ^ + ^w j s(^ j ) for all. 2. If k ^w j k 2 R (^s j ) 2 dq < "(N); then elmnate the jth hdden unt. 3. If k ^w j2 k 2 R (^s j2 0 ^s j1 ) 2 dq < "(N) for derent j 1 and j 2, then elmnate the j 2 th hdden unt and ^w j1! ^w j1 + ^w j2 for all. 4. If k ^w j2 k 2 R (1 0 ^s j2 0 ^s j1 ) 2 dq < "(N) for derent j 1 and j 2, then elmnate the j 2 th hdden unt and ^w j1! ^w j1 0 ^w j2 ; ^! ^ + ^w j2 for all. From Theorem 1, we know that ^w j, ^u j, (^u T j 2 ; ^ j2 )0(^u T j 1 ; ^ j1 ), or (^u T j 2 ; ^ j2 )+(^u T j 1 ; ^ j1 ) can be reduced to 0 f the nformaton matrx s sngular. Let ~ 2 NK denote the reduced parameter from ^ accordng to R the above procedure. The above four condtons are, then, gven by calculatng kf(x; ) ~ 0 f(x; ^)k 2 dq. We brey explan how the procedure keeps the nformaton matrx nonsngular and does not ncrease EMSE n hgh probablty. Frst, suppose detj( 0 ) = 0, then there exsts K 0 2 N K (K < H) such that f(x; 0 ) = f(x; K 0 ) and detj( K 0 ) 6= 0 n N K. The elmnaton of hdden unts up to K, of course, does not ncrease the EMSE. Therefore, we have only to consder the case n whch detj( 0 ) 6= 0 and hdden unts are elmnated. R Suppose kf(x; K 0 )0f(x; 0 )k 2 R dq > O(N 01 ) for any reduced parameter K 0 from 0. The probablty of satsfyng kf(x; ^) 0 f(x; )k ~ 2 dq < A=N s very small for 4

5 a sucently small A. Thus, R the elmnaton of hdden unts occurs n very tny probablty. Next, suppose kf(x; K 0 ) 0 f(x; 0 )k 2 dq = O(N 01 ). Let ~ 2 NK be a reduced parameter made from ^ wth the same procedure as we obtan K 0 from 0. We wll show for a sucently small A, EhR kf(x; ^K ) 0 f(x; 0 )k 2 dq EhR kf(x; ^) 0 f(x; 0 )k 2 dq where ^K s MLE computed n N K. We wrte = ( (1) ; (2) ) n whch (2) s changed to 0 n reducton, changng the coordnate system f necessary. The Taylor expanson and asymptotc theory gve EhR R kf(x; ^K ) 0 f(x; 0 )k 2 dq kf(x; K 0 )0f(x; 0 )k 2 dq+ 2 EhR kf(x; ) ~ 0 f(x; ^)k R 2 dq kf(x; K 0 )0f(x; 0 )k 2 dq+ 2 N Tr[I 11( K 0 )J 01 ; 11 (K 0 )]; N Tr[I 22( K 0 )J 01 ( 22 0)]; where I and J denote the local nformaton matrxes w.r.t. () ( = 1; 2). Thus, EhR kf(x; ^) 0 f(x; 0 )k 2 dq 0 EhR kf(x; ^K ) 0 f(x; 0 )k 2 dq 0EhR kf(x; ~ ) 0 f(x; ^)k 2 dq + 2 N Tr[I 22( K 0 )J ( 0 )] 0 2 N Tr[I 11( K 0 )J (K 0 )] + EhR kf(x; ^) 0 f(x; 0 )k 2 dq Snce the sum of the last two terms s postve, the l.h.s s postve f E[ R kf(x; ^ K )0 f(x; ^)k 2 dq] < B=N for a sucently small B. Although we cannot know the value of ths expectaton, we can make the probablty of holdng ths enequalty very hgh by takng a small A. 4 ACTIVE LEARNING WITH REDUCTION PROCEDURE The reducton procedure keeps the nformaton matrx nonsngular and makes the actve learnng algorthm applcable to MLP even wth surplus hdden unts. Actve Learnng wth Hdden Unt Reducton 1. Select ntal tranng data set D 0 from r(x; v [0] ), and compute ^[0]. 2. k := 1, and do REDUCTION PROCEDURE. 3. Compute the optmal v = v [k] to mnmze Tr[I(^ [k01] )J 01 (^ [k01] )], usng the steepest descent method. 4. Choose N k new tranng data from r(x; v [k] ) and let D [k] be a unon of D [k01] and the new data. 5. Compute the MLE ^ [k] based on the tranng data D [k] usng BP wth REDUCTION PROCEDURE. 6. k := k + 1 and go to 3. The BP wth reducton procedure s applcable not only to actve learnng, but to a varety of statstcal technques that requre the nverse of an nformaton matrx. We do not dscuss t n ths paper, however. : 5

6 Mean Squared Error (log) Actve Learnng Actve Learnng [Av-Sd,Av+Sd] Passve Learnng Passve Learnng [Av-Sd,Av+Sd] Mean Squared Error (log) Learnng Curve # of hdden unts The number of hdden unts The Number of Tranng Data The Number of Tranng Data 5 EXPERIMENTS Fgure 1: Actve/Passve Learnng : f(x) = s(x) We demonstrate the eect of the proposed actve learnng algorthm through experments. Frst we use a three-layer model wth 1 nput unt, 3 hdden unts, and 1 output unt. The true functon f s a MLP network wth 1 hdden unt. The nformaton matrx s sngular at 0, then. The envronmental probablty, Q, s a normal dstrbuton N(0; 4). We evaluate the generalzaton error n the actual envronment usng the followng mean squared error of the functon values: Z kf(x; ^) 0 f(x)k 2 dq: We set the devaton n the true system = 0:01. As a famly of dstrbutons for tranng fr(x; v)g, a mxture model of 4 normal dstrbutons s used. In each step of actve learnng, 100 new samples are added. A network s traned usng onlne BP, presented wth all tranng data tmes n each step, and operated the reducton procedure once a 100 cycles between 5000th and 10000th cycle. We try 30 tranngs changng the seed of random numbers. In comparson, we tran a network passvely based on tranng samples gven by the probablty Q. Fg.1 shows the averaged learnng curves of actve/passve learnng and the number of hdden unts n a typcal learnng curve. The advantage of the proposed actve learnng algorthm s clear. We can nd that the algorthm has expected eects on a smple, deal approxmaton problem. Second, we apply the algorthm to a problem n whch the true functon s not ncluded n the MLP model. We use MLP wth 4 nput unts, 7 hdden unts, and 1 output unt. The true functon s gven by f(x) = erf(x 1 ), where erf(t) s the error functon. The graph of the error functon resembles that of the sgmodal functon, whle they never concde by any ane transforms. We set Q = N(0; 25 2 I 4 ). We tran a network actvely/passvely based on 10 data sets, and evaluate MSE's of functon values. Other condtons are the same as those of the rst experment. Fg.2 shows the averaged learnng curves and the number of hdden unts n a typcal learnng curve. We nd that the actve learnng algorthm reduces the errors though the theoretcal condton s not perfectly satsed n ths case. It suggests the robustness of our actve learnng algorthm. 6

7 Actve Learnng Passve Learnng Learnng Curve Mean Squared Error (log) Mean Squared Error (log) # of hdden unts The number of hdden unts The Number of Tranng Data The Number of Tranng Data 6 CONCLUSION Fgure 2: Actve/Passve Learnng : f(x) = erf(x 1 ) We revew statstcal actve learnng methods and pont out a problem n ther applcaton to MLP: the requred nverse of an nformaton matrx does not exst f the network has redundant hdden unts. We characterze the sngularty condton of an nformaton matrx and propose an actve learnng algorthm whch s applcable to MLP wth any number of hdden unts. The eectveness of the algorthm s vered through computer smulatons, even when the theoretcal assumptons are not perfectly satsed. References D. A. Cohn. (1994) Neural network exploraton usng optmal experment desgn. In J. Cowan et al. (ed.), Advances n Neural Informaton Processng Systems 6, San Mateo, CA: Morgan Kaufmann. V. V. Fedorov. (1972) Theory of Optmal Experments. NY: Academc Press. K. Fukumzu. (1996) A Regularty Condton of the Informaton Matrx of a Multlayer Perceptron Network. Neural Networks, to appear. K. Fukumzu, & S. Watanabe. (1994) Error Estmaton and Learnng Data Arrangement for Neural Networks. Proc. IEEE Int. Conf. Neural Networks : K. Hagwara, N. Toda, & S. Usu. (1993) On the problem of applyng AIC to determne the structure of a layered feed-forward neural network. Proc Int. Jont Conf. Neural Networks : D. MacKay. (1992) Informaton-based objectve functons for actve data selecton, Neural Computaton 4(4): F. Pukelshem. (1993) Optmal Desgn of Experments. NY: John Wley & Sons. H. Whte. (1989) Learnng n artcal neural networks: A statstcal perspectve Neural Computaton 1(4):

Lecture Notes on Linear Regression

Lecture Notes on Linear Regression Lecture Notes on Lnear Regresson Feng L fl@sdueducn Shandong Unversty, Chna Lnear Regresson Problem In regresson problem, we am at predct a contnuous target value gven an nput feature vector We assume