WE extend the familiar unidirectional backpropagation

Size: px

Start display at page:

Download "WE extend the familiar unidirectional backpropagation"

Sharyl Alexander
5 years ago
Views:

1 To appear n the IEEE Transactons on Systems, Man, and Cybernetcs Bdrectonal Bacpropagaton Olaoluwa Adgun, Member, IEEE, and Bart Koso, Fellow, IEEE Abstract We extend bacpropagaton learnng from ordnary undrectonal tranng to bdrectonal tranng of deep multlayer neural networs. Ths gves a form of bacward channg or nverse nference from an observed networ output to a canddate nput that produced the output. The traned networ learns a bdrectonal mappng and can apply to some nverse problems. A bdrectonal multlayer neural networ can exactly represent some nvertble functons. We prove that a fxed three-layer networ can always exactly represent any fnte permutaton functon and ts nverse. The forward pass computes the permutaton functon value. The bacward pass computes the functon s nverse value usng the same hdden neurons and weghts. A ont forward-bacward error functon allows bacpropagaton learnng n both drectons wthout overwrtng learnng n ether drecton. Ths apples to both classfcaton and regresson. The algorthm does not requre that the underlyng sampled functons have an nverse. The traned networ tends to map an output to the centrod of ts pre-mage set. Index Terms Bacpropagaton learnng, bacward channg, nverse problems, bdrectonal assocatve memory, functon representaton and approxmaton. Input Layer Forward Pass: a x a h a y Hdden Layer Output Layer I. BIDIRECTIONAL BACKPROPAGATION WE extend the famlar undrectonal bacpropagaton (BP algorthm [] [5] to the bdrectonal case. Undrectonal BP maps an nput vector to an output vector by passng the nput vector forward through the networ s vsble and hdden neurons and ts connecton weghts. Bdrectonal BP (B-BP combnes ths forward pass wth a bacward pass through the same neurons and weghts. It does not use two separate feedforward or undrectonal networs. B-BP tranng endows a multlayered neural networ N : R n R p wth a form of bacward nference. The forward pass gves the usual predcted neural output N(x gven a vector nput x. The output vector value y N(x answers the what-f queston that x poses: What would we observe f x occurred? What would be the effect? The bacward pass answers the why queston that y poses: Why dd y occur? What type of nput would cause y? Feedbac convergence to a resonatng bdrectonal fxed-pont attractor [6], [7] gves a long-term or equlbrum answer to both the what-f and why questons. Ths paper does not address the global stablty of multlayered bdrectonal networs. Bdrectonal neural learnng apples to large-scale problems and bg data because the BP algorthm scales lnearly wth tranng data. BP has tme complexty O(n for n tranng samples because ts forward pass has complexty O( whle ts bacward pass has complexty O(n. So the B-BP algorthm stll has O(n complexty because O(n+O(n O(n. Olaoluwa Adgun and Bart Koso are wth the Department of Electrcal Engneerng, Unversty of Southern Calforna, Los Angeles, CA, 989 USA e-mal: (oso@usc.edu. Manuscrpt receved November 9, 26 ; revsed January 26, 27. Bacward Pass: a x a h a y Fg. : Exact bdrectonal representaton of a permutaton map. The -layer bdrectonal threshold networ exactly represents the nvertble -bt bpolar permutaton functon f n Table I. The forward pass taes the nput bpolar vector x at the nput layer and passes t forward through the weghted edges and the hdden layer of threshold neurons to the output layer. The bacward pass feeds the output bpolar vector y bac through the same weghted edges and threshold neurons. All neurons are bpolar and use zero thresholds. The bdrectonal networ computes y f(x on the forward pass and the nverse value f (y on the bacward pass. Ths lnear scalng does not hold for most machne-learnng algorthms. An example s the quadratc complexty O(n 2 of support-vector ernel methods [8]. We frst show that multlayer bdrectonal networs have suffcent power to exactly represent permutaton mappngs. These mappngs are nvertble and dscrete. Then we develop the B-BP algorthms that can approxmate these and other mappngs f the networ has enough hdden neurons. A neural networ N exactly represents a functon f ust n case N(x f(x for all nput vectors x. Exact representaton s a much stronger condton than the more famlar property of functon approxmaton: N(x f(x. Feedforward multlayer neural networs can unformly approxmate contnuous functons on compact sets [9], []. Addtve fuzzy systems are also unform functon approxmators []. But addtve

2 2 fuzzy systems have the further property that they can exactly represent any real functon f t s bounded [2]. Ths exact representaton needs only two fuzzy rules because the rules absorb the functon nto ther fuzzy sets. Ths holds more generally for generalzed probablty mxtures because the fuzzy rules defne the mxed probablty denstes [], [4]. Fgure shows a bdrectonal -layer networ of zerothreshold neurons that exactly represents the -bt permutaton functon f n Table I where,, +} denotes,, }. So f s a self-becton that rearranges the 8 vectors n the bpolar hypercube, }. Ths f s ust one of the 8! or 4,2 permutaton maps or rearrangements on the bpolar hypercube, }. The forward pass converts the nput bpolar vector (,, to the output bpolar vector (,,. The bacward pass converts (,, to (,, over the same fxed synaptc connecton weghts. These same weghts and neurons smlarly convert the other 7 nput vectors n the frst column of Table to the correspondng 7 output vectors n the second column and vce versa. Theorem states that multlayer bdrectonal networs can exactly represent all fnte bpolar or bnary permutaton functons. Ths result requres a hdden layer wth 2 n hdden neurons for a permutaton functon on the bpolar hypercube, } n. Usng so many hdden neurons s nether practcal nor necessary n most real-world cases. The exact bdrectonal representaton n Fgure uses only 4 hdden threshold neurons to represent the -bt permutaton functon. Ths was the smallest hdden layer that we found through guesswor. Many other bdrectonal representatons also use fewer than 8 hdden neurons. We see nstead a practcal learnng algorthm that can learn bdrectonal approxmatons from sample data. Fgure 2 shows a learned bdrectonal representaton of the same - bt permutaton n Table I. It uses only hdden neurons. The B-BP algorthm tuned the neurons threshold values as well as ther connecton weghts. All the learned threshold values were near zero. We rounded them to zero to acheve the bdrectonal representaton wth ust hdden neurons. The rest of the paper derves the B-BP algorthm for regresson n both drectons and for classfcaton networs and mxed classfcaton-regresson networs. Ths taes some care because tranng the same weghts n one drecton tends to undo the BP tranng n the other drecton. The B-BP algorthm solves ths problem by mnmzng a ont error functon. The lone error functon s cross entropy for classfcaton. It s squared error for regresson or classfcaton. See Fgure 4. The learnng approxmaton tends to mprove f we add more hdden neurons. Fgure 5 shows that the B-BP tranng cross-entropy error falls off as the number of hdden neurons grows when learnng the 5-bt permutaton n Table 2. Fgure 6 shows a deep 8-layer bdrectonal approxmaton of the nonlnear functon f(x.5σ(6x++.5σ(4x.2 and ts nverse. The networ used 6 hdden layers wth bpolar logstc neurons per layer. A bpolar logstc actvaton σ scales and translates an ordnary logstc. It has the form σ(x 2. ( + e x Input Layer 6. 8 Forward Pass: a x a h a y Hdden Layer Bacward Pass: a x a h a y Output Layer Fg. 2: Learned bdrectonal representaton of the -bt permutaton n Table I. The bdrectonal bacpropagaton algorthm found ths representaton usng the double-classfcaton learnng laws of Secton. All the neurons were bpolar and had zero thresholds. The zero thresholdng gave exact representaton of the -bt permutaton. The fnal sectons show that smlar B-BP algorthms hold for tranng two-way classfcaton networs and mxed classfcaton-regresson networs. B-BP learnng also approxmates non-nvertble functons. The algorthm tends to learn the centrod of many-to-one functons. Suppose that the target functon f : R n R p s not one-to-one or nectve. So t has no nverse f pont mappng. But t does have a set-valued nverse or pre-mage pullbac mappng f : 2 Rp 2 Rn such that f (B x R n : f(x B} for any B R p. Suppose that the n nput tranng samples x,..., x n map to the same output tranng sample y: f (y} x,..., x n }. Then B-BP learnng tends to map y to the centrod x of f (y} because the centrod mnmzes the mean-squared error of regresson. Fgure 7 shows such an approxmaton for the non-nvertble target functon f(x sn x. The forward regresson approxmates sn x. The bacward regresson approxmates the average or centrod of the two ponts n the pre-mage set of y sn x. Then f (y} sn (y θ, π θ} for < θ < π 2 f < y <. Ths gves the pullbac s centrod as π 2. The centrod equals π 2 f < y <. Bdrectonal BP dffers from earler neural approaches to approxmatng nverses. Mars et al. developed an nverse algorthm for query-based learnng n bnary classfcaton [5]. The BP-based algorthm s not bdrectonal but t explots the data-weght nner-product nput to neurons. It holds the weghts constant whle t tunes the data for a gven output. Wunsch et al. have appled ths nverse algorthm to problems n aerospace and elsewhere [6], [7]. Bdrectonal BP also

3 dffers from the more recent bdrectonal extreme-learnngmachne algorthm that uses a two-stage learnng process but n a undrectonal networ [8]. II. BIDIRECTIONAL EXACT REPRESENTATION OF BIPOLAR PERMUTATIONS Ths secton proves that there exsts multlayered neural networs that can exactly bdrectonally represent some nvertble functons. We frst defne the networ varables. The proof uses threshold neurons. The B-BP algorthms use softthreshold logstc sgmods for hdden neurons. They use dentty actvatons for nput and output neurons. A bdrectonal neural networ s a multlayer networ N : X Y that maps the nput space X to the output space Y and conversely through the same set of weghts. The bacward pass uses the matrx transposes of the weght matrces that the forward pass uses. Such a networ s a bdrectonal assocatve memory or BAM [6], [7]. The forward pass sends the nput vector x through the weght matrx W that connects the nput layer to the hdden layer. The result passes on through matrx U to the output layer. The bacward pass sends the output y from the output layer bac through the hdden layer to the nput layer. Let I, J, and K denote the respectve number of nput, hdden, and output neurons. Then the I J matrx W connects the nput layer to the hdden. The J K matrx U connects the hdden layer to the output layer. TABLE I: -Bt Bpolar Permutaton Functon f. Input x The hdden-neuron nput o h o h Output t [+ + +] [ +] [+ + ] [ + +] [+ +] [+ + +] [+ ] [+ +] [ + +] [ + ] [ + ] [ ] [ +] [+ ] [ ] [+ + ] has the affne form w a x (x + b h (2 where weght w connects the th nput neuron to the th hdden neuron, a x s the actvaton of the th nput neuron, and b h s the bas term of the th hdden neuron. The actvaton a h of the th hdden neuron s a bpolar threshold: a h (o h f o h f o h >. ( The B-BP algorthm n the next secton uses soft-threshold bpolar logstc functons for the hdden actvatons because such sgmod functons are dfferentable. The proof below also modfes the hdden thresholds to tae on bnary values n (4 and to fre wth a slghtly dfferent condton. The nput o y to the th output neuron from the hdden layer s also affne: J o y u a h + b y (4 where weght u connects the th hdden neuron to the th output neuron. Term b y s the addtve bas of the th output neuron. The output actvaton vector a y gves the predcted outcome or target on the forward pass. The th output neuron has bpolar threshold actvaton a y : a y (oy f o y f o y >. (5 The forward pass of an nput bpolar vector x from Table I through the networ n Fgure gves an output actvaton vector a y that equals the table s correspondng target vector y. The bacward pass feeds y from the output layer bac through the hdden layer to the nput layer. Then the bacward-pass nput o hb to the th hdden neuron s o hb u a y (y + b h (6 where y s the output of the th output neuron, a y s the actvaton of the th output neuron. The bacward-pass actvaton of the th hdden neuron a hb s a hb (o hb The bacward-pass nput o xb J o xb (o xb f o hb f o hb >. to the th nput neuron s (7 w a hb + b x (8 where b x s the bas for the th nput neuron. The nput-layer actvaton a x gves the predcted value for the bacward pass. The th nput neuron has bpolar actvaton f o xb f o xb >. We can now state and prove the bdrectonal representaton theorem for bpolar permutatons. The theorem also apples to bnary permutatons because the nput and output neurons have bpolar threshold actvatons. Theorem : Exact Bdrectonal Representaton of Bpolar Permutaton Functons: Suppose that the nvertble functon f :, } n, } n s a permutaton. Then there exsts a -layer bdrectonal neural networ N :, } n, } n that exactly represents f n the sense that N(x f(x and N (x f (x for all x. The hdden layer has 2 n threshold neurons. Proof: The proof strategy pcs weght matrces W and U so that exactly one hdden neuron fres on both the forward and the bacward pass. Fgure llustrates the proof technque for the specal case of a -bt bpolar permutaton. So we structure (9

4 4 the networ such that any nput vector x fres only one hdden neuron on the forward pass and such that the output vector y N(x fres only the same hdden neuron on the bacward pass. The bpolar permutaton f s a bectve map of the bpolar hypercube, } n nto tself. The bpolar hypercube contans the 2 n nput bpolar column vectors x, x 2,..., x 2 n. It lewse contans the 2 n output bpolar vectors y, y 2,..., y 2 n. The networ wll use 2 n correspondng hdden threshold neurons. So J 2 n. Matrx W connects the nput layer to the hdden layer. Matrx U connects the hdden layer to the output layer. Defne W so that each row lsts all 2 n bpolar nput vectors. Defne U so that each column lsts all 2 n transposed bpolar output vectors: W [ ] x x 2... x 2 n y T T y 2 U. y T 2 n We now show that ths arrangement fres only one hdden neuron and that the forward pass of any nput vector x n gves the correspondng output vector y n. Assume that every neuron has zero bas. Pc a bpolar nput vector x m for the forward pass. Then the nput actvaton vector a x (x m (a x (x m,..., a x n(x n m equals the nput bpolar vector x m because the nput actvatons (9 are bpolar threshold functons wth zero threshold. So a x equals x m because the vector space s bpolar, } n. The hdden layer nput o h s the same as (2. It has the matrx-vector form o h W T a x ( W T x m ( (o h, o h 2,..., o h n,..., o h 2 nt (2 (x T x m, x T 2 x m,..., x T x m,..., x T 2 nx m T ( from the defnton of W snce o h s the nner product of x and x m. The nput o h to the th neuron of the hdden layer obeys o h n when m and oh < n when m. Ths holds because the vectors x are bpolar wth scalar components n, }. The magntude of a bpolar vector n, } n s n. The nner product x T x m s a maxmum when both vectors have the same drecton. Ths occurs when m. The nner product s otherwse less than n. Fgure shows a bdrectonal neural networ that fres ust the sxth hdden neuron. The weghts for the networ n Fgure are W U T Input Layer W Hdden Layer Output Layer Fg. : Bdrectonal networ structure for the proof of Theorem. The nput and output layers have n threshold neurons whle the hdden layer has 2 n neurons wth threshold values of n. The 8 fan-n -vectors of weghts n W from the nput to the hdden layer lst the 2 elements of the bpolar cube, } and thus the 8 vectors n the nput column of Table I. The 8 fan-n -vectors of weghts n U from the output to the hdden layer lst the 8 bpolar vectors n the output column of Table I. The threshold value for the sxth and hghlghted hdden neuron s. Passng the sxth nput vector (-,, - through W leads to the vector of thresholded hdden unts of (,,,,,,,. Passng ths 8-bt vector through U produces after thresholdng the sxth output vector (-, -, - n Table I. Passng ths output vector bac through the transpose of U produces the same unt bt vector of thresholded hdden-unt values. Passng ths vector bac through the transpose of W produces the orgnal bpolar vector (,,. Now comes the ey step n the proof. Defne the hdden actvaton a h as a bnary (not bpolar threshold functon where n s the threshold value: a h (o h U f o h n f o h < n. (4 Then the hdden-layer actvaton a h s the unt bt vector (,,...,,..., T where a h when m and where a h when m. Ths holds because all 2n bpolar vectors x m n, } n are dstnct. So exactly one of these 2 n vectors acheves the maxmal nner-product value n x T mx m. See Fgure for detals n the case of representng the -bt bpolar permutaton n Table I.

5 5 The nput vector o y to the output layer s o y U T a h (5 J y a h (6 y m (7 where a h s the actvaton of the th hdden neuron. The actvaton a y of the output layer s a y (o y f o y f o y <. (8 The output layer actvaton leaves o y unchanged because o y equals y m and because y m s a vector n, } n. So a y y m. (9 So the forward pass of an nput vector x m through the networ yelds the desred correspondng output vector y m where y m f(x m for bpolar permutaton map f. Consder next the bacward pass over N. The bacward pass propagates the output vector y m from the output layer to the nput layer through the hdden layer. The hdden layer nput o hb has the form (6 and so o hb U y m (2 where o hb (y T y m, y2 T y m,..., y T y m,..., y2 T ny m T. The nput o hb of the th neuron n the hdden layer equals the nner product of y and y m. So o hb n when m and o hb < n when m. Ths holds because agan the magntude of a bpolar vector n, } n s n. The nner product o hb s a maxmum when vectors y m and y le n the same drecton. The actvaton a hb for the hdden layer has the same components as (4. So the hdden-layer actvaton a hb agan equals the unt bt vector (,,...,,..., T where a hb when m and a hb when m. Then the nput vector o xb for the nput layer s o xb W a hb (2 J x a hb (22 x m. (2 The th nput neuron has a threshold actvaton that s the same as (o xb f o xb f o xb (24 < where o xb s the nput of th neuron n the nput layer. Ths actvaton leaves o xb unchanged because o xb equals x m and because the vector x m les n, } n. So o xb (25 x m. (26 So the bacward pass of any target vector y m yelds the desred nput vector x m where f (y m x m. Ths completes the bacward pass and the proof. III. BIDIRECTIONAL BACKPROPAGATION ALGORITHMS A. Double Regresson We now derve the frst of three bdrectonal BP learnng algorthms. The frst case s double regresson where the networ performs regresson n both drectons. Bdrectonal BP tranng mnmzes both the forward error E f and bacward error E b. B-BP alternates between the bacward tranng and forward tranng. Forward tranng mnmzes E f whle holdng E b constant. Bacward mnmzes E b whle holdng E f constant. E f s the error at the output layer. E b s the error at the nput layer. Double regresson uses squared error for both error functons. The forward pass sends the nput vector x through the hdden layer to the ouput layer. We use only one hdden layer for smplcty and wthout loss of generalty. The B-BP doubleregresson algorthm apples to any number of hdden layers n a deep networ. The hdden-layer nput values o h are the same as (2. The th hdden actvaton a h s the ordnary bnary logstc map: a h (o h + e oh (27 where (4 gves the nput o y to the th output neuron. The hdden actvatons can be logstc or any other sgmodal functon so long as t s dfferentable. The actvaton for an output neuron s the dentty functon: a y oy (28 where a y s the actvaton of th output neuron. The error functon E f for the forward pass s squared error: E f 2 (y a y 2 (29 where y denotes the value of the th neuron n the output layer. Ordnary undrectonal BP updates the weghts and other networ parameters by propagatng the error from the output layer bac to the nput layer. The bacward pass sends the output vector y from the output layer to the nput layer through the hdden layer. The nput to th actvaton a hb hdden neuron o hb for th hdden neuron s a hb + e ohb s the same as (6. The. ( The nput o x for the th nput neuron s the same as (8. The actvaton at the nput layer s the dentty functon: o xb. ( A nonlnear sgmod (or Gaussan actvaton can replace the lnear functon. The bacward-pass error E b s smlarly E b 2 (x a x 2. (2

6 6 The partal dervatve of the hdden-layer actvaton n the forward drecton s a h o h ( o h ( + e oh e oh (4 ( + e oh 2 [ ] (5 + e oh + e oh a h ( a h. (6 Let a h denote the dervatve of a h wth respect to the nnerproduct term o h. We agan use the superscrpt b to denote bacward pass. Then the partal dervatve of E f wth respect to weght u s E f u 2 u (y a y 2 (7 E f a y o y u (8 (a y y a h. (9 The partal dervatve of E f wth respect to w s E f w 2 ( K w (y a y 2 (4 E f a h (a y y u a h a h o h o h w (4 x (42 where a h s the same as n (. The partal dervatve of Ef wth respect to the bas b y of the th output neuron s E f b y 2 b y (y a y 2 (4 E f a y o y b y (44 (a y y. (45 The partal dervatve of E f wth respect to the bas b h of the th hdden neuron s where a h E f b h 2 b h ( K (y a y 2 (46 E f a h (a y y u a h s the same as (. a h o h o h b h (47 (48 The partal dervatve of the hdden-layer actvaton n the bacward drecton s a hb o hb ( o hb (49 + e ohb e ohb ( + e ohb 2 + e ohb a hb [ + e ohb ] (5 (5 ( a hb. (52 The partal dervatve of E b wth respect to w s E b w 2 E b ( w (x 2 (5 o xb o xb w (54 x a hb. (55 The partal dervatve of E b wth respect to u s E b u 2 ( u E b ( (x 2 (56 o xb o xb a hb a hb o hb o hb u (57 x w a hb y (58 where a hb s the same as n (49. The partal dervatve of E b wth respect to the bas b x th nput neuron s E b b x 2 b x E b of (x 2 (59 o xb o xb b x (6 ( x. (6 The partal dervatve of E b wth respect to the bas b h of th hdden neuron s E b b h 2 b h (x 2 (62 ( E b ( o xb x w a hb o xb a hb a hb o hb o hb b h (6 (64 where a hb s the same as (49. The error for the nput neuron bas s E b because x o x for the forward pass. The error for the output neuron bas s E f only for the output neuron bas because y o y for the bacward pass.

7 7 The above update laws for forward regresson have the fnal form: u (n+ w (n+ b h u (n η(ay y a h (65 ( K w (n η (a y y u a h x (66 ( (n+ b h(n K η (a y y u a h (67 b y(n+ b y(n η(a y y. (68 The dual update laws for bacward regresson have the fnal form: u (n+ u (n η ( ( x w a hb y (69 w (n+ w (n η( x a yb (7 b x(n+ b x(n η( x (7 ( (n+ b h(n η ( x w a hb. (72 b h The B-BP tranng mnmzes E f whle holdng E b constant. It then mnmzes the E b whle holdng E b constant. Equatons (?? (68 state the update rules for forward tranng. Equatons (69 (72 state the update rules for bacward tranng. Forward tranng mnmzes E f whle eepng E b constant. Bacward tranng mnmzes E b whle eepng E f constant. Each tranng teraton nvolves forward tranng and then bacward tranng. Algorthm summarzes the B-BP algorthm. The algorthm explans how to combne forward and bacward tranng n B- BP. Fgure 6 shows how double-regresson B-BP approxmates the nvertble functon f(x.5σ(6x + +.5σ(4x.2 where σ(x s the logstc functon. The approxmaton used a deep 8-layer networ wth 6 layers of bpolar logstc neurons each. The nput and output layer each contaned only a sngle dentty neuron. B. Double Classfcaton We now derve a B-BP algorthm where the networ s forward pass acts as a classfer networ and so does ts bacward pass. We call ths double classfcaton. We present the dervaton n terms of cross entropy for the sae of smplcty. But our double-classfcaton smulatons used the slghtly more general form of cross entropy n (4 that we call logstc cross entropy. The smpler cross-entropy dervaton apples to softmax nput neurons and output neurons (wth mpled -n-k codng. But logstc nput and output neurons requre logstc cross entropy for the same BP dervaton. The smplest double classfcaton uses Gbbs or softmax neurons at both the nput and output layer to create a wnnertae-all structure at those layers. Then the th softmax neuron n the output layer codes for the nput pattern and represents t as K-length unt bt vector wth a n the th slot and a n the other K slots [], [9]. The same -n-i bnary encodng holds for the th neuron at the nput layer. The softmax structure mples that the nput and output felds each compute a dscrete probablty dstrbuton for each nput. Classfcaton networs dffer from regresson networs n another ey aspect: They do not mnmze squared error. They nstead mnmze the cross entropy of the gven target vector and the softmax actvaton values of the output or nput layers []. Equaton (79 states the forward cross entropy at the output layer where y s the desred or target value of the th output neuron and where a y s ts actual softmax actvaton value. The entropy structure apples because both the target vector and the nput/output vector are probablty vectors. Mnmzng the cross entropy maxmzes the Kullbac-Lebler dvergence [2] and vce versa [9]. The classfcaton BP algorthm depends on another optmzaton equvalence: Mnmzng the cross entropy s equvalent to maxmzng the networ s lelhood or log-lelhood [9]. We wll establsh ths equvalence because t mples that the BP learnng laws have the same form for both classfcaton and regresson. We wll prove the equvalence only for the forward drecton but t apples equally n the bacward drecton. The result both unfes the BP learnng laws and allows nose-boosts and other methods to enhance the networ lelhood snce BP s a specal case [9], [2] of the Expectaton-Maxmzaton algorthm for teratvely maxmzng a lelhood wth mssng data or hdden varables [22]. Denote the networ s forward probablty densty functon as p f (y x, Θ. The nput vector x passes through the networ wth total parameter vector Θ and produces the output vector y. Then the networ s forward lelhood L f (Θ s the natural logarthm of the forward networ probablty: L f (Θ ln p f (y x, Θ. We wll show that p f (y x, Θ exp E f (Θ}. So BP s forward pass computes the forward cross entropy as t maxmzes the lelhood [9]. The ey assumpton s that output softmax neurons n a classfer networ are ndependent because there are no ntra-layer connectons among them. Then the networ probablty densty p f (y x, Θ factors nto a product of K-many margnals []: p f (y x, Θ K p f (y x, Θ. Ths gves L f (Θ ln p f (y x, Θ (7 K ln p f (y x, Θ (74 ln K (a y y (75 y ln a y (76 E f (Θ (77 from (79. Then exponentaton gves p f (y x, Θ exp E f (Θ}. Mnmzng the forward cross entropy E f s equvalent to maxmzng the negatve cross entropy E f. So mnmzng E maxmzes the forward networ lelhood L and vce versa.

8 8 The thrd equalty (75 holds because the th margnal factor p f (y x, Θ n a classfer networ equals the exponentated softmax actvaton (a t y snce y f s the correct class label for the nput pattern x and y otherwse. Ths defnes an output categorcal dstrbuton or a sngle-sample multnomal. We now derve the B-BP algorthm for double classfcaton. The algorthm mnmzes the error functons separately where E f (Θ s the forward cross entropy n (75 and E b (Θ s the bacward cross entropy n (8. We frst derve the forward B-BP classfer algorthm and then derve the bacward B-BP algorthm. The forward pass sends the nput vector x through the hdden layer or layers to the output layer. The nput actvaton vector a x s the vector x. We assume only one hdden layer for smplcty. The dervaton apples to deep networs wth any number of hdden layers. The nput to the th hdden neuron o h has the same lnear form as n (2. The th hdden actvaton a h s the same ordnary unt-nterval-valued logstc functon n (27. The nput o y to the th output neuron s the same as n (4. The hdden actvatons can also be hyperbolc tangent or any other dfferentable monotone nondecreasng sgmod functon. The forward classfer s output layer neurons use Gbbs or softmax actvatons: a y e (oy K l e(oy l (78 where a y s the actvaton of the th output neuron. Then the forward error E f s the cross entropy E f y ln a y (79 between the bnary target values y and the actual output actvatons a y. We next descrbe the bacward pass through the classfer networ. The bacward pass sends the output target vector y through the hdden layer to the nput layer. So the ntal actvaton vector a y equals the target vector y. The nput to the th neuron of the hdden layer o hb has the same lnear form as (8. The actvaton of the th hdden neuron s the same as (. The bacward-pass nput to the th nput neuron s also the same as (8. The nput actvaton s Gbbs or softmax: e (oxb I (8 l e(oxb where a xb s the bacward-pass actvaton for the th neuron of the nput neuron. Then the bacward error E b s the cross entropy E b x ln (8 where x s the target value of the th nput neuron. The partal dervatves of the hdden actvaton a h are the same as ( and (49. and ahb The partal dervatve of the output actvaton a y forward classfcaton pass s o y ( e (oy o y K l e(oy l for the (82 e oy ( K l e(oy l e oy e oy ( K l e(oy l 2 (8 ( K e oy l e(oy l e oy ( K l e(oy l 2 (84 a y ( ay. (85 The dervatve when l s o y ( l o y l e (oy K m e(oy m (86 e oy e oy l ( K l e(oy l 2 (87 a y ay l. (88 So the dervatve of a y wth respect to o l s a y ay l f l l a y ( ay f l. (89 We denote ths dervatve as a h. The dervatve a xb of the bacward classfcaton pass has the same form because both classfer neurons are softmax actvatons. The partal dervatve of the forward cross entropy E f wth respect to u s E f u u ( Ef ( y a y y ln a y (9 o y u ( a y ay K y l a a y ay l l l (9 a h (92 (a y y a h. (9 The partal dervatve of the forward cross entropy E f wth respect to the bas b y of the th output neuron s E f b y b y ( Ef ( y a y y ln a y (94 o y b y ( a y ay K y l a a y ay l l l (95 (96 (a y y. (97 Equatons (9 and (97 show that the dervatves of E f wth respect to u and b y for double classfcaton are the same as for double regresson n (9 and (45. The actvatons of the hdden neurons are the same as for double regresson. So the

9 9 dervatves of E f wth respect to w and b h are the same as the respectve ones n (42 and (. The partal dervatve of E b wth respect to w s E b x ln (98 w w ( x ( ( Eb o xb o xb w ( l x l l (99 l a hb ( x a hb. ( The partal dervatve of E b wth respect to the bas b x of the th nput neuron s E b b x b xb x ln (2 ( x ( Eb o xb o xb b x ( x l l l l ( (4 ( x. (5 Equatons ( and (5 lewse show that the dervatves of E b wth respect to w and b x for double classfcaton are the same as for double regresson n (5 and (59. The actvatons of the hdden neurons are the same as for double regresson. So the dervatves of E b wth respect to u and b h are the same as the respectve ones n (56 and (62. The bdrectonal tranng of BP for double classfcaton alternates between mnmzng E f whle holdng E b constant and mnmzng E b whle holdng E f constant. The forward and bacward errors here are cross entropes. The update laws for forward classfcaton have the fnal form: u (n+ w (n+ b h u (n ((a η y y a h w (n ( K η (a y y u a h x ( (n+ b h(n K η (a y y u a h (6 (7 (8 b y(n+ b y(n η(a y y. (9 The dual update laws for bacward classfcatn have the fnal form: u (n+ w (n+ u (n w (n η ( ( η ( ( x w a hb y x a yb ( ( b x(n+ b x(n η( x (2 ( b h (n+ b h(n η ( x w a hb. ( The dervaton shows that the update rules for double classfcaton are the same as the update rules for double regresson. The B-BP tranng mnmzes E f whle holdng E b constant and then mnmzes the E b whle holdng E b constant. Equatons (6 (9 are the update rules for forward tranng. Equatons ( ( are the update rules for bacward tranng. Forward tranng mnmzes E f whle eepng E b constant. Bacward tranng mnzes E b wth E f constant. Each tranng teraton nvolves runnng the forward tranng and then the bacward tranng. Algorthm summarzes B- BP algorthm. The more general case of double classfcaton uses logstc neurons at the nput and output layer. Then the BP dervaton requres the slghtly more general logstc-cross-entropy performance measure. We used logstc cross-entropy E log for double classfcaton tranng because the nput and output neurons were logstc (rather than softmax: E log K y ln a y ( y ln( a y. (4 C. Mxed Case: Classfcaton and Regresson We last derve the B-BP learnng algorthm for the mxed case of a neural classfer networ n the forward drecton and a regresson networ n the bacward drecton. Ths mxed case descrbes the common case of neural mage classfcaton. The user need only add bacward-regresson tranng to allow the same classfer net to predct whch mage nput produced a gven output classfcaton. Bacward regresson estmates ths answer as the centrod of the nverse settheoretc mappng or pre-mage. The B-BP algorthm acheves ths by alternatng between mnmzng E f and mnmzng E b. The forward error E f s the same as the cross entropy n the double-classfcaton networ above. The bacward error E b s the same as the squared error n the double-regresson networ of the prevous secton. The nput space s lewse the I -dmensonal real space R I for regresson. The output space uses -n-k bnary encodng for classfcaton. Regresson networs use dentty functons as actvaton functons. Classfer networs use softmax actvatons. The forward pass sends the nput vector x through the hdden layer to the output layer. The nput actvaton vector equals x. We agan consder only a sngle hdden layer for smplcty. The nput o h to the th hdden neuron s the same as (2. The actvaton a h of the th hdden layer s the ordnary logstc actvaton n (27. Equaton (4 represents the nput o y to the th output neuron. The output actvaton s softmax. So the output actvaton a y s the same as n (78. The forward error E f s the cross entropy n (79. The forward pass n ths case s the same as the forward pass for the double classfcaton. So (9, (97, (, and (5 gve the dervatve of the forward error E f wth respect to u, by, w, and b x. The bacward pass propagates the -n-k vector y from the output through the hdden layer to the nput layer. The output a x layer actvaton vector a y s equals to y. The nput o hb to the

10 th hdden neuron for the bacward pass s the same as (6 and ( gves the actvaton ahb for the th hdden unt n the bacward pass. Equaton (8 gves the nput oxb for the th nput neuron. The actvaton axb of the th nput neuron for the bacward pass s the same as (. The bacward error Eb s the squared error n (2. The bacward pass n ths case s the same as the bacward pass for the double classfcaton. So (55, (58, (6, and (64 gve the dervatve of the bacward error Eb wth respect to w, bx, u, and by. The update laws for forward classfcaton regresson tranng have the fnal form: (n+ u (n+ w bh by (n+ (n+ (n u η(ay y ah K X (n w η (ay y u ah x bh by (n (n η K X (ay y u ah η(ay y. (5 (a (6 (7 (8 The update laws for bacward classfcaton tranng have the fnal form: I X (n+ (n hb (axb u u η (9 x w a y (n+ w bx (n+ bh (n+ (n yb w η(axb x a bx (n bh (n η(axb I X (2 x (2 hb (axb. x w a η (b (22 B-BP tranng mnmzes Ef whle holdng Eb constant and then mnmzes the Eb whle holdng Eb constant. Equatons (5 (8 state the update rules for forward tranng. Equatons (9 (22 state the update rules for bacward tranng. Algorthm shows how forward learnng combnes wth bacward learnng n B-BP. IV. S IMULATION R ESULTS TABLE II: 5-Bt Bpolar Permutaton Functon. (c Input x Output t Input x Output t [ ] [ +] [ + ] [ + +] [ + ] [ + +] [ + + ] [ + + +] [ + ] [ + +] [ + + ] [ + + +] [ + + ] [ + + +] [ ] [ ] [ ] [ + ] [ + ] [ ] [+ + + ] [+ + +] [ + + +] [ + + +] [ ] [+ +] [+ + + ] [ + + ] [ ] [+ + ] [+ + +] [ +] [+ ] [+ +] [+ + ] [+ + +] [+ + ] [+ + +] [+ + + ] [ ] [+ + ] [+ + +] [+ + + ] [ ] [+ + + ] [ ] [ ] [ ] [ ] [ + ] [+ + ] [ + +] [ + + +] [+ + +] [ ] [ + + ] [+ + + ] [ + + ] [+ ] [ + +] [ ] [ + +] [ ] [+ + ] Fg. 4: Logstc-cross-entropy-based learnng for double classfcaton usng hdden neurons wth forward BP tranng, bacward BP tranng, and bdrectonal BP tranng. The traned networ represents the 5-bt permutaton functon n Table II. (a Forward BP tuned the networ wth respect to logstc cross entropy for the forward pass only Ef only. (b Bacward BP tranng tuned the networ wth respect to logstc cross entropy for the bacward pass Eb only. (c Bdrectonal BP tranng summed the logstc cross entropy for both the forward pass Ef and bacward pass Eb to update the networ parameters We tested the B-BP algorthm for double classfcaton on a 5-bt permutaton functon. We used -layer networs wth dfferent numbers of hdden neurons. The neurons used bpolar logstc actvatons. The performance measure was logstc

11 Fg. 5: B-BP tranng error for the 5-bt permutaton n Table II (a usng dfferent numbers of hdden neurons. Tranng used the doubleclassfcaton algorthm. The two curves descrbe the logstc cross entropy for the forward and bacward passes through the -layer networ. Each test used 64 samples. The number of hdden neurons ncreased from 5,, 2, 5, to. (b Fg. 7: Bdrectonal bacpropagaton double-regresson learnng for the non-nvertble target functon f (x sn x. (a The forward pass learns the functon y f (x sn x. (b The bacward pass approxmates the centrod of the values n the set-theoretc pre-mage f (y} for y values n (,. The two centrods are π2 and π2. accuracy. The forward pass of (standard BP used logstc cross entropy as ts error functon. The bacward pass dd as well. Bdrectonal BP summed the forward and bacward errors for ts ont error. We computed the test error for the forward and bacward passes. Each plotted error value averaged 2 runs. TABLE III: Forward Pass Cross-Entropy Ef Fg. 6: B-BP double-regresson approxmaton of the nvertble functon f (x.5σ(6x + +.5σ(4x.2 usng a deep 8-layer networ wth 6 hdden layers. The functon σ denotes the bpolar logstc functon n (. Each hdden layer contaned bpolar logstc neurons. The nput and output layers each used a sngle neuron wth an dentty actvaton functon. The forward pass approxmated the forward functon f. The bacward pass approxmated the nverse functon f. cross entropy. The B-BP algorthm produced ether an exact representaton or an approxmaton. The permutaton functon bectvely mapped the 5-bt bpolar vector space, }n onto tself. Table II dsplays the the permutaton test functon. We compared the forward and bacward forms of undrectonal BP wth bdrectonal BP. We also tested whether addng more hdden neurons mproved networ approxmaton Bacpropagaton Tranng Hdden Neurons Forward Bacward Bdrectonal Fgure 4 shows the results of runnng the three types of BP learnng for classfcaton on a -layer networ wth hdden neurons. The values of Ef and Eb decrease wth an ncrease n the tranng teratons for bdrectonal BP. Ths was not the case for the undrectonal cases of forward BP and bacward BP tranng. Forward and bacward tranng performed well only for functon approxmaton n ther respectve tranng drecton. Nether performed well n

12 2 TABLE IV: Bacward Pass Cross-Entropy E b Bacpropagaton Tranng Hdden Neurons Forward Bacward Bdrectonal the opposte drecton. Table III shows the E f for learnng -layer classfcaton neural networs as the number of hdden neurons grows. We agan compared the three forms of BP for the networ tranng two forms of undrectonal BP and bdrectonal BP. The forward-pass error for forward BP fell substantally as the number of hdden neurons grew. The forward-pass error of bacward BP decreased slghtly as the number of hdden neurons grew. It gave the worst performance. Bdrectonal BP performed well on the test set. Its forward-pass error also fell substantally as the number of hdden neurons grew. Table IV shows smlar error-versus-hdden-neuron results for the bacward-pass error E b. The two tables ontly show that the undrectonal forms of BP for regresson performed well only n one drecton whle the B-BP algorthm performed well n both drectons. We tested the B-BP algorthm for double regresson wth the nvertble functon f(x.5σ(6x + +.5σ(4x.2 for values of x [.5,.5]. We used a deep 8-layer networ wth 6 hdden layers for ths approxmaton. There was only a sngle dentty neuron n the nput and output layers. The error functons E f and E b were ordnary squared errors. Fgure 6 compares the bdrectonal BP approxmaton wth the target functon for both the forward pass and the bacward pass. We also tested the B-BP double-regresson algorthm on the non-nvertble functon f(x sn(x for x [ π, π]. The forward mappng f(x sn(x s a well-defned pont functon. The bacward mappng y sn (f(x s not. It defnes nstead a set-based pullbac or pre-mage f (y f (y} x R: f(x y} R. The B- BP-traned neural networ tends to map each output pont y to the centrod of ts pre-mage f (y on the bacward pass because centrods mnmze squared error and because bacward regresson tranng uses squared error as ts performance measure. Fgure 7 shows that the forward regresson learns the target functon sn(x whle the bacward regresson approxmates the centrods π 2 and π 2 of the two pre-mage sets. V. CONCLUSION Undrectonal bacpropagaton learnng extends to bdrectonal bacpropagaton learnng f the algorthm uses the approprate ont error functon for both forward and bacward passes. Ths bdrectonal extenson apples to classfcaton networs as well as to regresson networs and to ther combnatons. Most classfcaton networs n practce can easly acqure a bacward-nference capablty by addng a bacward-regresson step to ther current tranng. So most networs smply gnore ths property of ther weght structure. Data: T nput vectors x, x 2,..., x T } and T correspondng output vectors y, y 2,..., y T } such that f(x n y n. Number of hdden neurons J. Batch sze S and number of epochs R. Choose the learnng rate η. Result: Bdrectonal neural networ representaton for functon f. Intalze: Randomly select the ntal weghts W ( and U (. Randomly pc the bas weghts for nput, hdden, and output neurons b x(, b h(, b y( }. whle epoch r : R do End Select S random samples from the tranng dataset. Intalze: ΔW, ΔU, Δb x, Δb h, Δb y. FORWARD TRAINING whle batch_sze l : End S Randomly pc nput vector x l and ts correspondng output vector y l. Compute hdden layer nput o h and the correspondng hdden actvaton a h Compute output layer nput o y and the correspondng output actvaton a y Compute the forward error E f Compute the followng dervatves: W E f, U E f, b he f, and b ye f Update : ΔW ΔW + W E f ; Δb h Δb h + b he f ΔU ΔU + W E f ; Δb y Δb y + b ye f Update: W (r W (r η ΔW; U (r U (r ηδu ; b h(r b h(r ηδb h ; b y(r b y(r ηδb y Intalze: ΔW, ΔU, Δb x, Δb h, Δb y. BACKWARD TRAINING whle batch_sze l : S Pc nput vector x l and ts correspondng output vector y l. Compute hdden layer nput o hb and hdden actvaton a hb. Compute nput o xb at the nput layer and nput actvaton. Compute the bacward error E b Compute the followng dervatves: W E b, U E b, b he b, and b xe b Update : ΔW ΔW + W E b ; Δb h Δb h + b he b ΔU ΔU + W E b ; Δb x Δb x + b xe b End Update: W (r+ W (r η ΔW U (r+ U (r ηδu b x(r+ b x(r ηδb x b h(r+ b h(r ηδb h b y(r+ b y(r Algorthm : The Bdrectonal Bacpropagaton Algorthm A bdrectonal multlayer threshold networ can exactly represent permutaton mappngs f the hdden layer contans an exponental number of hdden threshold neurons. An open queston s whether these networs can represent or at least approxmate an arbtrary nvertble mappng wth fewer hdden neurons. A related queston s what specfc classes of nvertble functons can these bdrectonal networs exactly represent and whch classes they cannot represent. Another open queston s the extent to whch carefully nected nose can speed B-BP convergence and accuracy. Ths follows because of the statstcal structure of BP as a specal case of the expectaton-maxmzaton algorthm [9] and because the approprate nose can boost such hll-clmbng algorthms [2], [2]. REFERENCES [] P. J. Werbos, Beyond regresson: New tools for predcton and analyss n the behavoral scences, Doctoral Dssertaton, Appled Mathematcs, Harvard Unversty, MA, 974.

13 [2] D. Rumelhart, G. Hnton, and R. Wllams, Learnng representatons by bac-propagatng errors, Nature, pp. 2 5, 986. [] C. M. Bshop, Pattern recognton and machne learnng. sprnger, 26. [4] Y. LeCun, Y. Bengo, and G. Hnton, Deep learnng, Nature, vol. 52, pp , 25. [5] M. Jordan and T. Mtchell, Machne learnng: trends, perspectves, and prospects, Scence, vol. 49, pp , 25. [6] B. Koso, Bdrectonal assocatve memores, IEEE Transactons on Systems, Man and Cybernetcs, vol. 8, no., pp. 49 6, 988. [7] B. Koso, Neural Networs and Fuzzy Systems: A Dynamcal Systems Approach to Machne Intellgence. Prentce Hall, 99. [8] S. Y. Kung, Kernel methods and machne learnng. Cambrdge Unversty Press, 24. [9] G. Cybeno, Approxmaton by superpostons of a sgmodal functon, Mathematcs of control, sgnals and systems, vol. 2, no. 4, pp. 4, 989. [] K. Horn, M. Stnchcombe, and H. Whte, Multlayer feedforward networs are unversal approxmators, Neural networs, vol. 2, no. 5, pp , 989. [] B. Koso, Fuzzy systems as unversal approxmators, IEEE Transactons on Computers, vol. 4, no., pp. 29, 994. [2] F. Watns, The representaton problem for addtve fuzzy systems, n Proceedngs of the Internatonal Conference on Fuzzy Systems (IEEE FUZZ-95, 995, pp [] B. Koso, Generalzed mxture representatons and combnatons for addtve fuzzy systems, n Neural Networs (IJCNN, 27 Internatonal Jont Conference on. IEEE, 27, pp [4], Addtve fuzzy systems: From generalzed mxtures to rule contnua, Internatonal Journal of Intellgent Systems, 27. [5] J.-N. Hwang, J. J. Cho, S. Oh, and R. Mars, Query-based learnng appled to partally traned multlayer perceptrons, IEEE Transactons on Neural Networs, vol. 2, no., pp. 6, 99. [6] E. W. Saad, J. J. Cho, J. L. Van, and D. Wunsch, Query-based learnng for aerospace applcatons, IEEE Transactons on Neural Networs, vol. 4, no. 6, pp , 2. [7] E. W. Saad and D. C. Wunsch, Neural networ explanaton usng nverson, Neural Networs, vol. 2, no., pp. 78 9, 27. [8] Y. Yang, Y. Wang, and X. Yuan, Bdrectonal extreme learnng machne for regresson problem and ts learnng effectveness, IEEE Transactons on Neural Networs and Learnng Systems, vol. 2, no. 9, pp , 22. [9] K. Audhhas, O. Osoba, and B. Koso, Nose-enhanced convolutonal neural networs, Neural Networs, vol. 78, pp. 5 2, 26. [2] S. Kullbac and R. A. Lebler, On nformaton and suffcency, The Annals of Mathematcal Statstcs, vol. 22, no., pp , 95. [2] O. Osoba and B. Koso, The nosy expectaton-maxmzaton algorthm for multplcatve nose necton, Fluctuaton and Nose Letters, p. 657, 26. [22] R. V. Hogg, J. McKean, and A. T. Crag, Introducton to Mathematcal Statstcs. Pearson, 2. [2] O. Osoba, S. Mtam, and B. Koso, The nosy expectaton maxmzaton algorthm, Fluctuaton and Nose Letters, vol. 2, no., p. 52, 2. PLACE PHOTO HERE Bart Koso Bography text here. Olaoluwa Adgun Bography text here. PLACE PHOTO HERE

Bidirectional Representation and Backpropagation Learning

Int'l Conf on Advances in Big Data Analytics ABDA'6 3 Bidirectional Representation and Bacpropagation Learning Olaoluwa Adigun and Bart Koso Department of Electrical Engineering Signal and Image Processing