Development of a General Purpose On-Line Update Multiple Layer Feedforward Backpropagation Neural Network

Master Thess MEE 97-4 Made by Development of a General Purpose On-Lne Update Multple Layer Feedforward Backpropagaton Neural Network Master Program n Electrcal Scence 997 College/Unversty of Karlskrona/Ronneby Supervsor/Examner: Mattas Dahl

Master Thess MEE 97-4 /5/98 Abstract Ths Master thess deals wth the complete understandng and creaton of a 3-layer Backpropagaton Neural Network wth synaptc weght update performed on a per sample bass (called, On-Lne update). The am s to create such a network for general purpose applcatons and wth a great degree of freedom n choosng the nner structure of the network. The algorthms used are all members of supervsed learnng classes,.e. they are all supervsed by a desred sgnal. The theory wll be treated thoroughly for the steepest descent algorthm and for addtonal features whch can be employed n order to ncrease the degree of generalzaton and learnng rate for the network. Emprcal results wll be presented and some comparsons wth pure lnear algorthms wll be made for a sgnal processng applcaton, speech enhancement. Page

Master Thess MEE 97-4 /5/98 Contents Secton Preface 4 Secton Structure of a sngle neuron 5. General methods for unconstraned optmzaton 5. Basc neuron model 6.3 Wdrow-Hoff Delta Rule or LMS Algorthm 6.4 Sngle Layer Perceptron 9.5 General Transfer functons 9 Secton 3 Neural Network Models 3. Dfferent Models 3. Feedforward Multlayer Neural Network Secton 4 Steepest Descent Backpropagaton Learnng Algorthm 4 4. Learnng of a Sngle Perceptron 4 4. Learnng of a Multlayer Perceptron 5 Secton 5 Addtonal Features for mprovng Convergence speed and 9 Generalzaton 5. Algorthm wth Momentum Updatng 9 5. Algorthms wth Non-Eucldean Error Sgnals 9 5.3 Algorthm wth an Adaptaton of the Slope of the Actvaton Functons 5.4 Adaptaton of the Learnng Rate and Mxng of nput patterns 5.5 A combned stochastc-determnstc weght update 3 Secton 6 Comparson of a Batch update system wth an On-Lne Update system 4 Secton 7 Emprcal Results 5 7. Approxmaton of sn(x) wth two neurons n hdden layer 5 7. Separaton of sgnals wth nose 6 7.3 Processng of EEG-sgnals 7 7.4 Echo-cancelng n a Hybrd 9 7.5 Speech enhancement for Hands-Free moble telephone set 3 Secton 8 Further Development 33 Secton 9 Conclusons 34 Page 3

Master Thess MEE 97-4 /5/98. Preface Neural Networks are used n a broad varety of applcatons. The way they work and the tasks they solve dffers wdely. Often a Neural Network s used as a assocator where one nput vector s assocated wth an output vector. These networks are traned rather than programmed to perform a gven task,.e. one set of assocated vector pars are presented n order to tran the network. The Network wll then hopefully gve satsfactory outputs when facng nput vectors not present n the tranng phase. Ths s the generalzaton property whch n fact s one of the key features of Neural Networks. The tranng-phase s sometmes done n batch, whch means that the tranng vector-pars are all presented at the same tme to the network. If the task to be solved s the same durng tme,.e. the task s statonary, tranng wll be needed only once. In ths case the batch procedure can be used. If the task s changng durng tme the network wll have to be adaptve,.e. to change wth the task. In ths case an on-lne tranng s preferable. The on-lne approach takes the tranng vector-pars one at the tme and performs a small adustment n performance for every such presentaton. The on-lne approach has the great advantage that learnng tme s reduced consderably when compared to a batch system. In sgnal processng applcatons the adaptaton durng tranng s of great mportance. Ths can only be acheved wth the on-lne system. In ths thess an on-lne update Neural Network wll be created and thoroughly explaned. The Neural Network wll be general n the sense that t can be chosen arbtrary when t comes to structure, performance, complexty and adaptaton rules. Emprcal experments of both artfcal and real-lfe applcatons wll be presented. The results show that many standard applcatons whch earler has been solved wth lnear systems can easly be solved wth neural networks. Often the result s better. Wth properly chosen structure adaptaton durng learnng can be satsfactory. In real tme mplementatons the use of neural networks can be lmted due to ther need for computatonal power. But, the parallel structure of these networks can make way for mplementatons wth parallel processors. I would lke to thank Professor Andrze Cchock for lettng me use materal from hs and Dr Rolf Unbehauens book [7]. Table. and fgures 3., 4. and 4. are taken from ths book. Page 4

Master Thess MEE 97-4 /5/98. Structure of a sngle neuron. General methods for unconstraned optmzaton Consder the followng optmzaton problem: fnd a vector w that mnmzes the real value scalar functon J(w): mn J(w), where w = [ w w... w ] T, the number of elements, denoted n, of w s arbtrary. n Ths problem can be transformed nto an assocated system of frst order ordnary dfferental equatons as n dw J = η dt w = The vector w wll then follow a trace n the phase-portrat of the dfferental equatons to a local- or global mnmum. In vector/matrx notaton ths becomes dw = η( w,t ) J(w) dt w An ntal vector w must be chosen, from whch the trace starts. Dfferent selectons of the matrx h(w,t) gves dfferent methods. The smplest selecton s to choose the matrx as the unty matrx multpled wth a small constant η, the learnng parameter, n ths case the method becomes the well known method of Steepest Descent. If a Taylor approxmaton of J(w) truncated to second order terms, and the gradent for ths, s taken then the resultng equaton can be solved for zero. A necessary condton for a local mnmum s that the gradent equals zeros. In ths case h(w,t) s chosen as the nverse Hessan matrx. Ths s the well known Newton s method. dw - = η w J(w) J(w) dt [ ] w There are some drawbacks wth Newton s method. Frst the nverse of the Hessan must be calculated and ths nverse does not always exst (or the Hessan can be very ll-condtoned). One way to overcome ths s to add an unty matrx multpled wth a small scalar to the Hessan matrx n order to mprove the condton. Of course ths wll then only be an approxmaton to the true Hessan. Ths method s then called the Marquardt-Levenberg algorthm. Second, there are strong restrctons about the chosen ntal values of w because the truncated Taylor seres gves poor approxmatons far away from the local mnma. In the context of Neural Networks there are even more dffcultes because of the computatonal overhead and large memory requrements. In an On-Lne update system we would need to calculate as many matrx nversons as the number of neurons n the network for every sample. There are several approaches whch teratvely approxmates ths nverse wthout actually calculate one,.e. Quas Newton s methods. Ths thess wll emphasze the Steepest Descent method snce the weght update wll be On- Lne and ths sets strong requrements on computatonal smplcty. Page 5

Master Thess MEE 97-4 /5/98. Basc neuron model Let us consder a sngle neuron whch takes a vector x =[ x x x ] T... n as nput and produces a scalar y as output. As before the number of elements, denoted n, n vector x s arbtrary. The neuron has nner strengths w =[ w w w ] T... n also called the synaptc weghts. In most models of a neuron t also has a bas, a scalar denoted Θ. Often the nput x s multpled wth the synaptc weghts and then the bas s added. The result s passed through a scalar actvaton functon, Ψ( ), accordng to n y = Ψ( w x + Θ) = Ths can be wrtten n a compact form as y = Ψ( w x ) n =, ndex stands for the :th scalar n vector w and x respectvely., where w = Θ and x = by defnton. The actvaton functon Ψ( ) can be any lnear or non-lnear functon whch s pece-wse dfferentable. The state of the neuron s measured by the output sgnal y.. Wdrow-Hoff Delta rule or LMS algorthm In the Wdrow-Hoff Delta rule, also called the LMS algorthm, the actvaton functon s lnear and often wth slope of unty. In supervsed learnng the output y should equal a desred value d. The LMS algorthm s a method for adustng the synaptc weghts to assure mnmzaton of the error functon J = e ( t) = ( d y) For several nputs the sum of J s to be mnmzed accordng to the L -norm (other norms can be used, see secton 5.) Applyng the steepest descent approach the followng system of dfferental equatons wll be obtaned dw J J y = η = η dt w y w snce y = n = w x we get dw = η e ( t) x ( t) dt Where t s the contnuous tme ndex and η s the learnng parameter. Convertng ths nto the standard dscrete tme LMS algorthm and wrtng t n vector notaton gves w(k+) = w (k) + µ e(k) x(k) Here ndex k stands for the dscrete sample ndex and µ s the learnng parameter (stepsze). The parameter µ dffers from the learnng parameter η n the contnuos case, generally t must be smaller to ensure convergence. Page 6

Master Thess MEE 97-4 /5/98 In sgnal processng applcatons, where LMS s wdely used, the vector x s a tme shfted verson of a stream of dscrete nput samples. Consder a stream of nput scalars (a sampled sgnal) wth sample-ndex k startng at and ncreasng to nfnty, that s X = [x() x(),.., x(-) x() x(+),.] The am s to map ths sgnal one-to-one on a desred sgnal D D = [d() d(),.., d(-) d() d(+),.] by flterng the sgnal X wth a fnte mpulse response flter F. Ths type of flter uses a fnte length of past experence, wthout any recurson, from the sgnal X as nput. The number of T flter coeffcents n F s denoted n and the flter coeffcents s w =[ w( ) w( )... w( n )]. The output at sample ndex k s n y( k) = w( ) x( k ) = Whch s the convoluton between w and x, that s y = w x. Now usng the LMS algorthm n order to update the flter coeffcents durng presentaton of the nput stream gves as before w(k+) = w (k) + µ e(k) x(k) where the nput vector x, at sample ndex k, s x(k) = [ x( k) x( k )... x( k n + )] The error e(k) s, as before, defned as the dfference between the desred value and the actual outcome from the flter, that s e(k) = (d(k) - y(k)). There are several mprovements to ths standard LMS algorthm whch all ft ther specal purposes. Table. shows some modfed versons of the LMS algorthm. The am for these modfcatons s to mprove the standard LMS algorthm wth regard to dfferent aspects. These aspects could be n focus on rapd convergence speed or on mnmzng the excess mean squared error. Some of the modfcatons ams to reduce computatonal complexty, as the sgn algorthms. A further nsght n ther behavor can be obtaned from any textbook n sgnal processng T Page 7

Master Thess MEE 97-4 /5/98 Table. Modfed and mproved versons of the LMS algorthm, [7] Page 8

Master Thess MEE 97-4 /5/98.3 Sngle layer perceptron The sngle layer perceptron uses the hardlmter as the actvaton functon. Ths hardlmter gves only quantzed bnary outputs, that s ~ y {-, }. In ths case the output s compared wth the quantzed bnary d ~ {-, } and a quantzed error sgnal ~ ( ~ e = d ~ y) {-,,} s produced. The update rule for ths s very smlar to that for the LMS algorthm w(k+) = w (k) + µ e ~ ( k) x( k) The only dfferens s that the error s quantzed. Ths perceptron can acheve any separaton of nput vectors that are lnearly separable. For nput vectors that are not lnearly separable t s possble to separate them as well wth approprately nterconnectng several perceptrons. In more general perceptron mplementatons the actvaton functon s non-lnear, se next secton. If an arbtrary transfer functon s used whch s pece-wse dfferentable the update rule can be derved n smlar way as for the LMS algorthm. Consder a general transfer functon Ψ( ) wth gradent Ψ ( ). n u = w x = and y = Ψ( u) As prevous we wll use the L -norm for error measurement (a more general error norm wll be descrbed n secton 5.) J = e~ ~ ( t ) = ( d ~ y ), ndex t stands for contnuous tme The dfferental equaton states (applyng the chan rule) dw J J e J e u = η = η ~ = η ~ dt w e~ w e~ u w Ths gves for a general actvaton functon n dscrete vector form w(k + ) = w(k) + µ e ~ (k) Ψ'( u(k) ) x(k) Here the error ~ e s quantzed as before and ndex k stands for the dscrete sample ndex, µ s the dscrete stepsze. Often the actvaton functon s a s-shaped sgmod functon..4 General transfer functons There are a varety of transfer functons where all of them has ther key applcaton. The most general and wdely used functons are the lnear, bpolar sgmod and the unpolar sgmod functons. These are the transfer functon whch wll be used n ths approach. In the network, whch wll be derved later, they can be ntermxed n any way. The lnear functon ust passes the value wth an amplfcaton due to the slope. Often the slope s set to unty because ths wll only affect the magntude of the stepsze. The bpolar sgmod functon s the tangent hyperbolcus functon wth a slope, see fg.4. Ψ(u) = tanh(γ u) = γu e γ u + e The defnton set s all real numbers and the target set s all real numbers between - and. The unpolar sgmod functon s also s-shaped but the target set s the real numbers between and, see fg.4. Ψ(u) = + e uγ Page 9

Master Thess MEE 97-4 /5/98 The slope γs a parameter whch controls the steepness n the non-lnearty and ths can ether be pre-specfed or t can be updated accordng to a scheme, secton 5.3 deals wth ths aspect. The bpolar sgmod functon for dfferent slopes (.5,.,.).8.6.4. -. -.4 -.6 -.8 - -5-4 -3 - - 3 4 5 Fg.4. The bpolar sgmod functon for dfferent slopes (.5,.,.) The unpolar sgmod functon for dfferent slopes (.5,.,.).9.8.7.6.5.4.3.. -5-4 -3 - - 3 4 5 Fg.4. The unpolar sgmod functon for dfferent slopes (.5,.,.) The partal dervatve for the bpolar sgmod functon (bsg) follows by usng the dervatve rule of a quote. γu γu γu γu dψ( u) γ e ( + e ) ( e )( γe ) = γ u = d( u) ( + e ) γ u γ u γ u γ u γ u γ u γ u γ u γ e + γ ( e ) + γ e γ ( e ) γ + γ e + γ ( e ) γ + γ e γ ( e ) γ u = γ u ( + e ) ( + e ) γ u γ u γ u ( + e ) ( e ) ( + e ) γ γ u = γ ( γ u = γ γ ( + e ) ( e ) ) ( tanh( u) ) = And for the unpolar sgmod functon (usg) dψ( u) = du γ + e γ u γ u γ u γ u γ e + e + e γ u = γ γ u = γ ( u u ) = ( + e ) ( + e ) ( γ + e ) ( γ + e ) + e γ u = γ Ψ( u) ( Ψ( u)) Page

Master Thess MEE 97-4 /5/98 3. Neural Network Models 3. Dfferent models There are dfferent type of Neural Network models for dfferent tasks. The way they work and the task they solve dffers wdely, but they also share some common features. Generally Neural Networks conssts of smple processng unts nterconnected n parallel. These connectons dvde the set of Networks nto three large categores:. Feedforward networks These networks can be compared wth conventonal FIR-flters wthout adaptaton of the weghts. The dfferens s that they can perform a non-lnear mappng of nput sgnals onto the target set. There are as for the FIR-flters no stablty problems.. Feedback networks These Networks are non-lnear dynamc systems and can be compared wth IIR-flters. The Elman Network and the Hopfeld Network are both feedback Networks used n practce. 3. Cellular networks These Networks has more complex nterconnectons. Every neuron s nterconnected wth ts neghbors and the neurons can be organzed n two or three dmensons. A change n one neurons state wll affect all the others. Often used tranng methods are smulated annealng schedule (whch s a pure stochastc algorthm) or mean feld theory (whch s a determnstc algorthm). In supervsed learnng the backpropagaton learnng can be used n order to fnd the weghts that best (accordng to an error norm) performs a specfc task. Ths method wll be descrbed n detal n secton 4. for a feedforward Neural Network. In unsupervsed learnng, where no desred target s present, the task to be performed s often to reduce all components n the n-sgnal that are correlated (here not only lnear correlaton s meant, as often s the case). Ths can for nstance be done when mnmzng the followng α T E( w) = w σ( w x) where σ(u) s the loss functon, typcally σ(u) = u Ths s called the potental learnng rule. The frst term on the rght sde s usually called the leaky effect. If ths unsupervsed approach s adopted, the algorthm descrbed above wll only change wth the followng w(k+) = ( α ) w(k) + µ u~ (k) Ψ'( u ~ (k)) x(k) Here the error e s replaced wth the nternal state u and a small constant α s added to prevent the weghts to grow beyond lmts. Sometmes the constrant on w s chosen to fxed length at unty, that s w =. In ths case one wsh to mnmze the absolute value of the loss functon. Observe that ths s a local update rule, that s, no nformaton has to be passed to other nterconnected neurons. 3. Feedforward Multlayer Neural Network A feedforward neural network can consst of an arbtrary number of layers but n practce there wll be only one, two or three layers. It has been shown theoretcally that t s suffcent to use a maxmum of three layers to solve an arbtrarly complex pattern classfcaton problem []. The layers are ordered n seral and the last layer s called the output layer. The precedng layers are Page

Master Thess MEE 97-4 /5/98 called hdden layers wth ndces startng at one n the layer closest to the nput vector. Sometmes the nput vector s called the nput layer. Here a three-layer perceptron wll be regarded. In every layer there can be neurons wth any transfer functon and wth or wthout a bas. The number of neurons n hdden layers are determned by the complexty of the problem. The number of neurons n the output layer are determned by the type of problem to be solved because ths s the number of the output sgnals from the network. In ths approach the type of transfer functons used are the lnear, bsg and usg ntermxed arbtrarly n each layer. All neurons wll have a bas n order to mprove the generalzaton. If usng the compact notaton (as n secton.) y = Ψ( w x ) n = where w = Θ and x = and arrangng the number of neurons as a column, the processng of the whole layer can be descrbed n matrx notaton. Defnng a matrx W [ ] for the frst layer, where the :th row contans the weghts for the :th neuron, and the transfer functon Ψ [] as a multvarable functon defned as Ψ [] n : R R n and Ψ [ ] [ Ψ [ ] Ψ [ ]... Ψ [ ] T = n ] The components n Ψ [] s arranged accordng to the chosen transfer functons. The frst layers processng can then be descrbed as ( ) o [ ] = Ψ [ ] W [ ] x, where o [ ] s the frst layers output (see fg 3.) In smlar way the other layers can be descrbed wth nput sgnals for each layer as the output values of the precedng layer. The total processng of the feedforward network wll be ( ( ( ) ( ) [ 3 ] [ 3 ] [ ] [ ] [ ] ( [ ] ) y = Ψ x = Ψ W Ψ W Ψ W x Fgure 3. shows the confguraton of a three-layer feedforward neural network. Fg 3. A Feedforward Multlayer Neural Network (3-layer), [7] The number of neurons n layer, and 3 wll be denoted n, n and n 3 respectvely. Page

Master Thess MEE 97-4 /5/98 Here all neurons n every layer are connected wth all feedng sources (nputs). Ths approach s called a fully connected neural network. If some of the connectons are removed the network becomes a reduced connecton network. The way n whch one chose the connecton n the later network are not n any way trval. The am s to mantan the maxmum of nformatonflow through the network, that s, only to reduce connectons that are redundant. Ths of course vares wth the applcatons. In speech enhancement systems for example one could try to make use of the fact that speech s qute correlated. Ths could perhaps make t possble to reduce some of the connectons wthout any maor loss n performance. In ths thess a general approach s made wth a fully connected network. Page 3

Master Thess MEE 97-4 /5/98 4. Steepest Descent Backpropagaton Learnng Algorthm 4. Learnng of a Sngle Perceptron The dervng of the learnng rule for the sngle perceptron wll be done wth bas and wth a general transfer functon. The perceptron wll be denoted wth ndex for straghtforward ncorporaton later nto the network, see fgure 4.. Consder a general transfer functon Ψ( ) wth gradent Ψ ( ). n u = w x =, w = Θ, x = and y = Ψ( u ) We wsh to mnmze the nstantaneous squared error of the output sgnal J = e ( t) = ( d y ) The steepest descent approach gves the followng dfferental equatons dw J J e J e u = η = η = η dt w e w e u w snce J e e u u w = e Ψ( u ) = u = x the update rule becomes k dw dt = η e Ψ ' ( u ) x If defnng a learnng sgnal δ as δ e u = Ψ ' ( ) we can wrte the dscrete vector update form as w (k + ) = w (k) + µ δ x(k) where k s the teraton ndex and µ s the learnng parameter. Ths last formula s represented by the box called Adaptve algorthm n fgure 4.. Page 4

Master Thess MEE 97-4 /5/98 Fg 4. Learnng of a sngle perceptron, [7] If the actvaton functon bsg s used as derved n secton.5 the update rule becomes w (k + ) = w (k) + µ γ e (k) ( y (k) ) x(k) and for the usg actvaton functon the update rule becomes (see secton.5) w (k + ) = w (k) + µ γ e (k) y (k) ( y (k)) x(k) For the pure lnear functon the rule becomes the well known LMS-algorthm (secton.3) w (k + ) = w (k) + µ γ e (k) x(k) As stated before the slope γn LMS s often chosen to unty snce t only affects the step-sze of the rule. 4. Learnng of a Multlayer Perceptron, MLP Here a 3-layer perceptron wll be derved where fewer layers can be used. The network structure s accordng to fg 4.. The structure s bult up wth several sngle layer perceptrons where the learnng sgnals s modfed accordng to rules descrbed below. In ths fgure the bases are left out. The extenson to more layers s straghtforward. The network behavor should be determned on the bass of nput/output vector pars where the sze of these vectors can be chosen n any way. The network wll have the same number of neurons n the thrd layer as the number of desred sgnals. A set of nput/output pars wll be used to tran the network. The presented par wll be denoted wth ndex k. Each learnng par s composed of n nput sgnals x ( =,,..., n ) and n 3 correspondng desred output sgnals d ( =,,..., n3 ). In the frst hdden layer there wll be n neurons and n the second hdden layer there wll be n neurons. The number of neurons n hdden layers can be chosen arbtrarly and ths represents the complexty of the problem. It s a delcate task to chose ths complexty for specfc applcatons and often a tral and error approach s used. There are some ways to automatcally ncrease or decrease the number of neurons n hdden layers durng learnng. Ths approach s Page 5

Master Thess MEE 97-4 /5/98 called prunng. Sometmes ths s done by lookng at each neurons output state durng learnng cycles and f those are smlar to an other neurons state (or wth opposte sgn) then one of them could be removed. Some levels for how much they can vary s proposed by []. Too many neurons n hdden layers decreases the generalzaton of the network and too few neurons wll not solve the task. The learnng of the MLP conssts n adustng all the weghts n order to mnmze the error measure between the desred and the actual outcome. A ntalzng set of weghts must be chosen. Fg 4. Learnng of a Multple Layer Perceptron, MLP (3-layer), [7] If multple outputs s chosen then the sum of all errors s to be mnmzed for all nput/outputpars, that s n3 n3 mn{ Jk}, Jk = ( d k y k ) = ( e k ), k = = Here the L -norm s used, n secton 5. other error measures s dscussed. Ths functon s called the local error functon snce ths error s to regard of the nstantaneous error for the k:th nput/output par. The global error functon (also called the performance functon) s the sum of the above functon over all tranng data. If the task to be solved s statonery the later functon s to be mnmzed, but on the other hand, f an adaptaton durng learnng patterns s desred then the local error functon must be mnmzed. Ths can only be done n an on-lne update system, whch s of concern here. An dscusson of ths topc wll be presented n secton 6. Page 6

Master Thess MEE 97-4 /5/98 When mnmzng the local error functon the global error functon s not always mnmzed. It has been proved that, f the learnng parameter (step sze) s suffcently small, the local mnmzaton wll lead to a global mnmzaton [3]. When dervng the backpropagaton algorthm, the dfferental equatons wll be ntalzed by the gradent method for mnmzaton of the local error functon, that s [ s] dw Jk = η, η > dt w (Here ndex [s] stands for layer s, stands for neuron and stands for the :th weght, k s the k:th nput/output pattern) Frst the update rules for the output layer wll be derved (s=3). Usng the steepest descent approach gves snce dw dt Jk J = η = η w u k n3 n3 [ ] [ ] [ ] [ ] = = 3 3 3 u = w x = w o u w [ where o ] s the nput vector to layer 3, the same as the output vector from layer, see fg 4. and compare t wth the sngle perceptron. Ths becomes dw J J dt w u o J e e u o e e k k [ ] k k [ ] k u o [ ] = η e = η = η Ψ = η k = η k u [ If we defne the local error δ 3 ] as 3 δ [ ] J Ψ k = = e k ( Ψ )' = ( d k y k ) u u ths can be wrtten as dw [ ] = η δ o dt and the dscrete vector update rule becomes k o [ ] Ψ [ 3 ] w (k + ) = w (k) + µ e (k) o u (k) [ ] (k) [Output layer, neuron ] where ndex k stands for the k:th teraton of nput/output patterns and ndex s the :th neuron n output layer. Here the stepsze µ dffers from the learnng parameter η n the contnuos case, generally t must be smaller to ensure convergence. The gradent above s dependng on the transfer functon chosen. In ths approach the transfer functon can be ntermxed n any order as mentoned earler, see secton.5. Page 7

Master Thess MEE 97-4 /5/98 For second hdden layer the error s not drectly reachable so the dervatves must be taken wth regard to quanttes already calculated and other that can be evaluated. Ths stll hold [ ] [ ] dw J J u k k [ ] [ ] = η [ ] = η [ ] [ ] = η δ o dt w u w [ The dfferens from output layer wll be n the local error δ ]. As before the local error s defned as [ ] Jk δ = [ ] u and ths gves (usng the chan rule) [ ] J o [ ] k δ = [ ] [ ] o u snce we have o = ψ ( u ) [ ] [ ] [ ] δ [ ] J = o k [ ] Ψ u [ ] [ ] The second term on the rght s the dervatve of the transfer functons used for the :th neuron. Jk The reason why ths algorthm s called backpropagaton s seen when calculatng [ ] o snce nformaton from the output layer update s used here to update the second hdden layer. n3 n3 n 3 n 3 = = = δ n3 n3 Jk Jk u Jk [ ] 3 3 w o w o [ ] [ ] [ ] p p [ ] p p = δ w o = u o = u o p= = o p= = Ths gves the local error for the second hdden layer as [ ] n Ψ 3 [ ] δ = [ ] δ w u = In vector notaton ths becomes Ψ [ ] 3 3 ( δ w ) w (k + ) = w (k) + µ [ ] o u (k) n 3 = [ ] [ ] [ ] Page 8 (k) [Second hdden layer, neuron ] [ ] [ ] [ ] [ Here w 3 ] means neuron, weght, n output layer. The nformaton, that are backpropagated, are the local errors and the updated weghts from layer three. In smlar way the update rule for frst hdden layer s derved, now wth regard to second hdden layer local error and weghts. Ths gves the vector update rule for frst hdden layer as Ψ [ ] [ ] [ ] ( δ w ) w (k + ) = w (k) + µ [ ] x(k) u (k) n = [Frst hdden layer, neuron ] Here w [ ] means neuron, weght, n second hdden layer. The extenson to more hdden layers s straghtforward but not dealt wth here.

Master Thess MEE 97-4 /5/98 5. Addtonal features for mprovng Convergence speed and Generalzaton 5. Algorthm wth Momentum Updatng The standard backpropagaton have some drawbacks, the learnng parameter (stepsze n the dscrete case) should be chosen small n order to provde mnmzaton of the global error functon, but a small learnng parameter decreases the learnng process. In the dscrete case the stepsze must be kept small also to ensure convergence. A large learnng parameter s desrable for rapd convergence and for mnmzng the rsk of gettng trapped n local mnma or very flat plateaus n the error surface. One way to mprove the backpropagaton algorthm s to smooth the weghts by overrelaxaton. Ths s done by addng a fracton of the prevous weght update to the actual weght update. The fracton s called the momentum term and the update rule can be modfed to [ s] [ s] [ s ] [ s] w ( k) = o η δ + α w ( k ), (s =,, 3) where α s a parameter whch controls the amount of momentum ( α < ). Ths s done for each layer [s] separately. The momentum concept wll ncrease the speed of convergence and at the same tme mprove the steady state performance of the algorthm. If we are n a plateau of the error surface, where the gradent s approxmately the same for two consecutve steps, the effectve step sze wll be η ηeff = ( α) snce [ s] Jk [ s] η Jk w ( k) = η [ s] + α w ( k ) [ s] w ( k) ( α) w ( k) If we are n a local- or a global mnma the momentum term wll have opposte sgn to the local error and thus decrease the effectve step sze. The result wll be that learnng rate s ncreased wthout magnfyng the parastc oscllatons at mnmas. 5. Algorthm wth Non-Eucldean Error Sgnals So far the optmzaton crteron has been to mnmze the local/global error surface on the bass of the least-square error, the Eucldean or the L -norm. When the output layer s large, and/or the nput sgnals are contamnated wth non- Gaussan nose (especally wld spky nose), other error measures s needed n order to mprove the learnng of the network. In ths general approach the error measure wll be derved for the L -norm and a contnuos selecton of norms between the Eucldean and the L -norm. The norms wll be an approxmaton wth mnor dfferens to the true norms. Consder the performance (error) functon defned as J k n3 = σ( e k ) =, where e k = d k y k as before. Page 9

Master Thess MEE 97-4 /5/98 n 3 s the number of neurons n output layer and σ s a functon (typcally convex) called the loss functon. The case when σ = e the L -norm s obtaned and when σ = e the L -norm s obtaned. There are many dfferent loss functons proposed for a varety of applcatons. A general loss functon whch easly gves freedom n choosng the shape of the functon s the logstc functon e σ( e) = β ln cosh( ) β Fgure 5. shows the logstc functon for dfferent selectons of β. 4 Loss functons for beta varyng between and 8 6 4-5 -4-3 - - 3 4 5 Fg 5. The Logstc functon for β =.,.3,.7, 5. and When β s close to one the functon approxmates the absolute ( L ) norm and when β s large t approxmates the Eucldean ( L ) norm. The dervatve for ths loss functon must be calculated n order to employ t nto the network. J J e k k k Jk = =, e k = d k y k, ( =,, n y e y e 3 ) k k k k Introducng: h(g) = ln(g) h g = g g(y) = cosh(y) g y e = y e y = snh( y) y(e) = e/β usng the chan rule y e = β Page

Master Thess MEE 97-4 /5/98 σ β e/ β e/ β e/ β e/ β Jk k h g y e e β β β β e e = = = = e/ β e/ β = e e g y e cosh( e / ) e + e k k k e/ β e/ β e β e e e e β e β e β = β / / / e/ β = β tanh( ) e + e + e β Now usng the backpropagaton algorthm wth ths logstc loss functon gves dw J J y J y u k k k = µ = µ = µ dt w y w y u w J e k k snce = β tanh( ) y β and y u k = Ψ u the dscrete update rule for output layer becomes [ ] e (k) Ψ 3 w (k + ) = w (k) + µ β tanh( ) o β u [ ] (k) [Output layer, neuron, Logstc loss functon] For the hdden layers the update rule wll be dentcal as wth the Eucldean error norm. The effect of ths loss functon wll be backpropagated to the nner of the network. 5.3 Algorthm wth an Adaptaton of the Slope of the Actvaton Functons So far the slope of the actvaton functons has been fxed (typcally γ= ). Kruschke and Movellan [4] has shown that an adaptaton of the slope (gan) for all actvaton functons greatly ncreases the learnng speed and mprove the generalzaton. In ths approach the slopes of bsg and usg (see.5) are modfed when an adaptaton of the slope s appled. No adaptaton are used to the pure lnear transfer functons, because ths could lead to nstablty. When the slope s modfed for each presented nput/output pattern ths gves the followng for each neuron [] n each layer [s] o s s s = ψ ( γ u ) where Ψ s bsg or usg as n sec..5 [ s] [ ] [ ] [ ] Applyng the steepest descent approach when mnmzng the nstantaneous error wth regard to the slope γgves s dγ dt [ ] = Jk s γ [ ] s dγ dt [ ] s s Jk Jk y u η s s s s γ η [ ] η δ = [ ] = [ ] [ ] = [ ] y γ γ [ ] [ s] Page

Master Thess MEE 97-4 /5/98 Here the local error for the synaptc weght update has been used snce the only dfference between [ s] y [ s] and [ s] y [ s] s that the slope γhere s treated as the varable and the nner state u γ u as a constant. The dscrete form for ths update wth general loss functon wll be σ Ψ γ (k ) γ (k) µ γ e (k) u (k) u (k) + = + γ (k) [Output layer, neuron, General loss functon] 3 3 ( δ p w p ) [ ] n3 Ψ [ ] [ ] [ ] [ ] [ ] γ (k + ) = γ (k) + µ γ [ ] u (k) [ ] u (k) γ (k) p= [Second hdden layer, neuron ] ( δ p w p ) [ ] n Ψ [ ] [ ] [ ] [ ] [ ] γ (k + ) = γ (k) + µ γ [ ] u (k) [ ] u (k) γ (k) p= [Frst hdden layer, neuron ] The local errors are those already calculated n the update for the synaptc weghts whch makes ths extenson cheap n regard to extra computaton needed. 5.4 Adaptaton of the learnng rate and Mxng of nput patterns Besdes the addton of momentum term there s another way to accomplsh rapd convergence whle keepng the oscllatons small n error surface mnma. An adaptaton of the learnng rate (step sze) wll stand up for those crtera. In a batch update system t s easer to keep an effectve magntude of the learnng rate snce there are more nformaton about the learnng n the global error functon then there s n the local error functon. More dfferences between the two wll be dscussed n secton 6. In an on-lne update system there s a smple rule whch can control the magntude of the learnng parameter. The update for the weghts wll be as before but wth varable learnng parameter ( k ) Jk w ( k) = η w wth η ( k ) Jk J = mn, Jk η max where η max s the maxmum learnng rate and J s a small offset error. These values wll dffer wth the applcaton. It can be understood ntutvely that when beng on the error surface at a large level the nomnator n the expresson wll be large and therefore the step sze wll be large. The norm of the gradent n the denomnator wll be small when beng near a mnma and therefore ncreasng the magntude. If on the other hand we are at a low level n the error surface the expresson wll set the step sze to a lower value. It s crtcal that the offset error s set to what s beleved to be the smallest error possble to get. Thus, we need some apror knowledge about the task to be solved. Ths smple scheme has not been very successful n my Page

Master Thess MEE 97-4 /5/98 emprcal materal so some modfcaton would be necessary. In ths thess no addtonal effort has been done to solve ths ssue, see further development sec. 8. The learnng parameter can be updated unquely for every neuron and some work has been done n ths feld [5]. If tryng to accomplsh ths n a on-lne update system, there wll be a great rsk for unstablty. Input patterns that are presented n some knd of order can be of dsturbance n the learnng process f no adaptaton durng epochs s desred. Ths follows from the fact that the nstantaneous error s mnmzed and not the global error. The algorthm can sometmes perform better when mnmzng the local error rather than the global one. But f the applcaton demands that the adaptaton s shut off durng use, ths wll lead to a problem. The weghts that are gven n the last update n the learnng phase wll then perform bad when facng the problem of mnmzng the total error. Ths problem arses only n on-lne update snce the mnmzaton n a batch system s always done on the global error. The soluton s smply to present the nput/output pattern n a random order where the order s changed for every epoch. In secton 7. an example whch ponts ths out s presented. 5.5 A combned stochastc-determnstc weght update When optmzng any functon by followng the traectores of dfferental equatons the global mnma s not lkely to be found. By nvestgatng neural cells of lvng creatures t has been found that they are essentally stochastc n nature. Ths can be employed n an artfcal envronment by ntroducng a stochastc term n the functon whch s to be mnmzed. Consderng a general functon J( w) and mnmzng ths wth a stochastc term added gves the followng problem mn( J ~ ( w)) = mn( J( w) + c( t) w N ( t)) w n = where n s the number of varables n vector w. N s a vector of ndependent random varables (deally whte nose) and c(t) s a parameter controllng the magntude of the nose. C(t) should reach zero as tmes goes to nfnty. In ths approach c(t) has been chosen as a exponentally decreasng functon c( t) = e α β t Where β and α are chosen to ft the applcaton. In ths thess the startng value β s chosen as a absolute value and the rate of nose dampng α s adusted as a relatve tme n whch the magntude of the functon c(t) s decreased wth 9 percent at the end of all tranng presentatons. The resultng dfferental equaton becomes dw Jk ( w ) = + c t N t dt ( ) ( ) w In an on-lne update system there s naturally a stochastc part n the update quanttes snce no averagng s done durng pattern presentatons, so the magntude of c(t) should be chosen wth care, n order to avod dvergence. Page 3

Master Thess MEE 97-4 /5/98 6. Comparson of a Batch update system wth an On-Lne Update system There are some dfferences between a batch system and an on-lne system whch must be outlned. They have both some advantages and of course some dsadvantages. Here some key dfferences wll be presented. The on-lne approach has to be used f all tranng examples are not avalable before the learnng starts and a adaptaton to the actual (on-lne) stream of nput/output patterns s desred The on-lne learnng algorthm s usually more effcent wth regard to computatonal and memory requrements when the number of tranng patterns s large The on-lne procedure ntroduces some randomness (nose) that often may help n escapng from local mnma Usually the on-lne algorthm s faster and more effcent than the batch procedure for largescale classfcaton problems, ths follows from the fact that usually many tranng examples possess redundant nformaton and gves approxmately the same gradent contrbutons and the watng to update the weghts after a whole epoch s wastng tme For hgh precson mappng the batch procedure can perform better results snce more sophstcated methods for optmzaton can be used (for nstance Newton s method) Page 4

Master Thess MEE 97-4 /5/98 7. Emprcal Results 7. Approxmaton of sn(x) wth two neurons n hdden layer In ths frst test the nput sgnal wll be real value scalars members of the set X [ p]. The desred (or target) sgnal wll be sn(x). The nner structure of the network s Number of layers: Hdden neurons: bpolar sgmod functons Output neuron: lnear functon (slope of unty) Bases: Present for all neurons Momentum term: Not used Error measure: Eucldean ( L -norm) Adaptve slope: Used, step sze.5 Adaptaton of learnng rate: Used, offset error. Stochastc term: Not used Mxng of nputs: Used, random order for every epoch Number of nput: equally dstrbuted n the set X Number of epochs: Results: Resultng slopes: bsg.3956 bsg.945 SSE (Sum Squared Error) for last epoch:.8 Instantaneous error for last epoch: Fgure 7.. shows the absolute error accordng to the L -norm for all nput/output pars.. x -4 Instantaneous error for last epoch.8.6.4. 3 4 5 6 7 8 9 Fg 7.. The nstantaneous error for last epoch n example 7. The match between the actual outcome and the true value sn(x) s shown n fgure 7.. Page 5

Master Thess MEE 97-4 /5/98.9.8.7.6.5.4.3.. Actual outcome and true values for example 7..5.5.5 3 3.5 Fg 7.. The actual outcome and true value sn(x) n example 7. It s crtcal n ths example to mx the nputs n order to fnd the soluton that mnmzes the global error functon. The nputs are otherwse so well ordered that t s easer for the network to approxmate a straght lne and then update the slope of ths lne durng epochs. The match n ths case can be very good (dependng on the magntude of the stepsze). In ths problem the match can probably not be better than the one whch mnmzes the global error. Ths follows because the algorthm probably has enough degrees of freedom to exactly mtate the trgonometrc functon sn(x). I say probably because the eventually best result for a Neural Network can not easly be proven. Ths s n fact one of the most mportant drawbacks of Neural Network and that s why they n some applcatons are of no nterest. 7. Separatons of sgnals wth nose In ths example the nput sgnal consst of three smple sgnals added together and then a nose component s also added. The task here s to separate these three sgnals from the sum. In fgure 7.. the frst plot shows the three components, the second plot shows the nput sgnal. The thrd plot shows the separaton after tranng the Network. Number of layers: Hdden neurons: 5 bpolar sgmod functons Output neurons: 3 lnear functons (slope of unty) Flter length: samples Bases: Present for all neurons Momentum term: Not used Error measure: Eucldean ( L -norm) Adaptve slope: Used, step sze. Adaptaton of learnng rate: Used, offset error 3*. Stochastc term: Not used Mxng of nputs: Used, random order for every epoch Number of nput patterns: 6 Number of epochs: 4 Results: Resultng slopes: bsg 6.46 bsg.4587 bsg 3.48 Page 6

Master Thess MEE 97-4 /5/98 bsg 4.6754 bsg 5.465 SSE (Sum Squared Error) for last epoch: 3.677-3 4 5 6 5-5 3 4 5 6 4-3 4 5 6 Fg 7.. The true components, nput sgnal and the resultng separaton Ths example shows that even sgnals wth abrupt drops and rsng can be separated. Here the hdden layer must provde satsfactory nformaton for all three output neurons smultaneously. Studyng the resultng slopes one can realze that ths soluton could not be accomplshed wth slopes fxed at unty, whch s usually the case. Of course ths may not be the best soluton, but slopes at hgh magntude gves the neuron a shape lke the hardlmter and ths s well ftted for sgnals wth abrupt drops an rses (as the square sgnal). 7.3 Processng of EEG-sgnals Ths example s taken from real data sampled at Unversty of Lnköpng. The nputs are two EEG sgnals samples from a person whle the person s closng and openng hs/her eyes at prespecfed tme frames (n ths example second frames). The am here s, by lookng at only the EEG sgnals, to determne whether or not the person has hs eyes opened. The true logcal value (lookng/not lookng) from one patent s used to tran the Network and nput sgnals from another patent s used to valdate the result. The complexty of ths problem s unknown. The EEG sgnals were sampled at 8 Hz sample rate. The structure chosen s Number of layers: Hdden neurons: 3 bpolar sgmod functons Output neuron: lnear functon (slope of unty) Flter length: Bases: Momentum term: samples Present for all neurons Not used Page 7

Master Thess MEE 97-4 /5/98 Error measure: Eucldean ( L -norm) Adaptve slope: Used, step sze. Adaptaton of learnng rate: Used, offset error. Stochastc term: Not used Mxng of nputs: Used, random order for every epoch Number of nput patterns: 8 Number of epochs: 4 Results: Resultng slopes: bsg.569 bsg.949 bsg 3.585 The nput EEG-sgnals are shown n fgure 7.3.. The nformaton n the nput sgnals s qute obvous and the classfcaton problem s not too complex. 5 The raw EEG-sgnal nput 5 5-5 - -5 4 6 8 4 Fg 7.3. The EEG sgnals whch s fed to the network.35 The Neural Network output for EEG sgnals.3.5..5..5 -.5 -. 4 6 8 4 Fg 7.3. The Neural Network output for EEG sgnals and desred state It can be seen from fgure 7.3. that some spky effects occurred. Ths probably comes from the fact that the nput sgnal s spread wdely. Ths can be reduced by processng the output of the neural network wth a medan flter as shown n fgure 7.3.3. Page 8

Master Thess MEE 97-4 /5/98..8.6.4...8.6.4. The NN output fltered n a medanflter (N=) 4 6 8 4 Fg 7.3.3 The Neural Network output of ex. 7.3, fltered wth a medan flter of length, and desred state The output does not follow the desred exactly and ths s probably due to the fact that the desred sgnal dffer from the real events. Ths dfferens could be caused by the dffculty of exact measurng whether or not the person has hs eyes opened. 7.4 Echo-cancelng n a Hybrd In a telephone system hybrds s used to separate speech n one drecton from speech n the other drecton. Ths s done so amplfers can be used to compensate for losses n wrng. These hybrds can be regarded as converters from a wre system to a 4 wre system, n whch the drectons are separated. Now, unfortunately, the separaton s not perfect and some of the speech n one drecton s nduced to the other. Ths phenomena wll be apprehended as a echo from the user pont of vew. Here the am s to use a neural network for deletng ths nducton (echo) by subtractng the echo from the channel. The tranng has been done on whte nose as nput and the output from the hybrd as the desred sgnal. Fgure 7.4. shows the unprocessed sgnal and the dfferens between the unprocessed and the processed sgnal. Approxmately 9 db dampng of the echo s acheved. True speech as nput wll probably gve better result snce the character of speech s mostly at low frequency. Low frequency parts are here damped more than hgher frequency. Number of layers: Hdden neurons: 5 bpolar sgmod functons Output neuron: lnear functon (slope of unty) Flter length: samples Bases: Present for all neurons Momentum term: Not used Error measure: Eucldean ( L -norm) Adaptve slope: Used, step sze.3 Adaptaton of learnng rate: Used, offset error. Stochastc term: Not used Mxng of nputs: Used, random order for every epoch Number of nput patterns: Number of epochs: 5 Page 9

Master Thess MEE 97-4 /5/98 Dampng of Hybrd echo wth whte nose as nput - - -3-4 -5-6.5.5 Fg 7.4. Dampng of Hybrd echo [db] x 4 7.5 Speech enhancement for Hands-Free moble telephone set In a hands-free moble telephone set used n a car many sources of dsturbance are present. The speech comng from the hands-free speaker wll be heard n the hands-free mcrophones. Ths wll be apprehended as a echo by the user n the far end of the communcaton. Nose from the engne and wnd-frcton wll cause degraded recognton of the speech comng from the person usng the set. An approach to solve ths problem and to enhance the speech s done by [6]. The soluton s due to the fact that hearng can be drectonal by usng a mcrophone array. Ths gves, the algorthm used, foundaton to solve the problem. In [6] lnear methods has been used. In ths thess the sampled data and the proposed structure by [6] wll be used. The dfference here s that the lnear model s replaced by a neural network. The structure of the chosen network s Number of layers: Hdden neurons: bpolar sgmod functons Output neuron: lnear functon (slope of unty) Flter length: 56 samples Bases: Present for all neurons Momentum term: Not used Error measure: Eucldean ( L -norm) Adaptve slope: Used/Not used, step sze. Adaptaton of learnng rate: Used, offset error. Stochastc term: Not used Mxng of nputs: Used, random order for every epoch Number of nput patterns: 48*6 (sx mcrophones n the array) Number of epochs: 3 Results: SSE (Sum Squared Error) for last epoch: 3.3839 wth no adaptaton of slopes 3.8 wth adaptaton of slopes Page 3

Master Thess MEE 97-4 /5/98 Fgure 7.5. shows the sum squared error (SSE) versus tranng epochs wthout any adaptaton of the slopes. Fgure 7.5. shows the SSE versus tranng epochs wth adaptaton of the slopes. 3 SSE versus epochs wthout adaptve slopes 5 5 5 3 Fg 7.5. SSE versus epochs wthout adaptve slopes 3 SSE versus epoches for adaptve slopes 5 5 5 3 Fg 7.5. SSE versus epochs wth adaptve slopes There are no maor dfferences between the network wth adaptaton and wthout adaptaton of the slopes. Examnng the resultng slopes gves that they vary between. and.. So, slopes fxed at unty are probably good values. Fgure 7.5.3 shows the unprocessed sgnal n the frst half of the plot and the enhanced sgnal n the second half. The unprocessed sgnal conssts of true speech sgnal n the begnnng and the echo n the end. The sgnal has also been degraded wth nose. For detaled descrpton of the prelmnares see [6]. Fgure 7.5.4 shows the same as fgure 7.5.3 but wth slope adaptaton. 5 Traned wth no adaptaton of the slopes -5 - -5 - -5-3 -35 4 6 8 4 6 8 x 4 Fg 7.5.3 Results usng no adaptaton of the slopes [db] Page 3

Master Thess MEE 97-4 /5/98 5 Traned wth adaptaton of the slopes -5 - -5 - -5-3 -35 4 6 8 4 6 8 Fg 7.5.4 Results usng adaptaton of the slopes [db] Comparson wth lnear system: The dampng s approxmately 5 db for nose and about 6 db for echo. It should be mentoned that further tranng would probably gve better results. In the results of [6] the dampng of both nose and echo s less than ths and the dstorton of the speech sgnal s qute notceable. When lstenng to ths resultng sgnal no actual dstorton s notced. Ths s a maor advantage of usng neural networks, when compared to the lnear algorthm used n [6]. The network task here has been to solve both the nose and the echo problem smultaneously. If only one of these tasks s facng the network, better results can be expected, see [6]. Page 3

Master Thess MEE 97-4 /5/98 8. Further Development The network adopted n ths thess can be mproved n some ways. The adaptaton of the stepsze descrbed n secton 5.4 has proven to be nadequate for keepng the oscllatons of the weghts small n error surface mnma. Ths may be due to the naturally arsng stochastc nature n an on-lne update system. The norm of the gradent follows ths stochastc nature and wll therefore mantan the oscllatons rather than dampng them. Some knd of flterng of the nstantaneous error and/or the gradent norm could perhaps gve update of the step sze a damped nature. Of course only past experence can be used n ths flterng. To choose the offset bas for ths step sze update s a delcate task. Ths bas would probably gve better result f decreased gradually durng learnng. Some ntal value must stll be chosen, but ths task s not so crtcal. The effect of the combned stochastc-determnstc weght update descrbed n secton 5.5 s almost neglectable n an on-lne update system due to the already stochastc nature. The scheme descrbed n sec. 5.5 can be modfed by replacng the functon c(t) wth an annealng schedule. The dea s to add some statstcal nformaton from the task and approprately adust c(t). The concept s taken from the feld of mechancs. Ths approach, of course, wll gve unque solutons dependng on the applcaton. The addng and deleton of neurons durng learnng (prunng), as descrbed n secton 4., can be adopted n order to mnmze the needed apror knowledge of the task to be solved. Page 33