x i 2. FEEDFORWARD NETWORKS

Size: px

Start display at page:

Download "x i 2. FEEDFORWARD NETWORKS"

Zoe Hancock
5 years ago
Views:

1 . FEEDFORWARD NEWORKS In retrosectve, t as not untl astonshngly late that methods for tranng ercetrons n several layers, multlayer ercetrons, became knon. Such methods had been roosed by Bryson and Ho 1969, Werbos 1974, Parker 1985, and Le Cun 1985, but the real breakthrough as not eerenced untl the ublcaton of Rumelhart s and McClelland s books 1986 and a aer n Nature 1986, here the back-roagaton or generalzed delta-rule method as roosed. A reason for ths as the fact that gradent-based methods ere not alcable for tranng as long as a Heavsde actvaton functon of the nodes as used. Hofeld - nsred by fndngs n bologcal systems hoever n 198 resented A netork confguraton today called Hofeld netork here the outut sgnalers from the nodes sho a graded resonse on nuts. he sgmod e a.1 that Hofeld ntroduced as to become the most common actvaton functon n neural netorks. Eercse: Dra the sgmod and study ts asymtotc behavor. Mult-layer feedforard netork.1 Central actvaton functons Lnear: ADALINE has already llustrated the behavor of a lnear element th a. Heavsde: he acton of ths actvaton functon can be understood by rertng the functon as N N a a~ 0 +. hus, the bas 0 can adust the locaton of the ste snce 0 acts as a threshold

2 Sgmod: Lke n the Heavsde functon, the bas 0 determnes the locaton of the transton ont of the sgmod see eq..1. If a gan, β, s ncluded n the equaton e β a. t s also ossble to adust the steeness of the curve; β yelds a sgmod aromatng the Heavsde functon, hle small β yeld a ractcally lnear nut-outut relaton. Note! hese adustments of the curve can be carred out by the eghts, snce β can be absorbed by. If the node outut s to vary beteen mn and ma one may use Another nterestng case s obtaned for mn 1 and ma 1. 1 e + a 1+ e 1+ e a tanh a 1 a.4 An advantage of the symmetrcal actvaton functon s that the outut can take on zero th requrng the node to saturate. Eercse: Dra a sgmod netork aromatng f 5 f < 4 f 4 ma + 1+ e a mn mn.3 Eq..3 s a secal case of ths more general formulaton th mn 0 and ma 1. Ram: A lnear ram can be mlemented by restrctng the outut sgnal, e.g., [0,1], by 1, ma0, a mn.5 0 1

3 . Some tranng algorthms In the follong some tranng algorthms for feedforard neural netorks are descrbed. A more detaled summary s gven n, e.g., Hertz et al and Battt Back-roagaton BP hs tranng method s llustrated n the form of the equatons for a netork th one layer of hdden neurons. It s qute easy to generalze the results to M > 1 hdden layers. Consder mnmzaton of the loss functon n 1.10 e t E 1 1 here the teraton nde for the sake of clarty has been omtted or relaced by. All eghts and varables deendent on them, such as net nuts, nternal and eternal oututs, etc., ll n hat follos be 3 assumed to be calculated at ths moment. Introduce the notaton for outut nodes, for hdden nodes and k for nut sgnals.e., nut nodes. Snce and a a σ.6 and and k k k a a σ.7 the netork s oututs are gven by k k k σ σ σ.8 he artal dervatve of the loss functon th resect to s gven by the chan rule as ' th ' 1 a t a t a a e E σ δ δ σ.9

4 If the eght udate s gven by E η η δ.10 Pros and cons: + Comutatonally effcent to calculate gradent and eght udate + Well suted for arallel mlementaton: Uses only local nformaton Slo convergence e obtan the Wdro-Hoff delta-rule ecet the dervatve term ' In an ADALINE σ a a, so ' 1 σ. a σ. a Eercse: Wrte the dervatve of the sgmod, σ a, as a functon of the node s outut only. Eercse: Derve a corresondng term for k. 1 Generally, on can rte for the eghts lm beteen the layers, hose nodes have been labeled m and l, σ σ σ 0.5 lm η δ l m he correcton of the eghts s thus receved by roagatng the error, δ, backard from the outut layer to the nut layer; ths gves the method ts name error back-roagaton a he sgmod and ts dervatve 4 5

5 A large number of modfcatons of the BP have been roosed to enhance the convergence. An early nventon as to nclude an etra momentum term n the eght udate η δ + α old.14 here the second term on the RHS ncludes the revous eght change and, tycally, α Conugate gradents CG Otmzaton methods based on conugate gradents are ell suted for tranng the eghts of neural netorks, snce they buld an estmate of the Hessan matr of the second dervatve terms thout large storage or comutatonal requrements. he CG method ales the udate η k k k.16 One can sho that ths modfcaton of udatng resembles the one that s gven by the method of conugate gradent. he effcency of the search s, hoever, often strongly deendent of the choce of η and α. here the search drecton s gven by g + α k + 1 k + 1 k k.17 Observe that the BP equatons can also be aled ncrementally to udate the eght after every resented attern,.e., the summaton terms over n eqs..10,.1,.13 and.15 can be droed. An advantage of ths s that the search drectons get a stochastc contrbuton by hch local otma may be avoded snce a temoral ncrease n E s alloed. startng from 0 g 0,.e., from the negatve gradent at the startng ont 0. he scalar α s gven by Fletcher and Reeves 1964 α k g k 1 g + k g g k k

6 or Polak and Rbère 1969 α k g k + 1 g k g k + 1 g g k k.19 Pros and Cons: + Converges more radly; may sho quadratc convergence + No ad-hoc arameter values to be set Requres more comutaton effort er teraton than BP. Hgher storage requrements than BP...3 he Levenberg-Marquardt metod LM A method for arameter estmaton n nonlnear systems e.g., n nonlnear regresson as resented by Levenberg 1944 and Marquardt hs trust-regon method aled an nterolaton beteen the drectons of the steeest descent and Neton s method n the search, hch makes t a robust rocedure. Aled on neural netork tranng, here the arameters are the eghts,, the changes n the eghts,, are solved from k k k k k k J J I J e +µ.0 here J s the Jacoban.e., the matr th artal dervatves of e th resect to the eghts, e s the resdual vector th elements e cf. Eq. 1.10, and µ s a arameter that s adusted durng the search. A eakness of all method roosed thus far s that they are only suted to fnd a local otmum and not the globally best soluton. hs roblem can be somehat resolved by startng from several dfferent randomzed ntal eghts, but to tackle the roblem rgorously, stochastc methods see belo are to refer...4 Smulated Annealng SA A method for global otmzaton, that at least artly can avod gettng traed n local mnma, as develoed n the 1950 s by Metrools et al based on an analogy beteen mathematcal mnmzaton of a loss or energy functon and the mnmzaton of the energy of a hyscal system that takes lace hen a materal s annealed. In the latter system, the atoms movement s consderable at hgh temerature, makng ractcally any states occur, hle the system at loer temeratures strves to organze nto a mnmum-energy state. 8 9

7 Introduce an artfcal temerature,, that follos an a ror fed rogram here the value decreases th tme, and a constant c. E 1. Intalze the system: Set k 0 and generate a random startng ont, 0.. Calculate the value of the obectve functon E k E k. 3. Adust the eghts randomly e.g., only one eght,, n turn to k Calculate E k+1 and E k E k+1 E k 5. Accet the ne ont and set k k + 1 th the robablty Search for a global mnmum of a functon th local mnma he dstrbuton of the states s gven by P E E k B e.1 rob e 1 E c k f E k f E < 0 k 0 6. est for convergence; f not, go to.. here PE s the robablty of fndng the system n an energy state E, k B s Boltzmann s constant and s the thermodynamc temerature. Based on ths a search method can be formulated Krkatrck et al. 1983: Some central roblems that have to be tackled n the mlementaton of the algorthm are he temerature rogram: he artfcal temerature should be reduced sloly enough to avod t from beng traed n local mnma

8 It has been shon that should be nversely roortonal to the dervatve of the tme, e.g., 0 k log1 + k.3 but ths requres a large number of teratons. Instead, more ractcal choces are made, e.g.,.3 Imortant ractcal asects Choce of nuts and oututs are the most mortant matters n statstcal model buldng! Often, 70% of the modelng tme may be sent on selecton and rerocessng of nuts and oututs. k k 1 γ ; γ 0.8, Startng ont: It s often motvated to start from several dfferent onts, 0. Search drecton and ste length: he former may be assumed to be of Gaussan dstrbuton k k k P e.5 Another choce s to make the ste length roortonal to the temerature Wasserman he search drecton, n turn, can be generated by randomly selectng a coordnate drecton,, modfyng. Alternatvely, all s may be vsted n turn to guarantee that all eghts be changed. A roer scalng of the nuts and oututs s essental for facltatng a good numercal soluton of the roblem. A scalng rocedure often aled s ~ s or ~ mn ma mn.6a,b Choce of netork structure. Because of a large number of eghts tycally used n neural netork models, Runge effects often occur: As the model comlety e.g., number of hdden nodes gros, the ft on the tranng data becomes better and better, but the model s erformance on novel data not seen ll get orse after some ont, snce the model ll start to ft nose and outlers n the tranng set. 3 33

9 tranng test redcton error and AIC Informaton Crteron A, Akake 1970, hese estmate the model s redcton error on novel data and are, therefore, esecally useful n cases here the number of observatons data ros s lmted. E A ay to tackle the roblem s to use Model comlety Number of teratons Overfttng and overtranng In summary, there s a need for crtera for selecton of netork comlety. Even though rocedures for such statstcal testng have been roosed at an early stage n the lterature Baum and Haussler 1989, Moody 199, Kendall and Hall 199, t, hoever, seems that no comletely general crtera est. It should also be noted that there s not even consensus on such tests for selecton among lnear models, even though certan test crtera are commonly aled, such as the FPE fnal Constructve algorthms that start from a small netork and, f needed, add hdden nodes e.g., Fahlman and Lebere 1990, Frean 1991 Destructve algorthms that start from a large traned netork and rune unnecessary connectons e.g., Mozer and Smolensky 1989, Chauvn 1989, Le Cun et al. 1990, Karnn 1990, Burktt 1991, Fnnoff et al Smultaneous otmzaton of eghts and structure. Even though mathematcal rogrammng can be used to tackle the roblem n theory, the nonlnearty of and the comaratvely large number of nteger varables make the task etremely challengng. Recently, genetc and evolutonary rogrammng has been aled by several nvestgators to tackle the roblem Fogel et al. 1990, Manezzo 1994, Gao et al. 1999, Yen and Lu

10 he crtera aled for decdng hether to ncrease/decrease the netork comlety are often rather ad hoc, even though most have a background n statstcs. Yet another alternatve, hch today has become a routne task n neural netork modelng, s to study the generalzaton erformance durng tranng by follong the model s behavor on a set of novel ndeendent data onts Wegend et al. 1990, Fnnoff et al. 1993, and sto tranng hen the test set errors are at ther mnmum. A tycal evoluton of the errors as schematcally dected n the revous fgure; the ncrease eerenced n the test error s the result of overtranng. It s ossble to sho Lung and Söberg 199 that such tranng nterruton s equvalent to usng regularzaton terms n the loss functon see belo. hs may serve to elan hy clearly overszed netorks commonly reorted n the lterature are stll able to roduce models that generalze ell: As the netorks are traned by neffcent methods such as backroagaton overfttng s avoded, snce a consderable art of the eressonal caacty remans unused. hs s analogous to use a smaller ell-traned netork Wegend et al Choce of loss functon: A desred feature of the loss functon, E, that s mnmzed durng tranng s that t shos a theoretcal mnmum of zero hen the model has learned to roduce all oututs erfectly ell. he squared Eucldean norm square sum of errors, Eq. 1.10, that s often used satsfes ths requrement, but there s usually no theoretcal ustfcaton for ths choce. Sometmes the netork s task s to classfy or categorze data, and n such cases the th outut, should gve the robablty that the nut vector,, corresonds to attrbute. hen, t may be motvated to use a relatve entroy as a measure of the ft Solla et al gven by th e E t t ln e + 1 t 1 t ln here the re-logarthmc terms stems from condtonal robabltes. hs loss functon has some nce roertes: It dverges f the nodal outut s saturated n the rong end of the sgmod and qute small errors for t 0 affect the loss functon consderably. Moreover, art of the loss functon can be shon to be conve, so t s lkely that feer local mnma est

11 Constranng the magntude of eghts: If large eghts should be avoded, a enalty term, E ~, can be added to the loss functon. he roblem can, e.g., be rtten as mn E ~ + E t t + γ.9 or n vector form { e e + γ } mn.30 hs gves an etra term n the dervatve that s roortonal to the value of the eght n queston, so the backroagaton udate becomes k + 1 k η δ 1 η γ k η η γ δ k.31 he term E ~ can, hoever, be consdered to enalze unreasonably much for large but ossbly sgnfcant eghts, so one may nstead use the form ~ E γ 1+.3 hs forces small eghts toards zero because of quadratc enalty, hle larger eghts may stll be alloed. hs formulaton has been clamed to rune unnecessary connectons n the netork, but studes on dfferent decay equatons have not revealed any clear ractcal advantage of.3 over the formulaton n.9. Another nterestng observaton s that the alcaton of a regularzaton such as.9 n the begnnng of the tranng by hgher order methods, here the eght estmates are stll oor, consderably decreases the robablty of gettng stuck n clearly local mnma. he reason s that node aralyss can be avoded,.e., stuatons here nodes saturate n an early stage of the search. he rocedure s often called eght decay cf. Eq

.4 Some areas of alcaton.4.1 Functon aromaton An obvous alcaton of feedforard neural netorks s functon aromaton and nonlnear regresson: Netorks th a hdden layer of sgmods and a lnear outut layer have

12 .4 Some areas of alcaton.4.1 Functon aromaton An obvous alcaton of feedforard neural netorks s functon aromaton and nonlnear regresson: Netorks th a hdden layer of sgmods and a lnear outut layer have been shon to be able to aromate arbtrary contnuous bounded functons to any accuracy, gven enough caacty,.e., a suffcent number of hdden nodes Cybenko 1988,1989, Funahash In ractce, hoever, the sze of the netork must be strctly controlled to consder features of the data sets studed, e.g., to avod overfttng. he ay n hch the netork acts s qute easy to understand, even though the fne onts of ho t does the ob are more comlcated: In regons here the functon to be aromated shos large changes, the netork allocates nodes, and the steeness of the functon ll be reflected n the magntude of the eghts. An eamle of a ft s gven belo Haykn Eamle of a functon aromated by a neural netork Eercse: Adat a neural netork to y 0.5sn8 + 4 n the nterval 0,1. Ho many hdden nodes are requred to make the ft accurate?.4. Prncal comonent analyss Prncal comonent analyss s a mathematcal method for analyzng multcomonent data that strves to reduce the dmenson of the varable sace th mnmum loss of sgnfcant nformaton. It can be aled to data comresson, data mnng, fault detecton elmnaton of redundant nformaton, rocess montorng descrbes the rocess varables actual dmenson, vsualzaton, etc. Let the measurement nformaton be gven by observatons of an m- dmensonal vector. PCA sets u f <m orthogonal vectors n the m

13 dmensonal sace that descrbe as much as ossble of the varaton n the data see the belo fgure. he covarance matr s gven by C or smly C for he frst rncal comonent s chosen along the drecton of mamum varance, the net one orthogonal to the frst and agan th mamum varance, and so on. hese drectons can be shon to corresond to the drectons of egenvectors, th largest egenvalues, of the covarance matr of the data. λ λ λ K 1 3 λ f.33 normalzed data. In hat follos a bref descrton of the lnear PCA s gven and ts ossble etenson to nonlnear PCA by neural netorks Kramer 1991: If the n observaton vectors are collected ro-se n the n m matr X, the method can be condensed nto E subect to X A B + E mn.34,35 here dma n f, dmb m f and dme n m. hs factorzaton of X yelds columns of B, denoted by b, that are egenvectors corresondng to the largest f egenvalues of C. Assume that B B I, hch hold s B s columns are orthogonal and normalzed,.e., b b δ, here δ s the Kronecker delta. A data vector s net rocessed from the m-dmensonal observaton sace to the f-dmensonal rncal comonent sace feature sace by z B.36 From z the orgnal vector can be reconstructed by Smle roectons z B.37 th the reconstructon error e. 4 43

14 In the nonlnear case Eqs can be rtten as z G and H z.38,.39 here G and H are nonlnear vector functons that each can be descrbed by a neural netork th a layer of hdden nodes. hs gves the archtecture Some noteorthy facts: he nodes n the nut and outut layers, as ell as the mddle layer, are lnear, and those of the remanng to layers are nonlnear e.g., sgmods. hs means that, theoretcally, the eghts of the connectons to and from the mddle layer could be merged, but then e loose the bottleneck,.e., no reducton n rank occurs nformaton about the values of the varables n the reduced dmenson.4.3 Dynamc modelng he easest ay to use neural netorks for modelng dynamc rocesses s to treat the tme dmenson as a satal dmenson. hs leads to an archtecture that s often called tme-delay netorks, th hch temoral sequences can be detected by smultaneously feedng as nuts t, t-1, t-,,t-l. Neural netork for NLPCA me-delay netork 44 45

15 A classcal eamle of an early alcaton of the above s gven by Senosk s and Rosenberg s 1987 NE-talk, here a ndo assng over a tet as fed as nut to a neural netork, that as traned to roduce a honeme code to a seech syntheszer for the centermost letter of the ndo see fgure belo. In such alcatons, neural netorks act as tme seres models, the theory of hch s brefly revsted belo. An autoregressve dscrete tme seres model of k:th order for one varable,, can be rtten Snce the letters ere read from a tet, every letter used 9 nut sgnals; the dmenson of the nut vector as therefore 7 9. hree letters before and after the centermost one ere used because the ronuncaton of a letter s not ndeendent of the contet. he netork as traned on 104 ords and roduced correct results n about 95 % of the honemes. For novel ords, the rate as slghtly belo 80 %. In ste of the fact that t could not comete th a sohstcated conventonal model n terms of accuracy, the eercse shoed that a qute robust neural-netork based model could be develoed usng a fracton of the tme requred for develong the conventonal model. t φ + φ t 1 + φ t + K + φk t k + ε t here ε s usually taken to be a normally dstrbuted uncorrelated random varable hte nose. he equaton can also be rtten comactly f, further, the varable s normalzed to zero mean as φ t ε t here φ 1 φ 1 z 1 φ z φ k z k.41 and z 1 s the backard shft oerator. he etenson of the equatons to vector autoregressve models VAR s obvous. he rncle of NE-talk he advantage of usng feedforard neural netorks for tme-seres modelng les n ther ablty to cature nonlneartes, thout a ror knoledge of the tye of nonlnearty encountered. If φ n eq..41 s relaced by a nonlnear oerator, mlemented by a neural netork, e obtan a NAR nonlnear autoregressve model. ycal roblems tackled 46 47

by the technque are found n seech recognton, stock market redctons, rocess dentfcaton, analyss of chaotc tme seres, etc. An eamle from the latter category s gven belo.

16 by the technque are found n seech recognton, stock market redctons, rocess dentfcaton, analyss of chaotc tme seres, etc. An eamle from the latter category s gven belo. Study the logstc ma Fegenbaum t + 1 α t t.4 he system shos a fed ont attractor at * α 1/α f α < 0, hle t for α 3.0, shos oscllatons of grong erodcty hch fnally for α > 3.57 change nto chaotc behavor. If α eceeds 4.0, the values of the sequence start to fall outsde 0,1. Recurson of the logstc ma Snce the ne value of s obtaned through recurson, t+1 ft, an l-ste ahead redcton can be rtten l t + l 1 f f t + l... f t t + l f.43 he net fgure llustrates grahcally the recurson, and the fgure belo t a tycal evoluton of the varable for α 3.99 n Eq observatons of t t 1 t 48 49

17 It s easly found that a small netork aromates the tme seres very ell Ydste 1990, Saén he belo fgure shos the quadratc functon and ts aromaton gven by a 1,,1 feedforard netork traned on the frst 500 observatons of the seres. he average redcton error for the follong 500 observatons s of the same magntude, hch shos that the netork has learned the nonlnear ma. Hoever, f the netork s aled for several-ste ahead redcton by feedng ts on outut back as nut, t fals f the redcton horzon s long l > Control On the bass of hat as resented n subsectons.4.1 and.4.3 t s straghtforard to aly neural netorks on control roblems, and the lterature resents numerous such alcatons. In ts smlest form, a netork can act as an nverse model of the rocess system to be controlled. hs model can be dentfed by tranng the netork n arallel th the rocess n queston to roduce the nut,, on the bass of the outut, y see fgure. t e Σ + Process Netork y Identfcaton of the nverse model of the rocess After ths, the netork can be connected before the rocess to calculate he seres and ts aromaton by the neural netork the control sgnal by feedforard control. It should be noted that the aroach requres the nverse of the rocess to est. In the aroach the resultng control qualty s strongly deendent on the qualty of the rocess model, and hether the rocess s tme-ndeendent

18 t Netork B Udated coy of Netork A e Σ Process Netork A Alcaton of the nverse model to rocess control + y he dentfed model can, n turn, form the bass of more sohstcated control strateges, such as otmal e.g., mnmum varance control and model-redctve control MPC, here the model s aled for a redcton horzon over hch a loss functon s mnmzed. It s also ossble to used neural netorks for learnng otmal control traectores. hs aroach can be useful n cases here a more sohstcated, but comutatonally eensve, rocess model s avalable for off-lne studes of the control roblem at hand. Normally, feedback control s aled to tackle the roblems of model naccuracy or tme-deendent rocesses mentoned above. Often, a rocess forard model s used n feedback control. hs can be estmated by tranng a neural netork, ossbly recursvely. Some early aers on alcatons of neural netorks to dentfcaton and control are Narendra and Parthasarathy 1990, Ydste 1990, Zurada 199 and Ydste and Narendra 199. Process Netork Process dentfcaton y t Σ Classfcaton As noted n Chater 1, feedforard neural netorks can be aled to solve classfcaton roblems, here the result s a Boolean varable. Snce all classfcaton roblems th several classes can be rertten nto a set of roblems th to classes, e shall n hat follos only consder a roblem here the nut vector s classfed to belong to ether A + or A. 5 53

19 Neural netorks thus consttute an alternatve technque to tradtonal classfcaton methods, such as nearest neghbor and Bayesan classfers. he classfcaton s based on the caablty of a node to lnearly searate the sace nto to arts; f the nut eghts are large enough, the outut of the node s zero A or one A +. For an n-dmensonal nut vector the node bsects the sace by an n-1- dmensonal hyerlane. Problems that are not lnearly searable can be tackled by combnng the outut from several nodes. A conve regon n an n-dmensonal sace can be descrbed by a logcal conuncton of several lnearly searable sub-regons. For nstance, a trangular regon n a to-dmensonal sace s formed as a logcal conuncton of three lnearly searable regons, as ndcated n the belo fgure. hus, three nodes are requred to form these classfcaton boundares, and a fourth node to combne the oututs of the three nodes. From ths follos that a large number of nodes may be requred to accurately form a nonlnear conve boundary. he number of almost lnear segments that are needed to form the desred shae of the classfcaton boundary gves an ndcaton of the requred number of hdden nodes. Unfortunately, the shae of the boundary s seldom knon a ror. A concave regon can be formed as the unon of several conve regons, hch means that netorks th to hdden layers of nodes ould be requred; the frst layer creates the conve regon logcal conuncton hle the other takes the unon of these logcal dsuncton. An eamle of ths s gven n the belo fgure. Huang and Lmann 1987, Lmann 1988 and Zurada 199 treat classfcaton roblems tackled by neural netorks n more detal. rangle-formed classfcaton regon and classfcaton boundares Concave classfcaton regons 54 55

20 An undesred henomenon that arses n classfcaton of concave regons by feedforard neural netorks s that t s ossble that the netorks create regons outsde the tranng sace th odd classfcaton results. he reason s that the classfcaton boundares ntersect here, formng ne regons here the logcal condtons are satsfed. hs s esecally alarmng n, e.g., fault detecton alcatons of the technque. Kramer and Leonard 1990 resent a ustfed crtcsm of the technque. Eercse: Suggest a netork that solves the classfcaton roblem of dfferng beteen the regons reresented by crcles and astersks th a mnmum number of hdden nodes

Confidence intervals for weighted polynomial calibrations

Confidence intervals for weighted polynomial calibrations Confdence ntervals for weghted olynomal calbratons Sergey Maltsev, Amersand Ltd., Moscow, Russa; ur Kalambet, Amersand Internatonal, Inc., Beachwood, OH e-mal: kalambet@amersand-ntl.com htt://www.chromandsec.com