Principe, J.C. Artificial Neural Networks The Electrical Engineering Handbook Ed. Richard C. Dorf Boca Raton: CRC Press LLC, 2000

Size: px

Start display at page:

Download "Principe, J.C. Artificial Neural Networks The Electrical Engineering Handbook Ed. Richard C. Dorf Boca Raton: CRC Press LLC, 2000"

Lorin Patrick
5 years ago
Views:

1 Prncpe, J.C. Artfcal Neural Networks The Electrcal Engneerng Handbook Ed. Rchard C. Dorf Boca Raton: CRC Press LLC, 2000

2 20 Artfcal Neural Networks Jose C. Prncpe Unversty of Florda 20.1 Defntons and Scope Introducton Defntons and Style of Computaton ANN Types and Applcatons 20.2 Multlayer Perceptrons Functon of Each PE How to Tran MLPs Applyng Back- Propagaton n Practce A Posteror Probabltes 20.3 Radal Bass Functon Networks 20.4 Tme Lagged Networks Memory Structures Tranng-Focused TLN Archtectures 20.5 Hebban Learnng and Prncpal Component Analyss Networks Hebban Learnng Prncpal Component Analyss Assocatve Memores 20.6 Compettve Learnng and Kohonen Networks 20.1 Defntons and Scope Introducton Artfcal neural networks (ANN) are among the newest sgnal-processng technologes n the engneer s toolbox. The feld s hghly nterdscplnary, but our approach wll restrct the vew to the engneerng perspectve. In engneerng, neural networks serve two mportant functons: as pattern classfers and as nonlnear adaptve flters. We wll provde a bref overvew of the theory, learnng rules, and applcatons of the most mportant neural network models. Defntons and Style of Computaton An ANN s an adaptve, most often nonlnear system that learns to perform a functon (an nput/output map) from data. Adaptve means that the system parameters are changed durng operaton, normally called the tranng phase. After the tranng phase the ANN parameters are fxed and the system s deployed to solve the problem at hand (the testng phase). The ANN s bult wth a systematc step-by-step procedure to optmze a performance crteron or to follow some mplct nternal constrant, whch s commonly referred to as the learnng rule. The nput/output tranng data are fundamental n neural network technology, because they convey the necessary nformaton to dscover the optmal operatng pont. The nonlnear nature of the neural network processng elements (PEs) provdes the system wth lots of flexblty to acheve practcally any desred nput/output map,.e., some ANNs are unversal mappers. There s a style n neural computaton that s worth descrbng (Fg. 20.1). An nput s presented to the network and a correspondng desred or target response set at the output (when ths s the case the tranng s called supervsed). An error s composed from the dfference between the desred response and the system

FIGURE 20.1 The style of neural computaton. output. Ths error nformaton s fed back to the system and adjusts the system parameters n a systematc fashon (the learnng rule).

3 FIGURE 20.1 The style of neural computaton. output. Ths error nformaton s fed back to the system and adjusts the system parameters n a systematc fashon (the learnng rule). The process s repeated untl the performance s acceptable. It s clear from ths descrpton that the performance hnges heavly on the data. If one does not have data that cover a sgnfcant porton of the operatng condtons or f they are nosy, then neural network technology s probably not the rght soluton. On the other hand, f there s plenty of data and the problem s poorly understood to derve an approxmate model, then neural network technology s a good choce. Ths operatng procedure should be contrasted wth the tradtonal engneerng desgn, made of exhaustve subsystem specfcatons and ntercommuncaton protocols. In ANNs, the desgner chooses the network topology, the performance functon, the learnng rule, and the crteron to stop the tranng phase, but the system automatcally adjusts the parameters. So, t s dffcult to brng a pror nformaton nto the desgn, and when the system does not work properly t s also hard to ncrementally refne the soluton. But ANN-based solutons are extremely effcent n terms of development tme and resources, and n many dffcult problems ANNs provde performance that s dffcult to match wth other technologes. Denker 10 years ago sad that ANNs are the second best way to mplement a soluton motvated by the smplcty of ther desgn and because of ther unversalty, only shadowed by the tradtonal desgn obtaned by studyng the physcs of the problem. At present, ANNs are emergng as the technology of choce for many applcatons, such as pattern recognton, predcton, system dentfcaton, and control. ANN Types and Applcatons It s always rsky to establsh a taxonomy of a technology, but our motvaton s one of provdng a quck overvew of the applcaton areas and the most popular topologes and learnng paradgms. Supervsed Unsupervsed Applcaton Topology Learnng Learnng Assocaton Hopfeld [Zurada, 1992; Haykn, 1994] Hebban [Zurada, 1992; Haykn, 1994; Kung, 1993] Multlayer perceptron [Zurada, 1992; Haykn, 1994; Back-propagaton [Zurada, 1992; Bshop, 1995] Haykn, 1994; Bshop, 1995] Lnear assocatve mem. [Zurada, 1992; Haykn, 1994] Hebban Pattern Multlayer perceptron [Zurada, 1992; Haykn, 1994; Back-propagaton recognton Bshop, 1995] Radal bass functons [Zurada, 1992; Bshop, 1995] Least mean square k-means [Bshop, 1995] Feature Compettve [Zurada, 1992; Haykn, 1994] Compettve extracton Kohonen [Zurada, 1992; Haykn, 1994] Kohonen Multlayer perceptron [Kung, 1993] Back-propagaton Prncpal comp. anal. [Zurada, 1992; Kung, 1993] Oja s [Zurada, 1992; Kung, 1993] Predcton, system ID Tme-lagged networks [Zurada, 1992; Kung, 1993; de Vres and Prncpe, 1992] Fully recurrent nets [Zurada, 1992] Back-propagaton through tme [Zurada, 1992]

4 FIGURE 20.2 MLP wth one hdden layer (d-k-m). FIGURE 20.3 A PE and the most common nonlneartes. It s clear that multlayer perceptrons (MLPs), the back-propagaton algorthm and ts extensons tme-lagged networks (TLN) and back-propagaton through tme (BPTT), respectvely hold a promnent poston n ANN technology. It s therefore only natural to spend most of our overvew presentng the theory and tools of back-propagaton learnng. It s also mportant to notce that Hebban learnng (and ts extenson, the Oja rule) s also a very useful (and bologcally plausble) learnng mechansm. It s an unsupervsed learnng method snce there s no need to specfy the desred or target response to the ANN Multlayer Perceptrons Multlayer perceptrons are a layered arrangement of nonlnear PEs as shown n Fg The layer that receves the nput s called the nput layer, and the layer that produces the output s the output layer. The layers that do not have drect access to the external world are called hdden layers. A layered network wth just the nput and output layers s called the perceptron. Each connecton between PEs s weghted by a scalar, w, called a weght, whch s adapted durng learnng. The PEs n the MLP are composed of an adder followed by a smooth saturatng nonlnearty of the sgmod type (Fg. 20.3). The most common saturatng nonlneartes are the logstc functon and the hyperbolc tangent. The threshold s used n other nets. The mportance of the MLP s that t s a unversal mapper (mplements arbtrary nput/output maps) when the topology has at least two hdden layers and suffcent number of PEs [Haykn, 1994]. Even MLPs wth a sngle hdden layer are able to approxmate contnuous nput/output maps. Ths means that rarely we wll need to choose topologes wth more than two hdden layers. But these are exstence proofs, so the ssue that we must solve as engneers s to choose how many layers and how many PEs n each layer are requred to produce good results. Many problems n engneerng can be thought of n terms of a transformaton of an nput space, contanng the nput, to an output space where the desred response exsts. For nstance, dvdng data nto classes can be thought of as transformng the nput nto 0 and 1 responses that wll code the classes [Bshop, 1995]. Lkewse, dentfcaton of an unknown system can also be framed as a mappng (functon approxmaton) from the nput to the system output [Kung, 1993]. The MLP s hghly recommended for these applcatons.

5 FIGURE 20.4 A two-nput PE and ts separaton surface. Functon of Each PE Let us study brefly the functon of a sngle PE wth two nputs [Zurada, 1992]. If the nonlnearty s the threshold nonlnearty we can mmedately see that the output s smply 1 and 1. The surface that dvdes these subspaces s called a separaton surface, and n ths case t s a lne of equaton ( ) = + + = yw, w wx w x b (20.1).e., the PE weghts and the bas control the orentaton and poston of the separaton lne, respectvely (Fg. 20.4). In many dmensons the separaton surface becomes an hyperplane of dmenson one less than the dmensonalty of the nput space. So, each PE creates a dchotomy n the nput space. For smooth nonlneartes the separaton surface s not crsp; t becomes fuzzy but the same prncples apply. In ths case, the sze of the weghts controls the wdth of the fuzzy boundary (larger weghts shrnk the fuzzy boundary). The perceptron nput/output map s bult from a juxtaposton of lnear separaton surfaces, so the perceptron gves zero classfcaton error only for lnearly separable classes (.e., classes that can be exactly classfed by hyperplanes). When one adds one layer to the perceptron creatng a one hdden layer MLP, the type of separaton surfaces changes drastcally. It can be shown that ths learnng machne s able to create bumps n the nput space,.e., an area of hgh response surrounded by low responses [Zurada, 1992]. The functon of each PE s always the same, no matter f the PE s part of a perceptron or an MLP. However, notce that the output layer n the MLP works wth the result of hdden layer actvatons, creatng an embeddng of functons and producng more complex separaton surfaces. The one-hdden-layer MLP s able to produce nonlnear separaton surfaces. If one adds an extra layer (.e., two hdden layers), the learnng machne now can combne at wll bumps, whch can be nterpreted as a unversal mapper, snce there s evdence that any functon can be approxmated by localzed bumps. One mportant aspect to remember s that changng a sngle weght n the MLP can drastcally change the locaton of the separaton surfaces;.e., the MLP acheves the nput/output map through the nterplay of all ts weghts. How to Tran MLPs One fundamental ssue s how to adapt the weghts w of the MLP to acheve a gven nput/output map. The core deas have been around for many years n optmzaton, and they are extensons of well-known engneerng prncples, such as the least mean square (LMS) algorthm of adaptve flterng [Haykn, 1994]. Let us revew the theory here. Assume that we have a lnear PE (f(net) = net) and that one wants to adapt the weghts as to mnmze the square dfference between the desred sgnal and the PE response (Fg. 20.5). Ths problem has an analytcal soluton known as the least squares [Haykn, 1994]. The optmal weghts are obtaned as the product of the nverse of the nput autocorrelaton functon (R 1 ) and the cross-correlaton vector (P) between the nput and the desred response. The analytcal soluton s equvalent to a search for the mnmum of the quadratc performance surface J(w ) usng gradent descent, where the weghts at each teraton k are adjusted by

6 FIGURE 20.5 Computng analytcally optmal weghts for the lnear PE. ( ) = ( ) - Ñ ( ) Ñ = w k + 1 w k h J k J w (20.2) where h s a small constant called the step sze, and Ñ J (k) s the gradent of the performance surface at teraton k. Bernard Wdrow n the late 1960s proposed a very effcent estmate to compute the gradent at each teraton w Jk ~ 1 2 w 2 Ñ J ( k) = ( ) e ( ) ( k ) = - e ( k) x ( k) (20.3) whch when substtuted nto Eq. (20.2) produces the so-called LMS algorthm. He showed that the LMS converged to the analytc soluton provded the step sze h s small enough. Snce t s a steepest descent procedure, the largest step sze s lmted by the nverse of the largest egenvalue of the nput autocorrelaton matrx. The larger the step sze (below ths lmt), the faster s the convergence, but the fnal values wll rattle around the optmal value n a basn that has a radus proportonal to the step sze. Hence, there s a fundamental trade-off between speed of convergence and accuracy n the fnal weght values. One great appeal of the LMS algorthm s that t s very effcent (just one multplcaton per weght) and requres only local quanttes to be computed. The LMS algorthm can be framed as a computaton of partal dervatves of the cost wth respect to the unknowns,.e., the weght values. In fact, wth the chanrule one wrtes w y = = æ å d - y y w y è ( ) 2 ö ø w ( å w x ) = -ex (20.4) we obtan the LMS algorthm for the lnear PE. What happens f the PE s nonlnear? If the nonlnearty s dfferentable (smooth), we stll can apply the same method, because of the chan rule, whch prescrbes that (Fg. 20.6) FIGURE 20.6 How to extend LMS to nonlnear PEs wth the chan rule.

7 FIGURE 20.7 How to adapt the weghts connected to th PE. w y = e y w net = -( d - y ) f ( net ) x f x = - ( net ) net (20.5) where f (net) s the dervatve of the nonlnearty computed at the operatng pont. Equaton (20.5) s known as the delta rule, and t wll tran the perceptron [Haykn, 1994]. Note that throughout the dervaton we skpped the pattern ndex p for smplcty, but ths rule s appled for each nput pattern. However, the delta rule cannot tran MLPs snce t requres the knowledge of the error sgnal at each PE. The prncple of the ordered dervatves can be extended to multlayer networks, provded we organze the computatons n flows of actvaton and error propagaton. The prncple s very easy to understand, but a lttle complex to formulate n equaton form [Haykn, 1994]. Suppose that we want to adapt the weghts connected to a hdden layer PE, the th PE (Fg. 20.7). One can decompose the computaton of the partal dervatve of the cost wth respect to the weght w j as w j = y y net w j net (20.6) 1 2.e., the partal dervatve wth respect to the weght s the product of the partal dervatve wth respect to the PE state part 1 n Eq. (20.6) tmes the partal dervatve of the local actvaton to the weghts part 2 n Eq. (20.6). Ths last quantty s exactly the same as for the nonlnear PE (f (net )x j ), so the bg ssue s the computaton of. For an output PE, becomes the njected error e n Eq. (20.4). For the hdden th PE y y y s evaluated by summng all the errors that reach the PE from the top layer through the topology when the njected errors e k are clamped at the top layer, or n an equaton y yk = æ ö e f w è ç å net yk k y = å ( net ) net ø k k k k k k (20.7) Substtutng back n Eq. (20.6) we fnally get w j æ x f net çå e f net w è = - ( ) ( ) j k k k 1 2 k ö ø (20.8)

8 Ths equaton embodes the back-propagaton tranng algorthm [Haykn, 1994; Bshop, 1995]. It can be rewrtten as the product of a local actvaton (part 1) and a local error (part 2), exactly as the LMS and the delta rules. But now the local error s a composton of errors that flow through the topology, whch becomes equvalent to the exstence of a desred response at the PE. There s an ntrnsc flow n the mplementaton of the back-propagaton algorthm: frst, nputs are appled to the net and actvatons computed everywhere to yeld the output actvaton. Second, the external errors are computed by subtractng the net output from the desred response. Thrd, these external errors are utlzed n Eq. (20.8) to compute the local errors for the layer mmedately precedng the output layer, and the computatons chaned up to the nput layer. Once all the local errors are avalable, Eq. (20.2) can be used to update every weght. These three steps are then repeated for other tranng patterns untl the error s acceptable. Step three s equvalent to njectng the external errors n the dual topology and back-propagatng them up to the nput layer [Haykn, 1994]. The dual topology s obtaned from the orgnal one by reversng data flow and substtutng summng junctons by splttng nodes and vce versa. The error at each PE of the dual topology s then multpled by the actvaton of the orgnal network to compute the weght updates. So, effectvely the dual topology s beng used to compute the local errors whch makes the procedure hghly effcent. Ths s the reason back-propagaton trans a network of N weghts wth a number of multplcatons proportonal to N, (O(N)), nstead of (O(N 2 )) for prevous methods of computng partal dervatves known n control theory. Usng the dual topology to mplement back-propagaton s the best and most general method to program the algorthm n a dgtal computer. Applyng Back-Propagaton n Practce Now that we know an algorthm to tran MLPs, let us see what are the practcal ssues to apply t. We wll address the followng aspects: sze of tranng set vs. weghts, search procedures, how to stop tranng, and how to set the topology for maxmum generalzaton. Sze of Tranng Set The sze of the tranng set s very mportant for good performance. Remember that the ANN gets ts nformaton from the tranng set. If the tranng data do not cover the full range of operatng condtons, the system may perform badly when deployed. Under no crcumstances should the tranng set be less than the number of weghts n the ANN. A good sze of the tranng data s ten tmes the number of weghts n the network, wth the lower lmt beng set around three tmes the number of weghts (these values should be taken as an ndcaton, subject to expermentaton for each case) [Haykn, 1994]. Search Procedures Searchng along the drecton of the gradent s fne f the performance surface s quadratc. However, n ANNs rarely s ths the case, because of the use of nonlnear PEs and topologes wth several layers. So, gradent descent can be caught n local mnma, whch makes the search very slow n regons of small curvature. One effcent way to speed up the search n regons of small curvature and, at the same tme, to stablze t n narrow valleys s to nclude a momentum term n the weght adaptaton ( ) ( ) = ( ) + ( ) ( ) + ( ) - ( - ) w n + 1 w n hd n x n a w n w n 1 j j j j j (20.9) The value of momentum a should be set expermentally between 0.5 and 0.9. There are many more modfcatons to the conventonal gradent search, such as adaptve step szes, annealed nose, conjugate gradents, and second-order methods (usng nformaton contaned n the Hessan matrx), but the smplcty and power of momentum learnng s hard to beat [Haykn, 1994; Bshop, 1995]. How to Stop Tranng The stop crteron s a fundamental aspect of tranng. The smple deas of cappng the number of teratons or of lettng the system tran untl a predetermned error value are not recommended. The reason s that we want the ANN to perform well n the test set data;.e., we would lke the system to perform well n data t

9 never saw before (good generalzaton) [Bshop, 1995]. The error n the tranng set tends to decrease wth teraton when the ANN has enough degrees of freedom to represent the nput/output map. However, the system may be rememberng the tranng patterns (overfttng) nstead of fndng the underlyng mappng rule. Ths s called overtranng. To avod overtranng the performance n a valdaton set,.e., a set of nput data that the system never saw before, must be checked regularly durng tranng (.e., once every 50 passes over the tranng set). The tranng should be stopped when the performance n the valdaton set starts to ncrease, despte the fact that the performance n the tranng set contnues to decrease. Ths method s called cross valdaton. The valdaton set should be 10% of the tranng set, and dstnct from t. Sze of the Topology The sze of the topology should also be carefully selected. If the number of layers or the sze of each layer s too small, the network does not have enough degrees of freedom to classfy the data or to approxmate the functon, and the performance suffers. On the other hand, f the sze of the network s too large, performance may also suffer. Ths s the phenomenon of overfttng that we mentoned above. But one alternatve way to control t s to reduce the sze of the network. There are bascally two procedures to set the sze of the network: ether one starts small and adds new PEs or one starts wth a large network and prunes PEs [Haykn, 1994]. One quck way to prune the network s to mpose a penalty term n the performance functon a regularzng term such as lmtng the slope of the nput/output map [Bshop, 1995]. A regularzaton term that can be mplemented locally s æ wj n + 1 wj n ç1 ç è ( ) = ( ) - l ( 1 + wj ( n) ) 2 ö n xj n + hd ( ) ( ) ø (20.10) where l s the weght decay parameter and d the local error. Weght decay tends to drve unmportant weghts to zero. A Posteror Probabltes We wll fnsh the dscusson of the MLP by notng that ths topology when traned wth the mean square error s able to estmate drectly at ts outputs a posteror probabltes,.e., the probablty that a gven nput pattern belongs to a gven class [Bshop, 1995]. Ths property s very useful because the MLP outputs can be nterpreted as probabltes and operated as numbers. In order to guarantee ths property, one has to make sure that each class s attrbuted to one output PE, that the topology s suffcently large to represent the mappng, that the tranng has converged to the absolute mnmum, and that the outputs are normalzed between 0 and 1. The frst requrements are met by good desgn, whle the last can be easly enforced f the softmax actvaton s used as the output PE [Bshop, 1995], y = å j exp ( net ) exp ( net j ) (20.11) 20.3 Radal Bass Functon Networks The radal bass functon (RBF) network consttutes another way of mplementng arbtrary nput/output mappngs. The most sgnfcant dfference between the MLP and RBF les n the PE nonlnearty. Whle the PE n the MLP responds to the full nput space, the PE n the RBF s local, normally a Gaussan kernel n the nput space. Hence, t only responds to nputs that are close to ts center;.e., t has bascally a local response.

10 FIGURE 20.8 Radal Bass Functon (RBF) network. The RBF network s also a layered net wth the hdden layer bult from Gaussan kernels and a lnear (or nonlnear) output layer (Fg. 20.8). Tranng of the RBF network s done normally n two stages [Haykn, 1994]: frst, the centers x are adaptvely placed n the nput space usng compettve learnng or k means clusterng [Bshop, 1995], whch are unsupervsed procedures. Compettve learnng s explaned later n the chapter. The varances of each Gaussan are chosen as a percentage (30 to 50%) to the dstance to the nearest center. The goal s to cover adequately the nput data dstrbuton. Once the RBF s located, the second layer weghts w are traned usng the LMS procedure. RBF networks are easy to work wth, they tran very fast, and they have shown good propertes both for functon approxmaton as classfcaton. The problem s that they requre lots of Gaussan kernels n hghdmensonal spaces Tme-Lagged Networks The MLP s the most common neural network topology, but t can only handle nstantaneous nformaton, snce the system has no memory and t s feedforward. In engneerng, the processng of sgnals that exst n tme requres systems wth memory,.e., lnear flters. Another alternatve to mplement memory s to use feedback, whch gves rse to recurrent networks. Fully recurrent networks are dffcult to tran and to stablze, so t s preferable to develop topologes based on MLPs but where explct subsystems to store the past nformaton are ncluded. These subsystems are called short-term memory structures [de Vres and Prncpe, 1992]. The combnaton of an MLP wth short-term memory structures s called a tme-lagged network (TLN). The memory structures can be eventually recurrent, but the feedback s local, so stablty s stll easy to guarantee. Here, we wll cover just one TLN topology, called focused, where the memory s at the nput layer. The most general TLN have memory added anywhere n the network, but they requre other more-nvolved tranng strateges (BPTT [Haykn, 1994]). The nterested reader s referred to de Vres and Prncpe [1992] for further detals. The functon of a short-term memory n the focused TLN s to represent the past of the nput sgnal, whle the nonlnear PEs provde the mappng as n the MLP (Fg. 20.9). Memory Structures The smplest memory structure s bult from a tap delay lne (Fg ). The memory by delays s a snglenput, multple-output system that has no free parameters except ts sze K. The tap delay memory s the memory utlzed n the tme-delay neural network (TDNN) whch has been utlzed successfully n speech recognton and system dentfcaton [Kung, 1993]. A dfferent mechansm for lnear memory s the feedback (Fg ). Feedback allows the system to remember past events because of the exponental decay of the response. Ths memory has lmted resoluton because of the low pass requred for long memores. But notce that unlke the memory by delay, memory by feedback provdes the learnng system wth a free parameter m that controls the length of the memory. Memory by feedback has been used n Elman and Jordan networks [Haykn, 1994].

FIGURE 20.9 A focused TLN. FIGURE 20.10 Tap delay lne memory. FIGURE 20.11 Memory by feedback (context PE).

11 FIGURE 20.9 A focused TLN. FIGURE Tap delay lne memory. FIGURE Memory by feedback (context PE). It s possble to combne the advantages of memory by feedback wth the ones of the memory by delays n lnear systems called dspersve delay lnes. The most studed of these memores s a cascade of low-pass functons called the gamma memory [de Vres and Prncpe, 1992]. The gamma memory has a free parameter m that controls and decouples memory depth from resoluton of the memory. Memory depth D s defned as the frst moment of the mpulse response from the nput to the last tap K, whle memory resoluton R s the number of taps per unt tme. For the gamma memory D = K/m, and R = m;.e., changng m modfes the memory depth and resoluton nversely. Ths recursve parameter m can be adapted wth the output MSE as the other network parameters;.e., the ANN s able to choose the best memory depth to mnmze the output error, whch s unlke the tap delay memory. Tranng-Focused TLN Archtectures The appeal of the focused archtecture s that the MLP weghts can be stll adapted wth back-propagaton. However, the nput/output mappng produced by these networks s statc. The nput memory layer s brngng n past nput nformaton to establsh the value of the mappng. As we know n engneerng, the sze of the memory s fundamental to dentfy, for nstance, an unknown plant or to perform predcton wth a small error. But note now that wth the focused TLN the models for system dentfcaton become nonlnear (.e., nonlnear movng average NMA). When the tap delay mplements the short-term memory, straght back-propagaton can be utlzed snce the only adaptve parameters are the MLP weghts. When the gamma memory s utlzed (or the context PE), the recursve parameter s adapted n a total adaptve framework (or the parameter s preset by some external consderaton). The equatons to adapt the context PE and the gamma memory are shown n Fgs and 20.12, respectvely. For the context PE d(n) refers to the total error that s back-propagated from the MLP and that reaches the dual context PE.

FIGURE 20.12 Gamma memory (dspersve delay lne). 20.5 Hebban Learnng and Prncpal Component Analyss Networks Hebban Learnng Hebban learnng s an unsupervsed learnng rule that captures smlarty between an nput and an output through correlaton.

12 FIGURE Gamma memory (dspersve delay lne) Hebban Learnng and Prncpal Component Analyss Networks Hebban Learnng Hebban learnng s an unsupervsed learnng rule that captures smlarty between an nput and an output through correlaton. To adapt a weght w usng Hebban learnng we adjust the weghts accordng to Dw = hx y or n an equaton [Haykn, 1994] ( ) = ( ) + ( ) ( ) w n + 1 w n h x n yn (20.12) where h s the step sze, x s the th nput and y s the PE output. The output of the sngle PE s an nner product between the nput and the weght vector (formula n Fg ). It measures the smlarty between the two vectors.e., f the nput s close to the weght vector the output y s large; otherwse t s small. The weghts are computed by an outer product of the nput X and output Y,.e., W = XY T, where T means transpose. The problem of Hebban learnng s that t s unstable;.e., the weghts wll keep on growng wth the number of teratons [Haykn, 1994]. Oja proposed to stablze the Hebban rule by normalzng the new weght by ts sze, whch gves the rule [Haykn, 1994]: [ ] ( ) = ( ) + ( ) ( ) - ( ) ( ) w n + 1 w n h yn x n yn w n (20.13) The weghts now converge to fnte values. They stll defne n the nput space the drecton where the data cluster has ts largest projecton, whch corresponds to the egenvector wth the largest egenvalue of the nput correlaton matrx [Kung, 1993]. The output of the PE provdes the largest egenvalue of the nput correlaton matrx. FIGURE Hebban PE.

13 FIGURE PCA network. Prncpal Component Analyss Prncpal component analyss (PCA) s a well-known technque n sgnal processng that s used to project a sgnal nto a sgnal-specfc bass. The mportance of PCA analyss s that t provdes the best lnear projecton to a subspace n terms of preservng the sgnal energy [Haykn, 1994]. Normally, PCA s computed analytcally through a sngular value decomposton. PCA networks offer an alternatve to ths computaton by provdng an teratve mplementaton that may be preferred for real-tme operaton n embedded systems. The PCA network s a one-layer network wth lnear-processng elements (Fg ). One can extend Oja s rule for many-output PEs (less or equal to the number of nput PEs), accordng to the formula shown n Fg whch s called the Sanger s rule [Haykn, 1994]. The weght matrx rows (that contan the weghts connected to the output PEs n descendng order) are the egenvectors of the nput correlaton matrx. If we set the number of output PEs equal to M < D, we wll be projectng the nput data onto the M largest prncpal components. Ther outputs wll be proportonal to the M largest egenvalues. Note that we are performng an egendecomposton through an teratve procedure. Assocatve Memores Hebban learnng s also the rule to create assocatve memores [Zurada, 1992]. The most-utlzed assocatve memory mplements heteroassocaton, where the system s able to assocate an nput X to a desgnated output Y whch can be of a dfferent dmenson (Fg ). So, n heteroassocaton the sgnal Y works as the desred response. We can tran such a memory usng Hebban learnng or LMS, but the LMS provdes a more effcent encodng of nformaton. Assocatve memores dffer from conventonal computer memores n several respects. Frst, they are content addressable, and the nformaton s dstrbuted throughout the network, so they are robust to nose n the nput. Wth nonlnear PEs or recurrent connectons (as n the famous Hopfeld network) [Haykn, 1994] they dsplay the mportant property of pattern completon;.e., when the nput s dstorted or only partally avalable, the recall can stll be perfect. FIGURE Assocatve memory (heteroassocaton).

14 FIGURE Autoassocator. A specal case of assocatve memores s called the autoassocator (Fg ), where the tranng output of sze D s equal to the nput sgnal (also a sze D) [Kung, 1993]. Note that the hdden layer has fewer PEs (M D) than the nput (bottleneck layer). W 1 = W 2 T s enforced. The functon of ths network s one of encodng or data reducton. The tranng of ths network (W 2 matrx) s done wth LMS. It can be shown that ths network also mplements PCA wth M components, even when the hdden layer s bult from nonlnear PEs Compettve Learnng and Kohonen Networks Competton s a very effcent way to dvde the computng resources of a network. Instead of havng each output PE more or less senstve to the full nput space, as n the assocatve memores, n a compettve network each PE specalzes nto a pece of the nput space and represents t [Haykn, 1994]. Compettve networks are lnear, sngle-layer nets (Fg ). Ther functonalty s drectly related to the compettve learnng rule, whch belongs to the unsupervsed category. Frst, only the PE that has the largest output gets ts weghts updated. The weghts of the wnnng PE are updated accordng to the formula n Fg n such a way that they approach the present nput. The step sze exactly controls how much s ths adjustment (see Fg ). Notce that there s an ntrnsc nonlnearty n the learnng rule: only the PE that has the largest output (the wnner) has ts weghts updated. All the other weghts reman unchanged. Ths s the mechansm that allows the compettve net PEs to specalze. Compettve networks are used for clusterng;.e., an M output PE net wll seek M clusters n the nput space. The weghts of each PE wll correspond to the centers of mass of one of the M clusters of nput samples. When a gven pattern s shown to the traned net, only one of the outputs wll be actve and can be used to label the sample as belongng to one of the clusters. No more nformaton about the nput data s preserved. Compettve learnng s one of the fundamental components of the Kohonen self-organzng feature map (SOFM) network, whch s also a sngle-layer network wth lnear PEs [Haykn, 1994]. Kohonen learnng creates annealed competton n the output space, by adaptng not only the wnner PE weghts but also ther spatal FIGURE Compettve neural network.

15 FIGURE Kohonen SOFM. neghbors usng a Gaussan neghborhood functon L. The output PEs are arranged n lnear or two-dmensonal neghborhoods (Fg ) Kohonen SOFM networks produce a mappng between the contnuous nput space to the dscrete output space preservng topologcal propertes of the nput space (.e., local neghbors n the nput space are mapped to neghbors n the output space). Durng tranng, both the spatal neghborhoods and the learnng constant are decreased slowly by startng wth a large neghborhood s 0, and decreasng t (N 0 controls the schedulng). The ntal step sze h 0 also needs to be scheduled (by K). The Kohonen SOFM network s useful to project the nput to a subspace as an alternatve to PCA networks. The topologcal propertes of the output space provde more nformaton about the nput than straght clusterng. References C. M. Bshop, Neural Networks for Pattern Recognton, New York: Oxford Unversty Press, de Vres and J. C. Prncpe, The gamma model a new neural model for temporal processng, Neural Networks, Vol. 5, pp , S. Haykn, Neural Networks: A Comprehensve Foundaton, New York: Macmllan, S. Y. Kung, Dgtal Neural Networks, Englewood Clffs, N.J.: Prentce-Hall, J. M. Zurada, Artfcal Neural Systems, West Publshng, Further Informaton The lterature n ths feld s volumnous. We decded to lmt the references to text books for an engneerng audence, wth dfferent levels of sophstcaton. Zurada s the most accessble text, Haykn the most comprehensve. Kung provdes nterestng applcatons of both PCA networks and nonlnear sgnal processng and system dentfcaton. Bshop concentrates on the desgn of pattern classfers. Interested readers are drected to the followng journals for more nformaton: IEEE Transactons on Sgnal Processng, IEEE Tranactons on Neural Networks, Neural Networks, Neural Computaton, and Proceedngs of the Neural Informaton Processng System Conference (NIPS).

EEE 241: Linear Systems

EEE 241: Linear Systems EEE : Lnear Systems Summary #: Backpropagaton BACKPROPAGATION The perceptron rule as well as the Wdrow Hoff learnng were desgned to tran sngle layer networks. They suffer from the same dsadvantage: they