arxiv: v1 [cs.lg] 17 Jan 2019

Size: px

Start display at page:

Download "arxiv: v1 [cs.lg] 17 Jan 2019"

Tobias Powers
5 years ago
Views:

1 LECTURE NOTES arxv: v [cs.lg] 7 Jan 209 Artfcal Neural Networks B. MEHLIG Department of Physcs Unversty of Gothenburg Göteborg, Sweden 209

3 PREFACE These are lecture notes for my course on Artfcal Neural Networks that I have gven at Chalmers (FFR35) and Gothenburg Unversty (FIM720). Ths course descrbes the use of neural networks n machne learnng: deep learnng, recurrent networks, renforcement learnng, and other supervsed and unsupervsed machne-learnng algorthms. When I frst developed my lectures, my man source was the book by Hertz, Krogh, and Palmer []. Other sources were the book by Haykn [2], as well as the lecture notes of Horner [3]. My man sources for the Chapter on deep learnng were the deep-learnng book by Goodfellow, Bengo & Courvlle [4], and the onlne-book by Nelsen [5]. I am grateful to Martn Čeka who typed the frst verson of my hand-wrtten lecture notes and made most of the Fgures, and to Erk Werner and Hampus Lnander for ther nterest and ther help n preparng Chapter 7. I would lke to thank also Johan Fres and Oleksandr Balabanov for mplementng the algorthms descrbed n Secton 7.4. Johan Fres and Marna Rafalovc made most of the exam questons. Fnally many students past and present ponted out msprnts and errors and suggested mprovements. I thank them all.

5 CONTENTS Preface Contents v Introducton. Neural networks McCulloch-Ptts neurons Other models for neural computaton Summary I Hopfeld networks 7 2 Determnstc Hopfeld networks 8 2. Assocatve memory problem Hopfeld network Energy functon Spurous states Summary Exercses Exam questons Stochastc Hopfeld networks Nosy dynamcs Order parameters Mean-feld theory Storage capacty Beyond mean-feld theory Summary Further readng Exercses Exam questons Stochastc optmsaton 4 4. Combnatoral optmsaton problems Energy functons Smulated annealng Monte-Carlo smulaton Summary Further readng Exercses

6 II Supervsed learnng 5 5 Perceptrons A classfcaton task Iteratve learnng algorthm Gradent-descent learnng Mult-layer perceptrons Summary Further readng Exercses Exam questons Stochastc gradent descent Chan rule and error backpropagaton Stochastc gradent-descent algorthm Recpes for mprovng the performance Summary Further readng Exercses Exam questons Deep learnng How many hdden layers? Tranng deep networks Convolutonal networks Learnng to read handwrtten dgts Deep learnng for obect recognton Resdual networks Summary Further readng Exercses Exam questons Recurrent networks Recurrent backpropagaton Backpropagaton through tme Recurrent networks for machne translaton Summary Further readng Exercses Exam questons III Unsupervsed learnng 35 9 Unsupervsed Hebban learnng 36

7 9. Oa s rule Compettve learnng Kohonen s algorthm Summary Exercses Exam questons Radal bass-functon networks Separatng capacty of a surface Radal bass-functon networks Summary Further readng Exercses Exam questons Renforcement learnng 55. Stochastc output unts Assocatve reward-penalty algorthm Summary Further readng

9 Introducton The term neural networks refers to networks of neurons n the mammalan bran. Neurons are ts fundamental unts of computaton. In the bran they are connected together n networks to process data. Ths s a very complex task, and the dynamcs of neural networks n the mammalan bran s therefore qute ntrcate. Inputs and outputs of each neuron vary as functons of tme, n the form of so-called spke trans, but also the network tself changes. We learn and mprove our data-processng capactes by establshng reconnectons between neurons. Neural-network algorthms are nspred by the archtecture and the dynamcs of networks of neurons n the bran. Yet these algorthms use representatons of neurons that are hghly smplfed, compared wth real neurons. Nevertheless, the fundamental prncple s the same: artfcal neural networks learn by re-connecton. Such networks can perform a multtude of nformaton-processng tasks. They can learn to recognse structures n a set of tranng data and generalse what they have learnt to other data sets (supervsed learnng). A tranng set contans a lst of nput data sets, together wth a lst of the correspondng target values that encode the propertes of the nput data that the network s supposed to learn. To solve such assocaton tasks by artfcal neural networks can work well when the new data sets are governed by the same prncples that gave rse to the tranng data. A prme example for a problem of ths type s obect recognton n mages, for nstance n the sequence of camera mages of a self-drvng car. Recently the use of neural networks for obect recognton has exploded. There are several reasons for ths strong nterest. It s drven, frst, by the acute need for such algorthms n ndustry. Second, there are now much better mage data bases avalable for tranng the networks. Thrd, there s better hardware (GPU processors), so that networks wth many layers contanng many neurons can be effcently traned (deep learnng) [4, 6]. Another task where neural networks excel s machne translaton. These networks are dynamcal (recurrent) networks. They take an nput sequence of words or sometmes sngle letters. As one feeds the nputs word by word, the network outputs the words n the translated sentence. Recurrent networks can be effcently traned on large tranng sets of nput sentences and ther translatons. Google translate works n ths way [7]. Artfcal neural networks are good at analysng hgh-dmensonal data where t may be dffcult to determne a pror whch propertes are of nterest. In ths case one often reles on unsupervsed learnng algorthms where the network learns wthout a tranng set. Instead t determnes n terms of whch categores the data can be analysed. In ths way, artfcal neural networks can detect famlarty (whch nput patterns occur most often), clusters, and other structures n the nput data. Unsupervsedlearnng algorthms work well when there s redundancy n the nput data that s not mmedately obvous because the data s hgh dmensonal. In many problems some nformaton about targets s known, yet ncomplete. In ths case one uses algorthms that contan elements of both supervsed and unsupervsed learnng (renforcement learnng). Such algorthms are used, for nstance, n the software AlphaGo [8] that plays the game of go. The dfferent algorthms have much n common. They share the same buldng blocks: the neurons are modeled as lnear threshold unts (McCulloch-Ptts neurons), and the learnng rules are smlar (Hebb s rule). Closely related questons arse also regardng the network dynamcs. A lttle bt of nose (not too much!) can mprove the performance, and ensures that the long-tme dynamcs approaches a steady state. Ths allows to analyse the convergence of the algorthms usng the central-lmt theorem. There are many connectons to methods used n Mathematcal Statstcs, such as Markov-chan

10 2 INTRODUCTION axon neural cell body dendrtes Fgure.: Neurons n the cerebral cortex (outer layer of the cerebrum, the largest and best developed part of the mammalan bran) of a macaque, an Asan monkey. Reproduced by permsson of branmaps.org [9] under the Creatve Commons Attrbuton 3.0 Lcense. The labels were added. Monte-Carlo algorthms and smulated annealng. Certan unsupervsed learnng algorthms are related to prncpal component analyss, others to clusterng algorthms such as k -means clusterng.. Neural networks The mammalan bran conssts of dfferent regons that perform dfferent tasks. The cerebral cortex s the outer layer of the mammalan bran. We can thnk of t as a thn sheet (about 2 to 5 mm thck) that folds upon tself to ncrease ts surface area. The cortex s the largest and best developed part of the Human bran. It contans large numbers of nerve cells, neurons. The Human cerebral cortex contans about 0 0 neurons. They are lnked together by nerve strands (axons) that branch and end n synapses. These synapses are the connectons to other neurons. The synapses connect to dendrtes, branched extensons from the neural cell body desgned to receve nput from other neurons n the form of electrcal sgnals. A neuron n the Human bran may have thousands of synaptc connectons wth other neurons. The resultng network of connected neurons n the cerebral cortex s responsble for percepton of vsual, audo, and sensory data, for language and mage processng, and for memory. Fgure. shows neurons n the cerebral cortex of the macaque, an Asan monkey. The mage dsplays a slver-staned cross secton through the cerebral cortex. The brown and black parts are the neurons. One can dstngush the cell bodes of the neural cells, ther axons, and ther dendrtes. Fgure.2 shows a more schematc vew of a neuron. Informaton s processed from left to rght. On the left are the dendrtes that receve sgnals and connect to the cell body of the neuron where the sgnal s processed. The rght part of the Fgure shows the axon, through whch the output s sent to the dendrtes of other neurons. Informaton s transmtted as an electrcal sgnal. Fgure.3 shows an example of the tme seres of the electrc potental for a pyramdal neuron n fsh [0]. The tme seres conssts of an ntermttent seres of electrcal-potental spkes. Quescent perods wthout spkes occur when the neuron s nactve, durng spke-rch perods the neuron s actve.

MCCULLOCH-PITTS NEURONS 3 dendrtes synapses cell body termnals axon nput process output Fgure.2: Schematc mage of a neuron. Dendrtes receve nput n the form of electrcal sgnals, va synapses.

11 MCCULLOCH-PITTS NEURONS 3 dendrtes synapses cell body termnals axon nput process output Fgure.2: Schematc mage of a neuron. Dendrtes receve nput n the form of electrcal sgnals, va synapses. The sgnals are processed n the cell body of the neuron. The output travels from the neural cell body to other neurons through the axon. Fgure.3: Spke tran n electrosensory pyramdal neuron n fsh (egenmanna). Tme seres from Ref. [0]. Reproduced by permsson of the publsher..2 McCulloch-Ptts neurons In artfcal networks, the ways n whch nformaton s processed and sgnals are transferred are hghly smplfed. The model we use nowadays for the computatonal unt, the artfcal neuron, goes back to McCulloch and Ptts []. Rosenblatt [2, 3] descrbed how to connect such unts n artfcal neural networks to process nformaton. He referred to these networks as perceptrons. In ts smplest form, the model for the artfcal neuron has only two states, actve or nactve. The model works as a lnear threshold unt: t processes all nput sgnals and computes an output. If the output exceeds a gven threshold then the state of the neuron s sad to be actve, otherwse nactve. The model s llustrated n Fgure.4. Neurons usually perform repeated computatons, and one dvdes up tme nto dscrete tme steps t = 0,, 2, 3,.... The state of neuron number at tme step t s denoted by n (t ) = 0 nactve, actve. (.)

12 4 INTRODUCTION n (t ) n 2 (t ) n 3 (t ). n N (t ) w w 2 w 3 w N n (t + ) = θ H N = w n (t ) µ µ Fgure.4: Schematc dagram of a McCulloch-Ptts neuron. The ndex of the neuron s, t receves nputs from N other neurons. The strength of the connecton from neuron to neuron s denoted by w. The functon θ H (b ) (actvaton functon) s the Heavsde functon. It s equal to zero for b < 0 and equal to unty for b > 0. The threshold value for neuron s denoted by µ. The ndex t = 0,,2,3,... labels the dscrete tme sequence of computaton steps. θ H (b ) Fgure.5: Heavsde functon. 0 b Gven the sgnals n (t ), neuron number computes n (t + ) = θ H w n (t ) µ. (.2) As wrtten, ths computaton s performed for all neurons n parallel, and the outputs n are the nputs to all neurons at the next tme step, therefore the outputs have the tme argument t +. These steps are repeated many tmes, resultng n tme seres of the actvty levels of all neurons n the network, referred to as neural dynamcs. The procedure descrbed above s called synchronous updatng. An alternatve s to choose a neuron randomly (or followng a prescrbed determnstc rule), and to update only ths one, nstead of all together. Ths scheme s called asynchronous updatng. If there are N neurons, then one synchronous step corresponds to N asynchronous steps, on average. Ths dfference n tme scales s not the only dfference between synchronous and asynchronous updatng. In general the two schemes yeld dfferent neural dynamcs. Now consder the detals of the computaton step Equaton (.2). The functon θ H (b ) s the actvaton functon. Its argument s often referred to as the local feld, b (t ) = w n (t ) µ. Snce the neurons can only assume the states 0/, the actvaton functon s taken to be the Heavsde functon, θ H (b ) = 0 f b < 0 and θ H (b ) = f b > 0 (Fgure.5). The Heavsde functon s not defned at b = 0. To avod problems n our computer algorthms, we usually take θ H (0) =. Equaton (.2) shows that the neuron performs a weghted lnear average of the nputs n (t ). The weghts w are called synaptc weghts. Here the frst ndex,, refers to the neuron that does the computaton, and labels all neurons that connect to neuron. The connecton strengths between

13 OTHER MODELS FOR NEURAL COMPUTATION 5 g (b ) Fgure.6: Contnuous actvaton functon. 0 b dfferent pars of neurons are n general dfferent, reflectng dfferent strengths of the synaptc couplngs. When the value of w s postve, we say that the couplng s called exctatory. When w s negatve, the connecton s called nhbtory. When w = 0 there s no connecton: w > 0 exctatory connecton, = 0 no connecton from to < 0 nhbtory connecton. Fnally, the threshold for neuron s denoted by µ. (.3).3 Other models for neural computaton The dynamcs defned by Equatons (.) and (.2) s ust a carcature of the tme seres of electrcal sgnals n the cortex. For a start, the neural model descrbed n the prevous Secton can only assume two states, nstead of a contnuous range of sgnal strengths. Whle real neurons produce tme seres of spkes, the model gves rse to tme sequences of zeros and ones. The two states, 0 and, are meant to model the nactve and actve perods shown n Fgure.3. For many computaton tasks ths s qute suffcent, and for our purposes t does not matter that the dynamcs of real neurons s so dfferent. The am s not to model the neural dynamcs n the bran, but to construct computaton models nspred by real neural dynamcs. In the course of these lectures t wll become apparent that the smplest model descrbed above must be generalsed to acheve certan tasks. The most mportant generalsatons are the followng. Sometmes t s necessary to allow the neuron to respond contnuously to ts nputs. To ths end one replaces Eq. (.2) by n (t + ) = g w n (t ) µ for all. (.4) Here g (b ) s a contnuous actvaton functon. An example s shown n Fgure.6. Ths dctates that the states assume contnuous values too, not ust the dscrete values 0 and as gven n Equaton (.). Equatons (.2) and (.4) descrbe synchronous updatng schemes, as mentoned above. At tme step t all nputs n (t ) are stored. All neurons are smultaneously updated usng the stored nputs. Sometmes asynchronous updatng s preferable. At each updatng step one chooses a sngle neuron. Say that we chose neuron number. Then only ths neuron s updated accordng to the rule: n (t + ) = θ H w n (t ) µ for one chosen value of. (.5)

14 6 INTRODUCTION Dfferent schemes for choosng neurons are used. One possblty s to arrange the neurons nto an array and to update them one them by one, n a certan order (typewrter scheme). A second possblty s to choose randomly whch neuron to update. Ths ntroduces stochastcty nto the neural dynamcs. Ths s very mportant, and we wll see that there are dfferent ways of ntroducng stochastcty. Random asynchronous updatng s one example. In many scentfc problems t s advantageous to avod stochastcty, when randomness s due to errors (multplcatve or addtve nose) that dmnsh the performance of the system. In neural-network dynamcs, by contrast, stochastcty s often helpful, as we shall see below..4 Summary Artfcal neural networks use a hghly smplfed model for the fundamental computaton unt, the neuron. In ts smplest form, the model s ust a bnary threshold unt. The unts are lnked together by weghts w, and each unt computes a weghted average of ts nputs. The network performs these computatons n sequence. Usually one consders dscrete sequences of computaton tme steps, t = 0,,2,3,.... Ether all neurons are updated smultaneously n one tme step (synchronous updatng), or only one chosen neuron s updated (asynchronous updatng). Most neural-network algorthms are bult usng the model descrbed n ths Chapter.

15 7 PART I HOPFIELD NETWORKS

8 DETERMINISTIC HOPFIELD NETWORKS Fgure 2.: Bnary representaton of the dgts 0 to 4. Each dgt has 6 0 pxels. The Hopfeld network [4] s an artfcal neural network that can recognse or reconstruct mages.

16 8 DETERMINISTIC HOPFIELD NETWORKS Fgure 2.: Bnary representaton of the dgts 0 to 4. Each dgt has 6 0 pxels. The Hopfeld network [4] s an artfcal neural network that can recognse or reconstruct mages. Consder for example the bnary mages of dgts n Fgure 2.. These mages can be stored n the artfcal neural network by assgnng the weghts w n a certan way (called Hebb s rule). Then one feeds a dstorted mage of one of the dgts (Fgure 2.2) to the network by assgnng the ntal states of the neurons n the network to the bts n the dstorted mage. The dea s that the neural-network dynamcs converges to the correct undstorted dgt. In ths way the network can recognse the nput as a dstorted mage of the correct dgt (retreve ths dgt). The pont s that the network may recognse patterns wth many bts very effcently. Ths dea s qute old though. In the past such networks were used to perform pattern recognton tasks. Today there are more effcent algorthms for ths purpose (Chapter 7). Yet the frst part of these lectures deals wth Hopfeld networks, for several reasons. Frst, Hopfeld nets form the bass for more recent algorthms such as Boltzmann machnes [2] and deep-belef networks [2]. Second, all other neural-network algorthms dscussed n these lectures are bult from the same buldng blocks and use learnng rules that are closely related to Hebb s rule. Thrd, Hopfeld networks can solve optmsaton problems, and the resultng algorthm s closely related to Markov-chan Monte- Carlo algorthms whch are much used for a wde range of problems n Physcs and Mathematcal Statstcs. Fourth, and most mportantly, a certan degree of nose (not too much) can substantally mprove the performance of Hopfeld networks, and t s understood n detal why. The reason that so much s known about the role of nose n Hopfeld networks s that they are closely related to stochastc systems studed n Physcs, namely random magnets and spn glasses. The pont s: understandng the effect of nose on the dynamcs of Hopfeld networks helps to analyse the performance of other neural-network models. 2 Determnstc Hopfeld networks 2. Assocatve memory problem The pattern-recognton task descrbed above (Fgure 2.) s an example of an assocatve-memory problem: there are p mages (patterns), each wth N bts. Examples for such sets of patterns are the letters n the alphabet, or the dgts shown n Fgure 2.. The dfferent patterns are labeled by the ndex µ =,..., p. The bts of pattern µ are denoted by x (µ). The ndex labels the bts of a gven pattern, t ranges from to N. The bts are bnary: they can take only the values 0 and, as llustrated n Fgure 2.2. To determne the generc propertes of the algorthm, one often turns to random patterns where each bt x (µ) s chosen randomly. Each bt takes ether value wth probablty 2, and dfferent bts (n the same and n dfferent patterns) are ndependent. It s convenent to gather the bts of a pattern n a

17 ASSOCIATIVE MEMORY PROBLEM 9 x = x = 0 =,..., N Fgure 2.2: Bnary mage (N = 60) of the dgt 0, and a dstorted verson of the same mage. column vector x (µ) x (µ) = 2.. (2.) x (µ) N In the followng, vectors are wrtten n bold math font. The frst part of the problem s to store the patterns x () to x (p ). Second, one feeds a test pattern x, a dstorted verson of one of the bnary patterns n the problem. The am s to determne whch one of the stored patterns x (µ) most closely resembles x. The problem s, n other words, to assocate the test pattern wth the closest one of the stored patterns. The formulaton of the problem requres to defne how close two gven patterns are to each other. One possblty s to use the Hammng dstance. For patterns wth 0/ bts, the Hammng dstance h µ between the patterns x and x (µ) s defned as h µ = x (µ) N x (µ) ( x ) + x (µ) x. (2.2) The Hammng dstance equals the number of bts by whch the patterns dffer. Two patterns are dentcal f they have Hammng dstance zero. For 0/ patterns, Equaton (2.2) s equvalent to: h µ N = N N = x (µ) x 2. (2.3) Ths means that the Hammng dstance s gven by the mean-squared error, summed over all bts. Note that the Hammng dstance does not refer to dstortons by translatons, rotatons, or shearng. An mproved verson of the dstance nvolves takng the mnmum dstance between the patterns subect to all possble translatons, rotatons, and so forth. In summary, the assocaton task s to fnd the ndex ν for whch the Hammng dstance h ν s mnmal, h ν h µ for all µ =,..., p. How can one solve ths task usng a neural network? One feeds the dstorted pattern x wth bts x nto the network by assgnng n (t = 0) = x. Assume that x s a dstorted verson of x (ν). Now the dea s to fnd a set of weghts w so that the network dynamcs converges to the correct stored pattern: n (t ) x (ν) as t. (2.4)

18 0 DETERMINISTIC HOPFIELD NETWORKS + sgn(b ) b Fgure 2.3: Sgnum functon. Whch weghts to choose depends on the patterns x (µ), so the weghts must be functons of x (µ). We say that we store these patterns n the network by choosng the approprate weghts. If the network converges as n Equaton (2.4), the pattern x (ν) s sad to be an attractor n the space of all possble states of ths network, the so-called confguraton space or state space. 2.2 Hopfeld network Hopfeld [4] used a network of McCulloch-Ptts neurons to solve the assocatve memory problem descrbed n the prevous Secton. The states of neuron n the Hopfeld network take the values S (t ) = nactve, actve, (2.5) nstead of 0/, because ths smplfes the mathematcal analyss as we shall see. The transformaton from n {0, } to S {, } s straghtforward: S = 2n. (2.6) The thresholds are transformed accordngly, θ = 2µ w. For the bts of the patterns we keep the symbol x (µ), but t s mportant to remember that now x (µ) = ±. To ensure that the S can only take the values ±, the actvaton functon s taken to be the sgnum functon (Fgure 2.3). sgn(b ) = b < 0, +, b > 0. (2.7) The sgnum functon s not defned at b = 0. To avod problems n our computer algorthms we usually defne sgn(0) =. The asynchronous update rule takes the form S sgn w S θ for one chosen value of, (2.8) } {{ } b As before s the ndex of the chosen neuron. The arrow ndcates that S (t + ) s assgned the r.h.s of ths equaton evaluated at S (t ). The synchronous update rule reads S sgn w S θ, (2.9) } {{ } b

19 HOPFIELD NETWORK where all bts are updated n parallel. In Eqs. (2.8) and (2.9) the argument of the actvaton functon s denoted by b, sometmes called the local feld. Now we need a strategy for choosng the weghts w, so that the patterns x (µ) are attractors. If one feeds a pattern x close to x (ν) to the network, we want the network to converge to x (ν) S (t = 0) = x x (ν) ; S (t ) x (ν) as t. (2.0) Ths means that the network succeeds n correctng a small number of errors. If the number of errors s too large, the network may converge to another pattern. The regon n confguraton space around pattern x (ν) n whch all patterns converge to x (ν) s called the regon of attracton of x (ν). However, we shall see that t s n general very dffcult to prove convergence accordng to Eq. (2.0). Therefore we try to answer a dfferent queston frst: f one feeds one of the undstorted patterns x (ν), does the network recognse that t s one of the stored, undstorted patterns? The network should not make any changes to x (ν) because all bts are correct: S (t = 0) = x (ν) ; S (t ) = x (ν) for all t = 0,, 2,.... (2.) Even ths queston s n general dffcult to answer. We therefore consder a smple lmt of the problem frst, namely p =. There s only one pattern to recognze, x (). A sutable choce of weghts w s gven by Hebb s rule w = N x () x () and θ = 0. (2.2) We say that the pattern x () s stored n the network by assgnng the weghts w usng the rule (2.2). Note that the weghts are symmetrc, w = w. To check that the rule (2.2) does the trck, feed the pattern to the network by assgnng S (t = 0) = x (), and evaluate Equaton (2.8): N = w x () = N N = x () x () x () = N N = x (). (2.3) The last equalty follows because x () that can only take the values ±. The empty sum evaluates to N, so N sgn w x () = x (). (2.4) = Recall that x (µ) = ±, so that sgn(x (µ) ) = x (µ). Comparng Equaton (2.4) wth the update rule (2.8) shows that the bts x () of the pattern x () reman unchanged under the update, as requred by Eq. (2.). The network recognses the pattern as a stored one, so Hebb s rule (2.2) does what we asked. But does the network correct small errors? In other words, s the pattern x () an attractor [Eq. (2.0)]? Ths queston cannot be answered n general. Yet n practce Hopfeld models work often very well! It s a fundamental nsght that neural networks may work well although t s mpossble to strctly prove that ther dynamcs converges to the correct soluton. To llustrate the dffcultes consder an example, a Hopfeld network wth p = and N = 4 (Fgure 2.4). Store the pattern x () shown n Fgure 2.4 by assgnng the weghts w usng Hebb s rule (2.2). Now feed a dstorted pattern x to the network that has a non-zero dstance to x () : h = 4 4 x x () 2 > 0. (2.5) =

2 DETERMINISTIC HOPFIELD NETWORKS 2 3 2 4 3 4 (a) Network layout. The network has four neurons,. The arrows ndcate symmetrc connectons. Fgure 2.4: Hopfeld network wth N = 4 neurons.

Under synchronous updatng (2.9) the frst two dstorted mages (a) and (b) converge to the stored pattern x () (rght), but pattern (c) does not.

20 2 DETERMINISTIC HOPFIELD NETWORKS (a) Network layout. The network has four neurons,. The arrows ndcate symmetrc connectons. Fgure 2.4: Hopfeld network wth N = 4 neurons. x () = x () 4 = x () 2 = x () 3 = (b) Pattern x ()T =,,, T. Here T denotes the transpose of the column vector x (). (a) (b) (c) Fgure 2.5: Reconstructon of a dstorted mage (left). Under synchronous updatng (2.9) the frst two dstorted mages (a) and (b) converge to the stored pattern x () (rght), but pattern (c) does not. The factor 4 takes nto account that the patterns take the values ± and not 0/ as n Secton 2.. To feed the pattern to the network, one sets S (t = 0) = x. Now terate the dynamcs usng synchronous updatng (2.9). Results for dfferent dstorted patterns are shown n Fgure 2.5. We see that the frst two dstorted patterns (dstance ) converge to the stored pattern, cases (a) and (b). But the thrd dstorted pattern does not [case (c)]. To understand ths behavour t s most convenent to analyse the synchronous dynamcs usng the weght matrx W = N x () x ()T. (2.6) Here x ()T denotes the transpose of the column vector x (), so that x ()T s a row vector. The standard rules for matrx multplcaton apply also to column and row vectors, they are ust N and N matrces. Ths means that the product on the r.h.s. of Equaton (2.6) s an N N matrx. In the followng, matrces wth elements A or B are wrtten as A, B, and so forth. The product n Equaton (2.6) s also referred to as an outer product. The product x ()T x () = N [x () ] 2 = N, (2.7) =

21 HOPFIELD NETWORK 3 by contrast, s ust a number (equal to N ). The product (2.7) s also called scalar product. It s denoted by x () x () = x ()T x (). Usng Equaton (2.7) we see that W proects onto the vector x (), In the same way we can show that the matrx W s dempotent: Wx () = x (). (2.8) W n = W for n =, 2, 3,.... (2.9) Equatons (2.8) and (2.9) mean that the network recognses the pattern x () as the stored one. The pattern s not updated [Eq. (2.)]. Ths example llustrates the general proof, Equatons (2.3) and (2.4). Now consder the dstorted pattern (a) n Fgure 2.5. We feed ths pattern to the network by assgnng S (t = 0) =. (2.20) To compute one step n the synchronous dynamcs (2.9) we smply apply W to S (t = 0). Ths s done n two steps, usng the outer-product form (2.6) of the weght matrx. We frst multply S (t = 0) wth x ()T from the left x ()T S (t = 0) =,,, = 2, (2.2) and then we multply ths result wth x (). Ths gves: WS (t = 0) = 2 x (). (2.22) The sgnum of the -th component of the vector WS (t = 0) yelds S (t = ): N S (t = ) = sgn w S (t = 0) = x (). (2.23) = Ths means that the state of the network converges to the stored pattern, n one synchronous update. Snce W s dempotent, the network stays there: the pattern x () s an attractor. Case (b) n Fgure 2.5 works n a smlar way. Now look at case (c), where the network fals to converge to the stored pattern. We feed ths pattern to the network by assgnng S (t = 0) = [,,, ] T. For one teraton of the synchronous dynamcs we frst evaluate x ()T S (0) =,,, = 2. (2.24)

22 4 DETERMINISTIC HOPFIELD NETWORKS It follows that Usng the update rule (2.9) we fnd Equaton (2.9) mples that WS (t = 0) = 2 x (). (2.25) S (t = ) = x (). (2.26) S (t ) = x () for t. (2.27) Thus the network shown n Fgure 2.4 has two attractors, the pattern x () as well as the nverted pattern x (). Ths s a general property of McCulloch-Ptts dynamcs wth Hebb s rule: f x () s an attractor, then the pattern x () s an attractor too. But one ends up n the correct pattern x () when more than half of bts n S (t = 0) are correct. In summary we have shown that Hebb s rule (2.2) allows the Hopfeld network to recognse a stored pattern: f we feed the stored pattern wthout any dstortons to the network, then t does not change the bts. Ths does not mean, however, that the network recognses dstorted patterns. It may or may not converge to the correct pattern. We expect that convergence s more lkely when the number of wrong bts s small. If all dstorted patterns near the stored pattern x () converge to x () then we say that x () s an attractor. If x () s an attractor, then x () s too. When there are more than one patterns then Hebb s rule (2.2) must be generalsed. A guess s to smply sum Equaton (2.2) over the stored patterns: w = N p µ= x (µ) x (µ) and θ = 0 (2.28) (Hebb s rule for p > patterns). As for p = the weght matrx s symmetrc, W = W T, so that w = w. The dagonal weghts are not zero n general. An alternatve verson of Hebb s rule [2] defnes the dagonal weghts to zero: w = N p µ= x (µ) x (µ) for, w = 0, and θ = 0. (2.29) If we store only one pattern, p =, ths modfed rule Hebb s rule (2.29) satsfes Equaton (2.). In ths Secton we use Equaton (2.29). If we assgn the weghts accordng to Equaton (2.29), does the network recognse dstorted patterns? We saw n the prevous Secton that ths queston s dffcult to answer n general, even for p =. Therefore we ask, frst, whether the network recognses the stored pattern x (ν). The queston s whether sgn N µ x (µ) x (µ) x (ν) } {{ } b (ν) = x (ν). (2.30) To check whether Equaton (2.30) holds, we must repeat the calculaton descrbed on page. As a frst step we evaluate the argument of the sgnum functon, b (ν) = N (ν) x + N x (µ) x (µ) µ ν x (ν). (2.3)

23 HOPFIELD NETWORK 5 Here we have splt the sum over the patterns nto two contrbutons. The frst term corresponds to µ = ν, where ν refers to the pattern that was fed to the network, the one that we want the network to recognse. The second term n Equaton (2.3) contans the sum over the remanng patterns. For large N we can approxmate ( N ). It follows that condton (2.30) s satsfed f the second term n (2.3) does not affect the sgn of the r.h.s. of ths Equaton. Ths second term s called cross-talk term. Whether addng the cross-talk term to x (ν) changes the sgnum of the r.h.s. of Equaton (2.3) or not, depends on the stored patterns. Snce the cross-talk term contans a sum over µ we may expect that the cross-talk term does not matter f p s small enough. If ths s true for all and ν then all p stored patterns are recognsed. Furthermore, by analogy wth the example descrbed n the prevous Secton, we may expect that the stored patterns are then also attractors, so that slghtly dstorted patterns converge to the correct stored pattern, patterns close to x (ν) converge to x (ν) under the network dynamcs (but ths s not guaranteed). For a more quanttatve analyss of the effect of the cross-talk term we store patterns wth random bts (random patterns) Prob(x (ν) = ±) = 2. (2.32) Dfferent bts (dfferent values of and/or µ) are assgned ndependent random values. Ths means that dfferent patterns are uncorrelated because ther covarance vanshes: x (µ) x (ν) = δ δ µν. (2.33) Here denotes an ensemble average over many realsatons of random patterns, and δ s the Kronecker delta, equal to unty f = but zero otherwse. Note that t follows from Equaton (2.32) that x (µ) = 0. We now ask: what s the probablty that the cross-talk term changes the sgnum of the r.h.s. of Equaton (2.3)? In other words, what s the probablty that the network produces a wrong bt n one asynchronous update, f all bts were ntally correct? The magntude of the cross-talk term does not matter when t has the same sgn as x (ν). If t has a dfferent sgn, then the cross-talk term may matter. It does f ts magntude s larger than unty (the magntude of x (ν) ). To smplfy the analyss one wants to avod havng to dstngush between the two cases, whether or not the cross-talk term has the same sgn as x (ν). To ths end one defnes: If C (ν) C (ν) x (ν) x (µ) x (µ) µ ν x (ν) N } {{ } cross-talk term. (2.34) < 0 then the cross-talk term has same sgn as x (ν), so that t does not matter. If 0 < C (ν) t does not matter ether, only when C (ν) < >. The network produces an error n updatng neuron f C (ν) > for partcular values and ν: f ntally S (0) = x (ν), the sgn of the bt changes under the update although t should not so that an error results. How frequently does ths happen? For random patterns we can answer ths queston by computng the one-step (t = ) error probablty: Snce patterns and bts are dentcally dstrbuted, Prob(C (ν) does not carry any ndces. P t = error P t = (ν) error = Prob(C > ). (2.35) > ) does not depend on or ν. Therefore

24 6 DETERMINISTIC HOPFIELD NETWORKS How does Perror t = depend on the parameters of the problem, p and N? When both p and N are large we can use the central-lmt theorem to answer ths queston. Snce dfferent bts/patterns are ndependent, we can thnk of C (ν) as a sum of ndependent random numbers c m that take the values and + wth equal probabltes, C (ν) = N x (µ) x (µ) µ ν x (ν) x (ν) = (N )(p ) N m= c m. (2.36) There are M = (N )(p ) terms n the sum on the r.h.s. because terms wth µ = ν are excluded, and also those wth = [Equaton (2.29)]. If we use Equaton (2.28) nstead, then there s a correcton to Equaton (2.36) from the dagonal weghts. For p N ths correcton s small. When p and N are large, then the sum m c m contans a large number of ndependently dentcally dstrbuted random numbers wth mean zero and varance unty. It follows from the central-lmt theorem that N m c m s Gaussan dstrbuted wth mean zero, and wth varance σ 2 C = N 2 M m= c m 2 = N 2 M n= m= M c n c m. (2.37) Here denotes an average over the ensemble or realsatons of c m. Snce the random numbers c m are ndependent for dfferent ndces, c n c m = δ nm. So only the dagonal terms n the double sum contrbute, summng up to M N p. Therefore σ 2 C p N. (2.38) One way of showng that the dstrbuton of m c m s approxmately Gaussan dstrbuted s to represent t n terms of Bernoull trals. The sum M m= c m equals 2k M where k s the number of occurrences + n the sum. Snce the probablty of c m = ± s 2, the probablty of drawng k tmes + and M k tmes s M P k,m = k M k k 2 2. (2.39) Here M M! = k k!(m k)! (2.40) s the number of ways n whch k occurrences of + can be dstrbuted over M places. We expect that the quantty 2k M s Gaussan dstrbuted wth mean zero and varance M. To demonstrate ths, t s convenent to use the varable z = (2k M )/ M whch s Gaussan wth mean zero and unt varance. Therefore we substtute k = M 2 + M 2 z nto Equaton (2.39) and take the lmt of large M usng Strlng s approxmaton n! = e n log n n+ 2 log 2πn. (2.4) Expandng P k,m to leadng order n M assumng that z remans of order unty gves P k,m = 2/(πM ) exp( z 2 /2). From P (z )dz = P (k)dk t follows that P (z ) = ( M /2)P (k), so that P (z ) = (2π) /2 exp( z 2 /2). So the dstrbuton of z s Gaussan wth zero mean and unt varance, as we ntended to show. In summary, the dstrbuton of C s Gaussan P (C ) = (2πσ 2 C ) /2 exp[ C 2 /(2σ 2 C )] (2.42)

25 HOPFIELD NETWORK 7 P (C (ν) ) P t = error C (ν) Fgure 2.6: Gaussan dstrbuton of C (ν). The hashed area equals the error probablty. P t = error α Fgure 2.7: Dependence of the one-step error probablty on the storage capacty α accordng to Equaton (2.43), schematc. wth mean zero and varance σ 2 C p N, as llustrated n Fgure 2.6. To determne the one-step error probablty we must ntegrate ths dstrbuton from to : Here erf s the error functon defned as P t = error = dc e C 2 2σ2 = 2πσ 2 erf(z ) = 2 π z 0 erf N 2p. (2.43) dx e x 2. (2.44) Ths functon s tabulated. Snce erf(z ) ncreases as z ncreases we conclude that Perror t = ncreases as p ncreases, or as N decreases. Ths s expected: t s more dffcult for the network to recognse stored patterns when there are more of them. On the other hand, t s easer to dstngush between stored patterns f they have more bts. We also see that the one-step error probablty depends on p and N only through the combnaton α p N. (2.45) The parameter α s called the storage capacty of the network. Fgure 2.7 shows how Perror t = depends on the storage capacty. Take α = 0.85 for example. Then the one-step error probablty (the probablty of an error n one asynchronous attempt to update a bt) s about %. The error probablty defned n ths Secton refers only to the ntal update, the frst teraton. What happens n the next teraton, and after many teratons? Numercal experments show that the error probablty can be much hgher n later teratons, because more errors tend to ncrease the probablty of makng another error. The estmate Perror t = s a lower bound. Also: realstc patterns are not random wth ndependent bts. We nevertheless expect that Perror t = descrbes the typcal one-step error probablty of the Hopfeld network when p and N are large.

26 8 DETERMINISTIC HOPFIELD NETWORKS However, t s straghtforward to construct counter examples. Consder for example orthogonal patterns: The cross-talk term vanshes n ths case, so that P t = error = Energy functon The energy functon s defned as x (µ) x (ν) = 0 for µ ν. (2.46) H = w S S. (2.47) 2 The name comes from an analogy to spn systems n Physcs. An alternatve name for H s Hamltonan. The energy functon (2.47) s mportant because t allows us to analyse the convergence of the dynamcs of the Hopfeld model. More generally, energy functons are mportant tools n analysng the convergence of dfferent knds of neural networks. A second reason for consderng the energy functon s that t allows us to derve Hebb s rule n a dfferent way. We can can wrte the energy functon as H = (w + w )S S + const. (2.48) 2 The constant s ndependent of S and S. Further, denotes that the sum s performed over connectons (or bonds) between pars of neurons. Note that Hebb s rule yelds symmetrc weghts, w = w. For symmetrc weghts t follows that H cannot ncrease under the dynamcs of the Hopfeld model. In each step H ether decreases, or t remans constant. To show ths consder the update S = sgn w S. (2.49) There are two possbltes, ether S = S or S = S. In the frst case H remans unchanged, H = H. Here H refers to the value of the energy functon after the update (2.49). The other case s S = S. Then H H = (w + w )(S 2 S S S ) = (w + w )S S. (2.50) The sum goes over all neurons that are connected to the neuron that s updated n Equaton (2.49). Now f the weghts are symmetrc, H H equals H H = 2 w S S. (2.5) Snce the sgn of w S s that of S, and snce the sgn of S dffers from that of S, t follows from Equaton (2.5) that H H < 0. (2.52) So ether H remans constant, or ts value decreases n one update step.

27 ENERGY FUNCTION 9 H dynamcs states x (2) x () x (spurous) Fgure 2.8: Mnma n the energy functon are attractors n state space. Not all mnma correspond to stored patterns, and stored patterns need not correspond to mnma. Snce the energy H cannot ncrease n the Hopfeld dynamcs, we see that mnma of the energy functon must correspond to attractors, as llustrated schematcally n Fgure 2.8. The state space of the network correspondng to all possble choces of (S,...S N ) s llustrated schematcally by a sngle axs, the x -axs. But when N s large, the state space s really very hgh dmensonal. Not all stored patterns are attractors. Our analyss of the cross-talk term showed ths. If the cross-talk term causes errors for a certan stored pattern that s fed nto the network, then ths pattern s not located at a mnmum of the energy functon. Conversely there may be mnma that do not correspond to stored patterns. Such states are referred to as spurous states. The network may converge to spurous states, ths s undesrable but nevtable. We now turn to an alternatve dervaton of Hebb s rule that uses the energy functon. To keep thngs smple we assume p = at frst. The dea s to wrte down an energy functon that assumes a mnmum at the stored pattern x () : H = 2N N = 2 S x (). (2.53) Ths functon s mnmal when S = x () for all (and also when S = x () ). You wll see n a moment why the factor /(2N ) s nserted. Now we evaluate the expresson on the r.h.s.: H = 2N S x () S x () = 2 N N x () x () } {{ } =w S S. (2.54) Ths shows that the functon H has the same form as the energy functon (2.47) for the Hopfeld model, f we assgn the weghts w accordng to Hebb s rule (2.2). Thus ths argument provdes an alternatve motvaton for ths rule: we wrte down an energy functon that has a mnmum at the stored pattern x (). Ths ensures that ths pattern s an attractor. Evaluatng the functon we see that t corresponds to choosng the weghts accordng to Hebb s rule (2.2). We know that ths strategy can fal when p >. How can ths happen? For p > the analogue of Equaton (2.53) s H = 2N p N µ= = 2 S x (µ), (2.55)

28 20 DETERMINISTIC HOPFIELD NETWORKS Table 2.: Sgns of s µ = x (µ) x (mx). x () x (2) x (3) x (mx) s s 2 s Here the patterns x (ν) are not necessarly mnma of H, because a maxmal value of N = S x (ν) 2 may be compensated by terms stemmng from other patterns. But one can hope that ths happens rarely when p s small (Secton 2.2). 2.4 Spurous states Stored patterns may be mnma of the energy functon (attractors), but they need not be. In addton there can be other attractors (spurous states), dfferent from the stored patterns. For example, snce H s nvarant under S S, t follows that the patterns x (µ) s an attractor f x (µ) s an attractor. We consder the nverted patterns as spurous states. The network may converge to the nverted patterns, as we saw n Secton 2.2. There are other types of spurous states. An example are lnear combnatons of an odd number n of patterns. Such states are called mxed states. For n = 3, for example, the bts are gven by x (mx) = sgn(±x () ± x (2) ± x (3) ). (2.56) Mxed states come n large numbers, 2 2n+ p 2n+, the more the larger n. It s dffcult to determne under whch crcumstances the network dynamcs converges to a certan mxed state. But we can at least check whether a mxed state s recognsed by the network (although we do not want t to do that). As an example consder the mxed state x (mx) = sgn(x () + x (2) + x (3) ). (2.57) To check whether ths state s recognsed, we must determne whether or not sgn N p N µ= = x (µ) x (µ) x (mx) = x (mx), (2.58) under the update (2.8) usng Hebb s rule (2.28). To ths end we splt the sum n the usual fashon N p N µ= = x (µ) x (µ) x (mx) = 3 µ= x (µ) N N = x (µ) x (mx) + cross-talk term. (2.59)

29 SUMMARY 2 Let us gnore the cross-talk term for the moment and check whether the frst term reproduces x (mx). To make progress we assume random patterns [Equaton (2.32)], and compute the probablty that the sum on the r.h.s of Equaton (2.59) yelds x (mx). Patterns and are uncorrelated, and the sum over on the r.h.s. of Equaton (2.59) s an average over s µ = x (µ) x (mx). Table 2. lsts all possble combnatons of bts of pattern and the correspondng values of s µ. We see that on average s µ = 2, so that N p N µ= = x (µ) x (µ) x (mx) = 2 3 µ= x (µ) + cross-talk term. (2.60) Neglectng the cross-talk term and takng the sgn-functon we see that x (mx) s reproduced. So mxed states such as (2.57) are recognsed, at least for small α, and t may happen that the network converges to these states, loosely referred to as superpostons of odd numbers of patterns. Fnally, for large values of p there are local mnma of H that are not correlated wth any number of the stored patterns x (µ). Such spn-glass states are dscussed further n the book by Hertz, Krogh and Palmer []. 2.5 Summary We have analysed the dynamcs of the Hopfeld network as a means of solvng the assocatve memory problem (Algorthm ). The Hopfeld network s a network of McCulloch-Ptts neurons. Its layout s defned by connecton strengths w, chosen accordng to Hebb s rule. These weghts w are symmetrc, and the network s n general fully connected. Hebb s rule ensures that stored patterns x (µ) are recognsed, at least most of the tme f the storage capacty s not too large. A sngle-step estmate for the error probablty was gven n Secton 2.2. If one terates several steps, the error probablty s generally much larger, and t s dffcult to evaluate t. It turns out that t s much smpler to compute the error probablty when nose s ntroduced nto the network dynamcs (not ust random patterns). Algorthm pattern recognton wth determnstc Hopfeld model : store patterns x (µ) usng Hebb s rule; 2: feed dstorted pattern x nto network by assgnng S (t = 0) x ; 3: for t =,..., T do 4: choose a value of and update S (t ) w S (t ); 5: end for 6: read out pattern S (T ); 7: end; 2.6 Exercses Modfed Hebb s rule. Show that the modfed rule Hebb s rule (2.29) satsfes Equaton (2.) f we store only one pattern, for p =.

30 22 DETERMINISTIC HOPFIELD NETWORKS Orthogonal patterns. Show that the cross-talk term vanshes for orthogonal patterns, so that P t = error = 0. Correcton to cross-talk term. Evaluate the magntude of the correcton term n Equaton (2.36) and show that t s neglgble f p N. Show that the correcton term vanshes f we set the dagonal weghts to zero, w = 0. Compare the error probabltes for large values of p/n when w = 0 and when w 0. Explan why the error probablty for large α s much smaller n the latter case. Mxed states. Explan why there are no mxed states that are superpostons of an even number of stored patterns. One-step error probablty for mxed states. Wrte a computer program mplementng the asynchronous determnstc dynamcs of a Hopfeld network to determne the one-step error probablty for the mxed state (2.57). Plot how the one-step error probablty depends on α for N = 50 and N = 00. Repeat ths exercse for mxed patterns that are superpostons of the bts of 5 and 7 patterns. Energy functon. For the Hopfeld network wth two neurons shown n Fgure 2.9 demonstrate that the energy functon cannot ncrease under the determnstc dynamcs. Wrte the energy functon as H = w 2+w 2 2 S S 2 and use the update rule S = sgn(w 2S 2 ). In whch step do you need to assume that the weghts are symmetrc, to prove that H cannot ncrease? w 2 2 Fgure 2.9: Hopfeld network wth two neurons. w Exam questons 2.7. One-step error probablty In a determnstc Hopfeld network, the state S of the -th neuron s updated accordng to the McCulloch Ptts rule S sgn N = w S, where N s the number of neurons n the model, w are the weghts, and p patterns x (µ) are stored by assgnng the weghts accordng to Hebb s rule, w = p N µ= x (µ) x (µ) for, and w = 0. (a) Apply pattern x (ν) to the network. Derve the condton for bt x (ν) of ths pattern to be unchanged after a sngle asynchronous update. Express ths condton n terms of the cross-talk term. (0.5p). (b) Store p patterns wth random bts (x (µ) = ± wth probablty 2 ) n the Hopfeld network usng Hebb s rule. Apply pattern x (ν) to the network. For large p and N, derve an approxmate expresson for the probablty that bt x (ν) s unchanged after a sngle asynchronous update. (p).

EXAM QUESTIONS 23 2.7.2 Hopfeld network wth four neurons The pattern shown n Fg. 2.0 s stored n a Hopfeld network usng Hebb s rule w = N x () x (). There are 2 4 four-bt patterns.

31 EXAM QUESTIONS Hopfeld network wth four neurons The pattern shown n Fg. 2.0 s stored n a Hopfeld network usng Hebb s rule w = N x () x (). There are 2 4 four-bt patterns. Apply each of these to the Hopfeld network, and perform one synchronous update. Lst the patterns you obtan and dscuss your results. (p). = =2 =3 =4 Fgure 2.0: The pattern x () has N = 4 bts, x () =, and x () = for = 2, 3, 4. Queston Recognsng letters wth a Hopfeld network Fgure 2. shows fve patterns, each wth N = 32 bts. Store the patterns x () and x (2) n a Hopfeld network usng Hebb s rule w = 2 N µ= x (µ) x (µ) wth, =,..., N. Use the update rule S sgn( N = w S ). Feed the patterns nto the network. To determne ther fate, follow the steps outlned below. (a) Compute N = x (µ) x (ν), for µ =, ν =,..., 5, and also for µ = 2, ν =,..., 5. Hnt: the result can be read off from the Hammng dstances between the patterns shown n Fgure 2.. (0.5p). (b) Consder the quantty b (ν) = N = w x (ν), where w are the weghts obtaned by storng patterns x () and x (2). Compute b (ν) for ν =,...,5. Express your result as lnear combnatons of x () and x (2). Hnt: use your answer to the frst part of ths queston. (p). (c) Feed the patterns n Fgure 2. to the network. Whch of the patterns reman the same after one synchronous update usng (2.49)? (0.5p). x () x (2) x (3) x (4) x (5) Fgure 2.: Each of the fve patterns conssts of 32 bts x (µ). A black pxel n pattern µ corresponds to x (µ) =, a whte one to x (µ) =. Queston Energy functon for determnstc Hopfeld network In a determnstc Hopfeld network the energy functon H s defned as H = N N w S S. (2.6) 2 = =

32 24 DETERMINISTIC HOPFIELD NETWORKS Here N s the number of neurons, w are the weghts, and the state S of neuron s equal to ±. The update rule s S sgn N w S. (2.62) = (a) Use Hebb s rule: w = p N µ= x (µ) x (µ) for, and w = 0. Show that H ether decreases or stays constant after a sngle asynchronous update (2.62). Whch property of weghts assures that ths s the case? (p). (b) Assume that the weghts are w = p N µ= x (µ) x (µ) for all,. In ths case, how does H change after a sngle asynchronous update accordng to Eq. (2.62)? Compare to the result of (a). Dscuss. (p) Dluted Hopfeld network In the dluted Hopfeld network wth N neurons, only a fracton K N of the weghts w s actve: w = K K p µ= x (µ) x (µ). (2.63) Here K s an element of a random connectvty matrx K wth elements K =, wth probablty K N, 0, otherwse. (2.64) Here K s the average number of connectons to neuron, N = K c = K, where... c denotes the average over random realsatons of K. The bts x (µ) of the stored pattern x (µ) (µ =,..., p and =,..., N ) are random: x (µ) = or wth probablty 2. The update rule for S s the usual one: S sgn(b ) wth b = Followng the steps (a)-(c) below, derve an approxmate expresson for N w S. (2.65) = m ν = N N = x (ν) S c (2.66) n the lmt of N, K, and p N. Here... c s the average defned above, and the outer average s over the network dynamcs. (a) Assume that S = (S,...,S N ) T s equal to x (ν). Assumng a sngle synchronous update [Eq. (2.65)] wth weghts gven by Equaton (2.63), derve an expresson for b (ν) c. Wrte the expresson usng the average C (ν) c of the cross-talk term" over random connectons. (0.5p). (b) Show that the dstrbuton P ( C (ν) c ) of the cross-talk term s Gaussan n the lmt of K, p, N, and determne the mean and the varance of the dstrbuton. (p) (c) In the lmt of N, S c S c. Use ths and replace S c on the rght-hand sde of Eq. (2.66) by sgn( b (ν) c ) where b (ν) c s the expresson from (a). Then use that K N = K x (ν) S c m ν for K.

33 EXAM QUESTIONS 25 Fnally, on the rght-hand-sde of the resultng expresson, approxmate N N = by an ntegral over the dstrbuton P ( C (ν) c ) you obtaned n (b). Evaluate the ntegral to fnd an approxmate expresson for m ν. (.5p) Mxed states Consder p random patterns x (µ) (µ =,..., p ) wth N bts x (µ) ( =,..., N ), equal to or - wth probablty 2. Store the patterns n a determnstc Hopfeld network usng Hebb s rule w = p N µ= x (µ) x (µ). In the lmt of N, p, p N, show that the network recognses bt x (mx) of the mxed state x (mx) wth bts x (mx) = sgn x () + x (2) + x (3), (2.67) after a sngle asynchronous update S sgn( N = w S ). Follow the steps outlned below. (a) Feed the mxed state (2.67) to the network. Use the weghts w you obtaned by applyng Hebb s N = x (µ) x (mx), for µ =... p. rule and express N = w x (mx) n terms of s µ, defned by s µ = N (0.5p). (b) Assume that the bts x (µ) are ndependent random numbers, equal to or - wth equal probabltes. What s the value of s µ for µ =, 2 and 3? What s the value for s µ for µ > 3? (p). (c) Rewrte the expresson you derved n (a) as a sum of two terms. The frst term s a sum over µ =, 2, 3. The second term s the cross-talk term, a sum over the remanng values of µ. Explan why the cross-talk term can be neglected n the lmt stated above. (0.5p). (d) Combne the results of (a), (b) and (c) to show that the network recognses the mxed state (2.67). (0.5p).

34 26 STOCHASTIC HOPFIELD NETWORKS (b ) β = 0 β = b Fgure 3.: Probablty functon (3.3) used n the defnton of the stochastc rule (3.2), plotted for β = 0 and β = 0. 3 Stochastc Hopfeld networks Two related problems became apparent n the prevous Chapter. Frst, the Hopfeld dynamcs may get stuck n spurous mnma. In fact, f there s a local mnmum downhll from a gven ntal state, between ths state and the correct attractor, then the dynamcs gets stuck n the local mnmum, so that the algorthm fals to converge to the correct attractor. Second, the energy functon usually s a strongly varyng functon over a hgh-dmensonal state space. Therefore t s dffcult to predct the long-tme dynamcs of the network. Whch s the frst local mnmum encountered on the down-hll path that the network takes? Both problems are solved by ntroducng a lttle bt of nose nto the dynamcs. Ths s a trck that works for many neural-network algorthms. But n general t s very challengng to analyse the nosy dynamcs. For the Hopfeld network, by contrast, much s known. The reason s that the stochastc Hopfeld network s closely related to systems studed n statstcal mechancs, so-called spn glasses. Lke these systems and lke many other physcal systems the stochastc Hopfeld network exhbts an order-dsorder transton. Ths transton becomes sharp n the lmt of large N. It may be that the network produces satsfactory results for a gven number of patterns wth a certan number of bts. But f one tres to store ust one more pattern, the network may fal to recognse anythng. The goal of ths Chapter s to explan why ths occurs, and how t can be avoded. 3. Nosy dynamcs The update rule (2.8) can be wrtten as S sgn(b ), (3.) where b s the local feld. Ths rule s called determnstc, because a gven set of states S determnes the outcome of the update. To ntroduce nose, one replaces the rule (3.) by the stochastc rule S = + wth probablty (b ), wth probablty (b ), (3.2) where (b ) s the functon (b ) = + e 2β b (3.3) shown n Fgure 3.. The stochastc algorthm s very smlar to the determnstc algorthm for the Hopfeld model (Algorthm ), only step 4 s dfferent. The parameter β s the nose parameter. Unfor-

35 ORDER PARAMETERS 27 m (T ) m T Fgure 3.2: Illustrates how the average m (T ) depends upon the total teraton tme T. The lght grey lnes show dfferent realsatons of m (T ) for dfferent realsatons of the stored patterns, at a large but fnte value of N. The thck red lne s the average over the dfferent realsatons of patterns. tunately t s defned the wrong way around. When β s large the nose level s small. In partcular one obtans the determnstc dynamcs as β tends to nfnty (3.): β determnstc dynamcs. (3.4) In ths lmt, the functon (b ) tends to zero f b s negatve, and to unty f b s postve. So for β, the stochastc update rule (3.3) s precsely equvalent to the determnstc rule (3.). Conversely, when β = 0, the functon (b ) smply equals 2. In ths case S s updated randomly wth equal probablty to or +. The dynamcs bears no reference to the stored patterns (contaned n the local feld b ). The dea s to run the network for a small but fnte nose level, that s at large value of β. Then the dynamcs s very smlar to the determnstc Hopfeld dynamcs analysed n the prevous Chapter. But the nose allows the system to sometmes also go uphll, makng t possble to escape spurous mnma. Snce the dynamcs s nosy, t s necessary to rephrase the convergence crteron (2.4). Ths s dscussed next. 3.2 Order parameters If we feed one of the stored patterns, x () for example, then we want the nosy dynamcs to stay n the vcnty of x (). Ths can only work f the nose s weak enough. Success s measured by the order parameter m µ m µ lm m µ(t ). (3.5) T Here m µ (T ) = T T t = N N = S (t )x (µ) s the average of N N = S (t )x (µ) over the nosy dynamcs of the network, for gven bts x (µ). Snce we feed pattern x () to the network, we have ntally S (t = 0) = x () and thus (3.6) N N = S (t = 0)x () =. (3.7) After a transent, the quantty N N = S (t )x () settles nto a steady state, where t fluctuates around a mean value wth a defnte dstrbuton that s ndependent of tme t. If the network works well, we

36 28 STOCHASTIC HOPFIELD NETWORKS expect that S (t ) remans close to x (), so that m converges to a value of order unty as T. Snce there s nose, ths mean value s usually smaller than unty. Fgure 3.2 llustrates how the average (3.6) converges to a defnte value when T becomes large. For fnte values of N the mean m depends upon the stored patterns. In ths case t s useful to average m over dfferent realsatons of stored patterns (thck red lne n Fgure 3.2). In the lmt of N, the mean m s ndependent of the stored patterns, we say that the system s self averagng. The other order parameters are expected to be small, because the bts of the patterns x (2) to x (p ) are ndependent from those of x (). As a consequence the ndvdual terms n the sum n Equaton (3.6) cancel upon summaton, f S (t ) x (). In summary, we expect that m µ f µ =, 0 otherwse. (3.8) The am s now to compute how m depends on the values of p, N, and β. 3.3 Mean-feld theory Assume that the dynamcs of the stochastc Hopfeld network reaches a steady state, as llustrated n Fgure 3.2. The order parameter s defned as a tme average over the stochastc dynamcs of the network, n the lmt of T. In practce we cannot choose T to be nfnte, but T must be large enough so that the average can be estmated accurately, and so that any ntal transent does not matter. It s challengng task to compute the average over the dynamcs. To smplfy the problem we consder frst the case of only one neuron, N =. In ths case there are no connectons to other neurons, but we can stll assume that the network s defned by a local feld b, correspondng to the threshold θ n Equaton (2.). To compute the order parameter we must evaluate the tme average of S (t ). We denote ths tme average as follows: S = lm T T T S (t ). (3.9) t = Note that ths s a dfferent average from the one defned on page 5. The average there s over random bts, here t s over the stochastc network dynamcs (3.2). We can evaluate the tme average S usng Equaton (3.2): Equaton (3.3) yelds: S = Prob(S = +) Prob(S = ) = (b ) [ (b )]. (3.0) S = e β b e β b e β b + e β b e β b = tanh(β b + e β b ). (3.) Fgure 3.3 llustrates how S depends upon b. For weak nose levels (large β ), the threshold b acts as a bas. When b s negatve then S, whle S when b s postve. So the state S reflects the bas most of the tme. When however the nose s large (small β ), then S 0. In ths case the state varable S s essentally unaffected by the bas, S s equally lkely to be negatve as postve so that ts average evaluates to zero.

37 MEAN-FIELD THEORY 29 S β = 0 β = b - Fgure 3.3: Average S under the dynamcs (3.2), as a functon of b for dfferent nose levels. How can we generalse ths calculaton to a Hopfeld model wth many neurons? It can be done approxmately at least, n the lmt of large values of N. Consder neuron number. The fate of S s determned by b. But b n turn depends on all S : b (t ) = N w S (t ). (3.2) = Ths makes the problem challengng. But when N s large we may assume that b (t ) remans essentally constant n the steady state, ndependent of t. The argument s that fluctuatons of S (t ) average out when summng over, at least when N s large. In ths case b (t ) s approxmately gven by ts tme average: b (t ) = b + small fluctuatons n the lmt N. (3.3) Here the average local feld b s called the mean feld, and theores that neglect the small correctons n Equaton (3.3) are called mean-feld theores. Let us assume that the fluctuatons n Equaton (3.3) are neglgble. Then the local feld for neuron s approxmately constant b (t ) b, (3.4) so that the problem of evaluatng the average of S (t ) reduces to the case dscussed above, N =. From Equatons (2.28) and (3.) one deduces that S = tanh(β b ) wth b = N x (µ) x (µ) S. (3.5) These mean-feld equatons are a set of N non-lnear equatons for S, for gven fxed patterns x (µ). The am s to solve these equatons to determne the tme averages S, and then the order parameters from m µ = N S x (). (3.6) N = If we ntally feed pattern x () we hope that m whle m µ 0 for µ. To determne under whch crcumstances ths has a chance to work we express the mean feld n terms of the order parameters m µ : µ b = N p µ= x (µ) x (µ) S µ x (µ) m µ. (3.7)

38 30 STOCHASTIC HOPFIELD NETWORKS m β - Fgure 3.4: Solutons of the mean-feld equaton (3.2). The crtcal nose level s β c =. The dashed lne corresponds to an unstable soluton. The last equalty s only approxmate because the -sum n the defnton of m µ contans the term =. Whether or not to nclude ths term makes only a small dfference to m µ, n the lmt of large N. Now assume that m m wth m of order unty, and m µ 0 for µ. Then the frst term n the sum over µ domnates, provded that the small terms do not add up to a contrbuton of order unty. Ths s the case f α = p (3.8) N s small enough. Roughly speakng ths works f α s of order N (a more precse estmate s that α s at most (log N )/N [5]). Now we use the mean-feld equatons (3.5) to approxmate Applyng the defnton of the order parameter S = tanh(β b ) tanh(β m x () ). (3.9) m = N N = S x () we fnd m = N N = tanh β m x () x () (3.20) Usng that tanh(z ) = tanh( z ) as well as the fact that the bts x (µ) can only assume the values ±, we get m = tanh(β m ). (3.2) Ths s a self-consstent equaton for m. We assumed that m s of order unty. So the queston s: does ths equaton admt such solutons? For β 0 there s one soluton, m = 0. Ths s not the desred one. For β, by contrast, there are three solutons, m = 0,±. Fgure 3.4 shows results of the numercal evaluaton of Equaton (3.2) for ntermedate values of β. Below a crtcal nose level there are three solutons, namely for β larger than β c =. (3.22) For β > β c, the soluton m = 0 s unstable (ths can be shown by computng the dervatves of the free energy of the Hopfeld network []). Even f we were to start wth an ntal condton that corresponds

39 MEAN-FIELD THEORY 3 to m = 0, the network would not stay there. The other two solutons are stable: when the network s ntalsed close to x (), then t converges to m = +m (wth m > 0). The symmetry of the problem dctates that there must also be a soluton wth m = m at small nose levels. Ths soluton corresponds to the nverted pattern x () (Secton 2.4). If we start n the vcnty of x (), then the network s unlkely to converge to x (), provded that N s large enough. The probablty of x () x () vanshes very rapdly as N ncreases and as the nose level decreases. If ths transton were to happen n a smulaton, the network would then stay near x () for a very long tme. Consder the lmt where T tends to at a fnte but possbly large value of N. Then the network would (at a very small rate) ump back and forth between x () and x (), so that the order parameter would average to zero. Ths shows that the lmts of large N and large T do not commute lm lm m (T ) lm lm m (T ). (3.23) N N T T In practce the nterestng lmt s the left one, that of a large network run for a tme T much longer than the ntal transent, but not nfnte. Ths s precsely where the mean-feld theory apples. It correponds to takng the lmt N frst, at fnte but large T. Ths descrbes smulatons where the transton x () x () does not occur. In summary, Equaton (3.2) predcts that the order parameter converges to a defnte value, m, ndependent of the stored patterns when N s large enough. Fgure 3.2 shows that the order parameter converges for large but fnte values of N. However, the lmtng value does depend on the stored patterns, as mentoned above. The system s not self averagng. When N s fnte we should therefore average the result over dfferent realsatons of the stored patterns. The value of m determnes the average number of correctly retreved bts n the steady state: N N correct = + S x () = N 2 2 ( + m ) (3.24) = The outer average s over dfferent realsatons of random patterns (the nner average s over the network dynamcs). Snce m as β we see that the number of correctly retreved bts approaches N N correct N. (3.25) Ths s expected snce the stored patterns x (µ) are recognsed for small enough values of α n the determnstc lmt, because the cross-talk term s neglgble. But t s mportant to know that the stochastc dynamcs slows down as the nose level tends to zero. The lower the nose level, the longer the network remans stuck n local mnma, so that t takes longer tme to reach the steady state, and to sample the steady-state statstcs of H. Conversely when the nose s strong, then N correct N 2. (3.26) In ths lmt the stochastc network ceases to functon. If one assgns N bts entrely randomly, then half of them are correct, on average. We defne the error probablty n the steady state as From Equaton (3.24) we fnd P t = error = N N correct. (3.27) N P t = error = 2 ( m ). (3.28)

40 32 STOCHASTIC HOPFIELD NETWORKS In the determnstc lmt the steady-state error probablty approaches zero as m tends to one. Let us compare ths result wth the one-step error probablty Perror t = derved n Chapter 2 n the determnstc lmt. We should take the lmt α = p/n 0 n Equaton (2.43) because the result (3.28) was derved assumng that α s very small. In ths lmt we fnd that the one-step and the steady-state error probabltes agree (they are both equal to zero). Above the crtcal nose level, for β < β c, the order parameter vanshes so that Perror t = equals 2. So when the nose s too large the network fals. It s mportant to note that nose can also help, because mxed states have lower crtcal nose levels than the stored patterns x (µ). Ths can be seen as follows. To derve the above mean-feld result we assumed that m µ = mδ µ. Mxed states correspond to solutons where an odd number of components of m s non-zero, for example: m m m = m. (3.29) 0 Neglectng the cross-talk term, the mean-feld equaton reads S = tanh β. p µ= m µ x (µ). (3.30) In the lmt of β, the S converge to the mxed states (2.57) when m s gven by Equaton (3.29). Usng the defnton of m µ and averagng over the bts of the random patterns one fnds: m µ = x (µ) tanh β p ν= m ν x (ν). (3.3) The numercal soluton of Equaton (3.29) shows that there s a m 0 soluton for β > β c =. Yet ths soluton s unstable close to the crtcal nose level, more precsely for < β < 2.7 [6]. 3.4 Storage capacty The precedng analyss replaced the sum (3.7) by ts frst term, x () m. Ths corresponds to neglectng the cross-talk term. We expect that ths can only work f α = p/n s small enough. The nfluence of the crosstalk term was studed n Secton 2.2, where the storage capacty α = p N was defned. When we computed P t = error n Secton 2.2, only the frst ntal update step was consdered, because t was too dffcult to analyse the long-tme lmt of the determnstc dynamcs. It s expected that the error probablty ncreases as t ncreases, at least when α s large enough so that the cross-talk term matters. For the stochastc update rule t s easer to compute the long-tme dynamcs (f there s a steady state). The remander of ths Secton descrbes the mean-feld analyss of the steady state for larger

41 STORAGE CAPACITY 33 values of α. It turns out that there s a crtcal storage capacty (that depends on the nose level) above whch the network ceases to functon. We store p patterns n the network usng Hebb s rule (2.29) and feed pattern x () to the network. The am s to determne the order parameter m and the correspondng error probablty n the steady state for p N, so that α remans fnte as N. In ths case we can no longer approxmate the sum n Equaton (3.7) ust by ts frst term, because the other terms for µ > can gve a contrbuton of order m. Instead we must evaluate all m µ to compute the mean feld b. The relevant calculaton s summarsed n Secton 2.5 of Hertz, Krogh and Palmer [], and the remander of ths Secton follows that outlne qute closely. One starts by rewrtng the mean-feld equatons (3.5) n terms of the order parameters m µ. Usng we fnd m ν = N S = tanh(β x (ν) S = N µ x (µ) m µ ) (3.32) x (ν) tanh β µ x (µ) m µ. (3.33) Ths coupled set of p non-lnear equatons s equvalent to the mean-feld equaton (3.5). Now feed pattern x () to the network. The strategy of solvng Equaton (3.33) s to assume that the network stays close to the pattern x () n the steady state, so that m remans of order unty. But we must also allow that the other order parameters m µ = N x (µ) S for µ reman fnte (yet small). The trck to evaluate these order parameters s to assume random patterns, so that the m µ become random numbers that fluctuate around zero wth varance mµ 2 (ths average s an average over the ensemble of random patterns). We use Equaton (3.33) to compute the varance approxmately. In the µ-sum on the r.h.s of Equaton (3.33) we must treat the term µ = ν separately (because the ndex ν appears also on the l.h.s. of ths equaton). Also the term µ = must be treated separately, as before, because µ = s the ndex of the pattern that was fed to the network. Therefore t s necessary to dstngush between ν = and ν. We start wth the second case and wrte m ν = x (ν) tanh β x () m + β x (ν) m ν + β x (µ) m µ N = N x (ν) x () tanh β m +β x () }{{} x (ν) m ν } {{ } 2 µ µ ν +β µ µ ν x (µ) x () m µ } {{ } 3 (3.34) Now consder 3 terms n argument of tanh(...). Term s of order unty, that s ndependent of N. Term 3 s also large because of the sum contans many terms, of the order p N terms. However, the term 2 s small for large values of N. Therefore t s a good approxmaton to Taylor expand the argument of tanh(... ): tanh tanh d dx tanh (3.35) + 3

42 34 STOCHASTIC HOPFIELD NETWORKS Usng d dx tanh(x ) = tanh2 (x ) one gets + N x (ν) x () m ν = N β x () x (ν) x (ν) m ν } {{ } 2 Usng the fact that x (µ) = ± and thus [x (µ) m ν = x (ν) N +β m ν N x () x () tanh β m +β }{{} µ tanh 2 β m + β µ ν µ µ ν x (µ) x () m µ }{{} 3 x (µ) x () m µ. ] 2 =, ths expresson smplfes somewhat: tanh β m + β x (µ) x () m µ + tanh 2 β m + β µ µ ν µ µ ν x (µ) x () m µ. (3.36) (3.37) The stuaton s smlar to our dscusson of the sum defnng the cross-talk term n Secton 2.2. The sums n Equaton (3.37) depend on the stored patterns, through the x (µ). To estmate the order parameters m µ we proceed as n Secton 2.2 and assume that the pattern bts x (µ) are ndependently randomly dstrbuted. Snce the sums n Equaton (3.37) contan many terms, we can estmate the sums usng the central-lmt theorem. If the patterns are random, then z µ µ ν x (µ) x () m µ (3.38) s approxmately Gaussan wth mean zero. The varance of z s gven by an average over a double sum. Snce the dstrbutons of x (µ) are ndependent, only the dagonal n ths double sum contrbutes: σ 2 z = m 2 µ p m 2 µ for any µ,ν. (3.39) µ µ ν Now return to Equaton (3.37). The sum N over the Gaussan dstrbuted varable z : We wrte the expresson (3.40) as β m ν n the second lne can be approxmated as an average β m ν dz e z 2 2σ 2 z tanh 2 β m + β z. (3.40) 2πσz dz usng the followng defnton of the parameter q : q = 2πσz e z 2 2σ 2 z tanh 2 β m + β z β m ν ( q ), (3.4a) dz 2πσz e z 2 2σ 2 z tanh 2 β m + β z. (3.4b)

43 STORAGE CAPACITY 35 Returnng to Equaton (3.37) we see that t takes the form m ν = N Solvng for m ν we fnd: m ν = x (ν) x () tanh β m + β N x (ν) x () tanh µ µ ν x (µ) x () m µ + ( q )β m ν. (3.42) β m + β µ µ ν β( q ) x (µ) x () m µ, (3.43) for ν. Ths expresson allows us to compute the varance σ z, defned by Equaton (3.39). Equaton (3.43) shows that the average mν 2 contans a double sum over the bt ndex,. Snce the bts are ndependent, only the dagonal terms contrbute, so that N 2 β tanh2 m + β µ x (µ) x () m µ m 2 ν µ ν, (3.44) [ β( q )] 2 ndependent of ν. The numerator s ust q /N, from Equaton (3.4b). So the varance evaluates to σ 2 z = αq [ β( q )] 2. (3.45) Up to now t was assumed that ν. Now consder ν =. One can derve an Equaton for m by repeatng almost the same steps as for m ν wth ν. The result s: dz m = e z 2 2σ 2 z tanh(β m + β z ). (3.46) 2πσ 2 z Ths s a self-consstent equaton for m. In summary there are three coupled equatons, for m, q, and σ z. Equatons (3.4a), (3.45), and (3.46). They must be solved together to determne how m depends on β and α. To compare wth the results descrbed n Secton 2.2 we must take the determnstc lmt, β. In ths lmt the coupled equatons smplfy []: β( q ) = 2 πσ 2 z e m2 2σ 2 z, (3.47a) σ 2 z = α [ β( q )], (3.47b) 2 m m = erf. (3.47c) 2σ 2 z Note that q approaches unty n the lmt β, yet β( q ) remans fnte n ths lmt. Recall the defnton (3.27) of the steady-state error probablty. Insertng Equaton (3.47c) for m nto ths expresson we fnd n the determnstc lmt: P t = error = 2 erf m 2σ 2 z. (3.48)

44 36 STOCHASTIC HOPFIELD NETWORKS Fgure 3.5: Error probablty as a functon of the storage capacty α n the determnstc lmt. The one-step error probablty P t = t = error [Equaton (2.43)] s shown as a black lne, the steady-state error probablty Perror [Equaton (3.27)] s shown as a red lne. In the hashed regon error avalanches ncrease the error probablty. Smlar to Fgure n Ref. [7]. β 0.5 P t = error = 2 P t = error < α c Fgure 3.6: Phase dagram of the Hopfeld network n the lmt of large N (schematc). The regon wth P t = error < t = 2 s the ordered phase, the regon wth Perror = 2 s the dsordered phase. After Fgure 2 n Ref. [7]. α Compare ths wth Equaton (2.43) for the one-step error probablty n the determnstc lmt. That equaton was derved for only one step n the dynamcs of the network, whle Equaton (3.48) descrbes the long-tme lmt. Yet t turns out that Equaton (3.48) reduces to (2.43) n the lmt of α 0. To see ths one solves Equaton (3.47) by ntroducng the varable y = m / 2σ 2 z []. One obtans a one-dmensonal equaton for y [7]: y ( 2α + (2/ π) e y 2 ) = erf(y ). (3.49) The physcal solutons are those satsfyng 0 erf(y ), because the order parameter s restrcted to ths range (transtons to m do not occur n the lmt N ). Fgure 3.5 shows the steady-state error probablty obtaned from Equatons (3.48) and (3.49). Also shown s the one-step error probablty P t = error = erf 2 2α derved n Secton 2.2. You see that P t = error approaches P t = error for small α. Ths means that the error probablty does not ncrease sgnfcantly as one terates the network, at least for small α. In ths case errors n earler teratons have lttle effect on the probablty that later errors occur. But the

45 BEYOND MEAN-FIELD THEORY 37 stuaton s dfferent at larger values of α. In that case Perror t = sgnfcantly underestmates the steadystate error probablty. In the hashed regon, errors n the dynamcs ncrease the probablty of errors n subsequent steps, gvng rse to error avalanches. Fgure 3.5 llustrates that there s a crtcal value α c where the steady-state error probablty tends. Soluton of the mean-feld Equatons gves to 2 α c (3.50) When α > α c the steady-state error probablty equals 2, n ths regon the network produces ust nose. When α s small, the error probablty s small, here the network works well. Fgure 3.5 shows that the steady-state error probablty changes very abruptly near α c. Assume you store 37 patterns wth 000 bts n a Hopfeld network. Fgure 3.5 demonstrates that the network can relably retreve the patterns. However, f you try to store one or two more patterns, the network fals to produce any output meanngfully related to the stored patterns. Ths rapd change s an example of a phase transton. In many physcal systems one observes smlar transtons between ordered and dsordered phases. What happens at hgher nose levels? The numercal soluton of Equatons (3.4a), (3.46), and (3.45) shows that the crtcal storage capacty α c decreases as the nose level ncreases (smaller values of β ). Ths s shown schematcally n Fgure 3.6. Insde the hashed regon the error probablty s smaller than 2 so that the network operates relably (although less so as one approaches the phase-transton boundary). Outsde ths regon the the error probablty equals 2. In ths regon the network fals. In the lmt of small α the crtcal nose level s β c =. In ths lmt the network s descrbed by the theory explaned n Secton 3.3, Equaton (3.2). Often these two dfferent phases of the Hopfeld network are charactersed n terms of the order parameter m. We see that m > 0 n the hashed regon, whle m = 0 outsde. 3.5 Beyond mean-feld theory The theory summarsed n ths Chapter rests on a mean-feld approxmaton for the local feld, Equaton (3.4). The man result s the phase dagram shown n Fgure 3.6. It s mportant to note that t was derved n the lmt N. For smaller values of N one expects the transton to be less sharp, so that m s non-zero for values of α larger than α c. But even for large values of N the queston remans how accurate the mean-feld theory s. To answer ths queston, one must take nto account fluctuatons. The correspondng calculaton s more dffcult than the ones outlned earler n ths Chapter, and t requres several steps. One starts from the steady-state dstrbuton of S for fxed patterns x (µ). In Chapter 4 we wll see that t takes the form of a Boltzmann dstrbuton P β (S ) = Z e β H (s ). (3.5) The normalsaton factor Z s called the partton functon Z = e β H (s ). (3.52) s One can compute the order parameter by addng a threshold term to the energy functon (2.47) H = w S S λs. (3.53) 2

46 38 STOCHASTIC HOPFIELD NETWORKS Then the order parameter m µ s obtaned by takng a dervatve w.r.t λ: m µ = x (µ) S = log Z. (3.54) N N β λ The outer average s over dfferent realsatons of random patterns. Snce the logarthm of Z s dffcult to average, one resorts to the replca trck. The dea s to represent the average of the logarthm as log Z = lm n 0 n ( Z n ), (3.55) The functon Z n looks lke the partton functon of n copes of the system, thus the name replca trck. It s stll debated when the replca trck works and when not [8]. Nevertheless, the most accurate theoretcal result for the crtcal storage capacty s obtaned n ths way [9] α c = (3.56) The mean-feld result (3.50) s dfferent from (3.56), but t s very close. Most of the tme, mean-feld theores do not gve such good results, they are usually used to gan merely a qualtatve understandng of phase transtons. In the Hopfeld model the mean-feld theory works so well because the connectons are global: every neuron s connected wth every other neuron. Ths helps to average out the fluctuatons n Equaton (3.4). The most precse Monte-Carlo smulatons (Secton 4.4) for fnte values of N [20] yeld upon extrapolaton to N = α c = 0.43 ± (3.57) Ths s close to, yet sgnfcantly dfferent from the best theoretcal estmate, Equaton (3.56), and also dfferent from the mean-feld result (3.50). 3.6 Summary In ths Chapter the use of Hopfeld networks for pattern recognton was dscussed. Hopfeld networks share many propertes wth the networks dscussed later on n these lectures. The most mportant pont s perhaps that ntroducng nose n the dynamcs allows to study the convergence and performance of the network: n the presence of nose there s a well-defned steady state that can be analysed. Wthout nose, n the determnstc lmt, the network dynamcs can get stuck n local mnma of the energy functon, and may not reach the stored patterns. Naturally the nose must be small enough for the network to functon relably. Apart from the nose level there s a second sgnfcant parameter, the storage capacty α, equal to the rato of the number of patterns to the number of bts per pattern. When α s small the network s relable. A mean-feld analyss of the N -lmt shows that there s a phase transton n the parameter plane (phase dagram) of the Hopfeld network, Fgure 3.6. The buldng blocks of Hopfeld networks are McCulloch-Ptts neurons and Hebb s rule for the weghts. These elements are fundamental to the networks dscussed n the comng Chapters. 3.7 Further readng The statstcal mechancs of Hopfeld networks s explaned n the book by Hertz, Krogh, and Palmer []. Startng from the Boltzmann dstrbuton, Chapter 0 n ths book explans how to compute the order parameters, and how to evaluate the stablty of the correspondng solutons. For more detals on the replca trck, refer to Ref. [5].

47 EXERCISES Exercses Mxed states. Wrte a computer program that mplements the stochastc dynamcs of a Hopfeld model. Compute how the order parameter for mxed states that are superpostons of the bts of three stored patterns depends on the nose level for 0.5 β 2.5. Compare your numercal results wth the predctons n Secton 3.3. Repeat the exercse for mxed states that consst of superpostons of the bts of fve stored patterns. To ths end, frst derve the equaton for the order parameter and solve ths equaton numercally. Second, perform your computer smulatons and compare. Phase dagram of the Hopfeld network. Derve Equaton (3.49) from Equaton (3.47). Numercally solve (3.49) to fnd the crtcal storage capacty α c n the determnstc lmt. Quote your result wth threedgt accuracy. To determne how the crtcal storate capacty depends on the nose level, numercally solve the three coupled Equatons (3.46), (3.4b), and (3.45). Compare your result wth the schematc Fgure Exam questons 3.9. Stochastc Hopfeld network In the stochastc Hopfeld network the state S of the -th neuron s updated asynchronously accordng to +, wth probablty (b ), S (3.58), wth probablty (b ), b = N w S, (b ) = = + e 2β b, (3.59) where β s the nose parameter. The weghts w are gven by Hebb s rule w = p N µ= x (µ) x (µ) for, and w = 0. Whether the stochastc Hopfeld network can relably retreve stored patterns depends upon the level of nose (measured by β ), the number of bts per pattern (N ), the number of stored patterns (p ), correlatons between stored patterns. Explan and dscuss how each of these factors nfluences the relablty of retreval of stored patterns n the stochastc Hopfeld model. In the dscusson, refer to and explan the followng terms: storage capacty", order parameter", phase dagram", local mnma", attractors", spurous states". Your answer must not be longer than one A4 page. (p).

48 40 STOCHASTIC HOPFIELD NETWORKS

49 4 4 Stochastc optmsaton Hopfeld networks were ntroduced as a means of solvng the assocatve memory problem. In Secton 2.3 t was shown that ths corresponds to mnmzng an energy functon H. In ths Chapter we see how ths can be used to solve combnatoral optmsaton problems usng neural networks. Such problems admt 2 k or k! confguratons - too many to lst and check n a seral approach when k s large. The dea s to wrte down an energy functon that s at most quadratc n the state varables, lke Equaton (2.47). Then one can use the Hopfeld dynamcs to mnmze H. The problem s of course that the determnstc network dynamcs gets stuck n frst local mnmum encountered, usually not the desred optmum (Fgure 4.). Therefore t s mportant to ntroduce a lttle bt of nose. As dscussed n Chapter 3, the nose helps the network to escape local mnma. A common strategy s to lower the nose level on the fly. In the begnnng of the smulaton the nose level s hgh, so that the network explores the rough features of the energy landscape. When the nose level s lowered, the network can see fner and fner features, and the hope s that t ends up n the global mnmum. Ths method s called smulated annealng. 4. Combnatoral optmsaton problems A well-known combnatoral optmsaton problem s the travelng-salesman problem. Gven the coordnates of k ctes, the goal s to determne the shortest ourney vstng each cty exactly once before returnng to the startng pont. The coordnates of seven ctes A,...,F are gven n Fgure 4.2 (ths Fgure llustrates the problem for k = 7). The Fgure shows two dfferent solutons. Denotng the dstance between cty A and B by d AB and so forth, the length of the path n panel (a) s L = d AD + d D B + d BG + d G F + d F C + d C E + d E A. (4.) H spurous mnma x (µ) states Fgure 4.: Spurous local mnma n the energy functon H. The global mnmum corresponds to one of the stored patterns, x (µ).

42 STOCHASTIC OPTIMISATION cty coordnates A ( 0.,0.5 ) B ( 0.4 0.2 ) C ( 0.5 0.7 ) D ( 0.2 0. ) E ( 0. 0.8 ) F ( 0.8 0.9 ) G ( 0.9 0.3 ) (a) (b) M = M = Fgure 4.2: Travelng-salesman problem for k = 7.

50 42 STOCHASTIC OPTIMISATION cty coordnates A ( 0.,0.5 ) B ( ) C ( ) D ( ) E ( ) F ( ) G ( ) (a) (b) M = M = Fgure 4.2: Travelng-salesman problem for k = 7. Gven are the coordnates of k ctes as ponts n the unt square. The problem s to fnd the shortest connected path that ons all ctes, vsts each cty exactly once, and returns to the startng pont. (a) best soluton, (b) a second soluton wth a longer path. Also gven are the matrx representatons of these solutons. Fgure 4.3: One soluton of the k -queens problem for k = 8. The Fgure also demonstrates how paths are represented n terms of k k matrces. Path (a) corresponds to M =. (4.2) Each row corresponds to a cty, and the -th element n ths row has the entry f the cty s the -th stop n the tour. The other entres are 0. Snce each cty s vsted only once, there can be only one n each row. Snce each vst corresponds to exactly one cty, there can be only one n each column. Any permutaton of the elements that satsfes these constrants s an allowed path. There are k! such permutatons. They are 2k -fold degenerate (there are k paths of the same length that dffer by whch cty s vsted frst, and each path can be traveled clockwse or ant-clockwse). Therefore there are

51 ENERGY FUNCTIONS 43 k!/(2k) possble paths to consder n tryng to determne the shortest one. Ths makes the problem hard. Note that nteger lnear-programmng methods for solvng the travellng salesman problem usually use a dfferent representaton of the paths [2]. The k -queens problem s derved from the game of chess. The queston s how to arrange k queens on a k k chess board so that they cannot take each other out. Ths means that each row and column as well as each dagonal can have only one queen. The problem s llustrated n Fgure 4.3. Ths Fgure shows one soluton for k = 8. The task s to fnd all solutons. Each arrangement of queens can be represented as a matrx M, where one sets M = f there s a queen on ste (, ). All other elements are set to zero. To represent vald solutons, M must satsfy the followng constrants [22] k = = k M = k, (4.3) If M = M p q = for (, ) (p,q ) then (4.4) p and q and p q and + p + q. The double-dgest problem. The Human Genome sequence was frst assembled by pecng together overlappng DNA segments n the rght order by makng sure that overlappng segments share the same DNA sequence. To ths end t s necessary to unquely dentfy the DNA segments. The actual DNA sequence of a segment s a unque dentfer. But t s suffcent and more effcent to dentfy a DNA segment by a fngerprnt, for example the sequence of restrcton stes. These are short subsequences (four or sx base pars long) that are recognsed by enzymes that cut (dgest) the DNA strand precsely at these stes. A DNA segment s dentfed by the types and locatons of restrcton stes that t contans, the so-called restrcton map. When a DNA segment s cut by two dfferent enzymes one can expermentally determne the lengths of the resultng fragments. Is t possble to determne how the cuts were ordered n the DNA sequence of the segment from the fragment lengths? Ths s the double-dgest problem [23]. The order of cut stes s precsely the restrcton map. In a double-dgest experment, a gven DNA sequence s frst dgested by one enzyme (A say). Assume that ths results n n fragments wth lengths a ( =,..., n). Second, the DNA sequence s dgested by another enzyme, B. In ths case m fragments are found, wth lengths b, b 2,..., b m. Thrd, the DNA sequence s dgested wth both enzymes A and B, yeldng l fragments wth lengths c,..., c l. The queston s now to determne all possble orderng of the a - and b -cuts that result n l fragments wth lengths c, c 2,..., c l? 4.2 Energy functons The frst two problems ntroduced n the prevous Secton are smlar n that possble solutons can be represented n terms of k k matrces M wth 0/ entres and certan constrants. It turns out that one can represent these problems n terms of a Hopfeld network wth N = k 2 neurons n that take the values 0/ [24]. As an example, consder the travellng-salesman problem. We label the ctes A to G by the nteger m =,..., k, and denote ther dstances by d mn. Then the path length to be mnmsed s L = 2 d mn M m (M n + M n + ). (4.5) mn

52 44 STOCHASTIC OPTIMISATION The column- and row-constrants upon the matrx M M m = row constrant, (4.6a) M m = column constrant (4.6b) m are ncorporated usng Lagrange multplers A and B (both postve), so that the functon to be mnmsed becomes 2 2 H = L + A 2 M m + B 2 M m. (4.7) m When the constrants (4.6) are satsfed, ther contrbutons to H vansh, otherwse they are postve. We conclude that H has a global mnmum at the desred soluton. If we use a stochastc method to mnmse H, t s not guaranteed that the algorthm fnds the global mnmum, ether because the constrants are not exactly satsfed, or because the path found s not the shortest one. The magntude of the Lagrange multplers A and B determnes how strongly the constrants are enforced, durng the search and n sub-optmal solutons. The expresson (4.7) s a quadratc functon of M m. Ths suggests that we can wrte H as the energy functon of a Hopfeld model wth 0/ neurons n : H = 2 w k l n n k l + µ n + const. (4.8) k l The weghts w k l and thresholds µ are determned by comparng Equatons (4.5), (4.7), and (4.8). Note that the neurons carry two ndces (not one as n Secton.2). The frst term n Equaton (4.8) s famlar [Equaton (2.47)]. The second term s a threshold term. In Sectons 2.2 and 2.3 all thresholds were set to zero (because they were not needed). If there are thresholds n the update rule [as n Equaton (.5)] then the energy functon must have a threshold term to ensure that the energy cannot ncrease when the network s updated. The constant term s rrelevant, t does not affect the mnmsaton. Ths term can be left out. We wrte: H = b n m (4.9a) where are local felds ncludng thresholds. b = w k l n k l µ 2 k l (4.9b) 4.3 Smulated annealng Under the determnstc dynamcs (.2) n = θ H(b ) (4.0)

53 SIMULATED ANNEALING 45 only updates satsfyng H (n ) H (n ) 0 occur, so that the network gets stuck n local mnma. To avod ths, one ntroduces nose, as explaned n Secton 3.. The analogue of Equaton (3.2) s n = wth prob. (b ) 0 wth prob. (b ) (4.) Here we take (b ) =. Ths form dffers slghtly from the defnton (3.3). For β the +e β b stochastc rule reduces to the determnstc dynamcs where H (n ) H (n ) 0. Now consder large but fnte values of β, correspondng to small nose levels. In ths case the energy functon may ncrease n an update, yet down-moves are much more lkely than up-moves. Ths becomes clear by consderng an alternatve (yet equvalent) formulaton of the network dynamcs:

54 46 STOCHASTIC OPTIMISATION. Choose at random a unt. 2. Change n to n n wth probablty wth Prob(n n ) = + e β H H = H (n ) H (n ) = b (n n ). (4.2a) (4.2b) Equaton (4.2) shows that moves wth H > 0 are less lkely when β s large. To demonstrate that ths scheme s equvalent to stochastc dynamcs (4.) we break the prescrpton (4.2) up nto dfferent cases. Changes occur wth the followng probabltes: f n = 0 obtan n = wth prob. + e β b f n = obtan n = 0 wth prob. + e β b = (b ), (4.3a) = (b ). (4.3b) In the second row we used that (b ) = wth the complmentary probabltes: f n = 0 f n = obtan n = 0 wth prob. obtan n = wth prob. = +e β b = +e β b +e β b +e β b. The state remans unchanged + e β b = (b ), (4.3c) + e β b = (b ). (4.3d) Comparng wth Equaton (4.) we see that the two schemes are equvalent. How does the network fnd a soluton of the optmsaton problem? We let t run wth stochastc dynamcs and compute n. If n, we set M =, otherwse 0. When the nose s weak, we expect that all n are ether be close to zero or to one. One strategy s to change the nose level as the smulaton proceeds. One starts wth larger nose, so that the network explores frst the rough features of the energy landscape. As the smulaton proceeds, one reduces the nose level, so that the network can learn fner features of the landscape. Ths s called smulated annealng. See Secton 0.9 n Numercal Recpes [25]. 4.4 Monte-Carlo smulaton The algorthms descrbed n the prevous Secton are equvalent to the Markov-chan Monte-Carlo algorthm, and t s closely related to the Metropols algorthm. Ths method s wdely used n Statstcal Physcs and n Mathematcal Statstcs. It s therefore mportant to understand the connectons between the dfferent formulatons. The Markov-chan Monte-Carlo algorthm s a method to sample from a dstrbuton that s too expensve to compute. Ths s not a contradcton n terms! The dstrbuton n queston s the Boltzmann dstrbuton P β (n ) = Z e β H (n ) (4.4) Here Z = n e β H (n ) s a normalsaton factor, called partton functon. The vector n represents the state vector of the system, as n the Hopfeld network. The dstrbuton (4.4) plays an mportant role

55 MONTE-CARLO SIMULATION 47 n the equlbrum statstcal mechancs of systems wth energy functon (Hamltonan) H. In ths context β = k B T where k B s the Boltzmann constant and T s the temperature of the system. For systems wth a large number of degrees of freedom, the functon P β (n ) can be very expensve to compute. Is there a way of samplng ths dstrbuton wthout actually evaluatng t? The answer s yes, by constructng a Markov chan[26] of states n that are dstrbuted accordng to P β (n ), after an ntal transent. A Markov chan s a memoryless random sequence of states defned by transton rates p l k from state n l to n k. Equaton (4.2) corresponds to p l k = + e β H (4.5) where H = H (n k ) H (n l ). The transton rate p l k connects arbtrary states, allowng for local moves (as n the prevous Secton where only one element of n was changed) or global moves. The Monte-Carlo algorthm proceeds n two steps. Gven the state n l, a new state n k s chosen. Then t s accepted wth p l k. It s mportant to note that the probablty of choosng n k gven n l must be symmetrc, that s equal to the probablty of choosng n l gven n k. For the local moves dscussed n the prevous Chapters (asynchronous updatng) ths requrement s satsfed. In general, and n partcular for global moves, t must be explctly checked [27]. Equaton (4.5) shows that downhll steps are more frequently accepted than uphll steps. These steps are repeated many tmes, creatng a sequence of states. If the process satsfes the detaled-balance condton P β (n l )p l k = P β (n k )p k l, (4.6) t follows that the Markov chan of states n l has the steady-state dstrbuton P β (n ). Usually ths means that the dstrbuton of states generated n ths way converges to P β (n ) (see Ref. [26] for detals). To prove that condton (4.6) holds, we use Equatons (4.4) and (4.5): e β H (n l ) + e β[h (n k ) H (n l )] = e β H (n l ) + e β H (n k ) = e β H (n k ) + e β[h (n l ) H (n k )]. (4.7) Ths demonstrates that the Boltzmann dstrbuton s a steady state of the Markov chan. If the smulaton converges to the steady state (as t usually does), then states vsted by the Markov chan are dstrbuted accordng to the Boltzmann dstrbuton. Whle ths dstrbuton may be dffcult to evaluate (Z nvolves the sum over all possble states), H s cheap to compute for local moves. The sequence of states are correlated, n partcular when the moves are local, because then subsequent confguratons are smlar. Note that the algorthm apples to energy functons of arbtrary form, not ust the partcular form (4.8) for the Hopfeld network. Returnng for a moment to Chapter 3, the above reasonng mples that the steady-state dstrbuton for the Hopfeld model s the Boltzmann dstrbuton, as stated n Secton 3.5. In practce one uses a slghtly dfferent form of the transton rates (Metropols algorthm) p l k = e β H when H > 0, when H 0, (4.8) wth H = H (n k ) H (n l ). Equaton (4.8) has the advantage that the transton rates are hgher than n (4.5) so that moves are more frequently accepted. That the Metropols rates obey the detaled-balance

56 48 STOCHASTIC OPTIMISATION condton (4.6) can be seen usng Equatons (4.4) and (4.8): P β (n l )p l k = Z e β H (n l ) e β[h (n k ) H (n l )] f H (n k ) > H (n l ) otherwse = Z e β max{h (n k ),H (n l )} = Z e β H (n k ) e β[h (n l ) H (n k )] f H (n l ) > H (n k ) otherwse = P β (n k )p k l. (4.9) The fact that the algorthm produces states dstrbuted accordng to Equaton (4.4) offers a dfferent perspectve upon the dea of smulated annealng. Slowly lowerng the temperature through the smulaton mmcs slow coolng of a physcal system. It passes through a sequence of quas-equlbrum Boltzmann dstrbutons wth lower and lower temperatures, untl the system fnds the global mnmum H mn of the energy functon at zero temperature, where P β (n ) = 0 when H (n ) > H mn, but P β (n ) > 0 when H (n ) = H mn. 4.5 Summary In ths Chapter t was shown how Hopfeld networks can perform optmsaton tasks, explorng the energy functon wth stochastc Hopfeld dynamcs. Ths approach s equvalent to the Markov-chan Monte-Carlo algorthm. In smulated annealng one gradually reduces the nose level as the smulaton proceeds. Ths mmcs the slow coolng of a physcal system, an effcent way of brngng the system nto ts global optmum. 4.6 Further readng An older but stll good reference for Monte-Carlo methods n Statstcal Physcs s the book Monte Carlo methods n Statstcal Physcs edted by Bnder [28]. A more recent source s the book by Newman and Barkema [29]. For smulated annealng, you can refer to Ref. [30]. 4.7 Exercses Travellng-salesman problem. Derve Equaton (4.5) for the path length n the travellng-salesman problem. Double-dgest problem. Implement the Metropols algorthm for the double-dgest problem. Denote the ordered set of fragment lengths produced by dgestng wth enzyme A by a = {a,..., a n }, where a a 2... a n. Smlarly b = {b,..., b m } (b b 2... b m ) for fragment lengths produced by dgestng wth enzyme B, and c = {c,..., c l } (c c 2... c l ) for fragment lengths produced by dgestng frst wth A and then wth B. Gven permutatons σ and µ of the sets a and b correspond to a set of c -fragments we denote t by ĉ (σ,µ). Use the energy functon H (σ,µ) = c [c ĉ (σ,µ)] 2. Confguraton space s the space of all permutaton pars (σ,µ). Local moves correspond to nversons of short subsequence of σ and/or µ. Check that the scheme of suggestng new states s symmetrc. Ths

57 EXERCISES 49 s necessary for the algorthm to converge. The solutons of the double-dgest problem are degenerate. Determne the degeneracy of the solutons for the fragment sets shown n Table 4.. L = 0000 a = [5976, 543, 39, 20, 42] b = [453, 2823, 2057, 607] c = [453, 543, 39, 20, 607, 54, 342, 42] L = a = [8479, 4868, 3696, 2646, 69, 42] b = [968, 5026, 08, 050, 69, 84] c = [8479, 467, 2646, 08, 88, 859, 70, 69, 84, 69, 42] L = a = [9979, 9348, 8022, 4020, 2693, 892, 74, 37, 50, 45] b = [9492, 8453, 7749, 7365, 2292, 280, 023, 959, 278, 24, 85] c = [7042, 5608, 5464, 437, 3884, 32, 90, 768, 590, 959, 899, 707, 702, 50, 45, 42, 278, 24, 24, 85] Table 4.: Example confguratons for the double-dgest problem for three dfferent chromosome lengths L. For each example, three ordered fragment sets are gven, correspondng to the result of dgeston wth A, wth B, and wth both A and B.

58 50 STOCHASTIC OPTIMISATION

59 5 PART II SUPERVISED LEARNING

60 52 PERCEPTRONS The Hopfeld networks descrbed n Part I solve auto-assocaton tasks, where the neurons of the network act as nputs and outputs. In the pattern-recognton problem, a dstorted pattern s fed nto the network, the recursve network dynamcs s run untl a steady state s reached. The am s that the steady-state values of the neurons converge to those of the correct pattern assocated wth the dstorted one. A related type of problem that s very common are classfcaton tasks. The machne-learnng repostory [3] at the Unversty of Calforna Irvne contans a large number of such problems. A well-known example s the rs data set. It lsts attrbutes of 50 rs plants. The data set was descrbed by the genetcst R. A. Fsher [32]. For each plant four attrbutes are gven (Fgure 5.): ts sepal length, sepal wdth, petal length, and petal wdth. Also, each plant s classfed nto one of three classes: rs setosa, rs verscolor, or rs vrgnca. The task s to program a neural network that determnes the class of a plant from ts attrbutes. To each nput (attrbutes of an rs plant) the network should assocate the correct output, the class of the plant. The correct output s referred to as the target. In supervsed learnng one uses a tranng data set of correct nput/output pars. One feeds an nput from the tranng data nto the nput termnals of the network and compares the states of the output neurons to the target values. The weghts and thresholds are changed to mnmse the dfferences between network outputs and targets for all nput patterns n the tranng set. In ths way the network learns to assocate nput patterns n the tranng set wth the correct target values. A crucal queston s whether the traned network can generalse: does t fnd the correct targets for nput patterns that were not n the tranng set? The networks used for supervsed learnng are called perceptrons [2, 3]. They consst of layers of McCulloch-Ptts neurons: an nput layer, a number of hdden layers, and an output layer. The layers are usually arranged from the left (nput) to the rght (output). All connectons are one-way, from neurons n one layer to neurons n the layer mmedately to the rght. There are no connectons between neurons n a gven layer, or back to layers on the left. Ths arrangement ensures convergence of the tranng algorthm (stochastc gradent descent). Durng tranng wth ths algorthm the weghts are updated teratvely. In each step, an nput s appled and the weghts of the network are updated to reduce the error n the output. In a sense each step corresponds to addng a lttle bt of Hebb s rule to the weghts. Ths s repeated untl the network classfes the tranng set correctly. Stochastc gradent descent for of mult-layer perceptrons has receved much attenton recently, after t was realsed that networks wth many hdden layers can be traned to relably recognse and classfy mage data, for self-drvng cars for nstance but also for other applcatons (deep learnng). 5 Perceptrons Perceptrons [2, 3] are traned by teratvely updatng ther weghts and thresholds. In the Hopfeld networks descrbed n Part I, by contrast, the weghts were always assgned usng Hebb s rule. To motvate the dea of updatng weghts teratvely, consder Hebb s rule, Equaton (2.29). We estmated n Secton 2.2 how frequently neurons are erroneously updated because the cross-talk term n Equaton (2.3) changed the sgn of the bt n queston. To ths end we assumed that all bts of all patterns were ndependently dentcally randomly dstrbuted, and we used the central-lmt theorem. For correlated patterns the effect of the cross-talk term s dfferent from the results calculated n Chapter 3. It has been argued that the storage capacty ncreases when the patterns are more strongly correlated, others have clamed that the capacty decreases n ths lmt (see Ref. [33] for a dscusson).

61 53 petal sepal sepal petal classfcaton length wdth length wdth vrgnca setosa verscolor setosa verscolor vrgnca Fgure 5.: Left: petals and sepals of the rs flower. Rght: sx entres of the rs data set [3]. All lengths n cm. The whole data set contans 50 entres. When we must deal wth a defnte set of patterns (no randomness to average over), the stuaton seems to be even more challengng. Is there a way of modfyng Hebb s rule to deal wth ths problem? Yes there s! We smply ncorporate the overlaps Q µν = N x (µ) x (ν). (5.) nto Hebb s rule. To ths end, defne the p p overlap matrx Q wth elements Q µν. The modfed Hebb s rule reads: w = x (µ) Q N µν x (ν). (5.2) µν For orthogonal patterns (Q µν = δ µν ) ths rule s dentcal to Equaton (2.28). For non-orthogonal patterns, the rule (5.2) ensures that all patterns are recognsed, provded that the matrx Q s nvertble. In ths case one can fnd the weghts w teratvely, by successve mprovement from an arbtrary startng pont. We can say that the network learns the task through a sequence of weght changes. Ths s the dea used to solve classfcaton tasks wth perceptrons (Secton 5.3). You wll see n the followng that ths usually works even when Equaton (5.2) fals. A perceptron s a layered feed-forward network (Fgure 5.2). The leftmost layer contans nput termnals (black n Fgure 5.2). To the rght follows a number of layers of McCulloch-Ptts neurons. The rght-most layer of neurons s the output layer where the output of the network s read out. The other neuron layers are called hdden layers, because they feed nto other neurons, ther states are not read out. All connectons w are one-way: every neuron (or nput termnal) feeds only to neurons n the layer mmedately to the rght. There are no connectons wthn layers, or back connectons, or connectons that ump over a layer. There are N nput termnals. As n Part I we denote the nput patterns by x (µ) x (µ) x (µ) = 2.. (5.3) x (µ) N The ndex µ labels the dfferent nput patterns n the tranng set. It ranges from to p. All neurons are McCulloch-Ptts neurons. The output neurons n the network on the left of Fgure 5.2, for example,

62 54 PERCEPTRONS perform the computaton: O = g (B ) wth B = W x Θ (5.4) The ndex labels the output neurons, t ranges from to M. Each output neuron has a threshold, Θ. In the lterature on deep learnng the thresholds are sometmes referred to as bases, defned as Θ. The functon g s an actvaton functon as descrbed n Secton (.2). Now consder the network on the rght of Fgure 5.2. The states of the neurons n the hdden layer are denoted by V, wth thresholds θ and weghts w k. In summary: V = θ H b wth b = w k x k θ, O = θ H (B ) wth B = W V Θ. k (5.5a) (5.5b) A classfcaton task s gven by a tranng set of nput patterns x (µ) and correspondng target values t (µ) t (µ) t (µ) = 2.. (5.6) t (µ) M The perceptron s traned by choosng ts weghts and thresholds so that the network produces the desred output. O (µ) = t (µ) for all and µ. (5.7) Remark: f we take t (µ) = x (µ) for =,..., N the task s the assocatve memory problem dscussed n Part I. nputs outputs nputs hdden outputs Fgure 5.2: Feed-forward network wthout hdden layer (left), and wth one hdden layer (rght). The nput termnals are coloured black.

63 A CLASSIFICATION TASK 55 x 2 Legend t (µ) = t (µ) = x (µ) x 2 O x x Fgure 5.3: Left: classfcaton problem wth two-dmensonal real-valued nputs and target values equal to ±. The red lne s the decson boundary (see text). Rght: correspondng perceptron. 5. A classfcaton task To llustrate how perceptrons can solve classfcaton tasks, we consder a very smple example (Fgure 5.3). There are ten patterns, each has two real-valued components: x (µ) = x (µ) x (µ). (5.8) 2 In Fgure 5.3 the patterns are drawn as ponts n the x -x 2 plane, the nput plane. There are two classes of patterns, wth targets ± t (µ) = for and t (µ) = for. (5.9) The actvaton functon consstent wth the possble target values s the sgnum functon, g (b ) = sgn(b ). The perceptron has two nput termnals connected to a sngle output neuron. Snce there s only one neuron, we can arrange the weghts nto a weght vector The network performs the computaton w = w w 2. (5.0) O = sgn(w x + w 2 x 2 θ ) = sgn(w x θ ). (5.) Here w x = w x + w 2 x 2 s the scalar product between the vectors w and x. Ths allows us to fnd a geometrcal nterpretaton of the classfcaton task. We see n Fgure 5.3 that the patterns fall nto two clusters: to the rght and to the left. We can classfy the patterns by drawng a lne that separates the two clusters, so that everythng on the rght of the lne has t =, whle the patterns on the left of the lne have t =. Ths lne s called the decson boundary. To fnd the geometrcal sgnfcance of Equaton (5.), let us gnore the threshold for a moment, so that The classfcaton task takes the form O = sgn(w x ). (5.2) sgn w x (µ) = t (µ). (5.3)

64 56 PERCEPTRONS x 2 Legend t (µ) = t (µ) = x ϕ w x Fgure 5.4: The perceptron classfes the patterns correctly for the weght vector w shown, orthogonal to the decson boundary. w x (µ) = 0 x 2 w = x x 2 θ w = w 2 w x (µ) = θ Fgure 5.5: Decson boundares wthout and wth threshold. x To evaluate the scalar product we wrte the vectors as cosβ w = w and x = x snβ cosα. (5.4) snα Here w = w 2 + w 2 2 denotes the norm of the vector w, and α and β are the angles of the vectors wth the x -axs. Then w x = w x cos(α β) = w x cosϕ, where ϕ s the angle between the two vectors. When ϕ s between π/2 and π/2, the scalar product s postve, otherwse negatve. As a consequence, the network classfes the patterns n Fgure 5.3 correctly f the weght vector s orthogonal to the decson boundary drawn n Fgure 5.4. What s the role of the threshold θ? Equaton (5.) shows that the decson boundary s parametersed by w x = θ, or x 2 = (w /w 2 ) x + θ /w 2. (5.5) Therefore the threshold determnes the ntersecton of the decson boundary wth the x 2 -axs (equal to θ /w 2 ). Ths s llustrated n Fgure 5.5. The decson boundary the straght lne orthogonal to w should dvde nputs wth postve and negatve targets. If no such lne can be found, then the problem cannot be solved wth a sngle neuron. Conversely, f such a lne exsts, the problem can be solved (and t s called lnearly separable). Otherwse the problem s not lnearly separable. Ths can occur only when p > N. Examples of problems that are lnearly separable and not lnearly separable are shown n Fgure 5.6. Other examples are Boolean functons. A Boolean functon takes N bnary nputs and has one bnary output. The Boolean AND functon (two nputs) s llustrated n Fgure 5.7. The value table of the functon s shown on the left. The graphcal representaton s shown n the centre of the Fgure ( corresponds to t = and to t = +). Also shown s the decson boundary, the weght vector w, and the network layout wth the correspondng values of the weghts and the threshold. It s mportant to note that the decson boundary s not unque, nether are the weght and threshold values that solve the problem. The norm of the weght vector, n partcular, s arbtrary. Nether s ts drecton unquely specfed.

65 ITERATIVE LEARNING ALGORITHM 57 Fgure 5.8 shows that the Boolean XOR functon s not lnearly separable [34]. There are 6 dfferent Boolean functons of two varables. Only two are not lnearly separable, the XOR and the NOT XOR functon. Up to now we dscussed only one output unt. If the classfcaton task requres several output unts, each has ts own weght vector w and threshold θ. We can group the weght vectors nto a weght matrx as n Part I, so that the w are the rows of W. 5.2 Iteratve learnng algorthm In the prevous Secton we determned the weghts and threshold for the XOR problem by nspecton. Now we dscuss an algorthm that allows a computer to fnd the weghts teratvely. How ths works s llustrated n Fgure 5.9. In panel (a), the pattern x (8) (t (8) = ) s on the wrong sde of the decson boundary. To turn the decson boundary ant-clockwse one adds a small multple of the pattern vector x (8) to the weght vector w = w + δw wth δw = ηx (8). (5.6) The parameter η > 0 s called the learnng rate. It must be small, so that the decson boundary s not rotated too far. The result s shown n panel (b). Panel (c) shows another case, where pattern x (4) (t (4) = ) s on the wrong sde of the decson boundary. In order to turn the decson boundary n the rght way, ant-clockwse, one subtracts a small multple of x (4) : Note the mnus sgn. These two learnng rules combne to For more than one output unt the rule reads w = w + δw wth δw = ηx (4). (5.7) w = w + δw (µ) wth δw (µ) = ηt (µ) x (µ). (5.8) w = w + δw (µ) wth δw (µ) = ηt (µ) x (µ). (5.9) Ths rule s of the same form as Hebb s rule (2.2). One apples (5.9) teratvely for a sequence of randomly chosen patterns µ, untl the problem s solved. Ths corresponds to addng a lttle bt of Hebb s rule n each teraton. To ensure that the algorthm stops when the problem s solved, one can use δw (µ) = η(t (µ) O (µ) )x (µ). (5.20) t = + t = x 2 x 2 x x Fgure 5.6: Lnearly separable and non-separable data.

66 58 PERCEPTRONS x x 2 t x 2 2 Legend t (µ) = t (µ) = w w x (µ) = θ 3 2 x θ = 3 2 Fgure 5.7: Boolean AND functon: value table, geometrcal representaton, and network layout. The weght values are wrtten next to the connectons. 5.3 Gradent-descent learnng In ths Secton the learnng algorthm (5.20) s derved n a dfferent way, by mnmsng an energy functon usng gradent descent. Ths requres dfferentaton, therefore we must choose a dfferentable actvaton functon. The smplest choce s g (b ) = b, so that the network computes: O (µ) = k w k x (µ) k (5.2) (lnear unt). The outputs O (µ) assume contnuous values, but not necessarly the targets t (µ). For lnear unts, the classfcaton problem has the formal soluton O (µ) = t (µ) for =,..., N and µ =,..., p (5.22) w k = N µν t (µ) Q x (ν) µν k, (5.23) as you can verfy by nsertng Equaton (5.23) nto (5.2). Here Q s the overlap matrx wth elements Q µν = N x (µ) x (ν) (5.24) (page 53). For the soluton (5.23) to exst, the matrx Q must be nvertble. Ths requres that p N, because otherwse the pattern vectors are lnearly dependent, and thus also the columns (or rows) of Q. If the matrx Q has lnearly dependent columns or rows t cannot be nverted. Let us assume that the patterns are lnearly ndependent, so that the soluton (5.23) exsts. In ths case we can fnd the soluton teratvely. To ths end one defnes the energy functon H ({w }) = 2 µ t (µ) O (µ) 2 = 2 µ t (µ) w x (µ) 2. (5.25) x x 2 t Fgure 5.8: The Boolean XOR functon s not lnearly separable. x 2 Legend t (µ) = t (µ) = x

67 GRADIENT-DESCENT LEARNING 59 (a) x 2 x (8) (b) x 2 x (8) Legend t (µ) = t (µ) = w x w x (c) x 2 (d) x 2 w x w x x (4) x (4) Fgure 5.9: Illustraton of the learnng algorthm. In panel (a) the t = pattern x (8) s on the wrong sde of the decson boundary. To correct the error the weght must be rotated ant-clockwse [panel (b)]. In panel (c) the t = pattern x (4) s on the wrong sde of the decson boundary. To correct the error the weght must be rotated ant-clockwse [panel (d)]. Here H s regarded as a functon of the weghts w, unlke the energy functon n Part I whch s a functon of the state-varables of the neurons. The energy functon (5.25) s non-negatve, and t vanshes for the optmal w f the pattern vectors x (µ) are lnearly ndependent. Ths soluton of the classfcaton problem corresponds to the global mnmum of H. To fnd the global mnmum of H one uses gradent descent: one repeatedly updates the weghts by addng ncrements w mn = w mn + δw mn wth δw mn = η H w mn. (5.26) The small parameter η > 0 s the learnng rate. The negatve gradent ponts n the drecton of steepest descent of H. The dea s to take many downhll steps untl one hopefully (but not necessarly) reaches the global mnmum. To evaluate the dervatves one uses the chan rule together wth w w mn = δ m δ n. (5.27) Here δ k l s the Kronecker delta, δ k l = f k = l and zero otherwse. So δ m δ n = only f = m and = n. Otherwse the product of Kronecker deltas equals zero. Illustraton. The lnear functon, x, and the constant functon are gong for a walk. When they suddenly see the dervatve approachng, the constant functon gets worred. "I m not worred" says the functon x confdently, "I m not put to zero by the dervatve." When the dervatve comes closer, t says "H! I m / y. How are you?" Moral: when m or n then w and w mn are ndependent varables, so that the dervatve (5.27) vanshes. Equaton (5.27) gves δw mn = η t (µ) m O (µ) m x (µ) n. (5.28) µ Ths learnng rule s very smlar to Equaton (5.20). The dfference s that Equaton (5.28) contans

68 60 PERCEPTRONS x 2 Legend t (µ) = t (µ) = x 2 Legend t (µ) = t (µ) = x x (a) Lnearly separable problem. (b) Problem not lnearly separable. Fgure 5.0: Problems that are not lnearly separable can be solved by a pecewse lnear decson boundary. a sum over all patterns (batch tranng). An advantage of the rule (5.28) s that t s derved from an energy functon. Ths allows to analyse the convergence of the algorthm. Lnear unts [Equaton (5.2)] are specal. You cannot solve the Boolean AND problem (Fgure 5.7) wth a lnear unt although the problem s lnearly separable because the pattern vectors x (µ) are lnearly dependent. Shftng the patterns or ntroducng a threshold does not change ths fact. Lnear separablty does not mply lnear ndependence (but the converse s true). Therefore we usually use non-lnear unts, McCulloch-Ptts neurons wth non-lnear actvaton functons g (b ). There are four mportant ponts to keep n mnd. Frst, for non-lnear unts t matters less whether or not the patterns are lnearly dependent, but t s mportant whether the problem s lnearly separable or not. Second, f the problem s lnearly separable then we can use gradent descent to determne sutable weghts (and thresholds). Thrd, for gradent descent we must requre that the actvaton functon g (b ) s dfferentable, or at least pecewse dfferentable. Fourth, we calculate the gradents usng the chan rule, resultng n factors of dervatves g (b ) = d db g (b ). Ths s the orgn of the vanshng-gradent problem (Chapter 7). 5.4 Mult-layer perceptrons In Sectons 5. and 5.2 we dscussed how to solve lnearly separable problems [Fgure 5.0(a)]. The am of ths Secton s to show that non-separable problems lke the one n Fgure 5.0(b) can be solved by a perceptron wth one hdden layer. A network that does the trck for the classfcaton problem n Fgure 5.0(b) s depcted n Fgure 5.. Here the hdden neurons are 0/ unts, but the output neuron 3 2 w k W O x k V Fgure 5.: Hdden-layer perceptron to solve the problem shown n Fgure 5.0 (b). The three hdden neurons are 0/ neurons, the output neuron produces ±.

69 MULTI-LAYER PERCEPTRONS 6 x Legend t (µ) = 0 t (µ) = V V 2 V 3 target x Fgure 5.2: Left: decson boundares and regons. Rght: encodng of the regons and correspondng targets. The regon 00 does not exst. gves ±, as n the prevous Secton. The network computes wth the followng rules: V (µ) = θ H b (µ) wth b (µ) = w k x (µ) k θ, O (µ) = sgn B (µ) wth B (µ) = k W V (µ) Θ. (5.29) Here θ H (b ) s the Heavsde functon. Each of the three neurons n the hdden layer has ts own decson boundary. The dea s to choose weghts and thresholds n such a way that the three decson boundares dvde the nput plane nto dstnct regons, so that each regon contans ether only t = 0 patterns or t = patterns. We shall see that the values of the hdden neurons encode the dfferent regons. Fnally, the output neuron assocates the correct target value wth each regon. How ths constructon works s shown n Fgure 5.2. The left part of the Fgure shows the three decson boundares. The ndces of the correspondng hdden neurons are drawn n blue. Also shown are the weght vectors. The regons are encoded wth a three-dgt bnary code. The value of the -th dgt s the value of the -th hdden neuron: V = f the pattern s on the weght-vector sde of the decson boundary, and V = 0 on the other sde. The Table shows the targets assocated wth each regon, together wth the code of the regon. A graphcal representaton of the output problem s shown n Fgure 5.3. The problem s lnearly separable. The followng functon computes the correct output for each regon: O (µ) = sgn V (µ) + V (µ) 2 + V (µ) (5.30) V V 2 0 V Fgure 5.3: Graphcal representaton of the output problem for the classfcaton problem shown n Fgure 5.2.

70 62 PERCEPTRONS V V 2 t x 2 00 w 2 w 0 Legend t (µ) = t (µ) = x Fgure 5.4: Boolean XOR functon: value table, geometrcal representaton, and network layout. The two hdden neurons as 0/ neurons, the output produces ±. 0.5 Ths completes the constructon of a soluton. It s not unque. In summary, one can solve non-lnearly separable problems by addng a hdden layer. The neurons n the hdden layer defne segments of a pecewse lnear decson boundary. More neurons are needed f the decson boundary s very wggly. Fgure 5.4 shows another example, how to solve the Boolean XOR problem wth a perceptron that has two 0/ neurons n a hdden layer, wth thresholds 2 and 3 2, and all weghts equal to unty. The output neuron has weghts + and and threshold 2 : O = sgn(v V 2 2 ). (5.3) Mnsky and Papert [34] proved n 969 that all Boolean functons can be represented by multlayer perceptrons, but that at least one hdden neuron must be connected to all nput termnals. Ths means that not all neurons n the network are locally connected (have only a few ncomng weghts). Snce fully connected networks are much harder to tran, Mnsky and Papert offered a somewhat pessmstc vew of learnng wth perceptrons, resultng n a controversy [35]. Now, almost 50 years later, the perspectve has changed. Convolutonal networks (Chapter 7) have only local connectons to the nputs and can be traned to recognse obects n mages wth hgh accuracy. In summary, perceptrons are traned on a tranng set (x (µ),t (µ) ) wth µ =,..., p by movng the decson boundares nto the correct postons. Ths s acheved by repeatedly applyng Hebb s rule to adust all weghts. Ths corresponds to usng gradent-descent learnng on the energy functon (5.25). We have not dscussed how to update the thresholds yet, but t s clear that they can be updated wth gradent-descent learnng (Secton 5.3). Once all decson boundares are n the rght place we must ask: what happens when we apply the traned network to a new dataset? Does t classfy the new nputs correctly? In other words, can the network generalse? An example s shown n Fgure 5.5. Panel (a) shows the result of tranng the network on a tranng set. The decson boundary separates t = patterns from t = patterns, so that the network classfes all patterns n the tranng set correctly. In panel (b) the traned network s appled to patterns n a valdaton set. We see that most patterns are correctly classfed, save for one error. Ths means that the energy functon (5.25) s not exactly zero for the valdaton set. Nevertheless, the network does qute a good ob. Usually t s not a good dea to try to precsely classfy all patterns near the decson boundary, because real-world data sets are subect to nose. It s a futle effort to try to learn and predct nose.

71 SUMMARY 63 (a) x 2 (b) x 2 Legend t (µ) = t (µ) = x x Fgure 5.5: (a) Result of tranng the network on a tranng set. (b) Valdaton by feedng the patterns of a valdaton set. error 5.5 Summary Perceptrons are layered feed-forward networks that can learn to classfy data n a tranng set (x (µ),t (µ) ). For each nput pattern x (µ) the network fnds the correct targets t (µ). We dscussed the learnng algorthm for a smple example: real-valued patterns wth ust two components, and one bnary target. Ths allowed us to represent the classfcaton problem graphcally. There are three dfferent ways of understandng how the perceptron learns. Frst, geometrcally, to learn means to move the decson boundares nto the rght places. Second, ths can be acheved by repeatedly addng a lttle bt of Hebb s rule. Thrd, ths algorthm corresponds to gradent descent on the energy functon (5.25). 5.6 Further readng A short account of the hstory of perceptron research s the revew by Kanal [35]. He dscusses the work of Rosenblatt [2, 3], McCulloch and Ptts [], as well as the early controversy around the book by Mnsky and Papert [34]. 5.7 Exercses Non-orthogonal patterns. Show that the rule (5.2) ensures that all patterns are recognsed, for any set of non-orthogonal patterns that gves rse to an nvertble matrx Q. Demonstrate ths by showng that the cross-talk term evaluates to zero, assumng that Q exsts. Boolean functons. How many Boolean functons wth three-dmensonal nputs are there? How many of them are lnearly separable? 5.8 Exam questons 5.8. Lnear separablty of Boolean functons (a) The value table for the Boolean XOR problem s shown n Table 5.. Show that ths problem cannot be solved by a smple perceptron, wth two nput termnals and one output unt (no hdden layer) (0.5p).

72 64 PERCEPTRONS (b) Now consder Boolean functons wth three-dmensonal nputs. How many of the 256 Boolean functons wth three-dmensonal nputs can be solved by a perceptron wth three nput termnals and one output unt (no hdden layer)? Descrbe how you arrve at the answer. Hnt: vsualse the functons by colourng the corners of a cube. Use symmetres to reduce the number of cases. (.5p). x x 2 t Table 5.: Value table for the XOR problem. Queston Boolean functons Any N -dmensonal Boolean functon can be represented usng a perceptron wth one hdden layer consstng of 2 N neurons. Here we consder N = 3 and N = 2. The three-dmensonal party problem s specfed n Fgure 5.6. The nput bts x (µ) k for k =, 2, 3 are ether + or -. The output O (µ) of the network s + f there s an odd number of postve bts n x (µ), and - f the number of postve bts are even. In one soluton, the state V (µ) of neuron =,..., 2 N n the hdden layer s gven by: V (µ) f θ + k = w k x (µ) k > 0, 0 f θ + k w k x (µ) k 0, (5.32) where the weghts and thresholds are gven by w k = x ( ) k and θ = 2. The output s gven by O (µ) = W V (µ). (a) Determne the weghts W. ( p). (b) In two dmensons the problem n (a) s the XOR problem. Specalse the soluton from (a) to the XOR problem and draw the decson boundares of the hdden neurons. (0.5 p)

73 EXAM QUESTIONS 65 x (8) x (5) x (7) x (6) x 3 x (4) x (3) x () x (2) x x 2 Fgure 5.6: The three-dmensonal party problem. A whte ball ndcates O (µ) =, and a black ball ndcates O (µ) = +. Queston x 2 Fgure 5.7: Classfcaton problem. Input space s the x -x 2 -plane. Queston x

74 66 PERCEPTRONS Lnearly nseparable problem A classfcaton problem s specfed n Fgure 5.7. The am s to map nput patterns x (µ) to outputs O (µ) as follows: f a pont x (µ) les nsde the trangle t s mapped to O (µ) =, but f x (µ) s outsde the trangle t s mapped to O (µ) = 0. How patterns on the trangle boundary are classfed s not mportant. (a) Show that ths problem s not lnearly separable. (0.5 p). (b) The problem can be solved by a perceptron wth one hdden layer wth three neurons ( =, 2, 3) The output s computed as V (µ) = θ H θ + 2 k= O (µ) = θ H Θ + 3 = w k x (µ) k. (5.33) W V (µ). (5.34) Here w k and W are weghts, θ and Θ are thresholds, and θ H s the Heavsde functon: θ H (b ) = for b > 0 0 for b 0. (5.35) Fnd weghts and thresholds that solve the classfcaton problem. ( p) Perceptron wth one hdden layer A perceptron has one nput layer, one layer of hdden neurons, and one output unt. It receves twodmensonal nput patterns x (µ) = (x (µ), x (µ) 2 )T. They are mapped to four hdden neurons V (µ) accordng to V (µ) 0 f θ + k = w k x (µ) k 0, f θ + k w k x (µ) k > 0, (5.36) where w k and θ are weghts and thresholds of the hdden neurons. The output s gven by O (µ) = 0 f Θ + W V (µ) 0, f Θ + W V (µ) > 0. (5.37) Weghts W and threshold Θ of the output are gven by: W = W 3 = W 4 =, W 2 =, and Θ = 2. (5.38) (a) Fgure 5.8(left) shows how nput space s mapped to the the hdden neurons. Draw the decson boundary of the network, gven the weghts and thresholds n Equaton (5.38). (0.5p). (b) Show that one cannot map the nput space to the space of hdden neurons as n Fgure 5.8(rght). (0.5 p). (c) Gve values of w and θ that yeld the pattern n Fgure 5.8(left). ( p).

75 EXAM QUESTIONS 67 (0, 0,, ) T (0,,, ) T (,,, ) T (0, 0,, ) T (0,,, ) T (,, 0, ) T (0, 0, 0, ) T (0,, 0, ) T (,, 0, ) T (0, 0, 0, ) T (0,, 0, ) T (,,, ) T x 2 x 2 (0, 0, 0, 0) T (0,, 0, 0) T (,, 0, 0) T (0, 0, 0, 0) T (0,, 0, 0) T (,, 0, 0) T x x Fgure 5.8: Left: nput space wth decson boundares of the hdden neurons V (black lnes). These decson boundares dvde nput space nto nne zones, each wth a certan coordnate V = (V, V 2, V 3, V 4 ) T n the space of the hdden neurons. Rght: same, but here the ndcated mappng to the space of the hdden neurons s not possble. Queston Lnearly nseparable problem A classfcaton problem s specfed n Fgure 5.9, where a grey trangle n nput space s shown. The am s to map nput patterns x (µ) to outputs O (µ) as follows: f a pont x (µ) les nsde the trangle t s mapped to O (µ) = +, but f x (µ) s outsde the trangle t s mapped to O (µ) =. How patterns on the boundary of the trangle are classfed s not mportant. (a) Show that ths problem s not lnearly separable by constructng a counter-example usng four nput patterns. (0.5p). (b) The problem can be solved by a perceptron wth one hdden layer wth three neurons ( =, 2, 3) V (µ) = sgn θ + 2 k= w k x (µ) k (5.39) and output O (µ) = sgn Θ + 3 = W V (µ). (5.40) Here w k and W are weghts and θ and Θ are thresholds. The weghts w k are subect to the followng constrants. Frst, the three weghts w are all equal to one, w = w 2 = w 3 =. Second, the three weghts w 2 are such that x (µ) = (-4,-) T maps to V (µ) = (,-,-) T and x (µ) = (-,5) T maps to V (µ) = (-,-,) T. Gven these constrants, fnd values of w k, W, θ and Θ that solve the classfcaton problem. Hnt: The constrants unquely determne the hdden thresholds, the orentatons of the hdden weght vectors, and ther order n the weght matrx. (p).

76 68 PERCEPTRONS x 2 x Fgure 5.9: Classfcaton problem. Input space s the x -x 2 -plane. Queston

77 69 6 Stochastc gradent descent In Chapter 5 we dscussed how a hdden layer helps to classfy problems that are not lnearly separable. We explaned how the decson boundary n Fgure 5.2 s represented n terms of the weghts and thresholds of the hdden neurons, and ntroduced a tranng algorthm based on gradent descent. In ths Secton, the tranng algorthm s dscussed n more detal. It s explaned how t s mplemented, why t converges, and how ts performance can be mproved. Fgure 6. shows the layout of the network to be traned. There are p nput patterns x (µ) wth N components each, as before. The output of the network has M components: O (µ) O (µ) O (µ) = 2., (6.) O (µ) M to be matched to the targets t (µ). The actvaton functons must be dfferentable (or at least pecewse dfferentable), but apart from that there s no need to specfy them further at ths pont. The network shown n Fgure 6. performs the computaton V (µ) = g O (µ) = g b (µ) B (µ) wth b (µ) = wth B (µ) = k w k x (µ) k θ, W V (µ) Θ. So the outputs are computed n terms of nested actvaton functons: O (µ) = g W g k w k x (µ) k θ } {{ } V (µ) (6.2) Θ. (6.3) Ths s a consequence of the network layout of the perceptron: all ncomng connectons to a gven neuron are from the layer mmedately to the left, all outgong connectons to the layer mmedately to the rght. The more hdden layers a network has, the deeper s the nestng of the actvaton functons. 6. Chan rule and error backpropagaton The network s traned by gradent-descent learnng on the energy functon (5.25), n the same way as n Secton 5.3: H = t (µ) O (µ) 2. (6.4) 2 The weghts are updated usng the ncrements µ δw mn = η H W mn and δw mn = η H w mn. (6.5)

78 70 STOCHASTIC GRADIENT DESCENT x k w k V W O Fgure 6.: Neural network wth one hdden layer. Illustrates the notaton used n Secton 6.. As n Secton 5.3, the small parameter η > 0 s the learnng rate. The dervatves of the energy functon are evaluated wth the chan rule. For the weghts connectng to the output layer we apply the chan rule once H W mn = µ t (µ) O (µ) O (µ), W mn (6.6a) and then once more: O (µ) = g W mn W mn W V (µ) Θ = g (B (µ) )δ m V (µ) n. (6.6b) Here g (B ) = dg /db s the dervatve of the actvaton functon wth respect to the local feld B. An mportant pont here s that the values V of the neurons n the hdden layer do not depend on W mn. The neurons V do not have ncomng connectons wth these weghts, a consequence of the feedforward layout of the network. In summary we obtan for the ncrements of the weghts connectng to the output layer: δw mn = η H W mn = η p t (µ) m O (µ) m g B (µ) m }{{} (µ) m µ= V (µ) n (6.7) The quantty (µ) m s a weghted error : t vanshes when O (µ) m = t (µ) m. The weghts connectng to the hdden layer are updated n a smlar fashon, by applyng the chan

79 CHAIN RULE AND ERROR BACKPROPAGATION 7 rule three tmes: H w mn = µ O (µ) = g w mn w mn V (µ) = g w mn w mn t (µ) O (µ) O (µ), w mn k W V (µ) Θ = g (B (µ) ) V (µ) W w mn (6.8a) (6.8b) w k x (µ) k θ = g (b (µ) )δ m x (µ) n. (6.8c) Here δ m s the Kronecker delta (δ m = f = m and zero otherwse). Takng these results together, one has δw mn = η (µ) W m g b (µ) m x (µ) n (6.9) The quanttes δ (µ) m (µ) µ } {{ } =δ (µ) m are errors assocated wth the hdden layer (they vansh when the output errors are zero). Equaton (6.9) shows that the errors are determned recursvely, n terms of the errors n the layer to the rght: δ (µ) m = (µ) W m g b (µ) m. (6.0) In other words, the error δ (µ) m for the hdden layer s computed n terms of the output errors (µ). Equatons (6.7) and (6.9) show that the weght ncrements have the same form: δw mn = η p µ= (µ) m V (µ) n and δw mn = η p µ= δ (µ) m x (µ) n. (6.) The rule (6.) s sometmes referred to as the δ-rule. It s local: the ncrements of the weghts feedng nto a certan layer are determned by the errors assocated wth that layer, and by the states of the neurons n the layer mmedately to the left. If the network has more hdden layers, then ther errors are computed recursvely usng Equaton (6.0), and the formula for the weght ncrements has the same form as Equaton (6.) (Algorthm 2). Fgure 6.2 llustrates the dfferent ways n whch neurons and errors are updated. The feed-forward structure of the layered network means that the neurons are updated from left to rght (blue arrows). Equaton (6.0), by contrast, mples that the errors are updated from the rght to the left (red arrows), from the output layer to the hdden layer. The term backpropagaton refers to ths dfference: the neurons are updated forward, the errors are updated backward. The thresholds are updated n a smlar way: δθ m = η H Θ m = η µ t (µ) m O (µ) m g B (µ) m = η µ (µ) m, (6.2a) δθ m = η H = η θ m µ (µ) W m g b (µ) m = η µ δ (µ) m. (6.2b)

72 STOCHASTIC GRADIENT DESCENT errors neurons Fgure 6.2: Backpropagaton algorthm: the states of the neurons are updated forward (from left to rght) whle errors are updated backward (rght to left).

80 72 STOCHASTIC GRADIENT DESCENT errors neurons Fgure 6.2: Backpropagaton algorthm: the states of the neurons are updated forward (from left to rght) whle errors are updated backward (rght to left). The general form for the threshold ncrements looks lke Equaton (6.) δθ m = η (µ) m and δθ m = η µ µ δ (µ) m, (6.3) but wthout the state varables of the neurons (or the nputs), as expected. A way to remember the dfference between Equatons (6.) and (6.3) s to note that the formula for the threshold ncrements looks lke the one for the weght ncrements f one sets the values of the neurons to. Stochastc gradent descent The backpropagaton rules (6.7), (6.9), and (6.2) contan sums over patterns. Ths corresponds to feedng all patterns at the same tme to compute the ncrements of weghts and thresholds (batch tranng). Alternatvely one may choose a sngle pattern, update the weghts by backpropagaton, and then contnue to terate these tranng steps many tmes. Ths s called sequental tranng. One teraton corresponds to feedng a sngle pattern, p teratons are called one epoch (n batch tranng, one teraton corresponds to one epoch). If one chooses the patterns randomly, then sequental tranng results n stochastc gradent descent. Snce the sum over pattern s absent, the steps do not necessarly pont downhll, ther drectons fluctuate. Ths yelds a stochastc path through weght and threshold space, less prone to gettng stuck n local mnma (Chapters 3 and 7). The stochastc gradent-descent algorthm s summarsed n Secton 6.2. It apples to networks wth feed-forward layout, where neurons n a gven layer take nput only from the neurons n the layer mmedately to the left. Mn batches In practce, the stochastc gradent-descent dynamcs may be too nosy. It s often better to average over a small number of randomly chosen patterns. Such a set s called mn batch, of sze m B say. In stochastc gradent descent wth mn batches one replaces Equatons (6.) and (6.3) by m B δw mn = η µ= m B δw mn = η µ= (µ) m V (µ) δ (µ) m x (µ) m B n and δθ m = η µ= m B n and δθ m = η µ= (µ) m, (6.4) δ (µ) m.

81 CHAIN RULE AND ERROR BACKPROPAGATION 73 σ(b ) tanh(b ) b b Fgure 6.3: Saturaton of the actvaton functons (6.5): the dervatve g (b ) tends to zero for large values of b. Sometmes the mn-batch rule s quoted wth prefactors of mb before the sums. Ths does not make any fundamental dfference, the factors mb can ust be absorbed n the learnng rate. But when you compare learnng rates for dfferent mplementatons, t s mportant to check whether or not there are factors of mb n front of the sums n Equaton (6.4). How does one assgn nputs to mn batches? Ths s dscussed n Secton 6.3.: at the begnnng of each epoch, one should randomly shuffle the sequence of the nput patterns n the tranng set. Then the frst mn batch contans patterns µ =,..., m B, and so forth. Actvaton functons Common choces for g (b ) are the sgmod functon or tanh: g (b ) = σ(b ), (6.5a) + e b g (b ) = tanh(b ). (6.5b) Ther dervatves can be expressed n terms of the functon tself: d db σ(b ) = σ(b )[ σ(b )], (6.6a) d db tanh(b ) = tanh 2 (b ). (6.6b) Other actvaton functons are dscussed n Chapter 7. As llustrated n Fgure 6.3, the actvaton functons (6.5) saturate at large values of b, so that the dervatve g (b ) tends to zero. Snce the backpropagaton rules (6.7), (6.9), and (6.2) contan factors of g (b ), ths mples that the algorthm slows down. It s a good dea to montor the values of the local felds durng tranng, to check that they do not become too large. Intalsaton of weghts and thresholds The ntal weghts and thresholds must be chosen so that the local felds are not too large (but not too small ether). A standard procedure s to take all weghts to be ntally randomly dstrbuted, for example Gaussan wth mean zero and a sutable varance. The performance of networks wth many hdden layers (deep networks) can be very senstve to the ntalsaton of the weghts (Secton 7.2.4). The thresholds are usually set to zero. The ntal values of the thresholds are not so crtcal, because thresholds are often learned more rapdly than the weghts, at least ntally.

82 74 STOCHASTIC GRADIENT DESCENT Tranng The tranng contnues untl the global mnmum of H has been reached, or untl H s deemed suffcently small. The resultng weghts and thresholds are not unque. In Fgure 5.4 all weghts for the Boolean XOR functon are equal to ±. But the tranng algorthm (6.7), (6.9), and (6.2) corresponds to repeatedly addng weght ncrements. Ths may cause the weghts to grow. 6.2 Stochastc gradent-descent algorthm The algorthm descrbed n the prevous Secton apples to networks wth any number of layers. We label the layers by the ndex l = 0,..., L. The layer of nput termnals has label l = 0, whle the l = L denotes the layer of output neurons. The state varables for the neurons n layer l are V (l), the weghts connectng nto these neurons from the left are w (l) (l) k, the local felds needed to update V are b (l) = k w (l) θ (l), and the errors assocated wth layer l are denoted by δ (l). Ths s llustrated k V (l ) k n Fgure 6.4. The algorthm s summarsed below (Algorthm 2). k

83 STOCHASTIC GRADIENT-DESCENT ALGORITHM 75 δ (l ) k δ (l) δ (l+) V (l ) k w (l) k V (l) w (l+) k V (l+) Fgure 6.4: Illustrates notaton for stochastc gradent-descent algorthm. Algorthm 2 stochastc gradent descent : ntalse weghts w (l) 2: for t =,..., T do k to Gaussan random numbers, thresholds to zero, θ (l) = 0; 3: Choose a value of µ and apply pattern x (µ) to nput layer, V (0) x (µ) ; 4: for l =,..., L do 5: propagate forward: V (l) k g w (l) k V (l ) θ (l) k ; 6: end for 7: compute errors for output layer: δ (L) g (b (L) )(t V (L) ); 8: for l = L,...,2 do 9: propagate backward: δ (l ) 0: end for : for l =,..., L do 2: update: w (l) w (l) + ηδ (l) 3: end for 4: end for 5: end; δ(l) w (l) g (b (l ) ); V (l ) and θ (l) θ (l) ηδ (l) ;

84 76 STOCHASTIC GRADIENT DESCENT x 2 w θ w 2 x w 2 x x 2 t w Fgure 6.5: Illustrates the effect of non-zero nput mean upon the energy functon for one output neuron wth tanh actvaton functon and two nput termnals. The graph plots the contours of H for θ = 0 for the tranng set on the left. The plot llustrates that H s close to zero only at the bottom of a very narrow trough (hashed regon) wth steep sdes. 6.3 Recpes for mprovng the performance 6.3. Preprocessng the nput data It can be useful to preprocess the nput data, although any preprocessng may remove nformaton from the data. Nevertheless, t s usually advsable to rgdly shft the data so that ts mean vanshes x k = p p µ= x (µ) k = 0. (6.7) There are several reasons for ths. The frst one s llustrated n Fgure 6.5. The Fgure shows how the energy functon for a sngle output wth tanh actvaton functon and two nput termnals. The classfcaton problem s gven n the Table. The nput data has large mean values n both components, x and x 2. Snce t s dffcult to vsualse the dependence of H on both weghts and threshold, the graph on the rght shows how the energy functon H depends on the weghts for zero threshold. The large mean values of the nputs cause steep clffs n the energy functon that are dffcult to maneuver wth gradent descent. Dfferent nput-data varances n dfferent drectons have a smlar effect. Therefore one usually scales the nputs so that the nput-data dstrbuton has the same varance n all drectons (Fgure 6.6), equal to unty for nstance: σ 2 k = p p µ= x (µ) k x k 2 = (6.8) Second, to avod saturaton of the neurons connected to the nputs, ther local felds must not be too large. If one ntalses the weghts n the above example to Gaussan random numbers wth mean zero and unt varance, large actvatons are qute lkely. Thrd, enforcng zero nput mean by shftng the nput data avods that the weghts of the neurons n the frst hdden layer must decrease or ncrease together [36]. Equaton (6.4) shows that the ncrements δw m nto neuron m are lkely to have the same sgns f the nputs have large mean values. Ths means that the weght ncrements have the same sgns. Ths makes t dffcult for the network to learn to dfferentate.

85 RECIPES FOR IMPROVING THE PERFORMANCE 77 x 2 x 2 x 2 shft scale x x x Fgure 6.6: Shft and scale the nput data to acheve zero mean and unt varance. In summary, one usually shfts and scales the nput-data dstrbuton so that t has mean zero and unt varance. Ths s llustrated n Fgure 6.6. The same transformaton (usng the mean values and scalng factors determned for the tranng set) should be appled to any new data set that the network s supposed to classfy after t has been traned on the tranng set. Fgure 6.7 shows a dstrbuton of nputs that falls nto two dstnct clusters. The dfference between the clusters s sometmes called covarate shft, here covarate s ust another term for nput. Imagne feedng frst ust nputs from one of the clusters to the network. It wll learn local propertes of the decson boundary, nstead of ts global features. Such global propertes are effcently learned f the network s more frequently confronted wth unfamlar data. For sequental tranng (stochastc gradent descent) ths s not a problem, because the sequence of nput patterns presented to the network s random. However, f one trans wth mn batches, the mn batches should contan randomly chosen patterns n order to avod covarate shfts. To ths end one randomly shuffles the sequence of the nput patterns n the tranng set, at the begnnng of each epoch. It s sometmes recommended [36] to observe the output errors durng tranng. If the errors are smlar for a number of subsequent learnng steps, the correspondng nputs appear famlar to the network. Larger errors correspond to unfamlar nputs, and Ref. [36] suggests to feed such nputs more often. Often the nput data s very hgh dmensonal, requrng many nput termnals. Ths usually means that there are many neurons n the hdden layers, and the large number of neurons makes the tranng computatonally very expensve. To avod ths as far as possble, one can reduce the dmensonalty of the nput data by prncpal-component analyss. Ths method allows to proect hgh-dmensonal data to a lower dmensonal subspace. How ths can work s llustrated for a smple example n Fgure 6.8. You see that all data ponts fall onto a one-dmensonal subspace, the sold lne wth slope 2 (prncpal drecton). The coordnate orthogonal to the prncpal drecton s not useful n classfyng the data, for the example shown. Ths coordnate can be removed n the followng way. One uses the fact that the prncpal drecton ponts n the drecton of the leadng egenvector of the data-covarance matrx C, x 2 x Fgure 6.7: When the nput data falls nto clusters as shown n ths Fgure, one should randomly pck data from ether cluster, to avod that patterns become too famlar. The decson boundary s shown n red.

86 78 STOCHASTIC GRADIENT DESCENT x 2 u 2 u 2 2 x Fgure 6.8: Illustraton of prncpal-component analyss. x 2 Legend t (µ) = t (µ) = x 2 x x Fgure 6.9: Overfttng. Left: accurate representaton of the decson boundary n the tranng set, for a network wth 5 neurons n the hdden layer. Rght: ths new data set dffers from the frst one ust by a lttle bt of nose. The ponts n the vcnty of the decson boundary are not correctly classfed. that s to the egenvector wth the largest egenvalue. The data-covarance matrx has elements C = p p µ= x (µ) x x (µ) x wth x = p p µ= x (µ). (6.9) For the example shown n Fgure 6.8, the data-covarance matrx reads C = 0 5. (6.20) Its egenvalues and egenvectors are: λ = 25 8,u = 5 2 and λ 2 = 0,u 2 =. (6.2) 5 2 We see that the leadng egenvector u defnes prncpal drecton. Fgure 6.8 s an extreme example. Usually there s nose, causng the data to scatter around the prncpal drecton. Ths does not change much. The result s that the smaller egenvalue s no longer equal to zero, but stll small f the data does not scatter too much about the prncpal drecton. When there are many dmensons, we nspect the ordered sequence of egenvalues. Often there s a gap between the small egenvalues (close to zero), and larger ones. Then one can safely throw away the small egenvalues. If there s no gap, t s less clear what to do Overfttng The goal of supervsed learnng s to generalse from a tranng set to new data. Only general propertes of the tranng set can be generalsed, not specfc ones that are partcular to the tranng set and

87 RECIPES FOR IMPROVING THE PERFORMANCE 79 log H early stoppng valdaton set tranng set teratons Fgure 6.0: Progress of tranng and valdaton errors. The plot s schematc, and the data s smoothed. Shown s the natural logarthm of the energy functons for the tranng set (sold lne) and the valdaton set (dashed lne) as a functon of the number of tranng teratons. The tranng s stopped when the valdaton energy begns to ncrease. In Secton a precse crteron for ths early stoppng s ntroduced, one that works for fluctuatng data. that could be very dfferent n new data. A network wth more neurons may classfy the tranng data better, because t accurately represents all specfc features of the data. But those specfc propertes could look qute dfferent n new data (Fgure 6.9). As a consequence, we must look for a compromse: between accurate classfcaton of the tranng set and the ablty of the network to generalse. The problem llustrated n Fgure 6.9 s also referred to as overfttng: the network fts too fne detals (for nstance nose n the tranng set) that have no general meanng. The tendency to overft s larger for networks wth more neurons. One way of avodng overfttng s to use cross valdaton and early stoppng. One splts the tranng data nto two sets: a tranng set and a valdaton set. The dea s that these sets share the general features to be learnt. But although tranng and valdaton data are drawn from the same dstrbuton, they dffer n detals that are not of nterest. The network s traned on the tranng set. Durng tranng one montors not only the energy functon for the tranng set, but also the energy functon evaluated on the valdaton data. As long as the network learns general features of the nput dstrbuton, both tranng and valdaton energes decrease. But when the network starts to learn specfc features of the tranng set, then the valdaton energy saturates, or may start to ncrease. At ths pont the tranng should be stopped. Ths scheme s llustrated n Fgure 6.0. Often the possble values of the output neurons are contnuous whle the targets assume only dscrete values. Then t s mportant to also montor the classfcaton error of the valdaton set. The defnton of the classfcaton error depends on the type of the classfcaton problem. Assume frst that there s one output unt, and that the targets take the values t = 0/. Then the classfcaton error s defned as C = p t (µ) θ H (O (µ) 2 p ). (6.22a) µ= If, by contrast, the targets take the values t = ±, then the classfcaton error s defned as C = 2p p t (µ) sgn(o (µ) ). µ= (6.22b) Now consder a classfcaton problem where nputs must be classfed nto M mutually exclusve classes. An example s the MNIST data set of hand-wrtten dgts (Secton 7.4) where M = 0. Another example s gven n Table 6., wth M = 3. In both examples one of the targets t (µ) = whle the others equal zero, for a gven nput x (µ). As a consequence, M =. Assume that the network has sgmod t (µ)

88 80 STOCHASTIC GRADIENT DESCENT output targets correct? setosa yes verscolor yes vrgnca no output targets correct? setosa yes verscolor yes vrgnca no Table 6.: Illustrates the dfference between energy functon and classfcaton error. Each table shows network outputs for three dfferent nputs from the rs data set, as well as the correct classfcatons. outputs, O (µ) = σ(b (µ) ). To classfy nput x (µ) from the network outputs O (µ) we compute for the gven value of µ: y (µ) = Then the classfcaton errors s defned as f O (µ) s the largest of all outputs =,..., M, 0 otherwse. C = 2p p M (µ) t y (µ). µ= = (6.23a) (6.23b) In all cases, the classfcaton accuracy s defned as ( C ) 00%, t s usually quoted n percent. Whle the classfcaton error s desgned to show the fracton of nputs that are classfed wrongly, t contans less nformaton than the energy functon (whch s n fact a mean-squared error of the outputs). Ths s llustrated by the two problems n Table 6.. Both problems have the same classfcaton error, but the energy functon s much lower for the second problem, reflectng the better qualty of ts soluton. Yet another measure of classfcaton success s the cross-entropy error. It s dscussed n Chapter Weght decay and regularsaton Fgure 5.4 shows a soluton of the classfcaton problem defned by the Boolean XOR functon. All weghts are of unt modulus, and also the thresholds are of order unty. If one uses the backpropagaton algorthm to fnd a soluton to ths problem, one may fnd that the weghts contnue to grow durng tranng. Ths can be problematc, f t means that the local felds become too large, so that the algorthm reaches the plateau of the actvaton functon. Then tranng slows down, as explaned n Secton 6.. One soluton to ths problem s to reduce the weghts by some factor durng tranng, ether at each teraton or n regular ntervals, w ( ɛ)w for 0 < ɛ <, or δw mn = εw mn for 0 < ɛ <. (6.24) Ths s acheved by addng a term to the energy functon H = t (µ) O (µ) 2 + γ w (6.25) µ }{{} H 0

89 RECIPES FOR IMPROVING THE PERFORMANCE 8 Gradent descent on H gves: δw mn = η H 0 w mn εw mn (6.26) wth ɛ = ηγ. One can add a correspondng term for the thresholds, but ths s usually not necessary. The scheme summarsed here s sometmes called L 2 -regularsaton. An alternatve scheme s L - regularsaton. It amounts to Ths gves the update rule H = 2 µ t (µ) O (µ) 2 γ + 2 w. (6.27) δw mn = η H 0 w εsgn(w mn ). (6.28) The dscontnuty of the update rule at w mn = 0 s cured by defnng sgn(0) = 0. Comparng Equatons (6.26) and (6.28) we see that L -regularsaton reduces small weghts much more than L 2 -regularsaton. We expect therefore that the L -scheme puts more weghts to zero, compared wth the L 2 -scheme. These two weght-decay schemes are referred to as regularsaton schemes because they tend to help aganst overfttng. How does ths work? Weght decay adds a constrant to the problem of mnmsng the energy functon. The result s a compromse, dependng upon the value γ, between a small value of H and small weght values. The dea s that a network wth smaller weghts s more robust to the effect of nose. When the weghts are small, then small changes n some of the patterns do not gve a substantally dfferent tranng result. When the network has large weghts, by contrast, t may happen that small changes n the nput gve sgnfcant dfferences n the tranng result that are dffcult to generalse (Fgure 6.9). Other regularsaton schemes are dscussed n Chapter Adaptaton of the learnng rate It s temptng to choose larger learnng rates, because they enable the network to escape more effcently from shallow mnma. But clearly ths causes problems when the energy functon vares rapdly. As a result the tranng may fal because the tranng dynamcs starts to oscllate. Ths can be avoded by changng the learnng rule somewhat δw (t ) = η H w (t ) (t ) + αδw. (6.29) Here t = 0,, 2,..., n labels the teraton number, and δw (0) = η H / w (0). You see that the ncrement (t ) at step t depends not only on the nstantaneous gradent, but also on the weght ncrement δw of the prevous teraton. We say that the dynamcs becomes nertal, the weghts gan momentum. The parameter α 0 s called momentum constant. It determnes how strong the nertal effect s. Obvously α = 0 corresponds to the usual backpropagaton rule. When α s postve, then how does nerta change the learnng rule? Iteratng Equaton (6.29) gves δw (n) = η n t =0 α n t H w (t ). (6.30)

90 82 STOCHASTIC GRADIENT DESCENT (t +) w (t +) w (t ) w w (t ) (t ) w w (t ) Fgure 6.: Left: Momentum method (6.29). Rght: Nesterov s accelerated gradent method (6.33) Ths Equaton shows that δw (n) s a weghted average of the gradents encountered durng tranng. Now assume that the tranng s stuck n a shallow mnmum. Then the gradent H / w (t ) remans roughly constant through many tme steps, so that we can wrte δw (n) η H w (n) n t =0 α n t = η αn+ α H w (n). (6.3) In ths stuaton, convergence s accelerated when α s close to unty. We also see that t s necessary that α < for the sum n Equaton (6.3) to converge. The other lmt to consder s that the gradent changes rapdly from teraton to teraton. How s the learnng rule modfed n ths case? To make the pont let us assume that the gradent remans of the same magntude, but that ts sgn oscllates. Then we get δw (n) η H w (n) n ( ) t α n t = η αn+ + ( ) n H α + t =0 w (n), (6.32) so that the ncrements are much smaller. Ths shows that ntroducng nerta can substantally accelerate convergence wthout sacrfcng accuracy. The dsadvantage s, of course, that there s yet another parameter to choose, the momentum constant α. Nesterov s accelerated gradent method [37] s another way of mplementng momentum. The algorthm was developed for smooth optmsaton problems, but t s often used n stochastc gradent descent when tranng deep neural networks. The algorthm can be summarsed as follows [38]: δw (t ) = η H (t ) w w +α (t ) t δw (t ) + α t δw. (6.33) A sutable sequence of coeffcents α t s defned by recurson [38]. The coeffcents α t approach unty from below as t ncreases. Nesterov s accelerated-gradent method s more effcent than the smple momentum method, because the accelerated-gradent method evaluates the gradent at an extrapolated pont, not at the ntal pont. Fgure 6. llustrates why ths better. In practce, Nesterov s method often works better than the smple momentum scheme. Snce t s not much more dffcult to mplement, t s now used qute frequently. There are other ways of adaptng the learnng rate durng tranng, see Secton 4.0 n the book by Haykn [2]. Fnally, the learnng rate need not be the same for all neurons. It s often the case that the weghts feedng nto neurons n the layers close to the output layer experence large energy gradents than the weghts close to the nput layer [2]. To make all neurons learn at approxmately the same rate, one can reduce the learnng rate for the layers that are further to the rght n the network layout.

91 RECIPES FOR IMPROVING THE PERFORMANCE Prunng The term prunng refers to removng unnecessary weghts or neurons from the network, to mprove ts effcency. The smplest approach s weght elmnaton by weght decay [39]. Weghts that tend to reman very close to zero durng tranng are removed by settng them to zero and not updatng them anymore. Neurons that have zero weghts for all ncomng connectons are effectvely removed (pruned). It has been shown that ths method can help the network to generalse [40]. An effcent prunng algorthm s based on the dea to remove the weghts that have least effect upon the energy functon (that mnmse the ncrease n H upon removng them) [4]. Assume that the network was traned, so that the network reached a (local) mnmum of the energy functon. One starts by expandng the energy functon around ths mnmum. To wrte ths expanson down n a convenent form, one groups all weghts n the network n a long weght vector w (as opposed to groupng them n a weght matrx W as we dd n Chapter 2). A partcular component w q s extracted from the vector w as follows:. w q = ê q w q where ê q = q. (6.34) Here ê q s the Cartesan unt vector n the drecton q, wth components e q = δ q. The expanson of H reads: H = H mn + 2δw Mδw + hgher orders n δw. (6.35) The term lnear n δw vanshes because we expand around a local mnmum. The matrx M s the Hessan, the matrx of second dervatves of the energy functon. Elmnatng the weght δw q amounts to settng δw q + w q = 0. (6.36) The dea s to mnmse the damage to the network by elmnatng the weght that has least effect upon H mn mn { q 2 δw Mδw } subect to the constrant ê q δw + w q = 0. (6.37) δw I omtted the constant term H mn because t does not matter. Now we frst mnmse H w.r.t. δw, for a gven value of q. The lnear constrant s ncorporated usng a Lagrange multpler as n Chapter 4, to form the Lagrangan = 2 δw Mδw + λ(ê q δw + w q ). (6.38) A necessary condton for a mnmum (δw,λ) satsfyng the constrant s. δw = Mδw + λê q = 0 and λ = ê q δw + w q = 0. (6.39) We denote the soluton of these Equatons by δw and λ. It s obtaned by solvng the lnear system M êq δw 0 ê T =. (6.40) q 0 λ w q If M s nvertble we can use a standard formula for the nverson of 2 2 block matrces to fnd the nverse of the matrx n Equaton (6.40): M (ê T q M ê q ) (ê T q M ê q ) M ê q ê T q M M ê q ê T. (6.4) q M

92 84 STOCHASTIC GRADIENT DESCENT From Equatons (6.40) and (6.4) we fnd that δw = M ê q w q (ê T q M ê q ) and λ = w q (ê T q M ê q ). (6.42) The second step s to fnd the optmal q by mnmsng (δw,λ ; q ) = 2 w 2 q (êt q M ê q ). (6.43) The Hessan of the energy functon s expensve to evaluate, and so s the nverse of ths matrx. Usually one resorts to an approxmate expresson for M [4]. One possblty s to set the off-dagonal elements of M to zero [42]. Algorthm 3 prunng least mportant weght : tran the network to reach H mn ; 2: compute M approxmately; 3: determne q as the value of q for whch (δw,λ ; q ) s mnmal; 4: f (δw,λ ; q ) H mn then 5: update all weghts usng δw = w q M ê q (ê T q M ê q ) ; 6: goto 2; 7: else 8: end; 9: end f The algorthm s summarsed n Algorthm 3. It turns out that ths algorthm succeeds better than weght decay n reducng the unnecessary weghts n the network [4]. Weght decay elmnates the smallest weghts. But small weghts are often needed to acheve a small tranng error. Another pont s that one obtans the weght elmnaton of the smallest weghts by substtutng M = I n the algorthm descrbed above [Equaton (6.43)]. But ths s often not a good approxmaton. 6.4 Summary Backpropagaton s an effcent algorthm for stochastc gradent-descent on the energy functon n weght space, because t refers only to quanttes that are local to the weght to be updated. It s sometmes argued that bologcal neural networks are local n ths way [2]. 6.5 Further readng The backpropagaton algorthm s explaned n Secton 6.. of Hertz, Krogh and Palmer [], and n Chapter 4 of the book by Haykn [2]. The paper [36] by LeCun et al. predates deep learnng, but t s stll a very nce collecton of recpes for makng backpropagaton more effcent. Hstorcal note: one of the frst papers on error backpropagaton s the one by Rumelhart et al. [43]. Have a look! The paper gves an excellent explanaton and summary of the backpropagaton algorthm. The authors also descrbe results of dfferent numercal experments, one of them ntroduces convolutonal nets (Secton 7.3) to learn to tell the dfference between the letters T and C (Fgure 6.2).

93 EXERCISES 85 Fgure 6.2: Patterns detected by the convolutonal net of Ref. [43]. After Fg. 3 n Ref. [43]. 6.6 Exercses Covarance matrx. Show that the covarance matrx C defned n Equaton (6.9) s postve semdefnte. Nesterov s accelerated-gradent method. Show how the verson (6.33) of Nesterov s algorthm corresponds to the orgnal formulaton [37]. Ths pont s dscussed n [38]. Prunng. Show that the expresson (6.42) for the weght ncrement δw mnmses the Lagrangan (6.38) subect to the constrant. Skppng layers. Show how the backpropagaton algorthm can be generalsed for feed-forward networks that allow for connectons from the two nearest layers to the left, not only from the nearest layer to the left. 6.7 Exam questons 6.7. Multlayer perceptron A classfcaton problem s shown n Fgure 6.3. Here, µ =,2,..., s the ndex of nput pattern x (µ) = (x (µ), x (µ) 2 )T, and t (µ) s the correspondng target value. (a) Can ths problem be solved by a smple perceptron wth 2 nput unts, and a sngle output unt? Explan. (0.5p). ( =, 2), three hd- where =, 2, 3, and one output unt O (µ) 3 = θ H = W V (µ) (b) Ths problem can be solved by a multlayer perceptron wth two nput unts x (µ) den unts V (µ) 2 = θ H = w k x (µ) θ Θ. Here k θ H (b ) = for b > 0, 0 for b 0 (6.44) s the Heavsde functon, and w k and W are weghts for the hdden, and the output layer, respectvely. Fnally, θ and Θ are the thresholds assgned to the hdden unts, and to the output unt. One way of solvng the classfcaton problem s llustrated n Fg. 6.3 where the three lnes (sold, dashed, and dash-dotted lne) are determned by weghts w k and thresholds θ assgned to the three hdden unts n the hdden layer. Compute w k and θ correspondng to the lnes shown n Fg Note that the pont where the dashed and dash-dotted lnes ntersect has the followng coordnates (0.5,0.8) T. (0.5p). (c) For each pattern x (µ) wrte ts coordnates V (µ) n the transformed (hdden) space. (0.5p). (d) Graphcally llustrate the problem n ths transformed space. Is the problem space lnearly separable n the transformed space, or not? If yes, llustrate a possble soluton to the problem n ths space. (0.5p)

94 86 STOCHASTIC GRADIENT DESCENT (e) Compute the correspondng weghts W and the threshold Θ correspondng to the soluton you llustrated n (d). (0.5p). µ x (µ) x (µ) 2 t (µ) x x Fgure 6.3: Left: Inputs and target values for a classfcaton problem. The target output for each pattern µ s ether t (µ) = 0 (whte crcles) or t (µ) = (black crcles). Rght: the three lnes llustrate a soluton to the problem by a multlayer perceptron. Queston 6.7..

95 EXAM QUESTIONS Backpropagaton Explan how to tran a mult-layer perceptron by back-propagaton. Draw a flow-chart of the algorthm. In your dscusson, refer to and explan the followng terms: forward propagaton, backward propagaton, hdden layer, energy functon, gradent descent, local energy mnma, batch mode, tranng set, valdaton set, classfcaton error, and overfttng. Your answer must not be longer than one A4 page. (p) Stochastc gradent descent To tran a mult-layer perceptron usng stochastc gradent descent one needs update formulae for the weghts and thresholds n the network. Derve these update formulae for sequental tranng usng backpropagaton for the network shown n Fg The weghts for the frst and second hdden layer, and for the output layer are denoted by w () k, w (2) m, and W m. The correspondng thresholds are denoted by θ (), θ m (2), and Θ, and the actvaton functon by g ( ). The target value for nput pattern x (µ) s t (µ), and the pattern ndex µ ranges from to p. The energy functon s H = 2 (2p). p (µ) µ= (t O (µ) )2. W m w () x (µ) k w (2) k V (,µ) m V (2,µ) m O (µ) Fgure 6.4: Mult-layer perceptron wth three nput termnals, two hdden layers, and one output unt. Queston Backpropagaton A multlayer perceptron has L hdden layers (l =,..., L and one output layer (l = L). The state of a neuron V (l,µ) n layer l s gven by V (l,µ) = g (b (l,µ) ), wth b (l,µ) = θ (l) + w (l) k V (l,µ) k, (6.45) where V (0,µ) equals the nputs x (µ) (V (0,µ) k = x (µ) (l) k ), w k actvaton functon. The output s O (µ) = g (b (L,µ) ). k are weghts, θ (l) are thresholds, and g (b ) s the

96 88 STOCHASTIC GRADIENT DESCENT (a) Draw ths network. Indcate where the elements x (µ) k, b (l,µ), V (l,µ), O (µ), w (l) (l) k and θ for l =,..., L belong. (0.5 p) (b) Derve the recursve rule for how the dervatves V (l,µ) / w (p ) mn depend on the dervatves V (l,µ) / w (p ) mn for p < l. ( p). (c) Evaluate the dervatve V (l,µ) / w (p ) mn for p = l. (0.5 p). (d) Use gradent descent on the the energy functon where t (µ) w (L 2) mn H = 2 µ (µ) t O (µ) 2, (6.46) s the target value for nput pattern x (µ). Fnd the batch-mode update rule for the weght wth learnng rate η. No momentum regularsaton. ( p) Backpropagaton To tran a mult-layer perceptron wth stochastc gradent descent one needs update formulae for the weghts and thresholds n the network. Derve these update formulae for the network shown n Fg. 6.5 usng the gradent-descent algorthm wth constant learnng rate, no momentum and no weght decay. The weghts for the hdden layer and for the output layer are denoted by w k and W, respectvely. The correspondng thresholds are denoted by θ, and Θ, and the actvaton functon by g ( ). The target value for nput pattern x (µ) s t (µ) (µ), and the network output s O. The energy functon s H = p (µ) 2 µ= t O (µ) 2. (2p) w k Fgure 6.5: Mult-layer perceptron. Queston W x (µ) k V (µ) O (µ) True/false questons Indcate wether the followng statements are true or false. 3 correct answers gve.5 ponts, 2 correct answers gve.0 pont and 0- correct answers gve 0.5 ponts and 0-9 correct answers gve zero ponts. (.5 p).. Backpropagaton s a form of unsupervsed learnng. 2. For backpropagaton to work the weghts must be symmetrc. 3. For backpropagaton t s necessary to know the target outputs of nput patterns n the tranng set.

97 EXAM QUESTIONS Early stoppng n backpropagaton s a way to avod overfttng. 5. Early stoppng n backpropagaton helps to avod gettng stuck n local mnma of energy. 6. The randomness n stochastc gradent descent helps to avod beng stuck n local mnma of energy. 7. The randomness n stochastc gradent descent prevents overfttng. 8. There are 2 (2n ) Boolean functons wth n nputs. 9. None of the Boolean functons wth 5 nputs are lnearly separable. 0. There are 24 Boolean functons wth three nput unts (and output 0/) where exactly three nput patterns map to 0.. When solvng a t = ±-problem n two dmensons usng a decson boundary, the resultng output problem may sometmes not be lnearly separable. 2. The tranng tme for stochastc gradent descent may depend on how the weghts are ntalsed. 3. The number of neurons n the nput layer of a perceptron s equal to the number of nput patterns.

90 DEEP LEARNING Fgure 7.: Images of rs flowers. From left to rght: rs setosa (copyrght T. Monto), rs verscolor (copyrght R. A. Nonemacher), and rs vrgnca (copyrght A. Westermoreland).

In Chapter 5 we saw why t s sometmes necessary to have a hdden layer: ths make t possble to solve problems that are not lnearly separable. Under whch crcumstances s one hdden layer suffcent?

The second queston s more dffcult to answer than the frst, so we start wth the frst queston.

98 90 DEEP LEARNING Fgure 7.: Images of rs flowers. From left to rght: rs setosa (copyrght T. Monto), rs verscolor (copyrght R. A. Nonemacher), and rs vrgnca (copyrght A. Westermoreland). All mages are copyrghted under the creatve commons lcense. 7 Deep learnng 7. How many hdden layers? In Chapter 5 we saw why t s sometmes necessary to have a hdden layer: ths make t possble to solve problems that are not lnearly separable. Under whch crcumstances s one hdden layer suffcent? Are there problems that requre more than one hdden layer? Even f not necessary, may addtonal hdden layers mprove the performance of the network? The second queston s more dffcult to answer than the frst, so we start wth the frst queston. To understand how many hdden layers are necessary t s useful to vew the classfcaton problem as an approxmaton problem [44]. Consder the classfcaton problem (x (µ), t (µ) ) for µ =,..., p. Ths problem defnes a target functon t (x ). Tranng a network to solve ths task corresponds to approxmatng the target functon t (x ) by the output functon O (x ) of the network, from N -dmensonal nput space to one-dmensonal output space. How many hdden layers are necessary or suffcent to approxmate a gven set of functons to a certan accuracy, by choosng weghts and thresholds? The answer depends on the nature of the set of functons. Are they real-valued or do they assume only dscrete values? If the functons are real-valued, are they contnuous or not? We start by consderng real-valued nputs and output. Consder the network drawn n Fgure 7.2. The neurons n the hdden layers have sgmod actvaton functons σ(b ) = ( + e b ). The output s contnuous, wth actvaton functon g (b ) = b. Wth two hdden layers the task s to approxmate the V () x l k w () l k w (2) l V (2) W O (x ) Fgure 7.2: Mult-layer perceptron for functon approxmaton. g (b ) = b g (b ) = σ(b )

99 HOW MANY HIDDEN LAYERS? 9 t (x ) O (x ) x Fgure 7.3: The neural-network output O (x ) approxmates the target functon t (x ). b (x ) σ a (x c ) σ a 2 (x c 2 ) x c c 2 x Fgure 7.4: (a) bass functon. (b) lnear combnaton of two sgmod functons for a large value of a = a 2. functon t (x ) by O (x ) = W g l w (2) l g k w () l k x k θ () l θ (2) Θ. (7.) In the smplest case the nputs are one-dmensonal (Fgure 7.3). The tranng set conssts of pars (x (µ), t (µ) ). The task s then to approxmate the correspondng target functon t (x ) by the network output O (x ): O (x ) t (x ). (7.2) We approxmate the real-valued functon t (x ) by lnear combnatons of the bass functons b (x ) shown n Fgure 7.4(a). Any reasonable real-valued functon t (x ) can be approxmated by a sums of such bass functons, each sutably shfted and scaled. Furthermore, these bass functons can be expressed as dfferences of actvaton functons [Fgure 7.4(b)] b (x ) = σ(w x θ ) σ(w 2 x θ 2 ). (7.3) Comparson wth Equaton (7.) shows that one hdden layer s suffcent to construct the functon O (x ) n ths way. Now consder two-dmensonal nputs. In ths case, sutable bass functons are σ [σ(x ) σ(x θ ) + σ(x 2 ) σ(x 2 θ 2 ) θ ]. (7.4) So for two nput dmensons two hdden layers are suffcent, wth four neurons n the frst layer, and one neuron per bass functon n the second hdden layer. In general, for N nputs, two hdden layers are suffcent, wth 2N unts n the frst layer, and one unt per bass functon n second layer. Yet t s not always necessary to use two layers for real-valued functons. For contnuous functons, one hdden layer s suffcent. Ths s ensured by the unversal approxmaton theorem [2]. Ths theorem

100 92 DEEP LEARNING (a) (b) (c) x x 2 t 0 dgt dgt k = 2 k = x k w k 2 3 W O V Fgure 7.5: Boolean XOR functon. (a) value table, (b) network layout. For the weghts feedng nto the hdden layer, dashed lnes correspond to w k = δ, sold lnes to w k = δ. For the weghts feedng nto the output neuron, dashed lnes correspond to W = γ, and sold lnes to W = γ (b) constructon prncple for the weghts of the hdden layer. says any contnuous functon can be approxmated to arbtrary accuracy by a network wth a sngle hdden layer, for suffcently many neurons n the hdden layer. In Chapter 5 we consdered dscrete Boolean functons. It turns out that any Boolean functon wth N -dmensonal nputs can be represented by a network wth one hdden layer, usng 2 N neurons n the hdden layer: x k {+, } k =,..., N nputs V = 0,..., 2 N hdden neurons g (b ) = tanh(b ) actvaton functon of hdden neurons (7.5) g (b ) = sgn(b ) actvaton functon of output unt A dfference compared wth the Boolean networks n Secton 5.4 s that here the nputs take the values ±. The reason s that ths smplfes the proof. Ths proof goes by constructon []. For each hdden neuron one assgns the weghts as follows δ f the k th dgt of bnary representaton of s, w k = δ otherwse, (7.6) wth δ > (see below). The thresholds θ of all hdden neurons are the same, equal to N (δ ). The dea s that each nput pattern turns on exactly one neuron n the hdden layer (called the wnnng unt). Ths requres that δ s large enough, as we shall see. The weghts feedng nto the output neuron are assgned as follows. If the output for the pattern represented by neuron V s +, let W = γ > 0, otherwse W = γ. The threshold s Θ = W. To show how ths constructon works, consder the Boolean XOR functon as an example. Frst, for each pattern only the correspondng wnnng neuron gves a postve sgnal. For pattern x () = [, ] T, for example, ths s the frst neuron n the hdden layer ( = 0). To see ths, compute the local felds for ths nput pattern: b () 0 = 2δ 2(δ ) = 2, (7.7) b () = 2(δ ) = 2 2δ, b () 2 = 2(δ ) = 2 2δ, b () 3 = 2δ 2(δ ) = 2 4δ.

101 TRAINING DEEP NETWORKS 93 If we choose δ > then the output of the frst hdden neuron gves a postve output (V 0 > 0), the other neurons produce negatve outputs, V < 0 for =, 2, 3. Now consder x (3) = [,+] T. In ths case b (3) 0 = 2(δ ) = 2 2δ (7.8) b (3) = 2δ 2(δ ) = 2 4δ b (3) 2 = 2δ 2(δ ) = 2 b (3) 3 = 2(δ ) = 2 2δ So n ths case the thrd hdden neuron gves a postve output, whle the others yeld negatve outputs. It works n the same way for the other two patterns, x (2) and x (4). Ths confrms that there s a unque wnnng neuron for each pattern. That pattern µ = k gves the wnnng neuron = k s of no mportance, t s ust a consequence of how the patterns are ordered n the value table n 7.5. Second, the output neuron computes O = sgn( γv + γv 2 + γv 3 γv 4 ) (7.9) wth γ > 0, and Θ = W = 0. For x () and x (4) we fnd the correct result O =. The same s true for x (2) and x (3), we obtan O =. In summary, ths example llustrates how an N -dmensonal Boolean functon s represented by a network wth one hdden layer, wth 2 N neurons. The problem s of course that ths network s expensve to tran for large N because the number of hdden neurons s very large. There are more effcent layouts f one uses more than one hdden layer. As an example, consder the party functon for N bnary nputs equal to 0 or. The functon measures the party of the nput sequence. It gves f there s an odd number of ones n the nput, otherwse 0. A constructon smlar to the above yelds a network layout wth 2 N neurons n the hdden layer. If one nstead wres together the XOR networks shown n Fgure 5.4, one can solve the party problem wth O (N ) neurons, as Fgure 7.6 demonstrates. When N s a power of two then ths network has 3(N ) neurons. To see ths, set the number of nputs to N = 2 k. Fgure 7.6 shows that the number k of neurons satsfes the recurson k+ = 2 k + 3 wth = 3. The soluton of ths recurson s k = 3(2 k ). Ths example also llustrates a second reason why t may be useful to have more than one hdden layer. To desgn a network for a certan task t s often convenent to buld the network from buldng blocks. One wres them together, often n a herarchcal fashon. In Fgure 7.6 there s only one buldng block, the XOR network from Fgure 5.4. Another example are convolutonal networks for mage analyss (Secton 7.3). Here the fundamental buldng blocks are feature maps, they recognse dfferent geometrcal features n the mage, such as edges or corners. 7.2 Tranng deep networks It was beleved for a long tme that networks wth many hdden layers (deep networks) are so dffcult to tran that t s not practcal to use many hdden layers. But the past few years have wtnessed a paradgm shft regardng ths queston. It has been demonstrated that fully connected deep networks can n fact be traned effcently wth backpropagaton (Secton 6.2), and that these networks can solve complex classfcaton tasks wth very small error rates. Browsng through the recent lterature of the subect one may get the mpresson that tranng deep networks such as the one shown n Fgure 7.7 s more an art than a scence, some read lke manuals, or lke collectons of engneerng recpes. But there are several fundamental facts about how deep

94 DEEP LEARNING 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Fgure 7.6: Soluton of the party problem for N -dmensonal nputs. The network s bult from XOR unts (Fgure 5.4).

102 94 DEEP LEARNING Fgure 7.6: Soluton of the party problem for N -dmensonal nputs. The network s bult from XOR unts (Fgure 5.4). Each XOR unt has a hdden layer wth two neurons. Above only the states of the nputs and outputs of the XOR unts are shown, not those of the hdden neurons. In total, the whole network has O (N ) neurons. nput l = l = 2 l = 3 l = 4 l = 5 output Fgure 7.7: Fully connected deep network wth fve hdden layers. How deep s deep? Usually one says: deep networks have two or more hdden layers. networks learn. I summarse them n ths Secton. A comprehensve summary of these (and more) prncples s gven n the book Deep learnng [4] Unstable gradents In Chapter 6 we dscussed that the learnng slows down when the gradents g (b ) become small. When the network has many hdden layers, ths problem becomes worse. One fnds that the neurons n hdden layers close to the nput layer (small values of l n Fgure 7.7) change only by small amounts, the smaller the more hdden layers the network has. One way of quantfyng ths effect s to measure the gradent of the energy functon wth respect to the thresholds n layer l θ H = (l) θ (l). θ (l) N H. (7.0)

103 TRAINING DEEP NETWORKS 95 (l) θ H epochs Fgure 7.8: Shows how the norm of the gradent of H w.r.t. θ (l) n layer l depends on the number of tranng epochs. Gradent n hdden layer l = ( ), l = 2 ( ), l = 3 ( ), l = 4 ( ). The data was obtaned by tranng a network wth four fully connected hdden layers wth N = 30 neurons each on the MNIST data set (Secton 7.4). The output layer has 0 neurons. Sgmod actvaton functons (6.5a), quadratc energy functon (6.4), learnng rate η = The data was obtaned by averagng over an ensemble of 00 ndependent runs (data by Johan Fres). Equaton (6.2) shows that the errors n layer l and thus the weght updates are proportonal to (l) θ H. Fgure 7.8 demonstrates that the norm of ths gradent tends to be very small for the frst 20 tranng epochs. In ths regme the gradent (and thus the speed of tranng) vanshes exponentally as l. Ths slowng down s the result of the dmnshed effect of the neurons n layer l upon the output, when l s small. Ths s the vanshng-gradent problem. To explan ths phenomenon, consder the very smple case shown n Fgure 7.9: a deep network wth only one neuron per layer. To measure the effect of a gven neuron on the output, we calculate how the output of the network changes when changng the state of a neuron n a partcular layer. The output V (L) s gven by the nested actvaton functons V (L) =g w (L) g w (L ) g w (2) g (w () x θ () ) θ (2)... θ (L ) θ (L). (7.) The effects of the neurons n Fgure 7.9 are computed usng the chan rule: V (L) V (L ) = g (b (L) )w (L) V (L) V (L 2) = V (L) V (L ) V (L ) V = g (b (L) )w (L) g (b (L ) )w (L ) (L 2). (7.2) where b (k) = w (k) V (k ) θ (k) s the local feld for neuron k. Ths yelds the followng expresson for J l,l V (L) / V (l) : J l,l = V (L) l+ V = [g (b (k) )w (k) ]. (7.3) (l) k=l Note that the error δ (l) of the hdden layer l s determned by a closely related product. Algorthm 2 shows that the errors are gven recursvely by δ (l) = δ (l+) w (l+) g (b (l) ). Usng δ (L) = [t V (L) (x )]g (b (L) ) we have l+ δ (l) = [t V (L) (x )]g (b (L) ) [w (k) g (b (k ) )]. (7.4) k=l

104 96 DEEP LEARNING θ () θ (2) θ (L ) θ (L) w () w (2) w (3) w (L ) w (L) x V () V (2) V (L ) V (L) Fgure 7.9: Network llustratng the vanshng-gradent problem, wth neurons V (l), weghts w (l), and thresholds θ (l). Equatons (7.3) and (7.4) are consstent snce δ (l) =[t V (L) ] V (L) / ( θ (l) ) = [t V (L) ] V (L) / V (l) g (b (l) ). Comng back to the product (7.3), consder frst the early stages of tranng. If one ntalses the weghts as descrbed n Chapter 6 to Gaussan random varables wth mean zero and varance σ 2 w, and the thresholds to zero, then the factors w (k) g (b (k ) ) are usually smaller than unty (for the actvaton functons (6.5), the maxmum of g (b ) s 2 and, respectvely). The product of these factors vanshes quckly as l decreases. So the slowng down s a consequence of multplyng many small numbers to get somethng really small (vanshng-gradent problem). What happens at later tmes? One mght argue that the weghts may grow durng tranng, as a functon of l. If that happened, the problem mght become worse stll, because g (b ) tends to zero exponentally as b grows. Ths ndcates that the frst layers may contnue to learn slowly. Fgure 7.8 shows that the effect perssts for about 20 epochs. But then even the frst layers begn to learn faster. Ths does not contradct the above dscusson, because t assumed random weghts. As the network learns, the weghts are no longer ndependent random numbers. But there s to date no mathematcal theory descrbng how ths transton occurs. More fundamentally, Equaton (7.3) demonstrates that dfferent layers of the network learn at dfferent speeds, because ther neurons typcally have dfferent effects on the output. Ths s due to the fact that the product n Equaton (7.3) s unlkely to reman of order unty when L s large. To see ths, assume that the weghts are ndependently dstrbuted random numbers. Takng the logarthm and usng the central-lmt theorem shows that the dstrbuton of the product s log normal. Ths means that the learnng speed can be substantally dfferent n dfferent layers. Ths s also referred to the problem of unstable gradents. The example shown n Fgure 7.9 llustrates the orgn of ths problem: t s due to the fact that multplyng many small numbers together produces a result that s very small. Multplyng many numbers that are larger than unty, by contrast, yelds a large result. In networks lke the one shown n Fgure 7.7 the prncple s the same, but nstead of multplyng numbers one multples matrces. The product (7.3) of random numbers becomes of product of random matrces. Assume that all layers l =,..., L have N neurons. We denote the matrx wth elements J (l,l) = V (L) / V (l) by J l,l. Usng the chan rule we fnd: V (L) V (l) = l m n V (L) V (L ) l V (L ) l V (L 2) m V n (l+) V (l). (7.5) Usng the update rule V (k) = g w (k) V (k ) θ (k) (7.6)

105 TRAINING DEEP NETWORKS 97 we can evaluate each factor: V (k) p V (k ) l = g (b (k) In summary, ths yelds the followng expresson for J l,l : p )w (k) p l. (7.7) J l,l = D (L) W (L) D (L ) W (L ) D (l+) W (l+), (7.8) where W (k) s the matrx of weghts feedng nto layer k, and g (b (k) ) D (k) =... g (b (k) N ). (7.9) Ths expresson s analogous to Equaton (7.3). The egenvalues of the matrx J 0,k descrbe how small changes δv (0) to the nputs V (0) (or small dfferences between the nputs) grow as they propagate through the layers. If the maxmal egenvalue s larger than unty, then δv (0) grows exponentally as a functon of layer ndex k. Ths s quantfed by the maxmal Lyapunov exponent [45] λ = lm log tr J T k 0,k 2k J 0,k (7.20) where the average s over realsatons of weghts and thresholds. The matrx J T 0,k J 0,k s called the rght Cauchy-Green matrx, and tr denotes the trace of ths matrx, the sum of ts dagonal elements. The rght Cauchy-Green matrx s symmetrc, and t s postve defnte. The egenvectors of J T 0,k J 0,k are called forward Lyapunov vectors. They descrbe how small correctons to the nputs rotate, shrnk, or stretch as they propagate through the network. If we multply the matrx J k,l from the left wth the transpose of the vector δ (L) of output errors, we see how the errors change as they propagate backwards from layer k to the leftmost hdden layer, how ths vector rotates, shrnks, or stretches. There are a number of dfferent trcks that help to suppress vanshng gradents, to some extent at least. Frst, t s usually argued that t helps to use an actvaton functon that does not saturate at large b, such as the ReLU functon ntroduced n Secton But the results of Ref. [46] show that the effect s perhaps not as strong as orgnally thought. Second, batch normalsaton (Secton 7.2.6) may help aganst the unstable gradent problem. Thrd, ntroducng connectons that skp layers (resdual network) can also reduce the unstable-gradent problem. Ths s dscussed n Secton Rectfed lnear unts Glorot et al. [47] suggested to use a dfferent actvaton functon. Ther choce s motvated by the response curve of leaky ntegrate-and-fre neurons. Ths s a model for the relaton between the electrcal current I through the cell membrane nto the neuron cell, and the membrane potental U. The smplest models for the dynamcs of the membrane potental represent the neuron as a capactor. In the leaky ntegrate-and-fre neuron, leakage s added by a resstor R n parallel wth the capactor C, so that I = U R + C du dt. (7.2)

106 98 DEEP LEARNING (a) f (I ) (b) max{0, b } Fgure 7.0: (a) Frng rate of a leaky ntegrate-and-fre neuron as a functon of the electrcal current I through the cell membrane, Equaton (7.22) for τ = 25 and U c /R = 2 (see text). (b) Rectfed lnear unt, g (b ) = max{0, b }. I b For a constant current, the membrane potental grows from zero as a functon of tme, U (t ) = R I [ exp( t /τ)], where τ = R C s the tme constant of the model. One says that the neuron produces a spke when the membrane potental exceeds a crtcal value, U c. Immedately after, the membrane potental s set to zero (and begns to grow agan). In ths model, the frng rate f (I ) s thus gven by, where t c s the soluton of U (t ) = U c. It follows that the frng rate exhbts a threshold behavour (the system works lke a rectfer): t c 0 for I Uc /R, f (I ) = τ log R I R I U c for I > U c /R. (7.22) Ths response curve s llustrated n Fgure 7.0 (a). The man message s that there s a threshold below whch the response s strctly zero (ths s not the case for the actvaton functon shown n Fgure.6). The response functon looks qualtatvely lke the ReLU functon g (b ) = ReLU(b ) max{0, b }, (7.23) shown n panel (b). Neurons wth ths actvaton functon are called rectfed lnear unts. The dervatve of the ReLU functon s dscontnuous at b = 0. A common conventon s to set the dervatve to zero at b = 0. What s the pont of usng rectfed lnear unts? When tranng a deep network wth ReLU functons t turns out that many of the hdden neurons (as many as 50%) produce outputs strctly equal to zero. Ths means that the network of actve neurons (non-zero output) s sparsely connected. It s thought that sparse networks have desrable propertes, and sparse representatons of a classfcaton problem are more lkely to be lnearly separable (as shown n Secton 0.). Fgure 7. llustrates that for a gven nput pattern only a certan fracton of hdden neurons s actve. For these neurons the computaton s lnear, yet dfferent nput patterns gve dfferent sets of actve neurons. The product n Equaton (7.8) acqures a partcularly smple structure: the matrces D (k) are dagonal wth 0/ entres. But whle the weght matrces are ndependent, the D (k) -matrces are correlated: whch elements vansh depends on the states of the neurons n the correspondng layer, whch n turn depend on the weghts to the rght of D (k) n the matrx product. A hdden layer wth only one or very few actve neurons mght act as a bottleneck preventng effcent backpropagaton of output errors whch could n prncple slow down tranng. For the examples gven n Ref. [47] ths does not occur. The ReLU functon s unbounded for large postve local felds. Therefore, the vanshng-gradent problem (Secton 7.2.) s thought to be less severe n networks made of rectfed lnear unts, but see Ref. [46]. Snce the ReLU functon does not saturate, the weghts tend to ncrease. Glorot et al. [47] suggested to use L -weght decay (Secton 6.3.3) to make sure that the weghts do not grow.

107 TRAINING DEEP NETWORKS 99 nput l = l = 2 l = 3 l = 4 l = 5 output Fgure 7.: Sparse network of actve neurons wth ReLU actvaton functons. The red paths correspond to actve neurons wth postve local felds. Fnally, usng ReLU functons nstead of sgmod functons speeds up the tranng, because the ReLU functon has pecewse constant dervatves. Such functon calls are faster to evaluate than sgmod functons, for example Outputs and cost functons Up to now we dscussed networks that have the same actvaton functons for all neurons n all layers, ether sgmod or tanh actvaton functons (Equaton 6.5), or ReLU functons (Secton 7.2.2). These networks are traned by stochastc gradent descent on the quadratc energy functon (5.25). It has been shown that t may be advantageous to employ a dfferent energy functon, and to use slghtly dfferent actvaton functons for the neurons n the output layer, so-called softmax outputs, defned as O = (L) αb e M (L) k= eαb k. (7.24) Here b (L) = w (L) V (L ) θ (L) are the local felds n the output layer. Usually the constant α s taken to be unty. In the lmt α, you see that O = δ 0 where 0 s the ndex of the wnnng output unt, the one wth the largest value b (L) (Chapter 9). Usually one takes α =, then Equaton (7.24) s a soft verson of ths maxmum crteron, thus the name softmax. Three mportant propertes of softmax outputs are, frst, that 0 O. Second, the values of the outputs sum to one M O =. (7.25) = Ths means that the outputs of softmax unts can be nterpreted as probabltes. Thrd, the outputs are monotonous: when b (L) ncreases then O ncreases but the values O k of the other output neurons k decrease. Softmax output unts can smplfy nterpretng the network output for classfcaton problems where the nputs must be assgned to one of M classes. In ths problem, the output O (µ) of softmax unt represents the probablty that the nput x (µ) s n class (n terms of the targets: t (µ) = whle t (µ) = 0 for k ). Softmax unts are often used n conuncton wth a dfferent energy functon (or cost k functon). It s defned n terms of negatve log lkelhoods H = t (µ) log O (µ). (7.26) µ

108 00 DEEP LEARNING O softmax O 2. O M Fgure 7.2: The symbol for a softmax layer ndcates that the neurons n ths layer are not ndependent. O M Here and n the followng log stands for the natural logarthm. The functon (7.26) s mnmal when O (µ) = t (µ). Snce the functon (7.26) s dfferent from the energy functon used n Chapter 6, the detals of the backpropagaton algorthm are slghtly dfferent. To fnd the correct formula for backpropagaton, we need to evaluate H = w mn µ t (µ) O (µ) O (µ) w mn. (7.27) Here I dd not wrte out the labels L that denote the output layer, and n the followng equatons I also drop the ndex µ that refers to the nput pattern. Usng the denttes O b l = O (δ l O l ) and b l w mn = δ l m V n, (7.28) one obtans So O = w mn δw mn = η H = η w mn l µ O b l t (µ) b l w mn = O (δ m O m )V n. (7.29) (δ m O (µ) m (µ) )V n = η µ (t (µ) m O (µ) m )V (µ) n, (7.30) snce M = t (µ) = for the type of classfcaton problem where each nput belongs to precsely one class. The correspondng expresson for the threshold updates reads δθ m = η H θ m = η µ (t (µ) m O (µ) m ). (7.3) Equatons (7.30) and (7.3) hghlght a further advantage of softmax output neurons (apart from the fact that they allow the output to be nterpreted n terms of probabltes). The weght and threshold ncrements for the output layer derved n Secton 6 [Equatons (6.7) and (6.2a)] contan factors of dervatves g (B (µ) m ). As noted earler, these dervatves tend to zero when the actvaton functon saturates, slowng down the learnng. But Equatons (7.30) and (7.3) do not contan such factors! Here the rate at whch the neuron learns s smply proportonal to the error, (t (µ) m O (µ) m ), no small factors reduce ths rate. Softmax unts are normally only used n the output layer. Frst, the dervaton shows that the learnng speedup mentoned above s coupled to the use of the log lkelhood functon (7.26). Second, one usually tres to avod dependence between the neurons n a gven hdden layer, but Equaton (7.24) shows that the output of neuron depends on all local felds n the hdden layer (Fgure 7.2). A better alternatve s usually the ReLU actvaton functon dscussed n Secton

109 TRAINING DEEP NETWORKS 0 There s an alternatve way of choosng the cost functon that s very smlar to the above, but works wth sgmod unts: H = t (µ) log O (µ) + ( t (µ) ) log( O (µ) ), (7.32) µ wth O = σ(b ) where σ s the sgmod functon (6.5a). The functon (7.32) s called cross-entropy functon. To compute the weght ncrements, we apply the chan rule: H w mn = µ t (µ) O (µ) t (µ) Ol O (µ) = w mn µ t (µ) O (µ) O (µ) ( O (µ) ) O l. (7.33) w mn Usng Equaton (6.6a) we obtan δw mn = η µ (t (µ) m O (µ) m )V (µ) n, (7.34) dentcal to Equaton (7.30). The threshold ncrements are also updated n the same way, Equaton (7.3). Yet the nterpretaton of the outputs s slghtly dfferent, snce the values of the softmax unts n the output layers sum to unty, whle those of the sgmod unts do not. In ether case you can use the defnton (6.23) for the classfcaton error Weght ntalsaton The results of Secton 7.2. pont to the mportance of ntalsng the weghts n the rght way, to avod that the learnng slows down. Ths s sgnfcant because t s often found that the ntal transent learnng phase poses a substantal bottleneck to learnng [48]. For ths ntal transent, correct weght ntalsaton can gve a substantal mprovement. Moreover, when tranng deep networks wth sgmod actvaton functons n the hdden layers, t was observed that the values of the output neurons reman very close to zero for many tranng teratons (Fgure 2 n Ref. [49]), slowng down the tranng substantally. It s argued that ths s a consequence of the way the weghts are ntalsed, n combnaton wth the partcular shape of the sgmod actvaton functon. It s sometmes argued that tanh actvaton functons work better than sgmods, although the reasons are not qute clear. So, how should the weghts be ntalsed? The standard choce s to ntalse the weghts to ndependent Gaussan random numbers wth mean zero and unt varance and the thresholds to zero (Secton 6.). But n networks that have large hdden layers wth many neurons, ths scheme may fal. Consder a neuron n the frst hdden layer wth N ncomng connectons. Its local feld b = N w x (7.35) = s a sum of many ndependently dentcally dstrbuted random numbers. Assume that the nput patterns have ndependent random bts, equal to 0 or wth probablty 2. From the central-lmt theorem we fnd that the local feld s Gaussan dstrbuted n the lmt of large N, wth mean zero and varance σ 2 b = σ2 w N /2. (7.36)

110 02 DEEP LEARNING Fgure 7.3: Illustrates regularsaton by drop out. Ths means that the local feld s typcally qute large, of order N, and ths mples that the unts of the frst hdden layer saturate slowng down the learnng. Ths concluson rests on our partcular assumpton concernng the nput patterns, but t s n general much better to ntalse the weghts unformly or Gaussan wth mean zero and wth varance σ 2 w N, (7.37) to cancel the factor of N n Equaton (7.36). The thresholds can be ntalsed to zero, as descrbed n Secton 6.. The normalsaton (7.37) makes sure that the weghts are not too large ntally, but t does not crcumvent the vanshng-gradent problem dscussed n Secton There the problem was llustrated for N =, so unt varance for the ntal weght dstrbuton corresponds to Equaton (7.37) Regularsaton Deeper networks have more neurons, so the problem of overfttng (Fgure 6.9) tends to be more severe for deeper networks. Therefore regularsaton schemes that lmt the tendency to overft are more mportant for deeper networks. In Secton several regularsaton schemes were descrbed, for example L - and L 2 -regularsaton. In tranng deep networks, a number of other regularsaton schemes have proved useful: drop out, prunng (Secton 6.3.5), and expandng the tranng set. Drop out In ths scheme some neurons are gnored durng tranng [50]. Usually ths regularsaton technque s appled to hdden neurons. The procedure s llustrated n Fgure 7.3. In each step of the tranng algorthm (for each mn batch, or for each ndvdual pattern) one gnores at random a fracton p of neurons from each hdden layer, and updates the weghts n the remanng, dluted network n the usual fashon. The weghts comng nto the dropped neurons are not updated, and as a consequence nether are ther outputs. For the next step n the tranng algorthm, the removed neurons at put back, and another set of hdden neurons s removed. Once the tranng s completed, all hdden neurons are actvated, but ther outputs are multpled by p. Srvastava et al. [50] motvate ths method by remarkng that the performance of machne-learnng algorthms s usually mproved by combnng the results of several learnng attempts. In our case ths corresponds to separately tranng several networks wth dfferent layouts on dfferent nputs, and then to average over ther outputs. However, for deep networks ths s computatonally very expensve. Drop out s an attempt to acheve the same goal more effcently. The dea s that dropout corresponds

111 TRAINING DEEP NETWORKS 03 n Tranng success wthout prunng pruned network Fgure 7.4: Boolean XOR problem. The network has one hdden layer wth n ReLU neurons. The output neuron has a sgmod actvaton functon. The network s traned wth stochastc gradent descent for teratons. The ntal weghts were Gaussan random numbers wth mean zero, standard devaton 0., and max-norm regularsaton w < 2. The thresholds were ntally zero. Tranng success was measured n an ensemble of 000 ndependent tranng realsatons. Data from Ref. [5]. to effectvely tranng a large number of dfferent networks. If there are k hdden neurons, then there are 2 k dfferent combnatons of neurons that are turned on or off. The hope s that the network learns more robust features of the nput data n ths way, and that ths reduces overfttng. In practce the method s usually appled together wth another regularsaton scheme, max-norm regularsaton. Ths smply means that weghts are not allowed to grow larger than a gven constant: w c. Prunng Prunng (Secton 6.3.5) s also a regularsaton method: by removng unnecessary weghts one reduces the rsk of overfttng. As opposed to drop out, where hdden neurons are only temporarly gnored, prunng refers to permanently removng hdden neurons. The dea s to tran a large network, and then to prune a large fracton of neurons to obtan a much smaller network. It s usually found that such pruned nets generalse much better than small nets that were traned wthout prunng. Up to 90% of the hdden neurons can be removed. Ths method s appled wth success to deep networks, but here I want to dscuss a smple example: Frankle & Carbn [5] used the Boolean XOR functon to llustrate the effectveness of prunng. Fgure 5.4 shows that the XOR functon can be represented by a hdden layer wth two neurons. Sutable weghts and thresholds are gven n ths Fgure. Frankle & Carbn [5] pont out that backpropagaton takes a long tme to fnd a vald soluton, for random ntal weghts. They observe that a network wth many more neurons n the hdden layer usually learns better. Fgure 7.4 lsts the fracton of successful tranngs for networks wth dfferent numbers of neurons n the hdden layer. Wth two hdden neurons, only 49.% of the networks learned the task n tranng steps of stochastc gradent descent. Networks wth more neurons n the hdden layer ensure better tranng success. The Fgure also shows the tranng success of pruned networks, that were ntally traned wth n = 0 neurons. Then networks were pruned teratvely durng tranng, removng the neurons wth the largest average magntude. After tranng, the weghts and threshold were reset to ther ntal values, the values before tranng began. One can draw three conclusons from ths data (from Ref. [5]). Frst, teratve prunng durng tranng sngles out neurons n the hdden layer that had ntal weghts and thresholds resultng n the correct decson boundares. Second, the pruned network wth two hdden neurons has much better tranng success than the network that was traned wth only two hdden neurons. Thrd, despte prunng more than 50% of the hdden neurons, the network wth n = 4 hdden neurons performs almost as well as then one wth n = 0 hdden neurons. When tranng deep

112 04 DEEP LEARNING networks t s common to start wth many neurons n the hdden layers, and to prune up to 90% of them. Ths results n small traned networks that can effcently and relably classfy. Expandng the tranng set If one trans a network wth a fxed number of hdden neurons on larger tranng sets, one observes that the network generalses wth hgher accuracy (better classfcaton success). The reason s that overfttng s reduced when the tranng set s larger. Thus, a way of avodng overfttng s to expand or augment the tranng set. It s sometmes argued that the recent success of deep neural networks n mage recognton and obect recognton s n large part due to larger tranng sets. One example s ImageNet, a database of more than 0 7 hand-classfed mages, nto more than categores [52]. Naturally t s expensve to mprove tranng sets n ths way. Instead, one can expand a tranng set artfcally. For dgt recognton (Fgure 2.), for example, one can expand the tranng set by randomly shftng, rotatng, and shearng the dgts Batch normalsaton Batch normalsaton [53] can sgnfcantly speed up the tranng of deep networks wth backpropagaton. The dea s to shft and normalse the nput data for each hdden layer, not only the nput patterns (Secton 6.3.). Ths s done separately for each mn batch (Secton 6.2), and for each component of the nputs V (µ), =,... (Algorthm 4). One calculates the average and varance over each mn batch V = m B m B µ= V (µ) and σ 2 B = m B m B (V (µ) µ= V ) 2, (7.38) subtracts the mean from the V (µ), and dvdes by σ 2 B + ε. The parameter ε > 0 s added to the denomnator to avod dvson by zero. There are two addtonal parameters n Algorthm 4, namely γ and β. If one were to set β = V and γ = σ 2 B + ε (algorthm 4), then batch normalsaton would leave the V (µ) unchanged. But nstead these two parameters are learnt by backpropagaton, ust lke the weghts and thresholds. In general the new parameters are allowed to dffer from layer to layer, γ (l) and β (l). Batch normalsaton was orgnally motvated by argung that t reduces possble covarate shfts faced by hdden neurons n layer l: as the parameters of the neurons n the precedng layer l change, ther outputs shft thus forcng the neurons n layer l to adapt. However n Ref. [54] t was argued that batch normalsaton does not reduce the nternal covarate shft. It speeds up the tranng by effectvely smoothng the energy landscape. Usng batch normalsaton helps to combat the vanshng-gradent problem because t prevents the local felds of hdden neurons to grow. Ths makes t possble to use sgmod functons n deep networks, because the dstrbuton of nputs remans normalsed. It s sometmes argued that batch normalsaton has a regularsng effect, and t was suggested by Ioffe and Szegedy [53] that batch normalsaton can replace drop out (Secton 7.2.5). It s also argued that batch normalsaton can help the network to generalse better, n partcular f each mn batch contans randomly pcked nputs. Then batch normalsaton corresponds to randomly transformng the nputs to each hdden neuron (by the randomly changng means and varances). Ths may help to

113 TRAINING DEEP NETWORKS 05 make the learnng more robust. There s no theory to date that frmly proves ether of these clams, but t s an emprcal fact that batch normalsaton often speeds up the tranng. Algorthm 4 batch normalsaton : for =,... do 2: calculate mean V mb m B µ= V (µ) 3: calculate varance σ 2 B 4: normalse ˆV (µ) (V (µ) V )/ mb (µ) m B µ= (V 5: calculate outputs as: g (γ ˆV (µ) + β ) 6: end for 7: end; V ) 2 σ 2 B + ε

114 06 DEEP LEARNING (a) (b) nputs 0 0 hdden 8 8 nputs 0 0 hdden V nputs 0 0 hdden 8 8 V 2 Fgure 7.5: (a) layout of a convoluton layer. (b) several convoluton layers are connected to the nput layer to detect dfferent features. 7.3 Convolutonal networks Convolutonal networks have been around snce the 980 s. They became wdely used after Krzhesvky et al. [55] won the ImageNet challenge (Secton 7.5) wth a convolutonal net. One reason for the recent success of convolutonal networks s that they have fewer neurons. Ths has two advantages. Frstly, such networks are obvously cheaper to tran. Secondly, as ponted out above, reducng the number of neurons regularses the network, t reduces the rsk of overfttng. Convolutonal neural networks are desgned for obect recognton and pattern detecton. They take mages as nputs (Fgure 7.), not ust a lst of attrbutes (Fgure 5.). Convolutonal networks have mportant propertes n common wth networks of neurons n the vsual cortex of the Human bran [4]. Frst, there s a spatal array of nput termnals. For mage analyss ths s the two-dmensonal array of bts. Second, neurons are desgned to detect local features of the mage (such as edges or corners for nstance). The maps learned by such neurons, from nputs to output, are referred to as feature maps. Snce these features occur n dfferent parts of the mage, one uses the same feature map (wth the same weghts and thresholds) for dfferent parts of the mage. Snce feature maps are local, and snce they act n a translatonal-nvarant way, the number of neurons from the two-dmensonal nput array s greatly reduced, compared wth usual fully connected networks. Feature maps act lke the mathematcal convoluton operaton. Therefore, layers wth feature maps are also referred to as convoluton layers. Convolutonal networks can have herarches of convoluton layers. The dea s that the addtonal layers can learn more abstract features. Apart from feature maps, convolutonal networks contan other types of layers. Poolng layers connect drectly to the convoluton layer(s), ther task s to smplfy the output of the convoluton layers. Connected to the poolng layers, convolutonal networks may also contan several fully connected layers Feature maps Fgure 7.5(a) llustrates the layout of a convoluton layer. Neuron V connects to a 3 3 area of pxels n the nput layer. In analogy wth the termnology used n neuroscence, ths area s called the local receptve feld of ths hdden neuron. Neuron V 2 connects to a shfted local receptve feld, as llustrated n the Fgure. Snce the nput has 0 0 pxels, the dmenson of the convoluton layer s 8 8. The mportant pont s that the neurons V and V 2, and all other neurons n ths convoluton

115 CONVOLUTIONAL NETWORKS 07 layer, share ther weghts and the threshold. In the example shown n Fgure 7.5(a) there are thus only 9 ndependent weghts, and one threshold. Snce the dfferent neurons n the convoluton layer share weghts and thresholds, ther computaton rule takes the form of a dscrete convoluton: 3 V = g 3 p= q = w p q x p +,q + θ. (7.39) The actvaton functon g can be the sgmod functon. Usually one connects several convoluton layers to the nput layer, as shown n Fgure 7.5(b). Dfferent layers contan dfferent feature maps, one that detects edges for example, and another one that detects corners, and so forth. Fgure 7.5 depcts a two-dmensonal nput array. For colour mages there are usually three colour channels, n ths case the nput array s three-dmensonal, and the nput bts are labeled by three ndces: two for poston and the last one for colour, x k. If one has several convoluton layers that connect to the nputs, one groups the weghts (and thresholds) nto stll hgher-dmensonal arrays (tensors). In ths case the convoluton takes the form: V k = g w p q k r x p +,q +,r θ k. (7.40) p q r The software package TensorFlow [56] s desgned to effcently perform tensor operatons as n Equaton (7.40). In Fgure 7.5 the local receptve feld s shfted by one pxel at a tme. Sometmes t s useful to use a dfferent strde, to shft the receptve feld by s pxels. Also, the local receptve regons need not have sze 3 3. If we assume that ther sze s Q P, the rule (7.39) takes the form P V = g Q p = q = w p q x p +s ( ),q +s ( ) θ. (7.4) If one couples several convoluton layers together, the number of neurons n these layers decreases rapdly as one moves to the rght. In ths case one can pad the mage (and the convoluton layers) by addng rows and columns of bts set to zero. In Fgure 7.5(a), for example, one obtans a convoluton layer of the same dmenson as the orgnal mage f one pads the mage wth two rows and columns of bts. Convoluton layers are traned wth backpropagaton. Consder the smplest case, Equaton (7.39). As usual, we use the chan rule to evaluate the gradents: the gradents V = w mn r s V b r s b r s w mn. (7.42) The dervatves of the local felds are evaluated by applyng rule (5.27) to Equaton (7.39): b r s = δ mp δ nq x p +,q +. (7.43) w mn p q In ths way one can tran several stacked convoluton layers too. It s mportant to keep track of the summaton boundares. To that end t helps to pad out the mage and the convoluton layers, so that the upper bounds reman the same n dfferent layers. Detals asde, the fundamental prncple of feature maps s that the map s appled n the same form to dfferent parts of the mage (translatonal nvarance). In ths way the learnng of parameters s shared between pxels, each weght n a gven feature map s traned on dfferent parts of the mage. Ths effectvely ncreases the tranng set for the feature map and combats overfttng.

08 DEEP LEARNING nput 0 0 hdden 8 8 4 max poolng fully connected 4 4 4 Fgure 7.6: Layout of a convolutonal neural network for obect recognton and mage classfcaton. The nputs are n a 0 0 array.

Between these and the output layer are a couple of fully connected hdden layers. Fgure 7.7: Examples of dgts from the MNIST data set of handwrtten dgts [57]. The mages were produced usng MATLAB.

2 Poolng layers Poolng layers process the output of convoluton layers. A neuron n a poolng layer takes the outputs of several neghbourng feature maps and summarses ther outputs nto a sngle number.

Instead, one may compute the root-mean square of the map values (L 2 -poolng).

Usually several feature maps are connected to the nput. Poolng s performed separately on each of them. The network layout looks lke the one shown schematcally n Fgure 7.6.

116 08 DEEP LEARNING nput 0 0 hdden max poolng fully connected Fgure 7.6: Layout of a convolutonal neural network for obect recognton and mage classfcaton. The nputs are n a 0 0 array. They feed nto four convoluton layers representng four dfferent 3 3 feature maps. Each convoluton layer feeds nto ts own max-poolng layer. Between these and the output layer are a couple of fully connected hdden layers. Fgure 7.7: Examples of dgts from the MNIST data set of handwrtten dgts [57]. The mages were produced usng MATLAB. But note that by default MATLAB dsplays the dgts whte one black background. Copyrght for the data set: Y. LeCun and C. Cortes Poolng layers Poolng layers process the output of convoluton layers. A neuron n a poolng layer takes the outputs of several neghbourng feature maps and summarses ther outputs nto a sngle number. Max-poolng unts, for example, summarse the outputs of nearby feature maps (n a 2 2 square for nstance) by takng the maxmum over the feature-map outputs. Instead, one may compute the root-mean square of the map values (L 2 -poolng). There are no weghts or thresholds assocated wth the poolng layers, they compute the output from the nputs usng a pre-defned prescrpton. Other ways of poolng are dscussed n Ref. [4]. Usually several feature maps are connected to the nput. Poolng s performed separately on each of them. The network layout looks lke the one shown schematcally n Fgure 7.6. In ths Fgure, the poolng layers feed nto a number of fully connected hdden layers that connect to the output neurons. There are as many output neurons as there are classes to be recognsed. Ths layout s qualtatvely smlar to the layout used by Krzhesvky et al. [55] n the ImageNet challenge (see Secton 7.5 below). 7.4 Learnng to read handwrtten dgts Fgure 7.7 shows patterns from the MNIST data set of handwrtten dgts [57]. The data set derves from a data set compled by the Natonal Insttute of Standards and Technology (NIST), of dgts handwrtten by hgh-school students and employees of the Unted States Census Bureau. The data contans a data set of mages of dgts wth pxels, and a test set of dgts. The mages are grayscale wth 8-bt resoluton, so each pxel contans a value rangng from 0 to 255. The mages n the database were preprocessed. The procedure s descrbed on the MNIST home page. Each orgnal bnary mage from the Natonal Insttute of Standards and Technology was represented

117 LEARNING TO READ HANDWRITTEN DIGITS 09 Algorthm 5 network layout and tranng optons: no hdden layers, softmax output layer wth 0 unts. Here net s the network obect contanng the tranng data set, the network layout, and the tranng optons. layers = [mageinputlayer([28 28 ]) fullyconnectedlayer(0) softmaxlayer classfcatonlayer]; optons = tranngoptons( sgdm,... MnBatchSze, 892,... ValdatonData, {xvald, tvald},... ValdatonFrequency, 30,... MaxEpochs,200,... Plots, Tranng-Progress,... L2Regularzaton, 0,... Momentum, 0.9,... ValdatonPatence, 5,... Shuffle, every-epoch,... IntalLearnRate, 0.00); net = trannetwork(xtran, ttran, layers, optons); as a gray-scale mage, preservng the aspect rato of the dgt. The resultng mage was placed n a mage so that the centre-of-mass of the mage concded wth ts geometrcal centre. These steps can make a crucal dfference (Secton 7.4.3). We dvde the data set nto a tranng set wth dgts and a valdaton set wth dgts. The latter s used for cross-valdaton and early stoppng. The test data s used for measurng the classfcaton error after tranng. For ths purpose one should use a data set that was not nvolved n the tranng. The goal of ths Secton s to show how the prncples descrbed n Chapters 6 and 7 allow to learn the MNIST data wth low classfcaton error, as outlned n Ref. [5]. You can follow the steps descrbed below wth your own computer program, usng MATLAB 207b whch s avalable at StuDAT. But f you prefer you can also use other software packages such as Keras [58], an nterface for TensorFlow [56], Theano [59], or PyTorch [60]. The networks descrbed below use ReLU unts (Secton 7.2.2) n the hdden layers and a softmax output layer (Secton 7.2.3) wth ten output unts O and energy functon (7.26), so that output O s the probablty that the pattern fed to the network falls nto category Fully connected layout The smplest network has no hdden layers at all, ust one softmax output layer wth 0 neurons. The representaton of ths network and ts tranng algorthm n MATLAB s summarsed n Algorthm 5. There are three parts. The layout of the network s defned n the array layers=[...]. Here nputlayer([28 28 ]) reads n the nputs, and preprocesses the nputs by subtractng the mean mage averaged over the whole tranng set from each nput mage [Equaton (6.7)]. The three array elements fullyconnectedlayer, softmaxlayer and classfcatonlayer defne a

118 0 DEEP LEARNING softmax layer wth 0 output unts. Frst, fullyconnectedlayer(0) computes b = k w k V +θ k for =,...,0. Note that the sgn of the threshold dffers from the conventon we use (Algorthm 2). Second, softmaxlayer computes the softmax functon (7.24), where O (µ) s the probablty that nput pattern x (µ) belongs to class. Fnally, classfcatonlayer computes the negatve log lkelhoods (7.26). The optons varable defnes the tranng optons for the backpropagaton algorthm. Ths ncludes choces for the learnng parameters, such as mn-batch sze m B, learnng rate η, momentum constant α, and so forth. To fnd approprate parameter values and network layouts s one of the man dffcultes when tranng a neural network, and t usually requres a far deal of expermentng. There are recpes for fndng certan parameters [6], but the general approach s stll tral and error [5]. In optons, sgdm means stochastc gradent descent wth momentum, Equaton (6.29). The momentum constant Momentum s set to α = 0.9. The mn-batch sze [Equaton (6.4)] s set to 892. The valdaton set s specfed by {xvald, tvald}. The algorthm computes the valdaton error durng tranng, n ntervals specfed by ValdatonFrequency. Ths allows for cross valdaton and early stoppng. Roughly speakng tranng stops when the valdaton error begns to ncrease. Durng tranng, the algorthm keeps track of the smallest valdaton error observed so far. Tranng stops when the valdaton error was larger than the mnmum for a specfed number of tmes, ValdatonPatence. Ths varable s set to 5 n Algorthm 5. The varable Shuffle determnes at whch ntervals the sequence of patterns n the tranng set s randomsed. The parameter IntalLearnRate defnes the ntal learnng rate, η = By default t does not change as a functon of tme. One Epoch corresponds to applyng p patterns or p/m B mn batches (Secton 6.). The resultng classfcaton accuracy s about 90%. It can be mproved by addng a hdden layer wth 30 ReLU unts (Algorthm 6), gvng a classfcaton accuracy of about 96%. The accuracy can be further mproved by ncreasng the number of neurons n the hdden layer. For 00 hdden ReLU unts the accuracy becomes about 97.2% after tranng for 200 epochs (early stoppng occurred after 35 epochs). Fgure 7.8 shows how the tranng and the valdaton energes decrease durng tranng, for both networks. You see that the energes are a lttle lower for the network wth 00 hdden neurons. But we observe overfttng n both cases, because after many tranng steps the valdaton energy s much hgher than the tranng energy. As mentoned above, early stoppng caused the tranng of the larger network to abort after 35 epochs, ths corresponds to 824 teratons. Now let us add more hdden layers. Expermentng shows that t s better to use a slghtly hgher learnng rate, η = 0.0. For two hdden layers we obtan classfcaton accuraces that are only slghtly hgher, 97.3%. Addng a thrd hdden layer does not help much ether. Try addng even more neurons, and/or more layers. You wll see that t s dffcult to ncrease the classfcaton accuracy further. Addng more fully connected hdden layers does not necessarly mprove the classfcaton accuracy, even f you tran for more epochs. One possble reason s that the network overfts the data (Secton 6.3.2). Ths Algorthm 6 one fully connected hdden layer, softmax outputs. layers = [mageinputlayer([28 28 ]) fullyconnectedlayer(30) relulayer fullyconnectedlayer(0) softmaxlayer classfcatonlayer];

119 LEARNING TO READ HANDWRITTEN DIGITS H teratons Fgure 7.8: Energy functons for the MNIST tranng set (sold lnes) and for the valdaton set (dashed lnes) for Algorthm 6 (red lnes) and for a smlar algorthm, but wth 00 neurons n the hdden layer, green lnes. The data was smoothed and the plot s schematc. The x -axs shows teratons. One teraton corresponds to feedng one mnbatch of patterns. One epoch conssts of 50000/892 6 teratons Fgure 7.9: Some hand-wrtten dgts from the MNIST test set, msclassfed by a convolutonal net that acheved an overall classfcaton accuracy of 98%. Correct classfcaton (top rght), msclassfcaton (bottom rght). Data from Oleksandr Balabanov problem becomes more acute as you add more hdden neurons. The tendency of the network to overft s reduced by regularsaton (Secton 7.2.5). For the network wth one hdden layer wth 00 ReLU unts, L 2 -regularsaton mproves the classfcaton accuracy to almost 98%. Here L2Regularzaton was set to 0.03 and the learnng rate to η = Convolutonal networks Deep convolutonal networks can yeld stll hgher classfcaton accuraces than those obtaned n the prevous Secton. The layout of Algorthm 7 corresponds to a network wth one convoluton layer wth 20 feature maps, a max-poolng layer, and a fully connected hdden layer wth 00 ReLU unts, smlar to the network shown n Fgure 7.6. The classfcaton accuracy obtaned after 60 epochs s slghtly above 98%. It can be further mproved by ncludng a second convoluton layer, and batch normalsaton (Secton 7.2.6). See Algorthm 8, a slghtly modfed verson of an example from MathWorks. The classfcaton accuracy s 98.99% after 30 epochs. The accuracy can be mproved further by tunng parameters and network layout, and by usng ensembles of convolutonal neural networks [57]. The best classfcaton accuracy found n ths way s 99.77% [62]. Several of the MNIST dgts are dffcult to classfy for Humans too (Fgure 7.9), so we conclude that convolutonal nets really work very well. Yet the above examples show also that t takes much expermentng to fnd the rght parameters and network layout as well as long tranng tmes to reach the best classfcaton accuraces. It could be argued that one reaches a stage of dmnshng returns as the classfcaton error falls below a few percent.

120 2 DEEP LEARNING Algorthm 7 convolutonal network, one convoluton layer. layers = [mageinputlayer([28 28 ]) convoluton2dlayer (5, 20, Paddng,) relulayer maxpoolng2dlayer(2, Strde,2) fullyconnectedlayer(00) relulayer fullyconnectedlayer(0) softmaxlayer classfcatonlayer]; optons = tranngoptons( sgdm,... MnBatchSze, 892,... ValdatonData, {xvald, tvald},... MaxEpochs,60,... Plots, Tranng-Progress,... L2Regularzaton, 0,... IntalLearnRate, 0.00); net = trannetwork(xtran, ttran, layers, optons); Algorthm 8 several convoluton layers, batch normalsaton. After MathWorks. layers = [mageinputlayer([28 28 ]) convoluton2dlayer (3, 20, Paddng,) batchnormalzatonlayer relulayer maxpoolng2dlayer(2, Strde,2) convoluton2dlayer (3, 30, Paddng,) batchnormalzatonlayer relulayer maxpoolng2dlayer(2, Strde,2) convoluton2dlayer (3, 50, Paddng,) batchnormalzatonlayer relulayer fullyconnectedlayer(0) softmaxlayer classfcatonlayer];

LEARNING TO READ HANDWRITTEN DIGITS 3 Fgure 7.20: Examples of dgts drawn on an Ipad. Data from Oleksandr Balabanov. Fgure 7.2: Same dgts as n Fgure 7.20, but preprocessed lke the MNIST dgts.

3 Readng your own hand-wrtten dgts In ths Secton I outlne how you can test the convolutonal networks descrbed above on your own data set of hand-wrtten dgts, and whch conclusons you can draw from

Draw as many dgts as possble, save them n a PNG fle, and extract the ndvdual dgts usng an mageprocessng program such as Pant. Fgure 7.20 shows dgts obtaned n ths way.

One way of solvng ths problem s to add dgts lke the ones from Fgure 7.20 to the tranng set. A second possblty s to try to preprocess the dgts from Fgure 7.

121 LEARNING TO READ HANDWRITTEN DIGITS 3 Fgure 7.20: Examples of dgts drawn on an Ipad. Data from Oleksandr Balabanov. Fgure 7.2: Same dgts as n Fgure 7.20, but preprocessed lke the MNIST dgts. Data from Oleksandr Balabanov Readng your own hand-wrtten dgts In ths Secton I outlne how you can test the convolutonal networks descrbed above on your own data set of hand-wrtten dgts, and whch conclusons you can draw from such an experment. Make your own data set by drawng the dgts on an Ipad wth GoodNotes or a smlar program. Draw as many dgts as possble, save them n a PNG fle, and extract the ndvdual dgts usng an mageprocessng program such as Pant. Fgure 7.20 shows dgts obtaned n ths way. Now try out one of your convolutonal networks (Algorthm 7 or 8) traned on the MNIST data set. You wll see that the network has great dffcultes recognsng your dgts. One way of solvng ths problem s to add dgts lke the ones from Fgure 7.20 to the tranng set. A second possblty s to try to preprocess the dgts from Fgure 7.20 n the same way as the MNIST data was preprocessed. The result s shown n Fgure 7.2. Usng a MNIST-traned convolutonal net on these dgts yelds a classfcaton accuracy of about 90%. So the algorthm does not work very well at all. What s gong on? Compare Fgures 7.7 and 7.2. The dgts n Fgure 7.2 have a much more slender stroke. It was suggested n Ref. [63] that ths may be the reason, snce t s known that the lne thckness of hand- Fgure 7.22: Same dgts as n Fgure 7.2. The dfference s that the thckness of the stroke was normalsed (see text). Data from Oleksandr Balabanov.

122 4 DEEP LEARNING wrtten text can make a dfference for algorthms that read hand-wrtten text [64]. There are dfferent methods for normalsng the lne thckness of hand-wrtten text. Applyng the method proposed n Ref. [64] to our dgts results n Fgure The algorthm of Ref. [64] has a free parameter, T, that specfes the resultng lne thckness. In Fgure 7.22 t was taken to be T = 0, close to the lne thckness of the MNIST dgts, we measured the latter to T 9.7 usng the method descrbed n Ref. [64]. If we run a MNIST-traned convolutonal net (Algorthm 8) on a data set of 60 dgts wth normalsed lne thckness, t fals on only two dgts. Ths corresponds to a classfcaton accuracy of roughly 97%, not so bad but not as good as the best results n Secton Note that we can only make a rough comparson. In order to to obtan a better estmate of the classfcaton accuracy we need to test many more than 60 dgts. A queston s of course whether there are perhaps other dfferences between our own hand-wrtten dgts and those n the MNIST data. It would also be of nterest to try dgts that were drawn usng Pant, or a smlar program. How does do MNIST-traned convolutonal nets perform on computer-drawn dgts? At any rate, the results of ths Secton show that the way the nput data are processed can make a bg dfference. Ths rases a pont of fundamental mportance. We have seen that convolutonal nets can be traned to represent a dstrbuton of nput patterns wth very hgh accuracy. But f you test the network on a data set that has a slghtly dfferent dstrbuton, perhaps because t was preprocessed dfferently, the network may not work as well.

DEEP LEARNING FOR OBJECT RECOGNITION 5 Fgure 7.23: Obect recognton usng a deep convolutonal network. Shown s a frame from a move recorded on a telephone.

123 DEEP LEARNING FOR OBJECT RECOGNITION 5 Fgure 7.23: Obect recognton usng a deep convolutonal network. Shown s a frame from a move recorded on a telephone. The network was traned on the Pascal VOC data set [65] usng YOLO [66]. Detals on how to obtan the weghts and how to nstall the software are gven on the YOLO webste. Top-5 Error n % CNN Year Fgure 7.24: Smallest classfcaton error for the ImageNet challenge [67]. The data up to 204 comes from Ref. [67]. The data for 205 comes from Ref. [68], for 206 from Ref. [69], and for 207 from Ref. [70]. From 202 onwards the smallest error was acheved by convolutonal neural networks (CNN). After Goodfellow et al. [4]. 7.5 Deep learnng for obect recognton Deep learnng has become so popular n the last few years because deep convolutonal networks are good at recognsng obects n mages. Fgure 7.23 shows a frame from a move taken from a car wth my moble telephone. A deep convolutonal network traned on the Pascal VOC tranng set [65] recognses obects n the move by puttng boundng boxes around the obects and classfyng them. The Pascal VOC data set s a tranng set for obect-class recognton n mages. It contans crca mages, each annotated wth one of 20 classes. The people behnd ths data set ran mage classfcaton challenges from 2005 to 202. A more recent challenge s the ImageNet large-scale vsual recognton challenge (ILSVRC) [67], a competton for mage classfcaton and obect recognton usng the ImageNet database [52]. The challenge s based on a subset of ImageNet. The tranng set contans more than 0 6 mages manually classfed nto one of 000 classes. There are approxmately 000 mages for each class. The valdaton set contans mages. The ILSVRC challenge conssts of several tasks. One task s mage classfcaton, to lst the obect

124 6 DEEP LEARNING Fgure 7.25: Reproduced from xkcd.com/897 under the creatve commons attrbuton-noncommercal 2.5 lcense. classes found n the mage. A common measure for accuracy s the so-called top-5 error for ths classfcaton task. The algorthm lsts the fve obect classes t dentfed wth hghest probabltes. The result s consdered correct f the annotated class s among these fve. The error equals the fracton of ncorrectly classfed mages. Why does one not smply udge whether the most probable class s the correct one? The reason s that the mages n the ImageNet database are annotated by a sngle-class dentfer. Often ths s not unque. The mage n Fgure 7.7, for example, shows not only a car but also trees, yet the mage s annotated wth the class label car. Ths s ambguous. The ambguty s sgnfcantly smaller f one consders the top fve classes the algorthm gves, and checks whether the annotated class s among them. The tasks n the ILSVRC challenge are sgnfcantly more dffcult than the dgt recognton descrbed n Secton 7.4, and also more dffcult than the VOC challenges. One reason s that the ImageNet classes are organsed nto a deep herarchy of subclasses. Ths results n hghly specfc sub classes that can be very dffcult to dstngush. The algorthm must be very senstve to small dfferences between smlar sub classes. We say that the algorthm must have hgh nter-class varablty [7]. Dfferent mages n the same sub class, on the other hand, may look qute dfferent. The algorthm should nevertheless recognse them as smlar, belongng to the same class. We say that the algorthm should have small ntra-class varablty [7]. Snce 202, algorthms based on deep convolutonal networks won the ILSVRC challenge. Fgure 7.24 shows that the error has sgnfcantly decreased untl 207, the last year of the challenge n the form descrbed above. We saw n prevous Sectons that deep networks are dffcult to tran. So how can these algorthms work so well? It s generally argued that the recent success of deep convolutonal networks s manly due to three factors. Frst, there are now much larger and better annotated tranng sets avalable. ImageNet s an example. Excellent tranng data s now recognsed as one of the most mportant factors, and companes developng software for self-drvng cars and systems that help to avod accdents recognse that good tranng sets s one of the most mportant factors, and dffcult to acheve: to obtan relable tranng data one must manually collect and annotate the data (Fgure 7.25). Ths s costly, but at the same tme t s mportant to have as large data sets as possble, to reduce overfttng. In addton one must am for a large varablty n the collected data. Second, the hardware s much better today. Deep networks are nowadays mplemented on sngle or multple GPUs. There are also dedcated chps, such as the tensor processng unt [72]. Thrd, mproved regularsaton technques (Secton 7.2.5) and weght sharng n convoluton layers

125 RESIDUAL NETWORKS 7 w (l,l 2) ReLU ReLU w (l,l ) ReLU l 2 l l Fgure 7.26: Schematc llustraton of a network wth skppng connectons. w (L 2,L 4) w (L,L 3) w (L,L ) w (,0) w (L 3,L 4) w (L 2,L 3) w (L,L 2) w (L,L ) x V () V (L 4) V (L 3) V (L 2) V (L ) V (L) Fgure 7.27: Network wth connectons that skp layers. help to fght overfttng, and ReLU unts (Secton 7.2.2) render the networks less susceptble to the vanshng-gradent problem (Secton 7.2.). The wnnng algorthm for 202 was based on a network wth fve convoluton layers and three fully connected layers, usng drop out, ReLU unts, and data-set augmentaton [55]. The algorthm was mplemented on GPU processors. The 203 ILSVRC challenge was also won by a convolutonal network [73], wth 22 layers. Nevertheless, the network has substantally fewer free parameters (weghts and thresholds) than the 202 network: nstead of In 205, the wnnng algorthm [68] had 52 layers. One sgnfcant new element n the layout were connectons that skp layers (resdual networks, Secton 7.6). The 206 [74] and 207 [70] wnnng algorthms used ensembles of convolutonal networks. 7.6 Resdual networks The network that won the 205 ILSVRC challenge had connectons that skp layers [68]. Emprcal evdence shows that skppng layers makes deep networks easer to tran. Ths s usually motvated by sayng that skppng layers reduces the vanshng-gradent problem. The layout s llustrated schematcally n Fgure Black arrows stand for usual feed-forward connectons. The notaton dffers somewhat from that of Algorthm 2. Here the weghts from layer l to l are denoted by w (l,l ) k, and those from layer l 2 to l by w (l,l 2) (red arrow n Fgure 7.26). Note that the superscrpts are ordered n the same way as the subscrpts: the rght ndex refers to the layer on the left. Neuron n layer l computes V (l) = g k w (l,l ) k V (l ) k θ (l) + n w (l,l 2) n V (l 2) n. (7.44) The weghts of connectons that skp layers are traned n the usual fashon, by stochastc gradent descent. To llustrate the structure of the resultng formulae consder a network wth ust one neuron per layer (Fgure 7.27). To begn wth we calculate the ncrements of the weghts w (l,l ). To update

126 8 DEEP LEARNING w (L,L ) we requre Ths gves V (L) w (L,L ) = g (b (L) )V (L ). (7.45) δw (L,L ) = ηδ (L) V (L ) wth δ (L) = (t V (L) )g (b (L) ), (7.46) as n Algorthm 2. The factor (t V (L) ) comes from the outer dervatve of the energy functon (6.4). The outputs are O = V (L). As n Algorthm 2, I have omtted the sum over µ (stochastc gradent descent, page 72). Also the update for w (L,L 2) s the same as n Algorthm 2: δw (L,L 2) = ηδ (L ) V (L 2) wth δ (L ) = δ (L) w (L,L ) g (b (L ) ). (7.47) But the update for w (L 2,L 3) s dfferent because the short cuts come nto play. The extra connecton from layer L 2 to L gves rse to an extra term: V (L) V (L) w = V (L ) (L) V V (L 2) +. (7.48) (L 2,L 3) V (L ) V (L 2) V (L 2) w (L 2,L 3) Evaluatng the partal dervatves we fnd = g (b (L) )w (L,L ) g (b (L ) )w (L,L 2) + g (b (L) )w (L,L 2) g (b (L 2 )V (L 2). Ths mples δ (L 2) = δ (L ) w (L,L 2) g (b (L 2 ) + δ (L) w (L,L 2) g (b (L 2) ). (7.49) In general, the error-backpropagaton rule reads δ (l ) = δ (l) w (l,l ) g (b (l ) ) + δ (l+) w (l+,l ) g (b (l ) ) (7.50) for l = L, L 2,.... The frst term s the same as n step 9 of Algorthm 2. The second term s due to the skppng connectons. The update formula for w (l,l ) s The updates of the weghts w (l+,l ) are gven by δw (l,l ) = ηδ (l) V (l ). (7.5) δw (l+,l ) = ηδ (l+) V (l ), (7.52) wth the same errors as n Equaton (7.5). Skppng connectons reduce the vanshng-gradent problem. To see ths, note that we can wrte the error δ (l) as δ (l) = δ (L) w (L,ln ) g (b (ln ) ) w (l2,l) g (b (l) )w (l,l) g (b (l) ) (7.53) l,l 2,...,l n where the sum s over all paths L > l n > l n > > l > l back through the network. The smallest gradents are domnated by the product correspondng to the path wth the smallest number of steps

127 SUMMARY 9 (factors), resultng n a smaller probablty to get small gradents. Introducng connectons that skp more than one layer tends to ncrease the small gradents, as Equaton (7.53) shows. Recently t has been suggested to randomse the layout by randomly short-crcutng the network. Equaton (7.53) remans vald for ths case too. The network descrbed n Ref. [68] used unt weghts for the skppng connectons, V (l) = g k w (l,l ) k V (l ) k θ (l) so that the hdden layer V (l ) k learns the dfference between the nput V (l 2) Therefore such networks are called resdual networks. + V (l 2), (7.54) and the output V (l). 7.7 Summary Networks wth many hdden layers are called deep networks. It has recently been shown that such networks can be traned to recognse obects n mages wth hgh accuracy. It s sometmes stated that convolutonal networks are now better than Humans, n that they recognse obects wth lower classfcaton errors than Humans [75]. Ths statement s problematc for several reasons. To start wth, the artcle refers to the 205 ILSVRC competton, and the company mentoned n the Guardan artcle was later caught out cheatng. At any rate, ths and smlar statements refer to an experment showng that the Human classfcaton error n recognsng obects n the ImageNet database s about 5.% [76], worse than the most recent convolutonal neural-network algorthms (Fgure 7.24). Yet t s clear that these algorthms learn n qute a dfferent way from Humans. They can detect local features, but snce these convolutonal networks rely on translatonal nvarance, they do not easly understand global features, and can mstake a leopard-patterned sofa for a leopard [77]. It may help to nclude more sofas n the tranng data set, but the essental dffculty remans: translatonal nvarance mposes constrants on what convolutonal networks can learn [77]. More fundamentally one may argue that Humans learn dfferently, by abstracton nstead of gong through vast tranng sets. Just try t out for yourself, ths webste [78] allows you to learn lke a convolutonal network. Nevertheless, the examples descrbed n ths Chapter llustrate the tremendous success of deep convolutonal networks. We have also seen that tranng deep networks suffers from a number of fundamental problems. Frst, networks wth many hdden neurons have many free parameters (ther weghts and thresholds). Ths ncreases the rsk of overfttng. Overfttng reduces the power of the network to generalse. The tendency of deep networks to overft can be reduced by cross-valdaton (Secton 6.3.2) and by regularsaton (weght decay, drop out, prunng, and data set augmentaton, Secton 7.2.5). In ths regard convolutonal nets have an advantage because they have fewer weghts, and the weghts of a gven feature map are traned on dfferent parts of the nput mages, effectvely ncreasng the tranng set. Second, the examples descrbed n Secton 7.4 show that convolutonal nets are senstve to dfferences n how the nput data are preprocessed. You may run nto problems f you tran a network on gven tranng and valdaton sets, but apply t to a test set that was preprocessed n a dfferent way so that the test set corresponds to a dfferent nput dstrbuton. Convolutonal nets excel at learnng the propertes of a gven nput dstrbuton, but they may have dffcultes n recognsng patterns sampled from a slghtly dfferent dstrbuton, even f the two dstrbutons appear very smlar to the Human eye.

128 20 DEEP LEARNING Note also that ths problem cannot be solved by cross-valdaton, because tranng and valdaton sets are drawn from the same nput dstrbuton, but here we are concerned wth what happens when the network s appled to a nput dstrbuton dfferent from the one that was traned on. Here s another example llustratng ths pont: the authors of Ref. [79] traned a convolutonal network on perturbed grayscale mages from the ImageNet data base, addng a lttle bt of nose ndependently to each pxel (whte nose) before tranng. Ths network faled to recognse mages that were weakly perturbed n a dfferent way, by settng a small number of pxels to whte or black. When we look at the mages we have no dffcultes seeng through the nose. Thrd, error backpropagaton n deep networks suffers from the vanshng-gradent problem. Ths s more dffcult to combat. It can be reduced by usng ReLU unts, by ntalsng the weghts n certan ways, and by networks wth connectons that skp layers. Yet vanshng or explodng gradents reman a fundamental dffculty, slowng learnng down n the ntal phase of tranng. Brute force (computer power) helps to allevate the problem. As a consequence, convolutonal neural networks have become mmensely successful n obect recognton, outperformng other algorthms sgnfcantly. Fourth, Refs. [80, 8] llustrate ntrgung falures of convolutonal networks. Szegedy et al. [80] show that the way convolutonal nets partton nput space can lead to surprsng results. The authors took an mage that the network classfes correctly wth hgh confdence, and t perturbed slghtly. The dfference between the orgnal and perturbed mages (adversaral mages) s undetectable to the Human eye, yet the network msclassfes the perturbed mage wth hgh confdence [80]. Ths ndcates that decson boundares are always close n nput space, not ntutve but possble n hgh dmensons. Fgure n Ref. [8] shows mages that are completely unrecognsable to the Human eye. Yet a convolutonal network classfes these mages wth hgh confdence. Ths llustrates that there s no tellng what a network may do f the nput s far away from the tranng dstrbuton. Unfortunately the network can sometmes be hghly confdent yet wrong. To conclude, convolutonal networks are very good at recognsng obects n mages. But we should not magne that they understand what they see n the same way as Humans. The theory of deep learnng has somewhat lagged behnd the performance n practce. But some progress has been made n recent years, and there are many nterestng open questons. 7.8 Further readng What do the hdden layers n a convolutonal layer actually compute? Feature maps that are drectly coupled to the nputs detect local features, such as edges or corners. Yet t s unclear precsely how hdden convolutonal layers help the network to learn. Therefore t s nterestng to vsualse the actvty of deep layers by askng: whch nput patterns maxmse the outputs of the neurons n a certan layer [82]? Another queston concerns the structure of the energy landscape. It seems that local mnma are perhaps less mportant for deep networks, because ther energy functons tend to have more saddle ponts than mnma [83]. Deep networks suffer from catastrophc forgettng: when you tran a network on a new nput dstrbuton that s qute dfferent from the one the network was orgnally traned on, then the network tends to forget what t learned ntally. Recently there has been much nterest n ths queston. A good startng pont s Ref. [84]. The stochastc-gradent descent algorthm (wth or wthout mnbatches) samples the nput-data dstrbuton unformly randomly. As mentoned n Secton 6.3., t may be advantageous to sample

129 EXERCISES 2 those nputs more frequently that ntally cause larger output errors. More generally, the algorthm may use other crtera to choose certan nput data more often, wth the goal to speed up learnng. It may even suggest how to augment a gven tranng set most effcently, by askng to specfcally label certan types of nput data (actve learnng) [85]. For connectons to Mathematcal Statstcs (multnomal and bnary logstc regresson), start wth Ref. [4]. 7.9 Exercses Decson boundares for XOR problem. Fgure 7.5 shows the layout of a network that solves the Boolean XOR problem. Draw the decson boundares for the four hdden neurons n the nput plane, and label the boundares and the regons as n Fgure 5.2. Vanshng-gradent problem. Tran the network shown n Fgure 7.7 on the rs data set, avalable from the Machne learnng repostory of the Unversty of Calforna Irvne. Measure the effects upon of the neurons n the dfferent layers, by calculatng the dervatve of the energy functon H w.r.t. the thresholds of the neurons n queston. 7.0 Exam questons 7.0. Party functon The party functon outputs f and only f the nput sequence of n bnary numbers has an odd number of ones, and zero otherwse. The party functon for n = 2 s also known as the Boolean XOR functon. (a) The XOR functon can be represented usng a multlayer perceptron wth two nputs, a fully connected hdden layer wth two hdden neurons, and one output unt. The actvaton functon s the Heavsde functon: θ H (b ) = for b > 0, 0 for b 0 (7.55) for all layers. Determne sutable weght vectors w and thresholds θ for the two hdden unts ( =, 2), as well as the weght vector W and the threshold Θ for the output unt. (0.5p) (b) Illustrate the problem graphcally n the nput space, and ndcate the planes determned by the weght vectors w and thresholds θ that you determned n (a). In a separate graph, llustrate the transformed nput data n the hdden space and draw the lne determned by the weght vector W and the threshold Θ. (0.5p) (c) Descrbe how you can combne several of the small XOR multlayer perceptrons analysed n (a)-(b) to create a deep network that computes the party functon for n > 2. Explan how the total number of nodes n the network grows wth the nput dmenson n. (p)

130 22 DEEP LEARNING Softmax outputs Consder a perceptron wth L layers and softmax output unts. For pattern µ, the state of the th output neuron s gven by = e b (L,µ) O (µ) m e b (L,µ) m, (7.56) where b (L,µ) m denotes the local feld of the m th output neuron: b (L,µ) m = θ (L) m + Here θ m (L) and w (L) mk evaluated for pattern µ. k w (L) mk V (L,µ) k. (7.57) are thresholds and weghts, and V (L,µ) k s the state of the k th neuron n layer L, (a) Compute the dervatve of output O (µ) wth respect to the local feld b (L,µ) (p). (b) The network s traned by gradent descent on the negatve log-lkelhood functon, µ of the th output neuron. H = t (µ) log O (µ). (7.58) The summaton s over all patterns n the tranng set and over all output neurons, the logarthm s the natural logarthm, and t (µ) denote targets. The targets satsfy the constrant t (µ) = (7.59) for all patterns µ. When updatng, the ncrement of a weght w (l) nq n layer l s gven by δw (l) nq = η H w (l), (7.60) nq where η denotes the learnng rate. Derve the ncrement for weght w (L) nq n layer L. (p).

131 23 x (µ) w (v x ) k w (v v ) 3 y (µ) 3 nputs outputs x (µ) y (µ) 4 feedback Fgure 8.: Network wth a feedback connecton. Neurons and 2 are hdden neurons. The weghts from the nput x k to the neurons V are denoted by w (v x ) k, the weght from neuron V to neuron V s w (v v ). Neurons 3 and 4 are output neurons, wth prescrbed target values y. To avod confuson wth the teraton ndex t, the targets are denoted by y n ths Chapter. 8 Recurrent networks The layout of the perceptrons analysed n the prevous Chapters s specal. All connectons are one way, and only to the layer mmedately to the rght, so that the update rule for the -th neuron n layer l becomes V (l) = g w (l) V (l ) θ (l). (8.) The backpropagaton algorthm reles on ths feed-forward layout. It means that the dervatves V (l ) / w mn (l) vansh. Ths ensures that the outputs are nested functons of the nputs, whch n turn mples the smple teratve structure of the backpropagaton algorthm on page 75. In some cases t s necessary or convenent to use networks that do not have ths smple layout. The Hopfeld networks dscussed n part I are examples where all connectons are symmetrc. More general networks may have a feed-forward layout wth feedbacks, as shown n Fgure 8.. Such networks are called recurrent networks. There are many dfferent ways n whch the feedbacks can act: from the output layer to hdden neurons for example (Fgure 8.), or there could be connectons between the neurons n a gven layer. Neurons 3 and 4 n Fgure 8. are output unts, they are assocated wth targets ust as n Chapters 5 to 7. The layout of recurrent networks s very general, but because of the feedback lnks we must consder how such networks can be traned. Unlke the mult-layer perceptrons, recurrent networks are commonly used as dynamcal networks, ust as n Chapters and 2 [c.f. Equaton (.4)]. The dynamcs can ether be dscrete V (t ) = g w (v v ) V (t ) + k w (v x ) k x k θ (v ) for t =, 2,..., (8.2) or contnuous τ dv dt = V + g w (v v ) V (t ) + k w (v x ) k x k θ (v ), (8.3) wth tme constant τ. The parameters θ (v ) are thresholds. We shall see n a moment why t can be advantageous to use a dynamcal network. Recurrent networks can learn n dfferent ways. One possblty s to use a tranng set of pars (x (µ),y (µ) ) wth µ =,..., p. To avod confuson wth the teraton ndex t, the targets are denoted by y

132 24 RECURRENT NETWORKS n ths Chapter. One feeds a pattern from ths set and runs the dynamcs (8.2) or (8.3) for the gven x (µ) untl the dynamcs reaches a steady state V (f ths does not happen the tranng fals). Then one updates the weghts by gradent descent usng the energy functon H = E 2 yk V k f V k s an output unt, k where E k = 2 0 otherwse, k evaluated at V = V, that s H = 2 k (E k )2 wth E k = y k V k. Instead of defnng the energy functon n terms of the mean-squared output errors, one could also use the negatve log-lkelhood functon (7.32). These steps are repeated untl the steady-state outputs yeld the correct targets for all nput patterns. Ths s remnscent of the algorthms dscussed n Chapters 5 to 7, and we shall see that the backpropagaton algorthm can be modfed (recurrent backpropagaton) to make the networks learn as descrbed earler. Another possblty s that nputs and targets change as functons of tme t whle the network dynamcs runs. In ths way the network can solve temporal assocaton tasks where t learns to output certan targets n response to the sequence x (t ) of nput patterns, and targets y (t ). In ths way recurrent networks can translate wrtten text or recognse speech. Such networks can be traned by unfoldng ther dynamcs n tme as explaned n Secton 8.2 (backpropagaton n tme), although ths algorthm suffers from the vanshng-gradent problem dscussed n Chapter 7. (8.4) 8. Recurrent backpropagaton Recall Fgure 8.. We want to tran a network wth N real-valued unts V wth sgmod actvaton functons, and weghts w from V to V. Several of the unts may be connected to nputs x (µ) k. Other unts are output unts wth assocated target values y (µ). We take the dynamcs to be contnuous n tme, Equaton (8.3), and assume that the dynamcs runs nto a steady state From Equaton (8.3) we deduce V = g V (t ) V so that w (v v ) V + k dv dt = 0. (8.5) w (v x ) k x k θ (v ). (8.6) In other words we assume that the dynamcs (8.3) has a stable steady state, so that small perturbatons δv away from V decay wth tme. Equaton (8.6) s a nonlnear self-consstent Equaton for V, n general dffcult to solve. However, f the fxed ponts V are stable then we can use the dynamcs (8.3) to automatcally pck out the steady-state soluton V. Ths soluton depends on the pattern x (µ), but n Equatons (8.5) and (8.6) and also n the followng I have left out the superscrpt (µ). The goal s to fnd weghts so that the outputs gve the correct target values n the steady state, those assocated wth x (µ). To ths end we use stochastc gradent descent on the energy functon (8.4). Consder frst how to update the weghts w (v v ). We must evaluate δw (v v ) mn = η H w mn (v v ) = η k E k V k w mn (v v ). (8.7)

133 RECURRENT BACKPROPAGATION 25 To calculate the gradents of V we use Equaton (8.6): V w mn (v v ) = w (v v ) mn g w (v v ) V + k w (v x ) k x k θ (v ) } {{ } =b n + w (v v ) V = g (b ) δ m V w (v v ) mn. (8.8) Ths s a self-consstent equaton for the gradent, as opposed to the explct equatons we found n Chapters 5 to 7. The reason for the dfference s that the recurrent network has feedbacks. Snce Equaton (8.8) s lnear n the gradents, we can solve t by matrx nverson. To ths end, defne the matrx L wth elements L = δ g (b (v v ) )w. Equaton (8.8) can be wrtten as L Applyng k L to both sdes we fnd k V w (v v ) mn V w (v v ) mn = δ m g (b )V n. (8.9) = L k m g (b m )V n. (8.0) Insertng ths result nto (8.7) we fnally obtan for the weght ncrements: = η δw (v v ) mn k E k L k m g (b m )V n. (8.) Ths learnng rule can be wrtten n the form of the backpropagaton rule by ntroducng the error m = g (b m ) L k m. (8.2) Then the learnng rule (8.) takes the form k E k δw (v v ) mn = η m V n, (8.3) compare Equaton (6.). A problem s that a matrx nverson s requred to compute the errors m, an expensve operaton. But as outlned n the begnnng of Chapter 5, we can try fnd the nverse teratvely. We can fnd a lnear dfferental equaton for = g (b ) E k L k. (8.4) k that does not nvolve the nverse of L. I used the ndex here because t makes the followng calculaton a bt easer to follow. The frst step s to multply both sdes of Equaton (8.4) wth L /g (b ) and to sum over : Ths gves L /g (b ) = L k L = E. (8.5) k E k

134 26 RECURRENT NETWORKS Usng L /g (b ) = δ /g (b ) w (v v ) we fnd δ w (v v ) g (b ) = g (b )E. (8.6) The trck s now to wrte down a dynamcal equaton for that has a steady state at the soluton of Equaton (8.6): τ d dt = + w (v v ) g (b ) + g (b )E. (8.7) Compare ths wth the dynamcal rule (8.3). Equatons (8.3) and (8.7) exhbt the same dualty as Algorthm 2, between forward propagaton of states of neurons (step 5) and backpropagaton of errors (step 9). The sum n Equaton (8.7) has the same form as the recurson for the errors n Algorthm 2 (step 9), except that there are no layer ndces l here. It s clear that the soluton of (8.6) s a fxed pont of ths Equaton. But s t stable? To decde ths we lnearse the dynamcal equatons (8.3) and (8.7). To ths end we wrte V (t ) = V + δv (t ) and (t ) = + δ (t ), (8.8) and nsert ths ansatz nto (8.3) and (8.7). To leadng order n the small dsplacements we fnd: τ d dt δv = δv + g (b ) w (v v ) δv = L δv, (8.9) τ d dt δ = δ + δ w (v v ) g (b ) = δ g (b )L /g (b ). (8.20) Snce the matrces wth elements L and g (b )L /g (b ) have the same egenvalues, s a stable fxed pont of (8.7) f Vn s a stable fxed pont of (8.3). Ths was assumed n the begnnng, Equaton (8.5). If ths assumpton does not hold, the algorthm does not converge. Now consder the update formula for the weghts w mn (v x ) from the nputs: wth δw (v x ) mn = η H w mn (v x ) = η V w mn (v x ) = g (b δ ) x n + k Ths Equaton s analogous to Equaton (8.8). Consequently E k w (v v ) V k w mn (v x ), (8.2) V w (v x ) mn. (8.22) δw (v x ) mn = η m x n. (8.23) The algorthm for recurrent backpropagaton s summarsed n Algorthm 9.

135 BACKPROPAGATION THROUGH TIME 27 Algorthm 9 recurrent backpropagaton : ntalse all weghts; 2: for t =,..., T do 3: choose a value of µ and apply x (µ) to the nputs; 4: fnd Vn by relaxng τ dv n dt = V n + g w (v v ) n V + k w (v x ) 5: compute E k = y k V k for all output unts; nk x k θ n (v ) ; 6: fnd m by relaxng τ d m dt = m + w m g (b m ) + g (b m )E m ; 7: update all weghts: w mn (v v ) w mn + δw mn (v v ) wth δw mn (v v ) δw mn (v x ) wth δw mn (v x ) = η m x n ; 8: end for 9: end; = η m V n and w (v x ) mn w (v x ) mn + O (t ) O... O T w (o v ) w (v v ) V (t ) w (v x ) V 0 V V T w (v v ) w (v v ) w (v v ) w (v v ) w (v v )... x (t ) x... x T Fgure 8.2: Left: recurrent network wth one hdden neuron (green) and one output neuron (blue). The nput termnal s drawn red. Rght: same network but unfolded n tme. Here the tme arguments are wrtten as subscrpts (see page 8.2). The weghts w (v v ) reman unchanged as drawn, also the weghts w (v x ) and w (o v ) reman unchanged (not drawn). 8.2 Backpropagaton through tme Recurrent networks can be used to learn sequental nputs, as n speech recognton and machne translaton. The tranng set s a tme sequence of nputs and targets [x (t ),y (t )]. The network s traned on the sequence and learns to predct the targets. In ths context the layout s changed a lttle bt compared wth the one descrbed n the prevous Secton. There are two man dfferences. Frstly, the nputs and targets depend on t and one uses a dscrete-tme update rule. Secondly, separate output unts O (t ) are added to the layout. The update rule takes the form V (t ) = g w (v v ) V (t ) + O (t ) = g w (o v ) k w (v x ) k x k (t ) θ (v ), (8.24a) V (t ) θ (o). (8.24b) The actvaton functon of the outputs O can be dfferent from that of the hdden neurons V. Often the softmax functon s used for the outputs [86, 87].

136 28 RECURRENT NETWORKS To tran recurrent networks wth tme-dependent nputs and targets and wth the dynamcs (8.24) one uses backpropagaton through tme. The dea s to unfold the network n tme to get rd of the feedbacks, at the expense of as many copes of the orgnal neurons as there are tme steps. Ths s llustrated n Fgure 8.2 for a recurrent network wth one hdden neuron, one nput, and one output. The unfolded network has T nputs and outputs, and t can be traned n the usual way wth stochastc gradent descent. The errors are calculated usng backpropagaton as n Algorthm 2, but here the error are propagated back n tme, not from layer to layer. The energy functon s the squared error summed over all tme steps H = 2 T t = E 2 t wth E t = y t O t. (8.25) One can also use the negatve log-lkelhood functon (7.26). Note that I have wrtten the tme argument as a subscrpt. Consder frst how to update the weght w (v v ). The gradent-descent rule (5.26) gves a result that s of the same form as Equaton (8.7): δw (v v ) = η T t = E t O t w (v v ) = η T t w (o v ) V t. (8.26) w (v v ) t = Here t = E t g (B t ) s an output error, B t = w (o v ) V t θ (o ) s the local feld of the output neuron at tme t [Equaton (8.24)], and V t / w (v v ) s evaluated wth the chan rule, as usual. Equaton (8.24a) yelds the recurson V t w = g (b (v v ) t ) V t + w (v v ) V t w (v v ) (8.27) for t. Snce V 0 / w (v v ) = 0 we have: V w (v v ) = g (b )V 0, V 2 w (v v ) = g (b 2 )V + g (b 2 )w (v v ) g (b )V 0, V 3 w (v v ) = g (b 3 )V 2 + g (b 3 )w (v v ) g (b 2 )V + g (b 3 )w (v v ) g (b 2 )w (v v ) g (b )V 0. V T w (v v ) = g (b T )V T 2 + g (b T )w (v v ) g (b T 2 )V T V T w (v v ) = g (b T )V T + g (b T )w (v v ) g (b T )V T

137 BACKPROPAGATION THROUGH TIME 29 Equaton (8.26) says that we must sum over t. Regroupng the terms n ths sum yelds: V w + V 2 (v v ) 2 w + V 3 (v v ) 3 w +... (v v ) = [ g (b ) + 2 g (b 2 )w (v v ) g (b ) + 3 g (b 3 )w (v v ) g (b 2 )w (v v ) g (b ) +...]V 0 + [ 2 g (b 2 ) + 3 g (b 3 )w (v v ) g (b 2 ) + 4 g (b 4 )w (v v ) g (b 3 )w (v v ) g (b 2 ) +...]V + [ 3 g (b 3 ) + 4 g (b 4 )w (v v ) g (b 3 ) + 5 g (b 5 )w (v v ) g (b 4 )w (v v ) g (b 3 ) +...]V 2. + [ T g (b T ) + T g (b T )w (v v ) g (b T )]V T 2 + [ T g (b T )]V T. To wrte the learnng rule n the usual form, we defne errors δ t recursvely: δ t = T w (o v ) g (b T ) for t = T, t w (o v ) g (b t ) + δ t + w (v v ) g (b t ) for 0 < t < T. (8.28) Then the learnng rule takes the form δw (v v ) = η T δ t V t, (8.29) ust lke Equaton (6.0), or lke the recurson n step 9 of Algorthm 2. The factor w (v v ) g (b t ) n the recurson (8.28) gves rse to a product of many such factors n δ t when T s large, exactly as descrbed n Secton 7.2. for multlayer perceptrons. Ths means that the tranng of recurrent nets suffers from unstable gradents, as backpropagaton of multlayer perceptrons does (Secton 7.2.). If the factors w (v v ) g (b p ) are smaller than unty then the errors δ t become very small when t becomes small (vanshng-gradent problem). Ths means that the early states of the hdden neuron no longer contrbute to the learnng, causng the network to forget what t has learned about early nputs. When w (v v ) g (b p ) >, on the other hand, explodng gradents make learnng mpossble. In summary, the unstable gradents n recurrent neural networks occurs much n the same way as n multlayer perceptrons (Secton 7.2.). The resultng dffcultes for tranng recurrent neural networks are dscussed n more detal n Ref. [88]. A slght varaton of the above algorthm (truncated backpropagaton through tme) suffers less from the explodng-gradent problem. The dea s that the explodng gradents are tamed by truncatng the memory. Ths s acheved by lmtng the error propagaton backwards n tme, errors are computed back to T τ and not further, where τ s the truncaton tme [2]. Naturally ths mples that long-tme correlatons cannot be learnt. Fnally, the update formulae for the weghts w (v x ) are obtaned n a smlar fashon. Equaton (8.24a) yelds the recurson V t w = g (b (v x ) t ) x t + w (v v ) V t. (8.30) w (v x ) Ths looks ust lke Equaton (8.27), except that V t s replaced by x t. As a consequence we have δw (v x ) = η t = T δ t x t. (8.3) t =

138 30 RECURRENT NETWORKS Fgure 8.3: Schematc llustraton of unfolded recurrent network for machne translaton, after Refs. [86, 87]. The green rectangular boxes represent the hdden states n the form of long short term memory unts (LSTM). Otherwse the network layout s lke the one shown n Fgure 8.2. Sutskever et al. [86] found that the network translates much better f the sentence s read n reverse order, from the end. The tag <EOS> denotes the end-of-sentence tag. Here t denotes the begnnng of the sentence. The update formula for w (o v ) s smpler to derve. From Equaton (8.24b) we fnd by dfferentaton w.r.t. w (o v ) : δw (o v ) = η T E t g (B t )V t. (8.32) t = How are the thresholds updated? Gong through the above dervaton we see that we must replace V t and x t n Equatons (8.29) and (8.3) by. It works n the same way for the output threshold. In order to keep the formulae smple, I only descrbed the algorthm for a sngle hdden and a sngle output neuron, so that I could leave out the ndces referrng to dfferent hdden neurons and/or dfferent output components. You can add those ndces yourself, the structure of the Equatons remans exactly the same, save for a number of extra sums over those ndces: δw (v v ) mn δ (t ) = = η T t = δ (t ) (t ) m V n (8.33) (t ) w (o v ) g (b (t ) ) for t = T, (t ) w (o v ) g (b (t ) ) + +) δ(t w (v v ) g (b (t ) ) for 0 < t < T. The second term n the recurson for δ (t ) s analogous to the recurson n step 9 of Algorthm 2. The tme ndex t here plays the role of the layer ndex l n Algorthm 2. A dfference s that the weghts n Equaton (8.33) are the same for all tme steps. In summary you see that backpropagaton through tme for recurrent networks s smlar to backpropagaton for multlayer perceptrons. After the recurrent network s unfolded to get rd of the feedback connectons t can be traned by backpropagaton. The tme ndex t takes the role of the layer ndex l. Backpropagaton through tme s the standard approach for tranng recurrent nets, despte the fact that t suffers from the vanshng-gradent problem. The next Secton descrbes how mprovements to the layout make t possble to effcently tran recurrent networks. 8.3 Recurrent networks for machne translaton Recurrent networks are used for machne translaton [87]. How does ths work?

139 RECURRENT NETWORKS FOR MACHINE TRANSLATION 3 Basc network layout The networks are traned usng backpropagaton through tme. The vanshng-gradent problem s dealt wth by mproved network layouts. Hochreter and Schmdhuber [89] suggested to replace the hdden neurons of the recurrent network wth computaton unts that are specally desgned to elmnate the vanshng-gradent problem. The method s referred to as long short-term memory (LSTM). The basc ngredent s the same as n resdual networks (Secton 7.6): short cuts reduce the vanshng-gradent problem. For our purposes we can thnk of LSTMs as unts that replace the hdden neurons. Representaton of nputs and outputs How are the network nputs and outputs represented? For machne translaton one must represent words n the dctonary n terms of a code. The smplest code s a bnary code where represents the frst word n the dctonary, 00...the second word, and so forth. Each nput s a vector wth as many components as there are words n the dctonary. A sentence corresponds to a sequence x,x 2,...,x T. Each sentence ends wth an end-of-sentence tag, <EOS>. Softmax outputs gve the probablty p(o,...,o T x,...,x T ) of an output sequence condtonal on the nput sequence. The translated sentence s the one wth the hghest probablty (t also contans the end-of-sentence tag <EOS>). So both nputs and outputs are represented by hgh-dmensonal vectors x t and O t. Other encodng schemes are descrbed n Ref. [87]. What s the role of the hdden states, represented n terms of an LSTM? The network encodes the nput sequence x,x 2,...,x T n these states. Upon encounterng the <EOS> tag n the nput sequence, the network outputs the frst word of the translated sentence usng the nformaton about the nput sequence stored n V T as shown n Fgure 8.3. The frst output s fed nto the next nput, and the network contnues to translate untl t produces an <EOS> tag for the output sequence. In short, the network calculates the probabltes T p(o,...,o T x,...,x T ) = p(o t O,...,O t ;x,...,x T ), (8.34) t = where p (O t O,..., O t ;x,...,x T ) s the probablty of the next word n the output sequence gve the nputs and the output sequence up to O t [7]. Advanced layouts There s a large number of recent papers on machne translaton wth recurrent neural nets. Most studes are based on the tranng algorthm descrbed n Secton 8.2, backpropagaton through tme. Dfferent algorthms manly dffer n ther network layouts. Google s machne translaton system uses a deep network wth layers of LSTMs [7]. Dfferent hdden states are unfolded forward as well as backwards n tme, as llustrated n Fgure 8.4. In ths Fgure the hdden states are represented by LSTMs. In the smplest case the hdden states are ust encoded n hdden neurons, as n Fgure 8.2 and Equaton (8.24). If we represent the hdden states by neurons, as n Secton 8.2, then the correspondng

32 RECURRENT NETWORKS Fgure 8.4: Schematc llustraton of a bdrectonal recurrent network. The net conssts of two hdden states that are unfolded n dfferent ways.

140 32 RECURRENT NETWORKS Fgure 8.4: Schematc llustraton of a bdrectonal recurrent network. The net conssts of two hdden states that are unfolded n dfferent ways. The hdden states are represented by LSTMs. bdrectonal network has the dynamcs V (t ) = g w (v v ) V (t ) + (u u) U (t ) = g w U (t + ) + k O (t ) = g w (o v ) V (t ) + w k w (v x ) k x k (t ) θ (v ), (o u) w (u x ) k x k (t ) θ (u), (8.35) U (t ) θ (o). It s natural to use bdrectonal nets for machne translaton because correlatons go ether way n a sentence, forward and backwards. In German, for example, the fnte verb form s usually at the end of the sentence. Scores Dfferent schemes for scorng the accuracy of a translaton are descrbed by Lpton et al. [87]. One dffculty s that there are often several dfferent vald translatons of a gven sentence, and the score must compare the machne translaton wth all of them. Recent papers on machne translaton usually use the so-called BLEU score to evaluate the translaton accuracy. The acronym stands for blngual evaluaton understudy. The scheme was proposed by Papen et al. [90], and t s commonly udged to score not too dfferently from how Humans would score. 8.4 Summary It s sometmes sad that recurrent networks learn dynamcal systems whle multlayer perceptrons learn nput-output maps. Ths noton refers to backpropagaton n tme. I would emphasse, by contrast, that both networks are traned n smlar ways, by backpropagaton. Nether s t gven that the tasks must dffer: recurrent networks are also used to learn tme-ndependent data. It s true though that tools from dynamcal-systems theory have been used wth success to analyse the dynamcs of recurrent networks [88, 9]. Recurrent neural networks are traned by stochastc gradent descent after unfoldng the network n tme to get rd of feedback connectons. Ths algorthm suffers from the vanshng-gradent problem. To overcome ths dffculty, the hdden states n the recurrent network are usually represented by LSTMs. Recent layouts for machne translaton use deep bdrectonal networks wth layers of LSTMs.

FURTHER READING 33 O (t ) w (o v ) V (t ) w (v v ) w (v x ) x (t ) Fgure 8.5: Recurrent network wth one nput unt x (t ) (red), one hdden neuron V (t ) (green) and one output neuron O (t ) (blue). 8.5 Further readng The tranng of recurrent networks s dscussed n Chapter 5 of Ref.

141 FURTHER READING 33 O (t ) w (o v ) V (t ) w (v v ) w (v x ) x (t ) Fgure 8.5: Recurrent network wth one nput unt x (t ) (red), one hdden neuron V (t ) (green) and one output neuron O (t ) (blue). 8.5 Further readng The tranng of recurrent networks s dscussed n Chapter 5 of Ref. [2]. Recurrent backpropagaton s descrbed by Hertz, Krogh and Palmer [], for a slghtly dfferent network layout. For a recent revew see Ref. [87]. Ths page [92] gves a very enthusastc overvew about what recurrent nets can do. A more pessmstc vew s expressed n ths blog. 8.6 Exercses Recurrent backpropagaton. Show that recurrent backpropagaton s a specal case of the backpropagaton algorthm for layered feed-forward networks. 8.7 Exam questons 8.7. Recurrent network Fgure 8.5 shows a smple recurrent network wth one hdden neuron V (t ), one nput x (t ) and one output O(t ). The network learns a tme seres of nput-output pars [x (t ), y (t )] for t =,2,3,..., T. Here t s a dscrete tme ndex and y (t ) s the target value at tme t (the targets are denoted by y to avod confuson wth the tme ndex t ). The hdden unt s ntalsed to a value V (0) at t = 0. Ths network can be traned by backpropgaton by unfoldng t n tme. (a) Draw the unfolded network, label the connectons usng the labels shown n Fgure 8.5, and dscuss the layout (max half an A4 page). (0.5p). (b) Wrte down the dynamcal rules for ths network, the rules that determne V (t ) n terms of V (t ) and x (t ), and O (t ) n terms of V (t ). Assume that both V (t ) and O (t ) have the same actvaton functon g (b ). (0.5p). (c) Derve the update rule for w (o v ) for gradent descent on the energy functon H = 2 T E (t ) 2 where E (t ) = y (t ) O (t ). (8.36) t = Denote the learnng rate by η. Hnt: the update rule for w (o v ) s much smpler to derve than those for w (v x ) and w (v v ). (p).

142 34 RECURRENT NETWORKS (d) Explan how recurrent networks are used for machne translaton. Draw the layout, descrbe how the nputs are encoded. How s the unstable-gradent problem overcome? (Max one A4 page). (p).

143 35 PART III UNSUPERVISED LEARNING

144 36 UNSUPERVISED HEBBIAN LEARNING Fgure 9.: Supervsed learnng fnds decson boundares (left). Unsupervsed learnng can fnd clusters n the nput data (rght). Chapters 5, 6, 7, and 8 descrbed supervsed learnng where the networks are traned to produce the correct outputs. The remanng Chapters dscuss unsupervsed learnng. In ths case there s no feedback tellng the network whether t has learnt correctly or not. The learnng goal s not defned, so that the network must dscover relevant ways of organsng the nput data. Ths requres redundancy n the nput data. Possble tasks are to determne the famlarty of nput patterns, or to fnd clusters (Fgure 9.) n hgh-dmensonal nput data. A further applcaton s to determne spatal maps of spatally dstrbuted nputs, so that nearby nputs actvate nearby output unts. Such unsupervsedlearnng algorthms are explaned n Chapter 9. Chapter 0 ntroduces radal-bass functon networks, they learn usng a hybrd algorthm wth supervsed and unsupervsed learnng. A dfferent hybrd algorthm s dscussed n Chapter, renforcement learnng. Here the dea s that the network receves only partal feedback on ts performance, t cannot access the full set of target values. For nstance, the feedback may ust be + (good soluton) or (not so good). In ths case the network can learn by buldng up ts own tranng set from ts outputs wth + feedback. 9 Unsupervsed Hebban learnng The materal n ths Chapter comes from the book by Hertz, Krogh, and Palmer []. The smplest example for unsupervsed learnng s gven by a dstrbuton P (x ) of nput patterns x x N x =. (9.) wth contnuous-valued components x. Patterns are drawn from ths dstrbuton and fed one after another to the network shown n Fgure 9.2. It has one lnear output unt y = w x wth weght vector w = w. w N. (9.2) The network can detect how famlar certan nput patterns are. The dea s that the output s the larger the more frequently the nput pattern occurs n P (x ). Ths learnng goal s acheved by Hebb s rule: w = w + δw wth δw = ηy x, (9.3) where y = w x s the output. The rule (9.3) s also called Hebban unsupervsed learnng. As usual, η > 0 s small learnng rate. How does ths learnng rule work? Snce we keep addng multples of the

145 OJA S RULE 37 x w x 2 y = w x w N x N Fgure 9.2: Network for unsupervsed Hebban learnng, wth a sngle lnear output unt that has weght vector w. The network output s denoted by y n ths Chapter. Algorthm 0 Oa s rule : ntalse weghts randomly; 2: for t =,..., T do 3: draw an nput pattern x from P (x ) and apply t to the network; 4: update all weghts usng δw = ηy (x y w ); 5: end for 6: end; pattern vectors x to the weghts, the magntude of the output y becomes the larger the more often the nput pattern occurs n the dstrbuton P (x ). So the most famlar pattern produces the largest output. A problem s potentally that the weght vector may contnue to grow as we keep on addng ncrements. Ths usually happens, and ths means that the smple Hebban learnng rule (9.3) does not converge to a steady state. To acheve defnte learnng outcomes we requre the network to approach a steady state. Therefore the learnng rule (9.3) must be modfed. One possblty s to ntroduce weght decay as descrbed n Secton Ths s dscussed n the next Secton. 9. Oa s rule Addng a weght-decay term wth coeffcent proportonal to y 2 to Equaton (9.3) δw = ηy (x y w ) = η x x T w [w (x x T )w ]w (9.4) ensures that the weghts reman normalsed. For the second equalty I used that the output s gven by y = w x = w T x = x T w. To see why Equaton (9.4) does the trck, consder an analogy: a vector q that obeys the dfferental equaton d dt q = A(t )q. (9.5) For a general matrx A(t ), the norm q may grow or shrnk. We can ensure that q remans normalsed by addng a term to Equaton (9.5): d dt w = A(t )w [w A(t )w ]w. (9.6) The vector w turns n the same way as q, and f we set w = ntally, then w remans normalsed (w = q / q ). You can see ths by notng that d dt w 2 = 2w w = 0. Equaton (9.6) descrbes the d dt

146 38 UNSUPERVISED HEBBIAN LEARNING dynamcs of the normalsed orentaton vector of a small rod n turbulence [93], where A(t ) s the matrx of flud-velocty gradents. But let us return to Equaton (9.4). It s called Oa s rule [94]. Oa s learnng algorthm s summarsed n Algorthm 0. One draws a pattern x from the dstrbuton P (x ) of nput patterns, apples t to the network, and updates the weghts as prescrbed n Equaton (9.4). Ths s repeated many tmes. In the followng we denote the average over T nput patterns as = T T t =. Now we show that a steady state w of Algorthm 0 has the followng propertes:. w = 2. w s the egenvector of C = x x T wth maxmal egenvalue 3. w maxmses y 2 over all w wth w =. In partcular, the weght vector remans normalsed. We frst prove statement, assumng that a steady state w has been reached. In a steady state the ncrements δw must average to zero because the weghts would ether grow or decrease otherwse: 0 = δw w. (9.7) Here w s an average at fxed w (the presumed steady state). So w s not averaged over, ust x. Equaton (9.7) s a condton upon w. Usng the learnng rule (9.4), Equaton (9.7) mples 0 = C w (w C w )w. (9.8) It follows that w must obey C w = λw. In other words, w must be an egenvector of C. We denote the egenvalues and egenvectors of C by λ α and u α. Snce C s symmetrc, ts egenvalues are real, the egenvectors can be chosen orthonormal, u α u β = δ αβ, and they form a bass. Moreover, C s postve semdefnte. Ths means that the egenvalues cannot be negatve. It also follows from Equaton (9.8) that w =. Second, we prove statement 2: only the egenvector correspondng to λ max represents a stable steady state. To demonstrate ths, we nvestgate the lnear stablty of w (n the same way as n Secton 8.). We use the ansatz: w = w + ε (9.9) where ε s a small ntal dsplacement from w. Now we determne the average change of ε after one teraton step: δε = δw w +ε. (9.0) To ths end we expand Equaton (9.0) n ε to leadng order: δε η C ε 2(ε C w )w (w Cw )ε. (9.) We know that w must be an egenvector of C. So we choose a value of α, put w = u α, and multply wth u β from the left, to determne the β -th component of δε : u β δε η[(λ β λ α ) 2λ α δ αβ ](u β ε). (9.2) Here we used that δ αβ (u α ε) = δ αβ (u β ε). We conclude: f λ β > λ α the component of the dsplacement along u β must grow, on average. So f λ α s not the largest egenvalue, the correspondng egenvector

147 OJA S RULE 39 x 2 u 2 u x Fgure 9.3: Maxmal egenvalue drecton u of the matrx C for nput data wth non-zero mean. u α cannot be a steady-state drecton. The egenvector correspondng to λ max, on the other hand, represents a steady state. So only the choce λ α = λ max leads to a steady state. Thrd, we prove that these statements mply property 3. We need to demonstrate that y 2 = w C w (9.3) s maxmsed when w s along maxmal egenvalue drecton. To ths end we compute w C w = λ α (w u α ) 2. (9.4) α We also need that α (w u α ) 2 =. Ths follows from w = (statement ). Together, the two expressons show that no other drecton w gves a larger value of y 2. In other words: w maxmses y 2. Ths completes the proof. For zero-mean nputs, Oa s rule fnds the maxmal prncpal drecton of the nput data by maxmsng y 2 (note that y = 0 for zero-nput data). For nputs wth non-zero means, maxmsng y 2 stll fnds the maxmal egenvalue drecton of C. But for nputs wth non-zero means, ths drecton s dfferent from the maxmal prncpal drecton (Secton 6.3.). Fgure 9.3 llustrates ths dfference. The Fgure shows three data ponts n a twodmensonal nput plane. The elements of C = x x T are C = 2, (9.5) 3 2 wth egenvalues and egenvectors λ =, u = 2 and λ 2 = 3, u 2 =. (9.6) 2 Thus the maxmal egenvalue drecton s u. To compute the prncpal drecton of the data we fnd the data-covarance matrx C wth elements (6.9). The maxmal-egenvalue drecton of C s u 2. Ths s the maxmal prncpal component of the data shown n Fgure 9.3. Oa s rule can be generalsed n dfferent ways to compute M prncpal components of zero-mean nput data usng M output neurons that compute y = w x for =,..., M : δw = ηy x y k w k k= Sanger s rule, (9.7)

148 40 UNSUPERVISED HEBBIAN LEARNING x x w 2 w 2 w w Fgure 9.4: Detecton of clusters by unsupervsed learnng. or M δw = ηy x y k w k k= Oa s M -rule. (9.8) For M = both rules reduce to Oa s rule. 9.2 Compettve learnng In Equatons (9.7) and (9.8) several outputs can be actve (non-zero) at the same tme. An alternatve approach s compettve learnng where only one output s actve at a tme, as n Secton 7.. Such algorthms can categorse or cluster nput data: smlar nputs are classfed to belong to the same category, and actvate the same output unt. Fgure 9.4 shows nput patterns on the unt crcle that cluster nto two dstnct clusters. The dea s to fnd weght vectors w that pont nto the drecton of the clusters. To ths end we take M lnear output unts wth weght vectors w, =,..., M. We feed a pattern x from the dstrbuton P (x ) nto the unts and defne the wnnng unt 0 as the one that has mnmal angle between ts weght and the pattern vector x. Ths s llustrated n Fgure 9.4, where 0 = 2. Then only ths weght vector s updated by addng a lttle bt of the dfference x w 0 between the pattern vector and the weght of the wnnng unt. The other weghts reman unchanged: δw = η(x w ) for = 0 (x,w...w M ), 0 otherwse. (9.9) In other words, only the wnnng unt s updated, w 0 = w 0 +δw 0. Equaton (9.9) s called compettvelearnng rule. The learnng rule (9.9) has the followng geometrcal nterpretaton: the weght of the wnnng unt s drawn towards the pattern x. Upon teratng (9.9), the weght vectors are drawn to clusters of nputs. So f the patterns are normalsed as n Fgure 9.4, the weghts end up normalsed on average, even though w 0 = does not mply that w 0 + δw 0 =, n general. The algorthm for compettve learnng s summarsed n Algorthm. When weght and nput vectors are normalsed, then the wnnng unt 0 s the one wth the largest scalar product w x. For lnear output unts y = w x (Fgure 9.2) ths s smply the unt wth the largest output. Equvalently, the wnnng unt s the one wth the smallest dstance w x. Output unts wth w that are very far away from any pattern may never be updated (dead unts). There are several strateges to avod ths problem []. One possblty s to ntalse the weghts to drectons found n the nputs.

149 KOHONEN S ALGORITHM 4 The soluton of the clusterng problem s not unquely defned. One possblty s to montor progress by an energy functon H = M t x (t ) 2 w. (9.20) 2T t Here x (t ) s the pattern fed n teraton number t, T s the total number of teratons, and M t = for = 0 (x (t ),w,...,w M ), 0 otherwse. (9.2) Note that 0 s a functon of the patterns x (µ) and of the weghts w,...,w M. For gven patterns, the ndcator M µ s a pecewse constant functon of the weghts. Gradent descent on the energy functon (9.20) gves δw = η H w = η T t M t (x (t ) w ). (9.22) Apart from the sum over patterns ths s the same as the compettve learnng rule (9.9). The angular brackets on the l.h.s. of ths Equaton ndcate that the weght ncrements are summed over patterns. If we defne for = 0 y = δ 0 = (9.23) 0 otherwse then the rule (9.9) can be wrtten n the form of Oa s M -rule: Ths s agan Hebb s rule wth weght decay. δw = ηy x M y k w k. (9.24) k= 9.3 Kohonen s algorthm Kohonen s algorthm can learn spatal maps. To ths end one arranges the output unts geometrcally. The dea s that close nputs actvate nearby outputs (Fgure 9.5). Not that nput and output space need not have the same dmenson. The spatal map s learned wth a compettve learnng rule (9.24), smlar to the prevous Secton. But an mportant dfference s that the rule must be modfed to Algorthm compettve learnng : ntalse weghts to vectors wth random angles and norm w = ; 2: for t =,..., T do 3: draw a pattern x from P (x ) and feed t to the network: 4: fnd the wnnng unt 0 (smallest angle between w 0 and x ); 5: update only the wnnng unt δw 0 = η(x w 0 ); 6: end for 7: end;

150 42 UNSUPERVISED HEBBIAN LEARNING r r 2 outputs coordnates r x () x (2) nputs coordnates x Fgure 9.5: Spatal map. If patterns x () and x (2) are close n nput space, then the two patterns actvate neghbourng outputs, r r 2. x 2 a b c d w 2 w 2 w 2 x w Fgure 9.6: Learnng a shape wth Kohonen s algorthm. (a) Input-pattern dstrbuton. P (x ) s unty wthn a parallelogram wth unt area, and zero outsde. (b) to (d) Illustraton of the dynamcs n terms of an elastc net. (b) Intal condton. (c) Intermedate stage (note the knk). (d) In the steady-state the elastc net resembles the shape defned by the nput-pattern dstrbuton. w w ncorporate spatal nformaton. In Kohonen s rule ths s done by updatng not only the wnnng unt, but also ts neghbours n the output array. δw = ηλ(, 0 )(x w ). (9.25) Here η > 0 s the learnng rate, as before. The functon Λ(, 0 ) s called the neghbourhood functon. A common choce s r r 0 2 Λ(, 0 ) = exp. (9.26) 2σ 2 As a result, nearby output unts respond to nputs that are close n nput space. Kohonen s rule drags the wnnng weght vector w 0 towards x, ust as the compettve learnng rule (9.9), but t also drags the neghbourng weght vectors along. Fgure 9.6 llustrates a geometrcal nterpretaton of Kohonen s rule. We can thnk of the weght vectors as pontng to the nodes of an elastc net that has the same layout as the output array. As one feeds patterns from the nput dstrbuton, the weghts are updated, causng the nodes of the network to move. Ths changes the shape of the elastc net. In the steady state, ths shape resembles the shape defned by the dstrbuton of nput patterns. Kohonen s rule has two parameters: the learnng rate η, and the wdth σ of the neghbourhood functon. Usually one adusts these parameters as the learnng proceeds. Typcally one begns wth large values for η and σ (orderng phase), and then reduces these parameters as the elastc net evolves (convergence phase): quckly at frst and then n smaller steps, untl the algorthm converges. Detals are gven by Hertz, Krogh and Palmer []. As for compettve learnng one can montor progress of the

151 KOHONEN S ALGORITHM 43 learnng wth an energy functon H = 2T Λ(, 0 ) x (t ) 2 w. (9.27) t Gradent descent yelds δw = η H w = η T t Λ(, 0 )(x (t ) w ). (9.28) Fgure 9.6 shows how Kohonen s network learns by unfoldng the elastc net of weght vectors untl the shape of the net resembles the form of the nput dstrbuton P (x ) = for x n the parallelogram n Fgure 9.6(a), 0 otherwse. (9.29) In other words, Kohonen s algorthm learns by dstrbutng the weght vectors to reflect the dstrbuton of nput patterns. For the dstrbuton (9.29) of nputs one may hope that the weghts end up dstrbuted unformly n the parallelogram. Ths s roughly how t works (Fgure 9.6), but there are problems at the boundares. Why ths happens s qute clear: for the dstrbuton (9.29) there are no patterns outsde the parallelogram that can draw the elastc net very close to the boundary. To analyse how the boundares affect learnng for Kohonen s rule, we consder the steady-state condton δw = η Λ(, 0 ) x (t ) w = 0 (9.30) T t Ths s a condton for the steady state w. The condton s more complcated than t looks at frst sght, because 0 depends on the weghts and on the patterns, as mentoned above. The steady-state condton (9.30) s very dffcult to analyse n general. One of the reasons s that global geometrc nformaton s dffcult to learn. It s usually much easer to learn local structures. Ths s partcularly true n the contnuum lmt where we can analyse local learnng progress usng Taylor expansons. For ths reason we assume now that we have a very dense net of weghts, so that we can r, 0 r 0, w w (r ), Λ(, 0 ) Λ r r 0 (x ), and T t d x P (x ). In ths contnuum approxmaton, Equaton (9.30) reads dx P (x )Λ r r 0 (x ) x w (r ) = 0. (9.3) Ths s an Equaton for the spatal map w (r ). Equaton (9.3) s stll qute dffcult to analyse. So we specalse to one nput and one output dmenson, wth spatal output coordnate r. Ths has the added advantage that we can easly draw the spatal map w (r ). It s the soluton of dx P (x )Λ r r 0 (x ) x w (r ) = 0 (9.32) The neghbourhood functon s sharply peaked at r = r 0 (x ). Ths means that the condton (9.32) yelds the local propertes of w (r ) around r 0, where r 0 s the coordnate of the wnnng unt, x = w (r 0 ). Equaton (9.32) nvolves an ntegral over patterns x. Usng x = w (r 0 ), ths ntegral s expressed as an ntegral over r 0. Specfcally we consder how w (r 0 ) changes n the vcnty of a gven pont r, as r 0 (x ) changes. To ths end we expand w around r :

152 44 UNSUPERVISED HEBBIAN LEARNING weghts and patterns w x w (r ) + dw dr δr r δr r 0 (x ) output poston Fgure 9.7: To fnd out how w vares near r, we expand w n δr around r. Ths gves w (r ) + dw dr δr + d 2 w 2 dr δr w (r 0 ) = w (r ) + w (r )δr + 2 w (r ) δr wth δr = r 0 (x ) r. (9.33) Here w denotes the dervatve dw /dr evaluated at r, and I have dropped the astersk. Usng ths expanson and x = w (r 0 ) we express dx n Equaton (9.32) n terms of dδr : dx = dw (r 0 ) = dw (r + δr ) (w + δr w ) dδr. (9.34a) To perform the ntegraton we express the ntegrand n (9.32) n terms of δr : P (x ) = P w (r 0 ) P (w ) + δr w d dw P (w ), (9.34b) and x w (r ) = w δr + 2 w (r )δr (9.34c) Insertng these expressons nto Equaton (9.3) we fnd 0 = dδr (w + δr w )[P + δr w d dw P ]Λ(δr )(δr w + 2 δr 2 w ) = w [ 3 2 w P (w ) + w (w ) 2 d dw P (w )] dδr δr 2 Λ(δr ) (9.35) Snce the last ntegral n Equaton (9.35) s non-zero, we must ether have w = 0 or 3 2 w P (w ) + (w ) 2 d dw P (w ) = 0. (9.36) The frst soluton can be excluded because t corresponds to a sngular weght dstrbuton [see Equaton (9.38)] that does not contan any geometrcal nformaton about the nput dstrbuton P (x ). The second soluton gves w = 2 w d dw P (w ) (9.37) w 3 P (w ) d d In other words, dx log w = 2 3 dx log P (w ). Ths means that w P (w ) 2 3. So the dstrbuton ϱ of output weghts s ϱ(w ) dr = dw w = P (w ) 2 3. (9.38)

153 SUMMARY 45 Ths tells us that the Kohonen net learns the nput dstrbuton n the followng way: the dstrbuton of output weghts n the steady state reflects the dstrbuton of nput patterns. Equaton (9.38) tells us that the two dstrbutons are not equal (equalty would have been a perfect outcome). The dstrbuton of weghts s nstead proportonal to P (w ) 2 3. Ths s a consequence of the fact that the elastc net has dffcultes reachng the corners and edges of the doman where the nput dstrbuton s non-zero. Let us fnally dscuss the convergence of Kohonen s algorthm. The update rule (9.25) can be rewrtten as (w x ) [ ηλ](w x ). (9.39) For small enough η the factor [ ηλ] s postve. Then t follows from Equaton (9.39) that the order of weghts n a monotoncally ncreasng (decreasng) sequence does not change under an update. What happens at the boundary between two such regons, where a knk s formed? It turns out that knks can only dsappear n two ways. Ether they move to one of the boundares, or two knks (a mnmum and a maxmum) annhlate each other f they collde. Both processes are slow. Ths means that convergence to the steady state can be very slow. Therefore one usually starts wth a larger learnng rate, to get rd of knks. After ths orderng phase, one contnues wth a smaller step sze to get the detals of the dstrbuton rght (convergence phase). 9.4 Summary The unsupervsed learnng algorthms descrbed above are based on Hebb s rule: certanly the Hebban unsupervsed learnng rule (9.3) and Oa s rule (9.4). Also the compettve learnng rule can be wrtten n ths form [Equaton (9.24)]. Kohonen s algorthm s closely related to compettve learnng, although the way n whch Kohonen s rule learns spatal maps s better descrbed by the noton of an elastc net that represents the values of the output weghts, as well as ther spatal locaton n the output array. Unsupervsed learnng rules can learn dfferent features of the dstrbuton of nput patterns. They can dscover whch patterns occur most frequently, they can help to reduce the dmensonalty of nput space by fndng the prncpal drectons of the nput dstrbuton, detect clusters n the nput data, compress data, and learn spatal nput-output maps. The mportant pont s that the algorthms learn wthout tranng, unlke the algorthms n Chapters 5 to 8. Supervsed-learnng algorthms are now wdely used for dfferent applcatons. Ths s not really the case yet for unsupervsed-learnng algorthms, except that smlar (sometmes equvalent) algorthms are used n Mathematcal Statstcs (k -means clusterng) and Bonformatcs (structure [95]) where large data sets must be analysed, such as Human sequence data (HGDP) [96]. But the smple algorthms descrbed n ths Chapter provde a proof of concept: how machnes can learn wthout feedback. In addton there s one sgnfcant applcaton of unsupervsed learnng: where the network learns from ncomplete feedback. Ths renforcement learnng s ntroduced n the next Chapter. 9.5 Exercses Kohonen net. Wrte a computer program that mplements Kohonen s algorthm wth a two-dmensonal output array, to learn the propertes of a two-dmensonal nput dstrbuton that s unform nsde an equlateral trangle wth sdes of unt length, and zero outsde. Hnt: to generate ths dstrbuton, sample at least 000 ponts unformly dstrbuted over the smallest square that contans the trangle, and then accept only ponts that fall nsde the trangle. Increase the number of weghts and study

154 46 UNSUPERVISED HEBBIAN LEARNING how the two-dmensonal densty of weghts near the boundary depends on the dstance from the boundary. 9.6 Exam questons 9.6. Oa s rule The am of unsupervsed learnng s to construct a network that learns the propertes of a dstrbuton P (x ) of nput patterns x = (x,..., x N ) T. Consder a network wth one lnear output that computes y = N = w x. Under Oa s learnng rule δw = ηy (x y w ) the weght vector w converges to a steady state w wth components w. Show that the steady state has the followng propertes:. w 2 N = (w )2 =. 2. w s the leadng egenvector of the matrx C wth elements C = x x. Here denotes the average over P (x ). 3. w maxmses y 2. All correct gves n total 3p Oa s rule The output of Oa s rule for the nput pattern x (µ) s and the update rule based on ths pattern s w w + δw (µ) y (µ) = w x (µ), (9.40) wth δw (µ) = ηy (µ) (x (µ) y (µ) w ) (9.4) wth targets y (µ). Let δw denote the update of w averaged over the nput patterns. (a) Show that δw = 0 mples that the weght vector n the steady state s normalsed to unty. ( p). (b) Calculate the prncpal component of the patterns n Fgure 9.8. ( p) Covarance matrces A covarance matrx C has egenvectors and egenvalues λ u = 4 and λ v =. (a) Wrte down the matrx C. (0.5 p). u = 4 and v = 4 (9.42)

155 EXAM QUESTIONS 47 x 2 x (4) x (5) x (3) x () x (2) x Fgure 9.8: Calculate the prncpal component of ths data set. Queston (b) Illustrate a dstrbuton wth ths covarance matrx, and ndcate the prncpal component n your llustraton. (0.5 p). (c) P (x ) s a Gaussan dstrbuton of two-dmensonal patterns x = [x, x 2 ] T. The dstrbuton s determned by ts mean x and ts covarance matrx C C C = 2. (9.43) C 2 C 22 Show how to draw an nput pattern from ths dstrbuton usng the followng steps.. Draw a random pattern z = (z, z 2 ) T, where z and z 2 are two ndependent random numbers drawn from a Gaussan dstrbuton wth mean zero unt varance. 2. Compute x = x + Lz, where L = L 0 L 2 L 22. (9.44) Express L, L 2 and L 22 n terms of C, C 2 and C 22. (p) Kohonen net The update rule for a Kohonen network reads: δw = ηλ(, 0 )(x w ). Here 0 labels the wnnng unt for pattern x = (x,..., x N ) T. The neghbourhood functon Λ(, 0 ) = exp r r 0 2 /(2σ 2 ) s Gaussan wth wdth σ, and r denotes the poston of the -th output neuron n the output array. (a) Explan the meanng of the parameter σ n Kohonen s algorthm. Dscuss the nature of the update rule n the lmt of σ 0. (0.5p). (b) Dscuss and explan the mplementaton of Kohonen s algorthm n a computer program. In the dscusson, refer to and explan the followng terms: output array, neghbourhood functon, orderng phase, convergence phase, knks. Your answer must not be longer than one A4 page.

156 48 RADIAL BASIS-FUNCTION NETWORKS 0 Radal bass-functon networks Problems that are not lnearly separable can be solved by perceptrons wth hdden layers, as we saw n Chapter 5. Fgure 5.0, for example, shows a pecewse lnear decson boundary that can be parametersed by hdden neurons. Another approach s to map the coordnates of nput space non-lnearly so that the problem becomes lnearly separable. It s usually easer to separate patterns n hgher dmensons. To see ths consder the XOR problem (Fgure 0.). It s not lnearly separable n two-dmensonal nput space. The problem becomes separable when we embed the ponts n a three-dmensonal space, for nstance by assgnng x 3 = 0 to the t = + patterns and x 3 = to the t = patterns. Ths example llustrates why t s often helpful to map nput space to a hgher-dmensonal space because t s more lkely that the resultng problem s lnearly separable. But t may also be possble to acheve separablty by a non-lnear transformaton of nput space to a space of the same dmenson. The example below shows how ths works for the XOR problem. Ether way we can apply a sngle perceptron to classfy the data f they are lnearly separable n the new coordnates. Fgure 0. shows how the XOR problem can be transformed nto a lnearly separable problem by the transformaton u (x ) = (x 2 x ) 2 and u 2 (x ) = x 2. (0.) The Fgure shows the non-separable problem n nput space (n the x -x 2 plane), and n the new coordnates u and u 2. Snce the problem s lnearly separable n the u -u 2 plane we can solve t by a sngle McCulloch-Ptts neuron wth weghts W and threshold Θ, parametersng the decson boundary as W u(x ) = Θ. In fact, one does not need the threshold Θ because the functon u can have a constant part. For nstance, we could choose u (x ) = 2(x 2 x ) 2. In the followng we therefore set Θ = 0. We expect that t should be easer to acheve lnear separablty the hgher the embeddng dmenson s. Ths statement s quantfed by Cover s theorem, dscussed n Secton 0.. The queston s of course how to fnd the non-lnear mappng u (x ). One possblty s to use radal bass functons. Ths s a way of parametersng the functons u (x ) n terms of weght vectors w, and to determne sutable weght vectors teratvely. How ths works s summarsed n Secton Separatng capacty of a surface In ts smplest form, Cover s theorem [97] concerns a classfcaton problem gven by p ponts wth coordnate vectors u (µ) n m-dmensonal space. It s assumed that the ponts are n general poston (Fgure 0.2). We assgn random target values to these ponts, x 2 Legend t (µ) = t (µ) = u 2 x Fgure 0.: Left: nput plane for the XOR functon (Fgure 5.8). The problem s not lnearly separable. Rght: n the u -u 2 plane the problem s lnearly separable. u

157 SEPARATING CAPACITY OF A SURFACE 49 Fgure 0.2: Left: 5 ponts n general poston n the plane. Rght: these ponts are not n general poston because three ponts le on a straght lne. t (µ) = + wth probablty 2, wth probablty 2. (0.2) Ths random classfcaton problem s homogeneously lnearly separable f we can fnd an m-dmensonal weght vector W wth components W, W = so that W u = 0 s a vald decson boundary that goes through the orgn: W. W m, (0.3) W u (µ) > 0 f t (µ) = and W u (µ) < 0 f t (µ) =. (0.4) So homogeneously lnearly separable problems are classfcaton problems that are lnearly separable by a hyperplane that goes contans the orgn (zero threshold, Chapter 5). Cover s theorem states the probablty that the random classfcaton problem of p patterns n dmenson m s homogeneously lnearly separable: P (p,m) = 2 p m p k=0 k for p > m, otherwse. (0.5) Here l k = l! (l k)!k! are the bnomal coeffcents. Equaton (0.5) s proven by recurson, startng from a set of p ponts n general poston. Assume that the number C (p, m) of homogeneously lnearly separable classfcaton problems gven these ponts s known. After addng one more pont, one can compute the C (p, m) n terms of C (p, m). Recurson yelds Equaton (0.5). To connect the result (0.5) to the dscusson at the begnnng of ths Chapter, we take u (µ) = u (x (µ) ) where x (µ) are p patterns n N -dmensonal nput space x (µ) = x (µ). x (µ) N for µ =,..., p, (0.6) and we assume that u s a set of m polynomal functons of fnte order. Then the probablty that the problem of the p ponts x (µ) n N -dmensonal nput space s separable by a polynomal decson boundary s gven by Equaton (0.5) [2, 97]. Note that the probablty P (p,m) s ndependent of the dmenson N of the nput space. Fgure 0.3 shows ths probablty for p = λm as a functon of λ

158 50 RADIAL BASIS-FUNCTION NETWORKS P (λm, m) m = 0 m = 50 m = Fgure 0.3: Probablty (0.5) of separablty for p = λm as a functon of λ for three dfferent values of the embeddng dmenson m. Note the pronounced threshold near λ = 2, for large values of m. λ for dfferent values of m. Note that P (2m, m) = 2. In the lmt of large m, the functon P (λm, m) approaches a step functon. In ths lmt one can separate at most 2m patterns (separablty threshold). Now consder a random sequence of patterns x,x 2,... and targets t, t 2,... and ask [97]: what s the dstrbuton of the largest nteger so that the problem x,x 2,...,x n s separable n embeddng dmenson m, but x,x 2,...,x n,x n+ s not? P (n, m) s the probablty that n patterns are lnearly separable n embeddng dmenson m. We can wrte P (n +, m) = q (n + n)p (n, m) where q (n + n) s the condtonal probablty that n + patterns are lnearly separable f the n patterns were. Then the probablty that n + patterns are not separable (but n patterns are) reads ( q )P (n, m) = P (n, m) P (n +, m). We can nterpret the rght-hand sde of ths Equaton as a dstrbuton p n of the random varable n, the maxmal number of separable patterns n embeddng dmenson m: n n p n = P (n, m) P (n +, m) = for n = 0,, 2, m It follows that the expected maxmal number of separable patterns s n = np n = 2m. (0.7) n=0 So the expected maxmal number of separable patterns s twce the embeddng dmenson. Ths quantfes the noton that t s easer to separate patterns n hgher embeddng dmensons. Comparng wth the dscusson of lnear separablty n Chapter 5 we see that Cover s theorem determnes the separaton capacty of a sngle-layer perceptron []. 0.2 Radal bass-functon networks The dea behnd radal bass-functon networks s to parameterse the functons u (x ) n terms of weght vectors w, and to use an unsupervsed-learnng algorthm (Chapter 9) to fnd weghts that separate the nput data. A common choce [2] are radal bass functons of the form: u (x ) = exp 2s 2 x w 2. (0.8) These functons are not of the fnte-order polynomal form that was assumed n Cover s theorem. Ths means that the theorem does not strctly apply. The parameters s parameterse the wdths of the

RADIAL BASIS-FUNCTION NETWORKS 5 u W u 2 x x 2 W 2 u 3 W 3 u 4 W 4 lnear output O = m = W u Fgure 0.4: Radal bass-functon network for N = 2 nputs and m = 4 radal bass functons (0.8).

159 RADIAL BASIS-FUNCTION NETWORKS 5 u W u 2 x x 2 W 2 u 3 W 3 u 4 W 4 lnear output O = m = W u Fgure 0.4: Radal bass-functon network for N = 2 nputs and m = 4 radal bass functons (0.8). The output neuron has weghts W and zero threshold. x 2 x 2 x Fgure 0.5: Comparson between radal-bass functon network and perceptron. Left: the output of a radal bass functon s localsed n nput space. Rght: to acheve a localsed output wth sgmod unts one needs two hdden layers (Secton 7.). One layer determnes the lghtly shaded cross, the second layer localses the output to the darker square. x radal bass functons. In the smplest verson of the algorthm they are set to unty. Other choces for radal bass functons are gven by Haykn [2]. Fgure 0.2 shows a radal bass-functon network for N = 2 and m = 4. The four neurons n the hdden layer stand for the four radal bass functons (0.8) that map the nputs to four-dmensonal u -space. The network looks lke a perceptron (Chapter 5). But here the hdden layers work n a dfferent way. Perceptrons have hdden McCulloch-Ptts neurons that compute non-local outputs σ(w x θ ). The output of radal bass functons u (x ), by contrast, s localsed n nput space [Fgure 0.5(left)]. We saw n Secton 7. how to make localsed bass functons out of McCulloch-Ptts neurons wth sgmod actvaton functons σ(b ), but one needs two hdden layers to do that [Fgure 0.5(rght)]. Radal bass functons produce localsed outputs wth a sngle hdden layer, and ths makes t possble to dvde up nput space nto localsed regons, each correspondng to one radal bass functon. Imagne for a moment that we have as many radal bass functons as nput patterns. In ths case we can smply take w µ = x (µ) for µ =,..., p. Then the classfcaton problem can be wrtten as UW = t, (0.9) where U s the symmetrc p p matrx wth entres U = u (x ). Here we used that the output unt s lnear. If all patterns are parwse dfferent, x (µ) x (ν) for µ ν, then the matrx U s nvertble [2], and the soluton of the classfcaton problem reads W = U t. In practce one can get away wth fewer radal bass functons by choosng ther weghts to pont n the drectons of clusters of nput data. To ths end one can use unsupervsed compettve learnng (Algorthm 2), where the wnnng unt s defned to be the one wth largest u. How are the wdths s determned? The wdth s of radal bass functon u (x ) s taken to be equal to the mnmum dstance

LECTURE NOTES. Artifical Neural Networks. B. MEHLIG (course home page)

LECTURE NOTES. Artifical Neural Networks. B. MEHLIG (course home page) LECTURE NOTES Artfcal Neural Networks B. MEHLIG (course home page) Department of Physcs Unversty of Gothenburg Göteborg, Sweden 208 PREFACE These are lecture notes for my course on Artfcal Neural Networks