Introduction to Neural Networks. David Stutz

Size: px

Start display at page:

Download "Introduction to Neural Networks. David Stutz"

Amberly Phelps
5 years ago
Views:

1 RWTH Aachen Unversty Char of Computer Scence 6 Prof. Dr.-Ing. Hermann Ney Selected Topcs n Human Language Technology and Pattern Recognton WS 13/14 Introducton to Neural Networs Davd Stutz Matrculaton Number ###### February 10, 2014 Advser: Pavel Gol

2 2 CONTENTS Contents 1 Abstract 4 2 Motvaton Hstorcal Bacground and Bblographcal Notes Neural Networs The Perceptron Actvaton Functons Layered Networs Feed-Forward Networs Multlayer Perceptrons Expressve Power Networ Tranng Error Measures Tranng Approaches Parameter Optmzaton Lnear Unts Weght Intalzaton Gradent Descent Momentum Enhanced Gradent Descent Newton s Method Error Bacpropagaton Effcency The Hessan Effcency Regularzaton L2-Regularzaton Early Stoppng Pattern Classfcaton Statstcal Bacground Bayes Decson Rule Maxmum Lelhood Estmaton Dervaton of Cross-Entropy Applcaton: Recognzng Handwrtten Dgts Matrx Notaton Implementaton and Results Concluson 25 Appendces 26

3 LIST OF TABLES 3 A MatLab Implementaton 26 A.1 Logstc Sgmod and ts Dervatve A.2 Tranng Procedure A.3 Valdaton Procedure Lterature 30 Lst of Tables Lst of Fgures 1 Sngle processng unts and ts components Networ graph of a perceptron wth D nput unts and C output unts The logstc sgmod as actvaton functon Networ graph for a two-layer perceptron wth C nput unts, D output unts and m hdden unts Sngle-layer perceptron for modelng boolean AND The learnng rate and ts nfluence on the rate of convergence Bacpropagaton of errors through the networ Exact evaluaton of the hessan Early stoppng based on a valdaton set Error on the tranng set durng tranng Results of tranng a two-layer perceptron usng the MNIST dataset

4 4 1 ABSTRACT 1 Abstract In ths semnar paper we study artfcal neural networs, ther tranng and applcaton to pattern recognton. We start by gvng a general defnton of artfcal neural networs and ntroduce both the sngle-layer and the multlayer perceptron. After consderng several actvaton functons we dscuss networ topology and the expressve power of multlayer perceptrons. The second secton ntroduces supervsed networ tranng. Therefore, we dscuss gradent descent and Newton s method for parameter optmzaton. We derve the error bacpropagaton algorthm for evaluatng the gradent of the error functon and extend ths approach to evaluate ts hessan. In addton, the concept of regularzaton wll be ntroduced. The thrd secton ntroduces pattern classfcaton. Usng maxmum lelhood estmaton we derve the cross-entropy error functon. As applcaton, we tran a two-layer perceptron to recognze handwrtten dgts based on the MNIST dataset.

5 5 2 Motvaton Theoretcally, the human bran has a very low rate of operatons per second when compared to a state of the art computer [8, p. 28]. Nevertheless, all computers are stll outraveled by the human bran consderng an mportant factor: learnng. The human bran s able to learn how to perform certan tass based on experence and pror nowledge. How to teach computers to learn? To clarfy the term learnng n respect to computers we assume a set of tranng data T = {(x n, t n ) : 1 n N} for some N N and an arbtrary target functon g of whch we now the target values t n := g(x n ). Our goal s to teach the computer a reasonably good approxmaton of g(x) for x n the doman of g. Many classfcaton 1 and regresson 2 problems can be formulated ths way. The target functon may even be unnown. Consderng nose wthn the gven tranng data such that we only have a value t n g(x n ) we requre an addtonal property that we call ablty for generalzaton. Solvng the problem by nterpolaton may result n exact values g(x n ) but may be a very poor approxmaton of g(x) n general. Ths phenomenon s called over-fttng of the underlyng tranng data. 2.1 Hstorcal Bacground and Bblographcal Notes In 1943 McCulloch and Ptts ntroduced the frst mathematcal models concernng networs of neurons we call artfcal neural networs. But ths frst step dd not nclude any results on networ tranng [6, p ]. The frst wor on how to tran smlar networs was Rosenblatt s perceptron n 1958 whch we dscuss n secton 3.1 [19]. Only ten years later Mnsy and Papert showed that Rosenblatt s perceptron hat many lmtatons as we see n secton 3.6 and can only model lnearly separable 3 problems of classfcaton [6, p ]. In [20] Rumelhart, Hnton and Wllams ntroduced the dea of error bacpropagaton for pattern recognton. In the late 80 s t was shown that non lnearly separable problems can be solved by multlayer perceptrons [10]. Weght decay was ntroduced n [9]. A dagonal approxmaton of the hessan was ntroduced n [12]. A prunng method based on dagonal approxmaton of the hessan s called Optmal Bran Damage and dscussed n [16]. The exact evaluaton of the hessan s dscussed n [2]. 1 The classfcaton problem can be stated as follows: Gven a D-dmensonal nput vector x assgn t to one of C dscrete classes (see secton 5). 2 The regresson problem can be descrbed as follows: Gven a D-dmensonal nput vector x predct the value of C contnuous target values. 3 Consderng a classfcaton problem as ntroduced n secton 5 we say a set of data ponts s not lnearly separable f the classes can not be separated by a hyperplane [5, p. 179].

6 6 3 NEURAL NETWORKS Fgure 1: Sngle processng unt and ts components. The actvaton functon s denoted by f and appled on the actual nput z of the unt to form ts output y = f(z). x 1,..., x D represent nput from other unts wthn the networ; w 0 s called bas and represents an external nput to the unt. All nputs are mapped onto the actual nput z usng the propagaton rule. w 0 x 1. x D y y := f(z) 3 Neural Networs An artfcal neural networ, also referred to as neural networ, s a set of nterconnected processng unts. A processng unt receves nput from external sources and connected unts and computes an output whch may be propagated to other unts. These unts represent the neurons of the human bran whch are nterconnected by synapses [8, p ]. A processng unt conssts of a propagaton rule and an actvaton functon. The propagaton rule determnes the actual nput of the unt by mappng the output of all drect predecessors and addtonal external nputs onto a sngle nput value. The actvaton functon s then appled on the actual nput and determnes the output of the unt. The output of the processng unt s also called actvaton. Ths s llustrated by fgure 1 showng a sngle processng unt where f denotes the actvaton functon, z the actual nput and y the output of the unt. We dstngush nput unts and output unts. Input unts accept the nput of the whole networ and output unts form the output of the networ. Each nput unt accepts a sngle nput value x and we set the output to y := x. Altogether, a neural networ models a functon y(x) whch dmensons are determned by the number of nput and output unts. As to vsualze neural networs we use drected graphs whch we call networ graphs. As llustrated n fgure 1, sngle processng unts are represented by nodes and are nterconnected by drected edges. The nodes of the graph are labeled accordng to the correspondng output. 3.1 The Perceptron As example we dscuss Rosenblatt s perceptron whch was ntroduced n 1958 [6, p ]. The perceptron conssts of D nput unts and C output unts. Every nput unt s nput layer Fgure 2: The perceptron conssts of D nput unts and C output unts. All unts are labeled accordng to ther output: y = f(z ) n the case of output unts; x n the case of nput unts. The nput values x are propagated to each output unt usng the weghted sum propagaton rule. The addtonal nput value x 0 := 1 s used to nclude the bases as weghts. As suggested n secton 3.3 the unts are assembled n layers. x 0 x 1. x D output layer y 1. y C

7 3.2 Actvaton Functons 7 connected to every output unt as shown n fgure 2. For 1 C the th output unt computes the output y = f(z ) wth z = D w x + w 0 (1) where x j s the nput of the j th nput unt. In ths case the propagaton rule s the weghted sum over all nputs wth weghts w and bases w 0. The bas can be ncluded as weght when consderng an addtonal nput x 0 := 1 such that the actual nput z can be wrtten as z = =1 d w x. (2) =0 Throughout ths paper we use the weghted sum as propagaton rule for all unts except the nput unts whle actvaton functons may vary accordng to the dscusson of the next secton. 3.2 Actvaton Functons The actvaton functon determnes the output of the unt. Often, a threshold functon as for example the heavsde functon { 1 f z 0 h(z) = (3) 0 f z < 0 s used [8, p ]. But n general we want the actvaton functon to have certan propertes. Networ tranng usng error bacpropagaton as dscussed n secton 4.4 requres the actvaton functon to be dfferentable. In addton, we may want to use nonlnear actvaton functons as to ncrease the computatonal power of the networ as dscussed n secton 3.6 [6, p ]. A sgmod functon s a commonly used s-shaped functon. The logstc sgmod s gven by σ(z) = exp( z) (4) Logstc Sgmod σ(x) x Fgure 3: The logstc sgmod s a commonly used s-shaped actvaton functon. It s both smooth and monotonc and allows an probablstc nterpretaton as ts range s lmted to [0, 1].

8 8 3 NEURAL NETWORKS and shown n fgure 3. It can be consdered as smooth verson of the heavsde functon. The softmax functon s gven by σ(z, ) = exp(z ) C =1 exp(z ) (5) where C s the dmenson of the vector z. Both functons are smooth and monotonc whch means there are no addtonal local extrema. Ths s desrable for networ tranng because multple extrema wthn the actvaton functon could cause addtonal extrema wthn the error surface [6, p ]. In addton they allow a probablstc nterpretaton whch we use n secton 5. The dervatves of both the logstc sgmod and the softmax functon tae preferable forms for mplementaton: where δ denotes the Kronecer delta 4. σ(z) = σ(z)(1 σ(z)), z (6) σ(z, ) = σ(z, ) (δ(, j) σ(z, j)) z j (7) 3.3 Layered Networs Arrangng the unts n layers results n a layered networ. Fgure 2 already ntroduced the perceptron as layered networ. In ths case we have a layer of nput unts and a layer of output unts. We may add addtonal unts organzed n so called hdden layers. These unts are called hdden unts as they are not vsble from the outsde [8, p. 43]. For countng the number of layers we sp the nput layer because there s no real processng tang place and no varable weghts [5, p. 229]. Thus, the perceptron can be consdered a sngle-layer perceptron wthout any hdden layers. 3.4 Feed-Forward Networs Snce we can model every neural networ n the means of a networ graph we get new neural networs by consderng more complex networ topologes [5, p ]. We dstngush two networ topologes: Feed-forward In a feed-forward topology we prohbt closed cycles wthn the networ graph. Ths means that a unt n layer p may only propagate ts output to a unt n layer l f l > p. Thus, the modeled functon s determnstc. Recurrent Recurrent networs allow closed cycles. A connecton establshng a closed cycle wthn a networ graph s called feedbac connecton. In ths paper we consder feed-forward networs only. As demonstrated n fgure 2 the sngle-layer perceptron mplements a feed-forward topology. 4 The Kronecer delta δ(, j) equals 1 f j = and s 0 otherwse.

9 3.5 Multlayer Perceptrons 9 nput layer x 0 hdden layer y (1) 0 output layer y (2) 1 x 1. x D y (1) 1. y (1) m y (2) 2. y (2) C Fgure 4: A two-layer perceptron wth C nput unts, D output unts and m := m (1) hdden unts. Agan, we ntroduced unts x 0 := 1 and y (1) 0 := 1 to nclude the bases as weghts. To dstngush unts and weghts wth same ndces n dfferent layers, the number of the layer s wrtten as superscrpt. 3.5 Multlayer Perceptrons The multlayer perceptron has addtonal L 1 hdden layers. The l th hdden layer conssts of m (l) hdden unts. The output value y (l) of the th hdden unt wthn ths layer s m (l 1) = f + w (l) m y(l 1) 0 :=1 (l 1) = f w (l). (8) y (l) =1 w (l) y(l 1) 0 =0 y(l 1) Altogether, we count (L + 1) layers ncludng the output layer where we set D := m (0) and C := m (M+1). Then w (l) denotes the weghted connecton between the th unt of layer (l 1) and the th unt of layer l. Fgure 4 shows a two-layer perceptron wth m := m (1) unts n ts hdden layer. As prevously mentoned, we added addtonal unts x 0 := y (1) 0 := 1 to represent the bases. In ths paper we dscuss the feed-forward multlayer perceptron as model for neural networs. The modeled functon taes the form y 1 (x, w) y(, w) : R D R C, x y(x, w) =. (9) y C (x, w) where w s the vector comprsng of all weghts and y (x, w) := y (L+1) (x, w) s the output of the th output unt. 3.6 Expressve Power One of Mnsy and Papert s results n 1969 showed that the sngle-layer perceptron has severe lmtatons one of whch s called the Exclusve-OR (XOR) problem [6, p ]. As ntroducton we consder the target functon g, whch we want to model, gven by g : {0, 1} 2 {0, 1}, x := (x 1, x 2 ) { 1 f x 1 = x 2 = 1. (10) 0 otherwse

10 10 3 NEURAL NETWORKS x 2 x 2 1 w 0 x 1 w 1 w 2 y x 1 x 1 x 2 (a) Boolean AND (left) and boolean XOR (rght) represented as classfcaton problems. (b) Networ graph of a sngle-layer perceptron modelng boolean AND. Fgure 5: Fgure 5(a) shows both boolean AND and boolean XOR represented as classfcaton problems. As we can see, boolean XOR s not lnearly separable. A smple sngle-layer-perceptron capable of separatng boolean AND s shown n fgure 5(b). Apparently, g descrbes boolean AND. Wth D = 2 nput unts and C = 1 output unt usng the sgn(x) actvaton functon, g can be modeled by a sngle-layer perceptron: y = sgn(z) = sgn(w 1 x 1 + w 2 x 2 + w 0 ). (11) Fgure 5(b) llustrates the correspondng networ graph. By settng z = 0 we can nterpret equaton (11) as straght lne n two-dmensonal space: x 2 = w 1 w 2 x 1 w 0 w 2. (12) A possble soluton for modelng boolean AND s shown n fgure 5(a). But, as we can see, boolean XOR cannot be modeled by ths sngle-layer perceptron. We say boolean XOR s not lnearly separable 5 [8, p ]. Ths lmtaton can be overcome by addng at least one addtonal hdden layer. Theoretcally, a two-layer perceptron s capable of modelng any contnuous functon when usng approprate nonlnear actvaton functons. Ths s a drect consequence of Kolmogorov s theorem. It states that for every contnuous functon f : [0, 1] D R there exst contnuous functons ψ j : [0, 1] R such that f can be wrtten n the form ( 2D D ) f(x 1,..., x D ) = g f w ψ j (x ). (13) j=0 But as of [3, p ] the functon g f s dependent on f and the theorem s not applcable f the functons ψ j are requred to be smooth. In most applcatons we do not now the target functon. And the theorem says nothng about how to fnd the functons ψ j. It smply states the exstence of such functons. Thus, the theorem s not of much practcal use for networ desgn [6, p ]. Nevertheless, the theorem s mportant as t states that theoretcally we are able to fnd an optmal soluton usng one hdden layer. Otherwse, we could end up wth a problem for whch we cannot fnd a soluton usng a multlayer perceptron [8, p ]. 5 When consdered as classfcaton problem (see secton 5) we say the problem s not lnearly separable f the classes can not be separated by a hyperplane [5, p. 179]. =1

11 11 4 Networ Tranng Networ tranng descrbes the problem of determnng the parameters to model the target functon. Therefore, we use a set of tranng data representng the ntal nowledge of the target functon. Often the target functon s unnown and we only now the value of the functon for specfc data ponts. Dependng on the tranng set we consder three dfferent learnng paradgms [8, p ]: Unsupervsed learnng The tranng set provdes only nput values. The neural networ has to fnd smlartes by tself. Renforcement learnng After processng an nput value of the tranng set the neural networ gets feedbac whether the result s consdered good or bad. Supervsed learnng The tranng set provdes both nput values and desred target values (labeled tranng set). Although the human bran mostly learns by renforcement learnng (or even unsupervsed learnng), we dscuss supervsed learnng only. Let T := {(x n, t n ) : 1 n N} be a tranng set where x n are nput values and t n the correspondng target values. As for any approxmaton problem we can evaluate the performance of the neural networ usng some dstance measure between approxmaton and target functon. 4.1 Error Measures We dscuss manly two error measures. The sum-of-squared error functon taes the form E(w) = N E n (w) = 1 2 n=1 N n=1 =1 C (y (x n, w) t n ) 2 (14) where t n denotes the th entry of the n th target value. The cross-entropy error functon s gven by E(w) = N N E n (w) = n=1 n=1 =1 C t n log(y (x n, w)). (15) For further dscusson we note that by usng y (x n, w) = f(z ) the dervatves of both error functons wth respect to the actual nput z (L+1) of the th output unt tae the form E n z (L+1) = E n y (L+1) y (L+1) z (L+1) = y (x n, w) t n. (16) Here we use the softmax actvaton functon for the cross-entropy error functon and the dentty as actvaton functon for the sum-of-squared error functon [3, p , ].

12 12 4 NETWORK TRAINING 4.2 Tranng Approaches Gven an error functon E(w) = N n=1 E n(w) we dstngush manly two dfferent tranng approaches [6, p ]: Stochastc tranng Randomly choose an nput value x n and propagate t through the networ. Update the weghts based on the error E n (w). Batch tranng Go through all nput values x n and compute y(x n, w). Update the weghts based on the overall error E(w) = N n=1 E n(w). Instead of choosng an nput value x n at random we could also select the nput values n sequence leadng to sequental tranng. As of [5, p ] stochastc tranng s consdered to be faster especally on redundant tranng sets. 4.3 Parameter Optmzaton The weght update n both approaches s determned by a parameter optmzaton algorthm whch tres to mnmze the error. The error E(w) can be consdered as error surface above the weght space. In general, the error surface s a nonlnear functon of the weghts and may have many local mnma and maxma. We want to fnd the global mnmum. The necessary crteron to fnd a mnmum s E(w) = 0 (17) where E denotes the gradent of the error functon E evaluated at pont w. As an analytcal soluton s usually not possble we use an teratve approach for mnmzng the error. In each teraton step we choose an weght update w[t] and set w[t + 1] = w[t] + w[t] (18) where w[t] denotes the weght vector w n the t th teraton and w[0] s an approprate startng vector. Optmzaton algorthms dffer n choosng the weght update w[t] [3, p ]. Before dscussng two common optmzaton algorthms, we dscuss the case of lnear unts and the problem of choosng a good startng vector w[0] Lnear Unts In the case of a sngle-layer perceptron wth lnear output unts and sum-of-squared error we obtan a lnear problem. Thus, we can determne the weghts exactly usng sngular value decomposton [3, p ]. When consderng a multlayer perceptron wth lnear output unts and nonlnear hdden unts we can at least splt up the problem of mnmzaton. The weghts n the output layer can be determned exactly usng sngular value decomposton whereas the weghts n the hdden layers can be optmzed usng an teratve approach. Note that every tme the weghts of the nonlnear layers are updated the exact soluton for the weghts n the output layer has to be recomputed [3, p ].

13 4.3 Parameter Optmzaton Weght Intalzaton Followng [6, p ] we want to get a startng vector w such that we have fast and unform learnng, that s all components learn more or less equally fast. Assumng sgmodal actvaton functons the actual nput of a unt should nether be too large causng σ (z) to be very small nor be too small such that σ(z) s approxmately lnear. Both cases wll result n slow learnng. For the logstc sgmod an acceptable value for the unt nput s of unty order [3, p ]. Thus, we choose all weghts randomly from the same dstrbuton such that 1 < m (l 1) w(l) j < 1 m (l 1) (19) for the weghts n layer l. Assumng a normal dstrbuton wth zero mean and unty varance for the nputs of each unt we get on average the desred actual nput of unty order [6, p ] Gradent Descent Gradent descent s a basc frst-order optmzaton algorthm. Ths means that n every teraton step we use nformaton about the gradent E at the current pont. In teraton step [t + 1] the weght update w[t] s determned by tang a step nto the drecton of the negatve gradent at poston w[t] such that w[t] = γ E w[t] (20) where γ s called learnng rate. Gradent descent can be used both for batch tranng and stochastc tranng[5, p ]. In the case of stochastc tranng we choose w[t] = γ E n w[t] (21) as weght update Momentum The choce of the learnng rate can be crucal. We want to choose the learnng rate such that we have fast learnng but avod oscllaton when choosng the learnng rate too large [3, p ]. Both cases are llustrated by fgure 6 where we see to mnmze a convex quadratc functon. One way to reduce oscllaton whle usng a hgh learnng rate as descrbed n [3, p ] s to add a momentum term. Then the change n weght s dependent on the prevous change n weght such that where λ s called momentum parameter. w[t] = γ E + λ w[t 1] (22) w[t]

14 14 4 NETWORK TRAINING 30 Error Learnng Steps 30 Error Learnng Steps Error Error Iteraton (a) Hgh learnng rate Iteraton (b) Low learnng rate. Fgure 6: When usng a large learnng rate the weght updates n each teraton tend to overstep the mnmum. Ths causes oscllaton around the mnmum. Ths observaton s llustrated by fgure 6(a). Choosng the learnng rate too low wll result n a slow convergence to the mnmum as shown n fgure 6(b). Both scenaros are vsualzed usng a quadratc functon we see to mnmze Enhanced Gradent Descent There s stll a serous problem wth gradent descent: how to choose the learnng rate (and the momentum parameter). Currently, we are requred to choose the parameters manually. Because the optmal values are dependent on the gven problem, an automatc choce s not possble. We dscuss a smple approach on choosng the learnng rate automatcally consderng no momentum term. Let γ[t] denote the learnng rate n the t th teraton. Then we have a smple crteron on whether the learnng rate was chosen too large: If the error has ncreased after the weght update we may have overstepped the mnmum. Then we can undo the weght update and choose a smaller learnng rate untl the error decreases [3, p ]. Thus, we have { ρ γ[t] f E(w[t + 1]) < E(w[t]) γ[t + 1] = (23) µ γ[t] f E(w[t + 1]) > E(w[t]) where the update parameters can be chosen as suggested n [3, p ]: set ρ := 1.1 to avod frequent oversteppng of the mnmum; set µ := 0.5 to ensure fast recovery to tang a step whch mnmzes the error Newton s Method Although gradent descent has been mproved n many ways t s consdered a poor optmzaton algorthm as ponted out n [3, p ] and [5, p ]. But optmzaton s a wdely studed area and, thus, there are more effcent algorthms. Conjugate gradents maes mplct use of second order nformaton. Newton s method or so called quas-newton methods explctly use the hessan of the error functon n each teraton step. Some of these algorthms are descrbed n detal n [7, p ]. We dscuss the general dea of Newton s method.

15 4.4 Error Bacpropagaton 15 Newton s method s an teratve optmzaton algorthm based on a quadratc approxmaton of E usng the Taylor seres. Let H(w[t]) denote the hessan of the error functon evaluated at w[t]. Then, by tang the frst three terms of the Taylor seres 6, we obtan a quadratc approxmaton of E around the pont w[t] [7, p ]: Ẽ(w[t] + w[t]) = E(w[t]) + E(w[t]) tr w[t] ( w[t])tr H(w[t]) w[t]. (25) We want to choose the weght update w[t] such that the quadratc approxmaton Ẽ s mnmzed. Therefore, the necessary crteron gves us: The soluton s gven by Ẽ(w[t] + w[t]) = E(w[t]) + H(w[t]) w[t]! = 0. (26) w[t] = γh(w[t]) 1 E(w[t]) (27) where H(w[t]) 1 s the nverse of the hessan at pont w[t]. Because we use a quadratc approxmaton, equaton (27) needs to be appled teratvely. When choosng w[0] suffcently close to the global mnmum of E, Newton s method converges quadratcally [7, p ]. Nevertheless, Newton s method has several drawbacs. As we see later, the evaluaton of the hessan matrx and ts nverson s rather expensve 7. In addton, the newton step n equaton (27) may be a step n the drecton of a local maxmum or saddle pont. Ths may happen f the hessan matrx s not postve defnte [3, p ]. 4.4 Error Bacpropagaton Error bacpropagaton descrbes an effcent algorthm to evaluate E n for multlayer perceptrons usng dfferentable actvaton functons. Followng [5, p ] we consder the th output unt frst. The dervatve of an arbtrary error functon E n wth respect to the weght w (L+1) j s gven by E n w (L+1) j chan- = rule E n z (L+1) z (L+1) w (L+1) j (28) where z (L+1) denotes the actual nput of the th unt wthn the output layer. Usng the chan rule, the frst factor n equaton (28) can be wrtten as δ (L+1) := E n z (L+1) chan- = rule = E n E n y (L+1) y (L+1) y (L+1) z (L+1) f ( z (L+1) ) (29) 6 In general, the Taylor seres of a smooth functon f at pont x 0 s gven by T (x) = 7 The nverson of a n n matrx usually requres O(n 3 ) operatons. =0 f () (x 0) (x x 0). (24)!

16 16 4 NETWORK TRAINING δ (l+1) 1 y (l+1) 1 Fgure 7: Once evaluated for all output unts, the errors can be propagated bacwards accordng to equaton (34). The dervatves of all weghts n layer l can then be determned usng equaton (31). δ (L+1) y (l) y (l+1) δ (l+1) m (l+1) m (l+1).. where δ (L+1) s often called error and descrbes the nfluence of the th output unt on the total error E n. The second factor of equaton (28) taes the form z (L+1) m (L) w (L+1) = j w (L+1) w (L+1) y (L) j =0 = w (L+1) j w(l+1) j y (L+1) j + m (L) =0, w (L+1) y (L) } {{ } const. = y (L) j. (30) By substtutng both factors (29) and (30) nto equaton (28) we get E n w (L+1) j = δ (L+1) y (L) j. (31) The errors δ (L+1) are usually easy to compute, for example when choosng error functon and output actvaton functon as noted n secton 4.1. Consderng the th hdden unt wthn an arbtrary hdden layer l we wrte E n w (l) j chan- = rule E n z (l) z (l) w (l) j (32) and notce that the second factor n equaton (32) taes the same form as n equaton (30): z (l) w (l) = m (l 1) j w (l) w (l) y(l 1) = y (l 1) j. (33) j =0 Thus, only the error δ (l) changes. Usng the chan rule we get δ (l) := E n z (l) = m (l+1) =1 E n z (l+1) z (l+1) z (l). (34) In general, the sum runs over all unts drectly succeedng the th unt n layer l [5, p ]. But we assume that every unt n layer l s connected to every unt n layer (l + 1)

17 4.5 The Hessan 17 y (p 1) j w (p) j y (p) y (l 1) s w (l) rs y (l) r Fgure 8: To derve an algorthm for evaluatng the hessan we consder two weghts w (p) j and w rs (l) where we assume that p l. Ths fgure llustrates the case when p < (l 1). where we allow the weghts w (l+1) from equaton (34) we have δ (l+1) to vansh. Altogether, by substtutng the defnton of δ (l) = f ( z (l) ) m (l+1) =1 w (l+1) δ (l+1). (35) In concluson, we have found a recursve algorthm for calculatng the gradent E n by propagatng the errors δ (L+1) bac through the networ usng equaton (35) and evaluatng the dervatves En usng equaton (32). w (l) j Effcency The number of operatons requred for error bacpropagaton scales wth the total number of weghts W and the number of unts. Consderng a suffcently large multlayer perceptron such that every unt n layer l s connected to every unt n layer (l + 1) the cost of propagatng an nput value through the networ s domnated by W. Here we neglect the evaluaton cost of the actvaton functons. Because of the weghted sum n equaton (34), bacpropagatng the error s domnated by W, as well. Ths leads to an overall effort lnear n the total number of weghts: O(W ) [5, p ]. 4.5 The Hessan Followng [2] we derve an algorthm whch allows us to evaluate the hessan exactly by forward and bacward propagaton. Agan, we assume dfferentable actvaton functons wthn a multlayer perceptron. We start by pcng two weghts w (p) j and w rs (l) where we assume the layer p to come before layer l, that s p l. The case where p > l wll follow by the symmetry of the hessan [2]. Fgure 8 llustrates ths settng for p < (l 1). As the error E n s only dependng on w (p) j through the actual nput z (p) of the th unt n layer p we start by wrtng 2 E n w (p) j w(l) rs Now we remember the defnton of δ (l) r chan- = rule w (p) j (30) = y (p 1) j ( ) E n ( w rs (l) E n w (l) rs ). (36) n equaton (34) and add two more defntons: g (l 1,p) s := z(l 1) s and b (l,p) r := δ(l) r. (37)

18 18 4 NETWORK TRAINING As E n s dependng on w rs (l) only through the value of z r (l) we use the chan rule and wrte 2 E n w (p) j w(l) rs [ = [ = (34) = y (p 1) j y (p 1) j [ y (p 1) j z (l) r w (l) rs E n z (l) r y s (l 1) E n z r (l) ] y (l 1) s δ (l) r ] ]. (38) Because the output y (p 1) j of the j th unt n layer (p 1) s not dependng on z (p), we can use the product rule and plug n the defntons from equaton (37): 2 E n w (p) j w(l) rs product- = rule (37) = y (p 1) j y (p 1) j ( ) (l 1) z f z s (l 1) s ( ) f z s (l 1) g (l 1,p) s + y (p 1) j + y (p 1) j y (l 1) s δ (l) r y s (l 1) b (l,p) p. (39) Followng ths dervaton, Bshop suggests a smple algorthm to evaluate the g (l 1,p) s and the b (l,p) r by forward and bacward propagaton. Usng the chan rule we can wrte g (l 1,p) s (37) = z(l 1) s chan- = rule = m (l 2) =1 m (l 2) =1 z s (l 1) z (l 2) ) f ( z (l 2) z (l 2) w (l 1) s g (l 2). (40) Intally we set g p,p = 1 and g l,p s = 0 for l p. Thus, the g (l 1,p) s forward propagatng them through the networ. Usng the defnton of δ r (l) (34), the product rule and the defntons from equaton (37) we wrte b (l,p) r (37) = δ(l) r product- = rule (34) = ( ) = f z (l) (37) f ( z r (l) ( ) (l) z f z r (l) r r g (l,p) r ) ( m (l+1) m (l+1) =1 m (l+1) =1 =1 w (l+1) r w (l+1) r w (l+1) r δ (l+1) δ (l+1) ) δ (l+1) + f ( z (l) r + f ( z (l) r ) m (l+1) ) m (l+1) =1 =1 w (l+1) r can be obtaned by w (l+1) r b (l+1,p) from equaton δ (l+1). (41) Usng equaton (41) we can bacpropagate the b (l,p) r after evaluatng b (L+1,p) for all output

19 4.6 Regularzaton 19 unts : b (L+1,p) (37) = δ(l+1) = g (L+1),p) (34) = ( z (L+1) E n z (L+1) ) = 2 E n 2 z (L+1) = 2 E n 2 z (L+1) ( ) f z (L+1) En y (L+1) g (L+1,p) ( ) + f z (L+1) 2 E n ( y (L+1) ) 2. (42) The errors δ r (l) can be evaluated as seen n secton 4.4 usng error bacpropagaton. Altogether, we are able to evaluate the hessan of E n by forward propagatng the g (l 1,p) s and bacward propagatng the b (l,p) r as well as the errors δ r (l) [2] Effcency We requre one step of forward propagaton and bacward propagaton per unt wthn the networ. Agan, assumng a suffcently large networ such that the number of weghts W domnates the networ, the evaluaton of equaton (39) requres O(W 2 ) operatons. 4.6 Regularzaton Now we focus on the ablty of generalzaton. We saw that theoretcally a neural networ wth a suffcently large number of hdden unts can model every contnuous target functon. The ablty of a networ to generalze can be observed usng unseen test data not used for tranng, usually called a valdaton set. If the traned approxmaton of the target functon wors well for the valdaton set the networ s sad to have good generalzaton capabltes [8, p ]. Overtranng of the networ may result n over-fttng of the tranng set, that s the networ memorzes the tranng set but performs very poorly on the valdaton set [8, p ]. Regularzaton tres to avod over-fttng and to gve a better generalzaton performance. Ths can be acheved by controllng the complexty of the networ whch s manly determned by the number of hdden unts. The smplest form of regularzaton s addng a regulzer to the error functon such that Ê(w) = E(w) + ηp (w) (43) where P (w) s a penalty term whch tres to nfluence the form and complexty of the soluton [3, p. 338] L2-Regularzaton Large weghts result n an approxmaton wth poor generalzaton capabltes. Therefore we try to penalze large weghts such that we have [3, p ]: Ê(w) = E(w) + ηw T w. (44)

20 20 4 NETWORK TRAINING 1 Error on Tranng Set Error on Valdaton Set Fgure 9: The error on the valdaton set (red) s usually gettng larger when the networ begns to overft the tranng set. Thus, although the error on the tranng set (blue) s decreasng monotoncally, we may want to stop tranng early to avod overfttng. Error ,000 Iteraton Ths regulzer s also referred to as weght decay. For understandng why ths regulzer s called weght decay, we consder the dervatve of the error functon Ê whch s gven by Ê = E + ηw. (45) Neglectng E and consderng gradent descent learnng the change n w wth respect to the teraton step t can be descrbed as w[t] = γηw[t]. (46) t Equaton (46) has the soluton w[t] = w[0]e γηt such that the weghts tend exponentally to zero gvng the method ts name [3, p ]. Whle weght decay represents L2-regularzaton we could also consder L1-regularzaton Early Stoppng Early stoppng descrbes the approach to stop tranng before gradent descent has fnshed. The error on the tranng set s usually monotoncally decreasng wth respect to the teraton number. For unseen data the error s usually much hgher and not monotoncally decreasng. It tends to get larger when we reach a state of over-fttng. Therefore, t seems reasonable to stop tranng at a pont where the error on the valdaton set reaches a mnmum [3, p ]. Ths s llustrated n fgure 9.

21 21 5 Pattern Classfcaton In ths secton we have a closer loo on pattern classfcaton. The classfcaton problem can be defned as follows: Gven a D-dmensonal nput vector x assgn t to one of C dscrete classes. We refer to a gven class by ts number c whch les n the range 1 c C. The nput vector x s also called pattern or observaton. 5.1 Statstcal Bacground Usng a statstcal approach we assume the pattern x and the class c to be random varables. Then p(x) s the probablty that we observe the pattern x and p(c) s the probablty that we observe a pattern belongng to class c [18, p ]. The classfcaton tas s to assgn a pattern x to ts correspondng class. Ths can be accomplshed by consderng the class-condtonal probablty p(c x), that s the probablty of pattern x belongng to class c [18, p ]. By applyng Bayes theorem we can rewrte the class-condtonal probablty to gve p(c x) = p(x c)p(c). (47) p(x) Then the class-condtonal probablty can be nterpreted as posteror probablty where p(c) s the pror probablty. The pror probablty represents the probablty of class c before mang an observaton. The posteror probablty descrbes the probablty of class c after observng the pattern x. [5, p ]. 5.2 Bayes Decson Rule Gven the posteror probabltes p(c x) for all classes c we can determne the class of x usng a decson rule. But we want to avod decson errors. A decson error occurs f an observaton vector x s assgned to the wrong class. Bayes decson rule gven by c : R D {1,..., C}, x argmax {p(c x)} (48) 1 c C results n a mnmum number of decson errors [18, p. 109]. But t assumes the true posteror probabltes to be nown. In practce the posteror probabltes p(c x) are unnown. Therefore, we may model the probablty p(c x) drectly (or ndrectly by modelng p(x c) frst) n the means of a so called model dstrbuton q θ (c x) whch depends on some parameters θ [17, p ]. Then we apply the model-based decson rule gven by c θ : R D {1,..., C}, x argmax {q θ (c x)}. (49) 1 c C Here we use the model dstrbuton and, thus, the decson rule s fully dependent on the parameters θ [17, p. 638]. 5.3 Maxmum Lelhood Estmaton Usng a maxmum lelhood approach we can estmate the unnown parameters of the model dstrbuton. Therefore, we assume that the posteror probabltes are gven by the model dstrbuton q θ (c x) and the patterns x n are drawn ndependently from the

22 22 5 PATTERN CLASSIFICATION dstrbuton q θ (c x). We say the data ponts x n are ndependent and dentcally dstrbuted [6, p ]. Then the lelhood functon s gven by L(θ) = N q θ (c n x n ). (50) n=1 where c n denotes the correspondng class of pattern x n. We then want to maxmze L(θ). Ths s equvalent to mnmzng the negatve log-lelhood functon whch s gven by log(l(θ)) = N log(q θ (x n c)). (51) n=1 Then, we can use the negatve log-lelhood functon as error functon to tran a neural networ. The sum-of-squared error functon and the cross-entropy error functon can both be motvated usng a maxmum lelhood approach. We derve only the cross-entropy error functon n the next secton Dervaton of Cross-Entropy We follow [5, p ] and consder the multclass 8 classfcaton problem. The target values t n follow the 1 of C encodng scheme, that s the th entry of t n equals 1 ff the pattern x n belongs to class. We nterpret the output of the neural networ as the posteror probabltes The posteror probablty for class c can then be wrtten as p(c x) = y c (x, w). (52) p(c x) = p(x c)p(c) C (53) =1 p(x )p(). As output actvaton functon we use the softmax functon such that we can rewrte equaton (53) to gve p(c x) = exp(z c ) c =1 exp(z ) wth z = log(p( x)p()). (54) Here we model the posterors n the means of the networ output. Ths means that the parameters of the model dstrbuton are gven by the weghts of the networ. Gven the probabltes p(c x) the maxmum lelhood functon taes the form L(w) = N n=1 C N C p(c x n ) tnc = y c (x n, w) tnc (55) c n=1 c=1 and we can derve the error functon E(w) whch s gven by the cross-entropy error functon already ntroduced n secton 4.1: E(w) = log(l(w)) = N n=1 c=1 C t nc ln(y c (w, x n )). (56)

23 5.4 Applcaton: Recognzng Handwrtten Dgts 23 (a) 100 unts, γ = 0.5 (b) 300 unts, γ = 0.5 (c) 400 unts, γ = 0.5 (d) 500 unts, γ = 0.5 (e) 600 unts, γ = 0.5 (f) 700 unts, γ = 0.5 (g) 750 unts, γ = 0.1 (h) 800 unts, γ = 0.1 Fgure 10: The error durng tranng for dfferent learnng rates γ and numbers of hdden unts. The two-layer perceptron was traned wth a batchsze of 100 randomly chosen mages terated 500 tmes. The error was measured usng the sum-of-squared error functon and plotted after each teraton. 5.4 Applcaton: Recognzng Handwrtten Dgts Based on the MNIST dataset (avalable onlne at we want to tran a two-layer perceptron to recognze handwrtten dgts. The dataset provdes a tranng set of 60, 000 mages and a valdaton set of 10, 000 mages. The mages have pxels of sze and, thus, we may wrte an mage as vector wth = 784 dmensons. The neural networ wll be traned usng gradent descent for parameter optmzaton and the sum-of-squared error functon. The tranng procedure s mplemented n MatLab Matrx Notaton For mplementng error bacpropagaton n MatLab t s convenent to ntroduce a matrx notaton. We dscuss the general case of a multlayer perceptron wth L layers. The output of all unts n layer l and ther actual nput can both be combned n vectors: ( y (l) := y (l) ( and =1 z(l) := ) m (l) z (l) ) m (l) =1. (57) When combnng all weghts w (l) j of layer l n a sngle weght matrx we can express the propagaton of an nput vector through the networ as several vector matrx products. The weght matrx W (l) s gven by ( ) m W (l) (l),m (l 1) :=. (58) w (l) j Propagatng the output vector of layer (l 1) denoted by y (l 1) through layer l can then be wrtten as ( y (l) = f W (l) y (l 1)) (59) 8 Generally, we dstngush the bnary classfcaton problem n the case of C = 2 and the multclass classfcaton problem for C > 2.,j=1

24 24 5 PATTERN CLASSIFICATION Accuracy 0.5 γ = 0.5 γ = Hdden Unts Accuracy batch sze 100 batch sze ,000 1,500 2,000 Epochs (a) 500 epochs wth batch sze 100. (b) 500 epochs wth learnng rate γ = 0.5. Fgure 11: Fgure 11(a) plots the acheved accuracy dependng on the number of hdden unts for a fxed learnng rate γ = 0.5. For 750 or more unts we need to adjust the learnng rate n order to ensure convergence. But ths may result n slow convergence such that we get worse results for few hdden unts. Fgure 11(b) shows the mpact of ncreasng the number of epochs. As we would expect, the accuracy ncreases wth rsng tranng set. where f operates element by element. Gven the error vector δ (l+1) defned as ( δ (l+1) := δ (l+1) ) m (l+1) =1 (60) for layer (l + 1) we can use the transpose of the weght matrx to gve ( δ (l) = f (z (l) ) W (l)) tr δ (l+1) (61) where f operates element by element and denotes the component-wse multplcaton Implementaton and Results The MNIST dataset s provded n the IDX fle format. To get the mages as vector wth 784 dmensons we use two functons loadmnistimages and loadmnistlabels whch can be found onlne at Appendx A.2 shows the mplementaton of the tranng procedure. The actvaton functon and ts dervatve are passed as parameters. We use the logstc sgmod, whch s shown n appendx A.1, as actvaton functon for both the hdden and the output layer. The number of hdden unts and the learnng rate are adjustable. The weght matrces for both layers are ntalzed randomly and normalzed afterwards (lnes 29 to 35). We apply stochastc tranng n so called epochs, that s for each epoch we tran the networ on randomly selected nput values (lnes 42 to 61). The number of randomly selected nput values to use for tranng s defned by the parameter batchsze. After each epoch we plot the current error on the same nput values as the networ was traned on (lnes 63 to 73). For comparng dfferent confguratons we use the valdaton set. The valdaton procedure, whch counts the number of correctly classfed mages, s shown n appendx A.3. The accuracy s measured as the quotent of correctly classfed mages and the total number of mages. Fgure 11 shows some of the results.

25 25 6 Concluson In the course of ths paper we ntroduced the general concept of artfcal neural networs and had a closer loo at the multlayer perceptron. We dscussed several actvaton functons and networ topologes and explored the expressve power of neural networs. Secton 4 ntroduced the basc notons of networ tranng. We focussed on supervsed tranng. Therefore, we ntroduced two dfferent error measure to evaluate the performance of the networ n respect to functon approxmaton. To tran the networ we ntroduced gradent descent as teratve parameter optmzaton algorthm as well as Newton s method as second-order optmzaton algorthm. The weghts of the networ are adjusted teratvely to mnmze the chosen error measure. To evaluate the gradent of the error measure we ntroduced the error bacpropagaton algorthm. We extended ths algorthm to allow the exact evaluaton of the hessan, as well. Thereafter, we dscussed the classfcaton problem usng a statstcal approach. Based on maxmum lelhood estmaton we derved the cross-entropy error measure for multclass classfcaton. As applcaton we consdered dgt recognton based on the MNIST dataset usng a two-layer perceptron. Unfortunately ths paper s far too short to cover the extensve topc of neural networs and ther applcaton n pattern recognton. Especally for networ tranng there are more advanced technques avalable (see for example [3]). Instead of the exact evaluaton of the hessan descrbed n [2], approxmatons to the hessan or ts nverse are more effcent to compute (for example descrbed n [3]). Based on a dagonal approxmaton of the hessan there are advanced regularzaton methods as for example the Optmal Bran Damage algorthm [16]. In concluson, artfcal neural networs are a powerful tool applcable to a wde range of problems especally n the doman of pattern recognton.

26 26 A MATLAB IMPLEMENTATION Appendces A MatLab Implementaton A.1 Logstc Sgmod and ts Dervatve 1 functon y = logstcsgmod(x) 2 % smplelogstcsgmod Logstc sgmod actvaton functon 3 % 4 % INPUT: 5 % x : Input vector. 6 % 7 % OUTPUT: 8 % y : Output vector where the logstc sgmod was appled element by 9 % element. 10 % y = 1./(1 + exp(-x)); 13 end 1 functon y = dlogstcsgmod(x) 2 % dlogstcsgmod Dervatve of the logstc sgmod. 3 % 4 % INPUT: 5 % x : Input vector. 6 % 7 % OUTPUT: 8 % y : Output vector where the dervatve of the logstc sgmod was 9 % appled element by element. 10 % 11 y = logstcsgmod(x).*(1 - logstcsgmod(x)); 12 end A.2 Tranng Procedure 1 functon [hddenweghts, outputweghts, error] = transtochastcsquarederrortwolayerperceptron(actvatonfuncton, dactvatonfuncton, numberofhddenunts, nputvalues, targetvalues, epochs, batchsze, learnngrate) 2 % transtochastcsquarederrortwolayerperceptron Creates a two-layer perceptron 3 % and trans t on the MNIST dataset. 4 % 5 % INPUT: 6 % actvatonfuncton : Actvaton functon used n both layers. 7 % dactvatonfuncton : Dervatve of the actvaton 8 % functon used n both layers. 9 % numberofhddenunts : Number of hdden unts. 10 % nputvalues : Input values for tranng (784 x 60000) 11 % targetvalues : Target values for tranng (1 x 60000) 12 % epochs : Number of epochs to tran. 13 % batchsze : Plot error after batchsze mages. 14 % learnngrate : Learnng rate to apply. 15 % 16 % OUTPUT: 17 % hddenweghts : Weghts of the hdden layer. 18 % outputweghts : Weghts of the output layer. 19 % 20

27 A.2 Tranng Procedure % The number of tranng vectors. 22 tranngsetsze = sze(nputvalues, 2); % Input vector has 784 dmensons. 25 nputdmensons = sze(nputvalues, 1); 26 % We have to dstngush 10 dgts. 27 outputdmensons = sze(targetvalues, 1); % Intalze the weghts for the hdden layer and the output layer. 30 hddenweghts = rand(numberofhddenunts, nputdmensons); 31 outputweghts = rand(outputdmensons, numberofhddenunts); hddenweghts = hddenweghts./sze(hddenweghts, 2); 34 outputweghts = outputweghts./sze(outputweghts, 2); n = zeros(batchsze); fgure; hold on; for t = 1: epochs 41 for = 1: batchsze 42 % Select whch nput vector to tran on. 43 n() = floor(rand(1)*tranngsetsze + 1); % Propagate the nput vector through the networ. 46 nputvector = nputvalues(:, n()); 47 hddenactualinput = hddenweghts*nputvector; 48 hddenoutputvector = actvatonfuncton(hddenactualinput); 49 outputactualinput = outputweghts*hddenoutputvector; 50 outputvector = actvatonfuncton(outputactualinput); targetvector = targetvalues(:, n()); % Bacpropagate the errors. 55 outputdelta = dactvatonfuncton(outputactualinput).*(outputvector - targetvector); 56 hddendelta = dactvatonfuncton(hddenactualinput).*(outputweghts *outputdelta); outputweghts = outputweghts - learnngrate.*outputdelta*hddenoutputvector ; 59 hddenweghts = hddenweghts - learnngrate.*hddendelta*nputvector ; 60 end; % Calculate the error for plottng. 63 error = 0; 64 for = 1: batchsze 65 nputvector = nputvalues(:, n()); 66 targetvector = targetvalues(:, n()); error = error + norm(actvatonfuncton(outputweghts*actvatonfuncton(hddenweghts*nputvector)) - targetvector, 2); 69 end; 70 error = error/batchsze; plot(t, error, * ); 73 end; 74 end

28 28 A MATLAB IMPLEMENTATION A.3 Valdaton Procedure 1 functon [correctlyclassfed, classfcatonerrors] = valdatetwolayerperceptron(actvatonfuncton, hddenweghts, outputweghts, nputvalues, labels) 2 % valdatetwolayerperceptron Valdate the twolayer perceptron usng the 3 % valdaton set. 4 % 5 % INPUT: 6 % actvatonfuncton : Actvaton functon used n both layers. 7 % hddenweghts : Weghts of the hdden layer. 8 % outputweghts : Weghts of the output layer. 9 % nputvalues : Input values for tranng (784 x 10000). 10 % labels : Labels for valdaton (1 x 10000). 11 % 12 % OUTPUT: 13 % correctlyclassfed : Number of correctly classfed values. 14 % classfcatonerrors : Number of classfcaton errors. 15 % testsetsze = sze(nputvalues, 2); 18 classfcatonerrors = 0; 19 correctlyclassfed = 0; for n = 1: testsetsze 22 nputvector = nputvalues(:, n); 23 outputvector = evaluatetwolayerperceptron(actvatonfuncton, hddenweghts, outputweghts, nputvector); class = decsonrule(outputvector); 26 f class == labels(n) correctlyclassfed = correctlyclassfed + 1; 28 else 29 classfcatonerrors = classfcatonerrors + 1; 30 end; 31 end; 32 end functon class = decsonrule(outputvector) 35 % decsonrule Model based decson rule. 36 % 37 % INPUT: 38 % outputvector : Output vector of the networ. 39 % 40 % OUTPUT: 41 % class : Class the vector s assgned to. 42 % max = 0; 45 class = 1; 46 for = 1: sze(outputvector, 1) 47 f outputvector() > max 48 max = outputvector(); 49 class = ; 50 end; 51 end; 52 end functon outputvector = evaluatetwolayerperceptron(actvatonfuncton, hddenweghts,

29 A.3 Valdaton Procedure 29 outputweghts, nputvector) 55 % evaluatetwolayerperceptron Evaluate two-layer perceptron gven by the 56 % weghts usng the gven actvaton functon. 57 % 58 % INPUT: 59 % actvatonfuncton : Actvaton functon used n both layers. 60 % hddenweghts : Weghts of hdden layer. 61 % outputweghts : Weghts for output layer. 62 % nputvector : Input vector to evaluate. 63 % 64 % OUTPUT: 65 % outputvector : Output of the perceptron. 66 % outputvector = actvatonfuncton(outputweghts*actvatonfuncton(hddenweghts*nputvector)); 69 end

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results. Neural Networks : Dervaton compled by Alvn Wan from Professor Jtendra Malk s lecture Ths type of computaton s called deep learnng and s the most popular method for many problems, such as computer vson