Introduction to Neural Networks. David Stutz

Size: px
Start display at page:

Download "Introduction to Neural Networks. David Stutz"

Transcription

1 RWTH Aachen Unversty Char of Computer Scence 6 Prof. Dr.-Ing. Hermann Ney Selected Topcs n Human Language Technology and Pattern Recognton WS 13/14 Introducton to Neural Networs Davd Stutz Matrculaton Number ###### February 10, 2014 Advser: Pavel Gol

2 2 CONTENTS Contents 1 Abstract 4 2 Motvaton Hstorcal Bacground and Bblographcal Notes Neural Networs The Perceptron Actvaton Functons Layered Networs Feed-Forward Networs Multlayer Perceptrons Expressve Power Networ Tranng Error Measures Tranng Approaches Parameter Optmzaton Lnear Unts Weght Intalzaton Gradent Descent Momentum Enhanced Gradent Descent Newton s Method Error Bacpropagaton Effcency The Hessan Effcency Regularzaton L2-Regularzaton Early Stoppng Pattern Classfcaton Statstcal Bacground Bayes Decson Rule Maxmum Lelhood Estmaton Dervaton of Cross-Entropy Applcaton: Recognzng Handwrtten Dgts Matrx Notaton Implementaton and Results Concluson 25 Appendces 26

3 LIST OF TABLES 3 A MatLab Implementaton 26 A.1 Logstc Sgmod and ts Dervatve A.2 Tranng Procedure A.3 Valdaton Procedure Lterature 30 Lst of Tables Lst of Fgures 1 Sngle processng unts and ts components Networ graph of a perceptron wth D nput unts and C output unts The logstc sgmod as actvaton functon Networ graph for a two-layer perceptron wth C nput unts, D output unts and m hdden unts Sngle-layer perceptron for modelng boolean AND The learnng rate and ts nfluence on the rate of convergence Bacpropagaton of errors through the networ Exact evaluaton of the hessan Early stoppng based on a valdaton set Error on the tranng set durng tranng Results of tranng a two-layer perceptron usng the MNIST dataset

4 4 1 ABSTRACT 1 Abstract In ths semnar paper we study artfcal neural networs, ther tranng and applcaton to pattern recognton. We start by gvng a general defnton of artfcal neural networs and ntroduce both the sngle-layer and the multlayer perceptron. After consderng several actvaton functons we dscuss networ topology and the expressve power of multlayer perceptrons. The second secton ntroduces supervsed networ tranng. Therefore, we dscuss gradent descent and Newton s method for parameter optmzaton. We derve the error bacpropagaton algorthm for evaluatng the gradent of the error functon and extend ths approach to evaluate ts hessan. In addton, the concept of regularzaton wll be ntroduced. The thrd secton ntroduces pattern classfcaton. Usng maxmum lelhood estmaton we derve the cross-entropy error functon. As applcaton, we tran a two-layer perceptron to recognze handwrtten dgts based on the MNIST dataset.

5 5 2 Motvaton Theoretcally, the human bran has a very low rate of operatons per second when compared to a state of the art computer [8, p. 28]. Nevertheless, all computers are stll outraveled by the human bran consderng an mportant factor: learnng. The human bran s able to learn how to perform certan tass based on experence and pror nowledge. How to teach computers to learn? To clarfy the term learnng n respect to computers we assume a set of tranng data T = {(x n, t n ) : 1 n N} for some N N and an arbtrary target functon g of whch we now the target values t n := g(x n ). Our goal s to teach the computer a reasonably good approxmaton of g(x) for x n the doman of g. Many classfcaton 1 and regresson 2 problems can be formulated ths way. The target functon may even be unnown. Consderng nose wthn the gven tranng data such that we only have a value t n g(x n ) we requre an addtonal property that we call ablty for generalzaton. Solvng the problem by nterpolaton may result n exact values g(x n ) but may be a very poor approxmaton of g(x) n general. Ths phenomenon s called over-fttng of the underlyng tranng data. 2.1 Hstorcal Bacground and Bblographcal Notes In 1943 McCulloch and Ptts ntroduced the frst mathematcal models concernng networs of neurons we call artfcal neural networs. But ths frst step dd not nclude any results on networ tranng [6, p ]. The frst wor on how to tran smlar networs was Rosenblatt s perceptron n 1958 whch we dscuss n secton 3.1 [19]. Only ten years later Mnsy and Papert showed that Rosenblatt s perceptron hat many lmtatons as we see n secton 3.6 and can only model lnearly separable 3 problems of classfcaton [6, p ]. In [20] Rumelhart, Hnton and Wllams ntroduced the dea of error bacpropagaton for pattern recognton. In the late 80 s t was shown that non lnearly separable problems can be solved by multlayer perceptrons [10]. Weght decay was ntroduced n [9]. A dagonal approxmaton of the hessan was ntroduced n [12]. A prunng method based on dagonal approxmaton of the hessan s called Optmal Bran Damage and dscussed n [16]. The exact evaluaton of the hessan s dscussed n [2]. 1 The classfcaton problem can be stated as follows: Gven a D-dmensonal nput vector x assgn t to one of C dscrete classes (see secton 5). 2 The regresson problem can be descrbed as follows: Gven a D-dmensonal nput vector x predct the value of C contnuous target values. 3 Consderng a classfcaton problem as ntroduced n secton 5 we say a set of data ponts s not lnearly separable f the classes can not be separated by a hyperplane [5, p. 179].

6 6 3 NEURAL NETWORKS Fgure 1: Sngle processng unt and ts components. The actvaton functon s denoted by f and appled on the actual nput z of the unt to form ts output y = f(z). x 1,..., x D represent nput from other unts wthn the networ; w 0 s called bas and represents an external nput to the unt. All nputs are mapped onto the actual nput z usng the propagaton rule. w 0 x 1. x D y y := f(z) 3 Neural Networs An artfcal neural networ, also referred to as neural networ, s a set of nterconnected processng unts. A processng unt receves nput from external sources and connected unts and computes an output whch may be propagated to other unts. These unts represent the neurons of the human bran whch are nterconnected by synapses [8, p ]. A processng unt conssts of a propagaton rule and an actvaton functon. The propagaton rule determnes the actual nput of the unt by mappng the output of all drect predecessors and addtonal external nputs onto a sngle nput value. The actvaton functon s then appled on the actual nput and determnes the output of the unt. The output of the processng unt s also called actvaton. Ths s llustrated by fgure 1 showng a sngle processng unt where f denotes the actvaton functon, z the actual nput and y the output of the unt. We dstngush nput unts and output unts. Input unts accept the nput of the whole networ and output unts form the output of the networ. Each nput unt accepts a sngle nput value x and we set the output to y := x. Altogether, a neural networ models a functon y(x) whch dmensons are determned by the number of nput and output unts. As to vsualze neural networs we use drected graphs whch we call networ graphs. As llustrated n fgure 1, sngle processng unts are represented by nodes and are nterconnected by drected edges. The nodes of the graph are labeled accordng to the correspondng output. 3.1 The Perceptron As example we dscuss Rosenblatt s perceptron whch was ntroduced n 1958 [6, p ]. The perceptron conssts of D nput unts and C output unts. Every nput unt s nput layer Fgure 2: The perceptron conssts of D nput unts and C output unts. All unts are labeled accordng to ther output: y = f(z ) n the case of output unts; x n the case of nput unts. The nput values x are propagated to each output unt usng the weghted sum propagaton rule. The addtonal nput value x 0 := 1 s used to nclude the bases as weghts. As suggested n secton 3.3 the unts are assembled n layers. x 0 x 1. x D output layer y 1. y C

7 3.2 Actvaton Functons 7 connected to every output unt as shown n fgure 2. For 1 C the th output unt computes the output y = f(z ) wth z = D w x + w 0 (1) where x j s the nput of the j th nput unt. In ths case the propagaton rule s the weghted sum over all nputs wth weghts w and bases w 0. The bas can be ncluded as weght when consderng an addtonal nput x 0 := 1 such that the actual nput z can be wrtten as z = =1 d w x. (2) =0 Throughout ths paper we use the weghted sum as propagaton rule for all unts except the nput unts whle actvaton functons may vary accordng to the dscusson of the next secton. 3.2 Actvaton Functons The actvaton functon determnes the output of the unt. Often, a threshold functon as for example the heavsde functon { 1 f z 0 h(z) = (3) 0 f z < 0 s used [8, p ]. But n general we want the actvaton functon to have certan propertes. Networ tranng usng error bacpropagaton as dscussed n secton 4.4 requres the actvaton functon to be dfferentable. In addton, we may want to use nonlnear actvaton functons as to ncrease the computatonal power of the networ as dscussed n secton 3.6 [6, p ]. A sgmod functon s a commonly used s-shaped functon. The logstc sgmod s gven by σ(z) = exp( z) (4) Logstc Sgmod σ(x) x Fgure 3: The logstc sgmod s a commonly used s-shaped actvaton functon. It s both smooth and monotonc and allows an probablstc nterpretaton as ts range s lmted to [0, 1].

8 8 3 NEURAL NETWORKS and shown n fgure 3. It can be consdered as smooth verson of the heavsde functon. The softmax functon s gven by σ(z, ) = exp(z ) C =1 exp(z ) (5) where C s the dmenson of the vector z. Both functons are smooth and monotonc whch means there are no addtonal local extrema. Ths s desrable for networ tranng because multple extrema wthn the actvaton functon could cause addtonal extrema wthn the error surface [6, p ]. In addton they allow a probablstc nterpretaton whch we use n secton 5. The dervatves of both the logstc sgmod and the softmax functon tae preferable forms for mplementaton: where δ denotes the Kronecer delta 4. σ(z) = σ(z)(1 σ(z)), z (6) σ(z, ) = σ(z, ) (δ(, j) σ(z, j)) z j (7) 3.3 Layered Networs Arrangng the unts n layers results n a layered networ. Fgure 2 already ntroduced the perceptron as layered networ. In ths case we have a layer of nput unts and a layer of output unts. We may add addtonal unts organzed n so called hdden layers. These unts are called hdden unts as they are not vsble from the outsde [8, p. 43]. For countng the number of layers we sp the nput layer because there s no real processng tang place and no varable weghts [5, p. 229]. Thus, the perceptron can be consdered a sngle-layer perceptron wthout any hdden layers. 3.4 Feed-Forward Networs Snce we can model every neural networ n the means of a networ graph we get new neural networs by consderng more complex networ topologes [5, p ]. We dstngush two networ topologes: Feed-forward In a feed-forward topology we prohbt closed cycles wthn the networ graph. Ths means that a unt n layer p may only propagate ts output to a unt n layer l f l > p. Thus, the modeled functon s determnstc. Recurrent Recurrent networs allow closed cycles. A connecton establshng a closed cycle wthn a networ graph s called feedbac connecton. In ths paper we consder feed-forward networs only. As demonstrated n fgure 2 the sngle-layer perceptron mplements a feed-forward topology. 4 The Kronecer delta δ(, j) equals 1 f j = and s 0 otherwse.

9 3.5 Multlayer Perceptrons 9 nput layer x 0 hdden layer y (1) 0 output layer y (2) 1 x 1. x D y (1) 1. y (1) m y (2) 2. y (2) C Fgure 4: A two-layer perceptron wth C nput unts, D output unts and m := m (1) hdden unts. Agan, we ntroduced unts x 0 := 1 and y (1) 0 := 1 to nclude the bases as weghts. To dstngush unts and weghts wth same ndces n dfferent layers, the number of the layer s wrtten as superscrpt. 3.5 Multlayer Perceptrons The multlayer perceptron has addtonal L 1 hdden layers. The l th hdden layer conssts of m (l) hdden unts. The output value y (l) of the th hdden unt wthn ths layer s m (l 1) = f + w (l) m y(l 1) 0 :=1 (l 1) = f w (l). (8) y (l) =1 w (l) y(l 1) 0 =0 y(l 1) Altogether, we count (L + 1) layers ncludng the output layer where we set D := m (0) and C := m (M+1). Then w (l) denotes the weghted connecton between the th unt of layer (l 1) and the th unt of layer l. Fgure 4 shows a two-layer perceptron wth m := m (1) unts n ts hdden layer. As prevously mentoned, we added addtonal unts x 0 := y (1) 0 := 1 to represent the bases. In ths paper we dscuss the feed-forward multlayer perceptron as model for neural networs. The modeled functon taes the form y 1 (x, w) y(, w) : R D R C, x y(x, w) =. (9) y C (x, w) where w s the vector comprsng of all weghts and y (x, w) := y (L+1) (x, w) s the output of the th output unt. 3.6 Expressve Power One of Mnsy and Papert s results n 1969 showed that the sngle-layer perceptron has severe lmtatons one of whch s called the Exclusve-OR (XOR) problem [6, p ]. As ntroducton we consder the target functon g, whch we want to model, gven by g : {0, 1} 2 {0, 1}, x := (x 1, x 2 ) { 1 f x 1 = x 2 = 1. (10) 0 otherwse

10 10 3 NEURAL NETWORKS x 2 x 2 1 w 0 x 1 w 1 w 2 y x 1 x 1 x 2 (a) Boolean AND (left) and boolean XOR (rght) represented as classfcaton problems. (b) Networ graph of a sngle-layer perceptron modelng boolean AND. Fgure 5: Fgure 5(a) shows both boolean AND and boolean XOR represented as classfcaton problems. As we can see, boolean XOR s not lnearly separable. A smple sngle-layer-perceptron capable of separatng boolean AND s shown n fgure 5(b). Apparently, g descrbes boolean AND. Wth D = 2 nput unts and C = 1 output unt usng the sgn(x) actvaton functon, g can be modeled by a sngle-layer perceptron: y = sgn(z) = sgn(w 1 x 1 + w 2 x 2 + w 0 ). (11) Fgure 5(b) llustrates the correspondng networ graph. By settng z = 0 we can nterpret equaton (11) as straght lne n two-dmensonal space: x 2 = w 1 w 2 x 1 w 0 w 2. (12) A possble soluton for modelng boolean AND s shown n fgure 5(a). But, as we can see, boolean XOR cannot be modeled by ths sngle-layer perceptron. We say boolean XOR s not lnearly separable 5 [8, p ]. Ths lmtaton can be overcome by addng at least one addtonal hdden layer. Theoretcally, a two-layer perceptron s capable of modelng any contnuous functon when usng approprate nonlnear actvaton functons. Ths s a drect consequence of Kolmogorov s theorem. It states that for every contnuous functon f : [0, 1] D R there exst contnuous functons ψ j : [0, 1] R such that f can be wrtten n the form ( 2D D ) f(x 1,..., x D ) = g f w ψ j (x ). (13) j=0 But as of [3, p ] the functon g f s dependent on f and the theorem s not applcable f the functons ψ j are requred to be smooth. In most applcatons we do not now the target functon. And the theorem says nothng about how to fnd the functons ψ j. It smply states the exstence of such functons. Thus, the theorem s not of much practcal use for networ desgn [6, p ]. Nevertheless, the theorem s mportant as t states that theoretcally we are able to fnd an optmal soluton usng one hdden layer. Otherwse, we could end up wth a problem for whch we cannot fnd a soluton usng a multlayer perceptron [8, p ]. 5 When consdered as classfcaton problem (see secton 5) we say the problem s not lnearly separable f the classes can not be separated by a hyperplane [5, p. 179]. =1

11 11 4 Networ Tranng Networ tranng descrbes the problem of determnng the parameters to model the target functon. Therefore, we use a set of tranng data representng the ntal nowledge of the target functon. Often the target functon s unnown and we only now the value of the functon for specfc data ponts. Dependng on the tranng set we consder three dfferent learnng paradgms [8, p ]: Unsupervsed learnng The tranng set provdes only nput values. The neural networ has to fnd smlartes by tself. Renforcement learnng After processng an nput value of the tranng set the neural networ gets feedbac whether the result s consdered good or bad. Supervsed learnng The tranng set provdes both nput values and desred target values (labeled tranng set). Although the human bran mostly learns by renforcement learnng (or even unsupervsed learnng), we dscuss supervsed learnng only. Let T := {(x n, t n ) : 1 n N} be a tranng set where x n are nput values and t n the correspondng target values. As for any approxmaton problem we can evaluate the performance of the neural networ usng some dstance measure between approxmaton and target functon. 4.1 Error Measures We dscuss manly two error measures. The sum-of-squared error functon taes the form E(w) = N E n (w) = 1 2 n=1 N n=1 =1 C (y (x n, w) t n ) 2 (14) where t n denotes the th entry of the n th target value. The cross-entropy error functon s gven by E(w) = N N E n (w) = n=1 n=1 =1 C t n log(y (x n, w)). (15) For further dscusson we note that by usng y (x n, w) = f(z ) the dervatves of both error functons wth respect to the actual nput z (L+1) of the th output unt tae the form E n z (L+1) = E n y (L+1) y (L+1) z (L+1) = y (x n, w) t n. (16) Here we use the softmax actvaton functon for the cross-entropy error functon and the dentty as actvaton functon for the sum-of-squared error functon [3, p , ].

12 12 4 NETWORK TRAINING 4.2 Tranng Approaches Gven an error functon E(w) = N n=1 E n(w) we dstngush manly two dfferent tranng approaches [6, p ]: Stochastc tranng Randomly choose an nput value x n and propagate t through the networ. Update the weghts based on the error E n (w). Batch tranng Go through all nput values x n and compute y(x n, w). Update the weghts based on the overall error E(w) = N n=1 E n(w). Instead of choosng an nput value x n at random we could also select the nput values n sequence leadng to sequental tranng. As of [5, p ] stochastc tranng s consdered to be faster especally on redundant tranng sets. 4.3 Parameter Optmzaton The weght update n both approaches s determned by a parameter optmzaton algorthm whch tres to mnmze the error. The error E(w) can be consdered as error surface above the weght space. In general, the error surface s a nonlnear functon of the weghts and may have many local mnma and maxma. We want to fnd the global mnmum. The necessary crteron to fnd a mnmum s E(w) = 0 (17) where E denotes the gradent of the error functon E evaluated at pont w. As an analytcal soluton s usually not possble we use an teratve approach for mnmzng the error. In each teraton step we choose an weght update w[t] and set w[t + 1] = w[t] + w[t] (18) where w[t] denotes the weght vector w n the t th teraton and w[0] s an approprate startng vector. Optmzaton algorthms dffer n choosng the weght update w[t] [3, p ]. Before dscussng two common optmzaton algorthms, we dscuss the case of lnear unts and the problem of choosng a good startng vector w[0] Lnear Unts In the case of a sngle-layer perceptron wth lnear output unts and sum-of-squared error we obtan a lnear problem. Thus, we can determne the weghts exactly usng sngular value decomposton [3, p ]. When consderng a multlayer perceptron wth lnear output unts and nonlnear hdden unts we can at least splt up the problem of mnmzaton. The weghts n the output layer can be determned exactly usng sngular value decomposton whereas the weghts n the hdden layers can be optmzed usng an teratve approach. Note that every tme the weghts of the nonlnear layers are updated the exact soluton for the weghts n the output layer has to be recomputed [3, p ].

13 4.3 Parameter Optmzaton Weght Intalzaton Followng [6, p ] we want to get a startng vector w such that we have fast and unform learnng, that s all components learn more or less equally fast. Assumng sgmodal actvaton functons the actual nput of a unt should nether be too large causng σ (z) to be very small nor be too small such that σ(z) s approxmately lnear. Both cases wll result n slow learnng. For the logstc sgmod an acceptable value for the unt nput s of unty order [3, p ]. Thus, we choose all weghts randomly from the same dstrbuton such that 1 < m (l 1) w(l) j < 1 m (l 1) (19) for the weghts n layer l. Assumng a normal dstrbuton wth zero mean and unty varance for the nputs of each unt we get on average the desred actual nput of unty order [6, p ] Gradent Descent Gradent descent s a basc frst-order optmzaton algorthm. Ths means that n every teraton step we use nformaton about the gradent E at the current pont. In teraton step [t + 1] the weght update w[t] s determned by tang a step nto the drecton of the negatve gradent at poston w[t] such that w[t] = γ E w[t] (20) where γ s called learnng rate. Gradent descent can be used both for batch tranng and stochastc tranng[5, p ]. In the case of stochastc tranng we choose w[t] = γ E n w[t] (21) as weght update Momentum The choce of the learnng rate can be crucal. We want to choose the learnng rate such that we have fast learnng but avod oscllaton when choosng the learnng rate too large [3, p ]. Both cases are llustrated by fgure 6 where we see to mnmze a convex quadratc functon. One way to reduce oscllaton whle usng a hgh learnng rate as descrbed n [3, p ] s to add a momentum term. Then the change n weght s dependent on the prevous change n weght such that where λ s called momentum parameter. w[t] = γ E + λ w[t 1] (22) w[t]

14 14 4 NETWORK TRAINING 30 Error Learnng Steps 30 Error Learnng Steps Error Error Iteraton (a) Hgh learnng rate Iteraton (b) Low learnng rate. Fgure 6: When usng a large learnng rate the weght updates n each teraton tend to overstep the mnmum. Ths causes oscllaton around the mnmum. Ths observaton s llustrated by fgure 6(a). Choosng the learnng rate too low wll result n a slow convergence to the mnmum as shown n fgure 6(b). Both scenaros are vsualzed usng a quadratc functon we see to mnmze Enhanced Gradent Descent There s stll a serous problem wth gradent descent: how to choose the learnng rate (and the momentum parameter). Currently, we are requred to choose the parameters manually. Because the optmal values are dependent on the gven problem, an automatc choce s not possble. We dscuss a smple approach on choosng the learnng rate automatcally consderng no momentum term. Let γ[t] denote the learnng rate n the t th teraton. Then we have a smple crteron on whether the learnng rate was chosen too large: If the error has ncreased after the weght update we may have overstepped the mnmum. Then we can undo the weght update and choose a smaller learnng rate untl the error decreases [3, p ]. Thus, we have { ρ γ[t] f E(w[t + 1]) < E(w[t]) γ[t + 1] = (23) µ γ[t] f E(w[t + 1]) > E(w[t]) where the update parameters can be chosen as suggested n [3, p ]: set ρ := 1.1 to avod frequent oversteppng of the mnmum; set µ := 0.5 to ensure fast recovery to tang a step whch mnmzes the error Newton s Method Although gradent descent has been mproved n many ways t s consdered a poor optmzaton algorthm as ponted out n [3, p ] and [5, p ]. But optmzaton s a wdely studed area and, thus, there are more effcent algorthms. Conjugate gradents maes mplct use of second order nformaton. Newton s method or so called quas-newton methods explctly use the hessan of the error functon n each teraton step. Some of these algorthms are descrbed n detal n [7, p ]. We dscuss the general dea of Newton s method.

15 4.4 Error Bacpropagaton 15 Newton s method s an teratve optmzaton algorthm based on a quadratc approxmaton of E usng the Taylor seres. Let H(w[t]) denote the hessan of the error functon evaluated at w[t]. Then, by tang the frst three terms of the Taylor seres 6, we obtan a quadratc approxmaton of E around the pont w[t] [7, p ]: Ẽ(w[t] + w[t]) = E(w[t]) + E(w[t]) tr w[t] ( w[t])tr H(w[t]) w[t]. (25) We want to choose the weght update w[t] such that the quadratc approxmaton Ẽ s mnmzed. Therefore, the necessary crteron gves us: The soluton s gven by Ẽ(w[t] + w[t]) = E(w[t]) + H(w[t]) w[t]! = 0. (26) w[t] = γh(w[t]) 1 E(w[t]) (27) where H(w[t]) 1 s the nverse of the hessan at pont w[t]. Because we use a quadratc approxmaton, equaton (27) needs to be appled teratvely. When choosng w[0] suffcently close to the global mnmum of E, Newton s method converges quadratcally [7, p ]. Nevertheless, Newton s method has several drawbacs. As we see later, the evaluaton of the hessan matrx and ts nverson s rather expensve 7. In addton, the newton step n equaton (27) may be a step n the drecton of a local maxmum or saddle pont. Ths may happen f the hessan matrx s not postve defnte [3, p ]. 4.4 Error Bacpropagaton Error bacpropagaton descrbes an effcent algorthm to evaluate E n for multlayer perceptrons usng dfferentable actvaton functons. Followng [5, p ] we consder the th output unt frst. The dervatve of an arbtrary error functon E n wth respect to the weght w (L+1) j s gven by E n w (L+1) j chan- = rule E n z (L+1) z (L+1) w (L+1) j (28) where z (L+1) denotes the actual nput of the th unt wthn the output layer. Usng the chan rule, the frst factor n equaton (28) can be wrtten as δ (L+1) := E n z (L+1) chan- = rule = E n E n y (L+1) y (L+1) y (L+1) z (L+1) f ( z (L+1) ) (29) 6 In general, the Taylor seres of a smooth functon f at pont x 0 s gven by T (x) = 7 The nverson of a n n matrx usually requres O(n 3 ) operatons. =0 f () (x 0) (x x 0). (24)!

16 16 4 NETWORK TRAINING δ (l+1) 1 y (l+1) 1 Fgure 7: Once evaluated for all output unts, the errors can be propagated bacwards accordng to equaton (34). The dervatves of all weghts n layer l can then be determned usng equaton (31). δ (L+1) y (l) y (l+1) δ (l+1) m (l+1) m (l+1).. where δ (L+1) s often called error and descrbes the nfluence of the th output unt on the total error E n. The second factor of equaton (28) taes the form z (L+1) m (L) w (L+1) = j w (L+1) w (L+1) y (L) j =0 = w (L+1) j w(l+1) j y (L+1) j + m (L) =0, w (L+1) y (L) } {{ } const. = y (L) j. (30) By substtutng both factors (29) and (30) nto equaton (28) we get E n w (L+1) j = δ (L+1) y (L) j. (31) The errors δ (L+1) are usually easy to compute, for example when choosng error functon and output actvaton functon as noted n secton 4.1. Consderng the th hdden unt wthn an arbtrary hdden layer l we wrte E n w (l) j chan- = rule E n z (l) z (l) w (l) j (32) and notce that the second factor n equaton (32) taes the same form as n equaton (30): z (l) w (l) = m (l 1) j w (l) w (l) y(l 1) = y (l 1) j. (33) j =0 Thus, only the error δ (l) changes. Usng the chan rule we get δ (l) := E n z (l) = m (l+1) =1 E n z (l+1) z (l+1) z (l). (34) In general, the sum runs over all unts drectly succeedng the th unt n layer l [5, p ]. But we assume that every unt n layer l s connected to every unt n layer (l + 1)

17 4.5 The Hessan 17 y (p 1) j w (p) j y (p) y (l 1) s w (l) rs y (l) r Fgure 8: To derve an algorthm for evaluatng the hessan we consder two weghts w (p) j and w rs (l) where we assume that p l. Ths fgure llustrates the case when p < (l 1). where we allow the weghts w (l+1) from equaton (34) we have δ (l+1) to vansh. Altogether, by substtutng the defnton of δ (l) = f ( z (l) ) m (l+1) =1 w (l+1) δ (l+1). (35) In concluson, we have found a recursve algorthm for calculatng the gradent E n by propagatng the errors δ (L+1) bac through the networ usng equaton (35) and evaluatng the dervatves En usng equaton (32). w (l) j Effcency The number of operatons requred for error bacpropagaton scales wth the total number of weghts W and the number of unts. Consderng a suffcently large multlayer perceptron such that every unt n layer l s connected to every unt n layer (l + 1) the cost of propagatng an nput value through the networ s domnated by W. Here we neglect the evaluaton cost of the actvaton functons. Because of the weghted sum n equaton (34), bacpropagatng the error s domnated by W, as well. Ths leads to an overall effort lnear n the total number of weghts: O(W ) [5, p ]. 4.5 The Hessan Followng [2] we derve an algorthm whch allows us to evaluate the hessan exactly by forward and bacward propagaton. Agan, we assume dfferentable actvaton functons wthn a multlayer perceptron. We start by pcng two weghts w (p) j and w rs (l) where we assume the layer p to come before layer l, that s p l. The case where p > l wll follow by the symmetry of the hessan [2]. Fgure 8 llustrates ths settng for p < (l 1). As the error E n s only dependng on w (p) j through the actual nput z (p) of the th unt n layer p we start by wrtng 2 E n w (p) j w(l) rs Now we remember the defnton of δ (l) r chan- = rule w (p) j (30) = y (p 1) j ( ) E n ( w rs (l) E n w (l) rs ). (36) n equaton (34) and add two more defntons: g (l 1,p) s := z(l 1) s and b (l,p) r := δ(l) r. (37)

18 18 4 NETWORK TRAINING As E n s dependng on w rs (l) only through the value of z r (l) we use the chan rule and wrte 2 E n w (p) j w(l) rs [ = [ = (34) = y (p 1) j y (p 1) j [ y (p 1) j z (l) r w (l) rs E n z (l) r y s (l 1) E n z r (l) ] y (l 1) s δ (l) r ] ]. (38) Because the output y (p 1) j of the j th unt n layer (p 1) s not dependng on z (p), we can use the product rule and plug n the defntons from equaton (37): 2 E n w (p) j w(l) rs product- = rule (37) = y (p 1) j y (p 1) j ( ) (l 1) z f z s (l 1) s ( ) f z s (l 1) g (l 1,p) s + y (p 1) j + y (p 1) j y (l 1) s δ (l) r y s (l 1) b (l,p) p. (39) Followng ths dervaton, Bshop suggests a smple algorthm to evaluate the g (l 1,p) s and the b (l,p) r by forward and bacward propagaton. Usng the chan rule we can wrte g (l 1,p) s (37) = z(l 1) s chan- = rule = m (l 2) =1 m (l 2) =1 z s (l 1) z (l 2) ) f ( z (l 2) z (l 2) w (l 1) s g (l 2). (40) Intally we set g p,p = 1 and g l,p s = 0 for l p. Thus, the g (l 1,p) s forward propagatng them through the networ. Usng the defnton of δ r (l) (34), the product rule and the defntons from equaton (37) we wrte b (l,p) r (37) = δ(l) r product- = rule (34) = ( ) = f z (l) (37) f ( z r (l) ( ) (l) z f z r (l) r r g (l,p) r ) ( m (l+1) m (l+1) =1 m (l+1) =1 =1 w (l+1) r w (l+1) r w (l+1) r δ (l+1) δ (l+1) ) δ (l+1) + f ( z (l) r + f ( z (l) r ) m (l+1) ) m (l+1) =1 =1 w (l+1) r can be obtaned by w (l+1) r b (l+1,p) from equaton δ (l+1). (41) Usng equaton (41) we can bacpropagate the b (l,p) r after evaluatng b (L+1,p) for all output

19 4.6 Regularzaton 19 unts : b (L+1,p) (37) = δ(l+1) = g (L+1),p) (34) = ( z (L+1) E n z (L+1) ) = 2 E n 2 z (L+1) = 2 E n 2 z (L+1) ( ) f z (L+1) En y (L+1) g (L+1,p) ( ) + f z (L+1) 2 E n ( y (L+1) ) 2. (42) The errors δ r (l) can be evaluated as seen n secton 4.4 usng error bacpropagaton. Altogether, we are able to evaluate the hessan of E n by forward propagatng the g (l 1,p) s and bacward propagatng the b (l,p) r as well as the errors δ r (l) [2] Effcency We requre one step of forward propagaton and bacward propagaton per unt wthn the networ. Agan, assumng a suffcently large networ such that the number of weghts W domnates the networ, the evaluaton of equaton (39) requres O(W 2 ) operatons. 4.6 Regularzaton Now we focus on the ablty of generalzaton. We saw that theoretcally a neural networ wth a suffcently large number of hdden unts can model every contnuous target functon. The ablty of a networ to generalze can be observed usng unseen test data not used for tranng, usually called a valdaton set. If the traned approxmaton of the target functon wors well for the valdaton set the networ s sad to have good generalzaton capabltes [8, p ]. Overtranng of the networ may result n over-fttng of the tranng set, that s the networ memorzes the tranng set but performs very poorly on the valdaton set [8, p ]. Regularzaton tres to avod over-fttng and to gve a better generalzaton performance. Ths can be acheved by controllng the complexty of the networ whch s manly determned by the number of hdden unts. The smplest form of regularzaton s addng a regulzer to the error functon such that Ê(w) = E(w) + ηp (w) (43) where P (w) s a penalty term whch tres to nfluence the form and complexty of the soluton [3, p. 338] L2-Regularzaton Large weghts result n an approxmaton wth poor generalzaton capabltes. Therefore we try to penalze large weghts such that we have [3, p ]: Ê(w) = E(w) + ηw T w. (44)

20 20 4 NETWORK TRAINING 1 Error on Tranng Set Error on Valdaton Set Fgure 9: The error on the valdaton set (red) s usually gettng larger when the networ begns to overft the tranng set. Thus, although the error on the tranng set (blue) s decreasng monotoncally, we may want to stop tranng early to avod overfttng. Error ,000 Iteraton Ths regulzer s also referred to as weght decay. For understandng why ths regulzer s called weght decay, we consder the dervatve of the error functon Ê whch s gven by Ê = E + ηw. (45) Neglectng E and consderng gradent descent learnng the change n w wth respect to the teraton step t can be descrbed as w[t] = γηw[t]. (46) t Equaton (46) has the soluton w[t] = w[0]e γηt such that the weghts tend exponentally to zero gvng the method ts name [3, p ]. Whle weght decay represents L2-regularzaton we could also consder L1-regularzaton Early Stoppng Early stoppng descrbes the approach to stop tranng before gradent descent has fnshed. The error on the tranng set s usually monotoncally decreasng wth respect to the teraton number. For unseen data the error s usually much hgher and not monotoncally decreasng. It tends to get larger when we reach a state of over-fttng. Therefore, t seems reasonable to stop tranng at a pont where the error on the valdaton set reaches a mnmum [3, p ]. Ths s llustrated n fgure 9.

21 21 5 Pattern Classfcaton In ths secton we have a closer loo on pattern classfcaton. The classfcaton problem can be defned as follows: Gven a D-dmensonal nput vector x assgn t to one of C dscrete classes. We refer to a gven class by ts number c whch les n the range 1 c C. The nput vector x s also called pattern or observaton. 5.1 Statstcal Bacground Usng a statstcal approach we assume the pattern x and the class c to be random varables. Then p(x) s the probablty that we observe the pattern x and p(c) s the probablty that we observe a pattern belongng to class c [18, p ]. The classfcaton tas s to assgn a pattern x to ts correspondng class. Ths can be accomplshed by consderng the class-condtonal probablty p(c x), that s the probablty of pattern x belongng to class c [18, p ]. By applyng Bayes theorem we can rewrte the class-condtonal probablty to gve p(c x) = p(x c)p(c). (47) p(x) Then the class-condtonal probablty can be nterpreted as posteror probablty where p(c) s the pror probablty. The pror probablty represents the probablty of class c before mang an observaton. The posteror probablty descrbes the probablty of class c after observng the pattern x. [5, p ]. 5.2 Bayes Decson Rule Gven the posteror probabltes p(c x) for all classes c we can determne the class of x usng a decson rule. But we want to avod decson errors. A decson error occurs f an observaton vector x s assgned to the wrong class. Bayes decson rule gven by c : R D {1,..., C}, x argmax {p(c x)} (48) 1 c C results n a mnmum number of decson errors [18, p. 109]. But t assumes the true posteror probabltes to be nown. In practce the posteror probabltes p(c x) are unnown. Therefore, we may model the probablty p(c x) drectly (or ndrectly by modelng p(x c) frst) n the means of a so called model dstrbuton q θ (c x) whch depends on some parameters θ [17, p ]. Then we apply the model-based decson rule gven by c θ : R D {1,..., C}, x argmax {q θ (c x)}. (49) 1 c C Here we use the model dstrbuton and, thus, the decson rule s fully dependent on the parameters θ [17, p. 638]. 5.3 Maxmum Lelhood Estmaton Usng a maxmum lelhood approach we can estmate the unnown parameters of the model dstrbuton. Therefore, we assume that the posteror probabltes are gven by the model dstrbuton q θ (c x) and the patterns x n are drawn ndependently from the

22 22 5 PATTERN CLASSIFICATION dstrbuton q θ (c x). We say the data ponts x n are ndependent and dentcally dstrbuted [6, p ]. Then the lelhood functon s gven by L(θ) = N q θ (c n x n ). (50) n=1 where c n denotes the correspondng class of pattern x n. We then want to maxmze L(θ). Ths s equvalent to mnmzng the negatve log-lelhood functon whch s gven by log(l(θ)) = N log(q θ (x n c)). (51) n=1 Then, we can use the negatve log-lelhood functon as error functon to tran a neural networ. The sum-of-squared error functon and the cross-entropy error functon can both be motvated usng a maxmum lelhood approach. We derve only the cross-entropy error functon n the next secton Dervaton of Cross-Entropy We follow [5, p ] and consder the multclass 8 classfcaton problem. The target values t n follow the 1 of C encodng scheme, that s the th entry of t n equals 1 ff the pattern x n belongs to class. We nterpret the output of the neural networ as the posteror probabltes The posteror probablty for class c can then be wrtten as p(c x) = y c (x, w). (52) p(c x) = p(x c)p(c) C (53) =1 p(x )p(). As output actvaton functon we use the softmax functon such that we can rewrte equaton (53) to gve p(c x) = exp(z c ) c =1 exp(z ) wth z = log(p( x)p()). (54) Here we model the posterors n the means of the networ output. Ths means that the parameters of the model dstrbuton are gven by the weghts of the networ. Gven the probabltes p(c x) the maxmum lelhood functon taes the form L(w) = N n=1 C N C p(c x n ) tnc = y c (x n, w) tnc (55) c n=1 c=1 and we can derve the error functon E(w) whch s gven by the cross-entropy error functon already ntroduced n secton 4.1: E(w) = log(l(w)) = N n=1 c=1 C t nc ln(y c (w, x n )). (56)

23 5.4 Applcaton: Recognzng Handwrtten Dgts 23 (a) 100 unts, γ = 0.5 (b) 300 unts, γ = 0.5 (c) 400 unts, γ = 0.5 (d) 500 unts, γ = 0.5 (e) 600 unts, γ = 0.5 (f) 700 unts, γ = 0.5 (g) 750 unts, γ = 0.1 (h) 800 unts, γ = 0.1 Fgure 10: The error durng tranng for dfferent learnng rates γ and numbers of hdden unts. The two-layer perceptron was traned wth a batchsze of 100 randomly chosen mages terated 500 tmes. The error was measured usng the sum-of-squared error functon and plotted after each teraton. 5.4 Applcaton: Recognzng Handwrtten Dgts Based on the MNIST dataset (avalable onlne at we want to tran a two-layer perceptron to recognze handwrtten dgts. The dataset provdes a tranng set of 60, 000 mages and a valdaton set of 10, 000 mages. The mages have pxels of sze and, thus, we may wrte an mage as vector wth = 784 dmensons. The neural networ wll be traned usng gradent descent for parameter optmzaton and the sum-of-squared error functon. The tranng procedure s mplemented n MatLab Matrx Notaton For mplementng error bacpropagaton n MatLab t s convenent to ntroduce a matrx notaton. We dscuss the general case of a multlayer perceptron wth L layers. The output of all unts n layer l and ther actual nput can both be combned n vectors: ( y (l) := y (l) ( and =1 z(l) := ) m (l) z (l) ) m (l) =1. (57) When combnng all weghts w (l) j of layer l n a sngle weght matrx we can express the propagaton of an nput vector through the networ as several vector matrx products. The weght matrx W (l) s gven by ( ) m W (l) (l),m (l 1) :=. (58) w (l) j Propagatng the output vector of layer (l 1) denoted by y (l 1) through layer l can then be wrtten as ( y (l) = f W (l) y (l 1)) (59) 8 Generally, we dstngush the bnary classfcaton problem n the case of C = 2 and the multclass classfcaton problem for C > 2.,j=1

24 24 5 PATTERN CLASSIFICATION Accuracy 0.5 γ = 0.5 γ = Hdden Unts Accuracy batch sze 100 batch sze ,000 1,500 2,000 Epochs (a) 500 epochs wth batch sze 100. (b) 500 epochs wth learnng rate γ = 0.5. Fgure 11: Fgure 11(a) plots the acheved accuracy dependng on the number of hdden unts for a fxed learnng rate γ = 0.5. For 750 or more unts we need to adjust the learnng rate n order to ensure convergence. But ths may result n slow convergence such that we get worse results for few hdden unts. Fgure 11(b) shows the mpact of ncreasng the number of epochs. As we would expect, the accuracy ncreases wth rsng tranng set. where f operates element by element. Gven the error vector δ (l+1) defned as ( δ (l+1) := δ (l+1) ) m (l+1) =1 (60) for layer (l + 1) we can use the transpose of the weght matrx to gve ( δ (l) = f (z (l) ) W (l)) tr δ (l+1) (61) where f operates element by element and denotes the component-wse multplcaton Implementaton and Results The MNIST dataset s provded n the IDX fle format. To get the mages as vector wth 784 dmensons we use two functons loadmnistimages and loadmnistlabels whch can be found onlne at Appendx A.2 shows the mplementaton of the tranng procedure. The actvaton functon and ts dervatve are passed as parameters. We use the logstc sgmod, whch s shown n appendx A.1, as actvaton functon for both the hdden and the output layer. The number of hdden unts and the learnng rate are adjustable. The weght matrces for both layers are ntalzed randomly and normalzed afterwards (lnes 29 to 35). We apply stochastc tranng n so called epochs, that s for each epoch we tran the networ on randomly selected nput values (lnes 42 to 61). The number of randomly selected nput values to use for tranng s defned by the parameter batchsze. After each epoch we plot the current error on the same nput values as the networ was traned on (lnes 63 to 73). For comparng dfferent confguratons we use the valdaton set. The valdaton procedure, whch counts the number of correctly classfed mages, s shown n appendx A.3. The accuracy s measured as the quotent of correctly classfed mages and the total number of mages. Fgure 11 shows some of the results.

25 25 6 Concluson In the course of ths paper we ntroduced the general concept of artfcal neural networs and had a closer loo at the multlayer perceptron. We dscussed several actvaton functons and networ topologes and explored the expressve power of neural networs. Secton 4 ntroduced the basc notons of networ tranng. We focussed on supervsed tranng. Therefore, we ntroduced two dfferent error measure to evaluate the performance of the networ n respect to functon approxmaton. To tran the networ we ntroduced gradent descent as teratve parameter optmzaton algorthm as well as Newton s method as second-order optmzaton algorthm. The weghts of the networ are adjusted teratvely to mnmze the chosen error measure. To evaluate the gradent of the error measure we ntroduced the error bacpropagaton algorthm. We extended ths algorthm to allow the exact evaluaton of the hessan, as well. Thereafter, we dscussed the classfcaton problem usng a statstcal approach. Based on maxmum lelhood estmaton we derved the cross-entropy error measure for multclass classfcaton. As applcaton we consdered dgt recognton based on the MNIST dataset usng a two-layer perceptron. Unfortunately ths paper s far too short to cover the extensve topc of neural networs and ther applcaton n pattern recognton. Especally for networ tranng there are more advanced technques avalable (see for example [3]). Instead of the exact evaluaton of the hessan descrbed n [2], approxmatons to the hessan or ts nverse are more effcent to compute (for example descrbed n [3]). Based on a dagonal approxmaton of the hessan there are advanced regularzaton methods as for example the Optmal Bran Damage algorthm [16]. In concluson, artfcal neural networs are a powerful tool applcable to a wde range of problems especally n the doman of pattern recognton.

26 26 A MATLAB IMPLEMENTATION Appendces A MatLab Implementaton A.1 Logstc Sgmod and ts Dervatve 1 functon y = logstcsgmod(x) 2 % smplelogstcsgmod Logstc sgmod actvaton functon 3 % 4 % INPUT: 5 % x : Input vector. 6 % 7 % OUTPUT: 8 % y : Output vector where the logstc sgmod was appled element by 9 % element. 10 % y = 1./(1 + exp(-x)); 13 end 1 functon y = dlogstcsgmod(x) 2 % dlogstcsgmod Dervatve of the logstc sgmod. 3 % 4 % INPUT: 5 % x : Input vector. 6 % 7 % OUTPUT: 8 % y : Output vector where the dervatve of the logstc sgmod was 9 % appled element by element. 10 % 11 y = logstcsgmod(x).*(1 - logstcsgmod(x)); 12 end A.2 Tranng Procedure 1 functon [hddenweghts, outputweghts, error] = transtochastcsquarederrortwolayerperceptron(actvatonfuncton, dactvatonfuncton, numberofhddenunts, nputvalues, targetvalues, epochs, batchsze, learnngrate) 2 % transtochastcsquarederrortwolayerperceptron Creates a two-layer perceptron 3 % and trans t on the MNIST dataset. 4 % 5 % INPUT: 6 % actvatonfuncton : Actvaton functon used n both layers. 7 % dactvatonfuncton : Dervatve of the actvaton 8 % functon used n both layers. 9 % numberofhddenunts : Number of hdden unts. 10 % nputvalues : Input values for tranng (784 x 60000) 11 % targetvalues : Target values for tranng (1 x 60000) 12 % epochs : Number of epochs to tran. 13 % batchsze : Plot error after batchsze mages. 14 % learnngrate : Learnng rate to apply. 15 % 16 % OUTPUT: 17 % hddenweghts : Weghts of the hdden layer. 18 % outputweghts : Weghts of the output layer. 19 % 20

27 A.2 Tranng Procedure % The number of tranng vectors. 22 tranngsetsze = sze(nputvalues, 2); % Input vector has 784 dmensons. 25 nputdmensons = sze(nputvalues, 1); 26 % We have to dstngush 10 dgts. 27 outputdmensons = sze(targetvalues, 1); % Intalze the weghts for the hdden layer and the output layer. 30 hddenweghts = rand(numberofhddenunts, nputdmensons); 31 outputweghts = rand(outputdmensons, numberofhddenunts); hddenweghts = hddenweghts./sze(hddenweghts, 2); 34 outputweghts = outputweghts./sze(outputweghts, 2); n = zeros(batchsze); fgure; hold on; for t = 1: epochs 41 for = 1: batchsze 42 % Select whch nput vector to tran on. 43 n() = floor(rand(1)*tranngsetsze + 1); % Propagate the nput vector through the networ. 46 nputvector = nputvalues(:, n()); 47 hddenactualinput = hddenweghts*nputvector; 48 hddenoutputvector = actvatonfuncton(hddenactualinput); 49 outputactualinput = outputweghts*hddenoutputvector; 50 outputvector = actvatonfuncton(outputactualinput); targetvector = targetvalues(:, n()); % Bacpropagate the errors. 55 outputdelta = dactvatonfuncton(outputactualinput).*(outputvector - targetvector); 56 hddendelta = dactvatonfuncton(hddenactualinput).*(outputweghts *outputdelta); outputweghts = outputweghts - learnngrate.*outputdelta*hddenoutputvector ; 59 hddenweghts = hddenweghts - learnngrate.*hddendelta*nputvector ; 60 end; % Calculate the error for plottng. 63 error = 0; 64 for = 1: batchsze 65 nputvector = nputvalues(:, n()); 66 targetvector = targetvalues(:, n()); error = error + norm(actvatonfuncton(outputweghts*actvatonfuncton(hddenweghts*nputvector)) - targetvector, 2); 69 end; 70 error = error/batchsze; plot(t, error, * ); 73 end; 74 end

28 28 A MATLAB IMPLEMENTATION A.3 Valdaton Procedure 1 functon [correctlyclassfed, classfcatonerrors] = valdatetwolayerperceptron(actvatonfuncton, hddenweghts, outputweghts, nputvalues, labels) 2 % valdatetwolayerperceptron Valdate the twolayer perceptron usng the 3 % valdaton set. 4 % 5 % INPUT: 6 % actvatonfuncton : Actvaton functon used n both layers. 7 % hddenweghts : Weghts of the hdden layer. 8 % outputweghts : Weghts of the output layer. 9 % nputvalues : Input values for tranng (784 x 10000). 10 % labels : Labels for valdaton (1 x 10000). 11 % 12 % OUTPUT: 13 % correctlyclassfed : Number of correctly classfed values. 14 % classfcatonerrors : Number of classfcaton errors. 15 % testsetsze = sze(nputvalues, 2); 18 classfcatonerrors = 0; 19 correctlyclassfed = 0; for n = 1: testsetsze 22 nputvector = nputvalues(:, n); 23 outputvector = evaluatetwolayerperceptron(actvatonfuncton, hddenweghts, outputweghts, nputvector); class = decsonrule(outputvector); 26 f class == labels(n) correctlyclassfed = correctlyclassfed + 1; 28 else 29 classfcatonerrors = classfcatonerrors + 1; 30 end; 31 end; 32 end functon class = decsonrule(outputvector) 35 % decsonrule Model based decson rule. 36 % 37 % INPUT: 38 % outputvector : Output vector of the networ. 39 % 40 % OUTPUT: 41 % class : Class the vector s assgned to. 42 % max = 0; 45 class = 1; 46 for = 1: sze(outputvector, 1) 47 f outputvector() > max 48 max = outputvector(); 49 class = ; 50 end; 51 end; 52 end functon outputvector = evaluatetwolayerperceptron(actvatonfuncton, hddenweghts,

29 A.3 Valdaton Procedure 29 outputweghts, nputvector) 55 % evaluatetwolayerperceptron Evaluate two-layer perceptron gven by the 56 % weghts usng the gven actvaton functon. 57 % 58 % INPUT: 59 % actvatonfuncton : Actvaton functon used n both layers. 60 % hddenweghts : Weghts of hdden layer. 61 % outputweghts : Weghts for output layer. 62 % nputvector : Input vector to evaluate. 63 % 64 % OUTPUT: 65 % outputvector : Output of the perceptron. 66 % outputvector = actvatonfuncton(outputweghts*actvatonfuncton(hddenweghts*nputvector)); 69 end

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results. Neural Networks : Dervaton compled by Alvn Wan from Professor Jtendra Malk s lecture Ths type of computaton s called deep learnng and s the most popular method for many problems, such as computer vson

More information

Lecture Notes on Linear Regression

Lecture Notes on Linear Regression Lecture Notes on Lnear Regresson Feng L fl@sdueducn Shandong Unversty, Chna Lnear Regresson Problem In regresson problem, we am at predct a contnuous target value gven an nput feature vector We assume

More information

EEE 241: Linear Systems

EEE 241: Linear Systems EEE : Lnear Systems Summary #: Backpropagaton BACKPROPAGATION The perceptron rule as well as the Wdrow Hoff learnng were desgned to tran sngle layer networks. They suffer from the same dsadvantage: they

More information

Week 5: Neural Networks

Week 5: Neural Networks Week 5: Neural Networks Instructor: Sergey Levne Neural Networks Summary In the prevous lecture, we saw how we can construct neural networks by extendng logstc regresson. Neural networks consst of multple

More information

Generalized Linear Methods

Generalized Linear Methods Generalzed Lnear Methods 1 Introducton In the Ensemble Methods the general dea s that usng a combnaton of several weak learner one could make a better learner. More formally, assume that we have a set

More information

Linear Approximation with Regularization and Moving Least Squares

Linear Approximation with Regularization and Moving Least Squares Lnear Approxmaton wth Regularzaton and Movng Least Squares Igor Grešovn May 007 Revson 4.6 (Revson : March 004). 5 4 3 0.5 3 3.5 4 Contents: Lnear Fttng...4. Weghted Least Squares n Functon Approxmaton...

More information

Neural networks. Nuno Vasconcelos ECE Department, UCSD

Neural networks. Nuno Vasconcelos ECE Department, UCSD Neural networs Nuno Vasconcelos ECE Department, UCSD Classfcaton a classfcaton problem has two types of varables e.g. X - vector of observatons (features) n the world Y - state (class) of the world x X

More information

Multilayer Perceptron (MLP)

Multilayer Perceptron (MLP) Multlayer Perceptron (MLP) Seungjn Cho Department of Computer Scence and Engneerng Pohang Unversty of Scence and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjn@postech.ac.kr 1 / 20 Outlne

More information

Supporting Information

Supporting Information Supportng Informaton The neural network f n Eq. 1 s gven by: f x l = ReLU W atom x l + b atom, 2 where ReLU s the element-wse rectfed lnear unt, 21.e., ReLUx = max0, x, W atom R d d s the weght matrx to

More information

Evaluation of classifiers MLPs

Evaluation of classifiers MLPs Lecture Evaluaton of classfers MLPs Mlos Hausrecht mlos@cs.ptt.edu 539 Sennott Square Evaluaton For any data set e use to test the model e can buld a confuson matrx: Counts of examples th: class label

More information

IV. Performance Optimization

IV. Performance Optimization IV. Performance Optmzaton A. Steepest descent algorthm defnton how to set up bounds on learnng rate mnmzaton n a lne (varyng learnng rate) momentum learnng examples B. Newton s method defnton Gauss-Newton

More information

Multilayer Perceptrons and Backpropagation. Perceptrons. Recap: Perceptrons. Informatics 1 CG: Lecture 6. Mirella Lapata

Multilayer Perceptrons and Backpropagation. Perceptrons. Recap: Perceptrons. Informatics 1 CG: Lecture 6. Mirella Lapata Multlayer Perceptrons and Informatcs CG: Lecture 6 Mrella Lapata School of Informatcs Unversty of Ednburgh mlap@nf.ed.ac.uk Readng: Kevn Gurney s Introducton to Neural Networks, Chapters 5 6.5 January,

More information

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M CIS56: achne Learnng Lecture 3 (Sept 6, 003) Preparaton help: Xaoyng Huang Lnear Regresson Lnear regresson can be represented by a functonal form: f(; θ) = θ 0 0 +θ + + θ = θ = 0 ote: 0 s a dummy attrbute

More information

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur Module 3 LOSSY IMAGE COMPRESSION SYSTEMS Verson ECE IIT, Kharagpur Lesson 6 Theory of Quantzaton Verson ECE IIT, Kharagpur Instructonal Objectves At the end of ths lesson, the students should be able to:

More information

Admin NEURAL NETWORKS. Perceptron learning algorithm. Our Nervous System 10/25/16. Assignment 7. Class 11/22. Schedule for the rest of the semester

Admin NEURAL NETWORKS. Perceptron learning algorithm. Our Nervous System 10/25/16. Assignment 7. Class 11/22. Schedule for the rest of the semester 0/25/6 Admn Assgnment 7 Class /22 Schedule for the rest of the semester NEURAL NETWORKS Davd Kauchak CS58 Fall 206 Perceptron learnng algorthm Our Nervous System repeat untl convergence (or for some #

More information

Lecture 10 Support Vector Machines II

Lecture 10 Support Vector Machines II Lecture 10 Support Vector Machnes II 22 February 2016 Taylor B. Arnold Yale Statstcs STAT 365/665 1/28 Notes: Problem 3 s posted and due ths upcomng Frday There was an early bug n the fake-test data; fxed

More information

1 Convex Optimization

1 Convex Optimization Convex Optmzaton We wll consder convex optmzaton problems. Namely, mnmzaton problems where the objectve s convex (we assume no constrants for now). Such problems often arse n machne learnng. For example,

More information

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

Support Vector Machines. Vibhav Gogate The University of Texas at dallas Support Vector Machnes Vbhav Gogate he Unversty of exas at dallas What We have Learned So Far? 1. Decson rees. Naïve Bayes 3. Lnear Regresson 4. Logstc Regresson 5. Perceptron 6. Neural networks 7. K-Nearest

More information

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012 MLE and Bayesan Estmaton Je Tang Department of Computer Scence & Technology Tsnghua Unversty 01 1 Lnear Regresson? As the frst step, we need to decde how we re gong to represent the functon f. One example:

More information

Multi-layer neural networks

Multi-layer neural networks Lecture 0 Mult-layer neural networks Mlos Hauskrecht mlos@cs.ptt.edu 5329 Sennott Square Lnear regresson w Lnear unts f () Logstc regresson T T = w = p( y =, w) = g( w ) w z f () = p ( y = ) w d w d Gradent

More information

Kernel Methods and SVMs Extension

Kernel Methods and SVMs Extension Kernel Methods and SVMs Extenson The purpose of ths document s to revew materal covered n Machne Learnng 1 Supervsed Learnng regardng support vector machnes (SVMs). Ths document also provdes a general

More information

Introduction to the Introduction to Artificial Neural Network

Introduction to the Introduction to Artificial Neural Network Introducton to the Introducton to Artfcal Neural Netork Vuong Le th Hao Tang s sldes Part of the content of the sldes are from the Internet (possbly th modfcatons). The lecturer does not clam any onershp

More information

CSC 411 / CSC D11 / CSC C11

CSC 411 / CSC D11 / CSC C11 18 Boostng s a general strategy for learnng classfers by combnng smpler ones. The dea of boostng s to take a weak classfer that s, any classfer that wll do at least slghtly better than chance and use t

More information

CHAPTER III Neural Networks as Associative Memory

CHAPTER III Neural Networks as Associative Memory CHAPTER III Neural Networs as Assocatve Memory Introducton One of the prmary functons of the bran s assocatve memory. We assocate the faces wth names, letters wth sounds, or we can recognze the people

More information

Multilayer neural networks

Multilayer neural networks Lecture Multlayer neural networks Mlos Hauskrecht mlos@cs.ptt.edu 5329 Sennott Square Mdterm exam Mdterm Monday, March 2, 205 In-class (75 mnutes) closed book materal covered by February 25, 205 Multlayer

More information

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI Logstc Regresson CAP 561: achne Learnng Instructor: Guo-Jun QI Bayes Classfer: A Generatve model odel the posteror dstrbuton P(Y X) Estmate class-condtonal dstrbuton P(X Y) for each Y Estmate pror dstrbuton

More information

Lecture 12: Classification

Lecture 12: Classification Lecture : Classfcaton g Dscrmnant functons g The optmal Bayes classfer g Quadratc classfers g Eucldean and Mahalanobs metrcs g K Nearest Neghbor Classfers Intellgent Sensor Systems Rcardo Guterrez-Osuna

More information

Classification as a Regression Problem

Classification as a Regression Problem Target varable y C C, C,, ; Classfcaton as a Regresson Problem { }, 3 L C K To treat classfcaton as a regresson problem we should transform the target y nto numercal values; The choce of numercal class

More information

Grover s Algorithm + Quantum Zeno Effect + Vaidman

Grover s Algorithm + Quantum Zeno Effect + Vaidman Grover s Algorthm + Quantum Zeno Effect + Vadman CS 294-2 Bomb 10/12/04 Fall 2004 Lecture 11 Grover s algorthm Recall that Grover s algorthm for searchng over a space of sze wors as follows: consder the

More information

Lecture 23: Artificial neural networks

Lecture 23: Artificial neural networks Lecture 23: Artfcal neural networks Broad feld that has developed over the past 20 to 30 years Confluence of statstcal mechancs, appled math, bology and computers Orgnal motvaton: mathematcal modelng of

More information

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification E395 - Pattern Recognton Solutons to Introducton to Pattern Recognton, Chapter : Bayesan pattern classfcaton Preface Ths document s a soluton manual for selected exercses from Introducton to Pattern Recognton

More information

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018 INF 5860 Machne learnng for mage classfcaton Lecture 3 : Image classfcaton and regresson part II Anne Solberg January 3, 08 Today s topcs Multclass logstc regresson and softma Regularzaton Image classfcaton

More information

Structure and Drive Paul A. Jensen Copyright July 20, 2003

Structure and Drive Paul A. Jensen Copyright July 20, 2003 Structure and Drve Paul A. Jensen Copyrght July 20, 2003 A system s made up of several operatons wth flow passng between them. The structure of the system descrbes the flow paths from nputs to outputs.

More information

MMA and GCMMA two methods for nonlinear optimization

MMA and GCMMA two methods for nonlinear optimization MMA and GCMMA two methods for nonlnear optmzaton Krster Svanberg Optmzaton and Systems Theory, KTH, Stockholm, Sweden. krlle@math.kth.se Ths note descrbes the algorthms used n the author s 2007 mplementatons

More information

Finite Mixture Models and Expectation Maximization. Most slides are from: Dr. Mario Figueiredo, Dr. Anil Jain and Dr. Rong Jin

Finite Mixture Models and Expectation Maximization. Most slides are from: Dr. Mario Figueiredo, Dr. Anil Jain and Dr. Rong Jin Fnte Mxture Models and Expectaton Maxmzaton Most sldes are from: Dr. Maro Fgueredo, Dr. Anl Jan and Dr. Rong Jn Recall: The Supervsed Learnng Problem Gven a set of n samples X {(x, y )},,,n Chapter 3 of

More information

Global Sensitivity. Tuesday 20 th February, 2018

Global Sensitivity. Tuesday 20 th February, 2018 Global Senstvty Tuesday 2 th February, 28 ) Local Senstvty Most senstvty analyses [] are based on local estmates of senstvty, typcally by expandng the response n a Taylor seres about some specfc values

More information

Chapter Newton s Method

Chapter Newton s Method Chapter 9. Newton s Method After readng ths chapter, you should be able to:. Understand how Newton s method s dfferent from the Golden Secton Search method. Understand how Newton s method works 3. Solve

More information

Online Classification: Perceptron and Winnow

Online Classification: Perceptron and Winnow E0 370 Statstcal Learnng Theory Lecture 18 Nov 8, 011 Onlne Classfcaton: Perceptron and Wnnow Lecturer: Shvan Agarwal Scrbe: Shvan Agarwal 1 Introducton In ths lecture we wll start to study the onlne learnng

More information

Lecture 12: Discrete Laplacian

Lecture 12: Discrete Laplacian Lecture 12: Dscrete Laplacan Scrbe: Tanye Lu Our goal s to come up wth a dscrete verson of Laplacan operator for trangulated surfaces, so that we can use t n practce to solve related problems We are mostly

More information

A PROBABILITY-DRIVEN SEARCH ALGORITHM FOR SOLVING MULTI-OBJECTIVE OPTIMIZATION PROBLEMS

A PROBABILITY-DRIVEN SEARCH ALGORITHM FOR SOLVING MULTI-OBJECTIVE OPTIMIZATION PROBLEMS HCMC Unversty of Pedagogy Thong Nguyen Huu et al. A PROBABILITY-DRIVEN SEARCH ALGORITHM FOR SOLVING MULTI-OBJECTIVE OPTIMIZATION PROBLEMS Thong Nguyen Huu and Hao Tran Van Department of mathematcs-nformaton,

More information

Homework Assignment 3 Due in class, Thursday October 15

Homework Assignment 3 Due in class, Thursday October 15 Homework Assgnment 3 Due n class, Thursday October 15 SDS 383C Statstcal Modelng I 1 Rdge regresson and Lasso 1. Get the Prostrate cancer data from http://statweb.stanford.edu/~tbs/elemstatlearn/ datasets/prostate.data.

More information

P R. Lecture 4. Theory and Applications of Pattern Recognition. Dept. of Electrical and Computer Engineering /

P R. Lecture 4. Theory and Applications of Pattern Recognition. Dept. of Electrical and Computer Engineering / Theory and Applcatons of Pattern Recognton 003, Rob Polkar, Rowan Unversty, Glassboro, NJ Lecture 4 Bayes Classfcaton Rule Dept. of Electrcal and Computer Engneerng 0909.40.0 / 0909.504.04 Theory & Applcatons

More information

Radial-Basis Function Networks

Radial-Basis Function Networks Radal-Bass uncton Networs v.0 March 00 Mchel Verleysen Radal-Bass uncton Networs - Radal-Bass uncton Networs p Orgn: Cover s theorem p Interpolaton problem p Regularzaton theory p Generalzed RBN p Unversal

More information

SDMML HT MSc Problem Sheet 4

SDMML HT MSc Problem Sheet 4 SDMML HT 06 - MSc Problem Sheet 4. The recever operatng characterstc ROC curve plots the senstvty aganst the specfcty of a bnary classfer as the threshold for dscrmnaton s vared. Let the data space be

More information

Neural Networks & Learning

Neural Networks & Learning Neural Netorks & Learnng. Introducton The basc prelmnares nvolved n the Artfcal Neural Netorks (ANN) are descrbed n secton. An Artfcal Neural Netorks (ANN) s an nformaton-processng paradgm that nspred

More information

CSci 6974 and ECSE 6966 Math. Tech. for Vision, Graphics and Robotics Lecture 21, April 17, 2006 Estimating A Plane Homography

CSci 6974 and ECSE 6966 Math. Tech. for Vision, Graphics and Robotics Lecture 21, April 17, 2006 Estimating A Plane Homography CSc 6974 and ECSE 6966 Math. Tech. for Vson, Graphcs and Robotcs Lecture 21, Aprl 17, 2006 Estmatng A Plane Homography Overvew We contnue wth a dscusson of the major ssues, usng estmaton of plane projectve

More information

1 GSW Iterative Techniques for y = Ax

1 GSW Iterative Techniques for y = Ax 1 for y = A I m gong to cheat here. here are a lot of teratve technques that can be used to solve the general case of a set of smultaneous equatons (wrtten n the matr form as y = A), but ths chapter sn

More information

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z ) C4B Machne Learnng Answers II.(a) Show that for the logstc sgmod functon dσ(z) dz = σ(z) ( σ(z)) A. Zsserman, Hlary Term 20 Start from the defnton of σ(z) Note that Then σ(z) = σ = dσ(z) dz = + e z e z

More information

4DVAR, according to the name, is a four-dimensional variational method.

4DVAR, according to the name, is a four-dimensional variational method. 4D-Varatonal Data Assmlaton (4D-Var) 4DVAR, accordng to the name, s a four-dmensonal varatonal method. 4D-Var s actually a drect generalzaton of 3D-Var to handle observatons that are dstrbuted n tme. The

More information

Ensemble Methods: Boosting

Ensemble Methods: Boosting Ensemble Methods: Boostng Ncholas Ruozz Unversty of Texas at Dallas Based on the sldes of Vbhav Gogate and Rob Schapre Last Tme Varance reducton va baggng Generate new tranng data sets by samplng wth replacement

More information

Maximum Likelihood Estimation (MLE)

Maximum Likelihood Estimation (MLE) Maxmum Lkelhood Estmaton (MLE) Ken Kreutz-Delgado (Nuno Vasconcelos) ECE 175A Wnter 01 UCSD Statstcal Learnng Goal: Gven a relatonshp between a feature vector x and a vector y, and d data samples (x,y

More information

VQ widely used in coding speech, image, and video

VQ widely used in coding speech, image, and video at Scalar quantzers are specal cases of vector quantzers (VQ): they are constraned to look at one sample at a tme (memoryless) VQ does not have such constrant better RD perfomance expected Source codng

More information

Numerical Heat and Mass Transfer

Numerical Heat and Mass Transfer Master degree n Mechancal Engneerng Numercal Heat and Mass Transfer 06-Fnte-Dfference Method (One-dmensonal, steady state heat conducton) Fausto Arpno f.arpno@uncas.t Introducton Why we use models and

More information

The Geometry of Logit and Probit

The Geometry of Logit and Probit The Geometry of Logt and Probt Ths short note s meant as a supplement to Chapters and 3 of Spatal Models of Parlamentary Votng and the notaton and reference to fgures n the text below s to those two chapters.

More information

CHALMERS, GÖTEBORGS UNIVERSITET. SOLUTIONS to RE-EXAM for ARTIFICIAL NEURAL NETWORKS. COURSE CODES: FFR 135, FIM 720 GU, PhD

CHALMERS, GÖTEBORGS UNIVERSITET. SOLUTIONS to RE-EXAM for ARTIFICIAL NEURAL NETWORKS. COURSE CODES: FFR 135, FIM 720 GU, PhD CHALMERS, GÖTEBORGS UNIVERSITET SOLUTIONS to RE-EXAM for ARTIFICIAL NEURAL NETWORKS COURSE CODES: FFR 35, FIM 72 GU, PhD Tme: Place: Teachers: Allowed materal: Not allowed: January 2, 28, at 8 3 2 3 SB

More information

The Expectation-Maximization Algorithm

The Expectation-Maximization Algorithm The Expectaton-Maxmaton Algorthm Charles Elan elan@cs.ucsd.edu November 16, 2007 Ths chapter explans the EM algorthm at multple levels of generalty. Secton 1 gves the standard hgh-level verson of the algorthm.

More information

10-701/ Machine Learning, Fall 2005 Homework 3

10-701/ Machine Learning, Fall 2005 Homework 3 10-701/15-781 Machne Learnng, Fall 2005 Homework 3 Out: 10/20/05 Due: begnnng of the class 11/01/05 Instructons Contact questons-10701@autonlaborg for queston Problem 1 Regresson and Cross-valdaton [40

More information

Chapter 5. Solution of System of Linear Equations. Module No. 6. Solution of Inconsistent and Ill Conditioned Systems

Chapter 5. Solution of System of Linear Equations. Module No. 6. Solution of Inconsistent and Ill Conditioned Systems Numercal Analyss by Dr. Anta Pal Assstant Professor Department of Mathematcs Natonal Insttute of Technology Durgapur Durgapur-713209 emal: anta.bue@gmal.com 1 . Chapter 5 Soluton of System of Lnear Equatons

More information

The exam is closed book, closed notes except your one-page cheat sheet.

The exam is closed book, closed notes except your one-page cheat sheet. CS 89 Fall 206 Introducton to Machne Learnng Fnal Do not open the exam before you are nstructed to do so The exam s closed book, closed notes except your one-page cheat sheet Usage of electronc devces

More information

Singular Value Decomposition: Theory and Applications

Singular Value Decomposition: Theory and Applications Sngular Value Decomposton: Theory and Applcatons Danel Khashab Sprng 2015 Last Update: March 2, 2015 1 Introducton A = UDV where columns of U and V are orthonormal and matrx D s dagonal wth postve real

More information

Boostrapaggregating (Bagging)

Boostrapaggregating (Bagging) Boostrapaggregatng (Baggng) An ensemble meta-algorthm desgned to mprove the stablty and accuracy of machne learnng algorthms Can be used n both regresson and classfcaton Reduces varance and helps to avod

More information

Outline. Communication. Bellman Ford Algorithm. Bellman Ford Example. Bellman Ford Shortest Path [1]

Outline. Communication. Bellman Ford Algorithm. Bellman Ford Example. Bellman Ford Shortest Path [1] DYNAMIC SHORTEST PATH SEARCH AND SYNCHRONIZED TASK SWITCHING Jay Wagenpfel, Adran Trachte 2 Outlne Shortest Communcaton Path Searchng Bellmann Ford algorthm Algorthm for dynamc case Modfcatons to our algorthm

More information

Assortment Optimization under MNL

Assortment Optimization under MNL Assortment Optmzaton under MNL Haotan Song Aprl 30, 2017 1 Introducton The assortment optmzaton problem ams to fnd the revenue-maxmzng assortment of products to offer when the prces of products are fxed.

More information

Some Comments on Accelerating Convergence of Iterative Sequences Using Direct Inversion of the Iterative Subspace (DIIS)

Some Comments on Accelerating Convergence of Iterative Sequences Using Direct Inversion of the Iterative Subspace (DIIS) Some Comments on Acceleratng Convergence of Iteratve Sequences Usng Drect Inverson of the Iteratve Subspace (DIIS) C. Davd Sherrll School of Chemstry and Bochemstry Georga Insttute of Technology May 1998

More information

Fundamentals of Neural Networks

Fundamentals of Neural Networks Fundamentals of Neural Networks Xaodong Cu IBM T. J. Watson Research Center Yorktown Heghts, NY 10598 Fall, 2018 Outlne Feedforward neural networks Forward propagaton Neural networks as unversal approxmators

More information

Topic 5: Non-Linear Regression

Topic 5: Non-Linear Regression Topc 5: Non-Lnear Regresson The models we ve worked wth so far have been lnear n the parameters. They ve been of the form: y = Xβ + ε Many models based on economc theory are actually non-lnear n the parameters.

More information

CSE 546 Midterm Exam, Fall 2014(with Solution)

CSE 546 Midterm Exam, Fall 2014(with Solution) CSE 546 Mdterm Exam, Fall 014(wth Soluton) 1. Personal nfo: Name: UW NetID: Student ID:. There should be 14 numbered pages n ths exam (ncludng ths cover sheet). 3. You can use any materal you brought:

More information

Maximal Margin Classifier

Maximal Margin Classifier CS81B/Stat41B: Advanced Topcs n Learnng & Decson Makng Mamal Margn Classfer Lecturer: Mchael Jordan Scrbes: Jana van Greunen Corrected verson - /1/004 1 References/Recommended Readng 1.1 Webstes www.kernel-machnes.org

More information

Hopfield networks and Boltzmann machines. Geoffrey Hinton et al. Presented by Tambet Matiisen

Hopfield networks and Boltzmann machines. Geoffrey Hinton et al. Presented by Tambet Matiisen Hopfeld networks and Boltzmann machnes Geoffrey Hnton et al. Presented by Tambet Matsen 18.11.2014 Hopfeld network Bnary unts Symmetrcal connectons http://www.nnwj.de/hopfeld-net.html Energy functon The

More information

RBF Neural Network Model Training by Unscented Kalman Filter and Its Application in Mechanical Fault Diagnosis

RBF Neural Network Model Training by Unscented Kalman Filter and Its Application in Mechanical Fault Diagnosis Appled Mechancs and Materals Submtted: 24-6-2 ISSN: 662-7482, Vols. 62-65, pp 2383-2386 Accepted: 24-6- do:.428/www.scentfc.net/amm.62-65.2383 Onlne: 24-8- 24 rans ech Publcatons, Swtzerland RBF Neural

More information

Design and Optimization of Fuzzy Controller for Inverse Pendulum System Using Genetic Algorithm

Design and Optimization of Fuzzy Controller for Inverse Pendulum System Using Genetic Algorithm Desgn and Optmzaton of Fuzzy Controller for Inverse Pendulum System Usng Genetc Algorthm H. Mehraban A. Ashoor Unversty of Tehran Unversty of Tehran h.mehraban@ece.ut.ac.r a.ashoor@ece.ut.ac.r Abstract:

More information

CSC321 Tutorial 9: Review of Boltzmann machines and simulated annealing

CSC321 Tutorial 9: Review of Boltzmann machines and simulated annealing CSC321 Tutoral 9: Revew of Boltzmann machnes and smulated annealng (Sldes based on Lecture 16-18 and selected readngs) Yue L Emal: yuel@cs.toronto.edu Wed 11-12 March 19 Fr 10-11 March 21 Outlne Boltzmann

More information

Neural Networks. Perceptrons and Backpropagation. Silke Bussen-Heyen. 5th of Novemeber Universität Bremen Fachbereich 3. Neural Networks 1 / 17

Neural Networks. Perceptrons and Backpropagation. Silke Bussen-Heyen. 5th of Novemeber Universität Bremen Fachbereich 3. Neural Networks 1 / 17 Neural Networks Perceptrons and Backpropagaton Slke Bussen-Heyen Unverstät Bremen Fachberech 3 5th of Novemeber 2012 Neural Networks 1 / 17 Contents 1 Introducton 2 Unts 3 Network structure 4 Snglelayer

More information

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING 1 ADVANCED ACHINE LEARNING ADVANCED ACHINE LEARNING Non-lnear regresson technques 2 ADVANCED ACHINE LEARNING Regresson: Prncple N ap N-dm. nput x to a contnuous output y. Learn a functon of the type: N

More information

Estimating the Fundamental Matrix by Transforming Image Points in Projective Space 1

Estimating the Fundamental Matrix by Transforming Image Points in Projective Space 1 Estmatng the Fundamental Matrx by Transformng Image Ponts n Projectve Space 1 Zhengyou Zhang and Charles Loop Mcrosoft Research, One Mcrosoft Way, Redmond, WA 98052, USA E-mal: fzhang,cloopg@mcrosoft.com

More information

A Robust Method for Calculating the Correlation Coefficient

A Robust Method for Calculating the Correlation Coefficient A Robust Method for Calculatng the Correlaton Coeffcent E.B. Nven and C. V. Deutsch Relatonshps between prmary and secondary data are frequently quantfed usng the correlaton coeffcent; however, the tradtonal

More information

Solving Nonlinear Differential Equations by a Neural Network Method

Solving Nonlinear Differential Equations by a Neural Network Method Solvng Nonlnear Dfferental Equatons by a Neural Network Method Luce P. Aarts and Peter Van der Veer Delft Unversty of Technology, Faculty of Cvlengneerng and Geoscences, Secton of Cvlengneerng Informatcs,

More information

Other NN Models. Reinforcement learning (RL) Probabilistic neural networks

Other NN Models. Reinforcement learning (RL) Probabilistic neural networks Other NN Models Renforcement learnng (RL) Probablstc neural networks Support vector machne (SVM) Renforcement learnng g( (RL) Basc deas: Supervsed dlearnng: (delta rule, BP) Samples (x, f(x)) to learn

More information

Logistic Classifier CISC 5800 Professor Daniel Leeds

Logistic Classifier CISC 5800 Professor Daniel Leeds lon 9/7/8 Logstc Classfer CISC 58 Professor Danel Leeds Classfcaton strategy: generatve vs. dscrmnatve Generatve, e.g., Bayes/Naïve Bayes: 5 5 Identfy probablty dstrbuton for each class Determne class

More information

Inexact Newton Methods for Inverse Eigenvalue Problems

Inexact Newton Methods for Inverse Eigenvalue Problems Inexact Newton Methods for Inverse Egenvalue Problems Zheng-jan Ba Abstract In ths paper, we survey some of the latest development n usng nexact Newton-lke methods for solvng nverse egenvalue problems.

More information

Hidden Markov Models & The Multivariate Gaussian (10/26/04)

Hidden Markov Models & The Multivariate Gaussian (10/26/04) CS281A/Stat241A: Statstcal Learnng Theory Hdden Markov Models & The Multvarate Gaussan (10/26/04) Lecturer: Mchael I. Jordan Scrbes: Jonathan W. Hu 1 Hdden Markov Models As a bref revew, hdden Markov models

More information

Linear Feature Engineering 11

Linear Feature Engineering 11 Lnear Feature Engneerng 11 2 Least-Squares 2.1 Smple least-squares Consder the followng dataset. We have a bunch of nputs x and correspondng outputs y. The partcular values n ths dataset are x y 0.23 0.19

More information

Chapter - 2. Distribution System Power Flow Analysis

Chapter - 2. Distribution System Power Flow Analysis Chapter - 2 Dstrbuton System Power Flow Analyss CHAPTER - 2 Radal Dstrbuton System Load Flow 2.1 Introducton Load flow s an mportant tool [66] for analyzng electrcal power system network performance. Load

More information

A new Approach for Solving Linear Ordinary Differential Equations

A new Approach for Solving Linear Ordinary Differential Equations , ISSN 974-57X (Onlne), ISSN 974-5718 (Prnt), Vol. ; Issue No. 1; Year 14, Copyrght 13-14 by CESER PUBLICATIONS A new Approach for Solvng Lnear Ordnary Dfferental Equatons Fawz Abdelwahd Department of

More information

Using T.O.M to Estimate Parameter of distributions that have not Single Exponential Family

Using T.O.M to Estimate Parameter of distributions that have not Single Exponential Family IOSR Journal of Mathematcs IOSR-JM) ISSN: 2278-5728. Volume 3, Issue 3 Sep-Oct. 202), PP 44-48 www.osrjournals.org Usng T.O.M to Estmate Parameter of dstrbutons that have not Sngle Exponental Famly Jubran

More information

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4) I. Classcal Assumptons Econ7 Appled Econometrcs Topc 3: Classcal Model (Studenmund, Chapter 4) We have defned OLS and studed some algebrac propertes of OLS. In ths topc we wll study statstcal propertes

More information

We present the algorithm first, then derive it later. Assume access to a dataset {(x i, y i )} n i=1, where x i R d and y i { 1, 1}.

We present the algorithm first, then derive it later. Assume access to a dataset {(x i, y i )} n i=1, where x i R d and y i { 1, 1}. CS 189 Introducton to Machne Learnng Sprng 2018 Note 26 1 Boostng We have seen that n the case of random forests, combnng many mperfect models can produce a snglodel that works very well. Ths s the dea

More information

Relevance Vector Machines Explained

Relevance Vector Machines Explained October 19, 2010 Relevance Vector Machnes Explaned Trstan Fletcher www.cs.ucl.ac.uk/staff/t.fletcher/ Introducton Ths document has been wrtten n an attempt to make Tppng s [1] Relevance Vector Machnes

More information

Time-Varying Systems and Computations Lecture 6

Time-Varying Systems and Computations Lecture 6 Tme-Varyng Systems and Computatons Lecture 6 Klaus Depold 14. Januar 2014 The Kalman Flter The Kalman estmaton flter attempts to estmate the actual state of an unknown dscrete dynamcal system, gven nosy

More information

MATH 567: Mathematical Techniques in Data Science Lab 8

MATH 567: Mathematical Techniques in Data Science Lab 8 1/14 MATH 567: Mathematcal Technques n Data Scence Lab 8 Domnque Gullot Departments of Mathematcal Scences Unversty of Delaware Aprl 11, 2017 Recall We have: a (2) 1 = f(w (1) 11 x 1 + W (1) 12 x 2 + W

More information

Chapter 12. Ordinary Differential Equation Boundary Value (BV) Problems

Chapter 12. Ordinary Differential Equation Boundary Value (BV) Problems Chapter. Ordnar Dfferental Equaton Boundar Value (BV) Problems In ths chapter we wll learn how to solve ODE boundar value problem. BV ODE s usuall gven wth x beng the ndependent space varable. p( x) q(

More information

A linear imaging system with white additive Gaussian noise on the observed data is modeled as follows:

A linear imaging system with white additive Gaussian noise on the observed data is modeled as follows: Supplementary Note Mathematcal bacground A lnear magng system wth whte addtve Gaussan nose on the observed data s modeled as follows: X = R ϕ V + G, () where X R are the expermental, two-dmensonal proecton

More information

Expectation Maximization Mixture Models HMMs

Expectation Maximization Mixture Models HMMs -755 Machne Learnng for Sgnal Processng Mture Models HMMs Class 9. 2 Sep 200 Learnng Dstrbutons for Data Problem: Gven a collecton of eamples from some data, estmate ts dstrbuton Basc deas of Mamum Lelhood

More information

APPENDIX A Some Linear Algebra

APPENDIX A Some Linear Algebra APPENDIX A Some Lnear Algebra The collecton of m, n matrces A.1 Matrces a 1,1,..., a 1,n A = a m,1,..., a m,n wth real elements a,j s denoted by R m,n. If n = 1 then A s called a column vector. Smlarly,

More information

C/CS/Phy191 Problem Set 3 Solutions Out: Oct 1, 2008., where ( 00. ), so the overall state of the system is ) ( ( ( ( 00 ± 11 ), Φ ± = 1

C/CS/Phy191 Problem Set 3 Solutions Out: Oct 1, 2008., where ( 00. ), so the overall state of the system is ) ( ( ( ( 00 ± 11 ), Φ ± = 1 C/CS/Phy9 Problem Set 3 Solutons Out: Oct, 8 Suppose you have two qubts n some arbtrary entangled state ψ You apply the teleportaton protocol to each of the qubts separately What s the resultng state obtaned

More information

Kernels in Support Vector Machines. Based on lectures of Martin Law, University of Michigan

Kernels in Support Vector Machines. Based on lectures of Martin Law, University of Michigan Kernels n Support Vector Machnes Based on lectures of Martn Law, Unversty of Mchgan Non Lnear separable problems AND OR NOT() The XOR problem cannot be solved wth a perceptron. XOR Per Lug Martell - Systems

More information

Support Vector Machines

Support Vector Machines Separatng boundary, defned by w Support Vector Machnes CISC 5800 Professor Danel Leeds Separatng hyperplane splts class 0 and class 1 Plane s defned by lne w perpendcular to plan Is data pont x n class

More information

Statistical Foundations of Pattern Recognition

Statistical Foundations of Pattern Recognition Statstcal Foundatons of Pattern Recognton Learnng Objectves Bayes Theorem Decson-mang Confdence factors Dscrmnants The connecton to neural nets Statstcal Foundatons of Pattern Recognton NDE measurement

More information

The Order Relation and Trace Inequalities for. Hermitian Operators

The Order Relation and Trace Inequalities for. Hermitian Operators Internatonal Mathematcal Forum, Vol 3, 08, no, 507-57 HIKARI Ltd, wwwm-hkarcom https://doorg/0988/mf088055 The Order Relaton and Trace Inequaltes for Hermtan Operators Y Huang School of Informaton Scence

More information

Lecture 20: November 7

Lecture 20: November 7 0-725/36-725: Convex Optmzaton Fall 205 Lecturer: Ryan Tbshran Lecture 20: November 7 Scrbes: Varsha Chnnaobreddy, Joon Sk Km, Lngyao Zhang Note: LaTeX template courtesy of UC Berkeley EECS dept. Dsclamer:

More information