Classification (klasifikácia) Feedforward Multi-Layer Perceptron (Dopredná viacvrstvová sieť) 14/11/2016. Perceptron (Frank Rosenblatt, 1957)

4//06 IAI: Lecture 09 Feedforard Mult-Layer Percetron (Doredná vacvrstvová seť) Lubca Benuskova AIMA 3rd ed. Ch. 8.6.4 8.7.5 Classfcaton (klasfkáca) In machne learnng and statstcs, classfcaton s the roblem of dentfyng to hch of a set of categores (classes) a ne observaton belongs, on the bass of a tranng set of data contanng observatons hose category (class) membersh s kno n. Let us no consder the task of classfcaton of onts nto t o categores,.e. our classfer must fnd a boundary that searates t o classes of obects. e assume the boundary bet een the classes s not lnear, but curved,.e. nonlnear. Percetron (Frank Rosenblatt, 957) Percetron: nut/outut formulas a R + s the actvaton of nut ; a real number from (0, ), R s the eght of an nut and s the ndex of outut 3 Total nut of unt s: Outut = actvaton functon g: n n n 0 n,, a a 0, a g n ) ( a g n, a 0, 4 Nonlnear actvaton functon g(n ) Percetron tranng (learnng) Contnuous dfferentable sgmod (logstc) functon, here l s the sloe of sgmod a The goal of a learnng algorthm s to automatcally fnd the values of eghts to ft any tranng set of examles. a g( n ) e l n e ll call such a ercetron a contnuous ercetron (as oosed to a bnary ercetron) n The task can be any nonlnear roblem. For a sngle ercetron ntalzed th small random eghts, uon resentaton of each examle a ne eght array =, s calculated to move the outut of ercetron closer to the desred (target) outut. 5 6

4//06 Tranng set and error functon Let the tranng set be x s the nut array or vector (also called a attern) y s the target or desred outut, beng + for one class of nuts and 0 for the other class of nuts Error functon: here A tran g P P ( x, y )( x, y )...( x, y )...( x, y ) P E ( y g ( x )) 0 e ( x) g( x) l( x) 7 Generalsaton to contnuous ercetron Let an error functon be 0 e ant to adust ercetron s eghts after each nut attern to mnmze the error E ste by ste n order to reach a (global) mnmum n the end of tranng. E ( y g ( x )) Ths algorthm of ste-lke error mnmsaton s called gradent descent. P Mnmum of E 8 eghts otmsaton by gradent descent Gradent of a functon Mnmsaton of the error functon E movng aganst the gradent of E E (, ) ( +, + ) The second term, artal dervatve of E accordng to the eght, s the so-called generalsed error sgnal 9 The gradent of a scalar functon s a vector, hch onts n the drecton of the greatest rate of ncrease of functon and hose magntude s the greatest rate of change. The gradent of an error functon E() th resect to an ndeendent vector varable = (,..., n ) s defned as a vector, the comonents of hch are artal dervatves of E accordng to eghts, such that E E grad ( E ) E,..., n 0 Gradent descent rule The eghts are udated n the drecton of negatve gradent, thus E It s guaranteed that ths rule al ays fnds the local mnmum of E(*), hch s nearest to the ntal state (defned by ntal eghts,, etc.) Generalsed (delta) rule The eghts are udated due to each examle x as g x Delta s the error sgnal,.e. ( y Constant > 0 s the learnng seed g ( x )) If x > 0, g > 0 and the error sgnal > 0, then the eght s ncreased. * If x > 0, g > 0 & < 0, then the eght s decreased.

4//06 Dervatve of actvaton functon g After resentaton of each nut attern, each eght s udated here g s the sgmod functon g g x e ( x) g( x) l( x) From math e kno that dervatve of sgmod functon: g g( g) Thus ( y g) g( g) x 3 // Percetron tranng algorthm n seudo-code Start th random ntal eghts (e.g., n [-.5,. 5]) Do For Al l Patt erns from the tr anng set Ca lculat eactv aton Er ror = Target Value_ for_pa ttern_ - Ac tvat on Fo r All Inut eght s Delta eght_ = al ha * Error * Inu t_ * g_der eght _ = eght_ + De ltae ght_ Untl Total Error for all atterns < e " Or "Tme-out" Start th random ntal eghts n [-.5,.5]) and alha = 0.5 do eoch++ CORRECT = 0 for = to = P // loo through all tranng samles n = 0 for = 0 to = N n = n + _ * x_ f (n > 0) out = else out = 0 f (out = = desred) CORECT++ else for = 0 to N _ = _ + alha * (desred out) * x_ * g_der hle (E(*) < e) //tranng stos hen error s mnmal Contnuous ercetron Nonlnear unt th sgmod actv. functon: g(n) = / (+e -n ) has good roertes (boundedness, monotoncty, dfferentablty) Gradent descent learnng corresonds to the total Error mnmzaton: necessary condton for stong E(*) e ths tye of learnng haens onlne and s determnstc sgmod Percetron as a nonlnear classf er E Ultmate goal: comlex nonlnear boundary A sngle contnuous ercetron can rovde only one sgmod boundary. Feedforard mult-layer ercetron (MLP) T o (or more) layers of contnuous ercetrons connected th feedfor ard connectons: x hat needs to be done to be able to fnd a more comlex nonlnear boundary? x And hat needs to be done to fnd several nonlnear boundares? 7 x k 8 3

4//06 MLP (Mult-Layer Percetron) Tranng set and error Inut Layer Hdden Layer Outut Layer K k, J I x x x 3 x 4 x 5, a y a x k g(n) s a nonlnear dfferentable functon (sgmod, hyerbolc tanh, Gaussan functon, etc.) 9 Let the tranng set be A tran P P ( x, y )( x, y )...( x, y )...( x, y ) x s the nut vector (also called a attern) y s the target or desred outut value The goal of tranng s to mnmse the total error: P E ( y g ( x )) 0 g (x ) = a s the actual outut for current eght matrx 0 Gradent of a functon The gradent of a scalar functon s a vector, hch onts n the drecton of the greatest rate of ncrease of functon and hose magntude s the greatest rate of change. The gradent of an error functon E() th resect to an ndeendent vector varable = (,..., n ) s defned as a vector, the comonents of hch are artal dervatves of E accordng to eghts, such that E E grad ( E ) E,..., n eghts otmsaton by gradent descent Mnmsaton of the error functon E movng aganst the gradent of E E (, ) ( +, + ) The second term, artal dervatve of E accordng to the eght, s the so-called generalsed error sgnal Gradent descent rule All the eghts are udated n the drecton of negatve gradent: E It s guaranteed that ths rule al ays fnds the local mnmum of E, hch s nearest to the ntal state (defned by ntal eghts) 3 Error-backroagaton algorthm. Choose (0, ], and generate randomly (0) [-0.5,0.5]. Set E = 0, nut attern counter = 0, eoch counter k = 0.. For attern calculate the actual outut of MLP. 3. Calculate the generalsed learnng sgnal delta for outut unt(s). 4. Udate each eght beteen the hdden and outut unt(s). 5. Calculate the generalsed learnng sgnal delta for hdden unts. 6. Udate each eght beteen the nut and hdden unts. 7. If < P, go to ste, else contnue. 8. Freeze eghts and calculate total error E. 9. If E < e, sto. Else set E = 0, e = 0, k = k +, and go to ste. 4 4

4//06 Nonlnear un- or multvarate egresson th a MLP The comlexty of the ftted curve deends on the number of hdden unts. In ths examle, the green functon s the unkno n or target functon, hch generates the data onts, hch have some random nose added to them. Ftted functon n red for, 3 and 9 hdden unts. Smle examle of MLP Inuts are x coordnates of onts of some unknon functon y = f(x). hdden unts ( and ) have hyerbolc tanh actvaton functon. One outut unt (No. 3) sums lnearly the oututs of hdden unts mnus an adustable bas. Inut x 3 Desred outut s the functonal value y 6 Note on the nut and target outut Task: aroxmate the nonlnear functon y = f(x) Inuts = x coordnates of onts, desred outut s the value y The nut ll be a sngle number, the value of the x coordnate of the ont n the D sace. The target outut ll be the value of the y coordnate of that ont. y = 0. 7 x =. 8 MLP outut after 8 sees through all data onts MLP outut after sees through all data onts 9 30 5

4//06 Aroxmaton of nonlnear data has been acheved MLP reresentatonal oer Contnuous functons: Any bounded contnuous functon can be aroxmated th arbtrarly small error by a to-layer feedforard MLP. Sgmod functons n the hdden layer act as a set of bass functons for comosng more comlex functons, lke sne aves n Fourer analyss. Arbtrary functon: Any functon can be aroxmated to arbtrary accuracy by a multle-layer ercetron. 3 Boolean functons: Any Boolean functon can be reresented by MLP. 3 Generalsaton (redcton) A netork s sad to generalse f t gves correct results for nuts not n the tranng set = redcton of ne values. Generalsaton s tested by usng an addtonal test set After tranng, the eghts are frozen For each test nut e evaluate the MLP redcton of the functonal value Examle of good and bad generalsaton To feed-forard MLPs, one th 5 hdden sgmod neurons and the other one th hdden sgmod neurons: Outut layer : one lnear neuron Often tranng set & test set are obtaned by searatng orgnal data set nto arts. 33 F(x) = sn(x), nterval / x /, Tranng set =(- / + /8, sn(- / + /8)) 0 8. 34 Tranng results Overtranng (overfttng, reučene) Both MLPs gve good aroxmaton to sn(x) n all tranng set onts Examle taken from Alexandra I. Crstea 35 Small MLP generalses ell, bg MLP very bad: It memorzed the tranng set but gves rong results for other nuts => overfttng Too many neurons and eghts lead to olynomal of a hgh degree Overtranng : a netork that erforms very ell on the tranng set but very bad on tes t onts s sad to be overtraned. 36 6

4//06 Early stong: ho to avod overftng: Model selecton e do not kno ho many hdden unts to use for MLP to aroxmate ell the gven nonlnear functon and obtan a good generalsaton. Sto the tranng here! 37 Model selecton, e exermentally evaluate several MLPs th dfferent number of hdden unts ho ell they erform on test data. K-fold cross-valdaton: run K exerments, each tme settng asde a dfferent /K of the data to test on; Leave-one-out, e leave only one examle for test, and ( examles reeat testng N tmes (for the set of N 38 Pattern classfcaton th a MLP Summary Suervsed learnng by error-backroagaton can be used ether for nonlnear regresson or attern (obect) classfcaton. In case of nonlnear regresson: the tranng set conssts of real data values, hch the set of ars x, F(x) F(x) s the unknon functon, x s the nut vector. The task s to aroxmate F(x) by the outut of the MLP: G(x) The error sgnal s calculated based on dfference beteen G(x) and F(x) for the tranng set. Crcles and crosses are obects belongng to dfferent classes (e.g. cats and dogs). Durng learnng the eghts n MLP are gradually adusted to the values of arameters of searatng boundares bet een the classes. Number of outut neurons = number of classes In case of nonlnear regresson: the tranng set conssts of ars: nut & desred outut (.e. class labels) The task s to learn ho to correctly classfy nut vectors e kno, hch class the obect/nut falls nto and e rovde an error sgnal based on desred or target oututs. 40 The draback of error-backroagaton Some hstorcal notes Qualty of soluton deends on the startng values of eghts. Error-backroagaton ALAYS converges to the nearest mnmum of total Error. There are varous ays ho to mrove the chances to fnd the global mnmum, hch e are not gong to deal th n ths course. err eght sace Paul erbos : Beyond Regresson: Ne Tools for Predcton and Analyss n the Behavoral Scence, Ph.D. thess, Harvard Unversty, 974. Rumelhart, Hnton, llams: Learnng nternal reresentatons by backroagatng errors, Nature 33(99),. 533-536,986. Rumelhart Prze s aarded annually to an ndvdual or collaboratve team makng a sgnfcant contemorary contrbuton to the theoretcal foundatons of human cognton US$ 00,000. Mathematcal roof of the theorem of unversal aroxmaton of functons: Hornk 989 and Kurkova 989. dm. dm. Next lecture: ractcal alcatons of MLP 4 7