Supervised Learning in Multilayer Networks

Copyrgh Cambrdge Unversy Press 23. On-screen vewng permed. Prnng no permed. hp://www.cambrdge.org/521642981 You can buy hs book for 3 pounds or $5. See hp://www.nference.phy.cam.ac.uk/mackay/la/ for lnks. 44 Supervsed Learnng n Mullayer Neworks 44.1 Mullayer perceprons No course on neural neworks could be complee whou a dscusson of supervsed mullayer neworks, also known as backpropagaon neworks. The mullayer percepron s a feedforward nework. I has npu neurons, hdden neurons and oupu neurons. The hdden neurons may be arranged n a sequence of layers. The mos common mullayer perceprons have a sngle hdden layer, and are known as wo-layer neworks, he number wo counng he number of layers of neurons no ncludng he npus. Such a feedforward nework defnes a nonlnear parameerzed mappng from an npu x o an oupu y = y(x; w, A). The oupu s a connuous funcon of he npu and of he parameers w; he archecure of he ne,.e., he funconal form of he mappng, s denoed by A. Feedforward neworks can be raned o perform regresson and classfcaon asks. Regresson neworks In he case of a regresson problem, he mappng for a nework wh one hdden layer may have he form: Oupus Hddens Inpus Fgure 44.1. A ypcal wo-layer nework, wh sx npus, seven hdden uns, and hree oupus. Each lne represens one wegh. Hdden layer: a (1) = l w (1) l x l + θ (1) ; h = f (1) (a (1) ) (44.1) Oupu layer: a (2) = w (2) h + θ (2) ; y = f (2) (a (2) ) (44.2).4.2 where, for example, f (1) (a) = anh(a), and f (2) (a) = a. Here l runs over he npus x 1,..., x L, runs over he hdden uns, and runs over he oupus. The weghs w and bases θ ogeher make up he parameer vecor w. The nonlnear sgmod funcon f (1) a he hdden layer gves he neural nework greaer compuaonal flexbly han a sandard lnear regresson model. Graphcally, we can represen he neural nework as a se of layers of conneced neurons (fgure 44.1). Wha sors of funcons can hese neworks mplemen? Jus as we explored he wegh space of he sngle neuron n Chaper 39, examnng he funcons could produce, le us explore he wegh space of a mullayer nework. In fgures 44.2 and 44.3 I ake a nework wh one npu and one oupu and a large number H of hdden uns, se he bases -.2 -.4 -.6 -.8-1 -1.2-1.4-2 -1 1 2 3 4 5 Fgure 44.2. Samples from he pror over funcons of a one-npu nework. For each of a sequence of values of σ bas = 8, 6, 4, 3, 2, 1.6, 1.2,.8,.4,.3,.2, and σ n = 5σbas w, one random funcon s shown. The oher hyperparameers of he nework were H = 4, σou w =.5. 527

Copyrgh Cambrdge Unversy Press 23. On-screen vewng permed. Prnng no permed. hp://www.cambrdge.org/521642981 You can buy hs book for 3 pounds or $5. See hp://www.nference.phy.cam.ac.uk/mackay/la/ for lnks. 528 44 Supervsed Learnng n Mullayer Neworks y Oupu σ ou Hdden layer σ bas σ n Inpu 1 x Oupu 1 5 Hσou 1/σ n -5-1 σ bas/σ n -2-1 1 2 3 4 Inpu Fgure 44.3. Properes of a funcon produced by a random nework. The vercal scale of a ypcal funcon produced by he nework wh random weghs s of order Hσ ou ; he horzonal range n whch he funcon vares sgnfcanly s of order σ bas /σ n ; and he shores horzonal lengh scale s of order 1/σ n. The funcon shown was produced by makng a random nework wh H = 4 hdden uns, and Gaussan weghs wh σ bas = 4, σ n = 8, and σ ou =.5. and weghs θ (1), w (1) l, θ (2) and w (2) o random values, and plo he resulng funcon y(x). I se he hdden uns bases θ (1) o random values from a Gaussan wh zero mean and sandard devaon σ bas ; he npu-o-hdden weghs w (1) l o random values wh sandard devaon σ n ; and he bas and oupu weghs θ (2) and w (2) o random values wh sandard devaon σ ou. The sor of funcons ha we oban depend on he values of σ bas, σ n and σ ou. As he weghs and bases are made bgger we oban more complex funcons wh more feaures and a greaer sensvy o he npu varable. The vercal scale of a ypcal funcon produced by he nework wh random weghs s of order Hσ ou ; he horzonal range n whch he funcon vares sgnfcanly s of order σ bas /σ n ; and he shores horzonal lengh scale s of order 1/σ n. Radford Neal (1996) has also shown ha n he lm as H he sascal properes of he funcons generaed by randomzng he weghs are ndependen of he number of hdden uns; so, neresngly, he complexy of he funcons becomes ndependen of he number of parameers n he model. Wha deermnes he complexy of he ypcal funcons s he characersc magnude of he weghs. Thus we ancpae ha when we f hese models o real daa, an mporan way of conrollng he complexy of he fed funcon wll be o conrol he characersc magnude of he weghs. Fgure 44.4 shows one ypcal funcon produced by a nework wh wo npus and one oupu. Ths should be conrased wh he funcon produced by a radonal lnear regresson model, whch s a fla plane. Neural neworks can creae funcons wh more complexy han a lnear regresson. -1 1-1 -2 -.5.5 1-1 -.5 Fgure 44.4. One sample from he pror of a wo-npu nework wh {H, σ w n, σw bas, σw ou} = {4, 8., 8.,.5}. 1.5 44.2 How a regresson nework s radonally raned Ths nework s raned usng a daa se D = {x (n), (n) } by adusng w so as o mnmze an error funcon, e.g., E D (w) = 1 2 ( (n) 2 y (x (n) ; w)). (44.3) n Ths obecve funcon s a sum of erms, one for each npu/arge par {x, }, measurng how close he oupu y(x; w) s o he arge. Ths mnmzaon s based on repeaed evaluaon of he graden of E D. Ths graden can be effcenly compued usng he backpropagaon algorhm (Rumelhar e al., 1986), whch uses he chan rule o fnd he dervaves.

Copyrgh Cambrdge Unversy Press 23. On-screen vewng permed. Prnng no permed. hp://www.cambrdge.org/521642981 You can buy hs book for 3 pounds or $5. See hp://www.nference.phy.cam.ac.uk/mackay/la/ for lnks. 44.3: Neural nework learnng as nference 529 Ofen, regularzaon (also known as wegh decay) s ncluded, modfyng he obecve funcon o: M(w) = βe D + αe W (44.4) where, for example, E W = 1 2 w2. Ths addonal erm favours small values of w and decreases he endency of a model o overf nose n he ranng daa. Rumelhar e al. (1986) showed ha mullayer perceprons can be raned, by graden descen on M(w), o dscover soluons o non-rval problems such as decdng wheher an mage s symmerc or no. These neworks have been successfully appled o real-world asks as vared as pronouncng Englsh ex (Senowsk and Rosenberg, 1987) and focussng mulple-mrror elescopes (Angel e al., 199). 44.3 Neural nework learnng as nference The neural nework learnng process above can be gven he followng probablsc nerpreaon. [Here we repea and generalze he dscusson of Chaper 41.] The error funcon s nerpreed as defnng a nose model. βe D s he negave log lkelhood: P (D w, β, H) = 1 Z D (β) exp( βe D). (44.5) Thus, he use of he sum-squared error E D (44.3) corresponds o an assumpon of Gaussan nose on he arge varables, and he parameer β defnes a nose level σ 2 ν = 1/β. Smlarly he regularzer s nerpreed n erms of a log pror probably dsrbuon over he parameers: P (w α, H) = 1 Z W (α) exp( αe W ). (44.6) If E W s quadrac as defned above, hen he correspondng pror dsrbuon s a Gaussan wh varance σ 2 W = 1/α. The probablsc model H specfes he archecure A of he nework, he lkelhood (44.5), and he pror (44.6). The obecve funcon M(w) hen corresponds o he nference of he parameers w, gven he daa: P (w D, α, β, H) = P (D w, β, H)P (w α, H) P (D α, β, H) (44.7) = 1 Z M exp( M(w)). (44.8) The w found by (locally) mnmzng M(w) s hen nerpreed as he (locally) mos probable parameer vecor, w MP. The nerpreaon of M(w) as a log probably adds lle new a hs sage. Bu new ools wll emerge when we proceed o oher nferences. Frs, hough, le us esablsh he probablsc nerpreaon of classfcaon neworks, o whch he same ools apply.

Copyrgh Cambrdge Unversy Press 23. On-screen vewng permed. Prnng no permed. hp://www.cambrdge.org/521642981 You can buy hs book for 3 pounds or $5. See hp://www.nference.phy.cam.ac.uk/mackay/la/ for lnks. 53 44 Supervsed Learnng n Mullayer Neworks Bnary classfcaon neworks If he arges n a daa se are bnary classfcaon labels (, 1), s naural o use a neural nework whose oupu y(x; w, A) s bounded beween and 1, and s nerpreed as a probably P (=1 x, w, A). For example, a nework wh one hdden layer could be descrbed by he feedforward equaons (44.1) and (44.2), wh f (2) (a) = 1/(1 + e a ). The error funcon βe D s replaced by he negave log lkelhood: [ ] G(w) = (n) ln y(x (n) ; w) + (1 (n) ) ln(1 y(x (n) ; w)). (44.9) n The oal obecve funcon s hen M = G + αe W. Noe ha hs ncludes no parameer β (because here s no Gaussan nose). Mul-class classfcaon neworks For a mul-class classfcaon problem, we can represen he arges by a vecor,, n whch a sngle elemen s se o 1, ndcang he correc class, and all oher elemens are se o. In hs case s approprae o use a sofmax nework havng coupled oupus whch sum o one and are nerpreed as class probables y = P ( =1 x, w, A). The las par of equaon (44.2) s replaced by: y = ea e a The negave log lkelhood n hs case s G(w) = n. (44.1) (n) ln y (x (n) ; w). (44.11) As n he case of he regresson nework, he mnmzaon of he obecve funcon M(w) = G + αe W corresponds o an nference of he form (44.8). A varey of useful resuls can be bul on hs nerpreaon. 44.4 Benefs of he Bayesan approach o supervsed feedforward neural neworks From he sascal perspecve, supervsed neural neworks are nohng more han nonlnear curve-fng devces. Curve fng s no a rval ask however. The effecve complexy of an nerpolang model s of crucal mporance, as llusraed n fgure 44.5. Consder a conrol parameer ha nfluences he complexy of a model, for example a regularzaon consan α (wegh decay parameer). As he conrol parameer s vared o ncrease he complexy of he model (descendng from fgure 44.5a c and gong from lef o rgh across fgure 44.5d), he bes f o he ranng daa ha he model can acheve becomes ncreasngly good. However, he emprcal performance of he model, he es error, frs decreases hen ncreases agan. An over-complex model overfs he daa and generalzes poorly. Ths problem may also complcae he choce of archecure n a mullayer percepron, he radus of he bass funcons n a radal bass funcon nework, and he choce of he npu varables hemselves n any muldmensonal regresson problem. Fndng values for model conrol parameers ha are approprae for he daa s herefore an mporan and non-rval problem.

Copyrgh Cambrdge Unversy Press 23. On-screen vewng permed. Prnng no permed. hp://www.cambrdge.org/521642981 You can buy hs book for 3 pounds or $5. See hp://www.nference.phy.cam.ac.uk/mackay/la/ for lnks. 44.4: Benefs of he Bayesan approach o supervsed feedforward neural neworks 531 (a) (d) Tes Error Tranng Error Model Conrol Parameers Log Probably(Tranng Daa Conrol Parameers) Fgure 44.5. Opmzaon of model complexy. Panels (a c) show a radal bass funcon model nerpolang a smple daa se wh one npu varable and one oupu varable. As he regularzaon consan s vared o ncrease he complexy of he model (from (a) o (c)), he nerpolan s able o f he ranng daa ncreasngly well, bu beyond a ceran pon he generalzaon ably (es error) of he model deeroraes. Probably heory allows us o opmze he conrol parameers whou needng a es se. (b) (e) Model Conrol Parameers (c) The overfng problem can be solved by usng a Bayesan approach o conrol model complexy. If we gve a probablsc nerpreaon o he model, hen we can evaluae he evdence for alernave values of he conrol parameers. As was explaned n Chaper 28, over-complex models urn ou o be less probable, and he evdence P (Daa Conrol Parameers) can be used as an obecve funcon for opmzaon of model conrol parameers (fgure 44.5e). The seng of α ha maxmzes he evdence s dsplayed n fgure 44.5b. Bayesan opmzaon of model conrol parameers has four mporan advanages. (1) No es se or valdaon se s nvolved, so all avalable ranng daa can be devoed o boh model fng and model comparson. (2) Regularzaon consans can be opmzed on-lne,.e., smulaneously wh he opmzaon of ordnary model parameers. (3) The Bayesan obecve funcon s no nosy, n conras o a cross-valdaon measure. (4) The graden of he evdence wh respec o he conrol parameers can be evaluaed, makng possble o smulaneously opmze a large number of conrol parameers. Probablsc modellng also handles uncerany n a naural manner. I offers a unque prescrpon, margnalzaon, for ncorporang uncerany abou parameers no predcons; hs procedure yelds beer predcons, as we saw n Chaper 41. Fgure 44.6 shows error bars on he predcons of a raned neural nework. Fgure 44.6. Error bars on he predcons of a raned regresson nework. The sold lne gves he predcons of he bes-f parameers of a mullayer percepron raned on he daa pons. The error bars (doed lnes) are hose produced by he uncerany of he parameers w. Noce ha he error bars become larger where he daa are sparse. Implemenaon of Bayesan nference As was menoned n Chaper 41, Bayesan nference for mullayer neworks may be mplemened by Mone Carlo samplng, or by deermnsc mehods employng Gaussan approxmaons (Neal, 1996; MacKay, 1992c).

Copyrgh Cambrdge Unversy Press 23. On-screen vewng permed. Prnng no permed. hp://www.cambrdge.org/521642981 You can buy hs book for 3 pounds or $5. See hp://www.nference.phy.cam.ac.uk/mackay/la/ for lnks. 532 44 Supervsed Learnng n Mullayer Neworks Whn he Bayesan framework for daa modellng, s easy o mprove our probablsc models. For example, f we beleve ha some npu varables n a problem may be rrelevan o he predced quany, bu we don know whch, we can defne a new model wh mulple hyperparameers ha capures he dea of unceran npu varable relevance (MacKay, 1994b; Neal, 1996; MacKay, 1995b); hese models hen nfer auomacally from he daa whch are he relevan npu varables for a problem. 44.5 Exercses Exercse 44.1. [4 ] How o measure a classfer s qualy. You ve us wren a new classfcaon algorhm and wan o measure how well performs on a es se, and compare wh oher classfers. Wha performance measure should you use? There are several sandard answers. Le s assume he classfer gves an oupu y(x), where x s he npu, whch we won dscuss furher, and ha he rue arge value s. In he smples dscussons of classfers, boh y and are bnary varables, bu you mgh care o consder cases where y and are more general obecs also. The mos wdely used measure of performance on a es se s he error rae he fracon of msclassfcaons made by he classfer. Ths measure forces he classfer o gve a /1 oupu and gnores any addonal nformaon ha he classfer mgh be able o offer for example, an ndcaon of he frmness of a predcon. Unforunaely, he error rae does no necessarly measure how nformave a classfer s oupu s. Consder frequency ables showng he on frequency of he /1 oupu of a classfer (horzonal axs), and he rue /1 varable (vercal axs). The numbers ha we ll show are percenages. The error rae e s he sum of he wo off-dagonal numbers, whch we could call he false posve rae e + and he false negave rae e. Of he followng hree classfers, A and B have he same error rae of 1% and C has a greaer error rae of 12%. Classfer A Classfer B Classfer C y 1 y 1 y 1 9 8 1 78 12 1 1 1 1 1 1 Bu clearly classfer A, whch smply guesses ha he oucome s for all cases, s conveyng no nformaon a all abou ; whereas classfer B has an nformave oupu: f y = hen we are sure ha really s zero; and f y = 1 hen here s a 5% chance ha = 1, as compared o he pror probably P ( = 1) =.1. Classfer C s slghly less nformave han B, bu s sll much more useful han he nformaon-free classfer A. One way o mprove on he error rae as a performance measure s o repor he par (e +, e ), he false posve error rae and he false negave error rae, whch are (,.1) and (.1, ) for classfers A and B. I s especally mporan o dsngush beween hese wo error probables n applcaons where he wo sors of error have dfferen assocaed coss. However, here are a couple of problems wh he error rae par : How common sense ranks he classfers: (bes) B > C > A (wors). How error rae ranks he classfers: (bes) A = B > C (wors). Frs, f I smply old you ha classfer A has error raes (,.1) and B has error raes (.1, ), would no be mmedaely evden ha classfer A s acually uerly worhless. Surely we should have a performance measure ha gves he wors possble score o A!

Copyrgh Cambrdge Unversy Press 23. On-screen vewng permed. Prnng no permed. hp://www.cambrdge.org/521642981 You can buy hs book for 3 pounds or $5. See hp://www.nference.phy.cam.ac.uk/mackay/la/ for lnks. 44.5: Exercses 533 Second, f we urn o a mulple-class classfcaon problem such as dg recognon, hen he number of ypes of error ncreases from wo o 1 9 = 9 one for each possble confuson of class wh. I would be nce o have some sensble way of collapsng hese 9 numbers no a sngle rankable number ha makes more sense han he error rae. Anoher reason for no lkng he error rae s ha doesn gve a classfer cred for accuraely specfyng s uncerany. Consder classfers ha have hree oupus avalable,, 1 and a reecon class,?, whch ndcaes ha he classfer s no sure. Consder classfers D and E wh he followng frequency ables, n percenages: Classfer D y? 1 74 1 6 1 1 9 Classfer E y? 1 78 6 6 1 5 5 Boh of hese classfers have (e +, e, r) = (6%, %, 11%). Bu are hey equally good classfers? Compare classfer E wh C. The wo classfers are equvalen. E s us C n dsguse we could make E by akng he oupu of C and ossng a con when C says 1 n order o decde wheher o gve oupu 1 or?. So E s equal o C and hus nferor o B. Now compare D wh B. Can you usfy he suggeson ha D s a more nformave classfer han B, and hus s superor o E? Ye D and E have he same (e +, e, r) scores. People ofen plo error-reec curves (also known as ROC curves; ROC sands for recever operang characersc ) whch show he oal e = (e + + e ) versus r as r s allowed o vary from o 1, and use hese curves o compare classfers (fgure 44.7). [In he specal case of bnary classfcaon problems, e + may be ploed versus e nsead.] Bu as we have seen, error raes can be undscernng performance measures. Does plong one error rae as a funcon of anoher make hs weakness of error raes go away? For hs exercse, eher consruc an explc example demonsrang ha he error-reec curve, and he area under, are no necessarly good ways o compare classfers; or prove ha hey are. As a suggesed alernave mehod for comparng classfers, consder he muual nformaon beween he oupu and he arge, Error rae Reecon rae Fgure 44.7. An error-reec curve. Some people use he area under hs curve as a measure of classfer qualy. I(T ; Y ) H(T ) H(T Y ) = y, P (y)p ( y) log P () P ( y), (44.12) whch measures how many bs he classfer s oupu conveys abou he arge. Evaluae he muual nformaon for classfers A E above. Invesgae hs performance measure and dscuss wheher s a useful one. Does have praccal drawbacks?