Multi layer feed-forward NN FFNN. XOR problem. XOR problem. Neural Network for Speech. NETtalk (Sejnowski & Rosenberg, 1987) NETtalk (contd.

NN 3-00 Mult layer feed-forard NN FFNN We consder a more general netor archtecture: beteen the nput and output layers there are hdden layers, as llustrated belo. Hdden nodes do not drectly send outputs to the external envronment. FFNNs overcome the lmtaton of sngle-layer NN: they can handle non-lnearly separable learnng tass. Input layer Hdden Layer Output layer Neural Netors NN 4 XOR problem A typcal example of a non-lnealy separable functon s the XOR. Ths functon taes to nput arguments th values n {-,} and returns one output n {-,}, as specfed by the follong table: x x 2 x xor x 2 - - - - - - If e thn at - and as encodng of the truth values false and true, respectvely, then XOR computes the logcal exclusve or, hch yelds true f and only f the to nputs have dfferent truth values. Neural Netors NN 4 2 XOR problem In ths graph of the XOR, nput pars gvng output equal to and - are depcted th green and red crcles, respectvely. These to classes (green and red cannot be x 2 separated usng one lne, but to lnes. The NN belo th to hdden nodes - realzes ths non-lnear separaton, here x each hdden node represents one of the to blue lnes. - - Ths specfc NN uses the sgn x actvaton functon. Each green arro 0. + s characterzed by the eghts of one - + hdden node. It ndcates the drecton orthogonal to the correspondng lne. x 2 - It ponts to the regon here the neuron + output s. The output node s used to + Neural Netors - NN 4 combne the outputs of the to 3 hdden nodes. NETtal (Senos & Rosenberg, 987 The tas s to learn to pronounce Englsh text from examples (text-to-speech. Tranng data s 024 ords from a sde-by-sde Englsh/phoneme source. Input: 7 consecutve characters from rtten text presented n a movng ndo that scans text Output: phoneme code gvng the pronuncaton of the letter at the center of the nput ndo Netor topology: 7x29 bnary nputs (26 chars + punctuaton mars, 80 hdden unts and 26 output unts (phoneme code. Sgmod unts n hdden and output layer. Neural Netors NN 4 4 NETtal (contd. Neural Netor for Speech Tranng protocol: 95% accuracy on tranng set after 50 epochs of tranng by full gradent descent. 78% accuracy on a set-asde test set. Comparson aganst Dectal (a rule based expert system: Dectal performs better; t represents a decade of analyss by lngusts. NETtal learns from examples alone and as constructed th lttle noledge of the tas. Neural Netors NN 4 5 Dstngush nonlnear regons EE-589

NN 3-00 Neural Net example: ALVINN Autonomous Land Vehcle In a Neural Netor Drves up to 70mph on publc hghays Neural Net example: ALVINN Sharp left Straght ahead Sharp rght 30 output unts Learnng means adustng eght values 4 hdden unts nput pxel Input s 30x32 pxels = 960 values Neural Netors NN 4 7 Note: most mages are from the onlne sldes for Tom Mtchell s boo Machne Learnng Neural Netors NN 4 8 Neural Net example: ALVINN Types of decson regons Output s array of 30 values Ths corresponds to steerng nstructons E.g. hard left, hard rght Ths shos one hdden node 0 x 2x2 0 0 x 0 x 2x2 0 x2 2 L L2 Convex regon x L4-3.5 L3 x2 Netor th a sngle node One-hdden layer netor that realzes the convex regon: each hdden node realzes one of the lnes boundng the convex regon Input s 30x32 array of pxel values = 960 values Note: no specal vsual processng Sze/colour corresponds to eght Neural Netors NN 4 9 on ln P P2 to-hdden layer netor that realzes the unon of three convex regons: each box represents a one x hdden layer netor realzng one convex regon P3 x2.5 Neural Netors NN 4 0 FFNN NEURON MODEL The classcal learnng algorthm of FFNN s based on the gradent descent method. For ths reason the actvaton functon used n FFNN are contnuous functons of the eghts, dfferentable everyhere. A typcal actvaton functon that can be veed as a contnuous approxmaton of the step (threshold functon s the Sgmod Functon. A sgmod functon for node s: ( v (v av th a 0 e Increasng a here v y th eght of ln from node to node and y output of node -0-8 -6-4 -2 2 4 6 8 0 hen a tends to nfnty then tends to the step functon Neural Netors NN 4 v Tranng: Bacprop algorthm The Bacprop algorthm searches for eght values that mnmze the total error of the netor over the set of tranng examples (tranng set. Bacprop conssts of the repeated applcaton of the follong to passes: Forard pass: n ths step the netor s actvated on one example and the error of (each neuron of the output layer s computed. Bacard pass: n ths step the netor error s used for updatng the eghts (credt assgnment problem. Ths process s more complex than the LMS algorthm for Adalne, because hdden nodes are lned to the error not drectly but by means of the nodes of the next layer. Therefore, startng at the output layer, the error s propagated bacards through the netor, layer by layer. Ths s done by recursvely computng the local gradent of each neuron. Neural Netors NN 4 2 EE-589 2

NN 3-00 Bacprop Bac-propagaton tranng algorthm llustrated: Netor actvaton Error computaton Forard Step Error propagaton Bacard Step Bacprop adusts the eghts of the NN n order to mnmze the netor total mean squared error. Neural Netors NN 4 3 Total Mean Squared Error The error of output neuron after the actvaton of the netor on the n-th tranng example ( x( n, d( n s: e (n d (n - y (n The netor error s the sum of the squared errors of the output t neurons: 2 E(n e (n The total mean squared error s the average of the netor errors over the tranng examples. E AV 2 Neural Netors NN 4 4 output node N N n E (n Weght Update Rule The Bacprop eght update rule s based on the gradent descent method: tae a step n the drecton yeldng the maxmum decrease of the netor error E. Ths drecton s the opposte of the gradent of E. E - Neural Netors NN 4 5 Input of neuron s: Usng the chan rule e can rte: Weght Update Rule v E 0,..., m Moreover f e defne the local gradent of neuron as follos: Then from v y e get y E v v Neurons, m, th ln to neuron, y output of neuron y Neural Netors NN 4 6 E v E v Weght update of output neuron In order to compute the eght change e need to compute the local gradent of neuron : There are to cases, dependng hether s an output or an hdden neuron. If s an output neuron then usng the chan rule e obtan: E e because e d - y e y y v So f s an output node then the eght from neuron to neuron s updated of the follong quantty: e and y v ( ' ( v d - y ' (v y ( ( Neural Netors NN 4 7 e'(v Weght update of hdden neuron If s a hdden neuron then ts local gradent s computed usng the local gradents of all the neurons of the next layer. Usng the chan rule e have: E E y - v y v C set of neurons of output layer E e e v e e y C y C v y Observe that e v ' ( v, ' e ( v, v y E y Then. Moreover ' ( v y n next layer v So f s a hdden node then the eght from neuron to neuron s updated of: ' y ( v n next layer Neural Netors NN 4 8, EE-589 3

NN 3-00 Error bacpropagaton Summary: Delta Rule The flo-graph belo llustrates ho errors are bacpropagated to hdden neuron (v m m (v (v (v m e Neural Netors NN 4 9 e e m Delta rule = y here v (d y ( ( v of next layer IF output node IF hdden node Neural Netors NN 4 20 ' (v ay ( y for sgmod actvaton functons Generalzed delta rule If s small then the algorthm learns the eghts very sloly, hle f s large then the large changes of the eghts may cause an unstable behavor th oscllatons of the eght values. A technque for taclng ths problem s the ntroducton of a momentum term n the delta rule hch taes nto account prevous updates. We obtan the follong generalzed Delta rule: (n (n (ny(n 0 momentum constant momentum term accelerates the descent n steady donhll drectons and has a stablzng effect n drectons that oscllate n tme. Neural Netors NN 4 2 Other technques: adaptaton Other heurstcs for acceleratng the convergence of the bac-prop algorthm through adaptaton: Heurstc : Every eght has ts on. Heurstc 2: Every s alloed to vary from one teraton t to the next. Neural Netors NN 4 22 Bacprop learnng algorthm (ncremental-mode n=; ntalze (n randomly; hle (stoppng crteron not satsfed and n<max_teratons for each example (x,d - run the netor th nput x and compute the output y - update the eghts n bacard order startng from those of the output layer: th computed usng the (generalzed Delta rule end-for n = n+; end-hle; Neural Netors NN 4 23 Bacprop algorthm In the batch-mode the eghts are updated only after all examples have been processed, usng the formula x x tranng example The learnng process contnues on an epoch- by-epoch bass untl the stoppng condton s satsfed. In the ncremental mode from one epoch to the next choose a randomzed orderng for selectng the examples n the tranng set n order to avod poor performance. Neural Netors NN 4 24 EE-589 4

NN 3-00 Stoppng crterons Sensble stoppng crterons: total mean squared error change: Bac-prop s consdered to have converged hen the absolute rate of change n the average squared error per epoch s suffcently small (n the range [0.0, 0.]. generalzaton based crteron: After each epoch the NN s tested for generalzaton usng a dfferent set of examples (valdaton set. If the generalzaton performance s adequate then stop. NN DESIGN Data representaton Netor Topology Netor Parameters Tranng Valdaton Neural Netors NN 4 25 Neural Netors NN 4 26 Data Representaton Data representaton depends on the problem. In general NNs or on contnuous (real valued attrbutes. Therefore symbolc attrbutes are encoded nto contnuous ones. Attrbutes of dfferent types may have dfferent ranges of values hch affect the tranng process. Normalzaton may be used, le the follong one hch scales each attrbute to assume values beteen 0 and. x mn x max mn for each value x of attrbute, here are mn and max the mnmum and maxmum value of that attrbute over the tranng set. Neural Netors NN 4 27 Netor Topology The number of layers and of neurons depend on the specfc tas. In practce ths ssue s solved by tral and error. To types of adaptve algorthms can be used: start from a large netor and successvely remove some neurons and lns untl netor performance degrades. begn th a small netor and ntroduce ne neurons untl performance s satsfactory. Neural Netors NN 4 28 Netor parameters Ho are the eghts ntalzed? Ho s the learnng rate chosen? Ho many hdden layers and ho many neurons? Ho many examples n the tranng set? Intalzaton of eghts In general, ntal eghts are randomly chosen, th typcal values beteen -.0 and.0 or -0.5 and 0.5. If some nputs are much larger than others, random ntalzaton may bas the netor to gve much more mportance to larger nputs. In such a case, eghts can be ntalzed as follos: 2N x,..., N 2N (,..., N x For eghts from the nput to the frst layer For eghts from the frst to the second layer Neural Netors NN 4 29 Neural Netors NN 4 30 EE-589 5

NN 3-00 Choce of learnng rate The rght value of depends on the applcaton. Values beteen 0. and 0.9 have been used n many applcatons. Other heurstcs adapt durng the tranng th dfferent methods. Netor Tranng Ho do you ensure that a netor has been ell traned? Obectve: To acheve good generalzaton accuracy on ne examples/cases Establsh a maxmum acceptable error rate Tran the netor usng a valdaton test set to tune t Valdate the traned netor aganst a separate test set hch s usually referred to as a producton test set Neural Netors NN 4 3 Neural Netors NN 4 32 Netor Tranng Approach #: Large Sample When the amount of avalable data s large... Netor Tranng Approach #2: Cross-valdaton When the amount of avalable data s small... Avalable Examples 70% Dvde randomly 30% Tranng Test Used to develop one ANN model Producton Compute Test error Generalzaton error = test error Neural Netors NN 4 33 Avalable Examples 90% Tranng Test 0% Pro. Repeat 0 tmes Generalzaton error determned by mean test error and stddev Used to develop 0 dfferent ANN models Accumulate test errors Neural Netors NN 4 34 Netor Tranng Ho do you select beteen to ANN desgns? A statstcal test of hypothess s requred to ensure that a sgnfcant dfference exsts beteen the error rates of to ANN models If Large Sample method has been used then apply McNemar s test* If Cross-valdaton then use a pared t test for dfference of to proportons *We assume a classfcaton problem, f ths s functon approxmaton then use pared t test for dfference of means Neural Netors NN 4 35 Netor Tranng Masterng ANN Parameters Typcal Range learnng rate - 0. 0.0-0.99 momentum - 0.8 0. - 0.9 eght-cost - 0. 0.00-0.5 Fne tunng : - adust ndvdual parameters at each node and/or connecton eght automatc adustment durng tranng Neural Netors NN 4 36 EE-589 6

NN 3-00 Netor Tranng Netor eght ntalzaton Random ntal values +/- some range Smaller eght values for nodes th many ncomng connectons Rule of thumb: ntal eght range should be approxmately comng nto a node # eghts Neural Netors NN 4 37 Netor Tranng Typcal Problems Durng Tranng Would le: But sometmes: E E E # ter # ter # ter Steady, rapd declne n total error Seldom a local mnmum - reduce learnng or momentum parameter Reduce learnng parms. - may ndcate data s not learnable Neural Netors NN 4 38 Tranng Strateges Onlne tranng: Update eghts after each sample Offlne (batch tranng: Computeerroroverallsamples error over Then update eghts Onlne tranng nosy Senstve to ndvdual nstances Hoever, may escape local mnma Neural Netors NN 4 39 Tranng Strategy To avod overfttng: Splt data nto: tranng, valdaton, & test Also, avod excess eghts (less than # samples Intalze th small random eghts Small changes have notceable effect Use offlne tranng Untl valdaton set mnmum Evaluate on test set No more eght changes Neural Netors NN 4 40 Sze of Tranng set Rule of thumb: the number of tranng examples should be at least fve to ten tmes the number of eghts of the netor. Other rule: W N ( - a W = number of eghts a=expected accuracy Valdaton - ML Manager Sometmes you ll need to use a valdaton set (separate from the tranng or test set for stoppng crtera, etc. In these cases you should tae the valdaton set out of the tranng set hch has already been gven by the prevous routnes. For example, you mght use the random test set method to randomly brea the orgnal data set nto 80% tranng set and 20% test set. Independent and subsequent to the above routnes you ould tae n% of the tranng set to be a valdaton set for that partcular tranng exercse. Neural Netors NN 4 4 Neural Netors NN 4 42 EE-589 7

NN 3-00 Expressve Poer of FFNN Boolean functons: Every boolean functon can be descrbed by a netor th a sngle hdden layer but t mght requre exponental (n the number of nputs hdden neurons. Contnuous functons: Every bounded pece-se se contnuous o functon can be approxmated th arbtrarly small error by a netor th one hdden layer. Any contnuous functon can be approxmated to arbtrary accuracy by a netor th to hdden layers. Applcatons of FFNN Classfcaton, pattern recognton: FFNN can be appled to tacle non-lnearly separable learnng tass. Recognzng prnted or handrtten characters Face recognton Classfcaton of loan applcatons nto credt-orthy and non-credt-orthy groups Analyss of sonar radar to determne the nature of the source of a sgnal Regresson and forecastng: FFNN can be appled to learn non-lnear functons (regresson and n partcular functons hose nputs s a sequence of measurements over tme (tme seres. Neural Netors NN 4 43 Neural Netors NN 4 44 EE-589 8