Multilayer Perceptrons and Backpropagation. Perceptrons. Recap: Perceptrons. Informatics 1 CG: Lecture 6. Mirella Lapata

Multlayer Perceptrons and Informatcs CG: Lecture 6 Mrella Lapata School of Informatcs Unversty of Ednburgh mlap@nf.ed.ac.uk Readng: Kevn Gurney s Introducton to Neural Networks, Chapters 5 6.5 January, 6 / 33 / 33 Perceptrons Recap: Perceptrons w Connectonsm s a computer modelng approach nspred by neural networks. Anatomy of a connectonst model: unts, connectons The Perceptron as a lnear classfer. A learnng algorthm for Perceptrons. Key lmtaton: only works for lnearly separable data. x w x = n = w x y w n x n 3 / 33 4 / 33

Multlayer Perceptrons (MLPs) Actvaton Functons x w x w w n h y x n Step functon Sgmod functon y y Input layer Output layer Hdden layer MLPs are feed-forward neural networks, organzed n layers. One nput layer, one or more hdden layers, one output layer. Each node n a layer connected to all other nodes n next layer. Each connecton has a weght (can be zero). 5 / 33 h Outputs or. x h x Outputs a real value between and. 6 / 33 Sgmods Learnng wth MLPs Input layer Output layer Hdden layer 7 / 33 As wth perceptrons, fndng the rght weghts s very hard! Soluton technque: learnng! Learnng: adustng the weghts based on tranng examples. 8 / 33

Supervsed Learnng General Idea Send the MLP an nput pattern, x, from the tranng set. Get the output from the MLP, y. 3 Compare y wth the rght answer, or target t, to get the error quantty. 4 Use the error quantty to modfy the weghts, so next tme y wll be closer to t. 5 Repeat wth another x from the tranng set. When updatng weghts after seeng x, the network doesn t ust change the way t deals wth x, but other nputs too... Inputs t has not seen yet! Generalzaton s the ablty to deal accurately wth unseen nputs. Learnng and Error Mnmzaton Recall: Perceptron Learnng Rule Mnmze the dfference between the actual and desred outputs: w w + η(t o)x Error Functon: Mean Squared Error (MSE) An error functon represents such a dfference over a set of nputs: E( w) = N N s the number of patterns N (t p o p ) p= t p s the target output for pattern p o p s the output obtaned for pattern p the makes lttle dfference, but makes lfe easer later on! 9 / 33 / 33 Gradent Descent One technque that can be used for mnmzng functons s gradent descent. Can we use ths on our error functon E? We would lke a learnng rule that tells us how to update weghts, lke ths: Gradent and Dervatves: The Idea The dervatve s a measure of the rate of change of a functon, as ts nput changes; For functon y = f (x), the dervatve dy ndcates how much y changes n response to changes n x. If x and y are real numbers, and f the graph of y s plotted aganst x, the dervatve measures the slope or gradent of the lne at each pont,.e., t descrbes the steepness or nclne. w = w + w But what should w be? / 33 / 33

Gradent and Dervatves: The Idea Gradent and Dervatves: The Idea So, we know how to use dervatves to adust one nput value. But we have several weghts to adust! We need to use partal dervatves. A partal dervatve of a functon of several varables s ts dervatve wth respect to one of those varables, wth the others held constant. dy > mples that y ncreases as x ncreases. If we want to fnd the mnmum y, we should reduce x. dy < mples that y decreases as x ncreases. If we want to fnd the mnmum y, we should ncrease x. dy = mples that we are at a mnmum or maxmum or a plateau. To get closer to the mnmum: x new = x old η dy 3 / 33 Example If y = f (x, x ), then we can have y x and y x. In our learnng rule case, f we can work out the partal dervatves, we can use ths rule to update the weghts: w = w + w where w = η E w. 4 / 33 Summary So Far Usng Gradent Descent to Mnmze the Error We learnt what a multlayer perceptron s. We know a learnng rule for updatng weghts n order to mnmse the error: w = w + w where w = η E w w tells us n whch drecton and how much we should change each weght to roll down the slope (descend the gradent) of the error functon E. So, how do we calculate E w? w The mean squared error functon E, whch we want to mnmze: E( w) = N f N (t p o p ) p= 5 / 33 6 / 33

Usng Gradent Descent to Mnmze the Error w Usng Gradent Descent to Mnmze the Error w If we use a sgmod actvaton functon f, then the output of neuron for pattern p s: o p = f (u ) = + e au where a s a pre-defned constant and u s the result of the nput functon n neuron : u = w x For the pth pattern and the th neuron, we use gradent descent on the error functon: w = η E p w = η(t p o p )f (u )x where f (u ) = df du s the dervatve of f wth respect to u. If f s the sgmod functon, f (u ) = af (u )( f (u )). 7 / 33 8 / 33 Usng Gradent Descent to Mnmze the Error w We can update weghts after processng each pattern, usng rule: w = η (t p f o p ) f (u ) x w = η δ p x Ths s known as the generalzed delta rule. We need to use the dervatve of the actvaton functon f. So, f must be dfferentable! The threshold actvaton functon s not contnuous, thus not dfferentable! Sgmod has a dervatve whch s easy to calculate. 9 / 33 Updatng Output vs Hdden Neurons We can update output neurons usng the generalze delta rule: w = η δ p x δ p = (t p o p )f (u ) Ths δ p s only good for the output neurons, snce t reles on target outputs. But we don t have target output for the hdden nodes! What can we use nstead? δ p = k w k δ k f (u ) Ths rule propagates error back from output nodes to hdden nodes. If effect, t blames hdden nodes accordng to how much nfluence they had. So, now we have rules for updatng both output and hdden neurons! / 33

straton: straton: Present the pattern at the nput layer. Present the pattern at the nput layer. / 33 / 33 traton: straton: Present the pattern at the nput layer. Propagate forward actvatons. Present the pattern at the nput layer. Propagate forward actvatons. 3 / 33 4 / 33

raton: aton: Present the pattern at the nput layer. Propagate forward actvatons. Present the pattern at the nput layer. Propagate forward actvatons. 5 / 33 6 / 33 traton: traton: Present the pattern at the nput layer. Propagate forward actvatons. 3 Calculate error for the output neurons. Present the pattern at the nput layer. Propagate forward actvatons. 3 Propagate backward error. 7 / 33 8 / 33

raton: aton: Present the pattern at the nput layer. Propagate forward actvatons. 3 Propagate backward error. Present the pattern at the nput layer. Propagate forward actvatons. 3 Propagate backward error. 9 / 33 3 / 33 raton: Onlne Present the pattern at the nput layer. Propagate forward actvatons. 3 Propagate backward error. 4 Calculate E w 5 Repate for all patterns and sum up. : Intalze all weghts to small random values. : repeat 3: for each tranng example do 4: Forward propagate the nput features of the example to determne the MLP s outputs. 5: Back propagate error to generate w for all weghts w. 6: Update the weghts usng w. 7: end for 8: untl stoppng crtera reached. 3 / 33 3 / 33

Summary We learnt what a multlayer perceptron s. We have some ntuton about usng gradent descent on an error functon. We know a learnng rule for updatng weghts n order to mnmze the error: w = η E w If we use the squared error, we get the generalzed delta rule: w = ηδ p x. We know how to calculate δ p for output and hdden layers. We can use ths rule to learn an MLP s weghts usng the backpropagaton algorthm. Next lecture: a neural network model of the past tense. 33 / 33