Admin BACKPROPAGATION. Neural network. Neural network 11/3/16. Assignment 7. Assignment 8 Goals today. David Kauchak CS158 Fall 2016

Size: px

Start display at page:

Download "Admin BACKPROPAGATION. Neural network. Neural network 11/3/16. Assignment 7. Assignment 8 Goals today. David Kauchak CS158 Fall 2016"

Dinah Maryann Miller
5 years ago
Views:

1 Amin Assignment 7 Assignment 8 Goals toay BACKPROPAGATION Davi Kauchak CS58 Fall 206 Neural network Neural network inputs inputs some inputs are provie/ entere Iniviual perceptrons/ neurons

2 Neural network Neural network inputs inputs each perceptron computes an calculates an answer those answers become inputs for the next level Neural network A neuron/perceptron Input inputs Weight w Input x 2 Weight w 2 g(in) Output y activation function finally get the answer after all levels compute Input x 3 Weight w 3 Weight w 4 in = w i x i i Input x 4 2

3 Activation functions Training har threshol: " if in > b g(in) = # $ 0 otherwise sigmoi tanh x g(x) = + e x? Input?? b=???? Input x 2 b=? How o we learn the weights? b=? Output = xor x 2 x 2 xor x Learning in multilayer networks Challenge: for multilayer networks, we on t know what the expecte put/error is for the internal noes! Backpropagation: intuition Graient escent metho for learning weights by optimizing a loss function. calculate put of all noes w w w how o we learn these weights? w w w w w w expecte put? w w w 2. calculate the weights for the put layer base on the error 3. backpropagate errors through hien layers perceptron/ linear moel neural network 3

4 Backpropagation: intuition Backpropagation: intuition Key iea: propagate the error back to this layer We can calculate the actual error here Backpropagation: intuition Backpropagation: intuition w w 2 w3 w 4 w 5 w6 error ~w 3 * error error for noe is ~ w i * error Calculate as normal, but weight the error 4

5 Backpropagation: the etails Backpropagation: the etails Graient escent metho for learning weights by optimizing a loss function. calculate put of all noes 2. calculate the upates irectly for the put layer 3. backpropagate errors through hien layers Notation: h m: features/inputs : hien noes hj: put from hien noes How many weights (ignore bias for now)? loss = x (y ŷ)2 2 square error Backpropagation: the etails Backpropagation: the etails Notation: m: features/inputs Notation: m: features/inputs v : hien noes hj: put from hien noes v : hien noes hj: put from hien noes h h weights: enote How many weights? 5

6 Backpropagation: the etails Backpropagation: the etails Notation: w 2 w 3 w m * m: enote h first inex = hien noe secon inex = feature v m: features/inputs : hien noes : put from hien noes! w 23 : weight from input 3 to hien noe 2! w 4 : all the m weights associate with hien noe 4 Graient escent metho for learning weights by optimizing a loss function argmin w,v (y ŷ)2 2. calculate put of all noes x 2. calculate the upates irectly for the put layer 3. backpropagate errors through hien layers Backpropagation: the etails. Calculate puts of all noes Backpropagation: the etails. Calculate puts of all noes w 2 w 3 w m v h w 2 w 3 w m v h w k x = j x What are in terms of x an w? = f (w k x) f is the activation function 6

7 Backpropagation: the etails. Calculate puts of all noes Backpropagation: the etails. Calculate puts of all noes w 2 w 3 w 2 w 3 w m v w m v h h = f (w k x) = + e w k x f is the activation function What is in terms of h an v? Backpropagation: the etails. Calculate puts of all noes Backpropagation: the etails 2. Calculate new weights for put layer w 2 w 3 v w m v h h = f (v h) = + e v h argmin w,v (y ŷ)2 2 x Want to take a small step towars ecreasing loss 7

8 Output layer weights Output layer weights argmin w,v (y ŷ)2 2 x loss = " % $ (y ŷ)2 ' # 2 & = # & % (y f (v h)2 ( $ 2 ' = (y f (v h)) y f (v h) ( ) h ŷ = f (v h) v = (y f (v h)) y f (v h) ( ) = (y f (v h)) f (v h) = (y f (v h)) f '(v h) v h = (y f (v h)) f '(v h) The actual upate is a step towars ecreasing loss: v h = k h v + (y f (v h)) f '(v h) Output layer weights Output layer weights + (y f (v h)) f '(v h) + (y f (v h)) f '(v h) v v h h What are each of these? Do they make sense iniviually? how far from correct an which irection slope of the activation function where input is at size an irection of the feature associate with this weight 8

9 Output layer weights Output layer weights + (y f (v h)) f '(v h) + (y f (v h)) f '(v h) v v h h how far from correct an which irection how far from correct an which irection (y f (v h)) > 0 (y f (v h)) < 0? (y f (v h)) > 0 (y f (v h)) < 0 preiction < label: preiction > label: increase the weight ecrease the weight bigger ifference = bigger change Output layer weights Output layer weights + (y f (v h)) f '(v h) + (y f (v h)) f '(v h) v v h h slope of the activation function where input is at bigger step smaller step perceptron upate: w j = w j + x ij y i size an irection of the feature associate with this weight smaller step graient escent upate: w j = w j + x ij y i c 9

10 Backpropagation: the etails Graient escent metho for learning weights by optimizing a loss function argmin w,v (y ŷ)2 2. calculate put of all noes x 2. calculate the upates irectly for the put layer 3. backpropagate errors through hien layers Backpropagation 3. backpropagate errors through hien layers w 2 w 3 w m Want to take a small step towars ecreasing loss h v argmin w,v (y ŷ)2 2 x Hien layer weights Hien layer weights loss = " % $ (y ŷ)2 ' # 2 & = # & % ( y f (v h)2) ( $ 2 ' = (y f (v h)) y f (v h) ( ) = (y f (v h)) f (v h) = (y f (v h)) f '(v h) v h w 2 w 3 ŷ = f (v h) = (y f (v h)) f '(v h) v h = (y f (v h)) f '(v h) f (w k x) w 2 w 3 erivative of other vh components are not affecte by 0

11 Hien layer weights Why all the math? w 2 w 3 f (w k x) I also wouln't min more math! x) w k x x)x j w k x = j x j loss = " % $ (y ŷ)2 ' # 2 & = # & % (y f (v h)2 ( $ 2 ' = (y f (v h)) ( y f (v h) ) = (y f (v h)) f (v h) = (y f (v h)) f '(v h) v h What happene here? loss = " % $ (y ŷ)2 ' # 2 & = # & % ( y f (v h)2) ( $ 2 ' = (y f (v h)) ( y f (v h) ) = (y f (v h)) f (v h) = (y f (v h)) f '(v h) v h = (y f (v h)) f '(v h) f (w k x) x) w k x = (y f (v h)) f '(v h) v h = (y f (v h)) f '(v h) f (w k x) x) w k x x)x j w 2 w 3 What is the slope vh with respect to = (y f (v h)) f '(v h) x)x j

12 = (y f (v h)) f '(v h) v h = (y f (v h)) f '(v h) f (w k x) x) w k x What is the slope vh with respect to w 2 w 3 Backpropagation put layer hien layer = (y f (v h)) f '(v h) x)x j What s ifferent? x)x j weight from hien layer to put layer slope of wx input feature w 2 w 3 w m v h Backpropagation Backpropagation put layer hien layer put layer hien layer = (y f (v h)) f '(v h) x)x j = (y f (v h)) f '(v h) x)x j error put activation slope input error put activation slope input error put activation slope input error put activation slope input w 2 w 3 w 2 w 3 w m v h weight from hien layer to put layer slope of wx w m v h how much of the error came from this hien noe how much o we nee to change 2

13 Backpropgation generalization Backpropgation generalization put layer put layer hien layer + (y f (v h)) f '(v h) + (y f (v h)) f '(v h) = + (y f (v h)) f '(v h) x)x j + (y f (v h)) f '(v h) + (y f (v h)) f '(v h) = + x j x) f '(v h)(y f (v h)) + Δ + Δ = + x j Δ k Δ = f '(v h)(y f (v h)) moifie error Δ = f '(v h)(y f (v h)) Δ k = x) f '(v h)(y f (v h)) erivative of input at noe error Can we write this more succinctly? Backpropgation generalization Backpropgation generalization put layer hien layer put layer hien layer + (y f (v h)) f '(v h) = + (y f (v h)) f '(v h) x)x j + Δ = + x j Δ k + (y f (v h)) f '(v h) = + x j x) f '(v h)(y f (v h)) Δ = f '(v h)(y f (v h)) Δ k = x) f '(v h)(y f (v h)) = x) Δ + Δ = + x j Δ k weight to put layer moifie error of put layer Δ = f '(v h)(y f (v h)) Δ k = x) f '(v h)(y f (v h)) = x) Δ = f '(current _input)w put Δ put 3

14 Backprop on multilayer networks Backprop on multilayer networks Anything ifferent here? = f '(current _input)w put Δ put = f '(current _input)w put Δ put w = w + input * Δ put w = w + input * Δ put What errors at the next layer oes the highlighte ege affect? Backprop on multilayer networks Backprop on multilayer networks = f '(current _input)w put Δ put = f '(current _input)w put Δ put w = w + input * Δ put w = w + input * Δ put What errors at the next layer oes the highlighte ege affect? 4

15 Backprop on multilayer networks Backprop on multilayer networks = f '(current _input) w put Δ put = f '(current _input)w put Δ put = f '(current _input)w put Δ put w = w + input * Δ put w = w + input * Δ put Backprop on multilayer networks Multiple put noes = f '(current _input) w put Δ put = f '(current _input) w put Δ put Backpropogation: - Calculate new weights an moifie errors at put layer - Recursively calculate new weights an moifie errors on hien layers base on recursive relationship - Upate moel with new weights = f '(current _input)w put Δ put w = w + input * Δ put How oes multiple puts change things? 5

16 Multiple put noes Backpropagation implementation Output layer upate: + (y f (v h)) f '(v h) = f '(current _input) w put Δ put = f '(current _input) w put Δ put Hien layer upate: = + x j x) f '(v h)(y f (v h)) w = w + input * Δ put Any missing information for implementation? How oes multiple puts change things? Backpropagation implementation Activation function erivatives Output layer upate: + (y f (v h)) f '(v h) Hien layer upate: = + x j x) f '(v h)(y f (v h)) sigmoi s(x) = + e x s'(x) = s(x)( s(x)). What activation function are we using 2. What is the erivative of that activation function tanh x tanh(x) = tanh2 x 6

17 Learning rate Backpropagation implementation Output layer upate: +η (y f (v h)) f '(v h) Hien layer upate: = +ηx j x) f '(v h)(y f (v h)) Like graient escent for linear classifiers, use a learning rate Often will start larger an then get smaller Just like graient escent! for some number of iterations: ranomly shuffle training ata for each example: - Compute all puts going forwar - Calculate new weights an moifie errors at put layer - Recursively calculate new weights an moifie errors on hien layers base on recursive relationship - Upate moel with new weights Hanling bias Hanling bias w 2 w 3 w 2 w 3 w m h v w m w (m+) h v v + How shoul we learn the bias?. A an extra feature har-wire to to all the examples 2. For other layers, a an extra parameter whose input is always 7

18 Online vs. batch learning for some number of iterations: ranomly shuffle training ata for each example: - Compute all puts going forwar - Calculate new weights an moifie errors at put layer - Recursively calculate new weights an moifie errors on hien layers base on recursive relationship - Upate moel with new weights Online learning: upate weights after each example Batch learning? Batch learning for some number of iterations: ranomly shuffle training ata initialize weight accumulators to 0 (one for each weight) for each example: - Compute all puts going forwar - Calculate new weights an moifie errors at put layer - Recursively calculate new weights an moifie errors on hien layers base on recursive relationship - A new weights to weight accumulators Divie weight accumulators by number of examples Upate moel weights by weight accumulators Process all of the examples before upating the weights Many variations Momentum: inclue a factor in the weight upate to keep moving in the irection of the previous upate Mini-batch:! Compromise between online an batch! Avois noisiness of upates from online while making more eucate weight upates Simulate annealing:! With some probability make a ranom weight upate! Reuce this probability over time Challenges of neural networks? Picking network configuration Can be slow to train for large networks an large amounts of ata Loss functions (incluing square error) are generally not convex with respect to the parameter space 8

History of Neural Networks McCulloch an Pitts (943) introuce moel of artificial neurons an suggeste they coul learn Hebb (949) Simple upating rule for learning Rosenblatt (962) - the perceptron moel

19 History of Neural Networks McCulloch an Pitts (943) introuce moel of artificial neurons an suggeste they coul learn Hebb (949) Simple upating rule for learning Rosenblatt (962) - the perceptron moel Minsky an Papert (969) wrote Perceptrons Bryson an Ho (969, but largely ignore until 980s-- Rosenblatt) invente backpropagation learning for multilayer networks technology/in-a-big-network-of-computersevience-of-machine-learning.html?_r=0 9

A Course in Machine Learning

A Course in Machine Learning Hal Daumé III 12 EFFICIENT LEARNING So far, our focus has been on moels of learning an basic algorithms for those moels. We have not place much emphasis on how to learn quickly.