Neural Nets Supervised learning

6.034 Artificial Intelligence Big idea: Learning as acquiring a function on feature vectors Background Nearest Neighbors Identification Trees Neural Nets Neural Nets Supervised learning y s(z) w w 0 w w n 1 2... -1 x 1 x 2 x n Training: Choose connection weights that minimize error (Observed - Computed) = error driven learning Prediction: Propagate input feature values through the network of artificial neurons Advantages: Fast prediction Does feature weighting Very generally applicable Disadvantages: Very slow training Overfitting is easy 1

x! Multi-Layer Neural net x 1 x 2 y! x 3 y 1 y 2 x u input hidden hidden output More powerful than single layer. Lower layers transform the input problem into more tractable (linearly separable) problems for subsequent layers. Error driven learning:minimize error Error is difference between (Observed/Data - Currently computed) or Error function E(w) = (Observed - computed) value Goal is to make the Error 0 by adjusting the weights on the neural net units So first we have to figure out how far off we are: we must make a forward propagation through the net w/ our current weights to see how close we are to the truth Once we know how close we are, we must figure out how to change the w s to minimize E(w) So - how do we minimize a function? Set derivative to 0! So, we do this by gradient ascent: look at the slope of the Error function E with respect to the weights, and figure out how to minimize E by changing the w s: if slope is negative, move down; o.w. up. (We want to get to the place where The backpropagation algorithm tells us how to adjust w s. 2

Minimizing by Gradient Descent Basic Problem: Given initial parameters w 0 with error E=G(w 0 ), find w that is local minimum of G, that is, all first derivatives of G are zero. E Slope = dg dw G(w) Approach: 1. Find slope of G at w 0 2. Move to w 1 which is down slope Local min < > w 1 w 0 w 0 w 1 w Gradient is n-dimensional slope Gradient descent, r is rate a small step size Stop when gradient is very small (or change in G is very small) The simplest possible net y Output y = w 1 x 1 + w 2 x 2 w 0 w 1 w 2 x 1 Error or Performance -1function = x 2 1/2(y* y) 2, so what is the derivative of the error function wrt w? It is just (y*-y) dy/dw (Note that the two negative signs cancel) For this linear case easy to figure out how to change w in order to change y - i.e., dy/dw. It is just the vector of coefficients on the w i - i.e., [x 1 x 2 ] (e.g., dy/dx for y=wx + 2 is just x ) So Credit/blame assignment is simple 3

Simplest neural net Learning Rule New weight w Original weight w Desired (correct) output!!! w! = w + r(d " z) x learning rate (small) actual output Deriv Error: derivative of Error fn wrt w; how error changes if you change w a little bit (this is the slope of the error function at point w (d! z) Change weights on non-zero inputs in direction of error. (gradient ascent) - note for a single layer, this is linear, so derivative of simple: it is just [x1 x2]. We update the weight - we take a step based on the size of the error times the derivative, a direction in which to take a step (size, plus direction). This is gradient descent w x=0 y= w 1 x 1 +w 2 x 2 dy/dw = x = [x 1 x 2 ] 4

Current hyperplane x Correct hyperplane Δw = r(y* y)x Note: Weight vector changed in the direction of x Learning AND IN P U T TARGET 0 1 0 1 0 0 1 1 1 0 1 0 1 0 0 1 1 1 0 1 0 0 0 0 INPUT WEIG H T S ACTIVATIO N OUTPU T TARGE T WEIG H T S L EARN? 0 1-0.1 +0.2 0.2 1 0 0-0.1 +0.2 OK 1 0-0.1 +0.2-0.1 1 0 0 1 1-0.1 +0.2 0.1 1 0 1 +0.9 +1.2 CHANGE 0 1 +0.9 +1.2 1.2 1 1 0 +0.9 +0.2 0.9 1 0 0 1 1 +0.9 +0.2 1.1 1 1 1 0 1 +0.9 +0.2 0.2 1 0 0 0 0 +0.9 +0.2 0 1 0 0 5

Learning XOR: this is not separable by a single line, which is all a 1-layer neural net can do 0,1 1,1 0,0 1,0 Learning XOR: note that we need two lines through the plane to separability red from green - there isn t a single slice that will work to separate the two But two slices will work 0,1 1,1 0,0 1,0 So, we need a hidden layer with two units (one to make each slice), and then one output neuron to decide between these two slices: 6

XOR solvable by 2-layer net: one layer separates off/off & on/on together; next layer can separate on/off, off/on A more complex pattern will require more hidden units, depending on how many slices we need 0,1 1,1 Think about what happens now. How many slices will we need? Therefore, how many hidden units? Will we need more than 2 layers? 0,0 1,0 7

Now we add 2 things, one at a time Sigmoid unit on the unit (instead of just adding) Intermediate (hidden) units between input and output The first just makes the computation of the derivative relating the output back to the input of a unit a bit more complicated The second also - how to allocate credit/blame yz Sigmoid Unit zy w 0 w 1 w 2 x 1-1 x 2 s(z) w w 0 w w n 1 2... -1 x 1 x 2 x n n 1 x out = " w i x i s(x out ) = 1 + e! x out i x 2 output surface x 1 y=1/2 8

Derivative of the sigmoid wrt its input 1 s(x) = 1 + e! x ds(x) = d x dx "# (1 + e! x )!1 $ % = [!(1+ e! x )!2 ][!e! x ] " 1 $ " e! x $ = # & 1 + e! x % ' & # 1+ e! x ' % = s(x)(1! s(x)) or just z(1! z) where z is the output from the sigmoid function, i.e., the final output from this particular net unit Note - if this is the 'final output' we sometimes use o instead of z So now, the derivative (slope) relating changes in w to changes in the output is a bit more complicated: How this changes the computation of the way to twiddle the weights Original:!w= r(d " z)! x After adding sigmoid:!w= r(d " z)! x[z(1" z)] w 0 zy w 1 w 2 x 1-1 x 2 zy s(z) s(x) w w 0 w w n 1 2... -1 x 1 x 2 x n So far, this is all We ve added - to take care of computing the derivative correctly as we go thru a Sigmoid unit Note: we will d for the true or desired value of y, And y net or o (as in Winston) for the value output or computed by the net 9

Now we add hidden units y 3 z 3-1 z 1 w 03 w 13 y 1 w 23 y 2 z 2 Given (y obs! y net /calc ), how do we twiddle the w ij? w w 21 01 w 22 w 02-1 w 11 w 12-1 x 1 x 2 The KEY POINT: hidden units have no correct target value so how do we know what their error is? If we don t know error, how can we know how to change the weights for them? Ans: we use the error from the output and propagate it back - allocate the blame proportionately over the next units in the layer down Q: How much blame should each unit get? A: it should be proportional to the weight on what it passes to the next unit Minimizing by Gradient Descent Basic Problem: Given initial parameters w 0 with error P=G(w 0 ), find w that is local minimum of G, that is, all first derivatives of G are zero. Approach: 1. Find slope of G at w 0 2. Move to w 1 which is down slope & ' G ' G ( wg = $! %' w0 ' w1 wi +1 = wi " r! w G ' G ' w n #! " P(erformance fn) Local min < > Slope = dg dw w 1 w 0 w 0 w 1 w Gradient is n-dimensional slope G(w) Gradient descent, r is rate a small step size Stop when gradient is very small (or change in G is very small) (Recall that the minus sign in front will be canceled by the negative sign for del G of our Performance function in our actual formula.) 10

The simplest net - just the output unit w z (Note: p z =y * w z ) Output from unit Input to unit y p z z Sigmoid Output unit (Note that the output unit could also include a bias w, set to 1, Included as in the 1-layer net case to make the threshold come out to 0) The picture: propagating back one step - the delta rule Intuitively, delp/del w z tells us how much the P(erformance) function (the difference between the actual desired output and the output the net computes given its current weights) will change if we twiddle w z a little bit This calculation has three parts OUTPUT UNIT (Note: p z =y * w z ) Output from unit Input to unit y p z Output z Sigmoid P =!1 / 2(observed! computed) 2 unit dp = (d[esired]! z) = derivative of error function dz "P = "P "w z "z # "z # "p z = "p z "w z "P "z # "z # "(w zy) = "p z "w z "P "w z = (d! z) # z(1! z) # y w z Output value z=s(p z )!P!w z = (d " z) # z(1" z) # y delta! z 11

The three components So,!P!w z = (d " z) # z(1" z) # y Derivative of the performance function wrt z Input to the unit Derivative of output z wrt the output from the neuron before passing it through the sigmoid function Intuition!P!w z = (d " z) # z(1" z) # y w z y p z z Sigmoid x Output value Intuitively, this makes sense, because the size of the effect on error P is proportional to the amount of flow that comes into a node, y, which is the derivative wrt w z, times the derivative twiddle produced by the sigmoid, which is z(1-z), times the derivative of P wrt z 12

Or, it divides into 3 parts, one for each part of the chain in the net - read it backward off the formula w z y p z z Sigmoid x!p!w z = (d " z) # z(1" z) # y Output value Part 1: del y w/del w = y Or, it divides into 3 parts, one for each part of the chain in the net - read it backward off the formula w z y p z z Sigmoid x!p!w z = (d " z) # z(1 " z) # y Output value Part 2: del sigmoid(z)/del w = -z(1-z) (I have put the sign out front.) 13

Or, it divides into 3 parts, one for each part of the chain in the net - read it backward off the formula w z!p!w z = (d " z) # z(1 " z) # y Part 3: del P/del w = (d-z) y p z z Sigmoid x Delta rule for output unit OUTPUT UNIT w z y p z z Sigmoid x!p!w z = (d " z) # z(1 " z) # y delta! z For output unit, where z = o, and re-arranging:! output =! o = o(1" o) # (d " o) Actual weight change $w = r # input to unit #! o or $w = r # y # o(1" o) # (d " o) 14

So now we know how to change the output unit weight, given some example (input, true output) training example pair:!w = r " y " o(1 # o) " (d # o) w new = w old +!w = w old + r " y " o(1# o) " (d # o) where r is the learning rate; y is the input to the unit from the previous layer; d is the desired, true output; and o is the value actually compute the net w/ its current weights But, how do we know y, the input to this unit? Ans: because we had to compute the forward propagation first, from the very first input to the output at the end of the net, and this tells us,given the current weights, what the net outputs at each of its units, including the 'hidden' ones (like y) Backpropagation An efficient method of implementing gradient descent for neural networks r Gradient descent Backprop rule 1. Initialize weights to small random values 2. Choose a random sample input feature vector 3. Compute total input and output for each unit (forward prop) 4. Compute for output layer! j! o! output = o(1 " o)(d " o) 5. Compute for preceding layer by backprop rule (repeat for all layers) 6. Compute weight change by descent rule (repeat for all weights) 15

Now that we know this, we can find the weight changes required in the previous layer. Here we use Winston s notation: i l = input to left neuron; w l = weight on left neuron; p l = product of i l and w l on left neuron (input to sigmoid); o l = output from left neuron (and input to right neuron) i r = input to right neuron (from left neuron); w r = weight on right neuron; p r = product of i r and w r on right neuron (input to sigmoid); o r = output from right neuron We already know how to find the wt delta for the right hand side neuron (this is like the output layer as before): 16

Now we compute the weight delta for the layer to the left of this neuron: Recall we called this δ z before So if we use the following notation we can see the key recursive pattern better: So we can write the weight derivatives more simply as a function of the input to each neuron plus the delta for each neuron: Clearly, if add another layer to the left, we can just repeat this computation: we just take the delta for the new layer (which is given in terms of things we already know plus the delta to the right) and the input to that layer (we we can compute via forward propagation) 17

From deltas to weight adjustments First, the output layer, the neuron on the right:! output =! right = o(1 " o) # (d " o)! left =! l = o l (1" o l ) # w l$r #! right %w right = r # o l #! right (Here I have used the subscript right instead of r to avoid confusion with the learning rate r; Winston uses alpha instead of r in his netnotes.pdf for just this reason. Note that o l = the output from the left neuron, is the input to the right neuron, so this equation has the form of r X input to final layer neuron X delta for final neuron) Finding the next adjustment back, w l, involves using the delta from the right, the the weight on the right unit, the delta for the left unit, and the input to the left unit:! output =! right = o(1" o) # (d " o)! left =! l = o l (1 " o l ) # w right #! right $w l = r # i l #! l 18

Let s do an example! Suppose we have training example (1, 2), and wts as shown. So, the input to the net is 1 and d (the true value) is 2. Let the learning rate r be 0.2. Q: what are the new wts after one iteration of backprop? 1 2.0 1.0 2.0= desired output value, d Step 1: Do forward propagation to find all the p l, o l, (hence i l ), p r, o r (the last the computed output o) (a) Right neuron, his gives: 1 x 2.0 = 2.0 for p l ; 1/(1+e -2 ) =0.88 for o l (b) Left neuron: o l = i l =0.88 x 1.0 = 0.88 for p r ; 1/(1+e -0.88 )= 0.71 for o r Step 2: Now start backpropagation. First, compute the weight change for the output weight, w r = Δw = r x i r x o(1-o) (d o) = 0.2 x 0.88 x 0.71 x (1 0.71) x (2 0.71) = 0.0467 Note: o(1 o)(d o) = δ r = 0.2656 new weight w r = 1.0 + 0.0467 = 1.0467 Step 3: Compute the weight change for the next level to the left, w l = Δw = r x I i x o l (1 o l ) x w r x δ r = 0.2 x 1.0 x 0.88 x (1 0.88) x 1.0 x 0.2656= 0.0056 new weight w l = 2.0 + 0.0056 = 2.0056 So we ve adjusted both the weights upwards, as desired (compare to the 1-layer case) though very slowly Now we could go back and iterate with this sample point until we get no further change in the weights and then we go to the next training example. For a much more detailed example in the same vein, please see my other handout on my web page, neuralnetex.pdf! Let s sum up 19

Backpropagation An efficient method of implementing gradient descent for neural networks r Gradient descent Backprop rule 1. Initialize weights to small random values 2. Choose a random sample input feature vector 3. Compute total input and output for each unit (forward prop) 4. Compute for output layer! j! o! output = o(1 " o)(d " o) 5. Compute for preceding layer by backprop rule (repeat for all layers) 6. Compute weight change by descent rule (repeat for all weights) Training Neural Nets without overfitting, hopefully Given: Data set, desired outputs and a neural net with m weights. Find a setting for the weights that will give good predictive performance on new data. Estimate expected performance on new data. 1. Split data set (randomly) into three subsets: Training set used for picking weights Validation set used to stop training Test set used to evaluate performance 2. Pick random, small weights as initial values 3. Perform iterative minimization of error over training set. 4. Stop when error on validation set reaches a minimum (to avoid overfitting). 5. Use best weights to compute error on test set, which is estimate of performance on new data. Do not repeat training to improve this. Can use cross-validation if data set is too small to divide into three subsets. 20

On-line vs off-line There are two approaches to performing the error minimization: On-line training present X i and Y i * (chosen randomly from the training set). Change the weights to reduce the error on this instance. Repeat. Off-line training change weights to reduce the total error on training set (sum over all instances). On-line training is a stochastic approximation to gradient descent since the gradient based on one instance is noisy relative to the full gradient (based on all instances). This can be beneficial in pushing the system out of shallow local minima. Classification vs Regression A neural net with a single sigmoid output unit is aimed at binary classification. Class is 0 if y<0.5 and 1 otherwise. For multiway classification, use one output unit per class. 21

Classification vs Regression A neural net with a single sigmoid output unit is aimed at binary classification. Class is 0 if y<0.5 and 1 otherwise. For multiway classification, use one output unit per class. A sigmoid output unit is not suitable for regression, since sigmoids are designed to change quickly from 0 to 1. For regression, we want a linear output unit, that is, remove the output non-linearity. The rest of the net still retains the sigmoid units. 22