Neural networks Chapter 20 Chapter 20 1
Outline Brains Neural networks Perceptrons Multilayer networks Applications of neural networks Chapter 20 2
Brains 10 11 neurons of > 20 types, 10 14 synapses, 1ms 10ms cycle time Signals are noisy spike trains of electrical potential Axonal arborization Synapse Axon from another cell Dendrite Axon Nucleus Synapses Cell body or Soma Chapter 20 3
McCulloch Pitts unit Output is a squashed linear function of the inputs: a i g(in i ) = g ( Σ j W j,i a j ) a 0 = 1 a i = g(in i ) Wj,i a j Bias Weight W 0,i Σ in i g a i Input Links Input Function Activation Function Output Output Links Chapter 20 4
Activation functions g(in i ) g(in i ) +1 +1 (a) in i (b) in i (a) is a step function or threshold function (b) is a sigmoid function 1/(1 + e x ) Changing the bias weight W 0,i moves the threshold location Chapter 20 5
Implementing logical functions McCulloch and Pitts: every Boolean function can be implemented (with large enough network) AND? OR? NOT? MAJORITY? Chapter 20 6
Implementing logical functions McCulloch and Pitts: every Boolean function can be implemented (with large enough network) W 0 = 1.5 W 0 = 0.5 W 0 = 0.5 W 1 = 1 W 2 = 1 W 1 = 1 W 2 = 1 W 1 = 1 AND OR NOT Chapter 20 7
Feed-forward networks: single-layer perceptrons multi-layer networks Network structures Feed-forward networks implement functions, have no internal state Recurrent networks: Hopfield networks have symmetric weights (W i,j = W j,i ) g(x) = sign(x), a i = ± 1; holographic associative memory Boltzmann machines use stochastic activation functions, MCMC in BNs recurrent neural nets have directed cycles with delays have internal state (like flip-flops), can oscillate etc. Chapter 20 8
Feed-forward example 1 W 1,3 W 1,4 3 W 3,5 5 2 W 2,3 W 2,4 4 W 4,5 Feed-forward network = a parameterized family of nonlinear functions: a 5 = g(w 3,5 a 3 + W 4,5 a 4 ) = g(w 3,5 g(w 1,3 a 1 + W 2,3 a 2 ) + W 4,5 g(w 1,4 a 1 + W 2,4 a 2 )) Chapter 20 9
Perceptrons Input Units Perceptron output 1 0.8 0.6 0.4 0.2 0-4 -2 Output 0 x 2 W 4-4 -2 0 1 j,i Units 2 4 x 2 Chapter 20 10
Expressiveness of perceptrons Consider a perceptron with g = step function (Rosenblatt, 1957, 1960) Can represent AND, OR, NOT, majority, etc. Represents a linear separator in input space: Σ j W j x j > 0 or W x > 0 I 1 I 1 I 1 1 1 1? 0 0 0 0 1 I 2 I 2 0 1 0 1 I 2 I 1 I 2 I 1 I 2 (a) and (b) or (c) I 1 xor I 2 Chapter 20 11
Perceptron learning Learn by adjusting weights to reduce error on training set The squared error for an example with input x and true output y is E = 1 2 Err2 1 2 (y h W(x)) 2 Chapter 20 12
Perceptron learning Learn by adjusting weights to reduce error on training set The squared error for an example with input x and true output y is E = 1 2 Err2 1 2 (y h W(x)) 2 Perform optimization search by gradient descent: W j =? Chapter 20 13
Perceptron learning Learn by adjusting weights to reduce error on training set The squared error for an example with input x and true output y is E = 1 2 Err2 1 2 (y h W(x)) 2 Perform optimization search by gradient descent: = Err rr = Err W j W j W j ( y g(σ n j = 0W j x j ) ) Chapter 20 14
Perceptron learning Learn by adjusting weights to reduce error on training set The squared error for an example with input x and true output y is E = 1 2 Err2 1 2 (y h W(x)) 2 Perform optimization search by gradient descent: = Err rr = Err W j W j = Err g (in) x j W j ( y g(σ n j = 0W j x j ) ) Chapter 20 15
Perceptron learning Learn by adjusting weights to reduce error on training set The squared error for an example with input x and true output y is E = 1 2 Err2 1 2 (y h W(x)) 2 Perform optimization search by gradient descent: = Err rr = Err W j W j = Err g (in) x j Simple weight update rule: W j W j W j + α Err g (in) x j ( y g(σ n j = 0W j x j ) ) E.g., +ve error increase network output increase weights on +ve inputs, decrease on -ve inputs Chapter 20 16
Perceptron learning W = random initial values for iter = 1 to T for i = 1 to N (all examples) x = input for example i y = output for example i W old = W Err = y g(w old x) for j = 1 to M (all weights) W j = W j + α Err g (W old x) x j Chapter 20 17
Perceptron learning contd. Derivative of sigmoid g(x) can be written in simple form: 1 g(x) = 1 + e x g (x) =? Chapter 20 18
Perceptron learning contd. Derivative of sigmoid g(x) can be written in simple form: g(x) = g (x) = Also, So g(x) = 1 1 + e x e x (1 + e x ) = 2 e x g(x) 2 1 1 + e g(x) + x e x g(x) = 1 e x = 1 g(x) g(x) g (x) = 1 g(x) g(x) 2 g(x) = (1 g(x))g(x) Chapter 20 19
Perceptron learning contd. Perceptron learning rule converges to a consistent function for any linearly separable data set Proportion correct on test set 1 0.9 0.8 0.7 0.6 0.5 0.4 Perceptron Decision tree 0 10 20 30 40 50 60 70 80 90 100 Proportion correct on test set 1 0.9 0.8 0.7 0.6 0.5 0.4 Perceptron Decision tree 0 10 20 30 40 50 60 70 80 90 100 Training set size - MAJORITY on 11 inputs Training set size - RESTAURANT data Chapter 20 20
Multilayer networks Layers are usually fully connected; numbers of hidden units typically chosen by hand Output units a i W j,i Hidden units a j W k,j Input units a k Chapter 20 21
Expressiveness of MLPs All continuous functions w/ 1 hidden layer, all functions w/ 2 hidden layers h W (x 1, x 2 ) 1 0.8 0.6 0.4 0.2 0-4 -2 0 x 1 2-4 -2 0 4 2 4 x 2 h W (x 1, x 2 ) 1 0.8 0.6 0.4 0.2 0-4 -2 0 x 1 2-4 -2 0 4 2 4 x 2 Chapter 20 22
In general have n output nodes, E 1 2 i Err2 i, Training a MLP where Err i = (y i a i ) and i runs over all nodes in the output layer. Need to calculate W ij for any W ij. Chapter 20 23
Can approximate derivatives by: Training a MLP cont. f (x) W ij (W) f(x + h) f(x) h E(W + (0,...,h,...,0)) E(W) h What would this entail for a network with n weights? Chapter 20 24
Can approximate derivatives by: Training a MLP cont. f (x) W ij (W) f(x + h) f(x) h E(W + (0,...,h,...,0)) E(W) h What would this entail for a network with n weights? - one iteration would take O(n 2 ) time Complicated networks have tens of thousands of weights, O(n 2 ) time is intractable. Back-propagation is a recursive method of calculating all of these derivatives in O(n) time. Chapter 20 25
In general have n output nodes, E 1 2 i Err2 i, Back-propagation learning where Err i = (y i a i ) and i runs over all nodes in the output layer. Output layer: same as for single-layer perceptron, W j,i W j,i + α a j i where i =Err i g (in i ) Hidden layers: back-propagate the error from the output layer: j = g (in j ) W j,i i. i Update rule for weights in hidden layers: W k,j W k,j + α a k j. Chapter 20 26
Back-propagation derivation For a node i in the output layer: W j,i = (y i a i ) a i W j,i Chapter 20 27
Back-propagation derivation For a node i in the output layer: W j,i = (y i a i ) a i W j,i = (y i a i ) g(in i) W j,i Chapter 20 28
Back-propagation derivation For a node i in the output layer: W j,i = (y i a i ) a i W j,i = (y i a i ) g(in i) W j,i = (y i a i )g (in i ) in i W j,i Chapter 20 29
Back-propagation derivation For a node i in the output layer: W j,i = (y i a i ) a i W j,i = (y i a i ) g(in i) W j,i = (y i a i )g (in i ) in i = (y i a i )g (in i ) W j,i W j,i k W k,i a j Chapter 20 30
Back-propagation derivation For a node i in the output layer: W j,i = (y i a i ) a i W j,i = (y i a i ) g(in i) W j,i = (y i a i )g (in i ) in i = (y i a i )g (in i ) W j,i = (y i a i )g (in i )a j = a j i where i = (y i a i )g (in i ) W j,i k W k,i a j Chapter 20 31
Back-propagation derivation: hidden layer For a node j in a hidden layer: W k,j =? Chapter 20 32
Reminder : Chain rule for partial derivatives For f(x, y), with f differentiable wrt x and y, and x and y differentiable wrt u and v: f u = f x x u + f y y u and f v = f x x v + f y y v Chapter 20 33
Back-propagation derivation: hidden layer For a node j in a hidden layer: W k,j = W k,j E(a j1, a j2,...,a jm ) where {j i } are the indices of the nodes in the same layer as node j. Chapter 20 34
Back-propagation derivation: hidden layer For a node j in a hidden layer: = a j + W k,j a j W k,j i a i a i W k,j where i runs over all other nodes i in the same layer as node j. Chapter 20 35
Back-propagation derivation: hidden layer For a node j in a hidden layer: = a j + W k,j a j W k,j i = a j a j W k,j a i a i since W k,j a i W k,j = 0 for i j Chapter 20 36
Back-propagation derivation: hidden layer For a node j in a hidden layer: = a j + W k,j a j W k,j i = a j a j W k,j = a j g (in j )a k a i a i since W k,j a i W k,j = 0 for i j Chapter 20 37
Back-propagation derivation: hidden layer For a node j in a hidden layer: = a j + W k,j a j W k,j i a j =? = a j a j W k,j = a j g (in j )a k a i a i since W k,j a i W k,j = 0 for i j Chapter 20 38
Back-propagation derivation: hidden layer For a node j in a hidden layer: = a j + W k,j a j W k,j i a j = = a j a j W k,j = a j g (in j )a k a i a i since a j E(a k1, a k2,...,a km ) W k,j a i W k,j = 0 for i j where {k i } are the indices of the nodes in the layer after node j. Chapter 20 39
Back-propagation derivation: hidden layer For a node j in a hidden layer: = a j + W k,j a j W k,j i a j = k = a j a j W k,j = a j g (in j )a k a k a k a j a i a i since W k,j a i W k,j = 0 for i j where k runs over all nodes k that node j connects to. Chapter 20 40
Back-propagation derivation: hidden layer For a node j in a hidden layer: = a j + W k,j a j W k,j i = a j a j W k,j = a j g (in j )a k a i a i since W k,j a i W k,j = 0 for i j a j = k = k a k a k a j g (in k )W j,k a k Chapter 20 41
Back-propagation derivation: hidden layer If we define then j g (in j ) k W j,k k W k,j = j a k Chapter 20 42
Back-propagation pseudocode for iter = 1 to T for e = 1 to N (all examples) x = input for example e y = output for example e run x forward through network, computing all {a i }, {in i } for all nodes i (in reverse order) compute i = for all weights W j,i W j,i = W j,i + α a j i (y i a i ) g (in i ) if i is output node g (in i ) k W i,k k o.w. Chapter 20 43
Back-propagation learning contd. At each epoch, sum gradient updates for all examples and apply Restaurant data: 14 Total error on training set 12 10 8 6 4 2 0 0 50 100 150 200 250 300 350 400 Number of epochs Usual problems with slow convergence, local minima Chapter 20 44
Restaurant data: Back-propagation learning contd. 1 0.9 % correct on test set 0.8 0.7 0.6 0.5 0.4 Multilayer network Decision tree 0 10 20 30 40 50 60 70 80 90 100 Training set size Chapter 20 45
Handwritten digit recognition 3-nearest-neighbor = 2.4% error 400 300 10 unit MLP = 1.6% error LeNet: 768 192 30 10 unit MLP = 0.9% error Chapter 20 46
Summary Most brains have lots of neurons; each neuron linear threshold unit (?) Perceptrons (one-layer networks) insufficiently expressive Multi-layer networks are sufficiently expressive; can be trained by gradient descent, i.e., error back-propagation Many applications: speech, driving, handwriting, credit cards, etc. Chapter 20 47