Neural networks Chapter 19, Sections 1 5 Chapter 19, Sections 1 5 1
Outline Brains Neural networks Perceptrons Multilayer perceptrons Applications of neural networks Chapter 19, Sections 1 5 2
Brains 10 11 neurons of > 20 types, 10 14 synapses, 1ms 10ms cycle time Signals are noisy spike trains of electrical potential Axonal arborization Synapse Axon from another cell Dendrite Axon Nucleus Synapses Cell body or Soma Chapter 19, Sections 1 5 3
McCulloch Pitts unit Output is a squashed linear function of the inputs: a i g(in i ) = g ( Σ j W j,i a j ) a 0 = 1 a i = g(in i ) Wj,i a j Bias Weight W 0,i Σ in i g a i Input Links Input Function Activation Function Output Output Links Chapter 19, Sections 1 5 4
Activation functions g(in i ) g(in i ) +1 +1 (a) in i (b) in i (a) is a step function or threshold function (b) is a sigmoid function 1/(1 + e x ) Changing the bias weight W 0,i moves the threshold location Chapter 19, Sections 1 5 5
Implementing logical functions W 0 = 1.5 W 0 = 0.5 W 0 = 0.5 W 1 = 1 W 2 = 1 W 1 = 1 W 2 = 1 W 1 = 1 AND OR NOT McCulloch and Pitts: every Boolean function can be implemented Chapter 19, Sections 1 5 6
Feed-forward networks: single-layer perceptrons multi-layer perceptrons Network structures Feed-forward networks implement functions, have no internal state Recurrent networks: Hopfield networks have symmetric weights (W i,j = W j,i ) g(x) = sign(x), a i = ± 1; holographic associative memory Boltzmann machines use stochastic activation functions, MCMC in BNs recurrent neural nets have directed cycles with delays have internal state (like flip-flops), can oscillate etc. Chapter 19, Sections 1 5 7
Feed-forward example 1 W 1,3 W 1,4 3 W 3,5 5 2 W 2,3 W 2,4 4 W 4,5 Feed-forward network = a parameterized family of nonlinear functions: a 5 = g(w 3,5 a 3 + W 4,5 a 4 ) = g(w 3,5 g(w 1,3 a 1 + W 2,3 a 2 ) + W 4,5 g(w 1,4 a 1 + W 2,4 a 2 )) Chapter 19, Sections 1 5 8
Perceptrons Input Units Perceptron output 0.9 1 0.8 0.7 0.6 0.5 0.4 0.3 0.2 Output 0.1 W 0 j,i Units -4-2 0 x 1 2 4-4 -2 0 2 4 x 2 Chapter 19, Sections 1 5 9
Expressiveness of perceptrons Consider a perceptron with g = step function (Rosenblatt, 1957, 1960) Can represent AND, OR, NOT, majority, etc. Represents a linear separator in input space: Σ j W j x j > 0 or W x > 0 I 1 I 1 I 1 1 1 1? 0 0 0 0 1 I 2 I 2 0 1 0 1 I 2 I 1 I 2 I 1 I 2 (a) and (b) or (c) I 1 xor I 2 Chapter 19, Sections 1 5 10
Perceptron learning Learn by adjusting weights to reduce error on training set The squared error for an example with input x and true output y is E = 1 2 Err 2 1 2 (y h W(x)) 2, Perform optimization search by gradient descent: E W j = Err Err = Err W j W j = Err g (in) x j Simple weight update rule: W j W j + α Err g (in) x j ( y g(σ n j = 0W j x j ) ) E.g., +ve error increase network output increase weights on +ve inputs, decrease on -ve inputs Chapter 19, Sections 1 5 11
Perceptron learning contd. Perceptron learning rule converges to a consistent function for any linearly separable data set 1 1 Proportion correct on test set 0.9 0.8 0.7 0.6 0.5 Perceptron Decision tree Proportion correct on test set 0.9 0.8 0.7 0.6 0.5 Decision tree Perceptron 0.4 0 10 20 30 40 50 60 70 80 90 100 0.4 0 10 20 30 40 50 60 70 80 90 100 Training set size Training set size Chapter 19, Sections 1 5 12
Multilayer perceptrons Layers are usually fully connected; numbers of hidden units typically chosen by hand Output units a i W j,i Hidden units a j W k,j Input units a k Chapter 19, Sections 1 5 13
Expressiveness of MLPs All continuous functions w/ 2 layers, all functions w/ 3 layers h W (x 1, x 2 ) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0-4 -2 0 x 1 2 4-4 -2 0 2 4 x 2 h W (x 1, x 2 ) 0.9 1 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0-4 -2 0 x 1 2 4-4 -2 0 2 4 x 2 Chapter 19, Sections 1 5 14
Back-propagation learning Output layer: same as for single-layer perceptron, W j,i W j,i + α a j i where i = Err i g (in i ) Hidden layer: back-propagate the error from the output layer: j = g (in j ) W j,i i. i Update rule for weights in hidden layer: W k,j W k,j + α a k j. (Most neuroscientists deny that back-propagation occurs in the brain) Chapter 19, Sections 1 5 15
Back-propagation derivation The squared error on a single example is defined as E = 1 2 i (y i a i ) 2, where the sum is over the nodes in the output layer. E W j,i = (y i a i ) a i W j,i = (y i a i ) g(in i) W j,i = (y i a i )g (in i ) in i = (y i a i )g (in i ) W j,i = (y i a i )g (in i )a j = a j i W j,i j W j,ia j Chapter 19, Sections 1 5 16
Back-propagation derivation contd. E W k,j = i (y i a i ) a i W k,j = i (y i a i ) g(in i) = (y i a i )g (in i ) in i = i W i k,j i W k,j W k,j = i iw j,i a j W k,j = i iw j,i g(in j ) W k,j = i iw j,i g (in j ) in j = i iw j,i g (in j ) W k,j W k,j k W k,j a k j W j,ia j = i iw j,i g (in j )a k = a k j Chapter 19, Sections 1 5 17
Back-propagation learning contd. At each epoch, sum gradient updates for all examples and apply 14 Total error on training set 12 10 8 6 4 2 0 0 50 100 150 200 250 300 350 400 Number of epochs Usual problems with slow convergence, local minima Chapter 19, Sections 1 5 18
Back-propagation learning contd. 1 0.9 % correct on test set 0.8 0.7 0.6 0.5 0.4 Multilayer network Decision tree 0 10 20 30 40 50 60 70 80 90 100 Training set size Chapter 19, Sections 1 5 19
Handwritten digit recognition 3-nearest-neighbor = 2.4% error 400 300 10 unit MLP = 1.6% error LeNet: 768 192 30 10 unit MLP = 0.9% Chapter 19, Sections 1 5 20
Summary Most brains have lots of neurons; each neuron linear threshold unit (?) Perceptrons (one-layer networks) insufficiently expressive Multi-layer networks are sufficiently expressive; can be trained by gradient descent, i.e., error back-propagation Many applications: speech, driving, handwriting, credit cards, etc. Chapter 19, Sections 1 5 21