Neural networks Chapter 20, Section 5 Chapter 20, Section 5
Outline Brains Neural networks Perceptrons Multilayer perceptrons Applications of neural networks Chapter 20, Section 5 2
Brains 0 neurons of > 20 types, 0 4 synapses, ms 0ms cycle time Signals are noisy spike trains of electrical potential Axonal arborization Synapse Axon from another cell Dendrite Axon Nucleus Synapses Cell body or Soma Chapter 20, Section 5 3
McCulloch Pitts unit Output is a squashed linear function of the inputs: a i g(in i ) = g ( Σ j W j,i a j ) a 0 = a i = g(in i ) Wj,i a j Bias Weight W 0,i Σ in i g a i Input Links Input Function Activation Function Output Output Links A gross oversimplification of real neurons, but its purpose is to develop understanding of what networks of simple units can do Chapter 20, Section 5 4
Activation functions g(in i ) g(in i ) + + (a) in i (b) in i (a) is a step function or threshold function (b) is a sigmoid function /( + e x ) Changing the bias weight W 0,i moves the threshold location Chapter 20, Section 5 5
Implementing logical functions W 0 =.5 W 0 = 0.5 W 0 = 0.5 W = W 2 = W = W 2 = W = AND OR NOT McCulloch and Pitts: every Boolean function can be implemented Chapter 20, Section 5 6
Feed-forward networks: single-layer perceptrons multi-layer perceptrons Network structures Feed-forward networks implement functions, have no internal state Recurrent networks: Hopfield networks have symmetric weights (W i,j = W j,i ) g(x) = sign(x), a i = ± ; holographic associative memory Boltzmann machines use stochastic activation functions, MCMC in Bayes nets recurrent neural nets have directed cycles with delays have internal state (like flip-flops), can oscillate etc. Chapter 20, Section 5 7
Feed-forward example W,3 W,4 3 W 3,5 5 2 W 2,3 W 2,4 4 W 4,5 Feed-forward network = a parameterized family of nonlinear functions: a 5 = g(w 3,5 a 3 + W 4,5 a 4 ) = g(w 3,5 g(w,3 a + W 2,3 a 2 ) + W 4,5 g(w,4 a + W 2,4 a 2 )) Adjusting weights changes the function: do learning this way! Chapter 20, Section 5 8
Single-layer perceptrons Input Units Perceptron output 0.8 0.6 0.4 0.2 0-4 -2 Output 0 x 2 W 4-4 -2 0 j,i Units 2 4 x 2 Output units all operate separately no shared weights Adjusting weights moves the location, orientation, and steepness of cliff Chapter 20, Section 5 9
Expressiveness of perceptrons Consider a perceptron with g = step function (Rosenblatt, 957, 960) Can represent AND, OR, NOT, majority, etc., but not XOR Represents a linear separator in input space: Σ j W j x j > 0 or W x > 0 x x x? 0 0 x 2 0 0 x 2 0 0 x 2 (a) x and x 2 (b) x or x2 (c) x xor x2 Minsky & Papert (969) pricked the neural network balloon Chapter 20, Section 5 0
Perceptron learning Learn by adjusting weights to reduce error on training set The squared error for an example with input x and true output y is E = 2 Err 2 2 (y h W(x)) 2, Perform optimization search by gradient descent: E W j = Err Err = Err W j W j = Err g (in) x j Simple weight update rule: W j W j + α Err g (in) x j ( y g(σ n j = 0W j x j ) ) E.g., +ve error increase network output increase weights on +ve inputs, decrease on -ve inputs Chapter 20, Section 5
Perceptron learning contd. Perceptron learning rule converges to a consistent function for any linearly separable data set Proportion correct on test set 0.9 0.8 0.7 0.6 0.5 0.4 Perceptron Decision tree 0 0 20 30 40 50 60 70 80 90 00 Proportion correct on test set 0.9 0.8 0.7 0.6 0.5 0.4 Perceptron Decision tree 0 0 20 30 40 50 60 70 80 90 00 Training set size - MAJORITY on inputs Training set size - RESTAURANT data Perceptron learns majority function easily, DTL is hopeless DTL learns restaurant function easily, perceptron cannot represent it Chapter 20, Section 5 2
Multilayer perceptrons Layers are usually fully connected; numbers of hidden units typically chosen by hand Output units a i W j,i Hidden units a j W k,j Input units a k Chapter 20, Section 5 3
Expressiveness of MLPs All continuous functions w/ 2 layers, all functions w/ 3 layers h W (x, x 2 ) 0.8 0.6 0.4 0.2 0-4 -2 0 x 2 4-4 -2 0 2 4 x 2 h W (x, x 2 ) 0.8 0.6 0.4 0.2 0-4 -2 0 x 2 4-4 -2 0 2 4 x 2 Combine two opposite-facing threshold functions to make a ridge Combine two perpendicular ridges to make a bump Add bumps of various sizes and locations to fit any surface Proof requires exponentially many hidden units (cf DTL proof) Chapter 20, Section 5 4
Back-propagation learning Output layer: same as for single-layer perceptron, W j,i W j,i + α a j i where i = Err i g (in i ) Hidden layer: back-propagate the error from the output layer: j = g (in j ) W j,i i. i Update rule for weights in hidden layer: W k,j W k,j + α a k j. (Most neuroscientists deny that back-propagation occurs in the brain) Chapter 20, Section 5 5
Back-propagation derivation The squared error on a single example is defined as E = 2 i (y i a i ) 2, where the sum is over the nodes in the output layer. E W j,i = (y i a i ) a i W j,i = (y i a i ) g(in i) W j,i = (y i a i )g (in i ) in i = (y i a i )g (in i ) W j,i = (y i a i )g (in i )a j = a j i W j,i j W j,ia j Chapter 20, Section 5 6
Back-propagation derivation contd. E W k,j = i (y i a i ) a i W k,j = i (y i a i ) g(in i) = (y i a i )g (in i ) in i = i W i k,j i W k,j W k,j = i iw j,i a j W k,j = i iw j,i g(in j ) W k,j = i iw j,i g (in j ) in j = i iw j,i g (in j ) W k,j W k,j k W k,j a k j W j,ia j = i iw j,i g (in j )a k = a k j Chapter 20, Section 5 7
Back-propagation learning contd. At each epoch, sum gradient updates for all examples and apply Training curve for 00 restaurant examples: finds exact fit 4 Total error on training set 2 0 8 6 4 2 0 0 50 00 50 200 250 300 350 400 Number of epochs Typical problems: slow convergence, local minima Chapter 20, Section 5 8
Back-propagation learning contd. Learning curve for MLP with 4 hidden units: Proportion correct on test set 0.9 0.8 0.7 0.6 0.5 0.4 Decision tree Multilayer network 0 0 20 30 40 50 60 70 80 90 00 Training set size - RESTAURANT data MLPs are quite good for complex pattern recognition tasks, but resulting hypotheses cannot be understood easily Chapter 20, Section 5 9
Handwritten digit recognition 3-nearest-neighbor = 2.4% error 400 300 0 unit MLP =.6% error LeNet: 768 92 30 0 unit MLP = 0.9% error Current best (kernel machines, vision algorithms) 0.6% error Chapter 20, Section 5 20
Summary Most brains have lots of neurons; each neuron linear threshold unit (?) Perceptrons (one-layer networks) insufficiently expressive Multi-layer networks are sufficiently expressive; can be trained by gradient descent, i.e., error back-propagation Many applications: speech, driving, handwriting, fraud detection, etc. Engineering, cognitive modelling, and neural system modelling subfields have largely diverged Chapter 20, Section 5 2