Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA 1/ 21

Neural Networks Chapter 8, Section 7 TB Artificial Intelligence Slides from AIMA http://aima.cs.berkeley.edu / 2

Outline Brains Neural networks Perceptrons Multilayer perceptrons Applications of neural networks 2/ 2

Brains 0 neurons of > 20 types, 0 4 synapses, ms 0ms cycle time Signals are noisy spike trains of electrical potential Axonal arborization Synapse Axon from another cell Dendrite Axon Nucleus Synapses Cell body or Soma 3/ 2

McCulloch Pitts unit Output is a squashed linear function of the inputs: a i g(in i ) = g (Σ j W j,i a j ) a 0 = a i = g(in i ) Wj,i a j Bias Weight W 0,i Σ in i g a i Input Links Input Function Activation Function Output Output Links A gross oversimplification of real neurons, but its purpose is to develop understanding of what networks of simple units can do 4/ 2

Activation functions g(in i ) g(in i ) + + (a) in i (b) in i (a) is a step function or threshold function (b) is a sigmoid function /(+e x ) Changing the bias weight W 0,i moves the threshold location 5/ 2

Implementing logical functions W 0 =.5 W 0 = 0.5 W 0 = 0.5 W = W 2 = W = W 2 = W = AND OR NOT McCulloch and Pitts: every Boolean function can be implemented 6/ 2

Network structures Feed-forward networks: single-layer perceptrons multi-layer perceptrons Feed-forward networks implement functions, have no internal state Recurrent networks: Hopfield networks have symmetric weights (Wi,j = W j,i ) g(x)=sign(x), a i =± ; holographic associative memory Boltzmann machines use stochastic activation functions, MCMC in Bayes nets recurrent neural nets have directed cycles with delays = have internal state (like flip-flops), can oscillate etc. 7/ 2

Feed-forward example W,3 W,4 3 W 3,5 5 2 W 2,3 W 2,4 4 W 4,5 Feed-forward network = a parameterized family of nonlinear functions: a 5 = g(w 3,5 a 3 + W 4,5 a 4 ) = g(w 3,5 g(w,3 a + W 2,3 a 2 )+W 4,5 g(w,4 a + W 2,4 a 2 )) Adjusting weights changes the function: do learning this way! 8/ 2

Single-layer perceptrons Input Units Perceptron output 0.8 0.6 0.4 0.2 0-4 -2 Output 0 x 2 W 4-4 -2 0 j,i Units 2 4 x 2 Output units all operate separately no shared weights Adjusting weights moves the location, orientation, and steepness of cliff 9/ 2

Expressiveness of perceptrons Consider a perceptron with g = step function (Rosenblatt, 957, 960) Can represent AND, OR, NOT, majority, etc., but not XOR Represents a linear separator in input space: Σ j W j x j > 0 or W x>0 x x 0 0 x? 0 x 2 0 x 2 0 0 x 2 (a) x and x 2 (b) x or x2 (c) x xor x2 Minsky & Papert (969) pricked the neural network balloon 0 / 2

Perceptron learning Learn by adjusting weights to reduce error on training set The squared error for an example with input x and true output y is E = 2 Err2 2 (y h W (x))2, Perform optimization search by gradient descent: ( E = Err Err = Err y W j W j W j g(σ n ) j = 0 W jx j ) = Err g (in) x j Simple weight update rule: W j W j +α Err g (in) x j E.g., +ve error = increase network output = increase weights on +ve inputs, decrease on -ve inputs / 2

Perceptron learning contd. Perceptron learning rule converges to a consistent function for any linearly separable data set Proportion correct on test set 0.9 0.8 0.7 0.6 0.5 0.4 Perceptron Decision tree 0 0 20 30 40 50 60 70 80 90 00 Proportion correct on test set 0.9 0.8 0.7 0.6 0.5 0.4 Perceptron Decision tree 0 0 20 30 40 50 60 70 80 90 00 Training set size - MAJORITY on inputs Training set size - RESTAURANT data Perceptron learns majority function easily, DTL is hopeless DTL learns restaurant function easily, perceptron cannot represent it 2 / 2

Multilayer perceptrons Layers are usually fully connected; numbers of hidden units typically chosen by hand Output units a i W j,i Hidden units a j W k,j Input units a k 3 / 2

Expressiveness of MLPs All continuous functions w/ 2 layers, all functions w/ 3 layers h W (x, x 2 ) 0.8 0.6 0.4 0.2 0-4 -2 0 x 2 4-4 -2 0 2 4 x 2 h W (x, x 2 ) 0.8 0.6 0.4 0.2 0-4 -2 0 x 2 4-4 -2 0 2 4 x 2 Combine two opposite-facing threshold functions to make a ridge Combine two perpendicular ridges to make a bump Add bumps of various sizes and locations to fit any surface Proof requires exponentially many hidden units (cf DTL proof) 4 / 2

Back-propagation learning Output layer: same as for single-layer perceptron, W j,i W j,i +α a j i where i =Err i g (in i ) Hidden layer: back-propagate the error from the output layer: j = g (in j ) i W j,i i. Update rule for weights in hidden layer: W k,j W k,j +α a k j. (Most neuroscientists deny that back-propagation occurs in the brain) 5 / 2

Back-propagation derivation The squared error on a single example is defined as E = (y i a i ) 2, 2 i where the sum is over the nodes in the output layer. E W j,i = (y i a i ) a i W j,i = (y i a i ) g(in i) W j,i = (y i a i )g (in i ) in i = (y i a i )g (in i ) W j,i = (y i a i )g (in i )a j = a j i W j,i j W j,i a j 6 / 2

Back-propagation derivation contd. E W k,j = i = i (y i a i ) a i W k,j = i (y i a i )g (in i ) in i W k,j = i (y i a i ) g(in i) W k,j i W k,j j W j,i a j = i a j i W j,i = W k,j i i W j,i g(in j ) W k,j = i = i = i i W j,i g (in j ) in j W k,j i W j,i g (in j ) W k,j i W j,i g (in j )a k = a k j ( ) W k,j a k k 7 / 2

Back-propagation learning contd. At each epoch, sum gradient updates for all examples and apply Training curve for 00 restaurant examples: finds exact fit 4 Total error on training set 2 0 8 6 4 2 0 0 50 00 50 200 250 300 350 400 Number of epochs Typical problems: slow convergence, local minima 8 / 2

Back-propagation learning contd. Learning curve for MLP with 4 hidden units: Proportion correct on test set 0.9 0.8 0.7 0.6 0.5 0.4 Decision tree Multilayer network 0 0 20 30 40 50 60 70 80 90 00 Training set size - RESTAURANT data MLPs are quite good for complex pattern recognition tasks, but resulting hypotheses cannot be understood easily 9 / 2

Handwritten digit recognition 3-nearest-neighbor = 2.4% error 400 300 0 unit MLP =.6% error LeNet: 768 92 30 0 unit MLP = 0.9% error Current best (kernel machines, vision algorithms) 0.6% error 20 / 2

Summary Most brains have lots of neurons; each neuron linear threshold unit (?) Perceptrons (one-layer networks) insufficiently expressive Multi-layer networks are sufficiently expressive; can be trained by gradient descent, i.e., error back-propagation Many applications: speech, driving, handwriting, fraud detection, etc. Engineering, cognitive modelling, and neural system modelling subfields have largely diverged 2 / 2