Neural networks. Chapter 20, Section 5 1

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "Neural networks. Chapter 20, Section 5 1"

Transcription

1 Neural networks Chapter 20, Section 5 Chapter 20, Section 5

2 Outline Brains Neural networks Perceptrons Multilayer perceptrons Applications of neural networks Chapter 20, Section 5 2

3 Brains 0 neurons of > 20 types, 0 4 synapses, ms 0ms cycle time Signals are noisy spike trains of electrical potential Axonal arborization Synapse Axon from another cell Dendrite Axon Nucleus Synapses Cell body or Soma Chapter 20, Section 5 3

4 McCulloch Pitts unit Output is a squashed linear function of the inputs: a i g(in i ) = g ( Σ j W j,i a j ) a 0 = a i = g(in i ) Wj,i a j Bias Weight W 0,i Σ in i g a i Input Links Input Function Activation Function Output Output Links A gross oversimplification of real neurons, but its purpose is to develop understanding of what networks of simple units can do Chapter 20, Section 5 4

5 Activation functions g(in i ) g(in i ) + + (a) in i (b) in i (a) is a step function or threshold function (b) is a sigmoid function /( + e x ) Changing the bias weight W 0,i moves the threshold location Chapter 20, Section 5 5

6 Implementing logical functions W 0 =.5 W 0 = 0.5 W 0 = 0.5 W = W 2 = W = W 2 = W = AND OR NOT McCulloch and Pitts: every Boolean function can be implemented Chapter 20, Section 5 6

7 Feed-forward networks: single-layer perceptrons multi-layer perceptrons Network structures Feed-forward networks implement functions, have no internal state Recurrent networks: Hopfield networks have symmetric weights (W i,j = W j,i ) g(x) = sign(x), a i = ± ; holographic associative memory Boltzmann machines use stochastic activation functions, MCMC in Bayes nets recurrent neural nets have directed cycles with delays have internal state (like flip-flops), can oscillate etc. Chapter 20, Section 5 7

8 Feed-forward example W,3 W,4 3 W 3,5 5 2 W 2,3 W 2,4 4 W 4,5 Feed-forward network = a parameterized family of nonlinear functions: a 5 = g(w 3,5 a 3 + W 4,5 a 4 ) = g(w 3,5 g(w,3 a + W 2,3 a 2 ) + W 4,5 g(w,4 a + W 2,4 a 2 )) Adjusting weights changes the function: do learning this way! Chapter 20, Section 5 8

9 Single-layer perceptrons Input Units Perceptron output Output 0 x 2 W j,i Units 2 4 x 2 Output units all operate separately no shared weights Adjusting weights moves the location, orientation, and steepness of cliff Chapter 20, Section 5 9

10 Expressiveness of perceptrons Consider a perceptron with g = step function (Rosenblatt, 957, 960) Can represent AND, OR, NOT, majority, etc., but not XOR Represents a linear separator in input space: Σ j W j x j > 0 or W x > 0 x x x? 0 0 x x x 2 (a) x and x 2 (b) x or x2 (c) x xor x2 Minsky & Papert (969) pricked the neural network balloon Chapter 20, Section 5 0

11 Perceptron learning Learn by adjusting weights to reduce error on training set The squared error for an example with input x and true output y is E = 2 Err 2 2 (y h W(x)) 2, Perform optimization search by gradient descent: E W j = Err Err = Err W j W j = Err g (in) x j Simple weight update rule: W j W j + α Err g (in) x j ( y g(σ n j = 0W j x j ) ) E.g., +ve error increase network output increase weights on +ve inputs, decrease on -ve inputs Chapter 20, Section 5

12 Perceptron learning contd. Perceptron learning rule converges to a consistent function for any linearly separable data set Proportion correct on test set Perceptron Decision tree Proportion correct on test set Perceptron Decision tree Training set size - MAJORITY on inputs Training set size - RESTAURANT data Perceptron learns majority function easily, DTL is hopeless DTL learns restaurant function easily, perceptron cannot represent it Chapter 20, Section 5 2

13 Multilayer perceptrons Layers are usually fully connected; numbers of hidden units typically chosen by hand Output units a i W j,i Hidden units a j W k,j Input units a k Chapter 20, Section 5 3

14 Expressiveness of MLPs All continuous functions w/ 2 layers, all functions w/ 3 layers h W (x, x 2 ) x x 2 h W (x, x 2 ) x x 2 Combine two opposite-facing threshold functions to make a ridge Combine two perpendicular ridges to make a bump Add bumps of various sizes and locations to fit any surface Proof requires exponentially many hidden units (cf DTL proof) Chapter 20, Section 5 4

15 Back-propagation learning Output layer: same as for single-layer perceptron, W j,i W j,i + α a j i where i = Err i g (in i ) Hidden layer: back-propagate the error from the output layer: j = g (in j ) W j,i i. i Update rule for weights in hidden layer: W k,j W k,j + α a k j. (Most neuroscientists deny that back-propagation occurs in the brain) Chapter 20, Section 5 5

16 Back-propagation derivation The squared error on a single example is defined as E = 2 i (y i a i ) 2, where the sum is over the nodes in the output layer. E W j,i = (y i a i ) a i W j,i = (y i a i ) g(in i) W j,i = (y i a i )g (in i ) in i = (y i a i )g (in i ) W j,i = (y i a i )g (in i )a j = a j i W j,i j W j,ia j Chapter 20, Section 5 6

17 Back-propagation derivation contd. E W k,j = i (y i a i ) a i W k,j = i (y i a i ) g(in i) = (y i a i )g (in i ) in i = i W i k,j i W k,j W k,j = i iw j,i a j W k,j = i iw j,i g(in j ) W k,j = i iw j,i g (in j ) in j = i iw j,i g (in j ) W k,j W k,j k W k,j a k j W j,ia j = i iw j,i g (in j )a k = a k j Chapter 20, Section 5 7

18 Back-propagation learning contd. At each epoch, sum gradient updates for all examples and apply Training curve for 00 restaurant examples: finds exact fit 4 Total error on training set Number of epochs Typical problems: slow convergence, local minima Chapter 20, Section 5 8

19 Back-propagation learning contd. Learning curve for MLP with 4 hidden units: Proportion correct on test set Decision tree Multilayer network Training set size - RESTAURANT data MLPs are quite good for complex pattern recognition tasks, but resulting hypotheses cannot be understood easily Chapter 20, Section 5 9

20 Handwritten digit recognition 3-nearest-neighbor = 2.4% error unit MLP =.6% error LeNet: unit MLP = 0.9% error Current best (kernel machines, vision algorithms) 0.6% error Chapter 20, Section 5 20

21 Summary Most brains have lots of neurons; each neuron linear threshold unit (?) Perceptrons (one-layer networks) insufficiently expressive Multi-layer networks are sufficiently expressive; can be trained by gradient descent, i.e., error back-propagation Many applications: speech, driving, handwriting, fraud detection, etc. Engineering, cognitive modelling, and neural system modelling subfields have largely diverged Chapter 20, Section 5 2