Last update: October 26, 207 Neural networks CMSC 42: Section 8.7 Dana Nau
Outline Applications of neural networks Brains Neural network units Perceptrons Multilayer perceptrons 2
Example Applications pattern classification virus/explosive detection, financial predictions, etc. image processing control character recognition, manufacturing inspection, etc. autonomous vehicles, industrial processes, etc. optimization bionics VLSI layout, scheduling, etc. prostheses, brain implants, etc. brain/cognitive models memory, learning, disorders, etc. 3
Nature-Inspired Computation natural system formal models, theories applications biology, interdisciplinary computer science, physics, etc. engineering Examples neural networks genetic programming swarm intelligence self-replicating machines... Inspiration for neural networks? 4
Brains 0 neurons of > 20 types, 0 4 synapses, ms 0ms cycle time Signals are noisy spike trains of electrical potential Axonal arborization Synapse Axon from another cell Dendrite Axon Nucleus Synapses Cell body or Soma 5
Brains as inspiration for neural networks 6
Computers vs. Brains Computers Brains information access: global local control: centralized decentralized processing method: sequential massively parallel how programmed: computer programmer self-organizing adaptability: minimally prominently 7
McCulloch Pitts unit Simple mathematical model inspired by neurons Much simpler than what neurons actually do Output of unit i depends on a weighted sum of the inputs: a i = g(in i ) = g(σ j W j,i a j ) Bias Weight a 0 = a i = g(in i ) W 0,i a j Wj,i Σ in i g a i Input Links Input Function Activation Function Output Output Links 8
Activation functions g(in i ) g(in i ) + + (a) in i (b) in i (a) is a step function or threshold function g(in i ) = if in i > 0; g(in i ) = 0 otherwise (b) is a sigmoid function /( + e x ) Changing the bias weight W 0,i moves the threshold location 9
Implementing logical functions Use step function as activation function W 0 =.5 W 0 = 0.5 W 0 = 0.5 W = W 2 = W = W 2 = W = AND OR NOT McCulloch and Pitts: every Boolean function can be implemented as a collection of units 0
Network structures Feed-forward networks: single-layer perceptrons multi-layer perceptrons Implement functions, have no internal state Recurrent networks: Hopfield networks have symmetric weights (Wi, j = W j,i ) g(x)=sign(x), a i = ± ; holographic associative memory Boltzmann machines use stochastic activation functions recurrent neural nets have directed cycles with delays have internal state (like flip-flops), can oscillate etc.
Feed-forward example W,3 3 W,4 W 3,5 5 2 W 2,3 W 2,4 4 W 4,5 Feed-forward network = a parameterized family of nonlinear functions: a 5 = g(w 3,5 a 3 +W 4,5 a 4 ) = g(w 3,5 g(w,3 a +W 2,3 a 2 ) +W 4,5 g(w,4 a +W 2,4 a 2 )) Adjusting weights changes the function do learning this way 2
Perceptrons Input Units Perceptron output 0.8 0.6 0.4 0.2 0-4 -2 Output 0 x 2 W 4-4 -2 0 j,i Units 2 4 x 2 Output units all operate separately no shared weights Adjusting weights moves the location, orientation, and steepness of cliff 3
Perceptron learning x0 x xn W0 W Wn in = Wi xi = W X a = g(in) Learn by adjusting weights to reduce error on training set Squared error for an example with input x and true output y: E = 2 Err2 2 (y a)2 Optimization search by gradient descent: E W j = 2 Err2 = Err Err = Err ( y g(σ n j W j W j W =0W j x j ) ) j = Err g (in) x j Simple weight update rule: W j W j + α Err g (in) x j E.g., positive error increase network output increase weights on pos. inputs, decrease on neg. inputs 4
Perceptron learning, continued Perceptron learning rule converges to a consistent function for any linearly separable data set Perceptron easily learns majority function Decision-tree learning doesn t majority is representable, but only as a very large decision tree DTL won t learn that without a very large data set Proportion correct on test set 0.9 0.8 0.7 0.6 0.5 0.4 Perceptron Decision tree 0 0 20 30 40 50 60 70 80 90 00 Training set size - MAJORITY on inpu 5
Perceptron learning, continued Perceptron learning rule converges to a consistent function for any linearly separable data set Decision-tree learning easily learns restaurant function Perceptron can t Not linearly separable perceptron can t represent it Proportion correct on test set 0.9 0.8 0.7 0.6 0.5 0.4 Perceptron Decision tree 0 0 20 30 40 50 60 70 80 90 00 Training set size - RESTAURANT data 6
Expressiveness of perceptrons Rosenblatt perceptron algorithm 957?? developed physical device, described as key to the future of mankind Minsky & Papert (969) showed it was impossible Consider a perceptron with g = step function (Rosenblatt, 957, 960) Represents linear separator in input space: Σ j W j x j > 0 or W x > 0?x 0? x x 0 x 2? 0?0 x 2 0 0 x 2 (a) x and x 2 (b) x or x2 (c) x xor x2 Can represent AND, OR, NOT, majority, etc., but not XOR Dark ages for neural network research: 970 985 7
Multilayer feed-forward networks Renaissance for neural network research: 985-995 Multilayer feed-forward network layers usually fully connected numbers of hidden units typically chosen by hand Output units a i W j,i Hidden units a j W k,j Input units a k 8
Expressiveness of MLPs h W (x, x 2 ) 0.8 0.6 0.4 0.2 0-4 -2 0 x 2-4 -2 0 4 2 4 x 2 h W (x, x 2 ) 0.8 0.6 0.4 0.2 0-4 -2 0 x 2-4 -2 0 4 2 4 x 2 With 2 layers, can express all continuous functions With 3 layers, all functions Combine two opposite-facing threshold functions to make a ridge Combine two perpendicular ridges to make a bump Add bumps of various sizes and locations to fit any surface Proof requires exponentially many hidden units (like in the proof for decision-tree learning) 9
Back-propagation learning j aj Wji Wjk k i ak ai Output layer: same as for single-layer perceptron, where i =Err i g (in i ) W j,i W j,i + α a j i Hidden layer: back-propagate the error from the output layer: j = g (in j ) W j,i i. i Update rule for weights in hidden layer: W k, j W k, j + α a k j. (Most neuroscientists deny that back-propagation occurs in the brain) 20
Back-propagation derivation Like perceptron learning, back-propagation is optimization search by gradient descent The squared error on a single example is defined as E = 2 i (y i a i ) 2, where the sum is over the nodes in the output layer E W j,i = (y i a i ) a i W j,i = (y i a i ) g(in i) W j,i = (y i a i )g (in i ) in i = (y i a i )g (in i ) W j,i = (y i a i )g (in i )a j = a j i W j,i ( ) W j,i a j j 2
Back-propagation derivation, continued E W k, j = (y i a i ) a i i W k, j = = (y i a i )g (in i ) in i i a j = i W j,i i W k, j = = i W j,i g (in j ) in j i W k, j = i = i i W j,i g (in j ) i W k, j = W k, j i ( i W j,i g (in j )a k = a k j (y i a i ) g(in i) i W k, j i W k, j i W j,i g(in j ) W k, j ) W k, j a k k ( ) W j,i a j j 22
Back-propagation learning, continued At each epoch, sum gradient updates for all examples and apply Training curve for 00 restaurant examples: finds exact fit 4 Total error on training set 2 0 8 6 4 2 0 0 50 00 50 200 250 300 350 400 Number of epochs Typical problems: slow convergence, local minima 23
Back-propagation learning, continued Learning curve for multi-layer perceptron with 4 hidden units: Proportion correct on test set 0.9 0.8 0.7 0.6 0.5 0.4 Decision tree Multilayer network 0 0 20 30 40 50 60 70 80 90 00 Training set size - RESTAURANT data Multi-layer perceptrons are quite good for complex pattern recognition tasks, but resulting hypotheses cannot be understood easily 24
Computational Example W 2 = 2 a 2 y 2 0 a 0 W 0 = ai = output of unit i a W 3 =2 Wi j = weight of link from unit i to unit j For output units, j = g (in j )(y j a j ). For hidden units, j = g (in j )Σ i w ji i. 3 a 3 y 3 Suppose g(in i ) = in i for every unit i. Then g (in i ) =. (Note: bad idea, one wouldn t use such a g in practice!) Problem. For what values of 2 and 3 will = 0? 0 = = g (in )(W 2 2 +W 3 3 ) = W 2 2 +W 3 3 = 2 + 2 3 i.e., 2 = 2 3. 25
Computational Example W 2 = 2 a 2 y 2 = 0 a 0 = W 0 = a W 3 =2 3 a 3 y 3 = Problem 2. α = 0.; just one training example: a 0 =,y 2 =,y 3 =. First epoch: a = g(in ) = W 0 a 0 = = a 2 = g(in 2 ) = W 2 a = = a 3 = g(in 3 ) = W 3 a = 2 = 2 2 = y 2 a 2 = = 0 3 = y 3 a 3 = 2 = = 2 + 2 3 = 0 + 2( ) = 2 W 2 W 2 + αa 2 = + 0. 0 = W 3 W 3 + αa 3 = 2 + 0. ( ) =.9 W 0 W 0 + αa 0 = + 0. ( 2) = 0.8 26
Computational Example W 2 = 2 a 2 y 2 = 0 a 0 = W 0 =0.8 a W 3 =.9 3 a 3 y 3 = Second epoch: a = g(in ) = W 0 a 0 = 0.8 = 0.8 a 2 = g(in 2 ) = W 2 a = 0.8 = 0.8 a 3 = g(in 3 ) = W 3 a =.9 0.8 =.52 2 = y 2 a 2 = 0.8 = 0.2 3 = y 3 a 3 =.52 = 0.52 = W 2 2 +W 3 3 = 0.2 +.9 ( 0.52) = 0.2 0.988 = 0.788 W 2 W 2 + αa 2 = + 0. 0.8 0.2 =.06 W 3 W 3 + αa 3 =.9 + 0. 0.8 ( 0.52) =.8584 W 0 W 0 + αa 0 = 0.8 + 0. ( 0.788) = 0.722. 27
Summary Most brains have lots of neurons; each neuron linear threshold unit (?) Perceptrons (one-layer networks) insufficiently expressive Multi-layer networks are sufficiently expressive; can be trained by gradient descent, i.e., error back-propagation Many applications: speech, driving, handwriting, fraud detection, etc. Engineering, cognitive modeling, and neural system modeling subfields have largely diverged 28