Neural Networks: Basics. Darrell Whitley Colorado State University

Neural Networks: Basics Darrell Whitley Colorado State University

In the Beginning: The Perceptron X1 W W 1,1 1,2 X2 W W 2,1 2,2 W source, destination

In the Beginning: The Perceptron

The Perceptron Learning Rule In Out Target Weight Threshold 0 0 1 n.a. T- 1 0 1 W+ T- 0 1 0 n.a. T+ 1 1 0 W- T+

The Perceptron Learning Rule Some things are linear and easy to learn.

The Perceptron Learning Rule In general, IF a perceptron can learn something it will. IF a perceptron cannot learn something... Easily implements: And, Or, Not. So logically complete if we build multi-layered networks.

A Simple XOR Network X1 1.0 1.0 Threshold = 1.5 (And) 1.0 X2 1.0 1.0 1.0 Threshold = 0.5 (XOR) Threshold = 0.5 (Or)

A Simple XOR Network with Bias Nodes X1 1.0 1.0 1.0 X2 1.0 1.0 1.0 1.0 1.5 0.5 0.0

A Simple XOR Network Note that the hidden layer is a transformed representation that is now linearly separable. X1 X2 H1 H2 OUT 0 0 0 0 0 1 0 0 1 1 0 1 0 1 1 1 1 1 1 0

Another Solution to XOR

Weights X1 W W 1,1 1,2 X2 W W 2,1 2,2 Wsource, destination [ W1,1 W [X 1 X 2 ] 1,2 W 2,1 W 2,2 ]

Weights X1 W W 1,2 V 1,1 1,1 V 1,2 X2 W W 2,1 2,2 Wsource, destination V V 2,1 2,2 XW V = XM S(XW )V XM

Linear Separation 1.0 X1 W1 + X2 W2 = Threshold X2 0.0 0.0 X1 1.0 Let W 0 = T hreshold X 1 W 1 + X 2 W 2 + W 0 = 0 X 2 W 2 = X 1 W 1 W 0 X 2 = (W 1 /W 2 )X 1 (W 0 /W 2 )

Linear Separation

How Neurons Communicate Warning: real neurons are comple, neural networks are simple

Neural Spike Trains

How to Artificial Neural Networks Learn? Like Perceptrons, learning is (largely) accomplished by weight adjustments. Recall, we have also converted the neuron thresholds into weights. But we need a different kind of activation function. Activation function = Transfer Function

The Activation Model

Sigmoid Sigmoid(Out) = (1 + e Out/T emp ) 1

Sigmoid, Temperature and Gain Sigmoid(Out) = (1 + e Out/T emp ) 1 The Gain can also be changed by rescaling all of the weights.

Logistics Sigmoid Derivative S() = 1 1 + e S() = (1 + e ) 1 S () = ( (1 + e ) 2 )( e ) S () = e (1 + e ) 2 S 1 () = { (1 + e ) }{ e (1 + e ) } S () = S(){ 1 + e 1 (1 + e ) } S () = S(){ 1 + e 1 + e 1 (1 + e ) } S () = S()(1 S())

Logistics Sigmoid and its derivative input S() S()(1-S()) 0.000000 0.500000 0.250000 0.500000 0.622459 0.235004 1.000000 0.731059 0.196612 1.500000 0.817574 0.149146 2.000000 0.880797 0.104994 2.500000 0.924142 0.070104 3.000000 0.952574 0.045177 3.500000 0.970688 0.028453 4.000000 0.982014 0.017663 4.500000 0.989013 0.010866 5.000000 0.993307 0.006648 5.500000 0.995930 0.004054 6.000000 0.997527 0.002467 6.500000 0.998499 0.001499 7.000000 0.999089 0.000910 7.500000 0.999447 0.000552 8.000000 0.999665 0.000335 8.500000 0.999797 0.000203

Sigmoid Derivative S(Out)(1 S(Out)) When the derivative is zero, there is no learning.

Sigmoid Derivative Instead of target of 0 and 1, or between 0 and 1, use targets of 0.1 and 0.9, or between 0.1 and 0.9. This can help to prevent network paralysis.

Other Sigmoids Elliots function, hyperbolic tangent These activate between 1 and -1. Some spread-out the derivative.

The Delta Rule Let E p be the error for a particular input pattern. We will just look at one pattern, and drop the inde. Let T j be the desired Target pattern for node j. The output of a simple linear net is given by: O j = i X i W i,j E = 1/2(T j O j ) 2 This is a composite function: (Error (Out (W i,j )))

The Delta Rule From this composite function: (Error (Out (W i,j ))) For one layer, we can apply the Chain Rule: δe = δe δw i,j δo j δe δo j = (T j O j ) δo j δw i,j δe δw i,j = (T j O j )X i δo j δw i,j = X i

The Delta Rule For networks with sigmoid units using the logistic function: S j = 1 1 + e Oj/t O j = i X i W i,j Again, a composite function: (Error (Sig (Out (W i,j )))) δe δs j = (T j S j ) δe = δe δs j δw i,j δs j δo j δo j δw i,j δs j δo j = S j (1 S j ) δe δw i,j = (T j S j )S j (1 S j )X i δo j δw i,j = X i

The Delta Rule: Back Propagation Now consider a 2-layer network: Wi,q Wq,j i q j (Error (Sig.j (Out.j (Sig.q (Out.q (W i,q )))))) δe = δe δs j δo j δs q δw i,q δs j δo j δs q δo q δo q δw i,q

The Delta Rule: Back Propagation δe = δe δs j δo j δs q δw i,q δs j δo j δs q δo q δo q δw i,q δe = δe δs j = (T j S j )S j (1 S j ) δo j δs j δo j δe δw i,q = { j δo j δs q δs q δo q = S q (1 S q ) = w q,j δo q δw i,q = X i (T j S j )S j (1 S j )w q,j }S q (1 S q )X i

Updating the weights δe δw i,q = { j (T j S j )S j (1 S j )w q,j }S q (1 S q )X i W i,q = δe δw i,q W i,q = W i,q + α W i,q where α is the step size.

Momentum (one variation) Assume δe δw i,q is the current back prop error. Consider: W i,q (t) = β W i,q (t 1) + (1 β)( δe δw i,q ) Again we update: W i,q = W i,q + α W i,q (1) If β = 0 the update is only the current back prop error. (2) If β = 1 the update use only the previous back prop error. For 0 < β < 0.5 (1) If two steps are increasing, the stepsize increases. (1) If two steps are decreasing, the stepwise decreases. (3) If one step decreases & the net increases, momentum is smoothed.

We are learning weights in different levels

Hyperplanes and Separation o o o o o o o o oo o o o o o o

Margins and Support Vectors o o o o o o

Incremental Learning versus Batch Learning Consider XOR again X1 X2 OUT 0 0 0 1 0 1 0 1 1 1 1 0 You could update the weights after each pattern is presented (incremental or stochastic). You could present all of the patterns, then accumulate the errors and update the weights (Batch).

Convergence to local optima (almost) All of the Back Propagation solutions are local optima (almost). And it does not seem to matter (much).

Convergence to local optima (almost) If we use some kind of (Cross) Validation to stop training early, we are not reaching the lowest possible error. If the neural networks have a high degree of symmetry there can be many identical local optima.

Many local optima are the same With 20 hidden units there are approimately 20! symmetries in the associated search space.

Feedforward networks are input-output boes This is not at all biological. There is no memory. There are also associative memory neural networks. And there are recurrent neural networks.