Neural Networks: Basics. Darrell Whitley Colorado State University

Size: px

Start display at page:

Download "Neural Networks: Basics. Darrell Whitley Colorado State University"

Paulina Sullivan
5 years ago
Views:

1 Neural Networks: Basics Darrell Whitley Colorado State University

2 In the Beginning: The Perceptron X1 W W 1,1 1,2 X2 W W 2,1 2,2 W source, destination

3 In the Beginning: The Perceptron

4 The Perceptron Learning Rule In Out Target Weight Threshold n.a. T W+ T n.a. T W- T+

5 The Perceptron Learning Rule Some things are linear and easy to learn.

6 The Perceptron Learning Rule In general, IF a perceptron can learn something it will. IF a perceptron cannot learn something... Easily implements: And, Or, Not. So logically complete if we build multi-layered networks.

7 A Simple XOR Network X Threshold = 1.5 (And) 1.0 X Threshold = 0.5 (XOR) Threshold = 0.5 (Or)

8 A Simple XOR Network with Bias Nodes X X

9 A Simple XOR Network Note that the hidden layer is a transformed representation that is now linearly separable. X1 X2 H1 H2 OUT

10 Another Solution to XOR

11 Weights X1 W W 1,1 1,2 X2 W W 2,1 2,2 Wsource, destination [ W1,1 W [X 1 X 2 ] 1,2 W 2,1 W 2,2 ]

12 Weights X1 W W 1,2 V 1,1 1,1 V 1,2 X2 W W 2,1 2,2 Wsource, destination V V 2,1 2,2 XW V = XM S(XW )V XM

13 Linear Separation 1.0 X1 W1 + X2 W2 = Threshold X X1 1.0 Let W 0 = T hreshold X 1 W 1 + X 2 W 2 + W 0 = 0 X 2 W 2 = X 1 W 1 W 0 X 2 = (W 1 /W 2 )X 1 (W 0 /W 2 )

14 Linear Separation

15 How Neurons Communicate Warning: real neurons are comple, neural networks are simple

16 Neural Spike Trains

17 How to Artificial Neural Networks Learn? Like Perceptrons, learning is (largely) accomplished by weight adjustments. Recall, we have also converted the neuron thresholds into weights. But we need a different kind of activation function. Activation function = Transfer Function

18 The Activation Model

19 Sigmoid Sigmoid(Out) = (1 + e Out/T emp ) 1

20 Sigmoid, Temperature and Gain Sigmoid(Out) = (1 + e Out/T emp ) 1 The Gain can also be changed by rescaling all of the weights.

21 Sigmoid, Temperature and Gain Sigmoid(Out) = (1 + e Out/T emp ) 1 The Gain can also be changed by rescaling all of the weights.

22 Logistics Sigmoid Derivative S() = e S() = (1 + e ) 1 S () = ( (1 + e ) 2 )( e ) S () = e (1 + e ) 2 S 1 () = { (1 + e ) }{ e (1 + e ) } S () = S(){ 1 + e 1 (1 + e ) } S () = S(){ 1 + e 1 + e 1 (1 + e ) } S () = S()(1 S())

23 Logistics Sigmoid and its derivative input S() S()(1-S())

24 Sigmoid Derivative S(Out)(1 S(Out)) When the derivative is zero, there is no learning.

25 Sigmoid Derivative Instead of target of 0 and 1, or between 0 and 1, use targets of 0.1 and 0.9, or between 0.1 and 0.9. This can help to prevent network paralysis.

26 Other Sigmoids Elliots function, hyperbolic tangent These activate between 1 and -1. Some spread-out the derivative.

27 The Delta Rule Let E p be the error for a particular input pattern. We will just look at one pattern, and drop the inde. Let T j be the desired Target pattern for node j. The output of a simple linear net is given by: O j = i X i W i,j E = 1/2(T j O j ) 2 This is a composite function: (Error (Out (W i,j )))

28 The Delta Rule From this composite function: (Error (Out (W i,j ))) For one layer, we can apply the Chain Rule: δe = δe δw i,j δo j δe δo j = (T j O j ) δo j δw i,j δe δw i,j = (T j O j )X i δo j δw i,j = X i

29 The Delta Rule For networks with sigmoid units using the logistic function: S j = e Oj/t O j = i X i W i,j Again, a composite function: (Error (Sig (Out (W i,j )))) δe δs j = (T j S j ) δe = δe δs j δw i,j δs j δo j δo j δw i,j δs j δo j = S j (1 S j ) δe δw i,j = (T j S j )S j (1 S j )X i δo j δw i,j = X i

30 The Delta Rule: Back Propagation Now consider a 2-layer network: Wi,q Wq,j i q j (Error (Sig.j (Out.j (Sig.q (Out.q (W i,q )))))) δe = δe δs j δo j δs q δw i,q δs j δo j δs q δo q δo q δw i,q

31 The Delta Rule: Back Propagation δe = δe δs j δo j δs q δw i,q δs j δo j δs q δo q δo q δw i,q δe = δe δs j = (T j S j )S j (1 S j ) δo j δs j δo j δe δw i,q = { j δo j δs q δs q δo q = S q (1 S q ) = w q,j δo q δw i,q = X i (T j S j )S j (1 S j )w q,j }S q (1 S q )X i

32 Updating the weights δe δw i,q = { j (T j S j )S j (1 S j )w q,j }S q (1 S q )X i W i,q = δe δw i,q W i,q = W i,q + α W i,q where α is the step size.

33 Momentum (one variation) Assume δe δw i,q is the current back prop error. Consider: W i,q (t) = β W i,q (t 1) + (1 β)( δe δw i,q ) Again we update: W i,q = W i,q + α W i,q (1) If β = 0 the update is only the current back prop error. (2) If β = 1 the update use only the previous back prop error. For 0 < β < 0.5 (1) If two steps are increasing, the stepsize increases. (1) If two steps are decreasing, the stepwise decreases. (3) If one step decreases & the net increases, momentum is smoothed.

34 We are learning weights in different levels

35 Hyperplanes and Separation o o o o o o o o oo o o o o o o

36 Margins and Support Vectors o o o o o o

37 Incremental Learning versus Batch Learning Consider XOR again X1 X2 OUT You could update the weights after each pattern is presented (incremental or stochastic). You could present all of the patterns, then accumulate the errors and update the weights (Batch).

38 Convergence to local optima (almost) All of the Back Propagation solutions are local optima (almost). And it does not seem to matter (much).

39 Convergence to local optima (almost) If we use some kind of (Cross) Validation to stop training early, we are not reaching the lowest possible error. If the neural networks have a high degree of symmetry there can be many identical local optima.

40 Many local optima are the same With 20 hidden units there are approimately 20! symmetries in the associated search space.

41 Feedforward networks are input-output boes This is not at all biological. There is no memory. There are also associative memory neural networks. And there are recurrent neural networks.

Neural Networks (Part 1) Goals for the lecture

Neural Networks (Part 1) Goals for the lecture Neural Networks (Part ) Mark Craven and David Page Computer Sciences 760 Spring 208 www.biostat.wisc.edu/~craven/cs760/ Some of the slides in these lectures have been adapted/borrowed from materials developed