Neural Networks. ICS 273A UC Irvine Instructor: Max Welling

Size: px

Start display at page:

Download "Neural Networks. ICS 273A UC Irvine Instructor: Max Welling"

David Sanders
5 years ago
Views:

1 Neural Networks ICS 273A UC Irve Instructor: Max Wellg

The axon connects to new dendrites through synapses which can learn how much signal is transmitted.

2 Neurons 1 b Neurons communicate by receivg signals on their dendrites. Addg these signals and firg off a new signal along the axon if the total put exceeds a threshold. The axon connects to new dendrites through synapses which can learn how much signal is transmitted. McCulloch and Pitt ( 43 built a first abstract model of a neuron. y = g( Wx i i + b output i activation function weights put bias

3 Neurons We have about neurons, each one connected to 10 other neurons on average. Each neuron needs at least 10 3 seconds to transmit the signal. So we have many, slow neurons. Yet we recognize our grandmother 10 1 sec. Computers have much faster switchg times: sec. Conclusion: bras compute parallel! In fact, neurons are unreliable/noisy as well. But sce thgs are encoded redundantly by many of them, their population can do computation reliably and fast.

4 fi ( x Wx ij Classification & Regression j Neural nets are a parameterized function Y=f(X;W from puts (X to outputs (Y. If Y is contuous: regression, if Y is discrete: classification. We adapt the weights so as to mimize the error between the data and the model predictions. N d out error = ( y W x b d ij jn i n= 1 i = 1 j= 1 2 This is just a perceptron with a quadratic cost function.

5 Optimization We use stochastic gradient descent: pick a sgle data-item, compute the contribution of that data-pot to the overall gradient and update the weights. Repeat : 1 Pick random data - item (x n,y n 2 Defe : " = (y # W ik x kn # b i 3 Update : $ W ij % W ij +&" x jn b i % b i +&" k

6 Stochastic Gradient Descent stochastic updates full updates (averaged over all data-items Stochastic gradient descent does not converge to the mimum, but dances around it. To get to the mimum, one needs to decrease the step-size as one get closer to the mimum. Alternatively, one can obta a few samples and average predictions over them (similar to baggg.

Multi-Layer Nets h1 h2 y W3,b3 W2,b2 3 2 3 ˆ i = ( ij j + i j y g W h b 2 2 1 2 h = g( W h

If we want to learn non-lear decision surfaces, or non-lear regression curves, we need more

7 Multi-Layer Nets h1 h2 y W3,b3 W2,b ˆ i = ( ij j + i j y g W h b h = g( W h + b i ij j i j h = g( W x + b i ij j i j W1,b1 x Sgle layers can only do lear thgs. If we want to learn non-lear decision surfaces, or non-lear regression curves, we need more than one layer. In fact, NN with 1 hidden layer can approximate any boolean and cont. functions

8 error =! logh 3 + (1 " y log(1 " h 3 ; y Back-propagation How do we learn the weights of a multi-layer network? Answer: Stochastic gradient descent. But now the gradients are harder! d error n dh 3 = y h 3 " 1 " y 1 " h 3 ; y i d error =! 3 d error n dh dw jk h dw = jk # & d W 3 d error is h 2 3 %! sn + b! i ( n h 3 3 (1 " h 3 $ s ' 2 dh dw = jk d error n h 3 3 (1 " h 3 W dh 2! 3 jn ij 2 dh dw = jk # & d W d error %! js h sn + b! j ( n h 3 3 (1 " h 3 W 3 ij h 2 jn (1 " h 2 $ s ' jn 2 dh dw = jk d error n h 3 3 (1 " h 3 W 3 ij h 2 jn (1 " h 2 1! jn h kn = dh d error! n h 3 3 (1 " h 3 W 3 ij h 2 jn dh (1 " h 2! (! W 1 jn kl x ln + b 1 k l W3,b3 h2 W2,b2 h1 W1,b1 x

9 Back Propagation y i h 3 =! ( W 3 2! ij h jn + b 3 i j y i! 3 = h 3 (1! h 3 d error dh 3 W3,b3 h2 h 2 =! ( W 2 1! ij h jn + b 2 i j W3,b3 h2 δ = h (1 h W δ jn jn jn ij upstream i W2,b2 h1 1 h =! ( W 1 1! ij x jn + b i j W2,b2 h1 δ = h (1 h W δjn kn kn kn jk upstream j W1,b1 x W1,b1 x Upward pass downward pass

10 Back Propagation y W3,b3 h2 W2,b2 h1 i! 3 = h 3 (1! h 3 d error dh 3 δ δ = h (1 h W δ jn jn jn ij upstream i = h (1 h W δjn kn kn kn jk upstream j d error = 2 dw jk =! 2 1 jn h kn " d error n dh 3 W W ηδ h b jk jk jn kn b ηδ j j jn h 3 (1! h 3 W 3 ij h 2 jn (1! h 2 1 jn h kn W1,b1 x

11 ALVINN Learng to drive a car This hidden unit detects a mildly left slopg road and advices to steer left. How would another hidden unit look like?

12 Weight Decay NN can also overfit (of course. We can try to avoid this by itializg all weights/biases terms to very small random values and grow them durg learng. One can now check performance on a validation set and stop early. Or one can change the update rule to discourage large weights: W W ηδ h λw jk jk jn kn jk b b ηδ λb j j jn j Now we need to set λ usg X-validation. This is called weight-decay NN jargon.

13 Momentum In the begng of learng it is likely that the weights are changed a consistent manner. Like a ball rollg down a hill, we should ga speed if we make consistent changes. It s like an adaptive stepsize. This idea is easily implemented by changg the gradient as follows: Δ W ( new = ηδ h + γδw ( old jk jn kn jk W W ΔW ( new jk jk jk (and similar to biases

14 Conclusion NN are a flexible way to model put/output functions They are robust agast noisy data Hard to terpret the results (unlike DTs Learng is fast on large datasets when usg stochastic gradient descent plus momentum. Local mima optimization is a problem Overfittg can be avoided usg weight decay or early stoppg There are also NN which feed formation back (recurrent NN Many more terestg NNs: Boltzman maches, self-organizg maps,...

Neural networks. Chapter 20. Chapter 20 1

Neural networks. Chapter 20. Chapter 20 1 Neural networks Chapter 20 Chapter 20 1 Outline Brains Neural networks Perceptrons Multilayer networks Applications of neural networks Chapter 20 2 Brains 10 11 neurons of > 20 types, 10 14 synapses, 1ms