Neural Networks Intro to AI Bert Huang Virginia Tech
Outline Biological inspiration for artificial neural networks Linear vs. nonlinear functions Learning with neural networks: back propagation
https://en.wikipedia.org/wiki/neuron#/media/file:chemical_synapse_schema_cropped.jpg
https://en.wikipedia.org/wiki/neuron#/media/file:gfpneuron.png https://en.wikipedia.org/wiki/neuron#/media/file:chemical_synapse_schema_cropped.jpg
Parameterizing p(y x) p(y x) :=f f : R d! [0, 1] f (x) := 1 1 + exp( w > x)
Parameterizing p(y x) p(y x) :=f f : R d! [0, 1] f (x) := 1 1 + exp( w > x)
Logistic Function (x) = 1 1 + exp( x)
Logistic Function (x) = 1 1 + exp( x) lim x!1 (x) = lim x!1 1 1 + exp( x) = 1 1 = 1.0 (0) = 1 1 + exp( 0) = 1 1+1 = 0.5 lim x! 1 (x) = lim x! 1 1 1 + exp( x) = 0.0
From Features to Probability f (x) := 1 1 + exp( w > x)
Parameterizing p(y x) p(y x) :=f f : R d! [0, 1] f (x) := 1 1 + exp( w > x) x1 x2 x3 x4 x5 w2 w1 w3 w 4 w5 y
Multi-Layered Perceptron x1 x2 x3 x4 x5 w2 w1 w3 w 4 w5 h1
Multi-Layered Perceptron raw data x1 x2 x3 x4 x5 pixel values representation h1 h2 shapes shadows round y prediction faces
Multi-Layered Perceptron x1 x2 x3 x4 x5 h 1 = (w > 11x) h 2 = (w > 12x) h1 h2 h =[h 1, h 2 ] > y p(y x) = (w > 21h) p(y x) = w > 21 (w > 11 x), (w > 12x) >
Decision Surface: Logistic Regression
Decision Surface: 2-Layer, 2 Hidden Units
Decision Surface: 2-Layer, More Hidden Units 3 hidden units 10 hidden units
Decision Surface: More Layers, More Hidden Units 10 layers, 5 hidden units per layer 4 layers, 10 hidden units each layer
Training Neural Networks 1 n min W n i=1 l( f(x i, W), y i ) = L(W) average training error W j W j α ( L W j ) Gradient Descent
Gradient Descent W j W j α ( L W j ) L(W1) very positive take big step left W1 L(W2) almost zero take tiny step left W2 L(W3) slightly negative W3 take medium step right
Approximate Q-Learning ˆQ(s, a) :=g(s, a, ) := 1 f 1 (s, a)+ 2 f 2 (s, a)+...+ d f d (s, a) i i + R(s)+ max a 0 ˆQ(s 0,a 0 ) ˆQ(s, a) @g @ i i i + R(s)+ max a 0 ˆQ(s 0,a 0 ) ˆQ(s, a) f i (s, a)
Back Propagation Back propagation: Compute hidden unit activations: forward propagation Compute gradient at output layer: error Propagate error back one layer at a time Chain rule via dynamic programming
Chain Rule Review f(g(x)) d f(g(x)) d x d f(g(x)) = d g(x) d g(x) d x x g(x) f(g(x))
Chain Rule on More Complex Function h( f(a) + g(b)) d h( f(a) + g(b)) d a d h( f(a) + g(b)) d b d h( f(a) + g(b)) d f(a) d h( f(a) + g(b)) d g(b) d f(a) + g(b) d a d f(a) + g(b) d b
Back to Neural Networks raw input x weights w1 hidden layer h1 h 1 = f(x, w 1 ) weights wn-1 hidden layer hn-1 h n 1 = f(h n 2, w n 1 ) weights wn final output y y = f(h, w ) n 1 n
Back to Neural Networks raw input x weights w1 hidden layer h1 h 1 = f(x, w 1 ) weights wn-1 hidden layer hn-1 h n 1 = f(h n 2, w n 1 ) weights wn final output y y = f(h, w ) n 1 n L(y) d L d w n = d L d y d y d w n
Intuition of Back Propagation At each layer, calculate how much changing the input changes the final output (derivative of final output w.r.t. layer s input) Calculate directly for last layer For preceding layers, use calculation from next layer and work backwards through network Use that derivative to find how changing the weights affect the error of the final output
FYI: Matrix Form x h1 = s(w1x) h2 = s(w2 h1) (You will not be tested on this matrix form in this course.) hm-1 = s(wm-1 hm-2) J(W )=`(f(x, W )) f(x, W) = s(wm hm-1)
h1 = s(w1x) FYI: Matrix Gradient Recipe r W1 J = 1 x > h2 = s(w2 h1) r Wi J = i h > i 1 i =(W > i+1 i+1) s 0 (W i h i 1 ) hm-1 = s(wm-1 hm-2) r Wm 1 J = m 1 h > m 2 m 1 =(W > m m) s 0 (W m 1 h m 2 ) f(x, W) = s(wm hm-1) r Wm J = m h > m 1 m = `0(f(x, W )) J(W )=`(f(x, W )) (You will not be tested on this matrix form in this course.)
FYI: Matrix Gradient Recipe h1 = s(w1x) (You will not be tested on this matrix form in this course.) hi = s(wi hi-1) i =(W > i+1 i+1) s 0 (W i h i 1 ) r W1 J = 1 x > f(x, W) = s(wm hm-1) m = `0(f(x, W )) r Wi J = i h > i 1 J(W )=`(f(x, W )) Feed Forward Propagation Back Propagation
Other New Aspects of Deep Learning GPU computation Differentiable programming Automatic differentiation Neural network structures
Types of Neural Network Structures Feed-forward Recurrent neural networks (RNNs) Good for analyzing sequences (text, time series) Convolutional neural networks (convnets, CNNs) Good for analyzing spatial data (images, videos)
Outline Biological inspiration for artificial neural networks Linear vs. nonlinear functions Learning with neural networks: backpropagation