9/6/208 Gradient Descent (GD) Hantao Zhang Deep Learning with Python Reading: https://en.wikipedia.org/wiki/gradient_descent Artificial Neuron (Perceptron) <w, > = w T = w 0 0 + + w 2 2 + + w d d where 0 = neuron Many monotonic functions can be used as Activation Function f: y = f(<w, >) The value of bias w 0 decides where to fire the neuron. 2
9/6/208 Perceptron Learning Perceptron learns linear decision boundaries E.g. 2 + + + + + + 0 0 0 0 0 0 0 But not 2 or 0 0 0 w + w 2 2 = 0 0 w + w 2 2 = w > 0.5 0 0 w + w 2 2 = w 2 > 0.5 0 w + w 2 2 = w +w 2 < 0.5 impossible! 2 + 0 or + Multilayer NN are universal function approimators Input, output, and arbitrary number of hidden layers hidden layer sufficient for DNF representation of any Boolean function - One hidden node per positive conjunct, output node set to the Or function 2 hidden layers allow arbitrary number of labeled clusters hidden layer sufficient to approimate all bounded continuous functions hidden layer was the most common in practice, but recently Deep networks show ecellent results! 2
9/6/208 Solving the XOR Problem Network Topology: 2 hidden nodes output Activation Function: step() = if > 0; 0 otherwise w w 3 w 2 w 2 w 0 w 03 w22 y 2 2 w 23 y y 3 y 9 Weights: w = w 2 =, w 2 = w 22 = w 0 = -.5; w 02 = -0.5; w 03 = -0.5 w 3 = ; w 23 = y = step(w + w 2 2 + w 0 ) y 2 = step(w 2 + w 22 2 + w 02 ) y 3 = step(w 3 y + w 23 y 2 + w 03 ) Desired: 2 y 0 0 0 0 0 0 w 02 Actual: y 30 0 Feed Forward Computation Neural Network with sigmoid activation functions Output Hidden Layer Input 6 3
9/6/208 Neural Net Training Goal: y = f(, w) Determine how to change weights to get correct output Large change in weight to produce large change in error Approach: Compute actual output: y Compare to desired output: y* Determine effect of weights w on error = y* y Adjust weights w Cost (Loss, Error) Function Neural Network with sigmoid activation functions Output Hidden Layer Input 8 4
9/6/208 Backpropagation Weights are parameters to change Backpropagation: Computes current output, works backward to correct error If smooth function, use Gradient Descent Linear functions (including identity function) are not useful, as combination of linear functions is still linear. i XOR Eample y 3 i : i th sample input vector w : weight vector y i *: desired output for i th sample F: output of the neural network s: the activation function Sum of squares error over training samples: * 2 E ( y i F ( i, w )) 2 z w 0 w 03 z 3 w 3 y w23 y 2 w w 2 w 2 w 22 z 2 w 02 2 We may use Gradient Descend to find w so that E is minimum. From 6.034 notes lozano-perez Full epression of output in terms of input and weights z z 2 y3 F(, w) s( w3s( w w22 w0) w23s( w2 w222 w02) w03) z 3 5
9/6/208 Gradient Descent Method Task: Find a local minimum value of the function y = f(). Method: Start at given point, use the gradient to move toward the minimum Gradient is the slope of a function: dy/d = f (). If is a minimum, then f () = 0. The least of all the minimum points is called the global minimum. Every minimum is a local minimum. f() global maimum inflection point global minimum local minimum One Variable Function 0 0 Starting at 0, the net point is = 0 f ( 0 ), where is a positive constant, called the moving (learning) rate. Rate parameter Large enough to learn quickly Small enough to reach (but not overshoot) target values. If looking for the maimum, the net point is = 0 + f ( 0 ). 6
9/6/208 One Variable Function Pick random starting point. f() One Variable Function Compute gradient at point (by calculus) f() 7
9/6/208 One Variable Function Move along parameter space in direction of negative gradient f() = amount to move = learning rate One Variable Function Move along parameter space in direction of negative gradient. f() = amount to move = learning rate 8
9/6/208 One Variable Function Stop when we don t move any more. f() : 0 Two Variable Function f (, ) 5 2 2 2 2 Partial Derivatives: f / = 2, f / 2 = 0 2 0.8 0.6 0.4 2 0.2 0-0.2-0.4-0.6-0.8 - - -0.5 0 0.5 The gradient descent at the point (, 2 ): (2 ), 2 2 (0 2 ) 9
9/6/208 Two Variable Function 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0. 0 0 0.5.5 saddle point local min Multi Variable Function First compute partial derivatives: f f f = (,..., ) Then for any given point (, 2,, n ), change i i f/ i (, 2,, n ) until satisfaction or pick another point and start over. n Summary: Gradient Descent Method is a greedy optimization algorithm. To find a local minimum of a function, each variable takes a step proportional to the negative of the partial gradient of the function at the current point. 0
9/6/208 The Gradient Descent Algorithm Data n 0 R Step 0: set i = 0 Step : if f ( ) 0 stop, i else, compute search direction h f ) i ( i Step 2: compute the step-size h arg min f ( i 0 Step 3: set i i i i go to step i h ) i Various learning rates are tried in the above algorithm. Eample Given: f, 2sin.47 sin 0.34 sin sin.9 2 2 2 Find the minimum when is allowed to vary from 0.5 to.5 and 2 is allowed to vary from 0 to 2.
9/6/208 Gradient descent oscillations We wish to descent like this. Gradient descent oscillations Actual path may look like this. Slow to converge to the (local) optimum 2
9/6/208 Lowering the learning rate = smaller steps in SGD -Less ping pong -Takes longer to get to the optimum Learning Rate 3
9/6/208 Picking learning rate 27 Use grid-search in log-space over small values on a tuning set: e.g., 0.0, 0.00, Sometimes, decrease after each pass: e.g factor of /( + dt), t = pass sometimes /t 2 Fancier techniques: Adaptive gradient: scale gradient differently for each dimension (Adagrad, ADAM,.) Pros and Cons of Gradient Descent Simple and often quite effective on machine learning tasks Often very scalable Only applies to smooth functions (differentiable) Better in general than other search methods, such as local search Might find a local minimum, rather than a global one 4
9/6/208 Using Gradient Descent for NN 29 What functions are used in NN? Cost functions: e.g., f( i, w) = ½ (y* y i ) 2 Activation functions: e.g. s(a) = /( + e -a ) Linear functions: e.g.,. w Composed functions: e.g., sigmoid(. w) How to compute derivatives with respect to w? Replace sign(. w) with something differentiable: e.g. sigmoid(. w) sign() Computation of derivative The derivative of f: R R is a function f : R R given by f ' df f h f lim d h0 h if the limit eists. 5
9/6/208 Rules for Differentiation Constant: d d c 0 Power: d n d n n Sum: d u v du dv d d d Ep: d e e d Product: d uvu dv v du d d d Log: d ln d Quotient: d u v du u dv 2 v v Chain Rule: dy dy du d du d f g y f u u g If is the composite of and, then: f g f gat at ug Eample: Sigmoid function y = s() = /(+e ) y = /u dy/du = /u 2 by quotient rule u = +v du/dv = by sum and power rules v = e w dv/dw = e w by eponential rule w = dw/d = by product and power rules dy/d = (dy/du)(du/dv)(dv/dw)(dw/d) = ( /u 2 )()(e w )( ) = e w /u 2 = y( y) = s()( s()) 6
9/6/208 Sigmoid Activation Function 33 s( u) e u P( Y X ) e w Derivative of sigmoid: ds(z)/dz = s(z)( s(z)).25 0-5 0 5 Net Derivative of Logistic Regression <w, > = w T = w 0 0 + w + w 2 2 + + w n n where 0 = Sigmoid function: () neuron d(z)/dz = (z)( (z)) Logistic regression: f(,w) = f = ( f / w, f / w 2,, f / w n ) =? )) 7
9/6/208 Alternative Activation Functions The logistic function is not widely used in modern NNs Derivative of Hyperbolic Tangent: dt(z)/dz = ( + t(z))( t(z)) Hyperbolic Tangent: t(z) = ( e -2z )/( + e -2z ) Like logistic function but shifted to range [-, +] Alternative Activation Functions Rectified Linear Unit (ReLU): relu(a) = ma(0, a) 0 0 8
9/6/208 Alternative Activation Functions Soft version of relu Soft version of relu: r() = ln(e + ) Doesn t saturate (at one end) Helps with vanishing gradient Derivative of Soft relu: dr()/d = /( + e - ) = s() AI Stats 200 depth 4? Test Errors: sigmoid vs. tanh Figure from Glorot & Bentio (200) 9
9/6/208 y XOR Eample: Gradient of Error z z 2 F(, w) s( w3s( w w22 w0) w23s( w2 w222 w02) 03) 3 w E * v. ( y i F ( i, w v )) 2 i E * y3 Σ i( ( yi y3) ) w w j y w 3 s( z3) z3 s( z3) s( z3) s( z) 3 z3 w3 z3 z3 j 2 s z 3 y z w 0 y 3 z 3 w 03 w 3 y w23 y 2 z 2 w 2 w 22 w 02 w w 2 2 y w If sigmoid is used, s(z i )/ z i = s(z i )( s(z i )) = y i ( y i ) 3 s( z3) z3 s( z3) s( z) z s( z3) s( z) w3 w3 z3 w z3 z w z3 z Backprop Eample: XOR How to compute the updates for general NN? Using sigmoid and quadratic error, updates for all w: y 3 z 3 * 3 ( y3 y3 ) y ( y w 2 3 3) 3 y3 y3) 3 w w w ( w 23 3 w03 w03 y3( y3) 3( ) w02 w02 y2( y2) 2( ) w w y y ) ( ) 0 0 (. 3 w3 y y3( y3) 3 2 w2 y2( y2) 2 w y( y) w w w z w 0 w 03 w 3 y w23 y 2 w 2 d s(z)/d z = s(z)(-s(z)) y i = s(z i ) 23 w23 y2 y3( y3) 3 22 w22 2 y2( y2) 2 2 w2 2 y( y) w 22 z 2 w 02 w w 2 2 20