1 Gradient Descent etc EE 13: Networked estimation and control Prof Kan) I DERIVATIVE Consider f : R R x fx) Te derivative is defined as d fx) = lim dx fx + ) fx) Te cain rule states tat if d d f gx) ) = dx dx fx) x gx) How do we arrive to te cain rule from te definition? Consider z = fx y) x = gt) y = t) Ten d dx gx) d dt fx y) = d dx dt dx fx y) + d dy fx y) dt dy = d fx y)dx dx dt + d fx y)dy dy dt II DERIVATIVE AND DESCENT Consider f : R R Te derivative is defined as fx) d fx + ɛ) fx) fx) = lim dx ɛ ɛ fx + ɛ) fx) + ɛ fx) Hence if fx) > ten f decreases to te left In oter words if ɛ is a small enoug positive number f x ɛsign fx)) ) < fx) It implies tat an algoritm were x is updated as x x ɛsign fx)) moves in te direction of decreasing fx) A plausible suc algoritm is x k+1 = x k ɛ fx k ) wic is called gradient descent It does work for a small enoug ɛ but wat is a proper bound?
III CONVEX FUNCTIONS A convex function f : R R is suc tat x 1 x 2 R and t [ 1] ftx 1 + 1 t)x 2 ) tfx 1 ) + 1 t)fx 2 ) Let f be te derivative of f First f may not be differentiable everywere eg x A differentiable function of one variable is convex on an interval if and only if its derivative is monotonically non-decreasing on tat interval For te basic case of a differentiable function from a subset of) te real numbers to te real numbers convex is equivalent to increasing at an increasing rate A differentiable function of one variable is convex on an interval if and only if te function lies above all of its tangents: fx) fy) + f y)x y) for all x and y in te interval In particular if f c) = ten c is a global minimum of fx) Tangent: Suppose tat a curve is given as te grap of a function y = fx) To find te tangent line at te point p = a fa)) consider anoter nearby point q = a+ fa+)) on te curve Te slope of te secant line passing troug p and q is equal to te difference quotient Hence te line can be defined as As we get y fa) = fa + ) fa) a + a fa + ) fa) x a) y fa) = f a)x a) y = fa) + f a)x a) As an example consider fx) = x 2 Ten te tangent at a point a R is defined by te line: y = a 2 + 2ax a) = a 2 + 2ax 2a 2 = a 2 + 2ax 2
IV METHOD OF STEEPEST DESCENT Convex: A convex function f : R n R is suc tat x 1 x 2 R n and t [ 1] ftx 1 + 1 t)x 2 ) tfx 1 ) + 1 t)fx 2 ) A function f : R n R wit Lipscitz-continuous gradient is suc tat for some L > Suc a function is also called L-smoot Lemma 1 If f is L-smoot ten fx) fy) 2 L x y x y R n fx) fy) fy) x y) L 2 x y 2 2 Proof Consider gt) = fy + tx y)) We ave g) = fy) g1) = fx) and We ave g t) = fy + tx y)) x y) g t)dt = g1) g) fx) fy) fy) x y) = g1) g) fy) x y) = fy + tx y)) x y)dt fy) x y)dt fy + tx y)) x y) fy) x y)) dt [by Caucy-Scwartz] Finally we ave and te proof follows = fy + tx y)) fy) ) x y) dt fy + tx y)) fy) 2 x y 2 dt L y + tx y) y 2 x y 2 dt L x y 2 2 L 2 x y 2 2 t dt fx) fy) fy) x y) L 2 x y 2 2 If f is strongly-convex wit constant m > ten 2 f mi is positive semidefinite If f is L-smoot ten 2 f LI is negative semidefinite 3
ttp://wwwstatcmuedu/ ryantibs/convexopt-f13/scribes/lec6pdf Teorem 1 Let f : R n R be convex and differentiable wit a Lipscitz-continuous gradient wit constant L > Ten if we run gradient descent for k iterations wit a fixed stepsize ɛ 1/L it will yield a solution x k suc tat fx k ) fx ) x x 2 2 2ɛk Proof Since f as a Lipscitz-continuous gradient we ave from Lemma 1: fy) fx) + fx) y x) + 1 2 L y x 2 2 Let us plug in te gradient descent update: y = x k+1 = x k ɛ fx k ) to obtain: fx k+1 ) fx k ) + fx k ) x k+1 x k ) + 1 2 L x k+1 x k 2 2 fx k ) + fx k ) ɛ fx k )) + 1 2 L ɛ fx k) 2 2 fx k ) ɛ fx k ) 2 2 + 1 2 Lɛ2 fx k ) 2 2 ) 1 fx k ) + 2 Lɛ 1 ɛ fx k ) 2 2 ɛ 1 L fx k+1 ) fx k ) 1 2 ɛ fx k) 2 2 1) wic sows tat fx k ) decreases at every k unless fx k ) = true wen x k = x because f is convex and differentiable Since f is convex it lies above all of its tangents and) we ave fx ) fx) + fx) x x) wic leads to From earlier we ave fx k+1 ) fx k ) 1 2 ɛ fx k) 2 2 fx k ) fx ) + fx k ) x k x ) fx ) + fx k ) x k x ) 1 2 ɛ fx k) 2 2 fx k+1 ) fx ) fx k ) x k x ) 1 2 ɛ fx k) 2 2 1 2ɛ fx k ) x k x ) ɛ 2 fx k ) 2 2 2ɛ 1 2ɛ fx k ) x k x ) ɛ 2 fx k ) 2 2 x k x 2 2 2ɛ }{{} + x k x 2 2 1 xk x ɛ fx k ) 2 2 + x k x 2 ) 2 2ɛ ) ) 4
Continuing: fx k+1 ) fx ) 1 2ɛ Sum bot sides over k = K 1: Consider K 1 k= fx k+1 ) fx )) 1 2ɛ 1 2ɛ xk x 2 2 x k+1 x 2 ) 2 K 1 k= xk x 2 2 x k+1 x 2 ) 2 x x 2 2 x 1 x 2 2+ x 1 x 2 2 x 2 x 2 2 + x K 1 x 2 2 x K x 2 2) = 1 x x 2 2 x K x 2 ) 2 2ɛ 1 x x 2 ) 2 2ɛ Kfx K ) fx )) = fx K ) + fx K ) + + fx K ) Kfx ) Finally we obtain fx K ) + fx K 1 ) + + fx 1 ) Kfx ) [see 1)] K 1 ) = fx k+1 ) Kfx ) k= K 1 = k= 1 2ɛ fx k+1 ) fx ) x x 2 ) 2 ) and te teorem follows fx K ) fx ) x x 2 2 2ɛK 5
V GRADIENT For f : R n R we generalize derivative wit a gradient f were x 1 fx) f = x n fx) 6
VI PARTIAL DERIVATIVE Consider f : R 2 R were R 2 is indexed by x and y Ten for some a b) R 2 partial derivative of f wit respect to x at te point a b) is defined as fa b) = lim x Let us expand te notation to a vector a R 2 : Hence for an arbitrary direction u R 2 : fa + b) fa b) fa + 1 ) fa) fa) = lim x fa) = lim u fa + u) fa) notice owever tat te direction is not scaled ere In oter words te definition canges for 2u unless we trow in a normalization on te direction Directional derivative is denoted as: u fa) = u fa) 7
VII DIRECTIONAL DERIVATIVE: FORMAL Definition 1 Directional derivative) Consider a function: f : R n derivative of f in an arbitrary direction u R n is given by u fa) = fa) = lim u fa + u) fa) R Te directional fa 1 + u 1 a 2 + u 2 a n + u n ) fa 1 a 2 a n ) = lim Lemma 2 Te directional derivative is given by u fx) = u fx) Proof Consider f : R 2 R for convenience Let R 2 be indexed by x y) Fix an arbitrary point x y ) R 2 and an arbitrary direction a b) R 2 At tis point define gz) = fx + za y + zb) As g is a function of a single variable z we can define leading to g z) d gz) = lim dz gz + ) gz) g ) d dz gz) g) g) fx + a y + b) fx y ) z= = lim = lim = ab) fx y ) 2) by definition We tus ave g ) = ab) fx y ) Now let x = x + za and y = y + zb Ten gz) = fx y) We ave g z) = d dz gz) = d dx fx y) = fx y) dz x dz + dy fx y) y dz = x fx y) a + fx y) b y Wit z = we get x = x and y = y and Combine 2) and 3) to get ab) fx y ) = generalizes to u fx) = wic completes te proof g ) = x fx y ) a + y fx y ) b 3) x fx y ) a + y fx y ) b Te above fx)u 1 + + fx)u n = u f = u f = f u x 1 x n 8
Lemma 3 Te maximum value of te directional derivative u fx) is in te direction of te gradient Proof Directional derivative is a dot product wic is max wen te angle of te cosine is Lemma 4 Te gradient vector is ortogonal to te level yper-)curve fx) = c at te point x Proof 9
VIII DIRECTIONAL DERIVATIVE Consider f : R n R Te scalar notion of derivative does not apply and te derivative must be specified along a particular direction in R n ie te rate of cange of f in a direction u It is defined as fx + αu) fx) u fx) = lim α α Te gradient of a function in an arbitrary direction u is given by u fx) = u fx) u 2 = u u = 1 Consider a function f wit domain R 3 ie over tree variables x 1 x 2 x 3 Te tree principal directions of tis space are tus e 1 e 2 e 3 If we coose u 1 to be e 1 ten e1 fx) = x 1 fx) Because we ave already specified a particular direction te directional derivative is a scalar We furter ave u fx) = u fx) = u fx) = u fx) cos θ were θ is te angle between te gradient vector and te direction u Te directional derivative tus is te maximum wen θ = ie wen u is along te direction of te gradient; tis is were f increases te most Similarly te directional derivative is te minimum wen θ = π ie wen u is opposite to te direction of te gradient; tis is were f decreases te most A Example Consider fx 1 x 2 ) = 4x 2 1 + x2 2 For any value fx 1 x 2 ) = c R te function becomes 4x 2 1 + x 2 2 = c and is an ellipse in R2 Te gradient of tis function is x 1 fx) fx) = x n fx) = 8x 1 2x 2 1