GRADIENT METHODS
GRADIENT = STEEPEST DESCENT Convex Function Iso-contours gradient 0.5 0.4 4 2 0 8 0.3 0.2 0. 0 0. negative gradient 6 0.2 4 0.3 2.5 0.5 0 0.5 0.5 0 0.5 0.4 0.5.5 0.5 0 0.5
GRADIENT DESCENT METHOD Iso-contours gradient 0.5 Gradient descent x k+ = x k rf(x k ) 0.4 0.3 0.2 0. 0 negative gradient 0. 0.2 0.3 0.4 0.5.5 0.5 0 0.5
STEPSIZE RESTRICTION f(x) = 2 x2 rf(x) = x 0.9 x k+ = x k ( x k ) 0.8 0.7 x k = 2 Restriction: 0.6 0.5 < 2 0.3 0.4 0.2 = 0. 0 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8
RECALL Lipschitz Constant for gradient krf(x) rf(y)k applemkx yk f(y) apple f(x)+(y x) T rf(x)+ M 2 ky xk2 When Hessian exists: M kr 2 f(x)k
STABILITY RESTRICTION If you know the Lipschitz Constant for the gradient f(y) apple f(x)+(y x) T rf(x)+ M 2 ky xk2 f(x rf(x)) apple f(x) rf(x) T rf(x)+ M 2 gradient step =f(x) krf(x)k 2 + M 2 2 krf(x)k2 =f(x)+ M 2 2 krf(x)k 2 2 2 krf(x)k2 M 2 2 < 0 ) < 2 M proves convergence in terms of gradient
LINE SEARCH METHODS Choose search direction: d SEARCH for stepsize that satisfies SOME inequality: Update iterate: x k+ = x k + d Armijo condition f(x k + d) apple f(x k )+ ( d) T rf(x k ), < x k f(x k )+ d T rf(x k ) f(x k )+ d T rf(x k )
WOLFE CONDITIONS Choose search direction: d = rf(x k ) Armijo condition don t go too far Find stepsize satisfying Wolf conditions f(x k + d) apple f(x k )+ ( d) T rf(x k ), < d T rf(x k + d) > d T rf(x k ), < < Update iterate: x k+ = x k + d curvature condition don t go too near Theorem Suppose we have the uniform bound for a convex function then rf(x k ) T d krf(x k )kkdk >c lim k! rf(xk )! 0
LINE SEARCH IN REAL LIFE Backtracking/Armijo line search Choose search direction: d = rf(x k ) While f(x k + d) f(x k )+ ( d) T rf(x k ) /2 Update iterate x k+ = x k + d Exact line search Choose search direction: d = rf(x k ) =min f(x k + d) Update iterate x k+ = x k + d
ADAPTIVE STEPSIZE METHOD Consider this simple model problem f(x) = 2 xt x x k+ = x k rf(x k )=x k ( x k ) Best choice: = Barzilai-Borwein Model: f(x) 2 xt x Figure out best approximation
BB METHOD Barzilai-Borwein Model: f(x) 2 xt x rf(y) rf(x) (y x) On each iteration, define: g = rf(x k+ ) rf(x k ) x = x k+ x k The model tells us g x min k x gk2 2
BB METHOD min k x gk2 2 x T x x T g =0 = xt g x T x = = xt x x T g
BB METHOD Choose search direction: d = rf(x k ) Compute BB step: k = xt x x T g While f(x k + k d) f(x k )+ k d T rf(x k ) For some smooth problems, BB method is superlinear, i.e. for large enough k k k /2 Update iterate: x k+ = x k + d f(x k+ ) f? apple C(f(x k+ ) f? ) p for some C>0, and p> This rate is asymptotic, and not global
CONVERGENCE RATE: EXACT LINE SEARCH Strong convexity f(y) f(x)+(y x) T rf(x)+ m 2 ky xk2 Lipschitz bound f(y) apple f(x)+(y x) T rf(x)+ M 2 ky xk2 y = x k+,x= x k y x = x k+ x k = rf(x k ) f(x k+ ) apple min f(x k ) rf(x k ) T rf(x k )+ M 2 2 krf(xk )k 2
CONVERGENCE RATE: EXACT LINE SEARCH f(x k+ ) apple min f(x k ) rf(x k ) T rf(x k )+ M 2 2 krf(xk )k 2 = M f(x k+ ) apple f(x k ) 2M krf(xk )k 2 f(x k+ ) f? apple f(x k ) f? 2M krf(xk )k 2 Optimality gap Write in terms of optimality gap so we get relative change
STRONG CONVEXITY BOUND f(x k+ ) f? apple f(x k ) f? 2M krf(xk )k 2 Strong convexity constant f(y) f(x)+(y x) T rf(x)+ m 2 ky xk2 f(x? ) Optimality: min y f(x k )+(y x k ) T rf(x k )+ m 2 ky xk k 2 y x k = m rf(xk ) f(x? ) f(x k ) m krf(xk )k 2 + 2m krf(xk )k 2 2m(f(x k ) f(x? )) apple krf(x k )k 2
STRONG CONVEXITY BOUND f(x k+ ) f? apple f(x k ) f? f(x k+ ) f? apple f(x k ) f? m 2M krf(xk )k 2 2m(f(x k ) f(x? )) apple krf(x k )k 2 M (f(xk ) f? ) f(x k+ ) f? apple m M (f(x k ) f? ) What s this?
STRONG CONVEXITY BOUND f(x k+ ) f? apple m M (f(x k ) f? ) By induction f(x k ) f? apple m M k (f(x 0 ) f? ) Theorem Gradient with exact line search satisfies linear convergence when the objective is strongly convex k f(x k ) f? apple (f(x 0 ) f? ) apple Condition number
GRADIENT DESCENT USES CONTOURS Gradients are perpendicular to contours apple =2 0 8 6 4 2 0 2 4 6 8 0 0 8 6 4 2 0 2 4 6 8 0
POOR CONDITIONING: CONTOUR INTERPRETATION apple = 00 Contours are (almost) parallel! 0.03 0.8 0.6 0.02 0.4 0.0 0.2 0 0 0.2 0.4 0.0 0.6 0.02 0.8 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 0.03 0.5 0.5 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59 0.6
GRADIENT DESCENT: WEAKLY CONVEX PROBLEMS All methods so far assume strong convexity For strongly convex problems: f(x k ) f? apple m M k (f(x 0 ) f? ) Strong convexity constant For weakly convex problems: f(x k+ ) f? apple Mkx0 x? k 2 k + Slow (worst case)
WHAT S THE BEST WE CAN DO? Goal: put worst-case bound on convergence of first-order methods Definition: a method is first-order if x k 2 span{x 0, rf(x 0 ), rf(x ),, rf(x k )} Theorem: For any first-order method, there is a smooth function with Lipschitz constant M such that f(x k ) f(x? ) 3Mkx 0 x? k 2 32(k + ) 2 Nemirovsky and Yudin, 82
PROOF The worst function in the world f n (x) = L 4 2 x2 + 2 n X i= (x i x i+ ) 2 + 2 x2 n x minimizer x? i = i/(n + ) f n (x? )= L 8 n +
PROOF The worst function in the world f n (x) = L 4 2 x2 + 2 f n (x? )= L 8 n + n X i= (x i x i+ ) 2 + 2 x2 n x After iteration k x i =0, for i>k Consider function with n=2k variables f 2k (x k )=f k (x k L ) 8 k + f 2k (x? )= L 8 2k +
PROOF Consider function with n=2k variables f 2k (x k )=f k (x k L ) 8 k + f 2k (x? )= L 8 2k +2 lowest possible energy on step k f 2k (x k ) f 2k (x? ) kx 0 x? k 2 L 8 k+ 2k+2 3 (2k + 2) = true solution 3L 32(k + ) 2 start at 0
Nemirovski and Yundin 83 NESTEROV S METHOD ACHIEVES THE BOUND Gradient Nesterov x 0 x 0 x x 2 x 4 x 3 x x 4 x 2 x3 Optimal f(x k+ ) f? apple kx 0 x? k 2 (k + ) f(x k+ ) f? apple 2kx 0 x? k 2 (k + ) 2
NESTEROV S METHOD < L rf, 0 = x k+ = y k rf(y k ) Momentum parameter k+ = +p 4( k ) 2 + 2 y k+ = x k+ + k k+ (xk+ x k ) Prediction step Momentum term
OPTIMAL CONVERGENCE Theorem: The objective error of the kth iterate of Nester s method is bounded by f(x k ) f(x? ) apple 2Lkx0 x? k 2 (k + ) 2 This worst case doesn t mean much in practice adaptive convergence is faster
POOR CONDITIONING: FUNCTION INTERPRETATION f(x) = X d i 2 x2 i = x T Dx d i = 3 d i = 50 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0. 0. 0 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 0 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8
POOR CONDITIONING: FUNCTION INTERPRETATION d i = 3 f(x) = X d i 2 x2 i = x T Dx x k+ i = x i d i x k One stepsize to rule them all d i = 50 0.9 0.9 0.8 0.8 0.7 0.6 0.7 0.6 Big step 0.5 0.4 0.5 0.4 = 50 0.3 0.3 0.2 Small step = 50 0.2 0. 0. 0 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 0 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8
THIS IS ALWAYS HAPPENING! minimize minimize 2 xt Hx + g T x EVD 2 xt UDU T x + g T x y U T x minimize 2 yt Dy +(U T g) T x = X d i 2 y2 i +(U T g) i y i
PRECONDITIONERS x = Py minimize f(x) minimize f(py) H Hessians P T HP
THE BEST PRECONDITIONER x = Py minimize f(x) minimize f(py) H Hessians P = H /2 P T HP P T HP = H /2 HH /2 = I
SVD INTERPRETATION f(x) = 2 xt Hx = 2 xt UDU T x = 2 xt UD /2 D /2 U T x = 2 yt y x = UD /2 y = Py y = D /2 U T x Rotation Stretch 0 0 8 8 6 6 4 2 0 P 4 2 0 2 2 4 4 6 6 8 8 0 0 8 6 4 2 0 2 4 6 8 0 0 0 8 6 4 2 0 2 4 6 8 0
NEWTON S METHOD x = H /2 y minimize f(x) minimize f(h /2 y) gradient step y k+ = y k H /2 rf(h /2 y k ) change variables back to x H /2 y k+ = H /2 y k H /2 H /2 rf(h /2 y k ) x k+ = x k H rf(x k ) Newton direction There are many interpretations
GEOMETRIC INTERPRETATION minimize f(x) f(x k )+(x x k ) T g +(x x k ) T H(x x k ) Derivative: g + H(x x k )=0 x k+ = x k H g f(x k )+(x x k ) T g +(x x k ) T H(x x k ) f(x) x k x k+
CLASSICAL NEWTON METHOD Newton s method = Algorithm for root finding g(x) =0 x k+ = x k g(x k ) g 0 (x k ) For nonlinear system of equations x k+ x k g(x) g(x k )+(x x k ) T rg(x k ) x k+ = x k rg(x k ) g(x)
NEWTON METHOD FOR OPTIMIZATION Taylor s theorem rf(x) =0 f(x) (x x k ) T rf(x k )+ 2 (x xk ) T r 2 f(x k )(x x k ) Derivative Linear approximation of gradient rf(x) rf(x k )+r 2 f(x k )(x x k )=g + H(x x k )=0 x k+ = x k H g
SIMPLE RATE ANALYSIS Bound the error: e k = x k x? Assume f 2 C 2, e k is sufficiently small, and strong convexity. 0=rf(x? )=rf(x k e k )=rf(x k ) He k + O(ke k k 2 ) Multiply by inverse hessian 0=H rf(x k ) e k + O(ke k k 2 ) note: e k+ = e k H rf(x k ) e k+ = O(ke k k 2 ) Fletcher, Practical Methods of Optimization
DAMPED NEWTON METHOD Choose search direction: d = r 2 f(x k ) rf(x k ) Find stepsize satisfying Wolf conditions f(x k + d) apple f(x k )+ ( d) T rf(x k ), < Update iterate: x k+ = x k + d < = = Damped step = Full Newton Predicted objective change in Newton direction
DAMPED NEWTON METHOD Theorem Suppose we have a Lipschitz constant for the Hessian kr 2 f(x) r 2 f(y)k applel H kx yk When the gradient gets small enough to satisfy krf(x k )kapple3( 2 ) m2 L H.the unit stepsize is an Armijo step, and ( = ) L H 2m 2 krf(xk+ )kapple LH 2m 2 krf(xk )k 2 see Boyd and Vandenberghe
NEWTON IN THE WILD In practice Hessian might be indefinite If you start at a point of negative curvature, the Newton goes up! (negative) search direction: must form an acute angle with the gradient g T d>0 We can use any SPD matrix to approximate Hessian d = Ĥ g g T d = g T Ĥ g>0
MODIFIED HESSIAN We can use any SPD matrix to approximate Hessian Levenberg Marquardt Ĥ = H + I Add a large enough multiple of the identity to the Hessian that it becomes SPD Modified Cholesky Begin with partial Cholesky I 0 H! L 0 B L T Replace B with a SPD matrix See Fang & O leary, 06