Adaptive Restarting for First Order Optimization Methods

Size: px

Start display at page:

Download "Adaptive Restarting for First Order Optimization Methods"

Myles McKinney
5 years ago
Views:

1 Adaptive Restarting for First Order Optimization Methods Nesterov method for smooth convex optimization adpative restarting schemes step-size insensitivity extension to non-smooth optimization continuation methods Brendan O Donoghue - Math 301 1

2 Smooth convex optimization want to solve minimize f(x) f smooth, strongly convex with parameter µ f(x) f(x 0 ) + (µ/2) x x 0 2 2, x dom(f) f Lipschitz with constant L f(x) f(y) 2 L x y 2, x,y dom(f) Brendan O Donoghue - Math 301 2

3 First order methods first developed by Nesterov 83, this from Nesterov 03: set x 0 R n, θ 0 = 1, y 0 = x 0, and q [0,1] for k = 0, 1,... x k = y k 1 + t k f(y k 1 ) θ k solves θ 2 k = (1 θ k )θ 2 k 1 + qθ k β k = θ k 1 (1 θ k 1 )/(θ 2 k 1 + θ k ) y k = x k + β k (x k x k 1 ) t k fixed step-size or selected by backtracking or exact line search not a descent method many variants (e.g. Tseng 08, Lan et. al. 09, Auslender and Teboulle 06, Nesterov 83, 05, 07) Brendan O Donoghue - Math 301 3

4 First order methods contd. if q = q = µ/l then f(x k ) f C min ( (1 q) k,θ(1/k 2 ) ) (see e.g. Nesterov 03) q controls momentum in auxilliary sequence θ k q, β k (1 q)/(1 + q) in practice µ, L, not known, q often set to zero (Nesterov 83 method) q = 0 corresponds to high momentum Brendan O Donoghue - Math 301 4

5 Restart instead of using q can use restarts instead (Nesterov 07) many possible restart schemes common to set y k = x k and θ k = 1 optimal restart interval function of L/µ (Gu et. al. 2009) and step-size in fact q often conservative, careful restarting can do better Brendan O Donoghue - Math 301 5

6 Example minimize f(x) = Ax b 2 2 A R m n q = 1/cond(A T A) we shall test various restart intervals Brendan O Donoghue - Math 301 6

7 Example contd. n = 100, m = 200, q = 0.007, fixed step-size (optimized) (fk f )/f GRA qstar notice characteristic Nesterov ripples k Brendan O Donoghue - Math 301 7

8 Adaptive restarting restarting scheme should have the following form: at each iteration make some (computationally cheap) observation if some condition is satisified, restart: set y k = x k or y k = x k 1 set θ k = ˆθ, e.g. ˆθ could be 1, 2θ k 1 etc. thus we restart not at set intervals, but in a manner that takes into account local information and recent history potential signals, restart if f(x k ) > f(x k 1 ) (function scheme) f(y k ) T (x k x k 1 ) < 0 (gradient scheme) almost like forcing the algorithm to be a descent method Brendan O Donoghue - Math 301 8

9 Example same example, function scheme red, gradient scheme green (fk f )/f GRA qstar adap1 adap k Brendan O Donoghue - Math 301 9

10 First order methods and momentum Nesterov ripples caused by high momentum term near the optimum the momentum can be much larger than the gradient this leads to spiraling behavior restarting simply resets momentum implies adaptive restarting more effective for better conditioned functions (need less momentum) Brendan O Donoghue - Math

11 Numerical instance contd. m = 10, n = 2, with q = 0, state trajectory x x 1 Brendan O Donoghue - Math

12 Numerical instance contd. now with function scheme restart in red x x 1 Brendan O Donoghue - Math

13 Fixed step-size backtracking prohibitively expensive for many applications fixed step-size often used in practice if t k is too small, algorithm will be slow t k too large and it will diverge optimal restarting dependent on step-size adaptive restarting somewhat step-size insensitive Brendan O Donoghue - Math

14 first example again, small step-size, function restart scheme in red, fewer restarts do better (fk f )/f GRA qstar adap k Brendan O Donoghue - Math

15 slightly larger step-size, restart every 100 does best (fk f )/f GRA qstar adap k Brendan O Donoghue - Math

16 large step-size, restart every 10, some others diverge (fk f )/f GRA qstar adap k Brendan O Donoghue - Math

17 Non-smooth constrained optimization follow approach of Becker, Candes and Grant 10 want to solve minimize f(x) subject to A(x) + b K f not smooth nor with Lipschitz gradient dual problem maximize g(λ) subject to λ K we add strongly convex prox term to primal minimize f(x) + µd(x x 0 ) subject to A(x) + b K e.g. d(z) = (1/2) z 2 2 Brendan O Donoghue - Math

18 equivalent to smoothing the dual maximize g µ (λ) subject to λ K where g µ smoothed approximation of g apply first order methods with projection to the dual problem reconstruct x µ from Lagrangian Brendan O Donoghue - Math

19 Continuation methods inner loop: use first order method to solve smoothed dual problem in the outer loop can use: homotopy methods µ 0 proximal point methods x 0 x a hybrid of the two as we vary µ there is a clear trade-off between number of inner and outer loop iterations Brendan O Donoghue - Math

20 Lasso want to solve minimize x 1 subject to Ax b 2 ǫ add (µ/2) x x to objective, solve smoothed dual (modified Lasso) see Becker et. al. for exact algorithm details compare function scheme adaptive restarting to Nesterov 83 method (q = 0) exact dual function not available to us, so use estimates g µ (z k ) L µ (x k,z k ) no extra overhead to evaluate this estimate (no extra applications of A) Brendan O Donoghue - Math

21 n = 200, m = 100, µ = 5 Modified Lasso example (inner loop) 10 5 nest adap 10 0 xk x µ / x µ k Brendan O Donoghue - Math

22 same example, µ = 1 Modified Lasso example (inner loop) 10 2 nest adap 10 0 xk x µ / x µ k Brendan O Donoghue - Math

23 Full Lasso run inner loop until convergence (judged by z k z k 1 ) fix µ, use proximal point method use accelerated continuation in outer loop (adds momentum term) no tuning to parameters done whatsoever, could do much better Brendan O Donoghue - Math

24 same example, full solution, µ = 5 Full Lasso example 10 2 nest 10 1 adap xk x / x total k Brendan O Donoghue - Math

25 same example, full solution, µ = 1 Full Lasso example 10 2 nest adap 10 0 xk x / x total k Brendan O Donoghue - Math

26 Conclusions simple adaptive restarting can be remarkably effective can provide dramatic speed ups to both smooth and non-smooth optimization problems robust performance over a wide range of fixed step-sizes empirically it seems we do not require a-priori function knowledge to achieve consistently good performance Brendan O Donoghue - Math

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Proximal Gradient Descent and Acceleration Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: subgradient method Consider the problem min f(x) with f convex, and dom(f) = R n. Subgradient method: