Optimized first-order minimization methods

Size: px

Start display at page:

Download "Optimized first-order minimization methods"

Joy Golden
6 years ago
Views:

1 Optimized first-order minimization methods Donghwan Kim & Jeffrey A. Fessler EECS Dept., BME Dept., Dept. of Radiology University of Michigan web.eecs.umich.edu/~fessler UM AIM Seminar

2 Disclosure Research support from GE Healthcare Research support to GE Global Research Supported in part by NIH grants R01 HL and P01 CA Equipment support from Intel Corporation 2

3 Low-dose X-ray CT image reconstruction Thin-slice FBP ASIR Statistical Seconds A bit longer Much longer Image reconstruction as an optimization problem: ˆx = argmin x y Ax 2 W + R(x) (Same sinogram, so all at same dose) 3

4 Outline Motivation (done) Problem definition Existing algorithms Gradient descent Nesterov s optimal first-order methods General first-order methods Optimizing first-order minimization methods Drori & Teboulle s numerical bounds Donghwan Kim s analytically optimized ( more optimal ) first-order methods Examples: logistic regression for machine learning CT image reconstruction Summary / Future work 4

5 Problem setting 5

6 Optimization problem setting ˆx argmin x Unconstrained Large-scale (Hessian too big to store) image reconstruction big-data / machine learning... Cost function assumptions (throughout) f : R M R convex (need not be strictly convex) non-empty set of global minimizers: f (x) ˆx X = { x R M : f (x ) f (x), x R M} smooth (differentiable with L-Lipschitz gradient) f (x) f (z) 2 L x z 2, x, z R M 6

7 Example: Machine learning To learn weights x of binary classifier given feature vectors {v i } and labels {y i }: f (x) = ψ(y i x, v i ), i where y i = ±1. 8 Loss functions (surrogates) exponential hinge logistic 0-1 loss functions ψ(t) 0-1: I {t 0} exponential: exp( t) logistic: log(1 + exp( t)) hinge: max{0,1 t} Which of these fit our conditions? ψ(t) t 7

8 Algorithms 8

9 Gradient descent iteration with step size 1/L ensures monotonic descent of f : x n+1 = x n 1 L f (x n ) stacking: i.e.: x 1 x 2. x N 1 x N x 1 x 2. x N 1 x N = = x 0 x 1. x N 2 x N 1 x 0 x 1. x N 2 x N 1 1 L 1 L f (x 0 ) f (x 1 ). f (x N 2 ) f (x N 1 ) }{{} H GD I f (x 0 ) f (x 1 ). f (x N 2 ) f (x N 1 ) Note: N N coefficient matrix H GD is diagonal (a special case of lower triangular). 9

10 Gradient descent convergence rate Classic O(1/n) convergence rate of cost function descent: f (x n ) f (x ) }{{} inaccuracy L x 0 x n Drori & Teboulle (2013) derive tightest inaccuracy bound: f (x n ) f (x ) L x 0 x n + 2 They construct a Huber-like function f for which GD achieves that bound. Case closed for GD. O(1/n) rate is undesirably slow. 10

11 Heavy ball method iteration (Polyak, 1987): stacking: x 1 x 2. x N 1 x N x n+1 = x n α L f (x n )+β (x n x n 1 ) }{{} momentum! = x n 1 L = x 0 x 1. x N 2 x N 1 n k=0 } αβ {{ n k } coefficients 1 L f (x k ) α αβ α αβ N 2... αβ α 0 αβ N 1... αβ 2 αβ α (for implementation) (for analysis) }{{} H HB I Here, N N coefficient matrix H HB is lower triangular. How to choose α and β? How to optimize N N coefficient matrix H more generally? f (x 0 ) f (x 1 ). f (x N 2 ) f (x N 1 ) 11

12 General first-order (FO) iteration: General first-order method class x n+1 = x n 1 L n h n+1,k f (x k ) k=0 stacking: x 1 x 2. x N 1 x N = x 0 x 1. x N 2 x N 1 1 L h 1, h 2,0 h 2, h N,0 h N,1... h N,N 2 h N,N 1 }{{} H FO I f (x 0 ) f (x 1 ). f (x N 2 ) f (x N 1 ) Primary goals: Analyze convergence rate of FO for any given H Optimize N N lower-triangular ( causal ) step-size coefficient matrix H. fast convergence efficient recursive implementation universal (design prior to iterating) 12

13 Not: Barzilai-Borwein gradient method Barzilai & Borwein, 1988 g (n) f (x n ) x n x n 1 2 α n = x n x n 1, g (n) g (n 1) x n+1 = x n α n f (x n ). Not in first-order class FO. Neither are methods like steepest descent (with line search), conjugate gradient, quasi-newton... 13

14 Nesterov s fast gradient method (FGM1) Nesterov (1983) iteration: Initialize: t 0 = 1, z 0 = x 0 z n+1 = x n 1 L f (x n ) (usual GD update) t n+1 = 1 ( 1 + ) 1 + 4tn 2 2 (magic momentum factors) x n+1 = z n+1 + t n 1 (z n+1 z n ) (update with momentum). t n+1 Reverts to GD if t n = 1, n. FGM1 is in class FO: h n+1,k = x n+1 = x n 1 L t n 1 t n+1 h n,k, k = 0,...,n 2 t n 1 t n+1 (h n,n 1 1), k = n t n 1 t n+1, k = n. n h n+1,k f (x k ) k=

15 Nesterov FGM1 optimal convergence rate Shown by Nesterov to be O(1/n 2 ) for auxiliary sequence: f (z n ) f (x ) 2L x 0 x 2 2 (n + 1) 2. Nesterov constructed a function f such that any first-order method achieves Thus O(1/n 2 ) rate of FGM1 is optimal L x 0 x 2 2 (n + 1) 2 f (x n ) f (x ). New results (Donghwan Kim, 2014): Bound on convergence rate of primary sequence {x n }: f (x n ) f (x ) 2L x 0 x 2 2 (n + 2) 2. Verifies (numerically inspired) conjecture of Drori & Teboulle (2013). 15

16 General first-order (FO) iteration: x n+1 = x n 1 L Overview n h n+1,k f (x k ) k=0 Analyze (i.e., bound) convergence rate as a function of number of iterations N Lipschitz constant L step-size coefficients H = {h n+1,k } Distance to a solution: R = x 0 x Optimize H by minimizing the bound 16

17 Ideal universal bound for first-order methods For given number of iterations N Lipschitz constant L step-size coefficients H = {h n+1,k } distance to a solution: R = x 0 x bound the worst-case convergence rate of FO algorithm: B 1 (H,R,L,N) max f F L max x 0,x 1,...,x N R M max x X ( f ) f (x N ) f (x ) x 0 x R such that Clearly for any FO method: x n+1 = x n 1 L n h n+1,k f (x k ), n = 0,...,N 1. k=0 f (x N ) f (x ) B 1 (H,R,L,N) 17

18 Towards practical bounds for first-order methods For convex functions with L-Lipschitz gradients 1 2L f (x) f (z) 2 f (x) f (z) f (z), x z, x, z R M. Drori & Teboulle (2013) use this inequality to propose a more tractable bound: B 2 (H,R,L,N) max max g 0,...,g N R M δ 0,...,δ N R max x 0,x 1,...,x N R M max x : x 0 x R such that x n+1 = x n 1 n L h n+1,k R g k, n = 0,...,N 1, k=0 1 g 2 i g j 2 δi δ j 1 R g j, x i x j, i, j = 0,...,N, where g n = 1 LR f (x n ) and δ n = 1 LR ( f (x n ) f (x )). For any FO method: f (x N ) f (x ) B 1 (H,R,L,N) B 2 (H,R,L,N) However, even B 2 is as of yet unsolved. LRδ 2 N 18

19 Numerical bounds for first-order methods Drori & Teboulle (2013) further relax the bound leading eventually to a still simpler optimization problem (with no known closed-form solution): f (x N ) f (x ) B 1 (H,R,L,N) B 2 (H,R,L,N) B 3 (H,R,L,N). For given step-size coefficients H, and given number of iterations N, they use a semi-definite program (SDP) to compute B 3 numerically. They find numerically that for the FGM1 choice of H, the convergence bound B 3 is slightly tighter than 2L x 0 x 2 2 (N + 1) 2. 19

20 Optimizing step-size coefficients numerically Drori & Teboulle (2013) also compute numerically the minimizer over H of their relaxed bound for given N using a semi-definite program (SDP): H = argminb 3 (H,R,L,N). H Numerical solution for H for N = 5 iterations: [Fig. from Drori & Teboulle (2013)] Drawbacks Must choose N in advance Requires O(N) memory for all gradient vectors { f (x n )} N n=1 O(N 2 ) computation for N iterations Benefit: convergence bound (for specific N) 2 lower than for Nesterov s FGM1. 20

21 H : h n+1,k = New analytical solution Analytical solution for optimized step-size coefficients (Donghwan Kim, 2014): θ n = θ n 1 θ n+1 h n,k, k = 0,...,n 2 θ n 1 θ n+1 (h n,n 1 1), k = n θ n 1 θ n+1, k = n. 1, ( ) n = θn 1 2, n = 1,...,N 1 ( ) θn 1 2, n = N. 1 2 Analytical convergence bound for these optimized step-size coefficients: f (x N ) f (x ) B 3 (H,R,L,N) = 1L x 0 x 2 2 (N + 1)(N ). Of course bound is O(1/N 2 ), but constant is twice better than that of Nesterov. No numerical SDP needed = feasible for large N. (History: sought banded / structured lower-triangular form) 21

22 Optimized gradient method (OGM1) Donghwan Kim (2014) found efficient recursive iteration: Initialize: θ 0 = 1, z 0 = x 0 z n+1 = x n 1 L f (x n ) ( ) θn 1 2, n = 1,...,N 1 θ n = ( ) θn 1 2, n = N 1 2 (usual GD update) (momentum factors) x n+1 = z n+1 + θ n 1 (z n+1 z n ) + θ n (z n+1 x n ). θ n+1 } θ n+1 {{} new momentum Reverts to Nesterov s FGM1 if the new terms are removed. Very simple modification of existing Nesterov code No need to choose N in advance (or solve SDP); use favorite stopping rule then run one last decreased momentum step. Factor of 2 better upper bound than Nesterov s optimal FGM1. (Proofs omitted.) 22

23 Numerical Example(s) 23

24 Machine learning (logistic regression) To learn weights x of binary classifier given feature vectors {v i } and labels {y i }: where y i = ±1. ˆx = argmin x f (x), f (x) = ψ(y i x, v i )+β 1 i 2 x 2 2, logistic: ψ(t) = log(1 + e t ), ψ(t) = 1 e t + 1, ψ(t) = e t (e t + 1) 2 Gradient f (x) = i y i v i ψ(y i x, v i )+βx Hessian is positive definite so strictly convex: 2 f (x) = i = L 1 4 ρ ( i v i ψ(y i x, v i ) v i +βi 1 4 i ) v i v i + β maxρ ( 2 f (x) ) x v i v i +βi ( 0, 1 ] 4 24

25 Numerical Results: logistic regression v v 1 Training data, initial decision boundary (red), final decision boundary (magenta) 25

26 Numerical Results: convergence rates 8 GD Nesterov OGM x n - x * Iteration 26

27 Numerical Results: adaptive restart 8 GD Nesterov (restart) OGM1 (restart) x n - x * Iteration O Donoghue & Candès,

Low-dose 2D X-ray CT image reconstruction simulation 30 25

28 Low-dose 2D X-ray CT image reconstruction simulation GD FGM OGM RMSD [HU] Iteration 28

29 Summary New optimized first-order minimization algorithm Simple implementation akin to Nesterov s FGM Analytical converge rate bound Bound is 2 better than Nesterov Future work Constraints Non-smooth cost functions, e.g., l 1 Tighter bounds Strongly convex case Asymptotic / local convergence rates Incremental gradients Stochastic gradient descent Adaptive restart Low-dose 3D X-ray CT image reconstruction 29

30 Bibliography [1] Y. Drori and M. Teboulle. Performance of first-order methods for smooth convex minimization: A novel approach. Mathematical Programming, 145(1-2):451 82, June [2] Y. Nesterov. A method for unconstrained convex minimization problem with the rate of convergence O(1/k 2 ). Dokl. Akad. Nauk. USSR, 269(3):543 7, [3] Y. Nesterov. Smooth minimization of non-smooth functions. Mathematical Programming, 103(1):127 52, May [4] D. Kim and J. A. Fessler. Optimized first-order methods for smooth convex minimization. Mathematical Programming, Submitted. [5] D. Böhning and B. G. Lindsay. Monotonicity of quadratic approximation algorithms. Ann. Inst. Stat. Math., 40(4):641 63, December [6] B. O Donoghue and E. Candès. Adaptive restart for accelerated gradient schemes. Found. Computational Math., 15(3):715 32, June 2015.

Relaxed linearized algorithms for faster X-ray CT image reconstruction

Relaxed linearized algorithms for faster X-ray CT image reconstruction Hung Nien and Jeffrey A. Fessler University of Michigan, Ann Arbor The 13th Fully 3D Meeting June 2, 2015 1/20 Statistical image reconstruction