A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization Panos Parpas Department of Computing Imperial College London www.doc.ic.ac.uk/ pp500 p.parpas@imperial.ac.uk jointly with D.V. (Ryan) Luong, D. Rueckert, B. Rustem
2 / 29 Composite Convex Optimisation min x Ω f (x) + g(x) f : Ω R convex & Lipschitz continuous gradient, f (x) f (y) L x y g : Ω R convex, continuous, non-differentiable. g is simple
Algorithms A small sample First Order Algorithms min x f (x) + g(x) Iterative Shrinkage Thresholding Algorithm (ISTA, Proximal Point Algorithm) [Rockafellar, 1976] Accelerated Gradient Methods [Nesterov, 2013] Fast Iterative Shrinkage Thresholding Algorithm (FISTA) [Beck and Teboulle, 2009] Block Coordinate Descent [Nesterov, 2012] Incremental gradient/subgradient [Bertsekas, 2011] Mirror Descent [Ben-Tal et al., 2001] Smoothing Algorithms [Nesterov, 2005] Bundle Methods [Kiwiel, 1990] Dual Proximal Augmented Lagrangian Method [Yang and Zhang, 2011] Homotopy Methods [Donoho and Tsaig, 2008]
4 / 29 Composite Convex Optimisation min x Ω h f h (x) + g h (x) f h : Ω R convex & Lipschitz continuous gradient, f h (x) f h (y) L x y g h : Ω h R convex, continuous, non-differentiable. g h is simple Multilevel notation: h fine (full) model H coarse (approximate) model
Algorithms State of the art First Order Algorithms min x f h (x) + g h (x) Iterative Shrinkage Thresholding Algorithm (ISTA, Proximal Point Algorithm)[Rockafellar, 1976], [Beck and Teboulle, 2009] Accelerated Gradient Methods [Nesterov, 2013] Fast Iterative Shrinkage Thresholding Algorithm (FISTA) [Beck and Teboulle, 2009] Block Coordinate Descent [Nesterov, 2012] Incremental gradient/subgradient [Bertsekas, 2011] Mirror Descent [Ben-Tal et al., 2001] Smoothing Algorithms [Nesterov, 2005] Bundle Methods [Kiwiel, 1990] Dual Proximal Augmented Lagrangian Method [Yang and Zhang, 2011] Homotopy Methods [Donoho and Tsaig, 2008]
min x Ω h F h (x) f h (x) + g h (x) Iterative Shrinkage Thresholding Algorithm (ISTA ) 1 Iteration k: x h,k, f h,k, f h,k, L h. 2 Quadratic Approximation: Q L (x h,k, x) = f h,k + f h,k, x x h,k + L h 2 x x h,k 2 + g h (x) 3 Compute Gradient Map: (minimize Quadratic Approximation) D h,k = x h,k prox h (x h,k 1 L f h,k ) h = x h,k arg min x (x x h,k 1 ) 2 L f h,k + g h (x) h = x h,k arg min x Q L (x h,k, x) 4 Error Correction Step: x h,k+1 = x h,k s h,k D h,k.
min x Ω h F h (x) f h (x) + g h (x) Iterative Shrinkage Thresholding Algorithm (ISTA ) 1 Iteration k: x h,k, f h,k, f h,k, L h. 2 Quadratic Approximation: Q L (x h,k, x) = f h,k + f h,k, x x h,k + L h 2 x x h,k 2 + g h (x) 3 Compute Gradient Map: (minimize Quadratic Approximation) D h,k = x h,k prox h (x h,k 1 L f h,k ) h = x h,k arg min x (x x h,k 1 ) 2 L f h,k + g h (x) h = x h,k arg min x Q L (x h,k, x) 4 Error Correction Step: x h,k+1 = x h,k s h,k D h,k.
min x Ω h F h (x) f h (x) + g h (x) Iterative Shrinkage Thresholding Algorithm (ISTA ) 1 Iteration k: x h,k, f h,k, f h,k, L h. 2 Quadratic Approximation: Q L (x h,k, x) = f h,k + f h,k, x x h,k + L h 2 x x h,k 2 + g h (x) 3 Compute Gradient Map: (minimize Quadratic Approximation) D h,k = x h,k prox h (x h,k 1 L f h,k ) h = x h,k arg min x (x x h,k 1 ) 2 L f h,k + g h (x) h = x h,k arg min x Q L (x h,k, x) 4 Error Correction Step: x h,k+1 = x h,k s h,k D h,k.
min x Ω h F h (x) f h (x) + g h (x) Iterative Shrinkage Thresholding Algorithm (ISTA ) 1 Iteration k: x h,k, f h,k, f h,k, L h. 2 Quadratic Approximation: Q L (x h,k, x) = f h,k + f h,k, x x h,k + L h 2 x x h,k 2 + g h (x) 3 Compute Gradient Map: (minimize Quadratic Approximation) D h,k = x h,k prox h (x h,k 1 L f h,k ) h = x h,k arg min x (x x h,k 1 ) 2 L f h,k + g h (x) h = x h,k arg min x Q L (x h,k, x) 4 Error Correction Step: x h,k+1 = x h,k s h,k D h,k.
min x Ω h F h (x) f h (x) + g h (x) Iterative Shrinkage Thresholding Algorithm (ISTA ) 1 Iteration k: x h,k, f h,k, f h,k, L h. 2 Quadratic Approximation: Computationally favorable model 3 Compute Gradient Map: (minimize Quadratic Approximation) D h,k = x h,k prox h (x h,k 1 L f h,k ) h = x h,k arg min x (x x h,k 1 ) 2 L f h,k + g h (x) h = x h,k arg min x Q L (x h,k, x) 4 Error Correction Step: x h,k+1 = x h,k s h,k D h,k.
min x Ω h F h (x) f h (x) + g h (x) Iterative Shrinkage Thresholding Algorithm (ISTA ) 1 Iteration k: x h,k, f h,k, f h,k, L h. 2 Quadratic Approximation: Computationally favorable model 3 Compute Gradient Map Solve (approx) favorable model D h,k = x h,k prox h (x h,k 1 L f h,k ) h = x h,k arg min x (x x h,k 1 ) 2 L f h,k + g(x) h = x h,k arg min Q L (x h,k, x) x 4 Error Correction Step: x h,k+1 = x h,k s h,k D h,k.
min x Ω h F h (x) f h (x) + g h (x) Iterative Shrinkage Thresholding Algorithm (ISTA ) 1 Iteration k: x h,k, f h,k, f h,k, L h. 2 Quadratic Approximation: Computationally favorable model 3 Compute Gradient Map Solve (approx) favorable model D h,k = x h,k prox h (x h,k 1 L f h,k ) h = x h,k arg min x (x x h,k 1 ) 2 L f h,k + g(x) h = x h,k arg min Q L (x h,k, x) x 4 Error Correction Step: Compute & Apply Error Correction x h,k+1 = x h,k s h,k D h,k.
8 / 29 What is a Computationally Favorable Model? Global convergence rate: F h (x k ) F h O ( Lh k ) k 1 A smoother model (smaller Lipschitz constant) 2 A lower dimensional model (e.g x H R H with H < h) Numerical Experiment min Ax x R b 2 n 2 + µ x 1, b R m, A R m n (Gaussian N (0, 1)) 10 4 models for n = 25, 50, 100, 200, and 400
8 / 29 What is a Computationally Favorable Model? Global convergence rate: F h (x k ) F h O ( Lh k ) k 1 A smoother model (smaller Lipschitz constant) 2 A lower dimensional model (e.g x H R H with H < h) Numerical Experiment min Ax x R b 2 n 2 + µ x 1, b R m, A R m n (Gaussian N (0, 1)) 10 4 models for n = 25, 50, 100, 200, and 400
8 / 29 What is a Computationally Favorable Model? Global convergence rate: F h (x k ) F h O ( Lh k ) k 1 A smoother model (smaller Lipschitz constant) 2 A lower dimensional model (e.g x H R H with H < h) Numerical Experiment min Ax x R b 2 n 2 + µ x 1, b R m, A R m n (Gaussian N (0, 1)) 10 4 models for n = 25, 50, 100, 200, and 400
What is a Computationally Favorable Model? min Ax x R b 2 n 2 + µ x 1, Lipschitz Constant ISTA Iterations 9 8 8 7 Lipschitz Constant 10 4 7 6 5 4 3 2 # Iterations 10 3 6 5 4 3 2 1 1 0 25 50 100 200 400 Model Dimension (n) 0 25 50 100 200 400 Model Dimension (n)
What is a Computationally Favorable Model? min Ax x R b 2 n 2 + µ x 1, Lipschitz Constant 10 4 9 8 7 6 5 4 3 2 1 Lipschitz Constant 0 25 50 100 200 400 Model Dimension (n) # Iterations 10 2 10 9 8 7 6 5 4 3 2 1 FISTA Iterations 0 25 50 100 200 400 Model Dimension (n)
Background: Composite Convex Optimisation min f (x) + g(x) x Applications & Algorithms Quadratic Approximations & Search Directions Coarse Approximations Motivation A Multilevel Algorithm Information Transfer Coarse Model Construction Multilevel Iterative Shrinkage Thresholding (MISTA) Convergence Analysis
Start-of-the-art in Multilevel Optimization Nonlinear Optimization Nash, S. G. A multigrid approach to discretized optimization problems. Optimization Methods and Software, 2000 Gratton, S., Sartenaer, A., Toint, P. L.. Recursive trust-region methods for multiscale nonlinear optimization. SIAM Journal on Optimization, 2008 W., Zaiwen, and D. Goldfarb. A line search multigrid method for large-scale nonlinear optimization. SIAM Journal on Optimization, 2009 This work Nonsmooth/constrained problems Beyond PDEs! Same complexity ISTA but (10 ) faster, and 3 faster than FISTA PP, Duy V. N. Luong, Daniel Rueckert, Berc Rustem, A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization, Optimization Online, March 2014 Paper & Code: http://www.doc.ic.ac.uk/ pp500/mista.html 11 / 29
Start-of-the-art in Multilevel Optimization Nonlinear Optimization Nash, S. G. A multigrid approach to discretized optimization problems. Optimization Methods and Software, 2000 Gratton, S., Sartenaer, A., Toint, P. L.. Recursive trust-region methods for multiscale nonlinear optimization. SIAM Journal on Optimization, 2008 W., Zaiwen, and D. Goldfarb. A line search multigrid method for large-scale nonlinear optimization. SIAM Journal on Optimization, 2009 This work Nonsmooth/constrained problems Beyond PDEs! Same complexity ISTA but (10 ) faster, and 3 faster than FISTA PP, Duy V. N. Luong, Daniel Rueckert, Berc Rustem, A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization, Optimization Online, March 2014 Paper & Code: http://www.doc.ic.ac.uk/ pp500/mista.html 11 / 29
Information transfer between levels 12 / 29 Coarse model design vector: x H R H Fine model design vector: x h R h and h > H Two standard techniques RH h Restriction Operator: Ih H Prolongation Operator: IH h Rh H Main Assumption: IH h = c(ih h ), c > 0 I Geometric II Algebraic
Coarse Model Construction Smooth Case 13 / 29 First Order Coherent Condition min f h (x h ) Coarse Model: x H,0 = I H h x h,k, then f H,0 = I H h f h,k f H (x H ) fh (x H ) + Ih }{{} f h,k f H,0, x H }{{} coarse representation of f h H first order coherent [Lewis and Nash, 2005, Gratton et al., 2008, Wen and Goldfarb, 2009]
Coarse Model Construction Smooth Case 13 / 29 First Order Coherent Condition min f h (x h ) Coarse Model: x H,0 = I H h x h,k, then f H,0 = I H h f h,k f H (x H ) fh (x H ) + Ih }{{} f h,k f H,0, x H }{{} coarse representation of f h H first order coherent [Lewis and Nash, 2005, Gratton et al., 2008, Wen and Goldfarb, 2009]
Non-Smooth Case 14 / 29 min f h (x h ) + g h (x h ) Optimality Conditions Gradient Mapping D h,k = x h,k prox h (x h,k 1 L f h,k ) = x h,k arg min x (x x h,k 1 ) 2 L f h,k + g(x) D h,k = 0 if and only if x h,k is stationary. First Order Coherent Condition: D H,0 = I H h D h,k
Non-Smooth Case 14 / 29 min f h (x h ) + g h (x h ) Optimality Conditions Gradient Mapping D h,k = x h,k prox h (x h,k 1 L f h,k ) = x h,k arg min x (x x h,k 1 ) 2 L f h,k + g(x) D h,k = 0 if and only if x h,k is stationary. First Order Coherent Condition: D H,0 = I H h D h,k
Non-Smooth Case 14 / 29 min f h (x h ) + g h (x h ) Optimality Conditions Gradient Mapping D h,k = x h,k prox h (x h,k 1 L f h,k ) = x h,k arg min x (x x h,k 1 ) 2 L f h,k + g(x) D h,k = 0 if and only if x h,k is stationary. First Order Coherent Condition: D H,0 = I H h D h,k
Three Ways to Satisfy First Order Coherency 15 / 29 Fine Model (h) Coarse Model (H) min x h f h (x h ) + g h (x h ) min f H (x h ) + g H (x h ) + v x H }{{} H, x H }{{} coarse representation f h +g h force FOC 1 Smooth Coarse Model: v H = L H I H h D h,k ( f H,0 + g H,0 ). 2 Non-smooth model, dual norm projection v H = L H Lh I H h f h,k f H,0. 3 Non-smooth model, indicator projection ( )) v H = L H x H,0 f H,0 + L H Ih H prox h (x h,k 1Lh f h,k.
Three Ways to Satisfy First Order Coherency 15 / 29 Fine Model (h) Coarse Model (H) min x h f h (x h ) + g h (x h ) min f H (x h ) + g H (x h ) + v x H }{{} H, x H }{{} coarse representation f h +g h force FOC 1 Smooth Coarse Model: v H = L H I H h D h,k ( f H,0 + g H,0 ). 2 Non-smooth model, dual norm projection v H = L H Lh I H h f h,k f H,0. 3 Non-smooth model, indicator projection ( )) v H = L H x H,0 f H,0 + L H Ih H prox h (x h,k 1Lh f h,k.
Multilevel Iterative Shrinkage Thresholding Algorithm (MISTA) 1.0 If condition to use coarse model is satisfied at x h,k 1.1. Set x H,0 = I H h x h,k 1.2. m coarse iterations, any monotone algorithm 1.3. Compute feasible coarse correction term, 1.4. Update fine model d h,k = I h H (x H,0 x H,m ) x + = prox h (x h,k τd h,k ) x h,k+1 = x h,k s h,k (x h,k x + h ) 1.5. Go to 1.0 2.0 Otherwise do a fine iteration, any monotone algorithm, go to 1.0. 16 / 29
Multilevel Iterative Shrinkage Thresholding Algorithm (MISTA) 1.0 If condition to use coarse model is satisfied at x h,k 1.1. Set x H,0 = I H h x h,k 1.2. m coarse iterations, any monotone algorithm 1.3. Compute feasible coarse correction term, 1.4. Update fine model d h,k = I h H (x H,0 x H,m ) x + = prox h (x h,k τd h,k ) x h,k+1 = x h,k s h,k (x h,k x + h ) 1.5. Go to 1.0 2.0 Otherwise do a fine iteration, any monotone algorithm, go to 1.0. 16 / 29
Multilevel Iterative Shrinkage Thresholding Algorithm (MISTA) 1.0 If condition to use coarse model is satisfied at x h,k 1.1. Set x H,0 = I H h x h,k 1.2. m coarse iterations, any monotone algorithm 1.3. Compute feasible coarse correction term, 1.4. Update fine model d h,k = I h H (x H,0 x H,m ) x + = prox h (x h,k τd h,k ) x h,k+1 = x h,k s h,k (x h,k x + h ) 1.5. Go to 1.0 2.0 Otherwise do a fine iteration, any monotone algorithm, go to 1.0. 16 / 29
Multilevel Iterative Shrinkage Thresholding Algorithm (MISTA) 1.0 If condition to use coarse model is satisfied at x h,k 1.1. Set x H,0 = I H h x h,k 1.2. m coarse iterations, any monotone algorithm 1.3. Compute feasible coarse correction term, 1.4. Update fine model d h,k = I h H (x H,0 x H,m ) x + = prox h (x h,k τd h,k ) x h,k+1 = x h,k s h,k (x h,k x + h ) 1.5. Go to 1.0 2.0 Otherwise do a fine iteration, any monotone algorithm, go to 1.0. 17 / 29
18 / 29 Multilevel Iterative Shrinkage Thresholding Algorithm (MISTA) Conditions to use coarse model: Ih H D h,k >κ D h,k x h,k x h >η x h, where x h last point to use coarse correction Monotone Algorithm: e.g. ISTA Step size: Determined by convergence analysis
Global Convergence Rate Analysis 19 / 29 Main Result: Coarse correction step contraction on the optimal point: where σ (0, 1). x h,k+1 x h, 2 σ x h,k x h, 2,
Global Convergence Rate Analysis Main Result: Coarse correction step contraction on the optimal point: where σ (0, 1). Theorem x h,k+1 x h, 2 σ x h,k x h, 2, [Rockafellar, 1976, Theorem 1] Suppose that the coarse correction step satisfies the contraction property, and that f (x) is convex and has Lipschitz-continuous gradients. Then any MISTA step is at least nonexpansive (coarse correction steps are contractions and gradient proximal steps are non-expansive), x h,k+1 x h, 2 x h,k x h, 2, and the sequence {x h,k } converges R-linearly. 19 / 29
Global Convergence Rate Analysis Main Result: Coarse correction step contraction on the optimal point: where σ (0, 1). Theorem x h,k+1 x h, 2 σ x h,k x h, 2, [Rockafellar, 1976, Theorem 2] Suppose that the coarse correction step satisfies the contraction property, and that f (x) is strongly convex and has Lipschitz-continuous gradients. Then any MISTA step is always a contraction, x h,k+1 x h, 2 σ x h,k x h, 2, where σ (0, 1) and the sequence {x h,k } converges Q-linearly. 19 / 29
Key Lemma 20 / 29 Useful inequality: x y, f (x) f (y) 1 L f (x) f (y) 2. Lemma Consider two coarse correction terms generated by performing m iterations at the coarse level starting from the points x h,k and x h, : then the following inequality holds: d h,k = IH h e H,m = IH h m 1 s H,i D H,i i=0 d h, = IH h m 1 s H,i DH,i = 0 i=0 x h,k x h,, d h,k d h, m d h,k d h, 2 where m > 0 is a constant independent of x h,k & x h, (given in the paper).
Key Lemma 20 / 29 Useful inequality: x y, f (x) f (y) 1 L f (x) f (y) 2. Lemma Consider two coarse correction terms generated by performing m iterations at the coarse level starting from the points x h,k and x h, : then the following inequality holds: d h,k = IH h e H,m = IH h m 1 s H,i D H,i i=0 d h, = IH h m 1 s H,i DH,i = 0 i=0 x h,k x h,, d h,k d h, m d h,k d h, 2 where m > 0 is a constant independent of x h,k & x h, (given in the paper).
Generalization of Lipschitz Continuity f (x) f (y) 2 L x y 2 Lemma Suppose that a convergent algorithm with nonexpansive steps is applied at the coarse level (e.g. ISTA) then, d h,k d h, 2 16 9 m2 ω 2 1 ω2 2 s2 H,0 x h,k x h, 2, ω 1, ω 2 known constants (depend on prolongation/restriction operators) m is the number of iterations in the coarse level s H,0 is the step size used in the coarse algorithm. 21 / 29
Generalization of Lipschitz Continuity f (x) f (y) 2 L x y 2 Lemma Suppose that a convergent algorithm with nonexpansive steps is applied at the coarse level (e.g. ISTA) then, d h,k d h, 2 16 9 m2 ω 2 1 ω2 2 s2 H,0 x h,k x h, 2, ω 1, ω 2 known constants (depend on prolongation/restriction operators) m is the number of iterations in the coarse level s H,0 is the step size used in the coarse algorithm. 21 / 29
Main Theorem 22 / 29 Theorem (Contraction for coarse correction update) Let τ denote the step size in feasible projection step and s denote the step size used when updating the fine model then, x h,k+1 x h, 2 σ(s, τ) x h,k x h, 2 where σ(τ, s) = 2 + (τ)s 2 and, (τ) = 8 9 mω2 1 s2 H,0 (4mω2 2 τ2 2τ(1 + 2m)). There always exists τ > 0 such that, (τ) < 1 and, { } 1 2 s min, 1 (τ) (τ) Therefore σ(τ, s) < 1.
24 / 29 Numerical Experiments Image Restoration min x h A h x h b h 2 2 + µ h W (x h ) 1 b h input image A h blurring operator W ( ) wavelet transform x R h restored image, h = 1024 1024 Algorithms run on Matlab & standard laptop
25 / 29 Image Restoration Coarse model min x H A H x H b H 2 2 + v H, x H + µ H W (x H ) i + ϱ ϱ i b h = I H h b H coarse input image A H = I H h A h(i H h ) blurring operator x R H H = 512 512 or H = 256 256
Iteration Comparison 500 MISTA Gradient Update MISTA Coarse Correction FISTA ISTA 400 F(x) 300 200 100 0 5 10 15 20 25 30 35 40 45 50 Iteration k 26 / 29
CPU Time Comparison (1% Noise) 100 90 80 ISTA FISTA MISTA CPU Time (secs) 70 60 50 40 30 20 10 0 Barb. Cam. Chess Cog Lady Lena Image 26 / 29
CPU Time Comparison Summary 26 / 29
Conclusions 27 / 29 Current Agorithms Quadratic Approx. Exploit Structure
Conclusions 27 / 29 Current Agorithms Multilevel Algorithms Quadratic Approx. Exploit Structure Hierarchical Approx. Exploit (more) Structure
Conclusions 27 / 29 Current Agorithms Multilevel Algorithms Numerical Results Quadratic Approx. Exploit Structure Hierarchical Approx. Exploit (more) Structure 3-4 faster than state of the art
Conclusions 27 / 29 Current Agorithms Multilevel Algorithms Numerical Results Quadratic Approx. Exploit Structure Hierarchical Approx. Exploit (more) Structure 3-4 faster than state of the art PP, Duy V. N. Luong, Daniel Rueckert, Berc Rustem, A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization, Optimization Online, March 2014 Paper & Code: http://www.doc.ic.ac.uk/ pp500/mista.html
References I Beck, A. and Teboulle, M. (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183 202. Ben-Tal, A., Margalit, T., and Nemirovski, A. (2001). The ordered subsets mirror descent optimization method with applications to tomography. SIAM Journal on Optimization, 12(1):79 108. Bertsekas, D. P. (2011). Incremental gradient, subgradient, and proximal methods for convex optimization: A survey. Optimization for Machine Learning, 2010:1 38. Donoho, D. L. and Tsaig, Y. (2008). Fast solution of-norm minimization problems when the solution may be sparse. Information Theory, IEEE Transactions on, 54(11):4789 4812. Gratton, S., Sartenae, A., and Toint, P. (2008). Recursive trust-region methods for multiscale nonlinear optimization. SIAM Journal on Optimization, 19(1):414 444. Kiwiel, K. C. (1990). Proximity control in bundle methods for convex nondifferentiable minimization. Mathematical Programming, 46(1-3):105 122. Lewis, R. and Nash, S. (2005). Model problems for the multigrid optimization of systems governed by differential equations. SIAM Journal on Scientific Computing, 26(6):1811 1837.
References II Nesterov, Y. (2005). Smooth minimization of non-smooth functions. Mathematical Programming, 103(1):127 152. Nesterov, Y. (2012). Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22(2):341 362. Nesterov, Y. (2013). Gradient methods for minimizing composite functions. Mathematical Programming, 140(1):125 161. Rockafellar, R. (1976). Monotone operators and the proximal point algorithm. SIAM Journal on Control and Optimization, 14(5):877 898. Wen, Z. and Goldfarb, D. (2009). A line search multigrid method for large-scale nonlinear optimization. SIAM Journal on Optimization, 20(3):1478 1503. Yang, J. and Zhang, Y. (2011). Alternating direction algorithms for \ell 1-problems in compressive sensing. SIAM journal on scientific computing, 33(1):250 278.