A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization

Similar documents
Optimization methods

Optimization methods

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples

A Line search Multigrid Method for Large-Scale Nonlinear Optimization

6. Proximal gradient method

Fast proximal gradient methods

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Proximal Minimization by Incremental Surrogate Optimization (MISO)

6. Proximal gradient method

arxiv: v1 [math.oc] 1 Jul 2016

1 Sparsity and l 1 relaxation

Multilevel Optimization Methods: Convergence and Problem Structure

Accelerated primal-dual methods for linearly constrained convex problems

Lasso: Algorithms and Extensions

ACCELERATED LINEARIZED BREGMAN METHOD. June 21, Introduction. In this paper, we are interested in the following optimization problem.

SIAM Conference on Imaging Science, Bologna, Italy, Adaptive FISTA. Peter Ochs Saarland University

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Optimization for Learning and Big Data

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems)

Journal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19

Lecture 9: September 28

Composite nonlinear models at scale

Stochastic Optimization: First order method

A weighted Mirror Descent algorithm for nonsmooth convex optimization problem

LECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties

Expanding the reach of optimal methods

Convex Optimization Algorithms for Machine Learning in 10 Slides

MS&E 318 (CME 338) Large-Scale Numerical Optimization

SPARSE SIGNAL RESTORATION. 1. Introduction

Efficient Quasi-Newton Proximal Method for Large Scale Sparse Optimization

I P IANO : I NERTIAL P ROXIMAL A LGORITHM FOR N ON -C ONVEX O PTIMIZATION

Minimizing Isotropic Total Variation without Subiterations

Cubic regularization of Newton s method for convex problems with constraints

Dual and primal-dual methods

ACCELERATED FIRST-ORDER PRIMAL-DUAL PROXIMAL METHODS FOR LINEARLY CONSTRAINED COMPOSITE CONVEX PROGRAMMING

Sparse and Regularized Optimization

Coordinate Update Algorithm Short Course Proximal Operators and Algorithms

OWL to the rescue of LASSO

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

Alternating Direction Augmented Lagrangian Algorithms for Convex Optimization

Sparse Optimization Lecture: Dual Methods, Part I

A Recursive Trust-Region Method for Non-Convex Constrained Minimization

Dual Proximal Gradient Method

A Unified Approach to Proximal Algorithms using Bregman Distance

A NEW ITERATIVE METHOD FOR THE SPLIT COMMON FIXED POINT PROBLEM IN HILBERT SPACES. Fenghui Wang

Lecture 1: September 25

Proximal Methods for Optimization with Spasity-inducing Norms

Lecture 8: February 9

2 Regularized Image Reconstruction for Compressive Imaging and Beyond

Lecture: Algorithms for Compressed Sensing

Contraction Methods for Convex Optimization and Monotone Variational Inequalities No.16

The Proximal Gradient Method

Homotopy methods based on l 0 norm for the compressed sensing problem

Asynchronous Non-Convex Optimization For Separable Problem

Accelerated Block-Coordinate Relaxation for Regularized Optimization

On the Local Quadratic Convergence of the Primal-Dual Augmented Lagrangian Method

SOR- and Jacobi-type Iterative Methods for Solving l 1 -l 2 Problems by Way of Fenchel Duality 1

Conditional Gradient (Frank-Wolfe) Method

Learning with stochastic proximal gradient

Coordinate Update Algorithm Short Course Operator Splitting

Accelerated Gradient Method for Multi-Task Sparse Learning Problem

Mathematics of Data: From Theory to Computation

Stochastic and online algorithms

Introduction. New Nonsmooth Trust Region Method for Unconstraint Locally Lipschitz Optimization Problems

A Bregman alternating direction method of multipliers for sparse probabilistic Boolean network problem

Beyond Heuristics: Applying Alternating Direction Method of Multipliers in Nonconvex Territory

Parallel Coordinate Optimization

A Brief Overview of Practical Optimization Algorithms in the Context of Relaxation

Primal-dual coordinate descent

Smoothing Proximal Gradient Method. General Structured Sparse Regression

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity

Recent developments on sparse representation

ARock: an algorithmic framework for asynchronous parallel coordinate updates

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Primal-dual Subgradient Method for Convex Problems with Functional Constraints

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Accelerated Proximal Gradient Methods for Convex Optimization

Proximal methods. S. Villa. October 7, 2014

Comparative Study of Restoration Algorithms ISTA and IISTA

A direct formulation for sparse PCA using semidefinite programming

Randomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming

MIST: l 0 Sparse Linear Regression with Momentum

Convex Optimization and l 1 -minimization

FAST FIRST-ORDER METHODS FOR COMPOSITE CONVEX OPTIMIZATION WITH BACKTRACKING

Selected Methods for Modern Optimization in Data Analysis Department of Statistics and Operations Research UNC-Chapel Hill Fall 2018

On the acceleration of the double smoothing technique for unconstrained convex optimization problems

A Sparsity Preserving Stochastic Gradient Method for Composite Optimization

EE 367 / CS 448I Computational Imaging and Display Notes: Image Deconvolution (lecture 6)

Convex Optimization Lecture 16

Accelerated gradient methods

Approximate Message Passing Algorithms

Convex Optimization on Large-Scale Domains Given by Linear Minimization Oracles

Coordinate Descent and Ascent Methods

Transcription:

A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization Panos Parpas Department of Computing Imperial College London www.doc.ic.ac.uk/ pp500 p.parpas@imperial.ac.uk jointly with D.V. (Ryan) Luong, D. Rueckert, B. Rustem

2 / 29 Composite Convex Optimisation min x Ω f (x) + g(x) f : Ω R convex & Lipschitz continuous gradient, f (x) f (y) L x y g : Ω R convex, continuous, non-differentiable. g is simple

Algorithms A small sample First Order Algorithms min x f (x) + g(x) Iterative Shrinkage Thresholding Algorithm (ISTA, Proximal Point Algorithm) [Rockafellar, 1976] Accelerated Gradient Methods [Nesterov, 2013] Fast Iterative Shrinkage Thresholding Algorithm (FISTA) [Beck and Teboulle, 2009] Block Coordinate Descent [Nesterov, 2012] Incremental gradient/subgradient [Bertsekas, 2011] Mirror Descent [Ben-Tal et al., 2001] Smoothing Algorithms [Nesterov, 2005] Bundle Methods [Kiwiel, 1990] Dual Proximal Augmented Lagrangian Method [Yang and Zhang, 2011] Homotopy Methods [Donoho and Tsaig, 2008]

4 / 29 Composite Convex Optimisation min x Ω h f h (x) + g h (x) f h : Ω R convex & Lipschitz continuous gradient, f h (x) f h (y) L x y g h : Ω h R convex, continuous, non-differentiable. g h is simple Multilevel notation: h fine (full) model H coarse (approximate) model

Algorithms State of the art First Order Algorithms min x f h (x) + g h (x) Iterative Shrinkage Thresholding Algorithm (ISTA, Proximal Point Algorithm)[Rockafellar, 1976], [Beck and Teboulle, 2009] Accelerated Gradient Methods [Nesterov, 2013] Fast Iterative Shrinkage Thresholding Algorithm (FISTA) [Beck and Teboulle, 2009] Block Coordinate Descent [Nesterov, 2012] Incremental gradient/subgradient [Bertsekas, 2011] Mirror Descent [Ben-Tal et al., 2001] Smoothing Algorithms [Nesterov, 2005] Bundle Methods [Kiwiel, 1990] Dual Proximal Augmented Lagrangian Method [Yang and Zhang, 2011] Homotopy Methods [Donoho and Tsaig, 2008]

min x Ω h F h (x) f h (x) + g h (x) Iterative Shrinkage Thresholding Algorithm (ISTA ) 1 Iteration k: x h,k, f h,k, f h,k, L h. 2 Quadratic Approximation: Q L (x h,k, x) = f h,k + f h,k, x x h,k + L h 2 x x h,k 2 + g h (x) 3 Compute Gradient Map: (minimize Quadratic Approximation) D h,k = x h,k prox h (x h,k 1 L f h,k ) h = x h,k arg min x (x x h,k 1 ) 2 L f h,k + g h (x) h = x h,k arg min x Q L (x h,k, x) 4 Error Correction Step: x h,k+1 = x h,k s h,k D h,k.

min x Ω h F h (x) f h (x) + g h (x) Iterative Shrinkage Thresholding Algorithm (ISTA ) 1 Iteration k: x h,k, f h,k, f h,k, L h. 2 Quadratic Approximation: Q L (x h,k, x) = f h,k + f h,k, x x h,k + L h 2 x x h,k 2 + g h (x) 3 Compute Gradient Map: (minimize Quadratic Approximation) D h,k = x h,k prox h (x h,k 1 L f h,k ) h = x h,k arg min x (x x h,k 1 ) 2 L f h,k + g h (x) h = x h,k arg min x Q L (x h,k, x) 4 Error Correction Step: x h,k+1 = x h,k s h,k D h,k.

min x Ω h F h (x) f h (x) + g h (x) Iterative Shrinkage Thresholding Algorithm (ISTA ) 1 Iteration k: x h,k, f h,k, f h,k, L h. 2 Quadratic Approximation: Q L (x h,k, x) = f h,k + f h,k, x x h,k + L h 2 x x h,k 2 + g h (x) 3 Compute Gradient Map: (minimize Quadratic Approximation) D h,k = x h,k prox h (x h,k 1 L f h,k ) h = x h,k arg min x (x x h,k 1 ) 2 L f h,k + g h (x) h = x h,k arg min x Q L (x h,k, x) 4 Error Correction Step: x h,k+1 = x h,k s h,k D h,k.

min x Ω h F h (x) f h (x) + g h (x) Iterative Shrinkage Thresholding Algorithm (ISTA ) 1 Iteration k: x h,k, f h,k, f h,k, L h. 2 Quadratic Approximation: Q L (x h,k, x) = f h,k + f h,k, x x h,k + L h 2 x x h,k 2 + g h (x) 3 Compute Gradient Map: (minimize Quadratic Approximation) D h,k = x h,k prox h (x h,k 1 L f h,k ) h = x h,k arg min x (x x h,k 1 ) 2 L f h,k + g h (x) h = x h,k arg min x Q L (x h,k, x) 4 Error Correction Step: x h,k+1 = x h,k s h,k D h,k.

min x Ω h F h (x) f h (x) + g h (x) Iterative Shrinkage Thresholding Algorithm (ISTA ) 1 Iteration k: x h,k, f h,k, f h,k, L h. 2 Quadratic Approximation: Computationally favorable model 3 Compute Gradient Map: (minimize Quadratic Approximation) D h,k = x h,k prox h (x h,k 1 L f h,k ) h = x h,k arg min x (x x h,k 1 ) 2 L f h,k + g h (x) h = x h,k arg min x Q L (x h,k, x) 4 Error Correction Step: x h,k+1 = x h,k s h,k D h,k.

min x Ω h F h (x) f h (x) + g h (x) Iterative Shrinkage Thresholding Algorithm (ISTA ) 1 Iteration k: x h,k, f h,k, f h,k, L h. 2 Quadratic Approximation: Computationally favorable model 3 Compute Gradient Map Solve (approx) favorable model D h,k = x h,k prox h (x h,k 1 L f h,k ) h = x h,k arg min x (x x h,k 1 ) 2 L f h,k + g(x) h = x h,k arg min Q L (x h,k, x) x 4 Error Correction Step: x h,k+1 = x h,k s h,k D h,k.

min x Ω h F h (x) f h (x) + g h (x) Iterative Shrinkage Thresholding Algorithm (ISTA ) 1 Iteration k: x h,k, f h,k, f h,k, L h. 2 Quadratic Approximation: Computationally favorable model 3 Compute Gradient Map Solve (approx) favorable model D h,k = x h,k prox h (x h,k 1 L f h,k ) h = x h,k arg min x (x x h,k 1 ) 2 L f h,k + g(x) h = x h,k arg min Q L (x h,k, x) x 4 Error Correction Step: Compute & Apply Error Correction x h,k+1 = x h,k s h,k D h,k.

8 / 29 What is a Computationally Favorable Model? Global convergence rate: F h (x k ) F h O ( Lh k ) k 1 A smoother model (smaller Lipschitz constant) 2 A lower dimensional model (e.g x H R H with H < h) Numerical Experiment min Ax x R b 2 n 2 + µ x 1, b R m, A R m n (Gaussian N (0, 1)) 10 4 models for n = 25, 50, 100, 200, and 400

8 / 29 What is a Computationally Favorable Model? Global convergence rate: F h (x k ) F h O ( Lh k ) k 1 A smoother model (smaller Lipschitz constant) 2 A lower dimensional model (e.g x H R H with H < h) Numerical Experiment min Ax x R b 2 n 2 + µ x 1, b R m, A R m n (Gaussian N (0, 1)) 10 4 models for n = 25, 50, 100, 200, and 400

8 / 29 What is a Computationally Favorable Model? Global convergence rate: F h (x k ) F h O ( Lh k ) k 1 A smoother model (smaller Lipschitz constant) 2 A lower dimensional model (e.g x H R H with H < h) Numerical Experiment min Ax x R b 2 n 2 + µ x 1, b R m, A R m n (Gaussian N (0, 1)) 10 4 models for n = 25, 50, 100, 200, and 400

What is a Computationally Favorable Model? min Ax x R b 2 n 2 + µ x 1, Lipschitz Constant ISTA Iterations 9 8 8 7 Lipschitz Constant 10 4 7 6 5 4 3 2 # Iterations 10 3 6 5 4 3 2 1 1 0 25 50 100 200 400 Model Dimension (n) 0 25 50 100 200 400 Model Dimension (n)

What is a Computationally Favorable Model? min Ax x R b 2 n 2 + µ x 1, Lipschitz Constant 10 4 9 8 7 6 5 4 3 2 1 Lipschitz Constant 0 25 50 100 200 400 Model Dimension (n) # Iterations 10 2 10 9 8 7 6 5 4 3 2 1 FISTA Iterations 0 25 50 100 200 400 Model Dimension (n)

Background: Composite Convex Optimisation min f (x) + g(x) x Applications & Algorithms Quadratic Approximations & Search Directions Coarse Approximations Motivation A Multilevel Algorithm Information Transfer Coarse Model Construction Multilevel Iterative Shrinkage Thresholding (MISTA) Convergence Analysis

Start-of-the-art in Multilevel Optimization Nonlinear Optimization Nash, S. G. A multigrid approach to discretized optimization problems. Optimization Methods and Software, 2000 Gratton, S., Sartenaer, A., Toint, P. L.. Recursive trust-region methods for multiscale nonlinear optimization. SIAM Journal on Optimization, 2008 W., Zaiwen, and D. Goldfarb. A line search multigrid method for large-scale nonlinear optimization. SIAM Journal on Optimization, 2009 This work Nonsmooth/constrained problems Beyond PDEs! Same complexity ISTA but (10 ) faster, and 3 faster than FISTA PP, Duy V. N. Luong, Daniel Rueckert, Berc Rustem, A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization, Optimization Online, March 2014 Paper & Code: http://www.doc.ic.ac.uk/ pp500/mista.html 11 / 29

Start-of-the-art in Multilevel Optimization Nonlinear Optimization Nash, S. G. A multigrid approach to discretized optimization problems. Optimization Methods and Software, 2000 Gratton, S., Sartenaer, A., Toint, P. L.. Recursive trust-region methods for multiscale nonlinear optimization. SIAM Journal on Optimization, 2008 W., Zaiwen, and D. Goldfarb. A line search multigrid method for large-scale nonlinear optimization. SIAM Journal on Optimization, 2009 This work Nonsmooth/constrained problems Beyond PDEs! Same complexity ISTA but (10 ) faster, and 3 faster than FISTA PP, Duy V. N. Luong, Daniel Rueckert, Berc Rustem, A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization, Optimization Online, March 2014 Paper & Code: http://www.doc.ic.ac.uk/ pp500/mista.html 11 / 29

Information transfer between levels 12 / 29 Coarse model design vector: x H R H Fine model design vector: x h R h and h > H Two standard techniques RH h Restriction Operator: Ih H Prolongation Operator: IH h Rh H Main Assumption: IH h = c(ih h ), c > 0 I Geometric II Algebraic

Coarse Model Construction Smooth Case 13 / 29 First Order Coherent Condition min f h (x h ) Coarse Model: x H,0 = I H h x h,k, then f H,0 = I H h f h,k f H (x H ) fh (x H ) + Ih }{{} f h,k f H,0, x H }{{} coarse representation of f h H first order coherent [Lewis and Nash, 2005, Gratton et al., 2008, Wen and Goldfarb, 2009]

Coarse Model Construction Smooth Case 13 / 29 First Order Coherent Condition min f h (x h ) Coarse Model: x H,0 = I H h x h,k, then f H,0 = I H h f h,k f H (x H ) fh (x H ) + Ih }{{} f h,k f H,0, x H }{{} coarse representation of f h H first order coherent [Lewis and Nash, 2005, Gratton et al., 2008, Wen and Goldfarb, 2009]

Non-Smooth Case 14 / 29 min f h (x h ) + g h (x h ) Optimality Conditions Gradient Mapping D h,k = x h,k prox h (x h,k 1 L f h,k ) = x h,k arg min x (x x h,k 1 ) 2 L f h,k + g(x) D h,k = 0 if and only if x h,k is stationary. First Order Coherent Condition: D H,0 = I H h D h,k

Non-Smooth Case 14 / 29 min f h (x h ) + g h (x h ) Optimality Conditions Gradient Mapping D h,k = x h,k prox h (x h,k 1 L f h,k ) = x h,k arg min x (x x h,k 1 ) 2 L f h,k + g(x) D h,k = 0 if and only if x h,k is stationary. First Order Coherent Condition: D H,0 = I H h D h,k

Non-Smooth Case 14 / 29 min f h (x h ) + g h (x h ) Optimality Conditions Gradient Mapping D h,k = x h,k prox h (x h,k 1 L f h,k ) = x h,k arg min x (x x h,k 1 ) 2 L f h,k + g(x) D h,k = 0 if and only if x h,k is stationary. First Order Coherent Condition: D H,0 = I H h D h,k

Three Ways to Satisfy First Order Coherency 15 / 29 Fine Model (h) Coarse Model (H) min x h f h (x h ) + g h (x h ) min f H (x h ) + g H (x h ) + v x H }{{} H, x H }{{} coarse representation f h +g h force FOC 1 Smooth Coarse Model: v H = L H I H h D h,k ( f H,0 + g H,0 ). 2 Non-smooth model, dual norm projection v H = L H Lh I H h f h,k f H,0. 3 Non-smooth model, indicator projection ( )) v H = L H x H,0 f H,0 + L H Ih H prox h (x h,k 1Lh f h,k.

Three Ways to Satisfy First Order Coherency 15 / 29 Fine Model (h) Coarse Model (H) min x h f h (x h ) + g h (x h ) min f H (x h ) + g H (x h ) + v x H }{{} H, x H }{{} coarse representation f h +g h force FOC 1 Smooth Coarse Model: v H = L H I H h D h,k ( f H,0 + g H,0 ). 2 Non-smooth model, dual norm projection v H = L H Lh I H h f h,k f H,0. 3 Non-smooth model, indicator projection ( )) v H = L H x H,0 f H,0 + L H Ih H prox h (x h,k 1Lh f h,k.

Multilevel Iterative Shrinkage Thresholding Algorithm (MISTA) 1.0 If condition to use coarse model is satisfied at x h,k 1.1. Set x H,0 = I H h x h,k 1.2. m coarse iterations, any monotone algorithm 1.3. Compute feasible coarse correction term, 1.4. Update fine model d h,k = I h H (x H,0 x H,m ) x + = prox h (x h,k τd h,k ) x h,k+1 = x h,k s h,k (x h,k x + h ) 1.5. Go to 1.0 2.0 Otherwise do a fine iteration, any monotone algorithm, go to 1.0. 16 / 29

Multilevel Iterative Shrinkage Thresholding Algorithm (MISTA) 1.0 If condition to use coarse model is satisfied at x h,k 1.1. Set x H,0 = I H h x h,k 1.2. m coarse iterations, any monotone algorithm 1.3. Compute feasible coarse correction term, 1.4. Update fine model d h,k = I h H (x H,0 x H,m ) x + = prox h (x h,k τd h,k ) x h,k+1 = x h,k s h,k (x h,k x + h ) 1.5. Go to 1.0 2.0 Otherwise do a fine iteration, any monotone algorithm, go to 1.0. 16 / 29

Multilevel Iterative Shrinkage Thresholding Algorithm (MISTA) 1.0 If condition to use coarse model is satisfied at x h,k 1.1. Set x H,0 = I H h x h,k 1.2. m coarse iterations, any monotone algorithm 1.3. Compute feasible coarse correction term, 1.4. Update fine model d h,k = I h H (x H,0 x H,m ) x + = prox h (x h,k τd h,k ) x h,k+1 = x h,k s h,k (x h,k x + h ) 1.5. Go to 1.0 2.0 Otherwise do a fine iteration, any monotone algorithm, go to 1.0. 16 / 29

Multilevel Iterative Shrinkage Thresholding Algorithm (MISTA) 1.0 If condition to use coarse model is satisfied at x h,k 1.1. Set x H,0 = I H h x h,k 1.2. m coarse iterations, any monotone algorithm 1.3. Compute feasible coarse correction term, 1.4. Update fine model d h,k = I h H (x H,0 x H,m ) x + = prox h (x h,k τd h,k ) x h,k+1 = x h,k s h,k (x h,k x + h ) 1.5. Go to 1.0 2.0 Otherwise do a fine iteration, any monotone algorithm, go to 1.0. 17 / 29

18 / 29 Multilevel Iterative Shrinkage Thresholding Algorithm (MISTA) Conditions to use coarse model: Ih H D h,k >κ D h,k x h,k x h >η x h, where x h last point to use coarse correction Monotone Algorithm: e.g. ISTA Step size: Determined by convergence analysis

Global Convergence Rate Analysis 19 / 29 Main Result: Coarse correction step contraction on the optimal point: where σ (0, 1). x h,k+1 x h, 2 σ x h,k x h, 2,

Global Convergence Rate Analysis Main Result: Coarse correction step contraction on the optimal point: where σ (0, 1). Theorem x h,k+1 x h, 2 σ x h,k x h, 2, [Rockafellar, 1976, Theorem 1] Suppose that the coarse correction step satisfies the contraction property, and that f (x) is convex and has Lipschitz-continuous gradients. Then any MISTA step is at least nonexpansive (coarse correction steps are contractions and gradient proximal steps are non-expansive), x h,k+1 x h, 2 x h,k x h, 2, and the sequence {x h,k } converges R-linearly. 19 / 29

Global Convergence Rate Analysis Main Result: Coarse correction step contraction on the optimal point: where σ (0, 1). Theorem x h,k+1 x h, 2 σ x h,k x h, 2, [Rockafellar, 1976, Theorem 2] Suppose that the coarse correction step satisfies the contraction property, and that f (x) is strongly convex and has Lipschitz-continuous gradients. Then any MISTA step is always a contraction, x h,k+1 x h, 2 σ x h,k x h, 2, where σ (0, 1) and the sequence {x h,k } converges Q-linearly. 19 / 29

Key Lemma 20 / 29 Useful inequality: x y, f (x) f (y) 1 L f (x) f (y) 2. Lemma Consider two coarse correction terms generated by performing m iterations at the coarse level starting from the points x h,k and x h, : then the following inequality holds: d h,k = IH h e H,m = IH h m 1 s H,i D H,i i=0 d h, = IH h m 1 s H,i DH,i = 0 i=0 x h,k x h,, d h,k d h, m d h,k d h, 2 where m > 0 is a constant independent of x h,k & x h, (given in the paper).

Key Lemma 20 / 29 Useful inequality: x y, f (x) f (y) 1 L f (x) f (y) 2. Lemma Consider two coarse correction terms generated by performing m iterations at the coarse level starting from the points x h,k and x h, : then the following inequality holds: d h,k = IH h e H,m = IH h m 1 s H,i D H,i i=0 d h, = IH h m 1 s H,i DH,i = 0 i=0 x h,k x h,, d h,k d h, m d h,k d h, 2 where m > 0 is a constant independent of x h,k & x h, (given in the paper).

Generalization of Lipschitz Continuity f (x) f (y) 2 L x y 2 Lemma Suppose that a convergent algorithm with nonexpansive steps is applied at the coarse level (e.g. ISTA) then, d h,k d h, 2 16 9 m2 ω 2 1 ω2 2 s2 H,0 x h,k x h, 2, ω 1, ω 2 known constants (depend on prolongation/restriction operators) m is the number of iterations in the coarse level s H,0 is the step size used in the coarse algorithm. 21 / 29

Generalization of Lipschitz Continuity f (x) f (y) 2 L x y 2 Lemma Suppose that a convergent algorithm with nonexpansive steps is applied at the coarse level (e.g. ISTA) then, d h,k d h, 2 16 9 m2 ω 2 1 ω2 2 s2 H,0 x h,k x h, 2, ω 1, ω 2 known constants (depend on prolongation/restriction operators) m is the number of iterations in the coarse level s H,0 is the step size used in the coarse algorithm. 21 / 29

Main Theorem 22 / 29 Theorem (Contraction for coarse correction update) Let τ denote the step size in feasible projection step and s denote the step size used when updating the fine model then, x h,k+1 x h, 2 σ(s, τ) x h,k x h, 2 where σ(τ, s) = 2 + (τ)s 2 and, (τ) = 8 9 mω2 1 s2 H,0 (4mω2 2 τ2 2τ(1 + 2m)). There always exists τ > 0 such that, (τ) < 1 and, { } 1 2 s min, 1 (τ) (τ) Therefore σ(τ, s) < 1.

24 / 29 Numerical Experiments Image Restoration min x h A h x h b h 2 2 + µ h W (x h ) 1 b h input image A h blurring operator W ( ) wavelet transform x R h restored image, h = 1024 1024 Algorithms run on Matlab & standard laptop

25 / 29 Image Restoration Coarse model min x H A H x H b H 2 2 + v H, x H + µ H W (x H ) i + ϱ ϱ i b h = I H h b H coarse input image A H = I H h A h(i H h ) blurring operator x R H H = 512 512 or H = 256 256

Iteration Comparison 500 MISTA Gradient Update MISTA Coarse Correction FISTA ISTA 400 F(x) 300 200 100 0 5 10 15 20 25 30 35 40 45 50 Iteration k 26 / 29

CPU Time Comparison (1% Noise) 100 90 80 ISTA FISTA MISTA CPU Time (secs) 70 60 50 40 30 20 10 0 Barb. Cam. Chess Cog Lady Lena Image 26 / 29

CPU Time Comparison Summary 26 / 29

Conclusions 27 / 29 Current Agorithms Quadratic Approx. Exploit Structure

Conclusions 27 / 29 Current Agorithms Multilevel Algorithms Quadratic Approx. Exploit Structure Hierarchical Approx. Exploit (more) Structure

Conclusions 27 / 29 Current Agorithms Multilevel Algorithms Numerical Results Quadratic Approx. Exploit Structure Hierarchical Approx. Exploit (more) Structure 3-4 faster than state of the art

Conclusions 27 / 29 Current Agorithms Multilevel Algorithms Numerical Results Quadratic Approx. Exploit Structure Hierarchical Approx. Exploit (more) Structure 3-4 faster than state of the art PP, Duy V. N. Luong, Daniel Rueckert, Berc Rustem, A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization, Optimization Online, March 2014 Paper & Code: http://www.doc.ic.ac.uk/ pp500/mista.html

References I Beck, A. and Teboulle, M. (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183 202. Ben-Tal, A., Margalit, T., and Nemirovski, A. (2001). The ordered subsets mirror descent optimization method with applications to tomography. SIAM Journal on Optimization, 12(1):79 108. Bertsekas, D. P. (2011). Incremental gradient, subgradient, and proximal methods for convex optimization: A survey. Optimization for Machine Learning, 2010:1 38. Donoho, D. L. and Tsaig, Y. (2008). Fast solution of-norm minimization problems when the solution may be sparse. Information Theory, IEEE Transactions on, 54(11):4789 4812. Gratton, S., Sartenae, A., and Toint, P. (2008). Recursive trust-region methods for multiscale nonlinear optimization. SIAM Journal on Optimization, 19(1):414 444. Kiwiel, K. C. (1990). Proximity control in bundle methods for convex nondifferentiable minimization. Mathematical Programming, 46(1-3):105 122. Lewis, R. and Nash, S. (2005). Model problems for the multigrid optimization of systems governed by differential equations. SIAM Journal on Scientific Computing, 26(6):1811 1837.

References II Nesterov, Y. (2005). Smooth minimization of non-smooth functions. Mathematical Programming, 103(1):127 152. Nesterov, Y. (2012). Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22(2):341 362. Nesterov, Y. (2013). Gradient methods for minimizing composite functions. Mathematical Programming, 140(1):125 161. Rockafellar, R. (1976). Monotone operators and the proximal point algorithm. SIAM Journal on Control and Optimization, 14(5):877 898. Wen, Z. and Goldfarb, D. (2009). A line search multigrid method for large-scale nonlinear optimization. SIAM Journal on Optimization, 20(3):1478 1503. Yang, J. and Zhang, Y. (2011). Alternating direction algorithms for \ell 1-problems in compressive sensing. SIAM journal on scientific computing, 33(1):250 278.