SIAM Conference on Imaging Science, Bologna, Italy, Adaptive FISTA. Peter Ochs Saarland University

Similar documents
Fast proximal gradient methods

arxiv: v1 [math.oc] 26 Jan 2018

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

I P IANO : I NERTIAL P ROXIMAL A LGORITHM FOR N ON -C ONVEX O PTIMIZATION

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Conditional Gradient (Frank-Wolfe) Method

6. Proximal gradient method

Accelerated gradient methods

Dual Proximal Gradient Method

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Lasso: Algorithms and Extensions

6. Proximal gradient method

An Optimal Affine Invariant Smooth Minimization Algorithm.

This can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization

A Tutorial on Primal-Dual Algorithm

Optimization methods

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems)

A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization

A Unified Approach to Proximal Algorithms using Bregman Distance

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Lecture 8: February 9

Optimization methods

Accelerated primal-dual methods for linearly constrained convex problems

Primal-dual coordinate descent

Stochastic Optimization: First order method

Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

A Brief Overview of Practical Optimization Algorithms in the Context of Relaxation

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties

Accelerated Proximal Gradient Methods for Convex Optimization

Expanding the reach of optimal methods

Linear Convergence under the Polyak-Łojasiewicz Inequality

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Block Coordinate Descent for Regularized Multi-convex Optimization

Lecture 17: October 27

Convex Optimization Algorithms for Machine Learning in 10 Slides

CPSC 540: Machine Learning

Adaptive Restarting for First Order Optimization Methods

A first-order primal-dual algorithm with linesearch

Optimized first-order minimization methods

Composite nonlinear models at scale

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods

OWL to the rescue of LASSO

Optimization for Learning and Big Data

Variable Metric Forward-Backward Algorithm

CPSC 540: Machine Learning

Descent methods. min x. f(x)

SEMI-SMOOTH SECOND-ORDER TYPE METHODS FOR COMPOSITE CONVEX PROGRAMS

Homework 3 Conjugate Gradient Descent, Accelerated Gradient Descent Newton, Quasi Newton and Projected Gradient Descent

Non-smooth Non-convex Bregman Minimization: Unification and New Algorithms

Primal and Dual Variables Decomposition Methods in Convex Optimization

Lecture 9: September 28

GRADIENT = STEEPEST DESCENT

ORIE 6326: Convex Optimization. Quasi-Newton Methods

Lecture 14: October 17

Journal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

Cubic regularization of Newton s method for convex problems with constraints

Smoothing Proximal Gradient Method. General Structured Sparse Regression

Mathematical methods for Image Processing

Lecture 1: September 25

Linear Convergence under the Polyak-Łojasiewicz Inequality

BISTA: a Bregmanian proximal gradient method without the global Lipschitz continuity assumption

The Proximal Gradient Method

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 9. Alternating Direction Method of Multipliers

Non-smooth Non-convex Bregman Minimization: Unification and new Algorithms

Line Search Methods for Unconstrained Optimisation

Convex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization

6.252 NONLINEAR PROGRAMMING LECTURE 10 ALTERNATIVES TO GRADIENT PROJECTION LECTURE OUTLINE. Three Alternatives/Remedies for Gradient Projection

Efficient Quasi-Newton Proximal Method for Large Scale Sparse Optimization

Universal Gradient Methods for Convex Optimization Problems

Lecture 2 February 25th

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 3. Gradient Method

Unconstrained optimization

Lecture 5: September 12

Selected Methods for Modern Optimization in Data Analysis Department of Statistics and Operations Research UNC-Chapel Hill Fall 2018

Gradient Descent. Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh. Convex Optimization /36-725

Accelerated Block-Coordinate Relaxation for Regularized Optimization

A Universal Catalyst for Gradient-Based Optimization

FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES. Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč

Global convergence of the Heavy-ball method for convex optimization

10. Unconstrained minimization

Newton s Method. Javier Peña Convex Optimization /36-725

Adaptive restart of accelerated gradient methods under local quadratic growth condition

Stochastic Optimization Algorithms Beyond SG

Spectral gradient projection method for solving nonlinear monotone equations

Numerisches Rechnen. (für Informatiker) M. Grepl P. Esser & G. Welper & L. Zhang. Institut für Geometrie und Praktische Mathematik RWTH Aachen

1 Numerical optimization

A Conservation Law Method in Optimization

Coordinate Update Algorithm Short Course Proximal Operators and Algorithms

Low Complexity Regularization 1 / 33

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

Computational Statistics and Optimisation. Joseph Salmon Télécom Paristech, Institut Mines-Télécom

Sparse Optimization Lecture: Dual Methods, Part I

Convex Optimization Lecture 16

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Transcription:

SIAM Conference on Imaging Science, Bologna, Italy, 2018 Adaptive FISTA Peter Ochs Saarland University 07.06.2018 joint work with Thomas Pock, TU Graz, Austria c 2018 Peter Ochs Adaptive FISTA 1 / 16

Some Facts about FISTA Some Facts about FISTA: FISTA developed in [Beck, Teboulle: A fast iterative shrinkage-thresholding algorithm for linear inverse problems, SIAM journal on imaging sciences 2(1):183 202, 2009], 5600 citations on google scholar. c 2018 Peter Ochs Adaptive FISTA 2 / 16

Some Facts about FISTA Some Facts about FISTA: FISTA developed in [Beck, Teboulle: A fast iterative shrinkage-thresholding algorithm for linear inverse problems, SIAM journal on imaging sciences 2(1):183 202, 2009], 5600 citations on google scholar. Motivated by [Nesterov: A method of solving a convex programming problem with convergence rate O(1/k 2 ), Soviet Mathematics Doklady, 1983], 2000 citations on google scholar. optimal method in the sense of [Nemirovskii, Yudin 83], [Nesterov 04]. c 2018 Peter Ochs Adaptive FISTA 2 / 16

Some Facts about FISTA Some Facts about FISTA: FISTA developed in [Beck, Teboulle: A fast iterative shrinkage-thresholding algorithm for linear inverse problems, SIAM journal on imaging sciences 2(1):183 202, 2009], 5600 citations on google scholar. Motivated by [Nesterov: A method of solving a convex programming problem with convergence rate O(1/k 2 ), Soviet Mathematics Doklady, 1983], 2000 citations on google scholar. optimal method in the sense of [Nemirovskii, Yudin 83], [Nesterov 04]. Accelerated Gradient Method: y (k) = x (k) + (x (k) x (k 1) ) ( ) ( ) k N cleverly predefined x (k+1) = y (k) τ f(y (k) ) FISTA extends Accelerated Gradient Method by [Nesterov 83] to non-smooth problem. c 2018 Peter Ochs Adaptive FISTA 2 / 16

A Class of Structured Non-smooth Optimization Problems A Class of Structured Non-smooth Optimization Problems: min x g(x) + f(x) simple proximal mapping smooth argmin x g(x) + 1 2 x x 2 Update Scheme: FISTA (f, g convex) y (k) = x (k) + (x (k) x (k 1) ) x (k+1) = argmin g(x) + f(y (k) ) + x f(y (k) ), x y (k) + 1 x y(k) 2τ k 2 c 2018 Peter Ochs Adaptive FISTA 3 / 16

A Class of Structured Non-smooth Optimization Problems A Class of Structured Non-smooth Optimization Problems: min x g(x) + f(x) simple proximal mapping smooth argmin x g(x) + 1 2 x x 2 Update Scheme: FISTA (f, g convex) y (k) = x (k) + (x (k) x (k 1) ) x (k+1) = argmin g(x) + f(y (k) ) + x Equivalent to x (k+1) = argmin g(x)+ 1 ( x R 2τ x y (k) τ f(y (k) ) N f(y (k) ), x y (k) + 1 x y(k) 2τ k 2 ) 2 =: prox τg ( y (k) τ f(y (k) ) c 2018 Peter Ochs Adaptive FISTA 3 / 16 )

A Class of Structured Non-smooth Optimization Problems A Class of Structured Non-smooth Optimization Problems: min x g(x) + f(x) simple proximal mapping smooth argmin x g(x) + 1 2 x x 2 Update Scheme: Adaptive FISTA (also non-convex) y (k) = x (k) + (x (k) x (k 1) ) x (k+1) = argmin x min g(x) + f(y (k) ) + k f(y (k) ), x y (k) + 1 x y(k) 2τ k 2 c 2018 Peter Ochs Adaptive FISTA 3 / 16

A Class of Structured Non-smooth Optimization Problems A Class of Structured Non-smooth Optimization Problems: min x g(x) + f(x) simple proximal mapping smooth argmin x g(x) + 1 2 x x 2 Update Scheme: Adaptive FISTA (f quadratic) y (k) = x (k) + (x (k) x (k 1) ) x (k+1) = argmin x min g(x) + f(y (k) ) + k f(y (k) ), x y (k) + 1 x y(k) 2τ k 2... Taylor expansion around x (k) and optimize for = (x)... x (k+1) = argmin x g(x) + 1 2 x (x(k) Q 1 k f(x(k) )) 2 Q k c 2018 Peter Ochs Adaptive FISTA 3 / 16

A Class of Structured Non-smooth Optimization Problems Update Scheme: Adaptive FISTA (f quadratic) x (k+1) = argmin x R N g(x) + 1 2 x (x(k) Q 1 k f(x(k) )) 2 Q k =: prox Q k g (x (k) Q 1 k f(x(k) )) with Q k S ++ (N) as in the (zero memory) SR1 quasi-newton method: Q = I uu (identity minus rank-1). SR1 proximal quasi-newton method proposed by [Becker, Fadili 12] (convex case). Special setting is treated in [Karimi, Vavasis 17]. Unified and extended in [Becker, Fadili, O. 18]. c 2018 Peter Ochs Adaptive FISTA 4 / 16

Discussion about Solving the Proximal Mapping Discussion about Solving the Proximal Mapping: (g convex) For general Q, the main algorithmic step is hard to solve: ˆx = prox Q g := argmin x R N g(x) + 1 2 x x 2 Q Theorem: [Becker, Fadili 12] Q = D ± uu S ++ for u R N and D diagonal. Then prox Q g ( x) = D 1/2 prox g D 1/2(D 1/2 x v ) where v = α D 1/2 u and α is the unique root of l(α) = u, x D 1/2 prox g D 1/2 D 1/2 ( x αd 1 u) + α, which is strictly increasing and Lipschitz continuous with 1 + i u2 i d i. c 2018 Peter Ochs Adaptive FISTA 5 / 16

Solving the rank-1 Proximal Mapping for l 1 -norm Example: (Solving the rank-1 prox of the l 1 -norm) The proximal mapping wrt. the diagonal matrix is separable and simple prox g D 1/2(z) = argmin D 1/2 x 1 + 1 x z 2 x R 2 N N = argmin x i / d i + 1 x R 2 (x i z i ) 2 N ( = = i=1 argmin x i / d i + 1 x i R 2 (x i z i ) 2) i=1,...,n ) ( max ( 0, z i 1/ d i ) sign(zi ) i=1,...,n c 2018 Peter Ochs Adaptive FISTA 6 / 16

Solving the rank-1 Proximal Mapping for l 1 -norm The root finding problem in the rank-1 prox of the l 1 -norm: α is the root of the 1D function (i.e. l(α ) = 0) l(α) = u, x D 1/2 prox g D 1/2 D 1/2 ( x αd 1 u) + α = u, x PLin( x αd 1 u) + α which is a piecewise linear function. Construct this function by sorting K N breakpoints. Cost: O(K log(k)). The root is determined using binary search. Cost: O(log(K)). (remember: l(α) is strictly increasing) Computing l(α) costs O(N). Total cost: O(K log(k)). c 2018 Peter Ochs Adaptive FISTA 7 / 16

Solving the rank-1 Proximal Mapping for l 1 -norm ŝ 1 ŝ 2 ŝ 3 ŝ 4 from [S. Becker] c 2018 Peter Ochs Adaptive FISTA 8 / 16

Discussion about Solving the Proximal Mapping Function g l 1 -norm Hinge l -ball Box constraint Positivity constraint Linear constraint l 1 -ball l -norm Simplex Algorithm Separable: exact Separable: exact Separable: exact Separable: exact Separable: exact Nonseparable: exact Nonseparable: Semi-smooth Newton + prox g D 1/2 exact Nonseparable: Moreau identity Nonseparable: Semi-smooth Newton + prox g D 1/2 exact From [Becker, Fadili 12]. c 2018 Peter Ochs Adaptive FISTA 9 / 16

Rank-r Modified Metric Rank-r Modified Metric: (g convex) (L-)BFGS uses a rank-r update of the metric with r > 1. Theorem: [Becker, Fadili, O. 18] Q = P ± V S ++, P S ++, V = r i=1 u iu i, rank(v ) = r. Then prox Q g ( x) = P 1/2 prox g P 1/2 P 1/2 ( x P 1 Uα ) where U = (u 1,..., u r ) and α is the unique root of l(α) = U ( x ) P 1/2 prox g P 1/2 P 1/2 ( x P 1 Uα) + Xα, where X := U V + U S ++ (r). The mapping l : R r R r is Lipschitz continuous with constant X + P 1/2 U 2 and strongly monotone. c 2018 Peter Ochs Adaptive FISTA 10 / 16

Variants with O(1/k 2 )-convergence rate Adaptive FISTA: Variants with O(1/k 2 )-convergence rate: (convex case) Adaptive FISTA cannot be proved to have the accelerated rate O(1/k 2 ). For each point x, afista decreases the objective more than a FISTA. However, global view on the sequence is lost. afista can be embedded into schemes with accelerated rate O(1/k 2 ). Monotone FISTA version: (Motivated by [Li, Lin 15], [Beck, Teboulle 09].) Tseng-like version: (Motivated by [Tseng 08].) c 2018 Peter Ochs Adaptive FISTA 11 / 16

Nesterov s Worst Case Function FISTA MFISTA afista amfista atseng CG lower bound O(1/k 2 ) 10 2 10 2 f(x (k) ) min f 10 3 10 4 10 5 10 3 10 4 10 5 10 6 10 1 10 2 iteration k 10 6 10 3 10 2 10 1 time [sec] c 2018 Peter Ochs Adaptive FISTA 12 / 16

LASSO min h(x), h(x) = 1 x R N 2 Ax b 2 + λ x 1, FISTA afista atseng MFISTA amfista h(x (k) ) h 10 0 10 2 10 0 10 2 10 4 10 4 10 0 10 1 10 2 10 3 iteration k 10 3 10 2 10 1 10 0 time [sec] c 2018 Peter Ochs Adaptive FISTA 13 / 16

Proposed Algorithm Proposed Algorithm: (non-convex setting) Current iterate x (k) R N. Step size: T S ++ (N). Define the extrapolated point y (k) that depends on : y (k) := x (k) + (x (k) x (k 1) ). Exact version: Compute x (k+1) as follows: x (k+1) = argmin min l g f (x; y(k) ) + 1 x y(k) x R N 2 2 T, l g f (x; y(k) ) := g(x) + f(y(k) ) + f(y (k) ), x y(k) Inexact version: Find x (k+1) and such that l g f (x(k+1) ; y (k) ) + 1 2 x(k+1) y (k) 2 T f(x (k) ) + g(x (k) ) c 2018 Peter Ochs Adaptive FISTA 14 / 16

Neural network optimization problem / non-linear inverse problem min W 0,W 1,W 2 b 0,b 1,b 2 N ( (W 2 σ 2 (W 1 σ 1 (W 0 X +B 0 )+B 1 )+B 2 Ỹ ) 1,i 2 +ε 2) 1/2 2 +λ W j 1 i=1 FBS MFISTA ipiano afista j=0 10 0 10 0 h(x (k) )/h(x (0) ) 10 0.5 10 0.5 10 0 10 1 10 2 10 3 10 4 10 3 10 2 10 1 10 0 10 1 iteration k time [sec] c 2018 Peter Ochs Adaptive FISTA 15 / 16

Conclusion Conclusion: Proposed adaptive FISTA for solving problems of the form min g(x) + f(x), x R N Adaptive FISTA is locally better than FISTA. Prove convergence to a stationary point. Equivalence to a proximal quasi-newton method, if f is quadratic. Often, the proximal mapping can be computed efficiently. Adaptive FISTA can be embedded into accelerated / optimal schemes. c 2018 Peter Ochs Adaptive FISTA 16 / 16