SIAM Conference on Imaging Science, Bologna, Italy, Adaptive FISTA. Peter Ochs Saarland University

SIAM Conference on Imaging Science, Bologna, Italy, 2018 Adaptive FISTA Peter Ochs Saarland University 07.06.2018 joint work with Thomas Pock, TU Graz, Austria c 2018 Peter Ochs Adaptive FISTA 1 / 16

Some Facts about FISTA Some Facts about FISTA: FISTA developed in [Beck, Teboulle: A fast iterative shrinkage-thresholding algorithm for linear inverse problems, SIAM journal on imaging sciences 2(1):183 202, 2009], 5600 citations on google scholar. Motivated by [Nesterov: A method of solving a convex programming problem with convergence rate O(1/k 2 ), Soviet Mathematics Doklady, 1983], 2000 citations on google scholar. optimal method in the sense of [Nemirovskii, Yudin 83], [Nesterov 04]. c 2018 Peter Ochs Adaptive FISTA 2 / 16

A Class of Structured Non-smooth Optimization Problems A Class of Structured Non-smooth Optimization Problems: min x g(x) + f(x) simple proximal mapping smooth argmin x g(x) + 1 2 x x 2 Update Scheme: FISTA (f, g convex) y (k) = x (k) + (x (k) x (k 1) ) x (k+1) = argmin g(x) + f(y (k) ) + x f(y (k) ), x y (k) + 1 x y(k) 2τ k 2 c 2018 Peter Ochs Adaptive FISTA 3 / 16

A Class of Structured Non-smooth Optimization Problems A Class of Structured Non-smooth Optimization Problems: min x g(x) + f(x) simple proximal mapping smooth argmin x g(x) + 1 2 x x 2 Update Scheme: FISTA (f, g convex) y (k) = x (k) + (x (k) x (k 1) ) x (k+1) = argmin g(x) + f(y (k) ) + x Equivalent to x (k+1) = argmin g(x)+ 1 ( x R 2τ x y (k) τ f(y (k) ) N f(y (k) ), x y (k) + 1 x y(k) 2τ k 2 ) 2 =: prox τg ( y (k) τ f(y (k) ) c 2018 Peter Ochs Adaptive FISTA 3 / 16 )

A Class of Structured Non-smooth Optimization Problems A Class of Structured Non-smooth Optimization Problems: min x g(x) + f(x) simple proximal mapping smooth argmin x g(x) + 1 2 x x 2 Update Scheme: Adaptive FISTA (also non-convex) y (k) = x (k) + (x (k) x (k 1) ) x (k+1) = argmin x min g(x) + f(y (k) ) + k f(y (k) ), x y (k) + 1 x y(k) 2τ k 2 c 2018 Peter Ochs Adaptive FISTA 3 / 16

A Class of Structured Non-smooth Optimization Problems A Class of Structured Non-smooth Optimization Problems: min x g(x) + f(x) simple proximal mapping smooth argmin x g(x) + 1 2 x x 2 Update Scheme: Adaptive FISTA (f quadratic) y (k) = x (k) + (x (k) x (k 1) ) x (k+1) = argmin x min g(x) + f(y (k) ) + k f(y (k) ), x y (k) + 1 x y(k) 2τ k 2... Taylor expansion around x (k) and optimize for = (x)... x (k+1) = argmin x g(x) + 1 2 x (x(k) Q 1 k f(x(k) )) 2 Q k c 2018 Peter Ochs Adaptive FISTA 3 / 16

A Class of Structured Non-smooth Optimization Problems Update Scheme: Adaptive FISTA (f quadratic) x (k+1) = argmin x R N g(x) + 1 2 x (x(k) Q 1 k f(x(k) )) 2 Q k =: prox Q k g (x (k) Q 1 k f(x(k) )) with Q k S ++ (N) as in the (zero memory) SR1 quasi-newton method: Q = I uu (identity minus rank-1). SR1 proximal quasi-newton method proposed by [Becker, Fadili 12] (convex case). Special setting is treated in [Karimi, Vavasis 17]. Unified and extended in [Becker, Fadili, O. 18]. c 2018 Peter Ochs Adaptive FISTA 4 / 16

Discussion about Solving the Proximal Mapping Discussion about Solving the Proximal Mapping: (g convex) For general Q, the main algorithmic step is hard to solve: ˆx = prox Q g := argmin x R N g(x) + 1 2 x x 2 Q Theorem: [Becker, Fadili 12] Q = D ± uu S ++ for u R N and D diagonal. Then prox Q g ( x) = D 1/2 prox g D 1/2(D 1/2 x v ) where v = α D 1/2 u and α is the unique root of l(α) = u, x D 1/2 prox g D 1/2 D 1/2 ( x αd 1 u) + α, which is strictly increasing and Lipschitz continuous with 1 + i u2 i d i. c 2018 Peter Ochs Adaptive FISTA 5 / 16

Solving the rank-1 Proximal Mapping for l 1 -norm Example: (Solving the rank-1 prox of the l 1 -norm) The proximal mapping wrt. the diagonal matrix is separable and simple prox g D 1/2(z) = argmin D 1/2 x 1 + 1 x z 2 x R 2 N N = argmin x i / d i + 1 x R 2 (x i z i ) 2 N ( = = i=1 argmin x i / d i + 1 x i R 2 (x i z i ) 2) i=1,...,n ) ( max ( 0, z i 1/ d i ) sign(zi ) i=1,...,n c 2018 Peter Ochs Adaptive FISTA 6 / 16

Solving the rank-1 Proximal Mapping for l 1 -norm The root finding problem in the rank-1 prox of the l 1 -norm: α is the root of the 1D function (i.e. l(α ) = 0) l(α) = u, x D 1/2 prox g D 1/2 D 1/2 ( x αd 1 u) + α = u, x PLin( x αd 1 u) + α which is a piecewise linear function. Construct this function by sorting K N breakpoints. Cost: O(K log(k)). The root is determined using binary search. Cost: O(log(K)). (remember: l(α) is strictly increasing) Computing l(α) costs O(N). Total cost: O(K log(k)). c 2018 Peter Ochs Adaptive FISTA 7 / 16

Solving the rank-1 Proximal Mapping for l 1 -norm ŝ 1 ŝ 2 ŝ 3 ŝ 4 from [S. Becker] c 2018 Peter Ochs Adaptive FISTA 8 / 16

Discussion about Solving the Proximal Mapping Function g l 1 -norm Hinge l -ball Box constraint Positivity constraint Linear constraint l 1 -ball l -norm Simplex Algorithm Separable: exact Separable: exact Separable: exact Separable: exact Separable: exact Nonseparable: exact Nonseparable: Semi-smooth Newton + prox g D 1/2 exact Nonseparable: Moreau identity Nonseparable: Semi-smooth Newton + prox g D 1/2 exact From [Becker, Fadili 12]. c 2018 Peter Ochs Adaptive FISTA 9 / 16

Rank-r Modified Metric Rank-r Modified Metric: (g convex) (L-)BFGS uses a rank-r update of the metric with r > 1. Theorem: [Becker, Fadili, O. 18] Q = P ± V S ++, P S ++, V = r i=1 u iu i, rank(v ) = r. Then prox Q g ( x) = P 1/2 prox g P 1/2 P 1/2 ( x P 1 Uα ) where U = (u 1,..., u r ) and α is the unique root of l(α) = U ( x ) P 1/2 prox g P 1/2 P 1/2 ( x P 1 Uα) + Xα, where X := U V + U S ++ (r). The mapping l : R r R r is Lipschitz continuous with constant X + P 1/2 U 2 and strongly monotone. c 2018 Peter Ochs Adaptive FISTA 10 / 16

Variants with O(1/k 2 )-convergence rate Adaptive FISTA: Variants with O(1/k 2 )-convergence rate: (convex case) Adaptive FISTA cannot be proved to have the accelerated rate O(1/k 2 ). For each point x, afista decreases the objective more than a FISTA. However, global view on the sequence is lost. afista can be embedded into schemes with accelerated rate O(1/k 2 ). Monotone FISTA version: (Motivated by [Li, Lin 15], [Beck, Teboulle 09].) Tseng-like version: (Motivated by [Tseng 08].) c 2018 Peter Ochs Adaptive FISTA 11 / 16

Nesterov s Worst Case Function FISTA MFISTA afista amfista atseng CG lower bound O(1/k 2 ) 10 2 10 2 f(x (k) ) min f 10 3 10 4 10 5 10 3 10 4 10 5 10 6 10 1 10 2 iteration k 10 6 10 3 10 2 10 1 time [sec] c 2018 Peter Ochs Adaptive FISTA 12 / 16

LASSO min h(x), h(x) = 1 x R N 2 Ax b 2 + λ x 1, FISTA afista atseng MFISTA amfista h(x (k) ) h 10 0 10 2 10 0 10 2 10 4 10 4 10 0 10 1 10 2 10 3 iteration k 10 3 10 2 10 1 10 0 time [sec] c 2018 Peter Ochs Adaptive FISTA 13 / 16

Proposed Algorithm Proposed Algorithm: (non-convex setting) Current iterate x (k) R N. Step size: T S ++ (N). Define the extrapolated point y (k) that depends on : y (k) := x (k) + (x (k) x (k 1) ). Exact version: Compute x (k+1) as follows: x (k+1) = argmin min l g f (x; y(k) ) + 1 x y(k) x R N 2 2 T, l g f (x; y(k) ) := g(x) + f(y(k) ) + f(y (k) ), x y(k) Inexact version: Find x (k+1) and such that l g f (x(k+1) ; y (k) ) + 1 2 x(k+1) y (k) 2 T f(x (k) ) + g(x (k) ) c 2018 Peter Ochs Adaptive FISTA 14 / 16

Neural network optimization problem / non-linear inverse problem min W 0,W 1,W 2 b 0,b 1,b 2 N ( (W 2 σ 2 (W 1 σ 1 (W 0 X +B 0 )+B 1 )+B 2 Ỹ ) 1,i 2 +ε 2) 1/2 2 +λ W j 1 i=1 FBS MFISTA ipiano afista j=0 10 0 10 0 h(x (k) )/h(x (0) ) 10 0.5 10 0.5 10 0 10 1 10 2 10 3 10 4 10 3 10 2 10 1 10 0 10 1 iteration k time [sec] c 2018 Peter Ochs Adaptive FISTA 15 / 16

Conclusion Conclusion: Proposed adaptive FISTA for solving problems of the form min g(x) + f(x), x R N Adaptive FISTA is locally better than FISTA. Prove convergence to a stationary point. Equivalence to a proximal quasi-newton method, if f is quadratic. Often, the proximal mapping can be computed efficiently. Adaptive FISTA can be embedded into accelerated / optimal schemes. c 2018 Peter Ochs Adaptive FISTA 16 / 16