SIAM Conference on Imaging Science, Bologna, Italy, 2018 Adaptive FISTA Peter Ochs Saarland University 07.06.2018 joint work with Thomas Pock, TU Graz, Austria c 2018 Peter Ochs Adaptive FISTA 1 / 16
Some Facts about FISTA Some Facts about FISTA: FISTA developed in [Beck, Teboulle: A fast iterative shrinkage-thresholding algorithm for linear inverse problems, SIAM journal on imaging sciences 2(1):183 202, 2009], 5600 citations on google scholar. c 2018 Peter Ochs Adaptive FISTA 2 / 16
Some Facts about FISTA Some Facts about FISTA: FISTA developed in [Beck, Teboulle: A fast iterative shrinkage-thresholding algorithm for linear inverse problems, SIAM journal on imaging sciences 2(1):183 202, 2009], 5600 citations on google scholar. Motivated by [Nesterov: A method of solving a convex programming problem with convergence rate O(1/k 2 ), Soviet Mathematics Doklady, 1983], 2000 citations on google scholar. optimal method in the sense of [Nemirovskii, Yudin 83], [Nesterov 04]. c 2018 Peter Ochs Adaptive FISTA 2 / 16
Some Facts about FISTA Some Facts about FISTA: FISTA developed in [Beck, Teboulle: A fast iterative shrinkage-thresholding algorithm for linear inverse problems, SIAM journal on imaging sciences 2(1):183 202, 2009], 5600 citations on google scholar. Motivated by [Nesterov: A method of solving a convex programming problem with convergence rate O(1/k 2 ), Soviet Mathematics Doklady, 1983], 2000 citations on google scholar. optimal method in the sense of [Nemirovskii, Yudin 83], [Nesterov 04]. Accelerated Gradient Method: y (k) = x (k) + (x (k) x (k 1) ) ( ) ( ) k N cleverly predefined x (k+1) = y (k) τ f(y (k) ) FISTA extends Accelerated Gradient Method by [Nesterov 83] to non-smooth problem. c 2018 Peter Ochs Adaptive FISTA 2 / 16
A Class of Structured Non-smooth Optimization Problems A Class of Structured Non-smooth Optimization Problems: min x g(x) + f(x) simple proximal mapping smooth argmin x g(x) + 1 2 x x 2 Update Scheme: FISTA (f, g convex) y (k) = x (k) + (x (k) x (k 1) ) x (k+1) = argmin g(x) + f(y (k) ) + x f(y (k) ), x y (k) + 1 x y(k) 2τ k 2 c 2018 Peter Ochs Adaptive FISTA 3 / 16
A Class of Structured Non-smooth Optimization Problems A Class of Structured Non-smooth Optimization Problems: min x g(x) + f(x) simple proximal mapping smooth argmin x g(x) + 1 2 x x 2 Update Scheme: FISTA (f, g convex) y (k) = x (k) + (x (k) x (k 1) ) x (k+1) = argmin g(x) + f(y (k) ) + x Equivalent to x (k+1) = argmin g(x)+ 1 ( x R 2τ x y (k) τ f(y (k) ) N f(y (k) ), x y (k) + 1 x y(k) 2τ k 2 ) 2 =: prox τg ( y (k) τ f(y (k) ) c 2018 Peter Ochs Adaptive FISTA 3 / 16 )
A Class of Structured Non-smooth Optimization Problems A Class of Structured Non-smooth Optimization Problems: min x g(x) + f(x) simple proximal mapping smooth argmin x g(x) + 1 2 x x 2 Update Scheme: Adaptive FISTA (also non-convex) y (k) = x (k) + (x (k) x (k 1) ) x (k+1) = argmin x min g(x) + f(y (k) ) + k f(y (k) ), x y (k) + 1 x y(k) 2τ k 2 c 2018 Peter Ochs Adaptive FISTA 3 / 16
A Class of Structured Non-smooth Optimization Problems A Class of Structured Non-smooth Optimization Problems: min x g(x) + f(x) simple proximal mapping smooth argmin x g(x) + 1 2 x x 2 Update Scheme: Adaptive FISTA (f quadratic) y (k) = x (k) + (x (k) x (k 1) ) x (k+1) = argmin x min g(x) + f(y (k) ) + k f(y (k) ), x y (k) + 1 x y(k) 2τ k 2... Taylor expansion around x (k) and optimize for = (x)... x (k+1) = argmin x g(x) + 1 2 x (x(k) Q 1 k f(x(k) )) 2 Q k c 2018 Peter Ochs Adaptive FISTA 3 / 16
A Class of Structured Non-smooth Optimization Problems Update Scheme: Adaptive FISTA (f quadratic) x (k+1) = argmin x R N g(x) + 1 2 x (x(k) Q 1 k f(x(k) )) 2 Q k =: prox Q k g (x (k) Q 1 k f(x(k) )) with Q k S ++ (N) as in the (zero memory) SR1 quasi-newton method: Q = I uu (identity minus rank-1). SR1 proximal quasi-newton method proposed by [Becker, Fadili 12] (convex case). Special setting is treated in [Karimi, Vavasis 17]. Unified and extended in [Becker, Fadili, O. 18]. c 2018 Peter Ochs Adaptive FISTA 4 / 16
Discussion about Solving the Proximal Mapping Discussion about Solving the Proximal Mapping: (g convex) For general Q, the main algorithmic step is hard to solve: ˆx = prox Q g := argmin x R N g(x) + 1 2 x x 2 Q Theorem: [Becker, Fadili 12] Q = D ± uu S ++ for u R N and D diagonal. Then prox Q g ( x) = D 1/2 prox g D 1/2(D 1/2 x v ) where v = α D 1/2 u and α is the unique root of l(α) = u, x D 1/2 prox g D 1/2 D 1/2 ( x αd 1 u) + α, which is strictly increasing and Lipschitz continuous with 1 + i u2 i d i. c 2018 Peter Ochs Adaptive FISTA 5 / 16
Solving the rank-1 Proximal Mapping for l 1 -norm Example: (Solving the rank-1 prox of the l 1 -norm) The proximal mapping wrt. the diagonal matrix is separable and simple prox g D 1/2(z) = argmin D 1/2 x 1 + 1 x z 2 x R 2 N N = argmin x i / d i + 1 x R 2 (x i z i ) 2 N ( = = i=1 argmin x i / d i + 1 x i R 2 (x i z i ) 2) i=1,...,n ) ( max ( 0, z i 1/ d i ) sign(zi ) i=1,...,n c 2018 Peter Ochs Adaptive FISTA 6 / 16
Solving the rank-1 Proximal Mapping for l 1 -norm The root finding problem in the rank-1 prox of the l 1 -norm: α is the root of the 1D function (i.e. l(α ) = 0) l(α) = u, x D 1/2 prox g D 1/2 D 1/2 ( x αd 1 u) + α = u, x PLin( x αd 1 u) + α which is a piecewise linear function. Construct this function by sorting K N breakpoints. Cost: O(K log(k)). The root is determined using binary search. Cost: O(log(K)). (remember: l(α) is strictly increasing) Computing l(α) costs O(N). Total cost: O(K log(k)). c 2018 Peter Ochs Adaptive FISTA 7 / 16
Solving the rank-1 Proximal Mapping for l 1 -norm ŝ 1 ŝ 2 ŝ 3 ŝ 4 from [S. Becker] c 2018 Peter Ochs Adaptive FISTA 8 / 16
Discussion about Solving the Proximal Mapping Function g l 1 -norm Hinge l -ball Box constraint Positivity constraint Linear constraint l 1 -ball l -norm Simplex Algorithm Separable: exact Separable: exact Separable: exact Separable: exact Separable: exact Nonseparable: exact Nonseparable: Semi-smooth Newton + prox g D 1/2 exact Nonseparable: Moreau identity Nonseparable: Semi-smooth Newton + prox g D 1/2 exact From [Becker, Fadili 12]. c 2018 Peter Ochs Adaptive FISTA 9 / 16
Rank-r Modified Metric Rank-r Modified Metric: (g convex) (L-)BFGS uses a rank-r update of the metric with r > 1. Theorem: [Becker, Fadili, O. 18] Q = P ± V S ++, P S ++, V = r i=1 u iu i, rank(v ) = r. Then prox Q g ( x) = P 1/2 prox g P 1/2 P 1/2 ( x P 1 Uα ) where U = (u 1,..., u r ) and α is the unique root of l(α) = U ( x ) P 1/2 prox g P 1/2 P 1/2 ( x P 1 Uα) + Xα, where X := U V + U S ++ (r). The mapping l : R r R r is Lipschitz continuous with constant X + P 1/2 U 2 and strongly monotone. c 2018 Peter Ochs Adaptive FISTA 10 / 16
Variants with O(1/k 2 )-convergence rate Adaptive FISTA: Variants with O(1/k 2 )-convergence rate: (convex case) Adaptive FISTA cannot be proved to have the accelerated rate O(1/k 2 ). For each point x, afista decreases the objective more than a FISTA. However, global view on the sequence is lost. afista can be embedded into schemes with accelerated rate O(1/k 2 ). Monotone FISTA version: (Motivated by [Li, Lin 15], [Beck, Teboulle 09].) Tseng-like version: (Motivated by [Tseng 08].) c 2018 Peter Ochs Adaptive FISTA 11 / 16
Nesterov s Worst Case Function FISTA MFISTA afista amfista atseng CG lower bound O(1/k 2 ) 10 2 10 2 f(x (k) ) min f 10 3 10 4 10 5 10 3 10 4 10 5 10 6 10 1 10 2 iteration k 10 6 10 3 10 2 10 1 time [sec] c 2018 Peter Ochs Adaptive FISTA 12 / 16
LASSO min h(x), h(x) = 1 x R N 2 Ax b 2 + λ x 1, FISTA afista atseng MFISTA amfista h(x (k) ) h 10 0 10 2 10 0 10 2 10 4 10 4 10 0 10 1 10 2 10 3 iteration k 10 3 10 2 10 1 10 0 time [sec] c 2018 Peter Ochs Adaptive FISTA 13 / 16
Proposed Algorithm Proposed Algorithm: (non-convex setting) Current iterate x (k) R N. Step size: T S ++ (N). Define the extrapolated point y (k) that depends on : y (k) := x (k) + (x (k) x (k 1) ). Exact version: Compute x (k+1) as follows: x (k+1) = argmin min l g f (x; y(k) ) + 1 x y(k) x R N 2 2 T, l g f (x; y(k) ) := g(x) + f(y(k) ) + f(y (k) ), x y(k) Inexact version: Find x (k+1) and such that l g f (x(k+1) ; y (k) ) + 1 2 x(k+1) y (k) 2 T f(x (k) ) + g(x (k) ) c 2018 Peter Ochs Adaptive FISTA 14 / 16
Neural network optimization problem / non-linear inverse problem min W 0,W 1,W 2 b 0,b 1,b 2 N ( (W 2 σ 2 (W 1 σ 1 (W 0 X +B 0 )+B 1 )+B 2 Ỹ ) 1,i 2 +ε 2) 1/2 2 +λ W j 1 i=1 FBS MFISTA ipiano afista j=0 10 0 10 0 h(x (k) )/h(x (0) ) 10 0.5 10 0.5 10 0 10 1 10 2 10 3 10 4 10 3 10 2 10 1 10 0 10 1 iteration k time [sec] c 2018 Peter Ochs Adaptive FISTA 15 / 16
Conclusion Conclusion: Proposed adaptive FISTA for solving problems of the form min g(x) + f(x), x R N Adaptive FISTA is locally better than FISTA. Prove convergence to a stationary point. Equivalence to a proximal quasi-newton method, if f is quadratic. Often, the proximal mapping can be computed efficiently. Adaptive FISTA can be embedded into accelerated / optimal schemes. c 2018 Peter Ochs Adaptive FISTA 16 / 16