SIAM Conference on Imaging Science, Bologna, Italy, Adaptive FISTA. Peter Ochs Saarland University

Size: px

Start display at page:

Download "SIAM Conference on Imaging Science, Bologna, Italy, Adaptive FISTA. Peter Ochs Saarland University"

Jade Fowler
5 years ago
Views:

1 SIAM Conference on Imaging Science, Bologna, Italy, 2018 Adaptive FISTA Peter Ochs Saarland University joint work with Thomas Pock, TU Graz, Austria c 2018 Peter Ochs Adaptive FISTA 1 / 16

2 Some Facts about FISTA Some Facts about FISTA: FISTA developed in [Beck, Teboulle: A fast iterative shrinkage-thresholding algorithm for linear inverse problems, SIAM journal on imaging sciences 2(1): , 2009], 5600 citations on google scholar. c 2018 Peter Ochs Adaptive FISTA 2 / 16

3 Some Facts about FISTA Some Facts about FISTA: FISTA developed in [Beck, Teboulle: A fast iterative shrinkage-thresholding algorithm for linear inverse problems, SIAM journal on imaging sciences 2(1): , 2009], 5600 citations on google scholar. Motivated by [Nesterov: A method of solving a convex programming problem with convergence rate O(1/k 2 ), Soviet Mathematics Doklady, 1983], 2000 citations on google scholar. optimal method in the sense of [Nemirovskii, Yudin 83], [Nesterov 04]. c 2018 Peter Ochs Adaptive FISTA 2 / 16

4 Some Facts about FISTA Some Facts about FISTA: FISTA developed in [Beck, Teboulle: A fast iterative shrinkage-thresholding algorithm for linear inverse problems, SIAM journal on imaging sciences 2(1): , 2009], 5600 citations on google scholar. Motivated by [Nesterov: A method of solving a convex programming problem with convergence rate O(1/k 2 ), Soviet Mathematics Doklady, 1983], 2000 citations on google scholar. optimal method in the sense of [Nemirovskii, Yudin 83], [Nesterov 04]. Accelerated Gradient Method: y (k) = x (k) + (x (k) x (k 1) ) ( ) ( ) k N cleverly predefined x (k+1) = y (k) τ f(y (k) ) FISTA extends Accelerated Gradient Method by [Nesterov 83] to non-smooth problem. c 2018 Peter Ochs Adaptive FISTA 2 / 16

5 A Class of Structured Non-smooth Optimization Problems A Class of Structured Non-smooth Optimization Problems: min x g(x) + f(x) simple proximal mapping smooth argmin x g(x) x x 2 Update Scheme: FISTA (f, g convex) y (k) = x (k) + (x (k) x (k 1) ) x (k+1) = argmin g(x) + f(y (k) ) + x f(y (k) ), x y (k) + 1 x y(k) 2τ k 2 c 2018 Peter Ochs Adaptive FISTA 3 / 16

6 A Class of Structured Non-smooth Optimization Problems A Class of Structured Non-smooth Optimization Problems: min x g(x) + f(x) simple proximal mapping smooth argmin x g(x) x x 2 Update Scheme: FISTA (f, g convex) y (k) = x (k) + (x (k) x (k 1) ) x (k+1) = argmin g(x) + f(y (k) ) + x Equivalent to x (k+1) = argmin g(x)+ 1 ( x R 2τ x y (k) τ f(y (k) ) N f(y (k) ), x y (k) + 1 x y(k) 2τ k 2 ) 2 =: prox τg ( y (k) τ f(y (k) ) c 2018 Peter Ochs Adaptive FISTA 3 / 16 )

7 A Class of Structured Non-smooth Optimization Problems A Class of Structured Non-smooth Optimization Problems: min x g(x) + f(x) simple proximal mapping smooth argmin x g(x) x x 2 Update Scheme: Adaptive FISTA (also non-convex) y (k) = x (k) + (x (k) x (k 1) ) x (k+1) = argmin x min g(x) + f(y (k) ) + k f(y (k) ), x y (k) + 1 x y(k) 2τ k 2 c 2018 Peter Ochs Adaptive FISTA 3 / 16

8 A Class of Structured Non-smooth Optimization Problems A Class of Structured Non-smooth Optimization Problems: min x g(x) + f(x) simple proximal mapping smooth argmin x g(x) x x 2 Update Scheme: Adaptive FISTA (f quadratic) y (k) = x (k) + (x (k) x (k 1) ) x (k+1) = argmin x min g(x) + f(y (k) ) + k f(y (k) ), x y (k) + 1 x y(k) 2τ k 2... Taylor expansion around x (k) and optimize for = (x)... x (k+1) = argmin x g(x) x (x(k) Q 1 k f(x(k) )) 2 Q k c 2018 Peter Ochs Adaptive FISTA 3 / 16

9 A Class of Structured Non-smooth Optimization Problems Update Scheme: Adaptive FISTA (f quadratic) x (k+1) = argmin x R N g(x) x (x(k) Q 1 k f(x(k) )) 2 Q k =: prox Q k g (x (k) Q 1 k f(x(k) )) with Q k S ++ (N) as in the (zero memory) SR1 quasi-newton method: Q = I uu (identity minus rank-1). SR1 proximal quasi-newton method proposed by [Becker, Fadili 12] (convex case). Special setting is treated in [Karimi, Vavasis 17]. Unified and extended in [Becker, Fadili, O. 18]. c 2018 Peter Ochs Adaptive FISTA 4 / 16

10 Discussion about Solving the Proximal Mapping Discussion about Solving the Proximal Mapping: (g convex) For general Q, the main algorithmic step is hard to solve: ˆx = prox Q g := argmin x R N g(x) x x 2 Q Theorem: [Becker, Fadili 12] Q = D ± uu S ++ for u R N and D diagonal. Then prox Q g ( x) = D 1/2 prox g D 1/2(D 1/2 x v ) where v = α D 1/2 u and α is the unique root of l(α) = u, x D 1/2 prox g D 1/2 D 1/2 ( x αd 1 u) + α, which is strictly increasing and Lipschitz continuous with 1 + i u2 i d i. c 2018 Peter Ochs Adaptive FISTA 5 / 16

11 Solving the rank-1 Proximal Mapping for l 1 -norm Example: (Solving the rank-1 prox of the l 1 -norm) The proximal mapping wrt. the diagonal matrix is separable and simple prox g D 1/2(z) = argmin D 1/2 x x z 2 x R 2 N N = argmin x i / d i + 1 x R 2 (x i z i ) 2 N ( = = i=1 argmin x i / d i + 1 x i R 2 (x i z i ) 2) i=1,...,n ) ( max ( 0, z i 1/ d i ) sign(zi ) i=1,...,n c 2018 Peter Ochs Adaptive FISTA 6 / 16

12 Solving the rank-1 Proximal Mapping for l 1 -norm The root finding problem in the rank-1 prox of the l 1 -norm: α is the root of the 1D function (i.e. l(α ) = 0) l(α) = u, x D 1/2 prox g D 1/2 D 1/2 ( x αd 1 u) + α = u, x PLin( x αd 1 u) + α which is a piecewise linear function. Construct this function by sorting K N breakpoints. Cost: O(K log(k)). The root is determined using binary search. Cost: O(log(K)). (remember: l(α) is strictly increasing) Computing l(α) costs O(N). Total cost: O(K log(k)). c 2018 Peter Ochs Adaptive FISTA 7 / 16

13 Solving the rank-1 Proximal Mapping for l 1 -norm ŝ 1 ŝ 2 ŝ 3 ŝ 4 from [S. Becker] c 2018 Peter Ochs Adaptive FISTA 8 / 16

14 Discussion about Solving the Proximal Mapping Function g l 1 -norm Hinge l -ball Box constraint Positivity constraint Linear constraint l 1 -ball l -norm Simplex Algorithm Separable: exact Separable: exact Separable: exact Separable: exact Separable: exact Nonseparable: exact Nonseparable: Semi-smooth Newton + prox g D 1/2 exact Nonseparable: Moreau identity Nonseparable: Semi-smooth Newton + prox g D 1/2 exact From [Becker, Fadili 12]. c 2018 Peter Ochs Adaptive FISTA 9 / 16

15 Rank-r Modified Metric Rank-r Modified Metric: (g convex) (L-)BFGS uses a rank-r update of the metric with r > 1. Theorem: [Becker, Fadili, O. 18] Q = P ± V S ++, P S ++, V = r i=1 u iu i, rank(v ) = r. Then prox Q g ( x) = P 1/2 prox g P 1/2 P 1/2 ( x P 1 Uα ) where U = (u 1,..., u r ) and α is the unique root of l(α) = U ( x ) P 1/2 prox g P 1/2 P 1/2 ( x P 1 Uα) + Xα, where X := U V + U S ++ (r). The mapping l : R r R r is Lipschitz continuous with constant X + P 1/2 U 2 and strongly monotone. c 2018 Peter Ochs Adaptive FISTA 10 / 16

16 Variants with O(1/k 2 )-convergence rate Adaptive FISTA: Variants with O(1/k 2 )-convergence rate: (convex case) Adaptive FISTA cannot be proved to have the accelerated rate O(1/k 2 ). For each point x, afista decreases the objective more than a FISTA. However, global view on the sequence is lost. afista can be embedded into schemes with accelerated rate O(1/k 2 ). Monotone FISTA version: (Motivated by [Li, Lin 15], [Beck, Teboulle 09].) Tseng-like version: (Motivated by [Tseng 08].) c 2018 Peter Ochs Adaptive FISTA 11 / 16

17 Nesterov s Worst Case Function FISTA MFISTA afista amfista atseng CG lower bound O(1/k 2 ) f(x (k) ) min f iteration k time [sec] c 2018 Peter Ochs Adaptive FISTA 12 / 16

18 LASSO min h(x), h(x) = 1 x R N 2 Ax b 2 + λ x 1, FISTA afista atseng MFISTA amfista h(x (k) ) h iteration k time [sec] c 2018 Peter Ochs Adaptive FISTA 13 / 16

19 Proposed Algorithm Proposed Algorithm: (non-convex setting) Current iterate x (k) R N. Step size: T S ++ (N). Define the extrapolated point y (k) that depends on : y (k) := x (k) + (x (k) x (k 1) ). Exact version: Compute x (k+1) as follows: x (k+1) = argmin min l g f (x; y(k) ) + 1 x y(k) x R N 2 2 T, l g f (x; y(k) ) := g(x) + f(y(k) ) + f(y (k) ), x y(k) Inexact version: Find x (k+1) and such that l g f (x(k+1) ; y (k) ) x(k+1) y (k) 2 T f(x (k) ) + g(x (k) ) c 2018 Peter Ochs Adaptive FISTA 14 / 16

20 Neural network optimization problem / non-linear inverse problem min W 0,W 1,W 2 b 0,b 1,b 2 N ( (W 2 σ 2 (W 1 σ 1 (W 0 X +B 0 )+B 1 )+B 2 Ỹ ) 1,i 2 +ε 2) 1/2 2 +λ W j 1 i=1 FBS MFISTA ipiano afista j= h(x (k) )/h(x (0) ) iteration k time [sec] c 2018 Peter Ochs Adaptive FISTA 15 / 16

21 Conclusion Conclusion: Proposed adaptive FISTA for solving problems of the form min g(x) + f(x), x R N Adaptive FISTA is locally better than FISTA. Prove convergence to a stationary point. Equivalence to a proximal quasi-newton method, if f is quadratic. Often, the proximal mapping can be computed efficiently. Adaptive FISTA can be embedded into accelerated / optimal schemes. c 2018 Peter Ochs Adaptive FISTA 16 / 16

Fast proximal gradient methods

L. Vandenberghe EE236C (Spring 2013-14) Fast proximal gradient methods fast proximal gradient method (FISTA) FISTA with line search FISTA as descent method Nesterov s second method 1 Fast (proximal) gradient