Agenda Fast proximal gradient methods 1 Accelerated first-order methods 2 Auxiliary sequences 3 Convergence analysis 4 Numerical examples 5 Optimality of Nesterov s scheme
Last time Proximal gradient method convergence rate 1 k Subgradient methods convergence rate 1 k Can we do better for non-smooth problems min f(x) = g(x) + h(x) with the same computational effort as proximal gradient method but with faster convergence? Answer: Yes we can - with equally simple scheme x k+1 = arg min Q 1/t (x, y k ) Note that we use y k instead of x k where new point is cleverly chosen Original idea: Nesterov 1983 for minimization of smooth objective Here: nonsmooth problem
Accelerated first-order methods Choose x 0 and set y 0 = x 0. Repeat for k = 1, 2,... { xk = prox tk h(y k 1 t k g(y k 1 )) y k = x k + k 1 k+2 (x k x k 1 ) same computational complexity as proximal gradient with h = 0, this is the accelerated gradient descent of Nesterov ( 83) can be used with various stepsize rules fixed BLS... interpretation x k + k 1 k+2 (x k x k 1 ) momentum term/prevents zigzagging
Other formulations: Beck and Teboulle 2009 Fix step size t = 1 L(g) Choose x 0, set y 0 = x 0, θ 0 = 1 Loop: for k = 1, 2,... (a) x k = prox tk h(y k 1 t k g(y k 1 )) (b) 1 θ k = 1+ 1+4/θ 2 k 1 2 (c) y k = x k + θ k [ 1 θ k 1 1](x k x k 1 )
With BLS (knowledge of Lipschitz constant not necessary) Choose x 0, set y 0 = x 0, θ 0 = 1 Loop: for k = 1, 2,..., backtrack until (this gives t k ) f(y k 1 t k G tk (y k 1 )) Q 1/tk ((y k 1 t k G tk (y k 1 ), y k 1 ) Then prox tk h(y k 1 t k g(y k 1 )) y k 1 t k G tk (y k 1 ) (a) (b) (c)
Convergence analysis Theorem f(x k ) f 2 x 0 x 2 (k + 1) 2 t t = 1/L for fixed step size t = β/l for BLS Other 1/k 2 first-order methods Nesterov 2007 Two auxiliary sequences {y k }, {z k } Two prox operations at each iteration convergence analysis Lu, Lan and Monteiro Tseng Auslander and Teboulle Unified analysis framework: Tseng (2008)
Proof (Beck and Teboulle s version) and (i) v k+1 = v k + 1 θ k [x k+1 y k ] = 1 θ 2 k 1 (ii) 1 θ k θk 2 Proof of (ii) (u = 4/θk 1 2 + 1) { vk 1 1 θ k 1 x k [ θ k 1 1]x k 1 y k = θ k v k + (1 θ k )x k 1 θ k θ 2 k = [1 + u] 2 4 u 1 u + 1 = u 1 4 = 4/θ k 1 2 + 1 1 4 = 1 θ 2 k 1
Increment in one iteration: Beck and Teboulle, Vandenberghe x = x i 1 x + = x i y = y i 1 v = v i 1 v + = v i θ = θ i 1 Pillars of analysis: (1) f(x + ) f(x) + G t (y) T (y x) t 2 G t(y) 2 (2) f(x + ) f + G t (y) T (y x ) t 2 G t(y) 2 Take cvx combination f(x + ) (1 θ)f(x) + θf + G t (y), y (1 θ)x θx t 2 G t(y) 2 = (1 θ)f(x) + θf + θ G t (y), v x t 2 G t(y) 2
Because y = θv + (1 θ)x f(x + ) f (1 θ) [f(x) f ] + [ v θ2 x 2 v x tθ ] 2t G t(y) 2 Therefore Conclusion 1 θ 2 i 1 v t θ G t(y) = v + 1 θ [y G t(y) y] = v + f(x + ) f (1 θ)[f(x) f ] + θ2 [ v x 2 v + x 2] 2t [f(x i ) f ] + 1 2t v i x 2 1 θ i 1 θ 2 i 1 [f(x i 1 ) f ] + 1 2t v i 1 x 2
We have 1 θi 1 = 1 θi 1 2 θi 1 2 1 θ 2 k 1 and [f(x k ) f ] + 1 2t v k x 2 1 θ 0 θ0 2 [f(x 0 ) f ] + 1 2t v 0 x 2 Since θ 0 = 1 and v 0 = x 0 1 θk 1 2 (f(x k ) f ) 1 2t x 0 x 2 1 Since 1 θk 1 2 4 (k + 1)2, f(x k ) f 2 (k + 1) 2 t x 0 x 2 Similar with BLS, see Beck and Teboulle (2009)
Case study: LASSO min f(x) = 1 2 Ax b 2 2 + λ x 1 Chose x 0, set y 0 = x 0 and θ 0 = 1 and repeat x k = S tk λ(y k 1 t k A (Ay k 1 b)) [ ] 1 θ k = 2 1 + 1 + 4/θk 1 2 y k = x k + θ k (θ 1 k 1 1)(x k x k 1 ) until convergence (S t is soft-thresholding at level t) Dominant computational cost per iteration one application of A one application of A
[1] A. Ben-Tal and A. Nemirovski, Lectures on Modern Convex Optimization: Analysis, Algorithms, and Example from Beck and Teboulle (FISTA) AFASTITERATIVESHRINKAGE-THRESHOLDINGALGORITHM 201 10 2 ISTA MTWIST FISTA 10 0 10 2 10 4 10 6 10 8 0 2000 4000 6000 8000 10000 Figure 5. Comparison of function value errors F (xk) F (x ) of ISTA, MTWIST, and FISTA. REFERENCES
Example from Vandenberghe (EE 236C, UCLA) 1-norm regularized least-squares minimize 1 2 Ax b 2 2 + x 1 (f(x (k) ) f )/f randomly generated A R 2000 1000 ; step t k =1/L with L = λ max (A T A) k Gradient methods for nonsmooth problems 4 18
Nuclear norm regularization General gradient update min g(x) + λ X { 1 } X = arg min 2t X (X 0 t g(x 0 ) 2 F + λ X = S tλ (X 0 t g(x 0 )) S λ is the singular value soft-thresholding operator X = r σ j u j vj S t (X) := j=1 r max(σ j t, 0) u j vj j=1
Example min 1 2 A(X) b 2 + λ X Choose X 0, set Y 0 = X 0, θ 0 = 1 and repeat X k = S tk λ[y k 1 t k A (A(Y k 1 ) b)] θ k =... Y k =... Important remark: only need to compute the (top) part of the SVD of with singular values exceeding t k λ Y k 1 t k A (A(Y k 1 ) b)
Example from Vandenberghe (EE 236C, UCLA) minimize (i,j) obs. X is convergence 500 500 (fixed step size t =1/L) 5,000 observed entries Fix step size t = 1/L (X ij M ij ) 2 + λ X (f(x (k) ) f )/f k
Optimality of Nesterov s Method min f(x) f convex f Lipschitz No method which updates x k in span {x 0, f(x 0 ),..., f(x k 1 )} can converge faster than 1/k 2 1/k 2 is the optimal rate of first-order method
Why? 2 1 0...... 0 1 f(x) = 1 1 2 1 0... 0 0 2 x Ax e 1x, A =...................................., e 1 =...... 0... 0 1 2 1... 0... 0 0 1 2 0 A 0, A 4 and solution obeys Ax = e 1 x = 1 i n + 1 f n = 2(n + 1) Note that since 0 k n x 2 = 1 (n + 1) 2 (n2 +... + 1) n + 1 3 (k + 1) 3 k 3 0 k n 3k 2 n k=1 k 2 n + 1 3
Start first-order algorithm at x 0 = 0 span ( f(x 1 )) span(e 1, e 2 ) span ( f(x 2 )) span(e 1, e 2, e 3 )... For k n/2 or n = 2k + 1 span( f(x 0 )) = e 1 = x 1 span(e 1 ) f(x k ) f inf f(x) = k x k+1 =...=x n=0 2(k + 1) f(x k ) f n 2(n + 1) k 2(k + 1) = 1 4(k + 1) So f(x k ) f 1 x 2 4(k + 1) x 2 3 x 2 4(k + 1)(n + 1) 3 x 2 8(k + 1) 2
References 1 Y. Nesterov. Gradient methods for minimizing composite objective function Technical Report CORE Université Catholique de Louvain, (2007) 2 A. Beck and M. Teboulle. Fast iterative shrinkage-thresholding algorithm for linear inverse problems, SIAM J. Imaging Sciences, (2008) 3 M. Teboulle, First Order Algorithms for Convex Minimization, Optimization Tutorials (2010), IPAM, UCLA 4 L. Vandenberghe, EE236C (Spring 2011), UCLA