Accelerating Nesterov s Method for Strongly Convex Functions Hao Chen Xiangrui Meng MATH301, 2011
Outline The Gap 1 The Gap 2 3
Outline The Gap 1 The Gap 2 3
Our talk begins with a tiny gap For any x 0 R and any constant µ > 0, L > µ there exists a function f S,1 µ,l such that for any first-order method, we have f (x k ) f µ 2 ( κ 1 κ + 1 ) 2k x 0 x 2, κ = L µ. Nesterov s method generates a sequence {x k } k=0 such that ( ) k κ 1 f (x k ) f L x 0 x 2, κ = L κ µ.
At a closer look, the gap is not tiny Assume that κ is large. Given a small tolerance ɛ > 0, to make f (x k ) f < ɛ, the ideal first-order method needs K = log ɛ log µ 2 2 log κ 1 log 1 κ ɛ 4 κ+1 number of iterations. Nesterov s method needs log ɛ log L K = log 1 log ɛ κ κ 1 κ number of iterations, which is 4 times as large as the ideal number.
Can we reduce the gap? Can we reduce the gap for quadratic functions? minimize f (x) = 1 2 x T Ax b T x, µi n A LI n. In this case, we do have an ideal method, the conjugate gradient method, having the optimal convergence rate. for general strongly convex functions? minimize f (x), f (x) S µ,l.
Outline The Gap 1 The Gap 2 3
Nesterov s constant step scheme, III 0. Choose y 0 = x 0 R n. 1. k-th iteration (k 0). x k+1 = y k hf (y k ), y k+1 = x k+1 + β(x k+1 x k ), where h = 1 L and β = 1 µh 1+ µh. Q: Is Nesterov s choice of h and β optimal?
On quadratic functions When minimizing a quadratic function f (x) = 1 2 x T Ax b T x, Nesterov s updates become 0. Choose y 0 = x 0 = 0. 1. k-th iteration (k 0). x k+1 = y k h(ay k b), y k+1 = x k+1 + β(x k+1 x k ).
Eigendecomposition Let A = V ΛV T be A s eigendecomposition. Define x k = V T x k, ȳ k = V T y k for all k, and b = V T b. Then Nesterov s updates can be written as 0. Choose ȳ 0 = x 0 = 0. 1. k-th iteration (k 0). x k+1 = ȳ k h(λȳ k b), ȳ k+1 = x k+1 + β( x k+1 x k ). Λ is diagonal, hence the updates are actually element-wise: x k+1,i = ȳ k,i h(λ i ȳ k,i b i ), i = 1,..., n, ȳ k+1,i = x k+1,i β(λ i ȳ k,i b i ), i = 1,..., n.
Recurrence relation We can eliminate the sequence {ȳ k } from the update scheme. x k+1,i = ȳ k,i h(λ i ȳ k,i b i ) = ( x k,i + β( x k,i x k 1,i ) h(λ i ( x k,i + β( x k,i x k 1,i )) b i ) = (1 + β)(1 λ i h) x k,i β(1 λ i h) x k 1,i + h b i. Let ē k = V T (x k x ) = V T (x k V Λ 1 V T b) = x k Λ 1 b for all k. We have the following recurrence relation on the error: ē k+1,i = (1 + β)(1 λ i h)ē k,i β(1 λ i h)ē k 1,i.
Characteristic equation The characteristic equation for the recurrence relation is given by ξ 2 i = (1 + β)(1 λ i h)ξ i β(1 λ i h). Denote the two roots by ξ i,1 and ξ i,2, and assume they are distinct for simplicity. The general solution is given by ē k,i = C i,1 ξ k i,1 + C i,2 ξ k i,2. Let C i = C i,1 + C i,2 and θ i = max{ ξ i,1, ξ i,2 }. We have Hence, ē k,i C i θ k i. x k x 2 = x k x 2 = i ē k,i 2 i C 2 i θ 2k i Cθ 2k, where C = i C 2 i and θ = max i θ i.
Finding the optimal convergence rate Our problem becomes minimize θ subject to θ ξ 1 (λ), ξ 2 (λ), λ [µ, L], where ξ 1 (λ) and ξ 2 (λ) are the roots of ξ 2 = (1 + β)(1 λh)ξ β(1 λh), where h, β and θ are variables.
Special cases If β = 0, we are doing gradient descent. The optimal rate is given by θ = L µ L+µ, attained at h = 2 L+µ. If h = 1 L, the optimal rate is given by θ = 1 µh = 1 µ L, attained at β = 1 µh 1+ µh = L µ L+ µ, which confirms Nesterov s choice. Q: Why do we choose h = 1 L? It guarantees the most decrease in function value of a function with Lipschitz constant L.
The optimal convergence rate By considering all the combinations of h and β, we reach the following optimal solution: ( h 4 = the harmonic mean of 1 ) 3L + µ L and 2 L + µ β = 1 µh 1 + µh, θ = 1 µh = 1 2 3κ + 1.
Comparing the convergence rates Nesterov s method (h = 1 L ): ( x k x C 1 1 ) k x 0 x. κ Note that this is better than the convergence rate we have on general strongly convex functions. Nesterov s method (h = 4 3L+µ ): x k x C Conjugate gradient: x k x A 2 ( ) k 2 1 x 0 x. 3κ + 1 ( ) k 2 1 x 0 x A. κ + 1
What s happening on the eigenspace Figure: Error along eigendirections ( ē k,i )
The model problem minimize f (x) = 1 2 x T Ax b T x, where 2 1. A = 1........... 1 + δi n R n n, b = randn(n, 1) R n. 1 2 We chose n = 10 6 and δ = 0.05.
Figure: x k x
Figure: f (x k ) f
Outline The Gap 1 The Gap 2 3
Back to Nesterov s proof A pair of sequence {φ k (x)} and {λ k }, λ k 0 is called an estimate sequence of function f (x) if λ k 0 and for any x R n and all k 0 we have φ k (x) (1 λ k )f (x) + λ k φ 0 (x). If for a sequence {x k } we have f (x k ) φ k min x R n φ k(x) then f (x k ) f λ k [φ 0 (x ) f ] 0
A useful estimate sequence provided by Nesterov λ k+1 = (1 α k )λ k φ k+1 (x) = (1 α k )φ k (x) + α k [f (y k ) + f (y k ), x y k + µ 2 x y k 2 ] where {y k } is an arbitrary sequence in R n. α k (0, 1), k=0 α k =. λ 0 = 1. φ 0 is an arbitrary function on R n.
A specific choice of φ 0 (x) φ 0 (x) φ 0 + γ 0 2 x v 0 2 and set x 0 = v 0, φ 0 = f (x 0) The previous estimate sequence becomes with γ k+1 =(1 α k )γ k + α k µ φ k (x) φ k + γ k 2 x v k 2 v k+1 =[(1 α k )γ k v k + α k µy k α k f (y k )]/γ k+1 φ k+1 =(1 α k)φ k + α kf (y k ) α2 k 2γ k+1 f (y k ) 2 + α k(1 α k )γ k γ k+1 ( µ 2 y k v k 2 + f (y k ), v k y k )
Let the update be x k+1 = y k h k f (y k ) and use the inequalities φ k f (x k) f (y k ) + f (y k ), x k y k + µ 2 x k y k 2 f (x k+1 ) f (y k ) h k(2 Lh k ) 2 f (y k ) 2 We have ( φ k+1 f (x k+1) α2 k 2 2γ k+1 + h k(2 Lh k ) ) f (y k ) 2 + (1 α) f (y k ), α kγ k (v k y k ) + (x k y k ) γ k+1 + µ(1 α ( ) k) αk γ k v k y k 2 + x k y k 2 2 γ k+1
( φ k+1 f (x k+1) Nesterov choice: y k = α kγ k v k +γ k+1 x k γ k +α k µ h k = 1 L. α2 k + h k(2 Lh k ) 2γ k+1 2 ) f (y k ) 2 +(1 α k ) f (y k ), α kγ k (v k y k ) + (x k y k ) γ k+1 + µ(1 α ( ) k) αk γ k v k y k 2 + x k y k 2 2 γ k+1 γ 0 µ. Since γ k+1 = (1 α k )γ k + α k µ, we have γ k µ α k can be as large as µ L convergence rate 1 µ L = 1 1 κ. at each step, which leads to the
A simplified version γ k µ, h k 1 L y k = α kv k +x k α+1 v k y k = v k x k α+1 x k y k = α(x k v k ) α+1 φ k+1 f (x k+1) ( α2 k 2µ + 1 ) f (y k ) 2 2L + µα k(1 α k ) 2(1 + α k ) x k v k 2
x k v k 2 / f (y k ) 2 Figure: f (x) = 1 2 Ax b 2 + λ smooth( x 1, τ) + 1 2 µ x 2
µα k (1 α k ) 2(1 + α k ) x k v k 2 α2 k ) 2 2µ 1 2L f ( α kv k +x k α k +1 Since the decay rate is k (1 α k), we want to find a large α k such that the inequality holds. Evaluating f ( α kv k +x k α+1 ) is time consuming, so we hope our first guess of α k is good. Note that f (y k ) has a trend of decreasing, so our procedure is to find an α k µ L such that µα k(1 α k ) x k v k 2 2(1+α k ) α2 f (y k 1 ) 2 k 2µ is large, then such α k usually makes the inequality holds.
f (y k ) The Gap Figure: f (x) = 1 2 Ax b 2 + λ smooth( x 1, τ) + 1 2 µ x 2
Test 1: smooth-bpdn The first test is a smooth version of Basis Pursuit De-Noising: minimize f (x) = 1 2 Ax b 2 + λ smooth( x 1, τ) + µ 2 x 2, where we set A = 1 n randn(m, n), m = 1000, n = 3000, λ = 0.2, τ = 0.001, and µ = 0.01. x is a random sparse vector with 125 non-zeros and b = Ax + ε. We use the following estimate for L: ˆL = ( 1 + ) 2 m + λ n τ + µ 202.50.
Figure: x k x
Figure: f (x k ) f
Test 2: anisotropic bowl The second test is minimize f (x) = subject to x τ. n i xi 4 + 1 2 x 2, We choose n = 500 and τ = 4. x 0 is randomly chosen from the boundary {x x = τ}. For this problem, we have i=1 L = 12nτ 2 + 1 = 96001 and µ = 1.
Figure: x k x
Figure: f (x k ) f
Test 3: back to quadratic functions Let s check the performance of the adaptive algorithm on quadratic functions. minimize f (x) = 1 2 x T Ax b T x. We choose A 1 m W n(i n, m), where n = 4500 and m = 5000. We use the following estimate for L and µ: ˆL = ( ) 2 ( ) 2 n n 1 +, ˆµ = 1. m m
Figure: x k x
Figure: f (x k ) f
Comparing with TFOCS(AT) Figure: x k x
Figure: f (x k ) f
Final thoughts The convergence rate of Nesterov s method depends on problem types. For quadratic problems, the speed is doubled. There is space to improve Nesterov s optimal gradient method on strongly convex functions. Whether we can improve Nesterov s method universally (with theoretical proof) is still a question.