Lecture 3: Huge-scale optimization problems

Liege University: Francqui Chair 2011-2012 Lecture 3: Huge-scale optimization problems Yurii Nesterov, CORE/INMA (UCL) March 9, 2012 Yu. Nesterov () Huge-scale optimization problems 1/32March 9, 2012 1 / 32

Outline 1 Problems sizes 2 Random coordinate search 3 Confidence level of solutions 4 Sparse Optimization problems 5 Sparse updates for linear operators 6 Fast updates in computational trees 7 Simple subgradient methods 8 Application examples Yu. Nesterov () Huge-scale optimization problems 2/32March 9, 2012 2 / 32

Nonlinear Optimization: problems sizes Class Operations Dimension Iter.Cost Memory Small-size All 10 0 10 2 n 4 n 3 Kilobyte: 10 3 Medium-size A 1 10 3 10 4 n 3 n 2 Megabyte: 10 6 Large-scale Ax 10 5 10 7 n 2 n Gigabyte: 10 9 Huge-scale x + y 10 8 10 12 n log n Terabyte: 10 12 Sources of Huge-Scale problems Internet (New) Telecommunications (New) Finite-element schemes (Old) Partial differential equations (Old) Yu. Nesterov () Huge-scale optimization problems 3/32March 9, 2012 3 / 32

Very old optimization idea: Coordinate Search Problem: min f (x) x R n (f is convex and differentiable). Coordinate relaxation algorithm For k 0 iterate 1 Choose active coordinate i k. 2 Update x k+1 = x k h k ik f (x k )e ik ensuring f (x k+1 ) f (x k ). (e i is ith coordinate vector in R n.) Main advantage: Very simple implementation. Yu. Nesterov () Huge-scale optimization problems 4/32March 9, 2012 4 / 32

Possible strategies 1 Cyclic moves. (Difficult to analyze.) 2 Random choice of coordinate (Why?) 3 Choose coordinate with the maximal directional derivative. Complexity estimate: assume f (x) f (y) L x y, x, y R n. Let us choose h k = 1 L. Then f (x k ) f (x k+1 ) 1 2L i k f (x k ) 2 1 2nL f (x k) 2 1 2nLR 2 (f (x k ) f ) 2. Hence, f (x k ) f 2nLR2 k, k 1. (For Grad.Method, drop n.) This is the only known theoretical result known for CDM! Yu. Nesterov () Huge-scale optimization problems 5/32March 9, 2012 5 / 32

Criticism Theoretical justification: Complexity bounds are not known for the most of the schemes. The only justified scheme needs computation of the whole gradient. (Why don t use GM?) Computational complexity: Fast differentiation: if function is defined by a sequence of operations, then C( f ) 4C(f ). Can we do anything without computing the function s values? Result: CDM are almost out of the computational practice. Yu. Nesterov () Huge-scale optimization problems 6/32March 9, 2012 6 / 32

Google problem Let E R n n be an incidence matrix of a graph. and Ē = E diag (E T e) 1. Thus, Ē T e = e. Our problem is as follows: Denote e = (1,..., 1) T Optimization formulation: Find x 0 : Ēx = x. f (x) def = 1 2 Ēx x 2 + γ 2 [ e, x 1]2 min x R n Yu. Nesterov () Huge-scale optimization problems 7/32March 9, 2012 7 / 32

Huge-scale problems Main features The size is very big (n 10 7 ). The data is distributed in space. The requested parts of data are not always available. The data is changing in time. Consequences Simplest operations are expensive or infeasible: Update of the full vector of variables. Matrix-vector multiplication. Computation of the objective function s value, etc. Yu. Nesterov () Huge-scale optimization problems 8/32March 9, 2012 8 / 32

Structure of the Google Problem Let ua look at the gradient of the objective: Main observations: i f (x) = a i, g(x) + γ[ e, x 1], i = 1,..., n, g(x) = Ēx x R n, (Ē = (a 1,..., a n )). The coordinate move x + = x h i i f (x)e i needs O(p i ) a.o. (p i is the number of nonzero elements in a i.) ( ) def d i = diag 2 f def = Ē T Ē + γee T = γ + 1 p i are available. We can use them for choosing the step sizes (h i = 1 d i ). i Reasonable coordinate choice strategy? Random! Yu. Nesterov () Huge-scale optimization problems 9/32March 9, 2012 9 / 32

Random coordinate descent methods (RCDM) min f (x), x R N (f is convex and differentiable) Main Assumption: f i (x + h ie i ) f i (x) L i h i, h i R, i = 1,..., N, where e i is a coordinate vector. Then f (x + h i e i ) f (x) + f i (x)h i + L i 2 h2 i. x RN, h i R. Define the coordinate steps: T i (x) def = x 1 L i f i (x)e i. Then, f (x) f (T i (x)) 1 2L i [f i (x)]2, i = 1,..., N. Yu. Nesterov () Huge-scale optimization problems 10/32 March 9, 2012 10 / 32

Random coordinate choice We need a special random counter R α, α R: Prob [i] = p (i) α = L α i [ N L α j j=1 Note: R 0 generates uniform distribution. ] 1, i = 1,..., N. Method RCDM(α, x 0 ) For k 0 iterate: 1) Choose i k = R α. 2) Update x k+1 = T ik (x k ). Yu. Nesterov () Huge-scale optimization problems 11/32 March 9, 2012 11 / 32

Complexity bounds for RCDM We need to introduce the following norms for x, g R N : x α = [ N ] 1/2 L α i [x (i) ] 2, g α = i=1 [ N i=1 1 L α i [g (i) ] 2 ] 1/2. After k iterations, RCDM(α, x 0 ) generates random output x k, which depends on ξ k = {i 0,..., i k }. Denote φ k = E ξk 1 f (x k ). Theorem. For any k 1 we have [ φ k f 2 k N L α j j=1 ] R 2 1 α (x 0), { where R β (x 0 ) = max max x x β : f (x) f (x 0 ) x x X }. Yu. Nesterov () Huge-scale optimization problems 12/32 March 9, 2012 12 / 32

Interpretation 1. α = 0. Then S 0 = N, and we get φ k f 2N k R2 1 (x 0). Note We use the metric x 2 1 = N L i [x (i) ] 2. i=1 A matrix with diagonal {L i } N i=1 can have its norm equal to n. Hence, for GM we can guarantee the same bound. But its cost of iteration is much higher! Yu. Nesterov () Huge-scale optimization problems 13/32 March 9, 2012 13 / 32

Interpretation 2. α = 1 2. Denote { D (x 0 ) = max max x y X Then, R 2 1/2 (x 0) S 1/2 D 2 (x 0 ), and we obtain } max x (i) y (i) : f (x) f (x 0 ). 1 i N Note: [ N ] 2 φ k f 2 k L 1/2 i D (x 2 0 ). i=1 For the first order methods, the worst-case complexity of minimizing over a box depends on N. Since S 1/2 can be bounded, RCDM can be applied in situations when the usual GM fail. Yu. Nesterov () Huge-scale optimization problems 14/32 March 9, 2012 14 / 32

Interpretation 3. α = 1. Then R 0 (x 0 ) is the size of the initial level set in the standard Euclidean norm. Hence, φ k f [ N ] [ ] 2 k L i R0 2(x 0) 2N k 1 N N L i R0 2(x 0). i=1 i=1 Rate of convergence of GM can be estimated as f (x k ) f γ k R2 0 (x 0 ), where γ satisfies condition f (x) γ I, x R N. Note: maximal eigenvalue of symmetric matrix can reach its trace. In the worst case, the rate of convergence of GM is the same as that of RCDM. Yu. Nesterov () Huge-scale optimization problems 15/32 March 9, 2012 15 / 32

Minimizing the strongly convex functions Theorem. Let f (x) be strongly convex with respect to 1 α with convexity parameter σ 1 α > 0. Then, for {x k } generated by RCDM(α, x 0 ) we have φ k φ ( 1 σ 1 α S α ) k (f (x0 ) f ). Proof: Let x k be generated by RCDM after k iterations. Let us estimate the expected result of the next iteration. f (x k ) E ik (f (x k+1 )) = N N i=1 σ 1 α i=1 p (i) α [f (x k ) f (T i (x k ))] p (i) α 2L i [f i (x k)] 2 = 1 2S α ( f (x k ) 1 α )2 S α (f (x k ) f ). It remains to compute expectation in ξ k 1. Yu. Nesterov () Huge-scale optimization problems 16/32 March 9, 2012 16 / 32

Confidence level of the answers Note: We have proved that the expected values of random f (x k ) are good. Can we guarantee anything after a single run? Confidence level: Probability β (0, 1), that some statement about random output is correct. Main tool: Chebyschev inequality (ξ 0): Our situation: Prob [ξ T ] E(ξ) T. Prob [f (x k ) f ɛ] 1 ɛ [φ k f ] 1 β. We need φ k f ɛ (1 β). Too expensive for β 1? Yu. Nesterov () Huge-scale optimization problems 17/32 March 9, 2012 17 / 32

Regularization technique Consider f µ (x) = f (x) + µ 2 x x 0 2 1 α. It is strongly convex. Therefore, we can obtain φ k f µ ɛ (1 β) in Theorem. Define α = 1, µ = ( ) O 1 µ S 1 α ln ɛ (1 β) iterations. k 1 + 8S 1R 2 0 (x 0) ɛ ɛ 4R0 2(x, and choose 0) [ ] ln 2S 1R0 2(x 0) ɛ + ln 1 1 β. Let x k be generated by RCDM(1, x 0 ) as applied to f µ.then Prob (f (x k ) f ɛ) β. Note: β = 1 10 p ln 10 p = 2.3p. Yu. Nesterov () Huge-scale optimization problems 18/32 March 9, 2012 18 / 32

Implementation details: Random Counter Given the values L i, i = 1,..., N, generate efficiently random i {1,..., N} with probabilities Prob [i = k] = L k / N L j. Solution: a) Trivial O(N) operations. b). Assume N = 2 p. Define p + 1 vectors S k R 2p k, k = 0,..., p: S (i) 0 = L i, i = 1,..., N. j=1 S (i) k = S (2i) k 1 + S (2i 1) k 1, i = 1,..., 2 p k, k = 1,..., p. Algorithm: Make the choice in p steps, from top to bottom. If the element i of S k is chosen, then choose in S k 1 either 2i or 2i 1 in accordance to probabilities S(2i) k 1 or S(2i 1) k 1. S (i) k S (i) k Difference: for n = 2 20 > 10 6 we have p = log 2 N = 20. Yu. Nesterov () Huge-scale optimization problems 19/32 March 9, 2012 19 / 32

Sparse problems Problem: min f (x), x Q where Q is closed and convex in R N, and f (x) = Ψ(Ax), where Ψ is a simple convex function: Ψ(y 1 ) Ψ(y 2 ) + Ψ (y 2 ), y 1 y 2, y 1, y 2 R M, A : R N R M is a sparse matrix. Let p(x) def = # of nonzeros in x. Sparsity coefficient: γ(a) def = p(a) MN. Example 1: Matrix-vector multiplication Computation of vector Ax needs p(a) operations. Initial complexity MN is reduced in γ(a) times. Yu. Nesterov () Huge-scale optimization problems 20/32 March 9, 2012 20 / 32

Gradient Method x 0 Q, x k+1 = π Q (x k hf (x k )), k 0. Main computational expenses Projection of simple set Q needs O(N) operations. Displacement x k x k hf (x k ) needs O(N) operations. f (x) = A T Ψ (Ax). If Ψ is simple, then the main efforts are spent for two matrix-vector multiplications: 2p(A). Conclusion: As compared with full matrices, we accelerate in γ(a) times. Note: For Large- and Huge-scale problems, we often have γ(a) 10 4... 10 6. Can we get more? Yu. Nesterov () Huge-scale optimization problems 21/32 March 9, 2012 21 / 32

Sparse updating strategy Main idea After update x + = x + d we have y + def = Ax + = Ax What happens if d is sparse? Denote σ(d) = {j : d (j) 0}. Then y + = y + Its complexity, κ A (d) def = κ A (d) = M j σ(d) j σ(d) p(ae j ), γ(ae j ) = γ(d) 1 p(d) γ(d) max 1 j m γ(ae j) MN. j σ(d) }{{} y +Ad. d (j) Ae j. can be VERY small! j σ(d) γ(ae j ) MN If γ(d) cγ(a), γ(a j ) cγ(a), then κ A (d) c 2 γ 2 (A) MN. Expected acceleration: (10 6 ) 2 = 10 12 1 sec 32 000 years Yu. Nesterov () Huge-scale optimization problems 22/32 March 9, 2012 22 / 32

When it can work? Simple methods: No full-vector operations! (Is it possible?) Simple problems: Examples Functions with sparse gradients. 1 Quadratic function f (x) = 1 2 Ax, x b, x. The gradient f (x) = Ax b, x R N, is not sparse even if A is sparse. 2 Piece-wise linear function g(x) = max [ a i, x b (i) ]. Its 1 i m subgradient f (x) = a i(x), i(x) : f (x) = a i(x), x b (i(x)), can be sparse if a i is sparse! But: We need a fast procedure for updating max-operations. Yu. Nesterov () Huge-scale optimization problems 23/32 March 9, 2012 23 / 32

Fast updates in short computational trees Def: Function f (x), x R n, is short-tree representable, if it can be computed by a short binary tree with the height ln n. Let n = 2 k and the tree has k + 1 levels: v 0,i = x (i), i = 1,..., n. Size of the next level halves the size of the previous one: v i+1,j = ψ i+1,j (v i,2j 1, v i,2j ), j = 1,..., 2 k i 1, i = 0,..., k 1, where ψ i,j are some bivariate functions. v k 1,1 v k,1 v k 1,2......... v 2,1 v 2,n/4 v 1,1 v 1,2... v 1,n/2 1 v 1,n/2 v 0,1 v 0,2 v 0,3 v 0,4 v 0,n 3 v 0,n 2 v 0,n 1 v 0,n Yu. Nesterov () Huge-scale optimization problems 24/32 March 9, 2012 24 / 32

Main advantages Important examples (symmetric functions) f (x) = x ( p, p 1, ψ i,j (t 1, t 2 ) [ t 1 p + t 2 p ] 1/p, n ) f (x) = ln e x(i), ψ i,j (t 1, t 2 ) ln (e t 1 + e t 2 ), i=1 f (x) = max x (i), ψ i,j (t 1, t 2 ) max {t 1, t 2 }. 1 i n The binary tree requires only n 1 auxiliary cells. Its value needs n 1 applications of ψ i,j (, ) ( operations). If x + differs from x in one entry only, then for re-computing f (x + ) we need only k log 2 n operations. Thus, we can have pure subgradient minimization schemes with Sublinear Iteration Cost. Yu. Nesterov () Huge-scale optimization problems 25/32 March 9, 2012 25 / 32

Simple subgradient methods I. Problem: f def = min x Q f (x), where Q is a closed and convex and f (x) L(f ), x Q, the optimal value f is known. Consider the following optimization scheme (B.Polyak, 1967): ( x 0 Q, x k+1 = π Q x k f (x k) f ) f (x k ) 2 f (x k ), k 0. Denote f k = min 0 i k f (x i). Then for any k 0 we have: f k f L(f ) x 0 π X (x 0 ) (k+1) 1/2, x k x x 0 x, x X. Yu. Nesterov () Huge-scale optimization problems 26/32 March 9, 2012 26 / 32

Proof: Let us fix x X. Denote r k (x ) = x k x. r 2 k+1 (x ) Then x k f (x k) f f (x f (x k ) 2 k ) x 2 = rk 2(x ) 2 f (x k) f f (x f (x k ) 2 k ), x k x + (f (x k) f ) 2 f (x k ) 2 r 2 k (x ) (f (x k) f ) 2 f (x k ) 2 r 2 k (x ) (f k f ) 2 L 2 (f ). From this reasoning, x k+1 x 2 x k x 2, x X. Corollary: Assume X has recession direction d. Then x k π X (x 0 ) x 0 π X (x 0 ), d, x k d, x 0. (Proof: consider x = π X (x 0 ) + αd, α 0.) Yu. Nesterov () Huge-scale optimization problems 27/32 March 9, 2012 27 / 32

Constrained minimization (N.Shor (1964) & B.Polyak) II. Problem: min{f (x) : g(x) 0}, where x Q Q is closed and convex, f, g have uniformly bounded subgradients. Consider the following method. It has step-size parameter h > 0. ( ) If g(x k ) > h g (x k ), then (A): x k+1 = π Q x k g(x k) g (x g (x k ) 2 k, ) else (B): x k+1 = π Q (x k h f (x k ) f (x k ). Let F k {0,..., k} be the set (B)-iterations, and f k Theorem: If k > x 0 x 2 /h 2, then F k and = min i F k f (x i ). f k f (x) hl(f ), max g(x i ) hl(g). i F k Yu. Nesterov () Huge-scale optimization problems 28/32 March 9, 2012 28 / 32

Computational strategies 1. Constants L(f ), L(g) are known (e.g. Linear Programming) We can take h = of steps N (easy!). ɛ max{l(f ),L(g)}. Note: The standard advice is h = Then we need to decide on the number R N+1 (much more difficult!) 2. Constants L(f ), L(g) are not known Start from a guess. Restart from scratch each time we see the guess is wrong. The guess is doubled after restart. 3. Tracking the record value fk Double run. Other ideas are welcome! Yu. Nesterov () Huge-scale optimization problems 29/32 March 9, 2012 29 / 32

Application examples Observations: 1 Very often, Large- and Huge- scale problems have repetitive sparsity patterns and/or limited connectivity. Social networks. Mobile phone networks. Truss topology design (local bars). Finite elements models (2D: four neighbors, 3D: six neighbors). 2 For p-diagonal matrices κ(a) p 2. Yu. Nesterov () Huge-scale optimization problems 30/32 March 9, 2012 30 / 32

Nonsmooth formulation of Google Problem Main property of spectral radius (A 0) If A R+ n n, then ρ(a) = min max 1 e x 0 1 i n x (i) i, Ax. The minimum is attained at the corresponding eigenvector. Since ρ(ē) = 1, our problem is as follows: f (x) def = max 1 i N [ e i, Ēx x (i) ] min x 0. Interpretation: Maximizing the self-esteem! Since f = 0, we can apply Polyak s method with sparse updates. Additional features; the optimal set X is a convex cone. If x 0 = e, then the whole sequence is separated from zero: x, e x, x k x 1 x k = x, e x k. Goal: Find x 0 such that x 1 and f ( x) ɛ. (First condition is satisfied automatically.) Yu. Nesterov () Huge-scale optimization problems 31/32 March 9, 2012 31 / 32

Computational experiments: Iteration Cost We compare Polyak s GM with sparse update (GM s ) with the standard one (GM). Setup: Each agent has exactly p random friends. Thus, κ(a) p 2. Iteration Cost: GM s p 2 log 2 N, Time for 10 4 iterations (p = 32) N κ(a) GM s GM 1024 1632 3.00 2.98 2048 1792 3.36 6.41 4096 1888 3.75 15.11 8192 1920 4.20 139.92 16384 1824 4.69 408.38 GM pn. Time for 10 3 iterations (p = 16) N κ(a) GM s GM 131072 576 0.19 213.9 262144 592 0.25 477.8 524288 592 0.32 1095.5 1048576 608 0.40 2590.8 1 sec 100 min! Yu. Nesterov () Huge-scale optimization problems 32/32 March 9, 2012 32 / 32