Limited Memory Kelley s Method Converges for Composite Convex and Submodular Objectives Madeleine Udell Operations Research and Information Engineering Cornell University Based on joint work with Song Zhou (Cornell) and Swati Gupta (Georgia Tech) Rutgers, 11/9/2018 1 / 33
My old 2013 Macbook Pro: 16GB Ram 2 / 33
Gonna buy a new model with more RAM... 3 / 33
Gonna buy a new model with more RAM... nope! RAM in 13in Macbook Pro 16 GB. 3 / 33
Ok, so RAM isn t smaller. Is it cheaper? 4 / 33
Ok, so RAM isn t smaller. Is it cheaper? Nope! 4 / 33
RIP Moore s Law for RAM (circa 2013) 10 8 Cost of RAM $/MB 10 6 10 4 10 2 10 0 10 2 1960 1970 1980 1990 2000 2010 Year 5 / 33
Why low memory convex optimization? low memory Moore s law is running out low memory algorithms are often fast convex optimization robust convergence elegant analysis 6 / 33
Memory plateau and conditional gradient method (Frank & Wolfe 1956) An algorithm for quadratic programming (Levitin & Poljak 1966) Conditional gradient method (Clarkson 2010) Coresets, sparse greedy approximation, and the Frank-Wolfe algorithm (Jaggi 2013) Revisiting Frank-Wolfe: projection-free sparse convex optimization 10 8 Cost of RAM $/MB 10 6 10 4 10 2 10 0 10 2 1960 1970 1980 1990 2000 2010 Year 7 / 33
Example: smooth minimization over l 1 ball for g : R n R smooth, α R, find iterative method to solve minimize subject to g(w) w 1 α what kinds of subproblems are easy? projection is complicated (Duchi et al. 2008) linear optimization is easy: αe i = argmin subject to x w w 1 α where i = indmax(w) 8 / 33
Conditional gradient method (Frank-Wolfe) minimize subject to g(w) w P v 1 w (0) v 5 w (1) v 2 g(w (1) ) g(w (0) ) v 4 v 3 v 4 v 3 9 / 33
Conditional gradient method (Frank-Wolfe) minimize subject to g(w) w P v 1 w (0) v 5 w (1) v 2 g(w (1) ) g(w (0) ) v 4 v 3 v 4 v 3 9 / 33
Conditional gradient method (Frank-Wolfe) minimize subject to g(w) w P v 1 w (0) v 5 w (1) v 2 g(w (1) ) g(w (0) ) v 4 v 3 v 4 v 3 9 / 33
Conditional gradient method (Frank-Wolfe) minimize subject to g(w) w P v 1 w (0) v 5 w (1) v 2 g(w (1) ) g(w (0) ) v 4 v 3 v 4 v 3 9 / 33
Conditional gradient method (Frank-Wolfe) minimize subject to g(w) w P v 1 w (0) v 5 w (1) v 2 g(w (1) ) g(w (0) ) v 4 v 3 v 4 v 3 9 / 33
Conditional gradient method (Frank-Wolfe) minimize subject to g(w) w P v 1 w (0) v 5 w (1) v 2 g(w (1) ) g(w (0) ) v 4 v 3 v 4 v 3 9 / 33
What s wrong with CGM? slow complexity of iterate grows with number of iterations 10 / 33
Outline Limited Memory Kelley s Method 11 / 33
Submodularity primer ground set V = {1,..., n} identify subsets of V with Boolean vectors {0, 1} n F : {0, 1} n R is submodular if F (A v) F (A) F (B v) F (B), A B, v V linear functions are (sub)modular: for w R n, define w(a) = i A w i 12 / 33
Submodular function: example 13 / 33
Submodular polyhedra for submodular function F, define submodular polyhedron P(F ) = {w R n : w(a) F (A), A V } base polytope B(F ) = {w P(F ) : w(v ) = F (V )} both (generically) have exponentially many facets! 14 / 33
Lovász extension define the (piecewise-linear) Lovász extension as f (x) = max w x = sup{w(x) : w(a) F (A) A V } w B(F ) the Lovász extension is the convex envelope of F examples: F (A) f (x) f ( x ) A 1 x x 1 min( A, 1) max(x) x j i=1 min( A S j, 1) j i=1 max(x S j ) j i=1 x S j 15 / 33
Linear optimization on B(F ) is easy define the (piecewise-linear) Lovász extension as f (x) = f and 1 B(F ) are Fenchel duals: max x w w B(F ) f (x) = 1 B(F ) (x) linear optimization over B(F ) is O(n log n) (Edmonds 1970) define permutation π so x π1 x πn. then max x w = w B(F ) n x πk [F ({π 1, π 2,..., π k }) F ({π 1, π 2,..., π k 1 })] k=1 computing subgradients of f require O(n log n) too! f (x) = argmax x w w B(F ) 16 / 33
Primal problem minimize g(x) + f (x) (P) g : R n R strongly convex f : R n R Lovász extension of submodular F piecewise linear homogeneous (generically) exponentially many pieces subgradients are easy O(n log n) 17 / 33
Primal problem: example (Source: http://mistis.inrialpes.fr/learninria/slides/bach.pdf) 18 / 33
Original Simplicial Method (OSM) (Bach 2013) Algorithm 1 OSM (to minimize g(x) + f (x)) initialize V. repeat 1. define ˆf (x) = max w V w x 2. solve subproblem x argmin g(x) + ˆf (x) 3. compute v f (x) = argmax w B(F ) x w 4. V V v problem: V keeps growing! No known rate of convergence (Bach 2013) 19 / 33
Limited Memory Kelley s Method (LM-KM) Algorithm 2 LM-KM (to minimize g(x) + f (x)) initialize V. repeat 1. define ˆf (x) = max w V w x 2. solve subproblem x argmin g(x) + ˆf (x) 3. compute v f (x) = argmax w B(F ) x w 4. V {w V : w x = f (x)} v does it converge or cycle? how large could V grow? 20 / 33
LM-KM: intuition g + f (1) g + f (1) g + f (2) g + f (2) x (1) z (0) z (1) x (2) g + f (4) g + f g + f (3) g + f (3) z (2) x (3) z (3) x (4) 21 / 33
L-KM converges linearly with bounded memory Theorem ((Zhou Gupta Udell 2018)) L-KM has bounded memory: V n + 1 L-KM converges when g is strong convex L-KM converges linearly when g is smooth and strongly convex (Corollary: OSM converges linearly, too.) 22 / 33
Dual problem minimize g ( w) subject to w B(F ) (D) g : R n R smooth (conjugate of strongly convex g) B(F ) base polytope of submodular F (generically) exponentially many facets linear optimization over B(F ) is easy O(n log n) 23 / 33
Conditional gradient methods for the dual linear optimization over constraint is easy so use a conditional gradient method! away-step FW, pairwise FW, fully corrective FW (FCFW) all converge linearly (Lacoste-Julien & Jaggi 2015) FCFW has limited memory (Garber & Hazan 2015) gives linear convergence with one gradient + one linear optimization per iteration 24 / 33
Dual to primal suppose g is α-strongly convex and β-smooth solve a dual subproblem inexactly to obtain ŷ B(F ) with g ( y ) g ( ŷ) ɛ g is 1/β-strongly convex, so ŷ y 2 2βɛ define ˆx = y ( g ( ŷ)) = argmin x g(x) + ŷ x since g is 1/α smooth, we have ˆx x 2 1/α 2 ŷ y 2 2βɛ/α 2 if the dual iterates converge linearly, so do the primal iterates 25 / 33
Fully corrective Frank-Wolfe minimize subject to g(w) w P v 1 w (0) v 5 w (1) v 2 g(w (1) (2) ) g(w (0) ) v 4 v 3 26 / 33
Fully corrective Frank-Wolfe minimize subject to g(w) w P v 1 w (0) v 5 w (1) v 2 g(w (1) (2) ) g(w (0) ) v 4 v 3 26 / 33
Fully corrective Frank-Wolfe minimize subject to g(w) w P v 1 w (0) v 5 w (1) v 2 g(w (1) (2) ) g(w (0) ) v 4 v 3 26 / 33
Fully corrective Frank-Wolfe minimize subject to g(w) w P v 1 w (0) v 5 w (1) v 2 g(w (1) (2) ) g(w (0) ) v 4 v 3 26 / 33
Fully corrective Frank-Wolfe minimize subject to g(w) w P v 1 w (0) v 5 w (1) v 2 g(w (1) (2) ) g(w (0) ) v 4 v 3 26 / 33
Fully corrective Frank-Wolfe minimize subject to g(w) w P v 1 w (0) v 5 w (1) v 2 g(w (1) (2) ) g(w (0) ) v 4 v 3 26 / 33
Fully corrective Frank-Wolfe minimize subject to g(w) w P v 1 w (0) v 5 w (1) v 2 g(w (1) (2) ) g(w (0) ) v 4 v 3 26 / 33
Fully corrective Frank-Wolfe minimize subject to g(w) w P v 1 w (0) v 5 w (1) v 2 g(w (1) (2) ) g(w (0) ) v 4 v 3 26 / 33
Fully corrective Frank-Wolfe minimize subject to g(w) w P v 1 w (0) v 5 w (1) v 2 g(w (1) (2) ) g(w (0) ) v 4 v 3 26 / 33
Fully corrective Frank-Wolfe minimize subject to g(w) w P v 1 w (0) v 5 w (1) v 2 g(w (1) (2) ) g(w (0) ) v 4 v 3 26 / 33
Limited-memory Fully Corrective Frank Wolfe L-FCFW Algorithm 3 FCFW (to minimize g ( y) over y B(F )) initialize V. repeat 1. solve subproblem minimize subject to g ( y) y Conv(V) define solution y = w V λ w w with λ w > 0 and w V λ w = 1 2. compute gradient x = ( g ( y)) 3. solve linear optimization v = argmax w B(F ) x w 4. V {w V : λ w > 0} v 27 / 33
Fully corrective Frank Wolfe FCFW: properties bounded memory: Carathéodory = can choose λ w so {w V : λ w > 0} n + 1 finite convergence active set changes at each iteration a vertex that exits the active set is never added again converges linearly for smooth strongly convex objectives useful if linear optimization over B(F ) is hard (so convex subproblem is comparatively cheap) ok to solve subproblem inexactly (Lacoste-Julien & Jaggi 2015) compare to vanilla CGM: memory cost similar: w R n vs n (very simple) vertices solving subproblems not much harder than evaluating g 28 / 33
Dual subproblems FCFW subproblem maximize subject to g ( y) y = w V λ w w 1 T λ = 1, λ 0 has dual minimize g(x) + max w V x w which is our primal subproblem! first order optimality conditions show active sets match λ w > 0 w x = max w V x w hence FCFW has a corresponding primal algorithm: LM-KM! 29 / 33
LM-KM: numerical experiment g(x) = x Ax + b x + n x 2 for x R n f is the Lovász extension of F (A) = A (2n A + 1) 2 entries of A M n sampled uniformly from [ 1, 1] entries of b R n sampled uniformly from [0, n] 30 / 33
LM-KM: numerical experiment dimension n = 10 in upper left, n = 100 in others 31 / 33
Conclusion LM-KM gives a new algorithm for composite convex and submodular optimization with bounded (O(n)) storage with two old ideas: Duality Carathéodory 32 / 33
References S. Zhou, S. Gupta, and M. Udell. Limited Memory Kelley s Method Converges for Composite Convex and Submodular Objectives. NIPS 2018. 33 / 33