Limited Memory Kelley s Method Converges for Composite Convex and Submodular Objectives

Similar documents
Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Conditional Gradient (Frank-Wolfe) Method

Learning with Submodular Functions: A Convex Optimization Perspective

Optimization for Machine Learning

Sketchy Decisions: Convex Low-Rank Matrix Optimization with Optimal Storage

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Coordinate Descent and Ascent Methods

A Greedy Framework for First-Order Optimization

Lecture 23: November 19

1 Maximizing a Submodular Function

On the von Neumann and Frank-Wolfe Algorithms with Away Steps

Complexity bounds for primal-dual methods minimizing the model of objective function

Lecture 23: Conditional Gradient Method

Dual Proximal Gradient Method

arxiv: v1 [math.oc] 10 Oct 2018

Proximal Methods for Optimization with Spasity-inducing Norms

Pairwise Away Steps for the Frank-Wolfe Algorithm

arxiv: v1 [math.oc] 1 Jul 2016

ECS289: Scalable Machine Learning

1 Sparsity and l 1 relaxation

LECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE

Fenchel Duality between Strong Convexity and Lipschitz Continuous Gradient

Lecture 10 (Submodular function)

STA141C: Big Data & High Performance Statistical Computing

Lecture 10: Linear programming duality and sensitivity 0-0

2.098/6.255/ Optimization Methods Practice True/False Questions

Newton s Method. Javier Peña Convex Optimization /36-725

9. Dual decomposition and dual algorithms

CSCI 1951-G Optimization Methods in Finance Part 01: Linear Programming

Submodularity in Machine Learning

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization

Dual Decomposition.

Constrained Optimization and Lagrangian Duality

Convex Optimization Lecture 16

Accelerating Stochastic Optimization

Conditional Gradient Algorithms for Rank-One Matrix Approximations with a Sparsity Constraint

Taylor-like models in nonsmooth optimization

STAT 200C: High-dimensional Statistics

Generalized Conditional Gradient and Its Applications

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Convex Optimization Conjugate, Subdifferential, Proximation

Submodular Functions, Optimization, and Applications to Machine Learning

Linear Programming. Chapter Introduction

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Composite nonlinear models at scale

arxiv: v3 [math.oc] 25 Nov 2015

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Optimization and Optimal Control in Banach Spaces

Compressed Sensing: a Subgradient Descent Method for Missing Data Problems

The Frank-Wolfe Algorithm:

Linear Convergence under the Polyak-Łojasiewicz Inequality

Primal and Dual Predicted Decrease Approximation Methods

Math 273a: Optimization Convex Conjugacy

Linear Convergence under the Polyak-Łojasiewicz Inequality

Stochastic Gradient Descent with Only One Projection

Primal/Dual Decomposition Methods

3.7 Strong valid inequalities for structured ILP problems

Integer Programming Reformulations: Dantzig-Wolfe & Benders Decomposition the Coluna Software Platform

Local strong convexity and local Lipschitz continuity of the gradient of convex functions

Fast column generation for atomic norm regularization

Lecture 1: Background on Convex Analysis

Lecture 16: FTRL and Online Mirror Descent

SCRIBERS: SOROOSH SHAFIEEZADEH-ABADEH, MICHAËL DEFFERRARD

CSC 411 Lecture 17: Support Vector Machine

Simplicial Nonnegative Matrix Tri-Factorization: Fast Guaranteed Parallel Algorithm

CPSC 540: Machine Learning

Submodular Functions Properties Algorithms Machine Learning

Optimization WS 13/14:, by Y. Goldstein/K. Reinert, 9. Dezember 2013, 16: Linear programming. Optimization Problems

A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices

Multicommodity Flows and Column Generation

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Iterative Rigid Multibody Dynamics A Comparison of Computational Methods

An Introduction to Submodular Functions and Optimization. Maurice Queyranne University of British Columbia, and IMA Visitor (Fall 2002)

Subgradient. Acknowledgement: this slides is based on Prof. Lieven Vandenberghes lecture notes. definition. subgradient calculus

Parallel Coordinate Optimization

c 2015 Society for Industrial and Applied Mathematics

Expanding the reach of optimal methods

Approximating Submodular Functions. Nick Harvey University of British Columbia

CS675: Convex and Combinatorial Optimization Fall 2016 Combinatorial Problems as Linear and Convex Programs. Instructor: Shaddin Dughmi

Linear Programming. Larry Blume Cornell University, IHS Vienna and SFI. Summer 2016

Lagrangian Relaxation in MIP

LP Duality: outline. Duality theory for Linear Programming. alternatives. optimization I Idea: polyhedra

Nonlinear Optimization: What s important?

CS675: Convex and Combinatorial Optimization Fall 2014 Combinatorial Problems as Linear Programs. Instructor: Shaddin Dughmi

Separation Techniques for Constrained Nonlinear 0 1 Programming

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Numerical Methods for PDE-Constrained Optimization

Algorithms for Nonsmooth Optimization

CO 250 Final Exam Guide

Statistical Data Mining and Machine Learning Hilary Term 2016

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Solving large Semidefinite Programs - Part 1 and 2

Convergence Rates for Deterministic and Stochastic Subgradient Methods Without Lipschitz Continuity

Integer Programming, Part 1

3. Linear Programming and Polyhedral Combinatorics

minimize x subject to (x 2)(x 4) u,

An Optimal Affine Invariant Smooth Minimization Algorithm.

Lectures 6, 7 and part of 8

A Linearly Convergent Conditional Gradient Algorithm with Applications to Online and Stochastic Optimization

On the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith. µd(w) + w T c. minimize. subject to w T 1 = 1,

Transcription:

Limited Memory Kelley s Method Converges for Composite Convex and Submodular Objectives Madeleine Udell Operations Research and Information Engineering Cornell University Based on joint work with Song Zhou (Cornell) and Swati Gupta (Georgia Tech) Rutgers, 11/9/2018 1 / 33

My old 2013 Macbook Pro: 16GB Ram 2 / 33

Gonna buy a new model with more RAM... 3 / 33

Gonna buy a new model with more RAM... nope! RAM in 13in Macbook Pro 16 GB. 3 / 33

Ok, so RAM isn t smaller. Is it cheaper? 4 / 33

Ok, so RAM isn t smaller. Is it cheaper? Nope! 4 / 33

RIP Moore s Law for RAM (circa 2013) 10 8 Cost of RAM $/MB 10 6 10 4 10 2 10 0 10 2 1960 1970 1980 1990 2000 2010 Year 5 / 33

Why low memory convex optimization? low memory Moore s law is running out low memory algorithms are often fast convex optimization robust convergence elegant analysis 6 / 33

Memory plateau and conditional gradient method (Frank & Wolfe 1956) An algorithm for quadratic programming (Levitin & Poljak 1966) Conditional gradient method (Clarkson 2010) Coresets, sparse greedy approximation, and the Frank-Wolfe algorithm (Jaggi 2013) Revisiting Frank-Wolfe: projection-free sparse convex optimization 10 8 Cost of RAM $/MB 10 6 10 4 10 2 10 0 10 2 1960 1970 1980 1990 2000 2010 Year 7 / 33

Example: smooth minimization over l 1 ball for g : R n R smooth, α R, find iterative method to solve minimize subject to g(w) w 1 α what kinds of subproblems are easy? projection is complicated (Duchi et al. 2008) linear optimization is easy: αe i = argmin subject to x w w 1 α where i = indmax(w) 8 / 33

Conditional gradient method (Frank-Wolfe) minimize subject to g(w) w P v 1 w (0) v 5 w (1) v 2 g(w (1) ) g(w (0) ) v 4 v 3 v 4 v 3 9 / 33

Conditional gradient method (Frank-Wolfe) minimize subject to g(w) w P v 1 w (0) v 5 w (1) v 2 g(w (1) ) g(w (0) ) v 4 v 3 v 4 v 3 9 / 33

Conditional gradient method (Frank-Wolfe) minimize subject to g(w) w P v 1 w (0) v 5 w (1) v 2 g(w (1) ) g(w (0) ) v 4 v 3 v 4 v 3 9 / 33

Conditional gradient method (Frank-Wolfe) minimize subject to g(w) w P v 1 w (0) v 5 w (1) v 2 g(w (1) ) g(w (0) ) v 4 v 3 v 4 v 3 9 / 33

Conditional gradient method (Frank-Wolfe) minimize subject to g(w) w P v 1 w (0) v 5 w (1) v 2 g(w (1) ) g(w (0) ) v 4 v 3 v 4 v 3 9 / 33

Conditional gradient method (Frank-Wolfe) minimize subject to g(w) w P v 1 w (0) v 5 w (1) v 2 g(w (1) ) g(w (0) ) v 4 v 3 v 4 v 3 9 / 33

What s wrong with CGM? slow complexity of iterate grows with number of iterations 10 / 33

Outline Limited Memory Kelley s Method 11 / 33

Submodularity primer ground set V = {1,..., n} identify subsets of V with Boolean vectors {0, 1} n F : {0, 1} n R is submodular if F (A v) F (A) F (B v) F (B), A B, v V linear functions are (sub)modular: for w R n, define w(a) = i A w i 12 / 33

Submodular function: example 13 / 33

Submodular polyhedra for submodular function F, define submodular polyhedron P(F ) = {w R n : w(a) F (A), A V } base polytope B(F ) = {w P(F ) : w(v ) = F (V )} both (generically) have exponentially many facets! 14 / 33

Lovász extension define the (piecewise-linear) Lovász extension as f (x) = max w x = sup{w(x) : w(a) F (A) A V } w B(F ) the Lovász extension is the convex envelope of F examples: F (A) f (x) f ( x ) A 1 x x 1 min( A, 1) max(x) x j i=1 min( A S j, 1) j i=1 max(x S j ) j i=1 x S j 15 / 33

Linear optimization on B(F ) is easy define the (piecewise-linear) Lovász extension as f (x) = f and 1 B(F ) are Fenchel duals: max x w w B(F ) f (x) = 1 B(F ) (x) linear optimization over B(F ) is O(n log n) (Edmonds 1970) define permutation π so x π1 x πn. then max x w = w B(F ) n x πk [F ({π 1, π 2,..., π k }) F ({π 1, π 2,..., π k 1 })] k=1 computing subgradients of f require O(n log n) too! f (x) = argmax x w w B(F ) 16 / 33

Primal problem minimize g(x) + f (x) (P) g : R n R strongly convex f : R n R Lovász extension of submodular F piecewise linear homogeneous (generically) exponentially many pieces subgradients are easy O(n log n) 17 / 33

Primal problem: example (Source: http://mistis.inrialpes.fr/learninria/slides/bach.pdf) 18 / 33

Original Simplicial Method (OSM) (Bach 2013) Algorithm 1 OSM (to minimize g(x) + f (x)) initialize V. repeat 1. define ˆf (x) = max w V w x 2. solve subproblem x argmin g(x) + ˆf (x) 3. compute v f (x) = argmax w B(F ) x w 4. V V v problem: V keeps growing! No known rate of convergence (Bach 2013) 19 / 33

Limited Memory Kelley s Method (LM-KM) Algorithm 2 LM-KM (to minimize g(x) + f (x)) initialize V. repeat 1. define ˆf (x) = max w V w x 2. solve subproblem x argmin g(x) + ˆf (x) 3. compute v f (x) = argmax w B(F ) x w 4. V {w V : w x = f (x)} v does it converge or cycle? how large could V grow? 20 / 33

LM-KM: intuition g + f (1) g + f (1) g + f (2) g + f (2) x (1) z (0) z (1) x (2) g + f (4) g + f g + f (3) g + f (3) z (2) x (3) z (3) x (4) 21 / 33

L-KM converges linearly with bounded memory Theorem ((Zhou Gupta Udell 2018)) L-KM has bounded memory: V n + 1 L-KM converges when g is strong convex L-KM converges linearly when g is smooth and strongly convex (Corollary: OSM converges linearly, too.) 22 / 33

Dual problem minimize g ( w) subject to w B(F ) (D) g : R n R smooth (conjugate of strongly convex g) B(F ) base polytope of submodular F (generically) exponentially many facets linear optimization over B(F ) is easy O(n log n) 23 / 33

Conditional gradient methods for the dual linear optimization over constraint is easy so use a conditional gradient method! away-step FW, pairwise FW, fully corrective FW (FCFW) all converge linearly (Lacoste-Julien & Jaggi 2015) FCFW has limited memory (Garber & Hazan 2015) gives linear convergence with one gradient + one linear optimization per iteration 24 / 33

Dual to primal suppose g is α-strongly convex and β-smooth solve a dual subproblem inexactly to obtain ŷ B(F ) with g ( y ) g ( ŷ) ɛ g is 1/β-strongly convex, so ŷ y 2 2βɛ define ˆx = y ( g ( ŷ)) = argmin x g(x) + ŷ x since g is 1/α smooth, we have ˆx x 2 1/α 2 ŷ y 2 2βɛ/α 2 if the dual iterates converge linearly, so do the primal iterates 25 / 33

Fully corrective Frank-Wolfe minimize subject to g(w) w P v 1 w (0) v 5 w (1) v 2 g(w (1) (2) ) g(w (0) ) v 4 v 3 26 / 33

Fully corrective Frank-Wolfe minimize subject to g(w) w P v 1 w (0) v 5 w (1) v 2 g(w (1) (2) ) g(w (0) ) v 4 v 3 26 / 33

Fully corrective Frank-Wolfe minimize subject to g(w) w P v 1 w (0) v 5 w (1) v 2 g(w (1) (2) ) g(w (0) ) v 4 v 3 26 / 33

Fully corrective Frank-Wolfe minimize subject to g(w) w P v 1 w (0) v 5 w (1) v 2 g(w (1) (2) ) g(w (0) ) v 4 v 3 26 / 33

Fully corrective Frank-Wolfe minimize subject to g(w) w P v 1 w (0) v 5 w (1) v 2 g(w (1) (2) ) g(w (0) ) v 4 v 3 26 / 33

Fully corrective Frank-Wolfe minimize subject to g(w) w P v 1 w (0) v 5 w (1) v 2 g(w (1) (2) ) g(w (0) ) v 4 v 3 26 / 33

Fully corrective Frank-Wolfe minimize subject to g(w) w P v 1 w (0) v 5 w (1) v 2 g(w (1) (2) ) g(w (0) ) v 4 v 3 26 / 33

Fully corrective Frank-Wolfe minimize subject to g(w) w P v 1 w (0) v 5 w (1) v 2 g(w (1) (2) ) g(w (0) ) v 4 v 3 26 / 33

Fully corrective Frank-Wolfe minimize subject to g(w) w P v 1 w (0) v 5 w (1) v 2 g(w (1) (2) ) g(w (0) ) v 4 v 3 26 / 33

Fully corrective Frank-Wolfe minimize subject to g(w) w P v 1 w (0) v 5 w (1) v 2 g(w (1) (2) ) g(w (0) ) v 4 v 3 26 / 33

Limited-memory Fully Corrective Frank Wolfe L-FCFW Algorithm 3 FCFW (to minimize g ( y) over y B(F )) initialize V. repeat 1. solve subproblem minimize subject to g ( y) y Conv(V) define solution y = w V λ w w with λ w > 0 and w V λ w = 1 2. compute gradient x = ( g ( y)) 3. solve linear optimization v = argmax w B(F ) x w 4. V {w V : λ w > 0} v 27 / 33

Fully corrective Frank Wolfe FCFW: properties bounded memory: Carathéodory = can choose λ w so {w V : λ w > 0} n + 1 finite convergence active set changes at each iteration a vertex that exits the active set is never added again converges linearly for smooth strongly convex objectives useful if linear optimization over B(F ) is hard (so convex subproblem is comparatively cheap) ok to solve subproblem inexactly (Lacoste-Julien & Jaggi 2015) compare to vanilla CGM: memory cost similar: w R n vs n (very simple) vertices solving subproblems not much harder than evaluating g 28 / 33

Dual subproblems FCFW subproblem maximize subject to g ( y) y = w V λ w w 1 T λ = 1, λ 0 has dual minimize g(x) + max w V x w which is our primal subproblem! first order optimality conditions show active sets match λ w > 0 w x = max w V x w hence FCFW has a corresponding primal algorithm: LM-KM! 29 / 33

LM-KM: numerical experiment g(x) = x Ax + b x + n x 2 for x R n f is the Lovász extension of F (A) = A (2n A + 1) 2 entries of A M n sampled uniformly from [ 1, 1] entries of b R n sampled uniformly from [0, n] 30 / 33

LM-KM: numerical experiment dimension n = 10 in upper left, n = 100 in others 31 / 33

Conclusion LM-KM gives a new algorithm for composite convex and submodular optimization with bounded (O(n)) storage with two old ideas: Duality Carathéodory 32 / 33

References S. Zhou, S. Gupta, and M. Udell. Limited Memory Kelley s Method Converges for Composite Convex and Submodular Objectives. NIPS 2018. 33 / 33