Expanding the reach of optimal methods

Similar documents
Composite nonlinear models at scale

Taylor-like models in nonsmooth optimization

An accelerated algorithm for minimizing convex compositions

Efficiency of minimizing compositions of convex functions and smooth maps

Accelerated Proximal Gradient Methods for Convex Optimization

Dual Proximal Gradient Method

A Proximal Method for Identifying Active Manifolds

Block Coordinate Descent for Regularized Multi-convex Optimization

Optimization methods

McMaster University. Advanced Optimization Laboratory. Title: A Proximal Method for Identifying Active Manifolds. Authors: Warren L.

Inexact alternating projections on nonconvex sets

Optimization methods

SIAM Conference on Imaging Science, Bologna, Italy, Adaptive FISTA. Peter Ochs Saarland University

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Accelerated primal-dual methods for linearly constrained convex problems

Subgradient Method. Ryan Tibshirani Convex Optimization

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples

Geometric Descent Method for Convex Composite Minimization

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Sequential convex programming,: value function and convergence

arxiv: v1 [math.oc] 9 Oct 2018

Linear Convergence under the Polyak-Łojasiewicz Inequality

A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization

Stochastic model-based minimization under high-order growth

arxiv: v1 [math.oc] 1 Jul 2016

Lecture 17: October 27

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 9. Alternating Direction Method of Multipliers

Sparse Optimization Lecture: Dual Methods, Part I

Block stochastic gradient update method

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

arxiv: v1 [math.oc] 24 Mar 2017

Stochastic and online algorithms

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

6. Proximal gradient method

Active sets, steepest descent, and smooth approximation of functions

Convex Optimization Lecture 16

Coordinate Descent and Ascent Methods

Math 273a: Optimization Convex Conjugacy

Tame variational analysis

Nonsmooth optimization: conditioning, convergence, and semi-algebraic models

On the Local Quadratic Convergence of the Primal-Dual Augmented Lagrangian Method

Identifying Active Constraints via Partial Smoothness and Prox-Regularity

Fast proximal gradient methods

On the acceleration of augmented Lagrangian method for linearly constrained optimization

I P IANO : I NERTIAL P ROXIMAL A LGORITHM FOR N ON -C ONVEX O PTIMIZATION

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Lecture 23: November 21

Linear Convergence under the Polyak-Łojasiewicz Inequality

5. Subgradient method

Selected Topics in Optimization. Some slides borrowed from

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

ALGORITHMS FOR MINIMIZING DIFFERENCES OF CONVEX FUNCTIONS AND APPLICATIONS

LECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE

On the Quadratic Convergence of the Cubic Regularization Method under a Local Error Bound Condition

On the Quadratic Convergence of the Cubic Regularization Method under a Local Error Bound Condition

A PROXIMAL METHOD FOR COMPOSITE MINIMIZATION. 1. Problem Statement. We consider minimization problems of the form. min

14. Nonlinear equations

Optimization for Machine Learning

Accelerated Block-Coordinate Relaxation for Regularized Optimization

Conditional Gradient (Frank-Wolfe) Method

An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 3. Gradient Method

Downloaded 09/27/13 to Redistribution subject to SIAM license or copyright; see

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Key words. prox-regular functions, polyhedral convex functions, sparse optimization, global convergence, active constraint identification

Introduction. New Nonsmooth Trust Region Method for Unconstraint Locally Lipschitz Optimization Problems

Sparse Covariance Selection using Semidefinite Programming

Higher-Order Methods

GRADIENT = STEEPEST DESCENT

Lasso: Algorithms and Extensions

Lecture 5: September 12

Contraction Methods for Convex Optimization and Monotone Variational Inequalities No.16

Primal-dual Subgradient Method for Convex Problems with Functional Constraints

Randomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming

Non-smooth Non-convex Bregman Minimization: Unification and new Algorithms

Coordinate Update Algorithm Short Course Operator Splitting

Homework 3 Conjugate Gradient Descent, Accelerated Gradient Descent Newton, Quasi Newton and Projected Gradient Descent

Convex Analysis Notes. Lecturer: Adrian Lewis, Cornell ORIE Scribe: Kevin Kircher, Cornell MAE

Lecture 3. Optimization Problems and Iterative Algorithms

c Copyright 2017 Scott Roy

Some Inexact Hybrid Proximal Augmented Lagrangian Algorithms

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Math 273a: Optimization Subgradient Methods

arxiv: v3 [math.oc] 19 Oct 2017

Convergence of Cubic Regularization for Nonconvex Optimization under KŁ Property

arxiv: v1 [math.oc] 13 Dec 2018

Beyond Heuristics: Applying Alternating Direction Method of Multipliers in Nonconvex Territory

Key words. constrained optimization, composite optimization, Mangasarian-Fromovitz constraint qualification, active set, identification.

Stochastic Composition Optimization

Optimization: an Overview

ARock: an algorithmic framework for asynchronous parallel coordinate updates

10. Unconstrained minimization

Convex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization

8 Numerical methods for unconstrained problems

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

arxiv: v2 [math.oc] 21 Nov 2017

Transcription:

Expanding the reach of optimal methods Dmitriy Drusvyatskiy Mathematics, University of Washington Joint work with C. Kempton (UW), M. Fazel (UW), A.S. Lewis (Cornell), and S. Roy (UW) BURKAPALOOZA! WCOM SPRING 2016 AFOSR YIP

Notation Function f : R n R is α-convex and β-smooth if where q x f Q x Q x (y) = f (x) + f (x), y x + β y x 2 2 q x (y) = f (x) + f (x), y x + α y x 2 2 Q x f q x x 2/15

Notation Function f : R n R is α-convex and β-smooth if where q x f Q x Q x (y) = f (x) + f (x), y x + β y x 2 2 q x (y) = f (x) + f (x), y x + α y x 2 2 Q x f q x Condition number: x κ = β α 2/15

Complexity of first-order methods Gradient descent: x k+1 = x k 1 β f (x k) 3/15

Complexity of first-order methods Gradient descent: x k+1 = x k 1 β f (x k) Majorization view: x k+1 = argmin Q xk ( ) 3/15

Complexity of first-order methods Gradient descent: x k+1 = x k 1 β f (x k) Majorization view: x k+1 = argmin Q xk ( ) Gradient Descent β-smooth β ɛ β-smooth & α-convex κ ln 1 ɛ Table: Iterations until f (x k ) f < ɛ 3/15

Complexity of first-order methods Gradient descent: x k+1 = x k 1 β f (x k) Majorization view: x k+1 = argmin Q xk ( ) Gradient Descent Optimal Methods β-smooth β ɛ β ɛ β-smooth & α-convex κ ln 1 ɛ κ ln 1 ɛ Table: Iterations until f (x k ) f < ɛ 3/15

Complexity of first-order methods Gradient descent: x k+1 = x k 1 β f (x k) Majorization view: x k+1 = argmin Q xk ( ) Gradient Descent Optimal Methods β-smooth β ɛ β ɛ β-smooth & α-convex κ ln 1 ɛ κ ln 1 ɛ Table: Iterations until f (x k ) f < ɛ (Nesterov 83, Yudin-Nemirovsky 83) 3/15

Complexity of first-order methods Gradient descent: x k+1 = x k 1 β f (x k) Majorization view: x k+1 = argmin Q xk ( ) Gradient Descent Optimal Methods β-smooth β ɛ β ɛ β-smooth & α-convex κ ln 1 ɛ κ ln 1 ɛ Table: Iterations until f (x k ) f < ɛ (Nesterov 83, Yudin-Nemirovsky 83) Optimal methods have downsides: Not intuitive Non-monotone Difficult to augment with memory 3/15

Optimal method by optimal averaging Idea: Maximize lower models of f. 4/15

Optimal method by optimal averaging Idea: Maximize lower models of f. Notation: x + = x 1 β f (x) and x++ = x 1 f (x) α 4/15

Optimal method by optimal averaging Idea: Maximize lower models of f. Notation: x + = x 1 β f (x) and x++ = x 1 f (x) α Convexity bound f q x in canonical form: ( ) f (x) 2 f (y) f (x) + α 2α 2 y x++ 2 4/15

Optimal method by optimal averaging Idea: Maximize lower models of f. Notation: x + = x 1 β f (x) and x++ = x 1 f (x) α Convexity bound f q x in canonical form: ( ) f (x) 2 f (y) f (x) + α 2α 2 y x++ 2 Lower models: Q A (x) = v A + α 2 x x A 2 Q B (x) = v B + α 2 x x B 2 4/15

Optimal method by optimal averaging Idea: Maximize lower models of f. Notation: x + = x 1 β f (x) and x++ = x 1 f (x) α Convexity bound f q x in canonical form: ( ) f (x) 2 f (y) f (x) + α 2α 2 y x++ 2 Lower models: Q A (x) = v A + α 2 x x A 2 Q B (x) = v B + α 2 x x B 2 = for any λ [0, 1] new lower-model Q λ := λq A + (1 λ)q B = v λ + α 2 x λ 2 4/15

Optimal method by optimal averaging Idea: Maximize lower models of f. Notation: x + = x 1 β f (x) and x++ = x 1 f (x) α Convexity bound f q x in canonical form: ( ) f (x) 2 f (y) f (x) + α 2α 2 y x++ 2 Lower models: Q A (x) = v A + α 2 x x A 2 Q B (x) = v B + α 2 x x B 2 = for any λ [0, 1] new lower-model Q λ := λq A + (1 λ)q B = v λ + α 2 x λ 2 Key observation: v λ f 4/15

Optimal method by optimal averaging The minimum v λ is maximized when ( 1 λ = proj [0,1] 2 + v ) A v B α x A x B 2. The quadratic Q λ is the optimal averaging of (Q A, Q B ). 5/15

Optimal method by optimal averaging The minimum v λ is maximized when ( 1 λ = proj [0,1] 2 + v ) A v B α x A x B 2. The quadratic Q λ is the optimal averaging of (Q A, Q B ). 5/15

Optimal method by optimal averaging The minimum v λ is maximized when ( 1 λ = proj [0,1] 2 + v ) A v B α x A x B 2. The quadratic Q λ is the optimal averaging of (Q A, Q B ). Related to cutting plane, bundle methods, geometric descent (Bubeck-Lee-Singh 15) 5/15

Optimal method by optimal averaging for k = 1,..., K do Set Q(x) = (f (x k ) f (x k) 2 end ) 2α + α 2 x x k ++ 2 Let Q k (x) = v k + α 2 x c k 2 be optim. average of (Q, Q k 1 ). Set x k+1 = line_search ( c k, x + k Algorithm: Optimal averaging The line-search is due to (Bubeck-Lee-Singh 15) ) 6/15

Optimal method by optimal averaging for k = 1,..., K do Set Q(x) = (f (x k ) f (x k) 2 end ) 2α + α 2 x x k ++ 2 Let Q k (x) = v k + α 2 x c k 2 be optim. average of (Q, Q k 1 ). Set x k+1 = line_search ( c k, x + k Algorithm: Optimal averaging The line-search is due to (Bubeck-Lee-Singh 15) ) Optimal Linear Rate (D-Fazel-Roy 16): f (x k + ) v k ɛ after O β α ln 1 ɛ iterations. 6/15

Optimal method by optimal averaging for k = 1,..., K do Set Q(x) = (f (x k ) f (x k) 2 end ) 2α + α 2 x x k ++ 2 Let Q k (x) = v k + α 2 x c k 2 be optim. average of (Q, Q k 1 ). Set x k+1 = line_search ( c k, x + k Algorithm: Optimal averaging The line-search is due to (Bubeck-Lee-Singh 15) ) Optimal Linear Rate (D-Fazel-Roy 16): f (x k + ) v k ɛ after O β α ln 1 ɛ iterations. Intuitive Monotone in f (x + k ) and in v k. Memory by optimally averaging (Q, Q k 1,..., Q k t ). 6/15

Optimal method by optimal averaging Figure: Logistic regression with regularization α = 10 4. 7/15

Nonsmooth & Nonconvex minimization Convex composition min g(x) + h(c(x)) x where g : R n R {+ } is closed, convex. h : R m R is convex and L-Lipschitz. c : R n R m is C 1 -smooth and c is β-lipschitz. 8/15

Nonsmooth & Nonconvex minimization Convex composition min g(x) + h(c(x)) x where g : R n R {+ } is closed, convex. h : R m R is convex and L-Lipschitz. c : R n R m is C 1 -smooth and c is β-lipschitz. For convenience, set µ = Lβ. 8/15

Nonsmooth & Nonconvex minimization Convex composition min g(x) + h(c(x)) x where g : R n R {+ } is closed, convex. h : R m R is convex and L-Lipschitz. c : R n R m is C 1 -smooth and c is β-lipschitz. For convenience, set µ = Lβ. Main examples: Additive composite minimization: min g(x) + c(x) x Nonlinear least squares: min { c(x) : l i x i u i for i = 1,..., m} x Exact penalty subproblem: min g(x) + dist K (c(x)) x (Burke 85, 91, Fletcher 82, Powell 84, Wright 90, Yuan 83) 8/15

Prox-linear algorithm Prox-linear mapping x + = argmin y g(y) + h and the prox-gradient ( ) c(x) + c(x)(y x) + µ y x 2 2 G(x) = µ(x x + ). 9/15

Prox-linear algorithm Prox-linear mapping x + = argmin y g(y) + h and the prox-gradient ( ) c(x) + c(x)(y x) + µ y x 2 2 G(x) = µ(x x + ). Prox-linear method (Burke, Fletcher, Osborne, Powell,... 80s): x k+1 = x + k. 9/15

Prox-linear algorithm Prox-linear mapping x + = argmin y g(y) + h and the prox-gradient ( ) c(x) + c(x)(y x) + µ y x 2 2 G(x) = µ(x x + ). Prox-linear method (Burke, Fletcher, Osborne, Powell,... 80s): x k+1 = x + k. Eg: proximal gradient, Levenberg-Marquardt methods 9/15

Prox-linear algorithm Prox-linear mapping x + = argmin y g(y) + h and the prox-gradient ( ) c(x) + c(x)(y x) + µ y x 2 2 G(x) = µ(x x + ). Prox-linear method (Burke, Fletcher, Osborne, Powell,... 80s): x k+1 = x + k. Eg: proximal gradient, Levenberg-Marquardt methods Convergence rate: G(x k ) 2 < ɛ after ( ) µ 2 O ɛ iterations 9/15

Stopping criterion What does G(x) 2 < ɛ actually mean? 10/15

Stopping criterion What does G(x) 2 < ɛ actually mean? Stationarity for target problem: 0 g(x) + c(x) h(c(x)) Stationarity for prox-subproblem: G(x) g(x + ) + c(x) h ( c(x) + c(x)(x + x) ) 10/15

Stopping criterion What does G(x) 2 < ɛ actually mean? Stationarity for target problem: 0 g(x) + c(x) h(c(x)) Stationarity for prox-subproblem: G(x) g(x + ) + c(x) h ( c(x) + c(x)(x + x) ) Thm: (D-Lewis 16) x + is nearly stationary because (ˆx, ˆv) with where ˆv g(ˆx) + c(ˆx) h(c(ˆx)) ˆx x + µ G(x) and ˆv 5 G(x) 10/15

Stopping criterion What does G(x) 2 < ɛ actually mean? Stationarity for target problem: 0 g(x) + c(x) h(c(x)) Stationarity for prox-subproblem: G(x) g(x + ) + c(x) h ( c(x) + c(x)(x + x) ) Thm: (D-Lewis 16) x + is nearly stationary because (ˆx, ˆv) with where ˆv g(ˆx) + c(ˆx) h(c(ˆx)) ˆx x + µ G(x) and ˆv 5 G(x) (pf: Ekeland s variational principle) 10/15

Local linear convergence Error bound property (Luo-Tseng 93) dist(x, {soln. set}) 1 α G(x) for x near x 11/15

Local linear convergence Error bound property (Luo-Tseng 93) dist(x, {soln. set}) 1 α G(x) for x near x = local linear convergence ( F(x k+1 ) F 1 α2 µ 2 ) (F(x k ) F ) 11/15

Local linear convergence Error bound property (Luo-Tseng 93) dist(x, {soln. set}) 1 α G(x) for x near x = local linear convergence ( F(x k+1 ) F 1 α2 µ 2 ) (F(x k ) F ) The following are essentially equivalent (D-Lewis 16): EB property Subregularity: dist(x; {soln. set}) 1 dist(0; F(x)) α for x near x Quadratic growth: F(x) F( x) + α 2 dist2 (x, {soln. set}) for x near x 11/15

Local linear convergence Error bound property (Luo-Tseng 93) dist(x, {soln. set}) 1 α G(x) for x near x = local linear convergence ( F(x k+1 ) F 1 α2 µ 2 ) (F(x k ) F ) The following are essentially equivalent (D-Lewis 16): EB property Subregularity: dist(x; {soln. set}) 1 dist(0; F(x)) α for x near x Quadratic growth: F(x) F( x) + α 2 dist2 (x, {soln. set}) for x near x Rate becomes α µ under tilt-stability (Poliquin-Rockafellar 98) 11/15

Acceleration For nonconvex problems, the rate ( ) µ G(x k ) 2 2 < ɛ after O ɛ iterations is essentially best possible. 12/15

Acceleration For nonconvex problems, the rate ( ) µ G(x k ) 2 2 < ɛ after O ɛ iterations is essentially best possible. Measuring non-convexity: h c(x) = sup w { w, c(x) h (w)} 12/15

Acceleration For nonconvex problems, the rate ( ) µ G(x k ) 2 2 < ɛ after O ɛ iterations is essentially best possible. Measuring non-convexity: h c(x) = sup w { w, c(x) h (w)} Fact: x w, c(x) + µ 2 x 2 convex w dom h 12/15

Acceleration For nonconvex problems, the rate ( ) µ G(x k ) 2 2 < ɛ after O ɛ iterations is essentially best possible. Measuring non-convexity: h c(x) = sup w { w, c(x) h (w)} Fact: x w, c(x) + µ 2 x 2 convex w dom h Defn: Parameter ρ [0, 1] such that x w, c(x) + ρ µ 2 x 2 is convex w dom h 12/15

Acceleration Accelerated prox-linear method! 13/15

Acceleration Accelerated prox-linear method! Thm: (D-Kempton 16) min G(x i) 2 O i=1,...,k where R = diam (dom g) ( ) ( µ 2 µ 2 R 2 ) k 3 + ρ O k 13/15

Acceleration Accelerated prox-linear method! Thm: (D-Kempton 16) min G(x i) 2 O i=1,...,k where R = diam (dom g) ( ) ( µ 2 µ 2 R 2 ) k 3 + ρ O k Generalizes (Ghadimi, Lan 16) for additively composite. 13/15

Conclusions Balanced approach: Computational complexity Acceleration Variational analysis 14/15

Happy Birthday, Jim! 15/15