Taylor-like models in nonsmooth optimization

Similar documents
Expanding the reach of optimal methods

Composite nonlinear models at scale

Efficiency of minimizing compositions of convex functions and smooth maps

An accelerated algorithm for minimizing convex compositions

Tame variational analysis

Active sets, steepest descent, and smooth approximation of functions

A Proximal Method for Identifying Active Manifolds

McMaster University. Advanced Optimization Laboratory. Title: A Proximal Method for Identifying Active Manifolds. Authors: Warren L.

Dual Proximal Gradient Method

Nonsmooth optimization: conditioning, convergence, and semi-algebraic models

Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization

The nonsmooth landscape of phase retrieval

Inexact alternating projections on nonconvex sets

Introduction. New Nonsmooth Trust Region Method for Unconstraint Locally Lipschitz Optimization Problems

The nonsmooth landscape of phase retrieval

Identifying Active Constraints via Partial Smoothness and Prox-Regularity

A PROXIMAL METHOD FOR COMPOSITE MINIMIZATION. 1. Problem Statement. We consider minimization problems of the form. min

Accelerated primal-dual methods for linearly constrained convex problems

arxiv: v1 [math.oc] 24 Mar 2017

Optimization methods

arxiv: v1 [math.oc] 9 Oct 2018

Sequential convex programming,: value function and convergence

Downloaded 09/27/13 to Redistribution subject to SIAM license or copyright; see

Higher-Order Methods

Key words. prox-regular functions, polyhedral convex functions, sparse optimization, global convergence, active constraint identification

Beyond Heuristics: Applying Alternating Direction Method of Multipliers in Nonconvex Territory

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 9. Alternating Direction Method of Multipliers

On the Local Quadratic Convergence of the Primal-Dual Augmented Lagrangian Method

Infeasibility Detection and an Inexact Active-Set Method for Large-Scale Nonlinear Optimization

6. Proximal gradient method

c???? Society for Industrial and Applied Mathematics Vol. 1, No. 1, pp ,???? 000

Optimisation non convexe avec garanties de complexité via Newton+gradient conjugué

Block Coordinate Descent for Regularized Multi-convex Optimization

Algorithms for Nonsmooth Optimization

A user s guide to Lojasiewicz/KL inequalities

Chapter 4. Unconstrained optimization

Complexity Analysis of Interior Point Algorithms for Non-Lipschitz and Nonconvex Minimization

Accelerated Block-Coordinate Relaxation for Regularized Optimization

Worst Case Complexity of Direct Search

A quasisecant method for minimizing nonsmooth functions

Contraction Methods for Convex Optimization and Monotone Variational Inequalities No.16

Optimization methods

A semi-algebraic look at first-order methods

Key words. constrained optimization, composite optimization, Mangasarian-Fromovitz constraint qualification, active set, identification.

Pacific Journal of Optimization (Vol. 2, No. 3, September 2006) ABSTRACT

Complexity analysis of second-order algorithms based on line search for smooth nonconvex optimization

Fast proximal gradient methods

arxiv: v1 [math.oc] 1 Jul 2016

On the Quadratic Convergence of the Cubic Regularization Method under a Local Error Bound Condition

On the Quadratic Convergence of the Cubic Regularization Method under a Local Error Bound Condition

Proximal Minimization by Incremental Surrogate Optimization (MISO)

SIAM Conference on Imaging Science, Bologna, Italy, Adaptive FISTA. Peter Ochs Saarland University

Introduction. A Modified Steepest Descent Method Based on BFGS Method for Locally Lipschitz Functions. R. Yousefpour 1

ALGORITHMS FOR MINIMIZING DIFFERENCES OF CONVEX FUNCTIONS AND APPLICATIONS

BORIS MORDUKHOVICH Wayne State University Detroit, MI 48202, USA. Talk given at the SPCOM Adelaide, Australia, February 2015

6. Proximal gradient method

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Provable Non-Convex Min-Max Optimization

A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization

Lecture 17: October 27

A globally and R-linearly convergent hybrid HS and PRP method and its inexact version with applications

Convergence of Cubic Regularization for Nonconvex Optimization under KŁ Property

The proximal point method revisited

Optimal Newton-type methods for nonconvex smooth optimization problems

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

On the acceleration of augmented Lagrangian method for linearly constrained optimization

Introduction to Nonlinear Stochastic Programming

Non-smooth Non-convex Bregman Minimization: Unification and new Algorithms

Limited Memory Kelley s Method Converges for Composite Convex and Submodular Objectives

8 Numerical methods for unconstrained problems

Randomized Smoothing Techniques in Optimization

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Computing proximal points of nonconvex functions

Journal of Convex Analysis Vol. 14, No. 2, March 2007 AN EXPLICIT DESCENT METHOD FOR BILEVEL CONVEX OPTIMIZATION. Mikhail Solodov. September 12, 2005

An introduction to complexity analysis for nonconvex optimization

Duality in Linear Programs. Lecturer: Ryan Tibshirani Convex Optimization /36-725

Journal of Convex Analysis (accepted for publication) A HYBRID PROJECTION PROXIMAL POINT ALGORITHM. M. V. Solodov and B. F.

arxiv: v1 [math.oc] 7 Dec 2018

Recovery of Simultaneously Structured Models using Convex Optimization

Complexity of gradient descent for multiobjective optimization

Linear Convergence under the Polyak-Łojasiewicz Inequality

Non-smooth Non-convex Bregman Minimization: Unification and New Algorithms

Numerical Methods for PDE-Constrained Optimization

Solving Corrupted Quadratic Equations, Provably

arxiv: v2 [math.oc] 21 Nov 2017

LARGE-SCALE NONCONVEX STOCHASTIC OPTIMIZATION BY DOUBLY STOCHASTIC SUCCESSIVE CONVEX APPROXIMATION

GENERAL NONCONVEX SPLIT VARIATIONAL INEQUALITY PROBLEMS. Jong Kyu Kim, Salahuddin, and Won Hee Lim

Numerical Optimization Professor Horst Cerjak, Horst Bischof, Thomas Pock Mat Vis-Gra SS09

Iterative regularization of nonlinear ill-posed problems in Banach space

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Stochastic model-based minimization under high-order growth

ORIE 6326: Convex Optimization. Quasi-Newton Methods

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

This manuscript is for review purposes only.

10. Unconstrained minimization

Quadratic Optimization over a Polyhedral Set

Stochastic Proximal Gradient Algorithm

Does Alternating Direction Method of Multipliers Converge for Nonconvex Problems?

Linear Convergence under the Polyak-Łojasiewicz Inequality

On the Minimization Over Sparse Symmetric Sets: Projections, O. Projections, Optimality Conditions and Algorithms

Transcription:

Taylor-like models in nonsmooth optimization Dmitriy Drusvyatskiy Mathematics, University of Washington Joint work with Ioffe (Technion), Lewis (Cornell), and Paquette (UW) SIAM Optimization 2017 AFOSR, NSF

Fix a closed function f : R n R. 2/18

Fix a closed function f : R n R. Slope: Fastest instantaneous rate of decrease 2/18

Fix a closed function f : R n R. Slope: Fastest instantaneous rate of decrease f ( x) := limsup x x (f ( x) f (x)) +. x x 2/18

Fix a closed function f : R n R. Slope: Fastest instantaneous rate of decrease If f is convex, then f ( x) := limsup x x (f ( x) f (x)) +. x x f (x) = dist(0; f (x)). 2/18

Fix a closed function f : R n R. Slope: Fastest instantaneous rate of decrease If f is convex, then Critical points: f ( x) := limsup x x (f ( x) f (x)) +. x x f (x) = dist(0; f (x)). x is critical for f f (x) = 0. 2/18

Fix a closed function f : R n R. Slope: Fastest instantaneous rate of decrease If f is convex, then Critical points: Deficiency: f is discontinuous f ( x) := limsup x x (f ( x) f (x)) +. x x f (x) = dist(0; f (x)). x is critical for f f (x) = 0. 2/18

Fix a closed function f : R n R. Slope: Fastest instantaneous rate of decrease If f is convex, then Critical points: f ( x) := limsup x x (f ( x) f (x)) +. x x f (x) = dist(0; f (x)). x is critical for f f (x) = 0. Deficiency: f is discontinuous = can not be used to terminate 2/18

Basic question: Is there a computable continuous surrogate G for f? 3/18

Basic question: Is there a computable continuous surrogate G for f? Desirable properties: 1. G is continuous, 2. G (x) = 0 f (x) = 0, 3. epi G and epi f are close. 3/18

Basic question: Is there a computable continuous surrogate G for f? Desirable properties: 1. G is continuous, 2. G (x) = 0 f (x) = 0, 3. epi G and epi f are close. Various contexts: cutting planes (Kelley), bundle (Lemaréchal, Noll, Sagastizábal, Wolfe), gradient sampling (Goldstein, Burke-Lewis-Overton) 3/18

Outline 1. Taylor-like models step-size, stationarity, error-bounds 2. Convex composite g + h c prox-linear method local linear/quadratic rates 4/18

Taylor-like models Task: Determine quality of x R n for min f (y). y 5/18

Taylor-like models Task: Determine quality of x R n for min f (y). y Structural assumption: Taylor-like model f x available: f x (y) f (y) η 2 x y 2 y. 5/18

Taylor-like models Task: Determine quality of x R n for min f (y). y Structural assumption: Taylor-like model f x available: Slope surrogate: f x (y) f (y) η 2 x y 2 y. x + argmin f x (y) and G (x) := x x + y 5/18

Taylor-like models Task: Determine quality of x R n for min f (y). y Structural assumption: Taylor-like model f x available: Slope surrogate: f x (y) f (y) η 2 x y 2 y. x + argmin f x (y) and G (x) := x x + y Thm: (D-Ioffe-Lewis 16) There exists ˆx satisfying 1 (point proximity) 2 x ˆx G (x). 5/18

Taylor-like models Task: Determine quality of x R n for min f (y). y Structural assumption: Taylor-like model f x available: Slope surrogate: f x (y) f (y) η 2 x y 2 y. x + argmin f x (y) and G (x) := x x + y Thm: (D-Ioffe-Lewis 16) There exists ˆx satisfying 1 (point proximity) 2 x ˆx (value proximity) 1 η f (ˆx) f (x) G (x). 5/18

Taylor-like models Task: Determine quality of x R n for min f (y). y Structural assumption: Taylor-like model f x available: Slope surrogate: f x (y) f (y) η 2 x y 2 y. x + argmin f x (y) and G (x) := x x + y Thm: (D-Ioffe-Lewis 16) There exists ˆx satisfying 1 (point proximity) 2 x ˆx (value proximity) 1 η f (ˆx) f (x) G (x). 1 (near stationarity) η f (ˆx) 5/18

Error bounds and linear rates Thm: (D-Ioffe-Lewis 16) Let S R n be arbitrary and x S arbitrary. Suppose (Slope EB): dist(x; S) κ f (x) x near x. 6/18

Error bounds and linear rates Thm: (D-Ioffe-Lewis 16) Let S R n be arbitrary and x S arbitrary. Suppose (Slope EB): dist(x; S) κ f (x) x near x. Slope EB phenomenon underling linear rates. 6/18

Error bounds and linear rates Thm: (D-Ioffe-Lewis 16) Let S R n be arbitrary and x S arbitrary. Suppose (Slope EB): Then it holds: (Step-size EB) dist(x; S) κ f (x) dist(x, S) (3κη + 2) G (x) x near x. x near x. Slope EB phenomenon underling linear rates. 6/18

Error bounds and linear rates Thm: (D-Ioffe-Lewis 16) Let S R n be arbitrary and x S arbitrary. Suppose (Slope EB): Then it holds: (Step-size EB) dist(x; S) κ f (x) dist(x, S) (3κη + 2) G (x) x near x. x near x. Slope EB phenomenon underling linear rates. Step-size EB aids linear rate analysis (Luo-Tseng 93). 6/18

Error bounds and linear rates Thm: (D-Ioffe-Lewis 16) Let S R n be arbitrary and x S arbitrary. Suppose (Slope EB): Then it holds: (Step-size EB) dist(x; S) κ f (x) dist(x, S) (3κη + 2) G (x) x near x. x near x. Slope EB phenomenon underling linear rates. Step-size EB aids linear rate analysis (Luo-Tseng 93). Rem: Similar for the surrogate G (x) := f (x) f x (x + ) 6/18

Convex composite minimization h c + g 7/18

Nonsmooth & Nonconvex minimization Convex composition min x f (x) = g(x) + h(c(x)) 8/18

Nonsmooth & Nonconvex minimization Convex composition min x f (x) = g(x) + h(c(x)) where g : R d R is closed, convex. h : R m R is convex and L-Lipschitz. c : R d R m is C 1 -smooth and c is β-lipschitz. For convenience, set η = Lβ. 8/18

Nonsmooth & Nonconvex minimization Convex composition min x f (x) = g(x) + h(c(x)) where g : R d R is closed, convex. h : R m R is convex and L-Lipschitz. c : R d R m is C 1 -smooth and c is β-lipschitz. For convenience, set η = Lβ. (Burke 85, 91, Cartis-Gould-Toint 11, Fletcher 82, Lewis-Wright 15, Powell 84, Wright 90, Yuan 83) 8/18

Composite examples Convex composition min x f (x) = g(x) + h(c(x)) Examples: Additive composite minimization: min g(x) + c(x) x Nonlinear least squares: min { c(x) : l i x i u i for i = 1,..., m} x Nonnegative Matrix Factorization: min X,Y XY T D s.t. X, Y 0 Robust Phase Retrieval: (Duchi-Ruan 17) min a, x 2 b 1 x Exact penalty subproblem: min g(x) + dist K (c(x)) x 9/18

Prox-linear algorithm Prox-linear model: f x (y) = g(y) + h x + = argmin y ( ) c(x) + c(x)(y x) + η y x 2 2 f x (y) G (x) = η x x + 10/18

Prox-linear algorithm Prox-linear model: f x (y) = g(y) + h ( ) c(x) + c(x)(y x) + η y x 2 2 x + = argmin y f x (y) G (x) = η x x + Justification: f (y) f x (y) x, y 10/18

Prox-linear algorithm Prox-linear model: f x (y) = g(y) + h ( ) c(x) + c(x)(y x) + η y x 2 2 x + = argmin y f x (y) G (x) = η x x + Justification: f x (y) η y x 2 f (y) f x (y) x, y 10/18

Prox-linear algorithm Prox-linear model: f x (y) = g(y) + h ( ) c(x) + c(x)(y x) + η y x 2 2 x + = argmin y f x (y) G (x) = η x x + Justification: f x (y) η y x 2 f (y) f x (y) x, y Prox-linear method (Burke, Fletcher, Osborne, Powell,... 80s): x k+1 = x + k. 10/18

Prox-linear algorithm Prox-linear model: f x (y) = g(y) + h ( ) c(x) + c(x)(y x) + η y x 2 2 x + = argmin y f x (y) G (x) = η x x + Justification: f x (y) η y x 2 f (y) f x (y) x, y Prox-linear method (Burke, Fletcher, Osborne, Powell,... 80s): x k+1 = x + k. Eg: proximal gradient, Levenberg-Marquardt methods 10/18

Prox-linear algorithm Prox-linear model: f x (y) = g(y) + h ( ) c(x) + c(x)(y x) + η y x 2 2 x + = argmin y f x (y) G (x) = η x x + Justification: f x (y) η y x 2 f (y) f x (y) x, y Prox-linear method (Burke, Fletcher, Osborne, Powell,... 80s): x k+1 = x + k. Eg: proximal gradient, Levenberg-Marquardt methods Convergence rate: G (x k ) < ɛ after ( ) η O ɛ 2 iterations 10/18

Stopping criterion What does G (x) < ɛ actually mean? 11/18

Stopping criterion What does G (x) < ɛ actually mean? Stationarity for target problem: 0 g(x) + c(x) h(c(x)) Stationarity for prox-subproblem: G (x) dist (0; g(x + ) + c(x) h ( c(x) + c(x)(x + x) )) 11/18

Stopping criterion What does G (x) < ɛ actually mean? Stationarity for target problem: 0 g(x) + c(x) h(c(x)) Stationarity for prox-subproblem: G (x) dist (0; g(x + ) + c(x) h ( c(x) + c(x)(x + x) )) Thm: (D-Lewis 16) x + is nearly stationary because ˆx with 1 η ˆx x G (x) and f (ˆx) G (x) 11/18

Stopping criterion What does G (x) < ɛ actually mean? Stationarity for target problem: 0 g(x) + c(x) h(c(x)) Stationarity for prox-subproblem: G (x) dist (0; g(x + ) + c(x) h ( c(x) + c(x)(x + x) )) Thm: (D-Lewis 16) x + is nearly stationary because ˆx with 1 η ˆx x G (x) and f (ˆx) G (x) Thm: (D-Paquette 16) G (x) 2η(x prox F/2η (x)). 11/18

Local quadratic convergence Let S = {stationary points} and fix x S. 12/18

Local quadratic convergence Let S = {stationary points} and fix x S. Thm: (Burke-Ferris 95) Weak sharp minimum 0 < α f (x) x / S near x, 12/18

Local quadratic convergence Let S = {stationary points} and fix x S. Thm: (Burke-Ferris 95) Weak sharp minimum 0 < α f (x) x / S near x, = local quadratic convergence dist(x k+1 ; S) O(dist 2 (x k ; S)). 12/18

Local quadratic convergence Let S = {stationary points} and fix x S. Thm: (Burke-Ferris 95) Weak sharp minimum 0 < α f (x) x / S near x, = local quadratic convergence dist(x k+1 ; S) O(dist 2 (x k ; S)). Growth interpretation: Weak sharp minimum = f (x) f (proj(x; S)) + α dist(x, S) x near x. 12/18

Local linear convergence Thm: (D-Lewis 16) Error bound property dist(x; S) 1 f (x) α for x near x 13/18

Local linear convergence Thm: (D-Lewis 16) Error bound property dist(x; S) 1 f (x) α for x near x = local linear convergence ( f (x k+1 ) f 1 α2 η 2 ) (f (x k ) f ) 13/18

Local linear convergence Thm: (D-Lewis 16) Error bound property dist(x; S) 1 f (x) α for x near x = local linear convergence ( f (x k+1 ) f 1 α2 η 2 ) (f (x k ) f ) Growth interpretation: (D-Mordukhovich-Nghia 14) EB property = f (x) f (proj(x, S)) + α 2 dist2 (x, S) for x near x 13/18

Local linear convergence Thm: (D-Lewis 16) Error bound property dist(x; S) 1 f (x) α for x near x = local linear convergence ( f (x k+1 ) f 1 α2 η 2 ) (f (x k ) f ) Growth interpretation: (D-Mordukhovich-Nghia 14) EB property = f (x) f (proj(x, S)) + α 2 dist2 (x, S) for x near x Rate becomes α η under tilt-stability (Poliquin-Rockafellar 98) 13/18

Robust phase retrieval (Duchi-Ruan 17) Problem: Find x R n satisfying a i, x 2 b i for a 1,..., a m R n and b 1,..., b m R +. 14/18

Robust phase retrieval (Duchi-Ruan 17) Problem: Find x R n satisfying a i, x 2 b i for a 1,..., a m R n and b 1,..., b m R +. Defn: (Eldar-Mendelson 12) A R m n is stable if (Ax) 2 (Ay) 2 1 λ x y x + y. 14/18

Robust phase retrieval (Duchi-Ruan 17) Problem: Find x R n satisfying a i, x 2 b i for a 1,..., a m R n and b 1,..., b m R +. Defn: (Eldar-Mendelson 12) A R m n is stable if (Ax) 2 (Ay) 2 1 λ x y x + y. Thm: (Duchi-Ruan 17) If a i N (0, I n ) and m/n 1 then A is stable with high probability. 14/18

Robust phase retrieval (Duchi-Ruan 17) Problem: Find x R n satisfying a i, x 2 b i for a 1,..., a m R n and b 1,..., b m R +. Defn: (Eldar-Mendelson 12) A R m n is stable if (Ax) 2 (Ay) 2 1 λ x y x + y. Thm: (Duchi-Ruan 17) If a i N (0, I n ) and m/n 1 then A is stable with high probability. Two ingredients: 1) Problem min x (Ax) 2 b 1 14/18

Robust phase retrieval (Duchi-Ruan 17) Problem: Find x R n satisfying a i, x 2 b i for a 1,..., a m R n and b 1,..., b m R +. Defn: (Eldar-Mendelson 12) A R m n is stable if (Ax) 2 (Ay) 2 1 λ x y x + y. Thm: (Duchi-Ruan 17) If a i N (0, I n ) and m/n 1 then A is stable with high probability. Two ingredients: 1) Problem min x (Ax) 2 b 1 = (Ax) 2 (Ax ) 2 1 has a weak sharp minimum = local quadratic convergence! 14/18

Robust phase retrieval (Duchi-Ruan 17) Problem: Find x R n satisfying a i, x 2 b i for a 1,..., a m R n and b 1,..., b m R +. Defn: (Eldar-Mendelson 12) A R m n is stable if (Ax) 2 (Ay) 2 1 λ x y x + y. Thm: (Duchi-Ruan 17) If a i N (0, I n ) and m/n 1 then A is stable with high probability. Two ingredients: 1) Problem min x (Ax) 2 b 1 = (Ax) 2 (Ax ) 2 1 has a weak sharp minimum = local quadratic convergence! 2) Can find x 0 in attraction region w.h.p. using spectrum. 14/18

RNA reconstruction (Duchi-Ruan 17) n = 222, m = 3n (a) x0, (b) 10 inaccurate solves, (c) one accurate solve, (d) original image. 15/18

Summary 1. Taylor-like models step-size, stationarity, error-bounds 2. Convex composite g + h c prox-linear method local linear/quadratic rates 16/18

Summary 1. Taylor-like models step-size, stationarity, error-bounds 2. Convex composite g + h c prox-linear method local linear/quadratic rates Other recent works: 1. First-order complexity & Acceleration (Paquette 2:00-2:25) 2. Stochastic prox-linear algorithms (Duchi-Ruan 17) 3. Robust phase retrieval (Duchi-Ruan 17) 16/18

Thank you! 17/18

References Nonsmooth optimization using Taylor-like models: error bounds, convergence, and termination criteria D-Ioffe-Lewis, 2016, arxiv:1610.03446. Error bounds, quadratic growth, and linear convergence of proximal methods D-Lewis, 2016, arxiv:1602.06661. Efficiency of minimizing compositions of convex functions and smooth maps D-Paquette, 2016, arxiv:1605.00125. 18/18