An Optimal Affine Invariant Smooth Minimization Algorithm.

Similar documents
10. Unconstrained minimization

Unconstrained minimization

Newton s Method. Javier Peña Convex Optimization /36-725

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Conditional Gradient (Frank-Wolfe) Method

Lecture 15 Newton Method and Self-Concordance. October 23, 2008

SIAM Conference on Imaging Science, Bologna, Italy, Adaptive FISTA. Peter Ochs Saarland University

Convex Optimization. 9. Unconstrained minimization. Prof. Ying Cui. Department of Electrical Engineering Shanghai Jiao Tong University

Lecture 14: Newton s Method

Lecture 14: October 17

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

An interior-point trust-region polynomial algorithm for convex programming

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

ORIE 6326: Convex Optimization. Quasi-Newton Methods

arxiv: v1 [math.oc] 1 Jul 2016

Sharpness, Restart and Compressed Sensing Performance.

Problem Set 6: Solutions Math 201A: Fall a n x n,

Convex Optimization. Problem set 2. Due Monday April 26th

Barrier Method. Javier Peña Convex Optimization /36-725

GRADIENT = STEEPEST DESCENT

A Second-Order Path-Following Algorithm for Unconstrained Convex Optimization

Accelerated Proximal Gradient Methods for Convex Optimization

Convex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization

Optimization and Optimal Control in Banach Spaces

A direct formulation for sparse PCA using semidefinite programming

SHARPNESS, RESTART AND ACCELERATION

OPTIMAL AFFINE INVARIANT SMOOTH MINIMIZATION ALGORITHMS

Unconstrained minimization: assumptions

Oracle Complexity of Second-Order Methods for Smooth Convex Optimization

Convex Optimization and l 1 -minimization

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Largest dual ellipsoids inscribed in dual cones

Supplement: Universal Self-Concordant Barrier Functions

Lecture 9 Sequential unconstrained minimization

Constrained Optimization and Lagrangian Duality

Primal-dual Subgradient Method for Convex Problems with Functional Constraints

i=1 β i,i.e. = β 1 x β x β 1 1 xβ d

Complexity bounds for primal-dual methods minimizing the model of objective function

Determinant maximization with linear. S. Boyd, L. Vandenberghe, S.-P. Wu. Information Systems Laboratory. Stanford University

On the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith. µd(w) + w T c. minimize. subject to w T 1 = 1,

Perturbed Proximal Gradient Algorithm

Optimality, Duality, Complementarity for Constrained Optimization

Nesterov s Optimal Gradient Methods

SMOOTH OPTIMIZATION WITH APPROXIMATE GRADIENT. 1. Introduction. In [13] it was shown that smooth convex minimization problems of the form:

A direct formulation for sparse PCA using semidefinite programming

8 Numerical methods for unconstrained problems

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

A Primal-Dual Second-Order Cone Approximations Algorithm For Symmetric Cone Programming

Convex Optimization Lecture 16

Linear Analysis Lecture 5

On well definedness of the Central Path

Sequential Unconstrained Minimization: A Survey

Lecture 5: September 12

Agenda. Interior Point Methods. 1 Barrier functions. 2 Analytic center. 3 Central path. 4 Barrier method. 5 Primal-dual path following algorithms

Continuity of convex functions in normed spaces

12. Interior-point methods

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Optimization for Machine Learning

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 3. Gradient Method

Convex Optimization on Large-Scale Domains Given by Linear Minimization Oracles

Math 273a: Optimization Subgradient Methods

Primal-Dual Interior-Point Methods. Javier Peña Convex Optimization /36-725

Functional Analysis Exercise Class

Cubic regularization of Newton s method for convex problems with constraints

Math 273a: Optimization Subgradients of convex functions

A glimpse into convex geometry. A glimpse into convex geometry

Lecture 3. Optimization Problems and Iterative Algorithms

Locally convex spaces, the hyperplane separation theorem, and the Krein-Milman theorem

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

Nonsymmetric potential-reduction methods for general cones

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

Convex Optimization. Prof. Nati Srebro. Lecture 12: Infeasible-Start Newton s Method Interior Point Methods

Functional Analysis I

Unconstrained minimization of smooth functions

Inner approximation of convex cones via primal-dual ellipsoidal norms

Interior Point Algorithms for Constrained Convex Optimization

ALGORITHMS FOR MINIMIZING DIFFERENCES OF CONVEX FUNCTIONS AND APPLICATIONS

Mathematics for Economists

10 Numerical methods for constrained problems

Optimisation in Higher Dimensions

Interior-point methods Optimization Geoff Gordon Ryan Tibshirani

Lecture 1: Background on Convex Analysis

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

GEORGIA INSTITUTE OF TECHNOLOGY H. MILTON STEWART SCHOOL OF INDUSTRIAL AND SYSTEMS ENGINEERING LECTURE NOTES OPTIMIZATION III

Exponentiated Gradient Descent

Lecture: Smoothing.

Metric Thickenings of Euclidean Submanifolds

Multivariable Calculus

The moment-lp and moment-sos approaches

minimize x subject to (x 2)(x 4) u,

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

LEARNING IN CONCAVE GAMES

Radial Subgradient Descent

Advances in Convex Optimization: Theory, Algorithms, and Applications

Nonlinear equations. Norms for R n. Convergence orders for iterative methods

An E cient A ne-scaling Algorithm for Hyperbolic Programming

Functional Analysis. Franck Sueur Metric spaces Definitions Completeness Compactness Separability...

Block Coordinate Descent for Regularized Multi-convex Optimization

On John type ellipsoids

Transcription:

An Optimal Affine Invariant Smooth Minimization Algorithm. Alexandre d Aspremont, CNRS & École Polytechnique. Joint work with Martin Jaggi. Support from ERC SIPA. A. d Aspremont IWSL, Moscow, June 2013, 1/18

A Basic Convex Problem Solve minimize f(x) subject to x Q, in x R n. Here, f(x) is convex, smooth. Assume Q R n is compact, convex and simple. A. d Aspremont IWSL, Moscow, June 2013, 2/18

Complexity Newton s method. At each iteration, take a step in the direction Assume that x nt = 2 f(x) 1 f(x) the function f(x) is self-concordant, i.e. f (x) 2f (x) 3/2, the set Q has a self concordant barrier g(x). [Nesterov and Nemirovskii, 1994] Newton s method produces an ɛ optimal solution to the barrier problem for some t > 0, in at most min x h(x) f(x) + t g(x) 20 8α αβ(1 2α) 2(h(x 0) h ) + log 2 log 2 (1/ɛ) iterations where 0 < α < 0.5 and 0 < β < 1 are line search parameters. A. d Aspremont IWSL, Moscow, June 2013, 3/18

Complexity Newton s method. Basically # Newton iterations 375 (h(x 0 ) h ) + 6 Empirically valid, up to constants. Independent from the dimension n. Affine invariant. In practice, implementation mostly requires efficient linear algebra... Form the Hessian. Solve the Newton (or KKT) system 2 f(x) x nt = f(x). A. d Aspremont IWSL, Moscow, June 2013, 4/18

Affine Invariance Set x = Ay where A R n n is nonsingular minimize f(x) subject to x Q, becomes minimize subject to ˆf(y) y ˆQ, in the variable y R n, where ˆf(y) f(ay) and ˆQ A 1 Q. Identical Newton steps, with x nt = A y nt Identical complexity bounds 375 (h(x 0 ) h ) + 6 since h = ĥ Newton s method is invariant w.r.t. an affine change of coordinates. The same is true for its complexity analysis. A. d Aspremont IWSL, Moscow, June 2013, 5/18

Large-Scale Problems The challenge now is scaling. Real men/women solve optimization problems with terabytes of data. (Michael Jordan, Paris, 2013.) Newton s method (and derivatives) solve all reasonably large problems. Beyond a certain scale, second order information is out of reach. Question today: clean complexity bounds for first order methods? A. d Aspremont IWSL, Moscow, June 2013, 6/18

Frank-Wolfe Conditional gradient. At each iteration, solve in u R n. Define the curvature minimize f(x k ), u subject to u Q C f sup s,x M, α [0,1], y=x+α(s x) 1 α2(f(y) f(x) y x, f(x) ). The Franke-Wolfe algorithm will then produce an ɛ solution after iterations. N max = 4C f ɛ C f is affine invariant but the bound is suboptimal in ɛ. ( If f(x) has a Lipschitz gradient, the lower bound is O 1 ɛ ). A. d Aspremont IWSL, Moscow, June 2013, 7/18

Optimal First-Order Methods Smooth Minimization algorithm in [Nesterov, 1983] to solve minimize f(x) subject to x Q, Choose a norm. f(x) Lipschitz with constant L w.r.t. f(y) f(x) + f(x), y x + 1 2 L y x 2, x, y Q Choose a prox function d(x) for the set Q, with σ 2 x x 0 2 d(x) for some σ > 0. A. d Aspremont IWSL, Moscow, June 2013, 8/18

Optimal First-Order Methods Smooth minimization algorithm [Nesterov, 2005] Input: x 0, the prox center of the set Q. 1: for k = 0,..., N do 2: Compute f(x k ). 3: Compute y k = argmin y Q { f(xk ), y x k + 1 2 L y x k 2}. 4: Compute z k = argmin x Q { k i=0 α i[f(x i ) + f(x i ), x x i ] + L σ d(x) }. 5: Set x k+1 = τ k z k + (1 τ k )y k. 6: end for Output: x N, y N Q. Produces an ɛ-solution in at most 8L ɛ d(x ) σ iterations. Optimal in ɛ, but not affine invariant. Heavily used: TFOCS, NESTA, Structured l 1,... A. d Aspremont IWSL, Moscow, June 2013, 9/18

Optimal First-Order Methods Choosing norm and prox can have a big impact. Consider the following matrix game problem x T Ay min {1 T x=1,x 0} max {1 T x=1,x 0} Euclidean prox. pick 2 and d(x) = x 2 2/2, after regularization, the complexity bound is N max = 4 A 2 N + 1 Entropy prox. pick 1 and d(x) = i x i log x i + log n, the bound becomes which can be significantly smaller. N max = 4 log n log m max ij A ij N + 1 Speedup is roughly n when A is Bernoulli... A. d Aspremont IWSL, Moscow, June 2013, 10/18

Choosing the norm Invariance means and d(x) constructed using only f and the set Q. Minkovski gauge. Assume Q is centrally symmetric with non-empty interior. The Minkowski gauge of Q is a norm x Q inf{λ 0 : x λq} Lemma Affine invariance. The function f(x) has Lipschitz continuous gradient with respect to the norm Q with constant L Q > 0, i.e. f(y) f(x) + f(x), y x + 1 2 L Q y x 2 Q, x, y Q, if and only if the function f(aw) has Lipschitz continuous gradient with respect to the norm A 1 Q with the same constant L Q. A similar result holds for strong convexity. Note that x Q = x Q. A. d Aspremont IWSL, Moscow, June 2013, 11/18

Choosing the prox. How do we choose the prox.? Start with two definitions. Definition Banach-Mazur distance. Suppose X and Y are two norms on a space E, the distortion d( X, Y ) is the smallest product ab > 0 such that 1 b x Y x X a x Y, for all x E. log(d( X, Y )) is the Banach-Mazur distance between X and Y. A. d Aspremont IWSL, Moscow, June 2013, 12/18

Choosing the prox. Regularity constant. Regularity constant of (E, ), defined in [Juditsky and Nemirovski, 2008] to study large deviations of vector valued martingales. Definition [Juditsky and Nemirovski, 2008] Regularity constant of a Banach (E,. ). The smallest constant > 0 for which there exists a smooth norm p(x) such that The prox p(x) 2 /2 has a Lipschitz continuous gradient w.r.t. the norm p(x), with constant µ where 1 µ, The norm p(x) satisfies i.e. d(p(x),. ) /µ. x p(x) x ( ) 1/2, for all x E µ A. d Aspremont IWSL, Moscow, June 2013, 13/18

Complexity Using the algorithm in [Nesterov, 2005] to solve minimize f(x) subject to x Q. Proposition [d Aspremont and Jaggi, 2013] Affine invariant complexity bounds. Suppose f(x) has a Lipschitz continuous gradient with constant L Q with respect to the norm Q and the space (R n, Q ) is D Q -regular, then the smooth algorithm in [Nesterov, 2005] will produce an ɛ solution in at most 4LQ D Q N max = ɛ iterations. Furthermore, the constants L Q and D Q are affine invariant. We can show C f L Q D Q, but it is not clear if the bound is attained... A. d Aspremont IWSL, Moscow, June 2013, 14/18

Complexity A few more facts about L Q and D Q... Suppose we scale Q αq, with α > 0, the Lipschitz constant L αq satisfies α 2 L Q L αq. the smoothness term D Q remains unchanged. Given our choice of norm (hence L Q ), L Q D Q is the best possible bound. Also, from [Juditsky and Nemirovski, 2008], in the dual space The regularity constant decreases on a subspace F, i.e. D Q F D Q. From D regular spaces (E i, ), we can construct a 2D + 2 regular product space E... E m. A. d Aspremont IWSL, Moscow, June 2013, 15/18

Complexity, l 1 example Minimizing a smooth convex function over the unit simplex minimize f(x) subject to 1 T x 1, x 0 in x R n. Choosing 1 as the norm and d(x) = log n + n i=1 x i log x i as the prox function, complexity bounded by 8 L 1 log n ɛ (note L 1 is lowest Lipschitz constant among all l p norm choices.) Symmetrizing the simplex into the l 1 ball. The space (R n, ) is 2 log n regular [Juditsky and Nemirovski, 2008, Ex. 3.2]. The prox function chosen here is 2 α/2, with α = 2 log n/(2 log n 1) and our complexity bound is 16 L 1 log n ɛ A. d Aspremont IWSL, Moscow, June 2013, 16/18

In practice Easy and hard problems. The parameter L Q satisfies f(y) f(x) + f(x), y x + 1 2 L Q y x 2 Q, x, y Q, On easy problems, is large in directions where f is large, i.e. the sublevel sets of f(x) and Q are aligned. For l p spaces for p [2, ], the unit balls B p have low regularity constants, D Bp min{p 1, 2 log n} while D B1 = n (worst case). By duality, problems over unit balls B q for q [1, 2] are easier. Optimizing over cubes is harder. A. d Aspremont IWSL, Moscow, June 2013, 17/18

Conclusion Affine invariant complexity bound for the optimal algorithm [Nesterov, 1983] N max = 4LQ D Q ɛ Matches best known bounds on key examples. Open problems. Prove optimality of product L Q D Q. Matches curvature C f? Symmetrize non-symmetric sets Q. Systematic, tractable procedure for smoothing Q. A. d Aspremont IWSL, Moscow, June 2013, 18/18

* References Alexandre d Aspremont and Martin Jaggi. An optimal affine invariant smooth minimization algorithm. arxiv preprint arxiv:1301.0465, 2013. A. Juditsky and A.S. Nemirovski. Large deviations of vector-valued martingales in 2-smooth normed spaces. arxiv preprint arxiv:0809.0813, 2008. Y. Nesterov. A method of solving a convex programming problem with convergence rate O(1/k 2 ). Soviet Mathematics Doklady, 27(2): 372 376, 1983. Y. Nesterov. Smooth minimization of non-smooth functions. Mathematical Programming, 103(1):127 152, 2005. Y. Nesterov and A. Nemirovskii. Interior-point polynomial algorithms in convex programming. Society for Industrial and Applied Mathematics, Philadelphia, 1994. A. d Aspremont IWSL, Moscow, June 2013, 19/18