Optimization for Machine Learning

Similar documents
Convex Optimization & Lagrange Duality

5. Duality. Lagrangian

Convex Optimization Boyd & Vandenberghe. 5. Duality

Lagrange duality. The Lagrangian. We consider an optimization program of the form

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014

Optimization for Machine Learning

Convex Optimization M2

Lecture: Duality of LP, SOCP and SDP

Lecture: Duality.

14. Duality. ˆ Upper and lower bounds. ˆ General duality. ˆ Constraint qualifications. ˆ Counterexample. ˆ Complementary slackness.

Constrained Optimization and Lagrangian Duality

Extreme Abridgment of Boyd and Vandenberghe s Convex Optimization

EE/AA 578, Univ of Washington, Fall Duality

Lagrangian Duality and Convex Optimization

Lecture 6: Conic Optimization September 8

Duality. Lagrange dual problem weak and strong duality optimality conditions perturbation and sensitivity analysis generalized inequalities

Duality Uses and Correspondences. Ryan Tibshirani Convex Optimization

I.3. LMI DUALITY. Didier HENRION EECI Graduate School on Control Supélec - Spring 2010

Motivation. Lecture 2 Topics from Optimization and Duality. network utility maximization (NUM) problem:

Karush-Kuhn-Tucker Conditions. Lecturer: Ryan Tibshirani Convex Optimization /36-725

Subgradient. Acknowledgement: this slides is based on Prof. Lieven Vandenberghes lecture notes. definition. subgradient calculus

4TE3/6TE3. Algorithms for. Continuous Optimization

Conditional Gradient (Frank-Wolfe) Method

Lagrange Duality. Daniel P. Palomar. Hong Kong University of Science and Technology (HKUST)

ICS-E4030 Kernel Methods in Machine Learning

arxiv: v1 [math.oc] 1 Jul 2016

CS-E4830 Kernel Methods in Machine Learning

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 4. Subgradient

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Primal/Dual Decomposition Methods

Convex Optimization. (EE227A: UC Berkeley) Lecture 4. Suvrit Sra. (Conjugates, subdifferentials) 31 Jan, 2013

Introduction to Machine Learning Lecture 7. Mehryar Mohri Courant Institute and Google Research

The Lagrangian L : R d R m R r R is an (easier to optimize) lower bound on the original problem:

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Lecture Note 5: Semidefinite Programming for Stability Analysis

Applications of Linear Programming

Lecture 7: Convex Optimizations

Subgradients. subgradients and quasigradients. subgradient calculus. optimality conditions via subgradients. directional derivatives

Optimality Conditions for Constrained Optimization

Duality. Geoff Gordon & Ryan Tibshirani Optimization /

CSCI : Optimization and Control of Networks. Review on Convex Optimization

subject to (x 2)(x 4) u,

Newton s Method. Javier Peña Convex Optimization /36-725

On the Method of Lagrange Multipliers

A Brief Review on Convex Optimization

Lecture 3: Lagrangian duality and algorithms for the Lagrangian dual problem

LECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE

Lecture 18: Optimization Programming

Subgradients. subgradients. strong and weak subgradient calculus. optimality conditions via subgradients. directional derivatives

EE 227A: Convex Optimization and Applications October 14, 2008

Support Vector Machines

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Homework Set #6 - Solutions

Primal-dual Subgradient Method for Convex Problems with Functional Constraints

Linear and non-linear programming

Math 273a: Optimization Subgradients of convex functions

Quiz Discussion. IE417: Nonlinear Programming: Lecture 12. Motivation. Why do we care? Jeff Linderoth. 16th March 2006

Chap 2. Optimality conditions

Lecture 8. Strong Duality Results. September 22, 2008

Convex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization

Lecture 14: Optimality Conditions for Conic Problems

Additional Homework Problems

Nonlinear Programming

Lagrange Relaxation and Duality

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Convex Optimization and SVM

minimize x subject to (x 2)(x 4) u,

Convex Optimization and Modeling

Enhanced Fritz John Optimality Conditions and Sensitivity Analysis

Convex Optimization. (EE227A: UC Berkeley) Lecture 6. Suvrit Sra. (Conic optimization) 07 Feb, 2013

Lecture 23: November 19

E5295/5B5749 Convex optimization with engineering applications. Lecture 5. Convex programming and semidefinite programming

Duality in Linear Programs. Lecturer: Ryan Tibshirani Convex Optimization /36-725

Lecture 5. The Dual Cone and Dual Problem

Numerical Optimization

Interior Point Algorithms for Constrained Convex Optimization

Convex Optimization Theory. Chapter 5 Exercises and Solutions: Extended Version

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Lagrangian Duality Theory

Tutorial on Convex Optimization: Part II

12. Interior-point methods

8. Conjugate functions

Lecture 2: Convex Sets and Functions

EE364a Review Session 5

UC Berkeley Department of Electrical Engineering and Computer Science. EECS 227A Nonlinear and Convex Optimization. Solutions 6 Fall 2009

Coordinate Descent and Ascent Methods

9. Dual decomposition and dual algorithms

Dual methods and ADMM. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Convex Optimization Theory. Athena Scientific, Supplementary Chapter 6 on Convex Optimization Algorithms

10 Numerical methods for constrained problems

Optimality, Duality, Complementarity for Constrained Optimization

Introduction to Nonlinear Stochastic Programming

Lecture 15 Newton Method and Self-Concordance. October 23, 2008

A : k n. Usually k > n otherwise easily the minimum is zero. Analytical solution:

Two hours. To be provided by Examinations Office: Mathematical Formula Tables. THE UNIVERSITY OF MANCHESTER. xx xxxx 2017 xx:xx xx.

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Tutorial on Convex Optimization for Engineers Part II

A : k n. Usually k > n otherwise easily the minimum is zero. Analytical solution:

UC Berkeley Department of Electrical Engineering and Computer Science. EECS 227A Nonlinear and Convex Optimization. Solutions 5 Fall 2009

Extended Monotropic Programming and Duality 1

Transcription:

Optimization for Machine Learning (Problems; Algorithms - A) SUVRIT SRA Massachusetts Institute of Technology PKU Summer School on Data Science (July 2017)

Course materials http://suvrit.de/teaching.html Some references: Introductory lectures on convex optimization Nesterov Convex optimization Boyd & Vandenberghe Nonlinear programming Bertsekas Convex Analysis Rockafellar Fundamentals of convex analysis Urruty, Lemaréchal Lectures on modern convex optimization Nemirovski Optimization for Machine Learning Sra, Nowozin, Wright Theory of Convex Optimization for Machine Learning Bubeck NIPS 2016 Optimization Tutorial Bach, Sra Some related courses: EE227A, Spring 2013, (Sra, UC Berkeley) 10-801, Spring 2014 (Sra, CMU) EE364a,b (Boyd, Stanford) EE236b,c (Vandenberghe, UCLA) Venues: NIPS, ICML, UAI, AISTATS, SIOPT, Math. Prog. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 2 / 36

Lecture Plan Introduction (3 lectures) Problems and algorithms (5 lectures) Non-convex optimization, perspectives (2 lectures) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 3 / 36

Constrained problems Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 4 / 36

Optimality constrained For every x, y dom f, we have f (y) f (x) + f (x), y x. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 5 / 36

Optimality constrained For every x, y dom f, we have f (y) f (x) + f (x), y x. Thus, x is optimal if and only if f (x ), y x 0, for all y X. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 5 / 36

Optimality constrained For every x, y dom f, we have f (y) f (x) + f (x), y x. Thus, x is optimal if and only if f (x ), y x 0, for all y X. If X = R n, this reduces to f (x ) = 0 f(x) X x f(x ) x If f (x ) 0, it defines supporting hyperplane to X at x Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 5 / 36

Optimality conditions constrained Proof: Suppose y X such that f (x ), y x < 0 Using mean-value theorem of calculus, ξ [0, 1] s.t. f (x + t(y x )) = f (x ) + f (x + ξt(y x )), t(y x ) (we applied MVT to g(t) := f (x + t(y x ))) For sufficiently small t, since f continuous, by assump on y, f (x + ξt(y x )), y x < 0 This in turn implies that f (x + t(y x )) < f (x ) Since X is convex, x + t(y x ) X is also feasible Contradiction to local optimality of x Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 6 / 36

Example: projection operator P X (z) := argmin x z 2 x X (Assume X is closed and convex, then projection is unique) Let X be nonempty, closed and convex. Optimality condition: x = P X (y) iff x z, y x 0 for all y X Exercise: Prove that projection is nonexpansive, i.e., P X (x) P X (y) 2 x y 2 for all x, y R n. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 7 / 36

Feasible descent min f (x) s.t. x X f (x ), x x 0, x X. f(x) X x f(x ) x Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 8 / 36

Feasible descent x k+1 = x k + α k d k Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 9 / 36

Feasible descent x k+1 = x k + α k d k d k feasible direction, i.e., x k + α k d k X Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 9 / 36

Feasible descent x k+1 = x k + α k d k d k feasible direction, i.e., x k + α k d k X d k must also be descent direction, i.e., f (x k ), d k < 0 Stepsize α k chosen to ensure feasibility and descent. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 9 / 36

Feasible descent x k+1 = x k + α k d k d k feasible direction, i.e., x k + α k d k X d k must also be descent direction, i.e., f (x k ), d k < 0 Stepsize α k chosen to ensure feasibility and descent. Since X is convex, all feasible directions are of the form d k = γ(z x k ), γ > 0, where z X is any feasible vector. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 9 / 36

Feasible descent x k+1 = x k + α k d k d k feasible direction, i.e., x k + α k d k X d k must also be descent direction, i.e., f (x k ), d k < 0 Stepsize α k chosen to ensure feasibility and descent. Since X is convex, all feasible directions are of the form d k = γ(z x k ), γ > 0, where z X is any feasible vector. x k+1 = x k + α k (z k x k ), α k (0, 1] Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 9 / 36

Cone of feasible directions x Feasible directions at x X d Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 10 / 36

Frank-Wolfe / conditional gradient method Optimality: f (x k ), z k x k 0 for all z k X Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 11 / 36

Frank-Wolfe / conditional gradient method Optimality: f (x k ), z k x k 0 for all z k X Aim: If not optimal, then generate feasible direction d k = z k x k that obeys descent condition f (x k ), d k < 0. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 11 / 36

Frank-Wolfe / conditional gradient method Optimality: f (x k ), z k x k 0 for all z k X Aim: If not optimal, then generate feasible direction d k = z k x k that obeys descent condition f (x k ), d k < 0. Frank-Wolfe (Conditional gradient) method Let z k argmin x X f (x k ), x x k Use different methods to select α k x k+1 = x k + α k (z k x k ) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 11 / 36

Frank-Wolfe / conditional gradient method Optimality: f (x k ), z k x k 0 for all z k X Aim: If not optimal, then generate feasible direction d k = z k x k that obeys descent condition f (x k ), d k < 0. Frank-Wolfe (Conditional gradient) method Let z k argmin x X f (x k ), x x k Use different methods to select α k x k+1 = x k + α k (z k x k ) Due to M. Frank and P. Wolfe (1956) Practical when solving linear problem over X easy Very popular in machine learning over recent years Refinements, several variants (including nonconvex) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 11 / 36

Frank-Wolfe: Convergence Assum: There is a C 0 s.t. for all x, z X and α (0, 1): f ( (1 α)x + αz) f (x) + α f (x), z x + 1 2 Cα2. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 12 / 36

Frank-Wolfe: Convergence Assum: There is a C 0 s.t. for all x, z X and α (0, 1): f ( (1 α)x + αz) f (x) + α f (x), z x + 1 2 Cα2. Let α k = 2 k+2. Recall xk+1 = (1 α k )x k + α k z k ; thus, Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 12 / 36

Frank-Wolfe: Convergence Assum: There is a C 0 s.t. for all x, z X and α (0, 1): f ( (1 α)x + αz) f (x) + α f (x), z x + 1 2 Cα2. Let α k = 2 k+2. Recall xk+1 = (1 α k )x k + α k z k ; thus, f (x k+1 ) f (x ) = f ((1 α k )x k + α k z k ) f (x ) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 12 / 36

Frank-Wolfe: Convergence Assum: There is a C 0 s.t. for all x, z X and α (0, 1): f ( (1 α)x + αz) f (x) + α f (x), z x + 1 2 Cα2. Let α k = 2 k+2. Recall xk+1 = (1 α k )x k + α k z k ; thus, f (x k+1 ) f (x ) = f ((1 α k )x k + α k z k ) f (x ) f (x k ) f (x ) + α k f (x k ), z k x k + 1 2 α2 k C Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 12 / 36

Frank-Wolfe: Convergence Assum: There is a C 0 s.t. for all x, z X and α (0, 1): f ( (1 α)x + αz) f (x) + α f (x), z x + 1 2 Cα2. Let α k = 2 k+2. Recall xk+1 = (1 α k )x k + α k z k ; thus, f (x k+1 ) f (x ) = f ((1 α k )x k + α k z k ) f (x ) f (x k ) f (x ) + α k f (x k ), z k x k + 1 2 α2 k C f (x k ) f (x ) + α k f (x k ), x x k + 1 2 α2 k C Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 12 / 36

Frank-Wolfe: Convergence Assum: There is a C 0 s.t. for all x, z X and α (0, 1): f ( (1 α)x + αz) f (x) + α f (x), z x + 1 2 Cα2. Let α k = 2 k+2. Recall xk+1 = (1 α k )x k + α k z k ; thus, f (x k+1 ) f (x ) = f ((1 α k )x k + α k z k ) f (x ) f (x k ) f (x ) + α k f (x k ), z k x k + 1 2 α2 k C f (x k ) f (x ) + α k f (x k ), x x k + 1 2 α2 k C f (x k ) f (x ) α k (f (x k ) f (x )) + 1 2 α2 k C Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 12 / 36

Frank-Wolfe: Convergence Assum: There is a C 0 s.t. for all x, z X and α (0, 1): f ( (1 α)x + αz) f (x) + α f (x), z x + 1 2 Cα2. Let α k = 2 k+2. Recall xk+1 = (1 α k )x k + α k z k ; thus, f (x k+1 ) f (x ) = f ((1 α k )x k + α k z k ) f (x ) f (x k ) f (x ) + α k f (x k ), z k x k + 1 2 α2 k C f (x k ) f (x ) + α k f (x k ), x x k + 1 2 α2 k C f (x k ) f (x ) α k (f (x k ) f (x )) + 1 2 α2 k C = (1 α k )(f (x k ) f (x )) + 1 2 α2 k C. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 12 / 36

Frank-Wolfe: Convergence Assum: There is a C 0 s.t. for all x, z X and α (0, 1): f ( (1 α)x + αz) f (x) + α f (x), z x + 1 2 Cα2. Let α k = 2 k+2. Recall xk+1 = (1 α k )x k + α k z k ; thus, f (x k+1 ) f (x ) = f ((1 α k )x k + α k z k ) f (x ) f (x k ) f (x ) + α k f (x k ), z k x k + 1 2 α2 k C f (x k ) f (x ) + α k f (x k ), x x k + 1 2 α2 k C f (x k ) f (x ) α k (f (x k ) f (x )) + 1 2 α2 k C = (1 α k )(f (x k ) f (x )) + 1 2 α2 k C. A simple induction (Verify!) then shows that f (x k ) f (x ) 2C k + 2, k 0. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 12 / 36

Example: Linear Oracle Suppose X = { x p 1 }, for p > 1. Write Linear Oracle (LO) as maximization problem: max z g, z s.t. z p 1. What is the answer? Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 13 / 36

Example: Linear Oracle Suppose X = { x p 1 }, for p > 1. Write Linear Oracle (LO) as maximization problem: max z g, z s.t. z p 1. What is the answer? Hint: Recall, g, z z p g q. Pick z to obtain equality. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 13 / 36

Example: Linear Oracle Suppose X = { x p 1 }, for p > 1. Write Linear Oracle (LO) as maximization problem: max z g, z s.t. z p 1. What is the answer? Hint: Recall, g, z z p g q. Pick z to obtain equality. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 13 / 36

Example: Linear Oracle max Z Trace norm LO G, Z σ i(z) 1. i Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 14 / 36

Example: Linear Oracle max Z Trace norm LO G, Z σ i(z) 1. i max { G, Z Z 1} is just the dual-norm to the trace norm; see Lectures 1-3 for more on dual norms. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 14 / 36

Example: Linear Oracle max Z Trace norm LO G, Z σ i(z) 1. i max { G, Z Z 1} is just the dual-norm to the trace norm; see Lectures 1-3 for more on dual norms. can be shown that G 2 is the dual norm here. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 14 / 36

Example: Linear Oracle max Z Trace norm LO G, Z σ i(z) 1. i max { G, Z Z 1} is just the dual-norm to the trace norm; see Lectures 1-3 for more on dual norms. can be shown that G 2 is the dual norm here. Optimal Z satisfies G, Z = G 2 Z = G 2 ; use Lanczos (or using power method) to compute top singular vectors. (for more examples: Jaggi, Revisiting Frank-Wolfe:...) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 14 / 36

Extensions How about FW for nonconvex objective functions? What about FW methods that can converge faster than O(1/k)? Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 15 / 36

Extensions How about FW for nonconvex objective functions? What about FW methods that can converge faster than O(1/k)? Nonconvex-FW possible. It works (i.e., satisfies first-order optimality conditions to ɛ-accuracy in O(1/ɛ) iterations (Lacoste-Julien 2016; Reddi et al. 2016). Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 15 / 36

Extensions How about FW for nonconvex objective functions? What about FW methods that can converge faster than O(1/k)? Nonconvex-FW possible. It works (i.e., satisfies first-order optimality conditions to ɛ-accuracy in O(1/ɛ) iterations (Lacoste-Julien 2016; Reddi et al. 2016). Linear convergence under quite strong assumptions on both f and X ; alternatively, use a more complicated method: FW with Away Steps (Guelat-Marcotte 1986); more recently (Jaggi, Lacoste-Julien 2016) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 15 / 36

Quadratic oracle: projection methods FW can be quite slow If X not compact, LO does not make sense A possible alternative (with other weaknesses though!) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 16 / 36

Quadratic oracle: projection methods FW can be quite slow If X not compact, LO does not make sense A possible alternative (with other weaknesses though!) If constraint set X is simple, i.e., we can easily solve projections min 1 2 x y 2 s.t. x X. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 16 / 36

Quadratic oracle: projection methods FW can be quite slow If X not compact, LO does not make sense A possible alternative (with other weaknesses though!) If constraint set X is simple, i.e., we can easily solve projections min 1 2 x y 2 s.t. x X. Projected Gradient x k+1 = P X ( x k α k f (x k ) ), k = 0, 1,... where P X denotes above orthogonal projection. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 16 / 36

Quadratic oracle: projection methods FW can be quite slow If X not compact, LO does not make sense A possible alternative (with other weaknesses though!) If constraint set X is simple, i.e., we can easily solve projections min 1 2 x y 2 s.t. x X. Projected Gradient x k+1 = P X ( x k α k f (x k ) ), k = 0, 1,... where P X denotes above orthogonal projection. PG can be much faster than O(1/k) of FW (e.g., O(e k ) for strongly convex); but LO can be sometimes much faster than projections. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 16 / 36

Projected Gradient convergence Depends on the following crucial properties of P: Nonexpansivity: Px Py 2 x y 2 Firm nonxpansivity: Px Py 2 2 Px Py, x y Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 17 / 36

Projected Gradient convergence Depends on the following crucial properties of P: Nonexpansivity: Px Py 2 x y 2 Firm nonxpansivity: Px Py 2 2 Px Py, x y Using projections, essentially convergence analysis with α k = 1/L for the unconstrained case works. Exercise: Let f (x) = 1 2 Ax b 2 2. Write a Matlab/Python script to minimize this function over the convex set X := { 1 x i 1} using projected gradient as well as Frank-Wolfe. Compare the two. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 17 / 36

Duality Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 18 / 36

Primal problem Let f i : R n R (0 i m). Generic nonlinear program min f 0 (x) s.t. f i (x) 0, 1 i m, x {dom f 0 dom f 1 dom f m }. (P) Def. Domain: The set D := {dom f 0 dom f 1 dom f m } We call (P) the primal problem The variable x is the primal variable We will attach to (P) a dual problem In our initial derivation: no restriction to convexity. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 19 / 36

Lagrangian To the primal problem, associate Lagrangian L : R n R m R, L(x, λ) := f 0 (x) + m λ if i (x). i=1 Variables λ R m called Lagrange multipliers Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 20 / 36

Lagrangian To the primal problem, associate Lagrangian L : R n R m R, L(x, λ) := f 0 (x) + m λ if i (x). i=1 Variables λ R m called Lagrange multipliers If x is feasible, λ 0, then we get the lower-bound f 0 (x) L(x, λ) x X, λ R m +. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 20 / 36

Lagrangian To the primal problem, associate Lagrangian L : R n R m R, L(x, λ) := f 0 (x) + m λ if i (x). i=1 Variables λ R m called Lagrange multipliers If x is feasible, λ 0, then we get the lower-bound f 0 (x) L(x, λ) x X, λ R m +. Lagrangian helps write problem in unconstrained form Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 20 / 36

Lagrangian Claim: Since, f 0 (x) L(x, λ) x X, λ R m +, primal optimal p = inf x X sup λ 0 L(x, λ). Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 21 / 36

Lagrangian Claim: Since, f 0 (x) L(x, λ) x X, λ R m +, primal optimal p = inf x X sup λ 0 L(x, λ). Proof: If x is not feasible, then some f i (x) > 0 Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 21 / 36

Lagrangian Claim: Since, f 0 (x) L(x, λ) x X, λ R m +, primal optimal p = inf x X sup λ 0 L(x, λ). Proof: If x is not feasible, then some f i (x) > 0 In this case, inner sup is +, so claim true by definition Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 21 / 36

Lagrangian Claim: Since, f 0 (x) L(x, λ) x X, λ R m +, primal optimal p = inf x X sup λ 0 L(x, λ). Proof: If x is not feasible, then some f i (x) > 0 In this case, inner sup is +, so claim true by definition If x is feasible, each f i (x) 0, so sup λ i λ if i (x) = 0 Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 21 / 36

Lagrange dual function Def. We define the Lagrangian dual as g(λ) := inf x L(x, λ). Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 22 / 36

Lagrange dual function Def. We define the Lagrangian dual as g(λ) := inf x L(x, λ). Observations: g is pointwise inf of affine functions of λ Thus, g is concave; it may take value Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 22 / 36

Lagrange dual function Def. We define the Lagrangian dual as g(λ) := inf x L(x, λ). Observations: g is pointwise inf of affine functions of λ Thus, g is concave; it may take value Recall: f 0 (x) L(x, λ) x X, λ 0; thus x X, f 0 (x) inf x L(x, λ) =: g(λ) Now minimize over x on lhs, to obtain λ R m + p g(λ). Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 22 / 36

Lagrange dual problem sup g(λ) s.t. λ 0. λ Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 23 / 36

Lagrange dual problem sup g(λ) s.t. λ 0. λ dual feasible: if λ 0 and g(λ) > dual optimal: λ if sup is achieved Lagrange dual is always concave, regardless of original Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 23 / 36

Weak duality Def. Denote dual optimal value by d, i.e., d := sup λ 0 g(λ). Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 24 / 36

Weak duality Def. Denote dual optimal value by d, i.e., d := sup λ 0 g(λ). Theorem. (Weak-duality): For problem (P), we have p d. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 24 / 36

Weak duality Def. Denote dual optimal value by d, i.e., d := sup λ 0 g(λ). Theorem. (Weak-duality): For problem (P), we have p d. Proof: We showed that for all λ R m +, p g(λ). Thus, it follows that p sup g(λ) = d. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 24 / 36

Duality gap p d 0 Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 25 / 36

Duality gap p d 0 Strong duality if duality gap is zero: p = d Notice: both p and d may be + Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 25 / 36

Duality gap p d 0 Strong duality if duality gap is zero: p = d Notice: both p and d may be + Several sufficient conditions known, especially for convex optimization. Easy necessary and sufficient conditions: unknown Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 25 / 36

Example: Slater s sufficient conditions min f 0 (x) s.t. f i (x) 0, 1 i m, Ax = b. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 26 / 36

Example: Slater s sufficient conditions min f 0 (x) s.t. f i (x) 0, 1 i m, Ax = b. Constraint qualification: There exists x ri D s.t. f i (x) < 0, Ax = b. That is, there is a strictly feasible point. Theorem. Let the primal problem be convex. If there is a feasible point such that is strictly feasible for the non-affine constraints (and merely feasible for affine, linear ones), then strong duality holds. Moreover, the dual optimal is attained (i.e., d > ). Reading: Read BV 5.3.2 for a proof. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 26 / 36

Example: failure of strong duality min x,y e x x 2 /y 0, over the domain D = {(x, y) y > 0}. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 27 / 36

Example: failure of strong duality min x,y e x x 2 /y 0, over the domain D = {(x, y) y > 0}. Clearly, only feasible x = 0. So p = 1 Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 27 / 36

Example: failure of strong duality min x,y e x x 2 /y 0, over the domain D = {(x, y) y > 0}. Clearly, only feasible x = 0. So p = 1 L(x, y, λ) = e x + λx 2 /y, so dual function is { g(λ) = inf x,y>0 e x + λx 2 0 λ 0 y = λ < 0. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 27 / 36

Example: failure of strong duality min x,y e x x 2 /y 0, over the domain D = {(x, y) y > 0}. Clearly, only feasible x = 0. So p = 1 L(x, y, λ) = e x + λx 2 /y, so dual function is { g(λ) = inf x,y>0 e x + λx 2 0 λ 0 y = λ < 0. Dual problem d = max 0 s.t. λ 0. λ Thus, d = 0, and gap is p d = 1. Here, we had no strictly feasible solution. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 27 / 36

Zero duality gap: nonconvex example Trust region subproblem (TRS) min x T Ax + 2b T x x T x 1. A is symmetric but not necessarily semidefinite! Theorem. TRS always has zero duality gap. Remark: Above theorem extremely important; part of family of related results for certain quadratic nonconvex problems. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 28 / 36

Example: Maxent min i x i log x i Ax b, 1 T x = 1, x > 0. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 29 / 36

Example: Maxent min i x i log x i Ax b, 1 T x = 1, x > 0. Convex conjugate of f (x) = x log x is f (y) = e y 1. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 29 / 36

Example: Maxent min i x i log x i Ax b, 1 T x = 1, x > 0. Convex conjugate of f (x) = x log x is f (y) = e y 1. max λ,ν g(λ, ν) = b T λ v n s.t. λ 0. i=1 e (AT λ) i ν 1 Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 29 / 36

Example: Maxent min i x i log x i Ax b, 1 T x = 1, x > 0. Convex conjugate of f (x) = x log x is f (y) = e y 1. max λ,ν g(λ, ν) = b T λ v n s.t. λ 0. i=1 e (AT λ) i ν 1 If there is x > 0 with Ax b and 1 T x = 1, strong duality holds. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 29 / 36

Example: Maxent min i x i log x i Ax b, 1 T x = 1, x > 0. Convex conjugate of f (x) = x log x is f (y) = e y 1. max λ,ν g(λ, ν) = b T λ v n s.t. λ 0. i=1 e (AT λ) i ν 1 If there is x > 0 with Ax b and 1 T x = 1, strong duality holds. Exercise: Simplify above dual by optimizing out ν Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 29 / 36

Example: dual for Support Vector Machine min x,ξ 1 2 x 2 2 + C i ξ i s.t. Ax 1 ξ, ξ 0. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 30 / 36

Example: dual for Support Vector Machine min x,ξ 1 2 x 2 2 + C i ξ i s.t. Ax 1 ξ, ξ 0. L(x, ξ, λ, ν) = 1 2 x 2 2 + C1T ξ λ T (Ax 1 + ξ) ν T ξ Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 30 / 36

Example: dual for Support Vector Machine min x,ξ 1 2 x 2 2 + C i ξ i s.t. Ax 1 ξ, ξ 0. L(x, ξ, λ, ν) = 1 2 x 2 2 + C1T ξ λ T (Ax 1 + ξ) ν T ξ g(λ, ν) := L(x, ξ, λ, ν) inf { = λ T 1 1 2 AT λ 2 2 λ + ν = C1 + otherwise d = max λ 0,ν 0 g(λ, ν) Exercise: Using ν 0, eliminate ν from above problem. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 30 / 36

Dual via Fenchel conjugates min f (x) s.t. f i (x) 0, Ax = b. L(x, λ, ν) := f 0 (x) + i λ if i (x) + ν T (Ax b) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 31 / 36

Dual via Fenchel conjugates min f (x) s.t. f i (x) 0, Ax = b. L(x, λ, ν) := f 0 (x) + i λ if i (x) + ν T (Ax b) g(λ, ν) = inf L(x, λ, ν) x Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 31 / 36

Dual via Fenchel conjugates min f (x) s.t. f i (x) 0, Ax = b. L(x, λ, ν) := f 0 (x) + i λ if i (x) + ν T (Ax b) g(λ, ν) = inf L(x, λ, ν) x g(λ, ν) = ν T b + inf x xt A T ν + F(x) F(x) := f 0 (x) + i λ if i (x) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 31 / 36

Dual via Fenchel conjugates min f (x) s.t. f i (x) 0, Ax = b. L(x, λ, ν) := f 0 (x) + i λ if i (x) + ν T (Ax b) g(λ, ν) = inf L(x, λ, ν) x g(λ, ν) = ν T b + inf x xt A T ν + F(x) F(x) := f 0 (x) + i λ if i (x) g(λ, ν) = ν T b sup x x, A T ν F(x) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 31 / 36

Dual via Fenchel conjugates min f (x) s.t. f i (x) 0, Ax = b. L(x, λ, ν) := f 0 (x) + i λ if i (x) + ν T (Ax b) g(λ, ν) = inf L(x, λ, ν) x g(λ, ν) = ν T b + inf x xt A T ν + F(x) F(x) := f 0 (x) + i λ if i (x) g(λ, ν) = ν T b sup x g(λ, ν) = ν T b F ( A T ν). x, A T ν F(x) Not so useful! F hard to compute. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 31 / 36

Example: norm regularized problems min f (x) + Ax Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 32 / 36

Example: norm regularized problems min f (x) + Ax Dual problem min y f ( A T y) s.t. y 1. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 32 / 36

Example: norm regularized problems min f (x) + Ax Dual problem min y f ( A T y) s.t. y 1. Say ȳ < 1, such that A T ȳ ri(dom f ), then we have strong duality (e.g., for instance 0 ri(dom f )) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 32 / 36

Example: Lasso-like problem p := min x Ax b 2 + λ x 1. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 33 / 36

Example: Lasso-like problem p := min x Ax b 2 + λ x 1. { } x 1 = max x T v v 1 { } x 2 = max x T u u 2 1. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 33 / 36

Example: Lasso-like problem p = min x p := min x Ax b 2 + λ x 1. { } x 1 = max x T v v 1 { } x 2 = max x T u u 2 1. max u,v Saddle-point formulation { u T (b Ax) + v T x u 2 1, v λ } Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 33 / 36

Example: Lasso-like problem p = min x = max u,v p := min x Ax b 2 + λ x 1. { } x 1 = max x T v v 1 { } x 2 = max x T u u 2 1. max u,v min x Saddle-point formulation { } u T (b Ax) + v T x u 2 1, v λ { } u T (b Ax) + x T v u 2 1, v λ Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 33 / 36

Example: Lasso-like problem p = min x = max u,v p := min x Ax b 2 + λ x 1. { } x 1 = max x T v v 1 { } x 2 = max x T u u 2 1. max u,v min x Saddle-point formulation { } u T (b Ax) + v T x u 2 1, v λ { } u T (b Ax) + x T v u 2 1, v λ = max u,v ut b A T u = v, u 2 1, v λ Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 33 / 36

Example: Lasso-like problem p = min x = max u,v p := min x Ax b 2 + λ x 1. { } x 1 = max x T v v 1 { } x 2 = max x T u u 2 1. max u,v min x Saddle-point formulation { } u T (b Ax) + v T x u 2 1, v λ { } u T (b Ax) + x T v u 2 1, v λ = max u,v ut b A T u = v, u 2 1, v λ = max u T b u 2 1, A T v λ. u Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 33 / 36

Example: KKT conditions min f 0 (x) f i (x) 0, i = 1,..., m. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 34 / 36

Example: KKT conditions min f 0 (x) f i (x) 0, i = 1,..., m. Recall: f 0 (x ), x x 0 for all feasible x X Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 34 / 36

Example: KKT conditions min f 0 (x) f i (x) 0, i = 1,..., m. Recall: f 0 (x ), x x 0 for all feasible x X Can we simplify this using Lagrangian? g(λ) = inf x L(x, λ) := f 0 (x) + i λ if i (x) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 34 / 36

Example: KKT conditions min f 0 (x) f i (x) 0, i = 1,..., m. Recall: f 0 (x ), x x 0 for all feasible x X Can we simplify this using Lagrangian? g(λ) = inf x L(x, λ) := f 0 (x) + i λ if i (x) Assume strong duality; and both p and d attained! Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 34 / 36

Example: KKT conditions min f 0 (x) f i (x) 0, i = 1,..., m. Recall: f 0 (x ), x x 0 for all feasible x X Can we simplify this using Lagrangian? g(λ) = inf x L(x, λ) := f 0 (x) + i λ if i (x) Assume strong duality; and both p and d attained! Thus, there exists a pair (x, λ ) such that p = f 0 (x ) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 34 / 36

Example: KKT conditions min f 0 (x) f i (x) 0, i = 1,..., m. Recall: f 0 (x ), x x 0 for all feasible x X Can we simplify this using Lagrangian? g(λ) = inf x L(x, λ) := f 0 (x) + i λ if i (x) Assume strong duality; and both p and d attained! Thus, there exists a pair (x, λ ) such that p = f 0 (x ) = d = g(λ ) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 34 / 36

Example: KKT conditions min f 0 (x) f i (x) 0, i = 1,..., m. Recall: f 0 (x ), x x 0 for all feasible x X Can we simplify this using Lagrangian? g(λ) = inf x L(x, λ) := f 0 (x) + i λ if i (x) Assume strong duality; and both p and d attained! Thus, there exists a pair (x, λ ) such that p = f 0 (x ) = d = g(λ ) = min x L(x, λ ) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 34 / 36

Example: KKT conditions min f 0 (x) f i (x) 0, i = 1,..., m. Recall: f 0 (x ), x x 0 for all feasible x X Can we simplify this using Lagrangian? g(λ) = inf x L(x, λ) := f 0 (x) + i λ if i (x) Assume strong duality; and both p and d attained! Thus, there exists a pair (x, λ ) such that p = f 0 (x ) = d = g(λ ) = min x L(x, λ ) L(x, λ ) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 34 / 36

Example: KKT conditions min f 0 (x) f i (x) 0, i = 1,..., m. Recall: f 0 (x ), x x 0 for all feasible x X Can we simplify this using Lagrangian? g(λ) = inf x L(x, λ) := f 0 (x) + i λ if i (x) Assume strong duality; and both p and d attained! Thus, there exists a pair (x, λ ) such that p = f 0 (x ) = d = g(λ ) = min L(x, λ ) L(x, λ ) f 0 (x ) = p x Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 34 / 36

Example: KKT conditions min f 0 (x) f i (x) 0, i = 1,..., m. Recall: f 0 (x ), x x 0 for all feasible x X Can we simplify this using Lagrangian? g(λ) = inf x L(x, λ) := f 0 (x) + i λ if i (x) Assume strong duality; and both p and d attained! Thus, there exists a pair (x, λ ) such that p = f 0 (x ) = d = g(λ ) = min L(x, λ ) L(x, λ ) f 0 (x ) = p x Thus, equalities hold in above chain. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 34 / 36

Example: KKT conditions min f 0 (x) f i (x) 0, i = 1,..., m. Recall: f 0 (x ), x x 0 for all feasible x X Can we simplify this using Lagrangian? g(λ) = inf x L(x, λ) := f 0 (x) + i λ if i (x) Assume strong duality; and both p and d attained! Thus, there exists a pair (x, λ ) such that p = f 0 (x ) = d = g(λ ) = min L(x, λ ) L(x, λ ) f 0 (x ) = p x Thus, equalities hold in above chain. x argmin x L(x, λ ). Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 34 / 36

Example: KKT conditions x argmin x L(x, λ ). If f 0, f 1,..., f m are differentiable, this implies Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 35 / 36

Example: KKT conditions x argmin x L(x, λ ). If f 0, f 1,..., f m are differentiable, this implies x L(x, λ ) x=x = f 0 (x ) + i λ i f i(x ) = 0. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 35 / 36

Example: KKT conditions x argmin x L(x, λ ). If f 0, f 1,..., f m are differentiable, this implies x L(x, λ ) x=x = f 0 (x ) + i λ i f i(x ) = 0. Moreover, since L(x, λ ) = f 0 (x ), we also have Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 35 / 36

Example: KKT conditions x argmin x L(x, λ ). If f 0, f 1,..., f m are differentiable, this implies x L(x, λ ) x=x = f 0 (x ) + i λ i f i(x ) = 0. Moreover, since L(x, λ ) = f 0 (x ), we also have i λ i f i(x ) = 0. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 35 / 36

Example: KKT conditions x argmin x L(x, λ ). If f 0, f 1,..., f m are differentiable, this implies x L(x, λ ) x=x = f 0 (x ) + i λ i f i(x ) = 0. Moreover, since L(x, λ ) = f 0 (x ), we also have i λ i f i(x ) = 0. But λ i 0 and f i (x ) 0, Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 35 / 36

Example: KKT conditions x argmin x L(x, λ ). If f 0, f 1,..., f m are differentiable, this implies x L(x, λ ) x=x = f 0 (x ) + i λ i f i(x ) = 0. Moreover, since L(x, λ ) = f 0 (x ), we also have i λ i f i(x ) = 0. But λ i 0 and f i (x ) 0, so complementary slackness λ i f i(x ) = 0, i = 1,..., m. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 35 / 36

KKT conditions f i (x ) 0, i = 1,..., m (primal feasibility) λ i 0, i = 1,..., m (dual feasibility) λ i f i(x ) = 0, i = 1,..., m (compl. slackness) x L(x, λ ) x=x = 0 (Lagrangian stationarity) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 36 / 36

KKT conditions f i (x ) 0, i = 1,..., m (primal feasibility) λ i 0, i = 1,..., m (dual feasibility) λ i f i(x ) = 0, i = 1,..., m (compl. slackness) x L(x, λ ) x=x = 0 (Lagrangian stationarity) We showed: if strong duality holds, and (x, λ ) exist, then KKT conditions are necessary for pair (x, λ ) to be optimal Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 36 / 36

KKT conditions f i (x ) 0, i = 1,..., m (primal feasibility) λ i 0, i = 1,..., m (dual feasibility) λ i f i(x ) = 0, i = 1,..., m (compl. slackness) x L(x, λ ) x=x = 0 (Lagrangian stationarity) We showed: if strong duality holds, and (x, λ ) exist, then KKT conditions are necessary for pair (x, λ ) to be optimal If problem is convex, then KKT also sufficient Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 36 / 36

KKT conditions f i (x ) 0, i = 1,..., m (primal feasibility) λ i 0, i = 1,..., m (dual feasibility) λ i f i(x ) = 0, i = 1,..., m (compl. slackness) x L(x, λ ) x=x = 0 (Lagrangian stationarity) We showed: if strong duality holds, and (x, λ ) exist, then KKT conditions are necessary for pair (x, λ ) to be optimal If problem is convex, then KKT also sufficient Exercise: Prove the above sufficiency of KKT. Hint: Use that L(x, λ ) is convex, and conclude from KKT conditions that g(λ ) = f 0 (x ), so that (x, λ ) optimal primal-dual pair. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 36 / 36