Lecture: Smoothing.

Similar documents
Math 273a: Optimization Convex Conjugacy

5. Subgradient method

Dual Proximal Gradient Method

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

Dual Decomposition.

The proximal mapping

EE 546, Univ of Washington, Spring Proximal mapping. introduction. review of conjugate functions. proximal mapping. Proximal mapping 6 1

Subgradient. Acknowledgement: this slides is based on Prof. Lieven Vandenberghes lecture notes. definition. subgradient calculus

On the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith. µd(w) + w T c. minimize. subject to w T 1 = 1,

Fast proximal gradient methods

6. Proximal gradient method

Conditional Gradient (Frank-Wolfe) Method

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples

Sparse Optimization Lecture: Dual Methods, Part I

Primal-dual Subgradient Method for Convex Problems with Functional Constraints

Subgradient Method. Ryan Tibshirani Convex Optimization

LECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE

Lecture: Convex Optimization Problems

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725

Double Smoothing technique for Convex Optimization Problems with Linear Constraints

Convex Functions. Daniel P. Palomar. Hong Kong University of Science and Technology (HKUST)

6. Proximal gradient method

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Descent methods. min x. f(x)

Constrained Optimization and Lagrangian Duality

Newton s Method. Javier Peña Convex Optimization /36-725

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

9. Dual decomposition and dual algorithms

Selected Topics in Optimization. Some slides borrowed from

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

AM 205: lecture 18. Last time: optimization methods Today: conditions for optimality

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

8. Conjugate functions

Optimization methods

Applications of Linear Programming

Adaptive Restarting for First Order Optimization Methods

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Lecture: Duality of LP, SOCP and SDP

Convex Optimization Lecture 16

3. Convex functions. basic properties and examples. operations that preserve convexity. the conjugate function. quasiconvex functions

MATH 680 Fall November 27, Homework 3

minimize x subject to (x 2)(x 4) u,

Optimization methods

Lagrangian Duality and Convex Optimization

3. Convex functions. basic properties and examples. operations that preserve convexity. the conjugate function. quasiconvex functions

Math 273a: Optimization Subgradient Methods

Convex Optimization & Lagrange Duality

Duality in Linear Programs. Lecturer: Ryan Tibshirani Convex Optimization /36-725

12. Interior-point methods

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Convex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization

Convex Optimization. Prof. Nati Srebro. Lecture 12: Infeasible-Start Newton s Method Interior Point Methods

Lecture 5: September 12

ALGORITHMS FOR MINIMIZING DIFFERENCES OF CONVEX FUNCTIONS AND APPLICATIONS

Lasso: Algorithms and Extensions

Optimization for Machine Learning

Fenchel Duality between Strong Convexity and Lipschitz Continuous Gradient

Lecture: Duality.

Smoothing Proximal Gradient Method. General Structured Sparse Regression

Lagrange duality. The Lagrangian. We consider an optimization program of the form

Iteration-complexity of first-order penalty methods for convex programming

4. Convex optimization problems (part 1: general)

Supplement: Universal Self-Concordant Barrier Functions

ORIE 4741: Learning with Big Messy Data. Proximal Gradient Method

Convex Optimization Conjugate, Subdifferential, Proximation

CSCI : Optimization and Control of Networks. Review on Convex Optimization

Proximal methods. S. Villa. October 7, 2014

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 4. Subgradient

Randomized Coordinate Descent Methods on Optimization Problems with Linearly Coupled Constraints

Extreme Abridgment of Boyd and Vandenberghe s Convex Optimization

This can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization

Convex Optimization Algorithms for Machine Learning in 10 Slides

Linear Analysis Lecture 5

Algorithms for Nonsmooth Optimization

Hessian Riemannian Gradient Flows in Convex Programming

Convex Optimization on Large-Scale Domains Given by Linear Minimization Oracles

Lecture 8. Strong Duality Results. September 22, 2008

Dual and primal-dual methods

Lecture 3. Optimization Problems and Iterative Algorithms

A SIMPLE PARALLEL ALGORITHM WITH AN O(1/T ) CONVERGENCE RATE FOR GENERAL CONVEX PROGRAMS

Distributed Optimization: Analysis and Synthesis via Circuits

LECTURE 13 LECTURE OUTLINE

Proximal Methods for Optimization with Spasity-inducing Norms

arxiv: v1 [math.oc] 1 Jul 2016

EE Applications of Convex Optimization in Signal Processing and Communications Dr. Andre Tkacenko, JPL Third Term

Primal/Dual Decomposition Methods

4. Convex optimization problems

Convex Functions. Pontus Giselsson

Accelerated primal-dual methods for linearly constrained convex problems

Lecture 4: Convex Functions, Part I February 1

Lecture 8: February 9

Convex Optimization and Modeling

Problem set 5, Real Analysis I, Spring, otherwise. (a) Verify that f is integrable. Solution: Compute since f is even, 1 x (log 1/ x ) 2 dx 1

In applications, we encounter many constrained optimization problems. Examples Basis pursuit: exact sparse recovery problem

A Tutorial on Primal-Dual Algorithm

A direct formulation for sparse PCA using semidefinite programming

Transcription:

Lecture: Smoothing http://bicmr.pku.edu.cn/~wenzw/opt-2018-fall.html Acknowledgement: this slides is based on Prof. Lieven Vandenberghe s lecture notes

Smoothing 2/26 introduction smoothing via conjugate examples

3/26 First-order convex optimization methods complexity of finding ɛ-suboptimal point of f (x) subgradient method: f nondifferentiable with Lipschitz constant G O((G/ɛ) 2 ) iterations proximal gradient method: f = g + h, where h is a simple nondifferentiable function, g is differentiable with L Lipschitz continuous gratient O(L/ɛ) iterations fast proximal gradient methods O( L/ɛ) iterations

Non-differentiable optimization by smoothing 4/26 for nondifferentiable f that cannot be handled by proximal gradient method replace f with differentiable approximation f µ (parametrized by µ) minimize f µ by (fast) gradient method Complexity: #iterations for (fast) gradient method depends on L µ /ɛ µ L µ is Lipschitz constant of f µ ɛ µ is accuracy with which the smooth problem is solved trade-off in amount of smoothing (choice of µ) Large L µ (less smoothing) gives more accurate approximation Small L µ (more smoothing) gives faster convergence

5/26 Example: Huber penalty as smoothed absolute value φ µ (z) = { z 2 /2(µ) z µ z µ/2 z µ µ controls accuracy and smoothness accuracy z µ 2 φ µ(z) z smoothness φ µ(z) 1 µ

Huber penalty approximation of 1-norm minimization f (x) = Ax b 1, f µ (x) = m φ µ (a T i x b i ) i=1 accuracy: from f (x) mµ/2 f µ (x) f (x), f (x) f f µ (x) f µ + mµ/2 to achieve f (x) f ɛ, we need f µ (x) f µ ɛ µ Lipschitz constant of f µ is L µ = A 2 2 /µ complexity: for µ = ɛ/m with ɛ µ = ɛ mµ/2 L µ A 2 2 ɛ µ µ(ɛ mµ/2) = 2m A 2 ɛ 2 i.e., O( L µ /ɛ µ = O(1/ɛ) iteration complexity for fast gradient method 6/26

Outline 7/26 introduction smoothing via conjugate examples

8/26 Minimum of strongly convex function if x is a minimizer of a strongly convex function f, then it is unique and f (y) f (x) + µ 2 y x 2 2 y domf (µ is the strong convexity constant of f ) proof: if some y does not satisfy the inequality, then for some small θ > 0: f ((1 θ)x + θy) (1 θ)f (x) + θf (y) µ θ(1 θ) y x 2 2 2 = f (x) + θ(f (y) f (x) µ 2 y x 2 2) + µ θ2 2 x y 2 2 < f (x)

Conjugate of strongly convex function 9/26 suppose f is closed and strongly convex with constant µ and conjugate f (y) = sup (y T x f (x)) x domf f is defined and differentiable at all y, with gradient f (y) = argmax(y T x f (x)) x f is Lipschitz continuous with constant 1/µ f (u) f (v) 2 1 µ u v 2

10/26 outline of proof y T x f (x) has a unique maximizer x y for every y (follows from closedness and strong convexity of f (x) y T x) f (y) = x y from strong convexity (with x µ = f (u), x v = f (v)) f (x u ) v T x u f (x v ) v T x v + µ 2 x u x v 2 2 f (x v ) u T x v f (x u ) u T x u + µ 2 x u x v 2 2 adding the left- and right-hand sides of the inequalities gives µ x u x v 2 2 (x u x v ) T (u v) by the Cauchy-Schwarz inequality, µ x u x v 2 u v 2

Proximity function 11/26 d is a proximity function for a closed convec set C if d is continuous and strongly convex C domd d(x) measures distance of x to the center x d = argmin x C d(x) of C normalization we will assume the strong convexity constant is 1 and inf x C d(x) = 0 for a normalized proximity function d(x) 1 2 x x d 2 2 x C

12/26 common proximity functions d(x) = x u 2 2 /2 with x d = u C d(x) = n w i (x i u i ) 2 /2 with w i 1 and x d = u C i=1 d(x) = n x i log x i + log n for C = {x 0 1 T x = 1}, x d = (1/n)1 i=1 example (probability simplex): entropy and d(x) = (1/2) x (1/n)1 2 2

13/26 Smoothing via conjugate conjugate (dual) representation: suppose f can be expressed as f (x) = sup ((Ax + b) T y h(y)) y domh = h (Ax + b) where h is closed and convex with bounded domain smooth approximation: choose proximity function d for C = cldomh f µ (x) = sup ((Ax + b) T y h(y) µd(y)) y domh = (h + µd) (Ax + b) f µ is differentiable because h + µd is strongly convex

Example: absolute value 14/26 conjugate representation x = sup xy = h (x), 1 y 1 h(y) = I [ 1,1] (y) proximity function: choosing d(y) = y 2 /2 gives Huber penalty { f µ (x) = sup (xy µy 2 x 2 /(2µ) x µ /2) = 1 y 1 x µ/2 x > µ proximity function: choosing d(y) = 1 1 y 2 gives f µ (x) = sup (xy + µ 1 y 2 µ) = x 2 + µ 2 µ 1 y 1

another conjugate representation of x 15/26 x = sup y 1 +y 2 =1 y 0 x(y 1 y 2 ) i.e., x = h (ax) for h = I C, proximity function for C smooth approximation f µ (x) = C = {y 0 y 1 + y 2 = 1}, A = sup y 1 +y 2 =1 d(y) = y 1 log y 1 + y 2 log y 2 + log 2 [ ] 1 1 (xy 1 xy 2 + µ(y 1 log y 1 + y 2 log y 2 + log 2)) ( ) e x/µ + e x/µ = µ log 2

comparison: three smooth approximations of absolute value 16/26

17/26 Gradient of smooth approximation f µ (x) = (h + µd) (Ax + b) = sup ((Ax + b) T y h(y) µd(y)) y domh from properties of the conjugate of strongly convex function (page 7) f µ is differentiable, with gradient f µ (x) = A T argmax((ax + b) T y h(y) µd(y)) y domh f µ (x) is Lipschitz continuous with constant L µ = A 2 2 µ

18/26 Accuracy of smooth approximation f (x) µd f µ (x) f (x), D = sup d(y) y domh note D < + because domh is bounded and domh domd lower bound follows from f µ (x) = sup ((Ax + b) T y h(y) µd(y)) y domh sup ((Ax + b) T y h(y) µd) y domh = f (x) µd upper bound follows from f µ (x) sup ((Ax + b) T y h(y)) = f (x) y domh

Complexity 19/26 to find solution of nondifferentiable problem with accuracy f (x) f ɛ solve smoothed problem with accuracy ɛ µ = ɛ µd, so that f (x) f f µ (x) + µd f µ ɛ µ + µd = ɛ Lipschitz constant of f µ is L µ = A 2 2 /µ complexity: for µ = ɛ/(2d) L µ A 2 2 = ɛ µ µ(ɛ µd) = 4D A 2 2 µɛ 2 gives O(1/ɛ) iteration bound for fast gradient method efficiency in practice can be improved by decreasing µ gradually

Outline 20/26 introduction smoothing via conjugate examples

Piecewise-linear approximation 21/26 conjugate representation f (x) = max i=1,...,m (at i x + b i ) proximity function f (x) = sup (Ax + b) T y y 0,1 T y=1 d(y) = m y i log y i + log m i=1 smooth approximation m f µ (x) = µ log e (at i x+bi)/µ µ log m i=1

1-Norm approximation 22/26 f (x) = Ax b 1 conjugate representation proximity function f (x) = sup (Ax b) T y y 1 d(y) = 1 w i y 2 i (with w i > 1) 2 smooth approximation: Huber approximation i f µ (x) = n φ µwi (a T i x b i ) i=1

Maximum eigenvalue 23/26 conjugate representation: for X S n, f (X) = λ max (X) = sup tr(xy) Y,trY=1 proximity function: negative matrix entropy d(y) = n λ i (Y) log λ i (Y) + log n i=1 smooth approximation f µ (X) = sup (tr(xy) µd(y)) Y 0,trY=1 n = µ log e λi(x)/µ µ log n i=1

Nuclear norm 24/26 nuclear norm f (X) = X is sum of singular values of X R m n conjugate representation f (X) = sup tr(x T Y) Y 2 1 proximity function d(y) = 1 2 Y 2 F smooth approximation f µ (X) = sup tr(x T Y µd(y)) = Y 2 1 i φ µ (σ i (X)) the sum of the Huber penalties applied to the singular values of X

25/26 Lagrange dual function minimize f 0 (x) subject to f i (x) 0, i = 1,..., m f i (x) convex, C closed and bounded x C smooth approximation of dual function: choose prox. function d for C g µ (λ) = inf x C (f 0(x) + m λ i f i (x) + µd(x)) i=1 minimize f 0 (x) + µd(x) subject to f i (x) 0, i = 1,..., m x C

References 26/26 D. Bertsekas, Nonlinear Programming (1995), 5.4.5 Yu. Nesterov, Smooth minimization of non-smooth functions, Mathamatical Programming (2005)