Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods

Similar documents
Math 273a: Optimization Subgradient Methods

Math 273a: Optimization Subgradients of convex functions

Math 273a: Optimization Subgradients of convex functions

Math 273a: Optimization Convex Conjugacy

Subgradient Method. Ryan Tibshirani Convex Optimization

Coordinate Update Algorithm Short Course Proximal Operators and Algorithms

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725

LECTURE 12 LECTURE OUTLINE. Subgradients Fenchel inequality Sensitivity in constrained optimization Subdifferential calculus Optimality conditions

Lecture 6: September 17

LECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE

Math 273a: Optimization Lagrange Duality

Convergence of Fixed-Point Iterations

Lecture 6: September 12

Subgradients. subgradients and quasigradients. subgradient calculus. optimality conditions via subgradients. directional derivatives

Subgradients. subgradients. strong and weak subgradient calculus. optimality conditions via subgradients. directional derivatives

Optimality Conditions for Nonsmooth Convex Optimization

Convex Analysis Background

Dual and primal-dual methods

Sparse Optimization Lecture: Dual Methods, Part I

Algorithms for Nonsmooth Optimization

Optimization and Optimal Control in Banach Spaces

Lecture 7: September 17

Lecture 3: Lagrangian duality and algorithms for the Lagrangian dual problem

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

GEOMETRIC APPROACH TO CONVEX SUBDIFFERENTIAL CALCULUS October 10, Dedicated to Franco Giannessi and Diethard Pallaschke with great respect

Chapter 2 Convex Analysis

Dual methods and ADMM. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Convex Functions. Pontus Giselsson

Convex Analysis and Optimization Chapter 4 Solutions

Dual Proximal Gradient Method

Primal/Dual Decomposition Methods

Smoothing Proximal Gradient Method. General Structured Sparse Regression

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 4. Subgradient

Contraction Methods for Convex Optimization and Monotone Variational Inequalities No.18

Iterative Convex Optimization Algorithms; Part One: Using the Baillon Haddad Theorem

Selected Topics in Optimization. Some slides borrowed from

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Introduction to Alternating Direction Method of Multipliers

Lecture 14: Newton s Method

Lecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent

Descent methods. min x. f(x)

Kaisa Joki Adil M. Bagirov Napsu Karmitsa Marko M. Mäkelä. New Proximal Bundle Method for Nonsmooth DC Optimization

Introduction. New Nonsmooth Trust Region Method for Unconstraint Locally Lipschitz Optimization Problems

Constrained Optimization and Lagrangian Duality

Optimization for Machine Learning

Math 273a: Optimization Basic concepts

Coordinate Update Algorithm Short Course Operator Splitting

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

CSC 576: Gradient Descent Algorithms

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties

Epiconvergence and ε-subgradients of Convex Functions

Lecture 1: Background on Convex Analysis

Lecture 5: September 15

Lecture 6: Conic Optimization September 8

Subgradient. Acknowledgement: this slides is based on Prof. Lieven Vandenberghes lecture notes. definition. subgradient calculus

Convex Optimization. (EE227A: UC Berkeley) Lecture 4. Suvrit Sra. (Conjugates, subdifferentials) 31 Jan, 2013

In Progress: Summary of Notation and Basic Results Convex Analysis C&O 663, Fall 2009

Proximal gradient methods

Constrained Optimization Theory

SWFR ENG 4TE3 (6TE3) COMP SCI 4TE3 (6TE3) Continuous Optimization Algorithm. Convex Optimization. Computing and Software McMaster University

Dual Ascent. Ryan Tibshirani Convex Optimization

6. Proximal gradient method

Newton s Method. Javier Peña Convex Optimization /36-725

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems)

CSCI : Optimization and Control of Networks. Review on Convex Optimization

Extreme Abridgment of Boyd and Vandenberghe s Convex Optimization

6. Proximal gradient method

Lecture 6 : Projected Gradient Descent

Radial Subgradient Descent

Douglas-Rachford splitting for nonconvex feasibility problems

Conditional Gradient (Frank-Wolfe) Method

5. Subgradient method

A Brief Review on Convex Optimization

ORIE 4741: Learning with Big Messy Data. Proximal Gradient Method

BASICS OF CONVEX ANALYSIS

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Division of the Humanities and Social Sciences. Supergradients. KC Border Fall 2001 v ::15.45

Unconstrained minimization of smooth functions

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

9. Dual decomposition and dual algorithms

Subdifferential representation of convex functions: refinements and applications

Convex Feasibility Problems

Gradient Descent. Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh. Convex Optimization /36-725

Some Properties of the Augmented Lagrangian in Cone Constrained Optimization

ON GAP FUNCTIONS OF VARIATIONAL INEQUALITY IN A BANACH SPACE. Sangho Kum and Gue Myung Lee. 1. Introduction

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples

WE consider an undirected, connected network of n

Introduction. A Modified Steepest Descent Method Based on BFGS Method for Locally Lipschitz Functions. R. Yousefpour 1

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

MATH 829: Introduction to Data Mining and Analysis Computing the lasso solution

Dual Methods. Lecturer: Ryan Tibshirani Convex Optimization /36-725

Chapter 1. Optimality Conditions: Unconstrained Optimization. 1.1 Differentiable Problems

Convex envelopes, cardinality constrained optimization and LASSO. An application in supervised learning: support vector machines (SVMs)

Proximal methods. S. Villa. October 7, 2014

Chapter 2: Preliminaries and elements of convex analysis

Relationships between upper exhausters and the basic subdifferential in variational analysis

Merit functions and error bounds for generalized variational inequalities

Lecture 25: Subgradient Method and Bundle Methods April 24

Lecture 3. Optimization Problems and Iterative Algorithms

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

Transcription:

Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods Instructor: Wotao Yin (UCLA Math) Summer 2016 1 / 30

Notation f : H R { } is a closed proper convex function domf := {x R n : f(x) < } f is closed if epif is a closed set f is proper if domf 2 / 30

Definition of subgradient

Differentiable function f (gradient of f) is the vector of partial derivatives If f is convex, then f(x) = [ f f (x);... ; (x) ] x 1 x n f(y) f(x) + f(x), y x, x, y H 1 2 1 restricting x, y domf is unnecessary since f(x) = if x domf 2 figure taken from Boyd and Vandenberghe, Convex Optimization. 3 / 30

Non-differentiable function assumption: f is a proper function (nonconvexity is allowed) At x domf, the subdifferential of f is f( x) := { g R n : f(y) f( x) + g, y x, y domf } (defined via a global inequality, not locally or by taking limits) g f( x) is called a subgradient of f at x; we may use f( x) in some context 4 / 30

Existence If f C 1 is proper convex, then f(x) f(x) for x domf In fact, f(x) = { f(x)} If f is proper closed convex and x ri(domf), then f(x) is nonempty Conversely, if the set f(x) is nonempty for all x dom(f), then f is convex. 5 / 30

Computing subgradients

General rules smooth functions: f(x) = { f(x)} chain rule: φ(x) = f(ax + b) φ(x) = A T f(ax + b) positive scaling: λ > 0 f(λx) = λ f(x) positive sums: α, β > 0, f(x) = αf 1(x) + βf 2(x) f(x) α f 1(x) + β f 2(x) under additional conditions e.g. 0 sri(domf 1 domf 2), f(x) = α f 1(x) + β f 2(x) 6 / 30

maximums: f(x) = max i {1,,n} {f i(x)} f(x) = conv{ f i(x) f i(x) = f(x)} separable sum: f(x) = n i=1 fi(xi) xf(x) = x1 f 1(x 1) xn f n(x n) 7 / 30

Examples f(x) = x, x R. f(x) = { {sign(x)} x 0; [ 1, 1] otherwise f(x) = x 1, x R n. f(x) = x 1 x n. 8 / 30

Examples f(x) = n ai, x bi. Define i=1 I (x) = {i a i, x b i < 0} I +(x) = {i a i, x b i > 0} I 0(x) = {i a i, x b i = 0}. Then f(x) = a i a i + [ a i, a i] i I + (x) i I (x) i I 0 (x) 9 / 30

Examples f(x) = max i {1,,n} x i. Then f(x) = conv{e i x i = f(x)} For example, at the origin 0, f(0) = conv{e i i {1,, n}} 10 / 30

Examples f(x) = x 2. f is differential away from 0, so: f(x) = At 0, go back to subgradient equation: x x 2 x 0. y 2 0 + g, y 0 Thus, g f(0), if, and only if, g,y y 2 dual ball B2(0, 1) = B 2(0, 1). This is a common pattern! 1 for all y 0. Thus, g is in the 11 / 30

Examples f(x) = x = max i {1,,n} x (i). f(x) = conv{[ e i, e i] x (i) = f(x)} x 0. Going back to subgradient equation y 0 + g, y g,y Thus, g f(0), if, and only if, y 1 for all y 0. Thus, f(0) is the dual ball to the l norm: B 1(0, 1). 12 / 30

Examples Let C be a closed nonempty convex set Define the indicator function ι C(x) = Subdifferential of ι C: let x C, { 0, if x C, otherwise. ι C(x) = { g : ι C(y) ι C(x) + g, y x, y} = {g : g, y x 0, y C} which is a cone, called the normal cone N C(x) By convention, if x C, then ι C(x) = 13 / 30

figures taken from D.Bertsekas, MIT 253 Spring 12 14 / 30

Comparisons Top: left: l 1(x) = x 1 + x 2 right: f(x) = x 1 + 2 x 2 Bottom: left: f(x) = f(rx) ( π 4 -rotated f), right: {x : f(x) 2} 15 / 30

1-norm function: l 1(x) = x 1 + x 2 pick any α, β > 0. (α, 0) 1 = 1 [ 1, 1] = {(1, g 2) : g 2 [ 1, 1]} (0, β) 1 = [ 1, 1] 1 = {(g 1, 1) : g 1 [ 1, 1]} (α, β) 1 = {(1, 1)} 16 / 30

Weighted 1-norm function: f(x) = x 1 + 2 x 2 pick any α, β > 0. f(α, 0) = 1 [ 2, 2] f(0, β) = [ 1, 1] 2 f(α, β) = {(1, 2)} 17 / 30

Weighted 1-norm, rotated rotation matrix: function: f(x) = f(rx) pick any α > 0. R := [ ] cos π sin π 4 4 sin π cos π 4 4 f(α, α) = R T ( 1 [ 2, 2] ) = { (g 1, g 2) : g 1 + g 2 = 2, g 1 [ 1 2, 3 2 ] 2 } f( α, α) = R T ( [ 1, 1] 2 ) = { (g 1, g 2) : g 2 g 1 = 2 2, g 1 [ 3 2, 1 2 ] 2 } 18 / 30

Weighted 1-norm ball, rotated set: C = {x : f(x) 2} function: ι C(x) = 0 if x C and ι C(x) = if x C. pick any α > 0. ι C(α, α) = N C(α, α) = { θ 1( 1, 3) + θ 2(3, 1) : θ 1, θ 2 0 } ι C( α, α) = N C( α, α) = { θ 1( 1, 3) + θ 2( 3, 5) : θ 1, θ 2 0 } 19 / 30

Comparisons left: f(x) = x 2. If x 0, f(x) = { x x 2 } which contains only 1 point right: f(x) = ι{ 2 C}(x). If x 2 = C, f(x) = { θ is a ray x x 2 : θ 0 } which 20 / 30

Partial subgradient Let f(x 1, x 2) be a proper closed convex function If f(x 1, x 2) is differentiable p 1 = 1f(x 1, x 2) p 2 = 2f(x 1, x 2) If f(x 1, x 2) is non-differentiable p 1 1f(x 1, x 2) p 2 2f(x 1, x 2) = [ p1 p 2 [ p1 p 2 ] = f(x 1, x 2) ] f(x 1, x 2) In general, 1f(x 1, x 2) 2f(x 1, x 2) f(x 1, x 2) exception: = holds for separable f(x 1, x 2) = f 1(x 1) + f 2(x 2) 21 / 30

called the rotated weighted l 1 function f 5 4 3 2 10 12 10 8 8 6 6 4 1 2 6 8 0-1 8 4 2 4-2 -3-4 6 6 8 4 6 8-5 -5-4 -3-2 -1 0 1 2 3 4 5 10 12 10 Take (x 1, x 2) = (α, α) for arbitrary α > 0 0 x1 f(α, α) and 0 x2 f(α, α), but [ ] 0 f(α, α) = { g : g 1 + g 2 = 2 2, 0 2 g1 3 2 } 2 22 / 30

Subgradient optimality condition

0 subgradient for unconstrained minimization Let f be a proper function. Convexity is not required. The set of minimizers arg min f can be empty, a singleton, or a set with infinitely many points Lemma: x arg min f if and only if 0 f(x ) Proof: : Let 0 f(x ). For all y f(y) f(x ) + 0, y x = f(x ). : let x arg min f(x). f(y) f(x) = f(x) + 0, y x thus 0 f(x). 23 / 30

Variational inequality for constrained minimization Let f be a proper closed convex function and C be a nonempty closed set. Lemma: Under proper regularity conditions (e.g., 0 sri(domf C)), 0 f(x ) + N C(x ) x arg min { f(x) : x C }. interpretation of the 1st condition: there exists a subgradient f(x ) such that the following variational inequality holds: f(x ), y x 0, y C. 24 / 30

3 figures taken from D.Bertsekas, MIT 253 Spring 12 25 / 30 3

Subgradient method

Negative subgradient is not necessarily a descent direction Consider f(x) = x, x R. Recall f(0) = {g : g 1}. Subgradients may not vanish at the minimum. Many g is an ascending direction. No matter how x is near 0, f(x) = sign(x). Consider f(x) = x 1 + 2 x 2. At x = (1, 0), g = (1, 2) f(x), but g is not a descent direction. Seemingly only a zero-measure set of points cause the issue, but your solutions are often there! 26 / 30

Subgradient method applications: find a point in the intersections of convex sets minimize nonsmooth functions dual ascent method (often nonsmooth) iterations: x k+1 x k α k f(x k ) objective sequence f k := f(x k ) is typically non-monotonic (difficult to ensure since f is non-continuous) 27 / 30

step size choices: Define f k best := min { f 0, f 1,..., f k } Fix α k α: during k = 0,..., O(α 2 G 2 ), fbest k f = O(α 1 k 1 ) Reduce α k like lim α k = 0 and α k k = : fbest k f = O(k 1/2 ) Several other choices 28 / 30

Summary Subgradients lose several good features of gradients Subgradients are easy to compute for some convex functions They help define solutions Subgradient method works but is slow 29 / 30

Not covered Limiting subdifferential defined with by taking limits Subgradients of dual functions, compute them by minimizing Lagrangian Methods based on subgradients: cutting plane method and bundle method Proximal map 30 / 30