Math 273a: Optimization Subgradient Methods

Similar documents
Math 273a: Optimization Subgradients of convex functions

Math 273a: Optimization Subgradients of convex functions

Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods

Subgradient Method. Ryan Tibshirani Convex Optimization

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725

5. Subgradient method

Math 273a: Optimization Convex Conjugacy

Convergence of Fixed-Point Iterations

Coordinate Update Algorithm Short Course Proximal Operators and Algorithms

Math 273a: Optimization Lagrange Duality

Lecture 6: September 12

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Sparse Optimization Lecture: Dual Methods, Part I

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Lecture 7: September 17

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

6. Proximal gradient method

Optimization methods

Conditional Gradient (Frank-Wolfe) Method

Lecture 2: Convex Sets and Functions

Optimality Conditions for Nonsmooth Convex Optimization

Tight Rates and Equivalence Results of Operator Splitting Schemes

Coordinate Update Algorithm Short Course Operator Splitting

Convex Analysis and Optimization Chapter 4 Solutions

Lecture 25: Subgradient Method and Bundle Methods April 24

Subgradient. Acknowledgement: this slides is based on Prof. Lieven Vandenberghes lecture notes. definition. subgradient calculus

Lecture 6: September 17

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

WE consider an undirected, connected network of n

Optimization methods

Lecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent

Radial Subgradient Descent

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

arxiv: v1 [math.oc] 1 Jul 2016

CPSC 540: Machine Learning

Selected Topics in Optimization. Some slides borrowed from

6. Proximal gradient method

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples

Convex Analysis Background

Lecture 2: Subgradient Methods

Algorithms for Nonsmooth Optimization

Lecture 1: Background on Convex Analysis

Coordinate Descent and Ascent Methods

Gradient Descent. Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh. Convex Optimization /36-725

Dual and primal-dual methods

Math 273a: Optimization Netwon s methods

The proximal mapping

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Newton s Method. Javier Peña Convex Optimization /36-725

Primal/Dual Decomposition Methods

LECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE

Optimization and Optimal Control in Banach Spaces

Lecture 14: Newton s Method

Unconstrained minimization of smooth functions

Contraction Methods for Convex Optimization and Monotone Variational Inequalities No.18

Convex Analysis and Optimization Chapter 2 Solutions

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

Optimization for Machine Learning

ORIE 4741: Learning with Big Messy Data. Proximal Gradient Method

Subgradients. subgradients and quasigradients. subgradient calculus. optimality conditions via subgradients. directional derivatives

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties

Lecture 3: Lagrangian duality and algorithms for the Lagrangian dual problem

Introduction to Machine Learning Lecture 7. Mehryar Mohri Courant Institute and Google Research

Lecture 24 November 27

Dual Proximal Gradient Method

Asynchronous Algorithms for Conic Programs, including Optimal, Infeasible, and Unbounded Ones

Fenchel Duality between Strong Convexity and Lipschitz Continuous Gradient

ARock: an algorithmic framework for asynchronous parallel coordinate updates

Smoothing Proximal Gradient Method. General Structured Sparse Regression

Convex Optimization and l 1 -minimization

Accelerated Proximal Gradient Methods for Convex Optimization

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

A Geometric Framework for Nonconvex Optimization Duality using Augmented Lagrangian Functions

Proximal and First-Order Methods for Convex Optimization

Lecture 12 Unconstrained Optimization (contd.) Constrained Optimization. October 15, 2008

Iterative Convex Optimization Algorithms; Part One: Using the Baillon Haddad Theorem

ORIE 6326: Convex Optimization. Quasi-Newton Methods

Sparse Optimization Lecture: Basic Sparse Optimization Models

Convex Optimization Lecture 16

Local strong convexity and local Lipschitz continuity of the gradient of convex functions

Convergence Rates for Deterministic and Stochastic Subgradient Methods Without Lipschitz Continuity

Descent methods. min x. f(x)

Numerisches Rechnen. (für Informatiker) M. Grepl P. Esser & G. Welper & L. Zhang. Institut für Geometrie und Praktische Mathematik RWTH Aachen

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Subgradients. subgradients. strong and weak subgradient calculus. optimality conditions via subgradients. directional derivatives

Math 273a: Optimization Overview of First-Order Optimization Algorithms

Lecture 3: Linesearch methods (continued). Steepest descent methods

Subgradient Methods. Stephen Boyd (with help from Jaehyun Park) Notes for EE364b, Stanford University, Spring

More First-Order Optimization Algorithms

Static Problem Set 2 Solutions

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Lecture Notes on Iterative Optimization Algorithms

Composite nonlinear models at scale

Lecture 6 : Projected Gradient Descent

LECTURE SLIDES ON BASED ON CLASS LECTURES AT THE CAMBRIDGE, MASS FALL 2007 BY DIMITRI P. BERTSEKAS.

Second order forward-backward dynamical systems for monotone inclusion problems

MATH 829: Introduction to Data Mining and Analysis Computing the lasso solution

In Progress: Summary of Notation and Basic Results Convex Analysis C&O 663, Fall 2009

Lasso: Algorithms and Extensions

Transcription:

Math 273a: Optimization Subgradient Methods Instructor: Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com

Nonsmooth convex function Recall: For ˉx R n, f(ˉx) := {g R n : f(y) f(ˉx) + g, y ˉx }. If f(x) is differentiable, then f(x) = { f(x)} is a singleton. For any x, y, g x f(x), and g y f(y), we have g x g y, x y 0. ( f is a monotonic point-to-set map)

Which functions have subgradients? Theorem (Nesterov Thm 3.1.13) Let f be a closed convex function and x 0 int(dom(f)). Then f(x 0) is a nonempty bounded set. Proof uses supporting hyperplanes of epigraph to show existence, and local Lipschitz continuity of convex functions to show boundedness. 1 1 This and next Slide taken from Damek s lecture. Figure taken from Boyd and Vandenberghe, http://see.stanford.edu/materials/lsocoee364b/01-subgradients_notes.pdf.

The converse Lemma (Nesterov Lm 3.1.6) If f(x) for all x dom(f), then f is convex. Proof. x, y dom(f), α [0, 1], y α = (1 α)x + αy = x + α(y x), g f(y α). f(y) f(y α ) + g, y y α = f(y α ) + (1 α) g, y x (1) f(x) f(y α) + g, x y α = f(y α) α g, y x (2) Multiply equation (1) by α and equation (2) by (1 α). Add them together to get αf(y) + (1 α)f(x) f(y α).

Compute subgradients: general rules Differentiable functions: f(x) = { f(x)}. Composition with affine map: φ(x) = f(a(x) + b) φ(x) = A T f(a(x) + b). Positive sums: α, β > 0, f(x) = αf 1(x) + βf 2(x). f(x) = α f 1(x) + β f 2(x) Maximums: f(x) = max i {1,,n} {f i(x)} f(x) = conv{ f i(x) f i(x) = f(x)}

Examples f(x) = x. f(x) = { {sign(x)} x 0; [ 1, 1] otherwise 2 2 figure taken from Boyd and Vandenberghe, http://see.stanford.edu/materials/lsocoee364b/01-subgradients_notes.pdf.

Examples f(x) = n ai, x bi. Define i=1 Then f(x) = I (x) = {i a i, x b i < 0} I + (x) = {i a i, x b i > 0} I 0(x) = {i a i, x b i = 0}. i I + (x) a i f(x) = max i {1,,n} x i. Then i I (x) a i + i I 0 (x) f(x) = conv{e i x i = f(x)} f(0) = conv{e i i {1,, n}} [ a i, a i]

Examples f(x) = x 2. f is differential away from 0, so: f(x) = At 0, go back to subgradient equation: Thus, g f(0), if, and only if, g,y y 2 ball B2(0, 1) = B 2(0, 1). This is a common pattern! x x 2 x 0. y 2 0 + g, y 0 1 for all y 0. Thus, g is in the dual

Examples f(x) = x = max i {1,,n} x (i). f(x) = conv{g (i) g (i) x (i), x (i) = f(x)}, x 0. Going back to subgradient equation y 0 + g, y g,y Thus, g f(0), if, and only if, y 1 for all y 0. Thus, f(0) is the dual ball to the l norm: B 1(0, 1).

Examples f(x) = x 1 = n xi. Then i=1 f(x) = e i x i >0 e i + [ e i, e i] x i <0 x i =0 for all x. Then n f(0) = [ e i, e i] = B (0, 1). i=1

Optimality condition: 0 subgradient minimum Suppose that 0 f(x). If f is smooth and convex, 0 f(x) = { f(x)} = f(x) = 0. In general: If 0 f(x), then f(y) f(x) + 0, y x = f(x) for all y R n. = x is a minimum! Converse also true: f(y) f(x ) + 0 = f(x ) + 0, y x.

The subgradient method Iteration: x k+1 x k α k g k where g k f(x k ). Questions: Applications? Are f(x k ) and x k x monotonic? How to choose α k?

Applications Finding a point in the intersection of closed convex sets minimize f(x) = max{dist(x, C 1),..., dist(x, C 1)} Subgradient: if f(x) = dist(x, C j) and x C j, then g = x Proj C j (x) x Proj Cj (x) x dist(x, C j ). Minimizing non-smooth convex functions, e.g., piece-wise linear convex functions Dual subgradient method (generalizes the Uzawa algorithm), dual decomposition (more to come in this lecture)

Convergence overview Typically, convergence is established by identifying a monotonically nonincreasing sequence, such as f(x k ) f and x k x 2 However, since the subgradient g(x) is not continuous in x, neither sequence is monotonic Instead, we will define the total descent and use it to bound f(x k ) The choice of step sizes α k is critical.

Monotonicity of f(x k )? The definition f(y) f(x) + g, y x, g f(x) yields f(x k+1 ) f(x k ) g k+1, x k x k+1 = f(x k ) α k g k+1, g k. It is generally difficult to estimate g k+1, g k since g is not continuous. (No matter how close x k+1 is to x k, their subgradients g k+1 and g k can be very different.) Therefore, we cannot guarantee f(x k+1 ) < f(x k ). Note: Taking the implicit iteration x k+1 = x k α k g k+1 (the proximal method), we would ensure f(x k+1 ) f(x k ). It is more expensive to compute though.

Monotonicity of x k x 2 Let us assume that x exists. Then x k+1 x 2 = (x k x ) α k g k 2 = x k x 2 2α k g k, x k x + α 2 k g k 2. To have monotonicity: x k+1 x 2 x k x 2, we need 2α k g k, x k x + α 2 k g k 2 0 g k, x k x α k 2 gk 2. However, even at x k x, g k may not vanish. Therefore, x k x 2 is generally not monotonic unless g k < G (commonly assumed for subgradient method), and α k = O( x k x ), which is unrealistic to ensure since x is unknown.

However, it is often easy to have an estimate on f. For example, f 0 in many applications. The definition f(y) f(x) + g, y x g f(x) yields α k g k, x k x α k (f(x k ) f ) 0. Interpretation: the term α k g k, x k x guarantees a sufficient descent by at least α k (f(x k ) f ) However, the ascending term α 2 k g k 2 can be as large as α 2 kg 2.

After substitution, we get the bound x k+1 x 2 x k x 2 2α k (f(x k ) f ) + α 2 k g k 2, Telescopic sum over k gives x k+1 x 2 = x 0 x 2 2 Therefore, x k+1 x 2 + 2 k α i(f(x i ) f ) + i=0 k α i(f(x i ) f ) x 0 x 2 + i=0 k αi 2 g i 2. i=0 k αi 2 g i 2 which bounds the total descent k i=0 α k(f(x i ) f ) by the total ascent k i=0 α2 i g i 2. Clearly, α k play the key role. i=0

Step size and convergence By x k+1 x 2 + 2 and letting k k α i(f(x i ) f ) x 0 x 2 + αi 2 g i 2 i=0 fbest k = min{f(x i ) : i = 0, 1,..., k} (thus, f best f f(x i ) f, i k) g G (equivalent to Lip. f: f(x) f(y) G x y x, y) i=0 we have fbest k f x0 x 2 + G 2 k i=0 α2 i 2. k i=0 αi

We need unbounded k i=0 αi and bounded k αi as k. i=0 To have f k best f 0, we require, for example, k i=0 αi = and k i=0 α2 i ; or more weakly, k i=0 αi = and lim α k 0. (the truncation trick) Otherwise, f k best f in general.

Fixed step size Fixing α k α, we get fbest k f x0 x 2 + G 2 k i=0 α2 2 k α = x 0 x 2 2α(k + 1) + αg2 2. i=0 f k best f αg 2 /2 = O(α). while in the early stage we have k < x0 x 2 α 2 G 2, x 0 x 2 2α(k + 1) + αg2 x0 x 2 2 α(k + 1) and thus the (non-asymptotic, conditional) rate of convergence O( 1 αk ). larger α = faster convergence, lower final accuracy smaller α = slower convergence, higher final accuracy

Fixed step length α k = α/ g k We have f k best f x0 x 2 + k i=0 α2 k g k 2 2 k i=0 α k = x0 x 2 G 2α(k + 1) + αg 2. f k best f αg/2 = O(α), slightly better than with a fixed step size. while k < x0 x 2 α 2, we have the (non-asymptotic, conditional) rate of convergence O( G αk ). larger α = faster convergence, lower final accuracy smaller α = slower convergence, higher final accuracy

Polyak step size Assume that f is known (not x though). Example: f = 0 in projection problems.) Set: α k = f(xk ) f g k = arg min x k x 2 2α k (f(x k ) f ) + α 2 k g k 2 Then, x k x 2 2α k (f(x k ) f ) + αk g 2 k 2 = x k x 2 (f(xk ) f ) 2 g k (so x k x 2 is monotonic) and thus after adding over k, x k+1 x 2 x 0 x 2 1 G k (f(x i ) f ) 2. i=0 Finally, f k best f x0 x G k + 1 = O( 1 k ).

Oracle optimality For an O( 1 k ) method to guarantee f k best f ɛ, we need O(1/ɛ 2 ) iterations. It this tight for the subgradient method? Answer: Yes. Suppose x k+1 is computed by an arbitrary method as where the oracle gives arbitrary g k f(x k ) and f(x k ). x k+1 x 0 + span{g 0, g 1,..., g k } Theorem (Nesterov Thm 3.2.1) There is a nonsmooth convex function with g G uniformly so that the above method obeys f(x k ) f(x ) x0 x G 2(1 + k + 1).

The subgradient algorithm f is a proper closed convex function. Problem: minimize x f(x) Algorithm: pick any starting point x 1 pick g k f(x k ) set α k (α 0/k, fixed size, fixed length, or Polyak) x k+1 x k α k g k k k + 1 (monitor f(x k ) and fbest, k especially if using fixed size, fixed length)

Variant: projected subgradient method f is a proper closed convex function. C is a nonempty closed convex set. Problem: pick g k f(x k ) and α k, and apply minimize f(x), subject to x C. x x k+1 Proj C (x k α k g k )

since projection is nonexpansive, Proj C (x) Proj C (y) x y, x, y R n the analysis remains the same. x k+1 x 2 = Proj C (x k α k g k ) Proj C (x ) 2 (x k α k g k ) x 2 = (x k x ) α k g k 2 = x k x 2 2α k g k, x k x + αk g 2 k 2 =...

Summary for subgradient methods Universal. It handles non-differentiable convex problems and, in particular, the dual problem of linearly constrained convex problems (later lectures) No monotonicity for either f(x k ) or dist(x k, X ) except for Polyak s step size (requiring known f ) Convergence relies on uniformly bounded subgradient (or Lipschitz f) Rate of convergence fbest k f depends on the step size Constant step size (or length) does not guarantee fbest k f If we need fbest k f, use diminishing step sizes; the rate is at best O(1/ k) Convergence is quite slow (but the method is widely applicable) Some non-smooth problems have better rates by other methods, e.g., prox-linear iteration, operator splitting, dual smoothing (later lectures)