Subgradient Method. Ryan Tibshirani Convex Optimization

Similar documents
Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725

5. Subgradient method

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Lecture 7: September 17

Math 273a: Optimization Subgradient Methods

Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Newton s Method. Javier Peña Convex Optimization /36-725

Conditional Gradient (Frank-Wolfe) Method

Lecture 8: February 9

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Lecture 6: September 12

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Lecture 6: September 17

Selected Topics in Optimization. Some slides borrowed from

6. Proximal gradient method

Dual methods and ADMM. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Lecture 25: Subgradient Method and Bundle Methods April 24

Convex Optimization Lecture 16

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Duality in Linear Programs. Lecturer: Ryan Tibshirani Convex Optimization /36-725

Math 273a: Optimization Subgradients of convex functions

Optimization methods

Lecture 5: September 15

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

Lecture 9: September 28

Optimization methods

Motivation Subgradient Method Stochastic Subgradient Method. Convex Optimization. Lecture 15 - Gradient Descent in Machine Learning

Lecture 14 Ellipsoid method

6. Proximal gradient method

Lecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent

Lecture 14: Newton s Method

Dual Ascent. Ryan Tibshirani Convex Optimization

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples

10. Ellipsoid method

Math 273a: Optimization Convex Conjugacy

Math 273a: Optimization Subgradients of convex functions

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization

Lasso: Algorithms and Extensions

Dual Proximal Gradient Method

Lecture 2: Subgradient Methods

Lecture 5: September 12

Lecture 23: November 19

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

10-725/36-725: Convex Optimization Spring Lecture 21: April 6

Fast proximal gradient methods

Unconstrained minimization of smooth functions

Constrained Optimization and Lagrangian Duality

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization /36-725

Algorithms for Nonsmooth Optimization

Dual Methods. Lecturer: Ryan Tibshirani Convex Optimization /36-725

ORIE 6326: Convex Optimization. Quasi-Newton Methods

Subgradient. Acknowledgement: this slides is based on Prof. Lieven Vandenberghes lecture notes. definition. subgradient calculus

Lecture 23: November 21

Convex envelopes, cardinality constrained optimization and LASSO. An application in supervised learning: support vector machines (SVMs)

Primal/Dual Decomposition Methods

Descent methods. min x. f(x)

Gradient Descent. Stephan Robert

Karush-Kuhn-Tucker Conditions. Lecturer: Ryan Tibshirani Convex Optimization /36-725

10. Unconstrained minimization

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 3. Gradient Method

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Convex Optimization / Homework 1, due September 19

EE 546, Univ of Washington, Spring Proximal mapping. introduction. review of conjugate functions. proximal mapping. Proximal mapping 6 1

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

CPSC 540: Machine Learning

Composite nonlinear models at scale

Stochastic Subgradient Method

Lecture 23: Conditional Gradient Method

Lecture: Smoothing.

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

Lecture 15 Newton Method and Self-Concordance. October 23, 2008

Lecture 3. Optimization Problems and Iterative Algorithms

MATH 829: Introduction to Data Mining and Analysis Computing the lasso solution

Gradient Descent. Model

Convex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization

Lecture 1: Background on Convex Analysis

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

arxiv: v1 [math.oc] 1 Jul 2016

Quasi-Newton Methods. Zico Kolter (notes by Ryan Tibshirani, Javier Peña, Zico Kolter) Convex Optimization

Lecture 14: October 17

Convex Optimization and l 1 -minimization

Subgradients. subgradients. strong and weak subgradient calculus. optimality conditions via subgradients. directional derivatives

10-725/36-725: Convex Optimization Prerequisite Topics

GRADIENT = STEEPEST DESCENT

9. Dual decomposition and dual algorithms

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

Convex Optimization. Prof. Nati Srebro. Lecture 12: Infeasible-Start Newton s Method Interior Point Methods

CPSC 540: Machine Learning

Lecture 1: September 25

Subgradients. subgradients and quasigradients. subgradient calculus. optimality conditions via subgradients. directional derivatives

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Gradient Descent. Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh. Convex Optimization /36-725

Lecture 17: October 27

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 4. Subgradient

Transcription:

Subgradient Method Ryan Tibshirani Convex Optimization 10-725

Consider the problem Last last time: gradient descent min x f(x) for f convex and differentiable, dom(f) = R n. Gradient descent: choose initial x (0) R n, repeat: x (k) = x (k 1) t k f(x (k 1) ), k = 1, 2, 3,... Step sizes t k chosen to be fixed and small, or by backtracking line search If f Lipschitz, gradient descent has convergence rate O(1/ɛ) Downsides: Requires f differentiable addressed this lecture Can be slow to converge addressed next lecture 2

Subgradient method Now consider f convex, having dom(f) = R n, but not necessarily differentiable Subgradient method: like gradient descent, but replacing gradients with subgradients. I.e., initialize x (0), repeat: x (k) = x (k 1) t k g (k 1), k = 1, 2, 3,... where g (k 1) f(x (k 1) ), any subgradient of f at x (k 1) Subgradient method is not necessarily a descent method, so we keep track of best iterate x (k) best among x(0),... x (k) so far, i.e., f(x (k) best ) = min i=0,...k f(x(i) ) 3

Outline Today: How to choose step sizes Convergence analysis Intersection of sets Projected subgradient method 4

Step size choices Fixed step sizes: t k = t all k = 1, 2, 3,... Diminishing step sizes: choose to meet conditions t 2 k <, k=1 k=1 t k =, i.e., square summable but not summable. Important here that step sizes go to zero, but not too fast There are several other options too, but key difference to gradient descent: step sizes are pre-specified, not adaptively computed 5

Convergence analysis Assume that f convex, dom(f) = R n, and also that f is Lipschitz continuous with constant G > 0, i.e., f(x) f(y) G x y 2 for all x, y Theorem: For a fixed step size t, subgradient method satisfies lim k f(x(k) best ) f + G 2 t/2 Theorem: For diminishing step sizes, subgradient method satisfies lim k f(x(k) best ) = f 6

Basic inequality Can prove both results from same basic inequality. Key steps: Using definition of subgradient, x (k) x 2 2 x (k 1) x 2 ( 2 2t k f(x (k 1) ) f(x ) ) + t 2 k g(k 1) 2 2 Iterating last inequality, x (k) x 2 2 k x (0) x 2 ( 2 2 t i f(x (i 1) ) f(x ) ) k + t 2 i g (i 1) 2 2 i=1 i=1 7

Using x (k) x 2 0, and letting R = x (0) x 2, 0 R 2 2 k ( t i f(x (i 1) ) f(x ) ) + G 2 i=1 Introducing f(x (k) best ) = min i=0,...k f(x (i) ), and rearranging, we have the basic inequality f(x (k) best ) f(x ) R2 + G 2 k 2 k i=1 t i i=1 t2 i k i=1 t 2 i For different step sizes choices, convergence results can be directly obtained from this bound. E.g., theorems for fixed and diminishing step sizes follow 8

Convergence rate The basic inequality tells us that after k steps, we have f(x (k) With fixed step size t, this gives best ) f(x ) R2 + G 2 k 2 k i=1 t i f(x (k) best ) f R2 2kt + G2 t 2 i=1 t2 i For this to be ɛ, let s make each term ɛ/2. So we can choose t = ɛ/g 2, and k = R 2 /t 1/ɛ = R 2 G 2 /ɛ 2 I.e., subgradient method has convergence rate O(1/ɛ 2 )... compare this to O(1/ɛ) rate of gradient descent 9

10 Example: regularized logistic regression Given (x i, y i ) R p {0, 1} for i = 1,... n, the logistic regression loss is n ( ) f(β) = y i x T i β + log(1 + exp(x T i β)) i=1 This is a smooth and convex, with f(β) = n ( yi p i (β) ) x i i=1 where p i (β) = exp(x T i β)/(1 + exp(xt i β)), i = 1,... n. Consider the regularized problem: min β f(β) + λ P (β) where P (β) = β 2 2, ridge penalty; or P (β) = β 1, lasso penalty

11 Ridge: use gradients; lasso: use subgradients. Example here has n = 1000, p = 20: Gradient descent Subgradient method f fstar 1e 13 1e 10 1e 07 1e 04 1e 01 t=0.001 f fstar 0.02 0.05 0.20 0.50 2.00 t=0.001 t=0.001/k 0 50 100 150 200 0 50 100 150 200 k k Step sizes hand-tuned to be favorable for each method (of course comparison is imperfect, but it reveals the convergence behaviors)

12 Polyak step sizes Polyak step sizes: when the optimal value f is known, take t k = f(x(k 1) ) f g (k 1) 2, k = 1, 2, 3,... 2 Can be motivated from first step in subgradient proof: x (k) x 2 2 x (k 1) x 2 2 2t k ( f(x (k 1) ) f(x ) ) +t 2 k g(k 1) 2 2 Polyak step size minimizes the right-hand side With Polyak step sizes, can show subgradient method converges to optimal value. Convergence rate is still O(1/ɛ 2 )

13 Example: intersection of sets Suppose we want to find x C 1... C m, i.e., find a point in intersection of closed, convex sets C 1,... C m First define f i (x) = dist(x, C i ), f(x) = max f i(x) i=1,...m i = 1,... m and now solve Check: is this convex? min x f(x) Note that f = 0 x C 1... C m

14 Recall the distance function dist(x, C) = min y C y x 2. Last time we computed its gradient dist(x, C) = x P C(x) x P C (x) 2 where P C (x) is the projection of x onto C Also recall subgradient rule: if f(x) = max i=1,...m f i (x), then ( f(x) = conv i:f i (x)=f(x) ) f i (x) So if f i (x) = f(x) and g i f i (x), then g i f(x)

15 Put these two facts together for intersection of sets problem, with f i (x) = dist(x, C i ): if C i is farthest set from x (so f i (x) = f(x)), and g i = f i (x) = x P C i (x) x P Ci (x) 2 then g i f(x) Now apply subgradient method, with Polyak size t k = f(x (k 1) ). At iteration k, with C i farthest from x (k 1), we perform update x (k) = x (k 1) f(x (k 1) x (k 1) P Ci (x (k 1) ) ) x (k 1) P Ci (x (k 1) ) 2 = P Ci (x (k 1) )

For two sets, this is the famous alternating projections algorithm 1, i.e., just keep projecting back and forth (From Boyd s lecture notes) 1 von Neumann (1950), Functional operators, volume II: The geometry of orthogonal spaces 16

17 Projected subgradient method To optimize a convex function f over a convex set C, min x f(x) subject to x C we can use the projected subgradient method. Just like the usual subgradient method, except we project onto C at each iteration: x (k) = P C ( x (k 1) t k g (k 1)), k = 1, 2, 3,... Assuming we can do this projection, we get the same convergence guarantees as the usual subgradient method, with the same step size choices

18 What sets C are easy to project onto? Lots, e.g., Affine images: {Ax + b : x R n } Solution set of linear system: {x : Ax = b} Nonnegative orthant: R n + = {x : x 0} Some norm balls: {x : x p 1} for p = 1, 2, Some simple polyhedra and simple cones Warning: it is easy to write down seemingly simple set C, and P C can turn out to be very hard! E.g., generally hard to project onto arbitrary polyhedron C = {x : Ax b} Note: projected gradient descent works too, more next time...

19 Can we do better? Upside of the subgradient method: broad applicability. Downside: O(1/ɛ 2 ) convergence rate over problem class of convex, Lipschitz functions is really slow Nonsmooth first-order methods: iterative methods updating x (k) in x (0) + span{g (0), g (1),... g (k 1) } where subgradients g (0), g (1),... g (k 1) come from weak oracle Theorem (Nesterov): For any k n 1 and starting point x (0), there is a function in the problem class such that any nonsmooth first-order method satisfies f(x (k) ) f RG 2(1 + k + 1)

20 Improving on the subgradient method In words, we cannot do better than the O(1/ɛ 2 ) rate of subgradient method (unless we go beyond nonsmooth first-order methods) So instead of trying to improve across the board, we will focus on minimizing composite functions of the form f(x) = g(x) + h(x) where g is convex and differentiable, h is convex and nonsmooth but simple For a lot of problems (i.e., functions h), we can recover the O(1/ɛ) rate of gradient descent with a simple algorithm, having important practical consequences

21 References and further reading S. Boyd, Lecture notes for EE 264B, Stanford University, Spring 2010-2011 Y. Nesterov (1998), Introductory lectures on convex optimization: a basic course, Chapter 3 B. Polyak (1987), Introduction to optimization, Chapter 5 L. Vandenberghe, Lecture notes for EE 236C, UCLA, Spring 2011-2012