Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725

Similar documents
Subgradient Method. Ryan Tibshirani Convex Optimization

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

5. Subgradient method

Selected Topics in Optimization. Some slides borrowed from

Conditional Gradient (Frank-Wolfe) Method

Lecture 8: February 9

Math 273a: Optimization Subgradient Methods

Lecture 7: September 17

Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Newton s Method. Javier Peña Convex Optimization /36-725

Lecture 6: September 12

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization

Convex Optimization Lecture 16

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Motivation Subgradient Method Stochastic Subgradient Method. Convex Optimization. Lecture 15 - Gradient Descent in Machine Learning

Lecture 6: September 17

Lecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent

6. Proximal gradient method

Lecture 5: September 15

Lecture 25: Subgradient Method and Bundle Methods April 24

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Lecture 9: September 28

Gradient Descent. Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh. Convex Optimization /36-725

Optimization methods

Duality in Linear Programs. Lecturer: Ryan Tibshirani Convex Optimization /36-725

Math 273a: Optimization Subgradients of convex functions

Dual methods and ADMM. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Lecture 23: November 19

Lecture 23: November 21

Math 273a: Optimization Convex Conjugacy

Lecture 14: Newton s Method

6. Proximal gradient method

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples

Lasso: Algorithms and Extensions

Dual Proximal Gradient Method

arxiv: v1 [math.oc] 1 Jul 2016

Stochastic Subgradient Method

Optimization methods

Lecture 5: September 12

10-725/36-725: Convex Optimization Spring Lecture 21: April 6

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

10. Ellipsoid method

Lecture 14 Ellipsoid method

Lecture 2: Subgradient Methods

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization /36-725

Lecture: Smoothing.

Descent methods. min x. f(x)

Math 273a: Optimization Subgradients of convex functions

Composite nonlinear models at scale

Fast proximal gradient methods

Unconstrained minimization of smooth functions

ORIE 6326: Convex Optimization. Quasi-Newton Methods

Lecture 23: Conditional Gradient Method

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Coordinate Descent and Ascent Methods

Convex envelopes, cardinality constrained optimization and LASSO. An application in supervised learning: support vector machines (SVMs)

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Algorithms for Nonsmooth Optimization

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

Convex Optimization / Homework 1, due September 19

Lecture 3. Optimization Problems and Iterative Algorithms

Subgradient. Acknowledgement: this slides is based on Prof. Lieven Vandenberghes lecture notes. definition. subgradient calculus

Dual Ascent. Ryan Tibshirani Convex Optimization

Lecture 14: October 17

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 3. Gradient Method

1 Sparsity and l 1 relaxation

Primal/Dual Decomposition Methods

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

10. Unconstrained minimization

Constrained Optimization and Lagrangian Duality

Karush-Kuhn-Tucker Conditions. Lecturer: Ryan Tibshirani Convex Optimization /36-725

ORIE 4741: Learning with Big Messy Data. Proximal Gradient Method

Quasi-Newton Methods. Zico Kolter (notes by Ryan Tibshirani, Javier Peña, Zico Kolter) Convex Optimization

Dual Methods. Lecturer: Ryan Tibshirani Convex Optimization /36-725

9. Dual decomposition and dual algorithms

CPSC 540: Machine Learning

CPSC 540: Machine Learning

MATH 829: Introduction to Data Mining and Analysis Computing the lasso solution

Lecture 17: October 27

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization

Linear Convergence under the Polyak-Łojasiewicz Inequality

Lecture 1: September 25

Gradient Descent. Stephan Robert

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Lecture 1: Background on Convex Analysis

An interior-point stochastic approximation method and an L1-regularized delta rule

Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization

Proximal and First-Order Methods for Convex Optimization

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

CSC321 Lecture 8: Optimization

MATH 680 Fall November 27, Homework 3

Fall 2017 Qualifier Exam: OPTIMIZATION. September 18, 2017

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Transcription:

Subgradient Method Guest Lecturer: Fatma Kilinc-Karzan Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization 10-725/36-725 Adapted from slides from Ryan Tibshirani

Consider the problem Recall: gradient descent min x f(x) for f convex and differentiable, dom(f) = R n. Gradient descent: choose initial x (0) R n, repeat: x (k) = x (k 1) t k f(x (k 1) ), k = 1, 2, 3,... Step sizes t k chosen to be fixed and small, or by backtracking line search If f Lipschitz, gradient descent has convergence rate O(1/ɛ) Downsides: Requires f differentiable this lecture Can be slow to converge next lecture 2

Subgradient method Now consider f convex, with dom(f) = R n, but not necessarily differentiable Subgradient method: like gradient descent, but replacing gradients with subgradients. I.e., initialize x (0), repeat: x (k) = x (k 1) t k g (k 1), k = 1, 2, 3,... where g (k 1) f(x (k 1) ), any subgradient of f at x (k 1) Subgradient method is not necessarily a descent method, so we keep track of best iterate x (k) best among x(0),... x (k) so far, i.e., f(x (k) best ) = min i=0,...k f(x(i) ) 3

Outline Today: How to choose step sizes Convergence analysis Intersection of sets Stochastic subgradient method 4

Step size choices Fixed step sizes: t k = t all k = 1, 2, 3,... Diminishing step sizes: choose to meet conditions t 2 k <, k=1 k=1 t k =, i.e., square summable but not summable Important that step sizes go to zero, but not too fast Other options too, but important difference to gradient descent: step sizes are typically pre-specified, not adaptively computed 5

Convergence analysis Assume that f convex, dom(f) = R n, and also that f is Lipschitz continuous with constant G > 0, i.e., f(x) f(y) G x y 2 for all x, y Theorem: For a fixed step size t, subgradient method satisfies lim k f(x(k) best ) f + G 2 t/2 Theorem: For diminishing step sizes, subgradient method satisfies lim k f(x(k) best ) = f 6

Basic inequality Can prove both results from same basic inequality. Key steps: Using definition of subgradient, x (k) x 2 2 x (k 1) x 2 ( 2 2t k f(x (k 1) ) f(x ) ) + t 2 k g(k 1) 2 2 Iterating last inequality, x (k) x 2 2 k x (0) x 2 ( 2 2 t i f(x (i 1) ) f(x ) ) k + t 2 i g (i 1) 2 2 i=1 i=1 7

Using x (k) x 2 0, and letting R = x (0) x 2, 0 R 2 2 k ( t i f(x (i 1) ) f(x ) ) + G 2 i=1 Introducing f(x (k) best ) = min i=0,...k f(x (i) ), and rearranging, we have the basic inequality f(x (k) best ) f(x ) R2 + G 2 k 2 k i=1 t i i=1 t2 i k i=1 t 2 i For different step sizes choices, convergence results can be directly obtained from this bound. E.g., theorems for fixed and diminishing step sizes follow 8

Convergence rate The basic inequality tells us that after k steps, we have f(x (k) With fixed step size t, this gives best ) f(x ) R2 + G 2 k 2 k i=1 t i f(x (k) best ) f R2 2kt + G2 t 2 i=1 t2 i For this to be ɛ, let s make each term ɛ/2. Therefore choose t = ɛ/g 2, and k = R 2 /t 1/ɛ = R 2 G 2 /ɛ 2 I.e., subgradient method has convergence rate O(1/ɛ 2 )... compare this to O(1/ɛ) rate of gradient descent 9

10 Example: regularized logistic regression Given (x i, y i ) R p {0, 1} for i = 1,... n, consider the logistic regression loss: f(β) = n i=1 ( ) y i x T i β + log(1 + exp(x T i β) This is a smooth and convex, with n ( f(β) = yi p i (β) ) x i i=1 where p i (β) = exp(x T i β)/(1 + exp(xt i β)), i = 1,... n. We will consider the regularized problem: min β f(β) + λ P (β) where P (β) = β 2 2 (ridge penalty) or P (β) = β 1 (lasso penalty)

11 Ridge problem: use gradients; lasso problem: use subgradients. Data example with n = 1000, p = 20: Gradient descent Subgradient method f fstar 1e 13 1e 10 1e 07 1e 04 1e 01 t=0.001 f fstar 0.02 0.05 0.20 0.50 2.00 t=0.001 t=0.001/k 0 50 100 150 200 0 50 100 150 200 k k Step sizes hand-tuned to be favorable for each method (of course comparison is imperfect, but it reveals the convergence behaviors)

12 Polyak step sizes Polyak step sizes: when the optimal value f is known, take t k = f(x(k 1) ) f g (k 1) 2, k = 1, 2, 3,... 2 Can be motivated from first step in subgradient proof: x (k) x 2 2 x (k 1) x 2 2 2t k ( f(x (k 1) ) f(x ) ) +t 2 k g(k 1) 2 2 Polyak step size minimizes the right-hand side With Polyak step sizes, can show subgradient method converges to optimal value. Convergence rate is still O(1/ɛ 2 )

13 Example: intersection of sets Suppose we want to find x C 1... C m, i.e., find a point in intersection of closed, convex sets C 1,... C m First define f i (x) = dist(x, C i ), f(x) = max f i(x) i=1,...m i = 1,... m and now solve min x f(x) Note that f = 0 x C 1... C m. Check: is this problem convex?

14 Recall the distance function dist(x, C) = min y C y x 2. Last time we computed its gradient dist(x, C) = x P C(x) x P C (x) 2 where P C (x) is the projection of x onto C Also recall subgradient rule: if f(x) = max i=1,...m f i (x), then ( f(x) = conv i:f i (x)=f(x) ) f i (x) So if f i (x) = f(x) and g i f i (x), then g i f(x)

15 Put these two facts together for intersection of sets problem, with f i (x) = dist(x, C i ): if C i is farthest set from x (so f i (x) = f(x)), and g i = f i (x) = x P C i (x) x P Ci (x) 2 then g i f(x) Now apply subgradient method, with Polyak size t k = f(x (k 1) ). At iteration k, with C i farthest from x (k 1), we perform update x (k) = x (k 1) f(x (k 1) x (k 1) P Ci (x (k 1) ) ) x (k 1) P Ci (x (k 1) ) 2 = P Ci (x (k 1) )

16 For two sets, this is the famous alternating projections algorithm, i.e., just keep projecting back and forth (From Boyd s lecture notes)

17 Projected subgradient method To optimize a convex function f over a convex set C, min x f(x) subject to x C we can use the projected subgradient method. Just like the usual subgradient method, except we project onto C at each iteration: x (k) = P C ( x (k 1) t k g (k 1)), k = 1, 2, 3,... Assuming we can do this projection, we get the same convergence guarantees as the usual subgradient method, with the same step size choices

18 What sets C are easy to project onto? Lots, e.g., Affine images: {Ax + b : x R n } Solution set of linear system: {x : Ax = b} Nonnegative orthant: R n + = {x : x 0} Some norm balls: {x : x p 1} for p = 1, 2, Some simple polyhedra and simple cones Warning: it is easy to write down seemingly simple set C, and P C can turn out to be very hard! E.g., generally hard to project onto arbitrary polyhedron C = {x : Ax b} Note: projected gradient descent works too, more next time...

19 Stochastic subgradient method Similar to our setup for stochastic gradient descent. Consider sum of convex functions m min f i (x) x i=1 Stochastic subgradient method repeats: x (k) = x (k 1) t k g (k 1) i k, k = 1, 2, 3,... where i k {1,... m} is some chosen index at iteration k, chosen by either by the random or cyclic rule, and g (k 1) i f i (x (k 1) ) (this update direction is used in place of the usual m i=1 g(k 1) i ) Note that when each f i, i = 1,..., m is differentiable, this reduces to stochastic gradient descent (SGD)

Convergence of stochastic methods Assume each f i, i = 1,... m is convex and Lipschitz with constant G > 0 For fixed step sizes t k = t, k = 1, 2, 3,..., cyclic and randomized 1 stochastic subgradient methods both satisfy lim k f(x(k) best ) f + 5m 2 G 2 t/2 Note: mg can be viewed as Lipschitz constant for whole function m i=1 f i, so this is comparable to batch bound For diminishing step sizes, cyclic and randomized methods satisfy lim k f(x(k) best ) = f 1 For randomized rule, results hold with probability 1 20

How about convergence rates? This is where things get interesting Looking back carefully, the batch subgradient method rate was O(G 2 batch /ɛ2 ), where Lipschitz constant G batch is for whole function Cyclic rule: iteration complexity is O(m 3 G 2 /ɛ 2 ). Therefore number of cycles needed is O(m 2 G 2 /ɛ 2 ), comparable to batch Randomized rule 2 : iteration complexity is O(m 2 G 2 /ɛ 2 ). Thus number of random cycles needed is O(mG 2 /ɛ 2 ), reduced by a factor of m! This is a convincing reason to use randomized stochastic methods, for problems where m is big 2 For randomized rule, result holds in expectation, i.e., bound is on expected number of iterations 21

22 Example: stochastic logistic regression Back to the logistic regression problem (now we re talking SGD): min β f(β) = n ( i=1 ) y i x T i β + log(1 + exp(x T i β) } {{ } f i (β) The gradient computation f(β) = n i=1 ( yi p i (β) ) x i is doable when n is moderate, but not when n 500 million. Recall: One batch update costs O(np) One stochastic update costs O(p) So clearly, e.g., 10K stochastic steps are much more affordable Also, we often take fixed step size for stochastic updates to be n what we use for batch updates. (Why?)

23 The classic picture : 20 10 0 10 20 * Batch Random Blue: batch steps, O(np) Red: stochastic steps, O(p) Rule of thumb for stochastic methods: generally thrive far from optimum generally struggle close to optimum 20 10 0 10 20 (Even more on stochastic methods later in the course...)

24 Can we do better? Upside of the subgradient method: broad applicability. Downside: O(1/ɛ 2 ) convergence rate over problem class of convex, Lipschitz functions is really slow Nonsmooth first-order methods: iterative methods updating x (k) in x (0) + span{g (0), g (1),... g (k 1) } where subgradients g (0), g (1),... g (k 1) come from weak oracle Theorem (Nesterov): For any k n 1 and starting point x (0), there is a function in the problem class such that any nonsmooth first-order method satisfies f(x (k) ) f RG 2(1 + k + 1)

25 Improving on the subgradient method In words, we cannot do better than the O(1/ɛ 2 ) rate of subgradient method (unless we go beyond nonsmooth first-order methods) So instead of trying to improve across the board, we will focus on minimizing composite functions of the form f(x) = g(x) + h(x) where g is convex and differentiable, h is convex and nonsmooth but simple For a lot of problems (i.e., functions h), we can recover the O(1/ɛ) rate of gradient descent with a simple algorithm, having important practical consequences

26 References and further reading D. Bertsekas (2010), Incremental gradient, subgradient, and proximal methods for convex optimization: a survey S. Boyd, Lecture notes for EE 264B, Stanford University, Spring 2010-2011 Y. Nesterov (1998), Introductory lectures on convex optimization: a basic course, Chapter 3 B. Polyak (1987), Introduction to optimization, Chapter 5 L. Vandenberghe, Lecture notes for EE 236C, UCLA, Spring 2011-2012