Stochastic Subgradient Method

Similar documents
Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

5. Subgradient method

Motivation Subgradient Method Stochastic Subgradient Method. Convex Optimization. Lecture 15 - Gradient Descent in Machine Learning

Primal/Dual Decomposition Methods

Computational and Statistical Learning Theory

Stochastic Subgradient Methods

ECS171: Machine Learning

Stochastic Subgradient Methods

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725

Selected Topics in Optimization. Some slides borrowed from

Subgradient Method. Ryan Tibshirani Convex Optimization

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Statistical Machine Learning II Spring 2017, Learning Theory, Lecture 4

Lecture 2: Subgradient Methods

ORIE 4741: Learning with Big Messy Data. Proximal Gradient Method

6. Proximal gradient method

Introduction to Optimization

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

A Parallel SGD method with Strong Convergence

Ad Placement Strategies

Subgradient Methods. Stephen Boyd (with help from Jaehyun Park) Notes for EE364b, Stanford University, Spring

Learnability, Stability, Regularization and Strong Convexity

Inverse Time Dependency in Convex Regularized Learning

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

Subgradients. subgradients. strong and weak subgradient calculus. optimality conditions via subgradients. directional derivatives

Coordinate Descent and Ascent Methods

1 Sparsity and l 1 relaxation

Classification Logistic Regression

ECS289: Scalable Machine Learning

Max Margin-Classifier

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Subgradients. subgradients and quasigradients. subgradient calculus. optimality conditions via subgradients. directional derivatives

Stochastic Optimization Algorithms Beyond SG

Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Unconstrained minimization of smooth functions

Optimization Tutorial 1. Basic Gradient Descent

Convex Optimization Lecture 16

The FTRL Algorithm with Strongly Convex Regularizers

10. Ellipsoid method

Math 273a: Optimization Subgradients of convex functions

Subgradient. Acknowledgement: this slides is based on Prof. Lieven Vandenberghes lecture notes. definition. subgradient calculus

Computational and Statistical Learning Theory

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Optimization methods

STA141C: Big Data & High Performance Statistical Computing

Conditional Gradient (Frank-Wolfe) Method

Stochastic Analogues to Deterministic Optimizers

How hard is this function to optimize?

Lecture 14 Ellipsoid method

Non-convex optimization. Issam Laradji

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Discriminative Models

ICS-E4030 Kernel Methods in Machine Learning

Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods

arxiv: v4 [math.oc] 5 Jan 2016

Machine Learning And Applications: Supervised Learning-SVM

Convex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research

Large-scale Stochastic Optimization

Lecture 7: September 17

CSC 411 Lecture 17: Support Vector Machine

Bregman Divergence and Mirror Descent

Stochastic Approximation: Mini-Batches, Optimistic Rates and Acceleration

Stochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values

Discriminative Models

Linear Convergence under the Polyak-Łojasiewicz Inequality

Introduction to Logistic Regression and Support Vector Machine

Optimization and Gradient Descent

6. Proximal gradient method

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

Convex Optimization. Problem set 2. Due Monday April 26th

Lecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent

Gradient Descent. Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh. Convex Optimization /36-725

Support vector machines Lecture 4

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Optimization methods

Least Mean Squares Regression

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal

Big Data Analytics: Optimization and Randomization

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Lecture 6: September 17

CS260: Machine Learning Algorithms

Lecture 14: Newton s Method

Optimization for Machine Learning

Convergence rate of SGD

Convex Optimization. Prof. Nati Srebro. Lecture 12: Infeasible-Start Newton s Method Interior Point Methods

Nesterov s Optimal Gradient Methods

Adaptive Gradient Methods AdaGrad / Adam

CS-E4830 Kernel Methods in Machine Learning

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Subgradient Descent. David S. Rosenberg. New York University. February 7, 2018

Kaisa Joki Adil M. Bagirov Napsu Karmitsa Marko M. Mäkelä. New Proximal Bundle Method for Nonsmooth DC Optimization

Stochastic Optimization: First order method

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Stochastic Gradient Descent

arxiv: v1 [math.oc] 1 Jul 2016

Optimization. Yuh-Jye Lee. March 21, Data Science and Machine Intelligence Lab National Chiao Tung University 1 / 29

Transcription:

Stochastic Subgradient Method Lingjie Weng, Yutian Chen Bren School of Information and Computer Science UC Irvine

Subgradient Recall basic inequality for convex differentiable f : f y f x + f x T (y x) The gradient f x at x determines a global under-estimator of f. What if f is not differentiable? We can still construct a global under-estimator : the vector g is a subgradient of f at x if f y f x + g T y x for all y There can be more than one subgradient at given point: the collection of subgradients of f at x is the subdifferential f(x).

An example: non-differentiable function and its subgradients f = max f 1, f 2, with f 1, f 2 convex and differentiable f 1 x > f 2 x : unique subgradient g = f 1 x f 1 x < f 2 x : unique subgradient g = f 2 x f 1 x > f 2 x : subgradient form a line segment f 1 x, f 2 x

Subgradient methods Subgradient method is simple algorithm to minimize nondifferentiable convex function f x (k+1) = x (k) α k g k x (k) is the kth iterate g k is any subgradient of f at x k α k > 0 is the kth step size Not a descent method, so we keep track of best point so far k = min *f x 1,, f x k + f best

Why does it work?--convergence Assumptions: f = inf x x = f g k 2 G for all k (if f satisfies Lipschitz condition f u f(v) G u v 2 ) x 1 x 2 R Convergence results for different step size Constant step size: α k = α(positive constant). Converges to a suboptimal solution: f k best f R2 +G 2 kα 2 2kα Constant step length: α k = γ g k 2 Converges to a suboptimal solution: f k best k G 2 α (i.e. x k+1 x k = γ). f R2 +γ 2 k 2γk/G k Gγ 2 Square summable but not summable k=1 α k <, k=1 α k = (i.e. α k = a Nonsummable diminishing step size lim α k = 0, and k=1 α k = (i.e. α k = a ) k k Nonsummable diminishing step length α k = γ k g k 2 Converges to the optimal solution: lim f k k best f = 0 2 2 b+k )

Stopping criterion Terminating when R2 +G 2 k 2 i=1 α i k i=1 2 α i ε is really, really slow. α i = (R/G)/ k, i = 1,, k minimizes the suboptimality bound. The number of steps to achieve a guaranteed accuracy of ε will be at least (RG/ε) 2, no matter what step sizes we use. There really isn t a good stopping criterion for the subgradient method are used only for problems in which very high accuracy is not required, typically around 10%

Subgradient method for unconstrained problem 1. Set k 1 and. Choose an infinite sequence of positive step size value *α (k) + k=1. 2. Compute a subgradient g k f(x k ). 3. Update x (k+1) = x (k) α k g k 4. If algorithm has not converged, then set k k + 1 and go to step 2. Properties: Simple and general. Any subgradient from f(x k ) is sufficient. Convergence theorems exist for properly selected step size Slow convergence. It usually requires a lot of iterations No monotonic convergence. g k need not be decreasing direction line search cannot be used to find optimal α k. No good stopping condition.

Example: L 2 -norm Support Vector Machines w, b, ξ = argmin w w,b,ξ 2 2 +C ξ i s.t. y i w T x i + b 1 ξ i, i = 1,, m 1 ξ i 0, m i=1, i = 1,, m Note that ξ i max *0,1 y i w T x i + b + and the equality is attained in the optimum, i.e., we can optimize s.t. ξ i = max 0,1 y i w T x i + b = l w, x i y i (hinge loss) We can reformulate the problem as an unconstrained convex problem w, b =argmin w,b 1 2 w 2 +Cl( w, x i y i )

Example L 2 -norm Support Vector Machines(cont ) Function f 0 w, b = 1 2 w 2 is differentiable and thus w f 0 w, b = w, b f 0 w, b = 0. Function f i w, b = max *0,1 y i w T x i + b + = l(y i w T x i + b )is a point-wise maximum thus subdifferential is convex combination of gradients of active functions 3 2.5 0/1 loss hinge loss 2 1.5 1 0.5 0-2 -1.5-1 -0.5 0 0.5 1 1.5 2 l(x)

Constant step length: α t = γ g t Converges to a suboptimal solution: f t best f G2 α 2 lim t Diminishing step size: α t = γ t Converges to a optimal solution: lim t f t best f = 0 The value of : f k best f versus iteration number t The value of : f k best f versus iteration number t Trade-off: larger γ gives faster convergence, but larger final suboptimality

Subgradient method for constrained problem Let us consider the constraint optimization problem minimize f(x) s.t. x S (S is a convex set) Projected subgradient method is given by x (k+1) = P S (x k α k g k ) P S is (Euclidean) projection on S, 1. Set k 1 and. Choose an infinite sequence of positive step size value *α (k) + k=1. 2. Compute a subgradient g k f(x k ). 3. Updatex (k+1) = P S (x k α k g k ) 4. If algorithm has not converged, then set k k + 1 and go to step 2.

Example L 2 -norm Support Vector Machines 1 w, b = argmin w w,b 2 2 +C i=1 max *0,1 y i w T x i + b + An alternative formulation of the same problem is w, b = argmin max *0,1 y i w T x i + b + w,b S.t. w W Note that the regularization constant C > 0 is replaced here by a less abstract constant W > 0. To implement subgradient method we need only a projection operator m P S w, b = *w w w W, b b+ So after update w w α k w and b b α k b, we project w:= w w W

Constant step length: α t = γ g t Diminishing step size: α t = γ t

Noisy unbiased subgradient Random vector g R n is a noisy unbiased subgradient for f: R n R at x if for all z f z f x + Eg T (z x) i. e., g = Eg f(x) The noise can represent error in computing g, measurement noise, Monte Carlo sampling error, etc If x is also random, g is a noisy unbiased subgradient of f at x if z f z f x + E g x T z x holds almost surely Same as E g x f(x) (a.s.)

Stochastic subgradient method Stochastic subgradient method is the subgradient method, using noisy unbiased subgradients x (k+1) = x k α k g k x k is kth iterate g k is any noisy unbiased subgradient of (convex) f at x k α k > 0 is the kth step size Define f k best = min *f x 1,, f x k +

Assumptions f = inf x f(x) >, with f x = f E g k 2 2 G 2 for all k E x 1 x 2 2 R 2 Step sizes are square-summable but not summable α k 0, α k 2 = α 2 2 <, α k = k=1 These assumptions are stronger than needed, just to simplify proofs k=1

Convergence results Convergence in expectation: Ef best k f R2 +G 2 α 2 2 lim Ef k k best = f k 2 i=1 α i Convergence in probability (Markov s inequality): for any ε > 0, lim Prob f k k best f + ε = 0 Almost sure convergence: lim f k k best = f (a.s.)

Example: Machine learning problem Minimizing loss function m min L(w) = 1 w m loss(w, x i, y i ) Subgradient estimate: g k = w loss(w k, x i, y i ) Examples: L 2 regularized linear classification with hinge loss, aka. SVM min w 2 B m 1 m l( w, x i y i ) i=1 i=1 m min 1 m l( w, x i y i ) + λ w 2 2 i=1 Initialize at w 0 Iteration k: w k+1 w k α k g k If w k+1 > B w k+1 B w k+1 w k+1

Stochastic vs Batch Gradient Descent w = 1 m m w i i=1

Stochastic vs Batch Gradient Descent Runtime to guarantee an error ε (α k = B/G k ) Stochastic Gradient Descent: O( G2 B 2 d) Batch Gradient Descent: O G2 B 2 md ε 2 ε 2 Optimal runtime if only using gradients and only assuming Lipchitz condition. Slower than second order method

Stochastic programming minimize Ef 0 (x, ω) subject to Ef i x, ω 0, i = 1,, m If f i x, ω is convex in x for each ω, problem is convex cerntainty-equivalent problem minimize f 0 (x, Eω) subject to f i x, Eω 0, i = 1,, m (if f i x, ω is convex in ω, gives a lower bound on optimal value of stochastic problem)

Machine learning as stochastic programming Minimizing generalization error: min E,f w, z - w W Two approaches: Sample Average Approximation (SAA): Collect sample z 1,, z m Minimize F w = 1 m m i=1 f(w, z i ), w = min w F(w) Minimizing empirical risk Stochastic/batch gradient descent, Newton method, Sample Approximation (SA): Update w k based on weak estimator to F(w k ) g k = f(w k, z k ) Stochastic gradient descent, w m = 1 m m i=1 w i

Machine learning as stochastic programming Minimizing generalization error: min E,f w, z - w W Two approaches: Sample Average Approximation (SAA): Collect sample z 1,, z m Minimize F w = 1 m m i=1 f(w, z i ), w = min w F(w) Minimizing empirical risk Stochastic/batch gradient descent, Newton method, Sample Approximation (SA): Update w k based on weak estimator to F(w k ) g k = f(w k, z k ) Stochastic gradient descent, w m = 1 m m i=1 w i m min L(w) = 1 w m loss(w, x i, y i ) i=1

SAA vs SA Runtime to guarantee an error ε (L w L w + ε) SAA (Empirical Risk Minimization) Sample size to guarantee L w L w + ε: m = Ω( G2 B 2 ) ε 2 Using any method (stochastic/batch subgradient descent, Newton method, ), runtime: Ω( G2 B 2 ε 2 d) SA (Single-Pass Stochastic Gradient Descent) After k iteration, L w k L w + O( G2 B 2 k ) Runtime: O( G2 B 2 ε 2 d)

Reference S. Boyd, et. al. Subgradient Methods, Stochastic Subgradient Methods. Lecture slides and notes for EE364b, Stanford University, Spring Quarter 2007 08 Professor Vojtech Franc's talks. A Short Tutorial on Subgradient Methods. IDA retreat, May 16, 2007 Nathan Srebro and Ambuj Tewari, Stochastic Optimization: ICML 2010 Tutorial, July, 2010