Convex Optimization Lecture 16

Similar documents
Accelerating Stochastic Optimization

Coordinate Descent and Ascent Methods

Fast Stochastic Optimization Algorithms for ML

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization

Conditional Gradient (Frank-Wolfe) Method

Subgradient Method. Ryan Tibshirani Convex Optimization

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725

Selected Topics in Optimization. Some slides borrowed from

Convex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization

Stochastic and online algorithms

Nesterov s Acceleration

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm

Large-scale Stochastic Optimization

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Optimization for Machine Learning

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Big Data Analytics: Optimization and Randomization

Linear Convergence under the Polyak-Łojasiewicz Inequality

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Stochastic gradient methods for machine learning

Stochastic Composition Optimization

ECS289: Scalable Machine Learning

STA141C: Big Data & High Performance Statistical Computing

Linearly-Convergent Stochastic-Gradient Methods

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Beyond stochastic gradient descent for large-scale machine learning

Stochastic gradient methods for machine learning

Optimization methods

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Linear Convergence under the Polyak-Łojasiewicz Inequality

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Beyond stochastic gradient descent for large-scale machine learning

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Stochastic Optimization Algorithms Beyond SG

Stochastic gradient descent and robustness to ill-conditioning

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Sub-Sampled Newton Methods

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

ECS171: Machine Learning

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization

On Nesterov s Random Coordinate Descent Algorithms - Continued

ICS-E4030 Kernel Methods in Machine Learning

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Convex Optimization Algorithms for Machine Learning in 10 Slides

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Newton s Method. Javier Peña Convex Optimization /36-725

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Simple Optimization, Bigger Models, and Faster Learning. Niao He

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

Lecture 9: September 28

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Descent methods. min x. f(x)

Lecture 5: September 12

Optimization Methods for Machine Learning

Block stochastic gradient update method

Optimization for Machine Learning

SVMs, Duality and the Kernel Trick

Stochastic Gradient Descent. CS 584: Big Data Analytics

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Non-negative Matrix Factorization via accelerated Projected Gradient Descent

Stochastic Gradient Descent with Only One Projection

Kernelized Perceptron Support Vector Machines

Proximal and First-Order Methods for Convex Optimization

Adaptive Probabilities in Stochastic Optimization Algorithms

arxiv: v1 [math.oc] 1 Jul 2016

Trade-Offs in Distributed Learning and Optimization

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Modern Optimization Techniques

Introduction to Machine Learning (67577) Lecture 7

CS-E4830 Kernel Methods in Machine Learning

Primal-dual coordinate descent

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions

Mini-Batch Primal and Dual Methods for SVMs

Stochastic Gradient Descent with Variance Reduction

Journal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19

Classification Logistic Regression

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence

Coordinate descent methods

Beyond stochastic gradient descent for large-scale machine learning

Lecture 14: October 17

CS260: Machine Learning Algorithms

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples

Large-scale Machine Learning and Optimization

Machine Learning for NLP

Learning with stochastic proximal gradient

Lecture 17: October 27

Warm up: risk prediction with logistic regression

Lecture 15 Newton Method and Self-Concordance. October 23, 2008

Transcription:

Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent

Recall: Gradient Descent (Steepest Descent w.r.t Euclidean Norm) Gradient descent algorithm: Gradient Descent Δx = f x k Convergence: 1 Init x 0 dom(f) Iterate x k+1 x k t k f x k Lower'Bounds Reminder: We are violating here the distinction between the primal space and the dual gradient space we are implicitly linking them by matching representations w.r.t. a chosen basis Some%upper%bounds: #iter $ & ' ( #iter & ' ( #iter & * Note: Δx is not normalized (i.e. we don t require Δx 2 = 1). This just changes the meaning of t. How do we 2choose 3 GD + log 1/1 # the 5 # stepsize 3 # t k 5? # 1 1 # 61 & * $ & ' Oracle/ops 78 +!(") A>GD 1 κ = M/µ + log 1/1 2 3 # 1 x x 78 +!(")

Smoothness and Strong Convexity Def: f is μ-strongly convex Def: f is M-smooth f x + f x, Δx + μ 2 Δx f x + Δx f x + f x, Δx + M 2 Δx Can be viewed as a condition on the directional 2 nd derivatives μ f x = f x + tv = v f x v M (for v = 1) f x + f x, Δx + M 2 Δx f(x + Δx) f x + f x, Δx + μ 2 f x + f x, Δx Δx

What about constraints? where X is convex min f(x) x s.t. x X

Projected Gradient Descent Idea: make sure that points are feasible by projecting onto X Algorithm: y (k+1) = x (k) t (k) g (k) where g (k) f(x (k) ) x (k+1) = Π X (y (k+1) ) y t+1 y (k+1) gradient step (3.2) projection (3.3) x (k+1) x t+1 x (k) x t The projection operator Π X onto X : X Π X (x) = min z X x z Notice: subgradient instead of gradient (even for differentiable functions)

Projected gradient descent convergence rate: µ 2 M 2 M L L, µ 2 κ log 1 ɛ M x 2 +(f(x 1 ) f(x )) ɛ L 2 x 2 L 2 ɛ 2 µɛ Same as unconstrained case! But, requires projection... how expensive is that? Examples: Euclidean ball PSD constraints Linear constraints Ax b Sometimes as expensive as solving the original optimization problem!

Conditional Gradient Descent A projection-free algorithm! Introduced for QP by Marguerite Frank and Philip Wolfe (1956) Algorithm Initialize: x(0) X s(k) = argminh f (x(k)), si s X x(k+1) = x(k) + t(k)(s(k) x(k))

Notice f assumed M-smooth X assumed bounded First-order oracle Linear optimization (in place of projection) Sparse iterates (e.g., for polytope constraints) Convergence rate For M-smooth functions with step size t (k) = 2 k+1 : where R = sup x,y X x y # iterations required for ɛ-optimality: MR2 ɛ

Proof f(x (k+1) ) f(x (k) ) + f(x (k) ), x (k+1) x (k) + M 2 x(k+1) x (k) 2 =f(x (k) ) + t (k) f(x (k) ), s (k) x (k) + M 2 (t(k) ) 2 s (k) x (k) 2 [smoothness] [update] f(x (k) ) + t (k) f(x (k) ), x x (k) + M 2 (t(k) ) 2 R 2 f(x (k) ) + t (k) (f(x ) f(x (k) )) + M 2 (t(k) ) 2 R 2 [convexity] Define: δ (k) = f(x (k) ) f(x ), we have: δ (k+1) (1 t (k) )δ (k) + M(t(k) ) 2 R 2 A simple induction shows that for t (k) = 2 k+1 : δ (k) 2MR2 k + 1 Same rate as projected gradient descent, but without projection! Does need linear optimization 2

What about strong convexity? Not helpful! Does not give linear rate (κ log(1/ɛ)) Active research

Randomness in Convex Optimization Insight: first-order methods are robust inexact gradients are sufficient As long as gradients are correct on average, the error will vanish Long history (Robbins & Monro, 1951)

Stochastic Gradient Descent Motivation Many machine learning problems have the form of empirical risk minimization min x R n m f i (x) + λω(x) i=1 where f i are convex and λ is the regularization constant Classification: SVM, logistic regression Regression: least-squares, ridge regression, LASSO Cost of computing the gradient? m n What if m is VERY large? We want cheaper iterations

Idea: Use stochastic first-order oracle: for each point x dom(f) returns a stochastic gradient g(x) s.t. E[ g(x)] f(x) That is, g is an unbiased estimator of the subgradient Example 1 min x R n m m i=1 F i (x) {}}{ (f i (x) + λω(x)) For this objective, select j {1,..., m} u.a.r. and return F j (x) Then, E[ g(x)] = 1 F i (x) = f(x) m i

SGD iterates: Stochastic x (k+1) vs. deterministic x (k) t (k) g(x (k) methods ) How to choose step size t (k)? Minimizing g(θ) = 1 n Lipschitz case: t (k) 1 k ochastic vs. deterministic methods µ-strongly-convex case: t (k) 1 µk g(θ) = 1 n f i (θ) with f i (θ) =l ( y i, θ Φ(x i ) ) + µω(θ) n i=1 Note: decaying step size! ent descent: θ t = θ t 1 GD γ t g (θ t 1 )=θ t 1 γ n t Stochastic gradient descent: f n θ SGD i(θ t 1 ) t = θ t 1 γ t f i(t) (θ t 1) radient descent: θ t = θ t 1 γ t f i(t) (θ t 1) n f i (θ) with f i (θ) =l ( y i, θ Φ(x i ) ) + µω(θ) i=1 Batch gradient descent: θ t = θ t 1 γ t g (θ t 1 )=θ t 1 γ t n i=1 (Figures borrowed from Francis Bach s slides) n f i(θ t 1 ) i=1 Stochastic vs. deterministic methods Goal = best of both worlds: linearratewitho(1) iteratio log(excess cost) stochastic deterministic time

Convergence rates µ 2 M 2 M L L, µ 2 GD κ log 1 ɛ M x 2 ɛ SGD?? L 2 x 2 L 2 ɛ 2 µɛ B 2 x 2 B 2 ɛ 2 µɛ Additional assumption: E[ g(x) 2 ] B 2 for all x dom(f) Comment: holds in expectation, with averaged iterates [ ( )] 1 K E f x (k) f(x )... K k=1 Similar rates as with exact gradients!

µ 2 M 2 M L L, µ 2 GD AGD κ log 1 ɛ κ log 1 ɛ SGD? M x 2 ɛ L 2 x 2 L 2 ɛ 2 µɛ M x 2 ɛ x σ + M x 2 ɛ 2 ɛ B 2 x 2 B 2 ɛ 2 µɛ where E[ f(x) g(x) 2 ] σ 2 Smoothness? Not helpful! (same rate as non-smooth) Lower bounds (Nemirovski & Yudin, 1983) Active research Acceleration? Cannot be easily accelerated! Mini-batch acceleration Active research

Random Coordinate Descent Recall: cost of computing exact GD update: m n What if n VERY is large? We want cheaper iterations Random coordinate descent algorithm: Initialize: x (0) dom(f) Iterate: pick i(k) {1,..., n} randomly x (k+1) = x (k) t (k) i(k) f(x (k) )e i(k) where we denote: i f(x) = f x i (x) Assumption: f is convex and differentiable

f x2 What if f not differentiable? x2 4 2 0 2 4 x1 4 2 0 2 4 x1 A: No! Look at the above counterexample (Figures borrowed from Ryan Tibshirani s slides) Q: Same question again, but now f (x) = g(x) + P n i=1 h i(x i ),with

Iteration cost? i f(x) + O(1) Compare to f(x) + O(n) for GD Example: quadratic f(x) = 1 2 x Qx v x f(x) = Qx v i f(x) = q i x v i Can view CD as SGD with oracle: g(x) = n i f(x)e i Clearly, E[ g(x)] = 1 n n i i f(x)e i = f(x) Can replace individual coordinates with blocks of coordinates

10 10 10 5 10 5 10 6 5 10 15 20 25 30 35 40 2 4 6 8 10 12 14 16 18 20 22 Example: SVM Primal: Dual: SDCA SDCA Perm SGD 10 min 1 w 10 0 SDCA SDCA Perm SGD 10 2 10 3 10 4 10 5 λ 2 w 2 + i min 2 α Qα 1 α s.t. 0 α i 1/λ α 1 max(1 y i w 10 z i, 0) 1 i 10 6 2 4 6 8 10 12 14 16 18 20 10 0 SDCA SDCA Perm SGD 10 2 10 3 10 4 10 5 where Q ij = y i y j z i z j 10 20 30 40 50 60 70 80 90 100 110 10 6 5 10 15 20 25 30 35 10 6 5 10 15 20 25 SDCA SDCA Perm SGD 10 0 SDCA SDCA Perm SGD 10 1 10 0 SDCA SDCA Perm SGD 10 1 10 2 10 2 10 3 10 3 10 4 10 4 10 5 10 5 50 100 150 200 250 300 10 6 20 40 60 80 100 120 140 160 180 200 10 6 10 20 30 40 50 60 70 80 90 100 110 (Shalev-Schwartz & Zhang, 2013) : Comparing the primal sub-optimality of SDCA and SGD for the non-smooth hinge-loss

Convergence rate Directional smoothness for f: there exist M 1,..., M n s.t. for any i {1,..., n}, x R n, and u R i f(x + ue i ) i f(x) M i u Note: implies f is M-smooth with M i M i Consider the update: x (k+1) = x (k) 1 M i(k) i(k) f(x (k) ) e i(k) No need to know M i s, can be adjusted dynamically

Rates (Nesterov, 2012): µ 2 M 2 M L L, µ 2 GD CD κ log 1 ɛ nκ log 1 ɛ, κ = i M i µ M x 2 ɛ L 2 x 2 L 2 ɛ 2 µɛ n x 2 i M i ɛ Same total cost as GD, but with much cheaper iterations Comment: holds in expectation [ ] E f(x (k) ) f... Acceleration? Yes! Active research