Accelerating Stochastic Optimization

Similar documents
Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Fast Stochastic Optimization Algorithms for ML

Introduction to Machine Learning (67577) Lecture 7

Convex Optimization Lecture 16

Big Data Analytics: Optimization and Randomization

Nesterov s Acceleration

Stochastic Gradient Descent with Variance Reduction

STA141C: Big Data & High Performance Statistical Computing

ECS289: Scalable Machine Learning

Stochastic Gradient Descent

Large-scale Stochastic Optimization

Trade-Offs in Distributed Learning and Optimization

Logistic Regression. Stochastic Gradient Descent

Mini-Batch Primal and Dual Methods for SVMs

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Coordinate Descent and Ascent Methods

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity

Introduction to Machine Learning (67577)

Comparison of Modern Stochastic Optimization Algorithms

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Optimization for Machine Learning

Machine Learning in the Data Revolution Era

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization

CS260: Machine Learning Algorithms

Optimization Methods for Machine Learning

Lecture 1: Supervised Learning

Day 3 Lecture 3. Optimizing deep networks

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm

SVRG++ with Non-uniform Sampling

Adaptive Probabilities in Stochastic Optimization Algorithms

Accelerating SVRG via second-order information

Neural Networks and Deep Learning

Cutting Plane Training of Structural SVM

Stochastic and online algorithms

Online Convex Optimization

Computational and Statistical Learning Theory

Failures of Gradient-Based Deep Learning

Deep Feedforward Networks

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Stochastic Optimization Algorithms Beyond SG

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Logarithmic Regret Algorithms for Strongly Convex Repeated Games

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

Neural Networks: Backpropagation

Simple Optimization, Bigger Models, and Faster Learning. Niao He

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer

SGD and Deep Learning

CSC321 Lecture 7: Optimization

ECE 5424: Introduction to Machine Learning

CSC321 Lecture 8: Optimization

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Optimistic Rates Nati Srebro

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba

A Parallel SGD method with Strong Convergence

Inverse Time Dependency in Convex Regularized Learning

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017

VBM683 Machine Learning

Littlestone s Dimension and Online Learnability

FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES. Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč

Online Learning Summer School Copenhagen 2015 Lecture 1

Probabilistic Graphical Models & Applications

Optimization for neural networks

CSCI567 Machine Learning (Fall 2018)

ECE 5984: Introduction to Machine Learning

Convex Repeated Games and Fenchel Duality

Lecture 3: Minimizing Large Sums. Peter Richtárik

Incremental Training of a Two Layer Neural Network

Improved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations

Practical Agnostic Active Learning

Convex Repeated Games and Fenchel Duality

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Stochastic gradient descent; Classification

Deep Learning II: Momentum & Adaptive Step Size

Voting (Ensemble Methods)

Introduction to Machine Learning (67577) Lecture 3

Stochastic Optimization

i=1 = H t 1 (x) + α t h t (x)

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal

Least Mean Squares Regression

Journal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19

Lecture 5: Logistic Regression. Neural Networks

Optimization for Training I. First-Order Methods Training algorithm

Nyström-SGD: Fast Learning of Kernel-Classifiers with Conditioned Stochastic Gradient Descent

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos

Selected Topics in Optimization. Some slides borrowed from

Statistical Machine Learning from Data

Sub-Sampled Newton Methods

Accelerate Subgradient Methods

Transcription:

Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 1 / 34

Learning Goal (informal): Learn an accurate mapping h : X Y based on examples ((x 1, y 1 ),..., (x n, y n )) (X Y) n Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 2 / 34

Learning Goal (informal): Learn an accurate mapping h : X Y based on examples ((x 1, y 1 ),..., (x n, y n )) (X Y) n Deep learning: Each mapping h : X Y is parameterized by a weight vector w R d, so our goal is to learn the vector w Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 2 / 34

Regularized Loss Minimization A popular learning approach is Regularized Loss Minimization (RLM) with Euclidean regularization: Sample S = ((x 1, y 1 ),..., (x n, y n )) D n and approximately solve the RLM problem 1 min w R d n n φ i (w) + λ 2 w 2 i=1 where φ i (w) = l yi (h w (x i )) is the loss of predicting h w (x i ) when the true target is y i Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 3 / 34

How to solve RLM for Deep Learning? Stochastic Gradient Descent (SGD): Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 4 / 34

How to solve RLM for Deep Learning? Stochastic Gradient Descent (SGD): Advantages: Works well in practice Per iteration cost independent of n Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 4 / 34

How to solve RLM for Deep Learning? Stochastic Gradient Descent (SGD): Advantages: Works well in practice Per iteration cost independent of n Disadvantage: slow convergence 10 0 objective 10 1 10 2 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 # of backpropagation 10 7 Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 4 / 34

How to improve SGD convergence rate? 1 Stochastic Dual Coordinate Ascent (SDCA): Same per iteration cost as SGD... but converges exponentially faster Designed for convex problems... but can be adapted to deep learning Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 5 / 34

How to improve SGD convergence rate? 1 Stochastic Dual Coordinate Ascent (SDCA): Same per iteration cost as SGD... but converges exponentially faster Designed for convex problems... but can be adapted to deep learning 2 SelfieBoost: AdaBoost, with SGD as weak learner, converges exponentially faster than vanilla SGD But yields an ensemble of networks very expensive at prediction time I ll describe a new boosting algorithm that boost the performance of the same network I ll show faster convergence under some SGD success assumption Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 5 / 34

SDCA vs. SGD On CCAT dataset, shallow architecture 10 0 SDCA SDCA Perm SGD 10 1 10 2 10 3 10 4 10 5 10 6 5 10 15 20 25 Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 6 / 34

SelfieBoost vs. SGD On MNIST dataset, depth 5 network 10 0 SGD SelfieBoost 10 1 error 10 2 10 3 10 4 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 # of backpropagation 10 7 Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 7 / 34

Gradient Descent vs. Stochastic Gradient Descent Define: P I (w) = 1 I i I φ i(w) + λ 2 w 2 GD SGD rule w t+1 = w t η P (w t ) w t+1 = w t η P I (w t ) for random I [n] per iteration cost O(n) O(1) convergence rate log(1/ɛ) 1/ɛ Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 8 / 34

Hey, wait, but what about... Decaying learning rate Nesterov s momentum Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 9 / 34

SGD More powerful oracle is crucial Theorem Any algorithm for solving RLM that only accesses the objective using stochastic gradient oracle and has log(1/ɛ) rate must perform Ω(n 2 ) iterations Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 10 / 34

SGD More powerful oracle is crucial Theorem Any algorithm for solving RLM that only accesses the objective using stochastic gradient oracle and has log(1/ɛ) rate must perform Ω(n 2 ) iterations Proof idea: Consider two objectives (in both, λ = 1): for i {±1} P i (w) = 1 ( n 1 (w i) 2 + n + 1 ) (w + i) 2 2n 2 2 A stochastic gradient oracle returns w ± i w.p. 1 2 ± 1 2n Easy to see that w i = i/n, P i(0) = 1/2, P i (w i ) = 1/2 1/(2n2 ) Therefore, solving to accuracy ɛ < 1/(2n 2 ) amounts to determining the bias of the coin Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 10 / 34

Outline 1 SDCA Description and analysis for convex problems SDCA for Deep Learning Accelerated SDCA 2 SelfieBoost Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 11 / 34

Stochastic Dual Coordinate Ascent Primal problem: min P (w) := w Rd [ 1 n ] n φ i (w) + λ 2 w 2 i=1 (Fenchel) Dual problem: max D(α) := α Rd,n 1 n n i=1 φ i ( α i ) 1 2λn 2 n i=1 2 α i (where α i is the i th column of α) DCA: At each iteration, optimize D(α) w.r.t. a single column of α, while the rest of the columns are kept in tact. Stochastic Dual Coordinate Ascent (SDCA): Choose the updated column uniformly at random Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 12 / 34

Fenchel Conjugate Two equivalent representations of a convex function Point (w, f(w)) Tangent (θ, f (θ)) f(w) slope = θ w f (θ) = max w w, θ f(w) f (θ) Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 13 / 34

SDCA Analysis Theorem Assume each φ i is convex and smooth. Then, after (( Õ n + 1 ) log 1 ) λ ɛ iterations of SDCA we have, with high probability, P (w t ) P (w ) ɛ. GD SGD SDCA iteration cost nd d d convergence rate 1 λ log(1/ɛ) 1 λɛ runtime nd 1 λ d 1 λɛ runtime for λ = 1 n n 2 d nd ɛ ( n + 1 λ) log(1/ɛ) d (n + 1 ) λ nd Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 14 / 34

SDCA vs. SGD experimental observations On CCAT dataset, shallow architecture, λ = 10 6 10 0 SDCA SDCA Perm SGD 10 1 10 2 10 3 10 4 10 5 10 6 5 10 15 20 25 Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 15 / 34

SDCA vs. DCA Randomization is crucial 10 0 SDCA DCA Cyclic 10 1 SDCA Perm Bound 10 2 10 3 10 4 10 5 10 6 0 2 4 6 8 10 12 14 16 18 Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 16 / 34

Outline 1 SDCA Description and analysis for convex problems SDCA for Deep Learning Accelerated SDCA 2 SelfieBoost Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 17 / 34

Deep Networks are Non-Convex A 2-dim slice of a network with hidden layers {10, 10, 10, 10}, on MNIST, with the clamped ReLU activation function and logistic loss. The slice is defined by finding a global minimum (using SGD) and creating two random permutations of the first hidden layer. 2 1 1 0.8 0.6 0.4 0.2 1 0.8 0.6 0.4 0 0.2 0 0.2 0.2 0.4 0.4 0.6 0.6 0.8 1 1 0.8 Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 18 / 34

But Deep Networks Seem Convex Near a Miminum Now the slice is based on 2 random points at distance 1 around a global minimum 10 2 8 7 6 5 4 1 0.5 0 0.5 1 1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 19 / 34

SDCA for Deep Learning For φ i being non-convex, the Fenchel conjugate often becomes meaningless But, our analysis implies that an approximate dual update suffices, that is, ) α (t) i = α (t 1) i ( ηλn φ i (w (t 1) ) + α (t 1) i The relation between primal and dual vectors is by w (t 1) = 1 λn n i=1 α (t 1) i and therefore the corresponding primal update is: ( ) w (t) = w (t 1) η φ i (w (t 1) ) + α (t 1) i These updates can be implemented for deep learning as well and do not require to calculate the Fenchel conjugate Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 20 / 34

Intuition: Why SDCA is better than SGD Recall that SDCA primal update rule is ( ) w (t) = w (t 1) η φ i (w (t 1) ) + α (t 1) i }{{} v (t) and that w (t 1) = 1 n λn i=1 α(t 1) i. Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 21 / 34

Intuition: Why SDCA is better than SGD Recall that SDCA primal update rule is ( ) w (t) = w (t 1) η φ i (w (t 1) ) + α (t 1) i }{{} v (t) and that w (t 1) = 1 n λn i=1 α(t 1) i. Observe: v (t) is unbiased estimate of the gradient: E[v (t) w (t 1) ] = 1 n ( ) φ i (w (t 1) ) + α (t 1) i n i=1 = P (w (t 1) ) λw (t 1) + λw (t 1) = P (w (t 1) ) Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 21 / 34

Intuition: Why SDCA is better than SGD The update step of both SGD and SDCA is w (t) = w (t 1) ηv (t) where { v (t) φi (w (t 1) ) + λw (t 1) for SGD = φ i (w (t 1) ) + α (t 1) i for SDCA In both cases E[v (t) w (t 1) ] = P (w (t) ) What about the variance? For SGD, even if w (t 1) = w, the variance of v (t) is still constant For SDCA, we ll show that the variance of v (t) goes to zero as w (t 1) w Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 22 / 34

SDCA as variance reduction technique (Johnson & Zhang) For w we have P (w ) = 0 which means 1 n n φ i (w ) + λw = 0 i=1 w = 1 λn n i=1 ( φ i (w )) = 1 λn Therefore, if α (t 1) i αi and w(t 1) w then the update vector satisfies n i=1 v (t) = φ i (w (t 1) ) + α (t 1) i φ i (w ) + α i = 0 α i Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 23 / 34

Issues with SDCA for Deep Learning Needs to maintain the matrix α Can significantly reduce storage if working with mini-batches Another approach is the SVRG algorithm of Johnson and Zhang Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 24 / 34

Accelerated SDCA Nesterov s Accelerated (deterministic) Gradient Descent (AGD): combine 2 gradients to accelerate the convergence rate ( ) AGD runtime: Õ d n 1 λ SDCA runtime: Õ ( d ( n + 1 )) λ Can we accelerate SDCA? Yes! The main idea is to iterate Use SDCA to approximately minimize P t (w) = P (w) + κ 2 w y(t 1) 2 Update y (t) = w (t) + β(w (t) w (t 1) ) Accelerated SDCA runtime: Õ ( d ( n + )) n λ Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 25 / 34

Experimental Demonstration Smoothed hinge-loss with l 1, l 2 regularization, λ = 10 7 : astro-ph cov1 CCAT 0.5 AccProxSDCA ProxSDCA 0.4 FISTA 0.3 0.5 AccProxSDCA ProxSDCA 0.45 FISTA 0.4 0.5 AccProxSDCA ProxSDCA FISTA 0.4 0.3 0.2 0.35 0.2 0.1 0 0 20 40 60 80 100 0.3 0 20 40 60 80 100 0.1 0 20 40 60 80 100 Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 26 / 34

Outline 1 SDCA Description and analysis for convex problems SDCA for Deep Learning Accelerated SDCA 2 SelfieBoost Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 27 / 34

SelfieBoost Motivation 10 0 objective 10 1 10 2 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 # of backpropagation 10 7 Why SGD is slow at the end? High variance, even close to the optimum Rare mistakes: Suppose all but 1% of the examples are correctly classified. SGD will now waste 99% of its time on examples that are already correct by the model Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 28 / 34

SelfieBoost Motivation For simplicity, consider a binary classification problem in the realizable case For a fixed ɛ 0 (not too small), few SGD iterations find a solution with P (w) P (w ) ɛ 0 However, for a small ɛ, SGD requires many iterations Smells like we need to use boosting... Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 29 / 34

First idea: learn an ensemble using AdaBoost Fix ɛ 0 (say 0.05), and assume SGD can find a solution with error < ɛ 0 quite fast Lets apply AdaBoost with the SGD learner as a weak learner: At iteration t, we sub-sample a training set based on a distribution D t over [n] We feed the sub-sample to a SGD learner and gets a weak classifier h t Update D t+1 based on the predictions of h t The output of AdaBoost is an ensemble with prediction T t=1 α th t (x) The celebrated Freund & Schapire theorem states that if T = O(log(1/ɛ)) then the error of the ensemble classifier is at most ɛ Observe that each boosting iteration involves calling SGD on a relatively small data, and updating the distribution on the entire big data. The latter step can be performed in parallel Disadvantage of learning an ensemble: at prediction time, we need to apply many networks Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 30 / 34

Boosting the Same Network Can we obtain boosting-like convergence, while learning a single network? The SelfieBoost Algorithm: Start with an initial network f 1 At iteration t, define weights over the n examples according to D i e y if t(x i ) Sub-sample a training set S D Use SGD for approximately solving the problem f t+1 argmin g y i (f t (x i ) g(x i )) + 1 (g(x i ) f t (x i )) 2 2 i S i S Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 31 / 34

Analysis of the SelfieBoost Algorithm Lemma: At each iteration, with high probability over the choice of S, there exists a network g with objective value of at most 1/4 Theorem: If at each iteration, the SGD algorithm finds a solution with objective value of at most ρ, then after log(1/ɛ) ρ SelfieBoost iterations the error of f t will be at most ɛ To summarize: we have obtained log(1/ɛ) convergence assuming that the SGD algorithm can solve each sub-problem to a fixed accuracy (which seems to hold in practice) Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 32 / 34

SelfieBoost vs. SGD On MNIST dataset, depth 5 network 10 0 SGD SelfieBoost 10 1 error 10 2 10 3 10 4 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 # of backpropagation 10 7 Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 33 / 34

Summary SGD converges quickly to an o.k. solution, but then slows down: 1 High variance even at w 2 Wastes time on already solved cases There s a need for stochastic methods that have similar per-iteration complexity as SGD but converge faster 1 SDCA reduces variance and therefore converges faster 2 SelfieBoost focuses on the hard cases Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 34 / 34

Summary SGD converges quickly to an o.k. solution, but then slows down: 1 High variance even at w 2 Wastes time on already solved cases There s a need for stochastic methods that have similar per-iteration complexity as SGD but converge faster 1 SDCA reduces variance and therefore converges faster 2 SelfieBoost focuses on the hard cases Future Work and Open Questions: Evaluate the empirical performance of SDCA and SelfieBoost for challenging deep learning tasks Bridge the gap between empirical success of SGD and worst-case hardness results Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 34 / 34