Accelerating Stochastic Optimization

Size: px
Start display at page:

Download "Accelerating Stochastic Optimization"

Transcription

1 Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 1 / 34

2 Learning Goal (informal): Learn an accurate mapping h : X Y based on examples ((x 1, y 1 ),..., (x n, y n )) (X Y) n Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 2 / 34

3 Learning Goal (informal): Learn an accurate mapping h : X Y based on examples ((x 1, y 1 ),..., (x n, y n )) (X Y) n Deep learning: Each mapping h : X Y is parameterized by a weight vector w R d, so our goal is to learn the vector w Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 2 / 34

4 Regularized Loss Minimization A popular learning approach is Regularized Loss Minimization (RLM) with Euclidean regularization: Sample S = ((x 1, y 1 ),..., (x n, y n )) D n and approximately solve the RLM problem 1 min w R d n n φ i (w) + λ 2 w 2 i=1 where φ i (w) = l yi (h w (x i )) is the loss of predicting h w (x i ) when the true target is y i Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 3 / 34

5 How to solve RLM for Deep Learning? Stochastic Gradient Descent (SGD): Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 4 / 34

6 How to solve RLM for Deep Learning? Stochastic Gradient Descent (SGD): Advantages: Works well in practice Per iteration cost independent of n Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 4 / 34

7 How to solve RLM for Deep Learning? Stochastic Gradient Descent (SGD): Advantages: Works well in practice Per iteration cost independent of n Disadvantage: slow convergence 10 0 objective # of backpropagation 10 7 Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 4 / 34

8 How to improve SGD convergence rate? 1 Stochastic Dual Coordinate Ascent (SDCA): Same per iteration cost as SGD... but converges exponentially faster Designed for convex problems... but can be adapted to deep learning Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 5 / 34

9 How to improve SGD convergence rate? 1 Stochastic Dual Coordinate Ascent (SDCA): Same per iteration cost as SGD... but converges exponentially faster Designed for convex problems... but can be adapted to deep learning 2 SelfieBoost: AdaBoost, with SGD as weak learner, converges exponentially faster than vanilla SGD But yields an ensemble of networks very expensive at prediction time I ll describe a new boosting algorithm that boost the performance of the same network I ll show faster convergence under some SGD success assumption Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 5 / 34

10 SDCA vs. SGD On CCAT dataset, shallow architecture 10 0 SDCA SDCA Perm SGD Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 6 / 34

11 SelfieBoost vs. SGD On MNIST dataset, depth 5 network 10 0 SGD SelfieBoost 10 1 error # of backpropagation 10 7 Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 7 / 34

12 Gradient Descent vs. Stochastic Gradient Descent Define: P I (w) = 1 I i I φ i(w) + λ 2 w 2 GD SGD rule w t+1 = w t η P (w t ) w t+1 = w t η P I (w t ) for random I [n] per iteration cost O(n) O(1) convergence rate log(1/ɛ) 1/ɛ Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 8 / 34

13 Hey, wait, but what about... Decaying learning rate Nesterov s momentum Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU 14 9 / 34

14 SGD More powerful oracle is crucial Theorem Any algorithm for solving RLM that only accesses the objective using stochastic gradient oracle and has log(1/ɛ) rate must perform Ω(n 2 ) iterations Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU / 34

15 SGD More powerful oracle is crucial Theorem Any algorithm for solving RLM that only accesses the objective using stochastic gradient oracle and has log(1/ɛ) rate must perform Ω(n 2 ) iterations Proof idea: Consider two objectives (in both, λ = 1): for i {±1} P i (w) = 1 ( n 1 (w i) 2 + n + 1 ) (w + i) 2 2n 2 2 A stochastic gradient oracle returns w ± i w.p. 1 2 ± 1 2n Easy to see that w i = i/n, P i(0) = 1/2, P i (w i ) = 1/2 1/(2n2 ) Therefore, solving to accuracy ɛ < 1/(2n 2 ) amounts to determining the bias of the coin Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU / 34

16 Outline 1 SDCA Description and analysis for convex problems SDCA for Deep Learning Accelerated SDCA 2 SelfieBoost Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU / 34

17 Stochastic Dual Coordinate Ascent Primal problem: min P (w) := w Rd [ 1 n ] n φ i (w) + λ 2 w 2 i=1 (Fenchel) Dual problem: max D(α) := α Rd,n 1 n n i=1 φ i ( α i ) 1 2λn 2 n i=1 2 α i (where α i is the i th column of α) DCA: At each iteration, optimize D(α) w.r.t. a single column of α, while the rest of the columns are kept in tact. Stochastic Dual Coordinate Ascent (SDCA): Choose the updated column uniformly at random Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU / 34

18 Fenchel Conjugate Two equivalent representations of a convex function Point (w, f(w)) Tangent (θ, f (θ)) f(w) slope = θ w f (θ) = max w w, θ f(w) f (θ) Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU / 34

19 SDCA Analysis Theorem Assume each φ i is convex and smooth. Then, after (( Õ n + 1 ) log 1 ) λ ɛ iterations of SDCA we have, with high probability, P (w t ) P (w ) ɛ. GD SGD SDCA iteration cost nd d d convergence rate 1 λ log(1/ɛ) 1 λɛ runtime nd 1 λ d 1 λɛ runtime for λ = 1 n n 2 d nd ɛ ( n + 1 λ) log(1/ɛ) d (n + 1 ) λ nd Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU / 34

20 SDCA vs. SGD experimental observations On CCAT dataset, shallow architecture, λ = SDCA SDCA Perm SGD Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU / 34

21 SDCA vs. DCA Randomization is crucial 10 0 SDCA DCA Cyclic 10 1 SDCA Perm Bound Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU / 34

22 Outline 1 SDCA Description and analysis for convex problems SDCA for Deep Learning Accelerated SDCA 2 SelfieBoost Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU / 34

23 Deep Networks are Non-Convex A 2-dim slice of a network with hidden layers {10, 10, 10, 10}, on MNIST, with the clamped ReLU activation function and logistic loss. The slice is defined by finding a global minimum (using SGD) and creating two random permutations of the first hidden layer Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU / 34

24 But Deep Networks Seem Convex Near a Miminum Now the slice is based on 2 random points at distance 1 around a global minimum Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU / 34

25 SDCA for Deep Learning For φ i being non-convex, the Fenchel conjugate often becomes meaningless But, our analysis implies that an approximate dual update suffices, that is, ) α (t) i = α (t 1) i ( ηλn φ i (w (t 1) ) + α (t 1) i The relation between primal and dual vectors is by w (t 1) = 1 λn n i=1 α (t 1) i and therefore the corresponding primal update is: ( ) w (t) = w (t 1) η φ i (w (t 1) ) + α (t 1) i These updates can be implemented for deep learning as well and do not require to calculate the Fenchel conjugate Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU / 34

26 Intuition: Why SDCA is better than SGD Recall that SDCA primal update rule is ( ) w (t) = w (t 1) η φ i (w (t 1) ) + α (t 1) i }{{} v (t) and that w (t 1) = 1 n λn i=1 α(t 1) i. Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU / 34

27 Intuition: Why SDCA is better than SGD Recall that SDCA primal update rule is ( ) w (t) = w (t 1) η φ i (w (t 1) ) + α (t 1) i }{{} v (t) and that w (t 1) = 1 n λn i=1 α(t 1) i. Observe: v (t) is unbiased estimate of the gradient: E[v (t) w (t 1) ] = 1 n ( ) φ i (w (t 1) ) + α (t 1) i n i=1 = P (w (t 1) ) λw (t 1) + λw (t 1) = P (w (t 1) ) Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU / 34

28 Intuition: Why SDCA is better than SGD The update step of both SGD and SDCA is w (t) = w (t 1) ηv (t) where { v (t) φi (w (t 1) ) + λw (t 1) for SGD = φ i (w (t 1) ) + α (t 1) i for SDCA In both cases E[v (t) w (t 1) ] = P (w (t) ) What about the variance? For SGD, even if w (t 1) = w, the variance of v (t) is still constant For SDCA, we ll show that the variance of v (t) goes to zero as w (t 1) w Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU / 34

29 SDCA as variance reduction technique (Johnson & Zhang) For w we have P (w ) = 0 which means 1 n n φ i (w ) + λw = 0 i=1 w = 1 λn n i=1 ( φ i (w )) = 1 λn Therefore, if α (t 1) i αi and w(t 1) w then the update vector satisfies n i=1 v (t) = φ i (w (t 1) ) + α (t 1) i φ i (w ) + α i = 0 α i Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU / 34

30 Issues with SDCA for Deep Learning Needs to maintain the matrix α Can significantly reduce storage if working with mini-batches Another approach is the SVRG algorithm of Johnson and Zhang Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU / 34

31 Accelerated SDCA Nesterov s Accelerated (deterministic) Gradient Descent (AGD): combine 2 gradients to accelerate the convergence rate ( ) AGD runtime: Õ d n 1 λ SDCA runtime: Õ ( d ( n + 1 )) λ Can we accelerate SDCA? Yes! The main idea is to iterate Use SDCA to approximately minimize P t (w) = P (w) + κ 2 w y(t 1) 2 Update y (t) = w (t) + β(w (t) w (t 1) ) Accelerated SDCA runtime: Õ ( d ( n + )) n λ Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU / 34

32 Experimental Demonstration Smoothed hinge-loss with l 1, l 2 regularization, λ = 10 7 : astro-ph cov1 CCAT 0.5 AccProxSDCA ProxSDCA 0.4 FISTA AccProxSDCA ProxSDCA 0.45 FISTA AccProxSDCA ProxSDCA FISTA Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU / 34

33 Outline 1 SDCA Description and analysis for convex problems SDCA for Deep Learning Accelerated SDCA 2 SelfieBoost Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU / 34

34 SelfieBoost Motivation 10 0 objective # of backpropagation 10 7 Why SGD is slow at the end? High variance, even close to the optimum Rare mistakes: Suppose all but 1% of the examples are correctly classified. SGD will now waste 99% of its time on examples that are already correct by the model Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU / 34

35 SelfieBoost Motivation For simplicity, consider a binary classification problem in the realizable case For a fixed ɛ 0 (not too small), few SGD iterations find a solution with P (w) P (w ) ɛ 0 However, for a small ɛ, SGD requires many iterations Smells like we need to use boosting... Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU / 34

36 First idea: learn an ensemble using AdaBoost Fix ɛ 0 (say 0.05), and assume SGD can find a solution with error < ɛ 0 quite fast Lets apply AdaBoost with the SGD learner as a weak learner: At iteration t, we sub-sample a training set based on a distribution D t over [n] We feed the sub-sample to a SGD learner and gets a weak classifier h t Update D t+1 based on the predictions of h t The output of AdaBoost is an ensemble with prediction T t=1 α th t (x) The celebrated Freund & Schapire theorem states that if T = O(log(1/ɛ)) then the error of the ensemble classifier is at most ɛ Observe that each boosting iteration involves calling SGD on a relatively small data, and updating the distribution on the entire big data. The latter step can be performed in parallel Disadvantage of learning an ensemble: at prediction time, we need to apply many networks Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU / 34

37 Boosting the Same Network Can we obtain boosting-like convergence, while learning a single network? The SelfieBoost Algorithm: Start with an initial network f 1 At iteration t, define weights over the n examples according to D i e y if t(x i ) Sub-sample a training set S D Use SGD for approximately solving the problem f t+1 argmin g y i (f t (x i ) g(x i )) + 1 (g(x i ) f t (x i )) 2 2 i S i S Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU / 34

38 Analysis of the SelfieBoost Algorithm Lemma: At each iteration, with high probability over the choice of S, there exists a network g with objective value of at most 1/4 Theorem: If at each iteration, the SGD algorithm finds a solution with objective value of at most ρ, then after log(1/ɛ) ρ SelfieBoost iterations the error of f t will be at most ɛ To summarize: we have obtained log(1/ɛ) convergence assuming that the SGD algorithm can solve each sub-problem to a fixed accuracy (which seems to hold in practice) Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU / 34

39 SelfieBoost vs. SGD On MNIST dataset, depth 5 network 10 0 SGD SelfieBoost 10 1 error # of backpropagation 10 7 Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU / 34

40 Summary SGD converges quickly to an o.k. solution, but then slows down: 1 High variance even at w 2 Wastes time on already solved cases There s a need for stochastic methods that have similar per-iteration complexity as SGD but converge faster 1 SDCA reduces variance and therefore converges faster 2 SelfieBoost focuses on the hard cases Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU / 34

41 Summary SGD converges quickly to an o.k. solution, but then slows down: 1 High variance even at w 2 Wastes time on already solved cases There s a need for stochastic methods that have similar per-iteration complexity as SGD but converge faster 1 SDCA reduces variance and therefore converges faster 2 SelfieBoost focuses on the hard cases Future Work and Open Questions: Evaluate the empirical performance of SDCA and SelfieBoost for challenging deep learning tasks Bridge the gap between empirical success of SGD and worst-case hardness results Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU / 34

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine

More information

Fast Stochastic Optimization Algorithms for ML

Fast Stochastic Optimization Algorithms for ML Fast Stochastic Optimization Algorithms for ML Aaditya Ramdas April 20, 205 This lecture is about efficient algorithms for minimizing finite sums min w R d n i= f i (w) or min w R d n f i (w) + λ 2 w 2

More information

Introduction to Machine Learning (67577) Lecture 7

Introduction to Machine Learning (67577) Lecture 7 Introduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Solving Convex Problems using SGD and RLM Shai Shalev-Shwartz (Hebrew

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

Nesterov s Acceleration

Nesterov s Acceleration Nesterov s Acceleration Nesterov Accelerated Gradient min X f(x)+ (X) f -smooth. Set s 1 = 1 and = 1. Set y 0. Iterate by increasing t: g t 2 @f(y t ) s t+1 = 1+p 1+4s 2 t 2 y t = x t + s t 1 s t+1 (x

More information

Stochastic Gradient Descent with Variance Reduction

Stochastic Gradient Descent with Variance Reduction Stochastic Gradient Descent with Variance Reduction Rie Johnson, Tong Zhang Presenter: Jiawen Yao March 17, 2015 Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

Stochastic Gradient Descent

Stochastic Gradient Descent Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

Trade-Offs in Distributed Learning and Optimization

Trade-Offs in Distributed Learning and Optimization Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed

More information

Logistic Regression. Stochastic Gradient Descent

Logistic Regression. Stochastic Gradient Descent Tutorial 8 CPSC 340 Logistic Regression Stochastic Gradient Descent Logistic Regression Model A discriminative probabilistic model for classification e.g. spam filtering Let x R d be input and y { 1, 1}

More information

Mini-Batch Primal and Dual Methods for SVMs

Mini-Batch Primal and Dual Methods for SVMs Mini-Batch Primal and Dual Methods for SVMs Peter Richtárik School of Mathematics The University of Edinburgh Coauthors: M. Takáč (Edinburgh), A. Bijral and N. Srebro (both TTI at Chicago) arxiv:1303.2314

More information

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for

More information

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence oé LAMDA Group H ŒÆOŽÅ Æ EâX ^ #EâI[ : liwujun@nju.edu.cn Dec 10, 2016 Wu-Jun Li (http://cs.nju.edu.cn/lwj)

More information

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Zheng Qu University of Hong Kong CAM, 23-26 Aug 2016 Hong Kong based on joint work with Peter Richtarik and Dominique Cisba(University

More information

Introduction to Machine Learning (67577)

Introduction to Machine Learning (67577) Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Deep Learning Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so

More information

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent IFT 6085 - Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s):

More information

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization Alexander Rakhlin University of Pennsylvania Ohad Shamir Microsoft Research New England Karthik Sridharan University of Pennsylvania

More information

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent April 27, 2018 1 / 32 Outline 1) Moment and Nesterov s accelerated gradient descent 2) AdaGrad and RMSProp 4) Adam 5) Stochastic

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Lecture 3-A - Convex) SUVRIT SRA Massachusetts Institute of Technology Special thanks: Francis Bach (INRIA, ENS) (for sharing this material, and permitting its use) MPI-IS

More information

Machine Learning in the Data Revolution Era

Machine Learning in the Data Revolution Era Machine Learning in the Data Revolution Era Shai Shalev-Shwartz School of Computer Science and Engineering The Hebrew University of Jerusalem Machine Learning Seminar Series, Google & University of Waterloo,

More information

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization Modern Stochastic Methods Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization 10-725 Last time: conditional gradient method For the problem min x f(x) subject to x C where

More information

CS260: Machine Learning Algorithms

CS260: Machine Learning Algorithms CS260: Machine Learning Algorithms Lecture 4: Stochastic Gradient Descent Cho-Jui Hsieh UCLA Jan 16, 2019 Large-scale Problems Machine learning: usually minimizing the training loss min w { 1 N min w {

More information

Optimization Methods for Machine Learning

Optimization Methods for Machine Learning Optimization Methods for Machine Learning Sathiya Keerthi Microsoft Talks given at UC Santa Cruz February 21-23, 2017 The slides for the talks will be made available at: http://www.keerthis.com/ Introduction

More information

Lecture 1: Supervised Learning

Lecture 1: Supervised Learning Lecture 1: Supervised Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine from Portland, Learning Oregon: pervised learning (Supervised)

More information

Day 3 Lecture 3. Optimizing deep networks

Day 3 Lecture 3. Optimizing deep networks Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient

More information

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Lin Xiao (Microsoft Research) Joint work with Qihang Lin (CMU), Zhaosong Lu (Simon Fraser) Yuchen Zhang (UC Berkeley)

More information

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Joint work with Nicolas Le Roux and Francis Bach University of British Columbia Context: Machine Learning for Big Data Large-scale

More information

SVRG++ with Non-uniform Sampling

SVRG++ with Non-uniform Sampling SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract

More information

Adaptive Probabilities in Stochastic Optimization Algorithms

Adaptive Probabilities in Stochastic Optimization Algorithms Research Collection Master Thesis Adaptive Probabilities in Stochastic Optimization Algorithms Author(s): Zhong, Lei Publication Date: 2015 Permanent Link: https://doi.org/10.3929/ethz-a-010421465 Rights

More information

Accelerating SVRG via second-order information

Accelerating SVRG via second-order information Accelerating via second-order information Ritesh Kolte Department of Electrical Engineering rkolte@stanford.edu Murat Erdogdu Department of Statistics erdogdu@stanford.edu Ayfer Özgür Department of Electrical

More information

Neural Networks and Deep Learning

Neural Networks and Deep Learning Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost

More information

Cutting Plane Training of Structural SVM

Cutting Plane Training of Structural SVM Cutting Plane Training of Structural SVM Seth Neel University of Pennsylvania sethneel@wharton.upenn.edu September 28, 2017 Seth Neel (Penn) Short title September 28, 2017 1 / 33 Overview Structural SVMs

More information

Stochastic and online algorithms

Stochastic and online algorithms Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem

More information

Online Convex Optimization

Online Convex Optimization Advanced Course in Machine Learning Spring 2010 Online Convex Optimization Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz A convex repeated game is a two players game that is performed

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 17: Stochastic Optimization Part II: Realizable vs Agnostic Rates Part III: Nearest Neighbor Classification Stochastic

More information

Failures of Gradient-Based Deep Learning

Failures of Gradient-Based Deep Learning Failures of Gradient-Based Deep Learning Shai Shalev-Shwartz, Shaked Shammah, Ohad Shamir The Hebrew University and Mobileye Representation Learning Workshop Simons Institute, Berkeley, 2017 Shai Shalev-Shwartz

More information

Deep Feedforward Networks

Deep Feedforward Networks Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3

More information

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Davood Hajinezhad Iowa State University Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 1 / 35 Co-Authors

More information

Stochastic Optimization Algorithms Beyond SG

Stochastic Optimization Algorithms Beyond SG Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods

More information

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research OLSO Online Learning and Stochastic Optimization Yoram Singer August 10, 2016 Google Research References Introduction to Online Convex Optimization, Elad Hazan, Princeton University Online Learning and

More information

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017 Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem

More information

Logarithmic Regret Algorithms for Strongly Convex Repeated Games

Logarithmic Regret Algorithms for Strongly Convex Repeated Games Logarithmic Regret Algorithms for Strongly Convex Repeated Games Shai Shalev-Shwartz 1 and Yoram Singer 1,2 1 School of Computer Sci & Eng, The Hebrew University, Jerusalem 91904, Israel 2 Google Inc 1600

More information

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some

More information

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence: A Omitted Proofs from Section 3 Proof of Lemma 3 Let m x) = a i On Acceleration with Noise-Corrupted Gradients fxi ), u x i D ψ u, x 0 ) denote the function under the minimum in the lower bound By Proposition

More information

Neural Networks: Backpropagation

Neural Networks: Backpropagation Neural Networks: Backpropagation Machine Learning Fall 2017 Based on slides and material from Geoffrey Hinton, Richard Socher, Dan Roth, Yoav Goldberg, Shai Shalev-Shwartz and Shai Ben-David, and others

More information

Simple Optimization, Bigger Models, and Faster Learning. Niao He

Simple Optimization, Bigger Models, and Faster Learning. Niao He Simple Optimization, Bigger Models, and Faster Learning Niao He Big Data Symposium, UIUC, 2016 Big Data, Big Picture Niao He (UIUC) 2/26 Big Data, Big Picture Niao He (UIUC) 3/26 Big Data, Big Picture

More information

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Vicente L. Malave February 23, 2011 Outline Notation minimize a number of functions φ

More information

SGD and Deep Learning

SGD and Deep Learning SGD and Deep Learning Subgradients Lets make the gradient cheating more formal. Recall that the gradient is the slope of the tangent. f(w 1 )+rf(w 1 ) (w w 1 ) Non differentiable case? w 1 Subgradients

More information

CSC321 Lecture 7: Optimization

CSC321 Lecture 7: Optimization CSC321 Lecture 7: Optimization Roger Grosse Roger Grosse CSC321 Lecture 7: Optimization 1 / 25 Overview We ve talked a lot about how to compute gradients. What do we actually do with them? Today s lecture:

More information

ECE 5424: Introduction to Machine Learning

ECE 5424: Introduction to Machine Learning ECE 5424: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting PAC Learning Readings: Murphy 16.4;; Hastie 16 Stefan Lee Virginia Tech Fighting the bias-variance tradeoff Simple

More information

CSC321 Lecture 8: Optimization

CSC321 Lecture 8: Optimization CSC321 Lecture 8: Optimization Roger Grosse Roger Grosse CSC321 Lecture 8: Optimization 1 / 26 Overview We ve talked a lot about how to compute gradients. What do we actually do with them? Today s lecture:

More information

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science Machine Learning CS 4900/5900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Machine Learning is Optimization Parametric ML involves minimizing an objective function

More information

Optimistic Rates Nati Srebro

Optimistic Rates Nati Srebro Optimistic Rates Nati Srebro Based on work with Karthik Sridharan and Ambuj Tewari Examples based on work with Andy Cotter, Elad Hazan, Tomer Koren, Percy Liang, Shai Shalev-Shwartz, Ohad Shamir, Karthik

More information

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba Tutorial on: Optimization I (from a deep learning perspective) Jimmy Ba Outline Random search v.s. gradient descent Finding better search directions Design white-box optimization methods to improve computation

More information

A Parallel SGD method with Strong Convergence

A Parallel SGD method with Strong Convergence A Parallel SGD method with Strong Convergence Dhruv Mahajan Microsoft Research India dhrumaha@microsoft.com S. Sundararajan Microsoft Research India ssrajan@microsoft.com S. Sathiya Keerthi Microsoft Corporation,

More information

Inverse Time Dependency in Convex Regularized Learning

Inverse Time Dependency in Convex Regularized Learning Inverse Time Dependency in Convex Regularized Learning Zeyuan A. Zhu (Tsinghua University) Weizhu Chen (MSRA) Chenguang Zhu (Tsinghua University) Gang Wang (MSRA) Haixun Wang (MSRA) Zheng Chen (MSRA) December

More information

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 Announcements: HW3 posted Dual coordinate ascent (some review of SGD and random

More information

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Alberto Bietti Julien Mairal Inria Grenoble (Thoth) March 21, 2017 Alberto Bietti Stochastic MISO March 21,

More information

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Nicolas Thome Prenom.Nom@cnam.fr http://cedric.cnam.fr/vertigo/cours/ml2/ Département Informatique Conservatoire

More information

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017 CPSC 340: Machine Learning and Data Mining Stochastic Gradient Fall 2017 Assignment 3: Admin Check update thread on Piazza for correct definition of trainndx. This could make your cross-validation code

More information

VBM683 Machine Learning

VBM683 Machine Learning VBM683 Machine Learning Pinar Duygulu Slides are adapted from Dhruv Batra Bias is the algorithm's tendency to consistently learn the wrong thing by not taking into account all the information in the data

More information

Littlestone s Dimension and Online Learnability

Littlestone s Dimension and Online Learnability Littlestone s Dimension and Online Learnability Shai Shalev-Shwartz Toyota Technological Institute at Chicago The Hebrew University Talk at UCSD workshop, February, 2009 Joint work with Shai Ben-David

More information

FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES. Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč

FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES. Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč School of Mathematics, University of Edinburgh, Edinburgh, EH9 3JZ, United Kingdom

More information

Online Learning Summer School Copenhagen 2015 Lecture 1

Online Learning Summer School Copenhagen 2015 Lecture 1 Online Learning Summer School Copenhagen 2015 Lecture 1 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Online Learning Shai Shalev-Shwartz (Hebrew U) OLSS Lecture

More information

Probabilistic Graphical Models & Applications

Probabilistic Graphical Models & Applications Probabilistic Graphical Models & Applications Learning of Graphical Models Bjoern Andres and Bernt Schiele Max Planck Institute for Informatics The slides of today s lecture are authored by and shown with

More information

Optimization for neural networks

Optimization for neural networks 0 - : Optimization for neural networks Prof. J.C. Kao, UCLA Optimization for neural networks We previously introduced the principle of gradient descent. Now we will discuss specific modifications we make

More information

CSCI567 Machine Learning (Fall 2018)

CSCI567 Machine Learning (Fall 2018) CSCI567 Machine Learning (Fall 2018) Prof. Haipeng Luo U of Southern California Sep 12, 2018 September 12, 2018 1 / 49 Administration GitHub repos are setup (ask TA Chi Zhang for any issues) HW 1 is due

More information

ECE 5984: Introduction to Machine Learning

ECE 5984: Introduction to Machine Learning ECE 5984: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16 Dhruv Batra Virginia Tech Administrativia HW3 Due: April 14, 11:55pm You will implement

More information

Convex Repeated Games and Fenchel Duality

Convex Repeated Games and Fenchel Duality Convex Repeated Games and Fenchel Duality Shai Shalev-Shwartz 1 and Yoram Singer 1,2 1 School of Computer Sci. & Eng., he Hebrew University, Jerusalem 91904, Israel 2 Google Inc. 1600 Amphitheater Parkway,

More information

Lecture 3: Minimizing Large Sums. Peter Richtárik

Lecture 3: Minimizing Large Sums. Peter Richtárik Lecture 3: Minimizing Large Sums Peter Richtárik Graduate School in Systems, Op@miza@on, Control and Networks Belgium 2015 Mo@va@on: Machine Learning & Empirical Risk Minimiza@on Training Linear Predictors

More information

Incremental Training of a Two Layer Neural Network

Incremental Training of a Two Layer Neural Network Incremental Training of a Two Layer Neural Network Group 10 Gurpreet Singh 150259 guggu@iitk.ac.in Jaivardhan Kapoor 150300 jkapoor@iitk.ac.in Abstract Gradient boosting for convex objectives has had a

More information

Improved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations

Improved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations Improved Optimization of Finite Sums with Miniatch Stochastic Variance Reduced Proximal Iterations Jialei Wang University of Chicago Tong Zhang Tencent AI La Astract jialei@uchicago.edu tongzhang@tongzhang-ml.org

More information

Practical Agnostic Active Learning

Practical Agnostic Active Learning Practical Agnostic Active Learning Alina Beygelzimer Yahoo Research based on joint work with Sanjoy Dasgupta, Daniel Hsu, John Langford, Francesco Orabona, Chicheng Zhang, and Tong Zhang * * introductory

More information

Convex Repeated Games and Fenchel Duality

Convex Repeated Games and Fenchel Duality Convex Repeated Games and Fenchel Duality Shai Shalev-Shwartz 1 and Yoram Singer 1,2 1 School of Computer Sci. & Eng., he Hebrew University, Jerusalem 91904, Israel 2 Google Inc. 1600 Amphitheater Parkway,

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

Stochastic gradient descent; Classification

Stochastic gradient descent; Classification Stochastic gradient descent; Classification Steve Renals Machine Learning Practical MLP Lecture 2 28 September 2016 MLP Lecture 2 Stochastic gradient descent; Classification 1 Single Layer Networks MLP

More information

Deep Learning II: Momentum & Adaptive Step Size

Deep Learning II: Momentum & Adaptive Step Size Deep Learning II: Momentum & Adaptive Step Size CS 760: Machine Learning Spring 2018 Mark Craven and David Page www.biostat.wisc.edu/~craven/cs760 1 Goals for the Lecture You should understand the following

More information

Voting (Ensemble Methods)

Voting (Ensemble Methods) 1 2 Voting (Ensemble Methods) Instead of learning a single classifier, learn many weak classifiers that are good at different parts of the data Output class: (Weighted) vote of each classifier Classifiers

More information

Introduction to Machine Learning (67577) Lecture 3

Introduction to Machine Learning (67577) Lecture 3 Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem General Learning Model and Bias-Complexity tradeoff Shai Shalev-Shwartz

More information

Stochastic Optimization

Stochastic Optimization Introduction Related Work SGD Epoch-GD LM A DA NANJING UNIVERSITY Lijun Zhang Nanjing University, China May 26, 2017 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic

More information

i=1 = H t 1 (x) + α t h t (x)

i=1 = H t 1 (x) + α t h t (x) AdaBoost AdaBoost, which stands for ``Adaptive Boosting", is an ensemble learning algorithm that uses the boosting paradigm []. We will discuss AdaBoost for binary classification. That is, we assume that

More information

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal An Evolving Gradient Resampling Method for Machine Learning Jorge Nocedal Northwestern University NIPS, Montreal 2015 1 Collaborators Figen Oztoprak Stefan Solntsev Richard Byrd 2 Outline 1. How to improve

More information

Least Mean Squares Regression

Least Mean Squares Regression Least Mean Squares Regression Machine Learning Spring 2018 The slides are mainly from Vivek Srikumar 1 Lecture Overview Linear classifiers What functions do linear classifiers express? Least Squares Method

More information

Journal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19

Journal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19 Journal Club A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) CMAP, Ecole Polytechnique March 8th, 2018 1/19 Plan 1 Motivations 2 Existing Acceleration Methods 3 Universal

More information

Lecture 5: Logistic Regression. Neural Networks

Lecture 5: Logistic Regression. Neural Networks Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture

More information

Optimization for Training I. First-Order Methods Training algorithm

Optimization for Training I. First-Order Methods Training algorithm Optimization for Training I First-Order Methods Training algorithm 2 OPTIMIZATION METHODS Topics: Types of optimization methods. Practical optimization methods breakdown into two categories: 1. First-order

More information

Nyström-SGD: Fast Learning of Kernel-Classifiers with Conditioned Stochastic Gradient Descent

Nyström-SGD: Fast Learning of Kernel-Classifiers with Conditioned Stochastic Gradient Descent Nyström-SGD: Fast Learning of Kernel-Classifiers with Conditioned Stochastic Gradient Descent Lukas Pfahler and Katharina Morik TU Dortmund University, Artificial Intelligence Group, Dortmund, Germany

More information

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos 1 Stochastic Variance Reduction for Nonconvex Optimization Barnabás Póczos Contents 2 Stochastic Variance Reduction for Nonconvex Optimization Joint work with Sashank Reddi, Ahmed Hefny, Suvrit Sra, and

More information

Selected Topics in Optimization. Some slides borrowed from

Selected Topics in Optimization. Some slides borrowed from Selected Topics in Optimization Some slides borrowed from http://www.stat.cmu.edu/~ryantibs/convexopt/ Overview Optimization problems are almost everywhere in statistics and machine learning. Input Model

More information

Statistical Machine Learning from Data

Statistical Machine Learning from Data January 17, 2006 Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Multi-Layer Perceptrons Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole

More information

Sub-Sampled Newton Methods

Sub-Sampled Newton Methods Sub-Sampled Newton Methods F. Roosta-Khorasani and M. W. Mahoney ICSI and Dept of Statistics, UC Berkeley February 2016 F. Roosta-Khorasani and M. W. Mahoney (UCB) Sub-Sampled Newton Methods Feb 2016 1

More information

Accelerate Subgradient Methods

Accelerate Subgradient Methods Accelerate Subgradient Methods Tianbao Yang Department of Computer Science The University of Iowa Contributors: students Yi Xu, Yan Yan and colleague Qihang Lin Yang (CS@Uiowa) Accelerate Subgradient Methods

More information