Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence

Size: px
Start display at page:

Download "Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence"

Transcription

1 Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence oé LAMDA Group H ŒÆOŽÅ Æ EâX ^ #EâI[ : liwujun@nju.edu.cn Dec 10, 2016 Wu-Jun Li ( PDSL CS, NJU 1 / 36

2 Outline Introduction 1 Introduction 2 AsySVRG 3 SCOPE 4 Conclusion Wu-Jun Li ( PDSL CS, NJU 2 / 36

3 Machine Learning Introduction Supervised Learning: Given a set of training examples {(x i, y i )} n i=1, supervised learning tries to solve the following regularized empirical risk minimization problem: min f(w) = 1 n f i (w), w n where f i (w) is the loss function (plus some regularization term) defined on example i, and w is the parameter to learn. Examples: Logistic regression: f(w) = 1 n n i=1 [log(1 + e yixt i w ) + λ 2 w 2 ] SVM: f(w) = 1 n n i=1 [max{0, 1 y ix T i w} + λ 2 w 2 ] Deep learning models Unsupervised Learning: Many unsupervised learning models, such as PCA and matrix factorization, can also be reformulated as similar problems. Wu-Jun Li ( PDSL CS, NJU 3 / 36 i=1

4 Introduction Machine Learning for Big Data For big data applications, first-order methods have become much more popular than other higher-order methods for learning (optimization). Gradient descent methods are the most representative first-order methods. (Deterministic) gradient descent (GD): w t+1 w t η t [ 1 n f i (w t )], n where t is the iteration number. Linear convergence rate: O(ρ t ) Iteration cost is O(n) Stochastic gradient descent (SGD): In the t th iteration, randomly choosing an example i t {1, 2,..., n}, then update i=1 w t+1 w t η t f it (w t ) Iteration cost is O(1) The convergence rate is sublinear: O(1/t) Wu-Jun Li ( PDSL CS, NJU 4 / 36

5 Introduction Stochastic Learning for Big Data Researchers recently proposed improved versions of SGD: SAG [Roux et al., NIPS 2012], SDCA [Shalev-Shwartz and Zhang, JMLR 2013], SVRG [Johnson and Zhang, NIPS 2013] Number of gradient ( f i ) evaluation to reach ɛ for smooth and strongly convex problems: GD: O(nκ log( 1 ɛ )) SGD: O(κ( 1 ɛ )) SAG: O(n log( 1 ɛ )) when n 8κ SDCA: O((n + κ) log( 1 ɛ )) SVRG: O((n + κ) log( 1 ɛ )) where κ = L µ > 1 is the condition number of the objective function. Stochastic Learning: Stochastic GD Stochastic coordinate descent Stochastic dual coordinate ascent Wu-Jun Li ( PDSL CS, NJU 5 / 36

6 Introduction Parallel and Distributed Stochastic Learning To further improve the learning scalability (speed): Parallel stochastic learning: One machine with multiple cores and a shared memory Distributed stochastic learning: A cluster with multiple machines Key issues: cooperation Parallel stochastic learning: lock vs. lock-free: waiting cost and lock cost Distributed stochastic learning: synchronous vs. asynchronous: waiting cost and communication cost Wu-Jun Li ( PDSL CS, NJU 6 / 36

7 Our Contributions Introduction Parallel stochastic learning: AsySVRG Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee. Distributed stochastic learning: SCOPE Scalable Composite Optimization for Learning Wu-Jun Li ( PDSL CS, NJU 7 / 36

8 Outline AsySVRG 1 Introduction 2 AsySVRG 3 SCOPE 4 Conclusion Wu-Jun Li ( PDSL CS, NJU 8 / 36

9 AsySVRG Motivation and Contribution Motivation: Existing asynchronous parallel SGD: Hogwild! [Recht et al. 2011], and PASSCoDe [Hsieh, Yu, and Dhillon 2015] No parallel methods for SVRG. Lock-free: empirically effective, but no theoretical proof. Contribution: A fast asynchronous method to parallelize SVRG, called AsySVRG. A lock-free parallel strategy for both read and write Linear convergence rate with theoretical proof Outperforms Hogwild! in experiments AsySVRG is the first lock-free parallel SGD method with theoretical proof of convergence. Wu-Jun Li ( PDSL CS, NJU 9 / 36

10 AsySVRG AsySVRG: a multi-thread version of SVRG Initialization: p threads, initialize w 0, η; for t = 0, 1, 2,... do u 0 = w t ; All threads parallelly compute the full gradient f(u 0 ) = 1 n n i=1 f i(u 0 ); u = w t ; For each thread, do: for m = 1 to M do Read current value of u, denoted as û, from the shared memory. And randomly pick up an i from {1,..., n}; Compute the update vector: ˆv = f i (û) f i (u 0 ) + f(u 0 ); u u ηˆv; end for Take w t+1 to be the current value of u in the shared memory; end for Wu-Jun Li ( PDSL CS, NJU 10 / 36

11 Lock-free Analysis AsySVRG In all the GD or SGD methods to solve the objective function, the key step can be written as Notation u u + i,j : the j th update vector computed by the i th thread; u R d and u = (u (1), u (2),..., u (d) ); t (k) i,j : the time when the operation u(k) u (k) + (k) i,j has been completed (Not the time when the operation begins); Assuming i, j, t (1) i,j < t(2) i,j <... < t(d) i,j, which can be easily guaranteed by programming Wu-Jun Li ( PDSL CS, NJU 11 / 36

12 AsySVRG Lock-free Analysis: update sequence Since u (1) can only be changed by at most one thread at any absolute time, these t (1) i,j are different from each other. So we can: Sort these t (1) i,j as t(1) 0 < t (1) 1 <... < t (1) M 1 ( M = p M); 0, 1,..., M 1 are the corresponding update vectors. Since it is lock-free, for each update vector m, the real update vector is B m m because of over-written. The B m is a diagonal matrix whose diagonal elements are 0 or 1. After all the inner-loop stop, we can get: w t+1 = u 0 + M 1 m=0 B m m (1) Wu-Jun Li ( PDSL CS, NJU 12 / 36

13 AsySVRG Lock-free Analysis: update sequence According to (1), we define a sequence {u m } as follows: which means u m+1 = u m + B m m. m 1 u m = u 0 + B i i (2) Note The sequence {u m } (m = 1, 2,..., M 1) is synthetic, and the whole u m may never occur in the shared memory. What we can get is only the final value of u M. i=0 Wu-Jun Li ( PDSL CS, NJU 13 / 36

14 AsySVRG Lock-free Analysis: read sequence Assume the old update vectors 0, 1,..., a(m) 1 have been completely applied to u when one thread is reading the shared variable. At the same time, some new update vectors might be updating u. So we can write û m read by the thread to compute m as follows: û m = u a(m) + b(m) i=a(m) P m,i a(m) i where P m,i a(m) is a diagonal matrix whose diagonal elements are 0 or 1. According to the principle of the order, i (i m) should not be read by û m. So b(m) < m, which means: û m = u a(m) + m 1 i=a(m) P m,i a(m) i Wu-Jun Li ( PDSL CS, NJU 14 / 36

15 AsySVRG Convergence Result for Strongly Convex Problems With some assumptions, our algorithm gets a linear convergence rate for strongly convex problems: Ef(w t+1 ) f(w ) (c M 1 + c 2 1 c 1 )(Ef(w t ) f(w )), where c 1 = 1 αηµ + c 2 and c 2 = η 2 ( 8τL3 ηρ 2 (ρ τ 1) ρ 1 + 2L 2 ρ), M = p M is the total number of iterations of the inner-loop. Note Since it is lock-free, we do not know the exact B m and we can not take the average sum of B m u m to be w t+1. Wu-Jun Li ( PDSL CS, NJU 15 / 36

16 AsySVRG Convergence Result for Non-Convex Problems With some assumptions, our algorithm gets a sub-linear convergence rate for non-convex problems: 1 T M T 1 M 1 t=0 m=0 E f(u t,m ) 2 Ef(w 0) Ef(w T ) T Mγ. Similar to the analysis for strongly convex problems, we construct an equivalent write sequence {u t,m } for the t th outer-loop: u t,0 = w t, u t,m+1 = u t,m ηb t,mˆv t,m, where ˆv t,m = f it,m (û t,m ) f it,m (u t,0 ) + f(u t,0 ). B t,m is a diagonal matrix whose diagonal entries are 0 or 1. And û t,m is read by the thread who computes ˆv t,m. Wu-Jun Li ( PDSL CS, NJU 16 / 36

17 Experiments AsySVRG Experimental platform: A server with 12 Intel cores and 64G memory. Model: Logistic regression with L2-norm f(w) = 1 n n log(1 + e y ix T i w ) + λ 2 w 2 i=1 Data set We set λ = dataset instances features memory type rcv1 20,242 47,236 36M sparse real-sim 72,309 20,958 90M sparse news20 19,996 1,355, M sparse epsilon 400,000 2,000 11G dense Wu-Jun Li ( PDSL CS, NJU 17 / 36

18 AsySVRG Experiments: Computation Cost log(objective value min) log(objective value min) 10 0 rcv AsySVRG 1 AsySVRG lock 10 AsySVRG Hogwild lock 10 Hogwild number of effective passes log(objective value min) 10 0 real sim AsySVRG 1 AsySVRG lock AsySVRG 10 Hogwild lock 10 Hogwild number of effective passes (a) rcv1 (b) realsim 10 0 news AsySVRG 1 AsySVRG lock AsySVRG 10 Hogwild lock 10 Hogwild number of effective passes log(objective value min) 10 0 epsilon AsySVRG 1 AsySVRG lock AsySVRG 10 Hogwild lock 10 Hogwild number of effective passes (c) news20 (d) epsilon Wu-Jun Li ( PDSL CS, NJU 18 / 36

19 AsySVRG Experiments: Total Time Cost log(objective value min) log(objective value min) rcv1 real sim 10 0 AsySVRG 10 Hogwild time log(objective value min) 10 0 AsySVRG 10 Hogwild time (a) rcv1 (b) realsim news20 epsilon 10 0 AsySVRG 10 Hogwild time log(objective value min) 10 0 AsySVRG 10 Hogwild time (c) news20 (d) epsilon Wu-Jun Li ( PDSL CS, NJU 19 / 36

20 Experiments: Speed up 6 rcv1 5 AsySVRG 6 real sim AsySVRG lock AsySVRG AsySVRG lock AsySVRG 5 speedup 4 3 speedup number of threads number of threads (a) rcv1 (b) realsim 8 news20 6 epsilon 7 AsySVRG lock AsySVRG 5 AsySVRG lock AsySVRG unlock 6 speedup 5 4 speedup number of threads (c) news number of threads (d) epsilon Wu-Jun Li ( PDSL CS, NJU 20 / 36

21 Outline SCOPE 1 Introduction 2 AsySVRG 3 SCOPE 4 Conclusion Wu-Jun Li ( PDSL CS, NJU 21 / 36

22 SCOPE Motivation and Contribution Motivation: Bulk synchronous parallel (BSP) models, such as MapReduce, are commonly considered to be inefficient for distributed stochastic learning. Is there any technique to solve the issues of BSP models? Contribution: A novel distributed stochastic learning method, called scalable composite optimization for learning (SCOPE), on BSP models Both computation-efficient and communication-efficient Linear convergence rate with theoretical proof Can be easily integrated into the data processing pipeline of Spark Outperform other state-of-the-art distributed learning methods on Spark Wu-Jun Li ( PDSL CS, NJU 22 / 36

23 Framework of SCOPE SCOPE Master w Worker_1 Worker_ Worker_p D! D! D! Figure: Distributed framework of SCOPE. Wu-Jun Li ( PDSL CS, NJU 23 / 36

24 SCOPE Optimization Algorithm: Master Task of Master in SCOPE: Initialization: p Workers, w 0 ; for t = 0, 1, 2,..., T do Send w t to the Workers; Wait until it receives z 1, z 2,..., z p from the p Workers; Compute the full gradient z = 1 n p k=1 z k, and then send z to each Worker; Wait until it receives ũ 1, ũ 2,..., ũ p from the p Workers; Compute w t+1 = 1 p p k=1 ũk; end for Wu-Jun Li ( PDSL CS, NJU 24 / 36

25 SCOPE Optimization Algorithm: Workers Task of Workers in SCOPE: Initialization: initialize η and c > 0; For the Worker k: for t = 0, 1, 2,..., T do Wait until it gets the newest parameter w t from the Master; Let u k,0 = w t, compute the local gradient sum z k = i D k f i (w t ), and then send z k to the Master; Wait until it gets the full gradient z from the Master; for m = 0 to M 1 do Randomly pick up an instance with index i k,m from D k ; u k,m+1 = u k,m η( f ik,m (u k,m ) f ik,m (w t )+z+c(u k,m w t )); end for Send u k,m or 1 M M m=1 u k,m, which is called the locally updated parameter and denoted as ũ k, to the Master; end for Wu-Jun Li ( PDSL CS, NJU 25 / 36

26 Convergence SCOPE Let α = 1 η(2µ + c) < 1, β = cη + 3L 2 η 2 and α + β < 1. We have the following theorems: Theorem If we take w t+1 = 1 p p k=1 u k,m, then we can get the following convergence result: Theorem E w t+1 w 2 (α M + β 1 α )E w t w 2. If we take w t+1 = 1 p p k=1 ũk with ũ k = 1 M M m=1 u k,m, then we can get the following convergence result: E w t+1 w 2 1 ( M(1 α) + β 1 α )E w t w 2. Wu-Jun Li ( PDSL CS, NJU 26 / 36

27 Communication Cost SCOPE Traditional mini-batch based methods: O(T n) SCOPE: O(T ) Wu-Jun Li ( PDSL CS, NJU 27 / 36

28 SCOPE Experiment Logistic regression (LR) with a L 2 -norm regularization term: f(w) = 1 ] n n i=1 [log(1 + e y ix T i w ) + λ 2 w 2. Table: Datasets for evaluation. instances features memory λ MNIST-8M 8,100, G 1e-4 epsilon 400,000 2,000 11G 1e-4 KDD12 73,209,277 1,427,495 21G 1e-4 Data-A 106,691, G 1e-6 Two Spark clusters with Intel CPUs: small: 1 Master and 16 Workers large: 1 Master and 128 Workers Wu-Jun Li ( PDSL CS, NJU 28 / 36

29 SCOPE Experiment Baselines: MLlib [Meng et al., 2015]: MLlib is an open source library for distributed machine learning on Spark. We compare our method with distributed lbfgs for MLlib, which is a batch learning method and faster than the SGD version of MLlib. LibLinear [Lin et al., 2014a]: LibLinear is a distributed Newton method, which is also a batch learning method. Splash [Zhang and Jordan, 2015]: Splash is a distributed SGD method by using the local learning strategy to reduce communication cost. CoCoA [Jaggi et al., 2014]: CoCoA is a distributed dual coordinate ascent method. CoCoA+ [Ma et al., 2015]: CoCoA+ is an improved version of CoCoA. CoCoA+ adopts adding rather than average to combine local updates. Wu-Jun Li ( PDSL CS, NJU 29 / 36

30 Experiment SCOPE 10 0 MNIST 8M with 16 cores 10 0 epsilon with 16 cores objective value optimal SCOPE LibLinear CoCoA MLlib(lbfgs) Splash CoCoA CPU Time(millisecond) x 10 4 objective value optimal SCOPE LibLinear CoCoA MLlib(lbfgs) Splash CoCoA CPU Time(millisecond) x 10 4 (a) MNIST-8M (b) epsilon 10 0 KDD12 with 16 cores 10 0 Data A with 128 cores objective value optimal SCOPE Liblinear CoCoA MLlib(lbfgs) Splash CoCoA+ objective value optimal SCOPE LibLinear CoCoA MLlib(lbfgs) Splash CoCoA CPU Time(millisecond) x 10 5 (c) KDD CPU Time(millisecond) x 10 4 (d) Data-A Wu-Jun Li ( PDSL CS, NJU 30 / 36

31 Experiment SCOPE speedup SCOPE Ideal CPU Time(milllisecond) per iteration computation time waiting time communication time #cores #cores (e) Speedup (f) Synchronization cost Wu-Jun Li ( PDSL CS, NJU 31 / 36

32 Outline Conclusion 1 Introduction 2 AsySVRG 3 SCOPE 4 Conclusion Wu-Jun Li ( PDSL CS, NJU 32 / 36

33 Conclusion Conclusion Stochastic learning is becoming popular for big data machine learning. Lock-free strategy is the key to get a good speedup in parallel stochastic learning. With properly designed techniques, BSP models are also efficient for distributed stochastic learning. Wu-Jun Li ( PDSL CS, NJU 33 / 36

34 Future Work Conclusion Open source project: LIBBLE: A library for big learning LIBBLE-Spark: Classification: LR, SVM, LR with L1-norm Regularization Regression: Linear Regression, Lasso Generalized Linear Models: with L2-norm/L1-norm Regularization Dimensionality Reduction: PCA, SVD Matrix Factorization Clustering: K-Means LIBBLE-MPI LIBBLE-GPU LIBBLE-MultiThread Wu-Jun Li ( PDSL CS, NJU 34 / 36

35 References Conclusion Shen-Yi Zhao, Ru Xiang, Ying-Hao Shi, Peng Gao, Wu-Jun Li. SCOPE: Scalable Composite Optimization for Learning on Spark. To Appear in Proceedings of the 31th AAAI Conference on Artificial Intelligence (AAAI), Shen-Yi Zhao, Gong-Duo Zhang, Wu-Jun Li. Lock-Free Optimization for Non-Convex Problems. To Appear in Proceedings of the 31th AAAI Conference on Artificial Intelligence (AAAI), Shen-Yi Zhao, Wu-Jun Li. Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee. Proceedings of the 30th AAAI Conference on Artificial Intelligence (AAAI), Wu-Jun Li ( PDSL CS, NJU 35 / 36

36 Q & A Conclusion Thanks! Wu-Jun Li ( PDSL CS, NJU 36 / 36

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16) Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi Zhao and

More information

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi Zhao and Wu-Jun Li National Key Laboratory for Novel Software Technology Department of Computer

More information

LAMDA Group. Oct 19, Wu-Jun Li ( Big Learning CS, NJU 1 / 125

LAMDA Group. Oct 19, Wu-Jun Li (  Big Learning CS, NJU 1 / 125 ŒêâÅìÆS oé LAMDA Group H ŒÆOŽÅ Æ EâX ^ #EâI[ : liwujun@nju.edu.cn Oct 19, 2016 Wu-Jun Li (http://cs.nju.edu.cn/lwj) Big Learning CS, NJU 1 / 125 Outline 1 Introduction 2 Learning to Hash Isotropic Hashing

More information

Accelerating Stochastic Optimization

Accelerating Stochastic Optimization Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

SVRG++ with Non-uniform Sampling

SVRG++ with Non-uniform Sampling SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

Scalable Asynchronous Gradient Descent Optimization for Out-of-Core Models

Scalable Asynchronous Gradient Descent Optimization for Out-of-Core Models Scalable Asynchronous Gradient Descent Optimization for Out-of-Core Models Chengjie Qin 1, Martin Torres 2, and Florin Rusu 2 1 GraphSQL, Inc. 2 University of California Merced August 31, 2017 Machine

More information

Stochastic Optimization Algorithms Beyond SG

Stochastic Optimization Algorithms Beyond SG Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods

More information

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Nov 2, 2016 Outline SGD-typed algorithms for Deep Learning Parallel SGD for deep learning Perceptron Prediction value for a training data: prediction

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

Simple Optimization, Bigger Models, and Faster Learning. Niao He

Simple Optimization, Bigger Models, and Faster Learning. Niao He Simple Optimization, Bigger Models, and Faster Learning Niao He Big Data Symposium, UIUC, 2016 Big Data, Big Picture Niao He (UIUC) 2/26 Big Data, Big Picture Niao He (UIUC) 3/26 Big Data, Big Picture

More information

ParaGraphE: A Library for Parallel Knowledge Graph Embedding

ParaGraphE: A Library for Parallel Knowledge Graph Embedding ParaGraphE: A Library for Parallel Knowledge Graph Embedding Xiao-Fan Niu, Wu-Jun Li National Key Laboratory for Novel Software Technology Department of Computer Science and Technology, Nanjing University,

More information

Asynchronous Doubly Stochastic Sparse Kernel Learning

Asynchronous Doubly Stochastic Sparse Kernel Learning The Thirty-Second AAAI Conference on Artificial Intelligence AAAI-8 Asynchronous Doubly Stochastic Sparse Kernel Learning Bin Gu, Xin Miao, Zhouyuan Huo, Heng Huang * Department of Electrical & Computer

More information

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Davood Hajinezhad Iowa State University Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 1 / 35 Co-Authors

More information

Accelerating SVRG via second-order information

Accelerating SVRG via second-order information Accelerating via second-order information Ritesh Kolte Department of Electrical Engineering rkolte@stanford.edu Murat Erdogdu Department of Statistics erdogdu@stanford.edu Ayfer Özgür Department of Electrical

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 11, 2016 Paper presentations and final project proposal Send me the names of your group member (2 or 3 students) before October 15 (this Friday)

More information

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine

More information

Stochastic Gradient Descent with Variance Reduction

Stochastic Gradient Descent with Variance Reduction Stochastic Gradient Descent with Variance Reduction Rie Johnson, Tong Zhang Presenter: Jiawen Yao March 17, 2015 Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction

More information

arxiv: v1 [stat.ml] 15 Mar 2018

arxiv: v1 [stat.ml] 15 Mar 2018 Proximal SCOPE for Distributed Sparse Learning: Better Data Partition Implies Faster Convergence Rate Shen-Yi Zhao 1 Gong-Duo Zhang 1 Ming-Wei Li 1 Wu-Jun Li * 1 arxiv:1803.05621v1 [stat.ml] 15 Mar 2018

More information

Parallel Coordinate Optimization

Parallel Coordinate Optimization 1 / 38 Parallel Coordinate Optimization Julie Nutini MLRG - Spring Term March 6 th, 2018 2 / 38 Contours of a function F : IR 2 IR. Goal: Find the minimizer of F. Coordinate Descent in 2D Contours of a

More information

Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM

Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM Lee, Ching-pei University of Illinois at Urbana-Champaign Joint work with Dan Roth ICML 2015 Outline Introduction Algorithm Experiments

More information

arxiv: v5 [cs.lg] 23 Dec 2017 Abstract

arxiv: v5 [cs.lg] 23 Dec 2017 Abstract Nonconvex Sparse Learning via Stochastic Optimization with Progressive Variance Reduction Xingguo Li, Raman Arora, Han Liu, Jarvis Haupt, and Tuo Zhao arxiv:1605.02711v5 [cs.lg] 23 Dec 2017 Abstract We

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

A Parallel SGD method with Strong Convergence

A Parallel SGD method with Strong Convergence A Parallel SGD method with Strong Convergence Dhruv Mahajan Microsoft Research India dhrumaha@microsoft.com S. Sundararajan Microsoft Research India ssrajan@microsoft.com S. Sathiya Keerthi Microsoft Corporation,

More information

ARock: an algorithmic framework for asynchronous parallel coordinate updates

ARock: an algorithmic framework for asynchronous parallel coordinate updates ARock: an algorithmic framework for asynchronous parallel coordinate updates Zhimin Peng, Yangyang Xu, Ming Yan, Wotao Yin ( UCLA Math, U.Waterloo DCO) UCLA CAM Report 15-37 ShanghaiTech SSDS 15 June 25,

More information

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal An Evolving Gradient Resampling Method for Machine Learning Jorge Nocedal Northwestern University NIPS, Montreal 2015 1 Collaborators Figen Oztoprak Stefan Solntsev Richard Byrd 2 Outline 1. How to improve

More information

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for

More information

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Barzilai-Borwein Step Size for Stochastic Gradient Descent Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan The Chinese University of Hong Kong chtan@se.cuhk.edu.hk Shiqian Ma The Chinese University of Hong Kong sqma@se.cuhk.edu.hk Yu-Hong

More information

Stochastic Quasi-Newton Methods

Stochastic Quasi-Newton Methods Stochastic Quasi-Newton Methods Donald Goldfarb Department of IEOR Columbia University UCLA Distinguished Lecture Series May 17-19, 2016 1 / 35 Outline Stochastic Approximation Stochastic Gradient Descent

More information

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Lin Xiao (Microsoft Research) Joint work with Qihang Lin (CMU), Zhaosong Lu (Simon Fraser) Yuchen Zhang (UC Berkeley)

More information

Sub-Sampled Newton Methods

Sub-Sampled Newton Methods Sub-Sampled Newton Methods F. Roosta-Khorasani and M. W. Mahoney ICSI and Dept of Statistics, UC Berkeley February 2016 F. Roosta-Khorasani and M. W. Mahoney (UCB) Sub-Sampled Newton Methods Feb 2016 1

More information

Logistic Regression. Stochastic Gradient Descent

Logistic Regression. Stochastic Gradient Descent Tutorial 8 CPSC 340 Logistic Regression Stochastic Gradient Descent Logistic Regression Model A discriminative probabilistic model for classification e.g. spam filtering Let x R d be input and y { 1, 1}

More information

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Alberto Bietti Julien Mairal Inria Grenoble (Thoth) March 21, 2017 Alberto Bietti Stochastic MISO March 21,

More information

Adaptive Probabilities in Stochastic Optimization Algorithms

Adaptive Probabilities in Stochastic Optimization Algorithms Research Collection Master Thesis Adaptive Probabilities in Stochastic Optimization Algorithms Author(s): Zhong, Lei Publication Date: 2015 Permanent Link: https://doi.org/10.3929/ethz-a-010421465 Rights

More information

Asynchronous Mini-Batch Gradient Descent with Variance Reduction for Non-Convex Optimization

Asynchronous Mini-Batch Gradient Descent with Variance Reduction for Non-Convex Optimization Proceedings of the hirty-first AAAI Conference on Artificial Intelligence (AAAI-7) Asynchronous Mini-Batch Gradient Descent with Variance Reduction for Non-Convex Optimization Zhouyuan Huo Dept. of Computer

More information

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Joint work with Nicolas Le Roux and Francis Bach University of British Columbia Context: Machine Learning for Big Data Large-scale

More information

Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization

Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization Jinghui Chen Department of Systems and Information Engineering University of Virginia Quanquan Gu

More information

Lecture 3: Minimizing Large Sums. Peter Richtárik

Lecture 3: Minimizing Large Sums. Peter Richtárik Lecture 3: Minimizing Large Sums Peter Richtárik Graduate School in Systems, Op@miza@on, Control and Networks Belgium 2015 Mo@va@on: Machine Learning & Empirical Risk Minimiza@on Training Linear Predictors

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Lecture 3-A - Convex) SUVRIT SRA Massachusetts Institute of Technology Special thanks: Francis Bach (INRIA, ENS) (for sharing this material, and permitting its use) MPI-IS

More information

Faster Machine Learning via Low-Precision Communication & Computation. Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich)

Faster Machine Learning via Low-Precision Communication & Computation. Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich) Faster Machine Learning via Low-Precision Communication & Computation Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich) 2 How many bits do you need to represent a single number in machine

More information

Large Scale Semi-supervised Linear SVMs. University of Chicago

Large Scale Semi-supervised Linear SVMs. University of Chicago Large Scale Semi-supervised Linear SVMs Vikas Sindhwani and Sathiya Keerthi University of Chicago SIGIR 2006 Semi-supervised Learning (SSL) Motivation Setting Categorize x-billion documents into commercial/non-commercial.

More information

Nesterov s Acceleration

Nesterov s Acceleration Nesterov s Acceleration Nesterov Accelerated Gradient min X f(x)+ (X) f -smooth. Set s 1 = 1 and = 1. Set y 0. Iterate by increasing t: g t 2 @f(y t ) s t+1 = 1+p 1+4s 2 t 2 y t = x t + s t 1 s t+1 (x

More information

Trade-Offs in Distributed Learning and Optimization

Trade-Offs in Distributed Learning and Optimization Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

Convex Optimization Algorithms for Machine Learning in 10 Slides

Convex Optimization Algorithms for Machine Learning in 10 Slides Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,

More information

Block stochastic gradient update method

Block stochastic gradient update method Block stochastic gradient update method Yangyang Xu and Wotao Yin IMA, University of Minnesota Department of Mathematics, UCLA November 1, 2015 This work was done while in Rice University 1 / 26 Stochastic

More information

Fast Stochastic Optimization Algorithms for ML

Fast Stochastic Optimization Algorithms for ML Fast Stochastic Optimization Algorithms for ML Aaditya Ramdas April 20, 205 This lecture is about efficient algorithms for minimizing finite sums min w R d n i= f i (w) or min w R d n f i (w) + λ 2 w 2

More information

StreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory

StreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory StreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory S.V. N. (vishy) Vishwanathan Purdue University and Microsoft vishy@purdue.edu October 9, 2012 S.V. N. Vishwanathan (Purdue,

More information

A Distributed Solver for Kernelized SVM

A Distributed Solver for Kernelized SVM and A Distributed Solver for Kernelized Stanford ICME haoming@stanford.edu bzhe@stanford.edu June 3, 2015 Overview and 1 and 2 3 4 5 Support Vector Machines and A widely used supervised learning model,

More information

Mini-Batch Primal and Dual Methods for SVMs

Mini-Batch Primal and Dual Methods for SVMs Mini-Batch Primal and Dual Methods for SVMs Peter Richtárik School of Mathematics The University of Edinburgh Coauthors: M. Takáč (Edinburgh), A. Bijral and N. Srebro (both TTI at Chicago) arxiv:1303.2314

More information

Bundle CDN: A Highly Parallelized Approach for Large-scale l 1 -regularized Logistic Regression

Bundle CDN: A Highly Parallelized Approach for Large-scale l 1 -regularized Logistic Regression Bundle : A Highly Parallelized Approach for Large-scale l 1 -regularized Logistic Regression Yatao Bian, Xiong Li, Mingqi Cao, and Yuncai Liu Institute of Image Processing and Pattern Recognition, Shanghai

More information

Lock-Free Approaches to Parallelizing Stochastic Gradient Descent

Lock-Free Approaches to Parallelizing Stochastic Gradient Descent Lock-Free Approaches to Parallelizing Stochastic Gradient Descent Benjamin Recht Department of Computer Sciences University of Wisconsin-Madison with Feng iu Christopher Ré Stephen Wright minimize x f(x)

More information

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization Proximal Newton Method Zico Kolter (notes by Ryan Tibshirani) Convex Optimization 10-725 Consider the problem Last time: quasi-newton methods min x f(x) with f convex, twice differentiable, dom(f) = R

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass

More information

Finite-sum Composition Optimization via Variance Reduced Gradient Descent

Finite-sum Composition Optimization via Variance Reduced Gradient Descent Finite-sum Composition Optimization via Variance Reduced Gradient Descent Xiangru Lian Mengdi Wang Ji Liu University of Rochester Princeton University University of Rochester xiangru@yandex.com mengdiw@princeton.edu

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Bo Liu Department of Computer Science, Rutgers Univeristy Xiao-Tong Yuan BDAT Lab, Nanjing University of Information Science and Technology

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

Stochastic Gradient Descent. CS 584: Big Data Analytics

Stochastic Gradient Descent. CS 584: Big Data Analytics Stochastic Gradient Descent CS 584: Big Data Analytics Gradient Descent Recap Simplest and extremely popular Main Idea: take a step proportional to the negative of the gradient Easy to implement Each iteration

More information

Distributed Stochastic Optimization of Regularized Risk via Saddle-point Problem

Distributed Stochastic Optimization of Regularized Risk via Saddle-point Problem Distributed Stochastic Optimization of Regularized Risk via Saddle-point Problem Shin Matsushima 1, Hyokun Yun 2, Xinhua Zhang 3, and S.V.N. Vishwanathan 2,4 1 The University of Tokyo, Tokyo, Japan shin

More information

Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions

Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions Olivier Fercoq and Pascal Bianchi Problem Minimize the convex function

More information

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5 Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5 Slides adapted from Jordan Boyd-Graber, Tom Mitchell, Ziv Bar-Joseph Machine Learning: Chenhao Tan Boulder 1 of 27 Quiz question For

More information

CS60021: Scalable Data Mining. Large Scale Machine Learning

CS60021: Scalable Data Mining. Large Scale Machine Learning J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 CS60021: Scalable Data Mining Large Scale Machine Learning Sourangshu Bhattacharya Example: Spam filtering Instance

More information

Linearly-Convergent Stochastic-Gradient Methods

Linearly-Convergent Stochastic-Gradient Methods Linearly-Convergent Stochastic-Gradient Methods Joint work with Francis Bach, Michael Friedlander, Nicolas Le Roux INRIA - SIERRA Project - Team Laboratoire d Informatique de l École Normale Supérieure

More information

Randomized Smoothing for Stochastic Optimization

Randomized Smoothing for Stochastic Optimization Randomized Smoothing for Stochastic Optimization John Duchi Peter Bartlett Martin Wainwright University of California, Berkeley NIPS Big Learn Workshop, December 2011 Duchi (UC Berkeley) Smoothing and

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 18, 2016 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass

More information

Distributed Machine Learning: A Brief Overview. Dan Alistarh IST Austria

Distributed Machine Learning: A Brief Overview. Dan Alistarh IST Austria Distributed Machine Learning: A Brief Overview Dan Alistarh IST Austria Background The Machine Learning Cambrian Explosion Key Factors: 1. Large s: Millions of labelled images, thousands of hours of speech

More information

Asaga: Asynchronous Parallel Saga

Asaga: Asynchronous Parallel Saga Rémi Leblond Fabian Pedregosa Simon Lacoste-Julien INRIA - Sierra Project-team INRIA - Sierra Project-team Department of CS & OR (DIRO) École normale supérieure, Paris École normale supérieure, Paris Université

More information

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Proximal Minimization by Incremental Surrogate Optimization (MISO) Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine

More information

Journal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19

Journal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19 Journal Club A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) CMAP, Ecole Polytechnique March 8th, 2018 1/19 Plan 1 Motivations 2 Existing Acceleration Methods 3 Universal

More information

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization Modern Stochastic Methods Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization 10-725 Last time: conditional gradient method For the problem min x f(x) subject to x C where

More information

Importance Sampling for Minibatches

Importance Sampling for Minibatches Importance Sampling for Minibatches Dominik Csiba School of Mathematics University of Edinburgh 07.09.2016, Birmingham Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches 07.09.2016,

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms Chapter 6 Structured and Low Rank Matrices Section 6.3 Numerical Optimization Michael T. Heath and Edgar Solomonik Department of Computer Science University of Illinois at

More information

Improved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations

Improved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations Improved Optimization of Finite Sums with Miniatch Stochastic Variance Reduced Proximal Iterations Jialei Wang University of Chicago Tong Zhang Tencent AI La Astract jialei@uchicago.edu tongzhang@tongzhang-ml.org

More information

CS260: Machine Learning Algorithms

CS260: Machine Learning Algorithms CS260: Machine Learning Algorithms Lecture 4: Stochastic Gradient Descent Cho-Jui Hsieh UCLA Jan 16, 2019 Large-scale Problems Machine learning: usually minimizing the training loss min w { 1 N min w {

More information

Optimization Methods for Machine Learning

Optimization Methods for Machine Learning Optimization Methods for Machine Learning Sathiya Keerthi Microsoft Talks given at UC Santa Cruz February 21-23, 2017 The slides for the talks will be made available at: http://www.keerthis.com/ Introduction

More information

Motivation Subgradient Method Stochastic Subgradient Method. Convex Optimization. Lecture 15 - Gradient Descent in Machine Learning

Motivation Subgradient Method Stochastic Subgradient Method. Convex Optimization. Lecture 15 - Gradient Descent in Machine Learning Convex Optimization Lecture 15 - Gradient Descent in Machine Learning Instructor: Yuanzhang Xiao University of Hawaii at Manoa Fall 2017 1 / 21 Today s Lecture 1 Motivation 2 Subgradient Method 3 Stochastic

More information

Asynchronous Parallel Computing in Signal Processing and Machine Learning

Asynchronous Parallel Computing in Signal Processing and Machine Learning Asynchronous Parallel Computing in Signal Processing and Machine Learning Wotao Yin (UCLA Math) joint with Zhimin Peng (UCLA), Yangyang Xu (IMA), Ming Yan (MSU) Optimization and Parsimonious Modeling IMA,

More information

Big Data Analytics. Lucas Rego Drumond

Big Data Analytics. Lucas Rego Drumond Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Predictive Models Predictive Models 1 / 34 Outline

More information

arxiv: v2 [stat.ml] 16 Jun 2015

arxiv: v2 [stat.ml] 16 Jun 2015 Semi-Stochastic Gradient Descent Methods Jakub Konečný Peter Richtárik arxiv:1312.1666v2 [stat.ml] 16 Jun 2015 School of Mathematics University of Edinburgh United Kingdom June 15, 2015 (first version:

More information

Accelerate Subgradient Methods

Accelerate Subgradient Methods Accelerate Subgradient Methods Tianbao Yang Department of Computer Science The University of Iowa Contributors: students Yi Xu, Yan Yan and colleague Qihang Lin Yang (CS@Uiowa) Accelerate Subgradient Methods

More information

Lecture 1: Supervised Learning

Lecture 1: Supervised Learning Lecture 1: Supervised Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine from Portland, Learning Oregon: pervised learning (Supervised)

More information

Stochastic and online algorithms

Stochastic and online algorithms Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem

More information

Learning in a Distributed and Heterogeneous Environment

Learning in a Distributed and Heterogeneous Environment Learning in a Distributed and Heterogeneous Environment Martin Jaggi EPFL Machine Learning and Optimization Laboratory mlo.epfl.ch Inria - EPFL joint Workshop - Paris - Feb 15 th Machine Learning Methods

More information

Proximal average approximated incremental gradient descent for composite penalty regularized empirical risk minimization

Proximal average approximated incremental gradient descent for composite penalty regularized empirical risk minimization Mach earn 207 06:595 622 DOI 0.007/s0994-06-5609- Proximal average approximated incremental gradient descent for composite penalty regularized empirical risk minimization Yiu-ming Cheung Jian ou Received:

More information

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017 CPSC 340: Machine Learning and Data Mining Stochastic Gradient Fall 2017 Assignment 3: Admin Check update thread on Piazza for correct definition of trainndx. This could make your cross-validation code

More information

HOMEWORK #4: LOGISTIC REGRESSION

HOMEWORK #4: LOGISTIC REGRESSION HOMEWORK #4: LOGISTIC REGRESSION Probabilistic Learning: Theory and Algorithms CS 274A, Winter 2018 Due: Friday, February 23rd, 2018, 11:55 PM Submit code and report via EEE Dropbox You should submit a

More information

Efficient Learning on Large Data Sets

Efficient Learning on Large Data Sets Department of Computer Science and Engineering Hong Kong University of Science and Technology Hong Kong Joint work with Mu Li, Chonghai Hu, Weike Pan, Bao-liang Lu Chinese Workshop on Machine Learning

More information

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725 Proximal Newton Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: primal-dual interior-point method Given the problem min x subject to f(x) h i (x) 0, i = 1,... m Ax = b where f, h

More information

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Barzilai-Borwein Step Size for Stochastic Gradient Descent Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan Shiqian Ma Yu-Hong Dai Yuqiu Qian May 16, 2016 Abstract One of the major issues in stochastic gradient descent (SGD) methods is how

More information

PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent

PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Cho-Jui Hsieh Hsiang-Fu Yu Inderjit S. Dhillon Department of Computer Science, The University of Texas, Austin, TX 787, USA CJHSIEH@CS.UTEXAS.EDU ROFUYU@CS.UTEXAS.EDU INDERJIT@CS.UTEXAS.EDU Abstract Stochastic

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 27, 2015 Outline Linear regression Ridge regression and Lasso Time complexity (closed form solution) Iterative Solvers Regression Input: training

More information

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer

More information

arxiv: v2 [cs.na] 3 Apr 2018

arxiv: v2 [cs.na] 3 Apr 2018 Stochastic variance reduced multiplicative update for nonnegative matrix factorization Hiroyui Kasai April 5, 208 arxiv:70.078v2 [cs.a] 3 Apr 208 Abstract onnegative matrix factorization (MF), a dimensionality

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

arxiv: v1 [cs.lg] 10 Jan 2019

arxiv: v1 [cs.lg] 10 Jan 2019 Quantized Epoch-SGD for Communication-Efficient Distributed Learning arxiv:1901.0300v1 [cs.lg] 10 Jan 2019 Shen-Yi Zhao Hao Gao Wu-Jun Li Department of Computer Science and Technology Nanjing University,

More information