Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence
|
|
- Joella Henry
- 5 years ago
- Views:
Transcription
1 Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence oé LAMDA Group H ŒÆOŽÅ Æ EâX ^ #EâI[ : liwujun@nju.edu.cn Dec 10, 2016 Wu-Jun Li ( PDSL CS, NJU 1 / 36
2 Outline Introduction 1 Introduction 2 AsySVRG 3 SCOPE 4 Conclusion Wu-Jun Li ( PDSL CS, NJU 2 / 36
3 Machine Learning Introduction Supervised Learning: Given a set of training examples {(x i, y i )} n i=1, supervised learning tries to solve the following regularized empirical risk minimization problem: min f(w) = 1 n f i (w), w n where f i (w) is the loss function (plus some regularization term) defined on example i, and w is the parameter to learn. Examples: Logistic regression: f(w) = 1 n n i=1 [log(1 + e yixt i w ) + λ 2 w 2 ] SVM: f(w) = 1 n n i=1 [max{0, 1 y ix T i w} + λ 2 w 2 ] Deep learning models Unsupervised Learning: Many unsupervised learning models, such as PCA and matrix factorization, can also be reformulated as similar problems. Wu-Jun Li ( PDSL CS, NJU 3 / 36 i=1
4 Introduction Machine Learning for Big Data For big data applications, first-order methods have become much more popular than other higher-order methods for learning (optimization). Gradient descent methods are the most representative first-order methods. (Deterministic) gradient descent (GD): w t+1 w t η t [ 1 n f i (w t )], n where t is the iteration number. Linear convergence rate: O(ρ t ) Iteration cost is O(n) Stochastic gradient descent (SGD): In the t th iteration, randomly choosing an example i t {1, 2,..., n}, then update i=1 w t+1 w t η t f it (w t ) Iteration cost is O(1) The convergence rate is sublinear: O(1/t) Wu-Jun Li ( PDSL CS, NJU 4 / 36
5 Introduction Stochastic Learning for Big Data Researchers recently proposed improved versions of SGD: SAG [Roux et al., NIPS 2012], SDCA [Shalev-Shwartz and Zhang, JMLR 2013], SVRG [Johnson and Zhang, NIPS 2013] Number of gradient ( f i ) evaluation to reach ɛ for smooth and strongly convex problems: GD: O(nκ log( 1 ɛ )) SGD: O(κ( 1 ɛ )) SAG: O(n log( 1 ɛ )) when n 8κ SDCA: O((n + κ) log( 1 ɛ )) SVRG: O((n + κ) log( 1 ɛ )) where κ = L µ > 1 is the condition number of the objective function. Stochastic Learning: Stochastic GD Stochastic coordinate descent Stochastic dual coordinate ascent Wu-Jun Li ( PDSL CS, NJU 5 / 36
6 Introduction Parallel and Distributed Stochastic Learning To further improve the learning scalability (speed): Parallel stochastic learning: One machine with multiple cores and a shared memory Distributed stochastic learning: A cluster with multiple machines Key issues: cooperation Parallel stochastic learning: lock vs. lock-free: waiting cost and lock cost Distributed stochastic learning: synchronous vs. asynchronous: waiting cost and communication cost Wu-Jun Li ( PDSL CS, NJU 6 / 36
7 Our Contributions Introduction Parallel stochastic learning: AsySVRG Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee. Distributed stochastic learning: SCOPE Scalable Composite Optimization for Learning Wu-Jun Li ( PDSL CS, NJU 7 / 36
8 Outline AsySVRG 1 Introduction 2 AsySVRG 3 SCOPE 4 Conclusion Wu-Jun Li ( PDSL CS, NJU 8 / 36
9 AsySVRG Motivation and Contribution Motivation: Existing asynchronous parallel SGD: Hogwild! [Recht et al. 2011], and PASSCoDe [Hsieh, Yu, and Dhillon 2015] No parallel methods for SVRG. Lock-free: empirically effective, but no theoretical proof. Contribution: A fast asynchronous method to parallelize SVRG, called AsySVRG. A lock-free parallel strategy for both read and write Linear convergence rate with theoretical proof Outperforms Hogwild! in experiments AsySVRG is the first lock-free parallel SGD method with theoretical proof of convergence. Wu-Jun Li ( PDSL CS, NJU 9 / 36
10 AsySVRG AsySVRG: a multi-thread version of SVRG Initialization: p threads, initialize w 0, η; for t = 0, 1, 2,... do u 0 = w t ; All threads parallelly compute the full gradient f(u 0 ) = 1 n n i=1 f i(u 0 ); u = w t ; For each thread, do: for m = 1 to M do Read current value of u, denoted as û, from the shared memory. And randomly pick up an i from {1,..., n}; Compute the update vector: ˆv = f i (û) f i (u 0 ) + f(u 0 ); u u ηˆv; end for Take w t+1 to be the current value of u in the shared memory; end for Wu-Jun Li ( PDSL CS, NJU 10 / 36
11 Lock-free Analysis AsySVRG In all the GD or SGD methods to solve the objective function, the key step can be written as Notation u u + i,j : the j th update vector computed by the i th thread; u R d and u = (u (1), u (2),..., u (d) ); t (k) i,j : the time when the operation u(k) u (k) + (k) i,j has been completed (Not the time when the operation begins); Assuming i, j, t (1) i,j < t(2) i,j <... < t(d) i,j, which can be easily guaranteed by programming Wu-Jun Li ( PDSL CS, NJU 11 / 36
12 AsySVRG Lock-free Analysis: update sequence Since u (1) can only be changed by at most one thread at any absolute time, these t (1) i,j are different from each other. So we can: Sort these t (1) i,j as t(1) 0 < t (1) 1 <... < t (1) M 1 ( M = p M); 0, 1,..., M 1 are the corresponding update vectors. Since it is lock-free, for each update vector m, the real update vector is B m m because of over-written. The B m is a diagonal matrix whose diagonal elements are 0 or 1. After all the inner-loop stop, we can get: w t+1 = u 0 + M 1 m=0 B m m (1) Wu-Jun Li ( PDSL CS, NJU 12 / 36
13 AsySVRG Lock-free Analysis: update sequence According to (1), we define a sequence {u m } as follows: which means u m+1 = u m + B m m. m 1 u m = u 0 + B i i (2) Note The sequence {u m } (m = 1, 2,..., M 1) is synthetic, and the whole u m may never occur in the shared memory. What we can get is only the final value of u M. i=0 Wu-Jun Li ( PDSL CS, NJU 13 / 36
14 AsySVRG Lock-free Analysis: read sequence Assume the old update vectors 0, 1,..., a(m) 1 have been completely applied to u when one thread is reading the shared variable. At the same time, some new update vectors might be updating u. So we can write û m read by the thread to compute m as follows: û m = u a(m) + b(m) i=a(m) P m,i a(m) i where P m,i a(m) is a diagonal matrix whose diagonal elements are 0 or 1. According to the principle of the order, i (i m) should not be read by û m. So b(m) < m, which means: û m = u a(m) + m 1 i=a(m) P m,i a(m) i Wu-Jun Li ( PDSL CS, NJU 14 / 36
15 AsySVRG Convergence Result for Strongly Convex Problems With some assumptions, our algorithm gets a linear convergence rate for strongly convex problems: Ef(w t+1 ) f(w ) (c M 1 + c 2 1 c 1 )(Ef(w t ) f(w )), where c 1 = 1 αηµ + c 2 and c 2 = η 2 ( 8τL3 ηρ 2 (ρ τ 1) ρ 1 + 2L 2 ρ), M = p M is the total number of iterations of the inner-loop. Note Since it is lock-free, we do not know the exact B m and we can not take the average sum of B m u m to be w t+1. Wu-Jun Li ( PDSL CS, NJU 15 / 36
16 AsySVRG Convergence Result for Non-Convex Problems With some assumptions, our algorithm gets a sub-linear convergence rate for non-convex problems: 1 T M T 1 M 1 t=0 m=0 E f(u t,m ) 2 Ef(w 0) Ef(w T ) T Mγ. Similar to the analysis for strongly convex problems, we construct an equivalent write sequence {u t,m } for the t th outer-loop: u t,0 = w t, u t,m+1 = u t,m ηb t,mˆv t,m, where ˆv t,m = f it,m (û t,m ) f it,m (u t,0 ) + f(u t,0 ). B t,m is a diagonal matrix whose diagonal entries are 0 or 1. And û t,m is read by the thread who computes ˆv t,m. Wu-Jun Li ( PDSL CS, NJU 16 / 36
17 Experiments AsySVRG Experimental platform: A server with 12 Intel cores and 64G memory. Model: Logistic regression with L2-norm f(w) = 1 n n log(1 + e y ix T i w ) + λ 2 w 2 i=1 Data set We set λ = dataset instances features memory type rcv1 20,242 47,236 36M sparse real-sim 72,309 20,958 90M sparse news20 19,996 1,355, M sparse epsilon 400,000 2,000 11G dense Wu-Jun Li ( PDSL CS, NJU 17 / 36
18 AsySVRG Experiments: Computation Cost log(objective value min) log(objective value min) 10 0 rcv AsySVRG 1 AsySVRG lock 10 AsySVRG Hogwild lock 10 Hogwild number of effective passes log(objective value min) 10 0 real sim AsySVRG 1 AsySVRG lock AsySVRG 10 Hogwild lock 10 Hogwild number of effective passes (a) rcv1 (b) realsim 10 0 news AsySVRG 1 AsySVRG lock AsySVRG 10 Hogwild lock 10 Hogwild number of effective passes log(objective value min) 10 0 epsilon AsySVRG 1 AsySVRG lock AsySVRG 10 Hogwild lock 10 Hogwild number of effective passes (c) news20 (d) epsilon Wu-Jun Li ( PDSL CS, NJU 18 / 36
19 AsySVRG Experiments: Total Time Cost log(objective value min) log(objective value min) rcv1 real sim 10 0 AsySVRG 10 Hogwild time log(objective value min) 10 0 AsySVRG 10 Hogwild time (a) rcv1 (b) realsim news20 epsilon 10 0 AsySVRG 10 Hogwild time log(objective value min) 10 0 AsySVRG 10 Hogwild time (c) news20 (d) epsilon Wu-Jun Li ( PDSL CS, NJU 19 / 36
20 Experiments: Speed up 6 rcv1 5 AsySVRG 6 real sim AsySVRG lock AsySVRG AsySVRG lock AsySVRG 5 speedup 4 3 speedup number of threads number of threads (a) rcv1 (b) realsim 8 news20 6 epsilon 7 AsySVRG lock AsySVRG 5 AsySVRG lock AsySVRG unlock 6 speedup 5 4 speedup number of threads (c) news number of threads (d) epsilon Wu-Jun Li ( PDSL CS, NJU 20 / 36
21 Outline SCOPE 1 Introduction 2 AsySVRG 3 SCOPE 4 Conclusion Wu-Jun Li ( PDSL CS, NJU 21 / 36
22 SCOPE Motivation and Contribution Motivation: Bulk synchronous parallel (BSP) models, such as MapReduce, are commonly considered to be inefficient for distributed stochastic learning. Is there any technique to solve the issues of BSP models? Contribution: A novel distributed stochastic learning method, called scalable composite optimization for learning (SCOPE), on BSP models Both computation-efficient and communication-efficient Linear convergence rate with theoretical proof Can be easily integrated into the data processing pipeline of Spark Outperform other state-of-the-art distributed learning methods on Spark Wu-Jun Li ( PDSL CS, NJU 22 / 36
23 Framework of SCOPE SCOPE Master w Worker_1 Worker_ Worker_p D! D! D! Figure: Distributed framework of SCOPE. Wu-Jun Li ( PDSL CS, NJU 23 / 36
24 SCOPE Optimization Algorithm: Master Task of Master in SCOPE: Initialization: p Workers, w 0 ; for t = 0, 1, 2,..., T do Send w t to the Workers; Wait until it receives z 1, z 2,..., z p from the p Workers; Compute the full gradient z = 1 n p k=1 z k, and then send z to each Worker; Wait until it receives ũ 1, ũ 2,..., ũ p from the p Workers; Compute w t+1 = 1 p p k=1 ũk; end for Wu-Jun Li ( PDSL CS, NJU 24 / 36
25 SCOPE Optimization Algorithm: Workers Task of Workers in SCOPE: Initialization: initialize η and c > 0; For the Worker k: for t = 0, 1, 2,..., T do Wait until it gets the newest parameter w t from the Master; Let u k,0 = w t, compute the local gradient sum z k = i D k f i (w t ), and then send z k to the Master; Wait until it gets the full gradient z from the Master; for m = 0 to M 1 do Randomly pick up an instance with index i k,m from D k ; u k,m+1 = u k,m η( f ik,m (u k,m ) f ik,m (w t )+z+c(u k,m w t )); end for Send u k,m or 1 M M m=1 u k,m, which is called the locally updated parameter and denoted as ũ k, to the Master; end for Wu-Jun Li ( PDSL CS, NJU 25 / 36
26 Convergence SCOPE Let α = 1 η(2µ + c) < 1, β = cη + 3L 2 η 2 and α + β < 1. We have the following theorems: Theorem If we take w t+1 = 1 p p k=1 u k,m, then we can get the following convergence result: Theorem E w t+1 w 2 (α M + β 1 α )E w t w 2. If we take w t+1 = 1 p p k=1 ũk with ũ k = 1 M M m=1 u k,m, then we can get the following convergence result: E w t+1 w 2 1 ( M(1 α) + β 1 α )E w t w 2. Wu-Jun Li ( PDSL CS, NJU 26 / 36
27 Communication Cost SCOPE Traditional mini-batch based methods: O(T n) SCOPE: O(T ) Wu-Jun Li ( PDSL CS, NJU 27 / 36
28 SCOPE Experiment Logistic regression (LR) with a L 2 -norm regularization term: f(w) = 1 ] n n i=1 [log(1 + e y ix T i w ) + λ 2 w 2. Table: Datasets for evaluation. instances features memory λ MNIST-8M 8,100, G 1e-4 epsilon 400,000 2,000 11G 1e-4 KDD12 73,209,277 1,427,495 21G 1e-4 Data-A 106,691, G 1e-6 Two Spark clusters with Intel CPUs: small: 1 Master and 16 Workers large: 1 Master and 128 Workers Wu-Jun Li ( PDSL CS, NJU 28 / 36
29 SCOPE Experiment Baselines: MLlib [Meng et al., 2015]: MLlib is an open source library for distributed machine learning on Spark. We compare our method with distributed lbfgs for MLlib, which is a batch learning method and faster than the SGD version of MLlib. LibLinear [Lin et al., 2014a]: LibLinear is a distributed Newton method, which is also a batch learning method. Splash [Zhang and Jordan, 2015]: Splash is a distributed SGD method by using the local learning strategy to reduce communication cost. CoCoA [Jaggi et al., 2014]: CoCoA is a distributed dual coordinate ascent method. CoCoA+ [Ma et al., 2015]: CoCoA+ is an improved version of CoCoA. CoCoA+ adopts adding rather than average to combine local updates. Wu-Jun Li ( PDSL CS, NJU 29 / 36
30 Experiment SCOPE 10 0 MNIST 8M with 16 cores 10 0 epsilon with 16 cores objective value optimal SCOPE LibLinear CoCoA MLlib(lbfgs) Splash CoCoA CPU Time(millisecond) x 10 4 objective value optimal SCOPE LibLinear CoCoA MLlib(lbfgs) Splash CoCoA CPU Time(millisecond) x 10 4 (a) MNIST-8M (b) epsilon 10 0 KDD12 with 16 cores 10 0 Data A with 128 cores objective value optimal SCOPE Liblinear CoCoA MLlib(lbfgs) Splash CoCoA+ objective value optimal SCOPE LibLinear CoCoA MLlib(lbfgs) Splash CoCoA CPU Time(millisecond) x 10 5 (c) KDD CPU Time(millisecond) x 10 4 (d) Data-A Wu-Jun Li ( PDSL CS, NJU 30 / 36
31 Experiment SCOPE speedup SCOPE Ideal CPU Time(milllisecond) per iteration computation time waiting time communication time #cores #cores (e) Speedup (f) Synchronization cost Wu-Jun Li ( PDSL CS, NJU 31 / 36
32 Outline Conclusion 1 Introduction 2 AsySVRG 3 SCOPE 4 Conclusion Wu-Jun Li ( PDSL CS, NJU 32 / 36
33 Conclusion Conclusion Stochastic learning is becoming popular for big data machine learning. Lock-free strategy is the key to get a good speedup in parallel stochastic learning. With properly designed techniques, BSP models are also efficient for distributed stochastic learning. Wu-Jun Li ( PDSL CS, NJU 33 / 36
34 Future Work Conclusion Open source project: LIBBLE: A library for big learning LIBBLE-Spark: Classification: LR, SVM, LR with L1-norm Regularization Regression: Linear Regression, Lasso Generalized Linear Models: with L2-norm/L1-norm Regularization Dimensionality Reduction: PCA, SVD Matrix Factorization Clustering: K-Means LIBBLE-MPI LIBBLE-GPU LIBBLE-MultiThread Wu-Jun Li ( PDSL CS, NJU 34 / 36
35 References Conclusion Shen-Yi Zhao, Ru Xiang, Ying-Hao Shi, Peng Gao, Wu-Jun Li. SCOPE: Scalable Composite Optimization for Learning on Spark. To Appear in Proceedings of the 31th AAAI Conference on Artificial Intelligence (AAAI), Shen-Yi Zhao, Gong-Duo Zhang, Wu-Jun Li. Lock-Free Optimization for Non-Convex Problems. To Appear in Proceedings of the 31th AAAI Conference on Artificial Intelligence (AAAI), Shen-Yi Zhao, Wu-Jun Li. Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee. Proceedings of the 30th AAAI Conference on Artificial Intelligence (AAAI), Wu-Jun Li ( PDSL CS, NJU 35 / 36
36 Q & A Conclusion Thanks! Wu-Jun Li ( PDSL CS, NJU 36 / 36
Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee
Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16) Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi Zhao and
More informationFast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee
Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi Zhao and Wu-Jun Li National Key Laboratory for Novel Software Technology Department of Computer
More informationLAMDA Group. Oct 19, Wu-Jun Li ( Big Learning CS, NJU 1 / 125
ŒêâÅìÆS oé LAMDA Group H ŒÆOŽÅ Æ EâX ^ #EâI[ : liwujun@nju.edu.cn Oct 19, 2016 Wu-Jun Li (http://cs.nju.edu.cn/lwj) Big Learning CS, NJU 1 / 125 Outline 1 Introduction 2 Learning to Hash Isotropic Hashing
More informationAccelerating Stochastic Optimization
Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization
More informationBig Data Analytics: Optimization and Randomization
Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.
More informationSVRG++ with Non-uniform Sampling
SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied
More informationScalable Asynchronous Gradient Descent Optimization for Out-of-Core Models
Scalable Asynchronous Gradient Descent Optimization for Out-of-Core Models Chengjie Qin 1, Martin Torres 2, and Florin Rusu 2 1 GraphSQL, Inc. 2 University of California Merced August 31, 2017 Machine
More informationStochastic Optimization Algorithms Beyond SG
Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods
More informationCoordinate Descent and Ascent Methods
Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Nov 2, 2016 Outline SGD-typed algorithms for Deep Learning Parallel SGD for deep learning Perceptron Prediction value for a training data: prediction
More informationLarge-scale Stochastic Optimization
Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation
More informationSimple Optimization, Bigger Models, and Faster Learning. Niao He
Simple Optimization, Bigger Models, and Faster Learning Niao He Big Data Symposium, UIUC, 2016 Big Data, Big Picture Niao He (UIUC) 2/26 Big Data, Big Picture Niao He (UIUC) 3/26 Big Data, Big Picture
More informationParaGraphE: A Library for Parallel Knowledge Graph Embedding
ParaGraphE: A Library for Parallel Knowledge Graph Embedding Xiao-Fan Niu, Wu-Jun Li National Key Laboratory for Novel Software Technology Department of Computer Science and Technology, Nanjing University,
More informationAsynchronous Doubly Stochastic Sparse Kernel Learning
The Thirty-Second AAAI Conference on Artificial Intelligence AAAI-8 Asynchronous Doubly Stochastic Sparse Kernel Learning Bin Gu, Xin Miao, Zhouyuan Huo, Heng Huang * Department of Electrical & Computer
More informationOptimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method
Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Davood Hajinezhad Iowa State University Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 1 / 35 Co-Authors
More informationAccelerating SVRG via second-order information
Accelerating via second-order information Ritesh Kolte Department of Electrical Engineering rkolte@stanford.edu Murat Erdogdu Department of Statistics erdogdu@stanford.edu Ayfer Özgür Department of Electrical
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 11, 2016 Paper presentations and final project proposal Send me the names of your group member (2 or 3 students) before October 15 (this Friday)
More informationStochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization
Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine
More informationStochastic Gradient Descent with Variance Reduction
Stochastic Gradient Descent with Variance Reduction Rie Johnson, Tong Zhang Presenter: Jiawen Yao March 17, 2015 Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction
More informationarxiv: v1 [stat.ml] 15 Mar 2018
Proximal SCOPE for Distributed Sparse Learning: Better Data Partition Implies Faster Convergence Rate Shen-Yi Zhao 1 Gong-Duo Zhang 1 Ming-Wei Li 1 Wu-Jun Li * 1 arxiv:1803.05621v1 [stat.ml] 15 Mar 2018
More informationParallel Coordinate Optimization
1 / 38 Parallel Coordinate Optimization Julie Nutini MLRG - Spring Term March 6 th, 2018 2 / 38 Contours of a function F : IR 2 IR. Goal: Find the minimizer of F. Coordinate Descent in 2D Contours of a
More informationDistributed Box-Constrained Quadratic Optimization for Dual Linear SVM
Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM Lee, Ching-pei University of Illinois at Urbana-Champaign Joint work with Dan Roth ICML 2015 Outline Introduction Algorithm Experiments
More informationarxiv: v5 [cs.lg] 23 Dec 2017 Abstract
Nonconvex Sparse Learning via Stochastic Optimization with Progressive Variance Reduction Xingguo Li, Raman Arora, Han Liu, Jarvis Haupt, and Tuo Zhao arxiv:1605.02711v5 [cs.lg] 23 Dec 2017 Abstract We
More informationConvex Optimization Lecture 16
Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean
More informationA Parallel SGD method with Strong Convergence
A Parallel SGD method with Strong Convergence Dhruv Mahajan Microsoft Research India dhrumaha@microsoft.com S. Sundararajan Microsoft Research India ssrajan@microsoft.com S. Sathiya Keerthi Microsoft Corporation,
More informationARock: an algorithmic framework for asynchronous parallel coordinate updates
ARock: an algorithmic framework for asynchronous parallel coordinate updates Zhimin Peng, Yangyang Xu, Ming Yan, Wotao Yin ( UCLA Math, U.Waterloo DCO) UCLA CAM Report 15-37 ShanghaiTech SSDS 15 June 25,
More informationAn Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal
An Evolving Gradient Resampling Method for Machine Learning Jorge Nocedal Northwestern University NIPS, Montreal 2015 1 Collaborators Figen Oztoprak Stefan Solntsev Richard Byrd 2 Outline 1. How to improve
More informationOptimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade
Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for
More informationBarzilai-Borwein Step Size for Stochastic Gradient Descent
Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan The Chinese University of Hong Kong chtan@se.cuhk.edu.hk Shiqian Ma The Chinese University of Hong Kong sqma@se.cuhk.edu.hk Yu-Hong
More informationStochastic Quasi-Newton Methods
Stochastic Quasi-Newton Methods Donald Goldfarb Department of IEOR Columbia University UCLA Distinguished Lecture Series May 17-19, 2016 1 / 35 Outline Stochastic Approximation Stochastic Gradient Descent
More informationAccelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization
Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Lin Xiao (Microsoft Research) Joint work with Qihang Lin (CMU), Zhaosong Lu (Simon Fraser) Yuchen Zhang (UC Berkeley)
More informationSub-Sampled Newton Methods
Sub-Sampled Newton Methods F. Roosta-Khorasani and M. W. Mahoney ICSI and Dept of Statistics, UC Berkeley February 2016 F. Roosta-Khorasani and M. W. Mahoney (UCB) Sub-Sampled Newton Methods Feb 2016 1
More informationLogistic Regression. Stochastic Gradient Descent
Tutorial 8 CPSC 340 Logistic Regression Stochastic Gradient Descent Logistic Regression Model A discriminative probabilistic model for classification e.g. spam filtering Let x R d be input and y { 1, 1}
More informationStochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure
Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Alberto Bietti Julien Mairal Inria Grenoble (Thoth) March 21, 2017 Alberto Bietti Stochastic MISO March 21,
More informationAdaptive Probabilities in Stochastic Optimization Algorithms
Research Collection Master Thesis Adaptive Probabilities in Stochastic Optimization Algorithms Author(s): Zhong, Lei Publication Date: 2015 Permanent Link: https://doi.org/10.3929/ethz-a-010421465 Rights
More informationAsynchronous Mini-Batch Gradient Descent with Variance Reduction for Non-Convex Optimization
Proceedings of the hirty-first AAAI Conference on Artificial Intelligence (AAAI-7) Asynchronous Mini-Batch Gradient Descent with Variance Reduction for Non-Convex Optimization Zhouyuan Huo Dept. of Computer
More informationMinimizing Finite Sums with the Stochastic Average Gradient Algorithm
Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Joint work with Nicolas Le Roux and Francis Bach University of British Columbia Context: Machine Learning for Big Data Large-scale
More informationAccelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization
Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization Jinghui Chen Department of Systems and Information Engineering University of Virginia Quanquan Gu
More informationLecture 3: Minimizing Large Sums. Peter Richtárik
Lecture 3: Minimizing Large Sums Peter Richtárik Graduate School in Systems, Op@miza@on, Control and Networks Belgium 2015 Mo@va@on: Machine Learning & Empirical Risk Minimiza@on Training Linear Predictors
More informationOptimization for Machine Learning
Optimization for Machine Learning (Lecture 3-A - Convex) SUVRIT SRA Massachusetts Institute of Technology Special thanks: Francis Bach (INRIA, ENS) (for sharing this material, and permitting its use) MPI-IS
More informationFaster Machine Learning via Low-Precision Communication & Computation. Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich)
Faster Machine Learning via Low-Precision Communication & Computation Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich) 2 How many bits do you need to represent a single number in machine
More informationLarge Scale Semi-supervised Linear SVMs. University of Chicago
Large Scale Semi-supervised Linear SVMs Vikas Sindhwani and Sathiya Keerthi University of Chicago SIGIR 2006 Semi-supervised Learning (SSL) Motivation Setting Categorize x-billion documents into commercial/non-commercial.
More informationNesterov s Acceleration
Nesterov s Acceleration Nesterov Accelerated Gradient min X f(x)+ (X) f -smooth. Set s 1 = 1 and = 1. Set y 0. Iterate by increasing t: g t 2 @f(y t ) s t+1 = 1+p 1+4s 2 t 2 y t = x t + s t 1 s t+1 (x
More informationTrade-Offs in Distributed Learning and Optimization
Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed
More informationContents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016
ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................
More informationConvex Optimization Algorithms for Machine Learning in 10 Slides
Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,
More informationBlock stochastic gradient update method
Block stochastic gradient update method Yangyang Xu and Wotao Yin IMA, University of Minnesota Department of Mathematics, UCLA November 1, 2015 This work was done while in Rice University 1 / 26 Stochastic
More informationFast Stochastic Optimization Algorithms for ML
Fast Stochastic Optimization Algorithms for ML Aaditya Ramdas April 20, 205 This lecture is about efficient algorithms for minimizing finite sums min w R d n i= f i (w) or min w R d n f i (w) + λ 2 w 2
More informationStreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory
StreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory S.V. N. (vishy) Vishwanathan Purdue University and Microsoft vishy@purdue.edu October 9, 2012 S.V. N. Vishwanathan (Purdue,
More informationA Distributed Solver for Kernelized SVM
and A Distributed Solver for Kernelized Stanford ICME haoming@stanford.edu bzhe@stanford.edu June 3, 2015 Overview and 1 and 2 3 4 5 Support Vector Machines and A widely used supervised learning model,
More informationMini-Batch Primal and Dual Methods for SVMs
Mini-Batch Primal and Dual Methods for SVMs Peter Richtárik School of Mathematics The University of Edinburgh Coauthors: M. Takáč (Edinburgh), A. Bijral and N. Srebro (both TTI at Chicago) arxiv:1303.2314
More informationBundle CDN: A Highly Parallelized Approach for Large-scale l 1 -regularized Logistic Regression
Bundle : A Highly Parallelized Approach for Large-scale l 1 -regularized Logistic Regression Yatao Bian, Xiong Li, Mingqi Cao, and Yuncai Liu Institute of Image Processing and Pattern Recognition, Shanghai
More informationLock-Free Approaches to Parallelizing Stochastic Gradient Descent
Lock-Free Approaches to Parallelizing Stochastic Gradient Descent Benjamin Recht Department of Computer Sciences University of Wisconsin-Madison with Feng iu Christopher Ré Stephen Wright minimize x f(x)
More informationProximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization
Proximal Newton Method Zico Kolter (notes by Ryan Tibshirani) Convex Optimization 10-725 Consider the problem Last time: quasi-newton methods min x f(x) with f convex, twice differentiable, dom(f) = R
More informationECS171: Machine Learning
ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass
More informationFinite-sum Composition Optimization via Variance Reduced Gradient Descent
Finite-sum Composition Optimization via Variance Reduced Gradient Descent Xiangru Lian Mengdi Wang Ji Liu University of Rochester Princeton University University of Rochester xiangru@yandex.com mengdiw@princeton.edu
More informationMachine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.
Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted
More informationDistributed Inexact Newton-type Pursuit for Non-convex Sparse Learning
Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Bo Liu Department of Computer Science, Rutgers Univeristy Xiao-Tong Yuan BDAT Lab, Nanjing University of Information Science and Technology
More informationComparison of Modern Stochastic Optimization Algorithms
Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,
More informationStochastic Gradient Descent. CS 584: Big Data Analytics
Stochastic Gradient Descent CS 584: Big Data Analytics Gradient Descent Recap Simplest and extremely popular Main Idea: take a step proportional to the negative of the gradient Easy to implement Each iteration
More informationDistributed Stochastic Optimization of Regularized Risk via Saddle-point Problem
Distributed Stochastic Optimization of Regularized Risk via Saddle-point Problem Shin Matsushima 1, Hyokun Yun 2, Xinhua Zhang 3, and S.V.N. Vishwanathan 2,4 1 The University of Tokyo, Tokyo, Japan shin
More informationPrimal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions
Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions Olivier Fercoq and Pascal Bianchi Problem Minimize the convex function
More informationMachine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5
Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5 Slides adapted from Jordan Boyd-Graber, Tom Mitchell, Ziv Bar-Joseph Machine Learning: Chenhao Tan Boulder 1 of 27 Quiz question For
More informationCS60021: Scalable Data Mining. Large Scale Machine Learning
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 CS60021: Scalable Data Mining Large Scale Machine Learning Sourangshu Bhattacharya Example: Spam filtering Instance
More informationLinearly-Convergent Stochastic-Gradient Methods
Linearly-Convergent Stochastic-Gradient Methods Joint work with Francis Bach, Michael Friedlander, Nicolas Le Roux INRIA - SIERRA Project - Team Laboratoire d Informatique de l École Normale Supérieure
More informationRandomized Smoothing for Stochastic Optimization
Randomized Smoothing for Stochastic Optimization John Duchi Peter Bartlett Martin Wainwright University of California, Berkeley NIPS Big Learn Workshop, December 2011 Duchi (UC Berkeley) Smoothing and
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 18, 2016 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass
More informationDistributed Machine Learning: A Brief Overview. Dan Alistarh IST Austria
Distributed Machine Learning: A Brief Overview Dan Alistarh IST Austria Background The Machine Learning Cambrian Explosion Key Factors: 1. Large s: Millions of labelled images, thousands of hours of speech
More informationAsaga: Asynchronous Parallel Saga
Rémi Leblond Fabian Pedregosa Simon Lacoste-Julien INRIA - Sierra Project-team INRIA - Sierra Project-team Department of CS & OR (DIRO) École normale supérieure, Paris École normale supérieure, Paris Université
More informationProximal Minimization by Incremental Surrogate Optimization (MISO)
Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine
More informationJournal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19
Journal Club A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) CMAP, Ecole Polytechnique March 8th, 2018 1/19 Plan 1 Motivations 2 Existing Acceleration Methods 3 Universal
More informationModern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization
Modern Stochastic Methods Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization 10-725 Last time: conditional gradient method For the problem min x f(x) subject to x C where
More informationImportance Sampling for Minibatches
Importance Sampling for Minibatches Dominik Csiba School of Mathematics University of Edinburgh 07.09.2016, Birmingham Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches 07.09.2016,
More informationParallel Numerical Algorithms
Parallel Numerical Algorithms Chapter 6 Structured and Low Rank Matrices Section 6.3 Numerical Optimization Michael T. Heath and Edgar Solomonik Department of Computer Science University of Illinois at
More informationImproved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations
Improved Optimization of Finite Sums with Miniatch Stochastic Variance Reduced Proximal Iterations Jialei Wang University of Chicago Tong Zhang Tencent AI La Astract jialei@uchicago.edu tongzhang@tongzhang-ml.org
More informationCS260: Machine Learning Algorithms
CS260: Machine Learning Algorithms Lecture 4: Stochastic Gradient Descent Cho-Jui Hsieh UCLA Jan 16, 2019 Large-scale Problems Machine learning: usually minimizing the training loss min w { 1 N min w {
More informationOptimization Methods for Machine Learning
Optimization Methods for Machine Learning Sathiya Keerthi Microsoft Talks given at UC Santa Cruz February 21-23, 2017 The slides for the talks will be made available at: http://www.keerthis.com/ Introduction
More informationMotivation Subgradient Method Stochastic Subgradient Method. Convex Optimization. Lecture 15 - Gradient Descent in Machine Learning
Convex Optimization Lecture 15 - Gradient Descent in Machine Learning Instructor: Yuanzhang Xiao University of Hawaii at Manoa Fall 2017 1 / 21 Today s Lecture 1 Motivation 2 Subgradient Method 3 Stochastic
More informationAsynchronous Parallel Computing in Signal Processing and Machine Learning
Asynchronous Parallel Computing in Signal Processing and Machine Learning Wotao Yin (UCLA Math) joint with Zhimin Peng (UCLA), Yangyang Xu (IMA), Ming Yan (MSU) Optimization and Parsimonious Modeling IMA,
More informationBig Data Analytics. Lucas Rego Drumond
Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Predictive Models Predictive Models 1 / 34 Outline
More informationarxiv: v2 [stat.ml] 16 Jun 2015
Semi-Stochastic Gradient Descent Methods Jakub Konečný Peter Richtárik arxiv:1312.1666v2 [stat.ml] 16 Jun 2015 School of Mathematics University of Edinburgh United Kingdom June 15, 2015 (first version:
More informationAccelerate Subgradient Methods
Accelerate Subgradient Methods Tianbao Yang Department of Computer Science The University of Iowa Contributors: students Yi Xu, Yan Yan and colleague Qihang Lin Yang (CS@Uiowa) Accelerate Subgradient Methods
More informationLecture 1: Supervised Learning
Lecture 1: Supervised Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine from Portland, Learning Oregon: pervised learning (Supervised)
More informationStochastic and online algorithms
Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem
More informationLearning in a Distributed and Heterogeneous Environment
Learning in a Distributed and Heterogeneous Environment Martin Jaggi EPFL Machine Learning and Optimization Laboratory mlo.epfl.ch Inria - EPFL joint Workshop - Paris - Feb 15 th Machine Learning Methods
More informationProximal average approximated incremental gradient descent for composite penalty regularized empirical risk minimization
Mach earn 207 06:595 622 DOI 0.007/s0994-06-5609- Proximal average approximated incremental gradient descent for composite penalty regularized empirical risk minimization Yiu-ming Cheung Jian ou Received:
More informationCPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017
CPSC 340: Machine Learning and Data Mining Stochastic Gradient Fall 2017 Assignment 3: Admin Check update thread on Piazza for correct definition of trainndx. This could make your cross-validation code
More informationHOMEWORK #4: LOGISTIC REGRESSION
HOMEWORK #4: LOGISTIC REGRESSION Probabilistic Learning: Theory and Algorithms CS 274A, Winter 2018 Due: Friday, February 23rd, 2018, 11:55 PM Submit code and report via EEE Dropbox You should submit a
More informationEfficient Learning on Large Data Sets
Department of Computer Science and Engineering Hong Kong University of Science and Technology Hong Kong Joint work with Mu Li, Chonghai Hu, Weike Pan, Bao-liang Lu Chinese Workshop on Machine Learning
More informationProximal Newton Method. Ryan Tibshirani Convex Optimization /36-725
Proximal Newton Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: primal-dual interior-point method Given the problem min x subject to f(x) h i (x) 0, i = 1,... m Ax = b where f, h
More informationBarzilai-Borwein Step Size for Stochastic Gradient Descent
Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan Shiqian Ma Yu-Hong Dai Yuqiu Qian May 16, 2016 Abstract One of the major issues in stochastic gradient descent (SGD) methods is how
More informationPASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent
Cho-Jui Hsieh Hsiang-Fu Yu Inderjit S. Dhillon Department of Computer Science, The University of Texas, Austin, TX 787, USA CJHSIEH@CS.UTEXAS.EDU ROFUYU@CS.UTEXAS.EDU INDERJIT@CS.UTEXAS.EDU Abstract Stochastic
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 27, 2015 Outline Linear regression Ridge regression and Lasso Time complexity (closed form solution) Iterative Solvers Regression Input: training
More informationLecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron
CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer
More informationarxiv: v2 [cs.na] 3 Apr 2018
Stochastic variance reduced multiplicative update for nonnegative matrix factorization Hiroyui Kasai April 5, 208 arxiv:70.078v2 [cs.a] 3 Apr 208 Abstract onnegative matrix factorization (MF), a dimensionality
More informationMachine Learning. Support Vector Machines. Fabio Vandin November 20, 2017
Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training
More informationarxiv: v1 [cs.lg] 10 Jan 2019
Quantized Epoch-SGD for Communication-Efficient Distributed Learning arxiv:1901.0300v1 [cs.lg] 10 Jan 2019 Shen-Yi Zhao Hao Gao Wu-Jun Li Department of Computer Science and Technology Nanjing University,
More information