Asynchronous Parallel Computing in Signal Processing and Machine Learning
|
|
- Bertha Farmer
- 5 years ago
- Views:
Transcription
1 Asynchronous Parallel Computing in Signal Processing and Machine Learning Wotao Yin (UCLA Math) joint with Zhimin Peng (UCLA), Yangyang Xu (IMA), Ming Yan (MSU) Optimization and Parsimonious Modeling IMA, Jan 25, / 49
2 Do we need parallel computing? 2 / 49
3 Back in / 49
4 / 49
5 / 49
6 35 Years of CPU Trend Number of CPUs Performance per core Cores per CPU D. Henty. Emerging Architectures and Programming Models for Parallel Computing, In May 2014, Intel cancelled its Tejas project (single-core) and announced a new multi-core project. 6 / 49
7 Today: 4x ADM 16-core 3.5GHz CPUs (64 cores total) 7 / 49
8 Today: Tesla K80 GPU (2496 cores) 8 / 49
9 Today: Octa-Core Headsets 9 / 49
10 Free lunch was over before 2005: a single-threaded algorithm automatically gets faster now, new algorithms must be developed for faster speeds by exploiting problem structures taking advantages of dataset properties using all the cores available 10 / 49
11 How to use all the cores available? 11 / 49
12 Parallel computing Problem Agent Agent Agent t N t 2 t 1 12 / 49
13 definition: time is in the wall-clock sense Parallel speedup speedup = serial time parallel time Amdahl s Law: N agents, no overhead, ρ = percentage of parallel computing ideal speedup = 1 (ρ/n) + (1 ρ) % 50% 90% 95% Speedup Number of processors 13 / 49
14 Parallel speedup ε := parallel overhead (startup, synchronization, collection) in the real world actual speedup = 1 (ρ/n) + (1 ρ) + ε Speedup % 50% 90% 95% Speedup % 50% 90% 95% Number of processors when ε = N Number of processors when ε = log(n) 14 / 49
15 Sync-parallel versus async-parallel Agent 1 idle idle Agent 1 Agent 2 idle Agent 2 Agent 3 idle Agent 3 Synchronous (wait for the slowest) Asynchronous (non-stop, no wait) 15 / 49
16 Async-parallel coordinate updates 16 / 49
17 Fixed point iteration and its parallel version H = H 1 H m original iteration: x k+1 = T x k =: (I ηs)x k all agents do in parallel: agent 1: x k+1 1 T 1(x k ) = x k 1 ηs 1(x k ) agent 2: x k+1 2 T 2(x k ) = x k 2 ηs 2(x k ). agent m: x k+1 m T m(x k ) = x k m ηs m(x k ) assumption: 1. coordinate friendliness: cost of S ix 1 m 2. synchronization after each iteration cost of Sx 17 / 49
18 Comparison Agent 1 Agent 2 Agent 3 t 0 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 t 10 Synchronous new iteration = all agents finish Asynchronous new iteration = any agent finishes 18 / 49
19 ARock 1 : Async-parallel coordinate update H = H 1 H m p agents, possibly p m each agent randomly picks i {1,..., m} and updates just x i: x k+1 1 x k 1 x k+1 i x k+1 m 0 d k τ, maximum delay.. x k i η k S ix k d k x k m 1 Peng-Xu-Yan-Y / 49
20 Two ways to model x k d k definitions: let x 0,..., x k,... be the states of x in the memory 1. x k d k is consistent if d k is a scalar 2. x k d k is possibly inconsistent if d k is a vector, different components are delayed by different amounts ARock allows both consistent and inconsistent read. 20 / 49
21 Memory lock illustration Agent 1 read [0, 0, 0, 0] T = x 0 consistent read Agent 1 read [0, 0, 0, 2] T {x 0, x 1, x 2 } inconsistent read 21 / 49
22 History and recent literature 22 / 49
23 Brief history of async-parallel algorithms (mostly worst case analysis) 1969 a linear equation solver by Chazan and Miranker; 1978 extended to the fixed-point problem by Baudet under the absolute-contraction 2 type of assumption. For years, mainly solve linear, nonlinear and differential equations by many people 1989 Parallel and Distributed Computation: Numerical Methods by Bertsekas and Tsitsiklis Review by Frommer and Szyld gradient-projection itr assuming a local linear-error bound by Tseng 2001 domain decomposition assuming strong convexity by Tai & Tseng 2 An operator T : R n R n is absolute-contractive if T (x) T (y) P x y, component-wise, where x denotes the vector with components x i, i = 1,..., n, and P R n n and ρ(p ) < / 49
24 Absolute-contraction Absolute-contractive operator T : R n R n : if T (x) T (y) P x y, component-wise, where x denotes the vector with components x i, i = 1,..., n, and P R n n + and ρ(p ) < 1. Interpretation: a series of nested rectangular boxes for x k+1 = T x k Applications: diagonally dominated A for Ax = b diagonally dominated 2 f for min x f(x) (just strong convexity is not enough) some network flow problems 24 / 49
25 Recent work (stochastic analysis) AsySCD for convex smooth and composite minimization by Liu et al 14 and Liu-Wright 14. Async dual CD (regression problems) by Hsieh et al. 15 Async randomized (splitting/distributed/incremental) methods: Wei-Ozdaglar 13, Iutzeler et al 13, Zhang-Kwok 14, Hong 14, Chang et al 15 Async SGD: Hogwild!, Lian 15, etc. Async operator sample and CD: SMART Davis / 49
26 Random coordinate selection select x i to update with probability p i, where min i p i > 0 drawback: agents cannot cache data either global memory or communication is required pseudo-random number generation takes time benefits: often faster than the fixed cyclic order automatic load balance simplifies certain analysis 26 / 49
27 Convergence summary 27 / 49
28 Convergence guarantees m is # coordinates, τ is the maximum delay, uniform selection p i 1 m Theorem (almost sure convergence) Assume that T is nonexpansive and has a fixed point. Use step sizes 1 η k [ɛ, ), k. Then, with probability one, 2m 1/2 τ+1 xk x FixT. In addition, rates can be derived. Consequence: step size is O(1) if τ m. Under equal agents and updates, attaining linear speedup if using p = O( m) agents. p can be bigger if T is sparse. 28 / 49
29 Sketch of proof typical inequality: x k+1 x 2 x k x 2 c T x k x k 2 + harmful terms(x k 1,..., x k τ ) Descent inequality under a new metric: E ( x k+1 x 2 ) M X k x k x 2 M c T x k x k 2 where the history up to iteration k x k = (x k, x k 1,..., x k τ ) H τ+1, k 0 any x = (x, x,..., x ) X H τ+1 M is a positive definite matrix. c = c(η k, m, τ) 29 / 49
30 apply the Robbins-Siegmund theorem: E(α k+1 F k ) + v k (1 + ξ k )α k + η k where all are nonnegative, α k is random, and ξ k, η k are summable. Then α k converges a.s. prove weakly convergent clustering points are fixed-points; assume H is separable and apply results [Combettes, Pesquet 2014]. 30 / 49
31 Applications and numerical results 31 / 49
32 Linear equations (asynchronous Jacobi) require: invertible square matrix A with nonzero diagonal entries let D be the diagonal part of A; then Ax = b (I D 1 A)x + D 1 b = x. }{{} T x T is nonexpansive if I D 1 A 2 1, i.e., A is diagonal dominating x k+1 = T x k recovers the Jacobi algorithm 32 / 49
33 Algorithm 1: ARock for linear equations Input : shared variables x R n, K > 0; set global iteration counter k = 0; while k < K, every agent asynchronously and continuously do sample i {1,..., m} uniformly at random; add η k a ii ( j aij ˆxk j b i) to shared variable x i; update the global counter k k + 1; 33 / 49
34 Sample code loaddata (A, data_file_name ); loaddata (b, label_file_name ); # pragma omp parallel num_threads (p) shared (A,b,x, para ) { // A, b, x, and para are passed by reference call Jacobi (A,b,x, para ) or ARock (A,b,x, para ); } p: the number of threads A,b,x: shared variable para: other parameters 34 / 49
35 Jacobi worker function for ( int itr =0; itr < max_itr ; itr ++) { // compute the update for the assigned x[ i] //... # pragma omp barrier { // write x[ i] in global memory } # pragma omp barrier } Jacobi needs the barrier directive for synchronization 35 / 49
36 ARock worker function for ( int itr =0; itr < max_itr ; itr ++) { // pick i at random // compute the update for x[ i] //... // write x( i) in global memory } ARock has no synchronization barrier directive 36 / 49
37 Minimizing smooth functions require: convex and Lipschitz differentiable function f if f is L-Lipschitz, then minimize x where T is nonexpansive f(x) x = ( I 2 L f) x. }{{} T ARock will be efficient when xi f(x) is easy to compute 37 / 49
38 Minimizing composite functions require: convex smooth g( ) and convex (possibly nonsmooth) f( ) proximal map: prox γf (y) = arg min f(x) + 1 2γ x y 2. minimize x ARock will be fast if xi g(x) is easy to compute f(x) + g(x) x = prox γf (I γ g) x. }{{} T f( ) is separable (e.g., l 1 and l 1,2) 38 / 49
39 Example: sparse logistic regression n features, N labeled samples each sample a i R n has its label b i {1, 1} l 1 regularized logistic regression: minimize x R n λ x N N log ( 1 + exp( b i a T i x) ), (1) i=1 compare sync-parallel and ARock (async-parallel) on two datasets: Name N (#samples) n (# features) # nonzeros in {a 1,..., a N } rcv1 20,242 47,236 1,498,952 news20 19,996 1,355,191 9,097, / 49
40 Speedup tests implemented in C++ and OpenMP. 32 cores shared memory machine. rcv1 news20 #cores Time (s) Speedup Time (s) Speedup async sync async sync async sync async sync / 49
41 More applications 41 / 49
42 Minimizing composite functions require: both f and g are convex (possibly nonsmooth) functions minimize f(x) + g(x) z = refl γf refl γg (z). }{{} T PRS recover x = prox γg (z) T PRS is known as the Peaceman-Rachford splitting operator 3 also, the Douglas-Rachford splitting operator: ARock runs fast when refl γf is separable (refl γg) i is easy-to-compute/maintain 1 2 I T PRS 3 reflective proximal map: reflγf := 2prox γf I. The maps refl γf, refl γg and thus refl γf refl γg are nonexpansive 42 / 49
43 Parallel/distributed ADMM require: m convex functions f i consensus problem: minimize x m fi(x) + g(x) i=1 minimize x i,y subject to m fi(xi) + g(y) i=1 I I... I x 1 x 2. x m I I. y = 0 I Douglas-Rachford-ARock to the dual problem async-parallel ADMM: m subproblems are solved in the async-parallel fashion y and z i (dual variables) are updated in global memory (no lock) 43 / 49
44 Decentralized computing n agents in a connected network G = (V, E) with bi-directional links E each agent i has a private function f i problem: find a consensus solution x to minimize x R p f(x) := n f i(a ix i) subject to x i = x j, i, j. i=1 challenges: no center, only between-neighbor communication benefits: fault tolerance, no long-dist communication, privacy 44 / 49
45 Async-parallel decentralized ADMM a graph of connected agents: G = (V, E). decentralized consensus optimization problem: minimize x i R d,i V f(x) := i V fi(xi) subject to x i = x j, (i, j) E ADMM reformulation: constraints x i = y ij, x j = y ij, (i, j) E apply ARock version 1: nodes asynchronously activate ARock version 2: edges asynchronously activate no global clock, no central controller, each agent keeps f i private and talks to its neighbors 45 / 49
46 notation: N(i) all edges of agent i, N(i) = L(i) R(i) L(i) neighbors j of agent i, j < i R(i) neighbors j of agent i, j > i Algorithm 2: ARock for the decentralized consensus problem Input : each agent i sets x 0 i R d, dual variables z 0 e,i for e E(i), K > 0. while k < K, any activated agent i do receive ẑli,l k from neighbors l L(i) and ẑir,r k from neighbors r R(i); update local ˆx k i, z k+1 li,i and z k+1 ir,i according to (2a) (2c), respectively; send z k+1 li,i to neighbors l L(i) and z k+1 ir,i to neighbors r R(i). ˆx k i arg min f i(x i) + ( x l L(i) ẑk li,l + ) r R(i) ẑk ir,r xi + γ 2 E(i) xi 2, i (2a) z k+1 ir,i = z k ir,i η k ((ẑ k ir,i + ẑ ir,r)/2 + γˆx k i ), r R(i), (2b) z k+1 li,i = z k li,i η k ((ẑ k li,i + ẑ li,l )/2 + γˆx k i ), l L(i). (2c) 46 / 49
47 Summary 47 / 49
48 Summary of async-parallel coordinate descent benefits: eliminate idle time reduce communication / memory-access congestion random job selection: load balance mathematics: analysis of disorderly (partial) updates asynchronous delay inconsistent read and write 48 / 49
49 Thank you! Acknowledgements: NSF DMS Reference: Zhimin Peng, Yangyang Xu, Ming Yan, Wotao Yin. UCLA CAM Website: wotaoyin/arock 49 / 49
ARock: an algorithmic framework for asynchronous parallel coordinate updates
ARock: an algorithmic framework for asynchronous parallel coordinate updates Zhimin Peng, Yangyang Xu, Ming Yan, Wotao Yin ( UCLA Math, U.Waterloo DCO) UCLA CAM Report 15-37 ShanghaiTech SSDS 15 June 25,
More informationARock: an Algorithmic Framework for Asynchronous Parallel Coordinate Updates
ARock: an Algorithmic Framework for Asynchronous Parallel Coordinate Updates Zhimin Peng Yangyang Xu Ming Yan Wotao Yin May 3, 216 Abstract Finding a fixed point to a nonexpansive operator, i.e., x = T
More informationARock: an Algorithmic Framework for Async-Parallel Coordinate Updates
ARock: an Algorithmic Framework for Async-Parallel Coordinate Updates Zhimin Peng Yangyang Xu Ming Yan Wotao Yin July 7, 215 The problem of finding a fixed point to a nonexpansive operator is an abstraction
More informationAsynchronous Algorithms for Conic Programs, including Optimal, Infeasible, and Unbounded Ones
Asynchronous Algorithms for Conic Programs, including Optimal, Infeasible, and Unbounded Ones Wotao Yin joint: Fei Feng, Robert Hannah, Yanli Liu, Ernest Ryu (UCLA, Math) DIMACS: Distributed Optimization,
More informationOperator Splitting for Parallel and Distributed Optimization
Operator Splitting for Parallel and Distributed Optimization Wotao Yin (UCLA Math) Shanghai Tech, SSDS 15 June 23, 2015 URL: alturl.com/2z7tv 1 / 60 What is splitting? Sun-Tzu: (400 BC) Caesar: divide-n-conquer
More informationParallel Coordinate Optimization
1 / 38 Parallel Coordinate Optimization Julie Nutini MLRG - Spring Term March 6 th, 2018 2 / 38 Contours of a function F : IR 2 IR. Goal: Find the minimizer of F. Coordinate Descent in 2D Contours of a
More informationBlock stochastic gradient update method
Block stochastic gradient update method Yangyang Xu and Wotao Yin IMA, University of Minnesota Department of Mathematics, UCLA November 1, 2015 This work was done while in Rice University 1 / 26 Stochastic
More informationDecentralized Consensus Optimization with Asynchrony and Delay
Decentralized Consensus Optimization with Asynchrony and Delay Tianyu Wu, Kun Yuan 2, Qing Ling 3, Wotao Yin, and Ali H. Sayed 2 Department of Mathematics, 2 Department of Electrical Engineering, University
More informationCoordinate Update Algorithm Short Course The Package TMAC
Coordinate Update Algorithm Short Course The Package TMAC Instructor: Wotao Yin (UCLA Math) Summer 2016 1 / 16 TMAC: A Toolbox of Async-Parallel, Coordinate, Splitting, and Stochastic Methods C++11 multi-threading
More informationParallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence
Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence oé LAMDA Group H ŒÆOŽÅ Æ EâX ^ #EâI[ : liwujun@nju.edu.cn Dec 10, 2016 Wu-Jun Li (http://cs.nju.edu.cn/lwj)
More informationMath 273a: Optimization Overview of First-Order Optimization Algorithms
Math 273a: Optimization Overview of First-Order Optimization Algorithms Wotao Yin Department of Mathematics, UCLA online discussions on piazza.com 1 / 9 Typical flow of numerical optimization Optimization
More informationPrimal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions
Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions Olivier Fercoq and Pascal Bianchi Problem Minimize the convex function
More informationPrimal-dual coordinate descent
Primal-dual coordinate descent Olivier Fercoq Joint work with P. Bianchi & W. Hachem 15 July 2015 1/28 Minimize the convex function f, g, h convex f is differentiable Problem min f (x) + g(x) + h(mx) x
More informationCoordinate Descent and Ascent Methods
Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:
More informationTight Rates and Equivalence Results of Operator Splitting Schemes
Tight Rates and Equivalence Results of Operator Splitting Schemes Wotao Yin (UCLA Math) Workshop on Optimization for Modern Computing Joint w Damek Davis and Ming Yan UCLA CAM 14-51, 14-58, and 14-59 1
More informationAsynchronous Parareal in time discretization for partial differential equations
Asynchronous Parareal in time discretization for partial differential equations Frédéric Magoulès, Guillaume Gbikpi-Benissan April 7, 2016 CentraleSupélec IRT SystemX Outline of the presentation 01 Introduction
More informationConvergence of Fixed-Point Iterations
Convergence of Fixed-Point Iterations Instructor: Wotao Yin (UCLA Math) July 2016 1 / 30 Why study fixed-point iterations? Abstract many existing algorithms in optimization, numerical linear algebra, and
More informationBlock Coordinate Descent for Regularized Multi-convex Optimization
Block Coordinate Descent for Regularized Multi-convex Optimization Yangyang Xu and Wotao Yin CAAM Department, Rice University February 15, 2013 Multi-convex optimization Model definition Applications Outline
More informationStreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory
StreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory S.V. N. (vishy) Vishwanathan Purdue University and Microsoft vishy@purdue.edu October 9, 2012 S.V. N. Vishwanathan (Purdue,
More informationA Primal-dual Three-operator Splitting Scheme
Noname manuscript No. (will be inserted by the editor) A Primal-dual Three-operator Splitting Scheme Ming Yan Received: date / Accepted: date Abstract In this paper, we propose a new primal-dual algorithm
More informationFast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee
Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16) Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi Zhao and
More informationFast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee
Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi Zhao and Wu-Jun Li National Key Laboratory for Novel Software Technology Department of Computer
More informationAsynchronous Non-Convex Optimization For Separable Problem
Asynchronous Non-Convex Optimization For Separable Problem Sandeep Kumar and Ketan Rajawat Dept. of Electrical Engineering, IIT Kanpur Uttar Pradesh, India Distributed Optimization A general multi-agent
More informationOptimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method
Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Davood Hajinezhad Iowa State University Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 1 / 35 Co-Authors
More informationShiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 9. Alternating Direction Method of Multipliers
Shiqian Ma, MAT-258A: Numerical Optimization 1 Chapter 9 Alternating Direction Method of Multipliers Shiqian Ma, MAT-258A: Numerical Optimization 2 Separable convex optimization a special case is min f(x)
More informationPrimal-dual algorithms for the sum of two and three functions 1
Primal-dual algorithms for the sum of two and three functions 1 Ming Yan Michigan State University, CMSE/Mathematics 1 This works is partially supported by NSF. optimization problems for primal-dual algorithms
More informationAccelerated primal-dual methods for linearly constrained convex problems
Accelerated primal-dual methods for linearly constrained convex problems Yangyang Xu SIAM Conference on Optimization May 24, 2017 1 / 23 Accelerated proximal gradient For convex composite problem: minimize
More informationCoordinate Update Algorithm Short Course Operator Splitting
Coordinate Update Algorithm Short Course Operator Splitting Instructor: Wotao Yin (UCLA Math) Summer 2016 1 / 25 Operator splitting pipeline 1. Formulate a problem as 0 A(x) + B(x) with monotone operators
More informationLock-Free Approaches to Parallelizing Stochastic Gradient Descent
Lock-Free Approaches to Parallelizing Stochastic Gradient Descent Benjamin Recht Department of Computer Sciences University of Wisconsin-Madison with Feng iu Christopher Ré Stephen Wright minimize x f(x)
More informationDistributed Consensus Optimization
Distributed Consensus Optimization Ming Yan Michigan State University, CMSE/Mathematics September 14, 2018 Decentralized-1 Backgroundwhy andwe motivation need decentralized optimization? I Decentralized
More informationProximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization
Proximal Newton Method Zico Kolter (notes by Ryan Tibshirani) Convex Optimization 10-725 Consider the problem Last time: quasi-newton methods min x f(x) with f convex, twice differentiable, dom(f) = R
More informationConvergence Models and Surprising Results for the Asynchronous Jacobi Method
Convergence Models and Surprising Results for the Asynchronous Jacobi Method Jordi Wolfson-Pou School of Computational Science and Engineering Georgia Institute of Technology Atlanta, Georgia, United States
More informationProximal Newton Method. Ryan Tibshirani Convex Optimization /36-725
Proximal Newton Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: primal-dual interior-point method Given the problem min x subject to f(x) h i (x) 0, i = 1,... m Ax = b where f, h
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization
More informationA Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming
A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming Zhaosong Lu Lin Xiao March 9, 2015 (Revised: May 13, 2016; December 30, 2016) Abstract We propose
More informationSplitting Techniques in the Face of Huge Problem Sizes: Block-Coordinate and Block-Iterative Approaches
Splitting Techniques in the Face of Huge Problem Sizes: Block-Coordinate and Block-Iterative Approaches Patrick L. Combettes joint work with J.-C. Pesquet) Laboratoire Jacques-Louis Lions Faculté de Mathématiques
More informationBig Data Analytics: Optimization and Randomization
Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 11, 2016 Paper presentations and final project proposal Send me the names of your group member (2 or 3 students) before October 15 (this Friday)
More informationAccelerated Block-Coordinate Relaxation for Regularized Optimization
Accelerated Block-Coordinate Relaxation for Regularized Optimization Stephen J. Wright Computer Sciences University of Wisconsin, Madison October 09, 2012 Problem descriptions Consider where f is smooth
More informationDistributed Machine Learning: A Brief Overview. Dan Alistarh IST Austria
Distributed Machine Learning: A Brief Overview Dan Alistarh IST Austria Background The Machine Learning Cambrian Explosion Key Factors: 1. Large s: Millions of labelled images, thousands of hours of speech
More informationEfficient Serial and Parallel Coordinate Descent Methods for Huge-Scale Convex Optimization
Efficient Serial and Parallel Coordinate Descent Methods for Huge-Scale Convex Optimization Martin Takáč The University of Edinburgh Based on: P. Richtárik and M. Takáč. Iteration complexity of randomized
More informationParallel Successive Convex Approximation for Nonsmooth Nonconvex Optimization
Parallel Successive Convex Approximation for Nonsmooth Nonconvex Optimization Meisam Razaviyayn meisamr@stanford.edu Mingyi Hong mingyi@iastate.edu Zhi-Quan Luo luozq@umn.edu Jong-Shi Pang jongship@usc.edu
More informationCoordinate descent methods
Coordinate descent methods Master Mathematics for data science and big data Olivier Fercoq November 3, 05 Contents Exact coordinate descent Coordinate gradient descent 3 3 Proximal coordinate descent 5
More informationMath 273a: Optimization Subgradient Methods
Math 273a: Optimization Subgradient Methods Instructor: Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com Nonsmooth convex function Recall: For ˉx R n, f(ˉx) := {g R
More informationAsynchronous Mini-Batch Gradient Descent with Variance Reduction for Non-Convex Optimization
Proceedings of the hirty-first AAAI Conference on Artificial Intelligence (AAAI-7) Asynchronous Mini-Batch Gradient Descent with Variance Reduction for Non-Convex Optimization Zhouyuan Huo Dept. of Computer
More informationSupplement: Distributed Box-constrained Quadratic Optimization for Dual Linear SVM
Supplement: Distributed Box-constrained Quadratic Optimization for Dual Linear SVM Ching-pei Lee LEECHINGPEI@GMAIL.COM Dan Roth DANR@ILLINOIS.EDU University of Illinois at Urbana-Champaign, 201 N. Goodwin
More informationAsynchronous Parallel Stochastic Gradient for Nonconvex Optimization
Asynchronous Parallel Stochastic Gradient for Nonconvex Optimization Xiangru Lian, Yijun Huang, Yuncheng Li, and Ji Liu Department of Computer Science, University of Rochester {lianxiangru,huangyj0,raingomm,ji.liu.uwisc}@gmail.com
More informationMinimizing Finite Sums with the Stochastic Average Gradient Algorithm
Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Joint work with Nicolas Le Roux and Francis Bach University of British Columbia Context: Machine Learning for Big Data Large-scale
More informationSparse Optimization Lecture: Dual Methods, Part I
Sparse Optimization Lecture: Dual Methods, Part I Instructor: Wotao Yin July 2013 online discussions on piazza.com Those who complete this lecture will know dual (sub)gradient iteration augmented l 1 iteration
More informationBias-free Sparse Regression with Guaranteed Consistency
Bias-free Sparse Regression with Guaranteed Consistency Wotao Yin (UCLA Math) joint with: Stanley Osher, Ming Yan (UCLA) Feng Ruan, Jiechao Xiong, Yuan Yao (Peking U) UC Riverside, STATS Department March
More informationACCELERATED FIRST-ORDER PRIMAL-DUAL PROXIMAL METHODS FOR LINEARLY CONSTRAINED COMPOSITE CONVEX PROGRAMMING
ACCELERATED FIRST-ORDER PRIMAL-DUAL PROXIMAL METHODS FOR LINEARLY CONSTRAINED COMPOSITE CONVEX PROGRAMMING YANGYANG XU Abstract. Motivated by big data applications, first-order methods have been extremely
More informationPrivacy and Fault-Tolerance in Distributed Optimization. Nitin Vaidya University of Illinois at Urbana-Champaign
Privacy and Fault-Tolerance in Distributed Optimization Nitin Vaidya University of Illinois at Urbana-Champaign Acknowledgements Shripad Gade Lili Su argmin x2x SX i=1 i f i (x) Applications g f i (x)
More informationCoordinate Update Algorithm Short Course Proximal Operators and Algorithms
Coordinate Update Algorithm Short Course Proximal Operators and Algorithms Instructor: Wotao Yin (UCLA Math) Summer 2016 1 / 36 Why proximal? Newton s method: for C 2 -smooth, unconstrained problems allow
More informationRecent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables
Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables Department of Systems Engineering and Engineering Management The Chinese University of Hong Kong 2014 Workshop
More informationF (x) := f(x) + g(x), (1.1)
ASYNCHRONOUS STOCHASTIC COORDINATE DESCENT: PARALLELISM AND CONVERGENCE PROPERTIES JI LIU AND STEPHEN J. WRIGHT Abstract. We describe an asynchronous parallel stochastic proximal coordinate descent algorithm
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Nov 2, 2016 Outline SGD-typed algorithms for Deep Learning Parallel SGD for deep learning Perceptron Prediction value for a training data: prediction
More informationarxiv: v2 [math.oc] 2 Mar 2017
CYCLIC COORDINATE UPDATE ALGORITHMS FOR FIXED-POINT PROBLEMS: ANALYSIS AND APPLICATIONS YAT TIN CHOW, TIANYU WU, AND WOTAO YIN arxiv:6.0456v [math.oc] Mar 07 Abstract. Many problems reduce to the fixed-point
More informationNetwork Newton. Aryan Mokhtari, Qing Ling and Alejandro Ribeiro. University of Pennsylvania, University of Science and Technology (China)
Network Newton Aryan Mokhtari, Qing Ling and Alejandro Ribeiro University of Pennsylvania, University of Science and Technology (China) aryanm@seas.upenn.edu, qingling@mail.ustc.edu.cn, aribeiro@seas.upenn.edu
More informationAdaptive Primal Dual Optimization for Image Processing and Learning
Adaptive Primal Dual Optimization for Image Processing and Learning Tom Goldstein Rice University tag7@rice.edu Ernie Esser University of British Columbia eesser@eos.ubc.ca Richard Baraniuk Rice University
More informationMath 273a: Optimization Subgradients of convex functions
Math 273a: Optimization Subgradients of convex functions Made by: Damek Davis Edited by Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com 1 / 42 Subgradients Assumptions
More informationarxiv: v3 [cs.lg] 15 Sep 2018
Asynchronous Stochastic Proximal Methods for onconvex onsmooth Optimization Rui Zhu 1, Di iu 1, Zongpeng Li 1 Department of Electrical and Computer Engineering, University of Alberta School of Computer
More informationLinear Convergence under the Polyak-Łojasiewicz Inequality
Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini and Mark Schmidt The University of British Columbia LCI Forum February 28 th, 2017 1 / 17 Linear Convergence of Gradient-Based
More informationOptimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison
Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big
More informationStochastic and online algorithms
Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem
More informationALADIN An Algorithm for Distributed Non-Convex Optimization and Control
ALADIN An Algorithm for Distributed Non-Convex Optimization and Control Boris Houska, Yuning Jiang, Janick Frasch, Rien Quirynen, Dimitris Kouzoupis, Moritz Diehl ShanghaiTech University, University of
More informationOptimization methods
Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,
More informationConvex Optimization Algorithms for Machine Learning in 10 Slides
Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,
More informationSGD and Randomized projection algorithms for overdetermined linear systems
SGD and Randomized projection algorithms for overdetermined linear systems Deanna Needell Claremont McKenna College IPAM, Feb. 25, 2014 Includes joint work with Eldar, Ward, Tropp, Srebro-Ward Setup Setup
More informationSolving PDEs with CUDA Jonathan Cohen
Solving PDEs with CUDA Jonathan Cohen jocohen@nvidia.com NVIDIA Research PDEs (Partial Differential Equations) Big topic Some common strategies Focus on one type of PDE in this talk Poisson Equation Linear
More informationDistributed Computation of Quantiles via ADMM
1 Distributed Computation of Quantiles via ADMM Franck Iutzeler Abstract In this paper, we derive distributed synchronous and asynchronous algorithms for computing uantiles of the agents local values.
More informationUses of duality. Geoff Gordon & Ryan Tibshirani Optimization /
Uses of duality Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember conjugate functions Given f : R n R, the function is called its conjugate f (y) = max x R n yt x f(x) Conjugates appear
More informationLARGE-SCALE NONCONVEX STOCHASTIC OPTIMIZATION BY DOUBLY STOCHASTIC SUCCESSIVE CONVEX APPROXIMATION
LARGE-SCALE NONCONVEX STOCHASTIC OPTIMIZATION BY DOUBLY STOCHASTIC SUCCESSIVE CONVEX APPROXIMATION Aryan Mokhtari, Alec Koppel, Gesualdo Scutari, and Alejandro Ribeiro Department of Electrical and Systems
More informationNonconvex ADMM: Convergence and Applications
Nonconvex ADMM: Convergence and Applications Instructor: Wotao Yin (UCLA Math) Based on CAM 15-62 with Yu Wang and Jinshan Zeng Summer 2016 1 / 54 1. Alternating Direction Method of Multipliers (ADMM):
More informationDistributed Optimization over Networks Gossip-Based Algorithms
Distributed Optimization over Networks Gossip-Based Algorithms Angelia Nedić angelia@illinois.edu ISE Department and Coordinated Science Laboratory University of Illinois at Urbana-Champaign Outline Random
More informationExpanding the reach of optimal methods
Expanding the reach of optimal methods Dmitriy Drusvyatskiy Mathematics, University of Washington Joint work with C. Kempton (UW), M. Fazel (UW), A.S. Lewis (Cornell), and S. Roy (UW) BURKAPALOOZA! WCOM
More informationStochastic Composition Optimization
Stochastic Composition Optimization Algorithms and Sample Complexities Mengdi Wang Joint works with Ethan X. Fang, Han Liu, and Ji Liu ORFE@Princeton ICCOPT, Tokyo, August 8-11, 2016 1 / 24 Collaborators
More informationBeyond Heuristics: Applying Alternating Direction Method of Multipliers in Nonconvex Territory
Beyond Heuristics: Applying Alternating Direction Method of Multipliers in Nonconvex Territory Xin Liu(4Ð) State Key Laboratory of Scientific and Engineering Computing Institute of Computational Mathematics
More informationADMM and Fast Gradient Methods for Distributed Optimization
ADMM and Fast Gradient Methods for Distributed Optimization João Xavier Instituto Sistemas e Robótica (ISR), Instituto Superior Técnico (IST) European Control Conference, ECC 13 July 16, 013 Joint work
More informationStochastic Quasi-Newton Methods
Stochastic Quasi-Newton Methods Donald Goldfarb Department of IEOR Columbia University UCLA Distinguished Lecture Series May 17-19, 2016 1 / 35 Outline Stochastic Approximation Stochastic Gradient Descent
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied
More informationDistributed Optimization and Statistics via Alternating Direction Method of Multipliers
Distributed Optimization and Statistics via Alternating Direction Method of Multipliers Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato Stanford University Stanford Statistics Seminar, September 2010
More informationWE consider an undirected, connected network of n
On Nonconvex Decentralized Gradient Descent Jinshan Zeng and Wotao Yin Abstract Consensus optimization has received considerable attention in recent years. A number of decentralized algorithms have been
More informationPreconditioning via Diagonal Scaling
Preconditioning via Diagonal Scaling Reza Takapoui Hamid Javadi June 4, 2014 1 Introduction Interior point methods solve small to medium sized problems to high accuracy in a reasonable amount of time.
More informationImportance Sampling for Minibatches
Importance Sampling for Minibatches Dominik Csiba School of Mathematics University of Edinburgh 07.09.2016, Birmingham Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches 07.09.2016,
More informationDLM: Decentralized Linearized Alternating Direction Method of Multipliers
1 DLM: Decentralized Linearized Alternating Direction Method of Multipliers Qing Ling, Wei Shi, Gang Wu, and Alejandro Ribeiro Abstract This paper develops the Decentralized Linearized Alternating Direction
More informationDistributed Optimization via Alternating Direction Method of Multipliers
Distributed Optimization via Alternating Direction Method of Multipliers Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato Stanford University ITMANET, Stanford, January 2011 Outline precursors dual decomposition
More informationCYCLIC COORDINATE-UPDATE ALGORITHMS FOR FIXED-POINT PROBLEMS: ANALYSIS AND APPLICATIONS
SIAM J. SCI. COMPUT. Vol. 39, No. 4, pp. A80 A300 CYCLIC COORDINATE-UPDATE ALGORITHMS FOR FIXED-POINT PROBLEMS: ANALYSIS AND APPLICATIONS YAT TIN CHOW, TIANYU WU, AND WOTAO YIN Abstract. Many problems
More informationOn the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith. µd(w) + w T c. minimize. subject to w T 1 = 1,
Math 30 Winter 05 Solution to Homework 3. Recognizing the convexity of g(x) := x log x, from Jensen s inequality we get d(x) n x + + x n n log x + + x n n where the equality is attained only at x = (/n,...,
More informationAsynchronous Parallel Algorithms for Nonconvex Big-Data Optimization Part I: Model and Convergence
Noname manuscript No. (will be inserted by the editor) Asynchronous Parallel Algorithms for Nonconvex Big-Data Optimization Part I: Model and Convergence Loris Cannelli Francisco Facchinei Vyacheslav Kungurtsev
More informationConvex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013
Convex Optimization (EE227A: UC Berkeley) Lecture 15 (Gradient methods III) 12 March, 2013 Suvrit Sra Optimal gradient methods 2 / 27 Optimal gradient methods We saw following efficiency estimates for
More informationSEMI-SMOOTH SECOND-ORDER TYPE METHODS FOR COMPOSITE CONVEX PROGRAMS
SEMI-SMOOTH SECOND-ORDER TYPE METHODS FOR COMPOSITE CONVEX PROGRAMS XIANTAO XIAO, YONGFENG LI, ZAIWEN WEN, AND LIWEI ZHANG Abstract. The goal of this paper is to study approaches to bridge the gap between
More informationFirst-order methods for structured nonsmooth optimization
First-order methods for structured nonsmooth optimization Sangwoon Yun Department of Mathematics Education Sungkyunkwan University Oct 19, 2016 Center for Mathematical Analysis & Computation, Yonsei University
More informationSubgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725
Subgradient Method Guest Lecturer: Fatma Kilinc-Karzan Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization 10-725/36-725 Adapted from slides from Ryan Tibshirani Consider the problem Recall:
More informationEE364b Convex Optimization II May 30 June 2, Final exam
EE364b Convex Optimization II May 30 June 2, 2014 Prof. S. Boyd Final exam By now, you know how it works, so we won t repeat it here. (If not, see the instructions for the EE364a final exam.) Since you
More informationAlternating Direction Method of Multipliers. Ryan Tibshirani Convex Optimization
Alternating Direction Method of Multipliers Ryan Tibshirani Convex Optimization 10-725 Consider the problem Last time: dual ascent min x f(x) subject to Ax = b where f is strictly convex and closed. Denote
More informationOptimization methods
Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to
More informationA Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization
A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization Panos Parpas Department of Computing Imperial College London www.doc.ic.ac.uk/ pp500 p.parpas@imperial.ac.uk jointly with D.V.
More informationTrade-Offs in Distributed Learning and Optimization
Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed
More informationON THE GLOBAL AND LINEAR CONVERGENCE OF THE GENERALIZED ALTERNATING DIRECTION METHOD OF MULTIPLIERS
ON THE GLOBAL AND LINEAR CONVERGENCE OF THE GENERALIZED ALTERNATING DIRECTION METHOD OF MULTIPLIERS WEI DENG AND WOTAO YIN Abstract. The formulation min x,y f(x) + g(y) subject to Ax + By = b arises in
More informationAccelerating Nesterov s Method for Strongly Convex Functions
Accelerating Nesterov s Method for Strongly Convex Functions Hao Chen Xiangrui Meng MATH301, 2011 Outline The Gap 1 The Gap 2 3 Outline The Gap 1 The Gap 2 3 Our talk begins with a tiny gap For any x 0
More information