ARock: an algorithmic framework for asynchronous parallel coordinate updates

Similar documents
Asynchronous Parallel Computing in Signal Processing and Machine Learning

ARock: an Algorithmic Framework for Asynchronous Parallel Coordinate Updates

ARock: an Algorithmic Framework for Async-Parallel Coordinate Updates

Asynchronous Algorithms for Conic Programs, including Optimal, Infeasible, and Unbounded Ones

Operator Splitting for Parallel and Distributed Optimization

Block stochastic gradient update method

Decentralized Consensus Optimization with Asynchrony and Delay

Tight Rates and Equivalence Results of Operator Splitting Schemes

Convergence of Fixed-Point Iterations

Parallel Coordinate Optimization

Coordinate Update Algorithm Short Course The Package TMAC

Coordinate Descent and Ascent Methods

Math 273a: Optimization Overview of First-Order Optimization Algorithms

Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 9. Alternating Direction Method of Multipliers

Accelerated primal-dual methods for linearly constrained convex problems

Block Coordinate Descent for Regularized Multi-convex Optimization

Coordinate Update Algorithm Short Course Operator Splitting

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence

Primal-dual algorithms for the sum of two and three functions 1

Accelerated Block-Coordinate Relaxation for Regularized Optimization

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee

Distributed Consensus Optimization

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Sparse Optimization Lecture: Dual Methods, Part I

Big Data Analytics: Optimization and Randomization

Adaptive Primal Dual Optimization for Image Processing and Learning

Primal-dual coordinate descent

Coordinate Update Algorithm Short Course Proximal Operators and Algorithms

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables

Coordinate descent methods

Math 273a: Optimization Subgradient Methods

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm

Parallel Successive Convex Approximation for Nonsmooth Nonconvex Optimization

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

ECS289: Scalable Machine Learning

A Primal-dual Three-operator Splitting Scheme

Asynchronous Non-Convex Optimization For Separable Problem

A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity

A NEW ITERATIVE METHOD FOR THE SPLIT COMMON FIXED POINT PROBLEM IN HILBERT SPACES. Fenghui Wang

Network Newton. Aryan Mokhtari, Qing Ling and Alejandro Ribeiro. University of Pennsylvania, University of Science and Technology (China)

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Lecture 23: November 21

Optimization methods

SEMI-SMOOTH SECOND-ORDER TYPE METHODS FOR COMPOSITE CONVEX PROGRAMS

Distributed Computation of Quantiles via ADMM

Stochastic and online algorithms

CYCLIC COORDINATE-UPDATE ALGORITHMS FOR FIXED-POINT PROBLEMS: ANALYSIS AND APPLICATIONS

arxiv: v2 [math.oc] 2 Mar 2017

Bias-free Sparse Regression with Guaranteed Consistency

Lasso: Algorithms and Extensions

ACCELERATED FIRST-ORDER PRIMAL-DUAL PROXIMAL METHODS FOR LINEARLY CONSTRAINED COMPOSITE CONVEX PROGRAMMING

Splitting Techniques in the Face of Huge Problem Sizes: Block-Coordinate and Block-Iterative Approaches

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Convex Optimization Algorithms for Machine Learning in 10 Slides

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

STA141C: Big Data & High Performance Statistical Computing

HYBRID JACOBIAN AND GAUSS SEIDEL PROXIMAL BLOCK COORDINATE UPDATE METHODS FOR LINEARLY CONSTRAINED CONVEX PROGRAMMING

ON THE GLOBAL AND LINEAR CONVERGENCE OF THE GENERALIZED ALTERNATING DIRECTION METHOD OF MULTIPLIERS

A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization

Importance Sampling for Minibatches

Asynchronous Parallel Algorithms for Nonconvex Big-Data Optimization Part I: Model and Convergence

Minimizing the Difference of L 1 and L 2 Norms with Applications

Motivation Subgradient Method Stochastic Subgradient Method. Convex Optimization. Lecture 15 - Gradient Descent in Machine Learning

Optimization methods

Lecture 1: September 25

Trade-Offs in Distributed Learning and Optimization

Convergence Models and Surprising Results for the Asynchronous Jacobi Method

Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods

Distributed Convex Optimization

EE364b Convex Optimization II May 30 June 2, Final exam

F (x) := f(x) + g(x), (1.1)

Selected Topics in Optimization. Some slides borrowed from

ECS289: Scalable Machine Learning

Lock-Free Approaches to Parallelizing Stochastic Gradient Descent

1 Sparsity and l 1 relaxation

Nonconvex ADMM: Convergence and Applications

Linear Regression (continued)

Math 273a: Optimization Subgradients of convex functions

Linear Models in Machine Learning

MATH 680 Fall November 27, Homework 3

Lecture 3: Huge-scale optimization problems

ADMM and Fast Gradient Methods for Distributed Optimization

Linear Convergence under the Polyak-Łojasiewicz Inequality

Subgradient methods for huge-scale optimization problems

Nonnegative Tensor Factorization using a proximal algorithm: application to 3D fluorescence spectroscopy

Alternating Direction Method of Multipliers. Ryan Tibshirani Convex Optimization

First-order methods for structured nonsmooth optimization

A random coordinate descent algorithm for optimization problems with composite objective function and linear coupled constraints

DLM: Decentralized Linearized Alternating Direction Method of Multipliers

Asynchronous Parallel Stochastic Gradient for Nonconvex Optimization

Perturbed Iterate Analysis for Asynchronous Stochastic Optimization

arxiv: v4 [math.oc] 29 Jan 2018

CPSC 540: Machine Learning

Beyond Heuristics: Applying Alternating Direction Method of Multipliers in Nonconvex Territory

Expanding the reach of optimal methods

Convex Optimization Lecture 16

Transcription:

ARock: an algorithmic framework for asynchronous parallel coordinate updates Zhimin Peng, Yangyang Xu, Ming Yan, Wotao Yin ( UCLA Math, U.Waterloo DCO) UCLA CAM Report 15-37 ShanghaiTech SSDS 15 June 25, 2015 1 / 40

Background 2 / 40

Serial computing Problem CPU t N t 2 t 1 3 / 40

Parallel computing Problem CPU CPU CPU tn t2 t1 4 / 40

Sync-parallel versus async-parallel Agent 1 idle idle Agent 1 Agent 2 idle Agent 2 Agent 3 idle Agent 3 Synchronous (new iteration starts after the last agent finishes) Asynchronous (all agents are non-stop) 5 / 40

ARock: an algorithmic framework of async-parallel coordinate updates 6 / 40

The fixed-point problem Hilbert space H. Operator T : H H. find x H such that x = T x equivalent problem: let S := I T ; find x H such that 0 = Sx abstracts many problems: convex optimization; statistical regression; optimal control; linear and nonlinear systems of equations; ordinary and partial differential equations. 7 / 40

Krasnosel skĭi Mann (KM) iteration require: nonexpansive operator T, that is T x T y x y, x, y H iteration: equivalent form with S = I T : x k+1 = (1 λ)x k + λt x k x k+1 = x k λsx k special cases: gradient descent, proximal-point algorithm, many operator-splitting algorithms such as Douglas-Rachford and ADMM 8 / 40

Parallel coordinate update suppose H = H 1 H m totally m agents (workstations, CPUs, cores) agents i update x i H i in parallel: agent 1: agent 2:. agent m: x k+1 1 x k+1 2... x k+1 m = x k 1 x k 2. x k m (Sx k ) 1 η (Sx k ) 2 k. (Sx k ) m require: each (Sx) i is much easier to compute than Sx (otherwise, parallel computing does not save time) 9 / 40

suppose H = H 1 H m ARock: Async-parallel coordinate KM totally p agents; each agent randomly picks i {1,..., m} and updates x i: x k+1 1 x k 1 0..... x k+1 i = x k i η k (Sˆx k ) i...... x k+1 m ˆx k : the result of reading x from global memory x k : the status of x in global memory right before it is updated Agent 1 Agent 2 Agent 3 x k m 0 10 / 40

Random coordinate selection each coordinate x i is selected with probability p i, where min i p i > 0 cost of randomness: agents cannot cache data, global memory required (with exceptions) benefits of randomness: enforce the update frequencies p i (even if the agents have different speeds and the coordinates have different complexities); automatic load balance breaks a pattern, often faster than the fixed cyclic order 11 / 40

Applications and numerical results 12 / 40

Linear equations (asynchronous Jacobi) require: invertible square matrix A with nonzero diagonal entries let D be the diagonal part of A; then Ax = b (I D 1 A)x + D 1 b = x. }{{} T x T is nonexpansive if I D 1 A 2 1, i.e., A is diagonal dominating x k+1 = T x k recovers the Jacobi algorithm 13 / 40

Algorithm 1: ARock for linear equations Input : shared variables x R n, K > 0; set global iteration counter k = 0; while k < K, every agent asynchronously and continuously do sample i {1,..., m} uniformly at random; add η k a ii ( j aij ˆxk j b i) to shared variable x i; update the global counter k k + 1; 14 / 40

Numerical comparison problem: solve Ax = b, where b R n and A R n n are taken from the two datasets Name Type Size (n) Bandwidth (w) Dataset I sparse 1,000,000 5 Dataset II dense 5,000 N/A we compare: ARock (async) Jacobi (sync) running on 1, 2, 4,..., 32 cores on a workstation 15 / 40

Residual-vs-time plot size: 1 million, bandwidth = 5 size: 5,000, dense matrix 10 0 residual 10 0 10 5 10 10 async 1 async 2 async 4 async 8 async 16 async 32 sync 1 sync 2 sync 4 sync 8 sync 16 sync 32 residual 10 5 10 10 async 1 async 2 async 4 async 8 async 16 async 32 sync 1 sync 2 sync 4 sync 8 sync 16 sync 32 10 1 10 0 10 1 10 2 time (s) 10 1 10 0 10 1 10 2 time (s) sparse A and 100 epochs dense A and 50 epochs ARock (async) and Jacobi (sync) both have almost linear speedup ARock (async) is much faster due to asynchronicity and its Gauss-Seidel kind efficiency (next slide) 16 / 40

Residual-vs-epoch plot 10 5 size: 1 million, bandwidth = 5 10 5 size: 1 million, bandwidth = 5 residual 10 0 10 5 10 10 async 1 core async 2 cores async 4 cores async 8 cores async 16 cores async 32 cores sync Jacobi Gauss Seidel residual 10 0 10 5 10 10 async 1 core async 2 cores async 4 cores async 8 cores async 16 cores async 32 cores sync Jacobi Gauss Seidel 10 15 0 20 40 60 80 100 number of epochs sparse A 10 15 0 20 40 60 80 100 number of epochs dense A ARock matches Gauss-Seidel s epoch efficiency 17 / 40

Minimizing smooth functions require: convex and Lipschitz differentiable function f if f is L-Lipschitz, then where T is nonexpansive minimize x f(x) x = ( I 2 L f) x. }{{} T ARock will be very faster when xi f(x) is easy to compute 18 / 40

Minimizing composite functions require: convex smooth function g and convex (possibly nonsmooth) function f proximal map: prox γf (y) = arg min f(x) + 1 2γ x y 2. minimize x ARock will be very fast given easy-to-compute xi g(x) f(x) + g(x) x = prox γf (I γ g) x. }{{} T either separable or easy-to-compute f (e.g., l 1 and l 1,2) 19 / 40

Example: sparse logistic regression n features, N labeled samples each sample x i R n has its label b i {1, 1} l 1 regularized logistic regression: minimize x R n λ x 1 + 1 N N log ( 1 + exp( b i a T i x) ), (1) i=1 compare sync-parallel and ARock (async-parallel) on two datasets: Name N (#samples) n (# features) # nonzeros in {a 1,..., a N } rcv1 20,242 47,236 1,498,952 news20 19,996 1,355,191 9,097,916 20 / 40

Speedup tests rcv1 news20 #cores Time (s) Speedup Time (s) Speedup async sync async sync async sync async sync 1 122.0 122.0 1.0 1.0 591.1 591.3 1.0 1.0 2 63.4 104.1 1.9 1.2 304.2 590.1 1.9 1.0 4 32.7 83.7 3.7 1.5 150.4 557.0 3.9 1.1 8 16.8 63.4 7.3 1.9 78.3 525.1 7.5 1.1 16 9.1 45.4 13.5 2.7 41.6 493.2 14.2 1.2 32 4.9 30.3 24.6 4.0 22.6 455.2 26.1 1.3 reasons of sync s poor speedup: load imbalance (next slide): as more cores are used in parallel, it is more likely that one of them handles a coordinate corresponding to a large number of nonzeros in the samples before each new iteration, all cores wait for the last core to finish ARock (asyn) has nearly linear speedup, not affected by load imbalance 21 / 40

Sparsity pattern and load imbalance 10 5 rcv1 10 5 news20 10 4 10 4 # nonzeros 10 3 10 2 # nonzeros 10 3 10 2 10 1 0 200 400 600 800 1000 coordinate (each has ~50 features) 10 1 0 5000 10000 15000 20000 25000 30000 coordinate (each has ~50 features) each dot gives the # nonzeros in each coordinate (about 50 features) left: range of # nonzero: 10 2 10 4 right: range of # nonzero: 10 1.8 10 5 larger ratio worse load balance 22 / 40

More applications 23 / 40

Minimizing composite functions require: both f and g are convex (possibly nonsmooth) functions reflective proximal map: refl γf := 2prox γf I the maps refl γf, refl γg and thus refl γf refl γg are nonexpansive minimize f(x) + g(x) z = refl γf refl γg (z), x = prox }{{} γg (z). T PRS T PRS is known as the Peaceman-Rachford splitting operator also works with the Douglas-Rachford splitting operator: ARock will be very fast given separable refl γf easy-to-compute (refl γg) i 1 2 I + 1 2 T PRS 24 / 40

Parallel/distributed ADMM require: m convex functions f i (possibly nonsmooth) consensus problem: minimize x m fi(x) + g(x) i=1 minimize x i,y subject to m fi(xi) + g(y) i=1 I I... I x 1 x 2. x m I I. y = 0 I Douglas-Rachford-ARock to the dual problem async-parallel ADMM: m f i-subproblems are solved in the async-parallel fashion y and z i (dual variables) are updated in global memory 25 / 40

Algorithm 2: ARock (async-parallel ADMM) for consensus optimization Input : shared variables y 0, z 0 i, i, and K > 0 while k < K, every agent asynchronously and continuously do sample i from {1,..., m} with equal probability; locally compute (ŵ k d g ) i, ˆx k i, and (ŵ k d f ) i by (2a) (2c), respectively; update global z k+1 i and ŷ k+1 by (3a) and (3b), respectively; update the global counter k k + 1; local computation: (ŵ k d g ) i = ẑ k i + γ ŷ k, (2a) ˆx k i = arg min xi f i(x i) 2(ŵ k d g ) i ẑ k i, x i + γ 2 xi 2, (2b) (ŵ k d f ) i = 2(ŵ k d g ) i ẑ k i γ ˆx k i. (2c) global update: z k+1 i = z k i + η k ((ŵ k d f ) i (ŵ k d g ) i) (3a) ŷ k+1 = ŷ k + 1 γm (ẑk i ẑ k+1 i ) (3b) 26 / 40

Async-parallel decentralized ADMM a graph of connected agents: G = (V, E). decentralized consensus optimization problem: minimize x i R d,i V f(x) := i V fi(xi) subject to x i = x j, (i, j) E ADMM reformulation: constraints x i = y ij, x j = y ij, (i, j) E apply ARock version 1: nodes asynchronously activate version 2: edges (and nodes of each edge) asynchronously activate both versions: each agent keeps f i private and talks to its neighbors 27 / 40

notation: E(i) all edges of agent i, E(i) = L(i) R(i) L(i) neighbors j of agent i, j < i R(i) neighbors j of agent i, j > i Algorithm 3: ARock for the decentralized consensus problem Input : each agent i sets x 0 i R d, dual variables z 0 e,i for e E(i), K > 0. while k < K, any activated agent i do receive ẑli,l k from neighbors l L(i) and ẑir,r k from neighbors r R(i); update local ˆx k i, z k+1 li,i and z k+1 ir,i according to (4a) (4c), respectively; send z k+1 li,i to neighbors l L(i) and z k+1 ir,i to neighbors r R(i). ˆx k i arg min f i(x i) + ( x l L(i) ẑk li,l + ) r R(i) ẑk ir,r xi + γ 2 E(i) xi 2, (4a) i z k+1 ir,i = z k ir,i η k ((ẑ k ir,i + ẑ ir,r)/2 + γˆx k i ), r R(i), (4b) z k+1 li,i = z k li,i η k ((ẑ k li,i + ẑ li,l )/2 + γˆx k i ), l L(i). (4c) 28 / 40

Literature 29 / 40

Brief history The first async-parallel algorithm appeared in 1969 for solving linear equations. It was extended to fixed-point problems under the absolute-contraction 1 type of assumption. For 20 30 years, mainly solve linear, nonlinear and differential equations. Some recent work solves statistical regression, machine learning, and sensor network problems. 1 An operator T : R n R n is Lipschitz contractive if T (x) T (y) A x y, component-wise, where x denotes the vector with components x i, i = 1,..., n, and A R n n is a matrix with a spectral radius strictly less than 1. 30 / 40

Recent work Bertsekas-Tsitsiklis 89: Async-parallel gradient-projection method Liu et al. 13: async-parallel stochastic coordinate descent for minimizing convex smooth functions Liu and Wright 14: async-parallel stochastic proximal coordinate descent algorithm for minimizing convex composite objective functions Hsieh et al. 15: async-parallel implementation of LIBLINEAR (for l 2 regularized empirical risk minimization) Other async-parallel / async-admm methods: Wei-Ozdaglar 13, Iutzeler et al 13, Zhang-Kwok 14, Hong 14, 31 / 40

ARock contributions A framework for nonexpansive operators that have fixed-points Applications: async-parallel algorithms for linear equations, (smooth and nonsmooth) function minimization, distributed and decentralized optimization... Similar to recent work, random coordinate updates: automatic load balance Analysis: almost sure convergence of x k to x FixT linear convergence (when S is strongly monotone) fixed step sizes Open-source C code for reproducible research 32 / 40

Under the hood 33 / 40

Iteration is redefined Synchronous new iteration = all agents finish Asynchronous new iteration = any agent finishes 34 / 40

Reading consistency multiple agents simultaneously read and write x in global memory. while an agent reads x into its cache, x might be updated by other agents. definitions: let x 0,..., x k,... be the states of x in the memory ˆx k is called consistent if ˆx k = x j for some j k. ˆx k is called inconsistent if ˆx k x j for every j k. 35 / 40

Reading consistency and memory lock Agent 1 read [0, 0, 0, 0] T = x 0 consistent read Agent 1 read [0, 0, 0, 2] T {x 0, x 1, x 2 } inconsistent read ARock allows inconsistent read 36 / 40

Atomic coordinate update when each coordinate update is atomic (single CPU instruction), the read of each single coordinate is consistent, that is, x k i = ˆx k i + d J i (k) (x d+1 i x d i ) }{{} interim changes of x i ˆx k i : the result of read x k i : the status of x i right before it is updated J i(k): the index set of the interim changes of x i since k increases for each coordinate update, we have J i(k) J j(k) =. therefore, let J(k) = m i=1j i(k), we have x k = ˆx k + (x d+1 x d ) we assume that J(k) τ for all k d J(k) 37 / 40

Special cases of ARock if p = m = 1 (one agent and one coordinate), ARock reduces to the KM iteration. if p = m > 1 and τ = 0 (no delay), ARock reduces to sync-parallel coordinate update. if p = 1 (only one agent), ARock reduces to Nesterov s randomized coordinate update. 38 / 40

Analysis challenges and techniques Challenges asynchrony staled information used in the update inconsistency ˆx k may not equal a status of x ever existed coordinate update search direction only on one coordinate no objective function must play with z k z 2 and T z k z k 2 Techniques bounded delay or infinite delay with a light tail a new metric a non-negative almost supermartingale staled ˆx k related to the current x k through atomic updates random selection: expected progress over all coordinates 39 / 40

Thank you! Reference: Zhimin Peng, Yangyang Xu, Ming Yan, Wotao Yin. UCLA CAM 15-37. Website: http://www.math.ucla.edu/ wotaoyin/arock 40 / 40