Tradeoffs between synchronization, communication, and work in parallel linear algebra computations

Similar documents
Tradeoffs between synchronization, communication, and work in parallel linear algebra computations

Provably efficient algorithms for numerical tensor algebra

Communication avoiding parallel algorithms for dense matrix factorizations

A communication-avoiding parallel algorithm for the symmetric eigenvalue problem

A communication-avoiding parallel algorithm for the symmetric eigenvalue problem

Algorithms as multilinear tensor equations

Provably Efficient Algorithms for Numerical Tensor Algebra

Towards an algebraic formalism for scalable numerical algorithms

2.5D algorithms for distributed-memory computing

Exploiting Low-Rank Structure in Computing Matrix Powers with Applications to Preconditioning

LU Factorization. LU Decomposition. LU Decomposition. LU Decomposition: Motivation A = LU

Communication-Avoiding Parallel Algorithms for Dense Linear Algebra and Tensor Computations

Parallel Numerical Algorithms

LU factorization with Panel Rank Revealing Pivoting and its Communication Avoiding version

Minisymposia 9 and 34: Avoiding Communication in Linear Algebra. Jim Demmel UC Berkeley bebop.cs.berkeley.edu

Avoiding Communication in Distributed-Memory Tridiagonalization

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Communication-avoiding parallel algorithms for dense linear algebra

Practicality of Large Scale Fast Matrix Multiplication

Scalable numerical algorithms for electronic structure calculations

Strassen-like algorithms for symmetric tensor contractions

COALA: Communication Optimal Algorithms

LU Factorization with Panel Rank Revealing Pivoting and Its Communication Avoiding Version

Algorithms as Multilinear Tensor Equations

Numerical Analysis Fall. Gauss Elimination

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences)

Cyclops Tensor Framework

Parallel Numerical Algorithms

AMS526: Numerical Analysis I (Numerical Linear Algebra)

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC

Review: From problem to parallel algorithm

CSE 421 Introduction to Algorithms Final Exam Winter 2005

Algorithms as Multilinear Tensor Equations

CS 598: Communication Cost Analysis of Algorithms Lecture 9: The Ideal Cache Model and the Discrete Fourier Transform

Dense LU factorization and its error analysis

James Demmel UC Berkeley Math and EECS Depts. Joint work with Ioana Dumitriu, Olga Holtz, Plamen Koev

Chapter 9: Gaussian Elimination

Dynamic Programming: Shortest Paths and DFA to Reg Exps

Dynamic Programming: Shortest Paths and DFA to Reg Exps

Communication Lower Bounds for Programs that Access Arrays

A communication-avoiding thick-restart Lanczos method on a distributed-memory system

Algorithms as Multilinear Tensor Equations

Introduction to communication avoiding algorithms for direct methods of factorization in Linear Algebra

Avoiding Communication in Linear Algebra

Communication-avoiding LU and QR factorizations for multicore architectures

Design and Analysis of Algorithms

Scientific Computing

Cost-Constrained Matchings and Disjoint Paths

Analysis of Algorithms I: All-Pairs Shortest Paths

Minimizing Communication in Linear Algebra. James Demmel 15 June

Minimizing Communication in Linear Algebra

CS412: Lecture #17. Mridul Aanjaneya. March 19, 2015

Communication Avoiding LU Factorization using Complete Pivoting

Communication-Avoiding Algorithms for Linear Algebra and Beyond

A Computation- and Communication-Optimal Parallel Direct 3-body Algorithm

Efficient Deflation for Communication-Avoiding Krylov Subspace Methods

CALU: A Communication Optimal LU Factorization Algorithm

Numerical Methods - Numerical Linear Algebra

Solving Linear Systems Using Gaussian Elimination. How can we solve

Efficient algorithms for symmetric tensor contractions

Communication-avoiding parallel and sequential QR factorizations

LU Factorization. LU factorization is the most common way of solving linear systems! Ax = b LUx = b

Matrix decompositions

Matrix Computations: Direct Methods II. May 5, 2014 Lecture 11

Introduction to communication avoiding linear algebra algorithms in high performance computing

Performance and Scalability. Lars Karlsson

Parallel Iterative Methods for Sparse Linear Systems. H. Martin Bücker Lehrstuhl für Hochleistungsrechnen

Single Source Shortest Paths

BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix-Vector-Product Analysis of Matrix-Matrix Product

9. Numerical linear algebra background

Numerical Linear Algebra

Program Lecture 2. Numerical Linear Algebra. Gaussian elimination (2) Gaussian elimination. Decompositions, numerical aspects

Computational Linear Algebra

Numerical Linear Algebra

Cache-Oblivious Computations: Algorithms and Experimental Evaluation

9. Numerical linear algebra background

Iterative methods, preconditioning, and their application to CMB data analysis. Laura Grigori INRIA Saclay

Matrix decompositions

Scientific Computing with Case Studies SIAM Press, Lecture Notes for Unit VII Sparse Matrix

Computation of the mtx-vec product based on storage scheme on vector CPUs

Avoiding communication in linear algebra. Laura Grigori INRIA Saclay Paris Sud University

MATHEMATICS FOR COMPUTER VISION WEEK 2 LINEAR SYSTEMS. Dr Fabio Cuzzolin MSc in Computer Vision Oxford Brookes University Year

All-Pairs Shortest Paths

Introduction to Mathematical Programming

Direct Methods for Solving Linear Systems. Simon Fraser University Surrey Campus MACM 316 Spring 2005 Instructor: Ha Le

forms Christopher Engström November 14, 2014 MAA704: Matrix factorization and canonical forms Matrix properties Matrix factorization Canonical forms

Linear Algebra Section 2.6 : LU Decomposition Section 2.7 : Permutations and transposes Wednesday, February 13th Math 301 Week #4

Important Matrix Factorizations

CMPS 6610 Fall 2018 Shortest Paths Carola Wenk

7. LU factorization. factor-solve method. LU factorization. solving Ax = b with A nonsingular. the inverse of a nonsingular matrix

Math 502 Fall 2005 Solutions to Homework 3

Parallel Scientific Computing

Sparse linear solvers

Cyclops Tensor Framework: reducing communication and eliminating load imbalance in massively parallel contractions

Average-Case Performance Analysis of Online Non-clairvoyant Scheduling of Parallel Tasks with Precedence Constraints

Boundary Value Problems - Solving 3-D Finite-Difference problems Jacob White

Parallel Numerical Algorithms

Saving Energy in Sparse and Dense Linear Algebra Computations

Designing Information Devices and Systems I Fall 2018 Lecture Notes Note 6

5.1 Banded Storage. u = temperature. The five-point difference operator. uh (x, y + h) 2u h (x, y)+u h (x, y h) uh (x + h, y) 2u h (x, y)+u h (x h, y)

Transcription:

Tradeoffs between synchronization, communication, and work in parallel linear algebra computations Edgar Solomonik, Erin Carson, Nicholas Knight, and James Demmel Department of EECS, UC Berkeley February, 2014 Edgar Solomonik Synchronization lower bounds 1/ 30

The problem, the algorithm, and the parallelization A problem asks for a solution which satisfies a set of properties with respect to a set of inputs (e.g. given A, find triangular matrices L and U such that A = L U) Edgar Solomonik Synchronization lower bounds 2/ 30

The problem, the algorithm, and the parallelization A problem asks for a solution which satisfies a set of properties with respect to a set of inputs (e.g. given A, find triangular matrices L and U such that A = L U) An algorithm is a sequence of computer operations that determines a solution to the problem Edgar Solomonik Synchronization lower bounds 2/ 30

The problem, the algorithm, and the parallelization A problem asks for a solution which satisfies a set of properties with respect to a set of inputs (e.g. given A, find triangular matrices L and U such that A = L U) An algorithm is a sequence of computer operations that determines a solution to the problem A parallelization is a schedule of the algorithm, which partitions the operations amongst processors and defines the necessary data movement between processors Edgar Solomonik Synchronization lower bounds 2/ 30

Graphical representation of a parallel algorithm We can represent an algorithm as a graph G = (V, E) where V includes the input, intermediate, and output values used by the algorithm E represents the dependencies between pairs of values e.g. to compute c = a b, we have a, b, c V and (a, c), (b, c) E We can represent a set of algorithms by working with hypergraph representations H = (V, Ē) Ē may represent the dependency of a value on a set of vertices (e.g. reduction tree) e.g. to compute d = n i=1 c i, we have d, c i V and hyperedges ({c 1,... c n }, {d}) Ē Edgar Solomonik Synchronization lower bounds 3/ 30

Parallelization schedules Our goal will be to bound the payload of any parallel schedule for given algorithms Edgar Solomonik Synchronization lower bounds 4/ 30

Parallelization schedules Our goal will be to bound the payload of any parallel schedule for given algorithms the schedule must give a unique assignment/partitioning of vertices amongst p processors V = p i=1 C i Edgar Solomonik Synchronization lower bounds 4/ 30

Parallelization schedules Our goal will be to bound the payload of any parallel schedule for given algorithms the schedule must give a unique assignment/partitioning of vertices amongst p processors V = the schedule should give a sequence of m i computation and communication operations p i=1 C i Edgar Solomonik Synchronization lower bounds 4/ 30

Parallelization schedules Our goal will be to bound the payload of any parallel schedule for given algorithms the schedule must give a unique assignment/partitioning of vertices amongst p processors V = the schedule should give a sequence of m i computation and communication operations F ij is the set of values computed by processor i at timestep j p i=1 C i Edgar Solomonik Synchronization lower bounds 4/ 30

Parallelization schedules Our goal will be to bound the payload of any parallel schedule for given algorithms the schedule must give a unique assignment/partitioning of vertices amongst p processors V = the schedule should give a sequence of m i computation and communication operations F ij is the set of values computed by processor i at timestep j R ij is the set of values received by processor i at timestep j p i=1 C i Edgar Solomonik Synchronization lower bounds 4/ 30

Parallelization schedules Our goal will be to bound the payload of any parallel schedule for given algorithms the schedule must give a unique assignment/partitioning of vertices amongst p processors V = the schedule should give a sequence of m i computation and communication operations F ij is the set of values computed by processor i at timestep j R ij is the set of values received by processor i at timestep j M ij is the set of values sent by processor i at timestep j p i=1 C i Edgar Solomonik Synchronization lower bounds 4/ 30

Cost model We quantify interprocessor communication and synchronization costs of a parallelization via a flat network model γ - cost for a single computation (flop) β - cost for a transfer of each byte between any pair of processors α - cost for a synchronization between any pair of processors We measure the cost of a parallelization along the longest sequence of dependent computations and data transfers (critical path) F - critical path payload for computation cost W - critical path payload for communication (bandwidth) cost S - critical path payload for synchronization cost Edgar Solomonik Synchronization lower bounds 5/ 30

Parallel schedule example Edgar Solomonik Synchronization lower bounds 6/ 30

Critical path for computational cost Edgar Solomonik Synchronization lower bounds 7/ 30

Critical path for synchronization cost Edgar Solomonik Synchronization lower bounds 8/ 30

Critical path for communication cost Edgar Solomonik Synchronization lower bounds 9/ 30

The duality of upper and lower bounds Given an algorithm for a problem A parallelization provides a schedule whose payload is an upper bound on the cost of the algorithm Edgar Solomonik Synchronization lower bounds 10/ 30

The duality of upper and lower bounds Given an algorithm for a problem A parallelization provides a schedule whose payload is an upper bound on the cost of the algorithm A parallelization of an algorithms optimal under a given cost model, if there exists a matching lower bound on the cost of the algorithm Edgar Solomonik Synchronization lower bounds 10/ 30

The duality of upper and lower bounds Given an algorithm for a problem A parallelization provides a schedule whose payload is an upper bound on the cost of the algorithm A parallelization of an algorithms optimal under a given cost model, if there exists a matching lower bound on the cost of the algorithm We will first give some parallel algorithms (upper bounds) and then show their optimality via a new technique Edgar Solomonik Synchronization lower bounds 10/ 30

The duality of upper and lower bounds Given an algorithm for a problem A parallelization provides a schedule whose payload is an upper bound on the cost of the algorithm A parallelization of an algorithms optimal under a given cost model, if there exists a matching lower bound on the cost of the algorithm We will first give some parallel algorithms (upper bounds) and then show their optimality via a new technique The focus of our technique will be on lower bounds for synchronization cost, which manifest themselves as tradeoffs Edgar Solomonik Synchronization lower bounds 10/ 30

Solving a dense triangular system Problem: For lower triangular dense matrix L and vector y of dimension n, solve L x = y, i.e., i j=1 L ij x j = y i, for i {1,..., n}. Algorithm: x = TRSV(L, y, n) 1 for i = 1 to n 2 for j = 1 to i 1 3 Z ij = L ij x j 4 x i = ( y i i 1 j=1 Z ij ) /L ii This algorithm does not immediately yield a graph representation since a summation order is not defined, but we can infer a hypergraph dependency representation Edgar Solomonik Synchronization lower bounds 11/ 30

Dependency Hypergraph: Triangular solve Edgar Solomonik Synchronization lower bounds 12/ 30

Schedules for triangular solve One instance of the above triangular solve algorithm is a diamond DAG wavefront algorithms [Heath 1988] also algorithms given by [Papadimitriou and Ullman 1987] and [Tiskin 1998] These Diamond DAG schedules achieve the following costs for x of dimension n, and some number of processors p [1, n] computational: F TRSV = O(n 2 /p) bandwidth: W TRSV = O(n) synchronization: S TRSV = O(p) We see a tradeoff between computational and synchronization cost. Edgar Solomonik Synchronization lower bounds 13/ 30

Obtaining a Cholesky factorization Problem: The Cholesky factorization of a symmetric positive definite matrix A is A = L L T, for a lower-triangular matrix L. Algorithm: L = Cholesky(A, n) 1 for j = 1 to n 2 L jj = A ij j 1 k=1 L jk L jk 3 for i = j + 1 to n 4 for k = 1 to j 1 5 Z ijk = L ik L jk 6 L ij = (A ij j 1 k=1 Z ijk)/l jj Again this algorithm yields only a hypergraph dependency representation. Edgar Solomonik Synchronization lower bounds 14/ 30

Cholesky dependency hypergraph These diagrams show (a) the vertices Z ijk in V GE with n = 16 and (b) the hyperplane x 12 and hyperedge e 12,6 on H GE. Edgar Solomonik Synchronization lower bounds 15/ 30

Parallelizations of Cholesky For A of dimension n, with p processors, and some c [1, p 1/3 ] [Tiskin 2002] in BSP and 2.5D algorithms [ES, JD 2011] achieve the costs computational: F GE = O(n 3 /p) bandwidth: W GE = O(n 2 / cp) synchronization: S GE = O( cp) We see a tradeoff between computational and synchronization cost, as well as a tradeoff between bandwidth and synchronization costs (the latter independent of the number of processors, p). Algorithms with the same asymptotic costs also exist for LU with pairwise or with tournament pivoting as well as for QR factorization Edgar Solomonik Synchronization lower bounds 16/ 30

Dependency bubble Definition (Dependency bubble) Given two vertices u, v in a directed acyclic graph G = (V, E), the dependency bubble B(G, (u, v)) is the union of all paths in G from u to v. Edgar Solomonik Synchronization lower bounds 17/ 30

Path-expander graph Definition ((ɛ, σ)-path-expander) Graph G = (V, E) is a (ɛ, σ)-path-expander if there exists a path (u 1,... u n ) V, such that the dependency bubble B(G, (u i, u i+b )) has size Θ(σ(b)) and a minimum cut of size Ω(ɛ(b)). Edgar Solomonik Synchronization lower bounds 18/ 30

Scheduling tradeoffs of path-expander graphs Theorem (Path-expander communication lower bound) Any parallel schedule of an algorithm, with a (ɛ, σ)-path-expander dependency graph about a path of length n incurs the computation (F ), bandwidth (W ), and latency (S) costs for some b [1, n], F = Ω (σ(b) n/b), W = Ω (ɛ(b) n/b), S = Ω (n/b). Edgar Solomonik Synchronization lower bounds 19/ 30

Demonstration of proof for (b, b 2 )-path-expander Edgar Solomonik Synchronization lower bounds 20/ 30

Tradeoffs for triangular solve Theorem Any parallelization of any dependency graph G TRSV (n) incurs the following computation (F ), bandwidth (W ), and latency (S) costs, for some b [1, n], F TRSV = Ω (n b), W TRSV = Ω (n), S TRSV = Ω (n/b), and furthermore, F TRSV S TRSV = Ω ( n 2). Proof. Proof by application of path-based tradeoffs since G TRSV (n) is a (b, b 2 )-path-expander dependency graph. Edgar Solomonik Synchronization lower bounds 21/ 30

Attainability and related work on triangular solve Diamond DAG lower bounds were also given by Papadimitriou and Ullman [P.U. 1987] Tiskin [T. 1998] Previously noted algorithms for triangular solve attain these lower bounds with the number of processors p = n/b computational: F TRSV = Θ(n 2 /p) bandwidth: W TRSV = Θ(n) synchronization: S TRSV = Θ(p) Edgar Solomonik Synchronization lower bounds 22/ 30

Tradeoffs for Cholesky Theorem Any parallelization of any dependency graph G GE (n) incurs the following computation (F ), bandwidth (W ), and latency (S) costs, for some b [1, n], F GE = Ω ( n b 2), W GE = Ω (n b), S GE = Ω (n/b), and furthermore, F GE S 2 GE = Ω ( n 3), W GE S GE = Ω ( n 2). Proof. Proof by showing that G GE (n) is a (b 2, b 3 )-path-expander about the path corresponding to the calculation of the diagonal elements of L. Edgar Solomonik Synchronization lower bounds 23/ 30

Parallelizations of Cholesky These lower bounds are attainable for b [n/ p, n/p 2/3 ] by existing algorithms, so for c [1, p 1/3 ], computational: F GE = Θ(n 3 /p) bandwidth: W GE = Θ(n 2 / cp) synchronization: S GE = Θ( cp) Therefore, we have a tradeoff between communication and synchronization W GE S GE = Θ(n 2 ) We conjecture that our lower bound technique is extensible to QR and symmetric eigensolve algorithms. Edgar Solomonik Synchronization lower bounds 24/ 30

Krylov subspace methods We consider the s-step Krylov subspace basis computation x (l) = A x (l 1), for l {1,..., s} where the graph of the symmetric sparse matrix A is a (2m + 1) d -point stencil. Edgar Solomonik Synchronization lower bounds 25/ 30

Cost tradeoffs for Krylov subspace methods on stencils Theorem Any parallel execution of an s-step Krylov subspace basis computation for a (2m + 1) d -point stencil, requires the following computational, bandwidth, and latency costs for some b {1,... s}, ( ) ( F Kr = Ω m d b d s, W Kr = Ω m d b d 1 ) s, S Kr = Ω (s/b). and furthermore, F Kr S d Kr = Ω ( m d s d+1), W Kr S d 1 Kr ( = Ω m d s d). Edgar Solomonik Synchronization lower bounds 26/ 30

Proof of tradeoffs for Krylov subspace methods Proof. Done by showing that the dependency graph of a s-step (2m + 1) d -point stencil is a (m d b d, m d b d+1 )-path-expander. Edgar Solomonik Synchronization lower bounds 27/ 30

Attainability The lower bounds may be attained via communication-avoiding s-step algorithms (PA1 in Demmel, Hoemmen, Mohiyuddin, and Yelick 2007) F Kr = O ( ) ( m d b d s, W Kr = O m d b d 1 ) s, S Kr = O (s/b), under the assumption n/p 1/d = O(bm). Edgar Solomonik Synchronization lower bounds 28/ 30

Counterexample: All-Pairs Shortest-Paths (APSP) The APSP problem seeks to find the shortest paths between all pairs of vertices in graph G The Floyd-Warshall algorithm is analogous to Gaussian elimination on a different semiring and achieves costs F FW = Θ(n 3 /p) W FW = Θ(n 2 / cp) S FW = Θ( cp) Alexander Tiskin demonstrated that it is possible to augment path-doubling to achieve the costs F APSP = O(n 3 /p) W APSP = O(n 2 / cp) S APSP = O(log p) Edgar Solomonik Synchronization lower bounds 29/ 30

Summary and conclusion Proved tight synchronization lower bounds on schedules for algorithms triangular solve F TRSV S TRSV = Ω ( n 2) Gaussian Elimination F GE S 2 GE = Ω ( n 3) and W GE S GE = Ω(n 2 ) s-step Krylov subspace methods on (2m + 1) d -pt stencils F Kr S d Kr = Ω ( m d s d+1) and W Kr S d 1 Kr = Ω ( m d s d) Our lower bounds also extend naturally to a number of graph algorithms Floyd-Warshall is analogous to Gaussian Elimination Bellman-Ford is analogous to Krylov subspace methods However, it may be possible to find new algorithms rather than parallel schedules to solve these problems with less synchronization and communication cost (e.g. Tiskin s APSP algorithm, 2001) Edgar Solomonik Synchronization lower bounds 30/ 30