PARALLEL MATRIX MULTIPLICATION: A SYSTEMATIC JOURNEY

Size: px

Start display at page:

Download "PARALLEL MATRIX MULTIPLICATION: A SYSTEMATIC JOURNEY"

Charity Baldwin
6 years ago
Views:

1 PARALLEL MATRIX MULTIPLICATION: A SYSTEMATIC JOURNEY MARTIN D SCHATZ, ROBERT A VAN DE GEIJN, AND JACK POULSON Abstract We exose a systematic aroach for develoing distributed memory arallel matrix matrix multilication algorithms The journey starts with a descrition of how matrices are distributed to meshes of nodes (eg, MPI rocesses), relates these distributions to scalable arallel imlementation of matrix-vector multilication and rank-1 udate, continues on to reveal a family of matrix-matrix multilication algorithms that view the nodes as a two-dimensional mesh, and finishes with extending these 2D algorithms to so-called 3D algorithms that view the nodes as a three-dimensional mesh A cost analysis shows that the 3D algorithms can attain the (order of magnitude) lower bound for the cost of communication The aer introduces a taxonomy for the resulting family of algorithms and exlains how all algorithms have merit deending on arameters like the sizes of the matrices and architecture arameters The techniques described in this aer are at the heart of the Elemental distributed memory linear algebra library Performance results from imlementation within and with this library are given on a reresentative distributed memory architecture, the IBM Blue Gene/P suercomuter 1 Introduction This aer serves a number of uroses: Parallel imlementation of matrix-matrix multilication is a standard toic in a course on arallel high-erformance comuting However, rarely is the student exosed to the algorithms that are used in ractice in cutting-edge arallel dense linear algebra (DLA) libraries This aer exoses a systematic ath that leads from arallel algorithms for matrix-vector multilication and rank-1 udate to a ractical, scalable family of arallel algorithms for matrix-matrix multilication, including the classic result in [1] and those imlemented in the Elemental arallel DLA library [22] This aer introduces a set notation for describing the data distributions that underlie the Elemental library The notation is motivated using arallelization of matrix-vector oerations and matrix-matrix multilication as the driving examles Recently, research on arallel matrix-matrix multilication algorithms have revisited so-called 3D algorithms, which view (rocessing) nodes as a logical three-dimensional mesh These algorithms are known to attain theoretical (order of magnitude) lower bounds on communication This aer exoses a systematic ath from algorithms for two-dimensional meshes to their extensions for three-dimensional meshes Among the resulting algorithms are classic results [2] A taxonomy is given for the resulting family of algorithms which are all related to what is often called the Scalable Universal Matrix Multilication Algorithm (SUMMA) [27] This aer is a first ste towards understanding how the resented techniques extend to the much more comlicated domain of tensor contractions (of which matrix multilication is a secial case) Deartment of Comuter Science, Institute for Comutational Engineering and Sciences, The University of Texas at Austin, Austin, TX s: martinschatz@utexasedu, ltm@csutexasedu, rvdg@csutexasedu Institute for Comutational and Mathematical Engineering, Stanford University, Stanford, CA 94305, oulson@stanfordedu Parallel in this aer imlicitly means distributed memory arallel 1

2 Thus, the aer simultaneously serves a edagogical role, exlains abstractions that underly the Elemental library, advances the state of science for arallel matrix-matrix multilication by roviding a framework to systematically derive known and new algorithms for matrix-matrix-multilication when comuting on two-dimensional or three-dimensional meshes, and lays the foundation for extensions to tensor comutations 2 Background The arallelization of dense matrix-matrix multilication is a well-studied subject Cannon s algorithm (sometimes called roll-roll-comute) dates back to 1969 [7] and Fox s algorithm (sometimes called broadcast-roll-comute) dates back to the 1980s [12] Both suffer from a number of shortcomings: They assume that rocesses are viewed as an d 0 d 1 grid, with d 0 = d 1 = Removing this constraint on d0 and d 1 is nontrivial for these algorithms They do not deal well with the case where one of the matrix dimensions becomes relatively small This is the most commonly encountered case in libraries like LAPACK [3] and libflame [29, 30], and their distributed-memory counterarts: ScaLAPACK [9], PLAPACK [28], and Elemental [22] Attemts to generalize [10, 18, 19] led to imlementations that were neither simle nor effective A ractical algorithm, which also results from the systematic aroach discussed in this aer, can be described as allgather-allgather-multily [1] It does not suffer from the shortcomings of Cannon s and Fox s algorithms It did not gain in oularity in art because libraries like ScaLAPACK and PLAPACK used a 2D block-cyclic distribution, rather than the 2D elemental distribution advocated by that aer The arrival of the Elemental library, together with what we believe is our more systematic and extensible exlanation, will hoefully elevate awareness of this result The Scalable Universal Matrix Multilication Algorithm [27] is another algorithm that overcomes all shortcomings of Cannon s algorithm and Fox s algorithm We believe it is a more widely known result in art because it can already be exlained for a matrix that is distributed with a 2D blocked (but not cyclic) distribution and in art because it was easy to suort in the ScaLAPACK and PLAPACK libraries The original SUMMA aer gives four algorithms: For C := AB + C, SUMMA casts the comutation in terms of multile rank-k udates This algorithm is sometimes called the broadcast-broadcastmultily algorithm, a label we will see is somewhat limiting We also call this algorithm stationary C for reasons that will become clear later By design, this algorithm continues to erform well in the case where the width of A is small relative to the dimensions of C For C := A T B + C, SUMMA casts the comutation in terms of multile anel of rows times matrix multilies, so erformance is not degraded in the case where the height of A is small relative to the dimensions of B We have also called this algorithm stationary B for reasons that will become clear later For C := AB T + C, SUMMA casts the comutation in terms of multile matrix-anel (of columns) multilies, and so erformance does not deteriorate when the width of C is small relative to the dimensions of A We call this algorithm stationary A for reasons that will become clear later For C := A T B T + C, the aer sketches an algorithm that is actually not ractical 2

3 In the 1990s, it was observed that for the case where matrices were relatively small (or, equivalently, a relatively large number of nodes were available), better theoretical and ractical erformance resulted from viewing the nodes as a d 0 d 1 d 2 mesh, yielding a 3D algorithm [2] In [14], it was shown how stationary A, B, and C algorithms can be formulated for each of the four cases of matrix-matrix multilication, including for C := A T B T + C This then yielded a general, ractical family of 2D matrix-matrix multilication algorithms all of which were incororated into PLAPACK and Elemental, and some of which are suorted by ScaLAPACK Some of the observations about develoing 2D algorithms in the current aer can already be found in that aer, but our exosition is much more systematic and we use the matrix distribution that underlies Elemental to illustate the basic rinciles Although the work by Agarwal et al describes algorithms for the different matrix-matrix multilication transose variants, it does not describe how to create stationary A and B variants Recently, a 3D algorithm for comuting the LU factorization of a matrix was devised [25] In the same work, a 3D algorithm for matrix-matrix multilication building uon Cannon s algorithm was given for nodes arranged as an d 0 d 1 d 2 mesh, with d 0 = d 1 and 0 d 2 < 3 This was labeled a 25D algorithm Although the rimary contribution of that work was LU-related, the 25D algorithm for matrixmatrix multilication is the relevant ortion to this aer The focus of that study on 3D algorithms was the simlest case of matrix-matrix multilication, C := AB 3 Notation Although the focus of this aer is arallel distributed-memory matrix-matrix multilication, the notation used is designed to be extensible to comutation with higher-dimensional objects (tensors), on higher-dimensional grids Because of this, the notation used may seem overly comlex when restricted to matrixmatrix-multilication only In this section, we describe the notation used and the reasoning behind the choice of notation Grid dimension: d x Since we focus on algorithms for distributed-memory architectures, we must describe information about the grid on which we are comuting To suort arbitrary-dimensional grids, we must exress the shae of the grid in an extensible way For this reason, we have chosen the subscrited letter d to indicate the size of a articular dimension of the grid Thus, d x refers to the number of rocesses comrising the xth dimension of the grid In this aer, the grid is tyically d 0 d 1 Process location: s x In addition to describing the shae of the grid, it is useful to be able to refer to a articular rocess s location within the mesh of rocesses For this, we use the subscrited s letter to refer to a rocess s location within some given dimension of the mesh of rocesses Thus, s x refers to a articular rocess s location within the xth dimension of the mesh of rocesses In this aer, a tyical rocess is labeled with (s 0, s 1 ) Distribution: D (x0,x 1,,x k ) In subsequent sections, we will introduce a notation for describing how data is distributed among rocesses of the grid This notation will require a descrition of which dimensions of the grid are involved in defining the distribution We use the symbol D (x0,x 1,,x k ) to indicate a distribution which involves dimensions x 0, x 1,, x k of the mesh For examle, when describing a distribution which involves the column and row dimension of the grid, we refer to this distribution as D (0,1) Later, we will exlain why the symbol D (0,1) describes a different distribution from D (1,0) 3

4 4 Of Matrix-Vector Oerations and Distribution In this section, we discuss how matrix and vector distributions can be linked to arallel 2D matrix-vector multilication and rank-1 udate oerations, which then allows us to eventually describe the stationary C, A, and B 2D algorithms for matrix-matrix multilication that are art of the Elemental library 41 Collective communication Collectives are fundamental to the arallelization of dense matrix oerations Thus, the reader must be (or become) familiar with the basics of these communications and is encouraged to read Chan et al [8], which resents collectives in a systematic way that dovetails with the resent aer To make this aer self-contained, Figure 41 (similar to Figure 1 in [8]) summarizes the collectives In Figure 42 we summarize lower bounds on the cost of the collective communications, under basic assumtions exlained in [8] (see [6] for an analysis of all-to-all), and the cost exressions that we will use in our analyses 42 Motivation: matrix-vector multilication Suose A R m n, x R n, and y R m, and label their individual elements so that α 0,0 α 0,1 α 0,n 1 α 1,0 α 1,1 α 1,n 1 A =, x = α m 1,0 α m 1,1 α m 1,n 1 χ 0 χ 1 χ n 1, and y = Recalling that y = Ax (matrix-vector multilication) is comuted as ψ 0 ψ 1 ψ m 1 ψ 0 = α 0,0 χ 0 + α 0,1 χ α 0,n 1 χ n 1 ψ 1 = α 1,0 χ 0 + α 1,1 χ α 1,n 1 χ n 1 ψ m 1 = α m 1,0 χ 0 + α m 1,1 χ α m 1,n 1 χ n 1 we notice that element α i,j multilies χ j and contributes to ψ i Thus we may summarize the interactions of the elements of x, y, and A by χ 0 χ 1 χ n 1 (41) ψ 0 α 0,0 α 0,1 α 0,n 1 ψ 1 α 1,0 α 1,1 α 1,n 1 ψ m 1 α m 1,0 α m 1,1 α m 1,n 1 which is meant to indicate that χ j must be multilied by the elements in the jth column of A while the ith row of A contributes to ψ i 43 Two-Dimensional Elemental Cyclic Distribution It is well established that (weakly) scalable imlementations of DLA oerations require nodes to be logically viewed as a two-dimensional mesh [26, 17] It is also well established that to achieve load balance for a wide range of matrix oerations, matrices should be cyclically wraed onto this logical mesh We start with these insights and examine the simlest of matrix distributions that result: 2D elemental cyclic distribution [22, 16] Denoting the number of nodes by, a d 0 d 1 mesh must be chosen such that = d 0 d 1 4

5 Oeration Before After Permute Node 0 Node 1 Node 2 Node 3 Node 0 Node 1 Node 2 Node 3 x 0 x 1 x 2 x 3 x 1 x 0 x 3 x 2 Broadcast Node 0 Node 1 Node 2 Node 3 x Node 0 Node 1 Node 2 Node 3 x x x x to-one) Node 0 Node 1 Node 2 Node 3 Node 0 Node 1 Node 2 Node 3 x (0) x (1) x (2) x (3) Pj x(j) Scatter Node 0 Node 1 Node 2 Node 3 x 0 x 1 x 2 x 3 Node 0 Node 1 Node 2 Node 3 x 0 x 1 x 2 x 3 Gather Node 0 Node 1 Node 2 Node 3 x 0 x 1 x 2 x 3 Node 0 Node 1 Node 2 Node 3 x 0 x 1 x 2 x 3 Allgather Node 0 Node 1 Node 2 Node 3 x 0 x 1 x 2 x 3 Node 0 Node 1 Node 2 Node 3 x 0 x 0 x 0 x 0 x 1 x 1 x 1 x 1 x 2 x 2 x 2 x 2 x 3 x 3 x 3 x 3 Reduce(- Reducescatter Node 0 Node 1 Node 2 Node 3 x (0) 0 x (1) 0 x (2) 0 x (3) 0 x (0) 1 x (1) 1 x (2) 1 x (3) 1 x (0) 2 x (1) 2 x (2) 2 x (3) 2 x (0) 3 x (1) 3 x (2) 3 x (3) 3 Node 0 Node 1 Node 2 Node 3 Pj x(j) 0 Pj x(j) 1 Pj x(j) 2 Pj x(j) 3 Allreduce Node 0 Node 1 Node 2 Node 3 Node 0 Node 1 Node 2 Node 3 x (0) x (1) x (2) x (3) Pj x(j) Pj x(j) Pj x(j) Pj x(j) All-to-all Node 0 Node 1 Node 2 Node 3 x (0) 0 x (1) 0 x (2) 0 x (3) 0 x (0) 1 x (1) 1 x (2) 1 x (3) 1 x (0) 2 x (1) 2 x (2) 2 x (3) 2 x (0) 3 x (1) 3 x (2) 3 x (3) 3 Node 0 Node 1 Node 2 Node 3 x (0) 0 x (0) 1 x (0) 2 x (0) 3 x (1) 0 x (1) 1 x (1) 2 x (1) 3 x (2) 0 x (2) 1 x (2) 2 x (2) 3 x (3) 0 x (3) 1 x (3) 2 x (3) 3 Fig 41 Collective communications considered in this aer 5

6 Communication Latency Bandw Comut Cost used for analysis Permute α nβ α + nβ Broadcast log 2 () α nβ log 2 ()α + nβ 1 Reduce(-to-one) log 2 () α nβ nγ log 2()α + n(β + γ) 1 Scatter log 2 () α nβ log 2()α + 1 nβ 1 Gather log 2 () α nβ log 2()α + 1 nβ 1 Allgather log 2 () α nβ log 2()α + 1 nβ 1 Reduce-scatter log 2 () α nβ 1 nγ log 2()α + 1 n(β + γ) Allreduce log 2 () α 2 1 nβ 1 nγ 2 log 2()α + 1 n(2β + γ) 1 All-to-all log 2 () α nβ log 2()α + 1 nβ Fig 42 Lower bounds for the different comonents of communication cost Conditions for the lower bounds are given in [8] and [6] The last column gives the cost functions that we use in our analyses For architectures with sufficient connectivity, simle algorithms exist with costs that remain within a small constant factor of all but one of the given formulae The excetion is the all-to-all, for which there are algorithms that achieve the lower bound for the α and β term searately, but it is not clear whether an algorithm that consistently achieves erformance within a constant factor of the given cost function exists Matrix distribution The elements of A are assigned using an elemental cyclic (round-robin) distribution where α i,j is assigned to node (i mod d 0, j mod d 1 ) Thus, node (s 0, s 1 ) stores submatrix A(s 0 :d 0 :m 1, s 1 :d 1 :n 1) = α s0,s 1 α s0,s 1+d 1 α s0+d 0,s 1 α s0+d 0,s 1+d 1, where the left-hand side of the exression uses the MATLAB convention for exressing submatrices, starting indexing from zero instead of one This is illustrated in Figure 43 Column-major vector distribution A column-major vector distribution views the d 0 d 1 mesh of nodes as a linear array of nodes, numbered in column-major order A vector is distributed with this distribution if it is assigned to this linear array of nodes in a round-robin fashion, one element at a time In other words, consider vector y Its element ψ i is assigned to node (i mod d 0, (i/d 0 ) mod d 1 ), where / denotes integer division Or, equivalently in MATLAB-like notation, node (s 0, s 1 ) stores subvector y(u(s 0, s 1 )::m 1), where u(s 0, s 1 ) = s 0 + s 1 d 0 equals the rank of node (s 0, s 1 ) when the nodes are viewed as a one-dimensional array, indexed in column-major order This distribution of y is illustrated in Figure 43 Row-major vector distribution Similarly, a row-major vector distribution views the d 0 d 1 mesh of nodes as a linear array of nodes, numbered in row-major order In other words, consider vector x Its element χ j is assigned to node (j mod d 1, (j/d 1 ) mod d 0 ) Or, equivalently, node (s 0, s 1 ) stores subvector x(v(s 0, s 1 )::n 1), where v(s 0, s 1 ) = s 0 d 1 +s 1 equals the rank of node (s 0, s 1 ) when the nodes are viewed as a one-dimensional array, indexed in row-major order The distribution of x is illustrated in Figure 43 6

7 χ 0 χ 1 χ 2 ψ 0 α 0,0 α 0,3 α 0,6 α 0,1 α 0,4 α 0,7 α 0,2 α 0,5 α 0,8 α 2,0 α 2,3 α 2,6 ψ 2 α 2,1 α 2,4 α 2,7 α 2,2 α 2,5 α 2,8 α 4,0 α 4,3 α 4,6 α 4,1 α 4,4 α 4,7 ψ 4 α 4,2 α 4,5 α 4,8 χ 3 χ 4 χ 5 ψ 1 α 1,0 α 1,3 α 1,6 α 1,1 α 1,4 α 1,7 α 1,2 α 1,5 α 1,8 α 3,0 α 3,3 α 3,6 ψ 3 α 3,1 α 3,4 α 3,7 α 3,2 α 3,5 α 3,8 α 5,0 α 5,3 α 5,6 α 5,1 α 5,4 α 5,7 ψ 5 α 5,2 α 5,5 α 5,8 Fig 43 Distribution of A, x, and y within a 2 3 mesh Redistributing a column of A in the same manner as y requires simultaneous scatters within rows of nodes while redistributing a row of A consistently with x requires simultaneous scatters within columns of nodes In the notation of Section 5, here the distribution of x and y are given by x [(1, 0), ()] and y [(0, 1), ()], resectively, and A by A [(0), (1)] χ 0 χ 3 χ 6 χ 1 χ 4 χ 7 χ 2 χ 5 χ 8 ψ 0 α 0,0 α 0,3 α 0,6 ψ 0 α 0,1 α 0,4 α 0,7 ψ 0 α 0,2 α 0,5 α 0,8 ψ 2 α 2,0 α 2,3 α 2,6 ψ 2 α 2,1 α 2,4 α 2,7 ψ 2 α 2,2 α 2,5 α 2,8 ψ 4 α 4,0 α 4,3 α 4,6 ψ 4 α 4,1 α 4,4 α 4,7 ψ 4 α 4,2 α 4,5 α 4,8 χ 0 χ 3 χ 6 χ 1 χ 4 χ 7 χ 2 χ 5 χ 8 ψ 1 α 1,0 α 1,3 α 1,6 ψ 1 α 1,1 α 1,4 α 1,7 ψ 1 α 1,2 α 1,5 α 1,8 ψ 3 α 3,0 α 3,3 α 3,6 ψ 3 α 3,1 α 3,4 α 3,7 ψ 3 α 3,2 α 3,5 α 3,8 ψ 5 α 5,0 α 5,3 α 5,6 ψ 5 α 5,1 α 5,4 α 5,7 ψ 5 α 5,2 α 5,5 α 5,8 Fig 44 Vectors x and y resectively redistributed as row-rojected and column-rojected vectors The column-rojected vector y [(0), ()] here is to be used to comute local results that will become contributions to a column vector y [(0, 1), ()] which will result from adding these local contributions within rows of nodes By comaring and contrasting this figure with Figure 43 it becomes obvious that redistributing x [(1, 0), ()] to x [(1), ()] requires an allgather within columns of nodes while y [(0, 1), ()] results from scattering y [(0), ()] within rocess rows 7

8 44 Parallelizing matrix-vector oerations In the following discussion, we assume that A, x, and y are distributed as discussed above At this oint, we suggest comaring (41) with Figure 43 Comuting y := Ax The relation between the distributions of a matrix, columnmajor vector, and row-major vector is illustrated by revisiting the most fundamental of comutations in linear algebra, y := Ax, already discussed in Section 42 An examination of Figure 43 suggests that the elements of x must be gathered within columns of nodes (allgather within columns) leaving elements of x distributed as illustrated in Figure 44 Next, each node comutes the artial contribution to vector y with its local matrix and coy of x Thus, in Figure 44, ψ i in each node becomes a contribution to the final ψ i These must be added together, which is accomlished by a summation of contributions to y within rows of nodes An exerienced MPI rogrammer will recognize this as a reduce-scatter within each row of nodes Under our communication cost model, the cost of this arallel algorithm is given by m n T y=ax (m, n, r, c) = 2 γ d 0 d 1 }{{} local mvmult + log 2 (d 0 )α + d 0 1 n β + log d 0 d 2 (d 1 )α + d 1 1 m β + d 1 1 m γ 1 d 1 d 0 d 1 d 0 }{{}}{{} allgather x reduce-scatter y 2 mn γ + C m n 0 γ + C 1 γ d 0 d }{{ 1 } load imbalance + log 2 ()α + d 0 1 d 0 for some constants C 0 and C 1 We simlify this further to n β + d 1 1 m β + d 1 1 m γ d 1 d 0 d 0 2 mn γ + log 2()α + d 0 1 n β + d 1 1 m β + d 1 1 m γ d 0 d 1 d 1 d 0 d 1 d }{{ 0 } T + y:=ax(m, n, d 0, d 1 ) since the load imbalance contributes a cost similar to that of the communication Here, T + y:=ax(m, n, k/h, d 0, d 1 ) is used to refer to the overhead associated with the above algorithm for the y = Ax oeration In Aendix A we use these estimates to show that this arallel matrix-vector multilication is, for ractical uroses, weakly scalable if d 0 /d 1 is ket constant, but not if d 0 d 1 = 1 or d 0 d 1 = 1 Comuting x := A T y Let us next discuss an algorithm for comuting x := A T y, where A is an m n matrix and x and y are distributed as before (x with a row-major vector distribution and y with a column-major vector distribution) d 1 d 1 We suggest the reader rint coies of Figures 43 and 44 for easy referral while reading the rest of this section It is temting to aroximate x 1 by 1, but this would yield formulae for the cases where the x mesh is 1 (d 1 = 1) or 1 (d 0 = 1) that are overly essimistic 8

9 or, (42) Recall that x = A T y (transose matrix-vector multilication) means χ 0 = α 0,0 ψ 0 + α 1,0 ψ α n 1,0 ψ n 1 χ 1 = α 0,1 ψ 0 + α 1,1 ψ α n 1,1 ψ n 1 χ m 1 = α 0,m 1 ψ 0 + α 1,m 1 ψ α n 1,m 1 ψ n 1 χ 0 = χ 1 = χ m 1 = α 0,0 ψ 0 + α 0,1 ψ 0 + α 0,n 1 ψ 0 + α 1,0 ψ 1 + α 1,1 ψ 1 + α 1,n 1 ψ 1 + α n 1,0 ψ n 1 α n 1,1 ψ n 1 α n 1,n 1 ψ n 1 An examination of (42) and Figure 43 suggests that the elements of y must be gathered within rows of nodes (allgather within rows) leaving elements of y distributed as illustrated in Figure 44 Next, each node comutes the artial contribution to vector x with its local matrix and coy of y Thus, in Figure 44 χ j in each node becomes a contribution to the final χ j These must be added together, which is accomlished by a summation of contributions to x within columns of nodes We again recognize this as a reduce-scatter, but this time within each column of nodes The cost for this algorithm, aroximating as we did when analyzing the algorithm for y = Ax, is 2 mn γ + log 2()α + d 1 1 n β + d 0 1 m β + d 0 1 m γ d 1 d 0 d 0 d 1 d 0 d }{{ 1 } T + x:=a T y(m, n, d 1, d 0 ) where, as before, we ignore overhead due to load imbalance since terms of the same order aear in the terms that cature communication overhead Comuting y := A T x What if we wish to comute y := A T x, where A is an m n matrix and y is distributed with a column-major vector distribution and x with a rowmajor vector distribution? Now x must first be redistributed to a column-major vector distribution, after which the algorithm that we just discussed can be executed, and finally the result (in row-major vector distribution) must be redistributed to leave it as y in column-major vector distribution This adds the cost of the ermutation that redistributes x and the cost of the ermutation that redistributes the result to y to the cost of y := A T x Other cases What if, when comuting y := Ax the vector x is distributed like a row of matrix A? What if the vector y is distributed like a column of matrix A? We leave these cases as an exercise to the reader The oint is that understanding the basic algorithms for multilying with A and A T allows one to systematically derive and analyze algorithms when the vectors that are involved are distributed to the nodes in different ways Comuting A := yx T + A A second commonly encountered matrix-vector oeration is the rank-1 udate: A := αyx T + A We will discuss the case where α = 1 Recall that 0 1 α 0,0 + ψ 0 χ 0 α 0,1 + ψ 0 χ 1 α 0,n 1 + ψ 0 χ n 1 α A + yx T 1,0 + ψ 1 χ 0 α 1,1 + ψ 1 χ 1 α 1,n 1 + ψ 1 χ n 1 = C A, α m 1,0 + ψ m 1 χ 0 α m 1,1 + ψ m 1 χ 1 α m 1,n 1 + ψ m 1 χ n 1 9

10 which, when considering Figures 43 and 44, suggests the following arallel algorithm: All-gather of y within rows All-gather of x within columns Udate of the local matrix on each node The cost for this algorithm, aroximating as we did when analyzing the algorithm for y = Ax, yields 2 mn γ + log 2()α + d 0 1 n β + d 1 1 m β d 0 d 1 d 1 d }{{ 0 } T + A:=yx T +A(m, n, d 0, d 1 ) where, as before, we ignore overhead due to load imbalance since terms of the same order aear in the terms that cature communication overhead Notice that the cost is the same as a arallel matrix-vector multilication, excet for the γ term that results from the reduction within rows As before, one can modify this algorithm when the vectors start with different distributions building on intuition from matrix-vector multilication A attern is emerging 5 Generalizing the Theme The reader should now have an understanding how vector and matrix distribution are related to the arallelization of basic matrixvector oerations We generalize the insights using sets of indices as filters to indicate what arts of a matrix or vector a given rocess owns The insights in this section are similar to those that underlie Physically Based Matrix Distribution [11] which itself also underlies PLAPACK However, we formalize the notation beyond that used by PLAPACK The link between distribution of vectors and matrices was first observed by Bisseling [4, 5], and, around the same time, in [21] 51 Vector distribution The basic idea is to use two different artitions of the natural numbers as a means of describing the distribution of the row and column indices of a matrix Definition 51 (Subvectors and submatrices) Let x R n and S N Then x [S] equals the vector with elements from x with indices in the set S, in the order in which they aear in vector x If A R m n and S, T N, then A [S, T ] is the submatrix formed by keeing only the elements of A whose row-indices are in S and column-indices are in T, in the order in which they aear in matrix A We illustrate this idea with simle examles: Examle 1 Let x = χ 0 χ 1 χ 2 χ 3 and A = If S = {0, 2, 4, } and T = {1, 3, 5, }, then x [S] = ( χ0 χ 2 ) and A [S, T ] = α 0,0 α 0,1 α 0,2 α 0,3 α 0,4 α 1,0 α 1,1 α 1,2 α 1,3 α 1,4 α 2,0 α 2,1 α 2,2 α 2,3 α 2,4 α 3,0 α 3,1 α 3,2 α 3,3 α 3,4 α 4,0 α 4,1 α 4,2 α 4,3 α 4,4 α 0,1 α 0,3 α 2,1 α 2,3 α 4,1 α 4,3 We now introduce two fundamental ways to distribute vectors relative to a logical d 0 d 1 rocess grid 10

11 Definition 52 (Column-major vector distribution) rocesses are available, and define Suose that N V σ (q) = {N N : N q + σ (mod )}, q {0, 1,, 1}, where σ {0, 1,, 1} is an arbitrary alignment arameter When is imlied from context and σ is unimortant to the discussion, we will simly denote the above set by V(q) If the rocesses have been configured into a logical d 0 d 1 grid, a vector x is said to be in a column-major vector distribution if rocess (s 0, s 1 ), where s 0 {0,, d 0 1} and s 1 {0,, d 1 1}, is assigned the subvector x(v σ (s 0 +s 1 d 0 )) This distribution is reresented via the d 0 d 1 array of indices D (0,1) (s 0, s 1 ) V(s 0 + s 1 d 0 ), (s 0, s 1 ) {0,, d 0 1} {0,, d 1 1}, and the shorthand x[(0, 1)] will refer to the vector x distributed such that rocess (s 0, s 1 ) stores x(d (0,1) (s 0, s 1 )) Definition 53 (Row-major vector distribution) Similarly, the d 0 d 1 array D (1,0) V(s 1 + s 0 d 1 ), (s 0, s 1 ) {0,, d 0 1} {0,, d 1 1}, is said to define a row-major vector distribution The shorthand y[(1, 0)] will refer to the vector y distributed such that rocess (s 0, s 1 ) stores y(d (1,0) (s 0, s 1 )) The members of any column-major vector distribution, D (0,1), or row-major vector distribution, D (1,0), form a artition of N The names column-major vector distribution and row-major vector distribution come from the fact that the maings (s 0, s 1 ) s 0 + s 1 d 0 and (s 0, s 1 ) s 1 + s 0 d 1 resectively label the d 0 d 1 grid with a column-major and row-major ordering As row-major and column-major distributions differ only by which dimension of the grid is considered first when assigning an order to the rocesses in the grid, we can give one general definition for a vector distribution with two-dimensional grids We give this definition now Definition 54 (Vector distribution) We call the d 0 d 1 array D (i,j) a vector distribution if i, j {0, 1}, i j, and there exists some alignment arameter σ {0,, 1} such that, for every grid osition (s 0, s 1 ) {0,, d 0 1} {0,, d 1 1}, (51) D (i,j) (s 0, s 1 ) = V σ (s i + s j d i ) The shorthand y [(i, j)] will refer to the vector y distributed such that rocess (s 0, s 1 ) stores y(d (i,j) (s 0, s 1 )) Figure 43 illustrates that to redistribute y [(0, 1)] to y [(1, 0)], and vice versa, re- 11

12 quires a ermutation communication (simultaneous oint-to-oint communications) The effect of this redistribution can be seen in Figure 52 Via a ermutation communication, the vector y distributed as y [(0, 1)], can be redistributed as y [(1, 0)] which is the same distribution as the vector x In the receding discussions, our definitions of D (0,1) and D (1,0) allowed for arbitrary alignment arameters For the rest of the aer, we will only treat the case where all alignments are zero, ie, the to-left entry of every (global) matrix and to entry of every (global) vector is owned by the rocess in the to-left of the rocess grid 52 Induced matrix distribution We are now ready to discuss how matrix distributions are induced by the vector distributions For this, it ays to again consider Figure 43 The element α i,j of matrix A, is assigned to the row of rocesses in which ψ i exists and the column of rocesses in which χ j exists This means that in y = Ax elements of x need only be communicated within columns of rocesses and local contributions to y need only be summed within rows of rocesses This induces a Cartesian matrix distribution: Column j of A is assigned to the same column of rocesses as is χ j Row i of A is assigned to the same row of rocesses as ψ i We now answer the related questions What is the set D (0) (s 0 ) of matrix row indices assigned to rocess row s 0? and What is the set D (1) (s 1 ) of matrix column indices assigned to rocess column s 1? Definition 55 Let D (0) (s 0 ) = d 1 1 s 1=0 D (0,1) (s 0, s 1 ) and D (1) (s 1 ) = d 0 1 s 1=0 D (1,0) (s 0, s 1 ) Given matrix A, A [ D (0) (s 0 ), D (1) (s 1 ) ] denotes the submatrix of A with row indices in the set D (0) (s 0 ) and column indices in D (1) (s 1 ) Finally, A [(0), (1)] denotes the distribution of A that assigns A [ D (0) (s 0 ), D (1) (s 1 ) ] to rocess (s 0, s 1 ) We say that D (0) and D (1) are induced resectively by D (0,1) and D (1,0) because the rocess to which α i,j is assigned is determined by the row of rocesses, s 0, to which y i is assigned and the column of rocesses, s 1, to which x j is assigned, so that it is ensured that in the matrix-vector multilication y = Ax communication needs only be within rows and columns of rocesses Notice in Figure 52 that to redistribute indices of the vector y as the matrix column indices in A requires a communication within rows of rocesses Similarly, to redistribute indices of the vector x as matrix row indices requires a communication within columns of rocesses The above definition lies at the heart of our communication scheme 53 Vector dulication Two vector distributions, encountered in Section 44 and illustrated in Figure 44, still need to be secified with our notation The vector x, dulicated as needed for the matrix-vector multilication y = Ax, can be secified as x [(0)] or, viewing x as a n 1 matrix, x [(0), ()] The vector y, dulicated so as to store local contributions for y = Ax, can be secified as y [(1)] or, viewing y as a n 1 matrix, y [(1), ()] Here the () should be interreted as all indices In other words, D () N 12

13 Elemental symbol Introduced symbol M C (0) M R (1) V C (0, 1) V R (1, 0) () Fig 51 The relationshis between distribution symbols found in the Elemental library imlementation and those introduced here For instance, the distribution A[M C, M R ] found in the Elemental library imlementation corresonds to the distribution A [(0), (1)] x [(0), ()] x [(1), ()] reducescattescatter reduce- allgather bcast allgather ermutation bcast x [(0, 1), ()] x [(1, 0), ()] reduce (-to-one) reduce (-to-one) scatter scatter gather gather x [(0), (1)] x [(1), (0)] Fig 52 Summary of the communication atterns for redistributing a vector x For instance, a method for redistributing x from a matrix column to a matrix row is found by tracing from the bottom-left to the bottom-right of the diagram 54 Notation in Elemental library Readers familiar with the Elemental library will notice that the distribution symbols defined within that library s imlementation follow a different convention than that used for distribution symbols introduced in the revious subsections This is due to the fact that the notation used in this aer was devised after the imlementation of the Elemental library and we wanted the notation to be extensible to tensors However, for every symbol utilized in the Elemental library imlementation, there exists a unique symbol in the notation introduced here In Figure 51, the relationshis between distribution symbols utilized in the Elemental library imlementation and the symbols used in this aer are defined 55 Of vectors, columns, and rows A matrix-vector multilication or rank- 1 udate may take as its inut/outut vectors (x and y) the rows and/or columns of matrices, as we will see in Section 6 This motivates us to briefly discuss the different communications needed to redistribute vectors to and from columns and rows In our discussion, it will hel to refer back to Figures 43 and 44 13

14 Algorithm: y := Ax (gemv) Comments x [(1), ()] x [(1, 0), ()] Redistribute x (allgather in columns) y (1) [(0), ()] := A [(0), (1)] x [(1), ()] Local matrix-vector multily y [(0, 1), ()] := 1 y(1) [(0), ()] Sum contributions (reduce-scatter in rows) Fig 53 Parallel algorithm for comuting y := Ax Algorithm A := A + xy T (ger) x [(0, 1), ()] x [(1, 0), ()] x [(0), ()] x [(0, 1), ()] y [(1, 0), ()] y [(0, 1), ()] y((1), ()) y((1, 0), ()) A [(0), (1)] := x [(0), ()] [y [(1), ()]] T Comments Redistribute x as a column-major vector (ermutation) Redistribute x (allgather in rows) Redistribute y as a row-major vector (ermutation) Redistribute y (allgather in cols) Local rank-1 udate Fig 54 Parallel algorithm for comuting A := A + xy T Column to/from column-major vector Consider Figure 43 and let a j be a tyical column in A It exists within one single rocess column Redistributing a j [(0), (1)] to y [(0, 1), ()] requires simultaneous scatters within rocess rows Inversely, redistributing y [(0, 1), ()] to a j [(0), (1)] requires simultaneous gathers within rocess rows Column to/from row-major vector Redistributing a j [(0), (1)] to x [(1, 0), ()] can be accomlished by first redistributing to y [(0, 1), ()] (simultaneous scatters within rows) followed by a redistribution of y [(0, 1), ()] to x [(1, 0), ()] (a ermutation) Redistributing x [(1, 0)] to a j [(0), (1)] reverses these communications Column to/from column rojected vector Redistributing a j [(0), (1)] to a j [(0), ()] (dulicated y in Figure 44) can be accomlished by first redistributing to y [(0, 1), ()] (simultaneous scatters within rows) followed by a redistribution of y [(0, 1), ()] to y [(0), ()] (simultaneous allgathers within rows) However, recognize that a scatter followed by an allgather is equivalent to a broadcast Thus, redistributing a j [(0), (1)] to a j [(0), ()] can be more directly accomlished by broadcasting within rows Similarly, summing dulicated vectors y [(0), ()] leaving the result as a j [(0), (1)] (a column in A) can be accomlished by first summing them into y [(0, 1), ()] (reduce-scatters within rows) followed by a redistribution to a j [(0), (1)] (gather within rows) But a reduce-scatter followed by a gather is equivalent to a reduce(-to-one) collective communication All communication atterns with vectors, rows, and columns We summarize all the communication atterns that will be encounted when erforming various matrixvector multilications or rank-1 udates, with vectors, columns, or rows as inut, in Figure Parallelizing matrix-vector oerations (revisited) We now show how the notation discussed in the revious subsection ays off when describing algorithms for matrix-vector oerations Assume that A, x, and y are resectively distributed as A [(0), (1)], x [(1, 0), ()], and y [(0, 1), ()] Algorithms for comuting y := Ax and A := A + xy T are given in 14

15 Algorithm: ĉ i := Ab j (gemv) Comments x [(1), ()] b j [(0), (1)] Redistribute b j : y (1) [(0), ()] := A [(0), (1)] x [(1), ()] x [(0, 1), ()] b j [(0), (1)] (scatter in rows) x [(1, 0), ()] x [(0, 1), ()] (ermutation) x [(1), ()] x [(1, 0), ()] Local matrix-vector multily ĉ i [(0), (1)] := 1 y(1) [(0), ()] Sum contributions: y [(0, 1), ()] := t y(1) [(0), ()] (allgather in columns) (reduce-scatter in rows) y [(1, 0), ()] y [(0, 1), ()] (ermutation) ĉ i ((0), (1)) y((1, 0), ()) (gather in rows) Fig 55 Parallel algorithm for comuting ĉ i := Ab j where ĉ i is a row of a matrix C and b j is a column of a matrix B Figures 53 and 54 The discussion in Section 55 rovides the insights to generalize these arallel matrix-vector oerations to the cases where the vectors are rows and/or columns of matrices For examle, in Figure 55 we show how to comute a column of matrix C, ĉ i, as the roduct of a matrix A times the column of a matrix B, b j Certain stes in Figures have suerscrits associated with oututs of local comutations These suerscrits indicate that contributions rather than final results are comuted by the oeration Further, the subscrit to indicates along which dimension of the rocessing grid a reduction of contributions must occur 57 Similar oerations What we have described is a general method We leave it as an exercise to the reader to derive arallel algorithms for x := A T y and A := yx T + A, starting with vectors that are distributed in various ways 6 Elemental SUMMA: 2D algorithms (esumma2d) We have now arrived at the oint where we can discuss arallel matrix-matrix multilication on a d 0 d 1 mesh, with = d 0 d 1 In our discussion, we will assume an elemental distribution, but the ideas clearly generalize to other Cartesian distributions This section exoses a systematic ath from the arallel rank-1 udate and matrix-vector multilication algorithms to highly efficient 2D arallel matrix-matrix multilication algorithms The strategy is to first recognize that a matrix-matrix multilication can be erformed by a series of rank-1 udates or matrix-vector multilications This gives us arallel algorithms that are inefficient By then recognizing that the order of oerations can be changed so that communication and comutation can be searated and consolidated, these inefficient algorithms are transformed into efficient algorithms While only exlained for some of the cases of matrix multilication, we believe the exosition is such that the reader can him/herself derive algorithms for the remaining cases by alying the ideas in a straight forward manner To fully understand how to attain high erformance on a single rocessor, the reader should familiarize him/herself with, for examle, the techniques in [13] 15

16 A [(0), ()] A [(0), (1)] all-to-all all-to-all A [(1), ()] A [(1), (0)] reducescattescatter reduce- allgather allgather reducescatter A [(0, 1), ()] reduce- ermutation allgather A [(1, 0), ()] allgather scatter all-to-all all-to-all reducescatter A [(), (1, 0)] reduce- ermutation allgather A [(), (0, 1)] allgather scatter allgather reduce- allgather reducescatter scatter A [(), (1)] A [(), (0)] Fig 61 Summary of the communication atterns for redistributing a matrix A 61 Elemental stationary C algorithms (esumma2d-c) We first discuss the case where C := C + AB, where A and B have k columns each, with k relatively small We call this a rank-k udate or anel-anel multilication [13] We will assume the distributions C [(0), (1)], A [(0), (1)], and B [(0), (1)] Partition ˆbT 0 A = ( ) ˆbT a 0 a 1 a k 1 and B = 1 so that ˆbT k 1 C := (( ((C + a 0ˆbT 0 ) + a 1ˆbT 1 ) + ) + a k 1ˆbT k 1 ) There is an algorithmic block size, b alg, for which a local rank-k udate achieves eak erformance [13] Think of k as being that algorithmic block size for now 16

17 Algorithm: C := Gemm C(C, A, B) Partition A ` «BT A L A R, B B B where A L has 0 columns, B T has 0 rows while n(a L ) < n(a) do Determine block size b Reartition ` ` AL A R A0 A 1 A 2, 0 1 «B 0 B B 1 A B B 2 where A 1 has b columns, B 1 has b rows Algorithm: C := Gemm A(C, A, B) Partition C ` C L C R, B ` BL B R where C L and B L have 0 columns while n(c L ) < n(c) do Determine block size b Reartition ` CL C R ` C0 C 1 C 2, ` BL B R ` B0 B 1 B 2 where C 1 and B 1 have b columns A 1 [(0), ()] A 1 [(0), (1)] B 1 [(), (1)] B 1 [(0), (1)] C [(0), (1)] := C [(0), (1)] + A 1 [(0), ()] B 1 [(), (1)] Continue ` with ` AL A R A0 A 1 A 2, 0 1 «B 0 B 1 A B B B 2 endwhile B 1 [(1), ()] B 1 [(0), (1)] C (1) 1 [(0), ()] := A [(0), (1)] B 1 [(1), ()] C 1 [(0), (1)] := cp 1 C(1) 1 [(0), ()] Continue with ` CL C R ` C0 C 1 C 2, ` BL B R ` B0 B 1 B 2 endwhile Fig 62 Algorithms for comuting C := AB + C Left: Stationary C Right: Stationary A The following loo comutes C := AB + C: for = 0,, k 1 a [(0), ()] a [(0), (1)] b T [(), (1)] ˆb T [(0), (1)] C [(0), (1)] := C [(0), (1)] + a [(0), ()] ˆb T [(), (1)] endfor (broadcasts within rows) (broadcasts within cols) (local rank-1 udates) While Section 56 gives a arallel algorithm for ger, the roblem with this algorithm is that (1) it creates a lot of messages and (2) the local comutation is a rank-1 udate, which inherently does not achieve high erformance since it is memory bandwidth bound The algorithm can be rewritten as for = 0,, k 1 a [(0), ()] a [(0), (1)] endfor for = 0,, k 1 b T [(), (1)] ˆb T [(0), (1)] endfor for = 0,, k 1 C [(0), (1)] := C [(0), (1)] + a [(0), ()] ˆb T [(), (1)] endfor (broadcasts within rows) (broadcasts within cols) (local rank-1 udates) and finally, equivalently, 17

18 A [(0), ()] A [(0), (1)] B [(), (1)] B [(0), (1)] C [(0), (1)] := C [(0), (1)] + A [(0), ()] B [(), (1)] (allgather within rows) (allgather within cols) (local rank-k udate) Now the local comutation is cast in terms of a local matrix-matrix multilication (rank-k udate), which can achieve high erformance Here (given that we assume elemental distribution) A [(0), ()] A [(0), (1)], within each row broadcasts k columns of A from different roots: an allgather if elemental distribution is assumed! Similarly B [(), (1)] B [(0), (1)], within each column broadcasts k rows of B from different roots: another allgather if elemental distribution is assumed! Based on this observation, the SUMMA-like algorithm can be exressed as a loo around such rank-k udates, as given in Figure 62 (left) The urose for the loo is to reduce worksace required to store dulicated data Notice that, if an elemental distribution is assumed, the SUMMA-like algorithm should not be called a broadcastbroadcast-comute algorithm Instead, it becomes an allgather-allgather-comute algorithm We will also call it a stationary C algorithm, since C is not communicated (and hence owner comutes is determined by what rocessor owns what element of C) The rimary benefit from a having a loo around rank-k udates is that is reduces the required local worksace at the exense of an increase only in the α term of the communication cost We label this algorithm esumma2d-c, an elemental SUMMA-like algorithm targeting a two-dimensional mesh of nodes, stationary C variant It is not hard to extend the insights to non-elemental distributions (as, for examle, used by ScaLAPACK or PLAPACK) An aroximate cost for the described algorithm is given by T esumma2d C (m, n, k, d 0, d 1 ) = 2mnk γ + k log b 2 (d 1 )α + d 1 1 mk β + k log alg d 1 d 0 b 2 (d 0 )α + d 0 1 nk β alg d 0 d 1 = 2mnk γ + k log b 2 ()α + (d 1 1)mk β + (d 0 1)nk β alg }{{} T + esumma2d C(m, n, k, d 0, d 1 ) This estimate ignores load imbalance (which leads to a γ term of the same order as the β terms) and the fact that the allgathers may be unbalanced if b alg is not an integer multile of both d 0 and d 1 As before and throughout this aer, T + refers to the communication overhead of the roosed algorithm (eg T + esumma2d C refers to the communication overhead of the esumma2d-c algorithm) It is not hard to see that, for ractical uroses, the weak scalability of the esumma2d-c algorithm mirrors that of the arallel matrix-vector multilication algorithm analyzed in Aendix A: it is weakly scalable when m = n and d 0 = d 1, for arbitrary k At this oint it is imortant to mention that this resulting algorithm may seem similar to an aroach described in rior work [1] Indeed, this allgather-allgathercomute aroach to arallel matrix-matrix multilication is described in that aer We use FLAME notation to exress the algorithm, which has been used in our aers for more than a decade [15] The very slow growing factor log () revents weak scalability unless it is treated as a constant 18

19 for the matrix-matrix multilication variants C = AB, C = AB T, C = A T B, and C = A T B T under the assumtion that all matrices are aroximately the same size; a surmountable limitation As we have argued reviously, the allgather-allgathercomute aroach is articularly well-suited for situations where we wish not to communicate the matrix C In the next section, we describe how to systematically derive algorithms for situations where we wish to avoid communicating the matrix A 62 Elemental stationary A algorithms (esumma2d-a) Next, we discuss the case where C := C + AB, where C and B have n columns each, with n relatively small For simlicity, we also call that arameter b alg We call this a matrix-anel multilication [13] We again assume that the matrices are distributed as C [(0), (1)], A [(0), (1)], and B [(0), (1)] Partition C = ( c 0 c 1 c n 1 ) and B = ( b 0 b 1 b n 1 ) so that c j = Ab j + c j The following loo will comute C = AB + C: for j = 0,, n 1 b j [(0, 1), ()] b j [(0), (1)] (scatters within rows) b j [(1, 0), ()] b j [(0, 1), ()] (ermutation) b j [(1), ()] b j [(1, 0), ()] (allgathers within cols) c j [(0), ()] := A [(0), (1)] b j [(1), ()] (local matvec mult) c j [(0), (1)] 1 c j [(0), ()] (reduce-to-one within rows) endfor While Section 56 gives a arallel algorithm for gemv, the roblem again is that (1) it creates a lot of messages and (2) the local comutation is a matrix-vector multily, which inherently does not achieve high erformance since it is memory bandwidth bound This can be restructured as for j = 0,, n 1 b j [(0, 1), ()] b j [(0), (1)] (scatters within rows) endfor for j = 0,, n 1 b j [(1, 0), ()] b j [(0, 1), ()] (ermutation) endfor for j = 0,, n 1 b j [(1), ()] b j [(1, 0), ()] (allgathers within cols) endfor for j = 0,, n 1 c j [(0), ()] := A [(0), (1)] b j [(1), ()] (local matvec mult) endfor for j = 0,, n 1 c j [(0), (1)] 1 c j [(0), ()] (simultaneous reduce-to-one endfor within rows) and finally, equivalently, 19

20 B [(1), ()] B [(0), (1)] C [(0), ()] := A [(0), (1)] B [(1), ()] + C [(0), ()] C [(0), (1)] C [(0), ()] (all-to-all within rows, ermutation, allgather within cols) (simultaneous local matrix multilications) (reduce-scatter within rows) Now the local comutation is cast in terms of a local matrix-matrix multilication (matrix-anel multily), which can achieve high erformance A stationary A algorithm for arbitrary n can now be exressed as a loo around such arallel matrix-anel multilies, given in Figure 62 (right) An aroximate cost for the described algorithm is given by T esumma2d A (m, n, k, d 0, d 1 ) = n b alg log 2 (d 1 )α + d1 1 nk d 1 d 0 β + n b alg α + n k d 1 d 0 β + n b alg log 2 (d 0 )α + d0 1 nk d 1 β + 2mnk + n d 0 (all-to-all within rows) (ermutation) (allgather within cols) γ (simultaneous local matrix-anel mult) b alg log 2 (d 1 )α + d1 1 mn d 0 β + d1 1 mn d 0 γ (reduce-scatter within rows) d 1 d 1 As we discussed earlier, the cost function for the all-to-all oeration is somewhat susect Still, if an algorithm that attains the lower bound for the α term is emloyed, the β term must at most increase by a factor of log 2 (d 1 ) [6], meaning that it is not the dominant communication cost The estimate ignores load imbalance (which leads to a γ term of the same order as the β terms) and the fact that various collective communications may be unbalanced if b alg is not an integer multile of both d 0 and d 1 While the overhead is clearly greater than that of the esumma2d-c algorithm when m = n = k, the overhead is comarable to that of the esumma2d-c algorithm; so the weak scalability results are, asymtotically, the same Also, it is not hard to see that if m and k are large while n is small, this algorithm achieves better arallelism since less communication is required: The stationary matrix, A, is then the largest matrix and not communicating it is beneficial Similarly, if m and n are large while k is small, then the esumma2d-c algorithm does not communicate the largest matrix, C, which is beneficial 63 Communicating submatrices In Figure 61 we illustrate the collective communications required to redistribute submatrices from one distribution to another and the collective communications required to imlement them 64 Other cases We leave it as an exercise to the reader to roose and analyze the remaining stationary A, B, and C algorithms for the other cases of matrix-matrix multilication: C := A T B + C, C := AB T + C, and C := A T B T + C The oint is that we have resented a systematic framework for deriving a family of arallel matrix-matrix multilication algorithms 7 Elemental SUMMA: 3D algorithms (esumma3d) We now view the rocessors as forming a d 0 d 1 h mesh, which one should visualize as h stacked layers, where each layer consists of a d 0 d 1 mesh The extra dimension will be used to gain an extra level of arallelism, which reduces the overhead of the 2D SUMMA algorithms within each layer at the exense of communications between the layers 20

21 The aroach on how to generalize Elemental SUMMA 2D algorithms to Elemental SUMMA 3D algorithms can be easily modified to use Cannon s or Fox s algorithms (with the constraints and comlications that come from using those algorithms), or any other distribution for which SUMMA can be used (retty much any Cartesian distribution) While much of the new innovation for this aer is in this section, we believe the way we built u the intuition of the reader renders most of this new innovation much like a corollary to a theorem B h D stationary C algorithms (esumma3d-c) Partition, A and B so that A = ( B 0 ) A 0 A h 1 and B =, where A and B have aroximately k/h columns and rows, resectively Then C + AB = (C + A 0 B 0 ) }{{} by layer 0 + (0 + A 1 B 1 ) }{{} by layer (0 + A h 1 B h 1 ) }{{} by layer h-1 This suggests the following 3D algorithm: Dulicate C to each of the layers, initializing the dulicates assigned to layers 1 through h 1 to zero This requires no communication We will ignore the cost of setting the dulicates to zero Scatter A and B so that layer H receives A H and B H This means that all rocessors (I, J, 0) simultaneously scatter aroximately (m + n)k/(d 0 d 1 ) data to rocessors (I, J, 0) through (I, J, h 1) The cost of such a scatter can be aroximated by (71) log 2 (h)α + h 1 h (m + n)k β = log d 0 d 2 (h)α + 1 (h 1)(m + n)k β Comute C := C +A K B K simultaneously on all the layers If esumma2d-c is used for this in each layer, the cost is aroximated by 2 mnk γ + k (log hb 2 () log 2 (h)) α + (d 1 1)mk β + (d 0 1)nk (72) β alg }{{} T + esumma2d C(m, n, k/h, d 0, d 1 ) Perform reduce oerations to sum the contributions from the different layers to the coy of C in layer 0 This means that contributions from rocessors (I, J, 0) through (I, J, K) are reduced to rocessor (I, J, 0) An estimate for this reduce-to-one is (73) log 2 (h)α + mn d 0 d 1 β + mn γ = log d 0 d 2 (h)α + mnh 1 β + mnh γ Thus, an estimate for the total cost of this esumma3d-c algorithm for this case of gemm results from adding (71) (73) Let us analyze the case where m = n = k and d 0 = d 1 = /h in detail The cost becomes C esumma3d C (n, n, n, d 0, d 0, h) = 2 n3 γ + n (log hb 2 () log 2 (h))α + 2 (d 0 1)n 2 β + log alg 2 (h)α (h 1)n2 β

PARALLEL MATRIX MULTIPLICATION: A SYSTEMATIC JOURNEY

PARALLEL MATRIX MULTIPLICATION: A SYSTEMATIC JOURNEY MARTIN D SCHATZ, ROBERT A VAN DE GEIJN, AND JACK POULSON Abstract We exose a systematic aroach for develoing distributed memory arallel matrixmatrix