Parallel Sparse Matrix Vector Multiplication (PSC 4.3)

Size: px

Start display at page:

Download "Parallel Sparse Matrix Vector Multiplication (PSC 4.3)"

Edwina Richardson
6 years ago
Views:

1 Parallel Sparse Matrix Vector Multiplication (PSC 4.) original slides by Rob Bisseling, Universiteit Utrecht, accompanying the textbook Parallel Scientific Computation adapted for the lecture HPC Algorithms and Applications, Michael Bader, TUM, Winter 06/7 / 0

2 How to build a distributed sparse-matrix data structure Basically two options:. Start with a global data structure for the sparse matrix; then add&distribute elements over the processors. (Logically) distribute the matrix first, and then let each processor represent the local nonzeros Preferred approach: distribute first This assigns subsets of nonzeros to processors. The subsets form a partitioning of the nonzero set: subsets are disjoint and together they contain all nonzeros. Sequential sparse data structures can be used. Simple operations remain simple: insertion and deletion are local operations without communication. / 0

3 Most general matrix distribution Crucial question for distribute first approach: how to define partitioning of the matrix elements? The most general scheme maps nonzeros to processors, a ij P(φ(i, j)), for 0 i, j < n and a ij 0, where 0 φ(i, j) < p. Notation: a ij is stored on the processor given by φ(i, j). Zeros are not assigned to processors. For convenience, we define φ(i, j) = if a ij = 0. Here, we use a D processor numbering. / 0

4 Example: Distribution of matrix and vectors Global view Local view 4 v u A 8 P(0) P() (a) (b) p =, n =, nz =. P(0) grey cells, P() black cells. 4 / 0

5 Where to compute a ij v j? Usually, there are many more nonzeros than vector components, nz(a) n, so move the vector components v j to the nonzeros a ij, not vice versa. Accumulate/Add local products a ij v j belonging to the same row i. Result on P(s) is the local contribution u is to u i Result u is is sent to the owner of u i. Thus, we do not communicate elements of A, but only components of v and contributions to components of u. / 0

6 How to distribute the vectors? In many iterative solvers, the same vector is repeatedly multiplied by a matrix A. Usually, a few vector operations are interspersed, such as DAXPYs y := αx + y or inner products α := x T y. All vectors should then be distributed in the same way: distr(u) = distr(v) for the operation u := Av. Sometimes, however, we compute A T Av. The output vector for A is then taken as the input vector for A T. Now, we do not need to revert immediately to the input distribution. We allow distr(u) distr(v). We map vector components to processors by u i P(φ u (i)), for 0 i < n. 6 / 0

7 Deriving a parallel algorithm Once we have chosen the data distribution and decided to compute the products a ij v j on the processor that owns a ij, the parallel algorithm follows naturally. The main computation for processor P(s) is multiplying each local nonzero element a ij by v j and adding the result into a local partial sum, u is = a ij v j. 0 j<n, φ(i,j)=s Sparsity is exploited: only terms with a ij 0 are summed; only local partial sums uis are computed for which {j : 0 j < n φ(i, j) = s}. 7 / 0

8 Row index set 4 v u A 8 P(0) P() (a) (b) On P(0), row 4 is empty. On P(), row is empty. Index set I s of rows that are locally nonempty in P(s) is I s = {i : 0 i < n ( j : 0 j < n φ(i, j) = s)} I 0 = {0,,, } and I = {0,,, 4}. 8 / 0

9 Local sparse matrix vector multiplication I s = {i : 0 i < n ( j : 0 j < n φ(i, j) = s)} () { Local sparse matrix vector multiplication } for all i I s do u is := 0; for all j : 0 j < n φ(i, j) = s do u is := u is + a ij v j ; / 0

10 Data structure for local sparse matrix Compressed row storage (CRS) suits row-oriented local matrix vector multiplication. CRS must be adapted to avoid overhead of many empty rows, which typically occurs if c p. We number the nonempty local rows from 0 to I s. The corresponding indices i are the local indices. The original global indices from the set I s are stored in increasing order in an array rowindex of length I s : i = rowindex[i]. Address of first local nonzero of row i is start[i]. 0 / 0

11 Column index set 4 v u A 8 P(0) P() (a) (b) Index set J s of columns that are locally nonempty in P(s) is J s = {j : 0 j < n ( i : 0 i < n φ(i, j) = s)} J 0 = {0,, } and J = {,, 4}. / 0

12 Fanout J s = {j : 0 j < n ( i : 0 i < n φ(i, j) = s)} (0) { Fanout } for all j J s do get v j from P(φ v (j)); The receiver knows (from its index set J s ) that it needs the vector component v j. The sender is unaware of this. Thus, the receiver needs to initiate the communication use get primitive. Compare dense algorithms: communication patterns are predictable and thus known to every processor, so that we may use get, put or send/receive primitives. / 0

13 Fanin and final summation () { Fanin } for all i I s do put u is in P(φ u (i)); () { Summation of nonzero partial sums } for all i : 0 i < n φ u (i) = s do u i := 0; for all t : 0 t < p u it 0 do u i := u i + u it ; / 0

14 Communication 4 v u A Vertical arrows: communication of components v j. v 0 is sent from P() to P(0), because of the nonzeros a 0 = 4 and a 0 = 6 owned by P(0). v, v, v 4 need not be sent. Horizontal arrows: communication of partial sums u is. 4 / 0

15 Cost analysis What is the computational/parallel complexity of this algorithm? Answer depends on matrix A and distributions φ, φ v, φ u. We can obtain an upper bound on the BSP cost, assuming: the matrix nonzeros are evenly spread over the processors, each processor having cn p nonzeros; the vector components are also evenly spread, each processor having n p components. Bound may be far too pessimistic for particular distributions that are well-tailored to the matrix. / 0

16 Cost of separate supersteps (0): Cost for get s: P(s) must receive at most all n components v j, but not the n p local components, so that h r = n n p. n Cost for remote get s that affect P(s): p local components must be sent to all the other p processors, thus h s = n p (p ). T (0) = ( )ng + l. p (): flops are needed for each local nonzero. T () = cn p + l. 6 / 0

17 Cost of separate supersteps (cont d) (): Similar to (0). T () = ( )ng + l. p (): Each of the n p local vector components is computed by adding at most p partial sums. Total BSP cost is bounded by T () = n + l. T MV cn p + n + ( )ng + 4l. p 7 / 0

18 Efficient computation T MV cn p + n + ( )ng + 4l. p Computation is efficient if cn p > ng, i.e., c > pg. But this happens rarely: only for very dense matrices. To make the computation efficient for smaller c, we can: use a Cartesian distribution (see later) and exploit its D nature; refine the general distribution using an automatic procedure to detect the underlying matrix structure; exploit properties of specific matrix classes, such as random sparse matrices and Laplacian matrices. 8 / 0

19 Communication volume Communication volume V of an algorithm is the total number of data words sent. It depends on φ, φ u, φ v. For a given φ, obtain a lower bound V φ on V by counting: the number of processors pi that have a nonzero a ij in matrix row i; at least p i processors must send a contribution u is. the number of processors qj that have a nonzero a ij in matrix column j. V φ = (p i ) + (q j ). 0 i<n, p i 0 j<n, q j An upper bound is V φ + n, because in the worst case all n components u i are owned by processors without a nonzero in row i, and similar for the components v j. / 0

20 Summary Distribute first, represent later. Most general mapping of nonzeros and vector components to processors: a ij P(φ(i, j)), for 0 i, j < n and a ij 0, u i P(φ u (i)), for 0 i < n. We have derived a parallel algorithm with 4 supersteps: fanout, local matrix vector multiplication, fanin, summation of partial sums. The row index set I s and column index set J s are used for exploiting the sparsity in the algorithm. We encountered an absolutely necessary use of a get primitive. 0 / 0

Parallel LU Decomposition (PSC 2.3) Lecture 2.3 Parallel LU

Parallel LU Decomposition (PSC 2.3) 1 / 20 Designing a parallel algorithm Main question: how to distribute the data? What data? The matrix A and the permutation π. Data distribution + sequential algorithm