PARALLEL MATRIX MULTIPLICATION: A SYSTEMATIC JOURNEY

Size: px
Start display at page:

Download "PARALLEL MATRIX MULTIPLICATION: A SYSTEMATIC JOURNEY"

Transcription

1 PARALLEL MATRIX MULTIPLICATION: A SYSTEMATIC JOURNEY MARTIN D SCHATZ, ROBERT A VAN DE GEIJN, AND JACK POULSON Abstract We exose a systematic aroach for develoing distributed memory arallel matrix matrix multilication algorithms The journey starts with a descrition of how matrices are distributed to meshes of nodes (eg, MPI rocesses), relates these distributions to scalable arallel imlementation of matrix-vector multilication and rank-1 udate, continues on to reveal a family of matrix-matrix multilication algorithms that view the nodes as a two-dimensional mesh, and finishes with extending these 2D algorithms to so-called 3D algorithms that view the nodes as a three-dimensional mesh A cost analysis shows that the 3D algorithms can attain the (order of magnitude) lower bound for the cost of communication The aer introduces a taxonomy for the resulting family of algorithms and exlains how all algorithms have merit deending on arameters like the sizes of the matrices and architecture arameters The techniques described in this aer are at the heart of the Elemental distributed memory linear algebra library Performance results from imlementation within and with this library are given on a reresentative distributed memory architecture, the IBM Blue Gene/P suercomuter 1 Introduction This aer serves a number of uroses: Parallel imlementation of matrix-matrix multilication is a standard toic in a course on arallel high-erformance comuting However, rarely is the student exosed to the algorithms that are used in ractice in cutting-edge arallel dense linear algebra (DLA) libraries This aer exoses a systematic ath that leads from arallel algorithms for matrix-vector multilication and rank-1 udate to a ractical, scalable family of arallel algorithms for matrix-matrix multilication, including the classic result in [1] and those imlemented in the Elemental arallel DLA library [22] This aer introduces a set notation for describing the data distributions that underlie the Elemental library The notation is motivated using arallelization of matrix-vector oerations and matrix-matrix multilication as the driving examles Recently, research on arallel matrix-matrix multilication algorithms have revisited so-called 3D algorithms, which view (rocessing) nodes as a logical three-dimensional mesh These algorithms are known to attain theoretical (order of magnitude) lower bounds on communication This aer exoses a systematic ath from algorithms for two-dimensional meshes to their extensions for three-dimensional meshes Among the resulting algorithms are classic results [2] A taxonomy is given for the resulting family of algorithms which are all related to what is often called the Scalable Universal Matrix Multilication Algorithm (SUMMA) [27] This aer is a first ste towards understanding how the resented techniques extend to the much more comlicated domain of tensor contractions (of which matrix multilication is a secial case) Deartment of Comuter Science, Institute for Comutational Engineering and Sciences, The University of Texas at Austin, Austin, TX s: martinschatz@utexasedu, ltm@csutexasedu, rvdg@csutexasedu Institute for Comutational and Mathematical Engineering, Stanford University, Stanford, CA 94305, oulson@stanfordedu Parallel in this aer imlicitly means distributed memory arallel 1

2 Thus, the aer simultaneously serves a edagogical role, exlains abstractions that underly the Elemental library, advances the state of science for arallel matrix-matrix multilication by roviding a framework to systematically derive known and new algorithms for matrix-matrix-multilication when comuting on two-dimensional or three-dimensional meshes, and lays the foundation for extensions to tensor comutations 2 Background The arallelization of dense matrix-matrix multilication is a well-studied subject Cannon s algorithm (sometimes called roll-roll-comute) dates back to 1969 [7] and Fox s algorithm (sometimes called broadcast-roll-comute) dates back to the 1980s [12] Both suffer from a number of shortcomings: They assume that rocesses are viewed as an d 0 d 1 grid, with d 0 = d 1 = Removing this constraint on d0 and d 1 is nontrivial for these algorithms They do not deal well with the case where one of the matrix dimensions becomes relatively small This is the most commonly encountered case in libraries like LAPACK [3] and libflame [29, 30], and their distributed-memory counterarts: ScaLAPACK [9], PLAPACK [28], and Elemental [22] Attemts to generalize [10, 18, 19] led to imlementations that were neither simle nor effective A ractical algorithm, which also results from the systematic aroach discussed in this aer, can be described as allgather-allgather-multily [1] It does not suffer from the shortcomings of Cannon s and Fox s algorithms It did not gain in oularity in art because libraries like ScaLAPACK and PLAPACK used a 2D block-cyclic distribution, rather than the 2D elemental distribution advocated by that aer The arrival of the Elemental library, together with what we believe is our more systematic and extensible exlanation, will hoefully elevate awareness of this result The Scalable Universal Matrix Multilication Algorithm [27] is another algorithm that overcomes all shortcomings of Cannon s algorithm and Fox s algorithm We believe it is a more widely known result in art because it can already be exlained for a matrix that is distributed with a 2D blocked (but not cyclic) distribution and in art because it was easy to suort in the ScaLAPACK and PLAPACK libraries The original SUMMA aer gives four algorithms: For C := AB + C, SUMMA casts the comutation in terms of multile rank-k udates This algorithm is sometimes called the broadcast-broadcastmultily algorithm, a label we will see is somewhat limiting We also call this algorithm stationary C for reasons that will become clear later By design, this algorithm continues to erform well in the case where the width of A is small relative to the dimensions of C For C := A T B + C, SUMMA casts the comutation in terms of multile anel of rows times matrix multilies, so erformance is not degraded in the case where the height of A is small relative to the dimensions of B We have also called this algorithm stationary B for reasons that will become clear later For C := AB T + C, SUMMA casts the comutation in terms of multile matrix-anel (of columns) multilies, and so erformance does not deteriorate when the width of C is small relative to the dimensions of A We call this algorithm stationary A for reasons that will become clear later For C := A T B T + C, the aer sketches an algorithm that is actually not ractical 2

3 In the 1990s, it was observed that for the case where matrices were relatively small (or, equivalently, a relatively large number of nodes were available), better theoretical and ractical erformance resulted from viewing the nodes as a d 0 d 1 d 2 mesh, yielding a 3D algorithm [2] In [14], it was shown how stationary A, B, and C algorithms can be formulated for each of the four cases of matrix-matrix multilication, including for C := A T B T + C This then yielded a general, ractical family of 2D matrix-matrix multilication algorithms all of which were incororated into PLAPACK and Elemental, and some of which are suorted by ScaLAPACK Some of the observations about develoing 2D algorithms in the current aer can already be found in that aer, but our exosition is much more systematic and we use the matrix distribution that underlies Elemental to illustate the basic rinciles Although the work by Agarwal et al describes algorithms for the different matrix-matrix multilication transose variants, it does not describe how to create stationary A and B variants Recently, a 3D algorithm for comuting the LU factorization of a matrix was devised [25] In the same work, a 3D algorithm for matrix-matrix multilication building uon Cannon s algorithm was given for nodes arranged as an d 0 d 1 d 2 mesh, with d 0 = d 1 and 0 d 2 < 3 This was labeled a 25D algorithm Although the rimary contribution of that work was LU-related, the 25D algorithm for matrixmatrix multilication is the relevant ortion to this aer The focus of that study on 3D algorithms was the simlest case of matrix-matrix multilication, C := AB 3 Notation Although the focus of this aer is arallel distributed-memory matrix-matrix multilication, the notation used is designed to be extensible to comutation with higher-dimensional objects (tensors), on higher-dimensional grids Because of this, the notation used may seem overly comlex when restricted to matrixmatrix-multilication only In this section, we describe the notation used and the reasoning behind the choice of notation Grid dimension: d x Since we focus on algorithms for distributed-memory architectures, we must describe information about the grid on which we are comuting To suort arbitrary-dimensional grids, we must exress the shae of the grid in an extensible way For this reason, we have chosen the subscrited letter d to indicate the size of a articular dimension of the grid Thus, d x refers to the number of rocesses comrising the xth dimension of the grid In this aer, the grid is tyically d 0 d 1 Process location: s x In addition to describing the shae of the grid, it is useful to be able to refer to a articular rocess s location within the mesh of rocesses For this, we use the subscrited s letter to refer to a rocess s location within some given dimension of the mesh of rocesses Thus, s x refers to a articular rocess s location within the xth dimension of the mesh of rocesses In this aer, a tyical rocess is labeled with (s 0, s 1 ) Distribution: D (x0,x 1,,x k ) In subsequent sections, we will introduce a notation for describing how data is distributed among rocesses of the grid This notation will require a descrition of which dimensions of the grid are involved in defining the distribution We use the symbol D (x0,x 1,,x k ) to indicate a distribution which involves dimensions x 0, x 1,, x k of the mesh For examle, when describing a distribution which involves the column and row dimension of the grid, we refer to this distribution as D (0,1) Later, we will exlain why the symbol D (0,1) describes a different distribution from D (1,0) 3

4 4 Of Matrix-Vector Oerations and Distribution In this section, we discuss how matrix and vector distributions can be linked to arallel 2D matrix-vector multilication and rank-1 udate oerations, which then allows us to eventually describe the stationary C, A, and B 2D algorithms for matrix-matrix multilication that are art of the Elemental library 41 Collective communication Collectives are fundamental to the arallelization of dense matrix oerations Thus, the reader must be (or become) familiar with the basics of these communications and is encouraged to read Chan et al [8], which resents collectives in a systematic way that dovetails with the resent aer To make this aer self-contained, Figure 41 (similar to Figure 1 in [8]) summarizes the collectives In Figure 42 we summarize lower bounds on the cost of the collective communications, under basic assumtions exlained in [8] (see [6] for an analysis of all-to-all), and the cost exressions that we will use in our analyses 42 Motivation: matrix-vector multilication Suose A R m n, x R n, and y R m, and label their individual elements so that α 0,0 α 0,1 α 0,n 1 α 1,0 α 1,1 α 1,n 1 A =, x = α m 1,0 α m 1,1 α m 1,n 1 χ 0 χ 1 χ n 1, and y = Recalling that y = Ax (matrix-vector multilication) is comuted as ψ 0 ψ 1 ψ m 1 ψ 0 = α 0,0 χ 0 + α 0,1 χ α 0,n 1 χ n 1 ψ 1 = α 1,0 χ 0 + α 1,1 χ α 1,n 1 χ n 1 ψ m 1 = α m 1,0 χ 0 + α m 1,1 χ α m 1,n 1 χ n 1 we notice that element α i,j multilies χ j and contributes to ψ i Thus we may summarize the interactions of the elements of x, y, and A by χ 0 χ 1 χ n 1 (41) ψ 0 α 0,0 α 0,1 α 0,n 1 ψ 1 α 1,0 α 1,1 α 1,n 1 ψ m 1 α m 1,0 α m 1,1 α m 1,n 1 which is meant to indicate that χ j must be multilied by the elements in the jth column of A while the ith row of A contributes to ψ i 43 Two-Dimensional Elemental Cyclic Distribution It is well established that (weakly) scalable imlementations of DLA oerations require nodes to be logically viewed as a two-dimensional mesh [26, 17] It is also well established that to achieve load balance for a wide range of matrix oerations, matrices should be cyclically wraed onto this logical mesh We start with these insights and examine the simlest of matrix distributions that result: 2D elemental cyclic distribution [22, 16] Denoting the number of nodes by, a d 0 d 1 mesh must be chosen such that = d 0 d 1 4

5 Oeration Before After Permute Node 0 Node 1 Node 2 Node 3 Node 0 Node 1 Node 2 Node 3 x 0 x 1 x 2 x 3 x 1 x 0 x 3 x 2 Broadcast Node 0 Node 1 Node 2 Node 3 x Node 0 Node 1 Node 2 Node 3 x x x x to-one) Node 0 Node 1 Node 2 Node 3 Node 0 Node 1 Node 2 Node 3 x (0) x (1) x (2) x (3) Pj x(j) Scatter Node 0 Node 1 Node 2 Node 3 x 0 x 1 x 2 x 3 Node 0 Node 1 Node 2 Node 3 x 0 x 1 x 2 x 3 Gather Node 0 Node 1 Node 2 Node 3 x 0 x 1 x 2 x 3 Node 0 Node 1 Node 2 Node 3 x 0 x 1 x 2 x 3 Allgather Node 0 Node 1 Node 2 Node 3 x 0 x 1 x 2 x 3 Node 0 Node 1 Node 2 Node 3 x 0 x 0 x 0 x 0 x 1 x 1 x 1 x 1 x 2 x 2 x 2 x 2 x 3 x 3 x 3 x 3 Reduce(- Reducescatter Node 0 Node 1 Node 2 Node 3 x (0) 0 x (1) 0 x (2) 0 x (3) 0 x (0) 1 x (1) 1 x (2) 1 x (3) 1 x (0) 2 x (1) 2 x (2) 2 x (3) 2 x (0) 3 x (1) 3 x (2) 3 x (3) 3 Node 0 Node 1 Node 2 Node 3 Pj x(j) 0 Pj x(j) 1 Pj x(j) 2 Pj x(j) 3 Allreduce Node 0 Node 1 Node 2 Node 3 Node 0 Node 1 Node 2 Node 3 x (0) x (1) x (2) x (3) Pj x(j) Pj x(j) Pj x(j) Pj x(j) All-to-all Node 0 Node 1 Node 2 Node 3 x (0) 0 x (1) 0 x (2) 0 x (3) 0 x (0) 1 x (1) 1 x (2) 1 x (3) 1 x (0) 2 x (1) 2 x (2) 2 x (3) 2 x (0) 3 x (1) 3 x (2) 3 x (3) 3 Node 0 Node 1 Node 2 Node 3 x (0) 0 x (0) 1 x (0) 2 x (0) 3 x (1) 0 x (1) 1 x (1) 2 x (1) 3 x (2) 0 x (2) 1 x (2) 2 x (2) 3 x (3) 0 x (3) 1 x (3) 2 x (3) 3 Fig 41 Collective communications considered in this aer 5

6 Communication Latency Bandw Comut Cost used for analysis Permute α nβ α + nβ Broadcast log 2 () α nβ log 2 ()α + nβ 1 Reduce(-to-one) log 2 () α nβ nγ log 2()α + n(β + γ) 1 Scatter log 2 () α nβ log 2()α + 1 nβ 1 Gather log 2 () α nβ log 2()α + 1 nβ 1 Allgather log 2 () α nβ log 2()α + 1 nβ 1 Reduce-scatter log 2 () α nβ 1 nγ log 2()α + 1 n(β + γ) Allreduce log 2 () α 2 1 nβ 1 nγ 2 log 2()α + 1 n(2β + γ) 1 All-to-all log 2 () α nβ log 2()α + 1 nβ Fig 42 Lower bounds for the different comonents of communication cost Conditions for the lower bounds are given in [8] and [6] The last column gives the cost functions that we use in our analyses For architectures with sufficient connectivity, simle algorithms exist with costs that remain within a small constant factor of all but one of the given formulae The excetion is the all-to-all, for which there are algorithms that achieve the lower bound for the α and β term searately, but it is not clear whether an algorithm that consistently achieves erformance within a constant factor of the given cost function exists Matrix distribution The elements of A are assigned using an elemental cyclic (round-robin) distribution where α i,j is assigned to node (i mod d 0, j mod d 1 ) Thus, node (s 0, s 1 ) stores submatrix A(s 0 :d 0 :m 1, s 1 :d 1 :n 1) = α s0,s 1 α s0,s 1+d 1 α s0+d 0,s 1 α s0+d 0,s 1+d 1, where the left-hand side of the exression uses the MATLAB convention for exressing submatrices, starting indexing from zero instead of one This is illustrated in Figure 43 Column-major vector distribution A column-major vector distribution views the d 0 d 1 mesh of nodes as a linear array of nodes, numbered in column-major order A vector is distributed with this distribution if it is assigned to this linear array of nodes in a round-robin fashion, one element at a time In other words, consider vector y Its element ψ i is assigned to node (i mod d 0, (i/d 0 ) mod d 1 ), where / denotes integer division Or, equivalently in MATLAB-like notation, node (s 0, s 1 ) stores subvector y(u(s 0, s 1 )::m 1), where u(s 0, s 1 ) = s 0 + s 1 d 0 equals the rank of node (s 0, s 1 ) when the nodes are viewed as a one-dimensional array, indexed in column-major order This distribution of y is illustrated in Figure 43 Row-major vector distribution Similarly, a row-major vector distribution views the d 0 d 1 mesh of nodes as a linear array of nodes, numbered in row-major order In other words, consider vector x Its element χ j is assigned to node (j mod d 1, (j/d 1 ) mod d 0 ) Or, equivalently, node (s 0, s 1 ) stores subvector x(v(s 0, s 1 )::n 1), where v(s 0, s 1 ) = s 0 d 1 +s 1 equals the rank of node (s 0, s 1 ) when the nodes are viewed as a one-dimensional array, indexed in row-major order The distribution of x is illustrated in Figure 43 6

7 χ 0 χ 1 χ 2 ψ 0 α 0,0 α 0,3 α 0,6 α 0,1 α 0,4 α 0,7 α 0,2 α 0,5 α 0,8 α 2,0 α 2,3 α 2,6 ψ 2 α 2,1 α 2,4 α 2,7 α 2,2 α 2,5 α 2,8 α 4,0 α 4,3 α 4,6 α 4,1 α 4,4 α 4,7 ψ 4 α 4,2 α 4,5 α 4,8 χ 3 χ 4 χ 5 ψ 1 α 1,0 α 1,3 α 1,6 α 1,1 α 1,4 α 1,7 α 1,2 α 1,5 α 1,8 α 3,0 α 3,3 α 3,6 ψ 3 α 3,1 α 3,4 α 3,7 α 3,2 α 3,5 α 3,8 α 5,0 α 5,3 α 5,6 α 5,1 α 5,4 α 5,7 ψ 5 α 5,2 α 5,5 α 5,8 Fig 43 Distribution of A, x, and y within a 2 3 mesh Redistributing a column of A in the same manner as y requires simultaneous scatters within rows of nodes while redistributing a row of A consistently with x requires simultaneous scatters within columns of nodes In the notation of Section 5, here the distribution of x and y are given by x [(1, 0), ()] and y [(0, 1), ()], resectively, and A by A [(0), (1)] χ 0 χ 3 χ 6 χ 1 χ 4 χ 7 χ 2 χ 5 χ 8 ψ 0 α 0,0 α 0,3 α 0,6 ψ 0 α 0,1 α 0,4 α 0,7 ψ 0 α 0,2 α 0,5 α 0,8 ψ 2 α 2,0 α 2,3 α 2,6 ψ 2 α 2,1 α 2,4 α 2,7 ψ 2 α 2,2 α 2,5 α 2,8 ψ 4 α 4,0 α 4,3 α 4,6 ψ 4 α 4,1 α 4,4 α 4,7 ψ 4 α 4,2 α 4,5 α 4,8 χ 0 χ 3 χ 6 χ 1 χ 4 χ 7 χ 2 χ 5 χ 8 ψ 1 α 1,0 α 1,3 α 1,6 ψ 1 α 1,1 α 1,4 α 1,7 ψ 1 α 1,2 α 1,5 α 1,8 ψ 3 α 3,0 α 3,3 α 3,6 ψ 3 α 3,1 α 3,4 α 3,7 ψ 3 α 3,2 α 3,5 α 3,8 ψ 5 α 5,0 α 5,3 α 5,6 ψ 5 α 5,1 α 5,4 α 5,7 ψ 5 α 5,2 α 5,5 α 5,8 Fig 44 Vectors x and y resectively redistributed as row-rojected and column-rojected vectors The column-rojected vector y [(0), ()] here is to be used to comute local results that will become contributions to a column vector y [(0, 1), ()] which will result from adding these local contributions within rows of nodes By comaring and contrasting this figure with Figure 43 it becomes obvious that redistributing x [(1, 0), ()] to x [(1), ()] requires an allgather within columns of nodes while y [(0, 1), ()] results from scattering y [(0), ()] within rocess rows 7

8 44 Parallelizing matrix-vector oerations In the following discussion, we assume that A, x, and y are distributed as discussed above At this oint, we suggest comaring (41) with Figure 43 Comuting y := Ax The relation between the distributions of a matrix, columnmajor vector, and row-major vector is illustrated by revisiting the most fundamental of comutations in linear algebra, y := Ax, already discussed in Section 42 An examination of Figure 43 suggests that the elements of x must be gathered within columns of nodes (allgather within columns) leaving elements of x distributed as illustrated in Figure 44 Next, each node comutes the artial contribution to vector y with its local matrix and coy of x Thus, in Figure 44, ψ i in each node becomes a contribution to the final ψ i These must be added together, which is accomlished by a summation of contributions to y within rows of nodes An exerienced MPI rogrammer will recognize this as a reduce-scatter within each row of nodes Under our communication cost model, the cost of this arallel algorithm is given by m n T y=ax (m, n, r, c) = 2 γ d 0 d 1 }{{} local mvmult + log 2 (d 0 )α + d 0 1 n β + log d 0 d 2 (d 1 )α + d 1 1 m β + d 1 1 m γ 1 d 1 d 0 d 1 d 0 }{{}}{{} allgather x reduce-scatter y 2 mn γ + C m n 0 γ + C 1 γ d 0 d }{{ 1 } load imbalance + log 2 ()α + d 0 1 d 0 for some constants C 0 and C 1 We simlify this further to n β + d 1 1 m β + d 1 1 m γ d 1 d 0 d 0 2 mn γ + log 2()α + d 0 1 n β + d 1 1 m β + d 1 1 m γ d 0 d 1 d 1 d 0 d 1 d }{{ 0 } T + y:=ax(m, n, d 0, d 1 ) since the load imbalance contributes a cost similar to that of the communication Here, T + y:=ax(m, n, k/h, d 0, d 1 ) is used to refer to the overhead associated with the above algorithm for the y = Ax oeration In Aendix A we use these estimates to show that this arallel matrix-vector multilication is, for ractical uroses, weakly scalable if d 0 /d 1 is ket constant, but not if d 0 d 1 = 1 or d 0 d 1 = 1 Comuting x := A T y Let us next discuss an algorithm for comuting x := A T y, where A is an m n matrix and x and y are distributed as before (x with a row-major vector distribution and y with a column-major vector distribution) d 1 d 1 We suggest the reader rint coies of Figures 43 and 44 for easy referral while reading the rest of this section It is temting to aroximate x 1 by 1, but this would yield formulae for the cases where the x mesh is 1 (d 1 = 1) or 1 (d 0 = 1) that are overly essimistic 8

9 or, (42) Recall that x = A T y (transose matrix-vector multilication) means χ 0 = α 0,0 ψ 0 + α 1,0 ψ α n 1,0 ψ n 1 χ 1 = α 0,1 ψ 0 + α 1,1 ψ α n 1,1 ψ n 1 χ m 1 = α 0,m 1 ψ 0 + α 1,m 1 ψ α n 1,m 1 ψ n 1 χ 0 = χ 1 = χ m 1 = α 0,0 ψ 0 + α 0,1 ψ 0 + α 0,n 1 ψ 0 + α 1,0 ψ 1 + α 1,1 ψ 1 + α 1,n 1 ψ 1 + α n 1,0 ψ n 1 α n 1,1 ψ n 1 α n 1,n 1 ψ n 1 An examination of (42) and Figure 43 suggests that the elements of y must be gathered within rows of nodes (allgather within rows) leaving elements of y distributed as illustrated in Figure 44 Next, each node comutes the artial contribution to vector x with its local matrix and coy of y Thus, in Figure 44 χ j in each node becomes a contribution to the final χ j These must be added together, which is accomlished by a summation of contributions to x within columns of nodes We again recognize this as a reduce-scatter, but this time within each column of nodes The cost for this algorithm, aroximating as we did when analyzing the algorithm for y = Ax, is 2 mn γ + log 2()α + d 1 1 n β + d 0 1 m β + d 0 1 m γ d 1 d 0 d 0 d 1 d 0 d }{{ 1 } T + x:=a T y(m, n, d 1, d 0 ) where, as before, we ignore overhead due to load imbalance since terms of the same order aear in the terms that cature communication overhead Comuting y := A T x What if we wish to comute y := A T x, where A is an m n matrix and y is distributed with a column-major vector distribution and x with a rowmajor vector distribution? Now x must first be redistributed to a column-major vector distribution, after which the algorithm that we just discussed can be executed, and finally the result (in row-major vector distribution) must be redistributed to leave it as y in column-major vector distribution This adds the cost of the ermutation that redistributes x and the cost of the ermutation that redistributes the result to y to the cost of y := A T x Other cases What if, when comuting y := Ax the vector x is distributed like a row of matrix A? What if the vector y is distributed like a column of matrix A? We leave these cases as an exercise to the reader The oint is that understanding the basic algorithms for multilying with A and A T allows one to systematically derive and analyze algorithms when the vectors that are involved are distributed to the nodes in different ways Comuting A := yx T + A A second commonly encountered matrix-vector oeration is the rank-1 udate: A := αyx T + A We will discuss the case where α = 1 Recall that 0 1 α 0,0 + ψ 0 χ 0 α 0,1 + ψ 0 χ 1 α 0,n 1 + ψ 0 χ n 1 α A + yx T 1,0 + ψ 1 χ 0 α 1,1 + ψ 1 χ 1 α 1,n 1 + ψ 1 χ n 1 = C A, α m 1,0 + ψ m 1 χ 0 α m 1,1 + ψ m 1 χ 1 α m 1,n 1 + ψ m 1 χ n 1 9

10 which, when considering Figures 43 and 44, suggests the following arallel algorithm: All-gather of y within rows All-gather of x within columns Udate of the local matrix on each node The cost for this algorithm, aroximating as we did when analyzing the algorithm for y = Ax, yields 2 mn γ + log 2()α + d 0 1 n β + d 1 1 m β d 0 d 1 d 1 d }{{ 0 } T + A:=yx T +A(m, n, d 0, d 1 ) where, as before, we ignore overhead due to load imbalance since terms of the same order aear in the terms that cature communication overhead Notice that the cost is the same as a arallel matrix-vector multilication, excet for the γ term that results from the reduction within rows As before, one can modify this algorithm when the vectors start with different distributions building on intuition from matrix-vector multilication A attern is emerging 5 Generalizing the Theme The reader should now have an understanding how vector and matrix distribution are related to the arallelization of basic matrixvector oerations We generalize the insights using sets of indices as filters to indicate what arts of a matrix or vector a given rocess owns The insights in this section are similar to those that underlie Physically Based Matrix Distribution [11] which itself also underlies PLAPACK However, we formalize the notation beyond that used by PLAPACK The link between distribution of vectors and matrices was first observed by Bisseling [4, 5], and, around the same time, in [21] 51 Vector distribution The basic idea is to use two different artitions of the natural numbers as a means of describing the distribution of the row and column indices of a matrix Definition 51 (Subvectors and submatrices) Let x R n and S N Then x [S] equals the vector with elements from x with indices in the set S, in the order in which they aear in vector x If A R m n and S, T N, then A [S, T ] is the submatrix formed by keeing only the elements of A whose row-indices are in S and column-indices are in T, in the order in which they aear in matrix A We illustrate this idea with simle examles: Examle 1 Let x = χ 0 χ 1 χ 2 χ 3 and A = If S = {0, 2, 4, } and T = {1, 3, 5, }, then x [S] = ( χ0 χ 2 ) and A [S, T ] = α 0,0 α 0,1 α 0,2 α 0,3 α 0,4 α 1,0 α 1,1 α 1,2 α 1,3 α 1,4 α 2,0 α 2,1 α 2,2 α 2,3 α 2,4 α 3,0 α 3,1 α 3,2 α 3,3 α 3,4 α 4,0 α 4,1 α 4,2 α 4,3 α 4,4 α 0,1 α 0,3 α 2,1 α 2,3 α 4,1 α 4,3 We now introduce two fundamental ways to distribute vectors relative to a logical d 0 d 1 rocess grid 10

11 Definition 52 (Column-major vector distribution) rocesses are available, and define Suose that N V σ (q) = {N N : N q + σ (mod )}, q {0, 1,, 1}, where σ {0, 1,, 1} is an arbitrary alignment arameter When is imlied from context and σ is unimortant to the discussion, we will simly denote the above set by V(q) If the rocesses have been configured into a logical d 0 d 1 grid, a vector x is said to be in a column-major vector distribution if rocess (s 0, s 1 ), where s 0 {0,, d 0 1} and s 1 {0,, d 1 1}, is assigned the subvector x(v σ (s 0 +s 1 d 0 )) This distribution is reresented via the d 0 d 1 array of indices D (0,1) (s 0, s 1 ) V(s 0 + s 1 d 0 ), (s 0, s 1 ) {0,, d 0 1} {0,, d 1 1}, and the shorthand x[(0, 1)] will refer to the vector x distributed such that rocess (s 0, s 1 ) stores x(d (0,1) (s 0, s 1 )) Definition 53 (Row-major vector distribution) Similarly, the d 0 d 1 array D (1,0) V(s 1 + s 0 d 1 ), (s 0, s 1 ) {0,, d 0 1} {0,, d 1 1}, is said to define a row-major vector distribution The shorthand y[(1, 0)] will refer to the vector y distributed such that rocess (s 0, s 1 ) stores y(d (1,0) (s 0, s 1 )) The members of any column-major vector distribution, D (0,1), or row-major vector distribution, D (1,0), form a artition of N The names column-major vector distribution and row-major vector distribution come from the fact that the maings (s 0, s 1 ) s 0 + s 1 d 0 and (s 0, s 1 ) s 1 + s 0 d 1 resectively label the d 0 d 1 grid with a column-major and row-major ordering As row-major and column-major distributions differ only by which dimension of the grid is considered first when assigning an order to the rocesses in the grid, we can give one general definition for a vector distribution with two-dimensional grids We give this definition now Definition 54 (Vector distribution) We call the d 0 d 1 array D (i,j) a vector distribution if i, j {0, 1}, i j, and there exists some alignment arameter σ {0,, 1} such that, for every grid osition (s 0, s 1 ) {0,, d 0 1} {0,, d 1 1}, (51) D (i,j) (s 0, s 1 ) = V σ (s i + s j d i ) The shorthand y [(i, j)] will refer to the vector y distributed such that rocess (s 0, s 1 ) stores y(d (i,j) (s 0, s 1 )) Figure 43 illustrates that to redistribute y [(0, 1)] to y [(1, 0)], and vice versa, re- 11

12 quires a ermutation communication (simultaneous oint-to-oint communications) The effect of this redistribution can be seen in Figure 52 Via a ermutation communication, the vector y distributed as y [(0, 1)], can be redistributed as y [(1, 0)] which is the same distribution as the vector x In the receding discussions, our definitions of D (0,1) and D (1,0) allowed for arbitrary alignment arameters For the rest of the aer, we will only treat the case where all alignments are zero, ie, the to-left entry of every (global) matrix and to entry of every (global) vector is owned by the rocess in the to-left of the rocess grid 52 Induced matrix distribution We are now ready to discuss how matrix distributions are induced by the vector distributions For this, it ays to again consider Figure 43 The element α i,j of matrix A, is assigned to the row of rocesses in which ψ i exists and the column of rocesses in which χ j exists This means that in y = Ax elements of x need only be communicated within columns of rocesses and local contributions to y need only be summed within rows of rocesses This induces a Cartesian matrix distribution: Column j of A is assigned to the same column of rocesses as is χ j Row i of A is assigned to the same row of rocesses as ψ i We now answer the related questions What is the set D (0) (s 0 ) of matrix row indices assigned to rocess row s 0? and What is the set D (1) (s 1 ) of matrix column indices assigned to rocess column s 1? Definition 55 Let D (0) (s 0 ) = d 1 1 s 1=0 D (0,1) (s 0, s 1 ) and D (1) (s 1 ) = d 0 1 s 1=0 D (1,0) (s 0, s 1 ) Given matrix A, A [ D (0) (s 0 ), D (1) (s 1 ) ] denotes the submatrix of A with row indices in the set D (0) (s 0 ) and column indices in D (1) (s 1 ) Finally, A [(0), (1)] denotes the distribution of A that assigns A [ D (0) (s 0 ), D (1) (s 1 ) ] to rocess (s 0, s 1 ) We say that D (0) and D (1) are induced resectively by D (0,1) and D (1,0) because the rocess to which α i,j is assigned is determined by the row of rocesses, s 0, to which y i is assigned and the column of rocesses, s 1, to which x j is assigned, so that it is ensured that in the matrix-vector multilication y = Ax communication needs only be within rows and columns of rocesses Notice in Figure 52 that to redistribute indices of the vector y as the matrix column indices in A requires a communication within rows of rocesses Similarly, to redistribute indices of the vector x as matrix row indices requires a communication within columns of rocesses The above definition lies at the heart of our communication scheme 53 Vector dulication Two vector distributions, encountered in Section 44 and illustrated in Figure 44, still need to be secified with our notation The vector x, dulicated as needed for the matrix-vector multilication y = Ax, can be secified as x [(0)] or, viewing x as a n 1 matrix, x [(0), ()] The vector y, dulicated so as to store local contributions for y = Ax, can be secified as y [(1)] or, viewing y as a n 1 matrix, y [(1), ()] Here the () should be interreted as all indices In other words, D () N 12

13 Elemental symbol Introduced symbol M C (0) M R (1) V C (0, 1) V R (1, 0) () Fig 51 The relationshis between distribution symbols found in the Elemental library imlementation and those introduced here For instance, the distribution A[M C, M R ] found in the Elemental library imlementation corresonds to the distribution A [(0), (1)] x [(0), ()] x [(1), ()] reducescattescatter reduce- allgather bcast allgather ermutation bcast x [(0, 1), ()] x [(1, 0), ()] reduce (-to-one) reduce (-to-one) scatter scatter gather gather x [(0), (1)] x [(1), (0)] Fig 52 Summary of the communication atterns for redistributing a vector x For instance, a method for redistributing x from a matrix column to a matrix row is found by tracing from the bottom-left to the bottom-right of the diagram 54 Notation in Elemental library Readers familiar with the Elemental library will notice that the distribution symbols defined within that library s imlementation follow a different convention than that used for distribution symbols introduced in the revious subsections This is due to the fact that the notation used in this aer was devised after the imlementation of the Elemental library and we wanted the notation to be extensible to tensors However, for every symbol utilized in the Elemental library imlementation, there exists a unique symbol in the notation introduced here In Figure 51, the relationshis between distribution symbols utilized in the Elemental library imlementation and the symbols used in this aer are defined 55 Of vectors, columns, and rows A matrix-vector multilication or rank- 1 udate may take as its inut/outut vectors (x and y) the rows and/or columns of matrices, as we will see in Section 6 This motivates us to briefly discuss the different communications needed to redistribute vectors to and from columns and rows In our discussion, it will hel to refer back to Figures 43 and 44 13

14 Algorithm: y := Ax (gemv) Comments x [(1), ()] x [(1, 0), ()] Redistribute x (allgather in columns) y (1) [(0), ()] := A [(0), (1)] x [(1), ()] Local matrix-vector multily y [(0, 1), ()] := 1 y(1) [(0), ()] Sum contributions (reduce-scatter in rows) Fig 53 Parallel algorithm for comuting y := Ax Algorithm A := A + xy T (ger) x [(0, 1), ()] x [(1, 0), ()] x [(0), ()] x [(0, 1), ()] y [(1, 0), ()] y [(0, 1), ()] y((1), ()) y((1, 0), ()) A [(0), (1)] := x [(0), ()] [y [(1), ()]] T Comments Redistribute x as a column-major vector (ermutation) Redistribute x (allgather in rows) Redistribute y as a row-major vector (ermutation) Redistribute y (allgather in cols) Local rank-1 udate Fig 54 Parallel algorithm for comuting A := A + xy T Column to/from column-major vector Consider Figure 43 and let a j be a tyical column in A It exists within one single rocess column Redistributing a j [(0), (1)] to y [(0, 1), ()] requires simultaneous scatters within rocess rows Inversely, redistributing y [(0, 1), ()] to a j [(0), (1)] requires simultaneous gathers within rocess rows Column to/from row-major vector Redistributing a j [(0), (1)] to x [(1, 0), ()] can be accomlished by first redistributing to y [(0, 1), ()] (simultaneous scatters within rows) followed by a redistribution of y [(0, 1), ()] to x [(1, 0), ()] (a ermutation) Redistributing x [(1, 0)] to a j [(0), (1)] reverses these communications Column to/from column rojected vector Redistributing a j [(0), (1)] to a j [(0), ()] (dulicated y in Figure 44) can be accomlished by first redistributing to y [(0, 1), ()] (simultaneous scatters within rows) followed by a redistribution of y [(0, 1), ()] to y [(0), ()] (simultaneous allgathers within rows) However, recognize that a scatter followed by an allgather is equivalent to a broadcast Thus, redistributing a j [(0), (1)] to a j [(0), ()] can be more directly accomlished by broadcasting within rows Similarly, summing dulicated vectors y [(0), ()] leaving the result as a j [(0), (1)] (a column in A) can be accomlished by first summing them into y [(0, 1), ()] (reduce-scatters within rows) followed by a redistribution to a j [(0), (1)] (gather within rows) But a reduce-scatter followed by a gather is equivalent to a reduce(-to-one) collective communication All communication atterns with vectors, rows, and columns We summarize all the communication atterns that will be encounted when erforming various matrixvector multilications or rank-1 udates, with vectors, columns, or rows as inut, in Figure Parallelizing matrix-vector oerations (revisited) We now show how the notation discussed in the revious subsection ays off when describing algorithms for matrix-vector oerations Assume that A, x, and y are resectively distributed as A [(0), (1)], x [(1, 0), ()], and y [(0, 1), ()] Algorithms for comuting y := Ax and A := A + xy T are given in 14

15 Algorithm: ĉ i := Ab j (gemv) Comments x [(1), ()] b j [(0), (1)] Redistribute b j : y (1) [(0), ()] := A [(0), (1)] x [(1), ()] x [(0, 1), ()] b j [(0), (1)] (scatter in rows) x [(1, 0), ()] x [(0, 1), ()] (ermutation) x [(1), ()] x [(1, 0), ()] Local matrix-vector multily ĉ i [(0), (1)] := 1 y(1) [(0), ()] Sum contributions: y [(0, 1), ()] := t y(1) [(0), ()] (allgather in columns) (reduce-scatter in rows) y [(1, 0), ()] y [(0, 1), ()] (ermutation) ĉ i ((0), (1)) y((1, 0), ()) (gather in rows) Fig 55 Parallel algorithm for comuting ĉ i := Ab j where ĉ i is a row of a matrix C and b j is a column of a matrix B Figures 53 and 54 The discussion in Section 55 rovides the insights to generalize these arallel matrix-vector oerations to the cases where the vectors are rows and/or columns of matrices For examle, in Figure 55 we show how to comute a column of matrix C, ĉ i, as the roduct of a matrix A times the column of a matrix B, b j Certain stes in Figures have suerscrits associated with oututs of local comutations These suerscrits indicate that contributions rather than final results are comuted by the oeration Further, the subscrit to indicates along which dimension of the rocessing grid a reduction of contributions must occur 57 Similar oerations What we have described is a general method We leave it as an exercise to the reader to derive arallel algorithms for x := A T y and A := yx T + A, starting with vectors that are distributed in various ways 6 Elemental SUMMA: 2D algorithms (esumma2d) We have now arrived at the oint where we can discuss arallel matrix-matrix multilication on a d 0 d 1 mesh, with = d 0 d 1 In our discussion, we will assume an elemental distribution, but the ideas clearly generalize to other Cartesian distributions This section exoses a systematic ath from the arallel rank-1 udate and matrix-vector multilication algorithms to highly efficient 2D arallel matrix-matrix multilication algorithms The strategy is to first recognize that a matrix-matrix multilication can be erformed by a series of rank-1 udates or matrix-vector multilications This gives us arallel algorithms that are inefficient By then recognizing that the order of oerations can be changed so that communication and comutation can be searated and consolidated, these inefficient algorithms are transformed into efficient algorithms While only exlained for some of the cases of matrix multilication, we believe the exosition is such that the reader can him/herself derive algorithms for the remaining cases by alying the ideas in a straight forward manner To fully understand how to attain high erformance on a single rocessor, the reader should familiarize him/herself with, for examle, the techniques in [13] 15

16 A [(0), ()] A [(0), (1)] all-to-all all-to-all A [(1), ()] A [(1), (0)] reducescattescatter reduce- allgather allgather reducescatter A [(0, 1), ()] reduce- ermutation allgather A [(1, 0), ()] allgather scatter all-to-all all-to-all reducescatter A [(), (1, 0)] reduce- ermutation allgather A [(), (0, 1)] allgather scatter allgather reduce- allgather reducescatter scatter A [(), (1)] A [(), (0)] Fig 61 Summary of the communication atterns for redistributing a matrix A 61 Elemental stationary C algorithms (esumma2d-c) We first discuss the case where C := C + AB, where A and B have k columns each, with k relatively small We call this a rank-k udate or anel-anel multilication [13] We will assume the distributions C [(0), (1)], A [(0), (1)], and B [(0), (1)] Partition ˆbT 0 A = ( ) ˆbT a 0 a 1 a k 1 and B = 1 so that ˆbT k 1 C := (( ((C + a 0ˆbT 0 ) + a 1ˆbT 1 ) + ) + a k 1ˆbT k 1 ) There is an algorithmic block size, b alg, for which a local rank-k udate achieves eak erformance [13] Think of k as being that algorithmic block size for now 16

17 Algorithm: C := Gemm C(C, A, B) Partition A ` «BT A L A R, B B B where A L has 0 columns, B T has 0 rows while n(a L ) < n(a) do Determine block size b Reartition ` ` AL A R A0 A 1 A 2, 0 1 «B 0 B B 1 A B B 2 where A 1 has b columns, B 1 has b rows Algorithm: C := Gemm A(C, A, B) Partition C ` C L C R, B ` BL B R where C L and B L have 0 columns while n(c L ) < n(c) do Determine block size b Reartition ` CL C R ` C0 C 1 C 2, ` BL B R ` B0 B 1 B 2 where C 1 and B 1 have b columns A 1 [(0), ()] A 1 [(0), (1)] B 1 [(), (1)] B 1 [(0), (1)] C [(0), (1)] := C [(0), (1)] + A 1 [(0), ()] B 1 [(), (1)] Continue ` with ` AL A R A0 A 1 A 2, 0 1 «B 0 B 1 A B B B 2 endwhile B 1 [(1), ()] B 1 [(0), (1)] C (1) 1 [(0), ()] := A [(0), (1)] B 1 [(1), ()] C 1 [(0), (1)] := cp 1 C(1) 1 [(0), ()] Continue with ` CL C R ` C0 C 1 C 2, ` BL B R ` B0 B 1 B 2 endwhile Fig 62 Algorithms for comuting C := AB + C Left: Stationary C Right: Stationary A The following loo comutes C := AB + C: for = 0,, k 1 a [(0), ()] a [(0), (1)] b T [(), (1)] ˆb T [(0), (1)] C [(0), (1)] := C [(0), (1)] + a [(0), ()] ˆb T [(), (1)] endfor (broadcasts within rows) (broadcasts within cols) (local rank-1 udates) While Section 56 gives a arallel algorithm for ger, the roblem with this algorithm is that (1) it creates a lot of messages and (2) the local comutation is a rank-1 udate, which inherently does not achieve high erformance since it is memory bandwidth bound The algorithm can be rewritten as for = 0,, k 1 a [(0), ()] a [(0), (1)] endfor for = 0,, k 1 b T [(), (1)] ˆb T [(0), (1)] endfor for = 0,, k 1 C [(0), (1)] := C [(0), (1)] + a [(0), ()] ˆb T [(), (1)] endfor (broadcasts within rows) (broadcasts within cols) (local rank-1 udates) and finally, equivalently, 17

18 A [(0), ()] A [(0), (1)] B [(), (1)] B [(0), (1)] C [(0), (1)] := C [(0), (1)] + A [(0), ()] B [(), (1)] (allgather within rows) (allgather within cols) (local rank-k udate) Now the local comutation is cast in terms of a local matrix-matrix multilication (rank-k udate), which can achieve high erformance Here (given that we assume elemental distribution) A [(0), ()] A [(0), (1)], within each row broadcasts k columns of A from different roots: an allgather if elemental distribution is assumed! Similarly B [(), (1)] B [(0), (1)], within each column broadcasts k rows of B from different roots: another allgather if elemental distribution is assumed! Based on this observation, the SUMMA-like algorithm can be exressed as a loo around such rank-k udates, as given in Figure 62 (left) The urose for the loo is to reduce worksace required to store dulicated data Notice that, if an elemental distribution is assumed, the SUMMA-like algorithm should not be called a broadcastbroadcast-comute algorithm Instead, it becomes an allgather-allgather-comute algorithm We will also call it a stationary C algorithm, since C is not communicated (and hence owner comutes is determined by what rocessor owns what element of C) The rimary benefit from a having a loo around rank-k udates is that is reduces the required local worksace at the exense of an increase only in the α term of the communication cost We label this algorithm esumma2d-c, an elemental SUMMA-like algorithm targeting a two-dimensional mesh of nodes, stationary C variant It is not hard to extend the insights to non-elemental distributions (as, for examle, used by ScaLAPACK or PLAPACK) An aroximate cost for the described algorithm is given by T esumma2d C (m, n, k, d 0, d 1 ) = 2mnk γ + k log b 2 (d 1 )α + d 1 1 mk β + k log alg d 1 d 0 b 2 (d 0 )α + d 0 1 nk β alg d 0 d 1 = 2mnk γ + k log b 2 ()α + (d 1 1)mk β + (d 0 1)nk β alg }{{} T + esumma2d C(m, n, k, d 0, d 1 ) This estimate ignores load imbalance (which leads to a γ term of the same order as the β terms) and the fact that the allgathers may be unbalanced if b alg is not an integer multile of both d 0 and d 1 As before and throughout this aer, T + refers to the communication overhead of the roosed algorithm (eg T + esumma2d C refers to the communication overhead of the esumma2d-c algorithm) It is not hard to see that, for ractical uroses, the weak scalability of the esumma2d-c algorithm mirrors that of the arallel matrix-vector multilication algorithm analyzed in Aendix A: it is weakly scalable when m = n and d 0 = d 1, for arbitrary k At this oint it is imortant to mention that this resulting algorithm may seem similar to an aroach described in rior work [1] Indeed, this allgather-allgathercomute aroach to arallel matrix-matrix multilication is described in that aer We use FLAME notation to exress the algorithm, which has been used in our aers for more than a decade [15] The very slow growing factor log () revents weak scalability unless it is treated as a constant 18

19 for the matrix-matrix multilication variants C = AB, C = AB T, C = A T B, and C = A T B T under the assumtion that all matrices are aroximately the same size; a surmountable limitation As we have argued reviously, the allgather-allgathercomute aroach is articularly well-suited for situations where we wish not to communicate the matrix C In the next section, we describe how to systematically derive algorithms for situations where we wish to avoid communicating the matrix A 62 Elemental stationary A algorithms (esumma2d-a) Next, we discuss the case where C := C + AB, where C and B have n columns each, with n relatively small For simlicity, we also call that arameter b alg We call this a matrix-anel multilication [13] We again assume that the matrices are distributed as C [(0), (1)], A [(0), (1)], and B [(0), (1)] Partition C = ( c 0 c 1 c n 1 ) and B = ( b 0 b 1 b n 1 ) so that c j = Ab j + c j The following loo will comute C = AB + C: for j = 0,, n 1 b j [(0, 1), ()] b j [(0), (1)] (scatters within rows) b j [(1, 0), ()] b j [(0, 1), ()] (ermutation) b j [(1), ()] b j [(1, 0), ()] (allgathers within cols) c j [(0), ()] := A [(0), (1)] b j [(1), ()] (local matvec mult) c j [(0), (1)] 1 c j [(0), ()] (reduce-to-one within rows) endfor While Section 56 gives a arallel algorithm for gemv, the roblem again is that (1) it creates a lot of messages and (2) the local comutation is a matrix-vector multily, which inherently does not achieve high erformance since it is memory bandwidth bound This can be restructured as for j = 0,, n 1 b j [(0, 1), ()] b j [(0), (1)] (scatters within rows) endfor for j = 0,, n 1 b j [(1, 0), ()] b j [(0, 1), ()] (ermutation) endfor for j = 0,, n 1 b j [(1), ()] b j [(1, 0), ()] (allgathers within cols) endfor for j = 0,, n 1 c j [(0), ()] := A [(0), (1)] b j [(1), ()] (local matvec mult) endfor for j = 0,, n 1 c j [(0), (1)] 1 c j [(0), ()] (simultaneous reduce-to-one endfor within rows) and finally, equivalently, 19

20 B [(1), ()] B [(0), (1)] C [(0), ()] := A [(0), (1)] B [(1), ()] + C [(0), ()] C [(0), (1)] C [(0), ()] (all-to-all within rows, ermutation, allgather within cols) (simultaneous local matrix multilications) (reduce-scatter within rows) Now the local comutation is cast in terms of a local matrix-matrix multilication (matrix-anel multily), which can achieve high erformance A stationary A algorithm for arbitrary n can now be exressed as a loo around such arallel matrix-anel multilies, given in Figure 62 (right) An aroximate cost for the described algorithm is given by T esumma2d A (m, n, k, d 0, d 1 ) = n b alg log 2 (d 1 )α + d1 1 nk d 1 d 0 β + n b alg α + n k d 1 d 0 β + n b alg log 2 (d 0 )α + d0 1 nk d 1 β + 2mnk + n d 0 (all-to-all within rows) (ermutation) (allgather within cols) γ (simultaneous local matrix-anel mult) b alg log 2 (d 1 )α + d1 1 mn d 0 β + d1 1 mn d 0 γ (reduce-scatter within rows) d 1 d 1 As we discussed earlier, the cost function for the all-to-all oeration is somewhat susect Still, if an algorithm that attains the lower bound for the α term is emloyed, the β term must at most increase by a factor of log 2 (d 1 ) [6], meaning that it is not the dominant communication cost The estimate ignores load imbalance (which leads to a γ term of the same order as the β terms) and the fact that various collective communications may be unbalanced if b alg is not an integer multile of both d 0 and d 1 While the overhead is clearly greater than that of the esumma2d-c algorithm when m = n = k, the overhead is comarable to that of the esumma2d-c algorithm; so the weak scalability results are, asymtotically, the same Also, it is not hard to see that if m and k are large while n is small, this algorithm achieves better arallelism since less communication is required: The stationary matrix, A, is then the largest matrix and not communicating it is beneficial Similarly, if m and n are large while k is small, then the esumma2d-c algorithm does not communicate the largest matrix, C, which is beneficial 63 Communicating submatrices In Figure 61 we illustrate the collective communications required to redistribute submatrices from one distribution to another and the collective communications required to imlement them 64 Other cases We leave it as an exercise to the reader to roose and analyze the remaining stationary A, B, and C algorithms for the other cases of matrix-matrix multilication: C := A T B + C, C := AB T + C, and C := A T B T + C The oint is that we have resented a systematic framework for deriving a family of arallel matrix-matrix multilication algorithms 7 Elemental SUMMA: 3D algorithms (esumma3d) We now view the rocessors as forming a d 0 d 1 h mesh, which one should visualize as h stacked layers, where each layer consists of a d 0 d 1 mesh The extra dimension will be used to gain an extra level of arallelism, which reduces the overhead of the 2D SUMMA algorithms within each layer at the exense of communications between the layers 20

21 The aroach on how to generalize Elemental SUMMA 2D algorithms to Elemental SUMMA 3D algorithms can be easily modified to use Cannon s or Fox s algorithms (with the constraints and comlications that come from using those algorithms), or any other distribution for which SUMMA can be used (retty much any Cartesian distribution) While much of the new innovation for this aer is in this section, we believe the way we built u the intuition of the reader renders most of this new innovation much like a corollary to a theorem B h D stationary C algorithms (esumma3d-c) Partition, A and B so that A = ( B 0 ) A 0 A h 1 and B =, where A and B have aroximately k/h columns and rows, resectively Then C + AB = (C + A 0 B 0 ) }{{} by layer 0 + (0 + A 1 B 1 ) }{{} by layer (0 + A h 1 B h 1 ) }{{} by layer h-1 This suggests the following 3D algorithm: Dulicate C to each of the layers, initializing the dulicates assigned to layers 1 through h 1 to zero This requires no communication We will ignore the cost of setting the dulicates to zero Scatter A and B so that layer H receives A H and B H This means that all rocessors (I, J, 0) simultaneously scatter aroximately (m + n)k/(d 0 d 1 ) data to rocessors (I, J, 0) through (I, J, h 1) The cost of such a scatter can be aroximated by (71) log 2 (h)α + h 1 h (m + n)k β = log d 0 d 2 (h)α + 1 (h 1)(m + n)k β Comute C := C +A K B K simultaneously on all the layers If esumma2d-c is used for this in each layer, the cost is aroximated by 2 mnk γ + k (log hb 2 () log 2 (h)) α + (d 1 1)mk β + (d 0 1)nk (72) β alg }{{} T + esumma2d C(m, n, k/h, d 0, d 1 ) Perform reduce oerations to sum the contributions from the different layers to the coy of C in layer 0 This means that contributions from rocessors (I, J, 0) through (I, J, K) are reduced to rocessor (I, J, 0) An estimate for this reduce-to-one is (73) log 2 (h)α + mn d 0 d 1 β + mn γ = log d 0 d 2 (h)α + mnh 1 β + mnh γ Thus, an estimate for the total cost of this esumma3d-c algorithm for this case of gemm results from adding (71) (73) Let us analyze the case where m = n = k and d 0 = d 1 = /h in detail The cost becomes C esumma3d C (n, n, n, d 0, d 0, h) = 2 n3 γ + n (log hb 2 () log 2 (h))α + 2 (d 0 1)n 2 β + log alg 2 (h)α (h 1)n2 β

PARALLEL MATRIX MULTIPLICATION: A SYSTEMATIC JOURNEY

PARALLEL MATRIX MULTIPLICATION: A SYSTEMATIC JOURNEY PARALLEL MATRIX MULTIPLICATION: A SYSTEMATIC JOURNEY MARTIN D SCHATZ, ROBERT A VAN DE GEIJN, AND JACK POULSON Abstract We exose a systematic aroach for develoing distributed memory arallel matrixmatrix

More information

c 2016 Society for Industrial and Applied Mathematics

c 2016 Society for Industrial and Applied Mathematics SIAM J SCI COMPUT Vol 38, No 6, pp C748 C781 c 2016 Society for Industrial and Applied Mathematics PARALLEL MATRIX MULTIPLICATION: A SYSTEMATIC JOURNEY MARTIN D SCHATZ, ROBERT A VAN DE GEIJN, AND JACK

More information

Numerical Linear Algebra

Numerical Linear Algebra Numerical Linear Algebra Numerous alications in statistics, articularly in the fitting of linear models. Notation and conventions: Elements of a matrix A are denoted by a ij, where i indexes the rows and

More information

John Weatherwax. Analysis of Parallel Depth First Search Algorithms

John Weatherwax. Analysis of Parallel Depth First Search Algorithms Sulementary Discussions and Solutions to Selected Problems in: Introduction to Parallel Comuting by Viin Kumar, Ananth Grama, Anshul Guta, & George Karyis John Weatherwax Chater 8 Analysis of Parallel

More information

Evaluating Circuit Reliability Under Probabilistic Gate-Level Fault Models

Evaluating Circuit Reliability Under Probabilistic Gate-Level Fault Models Evaluating Circuit Reliability Under Probabilistic Gate-Level Fault Models Ketan N. Patel, Igor L. Markov and John P. Hayes University of Michigan, Ann Arbor 48109-2122 {knatel,imarkov,jhayes}@eecs.umich.edu

More information

Computer arithmetic. Intensive Computation. Annalisa Massini 2017/2018

Computer arithmetic. Intensive Computation. Annalisa Massini 2017/2018 Comuter arithmetic Intensive Comutation Annalisa Massini 7/8 Intensive Comutation - 7/8 References Comuter Architecture - A Quantitative Aroach Hennessy Patterson Aendix J Intensive Comutation - 7/8 3

More information

CMSC 425: Lecture 4 Geometry and Geometric Programming

CMSC 425: Lecture 4 Geometry and Geometric Programming CMSC 425: Lecture 4 Geometry and Geometric Programming Geometry for Game Programming and Grahics: For the next few lectures, we will discuss some of the basic elements of geometry. There are many areas

More information

Principles of Computed Tomography (CT)

Principles of Computed Tomography (CT) Page 298 Princiles of Comuted Tomograhy (CT) The theoretical foundation of CT dates back to Johann Radon, a mathematician from Vienna who derived a method in 1907 for rojecting a 2-D object along arallel

More information

Analysis of execution time for parallel algorithm to dertmine if it is worth the effort to code and debug in parallel

Analysis of execution time for parallel algorithm to dertmine if it is worth the effort to code and debug in parallel Performance Analysis Introduction Analysis of execution time for arallel algorithm to dertmine if it is worth the effort to code and debug in arallel Understanding barriers to high erformance and redict

More information

Linear diophantine equations for discrete tomography

Linear diophantine equations for discrete tomography Journal of X-Ray Science and Technology 10 001 59 66 59 IOS Press Linear diohantine euations for discrete tomograhy Yangbo Ye a,gewang b and Jiehua Zhu a a Deartment of Mathematics, The University of Iowa,

More information

A Parallel Algorithm for Minimization of Finite Automata

A Parallel Algorithm for Minimization of Finite Automata A Parallel Algorithm for Minimization of Finite Automata B. Ravikumar X. Xiong Deartment of Comuter Science University of Rhode Island Kingston, RI 02881 E-mail: fravi,xiongg@cs.uri.edu Abstract In this

More information

Feedback-error control

Feedback-error control Chater 4 Feedback-error control 4.1 Introduction This chater exlains the feedback-error (FBE) control scheme originally described by Kawato [, 87, 8]. FBE is a widely used neural network based controller

More information

Uncorrelated Multilinear Principal Component Analysis for Unsupervised Multilinear Subspace Learning

Uncorrelated Multilinear Principal Component Analysis for Unsupervised Multilinear Subspace Learning TNN-2009-P-1186.R2 1 Uncorrelated Multilinear Princial Comonent Analysis for Unsuervised Multilinear Subsace Learning Haiing Lu, K. N. Plataniotis and A. N. Venetsanooulos The Edward S. Rogers Sr. Deartment

More information

4. Score normalization technical details We now discuss the technical details of the score normalization method.

4. Score normalization technical details We now discuss the technical details of the score normalization method. SMT SCORING SYSTEM This document describes the scoring system for the Stanford Math Tournament We begin by giving an overview of the changes to scoring and a non-technical descrition of the scoring rules

More information

Radial Basis Function Networks: Algorithms

Radial Basis Function Networks: Algorithms Radial Basis Function Networks: Algorithms Introduction to Neural Networks : Lecture 13 John A. Bullinaria, 2004 1. The RBF Maing 2. The RBF Network Architecture 3. Comutational Power of RBF Networks 4.

More information

Distributed Rule-Based Inference in the Presence of Redundant Information

Distributed Rule-Based Inference in the Presence of Redundant Information istribution Statement : roved for ublic release; distribution is unlimited. istributed Rule-ased Inference in the Presence of Redundant Information June 8, 004 William J. Farrell III Lockheed Martin dvanced

More information

1-way quantum finite automata: strengths, weaknesses and generalizations

1-way quantum finite automata: strengths, weaknesses and generalizations 1-way quantum finite automata: strengths, weaknesses and generalizations arxiv:quant-h/9802062v3 30 Se 1998 Andris Ambainis UC Berkeley Abstract Rūsiņš Freivalds University of Latvia We study 1-way quantum

More information

A CONCRETE EXAMPLE OF PRIME BEHAVIOR IN QUADRATIC FIELDS. 1. Abstract

A CONCRETE EXAMPLE OF PRIME BEHAVIOR IN QUADRATIC FIELDS. 1. Abstract A CONCRETE EXAMPLE OF PRIME BEHAVIOR IN QUADRATIC FIELDS CASEY BRUCK 1. Abstract The goal of this aer is to rovide a concise way for undergraduate mathematics students to learn about how rime numbers behave

More information

MATH 2710: NOTES FOR ANALYSIS

MATH 2710: NOTES FOR ANALYSIS MATH 270: NOTES FOR ANALYSIS The main ideas we will learn from analysis center around the idea of a limit. Limits occurs in several settings. We will start with finite limits of sequences, then cover infinite

More information

Parallelism and Locality in Priority Queues. A. Ranade S. Cheng E. Deprit J. Jones S. Shih. University of California. Berkeley, CA 94720

Parallelism and Locality in Priority Queues. A. Ranade S. Cheng E. Deprit J. Jones S. Shih. University of California. Berkeley, CA 94720 Parallelism and Locality in Priority Queues A. Ranade S. Cheng E. Derit J. Jones S. Shih Comuter Science Division University of California Berkeley, CA 94720 Abstract We exlore two ways of incororating

More information

A generalization of Amdahl's law and relative conditions of parallelism

A generalization of Amdahl's law and relative conditions of parallelism A generalization of Amdahl's law and relative conditions of arallelism Author: Gianluca Argentini, New Technologies and Models, Riello Grou, Legnago (VR), Italy. E-mail: gianluca.argentini@riellogrou.com

More information

RESOLUTIONS OF THREE-ROWED SKEW- AND ALMOST SKEW-SHAPES IN CHARACTERISTIC ZERO

RESOLUTIONS OF THREE-ROWED SKEW- AND ALMOST SKEW-SHAPES IN CHARACTERISTIC ZERO RESOLUTIONS OF THREE-ROWED SKEW- AND ALMOST SKEW-SHAPES IN CHARACTERISTIC ZERO MARIA ARTALE AND DAVID A. BUCHSBAUM Abstract. We find an exlicit descrition of the terms and boundary mas for the three-rowed

More information

Paper C Exact Volume Balance Versus Exact Mass Balance in Compositional Reservoir Simulation

Paper C Exact Volume Balance Versus Exact Mass Balance in Compositional Reservoir Simulation Paer C Exact Volume Balance Versus Exact Mass Balance in Comositional Reservoir Simulation Submitted to Comutational Geosciences, December 2005. Exact Volume Balance Versus Exact Mass Balance in Comositional

More information

Hidden Predictors: A Factor Analysis Primer

Hidden Predictors: A Factor Analysis Primer Hidden Predictors: A Factor Analysis Primer Ryan C Sanchez Western Washington University Factor Analysis is a owerful statistical method in the modern research sychologist s toolbag When used roerly, factor

More information

RANDOM WALKS AND PERCOLATION: AN ANALYSIS OF CURRENT RESEARCH ON MODELING NATURAL PROCESSES

RANDOM WALKS AND PERCOLATION: AN ANALYSIS OF CURRENT RESEARCH ON MODELING NATURAL PROCESSES RANDOM WALKS AND PERCOLATION: AN ANALYSIS OF CURRENT RESEARCH ON MODELING NATURAL PROCESSES AARON ZWIEBACH Abstract. In this aer we will analyze research that has been recently done in the field of discrete

More information

Reconstructing Householder Vectors from Tall-Skinny QR

Reconstructing Householder Vectors from Tall-Skinny QR Reconstructing Householder Vectors from Tall-Skinny QR Grey Ballard James Demmel Laura Grigori Mathias Jacquelin Hong Die Nguyen Edgar Solomonik Electrical Engineering and Comuter Sciences University of

More information

Eigenanalysis of Finite Element 3D Flow Models by Parallel Jacobi Davidson

Eigenanalysis of Finite Element 3D Flow Models by Parallel Jacobi Davidson Eigenanalysis of Finite Element 3D Flow Models by Parallel Jacobi Davidson Luca Bergamaschi 1, Angeles Martinez 1, Giorgio Pini 1, and Flavio Sartoretto 2 1 Diartimento di Metodi e Modelli Matematici er

More information

1. INTRODUCTION. Fn 2 = F j F j+1 (1.1)

1. INTRODUCTION. Fn 2 = F j F j+1 (1.1) CERTAIN CLASSES OF FINITE SUMS THAT INVOLVE GENERALIZED FIBONACCI AND LUCAS NUMBERS The beautiful identity R.S. Melham Deartment of Mathematical Sciences, University of Technology, Sydney PO Box 23, Broadway,

More information

Positive decomposition of transfer functions with multiple poles

Positive decomposition of transfer functions with multiple poles Positive decomosition of transfer functions with multile oles Béla Nagy 1, Máté Matolcsi 2, and Márta Szilvási 1 Deartment of Analysis, Technical University of Budaest (BME), H-1111, Budaest, Egry J. u.

More information

Outline. EECS150 - Digital Design Lecture 26 Error Correction Codes, Linear Feedback Shift Registers (LFSRs) Simple Error Detection Coding

Outline. EECS150 - Digital Design Lecture 26 Error Correction Codes, Linear Feedback Shift Registers (LFSRs) Simple Error Detection Coding Outline EECS150 - Digital Design Lecture 26 Error Correction Codes, Linear Feedback Shift Registers (LFSRs) Error detection using arity Hamming code for error detection/correction Linear Feedback Shift

More information

q-ary Symmetric Channel for Large q

q-ary Symmetric Channel for Large q List-Message Passing Achieves Caacity on the q-ary Symmetric Channel for Large q Fan Zhang and Henry D Pfister Deartment of Electrical and Comuter Engineering, Texas A&M University {fanzhang,hfister}@tamuedu

More information

Sets of Real Numbers

Sets of Real Numbers Chater 4 Sets of Real Numbers 4. The Integers Z and their Proerties In our revious discussions about sets and functions the set of integers Z served as a key examle. Its ubiquitousness comes from the fact

More information

An Investigation on the Numerical Ill-conditioning of Hybrid State Estimators

An Investigation on the Numerical Ill-conditioning of Hybrid State Estimators An Investigation on the Numerical Ill-conditioning of Hybrid State Estimators S. K. Mallik, Student Member, IEEE, S. Chakrabarti, Senior Member, IEEE, S. N. Singh, Senior Member, IEEE Deartment of Electrical

More information

arxiv: v2 [quant-ph] 2 Aug 2012

arxiv: v2 [quant-ph] 2 Aug 2012 Qcomiler: quantum comilation with CSD method Y. G. Chen a, J. B. Wang a, a School of Physics, The University of Western Australia, Crawley WA 6009 arxiv:208.094v2 [quant-h] 2 Aug 202 Abstract In this aer,

More information

5. PRESSURE AND VELOCITY SPRING Each component of momentum satisfies its own scalar-transport equation. For one cell:

5. PRESSURE AND VELOCITY SPRING Each component of momentum satisfies its own scalar-transport equation. For one cell: 5. PRESSURE AND VELOCITY SPRING 2019 5.1 The momentum equation 5.2 Pressure-velocity couling 5.3 Pressure-correction methods Summary References Examles 5.1 The Momentum Equation Each comonent of momentum

More information

c Copyright by Helen J. Elwood December, 2011

c Copyright by Helen J. Elwood December, 2011 c Coyright by Helen J. Elwood December, 2011 CONSTRUCTING COMPLEX EQUIANGULAR PARSEVAL FRAMES A Dissertation Presented to the Faculty of the Deartment of Mathematics University of Houston In Partial Fulfillment

More information

Multilayer Perceptron Neural Network (MLPs) For Analyzing the Properties of Jordan Oil Shale

Multilayer Perceptron Neural Network (MLPs) For Analyzing the Properties of Jordan Oil Shale World Alied Sciences Journal 5 (5): 546-552, 2008 ISSN 1818-4952 IDOSI Publications, 2008 Multilayer Percetron Neural Network (MLPs) For Analyzing the Proerties of Jordan Oil Shale 1 Jamal M. Nazzal, 2

More information

Multi-Operation Multi-Machine Scheduling

Multi-Operation Multi-Machine Scheduling Multi-Oeration Multi-Machine Scheduling Weizhen Mao he College of William and Mary, Williamsburg VA 3185, USA Abstract. In the multi-oeration scheduling that arises in industrial engineering, each job

More information

Cryptanalysis of Pseudorandom Generators

Cryptanalysis of Pseudorandom Generators CSE 206A: Lattice Algorithms and Alications Fall 2017 Crytanalysis of Pseudorandom Generators Instructor: Daniele Micciancio UCSD CSE As a motivating alication for the study of lattice in crytograhy we

More information

MODELING THE RELIABILITY OF C4ISR SYSTEMS HARDWARE/SOFTWARE COMPONENTS USING AN IMPROVED MARKOV MODEL

MODELING THE RELIABILITY OF C4ISR SYSTEMS HARDWARE/SOFTWARE COMPONENTS USING AN IMPROVED MARKOV MODEL Technical Sciences and Alied Mathematics MODELING THE RELIABILITY OF CISR SYSTEMS HARDWARE/SOFTWARE COMPONENTS USING AN IMPROVED MARKOV MODEL Cezar VASILESCU Regional Deartment of Defense Resources Management

More information

Towards understanding the Lorenz curve using the Uniform distribution. Chris J. Stephens. Newcastle City Council, Newcastle upon Tyne, UK

Towards understanding the Lorenz curve using the Uniform distribution. Chris J. Stephens. Newcastle City Council, Newcastle upon Tyne, UK Towards understanding the Lorenz curve using the Uniform distribution Chris J. Stehens Newcastle City Council, Newcastle uon Tyne, UK (For the Gini-Lorenz Conference, University of Siena, Italy, May 2005)

More information

Unit 1 - Computer Arithmetic

Unit 1 - Computer Arithmetic FIXD-POINT (FX) ARITHMTIC Unit 1 - Comuter Arithmetic INTGR NUMBRS n bit number: b n 1 b n 2 b 0 Decimal Value Range of values UNSIGND n 1 SIGND D = b i 2 i D = 2 n 1 b n 1 + b i 2 i n 2 i=0 i=0 [0, 2

More information

Multiplicative group law on the folium of Descartes

Multiplicative group law on the folium of Descartes Multilicative grou law on the folium of Descartes Steluţa Pricoie and Constantin Udrişte Abstract. The folium of Descartes is still studied and understood today. Not only did it rovide for the roof of

More information

DMS: Distributed Sparse Tensor Factorization with Alternating Least Squares

DMS: Distributed Sparse Tensor Factorization with Alternating Least Squares DMS: Distributed Sarse Tensor Factorization with Alternating Least Squares Shaden Smith, George Karyis Deartment of Comuter Science and Engineering, University of Minnesota {shaden, karyis}@cs.umn.edu

More information

Combining Logistic Regression with Kriging for Mapping the Risk of Occurrence of Unexploded Ordnance (UXO)

Combining Logistic Regression with Kriging for Mapping the Risk of Occurrence of Unexploded Ordnance (UXO) Combining Logistic Regression with Kriging for Maing the Risk of Occurrence of Unexloded Ordnance (UXO) H. Saito (), P. Goovaerts (), S. A. McKenna (2) Environmental and Water Resources Engineering, Deartment

More information

RUN-TO-RUN CONTROL AND PERFORMANCE MONITORING OF OVERLAY IN SEMICONDUCTOR MANUFACTURING. 3 Department of Chemical Engineering

RUN-TO-RUN CONTROL AND PERFORMANCE MONITORING OF OVERLAY IN SEMICONDUCTOR MANUFACTURING. 3 Department of Chemical Engineering Coyright 2002 IFAC 15th Triennial World Congress, Barcelona, Sain RUN-TO-RUN CONTROL AND PERFORMANCE MONITORING OF OVERLAY IN SEMICONDUCTOR MANUFACTURING C.A. Bode 1, B.S. Ko 2, and T.F. Edgar 3 1 Advanced

More information

System Reliability Estimation and Confidence Regions from Subsystem and Full System Tests

System Reliability Estimation and Confidence Regions from Subsystem and Full System Tests 009 American Control Conference Hyatt Regency Riverfront, St. Louis, MO, USA June 0-, 009 FrB4. System Reliability Estimation and Confidence Regions from Subsystem and Full System Tests James C. Sall Abstract

More information

State Estimation with ARMarkov Models

State Estimation with ARMarkov Models Deartment of Mechanical and Aerosace Engineering Technical Reort No. 3046, October 1998. Princeton University, Princeton, NJ. State Estimation with ARMarkov Models Ryoung K. Lim 1 Columbia University,

More information

VIBRATION ANALYSIS OF BEAMS WITH MULTIPLE CONSTRAINED LAYER DAMPING PATCHES

VIBRATION ANALYSIS OF BEAMS WITH MULTIPLE CONSTRAINED LAYER DAMPING PATCHES Journal of Sound and Vibration (998) 22(5), 78 85 VIBRATION ANALYSIS OF BEAMS WITH MULTIPLE CONSTRAINED LAYER DAMPING PATCHES Acoustics and Dynamics Laboratory, Deartment of Mechanical Engineering, The

More information

Notes on Instrumental Variables Methods

Notes on Instrumental Variables Methods Notes on Instrumental Variables Methods Michele Pellizzari IGIER-Bocconi, IZA and frdb 1 The Instrumental Variable Estimator Instrumental variable estimation is the classical solution to the roblem of

More information

Modeling and Estimation of Full-Chip Leakage Current Considering Within-Die Correlation

Modeling and Estimation of Full-Chip Leakage Current Considering Within-Die Correlation 6.3 Modeling and Estimation of Full-Chi Leaage Current Considering Within-Die Correlation Khaled R. eloue, Navid Azizi, Farid N. Najm Deartment of ECE, University of Toronto,Toronto, Ontario, Canada {haled,nazizi,najm}@eecg.utoronto.ca

More information

Factorability in the ring Z[ 5]

Factorability in the ring Z[ 5] University of Nebraska - Lincoln DigitalCommons@University of Nebraska - Lincoln Dissertations, Theses, and Student Research Paers in Mathematics Mathematics, Deartment of 4-2004 Factorability in the ring

More information

8 STOCHASTIC PROCESSES

8 STOCHASTIC PROCESSES 8 STOCHASTIC PROCESSES The word stochastic is derived from the Greek στoχαστικoς, meaning to aim at a target. Stochastic rocesses involve state which changes in a random way. A Markov rocess is a articular

More information

The Graph Accessibility Problem and the Universality of the Collision CRCW Conflict Resolution Rule

The Graph Accessibility Problem and the Universality of the Collision CRCW Conflict Resolution Rule The Grah Accessibility Problem and the Universality of the Collision CRCW Conflict Resolution Rule STEFAN D. BRUDA Deartment of Comuter Science Bisho s University Lennoxville, Quebec J1M 1Z7 CANADA bruda@cs.ubishos.ca

More information

MODEL-BASED MULTIPLE FAULT DETECTION AND ISOLATION FOR NONLINEAR SYSTEMS

MODEL-BASED MULTIPLE FAULT DETECTION AND ISOLATION FOR NONLINEAR SYSTEMS MODEL-BASED MULIPLE FAUL DEECION AND ISOLAION FOR NONLINEAR SYSEMS Ivan Castillo, and homas F. Edgar he University of exas at Austin Austin, X 78712 David Hill Chemstations Houston, X 77009 Abstract A

More information

Probability Estimates for Multi-class Classification by Pairwise Coupling

Probability Estimates for Multi-class Classification by Pairwise Coupling Probability Estimates for Multi-class Classification by Pairwise Couling Ting-Fan Wu Chih-Jen Lin Deartment of Comuter Science National Taiwan University Taiei 06, Taiwan Ruby C. Weng Deartment of Statistics

More information

Solution sheet ξi ξ < ξ i+1 0 otherwise ξ ξ i N i,p 1 (ξ) + where 0 0

Solution sheet ξi ξ < ξ i+1 0 otherwise ξ ξ i N i,p 1 (ξ) + where 0 0 Advanced Finite Elements MA5337 - WS7/8 Solution sheet This exercise sheets deals with B-slines and NURBS, which are the basis of isogeometric analysis as they will later relace the olynomial ansatz-functions

More information

An Analysis of TCP over Random Access Satellite Links

An Analysis of TCP over Random Access Satellite Links An Analysis of over Random Access Satellite Links Chunmei Liu and Eytan Modiano Massachusetts Institute of Technology Cambridge, MA 0239 Email: mayliu, modiano@mit.edu Abstract This aer analyzes the erformance

More information

ON THE LEAST SIGNIFICANT p ADIC DIGITS OF CERTAIN LUCAS NUMBERS

ON THE LEAST SIGNIFICANT p ADIC DIGITS OF CERTAIN LUCAS NUMBERS #A13 INTEGERS 14 (014) ON THE LEAST SIGNIFICANT ADIC DIGITS OF CERTAIN LUCAS NUMBERS Tamás Lengyel Deartment of Mathematics, Occidental College, Los Angeles, California lengyel@oxy.edu Received: 6/13/13,

More information

arxiv: v1 [physics.data-an] 26 Oct 2012

arxiv: v1 [physics.data-an] 26 Oct 2012 Constraints on Yield Parameters in Extended Maximum Likelihood Fits Till Moritz Karbach a, Maximilian Schlu b a TU Dortmund, Germany, moritz.karbach@cern.ch b TU Dortmund, Germany, maximilian.schlu@cern.ch

More information

The Noise Power Ratio - Theory and ADC Testing

The Noise Power Ratio - Theory and ADC Testing The Noise Power Ratio - Theory and ADC Testing FH Irons, KJ Riley, and DM Hummels Abstract This aer develos theory behind the noise ower ratio (NPR) testing of ADCs. A mid-riser formulation is used for

More information

Statics and dynamics: some elementary concepts

Statics and dynamics: some elementary concepts 1 Statics and dynamics: some elementary concets Dynamics is the study of the movement through time of variables such as heartbeat, temerature, secies oulation, voltage, roduction, emloyment, rices and

More information

Chapter 3 - From Gaussian Elimination to LU Factorization

Chapter 3 - From Gaussian Elimination to LU Factorization Chapter 3 - From Gaussian Elimination to LU Factorization Maggie Myers Robert A. van de Geijn The University of Texas at Austin Practical Linear Algebra Fall 29 http://z.cs.utexas.edu/wiki/pla.wiki/ 1

More information

Scaling Multiple Point Statistics for Non-Stationary Geostatistical Modeling

Scaling Multiple Point Statistics for Non-Stationary Geostatistical Modeling Scaling Multile Point Statistics or Non-Stationary Geostatistical Modeling Julián M. Ortiz, Steven Lyster and Clayton V. Deutsch Centre or Comutational Geostatistics Deartment o Civil & Environmental Engineering

More information

Chapter 7 Rational and Irrational Numbers

Chapter 7 Rational and Irrational Numbers Chater 7 Rational and Irrational Numbers In this chater we first review the real line model for numbers, as discussed in Chater 2 of seventh grade, by recalling how the integers and then the rational numbers

More information

Elementary Analysis in Q p

Elementary Analysis in Q p Elementary Analysis in Q Hannah Hutter, May Szedlák, Phili Wirth November 17, 2011 This reort follows very closely the book of Svetlana Katok 1. 1 Sequences and Series In this section we will see some

More information

Generalized analysis method of engine suspensions based on bond graph modeling and feedback control theory

Generalized analysis method of engine suspensions based on bond graph modeling and feedback control theory Generalized analysis method of ine susensions based on bond grah modeling and feedback ntrol theory P.Y. RICHARD *, C. RAMU-ERMENT **, J. BUION *, X. MOREAU **, M. LE FOL *** *Equie Automatique des ystèmes

More information

p-adic Measures and Bernoulli Numbers

p-adic Measures and Bernoulli Numbers -Adic Measures and Bernoulli Numbers Adam Bowers Introduction The constants B k in the Taylor series exansion t e t = t k B k k! k=0 are known as the Bernoulli numbers. The first few are,, 6, 0, 30, 0,

More information

End-to-End Delay Minimization in Thermally Constrained Distributed Systems

End-to-End Delay Minimization in Thermally Constrained Distributed Systems End-to-End Delay Minimization in Thermally Constrained Distributed Systems Pratyush Kumar, Lothar Thiele Comuter Engineering and Networks Laboratory (TIK) ETH Zürich, Switzerland {ratyush.kumar, lothar.thiele}@tik.ee.ethz.ch

More information

Approximating min-max k-clustering

Approximating min-max k-clustering Aroximating min-max k-clustering Asaf Levin July 24, 2007 Abstract We consider the roblems of set artitioning into k clusters with minimum total cost and minimum of the maximum cost of a cluster. The cost

More information

Improved Capacity Bounds for the Binary Energy Harvesting Channel

Improved Capacity Bounds for the Binary Energy Harvesting Channel Imroved Caacity Bounds for the Binary Energy Harvesting Channel Kaya Tutuncuoglu 1, Omur Ozel 2, Aylin Yener 1, and Sennur Ulukus 2 1 Deartment of Electrical Engineering, The Pennsylvania State University,

More information

Topic 7: Using identity types

Topic 7: Using identity types Toic 7: Using identity tyes June 10, 2014 Now we would like to learn how to use identity tyes and how to do some actual mathematics with them. By now we have essentially introduced all inference rules

More information

A Closed-Form Solution to the Minimum V 2

A Closed-Form Solution to the Minimum V 2 Celestial Mechanics and Dynamical Astronomy manuscrit No. (will be inserted by the editor) Martín Avendaño Daniele Mortari A Closed-Form Solution to the Minimum V tot Lambert s Problem Received: Month

More information

Fault Tolerant Quantum Computing Robert Rogers, Thomas Sylwester, Abe Pauls

Fault Tolerant Quantum Computing Robert Rogers, Thomas Sylwester, Abe Pauls CIS 410/510, Introduction to Quantum Information Theory Due: June 8th, 2016 Sring 2016, University of Oregon Date: June 7, 2016 Fault Tolerant Quantum Comuting Robert Rogers, Thomas Sylwester, Abe Pauls

More information

An Analysis of Reliable Classifiers through ROC Isometrics

An Analysis of Reliable Classifiers through ROC Isometrics An Analysis of Reliable Classifiers through ROC Isometrics Stijn Vanderlooy s.vanderlooy@cs.unimaas.nl Ida G. Srinkhuizen-Kuyer kuyer@cs.unimaas.nl Evgueni N. Smirnov smirnov@cs.unimaas.nl MICC-IKAT, Universiteit

More information

Convex Optimization methods for Computing Channel Capacity

Convex Optimization methods for Computing Channel Capacity Convex Otimization methods for Comuting Channel Caacity Abhishek Sinha Laboratory for Information and Decision Systems (LIDS), MIT sinhaa@mit.edu May 15, 2014 We consider a classical comutational roblem

More information

Matching Partition a Linked List and Its Optimization

Matching Partition a Linked List and Its Optimization Matching Partition a Linked List and Its Otimization Yijie Han Deartment of Comuter Science University of Kentucky Lexington, KY 40506 ABSTRACT We show the curve O( n log i + log (i) n + log i) for the

More information

ANALYTIC NUMBER THEORY AND DIRICHLET S THEOREM

ANALYTIC NUMBER THEORY AND DIRICHLET S THEOREM ANALYTIC NUMBER THEORY AND DIRICHLET S THEOREM JOHN BINDER Abstract. In this aer, we rove Dirichlet s theorem that, given any air h, k with h, k) =, there are infinitely many rime numbers congruent to

More information

DIFFERENTIAL GEOMETRY. LECTURES 9-10,

DIFFERENTIAL GEOMETRY. LECTURES 9-10, DIFFERENTIAL GEOMETRY. LECTURES 9-10, 23-26.06.08 Let us rovide some more details to the definintion of the de Rham differential. Let V, W be two vector bundles and assume we want to define an oerator

More information

CHAPTER 5 TANGENT VECTORS

CHAPTER 5 TANGENT VECTORS CHAPTER 5 TANGENT VECTORS In R n tangent vectors can be viewed from two ersectives (1) they cature the infinitesimal movement along a ath, the direction, and () they oerate on functions by directional

More information

Solved Problems. (a) (b) (c) Figure P4.1 Simple Classification Problems First we draw a line between each set of dark and light data points.

Solved Problems. (a) (b) (c) Figure P4.1 Simple Classification Problems First we draw a line between each set of dark and light data points. Solved Problems Solved Problems P Solve the three simle classification roblems shown in Figure P by drawing a decision boundary Find weight and bias values that result in single-neuron ercetrons with the

More information

Information collection on a graph

Information collection on a graph Information collection on a grah Ilya O. Ryzhov Warren Powell February 10, 2010 Abstract We derive a knowledge gradient olicy for an otimal learning roblem on a grah, in which we use sequential measurements

More information

CHAPTER-II Control Charts for Fraction Nonconforming using m-of-m Runs Rules

CHAPTER-II Control Charts for Fraction Nonconforming using m-of-m Runs Rules CHAPTER-II Control Charts for Fraction Nonconforming using m-of-m Runs Rules. Introduction: The is widely used in industry to monitor the number of fraction nonconforming units. A nonconforming unit is

More information

Participation Factors. However, it does not give the influence of each state on the mode.

Participation Factors. However, it does not give the influence of each state on the mode. Particiation Factors he mode shae, as indicated by the right eigenvector, gives the relative hase of each state in a articular mode. However, it does not give the influence of each state on the mode. We

More information

arxiv: v1 [quant-ph] 3 Feb 2015

arxiv: v1 [quant-ph] 3 Feb 2015 From reversible comutation to quantum comutation by Lagrange interolation Alexis De Vos and Stin De Baerdemacker 2 arxiv:502.0089v [quant-h] 3 Feb 205 Cmst, Imec v.z.w., vakgroe elektronica en informatiesystemen,

More information

Characteristics of Beam-Based Flexure Modules

Characteristics of Beam-Based Flexure Modules Shorya Awtar e-mail: shorya@mit.edu Alexander H. Slocum e-mail: slocum@mit.edu Precision Engineering Research Grou, Massachusetts Institute of Technology, Cambridge, MA 039 Edi Sevincer Omega Advanced

More information

Use of Transformations and the Repeated Statement in PROC GLM in SAS Ed Stanek

Use of Transformations and the Repeated Statement in PROC GLM in SAS Ed Stanek Use of Transformations and the Reeated Statement in PROC GLM in SAS Ed Stanek Introduction We describe how the Reeated Statement in PROC GLM in SAS transforms the data to rovide tests of hyotheses of interest.

More information

Averaging sums of powers of integers and Faulhaber polynomials

Averaging sums of powers of integers and Faulhaber polynomials Annales Mathematicae et Informaticae 42 (20. 0 htt://ami.ektf.hu Averaging sums of owers of integers and Faulhaber olynomials José Luis Cereceda a a Distrito Telefónica Madrid Sain jl.cereceda@movistar.es

More information

STA 250: Statistics. Notes 7. Bayesian Approach to Statistics. Book chapters: 7.2

STA 250: Statistics. Notes 7. Bayesian Approach to Statistics. Book chapters: 7.2 STA 25: Statistics Notes 7. Bayesian Aroach to Statistics Book chaters: 7.2 1 From calibrating a rocedure to quantifying uncertainty We saw that the central idea of classical testing is to rovide a rigorous

More information

Introduction to MVC. least common denominator of all non-identical-zero minors of all order of G(s). Example: The minor of order 2: 1 2 ( s 1)

Introduction to MVC. least common denominator of all non-identical-zero minors of all order of G(s). Example: The minor of order 2: 1 2 ( s 1) Introduction to MVC Definition---Proerness and strictly roerness A system G(s) is roer if all its elements { gij ( s)} are roer, and strictly roer if all its elements are strictly roer. Definition---Causal

More information

A Qualitative Event-based Approach to Multiple Fault Diagnosis in Continuous Systems using Structural Model Decomposition

A Qualitative Event-based Approach to Multiple Fault Diagnosis in Continuous Systems using Structural Model Decomposition A Qualitative Event-based Aroach to Multile Fault Diagnosis in Continuous Systems using Structural Model Decomosition Matthew J. Daigle a,,, Anibal Bregon b,, Xenofon Koutsoukos c, Gautam Biswas c, Belarmino

More information

A Bound on the Error of Cross Validation Using the Approximation and Estimation Rates, with Consequences for the Training-Test Split

A Bound on the Error of Cross Validation Using the Approximation and Estimation Rates, with Consequences for the Training-Test Split A Bound on the Error of Cross Validation Using the Aroximation and Estimation Rates, with Consequences for the Training-Test Slit Michael Kearns AT&T Bell Laboratories Murray Hill, NJ 7974 mkearns@research.att.com

More information

Meshless Methods for Scientific Computing Final Project

Meshless Methods for Scientific Computing Final Project Meshless Methods for Scientific Comuting Final Project D0051008 洪啟耀 Introduction Floating island becomes an imortant study in recent years, because the lands we can use are limit, so eole start thinking

More information

The non-stochastic multi-armed bandit problem

The non-stochastic multi-armed bandit problem Submitted for journal ublication. The non-stochastic multi-armed bandit roblem Peter Auer Institute for Theoretical Comuter Science Graz University of Technology A-8010 Graz (Austria) auer@igi.tu-graz.ac.at

More information

Mobility-Induced Service Migration in Mobile. Micro-Clouds

Mobility-Induced Service Migration in Mobile. Micro-Clouds arxiv:503054v [csdc] 7 Mar 205 Mobility-Induced Service Migration in Mobile Micro-Clouds Shiiang Wang, Rahul Urgaonkar, Ting He, Murtaza Zafer, Kevin Chan, and Kin K LeungTime Oerating after ossible Deartment

More information

On the Toppling of a Sand Pile

On the Toppling of a Sand Pile Discrete Mathematics and Theoretical Comuter Science Proceedings AA (DM-CCG), 2001, 275 286 On the Toling of a Sand Pile Jean-Christohe Novelli 1 and Dominique Rossin 2 1 CNRS, LIFL, Bâtiment M3, Université

More information

Neural network models for river flow forecasting

Neural network models for river flow forecasting Neural network models for river flow forecasting Nguyen Tan Danh Huynh Ngoc Phien* and Ashim Das Guta Asian Institute of Technology PO Box 4 KlongLuang 220 Pathumthani Thailand Abstract In this study back

More information

Plotting the Wilson distribution

Plotting the Wilson distribution , Survey of English Usage, University College London Setember 018 1 1. Introduction We have discussed the Wilson score interval at length elsewhere (Wallis 013a, b). Given an observed Binomial roortion

More information

Construction of High-Girth QC-LDPC Codes

Construction of High-Girth QC-LDPC Codes MITSUBISHI ELECTRIC RESEARCH LABORATORIES htt://wwwmerlcom Construction of High-Girth QC-LDPC Codes Yige Wang, Jonathan Yedidia, Stark Draer TR2008-061 Setember 2008 Abstract We describe a hill-climbing

More information

A communication-avoiding parallel algorithm for the symmetric eigenvalue problem

A communication-avoiding parallel algorithm for the symmetric eigenvalue problem A communication-avoiding arallel algorithm for the symmetric eigenvalue roblem Edgar Solomonik ETH Zurich Email: solomonik@inf.ethz.ch Grey Ballard Sandia National Laboratory Email: gmballa@sandia.gov

More information