PARALLEL MATRIX MULTIPLICATION: A SYSTEMATIC JOURNEY
|
|
- Job Clarke
- 5 years ago
- Views:
Transcription
1 PARALLEL MATRIX MULTIPLICATION: A SYSTEMATIC JOURNEY MARTIN D SCHATZ, ROBERT A VAN DE GEIJN, AND JACK POULSON Abstract We exose a systematic aroach for develoing distributed memory arallel matrixmatrix multilication algorithms The journey starts with a descrition of how matrices are distributed to meshes of nodes (eg, MPI rocesses), relates these distributions to scalable arallel imlementation of matrix-vector multilication and rank-1 udate, continues on to reveal a family of matrix-matrix multilication algorithms that view the nodes as a two-dimensional mesh, and finishes with extending these 2D algorithms to so-called 3D algorithms that view the nodes as a three-dimensional mesh A cost analysis shows that the 3D algorithms can attain the (order of magnitude) lower bound for the cost of communication The aer introduces a taxonomy for the resulting family of algorithms and exlains how all algorithms have merit deending on arameters like the sizes of the matrices and architecture arameters The techniques described in this aer are at the heart of the Elemental distributed memory linear algebra library Performance results from imlementation within and with this library are given on a reresentative distributed memory architecture, the IBM Blue Gene/P suercomuter 1 Introduction This aer serves a number of uroses: Parallel imlementation of matrix-matrix multilication is a standard toic in a course on arallel high-erformance comuting However, rarely is the student exosed to the algorithms that are used in ractical cutting-edge arallel dense linear algebra (DLA) libraries This aer exoses a systematic ath that leads from arallel algorithms for matrix-vector multilication and rank-1 udate to a ractical, scalable family of arallel algorithms for matrix-matrix multilication, including the classic result in [1] and those imlemented in the Elemental arallel DLA library [28] This aer introduces a set notation for describing the data distributions that underlie the Elemental library The notation is motivated using arallelization of matrix-vector oerations and matrix-matrix multilication as the driving examles Recently, research on arallel matrix-matrix multilication algorithms have revisited so-called 3D algorithms, which view (rocessing) nodes as a logical three-dimensional mesh These algorithms are known to attain theoretical (order of magnitude) lower bounds on communication This aer exoses a systematic ath from algorithms for two-dimensional meshes to their extensions for three-dimensional meshes Among the resulting algorithms are classic results [2] A taxonomy is given for the resulting family of algorithms which are all related to what is often called the Scalable Universal Matrix Multilication Algorithm (SUMMA) [33] Thus, the aer simultaneously serves a edagogical role, exlains abstractions that underlie the Elemental library, advances the state of science for arallel matrix-matrix multilication by roviding a framework to systematically derive known and new algorithms for matrix-matrix-multilication when comuting on two-dimensional or three- Deartment of Comuter Science, Institute for Comutational Engineering and Sciences, The University of Texas at Austin, Austin, TX s: martinschatz@utexasedu, ltm@csutexasedu, rvdg@csutexasedu Institute for Comutational and Mathematical Engineering, Stanford University, Stanford, CA 94305, oulson@stanfordedu Parallel in this aer imlicitly means distributed memory arallel 1
2 dimensional meshes While much of the new innovation for this aer concerns the extension of arallel matrix-matrix multilication algorithms from two-dimensional to three-dimensional meshes, we believe that develoing the reader s intuition for algorithms on two-dimensional meshes renders most of this new innovation much like a corollary to a theorem 2 Background The arallelization of dense matrix-matrix multilication is a well-studied subject Cannon s algorithm (sometimes called roll-roll-comute) dates back to 1969 [9] and Fox s algorithm (sometimes called broadcast-roll-comute) dates back to the 1980s [15] Both suffer from a number of shortcomings: They assume that rocesses are viewed as an d 0 d 1 grid, with d 0 = d 1 = Removing this constraint on d0 and d 1 is nontrivial for these algorithms They do not deal well with the case where one of the matrix dimensions becomes relatively small This is the most commonly encountered case in libraries like LAPACK [3] and libflame [35, 36], and their distributed-memory counterarts: ScaLAPACK [11], PLAPACK [34], and Elemental [28] Attemts to generalize [12, 21, 22] led to imlementations that were neither simle nor effective A ractical algorithm, which also results from the systematic aroach discussed in this aer, can be described as allgather-allgather-multily [1] It does not suffer from the shortcomings of Cannon s and Fox s algorithms It did not gain in oularity in art because libraries like ScaLAPACK and PLAPACK used a 2D block-cyclic distribution, rather than the 2D elemental distribution advocated by that aer The arrival of the Elemental library, together with what we believe is our more systematic and extensible exlanation, will hoefully elevate awareness of this result The Scalable Universal Matrix Multilication Algorithm [33] is another algorithm that overcomes all shortcomings of Cannon s algorithm and Fox s algorithm We believe it is a more widely known result in art because it can already be exlained for a matrix that is distributed with a 2D blocked (but not cyclic) distribution and in art because it was easy to suort in the ScaLAPACK and PLAPACK libraries The original SUMMA aer gives four algorithms: For C := AB + C, SUMMA casts the comutation in terms of multile rank-k udates This algorithm is sometimes called the broadcast-broadcastmultily algorithm, a label we will see is somewhat limiting We also call this algorithm stationary C for reasons that will become clear later By design, this algorithm continues to erform well in the case where the width of A is small relative to the dimensions of C For C := A T B + C, SUMMA casts the comutation in terms of multile anel of rows times matrix multilies, so erformance is not degraded in the case where the height of A is small relative to the dimensions of B We have also called this algorithm stationary B for reasons that will become clear later For C := AB T + C, SUMMA casts the comutation in terms of multile matrix-anel (of columns) multilies, and so erformance does not deteriorate when the width of C is small relative to the dimensions of A We call this algorithm stationary A for reasons that will become clear later For C := A T B T + C, the aer sketches an algorithm that is actually not ractical In [17], it was shown how stationary A, B, and C algorithms can be formulated for each of the four cases of matrix-matrix multilication, including for C := A T B T + 2
3 C This then yielded a general, ractical family of 2D matrix-matrix multilication algorithms all of which were incororated into PLAPACK and Elemental, and some of which are suorted by ScaLAPACK Some of the observations about develoing 2D algorithms in the current aer can already be found in that aer, but our exosition is much more systematic and we use the matrix distribution that underlies Elemental to illustate the basic rinciles Although the work by Agarwal et al describes algorithms for the different matrix-matrix multilication transose variants, it does not describe how to create stationary A and B variants In the 1990s, it was observed that for the case where matrices were relatively small (or, equivalently, a relatively large number of nodes were available), better theoretical and ractical erformance resulted from viewing the nodes as a d 0 d 1 d 2 mesh, yielding a 3D algorithm [2] More recently, a 3D algorithm for comuting the LU factorization of a matrix was devised in Tiskin [27] and Solomonik and Demmel [31] In addition to the LU factorization algorithm devised in [31], a 3D algorithm for matrix-matrix multilication was given for nodes arranged as an d 0 d 1 d 2 mesh, with d 0 = d 1 and 0 d 2 < 3 This was labeled a 25D algorithm Although the rimary contribution of that work was LU-related, the 25D algorithm for matrixmatrix multilication is the relevant ortion to this aer The focus of that study on 3D algorithms was the simlest case of matrix-matrix multilication, C := AB In [25], an early attemt was made to combine multile algorithms for comuting C = AB into a oly-algorithm, which refers to the use of two or more algorithms to solve the same roblem with a high level decision-making rocess determining which of a set of algorithms erforms best in a given situation That aer was ublished right at the time when SUMMA algorithms first became oular and when it was not yet comletely understood that these SUMMA algorithm are inherently more ractical than the Cannon s and Fox s algorithms It already talked about stationary A, B, and C algorithms In the aer, an attemt was made to combine all these aroaches, including SUMMA, targeting general 2D Cartesian data distributions, which was (and still would be) a very ambitious goal Our aer benefits from decades of exerience with the more ractical SUMMA algorithms and their variants It urosely limits the data distribution to simle distributions, namely elemental distributions This, we hoe, allows the reader to gain a dee understanding in a simler setting so that even if elemental distribution is not best for an encountered situation, a generalization can be easily derived The family of resented 2D algorithms is a oly-algorithm imlemented in Elemental 3 Notation Although the focus of this aer is arallel distributed-memory matrix-matrix multilication, the notation used is designed to be extensible to comutation with higher-dimensional objects (tensors), on higher-dimensional grids Because of this, the notation used may seem overly comlex when restricted to matrixmatrix-multilication only In this section, we describe the notation used and the reasoning behind the choice of notation Grid dimension: d x Since we focus on algorithms for distributed-memory architectures, we must describe information about the grid on which we are comuting To suort arbitrary-dimensional grids, we must exress the shae of the grid in an extensible way For this reason, we have chosen the subscrited letter d to indicate the size of a articular dimension of the grid Thus, d x refers to the number of rocesses comrising the xth dimension of the grid In this aer, the grid is tyically d 0 d 1 3
4 Process location: s x In addition to describing the shae of the grid, it is useful to be able to refer to a articular rocess s location within the mesh of rocesses For this, we use the subscrited s letter to refer to a rocess s location within some given dimension of the mesh of rocesses Thus, s x refers to a articular rocess s location within the xth dimension of the mesh of rocesses In this aer, a tyical rocess is labeled with (s 0, s 1 ) Distribution: D (x0,x 1,,x k 1 ) In subsequent sections, we will introduce a notation for describing how data is distributed among rocesses of the grid This notation will require a descrition of which dimensions of the grid are involved in defining the distribution We use the symbol D (x0,x 1,,x k 1 ) to indicate a distribution which involves dimensions x 0, x 1,, x k 1 of the mesh For examle, when describing a distribution which involves the column and row dimension of the grid, we refer to this distribution as D (0,1) Later, we will exlain why the symbol D (0,1) describes a different distribution from D (1,0) 4 Of Matrix-Vector Oerations and Distribution In this section, we discuss how matrix and vector distributions can be linked to arallel 2D matrix-vector multilication and rank-1 udate oerations, which then allows us to eventually describe the stationary C, A, and B 2D algorithms for matrix-matrix multilication that are art of the Elemental library 41 Collective communication Collectives are fundamental to the arallelization of dense matrix oerations Thus, the reader must be (or become) familiar with the basics of these communications and is encouraged to read Chan et al [10], which resents collectives in a systematic way that dovetails with the resent aer To make this aer self-contained, Figure 41 (similar to Figure 1 in [10]) summarizes the collectives In Figure 42 we summarize lower bounds on the cost of the collective communications, under basic assumtions exlained in [10] (see [8] for an analysis of all-to-all), and the cost exressions that we will use in our analyses 42 Motivation: matrix-vector multilication Suose A R m n, x R n, and y R m, and label their individual elements so that α 0,0 α 0,1 α 0,n 1 α 1,0 α 1,1 α 1,n 1 A = α m 1,0 α m 1,1 α m 1,n 1, x = χ 0 χ 1 χ n 1, and y = ψ 0 ψ 1 ψ m 1 Recalling that y = Ax (matrix-vector multilication) is comuted as ψ 0 = α 0,0 χ 0 + α 0,1 χ α 0,n 1 χ n 1 ψ 1 = α 1,0 χ 0 + α 1,1 χ α 1,n 1 χ n 1 ψ m 1 = α m 1,0 χ 0 + α m 1,1 χ α m 1,n 1 χ n 1 4
5 Oeration Before After Permute Node 0 Node 1 Node 2 Node 3 Node 0 Node 1 Node 2 Node 3 x 0 x 1 x 2 x 3 x 1 x 0 x 3 x 2 Broadcast Node 0 Node 1 Node 2 Node 3 x Node 0 Node 1 Node 2 Node 3 x x x x to-one) Node 0 Node 1 Node 2 Node 3 Node 0 Node 1 Node 2 Node 3 x (0) x (1) x (2) x (3) j x(j) Scatter Node 0 Node 1 Node 2 Node 3 x 0 x 1 x 2 x 3 Node 0 Node 1 Node 2 Node 3 x 0 x 1 x 2 x 3 Gather Node 0 Node 1 Node 2 Node 3 x 0 x 1 x 2 x 3 Node 0 Node 1 Node 2 Node 3 x 0 x 1 x 2 x 3 Allgather Node 0 Node 1 Node 2 Node 3 x 0 x 1 x 2 x 3 Node 0 Node 1 Node 2 Node 3 x 0 x 0 x 0 x 0 x 1 x 1 x 1 x 1 x 2 x 2 x 2 x 2 x 3 x 3 x 3 x 3 Reduce(- Reducescatter Node 0 Node 1 Node 2 Node 3 x (0) 0 x (1) 0 x (2) 0 x (3) 0 x (0) 1 x (1) 1 x (2) 1 x (3) 1 x (0) 2 x (1) 2 x (2) 2 x (3) 2 x (0) 3 x (1) 3 x (2) 3 x (3) 3 Node 0 Node 1 Node 2 Node 3 j x(j) 0 j x(j) 1 j x(j) 2 j x(j) 3 Allreduce Node 0 Node 1 Node 2 Node 3 Node 0 Node 1 Node 2 Node 3 x (0) x (1) x (2) x (3) j x(j) j x(j) j x(j) j x(j) All-to-all Node 0 Node 1 Node 2 Node 3 x (0) 0 x (1) 0 x (2) 0 x (3) 0 x (0) 1 x (1) 1 x (2) 1 x (3) 1 x (0) 2 x (1) 2 x (2) 2 x (3) 2 x (0) 3 x (1) 3 x (2) 3 x (3) 3 Node 0 Node 1 Node 2 Node 3 x (0) 0 x (0) 1 x (0) 2 x (0) 3 x (1) 0 x (1) 1 x (1) 2 x (1) 3 x (2) 0 x (2) 1 x (2) 2 x (2) 3 x (3) 0 x (3) 1 x (3) 2 x (3) 3 Fig 41 Collective communications considered in this aer 5
6 Communication Latency Bandw Comut Cost used for analysis Permute α nβ α + nβ Broadcast log 2 () α nβ log 2 ()α + nβ 1 Reduce(-to-one) log 2 () α nβ nγ log 2()α + n(β + γ) 1 Scatter log 2 () α nβ log 2()α + 1 nβ 1 Gather log 2 () α nβ log 2()α + 1 nβ 1 Allgather log 2 () α nβ log 2()α + 1 nβ 1 Reduce-scatter log 2 () α nβ 1 nγ log 2()α + 1 n(β + γ) Allreduce log 2 () α 2 1 nβ 1 nγ 2 log 2()α + 1 n(2β + γ) 1 All-to-all log 2 () α nβ log 2()α + 1 nβ Fig 42 Lower bounds for the different comonents of communication cost Conditions for the lower bounds are given in [10] and [8] The last column gives the cost functions that we use in our analyses For architectures with sufficient connectivity, simle algorithms exist with costs that remain within a small constant factor of all but one of the given formulae The excetion is the all-to-all, for which there are algorithms that achieve the lower bound for the α and β term searately, but it is not clear whether an algorithm that consistently achieves erformance within a constant factor of the given cost function exists we notice that element α i,j multilies χ j and contributes to ψ i Thus we may summarize the interactions of the elements of x, y, and A by χ 0 χ 1 χ n 1 (41) ψ 0 α 0,0 α 0,1 α 0,n 1 ψ 1 α 1,0 α 1,1 α 1,n 1 ψ m 1 α m 1,0 α m 1,1 α m 1,n 1 which is meant to indicate that χ j must be multilied by the elements in the jth column of A while the ith row of A contributes to ψ i 43 Two-Dimensional Elemental Cyclic Distribution It is well established that (weakly) scalable imlementations of DLA oerations require nodes to be logically viewed as a two-dimensional mesh [32, 20] It is also well established that to achieve load balance for a wide range of matrix oerations, matrices should be cyclically wraed onto this logical mesh We start with these insights and examine the simlest of matrix distributions that result: 2D elemental cyclic distribution [28, 19] Denoting the number of nodes by, a d 0 d 1 mesh must be chosen such that = d 0 d 1 Matrix distribution The elements of A are assigned using an elemental cyclic (round-robin) distribution where α i,j is assigned to node (i mod d 0, j mod d 1 ) Thus, node (s 0, s 1 ) stores submatrix A(s 0 :d 0 :m 1, s 1 :d 1 :n 1) = α s0,s 1 α s0,s 1+d 1 α s0+d 0,s 1 α s0+d 0,s 1+d 1 where the left-hand side of the exression uses the MATLAB convention for exressing submatrices, starting indexing from zero instead of one This is illustrated in Figure 43 6,
7 χ 0 χ 1 χ 2 ψ 0 α 0,0 α 0,3 α 0,6 α 0,1 α 0,4 α 0,7 α 0,2 α 0,5 α 0,8 α 2,0 α 2,3 α 2,6 ψ 2 α 2,1 α 2,4 α 2,7 α 2,2 α 2,5 α 2,8 α 4,0 α 4,3 α 4,6 α 4,1 α 4,4 α 4,7 ψ 4 α 4,2 α 4,5 α 4,8 χ 3 χ 4 χ 5 ψ 1 α 1,0 α 1,3 α 1,6 α 1,1 α 1,4 α 1,7 α 1,2 α 1,5 α 1,8 α 3,0 α 3,3 α 3,6 ψ 3 α 3,1 α 3,4 α 3,7 α 3,2 α 3,5 α 3,8 α 5,0 α 5,3 α 5,6 α 5,1 α 5,4 α 5,7 ψ 5 α 5,2 α 5,5 α 5,8 Fig 43 Distribution of A, x, and y within a 2 3 mesh Redistributing a column of A in the same manner as y requires simultaneous scatters within rows of nodes while redistributing a row of A consistently with x requires simultaneous scatters within columns of nodes In the notation of Section 5, here the distribution of x and y are given by x [(1, 0), ()] and y [(0, 1), ()], resectively, and A by A [(0), (1)] χ 0 χ 3 χ 6 χ 1 χ 4 χ 7 χ 2 χ 5 χ 8 ψ 0 α 0,0 α 0,3 α 0,6 ψ 0 α 0,1 α 0,4 α 0,7 ψ 0 α 0,2 α 0,5 α 0,8 ψ 2 α 2,0 α 2,3 α 2,6 ψ 2 α 2,1 α 2,4 α 2,7 ψ 2 α 2,2 α 2,5 α 2,8 ψ 4 α 4,0 α 4,3 α 4,6 ψ 4 α 4,1 α 4,4 α 4,7 ψ 4 α 4,2 α 4,5 α 4,8 χ 0 χ 3 χ 6 χ 1 χ 4 χ 7 χ 2 χ 5 χ 8 ψ 1 α 1,0 α 1,3 α 1,6 ψ 1 α 1,1 α 1,4 α 1,7 ψ 1 α 1,2 α 1,5 α 1,8 ψ 3 α 3,0 α 3,3 α 3,6 ψ 3 α 3,1 α 3,4 α 3,7 ψ 3 α 3,2 α 3,5 α 3,8 ψ 5 α 5,0 α 5,3 α 5,6 ψ 5 α 5,1 α 5,4 α 5,7 ψ 5 α 5,2 α 5,5 α 5,8 Fig 44 Vectors x and y resectively redistributed as row-rojected and column-rojected vectors The column-rojected vector y [(0), ()] here is to be used to comute local results that will become contributions to a column vector y [(0, 1), ()] which will result from adding these local contributions within rows of nodes By comaring and contrasting this figure with Figure 43 it becomes obvious that redistributing x [(1, 0), ()] to x [(1), ()] requires an allgather within columns of nodes while y [(0, 1), ()] results from scattering y [(0), ()] within rocess rows 7
8 Column-major vector distribution A column-major vector distribution views the d 0 d 1 mesh of nodes as a linear array of nodes, numbered in column-major order A vector is distributed with this distribution if it is assigned to this linear array of nodes in a round-robin fashion, one element at a time In other words, consider vector y Its element ψ i is assigned to node (i mod d 0, (i/d 0 ) mod d 1 ), where / denotes integer division Or, equivalently in MATLAB-like notation, node (s 0, s 1 ) stores subvector y(u(s 0, s 1 )::m 1), where u(s 0, s 1 ) = s 0 + s 1 d 0 equals the rank of node (s 0, s 1 ) when the nodes are viewed as a one-dimensional array, indexed in column-major order This distribution of y is illustrated in Figure 43 Row-major vector distribution Similarly, a row-major vector distribution views the d 0 d 1 mesh of nodes as a linear array of nodes, numbered in row-major order In other words, consider vector x Its element χ j is assigned to node (j mod d 1, (j/d 1 ) mod d 0 ) Or, equivalently, node (s 0, s 1 ) stores subvector x(v(s 0, s 1 )::n 1), where v(s 0, s 1 ) = s 0 d 1 +s 1 equals the rank of node (s 0, s 1 ) when the nodes are viewed as a one-dimensional array, indexed in row-major order The distribution of x is illustrated in Figure Parallelizing matrix-vector oerations In the following discussion, we assume that A, x, and y are distributed as discussed above At this oint, we suggest comaring (41) with Figure 43 Comuting y := Ax The relation between the distributions of a matrix, columnmajor vector, and row-major vector is illustrated by revisiting the most fundamental of comutations in linear algebra, y := Ax, already discussed in Section 42 An examination of Figure 43 suggests that the elements of x must be gathered within columns of nodes (allgather within columns) leaving elements of x distributed as illustrated in Figure 44 Next, each node comutes the artial contribution to vector y with its local matrix and coy of x Thus, in Figure 44, ψ i in each node becomes a contribution to the final ψ i These must be added together, which is accomlished by a summation of contributions to y within rows of nodes An exerienced MPI rogrammer will recognize this as a reduce-scatter within each row of nodes by Under our communication cost model, the cost of this arallel algorithm is given m n T y=ax (m, n, r, c) = 2 γ d 0 d 1 }{{} local mvmult + log 2 (d 0 )α + d 0 1 n β + log d 0 d 2 (d 1 )α + d 1 1 m β + d 1 1 m γ 1 d 1 d 0 d 1 d 0 }{{}}{{} allgather x reduce-scatter y 2 mn γ + C m n 0 γ + C 1 γ d 0 d }{{ 1 } load imbalance + log 2 ()α + d 0 1 d 0 n β + d 1 1 m β + d 1 1 m γ d 1 d 0 d 0 d 1 d 1 We suggest the reader rint coies of Figures 43 and 44 for easy referral while reading the rest of this section 8
9 for some constants C 0 and C 1 We simlify this further to (42) 2 mn γ + log 2()α + d 0 1 n β + d 1 1 m β + d 1 1 m γ d 0 d 1 d 1 d 0 d 1 d }{{ 0 } T + y:=ax(m, n, d 0, d 1 ) since the load imbalance contributes a cost similar to that of the communication Here, T + y:=ax(m, n, k/h, d 0, d 1 ) is used to refer to the overhead associated with the above algorithm for the y = Ax oeration In Aendix A we use these estimates to show that this arallel matrix-vector multilication is, for ractical uroses, weakly scalable if d 0 /d 1 is ket constant, but not if d 0 d 1 = 1 or d 0 d 1 = 1 Comuting x := A T y Let us next discuss an algorithm for comuting x := A T y, where A is an m n matrix and x and y are distributed as before (x with a row-major vector distribution and y with a column-major vector distribution) Recall that x = A T y (transose matrix-vector multilication) means χ 0 = α 0,0 ψ 0 + α 1,0 ψ α n 1,0 ψ n 1 χ 1 = α 0,1 ψ 0 + α 1,1 ψ α n 1,1 ψ n 1 χ m 1 = α 0,m 1 ψ 0 + α 1,m 1 ψ α n 1,m 1 ψ n 1 or, (43) χ 0 = χ 1 = χ m 1 = α 0,0 ψ 0 + α 0,1 ψ 0 + α 0,n 1 ψ 0 + α 1,0 ψ 1 + α 1,1 ψ 1 + α 1,n 1 ψ 1 + α n 1,0 ψ n 1 α n 1,1 ψ n 1 α n 1,n 1 ψ n 1 An examination of (43) and Figure 43 suggests that the elements of y must be gathered within rows of nodes (allgather within rows) leaving elements of y distributed as illustrated in Figure 44 Next, each node comutes the artial contribution to vector x with its local matrix and coy of y Thus, in Figure 44 χ j in each node becomes a contribution to the final χ j These must be added together, which is accomlished by a summation of contributions to x within columns of nodes We again recognize this as a reduce-scatter, but this time within each column of nodes The cost for this algorithm, aroximating as we did when analyzing the algorithm for y = Ax, is 2 mn γ + log 2()α + d 1 1 n β + d 0 1 m β + d 0 1 m γ d 1 d 0 d 0 d 1 d 0 d }{{ 1 } T + x:=a T y(m, n, d 1, d 0 ) where, as before, we ignore overhead due to load imbalance since terms of the same order aear in the terms that cature communication overhead It is temting to aroximate x 1 by 1, but this would yield formulae for the cases where the x mesh is 1 (d 1 = 1) or 1 (d 0 = 1) that are overly essimistic 9
10 Comuting y := A T x What if we wish to comute y := A T x, where A is an m n matrix and y is distributed with a column-major vector distribution and x with a rowmajor vector distribution? Now x must first be redistributed to a column-major vector distribution, after which the algorithm that we just discussed can be executed, and finally the result (in row-major vector distribution) must be redistributed to leave it as y in column-major vector distribution This adds the cost of the ermutation that redistributes x and the cost of the ermutation that redistributes the result to y to the cost of y := A T x Other cases What if, when comuting y := Ax the vector x is distributed like a row of matrix A? What if the vector y is distributed like a column of matrix A? We leave these cases as an exercise to the reader The oint is that understanding the basic algorithms for multilying with A and A T allows one to systematically derive and analyze algorithms when the vectors that are involved are distributed to the nodes in different ways Comuting A := yx T + A A second commonly encountered matrix-vector oeration is the rank-1 udate: A := αyx T + A We will discuss the case where α = 1 Recall that α 0,0 + ψ 0 χ 0 α 0,1 + ψ 0 χ 1 α 0,n 1 + ψ 0 χ n 1 α A + yx T 1,0 + ψ 1 χ 0 α 1,1 + ψ 1 χ 1 α 1,n 1 + ψ 1 χ n 1 =, α m 1,0 + ψ m 1 χ 0 α m 1,1 + ψ m 1 χ 1 α m 1,n 1 + ψ m 1 χ n 1 which, when considering Figures 43 and 44, suggests the following arallel algorithm: All-gather of y within rows All-gather of x within columns Udate of the local matrix on each node The cost for this algorithm, aroximating as we did when analyzing the algorithm for y = Ax, yields 2 mn γ + log 2()α + d 0 1 n β + d 1 1 m β d 0 d 1 d 1 d }{{ 0 } T + A:=yx T +A(m, n, d 0, d 1 ) where, as before, we ignore overhead due to load imbalance since terms of the same order aear in the terms that cature communication overhead Notice that the cost is the same as a arallel matrix-vector multilication, excet for the γ term that results from the reduction within rows As before, one can modify this algorithm when the vectors start with different distributions building on intuition from matrix-vector multilication A attern is emerging 5 Generalizing the Theme The reader should now have an understanding how vector and matrix distribution are related to the arallelization of basic matrixvector oerations We generalize the insights using sets of indices as filters to indicate what arts of a matrix or vector a given rocess owns The insights in this section are similar to those that underlie Physically Based Matrix Distribution [14] which itself also underlies PLAPACK However, we formalize the notation beyond that used by PLAPACK The link between distribution of vectors and matrices was first observed by Bisseling [6, 7], and, around the same time, in [24] 51 Vector distribution The basic idea is to use two different artitions of the natural numbers as a means of describing the distribution of the row and column indices of a matrix 10
11 Definition 51 (Subvectors and submatrices) Let x R n and S N Then x [S] equals the vector with elements from x with indices in the set S, in the order in which they aear in vector x If A R m n and S, T N, then A [S, T ] is the submatrix formed by keeing only the elements of A whose row-indices are in S and column-indices are in T, in the order in which they aear in matrix A We illustrate this idea with simle examles: Examle 1 Let x = χ 0 χ 1 χ 2 χ 3 and A = If S = {0, 2, 4, } and T = {1, 3, 5, }, then x [S] = ( χ0 χ 2 ) and A [S, T ] = α 0,0 α 0,1 α 0,2 α 0,3 α 0,4 α 1,0 α 1,1 α 1,2 α 1,3 α 1,4 α 2,0 α 2,1 α 2,2 α 2,3 α 2,4 α 3,0 α 3,1 α 3,2 α 3,3 α 3,4 α 4,0 α 4,1 α 4,2 α 4,3 α 4,4 α 0,1 α 0,3 α 2,1 α 2,3 α 4,1 α 4,3 We now introduce two fundamental ways to distribute vectors relative to a logical d 0 d 1 rocess grid Definition 52 (Column-major vector distribution) rocesses are available, and define Suose that N V σ (q) = {N N : N q + σ (mod )}, q {0, 1,, 1}, where σ {0, 1,, 1} is an arbitrary alignment arameter When is imlied from context and σ is unimortant to the discussion, we will simly denote the above set by V(q) If the rocesses have been configured into a logical d 0 d 1 grid, a vector x is said to be in a column-major vector distribution if rocess (s 0, s 1 ), where s 0 {0,, d 0 1} and s 1 {0,, d 1 1}, is assigned the subvector x(v σ (s 0 +s 1 d 0 )) This distribution is reresented via the d 0 d 1 array of indices D (0,1) (s 0, s 1 ) V(s 0 + s 1 d 0 ), (s 0, s 1 ) {0,, d 0 1} {0,, d 1 1}, and the shorthand x[(0, 1)] will refer to the vector x distributed such that rocess (s 0, s 1 ) stores x(d (0,1) (s 0, s 1 )) Definition 53 (Row-major vector distribution) Similarly, the d 0 d 1 array D (1,0) V(s 1 + s 0 d 1 ), (s 0, s 1 ) {0,, d 0 1} {0,, d 1 1}, is said to define a row-major vector distribution The shorthand y[(1, 0)] will refer to the vector y distributed such that rocess (s 0, s 1 ) stores y(d (1,0) (s 0, s 1 )) 11
12 The members of any column-major vector distribution, D (0,1), or row-major vector distribution, D (1,0), form a artition of N The names column-major vector distribution and row-major vector distribution come from the fact that the maings (s 0, s 1 ) s 0 + s 1 d 0 and (s 0, s 1 ) s 1 + s 0 d 1 resectively label the d 0 d 1 grid with a column-major and row-major ordering As row-major and column-major distributions differ only by which dimension of the grid is considered first when assigning an order to the rocesses in the grid, we can give one general definition for a vector distribution with two-dimensional grids We give this definition now Definition 54 (Vector distribution) We call the d 0 d 1 array D (i,j) a vector distribution if i, j {0, 1}, i j, and there exists some alignment arameter σ {0,, 1} such that, for every grid osition (s 0, s 1 ) {0,, d 0 1} {0,, d 1 1}, (51) D (i,j) (s 0, s 1 ) = V σ (s i + s j d i ) The shorthand y [(i, j)] will refer to the vector y distributed such that rocess (s 0, s 1 ) stores y(d (i,j) (s 0, s 1 )) Figure 43 illustrates that to redistribute y [(0, 1)] to y [(1, 0)], and vice versa, requires a ermutation communication (simultaneous oint-to-oint communications) The effect of this redistribution can be seen in Figure 52 Via a ermutation communication, the vector y distributed as y [(0, 1)], can be redistributed as y [(1, 0)] which is the same distribution as the vector x In the receding discussions, our definitions of D (0,1) and D (1,0) allowed for arbitrary alignment arameters For the rest of the aer, we will only treat the case where all alignments are zero, ie, the to-left entry of every (global) matrix and to entry of every (global) vector is owned by the rocess in the to-left of the rocess grid 52 Induced matrix distribution We are now ready to discuss how matrix distributions are induced by the vector distributions For this, it ays to again consider Figure 43 The element α i,j of matrix A, is assigned to the row of rocesses in which ψ i exists and the column of rocesses in which χ j exists This means that in y = Ax elements of x need only be communicated within columns of rocesses and local contributions to y need only be summed within rows of rocesses This induces a Cartesian matrix distribution: Column j of A is assigned to the same column of rocesses as is χ j Row i of A is assigned to the same row of rocesses as ψ i We now answer the related questions What is the set D (0) (s 0 ) of matrix row indices assigned to rocess row s 0? and What is the set D (1) (s 1 ) of matrix column indices assigned to rocess column s 1? 12
13 Elemental symbol Introduced symbol M C (0) M R (1) V C (0, 1) V R (1, 0) () Fig 51 The relationshis between distribution symbols found in the Elemental library imlementation and those introduced here For instance, the distribution A[M C, M R ] found in the Elemental library imlementation corresonds to the distribution A [(0), (1)] Definition 55 Let D (0) (s 0 ) = d 1 1 s 1=0 D (0,1) (s 0, s 1 ) and D (1) (s 1 ) = d 0 1 s 1=0 D (1,0) (s 0, s 1 ) Given matrix A, A [ D (0) (s 0 ), D (1) (s 1 ) ] denotes the submatrix of A with row indices in the set D (0) (s 0 ) and column indices in D (1) (s 1 ) Finally, A [(0), (1)] denotes the distribution of A that assigns A [ D (0) (s 0 ), D (1) (s 1 ) ] to rocess (s 0, s 1 ) We say that D (0) and D (1) are induced resectively by D (0,1) and D (1,0) because the rocess to which α i,j is assigned is determined by the row of rocesses, s 0, to which y i is assigned and the column of rocesses, s 1, to which x j is assigned, so that it is ensured that in the matrix-vector multilication y = Ax communication needs only be within rows and columns of rocesses Notice in Figure 52 that to redistribute indices of the vector y as the matrix column indices in A requires a communication within rows of rocesses Similarly, to redistribute indices of the vector x as matrix row indices requires a communication within columns of rocesses The above definition lies at the heart of our communication scheme 53 Vector dulication Two vector distributions, encountered in Section 44 and illustrated in Figure 44, still need to be secified with our notation The vector x, dulicated as needed for the matrix-vector multilication y = Ax, can be secified as x [(0)] or, viewing x as a n 1 matrix, x [(0), ()] The vector y, dulicated so as to store local contributions for y = Ax, can be secified as y [(1)] or, viewing y as a n 1 matrix, y [(1), ()] Here the () should be interreted as all indices In other words, D () N 54 Notation in Elemental library Readers familiar with the Elemental library will notice that the distribution symbols defined within that library s imlementation follow a different convention than that used for distribution symbols introduced in the revious subsections This is due to the fact that the notation used in this aer was devised after the imlementation of the Elemental library and we wanted the notation to be extensible to higher-dimensional objects (tensors) However, for every symbol utilized in the Elemental library imlementation, there exists a unique symbol in the notation introduced here In Figure 51, the relationshis between distribution symbols utilized in the Elemental library imlementation and the symbols used in this aer are defined 13
14 x [(0), ()] x [(1), ()] reducescattescatter reduce- allgather bcast allgather ermutation bcast x [(0, 1), ()] x [(1, 0), ()] reduce (-to-one) reduce (-to-one) scatter scatter gather gather x [(0), (1)] x [(1), (0)] Fig 52 Summary of the communication atterns for redistributing a vector x For instance, a method for redistributing x from a matrix column to a matrix row is found by tracing from the bottom-left to the bottom-right of the diagram 55 Of vectors, columns, and rows A matrix-vector multilication or rank- 1 udate may take as its inut/outut vectors (x and y) the rows and/or columns of matrices, as we will see in Section 6 This motivates us to briefly discuss the different communications needed to redistribute vectors to and from columns and rows In our discussion, it will hel to refer back to Figures 43 and 44 Column to/from column-major vector Consider Figure 43 and let a j be a tyical column in A It exists within one single rocess column Redistributing a j [(0), (1)] to y [(0, 1), ()] requires simultaneous scatters within rocess rows Inversely, redistributing y [(0, 1), ()] to a j [(0), (1)] requires simultaneous gathers within rocess rows Column to/from row-major vector Redistributing a j [(0), (1)] to x [(1, 0), ()] can be accomlished by first redistributing to y [(0, 1), ()] (simultaneous scatters within rows) followed by a redistribution of y [(0, 1), ()] to x [(1, 0), ()] (a ermutation) Redistributing x [(1, 0)] to a j [(0), (1)] reverses these communications Column to/from column rojected vector Redistributing a j [(0), (1)] to a j [(0), ()] (dulicated y in Figure 44) can be accomlished by first redistributing to y [(0, 1), ()] (simultaneous scatters within rows) followed by a redistribution of y [(0, 1), ()] to y [(0), ()] (simultaneous allgathers within rows) However, recognize that a scatter followed by an allgather is equivalent to a broadcast Thus, redistributing a j [(0), (1)] to a j [(0), ()] can be more directly accomlished by broadcasting within rows Similarly, summing dulicated vectors y [(0), ()] leaving the result as a j [(0), (1)] (a column in A) can be accomlished by first summing them into y [(0, 1), ()] (reduce-scatters within rows) followed by a redistribution to a j [(0), (1)] (gather within rows) But a reduce-scatter followed by a gather is equivalent to a reduce(-to-one) collective communication All communication atterns with vectors, rows, and columns We summarize all the communication atterns that will be encounted when erforming various matrixvector multilications or rank-1 udates, with vectors, columns, or rows as inut, in 14
15 Algorithm: y := Ax (gemv) Comments x [(1), ()] x [(1, 0), ()] Redistribute x (allgather in columns) y (1) [(0), ()] := A [(0), (1)] x [(1), ()] Local matrix-vector multily y [(0, 1), ()] := 1 y(1) [(0), ()] Sum contributions (reduce-scatter in rows) Fig 53 Parallel algorithm for comuting y := Ax Algorithm A := A + xy T (ger) x [(0, 1), ()] x [(1, 0), ()] x [(0), ()] x [(0, 1), ()] y [(1, 0), ()] y [(0, 1), ()] y((1), ()) y((1, 0), ()) A [(0), (1)] := x [(0), ()] [y [(1), ()]] T Comments Redistribute x as a column-major vector (ermutation) Redistribute x (allgather in rows) Redistribute y as a row-major vector (ermutation) Redistribute y (allgather in cols) Local rank-1 udate Fig 54 Parallel algorithm for comuting A := A + xy T Algorithm: ĉ i := Ab j (gemv) Comments x [(1), ()] b j [(0), (1)] Redistribute b j : y (1) [(0), ()] := A [(0), (1)] x [(1), ()] x [(0, 1), ()] b j [(0), (1)] (scatter in rows) x [(1, 0), ()] x [(0, 1), ()] (ermutation) x [(1), ()] x [(1, 0), ()] Local matrix-vector multily ĉ i [(0), (1)] := 1 y(1) [(0), ()] Sum contributions: y [(0, 1), ()] := t y(1) [(0), ()] (allgather in columns) (reduce-scatter in rows) y [(1, 0), ()] y [(0, 1), ()] (ermutation) ĉ i ((0), (1)) y((1, 0), ()) (gather in rows) Fig 55 Parallel algorithm for comuting ĉ i := Ab j where ĉ i is a row of a matrix C and b j is a column of a matrix B Figure Parallelizing matrix-vector oerations (revisited) We now show how the notation discussed in the revious subsection ays off when describing algorithms for matrix-vector oerations Assume that A, x, and y are resectively distributed as A [(0), (1)], x [(1, 0), ()], and y [(0, 1), ()] Algorithms for comuting y := Ax and A := A + xy T are given in Figures 53 and 54 The discussion in Section 55 rovides the insights to generalize these arallel matrix-vector oerations to the cases where the vectors are rows and/or columns of 15
16 A [(0), ()] A [(0), (1)] all-to-all all-to-all A [(1), ()] A [(1), (0)] reducescattescatter reduce- allgather allgather reducescatter A [(0, 1), ()] reduce- ermutation allgather A [(1, 0), ()] allgather scatter all-to-all all-to-all reducescatter A [(), (1, 0)] reduce- ermutation allgather A [(), (0, 1)] allgather scatter allgather reduce- allgather reducescatter scatter A [(), (1)] A [(), (0)] Fig 61 Summary of the communication atterns for redistributing a matrix A matrices For examle, in Figure 55 we show how to comute a column of matrix C, ĉ i, as the roduct of a matrix A times the column of a matrix B, b j Certain stes in Figures have suerscrits associated with oututs of local comutations These suerscrits indicate that contributions rather than final results are comuted by the oeration Further, the subscrit to indicates along which dimension of the rocessing grid a reduction of contributions must occur 57 Similar oerations What we have described is a general method We leave it as an exercise to the reader to derive arallel algorithms for x := A T y and A := yx T + A, starting with vectors that are distributed in various ways 6 Elemental SUMMA: 2D algorithms (esumma2d) We have now arrived at the oint where we can discuss arallel matrix-matrix multilication on a d 0 d 1 mesh, with = d 0 d 1 In our discussion, we will assume an elemental distribution, but the ideas clearly generalize to other Cartesian distributions 16
17 This section exoses a systematic ath from the arallel rank-1 udate and matrix-vector multilication algorithms to highly efficient 2D arallel matrix-matrix multilication algorithms The strategy is to first recognize that a matrix-matrix multilication can be erformed by a series of rank-1 udates or matrix-vector multilications This gives us arallel algorithms that are inefficient By then recognizing that the order of oerations can be changed so that communication and comutation can be searated and consolidated, these inefficient algorithms are transformed into efficient algorithms While only exlained for some of the cases of matrix multilication, we believe the exosition is such that the reader can him/herself derive algorithms for the remaining cases by alying the ideas in a straight forward manner To fully understand how to attain high erformance on a single rocessor, the reader should familiarize him/herself with, for examle, the techniques in [16] 61 Elemental stationary C algorithms (esumma2d-c) We first discuss the case where C := C + AB, where A and B have k columns each, with k relatively small We call this a rank-k udate or anel-anel multilication [16] We will assume the distributions C [(0), (1)], A [(0), (1)], and B [(0), (1)] Partition ˆbT 0 A = ( ) ˆbT a 0 a 1 a k 1 and B = 1 so that ˆbT k 1 C := (( ((C + a 0ˆbT 0 ) + a 1ˆbT 1 ) + ) + a k 1ˆbT k 1 ) The following loo comutes C := AB + C: for = 0,, k 1 a [(0), ()] a [(0), (1)] b T [(), (1)] ˆb T [(0), (1)] C [(0), (1)] := C [(0), (1)] + a [(0), ()] ˆb T [(), (1)] endfor (broadcasts within rows) (broadcasts within cols) (local rank-1 udates) While Section 56 gives a arallel algorithm for ger, the roblem with this algorithm is that (1) it creates a lot of messages and (2) the local comutation is a rank-1 udate, which inherently does not achieve high erformance since it is memory bandwidth bound The algorithm can be rewritten as for = 0,, k 1 a [(0), ()] a [(0), (1)] endfor for = 0,, k 1 b T [(), (1)] ˆb T [(0), (1)] endfor for = 0,, k 1 C [(0), (1)] := C [(0), (1)] + a [(0), ()] ˆb T [(), (1)] endfor (broadcasts within rows) (broadcasts within cols) (local rank-1 udates) There is an algorithmic block size, b alg, for which a local rank-k udate achieves eak erformance [16] Think of k as being that algorithmic block size for now 17
18 Algorithm: C := Gemm C(C, A, B) Partition A ( ( ) ) BT A L A R, B B B where A L has 0 columns, B T has 0 rows while n(a L ) < n(a) do Determine block size b Reartition ( ) ( ) AL A R A0 A 1 A 2, ( ) B 0 BT B B 1 B B 2 where A 1 has b columns, B 1 has b rows Algorithm: C := Gemm A(C, A, B) Partition C ( C L C R ), B ( BL B R ) where C L and B L have 0 columns while n(c L ) < n(c) do Determine block size b Reartition ( CL C R ) ( C0 C 1 C 2 ), ( BL B R ) ( B0 B 1 B 2 ) where C 1 and B 1 have b columns A 1 [(0), ()] A 1 [(0), (1)] B 1 [(), (1)] B 1 [(0), (1)] C [(0), (1)] := C [(0), (1)] + A 1 [(0), ()] B 1 [(), (1)] Continue ( with ) ( ) AL A R A0 A 1 A 2, ( ) B 0 BT B 1 B B B 2 endwhile B 1 [(1), ()] B 1 [(0), (1)] C (1) 1 [(0), ()] := A [(0), (1)] B 1 [(1), ()] C 1 [(0), (1)] := 1 C(1) 1 [(0), ()] Continue with ( CL C R ) ( C0 C 1 C 2 ), ( BL B R ) ( B0 B 1 B 2 ) endwhile Fig 62 Algorithms for comuting C := AB + C Left: Stationary C Right: Stationary A and finally, equivalently, A [(0), ()] A [(0), (1)] B [(), (1)] B [(0), (1)] C [(0), (1)] := C [(0), (1)] + A [(0), ()] B [(), (1)] (allgather within rows) (allgather within cols) (local rank-k udate) Now the local comutation is cast in terms of a local matrix-matrix multilication (rank-k udate), which can achieve high erformance Here (given that we assume elemental distribution) A [(0), ()] A [(0), (1)], within each row broadcasts k columns of A from different roots: an allgather if elemental distribution is assumed! Similarly B [(), (1)] B [(0), (1)], within each column broadcasts k rows of B from different roots: another allgather if elemental distribution is assumed! Based on this observation, the SUMMA-like algorithm can be exressed as a loo around such rank-k udates, as given in Figure 62 (left) The urose for the loo is to reduce worksace required to store dulicated data Notice that, if an elemental distribution is assumed, the SUMMA-like algorithm should not be called a broadcastbroadcast-comute algorithm Instead, it becomes an allgather-allgather-comute algorithm We will also call it a stationary C algorithm, since C is not communicated (and hence owner comutes is determined by what rocessor owns what element of C) The rimary benefit from a having a loo around rank-k udates is that it reduces the required local worksace at the exense of an increase only in the α term of the communication cost We label this algorithm esumma2d-c, an elemental SUMMA-like algorithm targeting a two-dimensional mesh of nodes, stationary C variant It is not hard to extend We use FLAME notation to exress the algorithm, which has been used in our aers for more than a decade [18] 18
19 the insights to non-elemental distributions (as, for examle, used by ScaLAPACK or PLAPACK) An aroximate cost for the described algorithm is given by T esumma2d C (m, n, k, d 0, d 1 ) = 2mnk γ + k log b 2 (d 1 )α + d 1 1 mk β + k log alg d 1 d 0 b 2 (d 0 )α + d 0 1 nk β alg d 0 d 1 = 2mnk γ + k log b 2 ()α + (d 1 1)mk β + (d 0 1)nk β alg }{{} T + esumma2d C(m, n, k, d 0, d 1 ) This estimate ignores load imbalance (which leads to a γ term of the same order as the β terms) and the fact that the allgathers may be unbalanced if b alg is not an integer multile of both d 0 and d 1 As before and throughout this aer, T + refers to the communication overhead of the roosed algorithm (eg T + esumma2d C refers to the communication overhead of the esumma2d-c algorithm) It is not hard to see that, for ractical uroses, the weak scalability of the esumma2d-c algorithm mirrors that of the arallel matrix-vector multilication algorithm analyzed in Aendix A: it is weakly scalable when m = n and d 0 = d 1, for arbitrary k At this oint it is imortant to mention that this resulting algorithm may seem similar to an aroach described in rior work [1] Indeed, this allgather-allgathercomute aroach to arallel matrix-matrix multilication is described in that aer for the matrix-matrix multilication variants C = AB, C = AB T, C = A T B, and C = A T B T under the assumtion that all matrices are aroximately the same size; a surmountable limitation As we have argued reviously, the allgather-allgathercomute aroach is articularly well-suited for situations where we wish not to communicate the matrix C In the next section, we describe how to systematically derive algorithms for situations where we wish to avoid communicating the matrix A 62 Elemental stationary A algorithms (esumma2d-a) Next, we discuss the case where C := C + AB, where C and B have n columns each, with n relatively small For simlicity, we also call that arameter b alg We call this a matrix-anel multilication [16] We again assume that the matrices are distributed as C [(0), (1)], A [(0), (1)], and B [(0), (1)] Partition C = ( c 0 c 1 c n 1 ) and B = ( b 0 b 1 b n 1 ) so that c j = Ab j + c j The following loo will comute C = AB + C: for j = 0,, n 1 b j [(0, 1), ()] b j [(0), (1)] (scatters within rows) b j [(1, 0), ()] b j [(0, 1), ()] (ermutation) b j [(1), ()] b j [(1, 0), ()] (allgathers within cols) c j [(0), ()] := A [(0), (1)] b j [(1), ()] (local matvec mult) c j [(0), (1)] 1 c j [(0), ()] (reduce-to-one within rows) endfor The very slow growing factor log () revents weak scalability unless it is treated as a constant 19
20 While Section 56 gives a arallel algorithm for gemv, the roblem again is that (1) it creates a lot of messages and (2) the local comutation is a matrix-vector multily, which inherently does not achieve high erformance since it is memory bandwidth bound This can be restructured as for j = 0,, n 1 b j [(0, 1), ()] b j [(0), (1)] (scatters within rows) endfor for j = 0,, n 1 b j [(1, 0), ()] b j [(0, 1), ()] (ermutation) endfor for j = 0,, n 1 b j [(1), ()] b j [(1, 0), ()] (allgathers within cols) endfor for j = 0,, n 1 c j [(0), ()] := A [(0), (1)] b j [(1), ()] (local matvec mult) endfor for j = 0,, n 1 c j [(0), (1)] 1 c j [(0), ()] (simultaneous reduce-to-one endfor within rows) and finally, equivalently, B [(1), ()] B [(0), (1)] C [(0), ()] := A [(0), (1)] B [(1), ()] + C [(0), ()] C [(0), (1)] C [(0), ()] (all-to-all within rows, ermutation, allgather within cols) (simultaneous local matrix multilications) (reduce-scatter within rows) Now the local comutation is cast in terms of a local matrix-matrix multilication (matrix-anel multily), which can achieve high erformance A stationary A algorithm for arbitrary n can now be exressed as a loo around such arallel matrix-anel multilies, given in Figure 62 (right) An aroximate cost for the described algorithm is given by T esumma2d A (m, n, k, d 0, d 1 ) = n b alg log 2 (d 1 )α + d1 1 nk d 1 d 0 β + n b alg α + n k d 1 d 0 β + n b alg log 2 (d 0 )α + d0 1 nk d 1 β + 2mnk + n d 0 (all-to-all within rows) (ermutation) (allgather within cols) γ (simultaneous local matrix-anel mult) b alg log 2 (d 1 )α + d1 1 mn d 0 β + d1 1 mn d 0 γ (reduce-scatter within rows) d 1 d 1 As we discussed earlier, the cost function for the all-to-all oeration is somewhat susect Still, if an algorithm that attains the lower bound for the α term is emloyed, the β term must at most increase by a factor of log 2 (d 1 ) [8], meaning that it is not the dominant communication cost The estimate ignores load imbalance (which leads to a γ term of the same order as the β terms) and the fact that various collective communications may be unbalanced if b alg is not an integer multile of both d 0 and d 1 20
21 While the overhead is clearly greater than that of the esumma2d-c algorithm when m = n = k, the overhead is comarable to that of the esumma2d-c algorithm; so the weak scalability results are, asymtotically, the same Also, it is not hard to see that if m and k are large while n is small, this algorithm achieves better arallelism since less communication is required: The stationary matrix, A, is then the largest matrix and not communicating it is beneficial Similarly, if m and n are large while k is small, then the esumma2d-c algorithm does not communicate the largest matrix, C, which is beneficial 63 Communicating submatrices In Figure 61 we illustrate the collective communications required to redistribute submatrices from one distribution to another and the collective communications required to imlement them 64 Other cases We leave it as an exercise to the reader to roose and analyze the remaining stationary A, B, and C algorithms for the other cases of matrix-matrix multilication: C := A T B + C, C := AB T + C, and C := A T B T + C The oint is that we have resented a systematic framework for deriving a family of arallel matrix-matrix multilication algorithms 7 Elemental SUMMA: 3D algorithms (esumma3d) We now view the rocessors as forming a d 0 d 1 h mesh, which one should visualize as h stacked layers, where each layer consists of a d 0 d 1 mesh The extra dimension is used to gain an extra level of arallelism, which reduces the overhead of the 2D SUMMA algorithms within each layer at the exense of communications between the layers The aroach used to generalize Elemental SUMMA 2D algorithms to Elemental SUMMA 3D algorithms can be easily modified to use Cannon s or Fox s algorithms (with the constraints and comlications that come from using those algorithms), or any other distribution for which SUMMA can be used (retty much any Cartesian distribution) 71 3D stationary C algorithms (esumma3d-c) Partition, A and B so that A = ( A 0 A h 1 ) and B = B 0 B h 1, where A and B have aroximately k/h columns and rows, resectively Then C + AB = (C + A 0 B 0 ) }{{} by layer 0 + (0 + A 1 B 1 ) }{{} by layer (0 + A h 1 B h 1 ) }{{} by layer h-1 This suggests the following 3D algorithm: Dulicate C to each of the layers, initializing the dulicates assigned to layers 1 through h 1 to zero This requires no communication We will ignore the cost of setting the dulicates to zero Scatter A and B so that layer H receives A H and B H This means that all rocessors (I, J, 0) simultaneously scatter aroximately (m + n)k/(d 0 d 1 ) data to rocessors (I, J, 0) through (I, J, h 1) The cost of such a scatter can be aroximated by (71) log 2 (h)α + h 1 h (m + n)k β = log d 0 d 2 (h)α (h 1)(m + n)k β
PARALLEL MATRIX MULTIPLICATION: A SYSTEMATIC JOURNEY
PARALLEL MATRIX MULTIPLICATION: A SYSTEMATIC JOURNEY MARTIN D SCHATZ, ROBERT A VAN DE GEIJN, AND JACK POULSON Abstract We exose a systematic aroach for develoing distributed memory arallel matrix matrix
More informationc 2016 Society for Industrial and Applied Mathematics
SIAM J SCI COMPUT Vol 38, No 6, pp C748 C781 c 2016 Society for Industrial and Applied Mathematics PARALLEL MATRIX MULTIPLICATION: A SYSTEMATIC JOURNEY MARTIN D SCHATZ, ROBERT A VAN DE GEIJN, AND JACK
More informationJohn Weatherwax. Analysis of Parallel Depth First Search Algorithms
Sulementary Discussions and Solutions to Selected Problems in: Introduction to Parallel Comuting by Viin Kumar, Ananth Grama, Anshul Guta, & George Karyis John Weatherwax Chater 8 Analysis of Parallel
More informationNumerical Linear Algebra
Numerical Linear Algebra Numerous alications in statistics, articularly in the fitting of linear models. Notation and conventions: Elements of a matrix A are denoted by a ij, where i indexes the rows and
More informationEvaluating Circuit Reliability Under Probabilistic Gate-Level Fault Models
Evaluating Circuit Reliability Under Probabilistic Gate-Level Fault Models Ketan N. Patel, Igor L. Markov and John P. Hayes University of Michigan, Ann Arbor 48109-2122 {knatel,imarkov,jhayes}@eecs.umich.edu
More informationLinear diophantine equations for discrete tomography
Journal of X-Ray Science and Technology 10 001 59 66 59 IOS Press Linear diohantine euations for discrete tomograhy Yangbo Ye a,gewang b and Jiehua Zhu a a Deartment of Mathematics, The University of Iowa,
More informationComputer arithmetic. Intensive Computation. Annalisa Massini 2017/2018
Comuter arithmetic Intensive Comutation Annalisa Massini 7/8 Intensive Comutation - 7/8 References Comuter Architecture - A Quantitative Aroach Hennessy Patterson Aendix J Intensive Comutation - 7/8 3
More informationPrinciples of Computed Tomography (CT)
Page 298 Princiles of Comuted Tomograhy (CT) The theoretical foundation of CT dates back to Johann Radon, a mathematician from Vienna who derived a method in 1907 for rojecting a 2-D object along arallel
More informationCMSC 425: Lecture 4 Geometry and Geometric Programming
CMSC 425: Lecture 4 Geometry and Geometric Programming Geometry for Game Programming and Grahics: For the next few lectures, we will discuss some of the basic elements of geometry. There are many areas
More informationAnalysis of execution time for parallel algorithm to dertmine if it is worth the effort to code and debug in parallel
Performance Analysis Introduction Analysis of execution time for arallel algorithm to dertmine if it is worth the effort to code and debug in arallel Understanding barriers to high erformance and redict
More informationA Parallel Algorithm for Minimization of Finite Automata
A Parallel Algorithm for Minimization of Finite Automata B. Ravikumar X. Xiong Deartment of Comuter Science University of Rhode Island Kingston, RI 02881 E-mail: fravi,xiongg@cs.uri.edu Abstract In this
More informationUncorrelated Multilinear Principal Component Analysis for Unsupervised Multilinear Subspace Learning
TNN-2009-P-1186.R2 1 Uncorrelated Multilinear Princial Comonent Analysis for Unsuervised Multilinear Subsace Learning Haiing Lu, K. N. Plataniotis and A. N. Venetsanooulos The Edward S. Rogers Sr. Deartment
More informationFeedback-error control
Chater 4 Feedback-error control 4.1 Introduction This chater exlains the feedback-error (FBE) control scheme originally described by Kawato [, 87, 8]. FBE is a widely used neural network based controller
More informationA CONCRETE EXAMPLE OF PRIME BEHAVIOR IN QUADRATIC FIELDS. 1. Abstract
A CONCRETE EXAMPLE OF PRIME BEHAVIOR IN QUADRATIC FIELDS CASEY BRUCK 1. Abstract The goal of this aer is to rovide a concise way for undergraduate mathematics students to learn about how rime numbers behave
More informationMATH 2710: NOTES FOR ANALYSIS
MATH 270: NOTES FOR ANALYSIS The main ideas we will learn from analysis center around the idea of a limit. Limits occurs in several settings. We will start with finite limits of sequences, then cover infinite
More information4. Score normalization technical details We now discuss the technical details of the score normalization method.
SMT SCORING SYSTEM This document describes the scoring system for the Stanford Math Tournament We begin by giving an overview of the changes to scoring and a non-technical descrition of the scoring rules
More informationRadial Basis Function Networks: Algorithms
Radial Basis Function Networks: Algorithms Introduction to Neural Networks : Lecture 13 John A. Bullinaria, 2004 1. The RBF Maing 2. The RBF Network Architecture 3. Comutational Power of RBF Networks 4.
More informationDistributed Rule-Based Inference in the Presence of Redundant Information
istribution Statement : roved for ublic release; distribution is unlimited. istributed Rule-ased Inference in the Presence of Redundant Information June 8, 004 William J. Farrell III Lockheed Martin dvanced
More information1-way quantum finite automata: strengths, weaknesses and generalizations
1-way quantum finite automata: strengths, weaknesses and generalizations arxiv:quant-h/9802062v3 30 Se 1998 Andris Ambainis UC Berkeley Abstract Rūsiņš Freivalds University of Latvia We study 1-way quantum
More informationParallelism and Locality in Priority Queues. A. Ranade S. Cheng E. Deprit J. Jones S. Shih. University of California. Berkeley, CA 94720
Parallelism and Locality in Priority Queues A. Ranade S. Cheng E. Derit J. Jones S. Shih Comuter Science Division University of California Berkeley, CA 94720 Abstract We exlore two ways of incororating
More informationReconstructing Householder Vectors from Tall-Skinny QR
Reconstructing Householder Vectors from Tall-Skinny QR Grey Ballard James Demmel Laura Grigori Mathias Jacquelin Hong Die Nguyen Edgar Solomonik Electrical Engineering and Comuter Sciences University of
More informationHidden Predictors: A Factor Analysis Primer
Hidden Predictors: A Factor Analysis Primer Ryan C Sanchez Western Washington University Factor Analysis is a owerful statistical method in the modern research sychologist s toolbag When used roerly, factor
More informationPaper C Exact Volume Balance Versus Exact Mass Balance in Compositional Reservoir Simulation
Paer C Exact Volume Balance Versus Exact Mass Balance in Comositional Reservoir Simulation Submitted to Comutational Geosciences, December 2005. Exact Volume Balance Versus Exact Mass Balance in Comositional
More informationarxiv: v2 [quant-ph] 2 Aug 2012
Qcomiler: quantum comilation with CSD method Y. G. Chen a, J. B. Wang a, a School of Physics, The University of Western Australia, Crawley WA 6009 arxiv:208.094v2 [quant-h] 2 Aug 202 Abstract In this aer,
More informationRANDOM WALKS AND PERCOLATION: AN ANALYSIS OF CURRENT RESEARCH ON MODELING NATURAL PROCESSES
RANDOM WALKS AND PERCOLATION: AN ANALYSIS OF CURRENT RESEARCH ON MODELING NATURAL PROCESSES AARON ZWIEBACH Abstract. In this aer we will analyze research that has been recently done in the field of discrete
More informationEigenanalysis of Finite Element 3D Flow Models by Parallel Jacobi Davidson
Eigenanalysis of Finite Element 3D Flow Models by Parallel Jacobi Davidson Luca Bergamaschi 1, Angeles Martinez 1, Giorgio Pini 1, and Flavio Sartoretto 2 1 Diartimento di Metodi e Modelli Matematici er
More informationAn Investigation on the Numerical Ill-conditioning of Hybrid State Estimators
An Investigation on the Numerical Ill-conditioning of Hybrid State Estimators S. K. Mallik, Student Member, IEEE, S. Chakrabarti, Senior Member, IEEE, S. N. Singh, Senior Member, IEEE Deartment of Electrical
More informationq-ary Symmetric Channel for Large q
List-Message Passing Achieves Caacity on the q-ary Symmetric Channel for Large q Fan Zhang and Henry D Pfister Deartment of Electrical and Comuter Engineering, Texas A&M University {fanzhang,hfister}@tamuedu
More informationA generalization of Amdahl's law and relative conditions of parallelism
A generalization of Amdahl's law and relative conditions of arallelism Author: Gianluca Argentini, New Technologies and Models, Riello Grou, Legnago (VR), Italy. E-mail: gianluca.argentini@riellogrou.com
More informationModeling and Estimation of Full-Chip Leakage Current Considering Within-Die Correlation
6.3 Modeling and Estimation of Full-Chi Leaage Current Considering Within-Die Correlation Khaled R. eloue, Navid Azizi, Farid N. Najm Deartment of ECE, University of Toronto,Toronto, Ontario, Canada {haled,nazizi,najm}@eecg.utoronto.ca
More informationProbability Estimates for Multi-class Classification by Pairwise Coupling
Probability Estimates for Multi-class Classification by Pairwise Couling Ting-Fan Wu Chih-Jen Lin Deartment of Comuter Science National Taiwan University Taiei 06, Taiwan Ruby C. Weng Deartment of Statistics
More informationPositive decomposition of transfer functions with multiple poles
Positive decomosition of transfer functions with multile oles Béla Nagy 1, Máté Matolcsi 2, and Márta Szilvási 1 Deartment of Analysis, Technical University of Budaest (BME), H-1111, Budaest, Egry J. u.
More informationMultilayer Perceptron Neural Network (MLPs) For Analyzing the Properties of Jordan Oil Shale
World Alied Sciences Journal 5 (5): 546-552, 2008 ISSN 1818-4952 IDOSI Publications, 2008 Multilayer Percetron Neural Network (MLPs) For Analyzing the Proerties of Jordan Oil Shale 1 Jamal M. Nazzal, 2
More informationRESOLUTIONS OF THREE-ROWED SKEW- AND ALMOST SKEW-SHAPES IN CHARACTERISTIC ZERO
RESOLUTIONS OF THREE-ROWED SKEW- AND ALMOST SKEW-SHAPES IN CHARACTERISTIC ZERO MARIA ARTALE AND DAVID A. BUCHSBAUM Abstract. We find an exlicit descrition of the terms and boundary mas for the three-rowed
More informationVIBRATION ANALYSIS OF BEAMS WITH MULTIPLE CONSTRAINED LAYER DAMPING PATCHES
Journal of Sound and Vibration (998) 22(5), 78 85 VIBRATION ANALYSIS OF BEAMS WITH MULTIPLE CONSTRAINED LAYER DAMPING PATCHES Acoustics and Dynamics Laboratory, Deartment of Mechanical Engineering, The
More information1. INTRODUCTION. Fn 2 = F j F j+1 (1.1)
CERTAIN CLASSES OF FINITE SUMS THAT INVOLVE GENERALIZED FIBONACCI AND LUCAS NUMBERS The beautiful identity R.S. Melham Deartment of Mathematical Sciences, University of Technology, Sydney PO Box 23, Broadway,
More informationCryptanalysis of Pseudorandom Generators
CSE 206A: Lattice Algorithms and Alications Fall 2017 Crytanalysis of Pseudorandom Generators Instructor: Daniele Micciancio UCSD CSE As a motivating alication for the study of lattice in crytograhy we
More informationNotes on Instrumental Variables Methods
Notes on Instrumental Variables Methods Michele Pellizzari IGIER-Bocconi, IZA and frdb 1 The Instrumental Variable Estimator Instrumental variable estimation is the classical solution to the roblem of
More informationMulti-Operation Multi-Machine Scheduling
Multi-Oeration Multi-Machine Scheduling Weizhen Mao he College of William and Mary, Williamsburg VA 3185, USA Abstract. In the multi-oeration scheduling that arises in industrial engineering, each job
More informationc Copyright by Helen J. Elwood December, 2011
c Coyright by Helen J. Elwood December, 2011 CONSTRUCTING COMPLEX EQUIANGULAR PARSEVAL FRAMES A Dissertation Presented to the Faculty of the Deartment of Mathematics University of Houston In Partial Fulfillment
More informationDMS: Distributed Sparse Tensor Factorization with Alternating Least Squares
DMS: Distributed Sarse Tensor Factorization with Alternating Least Squares Shaden Smith, George Karyis Deartment of Comuter Science and Engineering, University of Minnesota {shaden, karyis}@cs.umn.edu
More informationarxiv: v1 [physics.data-an] 26 Oct 2012
Constraints on Yield Parameters in Extended Maximum Likelihood Fits Till Moritz Karbach a, Maximilian Schlu b a TU Dortmund, Germany, moritz.karbach@cern.ch b TU Dortmund, Germany, maximilian.schlu@cern.ch
More informationTowards understanding the Lorenz curve using the Uniform distribution. Chris J. Stephens. Newcastle City Council, Newcastle upon Tyne, UK
Towards understanding the Lorenz curve using the Uniform distribution Chris J. Stehens Newcastle City Council, Newcastle uon Tyne, UK (For the Gini-Lorenz Conference, University of Siena, Italy, May 2005)
More informationSets of Real Numbers
Chater 4 Sets of Real Numbers 4. The Integers Z and their Proerties In our revious discussions about sets and functions the set of integers Z served as a key examle. Its ubiquitousness comes from the fact
More information5. PRESSURE AND VELOCITY SPRING Each component of momentum satisfies its own scalar-transport equation. For one cell:
5. PRESSURE AND VELOCITY SPRING 2019 5.1 The momentum equation 5.2 Pressure-velocity couling 5.3 Pressure-correction methods Summary References Examles 5.1 The Momentum Equation Each comonent of momentum
More informationRUN-TO-RUN CONTROL AND PERFORMANCE MONITORING OF OVERLAY IN SEMICONDUCTOR MANUFACTURING. 3 Department of Chemical Engineering
Coyright 2002 IFAC 15th Triennial World Congress, Barcelona, Sain RUN-TO-RUN CONTROL AND PERFORMANCE MONITORING OF OVERLAY IN SEMICONDUCTOR MANUFACTURING C.A. Bode 1, B.S. Ko 2, and T.F. Edgar 3 1 Advanced
More informationOutline. EECS150 - Digital Design Lecture 26 Error Correction Codes, Linear Feedback Shift Registers (LFSRs) Simple Error Detection Coding
Outline EECS150 - Digital Design Lecture 26 Error Correction Codes, Linear Feedback Shift Registers (LFSRs) Error detection using arity Hamming code for error detection/correction Linear Feedback Shift
More informationState Estimation with ARMarkov Models
Deartment of Mechanical and Aerosace Engineering Technical Reort No. 3046, October 1998. Princeton University, Princeton, NJ. State Estimation with ARMarkov Models Ryoung K. Lim 1 Columbia University,
More informationElementary Analysis in Q p
Elementary Analysis in Q Hannah Hutter, May Szedlák, Phili Wirth November 17, 2011 This reort follows very closely the book of Svetlana Katok 1. 1 Sequences and Series In this section we will see some
More informationThe Noise Power Ratio - Theory and ADC Testing
The Noise Power Ratio - Theory and ADC Testing FH Irons, KJ Riley, and DM Hummels Abstract This aer develos theory behind the noise ower ratio (NPR) testing of ADCs. A mid-riser formulation is used for
More informationSolved Problems. (a) (b) (c) Figure P4.1 Simple Classification Problems First we draw a line between each set of dark and light data points.
Solved Problems Solved Problems P Solve the three simle classification roblems shown in Figure P by drawing a decision boundary Find weight and bias values that result in single-neuron ercetrons with the
More informationCombining Logistic Regression with Kriging for Mapping the Risk of Occurrence of Unexploded Ordnance (UXO)
Combining Logistic Regression with Kriging for Maing the Risk of Occurrence of Unexloded Ordnance (UXO) H. Saito (), P. Goovaerts (), S. A. McKenna (2) Environmental and Water Resources Engineering, Deartment
More informationUnit 1 - Computer Arithmetic
FIXD-POINT (FX) ARITHMTIC Unit 1 - Comuter Arithmetic INTGR NUMBRS n bit number: b n 1 b n 2 b 0 Decimal Value Range of values UNSIGND n 1 SIGND D = b i 2 i D = 2 n 1 b n 1 + b i 2 i n 2 i=0 i=0 [0, 2
More informationEnd-to-End Delay Minimization in Thermally Constrained Distributed Systems
End-to-End Delay Minimization in Thermally Constrained Distributed Systems Pratyush Kumar, Lothar Thiele Comuter Engineering and Networks Laboratory (TIK) ETH Zürich, Switzerland {ratyush.kumar, lothar.thiele}@tik.ee.ethz.ch
More informationChapter 7 Rational and Irrational Numbers
Chater 7 Rational and Irrational Numbers In this chater we first review the real line model for numbers, as discussed in Chater 2 of seventh grade, by recalling how the integers and then the rational numbers
More informationStatics and dynamics: some elementary concepts
1 Statics and dynamics: some elementary concets Dynamics is the study of the movement through time of variables such as heartbeat, temerature, secies oulation, voltage, roduction, emloyment, rices and
More informationThe Graph Accessibility Problem and the Universality of the Collision CRCW Conflict Resolution Rule
The Grah Accessibility Problem and the Universality of the Collision CRCW Conflict Resolution Rule STEFAN D. BRUDA Deartment of Comuter Science Bisho s University Lennoxville, Quebec J1M 1Z7 CANADA bruda@cs.ubishos.ca
More informationSTA 250: Statistics. Notes 7. Bayesian Approach to Statistics. Book chapters: 7.2
STA 25: Statistics Notes 7. Bayesian Aroach to Statistics Book chaters: 7.2 1 From calibrating a rocedure to quantifying uncertainty We saw that the central idea of classical testing is to rovide a rigorous
More informationScaling Multiple Point Statistics for Non-Stationary Geostatistical Modeling
Scaling Multile Point Statistics or Non-Stationary Geostatistical Modeling Julián M. Ortiz, Steven Lyster and Clayton V. Deutsch Centre or Comutational Geostatistics Deartment o Civil & Environmental Engineering
More informationMODELING THE RELIABILITY OF C4ISR SYSTEMS HARDWARE/SOFTWARE COMPONENTS USING AN IMPROVED MARKOV MODEL
Technical Sciences and Alied Mathematics MODELING THE RELIABILITY OF CISR SYSTEMS HARDWARE/SOFTWARE COMPONENTS USING AN IMPROVED MARKOV MODEL Cezar VASILESCU Regional Deartment of Defense Resources Management
More informationUncorrelated Multilinear Discriminant Analysis with Regularization and Aggregation for Tensor Object Recognition
TNN-2007-P-0332.R1 1 Uncorrelated Multilinear Discriminant Analysis with Regularization and Aggregation for Tensor Object Recognition Haiing Lu, K.N. Plataniotis and A.N. Venetsanooulos The Edward S. Rogers
More informationConstruction of High-Girth QC-LDPC Codes
MITSUBISHI ELECTRIC RESEARCH LABORATORIES htt://wwwmerlcom Construction of High-Girth QC-LDPC Codes Yige Wang, Jonathan Yedidia, Stark Draer TR2008-061 Setember 2008 Abstract We describe a hill-climbing
More informationFactorability in the ring Z[ 5]
University of Nebraska - Lincoln DigitalCommons@University of Nebraska - Lincoln Dissertations, Theses, and Student Research Paers in Mathematics Mathematics, Deartment of 4-2004 Factorability in the ring
More informationInformation collection on a graph
Information collection on a grah Ilya O. Ryzhov Warren Powell February 10, 2010 Abstract We derive a knowledge gradient olicy for an otimal learning roblem on a grah, in which we use sequential measurements
More informationOn the Toppling of a Sand Pile
Discrete Mathematics and Theoretical Comuter Science Proceedings AA (DM-CCG), 2001, 275 286 On the Toling of a Sand Pile Jean-Christohe Novelli 1 and Dominique Rossin 2 1 CNRS, LIFL, Bâtiment M3, Université
More informationAn Analysis of TCP over Random Access Satellite Links
An Analysis of over Random Access Satellite Links Chunmei Liu and Eytan Modiano Massachusetts Institute of Technology Cambridge, MA 0239 Email: mayliu, modiano@mit.edu Abstract This aer analyzes the erformance
More informationA Qualitative Event-based Approach to Multiple Fault Diagnosis in Continuous Systems using Structural Model Decomposition
A Qualitative Event-based Aroach to Multile Fault Diagnosis in Continuous Systems using Structural Model Decomosition Matthew J. Daigle a,,, Anibal Bregon b,, Xenofon Koutsoukos c, Gautam Biswas c, Belarmino
More informationTests for Two Proportions in a Stratified Design (Cochran/Mantel-Haenszel Test)
Chater 225 Tests for Two Proortions in a Stratified Design (Cochran/Mantel-Haenszel Test) Introduction In a stratified design, the subects are selected from two or more strata which are formed from imortant
More informationA communication-avoiding parallel algorithm for the symmetric eigenvalue problem
A communication-avoiding arallel algorithm for the symmetric eigenvalue roblem Edgar Solomonik ETH Zurich Email: solomonik@inf.ethz.ch Grey Ballard Sandia National Laboratory Email: gmballa@sandia.gov
More informationON POLYNOMIAL SELECTION FOR THE GENERAL NUMBER FIELD SIEVE
MATHEMATICS OF COMPUTATIO Volume 75, umber 256, October 26, Pages 237 247 S 25-5718(6)187-9 Article electronically ublished on June 28, 26 O POLYOMIAL SELECTIO FOR THE GEERAL UMBER FIELD SIEVE THORSTE
More informationMeshless Methods for Scientific Computing Final Project
Meshless Methods for Scientific Comuting Final Project D0051008 洪啟耀 Introduction Floating island becomes an imortant study in recent years, because the lands we can use are limit, so eole start thinking
More informationSystem Reliability Estimation and Confidence Regions from Subsystem and Full System Tests
009 American Control Conference Hyatt Regency Riverfront, St. Louis, MO, USA June 0-, 009 FrB4. System Reliability Estimation and Confidence Regions from Subsystem and Full System Tests James C. Sall Abstract
More informationON THE LEAST SIGNIFICANT p ADIC DIGITS OF CERTAIN LUCAS NUMBERS
#A13 INTEGERS 14 (014) ON THE LEAST SIGNIFICANT ADIC DIGITS OF CERTAIN LUCAS NUMBERS Tamás Lengyel Deartment of Mathematics, Occidental College, Los Angeles, California lengyel@oxy.edu Received: 6/13/13,
More informationAveraging sums of powers of integers and Faulhaber polynomials
Annales Mathematicae et Informaticae 42 (20. 0 htt://ami.ektf.hu Averaging sums of owers of integers and Faulhaber olynomials José Luis Cereceda a a Distrito Telefónica Madrid Sain jl.cereceda@movistar.es
More informationMarch 4, :21 WSPC/INSTRUCTION FILE FLSpaper2011
International Journal of Number Theory c World Scientific Publishing Comany SOLVING n(n + d) (n + (k 1)d ) = by 2 WITH P (b) Ck M. Filaseta Deartment of Mathematics, University of South Carolina, Columbia,
More informationChapter 3 - From Gaussian Elimination to LU Factorization
Chapter 3 - From Gaussian Elimination to LU Factorization Maggie Myers Robert A. van de Geijn The University of Texas at Austin Practical Linear Algebra Fall 29 http://z.cs.utexas.edu/wiki/pla.wiki/ 1
More information8 STOCHASTIC PROCESSES
8 STOCHASTIC PROCESSES The word stochastic is derived from the Greek στoχαστικoς, meaning to aim at a target. Stochastic rocesses involve state which changes in a random way. A Markov rocess is a articular
More informationMonopolist s mark-up and the elasticity of substitution
Croatian Oerational Research Review 377 CRORR 8(7), 377 39 Monoolist s mark-u and the elasticity of substitution Ilko Vrankić, Mira Kran, and Tomislav Herceg Deartment of Economic Theory, Faculty of Economics
More informationMorten Frydenberg Section for Biostatistics Version :Friday, 05 September 2014
Morten Frydenberg Section for Biostatistics Version :Friday, 05 Setember 204 All models are aroximations! The best model does not exist! Comlicated models needs a lot of data. lower your ambitions or get
More informationMODEL-BASED MULTIPLE FAULT DETECTION AND ISOLATION FOR NONLINEAR SYSTEMS
MODEL-BASED MULIPLE FAUL DEECION AND ISOLAION FOR NONLINEAR SYSEMS Ivan Castillo, and homas F. Edgar he University of exas at Austin Austin, X 78712 David Hill Chemstations Houston, X 77009 Abstract A
More informationMultiplicative group law on the folium of Descartes
Multilicative grou law on the folium of Descartes Steluţa Pricoie and Constantin Udrişte Abstract. The folium of Descartes is still studied and understood today. Not only did it rovide for the roof of
More informationImproved Capacity Bounds for the Binary Energy Harvesting Channel
Imroved Caacity Bounds for the Binary Energy Harvesting Channel Kaya Tutuncuoglu 1, Omur Ozel 2, Aylin Yener 1, and Sennur Ulukus 2 1 Deartment of Electrical Engineering, The Pennsylvania State University,
More informationYixi Shi. Jose Blanchet. IEOR Department Columbia University New York, NY 10027, USA. IEOR Department Columbia University New York, NY 10027, USA
Proceedings of the 2011 Winter Simulation Conference S. Jain, R. R. Creasey, J. Himmelsach, K. P. White, and M. Fu, eds. EFFICIENT RARE EVENT SIMULATION FOR HEAVY-TAILED SYSTEMS VIA CROSS ENTROPY Jose
More informationUncorrelated Multilinear Discriminant Analysis with Regularization and Aggregation for Tensor Object Recognition
Uncorrelated Multilinear Discriminant Analysis with Regularization and Aggregation for Tensor Object Recognition Haiing Lu, K.N. Plataniotis and A.N. Venetsanooulos The Edward S. Rogers Sr. Deartment of
More informationFault Tolerant Quantum Computing Robert Rogers, Thomas Sylwester, Abe Pauls
CIS 410/510, Introduction to Quantum Information Theory Due: June 8th, 2016 Sring 2016, University of Oregon Date: June 7, 2016 Fault Tolerant Quantum Comuting Robert Rogers, Thomas Sylwester, Abe Pauls
More informationDeriving Indicator Direct and Cross Variograms from a Normal Scores Variogram Model (bigaus-full) David F. Machuca Mory and Clayton V.
Deriving ndicator Direct and Cross Variograms from a Normal Scores Variogram Model (bigaus-full) David F. Machuca Mory and Clayton V. Deutsch Centre for Comutational Geostatistics Deartment of Civil &
More informationAn Analysis of Reliable Classifiers through ROC Isometrics
An Analysis of Reliable Classifiers through ROC Isometrics Stijn Vanderlooy s.vanderlooy@cs.unimaas.nl Ida G. Srinkhuizen-Kuyer kuyer@cs.unimaas.nl Evgueni N. Smirnov smirnov@cs.unimaas.nl MICC-IKAT, Universiteit
More informationConvex Optimization methods for Computing Channel Capacity
Convex Otimization methods for Comuting Channel Caacity Abhishek Sinha Laboratory for Information and Decision Systems (LIDS), MIT sinhaa@mit.edu May 15, 2014 We consider a classical comutational roblem
More informationPrincipal Components Analysis and Unsupervised Hebbian Learning
Princial Comonents Analysis and Unsuervised Hebbian Learning Robert Jacobs Deartment of Brain & Cognitive Sciences University of Rochester Rochester, NY 1467, USA August 8, 008 Reference: Much of the material
More informationSolving Cyclotomic Polynomials by Radical Expressions Andreas Weber and Michael Keckeisen
Solving Cyclotomic Polynomials by Radical Exressions Andreas Weber and Michael Keckeisen Abstract: We describe a Male ackage that allows the solution of cyclotomic olynomials by radical exressions. We
More informationPlotting the Wilson distribution
, Survey of English Usage, University College London Setember 018 1 1. Introduction We have discussed the Wilson score interval at length elsewhere (Wallis 013a, b). Given an observed Binomial roortion
More informationMA3H1 TOPICS IN NUMBER THEORY PART III
MA3H1 TOPICS IN NUMBER THEORY PART III SAMIR SIKSEK 1. Congruences Modulo m In quadratic recirocity we studied congruences of the form x 2 a (mod ). We now turn our attention to situations where is relaced
More informationKeywords: pile, liquefaction, lateral spreading, analysis ABSTRACT
Key arameters in seudo-static analysis of iles in liquefying sand Misko Cubrinovski Deartment of Civil Engineering, University of Canterbury, Christchurch 814, New Zealand Keywords: ile, liquefaction,
More informationAI*IA 2003 Fusion of Multiple Pattern Classifiers PART III
AI*IA 23 Fusion of Multile Pattern Classifiers PART III AI*IA 23 Tutorial on Fusion of Multile Pattern Classifiers by F. Roli 49 Methods for fusing multile classifiers Methods for fusing multile classifiers
More informationObserver/Kalman Filter Time Varying System Identification
Observer/Kalman Filter Time Varying System Identification Manoranjan Majji Texas A&M University, College Station, Texas, USA Jer-Nan Juang 2 National Cheng Kung University, Tainan, Taiwan and John L. Junins
More informationMatching Partition a Linked List and Its Optimization
Matching Partition a Linked List and Its Otimization Yijie Han Deartment of Comuter Science University of Kentucky Lexington, KY 40506 ABSTRACT We show the curve O( n log i + log (i) n + log i) for the
More informationNotes on pressure coordinates Robert Lindsay Korty October 1, 2002
Notes on ressure coordinates Robert Lindsay Korty October 1, 2002 Obviously, it makes no difference whether the quasi-geostrohic equations are hrased in height coordinates (where x, y,, t are the indeendent
More informationDetermining Momentum and Energy Corrections for g1c Using Kinematic Fitting
CLAS-NOTE 4-17 Determining Momentum and Energy Corrections for g1c Using Kinematic Fitting Mike Williams, Doug Alegate and Curtis A. Meyer Carnegie Mellon University June 7, 24 Abstract We have used the
More informationParticipation Factors. However, it does not give the influence of each state on the mode.
Particiation Factors he mode shae, as indicated by the right eigenvector, gives the relative hase of each state in a articular mode. However, it does not give the influence of each state on the mode. We
More informationp-adic Measures and Bernoulli Numbers
-Adic Measures and Bernoulli Numbers Adam Bowers Introduction The constants B k in the Taylor series exansion t e t = t k B k k! k=0 are known as the Bernoulli numbers. The first few are,, 6, 0, 30, 0,
More information