REDISTRIBUTION OF TENSORS FOR DISTRIBUTED CONTRACTIONS

Size: px

Start display at page:

Download "REDISTRIBUTION OF TENSORS FOR DISTRIBUTED CONTRACTIONS"

Charlotte McBride
5 years ago
Views:

1 REDISTRIBUTION OF TENSORS FOR DISTRIBUTED CONTRACTIONS THESIS Presented in Partial Fulfillment of the Requirements for the Degree Master of Science in the Graduate School of the Ohio State University By Akshay Nikam, B.Tech. Graduate Program in Computer Science and Engineering The Ohio State University 2014 Master s Examination Committee: Dr. P. Sadayappan, Advisor Dr. Atanas Rountev

2 c Copyright by Akshay Nikam 2014

3 ABSTRACT In computational quantum chemistry and nuclear physics, tensor contractions are frequently required computationally expensive operations. Methods involving tensor contractions often require them to run in series. Efficient contraction algorithms require the tensors to be distributed in certain ways. Hence, a redistribution operation on tensors is essential between two contractions in series. In this thesis, an efficient method to redistribute tensors on a multi-dimensional torus-grid of processors is proposed. Multiple ways of redistribution exist and the most efficient one for the given pair of distributions can be picked. The choice is mainly driven by replication of tensor data in the processor grid. Redistribution approaches are developed based on the replication scheme and they involve division of processor grid into smaller hyperplanar sub-grids that can handle the communication in parallel. Some approaches involve data broadcast along grid dimensions to replicate data as required by the new distribution. Wherever such efficient schemes are not possible, a point-to-point communication approach is proposed. ii

4 I dedicate my work to my parents, sisters and friends iii

5 ACKNOWLEDGMENTS I would like to convey my gratitude to my advisor Dr. P. Sadayappan, for giving me the opportunity to be a part of his vibrant and active research group. He has been a great source of inspiration for me throughout my masters program and his guidance has been of immense help in learning about High Performance Computing, especially Distributed and Parallel Computing. I express my deep gratitude to Samyam Rajbhandari with whom I worked for the most of my time in High Performance Computing Lab. I have lost track of the countless days and nights we worked together for implementing contractions of distributed tensors and I will always appreciate his hard work and willingness to help me with the hurdles I came across. I would also like to thank Kevin Stock for being a good critic and helping me in improving my programming practices. I cannot thank enough my friends Sonali and Ashwin for keeping me motivated about the dreams of my life and always being there for me. I am short of words to appreciate the efforts my parents have taken to help me achieve my dreams. I thank my sisters Manjusha, Archana and Kranti who always motivated me for keeping up the hard work. I express my deep gratitude to my brother-in-law Mr. Baban Shinde. I attribute a significant portion of my success to the help he has provided for my education. iv

6 VITA June May May July June July Jan Dec May Aug B.Tech. in Computer Enginnering, College of Engineering, Pune, India. Software Development Engineering Intern, IBM India Software Labs, Pune, India. Associate Software Engineer, IBM India Software Labs, Pune, India. Graduate Research Associate, High Performance Computing Lab, The Ohio State University, Columbus OH, USA. Software Development Engineering Intern, Microsoft Corporation, Redmond WA, USA. PUBLICATIONS Hukerikar, Saumil; Tumma, Ashwin; Nikam, Akshay; Attar, Vahida. SkewBoost: An Algorithm for Classifying Imbalanced Datasets, IEEE Xplore Proceedings of ICCCT 2011, 45-52, /ICCCT v

7 FIELDS OF STUDY Major Field: Computer Science and Engineering Specialization: High Performance Computing vi

8 TABLE OF CONTENTS Abstract Dedication Acknowledgments Vita ii iii iv v List of Figures ix List of Algorithms xi CHAPTER PAGE 1 Introduction Background and Motivation Matrices and Tensors Contractions of Tensors Symmetry in Tensors Tensor Contraction on a Processor Grid Tensor Mapping on Processor Grid Tensor Index Distribution On Grid Dimensions Contraction of Distributed Tensors Need for Tensor Redistribution Redistribution of Tensors Identifying Communication Patterns Parallelizing Communication Replication in Current Distribution Redistribution within a Hyperplane Broadcast Communication Point-to-Point Communication vii

9 4 Experimental Setup and Results Experimental Setup Results Grid-wide Broadcast Communication Grid-wide Point-to-Point Communication Broadcast Communication in Hyperplanes Point-to-Point Communication in Hyperplanes Conclusion and Future Work Bibliography viii

10 LIST OF FIGURES FIGURE PAGE 2.1 3D Torus grid by Fujitsu [5] One-to-one mapping of tensor A[i, j, k] on a 3D grid Mapping tensor A[i, j, k, l] on a 3D grid with l serialized Mapping tensor A[i, j] on a 3D grid with one of the dimensions replicated Mapping tensor A[i, j, k] on a 3D grid with k serialized and one of the dimensions replicated elements to be distributed along 3 nodes Block Distribution Cyclic Distribution Block-Cyclic Distribution Virtual splitting of grid into 2D planes along the replicated dimension Broadcast groups of the form P[I,*,K] Broadcast in an Intra-Communicator Broadcast in an Inter-Communicator Bandwidth achieved for different pairs of distributions for broadcast communication across entire grid Bandwidth achieved for different pairs of distributions for point-topoint communication across entire grid Bandwidth achieved for different pairs of distributions for Broadcast communication in hyperplanes ix

11 4.4 Bandwidth achieved for different pairs of distributions for point-topoint communication in hyperplanes x

12 LIST OF ALGORITHMS ALGORITHMS PAGE 2.1 Tensor Contraction: C[a,b,c,d] = A[a,b,k,l] B[l,k,c,d] Redistribution with Broadcast Communication Redistribution with Point-to-point Communication Redistribution Scheme 35 xi

13 CHAPTER 1 INTRODUCTION The topic of this thesis stems from the area of Quantum Chemistry. This field applies the notions in quantum mechanics to chemical many-body systems. It involves usage of a variety of computational methods to solve problems. One of the widely used methods in Quantum Chemistry is the Coupled Cluster family of methods [1] [2]. Used for modeling many-body systems in chemistry, coupled cluster methods are computationally expensive and often require computational power of supercomputers. With the benefit of high performance computing, multi-electron wavefunctions can be more accurately modeled for molecules. The types of coupled cluster methods are decided by the number of excitations permitted. Coupled Cluster Doubles (CCD) applies for onyl double excitations, Coupled Cluster Singles and Doubles (CCSD) applies for single and doubles excitations, etc. In CCSD, using algebraic and diagrammetic techniques, a series of equations is derived [2]. These equations involve operations on tensor objects. Tensors can be stored as high dimensional matrices in computer memory. The frequent operations on tensors in CCSD include contractions. A contraction of two tensors is essentially a higher dimensional generalization of a matrix multiplication from computation perspective. Since dimensionality and size of tensors can be big in CCSD, contraction operations tend to be compute-intensive. It is common to exploit the power of supercomputer 1

14 architectures to accurately compute CCSD equations. Tensors are distributed across processor grids and contractions take place in distributed fashion. Efficient contraction algorithms require tensors to be distributed and mapped to the processor grid in a certain manner. Since tensors are often reused in multiple equations in one CCSD sequence of operations, efficient execution of CCSD needs to redistribute tensors as required by the contraction algorithm for efficient execution [4]. This thesis presents an algorithm for redistributing tensors on a multi-dimensional torus grid based on the data replication patterns in the grid. This algorithm assists the CCSD contractions for an efficient execution. 2

15 CHAPTER 2 BACKGROUND AND MOTIVATION In Coupled Cluster Singles and Doubles (CCSD), a large number of contractions need to be performed on tensors efficiently. This section defines various concepts such as tensors, their mapping on a processor grid, data distribution and tensor contractions and thereafter shows why an efficient redistribution scheme is required. 2.1 Matrices and Tensors A matrix A [M N] can be defined as a set of M N numbers organized in a rectangle (2 dimensions) in M rows and N columns. From computer storage viewpoint, Matrix A is essentially a rectangular array of size M N, with M rows and N columns. Where a matrix is a 2-dimensional array, tensors can be defined as higher dimensional generalizations of matrices from the storage and computational perspective. Since tensor is a general term, a 2-dimensional matrix can also be called as a 2-dimensional tensor. For CCSD method, we only have to work with tensors of dimensionality 4 or lower. A dimension of tensor is referred to as its index. For representing matrix A of size M N as a tensor, we need to use two indices representing its dimensions as A[i, j]. Here there are M elements along index i and N elements along index j. Although an order is decided for the indices of a tensor for the purpose of storage, the order does 3

16 not hold any significance in the physical interpretation of the tensor. A 4D tensor B[i, j, k, l] has four indices namely, i, j, k and l and the tensor can also be represented as B[k, i, l, j] or any other permutation of the indices. However, the storage of the tensor only follows one of all the possible permutations. 2.2 Contractions of Tensors The CCSD equations involve addition and contraction operations between multiple tensors. Addition of tensors is analogous to matrix addition. A contraction operation between tensors is a slightly more complicated higher dimensional generalization of matrix multiplication. It is helpful to discuss matrix multiplication from another point of view before discussing tensor contractions. Matrix multiplication is an operation on two matrices: A [M K] and B [K N] to generate an output matrix C [M N], such that an element of C at the intersection of i th row and j th column which is identified as C[i][j] is a dot product of i th row of A and j th column of B. (The notation C[i][j] is used to denote an element where i and j mean values of the respective indices, while C[i, j] is used to denote the tensor where i and j indicate the indices of the tensor.) The rows of A and columns of B are vectors of length K. These vectors represent one of the two dimensions or indices of their parent matrix. Thus one of the two indices of both the input matrices (the index with size K) is vanished due to the dot product after all values in C are computed. The remaining indices (of the sizes M and N) appear in the output matrix C. In the tensor terminology, we say that the index with size K has contracted. Thus this matrix multiplication also represents a contraction of tensors A[i, k] and B[k, j] that contracts the index k and yields an output tensor C[i, j]. From the indices of the tensors, it can be observed that index i of A and the index j of B are retained in C after the contraction. These are called as external indices of the input tensors. 4

17 However, the index k of both input tensors is contracted and thus does not appear in the output tensor. It is termed as a contracting index of the input tensors. This tensor contraction can be represented by equation C[i, j] = k A[i, k] B[k, j] (2.2.1) In higher dimensional tensor contractions, there can be more than one contracting indices in each input tensor. For example, contraction of two tensors: A[a, b, k, l] and B[l, k, c, d] will yield an output tensor C[a, b, c, d] such that the contraction indices k and l are contracted, while the indices a, b, c and d as external indices are retained in C. Listing. 2.1 shows a code snippet for contracting tensors A and B to yield C. Note that there is one loop for each external index as well as for each contracting index. The loops for contracting indices k and l are innermost in the code. Suppose, A is d a - dimensional, B is d b -dimensional tensor that are contracted to yield a d c dimensional tensor C. Then the total number of loops or iterators in a contraction algorithm is equal to d a + d b + d c. If n is the size of each tensor (number of elements) in each 2 dimension, their contraction takes O(n da+d b +dc 2 ) time. Listing 2.1: Tensor Contraction: C[a,b,c,d] = A[a,b,k,l] B[l,k,c,d] for ( int a =0; a<n; a ++) { for ( int b =0; b<n; b ++) { for ( int c =0; c<n; c ++) { for ( int d =0; d<n; d ++) { for ( int k =0; k<n; k ++) { for ( int l =0; l<n; l ++) { C[a,b,c,d] += A[a,b,k,l] * B[l,k,c,d] } } } } } } 5

18 2.3 Symmetry in Tensors Symmetry in tensors can be generalized from that in 2-dimensional matrices. A 2D symmetric matrix can be defined as a matrix A [M N] such that A[i][j] = A[j][i] for all i,j. Thus if we divide the matrix in two triangles separated by the diagonal, the triangles are mirror images of each other and the diagonal elements are unique. This type of symmetry in a 2D matrix can also be represented by a notation A[i > j], indicating that only elements with indices i j exist and the elements with indices j > i are the same as those with i > j. Only the lower triangle of the matrix can be stored in memory to avoid redundancy and utilized for all complete matrix operations. Thus with a 2-index symmetry in a 2D matrix, we can save half of the total storage space required for a non-symmetric 2D matrix. When the number of dimensions increase, the concept of symmetry can be further generalized for tensors. A group of two or more indices can be involved in a symmetry. Moreover, there can be more than one independent symmetry groups in a higher dimensional tensor. For example, there can be five cases in a 4D tensor A[i,j,k,l]: CASE I: No Symmetry The tensor can be represented simply as A[i, j, k, l]. CASE II: Two of the indices are symmetric with each other Without loss of generality, consider i and j to be symmetric. Tensor A can be represented as A[i > j, k, l]. CASE III: Three of the indices are symmetric with each other Without loss of generality, consider i, j and k to be symmetric. Tensor A can be represented as A[i > j > k, l]. CASE IV: Two disjoint symmetry groups each involving two indices of the tensor Consider i and j to be symmetric with each other while k and l are symmetric with each other. In this case, tensor A is represented as A[i > j, k > l]. 6

19 CASE V: All four indices are symmetric with each other Although this case does not occur in the tensors in CCSD equations, it is a possible scenario for symmetry in a 4D tensor. In this case, tensor A can be represented as A[i > j > k > l] With d dimensions involved in a symmetry group, we can save 1/d! of the storage space due to the symmetry. 2.4 Tensor Contraction on a Processor Grid Contraction of tensors can be parallelized on a cluster of processor nodes with distributed memory. The processor grid considered for this study is a multi-dimensional torus. A torus shaped grid has nodes linked in series in each dimension with a link between the two extreme nodes. In a d-dimensional torus, every node has two neighbor nodes in each dimension and total 2 d neighbors. Each processor in the grid can be identified by its d-dimensional coordinates. Thus for a 3D grid, a processor P [I, J, K] is the I th processor in 0 th dimension, J th in the 1 st dimension and K th in the 2 nd dimension, where dimension indexing starts at 0. Figure. 2.1 shows an example of 3D torus grid of 64 processors by Fujitsu. There are four nodes along each dimension of the grid. Each of these nodes are connected to two neighbors per dimension and thus each one has a total of six neighbors. Note that the nodes at either end of each dimension are connected to each other by a link forming a loop of four nodes in each dimension. An advantage of a torus grid over non-torus grid of same dimensionality is that it is efficient to shift and rotate data in all dimensions. This type of communication pattern is highly desirable for the most efficient tensor contraction algorithms. 7

Figure 2.1: 3D Torus grid by Fujitsu [5] 2.4.1 Tensor Mapping on Processor Grid A tensor can be stored in a processor grid of the same or different dimensionality as that of the tensor.

20 Figure 2.1: 3D Torus grid by Fujitsu [5] Tensor Mapping on Processor Grid A tensor can be stored in a processor grid of the same or different dimensionality as that of the tensor. Usually, tensor indices are mapped to the grid dimensions in order to distribute the tensor. This mapping is also referred to as index-dimension mapping here onwards. It can be represented by a vector of the same dimensionality as the tensor where the value at i th dimension represents the physical grid dimension where the i th index of tensor is mapped to. We need to consider some constraints and general facts about distributing a tensor on a processor grid. One-to-one Index-Dimension Mapping Constraint Only one index from a tensor can be mapped to a grid dimension. This is because the set of elements of a tensor is the product set formed by a cross product between 8

21 each index of the tensor. Thus, if there are 2 indices and the size of tensor is n along each index, the total number of elements formed by the product is n n = n 2. Generalizing for p indices, we get the total number of elements as n n n... n (p times) = n p. It is not possible to have the entire product set of elements if we map more than one index to a physical grid dimension. If we map two indices i and j to a dimension and distribute the index values in the same fashion along the dimension, only elements that will be formed are the diagonal elements where i = j. It is not possible to store the elements where i j since each node along the dimension gets the same values of i and j. Distribution or Serialization of a Tensor Index There are two ways to deal with each tensor index when distributing a tensor on a grid. An index can either be distributed across some physical dimension or it can be serialized. There are several ways of distributing an index along a dimension which will be discussed in a later section. However, when an index is serialized, it means the index is not mapped to any of the physical dimensions, i.e. all the elements along this index are fully stored in one processor rather than spreading them across several processors in a dimension. It is possible to serialize all indices of the tensor, as a result of which we will have the entire tensor stored in each processor node of the grid. Distribution or Replication along a Grid Dimension Another perspective of understanding tensor mapping on a grid is from the viewpoint of a grid dimension. Again, there are two possibilities with respect to a physical grid dimension. Either a tensor index is distributed across the dimension or the tensor data is replicated along it. If an index is mapped to this dimension, data along that index is divided in chunks and each chunk is stored in one node along this grid 9

22 dimension. However, if no index is mapped to this dimension, all processors along the dimension hold the same data resulting in replication and redundant storage. Keeping above facts and constraints in mind, there are various ways of mapping a tensor. Let us consider a few examples of mapping a d-dimensional tensor on a δ-dimensional grid. Example 1: One-to-one index-dimension mapping Since the mapping is exhaustive in this example, consider a d-dimensional tensor mapped on a d- dimensional grid such that each index is mapped to one and only one dimension. No two indices are mapped to the same dimension. The index-dimension map acts as a one-to-one function. Consider an example dimensionality d = 3, now 3 indices are mapped to 3 dimensions. There are 6 possible permutations of the index-dimension map and thus 6 possible ways of mapping indices to the dimensions in one-to-one fashion. Figure. 2.2 shows one of such possible mappings of the tensor A[i, j, k] on a 3D physical grid. The 3D cube represents the processor grid while the arrows indicate mapping of a tensor index i, j or k to a physical dimension. The index-dimension map for this example can be written as < 0, 1, 2 >, meaning index i is mapped to dimension 0, index j to dimension 1 and k to dimension 2. Example 2: Serialization Consider the case when the grid dimensionality is less than that of the tensor. Now, only δ of the d indices can be mapped to the δ-dimensional grid. While the remaining d δ indices have to be serialized. For example, consider a 4D tensor A[i, j, k, l] being distributed along a 3D processor grid. We will try to map as many indices possible to the various dimensions. But only 3 of them can be mapped in one-to-one mapping. The fourth one has to be serialized. There are 4 ways of choosing the serialized index and 6 ways of mapping the rest to the 3 dimensions. 10 Hence we have

Figure 2.2: One-to-one mapping of tensor A[i, j, k] on a 3D grid 4 6 = 24 ways of distributing the tensor (ignoring possibilities involving replication of dimensions).

23 Figure 2.2: One-to-one mapping of tensor A[i, j, k] on a 3D grid 4 6 = 24 ways of distributing the tensor (ignoring possibilities involving replication of dimensions). Note that there are more possible distributions if we allow replication along dimensions. One of the 24 possible ways of mapping A[i, j, k, l] on a 3D grid is shown in figure Here the index l is serialized since there is no dimension that it can be mapped to. The index-dimension map for this case is < 0, 1, 2, serial >, meaning indices i, j and k are mapped to the dimensions 0, 1 and 2 respectively, while index l is serialized. Example 3: Replication Consider the case when the tensor dimensionality is less than that of the grid. We can map all the tensor indices to the dimensions in one-to-one fashion. But some of the dimensions will be left unmapped. Data will be replicated along these dimensions. For example, consider a 2D tensor A[i, j] being mapped to a 3D grid. If we try to achieve as many indices mapped to different dimensions, we can map all 2 of the tensor indices to any 2 of the 3 grid dimensions and leave one of the dimensions replicated. There are 11

Figure 2.3: Mapping tensor A[i, j, k, l] on a 3D grid with l serialized 2 ways to choose a replicated dimension and 2 ways to map the indices on remaining 2 dimension.

24 Figure 2.3: Mapping tensor A[i, j, k, l] on a 3D grid with l serialized 2 ways to choose a replicated dimension and 2 ways to map the indices on remaining 2 dimension. Hence, we have 2 2 = 4 ways of distributing the tensor (ignoring the possibilities involving serialization of tensor indices). One of the possible ways of mapping A[i, j] on a 3D grid is shown in figure Note that data is replicated along one of the physical dimensions since there is no index that can be mapped there. The index-dimension map can be represented as < 0, 1 >. Note that, unlike serialization of indices, it is not evident from the index-dimension map whether a physical dimension is replicated or not. Example 4: Serialization and Replication Although we saw exhaustive one-to-one mapping of a 3D tensor A[i, j, k] on a 3D grid, there are more possibilities. For example, even if the three indices can be mapped to the dimensions in one-to-one fashion, we can choose to serialize one or more indices and as a result the same number of dimensions will have replication along them. Figure. 2.5 shows an example where the index k is serialized, while i and j are 12

25 Figure 2.4: Mapping tensor A[i, j] on a 3D grid with one of the dimensions replicated distributed along two of the dimensions. Since one of the dimensions is left unmapped, data is replicated along it. The index-dimension map for this example is < 0, 1, serial >. Figure 2.5: Mapping tensor A[i, j, k] on a 3D grid with k serialized and one of the dimensions replicated 13

26 2.4.2 Tensor Index Distribution On Grid Dimensions Before delving into tensor redistribution algorithms, it is necessary to understand how a tensor index can be distributed on a grid dimension. Assume the size of tensor along an index to be n while the size of processor grid in one dimension to be p. Now, the problem of distributing a tensor index along a grid dimension can be reduced to a problem of distributing n elements across p nodes. There are several possible ways of doing this a subset of which is defined below. Figure 2.6: 12 elements to be distributed along 3 nodes Block Distribution In this type of distribution, we divide the sequence of n elements into p parts equally, so that each part contains n/p elements (assuming p divides n). Each part is stored on one node. As an example, consider n = 12 and p = 3. The 12 elements to be distributed are shown in Figure After distributing, each of the three nodes gets n/p = 4 elements. First node gets the first four, the second node gets the next four while the third node gets the last four elements. Figure. 2.7 illustrates this example. Although block distribution is easy to understand and implement, it does not provide a good load balancing in case of symmetric tensors where you would only want to store the unique elements. 14

Figure 2.7: Block Distribution Cyclic Distribution Here, n elements are cyclically divided into p nodes, such that first element is stored in first node, second element in the second node.

27 Figure 2.7: Block Distribution Cyclic Distribution Here, n elements are cyclically divided into p nodes, such that first element is stored in first node, second element in the second node... p th element in p th node. Now the next, i.e. (p + 1) th element is stored in the first node and further elements are similarly distributed in a cyclic fashion. In general, i th element is stored on (i%p) th node. Figure. 2.8 shows how the same 12 elements can be distributed across 3 nodes in a cyclic fashion. Cyclic distribution is a good fit for symmetric tensors, since it can properly load balance the storage across the nodes in the dimension where a symmetric index is mapped. Figure 2.8: Cyclic Distribution 15

28 Block-Cyclic Distribution This distribution combines the concepts of the above two distributions into one. The elements are first divided into s equal blocks where s is an integer that divides the number of elements and is divisible by number of processors p. These s blocks of elements are then cyclically distributed across p processors. Figure. 2.9 shows how the blocks of 2 elements can be cyclically distributed across 3 nodes. This scheme is good for load balancing of symmetric tensors and is a good fit for distributed contractions since the blocks can be utilized directly in the parallel matrix multiplication kernel of contractions of tensors. The implementation for this thesis uses a block-cyclic distribution of tensors. Figure 2.9: Block-Cyclic Distribution Contraction of Distributed Tensors CCSD equations involve computations with very large tensors. Contraction of large tensors is a computationally demanding operation. Contracting a d a -dimensional and a d b -dimensional tensor resulting in a d c -dimensional tensor takes O(n da+d b +dc 2 ) time where n is the size of tensor in each dimension. Since a large percentage of CCSD equations involve upto 4D tensors, this cost often turns out exorbitant on single 16

29 processor execution. However, with distributed and parallel computing this task can be made much more efficient. The computationally intensive contraction operation can be effectively parallelized on a torus-shaped grid of processors. As discussed before, the tensors can be stored on the grid in a distributed fashion with various possible index-dimension mappings. The mapping of input tensors on the processor grid is the primary decisive factor in the choice of the most efficient contraction algorithm. The tensor distribution across the processor grid dictates the inter-processor data communication pattern for the contraction algorithm. The data communication patterns tend to be different based on the contraction algorithm. Some tensor distributions make the most efficient contraction algorithms possible, while others tend to force a more time-intensive algorithm. Hence, the initial distribution of input and output tensors in the contraction process is very important. 2.5 Need for Tensor Redistribution CCSD equations involve repeated contraction operations on several tensors. These operations are broken down in a series of contractions and additions of two intermediate tensors. Output of one tensor contraction operation is often used as input in the next contraction. However, although this tensor is involved in two different contraction operations, it often requires to have different data distributions on the grid. The contraction that generates the tensor could generate it in a certain indexdimension mapping, but for another contraction where it is used as input, its current distribution may not be optimal for the contraction to work most efficiently. Hence, the tensor needs to be redistributed with a new index-dimension mapping on the grid. As an example, let us consider a scenario that often mandates a redistribution of the input tensors depending on the initial distribution of the output tensors. External 17

30 indices (those that are retained in the output tensor after the contraction) are mapped to a physical dimension or serialized in the initial distribution of the output tensor. If an external index in the output tensor is mapped along a particular dimension, we normally would like that external index in the input tensor to be mapped along the same dimension before starting the contraction, unless contractions can be done with rotational communication patterns. Also, we usually require that external indices from different input tensors to be orthogonal to each other, i.e. they are to be mapped to different physical dimensions (those where the same external indices in the output tensor are mapped). Now, consider five 4D tensors distributed on a 4D processor grid all with the same index-dimension map < 0, 1, 2, 3 >, i.e. i t h index of each tensor is mapped to i t h dimension. A[a 1, l 1, k 1, k 2 ] B[k 1, k 2, l 2, b 1 ] C[a 1, l 1, l 2, b 1 ] D[d 1, l 1, l 2, d 2 ] E[d 1, a 1, d 2, b 1 ] These tensors undergo the following two contractions: 1. A[a 1, l 1, k 1, k 2 ] B[k 1, k 2, l 2, b 1 ] = C[a 1, l 1, l 2, b 1 ] 2. C[a 1, l 1, l 2, b 1 ] D[d 1, l 1, l 2, d 2 ] = E[d 1, a 1, d 2, b 1 ] In contraction 1, according to the given index-dimension map, all the external indices tensors A and B (the input tensors) are mapped to the same dimension where 18

31 the corresponding external indices in tensor C (the output tensor) are mapped. For example, according to the index-dimension map < 0, 1, 2, 3 >, index l 1 in C is mapped to the dimension 1; since tensor A has the same map, index l 1 in A is mapped to the dimension 1 as well. This type of intial distribution is highly desirable for the contraction algorithm to be efficient. However, in contraction 2, index a 1 is mapped to the dimension 1 in E (the output tensor), while that in tensor C is mapped to the dimension 0. Similarly, index d 2 in E is mapped to dimension 2, but that in D is mapped to dimension 3. Since we would normally want them to be mapped to the same dimension in both input and output tensors, it is desirable to redistribute the input tensors, so that the condition is fulfilled. Hence, before starting contraction 2, a preferable index-dimension map for tensor C would be < 1, 0, 2, 3 > and that for D would be < 0, 3, 1, 2 >. With such distribution, external indices in C and D are mapped to the dimensions where those in E are mapped to. To achieve this, we need to redistribute the tensors C and D after contraction 1 and before contraction 2. 19

32 CHAPTER 3 REDISTRIBUTION OF TENSORS Redistribution of tensors is an essential operation that occurs between two contraction operations in CCSD. Such frequent occurrence of redistribution increases the necessity for it to be performed efficiently. This section throws light on the details of redistribution algorithm. The redistribution algorithm takes the tensor to be redistributed and a new indexdimension map as inputs and redistributes the tensor according to the new indexdimension map. The algorithm considers both current (or old) and the new indexdimension maps to decide how to communicate data in the grid in order to redistribute the tensor. The details are mainly driven by the idea of replication along physical dimensions. 3.1 Identifying Communication Patterns Since redistribution of tensors is mainly a data communication task, the first problem encountered is which processors send data and which ones receive it. Since a blockcyclic distribution of tensors is used, the best way to identify the data held by a processor is in terms of the blocks of tensor. In block-cyclic distribution, a tensor is split into blocks along each index and they are cyclically distributed along the dimensions the respective indices are mapped to. Each block has a unique vector 20

33 address of the same dimensionality as the tensor. For example, a block in a 4D tensor A[i, j, k, l] can be identified by an address of the form < i, j, k, l > where i, j, k and l represent the position of the block along each index. Given an index-dimension map, dimensionality and size of the processor grid and tensor block size, we can determine which blocks a particular node in the grid holds. It is possible to enumerate all the block addresses held at a processor given the processor s address in the grid. Thus we can find out what blocks each processor will hold after redistribution, given the new index-dimension map. From these block addresses, each processor can determine where each of them currently reside based on the current index-dimension map. Hence, each processor can find out which other processor will send the blocks to it. Also, the addresses of currently held blocks are already stored in each processor. From the addresses, the processor can compute which other processor the block will reside at after redistribution. Hence, each processor can compute where to send each block. 3.2 Parallelizing Communication Although redistribution overall seems to be a grid-wide communication algorithm, it can be parallelized for some cases by dividing the grid in groups of nodes that handle the communication within the group. This idea of parallelization of communication stems from current replication scenario in the grid Replication in Current Distribution Communication can be parallelized and handled independently by dividing the grid into groups of nodes such that each group contains full copy of the tensor in the current distribution. Multiple copies of tensor exist in the grid only if there is replication 21

34 along at least one dimension. The division of grid is using the replication scenario is discussed below. Replication along one dimension If there is replication along one of the dimensions as per the current index-dimension map, it means we can divide the grid into hyperplanes of d 1 dimensions, where d is the dimensionality of the grid. The division is such that each processor along the replicated dimension is in a different hyperplane and each hyperplane contains a copy of the entire tensor. Hence, if there are p nodes along each dimension of the grid, there will be p hyperplanes. Figure. 3.1 shows a 3D grid with p = 4. One of the dimensions of the grid has replication along it while the other two dimensions each have some tensor index distributed along them. As shown the grid can be virtually split into 2D planes along the replicated dimension. Since all the dimensions that have some index distributed along them form the plane together, each plane contains a copy of full tensor. An interesting outcome of this is that we do not need to send data from nodes in one plane to those in any other planes in order to redistribute the tensor. Each plane can handle communication independent of other planes. This not only greatly simplifies the communication pattern for redistribution but also makes sure that nodes only communicate with the nearest other nodes for redistribution. Replication along multiple dimensions If there are d r replicated dimensions in the grid according to the current indexdimension map, the hyperplane splits will occur along all d r dimensions. This will yield hyperplanes of d d r dimensions formed with those that have indices mapped to them. Again, each hyperplane will have a copy of the entire tensor and communication can be worked out in each hyperplane independent of others. If there are p nodes along each dimension of the grid, there will be p dr 22 different hyperplanes formed.

35 Figure 3.1: Virtual splitting of grid into 2D planes along the replicated dimension No Replication in Current Distribution If there is no replication of data in the grid, the entire grid has only one copy of the tensor distributed in all nodes. In this case, since no two nodes hold the same block of tensor data, communication cannot be parallelized but it will have to be grid-wide. One point to note is that this communication will resemble that happening within a hyperplane if the grid were divided into hyperplanes containing one and only copy of the tensor. Now since the entire d-dimensional grid contains only one complete copy of the tensor, it can be viewed as a d-dimensional hyperplane itself. Hence, all arguments made for a hyperplane will be valid for the grid when there is no replication in it with respect to current distribution of tensor. 23

36 3.3 Redistribution within a Hyperplane Irrespective of whether the hyperplane spans the entire grid or a subset of it, the algorithm applies in the same manner. Redistribution can be performed in one of the two possible ways inside a hyperplane. Since we know the new index-dimension map to redistribute the tensor to, we can tell which dimensions of the hyperplane (or grid) will have replication after redistribution. The replication scenario in new distribution of tensor decides which algorithm to pick Broadcast Communication If a dimension is replicated in the new index-dimension map while some index is distributed along it in the current map, then it means there will be a broadcast of data that will be replicated along this dimension. All the processors in this dimension are supposed to receive the same data from the only processor in the hyperplane (or grid) that currently holds it. This sender processor may or may not be a part of the group of processors that will be holding this data after the broadcast. A previously defined notation to identify processors (P [I, J, K...]) will be used to discuss data broadcasts. This notation can also be used to identify multiple processors along a dimension by using a wildcard ( ). For example, in a 3D grid, a processor can be identified by its coordinates as P [I, J, K]. The processors in dimension=1 that share the other coordinates with this processor can be represented by P [I,, K] which means any processor that is I th in 0 th dimension and K th in the 2 nd dimension is included despite of what value it takes for the wildcard ( ). This group of processors represented by P [I,, K] forms a 1-dimensional torus in the grid, which can also be seen as a ring of processors. 24

37 Broadcast Groups A group of processors that will receive the broadcast (also called as a broadcast group ) can be represented using the above defined notation. If 1 st dimension in a 3D grid has replication in the new distribution but not in the old one, it becomes a broadcast dimension. A broadcast group along the dimension can be represented as P [I,, K]. If there are p processors in each dimension of the grid, there will be p p = p 2 disjoint broadcast groups, because I and K can take p possible values and form p 2 possible combinations for P [I,, K]. Mathematically generalizing the idea, consider d to be the dimensionality of the torus grid, with p nodes on each dimension. If there are b broadcast dimensions, then each broadcast group will have n b nodes and there will be n d b such broadcast groups. Figure. 3.2 shows these broadcast groups of processors for p = 4. The groups look like lines of processors (1D) along the 1 st dimension which is the dimension that will have replication in the new index-dimension map. Although we discussed for only one replicated dimension, there can be more than one dimensions that have replication in the new distribution but not in the old one. In this case, the broadcast groups span all these dimensions forming hyperplanes instead of just one line of processors. For example, consider a 4D processor grid [I, J, K, L] where two of the dimensions J and L have replication in the new distribution but some indices are mapped to them in the current distribution. Now the broadcast groups can be represented as P [I,, K, ]. Again, the processor that sends the broadcast may or may not be a part of the broadcast group. Implementation of Broadcast Groups C++ and Message Passing Interface (MPI) were used for implementing redistribution of tensors. Some MPI features allow broadcast data communication within the same 25

38 Figure 3.2: Broadcast groups of the form P[I,*,K] group or between different groups. Since broadcast is a collective operation in MPI, it needs to be done in a defined group of processes. A communicator is created for the group of processors involved in the collective operation and all the group members participate in the operation. An MPI communicator can be of two types: Intra-Communicator allows communication within a single group of processes. A process that is a part of the sole group of an intra-communicator can broadcast data to all the processes in that group. This scenario is depicted in Figure The cloud represents an MPI group of processes and the circles in the cloud represent processes. A large rectangle that wraps the group is an intra-communicator. As shown, the process 0 sends a broadcast message to all the other processes in the group. Inter-Communicator allows communication between two disjoint groups of processes. A process in one group can broadcast data to all the processes in the 26

39 Figure 3.3: Broadcast in an Intra-Communicator other group that is part of the inter-communicator. Figure. 3.4 represents this setup. The two clouds represent two groups that form an inter-communicator together. Process 0 from Group1 broadcasts data to all processes in Group2. If a processor P [I 1, J 1, K 1 ] in a 3D processor grid needs to broadcast data to a group of processors P [I 2,, K 2 ] where I 1 = I 2 and K 1 = K 2, then the sender processor is a part of the broadcast group. In this case, an intra-communicator is used. However, if I 1 I 2 or K 1 K 2, then the sender processor is not a part of the broadcast receiver group. Here, an inter-communicator is created such that one of the groups contains only the sender processor P [I 1, J 1, K 1 ], while the other group contains all the processors P [I 2,, K 2 ] that will receive the broadcast. 27

40 Figure 3.4: Broadcast in an Inter-Communicator Redistribution with Broadcast Listing 3.1 presents the algorithm for redistribution with broadcast. In this scheme, all processors that receive tensor blocks will always receive them as a part of a broadcast message. A broadcast message is directed from a sender processor to a broadcast group. The sender can have several blocks that may need to be broadcasted to different broadcast groups. Thus the sender needs to find out which broadcast group needs which set of blocks. This can be computed with the help of current (or old) and new index-dimension map, processor grid size, tensor block size, etc. Each of the broadcast groups can be represented as a generic processor address with a wildcard for the broadcast dimension. This generic address can be translated to a single number that can act as a generic rank of the broadcast group. A send map is created that 28

41 maps this generic rank of the broadcast group to the list of addresses (or numbers) of blocks it needs to send to the broadcast group. Since the sender processor knows the receiver broadcast groups it will broadcast its blocks to, it can determine whether it is a part of the broadcast groups or not. Based on this information, it can decide whether to form an intra-communicator or an intercommiunicator with the receiver group. Similarly, each processor finds out which blocks it will receive based on the new index-dimension map, processor grid size, tensor block size, etc. From the block addresses to receive, the receiver processor can find which processor will send it. The receiver also knows the broadcast group it is part of and thus it can determine if the sender is in the broadcast group or not. Thus, the receiver can decide whether to form an intra-communicator with the broadcast group or an inter-communicator with broadcast group as one group and the sender processor as the other group with sole member. Now that each processor knows which blocks to receive from which processor, a recv map is created that maps the sender processor s rank to the list of addresses (or numbers) of blocks this processor will receive from this sender. Once the send map and recv map are ready for each processor, the communicators are created. When sender is a part of the broadcast group, all the processors in the broadcast group as well as the sender know about it and they mutually form an intracommunicator. When sender is outside the receiving broadcast group, both parties know about this and they mutually form an inter-communicator. Since creation of a communicator involves internal message passing between all the processors, a haphazard order of communicator creation may result in a deadlock. Hence the function call to create a communicator is made by each processor involved (sender and broadcast receivers) in the ascending order of the sender processor s rank in the world (or global) communicator. 29

42 Listing 3.1: Redistribution with Broadcast Communication send_ map : maps the general ranks of processor groups that will receive the broadcast to the list of block numbers to broadcast recv_ map : maps the ranks of processors to receive blocks from to the list of block numbers to receive rank : rank of this processor in the processor grid send_ comm : array of communicators for messages to send recv_ comm : array of communicators for messages to receive redistribute_broadcast ( bcast_dims ): send_ map = generate_ send_ map (); recv_ map = generate_ recv_ map (); send_comm = create_bcast_send_comms ( send_map, bcast_dims ); recv_comm = create_bcast_recv_comms ( recv_map, bcast_dims ); sends_ done = false ; for (r in range (0, recv_map. size )) sender = recv_map [r]. key ; block_ids_to_recv = recv_map [r]. value ; if ( sender >= rank AND sends_ done == false ) send_bcast ( send_map ); sends_ done = true ; if ( sender!= rank ) received_ blocks = recv_ bcast ( recv_ comm [ r ]); if ( sends_ done == false ) send_bcast ( send_map ); tensor. blocks = received_ blocks ; send_ bcast ( send_ map, blocks_ to_ send ): for (s in range (0, send_map. size )) recv_group = send_map [s]. key ; block_ids_to_send = send_map [s]. value ; blocks_ to_ send = gather_ blocks ( block_ ids_ to_ send ); bcast ( blocks_to_send, send_comm [s ]); if ( sender is in the recv group ) received_blocks. add ( blocks_to_send ); 30

43 After all the maps and communicators are ready, the processors are ready to exchange blocks. Again, since each processor can be involved in multiple broadcasts as a sender or a receiver, a haphazard ordering of calls to send and receive broadcasts will result in a deadlock. A similar scheme used in creation of communicators can be used and the function calls can be made send or receive broadcasts in the ascending order of the sender s rank in the world (or global) communicator. Since we are following sender s rank order, we would like to access each key in the recv map (which is nothing but the sender s rank) in ascending order. Iterating over the map yields ascending order of the keys. The processor starts posting broadcast receive calls for each entry in recv map. When the processor finds that it has received data from all the processors with rank less than its own rank, it is time to send the blocks it holds to the processors expecting them. Hence, the processor iterates through the send map and posts broadcast sends for each entry. Once it is done with all sends, remaining broadcast receives are posted in the ascending order of senders. This finishes the broadcast of blocks and at the end of this exchange, every processor holds the blocks that it should as per the new index-dimension map Point-to-Point Communication This scenario arises when none of the dimensions in the hyperplane have replication along them in the new distribution. The statement is also valid for the entire grid when there is no replication in current distribution as well, meaning there is only one copy of the tensor in entire grid. This idea can also be represented with the following cases: Case I There is no dimension that has replication as per the new distribution irrespective of its replication scenario in the current distribution. However, it 31

Cyclops Tensor Framework

Cyclops Tensor Framework Edgar Solomonik Department of EECS, Computer Science Division, UC Berkeley March 17, 2014 1 / 29 Edgar Solomonik Cyclops Tensor Framework 1/ 29 Definition of a tensor A rank r