Transposition Mechanism for Sparse Matrices on Vector Processors

Size: px

Start display at page:

Download "Transposition Mechanism for Sparse Matrices on Vector Processors"

Shanna Atkinson
6 years ago
Views:

1 Transposition Mechanism for Sparse Matrices on Vector Processors Pyrrhos Stathis Stamatis Vassiliadis Sorin Cotofana Electrical Engineering Department, Delft University of Technology, Delft, The Netherlands Abstract Many scientific applications involve operations on sparse matrices. However, due to irregularities induced by the sparsity patterns, many operations on sparse matrices execute inefficiently on traditional scalar and vector architectures. To tackle this problem a scheme has been proposed consisting of two parts: (a) An extension to a vector architecture to support sparse matrix-vector multiplication using (b) a novel Blocked Based sparse matrix Compression Storage (BBCS) format. Within this context, in this paper we propose and describe a hardware mechanism for the extended vector architecture that performs the transposition Ì of a sparse matrix using a hierarchical variation of the aforementioned sparse matrix compression format. The proposed Sparse matrix Transposition Mechanism (STM) is used as a Functional Unit for a vector processor and requires an word in-processor memory where is the vector processor s section size. In this paper we provide a full description of the STM and show an expected performance increase of one order of magnitude. Keywords Vector processor, matrix transpose, sparse matrix, functional unit I. INTRODUCTION In many scientific computing areas the manipulation of sparse matrices constitutes the kernel of the solvers. The irregularities however of the matrix sparsity patterns, i.e. the distribution of the non-zeros within the matrix, make many operations on sparse matrices execute inefficiently on traditional scalar and vector architectures. This problem has been tackled both software and hardware approaches. Most of the approaches are in software [], [], because they are less costly. However, research focused on hardware approaches [4], [5], [], [] indicates that much greater improvements can be obtained. In [] the authors report a speedup of up to times (depending on the sparsity pattern) using an Augmented Vector Architecture (AVA) and an associated sparse matrix storage scheme (BBCS) when performing sparse matrix vector multiplication when compared to the aforementioned JD method on a conventional vector processor. The sparse matrix related problem that we address here is that of the matrix transposition, i.e. the construction of the Ì from a sparse matrix on a vector processor. It is not possible to perform the transposition of a sparse matrix using the instruction set of a traditional vector processor. Therefore, this paper we propose a mechanism to enable the transposition of a sparse matrix within the context of the aforementioned AVA. The contributions of this paper can be summarized as follows: We propose and describe a novel mechanism, the Sparse matrix Transposition Mechanism (STM), implemented as a functional unit for a vector processor, that can perform the transposition of a sparse matrix that is stored using in a hierarchical sparse matrix storage format. We evaluate the timing properties of the STM and show an expected performance increase of one order of magnitude when compared to a scalar implementation of the sparse matrix transposition. The remainder of the paper is organized as follows: In the next Section we provide with some background information on transposition, vector processors and the hierarchical sparse matrix storage format. In Section III we describe and evaluate the proposed mechanism and finally, in Section IV we draw some conclusions. II. BACKGROUND This section provides some background information and assumptions made throughout the paper. The transposition of an Å Æ matrix is the calculation of the Æ Å matrix Ì. The operation consists of the exchange of the rows and columns of the matrix. Thus, essentially it is an operation that does not alter the values of the elements but their positions. For a dense matrix, the problem is trivial and can be solved by addressing a rowwise stored matrix with a stride equal to the number of rows of the matrix or vice versa. Sparse matrices however are usually stored in a more complex way that involves the storage of the non-zero values and their positional information [], []. This results in the need of using costly sorting algorithms in order to perform the transposition.

2 With the proposed mechanism we attempt to streamline this operation and make it suitable for a vector processor. The proposed transposition mechanism is functioning as a Functional Unit of a vector processor. Vector processors, such as the one depicted in Figure are based on architectures that support the execution of vector instructions. On most current vector architectures [8], the vectors are copied from the main memory into vector registers within the processor before they are operated upon. Vector registers are arrays of scalar registers that hold (parts of) the vectors to be processed. Due to the fact that the vector register length can not be arbitrarily large, when operating on large vectors they have to be divided into smaller parts, a technique that is usually called strip mining, each of which cannot be larger than the maximum amount of elements a vector register can hold, i.e., the architecturally defined section size of the VP. In a VP the operations are carried out by (usually) pipelined Functional Units (FU) that are able to fetch one or more new element per cycle from each of the source vector register(s) involved, operate on it/them, and return the result(s) to the result (vector) register. Main Memory Vector processor Vector Unit Load Store Unit Vector Controller Scalar Controller Cache Scalar Unit Vector Register File Scalar Registers Fig.. Vector Architecture Functional Unit Functional Unit Functional Unit Functional Unit N Scalar Pipeline Before proceeding with the description STM functional unit we will first give a brief description of the hierarchical storage format, the sparse matrix format that we will assume for the remainder of our paper and which is a hierarchical variation of the aforementioned BBCS format: To obtain the HiSM an Å Æ sparse matrix is partitioned in Å Æ square sub-matrices where is the Section Size of the vector architecture. Each of these sub-matrices, which we will call ¾ -blocks, is then stored separately in memory in the following way: All the nonzero values as well as the positional information combined are stored in a row-wise fashion in an array ( ¾ -blockarray) in memory. In Figure (bottom left) we can observe how such a blockarray is formed containing both the position and value data from the top left ¾ -block of an sparse matrix. The section size is. Note that the positional data consists of only the column position of the non-zero elements within the sub-matrix plus an extra bit that indicates whether the non-zero element is the last element in its row. This bit is incorporated in the position data. We will not elaborate further on the exact bit-by-bit configuration of the ¾ -blockarray. The ¾ -blockarrays can contain up to ¾ non-zero elements and we will assume that an AVA can operate on these in the same way as described in []. These ¾ -blockarrays that describe the non-empty ¾ - blocks form the lowest (zero) level of the hierarchical structure of our format. As can be observed in Figure, the non-empty ¾ -blocks form a similar sparsity pattern as the non-zero values within an ¾ -block, Therefore, the next level of the hierarchy, level-½, is formed in exactly the same way as level zero with the difference that the values of non-zero elements are replaced by the pointers to the ¾ -blockarrays in memory that describe non-empty ¾ - blocks. This new array which contains the pointers to the lower level is stored in exactly the same fashion in memory (see Figure (bottom right). Notice that at level- the pointers are stored in a column-wise fashion. In this way an access pattern is provided where the ¾ -elementwide columns are accessed row-wise. This is favorable for operations such as matrix-vector multiplication (refer to [] for a more elaborate discussion). The next level, level-, if there is one (in the example of Figure there is none), is formed in the same way as level- with the pointers pointing at the ¾ -blockarrays of level-. Further, as in any hierarchical structure the higher levels are formed in the same way and we proceed until we have covered the entire matrix in Ñ Ü ÐÓ Å ÐÓ Æ µ levels. We can summarize the description of the Hierarchical sparse matrix storage format as follows: The entire matrix is divided hierarchically into blocks of size (called ¾ -blocks) with the lowest level containing the actual value of the non-zero elements and the higher levels containing pointers to the non-empty ¾ - blocks of one level lower. The ¾ -blocks at all levels are represented as an array (called a ¾ -blockarray whose entries are non-zero val- ½ The careful reader will notice that when there are empty rows within the ¾ -block this format will not suffice. We have incorporated this detail in our format in the same way as in [], however, being of no further consequence to what will be discussed in the remainder of the paper we will omit a detailed description for simplicity

3 Non-zero element End of Row Storage of an 8x8 submatrix Pointer to an 8x8 submatrix one level lower in the matrix hierarchy End of Column Storage of an 8x8 level hierarchy submatrix s -blockarray Positional Data s -blockarray Positional Data Value Data Pointer Data Level - 0 Level - Fig.. Example of the Hierarchical Sparse Matrix Storage Format ues (for level-¼) or pointers to non-empty lower level ¾ - blockarrays (for all higher levels) along with their corresponding positional information within the block. The formats are identical for all levels. III. THE TRANSPOSITION MECHANISM As mentioned previously, the proposed Sparse matrix Transposition Mechanism (STM) is implemented as a functional unit of a vector processor. The STM with a section size is depicted in Figure. The main part of the unit consists of the -memory. The -memory is used to store an ¾ -block of the hierarchically stored matrix. The mechanism can transpose one ¾ -block at a time. The procedure is as follows: First, the ¾ -block is stored in the -memory one section at a time. When the complete ¾ -block is stored, the ¾ -block is then read from the -memory in the transpose fashion than storing, i.e. row-wise if stored column-wise and vice versa. More specifically: assume that a part of an ¾ block is stored in a vector register Ê. The contents of a register Ê can be stored in the -memory via the column-wise I/O-buffer. The depth of this buffer defines how many elements can maximally be stored per clock cycle. We will call this the -memory bandwidth, which in the case of Figure is. At each cycle the I/O-buffer is filled with non-zero elements of the same column along with their correspond-

4 Vector Register File I/O Buffer Rowwise MUX Column Position Value Non-zero locator memory Row-buffer Column-buffer Non-zero Indicators "" when non-zero when zero Columnwise I/O Buffer M U X Position of st "" Position of nd "" Position of rd "" Position of 4th "" Row Position Non-zero locator Non-zero Indicator Storage cell SxS Memory Fig.. The Sparse matrix Transposition Mechanism (STM) counter counter ing row position. In the next cycle, the row position is used by the Non-zero Locator unit to store the non-zero values at the correct row position in the column-buffer. The nonzero indicator at the corresponding cells of the buffer are then set accordingly to indicate the a non-zero or a zero value. This process is repeated until there are no more non-zero element for the current column. Subsequently, the entire column buffer is copied into the -memory using the column position information (not shown in Figure ). We can now read the transposed of the ¾ -block by reversing the order used for storing, at the row-wise section of the STM. Column by column, the ¾ -block is moved into row-buffer. There, using the Non-zero-Locator, the non-zero values and their column positions are copied into I/O-buffer (maximally at a time) and then stored into a register in the register file. However, when reading the -memory, the working of the Non-zero locator is not trivial. Therefore we will describe its workings in further detail. The working Non-zero locator is graphically depicted in Figure 4. The function of this circuit is to extract from a string of input bits (the non-zero indicators) the position of the first s. When there are more than non-zero elements the located non-zeros are set to zero (not depicted in Figure 4) and the process is repeated in order to locate the following non-zero elements. When there are less than non-zero elements one or more of the 0 -counters will produce an overflow. This overflow indicates to the control logic that a new row or column needs to be fetched from the - "" Fig. 4. The Non-zero Locator counter counter As we have mentioned, the STM can only transpose an ¾ -block. However, because of the similar structure of the HiSM at all hierarchy levels we can apply the same transposition mechanism on all levels in order to achieve the transposition of the entire matrix. Figure 5 graphically illustrates this principle. Observe that when the matrix is transposed, every ¾ -block is also transposed. Additionally, at one level higher, level-½, the position of the non-empty (depicted darker in the figure) blocks is also transposed. This means that if we can transpose an ¾ - blockarray at level-¼ we can apply the same algorithm to the ¾ -blockarrays at all levels to transpose the entire matrix. A. Timing Evaluation In this section we will provide performance estimations of the proposed mechanism. Due to fact that the - memory has to be filled before it can be read back, the

5 Fig. 5. Matrix Transposition STM unit can not be fully pipelined. However, separately, the write and read phases can be pipelined in three stages. This means that cycles are required for the last elements to enter the Ë -memory and similarly cycles are needed for the last results to be returned to the vector register. This results in a functional unit that has a latency of ¾ Ò where Ò is the number of non-zero elements Ì in the ¾ -block and Ì is the throughput of the I/O-buffer, i.e. the average number of elements in the I/O-buffer per cycle. The throughput Ì varies from ½ to where is the previously mentioned STM bandwidth that is equal to the depth of the I/O-buffer. The precise value of Ì and thus the performance of the STM depends on the sparsity pattern of the matrix to be transposed. Therefore we will provide only the worst and best case scenarios to evaluate the performance: Best case: ¾ Ò Worst case: ¾ Ò To Perform the same operation on a scalar machine we would need a sorting loop of on average Ò ÐÓ Ò iterations. The operations within this loop are highly dependent and unpredictable, and therefore no advantage can be expected from ILP techniques such as pipelining, dependence checking and branch prediction. This results in a sustained execution time of several cycles per iteration. Comparing to our scheme we can expect an order of magnitude of improvement. IV. CONCLUSIONS In this paper we have proposed and described a novel mechanism, the Sparse matrix Transposition Mechanism (STM), implemented as a functional unit for a vector processor, that can perform the transposition of a sparse matrix that is stored using in a hierarchical sparse matrix storage format. We have evaluated the timing properties of the STM and showed an expected performance increase of one order of magnitude when compared to a scalar implementation of the sparse matrix transposition. REFERENCES [] S. Vassiliadis, S. Cotofana, and Pyrrhos Stathis, Vector isa extension sprase matrix multiplication., in EuroPar 99 Parallel Processing. 999, Lecture Notes in Computer Science, No. 85, pp. 08 5, Springer-Verlag. [] Victor Eijkhout, LAPACK working note 50: Distributed sparse data structures for linear algebra operations, Tech. Rep. UT- CS-9-9, Department of Computer Science, University of Tennessee, Sept. 99, Mon, Apr 99 0:9: GMT. [] Yosef Saad, SPARSKIT: A basic tool kit for sparse matrix computations, Tech. Rep., Computer Science Department, University of Minnesota, Minneapolis, MN 55455, June 994, Version. [4] Hideharu Amano, Taisuke Boku, Tomohiro Kudoh, and Hideo Aiso, (SM) ¾ -II: A new version of the sparse matrix solving machine, in Proceedings of the th Annual International Symposium on Computer Architecture, Boston, Massachusetts, June 9, 985, IEEE Computer Society TCA and ACM SIGARCH, pp [5] Valerie E. Taylor, Abhiram Ranade, and David G. Messerschitt, SPAR: A New Architecture for Large Finite Element Computations, IEEE Transactions on Computers, vol. 44, no. 4, pp , April 995. [] Pyrrhos Stathis, Stamatis Vassiliadis, and Sorin Cotofana, Sparse matrix vector multiplication evaluation using the bbcs scheme, To appear in 8th PCI, Nov 00. [] A. Wolfe, M. Breternitz, Jr., C. Stephens, A. L. Ting, D. B. Kirk, R. P. Bianchini, Jr., and J. P. Shen, The white dwarf: A highperformance application-specific processor, in Proceedings of the 5th Annual International Symposium on Computer Architecture, H. J. Siegel, Ed., Honolulu, Hawaii, May June 988, pp., IEEE Computer Society Press. [8] John L. Hennessy and David A. Patterson, Computer Architecture A Quantative Approach, Morgan Kaufman, San Mateo, California, 990.

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications Christopher Rodrigues, David J. Hardy, John E. Stone, Klaus Schulten, Wen-Mei W. Hwu University of Illinois at Urbana-Champaign