Symmetric rank-2k update on GPUs and/or multi-cores

Symmetric rank-2k update on GPUs and/or multi-cores Assignment 2, 5DV050, Spring 2012 Due on May 4 (soft) or May 11 (hard) at 16.00 Version 1.0 1 Background and motivation Quoting from Beresford N. Parlett's classic book The Symmetric Eigenvalue Problem: Vibrations are everywhere [... ] and so too are the eigenvalues (or frequencies) associated with them. The concert-goer unconsciously analyzes the quivering of her eardrum, the spectroscopist identies the constituents of a gas by looking at eigenvalues, and in California the State Building Department requires that the natural frequencies of many buildings should lie outside the earthquake band. Indeed, as mathematical models invade more and more disciplines we can anticipate a demand for eigenvalue calculations in an ever richer variety of contexts. Beresford N. Parlett Eigenvalues are found everywhere and they can reveal important properties such as the resonance frequencies of a physical system. They can be used to nd commonalities in a set of data, e.g., images or documents, that enable eective lossy data compression or feature extraction. The applications of eigenvalues are numerous and span most disciplines. Regardless of the interpretation of the eigenvalues in a specic application, many eigenvalue problems eventually boil down to the fundamental problem of determining the eigenvalues and the associated eigenvectorsof a square (real or complex) matrix A. Sometimes all the eigenvalues and eigenvectors are needed but many times only a subset of them are of interest. Actually computing eigenvalues λ and eigenvectors x, hereafter referred to as eigenpairs {λ, x}, is, however, not a simple task. Using nite precision arithmetic, round-o errors might propagate and accumulate. If the problem itself is very sensitive to perturbations (the technical term is ill-conditioned) and/or if the numerical algorithm is not stable in the sense that small perturbations due to, e.g., round-o errors accumulate too much, then the computed eigenpairs might have little to do with the actual eigenpairs. Some of the primary goals of the research into algorithms for eigenvalue computations are the following: 1. Construct stable and ecient algorithms for matrices having various combinations of properties, e.g., real, complex, symmetric, etc. 2. Analyze the strengths and limitations of the algorithms in the presence of round-o errors. 1

3. Implement the algorithms on various parallel architectures and maximize the performance without sacricing the numerical stability of the algorithms. In the present assignment, we will look into the third goal and get a glimpse of the computational and parallel aspects of eigenvalue computations. We hope you nd it interesting! Sections 1 to 4 give an introduction to symmetric eigenvalue problems and nally denes the operation that you will implementthe symmetric rank-2k update. Sections 5 and 6 explain the operation in more detail and present a reference implementation as well as the op count for the operation. In Section 7, a recursive blocked algorithm is presented, and nally the assignment is explained in more details in Section 8. In principle, you can skip everything up to (but not including) Section 5 and still complete the assignment without problems. However, the rst sections provide important context that puts the operation in some perspective and partially explains its relevance in the greater scheme of things. We assume some prior experience with linear algebra, in particular the denitions of matrix multiplication and matrix transpose. This assignment does not, however, require advanced knowledge of linear algebra and everything you need can be learned rather quickly. Please contact us if you have trouble understanding the assignment specication and we will help you as best we can. 1.1 Real symmetric matrices An important class of matrices are the real and symmetric ones. A symmetric matrix A of size n-by-n has the property that the entry in the i-th row and j-th column, i.e., [A] ij or A(i, j) or a ij deping on how you want to denote it, is equal to the entry in the j-th row and i-th column, i.e., [A] ji. In geometric terms, the matrix is symmetric with respect to a reection along the northwestsoutheast diagonalthe so-called main diagonal of the matrix. Formally, a matrix A is symmetric if and only if [A] ij = [A] ji (1) for all i, j = 1, 2,..., n. In matrix notation, a matrix is symmetric if and only if where A T is the transpose of A dened by its entries A T = A, (2) [A T ] ij := [A] ji. (3) From a computational point of view, one can often exploit symmetry to save half of the storage and half of the arithmetic operations since only the upper or lower triangular part of the matrix need to be represented explicitly. In this assignment, we will look at an algorithm that saves half the operations but we will ignore the potential for saving memory. 1.2 The real symmetric eigenvalue problem The real symmetric eigenvalue problem consists of nding all eigenvalues λ and the corresponding eigenvectors x of a real symmetric matrix A of size n-by-n. Each of the n eigenpairs {λ, x} of A satises a linear equation of the form Ax = λx. (4) 2

A trivial solution to (4) is x = 0, but this is hardly interesting. Therefore, we require that the eigenvector x be non-zero. The geometric interpretation of this is as follows. The vector x is a vector in n-space represented in some basis. The matrix A represents a linear transformation (e.g., rotation, reection, scaling, etc) on this space and the matrixvector product Ax applies A to x and produces the image of x under the linear transformation encoded by A. What (4) says is that this image is equal to a simple scaling of x itself and so x and its image are parallel to each other. For example, one of the eigenpairs of the real symmetric matrix 1 2 3 A := 2 1 2 3 2 1 is {λ, x} where λ := 2 and 1 x := 0. 1 While it is dicult to compute the eigenpairs in general, it is straightforward to verify that a λ and x forms an eigenpair. For the example above we get 1 2 3 1 1 3 2 1 Ax = 2 1 2 0 = 2 2 = 0 = 2 0 = λx, 3 2 1 1 3 1 2 1 which veries that {λ, x} is indeed an eigenpair. 2 Algorithm for solving the real symmetric eigenvalue problem When A is a large and dense matrix (n in the hundreds or thousands and most entries of A are non-zero), then the stable computation of all eigenpairs of A is typically broken up into three phases. First, the matrix A is reduced to a condensed form with much fewer non-zero entries in such a way that the eigenvalues of the condensed matrix are the same as those of A and the eigenvectors of the condensed matrix are related to those of A in a simple way. Next, the eigenpairs of the condensed matrix are computed, and, nally, its eigenvectors are transformed into the eigenvectors of A. Specically, the computation of all eigenpairs of a real symmetric matrix A consists of three steps: 1. Reduce A to an equivalent (the technical term is similar) tri-diagonal matrix T. 2. Compute the eigenpairs of T. 3. Transform the eigenvectors of T into the eigenvectors of A. Numerically, the second step is the most challenging and relies on iterative algorithms that may or may not converge. Many advanced techniques are used to improve the accuracy and speed of these algorithms. Computationally, however, the second step is almost negligible in terms of computation time, and the third step is expensive but more or less straightforward to implement with high eciency. Therefore, the most challenging step from a computational point of view is the initial reduction to tri-diagonal form. 3

3 Tri-diagonal reduction Since the tri-diagonal reduction is both a challenging and an expensive part of the symmetric eigenvalue solver, we will only consider this part of the algorithm in this assignment. However, the best tri-diagonal reduction algorithm used today is a bit too complicated to implement in this course. A tri-diagonal matrix has non-zero entries only on and directly below (rst sub-diagonal) and above (rst super-diagonal) the main diagonal of the matrix as illustrated below: 2 1 1 2 1 1 2 1 1 2 1 1 2 1 1 2 1 1 2 1 1 2 1 1 2 It follows that a tri-diagonal matrix of size n-by-n has at most 3n 2 non-zero entries, and if the matrix is also symmetric, then the number of (non-redundant) non-zero entries reduces to 2n 1. The goal of tri-diagonal reduction is to start with a dense symmetric matrix A and up with an equivalent matrix T that is still symmetric but also tri-diagonal. In order to preserve the eigenvalues of A we are only allowed to perform what is called similarity transformations of A. A similarity transformation takes the form A P 1 AP, (5) where P is an n-by-n invertible matrix. To see that the eigenvalues are preserved by a similarity transformation, note that if {λ, x} is an eigenpair of A, then Ax = λx (P 1 AP )(P 1 x) = λ(p 1 x), which implies that {λ, P 1 x} is an eigenpair of the transformed (similar) matrix B := P 1 AP. The geometric interpretation of a similarity transformation is that both A and the transformed matrix B represent the same linear transformation but in dierent coordinate systems. The matrix P, then, encodes the (linear) coordinate transformation. In general, the transformation matrix P can be very ill-conditioned and as a consequence the computed tri-diagonal form of A can have very dierent eigenvalues compared to A. Moreover, a general similarity transformation does not necessarily preserve the symmetry property. To improve the numerical stability of the tri-diagonal reduction algorithm and preserve symmetry we restrict the set of allowed transformations to the nicest ones: The so-called orthogonal matrices. Each column of an orthogonal matrix has unit length and the columns are pairwise orthogonal (perpicular). It follows that the inverse, Q 1, of an orthogonal matrix Q is equal to its transpose, Q T, i.e., Q T Q = QQ T = I. An orthogonal similarity transformation, i.e., P is orthogonal, takes the form where Q (instead of P ) is an orthogonal matrix. A Q T AQ, (6) 4

4 Tri-diagonal reduction algorithm The tri-diagonal reduction algorithm relies on a fundamental type of orthogonal matrices known as Householder reections, named after the numerical linear algebra pioneer Alston Scott Householder. A Householder reection of order n is dened by a vector v of length n that is scaled such that the matrix Q := I vv T (7) becomes orthogonal. Using specially crafted reections, it is possible to systematically reduce A to tri-diagonal form by introducing zeros one column/row at a time. Without going into the detailsthey are not important for the sake of this assignment we can construct a vector v j dening a reection Q j such that when the matrix is updated according to the formula A Q T j AQ j, (8) then the j-th column/row of the matrix is in tri-diagonal form. By systematically reducing the columns/rows from left/top to the right/bottom we eventually up with a full tri-diagonal matrix. The matrix below illustrates the zero/non-zero (sparsity) pattern of A after the update (8) with j = 3 and n = 6. Q T 3 AQ 3 = Note that the matrix is tri-diagonal in its top left 3-by-3 sub-matrix and fully dense in its bottom right (n 3)-by-(n 3) sub-matrix. The special structure of the reection Q must be exploited in order to implement the update (8) eciently. As written, (8) requires two matrix multiplications of order n for a total of 4n 3 oating point operations (in addition to the cost of constructing Q, which is negligible in comparison). However, if we exploit the structure and apply the update as in. A Q T j AQ j = (I vv T )A(I vv T ) = (I vv T )(A Avv T ) = A Avv T vv T A + vv T Avv T = A wv T vw T (9) where we have introduced the auxiliary vector w := Av 1 2 vvt Av, (10) then the operation count of (8) is reduced to Θ(n 2 ). Again without going into the details, a signicant portion of the work in the blocked tridiagonal reduction algorithm lies in applying a generalized (blocked) version of the update (9) that takes the form A A W V T V W T. (11) 5

We call this type of update a symmetric rank-2k update. The name stems from the facts that A is symmetric and each term (W V T and V W T ) has rank k and thus both terms together have rank 2k and is a symmetric matrix. 5 Symmetric rank-2k update We have nally covered enough background material to put the topic of this assignment into some context. You do not have to understand a word of what has been written above unless you want to get the bigger picture and understand the relevance and origins of the symmetric rank-2k update. Your task is to implement and evaluate a parallel version of the symmetric rank-2k update on a GPU and/or a set of multi-core CPUs with shared memory. The symmetric rank-2k update is a part of the BLAS interface/library, which contains most of the fundamental linear algebra operations, and goes by the name of xsyr2k where x should be replaced by one of S, D, C, and Z for single precision real, double precision real, single precision complex, and double precision complex, respectively. We limit ourselves to the single precision real case and therefore from this point on we assume x = S. The corresponding BLAS subroutine SSYR2K is fairly general and implements the following variants of the update: C αab T + αba T + βc, (12) C αa T B + αb T A + βc. (13) Both α and β are (real) scalars. The matrix C is square and symmetric of size n-by-n. Only the upper or lower part of C is touched, deping on an argument passed to the function. In variant (12), both A and B are n-by-k matrices and in variant (13) they are both k-by-n. We limit ourselves to the variant (12). The prototype of SSYR2K in the FORTRAN binding of the BLAS library reads as follows: subroutine ssyr2k(uplo,trans,n,k,alpha,a,lda,b,ldb,beta,c,ldc) The arguments are described in detail in the ssyr2k.f source le of the reference BLAS implementation [2], but for convenience we also briey describe them below. uplo String that determines whether to touch the upper ('U') or the lower ('L') triangular part of C. trans String that chooses either variant (12) ('N') or variant (13) ('T'). n The order of the matrix C and either the number of rows or the number of columns of A and B, deping on trans. k Either the number of columns or the number of rows of A and B, deping on trans. alpha The scalar α. A The matrix A stored in column-major format. lda The column stride of A (so called leading dimension). 6

B The matrix B stored in column-major format. ldb The column stride of B. beta The scalar β. C The matrix C stored in column-major format. ldc The column stride of C. The operation is illustrated pictorially in Figure 1 for the case alpha=beta=1, trans='n', and uplo='u'. On the entry level, the update takes the form [C] ij β[c] ij + α k [A] is [B] js + α However, we prefer the more compact and less cluttered form (12). s=1 k [B] is [A] js. (14) s=1 + + C C A B T B A T Figure 1: Graphical depiction of the xsyr2k operation with uplo='u' and trans='n'. 6 The reference implementation The reference implementation of SSYR2K [2] is structured as follows. We assume that α 0 since otherwise the operation degenerates into a simple scaling of C. We consider only the case uplo='u' and trans='n' as illustrated in Figure 1 since the other three cases are very similar. The upper triangular part of the matrix C is updated one column at a time from left to right. The column index j runs from 1 to n in an outer loop. The column is rst scaled in-place by the factor β (unless β = 1) as illustrated by the pseudo-code below: do i = 1 to j C(i,j) = beta * C(i,j) Next the j-th column of the two terms in (12) are applied to the j-th column of C in k steps as illustrated by the following pseudo-code: do s = 1 to k t1 = alpha * B(j,s) t2 = alpha * A(j,s) do i = 1 to j C(i,j) = C(i,j) + A(i,s) * t1 C(i,j) = C(i,j) + B(i,s) * t2 7

We can understand the snippet above by looking at (14) for one value of s at a time. In this context, a column of C is updated by adding a scaled column of A and a scaled column of B. The scaling factors correspond to the variables t1 and t2 above. If we glue the two pieces above together with the outer loop over columns we up with (a simplied version of) the reference implementation expressed in pseudo-code: do j = 1 to n do i = 1 to j C(i,j) = beta * C(i,j) do s = 1 to k t1 = alpha * B(j,s) t2 = alpha * A(j,s) do i = 1 to j C(i,j) = C(i,j) + A(i,s) * t1 C(i,j) = C(i,j) + B(i,s) * t2 Let us count the number of oating point operations (ops) performed by the pseudo-code above as a function f(n, k) of the parameters n and k. The initial scaling of C accounts for n j=1 i=1 The computations of t1 and t2 account for Finally, the inner loop accounts for n j 1 = n 2 /2 + n/2 ops. (15) n j=1 s=1 k j=1 s=1 i=1 In total, the op count of the xsyr2k operation is k 2 = 2kn ops. (16) j 4 = 2kn 2 + 2kn ops. (17) f(n, k) = (2k + 1/2)n 2 + 4kn + n/2 2kn 2. (18) Note that the op count grows rapidly with k, much faster than the amount of data (A, B, and C). Therefore, we expect that the operation can be implemented with a high computation to communication ratio (operational intensity) and thereby overcome the limited main memory bandwidth and the slow PCIe bus that connects the host to the GPU. In the next section, we will devise a blocked algorithm that relies heavily on matrix multiplication and should therefore be possible to implement eciently on GPUs and multi-cores alike. 8

7 Recursive blocked algorithm The reference implementation does not use the cache hierarchy eectively. In this section, we develop a recursive blocked algorithm that automatically adapts to all levels of the cache hierarchy. Let us partition the matrix C according to [ ] C11 C C =: 12 C12 T (19) C 22 such that C 11 and C 22 are square symmetric submatrices of order n 1 and n 2, respectively. Here n 1 and n 2 can be chosen arbitrarily but we prefer to choose either n 1 = b for some xed block size b, which results in a traditional blocked algorithm, or n 1 = n 2 = n/2 for a divide-and-conquer style recursive blocked algorithm. Next let us partition A and B into two row blocks each conformally with C. The rst row block consists of the rst n 1 rows and the other row block consists of the remaining n 2 rows. Using the block partitionings of A, B, and C the update (12) can now be written as [ ] [ ] [ ] C11 C 12 C11 C C12 T β 12 A1 [B1 C 22 C12 T + α T B T ] [ ] B1 [A1 C 2 + α T A T ] 2. (20) 22 A 2 On the block level, the update (20) decomposes into three (actually four, but one is redundant) block updates C 11 βc 11 + αa 1 B 1 T + αb 1 A 1 T, (21) C 22 βc 22 + αa 2 B 2 T + αb 2 A 2 T, (22) C 12 βc 12 + αa 1 B 2 T + αb 1 A 2 T. (23) If one looks at the updates and the properties of the blocks carefully enough, one will notice that the rst two block updates are also instances of the symmetric rank-2k update, only smaller, while the third block update is a sequence of two general matrix multiplication updates or xgemm operations. By applying the recursive template (20) to the two sub-operations (21) and (22) we obtain a recursive formulation of the xsyr2k operation. The op count of this recursive formulation is essentially the same as the entry-wise formulation analyzed in the previous section, i.e., 2kn 2. A strength of the recursive formulation is that at each level of the recursion it exposes two general matrix multiplication updates, or xgemm operations, which are regular operations that are well suited for high-performance implementation on both GPUs and multi-cores. This section has illustrated how rewriting an operation using recursion can expose highly desirable xgemm operations. It turns out that this technique can be applied to all of the fundamental matrix operations, as was rst demonstrated by Kågström, van Loan, and Ling with their GEMM-based BLAS project [3]. B 2 8 The assignment Read this section carefully. The requirements below are minimal requirements and they are deliberately a bit fuzzy. Use your own judgment to select suitable experiments and suitable ways of presenting/analyzing your results. 9

You should, given a piece of skeleton code, implement the SSYR2K operation variant with trans='n' and uplo='u' on either a GPU, a set of traditional multi-core processors, or a combination of the two. Base your implementation on the recursive blocked algorithm in its divide-and-conquer variant. The performance evaluation logic is already present in the skeleton code so you can concentrate on developing the computational kernels. For the sake of this assignment you should implement the SGEMM kernel(s) yourself instead of relying on any BLAS library. A minimal requirement is that you must use the caches (CPU) and shared memory (GPU) resources eectively. For your reference, a sequential implementation relying on the BLAS has been provided in the skeleton code. You are encouraged to implement a second version of your parallel code(s) that call the BLAS routines and compare the performance to your own code. Perform and report carefully chosen experiments. Use information about the cache sizes (CPU) and graphics memory size (GPU) to choose an appropriate range of parameters (n and k and block sizes). Hint: A reasonable choice in the context of tri-diagonal reduction would be n a few thousands and k at least 32 or so but perhaps as large as a few hundred. Some questions to think about: What is/are the optimal block size/sizes? How does k = 1, 2,... aect the performance? In the context of tri-diagonal reduction, k is a parameter that can be tweaked to improve performance. Is the code sensitive to NUMA eects? Can a tiled data layout improve performance? When should the recursion be aborted? Does it help to regularize the computation by computing the (small) SSYR2K operations in the recursive base case using SGEMM instead of SSYR2K? The SGEMM operation is presumably easier to optimize. References [1] http://www.netlib.org/lapack The LAPACK interface specication and reference implementation. [2] http://www.netlib.org/blas The BLAS interface specication and reference implementation. [3] http://www.netlib.org/blas/gemm_based/ The GEMM-based BLAS by Kågström, van Loan, and Ling. 10