Lazy Householder Decomposition of Sparse Matrices

Size: px

Start display at page:

Download "Lazy Householder Decomposition of Sparse Matrices"

Myron Jackson
5 years ago
Views:

1 Lazy Householder Decomposition of Sparse Matrices G.W. Howell North Carolina State University Raleigh, North Carolina August 26, 2010 Abstract This paper describes Householder reduction of a rectangular sparse matrix to small band upper triangular form B k+1. B k+1 is upper triangular with nonzero entries only on the diagonal and on the nearest k superdiagonals. The algorithm is similar to the Householder reduction used as part of the standard dense SVD computation. For the sparse lazy algorithm, matrix updates are deferred until a row or column block is eliminated. The original sparse matrix is accessed only for sparse matrix dense matrix (SMDM) multiplications and to extract row and column blocks. For a triangular bandwidth of k + 1, the SMDM operations are of the sparse matrix by dense matrices consisting of the k rows or columns of a block Householder transformation. Block Householder transformations are reliably orthogonal, computationally efficient, and have good potential for parallelization. Numeric results presented here indicate that using an initial random block Householder transformation allows computation of a collection of largest singular values. Some potential applications are in finding low rank matrix approximations and in solving least squares problems. 1 Introduction In 1965, Golub and Kahan proposed Householder bidiagonalization A = UB 2 V as a first step in determining the singular values of dense matrices. For sparse Supported by NIH Molecular Libraries Roadmap for Medical Research, Grant 1 P20 Hg

2 matrices they proposed Lanczos bidiagonalization as a means of determining a few singular values [8]. Rearranging the order of computation to avoid filling a sparse matrix allows a natural extension of the use of Householder decomposition to the sparse case. Householder transformations are scalably stable and if blocked, the reduction algorithm is almost entirely BLAS-3, efficient on a variety of computer architectures. Given a sparse matrix A, applications include solving Ax = b and finding x to minimize Ax b 2. In this paper, we apply the algorithm to finding singular values of A. Key points are Algorithm stability is desirable for reliable solution of large problems. Householder and block Householder transformations are very nearly orthogonal when implemented with rounding arithmetic, enabling simple run-time convergence tests (see Section 4). Householder reductions can be applied to sparse matrices by deferring updates of blocks of the original matrix. Updates are not performed until the step on which a column or row block is to be eliminated. Multiplications are accomplished by expressing the updated matrix as a sum of the original sparse matrix and a low rank update 1. The work here is essentially an extension of the dense Grösser and Lang algorithm [10], [16] to apply in the sparse case. Block Householder transformations are BLAS-3. On current computer architectures, whether cache-based multicore, GPU or other hardware accelerators connected to general processors, or distributed parallel computing, dense BLAS-3 matrix matrix multiplies are significantly faster than BLAS-2 matrix vector multiplications. In all these cases, BLAS-3 Section 3 discusses NUMA (shared memory Non-Uniform Memory Access) performance for wide or tall BLAS-3 (See Section 3). Similarly, multiplying a sparse matrix by a dense matrix (SMDM) is faster than multiplying a sparse matrix by a dense vector. If the reduction is to an upper triangular matrix B k+1 with nonzero entries on the diagonal and nearest k superdiagonals, then the SMDM operations AX and A T Y entail dense matrices X and Y with k columns. 1 For reduction to Hessenberg form or upper triangular form, the idea of deferring updates has been repeatedly used. Kaufman implemented deferral of updates in sparse Householder QR factorization [15]. For other applications of this idea, see for example ARPACK [19], Sosonkina, Allison, and Watson [25], and Dubrulle [6]. For Householder reduction to bidiagonal form, the idea of deferring updates is implicit in the LAPACK reduction GEBRD (for the dense case) [3], with the extension to the sparse case explicitly outlined in Howell, Demmel, Fulton, Hammarling, and Marmol [11]. The lazy functional language Haskell defers updates. 2

3 With 64 GBytes of RAM with 48 GBytes allocated to basis storage, then for matrices of size up to about a million square (as illustrated by testing against matrices from Tim Davis s UF collection of matrices [4]), the UB k+1 V algorithm usually determined many singular values (See Section 6). For comparision, if one dense 20K by 20K matrix can be stored on a given processor (3.2 GBytes of storage), then 2500 processors would be needed for a dense SCALAPACK computation of singular values for a one million square matrix. Section 2 compares basis orthogonality and storage requirements for several methods of finding singular values of sparse matrices. Section 3 gives some numeric results justifying the assertion that AX, A T Y (Sparse matrix dense matrix or SPMD) operations are likely to be faster that Ax, A T y and discusses shared memory parallelization of wide or tall BLAS-3 operations. Section 4 summarizes some theory, justifying some run-time error estimates. Section 5 is an explicit presentation of the algorithm, provides comparisons of the sparse and dense algorithms, and shows how implicit fill can be useful. Section 6 describes numeric experiments with the Davis UF collection [4] of sparse matrices. 2 Comparison to Other Sparse Methods The sparse and dense UB k+1 V Householder based decompositions are BLAS-3 algorithms with U and V scalably orthogonal. For comparison, consider Lanczos bidiagonalization for finding a few singular values of a sparse matrix, proposed by Golub and Kahan as a sparse alternative to Householder bidiagonalization [8]. Lanczos bidiagonalization can proceed without storing multipliers, relying on a three term recursion, so that only the last few left and right multpliers are needed. Storage requirements are minimal. In exact arithmetic the Lanczos bases would be orthogonal. In rounding arithmetic, there is a rapid loss of orthogonality, and even of linear independence, as illustrated in Figure 1. A matrix of size a few hundred was randomly generated in the program octave, the left and right multipliers were saved, and the numeric rank of the right multiplier basis was calculated as its number of nonzero singular values. As memory per available node grows, using the memory to get better stability becomes feasible. In order to preserve orthogonality, multiplier vectors are frequently stored, and in a Lanczos algorithm, reorthogonalized. Table 1 compares L 2 condition numbers of various methods for constructing bases for the columns of Hilbert matrices, computed as the ratio of largest to smallest singular values of the basis. 3

4 12 Loss of Linear Independence of "Orthogonal" Lanczos Basis 10 8 Rank Deficiency Number of Columns Figure 1: Lanczos bases suffer loss of numeric rank. Sizes Householder QR L in LU MGS QR Table 1: L 2 Condition Numbers of Bases. Here we factored Hilbert matrices of various sizes and compared condition numbers of Q for QR factorization or L for LU decomposition. If modified Gram-Schmidt as opposed to Householder orthogonalization is used, the number of flops can be halved. Since modified Gram-Schmidt is primarily a BLAS-1 algorithm, use of BLAS-3 block Householder transformations is typically faster, as well as more nearly orthogonal. Alternatively, Jalby and Philippe [13], Vanderstraten [28], Stewart [26], Giraud and Langou [7] and others have have designed block Gram-Schmidt algorithms to be of comparable stability to modified Gram-Schmidt, which gives a well-conditioned but not orthogonal basis for an ill-conditioned set of vectors such as those obtained by a Lanczos method. Using block Gram-Schmidt in combination with block Lanczos methods, such as that proposed by Golub, Lusk, and Overton [9] would be another possible means of obtaining a stable BLAS-3 algorithm. More usually, Lanczos bidiagonalization with reorthogonalization is used in SVDPACK [2] and PROPACK [17], with theoretical development in Larsen s thesis [18] and work by Simon and Zha [24]. Each of these methods is appropriate for finding some singular values. Sparse MATLAB instead uses ARPACK 4

5 GFlop rates for tall or wide BLAS. Peak speed with 16 cores is tall*small small*wide wide*tall Smallest Dimension vs. Gflop Rate, Skinny BLAS 3 with 16 cores Gflop Rate GFlops Smallest Dimension Figure 2: For the above plot, the long matrix dimension is fixed at The x-axis is the smallest dimension. If the OpenMP loop using during initialization imitates the loop used during computation, parallelization is relatively good, presumably because of data locality. Best performance is about 62% of peak. For matrices of practical width of 8, performance is only about 10% of peak. [19], an Arnoldi method based on BLAS-2 non-blocked Householder transformations. Table 2 compares some sparse decompositions in terms of required storage and level of BLAS. 3 Sparse Matrix Dense Matrix (SMDM) and Wide or Tall BLAS-3 Sparse matrix algorithms often use multiplications of the sparse matrix times a dense vector and BLAS-1 or BLAS-2 operations. On cache based computer architectures these execute orders of magnitude more slower than the peak machine speed, slowed by repeated fetches of of the sparse matrix from RAM. Block algorithms replace sparse matrix dense vector multiplications by sparse matrix dense matrix multiplications, and replace BLAS-1 inner products and daxpys by Wide or Tall BLAS-3 operations. Compared to sparse matrix dense vector and BLAS-1 operations, SMDM multiplications and Wide or 5

6 Basis Lanczos PROPACK ARPACK UB 2 V UB 2 V GMRES UHU T UB k+1 V Vecs O(1) 2N 2N 4N Loss of Uses Keeps Keeps Rank Re-orthog Orthog Orthog BLAS BLAS-1 BLAS-1 BLAS-2 BLAS-3 flops 4Nn z 4Nn z 4Nn z 4Nn z +O(N) +4N 2 n + 4N 2 n + 4(n + m)n 2 Table 2: Summary Chart Comparing Sparse Decompositions. Tall BLAS-3 allow more floating point operations for each fetch of a floating point number from RAM. 3.1 Shared Memory Parallelization for Wide or Tall BLAS-3 When dense matrices A and B are two large to fit in fast memory and have smallest dimension smaller than about 100, our tests on multi-core processors indicate that the matrix multiplication AB has computational rate roughly proportion to the smallest dimension. For a few shared memory (or multi-core processors), we get good parallel speedups with a few OpenMP calls or merely by using a multi-threaded BLAS library. For more than about four cores, a more careful parallelization is needed for the wide or tall BLAS-3 operations which are the predominant calculation in UB k+1 V decomposition. Three special cases were parallelized using the OpenMP library. These were Wide, Tall, and WideTall, which respectively parallelize the cases of small wide, tall small, and wide tall. The 4 socket, 16 core architecture is NUMA (Non Uniform Memory Access). Each socket has faster access for its own RAM than for the RAM associated with the other sockets. The computational rates illustrated in Figure 2 were obtained by using the same OpenMP loops for matrix initialization as for computation, thereby improving data locality. This numeric experiment was on a four motherboard Opteron running in Linux. The same code also produces good data locality and performance for Intel chips. 2 2 Using the same OpenMP loops for matrix intialization and computation may fail to produce data locatility on other architectures and operating systems. Lack of explicit control over data locality may limit the portability of OpenMP NUMA parallelism. 6

7 3.2 Sparse Matrix Dense Matrix Products Many classic iterative schemes for solving systems of sparse linear equations rely on multiplications Ax, y T A, and BLAS-1 (vector vector) operations. Accessing A to perform multiplications AX gives significantly better performance. Figure 3 indicates the relative effects of block size vs. matrix storage in speeding sparse matrix multiplications. Column blocking is effective in (Sparse A)*(Dense X) 20X Seedup Blocking Speeds Sparse Matrix Dense Matrix Multiplies Speed in Megaflops per Second Vectors in X, A*X 16 Vectors in X tr(a)*x 1 Vector in x, A*x 1 Vector in x, tr(a)*x Number of Column Blocks Figure 3: Speeding multiplication of a randomly generated sparse matrix. Blocking the matrix and multiplying by multiple vectors reduce cache misses. The matrix here is 100K by 100K with 500 randomly distributed nonzeros entries per row. The computation was with a 64 bit 2.4 GHz two Pentium processor with 512 MByte L2 cache compiled with an Intel Fortran compiler. Parallelization is provided with one OpenMP parallel loop for AX, Ax and one also for X T A,x T A. improving performance when nonzero entries are uniformly distributed. For other sparse matrices, different matrix storages can improve performance of the AX kernel. For example, Toledo [27], Angeli, et. al [1] and Im s Ph.D. dissertation [12] offer some guidance in arranging storage of A to speed the computation Ax. The OSKI package (Vuduc, Demmel, and Yelick [29]) automates the process of choosing storage of A. Nishtala, Vuduc, Demmel, and Yelick [22] offer some guidance as to when OSKI is likely to be effective. For the UB k+1 V decomposition, access to A is only for the multiplications 7

8 AX, Y T A and extractions of blocks of A. Almost all other operations are BLAS-3 with minimal dimension k. 4 Some Theory Using block Householder transformations, we expect the overall condition number of transformations to be very near one, and expect that if we compute B k+1 to satisfy A = UB k+1 V, then the singular values of B k+1 and A will be closely matched. For a practical sparse algorithm only a partial decomposition is made, i.e., for A of m rows and n columns we compute only the first N rows and columns B k+1, N < n. The following subsections discuss interlacing of singular values and approximation of of A by U N B k+1 N V N in the Frobenius norm. 4.1 Interlacing of Singular Values In the dense case, singular values are typically found by reducing an m n A matrix to a condensed m n matrix B (upper triangular, or with banded structure such as bidiagonal) with the same singular values, then finding the singular values of the condensed form by some iterative procedure. For reduction to small band (or upper triangular) form to the sparse case, obtaining an m n reduced matrix is impractical if the transformations are stored, unstable if they are not stored. Transformations must be stored to maintain orthogonality and linear independence. It s natural to try to use the singular values of an N N reduced matrix B k+1 N obtained after eliminating N columns as approximate singular values of the original matrix A. The singular values of B k+1 N are sometimes called Ritz values of A. Cauchy s interlacing property relates the Ritz values to the singular values of A. Cauchy s Interlace Theorem [20]: Let C be a Hermitian matrix partitioned as where C has eigenvalues: and H has eigenvalues Then for j = 1,...,N, [ H B C = B U ], α 1 α 2...α n θ 1 θ 2...θ N C is n n H is N N α j θ j α j+n N (4.1) 8

9 and for l = 1, 2,...,n, θ l n+n α l θ. (4.2) Supppose A has been transformed to A N [ RN T A N = N 0 B k+1 N ] Then The Hermitian matrix A T N A N [ R A T T NA N = N R N RN T T N TN TR N TN TT N + (B k+1 N )T B k+1 N ] has the same eigenvalues as A T A. For the symmetric matrix A T N A N, Cauchy s interlacing theorem implies that the eigenvalues of the symmetric matrix RN T R N interlace with those of A T A. Since the singular values σ i of A have the same ordering in size as the eigenvalues λ i = σi 2 of A T A, Cauchy s interlacing value theorem interlaces the singular values of A and R N. As an example of interlacing consider singular values of of the 4 by 4 upper triangular matrix T = Let T 1,T 2,T 3 be the upper left 1 1, 2 2, 3 3 matrices respectively. The singular values of T 1,T 2,T 3,T are respectively Actually, we can do a bit more. When reducing a banded upper triangular form, we get A N of the form R N L N 0 A N = 0 B C (4.3) 0 D E and we naturally wonder whether singular values of ˆR = [ R N L N 0 ] are related to those of A. The following result of Kahan from P. 196 [20] is applicable. 9

10 The Residual Interlace Theorem. Let F be a Hermitian matrix of the form H C 0 F = C V Z 0 Z W where H is N N, V is j j, F is n n. Define [ ] H C M(X) = C X (4.4) where V X is assumed to be invertible. Denote the eigenvalues of M(X) as µ 1 µ 2... µ j+n Then each interval [µ i,µ i+n ],i = 1,...,j contains a different eigenvalue α I of F. Also, outside each open interval (µ l,µ l+j ),l = 1,...,N, there is a different eigenvalue α N of F. The residual interlace theorm applies to A T N A N as it is of the form (suppressing the N subscripts) A T NA N = R T R R T L 0 L T R L T L + B T B + D T D B T C + D T E 0 C T B + E T D C T C + E T E Taking X = L T L gives X V = B T B + D T D. The theorem will apply if X V is nonsingular, which will be the case when either the columns of B or the columns of D are linearly independent. We conclude that the j + N singular values α i of ˆR taken as the square roots the eigenvalues of M(L T L) are lower bounds for the top j + N singular values of A. In particular if α i, i N is the ith largest singular value of ˆR, then α i < σ i, where σ i is the ith largest singular value of A. Applying the interlacing theorem to R and ˆRi, the ith of N largest singular values of ˆR is larger than the ith singular value η i of R. Since η i α i σ i, the singular values α i of ˆR are better estimates of singular values of A than are the singular values of η i of R. For example, take A = Let R 1,R 2,R 3 be the upper left 2 2, 2 3, 2 4 matrices respectively. The first two singular values of R 1,R 2,R 3,A are respectively 10.

11 Approximation of A by J kl = U kl ˆBk+1 V T kl Suppose that A kl is related to A by the orthogonal transformations U kl and V kl as A kl = U T klav kl. Due to the orthogonality of U kl and V kl, we have A kl F = A F. For the algorithm described in the next section A kl has the form [ ] Bk+1 C k 0 A kl = 0 Â kl (4.5) where B k+1 is kl kl and C k is kl k. In our instance, C k has nonzero entries only in its lower triangular k k block. We re interested in the case that Âkl is not computed as it would be dense and large and likely to overflow the RAM. Since we have A 2 F = A kl 2 F = B k+1 2 F + C k 2 F + Âkl 2 F, Âkl 2 F = A 2 F B k+1 2 F C k 2 F. (4.6) Take ˆB k+1 = [B k C k ] as a kl k(l + 1) matrix and J kl = U k ˆBkl Vk T as a rank kl approximation to A. The approximation is good if Âkl F is small, with the quantities on the right hand side of (4.6) easily computable during a computation. 5 The lazy UB k+1 V partial decomposition We adapt the BLAS-3 algorithm for reduction to bandwidth k+1 using Householder reductions of block size k. Dense implementations were by Grösser and Lang, [10], [16]. Using deferred updates to convert a dense to a sparse algorithm for k = 1 (the bidiagonal case) is discussed in Howell, Demmel, Fulton, and Marmol [11]. 11

12 5.1 Notes on Lazy 2-Sided Block Householder Reduction The pseudo-code below 3 should allow the reader to verify the following points. Entries of A old are not changed. The only accesses to A old are for SMDM multiplications and extractions of matrix sub-blocks. If l pairs of row and column blocks of size k are eliminated, then A old is accessed for 2l SMDM operations consisting of multiplication of A old by blocks of k vectors. Block Householder transformations can be used for the reduction. The implementations of qlt and qlr used but not specifically detailed use the algorithms due to Schreiber and Van Loan [23]. Alternately, the method proposed by Joffrain, Low, Quintana-Orti, Van de Geijn, and Van Zee[14] or Puglisi[21] could be used. Householder transformations are reliably orthogonal. Blocking the transformations enables use of BLAS-3. Operations updating column and row blocks and in forming the update matrices are are BLAS-3. BLAS-2 operations are only in initializations and copies, and in the qlt and qlr formation of block Householder transformations. If the qlt and qlr operations are BLAS-2, the total number of BLAS-2 flops is O(mnk) for elimination of all columns, O(m + n)k 2 l for elimination of l k-sized blocks. In comparison, (see (5.14)), there are 6(lk) 2 (m + n) 8(lk) 3 BLAS-3 flops for eliminating l blocks of k rows and columns. As presented, the block Householder reduction runs to completion, useful in that the returned matrix B k can be observed to have very nearly the same singular values as the input matrix, enabling a test for correct implementation. More usually, for a large sparse matrix, the returned matrix B k has dimension kl kl, kl << n, where l blocks have been eliminated. l is chosen (either a priori as an input, or from a convergence criterion and the algorithm is ended at the! End of loop on blocks. The returned matrix B k+1 is then has nonzero entries confined to the diagonal and k superdiagonals, B k+1 satisfying 3 As presented, and given appropriate qlr and qlt functions, the algorithm closely follows an octave script implementation 12

13 B k C k 0 0 A updated = [ l ] [ l ] (I U i )L i (I Ui T ) (A old + E) (I Vi T )T i (I V i ) i=1 i=1 (5.1) In 5.1, C k is a lower triangular k k matrix. Denote ǫ as the largest number satisfying 1 = fl(1+ǫ). Due to the use of block Householder transformations, E satisfies E / A = O(ǫ). (5.2) i.e., the UB k+1 V decomposition is backward stable. When the algorithm is not run to completion so the not actually computed A updated in (5.1) has size (m kl) (n kl), then we typically assume the the E term in (5.1) to be negligible compared to A updated. As indicated by (4.6), runtime estimates of A updated F enable estimates of E F. 5.2 Pseudo-code for lazy UB k+1 V As given here, the code runs to completion, returning an upper triangular banded matrix, with bandwidth k + 1 ( see below (5.5)). In exact arithmetic, the returned matrix has the same singular values as the original matrix and is related to the original matrix by or by A return = l l 1 (I U i L i Ui T )A orig (I Vi T T i V i ) (5.3) i=1 i=1 A return = (I U l )L l (I U T l )(A orig UZ WV) (5.4) where (I U i L i U T i ) and (I V T i T i V i ) are block Householder transformations. On return the blocks U i are stored in the ith block of k columns in U (lower triangular) and the blocks V i are stored in the ith block of k rows in V (upper triangular). Similarly, W is lower triangular and Z upper triangular. 4 The algorithm proceeds by alternately eliminating blocks of k columns and k rows. When no more blocks of size k can be made, then the rest of the 4 The block letters U, V, W, Z refer to matrices used in the algorithm. U, V refer to generic orthogonal matrices. In exact arithmetic, A - UZ - WV is an orthogonal transformation of A, but none of the matrices U, Z, W, V are orthogonal. 13

14 columns are eliminated as one block. Hence the last block of L can be up to twice as large as the others. For A = UB 4 V,m = 10,n = 8,k = 3, the returned B 4 has the following form B 4 = x x x x x x x x x x x x x x x x x x x x x x x x x x x (5.5) Capital letters are used below to indicate that variables are matrices (as opposed to vectors). Accesses to the original sparse matrix are commented as either extractions of blocks or as SMDM operations. Pseudo-Code for UB k+1 V Function [B, U, W, V, Z, L, L temp, T] band(m, n, k, A old )! Assume m > n! Input Variables! m number of rows! n number of columns! k number of superdiagonals in returned matrix! (also the block size for multiplications by A old )! A old input matrix in sparse storage! Output Variables! B is an m by n matrix with upper bandwidth k+1! U (m x n), W (m x n ), V (n x n), Z (n x n )! L (m x k), T (k x n ), L temp (2k x 2k)! where (compare to 5.4,5.3), the extra! term here is from eliminating all remaining columns! as a final block; for a large sparse problem, this final! block will not be eliminated).! B = (I U last L temp U T last ) i (I U il i U T i )A old! = (I U last L temp U T last )(A old UZ WV)! where W = [W 1 W 2... W l ], U = [U 1 U 2... U l ],! V T = [V T 1 VT 2... VT l ], ZT = [Z T 1 ZT 2... ZT l ],! where each of the blocks U i, W i has k columns,! and where each block V i, Z i has k rows i (I VT i T iv i ) 14

15 ! and U last may have up to 2k columns.! Initializations B 0 m,n ; W 0 m,n ; U O m,n ; V 0 n,n ; Z 0 n,n ; L 0 m,k ; T 0 n,k ; b lks floor((n k)/k) ; for i =1:b lks, i l (i) (i 1)k + 1 ; i h (i) i k ; end if ( kb lks!= n ) i l (b lks +1) k b lks + 1 ; i h (b lks +1) n ; end m now m ; n now k ; A temp A old ( :, 1:k) ;! Extract first column block of A old i low 1 ; i hi k ; i hp1 k+1 ; C A old ( :, 1:k) ;! Extract first row block of A old! qrl returns the QR factorization of the! the first column block of A old where! R = (I - U temp L temp U T temp) C [ U temp, R, L temp ] qrl (m, k, C) ; U(i low :m, i low :i hi ) U temp ; L(i low :i hi, : ) L temp ; B( i low :m, i low :i hi ) R ;! C will be the update of the first row block of A old L ua L temp (U T temp)a old (i low :m, i hp1 :n ) ;! SPMD multiplication with A old C A(i low :i hi, i hp1 :n ) ;! Extract a row block of A old C C - U(1:k,1:k )L ua ;! qlt performs the QL factorization of C so that! T temp = C ( I - V temp L r V T temp ) [ V temp, L r, T temp ] qlt (k, n-k, C) ; B(1:k, k+1:n ) L r ; T(1:k, : ) T temp ;! Get the first blocks for U, V, Z, W T emp T temp V temp ; 15

16 Z(1:k, k+1:n ) L ua (L ua V T temp) T emp ; T emp2 V T temp T temp ; W ( :,1:k ) A old ( :,k+1:n )T emp2 ; V(1:k, k+1:n ) V temp ;! Now loop through all but the end block! In the usual application to large sparse matrices! the number of loops b lks is constrained by available RAM! or by satisfaction of a convergence requirement. for i = 2 : b lks, i low i l (i) ; i hi i h (i) ; i hp1 i h (i) + 1 ;! To proceed with a reduction to banded form,! we need to multiply the updated A! A updated A old - U Z - W V! by a block of vectors X. Since A updated is presumed dense,! A updated X is accomplished as! A old X - U (Z X) - W (V X)! Update the current column block of A C A old (i low :m, i low :i hi ) ;! Extract a column block of A old C C - U( i low :m, 1:i low -1 )Z(1:i low -1, i low :i hi ) ; C C - W( i low :m,1:i low -1 )V(1:i low -1, i low :i hi ) ;! qrl performs the QR factorization of the current column block. [ U temp, R, L temp ] qrl (m-i low +1, k, C) ; U(i low :m,i low :i hi ) U temp ; L(i low :i hi, : ) L temp ;! Multiply (L i U T i ) A update B( i low :m,i low :i hi ) R ; L up L temp U T temp ; L ua L up A old (i low :m,i hp1 :n ) ;! SMDM with A old L ua L ua - (L up U(i low :m,1:i low -1 )) Z(1: i low -1,i hp1 :n ) ; L ua L ua - (L up W(i low :m,1:i low -1 )) V(1: i low -1, i hp1 :n ) ;! Update the current row block of B C A old (i low :i hi, i hp1 :n ) ;! Extract row block of A old C C - U( i low :i hi, 1:i low -1) Z(1:i low -1, i hp1 :n ) ; C C - W( i low :i hi, 1:i low -1 ) V(1:i low -1, i hp1 :n ) ; 16

17 ! The row block also needs the update from the current column block C C - U( i low :i hi, i low :i hi )L ua ;! Having updated the current row block, get its! QL factorization by calling qlt [ V temp, L r, T temp ] qlt(k, n-i hi, C) ; B( i low :i hi, i hp1 :n ) L r ; T(i low :i hi, : ) T temp ;! Get the next blocks for the U, V, Z, W matrices T emp T temp V temp ; Z(i low :i hi, i hp1 :n ) L ua - (L ua V T temp)t emp ; T emp2 V T temp T temp ; T emp3 A old (i low :m,i hp1 :n)t emp2 ;! SMDM with A old T emp3 T emp3 - U( i low :m,1:i low -1 )(Z(1:i low -1, i hp1 :n)t emp2 ) ; T emp3 T emp3 - W(i low :m,1:i low -1 )(V(1:i low -1, i hp1 :n )T emp2 ) ; W(i low :m,i low :i hi ) T emp3 ; V(i low :i hi, i hp1 :n ) V temp ; end! End of loop on blocks! We ve eliminated all the row blocks of width k.! Eliminate the rest of the columns as one block i low i h (b lks ) + 1 ; i hi n ;! Update the current column block from A old C A old (i low :m,i low :i hi ) ;! Extract block of A old C C - U( i low :m,1:i low -1 ) Z(1:i low -1, i low :i hi ) ; C C - W( i low :m,1:i low -1 )V(1:i low -1, i low :i hi ) ;! qrl for QR factorization of the last column block. [ U temp, R, L temp ] qrl(m-i low +1,n-i low +1, C) ; U(i low :m, i low :i hi ) U temp ; L 2 = L temp ; B( i low :m, i low :i hi ) R ; Endfunction 17

18 5.3 Comparison to Dense 2-Sided Block Householder Reduction This section makes explicit some differences between the sparse algorithm presented in the pseudo-code and the more usual dense algorithm. The dense algorithm proceeds by alternately eliminating column and row blocks. Consider a partitioning of the original matrix A = A 11 A 12 A 13 A 21 A 22 A 23 A 31 A 32 A 33. (5.6) For the sparse algorithm, elimination of a block of columns corresponding to A 21 and A 31 and an inital row corresponding to A 12 and A 13 has changed no entries of A. A 11 corresponds to an upper triangular matrix B 11. A 12,A 13, A 31, and the upper triangular part of A 12 would have been eliminated in the dense algorithm. The dense algorithm would update the trailing matrix ( ) A22 A 23. (5.7) A 32 A Multiplying ÂX In the case of a large sparse matrix, we can t actually form the updated Â = A + UZ + WV, as it would be dense and exhaust RAM. Instead compute ÂX = AX + U(ZX) + W(VX) (5.8) Where the dense algorithm would have an already updated block to eliminate, the sparse algorithm extracts the corresponding block of the original matrix and performs just in time update. For example for the block A 32 Perform the sequence of block Householder eliminations which had already been made to eliminate A 21,A 11,A 12. These are all BLAS-3. Â 32 A 32 U(3, :) Z(:, 2) W(3, :) V(:, 2) (5.9) Perform a QR factorization of Â32. New blocks of W and Z are formed by multiplying the currently produced blocks of dense vectors by the sparse A, e.g. W(:, 3) V(3, :) A 18

19 5.3.2 Storage and Flop Comparisons Reducing an m n,m n matrix to upper bandwidth k + 1 by Householder transformations requires 4mn 2 4/3n 3 flops. For the ith elimination of row and column blocks of size k, the dense algorithm requires 4(m ik)(n ik)k flops for updates and 4(m ik)(n ik)k flops for multiplications of A by blocks of k row and column multipliers where the ith block requires 8k(m ik)(n ik) (5.10) flops for elimination. The sparse algorithm as illustrated in the pseudo-code differs in that, instead of multiplications of the form AU i,vi T A, A dense we perform multiplications and A updated U i = A old U i + U(ik+1: m, 1: ik) Z(1: ik,ki+1: n) U i + W(ik+1: m, 1: ik) V(1: ik,ki+1: n) U i (5.11) V T i A updated = V T i A old U i + V T i U(ik+1: m, 1: ik) Z(1: ik,ki+1: n) + V T i W(ik+1: m, 1: ik) V(1: ik,ki+1: n) (5.12) Neglecting the sparse matrix dense matrix flops, the flop count for a completed reduction would be 6mn 2 2n 3 with the incremental number of flops for the ith pair of row column blocks being requiring a total of 12k(ik)[m + n 2ik]. (5.13) 6(lk) 2 (m + n) 8(lk) 3 (5.14) flops to eliminate l row and column blocks of size k. For the dense algorithm, required storage is independent of the number of row-column pairs eliminated. As seen in (5.3.2), inital row-column eliminations require more flops than later ones. Conversely, for the sparse algorithm, the number of flops for the next block eliminated is proportional to i (when ik << n + m), so that the flop count is proportional to the square of the total number l of eliminated blocks. For m = n, the incremental flop counts for sparse and dense algorithm are equal for ik = n/4, so that at n/4 the difference in required flops is maximal. For ik > n/4, the dense algorithm becomes more competitive in terms of required flops. In the dense serial algorithm, the size of matrix which can be reduced to small band form (on a single processor) depends on how large a dense matrix will fit in available RAM. 19

20 18 x 1012 Flops to Eliminate L Columns, Sparse vs. Dense Algorithms Flops Dense Algorithm Sparse Algorithm Row Column Pairs Eliminated Figure 4: Sparse and Dense Flops vs. Columns Eliminated for a Matrix for Which the Dense Algorithm Requires 2 GBytes Storage In the sparse serial algorithm, available storage limits the number of blocks that can be eliminated. Neglecting the storage of A, eliminating l rows and columns requires 2(m + n)l double precision numbers stored in W, U, V and Z. When l = kk = n/4, the total storage for U, V, W, and Z is nm/2 + n 2 /2 so comparable to the total storage required for the dense algorithm. For a double precision in-core serial dense computation, the largest matrix we can expect to reduce with 2 GBytes of RAM is at most 16K square 5 For the sparse matrix with a fixed quantity of RAM, the number l of eliminated rows and columns is inversely proportional to m+n. For example, with 2 GBytes allocated for storage of U, V, W, and Z, then with m + n = (one hundred thousand), at most 1250 row-column pairs can be eliminated in-core ; for m + n = (one million), at most K here means 2 10, 16K*16K * 8 = 2Gbytes, with 8 bytes per double precision number. This assumes that only U, V are stored, overwriting part of A. 6 For both the sparse and dense case, this discussion neglects the RAM needed for system requirements, which might typically reduce RAM available for storage to three quarters of installed RAM. When storage use exceeds available RAM, execution times may markedly increase as data must be written to and read from hard disk drives. 20

21 Columns that Can be Eliminated Vs. Matrix Size Number of Row Column Pairs That Can Be Eliminated Total Rows plus Columns x 10 5 Figure 5: Number of Rows and Columns that can be Eliminated with 2 GBytes of Array Storage. 5.4 Apply a preliminary random block Householder transformation As so far discussed, sparse block Householder reduction may perform wasted block eliminations. For example, if A is already of banded upper trianguar form, Householder elimination of l rows and columns merely extracts the upper left l l matrix, singular values of which may not be representative of the singular values of A. Taking a preliminary random Householder transformation ensures that the SPMD operations AX and A T Y are made with X and Y dense so that the products are impacted by all nonzero entries of A. Modification of the UB k+1 V algorithm is straightforward. The only access to the orginal sparse matrix A was for block extraction and mutliplications by blocks of dense vectors (SPMDs). For the SPMD operation, denote Then and Equation 5.8 becomes A 0 = A U 0 Z 0 W 0 V 0 X A 0 X = AX U 0 (Z 0 X) W 0 (V 0 X) ÂX = X + U(ZX) + W(VX) Each extraction of a row or columns block of A is replaced by an extraction and a preliminary update. So instead the of the column extraction C A(i low :m,i low :i hi ) 21

22 Converged Singular Values for Bandwidth 2 Bandwidth = 2, Number of Converged Singular Values Converged, 2 GBytes Storage Compared Converged, 1 GByte Storage 80 Number of converged singular values K 40K 100K 400K Matrix Size from 10K to 400K, 249 Matrices Figure 6: For each of 261 matrices, some singular values were determined. When 2 GBytes of storage could be used without relocation error or segmentation faults, at least 20 singular values were found. we have C A(i low :m,i low :i hi ) U 0 (i low :m,1:i low 1) Z 0 (1:i low 1,i low :i hi ) W 0 (i low :m, 1:i low 1) V 0 (1:i low 1,i low :i hi ) (5.15) The preliminary random Householder transformation was used in the numeric experiments discussed in the next section. 6 Numerical Experiments Numerical tests were with matrices from the Davis UF Sparse Collection [4]. As with Householder implementations of GMRES and ARPACK, UB k+1 V is a stable algorithm. Limiting the computation to use 2 GByte of dimensioned space, then for matrices to size around , singular values of B k+1 (all columns eliminated ) and of A can both be computed by a 32 bit version of the standard LAPACK program dgesvd. As expected, LAPACK dgesvd gives the same singular values for A and B k+1 to high accuracy 7. 7 Our implementation of lazy Householder computation reduces to upper banded form, and can be run to completion only for matrices with at least as many rows as columns 22

23 Converged Singular Values for Bandwidth 6 Bandwidth = 6, Number of Converged Singular Values Converged, 2 GBytes Storage Compared Converged, 1 GByte Storage 80 Number of converged singular values K 40K 100K 400K Matrix Size from 10K to 400K, 307 Matrices Figure 7: For one case of 307, no singular values were determined. In several other cases, only a few singular values were determined. For matrices A large enough that 32 bit LAPACK cannot be used, the UB k+1 V algorithm can not be run to completion as the storage requirements would be too high. For these larger matices, singular values of B N k+1 were computed by two calls to LAPACK dgesvd. A first call to dgesvd was for the entire N N matrix. A second dgesvd call was for the upper left matrix square submatrix B 1 of dimension min(n k,n 6). The largest L = 2 N + 10 (lk = N 1/(m + n) ) singular values were compared to one another. Let σ 1 σ 2...σ L be the largest L singular values of B N k+1, and ˆσ 1 ˆσ 2... ˆσ L the largest singular values of B 1. σ i was said to be converged if σ j ˆσ j < 10 8, j, 1 j i σ j Converged singular values for different bandwidths agreed to high accuracy. Figures 6, 7, and 8 were tested on the June 2008 collection. These tests used only 2 GBytes of RAM, compiled with 32 bit integers. For bandwidths 2 (Figure 6), 6 (Figure 7), and 12 (Figure 8) we computed singular values for all the unsymmetric or rectangular matrices with 10 4 < (m + n)/2 < The results for bandwidth 6 and 12 include integer valued matrices. In each case, the number of steps lk = N is calculated so that the 2N(m + n) double precision numbers allocated for W,U,V,Z plus the n z elements of the sparse matrix A require less than 2 GBytes of storage (taking 8 bytes of storage per 23

24 Converged Singular Values for Bandwidth 12 Bandwidth = 12, Number of Converged Singular Values Converged, 2 GBytes Storage Compared Converged, 1 GByte Storage 80 Number of converged singular values K 40K 100K 400K Matrix Size from 10K to 400K, 315 Matrices Figure 8: Using block size 12 for matrices larger than 100 thousand was not effective with only 1 GByte of storage. For these matrices, the number of multiplications by A was relatively small double precision number). For each matrix the code was recompiled to reset parameters for matrix dimensioning. For some of the larger matrices, Fortran code compiled with the g77 compiler suffers relocation errors at compile time, or run time segmentation faults. These instances were recompiled to use 1 GByte for matrix storage, and rerun. For each bandwidth k = 2, 6, 12 around 300 matrices successfully ran. In each instance, the Frobenius norm of Bk+1 N was less than or equal to the Frobenius norm of A, with near equality in some cases. The algorithm used a preliminary random block Householder transformation of block size k. Figures 6, 7 and 8 plot the number of singular values converged for matrices of sizes < (m+n)/2 < The legend. represents the number of singular values compared, o represents the number converged for 2 Gbytes of storage. x represents the number of converged singular values for 1 GBytes of storage. For the largest matrices represented only about 40 right and left basis vector could be computed. The maximal number of computed basis vectors was 1500 (representing the flat part of the plots). When the o encloses a. or an x, all compared singular values converged. The isolated o s and x s indicate instances for which fewer singular values converged than were compared. 24

25 Converged Singular Values for Bandwidth 8, 48 GByte basis Bandwidth = 8, Number of Converged Singular Values Converged, 64 GBytes Storage Compared 90 Number of converged singular values Matrix Size from 10 Thousand to 10 Million, 420 Matrices protect Figure 9: Singular values were determined for 420 Matrices, size 10 thousand to 10 million. 7 matrices of size less than 1 Million failed to return at least a dozen converged singular values. 64 GBytes of RAM were not always enough to find multiple singular values for matrices of size greater than a million. For a high proportion of the test matrices, all the L = 2 N + 10 singular values converged. For bandwidth 2, for 238 of 250 matrices, all the compared singular values converged. The minimal number of converged singular values was 10 of 30 compared. For bandwidth 6, for 278 of 308 matrices, all compared singular values converged. In one instance there were no converged singular values of 36 compared. Two other instances of poor convergence were 2 of 36 and 3 of 28. All the worst cases were when only 1 GByte of storage was used. Also these cases tended to be the matrices of higher dimension (for which the size N of B 7 was relatively small) The next worst was 9 of 53. For a bandwidth of 12, singular values were computed for 315 matrices. 32 of these had suffered relocation errors or runtime segmentation faults for 2 GBytes of storage, so were rerun allowing 1 GBytes for storage. For 259 of 315 matrices, all the compared singular values converged. There were many instances of no converged singular values, especially 25

26 Banded matrix norms as a fraction of the original matrix norm 1 Ratio of Frobenius Norm vs. Proportion of Converged Singular Values 0.9 Proportion of Compared Singular Values that Converged FrobeniusNorm(Abanded)/FrobeniusNorm(A) 462 Matrices Figure 10: If the Frobenius norm of the banded matrix is of the same order as the norm of the original matrix, compared singular values are likely to be converged. The two circles at the lower right are exceptional cases. for large matrices and in the case that only 1 GByte of storage could be used GBytes of RAM The preceeding runs used were tested against the Davis collection of June In November of 2009, additional large matrices, several dozen of size greater than a million, had entered the collection. At the same time, 16 core (4 quad core Opteron) blades had become available in the NC State blade center. 8 Using 48 GBytes of RAM for matrix storage, then for test matrices to size about a million, the UB k+1 V algorithm with k = 8 determined some singular values in all but a few cases. Figure 6 plots the number of converged singular values vs. the matrix size. For this plot, the number of compared singular values is taken as L = 2 N For matrices of size greater than a million, fewer rows and columns can be eliminated, and fewer singular values are determined. 8 These blades have 64 GByes of RAM. OpenMP BLAS performance on these machines was plotted in Figure 2. Using 8 byte integers and the ACML byte integer BLAS library with a PGI fortran compiler, segmentation faults did not occur. 26

27 1 Clustered Singular Values Slow Convergence Equal Singular Values Slow to Converge 441 Matrices 0.9 Fraction of Compared Singular Values that Converged (Smallest Converged Singular Value)/(2nd Largest Singular Value) Figure 11: If the 2nd largest singular value is nearly equal to the smallest converged singular value, few singular values may converge. A samll ratio σ min /σ 2 was a good predictor that most compared singular values would converg. There was one exceptional case with a small ratio for which only about 1/5 of the compared singular values converged. Table 3 shows some timing results. Times ranged from about 40 seconds for a matrix of size 50 thousand to about 500 seconds for a matrix of size 322 thousand. For these matrices and the matrix of size 160 thousand, 1250 rows and columns were eliminated so that singular values were determined from a triangular matrix of size 1250 with bandwidth 8. For the matrix of size 1.96 million, about 200 rows and columns were eliminated. Though the BLAS- 3 operations are reasonably fast, the Sparse Matrix Dense Matrix (SMDM) multliplications and other computations (largely skinny QR) are a significant proportion of the 16 core time. Getting good parallel performance for more than 16 processors will require parallelization of skinny QR (see Demmel, Grigori, Hoemmel, and Langou [5] for a successful approach) and more work on the SMDM operations. 6.2 Observations on Convergence It s natural to expect that the number of converged singular values tends to increase with the basis size and the number of multiplications by the sparse 27

28 GFlop rates with 16 cores Size BLAS3 BLAS3 SMDM SMDM Other in K Secs Gflop Secs Gflop Secs Table 3: Runs of Spar3Bnd Bandwidth 8. matrix. We would also expect the number of the converged singular values to increase with the fraction of the original matrix Frobenius norm captured in the Frobenius norm of the banded matrix. Conversely, if many large singular values are nearly equal in size, then convergence is likely to be slow, so that the number of converged singular values will tend to be less. These tendencies were evident in experiments with the Davis matrix collection. Figures 10 and 11 are from same test ( 48 GBytes of basis vectors and the Dec Davis collection) as Figure 9. The number of converged singular values increases with the size of the Householder basis. Since the number of basis vectors and rows and condition numbers is inversely proportional to matrix size, (see Figure 5.3.2) a decrease in computed singular values with increased matrix size is expected. See Figures 6, 7 and 8. For a fixed number of rows and columns eliminated (fixed usage of RAM), the number of multiplications AX and Y T A is inversely proportional to the bandwidth k. For a fixed allocation of storage, number of computed singular values decreases somewhat as k increases. Again, see Figures 6, 7 and 8). Conversely, increasing k increases the speed of the computaton (See Figure 2). When Frobenius norms of the reduced matrix B k+1 in Equation (5.1) are near those of the orginal matrix A,.i.e, when R = BN k+1 F A F 1 convergence of a significant fraction of singular values is likely. Figure 11 plots the proportion of compared to converged singular values vs. R. When the largest singular values are nearly equal, relatively few singular values may converge. Figure 11 plots the proportion of converged to 28

29 compared singular values vs. the ratio σ min /σ 2 where σ min is the smallest converged singular value and σ 2 is the next to largest converged singular value σ min...σ 2 σ 1 7 Conclusions and Acknowledgements We report good success in using the lazy UB k+1 V decomposition to compute a collection of largest singular values for sparse matrices. Ongoing work is in computing singular vectors and low rank approximations comparing performance to other methods of computing sparse matrix singular values simplifying and modernizing the code improving multi-core performance Some current work is in using a UB k+1 V decomposition for solving a sparse least squares problem. The author wishes to offer thanks for advice and encouragement from Gene Golub and Jim Demmel. He is grateful to Franc Brglez for aid in automating numerical experiments over a fairly large collection of matrices and to Noura Howell for help in editing the manuscript. References [1] J. Angeli, O. Basset, C. Fulton, G. Howell, R. Hsuand A. Sawetprawhickal, M. Schuster, D. Richardson, H. Thompson, and S. Wilberscheid. Some issues in efficient implementation of a vector based modeul for document retrieval, June [2] M. Berry, T. Do, G. O Brien, V. Krishna, and S. Varadhan. SVDPACKC: Version 1.0 user s guide. Technical Report Tech. Report CS , University of Tennessee, Knoxville, TN, October [3] J. Choi, J. Dongarra, and D. Walker. The design of a parallel dense linear algebra software library: Reduction to Hessenberg, tridiagonal, and bidiagonal form Cholesky factorization routines. Num. Alg., 10: , LAPACK Working Note # 92. [4] T. Davis. University of Florida sparse matrix collection,

30 [5] J. Demmel, L. Grigori, M. Hoemmem, and J. Langou. Communicationoptimal parallel and sequential QR and LU factorizations. Technical Report UCB/EECS ,lawn204, University of California, August [6] A. A. Dubrulle. On block Householder algorithms for the reduction of a matrix to Hessenberg form. Supercomputing 88. Vol.II: Science and Applications. Proceedings, IEEE Explore, 2: , Nov [7] L. Giraud and J. Langou. Robust selective Gram-Schmidt reorthogonalization. Technical Report TR/PA/02/52, CERFACS, Toulouse, FR, [8] G. Golub and W. Kahan. Calculating the singular values and psuedoinverse of a matrix. SIAM J. Num. Anal., 2: , [9] G. Golub, F. Lusk, and M. Overton. A block Lanczos method for comptuing the singular values and corresponding singular vectors of a matrix. ACM Trans. Math. Soft., 7: , [10] B. Grösser and B. Lang. Efficient parallel reduction to bidiagonal form, Preprint BUGHW-SC 98/2 (Available from [11] G. Howell, J. Demmel, C. Fulton, S. Hammarling, and K. Marmol. BLAS 2.5 Householder bidiagonalization. ACM Transactions on Mathematical Software, 34(3):13 46, May [12] E. Im. Optimizing the Performance of Sparse Matrix-Vector Multiplication. PhD thesis, University of California, Berkeley, [13] W. Jalby and B. Philippe. Stability analysis and improvement of the block Gram-Schmidt algorithm. SIAM J. Sci. Stat. Comput., 12(5): , [14] T. Joffrain, T. M. Low, E. S. Quintana-Orti, R. Van de Geijn, and F. G. Van Zee. Accumulating Householder transformations, revisited. ACM Trans. on Math. Software, 32(2): , [15] L. Kaufman. Application of dense Householder transformation to a sparse matrix. ACM Trans. on Math. Software, 5(4): , [16] B. Lang. Parallel reduction of banded matrices to bidiagonal form. Parallel Comput., 22:1 18, [17] R. Larsen. PROPACK, software package for sparse SVD. Available from rmunk/propack/. 30

Sparse BLAS-3 Reduction

Sparse BLAS-3 Reduction to Banded Upper Triangular (Spar3Bnd) Gary Howell, HPC/OIT NC State University gary howell@ncsu.edu Sparse BLAS-3 Reduction p.1/27 Acknowledgements James Demmel, Gene Golub, Franc