Lazy Householder Decomposition of Sparse Matrices

Size: px
Start display at page:

Download "Lazy Householder Decomposition of Sparse Matrices"

Transcription

1 Lazy Householder Decomposition of Sparse Matrices G.W. Howell North Carolina State University Raleigh, North Carolina August 26, 2010 Abstract This paper describes Householder reduction of a rectangular sparse matrix to small band upper triangular form B k+1. B k+1 is upper triangular with nonzero entries only on the diagonal and on the nearest k superdiagonals. The algorithm is similar to the Householder reduction used as part of the standard dense SVD computation. For the sparse lazy algorithm, matrix updates are deferred until a row or column block is eliminated. The original sparse matrix is accessed only for sparse matrix dense matrix (SMDM) multiplications and to extract row and column blocks. For a triangular bandwidth of k + 1, the SMDM operations are of the sparse matrix by dense matrices consisting of the k rows or columns of a block Householder transformation. Block Householder transformations are reliably orthogonal, computationally efficient, and have good potential for parallelization. Numeric results presented here indicate that using an initial random block Householder transformation allows computation of a collection of largest singular values. Some potential applications are in finding low rank matrix approximations and in solving least squares problems. 1 Introduction In 1965, Golub and Kahan proposed Householder bidiagonalization A = UB 2 V as a first step in determining the singular values of dense matrices. For sparse Supported by NIH Molecular Libraries Roadmap for Medical Research, Grant 1 P20 Hg

2 matrices they proposed Lanczos bidiagonalization as a means of determining a few singular values [8]. Rearranging the order of computation to avoid filling a sparse matrix allows a natural extension of the use of Householder decomposition to the sparse case. Householder transformations are scalably stable and if blocked, the reduction algorithm is almost entirely BLAS-3, efficient on a variety of computer architectures. Given a sparse matrix A, applications include solving Ax = b and finding x to minimize Ax b 2. In this paper, we apply the algorithm to finding singular values of A. Key points are Algorithm stability is desirable for reliable solution of large problems. Householder and block Householder transformations are very nearly orthogonal when implemented with rounding arithmetic, enabling simple run-time convergence tests (see Section 4). Householder reductions can be applied to sparse matrices by deferring updates of blocks of the original matrix. Updates are not performed until the step on which a column or row block is to be eliminated. Multiplications are accomplished by expressing the updated matrix as a sum of the original sparse matrix and a low rank update 1. The work here is essentially an extension of the dense Grösser and Lang algorithm [10], [16] to apply in the sparse case. Block Householder transformations are BLAS-3. On current computer architectures, whether cache-based multicore, GPU or other hardware accelerators connected to general processors, or distributed parallel computing, dense BLAS-3 matrix matrix multiplies are significantly faster than BLAS-2 matrix vector multiplications. In all these cases, BLAS-3 Section 3 discusses NUMA (shared memory Non-Uniform Memory Access) performance for wide or tall BLAS-3 (See Section 3). Similarly, multiplying a sparse matrix by a dense matrix (SMDM) is faster than multiplying a sparse matrix by a dense vector. If the reduction is to an upper triangular matrix B k+1 with nonzero entries on the diagonal and nearest k superdiagonals, then the SMDM operations AX and A T Y entail dense matrices X and Y with k columns. 1 For reduction to Hessenberg form or upper triangular form, the idea of deferring updates has been repeatedly used. Kaufman implemented deferral of updates in sparse Householder QR factorization [15]. For other applications of this idea, see for example ARPACK [19], Sosonkina, Allison, and Watson [25], and Dubrulle [6]. For Householder reduction to bidiagonal form, the idea of deferring updates is implicit in the LAPACK reduction GEBRD (for the dense case) [3], with the extension to the sparse case explicitly outlined in Howell, Demmel, Fulton, Hammarling, and Marmol [11]. The lazy functional language Haskell defers updates. 2

3 With 64 GBytes of RAM with 48 GBytes allocated to basis storage, then for matrices of size up to about a million square (as illustrated by testing against matrices from Tim Davis s UF collection of matrices [4]), the UB k+1 V algorithm usually determined many singular values (See Section 6). For comparision, if one dense 20K by 20K matrix can be stored on a given processor (3.2 GBytes of storage), then 2500 processors would be needed for a dense SCALAPACK computation of singular values for a one million square matrix. Section 2 compares basis orthogonality and storage requirements for several methods of finding singular values of sparse matrices. Section 3 gives some numeric results justifying the assertion that AX, A T Y (Sparse matrix dense matrix or SPMD) operations are likely to be faster that Ax, A T y and discusses shared memory parallelization of wide or tall BLAS-3 operations. Section 4 summarizes some theory, justifying some run-time error estimates. Section 5 is an explicit presentation of the algorithm, provides comparisons of the sparse and dense algorithms, and shows how implicit fill can be useful. Section 6 describes numeric experiments with the Davis UF collection [4] of sparse matrices. 2 Comparison to Other Sparse Methods The sparse and dense UB k+1 V Householder based decompositions are BLAS-3 algorithms with U and V scalably orthogonal. For comparison, consider Lanczos bidiagonalization for finding a few singular values of a sparse matrix, proposed by Golub and Kahan as a sparse alternative to Householder bidiagonalization [8]. Lanczos bidiagonalization can proceed without storing multipliers, relying on a three term recursion, so that only the last few left and right multpliers are needed. Storage requirements are minimal. In exact arithmetic the Lanczos bases would be orthogonal. In rounding arithmetic, there is a rapid loss of orthogonality, and even of linear independence, as illustrated in Figure 1. A matrix of size a few hundred was randomly generated in the program octave, the left and right multipliers were saved, and the numeric rank of the right multiplier basis was calculated as its number of nonzero singular values. As memory per available node grows, using the memory to get better stability becomes feasible. In order to preserve orthogonality, multiplier vectors are frequently stored, and in a Lanczos algorithm, reorthogonalized. Table 1 compares L 2 condition numbers of various methods for constructing bases for the columns of Hilbert matrices, computed as the ratio of largest to smallest singular values of the basis. 3

4 12 Loss of Linear Independence of "Orthogonal" Lanczos Basis 10 8 Rank Deficiency Number of Columns Figure 1: Lanczos bases suffer loss of numeric rank. Sizes Householder QR L in LU MGS QR Table 1: L 2 Condition Numbers of Bases. Here we factored Hilbert matrices of various sizes and compared condition numbers of Q for QR factorization or L for LU decomposition. If modified Gram-Schmidt as opposed to Householder orthogonalization is used, the number of flops can be halved. Since modified Gram-Schmidt is primarily a BLAS-1 algorithm, use of BLAS-3 block Householder transformations is typically faster, as well as more nearly orthogonal. Alternatively, Jalby and Philippe [13], Vanderstraten [28], Stewart [26], Giraud and Langou [7] and others have have designed block Gram-Schmidt algorithms to be of comparable stability to modified Gram-Schmidt, which gives a well-conditioned but not orthogonal basis for an ill-conditioned set of vectors such as those obtained by a Lanczos method. Using block Gram-Schmidt in combination with block Lanczos methods, such as that proposed by Golub, Lusk, and Overton [9] would be another possible means of obtaining a stable BLAS-3 algorithm. More usually, Lanczos bidiagonalization with reorthogonalization is used in SVDPACK [2] and PROPACK [17], with theoretical development in Larsen s thesis [18] and work by Simon and Zha [24]. Each of these methods is appropriate for finding some singular values. Sparse MATLAB instead uses ARPACK 4

5 GFlop rates for tall or wide BLAS. Peak speed with 16 cores is tall*small small*wide wide*tall Smallest Dimension vs. Gflop Rate, Skinny BLAS 3 with 16 cores Gflop Rate GFlops Smallest Dimension Figure 2: For the above plot, the long matrix dimension is fixed at The x-axis is the smallest dimension. If the OpenMP loop using during initialization imitates the loop used during computation, parallelization is relatively good, presumably because of data locality. Best performance is about 62% of peak. For matrices of practical width of 8, performance is only about 10% of peak. [19], an Arnoldi method based on BLAS-2 non-blocked Householder transformations. Table 2 compares some sparse decompositions in terms of required storage and level of BLAS. 3 Sparse Matrix Dense Matrix (SMDM) and Wide or Tall BLAS-3 Sparse matrix algorithms often use multiplications of the sparse matrix times a dense vector and BLAS-1 or BLAS-2 operations. On cache based computer architectures these execute orders of magnitude more slower than the peak machine speed, slowed by repeated fetches of of the sparse matrix from RAM. Block algorithms replace sparse matrix dense vector multiplications by sparse matrix dense matrix multiplications, and replace BLAS-1 inner products and daxpys by Wide or Tall BLAS-3 operations. Compared to sparse matrix dense vector and BLAS-1 operations, SMDM multiplications and Wide or 5

6 Basis Lanczos PROPACK ARPACK UB 2 V UB 2 V GMRES UHU T UB k+1 V Vecs O(1) 2N 2N 4N Loss of Uses Keeps Keeps Rank Re-orthog Orthog Orthog BLAS BLAS-1 BLAS-1 BLAS-2 BLAS-3 flops 4Nn z 4Nn z 4Nn z 4Nn z +O(N) +4N 2 n + 4N 2 n + 4(n + m)n 2 Table 2: Summary Chart Comparing Sparse Decompositions. Tall BLAS-3 allow more floating point operations for each fetch of a floating point number from RAM. 3.1 Shared Memory Parallelization for Wide or Tall BLAS-3 When dense matrices A and B are two large to fit in fast memory and have smallest dimension smaller than about 100, our tests on multi-core processors indicate that the matrix multiplication AB has computational rate roughly proportion to the smallest dimension. For a few shared memory (or multi-core processors), we get good parallel speedups with a few OpenMP calls or merely by using a multi-threaded BLAS library. For more than about four cores, a more careful parallelization is needed for the wide or tall BLAS-3 operations which are the predominant calculation in UB k+1 V decomposition. Three special cases were parallelized using the OpenMP library. These were Wide, Tall, and WideTall, which respectively parallelize the cases of small wide, tall small, and wide tall. The 4 socket, 16 core architecture is NUMA (Non Uniform Memory Access). Each socket has faster access for its own RAM than for the RAM associated with the other sockets. The computational rates illustrated in Figure 2 were obtained by using the same OpenMP loops for matrix initialization as for computation, thereby improving data locality. This numeric experiment was on a four motherboard Opteron running in Linux. The same code also produces good data locality and performance for Intel chips. 2 2 Using the same OpenMP loops for matrix intialization and computation may fail to produce data locatility on other architectures and operating systems. Lack of explicit control over data locality may limit the portability of OpenMP NUMA parallelism. 6

7 3.2 Sparse Matrix Dense Matrix Products Many classic iterative schemes for solving systems of sparse linear equations rely on multiplications Ax, y T A, and BLAS-1 (vector vector) operations. Accessing A to perform multiplications AX gives significantly better performance. Figure 3 indicates the relative effects of block size vs. matrix storage in speeding sparse matrix multiplications. Column blocking is effective in (Sparse A)*(Dense X) 20X Seedup Blocking Speeds Sparse Matrix Dense Matrix Multiplies Speed in Megaflops per Second Vectors in X, A*X 16 Vectors in X tr(a)*x 1 Vector in x, A*x 1 Vector in x, tr(a)*x Number of Column Blocks Figure 3: Speeding multiplication of a randomly generated sparse matrix. Blocking the matrix and multiplying by multiple vectors reduce cache misses. The matrix here is 100K by 100K with 500 randomly distributed nonzeros entries per row. The computation was with a 64 bit 2.4 GHz two Pentium processor with 512 MByte L2 cache compiled with an Intel Fortran compiler. Parallelization is provided with one OpenMP parallel loop for AX, Ax and one also for X T A,x T A. improving performance when nonzero entries are uniformly distributed. For other sparse matrices, different matrix storages can improve performance of the AX kernel. For example, Toledo [27], Angeli, et. al [1] and Im s Ph.D. dissertation [12] offer some guidance in arranging storage of A to speed the computation Ax. The OSKI package (Vuduc, Demmel, and Yelick [29]) automates the process of choosing storage of A. Nishtala, Vuduc, Demmel, and Yelick [22] offer some guidance as to when OSKI is likely to be effective. For the UB k+1 V decomposition, access to A is only for the multiplications 7

8 AX, Y T A and extractions of blocks of A. Almost all other operations are BLAS-3 with minimal dimension k. 4 Some Theory Using block Householder transformations, we expect the overall condition number of transformations to be very near one, and expect that if we compute B k+1 to satisfy A = UB k+1 V, then the singular values of B k+1 and A will be closely matched. For a practical sparse algorithm only a partial decomposition is made, i.e., for A of m rows and n columns we compute only the first N rows and columns B k+1, N < n. The following subsections discuss interlacing of singular values and approximation of of A by U N B k+1 N V N in the Frobenius norm. 4.1 Interlacing of Singular Values In the dense case, singular values are typically found by reducing an m n A matrix to a condensed m n matrix B (upper triangular, or with banded structure such as bidiagonal) with the same singular values, then finding the singular values of the condensed form by some iterative procedure. For reduction to small band (or upper triangular) form to the sparse case, obtaining an m n reduced matrix is impractical if the transformations are stored, unstable if they are not stored. Transformations must be stored to maintain orthogonality and linear independence. It s natural to try to use the singular values of an N N reduced matrix B k+1 N obtained after eliminating N columns as approximate singular values of the original matrix A. The singular values of B k+1 N are sometimes called Ritz values of A. Cauchy s interlacing property relates the Ritz values to the singular values of A. Cauchy s Interlace Theorem [20]: Let C be a Hermitian matrix partitioned as where C has eigenvalues: and H has eigenvalues Then for j = 1,...,N, [ H B C = B U ], α 1 α 2...α n θ 1 θ 2...θ N C is n n H is N N α j θ j α j+n N (4.1) 8

9 and for l = 1, 2,...,n, θ l n+n α l θ. (4.2) Supppose A has been transformed to A N [ RN T A N = N 0 B k+1 N ] Then The Hermitian matrix A T N A N [ R A T T NA N = N R N RN T T N TN TR N TN TT N + (B k+1 N )T B k+1 N ] has the same eigenvalues as A T A. For the symmetric matrix A T N A N, Cauchy s interlacing theorem implies that the eigenvalues of the symmetric matrix RN T R N interlace with those of A T A. Since the singular values σ i of A have the same ordering in size as the eigenvalues λ i = σi 2 of A T A, Cauchy s interlacing value theorem interlaces the singular values of A and R N. As an example of interlacing consider singular values of of the 4 by 4 upper triangular matrix T = Let T 1,T 2,T 3 be the upper left 1 1, 2 2, 3 3 matrices respectively. The singular values of T 1,T 2,T 3,T are respectively Actually, we can do a bit more. When reducing a banded upper triangular form, we get A N of the form R N L N 0 A N = 0 B C (4.3) 0 D E and we naturally wonder whether singular values of ˆR = [ R N L N 0 ] are related to those of A. The following result of Kahan from P. 196 [20] is applicable. 9

10 The Residual Interlace Theorem. Let F be a Hermitian matrix of the form H C 0 F = C V Z 0 Z W where H is N N, V is j j, F is n n. Define [ ] H C M(X) = C X (4.4) where V X is assumed to be invertible. Denote the eigenvalues of M(X) as µ 1 µ 2... µ j+n Then each interval [µ i,µ i+n ],i = 1,...,j contains a different eigenvalue α I of F. Also, outside each open interval (µ l,µ l+j ),l = 1,...,N, there is a different eigenvalue α N of F. The residual interlace theorm applies to A T N A N as it is of the form (suppressing the N subscripts) A T NA N = R T R R T L 0 L T R L T L + B T B + D T D B T C + D T E 0 C T B + E T D C T C + E T E Taking X = L T L gives X V = B T B + D T D. The theorem will apply if X V is nonsingular, which will be the case when either the columns of B or the columns of D are linearly independent. We conclude that the j + N singular values α i of ˆR taken as the square roots the eigenvalues of M(L T L) are lower bounds for the top j + N singular values of A. In particular if α i, i N is the ith largest singular value of ˆR, then α i < σ i, where σ i is the ith largest singular value of A. Applying the interlacing theorem to R and ˆRi, the ith of N largest singular values of ˆR is larger than the ith singular value η i of R. Since η i α i σ i, the singular values α i of ˆR are better estimates of singular values of A than are the singular values of η i of R. For example, take A = Let R 1,R 2,R 3 be the upper left 2 2, 2 3, 2 4 matrices respectively. The first two singular values of R 1,R 2,R 3,A are respectively 10.

11 Approximation of A by J kl = U kl ˆBk+1 V T kl Suppose that A kl is related to A by the orthogonal transformations U kl and V kl as A kl = U T klav kl. Due to the orthogonality of U kl and V kl, we have A kl F = A F. For the algorithm described in the next section A kl has the form [ ] Bk+1 C k 0 A kl = 0 Â kl (4.5) where B k+1 is kl kl and C k is kl k. In our instance, C k has nonzero entries only in its lower triangular k k block. We re interested in the case that Âkl is not computed as it would be dense and large and likely to overflow the RAM. Since we have A 2 F = A kl 2 F = B k+1 2 F + C k 2 F + Âkl 2 F, Âkl 2 F = A 2 F B k+1 2 F C k 2 F. (4.6) Take ˆB k+1 = [B k C k ] as a kl k(l + 1) matrix and J kl = U k ˆBkl Vk T as a rank kl approximation to A. The approximation is good if Âkl F is small, with the quantities on the right hand side of (4.6) easily computable during a computation. 5 The lazy UB k+1 V partial decomposition We adapt the BLAS-3 algorithm for reduction to bandwidth k+1 using Householder reductions of block size k. Dense implementations were by Grösser and Lang, [10], [16]. Using deferred updates to convert a dense to a sparse algorithm for k = 1 (the bidiagonal case) is discussed in Howell, Demmel, Fulton, and Marmol [11]. 11

12 5.1 Notes on Lazy 2-Sided Block Householder Reduction The pseudo-code below 3 should allow the reader to verify the following points. Entries of A old are not changed. The only accesses to A old are for SMDM multiplications and extractions of matrix sub-blocks. If l pairs of row and column blocks of size k are eliminated, then A old is accessed for 2l SMDM operations consisting of multiplication of A old by blocks of k vectors. Block Householder transformations can be used for the reduction. The implementations of qlt and qlr used but not specifically detailed use the algorithms due to Schreiber and Van Loan [23]. Alternately, the method proposed by Joffrain, Low, Quintana-Orti, Van de Geijn, and Van Zee[14] or Puglisi[21] could be used. Householder transformations are reliably orthogonal. Blocking the transformations enables use of BLAS-3. Operations updating column and row blocks and in forming the update matrices are are BLAS-3. BLAS-2 operations are only in initializations and copies, and in the qlt and qlr formation of block Householder transformations. If the qlt and qlr operations are BLAS-2, the total number of BLAS-2 flops is O(mnk) for elimination of all columns, O(m + n)k 2 l for elimination of l k-sized blocks. In comparison, (see (5.14)), there are 6(lk) 2 (m + n) 8(lk) 3 BLAS-3 flops for eliminating l blocks of k rows and columns. As presented, the block Householder reduction runs to completion, useful in that the returned matrix B k can be observed to have very nearly the same singular values as the input matrix, enabling a test for correct implementation. More usually, for a large sparse matrix, the returned matrix B k has dimension kl kl, kl << n, where l blocks have been eliminated. l is chosen (either a priori as an input, or from a convergence criterion and the algorithm is ended at the! End of loop on blocks. The returned matrix B k+1 is then has nonzero entries confined to the diagonal and k superdiagonals, B k+1 satisfying 3 As presented, and given appropriate qlr and qlt functions, the algorithm closely follows an octave script implementation 12

13 B k C k 0 0 A updated = [ l ] [ l ] (I U i )L i (I Ui T ) (A old + E) (I Vi T )T i (I V i ) i=1 i=1 (5.1) In 5.1, C k is a lower triangular k k matrix. Denote ǫ as the largest number satisfying 1 = fl(1+ǫ). Due to the use of block Householder transformations, E satisfies E / A = O(ǫ). (5.2) i.e., the UB k+1 V decomposition is backward stable. When the algorithm is not run to completion so the not actually computed A updated in (5.1) has size (m kl) (n kl), then we typically assume the the E term in (5.1) to be negligible compared to A updated. As indicated by (4.6), runtime estimates of A updated F enable estimates of E F. 5.2 Pseudo-code for lazy UB k+1 V As given here, the code runs to completion, returning an upper triangular banded matrix, with bandwidth k + 1 ( see below (5.5)). In exact arithmetic, the returned matrix has the same singular values as the original matrix and is related to the original matrix by or by A return = l l 1 (I U i L i Ui T )A orig (I Vi T T i V i ) (5.3) i=1 i=1 A return = (I U l )L l (I U T l )(A orig UZ WV) (5.4) where (I U i L i U T i ) and (I V T i T i V i ) are block Householder transformations. On return the blocks U i are stored in the ith block of k columns in U (lower triangular) and the blocks V i are stored in the ith block of k rows in V (upper triangular). Similarly, W is lower triangular and Z upper triangular. 4 The algorithm proceeds by alternately eliminating blocks of k columns and k rows. When no more blocks of size k can be made, then the rest of the 4 The block letters U, V, W, Z refer to matrices used in the algorithm. U, V refer to generic orthogonal matrices. In exact arithmetic, A - UZ - WV is an orthogonal transformation of A, but none of the matrices U, Z, W, V are orthogonal. 13

14 columns are eliminated as one block. Hence the last block of L can be up to twice as large as the others. For A = UB 4 V,m = 10,n = 8,k = 3, the returned B 4 has the following form B 4 = x x x x x x x x x x x x x x x x x x x x x x x x x x x (5.5) Capital letters are used below to indicate that variables are matrices (as opposed to vectors). Accesses to the original sparse matrix are commented as either extractions of blocks or as SMDM operations. Pseudo-Code for UB k+1 V Function [B, U, W, V, Z, L, L temp, T] band(m, n, k, A old )! Assume m > n! Input Variables! m number of rows! n number of columns! k number of superdiagonals in returned matrix! (also the block size for multiplications by A old )! A old input matrix in sparse storage! Output Variables! B is an m by n matrix with upper bandwidth k+1! U (m x n), W (m x n ), V (n x n), Z (n x n )! L (m x k), T (k x n ), L temp (2k x 2k)! where (compare to 5.4,5.3), the extra! term here is from eliminating all remaining columns! as a final block; for a large sparse problem, this final! block will not be eliminated).! B = (I U last L temp U T last ) i (I U il i U T i )A old! = (I U last L temp U T last )(A old UZ WV)! where W = [W 1 W 2... W l ], U = [U 1 U 2... U l ],! V T = [V T 1 VT 2... VT l ], ZT = [Z T 1 ZT 2... ZT l ],! where each of the blocks U i, W i has k columns,! and where each block V i, Z i has k rows i (I VT i T iv i ) 14

15 ! and U last may have up to 2k columns.! Initializations B 0 m,n ; W 0 m,n ; U O m,n ; V 0 n,n ; Z 0 n,n ; L 0 m,k ; T 0 n,k ; b lks floor((n k)/k) ; for i =1:b lks, i l (i) (i 1)k + 1 ; i h (i) i k ; end if ( kb lks!= n ) i l (b lks +1) k b lks + 1 ; i h (b lks +1) n ; end m now m ; n now k ; A temp A old ( :, 1:k) ;! Extract first column block of A old i low 1 ; i hi k ; i hp1 k+1 ; C A old ( :, 1:k) ;! Extract first row block of A old! qrl returns the QR factorization of the! the first column block of A old where! R = (I - U temp L temp U T temp) C [ U temp, R, L temp ] qrl (m, k, C) ; U(i low :m, i low :i hi ) U temp ; L(i low :i hi, : ) L temp ; B( i low :m, i low :i hi ) R ;! C will be the update of the first row block of A old L ua L temp (U T temp)a old (i low :m, i hp1 :n ) ;! SPMD multiplication with A old C A(i low :i hi, i hp1 :n ) ;! Extract a row block of A old C C - U(1:k,1:k )L ua ;! qlt performs the QL factorization of C so that! T temp = C ( I - V temp L r V T temp ) [ V temp, L r, T temp ] qlt (k, n-k, C) ; B(1:k, k+1:n ) L r ; T(1:k, : ) T temp ;! Get the first blocks for U, V, Z, W T emp T temp V temp ; 15

16 Z(1:k, k+1:n ) L ua (L ua V T temp) T emp ; T emp2 V T temp T temp ; W ( :,1:k ) A old ( :,k+1:n )T emp2 ; V(1:k, k+1:n ) V temp ;! Now loop through all but the end block! In the usual application to large sparse matrices! the number of loops b lks is constrained by available RAM! or by satisfaction of a convergence requirement. for i = 2 : b lks, i low i l (i) ; i hi i h (i) ; i hp1 i h (i) + 1 ;! To proceed with a reduction to banded form,! we need to multiply the updated A! A updated A old - U Z - W V! by a block of vectors X. Since A updated is presumed dense,! A updated X is accomplished as! A old X - U (Z X) - W (V X)! Update the current column block of A C A old (i low :m, i low :i hi ) ;! Extract a column block of A old C C - U( i low :m, 1:i low -1 )Z(1:i low -1, i low :i hi ) ; C C - W( i low :m,1:i low -1 )V(1:i low -1, i low :i hi ) ;! qrl performs the QR factorization of the current column block. [ U temp, R, L temp ] qrl (m-i low +1, k, C) ; U(i low :m,i low :i hi ) U temp ; L(i low :i hi, : ) L temp ;! Multiply (L i U T i ) A update B( i low :m,i low :i hi ) R ; L up L temp U T temp ; L ua L up A old (i low :m,i hp1 :n ) ;! SMDM with A old L ua L ua - (L up U(i low :m,1:i low -1 )) Z(1: i low -1,i hp1 :n ) ; L ua L ua - (L up W(i low :m,1:i low -1 )) V(1: i low -1, i hp1 :n ) ;! Update the current row block of B C A old (i low :i hi, i hp1 :n ) ;! Extract row block of A old C C - U( i low :i hi, 1:i low -1) Z(1:i low -1, i hp1 :n ) ; C C - W( i low :i hi, 1:i low -1 ) V(1:i low -1, i hp1 :n ) ; 16

17 ! The row block also needs the update from the current column block C C - U( i low :i hi, i low :i hi )L ua ;! Having updated the current row block, get its! QL factorization by calling qlt [ V temp, L r, T temp ] qlt(k, n-i hi, C) ; B( i low :i hi, i hp1 :n ) L r ; T(i low :i hi, : ) T temp ;! Get the next blocks for the U, V, Z, W matrices T emp T temp V temp ; Z(i low :i hi, i hp1 :n ) L ua - (L ua V T temp)t emp ; T emp2 V T temp T temp ; T emp3 A old (i low :m,i hp1 :n)t emp2 ;! SMDM with A old T emp3 T emp3 - U( i low :m,1:i low -1 )(Z(1:i low -1, i hp1 :n)t emp2 ) ; T emp3 T emp3 - W(i low :m,1:i low -1 )(V(1:i low -1, i hp1 :n )T emp2 ) ; W(i low :m,i low :i hi ) T emp3 ; V(i low :i hi, i hp1 :n ) V temp ; end! End of loop on blocks! We ve eliminated all the row blocks of width k.! Eliminate the rest of the columns as one block i low i h (b lks ) + 1 ; i hi n ;! Update the current column block from A old C A old (i low :m,i low :i hi ) ;! Extract block of A old C C - U( i low :m,1:i low -1 ) Z(1:i low -1, i low :i hi ) ; C C - W( i low :m,1:i low -1 )V(1:i low -1, i low :i hi ) ;! qrl for QR factorization of the last column block. [ U temp, R, L temp ] qrl(m-i low +1,n-i low +1, C) ; U(i low :m, i low :i hi ) U temp ; L 2 = L temp ; B( i low :m, i low :i hi ) R ; Endfunction 17

18 5.3 Comparison to Dense 2-Sided Block Householder Reduction This section makes explicit some differences between the sparse algorithm presented in the pseudo-code and the more usual dense algorithm. The dense algorithm proceeds by alternately eliminating column and row blocks. Consider a partitioning of the original matrix A = A 11 A 12 A 13 A 21 A 22 A 23 A 31 A 32 A 33. (5.6) For the sparse algorithm, elimination of a block of columns corresponding to A 21 and A 31 and an inital row corresponding to A 12 and A 13 has changed no entries of A. A 11 corresponds to an upper triangular matrix B 11. A 12,A 13, A 31, and the upper triangular part of A 12 would have been eliminated in the dense algorithm. The dense algorithm would update the trailing matrix ( ) A22 A 23. (5.7) A 32 A Multiplying ÂX In the case of a large sparse matrix, we can t actually form the updated  = A + UZ + WV, as it would be dense and exhaust RAM. Instead compute ÂX = AX + U(ZX) + W(VX) (5.8) Where the dense algorithm would have an already updated block to eliminate, the sparse algorithm extracts the corresponding block of the original matrix and performs just in time update. For example for the block A 32 Perform the sequence of block Householder eliminations which had already been made to eliminate A 21,A 11,A 12. These are all BLAS-3.  32 A 32 U(3, :) Z(:, 2) W(3, :) V(:, 2) (5.9) Perform a QR factorization of Â32. New blocks of W and Z are formed by multiplying the currently produced blocks of dense vectors by the sparse A, e.g. W(:, 3) V(3, :) A 18

19 5.3.2 Storage and Flop Comparisons Reducing an m n,m n matrix to upper bandwidth k + 1 by Householder transformations requires 4mn 2 4/3n 3 flops. For the ith elimination of row and column blocks of size k, the dense algorithm requires 4(m ik)(n ik)k flops for updates and 4(m ik)(n ik)k flops for multiplications of A by blocks of k row and column multipliers where the ith block requires 8k(m ik)(n ik) (5.10) flops for elimination. The sparse algorithm as illustrated in the pseudo-code differs in that, instead of multiplications of the form AU i,vi T A, A dense we perform multiplications and A updated U i = A old U i + U(ik+1: m, 1: ik) Z(1: ik,ki+1: n) U i + W(ik+1: m, 1: ik) V(1: ik,ki+1: n) U i (5.11) V T i A updated = V T i A old U i + V T i U(ik+1: m, 1: ik) Z(1: ik,ki+1: n) + V T i W(ik+1: m, 1: ik) V(1: ik,ki+1: n) (5.12) Neglecting the sparse matrix dense matrix flops, the flop count for a completed reduction would be 6mn 2 2n 3 with the incremental number of flops for the ith pair of row column blocks being requiring a total of 12k(ik)[m + n 2ik]. (5.13) 6(lk) 2 (m + n) 8(lk) 3 (5.14) flops to eliminate l row and column blocks of size k. For the dense algorithm, required storage is independent of the number of row-column pairs eliminated. As seen in (5.3.2), inital row-column eliminations require more flops than later ones. Conversely, for the sparse algorithm, the number of flops for the next block eliminated is proportional to i (when ik << n + m), so that the flop count is proportional to the square of the total number l of eliminated blocks. For m = n, the incremental flop counts for sparse and dense algorithm are equal for ik = n/4, so that at n/4 the difference in required flops is maximal. For ik > n/4, the dense algorithm becomes more competitive in terms of required flops. In the dense serial algorithm, the size of matrix which can be reduced to small band form (on a single processor) depends on how large a dense matrix will fit in available RAM. 19

20 18 x 1012 Flops to Eliminate L Columns, Sparse vs. Dense Algorithms Flops Dense Algorithm Sparse Algorithm Row Column Pairs Eliminated Figure 4: Sparse and Dense Flops vs. Columns Eliminated for a Matrix for Which the Dense Algorithm Requires 2 GBytes Storage In the sparse serial algorithm, available storage limits the number of blocks that can be eliminated. Neglecting the storage of A, eliminating l rows and columns requires 2(m + n)l double precision numbers stored in W, U, V and Z. When l = kk = n/4, the total storage for U, V, W, and Z is nm/2 + n 2 /2 so comparable to the total storage required for the dense algorithm. For a double precision in-core serial dense computation, the largest matrix we can expect to reduce with 2 GBytes of RAM is at most 16K square 5 For the sparse matrix with a fixed quantity of RAM, the number l of eliminated rows and columns is inversely proportional to m+n. For example, with 2 GBytes allocated for storage of U, V, W, and Z, then with m + n = (one hundred thousand), at most 1250 row-column pairs can be eliminated in-core ; for m + n = (one million), at most K here means 2 10, 16K*16K * 8 = 2Gbytes, with 8 bytes per double precision number. This assumes that only U, V are stored, overwriting part of A. 6 For both the sparse and dense case, this discussion neglects the RAM needed for system requirements, which might typically reduce RAM available for storage to three quarters of installed RAM. When storage use exceeds available RAM, execution times may markedly increase as data must be written to and read from hard disk drives. 20

21 Columns that Can be Eliminated Vs. Matrix Size Number of Row Column Pairs That Can Be Eliminated Total Rows plus Columns x 10 5 Figure 5: Number of Rows and Columns that can be Eliminated with 2 GBytes of Array Storage. 5.4 Apply a preliminary random block Householder transformation As so far discussed, sparse block Householder reduction may perform wasted block eliminations. For example, if A is already of banded upper trianguar form, Householder elimination of l rows and columns merely extracts the upper left l l matrix, singular values of which may not be representative of the singular values of A. Taking a preliminary random Householder transformation ensures that the SPMD operations AX and A T Y are made with X and Y dense so that the products are impacted by all nonzero entries of A. Modification of the UB k+1 V algorithm is straightforward. The only access to the orginal sparse matrix A was for block extraction and mutliplications by blocks of dense vectors (SPMDs). For the SPMD operation, denote Then and Equation 5.8 becomes A 0 = A U 0 Z 0 W 0 V 0 X A 0 X = AX U 0 (Z 0 X) W 0 (V 0 X) ÂX = X + U(ZX) + W(VX) Each extraction of a row or columns block of A is replaced by an extraction and a preliminary update. So instead the of the column extraction C A(i low :m,i low :i hi ) 21

22 Converged Singular Values for Bandwidth 2 Bandwidth = 2, Number of Converged Singular Values Converged, 2 GBytes Storage Compared Converged, 1 GByte Storage 80 Number of converged singular values K 40K 100K 400K Matrix Size from 10K to 400K, 249 Matrices Figure 6: For each of 261 matrices, some singular values were determined. When 2 GBytes of storage could be used without relocation error or segmentation faults, at least 20 singular values were found. we have C A(i low :m,i low :i hi ) U 0 (i low :m,1:i low 1) Z 0 (1:i low 1,i low :i hi ) W 0 (i low :m, 1:i low 1) V 0 (1:i low 1,i low :i hi ) (5.15) The preliminary random Householder transformation was used in the numeric experiments discussed in the next section. 6 Numerical Experiments Numerical tests were with matrices from the Davis UF Sparse Collection [4]. As with Householder implementations of GMRES and ARPACK, UB k+1 V is a stable algorithm. Limiting the computation to use 2 GByte of dimensioned space, then for matrices to size around , singular values of B k+1 (all columns eliminated ) and of A can both be computed by a 32 bit version of the standard LAPACK program dgesvd. As expected, LAPACK dgesvd gives the same singular values for A and B k+1 to high accuracy 7. 7 Our implementation of lazy Householder computation reduces to upper banded form, and can be run to completion only for matrices with at least as many rows as columns 22

23 Converged Singular Values for Bandwidth 6 Bandwidth = 6, Number of Converged Singular Values Converged, 2 GBytes Storage Compared Converged, 1 GByte Storage 80 Number of converged singular values K 40K 100K 400K Matrix Size from 10K to 400K, 307 Matrices Figure 7: For one case of 307, no singular values were determined. In several other cases, only a few singular values were determined. For matrices A large enough that 32 bit LAPACK cannot be used, the UB k+1 V algorithm can not be run to completion as the storage requirements would be too high. For these larger matices, singular values of B N k+1 were computed by two calls to LAPACK dgesvd. A first call to dgesvd was for the entire N N matrix. A second dgesvd call was for the upper left matrix square submatrix B 1 of dimension min(n k,n 6). The largest L = 2 N + 10 (lk = N 1/(m + n) ) singular values were compared to one another. Let σ 1 σ 2...σ L be the largest L singular values of B N k+1, and ˆσ 1 ˆσ 2... ˆσ L the largest singular values of B 1. σ i was said to be converged if σ j ˆσ j < 10 8, j, 1 j i σ j Converged singular values for different bandwidths agreed to high accuracy. Figures 6, 7, and 8 were tested on the June 2008 collection. These tests used only 2 GBytes of RAM, compiled with 32 bit integers. For bandwidths 2 (Figure 6), 6 (Figure 7), and 12 (Figure 8) we computed singular values for all the unsymmetric or rectangular matrices with 10 4 < (m + n)/2 < The results for bandwidth 6 and 12 include integer valued matrices. In each case, the number of steps lk = N is calculated so that the 2N(m + n) double precision numbers allocated for W,U,V,Z plus the n z elements of the sparse matrix A require less than 2 GBytes of storage (taking 8 bytes of storage per 23

24 Converged Singular Values for Bandwidth 12 Bandwidth = 12, Number of Converged Singular Values Converged, 2 GBytes Storage Compared Converged, 1 GByte Storage 80 Number of converged singular values K 40K 100K 400K Matrix Size from 10K to 400K, 315 Matrices Figure 8: Using block size 12 for matrices larger than 100 thousand was not effective with only 1 GByte of storage. For these matrices, the number of multiplications by A was relatively small double precision number). For each matrix the code was recompiled to reset parameters for matrix dimensioning. For some of the larger matrices, Fortran code compiled with the g77 compiler suffers relocation errors at compile time, or run time segmentation faults. These instances were recompiled to use 1 GByte for matrix storage, and rerun. For each bandwidth k = 2, 6, 12 around 300 matrices successfully ran. In each instance, the Frobenius norm of Bk+1 N was less than or equal to the Frobenius norm of A, with near equality in some cases. The algorithm used a preliminary random block Householder transformation of block size k. Figures 6, 7 and 8 plot the number of singular values converged for matrices of sizes < (m+n)/2 < The legend. represents the number of singular values compared, o represents the number converged for 2 Gbytes of storage. x represents the number of converged singular values for 1 GBytes of storage. For the largest matrices represented only about 40 right and left basis vector could be computed. The maximal number of computed basis vectors was 1500 (representing the flat part of the plots). When the o encloses a. or an x, all compared singular values converged. The isolated o s and x s indicate instances for which fewer singular values converged than were compared. 24

25 Converged Singular Values for Bandwidth 8, 48 GByte basis Bandwidth = 8, Number of Converged Singular Values Converged, 64 GBytes Storage Compared 90 Number of converged singular values Matrix Size from 10 Thousand to 10 Million, 420 Matrices protect Figure 9: Singular values were determined for 420 Matrices, size 10 thousand to 10 million. 7 matrices of size less than 1 Million failed to return at least a dozen converged singular values. 64 GBytes of RAM were not always enough to find multiple singular values for matrices of size greater than a million. For a high proportion of the test matrices, all the L = 2 N + 10 singular values converged. For bandwidth 2, for 238 of 250 matrices, all the compared singular values converged. The minimal number of converged singular values was 10 of 30 compared. For bandwidth 6, for 278 of 308 matrices, all compared singular values converged. In one instance there were no converged singular values of 36 compared. Two other instances of poor convergence were 2 of 36 and 3 of 28. All the worst cases were when only 1 GByte of storage was used. Also these cases tended to be the matrices of higher dimension (for which the size N of B 7 was relatively small) The next worst was 9 of 53. For a bandwidth of 12, singular values were computed for 315 matrices. 32 of these had suffered relocation errors or runtime segmentation faults for 2 GBytes of storage, so were rerun allowing 1 GBytes for storage. For 259 of 315 matrices, all the compared singular values converged. There were many instances of no converged singular values, especially 25

26 Banded matrix norms as a fraction of the original matrix norm 1 Ratio of Frobenius Norm vs. Proportion of Converged Singular Values 0.9 Proportion of Compared Singular Values that Converged FrobeniusNorm(Abanded)/FrobeniusNorm(A) 462 Matrices Figure 10: If the Frobenius norm of the banded matrix is of the same order as the norm of the original matrix, compared singular values are likely to be converged. The two circles at the lower right are exceptional cases. for large matrices and in the case that only 1 GByte of storage could be used GBytes of RAM The preceeding runs used were tested against the Davis collection of June In November of 2009, additional large matrices, several dozen of size greater than a million, had entered the collection. At the same time, 16 core (4 quad core Opteron) blades had become available in the NC State blade center. 8 Using 48 GBytes of RAM for matrix storage, then for test matrices to size about a million, the UB k+1 V algorithm with k = 8 determined some singular values in all but a few cases. Figure 6 plots the number of converged singular values vs. the matrix size. For this plot, the number of compared singular values is taken as L = 2 N For matrices of size greater than a million, fewer rows and columns can be eliminated, and fewer singular values are determined. 8 These blades have 64 GByes of RAM. OpenMP BLAS performance on these machines was plotted in Figure 2. Using 8 byte integers and the ACML byte integer BLAS library with a PGI fortran compiler, segmentation faults did not occur. 26

27 1 Clustered Singular Values Slow Convergence Equal Singular Values Slow to Converge 441 Matrices 0.9 Fraction of Compared Singular Values that Converged (Smallest Converged Singular Value)/(2nd Largest Singular Value) Figure 11: If the 2nd largest singular value is nearly equal to the smallest converged singular value, few singular values may converge. A samll ratio σ min /σ 2 was a good predictor that most compared singular values would converg. There was one exceptional case with a small ratio for which only about 1/5 of the compared singular values converged. Table 3 shows some timing results. Times ranged from about 40 seconds for a matrix of size 50 thousand to about 500 seconds for a matrix of size 322 thousand. For these matrices and the matrix of size 160 thousand, 1250 rows and columns were eliminated so that singular values were determined from a triangular matrix of size 1250 with bandwidth 8. For the matrix of size 1.96 million, about 200 rows and columns were eliminated. Though the BLAS- 3 operations are reasonably fast, the Sparse Matrix Dense Matrix (SMDM) multliplications and other computations (largely skinny QR) are a significant proportion of the 16 core time. Getting good parallel performance for more than 16 processors will require parallelization of skinny QR (see Demmel, Grigori, Hoemmel, and Langou [5] for a successful approach) and more work on the SMDM operations. 6.2 Observations on Convergence It s natural to expect that the number of converged singular values tends to increase with the basis size and the number of multiplications by the sparse 27

28 GFlop rates with 16 cores Size BLAS3 BLAS3 SMDM SMDM Other in K Secs Gflop Secs Gflop Secs Table 3: Runs of Spar3Bnd Bandwidth 8. matrix. We would also expect the number of the converged singular values to increase with the fraction of the original matrix Frobenius norm captured in the Frobenius norm of the banded matrix. Conversely, if many large singular values are nearly equal in size, then convergence is likely to be slow, so that the number of converged singular values will tend to be less. These tendencies were evident in experiments with the Davis matrix collection. Figures 10 and 11 are from same test ( 48 GBytes of basis vectors and the Dec Davis collection) as Figure 9. The number of converged singular values increases with the size of the Householder basis. Since the number of basis vectors and rows and condition numbers is inversely proportional to matrix size, (see Figure 5.3.2) a decrease in computed singular values with increased matrix size is expected. See Figures 6, 7 and 8. For a fixed number of rows and columns eliminated (fixed usage of RAM), the number of multiplications AX and Y T A is inversely proportional to the bandwidth k. For a fixed allocation of storage, number of computed singular values decreases somewhat as k increases. Again, see Figures 6, 7 and 8). Conversely, increasing k increases the speed of the computaton (See Figure 2). When Frobenius norms of the reduced matrix B k+1 in Equation (5.1) are near those of the orginal matrix A,.i.e, when R = BN k+1 F A F 1 convergence of a significant fraction of singular values is likely. Figure 11 plots the proportion of compared to converged singular values vs. R. When the largest singular values are nearly equal, relatively few singular values may converge. Figure 11 plots the proportion of converged to 28

29 compared singular values vs. the ratio σ min /σ 2 where σ min is the smallest converged singular value and σ 2 is the next to largest converged singular value σ min...σ 2 σ 1 7 Conclusions and Acknowledgements We report good success in using the lazy UB k+1 V decomposition to compute a collection of largest singular values for sparse matrices. Ongoing work is in computing singular vectors and low rank approximations comparing performance to other methods of computing sparse matrix singular values simplifying and modernizing the code improving multi-core performance Some current work is in using a UB k+1 V decomposition for solving a sparse least squares problem. The author wishes to offer thanks for advice and encouragement from Gene Golub and Jim Demmel. He is grateful to Franc Brglez for aid in automating numerical experiments over a fairly large collection of matrices and to Noura Howell for help in editing the manuscript. References [1] J. Angeli, O. Basset, C. Fulton, G. Howell, R. Hsuand A. Sawetprawhickal, M. Schuster, D. Richardson, H. Thompson, and S. Wilberscheid. Some issues in efficient implementation of a vector based modeul for document retrieval, June [2] M. Berry, T. Do, G. O Brien, V. Krishna, and S. Varadhan. SVDPACKC: Version 1.0 user s guide. Technical Report Tech. Report CS , University of Tennessee, Knoxville, TN, October [3] J. Choi, J. Dongarra, and D. Walker. The design of a parallel dense linear algebra software library: Reduction to Hessenberg, tridiagonal, and bidiagonal form Cholesky factorization routines. Num. Alg., 10: , LAPACK Working Note # 92. [4] T. Davis. University of Florida sparse matrix collection,

30 [5] J. Demmel, L. Grigori, M. Hoemmem, and J. Langou. Communicationoptimal parallel and sequential QR and LU factorizations. Technical Report UCB/EECS ,lawn204, University of California, August [6] A. A. Dubrulle. On block Householder algorithms for the reduction of a matrix to Hessenberg form. Supercomputing 88. Vol.II: Science and Applications. Proceedings, IEEE Explore, 2: , Nov [7] L. Giraud and J. Langou. Robust selective Gram-Schmidt reorthogonalization. Technical Report TR/PA/02/52, CERFACS, Toulouse, FR, [8] G. Golub and W. Kahan. Calculating the singular values and psuedoinverse of a matrix. SIAM J. Num. Anal., 2: , [9] G. Golub, F. Lusk, and M. Overton. A block Lanczos method for comptuing the singular values and corresponding singular vectors of a matrix. ACM Trans. Math. Soft., 7: , [10] B. Grösser and B. Lang. Efficient parallel reduction to bidiagonal form, Preprint BUGHW-SC 98/2 (Available from [11] G. Howell, J. Demmel, C. Fulton, S. Hammarling, and K. Marmol. BLAS 2.5 Householder bidiagonalization. ACM Transactions on Mathematical Software, 34(3):13 46, May [12] E. Im. Optimizing the Performance of Sparse Matrix-Vector Multiplication. PhD thesis, University of California, Berkeley, [13] W. Jalby and B. Philippe. Stability analysis and improvement of the block Gram-Schmidt algorithm. SIAM J. Sci. Stat. Comput., 12(5): , [14] T. Joffrain, T. M. Low, E. S. Quintana-Orti, R. Van de Geijn, and F. G. Van Zee. Accumulating Householder transformations, revisited. ACM Trans. on Math. Software, 32(2): , [15] L. Kaufman. Application of dense Householder transformation to a sparse matrix. ACM Trans. on Math. Software, 5(4): , [16] B. Lang. Parallel reduction of banded matrices to bidiagonal form. Parallel Comput., 22:1 18, [17] R. Larsen. PROPACK, software package for sparse SVD. Available from rmunk/propack/. 30

Sparse BLAS-3 Reduction

Sparse BLAS-3 Reduction Sparse BLAS-3 Reduction to Banded Upper Triangular (Spar3Bnd) Gary Howell, HPC/OIT NC State University gary howell@ncsu.edu Sparse BLAS-3 Reduction p.1/27 Acknowledgements James Demmel, Gene Golub, Franc

More information

Algorithm 853: an Efficient Algorithm for Solving Rank-Deficient Least Squares Problems

Algorithm 853: an Efficient Algorithm for Solving Rank-Deficient Least Squares Problems Algorithm 853: an Efficient Algorithm for Solving Rank-Deficient Least Squares Problems LESLIE FOSTER and RAJESH KOMMU San Jose State University Existing routines, such as xgelsy or xgelsd in LAPACK, for

More information

Computing least squares condition numbers on hybrid multicore/gpu systems

Computing least squares condition numbers on hybrid multicore/gpu systems Computing least squares condition numbers on hybrid multicore/gpu systems M. Baboulin and J. Dongarra and R. Lacroix Abstract This paper presents an efficient computation for least squares conditioning

More information

Performance Evaluation of Some Inverse Iteration Algorithms on PowerXCell T M 8i Processor

Performance Evaluation of Some Inverse Iteration Algorithms on PowerXCell T M 8i Processor Performance Evaluation of Some Inverse Iteration Algorithms on PowerXCell T M 8i Processor Masami Takata 1, Hiroyuki Ishigami 2, Kini Kimura 2, and Yoshimasa Nakamura 2 1 Academic Group of Information

More information

Numerical Methods in Matrix Computations

Numerical Methods in Matrix Computations Ake Bjorck Numerical Methods in Matrix Computations Springer Contents 1 Direct Methods for Linear Systems 1 1.1 Elements of Matrix Theory 1 1.1.1 Matrix Algebra 2 1.1.2 Vector Spaces 6 1.1.3 Submatrices

More information

Cache Efficient Bidiagonalization Using BLAS 2.5 Operators

Cache Efficient Bidiagonalization Using BLAS 2.5 Operators Cache Efficient Bidiagonalization Using BLAS 2.5 Operators G.W. Howell North Carolina State University Raleigh, North Carolina 27695 J. W. Demmel University of California, Berkeley Berkeley, California

More information

Block Lanczos Tridiagonalization of Complex Symmetric Matrices

Block Lanczos Tridiagonalization of Complex Symmetric Matrices Block Lanczos Tridiagonalization of Complex Symmetric Matrices Sanzheng Qiao, Guohong Liu, Wei Xu Department of Computing and Software, McMaster University, Hamilton, Ontario L8S 4L7 ABSTRACT The classic

More information

Minisymposia 9 and 34: Avoiding Communication in Linear Algebra. Jim Demmel UC Berkeley bebop.cs.berkeley.edu

Minisymposia 9 and 34: Avoiding Communication in Linear Algebra. Jim Demmel UC Berkeley bebop.cs.berkeley.edu Minisymposia 9 and 34: Avoiding Communication in Linear Algebra Jim Demmel UC Berkeley bebop.cs.berkeley.edu Motivation (1) Increasing parallelism to exploit From Top500 to multicores in your laptop Exponentially

More information

LAPACK-Style Codes for Pivoted Cholesky and QR Updating. Hammarling, Sven and Higham, Nicholas J. and Lucas, Craig. MIMS EPrint: 2006.

LAPACK-Style Codes for Pivoted Cholesky and QR Updating. Hammarling, Sven and Higham, Nicholas J. and Lucas, Craig. MIMS EPrint: 2006. LAPACK-Style Codes for Pivoted Cholesky and QR Updating Hammarling, Sven and Higham, Nicholas J. and Lucas, Craig 2007 MIMS EPrint: 2006.385 Manchester Institute for Mathematical Sciences School of Mathematics

More information

A communication-avoiding thick-restart Lanczos method on a distributed-memory system

A communication-avoiding thick-restart Lanczos method on a distributed-memory system A communication-avoiding thick-restart Lanczos method on a distributed-memory system Ichitaro Yamazaki and Kesheng Wu Lawrence Berkeley National Laboratory, Berkeley, CA, USA Abstract. The Thick-Restart

More information

On the loss of orthogonality in the Gram-Schmidt orthogonalization process

On the loss of orthogonality in the Gram-Schmidt orthogonalization process CERFACS Technical Report No. TR/PA/03/25 Luc Giraud Julien Langou Miroslav Rozložník On the loss of orthogonality in the Gram-Schmidt orthogonalization process Abstract. In this paper we study numerical

More information

Communication-avoiding parallel and sequential QR factorizations

Communication-avoiding parallel and sequential QR factorizations Communication-avoiding parallel and sequential QR factorizations James Demmel Laura Grigori Mark Frederick Hoemmen Julien Langou Electrical Engineering and Computer Sciences University of California at

More information

Communication-avoiding parallel and sequential QR factorizations

Communication-avoiding parallel and sequential QR factorizations Communication-avoiding parallel and sequential QR factorizations James Demmel, Laura Grigori, Mark Hoemmen, and Julien Langou May 30, 2008 Abstract We present parallel and sequential dense QR factorization

More information

Arnoldi Methods in SLEPc

Arnoldi Methods in SLEPc Scalable Library for Eigenvalue Problem Computations SLEPc Technical Report STR-4 Available at http://slepc.upv.es Arnoldi Methods in SLEPc V. Hernández J. E. Román A. Tomás V. Vidal Last update: October,

More information

LAPACK-Style Codes for Pivoted Cholesky and QR Updating

LAPACK-Style Codes for Pivoted Cholesky and QR Updating LAPACK-Style Codes for Pivoted Cholesky and QR Updating Sven Hammarling 1, Nicholas J. Higham 2, and Craig Lucas 3 1 NAG Ltd.,Wilkinson House, Jordan Hill Road, Oxford, OX2 8DR, England, sven@nag.co.uk,

More information

Block Bidiagonal Decomposition and Least Squares Problems

Block Bidiagonal Decomposition and Least Squares Problems Block Bidiagonal Decomposition and Least Squares Problems Åke Björck Department of Mathematics Linköping University Perspectives in Numerical Analysis, Helsinki, May 27 29, 2008 Outline Bidiagonal Decomposition

More information

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers UT College of Engineering Tutorial Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers Stan Tomov 1, George Bosilca 1, and Cédric

More information

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Accelerating linear algebra computations with hybrid GPU-multicore systems. Accelerating linear algebra computations with hybrid GPU-multicore systems. Marc Baboulin INRIA/Université Paris-Sud joint work with Jack Dongarra (University of Tennessee and Oak Ridge National Laboratory)

More information

MS&E 318 (CME 338) Large-Scale Numerical Optimization

MS&E 318 (CME 338) Large-Scale Numerical Optimization Stanford University, Management Science & Engineering (and ICME MS&E 38 (CME 338 Large-Scale Numerical Optimization Course description Instructor: Michael Saunders Spring 28 Notes : Review The course teaches

More information

Communication avoiding parallel algorithms for dense matrix factorizations

Communication avoiding parallel algorithms for dense matrix factorizations Communication avoiding parallel dense matrix factorizations 1/ 44 Communication avoiding parallel algorithms for dense matrix factorizations Edgar Solomonik Department of EECS, UC Berkeley October 2013

More information

Lecture 8: Fast Linear Solvers (Part 7)

Lecture 8: Fast Linear Solvers (Part 7) Lecture 8: Fast Linear Solvers (Part 7) 1 Modified Gram-Schmidt Process with Reorthogonalization Test Reorthogonalization If Av k 2 + δ v k+1 2 = Av k 2 to working precision. δ = 10 3 2 Householder Arnoldi

More information

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences)

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences) AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences) Lecture 19: Computing the SVD; Sparse Linear Systems Xiangmin Jiao Stony Brook University Xiangmin Jiao Numerical

More information

Matrix Assembly in FEA

Matrix Assembly in FEA Matrix Assembly in FEA 1 In Chapter 2, we spoke about how the global matrix equations are assembled in the finite element method. We now want to revisit that discussion and add some details. For example,

More information

Dense LU factorization and its error analysis

Dense LU factorization and its error analysis Dense LU factorization and its error analysis Laura Grigori INRIA and LJLL, UPMC February 2016 Plan Basis of floating point arithmetic and stability analysis Notation, results, proofs taken from [N.J.Higham,

More information

Numerical Linear Algebra

Numerical Linear Algebra Numerical Linear Algebra The two principal problems in linear algebra are: Linear system Given an n n matrix A and an n-vector b, determine x IR n such that A x = b Eigenvalue problem Given an n n matrix

More information

AMS526: Numerical Analysis I (Numerical Linear Algebra)

AMS526: Numerical Analysis I (Numerical Linear Algebra) AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 1: Course Overview & Matrix-Vector Multiplication Xiangmin Jiao SUNY Stony Brook Xiangmin Jiao Numerical Analysis I 1 / 20 Outline 1 Course

More information

4.8 Arnoldi Iteration, Krylov Subspaces and GMRES

4.8 Arnoldi Iteration, Krylov Subspaces and GMRES 48 Arnoldi Iteration, Krylov Subspaces and GMRES We start with the problem of using a similarity transformation to convert an n n matrix A to upper Hessenberg form H, ie, A = QHQ, (30) with an appropriate

More information

AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 23: GMRES and Other Krylov Subspace Methods; Preconditioning

AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 23: GMRES and Other Krylov Subspace Methods; Preconditioning AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 23: GMRES and Other Krylov Subspace Methods; Preconditioning Xiangmin Jiao SUNY Stony Brook Xiangmin Jiao Numerical Analysis I 1 / 18 Outline

More information

Theoretical Computer Science

Theoretical Computer Science Theoretical Computer Science 412 (2011) 1484 1491 Contents lists available at ScienceDirect Theoretical Computer Science journal homepage: wwwelseviercom/locate/tcs Parallel QR processing of Generalized

More information

Scientific Computing

Scientific Computing Scientific Computing Direct solution methods Martin van Gijzen Delft University of Technology October 3, 2018 1 Program October 3 Matrix norms LU decomposition Basic algorithm Cost Stability Pivoting Pivoting

More information

I-v k e k. (I-e k h kt ) = Stability of Gauss-Huard Elimination for Solving Linear Systems. 1 x 1 x x x x

I-v k e k. (I-e k h kt ) = Stability of Gauss-Huard Elimination for Solving Linear Systems. 1 x 1 x x x x Technical Report CS-93-08 Department of Computer Systems Faculty of Mathematics and Computer Science University of Amsterdam Stability of Gauss-Huard Elimination for Solving Linear Systems T. J. Dekker

More information

Index. Copyright (c)2007 The Society for Industrial and Applied Mathematics From: Matrix Methods in Data Mining and Pattern Recgonition By: Lars Elden

Index. Copyright (c)2007 The Society for Industrial and Applied Mathematics From: Matrix Methods in Data Mining and Pattern Recgonition By: Lars Elden Index 1-norm, 15 matrix, 17 vector, 15 2-norm, 15, 59 matrix, 17 vector, 15 3-mode array, 91 absolute error, 15 adjacency matrix, 158 Aitken extrapolation, 157 algebra, multi-linear, 91 all-orthogonality,

More information

Bindel, Fall 2016 Matrix Computations (CS 6210) Notes for

Bindel, Fall 2016 Matrix Computations (CS 6210) Notes for 1 Algorithms Notes for 2016-10-31 There are several flavors of symmetric eigenvalue solvers for which there is no equivalent (stable) nonsymmetric solver. We discuss four algorithmic ideas: the workhorse

More information

Avoiding Communication in Distributed-Memory Tridiagonalization

Avoiding Communication in Distributed-Memory Tridiagonalization Avoiding Communication in Distributed-Memory Tridiagonalization SIAM CSE 15 Nicholas Knight University of California, Berkeley March 14, 2015 Joint work with: Grey Ballard (SNL) James Demmel (UCB) Laura

More information

Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem

Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem Mark Gates 1, Azzam Haidar 1, and Jack Dongarra 1,2,3 1 University of Tennessee, Knoxville, TN, USA 2 Oak Ridge National

More information

Tile QR Factorization with Parallel Panel Processing for Multicore Architectures

Tile QR Factorization with Parallel Panel Processing for Multicore Architectures Tile QR Factorization with Parallel Panel Processing for Multicore Architectures Bilel Hadri, Hatem Ltaief, Emmanuel Agullo, Jack Dongarra Department of Electrical Engineering and Computer Science, University

More information

Enhancing Scalability of Sparse Direct Methods

Enhancing Scalability of Sparse Direct Methods Journal of Physics: Conference Series 78 (007) 0 doi:0.088/7-6596/78//0 Enhancing Scalability of Sparse Direct Methods X.S. Li, J. Demmel, L. Grigori, M. Gu, J. Xia 5, S. Jardin 6, C. Sovinec 7, L.-Q.

More information

Finite-choice algorithm optimization in Conjugate Gradients

Finite-choice algorithm optimization in Conjugate Gradients Finite-choice algorithm optimization in Conjugate Gradients Jack Dongarra and Victor Eijkhout January 2003 Abstract We present computational aspects of mathematically equivalent implementations of the

More information

Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem

Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem Mark Gates 1, Azzam Haidar 1, and Jack Dongarra 1,2,3 1 University of Tennessee, Knoxville, TN, USA 2 Oak Ridge National

More information

Tall and Skinny QR Matrix Factorization Using Tile Algorithms on Multicore Architectures LAPACK Working Note - 222

Tall and Skinny QR Matrix Factorization Using Tile Algorithms on Multicore Architectures LAPACK Working Note - 222 Tall and Skinny QR Matrix Factorization Using Tile Algorithms on Multicore Architectures LAPACK Working Note - 222 Bilel Hadri 1, Hatem Ltaief 1, Emmanuel Agullo 1, and Jack Dongarra 1,2,3 1 Department

More information

The Lanczos and conjugate gradient algorithms

The Lanczos and conjugate gradient algorithms The Lanczos and conjugate gradient algorithms Gérard MEURANT October, 2008 1 The Lanczos algorithm 2 The Lanczos algorithm in finite precision 3 The nonsymmetric Lanczos algorithm 4 The Golub Kahan bidiagonalization

More information

WHEN MODIFIED GRAM-SCHMIDT GENERATES A WELL-CONDITIONED SET OF VECTORS

WHEN MODIFIED GRAM-SCHMIDT GENERATES A WELL-CONDITIONED SET OF VECTORS IMA Journal of Numerical Analysis (2002) 22, 1-8 WHEN MODIFIED GRAM-SCHMIDT GENERATES A WELL-CONDITIONED SET OF VECTORS L. Giraud and J. Langou Cerfacs, 42 Avenue Gaspard Coriolis, 31057 Toulouse Cedex

More information

Analysis of Block LDL T Factorizations for Symmetric Indefinite Matrices

Analysis of Block LDL T Factorizations for Symmetric Indefinite Matrices Analysis of Block LDL T Factorizations for Symmetric Indefinite Matrices Haw-ren Fang August 24, 2007 Abstract We consider the block LDL T factorizations for symmetric indefinite matrices in the form LBL

More information

14.2 QR Factorization with Column Pivoting

14.2 QR Factorization with Column Pivoting page 531 Chapter 14 Special Topics Background Material Needed Vector and Matrix Norms (Section 25) Rounding Errors in Basic Floating Point Operations (Section 33 37) Forward Elimination and Back Substitution

More information

QR Decomposition in a Multicore Environment

QR Decomposition in a Multicore Environment QR Decomposition in a Multicore Environment Omar Ahsan University of Maryland-College Park Advised by Professor Howard Elman College Park, MD oha@cs.umd.edu ABSTRACT In this study we examine performance

More information

CS 598: Communication Cost Analysis of Algorithms Lecture 9: The Ideal Cache Model and the Discrete Fourier Transform

CS 598: Communication Cost Analysis of Algorithms Lecture 9: The Ideal Cache Model and the Discrete Fourier Transform CS 598: Communication Cost Analysis of Algorithms Lecture 9: The Ideal Cache Model and the Discrete Fourier Transform Edgar Solomonik University of Illinois at Urbana-Champaign September 21, 2016 Fast

More information

1 Multiply Eq. E i by λ 0: (λe i ) (E i ) 2 Multiply Eq. E j by λ and add to Eq. E i : (E i + λe j ) (E i )

1 Multiply Eq. E i by λ 0: (λe i ) (E i ) 2 Multiply Eq. E j by λ and add to Eq. E i : (E i + λe j ) (E i ) Direct Methods for Linear Systems Chapter Direct Methods for Solving Linear Systems Per-Olof Persson persson@berkeleyedu Department of Mathematics University of California, Berkeley Math 18A Numerical

More information

ANONSINGULAR tridiagonal linear system of the form

ANONSINGULAR tridiagonal linear system of the form Generalized Diagonal Pivoting Methods for Tridiagonal Systems without Interchanges Jennifer B. Erway, Roummel F. Marcia, and Joseph A. Tyson Abstract It has been shown that a nonsingular symmetric tridiagonal

More information

A Backward Stable Hyperbolic QR Factorization Method for Solving Indefinite Least Squares Problem

A Backward Stable Hyperbolic QR Factorization Method for Solving Indefinite Least Squares Problem A Backward Stable Hyperbolic QR Factorization Method for Solving Indefinite Least Suares Problem Hongguo Xu Dedicated to Professor Erxiong Jiang on the occasion of his 7th birthday. Abstract We present

More information

Computation of eigenvalues and singular values Recall that your solutions to these questions will not be collected or evaluated.

Computation of eigenvalues and singular values Recall that your solutions to these questions will not be collected or evaluated. Math 504, Homework 5 Computation of eigenvalues and singular values Recall that your solutions to these questions will not be collected or evaluated 1 Find the eigenvalues and the associated eigenspaces

More information

AMS526: Numerical Analysis I (Numerical Linear Algebra)

AMS526: Numerical Analysis I (Numerical Linear Algebra) AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 7: More on Householder Reflectors; Least Squares Problems Xiangmin Jiao SUNY Stony Brook Xiangmin Jiao Numerical Analysis I 1 / 15 Outline

More information

Math 411 Preliminaries

Math 411 Preliminaries Math 411 Preliminaries Provide a list of preliminary vocabulary and concepts Preliminary Basic Netwon s method, Taylor series expansion (for single and multiple variables), Eigenvalue, Eigenvector, Vector

More information

Applied Numerical Linear Algebra. Lecture 8

Applied Numerical Linear Algebra. Lecture 8 Applied Numerical Linear Algebra. Lecture 8 1/ 45 Perturbation Theory for the Least Squares Problem When A is not square, we define its condition number with respect to the 2-norm to be k 2 (A) σ max (A)/σ

More information

Matrices, Moments and Quadrature, cont d

Matrices, Moments and Quadrature, cont d Jim Lambers CME 335 Spring Quarter 2010-11 Lecture 4 Notes Matrices, Moments and Quadrature, cont d Estimation of the Regularization Parameter Consider the least squares problem of finding x such that

More information

ON ORTHOGONAL REDUCTION TO HESSENBERG FORM WITH SMALL BANDWIDTH

ON ORTHOGONAL REDUCTION TO HESSENBERG FORM WITH SMALL BANDWIDTH ON ORTHOGONAL REDUCTION TO HESSENBERG FORM WITH SMALL BANDWIDTH V. FABER, J. LIESEN, AND P. TICHÝ Abstract. Numerous algorithms in numerical linear algebra are based on the reduction of a given matrix

More information

This can be accomplished by left matrix multiplication as follows: I

This can be accomplished by left matrix multiplication as follows: I 1 Numerical Linear Algebra 11 The LU Factorization Recall from linear algebra that Gaussian elimination is a method for solving linear systems of the form Ax = b, where A R m n and bran(a) In this method

More information

Index. for generalized eigenvalue problem, butterfly form, 211

Index. for generalized eigenvalue problem, butterfly form, 211 Index ad hoc shifts, 165 aggressive early deflation, 205 207 algebraic multiplicity, 35 algebraic Riccati equation, 100 Arnoldi process, 372 block, 418 Hamiltonian skew symmetric, 420 implicitly restarted,

More information

A Review of Matrix Analysis

A Review of Matrix Analysis Matrix Notation Part Matrix Operations Matrices are simply rectangular arrays of quantities Each quantity in the array is called an element of the matrix and an element can be either a numerical value

More information

Preconditioned Parallel Block Jacobi SVD Algorithm

Preconditioned Parallel Block Jacobi SVD Algorithm Parallel Numerics 5, 15-24 M. Vajteršic, R. Trobec, P. Zinterhof, A. Uhl (Eds.) Chapter 2: Matrix Algebra ISBN 961-633-67-8 Preconditioned Parallel Block Jacobi SVD Algorithm Gabriel Okša 1, Marián Vajteršic

More information

AM 205: lecture 8. Last time: Cholesky factorization, QR factorization Today: how to compute the QR factorization, the Singular Value Decomposition

AM 205: lecture 8. Last time: Cholesky factorization, QR factorization Today: how to compute the QR factorization, the Singular Value Decomposition AM 205: lecture 8 Last time: Cholesky factorization, QR factorization Today: how to compute the QR factorization, the Singular Value Decomposition QR Factorization A matrix A R m n, m n, can be factorized

More information

Linear System of Equations

Linear System of Equations Linear System of Equations Linear systems are perhaps the most widely applied numerical procedures when real-world situation are to be simulated. Example: computing the forces in a TRUSS. F F 5. 77F F.

More information

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences)

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences) AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences) Lecture 1: Course Overview; Matrix Multiplication Xiangmin Jiao Stony Brook University Xiangmin Jiao Numerical

More information

Numerical Analysis Lecture Notes

Numerical Analysis Lecture Notes Numerical Analysis Lecture Notes Peter J Olver 8 Numerical Computation of Eigenvalues In this part, we discuss some practical methods for computing eigenvalues and eigenvectors of matrices Needless to

More information

Roundoff Error. Monday, August 29, 11

Roundoff Error. Monday, August 29, 11 Roundoff Error A round-off error (rounding error), is the difference between the calculated approximation of a number and its exact mathematical value. Numerical analysis specifically tries to estimate

More information

Generalized interval arithmetic on compact matrix Lie groups

Generalized interval arithmetic on compact matrix Lie groups myjournal manuscript No. (will be inserted by the editor) Generalized interval arithmetic on compact matrix Lie groups Hermann Schichl, Mihály Csaba Markót, Arnold Neumaier Faculty of Mathematics, University

More information

Using Godunov s Two-Sided Sturm Sequences to Accurately Compute Singular Vectors of Bidiagonal Matrices.

Using Godunov s Two-Sided Sturm Sequences to Accurately Compute Singular Vectors of Bidiagonal Matrices. Using Godunov s Two-Sided Sturm Sequences to Accurately Compute Singular Vectors of Bidiagonal Matrices. A.M. Matsekh E.P. Shurina 1 Introduction We present a hybrid scheme for computing singular vectors

More information

Numerical Methods I Non-Square and Sparse Linear Systems

Numerical Methods I Non-Square and Sparse Linear Systems Numerical Methods I Non-Square and Sparse Linear Systems Aleksandar Donev Courant Institute, NYU 1 donev@courant.nyu.edu 1 MATH-GA 2011.003 / CSCI-GA 2945.003, Fall 2014 September 25th, 2014 A. Donev (Courant

More information

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC Hybrid static/dynamic scheduling for already optimized dense matrix factorization Simplice Donfack, Laura Grigori, INRIA, France Bill Gropp, Vivek Kale UIUC, USA Joint Laboratory for Petascale Computing,

More information

Parallel Singular Value Decomposition. Jiaxing Tan

Parallel Singular Value Decomposition. Jiaxing Tan Parallel Singular Value Decomposition Jiaxing Tan Outline What is SVD? How to calculate SVD? How to parallelize SVD? Future Work What is SVD? Matrix Decomposition Eigen Decomposition A (non-zero) vector

More information

The geometric mean algorithm

The geometric mean algorithm The geometric mean algorithm Rui Ralha Centro de Matemática Universidade do Minho 4710-057 Braga, Portugal email: r ralha@math.uminho.pt Abstract Bisection (of a real interval) is a well known algorithm

More information

AMS526: Numerical Analysis I (Numerical Linear Algebra)

AMS526: Numerical Analysis I (Numerical Linear Algebra) AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 19: More on Arnoldi Iteration; Lanczos Iteration Xiangmin Jiao Stony Brook University Xiangmin Jiao Numerical Analysis I 1 / 17 Outline 1

More information

Last Time. Social Network Graphs Betweenness. Graph Laplacian. Girvan-Newman Algorithm. Spectral Bisection

Last Time. Social Network Graphs Betweenness. Graph Laplacian. Girvan-Newman Algorithm. Spectral Bisection Eigenvalue Problems Last Time Social Network Graphs Betweenness Girvan-Newman Algorithm Graph Laplacian Spectral Bisection λ 2, w 2 Today Small deviation into eigenvalue problems Formulation Standard eigenvalue

More information

Singular Value Decompsition

Singular Value Decompsition Singular Value Decompsition Massoud Malek One of the most useful results from linear algebra, is a matrix decomposition known as the singular value decomposition It has many useful applications in almost

More information

Key words. conjugate gradients, normwise backward error, incremental norm estimation.

Key words. conjugate gradients, normwise backward error, incremental norm estimation. Proceedings of ALGORITMY 2016 pp. 323 332 ON ERROR ESTIMATION IN THE CONJUGATE GRADIENT METHOD: NORMWISE BACKWARD ERROR PETR TICHÝ Abstract. Using an idea of Duff and Vömel [BIT, 42 (2002), pp. 300 322

More information

Numerical Linear Algebra

Numerical Linear Algebra Numerical Linear Algebra Decompositions, numerical aspects Gerard Sleijpen and Martin van Gijzen September 27, 2017 1 Delft University of Technology Program Lecture 2 LU-decomposition Basic algorithm Cost

More information

Program Lecture 2. Numerical Linear Algebra. Gaussian elimination (2) Gaussian elimination. Decompositions, numerical aspects

Program Lecture 2. Numerical Linear Algebra. Gaussian elimination (2) Gaussian elimination. Decompositions, numerical aspects Numerical Linear Algebra Decompositions, numerical aspects Program Lecture 2 LU-decomposition Basic algorithm Cost Stability Pivoting Cholesky decomposition Sparse matrices and reorderings Gerard Sleijpen

More information

A High-Performance Parallel Hybrid Method for Large Sparse Linear Systems

A High-Performance Parallel Hybrid Method for Large Sparse Linear Systems Outline A High-Performance Parallel Hybrid Method for Large Sparse Linear Systems Azzam Haidar CERFACS, Toulouse joint work with Luc Giraud (N7-IRIT, France) and Layne Watson (Virginia Polytechnic Institute,

More information

CME 302: NUMERICAL LINEAR ALGEBRA FALL 2005/06 LECTURE 6

CME 302: NUMERICAL LINEAR ALGEBRA FALL 2005/06 LECTURE 6 CME 302: NUMERICAL LINEAR ALGEBRA FALL 2005/06 LECTURE 6 GENE H GOLUB Issues with Floating-point Arithmetic We conclude our discussion of floating-point arithmetic by highlighting two issues that frequently

More information

CALU: A Communication Optimal LU Factorization Algorithm

CALU: A Communication Optimal LU Factorization Algorithm CALU: A Communication Optimal LU Factorization Algorithm James Demmel Laura Grigori Hua Xiang Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-010-9

More information

The Future of LAPACK and ScaLAPACK

The Future of LAPACK and ScaLAPACK The Future of LAPACK and ScaLAPACK Jason Riedy, Yozo Hida, James Demmel EECS Department University of California, Berkeley November 18, 2005 Outline Survey responses: What users want Improving LAPACK and

More information

6.4 Krylov Subspaces and Conjugate Gradients

6.4 Krylov Subspaces and Conjugate Gradients 6.4 Krylov Subspaces and Conjugate Gradients Our original equation is Ax = b. The preconditioned equation is P Ax = P b. When we write P, we never intend that an inverse will be explicitly computed. P

More information

APPLIED NUMERICAL LINEAR ALGEBRA

APPLIED NUMERICAL LINEAR ALGEBRA APPLIED NUMERICAL LINEAR ALGEBRA James W. Demmel University of California Berkeley, California Society for Industrial and Applied Mathematics Philadelphia Contents Preface 1 Introduction 1 1.1 Basic Notation

More information

Matrix Computations: Direct Methods II. May 5, 2014 Lecture 11

Matrix Computations: Direct Methods II. May 5, 2014 Lecture 11 Matrix Computations: Direct Methods II May 5, 2014 ecture Summary You have seen an example of how a typical matrix operation (an important one) can be reduced to using lower level BS routines that would

More information

Preliminary/Qualifying Exam in Numerical Analysis (Math 502a) Spring 2012

Preliminary/Qualifying Exam in Numerical Analysis (Math 502a) Spring 2012 Instructions Preliminary/Qualifying Exam in Numerical Analysis (Math 502a) Spring 2012 The exam consists of four problems, each having multiple parts. You should attempt to solve all four problems. 1.

More information

Intel Math Kernel Library (Intel MKL) LAPACK

Intel Math Kernel Library (Intel MKL) LAPACK Intel Math Kernel Library (Intel MKL) LAPACK Linear equations Victor Kostin Intel MKL Dense Solvers team manager LAPACK http://www.netlib.org/lapack Systems of Linear Equations Linear Least Squares Eigenvalue

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms Chapter 6 Matrix Models Section 6.2 Low Rank Approximation Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign CS 554 / CSE 512 Edgar

More information

Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA

Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA Jacobi-Based Eigenvalue Solver on GPU Lung-Sheng Chien, NVIDIA lchien@nvidia.com Outline Symmetric eigenvalue solver Experiment Applications Conclusions Symmetric eigenvalue solver The standard form is

More information

Chapter 4 No. 4.0 Answer True or False to the following. Give reasons for your answers.

Chapter 4 No. 4.0 Answer True or False to the following. Give reasons for your answers. MATH 434/534 Theoretical Assignment 3 Solution Chapter 4 No 40 Answer True or False to the following Give reasons for your answers If a backward stable algorithm is applied to a computational problem,

More information

Matrix Algorithms. Volume II: Eigensystems. G. W. Stewart H1HJ1L. University of Maryland College Park, Maryland

Matrix Algorithms. Volume II: Eigensystems. G. W. Stewart H1HJ1L. University of Maryland College Park, Maryland Matrix Algorithms Volume II: Eigensystems G. W. Stewart University of Maryland College Park, Maryland H1HJ1L Society for Industrial and Applied Mathematics Philadelphia CONTENTS Algorithms Preface xv xvii

More information

Lecture 2: Numerical linear algebra

Lecture 2: Numerical linear algebra Lecture 2: Numerical linear algebra QR factorization Eigenvalue decomposition Singular value decomposition Conditioning of a problem Floating point arithmetic and stability of an algorithm Linear algebra

More information

Direct solution methods for sparse matrices. p. 1/49

Direct solution methods for sparse matrices. p. 1/49 Direct solution methods for sparse matrices p. 1/49 p. 2/49 Direct solution methods for sparse matrices Solve Ax = b, where A(n n). (1) Factorize A = LU, L lower-triangular, U upper-triangular. (2) Solve

More information

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization) A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization) Schodinger equation: Hψ = Eψ Choose a basis set of wave functions Two cases: Orthonormal

More information

QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment

QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment Emmanuel AGULLO (INRIA / LaBRI) Camille COTI (Iowa State University) Jack DONGARRA (University of Tennessee) Thomas HÉRAULT

More information

Lecture 2 INF-MAT : , LU, symmetric LU, Positve (semi)definite, Cholesky, Semi-Cholesky

Lecture 2 INF-MAT : , LU, symmetric LU, Positve (semi)definite, Cholesky, Semi-Cholesky Lecture 2 INF-MAT 4350 2009: 7.1-7.6, LU, symmetric LU, Positve (semi)definite, Cholesky, Semi-Cholesky Tom Lyche and Michael Floater Centre of Mathematics for Applications, Department of Informatics,

More information

REORTHOGONALIZATION FOR GOLUB KAHAN LANCZOS BIDIAGONAL REDUCTION: PART II SINGULAR VECTORS

REORTHOGONALIZATION FOR GOLUB KAHAN LANCZOS BIDIAGONAL REDUCTION: PART II SINGULAR VECTORS REORTHOGONALIZATION FOR GOLUB KAHAN LANCZOS BIDIAGONAL REDUCTION: PART II SINGULAR VECTORS JESSE L. BARLOW Department of Computer Science and Engineering, The Pennsylvania State University, University

More information

Implementing QR Factorization Updating Algorithms on GPUs. Andrew, Robert and Dingle, Nicholas J. MIMS EPrint:

Implementing QR Factorization Updating Algorithms on GPUs. Andrew, Robert and Dingle, Nicholas J. MIMS EPrint: Implementing QR Factorization Updating Algorithms on GPUs Andrew, Robert and Dingle, Nicholas J. 214 MIMS EPrint: 212.114 Manchester Institute for Mathematical Sciences School of Mathematics The University

More information

A DIVIDE-AND-CONQUER METHOD FOR THE TAKAGI FACTORIZATION

A DIVIDE-AND-CONQUER METHOD FOR THE TAKAGI FACTORIZATION SIAM J MATRIX ANAL APPL Vol 0, No 0, pp 000 000 c XXXX Society for Industrial and Applied Mathematics A DIVIDE-AND-CONQUER METHOD FOR THE TAKAGI FACTORIZATION WEI XU AND SANZHENG QIAO Abstract This paper

More information

STAT 309: MATHEMATICAL COMPUTATIONS I FALL 2018 LECTURE 9

STAT 309: MATHEMATICAL COMPUTATIONS I FALL 2018 LECTURE 9 STAT 309: MATHEMATICAL COMPUTATIONS I FALL 2018 LECTURE 9 1. qr and complete orthogonal factorization poor man s svd can solve many problems on the svd list using either of these factorizations but they

More information

Linear System of Equations

Linear System of Equations Linear System of Equations Linear systems are perhaps the most widely applied numerical procedures when real-world situation are to be simulated. Example: computing the forces in a TRUSS. F F 5. 77F F.

More information

Communication-avoiding LU and QR factorizations for multicore architectures

Communication-avoiding LU and QR factorizations for multicore architectures Communication-avoiding LU and QR factorizations for multicore architectures DONFACK Simplice INRIA Saclay Joint work with Laura Grigori INRIA Saclay Alok Kumar Gupta BCCS,Norway-5075 16th April 2010 Communication-avoiding

More information