Lazy Householder Decomposition of Sparse Matrices
|
|
- Myron Jackson
- 5 years ago
- Views:
Transcription
1 Lazy Householder Decomposition of Sparse Matrices G.W. Howell North Carolina State University Raleigh, North Carolina August 26, 2010 Abstract This paper describes Householder reduction of a rectangular sparse matrix to small band upper triangular form B k+1. B k+1 is upper triangular with nonzero entries only on the diagonal and on the nearest k superdiagonals. The algorithm is similar to the Householder reduction used as part of the standard dense SVD computation. For the sparse lazy algorithm, matrix updates are deferred until a row or column block is eliminated. The original sparse matrix is accessed only for sparse matrix dense matrix (SMDM) multiplications and to extract row and column blocks. For a triangular bandwidth of k + 1, the SMDM operations are of the sparse matrix by dense matrices consisting of the k rows or columns of a block Householder transformation. Block Householder transformations are reliably orthogonal, computationally efficient, and have good potential for parallelization. Numeric results presented here indicate that using an initial random block Householder transformation allows computation of a collection of largest singular values. Some potential applications are in finding low rank matrix approximations and in solving least squares problems. 1 Introduction In 1965, Golub and Kahan proposed Householder bidiagonalization A = UB 2 V as a first step in determining the singular values of dense matrices. For sparse Supported by NIH Molecular Libraries Roadmap for Medical Research, Grant 1 P20 Hg
2 matrices they proposed Lanczos bidiagonalization as a means of determining a few singular values [8]. Rearranging the order of computation to avoid filling a sparse matrix allows a natural extension of the use of Householder decomposition to the sparse case. Householder transformations are scalably stable and if blocked, the reduction algorithm is almost entirely BLAS-3, efficient on a variety of computer architectures. Given a sparse matrix A, applications include solving Ax = b and finding x to minimize Ax b 2. In this paper, we apply the algorithm to finding singular values of A. Key points are Algorithm stability is desirable for reliable solution of large problems. Householder and block Householder transformations are very nearly orthogonal when implemented with rounding arithmetic, enabling simple run-time convergence tests (see Section 4). Householder reductions can be applied to sparse matrices by deferring updates of blocks of the original matrix. Updates are not performed until the step on which a column or row block is to be eliminated. Multiplications are accomplished by expressing the updated matrix as a sum of the original sparse matrix and a low rank update 1. The work here is essentially an extension of the dense Grösser and Lang algorithm [10], [16] to apply in the sparse case. Block Householder transformations are BLAS-3. On current computer architectures, whether cache-based multicore, GPU or other hardware accelerators connected to general processors, or distributed parallel computing, dense BLAS-3 matrix matrix multiplies are significantly faster than BLAS-2 matrix vector multiplications. In all these cases, BLAS-3 Section 3 discusses NUMA (shared memory Non-Uniform Memory Access) performance for wide or tall BLAS-3 (See Section 3). Similarly, multiplying a sparse matrix by a dense matrix (SMDM) is faster than multiplying a sparse matrix by a dense vector. If the reduction is to an upper triangular matrix B k+1 with nonzero entries on the diagonal and nearest k superdiagonals, then the SMDM operations AX and A T Y entail dense matrices X and Y with k columns. 1 For reduction to Hessenberg form or upper triangular form, the idea of deferring updates has been repeatedly used. Kaufman implemented deferral of updates in sparse Householder QR factorization [15]. For other applications of this idea, see for example ARPACK [19], Sosonkina, Allison, and Watson [25], and Dubrulle [6]. For Householder reduction to bidiagonal form, the idea of deferring updates is implicit in the LAPACK reduction GEBRD (for the dense case) [3], with the extension to the sparse case explicitly outlined in Howell, Demmel, Fulton, Hammarling, and Marmol [11]. The lazy functional language Haskell defers updates. 2
3 With 64 GBytes of RAM with 48 GBytes allocated to basis storage, then for matrices of size up to about a million square (as illustrated by testing against matrices from Tim Davis s UF collection of matrices [4]), the UB k+1 V algorithm usually determined many singular values (See Section 6). For comparision, if one dense 20K by 20K matrix can be stored on a given processor (3.2 GBytes of storage), then 2500 processors would be needed for a dense SCALAPACK computation of singular values for a one million square matrix. Section 2 compares basis orthogonality and storage requirements for several methods of finding singular values of sparse matrices. Section 3 gives some numeric results justifying the assertion that AX, A T Y (Sparse matrix dense matrix or SPMD) operations are likely to be faster that Ax, A T y and discusses shared memory parallelization of wide or tall BLAS-3 operations. Section 4 summarizes some theory, justifying some run-time error estimates. Section 5 is an explicit presentation of the algorithm, provides comparisons of the sparse and dense algorithms, and shows how implicit fill can be useful. Section 6 describes numeric experiments with the Davis UF collection [4] of sparse matrices. 2 Comparison to Other Sparse Methods The sparse and dense UB k+1 V Householder based decompositions are BLAS-3 algorithms with U and V scalably orthogonal. For comparison, consider Lanczos bidiagonalization for finding a few singular values of a sparse matrix, proposed by Golub and Kahan as a sparse alternative to Householder bidiagonalization [8]. Lanczos bidiagonalization can proceed without storing multipliers, relying on a three term recursion, so that only the last few left and right multpliers are needed. Storage requirements are minimal. In exact arithmetic the Lanczos bases would be orthogonal. In rounding arithmetic, there is a rapid loss of orthogonality, and even of linear independence, as illustrated in Figure 1. A matrix of size a few hundred was randomly generated in the program octave, the left and right multipliers were saved, and the numeric rank of the right multiplier basis was calculated as its number of nonzero singular values. As memory per available node grows, using the memory to get better stability becomes feasible. In order to preserve orthogonality, multiplier vectors are frequently stored, and in a Lanczos algorithm, reorthogonalized. Table 1 compares L 2 condition numbers of various methods for constructing bases for the columns of Hilbert matrices, computed as the ratio of largest to smallest singular values of the basis. 3
4 12 Loss of Linear Independence of "Orthogonal" Lanczos Basis 10 8 Rank Deficiency Number of Columns Figure 1: Lanczos bases suffer loss of numeric rank. Sizes Householder QR L in LU MGS QR Table 1: L 2 Condition Numbers of Bases. Here we factored Hilbert matrices of various sizes and compared condition numbers of Q for QR factorization or L for LU decomposition. If modified Gram-Schmidt as opposed to Householder orthogonalization is used, the number of flops can be halved. Since modified Gram-Schmidt is primarily a BLAS-1 algorithm, use of BLAS-3 block Householder transformations is typically faster, as well as more nearly orthogonal. Alternatively, Jalby and Philippe [13], Vanderstraten [28], Stewart [26], Giraud and Langou [7] and others have have designed block Gram-Schmidt algorithms to be of comparable stability to modified Gram-Schmidt, which gives a well-conditioned but not orthogonal basis for an ill-conditioned set of vectors such as those obtained by a Lanczos method. Using block Gram-Schmidt in combination with block Lanczos methods, such as that proposed by Golub, Lusk, and Overton [9] would be another possible means of obtaining a stable BLAS-3 algorithm. More usually, Lanczos bidiagonalization with reorthogonalization is used in SVDPACK [2] and PROPACK [17], with theoretical development in Larsen s thesis [18] and work by Simon and Zha [24]. Each of these methods is appropriate for finding some singular values. Sparse MATLAB instead uses ARPACK 4
5 GFlop rates for tall or wide BLAS. Peak speed with 16 cores is tall*small small*wide wide*tall Smallest Dimension vs. Gflop Rate, Skinny BLAS 3 with 16 cores Gflop Rate GFlops Smallest Dimension Figure 2: For the above plot, the long matrix dimension is fixed at The x-axis is the smallest dimension. If the OpenMP loop using during initialization imitates the loop used during computation, parallelization is relatively good, presumably because of data locality. Best performance is about 62% of peak. For matrices of practical width of 8, performance is only about 10% of peak. [19], an Arnoldi method based on BLAS-2 non-blocked Householder transformations. Table 2 compares some sparse decompositions in terms of required storage and level of BLAS. 3 Sparse Matrix Dense Matrix (SMDM) and Wide or Tall BLAS-3 Sparse matrix algorithms often use multiplications of the sparse matrix times a dense vector and BLAS-1 or BLAS-2 operations. On cache based computer architectures these execute orders of magnitude more slower than the peak machine speed, slowed by repeated fetches of of the sparse matrix from RAM. Block algorithms replace sparse matrix dense vector multiplications by sparse matrix dense matrix multiplications, and replace BLAS-1 inner products and daxpys by Wide or Tall BLAS-3 operations. Compared to sparse matrix dense vector and BLAS-1 operations, SMDM multiplications and Wide or 5
6 Basis Lanczos PROPACK ARPACK UB 2 V UB 2 V GMRES UHU T UB k+1 V Vecs O(1) 2N 2N 4N Loss of Uses Keeps Keeps Rank Re-orthog Orthog Orthog BLAS BLAS-1 BLAS-1 BLAS-2 BLAS-3 flops 4Nn z 4Nn z 4Nn z 4Nn z +O(N) +4N 2 n + 4N 2 n + 4(n + m)n 2 Table 2: Summary Chart Comparing Sparse Decompositions. Tall BLAS-3 allow more floating point operations for each fetch of a floating point number from RAM. 3.1 Shared Memory Parallelization for Wide or Tall BLAS-3 When dense matrices A and B are two large to fit in fast memory and have smallest dimension smaller than about 100, our tests on multi-core processors indicate that the matrix multiplication AB has computational rate roughly proportion to the smallest dimension. For a few shared memory (or multi-core processors), we get good parallel speedups with a few OpenMP calls or merely by using a multi-threaded BLAS library. For more than about four cores, a more careful parallelization is needed for the wide or tall BLAS-3 operations which are the predominant calculation in UB k+1 V decomposition. Three special cases were parallelized using the OpenMP library. These were Wide, Tall, and WideTall, which respectively parallelize the cases of small wide, tall small, and wide tall. The 4 socket, 16 core architecture is NUMA (Non Uniform Memory Access). Each socket has faster access for its own RAM than for the RAM associated with the other sockets. The computational rates illustrated in Figure 2 were obtained by using the same OpenMP loops for matrix initialization as for computation, thereby improving data locality. This numeric experiment was on a four motherboard Opteron running in Linux. The same code also produces good data locality and performance for Intel chips. 2 2 Using the same OpenMP loops for matrix intialization and computation may fail to produce data locatility on other architectures and operating systems. Lack of explicit control over data locality may limit the portability of OpenMP NUMA parallelism. 6
7 3.2 Sparse Matrix Dense Matrix Products Many classic iterative schemes for solving systems of sparse linear equations rely on multiplications Ax, y T A, and BLAS-1 (vector vector) operations. Accessing A to perform multiplications AX gives significantly better performance. Figure 3 indicates the relative effects of block size vs. matrix storage in speeding sparse matrix multiplications. Column blocking is effective in (Sparse A)*(Dense X) 20X Seedup Blocking Speeds Sparse Matrix Dense Matrix Multiplies Speed in Megaflops per Second Vectors in X, A*X 16 Vectors in X tr(a)*x 1 Vector in x, A*x 1 Vector in x, tr(a)*x Number of Column Blocks Figure 3: Speeding multiplication of a randomly generated sparse matrix. Blocking the matrix and multiplying by multiple vectors reduce cache misses. The matrix here is 100K by 100K with 500 randomly distributed nonzeros entries per row. The computation was with a 64 bit 2.4 GHz two Pentium processor with 512 MByte L2 cache compiled with an Intel Fortran compiler. Parallelization is provided with one OpenMP parallel loop for AX, Ax and one also for X T A,x T A. improving performance when nonzero entries are uniformly distributed. For other sparse matrices, different matrix storages can improve performance of the AX kernel. For example, Toledo [27], Angeli, et. al [1] and Im s Ph.D. dissertation [12] offer some guidance in arranging storage of A to speed the computation Ax. The OSKI package (Vuduc, Demmel, and Yelick [29]) automates the process of choosing storage of A. Nishtala, Vuduc, Demmel, and Yelick [22] offer some guidance as to when OSKI is likely to be effective. For the UB k+1 V decomposition, access to A is only for the multiplications 7
8 AX, Y T A and extractions of blocks of A. Almost all other operations are BLAS-3 with minimal dimension k. 4 Some Theory Using block Householder transformations, we expect the overall condition number of transformations to be very near one, and expect that if we compute B k+1 to satisfy A = UB k+1 V, then the singular values of B k+1 and A will be closely matched. For a practical sparse algorithm only a partial decomposition is made, i.e., for A of m rows and n columns we compute only the first N rows and columns B k+1, N < n. The following subsections discuss interlacing of singular values and approximation of of A by U N B k+1 N V N in the Frobenius norm. 4.1 Interlacing of Singular Values In the dense case, singular values are typically found by reducing an m n A matrix to a condensed m n matrix B (upper triangular, or with banded structure such as bidiagonal) with the same singular values, then finding the singular values of the condensed form by some iterative procedure. For reduction to small band (or upper triangular) form to the sparse case, obtaining an m n reduced matrix is impractical if the transformations are stored, unstable if they are not stored. Transformations must be stored to maintain orthogonality and linear independence. It s natural to try to use the singular values of an N N reduced matrix B k+1 N obtained after eliminating N columns as approximate singular values of the original matrix A. The singular values of B k+1 N are sometimes called Ritz values of A. Cauchy s interlacing property relates the Ritz values to the singular values of A. Cauchy s Interlace Theorem [20]: Let C be a Hermitian matrix partitioned as where C has eigenvalues: and H has eigenvalues Then for j = 1,...,N, [ H B C = B U ], α 1 α 2...α n θ 1 θ 2...θ N C is n n H is N N α j θ j α j+n N (4.1) 8
9 and for l = 1, 2,...,n, θ l n+n α l θ. (4.2) Supppose A has been transformed to A N [ RN T A N = N 0 B k+1 N ] Then The Hermitian matrix A T N A N [ R A T T NA N = N R N RN T T N TN TR N TN TT N + (B k+1 N )T B k+1 N ] has the same eigenvalues as A T A. For the symmetric matrix A T N A N, Cauchy s interlacing theorem implies that the eigenvalues of the symmetric matrix RN T R N interlace with those of A T A. Since the singular values σ i of A have the same ordering in size as the eigenvalues λ i = σi 2 of A T A, Cauchy s interlacing value theorem interlaces the singular values of A and R N. As an example of interlacing consider singular values of of the 4 by 4 upper triangular matrix T = Let T 1,T 2,T 3 be the upper left 1 1, 2 2, 3 3 matrices respectively. The singular values of T 1,T 2,T 3,T are respectively Actually, we can do a bit more. When reducing a banded upper triangular form, we get A N of the form R N L N 0 A N = 0 B C (4.3) 0 D E and we naturally wonder whether singular values of ˆR = [ R N L N 0 ] are related to those of A. The following result of Kahan from P. 196 [20] is applicable. 9
10 The Residual Interlace Theorem. Let F be a Hermitian matrix of the form H C 0 F = C V Z 0 Z W where H is N N, V is j j, F is n n. Define [ ] H C M(X) = C X (4.4) where V X is assumed to be invertible. Denote the eigenvalues of M(X) as µ 1 µ 2... µ j+n Then each interval [µ i,µ i+n ],i = 1,...,j contains a different eigenvalue α I of F. Also, outside each open interval (µ l,µ l+j ),l = 1,...,N, there is a different eigenvalue α N of F. The residual interlace theorm applies to A T N A N as it is of the form (suppressing the N subscripts) A T NA N = R T R R T L 0 L T R L T L + B T B + D T D B T C + D T E 0 C T B + E T D C T C + E T E Taking X = L T L gives X V = B T B + D T D. The theorem will apply if X V is nonsingular, which will be the case when either the columns of B or the columns of D are linearly independent. We conclude that the j + N singular values α i of ˆR taken as the square roots the eigenvalues of M(L T L) are lower bounds for the top j + N singular values of A. In particular if α i, i N is the ith largest singular value of ˆR, then α i < σ i, where σ i is the ith largest singular value of A. Applying the interlacing theorem to R and ˆRi, the ith of N largest singular values of ˆR is larger than the ith singular value η i of R. Since η i α i σ i, the singular values α i of ˆR are better estimates of singular values of A than are the singular values of η i of R. For example, take A = Let R 1,R 2,R 3 be the upper left 2 2, 2 3, 2 4 matrices respectively. The first two singular values of R 1,R 2,R 3,A are respectively 10.
11 Approximation of A by J kl = U kl ˆBk+1 V T kl Suppose that A kl is related to A by the orthogonal transformations U kl and V kl as A kl = U T klav kl. Due to the orthogonality of U kl and V kl, we have A kl F = A F. For the algorithm described in the next section A kl has the form [ ] Bk+1 C k 0 A kl = 0 Â kl (4.5) where B k+1 is kl kl and C k is kl k. In our instance, C k has nonzero entries only in its lower triangular k k block. We re interested in the case that Âkl is not computed as it would be dense and large and likely to overflow the RAM. Since we have A 2 F = A kl 2 F = B k+1 2 F + C k 2 F + Âkl 2 F, Âkl 2 F = A 2 F B k+1 2 F C k 2 F. (4.6) Take ˆB k+1 = [B k C k ] as a kl k(l + 1) matrix and J kl = U k ˆBkl Vk T as a rank kl approximation to A. The approximation is good if Âkl F is small, with the quantities on the right hand side of (4.6) easily computable during a computation. 5 The lazy UB k+1 V partial decomposition We adapt the BLAS-3 algorithm for reduction to bandwidth k+1 using Householder reductions of block size k. Dense implementations were by Grösser and Lang, [10], [16]. Using deferred updates to convert a dense to a sparse algorithm for k = 1 (the bidiagonal case) is discussed in Howell, Demmel, Fulton, and Marmol [11]. 11
12 5.1 Notes on Lazy 2-Sided Block Householder Reduction The pseudo-code below 3 should allow the reader to verify the following points. Entries of A old are not changed. The only accesses to A old are for SMDM multiplications and extractions of matrix sub-blocks. If l pairs of row and column blocks of size k are eliminated, then A old is accessed for 2l SMDM operations consisting of multiplication of A old by blocks of k vectors. Block Householder transformations can be used for the reduction. The implementations of qlt and qlr used but not specifically detailed use the algorithms due to Schreiber and Van Loan [23]. Alternately, the method proposed by Joffrain, Low, Quintana-Orti, Van de Geijn, and Van Zee[14] or Puglisi[21] could be used. Householder transformations are reliably orthogonal. Blocking the transformations enables use of BLAS-3. Operations updating column and row blocks and in forming the update matrices are are BLAS-3. BLAS-2 operations are only in initializations and copies, and in the qlt and qlr formation of block Householder transformations. If the qlt and qlr operations are BLAS-2, the total number of BLAS-2 flops is O(mnk) for elimination of all columns, O(m + n)k 2 l for elimination of l k-sized blocks. In comparison, (see (5.14)), there are 6(lk) 2 (m + n) 8(lk) 3 BLAS-3 flops for eliminating l blocks of k rows and columns. As presented, the block Householder reduction runs to completion, useful in that the returned matrix B k can be observed to have very nearly the same singular values as the input matrix, enabling a test for correct implementation. More usually, for a large sparse matrix, the returned matrix B k has dimension kl kl, kl << n, where l blocks have been eliminated. l is chosen (either a priori as an input, or from a convergence criterion and the algorithm is ended at the! End of loop on blocks. The returned matrix B k+1 is then has nonzero entries confined to the diagonal and k superdiagonals, B k+1 satisfying 3 As presented, and given appropriate qlr and qlt functions, the algorithm closely follows an octave script implementation 12
13 B k C k 0 0 A updated = [ l ] [ l ] (I U i )L i (I Ui T ) (A old + E) (I Vi T )T i (I V i ) i=1 i=1 (5.1) In 5.1, C k is a lower triangular k k matrix. Denote ǫ as the largest number satisfying 1 = fl(1+ǫ). Due to the use of block Householder transformations, E satisfies E / A = O(ǫ). (5.2) i.e., the UB k+1 V decomposition is backward stable. When the algorithm is not run to completion so the not actually computed A updated in (5.1) has size (m kl) (n kl), then we typically assume the the E term in (5.1) to be negligible compared to A updated. As indicated by (4.6), runtime estimates of A updated F enable estimates of E F. 5.2 Pseudo-code for lazy UB k+1 V As given here, the code runs to completion, returning an upper triangular banded matrix, with bandwidth k + 1 ( see below (5.5)). In exact arithmetic, the returned matrix has the same singular values as the original matrix and is related to the original matrix by or by A return = l l 1 (I U i L i Ui T )A orig (I Vi T T i V i ) (5.3) i=1 i=1 A return = (I U l )L l (I U T l )(A orig UZ WV) (5.4) where (I U i L i U T i ) and (I V T i T i V i ) are block Householder transformations. On return the blocks U i are stored in the ith block of k columns in U (lower triangular) and the blocks V i are stored in the ith block of k rows in V (upper triangular). Similarly, W is lower triangular and Z upper triangular. 4 The algorithm proceeds by alternately eliminating blocks of k columns and k rows. When no more blocks of size k can be made, then the rest of the 4 The block letters U, V, W, Z refer to matrices used in the algorithm. U, V refer to generic orthogonal matrices. In exact arithmetic, A - UZ - WV is an orthogonal transformation of A, but none of the matrices U, Z, W, V are orthogonal. 13
14 columns are eliminated as one block. Hence the last block of L can be up to twice as large as the others. For A = UB 4 V,m = 10,n = 8,k = 3, the returned B 4 has the following form B 4 = x x x x x x x x x x x x x x x x x x x x x x x x x x x (5.5) Capital letters are used below to indicate that variables are matrices (as opposed to vectors). Accesses to the original sparse matrix are commented as either extractions of blocks or as SMDM operations. Pseudo-Code for UB k+1 V Function [B, U, W, V, Z, L, L temp, T] band(m, n, k, A old )! Assume m > n! Input Variables! m number of rows! n number of columns! k number of superdiagonals in returned matrix! (also the block size for multiplications by A old )! A old input matrix in sparse storage! Output Variables! B is an m by n matrix with upper bandwidth k+1! U (m x n), W (m x n ), V (n x n), Z (n x n )! L (m x k), T (k x n ), L temp (2k x 2k)! where (compare to 5.4,5.3), the extra! term here is from eliminating all remaining columns! as a final block; for a large sparse problem, this final! block will not be eliminated).! B = (I U last L temp U T last ) i (I U il i U T i )A old! = (I U last L temp U T last )(A old UZ WV)! where W = [W 1 W 2... W l ], U = [U 1 U 2... U l ],! V T = [V T 1 VT 2... VT l ], ZT = [Z T 1 ZT 2... ZT l ],! where each of the blocks U i, W i has k columns,! and where each block V i, Z i has k rows i (I VT i T iv i ) 14
15 ! and U last may have up to 2k columns.! Initializations B 0 m,n ; W 0 m,n ; U O m,n ; V 0 n,n ; Z 0 n,n ; L 0 m,k ; T 0 n,k ; b lks floor((n k)/k) ; for i =1:b lks, i l (i) (i 1)k + 1 ; i h (i) i k ; end if ( kb lks!= n ) i l (b lks +1) k b lks + 1 ; i h (b lks +1) n ; end m now m ; n now k ; A temp A old ( :, 1:k) ;! Extract first column block of A old i low 1 ; i hi k ; i hp1 k+1 ; C A old ( :, 1:k) ;! Extract first row block of A old! qrl returns the QR factorization of the! the first column block of A old where! R = (I - U temp L temp U T temp) C [ U temp, R, L temp ] qrl (m, k, C) ; U(i low :m, i low :i hi ) U temp ; L(i low :i hi, : ) L temp ; B( i low :m, i low :i hi ) R ;! C will be the update of the first row block of A old L ua L temp (U T temp)a old (i low :m, i hp1 :n ) ;! SPMD multiplication with A old C A(i low :i hi, i hp1 :n ) ;! Extract a row block of A old C C - U(1:k,1:k )L ua ;! qlt performs the QL factorization of C so that! T temp = C ( I - V temp L r V T temp ) [ V temp, L r, T temp ] qlt (k, n-k, C) ; B(1:k, k+1:n ) L r ; T(1:k, : ) T temp ;! Get the first blocks for U, V, Z, W T emp T temp V temp ; 15
16 Z(1:k, k+1:n ) L ua (L ua V T temp) T emp ; T emp2 V T temp T temp ; W ( :,1:k ) A old ( :,k+1:n )T emp2 ; V(1:k, k+1:n ) V temp ;! Now loop through all but the end block! In the usual application to large sparse matrices! the number of loops b lks is constrained by available RAM! or by satisfaction of a convergence requirement. for i = 2 : b lks, i low i l (i) ; i hi i h (i) ; i hp1 i h (i) + 1 ;! To proceed with a reduction to banded form,! we need to multiply the updated A! A updated A old - U Z - W V! by a block of vectors X. Since A updated is presumed dense,! A updated X is accomplished as! A old X - U (Z X) - W (V X)! Update the current column block of A C A old (i low :m, i low :i hi ) ;! Extract a column block of A old C C - U( i low :m, 1:i low -1 )Z(1:i low -1, i low :i hi ) ; C C - W( i low :m,1:i low -1 )V(1:i low -1, i low :i hi ) ;! qrl performs the QR factorization of the current column block. [ U temp, R, L temp ] qrl (m-i low +1, k, C) ; U(i low :m,i low :i hi ) U temp ; L(i low :i hi, : ) L temp ;! Multiply (L i U T i ) A update B( i low :m,i low :i hi ) R ; L up L temp U T temp ; L ua L up A old (i low :m,i hp1 :n ) ;! SMDM with A old L ua L ua - (L up U(i low :m,1:i low -1 )) Z(1: i low -1,i hp1 :n ) ; L ua L ua - (L up W(i low :m,1:i low -1 )) V(1: i low -1, i hp1 :n ) ;! Update the current row block of B C A old (i low :i hi, i hp1 :n ) ;! Extract row block of A old C C - U( i low :i hi, 1:i low -1) Z(1:i low -1, i hp1 :n ) ; C C - W( i low :i hi, 1:i low -1 ) V(1:i low -1, i hp1 :n ) ; 16
17 ! The row block also needs the update from the current column block C C - U( i low :i hi, i low :i hi )L ua ;! Having updated the current row block, get its! QL factorization by calling qlt [ V temp, L r, T temp ] qlt(k, n-i hi, C) ; B( i low :i hi, i hp1 :n ) L r ; T(i low :i hi, : ) T temp ;! Get the next blocks for the U, V, Z, W matrices T emp T temp V temp ; Z(i low :i hi, i hp1 :n ) L ua - (L ua V T temp)t emp ; T emp2 V T temp T temp ; T emp3 A old (i low :m,i hp1 :n)t emp2 ;! SMDM with A old T emp3 T emp3 - U( i low :m,1:i low -1 )(Z(1:i low -1, i hp1 :n)t emp2 ) ; T emp3 T emp3 - W(i low :m,1:i low -1 )(V(1:i low -1, i hp1 :n )T emp2 ) ; W(i low :m,i low :i hi ) T emp3 ; V(i low :i hi, i hp1 :n ) V temp ; end! End of loop on blocks! We ve eliminated all the row blocks of width k.! Eliminate the rest of the columns as one block i low i h (b lks ) + 1 ; i hi n ;! Update the current column block from A old C A old (i low :m,i low :i hi ) ;! Extract block of A old C C - U( i low :m,1:i low -1 ) Z(1:i low -1, i low :i hi ) ; C C - W( i low :m,1:i low -1 )V(1:i low -1, i low :i hi ) ;! qrl for QR factorization of the last column block. [ U temp, R, L temp ] qrl(m-i low +1,n-i low +1, C) ; U(i low :m, i low :i hi ) U temp ; L 2 = L temp ; B( i low :m, i low :i hi ) R ; Endfunction 17
18 5.3 Comparison to Dense 2-Sided Block Householder Reduction This section makes explicit some differences between the sparse algorithm presented in the pseudo-code and the more usual dense algorithm. The dense algorithm proceeds by alternately eliminating column and row blocks. Consider a partitioning of the original matrix A = A 11 A 12 A 13 A 21 A 22 A 23 A 31 A 32 A 33. (5.6) For the sparse algorithm, elimination of a block of columns corresponding to A 21 and A 31 and an inital row corresponding to A 12 and A 13 has changed no entries of A. A 11 corresponds to an upper triangular matrix B 11. A 12,A 13, A 31, and the upper triangular part of A 12 would have been eliminated in the dense algorithm. The dense algorithm would update the trailing matrix ( ) A22 A 23. (5.7) A 32 A Multiplying ÂX In the case of a large sparse matrix, we can t actually form the updated  = A + UZ + WV, as it would be dense and exhaust RAM. Instead compute ÂX = AX + U(ZX) + W(VX) (5.8) Where the dense algorithm would have an already updated block to eliminate, the sparse algorithm extracts the corresponding block of the original matrix and performs just in time update. For example for the block A 32 Perform the sequence of block Householder eliminations which had already been made to eliminate A 21,A 11,A 12. These are all BLAS-3.  32 A 32 U(3, :) Z(:, 2) W(3, :) V(:, 2) (5.9) Perform a QR factorization of Â32. New blocks of W and Z are formed by multiplying the currently produced blocks of dense vectors by the sparse A, e.g. W(:, 3) V(3, :) A 18
19 5.3.2 Storage and Flop Comparisons Reducing an m n,m n matrix to upper bandwidth k + 1 by Householder transformations requires 4mn 2 4/3n 3 flops. For the ith elimination of row and column blocks of size k, the dense algorithm requires 4(m ik)(n ik)k flops for updates and 4(m ik)(n ik)k flops for multiplications of A by blocks of k row and column multipliers where the ith block requires 8k(m ik)(n ik) (5.10) flops for elimination. The sparse algorithm as illustrated in the pseudo-code differs in that, instead of multiplications of the form AU i,vi T A, A dense we perform multiplications and A updated U i = A old U i + U(ik+1: m, 1: ik) Z(1: ik,ki+1: n) U i + W(ik+1: m, 1: ik) V(1: ik,ki+1: n) U i (5.11) V T i A updated = V T i A old U i + V T i U(ik+1: m, 1: ik) Z(1: ik,ki+1: n) + V T i W(ik+1: m, 1: ik) V(1: ik,ki+1: n) (5.12) Neglecting the sparse matrix dense matrix flops, the flop count for a completed reduction would be 6mn 2 2n 3 with the incremental number of flops for the ith pair of row column blocks being requiring a total of 12k(ik)[m + n 2ik]. (5.13) 6(lk) 2 (m + n) 8(lk) 3 (5.14) flops to eliminate l row and column blocks of size k. For the dense algorithm, required storage is independent of the number of row-column pairs eliminated. As seen in (5.3.2), inital row-column eliminations require more flops than later ones. Conversely, for the sparse algorithm, the number of flops for the next block eliminated is proportional to i (when ik << n + m), so that the flop count is proportional to the square of the total number l of eliminated blocks. For m = n, the incremental flop counts for sparse and dense algorithm are equal for ik = n/4, so that at n/4 the difference in required flops is maximal. For ik > n/4, the dense algorithm becomes more competitive in terms of required flops. In the dense serial algorithm, the size of matrix which can be reduced to small band form (on a single processor) depends on how large a dense matrix will fit in available RAM. 19
20 18 x 1012 Flops to Eliminate L Columns, Sparse vs. Dense Algorithms Flops Dense Algorithm Sparse Algorithm Row Column Pairs Eliminated Figure 4: Sparse and Dense Flops vs. Columns Eliminated for a Matrix for Which the Dense Algorithm Requires 2 GBytes Storage In the sparse serial algorithm, available storage limits the number of blocks that can be eliminated. Neglecting the storage of A, eliminating l rows and columns requires 2(m + n)l double precision numbers stored in W, U, V and Z. When l = kk = n/4, the total storage for U, V, W, and Z is nm/2 + n 2 /2 so comparable to the total storage required for the dense algorithm. For a double precision in-core serial dense computation, the largest matrix we can expect to reduce with 2 GBytes of RAM is at most 16K square 5 For the sparse matrix with a fixed quantity of RAM, the number l of eliminated rows and columns is inversely proportional to m+n. For example, with 2 GBytes allocated for storage of U, V, W, and Z, then with m + n = (one hundred thousand), at most 1250 row-column pairs can be eliminated in-core ; for m + n = (one million), at most K here means 2 10, 16K*16K * 8 = 2Gbytes, with 8 bytes per double precision number. This assumes that only U, V are stored, overwriting part of A. 6 For both the sparse and dense case, this discussion neglects the RAM needed for system requirements, which might typically reduce RAM available for storage to three quarters of installed RAM. When storage use exceeds available RAM, execution times may markedly increase as data must be written to and read from hard disk drives. 20
21 Columns that Can be Eliminated Vs. Matrix Size Number of Row Column Pairs That Can Be Eliminated Total Rows plus Columns x 10 5 Figure 5: Number of Rows and Columns that can be Eliminated with 2 GBytes of Array Storage. 5.4 Apply a preliminary random block Householder transformation As so far discussed, sparse block Householder reduction may perform wasted block eliminations. For example, if A is already of banded upper trianguar form, Householder elimination of l rows and columns merely extracts the upper left l l matrix, singular values of which may not be representative of the singular values of A. Taking a preliminary random Householder transformation ensures that the SPMD operations AX and A T Y are made with X and Y dense so that the products are impacted by all nonzero entries of A. Modification of the UB k+1 V algorithm is straightforward. The only access to the orginal sparse matrix A was for block extraction and mutliplications by blocks of dense vectors (SPMDs). For the SPMD operation, denote Then and Equation 5.8 becomes A 0 = A U 0 Z 0 W 0 V 0 X A 0 X = AX U 0 (Z 0 X) W 0 (V 0 X) ÂX = X + U(ZX) + W(VX) Each extraction of a row or columns block of A is replaced by an extraction and a preliminary update. So instead the of the column extraction C A(i low :m,i low :i hi ) 21
22 Converged Singular Values for Bandwidth 2 Bandwidth = 2, Number of Converged Singular Values Converged, 2 GBytes Storage Compared Converged, 1 GByte Storage 80 Number of converged singular values K 40K 100K 400K Matrix Size from 10K to 400K, 249 Matrices Figure 6: For each of 261 matrices, some singular values were determined. When 2 GBytes of storage could be used without relocation error or segmentation faults, at least 20 singular values were found. we have C A(i low :m,i low :i hi ) U 0 (i low :m,1:i low 1) Z 0 (1:i low 1,i low :i hi ) W 0 (i low :m, 1:i low 1) V 0 (1:i low 1,i low :i hi ) (5.15) The preliminary random Householder transformation was used in the numeric experiments discussed in the next section. 6 Numerical Experiments Numerical tests were with matrices from the Davis UF Sparse Collection [4]. As with Householder implementations of GMRES and ARPACK, UB k+1 V is a stable algorithm. Limiting the computation to use 2 GByte of dimensioned space, then for matrices to size around , singular values of B k+1 (all columns eliminated ) and of A can both be computed by a 32 bit version of the standard LAPACK program dgesvd. As expected, LAPACK dgesvd gives the same singular values for A and B k+1 to high accuracy 7. 7 Our implementation of lazy Householder computation reduces to upper banded form, and can be run to completion only for matrices with at least as many rows as columns 22
23 Converged Singular Values for Bandwidth 6 Bandwidth = 6, Number of Converged Singular Values Converged, 2 GBytes Storage Compared Converged, 1 GByte Storage 80 Number of converged singular values K 40K 100K 400K Matrix Size from 10K to 400K, 307 Matrices Figure 7: For one case of 307, no singular values were determined. In several other cases, only a few singular values were determined. For matrices A large enough that 32 bit LAPACK cannot be used, the UB k+1 V algorithm can not be run to completion as the storage requirements would be too high. For these larger matices, singular values of B N k+1 were computed by two calls to LAPACK dgesvd. A first call to dgesvd was for the entire N N matrix. A second dgesvd call was for the upper left matrix square submatrix B 1 of dimension min(n k,n 6). The largest L = 2 N + 10 (lk = N 1/(m + n) ) singular values were compared to one another. Let σ 1 σ 2...σ L be the largest L singular values of B N k+1, and ˆσ 1 ˆσ 2... ˆσ L the largest singular values of B 1. σ i was said to be converged if σ j ˆσ j < 10 8, j, 1 j i σ j Converged singular values for different bandwidths agreed to high accuracy. Figures 6, 7, and 8 were tested on the June 2008 collection. These tests used only 2 GBytes of RAM, compiled with 32 bit integers. For bandwidths 2 (Figure 6), 6 (Figure 7), and 12 (Figure 8) we computed singular values for all the unsymmetric or rectangular matrices with 10 4 < (m + n)/2 < The results for bandwidth 6 and 12 include integer valued matrices. In each case, the number of steps lk = N is calculated so that the 2N(m + n) double precision numbers allocated for W,U,V,Z plus the n z elements of the sparse matrix A require less than 2 GBytes of storage (taking 8 bytes of storage per 23
24 Converged Singular Values for Bandwidth 12 Bandwidth = 12, Number of Converged Singular Values Converged, 2 GBytes Storage Compared Converged, 1 GByte Storage 80 Number of converged singular values K 40K 100K 400K Matrix Size from 10K to 400K, 315 Matrices Figure 8: Using block size 12 for matrices larger than 100 thousand was not effective with only 1 GByte of storage. For these matrices, the number of multiplications by A was relatively small double precision number). For each matrix the code was recompiled to reset parameters for matrix dimensioning. For some of the larger matrices, Fortran code compiled with the g77 compiler suffers relocation errors at compile time, or run time segmentation faults. These instances were recompiled to use 1 GByte for matrix storage, and rerun. For each bandwidth k = 2, 6, 12 around 300 matrices successfully ran. In each instance, the Frobenius norm of Bk+1 N was less than or equal to the Frobenius norm of A, with near equality in some cases. The algorithm used a preliminary random block Householder transformation of block size k. Figures 6, 7 and 8 plot the number of singular values converged for matrices of sizes < (m+n)/2 < The legend. represents the number of singular values compared, o represents the number converged for 2 Gbytes of storage. x represents the number of converged singular values for 1 GBytes of storage. For the largest matrices represented only about 40 right and left basis vector could be computed. The maximal number of computed basis vectors was 1500 (representing the flat part of the plots). When the o encloses a. or an x, all compared singular values converged. The isolated o s and x s indicate instances for which fewer singular values converged than were compared. 24
25 Converged Singular Values for Bandwidth 8, 48 GByte basis Bandwidth = 8, Number of Converged Singular Values Converged, 64 GBytes Storage Compared 90 Number of converged singular values Matrix Size from 10 Thousand to 10 Million, 420 Matrices protect Figure 9: Singular values were determined for 420 Matrices, size 10 thousand to 10 million. 7 matrices of size less than 1 Million failed to return at least a dozen converged singular values. 64 GBytes of RAM were not always enough to find multiple singular values for matrices of size greater than a million. For a high proportion of the test matrices, all the L = 2 N + 10 singular values converged. For bandwidth 2, for 238 of 250 matrices, all the compared singular values converged. The minimal number of converged singular values was 10 of 30 compared. For bandwidth 6, for 278 of 308 matrices, all compared singular values converged. In one instance there were no converged singular values of 36 compared. Two other instances of poor convergence were 2 of 36 and 3 of 28. All the worst cases were when only 1 GByte of storage was used. Also these cases tended to be the matrices of higher dimension (for which the size N of B 7 was relatively small) The next worst was 9 of 53. For a bandwidth of 12, singular values were computed for 315 matrices. 32 of these had suffered relocation errors or runtime segmentation faults for 2 GBytes of storage, so were rerun allowing 1 GBytes for storage. For 259 of 315 matrices, all the compared singular values converged. There were many instances of no converged singular values, especially 25
26 Banded matrix norms as a fraction of the original matrix norm 1 Ratio of Frobenius Norm vs. Proportion of Converged Singular Values 0.9 Proportion of Compared Singular Values that Converged FrobeniusNorm(Abanded)/FrobeniusNorm(A) 462 Matrices Figure 10: If the Frobenius norm of the banded matrix is of the same order as the norm of the original matrix, compared singular values are likely to be converged. The two circles at the lower right are exceptional cases. for large matrices and in the case that only 1 GByte of storage could be used GBytes of RAM The preceeding runs used were tested against the Davis collection of June In November of 2009, additional large matrices, several dozen of size greater than a million, had entered the collection. At the same time, 16 core (4 quad core Opteron) blades had become available in the NC State blade center. 8 Using 48 GBytes of RAM for matrix storage, then for test matrices to size about a million, the UB k+1 V algorithm with k = 8 determined some singular values in all but a few cases. Figure 6 plots the number of converged singular values vs. the matrix size. For this plot, the number of compared singular values is taken as L = 2 N For matrices of size greater than a million, fewer rows and columns can be eliminated, and fewer singular values are determined. 8 These blades have 64 GByes of RAM. OpenMP BLAS performance on these machines was plotted in Figure 2. Using 8 byte integers and the ACML byte integer BLAS library with a PGI fortran compiler, segmentation faults did not occur. 26
27 1 Clustered Singular Values Slow Convergence Equal Singular Values Slow to Converge 441 Matrices 0.9 Fraction of Compared Singular Values that Converged (Smallest Converged Singular Value)/(2nd Largest Singular Value) Figure 11: If the 2nd largest singular value is nearly equal to the smallest converged singular value, few singular values may converge. A samll ratio σ min /σ 2 was a good predictor that most compared singular values would converg. There was one exceptional case with a small ratio for which only about 1/5 of the compared singular values converged. Table 3 shows some timing results. Times ranged from about 40 seconds for a matrix of size 50 thousand to about 500 seconds for a matrix of size 322 thousand. For these matrices and the matrix of size 160 thousand, 1250 rows and columns were eliminated so that singular values were determined from a triangular matrix of size 1250 with bandwidth 8. For the matrix of size 1.96 million, about 200 rows and columns were eliminated. Though the BLAS- 3 operations are reasonably fast, the Sparse Matrix Dense Matrix (SMDM) multliplications and other computations (largely skinny QR) are a significant proportion of the 16 core time. Getting good parallel performance for more than 16 processors will require parallelization of skinny QR (see Demmel, Grigori, Hoemmel, and Langou [5] for a successful approach) and more work on the SMDM operations. 6.2 Observations on Convergence It s natural to expect that the number of converged singular values tends to increase with the basis size and the number of multiplications by the sparse 27
28 GFlop rates with 16 cores Size BLAS3 BLAS3 SMDM SMDM Other in K Secs Gflop Secs Gflop Secs Table 3: Runs of Spar3Bnd Bandwidth 8. matrix. We would also expect the number of the converged singular values to increase with the fraction of the original matrix Frobenius norm captured in the Frobenius norm of the banded matrix. Conversely, if many large singular values are nearly equal in size, then convergence is likely to be slow, so that the number of converged singular values will tend to be less. These tendencies were evident in experiments with the Davis matrix collection. Figures 10 and 11 are from same test ( 48 GBytes of basis vectors and the Dec Davis collection) as Figure 9. The number of converged singular values increases with the size of the Householder basis. Since the number of basis vectors and rows and condition numbers is inversely proportional to matrix size, (see Figure 5.3.2) a decrease in computed singular values with increased matrix size is expected. See Figures 6, 7 and 8. For a fixed number of rows and columns eliminated (fixed usage of RAM), the number of multiplications AX and Y T A is inversely proportional to the bandwidth k. For a fixed allocation of storage, number of computed singular values decreases somewhat as k increases. Again, see Figures 6, 7 and 8). Conversely, increasing k increases the speed of the computaton (See Figure 2). When Frobenius norms of the reduced matrix B k+1 in Equation (5.1) are near those of the orginal matrix A,.i.e, when R = BN k+1 F A F 1 convergence of a significant fraction of singular values is likely. Figure 11 plots the proportion of compared to converged singular values vs. R. When the largest singular values are nearly equal, relatively few singular values may converge. Figure 11 plots the proportion of converged to 28
29 compared singular values vs. the ratio σ min /σ 2 where σ min is the smallest converged singular value and σ 2 is the next to largest converged singular value σ min...σ 2 σ 1 7 Conclusions and Acknowledgements We report good success in using the lazy UB k+1 V decomposition to compute a collection of largest singular values for sparse matrices. Ongoing work is in computing singular vectors and low rank approximations comparing performance to other methods of computing sparse matrix singular values simplifying and modernizing the code improving multi-core performance Some current work is in using a UB k+1 V decomposition for solving a sparse least squares problem. The author wishes to offer thanks for advice and encouragement from Gene Golub and Jim Demmel. He is grateful to Franc Brglez for aid in automating numerical experiments over a fairly large collection of matrices and to Noura Howell for help in editing the manuscript. References [1] J. Angeli, O. Basset, C. Fulton, G. Howell, R. Hsuand A. Sawetprawhickal, M. Schuster, D. Richardson, H. Thompson, and S. Wilberscheid. Some issues in efficient implementation of a vector based modeul for document retrieval, June [2] M. Berry, T. Do, G. O Brien, V. Krishna, and S. Varadhan. SVDPACKC: Version 1.0 user s guide. Technical Report Tech. Report CS , University of Tennessee, Knoxville, TN, October [3] J. Choi, J. Dongarra, and D. Walker. The design of a parallel dense linear algebra software library: Reduction to Hessenberg, tridiagonal, and bidiagonal form Cholesky factorization routines. Num. Alg., 10: , LAPACK Working Note # 92. [4] T. Davis. University of Florida sparse matrix collection,
30 [5] J. Demmel, L. Grigori, M. Hoemmem, and J. Langou. Communicationoptimal parallel and sequential QR and LU factorizations. Technical Report UCB/EECS ,lawn204, University of California, August [6] A. A. Dubrulle. On block Householder algorithms for the reduction of a matrix to Hessenberg form. Supercomputing 88. Vol.II: Science and Applications. Proceedings, IEEE Explore, 2: , Nov [7] L. Giraud and J. Langou. Robust selective Gram-Schmidt reorthogonalization. Technical Report TR/PA/02/52, CERFACS, Toulouse, FR, [8] G. Golub and W. Kahan. Calculating the singular values and psuedoinverse of a matrix. SIAM J. Num. Anal., 2: , [9] G. Golub, F. Lusk, and M. Overton. A block Lanczos method for comptuing the singular values and corresponding singular vectors of a matrix. ACM Trans. Math. Soft., 7: , [10] B. Grösser and B. Lang. Efficient parallel reduction to bidiagonal form, Preprint BUGHW-SC 98/2 (Available from [11] G. Howell, J. Demmel, C. Fulton, S. Hammarling, and K. Marmol. BLAS 2.5 Householder bidiagonalization. ACM Transactions on Mathematical Software, 34(3):13 46, May [12] E. Im. Optimizing the Performance of Sparse Matrix-Vector Multiplication. PhD thesis, University of California, Berkeley, [13] W. Jalby and B. Philippe. Stability analysis and improvement of the block Gram-Schmidt algorithm. SIAM J. Sci. Stat. Comput., 12(5): , [14] T. Joffrain, T. M. Low, E. S. Quintana-Orti, R. Van de Geijn, and F. G. Van Zee. Accumulating Householder transformations, revisited. ACM Trans. on Math. Software, 32(2): , [15] L. Kaufman. Application of dense Householder transformation to a sparse matrix. ACM Trans. on Math. Software, 5(4): , [16] B. Lang. Parallel reduction of banded matrices to bidiagonal form. Parallel Comput., 22:1 18, [17] R. Larsen. PROPACK, software package for sparse SVD. Available from rmunk/propack/. 30
Sparse BLAS-3 Reduction
Sparse BLAS-3 Reduction to Banded Upper Triangular (Spar3Bnd) Gary Howell, HPC/OIT NC State University gary howell@ncsu.edu Sparse BLAS-3 Reduction p.1/27 Acknowledgements James Demmel, Gene Golub, Franc
More informationAlgorithm 853: an Efficient Algorithm for Solving Rank-Deficient Least Squares Problems
Algorithm 853: an Efficient Algorithm for Solving Rank-Deficient Least Squares Problems LESLIE FOSTER and RAJESH KOMMU San Jose State University Existing routines, such as xgelsy or xgelsd in LAPACK, for
More informationComputing least squares condition numbers on hybrid multicore/gpu systems
Computing least squares condition numbers on hybrid multicore/gpu systems M. Baboulin and J. Dongarra and R. Lacroix Abstract This paper presents an efficient computation for least squares conditioning
More informationPerformance Evaluation of Some Inverse Iteration Algorithms on PowerXCell T M 8i Processor
Performance Evaluation of Some Inverse Iteration Algorithms on PowerXCell T M 8i Processor Masami Takata 1, Hiroyuki Ishigami 2, Kini Kimura 2, and Yoshimasa Nakamura 2 1 Academic Group of Information
More informationNumerical Methods in Matrix Computations
Ake Bjorck Numerical Methods in Matrix Computations Springer Contents 1 Direct Methods for Linear Systems 1 1.1 Elements of Matrix Theory 1 1.1.1 Matrix Algebra 2 1.1.2 Vector Spaces 6 1.1.3 Submatrices
More informationCache Efficient Bidiagonalization Using BLAS 2.5 Operators
Cache Efficient Bidiagonalization Using BLAS 2.5 Operators G.W. Howell North Carolina State University Raleigh, North Carolina 27695 J. W. Demmel University of California, Berkeley Berkeley, California
More informationBlock Lanczos Tridiagonalization of Complex Symmetric Matrices
Block Lanczos Tridiagonalization of Complex Symmetric Matrices Sanzheng Qiao, Guohong Liu, Wei Xu Department of Computing and Software, McMaster University, Hamilton, Ontario L8S 4L7 ABSTRACT The classic
More informationMinisymposia 9 and 34: Avoiding Communication in Linear Algebra. Jim Demmel UC Berkeley bebop.cs.berkeley.edu
Minisymposia 9 and 34: Avoiding Communication in Linear Algebra Jim Demmel UC Berkeley bebop.cs.berkeley.edu Motivation (1) Increasing parallelism to exploit From Top500 to multicores in your laptop Exponentially
More informationLAPACK-Style Codes for Pivoted Cholesky and QR Updating. Hammarling, Sven and Higham, Nicholas J. and Lucas, Craig. MIMS EPrint: 2006.
LAPACK-Style Codes for Pivoted Cholesky and QR Updating Hammarling, Sven and Higham, Nicholas J. and Lucas, Craig 2007 MIMS EPrint: 2006.385 Manchester Institute for Mathematical Sciences School of Mathematics
More informationA communication-avoiding thick-restart Lanczos method on a distributed-memory system
A communication-avoiding thick-restart Lanczos method on a distributed-memory system Ichitaro Yamazaki and Kesheng Wu Lawrence Berkeley National Laboratory, Berkeley, CA, USA Abstract. The Thick-Restart
More informationOn the loss of orthogonality in the Gram-Schmidt orthogonalization process
CERFACS Technical Report No. TR/PA/03/25 Luc Giraud Julien Langou Miroslav Rozložník On the loss of orthogonality in the Gram-Schmidt orthogonalization process Abstract. In this paper we study numerical
More informationCommunication-avoiding parallel and sequential QR factorizations
Communication-avoiding parallel and sequential QR factorizations James Demmel Laura Grigori Mark Frederick Hoemmen Julien Langou Electrical Engineering and Computer Sciences University of California at
More informationCommunication-avoiding parallel and sequential QR factorizations
Communication-avoiding parallel and sequential QR factorizations James Demmel, Laura Grigori, Mark Hoemmen, and Julien Langou May 30, 2008 Abstract We present parallel and sequential dense QR factorization
More informationArnoldi Methods in SLEPc
Scalable Library for Eigenvalue Problem Computations SLEPc Technical Report STR-4 Available at http://slepc.upv.es Arnoldi Methods in SLEPc V. Hernández J. E. Román A. Tomás V. Vidal Last update: October,
More informationLAPACK-Style Codes for Pivoted Cholesky and QR Updating
LAPACK-Style Codes for Pivoted Cholesky and QR Updating Sven Hammarling 1, Nicholas J. Higham 2, and Craig Lucas 3 1 NAG Ltd.,Wilkinson House, Jordan Hill Road, Oxford, OX2 8DR, England, sven@nag.co.uk,
More informationBlock Bidiagonal Decomposition and Least Squares Problems
Block Bidiagonal Decomposition and Least Squares Problems Åke Björck Department of Mathematics Linköping University Perspectives in Numerical Analysis, Helsinki, May 27 29, 2008 Outline Bidiagonal Decomposition
More informationAccelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers
UT College of Engineering Tutorial Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers Stan Tomov 1, George Bosilca 1, and Cédric
More informationAccelerating linear algebra computations with hybrid GPU-multicore systems.
Accelerating linear algebra computations with hybrid GPU-multicore systems. Marc Baboulin INRIA/Université Paris-Sud joint work with Jack Dongarra (University of Tennessee and Oak Ridge National Laboratory)
More informationMS&E 318 (CME 338) Large-Scale Numerical Optimization
Stanford University, Management Science & Engineering (and ICME MS&E 38 (CME 338 Large-Scale Numerical Optimization Course description Instructor: Michael Saunders Spring 28 Notes : Review The course teaches
More informationCommunication avoiding parallel algorithms for dense matrix factorizations
Communication avoiding parallel dense matrix factorizations 1/ 44 Communication avoiding parallel algorithms for dense matrix factorizations Edgar Solomonik Department of EECS, UC Berkeley October 2013
More informationLecture 8: Fast Linear Solvers (Part 7)
Lecture 8: Fast Linear Solvers (Part 7) 1 Modified Gram-Schmidt Process with Reorthogonalization Test Reorthogonalization If Av k 2 + δ v k+1 2 = Av k 2 to working precision. δ = 10 3 2 Householder Arnoldi
More informationAMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences)
AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences) Lecture 19: Computing the SVD; Sparse Linear Systems Xiangmin Jiao Stony Brook University Xiangmin Jiao Numerical
More informationMatrix Assembly in FEA
Matrix Assembly in FEA 1 In Chapter 2, we spoke about how the global matrix equations are assembled in the finite element method. We now want to revisit that discussion and add some details. For example,
More informationDense LU factorization and its error analysis
Dense LU factorization and its error analysis Laura Grigori INRIA and LJLL, UPMC February 2016 Plan Basis of floating point arithmetic and stability analysis Notation, results, proofs taken from [N.J.Higham,
More informationNumerical Linear Algebra
Numerical Linear Algebra The two principal problems in linear algebra are: Linear system Given an n n matrix A and an n-vector b, determine x IR n such that A x = b Eigenvalue problem Given an n n matrix
More informationAMS526: Numerical Analysis I (Numerical Linear Algebra)
AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 1: Course Overview & Matrix-Vector Multiplication Xiangmin Jiao SUNY Stony Brook Xiangmin Jiao Numerical Analysis I 1 / 20 Outline 1 Course
More information4.8 Arnoldi Iteration, Krylov Subspaces and GMRES
48 Arnoldi Iteration, Krylov Subspaces and GMRES We start with the problem of using a similarity transformation to convert an n n matrix A to upper Hessenberg form H, ie, A = QHQ, (30) with an appropriate
More informationAMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 23: GMRES and Other Krylov Subspace Methods; Preconditioning
AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 23: GMRES and Other Krylov Subspace Methods; Preconditioning Xiangmin Jiao SUNY Stony Brook Xiangmin Jiao Numerical Analysis I 1 / 18 Outline
More informationTheoretical Computer Science
Theoretical Computer Science 412 (2011) 1484 1491 Contents lists available at ScienceDirect Theoretical Computer Science journal homepage: wwwelseviercom/locate/tcs Parallel QR processing of Generalized
More informationScientific Computing
Scientific Computing Direct solution methods Martin van Gijzen Delft University of Technology October 3, 2018 1 Program October 3 Matrix norms LU decomposition Basic algorithm Cost Stability Pivoting Pivoting
More informationI-v k e k. (I-e k h kt ) = Stability of Gauss-Huard Elimination for Solving Linear Systems. 1 x 1 x x x x
Technical Report CS-93-08 Department of Computer Systems Faculty of Mathematics and Computer Science University of Amsterdam Stability of Gauss-Huard Elimination for Solving Linear Systems T. J. Dekker
More informationIndex. Copyright (c)2007 The Society for Industrial and Applied Mathematics From: Matrix Methods in Data Mining and Pattern Recgonition By: Lars Elden
Index 1-norm, 15 matrix, 17 vector, 15 2-norm, 15, 59 matrix, 17 vector, 15 3-mode array, 91 absolute error, 15 adjacency matrix, 158 Aitken extrapolation, 157 algebra, multi-linear, 91 all-orthogonality,
More informationBindel, Fall 2016 Matrix Computations (CS 6210) Notes for
1 Algorithms Notes for 2016-10-31 There are several flavors of symmetric eigenvalue solvers for which there is no equivalent (stable) nonsymmetric solver. We discuss four algorithmic ideas: the workhorse
More informationAvoiding Communication in Distributed-Memory Tridiagonalization
Avoiding Communication in Distributed-Memory Tridiagonalization SIAM CSE 15 Nicholas Knight University of California, Berkeley March 14, 2015 Joint work with: Grey Ballard (SNL) James Demmel (UCB) Laura
More informationAccelerating computation of eigenvectors in the nonsymmetric eigenvalue problem
Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem Mark Gates 1, Azzam Haidar 1, and Jack Dongarra 1,2,3 1 University of Tennessee, Knoxville, TN, USA 2 Oak Ridge National
More informationTile QR Factorization with Parallel Panel Processing for Multicore Architectures
Tile QR Factorization with Parallel Panel Processing for Multicore Architectures Bilel Hadri, Hatem Ltaief, Emmanuel Agullo, Jack Dongarra Department of Electrical Engineering and Computer Science, University
More informationEnhancing Scalability of Sparse Direct Methods
Journal of Physics: Conference Series 78 (007) 0 doi:0.088/7-6596/78//0 Enhancing Scalability of Sparse Direct Methods X.S. Li, J. Demmel, L. Grigori, M. Gu, J. Xia 5, S. Jardin 6, C. Sovinec 7, L.-Q.
More informationFinite-choice algorithm optimization in Conjugate Gradients
Finite-choice algorithm optimization in Conjugate Gradients Jack Dongarra and Victor Eijkhout January 2003 Abstract We present computational aspects of mathematically equivalent implementations of the
More informationAccelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem
Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem Mark Gates 1, Azzam Haidar 1, and Jack Dongarra 1,2,3 1 University of Tennessee, Knoxville, TN, USA 2 Oak Ridge National
More informationTall and Skinny QR Matrix Factorization Using Tile Algorithms on Multicore Architectures LAPACK Working Note - 222
Tall and Skinny QR Matrix Factorization Using Tile Algorithms on Multicore Architectures LAPACK Working Note - 222 Bilel Hadri 1, Hatem Ltaief 1, Emmanuel Agullo 1, and Jack Dongarra 1,2,3 1 Department
More informationThe Lanczos and conjugate gradient algorithms
The Lanczos and conjugate gradient algorithms Gérard MEURANT October, 2008 1 The Lanczos algorithm 2 The Lanczos algorithm in finite precision 3 The nonsymmetric Lanczos algorithm 4 The Golub Kahan bidiagonalization
More informationWHEN MODIFIED GRAM-SCHMIDT GENERATES A WELL-CONDITIONED SET OF VECTORS
IMA Journal of Numerical Analysis (2002) 22, 1-8 WHEN MODIFIED GRAM-SCHMIDT GENERATES A WELL-CONDITIONED SET OF VECTORS L. Giraud and J. Langou Cerfacs, 42 Avenue Gaspard Coriolis, 31057 Toulouse Cedex
More informationAnalysis of Block LDL T Factorizations for Symmetric Indefinite Matrices
Analysis of Block LDL T Factorizations for Symmetric Indefinite Matrices Haw-ren Fang August 24, 2007 Abstract We consider the block LDL T factorizations for symmetric indefinite matrices in the form LBL
More information14.2 QR Factorization with Column Pivoting
page 531 Chapter 14 Special Topics Background Material Needed Vector and Matrix Norms (Section 25) Rounding Errors in Basic Floating Point Operations (Section 33 37) Forward Elimination and Back Substitution
More informationQR Decomposition in a Multicore Environment
QR Decomposition in a Multicore Environment Omar Ahsan University of Maryland-College Park Advised by Professor Howard Elman College Park, MD oha@cs.umd.edu ABSTRACT In this study we examine performance
More informationCS 598: Communication Cost Analysis of Algorithms Lecture 9: The Ideal Cache Model and the Discrete Fourier Transform
CS 598: Communication Cost Analysis of Algorithms Lecture 9: The Ideal Cache Model and the Discrete Fourier Transform Edgar Solomonik University of Illinois at Urbana-Champaign September 21, 2016 Fast
More information1 Multiply Eq. E i by λ 0: (λe i ) (E i ) 2 Multiply Eq. E j by λ and add to Eq. E i : (E i + λe j ) (E i )
Direct Methods for Linear Systems Chapter Direct Methods for Solving Linear Systems Per-Olof Persson persson@berkeleyedu Department of Mathematics University of California, Berkeley Math 18A Numerical
More informationANONSINGULAR tridiagonal linear system of the form
Generalized Diagonal Pivoting Methods for Tridiagonal Systems without Interchanges Jennifer B. Erway, Roummel F. Marcia, and Joseph A. Tyson Abstract It has been shown that a nonsingular symmetric tridiagonal
More informationA Backward Stable Hyperbolic QR Factorization Method for Solving Indefinite Least Squares Problem
A Backward Stable Hyperbolic QR Factorization Method for Solving Indefinite Least Suares Problem Hongguo Xu Dedicated to Professor Erxiong Jiang on the occasion of his 7th birthday. Abstract We present
More informationComputation of eigenvalues and singular values Recall that your solutions to these questions will not be collected or evaluated.
Math 504, Homework 5 Computation of eigenvalues and singular values Recall that your solutions to these questions will not be collected or evaluated 1 Find the eigenvalues and the associated eigenspaces
More informationAMS526: Numerical Analysis I (Numerical Linear Algebra)
AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 7: More on Householder Reflectors; Least Squares Problems Xiangmin Jiao SUNY Stony Brook Xiangmin Jiao Numerical Analysis I 1 / 15 Outline
More informationMath 411 Preliminaries
Math 411 Preliminaries Provide a list of preliminary vocabulary and concepts Preliminary Basic Netwon s method, Taylor series expansion (for single and multiple variables), Eigenvalue, Eigenvector, Vector
More informationApplied Numerical Linear Algebra. Lecture 8
Applied Numerical Linear Algebra. Lecture 8 1/ 45 Perturbation Theory for the Least Squares Problem When A is not square, we define its condition number with respect to the 2-norm to be k 2 (A) σ max (A)/σ
More informationMatrices, Moments and Quadrature, cont d
Jim Lambers CME 335 Spring Quarter 2010-11 Lecture 4 Notes Matrices, Moments and Quadrature, cont d Estimation of the Regularization Parameter Consider the least squares problem of finding x such that
More informationON ORTHOGONAL REDUCTION TO HESSENBERG FORM WITH SMALL BANDWIDTH
ON ORTHOGONAL REDUCTION TO HESSENBERG FORM WITH SMALL BANDWIDTH V. FABER, J. LIESEN, AND P. TICHÝ Abstract. Numerous algorithms in numerical linear algebra are based on the reduction of a given matrix
More informationThis can be accomplished by left matrix multiplication as follows: I
1 Numerical Linear Algebra 11 The LU Factorization Recall from linear algebra that Gaussian elimination is a method for solving linear systems of the form Ax = b, where A R m n and bran(a) In this method
More informationIndex. for generalized eigenvalue problem, butterfly form, 211
Index ad hoc shifts, 165 aggressive early deflation, 205 207 algebraic multiplicity, 35 algebraic Riccati equation, 100 Arnoldi process, 372 block, 418 Hamiltonian skew symmetric, 420 implicitly restarted,
More informationA Review of Matrix Analysis
Matrix Notation Part Matrix Operations Matrices are simply rectangular arrays of quantities Each quantity in the array is called an element of the matrix and an element can be either a numerical value
More informationPreconditioned Parallel Block Jacobi SVD Algorithm
Parallel Numerics 5, 15-24 M. Vajteršic, R. Trobec, P. Zinterhof, A. Uhl (Eds.) Chapter 2: Matrix Algebra ISBN 961-633-67-8 Preconditioned Parallel Block Jacobi SVD Algorithm Gabriel Okša 1, Marián Vajteršic
More informationAM 205: lecture 8. Last time: Cholesky factorization, QR factorization Today: how to compute the QR factorization, the Singular Value Decomposition
AM 205: lecture 8 Last time: Cholesky factorization, QR factorization Today: how to compute the QR factorization, the Singular Value Decomposition QR Factorization A matrix A R m n, m n, can be factorized
More informationLinear System of Equations
Linear System of Equations Linear systems are perhaps the most widely applied numerical procedures when real-world situation are to be simulated. Example: computing the forces in a TRUSS. F F 5. 77F F.
More informationAMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences)
AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences) Lecture 1: Course Overview; Matrix Multiplication Xiangmin Jiao Stony Brook University Xiangmin Jiao Numerical
More informationNumerical Analysis Lecture Notes
Numerical Analysis Lecture Notes Peter J Olver 8 Numerical Computation of Eigenvalues In this part, we discuss some practical methods for computing eigenvalues and eigenvectors of matrices Needless to
More informationRoundoff Error. Monday, August 29, 11
Roundoff Error A round-off error (rounding error), is the difference between the calculated approximation of a number and its exact mathematical value. Numerical analysis specifically tries to estimate
More informationGeneralized interval arithmetic on compact matrix Lie groups
myjournal manuscript No. (will be inserted by the editor) Generalized interval arithmetic on compact matrix Lie groups Hermann Schichl, Mihály Csaba Markót, Arnold Neumaier Faculty of Mathematics, University
More informationUsing Godunov s Two-Sided Sturm Sequences to Accurately Compute Singular Vectors of Bidiagonal Matrices.
Using Godunov s Two-Sided Sturm Sequences to Accurately Compute Singular Vectors of Bidiagonal Matrices. A.M. Matsekh E.P. Shurina 1 Introduction We present a hybrid scheme for computing singular vectors
More informationNumerical Methods I Non-Square and Sparse Linear Systems
Numerical Methods I Non-Square and Sparse Linear Systems Aleksandar Donev Courant Institute, NYU 1 donev@courant.nyu.edu 1 MATH-GA 2011.003 / CSCI-GA 2945.003, Fall 2014 September 25th, 2014 A. Donev (Courant
More informationHybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC
Hybrid static/dynamic scheduling for already optimized dense matrix factorization Simplice Donfack, Laura Grigori, INRIA, France Bill Gropp, Vivek Kale UIUC, USA Joint Laboratory for Petascale Computing,
More informationParallel Singular Value Decomposition. Jiaxing Tan
Parallel Singular Value Decomposition Jiaxing Tan Outline What is SVD? How to calculate SVD? How to parallelize SVD? Future Work What is SVD? Matrix Decomposition Eigen Decomposition A (non-zero) vector
More informationThe geometric mean algorithm
The geometric mean algorithm Rui Ralha Centro de Matemática Universidade do Minho 4710-057 Braga, Portugal email: r ralha@math.uminho.pt Abstract Bisection (of a real interval) is a well known algorithm
More informationAMS526: Numerical Analysis I (Numerical Linear Algebra)
AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 19: More on Arnoldi Iteration; Lanczos Iteration Xiangmin Jiao Stony Brook University Xiangmin Jiao Numerical Analysis I 1 / 17 Outline 1
More informationLast Time. Social Network Graphs Betweenness. Graph Laplacian. Girvan-Newman Algorithm. Spectral Bisection
Eigenvalue Problems Last Time Social Network Graphs Betweenness Girvan-Newman Algorithm Graph Laplacian Spectral Bisection λ 2, w 2 Today Small deviation into eigenvalue problems Formulation Standard eigenvalue
More informationSingular Value Decompsition
Singular Value Decompsition Massoud Malek One of the most useful results from linear algebra, is a matrix decomposition known as the singular value decomposition It has many useful applications in almost
More informationKey words. conjugate gradients, normwise backward error, incremental norm estimation.
Proceedings of ALGORITMY 2016 pp. 323 332 ON ERROR ESTIMATION IN THE CONJUGATE GRADIENT METHOD: NORMWISE BACKWARD ERROR PETR TICHÝ Abstract. Using an idea of Duff and Vömel [BIT, 42 (2002), pp. 300 322
More informationNumerical Linear Algebra
Numerical Linear Algebra Decompositions, numerical aspects Gerard Sleijpen and Martin van Gijzen September 27, 2017 1 Delft University of Technology Program Lecture 2 LU-decomposition Basic algorithm Cost
More informationProgram Lecture 2. Numerical Linear Algebra. Gaussian elimination (2) Gaussian elimination. Decompositions, numerical aspects
Numerical Linear Algebra Decompositions, numerical aspects Program Lecture 2 LU-decomposition Basic algorithm Cost Stability Pivoting Cholesky decomposition Sparse matrices and reorderings Gerard Sleijpen
More informationA High-Performance Parallel Hybrid Method for Large Sparse Linear Systems
Outline A High-Performance Parallel Hybrid Method for Large Sparse Linear Systems Azzam Haidar CERFACS, Toulouse joint work with Luc Giraud (N7-IRIT, France) and Layne Watson (Virginia Polytechnic Institute,
More informationCME 302: NUMERICAL LINEAR ALGEBRA FALL 2005/06 LECTURE 6
CME 302: NUMERICAL LINEAR ALGEBRA FALL 2005/06 LECTURE 6 GENE H GOLUB Issues with Floating-point Arithmetic We conclude our discussion of floating-point arithmetic by highlighting two issues that frequently
More informationCALU: A Communication Optimal LU Factorization Algorithm
CALU: A Communication Optimal LU Factorization Algorithm James Demmel Laura Grigori Hua Xiang Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-010-9
More informationThe Future of LAPACK and ScaLAPACK
The Future of LAPACK and ScaLAPACK Jason Riedy, Yozo Hida, James Demmel EECS Department University of California, Berkeley November 18, 2005 Outline Survey responses: What users want Improving LAPACK and
More information6.4 Krylov Subspaces and Conjugate Gradients
6.4 Krylov Subspaces and Conjugate Gradients Our original equation is Ax = b. The preconditioned equation is P Ax = P b. When we write P, we never intend that an inverse will be explicitly computed. P
More informationAPPLIED NUMERICAL LINEAR ALGEBRA
APPLIED NUMERICAL LINEAR ALGEBRA James W. Demmel University of California Berkeley, California Society for Industrial and Applied Mathematics Philadelphia Contents Preface 1 Introduction 1 1.1 Basic Notation
More informationMatrix Computations: Direct Methods II. May 5, 2014 Lecture 11
Matrix Computations: Direct Methods II May 5, 2014 ecture Summary You have seen an example of how a typical matrix operation (an important one) can be reduced to using lower level BS routines that would
More informationPreliminary/Qualifying Exam in Numerical Analysis (Math 502a) Spring 2012
Instructions Preliminary/Qualifying Exam in Numerical Analysis (Math 502a) Spring 2012 The exam consists of four problems, each having multiple parts. You should attempt to solve all four problems. 1.
More informationIntel Math Kernel Library (Intel MKL) LAPACK
Intel Math Kernel Library (Intel MKL) LAPACK Linear equations Victor Kostin Intel MKL Dense Solvers team manager LAPACK http://www.netlib.org/lapack Systems of Linear Equations Linear Least Squares Eigenvalue
More informationParallel Numerical Algorithms
Parallel Numerical Algorithms Chapter 6 Matrix Models Section 6.2 Low Rank Approximation Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign CS 554 / CSE 512 Edgar
More informationJacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA
Jacobi-Based Eigenvalue Solver on GPU Lung-Sheng Chien, NVIDIA lchien@nvidia.com Outline Symmetric eigenvalue solver Experiment Applications Conclusions Symmetric eigenvalue solver The standard form is
More informationChapter 4 No. 4.0 Answer True or False to the following. Give reasons for your answers.
MATH 434/534 Theoretical Assignment 3 Solution Chapter 4 No 40 Answer True or False to the following Give reasons for your answers If a backward stable algorithm is applied to a computational problem,
More informationMatrix Algorithms. Volume II: Eigensystems. G. W. Stewart H1HJ1L. University of Maryland College Park, Maryland
Matrix Algorithms Volume II: Eigensystems G. W. Stewart University of Maryland College Park, Maryland H1HJ1L Society for Industrial and Applied Mathematics Philadelphia CONTENTS Algorithms Preface xv xvii
More informationLecture 2: Numerical linear algebra
Lecture 2: Numerical linear algebra QR factorization Eigenvalue decomposition Singular value decomposition Conditioning of a problem Floating point arithmetic and stability of an algorithm Linear algebra
More informationDirect solution methods for sparse matrices. p. 1/49
Direct solution methods for sparse matrices p. 1/49 p. 2/49 Direct solution methods for sparse matrices Solve Ax = b, where A(n n). (1) Factorize A = LU, L lower-triangular, U upper-triangular. (2) Solve
More informationA model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)
A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization) Schodinger equation: Hψ = Eψ Choose a basis set of wave functions Two cases: Orthonormal
More informationQR Factorization of Tall and Skinny Matrices in a Grid Computing Environment
QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment Emmanuel AGULLO (INRIA / LaBRI) Camille COTI (Iowa State University) Jack DONGARRA (University of Tennessee) Thomas HÉRAULT
More informationLecture 2 INF-MAT : , LU, symmetric LU, Positve (semi)definite, Cholesky, Semi-Cholesky
Lecture 2 INF-MAT 4350 2009: 7.1-7.6, LU, symmetric LU, Positve (semi)definite, Cholesky, Semi-Cholesky Tom Lyche and Michael Floater Centre of Mathematics for Applications, Department of Informatics,
More informationREORTHOGONALIZATION FOR GOLUB KAHAN LANCZOS BIDIAGONAL REDUCTION: PART II SINGULAR VECTORS
REORTHOGONALIZATION FOR GOLUB KAHAN LANCZOS BIDIAGONAL REDUCTION: PART II SINGULAR VECTORS JESSE L. BARLOW Department of Computer Science and Engineering, The Pennsylvania State University, University
More informationImplementing QR Factorization Updating Algorithms on GPUs. Andrew, Robert and Dingle, Nicholas J. MIMS EPrint:
Implementing QR Factorization Updating Algorithms on GPUs Andrew, Robert and Dingle, Nicholas J. 214 MIMS EPrint: 212.114 Manchester Institute for Mathematical Sciences School of Mathematics The University
More informationA DIVIDE-AND-CONQUER METHOD FOR THE TAKAGI FACTORIZATION
SIAM J MATRIX ANAL APPL Vol 0, No 0, pp 000 000 c XXXX Society for Industrial and Applied Mathematics A DIVIDE-AND-CONQUER METHOD FOR THE TAKAGI FACTORIZATION WEI XU AND SANZHENG QIAO Abstract This paper
More informationSTAT 309: MATHEMATICAL COMPUTATIONS I FALL 2018 LECTURE 9
STAT 309: MATHEMATICAL COMPUTATIONS I FALL 2018 LECTURE 9 1. qr and complete orthogonal factorization poor man s svd can solve many problems on the svd list using either of these factorizations but they
More informationLinear System of Equations
Linear System of Equations Linear systems are perhaps the most widely applied numerical procedures when real-world situation are to be simulated. Example: computing the forces in a TRUSS. F F 5. 77F F.
More informationCommunication-avoiding LU and QR factorizations for multicore architectures
Communication-avoiding LU and QR factorizations for multicore architectures DONFACK Simplice INRIA Saclay Joint work with Laura Grigori INRIA Saclay Alok Kumar Gupta BCCS,Norway-5075 16th April 2010 Communication-avoiding
More information