Sparse BLAS-3 Reduction

Similar documents
Lazy Householder Decomposition of Sparse Matrices

Block Bidiagonal Decomposition and Least Squares Problems

Numerical Methods in Matrix Computations

CME 302: NUMERICAL LINEAR ALGEBRA FALL 2005/06 LECTURE 6

BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix-Vector-Product Analysis of Matrix-Matrix Product

Course Notes: Week 1

Numerical Linear Algebra

Program Lecture 2. Numerical Linear Algebra. Gaussian elimination (2) Gaussian elimination. Decompositions, numerical aspects

Lecture # 5 The Linear Least Squares Problem. r LS = b Xy LS. Our approach, compute the Q R decomposition, that is, n R X = Q, m n 0

MS&E 318 (CME 338) Large-Scale Numerical Optimization

Scientific Computing

Numerical Methods I Non-Square and Sparse Linear Systems

Numerical Linear Algebra

This can be accomplished by left matrix multiplication as follows: I

Algorithm 853: an Efficient Algorithm for Solving Rank-Deficient Least Squares Problems

14.2 QR Factorization with Column Pivoting

APPLIED NUMERICAL LINEAR ALGEBRA

Linear Algebra, part 3 QR and SVD

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences)

LINEAR ALGEBRA: NUMERICAL METHODS. Version: August 12,

Computing least squares condition numbers on hybrid multicore/gpu systems

AMS526: Numerical Analysis I (Numerical Linear Algebra)

LAPACK-Style Codes for Pivoted Cholesky and QR Updating

Numerical Linear Algebra

Inverses. Stephen Boyd. EE103 Stanford University. October 28, 2017

CS 598: Communication Cost Analysis of Algorithms Lecture 9: The Ideal Cache Model and the Discrete Fourier Transform

STAT 309: MATHEMATICAL COMPUTATIONS I FALL 2018 LECTURE 9

Bindel, Fall 2016 Matrix Computations (CS 6210) Notes for

LAPACK-Style Codes for Pivoted Cholesky and QR Updating. Hammarling, Sven and Higham, Nicholas J. and Lucas, Craig. MIMS EPrint: 2006.

Index. Copyright (c)2007 The Society for Industrial and Applied Mathematics From: Matrix Methods in Data Mining and Pattern Recgonition By: Lars Elden

Scientific Computing: An Introductory Survey

LU Factorization. Marco Chiarandini. DM559 Linear and Integer Programming. Department of Mathematics & Computer Science University of Southern Denmark

Orthogonalization and least squares methods

Cache Efficient Bidiagonalization Using BLAS 2.5 Operators

Matrices, Moments and Quadrature, cont d

Minisymposia 9 and 34: Avoiding Communication in Linear Algebra. Jim Demmel UC Berkeley bebop.cs.berkeley.edu

Computation of the mtx-vec product based on storage scheme on vector CPUs

Communication-avoiding parallel and sequential QR factorizations

Generalized interval arithmetic on compact matrix Lie groups

CSE 160 Lecture 13. Numerical Linear Algebra

Parallel Numerical Algorithms

CS 542G: Conditioning, BLAS, LU Factorization

AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 23: GMRES and Other Krylov Subspace Methods; Preconditioning

Solving linear equations with Gaussian Elimination (I)

Communication avoiding parallel algorithms for dense matrix factorizations

Bindel, Fall 2016 Matrix Computations (CS 6210) Notes for

Linear Analysis Lecture 16

This ensures that we walk downhill. For fixed λ not even this may be the case.

Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem

Computational Linear Algebra

Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA

Communication-avoiding parallel and sequential QR factorizations

Exploiting Low-Rank Structure in Computing Matrix Powers with Applications to Preconditioning

Linear Programming The Simplex Algorithm: Part II Chapter 5

3 QR factorization revisited

Lecture 6. Numerical methods. Approximation of functions

Lecture Notes to Accompany. Scientific Computing An Introductory Survey. by Michael T. Heath. Chapter 2. Systems of Linear Equations

Preliminary/Qualifying Exam in Numerical Analysis (Math 502a) Spring 2012

Last Time. Social Network Graphs Betweenness. Graph Laplacian. Girvan-Newman Algorithm. Spectral Bisection

MTH 464: Computational Linear Algebra

Dense LU factorization and its error analysis

Lecture 4 Orthonormal vectors and QR factorization

Matrix Computations: Direct Methods II. May 5, 2014 Lecture 11

Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem

Krylov Subspace Methods that Are Based on the Minimization of the Residual

AMS526: Numerical Analysis I (Numerical Linear Algebra)

Section 4.5 Eigenvalues of Symmetric Tridiagonal Matrices

AMS 209, Fall 2015 Final Project Type A Numerical Linear Algebra: Gaussian Elimination with Pivoting for Solving Linear Systems

Matrix decompositions

COMP6237 Data Mining Covariance, EVD, PCA & SVD. Jonathon Hare

Iterative Solution of Sparse Linear Least Squares using LU Factorization

Review Questions REVIEW QUESTIONS 71

AM 205: lecture 8. Last time: Cholesky factorization, QR factorization Today: how to compute the QR factorization, the Singular Value Decomposition

Computational Methods. Least Squares Approximation/Optimization

Parallel Numerical Algorithms

1 Multiply Eq. E i by λ 0: (λe i ) (E i ) 2 Multiply Eq. E j by λ and add to Eq. E i : (E i + λe j ) (E i )

Numerical linear algebra

LU Factorization with Panel Rank Revealing Pivoting and Its Communication Avoiding Version

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC

Matrix decompositions

Numerical Methods I: Numerical linear algebra

Lecture 9: Numerical Linear Algebra Primer (February 11st)

Rank revealing factorizations, and low rank approximations

BlockMatrixComputations and the Singular Value Decomposition. ATaleofTwoIdeas

Rank Revealing QR factorization. F. Guyomarc h, D. Mezher and B. Philippe

Lecture 3: QR-Factorization

Gaussian Elimination and Back Substitution

Sparse Linear Systems. Iterative Methods for Sparse Linear Systems. Motivation for Studying Sparse Linear Systems. Partial Differential Equations

Krylov Subspaces. Lab 1. The Arnoldi Iteration

QR Decomposition in a Multicore Environment

B553 Lecture 5: Matrix Algebra Review

Linear Least squares

Performance of Random Sampling for Computing Low-rank Approximations of a Dense Matrix on GPUs

Applications of Randomized Methods for Decomposing and Simulating from Large Covariance Matrices

AM205: Assignment 2. i=1

6.4 Krylov Subspaces and Conjugate Gradients

PRECONDITIONING IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM

Solving Ax = b, an overview. Program

Linear Algebra, part 3. Going back to least squares. Mathematical Models, Analysis and Simulation = 0. a T 1 e. a T n e. Anna-Karin Tornberg

Transcription:

Sparse BLAS-3 Reduction to Banded Upper Triangular (Spar3Bnd) Gary Howell, HPC/OIT NC State University gary howell@ncsu.edu Sparse BLAS-3 Reduction p.1/27

Acknowledgements James Demmel, Gene Golub, Franc Brglez Supported by NIH Molecular Libraries Roadmap for Medical Research, Grant 1 P20 Hg003900-01 Sparse BLAS-3 Reduction p.2/27

Multi-Core, Lots of RAM Flops are cheap (multi-core). Transferring data from RAM to cache is expensive. Hence algorithms should use BLAS-3 (CPU speed limited) to reuse data in cache as opposed to BLAS-1 and BLAS-2 algorithm (limited by rate of reads and writes to RAM) RAM is cheap. Last year at NC State we only had 4 GBytes per blade. This year 16 GBytes. So we can afford to store more data than in the past. Change in algorithm to better use Architecture? Sparse BLAS-3 Reduction p.3/27

Times change Can we improve the algorithm for multi-core cache based architectures? Stability. To avoid the fake singular values associated with loss of basis rank, we need a linear independent basis. For accuracy, we need an orthogonal basis. Use the RAM Ax,y A are relatively slow computations on modern computers. AX,Y A are faster. Can use BLAS-3 operations with X,Y as opposed to BLAS-1 with x,y Keep the cores busy by re-using data in cache. Sparse BLAS-3 Reduction p.4/27

Lanczos bidiagonalization minimizes storage and flops. It uses sparse matrix dense vector multiplications y Ax,z T w T A and some dense vector BLAS-1 operations α u T v,c βu + d It needs O(nz) + O(n + m) total storage and the same number of flops per iteration. Vectors too large for cache? Stability? Sparse BLAS-3 Reduction p.5/27

Computation of a few singular value After k steps a bidiagonal k k matrix A k is produced. If at each step we compute the singular values ρ ik, 1...k of A k, then we get a triangular matrix of trial singular values. ρ 1,k 1 ρ 1,k ρ 1,k+1 ρ 2,k 1. ρ 2,k. ρ 2,k+1. ρ k 1,k 1 ρ k 1,k ρ k 1,k+1 0 ρ k,k ρ k,k+1 0 0 ρ k+1,k+1 (1) Sparse BLAS-3 Reduction p.6/27

Interlacing singular values The computed singular values of A k (column k) relate to those of A k+1 (column k + 1) by ρ i,k ρ i,k+1, ρ i,k ρ i+1,k+1 In exact arithmetic, as k goes to M = min(m,n), the singular values of A k go to the singular values of A. But in practice, we only run the Lanczos algorithm for k << M, and can recover at most a few singular values of A. Problem: poor conditioning of Lanczos bases Sparse BLAS-3 Reduction p.7/27

Loss of rank of Lanczos "basis" 12 Loss of Linear Independence of "Orthogonal" Lanczos Basis 10 8 Rank Deficiency 6 4 2.. 0 0 10 20 30 40 50 60 70 80 Number of Columns Of the 50 Lanczos vectors produced after 50 steps, 45 are linearly independent (rank deficiency of 5). Sparse BLAS-3 Reduction p.8/27

Orthogonality for Stability But for accurate singular values we need to preserve the orthogonality of bases. Some possibilities: Use Modified Gram Schmidt to orthogonalize, Simon and Zha, Berry, Larsen. Use LU decomposition ( Sadok, Stephens) Use Householder transformations (ARPACK, Spar3Bnd). Sparse BLAS-3 Reduction p.9/27

L 2 Condition Numbers of Bases Sizes 10 20 30 40 50 Householder QR 1.0 1.0 1.0 1.0 1.0 L in LU 9.70 18.7 46.5 53.0 72.2 MGS QR 1.001 98.5 170. 122.5 3631 Here we factored Hilbert matrices of various sizes and compared condition numbers of Q for QR factorization or L for LU decomposition. Sparse BLAS-3 Reduction p.10/27

BLAS-3 Block Householder By working with blocks X we can use BLAS-3 block Householder as opposed to x and BLAS-2 Householder (e.g. ARPACK). Sparse BLAS-3 Reduction p.11/27

Replace Ax by AX In the Lanczos algorithm, we bidiagonalize. If instead we reduce to an upper triangular matrix with bandwidth K + 1, then we can replace Ax by AX, y T A by Y T A. Then dragging A from RAM to cache allows 2K flops per element of A as opposed to 2 flops. When x or X is too big to fit in cache, can get a drastic hit due to cache misses. Solution here to put A in column blocks. Example. 200K square matrices, 200 randomly distributed nonzero entries per row. Dual core OpenMP. Sparse BLAS-3 Reduction p.12/27

(Sparse A)*(Dense X) 20X Seedup Blocking Speeds Sparse Matrix Dense Matrix Multiplies 1600 1400 1200 Speed in Megaflops 1000 800 600 16 Vectors in X, A*X 16 Vectors in X tr(a)*x 1 Vector in x, A*x 1 Vector in x, tr(a)*x 400 200 0 0 5 10 15 20 25 30 35 40 45 50 Number of Column Blocks Sparse BLAS-3 Reduction p.13/27

Reducing to upper bandwidth K + 1 Sparse BLAS-3 Reduction 2 Banded Upper Triangular form (Spar3Bnd) Spar3Bnd adapts dense BLAS-3 algorithm for reduction to bandwidth K + 1 using Householder reductions of block size K. Dense implementations were by Chris Bischof, Bruno Lang, related work by Xioabai Sun, the current Fortran implementation is from "scratch". The Trick for a sparse algorithm, do not update the trailing part of the matrix. Sparse BLAS-3 Reduction p.14/27

Untouched sparse matrix Partitioning before a row and block elminination A 11 A 12 A 13 A = A 21 A 22 A 23 (2) A 31 A 32 A 33 The only operations on the original sparse matrix are Extract blocks. Multiply the sparse matrix by dense matrices. Sparse BLAS-3 Reduction p.15/27

ÂX We can t actually form the updated  = A + UZ + WV. It would be dense and hence exhaust RAM. Instead compute ÂX = AX + U(ZX) + W(V X) Sparse BLAS-3 Reduction p.16/27

On the extracted blocks of A Perform the sequence of block Householder eliminations which had already been made to eliminate A 21,A 11,A 12. For example, Â 23 A 23 U(2, :)Z(:, 3) W(2, :)V (:, 3) These are all BLAS-3 dense matrix dense matrix operations. Perform a QR (actually since this is the row block an LQ) factorization of Â23. Also BLAS 3. Sparse BLAS-3 Reduction p.17/27

Append new blocks to U, V, W, Z To get the new block of W, Z, multiply the currently produced blocks of dense vectors by the sparse A, e.g. W(:, 3) V (3, :)A with a similar multiplication by U(:, 3) The down side here is the storage of the 4 dense matrices, U, W, Z, V. To eliminate l rows and columns requires 2l(m + n) storage. Sparse BLAS-3 Reduction p.18/27

Test 1 Spar3Bnd is Stable Spar3Bnd is a stable algorithm which in the sparse case is not (due to storage constraints) run to completion. On my 2 GByte RAM desktop we can test to size around 3K square against LAPACK. LAPACK gives the same singular values for A and  to high accuracy. Sparse BLAS-3 Reduction p.19/27

Practical for real problems? We are taking a standard dense algorithm and at the cost of storing multipliers applying it to the sparse case. Experiment, given a desktop with 2 GBytes RAM, can we handle reasonable sized sparse matrices? Take the largest sparse matrices from the UF Sparse collection. Download all the unsymmetric and rectangular matrices (more rows than columns) with more than 1M nonzero entries. Sparse BLAS-3 Reduction p.20/27

Description of Experiment For each matrix, ran as many steps as 2 GBytes storage on my desktop permitted. The number of steps N was taken as N = max(4.5e7/m, 1200) where m is the number of rows of the sparse matrix. Spar3Bnd returns an N N upper triangular matrix A 0 with bandwidth K + 1. Sparse BLAS-3 Reduction p.21/27

Description 2 LAPACK dgesvd is used to compute singular values of the returned block. If the block size is K, the input matrices A 2,A 1,A 0 to dgesvd are square of size N 2K, N K, N. The singular values of A 2,A 1,A 0 interlace Claim observed convergence if relative difference less than 1.e-10. Sparse BLAS-3 Reduction p.22/27

106 Sparse Matrices 17 were too large m > 450k, segmentation fault in LAPACK or BLAS Converged > N 5 N 1-4 0 Singular Values Matrices 38 21 12 18 Many Some A Few None Table 1: N was the number of steps of SparBnd3 38 matrices had many converged singular values. Sparse BLAS-3 Reduction p.23/27

Notes The problems with fewer converged singular values tended to be the larger ones (for which case we could perform relatively few steps before exhausting the 2 GBytes of RAM) Strictly diagonal rows and columns should be permuted to the top and left and those elimination steps should be skipped. A preconditioning step should select what rows and columns to eliminate. Sparse BLAS-3 Reduction p.24/27

Is BLAS-3 Worth It? Basis Lanczos PROPACK ARPACK Spar3Bnd Vecs O(1) 2N 2N 4N Basis Loss of Loss of Keeps Keeps Rank Orthog Orthog Orthog BLAS BLAS-1 BLAS-1 BLAS-2 BLAS-3 Need some experimentation comparing Spar3Bnd to other algorithms. Sparse BLAS-3 Reduction p.25/27

Parallelization of Spar3Bnd Distributed memory case. BLAS-3 operations are expected to work better to deal with issues of latency and bandwidth than do BLAS-2 operations. Similarly the sparse matrix, dense matrix operations AX and Y T A are likely to be preferable to Ax and y T A. For shared memory (or multi-core processors), we get good parallelization with a few OpenMP calls in the AX and Y T A routines and by linking to a multi-thread enabled BLAS. Already works. Sparse BLAS-3 Reduction p.26/27

Next Stuff In some other talk, I ll show how to use Spar3Bnd to find x to minimize Ax b. Modification of LSQR. Preconditioning. Permute diagonal entries to upper left and wisely choose which elements to eliminate. Some data mining applications. Sparse BLAS-3 Reduction p.27/27