BLAS Specification Revisited. Linda Kaufman William Paterson University

Similar documents
APPLIED NUMERICAL LINEAR ALGEBRA

Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem

Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem

Solving large scale eigenvalue problems

Lecture # 11 The Power Method for Eigenvalues Part II. The power method find the largest (in magnitude) eigenvalue of. A R n n.

EIGENVALUE PROBLEMS. EIGENVALUE PROBLEMS p. 1/4

13-2 Text: 28-30; AB: 1.3.3, 3.2.3, 3.4.2, 3.5, 3.6.2; GvL Eigen2

Solving linear equations with Gaussian Elimination (I)

THE QR METHOD A = Q 1 R 1

Numerical Methods in Matrix Computations

Applied Linear Algebra

Sparse BLAS-3 Reduction

HW3 - Due 02/06. Each answer must be mathematically justified. Don t forget your name. 1 2, A = 2 2

Scientific Computing with Case Studies SIAM Press, Lecture Notes for Unit VII Sparse Matrix

Section 4.5 Eigenvalues of Symmetric Tridiagonal Matrices

Designing optical fibers: fitting the derivatives of a nonlinear pde eigenvalue problem

Orthogonal iteration to QR

Chapter 3. Linear and Nonlinear Systems

A Review of Matrix Analysis

Eigenvalues and eigenvectors

AMS526: Numerical Analysis I (Numerical Linear Algebra)

QR Decomposition. When solving an overdetermined system by projection (or a least squares solution) often the following method is used:

Index. for generalized eigenvalue problem, butterfly form, 211

Orthogonal iteration to QR

Cheat Sheet for MATH461

1 Eigenvalues and eigenvectors

Linear Algebra. PHY 604: Computational Methods in Physics and Astrophysics II

Bindel, Fall 2016 Matrix Computations (CS 6210) Notes for

LAPACK-Style Codes for Pivoted Cholesky and QR Updating. Hammarling, Sven and Higham, Nicholas J. and Lucas, Craig. MIMS EPrint: 2006.

CS 246 Review of Linear Algebra 01/17/19

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS. + + x 1 x 2. x n 8 (4) 3 4 2

BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix-Vector-Product Analysis of Matrix-Matrix Product

6.4 Krylov Subspaces and Conjugate Gradients

CME342 Parallel Methods in Numerical Analysis. Matrix Computation: Iterative Methods II. Sparse Matrix-vector Multiplication.

arxiv: v1 [math.na] 5 May 2011

Homework 1 Elena Davidson (B) (C) (D) (E) (F) (G) (H) (I)

Introduction to PDEs and Numerical Methods Lecture 7. Solving linear systems

Direct methods for symmetric eigenvalue problems

11.5 Reduction of a General Matrix to Hessenberg Form

Linear Algebra. Brigitte Bidégaray-Fesquet. MSIAM, September Univ. Grenoble Alpes, Laboratoire Jean Kuntzmann, Grenoble.

Bindel, Fall 2016 Matrix Computations (CS 6210) Notes for

Numerical Linear Algebra

Eigenvalue Problems. Eigenvalue problems occur in many areas of science and engineering, such as structural analysis

EECS 275 Matrix Computation

Scientific Computing: An Introductory Survey

AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 23: GMRES and Other Krylov Subspace Methods; Preconditioning

Solving large scale eigenvalue problems

1 Number Systems and Errors 1

Math Linear Algebra Final Exam Review Sheet

Eigenvalue Problems and Singular Value Decomposition

1 Last time: least-squares problems

Linear Algebra March 16, 2019

Positive Definite Matrix

Numerical Analysis Lecture Notes

Problem Set (T) If A is an m n matrix, B is an n p matrix and D is a p s matrix, then show

Last Time. Social Network Graphs Betweenness. Graph Laplacian. Girvan-Newman Algorithm. Spectral Bisection

Solutions to Final Exam

LAPACK-Style Codes for Pivoted Cholesky and QR Updating

be a Householder matrix. Then prove the followings H = I 2 uut Hu = (I 2 uu u T u )u = u 2 uut u

1 Multiply Eq. E i by λ 0: (λe i ) (E i ) 2 Multiply Eq. E j by λ and add to Eq. E i : (E i + λe j ) (E i )

The Lanczos and conjugate gradient algorithms

Computational Methods. Least Squares Approximation/Optimization

Applied Numerical Analysis

Linear Algebra. Matrices Operations. Consider, for example, a system of equations such as x + 2y z + 4w = 0, 3x 4y + 2z 6w = 0, x 3y 2z + w = 0.

Numerical Linear Algebra

Linear Algebra, part 3 QR and SVD

Krylov subspace projection methods

Computational Methods CMSC/AMSC/MAPL 460. Eigenvalues and Eigenvectors. Ramani Duraiswami, Dept. of Computer Science

Main matrix factorizations

Math 18, Linear Algebra, Lecture C00, Spring 2017 Review and Practice Problems for Final Exam

Numerical Linear Algebra

AMS526: Numerical Analysis I (Numerical Linear Algebra)

Math 4A Notes. Written by Victoria Kala Last updated June 11, 2017

BlockMatrixComputations and the Singular Value Decomposition. ATaleofTwoIdeas

Linear Algebra Review (Course Notes for Math 308H - Spring 2016)

Lecture 4: Linear Algebra 1

Linear Algebra: Lecture Notes. Dr Rachel Quinlan School of Mathematics, Statistics and Applied Mathematics NUI Galway

Pivoting. Reading: GV96 Section 3.4, Stew98 Chapter 3: 1.3

MAT 610: Numerical Linear Algebra. James V. Lambers

Properties of Matrices and Operations on Matrices

QR-decomposition. The QR-decomposition of an n k matrix A, k n, is an n n unitary matrix Q and an n k upper triangular matrix R for which A = QR

CS412: Lecture #17. Mridul Aanjaneya. March 19, 2015

LECTURE NOTES ELEMENTARY NUMERICAL METHODS. Eusebius Doedel

Study Guide for Linear Algebra Exam 2

Linear Algebra Primer

Review Questions REVIEW QUESTIONS 71

Math 411 Preliminaries

18.06 Professor Johnson Quiz 1 October 3, 2007

ANALYTICAL MATHEMATICS FOR APPLICATIONS 2018 LECTURE NOTES 3

Linear Algebra Review. Vectors

Eigenvalues and Eigenvectors

Total 170. Name. Final Examination M340L-CS

B553 Lecture 5: Matrix Algebra Review

Linear Systems of n equations for n unknowns

3.4 Elementary Matrices and Matrix Inverse

Lecture 8: Fast Linear Solvers (Part 7)

33AH, WINTER 2018: STUDY GUIDE FOR FINAL EXAM

Topics. Vectors (column matrices): Vector addition and scalar multiplication The matrix of a linear function y Ax The elements of a matrix A : A ij

Numerical Analysis. A Comprehensive Introduction. H. R. Schwarz University of Zürich Switzerland. with a contribution by

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences)

Transcription:

BLAS Specification Revisited Linda Kaufman William Paterson University

Wish list Simultaneous Orthogonal Transformations Generation Application Simultaneous elementary transformations Simultaneous gemv with different matrices Simultaneous Householders Symmetric rank k update that manufacturers might want not balk at implementing

Simultaneous Orthogonal transformations A. QZhes- reduction of A to Hessenberg form and B to triangular form B. QR iteration for symmetric tridiagonal eigenvalues C. Reduction of narrow banded matrix to tridiagonal form in order to solve an eigenvalue problem Ax=λx A. Approximation of 1 dimensional pde with a Rayleigh Ritz Galerkin approach using cubic or quintic B splines B. Periodic boundary conditions of 1 dimensional pde with finite element C. Coupling several problems with 1 dimensional pde as in designing optical fibers D. Ax = λb x, A and B symmetric, B positive definite, A and B banded A. B tridiagonal mass matrix in finite element approximation of 1 D problem, E. Banded singular value decomposition-s. Rajamanickam thesis under Tim Davis F. To prevent fill in when pivoting in Symmetric indefinite banded factorization A. Optimization problem with negative curvature

QZ phase 1 reduction of A to upper Hessenberg B to triangular for solving nonsymmetric Ax = λb x Assume we have used orthogonal transformations to reduce B to triangular and have applied them to A. We now have A B A B A B In LAPACK get rid of elements of A in the order of But there are independent operations that can be done simultaneously using the ordering In general 2n simultaneous operations. B. Kagstrom & Dackland, 1999 One can look at these as blocks or individual elements

IMTQL1- finding the eigenvalues of Ax=λx for A symmetric tridiagonal 1 Compute shift μ, form B =A- μi 2. Find Q 1 that annihilates B 21 and form Q 1 B Q 1 T 3. Chase unwanted element down matrix Parallel QR- keep on determining shift and do simultaneous chases. Van-de Gijn(1993)- 3 times as many chases but Kaufman showed can get factor of 2 reduction (1994)

Diagram of annihilation using Given s rotations for banded eigenvalue Eventually every kth row have element that could be annihilated for r 2k-1 diagonals (Kaufman-1984) implemented in Lapack- (Christian Bischof, Bruno Lang, and XiaobaiSun -SBR toolbox 2000) Saw reduction by factor of 5 for narrow bands on Cray Would like to have been able to generate Givens rotations simultaneously- killed by manufacturers

Diagrams of parallel Crawford for reducing symmetric tridiagonal Ax = λb x, to standard eigenvalue problem A B A B

Simultaneous Stabilized elementary transformations Simultaneously factoring banded linear systems Tinvit- Given several eigenvalues (λ 1, λ 2, λ k ) of a tridiagonal system determine their eigenvectors. Solve A- λ 1 I, A- λ 2 I, A- λ k I simultaneously Shot gun bisection Two dimensional separable elliptic PDE s solved using marching(bank), Rayleigh Ritz Galerkin(Kaufman and Warner), or Collocation(Fairweather). Matrix has form A = (Tensor Product where S s and M s are banded. ) Queueing Problems leading to separable matrices(kaufman, 1983) Symmetric Indefinite using stabilized elementary to prevent fill-in

Separable matrix in solving 2 dimensional Matrix hasthe form A = (Tensor Product ) Where S x and M x are m x m banded and S y and M y are n xn banded, M x, M y, and S y are symmetric and M y is positive definite Need to Solve Av = f where A has mn rows and columns Algorithm: (1) Find D and Z such that Z T M y Z = I and Z T S y Z = D Generalize eigenvalue problem (2) Compute g = f (3) Solve the n banded systems given by h = g (4)Compute v = h Steps 2 and 4 just use Matrix-Matrix multiply and are fast. Sometimes Z and D are known apriori like for Poisson s equation with uniform grid. One can reduce Step(3) by factor of 4 by using simultaneous axpy s.

Queueing Problems Often working with singular matrices A with the form Where the B s are not symmetric but might have symmetric zero structure, there exists matrices Q and Z such QB j Z is diagonal. Usually all the B s are the identity matrix except for one. The variable q denotes the number of queues. The problems get large quickly. If one has 10 waiting spaces in q queues the number of variables is 0(10 q ) As in the pde case, one can reduce the problem using generalized eigendecompositions to diagonal blocks containing tridiagonal matrices, which here could be unsymmetric-

Symmetric rank k updates Originally updates could have the form A= A + α XX T Could not accommodate the quasi- Newton BFGS update of the approximate Hessian in optimization LAPACK did not use it and instead treated the lower triangular part as shown where the rectangles used GEMM and small lines that used DGEMV Could use simultaneous DGEMV here

Workarounds for symmetric updates For symmetric indefinite linear systems, the reduction uses either A=A+ YDY T where D is either 1 x 1 or 2 x 2. For block D would be a sequence of 1 x 1 or 2 x 2. 2002 Blas suggest A=A + YJY T where J was tridiagonal- I don t know of any implementations. Perhaps better to have A=A+XY T but only update triangular part of A At 2011 Householder conference Jennifer Scott suggested adding extra space so that one can use GEMM s throughout

Symmetric banded factorization For symmetric banded matrices, Kaufman s retraction algorithm requires (2m+1)n space even though the original matrix can be specified using (m+1)n. The extra space is to store complications for 2 x 2. Thus one can imagine the image below and the stuff above diagonal is just scratch space with 1 x 1 s.

Bunch Kaufman for symmetric indefinite non banded Partition A as Bandwidth spread with Bunch-Kaufman on banded matrix because of pivoting for stability Where D is either 1 x 1 or 2 x 2 Reset B to B = B Y D -1 Y T when deleting Y s Choice of dimension of D depends on magnitudes of a 11 versus other elements Continue with B and partition it as above

Banded algorithm based on B-K 1) Let c = a r 1 = max in abs. in col. 1 2) If a 11 >= w c, use a 1 x 1 pivot. Here w is a scalar to balance element growth, like 1/3 Else 3)Let f= max element in abs. in column r 4) If w c*c <= a 11 f, use a 1 x 1 pivot Else 5)interchange the rth and second rows and columns of A 6) Do a sequence of orthogonal or elementary transformation to prevent fill-in while performing a 2 by 2 pivot 7) Perform a 2 x 2 pivot Never pivot with 1 x 1

Pivoting for stability can ruin bandwidth Worst case r =m, what happens in pivoting x x x x x x x x x x x x x a b c d x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x a x x x x x x x x x b x x x x x x x x c x x x x x x x d x x x x x x x x x x x x

Partition A as Reset B = B Y D -1 Y T = YZ Let Z = D -1 Y T x x x x x x x x x p q r s x x x x x. Then B looks like x x x x x x bp cp dp which by the x x x x x x 0 0 0 x x x x x x x cq dq x x x x x x x cq dq x x x x x x x x dr elimination of 1 x x x x x x x x dr x x x x x as bs cs ds x x x x x x bt ct dt x x x x x x x x x x element x x x x x x x x x x x x x as x x x x x x x x x x x x x x x x bp x x bs x x x x x x becomes 0 x x bt x x x x x x cp cq x cs x x x x x x 0 cq x ct x x x x x x dp dq dr ds x x x x x x 0 dq dr dt x x x x x x x x x x x x x x x x x x Continuing in this way gets us back to band form

In practice- Pretreat Z to make zeroes so that rank 2 change does not produce zeroes outside the band Partition A as where D is 2 X 2 Let Z = D -1 Y T Reset B = Q T (B Q Z)Q= Q T B Q-HG, Q from fixup Where H=Q T Y and G =Z Q Construct Q so that G= Z Q looks like x x x x x x 0 0 0 x x x Use a sequence of Givens transformations or stabilized planar elementary transformations to form Q so banded structure of Q T BQ is not upset. Because H= QY has form HG will not extend beyond band x x x 0 0 0 x x x x x x

Comparison with Lapack on positive definite n=2000 posdef mine-no block dgbtrf dgbtf2 dpbtrf dpbtf2 mine,nb=16 100 0.223 0.266 0.389 0.17 0.218 0.13 200 0.782 0.873 1.46 0.67 0.773 0.382 300 1.64 1.65 3.12 1.49 1.62 0.834 400 2.78 2.41 5.28 2.59 2.76 1.25 500 4.2 3.78 8.14 3.9 4.13 1.76 600 5.61 4.81 11.31 5.62 5.63 2.343 700 7.35 6.3 15.36 7.88 7.33 2.977 18 16 14 12 10 8 6 4 2 0 0 200 400 600 800 mine-dsyr dgbtrf dgbtf2 dpbtrf dpbtf2 mine,nb=16

Block version on random matrices-n=2000 m nonblockretraction retraction dgbtf2 block dgbtrf 2x2 maxr ave 100 0.327 0.315 0.682 0.451 444 49 200 0.986 0.81 2.58 1.3 315 98 300 2.08 1.79 5.22 2.58 365 141 400 3.37 2.6 8.92 3.93 327 201 500 5.45 4.38 13.97 5.85 370 231 600 7.19 5.61 24.84 7.64 344 299 700 10.23 8.46 37.03 9.85 421 293 Only blocking for 1 x 1, stop accumulating when a 2 x 2 is reached. Elementary transformations for pretreating Z. with 2 x 2 Time as a function of number of planar transformations m=400, n=2000. 6 5 4 3 2 retraction dgbtrf 1 0 0 50000 100000 150000 200000 250000

Possible ways to speedup retraction for consecutive 2 x 2s: Each column involves 2 full daxpys plus orthogonal transformations or cut up daxpys to the same column (1) marching- (1) work on column i+j when elimination starts at i (2) Work on column i+j-1 with elimination starting at i+2 (3) Work on column i+j-2 with elimination starting at i+4 Requires simultaneous dgemvs or daxpys (2) 2 sets of transformations (1) Cut up daxpy or orthogonal transformation applied to i+j stemming from i (2) Dgemv involving 4 columns (i,i+1,i+2,i+3) applied to i+j (3) Cut up daxpy applied to i+j stemming from i+2 Back to requesting simultaneous orthogonal or elementary transformations Van de Geign to the rescue