Parallel Numerical Algorithms

Similar documents
Block Bidiagonal Decomposition and Least Squares Problems

Algorithm 853: an Efficient Algorithm for Solving Rank-Deficient Least Squares Problems

Normalized power iterations for the computation of SVD

APPLIED NUMERICAL LINEAR ALGEBRA

Numerical Methods in Matrix Computations

Lecture 5: Randomized methods for low-rank approximation

Main matrix factorizations

Rank revealing factorizations, and low rank approximations

Lecture 9: Numerical Linear Algebra Primer (February 11st)

Performance of Random Sampling for Computing Low-rank Approximations of a Dense Matrix on GPUs

Communication avoiding parallel algorithms for dense matrix factorizations

Applications of Randomized Methods for Decomposing and Simulating from Large Covariance Matrices

Parallel Singular Value Decomposition. Jiaxing Tan

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization /36-725

STAT 309: MATHEMATICAL COMPUTATIONS I FALL 2018 LECTURE 9

STA141C: Big Data & High Performance Statistical Computing

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences)

STA141C: Big Data & High Performance Statistical Computing

Parallel Numerical Algorithms

AMS526: Numerical Analysis I (Numerical Linear Algebra)

MAT 610: Numerical Linear Algebra. James V. Lambers

Linear Algebra and Eigenproblems

AM 205: lecture 8. Last time: Cholesky factorization, QR factorization Today: how to compute the QR factorization, the Singular Value Decomposition

CS 598: Communication Cost Analysis of Algorithms Lecture 9: The Ideal Cache Model and the Discrete Fourier Transform

Course Notes: Week 1

A fast randomized algorithm for the approximation of matrices preliminary report

14.2 QR Factorization with Column Pivoting

Math 407: Linear Optimization

Randomized methods for accelerating matrix factorization algorithms

Randomized algorithms for matrix computations and analysis of high dimensional data

Applied Mathematics 205. Unit II: Numerical Linear Algebra. Lecturer: Dr. David Knezevic

BlockMatrixComputations and the Singular Value Decomposition. ATaleofTwoIdeas

Introduction to communication avoiding algorithms for direct methods of factorization in Linear Algebra

Computation of eigenvalues and singular values Recall that your solutions to these questions will not be collected or evaluated.

A Tour of the Lanczos Algorithm and its Convergence Guarantees through the Decades

A fast randomized algorithm for approximating an SVD of a matrix

MATH 350: Introduction to Computational Mathematics

Parallel Numerical Algorithms

This can be accomplished by left matrix multiplication as follows: I

CHAPTER 11. A Revision. 1. The Computers and Numbers therein

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization

Dense LU factorization and its error analysis

Subset Selection. Deterministic vs. Randomized. Ilse Ipsen. North Carolina State University. Joint work with: Stan Eisenstat, Yale

Sparse BLAS-3 Reduction

Matrices and Vectors. Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A =

A fast randomized algorithm for computing a Hierarchically Semi-Separable representation of a matrix

Linear Algebra Review. Vectors

Vector and Matrix Norms. Vector and Matrix Norms

EECS 275 Matrix Computation

Low Rank Approximation of a Sparse Matrix Based on LU Factorization with Column and Row Tournament Pivoting

Avoiding Communication in Distributed-Memory Tridiagonalization

Parallel Numerical Algorithms

Preliminary/Qualifying Exam in Numerical Analysis (Math 502a) Spring 2012

Rank revealing factorizations, and low rank approximations

4.8 Arnoldi Iteration, Krylov Subspaces and GMRES

Sparse factorization using low rank submatrices. Cleve Ashcraft LSTC 2010 MUMPS User Group Meeting April 15-16, 2010 Toulouse, FRANCE

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences)

Numerical Methods. Elena loli Piccolomini. Civil Engeneering. piccolom. Metodi Numerici M p. 1/??

Randomized methods for computing low-rank approximations of matrices

SOLVING SPARSE LINEAR SYSTEMS OF EQUATIONS. Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA

Summary of Iterative Methods for Non-symmetric Linear Equations That Are Related to the Conjugate Gradient (CG) Method

Least Squares. Tom Lyche. October 26, Centre of Mathematics for Applications, Department of Informatics, University of Oslo

Linear Analysis Lecture 16

7. Symmetric Matrices and Quadratic Forms

Vector Spaces, Orthogonality, and Linear Least Squares

Probabilistic upper bounds for the matrix two-norm

Block Lanczos Tridiagonalization of Complex Symmetric Matrices

Scientific Computing: An Introductory Survey

B553 Lecture 5: Matrix Algebra Review

Exploiting Low-Rank Structure in Computing Matrix Powers with Applications to Preconditioning

Krylov Space Methods. Nonstationary sounds good. Radu Trîmbiţaş ( Babeş-Bolyai University) Krylov Space Methods 1 / 17

Numerical Methods for Solving Large Scale Eigenvalue Problems

Minisymposia 9 and 34: Avoiding Communication in Linear Algebra. Jim Demmel UC Berkeley bebop.cs.berkeley.edu

Randomized Methods for Computing Low-Rank Approximations of Matrices

A communication-avoiding parallel algorithm for the symmetric eigenvalue problem

AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 23: GMRES and Other Krylov Subspace Methods; Preconditioning

Rank Revealing QR factorization. F. Guyomarc h, D. Mezher and B. Philippe

Why the QR Factorization can be more Accurate than the SVD

Towards an algebraic formalism for scalable numerical algorithms

Problem set 5: SVD, Orthogonal projections, etc.

arxiv: v1 [math.na] 23 Nov 2018

Fast Hessenberg QR Iteration for Companion Matrices

Scientific Computing with Case Studies SIAM Press, Lecture Notes for Unit VII Sparse Matrix

A fast randomized algorithm for overdetermined linear least-squares regression

Cheat Sheet for MATH461

Randomized methods for computing the Singular Value Decomposition (SVD) of very large matrices

σ 11 σ 22 σ pp 0 with p = min(n, m) The σ ii s are the singular values. Notation change σ ii A 1 σ 2

Orthogonal Transformations

A randomized algorithm for approximating the SVD of a matrix

CS6964: Notes On Linear Systems

A communication-avoiding parallel algorithm for the symmetric eigenvalue problem

Jim Lambers MAT 610 Summer Session Lecture 1 Notes

Bindel, Fall 2016 Matrix Computations (CS 6210) Notes for

Krylov Subspaces. Lab 1. The Arnoldi Iteration

Matrix decompositions

Introduction to Numerical Linear Algebra II

AMS526: Numerical Analysis I (Numerical Linear Algebra)

randomized block krylov methods for stronger and faster approximate svd

Conceptual Questions for Review

Parallel Iterative Methods for Sparse Linear Systems. H. Martin Bücker Lehrstuhl für Hochleistungsrechnen

Transcription:

Parallel Numerical Algorithms Chapter 6 Matrix Models Section 6.2 Low Rank Approximation Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign CS 554 / CSE 512 Edgar Solomonik Parallel Numerical Algorithms 1 / 35

Outline 1 Truncated SVD Fast Algorithms with Truncated SVD 2 Direct Computation Indirect Computation 3 Randomized Approximation Basics Structured Randomized Factorization 4 HSS Matrix Vector Multiplication Parallel HSS Matrix Vector Multiplication Edgar Solomonik Parallel Numerical Algorithms 2 / 35

Truncated SVD Fast Algorithms with Truncated SVD Rank-k Singular Value Decomposition (SVD) For any matrix A R m n of rank k there exists a factorization A = UDV T U R m k is a matrix of orthonormal left singular vectors D R k k is a nonnegative diagonal matrix of singular values in decreasing order σ 1 σ k V R n k is a matrix of orthonormal right singular vectors Edgar Solomonik Parallel Numerical Algorithms 3 / 35

Truncated SVD Truncated SVD Fast Algorithms with Truncated SVD Given A R m n seek its best k < rank(a) approximation B = argmin ( A B 2 ) B R m n,rank(b) k Eckart-Young theorem: given SVD A = [ U 1 U 2 ] [ D 1 D 2 ] [V1 V 2 ] T B = U1 D 1 V T 1 where D 1 is k k. U 1 D 1 V1 T is the rank-k truncated SVD of A and A U 1 D 1 V T 1 2 = min B R m n,rank(b) k ( A B 2) = σ k+1 Edgar Solomonik Parallel Numerical Algorithms 4 / 35

Computational Cost Truncated SVD Fast Algorithms with Truncated SVD Given a rank k truncated SVD A UDV T of A R m n with m n Performing approximately y = Ax requires O(mk) work y U(D(V T x)) Solving Ax = b requires O(mk) work via approximation x V D 1 U T b Edgar Solomonik Parallel Numerical Algorithms 5 / 35

Computing the Truncated SVD Direct Computation Indirect Computation Reduction to upper-hessenberg form via two-sided orthogonal updates can compute full SVD Given full SVD can obtain truncated SVD by keeping only largest singular value/vector pairs Given set of transformations Q 1,..., Q s so that U = Q 1 Q s, can obtain leading k columns of U by computing ( ( [ ] ) ) I U 1 = Q 1 Q s 0 This method requires O(mn 2 ) work for the computation of singular values and O(mnk) for k singular vectors Edgar Solomonik Parallel Numerical Algorithms 6 / 35

Direct Computation Indirect Computation Computing the Truncated SVD by Krylov Subspace Methods Seek k m, n leading right singular vectors of A Find a basis for Krylov subspace of B = A T A Rather than computing B, compute products Bx = A T (Ax) For instance, do k k + O(1) iterations of Lanczos, to yield BQ = QH Left singular vectors can be obtains by working with C = AA T This method requires O(mnk) work for k singular vectors However, Θ(k) sparse-matrix-vector multiplications are needed (high latency and low flop/byte ratio) Edgar Solomonik Parallel Numerical Algorithms 7 / 35

Direct Computation Indirect Computation Generic Low-Rank Factorizations A matrix A R m n is rank k, if for somex R m k, Y R n k with k min(m, n), A = XY T If A = XY T (exact low rank factorization), we can obtain reduced SVD A = UDV T via 1 [U 1, R] = QR(X) 2 [U 2, D, V ] = SVD(RY T ) 3 U = U 1 U 2 with cost O(mk 2 ) using an SVD of a k k rather than m n matrix If instead A XY T 2 ε then A UDV T 2 ε So we can obtain a truncated SVD given an optimal generic low-rank approximation Edgar Solomonik Parallel Numerical Algorithms 8 / 35

Rank-Revealing QR Direct Computation Indirect Computation If A is of rank k and its first k columns are linearly independent R 11 R 12 A = Q 0 0 0 0 where R 11 is upper-triangular and k k and Q = Y T Y T with n k matrix Y For arbitrary A we need column ordering permutation P A = QRP QR with column pivoting (due to Gene Golub) is an effective method for this pivot so that the leading column has largest 2-norm method can break in the presence of roundoff error (see Kahan matrix), but is very robust in practice Edgar Solomonik Parallel Numerical Algorithms 9 / 35

Direct Computation Indirect Computation Low Rank Factorization by QR with Column Pivoting QR with column pivoting can be used to either determine the (numerical) rank of A compute a low-rank approximation with a bounded error performs only O(mnk) rather than O(mn 2 ) work for a full QR or SVD Edgar Solomonik Parallel Numerical Algorithms 10 / 35

Direct Computation Indirect Computation Parallel QR with Column Pivoting In distributed-memory, column pivoting poses further challenges Need at least one message to decide on each pivot column, which leads to Ω(k) synchronizations Existing work tries to pivot many columns at a time by finding subsets of them that are sufficiently linearly independent Randomized approaches provide alternatives and flexibility Edgar Solomonik Parallel Numerical Algorithms 11 / 35

Randomization Basics Randomized Approximation Basics Structured Randomized Factorization Intuition: consider a random vector w of dimension n, all of the following holds with high probability in exact arithmetic Given any basis Q for the n dimensional space, random w is not orthogonal to any row of Q T Let A = UDV T where V T R n k Vector w is not orthogonal to any row of V T, consequently z = V T w is a random vector Aw = UDz is random linear combination of cols of UD Given k random vectors, i.e., random matrix W R n k Columns of B = AW gives k random linear combinations of columns of in UD B has the same span as U! Edgar Solomonik Parallel Numerical Algorithms 12 / 35

Randomized Approximation Basics Structured Randomized Factorization Using the Basis to Compute a Factorization If B has the same span as the range of A [Q, R] = QR(B) gives orthogonal basis Q for B = AW QQ T A = QQ T UDV T = (QQ T U)DV T, now Q T U is orthogonal and so QQ T U is a basis for the range of A so compute H = Q T A, H R k n and compute [U 1, D, V ] = SVD(H) then compute U = QU 1 and we have a rank k truncated SVD of A A = UDV T Edgar Solomonik Parallel Numerical Algorithms 13 / 35

Cost of the Randomized Method Randomized Approximation Basics Structured Randomized Factorization Matrix multiplications e.g. AW, all require O(mnk) operations QR and SVD require O((m + n)k 2 ) operations If k min(m, n) the bulk of the computation here is within matrix multiplication, which can be done with fewer synchronizations and higher efficiency than QR with column pivoting or Arnoldi Edgar Solomonik Parallel Numerical Algorithms 14 / 35

Randomized Approximation Basics Structured Randomized Factorization Randomized Approximate Factorization Now lets consider the case when A = UDV T + E where D R k k and E is a small perturbation E may be noise in data or numerical error To obtain a basis for U it is insufficient to multiply by random B R n k, due to influence of E However, oversampling, for instance l = k + 10, and random B R n l gives good results A Gaussian random distribution provides particularly good accuracy So far the dimension of B has assumed knowledge of the target approximate rank k, to determine it dynamically generate vectors (columns of B) one at a time or a block at a time, which results in a provably accurate basis Edgar Solomonik Parallel Numerical Algorithms 15 / 35

Randomized Approximation Basics Structured Randomized Factorization Cost Analysis of Randomized Low-rank Factorization The cost of the randomized algorithm for is Tp MM (m, n, k) + Tp QR (m, k, k) which means that the work is O(mnk) and the algorithm is well-parallelizable This assumes we factorize the basis by QR and SVD of R Edgar Solomonik Parallel Numerical Algorithms 16 / 35

Randomized Approximation Basics Structured Randomized Factorization Fast Algorithms via Pseudo-Randomness We can lower the number of operations needed by the randomized algorithm by generating B so that AB can be computed more rapidly Generate W as a pseudo-random matrix B = DF R D is diagonal with random elements F can be applied to a vector in O(n log(n)) operations [ ] Hn H e.g. DFT or Hadamard matrix H 2n = n H n H n R is p k columns of the n n identity matrix Computes AB with O(mn log(n)) operations (if m > n) Edgar Solomonik Parallel Numerical Algorithms 17 / 35

Randomized Approximation Basics Structured Randomized Factorization Cost of Pseudo-Randomized Factorization Instead of matrix multiplication, apply m FFTs of dimension n Each FFT is independent, so it suffices to perform a single transpose So we have the following overall cost ( ) mn log(n) O γ + Tp all to all (mn/p) + Tp QR (m, k, k) p assuming m > n This is lower with respect to the unstructured/randomized version, however, this idea does not extend well to the case when A is sparse Edgar Solomonik Parallel Numerical Algorithms 18 / 35

Hierarchical Low Rank Structure HSS Matrix Vector Multiplication Parallel HSS Matrix Vector Multiplication Consider two-way partitioning of vertices of a graph The connectivity within each partition is given by a block diagonal matrix ] [ A1 If the graph is nicely separable there is little connectivity between vertices in the two partitions Consequently, it is often possible to approximate the off-diagonal blocks by low-rank factorization [ A 2 A 1 U 1 D 1 V1 T U 2 D 2 V2 T A 2 Doing this recursively to A 1 and A 2 yields a matrix with hierarchical low-rank structure Edgar Solomonik Parallel Numerical Algorithms 19 / 35 ]

Hierarchically semi-separable Edgar Solomonik(HSS) Parallel matrix, Numerical Algorithms space padded 20 / 35 HSS Matrix, Two Levels HSS Matrix Vector Multiplication Parallel HSS Matrix Vector Multiplication

HSS Matrix, Three Levels HSS Matrix Vector Multiplication Parallel HSS Matrix Vector Multiplication Edgar Solomonik Parallel Numerical Algorithms 21 / 35

HSS Matrix Formal Definition HSS Matrix Vector Multiplication Parallel HSS Matrix Vector Multiplication The l-level HSS factorization is described by { {U, V, T 12, T 21, A 11, A 22 } : l = 1 H l (A) = {U, V, T 12, T 21, H l 1 (A 11 ), H l 1 (A 22 )} : l > 1 The low-rank representation of the diagonal blocks is given by A 21 = Ū2T T 21 V 1, A 12 = Ū1T T 12 V 2 where for a {1, 2}, [ U a ] : l = 1 Ū a = U a (H l (A)) = U 1 (H l 1 (A aa )) 0 U a : l > 1 0 U 2 (H l 1 (A aa )) V [ a ] : l = 1 V a = V a (H l (A)) = V 1 (H l 1 (A aa )) 0 V a : l > 1 0 V 2 (H l 1 (A aa )) Edgar Solomonik Parallel Numerical Algorithms 22 / 35

HSS Matrix Vector Multiplication HSS Matrix Vector Multiplication Parallel HSS Matrix Vector Multiplication We now consider computing y = Ax With H 1 (A) we would just compute y 1 = A 11 x 1 + U 1 (T 12 (V2 T x 2 )) and y 2 = A 22 x 2 + U 2 (T 21 (V1 T x 1 )) For general H l (A) perform up-sweep and down-sweep [ ] V T up-sweep computes w = 1 x 1 V 2 T at every tree node x 2 ] [Ū1 T down-sweep computes a tree sum of 12 w 2 Ū 2 T 21 w 1 Edgar Solomonik Parallel Numerical Algorithms 23 / 35

HSS Matrix Vector Multiplication Parallel HSS Matrix Vector Multiplication HSS Matrix Vector Multiplication, Up-Sweep The up-sweep is performed by using the nested structure of V [ ] w = W(H l (A), x) = [ V1 T 0 0 V T 2 V T 1 0 0 V T 2 x : l = 1 ] [ ] W(H l 1 (A 11 ), x 1 ) : l > 1 W(H l 1 (A 22 ), x 2 ) Edgar Solomonik Parallel Numerical Algorithms 24 / 35

HSS Matrix Vector Multiplication Parallel HSS Matrix Vector Multiplication HSS Matrix Vector Multiplication, Down-Sweep Use w = W(H l (A), x) from the root to the leaves to get [ ] [ ] ] [ ] [ ] U1T 12w 2 A11x 1 [Ū1 0 0 T12 A11 0 y = Ax = + = w+ x U 2T 21w 1 A 22x 2 0 Ū 2 T 21 0 0 A 22 [ ] [ ] using the nested structure of Ūa and v = U1 0 0 T12 w, 0 U 2 T 21 0 [ ] U1(H y a = l 1 (A aa)) 0 v 0 U 2(H l 1 (A a + A aax a for a {1, 2} aa)) which gives the down-sweep recurrence [ ] [ ] U 1q 1 A 11x 1 + : l = 1 U 2q 2 A 22x 2 y = Ax + z = Y(H l (A), x, z) = [ ] Y(H l 1 (A 11), x 1, U 1q 1) Y(H l 1 (A 22), x 2, U 2q 2) [ ] 0 T12 where q = w + z T 21 0 : l > 1 Edgar Solomonik Parallel Numerical Algorithms 25 / 35

HSS Matrix Vector Multiplication Parallel HSS Matrix Vector Multiplication Prefix Sum as HSS Matrix Vector Multiplication We can express the n-element prefix sum y(i) = i 1 j=1 x(j) as 0 0 0 [ ] L11 0 y = Lx where L = = 1 0. L 21 L 22........ 1 1 0 L is an H-matrix since L 21 = 1 n 1 T n = [ 1 1 ] T [ 1 1 ] L also has rank-1 HSS structure, in particular { [ ] [ ] [ ] [ ] } 1 2, 1 2, 0, 1, 0, 0 H l (L) = { [ ] [ ] } 1 4, 1 4, 0, 1, H l 1 (L 11 ), H l 1 (L 22 ) : l = 1 : l > 1 so each U, V, Ū, V is a vector of 1s, T 12 = [ 0 ] and T 21 = [ 1 ] Edgar Solomonik Parallel Numerical Algorithms 26 / 35

Prefix Sum HSS Up-Sweep HSS Matrix Vector Multiplication Parallel HSS Matrix Vector Multiplication We can use the HSS structure of L to compute the prefix sum of x recall that the up-sweep recurrence has the general form [ ] w = W(H l (A), x) = [ V1 T 0 0 V T 2 V T 1 0 0 V T 2 x : l = 1 ] [ ] W(H l 1 (A 11 ), x 1 ) : l > 1 W(H l 1 (A 22 ), x 2 ) for the prefix sum this becomes x [ ] [ ] : l = 1 w = W(H l (L), x) = 1 1 0 0 W(H l 1 (L 11 ), x 1 ) : l > 1 0 0 1 1 W(H l 1 (L 22 ), x 2 ) [ ] S(x1 ) so the up-sweep computes w = where S(y) = S(x 2 ) i y i Edgar Solomonik Parallel Numerical Algorithms 27 / 35

Prefix Sum HSS Down-Sweep HSS Matrix Vector Multiplication Parallel HSS Matrix Vector Multiplication The down-sweep has the general structure [ ] [ ] U 1 0 A 11 0 q + x : l = 1 0 U 2 0 A 22 y = Y(H l (A), x, z) = [ ] Y(H l 1 (A 11), x 1, U 1q 1) : l > 1 Y(H l 1 (A 22), x 2, U 2q 2) [ ] 0 T12 where q = W(H T 21 0 l (A), x) + z, for the prefix sum [ ] [ ] [ ] [ ] 0 T12 0 0 S(x1) 0 W(H T 21 0 l (L), x) = = = q z 1 0 S(x 2) S(x 1) [ ] z 1 x 1 + z y = Y(H l (L), x, z) = [ 2 ] Y(H l 1 (L 11), x 1, 1 2z 1 Y(H l 1 (L 22), x 2, 1 2(S(x 1) + z 2) Initially the prefix z = 0 and it will always be the case that z 1 = z 2 : l = 1 : l > 1 Edgar Solomonik Parallel Numerical Algorithms 28 / 35

HSS Matrix Vector Multiplication Parallel HSS Matrix Vector Multiplication Cost of HSS Matrix Vector Multiplication The down-sweep and the up-sweep perform small dense matrix vector multiplications at each recursive step Lets assume k is the dimension of the leaf blocks and the rank at each level (number of columns in each U a, V a ) The work for both the down-sweep and up-sweep is Q(n, k) = 2Q(n/2, k) + O(k 2 γ), Q(k, k) = O(k 2 γ) Q(n, k) = O(nk γ) The depth of the algorithm scales as D = Θ(log(n)) for fixed k Edgar Solomonik Parallel Numerical Algorithms 29 / 35

HSS Matrix Vector Multiplication Parallel HSS Matrix Vector Multiplication Parallel of HSS Matrix Vector Multiplication If we assign each tree node to a single processor for the first log 2 (p) levels, and execute a different leaf subtree with a different processor T p (n, k) = 2T p/2 (n, k) + O(k 2 γ + k β + α) = O((nk/p + k 2 log(p)) γ + k log(p) β + log(p) α) Edgar Solomonik Parallel Numerical Algorithms 30 / 35

HSS Matrix Vector Multiplication Parallel HSS Matrix Vector Multiplication Synchronization-Efficient HSS Multiplication The leaf subtrees can be computed independently T leaf subtrees p (n, k) = O(nk/p γ + k β + α) Focus on up-sweep and down-sweep with log 2 (p) levels Executing the root subtree sequentially takes time T root subtree p (pk, k) = O(pk 2 γ + pk β + α) Instead have p r (r < 1) processors compute subtrees with p 1 r leaves, then recurse on the p r roots Tp rec tree (pk, k) = Tp rec tree r (pr k, k) + O(p 1 r k 2 γ + p 1 r k β + α) = O(p 1 r k 2 γ + p 1 r k β + log 1/r (log(p)) α) Edgar Solomonik Parallel Numerical Algorithms 31 / 35

HSS Matrix Vector Multiplication Parallel HSS Matrix Vector Multiplication Synchronization-Efficient HSS Multiplication Focus on the top tree with p leaves (leaf subtrees) Assign each processor a unique path from a leaf to the root Given w = W(H l (A), x) at every node each processor can compute a down-sweep path in the subtree independently For the up-sweep, realize that the tree applies a linear transformation, so can sum the results computed in each path For each tree node, there is a contribution from every processor assigned a leaf of the subtree of the node Therefore, there are p 1 sums of a total of O(p log(p)) contributions, for a total of O(kp log(p)) elements Do these with min(p, k log 2 (p)) processors, each obtaining max(p, k log 2 (p)) contributions, so T root paths p (k) = O(k 2 log(p) γ + (k log(p) + p) β + log(p) α) Edgar Solomonik Parallel Numerical Algorithms 32 / 35

HSS Matrix Vector Multiplication Parallel HSS Matrix Vector Multiplication HSS Multiplication by Multiple Vectors Consider multiplication C = AB where A R n n is HSS and B R n b lets consider the case that p b n if we assign each processor all of A, each can compute a column of C simultaneously this requires a prohibitive amount of memory usage perform leaf-level multiplications, processing n/p rows of B with each processor (call intermediate C) transpose C and apply log 2 (p) root levels of HSS tree to columns of C independently this algorithm requires replication only of the root O(log(p)) levels of the HSS tree, O(pb) data for large k or larger p different algorithms may be desirable Edgar Solomonik Parallel Numerical Algorithms 33 / 35

References Michiel E. Hochstenbach, A Jacobi Davidson Type SVD Method, SIAM Journal on Scientific Computing 2001 23:2, 606-628 Chan, T. F. (1987). Rank revealing QR factorizations. Linear algebra and its applications, 88, 67-82. Businger, P., and Golub, G. H. (1965). Linear least squares solutions by Householder transformations. Numerische Mathematik, 7(3), 269-276. Edgar Solomonik Parallel Numerical Algorithms 34 / 35

References Quintana-Ortí, G., Sun, X., and Bischof, C. H. (1998). A BLAS-3 version of the QR factorization with column pivoting. SIAM Journal on Scientific Computing, 19(5), 1486-1494. Demmel, J.W., Grigori, L., Gu, M. and Xiang, H., 2015. Communication avoiding rank revealing QR factorization with column pivoting. SIAM Journal on Matrix Analysis and Applications, 36(1), pp.55-89. Bischof, C.H., 1991. A parallel QR factorization algorithm with controlled local pivoting. SIAM Journal on Scientific and Statistical Computing, 12(1), pp.36-57. Halko, N., Martinsson, P.G. and Tropp, J.A., 2011. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review, 53(2), pp.217-288. Edgar Solomonik Parallel Numerical Algorithms 35 / 35