Rank revealing factorizations, and low rank approximations

Similar documents
Rank revealing factorizations, and low rank approximations

Low Rank Approximation of a Sparse Matrix Based on LU Factorization with Column and Row Tournament Pivoting

Low rank approximation of a sparse matrix based on LU factorization with column and row tournament pivoting

Dense LU factorization and its error analysis

Communication avoiding parallel algorithms for dense matrix factorizations

14.2 QR Factorization with Column Pivoting

Introduction to communication avoiding algorithms for direct methods of factorization in Linear Algebra

LU factorization with Panel Rank Revealing Pivoting and its Communication Avoiding version

Parallel Numerical Algorithms

LU Factorization with Panel Rank Revealing Pivoting and Its Communication Avoiding Version

Algorithm 853: an Efficient Algorithm for Solving Rank-Deficient Least Squares Problems

Spectrum-Revealing Matrix Factorizations Theory and Algorithms

Scientific Computing

Subset Selection. Deterministic vs. Randomized. Ilse Ipsen. North Carolina State University. Joint work with: Stan Eisenstat, Yale

c 2013 Society for Industrial and Applied Mathematics

Sparse linear solvers

Rank Revealing QR factorization. F. Guyomarc h, D. Mezher and B. Philippe

Computational Linear Algebra

ETNA Kent State University

SUBSET SELECTION ALGORITHMS: RANDOMIZED VS. DETERMINISTIC

Orthogonalization and least squares methods

Lecture 9: Numerical Linear Algebra Primer (February 11st)

Bindel, Fall 2016 Matrix Computations (CS 6210) Notes for

Numerical Methods I Solving Square Linear Systems: GEM and LU factorization

Subset Selection. Ilse Ipsen. North Carolina State University, USA

Numerical Linear Algebra

Program Lecture 2. Numerical Linear Algebra. Gaussian elimination (2) Gaussian elimination. Decompositions, numerical aspects

1 Multiply Eq. E i by λ 0: (λe i ) (E i ) 2 Multiply Eq. E j by λ and add to Eq. E i : (E i + λe j ) (E i )

c 2015 Society for Industrial and Applied Mathematics

Introduction to Numerical Linear Algebra II

Lecture 12 (Tue, Mar 5) Gaussian elimination and LU factorization (II)

Applied Numerical Linear Algebra. Lecture 8

Preliminary/Qualifying Exam in Numerical Analysis (Math 502a) Spring 2012

STAT 309: MATHEMATICAL COMPUTATIONS I FALL 2018 LECTURE 13

Communication-avoiding LU and QR factorizations for multicore architectures

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences)

Low-Rank Approximations, Random Sampling and Subspace Iteration

Sparse BLAS-3 Reduction

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC

Institute for Advanced Computer Studies. Department of Computer Science. Two Algorithms for the The Ecient Computation of

This can be accomplished by left matrix multiplication as follows: I

Review Questions REVIEW QUESTIONS 71

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization /36-725

5.6. PSEUDOINVERSES 101. A H w.

Lecture Note 2: The Gaussian Elimination and LU Decomposition

Direct Methods for Solving Linear Systems. Simon Fraser University Surrey Campus MACM 316 Spring 2005 Instructor: Ha Le

Orthonormal Transformations and Least Squares

Direct solution methods for sparse matrices. p. 1/49

B553 Lecture 5: Matrix Algebra Review

CALU: A Communication Optimal LU Factorization Algorithm

Algebra C Numerical Linear Algebra Sample Exam Problems

Numerical Linear Algebra

ETNA Kent State University

LAPACK-Style Codes for Pivoted Cholesky and QR Updating. Hammarling, Sven and Higham, Nicholas J. and Lucas, Craig. MIMS EPrint: 2006.

Linear Algebra, part 3. Going back to least squares. Mathematical Models, Analysis and Simulation = 0. a T 1 e. a T n e. Anna-Karin Tornberg

LAPACK-Style Codes for Pivoted Cholesky and QR Updating

Preconditioned Parallel Block Jacobi SVD Algorithm

SUBSPACE ITERATION RANDOMIZATION AND SINGULAR VALUE PROBLEMS

LU Factorization. LU Decomposition. LU Decomposition. LU Decomposition: Motivation A = LU

Multi-Linear Mappings, SVD, HOSVD, and the Numerical Solution of Ill-Conditioned Tensor Least Squares Problems

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization

Linear Algebra, part 3 QR and SVD

Randomized Numerical Linear Algebra: Review and Progresses

be a Householder matrix. Then prove the followings H = I 2 uut Hu = (I 2 uu u T u )u = u 2 uut u

Gaussian Elimination for Linear Systems

Matrix Factorization and Analysis

Scientific Computing: An Introductory Survey

Why the QR Factorization can be more Accurate than the SVD

Lecture Notes to Accompany. Scientific Computing An Introductory Survey. by Michael T. Heath. Chapter 2. Systems of Linear Equations

V C V L T I 0 C V B 1 V T 0 I. l nk

arxiv: v1 [math.na] 23 Nov 2018

SYMBOLIC AND EXACT STRUCTURE PREDICTION FOR SPARSE GAUSSIAN ELIMINATION WITH PARTIAL PIVOTING

AM 205: lecture 8. Last time: Cholesky factorization, QR factorization Today: how to compute the QR factorization, the Singular Value Decomposition

Numerical Methods. Elena loli Piccolomini. Civil Engeneering. piccolom. Metodi Numerici M p. 1/??

Linear Algebra Review. Vectors

Performance of Random Sampling for Computing Low-rank Approximations of a Dense Matrix on GPUs

IMPROVING THE PERFORMANCE OF SPARSE LU MATRIX FACTORIZATION USING A SUPERNODAL ALGORITHM

Dominant feature extraction

Bindel, Fall 2016 Matrix Computations (CS 6210) Notes for

AMS526: Numerical Analysis I (Numerical Linear Algebra)

STA141C: Big Data & High Performance Statistical Computing

Index. book 2009/5/27 page 121. (Page numbers set in bold type indicate the definition of an entry.)

PRECONDITIONING IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM

Least Squares. Tom Lyche. October 26, Centre of Mathematics for Applications, Department of Informatics, University of Oslo

STA141C: Big Data & High Performance Statistical Computing

Math Camp II. Basic Linear Algebra. Yiqing Xu. Aug 26, 2014 MIT

Linear Algebra in Actuarial Science: Slides to the lecture

Some Notes on Least Squares, QR-factorization, SVD and Fitting

BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix-Vector-Product Analysis of Matrix-Matrix Product

STAT 309: MATHEMATICAL COMPUTATIONS I FALL 2018 LECTURE 9

Efficient Preprocessing in the Parallel Block-Jacobi SVD Algorithm. Technical Report June Department of Scientific Computing

Ax = b. Systems of Linear Equations. Lecture Notes to Accompany. Given m n matrix A and m-vector b, find unknown n-vector x satisfying

LU Factorization. LU factorization is the most common way of solving linear systems! Ax = b LUx = b

Since the determinant of a diagonal matrix is the product of its diagonal elements it is trivial to see that det(a) = α 2. = max. A 1 x.

Properties of Matrices and Operations on Matrices

Numerical Linear Algebra And Its Applications

MS&E 318 (CME 338) Large-Scale Numerical Optimization

Randomized algorithms for the approximation of matrices

Bindel, Fall 2009 Matrix Computations (CS 6210) Week 8: Friday, Oct 17

arxiv: v2 [math.na] 14 Mar 2018

Transcription:

Rank revealing factorizations, and low rank approximations L. Grigori Inria Paris, UPMC January 2018

Plan Low rank matrix approximation Rank revealing QR factorization LU CRTP: Truncated LU factorization with column and row tournament pivoting Experimental results, LU CRTP Randomized algorithms for low rank approximation 2 of 70

Plan Low rank matrix approximation Rank revealing QR factorization LU CRTP: Truncated LU factorization with column and row tournament pivoting Experimental results, LU CRTP Randomized algorithms for low rank approximation 3 of 70

Low rank matrix approximation Problem: given m n matrix A, compute rank-k approximation ZW T, where Z is m k and W T is k n. Problem with diverse applications from scientific computing: fast solvers for integral equations, H-matrices to data analytics: principal component analysis, image processing,... Flops Ax ZW T x 2mn 2(m + n)k 4 of 70

Singular value decomposition Given A R m n, m n its singular value decomposition is A = UΣV T = ( ) Σ 1 0 U 1 U 2 U 3 0 Σ 2 (V ) T 1 V 2 0 0 where U is m m orthogonal matrix, the left singular vectors of A, U 1 is m k, U 2 is m n k, U 3 is m m n Σ is m n, its diagonal is formed by σ 1 (A)... σ n (A) 0 Σ 1 is k k, Σ 2 is n k n k V is n n orthogonal matrix, the right singular vectors of A, V 1 is n k, V 2 is n n k 5 of 70

Norms A F = m i=1 j=1 n a ij 2 = A 2 = σ max (A) = σ 1 (A) σ 2 1 (A) +... σ2 n(a) Some properties: A 2 A F min(m, n) A 2 Orthogonal Invariance: If Q R m m and Z R n n are orthogonal, then QAZ F = A F QAZ 2 = A 2 6 of 70

Low rank matrix approximation Best rank-k approximation A k = U k Σ k V k is rank-k truncated SVD of A [Eckart and Young, 1936] Original image of size 919 707 min A Ãk 2 = A A k 2 = σ k+1 (A) (1) rank(ãk ) k n min A Ã k F = A A k F = σj 2 (A) (2) rank(ã k ) k j=k+1 Rank-38 approximation, SVD Rank-75 approximation, SVD Image source: https: //upload.wikimedia.org/wikipedia/commons/a/a1/alan_turing_aged_16.jpg 7 of 70

Large data sets Matrix A might not exist entirely at a given time, rows or columns are added progressively. Streaming algorithm: can solve an arbitrarily large problem with one pass over the data (a row or a column at a time). Weakly streaming algorithm: can solve a problem with O(1) passes over the data. Matrix A might exist only implicitly, and it is never formed explicitly. 8 of 70

Low rank matrix approximation: trade-offs 9 of 70

Plan Low rank matrix approximation Rank revealing QR factorization LU CRTP: Truncated LU factorization with column and row tournament pivoting Experimental results, LU CRTP Randomized algorithms for low rank approximation 10 of 70

Rank revealing QR factorization Given A of size m n, consider the decomposition [ R11 R AP c = QR = Q 12 R 22 ], (3) where R 11 is k k, P c and k are chosen such that R 22 2 is small and R 11 is well-conditioned. By the interlacing property of singular values [Golub, Van Loan, 4th edition, page 487], σ i (R 11 ) σ i (A) and σ j (R 22 ) σ k+j (A) for 1 i k and 1 j n k. σ k+1 (A) σ max (R 22 ) = R 22 11 of 70

Rank revealing QR factorization Given A of size m n, consider the decomposition [ R11 R AP c = QR = Q 12 R 22 If R 22 2 is small, ]. (4) Q(:, 1 : k) forms an approximate orthogonal basis for the range of A, P c [ R 1 11 R 12 I a(:, j) = min(j,k) r ij Q(:, i) i=1 ran(a) span{q(:, 1),... Q(:, k)} span{q(:, 1),... Q(:, k)} ] is an approximate right null space of A. 12 of 70

Rank revealing QR factorization The factorization from equation (4) is rank revealing if 1 σ i(a) σ i (R 11 ), σ j(r 22 ) σ k+j (A) q 1(n, k), for 1 i k and 1 j min(m, n) k, where σ max (A) = σ 1 (A)... σ min (A) = σ n (A) It is strong rank revealing [Gu and Eisenstat, 1996] if in addition R 1 11 R 12 max q 2 (n, k) Gu and Eisenstat show that given k and f, there exists a P c such that q 1 (n, k) = 1 + f 2 k(n k) and q 2 (n, k) = f. Factorization computed in 4mnk (QRCP) plus O(mnk) flops. 13 of 70

QR with column pivoting [Businger and Golub, 1965] Idea: At first iteration, trailing columns decomposed into parallel part to first column (or e 1 ) and orthogonal part (in rows 2 : m). The column of maximum norm is the column with largest component orthogonal to the first column. Implementation: Find at each step of the QR factorization the column of maximum norm. Permute it into leading position. If rank(a) = k, at step k + 1 the maximum norm is 0. No need to compute the column norms at each step, but just update them since Q T v = w = [ w 1 w(2 : n) ], w(2 : n) 2 2 = v 2 2 w 2 1 14 of 70

QR with column pivoting [Businger and Golub, 1965] Sketch of the algorithm column norm vector: colnrm(j) = A(:, j) 2, j = 1 : n. for j = 1 : n do Find column p of largest norm if colnrm[p] > ɛ then 1. Pivot: swap columns j and p in A and modify colnrm. 2. Compute Householder matrix H j s.t. H j A(j : m, j) = ± A(j : m, j) 2 e 1. 3. Update A(j : m, j + 1 : n) = H j A(j : m, j + 1 : n). 4. Norm downdate colnrm(j + 1 : n) 2 = A(j, j + 1 : n) 2. else Break end if end for If algorithm stops after k steps σ max (R 22 ) n k max R 22(:, j) 2 n kɛ 1 j n k 15 of 70

Strong RRQR [Gu and Eisenstat, 1996] Since k n k det(r 11 ) = σ i (R 11 ) = det(a T A)/ σ i (R 22 ) i=1 i=1 a stron RRQR is related to a large det(r 11 ). The following algorithm interchanges columns that increase det(r 11 ), given f and k. Compute a strong RRQR factorization, given k: Compute AΠ = QR by using QRCP while there exist i and j such that det( R 11 )/det(r 11 ) > f, where R 11 = R(1 : k, 1 : k), Π i,j+k permutes columns i and j + k, RΠ i,j+k = Q R, R 11 = R(1 : k, 1 : k) do Find i and j Compute RΠ i,j+k = Q R and Π = ΠΠ i,j+k end while 16 of 70

Strong RRQR (contd) It can be shown that det( R 11 ) det(r 11 ) = (R 1 11 R 12) 2 i,j + ω2 i (R 11 ) χ 2 j (R 22) (5) for any 1 i k and 1 j n k (the 2-norm of the j-th column of A is χ j (A), and the 2-norm of the j-th row of A 1 is ω j (A) ). Compute a strong RRQR factorization, given k: Compute AΠ = QR by using QRCP (R 1 while max 1 i k,1 j n k 11 R 2 12) i,j + ω2 i (R 11 ) χ 2 j (R 22) > f do (R 1 Find i and j such that 11 R 2 12) + i,j ω2 i (R 11 ) χ 2 j (R 22) > f Compute RΠ i,j+k = Q R and Π = ΠΠ i,j+k end while 17 of 70

Strong RRQR (contd) det(r 11 ) strictly increases with every permutation, no permutation repeats, hence there is a finite number of permutations to be performed. 18 of 70

Strong RRQR (contd) Theorem [Gu and Eisenstat, 1996] If the QR factorization with column pivoting as in equation (4) satisfies inequality (R 1 11 R 12) 2 i,j + ω2 i (R 11 ) χ 2 j (R 22) < f for any 1 i k and 1 j n k, then 1 σ i(a) σ i (R 11 ), σ j(r 22 ) σ k+j (A) 1 + f 2 k(n k), for any 1 i k and 1 j min(m, n) k. 19 of 70

Tournament pivoting [Demmel et al., 2015] One step of CA RRQR, tournament pivoting used to select k columns Partition A = (A 1, A 2, A 3, A 4 ). Select k cols from each column block, by using QR with column pivoting At each level i of the tree At each node j do in parallel Let A v,i 1, A w,i 1 be the cols selected by the children of node j Select k cols from (A v,i 1, A w,i 1 ), by using QR with column pivoting Permute A ji in leading positions, compute QR with no pivoting ( ) R11 AP c1 = Q 1 20 of 70

Tournament pivoting [Demmel et al., 2015] One step of CA RRQR, tournament pivoting used to select k columns Partition A = (A 1, A 2, A 3, A 4 ). Select k cols from each column block, by using QR with column pivoting At each level i of the tree At each node j do in parallel Let A v,i 1, A w,i 1 be the cols selected by the children of node j Select k cols from (A v,i 1, A w,i 1 ), by using QR with column pivoting Permute A ji in leading positions, compute QR with no pivoting ( ) R11 AP c1 = Q 1 20 of 70

Tournament pivoting [Demmel et al., 2015] One step of CA RRQR, tournament pivoting used to select k columns Partition A = (A 1, A 2, A 3, A 4 ). Select k cols from each column block, by using QR with column pivoting At each level i of the tree At each node j do in parallel Let A v,i 1, A w,i 1 be the cols selected by the children of node j Select k cols from (A v,i 1, A w,i 1 ), by using QR with column pivoting Permute A ji in leading positions, compute QR with no pivoting ( ) R11 AP c1 = Q 1 20 of 70

Tournament pivoting [Demmel et al., 2015] One step of CA RRQR, tournament pivoting used to select k columns Partition A = (A 1, A 2, A 3, A 4 ). Select k cols from each column block, by using QR with column pivoting At each level i of the tree At each node j do in parallel Let A v,i 1, A w,i 1 be the cols selected by the children of node j Select k cols from (A v,i 1, A w,i 1 ), by using QR with column pivoting Permute A ji in leading positions, compute QR with no pivoting ( ) R11 AP c1 = Q 1 20 of 70

Tournament pivoting [Demmel et al., 2015] One step of CA RRQR, tournament pivoting used to select k columns Partition A = (A 1, A 2, A 3, A 4 ). Select k cols from each column block, by using QR with column pivoting At each level i of the tree At each node j do in parallel Let A v,i 1, A w,i 1 be the cols selected by the children of node j Select k cols from (A v,i 1, A w,i 1 ), by using QR with column pivoting Permute A ji in leading positions, compute QR with no pivoting ( ) R11 AP c1 = Q 1 20 of 70

Tournament pivoting [Demmel et al., 2015] One step of CA RRQR, tournament pivoting used to select k columns Partition A = (A 1, A 2, A 3, A 4 ). Select k cols from each column block, by using QR with column pivoting At each level i of the tree At each node j do in parallel Let A v,i 1, A w,i 1 be the cols selected by the children of node j Select k cols from (A v,i 1, A w,i 1 ), by using QR with column pivoting Permute A ji in leading positions, compute QR with no pivoting ( ) R11 AP c1 = Q 1 20 of 70

Select k columns from a tall and skinny matrix Given W of size m 2k, m >> k, k columns are selected as: W = QR 02 using TSQR R 02 P c = Q 2 R 2 using QRCP Return WP c (:, 1 : k) 21 of 70

Reduction trees Any shape of reduction tree can be used during CA RRQR, depending on the underlying architecture. Binary tree: A 00 A 10 A 20 A 30 f (A 00 ) f (A 10 ) f (A 20 ) f (A 30 ) f (A 01 ) f (A 11 ) f (A 02 ) Flat tree: A 00 A 10 A 20 A 30 f (A 00 ) f (A 01 ) f (A 02 ) f (A 03 ) Notation: at each node of the reduction tree, f (A ij ) returns the first b columns obtained after performing (strong) RRQR of A ij. 22 of 70

Rank revealing properties of CA-RRQR It is shown in [Demmel et al., 2015] that the column permutation computed by CA-RRQR satisfies χ 2 ( j R 1 11 R 12) + (χj (R 22 ) /σ min (R 11 )) 2 FTP, 2 for j = 1,..., n k. (6) where F TP depends on k, f, n, the shape of reduction tree used during tournament pivoting, and the number of iterations of CARRQR. 23 of 70

CA-RRQR - bounds for one tournament Selecting k columns by using tournament pivoting reveals the rank of A with the following bounds: 1 σ i(a) σ i (R 11 ), σ j(r 22 ) σ k+j (A) 1 + FTP 2 (n k), R 1 11 R 12 max F TP Binary tree of depth log 2 (n/k), F TP 1 2k (n/k) log 2( 2fk). (7) The upper bound is a decreasing function of k when k > Flat tree of depth n/k, n/( 2f ). F TP 1 2k ( 2fk ) n/k. (8) 24 of 70

Cost of CA-RRQR Cost of CA-RRQR vs QR with column pivoting n n matrix on P P processor grid, block size k Flops : 4n 3 /P + O(n 2 klogp/ P) vs (4/3)n 3 /P Bandwidth : O(n 2 log P/ P) vs same Latency : O(n log P/k) vs O(n log P) Communication optimal, modulo polylogarithmic factors, by choosing k = 1 n 2log 2 P P 25 of 70

Numerical results Stability close to QRCP for many tested matrices. Absolute value of diagonals of R, L referred to as R-values, L-values. Methods compared RRQR: QR with column pivoting CA-RRQR-B with tournament pivoting based on binary tree CA-RRQR-F with tournament pivoting based on flat tree SVD 26 of 70

Numerical results - devil s stairs Devil s stairs (Stewart), a matrix with multiple gaps in the singular values. Matlab code: Length = 20; s = zeros(n,1); Nst = floor(n/length); for i = 1 : Nst do s(1+length*(i-1):length*i) = -0.6*(i-1); end for s(length Nst : end) = 0.6 (Nst 1); s = 10. s; A = orth(rand(n)) * diag(s) * orth(randn(n)); QLP decomposition (Stewart) AP c1 = Q 1 R 1 using ca rrqr R T 1 = Q 2 R 2 10 0 x 10 16 10 QRCP CARRQR B 9 CARRQR F SVD 8 R values & singular values 10 1 10 2 7 6 5 4 3 Trustworthiness check 2 1 i 10 3 0 20 40 60 80 100 120 Column No. 27 of 70

Numerical results - devil s stairs Devil s stairs (Stewart), a matrix with multiple gaps in the singular values. Matlab code: Length = 20; s = zeros(n,1); Nst = floor(n/length); for i = 1 : Nst do s(1+length*(i-1):length*i) = -0.6*(i-1); end for s(length Nst : end) = 0.6 (Nst 1); s = 10. s; A = orth(rand(n)) * diag(s) * orth(randn(n)); QLP decomposition (Stewart) AP c1 = Q 1 R 1 using ca rrqr R T 1 = Q 2 R 2 10 0 x 10 16 10 QRCP CARRQR B 9 CARRQR F SVD 8 10 0 QRCP + QLP CARRQR B + QLP CARRQR F + QLP SVD 7 R values & singular values 10 1 10 2 6 5 4 3 Trustworthiness check L values & singular values 10 1 10 2 2 1 i 10 3 0 20 40 60 80 100 120 Column No. 10 3 0 20 40 60 80 100 120 Column No. i 27 of 70

Numerical results (contd) R values, singular values & trustworthiness check 10 0 10 2 10 4 10 6 10 8 10 10 10 12 10 14 10 16 QRCP CARRQR B CARRQR F SVD R values, singular values & trustworthiness check 10 0 10 5 10 10 10 15 QRCP CARRQR B CARRQR F SVD 10 18 10 20 10 20 0 50 100 150 200 250 Column No. i 0 50 100 150 200 250 Column No. i Left: exponent - exponential Distribution, σ 1 = 1, σ i = α i 1 (i = 2,..., n), α = 10 1/11 [Bischof, 1991] Right: shaw - 1D image restoration model [Hansen, 2007] ɛ min{ (AΠ 0 )(:, i) 2, (AΠ 1 )(:, i) 2, (AΠ 2 )(:, i) 2 } (9) ɛ max{ (AΠ 0 )(:, i) 2, (AΠ 1 )(:, i) 2, (AΠ 2 )(:, i) 2 } (10) where Π j (j = 0, 1, 2) are the permutation matrices obtained by QRCP, CARRQR-B, and CARRQR-F, and ɛ is the machine precision. 28 of 70

Numerical results - a set of 18 matrices Ratios R(i, i) /σ i (R), for QRCP (top plot), CARRQR-B (second plot), and CARRQR-F (third plot). The number along x-axis represents the index of test matrices. 29 of 70

Plan Low rank matrix approximation Rank revealing QR factorization LU CRTP: Truncated LU factorization with column and row tournament pivoting Experimental results, LU CRTP Randomized algorithms for low rank approximation 30 of 70

LU versus QR - filled graph G + (A) Consider A is SPD and A = LL T Given G(A) = (V, E), G + (A) = (V, E + ) is defined as: there is an edge (i, j) G + (A) iff there is a path from i to j in G(A) going through lower numbered vertices. G(L + L T ) = G + (A), ignoring cancellations. Definition holds also for directed graphs (LU factorization). A = 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x L + L T = 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x 31 of 70 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9

LU versus QR Filled column intersection graph G + (A) Graph of the Cholesky factor of A T A G(R) G + (A) A T A can have many more nonzeros than A 32 of 70

LU versus QR Numerical stability Let ˆL and Û be the computed factors of the block LU factorization. Then ( ) ˆLÛ = A + E, E max c(n)ɛ A max + ˆL max Û max. (11) For partial pivoting, L max 1, U max 2 n A max In practice, U max n A max 33 of 70

Low rank approximation based on LU factorization Given desired rank k, the factorization has the form ) ( ) ) (Ā11 Ā 12 I (Ā11 Ā 12 P r AP c = = Ā 21 Ā 22 Ā 21 Ā 1 11 I S(Ā11), (12) where A R m n, Ā 11 R k,k, S(Ā11) = Ā22 Ā21Ā 1 11 Ā12. The rank-k approximation matrix à k is à k = ( I Ā 21 Ā 1 11 ) (Ā11 Ā 12 ) = (Ā11 Ā 21 ) Ā 1 ) 11 (Ā11 Ā 12. (13) Ā 1 11 is never formed, its factorization is used when Ãk is applied to a vector. In randomized algorithms, U = C + AR +, where C +, R + are Moore-Penrose generalized inverses. 34 of 70

Design space Non-exhaustive list for selecting k columns and rows: 1. Select k linearly independent columns of A (call result B), by using 1.1 (strong) QRCP/tournament pivoting using QR, 1.2 LU / tournament pivoting based on LU, with some form of pivoting (column, complete, rook), 1.3 randomization: premultiply X = ZA where random matrix Z is short and fat, then pick k rows from X T, by some method from 2) below, 1.4 tournament pivoting based on randomized algorithms to select columns at each step. 2. Select k linearly independent rows of B, by using 2.1 (strong) QRCP / tournament pivoting based on QR on B T, or on Q T, the rows of the thin Q factor of B, 2.2 LU / tournament pivoting based on LU, with pivoting (row, complete, rook) on B, 2.3 tournament pivoting based on randomized algorithms to select rows. 35 of 70

Select k cols using tournament pivoting Partition A = (A 1, A 2, A 3, A 4 ). Select k cols from each column block, by using QR with column pivoting At each level i of the tree At each node j do in parallel Let A v,i 1, A w,i 1 be the cols selected by the children of node j Select k cols from (A v,i 1, A w,i 1 ), by using QR with column pivoting Return columns in A ji 36 of 70

Select k cols using tournament pivoting Partition A = (A 1, A 2, A 3, A 4 ). Select k cols from each column block, by using QR with column pivoting At each level i of the tree At each node j do in parallel Let A v,i 1, A w,i 1 be the cols selected by the children of node j Select k cols from (A v,i 1, A w,i 1 ), by using QR with column pivoting Return columns in A ji 36 of 70

Select k cols using tournament pivoting Partition A = (A 1, A 2, A 3, A 4 ). Select k cols from each column block, by using QR with column pivoting At each level i of the tree At each node j do in parallel Let A v,i 1, A w,i 1 be the cols selected by the children of node j Select k cols from (A v,i 1, A w,i 1 ), by using QR with column pivoting Return columns in A ji 36 of 70

Select k cols using tournament pivoting Partition A = (A 1, A 2, A 3, A 4 ). Select k cols from each column block, by using QR with column pivoting At each level i of the tree At each node j do in parallel Let A v,i 1, A w,i 1 be the cols selected by the children of node j Select k cols from (A v,i 1, A w,i 1 ), by using QR with column pivoting Return columns in A ji 36 of 70

Select k cols using tournament pivoting Partition A = (A 1, A 2, A 3, A 4 ). Select k cols from each column block, by using QR with column pivoting At each level i of the tree At each node j do in parallel Let A v,i 1, A w,i 1 be the cols selected by the children of node j Select k cols from (A v,i 1, A w,i 1 ), by using QR with column pivoting Return columns in A ji 36 of 70

Select k cols using tournament pivoting Partition A = (A 1, A 2, A 3, A 4 ). Select k cols from each column block, by using QR with column pivoting At each level i of the tree At each node j do in parallel Let A v,i 1, A w,i 1 be the cols selected by the children of node j Select k cols from (A v,i 1, A w,i 1 ), by using QR with column pivoting Return columns in A ji 36 of 70

LU CRTP factorization - one block step One step of truncated block LU based on column/row tournament pivoting on matrix A of size m n: 1. Select k columns by using tournament pivoting, permute them in front, bounds for s.v. governed by q 1 (n, k) ( ) ( ) ( ) R11 R AP c = Q 12 Q11 Q = 12 R11 R 12 R 22 Q 21 Q 22 R 22 2. Select k rows from (Q 11 ; Q 21 ) T of size m k by using tournament pivoting, ( ) Q11 Q12 P r Q = Q 21 Q22 such that Q 21 Q 1 11 max F TP and bounds for s.v. governed by q 2 (m, k). 37 of 70

Orthogonal matrices Given orthogonal matrix Q R m m and its partitioning ( ) Q11 Q 12 Q =, (14) Q 21 Q 22 the selection of k cols by tournament pivoting from (Q 11 ; Q 21 ) T leads to the factorization ( ) ( ) ( ) Q11 Q12 I Q11 Q12 P r Q = = Q 21 Q 22 Q Q 1 21 11 I S( Q (15) 11) where S( Q 11 ) = Q 22 Q 21 Q 1 11 Q 12 = Q T 22. 38 of 70

Orthogonal matrices (contd) The factorization satisfies: P r Q = ( Q11 Q12 Q 21 Q22 ) = ( ) ( ) I Q11 Q12 Q Q 1 21 11 I S( Q 11) (16) ρ j ( Q Q 1 21 11 ) F TP, (17) 1 q 2 (m, k) σ i ( Q 11 ) 1, (18) σ min ( Q 11 ) = σ min ( Q 22 ) (19) for all 1 i k, 1 j m k, where ρ j (A) is the 2-norm of the j-th row of A, q 2 (m, k) = 1 + FTP 2 (m k). 39 of 70

Sketch of the proof P r AP c = = ) ( (Ā11 Ā 12 I = Ā 21 Ā 22 Ā 21 Ā 1 ( I Q Q 1 21 11 I 11 I ) ( Q11 Q12 S( Q 11 ) ) ) (Ā11 Ā 12 S(Ā11) ) ( R11 R 12 R 22 ) (20) where Q Q 1 21 11 = Ā21Ā 1 11, S(Ā11) = S( Q 11 )R 22 = Q T 22 R 22. 40 of 70

Sketch of the proof (contd) We obtain Ā 11 = Q 11 R 11, (21) S(Ā11) = S( Q T 11 )R 22 = Q 22 R 22. (22) σ i (A) σ i (Ā 11 ) σ min ( Q 11 )σ i (R 11 ) We also have that 1 q 1 (n, k)q 2 (m, k) σ i(a), σ k+j (A) σ j (S(Ā11)) = σ j (S( Q 11 )R 22 ) S( Q 11 ) 2 σ j (R 22 ) q 1 (n, k)q 2 (m, k)σ k+j (A), where q 1 (n, k) = 1 + FTP 2 (n k), q 2(m, k) = 1 + FTP 2 (m k). 41 of 70

LU CRTP factorization - bounds if rank = k Given A of size m n, one step of LU CRTP computes the decomposition ) ( ) ) (Ā11 Ā 12 I (Ā11 Ā 12 Ā = P r AP c = = Ā 21 Ā 22 Q 21 Q 1 (23) 11 I S(Ā 11) where Ā11 is of size k k and S(Ā11) = Ā22 Ā21Ā 1 11 Ā12 = Ā22 Q Q 1 21 11 Ā12. (24) It satisfies the following properties: ρ l (Ā21Ā 1 11 ) = ρ l( Q 21 Q 1 11 ) F TP, (25) S(Ā11) max min((1 + F TP k) A max, F TP 1 + F 2 TP (m k)σ k(a)) 1 σ i(a) σ i (Ā11), σ j(s(ā11)) q(m, n, k), (26) σ k+j (A) for any 1 l m k, 1 i k, and 1 j min(m, n) k, q(m, n, k) = (1 + FTP 2 (n k)) (1 + F TP 2 (m k)). 42 of 70

Arithmetic complexity - arbitrary sparse matrices Let d i be the number of nonzeros in column i of A, nnz(a) = n i=1 d i. A is permuted such that d 1... d n. A = [A 00,..., A n/k,0 ] is partitioned into n/k blocks of columns. At first step of TP: Pick k cols from A 1 = [A 00, A 10 ] nnz(a 1 ) 2k 2k i=1 d i, flops QR (A 1 ) 8k 2 2k i=1 d i. At the second step of TP: Pick k cols from A 2 nnz(a 2 ) 2k 3k i=k+1 d i flops QR (A 2 ) 8k 2 3k i=k+1 d i Bounds attained when: A = 0 0... 0 0 0 0... 0 0... 0 0... 0 0 43 of 70

Arithmetic complexity - arbitrary sparse matrices (2) nnz max (TP FT ) 4d n k 2 ( 2k nnz total (TP FT ) 2k d i + 4k i=1 3k ) n d i +... + d i i=k+1 i=n 2k+1 n d i = 4nnz(A)k, i=1 flops(tp FT ) 16nnz(A)k 2, 44 of 70

Tournament pivoting for sparse matrices Arithmetic complexity A has arbitrary sparsity structure G(A T A) is an n 1/2 - separable graph flops(tp FT ) 16nnz(A)k 2 flops(tp FT ) O(nnz(A)k 3/2 ) flops(tp BT ) 8 nnz(a) k 2 log n flops(tp BT ) O( nnz(a) k 3/2 log n P k P k ) Randomized algorithm by Clarkson and Woodruff, STOC 13 Given n n matrix A, it computes LDW T, where D is k k such that with failure probability 1/10 A LDW T F (1 + ɛ) A A k F, A k is best rank-k approximation. The cost of this algorithm is flops O(nnz(A)) + nk 2 ɛ 4 log O(1) (nk 2 ɛ 4 ) Tournament pivoting is faster if ɛ 1 (nnz(a)/n) 1/4 or if ɛ = 0.1 and nnz(a)/n 10 4. 45 of 70

Tournament pivoting for sparse matrices Arithmetic complexity A has arbitrary sparsity structure G(A T A) is an n 1/2 - separable graph flops(tp FT ) 16nnz(A)k 2 flops(tp FT ) O(nnz(A)k 3/2 ) flops(tp BT ) 8 nnz(a) k 2 log n flops(tp BT ) O( nnz(a) k 3/2 log n P k P k ) Randomized algorithm by Clarkson and Woodruff, STOC 13 Given n n matrix A, it computes LDW T, where D is k k such that with failure probability 1/10 A LDW T F (1 + ɛ) A A k F, A k is best rank-k approximation. The cost of this algorithm is flops O(nnz(A)) + nk 2 ɛ 4 log O(1) (nk 2 ɛ 4 ) Tournament pivoting is faster if ɛ 1 (nnz(a)/n) 1/4 or if ɛ = 0.1 and nnz(a)/n 10 4. 45 of 70

Plan Low rank matrix approximation Rank revealing QR factorization LU CRTP: Truncated LU factorization with column and row tournament pivoting Experimental results, LU CRTP Randomized algorithms for low rank approximation 46 of 70

Numerical results Singular value 10 0 Evolution of singular values for exponential 10-5 10-10 10-15 QRCP LU-CRQRCP LU-CRTP SVD Singular value 10 0 Evolution of singular values for foxgood 10-5 10-10 10-15 10-20 QRCP LU-CRQRCP LU-CRTP SVD 10-20 0 50 100 150 200 250 300 Index of singular values 10-25 0 50 100 150 200 250 300 Index of singular values Left: exponent - exponential Distribution, σ 1 = 1, σ i = α i 1 (i = 2,..., n), α = 10 1/11 [Bischof, 1991] Right: foxgood - Severely ill-posed test problem of the 1st kind Fredholm integral equation used by Fox and Goodwin 47 of 70

Numerical results Here k = 16 and the factorization is truncated at K = 128 (bars) or K = 240 (red lines). LU CTP: Column tournament pivoting + partial pivoting All singular values smaller than machine precision, ɛ, are replaced by ɛ. The number along x-axis represents the index of test matrices. 48 of 70

Results for image of size 919 707 Original image Rank-38 approx, SVD Singular value distribution Rank-38 approx, LUPP Rank-38 approx, LU CRTP Rank-75 approx, LU CRTP 49 of 70

Results for image of size 691 505 Original image Rank-105 approx, SVD Singular value distribution Rank-105 approx, LUPP Rank-105 approx, LU CRTP Rank-209 approx, LU CRTP 50 of 70

Comparing nnz in the factors L, U versus Q, R Name/size Nnz Rank K Nnz QRCP/ Nnz LU CRTP/ A(:, 1 : K) Nnz LU CRTP Nnz LUPP gemat11 1232 128 2.1 2.2 4929 4895 512 3.3 2.6 9583 1024 11.5 3.2 wang 3 896 128 3.0 2.1 26064 3536 512 2.9 2.1 7120 1024 2.9 1.2 Rfdevice 633 128 10.0 1.1 74104 2255 512 82.6 0.9 4681 1024 207.2 0.0 Parab fem 896 128 0.5 525825 3584 512 0.3 7168 1024 0.2 Mac econ 384 128 0.3 206500 1535 512 0.3 5970 1024 0.2 51 of 70

Performance results Selection of 256 columns by tournament pivoting Edison, Cray XC30 (NERSC): 2x12-core Intel Ivy Bridge (2.4 GHz) Tournament pivoting uses SPQR (T. Davis) + dgeqp3 (Lapack), time in secs Matrices: dimension at leaves on 32 procs Parab fem: 528825 528825 528825 16432 Mac econ: 206500 206500 206500 6453 Time Time leaves Number of MPI processes 2k cols 32procs 16 32 64 128 256 512 1024 SPQR + dgeqp3 Parab fem 0.26 0.26 + 1129 46.7 24.5 13.7 8.4 5.9 4.8 4.4 Mac econ 0.46 25.4 + 510 132.7 86.3 111.4 59.6 27.2 52 of 70

More details on CA deterministic algorithms [Demmel et al., 2015] Communication avoiding rank revealing QR factorization with column pivoting Demmel, Grigori, Gu, Xiang, SIAM J. Matrix Analysis and Applications, 2015. Low rank approximation of a sparse matrix based on LU factorization with column and row tournament pivoting, with S. Cayrols and J. Demmel, Inria TR 8910. 53 of 70

References (1) Bischof, C. H. (1991). A parallel QR factorization algorithm with controlled local pivoting. SIAM J. Sci. Stat. Comput., 12:36 57. Businger, P. A. and Golub, G. H. (1965). Linear least squares solutions by Householder transformations. Numer. Math., 7:269 276. Demmel, J., Grigori, L., Gu, M., and Xiang, H. (2015). Communication-avoiding rank-revealing qr decomposition. SIAM Journal on Matrix Analysis and its Applications, 36(1):55 89. Eckart, C. and Young, G. (1936). The approximation of one matrix by another of lower rank. Psychometrika, 1:211 218. Eisenstat, S. C. and Ipsen, I. C. F. (1995). Relative perturbation techniques for singular value problems. SIAM J. Numer. Anal., 32(6):1972 1988. Gu, M. and Eisenstat, S. C. (1996). Efficient algorithms for computing a strong rank-revealing QR factorization. SIAM J. Sci. Comput., 17(4):848 869. Hansen, P. C. (2007). Regularization tools: A matlab package for analysis and solution of discrete ill-posed problems. Numerical Algorithms, (46):189 194. 54 of 70

Results used in the proofs Interlacing property of singular values [Golub, Van Loan, 4th edition, page 487] Let A = [a 1... a n ] be a column partitioning of an m n matrix with m n. If A r = [a 1... a r ], then for r = 1 : n 1 σ 1 (A r+1 ) σ 1 (A r ) σ 2 (A r+1 )... σ r (A r+1 ) σ r (A r ) σ r+1 (A r+1 ). Given n n matrix B and n k matrix C, then ([Eisenstat and Ipsen, 1995], p. 1977) σ min (B)σ j (C) σ j (BC) σ max (B)σ j (C), j = 1,..., k. 55 of 70