Rank revealing factorizations, and low rank approximations

Rank revealing factorizations, and low rank approximations L. Grigori Inria Paris, UPMC January 2018

Plan Low rank matrix approximation Rank revealing QR factorization LU CRTP: Truncated LU factorization with column and row tournament pivoting Experimental results, LU CRTP Randomized algorithms for low rank approximation 2 of 70

Low rank matrix approximation Problem: given m n matrix A, compute rank-k approximation ZW T, where Z is m k and W T is k n. Problem with diverse applications from scientific computing: fast solvers for integral equations, H-matrices to data analytics: principal component analysis, image processing,... Flops Ax ZW T x 2mn 2(m + n)k 4 of 70

Singular value decomposition Given A R m n, m n its singular value decomposition is A = UΣV T = ( ) Σ 1 0 U 1 U 2 U 3 0 Σ 2 (V ) T 1 V 2 0 0 where U is m m orthogonal matrix, the left singular vectors of A, U 1 is m k, U 2 is m n k, U 3 is m m n Σ is m n, its diagonal is formed by σ 1 (A)... σ n (A) 0 Σ 1 is k k, Σ 2 is n k n k V is n n orthogonal matrix, the right singular vectors of A, V 1 is n k, V 2 is n n k 5 of 70

Norms A F = m i=1 j=1 n a ij 2 = A 2 = σ max (A) = σ 1 (A) σ 2 1 (A) +... σ2 n(a) Some properties: A 2 A F min(m, n) A 2 Orthogonal Invariance: If Q R m m and Z R n n are orthogonal, then QAZ F = A F QAZ 2 = A 2 6 of 70

Low rank matrix approximation Best rank-k approximation A k = U k Σ k V k is rank-k truncated SVD of A [Eckart and Young, 1936] Original image of size 919 707 min A Ãk 2 = A A k 2 = σ k+1 (A) (1) rank(ãk ) k n min A Ã k F = A A k F = σj 2 (A) (2) rank(ã k ) k j=k+1 Rank-38 approximation, SVD Rank-75 approximation, SVD Image source: https: //upload.wikimedia.org/wikipedia/commons/a/a1/alan_turing_aged_16.jpg 7 of 70

Large data sets Matrix A might not exist entirely at a given time, rows or columns are added progressively. Streaming algorithm: can solve an arbitrarily large problem with one pass over the data (a row or a column at a time). Weakly streaming algorithm: can solve a problem with O(1) passes over the data. Matrix A might exist only implicitly, and it is never formed explicitly. 8 of 70

Low rank matrix approximation: trade-offs 9 of 70

Rank revealing QR factorization Given A of size m n, consider the decomposition [ R11 R AP c = QR = Q 12 R 22 ], (3) where R 11 is k k, P c and k are chosen such that R 22 2 is small and R 11 is well-conditioned. By the interlacing property of singular values [Golub, Van Loan, 4th edition, page 487], σ i (R 11 ) σ i (A) and σ j (R 22 ) σ k+j (A) for 1 i k and 1 j n k. σ k+1 (A) σ max (R 22 ) = R 22 11 of 70

Rank revealing QR factorization Given A of size m n, consider the decomposition [ R11 R AP c = QR = Q 12 R 22 If R 22 2 is small, ]. (4) Q(:, 1 : k) forms an approximate orthogonal basis for the range of A, P c [ R 1 11 R 12 I a(:, j) = min(j,k) r ij Q(:, i) i=1 ran(a) span{q(:, 1),... Q(:, k)} span{q(:, 1),... Q(:, k)} ] is an approximate right null space of A. 12 of 70

Rank revealing QR factorization The factorization from equation (4) is rank revealing if 1 σ i(a) σ i (R 11 ), σ j(r 22 ) σ k+j (A) q 1(n, k), for 1 i k and 1 j min(m, n) k, where σ max (A) = σ 1 (A)... σ min (A) = σ n (A) It is strong rank revealing [Gu and Eisenstat, 1996] if in addition R 1 11 R 12 max q 2 (n, k) Gu and Eisenstat show that given k and f, there exists a P c such that q 1 (n, k) = 1 + f 2 k(n k) and q 2 (n, k) = f. Factorization computed in 4mnk (QRCP) plus O(mnk) flops. 13 of 70

QR with column pivoting [Businger and Golub, 1965] Idea: At first iteration, trailing columns decomposed into parallel part to first column (or e 1 ) and orthogonal part (in rows 2 : m). The column of maximum norm is the column with largest component orthogonal to the first column. Implementation: Find at each step of the QR factorization the column of maximum norm. Permute it into leading position. If rank(a) = k, at step k + 1 the maximum norm is 0. No need to compute the column norms at each step, but just update them since Q T v = w = [ w 1 w(2 : n) ], w(2 : n) 2 2 = v 2 2 w 2 1 14 of 70

QR with column pivoting [Businger and Golub, 1965] Sketch of the algorithm column norm vector: colnrm(j) = A(:, j) 2, j = 1 : n. for j = 1 : n do Find column p of largest norm if colnrm[p] > ɛ then 1. Pivot: swap columns j and p in A and modify colnrm. 2. Compute Householder matrix H j s.t. H j A(j : m, j) = ± A(j : m, j) 2 e 1. 3. Update A(j : m, j + 1 : n) = H j A(j : m, j + 1 : n). 4. Norm downdate colnrm(j + 1 : n) 2 = A(j, j + 1 : n) 2. else Break end if end for If algorithm stops after k steps σ max (R 22 ) n k max R 22(:, j) 2 n kɛ 1 j n k 15 of 70

Strong RRQR [Gu and Eisenstat, 1996] Since k n k det(r 11 ) = σ i (R 11 ) = det(a T A)/ σ i (R 22 ) i=1 i=1 a stron RRQR is related to a large det(r 11 ). The following algorithm interchanges columns that increase det(r 11 ), given f and k. Compute a strong RRQR factorization, given k: Compute AΠ = QR by using QRCP while there exist i and j such that det( R 11 )/det(r 11 ) > f, where R 11 = R(1 : k, 1 : k), Π i,j+k permutes columns i and j + k, RΠ i,j+k = Q R, R 11 = R(1 : k, 1 : k) do Find i and j Compute RΠ i,j+k = Q R and Π = ΠΠ i,j+k end while 16 of 70

Strong RRQR (contd) It can be shown that det( R 11 ) det(r 11 ) = (R 1 11 R 12) 2 i,j + ω2 i (R 11 ) χ 2 j (R 22) (5) for any 1 i k and 1 j n k (the 2-norm of the j-th column of A is χ j (A), and the 2-norm of the j-th row of A 1 is ω j (A) ). Compute a strong RRQR factorization, given k: Compute AΠ = QR by using QRCP (R 1 while max 1 i k,1 j n k 11 R 2 12) i,j + ω2 i (R 11 ) χ 2 j (R 22) > f do (R 1 Find i and j such that 11 R 2 12) + i,j ω2 i (R 11 ) χ 2 j (R 22) > f Compute RΠ i,j+k = Q R and Π = ΠΠ i,j+k end while 17 of 70

Strong RRQR (contd) det(r 11 ) strictly increases with every permutation, no permutation repeats, hence there is a finite number of permutations to be performed. 18 of 70

Strong RRQR (contd) Theorem [Gu and Eisenstat, 1996] If the QR factorization with column pivoting as in equation (4) satisfies inequality (R 1 11 R 12) 2 i,j + ω2 i (R 11 ) χ 2 j (R 22) < f for any 1 i k and 1 j n k, then 1 σ i(a) σ i (R 11 ), σ j(r 22 ) σ k+j (A) 1 + f 2 k(n k), for any 1 i k and 1 j min(m, n) k. 19 of 70

Tournament pivoting [Demmel et al., 2015] One step of CA RRQR, tournament pivoting used to select k columns Partition A = (A 1, A 2, A 3, A 4 ). Select k cols from each column block, by using QR with column pivoting At each level i of the tree At each node j do in parallel Let A v,i 1, A w,i 1 be the cols selected by the children of node j Select k cols from (A v,i 1, A w,i 1 ), by using QR with column pivoting Permute A ji in leading positions, compute QR with no pivoting ( ) R11 AP c1 = Q 1 20 of 70

Select k columns from a tall and skinny matrix Given W of size m 2k, m >> k, k columns are selected as: W = QR 02 using TSQR R 02 P c = Q 2 R 2 using QRCP Return WP c (:, 1 : k) 21 of 70

Reduction trees Any shape of reduction tree can be used during CA RRQR, depending on the underlying architecture. Binary tree: A 00 A 10 A 20 A 30 f (A 00 ) f (A 10 ) f (A 20 ) f (A 30 ) f (A 01 ) f (A 11 ) f (A 02 ) Flat tree: A 00 A 10 A 20 A 30 f (A 00 ) f (A 01 ) f (A 02 ) f (A 03 ) Notation: at each node of the reduction tree, f (A ij ) returns the first b columns obtained after performing (strong) RRQR of A ij. 22 of 70

Rank revealing properties of CA-RRQR It is shown in [Demmel et al., 2015] that the column permutation computed by CA-RRQR satisfies χ 2 ( j R 1 11 R 12) + (χj (R 22 ) /σ min (R 11 )) 2 FTP, 2 for j = 1,..., n k. (6) where F TP depends on k, f, n, the shape of reduction tree used during tournament pivoting, and the number of iterations of CARRQR. 23 of 70

CA-RRQR - bounds for one tournament Selecting k columns by using tournament pivoting reveals the rank of A with the following bounds: 1 σ i(a) σ i (R 11 ), σ j(r 22 ) σ k+j (A) 1 + FTP 2 (n k), R 1 11 R 12 max F TP Binary tree of depth log 2 (n/k), F TP 1 2k (n/k) log 2( 2fk). (7) The upper bound is a decreasing function of k when k > Flat tree of depth n/k, n/( 2f ). F TP 1 2k ( 2fk ) n/k. (8) 24 of 70

Cost of CA-RRQR Cost of CA-RRQR vs QR with column pivoting n n matrix on P P processor grid, block size k Flops : 4n 3 /P + O(n 2 klogp/ P) vs (4/3)n 3 /P Bandwidth : O(n 2 log P/ P) vs same Latency : O(n log P/k) vs O(n log P) Communication optimal, modulo polylogarithmic factors, by choosing k = 1 n 2log 2 P P 25 of 70

Numerical results Stability close to QRCP for many tested matrices. Absolute value of diagonals of R, L referred to as R-values, L-values. Methods compared RRQR: QR with column pivoting CA-RRQR-B with tournament pivoting based on binary tree CA-RRQR-F with tournament pivoting based on flat tree SVD 26 of 70

Numerical results - devil s stairs Devil s stairs (Stewart), a matrix with multiple gaps in the singular values. Matlab code: Length = 20; s = zeros(n,1); Nst = floor(n/length); for i = 1 : Nst do s(1+length*(i-1):length*i) = -0.6*(i-1); end for s(length Nst : end) = 0.6 (Nst 1); s = 10. s; A = orth(rand(n)) * diag(s) * orth(randn(n)); QLP decomposition (Stewart) AP c1 = Q 1 R 1 using ca rrqr R T 1 = Q 2 R 2 10 0 x 10 16 10 QRCP CARRQR B 9 CARRQR F SVD 8 R values & singular values 10 1 10 2 7 6 5 4 3 Trustworthiness check 2 1 i 10 3 0 20 40 60 80 100 120 Column No. 27 of 70

Numerical results - devil s stairs Devil s stairs (Stewart), a matrix with multiple gaps in the singular values. Matlab code: Length = 20; s = zeros(n,1); Nst = floor(n/length); for i = 1 : Nst do s(1+length*(i-1):length*i) = -0.6*(i-1); end for s(length Nst : end) = 0.6 (Nst 1); s = 10. s; A = orth(rand(n)) * diag(s) * orth(randn(n)); QLP decomposition (Stewart) AP c1 = Q 1 R 1 using ca rrqr R T 1 = Q 2 R 2 10 0 x 10 16 10 QRCP CARRQR B 9 CARRQR F SVD 8 10 0 QRCP + QLP CARRQR B + QLP CARRQR F + QLP SVD 7 R values & singular values 10 1 10 2 6 5 4 3 Trustworthiness check L values & singular values 10 1 10 2 2 1 i 10 3 0 20 40 60 80 100 120 Column No. 10 3 0 20 40 60 80 100 120 Column No. i 27 of 70

Numerical results (contd) R values, singular values & trustworthiness check 10 0 10 2 10 4 10 6 10 8 10 10 10 12 10 14 10 16 QRCP CARRQR B CARRQR F SVD R values, singular values & trustworthiness check 10 0 10 5 10 10 10 15 QRCP CARRQR B CARRQR F SVD 10 18 10 20 10 20 0 50 100 150 200 250 Column No. i 0 50 100 150 200 250 Column No. i Left: exponent - exponential Distribution, σ 1 = 1, σ i = α i 1 (i = 2,..., n), α = 10 1/11 [Bischof, 1991] Right: shaw - 1D image restoration model [Hansen, 2007] ɛ min{ (AΠ 0 )(:, i) 2, (AΠ 1 )(:, i) 2, (AΠ 2 )(:, i) 2 } (9) ɛ max{ (AΠ 0 )(:, i) 2, (AΠ 1 )(:, i) 2, (AΠ 2 )(:, i) 2 } (10) where Π j (j = 0, 1, 2) are the permutation matrices obtained by QRCP, CARRQR-B, and CARRQR-F, and ɛ is the machine precision. 28 of 70

Numerical results - a set of 18 matrices Ratios R(i, i) /σ i (R), for QRCP (top plot), CARRQR-B (second plot), and CARRQR-F (third plot). The number along x-axis represents the index of test matrices. 29 of 70

LU versus QR - filled graph G + (A) Consider A is SPD and A = LL T Given G(A) = (V, E), G + (A) = (V, E + ) is defined as: there is an edge (i, j) G + (A) iff there is a path from i to j in G(A) going through lower numbered vertices. G(L + L T ) = G + (A), ignoring cancellations. Definition holds also for directed graphs (LU factorization). A = 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x L + L T = 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x 31 of 70 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9

LU versus QR Filled column intersection graph G + (A) Graph of the Cholesky factor of A T A G(R) G + (A) A T A can have many more nonzeros than A 32 of 70

LU versus QR Numerical stability Let ˆL and Û be the computed factors of the block LU factorization. Then ( ) ˆLÛ = A + E, E max c(n)ɛ A max + ˆL max Û max. (11) For partial pivoting, L max 1, U max 2 n A max In practice, U max n A max 33 of 70

Low rank approximation based on LU factorization Given desired rank k, the factorization has the form ) ( ) ) (Ā11 Ā 12 I (Ā11 Ā 12 P r AP c = = Ā 21 Ā 22 Ā 21 Ā 1 11 I S(Ā11), (12) where A R m n, Ā 11 R k,k, S(Ā11) = Ā22 Ā21Ā 1 11 Ā12. The rank-k approximation matrix Ã k is Ã k = ( I Ā 21 Ā 1 11 ) (Ā11 Ā 12 ) = (Ā11 Ā 21 ) Ā 1 ) 11 (Ā11 Ā 12. (13) Ā 1 11 is never formed, its factorization is used when Ãk is applied to a vector. In randomized algorithms, U = C + AR +, where C +, R + are Moore-Penrose generalized inverses. 34 of 70

Design space Non-exhaustive list for selecting k columns and rows: 1. Select k linearly independent columns of A (call result B), by using 1.1 (strong) QRCP/tournament pivoting using QR, 1.2 LU / tournament pivoting based on LU, with some form of pivoting (column, complete, rook), 1.3 randomization: premultiply X = ZA where random matrix Z is short and fat, then pick k rows from X T, by some method from 2) below, 1.4 tournament pivoting based on randomized algorithms to select columns at each step. 2. Select k linearly independent rows of B, by using 2.1 (strong) QRCP / tournament pivoting based on QR on B T, or on Q T, the rows of the thin Q factor of B, 2.2 LU / tournament pivoting based on LU, with pivoting (row, complete, rook) on B, 2.3 tournament pivoting based on randomized algorithms to select rows. 35 of 70

Select k cols using tournament pivoting Partition A = (A 1, A 2, A 3, A 4 ). Select k cols from each column block, by using QR with column pivoting At each level i of the tree At each node j do in parallel Let A v,i 1, A w,i 1 be the cols selected by the children of node j Select k cols from (A v,i 1, A w,i 1 ), by using QR with column pivoting Return columns in A ji 36 of 70

LU CRTP factorization - one block step One step of truncated block LU based on column/row tournament pivoting on matrix A of size m n: 1. Select k columns by using tournament pivoting, permute them in front, bounds for s.v. governed by q 1 (n, k) ( ) ( ) ( ) R11 R AP c = Q 12 Q11 Q = 12 R11 R 12 R 22 Q 21 Q 22 R 22 2. Select k rows from (Q 11 ; Q 21 ) T of size m k by using tournament pivoting, ( ) Q11 Q12 P r Q = Q 21 Q22 such that Q 21 Q 1 11 max F TP and bounds for s.v. governed by q 2 (m, k). 37 of 70

Orthogonal matrices Given orthogonal matrix Q R m m and its partitioning ( ) Q11 Q 12 Q =, (14) Q 21 Q 22 the selection of k cols by tournament pivoting from (Q 11 ; Q 21 ) T leads to the factorization ( ) ( ) ( ) Q11 Q12 I Q11 Q12 P r Q = = Q 21 Q 22 Q Q 1 21 11 I S( Q (15) 11) where S( Q 11 ) = Q 22 Q 21 Q 1 11 Q 12 = Q T 22. 38 of 70

Orthogonal matrices (contd) The factorization satisfies: P r Q = ( Q11 Q12 Q 21 Q22 ) = ( ) ( ) I Q11 Q12 Q Q 1 21 11 I S( Q 11) (16) ρ j ( Q Q 1 21 11 ) F TP, (17) 1 q 2 (m, k) σ i ( Q 11 ) 1, (18) σ min ( Q 11 ) = σ min ( Q 22 ) (19) for all 1 i k, 1 j m k, where ρ j (A) is the 2-norm of the j-th row of A, q 2 (m, k) = 1 + FTP 2 (m k). 39 of 70

Sketch of the proof P r AP c = = ) ( (Ā11 Ā 12 I = Ā 21 Ā 22 Ā 21 Ā 1 ( I Q Q 1 21 11 I 11 I ) ( Q11 Q12 S( Q 11 ) ) ) (Ā11 Ā 12 S(Ā11) ) ( R11 R 12 R 22 ) (20) where Q Q 1 21 11 = Ā21Ā 1 11, S(Ā11) = S( Q 11 )R 22 = Q T 22 R 22. 40 of 70

Sketch of the proof (contd) We obtain Ā 11 = Q 11 R 11, (21) S(Ā11) = S( Q T 11 )R 22 = Q 22 R 22. (22) σ i (A) σ i (Ā 11 ) σ min ( Q 11 )σ i (R 11 ) We also have that 1 q 1 (n, k)q 2 (m, k) σ i(a), σ k+j (A) σ j (S(Ā11)) = σ j (S( Q 11 )R 22 ) S( Q 11 ) 2 σ j (R 22 ) q 1 (n, k)q 2 (m, k)σ k+j (A), where q 1 (n, k) = 1 + FTP 2 (n k), q 2(m, k) = 1 + FTP 2 (m k). 41 of 70

LU CRTP factorization - bounds if rank = k Given A of size m n, one step of LU CRTP computes the decomposition ) ( ) ) (Ā11 Ā 12 I (Ā11 Ā 12 Ā = P r AP c = = Ā 21 Ā 22 Q 21 Q 1 (23) 11 I S(Ā 11) where Ā11 is of size k k and S(Ā11) = Ā22 Ā21Ā 1 11 Ā12 = Ā22 Q Q 1 21 11 Ā12. (24) It satisfies the following properties: ρ l (Ā21Ā 1 11 ) = ρ l( Q 21 Q 1 11 ) F TP, (25) S(Ā11) max min((1 + F TP k) A max, F TP 1 + F 2 TP (m k)σ k(a)) 1 σ i(a) σ i (Ā11), σ j(s(ā11)) q(m, n, k), (26) σ k+j (A) for any 1 l m k, 1 i k, and 1 j min(m, n) k, q(m, n, k) = (1 + FTP 2 (n k)) (1 + F TP 2 (m k)). 42 of 70

Arithmetic complexity - arbitrary sparse matrices Let d i be the number of nonzeros in column i of A, nnz(a) = n i=1 d i. A is permuted such that d 1... d n. A = [A 00,..., A n/k,0 ] is partitioned into n/k blocks of columns. At first step of TP: Pick k cols from A 1 = [A 00, A 10 ] nnz(a 1 ) 2k 2k i=1 d i, flops QR (A 1 ) 8k 2 2k i=1 d i. At the second step of TP: Pick k cols from A 2 nnz(a 2 ) 2k 3k i=k+1 d i flops QR (A 2 ) 8k 2 3k i=k+1 d i Bounds attained when: A = 0 0... 0 0 0 0... 0 0... 0 0... 0 0 43 of 70

Arithmetic complexity - arbitrary sparse matrices (2) nnz max (TP FT ) 4d n k 2 ( 2k nnz total (TP FT ) 2k d i + 4k i=1 3k ) n d i +... + d i i=k+1 i=n 2k+1 n d i = 4nnz(A)k, i=1 flops(tp FT ) 16nnz(A)k 2, 44 of 70

Tournament pivoting for sparse matrices Arithmetic complexity A has arbitrary sparsity structure G(A T A) is an n 1/2 - separable graph flops(tp FT ) 16nnz(A)k 2 flops(tp FT ) O(nnz(A)k 3/2 ) flops(tp BT ) 8 nnz(a) k 2 log n flops(tp BT ) O( nnz(a) k 3/2 log n P k P k ) Randomized algorithm by Clarkson and Woodruff, STOC 13 Given n n matrix A, it computes LDW T, where D is k k such that with failure probability 1/10 A LDW T F (1 + ɛ) A A k F, A k is best rank-k approximation. The cost of this algorithm is flops O(nnz(A)) + nk 2 ɛ 4 log O(1) (nk 2 ɛ 4 ) Tournament pivoting is faster if ɛ 1 (nnz(a)/n) 1/4 or if ɛ = 0.1 and nnz(a)/n 10 4. 45 of 70

Numerical results Singular value 10 0 Evolution of singular values for exponential 10-5 10-10 10-15 QRCP LU-CRQRCP LU-CRTP SVD Singular value 10 0 Evolution of singular values for foxgood 10-5 10-10 10-15 10-20 QRCP LU-CRQRCP LU-CRTP SVD 10-20 0 50 100 150 200 250 300 Index of singular values 10-25 0 50 100 150 200 250 300 Index of singular values Left: exponent - exponential Distribution, σ 1 = 1, σ i = α i 1 (i = 2,..., n), α = 10 1/11 [Bischof, 1991] Right: foxgood - Severely ill-posed test problem of the 1st kind Fredholm integral equation used by Fox and Goodwin 47 of 70

Numerical results Here k = 16 and the factorization is truncated at K = 128 (bars) or K = 240 (red lines). LU CTP: Column tournament pivoting + partial pivoting All singular values smaller than machine precision, ɛ, are replaced by ɛ. The number along x-axis represents the index of test matrices. 48 of 70

Results for image of size 919 707 Original image Rank-38 approx, SVD Singular value distribution Rank-38 approx, LUPP Rank-38 approx, LU CRTP Rank-75 approx, LU CRTP 49 of 70

Results for image of size 691 505 Original image Rank-105 approx, SVD Singular value distribution Rank-105 approx, LUPP Rank-105 approx, LU CRTP Rank-209 approx, LU CRTP 50 of 70

Comparing nnz in the factors L, U versus Q, R Name/size Nnz Rank K Nnz QRCP/ Nnz LU CRTP/ A(:, 1 : K) Nnz LU CRTP Nnz LUPP gemat11 1232 128 2.1 2.2 4929 4895 512 3.3 2.6 9583 1024 11.5 3.2 wang 3 896 128 3.0 2.1 26064 3536 512 2.9 2.1 7120 1024 2.9 1.2 Rfdevice 633 128 10.0 1.1 74104 2255 512 82.6 0.9 4681 1024 207.2 0.0 Parab fem 896 128 0.5 525825 3584 512 0.3 7168 1024 0.2 Mac econ 384 128 0.3 206500 1535 512 0.3 5970 1024 0.2 51 of 70

Performance results Selection of 256 columns by tournament pivoting Edison, Cray XC30 (NERSC): 2x12-core Intel Ivy Bridge (2.4 GHz) Tournament pivoting uses SPQR (T. Davis) + dgeqp3 (Lapack), time in secs Matrices: dimension at leaves on 32 procs Parab fem: 528825 528825 528825 16432 Mac econ: 206500 206500 206500 6453 Time Time leaves Number of MPI processes 2k cols 32procs 16 32 64 128 256 512 1024 SPQR + dgeqp3 Parab fem 0.26 0.26 + 1129 46.7 24.5 13.7 8.4 5.9 4.8 4.4 Mac econ 0.46 25.4 + 510 132.7 86.3 111.4 59.6 27.2 52 of 70

More details on CA deterministic algorithms [Demmel et al., 2015] Communication avoiding rank revealing QR factorization with column pivoting Demmel, Grigori, Gu, Xiang, SIAM J. Matrix Analysis and Applications, 2015. Low rank approximation of a sparse matrix based on LU factorization with column and row tournament pivoting, with S. Cayrols and J. Demmel, Inria TR 8910. 53 of 70

References (1) Bischof, C. H. (1991). A parallel QR factorization algorithm with controlled local pivoting. SIAM J. Sci. Stat. Comput., 12:36 57. Businger, P. A. and Golub, G. H. (1965). Linear least squares solutions by Householder transformations. Numer. Math., 7:269 276. Demmel, J., Grigori, L., Gu, M., and Xiang, H. (2015). Communication-avoiding rank-revealing qr decomposition. SIAM Journal on Matrix Analysis and its Applications, 36(1):55 89. Eckart, C. and Young, G. (1936). The approximation of one matrix by another of lower rank. Psychometrika, 1:211 218. Eisenstat, S. C. and Ipsen, I. C. F. (1995). Relative perturbation techniques for singular value problems. SIAM J. Numer. Anal., 32(6):1972 1988. Gu, M. and Eisenstat, S. C. (1996). Efficient algorithms for computing a strong rank-revealing QR factorization. SIAM J. Sci. Comput., 17(4):848 869. Hansen, P. C. (2007). Regularization tools: A matlab package for analysis and solution of discrete ill-posed problems. Numerical Algorithms, (46):189 194. 54 of 70

Results used in the proofs Interlacing property of singular values [Golub, Van Loan, 4th edition, page 487] Let A = [a 1... a n ] be a column partitioning of an m n matrix with m n. If A r = [a 1... a r ], then for r = 1 : n 1 σ 1 (A r+1 ) σ 1 (A r ) σ 2 (A r+1 )... σ r (A r+1 ) σ r (A r ) σ r+1 (A r+1 ). Given n n matrix B and n k matrix C, then ([Eisenstat and Ipsen, 1995], p. 1977) σ min (B)σ j (C) σ j (BC) σ max (B)σ j (C), j = 1,..., k. 55 of 70