LU factorization with Panel Rank Revealing Pivoting and its Communication Avoiding version

Similar documents
LU Factorization with Panel Rank Revealing Pivoting and Its Communication Avoiding Version

c 2013 Society for Industrial and Applied Mathematics

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC

Communication-avoiding LU and QR factorizations for multicore architectures

CALU: A Communication Optimal LU Factorization Algorithm

Communication Avoiding LU Factorization using Complete Pivoting

Dense LU factorization and its error analysis

Communication avoiding parallel algorithms for dense matrix factorizations

Introduction to communication avoiding algorithms for direct methods of factorization in Linear Algebra

Introduction to communication avoiding linear algebra algorithms in high performance computing

Spectrum-Revealing Matrix Factorizations Theory and Algorithms

Avoiding Communication in Distributed-Memory Tridiagonalization

Rank revealing factorizations, and low rank approximations

Linear Algebra Section 2.6 : LU Decomposition Section 2.7 : Permutations and transposes Wednesday, February 13th Math 301 Week #4

ETNA Kent State University

c 2015 Society for Industrial and Applied Mathematics

Low Rank Approximation of a Sparse Matrix Based on LU Factorization with Column and Row Tournament Pivoting

Minisymposia 9 and 34: Avoiding Communication in Linear Algebra. Jim Demmel UC Berkeley bebop.cs.berkeley.edu

Bindel, Fall 2016 Matrix Computations (CS 6210) Notes for

2.5D algorithms for distributed-memory computing

Iterative methods, preconditioning, and their application to CMB data analysis. Laura Grigori INRIA Saclay

Practicality of Large Scale Fast Matrix Multiplication

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Tradeoffs between synchronization, communication, and work in parallel linear algebra computations

Matrix decompositions

Solving linear equations with Gaussian Elimination (I)

A communication-avoiding parallel algorithm for the symmetric eigenvalue problem

ETNA Kent State University

A fast randomized algorithm for approximating an SVD of a matrix

Avoiding communication in linear algebra. Laura Grigori INRIA Saclay Paris Sud University

14.2 QR Factorization with Column Pivoting

Rank Revealing QR factorization. F. Guyomarc h, D. Mezher and B. Philippe

AMS526: Numerical Analysis I (Numerical Linear Algebra)

Symmetric Pivoting in ScaLAPACK Craig Lucas University of Manchester Cray User Group 8 May 2006, Lugano

LU Factorization. LU factorization is the most common way of solving linear systems! Ax = b LUx = b

Low rank approximation of a sparse matrix based on LU factorization with column and row tournament pivoting

1 Multiply Eq. E i by λ 0: (λe i ) (E i ) 2 Multiply Eq. E j by λ and add to Eq. E i : (E i + λe j ) (E i )

CSE 160 Lecture 13. Numerical Linear Algebra

Numerical Linear Algebra

Low-Rank Approximations, Random Sampling and Subspace Iteration

Linear Systems of n equations for n unknowns

Matrix decompositions

Using Random Butterfly Transformations to Avoid Pivoting in Sparse Direct Methods

LU Factorization. Marco Chiarandini. DM559 Linear and Integer Programming. Department of Mathematics & Computer Science University of Southern Denmark

Example Linear Algebra Competency Test

Accurate and Efficient Matrix Computations with Totally Positive Generalized Vandermonde Matrices Using Schur Functions

Linear Algebra Linear Algebra : Matrix decompositions Monday, February 11th Math 365 Week #4

Numerical Methods I: Numerical linear algebra

A communication-avoiding parallel algorithm for the symmetric eigenvalue problem

Communication-avoiding parallel and sequential QR factorizations

Enhancing Scalability of Sparse Direct Methods

Communication-avoiding parallel and sequential QR factorizations

Avoiding Communication in Linear Algebra

Sparse BLAS-3 Reduction

The Behavior of Algorithms in Practice 2/21/2002. Lecture 4. ɛ 1 x 1 y ɛ 1 x 1 1 = x y 1 1 = y 1 = 1 y 2 = 1 1 = 0 1 1

Review. Example 1. Elementary matrices in action: (a) a b c. d e f = g h i. d e f = a b c. a b c. (b) d e f. d e f.

Exploiting Low-Rank Structure in Computing Matrix Powers with Applications to Preconditioning

Introduction to Mathematical Programming

SYMBOLIC AND EXACT STRUCTURE PREDICTION FOR SPARSE GAUSSIAN ELIMINATION WITH PARTIAL PIVOTING

LAPACK-Style Codes for Pivoted Cholesky and QR Updating. Hammarling, Sven and Higham, Nicholas J. and Lucas, Craig. MIMS EPrint: 2006.

Minimizing Communication in Linear Algebra. James Demmel 15 June

Parallel LU Decomposition (PSC 2.3) Lecture 2.3 Parallel LU

Sparse linear solvers

Roundoff Error. Monday, August 29, 11

On the Skeel condition number, growth factor and pivoting strategies for Gaussian elimination

CS412: Lecture #17. Mridul Aanjaneya. March 19, 2015

Computing least squares condition numbers on hybrid multicore/gpu systems

A Computation- and Communication-Optimal Parallel Direct 3-body Algorithm

Subset Selection. Deterministic vs. Randomized. Ilse Ipsen. North Carolina State University. Joint work with: Stan Eisenstat, Yale

STAT 309: MATHEMATICAL COMPUTATIONS I FALL 2018 LECTURE 13

1.5 Gaussian Elimination With Partial Pivoting.

Performance Evaluation of Some Inverse Iteration Algorithms on PowerXCell T M 8i Processor

Matrix Computations: Direct Methods II. May 5, 2014 Lecture 11

Low-Complexity Encoding Algorithm for LDPC Codes

y b where U. matrix inverse A 1 ( L. 1 U 1. L 1 U 13 U 23 U 33 U 13 2 U 12 1

Scalable numerical algorithms for electronic structure calculations

ANALYTICAL MATHEMATICS FOR APPLICATIONS 2018 LECTURE NOTES 3

LU Factorization. LU Decomposition. LU Decomposition. LU Decomposition: Motivation A = LU

I-v k e k. (I-e k h kt ) = Stability of Gauss-Huard Elimination for Solving Linear Systems. 1 x 1 x x x x

Parallel Numerical Algorithms

Numerical Linear Algebra

. =. a i1 x 1 + a i2 x 2 + a in x n = b i. a 11 a 12 a 1n a 21 a 22 a 1n. i1 a i2 a in

COALA: Communication Optimal Algorithms

BlockMatrixComputations and the Singular Value Decomposition. ATaleofTwoIdeas

MATH 3511 Lecture 1. Solving Linear Systems 1

Sparse Rank-Revealing LU Factorization

Numerical Linear Algebra

Program Lecture 2. Numerical Linear Algebra. Gaussian elimination (2) Gaussian elimination. Decompositions, numerical aspects

Matrix decompositions

Applied Linear Algebra in Geoscience Using MATLAB

Large Scale Sparse Linear Algebra

Pivoting. Reading: GV96 Section 3.4, Stew98 Chapter 3: 1.3

Linear Equations and Matrix

Sparse Rank-Revealing LU Factorization

Lecture 9: Numerical Linear Algebra Primer (February 11st)

CHAPTER 6. Direct Methods for Solving Linear Systems

Theoretical Computer Science

S.F. Xu (Department of Mathematics, Peking University, Beijing)

Orthogonal Transformations

Parallel Numerical Algorithms

Transcription:

1 LU factorization with Panel Rank Revealing Pivoting and its Communication Avoiding version Amal Khabou Advisor: Laura Grigori Université Paris Sud 11, INRIA Saclay France SIAMPP12 February 17, 2012

2 Collaborators Jim Demmel, UC Berkeley Ming Gu, UC Berkeley

Plan Motivation : just work, less talk Communication Avoiding LU (CALU) LU with Panel Rank Revealing Pivoting (LU_PRRP) Communication Avoiding LU_PRRP (CALU_PRRP) Conclusions and future work 3

Lower bound for all direct linear algebra 4 Let M = local/ fast memory size, general lower bounds: [Ballard et al., 2011] ( # words moved = Ω # flops ( ) # flops # messages = Ω M 3 2 M )

Lower bound for all direct linear algebra 4 Let M = local/ fast memory size, general lower bounds: [Ballard et al., 2011] ( # words moved = Ω # flops ( ) # flops # messages = Ω Goal : reorganize linear algebra to : minimize # words moved minimize # messages M 3 2 M ) conserve numerical stability (better: to improve) don t increase the flop count (too much)

LU factorization (as in ScaLAPACK pdgetrf) LU factorization on a P = P r P c grid of processors For ib = 1 to n 1step b A (ib) = A(ib : n, ib : n) 1 Compute panel factorization - find pivot in each column, swap rows. 5

LU factorization (as in ScaLAPACK pdgetrf) LU factorization on a P = P r P c grid of processors For ib = 1 to n 1step b A (ib) = A(ib : n, ib : n) 1 Compute panel factorization # messages=o(n log 2 P r ) - find pivot in each column, swap rows. 5

LU factorization (as in ScaLAPACK pdgetrf) LU factorization on a P = P r P c grid of processors For ib = 1 to n 1step b A (ib) = A(ib : n, ib : n) 1 Compute panel factorization # messages=o(n log 2 P r ) - find pivot in each column, swap rows. 2 Apply all row permutations - broadcast pivot information along the rows - swap rows at left and right. 5

LU factorization (as in ScaLAPACK pdgetrf) LU factorization on a P = P r P c grid of processors For ib = 1 to n 1step b A (ib) = A(ib : n, ib : n) 1 Compute panel factorization # messages=o(n log 2 P r ) - find pivot in each column, swap rows. 2 Apply all row permutations # messages=o(n/b(log 2 P r + log 2 P c )) - broadcast pivot information along the rows - swap rows at left and right. 5

LU factorization (as in ScaLAPACK pdgetrf) LU factorization on a P = P r P c grid of processors For ib = 1 to n 1step b A (ib) = A(ib : n, ib : n) 1 Compute panel factorization # messages=o(n log 2 P r ) - find pivot in each column, swap rows. 2 Apply all row permutations # messages=o(n/b(log 2 P r + log 2 P c )) - broadcast pivot information along the rows - swap rows at left and right. 3 Compute block row of U - broadcast right diagonal block of L of current panel. 5

LU factorization (as in ScaLAPACK pdgetrf) LU factorization on a P = P r P c grid of processors For ib = 1 to n 1step b A (ib) = A(ib : n, ib : n) 1 Compute panel factorization # messages=o(n log 2 P r ) - find pivot in each column, swap rows. 2 Apply all row permutations # messages=o(n/b(log 2 P r + log 2 P c )) - broadcast pivot information along the rows - swap rows at left and right. 3 Compute block row of U # messages=o(n/b log 2 P c ) - broadcast right diagonal block of L of current panel. 5

LU factorization (as in ScaLAPACK pdgetrf) LU factorization on a P = P r P c grid of processors For ib = 1 to n 1step b A (ib) = A(ib : n, ib : n) 1 Compute panel factorization # messages=o(n log 2 P r ) - find pivot in each column, swap rows. 2 Apply all row permutations # messages=o(n/b(log 2 P r + log 2 P c )) - broadcast pivot information along the rows - swap rows at left and right. 3 Compute block row of U # messages=o(n/b log 2 P c ) - broadcast right diagonal block of L of current panel. 4 Update trailing matrix - broadcast right block column of L. - broadcast down block row of U. 5

LU factorization (as in ScaLAPACK pdgetrf) LU factorization on a P = P r P c grid of processors For ib = 1 to n 1step b A (ib) = A(ib : n, ib : n) 1 Compute panel factorization # messages=o(n log 2 P r ) - find pivot in each column, swap rows. 2 Apply all row permutations # messages=o(n/b(log 2 P r + log 2 P c )) - broadcast pivot information along the rows - swap rows at left and right. 3 Compute block row of U # messages=o(n/b log 2 P c ) - broadcast right diagonal block of L of current panel. 4 Update trailing matrix # messages=o(n/b(log 2 P r + log 2 P c )) - broadcast right block column of L. - broadcast down block row of U. 5

LU factorization (ScaLAPACK) upper bounds 2D algorithm P = P r P c, M = O [Ballard et al., 2011] ( ) # words moved = Ω n 2 P ( n 2 P ), Lower bounds : ( ) # messages = Ω P 6

LU factorization (ScaLAPACK) upper bounds 2D algorithm P = P r P c, M = O [Ballard et al., 2011] ( ) # words moved = Ω n 2 P ( n 2 P ), Lower bounds : ( ) # messages = Ω P LU with partial pivoting (ScaLAPACK ) P = P P, b = bound) : n P (upper Factor exceeding lower bounds for #words moved log P Factor exceeding lower bounds for #messages ( ) P n log P 6

LU factorization (ScaLAPACK) upper bounds 2D algorithm P = P r P c, M = O [Ballard et al., 2011] ( ) # words moved = Ω n 2 P ( n 2 P ), Lower bounds : ( ) # messages = Ω P LU with partial pivoting (ScaLAPACK ) P = P P, b = bound) : n P (upper Factor exceeding lower bounds for #words moved log P Factor exceeding lower bounds for #messages ( ) P n log P The lower bounds are attained by CALU: LU factorization with tournament pivoting [Grigori et al., 2011] 6

Stability of the LU factorization 7 Consider the growth factor [Wilkinson, 1961] g W = max i,j,k a (k) i,j max i,j a i,j where a (k) i,j denotes the entry in position (i, j) obtained after k steps of elimination.

Stability of the LU factorization 7 Consider the growth factor [Wilkinson, 1961] g W = max i,j,k a (k) i,j max i,j a i,j where a (k) i,j denotes the entry in position (i, j) obtained after k steps of elimination. Worst case growth factor : Partial pivoting : g W 2 n 1

Stability of the LU factorization 7 Consider the growth factor [Wilkinson, 1961] g W = max i,j,k a (k) i,j max i,j a i,j where a (k) i,j denotes the entry in position (i, j) obtained after k steps of elimination. Worst case growth factor : Partial pivoting : g W 2 n 1 For partial pivoting, the upper bound is attained for the Wilkinson matrix.

CALU: Tournament pivoting - the overall idea Consider a block algorithm that factors an n n matrix A by traversing panels of b columns. b {}}{ n b {}}{ A 12 A = A 11 A 21 A 22 where, W = [ A11 A 21 ] 8

CALU: Tournament pivoting - the overall idea Consider a block algorithm that factors an n n matrix A by traversing panels of b columns. b {}}{ n b {}}{ A 12 A = A 11 A 21 A 22 For each panel W where, W = [ A11 A 21 ] 8

CALU: Tournament pivoting - the overall idea Consider a block algorithm that factors an n n matrix A by traversing panels of b columns. b {}}{ n b {}}{ A 12 A = A 11 A 21 A 22 where, W = [ A11 A 21 For each panel W Find at low communication cost good pivots for the LU factorization of the panel W, return a permutation matrix Π (preprocessing step). ] 8

CALU: Tournament pivoting - the overall idea Consider a block algorithm that factors an n n matrix A by traversing panels of b columns. b {}}{ n b {}}{ A 12 A = A 11 A 21 A 22 where, W = [ A11 A 21 For each panel W Find at low communication cost good pivots for the LU factorization of the panel W, return a permutation matrix Π (preprocessing step). Apply Π to the input matrix A. ] 8

CALU: Tournament pivoting - the overall idea Consider a block algorithm that factors an n n matrix A by traversing panels of b columns. b {}}{ n b {}}{ A 12 A = A 11 A 21 A 22 where, W = [ A11 A 21 For each panel W Find at low communication cost good pivots for the LU factorization of the panel W, return a permutation matrix Π (preprocessing step). Apply Π to the input matrix A. Compute LU with no pivoting of ΠW, update trailing matrix. ] 8

CALU: Tournament pivoting - the overall idea Consider a block algorithm that factors an n n matrix A by traversing panels of b columns. b {}}{ n b {}}{ A 12 A = A 11 A 21 A 22 where, W = [ A11 A 21 For each panel W Find at low communication cost good pivots for the LU factorization of the panel W, return a permutation matrix Π (preprocessing step). Apply Π to the input matrix A. Compute LU with no pivoting of ΠW, update trailing matrix. benefit: b times fewer messages overall for the panel factorization - faster ] 8

Stability of CALU Worst case analysis of growth factor for a reduction tree of height H : CALU vs GEPP matrix of size m (b + 1) TSLU(b,H) GEPP g W upper bound 2 b(h+1) 2 b matrix of size m n CALU GEPP g W upper bound 2 n(h+1) 1 2 n 1 The growth factor upper bound of CALU is worse than GEPP, but CALU is still stable in practice.[grigori et al., 2011] 9

Motivation CALU does an optimal amount of communication. CALU is as stable as GEPP in practice.[grigori et al., 2011] CALU is worse than GEPP in terms of theoretical growth factor upper bound. Our goal: to improve the stability of both GEPP and CALU. 10

11 Strong Rank Revealing QR To improve the stability of LU: Pivoting strategy based on the Strong RRQR factorization [ A T R11 R Π = QR = Q 12 R 22 ]

Strong Rank Revealing QR To improve the stability of LU: Pivoting strategy based on the Strong RRQR factorization [ A T R11 R Π = QR = Q 12 R 22 This factorization satisfies three conditions [Gu and Eisenstat, 1996]: every singular value of R 11 is large. every singular value of R 22 is small. every element of R 1 11 R 12 could be bounded by a given threshold τ. ] 11

Strong Rank Revealing QR 12 Result: W T Π = QR with R 1 11 R 12 max τ Compute W T Π = QR RRQR with column pivoting ; [ while there exist i and j such that R 1 11 R 12 > τ do ij Compute the QR factorization of R Π ij (QR updates) ; end Algorithm: Strong RRQR of the panel W ]

Strong Rank Revealing QR 12 Result: W T Π = QR with R 1 11 R 12 max τ Compute W T Π = QR RRQR with column pivoting ; [ while there exist i and j such that R 1 11 R 12 > τ do ij Compute the QR factorization of R Π ij (QR updates) ; end Algorithm: Strong RRQR of the panel W ] After each swap the determinant of R 11 increases by at least τ. There are only a finite number of permutations. Strong RRQR does O(mb 2 ) flops in the worst case.

LU_PRRP : LU with Panel Rank Revealing Pivoting Consider a block algorithm that factors an n n matrix A by traversing panels of b columns. b n b {}}{{}}{ [ ] A = A 11 A 12 A11 where, W = A 21 A 21 A 22 13

LU_PRRP : LU with Panel Rank Revealing Pivoting Consider a block algorithm that factors an n n matrix A by traversing panels of b columns. b n b {}}{{}}{ [ ] A = A 11 A 12 A11 where, W = A 21 A 21 A 22 For each panel W 13

LU_PRRP : LU with Panel Rank Revealing Pivoting Consider a block algorithm that factors an n n matrix A by traversing panels of b columns. b n b {}}{{}}{ [ ] A = A 11 A 12 A11 where, W = A 21 A 21 A 22 For each panel W perform the Strong RRQR factorization of the transpose of the panel W, find a permutation matrix Π W T Π = Q [ R 11 R 12 ] such as L21 = R T 12(R 1 11 )T τ 13

LU_PRRP : LU with Panel Rank Revealing Pivoting Consider a block algorithm that factors an n n matrix A by traversing panels of b columns. b n b {}}{{}}{ [ ] A = A 11 A 12 A11 where, W = A 21 A 21 A 22 For each panel W perform the Strong RRQR factorization of the transpose of the panel W, find a permutation matrix Π W T Π = Q [ R 11 R 12 ] such as L21 = R T 12(R 1 11 )T τ Apply Π T to the input matrix A. 13

LU_PRRP : LU with Panel Rank Revealing Pivoting Consider a block algorithm that factors an n n matrix A by traversing panels of b columns. b n b {}}{{}}{ [ ] A = A 11 A 12 A11 where, W = A 21 A 21 A 22 For each panel W perform the Strong RRQR factorization of the transpose of the panel W, find a permutation matrix Π W T Π = Q [ R 11 R 12 ] such as L21 = R T 12(R 1 11 )T τ Apply Π T to the input matrix A. Update the trailing matrix as follows. [ ] [  = Π T Ib Â11  A =. 12 L 21 I m b where  s 22 ]  s 22 = Â22 L 21  12 13

LU_PRRP : LU with Panel Rank Revealing Pivoting 14 Perform an additional GEPP on the b b diagonal block Â11.

LU_PRRP : LU with Panel Rank Revealing Pivoting 14 Perform an additional GEPP on the b b diagonal block Â11. Update the corresponding trailing matrix Â12. Â = [ Ib L 21 I m b ] [ L11. I m b ] [ U11 U 12. Â s 22 ]

LU_PRRP : LU with Panel Rank Revealing Pivoting 14 Perform an additional GEPP on the b b diagonal block Â11. Update the corresponding trailing matrix Â12. Â = [ Ib L 21 I m b ] [ L11. I m b ] [ U11 U 12. Â s 22 ] LU_PRRP has the same memory requirements as the standard LU decomposition. The total cost is about 2 3 n3 + O(n 2 b) flops. Is only O(n 2 b) flops more than GEPP.

Stability of LU_PRRP (1/4) Worst case analysis of the growth factor : LU_PRRP vs GEPP. matrix of size m (b + 1) LU_PRRP GEPP g W upper bound 1 + τb 2 b matrix of size m n LU_PRRP GEPP g W upper bound (1 + τb) n b 2 n 1 LU_PRRP is more stable than GEPP in terms of theoretical growth factor upper bound. 15

Stability of LU_PRRP (2/4) The upper bound for different panel sizes with a parameter τ = 2 For the different panel sizes, LU_PRRP is more stable than GEPP. b g W 8 (1.425) n 16 (1.244) n 32 (1.139) n 64 (1.078) n 128 (1.044) n 16

Stability of LU_PRRP (3/4) 17 LU_PRRP is more resistant to pathological matrices on which GEPP fails:

Stability of LU_PRRP (3/4) 17 LU_PRRP is more resistant to pathological matrices on which GEPP fails: the Wilkinson matrix.

Stability of LU_PRRP (3/4) 17 LU_PRRP is more resistant to pathological matrices on which GEPP fails: the Wilkinson matrix. n b g W U 1 U 1 1 L 1 L 1 1 PA LU F A F 128 1 1.02e+03 6.09e+00 1 1.95e+00 4.25e-20 64 1 1.02e+03 6.09e+00 1 1.95e+00 5.29e-20 2048 32 1 1.02e+03 6.09e+00 1 1.95e+00 8.63e-20 16 1 1.02e+03 6.09e+00 1 1.95e+00 1.13e-19 8 1 1.02e+03 6.09e+00 1 1.95e+00 1.57e-19

Stability of LU_PRRP (3/4) 17 LU_PRRP is more resistant to pathological matrices on which GEPP fails: the Wilkinson matrix. n b g W U 1 U 1 1 L 1 L 1 1 PA LU F A F 128 1 1.02e+03 6.09e+00 1 1.95e+00 4.25e-20 64 1 1.02e+03 6.09e+00 1 1.95e+00 5.29e-20 2048 32 1 1.02e+03 6.09e+00 1 1.95e+00 8.63e-20 16 1 1.02e+03 6.09e+00 1 1.95e+00 1.13e-19 8 1 1.02e+03 6.09e+00 1 1.95e+00 1.57e-19 the Foster matrix: a concrete physical example that arises from using the quadrature method to solve a certain Volterra integral equation [Foster, 1994].

Stability of LU_PRRP (4/4) n b g W U 1 U 1 1 L 1 L 1 1 PA LU F A F 128 2.66 1.28e+03 1.87e+00 1.92e+03 1.92e+03 4.67e-16 64 2.66 1.19e+03 1.87e+00 1.98e+03 1.79e+03 2.64e-16 2048 32 2.66 4.33e+01 1.87e+00 2.01e+03 3.30e+01 2.83e-16 16 2.66 1.35e+03 1.87e+00 2.03e+03 2.03e+00 2.38e-16 8 2.66 1.35e+03 1.87e+00 2.04e+03 2.02e+00 5.36e-17 18

Stability of LU_PRRP (4/4) n b g W U 1 U 1 1 L 1 L 1 1 PA LU F A F 128 2.66 1.28e+03 1.87e+00 1.92e+03 1.92e+03 4.67e-16 64 2.66 1.19e+03 1.87e+00 1.98e+03 1.79e+03 2.64e-16 2048 32 2.66 4.33e+01 1.87e+00 2.01e+03 3.30e+01 2.83e-16 16 2.66 1.35e+03 1.87e+00 2.03e+03 2.03e+00 2.38e-16 8 2.66 1.35e+03 1.87e+00 2.04e+03 2.02e+00 5.36e-17 the Wright matrix: two-point boundary value problems, the multiple shooting algorithm [WRIGHT, 1993]. 18

Stability of LU_PRRP (4/4) n b g W U 1 U 1 1 L 1 L 1 1 PA LU F A F 128 2.66 1.28e+03 1.87e+00 1.92e+03 1.92e+03 4.67e-16 64 2.66 1.19e+03 1.87e+00 1.98e+03 1.79e+03 2.64e-16 2048 32 2.66 4.33e+01 1.87e+00 2.01e+03 3.30e+01 2.83e-16 16 2.66 1.35e+03 1.87e+00 2.03e+03 2.03e+00 2.38e-16 8 2.66 1.35e+03 1.87e+00 2.04e+03 2.02e+00 5.36e-17 the Wright matrix: two-point boundary value problems, the multiple shooting algorithm [WRIGHT, 1993]. n b g W U 1 U 1 1 L 1 L 1 1 PA LU F A F 128 1 3.25e+00 8.00e+00 2.00e+00 2.00e+00 4.08e-17 64 1 3.25e+00 8.00e+00 2.00e+00 2.00e+00 4.08e-17 2048 32 1 3.25e+00 8.00e+00 2.05e+00 2.07e+00 6.65e-17 16 1 3.25e+00 8.00e+00 2.32e+00 2.44e+00 1.04e-16 8 1 3.40e+00 8.00e+00 2.62e+00 3.65e+00 1.26e-16 18

CALU_PRRP: Communication-Avoiding LU_PRRP Consider a block algorithm that factors an n n matrix A by traversing panels of b columns. b {}}{ n b {}}{ A 12 A = A 11 A 21 A 22 where, W = [ A11 A 21 ] 19

CALU_PRRP: Communication-Avoiding LU_PRRP Consider a block algorithm that factors an n n matrix A by traversing panels of b columns. b {}}{ n b {}}{ A 12 A = A 11 A 21 A 22 For each panel W where, W = [ A11 A 21 ] 19

CALU_PRRP: Communication-Avoiding LU_PRRP Consider a block algorithm that factors an n n matrix A by traversing panels of b columns. b {}}{ n b {}}{ A 12 A = A 11 A 21 A 22 where, W = [ A11 A 21 For each panel W Find at low communication cost b pivot rows for the QR factorization of W T, return a permutation matrix Π (preprocessing step). ] 19

CALU_PRRP: Communication-Avoiding LU_PRRP Consider a block algorithm that factors an n n matrix A by traversing panels of b columns. b {}}{ n b {}}{ A 12 A = A 11 A 21 A 22 where, W = [ A11 A 21 For each panel W Find at low communication cost b pivot rows for the QR factorization of W T, return a permutation matrix Π (preprocessing step). Apply Π T to the input matrix A. ] 19

CALU_PRRP: Communication-Avoiding LU_PRRP Consider a block algorithm that factors an n n matrix A by traversing panels of b columns. b {}}{ n b {}}{ A 12 A = A 11 A 21 A 22 where, W = [ A11 A 21 For each panel W Find at low communication cost b pivot rows for the QR factorization of W T, return a permutation matrix Π (preprocessing step). Apply Π T to the input matrix A. Compute QR with no pivoting ofw T Π, update trailing matrix as for the LU_PRRP algorithm. ] 19

CALU_PRRP: Communication-Avoiding LU_PRRP Consider a block algorithm that factors an n n matrix A by traversing panels of b columns. b {}}{ n b {}}{ A 12 A = A 11 A 21 A 22 where, W = [ A11 A 21 For each panel W Find at low communication cost b pivot rows for the QR factorization of W T, return a permutation matrix Π (preprocessing step). Apply Π T to the input matrix A. Compute QR with no pivoting ofw T Π, update trailing matrix as for the LU_PRRP algorithm. Perform GEPP on the b b diagonal block and update the corresponding trailing matrix (to obtain the LU decomposition). ] 19

CALU_PRRP: preprocessing step Consider the panel W, partitioned over P = 4 processors as A 00 W = A 10 A 20, A 30 20

CALU_PRRP: preprocessing step Consider the panel W, partitioned over P = 4 processors as The panel factorization uses the following binary tree : A 00 W = A 10 A 20, A 00 A 10 A 20 A 01 A 02 A 30 A 30 A 11 20

CALU_PRRP: preprocessing step Consider the panel W, partitioned over P = 4 processors as The panel factorization uses the following binary tree : A 00 W = A 10 A 20, A 00 A 10 A 20 A 01 A 02 A 30 A 30 A 11 1 Compute Strong RRQR factorization of the transpose of each block A i0 of the panel W, find a permutation Π 0 A T 00 Π 00 A T 10 Π 10 A T 20 Π 20 A T 30 Π 30 = Q 00 R 00 Q 10 R 10 Q 20 R 20 Q 30 R 30 20

CALU_PRRP: preprocessing step Consider the panel W, partitioned over P = 4 processors as The panel factorization uses the following binary tree : A 00 W = A 10 A 20, A 00 A 10 A 20 A 01 A 02 A 30 A 30 A 11 1 Compute Strong RRQR factorization of the transpose of each block A i0 of the panel W, find a permutation Π 0 A T 00 Π 00 A T 10 Π 10 A T 20 Π 20 A T 30 Π 30 = Q 00 R 00 Q 10 R 10 Q 20 R 20 Q 30 R 30 2 Perform log P times Strong RRQR factorizations of 2b b blocks, find permutations Π 1 and Π 2 A T [ 01 Π 01 = (A T 00 Π 00)(:, 1 : b); (A T T 10 Π 10)(:, 1 : b)] Π01 = Q 01 R 01 A T [ 11 Π 11 = (A T 20 Π 20)(:, 1 : b); (A T T 30 Π 30)(:, 1 : b)] Π11 = Q 11 R 11 A T [ 02 Π 02 = (A T 01 Π 01)(:, 1 : b); (A T ] T 11 Π 11)(:, 1 : b) Π02 = Q 02 R 02 20

Stability of CALU_PRRP 21 Worst case analysis of the growth factor for a reduction tree of height H : CALU_PRRP vs CALU. matrix of size m (b + 1) TSLU_PRRP(b,H) TSLU(b,H) g W upper bound (1 + τb) H+1 2 b(h+1) matrix of size m n CALU_PRRP CALU g W upper bound (1 + τb) n b (H+1) 1 2 n(h+1) 1

Stability of CALU_PRRP 21 Worst case analysis of the growth factor for a reduction tree of height H : CALU_PRRP vs CALU. matrix of size m (b + 1) TSLU_PRRP(b,H) TSLU(b,H) g W upper bound (1 + τb) H+1 2 b(h+1) matrix of size m n CALU_PRRP CALU g W upper bound (1 + τb) n b (H+1) 1 2 n(h+1) 1 CALU_PRRP is more stable than CALU in terms of theoretical growth factor upper bound. For the binary reduction tree, it is more stable than GEPP when b log(τb) log P (b = n ) P

CALU_PRRP: experimental results CALU_PRRP is as stable as GEPP in practice [Khabou et al., 2012]. QR with column pivoting is sufficient to attain the bound τ in practice. 22

Overall summary LU_PRRP more stable than GEPP in terms of growth factor upper bound. more resistant to pathological matrices. same memory requirements as the standard LU. CALU_PRRP an optimal amount of communication at the cost of redundant computation. more stable than CALU in terms of growth factor upper bound. more stable than GEPP in terms of growth factor upper bound under certain conditions. 23

24 Future work Estimating the performance of CALU_PRRP on parallel machines based on multicore processors, and comparing it with the performance of CALU. Design of a communication avoiding algorithm that has smaller bounds on the growth factor than that of GEPP in general.

References I Ballard, G., Demmel, J., Holtz, O., and Schwartz, O. (2011). Minimizing communication in linear algebra. SIMAX, 32:866 90. Foster, L. V. (1994). Gaussian Elimination with Partial Pivoting Can Fail in Practice. SIAM J. Matrix Anal. Appl., 15:1354 1362. Grigori, L., Demmel, J., and Xiang, H. (2011). CALU: A communication optimal LU factorization algorithm. SIAM Journal on Matrix Analysis and Applications. Gu, M. and Eisenstat, S. C. (1996). Efficient Algorithms For Computing A Strong Rank Reaviling QR Factorization. SIAM J. SCI.COMPUT., 17:848 896. Khabou, A., Demmel, J., Grigori, L., and Gu, M. (2012). LU factorization with panel rank revealing pivoting and its communication avoiding version. Technical Report UCB/EECS-2012-15, EECS Department, University of California, Berkeley. Wilkinson, J. H. (1961). Error analysis of direct methods of matrix inversion. J. Assoc. Comput. Mach., 8:281 330. WRIGHT, S. J. (1993). A collection of problems for which gaussian elimination with partial pivoting is unstable. SIAM J. SCI. STATIST. COMPUT., 14:231 238. 25