Communication-avoiding LU and QR factorizations for multicore architectures

Size: px

Start display at page:

Download "Communication-avoiding LU and QR factorizations for multicore architectures"

Kory Flynn
5 years ago
Views:

1 Communication-avoiding LU and QR factorizations for multicore architectures DONFACK Simplice INRIA Saclay Joint work with Laura Grigori INRIA Saclay Alok Kumar Gupta BCCS,Norway th April 2010 Communication-avoiding LU and QR factorizations for multicore architectures 16th April / 25

2 1 Introduction 2 CALU and CAQR factorization 3 Multithreaded CALU and CAQR 4 Experimental section 5 Conclusion Communication-avoiding LU and QR factorizations for multicore architectures 16th April / 25

3 1 Introduction 2 CALU and CAQR factorization 3 Multithreaded CALU and CAQR 4 Experimental section 5 Conclusion Communication-avoiding LU and QR factorizations for multicore architectures 16th April / 25

4 Introduction Architectural trends show an increasing communication cost compared to the time it takes to perform arithmetic operations Motivated the design of communication avoiding algorithms that minimize communication First results are CAQR [Demmel, Grigori, Hoemmen, Langou 08] and CALU [Grigori, Demmel, Xiang 08], implemented for distributed memory. Our goal is to design multithreaded QR and LU factorizations for multicores based on communication avoiding algorithms. Communication-avoiding LU and QR factorizations for multicore architectures 16th April / 25

5 LU factorization with partial pivoting Factorization on Pr by Pc grid of processors as implemented in SCALAPACK: For ib = 1 to n-1 step b A(ib) = A(ib:n, ib:n) 1 Compute panel factorization (pdgetf2) O(nlog 2P r) - find pivot in each column, swap rows 2 Apply all row permutations (pdlaswp) O(n/b(log 2P c + log 2P r)) - broadcast pivot information along the rows - swap rows at left and right 3 Compute block row of U (pdtrsm) O(n/blog 2P c) - broadcast right diagonal block of L of current panel 4 Update trailing matrix (pdgemm) O(n/b(log 2P c + log 2P r)) - broadcast right block column of L - broadcast down block row of U Pivoting requires communication among processors on distributed memory and synchronisation between threads on multicores. Communication-avoiding LU and QR factorizations for multicore architectures 16th April / 25

6 CALU and CAQR approach Communication avoiding algorithms [Demmel, Grigori, Hoemmen, Langou, Xiang 08] approach: Decrease communication required for pivoting and overcome the latency bottleneck of classic algorithms by performing the factorization of a block column (a tall and skinny matrix) as a reduction operation and doing some redundant computations They are communication optimal in terms of both latency and bandwidth They lead to important speedups on distributed memory computers Communication-avoiding LU and QR factorizations for multicore architectures 16th April / 25

7 Goal Our goal Combine the main ideas to reduce communication in CALU and CAQR with : appropriate blocking task identification dynamic scheduling The reduction operation to use for a block-column factorization is based on a binary tree with asynchronous tasks : reduces synchronisation between threads (only O(log 2 (Pr))) avoids bus contention Communication-avoiding LU and QR factorizations for multicore architectures 16th April / 25

8 1 Introduction 2 CALU and CAQR factorization 3 Multithreaded CALU and CAQR 4 Experimental section 5 Conclusion Communication-avoiding LU and QR factorizations for multicore architectures 16th April / 25

9 CAQR Figure: Parallel TSQR Communication-avoiding LU and QR factorizations for multicore architectures 16th April / 25 Each panel factorization is computed as a reduction operation where at each node a QR factorization is performed. The reduction tree is chosen depending on the underlying architecture. For a binary tree log 2 (Pr) steps are used.

10 CAQR Update the submatrix using the tree in log 2 (Pr) steps Figure: The update of the trailing submatrix is triggered by the reduction tree used during panel factorization Communication-avoiding LU and QR factorizations for multicore architectures 16th April / 25

11 CALU[Grigori, Demmel, Xiang 08] The panel factorization is performed in two steps: A preprocessing steps aims at identifying at low communication cost good pivot rows The pivot rows are permuted in the first positions of the panel and LU without pivoting of the panel is performed Figure: Stable parallel panel factorization Communication-avoiding LU and QR factorizations for multicore architectures 16th April / 25

12 CALU (Stability) average growth factor P=256,b=32 P=256,b=16 P=128,b=64 P=128,b=32 P=128,b=16 P=64, b=128 P=64, b=64 P=64, b=32 P=64, b=16 GEPP n 2/3 2*n 2/3 3*n 1/ Figure: Stability of binary tree based CALU factorization for random matrices Extensive tests performed on random matrices and a set of special matrices using binary tree and flat tree show that CALU is as stable as GEPP in practice. Communication-avoiding LU and QR factorizations for multicore architectures 16th April / 25

13 1 Introduction 2 CALU and CAQR factorization 3 Multithreaded CALU and CAQR 4 Experimental section 5 Conclusion Communication-avoiding LU and QR factorizations for multicore architectures 16th April / 25

Multithreaded CALU The matrix is partitioned in blocks of size Tr x b The computation of each block is associated with a task The task dependency graph is scheduled using a dynamic

14 Multithreaded CALU The matrix is partitioned in blocks of size Tr x b The computation of each block is associated with a task The task dependency graph is scheduled using a dynamic scheduler Figure: Matrix 4 4 blocks and T r =2and Corresponding task dependency graph Communication-avoiding LU and QR factorizations for multicore architectures 16th April / 25

15 Multithreaded CALU Panel factorization is performed in two steps: find good pivots at low communication cost, permute them and compute LU factorization of the panel without pivoting. The panel factorization stays on the critical path but it is done more efficiently Communication-avoiding LU and QR factorizations for multicore architectures 16th April / 25

16 Multithreaded CALU (Execution) Figure: Example of execution of CALU for a tall skinny matrix, using b = 100 and T r =1, on 8-core Figure: Example of execution of CALU for a tall skinny matrix, using b = 100 and T r =8, on 8-core Communication-avoiding LU and QR factorizations for multicore architectures 16th April / 25

17 Multithreaded CAQR Same approach as CALU but: Panel factorization is performed only once The update of the trailing matrix is triggered by the binary tree used for the panel factorization. Communication-avoiding LU and QR factorizations for multicore architectures 16th April / 25

18 1 Introduction 2 CALU and CAQR factorization 3 Multithreaded CALU and CAQR 4 Experimental section 5 Conclusion Communication-avoiding LU and QR factorizations for multicore architectures 16th April / 25

19 Environments Tests performed on: two-socket, quad-core machine based on Intel Xeon EMT64 processor running on Linux and on a four-socket, quad-core machine based on AMD Opteron processor Comparison with MKL and PLASMA 2.0 (with default parameters) b = MIN(n, 100) has been chosen as block size Communication-avoiding LU and QR factorizations for multicore architectures 16th April / 25

20 Performance of CALU Performance of CALU, MKL_dgetrf, PLASMA_dgetrf on 8 cores Tall Skinny Matrix, CALU, m=10 5 MKL_dgetf2 MKL_dgetrf PLASMA_dgetrf CALU(Tr=4) CALU(Tr=8) GFlops/s log2(n) Figure: m=10 5 and varying n from 10 to Communication-avoiding LU and QR factorizations for multicore architectures 16th April / 25

21 Performance of CALU Performance of CALU, MKL_dgetrf, PLASMA_dgetrf on 16 cores Tall Skinny Matrix, CALU, m=10 5 ACML_dgeqrf PLASMA_dgeqrf CALU(Tr=8) CALU(Tr=16) 30 GFlops/s log2(n) Figure: m=10 5 and varying n from 10 to Communication-avoiding LU and QR factorizations for multicore architectures 16th April / 25

22 Performance of CAQR Performance of CAQR, MKL_dgeqrf, PLASMA_dgeqrf on 8 cores Tall Skinny Matrix, CAQR, m=10 5 MKL_dgeqrf PLASMA_dgeqrf CAQR(Tr=2) CAQR(Tr=4) CAQR(Tr=8) TSQR GFlops/s log2(n) Figure: m=10 5 and varying n from 10 to Communication-avoiding LU and QR factorizations for multicore architectures 16th April / 25

23 1 Introduction 2 CALU and CAQR factorization 3 Multithreaded CALU and CAQR 4 Experimental section 5 Conclusion Communication-avoiding LU and QR factorizations for multicore architectures 16th April / 25

24 Conclusion Multithreaded CALU and CAQR lead to important improvements for tall and skinny matrices with respect to the corresponding routines in MKL and PLASMA. PLASMA becomes more efficient with increasing number of columns. No significant improvements obtained so far for square matrices. Prospects: Improve the performance of the trailing matrix update by increasing the block size to optimize BLAS3 operations. Compare with the recent approach of [Hadri, Ltaief, Agullo, Dongarra 09] for QR factorization, which uses a different reduction tree during panel factorization. Communication-avoiding LU and QR factorizations for multicore architectures 16th April / 25

25 Thank you Thank you Communication-avoiding LU and QR factorizations for multicore architectures 16th April / 25

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC

Hybrid static/dynamic scheduling for already optimized dense matrix factorization Simplice Donfack, Laura Grigori, INRIA, France Bill Gropp, Vivek Kale UIUC, USA Joint Laboratory for Petascale Computing,