Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors

Size: px

Start display at page:

Download "Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors"

Marcus Jennings
5 years ago
Views:

1 Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1 1 Deparment of Computer Science and Engineering, Univ. Jaume I (Spain) {aliaga,martina,quintana}@icc.uji.es 2 Institute of Computational Mathematics, TU-Braunschweig (Germany) m.bollhoefer@tu-braunschweig.de September, 2007

2 The solution of SPARSE linear systems is ubiquitous: Matrix market collection UF sparse matrix collection matrices/ Two main approaches to attack the problem: Direct methods (sparse LU, sparse Cholesky, etc.) may fail due to excessive fill-in of the factors Iterative methods (PCG, GMRES, etc.) may present slow convergence or even divergence

3 Convergence of iterative methods can be accelerated via preconditioning: Incomplete LU Sparse approximate inverse... Temporal cost of preconditioning can be reduced using high-performance computing and parallel platforms: Distributed-memory: Difficult to program, high cost of communication specially hurts sparse linear algebra Shared-memory: lack of scalability

4 Mid-term goal: Computation of parallel preconditioners on shared-memory architectures These platforms provide enough computational power (16 64 processors) for many practical applications With the advent of manycore, scalability will improve Focus: First results obtained from a parallel preconditioner that employs the same techniques as those of ILUPACK using OpenMP on a CC-NUMA platform

5 Outline Motivation and Introduction 1 Motivation and Introduction 2 Incomplete LU factorization ILUPACK 3 Trees and sparse matrices Parallel algorithm Parallel implementation 4 5

6 Outline Motivation and Introduction Incomplete LU factorization ILUPACK 1 Motivation and Introduction 2 Incomplete LU factorization ILUPACK 3 Trees and sparse matrices Parallel algorithm Parallel implementation 4 5

7 Incomplete LU factorization Incomplete LU factorization ILUPACK Given Ax = b, compute L and U T (sparse) unit lower triangular and D diagonal so that, M = LDU, A = M R, while controlling the fill-in by means of dropping techniques. The application of the iterative method on the preconditioned system, accelerates its convergence. M 1 Ax = M 1 b

8 Incomplete LU factorization ILUPACK Many "succesful" ILU factorization preconditioners: ILU(0), ILU(1)... ILU(p): level-of-fill ILU with parameter p ILUT: ILU with threshold ILUTP: ILUT with partial pivoting ILU with inverse-based dropping Multilevel ILU... ILUPACK library implements multilevel inverse-based ILUs

9 ILUPACK Motivation and Introduction Incomplete LU factorization ILUPACK In each level: Initial static ordering (+ scaling) of the coefficient matrix P T AQ = Ã (MLND, RCM, MMD,... ) Incomplete LU factorization with controlled dropping Also, it uses diagonal pivoting to control the growth of L 1 and U 1 The pivoted part forms the input matrix of the next level

10 Incomplete LU factorization ILUPACK Crout variant of LU factorization with two different criteria for dropping entries in the factors: Small entries Entries with small influence on the growth of L 1, U 1 Diagonal pivoting to control growth of L 1, U 1 Untouched If pivoting is needed... Untouched Also Untouched

11 Incomplete LU factorization ILUPACK If the untouched part only contains pivoted elements... L 11 D 11U 11 U12 Untouched L 21 C Obtain the approximate Schur complement of C S = C L 21 D 11 U 12, applying certain dropping techniques Repeat the procedure to factorize S (multilevel recursion)

12 Incomplete LU factorization ILUPACK A three level inverse-based multilevel ILU example L 11 D 11 U 11 U 12 L22 D 22 U22 U23 L 21 S 1 L32 S 2 L33 D 33 U33 First Level Second Level Third Level

13 Outline Motivation and Introduction Trees and sparse matrices Parallel algorithm Parallel implementation 1 Motivation and Introduction 2 Incomplete LU factorization ILUPACK 3 Trees and sparse matrices Parallel algorithm Parallel implementation 4 5

14 Elimination tree Motivation and Introduction Trees and sparse matrices Parallel algorithm Parallel implementation Matrix (sparsity pattern) Elimination tree The elimination tree shows groups of rows/columns which can be factorized independently as, e.g., {1,2,3} and {4,5,6}

15 Trees and sparse matrices Parallel algorithm Parallel implementation Matrix (sparsity pattern) Elimination tree The elimination tree shows groups of rows/columns which can be factorized independently as, e.g., {1,2,3} and {4,5,6}

16 Task (dependency) tree Trees and sparse matrices Parallel algorithm Parallel implementation Elimination tree Task (dependency) tree T1 T T 2 3 T 4 T 5 T6 T7 Independent subtrees The task tree shows tasks which can be executed concurrently (i.e., in parallel)

17 Trees and sparse matrices Parallel algorithm Parallel implementation The task tree defines a block partition of the matrix. Consider a task tree of height one: T 1 T T 2 3 A = A 11 0 A 13 0 A 22 A 23 A T 13 A T 23 A 33 To compute a preconditioner of A in parallel, we assign submatrices to each leaf of the task tree: ( A11 A 13 A T 13 A 1 33 ) T 2, ( A22 A 23 A T 23 A 2 33 ) T 3, with A A2 33 = A 33

18 Parallel algorithm Trees and sparse matrices Parallel algorithm Parallel implementation T 1 T 2 T 3 Tasks T 2, T 3 use ILUPACK to factorize, respectively, ( ) ( ) A11 A 13 L11, D A T 13 A 1 11, U 11 U L 13 S33 1 ( ) ( ) A22 A 23 L22, D A T 23 A 2 22, U 22 U L 23 S33 2

19 Parallel algorithm Trees and sparse matrices Parallel algorithm Parallel implementation T 1 S 1 S T 2 T 3 Tasks T 2, T 3 interact with T 3 to send, resp., S 1 33, S2 33

20 Parallel algorithm Trees and sparse matrices Parallel algorithm Parallel implementation T 1 S 1 S T 2 T 3 Tasks T 2, T 3 interact with T 3 to send, resp., S 1 33, S2 33 Task T 1 uses ILUPACK to factorize the submatrix S S2 33 S S2 33 L 33, D 33, U 33

21 Trees and sparse matrices Parallel algorithm Parallel implementation One forced level appears in the parallel preconditioner L 11, D 11, U 11 0 U 13 0 L 22, D 22, U 22 U 23 L 31 L 32 S 33 1 A L 33, D 33, U 33 First Level Second Level Then, the parallel preconditioner is not the same as the preconditioner computed by ILUPACK!

22 Task (dependency) tree whishes Trees and sparse matrices Parallel algorithm Parallel implementation In a good task tree (from the parallelization viewpoint): Parallelism: The leaves should concentrate the major part of the cost The number of leaves should enable a high degree of concurrency (taking into account that the overhead usually increases with their number) Load balancing: In general, the leaves should have homogenous costs

23 Achieving task tree whishes Trees and sparse matrices Parallel algorithm Parallel implementation Our approach combines: MLND (METIS): yields a best-effort balanced elimination tree with a high degree of concurrency Task splitting: divides those leaves with higher computational cost into finer-grain tasks to improve load balancing

24 Achieving task tree whishes Trees and sparse matrices Parallel algorithm Parallel implementation 2 Processors Bad load balancing T 1 Task splitting Load balancing improvement T1 T 2 T 3 T2 T T5 4 T 3 T6 T7 Static task generation (i.e., before execution)!

25 Achieving task tree whishes Trees and sparse matrices Parallel algorithm Parallel implementation... but the costs of the tasks are unknown a priori use an heuristic of the cost (weight): currently, the number of nonzeros in the submatrix to factorize the costs of the tasks can be heterogenous use dynamic scheduling for load balancing

26 Parallel implementation Trees and sparse matrices Parallel algorithm Parallel implementation OpenMP implementation Maintain a queue of tasks ready for execution (dependencies fulfilled) Priorize execution of leaves with larger weights Idle threads extract tasks from this structure and execute them When a task is completed, the task tree is examined to find out a new ready task, which is enqueued

27 Outline Motivation and Introduction 1 Motivation and Introduction 2 Incomplete LU factorization ILUPACK 3 Trees and sparse matrices Parallel algorithm Parallel implementation 4 5

28 Miscellaneous Motivation and Introduction SGI Altix 350 CC-NUMA SMM with: 16 Intel GHz 32 GBytes of RAM shared via a SGI NUMAlink interconnect No attempt to exploit CC-NUMA data locality yet IEEE double precision -O3 Intel Compiler optimization level One thread scheduled per processor In both the serial and parallel algorithms: MLND as initial static ordering ILUPACK default values

29 Benchmark Matrices SPD matrices obtained from the UF sparse matrix collection Code Matrix name Rows/Cols. Nonzeros M1 GHS_psdef/Apache M2 Schmid/Thermal M3 Schenk_AFE/Af_0_k M4 GHS_psdef/Inline_ M5 GHS_psdef/Apache M6 GHS_psdef/Audikw_

30 Task splitting Motivation and Introduction Guided by leaf tasks weights (heuristic cost) Splitting criteria: Let w i the weight of leaf task T i, W the overall matrix weight (i.e., the number of nonzero elements of the matrix), and p the number of threads Split those leaves with ratio w i W... larger than 1 p option A num_leaves p larger than 1 2p option B num_leaves 2p

31 In order to measure how w i s is balanced among threads, we use: c w = σ w w with σ w and w, the std. dev. and the avg. of thread weights, assuming a static mapping (i.e., before execution) of leaf tasks to threads The closer c w to 0, the better the weight balancing!

32 Option A Matrix #Leaf Tasks c w (%) M M M #Procs Option B Matrix #Leaf Tasks c w (%) M M M #Procs Halving #procs. for the same task tree improves weight balancing!

33 Option A Matrix #Leaf Tasks c w (%) M M M #Procs Option B Matrix #Leaf Tasks c w (%) M M M #Procs Halving #procs. for the same task tree improves weight balancing!

34 Option A Matrix #Leaf Tasks c w (%) M M M #Procs Option B Matrix #Leaf Tasks c w (%) M M M #Procs Halving #procs. for the same task tree improves weight balancing!

35 Option A Matrix #Leaf Tasks c w (%) M M M #Procs Option B Matrix #Leaf Tasks c w (%) M M M #Procs Halving #procs. for the same task tree improves weight balancing!

36 Option A Matrix #Leaf Tasks c w (%) M M M #Procs Option B Matrix #Leaf Tasks c w (%) M M M #Procs Halving #procs. for the same task tree improves weight balancing!

37 Option A Matrix #Leaf Tasks c w (%) M M M #Procs Option B Matrix #Leaf Tasks c w (%) M M M #Procs Halving #procs. for the same task tree improves weight balancing!

38 Speed-up vs. Load balancing Speed-up computed with respect to the parallel algorithm using the same task tree on a single processor In order to measure how t i s (costs of leaf tasks) are balanced among threads, we use: c t = σ t T with σ t, T, the std. dev. and the avg. of thread costs, taking into account how tasks have been (dynamically) mapped to threads (i.e., after execution) The closer c t to 0, the better the load balancing in the computation of leaves!

39 Speed-up vs. Load balancing Option A Matrix Speed-up c t (%) M M M #Procs Option B Matrix Speed-up c t (%) M M M #Procs Halving #procs. for the same task tree improves load balancing!

40 Speed-up vs. Load balancing Option A Matrix Speed-up c t (%) M M M #Procs Option B Matrix Speed-up c t (%) M M M #Procs Halving #procs. for the same task tree improves load balancing!

41 Speed-up vs. Load balancing Option A Matrix Speed-up c t (%) M M M #Procs Option B Matrix Speed-up c t (%) M M M #Procs Halving #procs. for the same task tree improves load balancing!

42 Speed-up vs. Load balancing Option A Matrix Speed-up c t (%) M M M #Procs Option B Matrix Speed-up c t (%) M M M #Procs Halving #procs. for the same task tree improves load balancing!

43 Speed-up vs. Load balancing Option A Matrix Speed-up c t (%) M M M #Procs Option B Matrix Speed-up c t (%) M M M #Procs Halving #procs. for the same task tree improves load balancing!

44 Speed-up vs. Load balancing Option A Matrix Speed-up c t (%) M M M #Procs Option B Matrix Speed-up c t (%) M M M #Procs Halving #procs. for the same task tree improves load balancing!

45 Preconditioner execution time Serial Algorithm Parallel Algorithm Option A Matrix Time (secs.) Time (secs.) M M M #Procs Serial Algorithm Parallel Algorithm Option B Matrix Time (secs.) Time (secs.) M M M #Procs

46 Preconditioner execution time Serial Algorithm Parallel Algorithm Option A Matrix Time (secs.) Time (secs.) M M M #Procs Serial Algorithm Parallel Algorithm Option B Matrix Time (secs.) Time (secs.) M M M #Procs

47 Speed-up vs. Load balancing Option A Matrix Time (secs.) Speed-up c t (%) c w (%) M #Procs. 16 num_leaves p and heterogeneous costs bad performance Option B Matrix Time (secs.) Speed-up c t (%) c w (%) M #Procs. 16 Good load balancing good performance Good news: the leaves concentrate the major part of the cost!

48 Outline Motivation and Introduction 1 Motivation and Introduction 2 Incomplete LU factorization ILUPACK 3 Trees and sparse matrices Parallel algorithm Parallel implementation 4 5

49 Conclusions We have presented a parallel preconditioner for the iterative solution of sparse linear systems on parallel architectures with shared-memory The preconditioner shares the principles with ILUPACK MLND ordering, task splitting, and dynamic scheduling are employed to improve load balancing, raise the degree of concurrency, and reduce the execution time In most cases, the parallel performance is satisfactory

50 Future Work To compare and contrast the numerical properties of the preconditioner in ILUPACK and our parallel preconditioner. To parallelize the application of the preconditioner to the system. To exploit data locality on CC-NUMA architectures. To design parallel preconditioners for non-spd linear systems. To develop MPI parallel preconditioners.

51 Questions?

J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1. March, 2009

J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1. March, 2009 Parallel Preconditioning of Linear Systems based on ILUPACK for Multithreaded Architectures J.I. Aliaga M. Bollhöfer 2 A.F. Martín E.S. Quintana-Ortí Deparment of Computer Science and Engineering, Univ.