Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors
|
|
- Marcus Jennings
- 5 years ago
- Views:
Transcription
1 Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1 1 Deparment of Computer Science and Engineering, Univ. Jaume I (Spain) {aliaga,martina,quintana}@icc.uji.es 2 Institute of Computational Mathematics, TU-Braunschweig (Germany) m.bollhoefer@tu-braunschweig.de September, 2007
2 The solution of SPARSE linear systems is ubiquitous: Matrix market collection UF sparse matrix collection matrices/ Two main approaches to attack the problem: Direct methods (sparse LU, sparse Cholesky, etc.) may fail due to excessive fill-in of the factors Iterative methods (PCG, GMRES, etc.) may present slow convergence or even divergence
3 Convergence of iterative methods can be accelerated via preconditioning: Incomplete LU Sparse approximate inverse... Temporal cost of preconditioning can be reduced using high-performance computing and parallel platforms: Distributed-memory: Difficult to program, high cost of communication specially hurts sparse linear algebra Shared-memory: lack of scalability
4 Mid-term goal: Computation of parallel preconditioners on shared-memory architectures These platforms provide enough computational power (16 64 processors) for many practical applications With the advent of manycore, scalability will improve Focus: First results obtained from a parallel preconditioner that employs the same techniques as those of ILUPACK using OpenMP on a CC-NUMA platform
5 Outline Motivation and Introduction 1 Motivation and Introduction 2 Incomplete LU factorization ILUPACK 3 Trees and sparse matrices Parallel algorithm Parallel implementation 4 5
6 Outline Motivation and Introduction Incomplete LU factorization ILUPACK 1 Motivation and Introduction 2 Incomplete LU factorization ILUPACK 3 Trees and sparse matrices Parallel algorithm Parallel implementation 4 5
7 Incomplete LU factorization Incomplete LU factorization ILUPACK Given Ax = b, compute L and U T (sparse) unit lower triangular and D diagonal so that, M = LDU, A = M R, while controlling the fill-in by means of dropping techniques. The application of the iterative method on the preconditioned system, accelerates its convergence. M 1 Ax = M 1 b
8 Incomplete LU factorization ILUPACK Many "succesful" ILU factorization preconditioners: ILU(0), ILU(1)... ILU(p): level-of-fill ILU with parameter p ILUT: ILU with threshold ILUTP: ILUT with partial pivoting ILU with inverse-based dropping Multilevel ILU... ILUPACK library implements multilevel inverse-based ILUs
9 ILUPACK Motivation and Introduction Incomplete LU factorization ILUPACK In each level: Initial static ordering (+ scaling) of the coefficient matrix P T AQ = Ã (MLND, RCM, MMD,... ) Incomplete LU factorization with controlled dropping Also, it uses diagonal pivoting to control the growth of L 1 and U 1 The pivoted part forms the input matrix of the next level
10 Incomplete LU factorization ILUPACK Crout variant of LU factorization with two different criteria for dropping entries in the factors: Small entries Entries with small influence on the growth of L 1, U 1 Diagonal pivoting to control growth of L 1, U 1 Untouched If pivoting is needed... Untouched Also Untouched
11 Incomplete LU factorization ILUPACK If the untouched part only contains pivoted elements... L 11 D 11U 11 U12 Untouched L 21 C Obtain the approximate Schur complement of C S = C L 21 D 11 U 12, applying certain dropping techniques Repeat the procedure to factorize S (multilevel recursion)
12 Incomplete LU factorization ILUPACK A three level inverse-based multilevel ILU example L 11 D 11 U 11 U 12 L22 D 22 U22 U23 L 21 S 1 L32 S 2 L33 D 33 U33 First Level Second Level Third Level
13 Outline Motivation and Introduction Trees and sparse matrices Parallel algorithm Parallel implementation 1 Motivation and Introduction 2 Incomplete LU factorization ILUPACK 3 Trees and sparse matrices Parallel algorithm Parallel implementation 4 5
14 Elimination tree Motivation and Introduction Trees and sparse matrices Parallel algorithm Parallel implementation Matrix (sparsity pattern) Elimination tree The elimination tree shows groups of rows/columns which can be factorized independently as, e.g., {1,2,3} and {4,5,6}
15 Trees and sparse matrices Parallel algorithm Parallel implementation Matrix (sparsity pattern) Elimination tree The elimination tree shows groups of rows/columns which can be factorized independently as, e.g., {1,2,3} and {4,5,6}
16 Task (dependency) tree Trees and sparse matrices Parallel algorithm Parallel implementation Elimination tree Task (dependency) tree T1 T T 2 3 T 4 T 5 T6 T7 Independent subtrees The task tree shows tasks which can be executed concurrently (i.e., in parallel)
17 Trees and sparse matrices Parallel algorithm Parallel implementation The task tree defines a block partition of the matrix. Consider a task tree of height one: T 1 T T 2 3 A = A 11 0 A 13 0 A 22 A 23 A T 13 A T 23 A 33 To compute a preconditioner of A in parallel, we assign submatrices to each leaf of the task tree: ( A11 A 13 A T 13 A 1 33 ) T 2, ( A22 A 23 A T 23 A 2 33 ) T 3, with A A2 33 = A 33
18 Parallel algorithm Trees and sparse matrices Parallel algorithm Parallel implementation T 1 T 2 T 3 Tasks T 2, T 3 use ILUPACK to factorize, respectively, ( ) ( ) A11 A 13 L11, D A T 13 A 1 11, U 11 U L 13 S33 1 ( ) ( ) A22 A 23 L22, D A T 23 A 2 22, U 22 U L 23 S33 2
19 Parallel algorithm Trees and sparse matrices Parallel algorithm Parallel implementation T 1 S 1 S T 2 T 3 Tasks T 2, T 3 interact with T 3 to send, resp., S 1 33, S2 33
20 Parallel algorithm Trees and sparse matrices Parallel algorithm Parallel implementation T 1 S 1 S T 2 T 3 Tasks T 2, T 3 interact with T 3 to send, resp., S 1 33, S2 33 Task T 1 uses ILUPACK to factorize the submatrix S S2 33 S S2 33 L 33, D 33, U 33
21 Trees and sparse matrices Parallel algorithm Parallel implementation One forced level appears in the parallel preconditioner L 11, D 11, U 11 0 U 13 0 L 22, D 22, U 22 U 23 L 31 L 32 S 33 1 A L 33, D 33, U 33 First Level Second Level Then, the parallel preconditioner is not the same as the preconditioner computed by ILUPACK!
22 Task (dependency) tree whishes Trees and sparse matrices Parallel algorithm Parallel implementation In a good task tree (from the parallelization viewpoint): Parallelism: The leaves should concentrate the major part of the cost The number of leaves should enable a high degree of concurrency (taking into account that the overhead usually increases with their number) Load balancing: In general, the leaves should have homogenous costs
23 Achieving task tree whishes Trees and sparse matrices Parallel algorithm Parallel implementation Our approach combines: MLND (METIS): yields a best-effort balanced elimination tree with a high degree of concurrency Task splitting: divides those leaves with higher computational cost into finer-grain tasks to improve load balancing
24 Achieving task tree whishes Trees and sparse matrices Parallel algorithm Parallel implementation 2 Processors Bad load balancing T 1 Task splitting Load balancing improvement T1 T 2 T 3 T2 T T5 4 T 3 T6 T7 Static task generation (i.e., before execution)!
25 Achieving task tree whishes Trees and sparse matrices Parallel algorithm Parallel implementation... but the costs of the tasks are unknown a priori use an heuristic of the cost (weight): currently, the number of nonzeros in the submatrix to factorize the costs of the tasks can be heterogenous use dynamic scheduling for load balancing
26 Parallel implementation Trees and sparse matrices Parallel algorithm Parallel implementation OpenMP implementation Maintain a queue of tasks ready for execution (dependencies fulfilled) Priorize execution of leaves with larger weights Idle threads extract tasks from this structure and execute them When a task is completed, the task tree is examined to find out a new ready task, which is enqueued
27 Outline Motivation and Introduction 1 Motivation and Introduction 2 Incomplete LU factorization ILUPACK 3 Trees and sparse matrices Parallel algorithm Parallel implementation 4 5
28 Miscellaneous Motivation and Introduction SGI Altix 350 CC-NUMA SMM with: 16 Intel GHz 32 GBytes of RAM shared via a SGI NUMAlink interconnect No attempt to exploit CC-NUMA data locality yet IEEE double precision -O3 Intel Compiler optimization level One thread scheduled per processor In both the serial and parallel algorithms: MLND as initial static ordering ILUPACK default values
29 Benchmark Matrices SPD matrices obtained from the UF sparse matrix collection Code Matrix name Rows/Cols. Nonzeros M1 GHS_psdef/Apache M2 Schmid/Thermal M3 Schenk_AFE/Af_0_k M4 GHS_psdef/Inline_ M5 GHS_psdef/Apache M6 GHS_psdef/Audikw_
30 Task splitting Motivation and Introduction Guided by leaf tasks weights (heuristic cost) Splitting criteria: Let w i the weight of leaf task T i, W the overall matrix weight (i.e., the number of nonzero elements of the matrix), and p the number of threads Split those leaves with ratio w i W... larger than 1 p option A num_leaves p larger than 1 2p option B num_leaves 2p
31 In order to measure how w i s is balanced among threads, we use: c w = σ w w with σ w and w, the std. dev. and the avg. of thread weights, assuming a static mapping (i.e., before execution) of leaf tasks to threads The closer c w to 0, the better the weight balancing!
32 Option A Matrix #Leaf Tasks c w (%) M M M #Procs Option B Matrix #Leaf Tasks c w (%) M M M #Procs Halving #procs. for the same task tree improves weight balancing!
33 Option A Matrix #Leaf Tasks c w (%) M M M #Procs Option B Matrix #Leaf Tasks c w (%) M M M #Procs Halving #procs. for the same task tree improves weight balancing!
34 Option A Matrix #Leaf Tasks c w (%) M M M #Procs Option B Matrix #Leaf Tasks c w (%) M M M #Procs Halving #procs. for the same task tree improves weight balancing!
35 Option A Matrix #Leaf Tasks c w (%) M M M #Procs Option B Matrix #Leaf Tasks c w (%) M M M #Procs Halving #procs. for the same task tree improves weight balancing!
36 Option A Matrix #Leaf Tasks c w (%) M M M #Procs Option B Matrix #Leaf Tasks c w (%) M M M #Procs Halving #procs. for the same task tree improves weight balancing!
37 Option A Matrix #Leaf Tasks c w (%) M M M #Procs Option B Matrix #Leaf Tasks c w (%) M M M #Procs Halving #procs. for the same task tree improves weight balancing!
38 Speed-up vs. Load balancing Speed-up computed with respect to the parallel algorithm using the same task tree on a single processor In order to measure how t i s (costs of leaf tasks) are balanced among threads, we use: c t = σ t T with σ t, T, the std. dev. and the avg. of thread costs, taking into account how tasks have been (dynamically) mapped to threads (i.e., after execution) The closer c t to 0, the better the load balancing in the computation of leaves!
39 Speed-up vs. Load balancing Option A Matrix Speed-up c t (%) M M M #Procs Option B Matrix Speed-up c t (%) M M M #Procs Halving #procs. for the same task tree improves load balancing!
40 Speed-up vs. Load balancing Option A Matrix Speed-up c t (%) M M M #Procs Option B Matrix Speed-up c t (%) M M M #Procs Halving #procs. for the same task tree improves load balancing!
41 Speed-up vs. Load balancing Option A Matrix Speed-up c t (%) M M M #Procs Option B Matrix Speed-up c t (%) M M M #Procs Halving #procs. for the same task tree improves load balancing!
42 Speed-up vs. Load balancing Option A Matrix Speed-up c t (%) M M M #Procs Option B Matrix Speed-up c t (%) M M M #Procs Halving #procs. for the same task tree improves load balancing!
43 Speed-up vs. Load balancing Option A Matrix Speed-up c t (%) M M M #Procs Option B Matrix Speed-up c t (%) M M M #Procs Halving #procs. for the same task tree improves load balancing!
44 Speed-up vs. Load balancing Option A Matrix Speed-up c t (%) M M M #Procs Option B Matrix Speed-up c t (%) M M M #Procs Halving #procs. for the same task tree improves load balancing!
45 Preconditioner execution time Serial Algorithm Parallel Algorithm Option A Matrix Time (secs.) Time (secs.) M M M #Procs Serial Algorithm Parallel Algorithm Option B Matrix Time (secs.) Time (secs.) M M M #Procs
46 Preconditioner execution time Serial Algorithm Parallel Algorithm Option A Matrix Time (secs.) Time (secs.) M M M #Procs Serial Algorithm Parallel Algorithm Option B Matrix Time (secs.) Time (secs.) M M M #Procs
47 Speed-up vs. Load balancing Option A Matrix Time (secs.) Speed-up c t (%) c w (%) M #Procs. 16 num_leaves p and heterogeneous costs bad performance Option B Matrix Time (secs.) Speed-up c t (%) c w (%) M #Procs. 16 Good load balancing good performance Good news: the leaves concentrate the major part of the cost!
48 Outline Motivation and Introduction 1 Motivation and Introduction 2 Incomplete LU factorization ILUPACK 3 Trees and sparse matrices Parallel algorithm Parallel implementation 4 5
49 Conclusions We have presented a parallel preconditioner for the iterative solution of sparse linear systems on parallel architectures with shared-memory The preconditioner shares the principles with ILUPACK MLND ordering, task splitting, and dynamic scheduling are employed to improve load balancing, raise the degree of concurrency, and reduce the execution time In most cases, the parallel performance is satisfactory
50 Future Work To compare and contrast the numerical properties of the preconditioner in ILUPACK and our parallel preconditioner. To parallelize the application of the preconditioner to the system. To exploit data locality on CC-NUMA architectures. To design parallel preconditioners for non-spd linear systems. To develop MPI parallel preconditioners.
51 Questions?
J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1. March, 2009
Parallel Preconditioning of Linear Systems based on ILUPACK for Multithreaded Architectures J.I. Aliaga M. Bollhöfer 2 A.F. Martín E.S. Quintana-Ortí Deparment of Computer Science and Engineering, Univ.
More informationSaving Energy in Sparse and Dense Linear Algebra Computations
Saving Energy in Sparse and Dense Linear Algebra Computations P. Alonso, M. F. Dolz, F. Igual, R. Mayo, E. S. Quintana-Ortí, V. Roca Univ. Politécnica Univ. Jaume I The Univ. of Texas de Valencia, Spain
More informationLeveraging Task-Parallelism in Energy-Efficient ILU Preconditioners
Leveraging Task-Parallelism in Energy-Efficient ILU Preconditioners José I. Aliaga Leveraging task-parallelism in energy-efficient ILU preconditioners Universidad Jaime I (Castellón, Spain) José I. Aliaga
More informationContents. Preface... xi. Introduction...
Contents Preface... xi Introduction... xv Chapter 1. Computer Architectures... 1 1.1. Different types of parallelism... 1 1.1.1. Overlap, concurrency and parallelism... 1 1.1.2. Temporal and spatial parallelism
More informationIncomplete Cholesky preconditioners that exploit the low-rank property
anapov@ulb.ac.be ; http://homepages.ulb.ac.be/ anapov/ 1 / 35 Incomplete Cholesky preconditioners that exploit the low-rank property (theory and practice) Artem Napov Service de Métrologie Nucléaire, Université
More informationA robust multilevel approximate inverse preconditioner for symmetric positive definite matrices
DICEA DEPARTMENT OF CIVIL, ENVIRONMENTAL AND ARCHITECTURAL ENGINEERING PhD SCHOOL CIVIL AND ENVIRONMENTAL ENGINEERING SCIENCES XXX CYCLE A robust multilevel approximate inverse preconditioner for symmetric
More informationBalanced Truncation Model Reduction of Large and Sparse Generalized Linear Systems
Balanced Truncation Model Reduction of Large and Sparse Generalized Linear Systems Jos M. Badía 1, Peter Benner 2, Rafael Mayo 1, Enrique S. Quintana-Ortí 1, Gregorio Quintana-Ortí 1, A. Remón 1 1 Depto.
More informationEnhancing Scalability of Sparse Direct Methods
Journal of Physics: Conference Series 78 (007) 0 doi:0.088/7-6596/78//0 Enhancing Scalability of Sparse Direct Methods X.S. Li, J. Demmel, L. Grigori, M. Gu, J. Xia 5, S. Jardin 6, C. Sovinec 7, L.-Q.
More informationFine-Grained Parallel Algorithms for Incomplete Factorization Preconditioning
Fine-Grained Parallel Algorithms for Incomplete Factorization Preconditioning Edmond Chow School of Computational Science and Engineering Georgia Institute of Technology, USA SPPEXA Symposium TU München,
More informationPreconditioning Techniques for Large Linear Systems Part III: General-Purpose Algebraic Preconditioners
Preconditioning Techniques for Large Linear Systems Part III: General-Purpose Algebraic Preconditioners Michele Benzi Department of Mathematics and Computer Science Emory University Atlanta, Georgia, USA
More informationMultilevel low-rank approximation preconditioners Yousef Saad Department of Computer Science and Engineering University of Minnesota
Multilevel low-rank approximation preconditioners Yousef Saad Department of Computer Science and Engineering University of Minnesota SIAM CSE Boston - March 1, 2013 First: Joint work with Ruipeng Li Work
More informationFINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION
FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION EDMOND CHOW AND AFTAB PATEL Abstract. This paper presents a new fine-grained parallel algorithm for computing an incomplete LU factorization. All nonzeros
More informationAMS Mathematics Subject Classification : 65F10,65F50. Key words and phrases: ILUS factorization, preconditioning, Schur complement, 1.
J. Appl. Math. & Computing Vol. 15(2004), No. 1, pp. 299-312 BILUS: A BLOCK VERSION OF ILUS FACTORIZATION DAVOD KHOJASTEH SALKUYEH AND FAEZEH TOUTOUNIAN Abstract. ILUS factorization has many desirable
More informationFINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION
FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION EDMOND CHOW AND AFTAB PATEL Abstract. This paper presents a new fine-grained parallel algorithm for computing an incomplete LU factorization. All nonzeros
More informationFine-grained Parallel Incomplete LU Factorization
Fine-grained Parallel Incomplete LU Factorization Edmond Chow School of Computational Science and Engineering Georgia Institute of Technology Sparse Days Meeting at CERFACS June 5-6, 2014 Contribution
More informationSaving Energy in the LU Factorization with Partial Pivoting on Multi-Core Processors
20th Euromicro International Conference on Parallel, Distributed and Network-Based Special Session on Energy-aware Systems Saving Energy in the on Multi-Core Processors Pedro Alonso 1, Manuel F. Dolz 2,
More informationRecent advances in sparse linear solver technology for semiconductor device simulation matrices
Recent advances in sparse linear solver technology for semiconductor device simulation matrices (Invited Paper) Olaf Schenk and Michael Hagemann Department of Computer Science University of Basel Basel,
More informationDirect and Incomplete Cholesky Factorizations with Static Supernodes
Direct and Incomplete Cholesky Factorizations with Static Supernodes AMSC 661 Term Project Report Yuancheng Luo 2010-05-14 Introduction Incomplete factorizations of sparse symmetric positive definite (SSPD)
More informationChallenges for Matrix Preconditioning Methods
Challenges for Matrix Preconditioning Methods Matthias Bollhoefer 1 1 Dept. of Mathematics TU Berlin Preconditioning 2005, Atlanta, May 19, 2005 supported by the DFG research center MATHEON in Berlin Outline
More informationc 2015 Society for Industrial and Applied Mathematics
SIAM J. SCI. COMPUT. Vol. 37, No. 2, pp. C169 C193 c 2015 Society for Industrial and Applied Mathematics FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION EDMOND CHOW AND AFTAB PATEL Abstract. This paper
More informationBinding Performance and Power of Dense Linear Algebra Operations
10th IEEE International Symposium on Parallel and Distributed Processing with Applications Binding Performance and Power of Dense Linear Algebra Operations Maria Barreda, Manuel F. Dolz, Rafael Mayo, Enrique
More informationEfficiently solving large sparse linear systems on a distributed and heterogeneous grid by using the multisplitting-direct method
Efficiently solving large sparse linear systems on a distributed and heterogeneous grid by using the multisplitting-direct method S. Contassot-Vivier, R. Couturier, C. Denis and F. Jézéquel Université
More informationSolving Large Nonlinear Sparse Systems
Solving Large Nonlinear Sparse Systems Fred W. Wubs and Jonas Thies Computational Mechanics & Numerical Mathematics University of Groningen, the Netherlands f.w.wubs@rug.nl Centre for Interdisciplinary
More informationParallel Algorithms for Solution of Large Sparse Linear Systems with Applications
Parallel Algorithms for Solution of Large Sparse Linear Systems with Applications Murat Manguoğlu Department of Computer Engineering Middle East Technical University, Ankara, Turkey Prace workshop: HPC
More informationENHANCING PERFORMANCE AND ROBUSTNESS OF ILU PRECONDITIONERS BY BLOCKING AND SELECTIVE TRANSPOSITION
SIAM J. SCI. COMPUT. Vol. 39, No., pp. A303 A332 c 207 Society for Industrial and Applied Mathematics ENHANCING PERFORMANCE AND ROBUSTNESS OF ILU PRECONDITIONERS BY BLOCKING AND SELECTIVE TRANSPOSITION
More informationA sparse multifrontal solver using hierarchically semi-separable frontal matrices
A sparse multifrontal solver using hierarchically semi-separable frontal matrices Pieter Ghysels Lawrence Berkeley National Laboratory Joint work with: Xiaoye S. Li (LBNL), Artem Napov (ULB), François-Henry
More informationAccelerating linear algebra computations with hybrid GPU-multicore systems.
Accelerating linear algebra computations with hybrid GPU-multicore systems. Marc Baboulin INRIA/Université Paris-Sud joint work with Jack Dongarra (University of Tennessee and Oak Ridge National Laboratory)
More information7. LU factorization. factor-solve method. LU factorization. solving Ax = b with A nonsingular. the inverse of a nonsingular matrix
EE507 - Computational Techniques for EE 7. LU factorization Jitkomut Songsiri factor-solve method LU factorization solving Ax = b with A nonsingular the inverse of a nonsingular matrix LU factorization
More informationAMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences)
AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences) Lecture 19: Computing the SVD; Sparse Linear Systems Xiangmin Jiao Stony Brook University Xiangmin Jiao Numerical
More informationPreconditioning techniques to accelerate the convergence of the iterative solution methods
Note Preconditioning techniques to accelerate the convergence of the iterative solution methods Many issues related to iterative solution of linear systems of equations are contradictory: numerical efficiency
More informationParallelism in FreeFem++.
Parallelism in FreeFem++. Guy Atenekeng 1 Frederic Hecht 2 Laura Grigori 1 Jacques Morice 2 Frederic Nataf 2 1 INRIA, Saclay 2 University of Paris 6 Workshop on FreeFem++, 2009 Outline 1 Introduction Motivation
More informationA dissection solver with kernel detection for unsymmetric matrices in FreeFem++
. p.1/21 11 Dec. 2014, LJLL, Paris FreeFem++ workshop A dissection solver with kernel detection for unsymmetric matrices in FreeFem++ Atsushi Suzuki Atsushi.Suzuki@ann.jussieu.fr Joint work with François-Xavier
More informationComputational Linear Algebra
Computational Linear Algebra PD Dr. rer. nat. habil. Ralf Peter Mundani Computation in Engineering / BGU Scientific Computing in Computer Science / INF Winter Term 2017/18 Part 2: Direct Methods PD Dr.
More informationPreface to the Second Edition. Preface to the First Edition
n page v Preface to the Second Edition Preface to the First Edition xiii xvii 1 Background in Linear Algebra 1 1.1 Matrices................................. 1 1.2 Square Matrices and Eigenvalues....................
More informationAn Assessment of Incomplete-LU Preconditioners for Nonsymmetric Linear Systems 1
Informatica 17 page xxx yyy 1 An Assessment of Incomplete-LU Preconditioners for Nonsymmetric Linear Systems 1 John R. Gilbert Xerox Palo Alto Research Center 3333 Coyote Hill Road Palo Alto, CA 930, USA
More informationMARCH 24-27, 2014 SAN JOSE, CA
MARCH 24-27, 2014 SAN JOSE, CA Sparse HPC on modern architectures Important scientific applications rely on sparse linear algebra HPCG a new benchmark proposal to complement Top500 (HPL) To solve A x =
More informationStatic-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems
Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems Ichitaro Yamazaki University of Tennessee, Knoxville Xiaoye Sherry Li Lawrence Berkeley National Laboratory MS49: Sparse
More informationITERATIVE METHODS FOR SPARSE LINEAR SYSTEMS
ITERATIVE METHODS FOR SPARSE LINEAR SYSTEMS YOUSEF SAAD University of Minnesota PWS PUBLISHING COMPANY I(T)P An International Thomson Publishing Company BOSTON ALBANY BONN CINCINNATI DETROIT LONDON MADRID
More informationPreconditioners for the incompressible Navier Stokes equations
Preconditioners for the incompressible Navier Stokes equations C. Vuik M. ur Rehman A. Segal Delft Institute of Applied Mathematics, TU Delft, The Netherlands SIAM Conference on Computational Science and
More informationChe-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University
Che-Wei Chang chewei@mail.cgu.edu.tw Department of Computer Science and Information Engineering, Chang Gung University } 2017/11/15 Midterm } 2017/11/22 Final Project Announcement 2 1. Introduction 2.
More informationParallel Preconditioning Methods for Ill-conditioned Problems
Parallel Preconditioning Methods for Ill-conditioned Problems Kengo Nakajima Information Technology Center, The University of Tokyo 2014 Conference on Advanced Topics and Auto Tuning in High Performance
More informationThe flexible incomplete LU preconditioner for large nonsymmetric linear systems. Takatoshi Nakamura Takashi Nodera
Research Report KSTS/RR-15/006 The flexible incomplete LU preconditioner for large nonsymmetric linear systems by Takatoshi Nakamura Takashi Nodera Takatoshi Nakamura School of Fundamental Science and
More informationSolving PDEs with CUDA Jonathan Cohen
Solving PDEs with CUDA Jonathan Cohen jocohen@nvidia.com NVIDIA Research PDEs (Partial Differential Equations) Big topic Some common strategies Focus on one type of PDE in this talk Poisson Equation Linear
More informationA High-Performance Parallel Hybrid Method for Large Sparse Linear Systems
Outline A High-Performance Parallel Hybrid Method for Large Sparse Linear Systems Azzam Haidar CERFACS, Toulouse joint work with Luc Giraud (N7-IRIT, France) and Layne Watson (Virginia Polytechnic Institute,
More informationUsing Postordering and Static Symbolic Factorization for Parallel Sparse LU
Using Postordering and Static Symbolic Factorization for Parallel Sparse LU Michel Cosnard LORIA - INRIA Lorraine Nancy, France Michel.Cosnard@loria.fr Laura Grigori LORIA - Univ. Henri Poincaré Nancy,
More informationMultilevel Low-Rank Preconditioners Yousef Saad Department of Computer Science and Engineering University of Minnesota. Modelling 2014 June 2,2014
Multilevel Low-Rank Preconditioners Yousef Saad Department of Computer Science and Engineering University of Minnesota Modelling 24 June 2,24 Dedicated to Owe Axelsson at the occasion of his 8th birthday
More informationDense Arithmetic over Finite Fields with CUMODP
Dense Arithmetic over Finite Fields with CUMODP Sardar Anisul Haque 1 Xin Li 2 Farnam Mansouri 1 Marc Moreno Maza 1 Wei Pan 3 Ning Xie 1 1 University of Western Ontario, Canada 2 Universidad Carlos III,
More informationPreconditioning Techniques Analysis for CG Method
Preconditioning Techniques Analysis for CG Method Huaguang Song Department of Computer Science University of California, Davis hso@ucdavis.edu Abstract Matrix computation issue for solve linear system
More informationSparse BLAS-3 Reduction
Sparse BLAS-3 Reduction to Banded Upper Triangular (Spar3Bnd) Gary Howell, HPC/OIT NC State University gary howell@ncsu.edu Sparse BLAS-3 Reduction p.1/27 Acknowledgements James Demmel, Gene Golub, Franc
More informationA Numerical Study of Some Parallel Algebraic Preconditioners
A Numerical Study of Some Parallel Algebraic Preconditioners Xing Cai Simula Research Laboratory & University of Oslo PO Box 1080, Blindern, 0316 Oslo, Norway xingca@simulano Masha Sosonkina University
More informationScientific Computing WS 2018/2019. Lecture 9. Jürgen Fuhrmann Lecture 9 Slide 1
Scientific Computing WS 2018/2019 Lecture 9 Jürgen Fuhrmann juergen.fuhrmann@wias-berlin.de Lecture 9 Slide 1 Lecture 9 Slide 2 Simple iteration with preconditioning Idea: Aû = b iterative scheme û = û
More informationScientific Computing with Case Studies SIAM Press, Lecture Notes for Unit VII Sparse Matrix
Scientific Computing with Case Studies SIAM Press, 2009 http://www.cs.umd.edu/users/oleary/sccswebpage Lecture Notes for Unit VII Sparse Matrix Computations Part 1: Direct Methods Dianne P. O Leary c 2008
More informationAn advanced ILU preconditioner for the incompressible Navier-Stokes equations
An advanced ILU preconditioner for the incompressible Navier-Stokes equations M. ur Rehman C. Vuik A. Segal Delft Institute of Applied Mathematics, TU delft The Netherlands Computational Methods with Applications,
More informationScalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver
Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver Sherry Li Lawrence Berkeley National Laboratory Piyush Sao Rich Vuduc Georgia Institute of Technology CUG 14, May 4-8, 14, Lugano,
More informationarxiv: v1 [math.na] 10 Jan 2019
Robust Optimal-Complexity Multilevel ILU for Predominantly Symmetric Systems Aditi Ghai Xiangmin Jiao arxiv:1901.03249v1 [math.na] 10 Jan 2019 Abstract Incomplete factorization is a powerful preconditioner
More informationNumerical Methods I Non-Square and Sparse Linear Systems
Numerical Methods I Non-Square and Sparse Linear Systems Aleksandar Donev Courant Institute, NYU 1 donev@courant.nyu.edu 1 MATH-GA 2011.003 / CSCI-GA 2945.003, Fall 2014 September 25th, 2014 A. Donev (Courant
More informationUtilisation de la compression low-rank pour réduire la complexité du solveur PaStiX
Utilisation de la compression low-rank pour réduire la complexité du solveur PaStiX 26 Septembre 2018 - JCAD 2018 - Lyon Grégoire Pichon, Mathieu Faverge, Pierre Ramet, Jean Roman Outline 1. Context 2.
More informationRobust Preconditioned Conjugate Gradient for the GPU and Parallel Implementations
Robust Preconditioned Conjugate Gradient for the GPU and Parallel Implementations Rohit Gupta, Martin van Gijzen, Kees Vuik GPU Technology Conference 2012, San Jose CA. GPU Technology Conference 2012,
More informationFeeding of the Thousands. Leveraging the GPU's Computing Power for Sparse Linear Algebra
SPPEXA Annual Meeting 2016, January 25 th, 2016, Garching, Germany Feeding of the Thousands Leveraging the GPU's Computing Power for Sparse Linear Algebra Hartwig Anzt Sparse Linear Algebra on GPUs Inherently
More informationTowards a highly-parallel PDE-Solver using Adaptive Sparse Grids on Compute Clusters
Towards a highly-parallel PDE-Solver using Adaptive Sparse Grids on Compute Clusters HIM - Workshop on Sparse Grids and Applications Alexander Heinecke Chair of Scientific Computing May 18 th 2011 HIM
More informationPerformance and Energy Analysis of the Iterative Solution of Sparse Linear Systems on Multicore and Manycore Architectures
Performance and Energy Analysis of the Iterative Solution of Sparse Linear Systems on Multicore and Manycore Architectures José I. Aliaga Performance and Energy Analysis of the Iterative Solution of Sparse
More informationParallel Implementation of BDDC for Mixed-Hybrid Formulation of Flow in Porous Media
Parallel Implementation of BDDC for Mixed-Hybrid Formulation of Flow in Porous Media Jakub Šístek1 joint work with Jan Březina 2 and Bedřich Sousedík 3 1 Institute of Mathematics of the AS CR Nečas Center
More informationAMS526: Numerical Analysis I (Numerical Linear Algebra)
AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 24: Preconditioning and Multigrid Solver Xiangmin Jiao SUNY Stony Brook Xiangmin Jiao Numerical Analysis I 1 / 5 Preconditioning Motivation:
More informationSome notes on efficient computing and setting up high performance computing environments
Some notes on efficient computing and setting up high performance computing environments Andrew O. Finley Department of Forestry, Michigan State University, Lansing, Michigan. April 17, 2017 1 Efficient
More informationCalculate Sensitivity Function Using Parallel Algorithm
Journal of Computer Science 4 (): 928-933, 28 ISSN 549-3636 28 Science Publications Calculate Sensitivity Function Using Parallel Algorithm Hamed Al Rjoub Irbid National University, Irbid, Jordan Abstract:
More informationEnhancing Performance and Robustness of ILU Preconditioners by Blocking and Selective Transposition
RC 25577 (25906) November 6, 205 (Last update: December, 206) Computer Science/Mathematics IBM Research Report Enhancing Performance and Robustness of ILU Preconditioners by Blocking and Selective Transposition
More informationImprovements for Implicit Linear Equation Solvers
Improvements for Implicit Linear Equation Solvers Roger Grimes, Bob Lucas, Clement Weisbecker Livermore Software Technology Corporation Abstract Solving large sparse linear systems of equations is often
More informationAlgorithm for Sparse Approximate Inverse Preconditioners in the Conjugate Gradient Method
Algorithm for Sparse Approximate Inverse Preconditioners in the Conjugate Gradient Method Ilya B. Labutin A.A. Trofimuk Institute of Petroleum Geology and Geophysics SB RAS, 3, acad. Koptyug Ave., Novosibirsk
More informationModel Order Reduction via Matlab Parallel Computing Toolbox. Istanbul Technical University
Model Order Reduction via Matlab Parallel Computing Toolbox E. Fatih Yetkin & Hasan Dağ Istanbul Technical University Computational Science & Engineering Department September 21, 2009 E. Fatih Yetkin (Istanbul
More informationAccelerating Model Reduction of Large Linear Systems with Graphics Processors
Accelerating Model Reduction of Large Linear Systems with Graphics Processors P. Benner 1, P. Ezzatti 2, D. Kressner 3, E.S. Quintana-Ortí 4, Alfredo Remón 4 1 Max-Plank-Institute for Dynamics of Complex
More informationAccelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers
UT College of Engineering Tutorial Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers Stan Tomov 1, George Bosilca 1, and Cédric
More informationSCALABLE HYBRID SPARSE LINEAR SOLVERS
The Pennsylvania State University The Graduate School Department of Computer Science and Engineering SCALABLE HYBRID SPARSE LINEAR SOLVERS A Thesis in Computer Science and Engineering by Keita Teranishi
More informationGaussian Elimination without/with Pivoting and Cholesky Decomposition
Gaussian Elimination without/with Pivoting and Cholesky Decomposition Gaussian Elimination WITHOUT pivoting Notation: For a matrix A R n n we define for k {,,n} the leading principal submatrix a a k A
More informationMatrix Assembly in FEA
Matrix Assembly in FEA 1 In Chapter 2, we spoke about how the global matrix equations are assembled in the finite element method. We now want to revisit that discussion and add some details. For example,
More informationSolving the Inverse Toeplitz Eigenproblem Using ScaLAPACK and MPI *
Solving the Inverse Toeplitz Eigenproblem Using ScaLAPACK and MPI * J.M. Badía and A.M. Vidal Dpto. Informática., Univ Jaume I. 07, Castellón, Spain. badia@inf.uji.es Dpto. Sistemas Informáticos y Computación.
More informationPreconditioning techniques for highly indefinite linear systems Yousef Saad Department of Computer Science and Engineering University of Minnesota
Preconditioning techniques for highly indefinite linear systems Yousef Saad Department of Computer Science and Engineering University of Minnesota Institut C. Jordan, Lyon, 25/03/08 Introduction: Linear
More informationMODIFICATION AND COMPENSATION STRATEGIES FOR THRESHOLD-BASED INCOMPLETE FACTORIZATIONS
MODIFICATION AND COMPENSATION STRATEGIES FOR THRESHOLD-BASED INCOMPLETE FACTORIZATIONS S. MACLACHLAN, D. OSEI-KUFFUOR, AND YOUSEF SAAD Abstract. Standard (single-level) incomplete factorization preconditioners
More informationSparse linear solvers
Sparse linear solvers Laura Grigori ALPINES INRIA and LJLL, UPMC On sabbatical at UC Berkeley March 2015 Plan Sparse linear solvers Sparse matrices and graphs Classes of linear solvers Sparse Cholesky
More informationAdaptive Spike-Based Solver 1.0 User Guide
Intel r Adaptive Spike-Based Solver 10 User Guide I V 1 W 2 I V 2 W 3 I V 3 W 4 I 1 Contents 1 Overview 4 11 A Quick What, Why, and How 4 12 A Hello World Example 7 13 Future Developments 8 14 User Guide
More informationIII. Applications in convex optimization
III. Applications in convex optimization nonsymmetric interior-point methods partial separability and decomposition partial separability first order methods interior-point methods Conic linear optimization
More informationAn Empirical Comparison of Graph Laplacian Solvers
An Empirical Comparison of Graph Laplacian Solvers Kevin Deweese 1 Erik Boman 2 John Gilbert 1 1 Department of Computer Science University of California, Santa Barbara 2 Scalable Algorithms Department
More informationIterative Methods. Splitting Methods
Iterative Methods Splitting Methods 1 Direct Methods Solving Ax = b using direct methods. Gaussian elimination (using LU decomposition) Variants of LU, including Crout and Doolittle Other decomposition
More informationLevel-3 BLAS on a GPU
Level-3 BLAS on a GPU Picking the Low Hanging Fruit Francisco Igual 1 Gregorio Quintana-Ortí 1 Robert A. van de Geijn 2 1 Departamento de Ingeniería y Ciencia de los Computadores. University Jaume I. Castellón
More information9.1 Preconditioned Krylov Subspace Methods
Chapter 9 PRECONDITIONING 9.1 Preconditioned Krylov Subspace Methods 9.2 Preconditioned Conjugate Gradient 9.3 Preconditioned Generalized Minimal Residual 9.4 Relaxation Method Preconditioners 9.5 Incomplete
More informationParallel Circuit Simulation. Alexander Hayman
Parallel Circuit Simulation Alexander Hayman In the News LTspice IV features multi-threaded solvers designed to better utilize current multicore processors. Also included are new SPARSE matrix solvers
More informationComputation of the mtx-vec product based on storage scheme on vector CPUs
BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix Computation of the mtx-vec product based on storage scheme on
More informationLU Factorization. Marco Chiarandini. DM559 Linear and Integer Programming. Department of Mathematics & Computer Science University of Southern Denmark
DM559 Linear and Integer Programming LU Factorization Marco Chiarandini Department of Mathematics & Computer Science University of Southern Denmark [Based on slides by Lieven Vandenberghe, UCLA] Outline
More informationAMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 23: GMRES and Other Krylov Subspace Methods; Preconditioning
AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 23: GMRES and Other Krylov Subspace Methods; Preconditioning Xiangmin Jiao SUNY Stony Brook Xiangmin Jiao Numerical Analysis I 1 / 18 Outline
More informationHighly-scalable branch and bound for maximum monomial agreement
Highly-scalable branch and bound for maximum monomial agreement Jonathan Eckstein (Rutgers) William Hart Cynthia A. Phillips Sandia National Laboratories Sandia National Laboratories is a multi-program
More informationParallel Computation of the Eigenstructure of Toeplitz-plus-Hankel matrices on Multicomputers
Parallel Computation of the Eigenstructure of Toeplitz-plus-Hankel matrices on Multicomputers José M. Badía * and Antonio M. Vidal * Departamento de Sistemas Informáticos y Computación Universidad Politécnica
More informationParallel Transposition of Sparse Data Structures
Parallel Transposition of Sparse Data Structures Hao Wang, Weifeng Liu, Kaixi Hou, Wu-chun Feng Department of Computer Science, Virginia Tech Niels Bohr Institute, University of Copenhagen Scientific Computing
More informationMUMPS. The MUMPS library: work done during the SOLSTICE project. MUMPS team, Lyon-Grenoble, Toulouse, Bordeaux
The MUMPS library: work done during the SOLSTICE project MUMPS team, Lyon-Grenoble, Toulouse, Bordeaux Sparse Days and ANR SOLSTICE Final Workshop June MUMPS MUMPS Team since beg. of SOLSTICE (2007) Permanent
More informationThe new challenges to Krylov subspace methods Yousef Saad Department of Computer Science and Engineering University of Minnesota
The new challenges to Krylov subspace methods Yousef Saad Department of Computer Science and Engineering University of Minnesota SIAM Applied Linear Algebra Valencia, June 18-22, 2012 Introduction Krylov
More informationDynamic Scheduling for Work Agglomeration on Heterogeneous Clusters
Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters Jonathan Lifflander, G. Carl Evans, Anshu Arya, Laxmikant Kale University of Illinois Urbana-Champaign May 25, 2012 Work is overdecomposed
More informationInformation Sciences Institute 22 June 2012 Bob Lucas, Gene Wagenbreth, Dan Davis, Roger Grimes and
Accelerating the Multifrontal Method Information Sciences Institute 22 June 2012 Bob Lucas, Gene Wagenbreth, Dan Davis, Roger Grimes {rflucas,genew,ddavis}@isi.edu and grimes@lstc.com 3D Finite Element
More informationOn the design of parallel linear solvers for large scale problems
On the design of parallel linear solvers for large scale problems ICIAM - August 2015 - Mini-Symposium on Recent advances in matrix computations for extreme-scale computers M. Faverge, X. Lacoste, G. Pichon,
More informationSolving Ax = b, an overview. Program
Numerical Linear Algebra Improving iterative solvers: preconditioning, deflation, numerical software and parallelisation Gerard Sleijpen and Martin van Gijzen November 29, 27 Solving Ax = b, an overview
More informationMatrix Computations: Direct Methods II. May 5, 2014 Lecture 11
Matrix Computations: Direct Methods II May 5, 2014 ecture Summary You have seen an example of how a typical matrix operation (an important one) can be reduced to using lower level BS routines that would
More informationParallelism in Structured Newton Computations
Parallelism in Structured Newton Computations Thomas F Coleman and Wei u Department of Combinatorics and Optimization University of Waterloo Waterloo, Ontario, Canada N2L 3G1 E-mail: tfcoleman@uwaterlooca
More informationarxiv: v1 [cs.na] 20 Jul 2015
AN EFFICIENT SOLVER FOR SPARSE LINEAR SYSTEMS BASED ON RANK-STRUCTURED CHOLESKY FACTORIZATION JEFFREY N. CHADWICK AND DAVID S. BINDEL arxiv:1507.05593v1 [cs.na] 20 Jul 2015 Abstract. Direct factorization
More information