Sparse factorizations: Towards optimal complexity and resilience at exascale

Size: px
Start display at page:

Download "Sparse factorizations: Towards optimal complexity and resilience at exascale"

Transcription

1 Sparse factorizations: Towards optimal complexity and resilience at exascale Xiaoye Sherry Li Lawrence Berkeley National Laboratory Challenges in 21st Century Experimental Mathematical Computation Workshop, ICERM, Brown Univ., July 21-25, 2014.

2 Introduction! DOE SciDAC programs (Scientific Discovery through Advanced Computing)! FASTMath Institute ( , Frameworks, Algorithms, and Scalable Technologies for Mathematics) Software: SuperLU, PETSc, Trilinos, Chombo, mesh, (3 other Institutes)! Science Applications (many mostly Partial Differential Equations) CEMM ( , Center for Extended MHD Modeling, fusion energy) ComPASS ( , Community Petascale Project for Accelerator Science and Simulation)! LBNL focuses! Direct solvers (SuperLU): scaling to 1000s cores! Hybrid solvers (direct & iterative): scaling to 10,000 cores! Low-rank HSS preconditioner: nearly linear complexity for certain PDEs 2

3 Application 1: Burning plasma for fusion energy! ITER a new fusion reactor being constructed in Cadarache, France International collaboration: China, the European Union, India, Japan, Korea, Russia, and the United States Study how to harness fusion, creating clean energy using nearly inexhaustible hydrogen as the fuel. ITER promises to produce 10 times as much energy than it uses but that success hinges on accurately designing the device.! One major simulation goal is to predict microscopic MHD instabilities of burning plasma in ITER. This involves solving extended and nonlinear Magnetohydrodynamics equations. 3

4 Application 1: ITER modeling! Center for Extended Magnetohydrodynamic Modeling (CEMM), PI: S. Jardin, PPPL.! Develop simulation codes to predict microscopic MHD instabilities of burning magnetized plasma in a confinement device (e.g., tokamak used in ITER experiments). Efficiency of the fusion configuration increases with the ratio of thermal and magnetic pressures, but the MHD instabilities are more likely with higher ratio.! Code suite includes M3D-C 1, NIMROD Z ϕ R At each ϕ = constant plane, scalar 2D data is represented using 18 degree of freedom quintic triangular finite elements Q 18 Coupling along toroidal direction (S. Jardin) 4

5 ITER modeling: 2-Fluid 3D MHD Equations n + (nv ) = 0 continuity t B t = E, B = 0, µ 0J = B Maxwell % V ( nm t ' +V V *+ p = J B Π GV Π µ Momentum & t ) E +V B = ηj + 1 ne (J B p e Π e ) Ohm's law p e t + % 3 2 p ( ' ev * = p e +ηj 2 q e +Q Δ & ) electron energy p i t + % 3 2 p ( ' iv * = p i Π µ V q i Q Δ & ) ion energy The objective of the M3D-C 1 code is to solve these equations as accurately as possible in 3D toroidal geometry with realistic B.C. and optimized for a low-β torus with a strong toroidal field. 5

6 Application 2: particle accelerator cavity design Community Petascale Project for Accelerator Science and Simulation (ComPASS), PI: P. Spentzouris, Fermilab Development of a comprehensive computational infrastructure for accelerator modeling and optimization RF cavity: Maxwell equations in electromagnetic field FEM in frequency domain leads to large sparse eigenvalue problem; needs to solve shifted linear systems (L.-Q. Lee) RF unit in ILC Γ E Closed Cavity Γ M linear eigenvalue problem 2 ( K0 σ M 0) x = M 0 b Waveguide BC Waveguide BC Open Cavity 2 ( K + i σ W - M ) x = b Waveguide BC nonlinear complex eigenvalue problem 0 σ 0 6

7 Sparse: lots of zeros in matrix! fluid dynamics, structural mechanics, chemical process simulation, circuit simulation, electromagnetic fields, magneto-hydrodynamics, seismic-imaging, economic modeling, optimization, data analysis, statistics,...! Example: A of dimension 10 6, 10~100 nonzeros per row! Matlab: > spy(a) Boeing/msc00726 (structural eng.) Mallya/lhr01 (chemical eng.) 7

8 Strategies of sparse linear solvers Solving a system of linear equations Ax = b Sparse: many zeros in A; worth special treatment Iterative methods (CG, GMRES, ) A is not changed (read-only) Key kernel: sparse matrix-vector multiply Easier to optimize and parallelize Low algorithmic complexity, but may not converge Direct methods A is modified (factorized) Harder to optimize and parallelize Numerically robust, but higher algorithmic complexity Often use direct method (factorization) to precondition iterative method Solve an easy system: M -1 Ax = M -1 b 8

9 Gaussian Elimination (GE)! Solving a system of linear equations Ax = b! First step of GE! Repeat GE on C! Result in LU factorization (A = LU) L lower triangular with unit diagonal, U upper triangular! Then, x is obtained by solving two triangular systems with L and U = = C w I v B v w A T T 0 / 0 1 α α α 9 α T w v B C =

10 Sparse factorization! Store A explicitly many sparse compressed formats! Fill-in... new nonzeros in L & U! Graph algorithms: directed/undirected graphs, bipartite graphs, paths, elimination trees, depth-first search, heuristics for NP-hard problems, cliques, graph partitioning,...! Unfriendly to high performance, parallel computing 1! Irregular memory access, indirect addressing, strong task/data dependency 2 L 3 4 U

11 Graph tool: reachable set, fill-path o y x + o o Edge (x,y) exists in filled graph G + due to the path: x à 7 à 3 à 9 à y! Finding fill-ins ßà finding transitive closure of G(A) 11

12 Algorithmic phases in sparse GE 1. Minimize number of fill-ins, maximize parallelism! Sparsity structure of L & U depends on that of A, which can be changed by row/column permutations (vertex re-labeling of the underlying graph)! Ordering (combinatorial algorithms; NP-complete to find optimum [Yannakis 83]; use heuristics) 2. Predict the fill-in positions in L & U! Symbolic factorization (combinatorial algorithms) 3. Design efficient data structure for storage and quick retrieval of the nonzeros! Compressed storage schemes 4. Perform factorization and triangular solutions! Numerical algorithms (F.P. operations only on nonzeros)! Usually dominate the total runtime! For sparse Cholesky and QR, the steps can be separate; for sparse LU with pivoting, steps 2 and 4 my be interleaved. 12

13 Distributed-memory parallelization! 2D block-cyclic matrix distribution For j = 1, 2, 3.. Number of Supernodes 1. Block LU factorization L (j, j) U (j, j) ß LU(A(j, j)) 2. L update : L (k, j) ß A (k, j) U -1 (j, j) k>j 3. U update : U (j, k) ß L -1 (j, j) A (j, k) k>j 4. Rank K Update : A (i, k) ßA (i, k) L (I,j) U (j,k), i, k > j! Scalability challenges:! High degree of data & task dependency (DAG)! Irregular, indirect memory access! Low Arithmetic Intensity 13

14 SuperLU_DIST 2.5 on Cray XE6 Profiling using IPM! Synchronization dominates on a large number of cores! up to 96% of factorization time Factorization Communication Factorization Communication Factorization time(s) Factorization time(s) Number of cores Number of cores Accelerator (sym), n=2.7m, fill-ratio=12 DNA, n = 445K, fill-ratio=

15 SuperLU_DIST 3.0: better DAG scheduling look ahead window Factorization/Communication time (s) version 2.5 version Number of cores Factorization/Communication time (s) Number of cores version 2.5 version 3.0 Accelerator, n=2.7m, fill-ratio=12 DNA, n = 445K, fill-ratio= 609! Implemented new static scheduling and flexible look-ahead algorithms that shortened the length of the critical path.! Idle time was significantly reduced (speedup up to 2.6x)! To further improve performance:! more sophisticated scheduling schemes! hybrid programming paradigms 15

16 Performance of larger matrices Name Application Data type N A / N Sparsity L\U (10^6) Fill-ratio matrix211 cc_linear2 matick cage13 Fusion, MHD eqns (M3D-C1) Fusion, MHD eqns (NIMROD) Circuit sim. MNA method (IBM) DNA electrophoresis Real 801, Complex 259, Complex 16, Real 445, v Sparsity ordering: MeTis applied to structure of A +A 16

17 Strong scaling: MPI, Cray XE6 2 x 12-core AMD 'MagnyCours per node, 2.1 GHz processor v Up to 1.4 Tflops factorization rate 17

18 Variety of node architectures Cray XE6: dual-socket x 2-die x 6-core, 24 cores Cray XC30: dual-socket x 8-core, 16 cores Cray XK7: 16-core AMD + K20X GPU Intel MIC: 16-core host cores co-processor 18

19 Multicore / GPU-Aware SuperLU! New hybrid programming code: MPI+OpenMP+CUDA, able to use all the CPUs and GPUs on manycore computers.! Algorithmic changes:! Aggregate small BLAS operations into larger ones.! CPU multithreading Scatter/Gather operations.! Hide long-latency operations.! Results: using 100 nodes GPU clusters, up to 2.7x faster, 2x-5x memory saving.! New SuperLU_DIST 4.0 release, August

20 CPU + GPU algorithm Aggregate small blocks GEMM of large blocks Scatter GPU acceleration: Software pipelining to overlap GPU execution with CPU Scatter, data transfer. 20

21 Software issues! Use preprocesing to produce 4 versions {s, d, c, z}! Creating macro-enabled basefile at the first time is clumsy; later maintenance is easier.! template in C++ is better.! Performance portability?! Need adjust block size for each architecture Larger blocks better for uniprocessor Smaller blocks better for parallellism and load balance! Open problem: automatic tuning for block size?! Flexible interface?! Example: block diagonal preconditioner M -1 A x = M -1 b M = diag(a 11, A 22, A 33 ) à use SuperLU_DIST for each diagonal block! No explicit funding for user support. (other than SciDAC apps.) A 11 A 22 A33

22 Software issues! Use preprocesing to produce 4 versions {s, d, c, z}! Creating macro-enabled basefile at the first time is clumsy; later maintenance is easier.! template in C++ is better.! Performance portability?! Need adjust block size for each architecture Larger blocks better for uniprocessor Smaller blocks better for parallellism and load balance! Open problem: automatic tuning for block size?! Flexible interface?! Example: block diagonal preconditioner M -1 A x = M -1 b M = diag(a 11, A 22, A 33 ) à use SuperLU_DIST for each diagonal block ! No explicit funding for user support. (other than SciDAC apps.)

23 Towards exascale! Exascale machines will have hierarchical organization! Hierarchical memory, NUMA nodes: multicore, manycore! Exascale applications will encompass multiphysics (coupled PDEs) and multiscale (time and space)! Hierarchical algorithms and parallelism match machines and applications features Studying two classes of algorithms for sparse linear systems: 1. Domain decomposition hybrid method! General algebraic solver 2. Low-rank factorization employing hierarchical matrices and randomization! Target PDE applications 23

24 1. Domain decomposition, Schur-complement (PDSLin : 1. Graph-partition into subdomains, A 11 is block diagonal A A A A x x b = b 2. Schur complement S = A where 22 A A = L 1 2 A U 11 A = A 22! # " A 11 A 12 A 21 A 22 (U -T 11! D 1 E # 1 $ # D 2 E 2 & = # # % # D k E k # F 1 F 2 F k A " S = interface (separator) variables, no need to form explicitly A T 21 ) T (L A $ & & & & & & % 12 ) = A 22 W G 3. Hybrid solution methods: (1) x 2 = S 1 (b 2 A 21 A b 1 ) iterative solver (2) x 1 = A (b 1 A 12 x 2 ) direct solver 24

25 Hierarchical parallelism! Multiple processors per subdomain! one subdomain with 2x3 procs (e.g. SuperLU_DIST) D 1 P P (0 : 5) P (0 : 5) E 1 D 2 P (6 : 11) P (6 : 11) E 2 D 3 P (12 : 17) P (12 : 17) E 3 D 4 P (18 : 23) P (18 : 23) E 4 P (0 : 5) P (6 : 11) P (12 : 17) P (18 : 23) F 1 F 2 F 3 F 4 A 22! Advantages:! Constant #subdomains, Schur size, and convergence rate, regardless of core count.! Need only modest level of parallelism from direct solver. 25

26 PDSLin in Omega3P: Cryomodule Computa(on parameters 2.3M elements First order finite element (p = 1) PIP2 cryomodule consis1ng of 8 cavi1es - 39M non- zeroes, 2.5M DOFs - Solu1on 1me on hopper using 50 nodes and 600 cores: 863 ms (total) Second order finite element (p = 2) - 590M non- zeroes, 14M DOFs - Solu1on 1me on edison using 400 nodes, 4800 cores: 5:40 min (wall) - Using MUMPS with 400 nodes, 800 cores, solu1on 1me: 6:46 min (wall)

27 New mathematical algorithms! K-way, multi-constraint graph partitioning! Small separator, similar subdomains, similar connectivity! Both intra- and inter-group load balance! Sparse triangular sol. with many sparse RHS (intra-subdomain) S = A 22 (U -T l F T l ) T (L -1 l E l ) = W l G l, where D l = L l U l l! Sparse matrix matrix multiplication (inter-subdomain) W sparsify(w, σ 1 ); G sparsify(g, σ1 ) T ( p) W ( p) G ( p) Ŝ ( p) ( A p) 22 T (q) (p) ; S sparsify( Ŝ, σ 2 ) q l I. Yamazali, F.-H. Rouet, X.S. Li, B. Ucar, On partitioning and reordering problems in a hierarchically parallel hybrid linear solver, IPDPS / PDSEC Workshop, May 24,

28 2. HSS-embedded sparse factorization! Dense, but data-sparse): hierarchically semi-separable structure! PDEs with smooth kernels, off-diagonal blocks are rank deficient! Recursion leads to hierarchical partitioning! Key to low complexity: nested bases HSS tree " " T % $ $ D 1 U 1 B 1 V 2 ' $ $ T # U 2 B 2 V 1 D ' $ 2 & A $ $ " U 4 R % $ 4 ' # $ U 5 R 5 &' B " 6 W T T 1 V 1 W T T $ 2 V #$ 2 $ #! Sparse: apply HSS to dense separators/supernodes Nested tree-parallelism: Outer tree: separator tree Inner tree: HSS tree % &' " U 1 R % $ 1 ' # $ U 2 R 2 &' B " 3#$ " $ D 4 $ T # U 5 B 5 V 4 W T T 4 V 4 W T T 5 V 5 U 4 B 4 V 5 T D 5 % ' ' & % &' % ' ' ' ' ' ' ' &

29 3D Helmholtz! Helmholtz equation with PML boundary # Δ ω 2 & % ( u(x,ω) = s(x,ω) $ v(x) 2 '! N = = 27M, procs = 1024! Max rank = 1391 (tolerance = 1e-4) Times (s) Gflops (peak %) Comm % Mem (GB) MF (27.7%) 32.6 % 3144 MF + HSS HSS-compr (29.2%) 41.2 % 15.3 %

30 New compression kernel: Randomized Sampling! Traditional methods: SVD, rank-revealing QR! Difficult to scale up! Extend-add HSS structures of different shapes! Randomized sampling: 1. Pick random matrix Ω nx(k+p), p small, e.g Sample matrix S = A Ω, with slight oversampling p 3. Compute Q = ON-basis(S), orthonormal basis of S Accuracy: with high probability 1 6 p -p A QQ * A ( 1+11 k + p min(m, n) ) σ k+1! Benefits: kernel becomes dense matrix-matrix multiply! Extend-add tall-skinny dense matrices of conforming shapes! Scalable and resilient algorithms exist! Even faster, if fast matrix-vector multiply available (e.g. FMM)! Matrix-free solver, if only matrix-vector action available 30

31 Summary, forward looking...! Direct solvers can scale to 1000s cores! Domain-decomposition type of hybrid solvers can scale to 10,000s cores! Can also maintain robustness! Expect to scale more with low-rank structured factorization methods! Extend to general solver framework, examine feasibility with wider class of problems 31

Enhancing Scalability of Sparse Direct Methods

Enhancing Scalability of Sparse Direct Methods Journal of Physics: Conference Series 78 (007) 0 doi:0.088/7-6596/78//0 Enhancing Scalability of Sparse Direct Methods X.S. Li, J. Demmel, L. Grigori, M. Gu, J. Xia 5, S. Jardin 6, C. Sovinec 7, L.-Q.

More information

Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems

Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems Ichitaro Yamazaki University of Tennessee, Knoxville Xiaoye Sherry Li Lawrence Berkeley National Laboratory MS49: Sparse

More information

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver Sherry Li Lawrence Berkeley National Laboratory Piyush Sao Rich Vuduc Georgia Institute of Technology CUG 14, May 4-8, 14, Lugano,

More information

A sparse multifrontal solver using hierarchically semi-separable frontal matrices

A sparse multifrontal solver using hierarchically semi-separable frontal matrices A sparse multifrontal solver using hierarchically semi-separable frontal matrices Pieter Ghysels Lawrence Berkeley National Laboratory Joint work with: Xiaoye S. Li (LBNL), Artem Napov (ULB), François-Henry

More information

Improvements for Implicit Linear Equation Solvers

Improvements for Implicit Linear Equation Solvers Improvements for Implicit Linear Equation Solvers Roger Grimes, Bob Lucas, Clement Weisbecker Livermore Software Technology Corporation Abstract Solving large sparse linear systems of equations is often

More information

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences)

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences) AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences) Lecture 19: Computing the SVD; Sparse Linear Systems Xiangmin Jiao Stony Brook University Xiangmin Jiao Numerical

More information

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Accelerating linear algebra computations with hybrid GPU-multicore systems. Accelerating linear algebra computations with hybrid GPU-multicore systems. Marc Baboulin INRIA/Université Paris-Sud joint work with Jack Dongarra (University of Tennessee and Oak Ridge National Laboratory)

More information

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers UT College of Engineering Tutorial Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers Stan Tomov 1, George Bosilca 1, and Cédric

More information

Using Random Butterfly Transformations to Avoid Pivoting in Sparse Direct Methods

Using Random Butterfly Transformations to Avoid Pivoting in Sparse Direct Methods Using Random Butterfly Transformations to Avoid Pivoting in Sparse Direct Methods Marc Baboulin 1, Xiaoye S. Li 2 and François-Henry Rouet 2 1 University of Paris-Sud, Inria Saclay, France 2 Lawrence Berkeley

More information

SOLVING SPARSE LINEAR SYSTEMS OF EQUATIONS. Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA

SOLVING SPARSE LINEAR SYSTEMS OF EQUATIONS. Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA 1 SOLVING SPARSE LINEAR SYSTEMS OF EQUATIONS Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA 2 OUTLINE Sparse matrix storage format Basic factorization

More information

Utilisation de la compression low-rank pour réduire la complexité du solveur PaStiX

Utilisation de la compression low-rank pour réduire la complexité du solveur PaStiX Utilisation de la compression low-rank pour réduire la complexité du solveur PaStiX 26 Septembre 2018 - JCAD 2018 - Lyon Grégoire Pichon, Mathieu Faverge, Pierre Ramet, Jean Roman Outline 1. Context 2.

More information

On the design of parallel linear solvers for large scale problems

On the design of parallel linear solvers for large scale problems On the design of parallel linear solvers for large scale problems ICIAM - August 2015 - Mini-Symposium on Recent advances in matrix computations for extreme-scale computers M. Faverge, X. Lacoste, G. Pichon,

More information

Contents. Preface... xi. Introduction...

Contents. Preface... xi. Introduction... Contents Preface... xi Introduction... xv Chapter 1. Computer Architectures... 1 1.1. Different types of parallelism... 1 1.1.1. Overlap, concurrency and parallelism... 1 1.1.2. Temporal and spatial parallelism

More information

MARCH 24-27, 2014 SAN JOSE, CA

MARCH 24-27, 2014 SAN JOSE, CA MARCH 24-27, 2014 SAN JOSE, CA Sparse HPC on modern architectures Important scientific applications rely on sparse linear algebra HPCG a new benchmark proposal to complement Top500 (HPL) To solve A x =

More information

Minisymposia 9 and 34: Avoiding Communication in Linear Algebra. Jim Demmel UC Berkeley bebop.cs.berkeley.edu

Minisymposia 9 and 34: Avoiding Communication in Linear Algebra. Jim Demmel UC Berkeley bebop.cs.berkeley.edu Minisymposia 9 and 34: Avoiding Communication in Linear Algebra Jim Demmel UC Berkeley bebop.cs.berkeley.edu Motivation (1) Increasing parallelism to exploit From Top500 to multicores in your laptop Exponentially

More information

NIMROD Project Overview

NIMROD Project Overview NIMROD Project Overview Christopher Carey - Univ. Wisconsin NIMROD Team www.nimrodteam.org CScADS Workshop July 23, 2007 Project Overview NIMROD models the macroscopic dynamics of magnetized plasmas by

More information

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC Hybrid static/dynamic scheduling for already optimized dense matrix factorization Simplice Donfack, Laura Grigori, INRIA, France Bill Gropp, Vivek Kale UIUC, USA Joint Laboratory for Petascale Computing,

More information

Scientific Computing with Case Studies SIAM Press, Lecture Notes for Unit VII Sparse Matrix

Scientific Computing with Case Studies SIAM Press, Lecture Notes for Unit VII Sparse Matrix Scientific Computing with Case Studies SIAM Press, 2009 http://www.cs.umd.edu/users/oleary/sccswebpage Lecture Notes for Unit VII Sparse Matrix Computations Part 1: Direct Methods Dianne P. O Leary c 2008

More information

Multilevel low-rank approximation preconditioners Yousef Saad Department of Computer Science and Engineering University of Minnesota

Multilevel low-rank approximation preconditioners Yousef Saad Department of Computer Science and Engineering University of Minnesota Multilevel low-rank approximation preconditioners Yousef Saad Department of Computer Science and Engineering University of Minnesota SIAM CSE Boston - March 1, 2013 First: Joint work with Ruipeng Li Work

More information

Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors

Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1 1 Deparment of Computer

More information

J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1. March, 2009

J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1. March, 2009 Parallel Preconditioning of Linear Systems based on ILUPACK for Multithreaded Architectures J.I. Aliaga M. Bollhöfer 2 A.F. Martín E.S. Quintana-Ortí Deparment of Computer Science and Engineering, Univ.

More information

arxiv: v1 [cs.na] 20 Jul 2015

arxiv: v1 [cs.na] 20 Jul 2015 AN EFFICIENT SOLVER FOR SPARSE LINEAR SYSTEMS BASED ON RANK-STRUCTURED CHOLESKY FACTORIZATION JEFFREY N. CHADWICK AND DAVID S. BINDEL arxiv:1507.05593v1 [cs.na] 20 Jul 2015 Abstract. Direct factorization

More information

On the design of parallel linear solvers for large scale problems

On the design of parallel linear solvers for large scale problems On the design of parallel linear solvers for large scale problems Journée problème de Poisson, IHP, Paris M. Faverge, P. Ramet M. Faverge Assistant Professor Bordeaux INP LaBRI Inria Bordeaux - Sud-Ouest

More information

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization) A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization) Schodinger equation: Hψ = Eψ Choose a basis set of wave functions Two cases: Orthonormal

More information

Sparse factorization using low rank submatrices. Cleve Ashcraft LSTC 2010 MUMPS User Group Meeting April 15-16, 2010 Toulouse, FRANCE

Sparse factorization using low rank submatrices. Cleve Ashcraft LSTC 2010 MUMPS User Group Meeting April 15-16, 2010 Toulouse, FRANCE Sparse factorization using low rank submatrices Cleve Ashcraft LSTC cleve@lstc.com 21 MUMPS User Group Meeting April 15-16, 21 Toulouse, FRANCE ftp.lstc.com:outgoing/cleve/mumps1 Ashcraft.pdf 1 LSTC Livermore

More information

Fast algorithms for hierarchically semiseparable matrices

Fast algorithms for hierarchically semiseparable matrices NUMERICAL LINEAR ALGEBRA WITH APPLICATIONS Numer. Linear Algebra Appl. 2010; 17:953 976 Published online 22 December 2009 in Wiley Online Library (wileyonlinelibrary.com)..691 Fast algorithms for hierarchically

More information

Incomplete Cholesky preconditioners that exploit the low-rank property

Incomplete Cholesky preconditioners that exploit the low-rank property anapov@ulb.ac.be ; http://homepages.ulb.ac.be/ anapov/ 1 / 35 Incomplete Cholesky preconditioners that exploit the low-rank property (theory and practice) Artem Napov Service de Métrologie Nucléaire, Université

More information

Communication avoiding parallel algorithms for dense matrix factorizations

Communication avoiding parallel algorithms for dense matrix factorizations Communication avoiding parallel dense matrix factorizations 1/ 44 Communication avoiding parallel algorithms for dense matrix factorizations Edgar Solomonik Department of EECS, UC Berkeley October 2013

More information

Progress in Parallel Implicit Methods For Tokamak Edge Plasma Modeling

Progress in Parallel Implicit Methods For Tokamak Edge Plasma Modeling Progress in Parallel Implicit Methods For Tokamak Edge Plasma Modeling Michael McCourt 1,2,Lois Curfman McInnes 1 Hong Zhang 1,Ben Dudson 3,Sean Farley 1,4 Tom Rognlien 5, Maxim Umansky 5 Argonne National

More information

1 Overview. 2 Adapting to computing system evolution. 11 th European LS-DYNA Conference 2017, Salzburg, Austria

1 Overview. 2 Adapting to computing system evolution. 11 th European LS-DYNA Conference 2017, Salzburg, Austria 1 Overview Improving LSTC s Multifrontal Linear Solver Roger Grimes 3, Robert Lucas 3, Nick Meng 2, Francois-Henry Rouet 3, Clement Weisbecker 3, and Ting-Ting Zhu 1 1 Cray Incorporated 2 Intel Corporation

More information

Sparse linear solvers

Sparse linear solvers Sparse linear solvers Laura Grigori ALPINES INRIA and LJLL, UPMC On sabbatical at UC Berkeley March 2015 Plan Sparse linear solvers Sparse matrices and graphs Classes of linear solvers Sparse Cholesky

More information

A dissection solver with kernel detection for unsymmetric matrices in FreeFem++

A dissection solver with kernel detection for unsymmetric matrices in FreeFem++ . p.1/21 11 Dec. 2014, LJLL, Paris FreeFem++ workshop A dissection solver with kernel detection for unsymmetric matrices in FreeFem++ Atsushi Suzuki Atsushi.Suzuki@ann.jussieu.fr Joint work with François-Xavier

More information

Numerical Methods I Non-Square and Sparse Linear Systems

Numerical Methods I Non-Square and Sparse Linear Systems Numerical Methods I Non-Square and Sparse Linear Systems Aleksandar Donev Courant Institute, NYU 1 donev@courant.nyu.edu 1 MATH-GA 2011.003 / CSCI-GA 2945.003, Fall 2014 September 25th, 2014 A. Donev (Courant

More information

Large Scale Sparse Linear Algebra

Large Scale Sparse Linear Algebra Large Scale Sparse Linear Algebra P. Amestoy (INP-N7, IRIT) A. Buttari (CNRS, IRIT) T. Mary (University of Toulouse, IRIT) A. Guermouche (Univ. Bordeaux, LaBRI), J.-Y. L Excellent (INRIA, LIP, ENS-Lyon)

More information

An Integrative Model for Parallelism

An Integrative Model for Parallelism An Integrative Model for Parallelism Victor Eijkhout ICERM workshop 2012/01/09 Introduction Formal part Examples Extension to other memory models Conclusion tw-12-exascale 2012/01/09 2 Introduction tw-12-exascale

More information

Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2

Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2 1 / 23 Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption Langshi CHEN 1,2,3 Supervised by Serge PETITON 2 Maison de la Simulation Lille 1 University CNRS March 18, 2013

More information

2.5D algorithms for distributed-memory computing

2.5D algorithms for distributed-memory computing ntroduction for distributed-memory computing C Berkeley July, 2012 1/ 62 ntroduction Outline ntroduction Strong scaling 2.5D factorization 2/ 62 ntroduction Strong scaling Solving science problems faster

More information

A robust multilevel approximate inverse preconditioner for symmetric positive definite matrices

A robust multilevel approximate inverse preconditioner for symmetric positive definite matrices DICEA DEPARTMENT OF CIVIL, ENVIRONMENTAL AND ARCHITECTURAL ENGINEERING PhD SCHOOL CIVIL AND ENVIRONMENTAL ENGINEERING SCIENCES XXX CYCLE A robust multilevel approximate inverse preconditioner for symmetric

More information

Matrix Computations: Direct Methods II. May 5, 2014 Lecture 11

Matrix Computations: Direct Methods II. May 5, 2014 Lecture 11 Matrix Computations: Direct Methods II May 5, 2014 ecture Summary You have seen an example of how a typical matrix operation (an important one) can be reduced to using lower level BS routines that would

More information

A Robust Preconditioned Iterative Method for the Navier-Stokes Equations with High Reynolds Numbers

A Robust Preconditioned Iterative Method for the Navier-Stokes Equations with High Reynolds Numbers Applied and Computational Mathematics 2017; 6(4): 202-207 http://www.sciencepublishinggroup.com/j/acm doi: 10.11648/j.acm.20170604.18 ISSN: 2328-5605 (Print); ISSN: 2328-5613 (Online) A Robust Preconditioned

More information

Block Low-Rank (BLR) approximations to improve multifrontal sparse solvers

Block Low-Rank (BLR) approximations to improve multifrontal sparse solvers Block Low-Rank (BLR) approximations to improve multifrontal sparse solvers Joint work with Patrick Amestoy, Cleve Ashcraft, Olivier Boiteau, Alfredo Buttari and Jean-Yves L Excellent, PhD started on October

More information

An Efficient Solver for Sparse Linear Systems based on Rank-Structured Cholesky Factorization

An Efficient Solver for Sparse Linear Systems based on Rank-Structured Cholesky Factorization An Efficient Solver for Sparse Linear Systems based on Rank-Structured Cholesky Factorization David Bindel Department of Computer Science Cornell University 15 March 2016 (TSIMF) Rank-Structured Cholesky

More information

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics SPARSE SOLVERS FOR THE POISSON EQUATION Margreet Nool CWI, Multiscale Dynamics November 9, 2015 OUTLINE OF THIS TALK 1 FISHPACK, LAPACK, PARDISO 2 SYSTEM OVERVIEW OF CARTESIUS 3 POISSON EQUATION 4 SOLVERS

More information

Nuclear Physics and Computing: Exascale Partnerships. Juan Meza Senior Scientist Lawrence Berkeley National Laboratory

Nuclear Physics and Computing: Exascale Partnerships. Juan Meza Senior Scientist Lawrence Berkeley National Laboratory Nuclear Physics and Computing: Exascale Partnerships Juan Meza Senior Scientist Lawrence Berkeley National Laboratory Nuclear Science and Exascale i Workshop held in DC to identify scientific challenges

More information

Lecture 17: Iterative Methods and Sparse Linear Algebra

Lecture 17: Iterative Methods and Sparse Linear Algebra Lecture 17: Iterative Methods and Sparse Linear Algebra David Bindel 25 Mar 2014 Logistics HW 3 extended to Wednesday after break HW 4 should come out Monday after break Still need project description

More information

Lecture 8: Fast Linear Solvers (Part 7)

Lecture 8: Fast Linear Solvers (Part 7) Lecture 8: Fast Linear Solvers (Part 7) 1 Modified Gram-Schmidt Process with Reorthogonalization Test Reorthogonalization If Av k 2 + δ v k+1 2 = Av k 2 to working precision. δ = 10 3 2 Householder Arnoldi

More information

Parallel Algorithms for Solution of Large Sparse Linear Systems with Applications

Parallel Algorithms for Solution of Large Sparse Linear Systems with Applications Parallel Algorithms for Solution of Large Sparse Linear Systems with Applications Murat Manguoğlu Department of Computer Engineering Middle East Technical University, Ankara, Turkey Prace workshop: HPC

More information

Scientific Computing

Scientific Computing Scientific Computing Direct solution methods Martin van Gijzen Delft University of Technology October 3, 2018 1 Program October 3 Matrix norms LU decomposition Basic algorithm Cost Stability Pivoting Pivoting

More information

BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix-Vector-Product Analysis of Matrix-Matrix Product

BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix-Vector-Product Analysis of Matrix-Matrix Product Level-1 BLAS: SAXPY BLAS-Notation: S single precision (D for double, C for complex) A α scalar X vector P plus operation Y vector SAXPY: y = αx + y Vectorization of SAXPY (αx + y) by pipelining: page 8

More information

QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment

QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment Emmanuel AGULLO (INRIA / LaBRI) Camille COTI (Iowa State University) Jack DONGARRA (University of Tennessee) Thomas HÉRAULT

More information

The new challenges to Krylov subspace methods Yousef Saad Department of Computer Science and Engineering University of Minnesota

The new challenges to Krylov subspace methods Yousef Saad Department of Computer Science and Engineering University of Minnesota The new challenges to Krylov subspace methods Yousef Saad Department of Computer Science and Engineering University of Minnesota SIAM Applied Linear Algebra Valencia, June 18-22, 2012 Introduction Krylov

More information

Parallelism in FreeFem++.

Parallelism in FreeFem++. Parallelism in FreeFem++. Guy Atenekeng 1 Frederic Hecht 2 Laura Grigori 1 Jacques Morice 2 Frederic Nataf 2 1 INRIA, Saclay 2 University of Paris 6 Workshop on FreeFem++, 2009 Outline 1 Introduction Motivation

More information

Fine-grained Parallel Incomplete LU Factorization

Fine-grained Parallel Incomplete LU Factorization Fine-grained Parallel Incomplete LU Factorization Edmond Chow School of Computational Science and Engineering Georgia Institute of Technology Sparse Days Meeting at CERFACS June 5-6, 2014 Contribution

More information

Open-source finite element solver for domain decomposition problems

Open-source finite element solver for domain decomposition problems 1/29 Open-source finite element solver for domain decomposition problems C. Geuzaine 1, X. Antoine 2,3, D. Colignon 1, M. El Bouajaji 3,2 and B. Thierry 4 1 - University of Liège, Belgium 2 - University

More information

Computing least squares condition numbers on hybrid multicore/gpu systems

Computing least squares condition numbers on hybrid multicore/gpu systems Computing least squares condition numbers on hybrid multicore/gpu systems M. Baboulin and J. Dongarra and R. Lacroix Abstract This paper presents an efficient computation for least squares conditioning

More information

Petascale Quantum Simulations of Nano Systems and Biomolecules

Petascale Quantum Simulations of Nano Systems and Biomolecules Petascale Quantum Simulations of Nano Systems and Biomolecules Emil Briggs North Carolina State University 1. Outline of real-space Multigrid (RMG) 2. Scalability and hybrid/threaded models 3. GPU acceleration

More information

ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU

ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU STEVE RENNICH, SR. ENGINEER, NVIDIA DEVELOPER TECHNOLOGY DARKO STOSIC, PHD CANDIDATE, UNIV. FEDERAL DE PERNAMBUCO TIM DAVIS, PROFESSOR, CSE, TEXAS

More information

Computational Linear Algebra

Computational Linear Algebra Computational Linear Algebra PD Dr. rer. nat. habil. Ralf Peter Mundani Computation in Engineering / BGU Scientific Computing in Computer Science / INF Winter Term 2017/18 Part 2: Direct Methods PD Dr.

More information

AN INDEPENDENT LOOPS SEARCH ALGORITHM FOR SOLVING INDUCTIVE PEEC LARGE PROBLEMS

AN INDEPENDENT LOOPS SEARCH ALGORITHM FOR SOLVING INDUCTIVE PEEC LARGE PROBLEMS Progress In Electromagnetics Research M, Vol. 23, 53 63, 2012 AN INDEPENDENT LOOPS SEARCH ALGORITHM FOR SOLVING INDUCTIVE PEEC LARGE PROBLEMS T.-S. Nguyen *, J.-M. Guichon, O. Chadebec, G. Meunier, and

More information

Direct and Incomplete Cholesky Factorizations with Static Supernodes

Direct and Incomplete Cholesky Factorizations with Static Supernodes Direct and Incomplete Cholesky Factorizations with Static Supernodes AMSC 661 Term Project Report Yuancheng Luo 2010-05-14 Introduction Incomplete factorizations of sparse symmetric positive definite (SSPD)

More information

Using an Auction Algorithm in AMG based on Maximum Weighted Matching in Matrix Graphs

Using an Auction Algorithm in AMG based on Maximum Weighted Matching in Matrix Graphs Using an Auction Algorithm in AMG based on Maximum Weighted Matching in Matrix Graphs Pasqua D Ambra Institute for Applied Computing (IAC) National Research Council of Italy (CNR) pasqua.dambra@cnr.it

More information

Solving PDEs with CUDA Jonathan Cohen

Solving PDEs with CUDA Jonathan Cohen Solving PDEs with CUDA Jonathan Cohen jocohen@nvidia.com NVIDIA Research PDEs (Partial Differential Equations) Big topic Some common strategies Focus on one type of PDE in this talk Poisson Equation Linear

More information

Leveraging Task-Parallelism in Energy-Efficient ILU Preconditioners

Leveraging Task-Parallelism in Energy-Efficient ILU Preconditioners Leveraging Task-Parallelism in Energy-Efficient ILU Preconditioners José I. Aliaga Leveraging task-parallelism in energy-efficient ILU preconditioners Universidad Jaime I (Castellón, Spain) José I. Aliaga

More information

An Efficient Solver for Sparse Linear Systems based on Rank-Structured Cholesky Factorization

An Efficient Solver for Sparse Linear Systems based on Rank-Structured Cholesky Factorization An Efficient Solver for Sparse Linear Systems based on Rank-Structured Cholesky Factorization David Bindel and Jeffrey Chadwick Department of Computer Science Cornell University 30 October 2015 (Department

More information

FINDING PARALLELISM IN GENERAL-PURPOSE LINEAR PROGRAMMING

FINDING PARALLELISM IN GENERAL-PURPOSE LINEAR PROGRAMMING FINDING PARALLELISM IN GENERAL-PURPOSE LINEAR PROGRAMMING Daniel Thuerck 1,2 (advisors Michael Goesele 1,2 and Marc Pfetsch 1 ) Maxim Naumov 3 1 Graduate School of Computational Engineering, TU Darmstadt

More information

Sparse LU Factorization on GPUs for Accelerating SPICE Simulation

Sparse LU Factorization on GPUs for Accelerating SPICE Simulation Nano-scale Integrated Circuit and System (NICS) Laboratory Sparse LU Factorization on GPUs for Accelerating SPICE Simulation Xiaoming Chen PhD Candidate Department of Electronic Engineering Tsinghua University,

More information

Numerical Linear Algebra

Numerical Linear Algebra Numerical Linear Algebra Decompositions, numerical aspects Gerard Sleijpen and Martin van Gijzen September 27, 2017 1 Delft University of Technology Program Lecture 2 LU-decomposition Basic algorithm Cost

More information

Program Lecture 2. Numerical Linear Algebra. Gaussian elimination (2) Gaussian elimination. Decompositions, numerical aspects

Program Lecture 2. Numerical Linear Algebra. Gaussian elimination (2) Gaussian elimination. Decompositions, numerical aspects Numerical Linear Algebra Decompositions, numerical aspects Program Lecture 2 LU-decomposition Basic algorithm Cost Stability Pivoting Cholesky decomposition Sparse matrices and reorderings Gerard Sleijpen

More information

A High-Performance Parallel Hybrid Method for Large Sparse Linear Systems

A High-Performance Parallel Hybrid Method for Large Sparse Linear Systems Outline A High-Performance Parallel Hybrid Method for Large Sparse Linear Systems Azzam Haidar CERFACS, Toulouse joint work with Luc Giraud (N7-IRIT, France) and Layne Watson (Virginia Polytechnic Institute,

More information

A communication-avoiding thick-restart Lanczos method on a distributed-memory system

A communication-avoiding thick-restart Lanczos method on a distributed-memory system A communication-avoiding thick-restart Lanczos method on a distributed-memory system Ichitaro Yamazaki and Kesheng Wu Lawrence Berkeley National Laboratory, Berkeley, CA, USA Abstract. The Thick-Restart

More information

A DISTRIBUTED-MEMORY RANDOMIZED STRUCTURED MULTIFRONTAL METHOD FOR SPARSE DIRECT SOLUTIONS

A DISTRIBUTED-MEMORY RANDOMIZED STRUCTURED MULTIFRONTAL METHOD FOR SPARSE DIRECT SOLUTIONS A DISTRIBUTED-MEMORY RANDOMIZED STRUCTURED MULTIFRONTAL METHOD FOR SPARSE DIRECT SOLUTIONS ZIXING XIN, JIANLIN XIA, MAARTEN V. DE HOOP, STEPHEN CAULEY, AND VENKATARAMANAN BALAKRISHNAN Abstract. We design

More information

CME342 Parallel Methods in Numerical Analysis. Matrix Computation: Iterative Methods II. Sparse Matrix-vector Multiplication.

CME342 Parallel Methods in Numerical Analysis. Matrix Computation: Iterative Methods II. Sparse Matrix-vector Multiplication. CME342 Parallel Methods in Numerical Analysis Matrix Computation: Iterative Methods II Outline: CG & its parallelization. Sparse Matrix-vector Multiplication. 1 Basic iterative methods: Ax = b r = b Ax

More information

Dynamic Scheduling within MAGMA

Dynamic Scheduling within MAGMA Dynamic Scheduling within MAGMA Emmanuel Agullo, Cedric Augonnet, Jack Dongarra, Mathieu Faverge, Julien Langou, Hatem Ltaief, Samuel Thibault and Stanimire Tomov April 5, 2012 Innovative and Computing

More information

Linear Solvers. Andrew Hazel

Linear Solvers. Andrew Hazel Linear Solvers Andrew Hazel Introduction Thus far we have talked about the formulation and discretisation of physical problems...... and stopped when we got to a discrete linear system of equations. Introduction

More information

Communication-avoiding LU and QR factorizations for multicore architectures

Communication-avoiding LU and QR factorizations for multicore architectures Communication-avoiding LU and QR factorizations for multicore architectures DONFACK Simplice INRIA Saclay Joint work with Laura Grigori INRIA Saclay Alok Kumar Gupta BCCS,Norway-5075 16th April 2010 Communication-avoiding

More information

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic Jan Verschelde joint work with Xiangcheng Yu University of Illinois at Chicago

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms Chapter 6 Matrix Models Section 6.2 Low Rank Approximation Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign CS 554 / CSE 512 Edgar

More information

IMPROVING THE PERFORMANCE OF SPARSE LU MATRIX FACTORIZATION USING A SUPERNODAL ALGORITHM

IMPROVING THE PERFORMANCE OF SPARSE LU MATRIX FACTORIZATION USING A SUPERNODAL ALGORITHM IMPROVING THE PERFORMANCE OF SPARSE LU MATRIX FACTORIZATION USING A SUPERNODAL ALGORITHM Bogdan OANCEA PhD, Associate Professor, Artife University, Bucharest, Romania E-mail: oanceab@ie.ase.ro Abstract:

More information

Parallel Preconditioning Methods for Ill-conditioned Problems

Parallel Preconditioning Methods for Ill-conditioned Problems Parallel Preconditioning Methods for Ill-conditioned Problems Kengo Nakajima Information Technology Center, The University of Tokyo 2014 Conference on Advanced Topics and Auto Tuning in High Performance

More information

FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION

FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION EDMOND CHOW AND AFTAB PATEL Abstract. This paper presents a new fine-grained parallel algorithm for computing an incomplete LU factorization. All nonzeros

More information

MULTI-LAYER HIERARCHICAL STRUCTURES AND FACTORIZATIONS

MULTI-LAYER HIERARCHICAL STRUCTURES AND FACTORIZATIONS MULTI-LAYER HIERARCHICAL STRUCTURES AND FACTORIZATIONS JIANLIN XIA Abstract. We propose multi-layer hierarchically semiseparable MHS structures for the fast factorizations of dense matrices arising from

More information

A Sparse QS-Decomposition for Large Sparse Linear System of Equations

A Sparse QS-Decomposition for Large Sparse Linear System of Equations A Sparse QS-Decomposition for Large Sparse Linear System of Equations Wujian Peng 1 and Biswa N. Datta 2 1 Department of Math, Zhaoqing University, Zhaoqing, China, douglas peng@yahoo.com 2 Department

More information

Scalable Domain Decomposition Preconditioners For Heterogeneous Elliptic Problems

Scalable Domain Decomposition Preconditioners For Heterogeneous Elliptic Problems Scalable Domain Decomposition Preconditioners For Heterogeneous Elliptic Problems Pierre Jolivet, F. Hecht, F. Nataf, C. Prud homme Laboratoire Jacques-Louis Lions Laboratoire Jean Kuntzmann INRIA Rocquencourt

More information

An Efficient Graph Sparsification Approach to Scalable Harmonic Balance (HB) Analysis of Strongly Nonlinear RF Circuits

An Efficient Graph Sparsification Approach to Scalable Harmonic Balance (HB) Analysis of Strongly Nonlinear RF Circuits Design Automation Group An Efficient Graph Sparsification Approach to Scalable Harmonic Balance (HB) Analysis of Strongly Nonlinear RF Circuits Authors : Lengfei Han (Speaker) Xueqian Zhao Dr. Zhuo Feng

More information

V C V L T I 0 C V B 1 V T 0 I. l nk

V C V L T I 0 C V B 1 V T 0 I. l nk Multifrontal Method Kailai Xu September 16, 2017 Main observation. Consider the LDL T decomposition of a SPD matrix [ ] [ ] [ ] [ ] B V T L 0 I 0 L T L A = = 1 V T V C V L T I 0 C V B 1 V T, 0 I where

More information

Fast matrix algebra for dense matrices with rank-deficient off-diagonal blocks

Fast matrix algebra for dense matrices with rank-deficient off-diagonal blocks CHAPTER 2 Fast matrix algebra for dense matrices with rank-deficient off-diagonal blocks Chapter summary: The chapter describes techniques for rapidly performing algebraic operations on dense matrices

More information

Efficient implementation of the overlap operator on multi-gpus

Efficient implementation of the overlap operator on multi-gpus Efficient implementation of the overlap operator on multi-gpus Andrei Alexandru Mike Lujan, Craig Pelissier, Ben Gamari, Frank Lee SAAHPC 2011 - University of Tennessee Outline Motivation Overlap operator

More information

Exploiting hyper-sparsity when computing preconditioners for conjugate gradients in interior point methods

Exploiting hyper-sparsity when computing preconditioners for conjugate gradients in interior point methods Exploiting hyper-sparsity when computing preconditioners for conjugate gradients in interior point methods Julian Hall, Ghussoun Al-Jeiroudi and Jacek Gondzio School of Mathematics University of Edinburgh

More information

5.1 Banded Storage. u = temperature. The five-point difference operator. uh (x, y + h) 2u h (x, y)+u h (x, y h) uh (x + h, y) 2u h (x, y)+u h (x h, y)

5.1 Banded Storage. u = temperature. The five-point difference operator. uh (x, y + h) 2u h (x, y)+u h (x, y h) uh (x + h, y) 2u h (x, y)+u h (x h, y) 5.1 Banded Storage u = temperature u= u h temperature at gridpoints u h = 1 u= Laplace s equation u= h u = u h = grid size u=1 The five-point difference operator 1 u h =1 uh (x + h, y) 2u h (x, y)+u h

More information

Solving PDEs with Multigrid Methods p.1

Solving PDEs with Multigrid Methods p.1 Solving PDEs with Multigrid Methods Scott MacLachlan maclachl@colorado.edu Department of Applied Mathematics, University of Colorado at Boulder Solving PDEs with Multigrid Methods p.1 Support and Collaboration

More information

Solving linear systems (6 lectures)

Solving linear systems (6 lectures) Chapter 2 Solving linear systems (6 lectures) 2.1 Solving linear systems: LU factorization (1 lectures) Reference: [Trefethen, Bau III] Lecture 20, 21 How do you solve Ax = b? (2.1.1) In numerical linear

More information

Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA

Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA Jacobi-Based Eigenvalue Solver on GPU Lung-Sheng Chien, NVIDIA lchien@nvidia.com Outline Symmetric eigenvalue solver Experiment Applications Conclusions Symmetric eigenvalue solver The standard form is

More information

Scalable Non-blocking Preconditioned Conjugate Gradient Methods

Scalable Non-blocking Preconditioned Conjugate Gradient Methods Scalable Non-blocking Preconditioned Conjugate Gradient Methods Paul Eller and William Gropp University of Illinois at Urbana-Champaign Department of Computer Science Supercomputing 16 Paul Eller and William

More information

Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters

Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters H. Köstler 2nd International Symposium Computer Simulations on GPU Freudenstadt, 29.05.2013 1 Contents Motivation walberla software concepts

More information

Solving Large Nonlinear Sparse Systems

Solving Large Nonlinear Sparse Systems Solving Large Nonlinear Sparse Systems Fred W. Wubs and Jonas Thies Computational Mechanics & Numerical Mathematics University of Groningen, the Netherlands f.w.wubs@rug.nl Centre for Interdisciplinary

More information

Integration of PETSc for Nonlinear Solves

Integration of PETSc for Nonlinear Solves Integration of PETSc for Nonlinear Solves Ben Jamroz, Travis Austin, Srinath Vadlamani, Scott Kruger Tech-X Corporation jamroz@txcorp.com http://www.txcorp.com NIMROD Meeting: Aug 10, 2010 Boulder, CO

More information

Fine-Grained Parallel Algorithms for Incomplete Factorization Preconditioning

Fine-Grained Parallel Algorithms for Incomplete Factorization Preconditioning Fine-Grained Parallel Algorithms for Incomplete Factorization Preconditioning Edmond Chow School of Computational Science and Engineering Georgia Institute of Technology, USA SPPEXA Symposium TU München,

More information

FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION

FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION EDMOND CHOW AND AFTAB PATEL Abstract. This paper presents a new fine-grained parallel algorithm for computing an incomplete LU factorization. All nonzeros

More information

LU Factorization. Marco Chiarandini. DM559 Linear and Integer Programming. Department of Mathematics & Computer Science University of Southern Denmark

LU Factorization. Marco Chiarandini. DM559 Linear and Integer Programming. Department of Mathematics & Computer Science University of Southern Denmark DM559 Linear and Integer Programming LU Factorization Marco Chiarandini Department of Mathematics & Computer Science University of Southern Denmark [Based on slides by Lieven Vandenberghe, UCLA] Outline

More information

MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors

MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors J. Dongarra, M. Gates, A. Haidar, Y. Jia, K. Kabir, P. Luszczek, and S. Tomov University of Tennessee, Knoxville 05 / 03 / 2013 MAGMA:

More information