Large Scale Sparse Linear Algebra

Size: px
Start display at page:

Download "Large Scale Sparse Linear Algebra"

Transcription

1 Large Scale Sparse Linear Algebra P. Amestoy (INP-N7, IRIT) A. Buttari (CNRS, IRIT) T. Mary (University of Toulouse, IRIT) A. Guermouche (Univ. Bordeaux, LaBRI), J.-Y. L Excellent (INRIA, LIP, ENS-Lyon) B. Uçar (CNRS, LIP, ENS-Lyon) F.-H. Rouet (LSTC, Livermore, USA) C. Weisbecker (LSTC, Livermore, USA) Main principle Principle: build approximated factorization A ε = L ε U ε at given accuracy ε Part I: asymptotic complexity reduction Theoretical proof and experimental validation that (3D case): Operations: O(N 6 ) O(N 5 ) O(N 4 ) Memory: O(N 4 ) O(N 3 log N) Part II: efficient and scalable algorithms How to design algorithms to efficiently translate the theoretical complexity reduction into actual performance and memory gains for large-scale systems and applications?

2 Impact on industrial applications 10 ߝ௫ ܧ 0 20 Dip (km) Cross (km) 15,, ܧ ௫ 5 Depth (km) m/s Structural mechanics Matrix of order 8M Required accuracy: 10 9 Seismic imaging Matrix of order 17M Required accuracy: 10 3 Results on 900 cores: Electromagnetism Matrix of order 30M Required accuracy: 10 7 factorization time (s) memory/proc (GB) application MUMPS BLR ratio MUMPS BLR gain structural % seismic % electromag % Introduction

3 Rank and rank k approximation In the following, B is a dense matrix of size m n. Definition 1 (Rank) The rank k of B is defined as the smallest integer such that there exist matrices X and Y of size m k and n k such that B = XY T. Definition 2 We call a rank-k approximation of B at accuracy ε any matrix B of rank k such that B B ε. Optimal rank-k approximation Theorem 3 (Eckart-Young) Let UΣV T be the SVD decomposition of B and let us note σ i = Σ i,i its singular values. Then B = U 1:m,1:k Σ 1:k,1:k V1:n,1:k T is the optimal rank-k approximation of B and B B 2 = σ k+1.

4 Numerical rank Definition 4 (Numerical rank) The numerical rank Rk ε (B) of B at accuracy ε is defined as the smallest integer k ε such that there exists a matrix B of rank k ε such that B B ε. Theorem 5 Let UΣV T be the SVD decomposition of B and let us note σ i = Σ i,i its singular values. Then the numerical rank of B at accuracy ε is given by Proof: in exercise. k ε = min σ k+1 ε. 1 k min(m,n) Low-rank matrices If the numerical rank of B is equal to min(m, n) then B is said to be full-rank. Inversely, if Rk ε (B) < min(m, n), then B is said to be rank-deficient. A class of rank-deficient matrices of particular interest are low-rank matrices, defined as follows. Definition 6 (Low-rank matrix) B is said to be low-rank (for a given accuracy ε) if its numerical rank k ε is small enough such that its rank-k ε approximation B = XY T requires less storage than the full-rank matrix B, i.e., if k ε (m + n) mn. In that case, B is said to be a low-rank approximation of B and ε is called the low-rank threshold. In the following, for the sake of simplicity, we refer to the numerical rank of a matrix at accuracy ε simply as its rank.

5 Compression kernels The act of computing B from B is called the compression of B. What are the different methods to compress B? SVD: optimal but expensive: O(mn min(m, n)) operations Truncated QR factorization: slightly less accurate but much cheaper: O(mnk ε ) operations widely used Multiple other methods: randomized algorithms, adaptive cross-approximation, interpolative decomposition, CUR, etc. In the following, we assume truncated QR is used as compression kernel. Low-rank subblocks Frontal matrices are not low-rank but in some applications they exhibit low-rank blocks A block B represents the interaction between two subdomains σ and τ. σ If they have a small diameter and are far away their interaction is weak rank is τ low. The block-admissibility condition formalizes this intuition: σ τ is admissible max (diam (σ), diam (τ)) η dist (σ, τ) σ τ rank of distance between and

6 Block Low-Rank matrices A BLR matrix is defined by a partition P = S S, with S = {σ 1,..., σ p }. σ 1 σ 2 σ 3 σ 4 σ 5 σ 1 σ 6 σ 6 σ 2 σ 3 σ 4 σ 5 Gray blocks are non-admissible and therefore kept full-rank White blocks are admissible and therefore compressed to low-rank Standard BLR factorization: FSCU + FSCU (Factor, Solve, Compress, Update) To evaluate the complexity of the BLR factorization, we must compute the cost of these four main steps

7 Part I: complexity of the BLR factorization Cost analysis of the involved steps Let us consider two blocks A and B of size b b and of rank bounded by r. step type operation cost Factor FR A LU Solve FR-FR B BU 1 Compress LR A Ã Update FR-FR C AB Update LR-FR C ÃB Update FR-LR C A B Update LR-LR C Ã B This is not enough to compute the complexity we need to bound the number of FR blocks!

8 Bounding the number of FR blocks BLR-admissibility condition of a partition P { #{σ, σ τ P is not admissible} q P is admissible #{τ, σ τ P is not admissible} q Non-Admissible Admissible Main result For any matrix, we can build an admissible P for q = O(1), s.t. the maximal rank of the admissible blocks of A is r Amestoy, Buttari, L Excellent, and Mary. On the Complexity of the Block Low-Rank Multifrontal Factorization, SIAM J. Sci. Comput., Memory complexity of the dense BLR factorization Let us consider a dense (frontal) matrix of order m divided into p p blocks of order b, with p = m/b. The memory complexity to store the matrix can be computed as M total (b, p, r) = M FR (b, p) + M LR (b, p, r) Ex. 1: compute M FR (b, p) = M LR (b, p, r) = Ex. 2: assuming b = O(m x ) and r = O(m α ), compute M total (m, x, α) = Ex. 3: compute the optimal block size b = O(m x ) and the resulting optimal complexity: x =, b =, and M opt (m, r) = M total (m, x, α) =

9 Flop complexity of the dense BLR factorization Let us consider a dense (frontal) matrix of order m divided into p p blocks of order b, with p = m/b. step type cost number C step (b, p, r) C step (m, x, α) Factor FR O(b 3 ) Solve FR-FR O(b 3 ) Compress LR O(b 2 r) Update FR-FR O(b 3 ) Update LR-FR O(b 2 r) Update LR-LR O(b 2 r) Ex. 1: compute C step (b, p, r) = cost number Ex. 2: compute C step (m, x, α) with b = O(m x ) and r = O(m α ). Ex. 3: compute the total complexity (sum of all steps) C total (m, x, α) = Ex. 4: compute the optimal block size b = O(m x ) and the resulting optimal complexity: x =, b =, and C opt (m, r) = C total (m, x, α) = Complexity of the sparse multifrontal BLR factorization Sparse multifrontal complexity with ND For a dense complexity C opt (m, r), the sparse complexity is computed as log 2 N ( ) N d 1 C mf = O( 2 dl C 2 l ), l=0 where d is the dimension (2 or 3). operations (OPC) N N grid factor size (NNZ) FR O(N 3 ) O(N 2 log N) BLR O(N 5/2 r 1/2 ) O(N 2 ) N N N grid FR O(N 6 ) O(N 4 ) BLR

10 Flop count Flop count Experimental Setting: Matrices 1. Poisson: N 3 grid with a 7-point stencil with u = 1 on the boundary Ω u = f Rank bound is theoretically proven to be r = O(1). 2. Helmholtz: N 3 grid with a 27-point stencil, ω is the angular frequency, v(x) is the seismic velocity field, and u(x, ω) is the time-harmonic wavefield solution to the forcing term s(x, ω). ( ω2 v(x) 2 ) u(x, ω) = s(x, ω) ω is fixed and equal to 4Hz. Heuristically, rank bound can be expected to behave as r = O(N). Experimental MF flop complexity: Poisson (ε = ) Nested Dissection ordering (geometric) METIS ordering (purely algebraic) FR -t: 5n 2:02 BLR (FSCU) -t: 2105n 1: FR -t: 3n 2:05 BLR (FSCU) -t: 1068n 1: Mesh size N Mesh size N good agreement with theoretical complexity remains close to ND complexity with METIS ordering

11 Factors size Factors size Flop count Flop count Experimental MF flop complexity: Helmholtz (ε = 10 4 ) Nested Dissection ordering (geometric) METIS ordering (purely algebraic) FR -t: 12n 2: BLR (FSCU) -t: 31n 1: FR -t: 8n 2: BLR (FSCU) -t: 22n 1: Mesh size N Mesh size N good agreement with theoretical complexity remains close to ND complexity with METIS ordering Experimental MF complexity: factor size NNZ (Poisson) NNZ (Helmholtz) FR -t: 3n 1:40 BLR -t: 12n 1:05 log n FR -t: 15n 1: BLR -t: 32n 1: Mesh size N Mesh size N good agreement with theoretical complexity remains close to ND complexity with METIS ordering (not shown)

12 Flop count Flop count Experimental MF complexity: low-rank threshold ε OPC (Poisson) OPC (Helmholtz) = 10!14 -t: 905n 1: = 10!10 -t: 1068n 1:50 0 = 10!6 -t: 1045n 1:43 0 = 10!2 -t: 851n 1: = 10!5 -t: 23n 1:89 0 = 10!4 -t: 22n 1:87 0 = 10!3 -t: 14n 1: Mesh size N Mesh size N theory states ε should only play a role in the constant factor true for Helmholtz, but not Poisson why? Influence of zero-rank blocks on the complexity N N F R ε = N LR N ZR N F R ε = N LR N ZR N F R ε = 10 6 N LR N ZR N F R ε = 10 2 N LR N ZR Number of full-rank/low-rank/zero-rank blocks in percentage of the total number of blocks (Poisson problem). N F R decreases with N: asymptotically negligible N ZR increases with ε (as one would expect) but also with N: asymptotically dominant

13 Normalized.ops Influence of the block size b on the complexity Analysis on the root node (of size m = N 2 ): m = m = m = Block size b large range of acceptable block sizes around the optimal b flexibility to tune block size for performance that range increases with the size of the matrix necessity to have variable block sizes Part II: performance of the BLR factorization

14 Normalized time Normalized time Normalized flops Normalized time Sequential result (matrix S3) LAI parts Factor+Solve Update Compress LAI parts Factor+Solve Update Compress FR BLR 0 FR BLR Normalized Flops Normalized Time 7.7 gain in flops only translated to a 3.3 gain in time: why? lower granularity of the Update higher relative weight of the FR parts inefficient Compress Multithreaded result on 24 threads LAI parts Factor+Solve Update Compress LAI parts Factor+Solve Update Compress FR BLR 0 FR BLR Normalized Time (Seq.) Normalized Time (MT) 3.3 gain in sequential becomes 1.7 in multithreaded: why? LAI parts have become critical Update and Compress are memory-bound

15 Exploiting tree-based multithreading in MF solvers thr 0-3 thr 0-3 thr 0-3 Node parallelism L0 layer thr 0-3 thr 0-3 thr 0-3 thr 0-3 L Excellent and Sid-Lakhdar. A study of shared-memory parallelism in a multifrontal solver, Parallel Computing. how big an impact can tree-based multithreading make? Impact of tree-based multithreading on BLR (24 threads) % hai % lai Higher AI Lower AI node only node + tree time % lai time % lai FR % % BLR % % In FR, top of the tree is dominant tree MT brings little gain In BLR, bottom of the tree compresses less, becomes important 1.7 gain becomes 1.9 thanks to tree-based multithreading Theoretical speedup tree only node only node + tree N N grid N N N grid FR O(1) O(N) O(N 2 ) BLR O(log N) O(log N) O(N log N) FR O(1) O(N 3 ) O(N 4 ) BLR

16 Right-looking Vs. Left-looking analysis (24 threads) FR time BLR time RL LL RL LL Update Total read once written at each step read at each step written once RL factorization LL factorization Lower volume of memory transfers in LL (more critical in MT) Update is now less memory-bound: 1.9 gain becomes 2.4 in LL LUAR variant: accumulation and recompression FSCU (Factor, Solve, Compress, Update) FSCU+LUAR Better granularity in Update operations Potential recompression asymptotic complexity reduction? Designed and compared several recompression strategies

17 GF/s Gflops/s Performance of Outer Product with LUA(R) (24 threads) Outer Product benchmark b=256 b= Size of Outer Product LL LUA LUAR average size of Outer Product flops ( ) time (s) Outer Product Total Outer Product Total All metrics include the Recompression overhead Higher granularity and lower flops in Update: 2.4 gain becomes 2.6 Impact of machine properties on BLR: roofline model specs time (s) for peak bw BLR factorization (GF/s) (GB/s) RL LL LUA grunch (28 threads) brunch (24 threads) S3 matrix Arithmetic Intensity in BLR: LL > RL (lower volume of memory transfers) LUA > LL (higher granularities more efficient cache use) brunch grunch RL LL LUA Arithmetic Intensity of the Outer Product

18 FCSU variant: compress before solve FSCU (Factor, Solve, Compress, Update) FSCU+LUAR Better granularity in Update operations Potential recompression asymptotic complexity reduction? Designed and compared several recompression strategies FCSU(+LUAR) Restricted pivoting, e.g. to diagonal blocks Low-rank Solve asymptotic complexity reduction? Performance and accuracy of FCSU vs FSCU standard pivoting restricted pivoting FR FSCU FR FSCU FCSU +LUAR +LUAR +LUAR flops ( ) time (s) residual 4.5e e e e e-09 On this problem, restricted pivoting is enough to ensure stability better BLAS-3/BLAS-2 ratio Compressing before the Solve has little impact on the residual flop reduction 2.6 gain becomes 3.7

19 Normalized time (FR=1) Flop count Flop count Variants improve asymptotic complexity We have theoretically proven that: FSCU FSCU+LUAR FCSU+LUAR dense O(m 5/2 r 1/2 ) O(m 7/3 r 2/3 ) O(m 2 r) sparse (3D) O(N 5 r 1/2 ) O(N 14/3 r 2/3 ) O(N 4 r) Amestoy, Buttari, L Excellent, and Mary. On the Complexity of the Block Low-Rank Multifrontal Factorization, SIAM J. Sci. Comput., Poisson (ε = ) FSCU -t: 1068n 1:50 FSCU+LUAR -t: 2235n 1:42 FCSU+LUAR -t: 6175n 1: Helmholtz (ε = 10 4 ) FSCU -t: 22n 1:87 FSCU+LUAR -t: 34n 1:82 FCSU+LUAR -t: 60n 1: Mesh size N Mesh size N Multicore performance results (24 threads) 1 FR BLR BLR Hz 7Hz 10Hz E3 E4 S3 S4 p8d p8ar p8cr BLR : FSCU, right-looking, node only multithreading BLR+ : FCSU+LUAR, left-looking, node+tree multithreading Amestoy, Buttari, L Excellent, and Mary. Performance and Scalability of the Block Low-Rank Multifrontal Factorization on Multicore Architectures, submitted to ACM Trans. Math. Soft., 2017.

20 The problem with FCSU FSCU (Factor, Solve, Compress, Update) FSCU+LUAR Better granularity in Update operations Potential recompression asymptotic complexity reduction? Designed and compared several recompression strategies FCSU(+LUAR) Restricted pivoting, e.g. to diagonal blocks not acceptable in many applications Low-rank Solve asymptotic complexity reduction? Compress before Solve + pivoting: CFSU variant D k What s straightforward: Column swaps on B can be performed as row swaps of Y Triangular solve and update can also be performed on Y X Y T B What s less straightforward: How to assess the quality of pivot k? We need to estimate B :,k max : B :,k max, assuming X is orthonormal (e.g. RRQR, SVD) How to deal with postponed/delayed pivots? Several strategies to merge them with next panel

21 Residual Normalized flops FSCU vs FCSU vs CFSU FSCU Standard pivoting Compress after Solve FCSU Restricted pivoting Compress before Solve CFSU Standard pivoting Compress before Solve FR FSCU FCSU CFSU 0 barrier2-10 Lin para-10 kkt_power perf009d perf009ar FSCU FCSU CFSU barrier2-10 Lin para-10 kkt_power perf009d perf009ar When FCSU is enough (left), CFSU does not degrade compression When FCSU fails (right), CFSU achieves both good residual and compression Distributed-memory parallelism P 0 P 0 P 0 P 1 P 2 P 3 LU messages P 0 P 1 P 2 P 1 P 1 P 2 P 2 P 3 P 3 P 4 P 4 P 5 P 5 CB messages P 3 P 4 P 5 Volume of LU messages is reduced in BLR (compressed factors) Volume of CB messages can be reduced by compressing the CB but it is an overhead cost

22 Total bytes sent Time (s) Strong scalability analysis 2000 FR BLR x10 45x10 60x10 75x10 90x10 Number of MPIs x Number of cores Compression rate is not significantly impacted by number of processes Flops reduced by 12.8 but volume of communications only by 2.2 higher relative weight of communications Load unbalance (ratio between most and less loaded processes) increases from 1.28 to 2.57 Communication analysis LU messages CB messages Front size #10 4 FR case: LU messages dominate BLR case: CB messages dominate underwhelming reduction of comms CB compression allows for truly reducing the comms; it is an overhead cost but may lead to speedups depending on network speed w.r.t. processor speed Theoretical communication analysis bounds W LU W CB W tot FR O(n 4/3 p) O(n 4/3 ) O(n 4/3 p) BLR (CB FR ) BLR (CB LR )

23 Result on a very large problem Result on matrix 15Hz (order , nnz ) on 900 cores: flops factors memory (GB) elapsed time (s) (PF) size (TB) avg. max. ana. fac. sol. MUMPS OOM OOM OOM BLR /RHS ratio References Amestoy, Buttari, L Excellent, and Mary. On the Complexity of the Block Low-Rank Multifrontal Factorization, SIAM J. Sci. Comput., Amestoy, Buttari, L Excellent, and Mary. Performance and Scalability of the Block Low-Rank Multifrontal Factorization on Multicore Architectures, submitted to ACM Trans. Math. Soft., Amestoy, Brossier, Buttari, L Excellent, Mary, Métivier, Miniussi, and Operto. Fast 3D frequency-domain full waveform inversion with a parallel Block Low-Rank multifrontal direct solver: application to OBC data from the North Sea, Geophysics, Shantsev, Jaysaval, de la Kethulle de Ryhove, Amestoy, Buttari, L Excellent, and Mary. Large-scale 3D EM modeling with a Block Low-Rank multifrontal direct solver, Geophysical Journal International, 2017.

Performance and Scalability of the Block Low-Rank Multifrontal Factorization on Multicore Architectures

Performance and Scalability of the Block Low-Rank Multifrontal Factorization on Multicore Architectures Performance and Scalability of the Block Low-Rank Multifrontal Factorization on Multicore Architectures Patrick R. Amestoy, Alfredo Buttari, Jean-Yves L Excellent, Théo Mary To cite this version: Patrick

More information

Block Low-Rank (BLR) approximations to improve multifrontal sparse solvers

Block Low-Rank (BLR) approximations to improve multifrontal sparse solvers Block Low-Rank (BLR) approximations to improve multifrontal sparse solvers Joint work with Patrick Amestoy, Cleve Ashcraft, Olivier Boiteau, Alfredo Buttari and Jean-Yves L Excellent, PhD started on October

More information

c 2017 Society for Industrial and Applied Mathematics

c 2017 Society for Industrial and Applied Mathematics SIAM J. SCI. COMPUT. Vol. 39, No. 4, pp. A1710 A1740 c 2017 Society for Industrial and Applied Mathematics ON THE COMPLEXITY OF THE BLOCK LOW-RANK MULTIFRONTAL FACTORIZATION PATRICK AMESTOY, ALFREDO BUTTARI,

More information

SUMMARY INTRODUCTION BLOCK LOW-RANK MULTIFRONTAL METHOD

SUMMARY INTRODUCTION BLOCK LOW-RANK MULTIFRONTAL METHOD D frequency-domain seismic modeling with a Block Low-Rank algebraic multifrontal direct solver. C. Weisbecker,, P. Amestoy, O. Boiteau, R. Brossier, A. Buttari, J.-Y. L Excellent, S. Operto, J. Virieux

More information

Bridging the gap between flat and hierarchical low-rank matrix formats: the multilevel BLR format

Bridging the gap between flat and hierarchical low-rank matrix formats: the multilevel BLR format Bridging the gap between flat and hierarchical low-rank matrix formats: the multilevel BLR format Amestoy, Patrick and Buttari, Alfredo and L Excellent, Jean-Yves and Mary, Theo 2018 MIMS EPrint: 2018.12

More information

MUMPS. The MUMPS library: work done during the SOLSTICE project. MUMPS team, Lyon-Grenoble, Toulouse, Bordeaux

MUMPS. The MUMPS library: work done during the SOLSTICE project. MUMPS team, Lyon-Grenoble, Toulouse, Bordeaux The MUMPS library: work done during the SOLSTICE project MUMPS team, Lyon-Grenoble, Toulouse, Bordeaux Sparse Days and ANR SOLSTICE Final Workshop June MUMPS MUMPS Team since beg. of SOLSTICE (2007) Permanent

More information

Sparse Linear Algebra: Direct Methods, advanced features

Sparse Linear Algebra: Direct Methods, advanced features Sparse Linear Algebra: Direct Methods, advanced features P. Amestoy and A. Buttari (INPT(ENSEEIHT)-IRIT) A. Guermouche (Univ. Bordeaux-LaBRI), J.-Y. L Excellent and B. Uçar (INRIA-CNRS/LIP-ENS Lyon) F.-H.

More information

1 Overview. 2 Adapting to computing system evolution. 11 th European LS-DYNA Conference 2017, Salzburg, Austria

1 Overview. 2 Adapting to computing system evolution. 11 th European LS-DYNA Conference 2017, Salzburg, Austria 1 Overview Improving LSTC s Multifrontal Linear Solver Roger Grimes 3, Robert Lucas 3, Nick Meng 2, Francois-Henry Rouet 3, Clement Weisbecker 3, and Ting-Ting Zhu 1 1 Cray Incorporated 2 Intel Corporation

More information

Improvements for Implicit Linear Equation Solvers

Improvements for Implicit Linear Equation Solvers Improvements for Implicit Linear Equation Solvers Roger Grimes, Bob Lucas, Clement Weisbecker Livermore Software Technology Corporation Abstract Solving large sparse linear systems of equations is often

More information

A sparse multifrontal solver using hierarchically semi-separable frontal matrices

A sparse multifrontal solver using hierarchically semi-separable frontal matrices A sparse multifrontal solver using hierarchically semi-separable frontal matrices Pieter Ghysels Lawrence Berkeley National Laboratory Joint work with: Xiaoye S. Li (LBNL), Artem Napov (ULB), François-Henry

More information

Utilisation de la compression low-rank pour réduire la complexité du solveur PaStiX

Utilisation de la compression low-rank pour réduire la complexité du solveur PaStiX Utilisation de la compression low-rank pour réduire la complexité du solveur PaStiX 26 Septembre 2018 - JCAD 2018 - Lyon Grégoire Pichon, Mathieu Faverge, Pierre Ramet, Jean Roman Outline 1. Context 2.

More information

Incomplete Cholesky preconditioners that exploit the low-rank property

Incomplete Cholesky preconditioners that exploit the low-rank property anapov@ulb.ac.be ; http://homepages.ulb.ac.be/ anapov/ 1 / 35 Incomplete Cholesky preconditioners that exploit the low-rank property (theory and practice) Artem Napov Service de Métrologie Nucléaire, Université

More information

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences)

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences) AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences) Lecture 19: Computing the SVD; Sparse Linear Systems Xiangmin Jiao Stony Brook University Xiangmin Jiao Numerical

More information

Geophysical Journal International

Geophysical Journal International Geophysical Journal International Geophys. J. Int. (2017) 209, 1558 1571 Advance Access publication 2017 March 15 GJI Geomagnetism, rock magnetism and palaeomagnetism doi: 10.1093/gji/ggx106 Large-scale

More information

LU factorization with Panel Rank Revealing Pivoting and its Communication Avoiding version

LU factorization with Panel Rank Revealing Pivoting and its Communication Avoiding version 1 LU factorization with Panel Rank Revealing Pivoting and its Communication Avoiding version Amal Khabou Advisor: Laura Grigori Université Paris Sud 11, INRIA Saclay France SIAMPP12 February 17, 2012 2

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms Chapter 6 Matrix Models Section 6.2 Low Rank Approximation Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign CS 554 / CSE 512 Edgar

More information

MULTI-LAYER HIERARCHICAL STRUCTURES AND FACTORIZATIONS

MULTI-LAYER HIERARCHICAL STRUCTURES AND FACTORIZATIONS MULTI-LAYER HIERARCHICAL STRUCTURES AND FACTORIZATIONS JIANLIN XIA Abstract. We propose multi-layer hierarchically semiseparable MHS structures for the fast factorizations of dense matrices arising from

More information

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC Hybrid static/dynamic scheduling for already optimized dense matrix factorization Simplice Donfack, Laura Grigori, INRIA, France Bill Gropp, Vivek Kale UIUC, USA Joint Laboratory for Petascale Computing,

More information

Enhancing Scalability of Sparse Direct Methods

Enhancing Scalability of Sparse Direct Methods Journal of Physics: Conference Series 78 (007) 0 doi:0.088/7-6596/78//0 Enhancing Scalability of Sparse Direct Methods X.S. Li, J. Demmel, L. Grigori, M. Gu, J. Xia 5, S. Jardin 6, C. Sovinec 7, L.-Q.

More information

Computing least squares condition numbers on hybrid multicore/gpu systems

Computing least squares condition numbers on hybrid multicore/gpu systems Computing least squares condition numbers on hybrid multicore/gpu systems M. Baboulin and J. Dongarra and R. Lacroix Abstract This paper presents an efficient computation for least squares conditioning

More information

From Direct to Iterative Substructuring: some Parallel Experiences in 2 and 3D

From Direct to Iterative Substructuring: some Parallel Experiences in 2 and 3D From Direct to Iterative Substructuring: some Parallel Experiences in 2 and 3D Luc Giraud N7-IRIT, Toulouse MUMPS Day October 24, 2006, ENS-INRIA, Lyon, France Outline 1 General Framework 2 The direct

More information

Communication-avoiding LU and QR factorizations for multicore architectures

Communication-avoiding LU and QR factorizations for multicore architectures Communication-avoiding LU and QR factorizations for multicore architectures DONFACK Simplice INRIA Saclay Joint work with Laura Grigori INRIA Saclay Alok Kumar Gupta BCCS,Norway-5075 16th April 2010 Communication-avoiding

More information

Sparse factorization using low rank submatrices. Cleve Ashcraft LSTC 2010 MUMPS User Group Meeting April 15-16, 2010 Toulouse, FRANCE

Sparse factorization using low rank submatrices. Cleve Ashcraft LSTC 2010 MUMPS User Group Meeting April 15-16, 2010 Toulouse, FRANCE Sparse factorization using low rank submatrices Cleve Ashcraft LSTC cleve@lstc.com 21 MUMPS User Group Meeting April 15-16, 21 Toulouse, FRANCE ftp.lstc.com:outgoing/cleve/mumps1 Ashcraft.pdf 1 LSTC Livermore

More information

Spectrum-Revealing Matrix Factorizations Theory and Algorithms

Spectrum-Revealing Matrix Factorizations Theory and Algorithms Spectrum-Revealing Matrix Factorizations Theory and Algorithms Ming Gu Department of Mathematics University of California, Berkeley April 5, 2016 Joint work with D. Anderson, J. Deursch, C. Melgaard, J.

More information

Introduction to communication avoiding algorithms for direct methods of factorization in Linear Algebra

Introduction to communication avoiding algorithms for direct methods of factorization in Linear Algebra Introduction to communication avoiding algorithms for direct methods of factorization in Linear Algebra Laura Grigori Abstract Modern, massively parallel computers play a fundamental role in a large and

More information

Towards parallel bipartite matching algorithms

Towards parallel bipartite matching algorithms Outline Towards parallel bipartite matching algorithms Bora Uçar CNRS and GRAAL, ENS Lyon, France Scheduling for large-scale systems, 13 15 May 2009, Knoxville Joint work with Patrick R. Amestoy (ENSEEIHT-IRIT,

More information

A dissection solver with kernel detection for unsymmetric matrices in FreeFem++

A dissection solver with kernel detection for unsymmetric matrices in FreeFem++ . p.1/21 11 Dec. 2014, LJLL, Paris FreeFem++ workshop A dissection solver with kernel detection for unsymmetric matrices in FreeFem++ Atsushi Suzuki Atsushi.Suzuki@ann.jussieu.fr Joint work with François-Xavier

More information

J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1. March, 2009

J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1. March, 2009 Parallel Preconditioning of Linear Systems based on ILUPACK for Multithreaded Architectures J.I. Aliaga M. Bollhöfer 2 A.F. Martín E.S. Quintana-Ortí Deparment of Computer Science and Engineering, Univ.

More information

On the design of parallel linear solvers for large scale problems

On the design of parallel linear solvers for large scale problems On the design of parallel linear solvers for large scale problems Journée problème de Poisson, IHP, Paris M. Faverge, P. Ramet M. Faverge Assistant Professor Bordeaux INP LaBRI Inria Bordeaux - Sud-Ouest

More information

On the design of parallel linear solvers for large scale problems

On the design of parallel linear solvers for large scale problems On the design of parallel linear solvers for large scale problems ICIAM - August 2015 - Mini-Symposium on Recent advances in matrix computations for extreme-scale computers M. Faverge, X. Lacoste, G. Pichon,

More information

Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems

Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems Ichitaro Yamazaki University of Tennessee, Knoxville Xiaoye Sherry Li Lawrence Berkeley National Laboratory MS49: Sparse

More information

An Adaptive Hierarchical Matrix on Point Iterative Poisson Solver

An Adaptive Hierarchical Matrix on Point Iterative Poisson Solver Malaysian Journal of Mathematical Sciences 10(3): 369 382 (2016) MALAYSIAN JOURNAL OF MATHEMATICAL SCIENCES Journal homepage: http://einspem.upm.edu.my/journal An Adaptive Hierarchical Matrix on Point

More information

ERLANGEN REGIONAL COMPUTING CENTER

ERLANGEN REGIONAL COMPUTING CENTER ERLANGEN REGIONAL COMPUTING CENTER Making Sense of Performance Numbers Georg Hager Erlangen Regional Computing Center (RRZE) Friedrich-Alexander-Universität Erlangen-Nürnberg OpenMPCon 2018 Barcelona,

More information

Numerical Methods I Non-Square and Sparse Linear Systems

Numerical Methods I Non-Square and Sparse Linear Systems Numerical Methods I Non-Square and Sparse Linear Systems Aleksandar Donev Courant Institute, NYU 1 donev@courant.nyu.edu 1 MATH-GA 2011.003 / CSCI-GA 2945.003, Fall 2014 September 25th, 2014 A. Donev (Courant

More information

Block-tridiagonal matrices

Block-tridiagonal matrices Block-tridiagonal matrices. p.1/31 Block-tridiagonal matrices - where do these arise? - as a result of a particular mesh-point ordering - as a part of a factorization procedure, for example when we compute

More information

Balanced Truncation Model Reduction of Large and Sparse Generalized Linear Systems

Balanced Truncation Model Reduction of Large and Sparse Generalized Linear Systems Balanced Truncation Model Reduction of Large and Sparse Generalized Linear Systems Jos M. Badía 1, Peter Benner 2, Rafael Mayo 1, Enrique S. Quintana-Ortí 1, Gregorio Quintana-Ortí 1, A. Remón 1 1 Depto.

More information

A High-Performance Parallel Hybrid Method for Large Sparse Linear Systems

A High-Performance Parallel Hybrid Method for Large Sparse Linear Systems Outline A High-Performance Parallel Hybrid Method for Large Sparse Linear Systems Azzam Haidar CERFACS, Toulouse joint work with Luc Giraud (N7-IRIT, France) and Layne Watson (Virginia Polytechnic Institute,

More information

CME342 Parallel Methods in Numerical Analysis. Matrix Computation: Iterative Methods II. Sparse Matrix-vector Multiplication.

CME342 Parallel Methods in Numerical Analysis. Matrix Computation: Iterative Methods II. Sparse Matrix-vector Multiplication. CME342 Parallel Methods in Numerical Analysis Matrix Computation: Iterative Methods II Outline: CG & its parallelization. Sparse Matrix-vector Multiplication. 1 Basic iterative methods: Ax = b r = b Ax

More information

Sparse BLAS-3 Reduction

Sparse BLAS-3 Reduction Sparse BLAS-3 Reduction to Banded Upper Triangular (Spar3Bnd) Gary Howell, HPC/OIT NC State University gary howell@ncsu.edu Sparse BLAS-3 Reduction p.1/27 Acknowledgements James Demmel, Gene Golub, Franc

More information

Sparse linear solvers

Sparse linear solvers Sparse linear solvers Laura Grigori ALPINES INRIA and LJLL, UPMC On sabbatical at UC Berkeley March 2015 Plan Sparse linear solvers Sparse matrices and graphs Classes of linear solvers Sparse Cholesky

More information

Algorithm 853: an Efficient Algorithm for Solving Rank-Deficient Least Squares Problems

Algorithm 853: an Efficient Algorithm for Solving Rank-Deficient Least Squares Problems Algorithm 853: an Efficient Algorithm for Solving Rank-Deficient Least Squares Problems LESLIE FOSTER and RAJESH KOMMU San Jose State University Existing routines, such as xgelsy or xgelsd in LAPACK, for

More information

IMPROVING THE PERFORMANCE OF SPARSE LU MATRIX FACTORIZATION USING A SUPERNODAL ALGORITHM

IMPROVING THE PERFORMANCE OF SPARSE LU MATRIX FACTORIZATION USING A SUPERNODAL ALGORITHM IMPROVING THE PERFORMANCE OF SPARSE LU MATRIX FACTORIZATION USING A SUPERNODAL ALGORITHM Bogdan OANCEA PhD, Associate Professor, Artife University, Bucharest, Romania E-mail: oanceab@ie.ase.ro Abstract:

More information

Communication avoiding parallel algorithms for dense matrix factorizations

Communication avoiding parallel algorithms for dense matrix factorizations Communication avoiding parallel dense matrix factorizations 1/ 44 Communication avoiding parallel algorithms for dense matrix factorizations Edgar Solomonik Department of EECS, UC Berkeley October 2013

More information

An Efficient Solver for Sparse Linear Systems based on Rank-Structured Cholesky Factorization

An Efficient Solver for Sparse Linear Systems based on Rank-Structured Cholesky Factorization An Efficient Solver for Sparse Linear Systems based on Rank-Structured Cholesky Factorization David Bindel Department of Computer Science Cornell University 15 March 2016 (TSIMF) Rank-Structured Cholesky

More information

Direct solution methods for sparse matrices. p. 1/49

Direct solution methods for sparse matrices. p. 1/49 Direct solution methods for sparse matrices p. 1/49 p. 2/49 Direct solution methods for sparse matrices Solve Ax = b, where A(n n). (1) Factorize A = LU, L lower-triangular, U upper-triangular. (2) Solve

More information

Matrix Computations: Direct Methods II. May 5, 2014 Lecture 11

Matrix Computations: Direct Methods II. May 5, 2014 Lecture 11 Matrix Computations: Direct Methods II May 5, 2014 ecture Summary You have seen an example of how a typical matrix operation (an important one) can be reduced to using lower level BS routines that would

More information

Matrix Assembly in FEA

Matrix Assembly in FEA Matrix Assembly in FEA 1 In Chapter 2, we spoke about how the global matrix equations are assembled in the finite element method. We now want to revisit that discussion and add some details. For example,

More information

SOLVING SPARSE LINEAR SYSTEMS OF EQUATIONS. Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA

SOLVING SPARSE LINEAR SYSTEMS OF EQUATIONS. Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA 1 SOLVING SPARSE LINEAR SYSTEMS OF EQUATIONS Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA 2 OUTLINE Sparse matrix storage format Basic factorization

More information

Math 671: Tensor Train decomposition methods II

Math 671: Tensor Train decomposition methods II Math 671: Tensor Train decomposition methods II Eduardo Corona 1 1 University of Michigan at Ann Arbor December 13, 2016 Table of Contents 1 What we ve talked about so far: 2 The Tensor Train decomposition

More information

V C V L T I 0 C V B 1 V T 0 I. l nk

V C V L T I 0 C V B 1 V T 0 I. l nk Multifrontal Method Kailai Xu September 16, 2017 Main observation. Consider the LDL T decomposition of a SPD matrix [ ] [ ] [ ] [ ] B V T L 0 I 0 L T L A = = 1 V T V C V L T I 0 C V B 1 V T, 0 I where

More information

High Performance Parallel Tucker Decomposition of Sparse Tensors

High Performance Parallel Tucker Decomposition of Sparse Tensors High Performance Parallel Tucker Decomposition of Sparse Tensors Oguz Kaya INRIA and LIP, ENS Lyon, France SIAM PP 16, April 14, 2016, Paris, France Joint work with: Bora Uçar, CNRS and LIP, ENS Lyon,

More information

A DISTRIBUTED-MEMORY RANDOMIZED STRUCTURED MULTIFRONTAL METHOD FOR SPARSE DIRECT SOLUTIONS

A DISTRIBUTED-MEMORY RANDOMIZED STRUCTURED MULTIFRONTAL METHOD FOR SPARSE DIRECT SOLUTIONS A DISTRIBUTED-MEMORY RANDOMIZED STRUCTURED MULTIFRONTAL METHOD FOR SPARSE DIRECT SOLUTIONS ZIXING XIN, JIANLIN XIA, MAARTEN V. DE HOOP, STEPHEN CAULEY, AND VENKATARAMANAN BALAKRISHNAN Abstract. We design

More information

Singular Value Decompsition

Singular Value Decompsition Singular Value Decompsition Massoud Malek One of the most useful results from linear algebra, is a matrix decomposition known as the singular value decomposition It has many useful applications in almost

More information

LU Factorization with Panel Rank Revealing Pivoting and Its Communication Avoiding Version

LU Factorization with Panel Rank Revealing Pivoting and Its Communication Avoiding Version LU Factorization with Panel Rank Revealing Pivoting and Its Communication Avoiding Version Amal Khabou James Demmel Laura Grigori Ming Gu Electrical Engineering and Computer Sciences University of California

More information

This ensures that we walk downhill. For fixed λ not even this may be the case.

This ensures that we walk downhill. For fixed λ not even this may be the case. Gradient Descent Objective Function Some differentiable function f : R n R. Gradient Descent Start with some x 0, i = 0 and learning rate λ repeat x i+1 = x i λ f(x i ) until f(x i+1 ) ɛ Line Search Variant

More information

Fast matrix algebra for dense matrices with rank-deficient off-diagonal blocks

Fast matrix algebra for dense matrices with rank-deficient off-diagonal blocks CHAPTER 2 Fast matrix algebra for dense matrices with rank-deficient off-diagonal blocks Chapter summary: The chapter describes techniques for rapidly performing algebraic operations on dense matrices

More information

BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix-Vector-Product Analysis of Matrix-Matrix Product

BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix-Vector-Product Analysis of Matrix-Matrix Product Level-1 BLAS: SAXPY BLAS-Notation: S single precision (D for double, C for complex) A α scalar X vector P plus operation Y vector SAXPY: y = αx + y Vectorization of SAXPY (αx + y) by pipelining: page 8

More information

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics SPARSE SOLVERS FOR THE POISSON EQUATION Margreet Nool CWI, Multiscale Dynamics November 9, 2015 OUTLINE OF THIS TALK 1 FISHPACK, LAPACK, PARDISO 2 SYSTEM OVERVIEW OF CARTESIUS 3 POISSON EQUATION 4 SOLVERS

More information

Matrix decompositions

Matrix decompositions Matrix decompositions How can we solve Ax = b? 1 Linear algebra Typical linear system of equations : x 1 x +x = x 1 +x +9x = 0 x 1 +x x = The variables x 1, x, and x only appear as linear terms (no powers

More information

Communication-avoiding parallel and sequential QR factorizations

Communication-avoiding parallel and sequential QR factorizations Communication-avoiding parallel and sequential QR factorizations James Demmel, Laura Grigori, Mark Hoemmen, and Julien Langou May 30, 2008 Abstract We present parallel and sequential dense QR factorization

More information

A Parallel Geometric Multifrontal Solver Using Hierarchically Semiseparable Structure

A Parallel Geometric Multifrontal Solver Using Hierarchically Semiseparable Structure Page 26 of 46 A Parallel Geometric Multifrontal Solver Using Hierarchically Semiseparable Structure SHEN WANG, Department of Mathematics, Purdue University XIAOYE S. LI, Lawrence Berkeley National Laboratory

More information

Numerical Methods I Solving Square Linear Systems: GEM and LU factorization

Numerical Methods I Solving Square Linear Systems: GEM and LU factorization Numerical Methods I Solving Square Linear Systems: GEM and LU factorization Aleksandar Donev Courant Institute, NYU 1 donev@courant.nyu.edu 1 MATH-GA 2011.003 / CSCI-GA 2945.003, Fall 2014 September 18th,

More information

Solution to Laplace Equation using Preconditioned Conjugate Gradient Method with Compressed Row Storage using MPI

Solution to Laplace Equation using Preconditioned Conjugate Gradient Method with Compressed Row Storage using MPI Solution to Laplace Equation using Preconditioned Conjugate Gradient Method with Compressed Row Storage using MPI Sagar Bhatt Person Number: 50170651 Department of Mechanical and Aerospace Engineering,

More information

Applications of Randomized Methods for Decomposing and Simulating from Large Covariance Matrices

Applications of Randomized Methods for Decomposing and Simulating from Large Covariance Matrices Applications of Randomized Methods for Decomposing and Simulating from Large Covariance Matrices Vahid Dehdari and Clayton V. Deutsch Geostatistical modeling involves many variables and many locations.

More information

Block Bidiagonal Decomposition and Least Squares Problems

Block Bidiagonal Decomposition and Least Squares Problems Block Bidiagonal Decomposition and Least Squares Problems Åke Björck Department of Mathematics Linköping University Perspectives in Numerical Analysis, Helsinki, May 27 29, 2008 Outline Bidiagonal Decomposition

More information

A fast randomized algorithm for computing a Hierarchically Semi-Separable representation of a matrix

A fast randomized algorithm for computing a Hierarchically Semi-Separable representation of a matrix A fast randomized algorithm for computing a Hierarchically Semi-Separable representation of a matrix P.G. Martinsson, Department of Applied Mathematics, University of Colorado at Boulder Abstract: Randomized

More information

arxiv: v1 [cs.na] 20 Jul 2015

arxiv: v1 [cs.na] 20 Jul 2015 AN EFFICIENT SOLVER FOR SPARSE LINEAR SYSTEMS BASED ON RANK-STRUCTURED CHOLESKY FACTORIZATION JEFFREY N. CHADWICK AND DAVID S. BINDEL arxiv:1507.05593v1 [cs.na] 20 Jul 2015 Abstract. Direct factorization

More information

Exploiting off-diagonal rank structures in the solution of linear matrix equations

Exploiting off-diagonal rank structures in the solution of linear matrix equations Stefano Massei Exploiting off-diagonal rank structures in the solution of linear matrix equations Based on joint works with D. Kressner (EPFL), M. Mazza (IPP of Munich), D. Palitta (IDCTS of Magdeburg)

More information

A Novel Aggregation Method based on Graph Matching for Algebraic MultiGrid Preconditioning of Sparse Linear Systems

A Novel Aggregation Method based on Graph Matching for Algebraic MultiGrid Preconditioning of Sparse Linear Systems A Novel Aggregation Method based on Graph Matching for Algebraic MultiGrid Preconditioning of Sparse Linear Systems Pasqua D Ambra, Alfredo Buttari, Daniela Di Serafino, Salvatore Filippone, Simone Gentile,

More information

H 2 -matrices with adaptive bases

H 2 -matrices with adaptive bases 1 H 2 -matrices with adaptive bases Steffen Börm MPI für Mathematik in den Naturwissenschaften Inselstraße 22 26, 04103 Leipzig http://www.mis.mpg.de/ Problem 2 Goal: Treat certain large dense matrices

More information

I-v k e k. (I-e k h kt ) = Stability of Gauss-Huard Elimination for Solving Linear Systems. 1 x 1 x x x x

I-v k e k. (I-e k h kt ) = Stability of Gauss-Huard Elimination for Solving Linear Systems. 1 x 1 x x x x Technical Report CS-93-08 Department of Computer Systems Faculty of Mathematics and Computer Science University of Amsterdam Stability of Gauss-Huard Elimination for Solving Linear Systems T. J. Dekker

More information

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Accelerating linear algebra computations with hybrid GPU-multicore systems. Accelerating linear algebra computations with hybrid GPU-multicore systems. Marc Baboulin INRIA/Université Paris-Sud joint work with Jack Dongarra (University of Tennessee and Oak Ridge National Laboratory)

More information

Math 671: Tensor Train decomposition methods

Math 671: Tensor Train decomposition methods Math 671: Eduardo Corona 1 1 University of Michigan at Ann Arbor December 8, 2016 Table of Contents 1 Preliminaries and goal 2 Unfolding matrices for tensorized arrays The Tensor Train decomposition 3

More information

A DISTRIBUTED-MEMORY RANDOMIZED STRUCTURED MULTIFRONTAL METHOD FOR SPARSE DIRECT SOLUTIONS

A DISTRIBUTED-MEMORY RANDOMIZED STRUCTURED MULTIFRONTAL METHOD FOR SPARSE DIRECT SOLUTIONS SIAM J. SCI. COMPUT. Vol. 39, No. 4, pp. C292 C318 c 2017 Society for Industrial and Applied Mathematics A DISTRIBUTED-MEMORY RANDOMIZED STRUCTURED MULTIFRONTAL METHOD FOR SPARSE DIRECT SOLUTIONS ZIXING

More information

Parallel sparse direct solvers for Poisson s equation in streamer discharges

Parallel sparse direct solvers for Poisson s equation in streamer discharges Parallel sparse direct solvers for Poisson s equation in streamer discharges Margreet Nool, Menno Genseberger 2 and Ute Ebert,3 Centrum Wiskunde & Informatica (CWI), P.O.Box 9479, 9 GB Amsterdam, The Netherlands

More information

Contents. Preface... xi. Introduction...

Contents. Preface... xi. Introduction... Contents Preface... xi Introduction... xv Chapter 1. Computer Architectures... 1 1.1. Different types of parallelism... 1 1.1.1. Overlap, concurrency and parallelism... 1 1.1.2. Temporal and spatial parallelism

More information

Rank Revealing QR factorization. F. Guyomarc h, D. Mezher and B. Philippe

Rank Revealing QR factorization. F. Guyomarc h, D. Mezher and B. Philippe Rank Revealing QR factorization F. Guyomarc h, D. Mezher and B. Philippe 1 Outline Introduction Classical Algorithms Full matrices Sparse matrices Rank-Revealing QR Conclusion CSDA 2005, Cyprus 2 Situation

More information

Multilevel low-rank approximation preconditioners Yousef Saad Department of Computer Science and Engineering University of Minnesota

Multilevel low-rank approximation preconditioners Yousef Saad Department of Computer Science and Engineering University of Minnesota Multilevel low-rank approximation preconditioners Yousef Saad Department of Computer Science and Engineering University of Minnesota SIAM CSE Boston - March 1, 2013 First: Joint work with Ruipeng Li Work

More information

Dense LU factorization and its error analysis

Dense LU factorization and its error analysis Dense LU factorization and its error analysis Laura Grigori INRIA and LJLL, UPMC February 2016 Plan Basis of floating point arithmetic and stability analysis Notation, results, proofs taken from [N.J.Higham,

More information

Linear Algebra Linear Algebra : Matrix decompositions Monday, February 11th Math 365 Week #4

Linear Algebra Linear Algebra : Matrix decompositions Monday, February 11th Math 365 Week #4 Linear Algebra Linear Algebra : Matrix decompositions Monday, February 11th Math Week # 1 Saturday, February 1, 1 Linear algebra Typical linear system of equations : x 1 x +x = x 1 +x +9x = 0 x 1 +x x

More information

Parallel Singular Value Decomposition. Jiaxing Tan

Parallel Singular Value Decomposition. Jiaxing Tan Parallel Singular Value Decomposition Jiaxing Tan Outline What is SVD? How to calculate SVD? How to parallelize SVD? Future Work What is SVD? Matrix Decomposition Eigen Decomposition A (non-zero) vector

More information

Solving PDEs with CUDA Jonathan Cohen

Solving PDEs with CUDA Jonathan Cohen Solving PDEs with CUDA Jonathan Cohen jocohen@nvidia.com NVIDIA Research PDEs (Partial Differential Equations) Big topic Some common strategies Focus on one type of PDE in this talk Poisson Equation Linear

More information

14.2 QR Factorization with Column Pivoting

14.2 QR Factorization with Column Pivoting page 531 Chapter 14 Special Topics Background Material Needed Vector and Matrix Norms (Section 25) Rounding Errors in Basic Floating Point Operations (Section 33 37) Forward Elimination and Back Substitution

More information

Parallel programming using MPI. Analysis and optimization. Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco

Parallel programming using MPI. Analysis and optimization. Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco Parallel programming using MPI Analysis and optimization Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco Outline l Parallel programming: Basic definitions l Choosing right algorithms: Optimal serial and

More information

An H-LU Based Direct Finite Element Solver Accelerated by Nested Dissection for Large-scale Modeling of ICs and Packages

An H-LU Based Direct Finite Element Solver Accelerated by Nested Dissection for Large-scale Modeling of ICs and Packages PIERS ONLINE, VOL. 6, NO. 7, 2010 679 An H-LU Based Direct Finite Element Solver Accelerated by Nested Dissection for Large-scale Modeling of ICs and Packages Haixin Liu and Dan Jiao School of Electrical

More information

FAST STRUCTURED EIGENSOLVER FOR DISCRETIZED PARTIAL DIFFERENTIAL OPERATORS ON GENERAL MESHES

FAST STRUCTURED EIGENSOLVER FOR DISCRETIZED PARTIAL DIFFERENTIAL OPERATORS ON GENERAL MESHES Proceedings of the Project Review, Geo-Mathematical Imaging Group Purdue University, West Lafayette IN, Vol. 1 2012 pp. 123-132. FAST STRUCTURED EIGENSOLVER FOR DISCRETIZED PARTIAL DIFFERENTIAL OPERATORS

More information

Rank revealing factorizations, and low rank approximations

Rank revealing factorizations, and low rank approximations Rank revealing factorizations, and low rank approximations L. Grigori Inria Paris, UPMC January 2018 Plan Low rank matrix approximation Rank revealing QR factorization LU CRTP: Truncated LU factorization

More information

A Sparse QS-Decomposition for Large Sparse Linear System of Equations

A Sparse QS-Decomposition for Large Sparse Linear System of Equations A Sparse QS-Decomposition for Large Sparse Linear System of Equations Wujian Peng 1 and Biswa N. Datta 2 1 Department of Math, Zhaoqing University, Zhaoqing, China, douglas peng@yahoo.com 2 Department

More information

Some Geometric and Algebraic Aspects of Domain Decomposition Methods

Some Geometric and Algebraic Aspects of Domain Decomposition Methods Some Geometric and Algebraic Aspects of Domain Decomposition Methods D.S.Butyugin 1, Y.L.Gurieva 1, V.P.Ilin 1,2, and D.V.Perevozkin 1 Abstract Some geometric and algebraic aspects of various domain decomposition

More information

Symmetric Pivoting in ScaLAPACK Craig Lucas University of Manchester Cray User Group 8 May 2006, Lugano

Symmetric Pivoting in ScaLAPACK Craig Lucas University of Manchester Cray User Group 8 May 2006, Lugano Symmetric Pivoting in ScaLAPACK Craig Lucas University of Manchester Cray User Group 8 May 2006, Lugano Introduction Introduction We wanted to parallelize a serial algorithm for the pivoted Cholesky factorization

More information

A simple FEM solver and its data parallelism

A simple FEM solver and its data parallelism A simple FEM solver and its data parallelism Gundolf Haase Institute for Mathematics and Scientific Computing University of Graz, Austria Chile, Jan. 2015 Partial differential equation Considered Problem

More information

Partial Left-Looking Structured Multifrontal Factorization & Algorithms for Compressed Sensing. Cinna Julie Wu

Partial Left-Looking Structured Multifrontal Factorization & Algorithms for Compressed Sensing. Cinna Julie Wu Partial Left-Looking Structured Multifrontal Factorization & Algorithms for Compressed Sensing by Cinna Julie Wu A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor

More information

c 2013 Society for Industrial and Applied Mathematics

c 2013 Society for Industrial and Applied Mathematics SIAM J. MATRIX ANAL. APPL. Vol. 34, No. 3, pp. 1401 1429 c 2013 Society for Industrial and Applied Mathematics LU FACTORIZATION WITH PANEL RANK REVEALING PIVOTING AND ITS COMMUNICATION AVOIDING VERSION

More information

Minisymposia 9 and 34: Avoiding Communication in Linear Algebra. Jim Demmel UC Berkeley bebop.cs.berkeley.edu

Minisymposia 9 and 34: Avoiding Communication in Linear Algebra. Jim Demmel UC Berkeley bebop.cs.berkeley.edu Minisymposia 9 and 34: Avoiding Communication in Linear Algebra Jim Demmel UC Berkeley bebop.cs.berkeley.edu Motivation (1) Increasing parallelism to exploit From Top500 to multicores in your laptop Exponentially

More information

Lecture 5: Randomized methods for low-rank approximation

Lecture 5: Randomized methods for low-rank approximation CBMS Conference on Fast Direct Solvers Dartmouth College June 23 June 27, 2014 Lecture 5: Randomized methods for low-rank approximation Gunnar Martinsson The University of Colorado at Boulder Research

More information

Parallel PIPS-SBB Multi-level parallelism for 2-stage SMIPS. Lluís-Miquel Munguia, Geoffrey M. Oxberry, Deepak Rajan, Yuji Shinano

Parallel PIPS-SBB Multi-level parallelism for 2-stage SMIPS. Lluís-Miquel Munguia, Geoffrey M. Oxberry, Deepak Rajan, Yuji Shinano Parallel PIPS-SBB Multi-level parallelism for 2-stage SMIPS Lluís-Miquel Munguia, Geoffrey M. Oxberry, Deepak Rajan, Yuji Shinano ... Our contribution PIPS-PSBB*: Multi-level parallelism for Stochastic

More information

Fast algorithms for hierarchically semiseparable matrices

Fast algorithms for hierarchically semiseparable matrices NUMERICAL LINEAR ALGEBRA WITH APPLICATIONS Numer. Linear Algebra Appl. 2010; 17:953 976 Published online 22 December 2009 in Wiley Online Library (wileyonlinelibrary.com)..691 Fast algorithms for hierarchically

More information

A direct solver for elliptic PDEs in three dimensions based on hierarchical merging of Poincaré-Steklov operators

A direct solver for elliptic PDEs in three dimensions based on hierarchical merging of Poincaré-Steklov operators (1) A direct solver for elliptic PDEs in three dimensions based on hierarchical merging of Poincaré-Steklov operators S. Hao 1, P.G. Martinsson 2 Abstract: A numerical method for variable coefficient elliptic

More information

Matrix decompositions

Matrix decompositions Matrix decompositions How can we solve Ax = b? 1 Linear algebra Typical linear system of equations : x 1 x +x = x 1 +x +9x = 0 x 1 +x x = The variables x 1, x, and x only appear as linear terms (no powers

More information