B. Reps 1, P. Ghysels 2, O. Schenk 4, K. Meerbergen 3,5, W. Vanroose 1,3. IHP 2015, Paris FRx
|
|
- Debra Wilkinson
- 6 years ago
- Views:
Transcription
1 Communication Avoiding and Hiding in preconditioned Krylov solvers 1 Applied Mathematics, University of Antwerp, Belgium 2 Future Technology Group, Berkeley Lab, USA 3 Intel ExaScience Lab, Belgium 4 USI, Lugano, CH 5 KU Leuven, Belgium B. Reps 1, P. Ghysels 2, O. Schenk 4, K. Meerbergen 3,5, W. Vanroose 1,3 IHP 2015, Paris FRx
2 Overview Arithmetic Intensity in Krylov Arithmetic Intensity in Multigrid Pipelining communication and computation in Krylov
3 GMRES, classical Gram-Schmidt 1: r 0 b Ax 0 2: v 0 r 0 / r 0 2 3: for i = 0,..., m 1 do 4: w Av i 5: for j = 0,..., i do 6: h j,i w, v j 7: end for 8: ṽ i+1 w i j=1 h j,iv j 9: h i+1,i ṽ i : v i+1 ṽ i+1 /h i+1,i 11: {apply Givens rotations to h :,i } 12: end for 13: y m argmin H m+1,m y m r 0 2 e : x x 0 + V m y m Sparse Matrix-Vector product Only communication with neighbors Good scaling Dot-product Global communication Scales as log(p) Scalar vector multiplication, vector-vector addition No communication
4 Part I Arithmetic Intensity and Sparse Matrix vector product
5 Performance of Kernel of dependent SpMV SIMD: SSE AVX mm512 on Xeon Phi. Similar with NEON on ARM. Bandwidth is the bottleneck and will remain even with 3D stacked memory. attainable GFLOP/sec peak memory BW with vectorization peak floating-point 1 1/8 1/4 1/ Arithmetic Intensity FLOP/Byte Arithmetic intensity (Flop/byte) S. Williams, A. Waterman and D. Patterson (2008)
6 Arithmetic intensity of k dependent SpMVs 1 SpMV k SpMVs k SpMVs in place flops 2n nz 2k n nz 2k n nz words moved n nz + 2n kn nz + 2kn n nz + 2n q 2 2 2k J. Demmel, CS on Communication-Avoiding algorithms M. Hoemmen, Communication-avoiding Krylov subspace methods. (2010).
7 Increasing the arithmetic intensity Divide the domain in tiles which fit in the cache Ground surface is loaded in cache and reused k times Redundant work at the tile edges s+1 u s+1 i,j 3 s u s i 1,j u s i,j u s i,j+1 u s i+1,j s 2 1 u s i,j B 0 B P. Ghysels, P. Klosiewicz, and W. Vanroose. Improving the arithmetic intensity of multigrid with the help of polynomial smoothers. Num. Lin. Alg. Appl. 19 (2012):
8 Stencil Compilers. (polytope model) Pluto Automatic loop parallelization, Locality optimizations based on the polyhedral model, add OpenMP pragma s around the outer loop, ivdep and vector always. compilers (guided) auto-vectorization. U. Bondhugula, M. Baskaran, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan, Automatic Transformations for Communication-Minimized Parallelization and Locality Optimization in the Polyhedral Model, Int. Conf. Compiler Construction (ETAPS CC), Apr 2008, Budapest, Hungary.
9 Increasing the arithmetic intensity with a stencil compiler (Pluto) naive naive, no-vec pluto pluto, no-vec peak roofline fitted roofline naive naive, no-vec pluto pluto, no-vec peak roofline fitted roofline GFlop/s 64 GFlop/s arithmetic intensity (flop/byte) Dual socket Intel Sandy Bridge, 16 2 threads arithmetic intensity (flop/byte) Intel Xeon Phi, 61 4 threads
10 Krylov methods Classical Krylov K k (A, v) = span{v, Av, A 2 v,..., A k 1 v} the residual and error can then be written as r (k) = P k (A)r (0) (1)
11 Krylov methods Classical Krylov K k (A, v) = span{v, Av, A 2 v,..., A k 1 v} the residual and error can then be written as r (k) = P k (A)r (0) (1) Polynomial Preconditioning P m (A)x = q(a)ax = q(a)b K k (P m (A), v) = span{v, P m (A)v, P m (A) 2 v,..., P m (A) k 1 v} where P m (A) = (A σ 1 )(A σ 2 )... (A σ m ) is a fixed low-order polynomial r (k) = Q k (P m (A))r (0) (2)
12 Incomplete list of the literature on polynomial preconditioning Y. Saad, Iterative methods for sparse linear systems Chapter 12, SIAM (2003). D. O Leary, Yet another polynomials preconditioner for the Conjugate Gradient Algorithm, Lin. Alg. Appl. (1991) p377 A. van der Ploeg, Preconditioning for sparse matrices with applications, (1994) M. Van Gijzen, A polynomial preconditioner for the GMRES algorithm J. Comp. Appl. Math 59 (1995): A. Basermann, B. Reichel, C. Schelthoff, Preconditioned CG methods for sparse matrices on massively parallel machines, 23 (1997),
13 Convergence of CG with different short Polynomials
14 Time to solution is reduced
15 Recursive calculation of w k+1 = p k+1 (A)v 1: σ 1 = θ/δ 2: ρ 0 = 1/σ 1 3: w 0 = 0, w 1 = 1 θ Av 4: w 1 = w 1 w 0 5: for k=1,... do 6: ρ k = 1/(2σ 1 ρ k 1 ) 7: w k+1 = ρ k [ 2 δ A(v w k) + ρ k 1 w k ] 8: w k+1 = w k + w k+1 9: end for Mangled with Pluto to raise arithmetic intensity.
16 Average cost for each matvec averaged time for each matvec total matvec dot-product axpy order of polynomial 2D finite difference poisson problem on i7-2860qm
17 attainable GFLOP/sec peak memory BW with vectorization peak floating-point 1 1/8 1/4 1/ Arithmetic Intensity FLOP/Byte Arithmetic intensity (Flop/byte) S. Williams, A. Waterman and D. Patterson (2008)
18 Relying on compiler autovectorization averaged time for each matvec total matvec dot-product axpy total SIMD matvec SIMD order of polynomial Intel i7-2860qm
19 Part II Arithmetic Intensity in Multigrid
20 Arithmetic Intensity Multigrid where MG = (I I2h h 2h (...)Ih A)S ν 1, (3) S is the smoother, for example ω-jacobi where S = I ωd 1 A, The interpolation I2h h and restriction I 2h h. These are all low arithmetic intensity.
21 Multigrid Cost Model Increasing the number of smoothing steps per level 1 60 convergence rate # V-cycles smoothing steps smoothing steps MG iterations = log(10 8 )/ log(ρ(v-cycle)) where ρ(v-cycle) is the spectral radius of the V-cycle, can be rounded to higher integer
22 Work Unit Cost model Work Unit cost model ignores memory bandwidth 1 WU = smoother cost = O(n) (9ν + 19)( ) log(tol) log(ρ) WU (9ν + 19) 4 log(tol) 3 log(ρ) WU # V-cycles work units smoothing steps smoothing steps
23 avg perf (Gflop/s) peak BW, peak Flops no SIMD stream BW # smoothing steps Attainable GFlop/sec avg cost (s/gflop) peak BW, peak Flops no SIMD stream BW # smoothing steps Average cost in sec/gflop ν 1 ω-jac = flops (ν 1 ω-jac) roof (q(ν 1 ω-jac)) ν 1flops (ω-jac) roof (ν 1 q 1 (ω-jac)) (4)
24 Roofline-based vs naive cost model roofline model naive model work units # smoothing steps
25 Timings for 2D (Sandy Bridge) time (s) naive pluto naive model roofline model smoothing steps P. Ghysels and W. Vanroose, Modelling the performance of geometric multigrid on many-core computer architectures, Exascience.com techreport Sept, 2013,
26 3d results (Sandy Bridge)
27 Current Krylov Solvers Krylov Solver User Matvec User provides a matvec routine and selects Krylov method at command line. No opportunities to increase arithmetic intensity.
28 Current Krylov Solvers Krylov Solver Future Krylov Solvers User provides a stencil routine. User Matvec Stencil compiler increases arithmetic intensity. Krylov Solver User provides a matvec routine and selects Krylov method at command line. No opportunities to increase arithmetic intensity. Stencil Compiler (Pochoir, Pluto,...) W.Vanroose, P. Ghysels, D. Roose and K. Meerbergen, Position paper at DOE Exascale Mathematics workshop User Stencil
29 Part III Pipelining communication and computations
30 GMRES, classical Gram-Schmidt 1: r 0 := b Ax 0 2: v 0 := r 0 / r 0 2 3: for i = 0,..., m 1 do 4: w := Av i 5: for j = 0,..., i do 6: h j,i := w, v j 7: end for 8: ṽ i+1 := w i j=1 h j,iv j 9: h i+1,i := ṽ i : v i+1 := ṽ i+1 /h i+1,i 11: {apply Givens rotations to h :,i } 12: end for 13: y m := argmin H m+1,m y m r 0 2 e : x := x 0 + V m y m Sparse Matrix-Vector product Only communication with neighbors Good scaling but BW limited. Dot-product Global communication Scales as log(p) Scalar vector multiplication, vector-vector addition No communication
31 GMRES vs Pipelined GMRES iteration on 4 nodes Classical GMRES p4 p3 + + p2 p SpMV local dot reduction b-cast Gram-Schmidt axpy norm reduction b-castscale H update Normalization Pipelined GMRES p4 p3 + p2 p1 + + reduction bcastscale axpy H update correction local dot P. Ghysels, T. Ashby, K. Meerbergen and W. Vanroose Hiding global communication latency in the GMRES algorithm on massively parallel machines. SIAM J. Scientific Computing, 35(1):C48C71, (2013).
32 Better Scaling x1000 iterations/s GMRES cgs GMRES mgs l 1 -GMRES p 1 -GMRES p(1)-gmres p(2)-gmres p(3)-gmres x1000 nodes Prediction of a strong scaling experiment for GMRES on XT4 part of Cray Jaguar.
33 Are there similar opportunities in preconditioned Conjugate Gradients? Preconditioned Conjugate Gradient 1: r 0 := b Ax 0 ; u 0 := M 1 r 0 ; p 0 := u 0 2: for i = 0,..., m 1 do 3: s := Ap i 4: α := r i, u i / s, p i 5: x i+1 := x i + αp i 6: r i+1 := r i αs 7: u i+1 := M 1 r i+1 8: β := r i+1, u i+1 / r i, u i 9: p i+1 := u i+1 + βp i 10: end for
34 Chronopoulos/Gear CG Only one global reduction each iteration. 1: r 0 := b Ax 0 ; u 0 := M 1 r 0 ; w 0 := Au 0 2: α 0 := r 0, u 0 / w 0, u 0 ; β := 0; γ 0 := r 0, u 0 3: for i = 0,..., m 1 do 4: p i := u i + β i p i 1 5: s i := w i + β i s i 1 6: x i+1 := x i + αp i 7: r i+1 := r i αs i 8: u i+1 := M 1 r i+1 9: w i+1 := Au i+1 10: γ i+1 := r i+1, u i+1 11: δ := w i+1, u i+1 12: β i+1 := γ i+1 /γ i 13: α i+1 := γ i+1 /(δ β i+1 γ i+1 /α i ) 14: end for
35 pipelined Chronopoulos/Gear CG Global reduction overlaps with matrix vector product. 1: r 0 := b Ax 0 ; w 0 := Au 0 2: for i = 0,..., m 1 do 3: γ i := r i, r i 4: δ := w i, r i 5: q i := Aw i 6: if i > 0 then 7: β i := γ i /γ i 1 ; α i := γ i /(δ β i γ i /α i 1 ) 8: else 9: β i := 0; α i := γ i /δ 10: end if 11: z i := q i + β i z i 1 12: s i := w i + β i s i 1 13: p i := r i + β i p i 1 14: x i+1 := x i + α i p i 15: r i+1 := r i α i s i 16: w i+1 := w i α i z i 17: end for
36 Preconditioned pipelined CG 1: r 0 := b Ax 0 ; u 0 := M 1 r 0 ; w 0 := Au 0 2: for i = 0,... do 3: γ i := r i, u i 4: δ := w i, u i 5: m i := M 1 w i 6: n i := Am i 7: if i > 0 then 8: β i := γ i /γ i 1 ; α i := γ i / (δ β i γ i /α i 1 ) 9: else 10: β i := 0; α i := γ i /δ 11: end if 12: z i := n i + β i z i 1 13: q i := m i + β i q i 1 14: s i := w i + β i s i 1 15: p i := u i + β i p i 1 16: x i+1 := x i + α i p i 17: r i+1 := r i α i s i 18: u i+1 := u i α i q i 19: w i+1 := w i α i z i 20: end for
37 Preconditioned pipelined CR 1: r 0 := b Ax 0 ; u 0 := M 1 r 0 ; w 0 := Au 0 2: for i = 0,... do 3: m i := M 1 w i 4: γ i := w i, u i 5: δ := m i, w i 6: n i := Am i 7: if i > 0 then 8: β i := γ i /γ i 1 ; α i := γ i / (δ β i γ i /α i 1 ) 9: else 10: β i := 0; α i := γ i /δ 11: end if 12: z i := n i + β i z i 1 13: q i := m i + β i q i 1 14: p i := u i + β i p i 1 15: x i+1 := x i + α i p i 16: u i+1 := u i α i q i 17: w i+1 := w i α i z i 18: end for
38 Cost Model G := time for a global reduction SpMV := time for a sparse-matrix vector product PC := time for preconditioner application local work such as AXPY is neglected flops time (excl, axpys, dots) #glob syncs memory CG 10 2G + SpMV + PC 2 4 Chro/Gea 12 G + SpMV + PC 1 5 CR 12 2G + SpMV + PC 2 5 pipe-cg 20 max(g, SpMV + PC) 1 9 pipe-cr 16 max(g, SpMV) + PC 1 7 Gropp-CG 14 max(g, SpMV) + max(g, PC) 2 6 P. Ghysels and W. Vanroose, Hiding global synchronization latency in the preconditioned Conjugate Gradient algorithm, Parallel Computing, 40, (2014), Pages
39 Better Scaling Hydrostatic ice sheet flow, Q1 finite elements line search Newton method (rtol = 10 8, atol = ) CG with block Jacobi ICC(0) precond (rtol = 10 5, atol = ) Measured speedup over standard CG for different variations of pipelined CG.
40 MPI trace
41 Pipelined methods in PETSc (from 3.4.2) Krylov methods: KSPPIPECR, KSPPIPECG, KSPGROPPCG, KSPPGMRES Uses MPI-3 non-blocking collectives export MPICH ASYNC PROGRES=1 1:... 2: KSP PCApply(ksp,W,M); /* m Bw */ 3: if (i > 0 && ksp normtype == KSP NORM PRECONDITIONED) 4: VecNormBegin(U,NORM 2,&dp); 5: VecDotBegin(W,U,&gamma); 6: VecDotBegin(M,W,&delta); 7: PetscCommSplitReductionBegin(PetscObjectComm((PetscObject)U)); 8: KSP MatMult(ksp,Amat,M,N); /* n Am */ 9: if (i > 0 && ksp normtype == KSP NORM PRECONDITIONED) 10: VecNormEnd(U,NORM 2,&dp); 11: VecDotEnd(W,U,&gamma); 12: VecDotEnd(M,W,&delta); 13:...
42 Krylov subspace methods with additional global reductions Deflation: Remove a few known annoying eigenvectors. Helmholtz. FETI methods.... Augmenting: adds a subspace to the Krylov subspace, e.g. recycling. Newton-Krylov methods. Numerical Continuation. Coarse Solver in multigrid.... Sn := Kn(A, v) + U (5) where U forms a basis for U. x n = x 0 + V n y n + Uu n
43 Deflation with eigenvectors with smallest eigenvalues Smallest eigenvalues and vectors: A [w 1, w 2,..., w m ] = [λ 1 w 1, λ 2 w 2,..., λ m w m ] + ɛ[θ] where λ 1 < λ 2 < λ 3 <... < λ m <... and Θ 1. W := [w 1, w 2,..., w m ] Correction step e = W (W T AW ) 1 W T r
44 Deflated CG 1: r 1 := b Ax 1 2: x 0 := x 1 + W (W T AW ) 1 W T r 1 3: r 0 := b Ax 0 4: p 0 := r 0 W (W T AW ) 1 W T Ar 0 5: for i = 0,... do 6: s := Ap i 7: α i := r i, r i / s, p i 8: x i+1 := x i + α i p i 9: r i+1 := r i α i s 10: β i := r i+1, r i+1 / r i, r i 11: w := Ar i+1 12: σ := W, w 13: p i+1 := r i+1 + β i p i W (W T AW ) 1 σ 14: end for x 0 such that W T r 0 = 0 with r 0 = b Ax 0 (cfr init-cg) σ = W, w = W, Ar i+1 = AW, r i+1 : store AW? CG on H T AH x = H T b with H = I W (W T AW ) 1 (AW ) T
45 Pipelined Deflation CG
46 Selective Deflation
47
48
49
50 Conclusions Two communication bottlenecks: limited BW in Av. No benefit from SIMD. Latencies in (u, v). Poor strong scaling.
51 Conclusions Two communication bottlenecks: limited BW in Av. No benefit from SIMD. Latencies in (u, v). Poor strong scaling. Replace Av with a P m (A)v with a fixed coefficients. Execute with a stencil compiler. Combine with Multigrid. Many open question for unstructured matrices.
52 Conclusions Two communication bottlenecks: limited BW in Av. No benefit from SIMD. Latencies in (u, v). Poor strong scaling. Replace Av with a P m (A)v with a fixed coefficients. Execute with a stencil compiler. Combine with Multigrid. Many open question for unstructured matrices. Pipelining of Krylov algorithms reduce the number of global reductions and hide their latencies. They can scale as the Sparse Matrix-Vector product. The pipelined GMRES and CG code is in PETSc-current. Extension to Deflated and Augmented Krylov methods. Algorithms with additional global reductions.
53 Conclusions Two communication bottlenecks: limited BW in Av. No benefit from SIMD. Latencies in (u, v). Poor strong scaling. Replace Av with a P m (A)v with a fixed coefficients. Execute with a stencil compiler. Combine with Multigrid. Many open question for unstructured matrices. Pipelining of Krylov algorithms reduce the number of global reductions and hide their latencies. They can scale as the Sparse Matrix-Vector product. The pipelined GMRES and CG code is in PETSc-current. Extension to Deflated and Augmented Krylov methods. Algorithms with additional global reductions. Acknowledgements: EU s Seventh Framework Programme (FP7/ ) under grant agreement n Intel and the Institute for the Promotion of Innovation through Science and Technology in Flanders (IWT).
EXASCALE COMPUTING. Hiding global synchronization latency in the preconditioned Conjugate Gradient algorithm. P. Ghysels W.
ExaScience Lab Intel Labs Europe EXASCALE COMPUTING Hiding global synchronization latency in the preconditioned Conjugate Gradient algorithm P. Ghysels W. Vanroose Report 12.2012.1, December 2012 This
More informationCommunication Avoiding Strategies for the Numerical Kernels in Coupled Physics Simulations
ExaScience Lab Intel Labs Europe EXASCALE COMPUTING Communication Avoiding Strategies for the Numerical Kernels in Coupled Physics Simulations SIAM Conference on Parallel Processing for Scientific Computing
More informationScalable Non-blocking Preconditioned Conjugate Gradient Methods
Scalable Non-blocking Preconditioned Conjugate Gradient Methods Paul Eller and William Gropp University of Illinois at Urbana-Champaign Department of Computer Science Supercomputing 16 Paul Eller and William
More informationERLANGEN REGIONAL COMPUTING CENTER
ERLANGEN REGIONAL COMPUTING CENTER Making Sense of Performance Numbers Georg Hager Erlangen Regional Computing Center (RRZE) Friedrich-Alexander-Universität Erlangen-Nürnberg OpenMPCon 2018 Barcelona,
More informationEfficient Deflation for Communication-Avoiding Krylov Subspace Methods
Efficient Deflation for Communication-Avoiding Krylov Subspace Methods Erin Carson Nicholas Knight, James Demmel Univ. of California, Berkeley Monday, June 24, NASCA 2013, Calais, France Overview We derive
More informationMS 28: Scalable Communication-Avoiding and -Hiding Krylov Subspace Methods I
MS 28: Scalable Communication-Avoiding and -Hiding Krylov Subspace Methods I Organizers: Siegfried Cools, University of Antwerp, Belgium Erin C. Carson, New York University, USA 10:50-11:10 High Performance
More informationFinite-choice algorithm optimization in Conjugate Gradients
Finite-choice algorithm optimization in Conjugate Gradients Jack Dongarra and Victor Eijkhout January 2003 Abstract We present computational aspects of mathematically equivalent implementations of the
More informationChapter 7 Iterative Techniques in Matrix Algebra
Chapter 7 Iterative Techniques in Matrix Algebra Per-Olof Persson persson@berkeley.edu Department of Mathematics University of California, Berkeley Math 128B Numerical Analysis Vector Norms Definition
More informationLecture 17: Iterative Methods and Sparse Linear Algebra
Lecture 17: Iterative Methods and Sparse Linear Algebra David Bindel 25 Mar 2014 Logistics HW 3 extended to Wednesday after break HW 4 should come out Monday after break Still need project description
More informationAlgebraic Multigrid as Solvers and as Preconditioner
Ò Algebraic Multigrid as Solvers and as Preconditioner Domenico Lahaye domenico.lahaye@cs.kuleuven.ac.be http://www.cs.kuleuven.ac.be/ domenico/ Department of Computer Science Katholieke Universiteit Leuven
More informationSolving Ax = b, an overview. Program
Numerical Linear Algebra Improving iterative solvers: preconditioning, deflation, numerical software and parallelisation Gerard Sleijpen and Martin van Gijzen November 29, 27 Solving Ax = b, an overview
More informationHigh performance Krylov subspace method variants
High performance Krylov subspace method variants and their behavior in finite precision Erin Carson New York University HPCSE17, May 24, 2017 Collaborators Miroslav Rozložník Institute of Computer Science,
More information- Part 4 - Multicore and Manycore Technology: Chances and Challenges. Vincent Heuveline
- Part 4 - Multicore and Manycore Technology: Chances and Challenges Vincent Heuveline 1 Numerical Simulation of Tropical Cyclones Goal oriented adaptivity for tropical cyclones ~10⁴km ~1500km ~100km 2
More informationCommunication-avoiding Krylov subspace methods
Motivation Communication-avoiding Krylov subspace methods Mark mhoemmen@cs.berkeley.edu University of California Berkeley EECS MS Numerical Libraries Group visit: 28 April 2008 Overview Motivation Current
More information9.1 Preconditioned Krylov Subspace Methods
Chapter 9 PRECONDITIONING 9.1 Preconditioned Krylov Subspace Methods 9.2 Preconditioned Conjugate Gradient 9.3 Preconditioned Generalized Minimal Residual 9.4 Relaxation Method Preconditioners 9.5 Incomplete
More informationContents. Preface... xi. Introduction...
Contents Preface... xi Introduction... xv Chapter 1. Computer Architectures... 1 1.1. Different types of parallelism... 1 1.1.1. Overlap, concurrency and parallelism... 1 1.1.2. Temporal and spatial parallelism
More informationIterative Methods for Linear Systems of Equations
Iterative Methods for Linear Systems of Equations Projection methods (3) ITMAN PhD-course DTU 20-10-08 till 24-10-08 Martin van Gijzen 1 Delft University of Technology Overview day 4 Bi-Lanczos method
More informationCME342 Parallel Methods in Numerical Analysis. Matrix Computation: Iterative Methods II. Sparse Matrix-vector Multiplication.
CME342 Parallel Methods in Numerical Analysis Matrix Computation: Iterative Methods II Outline: CG & its parallelization. Sparse Matrix-vector Multiplication. 1 Basic iterative methods: Ax = b r = b Ax
More informationLecture 18 Classical Iterative Methods
Lecture 18 Classical Iterative Methods MIT 18.335J / 6.337J Introduction to Numerical Methods Per-Olof Persson November 14, 2006 1 Iterative Methods for Linear Systems Direct methods for solving Ax = b,
More information1 Conjugate gradients
Notes for 2016-11-18 1 Conjugate gradients We now turn to the method of conjugate gradients (CG), perhaps the best known of the Krylov subspace solvers. The CG iteration can be characterized as the iteration
More informationAMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences)
AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences) Lecture 19: Computing the SVD; Sparse Linear Systems Xiangmin Jiao Stony Brook University Xiangmin Jiao Numerical
More informationAn Efficient Low Memory Implicit DG Algorithm for Time Dependent Problems
An Efficient Low Memory Implicit DG Algorithm for Time Dependent Problems P.-O. Persson and J. Peraire Massachusetts Institute of Technology 2006 AIAA Aerospace Sciences Meeting, Reno, Nevada January 9,
More informationPreface to the Second Edition. Preface to the First Edition
n page v Preface to the Second Edition Preface to the First Edition xiii xvii 1 Background in Linear Algebra 1 1.1 Matrices................................. 1 1.2 Square Matrices and Eigenvalues....................
More informationSome Geometric and Algebraic Aspects of Domain Decomposition Methods
Some Geometric and Algebraic Aspects of Domain Decomposition Methods D.S.Butyugin 1, Y.L.Gurieva 1, V.P.Ilin 1,2, and D.V.Perevozkin 1 Abstract Some geometric and algebraic aspects of various domain decomposition
More informationIterative Methods for Solving A x = b
Iterative Methods for Solving A x = b A good (free) online source for iterative methods for solving A x = b is given in the description of a set of iterative solvers called templates found at netlib: http
More informationComputational Linear Algebra
Computational Linear Algebra PD Dr. rer. nat. habil. Ralf-Peter Mundani Computation in Engineering / BGU Scientific Computing in Computer Science / INF Winter Term 2018/19 Part 4: Iterative Methods PD
More informationAMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 23: GMRES and Other Krylov Subspace Methods; Preconditioning
AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 23: GMRES and Other Krylov Subspace Methods; Preconditioning Xiangmin Jiao SUNY Stony Brook Xiangmin Jiao Numerical Analysis I 1 / 18 Outline
More informationFine-Grained Parallel Algorithms for Incomplete Factorization Preconditioning
Fine-Grained Parallel Algorithms for Incomplete Factorization Preconditioning Edmond Chow School of Computational Science and Engineering Georgia Institute of Technology, USA SPPEXA Symposium TU München,
More informationMultilevel low-rank approximation preconditioners Yousef Saad Department of Computer Science and Engineering University of Minnesota
Multilevel low-rank approximation preconditioners Yousef Saad Department of Computer Science and Engineering University of Minnesota SIAM CSE Boston - March 1, 2013 First: Joint work with Ruipeng Li Work
More informationFAS and Solver Performance
FAS and Solver Performance Matthew Knepley Mathematics and Computer Science Division Argonne National Laboratory Fall AMS Central Section Meeting Chicago, IL Oct 05 06, 2007 M. Knepley (ANL) FAS AMS 07
More informationDeflation for inversion with multiple right-hand sides in QCD
Deflation for inversion with multiple right-hand sides in QCD A Stathopoulos 1, A M Abdel-Rehim 1 and K Orginos 2 1 Department of Computer Science, College of William and Mary, Williamsburg, VA 23187 2
More informationTopics. The CG Algorithm Algorithmic Options CG s Two Main Convergence Theorems
Topics The CG Algorithm Algorithmic Options CG s Two Main Convergence Theorems What about non-spd systems? Methods requiring small history Methods requiring large history Summary of solvers 1 / 52 Conjugate
More informationM.A. Botchev. September 5, 2014
Rome-Moscow school of Matrix Methods and Applied Linear Algebra 2014 A short introduction to Krylov subspaces for linear systems, matrix functions and inexact Newton methods. Plan and exercises. M.A. Botchev
More informationStabilization and Acceleration of Algebraic Multigrid Method
Stabilization and Acceleration of Algebraic Multigrid Method Recursive Projection Algorithm A. Jemcov J.P. Maruszewski Fluent Inc. October 24, 2006 Outline 1 Need for Algorithm Stabilization and Acceleration
More informationConjugate gradient method. Descent method. Conjugate search direction. Conjugate Gradient Algorithm (294)
Conjugate gradient method Descent method Hestenes, Stiefel 1952 For A N N SPD In exact arithmetic, solves in N steps In real arithmetic No guaranteed stopping Often converges in many fewer than N steps
More informationScalable Domain Decomposition Preconditioners For Heterogeneous Elliptic Problems
Scalable Domain Decomposition Preconditioners For Heterogeneous Elliptic Problems Pierre Jolivet, F. Hecht, F. Nataf, C. Prud homme Laboratoire Jacques-Louis Lions Laboratoire Jean Kuntzmann INRIA Rocquencourt
More informationThe Lanczos and conjugate gradient algorithms
The Lanczos and conjugate gradient algorithms Gérard MEURANT October, 2008 1 The Lanczos algorithm 2 The Lanczos algorithm in finite precision 3 The nonsymmetric Lanczos algorithm 4 The Golub Kahan bidiagonalization
More informationImplementation of a preconditioned eigensolver using Hypre
Implementation of a preconditioned eigensolver using Hypre Andrew V. Knyazev 1, and Merico E. Argentati 1 1 Department of Mathematics, University of Colorado at Denver, USA SUMMARY This paper describes
More informationMinisymposia 9 and 34: Avoiding Communication in Linear Algebra. Jim Demmel UC Berkeley bebop.cs.berkeley.edu
Minisymposia 9 and 34: Avoiding Communication in Linear Algebra Jim Demmel UC Berkeley bebop.cs.berkeley.edu Motivation (1) Increasing parallelism to exploit From Top500 to multicores in your laptop Exponentially
More informationParallel Iterative Methods for Sparse Linear Systems. H. Martin Bücker Lehrstuhl für Hochleistungsrechnen
Parallel Iterative Methods for Sparse Linear Systems Lehrstuhl für Hochleistungsrechnen www.sc.rwth-aachen.de RWTH Aachen Large and Sparse Small and Dense Outline Problem with Direct Methods Iterative
More informationSome minimization problems
Week 13: Wednesday, Nov 14 Some minimization problems Last time, we sketched the following two-step strategy for approximating the solution to linear systems via Krylov subspaces: 1. Build a sequence of
More informationSolving Symmetric Indefinite Systems with Symmetric Positive Definite Preconditioners
Solving Symmetric Indefinite Systems with Symmetric Positive Definite Preconditioners Eugene Vecharynski 1 Andrew Knyazev 2 1 Department of Computer Science and Engineering University of Minnesota 2 Department
More informationParallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics)
Parallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics) Eftychios Sifakis CS758 Guest Lecture - 19 Sept 2012 Introduction Linear systems
More informationEfficient implementation of the overlap operator on multi-gpus
Efficient implementation of the overlap operator on multi-gpus Andrei Alexandru Mike Lujan, Craig Pelissier, Ben Gamari, Frank Lee SAAHPC 2011 - University of Tennessee Outline Motivation Overlap operator
More informationCourse Notes: Week 1
Course Notes: Week 1 Math 270C: Applied Numerical Linear Algebra 1 Lecture 1: Introduction (3/28/11) We will focus on iterative methods for solving linear systems of equations (and some discussion of eigenvalues
More informationAlgorithms and Methods for Fast Model Predictive Control
Algorithms and Methods for Fast Model Predictive Control Technical University of Denmark Department of Applied Mathematics and Computer Science 13 April 2016 Background: Model Predictive Control Model
More informationEXASCALE COMPUTING. Implementation of a 2D Electrostatic Particle in Cell algorithm in UniÞed Parallel C with dynamic load-balancing
ExaScience Lab Intel Labs Europe EXASCALE COMPUTING Implementation of a 2D Electrostatic Particle in Cell algorithm in UniÞed Parallel C with dynamic load-balancing B. Verleye P. Henry R. Wuyts G. Lapenta
More informationA Domain Decomposition Based Jacobi-Davidson Algorithm for Quantum Dot Simulation
A Domain Decomposition Based Jacobi-Davidson Algorithm for Quantum Dot Simulation Tao Zhao 1, Feng-Nan Hwang 2 and Xiao-Chuan Cai 3 Abstract In this paper, we develop an overlapping domain decomposition
More informationBootstrap AMG. Kailai Xu. July 12, Stanford University
Bootstrap AMG Kailai Xu Stanford University July 12, 2017 AMG Components A general AMG algorithm consists of the following components. A hierarchy of levels. A smoother. A prolongation. A restriction.
More informationA decade of fast and robust Helmholtz solvers
A decade of fast and robust Helmholtz solvers Werkgemeenschap Scientific Computing Spring meeting Kees Vuik May 11th, 212 1 Delft University of Technology Contents Introduction Preconditioning (22-28)
More informationParallel Numerics, WT 2016/ Iterative Methods for Sparse Linear Systems of Equations. page 1 of 1
Parallel Numerics, WT 2016/2017 5 Iterative Methods for Sparse Linear Systems of Equations page 1 of 1 Contents 1 Introduction 1.1 Computer Science Aspects 1.2 Numerical Problems 1.3 Graphs 1.4 Loop Manipulations
More informationSpectral analysis of complex shifted-laplace preconditioners for the Helmholtz equation
Spectral analysis of complex shifted-laplace preconditioners for the Helmholtz equation C. Vuik, Y.A. Erlangga, M.B. van Gijzen, and C.W. Oosterlee Delft Institute of Applied Mathematics c.vuik@tudelft.nl
More informationApplied Mathematics 205. Unit V: Eigenvalue Problems. Lecturer: Dr. David Knezevic
Applied Mathematics 205 Unit V: Eigenvalue Problems Lecturer: Dr. David Knezevic Unit V: Eigenvalue Problems Chapter V.4: Krylov Subspace Methods 2 / 51 Krylov Subspace Methods In this chapter we give
More informationITERATIVE METHODS BASED ON KRYLOV SUBSPACES
ITERATIVE METHODS BASED ON KRYLOV SUBSPACES LONG CHEN We shall present iterative methods for solving linear algebraic equation Au = b based on Krylov subspaces We derive conjugate gradient (CG) method
More informationAdaptive algebraic multigrid methods in lattice computations
Adaptive algebraic multigrid methods in lattice computations Karsten Kahl Bergische Universität Wuppertal January 8, 2009 Acknowledgements Matthias Bolten, University of Wuppertal Achi Brandt, Weizmann
More informationReview: From problem to parallel algorithm
Review: From problem to parallel algorithm Mathematical formulations of interesting problems abound Poisson s equation Sources: Electrostatics, gravity, fluid flow, image processing (!) Numerical solution:
More informationParallel sparse linear solvers and applications in CFD
Parallel sparse linear solvers and applications in CFD Jocelyne Erhel Joint work with Désiré Nuentsa Wakam () and Baptiste Poirriez () SAGE team, Inria Rennes, France journée Calcul Intensif Distribué
More informationArnoldi Methods in SLEPc
Scalable Library for Eigenvalue Problem Computations SLEPc Technical Report STR-4 Available at http://slepc.upv.es Arnoldi Methods in SLEPc V. Hernández J. E. Román A. Tomás V. Vidal Last update: October,
More informationHPMPC - A new software package with efficient solvers for Model Predictive Control
- A new software package with efficient solvers for Model Predictive Control Technical University of Denmark CITIES Second General Consortium Meeting, DTU, Lyngby Campus, 26-27 May 2015 Introduction Model
More informationMultigrid absolute value preconditioning
Multigrid absolute value preconditioning Eugene Vecharynski 1 Andrew Knyazev 2 (speaker) 1 Department of Computer Science and Engineering University of Minnesota 2 Department of Mathematical and Statistical
More informationNotes on PCG for Sparse Linear Systems
Notes on PCG for Sparse Linear Systems Luca Bergamaschi Department of Civil Environmental and Architectural Engineering University of Padova e-mail luca.bergamaschi@unipd.it webpage www.dmsa.unipd.it/
More informationNumerical Methods I Non-Square and Sparse Linear Systems
Numerical Methods I Non-Square and Sparse Linear Systems Aleksandar Donev Courant Institute, NYU 1 donev@courant.nyu.edu 1 MATH-GA 2011.003 / CSCI-GA 2945.003, Fall 2014 September 25th, 2014 A. Donev (Courant
More informationEfficient Serial and Parallel Coordinate Descent Methods for Huge-Scale Convex Optimization
Efficient Serial and Parallel Coordinate Descent Methods for Huge-Scale Convex Optimization Martin Takáč The University of Edinburgh Based on: P. Richtárik and M. Takáč. Iteration complexity of randomized
More information4.8 Arnoldi Iteration, Krylov Subspaces and GMRES
48 Arnoldi Iteration, Krylov Subspaces and GMRES We start with the problem of using a similarity transformation to convert an n n matrix A to upper Hessenberg form H, ie, A = QHQ, (30) with an appropriate
More informationLinear Solvers. Andrew Hazel
Linear Solvers Andrew Hazel Introduction Thus far we have talked about the formulation and discretisation of physical problems...... and stopped when we got to a discrete linear system of equations. Introduction
More informationFPGA Implementation of a Predictive Controller
FPGA Implementation of a Predictive Controller SIAM Conference on Optimization 2011, Darmstadt, Germany Minisymposium on embedded optimization Juan L. Jerez, George A. Constantinides and Eric C. Kerrigan
More informationRecent advances in HPC with FreeFem++
Recent advances in HPC with FreeFem++ Pierre Jolivet Laboratoire Jacques-Louis Lions Laboratoire Jean Kuntzmann Fourth workshop on FreeFem++ December 6, 2012 With F. Hecht, F. Nataf, C. Prud homme. Outline
More informationVariants of BiCGSafe method using shadow three-term recurrence
Variants of BiCGSafe method using shadow three-term recurrence Seiji Fujino, Takashi Sekimoto Research Institute for Information Technology, Kyushu University Ryoyu Systems Co. Ltd. (Kyushu University)
More informationKrylov Space Solvers
Seminar for Applied Mathematics ETH Zurich International Symposium on Frontiers of Computational Science Nagoya, 12/13 Dec. 2005 Sparse Matrices Large sparse linear systems of equations or large sparse
More informationIDR(s) A family of simple and fast algorithms for solving large nonsymmetric systems of linear equations
IDR(s) A family of simple and fast algorithms for solving large nonsymmetric systems of linear equations Harrachov 2007 Peter Sonneveld en Martin van Gijzen August 24, 2007 1 Delft University of Technology
More informationParallel Preconditioning Methods for Ill-conditioned Problems
Parallel Preconditioning Methods for Ill-conditioned Problems Kengo Nakajima Information Technology Center, The University of Tokyo 2014 Conference on Advanced Topics and Auto Tuning in High Performance
More informationContribution of Wo¹niakowski, Strako²,... The conjugate gradient method in nite precision computa
Contribution of Wo¹niakowski, Strako²,... The conjugate gradient method in nite precision computations ªaw University of Technology Institute of Mathematics and Computer Science Warsaw, October 7, 2006
More informationOpen-Source Parallel FE Software : FrontISTR -- Performance Considerations about B/F (Byte per Flop) of SpMV on K-Supercomputer and GPU-Clusters --
Parallel Processing for Energy Efficiency October 3, 2013 NTNU, Trondheim, Norway Open-Source Parallel FE Software : FrontISTR -- Performance Considerations about B/F (Byte per Flop) of SpMV on K-Supercomputer
More informationPreliminary Results of GRAPES Helmholtz solver using GCR and PETSc tools
Preliminary Results of GRAPES Helmholtz solver using GCR and PETSc tools Xiangjun Wu (1),Lilun Zhang (2),Junqiang Song (2) and Dehui Chen (1) (1) Center for Numerical Weather Prediction, CMA (2) School
More informationTR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems
TR-0-07 A Comparison of the Performance of ::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems Ang Li, Omkar Deshmukh, Radu Serban, Dan Negrut May, 0 Abstract ::GPU is a
More informationThe amount of work to construct each new guess from the previous one should be a small multiple of the number of nonzeros in A.
AMSC/CMSC 661 Scientific Computing II Spring 2005 Solution of Sparse Linear Systems Part 2: Iterative methods Dianne P. O Leary c 2005 Solving Sparse Linear Systems: Iterative methods The plan: Iterative
More informationPreconditioned Locally Minimal Residual Method for Computing Interior Eigenpairs of Symmetric Operators
Preconditioned Locally Minimal Residual Method for Computing Interior Eigenpairs of Symmetric Operators Eugene Vecharynski 1 Andrew Knyazev 2 1 Department of Computer Science and Engineering University
More informationAn Accelerated Block-Parallel Newton Method via Overlapped Partitioning
An Accelerated Block-Parallel Newton Method via Overlapped Partitioning Yurong Chen Lab. of Parallel Computing, Institute of Software, CAS (http://www.rdcps.ac.cn/~ychen/english.htm) Summary. This paper
More informationSparse linear solvers: iterative methods and preconditioning
Sparse linear solvers: iterative methods and preconditioning L Grigori ALPINES INRIA and LJLL, UPMC March 2017 Plan Sparse linear solvers Sparse matrices and graphs Classes of linear solvers Krylov subspace
More informationEIGIFP: A MATLAB Program for Solving Large Symmetric Generalized Eigenvalue Problems
EIGIFP: A MATLAB Program for Solving Large Symmetric Generalized Eigenvalue Problems JAMES H. MONEY and QIANG YE UNIVERSITY OF KENTUCKY eigifp is a MATLAB program for computing a few extreme eigenvalues
More informationError Bounds for Iterative Refinement in Three Precisions
Error Bounds for Iterative Refinement in Three Precisions Erin C. Carson, New York University Nicholas J. Higham, University of Manchester SIAM Annual Meeting Portland, Oregon July 13, 018 Hardware Support
More informationA parameter tuning technique of a weighted Jacobi-type preconditioner and its application to supernova simulations
A parameter tuning technique of a weighted Jacobi-type preconditioner and its application to supernova simulations Akira IMAKURA Center for Computational Sciences, University of Tsukuba Joint work with
More informationA communication-avoiding thick-restart Lanczos method on a distributed-memory system
A communication-avoiding thick-restart Lanczos method on a distributed-memory system Ichitaro Yamazaki and Kesheng Wu Lawrence Berkeley National Laboratory, Berkeley, CA, USA Abstract. The Thick-Restart
More informationA Parallel Scalable PETSc-Based Jacobi-Davidson Polynomial Eigensolver with Application in Quantum Dot Simulation
A Parallel Scalable PETSc-Based Jacobi-Davidson Polynomial Eigensolver with Application in Quantum Dot Simulation Zih-Hao Wei 1, Feng-Nan Hwang 1, Tsung-Ming Huang 2, and Weichung Wang 3 1 Department of
More informationMultigrid Methods and their application in CFD
Multigrid Methods and their application in CFD Michael Wurst TU München 16.06.2009 1 Multigrid Methods Definition Multigrid (MG) methods in numerical analysis are a group of algorithms for solving differential
More informationConjugate Gradient Method
Conjugate Gradient Method direct and indirect methods positive definite linear systems Krylov sequence spectral analysis of Krylov sequence preconditioning Prof. S. Boyd, EE364b, Stanford University Three
More informationAccelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers
UT College of Engineering Tutorial Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers Stan Tomov 1, George Bosilca 1, and Cédric
More informationA Balancing Algorithm for Mortar Methods
A Balancing Algorithm for Mortar Methods Dan Stefanica Baruch College, City University of New York, NY 11, USA Dan Stefanica@baruch.cuny.edu Summary. The balancing methods are hybrid nonoverlapping Schwarz
More informationIDR(s) Master s thesis Goushani Kisoensingh. Supervisor: Gerard L.G. Sleijpen Department of Mathematics Universiteit Utrecht
IDR(s) Master s thesis Goushani Kisoensingh Supervisor: Gerard L.G. Sleijpen Department of Mathematics Universiteit Utrecht Contents 1 Introduction 2 2 The background of Bi-CGSTAB 3 3 IDR(s) 4 3.1 IDR.............................................
More informationLab 1: Iterative Methods for Solving Linear Systems
Lab 1: Iterative Methods for Solving Linear Systems January 22, 2017 Introduction Many real world applications require the solution to very large and sparse linear systems where direct methods such as
More informationParallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2
1 / 23 Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption Langshi CHEN 1,2,3 Supervised by Serge PETITON 2 Maison de la Simulation Lille 1 University CNRS March 18, 2013
More informationLecture 8: Fast Linear Solvers (Part 7)
Lecture 8: Fast Linear Solvers (Part 7) 1 Modified Gram-Schmidt Process with Reorthogonalization Test Reorthogonalization If Av k 2 + δ v k+1 2 = Av k 2 to working precision. δ = 10 3 2 Householder Arnoldi
More informationCase Study: Quantum Chromodynamics
Case Study: Quantum Chromodynamics Michael Clark Harvard University with R. Babich, K. Barros, R. Brower, J. Chen and C. Rebbi Outline Primer to QCD QCD on a GPU Mixed Precision Solvers Multigrid solver
More informationKasetsart University Workshop. Multigrid methods: An introduction
Kasetsart University Workshop Multigrid methods: An introduction Dr. Anand Pardhanani Mathematics Department Earlham College Richmond, Indiana USA pardhan@earlham.edu A copy of these slides is available
More informationTowards a highly-parallel PDE-Solver using Adaptive Sparse Grids on Compute Clusters
Towards a highly-parallel PDE-Solver using Adaptive Sparse Grids on Compute Clusters HIM - Workshop on Sparse Grids and Applications Alexander Heinecke Chair of Scientific Computing May 18 th 2011 HIM
More informationNumerical Methods in Matrix Computations
Ake Bjorck Numerical Methods in Matrix Computations Springer Contents 1 Direct Methods for Linear Systems 1 1.1 Elements of Matrix Theory 1 1.1.1 Matrix Algebra 2 1.1.2 Vector Spaces 6 1.1.3 Submatrices
More informationComputers and Mathematics with Applications
Computers and Mathematics with Applications 68 (2014) 1151 1160 Contents lists available at ScienceDirect Computers and Mathematics with Applications journal homepage: www.elsevier.com/locate/camwa A GPU
More informationarxiv: v1 [hep-lat] 28 Aug 2014 Karthee Sivalingam
Clover Action for Blue Gene-Q and Iterative solvers for DWF arxiv:18.687v1 [hep-lat] 8 Aug 1 Department of Meteorology, University of Reading, Reading, UK E-mail: K.Sivalingam@reading.ac.uk P.A. Boyle
More informationSOLVING SPARSE LINEAR SYSTEMS OF EQUATIONS. Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA
1 SOLVING SPARSE LINEAR SYSTEMS OF EQUATIONS Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA 2 OUTLINE Sparse matrix storage format Basic factorization
More informationExploiting Low-Rank Structure in Computing Matrix Powers with Applications to Preconditioning
Exploiting Low-Rank Structure in Computing Matrix Powers with Applications to Preconditioning Erin C. Carson, Nicholas Knight, James Demmel, Ming Gu U.C. Berkeley SIAM PP 12, Savannah, Georgia, USA, February
More information