Efficient multigrid solvers for mixed finite element discretisations in NWP models

Size: px

Start display at page:

Download "Efficient multigrid solvers for mixed finite element discretisations in NWP models"

Melina Welch
5 years ago
Views:

1 1/20 Efficient multigrid solvers for mixed finite element discretisations in NWP models Colin Cotter, David Ham, Lawrence Mitchell, Eike Hermann Müller *, Robert Scheichl * * University of Bath, Imperial College London 16th ECMWF Workshop on High Performance Computing in Meteorology Oct 2014

2/20 Motivation Separation of concerns and high-level abstractions Equations, Algorithms discretisation Parallel Optimised Code Domain specialist Computational Scientist Equations, Algorithms High

2 2/20 Motivation Separation of concerns and high-level abstractions Equations, Algorithms discretisation Parallel Optimised Code Domain specialist Computational Scientist Equations, Algorithms High level code PETSc firedrake PyOP2 * Parallel Optimised Code Use case: complex iterative solver for pressure correction eqn. * All credit for developing firedrake/pyop2 goes to David Ham and his group at Imperial, I m giving this talk as a user

3/20 Motivation Computational challenge in numerical forecasts: Solver for elliptic pressure correction equation

theoretical peak 10 15 PetaFLOP Total solution time [s] 1 FLOPs 10 14 10 13 10 12 10 11 single GPU peak TeraFLOP 0.

scaling, total solution time 10 10 256 CG 512 Multigrid 10 9 GigaFLOP 768 1 4 16 64 256 Number of GPUs 1024 4096 16384

3 3/20 Motivation Computational challenge in numerical forecasts: Solver for elliptic pressure correction equation (algorithmic + parallel) scalability & performance Number of CPU cores Titan theoretical peak PetaFLOP Total solution time [s] 1 FLOPs single GPU peak TeraFLOP CG Titan [nvidia Kepler] 512 Multigrid Hector [AMD Opteron] Number of GPUs Weak scaling, total solution time CG 512 Multigrid 10 9 GigaFLOP Number of GPUs Absolute performance dofs on GPUs of TITAN, 0.78 PFLOPs, 20%-50% BW Finite volume discretisation, simpler geometry

4 4/20 Motivation Challenges Finite element discretisation (Met Office next generation dyn. core), unstructured grids more complicated Duplicate effort for CPU, GPU (Xeon Phi,... ) implementation & optimisation, reinvent the wheel? Mix two issues: algorithmic (domain specialist/numerical analyst) parallelisation & low level optimisation (computational scientist) Goal Separation of concerns (algorithmic computational) rapid algorithm development and testing at high abstraction level Performance portability Reuse existing, tested and optimised tools and libraries Choice of correct abstraction(s)

5 5/20 This talk Iterative solver for Helmholtz equation: φ + ω (φ u) = r φ u ω φ = r u Mixed finite element discretisation, icosahedral grid Matrix-free multigrid preconditioner for pressure correction Performance portable firedrake/pyop2 toolchain [Rathgeber et al., (2012 & 2014)] Python automatic generation of optimised C code Collaboration between Numerical algorithm developers (Colin, Rob, Eike) Computational scientists, Library developers (David, Lawrence, {PETSc,PyOP2,firedrake} teams)

6 6/20 Shallow water equations Model system: shallow water equations u t φ t + (φu) = 0 + (u ) u + g φ = 0 Implicit timestepping linear system for velocity u and height perturbation φ φ + ω (φ u) = r φ u ω φ = r u reference state φ 1 ω = c h t = gφ t

7 7/20 Mixed system Finite element discretisation: φ V 2, u V 1 Matrix system for dof-vectors Φ R n V 2, U R n V 1 A ( ) ( ) ( ) Φ Mφ ωb Φ = U ωb T = M u U ( Rφ R u ) Mixed finite elements DG 0 + RT 1 (lowest order) DG 1 + BDFM 1 (higher order) (M φ ) ij β i (x)β j (x) dx, Ω B ie v e (x)β i (x) dx Ω (M u ) ef v e (x) v f (x) dx Ω (derivative matrix) (mass matrix)

8 8/20 Iterative solver Iterative solver 1 Outer iteration (mixed system): GMRES, Richardson, BiCGStab,... Preconditioner: ( ) Mφ ωb P ωb T Mu lumped velocity mass matrix ( ) ( ) ( ) P H ωb(m = u ) 1 (Mu) 1 ωb T 1 0 (Mu) Elliptic positive definite Helmholtz operator H = M φ + ω 2 B (M u ) 1 B T 2 Inner iteration (pressure system Hφ = R φ ): CG with geometric multigrid preconditioner

DH 1 Rφ0 H φ0 H = Mφ + ω2 B (Mu ) 1 BT Intrinsic length scale Λ/ x = cs t / x νcfl = O(10)

9 9/20 Multigrid Multigrid V-cycle smooth smooth smooth x 1/4 smooth smooth smooth smooth x1 x 1/16 x 1/64 Computational bottleneck Smoother and operator application φ0 7 φ0 + ρrelax DH 1 Rφ0 H φ0 H = Mφ + ω2 B (Mu ) 1 BT Intrinsic length scale Λ/ x = cs t / x νcfl = O(10) grid spacings ( resolutions x) small number of multigrid levels/coarse smoother sufficient

10 10/20 Grid iteration in FEM codes Computational bottlenecks in finite element codes u m e (1) u m e (3) ɸ e um (2) e Example: Weak divergence φ = u Φ = BU φ e φ e + b (e) 1 u me(1) + b (e) 2 u me(2) + b (e) 3 u me(3) Loop over topological entities (e.g. cells). In each cell Access local dofs (e.g. pressure and velocity) Execute local kernel Write data back Manage halo exchanges, thread parallelism

11 11/20 Firedrake/PyOP2 framework Algorithms developers, num. analysts Eike, Rob, Colin Library developers, comp. scientists Lawrence, David, {PyOP2 firedrake, PETSc}developers Firedrake Interface Geometry, (non)linear solves assembly, compiled expressions PETSc4py (KSP, SNES, DMPlex) Meshes, matrices, vectors data structures (Set, Map, Dat) Parallel loops: kernels executed over mesh Unified Form Language (UFL) modified FFC PyOP2 Interface Parallel scheduling, code generation MPI parallel loop CPU GPU (OpenMP/ (PyCUDA / OpenCL) PyOpenCL) parallel loop Future arch. Problem definition in FEM weak form FIAT Local assembly kernels (AST) COFFEE AST optimizer Explicitly parallel hardwarespecific implementation Firedrake Performanceportable Finite-element computation framework PyOP2 Parallel unstructured mesh computation framework [Florian Rathgeber, PhD thesis, Imperical College (2014)]

12 12/20 UFL Unified Form Language [Alnæs et al. (2014)] Example: Bilinear form a(u, v) = u(x)v(x) dx Python code from firedrake import * mesh = UnitSquareMesh(8,8) V = FunctionSpace(mesh, CG,1) u = Function(V) v = Function(V) a_uv = assemble(u*v*dx) Ω

13 13/20 Generated code C-Code generated by Firedrake/PyOP2 Kernel (below) execution over grid (right) OpenMP backend

14 14/20 Putting it all together 1, 600 lines of highly modular python code (incl. comments) 1 Use UFL wherever possible φ( u) dx assemble(phi*div(u)*dx) (weak divergence operator) 2 Use PETSc for iterative solvers provide operators/preconditioners as callbacks 3 Escape hatch: Write direct PyOP2 parallel loops lumped mass matrix division: U (M u ) 1 U kernel_code = void lumped_mass_divide(double **m,double **U) { for (int i=0; i<4; ++i) U[i][0] /= m[0][i]; } kernel = op2.kernel(kernel_code,"lumped_mass_divide") op2.par_loop(kernel,facetset, m.dat(op2.read,self.facet2dof_map_facets), u.dat(op2.rw,self.facet2dof_map_bdfm1))

15 15/20 Helmholtz operator Helmholtz operator Hφ = M φ φ + ω 2 B(M u ) 1 B T φ Operator application class Operator(object): [...] def apply(self,phi): psi = TestFunction(V_pressure) w = TestFunction(V_velocity) B_phi = assemble(div(w)*phi*dx) self.velocity_mass_matrix.divide(b_phi) BT_B_phi = assemble(psi*div(b_phi)*dx) M_phi = assemble(psi*phi*dx) return assemble(m_phi + self.omega**2*bt_b_phi) Interface with PETSc petsc_op = PETSc.Mat().create() petsc_op.setpythoncontext(operator(omega)) petsc_op.setup() ksp = PETSc.KSP() ksp.create() ksp.setoperators(petsc_op)

16 16/20 Algorithmic performance Convergence history Inner Solve: 3 CG iterations with multigrid preconditioner Icosahedral grid, 327,680 cells 4 multigrid levels relative residual r 2 / r DG 0 +RT 0 Richardson DG 0 +RT 0 GMRES DG 1 +BDFM 1 Richardson DG 1 +BDFM 1 GMRES iteration

17 17/20 Efficiency Single node run on ARCHER, lowest order, cells [preliminary] Total solve Pressure solve Helmholtz operator 17.1GB/s 1.3GB/s 4.7GB/s 10.6GB/s time [s] STREAM triad: 74.1GB/s per node up to 23% of peak BW

18 18/20 Parallel scalability Weak scaling on ARCHER [preliminary] 10 2 Number of cells hypre BoomerAMG matrix-free MG DG 1 +BDFM 1 Solution time [s] mio 5.2mio DG 0 +RT mio 83.9mio 335.5mio Number of cores

19 19/20 Conclusion Summary Outlook Matrix-free multigrid-preconditioner for pressure correction equation Implementation in firedrake/pyop2/petsc framework: Performance portability Correct abstractions Separation of concerns Test on GPU backend Extend to 3d: Regular vertical grid Improved caching Improved memory BW Improve and extend parallel scaling Full 3d dynamical core (Colin Cotter)

20 20/20 References The firedrake project: F. Rathgeber et al.: Firedrake: automating the finite element method by composing abstractions. in preparation F. Rathgeber et al.: PyOP2: A High-Level Framework for Performance-Portable Simulations on Unstructured Meshes. HPC, Networking Storage and Analysis, SC Companion, p , Los Alamitos, CA, USA, IEEE Computer Society E. Müller et al.: Petascale elliptic solvers for anisotropic PDEs on GPU clusters. Submitted to Parallel Computing (2014) [arxiv: ] E. Müller, R. Scheichl: Massively parallel solvers for elliptic PDEs in Numerical Weather- and Climate Prediction. QJRMS (2013) [arxiv: ] Helmholtz solver code on github

A primal-dual mixed finite element method. for accurate and efficient atmospheric. modelling on massively parallel computers

A primal-dual mixed finite element method for accurate and efficient atmospheric modelling on massively parallel computers John Thuburn (University of Exeter, UK) Colin Cotter (Imperial College, UK) AMMW03,