Solving PDEs on Supercomputers I: modern supercomputer architecture

Size: px

Start display at page:

Download "Solving PDEs on Supercomputers I: modern supercomputer architecture"

Georgina Shelton
6 years ago
Views:

1 Supercomputer architecture Solving PDEs on Supercomputers I: modern supercomputer architecture Patrick Farrell MMSC: Python in Scientific Computing May 17, 2015 P. E. Farrell (Oxford) SPS I May 17, / 17

2 Supercomputer architecture Moore s Law Moore s Law The number of transistors per unit area on integrated circuits doubles every two years. (1965) P. E. Farrell (Oxford) SPS I May 17, / 17

3 Supercomputer architecture Moore s Law The consequence Individual computers aren t getting faster: we re getting more of them. P. E. Farrell (Oxford) SPS I May 17, / 17

4 Supercomputer architecture A modern supercomputer In this lecture we will give a brief overview of modern supercomputer architecture. ARCHER is composed of 4920 nodes, each with 24 cores, for a total of 118,080 cores. P. E. Farrell (Oxford) SPS I May 17, / 17

5 Supercomputer architecture A node P. E. Farrell (Oxford) SPS I May 17, / 17

6 Supercomputer architecture A node Algorithmic consequence Extreme pressure on memory and memory bandwidth. P. E. Farrell (Oxford) SPS I May 17, / 17

7 Supercomputer architecture A socket P. E. Farrell (Oxford) SPS I May 17, / 17

8 Supercomputer architecture A socket Algorithmic consequence Want to have multiple cores working on the same data. P. E. Farrell (Oxford) SPS I May 17, / 17

9 Supercomputer architecture A core P. E. Farrell (Oxford) SPS I May 17, / 17

10 Supercomputer architecture A core Algorithmic consequence Vectorisation essential for maximum floating point performance. P. E. Farrell (Oxford) SPS I May 17, / 17

11 Supercomputer architecture Hardware properties Some relative timings On a 3.0 GHz Intel Core 2 Duo E8400: One clock cycle: 1/3 nanoseconds ( 10 light-cm!). Accessing L1 data cache (32 KB): 3 cycles Accessing L2 cache (6 MB): 14 cycles Accessing main memory: 250 cycles Accessing disk: 40 million cycles P. E. Farrell (Oxford) SPS I May 17, / 17

12 Supercomputer architecture Hardware properties Some relative timings On a 3.0 GHz Intel Core 2 Duo E8400: One clock cycle: 1/3 nanoseconds ( 10 light-cm!). Accessing L1 data cache (32 KB): 3 cycles Accessing L2 cache (6 MB): 14 cycles Accessing main memory: 250 cycles Accessing disk: 40 million cycles Analogy Register: the data is on your working paper. L1 cache: the data is on your desk (3 seconds). L2 cache: the data is on your bookshelf (14 seconds). Main memory: the data is in the library (a 4 minute walk). P. E. Farrell (Oxford) SPS I May 17, / 17

13 Supercomputer architecture Hardware properties Some relative timings On a 3.0 GHz Intel Core 2 Duo E8400: One clock cycle: 1/3 nanoseconds ( 10 light-cm!). Accessing L1 data cache (32 KB): 3 cycles Accessing L2 cache (6 MB): 14 cycles Accessing main memory: 250 cycles Accessing disk: 40 million cycles Analogy Register: the data is on your working paper. L1 cache: the data is on your desk (3 seconds). L2 cache: the data is on your bookshelf (14 seconds). Main memory: the data is in the library (a 4 minute walk). Disk: go backpacking for 1.2 years. P. E. Farrell (Oxford) SPS I May 17, / 17

14 Supercomputer architecture Hardware properties The interconnect P. E. Farrell (Oxford) SPS I May 17, / 17

15 Supercomputer architecture Hardware properties Some more timings On the Cray Aries interconnect, to send a message: Within a socket: 800 cycles Within a node: 1600 cycles Across the machine: 8000 cycles P. E. Farrell (Oxford) SPS I May 17, / 17

16 Supercomputer architecture Hardware properties Some more timings On the Cray Aries interconnect, to send a message: Within a socket: 800 cycles Within a node: 1600 cycles Across the machine: 8000 cycles Algorithmic consequence Interleave communication and computation. P. E. Farrell (Oxford) SPS I May 17, / 17

17 MPI and OpenMP Domain decomposition The coarsest level of parallelism used is domain decomposition over MPI. from dolfin import * mesh = UnitCubeMesh(32, 32, 32) partitioning = CellFunction("size_t", mesh) partitioning.set_all(mpi.rank(mpi_comm_world())) File("output/partitioning.xdmf") << partitioning $ mpiexec -n 4 python partition.py P. E. Farrell (Oxford) SPS I May 17, / 17

18 MPI and OpenMP MPI: basic model MPI Separate processes with separate memory spaces communicate via message passing. MPI concepts: communicator collective rank blocking and nonblocking communication reductions Each subdomain is assigned to one MPI rank. P. E. Farrell (Oxford) SPS I May 17, / 17

19 MPI and OpenMP Main communication patterns in finite elements Assembly Assembly requires exchanging halo data with your neighbours. processor 0 core owned exec non-exec halos non-exec exec owned core processor 1 P. E. Farrell (Oxford) SPS I May 17, / 17

20 MPI and OpenMP Main communication patterns in finite elements Krylov solvers Neighbour communications for sparse matrix-vector product. Global reductions (allreduce for dot products) Preconditioner application Multigrid: extremely complicated. P. E. Farrell (Oxford) SPS I May 17, / 17

21 MPI and OpenMP OpenMP: basic model OpenMP Separate threads operate on the same memory space. Less overhead in parallel execution Multiple cores can act on the same data Less pressure on memory and memory bandwidth Easier load balancing Extremely difficult to program correctly Subtle race conditions possible Colouring and locks required to synchronise P. E. Farrell (Oxford) SPS I May 17, / 17

22 MPI and OpenMP DOLFIN can also run in OpenMP mode for assembly: from dolfin import * parameters["num_threads"] = 4 #... solve(f == 0, u) # must use a threaded solver # (e.g. pastix)! You can t use MPI and OpenMP at the same time (yet). P. E. Farrell (Oxford) SPS I May 17, / 17

23 Algorithmic consequences General algorithmic consequences Need algorithms with high arithmetical intensity. P. E. Farrell (Oxford) SPS I May 17, / 17

24 Algorithmic consequences General algorithmic consequences Need algorithms with high arithmetical intensity. Caches greatly dislike unstructured memory accesses. P. E. Farrell (Oxford) SPS I May 17, / 17

25 Algorithmic consequences General algorithmic consequences Need algorithms with high arithmetical intensity. Caches greatly dislike unstructured memory accesses. Flops are (approximately) free. P. E. Farrell (Oxford) SPS I May 17, / 17

26 Algorithmic consequences General algorithmic consequences Need algorithms with high arithmetical intensity. Caches greatly dislike unstructured memory accesses. Flops are (approximately) free. Large stencils induce extra communication. P. E. Farrell (Oxford) SPS I May 17, / 17

27 Algorithmic consequences General algorithmic consequences Need algorithms with high arithmetical intensity. Caches greatly dislike unstructured memory accesses. Flops are (approximately) free. Large stencils induce extra communication. Must overlap communication and computation. P. E. Farrell (Oxford) SPS I May 17, / 17

28 Algorithmic consequences General algorithmic consequences Need algorithms with high arithmetical intensity. Caches greatly dislike unstructured memory accesses. Flops are (approximately) free. Large stencils induce extra communication. Must overlap communication and computation. Solver algorithms must be O(n) or O(nlogn). P. E. Farrell (Oxford) SPS I May 17, / 17

29 Algorithmic consequences General algorithmic consequences Need algorithms with high arithmetical intensity. Caches greatly dislike unstructured memory accesses. Flops are (approximately) free. Large stencils induce extra communication. Must overlap communication and computation. Solver algorithms must be O(n) or O(nlogn). General algorithmic trends Domain-decomposed high-order FE on semi-structured meshes. Multigrid/multilevel solvers with Krylov accelerators. Hybrid parallelism strategies (MPI/OpenMP/AVX). P. E. Farrell (Oxford) SPS I May 17, / 17

30 Solving PDEs on Supercomputers II: practical matters of using supercomputers Patrick Farrell MMSC: Python in Scientific Computing May 17, 2015 P. E. Farrell (Oxford) SPS 2 May 17, / 7

31 Logging on Supercomputers are accessed by sshing to the login nodes. $ ssh mmschpcxx@arcus.oerc.ox.ac.uk You configure your environment with modules: $ module list No Modulefiles Currently Loaded. $ module avail... $ module use -a /data/math-farrellp/crichardson/modules $ module load fenics/1.5.0 $ module list Modules are generally awful, but nothing better exists yet. P. E. Farrell (Oxford) SPS 2 May 17, / 7

32 Running jobs interactively The simplest way to run a job is interactively. This is mainly used for debugging. $ qsub -I -l nodes=1:ppn=16 -l walltime=0:10:00 -q develq qsub: waiting for job headnode1.arcus.osc.local to start # wait until PBS allocates us the resources we asked for... qsub: job headnode1.arcus.osc.local ready $ cd $PBS_O_WORKDIR $ module use -a /data/math-farrellp/crichardson/modules $ module load fenics/1.5.0 $ mpirun $MPI_HOSTS python poisson.py P. E. Farrell (Oxford) SPS 2 May 17, / 7

33 Running jobs in batch mode ARCUS-A and ARCHER are managed using PBS, the Portable Batch System. Users submit jobs to the batch system which decides when and where they get executed. The main PBS commands: qsub qdel qstat The argument to qsub is a PBS script. P. E. Farrell (Oxford) SPS 2 May 17, / 7

34 Running jobs in batch mode #!/bin/bash # set the number of nodes and processes per node #PBS -l nodes=1:ppn=16 # set max wallclock time #PBS -l walltime=1:00:00 # set name of job #PBS -N poisson # mail alert at start, end and abortion of execution #PBS -m bea # send mail to this address #PBS -M patrick.farrell@maths.ox.ac.uk # start job from the directory it was submitted cd $PBS_O_WORKDIR module use -a /data/math-farrellp/crichardson/modules module load fenics/ enable_arcus_mpi.sh mpirun $MPI_HOSTS python poisson.py tee poisson.log P. E. Farrell (Oxford) SPS 2 May 17, / 7

35 HPC 02 Challenge! Investigate the weak scaling of the 2D Poisson solver with parallel LU that you developed last week: Have the code refine the mesh once each time the number of cores quadruples. Hint: size = MPI.size(mpi_comm_world())... for i in nrefine: mesh = refine(mesh, redistribute=false) Run the code on 1, 4 and 16 cores. What happens to the runtime as the problem is scaled weakly?... P. E. Farrell (Oxford) SPS 2 May 17, / 7

36 HPC 02 Challenge! Which components of the solver are taking the longest? Profile the code with DOLFIN timing system: list timings() PETSc timing system: import petsc4py petsc4py.init("-log_summary summary.log".split()) from dolfin import * Now switch to HYPRE algebraic multigrid and compare the timings again. Hint: to get more details about the AMG solve, call PETScOptions.set("pc_hypre_boomeramg_print_statistics", 1) P. E. Farrell (Oxford) SPS 2 May 17, / 7

37 Solving PDEs on Supercomputers III: an introduction to PETSc Patrick Farrell MMSC: Python in Scientific Computing May 17, 2015 P. E. Farrell (Oxford) SPS 3 May 17, / 5

38 PETSc PETSc is a library of linear and nonlinear solvers for sparse PDEs. It has won most awards going: SIAM/ACM Prize in Computational Science and Engineering, R&D Award Gordon Bell Prizes in 2009, 2004, 2003, PETSc makes it easy to express complex hierarchical composed solvers as compactly as possible. P. E. Farrell (Oxford) SPS 3 May 17, / 5

39 Fundamental objects [Vec, Mat, PC, KSP, SNES] Vec Vec represents a dense vector, decomposed in parallel. Example ierr = VecCreateMPI(PETSC COMM WORLD, local, global, &x); ierr = VecDuplicate(x, &y); ierr = VecDotBegin(x, y, &xty); /* other computations */ ierr = VecDotEnd(x, y, &xty); P. E. Farrell (Oxford) SPS 3 May 17, / 5

40 Fundamental objects [Vec, Mat, PC, KSP, SNES] Mat Mat represents a sparse matrix, decomposed in parallel. Example ierr = MatCreateAIJ(PETSC COMM WORLD,..., &mat); for (i = 0; i < local rows; i++) ierr = MatSetValues(mat,...); ierr = MatAssemblyBegin(mat, MAT FINAL ASSEMBLY); ierr = MatAssemblyEnd(mat, MAT FINAL ASSEMBLY); ierr = MatMult(mat, x, y); P. E. Farrell (Oxford) SPS 3 May 17, / 5

41 Fundamental objects [Vec, Mat, PC, KSP, SNES] PC represents a linear preconditioner (Jacobi, Gauss-Seidel, ILU, ICC, AMG, additive Schwarz,...) PC Example ierr = PCCreate(PETSC COMM WORLD, &pc); ierr = PCSetOperators(pc, A, P); ierr = PCSetType(pc, PCILU); ierr = PCSetUp(pc); ierr = PCApply(pc, x, y); P. E. Farrell (Oxford) SPS 3 May 17, / 5

42 Fundamental objects [Vec, Mat, PC, KSP, SNES] KSP KSP represents a linear solver (CG, GMRES, TFQMR, BICGSTAB, MINRES, GCR, Richardson, Chebyshev,...) Example ierr = KSPCreate(PETSC COMM WORLD, &ksp); ierr = KSPSetOperators(ksp, A, P); ierr = KSPSetType(ksp, KSPCG); ierr = KSPSetUp(ksp); ierr = KSPSolve(ksp, b, x); P. E. Farrell (Oxford) SPS 3 May 17, / 5

43 Fundamental objects [Vec, Mat, PC, KSP, SNES] SNES SNES represents a nonlinear solver (Newton, reduced-space Newton, NGMRES, NCG, Anderson acceleration, FAS,...) Example ierr = SNESCreate(PETSC COMM WORLD, &snes); ierr = SNESSetFunction(snes, r, residual); ierr = SNESSetJacobian(snes, J, P, jacobian); ierr = SNESSetType(snes, SNESVINEWTONRSLS); ierr = SNESSetVariableBounds(snes, xl, xu); ierr = SNESSetUp(snes); ierr = SNESSolve(snes, b, x); P. E. Farrell (Oxford) SPS 3 May 17, / 5

44 Hierarchical composition Principle All objects are composable. P. E. Farrell (Oxford) SPS 3 May 17, / 5

45 Hierarchical composition Principle All objects are composable. Principle All objects are configurable. P. E. Farrell (Oxford) SPS 3 May 17, / 5

46 Hierarchical composition Principle All objects are composable. Principle All objects are configurable. (example from variational fracture mechanics) P. E. Farrell (Oxford) SPS 3 May 17, / 5

47 Wiring PETSc and FEniCS We re going to need fine control to design our solvers. A simple interface between FEniCS and PETSc: $ git clone P. E. Farrell (Oxford) SPS 3 May 17, / 5

48 Solving PDEs on Supercomputers IV: algebraic multigrid Patrick Farrell MMSC: Python in Scientific Computing May 18, 2015 P. E. Farrell (Oxford) SPS 4 May 18, / 13

49 Multilevel solvers At the core of most PDE solvers is the solution of a linear system Linear system Ax = b The most powerful solvers for PDEs exploit the fact that there exists an infinite hierarchy of discretisations, all approximating the same problem: Hierarchy of linear systems A h x h = b h A 2h x 2h = b 2h A 4h x 4h = b 4h P. E. Farrell (Oxford) SPS 4 May 18, / 13

50 Geometric multigrid: review Geometric multigrid algorithm Begin with an initial guess. P. E. Farrell (Oxford) SPS 4 May 18, / 13

51 Geometric multigrid: review Geometric multigrid algorithm Begin with an initial guess. Apply a relaxation method to smooth the error. P. E. Farrell (Oxford) SPS 4 May 18, / 13

52 Geometric multigrid: review Geometric multigrid algorithm Begin with an initial guess. Apply a relaxation method to smooth the error. Solve for the smooth error on a coarse grid. P. E. Farrell (Oxford) SPS 4 May 18, / 13

53 Why did geometric multigrid work? Geometric multigrid worked on the Laplacian because: simple relaxation methods yielded geometrically smooth errors; those errors could be well-represented on coarse grids. What about problems where the error isn t smooth after relaxation? P. E. Farrell (Oxford) SPS 4 May 18, / 13

54 Why did geometric multigrid work? Geometric multigrid worked on the Laplacian because: simple relaxation methods yielded geometrically smooth errors; those errors could be well-represented on coarse grids. What about problems where the error isn t smooth after relaxation? Anisotropic Laplacian au xx bu yy = f in Ω = [0, 1] 2 u = g on Ω a = b if x < 1/2 a b if x 1/2. P. E. Farrell (Oxford) SPS 4 May 18, / 13

55 Why did geometric multigrid work? Geometric multigrid worked on the Laplacian because: simple relaxation methods yielded geometrically smooth errors; those errors could be well-represented on coarse grids. What about problems where the error isn t smooth after relaxation? P. E. Farrell (Oxford) SPS 4 May 18, / 13

56 Two responses GMG: design increasingly arcane relaxation methods that do smooth; semi-coarsening, multi-coarsening, etc. P. E. Farrell (Oxford) SPS 4 May 18, / 13

57 Two responses GMG: design increasingly arcane relaxation methods that do smooth; semi-coarsening, multi-coarsening, etc. AMG: fix a simple relaxation method; algebraically construct coarse grids and interpolation operators; demand that these can well represent the error after relaxation. P. E. Farrell (Oxford) SPS 4 May 18, / 13

58 Two responses GMG: design increasingly arcane relaxation methods that do smooth; semi-coarsening, multi-coarsening, etc. AMG: fix a simple relaxation method; algebraically construct coarse grids and interpolation operators; demand that these can well represent the error after relaxation. A nice side effect: AMG requires much less infrastructure: No need to supply coarse grids No need to supply interpolation operators Only applies to linear problems Requires global linearisation (memory) Requires near-nullspace of operator P. E. Farrell (Oxford) SPS 4 May 18, / 13

59 Anisotropic Laplacian again P. E. Farrell (Oxford) SPS 4 May 18, / 13

60 Anisotropic Laplacian again P. E. Farrell (Oxford) SPS 4 May 18, / 13

61 Fundamental principles of AMG I: relaxation and error Recall Richardson iteration with a preconditioner P : Richardson iteration x k+1 = x k + P 1 (b Ax k ). P. E. Farrell (Oxford) SPS 4 May 18, / 13

62 Fundamental principles of AMG I: relaxation and error Recall Richardson iteration with a preconditioner P : Richardson iteration x k+1 = x k + P 1 (b Ax k ). A simple error analysis shows Error analysis of Richardson iteration e k+1 = ( I P 1 A ) e k P. E. Farrell (Oxford) SPS 4 May 18, / 13

63 Fundamental principles of AMG I: relaxation and error Recall Richardson iteration with a preconditioner P : Richardson iteration x k+1 = x k + P 1 (b Ax k ). A simple error analysis shows Error analysis of Richardson iteration e k+1 = ( I P 1 A ) e k Now if e k+1 e k then Near-nullspace of A P 1 Ae k 0 = Ae k 0. P. E. Farrell (Oxford) SPS 4 May 18, / 13

64 Fundamental principles of AMG I: relaxation and error Error after relaxation The error after relaxation is related to the near-nullspace of the operator. P. E. Farrell (Oxford) SPS 4 May 18, / 13

65 Fundamental principles of AMG II: interpolation Recall that in one multigrid cycle we approximate the fine error as Approximation of fine error e h P h He H Thus, we want the near-nullspace to be in the range of P h H. P. E. Farrell (Oxford) SPS 4 May 18, / 13

66 Coarse grid generation: an example Classical AMG: coarse-grid generation 1. Select C-point with maximal measure 2. Select neighbours as F-points 3. Update measures of neighbours P. E. Farrell (Oxford) SPS 4 May 18, / 13

67 Coarse grid generation: an example Classical AMG: coarse-grid generation 1. Select C-point with maximal measure 2. Select neighbours as F-points 3. Update measures of neighbours P. E. Farrell (Oxford) SPS 4 May 18, / 13

68 Coarse grid generation: an example Classical AMG: coarse-grid generation 1. Select C-point with maximal measure 2. Select neighbours as F-points 3. Update measures of neighbours P. E. Farrell (Oxford) SPS 4 May 18, / 13

69 Coarse grid generation: an example Classical AMG: coarse-grid generation 1. Select C-point with maximal measure 2. Select neighbours as F-points 3. Update measures of neighbours P. E. Farrell (Oxford) SPS 4 May 18, / 13

70 Coarse grid generation: an example Classical AMG: coarse-grid generation 1. Select C-point with maximal measure 2. Select neighbours as F-points 3. Update measures of neighbours P. E. Farrell (Oxford) SPS 4 May 18, / 13

71 Coarse grid generation: an example Classical AMG: coarse-grid generation 1. Select C-point with maximal measure 2. Select neighbours as F-points 3. Update measures of neighbours P. E. Farrell (Oxford) SPS 4 May 18, / 13

72 Coarse grid generation: an example Classical AMG: coarse-grid generation 1. Select C-point with maximal measure 2. Select neighbours as F-points 3. Update measures of neighbours P. E. Farrell (Oxford) SPS 4 May 18, / 13

73 Coarse grid generation: an example Classical AMG: coarse-grid generation 1. Select C-point with maximal measure 2. Select neighbours as F-points 3. Update measures of neighbours P. E. Farrell (Oxford) SPS 4 May 18, / 13

74 Coarse grid generation: an example Classical AMG: coarse-grid generation 1. Select C-point with maximal measure 2. Select neighbours as F-points 3. Update measures of neighbours P. E. Farrell (Oxford) SPS 4 May 18, / 13

75 Coarse grid generation: an example Classical AMG: coarse-grid generation 1. Select C-point with maximal measure 2. Select neighbours as F-points 3. Update measures of neighbours P. E. Farrell (Oxford) SPS 4 May 18, / 13

76 Coarse grid generation: an example Classical AMG: coarse-grid generation 1. Select C-point with maximal measure 2. Select neighbours as F-points 3. Update measures of neighbours P. E. Farrell (Oxford) SPS 4 May 18, / 13

77 Coarse grid generation: an example Smoothed-aggregation AMG: coarse-grid generation Phase 1: 1. Pick a root point not adjacent to an aggregation 2. Aggregate root and neighbours Phase 2: Move points into nearby aggregations P. E. Farrell (Oxford) SPS 4 May 18, / 13

78 Coarse grid generation: an example Smoothed-aggregation AMG: coarse-grid generation Phase 1: 1. Pick a root point not adjacent to an aggregation 2. Aggregate root and neighbours Phase 2: Move points into nearby aggregations P. E. Farrell (Oxford) SPS 4 May 18, / 13

79 Coarse grid generation: an example Smoothed-aggregation AMG: coarse-grid generation Phase 1: 1. Pick a root point not adjacent to an aggregation 2. Aggregate root and neighbours Phase 2: Move points into nearby aggregations P. E. Farrell (Oxford) SPS 4 May 18, / 13

80 Coarse grid generation: an example Smoothed-aggregation AMG: coarse-grid generation Phase 1: 1. Pick a root point not adjacent to an aggregation 2. Aggregate root and neighbours Phase 2: Move points into nearby aggregations P. E. Farrell (Oxford) SPS 4 May 18, / 13

81 Coarse grid generation: an example Smoothed-aggregation AMG: coarse-grid generation Phase 1: 1. Pick a root point not adjacent to an aggregation 2. Aggregate root and neighbours Phase 2: Move points into nearby aggregations P. E. Farrell (Oxford) SPS 4 May 18, / 13

82 Coarse grid generation: an example Smoothed-aggregation AMG: coarse-grid generation Phase 1: 1. Pick a root point not adjacent to an aggregation 2. Aggregate root and neighbours Phase 2: Move points into nearby aggregations P. E. Farrell (Oxford) SPS 4 May 18, / 13

83 Coarse grid generation: an example Smoothed-aggregation AMG: coarse-grid generation Phase 1: 1. Pick a root point not adjacent to an aggregation 2. Aggregate root and neighbours Phase 2: Move points into nearby aggregations P. E. Farrell (Oxford) SPS 4 May 18, / 13

84 Coarse grid generation: an example Smoothed-aggregation AMG: coarse-grid generation Phase 1: 1. Pick a root point not adjacent to an aggregation 2. Aggregate root and neighbours Phase 2: Move points into nearby aggregations P. E. Farrell (Oxford) SPS 4 May 18, / 13

85 Coarse grid generation: an example Smoothed-aggregation AMG: coarse-grid generation Phase 1: 1. Pick a root point not adjacent to an aggregation 2. Aggregate root and neighbours Phase 2: Move points into nearby aggregations P. E. Farrell (Oxford) SPS 4 May 18, / 13

86 HPC 04 Challenge! Consider the linear elasticity equation σ(u) = f in Ω u = 0 on Ω D σ n = 0 on Ω N on the pulley mesh, where ε(u) = 1 ( u + u T ), 2 σ(u) = 2µε(u) + λtr(ε(u))i, f = (ρω 2 x, ρω 2 y, 0), Ω D = {(x, y, z) Ω x 2 + y 2 < ( z) 2 } Ω N = Ω \ Ω D, E = 10 9, ν = 0.3, ρ = 10, ω = 300. P. E. Farrell (Oxford) SPS 4 May 18, / 13

87 HPC 04 Challenge! Solve this problem using only smoothed aggregation algebraic multigrid (no Krylov accelerator, -ksp type richardson -ksp monitor true residual -pc type gamg). How many iterations does it take to converge to atol (a) without the near-nullspace (b) with the near-nullspace? Here the near-nullspace is the rigid body translations and rotations. Now investigate the configuration of the smoothed aggregation AMG solver and the Krylov accelerator. (Hint: -help -snes view). By tuning the solver, can you achieve faster convergence? P. E. Farrell (Oxford) SPS 4 May 18, / 13

88 Solving PDEs on Supercomputers V: algebraic multigrid on nonsymmetric problems Patrick Farrell MMSC: Python in Scientific Computing May 19, 2015 P. E. Farrell (Oxford) SPS 5 May 19, / 4

89 HPC 05 Challenge! (1/3) Implement a solver for the Yamabe equation 8 2 u + 1 r 3 u u = 0 on the doughnut mesh with boundary conditions u = 1. Initialise Newton with the initial guess u = 1. P. E. Farrell (Oxford) SPS 5 May 19, / 4

90 HPC 05 Challenge! (2/3) Next, develop an efficient linear solver: 1. First use Newton + LU. 2. Next, try GMRES + GAMG. Does it work well? 3. Try increasing the maximum size of the coarse grid (pc gamg coarse eq limit) 4. Ah! Now we re getting somewhere. Does changing the smoother help (mg levels ksp monitor true residual)? 5. Increase the quality of the smoothed aggregation basis (pc gamg agg nsmooths). P. E. Farrell (Oxford) SPS 5 May 19, / 4

91 HPC 05 Challenge! (3/3) Profile the code. Where is it spending most of its time? How can the preconditioner construction cost be reduced? Once that is done, compare the memory usage of GMRES, FGMRES, GCR and CGS. P. E. Farrell (Oxford) SPS 5 May 19, / 4

92 Solving PDEs on Supercomputers VI: fieldsplit preconditioners Patrick Farrell MMSC: Python in Scientific Computing May 19, 2015 P. E. Farrell (Oxford) SPS 6 May 19, / 8

93 Block triangular factorisations A block matrix with nonsingular A 1 has a block triangular factorisation: ( ) A B J = = C D Block triangular factorisation ( I 0 CA 1 I ) ( A 0 0 S ) ( I A 1 ) B. 0 I where S = D CA 1 B is the (dense!) Schur complement. This gives us an expression for its inverse: ( ) 1 A B = C D Block triangular inverse ( I A 1 ) ( B A I 0 S 1 ) ( I 0 CA 1 I ). P. E. Farrell (Oxford) SPS 6 May 19, / 8

94 Fieldsplit preconditioners This gives rise to four related theorems. The choice P = ( I 0 CA 1 I Theorem (full) ) ( A 0 0 S will induce Krylov convergence in 1 iteration. ) ( I A 1 ) B 0 I P. E. Farrell (Oxford) SPS 6 May 19, / 8

95 Fieldsplit preconditioners This gives rise to four related theorems. The choice P = Theorem (lower) ( ) ( ) I 0 A 0 CA 1 I 0 S will induce Krylov convergence in 2 iterations. P. E. Farrell (Oxford) SPS 6 May 19, / 8

96 Fieldsplit preconditioners This gives rise to four related theorems. The choice P = Theorem (upper) ( ) ( A 0 I A 1 ) B 0 S 0 I will induce Krylov convergence in 2 iterations. P. E. Farrell (Oxford) SPS 6 May 19, / 8

97 Fieldsplit preconditioners This gives rise to four related theorems. The choice Theorem (diag) P = ( ) A 0 0 S will induce Krylov convergence in 3 iterations, if D = 0. P. E. Farrell (Oxford) SPS 6 May 19, / 8

98 Fieldsplit preconditioners This gives rise to four related theorems. The choice Theorem (diag) P = ( ) A 0 0 S will induce Krylov convergence in 3 iterations, if D = 0. How do you use this? Cheaply approximate A 1 and S 1 (problem specific)! P. E. Farrell (Oxford) SPS 6 May 19, / 8

99 Spectral equivalence Definition (spectral equivalence) A h and B h R n n are spectrally equivalent, A h B h, iff there exists constants c, C independent of h such that c λ(b 1 h A h) C. P. E. Farrell (Oxford) SPS 6 May 19, / 8

100 Spectral equivalence Definition (spectral equivalence) A h and B h R n n are spectrally equivalent, A h B h, iff there exists constants c, C independent of h such that c λ(b 1 h A h) C. Solving block-structured systems Find an approximation Ŝ S or Ŝ 1 S 1. P. E. Farrell (Oxford) SPS 6 May 19, / 8

101 Stokes equations The Stokes equations are ν 2 u + p = 0, u = 0. P. E. Farrell (Oxford) SPS 6 May 19, / 8

102 Stokes equations The Stokes equations are ν 2 u + p = 0, u = 0. A stable discretisation yields ( ) A B T J =. B 0 with S = BA 1 B T. P. E. Farrell (Oxford) SPS 6 May 19, / 8

103 Stokes equations The Stokes equations are ν 2 u + p = 0, u = 0. Spectral equivalence (e.g. Elman, Silvester and Wathen, 2005) Let Q be the viscosity-weighted pressure mass matrix 1 Q ij = ν φ iφ j. Then Ω S Q. P. E. Farrell (Oxford) SPS 6 May 19, / 8

104 Coding tools Creating PETSc index sets to extract dofs: u_dofs = SubSpace(Z, 0).dofmap().dofs() u_is = PETSc.IS().createGeneral(u_dofs) P. E. Farrell (Oxford) SPS 6 May 19, / 8

105 Coding tools Creating PETSc index sets to extract dofs: u_dofs = SubSpace(Z, 0).dofmap().dofs() u_is = PETSc.IS().createGeneral(u_dofs) Configuring the dofs to split: fields = [("0", u_is), ("1", p_is)] snes.ksp.pc.setfieldsplitis(*fields) P. E. Farrell (Oxford) SPS 6 May 19, / 8

106 Coding tools Creating PETSc index sets to extract dofs: u_dofs = SubSpace(Z, 0).dofmap().dofs() u_is = PETSc.IS().createGeneral(u_dofs) Configuring the dofs to split: fields = [("0", u_is), ("1", p_is)] snes.ksp.pc.setfieldsplitis(*fields) Setting the matrix for building a preconditioner for the Schur complement: schur = (1.0/nu) * inner(p, q)*dx schur_full = assemble(schur) schur_fmat = as_backend_type(schur_full).mat() schur_mat = schur_fmat.getsubmatrix(p_is, p_is) snes.ksp.pc.setfieldsplitschurpretype(petsc.pc.schurpretype.user, schur_mat) P. E. Farrell (Oxford) SPS 6 May 19, / 8

107 Configuring fieldsplit --petsc.ksp_converged_reason --petsc.ksp_type fgmres --petsc.ksp_monitor_true_residual --petsc.ksp_atol 1.0e-10 --petsc.ksp_rtol petsc.pc_type fieldsplit --petsc.pc_fieldsplit_type schur --petsc.pc_fieldsplit_schur_factorization_type full --petsc.pc_fieldsplit_schur_precondition user --petsc.fieldsplit_0_ksp_type richardson --petsc.fieldsplit_0_ksp_max_it 1 --petsc.fieldsplit_0_pc_type lu --petsc.fieldsplit_0_pc_factor_mat_solver_package mumps --petsc.fieldsplit_1_ksp_type bcgs --petsc.fieldsplit_1_ksp_rtol 1.0e-10 --petsc.fieldsplit_1_ksp_monitor_true_residual --petsc.fieldsplit_1_pc_type lu --petsc.fieldsplit_1_pc_factor_mat_solver_package mumps P. E. Farrell (Oxford) SPS 6 May 19, / 8

108 HPC 06 Challenge! Solve the Stokes equations with ν = 1/100 on the dolphin.xml mesh, with boundary conditions u = (0, 0) on Ω 0 u = ( sin πy, 0) on Ω 1 ν u n = pn on Ω 2, with colours taken from dolphin subdomains.xml. 0. Discretise the equation with a stable finite element pair. Integrate both terms in the momentum equation by parts. 1. Solve the problem with LU (UMFPACK/MUMPS). 2. Implement the fieldsplit preconditioner with ideal inner solvers (LU). 3. Now replace the inner solvers with Krylov solvers (CG/ML/5 for A, BCGS/HYPRE/5 for S). 4. What configuration is fastest? full with strong inner solvers? diag with weak inner solvers? P. E. Farrell (Oxford) SPS 6 May 19, / 8

109 Solving PDEs on Supercomputers VII: PDE-constrained optimisation Patrick Farrell MMSC: Python in Scientific Computing May 17, 2015 P. E. Farrell (Oxford) SPS 7 May 17, / 9

110 The mother problem Consider again the mother problem of PDE-constrained optimisation: 1 min d ) y,u 2 Ω(y y 2 dx + β u 2 dx 2 Ω subject to y = u y = 0 in Ω on Ω P. E. Farrell (Oxford) SPS 7 May 17, / 9

111 The mother problem Consider again the mother problem of PDE-constrained optimisation: 1 min d ) y,u 2 Ω(y y 2 dx + β u 2 dx 2 Ω subject to y = u y = 0 We form the Lagrangian: L(y, u, λ) = 1 y d ) 2 Ω(y 2 dx + β 2 in Ω on Ω Ω u 2 dx + λ y λu dx Ω P. E. Farrell (Oxford) SPS 7 May 17, / 9

112 The optimality conditions Taking the optimality conditions yields the system: find (y, u, λ) H0 1 L2 H0 1 such that ȳ(y y d ) + λ ȳ = 0, Ω Ω β ūu λū = 0, Ω Ω λ y λu = 0. Ω Ω P. E. Farrell (Oxford) SPS 7 May 17, / 9

113 The optimality conditions Taking the optimality conditions yields the system: find (y, u, λ) H0 1 L2 H0 1 such that ȳ(y y d ) + λ ȳ = 0, Ω Ω β ūu λū = 0, Ω Ω λ y λu = 0. On discretisation, this yields the system M 0 K y z 0 βm M u = 0. K M 0 λ 0 Ω Ω P. E. Farrell (Oxford) SPS 7 May 17, / 9

114 Ingredients of a fieldsplit Remember, to fieldsplit you need two things: 1. A diagonal block you can cheaply invert 2. A Schur complement you can cheaply approximate P. E. Farrell (Oxford) SPS 7 May 17, / 9

115 Ingredients of a fieldsplit Remember, to fieldsplit you need two things: 1. A diagonal block you can cheaply invert 2. A Schur complement you can cheaply approximate If we take A = [[M, 0], [0, βm]], the first is satisfied. P. E. Farrell (Oxford) SPS 7 May 17, / 9

116 Ingredients of a fieldsplit Remember, to fieldsplit you need two things: 1. A diagonal block you can cheaply invert 2. A Schur complement you can cheaply approximate If we take A = [[M, 0], [0, βm]], the first is satisfied. How about the Schur complement? Calculating, we find S = KM 1 K + 1 β M. P. E. Farrell (Oxford) SPS 7 May 17, / 9

117 Ingredients of a fieldsplit Remember, to fieldsplit you need two things: 1. A diagonal block you can cheaply invert 2. A Schur complement you can cheaply approximate If we take A = [[M, 0], [0, βm]], the first is satisfied. How about the Schur complement? Calculating, we find S = KM 1 K + 1 β M. Bad news Approximating the inverse of sums is hard. P. E. Farrell (Oxford) SPS 7 May 17, / 9

118 Two approaches Approach one: ignore one of terms (Rees, Dollar, Wathen 2010). S = KM 1 K + 1 β M KM 1 K with inverse Ŝ 1 K 1 MK 1. P. E. Farrell (Oxford) SPS 7 May 17, / 9

119 Two approaches Approach one: ignore one of terms (Rees, Dollar, Wathen 2010). S = KM 1 K + 1 β M KM 1 K with inverse Ŝ 1 K 1 MK 1. Approach two: approximate the sum with a product (Pearson and Wathen, 2012). ( S = K + 1 ) ( M M 1 K + 1 ) M 2 M β β β ( K + 1 ) ( M M 1 K + 1 ) M β β with inverse Ŝ 1 ˆK 1 M ˆK 1. P. E. Farrell (Oxford) SPS 7 May 17, / 9

120 Coding tools No need to pass index sets with scalar fields: """ --petsc.pc_fieldsplit_0_fields 0,1 --petsc.pc_fieldsplit_1_fields 2 """ P. E. Farrell (Oxford) SPS 7 May 17, / 9

121 Coding tools No need to pass index sets with scalar fields: """ --petsc.pc_fieldsplit_0_fields 0,1 --petsc.pc_fieldsplit_1_fields 2 """ You do need index sets to extract submatrices: trial = split(trialfunction(z))[0] test = split(testfunction(z))[0] bc = DirichletBC(Z.sub(0), 0.0, "on_boundary") mass_full = assemble(inner(trial, test)*dx) bc.apply(mass_full)... mass_mat = mass_fmat.getsubmatrix(is_0, is_0) P. E. Farrell (Oxford) SPS 7 May 17, / 9

122 Coding tools Creating a KSP to handle the solve: ksp_kbm = PETSc.KSP() ksp_kbm.create() ksp_kbm.settype("richardson") ksp_kbm.pc.settype("lu") ksp_kbm.setoperators(kbm) ksp_kbm.setoptionsprefix("fieldsplit_1_kbm_") ksp_kbm.setfromoptions() ksp_kbm.setup() P. E. Farrell (Oxford) SPS 7 May 17, / 9

123 Coding tools Using an approximate inverse action with PCMAT: """ --petsc.fieldsplit_1_pc_type mat """ P. E. Farrell (Oxford) SPS 7 May 17, / 9

124 Coding tools Using an approximate inverse action with PCMAT: """ --petsc.fieldsplit_1_pc_type mat """ Configuring a shell matrix: class SchurInv(object): def mult(self, mat, x, y): ksp_kbm.solve(x, tmp1) mass.mult(tmp1, tmp2) ksp_kbm.solve(tmp2, y) schur = PETSc.Mat() schur.createpython(mass.getsizes(), SchurInv()) schur.setup() P. E. Farrell (Oxford) SPS 7 May 17, / 9

125 HPC 07 Challenge! Solve the mother problem on Ω = [0, 1] 2 with { 1 if (x, y) [0, 0.5] 2 y d (x, y) = 0 otherwise and homogeneous Dirichlet boundary conditions. 0. Discretise the equation with [P 1 ] Solve the problem with LU. 2. Implement the two fieldsplit preconditioners with ideal inner solvers. 3. Which performs best as β 0? 4. Now choose scalable inner solvers. 5. Which configuration is fastest on the machine? P. E. Farrell (Oxford) SPS 7 May 17, / 9

126 Solving PDEs on Supercomputers VIII: advanced nonlinear solvers Patrick Farrell MMSC: Python in Scientific Computing May 18, 2015 P. E. Farrell (Oxford) SPS 8 May 18, / 13

127 Globalisation of Newton s method Consider again the p-laplace equation (γ(u) u) = f u = g in Ω on Ω where γ(u) = (ɛ u 2 ) (p 2)/2. The configuration we considered (p = 5) took 121 iterations to converge. Why? P. E. Farrell (Oxford) SPS 8 May 18, / 13

128 Newton steps near singular Jacobians Recall that at our initial guess u = 0, our Jacobian is nearly singular. If then and if σ min 0, then J = UΣV T, J 1 = V Σ 1 U T, δu = J 1 F. P. E. Farrell (Oxford) SPS 8 May 18, / 13

129 Newton steps near singular Jacobians Recall that at our initial guess u = 0, our Jacobian is nearly singular. If then and if σ min 0, then J = UΣV T, J 1 = V Σ 1 U T, δu = J 1 F. This explains 0 SNES Function norm e-02 1 SNES Function norm e+56 2 SNES Function norm e+56 P. E. Farrell (Oxford) SPS 8 May 18, / 13

130 Responses A few possible responses: 1. Start with a better initial guess (continuation) P. E. Farrell (Oxford) SPS 8 May 18, / 13

131 Responses A few possible responses: 1. Start with a better initial guess (continuation) 2. Regularise further (undesirable) P. E. Farrell (Oxford) SPS 8 May 18, / 13

132 Responses A few possible responses: 1. Start with a better initial guess (continuation) 2. Regularise further (undesirable) 3. Take a smaller step (damping with α 1)! P. E. Farrell (Oxford) SPS 8 May 18, / 13

133 Responses A few possible responses: 1. Start with a better initial guess (continuation) 2. Regularise further (undesirable) 3. Take a smaller step (damping with α 1)! Newton fractal for z 3 1 = 0 with α = 1. P. E. Farrell (Oxford) SPS 8 May 18, / 13

134 Responses A few possible responses: 1. Start with a better initial guess (continuation) 2. Regularise further (undesirable) 3. Take a smaller step (damping with α 1)! Newton fractal for z 3 1 = 0 with α = P. E. Farrell (Oxford) SPS 8 May 18, / 13

$Take a smaller step (damping with α 1)! Newton fractal for z 3 1 = 0 with α = 0.5. P. E.$

135 Responses A few possible responses: 1. Start with a better initial guess (continuation) 2. Regularise further (undesirable) 3. Take a smaller step (damping with α 1)! Newton fractal for z 3 1 = 0 with α = 0.5. P. E. Farrell (Oxford) SPS 8 May 18, / 13

136 Responses A few possible responses: 1. Start with a better initial guess (continuation) 2. Regularise further (undesirable) 3. Take a smaller step (damping with α 1)! Newton fractal for z 3 1 = 0 with α = P. E. Farrell (Oxford) SPS 8 May 18, / 13

137 Responses A few possible responses: 1. Start with a better initial guess (continuation) 2. Regularise further (undesirable) 3. Take a smaller step (damping with α 1)! Newton fractal for z 3 1 = 0 with α = 0.1. P. E. Farrell (Oxford) SPS 8 May 18, / 13

138 Linesearch schemes in PETSc Backtracking linesearch (bt) Finds the minimum of a polynomial fit to the l 2 norm in [0, 1]. Demands monotonic and sufficient decrease. If decrease is insufficient, the interval is reduced. P. E. Farrell (Oxford) SPS 8 May 18, / 13

139 Linesearch schemes in PETSc Backtracking linesearch (bt) Finds the minimum of a polynomial fit to the l 2 norm in [0, 1]. Demands monotonic and sufficient decrease. If decrease is insufficient, the interval is reduced. Good for: convex problems, occasional near-singular Jacobians. P. E. Farrell (Oxford) SPS 8 May 18, / 13

140 Linesearch schemes in PETSc Backtracking linesearch (bt) Finds the minimum of a polynomial fit to the l 2 norm in [0, 1]. Demands monotonic and sufficient decrease. If decrease is insufficient, the interval is reduced. Good for: convex problems, occasional near-singular Jacobians. Bad for: nonconvex problems where the residual must increase before convergence. P. E. Farrell (Oxford) SPS 8 May 18, / 13

141 Linesearch schemes in PETSc Critical point linesearch (cp) Many PDEs have an energy function to be minimised. Suppose F (u) is the gradient of some (unknown) E(u). E(u + αdu) can be minimised by looking for roots of du T F (u + αdu) = 0 with a secant method. P. E. Farrell (Oxford) SPS 8 May 18, / 13

142 Linesearch schemes in PETSc Critical point linesearch (cp) Many PDEs have an energy function to be minimised. Suppose F (u) is the gradient of some (unknown) E(u). E(u + αdu) can be minimised by looking for roots of du T F (u + αdu) = 0 with a secant method. Good for: problems with an energy functional. P. E. Farrell (Oxford) SPS 8 May 18, / 13

143 Linesearch schemes in PETSc Affine-covariant linesearch (nleqerr) Undamped Newton s method is affine covariant. This observation fundamentally changes convergence theorems for Newton (Deuflhard, 2011). Convergence criteria are expressed in terms of affine-covariant Lipschitz constants. This linesearch estimates these constants and uses it to decide step lengths. P. E. Farrell (Oxford) SPS 8 May 18, / 13

144 Linesearch schemes in PETSc Affine-covariant linesearch (nleqerr) Undamped Newton s method is affine covariant. This observation fundamentally changes convergence theorems for Newton (Deuflhard, 2011). Convergence criteria are expressed in terms of affine-covariant Lipschitz constants. This linesearch estimates these constants and uses it to decide step lengths. Good for: problems where you can start within singular manifolds; the hardest nonlinear problems. P. E. Farrell (Oxford) SPS 8 May 18, / 13

145 Nonlinear preconditioning For a linear problem Ax = b we apply an approximate solver P 1 on the left: P 1 Ax = P 1 b. P. E. Farrell (Oxford) SPS 8 May 18, / 13

146 Nonlinear preconditioning For a linear problem Ax = b we apply an approximate solver P 1 on the left: P 1 Ax = P 1 b. Write one step of a nonlinear solver for F (x) = b as x i+1 = N(F, x i, b). P. E. Farrell (Oxford) SPS 8 May 18, / 13

147 Nonlinear preconditioning In nonlinear left preconditioning, we define a new residual R(x) = x N(F, x, b) and apply an outer nonlinear solver to R. P. E. Farrell (Oxford) SPS 8 May 18, / 13

148 Nonlinear preconditioning In nonlinear left preconditioning, we define a new residual R(x) = x N(F, x, b) and apply an outer nonlinear solver to R. In the linear case this is equivalent, since R(x) = x N(F, x, b) = x + P 1 (Ax b) x = P 1 (Ax b) P. E. Farrell (Oxford) SPS 8 May 18, / 13

149 Nonlinear preconditioning In nonlinear left preconditioning, we define a new residual R(x) = x N(F, x, b) and apply an outer nonlinear solver to R. In the linear case this is equivalent, since R(x) = x N(F, x, b) = x + P 1 (Ax b) x = P 1 (Ax b) Can accelerate an inner solver with an outer solver! P. E. Farrell (Oxford) SPS 8 May 18, / 13

150 Examples of nonlinear preconditioning Hyperelasticity (Brune et al, 2013) Inner solver: Newton. Outer solver: nonlinear conjugate gradients. P. E. Farrell (Oxford) SPS 8 May 18, / 13

151 Examples of nonlinear preconditioning Hyperelasticity (Brune et al, 2013) Inner solver: Newton. Outer solver: nonlinear conjugate gradients. High-Reynolds number Navier Stokes (Cai and Keyes, 2002) Inner solver: nonlinear additive Schwarz. Outer solver: Newton Krylov. P. E. Farrell (Oxford) SPS 8 May 18, / 13

152 Examples of nonlinear preconditioning Hyperelasticity (Brune et al, 2013) Inner solver: Newton. Outer solver: nonlinear conjugate gradients. High-Reynolds number Navier Stokes (Cai and Keyes, 2002) Inner solver: nonlinear additive Schwarz. Outer solver: Newton Krylov. High-Prandtl number Navier Stokes (Brune et al, 2013) Inner solver: nonlinear multigrid. Outer solver: nonlinear GMRES. P. E. Farrell (Oxford) SPS 8 May 18, / 13

153 Nonlinear preconditioning: a remark The design space for nonlinear solvers is vast. At the moment we have very little theory to guide us. There are very large potential gains, however. P. E. Farrell (Oxford) SPS 8 May 18, / 13

154 Nonlinear multigrid The main bottleneck for massive problems is the linear system. P. E. Farrell (Oxford) SPS 8 May 18, / 13

155 Nonlinear multigrid The main bottleneck for massive problems is the linear system. What if we didn t have to solve (large) linear systems? P. E. Farrell (Oxford) SPS 8 May 18, / 13

156 Nonlinear multigrid The main bottleneck for massive problems is the linear system. What if we didn t have to solve (large) linear systems? FAS uses fine-grid residuals to correct coarse-grid equations. P. E. Farrell (Oxford) SPS 8 May 18, / 13

157 Full Approximation Scheme (FAS) Given: a problem (F h, x h, b h ) a smoother S and coarse solver M restriction, prolongation and injection operators R, P and ˆR. while not converged: x h s = S(F h, x h i, b h ) x H = ˆRx h s b H = R[b F h (x h )] + F H (x H ) x H c = M(F H, x H, b H ) x h c = x h s + P [x H c x H ] x h i+1 = S(F h, x h c, b h ) P. E. Farrell (Oxford) SPS 8 May 18, / 13

158 Nonlinear multigrid You can use a high-flop smoother on the fine grids, and Newton-LU on the coarse grids! P. E. Farrell (Oxford) SPS 8 May 18, / 13

159 Nonlinear multigrid You can use a high-flop smoother on the fine grids, and Newton-LU on the coarse grids! (see firedrake Yamabe demo) P. E. Farrell (Oxford) SPS 8 May 18, / 13

160 HPC 08 Challenge! Consider again the p-laplace equation (FEniCS lecture III). 1. Investigate the performance of different linesearch schemes on the p-laplace problem. 2. Using only basic for the inner solver, accelerate the convergence of Newton s method with left-preconditioning with ncg/cp. 3. Now use the optimal inner linesearch to beat the unaccelerated solver. 4. Choose sensible Krylov solvers and scale the code on ARCUS. P. E. Farrell (Oxford) SPS 8 May 18, / 13

161 Solving PDEs on Supercomputers IV: a final challenge Patrick Farrell MMSC: Python in Scientific Computing May 17, 2015 P. E. Farrell (Oxford) SPS 8 May 17, / 3

162 HPC 09 Challenge! (1/2) Consider the Cahn Hilliard equation ( ( )) c df t M dc λ 2 c = 0 in Ω, ( ( )) df M dc λ 2 c = 0 on Ω, Mλ c n = 0 on Ω. where c is the unknown field, f(c) = 100c 2 (c 1) 2, n is the unit normal, and M is a scalar parameter. To solve this with standard C 0 elements, write it as two coupled second-order problems. P. E. Farrell (Oxford) SPS 8 May 17, / 3

163 HPC 09 Challenge! (2/2) Discretise and solve the equation on Ω = [0, 1] 2 for M = 1, λ = 10 2, and initial condition class InitialConditions(Expression): def init (self): random.seed(2 + MPI.rank(mpi_comm_world())) def eval(self, values, x): values[0] = *(0.5 - random.random()) values[1] = 0.0 def value_shape(self): return (2,) Make sure your scheme is at least second-order. Sensible values are t = , θ = 0.5. An excellent preconditioner is discussed in doi: / P. E. Farrell (Oxford) SPS 8 May 17, / 3

Contents. Preface... xi. Introduction...

Contents. Preface... xi. Introduction... Contents Preface... xi Introduction... xv Chapter 1. Computer Architectures... 1 1.1. Different types of parallelism... 1 1.1.1. Overlap, concurrency and parallelism... 1 1.1.2. Temporal and spatial parallelism