FAS and Solver Performance

Size: px

Start display at page:

Download "FAS and Solver Performance"

Mark Brown
5 years ago
Views:

1 FAS and Solver Performance Matthew Knepley Mathematics and Computer Science Division Argonne National Laboratory Fall AMS Central Section Meeting Chicago, IL Oct 05 06, 2007 M. Knepley (ANL) FAS AMS 07 1 / 27

2 Payoff Why should I care? 1 Optimal multilevel solvers are necessary 2 Processor flops are increasing much faster than bandwidth 3 Nonlinear algorithms can be efficient than linear algorithms 4 Presents an opportunity for numerical algebraic geometry M. Knepley (ANL) FAS AMS 07 2 / 27

3 Payoff Why should I care? 1 Optimal multilevel solvers are necessary 2 Processor flops are increasing much faster than bandwidth 3 Nonlinear algorithms can be efficient than linear algorithms 4 Presents an opportunity for numerical algebraic geometry M. Knepley (ANL) FAS AMS 07 2 / 27

4 Payoff Why should I care? 1 Optimal multilevel solvers are necessary 2 Processor flops are increasing much faster than bandwidth 3 Nonlinear algorithms can be efficient than linear algorithms 4 Presents an opportunity for numerical algebraic geometry M. Knepley (ANL) FAS AMS 07 2 / 27

5 Payoff Why should I care? 1 Optimal multilevel solvers are necessary 2 Processor flops are increasing much faster than bandwidth 3 Nonlinear algorithms can be efficient than linear algorithms 4 Presents an opportunity for numerical algebraic geometry M. Knepley (ANL) FAS AMS 07 2 / 27

6 Outline Simulation Basics 1 Simulation Basics 2 Newton-Multigrid 3 Machine Performance 4 FAS and Multigrid-Newton 5 Possible Extensions M. Knepley (ANL) FAS AMS 07 3 / 27

7 Simulation Basics Necessity Of Simulation Experiment are... Expensive Difficult Impossible Dangerous M. Knepley (ANL) FAS AMS 07 4 / 27

8 Simulation Basics Why Optimal Algorithms? The more powerful the computer, the greater the importance of optimality Example: Suppose Alg 1 solves a problem in time CN 2, N is the input size Suppose Alg 2 solves the same problem in time CN Suppose Alg 1 and Alg 2 are able to use 10,000 processors In constant time compared to serial, Alg1 can run a problem 100X larger Alg2 can run a problem 10,000X larger Alternatively, filling the machine s memory, Alg1 requires 100X time Alg2 runs in constant time M. Knepley (ANL) FAS AMS 07 5 / 27

9 Outline Newton-Multigrid 1 Simulation Basics 2 Newton-Multigrid 3 Machine Performance 4 FAS and Multigrid-Newton 5 Possible Extensions M. Knepley (ANL) FAS AMS 07 6 / 27

10 What Is Optimal? Newton-Multigrid I will define optimal as an O(N) solution algorithm These are generally hierarchical, so we need hierarchy generation assembly on subdomains restriction and prolongation M. Knepley (ANL) FAS AMS 07 7 / 27

11 Newton-Multigrid The Bratu Problem u + λe u = f in Ω (1) u = g on Ω (2) Also called the Solid-Fuel Ignition equation Can be treated as a nonlinear eigenvalue problem Has two solution branches until λ = 6.28 M. Knepley (ANL) FAS AMS 07 8 / 27

12 Newton s Method Newton-Multigrid 0 = F(u + δu) = F(u) + J(u)δu (3) so that u + δu = u J(u) 1 F(u) (4) Quadratic convergence J can be solved approximately (Dembo-Eisensat-Steihaug) M. Knepley (ANL) FAS AMS 07 9 / 27

13 Linear Multigrid Newton-Multigrid Smoothing (typically Gauss-Seidel) x new = S(x old, b) (5) Coarse-grid Correction J c δx c = R(b Jx old ) (6) x new = x old + R T δx c (7) M. Knepley (ANL) FAS AMS / 27

14 Newton-Multigrid Linear Convergence Convergence to r < 10 9 b using GMRES(30)/ILU Elements Iterations M. Knepley (ANL) FAS AMS / 27

15 Newton-Multigrid Linear Convergence Convergence to r < 10 9 b using GMRES(30)/MG Elements Iterations M. Knepley (ANL) FAS AMS / 27

16 Newton-Multigrid Linear Multigrid Memory Access Memory J Processor M. Knepley (ANL) FAS AMS / 27

17 Newton-Multigrid Linear Multigrid Memory Access Memory J Processor M. Knepley (ANL) FAS AMS / 27

18 Newton-Multigrid Linear Multigrid Memory Access Memory J Processor M. Knepley (ANL) FAS AMS / 27

19 Newton-Multigrid Linear Multigrid Memory Access Memory J Processor M. Knepley (ANL) FAS AMS / 27

20 Newton-Multigrid Linear Multigrid Memory Access Memory J Processor M. Knepley (ANL) FAS AMS / 27

21 Newton-Multigrid Linear Multigrid Memory Access Memory J Processor M. Knepley (ANL) FAS AMS / 27

22 Newton-Multigrid Linear Multigrid Memory Access Memory J Processor M. Knepley (ANL) FAS AMS / 27

23 Newton-Multigrid Linear Multigrid Memory Access Memory J Processor M. Knepley (ANL) FAS AMS / 27

24 Newton-Multigrid Linear Multigrid Memory Access Memory J Processor M. Knepley (ANL) FAS AMS / 27

25 Newton-Multigrid Linear Multigrid Memory Access Memory J Processor M. Knepley (ANL) FAS AMS / 27

26 Newton-Multigrid Linear Multigrid Memory Access Memory J Processor M. Knepley (ANL) FAS AMS / 27

27 Newton-Multigrid Linear Multigrid Memory Access Memory J Processor M. Knepley (ANL) FAS AMS / 27

28 Newton-Multigrid Linear Multigrid Memory Access Memory J Processor M. Knepley (ANL) FAS AMS / 27

29 Newton-Multigrid Linear Multigrid Memory Access Memory J Processor M. Knepley (ANL) FAS AMS / 27

30 Newton-Multigrid Linear Multigrid Memory Access Memory J Processor M. Knepley (ANL) FAS AMS / 27

31 Newton-Multigrid Linear Multigrid Memory Access Memory J Processor M. Knepley (ANL) FAS AMS / 27

32 Outline Machine Performance 1 Simulation Basics 2 Newton-Multigrid 3 Machine Performance 4 FAS and Multigrid-Newton 5 Possible Extensions M. Knepley (ANL) FAS AMS / 27

33 Machine Performance STREAM Benchmark Simple benchmark program measuring sustainable memory bandwidth Protoypical operation is Triad (WAXPY): w = y + αx Measures the memory bandwidth bottleneck (much below peak) Datasets outstrip cache Machine Peak (MF/s) Triad (MB/s) MF/MW Eq. MF/s Matt s Laptop (5.5%) Intel Core2 Quad (1.2%) Tesla 1060C * (0.8%) Table: Bandwidth limited machine performance M. Knepley (ANL) FAS AMS / 27

34 Machine Performance Analysis of Sparse Matvec (SpMV) Assumptions No cache misses No waits on memory references Notation m Number of matrix rows nz Number of nonzero matrix elements V Number of vectors to multiply We can look at bandwidth needed for peak performance ( V ) m nz + 6 V or achieveable performance given a bandwith BW byte/flop (8) Vnz BW Mflop/s (9) (8V + 2)m + 6nz Towards Realistic Performance Bounds for Implicit CFD Codes, Gropp, Kaushik, Keyes, and Smith. M. Knepley (ANL) FAS AMS / 27

35 Machine Performance Improving Serial Performance For a single matvec with 3D FD Poisson, Matt s laptop can achieve at most 1 (8 + 2) 1 bytes/flop( MB/s) = 151 MFlops/s, (10) which is a dismal 8.8% of peak. Can improve performance by Blocking Multiple vectors but operation issue limitations take over. M. Knepley (ANL) FAS AMS / 27

36 Machine Performance Improving Serial Performance For a single matvec with 3D FD Poisson, Matt s laptop can achieve at most 1 (8 + 2) 1 bytes/flop( MB/s) = 151 MFlops/s, (10) which is a dismal 8.8% of peak. Better approaches: Unassembled operator application (Spectral elements, FMM) N data, N 2 computation Nonlinear evaluation (Picard, FAS, Exact Polynomial Solvers) N data, N k computation M. Knepley (ANL) FAS AMS / 27

37 Outline FAS and Multigrid-Newton 1 Simulation Basics 2 Newton-Multigrid 3 Machine Performance 4 FAS and Multigrid-Newton 5 Possible Extensions M. Knepley (ANL) FAS AMS / 27

38 FAS and Multigrid-Newton Matrix-Free Smoothing 7/23/15, 8:50 AM = We can use point Jacobi x new i = x old i + J 1 ii (b i J T i x old ) (11) In the nonlinear case, J T i x old = e T i F (x)x old (12) which might be calculated automatically using AD. M. Knepley (ANL) FAS AMS / 27

39 FAS and Multigrid-Newton Nonlinear Gauss-Seidel If we have an initial guess u, b = 0 F(u), xi new = xi old + J 1 ii (b i Ji T x old ) (13) xi new = xi old + J 1 ii ( F i (u) F i (u) T x old ) (14) xi new = xi old J 1 ii (F i (u) + F i (u) T x old ) (15) xi new = xi old J 1 ii F i (u + x old ) (16) u new i = u old i J 1 ii F i (u old ) (17) This is just Newton s method on a single equation at a time... which is Nonlinear Gauss-Seidel. M. Knepley (ANL) FAS AMS / 27

40 FAS and Multigrid-Newton Nonlinear Gauss-Seidel If we have an initial guess u, b = 0 F(u), xi new = xi old + J 1 ii (b i Ji T x old ) (13) xi new = xi old + J 1 ii ( F i (u) F i (u) T x old ) (14) xi new = xi old J 1 ii (F i (u) + F i (u) T x old ) (15) xi new = xi old J 1 ii F i (u + x old ) (16) u new i = u old i J 1 ii F i (u old ) (17) This is just Newton s method on a single equation at a time... which is Nonlinear Gauss-Seidel. M. Knepley (ANL) FAS AMS / 27

41 FAS and Multigrid-Newton Nonlinear Gauss-Seidel If we have an initial guess u, b = 0 F(u), xi new = xi old + J 1 ii (b i Ji T x old ) (13) xi new = xi old + J 1 ii ( F i (u) F i (u) T x old ) (14) xi new = xi old J 1 ii (F i (u) + F i (u) T x old ) (15) xi new = xi old J 1 ii F i (u + x old ) (16) u new i = u old i J 1 ii F i (u old ) (17) This is just Newton s method on a single equation at a time... which is Nonlinear Gauss-Seidel. M. Knepley (ANL) FAS AMS / 27

42 FAS and Multigrid-Newton Nonlinear Multigrid Most authors just offer an ansatz with nonlinear smoothing x new = S(x old, b) (18) and coarse-grid correction F c (x c ) = F c ( x c ) + γr(b F(x old )) (19) x new = x old + 1 γ RT (x c x c ) (20) where x is an approximate solution. If F is a linear operator L, the correction reduces to L c (x c ) = L c ( x c ) + γr(b L(x old )) (21) L c (x c x c ) = γr(b L(x old )) (22) L c δx c = γrr (23) M. Knepley (ANL) FAS AMS / 27

43 FAS and Multigrid-Newton Nonlinear Multigrid Most authors just offer an ansatz with nonlinear smoothing and coarse-grid correction x new = S(x old, b) (18) F c (x c ) = F c ( x c ) + γr(b F(x old )) (19) x new = x old + 1 γ RT (x c x c ) (20) where x is an approximate solution. and the update becomes x new = x old + 1 γ RT δx c (21) x new = x old + R T ˆL 1 c Rr (22) M. Knepley (ANL) FAS AMS / 27

44 FAS and Multigrid-Newton Nonlinear Multigrid It is instructive to look at the alternate derivation of Barry Smith Begin with the nonlinear generalization F(u) = 0, for a correction and then using Taylor series we have the correction J c x c = R(b Jx old ) (23) J c x c = R(F(u) + Jx old ) (24) F(u old ) = F(u) + J(u old u) +... (25) F c (uc old + x c ) = F c (uc old ) + J c x c +... (26) F c (u old c + x c ) F c (uc old ) = RF(u old ) (27) F c (uc old + x c ) = F c (uc old ) RF(u old ) (28) M. Knepley (ANL) FAS AMS / 27

45 FAS and Multigrid-Newton Nonlinear Multigrid It is instructive to look at the alternate derivation of Barry Smith Begin with the nonlinear generalization F(u) = 0, for a correction and then using Taylor series and the same update J c x c = R(b Jx old ) (23) J c x c = R(F(u) + Jx old ) (24) F(u old ) = F(u) + J(u old u) +... (25) F c (uc old + x c ) = F c (uc old ) + J c x c +... (26) x new = x old + R T x c (27) M. Knepley (ANL) FAS AMS / 27

46 FAS and Multigrid-Newton Spectrum of Methods Newton-Multigrid 7/23/15, 8:51 AM FAS When does linearization happen? Which Jacobian entries are updated? M. Knepley (ANL) FAS AMS / 27

47 FAS and Multigrid-Newton Nonlinear Convergence Convergence to r < 10 9 r 0 using Newton/GMRES(30)/ILU Elements Iterations M. Knepley (ANL) FAS AMS / 27

48 FAS and Multigrid-Newton Nonlinear Convergence M. Knepley (ANL) FAS AMS / 27

49 Outline Possible Extensions 1 Simulation Basics 2 Newton-Multigrid 3 Machine Performance 4 FAS and Multigrid-Newton 5 Possible Extensions M. Knepley (ANL) FAS AMS / 27

50 Possible Extensions Polynomial Solvers A great opportunity exists for polynomial solvers Better performance Bandwidth considerations only intensify on multicore chips Petascale systems will need these improvements More robust Most practical engineering calculations are quadratic New algorithms Can multiple solutions speed up convergence? M. Knepley (ANL) FAS AMS / 27

51 Conclusions Possible Extensions Newton-Multigrid provides Good nonlinear solves Simple interface for software libraries Low computational efficiency Multigrid-FAS provides Good nonlinear solves Lower memory bandwidth and storage Potentially high computational efficiency Requires formation on small systems on the fly M. Knepley (ANL) FAS AMS / 27

52 Possible Extensions PETSc Resources Can download tarballs or clone a Mercurial repository Hyperlinked documentation Manual Manual pages for evey method HTML of all example code (linked to manual pages) FAQ Full support at petsc-maint@mcs.anl.gov M. Knepley (ANL) FAS AMS / 27

ERLANGEN REGIONAL COMPUTING CENTER

ERLANGEN REGIONAL COMPUTING CENTER Making Sense of Performance Numbers Georg Hager Erlangen Regional Computing Center (RRZE) Friedrich-Alexander-Universität Erlangen-Nürnberg OpenMPCon 2018 Barcelona,