Scalability of Partial Differential Equations Preconditioner Resilient to Soft and Hard Faults

Size: px

Start display at page:

Download "Scalability of Partial Differential Equations Preconditioner Resilient to Soft and Hard Faults"

Bethany Preston
5 years ago
Views:

1 Scalability of Partial Differential Equations Preconditioner Resilient to Soft and Hard Faults K.Morris, F.Rizzi, K.Sargsyan, K.Dahlgren, P.Mycek, C.Safta, O.LeMaitre, O.Knio, B.Debusschere Sandia National Laboratories, Livermore, CA, USA Duke University, Durham, NC, USA UC Santa Cruz, USA ISC High Performance, Frankfurt June 2016 Supported by the US Department of Energy (DOE) Advanced Scientific Computing Research (ASCR) Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy s National Nuclear Security Administration under contract DE-AC04-94AL85000.

2 Outline 1 Motivation 2 Algorithm 3 Implementation 4 Results 5 Conclusions

Do We Need Resilience? Resilience Collection of techniques to keep applications running to a correct solution in a timely and efficient manner despite underlying system faults.

3 Do We Need Resilience? Resilience Collection of techniques to keep applications running to a correct solution in a timely and efficient manner despite underlying system faults. Addressing Failures in Exascale Computing, Snir et al., Toward Exascale Resilience, Cappello et al., IEEE, Feb.2016, Al Geist (ORNL) As a child, were you ever afraid that a monster lurking in your bedroom would leap out of the dark and get you? Lack of resilience is a similar monster, hiding in the steel cabinets of the supercomputers and threatening to crash the largest computing machines.

4 Faults: Broadly Speaking Hard: activation is systematically reproducible. Soft: activation is not systematically reproducible. Active: fault causes an error. Dormant: fault does not cause an error. The dormant fault is activated when it causes an error. Permanent: presence is continuous in time. Transient: presence is temporary. Intermittent: fault is transient and reappears. Addressing Failures in Exascale Computing, Snir et al., 2012.

5 Challenges & Needs Hypothetically hardware can take care of most of it......at the expense of energy consumption, money, and asynchrony. Power is a big obstacle towards exascale. High tradeoffs between power, resiliency and performance. For instance: if an application can tolerate memory bit flips for certain parts of its memory, it can ask the OS to turn off ECC checks for those memory regions and potentially improve power and performance. Cross-cutting research needed to explore these areas.

6 The Lorenz attractor Ed N. Lorenz, meterologist (1963): simplified model of convection in the earths atmosphere (also found in models of lasers and dynamos). System of coupled non-linear differential equations: dx = dt (y x), dy = x( dt z) y, dz = xy dt z. Lorenz used = 10, = 8/3, and = 28. Systems exhibits chaotic behavior for these (and nearby) values. Chaotic means highly sensitive to initial conditions. Small differences in initial conditions (such as those due to rounding errors in numerical computation) yield widely diverging outcomes for such dynamical systems, rendering long-term prediction impossible in general.

Motivation Algorithm Implementation Results Conclusions The Lorenz attractor: effect of Silent Data Corruption (SDCs) SDC: data corruption not raising errors/warnings.

7 Motivation Algorithm Implementation Results Conclusions The Lorenz attractor: effect of Silent Data Corruption (SDCs) SDC: data corruption not raising errors/warnings. Start from x0 = 8, y0 = 5, z0 = 1. Solve over 0 t 40. Simulate SDC by randomly selecting a bit to flip in x at t = th bit, before = , after = Small change, but completely different solution for this simple test! nominal, corrupted nominal, corrupted K. Morris Resilient Approach for PDE Min:

8 Algorithm

9 Resilient EXtreme Scale Scientific Simulations (REXSSS) What? Domain-decomposition-based preconditioner for PDEs. Currently for elliptic equations (1D, 2D). Other PDEs (in progress). How? Recasting the original PDE problem as a sampling problem, followed by a resilient data manipulation to achieve the final solution update. Why? We do not characterize all types of system faults that can occur, but focus solely on the information that a simulation provides. Target: silent data corruptions (SDCs) and nodes/cores failing.

10 Algorithm Overview 1D problem: 8 >< Ly(x) =g(x), in =(x, x + ) y(x )=y, >: y(x + )=y +, L is a linear, elliptic operator. The solution at point x 0 linearly depends on the boundary conditions: y(x 0 )=a + by + cy +.

11 Algorithm Overview: Domain Decomposition & Boundary Maps Grid with current state. Partition space with overlapping subdomains. Treat subdomains independently. Define sampling range for each boundary. Sample and solve PDE locally. From samples build maps: y 1 = f (y 2, y L )=a + by 2 + cy L y 2 = g(y 1, y R )=d + ey 1 + fy R Maps link the subdomains. y L, y R are known BC. Solve, new state: (y 1, y 2 ). If needed: update range, repeat loop. y x L x x R

12 Algorithm Overview: Domain Decomposition & Boundary Maps Grid with current state. Partition space with overlapping subdomains. Treat subdomains independently. Define sampling range for each boundary. Sample and solve PDE locally. From samples build maps: y 1 = f (y 2, y L )=a + by 2 + cy L y 2 = g(y 1, y R )=d + ey 1 + fy R Maps link the subdomains. y L, y R are known BC. Solve, new state: (y 1, y 2 ). If needed: update range, repeat loop. y A B x L x x x 1 x 2 R

13 Algorithm Overview: Domain Decomposition & Boundary Maps Grid with current state. Partition space with overlapping subdomains. Treat subdomains independently. Define sampling range for each boundary. Sample and solve PDE locally. From samples build maps: y 1 = f (y 2, y L )=a + by 2 + cy L y 2 = g(y 1, y R )=d + ey 1 + fy R Maps link the subdomains. y L, y R are known BC. Solve, new state: (y 1, y 2 ). If needed: update range, repeat loop. y A y L y 2 x L y x 1 y 1 x 2 B x y R R

14 Algorithm Overview: Domain Decomposition & Boundary Maps Grid with current state. Partition space with overlapping subdomains. Treat subdomains independently. Define sampling range for each boundary. Sample and solve PDE locally. From samples build maps: y 1 = f (y 2, y L )=a + by 2 + cy L y 2 = g(y 1, y R )=d + ey 1 + fy R Maps link the subdomains. y L, y R are known BC. Solve, new state: (y 1, y 2 ). If needed: update range, repeat loop. y y L x L A y x 1 Sampling range x 2 B x y R R

15 Algorithm Overview: Domain Decomposition & Boundary Maps Grid with current state. Partition space with overlapping subdomains. Treat subdomains independently. Define sampling range for each boundary. Sample and solve PDE locally. From samples build maps: y 1 = f (y 2, y L )=a + by 2 + cy L y 2 = g(y 1, y R )=d + ey 1 + fy R Maps link the subdomains. y L, y R are known BC. Solve, new state: (y 1, y 2 ). If needed: update range, repeat loop. y y L x L A y Build: y 2 =g(y 1,y R ) {y 1 } x 1 x 2 {y 2 } Build: y 1 =f(y L,y 2 ) B x y R R

16 Algorithm Overview: Domain Decomposition & Boundary Maps Grid with current state. Partition space with overlapping subdomains. Treat subdomains independently. Define sampling range for each boundary. Sample and solve PDE locally. From samples build maps: y 1 = f (y 2, y L )=a + by 2 + cy L y 2 = g(y 1, y R )=d + ey 1 + fy R Maps link the subdomains. y L, y R are known BC. Solve, new state: (y 1, y 2 ). If needed: update range, repeat loop. y y L x L A y Build: y 2 =g(y 1,y R ) {y 1 } x 1 x 2 {y 2 } Build: y 1 =f(y L,y 2 ) B x y R R

17 Algorithm Overview: Domain Decomposition & Boundary Maps Grid with current state. Partition space with overlapping subdomains. Treat subdomains independently. Define sampling range for each boundary. Sample and solve PDE locally. From samples build maps: y 1 = f (y 2, y L )=a + by 2 + cy L y 2 = g(y 1, y R )=d + ey 1 + fy R Maps link the subdomains. y L, y R are known BC. Solve, new state: (y 1, y 2 ). If needed: update range, repeat loop. y y L x L A y x 1 y 1 * x 2 * y 2 Update state & range B x y R R

18 Extension y y L y 1 y 2 y 3 y 4 y 5 y 6 y R A B C D x L x 1 x 2 x 3 x 4 x 5 x 6 x R x y 1 = f 1 (y L, y 2 ) y 2 = f 2 (y 1, y 4 ) y 3 = f 3 (y 1, y 4 ) y 4 = f 4 (y 3, y 6 ) y 5 = f 5 (y 3, y 6 ) y 6 = f 6 (y 5, y R )

19 Extension

Robust Regression: Resilience to SDCs High-dim boundary maps: u(x )=a + bu 1 + cu 2 +.

20 Robust Regression: Resilience to SDCs High-dim boundary maps: u(x )=a + bu 1 + cu Generate samples, suppose some are corrupted, run regression. `1 noise model is robust against presence of corrupted data.

21 Implementation

22 Server/Client-based Implementation server client cluster Cluster: 1 server + n clients. Servers: Communicate between each other. Safe data/state storage (sandboxed). Clients: Independent from one another. Only serve as computing units. Separates state from computation: reduces the overall vulnerability. Resilient to clients crashing because even if tasks are lost, state is safe. It aligns with the vision of future exascale architectures involving heterogeneous and hierarchical hardware required to meet energy and cost constraints. C++ code, two external dependencies: Trilinos and Boost.

23 Server/Client-based Implementation Node failures are modeled as clients crashing. Server simply continues the execution using only the clients that are alive. Servers probe the corresponding cluster communicator using MPI_ANY_SOURCE to assess whether a new message is arriving from one of the clients.

24 Resilience Results

25 Injecting Faults Selective reliability Inject/perturb applications at target points and evaluate how it behaves. Some parts of the algorithm are assumed to be handled in a more reliable manner than others. M.Hoemmen,M.Heroux,2012 Silent Data Corruptions (SDC) Selective reliability: only affect sampling stage. Poisson process with rates: r 1 =4 10 5, r 2 = (n faults /sec). Corrupt the solution of a task. Bit-flip model: random bit-flip in binary representation. Node/Cores failing Selective reliability: only affect sampling stage. Poisson process with rates: r 1 =5 10 5, r 2 = (n faults /sec). Simulate by setting idle the MPI ranks of a client (as if client was dead).

26 Details Oversampling Oversampling: >1, such that N = N s nom. N s nom: number of samples for the fault-free scenario. Silent Data Corruptions (SDC) Resilience condition: out of the samples used in the regression, the number of uncorrupted samples has to be greater than the minimum set needed to have a well-posed regression problem. Node/Cores failing Server continues the execution using only the clients that are alive. No need to rebuild broken communicators, or clients.

27 Test Problem for Resilience 2D elliptic PDE. Consider 48 2 subdomains, distributed among 48 servers. Each server-client cluster includes 48 clients each made of 1 rank. The local grid within each subdomain is We consider three runs: one with no faults that is used as a reference two runs with faults based on the rates defined before. The number of cores is in all cases 2352, and all runs are performed on Edison. To have a more statistically meaningful analysis an ensemble of runs would be needed, but this objective was prohibitive due to time limitations.

28 Results (left) The figure shows that in all cases, the solution converges, regardless of the number of faults occurring. (right) the results show that the runtime stays approximately constant. The negligible effect of the faults on the runtime is due to the overhead caused by the Boost serialization/mpi library, which tends to make the effective communication costs much higher than with pure MPI. root-mean-square (RMS) residual. t =time for the no faults case.

29 Nominal Scalability

SST/macro Coarse-grained simulator of distributed-memory applications. Uses machine models to estimate performance of processing and network components.

30 SST/macro Coarse-grained simulator of distributed-memory applications. Uses machine models to estimate performance of processing and network components. Estimates the performance without incurring the cost of doing actual message passing or computation operations. Uses a skeleton of the target application: all operations are stripped out, only left communication patterns. Estimate performance of a given application on new or existing architectures without using the actual machine. Useful for assessing communication behavior. How does it work?

31 SST/macro: Overview Replace/model computation with equivalent timings. Various ways to estimate timings: operations counts, memory, etc. No need for data to exist for real, but need size. 1 if (rank==0){ const int N = 10000; 4 std :: vector<double> data(n, 0.0); for ( int i=0; i<n; ++i){ 7 if (rank==0) 8 data [ i ] =... ; 9 else 10 data [ i ] =... ; 11 } 12 double mean = compute_mean ( data ) ; MPI_Send(&mean,1,DBL,1,0,COMM) ; 15 } 16 else 17 { 18 double mean = 0. 0 ; 19 MPI_Status st ; 20 MPI_Recv(&mean,1,DBL,0,0,COMM,& st ) ; 21 } 1 if (rank==0) 2 { double time1 = 3; //seconds 8 SSTMAC_compute ( time1 ) ; MPI_Send (NULL,1,DBL,1,0,COMM); 15 } 16 else 17 { 18 MPI_Status st ; 19 MPI_Recv (NULL,1,DBL,0,0,COMM, &st ); 20 }

32 Scalability based on SST/macro skeleton Weak Strong Subdomains 12 2, 18 2, 24 2, 30 2, Total Cores 3088, 6948, 12352, 19300, , 12352, Subdomain grid size Overlap (# cells) Servers 16, 36, 64, 100, , 64, 256 Number of clients/server Size of each client 4 4

33 Scalability Weak Strong

34 Conclusions Domain-decomposition PDE preconditioner that is resilient to SDC and node failures. Novel reformulation of the PDE into a sampling problem over a set of subdomains such that data is generated, and then suitably manipulated to yield the final updating of the solution state. SST/macro skeleton of our application showing excellent scalability up to 50k cores. Resilience with simulated SDCs and node failures. The results show convergence in all cases. Overhead due to the presence of faults is negligible. This effect, however, is due to the use of the Boost serialization/mpi library.

35 Acknowledgments This material is based upon work supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, under Award Number Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy s National Nuclear Security Administration under contract DE-AC04-94AL This research used resources of the National Energy Research Scientific Computing Center, a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH THANK YOU!

Agreement Protocols. CS60002: Distributed Systems. Pallab Dasgupta Dept. of Computer Sc. & Engg., Indian Institute of Technology Kharagpur

Agreement Protocols. CS60002: Distributed Systems. Pallab Dasgupta Dept. of Computer Sc. & Engg., Indian Institute of Technology Kharagpur Agreement Protocols CS60002: Distributed Systems Pallab Dasgupta Dept. of Computer Sc. & Engg., Indian Institute of Technology Kharagpur Classification of Faults Based on components that failed Program