Solving RODEs on GPU clusters

Size: px

Start display at page:

Download "Solving RODEs on GPU clusters"

Philip Morris Walton
5 years ago
Views:

1 HIGH SCIENCE Solving RODEs on GPU clusters Christoph Riesinger Technische Universität München March 4, 206 HIGH SCIENCE, March 4, 206

2 Motivation - Parallel Computing HIGH SCIENCE, March 4, 206 2

3 Motivation - Parallel Computing HIGH SCIENCE, March 4, 206 2

4 Motivation - Parallel Computing HIGH SCIENCE, March 4, 206 2

5 Motivation - Parallel Computing HIGH SCIENCE, March 4, 206 2

6 Motivation - Parallel Computing HIGH SCIENCE, March 4, 206 2

7 Motivation - Multiple Levels of Parallelism HIGH SCIENCE, March 4, 206 3

8 Motivation - Multiple Levels of Parallelism HIGH SCIENCE, March 4, 206 3

9 Technische Universita t Mu nchen Motivation - Multiple Levels of Parallelism HIGH SCIENCE, March 4, 206 3

10 Building Blocks Pseudo Random Number Generation Ornstein-Uhlenbeck Process Averaging Numerical Solver () (3) x 3 x 6 x 9 μ () μ (3) x x 4 x 7 x 0 x 2 x 5 x 8 x μ () μ 3 (3) x 2 x 3 x 5 x 6 x 8 x 9 x μ () μ 2 μ 3 (3) x 5 x x 6 x GPU 0 Monte Carlo Pseudo Random Number Generation Ornstein-Uhlenbeck Process Averaging Numerical Solver GPU Pseudo Random Number Generation Ornstein-Uhlenbeck Process Averaging Numerical Solver GPU N- HIGH SCIENCE, March 4, 206 4

11 Pseudo Random Number Generation - Ziggurat The area under the Gaussian function is approximated by strips R i These strips are further subdivided in central (green), tail (purple), and cap (red) regions and a base strip (blue) y0 x R 0 y x 2 R y2 x 3 R 2 y3 x 4 R3 y4 R 4 x 5 x 6 y5 R 5 y 6 R 6 x 7 =r y 7 R 7 =R B HIGH SCIENCE, March 4, 206 5

12 Pseudo Random Number Generation - Ziggurat The area under the Gaussian function is approximated by strips R i These strips are further subdivided in central (green), tail (purple), and cap (red) regions and a base strip (blue) y0 x R 0 y x 2 R y2 x 3 R 2 y3 x 4 R3 y4 R 4 x 5 x 6 y5 R 5 y 6 R 6 x 7 =r y 7 R 7 =R B To do the transformation, a strip is randomly selected An uniform random number u [0, [ is stretched by a lookup table value basing on the selected strip If a central region is hit, the transformation is very cheap, otherwise it s much more expensive HIGH SCIENCE, March 4, 206 5

13 Pseudo Random Number Generation - Trade-off The more strips are used for the Ziggurat, the bigger the ratio of the sum of all central regions to the sum of all strips gets The bigger this ratio gets, the higher the likelihood to hit a (cheap) central region gets In addition, on GPUs, this reduces the likelihood for warp divergence So runtime can be reduced by using more strips which results in larger lookup tables runtime/memory trade-off HIGH SCIENCE, March 4, 206 6

14 Pseudo Random Number Generation - Results /2 Performance of the Ziggurat Method GPU architecture Fermi Kepler Maxwell Model M2090 Tesla K40m GTX 750 Ti #Processing elements Peak performance SP (TFLOPS) Peak performance DP (TFLOPS) Peak memory bandwidth (GByte/s) Tesla M2090 (Fermi) 2.5 Tesla K40m (Kepler) 2.5 GTX 750 Ti (Maxwell) giga pseudo random numbers per second number of strips number of strips number of strips local local local local local shared shared shared shared shared HIGH SCIENCE, March 4, 206 7

15 Pseudo Random Number Generation - Results 2/2 Comparison with other Normal PRNGs giga pseudo random numbers per second Tesla M2090 (Fermi) grid configuration Ziggurat Inverse CDF 4.5 Tesla K40m (Kepler) grid configuration Rational Polynomial curand Wallace XORWOW 4.5 GTX 750 Ti (Maxwell) grid configuration MKL on Xeon E v2 HIGH SCIENCE, March 4, 206 8

16 Ornstein-Uhlenbeck process - Link to Prefix Sum /2 O th = µo t σ X n () O t2h = µo th σ X n = µ µo t σ X n () ( ) = µ 2 O t σ X µn () n ( O t3h = µo t2h σ X n (3) = µ ( µ ( µo t σ X n () ( = µ µo th σ X n ) ) σ X n σ X n (3) ( = µ 3 O t σ X µ 2 n () µn n (3)... =... i ( ) O tih = µ i O t σ X µ i k n (k) k= ) ) σ X n = ) σ X n (3) = = HIGH SCIENCE, March 4, 206 9

17 Ornstein-Uhlenbeck process - Link to Prefix Sum 2/2 This looks very similar to the prefix sum or scan operation: i ( ) O tih = µ i O t σ X µ i k n (k) i O tih = k= n (k) k= HIGH SCIENCE, March 4, 206 0

18 Ornstein-Uhlenbeck process - Link to Prefix Sum 2/2 This looks very similar to the prefix sum or scan operation: i ( ) O tih = µ i O t σ X µ i k n (k) i O tih = k= n (k) k= x x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 0 x x 2 x 3 x 4 x 5 x x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 0 x x 2 x 3 x 4 x 5 HIGH SCIENCE, March 4, 206 0

19 Ornstein-Uhlenbeck process - Parallel Prefix Sum Up-Sweep Algorithm Up-sweep phase : for d = ; d log 2 (n); d do 2: for i = 0; i < n 2 d ; i do 3: x (i)2 d x (i)2 d x (i 2 )2 d 4: end for 5: end for d=4 μ () μ 3 (3) μ (5) (6) μ 7 (7) μ (9) (0) μ 8 μ 3 μ () μ (3) μ 5 (5) d=3 μ () μ 3 (3) μ (5) (6) μ 7 (7) μ (9) (0) μ 3 μ () μ (3) μ 4 μ 4 μ 7 (5) d=2 d= μ n 0 () μ () μ 3 (3) μ (3) μ (5) μ (5) (6) (6) μ 3 μ (7) μ (6) (7) μ (9) μ (9) (0) (0) μ 3 μ () μ (0) () μ (3) μ 2 μ 2 μ 2 μ 2 μ (3) μ μ μ μ μ μ μ μ μ 3 (5) μ (5) d=0 () (3) (5) (6) (7) (9) (0) () (3) (5) HIGH SCIENCE, March 4, 206

20 Ornstein-Uhlenbeck process - Parallel Prefix Sum Down-Sweep Algorithm 2 Down-sweep phase : for d = log 2 (n) ; d 0; d-- do 2: for i = 0; i < n 2 d ; i do 3: x (i 3 2 )2 d x (i)2 d x (i 3 2 )2 d 4: end for 5: end for d=3 μ () X μ 3 (3) μ (5) μ 7 (6) (7) μ (9) (0) μ 3 () μ (3) μ 5 (5) d=2 μ () μ 3 (3) μ (5) (6) μ 7 (7) μ (9) μ 4 (0) μ () μ (3) μ 5 (5) d= d=0 μ () μ () μ 2 μ 3 (3) μ 3 (3) μ 2 μ 2 μ 2 μ 4 μ 5 (5) μ 5 (5) (6) μ 6 (6) μ 7 (7) μ 7 (7) μ 8 μ 9 (9) μ 9 (9) (0) μ 0 (0) μ () μ () μ 2 μ 3 (3) μ μ μ μ μ μ μ μ 3 (3) μ 4 μ 5 (5) μ 5 (5) HIGH SCIENCE, March 4, 206 2

21 Ornstein-Uhlenbeck process - Results 3.5 Tesla M2090 (Fermi) 3.5 Tesla K40m (Kepler) 3.5 GTX 750 Ti (Maxwell) giga realizations of OU process elements per thread float, 2 7 threads/block double, 2 7 threads/block elements per thread float, 2 8 threads/block double, 2 8 threads/block float, 2 9 threads/block double, 2 9 threads/block elements per thread float, 2 0 threads/block double, 2 0 threads/block HIGH SCIENCE, March 4, 206 3

22 Averaging x 3 x 6 x 9 x 2 x 5 x 8 x 2 x x 4 x 7 x 0 x 3 x 6 x 9 x 22 x 2 x 5 x 8 x x 4 x 7 x 20 x 23 x 2 x 3 x 5 x 6 x 8 x 9 x x 2 x 4 x 5 x 7 x 8 x 20 x 2 x 23 x 5 x 6 x x 2 x 7 x 8 x 23 x x 2 x 23 x 23 HIGH SCIENCE, March 4, 206 4

23 Averaging - Results Tesla M2090 (Fermi) Tesla K40m (Kepler) GTX 750 Ti (Maxwell) ratio of maximum bandwidth threads per block float, single averaging double, single averaging threads per block float, double averaging double, double averaging float, 3-tridiagonal double, 3-tridiagonal threads per block float, 4-tridiagonal double, 4-tridiagonal architecture Tesla M2090 Tesla K40m GTX 750 Ti ratio peak memory bandwidth 88.5% 72.% 8.% configuration ( threads block 28, double 2 8, double 2 6, double HIGH SCIENCE, March 4, 206 5

24 Solving one instance of the RODE on a single GPU Tesla M2090 (Fermi) Tesla K40m (Kepler) GTX 750 Ti (Maxwell) float, double, float, double, float, double, initstatesnormalkernel() scanexclusiveoukernel() averagedeulerkernel() purple blue green red float, double, float, double, float, double, float, double, float, double, float, double, float, double, float, double, getrandomnumbersnormalkernel() scanoufixkernel() float, double, float, double, float, double, float, double, singleaveragekernel() realizeouprocesskernel() numerical solver averaging Ornstein-Uhlenbeck process pseudo random number generation float, double, float, double, HIGH SCIENCE, March 4, 206 6

25 Solving several instances of the RODE on multiple GPUs cluster JuDGE Hydra TSUBAME 2.5 location FZJ RZG GSIC GPUs per node 2 3 total # of GPUs Interconnect QDR InfiniBand FDR InfiniBand QDR InfiniBand efficiency JuDGE number of GPUs float, GPU computations double, GPU computations Hydra number of GPUs float, MPI_Reduce() double, MPI_Reduce() TSUBAME number of GPUs float, total double, total HIGH SCIENCE, March 4, 206 7

26 Final slide HIGH SCIENCE, March 4, 206 8

Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters

Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters H. Köstler 2nd International Symposium Computer Simulations on GPU Freudenstadt, 29.05.2013 1 Contents Motivation walberla software concepts