Solving RODEs on GPU clusters - PDF Free Download

HIGH TEA @ SCIENCE Solving RODEs on GPU clusters Christoph Riesinger Technische Universität München March 4, 206 HIGH TEA @ SCIENCE, March 4, 206

Motivation - Parallel Computing HIGH TEA @ SCIENCE, March 4, 206 2

Motivation - Multiple Levels of Parallelism HIGH TEA @ SCIENCE, March 4, 206 3

Technische Universita t Mu nchen Motivation - Multiple Levels of Parallelism HIGH TEA @ SCIENCE, March 4, 206 3

Building Blocks Pseudo Random Number Generation Ornstein-Uhlenbeck Process Averaging Numerical Solver () (3) x 3 x 6 x 9 μ () μ (3) x x 4 x 7 x 0 x 2 x 5 x 8 x μ () μ 3 (3) x 2 x 3 x 5 x 6 x 8 x 9 x μ () μ 2 μ 3 (3) x 5 x x 6 x GPU 0 Monte Carlo Pseudo Random Number Generation Ornstein-Uhlenbeck Process Averaging Numerical Solver GPU............ Pseudo Random Number Generation Ornstein-Uhlenbeck Process Averaging Numerical Solver GPU N- HIGH TEA @ SCIENCE, March 4, 206 4

Pseudo Random Number Generation - Ziggurat The area under the Gaussian function is approximated by strips R i These strips are further subdivided in central (green), tail (purple), and cap (red) regions and a base strip (blue) y0 x R 0 y x 2 R y2 x 3 R 2 y3 x 4 R3 y4 R 4 x 5 x 6 y5 R 5 y 6 R 6 x 7 =r y 7 R 7 =R B To do the transformation, a strip is randomly selected An uniform random number u [0, [ is stretched by a lookup table value basing on the selected strip If a central region is hit, the transformation is very cheap, otherwise it s much more expensive HIGH TEA @ SCIENCE, March 4, 206 5

Pseudo Random Number Generation - Trade-off The more strips are used for the Ziggurat, the bigger the ratio of the sum of all central regions to the sum of all strips gets The bigger this ratio gets, the higher the likelihood to hit a (cheap) central region gets In addition, on GPUs, this reduces the likelihood for warp divergence So runtime can be reduced by using more strips which results in larger lookup tables runtime/memory trade-off 0.8 0.6 0.4 0.2 0.5.5 2.0 2.5 3.0 3.5 0.8 0.6 0.4 0.2 0.5.5 2.0 2.5 3.0 3.5 0.8 0.6 0.4 0.2 0.5.5 2.0 2.5 3.0 3.5 HIGH TEA @ SCIENCE, March 4, 206 6

Pseudo Random Number Generation - Results /2 Performance of the Ziggurat Method GPU architecture Fermi Kepler Maxwell Model M2090 Tesla K40m GTX 750 Ti #Processing elements 6 32 5 92 5 28 Peak performance SP (TFLOPS).332 3.8492.6384 Peak performance DP (TFLOPS) 0.6656.28064 52 Peak memory bandwidth (GByte/s) 77.4 288.384 96.28 2.5 Tesla M2090 (Fermi) 2.5 Tesla K40m (Kepler) 2.5 GTX 750 Ti (Maxwell) giga pseudo random numbers per second 2.0.5 0.5 2 4 2 5 2 6 2 7 2 8 2 9 2 0 2 2 2 2 3 number of strips 2.0.5 0.5 2 4 2 5 2 6 2 7 2 8 2 9 2 0 2 2 2 2 3 number of strips 2.0.5 0.5 2 4 2 5 2 6 2 7 2 8 2 9 2 0 2 2 2 2 3 number of strips local 2 5 2 local 2 6 2 0 local 2 7 2 9 local 2 8 2 8 local 2 9 2 7 shared 2 5 2 shared 2 6 2 0 shared 2 7 2 9 shared 2 8 2 8 shared 2 9 2 7 HIGH TEA @ SCIENCE, March 4, 206 7

Pseudo Random Number Generation - Results 2/2 Comparison with other Normal PRNGs giga pseudo random numbers per second 4.5 4.0 3.5 3.0 2.5 2.0.5 0.5 Tesla M2090 (Fermi) 2 5 2 2 6 2 0 2 7 2 9 2 8 2 8 2 9 2 7 grid configuration Ziggurat Inverse CDF 4.5 Tesla K40m (Kepler) 4.0 3.5 3.0 2.5 2.0.5 0.5 2 5 2 2 6 2 0 2 7 2 9 2 8 2 8 2 9 2 7 grid configuration Rational Polynomial curand Wallace XORWOW 4.5 GTX 750 Ti (Maxwell) 4.0 3.5 3.0 2.5 2.0.5 0.5 2 5 2 2 6 2 0 2 7 2 9 2 8 2 8 2 9 2 7 grid configuration MKL on Xeon E5-2680 v2 HIGH TEA @ SCIENCE, March 4, 206 8

Ornstein-Uhlenbeck process - Link to Prefix Sum /2 O th = µo t σ X n () O t2h = µo th σ X n = µ µo t σ X n () ( ) = µ 2 O t σ X µn () n ( O t3h = µo t2h σ X n (3) = µ ( µ ( µo t σ X n () ( = µ µo th σ X n ) ) σ X n σ X n (3) ( = µ 3 O t σ X µ 2 n () µn n (3)... =... i ( ) O tih = µ i O t σ X µ i k n (k) k= ) ) σ X n = ) σ X n (3) = = HIGH TEA @ SCIENCE, March 4, 206 9

Ornstein-Uhlenbeck process - Link to Prefix Sum 2/2 This looks very similar to the prefix sum or scan operation: i ( ) O tih = µ i O t σ X µ i k n (k) i O tih = k= n (k) k= HIGH TEA @ SCIENCE, March 4, 206 0

Ornstein-Uhlenbeck process - Link to Prefix Sum 2/2 This looks very similar to the prefix sum or scan operation: i ( ) O tih = µ i O t σ X µ i k n (k) i O tih = k= n (k) k= x x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 0 x x 2 x 3 x 4 x 5 x x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 0 x x 2 x 3 x 4 x 5 HIGH TEA @ SCIENCE, March 4, 206 0

Ornstein-Uhlenbeck process - Parallel Prefix Sum Up-Sweep Algorithm Up-sweep phase : for d = ; d log 2 (n); d do 2: for i = 0; i < n 2 d ; i do 3: x (i)2 d x (i)2 d x (i 2 )2 d 4: end for 5: end for d=4 μ () μ 3 (3) μ (5) (6) μ 7 (7) μ (9) (0) μ 8 μ 3 μ () μ (3) μ 5 (5) d=3 μ () μ 3 (3) μ (5) (6) μ 7 (7) μ (9) (0) μ 3 μ () μ (3) μ 4 μ 4 μ 7 (5) d=2 d= μ n 0 () μ () μ 3 (3) μ (3) μ (5) μ (5) (6) (6) μ 3 μ (7) μ (6) (7) μ (9) μ (9) (0) (0) μ 3 μ () μ (0) () μ (3) μ 2 μ 2 μ 2 μ 2 μ (3) μ μ μ μ μ μ μ μ μ 3 (5) μ (5) d=0 () (3) (5) (6) (7) (9) (0) () (3) (5) HIGH TEA @ SCIENCE, March 4, 206

Ornstein-Uhlenbeck process - Parallel Prefix Sum Down-Sweep Algorithm 2 Down-sweep phase : for d = log 2 (n) ; d 0; d-- do 2: for i = 0; i < n 2 d ; i do 3: x (i 3 2 )2 d x (i)2 d x (i 3 2 )2 d 4: end for 5: end for d=3 μ () X μ 3 (3) μ (5) μ 7 (6) (7) μ (9) (0) μ 3 () μ (3) μ 5 (5) d=2 μ () μ 3 (3) μ (5) (6) μ 7 (7) μ (9) μ 4 (0) μ () μ (3) μ 5 (5) d= d=0 μ () μ () μ 2 μ 3 (3) μ 3 (3) μ 2 μ 2 μ 2 μ 4 μ 5 (5) μ 5 (5) (6) μ 6 (6) μ 7 (7) μ 7 (7) μ 8 μ 9 (9) μ 9 (9) (0) μ 0 (0) μ () μ () μ 2 μ 3 (3) μ μ μ μ μ μ μ μ 3 (3) μ 4 μ 5 (5) μ 5 (5) HIGH TEA @ SCIENCE, March 4, 206 2

Ornstein-Uhlenbeck process - Results 3.5 Tesla M2090 (Fermi) 3.5 Tesla K40m (Kepler) 3.5 GTX 750 Ti (Maxwell) giga realizations of OU process 3.0 2.5 2.0.5 0.5 3.0 2.5 2.0.5 0.5 3.0 2.5 2.0.5 0.5 2 4 8 6 elements per thread float, 2 7 threads/block double, 2 7 threads/block 2 4 8 6 elements per thread float, 2 8 threads/block double, 2 8 threads/block float, 2 9 threads/block double, 2 9 threads/block 2 4 8 6 elements per thread float, 2 0 threads/block double, 2 0 threads/block HIGH TEA @ SCIENCE, March 4, 206 3

Averaging x 3 x 6 x 9 x 2 x 5 x 8 x 2 x x 4 x 7 x 0 x 3 x 6 x 9 x 22 x 2 x 5 x 8 x x 4 x 7 x 20 x 23 x 2 x 3 x 5 x 6 x 8 x 9 x x 2 x 4 x 5 x 7 x 8 x 20 x 2 x 23 x 5 x 6 x x 2 x 7 x 8 x 23 x x 2 x 23 x 23 HIGH TEA @ SCIENCE, March 4, 206 4

Averaging - Results Tesla M2090 (Fermi) Tesla K40m (Kepler) GTX 750 Ti (Maxwell) ratio of maximum bandwidth 0.8 0.6 0.4 0.2 0.8 0.6 0.4 0.2 0.8 0.6 0.4 0.2 2 5 2 6 2 7 2 8 2 9 2 0 threads per block float, single averaging double, single averaging 2 5 2 6 2 7 2 8 2 9 2 0 threads per block float, double averaging double, double averaging float, 3-tridiagonal double, 3-tridiagonal 2 5 2 6 2 7 2 8 2 9 2 0 threads per block float, 4-tridiagonal double, 4-tridiagonal architecture Tesla M2090 Tesla K40m GTX 750 Ti ratio peak memory bandwidth 88.5% 72.% 8.% configuration ( threads block 28, double 2 8, double 2 6, double HIGH TEA @ SCIENCE, March 4, 206 5

Solving one instance of the RODE on a single GPU Tesla M2090 (Fermi) Tesla K40m (Kepler) GTX 750 Ti (Maxwell) 0.8 0.8 0.8 0.6 0.6 0.6 0.4 0.4 0.4 0.2 0.2 0.2 float, 2 3 2 0 2 0 double, 2 3 2 0 2 0 float, 2 4 2 0 2 0 double, 2 4 2 0 2 0 float, 2 5 2 0 2 0 double, 2 5 2 0 2 0 initstatesnormalkernel() scanexclusiveoukernel() averagedeulerkernel() purple blue green red float, 2 3 2 0 2 0 double, 2 3 2 0 2 0 float, 2 4 2 0 2 0 double, 2 4 2 0 2 0 float, 2 5 2 0 2 0 double, 2 5 2 0 2 0 float, 2 6 2 0 2 0 double, 2 6 2 0 2 0 float, 2 7 2 0 2 0 double, 2 7 2 0 2 0 float, 2 8 2 0 2 0 double, 2 8 2 0 2 0 float, 2 9 2 0 2 0 double, 2 9 2 0 2 0 float, 2 0 2 0 2 0 double, 2 0 2 0 2 0 getrandomnumbersnormalkernel() scanoufixkernel() float, 2 3 2 0 2 0 double, 2 3 2 0 2 0 float, 2 4 2 0 2 0 double, 2 4 2 0 2 0 float, 2 5 2 0 2 0 double, 2 5 2 0 2 0 float, 2 6 2 0 2 0 double, 2 6 2 0 2 0 singleaveragekernel() realizeouprocesskernel() numerical solver averaging Ornstein-Uhlenbeck process pseudo random number generation float, 2 7 2 0 2 0 double, 2 7 2 0 2 0 float, 2 8 2 0 2 0 double, 2 8 2 0 2 0 HIGH TEA @ SCIENCE, March 4, 206 6

Solving several instances of the RODE on multiple GPUs cluster JuDGE Hydra TSUBAME 2.5 location FZJ RZG GSIC GPUs per node 2 3 total # of GPUs 206 338 4224 Interconnect QDR InfiniBand FDR InfiniBand QDR InfiniBand efficiency.2 0.8 0.6 0.4 0.2 JuDGE 2 2 2 2 3 2 4 2 5 2 6 2 7 number of GPUs.2 0.8 0.6 0.4 0.2 float, GPU computations double, GPU computations Hydra 2 2 2 2 3 2 4 2 5 2 6 2 7 number of GPUs float, MPI_Reduce() double, MPI_Reduce().2 0.8 0.6 0.4 0.2 TSUBAME 2.5 2 4 2 5 2 6 2 7 2 8 2 9 2 0 number of GPUs float, total double, total HIGH TEA @ SCIENCE, March 4, 206 7

Final slide HIGH TEA @ SCIENCE, March 4, 206 8