Interpolation with Radial Basis Functions on GPGPUs using CUDA

Size: px

Start display at page:

Download "Interpolation with Radial Basis Functions on GPGPUs using CUDA"

Abner Preston
6 years ago
Views:

1 Interpolation with Radial Basis Functions on GPGPUs using CUDA Gundolf Haase in coop. with: Dirk Martin [VRV Vienna] and Günter Offner [AVL Graz] Institute for Mathematics and Scientific Computing University of Graz, Austria Jena, July 15, 2014

2 Outline Moving mesh Motivation Radial basis function interpolation [Dirk Martin, AVL Graz] RBF Evaluation Numerical results More results Some nice speedups Conclusions Accelerator programming

3 Motivation Moving mesh Motivation

4 Motivation Moving Mesh - Motivation [Dirk Martin, AVL Graz] Changing geometry caused by boundary/interface displacement Preserve topology of overall mesh = map boundary displacement onto all nodes in mesh one opportunity: RBF interpolation requirement: runs on GPU We look for the harmonic solution of 2 u = 0 Ω u = f set of points = approx. fundamental solutions in points of interest Ω.

5 Radial basis function interpolation [Dirk Martin, AVL Graz] Moving mesh Radial basis function interpolation [Dirk Martin, AVL Graz]

6 Radial basis function interpolation [Dirk Martin, AVL Graz] Radial basis functions Definition If a univariate (one variable) real-valued function φ : [0, ) R is used as a symmetric multivariate function Φ : R d R d R via Φ(x, y) = φ( x y 2 ) for all x, y R d, then φ is called a radial basis function (RBF) and Φ is called the associated kernel. Definition The support of a function u defined on Ω R d is defined as supp u := {x Ω : u(x) 0}.

7 Radial basis function interpolation [Dirk Martin, AVL Graz] RBF interpolation General approximation: set of points X = {x i } N i=1 is given function values f i = f (x i ) are given (f unknown) search for an approximating function s : s X = f X. In the context of RBF interpolation we seek for an interpoland of the form N s(x) = λ i φ( x i x ) + p(x), λ i R, p P M. (1) i=1 Polynomial term p is required for the existence and uniqueness of a solution.

8 Radial basis function interpolation [Dirk Martin, AVL Graz] RBF Setup: System of equations Requiring the interpolation condition s X = f X in all given points and the unisolvency of the set X for P M d, thus p X = 0 p 0 and demanding a side condition on the coefficients of the polynomial term leads to a system of linear equations for the determination of the coefficients λ and π: N M λ i φ( x i x k ) + π j p j (x k ) = f (x k ), 1 k N, i=1 or, in short notation j=1 N λ i p l (x i ) = 0, 1 l M, (2) i=1 ( Φ Π Π 0 ) ( λ π ) = ( f 0 ). (3) Solving (3) provides all information to evaluate the RBF approximate s(x).

9 Radial basis function interpolation [Dirk Martin, AVL Graz] Solving the RBF system of equations RBF: r 2 + c 2 (multiquadric biharmonics) and constant polynomial terms. The system, i = 1,..., n, j = 1,..., M ( ) ( ) ( ) Φ Π λ f Π = ; Φ 0 π 0 ij = Φ(x i x j ), P ij = P j (x i ) is solved via FGP algorithm, a special Krylov subspace algorithm for our RBF [Faul/Goodsell/Powell 05]: no matrix is stored operation Matrix Vector directly implemented Brute force: direct implementation of Φ λ O(N 2 ) Multipole approx. Φ of Φ is used. O(N log N) preconditioning in FPG appropriate for our RBF with constant polynomial approximates Φ 1 by 51 entries per row. Octree is used for neighborhood relations: O(N log N) < 10 iterations to solve the system

10 RBF Evaluation Moving mesh RBF Evaluation

11 RBF Evaluation Evaluating the sums λ has been calculated in setup evaluation of sums N s(x) = λ i φ( x i x ) = i=1 N λ i φ i (x) i=1 with kernel φ(r) = r 2 + c 2. uses series expansion (Laurent) for the far field (octree distance 2) N λ i φ i (x) = i=1 # evaluation boxes N j=1 p+1 l=0 G j l (x)/ x 2l 1 + i in near field λ i φ i (x) Laurent series coefficients G l (x) are precomputed

12 Numerical results Moving mesh Numerical results

13 Numerical results Test examples [D. Martin, AVL] 3D geometry of an combustion chamber (AVL List GmbH) with boundary deformation and spheres with various numbers of boundary nodes boundary nodes: 603,..., equivalent to appr. 27 Mill nodes overall (similar to heart) heart geometry already in Fire TM -pipeline [ c D. Martin, AVL]

14 Numerical results Test I Setup of multipole method is measured (Multipole setup) Solving the system contains the application of one Mat Vector (t m) per iteration = dominates iterations in FGP algorithm Overall setup approx. 60 t m; Interpolation 6 t m sequential(+shm) CPU code was already accelerated by factor 2-4. Intel Core i7 2600K with 4+4HT cores; Nvidia GTX 680 ; CPU [s] GPU [s] Speedup multipole setup SP M v brute force M v multipole setup DP M v brute force M v Figure : 89,702 boundary nodes

15 Numerical results Test II mephisto: 2 Intel Xeon X5650 with 12+12HT cores; Nvidia Tesla C2070 CPU [s] GPU [s] Speedup Sp(1/12 cores) multipole setup SP M v brute force M v multipole setup DP M v brute force M v Figure : 89,702 boundary nodes GPU mesh moving/smoothing in consumer release, AMG on GPU in internal release [Max Emans, Manfred Liebmann]

16 Numerical results Example: ellipsoid [July 6, 2014] mephisto: 2 Intel Xeon E with 2 8 cores; Nvidia K20m; Xeon Phi 60 cores; gcc 4.4.7; nvcc V6.0.1; pgc++ v14.6 CPU-1 CPU-16 MIC CUDA OpenACC multipole setup SP M v brute force M v multipole setup DP M v brute force M v Figure : DP: 128,000 boundary nodes; SP: 32,000 bound. nodes; timings in sec. MIC: unmodified OpenMP-Code in native mode

17 Numerical results Many-core RBF: Goods and Odds OPENMP: times faster on 16 CPU cores. CUDA: up to 15 times faster than 16 CPU cores. MIC: very easy in native mode; up to 4 times faster than 16 CPU cores for simple structured code. MIC: needs more tuning OpenACC: 7 times slower than CUDA even for brute force alg.. Hard to convinve the compiler to use vector (thread) instead of gang (block).

18 Some nice speedups More results Some nice speedups

19 Some nice speedups Results wrt. GPU acceleration I Seminar work Andreas Windisch [ 12](QCD, Fortran, CUFFT): 95 Analytic Structure of Scalar Glueball Operators Markus Hopfer [ 12](Fortran, CUDA, MPI): 55 The Ghost-Gluon System of Yang-Mills Theory Michael Reisecker Parallel computing in the Potts model Mario Schröck Gauge fixing (maximation problem) in Quantum Electrodyn. Ydalia del Pilar Delgado: Random walk generation in Lattice QCD Martin Holler: (Matlab 3400 ) C++ 54 Totel Variation based JPEG decompression model

20 Some nice speedups Results wrt. GPU acceleration II Project work Andreas Kucher: 65 GPU accelerated optimization in a pill identification problem Kristian Bredies 50 TGV minimization for MRI Manfred Liebmann/Aurel Neic 10 AMG solver for unstructured sparse systems Manfred Liebmann 70 Mixed gas flow (Euler equations; explicit)

21 Accelerator programming Conclusions Accelerator programming

22 Accelerator programming Accelerator programming Available Accelerators: NVIDA-GPUs: Tesla K cores; 1.43 TFLOPS-DP (1.43); 288 GB/s AMD-GPU: Firepro S cores 1.48 TFLOPS-DP; 480 GB/s Intel: Xeon Phi SE10P (new: 7120X) 1.07 TFLOPS-DP (3); 61 cores (x86) ; 352 GB/s (500 GB/s) Intel: Xeon E7-8870: 10 cores, 96 GFLOPS, 43 GB/s Which language should be used for programming accelerators? CUDA: hardware specific but free compilers; additional code OpenMP 4.0 pragma: new standard (?); support by next gcc-compiler MIC pragma: special for Intel; yet no free compiler OpenMP 4.0 Will OpenACC converge to OpenMP 4.0? huma (heterogeneous UMA): one address space;

23 Accelerator programming Scalar product - OpenACC 1 #i f d e f OPENACC 2 #include <accel. h> // OpenACC 3 #e n d i f 4 5 double s c a l a r ( c o n s t u nsigned i n t N, c o n s t double c o n s t x, c o n s t double c o n s t y ) 6 { 7 double sum = 0. 0 ; 8 unsigned i n t i ; 9 #pragma omp p a r a l l e l f o r private ( i ) shared (x, y ) schedule ( s t a t i c ) reduction (+:sum ) 10 #pragma acc k e r n e l s l o o p p r e s e n t ( x [ 0 :N], y [ 0 :N] ) i n d e p e n d e n t r e d u c t i o n (+:sum ) 11 f o r ( i =0; i<n; ++i ) { 12 sum += x [ i ] y [ i ] ; } 13 r e t u r n sum ; i n t main ( i n t argc, char argv ){ // data a l l o c a t i o n and i n i t. on CPU #pragma acc data copy ( x [ 0 :N], y [ 0 :N] ) // copy data from CPU to GPU 20 { // Parantheses are important!! f o r ( i =0; i <10000; ++i ) { 23 sk = s c a l a r (N, x, y ) ; } } 26 }

24 Accelerator programming Matrix assembling - OpenACC [June 2014] 1 #pragma acc data c o p y i n ( e l e m c o l o r [ 0 : nelems ] )... copyout ( v a l [ 0 : nnz ] ) double s k e [ 3 ] [ 3 ], f e [ 3 ] ; //!! p r i v a t e but o u t s i d e o f l o o p!! 4 f o r ( i n t k = 0 ; k<nnz ; ++k ) val [ k ] = 0. 0 ; // Set v a l u e s i n m a t r i x to 0 5 f o r ( i n t k = 0 ; k<nrows ; ++k ) rhs [ k ] = 0. 0 ; 6 7 f o r ( i n t c=0; c<nc ol or s ; ++c ) // loop on a l l colors 8 { 9 c o n s t i n t c0 = c o l o r i d x [ c ], c1 = c o l o r i d x [ c +1]; 10 #pragma omp p a r a l l e l f o r d e f a u l t ( none ) s h a r e d ( e l e m c o l o r, conn, o f f s e t, xc, rhs, v a l, i d ) p r i v a t e ( ske, f e ) 11 #pragma acc k e r n e l s l o o p\ 12 p c o p y i n ( e l e m c o l o r [ 0 : nelems ], conn [ 0 : nelems 3], o f f s e t [ 0 : nelems 3 3], xc [ 0 : nnodes 2], i d [ 0 : nrows +1])\ 13 pcopy ( r h s [ 0 : nrows ], v a l [ 0 : nnz ] ) p r i v a t e ( ske, f e ) i n d e p e n d e n t f o r ( i n t j c=c0 ; jc<c1 ; ++j c ) // a l l elements of this color 16 { 17 c o n s t i n t j e = e l em c ol or [ j c ] ; // one e l e m e n t o f t h a t c o l o r 18 C a l c E l e m n e w i n l i n e ( conn+3 je, xc, ske, f e ) ; //!! pragma acc r o u t i n e seq!! no atomic!! 19 f o r ( i n t i =0; i<n s i z e ; ++i ) 20 { 21 c o n s t i n t ig = conn [3 je+i ] ; // g l o b a l row i n d e x i n CRS m a t r i x 22 c o n s t i n t irow = id [ i g ] ; // s t a r t o f t h a t row i n t h e v a l u e v e c t o r 23 f o r ( i n t j =0; j<nsize ; ++j ) 24 { 25 v a l [ i r o w+o f f s e t [3 3 j e +3 i+j ] ] += s k e [ i ] [ j ] ; 26 } 27 r h s [ i g ] += f e [ i ] ; 28 } 29 } // c o l o r s

25 Accelerator programming OpenACC and C++ OpenACC in PGI [June 2014] 1. support for C++ 2. simple classes can be transferred to GPU, 3. problems with virtual methods 4. inlining of functions is is now supported (gang vs. vector vs. seq) 5. mixing of CUDA and OpenACC possible via deviceptr 6. good performance for simple problems, but quite tricky for advanced problems. (private data, force vector parallelization)

Less flexible than CUDA regading data transfer. C++ data are a problem.

26 Accelerator programming How will we continue with accelerator programming? OpenMP 4.0 will be the new standard. OpenACC will probably merge into OpenMP 4.0. Intel icpc 14.0 and gcc-4.9 support OpenMP 4.0. Hope: One code that runs on several devices. Less flexible than CUDA regading data transfer. C++ data are a problem. The success depends on the available frontends of the comilers. Hint: Start with OpenMP (-openmp) then run on MIC (-mmic) and OpenACC (-acc -ta=nvidia). Supported by the FWF project F32-N18 and by NAWI Graz.

27 Accelerator programming Thank you! Figure : Jan 3, 2014; near Graz

A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures

A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences,