A CUDA Solver for Helmholtz Equation

Similar documents
Research on GPU-accelerated algorithm in 3D finite difference neutron diffusion calculation method

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics

arxiv: v1 [hep-lat] 7 Oct 2010

arxiv: v1 [physics.comp-ph] 22 Nov 2012

Tips Geared Towards R. Adam J. Suarez. Arpil 10, 2015

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems

Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS

16. Solution of elliptic partial differential equation

Accelerating Model Reduction of Large Linear Systems with Graphics Processors

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

Introduction to numerical computations on the GPU

S XMP LIBRARY INTERNALS. Niall Emmart University of Massachusetts. Follow on to S6151 XMP: An NVIDIA CUDA Accelerated Big Integer Library

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics

CRYPTOGRAPHIC COMPUTING

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic

Shortest Lattice Vector Enumeration on Graphics Cards

A Simple Compact Fourth-Order Poisson Solver on Polar Geometry

Parallel Rabin-Karp Algorithm Implementation on GPU (preliminary version)

Sparse LU Factorization on GPUs for Accelerating SPICE Simulation

BEM Methods for Acoustic and Vibroacoustic Problems in LS-DYNA

Real-time signal detection for pulsars and radio transients using GPUs

Parallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics)

Graphics Card Computing for Materials Modelling

S Subdivide, Preprocess and Conquer: Micromagnetism FEM/BEM-Simulations on Single-Node/Multi-GPU Systems

A Stochastic-based Optimized Schwarz Method for the Gravimetry Equations on GPU Clusters

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications

Poisson Solvers. William McLean. April 21, Return to Math3301/Math5315 Common Material.

Welcome to MCS 572. content and organization expectations of the course. definition and classification

Explore Computational Power of GPU in Electromagnetics and Micromagnetics

Parallelism of MRT Lattice Boltzmann Method based on Multi-GPUs

Julian Merten. GPU Computing and Alternative Architecture

ACCELERATING WEATHER PREDICTION WITH NVIDIA GPUS

Faster Kinetics: Accelerate Your Finite-Rate Combustion Simulation with GPUs

Dense Arithmetic over Finite Fields with CUMODP

Acceleration of Deterministic Boltzmann Solver with Graphics Processing Units

Modified Fourier Spectral Method for Non-periodic CFD

Simulation of acoustic and vibroacoustic problems in LS-DYNA using boundary element method ABSTRACT:

GPU Acceleration of BCP Procedure for SAT Algorithms

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Discretization of PDEs and Tools for the Parallel Solution of the Resulting Systems

Code Generation for GPU Accelerators in the Domain of Image Preprocessing

A Comparison of Solving the Poisson Equation Using Several Numerical Methods in Matlab and Octave on the Cluster maya

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters

PSEUDORANDOM numbers are very important in practice

Two case studies of Monte Carlo simulation on GPU

S0214 : GPU Based Stacking Sequence Generation For Composite Skins Using GA

arxiv: v1 [physics.data-an] 19 Feb 2017

Population annealing study of the frustrated Ising antiferromagnet on the stacked triangular lattice

Parallel stochastic simulation using graphics processing units for the Systems Biology Toolbox for MATLAB

A MEMORY EFFICIENT AND FAST SPARSE MATRIX VECTOR PRODUCT ON A GPU

PLATFORM DEPENDENT EFFICIENCY OF A MONTE CARLO CODE FOR TISSUE NEUTRON DOSE ASSESSMENT

Parallel Longest Common Subsequence using Graphics Hardware

High-Performance Computing, Planet Formation & Searching for Extrasolar Planets

CS 179: GPU Programming. Lecture 9 / Homework 3

POLITECNICO DI MILANO DATA PARALLEL OPTIMIZATIONS ON GPU ARCHITECTURES FOR MOLECULAR DYNAMIC SIMULATIONS

Scalable and Power-Efficient Data Mining Kernels

ACCELERATED LEARNING OF GAUSSIAN PROCESS MODELS

Stochastic Modelling of Electron Transport on different HPC architectures

Available online at ScienceDirect. Procedia Engineering 61 (2013 ) 94 99

Practical Free-Start Collision Attacks on 76-step SHA-1

Multiphase Flow Simulations in Inclined Tubes with Lattice Boltzmann Method on GPU

HIGH PERFORMANCE CTC TRAINING FOR END-TO-END SPEECH RECOGNITION ON GPU

An Implementation of the MRRR Algorithm on a Data-Parallel Coprocessor

Solving PDEs with CUDA Jonathan Cohen

Plain Polynomial Arithmetic on GPU

Research into GPU accelerated pattern matching for applications in computer security

Periodic motions. Periodic motions are known since the beginning of mankind: Motion of planets around the Sun; Pendulum; And many more...

EULER AND SECOND-ORDER RUNGE-KUTTA METHODS FOR COMPUTATION OF FLOW AROUND A CYLINDER

Bachelor-thesis: GPU-Acceleration of Linear Algebra using OpenCL

Practical Free-Start Collision Attacks on full SHA-1

A microsecond a day keeps the doctor away: Efficient GPU Molecular Dynamics with GROMACS

Solving Lattice QCD systems of equations using mixed precision solvers on GPUs

Exascale challenges for Numerical Weather Prediction : the ESCAPE project

Optimized analysis of isotropic high-nuclearity spin clusters with GPU acceleration

Analysis of Parallel Montgomery Multiplication in CUDA

MARCH 24-27, 2014 SAN JOSE, CA

Practical in Numerical Astronomy, SS 2012 LECTURE 9

1 Finite difference example: 1D implicit heat equation

Crossing the Chasm. On the Paths to Exascale: Presented by Mike Rezny, Monash University, Australia

PAPER Accelerating Parallel Evaluations of ROCS

Auto-Tuning Complex Array Layouts for GPUs - Supplemental Material

An Implementation of the MRRR Algorithm on a Data-Parallel Coprocessor

CS 179: GPU Programming. Lecture 9 / Homework 3

GPU Accelerated Markov Decision Processes in Crowd Simulation

Perm State University Research-Education Center Parallel and Distributed Computing

Measurement & Performance

Measurement & Performance

Robust Preconditioned Conjugate Gradient for the GPU and Parallel Implementations

PDE Solvers for Fluid Flow

COMPARISON OF CPU AND GPU IMPLEMENTATIONS OF THE LATTICE BOLTZMANN METHOD

Computing least squares condition numbers on hybrid multicore/gpu systems

Big Prime Field FFT on the GPU

Beam dynamics calculation

Fast Multipole Methods: Fundamentals & Applications. Ramani Duraiswami Nail A. Gumerov

ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU

Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA

Universität Dortmund UCHPC. Performance. Computing for Finite Element Simulations

Acceleration of WRF on the GPU

Parallelization of Molecular Dynamics (with focus on Gromacs) SeSE 2014 p.1/29

Algorithm for Sparse Approximate Inverse Preconditioners in the Conjugate Gradient Method

Transcription:

Journal of Computational Information Systems 11: 24 (2015) 7805 7812 Available at http://www.jofcis.com A CUDA Solver for Helmholtz Equation Mingming REN 1,2,, Xiaoguang LIU 1,2, Gang WANG 1,2 1 College of Computer and Control Engineering, Nankai University, Tianjin 300071, China 2 College of Software, Nankai University, Tianjin 300071, China Abstract In this paper, we present a CUDA based solver for the Helmholtz equation u + λu = f in three dimensional rectangular area. This solver follows the algorithm of FISHPACK A Fortran package for solving elliptical equations. It uses DFT and its variants such as DST and DCT to efficiently find the numerical solution. It is able to solve five different kinds of boundary conditions for each coordinate direction. The experiments show that it can achieve a speedup of up to around 30 compared to the solver in FISHPACK. The source code can be downloaded from https://github.com/rmingming/cudahelmholtz/. Keywords: CUDA; Helmholtz Equation; DFT; DST; DCT; FISHPACK 1 Introduction The Helmholtz equation u + λu = f is a very special kind of partial differential equations (PDEs). It often arises in the study of physical problems. Laplace equation and Poisson equation are all its special forms. Finding a numerical solution to a single Helmholtz equation may not take you too much time, but solving it as fast as possible maybe essential in some numerical simulations. Usually when solving partial differential equations, we first use finite difference method or finite element method to discretize the equation and form a very large and often sparse linear system, then by solving this linear system we can get the numerical solution to the original equation [1]. But since Helmholtz equation has a good form, it can be solved by the Fourier based methods. Using Fourier based methods to solve Helmholtz equation is usually much faster. FISHPACK [2] is a famous Fortran software package for solving elliptical equations including the Helmholtz equation and it is highly efficient. The solver in FISHPACK for solving Helmholtz equation in three dimensional space is called hw3crt. Its algorithm is based on the Fourier methods, it uses Discrete Fourier Transform (DFT) and its variants Discrete Sine/Cosine Transform (DST/DCT) [3, 4] in dealing with different Corresponding author. Email address: renmingming@nbjl.nankai.edu.cn (Mingming REN). 1553 9105 / Copyright 2015 Binary Information Press DOI: 10.12733/jcis15C0261 December 15, 2015

7806 M. Ren et al. /Journal of Computational Information Systems 11: 24 (2015) 7805 7812 boundary conditions. The most important aspect of hw3crt is that it can solve several different kinds of boundary conditions for each coordinate. In recent years, researchers and engineers are increasingly using Graphics Process Units (GPUs) to accelerate their codes. In order to address the computational challenges in many disciplines, the NVIDIA Corporation introduced Compute Unified Device Architecture (CUDA) [5] which is now supported on most graphical devices built by NVIDIA. CUDA provides a very minimal extension of C language. By using CUDA, the software developer can utilize the GPU horsepower easily. Currently it is the most popular way of using GPUs in general computing. Using GPUs to solve PDEs could drastically reduce the execution time. Since CUDA has provided the Fast Fourier Transform (FFT) library CUFFT, it can be used to solve some special PDEs, such as Poisson equations [6, 7]. On the other hand, since CUFFT doesn t contain any function to compute DST or DCT, the PDEs with complex boundary conditions (not the periodic boundary condition) are not easy to solve by using Fourier based methods on GPUs. In this paper, we present a CUDA based Helmholtz equation solver. It is a Fourier based solver and can deal with five different boundary conditions which are common in practice. We illustrate that our solver is accurate enough through experiments, and it achieves a speedup of up to around 30 compared to hw3crt. In particular, the code runs totally on GPU, it could be a good replacement for hw3crt, especially when user s code is already running on GPU. 2 Algorithm and Implementation 2.1 Problem The three dimensional Helmholtz equation in Cartesian coordinate is: u + λu = f where u is an unknown function, λ is a constant value, f is a known function, is 2 + 2 x 2 Function u and f are defined on a rectangular area x, y s y y f, z s z z f. + 2. y 2 z 2 In order to numerically solve the Helmholtz equation, we divide x, y, z directions into N x, N y, N z panels respectively and try to obtain the u value at the grid points. Usually we enlarge N x, N y, N z to get more accurate solution. In this paper, let N x, N y, N z be the same N just for simplicity, but they can be different in our code. Also we let N to be some power of 2 in this paper (this is in order to compute the order of the error), but it can be any positive integer in our code. 2.2 Boundary condition Boundary condition is essential in solving partial differential equation. There are three different kinds of boundary conditions often used in numerical simulations. They are periodic boundary condition, Dirichlet boundary condition and Neumman boundary condition. As an illustration, we describe them in one dimensional case. Periodic boundary condition means that u(x + T ) = u(x), where T :=. We follow the convention and call it the periodic boundary condition, but essentially it is not a true boundary condition. Dirichlet boundary condition means that the u values at the boundary are given. Neumann boundary condition means

M. Ren et al. /Journal of Computational Information Systems 11: 24 (2015) 7805 7812 7807 that the u values at the boundary are given. Since in each coordinate direction we have two x boundaries, so there are five different boundary cases in each coordinate, as shown in Fig. 1. Case 0: x Periodic: u(x + T ) = u(x) T := Case 1: u( ) and u( ) are given Case 2: u( ) and u x () are given Case 3: u (x ) and u(x ) are given Case 4: u (x ) and u( ) are given Fig. 1: Five different boundary cases When using Fourier methods to solve PDEs, different boundary conditions indicate that different methods should be adopted. For the periodic boundary case, we can use DFT directly. But for the other four boundary cases, we need to use one of the variants of DFT, i.e., DST or DCT. Before using DST or DCT, the non-zero Dirichlet and Neumann boundary conditions need to be processed in our code. After that, the values of u or u at the boundary would be all zero. In the x following implementation of DST and DCT, we always assume the zero Dirichlet and Neumann boundary conditions are satisfied. 2.3 Implementation of DST and DCT The CUDA library CUFFT only provides FFT functions, it is unable to use CUFFT to compute DST or DCT directly. But if we extend the original data array in specific ways, the DST/DCT of the original data array can be obtained from DFT (FFT) of the extended data array. As shown in Fig. 2, according to the boundary conditions, we use different extending methods and the purpose is to extend the data array to a whole period. Take N = 4 as an illustration. For case 1, i.e., the Dirichlet boundary condition, we double the data array size, and extend the array through u(x + T ) = u(t x). For case 3, i.e., the Neumann boundary condition, we also double the data array size, but extend the array through u(x + T ) = u(t x). For other two cases, we quadruple the data array size and extend the array in a symmetric or skew-symmetric way. The extending for all kinds of boundary conditions are shown in Fig. 2. 2.4 Sketch of CUDA Implementation Algorithm 1 shows the sketch of our code. We have a few const values needed by each GPU kernel, so we wrap them into a single struct and transfer it to the GPU constant memory. In this way, we won t occupy too much registers, as a result we could launch more threads in a single CUDA multiprocessor to hide the global memory access latency as much as possible. For every GPU kernel, we try to maximize the memory throughput by adjusting the memory access

7808 M. Ren et al. /Journal of Computational Information Systems 11: 24 (2015) 7805 7812 pattern, these are all common optimization strategies for CUDA kernels, so we omit the details here. = + T + 2T = + T + 2T (a) Case 1. left: Dirichlet, right: Dirichlet (b) Case 3. left: Neumann, right: Neumann = + T + 2T + 3T + 4T (c) Case 2. left: Dirichlet, right: Neumann = + T + 2T + 3T + 4T (d) Case 4. left: Neumann, right: Dirichlet Fig. 2: Extending the original data array 3 Experiments and Analysis 3.1 Testing platform The accuracy and performance tests for our CUDA based Helmholtz solver are conducted on a computer which equipped with a NVIDIA GeForce GTX TITAN card. Table 1 shows the major configuration of our testing platform. Operating System CPU Host memory GPU GPU memory Table 1: Testing platform CentOS 1x Intel(R) CPU i7-4770k, 4 cores, 3.5GHz 32 GB 1x Nvidia(R) GeForce GTX TITAN, 2688 CUDA cores, 837MHz 6GB

M. Ren et al. /Journal of Computational Information Systems 11: 24 (2015) 7805 7812 7809 Algorithm 1 CUDA based Helmholtz Solver function solver 3d gpu( ) Prepare a const struct (wrapped some const values) in CPU memory Transfer the const struct to GPU const memory by calling cudamemcpytosymbol( ) for all Coordinate Directions do Launch GPU kernel to deal with the non-zero Dirichlet and Neumann boundary conditions end for Allocate enough space on GPU memory to store the extended data array for all Coordinate Directions do Launch GPU kernel to extend the original data array end for Call CUFFT to execute FFT on the extended data array Divide the extended array by a value generate by λ and its position in the array (This is the essential step in solving Helmholtz equation) Call CUFFT to execute inverse FFT on the extended data array Launch GPU kernel to store the results in the original array end function We use compiler GCC with the -O2 flag to compile the CPU side code. For the comparison with FISHPACK, we compile FISHPACK using gfortran and the -O2 -fdefault-real-8 flags. We also implement a C wrapper for the FISHPACK function hw3crt. The source code can be downloaded from https://github.com/rmingming/cudahelmholtz/ (last accessed date Jun 10, 2015). 3.2 Experiments and discussion We have tested a whole variety of functions with different boundary conditions. In this paper we only illustrate the test for one function and three different boundary conditions. The testing function is u = 1 sin(2πx) sin(2πy) sin(2πz) and let the rectangular area be 0 x 1, 0 y 3 1, 0 z 1, λ be 1, so the Helmholtz equation would be u u = 1 3 (1 + 12π2 ) sin(2πx) sin(2πy) sin(2πz) We test three different boundary cases. The first case is all periodic boundary conditions in each coordinate direction. The second case is Dirichlet boundary condition in y axis and periodic boundary conditions in x, z axes. The third case is Neumann boundary condition in x axis and periodic boundary conditions in y, z axes. In the following, we briefly call them Periodic, Dirichlet, and Neumann respectively. First we need to illustrate that our code gives enough accurate solution. We have a discretization error of O(h 2 ), where h is the grid panel width, for instance, h := ( )/N x in x coordinate direction. Let N = 32, 64, 128, 256, 512 respectively, Table 2 gives the maximum error at all grid points. It shows that the solution is accurate enough and the order of error is indeed O(h 2 ). The order of error is computed through the following formula: Error order(n) = log 2 Error(N) Error(2N)

7810 M. Ren et al. /Journal of Computational Information Systems 11: 24 (2015) 7805 7812 We just list the CPU results (by using FISHPACK) for the Dirichlet case, the other two cases are basically the same so we omit them. Compared with the results of FISHPACK, we can observe a tiny difference, the reason is that we use different method (compared to FISHPACK) to compute the DFT and its variants. For the two methods, the roundoff error is different and accumulated, but they all give enough accurate solutions. Table 2: Maximum error and order of error Periodic Dirichlet Neumann GPU CPU GPU GPU N Error Error order Error Error order Error Error order Error Error order 32 0.00106398 2.00205807 0.0010639756230 2.00205807 0.0010639756230 2.00205807 0.00098352 1.99375369 64 0.00026561 2.00051425 0.0002656147244 2.00051425 0.0002656147244 2.00051425 0.00024695 2.00048120 128 0.00006638 2.00012854 0.0000663800155 2.00012854 0.0000663800155 2.00012854 0.00006172 2.00012031 256 0.00001659 2.00003220 0.0000165935252 2.00003219 0.0000165935253 2.00003220 0.00001543 2.00003014 512 0.00000415 0.0000041482891 0.0000041482888 0.00000386 Table 3 shows the running time for solving the testing problem. We can see that the CPU time for solving Periodic case is less than that for solving the Dirichlet case (for example when N = 512, they are 12.9 seconds and 20.4 seconds respectively) and the Neumann case, that s because in our algorithm the computation of DST or DCT is difficult than that of DFT. As a comparison, hw3crt also need more time to compute the Dirichlet case and Neumann case, but essentially they adopt different algorithms. It is obvious that when N is large, using our CUDA based Helmholtz solver can save us a lot of execution time. Table 3: Execution time (in seconds) Periodic Dirichlet Neumann N CPU GPU CPU GPU CPU GPU 32 0.001523 0.001060 0.001679 0.001055 0.001593 0.001019 64 0.012632 0.002667 0.013869 0.002727 0.012892 0.002862 128 0.108058 0.008140 0.119173 0.014381 0.121248 0.012493 256 0.890860 0.050599 1.002024 0.105117 1.321821 0.093812 512 12.936950 0.392494 20.426926 0.717741 21.320410 0.750475 Fig. 3 shows the speedup of our code compared to hw3crt. For N = 512, we can achieve about 30 times speedup. For N = 32 and N = 64, we can see the speedup is basically the same for all three cases. The reason is that when N is small, we can not use all the GPU cores in some kernels, and the extra overhead is relatively high, the total execution time is majorly dependent on the extra overhead. 4 Conclusion and Future Work We implement the CUDA based Helmholtz equation solver in three dimensional space and achieve about 30 times speedup compared to using the hw3crt from the FISHPACK software package. Our algorithm can deal with five different boundary conditions which are common in practice. This is achieved by transforming the computation of DST/DCT to DFT (FFT), since CUDA only provides the FFT library. There are some improvements can be made in the next step. First, the CUDA code can be further optimized. In the present code, we only consider some obvious optimization techniques

M. Ren et al. /Journal of Computational Information Systems 11: 24 (2015) 7805 7812 7811 Fig. 3: Speedup which are common in programming using CUDA. Second, the algorithm for the DST/DCT can be replaced. In our code, we compute DST/DCT by extending the original data array and then computing DFT of the extended array. The drawback of this strategy is that we need to compute the FFT of a much larger data array. This will take more GPU time and occupy larger device memory. There are other ways to compute DST/DCT which don t need to extend the data array, but these methods often need a pre- and post- processes on the data array. We will evaluate these strategies and implement them in the future. Acknowledgement This work was partially supported by NSF of China (Grant numbers: 61373018, 11301288), Program for New Century Excellent Talents in University (Grant number: NCET130301), and the Fundamental Research Funds for the Central Universities (Grant number: 65141021). This work was also partially supported by the NVIDIA GPU Education Center Program. References [1] William F. Ames, Numerical Methods for Partial Differential Equations, 2nd ed., Academic Press, 1977. [2] J. Adams and P. Swarztrauber and R. Sweet, Elliptic Problem Solvers, Academic Press, 1981, pp. 187-190. [3] W. Press, S. Teukolsky, W. Vetterling and B. Flannery, Numerical Recipes: The Art of Scientific Computing (3rd ed.), Cambridge University Press, 2007. [4] M. Frigo and S. Johnson, The Design and Implementation of FFTW3, Proceedings of the IEEE (2005) 93 (2), pp. 216-231. [5] NVIDIA CUDA Toolkit Documentation. Available at http://docs.nvidia.com/cuda/ (last accessed date Jun 10, 2015). [6] J. Wu and J. JaJa, it An optimized FFT-based direct Poisson solver on CUDA GPUs. IEEE Transactions on Parallel and Distributed Systems, (2014) 25 (3), pp. 550-559.

7812 M. Ren et al. /Journal of Computational Information Systems 11: 24 (2015) 7805 7812 [7] J. Wu and J. JaJa, High Performance FFT Based Poisson Solver on a CPU-GPU Heterogeneous Platform, International Parallel and Distributed Processing Symposium (IPDPS), May 2013, Boston, Massachusetts.