COMPARISON OF CPU AND GPU IMPLEMENTATIONS OF THE LATTICE BOLTZMANN METHOD

Similar documents
Parallelism of MRT Lattice Boltzmann Method based on Multi-GPUs

Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters

Multiphase Flow Simulations in Inclined Tubes with Lattice Boltzmann Method on GPU

A Framework for Hybrid Parallel Flow Simulations with a Trillion Cells in Complex Geometries

PREDICTION OF INTRINSIC PERMEABILITIES WITH LATTICE BOLTZMANN METHOD

On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code

Available online at ScienceDirect. Procedia Engineering 61 (2013 ) 94 99

arxiv: v1 [hep-lat] 7 Oct 2010

NON-DARCY POROUS MEDIA FLOW IN NO-SLIP AND SLIP REGIMES

Computers and Mathematics with Applications

Numerical Simulation Of Pore Fluid Flow And Fine Sediment Infiltration Into The Riverbed

Simulation of floating bodies with lattice Boltzmann

Physical Modeling of Multiphase flow. Boltzmann method

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications

Lattice-Boltzmann Simulations of Multiphase Flows in Gas-Diffusion-Layer (GDL) of a PEM Fuel Cell. Introduction

Computer simulations of fluid dynamics. Lecture 11 LBM: Algorithm for BGK Maciej Matyka

Simulation of Lid-driven Cavity Flow by Parallel Implementation of Lattice Boltzmann Method on GPUs

Lattice Boltzmann Method for Moving Boundaries

Introduction to numerical computations on the GPU

The Lattice Boltzmann Simulation on Multi-GPU Systems

Gas Turbine Technologies Torino (Italy) 26 January 2006

Lattice Boltzmann methods Summer term Cumulant-based LBM. 25. July 2017

The lattice Boltzmann equation (LBE) has become an alternative method for solving various fluid dynamic

Generalized Local Equilibrium in the Cascaded Lattice Boltzmann Method. Abstract

Lattice Boltzmann Method

Lattice Boltzmann model for the Elder problem

Lattice Boltzmann Method for Fluid Simulations

Drag Force Simulations of Particle Agglomerates with the Lattice-Boltzmann Method

MODELLING OF THE BOUNDARY CONDITION FOR MICRO CHANNELS WITH USING LATTICE BOLTZMANN METHOD (LBM)

Lattice Boltzmann approach to liquid - vapour separation

Computers and Mathematics with Applications. Investigation of the LES WALE turbulence model within the lattice Boltzmann framework

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters

The Lattice Boltzmann Method for Laminar and Turbulent Channel Flows

Geometry modeling of open-cell foams for efficient fluid flow and heat transfer computations using modified Kelvin cells

Pore-scale lattice Boltzmann simulation of laminar and turbulent flow through a sphere pack

Multi species Lattice Boltzmann Models and Applications to Sustainable Energy Systems

Simulation of 2D non-isothermal flows in slits using lattice Boltzmann method

On pressure and velocity boundary conditions for the lattice Boltzmann BGK model

Pore Scale Analysis of Oil Shale/Sands Pyrolysis

Why Should We Be Interested in Hydrodynamics?

Grad s approximation for missing data in lattice Boltzmann simulations

Analysis and boundary condition of the lattice Boltzmann BGK model with two velocity components

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic

APPRAISAL OF FLOW SIMULATION BY THE LATTICE BOLTZMANN METHOD

Level Set-based Topology Optimization Method for Viscous Flow Using Lattice Boltzmann Method

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and

Semi-implicit-linearized Multiple-relaxation-time formulation of Lattice Boltzmann Schemes for Mixture Modeling. Abstract

Kinetic Models and Gas-Kinetic Schemes with Rotational Degrees of Freedom for Hybrid Continuum/Kinetic Boltzmann Methods

ROLE OF PORE-SCALE HETEROGENEITY ON REACTIVE FLOWS IN POROUS MATERIALS: VALIDITY OF THE CONTINUUM REPRESENTATION OF REACTIVE TRANSPORT

Lecture 5: Kinetic theory of fluids

ARTICLE IN PRESS. Computers and Mathematics with Applications. Contents lists available at ScienceDirect

Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS

Predictor-Corrector Finite-Difference Lattice Boltzmann Schemes

DRAINAGE AND IMBIBITION CAPILLARY PRESSURE CURVES OF CARBONATE RESERVOIR ROCKS BY DIGITAL ROCK PHYSICS

Parallel Simulations of Self-propelled Microorganisms

LBM mesoscale modelling of porous media

arxiv:comp-gas/ v1 28 Apr 1993

Lattice Boltzmann Simulations of 2D Laminar Flows past Two Tandem Cylinders

Welcome to MCS 572. content and organization expectations of the course. definition and classification

Multiscale simulations of complex fluid rheology

Using OpenMP on a Hydrodynamic Lattice-Boltzmann Code

Performance Analysis of a List-Based Lattice-Boltzmann Kernel

Numerical Investigation of Fluid and Thermal Flow in a Differentially Heated Side Enclosure walls at Various Inclination Angles

FIELD SCALE RESERVOIR SIMULATION THROUGH A LATTICE BOLTZMANN FRAMEWORK. A Dissertation ZACHARY ROSS BENAMRAM

A Compact and Efficient Lattice Boltzmann Scheme to Simulate Complex Thermal Fluid Flows

Analysis of the Lattice Boltzmann Method by the Truncated Moment System

Research of Micro-Rectangular-Channel Flow Based on Lattice Boltzmann Method

A Momentum Exchange-based Immersed Boundary-Lattice. Boltzmann Method for Fluid Structure Interaction

IMPLEMENTING THE LATTICE-BOLTZMANN

LATTICE BOLTZMANN SIMULATION OF FLUID FLOW IN A LID DRIVEN CAVITY

arxiv:cond-mat/ v1 [cond-mat.stat-mech] 11 Dec 2002

EXTENDED FREE SURFACE FLOW MODEL BASED ON THE LATTICE BOLTZMANN APPROACH

Application of a Modular Particle-Continuum Method to Partially Rarefied, Hypersonic Flows

Computers and Mathematics with Applications

LATTICE BOLTZMANN MODELLING OF PULSATILE FLOW USING MOMENT BOUNDARY CONDITIONS

Solving PDEs with CUDA Jonathan Cohen

Efficient simulations of long wave propagation and runup using a LBM approach on GPGPU hardware

Permeability of Dual-Structured Porous Media

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters

ZFS - Flow Solver AIA NVIDIA Workshop

External and Internal Incompressible Viscous Flows Computation using Taylor Series Expansion and Least Square based Lattice Boltzmann Method

The Finite Cell Method: High order simulation of complex structures without meshing

An electrokinetic LB based model for ion transport and macromolecular electrophoresis

EFFECT OF A CLUSTER ON GAS-SOLID DRAG FROM LATTICE BOLTZMANN SIMULATIONS

On the Use of a Many core Processor for Computational Fluid Dynamics Simulations

Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry

Block AIR Methods. For Multicore and GPU. Per Christian Hansen Hans Henrik B. Sørensen. Technical University of Denmark

A LATTICE BOLTZMANN SCHEME WITH A THREE DIMENSIONAL CUBOID LATTICE. Haoda Min

Game Physics. Game and Media Technology Master Program - Utrecht University. Dr. Nicolas Pronost

Entropic Lattice Boltzmann Simulation of Three Dimensional Binary Gas Mixture Flow in Packed Beds Using Graphics Processors

Flow simulation of fiber reinforced self compacting concrete using Lattice Boltzmann method

Shortest Lattice Vector Enumeration on Graphics Cards

A Lattice Boltzmann Model for Diffusion of Binary Gas Mixtures

Lattice Boltzmann Method for Fluid Simulations

SEMICLASSICAL LATTICE BOLTZMANN EQUATION HYDRODYNAMICS

arxiv: v1 [physics.flu-dyn] 10 Aug 2015

Discontinuous shear thickening on dense non-brownian suspensions via lattice Boltzmann method

Tsunami wave impact on walls & beaches. Jannette B. Frandsen

Summary We develop an unconditionally stable explicit particle CFD scheme: Boltzmann Particle Hydrodynamics (BPH)

Lattice Boltzmann Method

Transcription:

XVIII International Conference on Water Resources CMWR 2010 J. Carrera (Ed) c CIMNE, Barcelona, 2010 COMPARISON OF CPU AND GPU IMPLEMENTATIONS OF THE LATTICE BOLTZMANN METHOD James.E. McClure, Jan F. Prins and Cass T. Miller Department of Environmental Sciences and Engineering CB 7431, 148 Rosenau Hall School of Public Health University of North Carolina Chapel Hill, North Carolina 27599-7431 Department of Computer Science University of North Carolina Chapel Hill, North Carolina 27599-7431 Key words: Lattice Boltzmann Methods, Permeability Estimation, Graphics Processor, GPU, Multi-relaxation time, MRT Summary. The lattice Boltzmann Method (LBM) has become a standard tool for estimating porous medium permeabilities from image data and numerically generated packings. We consider implementations of the single-relaxation time BGK scheme for single phase flow as well as a more computationally intensive multi-relaxation time (MRT) scheme. Results demonstrate that a considerable performance increase is achieved by implementing on graphics processing unit (GPU) for both methods. The MRT scheme is shown to provide a more efficient means for permeability estimation on GPU relative to the BGK approach. The increased accuracy of the MRT scheme allows accurate permeability measurements to be obtained at lower resolutions, more than offsetting the increased computational cost associated with MRT. 1 INTRODUCTION Utilization of graphics processing units (GPUs) for high-performance computing can demonstrate significant performance benefits for memory-bandwidth limited applications 1, 2. Many LBM schemes rely on local information only, a computational structure which is ideally suited to GPUs. Relative to CPU implementations, GPU implementations of the lattice Boltzmann (LBM) often achieve performance increases of an order of magnitude. 1

Single component LBM solve the Navier-Stokes equations by exploiting the relationship between kinetic and continuum theory. Such schemes offer significant advantages over numerical Navier-Stokes solutions when massively parallel implementations are required or where complex boundary conditions must be accommodated. These requirements have resulted in a proliferation in the use of LBM for porous medium applications, where large domains and complex solid boundary conditions are requisite. The LBM has become a standard tool for estimation of porous medium permeabilities 3. Multi-relaxation time (MRT) schemes provide much more accurate permeability estimates compared to the less computationally-intensive BGK scheme 4. In this work, we compare the computational behavior of the MRT and BGK LBM on CPU and GPU. These results motivate the use of GPUs for permeability estimation in porous media. We consider the tradeoff between increased accuracy of MRT and its increased computational cost relative to the BGK scheme. 2 METHODS 2.1 The Lattice Boltzmann Method In this work,we two commonly used implementations of the LBM for modeling single phase flow, both of which use a three-dimensional, nineteen velocity vector (D3Q19) structure. The computational domain is defined by a rectangular prism with sides of length N x, N y and N z. The domain is discretized by regularly spaced lattice nodes x i where i = 0, 1,..., N 1, N = N x N y N z. Our implementations utilize a three-dimensional nineteen velocity (D3Q19) model: {0, 0, 0} T for q = 0 ξ q = {±1, 0, 0} T, {0, ±1, 0} T, {0, 0, ±1} T for q = 1, 2,..., 6 (1) {±1, ±1, 0} T, {±1, 0, ±1} T, {0, ±1, ±1} T for q = 7, 8,..., 18 Each velocity vector ξ q is associated with a distribution f q / The fluid density and velocity are obtained from these distributions: ρ = Q 1 f q, (2) q=0 u = 1 Q 1 ξ ρ q f q. (3) Since density and velocity can be obtained directly from the distributions, a numerical solution for f q implies a solution for ρ and u. The solution for the LBM is 2 q=0

provided by solving the lattice Boltzmann equation (LBE), which may be generally expressed in the form: f q (x i + ξ q, t + 1) f q (x i, t) = J q, (4) where J q is a collision operator which accounts for changes in f q due to the intermolecular collisions. Solution of Eq. 4 is usually accomplished in two steps, referred to as streaming: f q (x i + ξ q, t + 1) = f q (x i, t), (5) and collision: f q (x i, t + 1) = f q (x i, t + 1) + J q. (6) Permeability estimates are obtained by assuming the macroscopic flow obeys Darcy s law: p ρηv + ρg = x κ, (7) where p is the pressure, ρg is the external force, η is the fluid viscosity, v is the mass averaged velocity over the domain and κ is the permeability. 2.1.1 Single-Component BGK Model The BGK model assumes that the distributions relax at a constant rate toward equilibrium values fq eq prescribed by the Maxwellian distribution. The collision term for the LBGK model is: J q = 1 ( fq eq f τ q ). (8) The relaxation rate is specified by the parameter τ, which relates to the fluid viscosity: ( ) η = 1 τ 1. (9) 3 2 The equilibrium distributions approximate the Maxwellian distribution, For the D3Q19 model, they take the form: ] = w q ρ [1 + 3(ξ q u) + 92 (ξ q u) 2 32 (u u) (10) f eq q where w 0 = 1, w 3 1,...,6 = 1 and w 18 7,...,18 = 1. This choice of equilibrium conditions 36 ensures that mass and momentum will be conserved. 3

2.1.2 Single-Component Multi-Relaxation Time Model In the BGK approximation to the collision term given by Eq. 8, all non-conserved hydrodynamic modes relax toward equilibrium at the same rate. Multi-relaxation time schemes are constructed so that different hydrodynamic modes may relax at different rates. In vector space, many physically significant hydrodynamic modes are associated with linear combinations of the distributions f w q : Q 1 ˆf m = α m,q f q, (11) q=0 where α m,q represents a set of constant coefficients associated with a particular mode m. Since we may specify Q independent, linear combinations of f q, coefficients must be defined for m = 0, 1, 2,..., Q 1. The coefficients must be chosen carefully in order to ensure that the moments correspond with physical modes which are hydrodynamically significant. The coefficients α m,q are obtained by applying a Gram- Schmidt orthogonalization to polynomials of the discrete velocities 5. The resulting set of moments include density, momentum and kinetic energy modes, as well as modes associated with elements of the stress tensor. The relaxation process is carried out in moment space, with each mode relaxing at its own rate specified by λ m : J q = Q 1 m=0 α q,mλ m ( ˆf eq, m ˆf m ). (12) The inverse transformation coefficients αq,m map the moments back to distribution space, and represent the matrix inverse of the transformation coefficients α m,q. The eq, equilibrium moments ˆf m are functions of the local density ρ and momentum j. The relaxation parameters are chosen to minimize viscosity dependence on permeability: λ 1 = λ 2 = λ 9 = λ 10 = λ 11 = λ 12 = λ 13 = λ 14 = λ 15 = λ A = 1 τ (13) λ 4 = λ 6 = λ 8 = λ 16 = λ 17 = λ 18 = λ B = 8(2 λ A ) 8 λ A (14) 2.1.3 GPU implementation of the LBM Each GPU consists a very large number of cores capable of handling computations. The GPU is utilized most efficiently when each core is able to execute computations independently from other cores. CUDA is a programming language designed with 4

GPU applications in mind. Computations are divided among the GPU cores by decomposing the computational domain into a grid of threadblocks which may be executed in any order. This ensures code which scales proportional to the number of cores, while imposing algorithmic constraints on the programmer. Unlike CPU applications, which can take advantage of specialized computations to achieve optimal performance, CUDA relies on computational kernels executed by each thread. Dist: N n n+s 1 n+s 2 n+s 17 n+s 18 f 0 f 1 f 2... f 17 f 18 Collision Copy: f 0 f 1 f 2... f 17 f 18 n n n n n Figure 1: Schematic illustration of array structure and kernel execution in CUDA. The structure of the kernel is illustrated by Fig. 1. The nineteen distributions are packed into a single array which is indexed such that Dist[N q + i] = f q (x i, t). In order to fuse the streaming and collision operations into a single memory access, the two-array streaming (TAS) algorithm was used. Streaming is accomplished by reading registers from the array using a set of offsets s q to pull distributions from the adjacent lattice nodes. The streaming offsets are defined by: s q = ξ q 1 N x N x N y. (15) The registers, denoted by f 0, f 1,... f 18 in Fig. 1, are used to carry out the collision step. For BGK, additional registers must be allocated to store the density ρ and velocity u. These moments are included in the nineteen additional registers required by MRT. At the implementation level, only two relaxation parameters λ A and λ B 5

are passed to the MRT kernel in CUDA. After the collision process has been applied, f 0, f 1,... f 18, are written a copy of the distribution array, indexed in the same way. 3 RESULTS Performance of BGK and MRT LBM was evaluated for dense lattice sizes ranging from N = 16 3 to N = 192 3 on GPU and N = 8 3 to N = 128 3 on CPU. Nvidia Quadro FX 5600 cards were used to perform GPU calculations, whereas CPU calculations were performed on a 3.0 ghz Intel Clovertown processor. A highly optimized swap implementation was used to execute the streaming step on the CPU, shown to be far superior to TAS for CPU 6. 220 GPU Performance of MRT LBM GPU Performance of BGK LBM 210 240 200 190 220 180 200 170 MLUPS 160 150 140 MLUPS 180 160 130 nthreads = 128 140 nthreads = 128 120 110 100 nthreads = 64 nthreads = 32 nthreads = 16 120 100 nthreads = 64 nthreads = 32 nthreads = 16 (a) 90 10 4 10 5 10 6 N (b) 10 4 10 5 10 6 N Figure 2: GPU performance of the LBM for various thread counts for (a) BGK (b) MRT. Fig. 2 shows the performance of BGK and MRT implementations as a function of domain size with a range of thread counts. Performance is evaluated in terms of million-lattice-updates-per-second (MLUPS). Performance of the LBM on GPU depends on the number of threads executing at a time. Optimal results are achieved with the number of threads equal to 64 or 128. GPU performance is very sensitive to coalescence of the read and write operations performed for each thread-block. For certain domain sizes, lack of coalescence can lead to a significant drop in the number of MLPUS observed. The number of MLUPS observed for BGK shows excellent agreement with other results published for BGK LBM on GPU 1. Results for MRT LBM demonstrate that the MRT approach is competitive with BGK in terms of MLUPS in spite of increased computations. This is consistent with the observation that the LBM is primarily limited by memory bandwidth, such that more computationally intensive methods may be accommodated easily. Comparison of GPU and CPU implementations verify that GPU performance is an order of magnitude faster for lattice sizes N 32 3, as shown in Fig. 3 (a) and 6

250 240 230 220 GPU Performance 15 14 13 12 11 CPU Performance BGK MRT 210 10 MLUPS 200 190 180 170 MLUPS 9 8 7 6 5 160 4 150 3 (a) 140 130 BGK (nthreads=128) MRT (nthreads=128) 10 4 10 5 10 6 N (b) 2 1 0 10 3 10 4 10 5 10 6 N Figure 3: Performance for MRT and BGK schemes for (a) GPU (b) CPU (b). On CPU, a significant performance dropoff is observed once the size of the distribution arrays exceed the cache size. Due to the fixed overhead associated with kernel execution, GPU is more advantageous for larger domain sizes, which favors porous medium applications. For a dense lattice, the GPU implementation represents a factor of nearly fifty times more MLUPS relative to the CPU. This performance increase is attributed to the higher memory bandwidth of the GPU as well as the fact that the GPU calculations are performed in float rather than double. Algorithm differences impose slightly different memory demands for the two schemes. 10-3 5.5 5 4.5 Pemeability Estimation with LBM MRT (N=100 3 ) MRT (N=80 3 ) MRT (N=64 3 ) BGK (N=100 3 ) BGK (N=64 3 ) BGK (N=80 3 ) κ/d 2 4 3.5 3 2.5 2 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 Figure 4: Dimensionless permeability estimates obtained via LBM for a range of relaxation parameters τ. The length scale D is the number of pixels per sphere radius. τ The impact of method choice on permeability estimation was investigated using a random sphere packing of 125 equally sized spheres in a cubic domain. The porosity of the packing was 0.37642. Permeability results were obtained via LBM by discretizing the packing to obtain domains of size 64 3, 80 3 and 100 3. The results are shown in Fig. 4 for both BGK and MRT at each discretization level. Permeability estimates demonstrate that the computational cost of the MRT scheme is more than 7

compensated for by the increased accuracy of the scheme. Results demonstrate that larger domains must be considered to mitigate the strong viscosity dependence observed for the BGK approach. The cost associated with simulation of larger domain sizes far exceeds the increased computational expense per iteration associated with MRT. 4 CONCLUSIONS Implementation of BGK and MRT lattice Boltzmann schemes demonstrate the advantages of the GPU; performance on GPU is fifty times faster than CPU. Additional registers are required for the MRT scheme, but this does not inhibit its application on GPU. As with CPU implementation, execution of the streaming step is the primary performance bottleneck on GPU. Viscosity dependence of BGK permeability estimates implies that much larger domain sizes must be considered to achieve reliable permeability estimates. MRT offers significant benefits even for the current generation of GPUs, where register space is at a premium. REFERENCES [1] A. Kaufman, Z. Fan, K. Petkov, Implementing the lattice Boltzmann model on commodity graphics hardware, JOURNAL OF STATISTICAL MECHANICS- THEORY AND EXPERIMENTdoi:10.1088/1742-5468/2009/06/P06016. [2] Z. Fan, F. Qiu, A. E. Kaufman, Zippy: A framework for computation and visualization on a GPU cluster, COMPUTER GRAPHICS FORUM 27 (2) (2008) 341 350. [3] C. Pan, M. Hilpert, C. T. Miller, Pore-scale modeling of saturated permeabilities in random sphere packings, Physical Review E 64 (6) (2001) 9. [4] C. Pan, L.-S. Luo, C. T. Miller, An evaluation of lattice Boltzmann schemes for porous medium flow simulation, Computers & Fluids 35 (8 9) (2006) 898 909. [5] D. d Humiéres, I. Ginzburg, M. Krafczyk, P. Lallemand, L. S. Luo, Multiplerelaxation-time lattice Boltzmann models in three dimensions, Philosophical Transactions of the Royal Society of London Series A-Mathematical Physical and Engineering Sciences 360 (2002) 437 451. [6] K. Mattila, J. Hyvaeluoma, J. Timonen, T. Rossi, Comparison of implementations of the lattice Boltzmann method, Computers & Mathematics with Applications 55 (7) (2008) 1514 1524. 8