A CUDA-Based Implementation of a Fluid-Solid Interaction Solver: The Immersed Boundary Lattice-Boltzmann Lattice-Spring Method

Commun. Comput. Phys. doi: 10.4208/cicp.OA-2016-0251 Vol. 23, No. 4, pp. 980-1011 April 2018 A CUDA-Based Implementation of a Fluid-Solid Interaction Solver: The Immersed Boundary Lattice-Boltzmann Lattice-Spring Method Tai-Hsien Wu 1, Mohammadreza Khani 2, Lina Sawalha 2, James Springstead 1, John Kapenga 3 and Dewei Qi 1, 1 Department of Chemical and Paper Engineering, Western Michigan University, Michigan 49009, USA. 2 Department of Electrical and Computer Engineering, Western Michigan University, Michigan 49009, USA. 3 Department of Computer Science, Western Michigan University, Michigan 49009, USA. Received 15 December 2016; Accepted (in revised version) 22 March 2017 Abstract. The immersed boundary lattice-boltzmann lattice-spring method (IBLLM) has previously been implemented to solve several systems involving deformable and moving solid bodies suspended in Navier-Stokes fluids, but these studies have generally been limited in scope by a lack of computing power. In this study a Graphics Processing Unit (GPU) in CUDA Fortran is implemented to solve a variety of systems, including a flexible beam, stretching of a red blood cell (RBC), and an ellipsoid under shear flow. A series of simulations is run to validate implementation of the IBLLM and analyze computing performance. Results demonstrate that an Intel Xeon E5645 fitted with an NVIDIA Tesla K40 graphics card running on a GPU improves computational speed by a maximum of over 80-fold increase in speed when compared with the same processor running on a CPU for solving a system of moderately sized solid and fluid particles. These studies represent the first report on using a single GPU device with CUDA Fortran in the implementation of the IBLLM solver. Incorporation of a GPU while solving with the versatile IBLLM technique will expand the range of complex fluid-solid interaction (FSI) problems that can be solved in a variety of fields. AMS subject classifications: 68U20, 74S30, 76Z99, 92C99 Key words: Lattice Boltzmann method, lattice spring model, immersed boundary method, CUDA, red blood cell, fluid-solid interaction. Corresponding author. Email addresses: tai-hsien.wu@wmich.edu (T.-H. Wu), khani@csu.fullerton.edu (M. Khani),lina.sawalha@wmich.edu (L. Sawalha), james.springstead@wmich.edu (J. Springstead), john.kapenga@wmich.edu (J. Kapenga),dewei.qi@wmich.edu (D. Qi) http://www.global-sci.com/ 980 c 2018 Global-Science Press

T.-H. Wu et al. / Commun. Comput. Phys., 23 (2018), pp. 980-1011 981 1 Introduction Recent advances in numerical techniques have resulted in the solution of a diversity of fluid-solid interaction (FSI) problems, including problems in engineering [1, 14] and biology [3 5]. Conducting experiments involving FSI has proven to be extremely challenging, and so computers have been used to simulate FSI and predict fluid dynamics and movement of the solid under flow. Computer simulation has proven particularly useful in problems involving deformable solids, including biological materials and flexible fibers, as present in the paper industry. Advantages of using computer simulations have led to rapid growth of these calculations to study FSI in the past decade. An FSI simulation model, also known as an FSI solver, consists of three main elements: a fluid, a solid, and an interaction between the solid and fluid. Complexity of direct solution of the Navier-Stokes equations has contributed to the popularity of more straightforward solver methods, including the lattice Boltzmann method (LBM), which solves the Boltzmann equation to simulate fluid behavior. It has been previously demonstrated that the LBM is equivalent to solution of the Navier-Stokes equations for Mach numbers below 0.3 [6]. The LBM allows us to avoid solving nonlinear partial differential equations [6 14], while allowing for greatly improved efficiency when compared to the classic computational fluid dynamics (CFD) approaches such as the finite difference method (FDM) and the finite element method (FEM). The structure of the LBM allows for seamless integration of Graphics Processing Unit (GPU) parallel computing in Compute Unified Device Architecture (CUDA). CUDA is a GPU parallel computing platform developed by NVIDIA Corporation. Due to its simplicity, CUDA has quickly gained remarkable attention and, therefore, has been widely used in CFD, particularly for solutions using the LBM. In 2008, J. Tolke and M. Krafczyk reported use of CUDA on a desktop PC to implement the LBM and obtain TeraFLOP computing speed for the first time [15]. Later, J. Habich et al. presented optimization approaches with D3Q19 based on J. Tolke s strategy [16]. Furthermore, many authors have studied other implementation strategies of CUDA for use in the LBM calculations [17 20]. For an FSI system with low solid concentration, the computing load is primarily occupied by solution of the fluid element. As shown in these studies, LBM calculations have been greatly accelerated due to improvements in computational power. However, when systems have a solid concentration that is very high, such as a simulation of red blood cells in blood flow where the interfaces of fluid and solids are more prevalent, FSI solvers using these methods are much slower. As a result, the efficiency of the entire solver is prohibitively limited, and there is a need for improvement of computational power to facilitate the solid and interaction elements. In recent publications, we have presented an FSI solver, combining the LBM, lattice spring model (LSM) [21, 22], and immersed boundary method (IBM) [23], called the immersed boundary lattice-boltzmann lattice-spring method (IBLLM). The IBLLM is composed of the LBM for fluid dynamics, the LSM for solid movement, and the IBM for the

982 T.-H. Wu et al. / Commun. Comput. Phys., 23 (2018), pp. 980-1011 interaction between fluid and solids. The LSM is a convenient solid model that can easily handle solid deformation. The LSM has been combined with the LBM to solve several FSI problems [22, 24 27]. The immersed boundary method (IBM) is the most common interaction solver that is used for FSI problems [28 30]. In 2004, Feng and Michaelides reported how to couple the IBM with the LBM, presenting it as the IB-LBM [31]. Later, several studies have used the IB-LBM [4, 5, 27, 32 34]. These studies led up to our recent implementation of the IBLLM to model complex problems such as the swimming of microbes [24, 26] and the deformation of an elastic blood vessel [25]. However, the computing load in these problems is dominated by solids and their interaction with fluids, leading to slower computing. Slower computing has led to a lack of research in complex systems, and in this paper, we focus on accelerating computations of the solid and interactions between fluid and solids in systems with high solid concentration. Specifically, we incorporate use of GPU with CUDA Fortran to accelerate computations involving the LSM for solids and the IBM for FSI. GPU and CUDA have been incorporated in studies involving the LSM and IBM. Zhao and Khalili reported a GPU-based parallelization of the distinct lattice spring model for geomechanics simulation [35]. In their work, they used shear springs to mimic the shear deformation. In contrast to their work that used shear springs to mimic shear deformation, we use a three-body elastic force for shear deformation [24]. In this paper, we also implement a coarse-grained red blood cell (RBC) model presented by Fedosov et al. [36, 37] to simulate RBC stretching, while accelerating calculations by using GPU with CUDA. Additionally, CUDA has been previously used with IBM by others [38 40], but their methods did not address the motion of deformable solids in fluids. In this paper, we focus on accelerating calculations using the IBLLM solver for motion of deformable solid particles in Navier-Stokes fluids by implementing a GPU with CUDA Fortran. Specifically, we implement LBM for fluids, LSM for solids, and IBM for FSI using a single GPU to improve computational speed. The results demonstrate that computational speed running GPU code on an Intel Xeon E5645 fitted with a NVIDIA Tesla K40 graphics card is improved by over 80-fold compared with running CPU code on the same processor. Furthermore, although we did not conduct an extremely large scale simulation, our results indicate that the speedup factor increases as simulation size increases, suggesting that this implementation will be even more instrumental in solution of more intricate systems. This drastic improvement in computational power will allow for the solution of more complex FSI problems, particularly large scale simulations involving deformable solids, such as elastic fibers or biological cells. We present the relevant numerical methods for this work in Section 2 and implementation of these methods with CUDA in Section 3. In Section 4 we apply this method to solve several systems, including a flexible beam model, stretching of a red blood cell (RBC), and an ellipsoid under shear flow. Finally, in Section 5 we highlight key conclusions of this work.

T.-H. Wu et al. / Commun. Comput. Phys., 23 (2018), pp. 980-1011 983 2 Simulation methods We have previously reported a fluid-solid interaction approach, the immersed boundary lattice-boltzmann lattice-spring model (IBLLM) [24]. This approach is a combination of the lattice Boltzmann method for the fluid phase, lattice spring model for the solid dynamics, and immersed boundary method for fluid-solid interactions. The LSM that is used in our IBLLM is similar to a coarse-grained red blood cell (RBC) model, which is used in calculations involving RBC stretching. This previously reported RBC model [36, 37] is also introduced in this section. 2.1 Lattice Boltzmann method The multiple-relaxation-time (MRT) lattice Boltzmann (LB) method with D3Q19 is used in this paper. The evolution equation for the MRT-LB equation [6, 41] can be expressed as f(x+eδt,t+δt) f(x,t)= M 1 S(m(x,t) m eq (x,t)), (2.1) where f(x,t) and m(x,t) are the fluid distribution functions at position x at time t in the velocity and moment spaces, respectively; the superscript eq denotes the equilibrium status, e is the discrete velocity set and δt represents the time interval; M is the transformation matrix, which transfers the distribution functions from the velocity space into the moment space, and S is the diagonal collision matrix [41]. In the D3Q19 model, the discrete velocities e are given by (0,0,0), i= 0, e i = (±1,0,0),(0,±1,0),(0,0,±1), i= 1 6, (2.2) (±1,±1,0),(±1,0,±1),(0,±1,±1), i= 7 18, where i {0,1,2,,18} are the discrete directions, and the distribution functions in the moment space m are m=(ρ,e,ǫ,j x,q x,j y,q y,j z,q z,3p xx,3π xx,p ww,π ww,p xy,p yz,p zx,t x,t y,t z ) T, (2.3) where ρ, e and ǫ represent density, energy and energy squared; j x,y,z are components of the momentum; q x,y,z are components of the heat flux; p xy,yz,zx are the symmetric and traceless strain-rate tensor; π xx,ww are fourth order moments and t x,y,z are third order moments [6, 41]. Moreover, the corresponding equilibria m eq are m eq =ρ f (1, 11+19u 2,α+βu 2,u x, 2 3 u x,u y, 2 3 u y,u z, 2 3 u z,3u 2 x u 2, γpeq xx ρ,u2 y u 2 z, γpeq ww T,,u x u y,u y u z,u z u x,0,0,0) (2.4) ρ

984 T.-H. Wu et al. / Commun. Comput. Phys., 23 (2018), pp. 980-1011 where ρ f is the fluid density, and u 2 = u 2 x+u 2 y+u 2 z denotes the fluid velocity squared; the parameters α=γ=0 and β= 475 63 are chosen for optimized stability [41]. Furthermore, the transformation matrix M of D3Q19 is 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 30 11 11 11 11 11 11 8 8 8 8 8 8 8 8 8 8 8 8 12 4 4 4 4 4 4 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 4 4 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 1 1 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 4 4 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 0 1 1 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 4 4 0 0 0 0 1 1 1 1 1 1 1 1 M= 0 2 2 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2, 0 4 4 2 2 2 2 1 1 1 1 1 1 1 1 2 2 2 2 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 2 2 2 2 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 (2.5) and the set of relaxation rates S is S=(0,s e,s ǫ,0,s q,0,s q,0,s q,s ν,s π,s ν,s π,s ν,s ν,s ν,s t,s t,s t ), (2.6) while shear and bulk viscosities are given by ν= 1 3 ( 1 s ν 1 2) δt (2.7) and ζ= 2 9 ( 1 s e 1 2) δt. (2.8) While s ν is related to the shear viscosity, the other relaxation rates can be set as constants as follows: s e = 1.19, s ǫ = s π = 1.4, s q = 1.2 and s t = 1.98. These values are optimized and reported by D Humieres et al. [41]. 2.2 Lattice spring model There are many types of lattice spring models (LSMs). There are two basic principles in an LSM: (1) particles in a designed shape are used to generate a solid body, and (2) each particle is connected to its adjacent particles via spring bonds. Buxton et al. calculated bond energies between not only the two neighboring particles but also the next neighboring particles [21, 22]. In their method, only one spring coefficient is used to control the deformation, and thus it cannot approach the bending deformation properly. Zhao et al. reported the distinct LSM in geomechanics to mimic

T.-H. Wu et al. / Commun. Comput. Phys., 23 (2018), pp. 980-1011 985 rock behaviors. They introduced the shear springs (a second coefficient) to deal with the shear deformation [42]. Recently, an LSM with three-body angular bond forces was reported by Wu et al. in which the angular bond is employed to approach bending deformation [24]. The LSM presented by Wu et al. is used in this work, and a brief introduction of this LSM is provided as follows. A harmonic spring exists between two neighboring particles and its potential energy U s is given by Ui s = 1 2 k s (r ij r 0ij ) 2, (2.9) j where k s is the spring coefficient; r 0ij is the equilibrium length of the spring between two neighboring particles i and j; j is the nearest neighboring solid particle of the ith solid particle; r ij = r i r j. Moreover, an angular bond exists between two adjacent springs and the angular bond potential energy U a is represented as U a i = 1 2 k a j k,k =j (θ ijk θ 0ijk ) 2, (2.10) where k a is the angular coefficient; j and k are the nearest neighboring solid particles of ith solid particle; θ ijk is the angle between the bonding vectors r ij and the bonding vector r ik ; θ 0ijk is the corresponding equilibrium angle. In this article, we use cubic structures and set r0ij =h=1 and θ 0ijk =90 0 for initial rest solids, where h is the lattice length in the lattice Boltzmann method. Once both potential energies have been calculated on the ith particle, the elastic force is evaluated by subtracting the gradient of the energies F e i F e i = (Us i +Ua i ). (2.11) The gradient is calculated analytically, and the answers are used in the code. The total force acting on the ith particle is F T i = F e i +Fh i, (2.12) where F h i is the fluid-solid interaction force, which only exists in the solid boundary domain. The fluid-solid interaction force will be discussed in Section 2.4. The leap-frog algorithm is used to update the position and velocity of each solid particle during the simulation run. According to Wu et al. [24], if the structure is isotropic, the relationships between macroscopic modulus and microscopic coefficients are written as where E is the Young s modulus and G is the shear modulus. E= k s r 0, (2.13) G 4k a r0 2, (2.14)

986 T.-H. Wu et al. / Commun. Comput. Phys., 23 (2018), pp. 980-1011 2.3 Coarse-grained red blood cell model A coarse-grained red blood cell (RBC) model has been presented by Fedosov et al. [36,37]. Kruger also reported a similar RBC model [5, 43]. The membrane can be considered as a two-dimensional triangulated network during the meshing process. In this study, the RBC model is meshed through an open source MATLAB code by Persson et al. [44], and the equation of average shape of a single RBC [45] as z=±d 0 1 4(x2 +y 2 ) [0.0518+2.0026 x2 +y 2 ( x 2 +y 2 ) ] 2 D0 2 D0 2 4.481 D0 2, (2.15) where D 0 = 7.82µm is the RBC diameter. The area and volume of the RBC are equal to 135µm 2 and 94µm 3, respectively. In the triangulated network, N p denotes the number of particles, N b denotes the number of bonds (edge), and N e denotes the number of elements (triangles). Four types of potential energies are calculated between particles. Therefore, the potential energy of the membrane can be written as U total = U in plane +U area +U volume +U bending. (2.16) The in-plane potential energy term U in plane has several formulas constituted by twobody and three-body energies. We consider U in plane as the combination of the finitely extensible nonlinear elastic (FENE) and power law (POW) potential energies as U in plane = N b j 1 ( r ) 2 ) 2 k FR 2 0 (1 ln R 0 k P +, m>0 and m = 1, (2.17) (m 1)rm 1 where k F is a constant coefficient of the FENE potential, R 0 and r are the maximum and instant distances between two particles, k P is the POW potential coefficient, and m is the exponent of the power law. In the right-hand side of Eq. (2.17), the first and second terms are the FENE and POW potential energies, representing the attractive and repulsive roles, respectively. In this way, U in plane is only two-body energies and similar to the spring bond energy U s in the LSM. Moreover, the area conservation energy U area and volume conservation energy U volume are the three-body energies which exist in a triangle constituted by three neighboring particles. The area and volume conservation energies are defined as 0 )2 2A0 tot U area = k A(A A tot 0 )2 2V0 tot U volume = k V(V V tot N e + j k D (A j A 0 j )2, (2.18) 2A 0 j, (2.19) where k A, k D, and k V are the global area, local area, and volume constraint constants, respectively. The term A and V represent the instantaneous entire area and volume while

T.-H. Wu et al. / Commun. Comput. Phys., 23 (2018), pp. 980-1011 987 A tot 0 and V tot 0 are the initial total area and volume; A 0 j and A j are the initial and instantaneous local area of jth triangle. Furthermore, the bending energy U bending exists between two adjacent elements (four adjacent particles) and can be given by U bending = N b j k B [1 cos(θ j θ 0 )], (2.20) where k B is the bending coefficient, θ j and θ 0 are the instantaneous and initial angles between two adjacent elements have the common edge j. Similar to Eqs. (2.11), the total elastic force F e i on ith particle of the RBC model can be computed by F e i = (Uin plane i +Ui area +Ui volume +U bending i ). (2.21) Later, Eq. (2.12) and the leap-frog algorithm are used to obtain the new position and velocity of every particle on the membrane. According to [36, 37], the shear stress µ 0 based on the U in plane can be given by { 3 2kF ( r } R µ 0 = 0 ) 2 k P(m+1) 4 [1 ( r R 0 ) 2 ] 2+ r m+1, (2.22) and the compression modulus κ in this model is given by κ= 2µ 0 +k A +k D. (2.23) The linear Young s modulus Y and the Poisson s ratio ν p can be expressed by Y= 4κµ 0 κ+µ 0, (2.24) ν p = κ µ 0 κ+µ 0. (2.25) Generally, we assume µ 0 =123.28, k A =6041.02, k D =123.28 and k V =6164.31 to achieve a nearly incompressible membrane when one grid size is 1µm. Based on the three dimensionless parameters, the linear Young s modulus becomes Y = 483.82 in dimensionless unit, corresponding to Y= 18.9 10 6 N/m. Thus, a scaling factor between the dimensionless energy unit and the real physical energy unit can be calculated. Moreover, the bending coefficient k B can be given by k B = 2 3 k C, (2.26) where k C is the bending rigidity and equal to 2.369 10 19 J at room temperature (296K). Note that k C must be converted to the dimensionless unit first by the scaling factor which has been discussed above.

988 T.-H. Wu et al. / Commun. Comput. Phys., 23 (2018), pp. 980-1011 Furthermore, since U in plane has the minimum potential energy at the equilibrium point, the attractive term should be equal to the repulsive term. If we set the exponent of the power law m=6 and R 0 =1.75r, the potential coefficient of power law k P can be given by k P = 1.4848kF r m+1 = 1.4848k F r 7. (2.27) Substituting to Eq. (2.22) with µ 0 = 123.28, k F can be expressed by k F 24.0584. (2.28) Based on these parameters, the properties of a modeling RBC are the diameter D 0 = 7.82µm, linear Young s modulus Y=18.9µN/m, shear stress µ 0 =6.3µN/m, and bending rigidity k C = 2.369 10 19 J. 2.4 Immersed boundary-lattice Boltzmann method Guo et al. [46] presented a body force scheme for the Bhatnagar-Gross-Krook (BGK) model [46]. Later, Guo and Zheng extended the work to MRT-LBE models [47]. The evolution equation, Eq. (2.1), with a forcing term, δtm 1 F, can be expressed as f(x+eδt,t+δt) f(x,t)= M 1 S(m(x,t) m eq (x,t))+δtm 1 F, (2.29) where F is the moment of the forcing term in the moment space and can be written as F= (I 1 ) 2 S MF, (2.30) where I is the unity matrix and F is related to the body force F (fluid-solid interaction force) as ( ei F F i = ω i c 2 + uf :(e ie i c 2 ) si) s c 4, (2.31) s where u represents the fluid velocity defined by and the weights ω i in D3Q19 model are ρ f u= f i e i + 1 δtf, (2.32) 2 i 1 3, i= 0, 1 ω i = 18, i= 1 6, (2.33) 1 36, i= 7 18. It is reported that the Navier-Stokes equations [25,48] can be entirely recovered by using the forcing scheme when the body forces are presented.

T.-H. Wu et al. / Commun. Comput. Phys., 23 (2018), pp. 980-1011 989 In addition, the fluid density and un-forced fluid momentum are given as follows: ρ f = f i, (2.34) i ρ f u*= f i e i, (2.35) i where u* represents the un-forced fluid velocity. To describe the immersed boundary method (IBM), the solid boundary domain Γ and fluid boundary domain Π are defined. Since the fluid nodes are in a regular Eulerian grid, the grids of a solid particle may not coincide with its adjacent fluid node. Therefore, the discrete Dirac delta function δ is used to interpolate the fluid velocity at the position of the solid boundary from the surrounding fluid grids. The discrete Dirac delta function [23] is given by δ( r) = { 1 (1+cos π x π y π z 64h 3 2h )(1+cos 2h )(1+cos 2h ) r 2h, 0 otherwise, (2.36) where h is the lattice length and r =( x, y, z) is the distance between the positions of the solid boundary particle and one of its surrounding fluid nodes. The fluid boundary domain Π is a spherical volume with a radius of 2h and a center point at a given solid particle position r b. The un-forced fluid velocity u at the position of the solid boundary particle r b is represented by u (r b,t)= u (r,t)δ(r r b )dr, (2.37) Π where r is a variable and r b denotes the position of a solid particle in the solid boundary domain Γ. Due to the no-slip boundary condition, the forced fluid velocity u(r b,t) in the solid boundary domain should be equal to the solid velocity U b (r b,t), which is u(r b,t)=u (r b,t)+δt F(rb,t) ρ f = U b (r b,t), (2.38) where the forced fluid velocity u can be calculated from the un-forced velocity u and the interaction force F. Therefore, the interaction force on the fluid at the solid boundary position can be calculated from the momentum difference F(r b,t)= ρ f(u b (r b,t) u (r b,t)), (2.39) δt thus the interaction force acting on the solid particles by the fluid F h (r b,t)= F(r b,t). The discrete Dirac delta function is utilized again to distribute the interaction forcef(r b,t) to the surrounding fluid nodes F(r,t) = F(r b,t)δ(r r b )dr b, (2.40) Γ where F(r,t) is the distributed body force and is used in Eq. (2.30).

990 T.-H. Wu et al. / Commun. Comput. Phys., 23 (2018), pp. 980-1011 3 CUDA implementations First, we provide a brief introduction of the CUDA Fortran programming model. Later, we introduce three single-gpu strategies in following sections: (Section 3.2) the lattice Boltzmann method (LBM); (Section 3.3) the lattice spring model (LSM); (Section 3.4) the immersed boundary method (IBM). 3.1 CUDA Fortran programming model CUDA Fortran extends Fortran by allowing the programmer to define Fortran subroutines, called kernel functions, that are executed in parallel on the host or device [49]. A kernel function is defined using the attributes(global) specifier on the subroutine statement. When invoking a kernel function, at least two arguments have to be given in the chevron<<<>>> syntax. The first argument is used to specify the grid size and the second argument is used to designate thread block size. The various memory types in CUDA are: global memory, shared memory, registers, texture memory, local memory, and various caches which are also on NVIDIA GPU devices. Global memory is the largest device memory that is declared with the device attribute in host code. It can be read and written from both host and device. The global memory is available to all threads launched on the device. Local variables defined in device code are stored in on-chip registers if there are sufficient registers available. If there are insufficient registers, data are stored off-chip in local memory. Both registers and local memory can be accessed only by the thread that they are in. Shared memory is allocated per block and can be accessed by all threads in the block. It is declared in a device code using the shared variable qualifier. It acts like a low-latency, high bandwidth software managed cache memory. When an operation is necessary between the threads in a block, shared memory can be used to avoid direct global memory access patterns that are inefficient. Constant memory can be read and written from host code but is read-only from threads in device code. Constant data is slow but cached on the chip. Therefore, it is effective when threads that execute at the same time access the same value [50]. 3.2 CUDA strategy of the lattice Boltzmann method In this work, we adopt the single-gpu implementation of the D3Q19 lattice-boltzmann method (LBM) presented by Habich et al. [16]. The CUDA strategy presented by Habich et al. is modified and optimized based on Tolke and Krafczyk s study in 2008 [15]. We briefly introduce the CUDA-based LBM implementation in the following. First, each thread is in charge the calculations (the collision and streaming) of a fluid node. The threads in a block are arranged in 1D and the grid of blocks is 2D. Based on this setup, one thread maps to the fluid distribution functions of one fluid grid and the values of the distribution function for all fluid grids lie consecutively in memory starting in the X-direction shown as in a clear sketch in Fig. 5 in [16]. The layout of the arrays

T.-H. Wu et al. / Commun. Comput. Phys., 23 (2018), pp. 980-1011 991 to store the fluid distribution functions is structure of arrays (SoA). In addition, shared memory is used during the streaming. A more detailed discussion of this single-gpu LBM implementation has been previously published [15, 16]. 3.3 CUDA strategy of the lattice spring model In this section, we discuss how to implement the lattice spring model (LSM) using CUDA in details. At the beginning of the simulation, 9 one-dimensional arrays on the host are declared to store 9 variables: positions (sx,sy,sz), velocities (vx,vy,vz), and forces (fz,fy,fx) of all solid particles. The size of these arrays are equal to the number of solid particles (the variable number of particles in the program), and every particle has its own ID. An input file that records detailed solid information with several zones is read before the calculation cycle. For example, Zone 1 records the number of solid particles and the particle ID, initial position(x s 0,ys 0,zs 0 ), and density ρs of each single particle. Zone 2 records the number of spring bonds, the bond ID, the particle ID in each spring bond, the initial bond length r0ij, and spring coefficient ks. Similarly, Zone 3 records the number of angular bonds, angular bond ID and three particle ID in each angular bond, the initial angle θ 0ijk, and angular coefficient k a. In a LSM simulation, there are 4 main iterations: Loop 0, Loop 1, Loop 2, and Loop 3. The whole calculation cycle of the LSM is handled by Loop 0. An iteration of this loop is equivalent to one time step. At each time step, we use Loop 1 and Loop 2 to execute all calculations of the spring bonds ( U s in Eq. (2.11)) and angular bonds ( U a in Eq. (2.11)), respectively. After Loop 1 and Loop 2, all elastic forces (see Eq. (2.11)) at this time step are collected, and then Loop 3 is conducted for the leapfrog algorithm. In the LSM simulation, three kernel functions (kernel computing spring bonds, kernel computing angular bonds, and kernel computing leapfrog) are used in Loop 1, Loop 2, and Loop 3. We set one-dimensional grid of blocks and one-dimensional threads in a block for spring bond, angular bond, and single particle in the three kernel functions as follows. 1 blocks = number_threads!number_threads is how many threads in a block 2 grid = dim3(ceiling(real(number_of_spring_bonds)/number_threads)) 3 call kernel_computing_spring_bonds<<<grid,blocks>>> 4 5 grid = dim3(ceiling(real(number_of_angular_bonds)/number_threads)) 6 call kernel_computing_angular_bonds<<<grid,blocks>>> 7 8 grid = dim3(ceiling(real(number_of_particles)/number_threads)) 9 call kernel_computing_leapfrog<<<grid,blocks>>> As above, the variable number threads is the number of threads per block. These three kernel functions are in charge of the calculations of the spring bonds, angular bonds, and the leapfrog algorithm, respectively. The arrays in host may be declared

992 T.-H. Wu et al. / Commun. Comput. Phys., 23 (2018), pp. 980-1011 again in the device global memory. A variable with d behind its name signifies that the variable is on the device. For example, the position arrays (sx d,sy d,sz d) mean that they are on the device global memory whereas(sx,sy,sz) are on the host memory. It is worth noting that three more arrays, half-step velocities (vhx d,vhy d,vhz d), have to be declared on the device for the leapfrog integration. At the beginning of this single-gpu implementation, all data are transferred from host to device and kept on device during the entire calculation. Except for output results, the data on the device are not transferred back to the host for reduction of time of memory transfers. The memory layout of these arrays on the device is structure of arrays (SoA). In kernel functions kernel computing spring bonds, a thread mapping to a spring bond ID (sb id in our CUDA Fortran code) is presented by a CUDA Index as sb id =(blockidx%x 1) blockdim%x+threadidx%x, where blockdim, blockidx and threadidx are predefined variables. The one-dimensional threads in one-dimensional block use the CUDA Indices to access the bonded particles IDs and calculate the forces. It is emphasized that an atomic function, atomicadd, is used to accumulate the force data correctly. During the calculations of the spring bonds, a thread is mapping to a spring bond, and two forces are obtained at the end of the calculation. Consequently, this thread needs to write the data to the force arrays (fx d,fy d,fz d) of the two particles of the bond. Since a particle may constitute more than one bond, multiple threads corresponding to the different bonds may read and update the same memory location without any synchronization, leading to incorrect values in force arrays. The atomicadd is necessary to use to avoid this problem in this CUDA implementation. Similar to kernel computing spring bonds, kernel computing angular bonds use CUDA Indices to map one-dimensional threads to angular bond IDab id as ab id =(blockidx%x 1) blockdim%x+threadidx%x. It is noted that in kernel function kernel computing leapfrog, a thread is associated with a particle ID, and every particle has a different memory location. Therefore, they will not access the same memory address simultaneously, and theatomicadd is not needed. The same applies to Eq. (2.17), Eq. (2.18), Eq. (2.19), and Eq. (2.20) for the coarsegrained RBC model. CUDA Indices are projected to the FENE-POW bond ID, element ID, dihedral angle ID, and particle ID in the coarse-grained RBC model. It is worth noting that a parallel sum reduction can be used for the total surface area and total volume in Eqs. (2.18) and (2.19), enhancing performance. 3.4 CUDA strategy of the immersed boundary method From Section 2.4, Eulerian fluid grids may not coincide with Lagrangian solid grids. To realize the no-slip boundary condition, the discrete Dirac δ function [23] is employed to

T.-H. Wu et al. / Commun. Comput. Phys., 23 (2018), pp. 980-1011 993 interpolate the un-forced fluid velocity u at the position r b of a solid boundary particle by using the velocities at all the fluid grids, called fluid supports, within a sphere Π of a diameter of D= 4 centered at r b. Thus, every boundary particle in the solid boundary domain Γ has its own fluid supports which are contributed to the un-forced fluid velocity through Eq. (2.37). When a solid particle moves, the corresponding fluid supports are varied at each time step. In order to use GPU effectively, we extend the spherical fluid domain Π to be a larger cubic volume of 5 5 5 centered at r b, and the fluid supports within the cube are utilized for velocity interpolation purpose. The size of the cube has to be larger than the diameter of the spherical fluid domain. In other words, the size of the cubic fluid domain must be slightly larger than the size of 4 4 4 to ensure that the integration of the Dirac delta function over the spherical fluid domain Π is equal to one. Technically, we use a diameter of D=4.2 in simulations. If the size of the cube is exactly equal to 4 4 4, some fluid grids on the cube edge will be missed during integration since the center of the fluid domain may not be coincided with the position of the solid boundary particle r b and the results of interaction forces between the fluid and solid may not be correct. The more detailed investigation of effects of the size of the fluid supports on computational speed will be given in Section 4.3.2. The results show that the size of 5 5 5 has the best computing speed for accurate results. Therefore, in this implementation, we make one-dimensional grid of blocks map to the solid boundary particle and three-dimensional threads of 5 5 5 in blocks map to the fluid supports. The part of source code in the host is shown below: 1 blocks = dim3(5,5,5)! for 3D 2 grid = dim3(number_solid_boundary_particles,1,1) 3 4 call kernel_computing_ibm<<<grid,blocks>>> Using a 2D case as a schematic illustration, the layout of dim3(5,5,1) shows that each block has 25 threads. The 25 threads map to fluid supports of a solid boundary particle as shown in Fig. 1. The red dashed rectangle denotes the area of the fluid supports of a solid boundary particle (big black point); the 5 5 green small circles represent the fluid supports of the solid boundary particle; the gray line is the solid boundary interface; the big gray points on the line are other neighboring solid boundary particles. In fact, all the results in this paper are obtained in three-dimensions and each block has 125 threads. Our CUDA-based IBM implementation can be divided into five steps. The source code is given in Appendix A. In Step 1, the particle ID p id of a solid boundary particle is assigned by blockidx%x and the threads map to the fluids supports. The thread array layout starts at the corner r 0 =(x0,y0,z0) of the cubic area shown in Fig. 1. In Step 2, the three-dimensional threads are collapsed as a one-dimensional thread by using index tid = threadidx%x + (threadidx%y-1)*5 + (threadidx%z-1)*5*5, and every thread calculates the term of u (r)δ(r r b ) of Eq. (2.37), where δ(r r b ) is the discrete δ function of Eq. (2.36). Three arrays (partial ux,partial uy,partial uz) with the size of 125 (number of threads per block) are used to store the value of the term and declared as the

994 T.-H. Wu et al. / Commun. Comput. Phys., 23 (2018), pp. 980-1011 Figure 1: A 2D illustration of the single-gpu IBM implementation. The red dashed rectangle denotes the frame of fluid supports of a solid boundary particle (big block point). The gray line represents the solid boundary domain, and the big gray points on the line are the other solid boundary particles. The 5 5 green small circles within the red rectangle are the surrounding fluid nodes (i.e., fluid supports) of this solid boundary particle. shared memory for the great reduction of computational time. After Step 2, syncthreads is needed to make sure that all threads in a block have saved their values to the three shared memory arrays. In Step 3, we use multi-thread reduction technique to add all partial unforced velocity values and obtain the un-forced velocity u (r b ), which is the term in the left-hand side of Eq. (2.37). In Step 4, we perform Eq. (2.39) to obtain the interaction force F(r b,t). Another syncthreads command is needed after Step 4. In Step 5, all threads in the block multiply the interaction force F(r b,t) by the value of δ( r) to gain the body force F(r,t) and store them to the related global memory by using atomicadd. The reason to use atomicadd is the same as in Section 3.3. Although the different threads correspond to the different fluid supports of the different solid boundary particles, they might point to the same fluid node. Since multiple blocks and threads execute at the same time, atomicadd is needed to avoid reading and writing data at the same location of the memory simultaneously. Unlike the work of Wu et al. [51] who used two kernel functions, in the present single- GPU implementation, only one kernel function kernel computing IBM is needed to interpolate the un-forced fluid velocity u (r b ) and to re-distribute the interaction forcef(r b,t) to the surrounding fluid supports. 4 Results The platform specifications in this study are given in details in Table 1. We conduct three simulation problems: a cantilever beam model for the lattice spring model (LSM), an RBC

T.-H. Wu et al. / Commun. Comput. Phys., 23 (2018), pp. 980-1011 995 Table 1: Platform specifications in this study. Item Details CPU Intel(R) Xeon(R) E5645 @ 2.40GHz System memory 24 GB GPU NVIDIA Tesla K40 GPU Memory 12 GB DDR5 OS CentOS 6.6, 64-bit CUDA Version CUDA 7.5 Compiler PGI Accelerator Fortran 16.3, 64-bit stretching for the RBC model, and a rigid ellipsoid in shear flow for the immersed boundary lattice-boltzmann lattice-spring method (IBLLM). All data are single precision in this study. In each case, we validate the CUDA implementation and discuss its optimization and computing performance. 4.1 Cantilever beam model 4.1.1 Validation The first simulation problem is the dynamic beam model. A rectangular cantilever beam with size of the length L, width W, and height H is simulated. The left end of the rectangular cantilever beam is fixed whereas the right end is under an external force F, as shown in Fig. 2. The deflection z b can be predicted by Timoshenko theory [52] as z b = FL3 [ 1+ 3EI ] 3EI κgal 2, (4.1) Figure 2: A rectangular cantilever beam is under a force F at its right end and is fixed at its left end. The deflection is changed by different force F. Reprinted from [24] with the permission of the authors and Elsevier.

996 T.-H. Wu et al. / Commun. Comput. Phys., 23 (2018), pp. 980-1011 0.6 0.5 0.4 z b /L 0.3 0.2 0.1 theoretical solution (L,W,H) = (30,3,3) (L,W,H) = (50,5,5) (L,W,H) = (100,10,10) 0 0 0.5 1 1.5 2 2.5 3 F L 2 /E I Figure 3: The normalized deflection as a function of the normalized force for three different sizes of the cantilever beams. The red solid line represents the analytical solution from Timoshenko theory [52], and the symbols are the simulation results from CUDA codes in various sizes of the beams. where E and G are the Young s and shear modulus, respectively; κ= 5 6 is the shear coefficient for the rectangular beam; I and A are the second moment of area and the crosssection area, respectively. Moreover, the lattice length and the unit time interval are set to h=8.064 10 5 m and t=1.301 10 4 s. The elastic coefficients k s =7.8 and k a =0.75 are used in the simulations. We simulate three different sizes of the cantilever beam and compare their results to the theoretical solution. The normalized deflection as a function of the normalized force is given in Fig. 3. The red line is the theoretical solution, and the symbols represent the numerical results of the different beam sizes. The comparison shows that the simulation results from the CUDA code are accurate, especially those for small deflections. 4.1.2 Optimization and speedup The grid size and thread size may influence the computing performance. In this CUDA strategy, thread indices map to the corresponding spring, angular, and solid particle indices. The total numbers of the spring bonds, angular bonds, and solid particles are constant in each case of simulations. Therefore, only the number of threads per block is adjustable. To find the optimal number, we have run the simulations by using the different numbers of threads per block: 16, 32, 64, 128, 256, 512, and 1,024. Fig. 4 displays the speedup for different numbers of threads per block. In this CUDA strategy, thread indices map to the corresponding spring, angular, and solid particle indices. The speedup

T.-H. Wu et al. / Commun. Comput. Phys., 23 (2018), pp. 980-1011 997 Figure 4: The speedup of the CUDA implementation of the LSM for different numbers of threads per block in three different sizes of the solids. The red color denotes the maximum speedup in each case. The suggesting number of threads per block is 128. is defined as follows: speedup= CPU computing time GPU computing time. (4.2) Each case runs for 10,000 time steps to average the computing time and calculate the speedup. Fig. 4 (a) shows that when the size of simulation is small (270 particles), 16 threads per block has the maximum speedup of 5.02 times. The speedup increases as the number of solid particles (the size of the solid) increases, and the optimal number of

998 T.-H. Wu et al. / Commun. Comput. Phys., 23 (2018), pp. 980-1011 threads becomes 128, as shown in Fig. 4 (b) and (c). The maximum speedup of 62.97 times is achieved in the 10,000 particles case. However, even in the small simulation case, the speedup of the 128 threads is also similar to the optimal one. Therefore, we suggest that 128 threads per block on the NVIDIA Tesla K40 is the best choice. This optimal number also accords with the suggested setup in [35]. 4.2 RBC stretching 4.2.1 Validation The second simulation problem is that a single RBC is stretched by a pair of external forces F as shown in Fig. 5 (a). The setup of the simulation corresponds to the real RBC stretching experiments conducted by Dao et al. in 2003 [53]. The numerical parameters in this model correspond to the membrane Young s modulus Y = 18.9µN/m, the membrane shear modulus µ 0 = 4.75µN/m, and the bending rigidity k c = 2.369 10 19 J. The longitudinal diameter D L and the transverse diameter D T are defined as shown in Fig. 5 (b). Two types of grids are simulated: the coarse and fine grids. The coarse grid uses 270 particles and 536 triangular elements, and the fine grid uses 1,230 particles and 2,456 elements to constitute a single RBC. The unit length and unit time interval are h=1µm and t=1.6 10 4 s in both grids. A damping force is applied to every particle to force the RBC to approach steady state. Figure 5: (a) A sketch of stretching a red blood cell. (b)definitions of the longitudinal diameter D L and the transverse diameter D T. The diameters as a function of the external force is given in Fig. 6. The red rhombus line and blue square line represent the numerical results with the coarse and fine grids, respectively. The black circles represent the experimental data by Dao et al. [53]. The top and bottom lines represent the longitudinal diameter D L and the transverse diameter D T, respectively. Fig. 6 shows that the numerical results from the CUDA codes agree well with the experimental data.

T.-H. Wu et al. / Commun. Comput. Phys., 23 (2018), pp. 980-1011 999 20 15 experimental data CUDA, coarse grid CUDA, fine grid diameter (µm) 10 5 0 0 20 40 60 80 100 120 140 160 180 200 force (pn) Figure 6: The longitudinal diameter D L and the transverse diameter D T as a function of the external force F. The red rhombus and blue square lines represent the numerical results from the CUDA code with the coarse and fine grid, respectively. The block circles represent the experimental data by Dao et al. [53]. The top and bottom lines represent the longitudinal diameter D L and the transverse diameter D T, respectively. 4.2.2 Optimization and speedup Similarly, the numbers of threads per block are varied among 16, 32, 64, 128, 256, 512, and 1024 in this CUDA strategy of the RBC model, and the definition of the speedup is the same as in Eq. (4.2). The speedup has the maximum when 128 threads per block is used. Although Fig. 7 (a) shows that the speedup is only 2.415 times in the coarse grid, the speed up increases to 9.51 times in the fine grid as shown in Fig. 7 (b). In many studies regarding the RBC models [4, 5, 54, 55], their grids are usually more than 2,000 triangular elements to guarantee that the deformation of the RBCs can be mimicked precisely. Therefore, this speedup is highly effective in solving practical problems. 4.3 An ellipsoid in shear flow 4.3.1 Validation The third simulation problem is that a rigid ellipsoid is rotated in a simple shear flow. According to Jeffery theory [56], the ellipsoid at zero Reynolds number has a rotation angle φ and angular velocity ω as a function of time t as follows: φ=tan 1( b bckt ) tan c b 2 +c 2, (4.3) ω= k b 2 +c 2(b2 cos 2 φ+c 2 sin 2 φ), (4.4) where(a,b,c) are the lengths of the semi-principal axes of the ellipsoid, and k is the shear rate. The simulation box (N x,n y,n z ) is (62,62,62), and the kinematic viscosity ν is 0.32

1000 T.-H. Wu et al. / Commun. Comput. Phys., 23 (2018), pp. 980-1011 Figure 7: The speedup of the CUDA implementation of the RBC model for different numbers of threads per block with the coarse and fine girds. The red color denotes the maximum speedup in each case. The suggesting number of threads per block is 128. in simulation units (8 10 6 m 2 /s in real units). A pair of equal and opposite velocities in the Y-direction, V 0 y and V 0 y, are set at Z= 0 and Z= N z +1 to simulate a simple shear flow. The definition of the shear rate k and Reynolds number Re in the simulation are k= 2V0 y N z +2, (4.5) Re= kc2 ν. (4.6) The size of the ellipsoid and the shear rate are set to(a,b,c)=(3.5,3.5,9.5) and k=1.5625 10 4 to make the Reynolds number Re=0.044 which approximates a very low Reynolds number. Fig. 8 shows simulation results of the normalized angular velocity ω as a function of the normalized time kt where t is the simulation time step. Both of the simulation results from the CUDA code and sequential code agree well with the theoretical solution Eq. (4.4).

T.-H. Wu et al. / Commun. Comput. Phys., 23 (2018), pp. 980-1011 1001 1 0.8 0.6 ω/k 0.4 0.2 0 ellips oid, (3.5, 3.5, 9.5) R e=0.044 G rid (64, 64, 64) shear rate (k) = 0.00015625 S equential C UDA, 463 particles J e ff e r y t h e o r y -0.2 0 5 10 15 20 25 30 35 40 kt Figure 8: The results of the normalized angular velocity ω as a function of the normalized time kt are compared. The ellipsoid with the size (a,b,c) =(3.5,9.5,9.5) is used. The green solid line is the theoretical result and the red circles and blue triangles are the simulation results from the sequential and CUDA codes, respectively. The fluid grid is 64 64 64, and the Reynolds number Re is 0.044. 4.3.2 Size of fluid supports In this subsection, two different sizes of fluid supports, 8 8 8 and 5 5 5, are adopted to optimize the GPU-based IBM implementation while the sphere diameter of the integration is kept at D=4.2. The computing time is compared between the two sizes in Fig. 9. The results show that although the size of 8 8 8 is a power of 2, its computing time is still longer than that of the size of 5 5 5, which is not a power of 2, because the former uses 4 times more numbers of threads than the latter and costs more time in this GPU-based IBM implementation. It is illustrated that the size of fluid support of 5 5 5 provides the best computing speed. 4.3.3 Performance investigation The computing performances of the CUDA implementations of the LSM and IBM are investigated in this section. The speedup is defined as in Eq. (4.2) and used to quantify the performance. We conduct four cases and keep the fluid grid 64 64 64 the same but vary the size of the solids in all cases. Table 2 shows the detailed simulation sizes of each case. The number of threads per block is set to 128 for the optimization of the CUDA implementation of the LSM. Fig. 10 shows all speedup of the three numerical methods and the entire FSI solver in this study. Since all fluid grids are the same, all cases have almost the same speedup of the LBM. The number of solid particles as a function of the speedup of the LSM is given

1002 T.-H. Wu et al. / Commun. Comput. Phys., 23 (2018), pp. 980-1011 Figure 9: The comparison of the computing time per simulation step between the two different sizes of fluid supports, 5 5 5 and 8 8 8. It shows that the size of 5 5 5 costs a shorter computing time than the size of 8 8 8. Table 2: The detailed sizes of the four cases. The fluid grids are kept the same while the sizes of solid are varied at each cases. Case 1 Case 2 Case 3 Case 4 fluid grid 64 64 64 64 64 64 64 64 64 64 64 64 Number of the solid particles 463 2,081 5,695 12,015 Number of the solid boundary particles 250 742 1,482 2,486 Figure 10: All speedup of the three numerical methods and the entire solver in this study in four different cases. The entire maximum speedup of 84.14 times is gained in Case 4.