Performance of Parallel Conjugate Gradient Solvers in Meshfree Analysis

Size: px

Start display at page:

Download "Performance of Parallel Conjugate Gradient Solvers in Meshfree Analysis"

Gwendoline Tyler
6 years ago
Views:

1 Perforance of Parallel Conjugate Gradient Solvers in Meshfree Analysis Youngjoon Ki, Graduate Research Assistant Mechanical Engineering, Center for Coputer-Aided Design The University of owa Colby C. Swan, Associate Professor Civil & Environental Engineering, Center for Coputer-Aided Design The University of owa Jiun-Shyan Chen, Associate Professor Civil & Environental Engineering University of California, Los Angeles ABSTRACT: Meshfree analysis ethods, on a per degree of freedo basis, are typically ore coputationally epensive and yet ore accurate than finite eleent ethods. For very large odels, whether eshfree or finite eleent, the eory and coputational effort associated with direct equation solvers akes the prohibitively epensive. n this work, the perforance of different linear equation solvers with eshfree analysis ethods is eplored. n particular, parallel conjugate gradient solvers with both Jacobi diagonal preconditioning and incoplete Cholesky factorization preconditioning are tested on a nuber of different eshfree analysis applications and copared against the perforance of a fast direct sparse equation solver. t is found in these eploratory coputations that as the support size of eshfree shape functions increases, the condition nuber of the associated stiffness atrices increases, and the relative efficiency of iterative solvers suffers soewhat. Nevertheless, for noralized support sizes between one and two, the perforance of both conjugate gradient solvers copares very favorably with that of the sparse direct solver for interediate ( N O(0 5 and large probles. Keywords: eshfree ethods; conjugate gradient; parallel solvers; preconditioners; high-perforance coputing. Corresponding author. E-ail: colby-swan@uiowa.edu; Ph: ; Fa: ; Dept. of Civil & Environental. Engineering, 420 SC, The University of owa, owa City, owa 52242, USA.

2 2. NTRODUCTON Conventional finite eleent ethods suffer a nuber of difficulties in dealing with specific classes of probles such as those involving large deforation and those involving crack propagation, which require the treatent of discontinuities that do not coincide with the original esh lines [5]. The objective of recently developed of eshfree ethods is to eliinate these difficulties by constructing the function approiations entirely in ters of nodes. That is, eshfree ethods do not require any eplicit eshes for approiation, and the proble doain is copletely described by particles, with the approiation solution constructed using a set of eshfree shape functions [9,32]. Despite these significant benefits, the high coputational cost of eshfree ethods has been perceived as probleatic. The unique aspects of eshfree ethods copared to finite eleent ethods that result in greater coputational epense are: larger bandwidths of the stiffness atri, nontrivial iposition of essential boundary conditions, lack of siple eleent-based data structures, ore epensive spatial nuerical integration procedures, and nontrivial coputation of nodal shape functions. n recent years, any of these efficiency issues with eshfree ethods have been successfully addressed. For eaple, using singular kernel shape functions for the essential boundary nodes greatly siplifies the iposition of essential boundary conditions [8], and usage of stabilized conforing nodal integration draatically reduces the cost of spatial integration [9]. Furtherore, for large and nonlinear probles, the initial nontrivial cost of coputing shape functions becoes a relatively insignificant fraction of the overall coputing effort in coparison to the nonlinear solving operations. With greatly iproved efficiency in these aspects of eshfree analysis, the need to address efficiency issues in nonlinear solving operations becoes even ore iportant.

3 3 To further reduce the coputational epense associated with eshfree ethods, this effort will focus on the reduction of linear solving ties which lie at the heart of ost nonlinear solving algoriths. Many efficient linear equation solvers that utilize sparse storage schees in solving A = b have been used successfully with finite eleent ethods. However few of these solvers have yet been applied or eplored with eshfree ethods due to the a priori difficulty in deterining the sparse storage requireents. Part of the difficulty in deterining the sparse storage requireents with eshfree analysis arose fro the need to apply essential boundary conditions using transforation procedures that odified the sparsity characteristics in a anner that was very difficulty to predict [8]. However, with the boundary singular kernel shape functions entioned previously, the essential boundary nodes recover Kronecker delta properties, obviating the need for transforation procedures. Accordingly, the sparsity characteristics of eshfree equation systes becoe very tractable aking it very easy to use sparse solvers in linear solving operations. n large-scale structural analyses, the coputational epense is often doinated by the cost solving systes of coupled linear equations. n the past two decades uch research has therefore been done on fast and efficient linear solving algoriths for large, potentially illconditioned systes. n the 980 s and early 990 s nuerous very fast vector and parallel direct equation solvers were developed based on different ipleentations of Choleski factorization [29,30] of syetric positive definite systes of equations into triangular factors (LL T, followed by the forward and backward solution of the resulting triangular systes. Notable works have been accoplished in the developent of Cholesky factorization algoriths that use skyline, profile and sparse atri storage schees [30] of which sparse solvers tend to show the greatest perforance. A key aspect of fast and efficient sparse solvers is that of reordering the syste of equations in such a way as to iniize the aount of fill-in that occurs

4 4 during factorization, since this can significantly reduce both the nuber of factorization operations and the eory storage requireents for the factorization of the stiffness atri. n spite of the significant gains in efficiency with optiu re-ordering of sparse systes, the asyptotic perforance of even the best sparse solvers indicates that their required eory and cpu-operations grow in proportion to (N 2 - N 2.5 as N becoes large, and thus will not be copetitive with scalable iterative solvers for sufficiently large proble sizes. Nevertheless, nuerous doain decoposition ethods (e.g. [3] have been investigated with the objective of breaking etreely large systes into oderately sized and coupled subdoains on which the fast, direct solvers can still be effectively eployed. Since developent of efficient solving ethods appropriate for large nonlinear eshfree analysis probles is the goal of this research, parallel iterative conjugate gradient equation solvers with different preconditioning ethods are eplored here as potential solutions and their perforance is copared with that of a very fast direct, sparse solver. t is believed that CPU and eory requireents that scale like N as N becoes large can be achieved by eploying conjugate gradient solvers in conjunction with appropriate preconditioning ethods and parallelization. The recognized advantages of conjugate gradient solvers that ake the potentially attractive here are: cpu-effort scales linearly with proble size when used with diagonal preconditioning; 2 eplicit storage of assebled global tangent stiffness operator with fill-in is not required; 3 they are easy to parallelize; and 4 they have been used successfully with FEM. Therefore, the perforance of conjugate gradient solvers with eshfree ethods is investigated and issues arising fro the nature of eshfree ethods are noted. n addition, alternative preconditioning ethods in conjunction with a conjugate gradient solver are ipleented and tested. All the operations of the preconditioned conjugate gradient (PCG solvers are parallelized for shared eory environents, and the parallel

5 5 perforance characteristics of the PCG solvers are tested. The reainder of this paper is organized as follows. Section 2 presents a brief review of eshfree ethods including shape functions, variational continuu forulation, and notable characteristics of eshfree ethods. n Section 3 preconditioned conjugate gradient solvers and their ipleentation for eshfree analyses are discussed along with the details of an incoplete Cholesky factorization preconditioning algorith. ssues pertaining to parallel ipleentation of the PCG solver with C factorization are covered in Section 4. Many nuerical eaples are presented in Section 5 to copare the perforance of direct and different PCG solvers. The anuscript closes with both discussion and conclusions in Section OVERVEW OF MESHFREE METHODS 2. Shape Functions The ajor difference between finite eleent ethods and eshfree ethods lies in the spaces fro which the approiation functions are constructed. n FEM, approiation functions are developed with shape functions that are both node and eleent based. n eshfree ethods, the shape functions are only node-based. The Reproducing Kernel (RK approiation in the Reproducing Kernel Particle Method (RKPM [24] is one of the ost coonly used function approiation ethods, and it is eployed for this research. The discrete RK approiation field of a variable field u is h u ( = NP = Φ ( ; d (2. a where NP is the nuber of discrete points used in the approiation, d are the coefficients of the approiation functions, and Φ ; is the reproducing kernel, which is epressed by a (

6 6 ( ; ( ; ( a a Φ C Φ = (2.2 where Φ a (- is a kernel or weight function, a is a easure of the size of the support, and C(; - is a correction function that is used to satisfy the n-th order reproducing conditions r q p NP r q p a ; ( = = Φ for p+q+r=0,,,n; p 0; q 0; r 0 (2.3 where i is the nodal value of the i th Cartesian coponent i at node. The correction function C(;- is constructed by a linear cobination of coplete n-th order onoial functions ( ( ( ( ( ( ; ( b H C l T n r q p pqr r l q l p l b = = + + (2.4 where b pqr ( are the coefficients of the onoial basis functions that are functions of, b( is a vector of b pqr (, and H(- is a vector containing the onoial basis functions [ ] n l T,(,,(,,, ( = L H (2.5 Substituting Eq. (2.4 into Eq. (2.3, the coefficients b( are solved by ( ( ( 0 H b M = (2.6 where = = NP a T ( ( ( ( Φ H H M (2.7 is the oent atri of ( a Φ. Using the solution of Eqs. (2.2, (2.4 and (2.6, the reproducing kernel function approiation is obtained by = = = NP NP a T h ( ( ( ( ( ( d d u Ψ Φ H M 0 H (2.8 in which NP,,2, ( ( K = Ψ are the RK shape functions

7 7 Ψ ( = H T ( 0 M ( H ( Φ a (. (2.9 A cubic spline function is eployed as the kernel function in this research such that z + 4z for z Φa ( z = 4z + 4z z for < z ( for z > where z is / a. The quantity a is coonly called shape function support size, and is a principal factor in deterining the connectivity aong discrete nodal points of the odel. 2.2 Variational Continuu Forulation Meshfree analysis has been ipleented for a variety of linear and nonlinear structural coputations of static probles using reproducing kernel particle ethods [7]. We can consider a body that is defored fro initial configuration Ω X with boundary Γ X to a defored configuration Ω with defored boundary Γ. The body is subjected to body force field b i in Ω, a surface traction field h i on the natural boundary Γ hi, and a prescribed displaceent field g i on i the essential boundary Γ g. We denote the particle positions in the initial configuration Ω X by X and those in the defored configuration Ω at tie t by a apping function φ where = φ(x, t. The strong for of equilibriu for general linear or nonlinear continuu probles is as follows: Given b i ( X, t, h i ( X, t, and (, t, 3 g i X find u : Ω [ 0, T ] a R such that τ + b = 0 in Ω (Equilibriu condition (2. ij, j i

8 8 τ n = h on Γ (Natural boundary condition (2.2 ij, j j i h i g i u = g on Γ (Essential boundary condition (2.3 i i Above τ ij is Cauchy stress obtained fro a general constitutive law of the for τ = τˆ ( F in which F = / X is the deforation gradient. n this paper, a Saint Venant-Kirchhoff elastic aterial is assued for which the second Piola-Kirchhoff stress S J is calculated by in which S = λe δ + 2 µ E = C E (2.4 J KL KK J J JKL KL [ ] E is the Green-Lagrange strain tensor E ( F F δ KL = 2. The second Piola- kk kl KL Kirchhoff stress is related to the Cauchy stress by τ ij = FiM SMN FjN (2.5 J where J = det( F. Using standard ethods in vector calculus, it can be shown that the weak or variational for corresponding to the strong for above is: ui jτ ijdω δuibidω + Ω δ h dγ. (2.6, δui = h Ω Γ i i 2.3 ntroduction of Meshfree Shape Functions By introducing the RK approiation fro Eq. (2.8 into Eq. (2.6, the following atri equilibriu equations at the A-th unrestrained node are obtained r in which A = f f = 0 (2.7 int A et A int T f = B : τ dω (2.8 A et A Ω A Ψ A Ω + f b d Ψ h dγ (2.9 = Ω h Γ A

9 9 where: B A is the nodal strain displaceent atri; Ψ A denotes the nodal shape function for the A th node; f int A represents the internal forces on node A; and f et A represents the eternal forces on node A due to body forces and surface tractions. The tangent stiffness atri associated the syste (2.7 is written: df du int Ai Bl = K AB il = Ω B A ji c jk B B kl dω + Ω A B Ψ, jτ jkψ, kδ ildω (2.20 Above, (2.5. c jk is the condensed for of the spatial elasticity tensor associated with (2.4 and 2.4 General Newton s Methods For linear static probles, the internal force vector is the product of the linear stiffness atri K and the displaceent vector d, and only a single factorization is required. For nonlinear static probles, the syste of nonlinear global force-balance equations at a given tie or load step n+ has the for r ( d n + = 0 (2.2 which is generally solved by Newton s ethod with line searching for a fied tie step ([5,3],for eaples. The sequence to update the global displaceent vector at the (n+-th tie step is shown in the algorith of Bo 2. in which α ν is the line search paraeter chosen to satisfy the standard line search criterion + ( d < STOL v+ ν n+ δ r (2.22 v where STOL is a tolerance paraeter controlling the accuracy of the search. For the linear

10 0 solving phase, the global stiffness atri K needs to be updated each iteration in a full Newton- Raphson ethod, but only occasionally if a odified Newton s ethod is used. 2.5 Definition of Nodal Neighbors and Data Structure For enhanced accuracy and efficiency, the global stiffness atri and force vector in Eqns. (2.8 (2.20 are integrated in this work using a stabilized conforing nodal integration ethod (SCN [9]. n SCN, the nodal representative areas are defined using Voronoi cells (as shown in Figure 2. for local strain soothing to eet linear eactness in the Galerkin approiation for optiu rate of integration [9]. t follows that the connective topology of the global stiffness atri structure in this work is deterined by such factors as Lagrangian nodal coordinates, Voronoi integration cell specifications, and shape function support sizes. t is necessary to eplicitly quantify the connective topology of the stiffness atri in order to store and anipulate it with the data structures used by sparse solvers. To facilitate this discussion, a nuber of definitions are presented here. First, the set of all nodes in a eshfree odel, with each identified by an integer value, is denotedη. Each node will have its own support region which is the space covered by its nodal shape function. For eaple, the support region of node is denoted by Ω = { X Ω ( X 0} X Ψ. n addition, each node in the odel will have its own Voronoi integration cell region denoted V. Furtherore, each node in the odel will have its own set of coupled nodes C and its own set of neighbor nodes N both of which are defined atheatically as follows: C N = = { J η Ω Ω J { } { J η V Ω { } J (2.23 n other words, C is node s set of coupled nodes whose support regions overlap that of node.

11 n a siilar fashion, N is the set of node s neighbor nodes, with ebers being those whose support region overlaps node s integration Voronoi cellv. Figure 2.(a shows a typical node with its support region Ω covered by node s nodal shape function and Figure 2.(b shows the typical Voronoi integration cell for a node B. Since a Lagrangian continuu forulation is used, these overlap coputations of each node s set of coupled nodes C and neighbor nodes N can be done once and for all in the undefored configuration. Each node s set of connected and neighbor nodes is stored in the following data structures: P(i, (i=,nunp: nuber of nodes whose support covers the integration cell of node i. This is the nuber of neighbors nodes for a node i. NODE(i, j,(i=,nunp; j=,p(i : list of neighbor nodes for node i. NB, (i=,nunp: nuber of coupled nodes for each node i NBL(i, j, (i=,nunp; j=,nb(i : list of coupled nodes for each node i The algorith for assebly of the global stiffness atri fro nodal stiffness atrices is provided in Bo Special Characteristics of Meshfree Methods The special characteristics of eshfree ethods that contribute to the high coputational cost of solving linear systes K δ = r are large bandwidths of the global stiffness atri and severe ill conditioning of the global stiffness atri with increasing shape function support size. Figure 2.2 shows an equally spaced discretization in order to illustrate why eshfree ethods have global stiffness atrices with large bandwidths. When a noralized shape function support size of unity is used, the nuber of coupled nodes at the i-th integration point in Figure 2.2 is 5, whereas the nuber of neighbor nodes with piecewise linear finite eleent shape functions is 3. The nuber of neighbor nodes is 25 in two-diensional probles and 25 in three-diensional

12 2 proble with eshfree ethods, whereas it is 9 in two-diensional probles and 27 in threediensional probles with finite eleent ethods using bilinear and tri-linear shape functions. When it is considered that noralized support sizes of.5~2.5 are coonly used with eshfree ethods, it is clear that the bandwidth of K is typically uch larger than that for FEM. This aspect of eshfree ethods that causes large bandwidths of the global stiffness atri consequently requires ore CPU tie in solving linear systes K δ = r. Understanding this characteristic of eshfree ethods, the perforance of direct equation solvers and iterative equation solvers is investigated and copared. The ill conditioning of the global stiffness atri with larger shape function support sizes is another unique feature of eshfree ethods that can result in higher coputational cost. The reason for this characteristic is that as the noralized support sizes increase, the resulting shape functions becoe ore and ore linearly dependent if the order of basis functions stays unchanged. While ill conditioning of the stiffness atri does not generally present uch of a proble for direct equation solvers, it can present a very serious proble with iterative equation solvers. 3. PRECONDTONED CONJUGATE GRADENT SOLVERS 3. Jacobi Preconditioned Conjugate Gradient Solvers The conjugate gradient ethod, originally developed by Hestenes and Stiefel [2], is an iterative algorith for solving systes of linear equations A = b, for real syetric positive definite atrices A R n n without assebling or factorizing the atri A. The conjugate gradient ethod works well on atrices that are either well conditioned or have just a few distinct eigenvalues, and so we often attept to precondition a linear syste so that the atri of coefficients assues one of these nice fors. Given a syetric positive definite A R n n, b

13 3 R n, a syetric positive definite preconditioner B, and initial guess 0 (A 0 b, the algorith of Bo 3. for odified PCG - based upon the conjugate gradient algorith - solves the linear syste A = b. The advantage of iterative equation solvers such as Preconditioned Conjugate Gradient (PCG ethods (Bo 3. and eoriless quasi-newton ethods [3] is that they require less storage than sparse or banded solvers, since the global stiffness atri K need not be fored and factorized. A goal that is soeties achievable with PCG iterative solvers is to have the asyptotic nuber of operations grow in proportion to the proble size N as N becoes large. PCG solvers are very attractive due to their relative ease of parallelization on ost coputing architectures and the fact that when a high degree of accuracy is not required in the solution, the nuber of iterations can be draatically reduced thus triing down the coputational epense accordingly [26]. Despite these strong points favoring iterative equation solvers that can lead the to soeties draatically outperfor even the fastest direct solvers on oderately sized probles, the ajor disadvantage of iterative ethods is that when the syste of equations to be solved is ill-conditioned, the convergence characteristics of the ethod suffer quite draatically. The equation for foring increental force for each node in the PCG algorith is s + = s α A g (3. where the coputation of product A.g is perfored not at global level, but at the nodal level by storing the nodal contribution to A, and is coposed of following steps: localize global increental displaceent vector g to nodal level; 2 gather all coupled nodal stiffness atrices A fro storage; 3 apply nodal stiffness and nodal displaceents to copute the product A.g; 4 asseble all nodal contributions into global force vector. Different preconditioning ethods for conjugate gradient solvers lead to quite a range of different perforance characteristics aking the choice of an appropriate preconditioner B quite

14 4 iportant for the practical applications of the PCG algorith. The rate of convergence in the energy nor ( 2 A = A is bounded as follows: C + A (3.2 A C + where C is the condition nuber of B A. Since the right hand side increases with growing condition nuber, lower condition nubers usually accelerate convergence. t is therefore desired to have B - approiate A - in the sense that C, while retaining a coputationally efficient structure. An etreely siple and yet soeties very effective preconditioner is diagonal scaling in which B = ( diag K [4,5] or the Jacobi conjugate gradient ethod (JCG.. This is often referred to as Jacobi acceleration 3.2 Alternative Pre-conditioning Strategies When the convergence rate of the JCG ethod is slow, other ore robust preconditioning strategies should be used that do not coproise the inherent benefits of iterative equation solvers. Aong popular preconditioning ethods are ncoplete Cholesky Factorization (C preconditioning [3,6], eleent-by-eleent preconditioning [9], polynoial preconditioning [3,6] and ultigrid preconditioning []. C preconditioning ethods are very effective in ters of accelerating the convergence in accordance with (3.4. However C factorization algoriths can be difficult to parallelize due to the recursive nature of the coputation. On the other hand, polynoial preconditioners are easy to parallelize since they only involve the coputation of atri-vector operations, but this ethod is not as powerful as the C preconditioning ethod. To overcoe difficulties with parallelization of C factorization, incoplete block Cholesky factorization preconditioner using atri blocks as basic entities were proposed [2,0,25].

15 5 The Crout eleent-by-eleent preconditioner was proposed by Hughes et al [9] to aintain the eleent-based data structure in finite eleent codes. With this approach, additional eory required for preconditioner is reduced and CPU-effort required to define a new data structure is not necessary. Recently Grote and Huckle [7] proposed a ore sophisticated ethod for parallel preconditioning with sparse approiate inverses. With this ethod, a sparse approiate inverse is coputed eplicitly in parallel and it provides full control over the sparsity and the quality of preconditioner. These capabilities enable one to choose an optial preconditioner for a particular proble and architecture. 3.3 The ncoplete Cholesky Factorization Preconditioner One of the ost iportant preconditioning strategies involves coputing an C factorization of A. The idea behind this approach is to calculate a lower triangle atri H with the property that H has soe tractable sparsity structure and is soehow close to A s eact Cholesky factor G. The preconditioner is then taken to be B = H H T. To appreciate this choice, consider the two following facts: First, there eists a unique syetric positive definite atri C such that T B = CC and B T = C C C = QH T. Accordingly,. Second, there eists an orthogonal transforation Q such that B - A = C - = ( HQ AC T - A( QH T = Q( H GG T H T Q T (3.3 Thus, the better H approiates G the saller the condition nuber of perforance of the PCG algorith. B A, and the better the A siple, effective H that approiates G is coputed by stepping through the sparse Cholesky factorization neglecting fill-in, and it is described in the algorith of Bo 3.2. The

16 6 nuber of operations required for the C factorization neglecting fill-in is as follows N k= ( P NDOF k 2 (3.4 where N is the nuber of integration points or nodes in the odel, and neighbor nodes coupled to the k th node. P k is the nuber of For what is called here the ncoplete Cholesky Conjugate Gradient (CCG ethod, the global stiffness atri is fored and stored in the so-called copressed row sparse forat, which stores only non-zero ters of stiffness atri row by row. Bo 3.3 is an eaple of atri storage with the copressed row sparse forat. Since the proposed algorith for C factorization neglects fill-in the data structure for the lower triangle atri H is identical to that of the stiffness atri in sparse forat. The size of diagonal preconditioner is very sall in coparison with global stiffness atri in large-scale probles, and thus the eory required for the CCG ethod is alost twice as big as that for the JCG (Jacobi preconditioned conjugate gradient ethod. 3.4 Required Storage with PCG Solvers n the finite eleent ethod (FEM, storage for the eleent level stiffness atrices is well defined in that one can predict the size according to the eleent type used (i.e. the size of the eleent stiffness and the nuber of eleents in the odel. n addition syetry in eleent level stiffness atrices enables further savings in storage. Typically, the storage of all the eleent level stiffness atrices requires far less eory than that of the assebled global stiffness atri in skyline, profile, or banded for. Since there are no eleents in eshfree ethods, the nodal stiffness atri assues a role

17 7 analogous to that of the eleent level stiffness atri in FEM. The nodal stiffness atri stores the contributions to the global stiffness atri for any two coupled nodes. To illustrate, consider the siple one-eleent four-node odel shown in Figure 3. in which each node has two degrees of freedo. n FEM, the eleent level stiffness atri is syetric, and thus only the upper triangle needs to be stored. The size of the eleent level stiffness atri is (NEE (NEE + / 2 = 36 words where NEE = NEN NDOF is the nuber of eleent equations (8. f the very siple odel in Figure 3. were treated as a eshfree odel, each contribution to the global stiffness atri by two coupled nodes would be given by K ab where a and b indicate specific nodes. The siteen nodal stiffness atrices for this odel are: K aa, K ab, K ac, K ad, K ba, K bb, K bc, K bd, K ca, K cb, K cc, K cd, K da, K db, K dc, K dd, and ecept for K aa, K bb, K cc, K dd they are generally nonsyetrical unlike eleent level stiffness atrices in FEM. However it is not necessary to store the entire set of nodal stiffness atrices, since K ij is equal to K ji T where i and j are node nubers of coupled nodes. Consequently, the iniu set of nodal stiffness atrices to be constructed and stored is: K ab, K ac, K ad, K bc, K bd, K cd as well as the upper (or lower triangular parts of K aa, K bb, K cc, and K dd. The cuulative size of all the nodal stiffness atrices here would be 6 NDOF (NDOF (NDOF+/2 = 36 words which is eactly the sae as that for the stiffness atri in sparse for without fill-in. n general, the eory S required to store the nodal stiffness atrices in a eshfree odel would be coputed as follows N 2 k = [ ] = k ( 2 NDOF NDOF + S = P NDOF ( NDOF + N P (3.6 where N is the nuber of unrestrained nodal points in the odel and k = k P k is nuber of nodes coupled with the k th node. n the current eaple, each of the three nodes has three coupled nodes yielding S = 36 words.

18 8 4 PARALLEL SOLUTON SCHEMES 4. Paradigs and Overview Many recently developed structural analysis progras take full advantage of parallelis. Those codes in which parallelis is considered fro the outset of developent can take greatest advantage of parallelis, with the capacity to use a large nuber of heterogeneous CPUs with esh partitioning, and essage passing. On the other hand when codes that were initially developed to run serially are ade parallel, the shared eory paradig with a relatively sall nuber of CPUs and yet significant gains in perforance can be achieved with relative ease Message passing is used in parallel finite eleent solution ethods based on doain decoposition approaches [2]. The finite eleent doain is decoposed into a set of subdoains, and each of these is assigned to processor. t is critical to balance the coputational load aong processors while also iniizing the nuber of partition interfaces. Toward this end, software packages like Metis [22] are coonly used to provide finite eleent esh partitions of nearly equal size that achieve load balancing. Danielson et al. [] presented a parallel coputational ipleentation of eshfree ethods for eplicit dynaic analysis. Several partitioning schees were considered and eplicit MP essage passing stateents were used for all counications aong partitions on different processors. Many copiler directives and libraries have been developed for parallel coputing. Aong the MP and OpenMP are widely used due to portability. OpenMP is a specification for a set of copiler directives, library routines, and environent variables that can be used to specify shared eory parallelis. t is a portable, scalable odel that gives shared-eory parallel prograers a siple and fleible interface for developing parallel application. Message Passing nterface (MP is a library specification for essage passing. t is a practical, portable,

19 9 efficient, and fleible standard for essage passing in a distributed eory counication environent. Parallel prograing languages such as Char++ [20,2] also provide portability, dynaic load balancing, and any advanced features. 4.2 Parallel JCG Solver The procedure of foring increental force in conjugate gradient algorith is parallelized as shown in Bo 4.. When all nodal contributions are assebled into a global vector, siultaneous actualization of the global vector by any different processors can cause waiting tie by these processors, leading to a reduction in the parallel speed-up rate. To avoid this concurrency, each iteration of the outer loop in Bo 4. is responsible for generating specific entries in the vector A.g. This procedure for a siple four-node odel with three degrees of freedo per node is illustrated in Figure 4.. One iteration of the inner loop of Bo 4. coputes the product of nodal stiffness atri A J and localized nodal displaceent g J for each neighbor node, and one iteration of the outer loop assebles three rows of A.g. [OpenMP the specification for shared eory parallelis - is eployed to parallelize this algorith and an HP Eeplar S-class 6 processor coputer is used to test the parallel perforance. For wall-clock tiing of processes, Fortran 90 offers an intrinsic function call date_and_tie, with a resolution of one illisecond.] Although this parallel algorith is based on inner loop-level parallelis in order to avoid critical sections that can cause significant processor waiting tie, the workload of the inner loop cannot be perfectly balanced aong the processors and soe processors will typically be idle while s = A g is coputed for the last few nodes. This idle tie will be proportional to the aiu nuber of coupled nodes for the last few nodes processed in each conjugate gradient iteration and naturally accuulates with the nuber of required conjugate gradient

20 20 iterations. Consequently, this algorith will not show ideal scalability. The inner loop in the algorith above is typically not used (at least by the authors when using conjugate gradient solvers with FEM. nstead, critical sections are used to avoid updating the global vector siultaneously. The total nuber of eleent stiffness atrices, which are uch larger than nodal stiffness atrices, is uch less than the nuber of nodal stiffness atri in the sae size odel, and therefore the nuber of procedures to asseble global vector in FEM is less than that of eshfree ethods. As a result, the effect of critical sections tends not to have too deleterious an effect on the scalability of parallel conjugate gradient FEM coputations. 4.3 Parallel C Factorization The parallel Cholesky factorization ethod for shared eory achines was discussed in [4,6] while for distributed eory achines it was presented in [4]. Since it is easy to ipleent with the data structures described in Section 3, the C factorization algorith ipleented and tested in this research is that based on the subatri-cholesky ethod, which is just one of three fors of Cholesky decoposition ethods illustrated in Figure 4.2 [4]. The Modified region in Figure 4.2 is updated in parallel, and its procedure corresponds to the parallel inner loop in the algorith of Bo 3.2. As such, parallelis of C factorization is easily obtained avoiding dependency that could eist if the outer loop were parallelized. n general the CPU waiting tie in inner loop parallelization is longer than outer loop parallelization, therefore speed-up is less than ideal as increasing nubers of processors are used. t is noted that outer loop parallelis can be used [23] resulting in reduced processor waiting tie, but this is dependent upon sophisticated reordering. Due to the high degree of nodal coupling in eshfree

21 2 ethods, it is not clear that such reordering would ake outer loop parallelis in the C factorization process ore efficient than the inner loop parallelis used. 5. NUMERCAL EXAMPLES 5. The Boussinesq Proble The Boussinesq proble involves application of a point load to a linear elastic half-space. The analytical solution of displaceent beneath point load is singular and can be epressed by the following equation. u z 2 P z z ( υ + 2πµ R 2R = 2 (5. where R the distance fro point of load application. The siple aterial odel and unifor esh for a cube for a well-conditioned proble that is ideal for testing the upper liit of PCG perforance. Beyond these factors, the spectru of the solution field for this proble is doinated by very high wavenubers (or very short wavelengths, and iterative solvers such as the conjugate gradient ethod tend to converge very rapidly for such probles as they first obtain the high wavenuber contributions to a solution spectru, followed subsequently by the lower wavenuber contributions [3,6]. By first studying the coparative perforance of different solvers on this ideal proble for iterative solvers, convergence rates and crossover points where iterative solvers begin to outperfor direct solvers can be identified. The proble described in Figure 5. is solved with si different levels of odel refineent having discretizations of 3 3 3, 5 5 5, 9 9 9,, 3 3 3, and equally spaced nodes. Using the sparse direct solver VSS [33] the coputational costs and solution accuracies associated with both eshfree (with noralized support size a= and FEM (with trilinear shape

22 22 functions are copared in Table 5. and Figure 5.2. When accuracy of solutions is considered, eshfree analysis is actually very copetitive. Table 5.2 copares storage required to solve the linear proble K u = f with the sparse direct solver (VSS, the JCG solver, and the CCG solver for noralized support sizes of a=; a=2; and a=3. The VSS solver uses a very effective reordering algorith that iniizes the fill that occurs during factorization, thereby iniizing both the required eory and nuber of processor operations for the solving process. Nevertheless, analysis of the results in Table 5.2 indicates that the required eory with VSS grows in proportion to.5 N as the proble size N increases whereas the required eory for the JCG and CCG solvers grows in proportion to. N and.3 N, respectively. For a given proble size, the required eory for the CCG solver is greater than that for the JCG solver since both require the equivalent of the sparse atri K but the CCG solver also requires the factorized K without fill in. For sall and oderate sized probles, the CCG solver actually requires ore eory than the VSS solver, since the storage required for two sparse atrices without fill-in can eceed that of a sparse-atri with fill-in. However, with increasing proble size, the trend is clearly for the CCG solver to use substantially less eory than the VSS solver. With noralized support sizes of a= and a=2, the JCG solver becoes faster than VSS on the Boussinesq proble for probles, respectively, greater in size than 777 nodes and 666 nodes. However, for noralized support sizes of a=3 the crossover point at which the JCG solver overtakes the direct VSS solver was not observed due to the fact that with a support size of a=3 the stiffness atri of the proble becoes etreely ill-conditioned greatly slowing down the rate of convergence within the jacobi-preconditioned conjugate gradient algorith. While such a crossover point should eist for sufficiently large proble sizes, it is not worth pursuing, due to the slow convergence rate of the JCG algorith. The CCG solver provides a uch faster rate of convergence than the JCG solver especially when noralized

23 23 support sizes of a=2 or a=3 are used. As shown in Table 5.2 and Figure 5.3, the CCG solver begins to outperfor the VSS solver in ters of speed on the Boussinesq proble with support size a=2 for a proble of approiately NEQ=0 4 equations, and etrapolating the results of Figure 5.3c, the CCG solver would begin to outperfor the VSS solver with support size a=3 for proble sizes well before NEQ=0 5. Both the JCG and CCG solving algoriths were parallelized for shared eory environent and tested with sall nuber of CPUs ( 2 for this eaple. The coputations were perfored on an HP Eeplar S-class 6 processor coputer, and a noralized support size of a=2 was used. The speed-up factors associated with parallelis are shown in Tables 5.3 and 5.4 and while they are reasonably good, they are a little less than ideal as increasing nubers of CPUs are used. t is especially noteworthy that the parallel speed-up factors with eshfree ethods diinish soewhat with increasing proble sizes. This is believed by the authors to be due in part to the increase of idle tie during stiffness and force assebly caused by increasing nubers of neighbor nodes with increasing proble size as was entioned in Section 4.2. The Boussinesq proble was also solved by FEM with a parallel JCG solver for coparison of speed-up factors between two ethods. As shown in Figure 5.4(a alost perfect scalability is obtained for sufficiently large FEM probles because the proportion of idle cpu-tie decreases with increasing proble size, while unlike the proposed ipleentation of the eshfree ethod, the load associated with each parallel loop iteration stays the sae. The parallel CCG solver with parallelization of the inner loop of the C factorization was also tested with sall nuber of CPUs in shared eory environent. n general, the occurrence of idle CPUs in inner loop parallelization is ore frequent than that with outer loop parallelization. Consequently, speed-up factors shown in Figure 5.4c are lower than ideal.

24 The 3D Cylinder Proble Cylindrical structures subjected to end loadings have been considered by Nour-Oid [26] to assess the perforance of various linear solving algoriths. While this proble is essentially two-diensional, it is solved as a 3D proble as shown in Figure 5.5 to produce larger bandwidth of stiffness atri. Two different refineents with 2 and equally spaced nodes have been considered. The aterial is linearly elastic, and a static loading is applied. The results in Table 5.5 and Figure 5.6 share the sae tendencies as the preceding results obtained for the Boussinesq proble. When a noralized support size of a= is used, both the JCG and CCG solvers outperfor the VSS solver in ters of eory and CPU tie. When noralized support size of a=2 is used, the JCG again converges very slowly due to the soewhat ill-conditioned stiffness atri. Alternatively, the CCG solver appears superior to both of the other solvers in ters of both eory and cpu-tie for this proble. 5.3 The 3D Bea Proble A cantilever bea test proble is described in Figure 5.7. All the degrees of freedo at the claped end are fied, and unifor loading is applied at the tip of bea in the transverse direction. Three cases are considered: a linear elastic bea, an elastic bea with geoetric nonlinearity, and an elasto-plastic bea with geoetric and aterial non-linearity. The aterial properties are Young s odulus E = 2,000 MPa, the Poisson s ratio v = 0.3, and elastoplastic isotropic hardening σ y (ē p = ē p Mpa. The bea is discretized into two levels of refineents with 6 2 and equally spaced nodes, and analyzed using ten equal loading steps for the nonlinear cases. This eaple is eployed to see how the different conjugate gradient solvers work for nonlinear probles. First, however, the perforance of the

25 25 JCG and CCG solvers for this cantilever bea geoetry are copared to the perforance on the Boussinesq proble (see Table 5.6. While the proble sizes for the 3D linear elastic bea proble are coparable to those of the largest Boussinesq probles solved (Table 5.2, it is seen that a considerably larger nuber of conjugate gradient iterations are required to solve these probles. Nevertheless, even for though the CCG solver requires a larger nuber of iterations in this proble, it still shows a clear trend toward outperforing both the JCG and VSS solvers. The relative perforance of the different solvers on the nonlinear 3D bea probles are suarized in Table 5.7 and Figure 5.8. For both the geoetrically nonlinear proble, and the fully nonlinear proble, the agnitude of loading applied to the bea was selected so that the aial deflection at the tip is approiately one third of the bea length. For the case involving aterial non-linearity in the for of hardening elasto-plasticity, approiately 79 percent of the nodes in the odel undergo plastic deforation. The results in Table 5.7 and Figure 5.8 indicate that both types of nonlinearity, geoetric and aterial, significantly degrade the perforance of the JCG solver with the convergence rates for the elasto-plastic proble being worse than those for the nonlinear elastic case. For eaple, Figure 5.8 (c shows the perforance of the JCG solver as coparable to that of VSS for the linear elastic case, but it becoes.4 ties slower than VSS with introduction of geoetric non-linearity, and 3.8 ties slower with further introduction of aterial non-linearity. The relative perforance of the CCG solver also deteriorates soewhat in solving nonlinear probles, although not as severely as that of the JCG solver. Nevertheless, the CCG solver reains copetitive with the VSS direct solver for large-scale nonlinear probles in ters of CPU tie and required eory.

26 26 6. DSCUSSON and CONCLUSONS n this paper two different parallel, pre-conditioned conjugate gradient (PCG solvers have been ipleented and tested in a eshfree analysis fraework on both linear and nonlinear continuu echanics probles. The first iterative solver was based on using the inverted diagonal of the stiffness atri as a preconditioner (JCG, while the second used the Cholesky factorization of the sparse stiffness atri neglecting fill-in (CCG as the preconditioner. n addition, the high-perforance direct sparse solver VSS was also used in this study as a baseline against which to easure the perforance of the PCG solvers. The required eory and cpu-tie for in-core eshfree solution of the Boussinesq proble with noralized shape function support size a=2 tends to grow roughly in proportion to N.5 and N 2.25, respectively, with the VSS solver, in proportion to N. and N.72, respectively, for the JCG solver, and in proportion to N.3 and N.52, respectively, for the CCG solver. Siilar relative trends for the three solvers are observed on different test probles, and with different noralized shape function support sizes. t can therefore be stated that above a certain size of analysis proble, PCG solvers will always be ore efficient than direct sparse solvers based on coplete Cholesky factorization, both in ters required CPU operations and required eory. Moreover by eploiting relatively siple shared-eory parallelis within the PCG solvers, significant additional coputational efficiency is easily gained with relatively sall nubers of processors. Here the parallel perforance of the solvers has been eplored with up to 2 processors. Many of the well-established perforance characteristics of iterative PCG solvers within FEM fraeworks have been found to be apparent within a eshfree analysis fraework as well. n particular, the perforance of iterative solvers is quite proble dependent. On probles where the solution field has priarily a high wavenuber content, in a spectral sense, iterative PCG solvers can be etreely efficient and outperfor direct solvers even on fairly sall sized

27 27 applications. The Boussinesq proble solved in Section 5. above is a classic eaple of such a proble. However in probles where the solution field has a significant low wavenuber spectru, the iterative PCG solvers can require substantially ore cpu-effort and iterations. This was deonstrated in the cylinder and bea probles in Sections 5.2 and 5.3 above. A new aspect of using PCG solvers within a eshfree analysis fraework was also identified in this work. The usage of relatively large shape-function support sizes in eshfree ethods can result in ill conditioning of the stiffness atri due to loss of linear independence between shape functions. terative PCG solvers are uch ore sensitive to this ill-conditioning than direct equation solvers. n particular, the rate of convergence of the JCG solver was found to be greatly diinished with larger noralized shape function support sizes such as a=3. Alternatively, the CCG solver was found to be significantly ore robust and able to deal with this proble. Nevertheless, even with the CCG solver, the analyst should avoid choosing the noralized support size a>2 without good reason, since doing so will entail ore coputational effort. 7. ACKNOWLEDGEMENTS: The authors gratefully acknowledge a grant NSF DMS that facilitated this research. 8. REFERENCES. Adas, M. and Taylor, R.L. Parallel ultigrid solvers for 3D-unstructured large deforation elasticity and plasticity finite eleent probles. Finite Eleents in Analysis and Design 36 ( Aelsson, O. (986, A general incoplete block-atri factorization ethod, Linear Algebra Appl., 74, Aelsson, O. (994, terative solution ethods, Cabridge University Press.

28 28 4. Baserann, A., Reichel, B., and Schelthoff, C. (997, Preconditioned CG ethods for sparse atrices on assively parallel achines, Parallel Coputing, 23, Belytschko, T., Krongauz, Y., Organ, D., and Fleing, M. (996, Meshless ethods: An overview and recent developents, Coput. Methods Appl. Mech. Engrg., 39, Briggs, W.L., Henson, V.E. and McCorick, S.F. A Multigrid Tutorial. 2 nd ed. SAM, Chen, J. S., Pan, C., Wu, C. T., and Liu, W. K. (996, Reproducing kernel particle ethods for large deforation analysis of non-linear structures, Coput. Methods Appl. Mech. Engrg., 39, Chen, J. S. and Wang, H. P. (2000, New boundary condition treatents in eshfree coputation of contact probles, Coput. Methods Appl. Mech. Engrg., 87, Chen, J. S., Wu, C. T., Yoon, S., and You, Y. (200, A stabilized conforing nodal integration for Galerkin esh-free ethods, nt J. Nuer. Meth. Engng, 50, Concus, P., Golub, G. H., and Meurant, G. (985, Block preconditioning for the CG ethod, SAM J. Sci. Stat. Coput., 6, Danielson, K. T., Hao, S., Liu, W. K., Uras, A., and Li, S. (2000, Parallel coputation of eshless ethods for eplicit dynaic analysis, nt J. Nuer. Meth. Engng, 47, 7, Danielson, K. T. and Naburu, R. R. (998, Nonlinear dynaic finite eleent analysis on parallel coputers using FORTRAN 90 and MP, Advances in Engineering Software, 29, Farhat, C. and Rou, F. X. (99, A ethod of finite eleent tearing and interconnecting and its parallel solution algorith, nt. J. nuer. Meth. Engng, 32, George, A., Heath, M. T., and Liu, J. (986, Parallel Cholesky factorization on a sharedeory ultiprocessor, Lin. Alg. and ts Applic., 77, Gerardin, M. and Hogge, M. (987, Solving systes of nonlinear equations, H. Kardestuncer (ed., The Finite Eleent Handbook, McGraw-Hill, New York. 6. Golub, G. H. and Van Loan, C. F. (996, Matri Coputations, The Johns Hopkins University Press, Baltiore, MD. 7. Grote, M. J. and Huckle, T. (997, Parallel preconditioning with sparse approiate inverses, SAM J. Sci. Coput., 8, Hestenes, M. R. and Stiefel, E. (952, Methods of conjugate gradients for solving linear systes, J. Coput. Appl. Math., 25, Hughes, T. J. R., Ferencz, R. M., and Hallquist, J. O. (987, Large-scale vectorized iplicit calclulations in solid echanics on a Cray X-MP/48 utilizing EBE preconditioned conjugate gradients, Coput. Meth. appl. Mech. Engng, 6, Kale, L. V., Rakuar, B., Sinha, A. B., and Gursoy, A. (994a, The Char Parallel Prograing Language and Syste: Part --- Description of Language Features, EEE

29 29 Transactions on Parallel and Distributed Systes. 2. Kale, L. V., Rakuar, B., Sinha, A. B., and Gursoy, A. (994b, The Char Parallel Prograing Language and Syste: Part - The Runtie Syste, EEE Transactions on Parallel and Distributed Systes. 22. Karypis, G. and Kuar, V. (995, A fast and high quality ultilevel schee for partitioning irregular graphs, Technical Report TR , Departent of Coputer Science, University of Minnesota. 23. Lin, W. Y. and Chen, C. L. (999, Miniu counication cost reordering for parallel sparse Cholesky factorization, Parallel coputing, 25, Liu, W. K., Jun, S., and Zhang, Y. F. (995, Reproducing Kernel Particle Method, nt. J. Nuer. Methods Fluids 20, Meurant, G. (984, The block preconditioned conjugate gradient ethod on vector coputer, BT, 24, Nour-Oid, B. (984, A preconditioned conjugate gradient ethod for solution of finite eleent equations, in: W. K. Liu, T. Belytschko and K. C. Park, eds., nnovative Methods for Nonlinear Probles, Pineridge Press, Swansea, U. K., Ortega, J. M. (988, ntroduction to parallel and vector solution of linear systes, Plenu Press New York, London. 28. Pini, G. and Gabolati, G. (990, s a siple diagonal scaling the best preconditioner for conjugate gradients on supercoputers?, Adv. Water Resour., 3, Poole, E. L., Knight, N. F., and Davis, D. D. (992, High-perforance equation solvers and their ipact on finite eleent analysis, nt J. Nuer. Meth. Engng, 33, Storaasli, O. O., Nguyen, D. T., and Agarwal, T. K. (990, A parallel-vector algorith for rapid structural analysis on high-perforance coputers, NASA Technical Meo, 0264, Hapton, VA. 3. Swan, C. C. and Kosaka,. (997, Hoogenization-based analysis and design of coposites, Coputers and Structures, 64, Yoon, S (200, Stabilized conforing nodal integration for Galerkin eshfree ethods, Ph.D. Thesis, University of owa. 33. Solversoft, (2002.

30 30 Bo 2. Global Newton solving algorith Predictor phase v = 0 : counter initialization v d ~ d : displaceent predictor = n+ n+ v v for r n + ( d n+ : initial residual Corrector phase while ( r > RTOL, v n+ v v = rn+ Kδ : linear solving phase for δ ν p v = α δ : line search for step size α ν v v d + v+ v n+ = d n+ pv : displaceent update v+ v+ for r ( d : residual update n+ n+ v = v + : counter update end-while Bo 2.2 Algorith for assebling global K on a node-by-node basis. For each nodal (integration point For each eber J N For each eber L CJ Asseble End-for End-for End-for K JL global K

31 3 Bo 3. Algorith of preconditioned conjugate gradient solver (PCG. Step : nitialize ; ; ; ; = = = = = 0 s b g z B s Step 2: Line search & updates g A s s g g A g z s = + = = + + α α α ; ; Step 3: Convergence check if 0 s s L δ +, then return Step 4: Update conjugate search direction ; ; + = + = = = g z g z s z s s B z β β Go to step 2

32 32 Bo 3.2: Algorith for incoplete Cholesky factorization with parallel region. for k = : n end A(k, k = for i = k + : n end for j = k + : n end if A(i, k 0 end DO_PARALLEL for i = j: n end A(i, k = A(i, k/a(k, k if A(i, j end A(k, k A(i, j 0 END_PARALLEL = A(i, j - A(i, ka(j, k Parallel region

33 33 Bo 3.3: Sparse storage syste for sall linear syste X X X X X X KDAG (diagonal ters = { 00, 200, 300, 400, 500, 600 } KPTRS (nuber of off-diagonal coefficients in each row ={ 3, 3, 3, 2,, 0 } KNDXS (colun inde of each coefficient ={ 2, 3, 6, 3, 4, 6, 4, 5, 6, 5, 6, 6 } KCOEFS (off-diagonal coefficients in row forat ={, 2, 5, 6, 7, 9, 0,, 2, 3, 4, 5} Bo 4.: Algorith for parallel coputation of s = s α A g. J J BEGN_PARALLEL For each node η For each coupled node J N Localize increental displaceent vector g J fro global to nodal level Retrieve nodal stiffness atri A J fro storage Apply nodal stiffness nodal displaceent product s = A J g J Asseble all nodal contributions into global vector s end-for end-for END_PARALLEL

34 (a (b Figure 2.. Relationship between shape function support size and nodal connectivity (a node and its support Ω (b Voronoi integration cell for a node B. Figure 2.2. Nodal connectivity in one-diensional equally spaced discretization.

34 34 (a (b Figure 2.. Relationship between shape function support size and nodal connectivity (a node and its support Ω (b Voronoi integration cell for a node B. Figure 2.2. Nodal connectivity in one-diensional equally spaced discretization. With piecewise linear FEM shape functions, each three consecutive nodes are coupled. With eshfree shape functions with noralized support size of unity, each five consecutive nodes are coupled.

35 35 a b node nuber d c Figure 3.. Siple four-node odel. Processor P stiffness atri lines of nodal point Kaa Kab Kac Kad Processor P2 stiffness atri lines of nodal point 2 A g A.g Figure 4.. The procedure to asseble all nodal contributions into global vector

36 36 Row-Cholesky Colun-Cholesky Subatri-Cholesky odified used for odification Figure 4.2 Three fors of Cholesky factorization. Figure 5. Proble description of a Boussinesq proble; Length =6, P = 500; Young s odulus: E = 30,000; Poisson s ratio: ν = 0.3.

High-Speed Smooth Cyclical Motion of a Hydraulic Cylinder

High-Speed Smooth Cyclical Motion of a Hydraulic Cylinder 846 1 th International Conference on Fleible Autoation and Intelligent Manufacturing 00, Dresden, Gerany High-Speed Sooth Cyclical Motion of a Hydraulic Cylinder Nebojsa I. Jaksic Departent of Industrial