Some experiments with massively parallel computation for Monte Carlo simulation of stochastic dynamical systems

Size: px

Start display at page:

Download "Some experiments with massively parallel computation for Monte Carlo simulation of stochastic dynamical systems"

Erick Gibson
5 years ago
Views:

1 Some experiments with massively parallel computation for Monte Carlo simulation of stochastic dynamical systems E. A. Johnson, S. F. Wojtkiewicz and L. A. Bergman University of Illinois, Urbana, Illinois ABSTRACT: The advantages and disadvantages of several numerical solution methods for the transition probability density functions of stochastic dynamical systems are discussed. Monte Carlo simulation is superior for some problems of this class. The drawbacks and benefits of its use on several computer architectures, including massively parallel and distributed-network computers, are examined. The effort required and the gains realized are discussed. Furthermore, a brief comparison of the results from MCS and finite element solutions is given. INTRODUCTION The evolution of stochastic dynamical systems is governed by Fokker Planck equations if the response process is Markovian. Analytical solutions for the transient response do not exist for all but the simplest of systems. The evolution of the transition probability density function over the phase space has been solved numerically for various two-dimensional systems subjected to additive and multiplicative random excitation using the finite element method (Spencer and Bergman 993). Systems of higher order, however, pose significant difficulty when using standard finite element formulations due to memory requirements and computational expense. Direct Monte Carlo simulation (MCS), while often regarded as less elegant than other methods, can indeed be used to solve problems of significantly higher complexity. Low order systems are often more efficiently solved by other methods (e.g., the finite element method, cell mapping, path integral methods, etc.). For example, a standard finite element solution with a grid of n points in each spatial dimension and a uniform time step requires a single reduction to upper triangular form of n d equations followed by forward and backward substitution at each time step for a d -dimensional problem. Thus, the required number of computations and memory allocation grow exponentially with the dimen- sionality of the problem. Granted, these matrix equations are not fully populated, and in fact have relatively narrow bandwidth if node numbering is done optimally; but the number of calculations required to solve these equations grows at least as n d and usually much faster. Contrastingly, a Monte Carlo simulation requires a number of computations proportional to d and to the number of realizations. Furthermore, the accuracy of the Monte Carlo simulation is not dependent on the dimensionality of the system but, rather, on the number of realizations used to characterize the system (Pradlwarter, Schuëller, and Melnik-Melnikov 993). The number of realizations required to accurately produce the transition probability density function over the entire phase space, especially in the tails, is large, but since each realization is entirely independent of the others, the Monte Carlo simulation is easily and efficiently adapted to parallel computation. The advent of highspeed, massively parallel computers permits a large number of realizations of a complex dynamical system to be determined. Consequently, Monte Carlo simulation may be more efficient for higher dimensional systems than other solution methods currently in use. Thus it is the purpose of this investigation to confirm the above observations and compare the performance of MCS on various platforms, including a massively-parallel supercomputer and distributed-network workstations, with

2 special focus on the advantages and disadvantages of each platform for this class of problems. SYSTEM DESCRIPTIONS. Duffing Oscillator Monte Carlo simulation is readily used for any number of stochastic systems. For the sake of comparison with previous solutions of the Fokker Planck equation by the finite element method, one system to be examined herein is a Duffing oscillator subjected to external white noise. The equation of motion is given by Ẋ + ζẋ + ( εx ) X = Wt () or, in state equation form, () Ẋ = X () Ẋ = ζx ( εx ) X + Wt () where Wt () is zero mean white noise E[ W() t ] =, (3) E[ W( t )Wt ( )] = ζδ( t t ) the initial conditions are X() = X () = X, Ẋ() = X () = Ẋ, () and δ( ) is the Dirac delta function. The stationary probability density function for this system is given by f X X ( x (), x ) x C --x ε = exp + --x where C is chosen such that the integral of Eq. () over the domain is unity. The two parameters are chosen to be ζ =. and ε =., and the initial joint probability distribution is bivariate Gaussian with covariance Γ = -- X Ẋ. Earthquake-Excited Linear Oscillator () In order to see the advantages of the MCS for higher-dimensional systems, a two degree of freedom oscillator will also be examined herein. Without loss of generality, this four-dimensional Figure : system is taken to be a linear oscillator driven by the Kanai-Tajimi stationary model of earthquake induced ground acceleration (Soong and Grigoriu 993), as shown in Figure. The equations of motion of this system are given by the configuration space equations Ẋ + ζωẋ + ω X = [ Ẋ g + Ẇ () t ] Ẋ g + ζ g ω g Ẋ g + ω g X g = Ẇ () t where Ẇ () t is zero mean white noise (7) E[ Ẇ () t ] =. (8) E[ Ẇ ( t )Ẇ ( t )] = πφ δ( t t ) The equivalent state space system is where yt () y g () t wt () c xt () (9), =, () () Since this is a linear, time-invariant system subjected to a white noise input, the stationary response is Gaussian with zero mean and covariance Γ XX = lim E [ X()X t T () t ] given by the t algebraic Ricatti equation, k surface ground, m g cg x g () t Bedrock Earthquake model. Note that m g» m, and thus there is no coupling of the structure dynamics into that of the ground. Ẋ() t = AX() t + GẆ () t X X X X Ẋ = = G X X 3 g X Ẋ g A = k g m ω ζω ω g ζ g ω g ω g ζ g ω g

3 + XX A T + πφ GG T =, () the solution of which is where X X X X AΓ XX XX = = X X X X 3 X X symmetric X X X X 3 X X πφ ζ g ω3 g πφ ζ g ω g πφ ω g ζζ g Ω ω 3 [ ζω 3 + ζ ω ζ g ω g + ζωζ g ω g + ζω 3 ζ g + ζ g ω3 g + ω ζ 3 g ω g ] πφ ω g = ζζ g Ω [ ζωζ ω g ω g + ζ g ω g + ω ζ3 g + ζωω g ] X X 3 πφ = ζζ g Ω [ ω + 8ζωζ g ω g ω g + 8ζ g ω g ] X X Γ πφ = X X = ζ g Ω [ ζωω g + ω ζ g ζ g ω g ] X X πφ = ζ g Ω [ ω ω g + ζωζ g ω g ω3 g + ωζ g ω g ] Ω = ω ω ω g + ζ ω ω g + ζωζ g ω3 g + ζω 3 ζ g ω g + ω g + ω ζ g ω g (3) () () () (7) (8) (9) For the parameters used in this study, ω = π, ζ =., ω g =.3, ζ g =.3, and Φ =, the covariance matrix becomes XX = () The initial density is chosen to be zero mean multivariate Gaussian with diagonal covariance Γ X X = πφ ζ g ω3 g πφ ζ g ω g () (The initial oscillator variances are ; the filter variances are chosen to be the stationary filter variances.) The number of realizations required for an accurate representation of the PDF is a topic that needs further study, but a quick measure would be to determine the expected number of realizations that fall into a given bin. At stationarity for the -D system given above, the -D marginal PDF of the structure states x and x is given by f X X ( x, x ) = exp -- πσ σ x x σ σ () At a ασ radius, that is for the locus of points for which α = ( x σ ) + ( x σ ), the distribution is given by f X X ( α) = exp πσ σ --α (3) The expected number of realizations that would fall into a bin near such a location would be E number of realizations in a bin of size x x at a radius of ασ = nf X X ( α) x x () For bin size..8, total number of realizations n =, and the stationary variance values given in Eq. (), these values are charted in Table. α f X X ( α) E [# in bin] Table : Expected number of realizations in a bin. 3 PLATFORM DESCRIPTIONS Four computing platforms were used in this study: a Cray Y MP, a Convex C, a Thinking Machines CM, and a network of Sun SPARCstation computers. Table is a speed comparison of these systems, showing maximum theoretical speed, the speed found using the FLOPS benchmark (Aburto 99), as reported by NCSA research scientist Fouad Ahmad (Cohen 993), and as found in the current study (discussed further in Performance on Various Platforms, below). 3

4 System Max. Theoretical FLOPS benchmark Ahmad study results Current study averages SPARC n/a.7 n/a. Convex C * n/a 8 Cray Y MP * node CM 9 n/a n/a -node CM 89 n/a ~ 8-node CM 38 n/a n/a -node CM 378 n/a ~ 7 -node CM 3 n/a 97 Table : MFLOPS ratings of the various platforms. ( * single-processor rating) The Cray Y-MP/, operated by the National Center for Supercomputing Applications (NCSA) at the University of Illinois at Urbana Champaign, is a four-processor vectorized system running on a 7MHz clock cycle with MB of central memory and GB of secondary memory used primarily for I/O caching (UNICOS User s Guide 99). The maximum theoretical speed of this system is 333 million floating point operations per second (MFLOPS) per processor, but speeds of MFLOPS/processor are more typical. The Convex C, run by the Computing and Communications Services Office at the University of Illinois at Urbana Champaign, is a MB, four-processor vectorized system with a maximum theoretical performance of MFLOPS per processor. NCSA also operates a Thinking Machines Connection Machine CM. This is a massively parallel supercomputer with nodes; each node has one processor, vector units, and 3MB of memory. The peak theoretical speed of this system is 8 MFLOPS/node for a total theoretical speed of 3 MFLOPS (NCSA Connection Machine User Guide 993). In practice, however, 3- MFLOPS/node is more realistic (CM- CM Fortran Performance Guide 99). The CM was run in SIMD mode (single instruction stream, multiple data each processor executes the same set of instructions concurrently on different data) for this study, but can also be run in MIMD mode (multiple instruction stream, multiple data each processor works independently and passes messages to the other processors when required). A cluster of workstations administered by the College of Engineering at the University of Illinois was used as a distributed network platform. These workstations are Sun SPARCstation (/7, MHz) computers. COMPARISON OF FEM AND MCS FOR A -D DUFFING SYSTEM In order to assess the accuracy of the Monte Carlo simulations for the -D Duffing oscillator, the evolutionary second moments of the system will be examined. Figures - show the evolution of the second moments as found from FEM and by,,,, and,, realization Monte Carlo simulations. The, realization MCS does rather well over the entire analysis. In fact, the difference between the different MCS runs is hardly distinguishable except in the zoomed inset graphs that show further detail near the end of the analysis Note that for the variances of X () t and X () t, the FEM converges to a value slightly below the exact stationary variances (shown in the inset); E[ X ()X t () t ] Exact Stationary FEM MCS, Realizations MCS, Realizations MCS, Realizations Time [secs] Figure : Evolution of E[ X ()X t () t ] for the -D Duffing system computed by MCS and FEM. E[ X ()X t () t ] Exact Stationary FEM MCS, Realizations MCS, Realizations MCS, Realizations 3 Time [secs] Figure 3: Evolution of E[ X ()X t () t ] for the -D Duffing system computed by MCS and FEM.

5 3... Realizations. E[ X ()X t () t ] Time [secs] the MCS, however, while still fluctuating at t = π, is doing so about the exact value. The MCS does not do quite as well in determining the response probability density function with few realizations as it does for the moments. The PDFs at three instances in time ( t =, π, π secs ) are shown in Figs. 7 as computed by FEM and by,,,, and,, realization Monte Carlo simulations. The, realization MCS is relatively close to the FEM solution. For determining the evolution of the second moments of this system, the MCS is significantly more attractive since even a, realization simulation characterizes the moments well and required less than 3% of the 7 minutes of CPU time and less than % of the MB required by the FEM. Furthermore, the MCS appears to converge to the correct variance values. For the evolutionary PDF, however, the MCS is somewhat less outstanding but still a viable option. For this system, the, realization MCS gives the same order of performance as the FEM. The CPU time and memory requirements are summarized in Table 3. Method Exact Stationary FEM MCS, Realizations MCS, Realizations MCS, Realizations Figure : Evolution of E[ X ()X t () t ] for the -D Duffing system computed by MCS and FEM. CPU time [min] Memory [MB] FEM MCS,, Reals... MCS,, Reals MCS,, Reals MCS,, Reals MCS,, Reals Table 3: MCS and FEM computational expense on a Cray Y MP for the -D Duffing system. Velocity x Velocity x Velocity x Velocity x Realizations Realizations FEM Solution Displacement x Figure : PDFs of -D Duffing at t = secs.

6 Realizations Realizations Velocity x Velocity x Velocity x Velocity x Realizations Realizations FEM Solution Velocity x Velocity x Velocity x Velocity x Realizations Realizations FEM Solution Displacement x Displacement x Figure : PDFs of -D Duffing at t = π secs. Figure 7: PDFs of -D Duffing at t = π secs.

7 For these analyses, a mesh (. apart in each dimension) was used. It must be noted that for a finer mesh (i.e., more nodes or bins), the finite element solution will increase in computational expense, requiring the solution of a number of equations equal to the number of nodes. In order to retain the same accuracy with smaller bins, the MCS would require that the number of realizations grows with the number of bins. Thus the MCS should be more efficient, compared to the FEM, when the mesh is finer. RESULTS OF EARTHQUAKE- EXCITED OSCILLATOR SYSTEM The earthquake-excited oscillator system is a -D linear system in which two of the states, the earthquake filter states, are not of primary interest. The evolution of the second moments of the structure (oscillator) states is shown in Figs. 8. Monte Carlo simulations with,,,,,, and,, realizations were performed. The variances of X () t and X () t are fairly accurate even for a small number of realizations. The same is true of the marginal density functions of X () t and X () t for small t, i.e., while the PDF is relatively concentrated near the origin. Figure shows this marginal PDF at.7 secs into the simulation; even the, realization MCS is quite good. Due to the parameters chosen for this system, the marginal PDF rapidly disperses across the phase plane. The marginal PDF is shown in Fig. for t = secs. Here, the, realization simulation is hardly recognizable, and the, realization MCS is only marginally better. The reason for this is that, at stationarity, the magnitude of the marginal PDF is sufficiently small that the coefficient of variation of the number of realizations that fall in a given bin at a given time is high. If, however, a coarser mesh is used to determine the marginal PDF, where each bin is larger in area, the number of realizations falling in a given bin at a given time will be larger and its coefficient of variation smaller. This is apparent in the marginal PDF contour plots shown in Fig. 3, where the thin contour lines are the PDFs over a mesh (the centers of the..8 bins are at the small dots), and the bold contour lines are over a mesh (the centers of the.7. bins are at the large dots). Note that with the coarser mesh, the E[ X ()X t () t ] E[ X ()X t () t ] E[ X ()X t () t ] 8 Exact Stationary MCS, Realizations MCS, Realizations MCS, Realizations MCS, Realizations Time [secs] 8 Figure 8: Evolution of E[ X ()X t () t ] for the -D linear system. Exact Stationary MCS, Realizations MCS, Realizations MCS, Realizations MCS, Realizations Time [secs] Figure 9: Evolution of E[ X ()X t () t ] for the -D linear system Exact Stationary MCS, Realizations MCS, Realizations MCS, Realizations MCS, Realizations 8 8 Time [secs] Figure : Evolution of E[ X ()X t () t ] for the -D linear system. 7

8 Velocity x Velocity x Velocity x Velocity x Realizations Realizations Realizations Realizations Velocity x Velocity x Velocity x Velocity x Realizations Realizations Realizations Realizations Displacement x Figure : PDFs of -D Linear at t =.7 secs Displacement x Figure : PDFs of -D Linear at t = secs. 8

9 Realizations, realization is significantly more usable than its fine-mesh counterpart. Further study is needed here to quantify the trade-off between PDF accuracy and mesh coarseness in some general way. Velocity x Velocity x Velocity x Velocity x Realizations Realizations Realizations Displacement x Figure 3: PDFs of -D Linear at t = secs. Bold contours are on the coarse mesh represented by the bold dots; light contours are on the fine mesh represented by the small dots. PERFORMANCE ON VARIOUS PLAT- FORMS A total of simulations of the earthquakeexcited oscillator system were performed on the four computing platforms, varying the number of realizations, the duration of the simulation, the frequency of the storage of the PDFs, and the size of the mesh. One set of parameters ( time steps of. secs, storing the mesh of size every time steps) was chosen as the basis for comparison. (A complete simulation of this system actually requires time steps, but the performance is comparable for shorter simulations.) The memory required on various platforms is shown as a function of the number of realizations in Fig.. The reason that the Sparc (single and network) memory requirements are constant and large is that the Fortran compiler on that platform does not allow dynamic memory allocation dependent on the system parameters, resulting in the need to hard-code the array sizes large enough to accommodate any problem given. Anther observation is that the massively parallel CM has significantly higher memory requirements because, in order to parallelize the integration of Required Memory [MB] 3 Convex C Cray Y-MP -node CM- 3-node CM- Sparc -Sparc Network 3 Number of Realizations Figure : Memory requirements on various platforms for the -D linear system. 9

10 the state equations, a number of temporary variables, that are scalars on other platforms, must be arrays of the same length as the vector of state variables of all of the realizations. The result is that the CM implementation uses at least three times the amount of memory as the Cray Y MP or Convex C. Note that the CM allocates at least MB per processor regardless of the problem size. This per processor overhead is one drawback of massively parallel systems. The performance of the MCS on these platforms, as measured by the average number of millions of floating-point operations per second (MFLOPS), is shown in Fig.. The 3 node CM performance values vary quite a bit. This is because the timing of CM codes that do extensive I/O is quite inaccurate due to some quirks in the operating system software. Simulations at everywhere from half to double the average performance were observed. (The -node performance would be expected to have the same variations, but this was not verified by running multiple simulations of the same size.) The parallel implementations (CM and Sparc network) do not reach peak efficiencies until the number of realizations is quite large; thus for problems requiring large numbers of realizations, the parallel implementations appear superior. This is even more true on the CM if the PDFs are not needed; Fig. shows the effect on CM performance when PDF calculation and storage are removed. The speed increase, which is negligible on the other machines, is as large as a factor of six, as is shown in Fig. 7. On the other machines, finding the PDFs every time steps results in less than a % performance loss over the case where no PDFs are computed; finding them every time steps and every step result in less than % and less than % performance losses, respectively. On the CM, however, even computing and storing every th time step results in a tremendous performance loss. The explanation for this is that computing the PDF, essentially the calculation of a histogram, requires that the states of each realization must be passed to the front end machine to be put in a given bin. This interprocessor communication is quite slow compared to the in-processor integration of the states. (Note: the histogram algorithm is currently under investigation by NCSA to determine its performance bottlenecks, so it is possible that the CM performance loss may be partially ameliorated in future versions of the CMSSL libraries.) Performance [MFLOPS] Performance [MFLOPS] Speed Increase Factor 3 Cray Y-MP* Sparc Convex C* -node CM- 3-node CM- -Sparc Network 3 Number of Realizations Figure : Floating-point performance on various platforms for the -D linear system. ( * single-processor performance) 3-node CM- -node CM- PDFs every time steps no PDFs 3 3 Number of Realizations Figure : The effect of computing the PDFs of the -D linear system on performance of the CM node CM- -node CM- 3 Number of Realizations Figure 7: The speed increase when not computing the PDFs of the -D linear system on the CM

11 Another performance issue is the significant percentage of the time-step iteration CPU time spent in generating the uniform random values used to compute the white noise input to the system (as low as 3% on the Sparcs, % on the Cray Y MP, -% on the CM, and 8% on the Convex C). Performance could be increased significantly with faster random number generation routines. One model of speed increase in parallel computation is to define the speed increase factor by S n = () ( α) α fn () n where n is the number of processors, α the fraction of the code that cannot be processed in parallel, and fn (), a function of the number of processors, is the parallel overhead (Sues et al. 99). The parallel efficiency is defined to be S observed n ST n, where ST n is the theoretical speed increase defined by ST n = S n ( α =, fn () = ). The parallel speed increase and efficiency for the CM is shown in Fig. 8. The efficiency of the node CM is only 83% of the theoretical compared to the 3 node CM. 7 DISCUSSION AND CONCLUSIONS The CM has obvious advantages in its parallel architecture for problems requiring little interprocessor communication. Thus, if the PDF is not required, or only the stationary probability density is of interest, then the massively parallel architecture is perfectly suited to the MCS. The fast vector machines, on the other hand, are well suited to performing the PDF computation. The effort required in porting code to the CM is minimal if one is already familiar with Fortran 9 array operations (similar to the method used by MATLAB ). The authors had a working port of the MCS for this study in a matter of a week. Performance was increased two-fold with an additional week or so of investigation, reading, and re-coding. The CM does offer CMAX (Using the CMAX Converter 993), a program that attempts to convert existing Fortran 77 code to CM Fortran. In the opinion of the authors, the converter performs only marginally well for general codes. For the code used in this study, the converter did rather poorly; the generated CM Fortran code ran at only a fraction (less than %) of the speed of even the first CM Fortran code written by the authors, and less than % of the latest and most optimized version. One pitfall found here was strange behavior of the MCS on the Cray Y MP. When the number of realizations was to a power greater than 7, the PDFs were far too peaked. For any other numbers of realizations, the system behaves as expected, with the error declining with increased number of realizations. Figure 9 shows the rms error of the PDF at stationarity for the -D Duffing system discussed above. This problem was determined by Cray Research to be an error in the vectorization of the Cray random number generator that caused correlation in the white noise. Theoretical Speedup Observed Speedup Parallel Speed Increase Parallel Efficiency RMS Error #realizations=^n- #realizations=^n #realizations=^n+ other - Observed Efficiency Number of Processors.8 Number of Processors - Number of Realizations Figure 8: The parallel speed increase and efficiency on the CM. Figure 9: The error in the stationary PDF of the -D Duffing system on the Cray Y MP.

12 8 ACKNOWLEDGMENT This project has been supported in part by National Science Foundation contracts ECS- 988, CEE-9-N, and MSS-9-N, the latter two through the National Center for Supercomputing Applications at the University of Illinois at Urbana-Champaign. 9 REFERENCES Aburto, A. (aburto@marlin.nosc.mil) 99. FLOPS v.. benchmark code flops.c. Bergman, L.A., Spencer, B.F., Wojtkiewicz, S.F., and E.A. Johnson 99. Robust Numerical Solution of the Fokker-Planck Equation for Second Order Dynamical Systems Under Parametric and External White Noise Excitations. Proceedings of the International Symposium on Nonlinear Dynamics and Stochastic Mechanics, Waterloo, Ontario, 993 (in press). CM- CM Fortran Performance Guide 99. Version. (January 99). Cambridge, Mass.: Thinking Machines Corporation. CM- Technical Summary 993. Nov Cambridge, Mass.: Thinking Machines Corporation. Cohen, Jarrett 993. NCSA and Structural Mechanics. access 7:-. NCSA Connection Machine User Guide 993. Version.. Board of Trustees of the University of Illinois. Pradlwarter, H.J., G.I. Schuëller, and P.G. Melnik- Melnikov 99. Reliability of MDOF-Systems. Journal of Probabilistic Engineering Mechanics (in review). Soong, T. and M. Grigoriu 993. Random Vibration of Mechanical and Structural Systems. Englewood Cliffs, New Jersey: Prentice Hall. Spencer, B.F., Jr. and L.A. Bergman 993. On the Numerical Solution of the Fokker Planck Equation for Nonlinear Stochastic Systems. Nonlinear Dynamics : Sues, R.H., H.-C. Chen, and L.A. Twisdale 99. Probabilistic Structural Mechanics Research for Parallel Processing Computers. NASA CR-87. Sues, R.H., Y.J. Lua, and M.D. Smith 99. Parallel Computing for Probabilistic Response Analysis of High Temperature Composites. NASA CR-97. UNICOS User s Guide 99. Version 3., Board of Trustees of the University of Illinois. Using the CMAX Converter 993. Version. (July 993). Cambridge, Mass.: Thinking Machines Corporation.

First Excursion Probabilities of Non-Linear Dynamical Systems by Importance Sampling. REN Limei [a],*

First Excursion Probabilities of Non-Linear Dynamical Systems by Importance Sampling. REN Limei [a],* Progress in Applied Mathematics Vol. 5, No. 1, 2013, pp. [41 48] DOI: 10.3968/j.pam.1925252820130501.718 ISSN 1925-251X [Print] ISSN 1925-2528 [Online] www.cscanada.net www.cscanada.org First Excursion