Analysis of the Weather Research and Forecasting (WRF) Model on Large-Scale Systems

John von Neumann Institute for Computing Analysis of the Weather Research and Forecasting (WRF) Model on Large-Scale Systems Darren J. Kerbyson, Kevin J. Barker, Kei Davis published in Parallel Computing: Architectures, Algorithms and Applications, C. Bischof, M. Bücker, P. Gibbon, G.R. Joubert, T. Lippert, B. Mohr, F. Peters (Eds.), John von Neumann Institute for Computing, Jülich, NIC Series, Vol. 38, ISBN 978-3-9810843-4-4, pp. 89-98, 2007. Reprinted in: Advances in Parallel Computing, Volume 15, ISSN 0927-5452, ISBN 978-1-58603-796-3 (IOS Press), 2008. c 2007 by John von Neumann Institute for Computing Permission to make digital or hard copies of portions of this work for personal or classroom use is granted provided that the copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise requires prior specific permission by the publisher mentioned above. http://www.fz-juelich.de/nic-series/volume38

Analysis of the Weather Research and Forecasting (WRF) Model on Large-Scale Systems Darren J. Kerbyson, Kevin J. Barker, and Kei Davis Performance and Architecture Lab Los Alamos National Laboratory Los Alamos, NM USA E-mail: {djk, kjbarker, kei.davis}@lanl.gov In this work we analyze the performance of the Weather Research and Forecasting (WRF) model using both empirical data and an accurate analytic performance model. WRF is a largescale mesoscale numerical weather prediction system designed for both operational forecasting and atmospheric research. It is in active development at the National Center for Atmospheric Research (NCAR), and can use 1,000 s of processors in parallel. In this work we compare the performance of WRF on a cluster-based system (AMD Opteron processors interconnected with 4x SDR Infiniband) to that on a mesh-based system (IBM Blue Gene/L interconnected with a proprietary 3-D torus). In addition, we develop a performance model of WRF that is validated against these two systems and that exhibits high prediction accuracy. The model is then used to examine the performance of a near-term future generation supercomputer. 1 Introduction The Weather Research and Forecasting (WRF) model is a community mesoscale numerical weather prediction system with nearly 5,000 users, developed by a consortium of government agencies together with the research community. It is used for both operational forecasting research and atmospheric research, particularly at the 1-10 km scale, and is capable of modelling events such as storm systems and hurricanes 10. It is also being used for regional climate modelling, chemistry and air-quality research and prediction, large eddy simulations, cloud and storm modelling, and data assimilation. Features of WRF include dynamical cores based on finite difference methods and many options for physical parameterizations (microphysics, cumulus parameterization, planetary boundary layers, turbulence, radiation, and surface models) that are being developed by various groups. It includes two-way moving nests and can be coupled with other models including hydrology, land-surface, and ocean models. WRF has been ported to various platforms and can utilize thousands of processors in parallel. Future computational requirements are expected to increase as a consequence of both increased resolution and the use of increasingly sophisticated physics models. Our performance model of WRF allows accurate prediction of the performance of WRF on near-future large-scale systems that may contain many hundreds of thousands of processors. In this work we analyze the performance of WRF (version 2.2) on two very different large-scale systems: a cluster of 256 Opteron nodes (1,024 processing cores) interconnected using Infiniband, and a small Blue Gene/L system containing 1,024 nodes (2,048 processing cores) interconnected by a proprietary 3-D torus 1. This comparison allows us to draw conclusions concerning the system sizes required to achieve an equivalent level of performance on WRF. 89

An important aspect of this work is the capture of key performance characteristics into an analytical performance model. This model is parameterized in terms of the main application inputs (iteration count, number of grid points in each dimension, the computation load per grid point, etc.) as well as system parameters (processor count, communication topology, latencies and bandwidths, etc.). The model also takes as input the time per cell when using all processing cores in a single node this can be measured on an available system, or determined for a future system using a processor simulator. The utility of the model is its capability of predicting for larger-scale systems that are not available for measurement, and for predicting for future or hypothetical systems. In Section 2 we provide an overview of WRF, two commonly used input decks, and the measured performance on the two systems. In Section 3 we detail the performance model and show that it has high prediction accuracy when compared to the measured data. In Section 4 we compare the performance of the two systems and quantify the size of the systems that are required to achieve equivalent performance. We then extend this work to a future large-scale system architecture using the analytic performance model we develop. 2 Overview of WRF WRF uses a three-dimensional grid to represent the atmosphere at scales ranging from meters to thousands of kilometers, topographical land information, and observational data to define initial conditions for a forecasting simulation. It features multiple dynamical cores of which one is chosen for a particular simulation. In this work we have analyzed two simulations that are defined by separate input decks, standard.input and large.input, as used in the Department of Defense technology insertion benchmark suite (TI-06). Both input decks are commonly used in the performance assessment of WRF and are representative of real-world forecasting scenarios. The main characteristics of these input decks are listed in Table 1. Both inputs define a weather forecast for the continental United States but at different resolutions, resulting in differences in the global problem size (number of cells) processed. As a consequence a smaller time step is necessary for large.input. Different physics are also used in the two forecasts, the details of which are not of concern to this work but nevertheless impact the processing requirements (and so processing time per cell). The global grid is partitioned in the two horizontal dimensions across a logical 2-D processor array. Each processor is assigned a subgrid of approximately equal size. Strong scaling is used to achieve faster time to solution, thus the subgrid on each processor becomes increasingly smaller with increased processor count, and the proportion of time spent in parallel activities increases. The main parallel activities in WRF consist of boundary exchanges that occur in all four logical directions 39 such exchanges occur in each direction in each iteration when using large.input, and 35 when using standard.input. All message sizes are a function of the two horizontal dimensions of the subgrid as well as the subgrid depth and can range from 10 s to 100 s of KB at a 512 processor scale. Each iteration on large.input advances the simulation time by 30 seconds. A total of 5720 iterations performs a 48-hour weather forecast. In addition, every 10 minutes in the simulation (every 20th iteration) a radiation physics step occurs, and every hour in the simulation (every 120th iteration) a history file is generated. These latter iterations involve large I/O operations and are excluded from our analysis. An example of the measured 90

standard.input large.input Simulation Resolution 12 km 5 km Duration 2 days 2 days Iteration time-step 72 s 30 s Total no. iterations 2,400 5,760 Grid dimensions East-west 425 1010 North-south 300 720 Vertical 35 38 Total grid points 4.5 M 27.6 M Physics Microphysics WSM 3-class simple WSM 5-class Land surface Thermal diffusion Noah model Radiation physics 10 min 10 min Lateral boundary updates 6 hr 3 hr History file gen. 1 hr 1 hr Table 1. Characteristics of the two commonly used WRF input decks Figure 1. Variation in the iteration time on a 512 processor run for large.input iteration time, for the first 720 iterations, taken from a 512 processor run using large.input on the Opteron cluster is shown in Fig. 1. Large peaks on the iteration time can clearly be seen every 120 iterations because of the history file generation. Note that the range on the vertical axis in Fig. 1 is 0 to 2 s while the history file generation (off scale) takes 160 s. The smaller peaks every 20th iteration result from the extra processing by the radiation physics. It should also be noted that the time for a typical iteration is nearly constant. 3 Performance Model Overview An overview of the performance model we have developed is described below, followed by a validation of the model s predictive capability. 91

3.1 Model Description The runtime of WRF is modeled as T RunTime = N Iter T CompIter +N IterRP T CompIterRP +(N Iter +N IterRP ) T CommIterRP (1) where N Iter is the number of normal iterations, N IterRP is the number of iterations with radiation physics, T CompIter and T CompIterRP are the modeled computation time per iteration for the two types of iterations, respectively, and T CommIter is the communication time per iteration. The computation times are a function of the number of grid points assigned to each processor, and the measured time per cell on a processing node of the system: T CompIter = T CompIterRP = N x N x P x P x Ny P y Ny P y N z T CompPerCell (2) N z T CompRPPerCell Note that the computation time per cell for both types of iteration may also be a function of the number of cells in a subgrid. The communication time consists of two components: one for the east-west (horizontal x dimension), and one for the north-south (horizontal y dimension) boundary exchanges. Each exchange is done by two calls to MPI Isend and two calls to MPI Irecv followed by an MPI Waitall. From an analysis of an execution of WRF the number of boundary exchanges, NumBX, was found to be 35 and 39 for standard.input and large.input, respectively. The time to perform a single boundary exchange is modeled as T CommIter = NumBX i=1 (T comm (Size xi, C x ) + T comm (Size yi, C y )) (3) where the size of the messages, Size xi and Size yi, vary over the NumBX boundary exchanges. A piece-wise linear model for the communication time is assumed which uses the latency, L c, and bandwidth, B c, of the communication network in the system. The effective communication latency and bandwidth vary depending on the size of a message and also the number of processors used (for example, in the cases of intra-node and inter-node communications for an SMP-based machine). T comm (S, C) = L c (S) + C S 1 B c (S) The communication model uses the bandwidths and latencies of the communication network observed in a single direction when performing bi-directional communications, as is the case in WRF for the boundary exchanges. They are obtained from a ping-pong type communication micro-benchmark that is independent of the application and in which the round-trip time required for bi-directional communications is measured as a function of the message size. This should not be confused with the peak uni-directional performance of the network or peak measured bandwidths from a performance evaluation exercise. The contention that occurs during inter-node communication is dependent on the communication direction (x or y dimension), the arrangement of subgrids across processing (4) 92

Opteron cluster Blue Gene/L System Peak 4.1 Tflops 5.7 Tflops Node Count 256 1,024 Core Count 1,024 2,048 Core Speed 2.0 GHz 0.7 GHz Nodes Peak 16 Gflops 5.6 Gflops Cores 4 2 Memory 8 GB 512 MB Network Topology 12-ary fat tree 3D torus MPI (zero-byte) latency 4.0 µs 2.8 µs MPI (1 MB) bandwidth 950 MB/s 154 MB/s n 1/2 15,000 B 1,400 B Table 2. Characteristics of the two systems used in the validation of the WRF model. nodes, the network topology and routing mechanism, and the number of processing cores per node. The contention within the network is denoted by the parameters C x and C y in equation 4. For example, a logical 2-D array of sub-grids can be folded into the 3-D topology of the Blue Gene interconnection network but will at certain scales result in more than one message requiring the use of the same communication channel, resulting in contention 3, 8. Contention can also occur on a fat tree network due to static routing (as used in Infiniband for example) but can be eliminated through routing table optimization 6. The node size also determines the number of processing cores that will share the connections to the network and hence also impact the contention. 3.2 Model Validation Two systems were used to validate the performance model of WRF. The first contains 256 nodes, each containing two dual-core AMD Opteron processors running at 2.0 GHz. Each node contains a single Mellanox Infiniband HCA having a single 4x SDR connection to a 288-port Voltaire ISR9288 switch. In this switch 264 ports are populated for the 256 compute nodes and also for a single head node. The Voltaire ISR9288 implements a twolevel 12-ary fat tree. All communication channels have a peak of 10 Gb/s per direction. This system is physically located at Los Alamos National Laboratory. The second system is a relatively small-scale Blue Gene/L system located at Lawrence Livermore National Laboratory (a sister to the 64K node system used for classified computing). It consisted of two mid-planes, each containing 512 dual-core embedded PowerPC440 processors running at 700 MHz. The main communication network arranges these nodes in a 3-D torus, and a further network is available for some collective operations and for global interrupts. The characteristics of both of these systems are listed in Table 2. Note that the stated MPI performance is for near-neighbour uni-directional communication 4. The quantity n 1/2 is the message size that achieves half of the peak bandwidth. It effectively indicates when a message is latency bound (when its size is less than n 1/2 ) or bandwidth bound. 93

(a) Opteron cluster (b) Blue Gene/L Figure 2. Measured and modeled performance of WRF for both typical and radiation physics iterations 4 Performance Comparison of Current Systems An interesting aspect of this work is the direct performance comparison of Blue Gene/L with the Opteron cluster on the WRF workload. We consider this in two steps, the first based on measured data alone and the second comparing larger scale systems using the performance model described in Section 3. In this way we show the utility of the performance model by exploring the performance of systems that could not be measured. The time for the typical iteration of WRF on standard.input is shown in Fig. 3(a) for both systems using the measured performance up to 256 nodes of the Opteron system, and up to 1024 nodes of the Blue Gene/L system. The model is used to predict the performance up to 16K nodes of an Opteron system and up to 64K nodes of the Blue Gene/L system. It is clear from this data that the Blue Gene/L system has a much lower performance (longer run-time) when using the same number of nodes. It should also be noted that the performance of WRF is expected to improve only up to 32K nodes of Blue Gene/L this limitation is due to the small sub-grid sizes that occur at this scale for standard.input, and the resulting high communication-to-computation ratio. The relative performance between the two systems is shown in Fig. 3(b). When comparing performance based on an equal number of processors, the Opteron cluster is between 3 and 6 times faster than Blue Gene/L. When comparing performance based on an equal node count, the Opteron cluster is between 5 and 6 times faster than Blue Gene/L. Note that for larger problem sizes, we would expect that runtime on Blue Gene would continue to decrease at larger scales, and thus the additional parallelism available in the system would improve performance. 5 Performance of Possible Future Blue Gene Systems The utility of the performance model lies in its ability to explore the performance of systems that cannot be directly measured. To illustrate this we consider a potential next- 94

(a) Time for typical iteration (b) Relative performance (Opteron to BG/L) Figure 3. Predicted performance of WRF on large-scale Blue Gene/L and Opteron/Infiniband systems. Blue Gene/P System Peak 891 Tflops Node Count 65,536 Core Count 262,144 Core Speed 850 MHz Nodes Peak 13.6 Gflops Cores 4 Memory 4 GB Network Topology 3D torus MPI (zero-byte) latency 1.5 µs MPI (1 MB) bandwidth 500 MB/s 750 B n 1/2 Table 3. Characteristics of the potential BG/P system. generation configuration of Blue Gene (Blue Gene/P). The characteristics of Blue Gene/P that are used in the following analysis are listed in Table 3. It should be noted that this analysis was undertaken prior to any actual Blue Gene/P hardware being available for measurement. We also assume the same logical arrangement as the largest Blue Gene/L system presently installed (a 32x32x64 node 3D torus), but containing quad-core processors with an 850 MHz clock speed and increased communication performance. Note that the peak of this system is 891 Tflops. The predicted performance of WRF using standard.input is shown in Fig. 4 and compared with that of Blue Gene/L (as presented earlier in Fig. 3). It was assumed that the processing rate per cell on a Blue Gene/P processing core would be the same as that on a Blue Gene/L processing core. It can be seen that we expect Blue Gene/P to result in improved performance (reduced processing time) when using up to approximately 10,000 95

Figure 4. Comparison of Blue Gene/P and Blue Gene/L performance Processing time/cell (µs) MPI latency (µs) MPI bandwidth (MB/s) -20% 24 1.80 400-10% 22 1.65 450 Baseline 20 1.50 500 +10% 18 1.35 550 +20% 16 1.20 600 Table 4. Performance characteristics used in the sensitivity analysis of Blue Gene/P. nodes. For larger input decks the performance should improve to an even larger scale. The expected performance of WRF on Blue Gene/P is also analyzed in terms of its sensitivity to the compute speed on a single node, the MPI latency, and the MPI bandwidth. The expected performance has a degree of uncertainty due to the inputs to the performance model being assumed rather than measured for the possible configuration of Blue Gene/P. A range of values for each of the compute performance, MPI latency and MPI bandwidth is used as listed in Table 4. Each of these values are varied from 20% to +20% of the baseline configuration. Each of the three values is varied independently, that is, one quantity is varied while the other two quantities are fixed at the baseline value. Two graphs are presented: the range in computational processing rates in Fig. 5(a), and the range in MPI bandwidths in Fig. 5(b). The sensitivity due to MPI latency was not included since the performance varied by at most 0.1%, i.e. WRF is not sensitive to latency. It can be seen that WRF is mostly sensitive to the processing rate up to 4K nodes (i.e. compute bound in this range), and mostly sensitive to the communication bandwidth at higher scales (i.e. bandwidth bound). Improvements to the processing performance would be most beneficial to improve performance of coarsely partitioned jobs (large cell counts per processor), whereas increased network bandwidth would be most beneficial for jobs executing on larger processor counts. 96

(a) Sensitivity to compute performance (b) Sensitivity to MPI bandwidth Figure 5. WRF sensitivity analysis on Blue Gene/P 6 Conclusions We have developed and validated an analytic performance model for the Weather Research and Forecasting (WRF) application and used it, in conjunction with empirical data, to quantitatively study application performance on two current generation and one near-term future generation supercomputer. Our analytic performance model was developed through careful study of the dynamic execution behaviour of the WRF application and subsequently validated using performance measurements on two current systems: a 256-node (1,024 core) AMD Opteron cluster using a 4x SDR Infiniband interconnection network, and a 1,024-node (2,048 core) IBM Blue Gene/L system utilizing a custom 3D torus network. In each case the average performance prediction error was less than 5%. With a validated performance model in place, we are able to extend our analysis to larger-scale current systems and near-term future machines. At small node count, we can see that overall application performance is tied most closely to single processor performance. At this scale, roughly four times as many Blue Gene/L nodes are required to match the performance of the Opteron/Infiniband cluster. At larger scale, communication performance becomes critical; in fact WRF performance on Blue Gene/L improves very slowly beyond roughly 10K nodes due to the communication contention caused by folding the logical 2D processor array onto the physical 3D network. This work is part of an ongoing project at Los Alamos to develop modelling techniques which facilitate analysis of workloads of interest to the scientific computing community on large-scale parallel systems 5. Acknowledgements This work was funded in part by the Department of Energy Accelerated Strategic Computing (ASC) program and by the Office of Science. Los Alamos National Laboratory is 97

operated by Los Alamos National Security LLC for the US Department of Energy under contract DE-AC52-06NA25396. References 1. N. R. Adiga, et. al., An Overview of the Blue Gene/L Supercomputer, in: Proc. IEEE/ACM Supercomputing (SC 02), Baltimore, MD, (2002). 2. K. J. Barker and D. J. Kerbyson, A Performance Model and Scalability Analysis of the HYCOM Ocean Simulation Application, in: Proc. IASTED Int. Conf. on Parallel and Distributed Computing (PDCS), Las Vegas, NV, (2005). 3. G. Banot, A. Gara, P. Heidelberger, E. Lawless, J.C. Sexton, and R. Walkup, Optimizing Task Layout on the Blue Gene/L Supercomputer, IBM J. Research and Development, 49, 489 500, (2005). 4. K. Davis, A. Hoisie, G. Johnson, D. J. Kerbyson, M. Lang, S. Pakin, and F. Petrini, A Performance and Scalability Analysis of the Blue Gene/L Architecture, in: Proc. IEEE/ACM Supercomputing (SC 04), Pittsburgh, PA, (2004). 5. A. Hoisie, G. Johnson, D.J. Kerbyson, M. Lang, and S. Pakin, A Performance Compariosn through Benchmarking and Modeling of Three Leading Supercomputers: Blue Gene/L, Red Storm, and Purple, in: Proc. IEEE/ACM Supercomputing (SC 06), Tampa, FL, (2006). 6. G. Johnson, D.J. Kerbyson, and M. Lang, Application Specific Optimization of Infiniband Networks, Los Alamos Unclassified Report, LA-UR-06-7234, (2006). 7. D.J. Kerbyson, and A. Hoisie, Performance Modeling of the Blue Gene Architecture, in: Proc. IEEE John Atanasoff Conf. on Modern Computing, Sofia, Bulgaria, (2006). 8. D.J. Kerbyson and P.W. Jones, A Performance Model of the Parallel Ocean Program, Int. J. of High Performance Computing Applications, 19, 1 16, (2005). 9. V. Salapura, R. Walkup and A. Gara, Exploiting Workload Parallelism for Performance and Power Optimization in Blue Gene, IEEE Micro 26, 67 81, (2006). 10. Weather Research and Forecasting (WRF) model, http://www.wrf-model.org. 98