WRF performance tuning for the Intel Woodcrest Processor

Size: px

Start display at page:

Download "WRF performance tuning for the Intel Woodcrest Processor"

Merilyn Lawson
5 years ago
Views:

1 WRF performance tuning for the Intel Woodcrest Processor A. Semenov, T. Kashevarova, P. Mankevich, D. Shkurko, K. Arturov, N. Panov Intel Corp., pr. ak. Lavrentieva 6/1, Novosibirsk, Russia, {alexander.l.semenov,tamara.p.kashevarova,pavel.v.mankevich,

2 Notations and abbreviations Woodcrest processor Dual Core Intel Xeon Processor model 5160 ppn processes per node (the number of cores used on a node) MVAPICH an MPI implementation from the Ohio State University WRF CONUS 12km 48-hour, 12km resolution case over the Continental U.S. (CONUS) domain October 24, 2001 that uses the Eulerian Mass dynamics WS8, WS9 benchmarks based on WRF 1.2 and correspondingly and described on

3 WRF based codes that we worked on WRF v NOAA WRF based benchmark workstreams WRF v WS8: WRF chemistry module WS9: WRF 2.0.2

4 Hardware and software 256 nodes Woodcrest, 2 sockets / 2 cores each, 3.0 GHz; 4 MB L2; 8GB RAM; Infiniband Interconnection Red Hat Enterprise Linux 3.0. Intel Fortran and C compilers v9.1 for Linux. Intel MPI 2.0, MVAPICH WRF v Workload: CONUS 12km; 48 hours forecast (2400 time steps) The results of running WRF CONUS 12km on 32 cores with ppn = 4: Options -O3 -O3 ip -O3 -xt -O3 -ip -xt -O3 -ip xt no-prec-div no-prec-sqrt Compile time (secs) Time (secs) Speed up to base base 1.3% 4.6% 6.8% 10.3%

5 Why use Intel Compilers? Efficiency: Inherent ability to highly optimize codes for all Intel processors Ease of Use: Automatic optimization features make it easier to obtain highly optimized target code Intel Premier Support: training best known methods problem fixes & workarounds Intel compilers use: Speculative memory accesses Advanced branch prediction Software pipelining for Intel Itanium There are other useful Intel software tools: Performance Analyzer VTune Threading Tools Cluster Tools Specific optimization for Woodcrest

6 Useful Intel Compiler Options for Woodcrest -O2 Turns on default optimizations for speed -O3 Enables -O2 optimization level and performs more aggressive optimizations, in particular, loop transformations -ip/ipo Enables single multi ile interprocedural optimizations -no-prec-div, -no-prec-sqrt Enable use of faster but slightly less accurate algorithms for division and square root (it may affect floating-point accuracy) -xt Enables use of specific optimization for Woodcrest -unroll0 Disables unrolling the loops in the file

7 Profiling with Intel Performance Analyzer VTune Performs exhaustive data collection Has multiple useful display options that help a developer quickly locate hotspot parts of the code and determine the strategy of performance improvement Multiple data views Very intuitive user interface Easy switching to assembly view and assembly instruction events

8 Decompositions of WRF2.1.1 CONUS 12km Options: -O3 ip xt; ppn=4; 0 I/O servers Number of cores Decomposition Wall time (sec) Speed up to default decompositions 16 Default(4x4) x % 32 Default(4x8) x % 64 Default (8x8) x % 128 Default (8x16) x % 256 Default (16x16) x %

9 Timings for WRF CONUS 12km 2400 time steps WRF CONUS 12km Time of full run (secs) Number of cores

10 Scalability for WRF CONUS 12km 2400 time steps WRF CONUS 12km Scalability for average time per step Number of cores

11 Comparison of Woodcrest and a previous Xeon processor Irwindale 8 7 Average time per step (secs) Woodcrest 3.0GHz Xeon DP 3.6GHz 2MB L2 8GB RAM Number of cpus

12 NOAA WRF based codes Workstream 8: WRF 5KM CHEM. This benchmark utilizes the WRF under development with cooperation from multiple government and academic agencies. This version of WRF is based on the Advanced Research WRF Eulerian mass coordinate. The benchmark includes code to produce chemical tracers and incorporates cloud chemistry code to predict chemical interaction and dispersion. Workstream 9: WRF 5KM SI. This benchmark is a test of the WRF Advanced Research version (ARW). The test contains six individual WRF tests with sample output and results for each. These six tests are: squall2d_x, squall2d_y, 3D quarter-circle shear supercell simulation, 2D flow over a bellshaped hill, 3D baroclinic wave, and 2D gravity current.

13 Timings for NOAA WS8 & WS9 benchmarks WS8 36 hours simulation runs WS9 6 hours simulation runs Time (secs) Number of cores Time (secs) Number of cores

14 Conclusions: Running WRF and WRF-based applications on the Woodcrest processor showed very high efficiency of the processor both in computations and scalability. Working with WRF Woodcrest beats previous Intel Xeon processors All benchmarks passed validation without any special efforts There is a number of hot-spots that have not be processed yet We are going to explore more intensively Intel MKL for WRF optimization: to use not only its vectorized math functions but more complex routines and solvers

Weather Research and Forecasting (WRF) Performance Benchmark and Profiling. July 2012

Weather Research and Forecasting (WRF) Performance Benchmark and Profiling July 2012 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Intel, Dell,