Performance Evaluation of MPI on Weather and Hydrological Models

Similar documents
Analysis of the Weather Research and Forecasting (WRF) Model on Large-Scale Systems

WRF performance tuning for the Intel Woodcrest Processor

Weather Research and Forecasting (WRF) Performance Benchmark and Profiling. July 2012

HYCOM and Navy ESPC Future High Performance Computing Needs. Alan J. Wallcraft. COAPS Short Seminar November 6, 2017

Scaling the Software and Advancing the Science of Global Modeling and Assimilation Systems at NASA. Bill Putman

Acceleration of WRF on the GPU

Piz Daint & Piz Kesch : from general purpose supercomputing to an appliance for weather forecasting. Thomas C. Schulthess

NOAA Research and Development High Performance Compu3ng Office Craig Tierney, U. of Colorado at Boulder Leslie Hart, NOAA CIO Office

THE WEATHER RESEARCH AND FORECAST MODEL VERSION 2.0

Marla Meehl Manager of NCAR/UCAR Networking and Front Range GigaPoP (FRGP)

Performance Analysis of Lattice QCD Application with APGAS Programming Model

Performance of WRF using UPC

WRF Modeling System Overview

WRF Modeling System Overview

Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters

SPECIAL PROJECT PROGRESS REPORT

Computational Numerical Integration for Spherical Quadratures. Verified by the Boltzmann Equation

A Hybrid MPI/OpenMP Parallel Algorithm and Performance Analysis for an Ensemble Square Root Filter Designed for Multi-scale Observations


Parallelization of Molecular Dynamics (with focus on Gromacs) SeSE 2014 p.1/29

WRF Modeling System Overview

Welcome to MCS 572. content and organization expectations of the course. definition and classification

ACCELERATING WEATHER PREDICTION WITH NVIDIA GPUS

WRF- Hydro Development and Performance Tes9ng

An Overview of HPC at the Met Office

Quantum ESPRESSO Performance Benchmark and Profiling. February 2017

Lecture 27: Hardware Acceleration. James C. Hoe Department of ECE Carnegie Mellon University

Scalable and Power-Efficient Data Mining Kernels

Some thoughts about energy efficient application execution on NEC LX Series compute clusters

Simulation of storm surge and overland flows using geographical information system applications

From Piz Daint to Piz Kesch : the making of a GPU-based weather forecasting system. Oliver Fuhrer and Thomas C. Schulthess

Lecture 19. Architectural Directions

GPU Computing Activities in KISTI

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters

上海超级计算中心 Shanghai Supercomputer Center. Lei Xu Shanghai Supercomputer Center San Jose

S8241 VERSIONING GPU- ACCLERATED WRF TO Jeff Adie, 26 March, 2018 (Presented by Stan Posey, NVIDIA)

First, a look at using OpenACC on WRF subroutine advance_w dynamics routine

BeiHang Short Course, Part 7: HW Acceleration: It s about Performance, Energy and Power

AMPS Update June 2016

Developing a High-Resolution Texas Water and Climate Prediction Model

Improving weather prediction via advancing model initialization

WRF Modeling System Overview

A Data Communication Reliability and Trustability Study for Cluster Computing

Supercomputer Programme

Large-scale Electronic Structure Simulations with MVAPICH2 on Intel Knights Landing Manycore Processors

Introduction to numerical computations on the GPU

Progress in NWP on Intel HPC architecture at Australian Bureau of Meteorology

SDG&E Meteorology. EDO Major Projects. Electric Distribution Operations

Direct Self-Consistent Field Computations on GPU Clusters

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications

A framework for detailed multiphase cloud modeling on HPC systems

Towards a highly-parallel PDE-Solver using Adaptive Sparse Grids on Compute Clusters

WRF benchmark for regional applications

The Panel: What does the future look like for NPW application development? 17 th ECMWF Workshop on High Performance Computing in Meteorology

Dense Arithmetic over Finite Fields with CUMODP

Massively parallel semi-lagrangian solution of the 6d Vlasov-Poisson problem

Petascale Quantum Simulations of Nano Systems and Biomolecules

Parallel programming using MPI. Analysis and optimization. Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco

Optimization strategy for MASNUM surface wave model

R. Glenn Brook, Bilel Hadri*, Vincent C. Betro, Ryan C. Hulguin, and Ryan Braby Cray Users Group 2012 Stuttgart, Germany April 29 May 3, 2012

The Fast Multipole Method in molecular dynamics

Parallel PIPS-SBB Multi-level parallelism for 2-stage SMIPS. Lluís-Miquel Munguia, Geoffrey M. Oxberry, Deepak Rajan, Yuji Shinano

Asynchronous Parareal in time discretization for partial differential equations

Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2

Convergence Models and Surprising Results for the Asynchronous Jacobi Method

Efficient Molecular Dynamics on Heterogeneous Architectures in GROMACS

Parallel Performance Studies for a Numerical Simulator of Atomic Layer Deposition Michael J. Reid

Introduction to Benchmark Test for Multi-scale Computational Materials Software

Parallel Sparse Tensor Decompositions using HiCOO Format

Review for the Midterm Exam

NOAA Supercomputing Directions and Challenges. Frank Indiviglio GFDL MRC Workshop June 1, 2017

Performance of the fusion code GYRO on three four generations of Crays. Mark Fahey University of Tennessee, Knoxville

Claude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique

Prof. Brant Robertson Department of Astronomy and Astrophysics University of California, Santa

The Memory Intensive System

Maximizing the Performance of the Weather Research and Forecast Model over the Hawaiian Islands

Performance of Met Office Weather and Climate Codes on Cavium ThunderX2 Processors. Adam Voysey, Maff Glover HPC Optimisation Team

Unidata: An Overview. Tom Yoksas. Unidata Program Center / UCAR Boulder, Colorado USA

Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem

Massively scalable computing method to tackle large eigenvalue problems for nanoelectronics modeling

Parallel Transposition of Sparse Data Structures

Performance Results for the Weather Research and Forecast (WRF) Model on AHPCRC HPC Systems

Threats to the Power System

Open-Source Parallel FE Software : FrontISTR -- Performance Considerations about B/F (Byte per Flop) of SpMV on K-Supercomputer and GPU-Clusters --

Massively scalable computing method to tackle large eigenvalue problems for nanoelectronics modeling

Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry

Scaling Deep Learning Algorithms on Extreme Scale Architectures

On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code

A Different Kind of Flow Analysis. David M Nicol University of Illinois at Urbana-Champaign

CRYSTAL in parallel: replicated and distributed (MPP) data. Why parallel?

The next-generation supercomputer and NWP system of the JMA

Performance of machines for lattice QCD simulations

Performance and Application of Observation Sensitivity to Global Forecasts on the KMA Cray XE6

arxiv: v1 [hep-lat] 7 Oct 2010

Huge-scale Molecular Study of Multi-bubble Nuclei

Position Papers of the 2013 Federated Conference on Computer Science and Information Systems pp

Lecture 2: Metrics to Evaluate Systems

Parallel Multivariate SpatioTemporal Clustering of. Large Ecological Datasets on Hybrid Supercomputers

Model Order Reduction via Matlab Parallel Computing Toolbox. Istanbul Technical University

APPLICATION OF CUDA TECHNOLOGY FOR CALCULATION OF GROUND STATES OF FEW-BODY NUCLEI BY FEYNMAN'S CONTINUAL INTEGRALS METHOD

Transcription:

NCAR/RAL Performance Evaluation of MPI on Weather and Hydrological Models Alessandro Fanfarillo elfanfa@ucar.edu August 8th 2018

Cheyenne - NCAR Supercomputer Cheyenne is a 5.34-petaflops, high-performance computer built for NCAR by SGI. 2.3-GHz Intel Xeon E5-2697V4 (Broadwell) processors. 36 cores on each processor. Mellanox EDR InfiniBand: Bandwidth: 25 GBps bidirectional per link. Latency: MPI ping-pong < 1 µs; hardware link 130 ns. Five MPI implementations installed: MPT SGI Intel MPI Open-MPI MVAPICH2-2.3 MPICH-3.2 Casper : heterogeneous system of specialized data analysis and visualization resources and Shortened presentation large-memory, multi-gpu nodes (Nvidia V100). title 2

Weather Research and Forecast Model (WRF) Parallel numerical mesoscale weather forecasting application used for both research and operational environments. It is a supported community model, i.e. a free and shared resource with distributed development and centralized support. Used for: Numerical weather prediction Meteorological case studies Regional climate Air quality, wind energy, hydrology, etc. Implemented mostly in Fortran90 with some parts in C, presentation communication. title Perl. It used Shortened MPI for inter-process 3

Summary of WRF benchmarks All the tests made by Akira Aiko Kyle (Carnegie Mellon University) and Dixit Patel (CU Boulder) interns at NCAR. Region Res Horizontal Gridpoints Vertical Gridpoints Tot Gridpoints Time Step Run Time CONUS 12 km 425 300 127,500 72 secs 6 hrs CONUS 2.5 km 1901 1301 2,473,201 15 secs 6 hrs Maria 3 km 1396 1384 1,932,064 9 secs 3 hrs Maria 1 km 3665 2894 10,606,510 3 secs 1 hrs 4

Overall Scaling Conus 12 Km Maria 1 Km 5

Computation time scaling 6

MVAPICH2 - Collective and IO BIND MV2_CPU_BINDING_POLICY=hybrid MV2_HYBRID_BINDING_POLICY=bunch HW MV2_USE_MCAST=1 MV2_ENABLE_SHARP=1 7

MVAPICH2 - Collectives and IO Shortened presentation title Maria 3 Km 8

WRF-Hydro and National Weather Model Colorado Flood of 11-15 Sept. 2013 WRF-Hydro was originally designed as a model coupling framework to facilitate coupling between WRF and components of terrestrial hydrological models. WRF-Hydro is both a stand-alone hydrological modeling architecture as well as a coupling architecture for coupling of hydrological models with atmospheric models. Although it was originally designed to be used within the WRF model, it has evolved over time to possess many additional attributes. WRF-Hydro has adopted a community-based development processes. An important aspect of WRF-Hydro has been serving as the core model for the new National Water Model. 9

NWM Analysis Cycle: Before cache opts 500 processes over 14 nodes on Cheyenne - 560 sec 32 processes over 2 nodes on Cheyenne - 1 h 10 m 10

NWM Analysis Cycle: After Cache Optimization 500 processes over 14 nodes on Cheyenne - 238 sec 32 processes over 4 nodes (8 procs per node) on Cheyenne - 18 min Speedup 500 vs. 32 processes ~ 4.5X Linear speedup supposed to be ~ 15.6X 11

USA River Map 12

Improving the Ramps The lack of a better data structure to represent the channel topology forces the code to scan the global links list several times. Unbalanced situations penalize the performance (a lot!). All the Ramps represent unbalanced situations. There are two possible solutions: a. using a multi-threaded solution (OpenMP) b. changing the data structures 13

Walltime vs. Core count 14

Walltime - core*hours 15

Improving the IO idle time Process 0 is in charge of computing AND dealing with IO. This double duty on process 0 slows down the entire application. There are two (and a half) possible solutions...: 1. Implementing parallel IO. 2. Creating two groups of processes, one group for IO and one for computing (asynchronous). 2 ½. Overlap communication with IO 16

Strategies for Asynchronous Progress 1. Manual Progress 2. Thread-based Progress (polling-based vs. interrupt-based) 3. Communication offload Manual progress is the most portable but requires the user to call MPI functions (e.g. MPI Test). Using a dedicated thread to poke MPI requires a thread-safe implementation and implies lots of resource contention and context switching. Offloading does not look promising and may Shortened presentation titlebecome a bottleneck because of the low performance of embedded processor on NIC. 17

Improvements - 4 Advances 18

Improvements - 12 Advances % Improvement MVAPICH MPT (intel) 1,5 5 19

Conclusions and Future Development MVAPICH2 looks promising (needs some tuning on Cheyenne). Overlapping communication and IO represents a good non-solution for WRF-Hydro. Need to implement some sort of parallel IO. Looking forward to using MVAPICH2-X-2.3 (bettertitle async progress). Shortened presentation 20

Thanks! Please, ask your questions! 21