Some thoughts about energy efficient application execution on NEC LX Series compute clusters

Similar documents
More Science per Joule: Bottleneck Computing

Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters

Exploring performance and power properties of modern multicore chips via simple machine models

Weather Research and Forecasting (WRF) Performance Benchmark and Profiling. July 2012

ERLANGEN REGIONAL COMPUTING CENTER

arxiv: v1 [cs.pf] 5 Mar 2018

Parallel Simulations of Self-propelled Microorganisms

A Framework for Hybrid Parallel Flow Simulations with a Trillion Cells in Complex Geometries

Performance and Energy Analysis of the Iterative Solution of Sparse Linear Systems on Multicore and Manycore Architectures

The Green Index (TGI): A Metric for Evalua:ng Energy Efficiency in HPC Systems

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver

A simple Concept for the Performance Analysis of Cluster-Computing

Leveraging Task-Parallelism in Energy-Efficient ILU Preconditioners

Cactus Tools for Petascale Computing

A Data Communication Reliability and Trustability Study for Cluster Computing

Large-scale Electronic Structure Simulations with MVAPICH2 on Intel Knights Landing Manycore Processors

Performance of the fusion code GYRO on three four generations of Crays. Mark Fahey University of Tennessee, Knoxville

Scalable and Power-Efficient Data Mining Kernels

Performance Evaluation of Scientific Applications on POWER8

Stochastic Modelling of Electron Transport on different HPC architectures

Towards a highly-parallel PDE-Solver using Adaptive Sparse Grids on Compute Clusters

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters

MPI at MPI. Jens Saak. Max Planck Institute for Dynamics of Complex Technical Systems Computational Methods in Systems and Control Theory

Performance Analysis of a List-Based Lattice-Boltzmann Kernel

WRF performance tuning for the Intel Woodcrest Processor

Impact of Thread and Frequency Scaling on Performance and Energy in Modern Multicores: A Measurement-based Study

Massively parallel semi-lagrangian solution of the 6d Vlasov-Poisson problem

HYCOM and Navy ESPC Future High Performance Computing Needs. Alan J. Wallcraft. COAPS Short Seminar November 6, 2017

Piz Daint & Piz Kesch : from general purpose supercomputing to an appliance for weather forecasting. Thomas C. Schulthess

A hierarchical Model for the Analysis of Efficiency and Speed-up of Multi-Core Cluster-Computers

Performance Analysis of Lattice QCD Application with APGAS Programming Model

Quantum Chemical Calculations by Parallel Computer from Commodity PC Components

Claude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique

Reliability at Scale

A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures

Quantum ESPRESSO Performance Benchmark and Profiling. February 2017

On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code

Lattice Quantum Chromodynamics on the MIC architectures

One Optimized I/O Configuration per HPC Application

Direct Self-Consistent Field Computations on GPU Clusters

Performance Evaluation of MPI on Weather and Hydrological Models

Benchmarking program performance evaluation of Parallel programming language XcalableMP on Many core processor

- Part 4 - Multicore and Manycore Technology: Chances and Challenges. Vincent Heuveline

Research of the new Intel Xeon Phi architecture for solving a wide range of scientific problems at JINR

MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors

Welcome to MCS 572. content and organization expectations of the course. definition and classification

Introduction to Benchmark Test for Multi-scale Computational Materials Software

Performance evaluation of scalable optoelectronics application on large-scale Knights Landing cluster

Domain Decomposition-based contour integration eigenvalue solvers

Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2

Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems

NCEP Applications -- HPC Performance and Strategies. Mark Iredell software team lead USDOC/NOAA/NWS/NCEP/EMC

Parallel Algorithms for Solution of Large Sparse Linear Systems with Applications

Performance and Application of Observation Sensitivity to Global Forecasts on the KMA Cray XE6

591 TFLOPS Multi-TRILLION Particles Simulation on SuperMUC

Building a Multi-FPGA Virtualized Restricted Boltzmann Machine Architecture Using Embedded MPI

arxiv: v1 [hep-lat] 8 Nov 2014

Hybrid parallelization of a pseudo-spectral DNS code and its computational performance on RZG s idataplex system Hydra

Large-scale MD simulation of heterogeneous systems with ls1 mardyn

Sustained Petascale Performance of Seismic Simulations with SeisSol

GPU-accelerated Computing at Scale. Dirk Pleiter I GTC Europe 10 October 2018

R. Glenn Brook, Bilel Hadri*, Vincent C. Betro, Ryan C. Hulguin, and Ryan Braby Cray Users Group 2012 Stuttgart, Germany April 29 May 3, 2012

GPU Computing Activities in KISTI

INCREASING THE PERFORMANCE OF THE JACOBI-DAVIDSON METHOD BY BLOCKING

Nuclear Physics and Computing: Exascale Partnerships. Juan Meza Senior Scientist Lawrence Berkeley National Laboratory

Parallelization of Molecular Dynamics (with focus on Gromacs) SeSE 2014 p.1/29

BeiHang Short Course, Part 7: HW Acceleration: It s about Performance, Energy and Power

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Parallel Transposition of Sparse Data Structures

Scalable Tools for Debugging Non-Deterministic MPI Applications

Massively scalable computing method to tackle large eigenvalue problems for nanoelectronics modeling

Progress in NWP on Intel HPC architecture at Australian Bureau of Meteorology

1 Overview. 2 Adapting to computing system evolution. 11 th European LS-DYNA Conference 2017, Salzburg, Austria

ECMWF Computing & Forecasting System

Lecture 27: Hardware Acceleration. James C. Hoe Department of ECE Carnegie Mellon University

Large-Scale Behavioral Targeting

High-Performance Scientific Computing

RWTH Aachen University

Information Sciences Institute 22 June 2012 Bob Lucas, Gene Wagenbreth, Dan Davis, Roger Grimes and

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

Massively scalable computing method to tackle large eigenvalue problems for nanoelectronics modeling

Advanced Vectorization of PPML Method for Intel Xeon Scalable Processors

Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry

MSC HPC Infrastructure Update. Alain St-Denis Canadian Meteorological Centre Meteorological Service of Canada

Analysis of the Tradeoffs between Energy and Run Time for Multilevel Checkpointing

A Simple Architectural Enhancement for Fast and Flexible Elliptic Curve Cryptography over Binary Finite Fields GF(2 m )

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics

The Memory Intensive System

Applications of Lattice Boltzmann Methods

NOAA Research and Development High Performance Compu3ng Office Craig Tierney, U. of Colorado at Boulder Leslie Hart, NOAA CIO Office

Parallel Performance Studies for a Numerical Simulator of Atomic Layer Deposition Michael J. Reid

The Lattice Boltzmann Simulation on Multi-GPU Systems

INTENSIVE COMPUTATION. Annalisa Massini

Simulation of Lid-driven Cavity Flow by Parallel Implementation of Lattice Boltzmann Method on GPUs

Verbundprojekt ELPA-AEO. Eigenwert-Löser für Petaflop-Anwendungen Algorithmische Erweiterungen und Optimierungen

Multiphase Flow Simulations in Inclined Tubes with Lattice Boltzmann Method on GPU

FPGA Implementation of a Predictive Controller

Vector Lane Threading

Red Sky. Pushing Toward Petascale with Commodity Systems. Matthew Bohnsack. Sandia National Laboratories Albuquerque, New Mexico USA

APPLICATION OF CUDA TECHNOLOGY FOR CALCULATION OF GROUND STATES OF FEW-BODY NUCLEI BY FEYNMAN'S CONTINUAL INTEGRALS METHOD

Transcription:

Some thoughts about energy efficient application execution on NEC LX Series compute clusters G. Wellein, G. Hager, J. Treibig, M. Wittmann Erlangen Regional Computing Center & Department of Computer Science Friedrich-Alexander-University Erlangen-Nuremberg Germany

Erlangen Regional Computing Center(RRZE) JuQueen 5 PF/s Hannover Berlin RRZE: Regional HPCservice provider and HPC research center FZ Jülich Erlangen HLRS-Stuttgart LRZ-München Hermit: 1 PF SuperMUC: 3 PF 2

Erlangen Regional Computing Center A broad range of users: Biology, Chemistry, CFD, Material Science, Physics Medicine, Economics, A broad range of clusters: LINUX (NEC): 560 nodes (234 TF/s) Installation: 2013 LINUX (NEC): 500 nodes (64 TF/s) Installation: 2010 LINUX (others): 300 nodes (2007 2011) WINDOWS (other): 16 nodes (2009) Installation of a new LINUX cluster every 3 years: Decision based on benchmarks from users Production nodes: CPU only (benchmark commitments for applications on GPGPU / Phi cards ) Budget: ~2.5 3 Million USD 3

NEC LX-Cluster@RRZE: Dedicated to Emmy Noether #210 in TOP500 as of Nov. 2013 191.5 TF/s LINPACK (CPU only) LINPACK efficiency: 97.1 % of 197.1 TF/s Peak (based on 2.2 GHz) Emmy cluster 234 TF/s peak 560 compute nodes 2x Intel Xeon E5-2660v2 (10 core Ivy Bridge @ 2.2 GHz) 64 GB DDR3 RAM 6 GPGPU nodes: 2xNVIDIA K20c 6 Phi nodes: 2xIntel Xeon Phi 4 mixed nodes: 1xK20c + 1xPhi QDR Infiniband no local disks 4

HPC-Research objectives SC13 Tutorial: The Practitioner's Cookbook for Good Parallel Performance on Multi- and Many- Core Systems Presenter(s): G. Wellein, G. Hager, J. Treibig SC13 Poster: Pattern-Driven Node-Level Performance Engineering Author(s):J.Treibig, G. Hager, G. Wellein See you there at 5:15-7:00 today! Performance Engineering for multi-/manycore architectures Efficient programming on hybrid parallel systems Fault Tolerance SC13 Tutorial: Hybrid MPI and OpenMP Parallel Programming Presenter(s): G. Jost, R. Rabenseifner, G. Hager Multicore tooling Application: Sparse matrix schemes and Lattice Boltzmann methods SC13 Doctoral Showcase: A Unified Sparse Matrix Format for Heterogeneous Systems Presenter: M. Kreutzer Don t miss it Thursday afternoon 5

Energy efficient application execution Best energy efficiency? There are so many parameters to consider! Clock Speed? Code variants SMT? Cores per Chip? 6

What kind of application do you run? Consider scalability within a single multicore processor chip LINPACK type Limiting factor: Core Execution STREAM type Limiting factor: Saturation (bandwidth) Change clock speed: 1.5 X 0.6 X 7

Simple model for Energy to solution: Clock speeds and core counts (1) Performance using t cores at clock speed of f P f, t = mmm f f 0 P 0 t, P mmm f 0 : P 0 P mmm : Baseline clock speed Baseline single core (max. chip) performance Power consumption for running t cores at clock speed of f W f, t = W 0 + W 1 f + W 2 f 2 t W 0 : Baseline power (memory, IO, network ) W 0, W 1, W 2 : Determined by benchmarks W 2 = 1 W/GHz 2 For Intel SNB: W 0 = 32 W for chip W 0 = 73 W per Socket for whole system 8

Simple model for Energy to solution: Clock speeds and core counts (2) Energy to solution if running t cores at clock speed of f E f, t = W f, t P f, t = W 0 + W 1 f + W 2 f 2 t mmm f f 0 P 0 t, P mmm Code optimization increases P 0 and / or P mmm and proportionally reduces E LINPACK type apps: Use all cores at clock speed of f ooo = W 0 t W 2 STREAM type apps: Minimum energy at saturation point. 9

Energy to Solution W 0 = 73 W W 2 = 1 W / GHz 2 LINPACK type base opt = 2 GHz = 3 GHz STREAM type Use all cores and high clock speed! Run all cores at clock speed which still saturates performance 10

Energy to Solution: A different way of presentation Energy vs. Performance Isoline of constant Energy delay product (E t) 11

A real world example: Lattice Boltzmann CFD solver STREAM type code Different levels of optimization (P 0 ): scalar, SSE, AVX code Not included in model: Bandwidth degradation with lower clock speed (2.7 GHz 1.2 GHz) 12

A real world example: Lattice Boltzmann CFD solver Realistic model for LBM performance MODEL MEASUREMENT Optimal point of operation: 1.2 GHz with AVX code at saturation point (7 cores) 13

A real world example: Lattice Boltzmann CFD solver Be aware! Lowering clock speed may lower MPI bandwidth between nodes! IMB sendrecv between two nodes (FDR IB) Using all cores network bandwidth may drop by 40%! 14

Lessons to learn Code optimization is a must! LINPACK-type codes: run as fast as possible STREAM-type code: Run at saturation point of lowest clock speed which saturates Check degradation of Main memory bandwidth Interconnect bandwidth Things to consider at system administration level: Allow users to specify clock speeds (simple modification in Prolog NEC) Install LIKWID toolkit (http://code.google.com/p/likwid/) allows users to measure power and energy consumption (likwid-powermeter) Works well with NEC software stack 15

LIKWID toolbox: small, flexible and easy-to-use tools likwid-topology likwid-pin likwid-bench likwid-perfctr likwid-powermeter likwid-mpirun References An analysis of energy-optimized lattice-boltzmann CFD simulations from the chip to the highly parallel level. Submitted. Preprint: arxiv:1304.7664 Exploring performance and power properties of modern multicore chips via simple machine models. Accepted for publication in CCPE http://arxiv.org/abs/1208.2908 Thank you! 16

Question: Name 2 hardware properties which may depend on clock speed (besides: clock speed and peak performance)? 17