Some thoughts about energy efficient application execution on NEC LX Series compute clusters

Some thoughts about energy efficient application execution on NEC LX Series compute clusters G. Wellein, G. Hager, J. Treibig, M. Wittmann Erlangen Regional Computing Center & Department of Computer Science Friedrich-Alexander-University Erlangen-Nuremberg Germany

Erlangen Regional Computing Center(RRZE) JuQueen 5 PF/s Hannover Berlin RRZE: Regional HPCservice provider and HPC research center FZ Jülich Erlangen HLRS-Stuttgart LRZ-München Hermit: 1 PF SuperMUC: 3 PF 2

Erlangen Regional Computing Center A broad range of users: Biology, Chemistry, CFD, Material Science, Physics Medicine, Economics, A broad range of clusters: LINUX (NEC): 560 nodes (234 TF/s) Installation: 2013 LINUX (NEC): 500 nodes (64 TF/s) Installation: 2010 LINUX (others): 300 nodes (2007 2011) WINDOWS (other): 16 nodes (2009) Installation of a new LINUX cluster every 3 years: Decision based on benchmarks from users Production nodes: CPU only (benchmark commitments for applications on GPGPU / Phi cards ) Budget: ~2.5 3 Million USD 3

NEC LX-Cluster@RRZE: Dedicated to Emmy Noether #210 in TOP500 as of Nov. 2013 191.5 TF/s LINPACK (CPU only) LINPACK efficiency: 97.1 % of 197.1 TF/s Peak (based on 2.2 GHz) Emmy cluster 234 TF/s peak 560 compute nodes 2x Intel Xeon E5-2660v2 (10 core Ivy Bridge @ 2.2 GHz) 64 GB DDR3 RAM 6 GPGPU nodes: 2xNVIDIA K20c 6 Phi nodes: 2xIntel Xeon Phi 4 mixed nodes: 1xK20c + 1xPhi QDR Infiniband no local disks 4

HPC-Research objectives SC13 Tutorial: The Practitioner's Cookbook for Good Parallel Performance on Multi- and Many- Core Systems Presenter(s): G. Wellein, G. Hager, J. Treibig SC13 Poster: Pattern-Driven Node-Level Performance Engineering Author(s):J.Treibig, G. Hager, G. Wellein See you there at 5:15-7:00 today! Performance Engineering for multi-/manycore architectures Efficient programming on hybrid parallel systems Fault Tolerance SC13 Tutorial: Hybrid MPI and OpenMP Parallel Programming Presenter(s): G. Jost, R. Rabenseifner, G. Hager Multicore tooling Application: Sparse matrix schemes and Lattice Boltzmann methods SC13 Doctoral Showcase: A Unified Sparse Matrix Format for Heterogeneous Systems Presenter: M. Kreutzer Don t miss it Thursday afternoon 5

Energy efficient application execution Best energy efficiency? There are so many parameters to consider! Clock Speed? Code variants SMT? Cores per Chip? 6

What kind of application do you run? Consider scalability within a single multicore processor chip LINPACK type Limiting factor: Core Execution STREAM type Limiting factor: Saturation (bandwidth) Change clock speed: 1.5 X 0.6 X 7

Simple model for Energy to solution: Clock speeds and core counts (1) Performance using t cores at clock speed of f P f, t = mmm f f 0 P 0 t, P mmm f 0 : P 0 P mmm : Baseline clock speed Baseline single core (max. chip) performance Power consumption for running t cores at clock speed of f W f, t = W 0 + W 1 f + W 2 f 2 t W 0 : Baseline power (memory, IO, network ) W 0, W 1, W 2 : Determined by benchmarks W 2 = 1 W/GHz 2 For Intel SNB: W 0 = 32 W for chip W 0 = 73 W per Socket for whole system 8

Simple model for Energy to solution: Clock speeds and core counts (2) Energy to solution if running t cores at clock speed of f E f, t = W f, t P f, t = W 0 + W 1 f + W 2 f 2 t mmm f f 0 P 0 t, P mmm Code optimization increases P 0 and / or P mmm and proportionally reduces E LINPACK type apps: Use all cores at clock speed of f ooo = W 0 t W 2 STREAM type apps: Minimum energy at saturation point. 9

Energy to Solution W 0 = 73 W W 2 = 1 W / GHz 2 LINPACK type base opt = 2 GHz = 3 GHz STREAM type Use all cores and high clock speed! Run all cores at clock speed which still saturates performance 10

Energy to Solution: A different way of presentation Energy vs. Performance Isoline of constant Energy delay product (E t) 11

A real world example: Lattice Boltzmann CFD solver STREAM type code Different levels of optimization (P 0 ): scalar, SSE, AVX code Not included in model: Bandwidth degradation with lower clock speed (2.7 GHz 1.2 GHz) 12

A real world example: Lattice Boltzmann CFD solver Realistic model for LBM performance MODEL MEASUREMENT Optimal point of operation: 1.2 GHz with AVX code at saturation point (7 cores) 13

A real world example: Lattice Boltzmann CFD solver Be aware! Lowering clock speed may lower MPI bandwidth between nodes! IMB sendrecv between two nodes (FDR IB) Using all cores network bandwidth may drop by 40%! 14

Lessons to learn Code optimization is a must! LINPACK-type codes: run as fast as possible STREAM-type code: Run at saturation point of lowest clock speed which saturates Check degradation of Main memory bandwidth Interconnect bandwidth Things to consider at system administration level: Allow users to specify clock speeds (simple modification in Prolog NEC) Install LIKWID toolkit (http://code.google.com/p/likwid/) allows users to measure power and energy consumption (likwid-powermeter) Works well with NEC software stack 15

LIKWID toolbox: small, flexible and easy-to-use tools likwid-topology likwid-pin likwid-bench likwid-perfctr likwid-powermeter likwid-mpirun References An analysis of energy-optimized lattice-boltzmann CFD simulations from the chip to the highly parallel level. Submitted. Preprint: arxiv:1304.7664 Exploring performance and power properties of modern multicore chips via simple machine models. Accepted for publication in CCPE http://arxiv.org/abs/1208.2908 Thank you! 16

Question: Name 2 hardware properties which may depend on clock speed (besides: clock speed and peak performance)? 17