Recent successes in high-end modelling for materials design in Europe. Thomas C. Schulthess

Size: px

Start display at page:

Download "Recent successes in high-end modelling for materials design in Europe. Thomas C. Schulthess"

Eustace Stone
5 years ago
Views:

1 Recent successes in high-end modelling for materials design in Europe Thomas C. Schulthess 1

September 15, 2015 Today s Outloo: GPU-accelerated

in software 2012/13: co-design for Piz Daint 2014:

co-design of Piz Kesch (specialised for MeteoSwiss) Oct.

2 September 15, 2015 Today s Outloo: GPU-accelerated Weather Forecasting John Russell 2010: start investing in software 2012/13: co-design for Piz Daint 2014: COSMO in production on GPUs ( Piz Daint ) : co-design of Piz Kesch (specialised for MeteoSwiss) Oct. 2015: Piz Kesch in production Apr. 2016: new model operational 2

3 ASCR Computing Upgrades At a Glance System attributes NERSC Now OLCF Now ALCF Now NERSC Upgrade OLCF Upgrade ALCF Upgrades Name Planned Installation Edison TITAN MIRA Cori 2016 Summit Theta 2016 Aurora System pea (PF) > > Pea Power (MW) < Total system memory 357 TB 710TB 768TB ~1 PB DDR4 High Bandwidth Memory (HBM)1.5PB persistent memory > 1.74 PB DDR4 HBM 2.8 PB persistent memory >480 TB DDR4 High Bandwidth Memory (HBM) > 7 PB High Bandwidth On-Pacage Memory Local Memory and Persistent Memory Node performance (TF) > 3 > 40 > 3 > 17 times Mira Node processors Intel Ivy Bridge AMD Opteron Nvidia Kepler 64-bit PowerPC A2 Intel Knights Landing many core CPUs Intel Haswell CPU in data partition Multiple IBM Power9 CPUs & multiple Nvidia Voltas GPUS Intel Knights Landing eon Phi many core CPUs Knights Hill eon Phi many core CPUs System size (nodes) 5,600 nodes 18,688 nodes 49,152 9,300 nodes 1,900 nodes in data partition ~3,500 nodes >2,500 nodes >50,000 nodes System Interconnect Aries Gemini 5D Torus Aries Dual Rail EDR-IB Aries File System 7.6 PB 168 GB/s, Lustre 32 PB 1 TB/s, Lustre 26 PB 300 GB/s GPFS 28 PB 744 GB/s Lustre 120 PB 1 TB/s GPFS 10PB, 210 GB/s Lustre initial 2 nd Generation Intel Omni-Path Architecture 150 PB 1 TB/s Lustre

4 GPU - accelerated hybrid eon Phi (accelerated) Multi-core 2017 Summit Aurora post-k Toyo-1 Toyo-2 Both architecture have heterogeneous memory! DARPA HPCS 4

5 Architectural diversity is here to stay, because it is a consequence of the dawn of CMOS scaling (Moore s Law) What are the implications? Complexity in software is one, but we don t understand all implications Physics of the computer matters more than ever 5

6 Three European Centers of Excellence in Materials Science have recently been funded NoMaD: Novel Materials Discovery PI: Claudia Draxl and Matthias Scheffler Materials Encyclopedia and Big-Data Analytics tools for materials science and engineering Ma: Materials design at the eascale PI: Elisa Molinari Applications and tools for electronic structure simulations on future exascale architectures A CoE based at CECAM PI: Dominic Tildesley Molecular simulation tools with emphasis on education 6

7 12 year ( ) NCCR project funded by the Swiss National Fund 33 investigators from 11 Swiss institutions (Universities, National Labs, Industry) and various disciplines (physics, chemistry, materials science, computational science, computer science, engineering) First phase: funded at CHF 34.4M (18M SNS, 6.6M EPFL, 9.8 M others) EPFL (Marzari, Pasquarello, Roethlisberger, Koch, Andreoni, Corminboeuf, Yazyev, Ceriotti), ETHZ (Spaldin, Troyer, VandeVondele), Basel (Goedecer, Von Lilienfeld), Fribourg (Werner), Geneva (Georges), Svizzera Italiana (Parrinello), Zurich (Hutter), IBM (Curioni), CSCS (Schulthess), EMPA (Groning, Passerone), PSI (Kenzelman, Nolting) 7

Serendipitous discovery & Edisonian development Most new materials are discovered serendipitously

synthesis with osmium as a catalyst Mitasch (BASF) tested ~22,000 materials to find iron-based catalyst

8 Serendipitous discovery & Edisonian development Most new materials are discovered serendipitously (particularly true for complex materials) Or through very laborious searches, e.g. Edison tested 3000 materials for his filament and settled on burned sewing thread Haber-Bosh ammonia synthesis with osmium as a catalyst Mitasch (BASF) tested ~22,000 materials to find iron-based catalyst still in use today Norsov showed in 2009 that CoMo is a more efficient & inexpensive catalyst TOF(s 1 ) Fe CoMo Ru Os Co Mo Ni Relative nitrogen binding energy (J mol 1 ) Nicola Marzari 8

9 Systematic searches with high-throughput & capability runs There are ~150,000 nown inorganic materials with published structures Very basic properties computed with DFT-based quantum simulations tae ~10 minutes on a powerful worstation (e.g. hybrid CPU-GPU) Piz Daint with 5272 hybrid CPU-GPU nodes could scan ~5000 structures / 10 minutes But we want to study more complex, harder to compute properties how complex? 9

10 Approaching the problem form the other end Start with the most reliable (and expensive) approach to electronic structure Linearised Augmented Plane Wave Method (LAPW) and the largest problem that is reasonable* for materials searches ~1000 atoms in a unit cell the 1000-atom problem ** and bet on future improvements in extreme-scale computing novel architectures and exa-scale computing (*) Using W. Kohn s arguments on nearsightedness of electronic matter (**) proposed by Claudia Draxl at a PRACE project meeting in spring

11 Solving the Kohn-Sham Equations is the bottlenec in most DFT-based materials science codes Kohn-Sham Eqn. Ansatz Hermitian matrix Basis is not orthogonal ~ 2m r2 v s [n](~r ) i ( r) = µ Z H µ = S µ = Solve generalised eigenvalue problem Z c iµ µ ( r) ~ 2 µ(~r ) 2m r2 v s [n](~r ) µ( r) ( r)d r (H " i S)=0 (~r )d~r where we are usually interested in about 10-50% of spectrum We need eigenvectors as well, to compute the density: i(~r )= i i (~r ) n(~r )= N i=1 i (~r ) i (~r ) 11

12 Generalised eigenvalue problem in the LAPW H G C i = i O G C i Overlap: O G = h' G ' i MT I Hamiltonian: H G = h' G Ĥ ' i LAPW basis: ' G (r) = 8 >< >: L O ` =1 A L (G)u ` (r)y L (ˆr) 1 p e i(g)r 9 r 2 MT >= r 2 I >; 12

13 Generalised eigenvalue problem in the LAPW (cont.) H G C i = i O G C i Overlap: O G = h' G ' i = L A L (G)A L ( ) (G ) Hamiltonian: H G = h' G Ĥ ' i = 8 L LAPW basis: >< ' G (r) = L >: A L (G)B L ( ) 1 2 (G )(G0 ) (G )Ṽs(G ) O ` =1 A L (G)u ` (r)y L (ˆr) 1 p e i(g)r r 2 MT >= r 2 I 9 >; 13

14 Generalised eigenvalue problem in the LAPW (cont.) H G C i = i O G C i LAPACK / ScaLAPACK Overlap: O G = h' G ' i = L A L (G)A L ( ) (G ) Hamiltonian: H G = h' G Ĥ ' i = L A L (G)B L ( ) 1 2 (G )(G0 ) (G )Ṽs(G ) B L (G) = L 3 L 2 2 A L 2 2 (G)h l L 3 l 2 2 hy L R L3 Y L2 i 1 2 A L 2 u l (R )u 0 l 2 (R )R 2 2 Buried in thousands of lines of FORTRAN code 14

15 Generalised eigenvalue problem in the LAPW (cont.) H G C i = i O G C i Overlap: O G = h' G ' i O(N 3 ) complexity = L A L (G)A L ( ) (G ) Hamiltonian: H G = h' G Ĥ ' i = L A L (G)B L ( ) 1 2 (G )(G0 ) (G )Ṽs(G ) B L (G) = L 3 L 2 2 A L 2 2 (G)h l L 3 l 2 2 hy L R L3 Y L2 i 1 2 A L 2 u l (R )u 0 l 2 (R )R

16 Generalised eigenvalue problem in the LAPW (cont.) i HGG 0 CG0 = i G0 i OGG 0 CGInitial 0 data is distributed ı(g) fashion in a bloc-cyclic G0 Each MPI ran gets a panel of tiles [0, 0] [0, 1] ı( L ) OGG 0 = h'g 'G0 i Overlap: = L Hamiltonian: HGG 0 0 A (G)A (G ) (G L L G0 ) [1, 0] [1, 1] = h'g H 'G0 i Fig. 1. (color online) Panel and slice storage of the data. For parallel grid of MPI rans. In order to perform a local operation on a whole vecto 1 with PBL row0 rans of the MPI grid. To0perform a distributed operation 0 0 = A (G)B (G ) (G )(G ) (G G ) V (G G ) s L L 2 L B L (G) = Thus the LAPW basis functions are given by: O l 2 > > A L2 2 (G)hL3 l2 2 hyl RL3 YL2 i A u (R )u (R )R > l 2 L (G)u` (r)yl (r ) r 2 MT 2 < L 2 Al L3 L2 2 'G (r) =2 L =1 > 1 > > : p ei(g)r r2i angularp momentum and wehe L {`, m} denotes the P `max P` imuthal quantum numbers and L `=0 m= `. 16 conti International Worshop on CO-DESIGN, Wuxi, Monday, November 2015 A L (G) 9, are chosen to ensure matching coefficients cont ity of the basis functions (and if possible of their derivativ

17 Generalised eigenvalue problem in the LAPW (cont.) i HGG 0 CG0 = i G0 i OGG 0 CGInitial 0 data is distributed Each MPI ran gets a panel of tiles ı(g) fashion in a bloc-cyclic G0 [0, 0] [0, 1] ı( L ) Overlap: OGG 0 = h'g 'G0 i = 0 A (G)A (G ) (G L L G0 ) L Hamiltonian: HGG 0 [1, 0] [1, 1] = h'g H 'G0 i Fig. 1. (color online) Panel and slice storage of the data. For parallel grid of MPI rans. In order to perform a local operation on a whole vecto 1 with PBL row0 rans of the MPI grid. To0perform a distributed operation 0 0 = A (G)B (G ) (G )(G ) (G G ) V (G G ) L L Initial data is distributed Each ran gets a panel MPI of swop eachswop column The swopslices The of whole vectors ares is distributed MPI ran gets a panel of tiles of tiles MPI rans of rans each column The slices of slices whole vectors InitialInitial data data is distributed EachEach MPI ran getsmpi a panel of tiles MPI rans of each column of whole vectors are are 2 in afashion bloc-cyclic gathered on in a bloc-cyclic fashion fashion blocs ofblocs panelsof panels gathered on each MPIeach ranmpi ran in a bloc-cyclic blocs of panels gathered on each MPI ran L [0, 1] [0, 0][0, 0] MPI communication MPI communication [0, 0] [0, 1][0, 1] MPI communication [0, 0][0, 0] [0, 0] [0, 1][0, 1] [0, 0] [0, functions [0,are 1] [0, 0] [0, 0]LAPW 1][0, 1] [0, 1] the Thus basis 'G (r) = 8 O > > > < A given by: L (G)u` (r)yl (r ) =1 L =1 > 1 > > : p ei(g)r r 2 MT r2i angularp momentum and wehe L {`, m} denotes the P P ` ` [1, 0] [1, 1] [1, 0] [1, 1] max [1, 0] [1, 1][1, 1] [1, 1] [1, 0] [1, 1][1, 1] [1, 1] [1, 0] [1, 1] [1, 0][1, 0] [1, 0][1, 0] imuthal quantum numbers and L `=0 m= `. 17 conti International Worshop on CO-DESIGN, Wuxi, Monday, November 2015 A L (G) 9, are chosen to ensure matching coefficients cont 1. online) (color online) Panel andstorage slice storage of the For linear parallel linear algebra operations array to beindistributed in a fashion bloc-cyclic over a 2D ig.(color 1. Fig. (color Panel and slice the data. Fordata. parallel algebra operations array todistributed behas distributed in a bloc-cyclic fashion online) Panel and slice storage of theof data. For parallel linear algebra operations array has tohas be a bloc-cyclic over over afashion 2Da 2D (and if possible of their derivativ ity of the basis functions grid rans. ofinmpi In perform a local on avector, whole vector, the vectors are gathered from or locally created locally on the corresponding of MPI Inrans. order to order perform a local operation a whole the slices ofslices vectors are gathered from panels orpanels created on the corresponding frid MPI rans. order to perform a to local operation on operation a on whole vector, the slices of vectors areofgathered from panels or created locally on the corresponding

18 Solving the generalised eigenvalue problem Ax = Bx Standard 1 stage solver xpotrf B = LL H A 0 y = y xhegst A 0 = L 1 AL H T = Q H A 0 Q xhetrd xheevx A 0 y = y Most time consuming step, dominated by level 2 BLAS (memory bound) Ty 0 = y 0 xstexx xtrsm x = L H y y = Qy 0 xunmtr 18

19 Solving the generalised eigenvalue problem (cont.) Ax = Bx A 0 y = y xpotrf B = LL H reduction to banded Most time consuming step, but dominated by BLAS-3 A 00 = Q 1 H A 0 Q 1 xhegst A 0 = L 1 AL H tri-diagonalize T = Q 2 H A 00 Q 2 xheevx xtrsm A 0 y = y x = L H y needs two eigenvector transformations (but easy to parallelise) Ty 0 = y 0 y 00 = Q 2 y 0 y = Q 1 y 00 19

20 Implementations of two-stage eigen solvers for our problem (i.e. with bac transformation of eigenvectors) For multi-cores systems: ELPA library T. Aucenthaler et al., Parallel Comput. vol. 37, no. 12, pp (2011) A. Mare et al., Psi-K Research Highlight, vol. 2014, no. 1, Jan Remar: implementation relies on intrinsics For hybrid CPU-GPU systems: integrated into MAGMA library A. Haidar et al., Lecture Notes in Comp. Sci., 7905, (2013) A. Haidar et al., Int. J. of High Perf. Comp. App / (2013) R. Solcà et al., Proceedings of SC 15, New Yor, ACM (2015) Remar: relies on pblas that is aware of heterogenous memory 20

latency threads DDR Memory few low latency threads High-BW Memory many throughput threads High-BW

21 Accelerated hybrid systems: heterogeneous memory 4 nodes blade of a GPU accelerated Cray C30 Networ fabric DDR Memory few low latency threads DDR Memory few low latency threads DDR Memory few low latency threads DDR Memory few low latency threads High-BW Memory many throughput threads High-BW Memory many throughput threads High-BW Memory many throughput threads High-BW Memory many throughput threads 21

22 pblas for accelerated hybrid (memory) systems Bloc cyclic decomposition of a matrix Many scalapack routines rely on distributed pblas routines distributed over DDR memory [0,0] [0,1] Execute pblas on low-latency threads (CPUs) [1,0] [1,1] [0,0] [0,1] Execute pblas on throughput threads (GPUs) distributed over high-bw memory [1,0] [1,1] move data directly between high-bw memory of different nodes On a distributed-memory accelerated hybrid systems we need to types of pblas routines, depending on which memory sub-system the matrix is located 22

hybrid: same CPU Nvidia K20 GPU Use comparable number of socets Li

23 1000-atom test problem ~115,000 basis functions (matrix size) Running on Cray C30: > CPU runs on eon E (Sandy Bridge) > hybrid: same CPU Nvidia K20 GPU Use comparable number of socets Li intercalated CoO2: 432 formula units CoO2 205 Li atoms 1501 atoms in total 23

24 Results for the full runs (on SCF iteration) MPI grid MPI rans / socet OpenMP threads / ran active socets setup, O H (sec.) solve (sec.) rest (sec.) total (sec.) energy (Wh) 28x28 (2R:4T) ScaLAPACK 28x28 (2R:4T) ELPA2 20x20 (1R:8T) ELPA2 14x14 (1R:8T) hybrid 20x20 (1R:8T) hybrid

25 Resources used 1000-atom design problem Time: ~15 minutes / iteration, i.e. 3 hours for ~10 iterations Footprint: ~400 hybrid nodes on Cray C30 (SandyBrideK20) Scan ~13 materials in 3 hours or 5,000 in ~16 days (consider performance will improve x in by end of decade) 25

26 A note on MPIOpenMP hybrid: (n/10,240) 2 2-socet nodes; CPU-only: 2(n/10,2040) 2 socets On CPU nodes, MPI-only runs perform best! OpenMP necessary to reduce memory footprint 26

27 SIRIUS: (prototype) Domain Specific Library Low-level LAPW (and PW) library that supports multiple codes ~30 lines of C code (incl. documentation) with F90 bindings Anton Kozhevniov with Claudia Draxl, Andris Gulans, and Georg Huhs Exciting El Quantum Espresso other SIRIUS C library Distributed hybrid memory model, MPI where =OpenMP and others Density class Distributed charge density and magnetisation generation Potential class Distributed C potential and magnetic field generation, distributed Poisson solver Band class Second-variational and full diagonalization of the Hamiltonian with support of GPU and distributed eigenvalue solvers Force class Atomic forces with support of distributed Hamiltonian matrix GNU scientific library FFTW3 HDF5 ELPA MAGMA Spglib LAPACK and BLAS ScaLAPACK and PBLAS LibC 27

28 References and Collaborators Peter Messmer and his team at the NVIDIA co-design lab at ETH Zurich Teams at CSCS A. Haidar, R. Solcà, M. Gates, T. Tomov, T.C. Schulthess, J. Dongarra, Leading Edge Hybrid Multi-GPU Algorithms for Generalized Eigenproblems in Electronic Structure Calculations, Supercomputing, pages Springer Berlin, Heidelberg (2013) A. Haidar, S. Tomov, J. Dongarra, R. Solcà, T. C. Schulthess, A novel hybrid CPU-GPU generalised eigensolver for electronic structure calculations based on fine grained memory aware tass, International Journal of High Performance Computing Applications, August 2013 R. Solcà, A. Kozhevniov, A. Haidar, S. Tomov, J. Dongarra, T. C. Schulthess, Efficient implementation of quantum materials simulations on distributed CPU-GPU systems, to be published in Proceedings of the International Conference on High-Performance Computing, Networing, Storage and Analysis, SC 15, New Yor, NY, USA (2015). ACM 28

29 29

Piz Daint & Piz Kesch : from general purpose supercomputing to an appliance for weather forecasting. Thomas C. Schulthess

Piz Daint & Piz Kesch : from general purpose supercomputing to an appliance for weather forecasting Thomas C. Schulthess 1 Cray XC30 with 5272 hybrid, GPU accelerated compute nodes Piz Daint Compute node: