Domain specific libraries. Material science codes on innovative HPC architectures Anton Kozhevnikov, CSCS December 5, 2016

Size: px

Start display at page:

Download "Domain specific libraries. Material science codes on innovative HPC architectures Anton Kozhevnikov, CSCS December 5, 2016"

Joel Eaton
5 years ago
Views:

1 Domain specific libraries Material science codes on innovative HPC architectures Anton Kozhevnikov, CSCS December 5, 2016

2 Part 1: Introduction

3 Kohn-Shame equations 1 2 Eigen-value problem + v eff (r) j(r) =" j j (r) Effective potential construction Z (r 0 ) v eff (r) = r 0 r dr0 + v XC [ ](r) + v ext (r) Density generation new (r) = X j j (r) 2 Density mixing (r) = new (r)+(1 ) old (r) Output: wave-functions j(r) and eigen energies " j charge density (r) and magnetization m(r) total energy E tot and atomic forces F Derived objects: Spectral functions Linear response functions Phonons, thermodynamic functions Parameters of lattice models

4 Basis expansion Expand Kohn-Sham wave functions in a convenient basis set: j(r) = X µ C µj µ (r) Convert differential equation to an eigen-value problem: HC j = " j OC j H µµ 0 = Z µ(r) O µµ 0 = Z where v eff (r) µ(r) µ 0(r)dr µ 0 (r)dr

5 Electronic structure codes Code Quantum ESPRESSO VASP ABINIT PEtot CPMD QBOX CASTEP SIESTA CRYSTAL FHI-AIMS FPLO OpenMX TB-LMTO CP2K BigDFT WIEN2k FLEUR Exciting Elk GPAW Octopus Basis set plane-waves localized atom-centered orbitals gaussians + plane-waves wavelets linarized augmented plane-waves real space grid / plane waves / atomic orbitals

Plane-wave basis Plane-wave with the wave-vector G e igr = e i(xg x+yg y +zg z ) = cos(gr)+i sin(gr) = 2 G Complete orthogonal basis Eigen-functions of the Laplace operator @ 2 @ 2 x + @2 @ 2 y + @2

6 Plane-wave basis Plane-wave with the wave-vector G e igr = e i(xg x+yg y +zg z ) = cos(gr)+i sin(gr) = 2 G Complete orthogonal basis Eigen-functions of the Laplace z e igr = G 2 e igr Fourier transformations can be rapidly evaluated with FFT Real-space integrals between periodic functions are represented as sums Z f (r)g(r)dr = N X f (r j )g(r j )= X j G f (G)g(G)

7 Pseudopotential Plane-wave basis requires a pseudopotential Unlike the true (all-electron) potential, pseudopotential has an additional nonlocal term: ˆV PS = V loc (r)+ X X 0 id 0 h 0 Action of the pseudopotential on the wave-function: ˆV PS (r) =V loc (r) (r)+ X X 0 Z (r)d 0 0 (r) (r)dr

8 Eigen-value problem H j = " j j H j = " j S j in case of norm-conserving pseudopotential in case of ultrasoft pseudopotential H and S are dense Hermitian matrices of large dimension (~1000 PW / atom).

9 Eigen-value problem H j = " j j H j = " j S j in case of norm-conserving pseudopotential in case of ultrasoft pseudopotential H and S are dense Hermitian matrices of large dimension (~1000 PW / atom). Iterative subspace diagonalziation methods are used. Conjugate gradient Davidson algorithm RMM-DIIS LOBPCG Chebyshev filering method

10 Hamiltonian operator application Any iterative solver requires operation Ĥ i

11 Hamiltonian operator application Any iterative solver requires operation Ĥ i Result of Hamiltonian application h j = H j = Z e igr Ĥ j (r)dr

12 Hamiltonian operator application Any iterative solver requires operation Ĥ i Result of Hamiltonian application h j = H j = Z e igr Ĥ j (r)dr Hamiltonian operator with pseudopotential Ĥ = v eff (r)+ X X 0 id 0 h 0

13 Hamiltonian operator application Application of the local part of potential (Vloc kernel) j(g) FFT 1! j (r)! v eff (r) j (r) FFT! h j (G)

14 Hamiltonian operator application Application of the local part of potential (Vloc kernel) j(g) FFT 1! j (r)! v eff (r) j (r) FFT! h j (G) do j = 1, num_psi! transform psi to real-space domain call inverse_fft(psi(:,j), psi_of_r(:))! multiply by effective potential do ir = 1, num_r_points psi_of_r(ir) = psi_of_r(ir) * veff(ir) end do! tranform back to PW domain call forward_fft(psi_of_r(:), hpsi(:,j)) end do Two FFTs for each wave-function!

15 Hamiltonian operator application 1 2 j(r) = X G Application of Laplace operator 1 2 eigr j(g) = X G e igr G2 2 j(g)

16 Hamiltonian operator application 1 2 j(r) = X G Application of Laplace operator 1 2 eigr j(g) = X G e igr G2 2 j(g) do j = 1, num_psi! add kinetic energy contribution do ig = 1, num_gvec hpsi(ig, j) = hpsi(ig, j) + pw_ekin(ig) * psi(ig, j) end do end do

17 Hamiltonian operator application X Application of non-local part of potential: (G) X 0 D 0 X G 0 zgemm#1 zgemm#2 zgemm#3 X 0 id 0h 0 0 (G 0 ) j(g 0 )! h j (G)

18 Hamiltonian operator application X Application of non-local part of potential: (G) X 0 D 0 X G 0 zgemm#1 zgemm#2 zgemm#3 X 0 id 0h 0 0 (G 0 ) j(g 0 )! h j (G)! compute <beta psi> call zgemm('c', 'N', num_beta_tot, num_psi, num_pw, beta_pw, psi, beta_psi)! apply block-diagonal D matrix do ia = 1, num_atoms call zgemm('n', 'N', num_beta(ia), num_psi, num_beta(ia), D(:, :, ia), beta_psi(offset(ia), 1), d_beta_psi(offset(ia), 1)) end do! compute beta*d*<beta psi> call zgemm('n', 'N', num_pw, num_psi, num_beta_tot, beta_pw, d_beta_psi, hpsi)

19 Hamiltonian operator application X Application of non-local part of potential: (G) X 0 D 0 X G 0 zgemm#1 zgemm#2 zgemm#3 X 0 id 0h 0 0 (G 0 ) j(g 0 )! h j (G) plane-wave index G h j (G) j += index of atom (G) and orbital D 0 D 1 D 2 D 3 0 (G 0 ) j(g) j

20 Davidson iterative solver We want to compute H j = " j j We know how to compute h j = H j

21 Davidson iterative solver We want to compute H j = " j j We know how to compute h j = H j Key idea of Davidson iterative solver: start with a subspace spanned by a guess to and expand it with preconditioned residuals j

22 Davidson iterative solver 0 Initialize the trial basis set: j ( j

23 Davidson iterative solver 0 Initialize the trial basis set: j ( j Apply Hamiltonian to the basis functions: h m j = H m j

24 Davidson iterative solver 0 Initialize the trial basis set: j ( j Apply Hamiltonian to the basis functions: h m j = H m j Compute reduced Hamiltonian matrix: hm jj 0 = X G m j (G) h m (G) j 0

25 Davidson iterative solver 0 Initialize the trial basis set: j ( j Apply Hamiltonian to the basis functions: h m j = H m j Compute reduced Hamiltonian matrix: hm jj 0 = X G Diagonalize reduced Hamiltonian matrix and get N lowest eigen pairs: h m Z m = j Z m m j (G) h m (G) j 0

26 Davidson iterative solver 0 Initialize the trial basis set: j ( j Apply Hamiltonian to the basis functions: h m j = H m j Compute reduced Hamiltonian matrix: hm jj 0 = X G Diagonalize reduced Hamiltonian matrix and get N lowest eigen pairs: h m Z m = j Z m m j (G) h m (G) j 0 Compute residuals ( ): Rm j = h m j Z m m R j j Z m j = Ĥ j j j

27 Davidson iterative solver 0 Initialize the trial basis set: j ( j Apply Hamiltonian to the basis functions: h m j = H m j Compute reduced Hamiltonian matrix: hm jj 0 = X G Diagonalize reduced Hamiltonian matrix and get N lowest eigen pairs: m j (G) h m (G) j 0 Compute residuals ( ): Rm j = h m j Z m m R j j Z m j = Ĥ j j j Apply preconditioner to the unconverged residuals, orthogonalize and add the resulting functions to the basis: h m Z m = j Z m { m+1 j } = { m j } M {P R m j }

28 Davidson iterative solver 0 Initialize the trial basis set: j ( j Apply Hamiltonian to the basis functions: h m j = H m j Compute reduced Hamiltonian matrix: hm jj 0 = X G Diagonalize reduced Hamiltonian matrix and get N lowest eigen pairs: m j (G) h m (G) j 0 Compute residuals ( ): Rm j = h m j Z m m R j j Z m j = Ĥ j j j Apply preconditioner to the unconverged residuals, orthogonalize and add the resulting functions to the basis: h m Z m = j Z m { m+1 j } = { m j } M {P R m j } Recompute the wave-functions: j = m j Z m

29 Davidson iterative solver Initialize subspace basis functions and apply Hamiltonian 0 j h 0j = H 0 j plane-wave index G

30 Davidson iterative solver Compute reduced Hamiltonian matrix h 0 jj 0 0 j h 0j = Compute eigen-value probelem = " j

31 Davidson iterative solver Compute residuals R 0 j h 0 j Z 0 Z 0 j jj 0 0 j = - Apply preconditioner: P R 0 j (G) =(H GG " j ) 1 R0 j (G)

32 Davidson iterative solver Expand variational space and apply Hamiltonian to new basis functions 1 j = 0 j M P R0 j h 1 j = H 1 j plane-wave index G

33 Davidson iterative solver Compute reduced Hamiltonian matrix h 1 jj 0 1 j h1 j = Compute eigen-value probelem = " j

34 Davidson iterative solver Compute residuals R 1 j h 1 j Z 1 Z 1 j jj 0 1 j = - Apply preconditioner: P R 1 j (G) =(H GG " j ) 1 R1 j (G)

35 Davidson iterative solver Continue to expand variational space 2 j = 1 j M P R1 j h 2j = H 2 j plane-wave index G

36 Davidson iterative solver Continue to expand variational space 2 j = 1 j M P R1 j h 2j = H 2 j plane-wave index G Iterate until the convergence (all residuals are zero) is reached

37 Davidson iterative solver Recompute the wave-functions m j j = Z m Take " j from the last subspace diagonalization

38 Part 2: Implementation details

39 Common operations FFTs (Vloc kernel, density summation) Subspace Hamiltonian construction = Wave-functions and residuals update = Wave-function orthogonalization

40 Common operations FFTs (Vloc kernel, density summation) Subspace Hamiltonian construction = Wave-functions and residuals update = Wave-function orthogonalization Need scalable parallel implementation!

41 Wave-function storage and distribution Wave-functions i(g) represents a single wave-function. are stored as a matrix: each column of the matrix band index i Number of bands: ~100 10K plane-wave index G Number of plane-waves: ~100K 10M

42 Wave-function storage and distribution How to distribute the matrix of plane-wave coefficients? Split bands Split G-vectors Split bands and G-vectors 2D block-cyclic distribution band index i band index i band index i band index i plane-wave index G plane-wave index G plane-wave index G plane-wave index G

43 Wave-function storage and distribution How to distribute the matrix of plane-wave coefficients? Split bands Split G-vectors Split bands and G-vectors 2D block-cyclic distribution band index i band index i band index i band index i plane-wave index G plane-wave index G plane-wave index G plane-wave index G

44 Wave-function storage and distribution How to distribute the matrix of plane-wave coefficients? Split G-vectors Split bands band index i Split bands and G-vectors 2D block-cyclic distribution band index i band index i band index i plane-wave index G plane-wave index G plane-wave index G plane-wave index G

45 FFT kernel do i=1, Nb enddo i(g) FFT 1! i(r)! i(r) V loc (r) FFT! [ iv ](G)

46 FFT kernel do i=1, Nb enddo i(g) FFT 1! i(r)! i(r) V loc (r) FFT! [ iv ](G) Wave-functions in the plane-wave domain are defined inside a sphere with a given energy cutoff: G 2 max i(g) 6= 0 for G apple G max ; = E cut 2

47 FFT kernel do i=1, Nb enddo i(g) FFT 1! i(r)! i(r) V loc (r) FFT! [ iv ](G) Wave-functions in the plane-wave domain are defined inside a sphere with a given energy cutoff: G 2 max i(g) 6= 0 for G apple G max ; = E cut 2 FFT box for the Vloc kernel typically circumscribes a sphere of radius 2Gmax. z x y

48 FFT decomposition to 1D and 2D The full 3D transformation is decomposed into a 1D transformation along the z-direction and a 2D transformation in the xy-plane: (G x,g y,g z ) (G x,g y,z) (x, y, z) FFT -1 (z) FFT(z) FFT -1 (xy) FFT(xy)

49 FFT parallelization

50 FFT parallelization In real space z-dimension of the FFT buffer is split among MPI ranks MPI rank#0 z MPI rank#1 MPI rank#2 y x

FFT parallelization In real space z-dimension of the FFT buffer is split among MPI ranks In reciprocal space z-sticks of the G-vector sphere are distributed among MPI ranks

51 FFT parallelization In real space z-dimension of the FFT buffer is split among MPI ranks In reciprocal space z-sticks of the G-vector sphere are distributed among MPI ranks such that: local number of z-sticks is roughly equivalent sum of lengths of z-sticks (local number of G- vectors) is roughly equivalent MPI rank#0 z MPI rank#1 MPI rank#2 y x

52 Parallel transformation of z-sticks FFT -1 (z) FFT(z)

53 Parallel transformation of z-sticks FFT -1 (z) FFT(z) each rank executes 1D transformations of the local fraction of z-sticks FFT -1 (z) p0 p1 p2 FFT(z) p0 p1 p2 {full-z, partial-xy} distribution of a cylinder

54 Parallel transformation of z-sticks FFT -1 (z) FFT(z) each rank executes 1D transformations of the local fraction of z-sticks z-sticks are swapped between MPI ranks using MPI_Alltoall p0 p0 p1 p2 FFT -1 (z) FFT(z) MPI_Alltoall p1 p2 p0 p1 p2 {full-z, partial-xy} distribution of a cylinder {partial-z, full-xy} distribution of a cylinder

55 Data redistribution Np ranks band index i p0 p1 p2 p3 p4 p5 p6 p7 plane-wave index G

56 Data redistribution Using all available MPI ranks for a single FFT is not efficient! Np ranks band index i p0 p1 p2 p3 p4 p5 p6 p7 plane-wave index G

57 Data redistribution Using all available MPI ranks for a single FFT is not efficient! Remap wave-functions to a 2D MPI grid where one dimension is dedicated to a parallel FFT and second dimension is used to distribute wave-function band index Np ranks band index i p0 p1 p2 p3 p4 p5 p6 p7 plane-wave index G 2D MPI grid of Nr x Nc ranks p0 p1 p2 p3 p4 p5 p6 p7

58 Data redistribution Using all available MPI ranks for a single FFT is not efficient! Remap wave-functions to a 2D MPI grid where one dimension is dedicated to a parallel FFT and second dimension is used to distribute wave-function band index Np ranks band index i p0 p1 p2 p3 p4 p5 p6 p7 plane-wave index G MPI A2A between Nc ranks MPI A2A between Nc ranks 2D MPI grid of Nr x Nc ranks p0 p1 p2 p3 p4 p5 p6 p7 Groups of Nc ranks contain the fat slab of plane-wave coefficients for all bands

59 FFT Vloc kernel Change wave-function distribution from slab to 2D grid (MPI_A2A between Nc ranks) Execute 1D backward FFTs along z for the local set of G-vector sticks swap z-columns (MPI_A2A between Nr ranks) execute 2D backward FFTs in the xy-plane multiply by effective potential execute 2D forward FFTs in the xy-plane swap z-columns (MPI_A2A between Nr ranks) Execute 1D forward FFTs along z for the local set of z-sticks Change wave-function distribution from 2D grid to slab (MPI_A2A between Nc ranks)

60 FFT kernel benchmark on Piz Daint Number of bands is 3000, number of times Vloc is applied is 4, number of G-vectors is ~9.2M, FFT grid size is Total Vloc time Total Vloc efficiency FFT time FFT efficiency 1 Time (sec.) Number of nodes Efficiency

61 Inner product O ij = X G i (G) j(g)

62 Inner product O ij = X G i (G) j(g) Slab data distribution of band index i i(g) plane-wave index G MPI rank #0 MPI rank #1 MPI rank #2 MPI rank #3

63 Inner product Slab data distribution of band index i O ij = X G i(g) i (G) j(g) 2D block-cyclic distribution of Oij band index j plane-wave index G MPI rank #0 MPI rank #1 MPI rank #2 MPI rank #3 band index i

64 Inner product: simple implementation

65 Inner product: simple implementation 1. Each rank computes a contribution to the global matrix p NGX O p ij = i (G) j(g) G j i MPI rank#0 MPI rank#1

66 Inner product: simple implementation 1. Each rank computes a contribution to the global matrix p NGX O p ij = i (G) j(g) 2. MPI_Allreduce of global matrix is performed N X p O ij = G p=1 O p ij j i + + = MPI rank#0 MPI rank#1

67 Inner product: simple implementation j 1. Each rank computes a contribution to the global matrix p NGX O p ij = i (G) j(g) 2. MPI_Allreduce of global matrix is performed N X p O ij = G p=1 O p ij 3. Each rank picks the local fraction of matrix elements from Oij i + + = MPI rank#0 MPI rank#1

68 Inner product: simple implementation j 1. Each rank computes a contribution to the global matrix p NGX O p ij = i (G) j(g) 2. MPI_Allreduce of global matrix is performed N X p O ij = G p=1 O p ij 3. Each rank picks the local fraction of matrix elements from Oij i + + = MPI rank#0 MPI rank#1 Requires MPI_Allreduce of global matrix! Each MPI rank receives a lot of unnecessary data!

69 Inner product: reduce-to-one implementation

70 Inner product: reduce-to-one implementation 1. Each rank computes a contribution to the local panel of (pr, pc) Cartesian rank NGX p O p i r j c = i r (G) j c (G) G jc ir MPI rank#0 MPI rank#1 MPI rank#2

71 Inner product: reduce-to-one implementation 1. Each rank computes a contribution to the local panel of (pr, pc) Cartesian rank NGX p O p i r j c = i r (G) j c (G) G 2. MPI_Reduce of local panel is performed to (pr, pc) Cartesian rank N X p O ir j c = p=1 O p i r j c jc ir MPI rank#0 MPI rank#1 MPI rank#2

72 Inner product: reduce-to-one implementation 1. Each rank computes a contribution to the local panel of (pr, pc) Cartesian rank NGX p O p i r j c = i r (G) j c (G) G 2. MPI_Reduce of local panel is performed to (pr, pc) Cartesian rank N X p O ir j c = p=1 O p i r j c 3. Steps 1-2 are repeated for all Cartesians rank of a 2D BLACS grid jc ir MPI rank#0 MPI rank#1 MPI rank#2

73 Inner product: reduce-to-one implementation synchronous zgemm MPI_Reduce asynchronous MPI_Wait zgemm MPI_Ireduce

74 Inner product: reduce-to-one implementation synchronous zgemm MPI_Reduce asynchronous MPI_Wait zgemm MPI_Ireduce A lot of MPI_Reduce! The resulting sub-matrices are small!

75 Inner product: block-allreduce implementation 1. Each rank computes a contribution to the block of global matrix O p i b j b = N p GX G i b (G) j b (G) 2. MPI_Allreduce of block is performed O ib j b = N p X p=1 O p i b j b 3. Each rank picks the local fraction of matrix elements from block O ib j b b=0 b=1 b=2 b=3 b=4 b=5 b=6 b=7 b=8

76 Inner product: block-allreduce implementation 1. Each rank computes a contribution to the block of global matrix O p i b j b = N p GX G i b (G) j b (G) 2. MPI_Allreduce of block is performed O ib j b = N p X p=1 O p i b j b 3. Each rank picks the local fraction of matrix elements from block O ib j b b=0 b=1 b=2 b=3 b=4 b=5 4. Steps 1-3 are repeated for all blocks b=6 b=7 b=8

77 Inner product: block-allreduce implementation synchronous zgemm MPI_Allreduce update local panels asynchronous MPI_Wait update local panels zgemm MPI_Ireduce overlap computation and communication single OMP thread is executing MPI_Allreduce and update of local panels remaining OMP threads execute zgemm

78 Example: communication and computation overlap omp_set_nested(1); /* allow nested OMP */ int nt = omp_get_max_threads(); /* get total number of threds */ #pragma omp parallel num_threads(2) shared(buf_state) /* spawn two threds */ { if (omp_get_thread_num() == 1) { /* thread #1 executes zgemms */ int s{0}; omp_set_num_threads(nt - 1); /* set number of nested threads to nt - 1 */ for (...) { /* loop over blocks */ int state = 1; while (state == 1) { /* wait for the release of the buffer */ #pragma omp atomic read state = buf_state[s % 2]; /* read from shared varray */ } zgemm(...,buf[s % 2]); /* execute local zgemm */ #pragma omp atomic write buf_state[s % 2] = 1; /* lock the buffer */ s++; } } else { /* thread #0 is doing communication */ int s{0}; for (...) { /* loop over blocks */ int state = 0; while (state == 0) { /* wait for the lock of the buffer */ #pragma omp atomic read state = buf_state[s % 2]; /* read from shared array */ } MPI_Allreduce(...,buf[s % 2]); /* execute global reduction */ /* do something with the result, for example, store it */ #pragma omp atomic write buf_state[s % 2] = 0; /* release the buffer */ s++; } } } omp_set_nested(0); /* forbid nested */

79 Inner product kernel benchmark Piz Daint, Intel SB (8 CPU cores) M=1 000 N=1 000 K=

80 Inner product kernel benchmark Piz Daint, Intel SB (8 CPU cores) M= N= K=

81 Wave-function transformation i(g) = X j j(g)z ji

82 Wave-function transformation i(g) = X j j(g)z ji Slab data distribution of and i(g) j(g) wave-function index plane-wave index G MPI rank #0 MPI rank #1 MPI rank #2 MPI rank #3

83 Wave-function transformation plane-wave index G Slab data distribution of and i(g) j(g) wave-function index MPI rank #0 MPI rank #1 MPI rank #2 MPI rank #3 i(g) = X j j(g)z ji 2D block-cyclic distribution of Zji band index i basis function index j

84 Wave-function transformation i0 i1 For each block of the transformation matrix: each MPI rank packs its contribution to the block elements of block are collected with MPI_Allgatherv elements are unpacked and the whole block is formed the updated of the wave-functions i 0 :i 1 (G) from the wave-functions j 0 :j 1 (G) is done by the local zgemm j0 j1

85 Wave-function transformation i0 i1 j0 j1 i0 i1 = j0 i(g) j(g) j1

86 Wave-function orthogonalization For the orthonormal basis { 0 j}, h 0 j functions { new j }, { 0 j} M { new j } = { 1 j} 0 j 0i = jj 0 the task is to expand it with the new such that h 1 1 j j 0i = jj 0

87 Wave-function orthogonalization For the orthonormal basis { 0 j}, h 0 j functions { new j }, { 0 j} M { new j } = { 1 j} 0 j 0i = jj 0 the task is to expand it with the new such that h 1 1 j j 0i = jj 0 Orthogonalize to the old basis functions ii = 1 X j 0 jih 0 j new i i = new i i X j 0 jih 0 j new i i inner product wave-function transformation

88 Wave-function orthogonalization For the orthonormal basis { 0 j}, h 0 j functions { new j }, { 0 j} M { new j } = { 1 j} Orthogonalize to the old basis functions ii = 1 X Orthonormalize new functions Oii0 j 0 jih 0 j 0 j 0i = jj 0 the task is to expand it with the new such that h 1 1 j j 0i = jj 0 new i = h i i0i i = new i i X j inner product 0 jih 0 j new i i inner product wave-function transformation O = LL H Cholesky decomposition new i i = X 1 i0ili 0 i i 0 wave-function transformation

89 Wave-function orthogonalization Test case: orthogonalization and transformation of i i, Ĥ i i, Ŝ ii for i 2 [1,N] orthogonalization and transformation of i i, Ĥ i i, Ŝ ii for i 2 [N +1, 2N] w.r.t to the first N wave-functions Upgraded Piz Daint, Intel HW (12 CPU cores) + P100 GPU, N=4096 K=

90 Wave-function orthogonalization Test case: orthogonalization and transformation of i i, Ĥ i i, Ŝ ii for i 2 [1,N] orthogonalization and transformation of i i, Ĥ i i, Ŝ ii for i 2 [N +1, 2N] w.r.t to the first N wave-functions Upgraded Piz Daint, Intel HW (12 CPU cores) + P100 GPU, N=4096 K= Average node performance Efficiency Performance (GFlop/s) Efficiency Number of nodes

91 Wave-function orthogonalization Test case: orthogonalization and transformation of orthogonalization and transformation of w.r.t to the first N wave-functions i i, H i i, i i, H i i, S i i for i 2 [N + 1, 2N ] S ii for i 2 [1, N ] 0 0 j S j 0 i Transformations: 0 1 j il H 0 1 j il S h 0 1 j S j 0 i Transformations: 1 ii H 0 1 j il X j 1 ii S j ih j S i i X j 1 ii H j ih j S i i X j S j ih j S i i h Cholesky factorization and inversion (CPU) h Cholesky factorization and inversion (CPU) Upgraded Piz Daint, Intel HW (12 CPU cores) + P100 GPU, N=4096 K= Transformations: 1 1 j S j 0 i 1 1 j il H 1 1 j il S 1 1 j il

92 Conclusion Domain specific libraries allow a separation of concerns between HPC and scientific communities Several compute-intensive kernels that are one level above the standard BLAS/LAPCK/ScaLAPACK/FFTW were identified MaX centre of excellence is working on the initial DSL implementation and API (WP4)

93 Thank you for your attention.

VASP: running on HPC resources. University of Vienna, Faculty of Physics and Center for Computational Materials Science, Vienna, Austria

VASP: running on HPC resources University of Vienna, Faculty of Physics and Center for Computational Materials Science, Vienna, Austria The Many-Body Schrödinger equation 0 @ 1 2 X i i + X i Ĥ (r 1,...,r