Susumu YAMADA 1,3 Toshiyuki IMAMURA 2,3, Masahiko MACHIDA 1,3

Size: px

Start display at page:

Download "Susumu YAMADA 1,3 Toshiyuki IMAMURA 2,3, Masahiko MACHIDA 1,3"

Lizbeth Jenkins
5 years ago
Views:

Dynamical Variation of Eigenvalue Problems in

Feb. 15, 2012 1 Center for Computational Science

University of Electro-Communications 3 CREST(JST)

1 Dynamical Variation of Eigenvalue Problems in Density-Matrix Renormalization-Group Code PP12, Feb. 15, Center for Computational Science and e-systems, Japan Atomic Energy Agency 2 The University of Electro-Communications 3 CREST(JST) Susumu YAMADA 1,3 Toshiyuki IMAMURA 2,3, Masahiko MACHIDA 1,3

2 Outline Strongly Correlated Quantum System Parallelization scheme for density matrix renormalization group method Communication strategy for a massively parallel computer Numerical experiment Auto-tuning for parallel DMRG method Conclusion

3 Strongly-correlated Quantum Systems A typical example: BiO SrO CuO Ca CuO SrO BiO High-Tc cuprate superconductors Superconducting Layer (SL) Insulating Layer (IL) Superconducting Layer (SL) Insulating Layer (IL) Cu O CuO 2 plane Cu ex. Bi 2 Sr 2 CaCu 2 O 8-δ Superconducting Layer (SL) Crystalline Structure U The Simple Model: Hubbard Model t Hamiltonian t U : Coulomb interaction t : hopping parameter

4 Density Matrix Renormalization Group renormalization renormalization 2-D direction A L system A R environment Superblock leg-direction Direct extension of DMRG method toward 2D model The dimension of the Hamiltonian increases exponentially. Parallelization of DMRG

5 Target of parallelization The time consuming operations of DMRG method Solving all eigenpairs of a density matrix (dense matrix) Solving the ground state of the Hamiltonian for the superblock All eigenstates of density matrix dense matrix ScaLAPACK The ground state of Hamiltonian large sparse matrix Iteration method is generally utilized. (Lanczos method, LOBPCG method, ) The most time consuming operation of iteration method: Hamiltonian (large sparse matrix)-vector multiplication

6 Parallelization using feature of model Superblock for quasi-2d model Divide the model into 3 blocks Block 1 Block 4 Block 1 Block 4 Block 2 Block 3 i 1 i 2 i 3 i 4 Block 2 Block 3 H H l H c H r The Hamiltonian H is decomposed as H I 4 I3 Hl I4 Hc I1 H r I2 I1 I The identity matrix whose dimension is the same as the i number of the states of the block i. Hv I4 I3 Hl v I4 Hc I1 v H r I2 I1 v Hamiltonian-vector multiplication 3 matrix-vector multiplications

7 Parallelization of matrix-vector multiplication Convert vector v into matrices V l, V c, and V r in consideration of the direct product with the identity matrix. Hv I 4 I3 Hl v HlVl I4 Hc I1 v HcVc H r I2 I1 v H rvr Three sparse matrixvector multiplications Three sparse matrix-dense matrix multiplications Parallelization of sparse matrix - dense matrix multiplication Sparse matrix partitioning dense matrix columnwisely Computation cost can be partitioned equally. Transformation of the partitioned data of matrices V l, V c, and V r all-to-all communication

8 Communication for transformation between partitioned matrices The all-to-all communication can realize the transformation between the data of the partitioned matrices V l, V c, and V r. Conflict process 0 process 1 process 2 process COM1 COM2 V l COM3 COM Ex. All-to-all communication on 4 processes V r V c The communication conflict occurs, because of the communication on all processes simultaneously. The all-to-all communication is not suitable for a massively parallel computer.

9 2-step communication All-to-all communication on all processes can be avoided by doubling the communication. process 0 process 1 process 2 process The total amount of communication data is the same as the all-to-all communication. V l COM1 COM V c COM2 COM Ex. 2-step communication on 4 processes V r The communication conflict decreases. But, the amount of communication data becomes double.

10 Numerical Experiment T2K Open Supercomputer (Todai Combined Cluster) The University of Tokyo Processor:AMD Opteron 8356 Quad core (2.3GHz) Number of processors per node :4 (16 cores) Network:Myrinet-10G link Bandwidth: 5GB/s full-duplex Compiler:Intel Fortran Compiler 11.0 Option:-O3 ip Parallelization:FlatMPI

11 Total elapsed time (sec) Numerical Experiment 4x10-site Hubbard model 19 up-spins, 19 down-spins U/t=10 64 cores 128 cores Elapsed time cores 512 cores cores Number of states kept (m) Conventional all-to-all communication cores 128 cores 256 cores 512 cores 1024 cores Number of states kept (m) 2-step communication Speed down on 1024 cores

12 Elapsed time (sec) Reason for speed down Elapsed time (sec) Communication and calculation time distribution for matrix-vector multiplication (m=200) Conventional communication 2-step communication cores 1024 cores COM 1 COM 2 COM 3 COM 4 calculation All communication times decrease Conventional communication 2-step communication COM2 and COM3 increase. V l V c V r COM1 COM4 COM2 COM3 No problem Factor in speed down

13 Reason for speed down Ex. Parallel computer with 8 dual-core processors COM1,COM4 COM2,COM3 P 0 P 1 P 2 P 3 P 4 P 5 P 6 P 7 P 0 P 1 P 2 P 3 P 4 P 5 P 6 P 7 P 8 P 9 core core processor P 10 P 11 P 12 P 13 P 14 The conflict hardly occurs, because of local communicating. P 15 P 8 P 9 P 10 P 11 P 12 P 13 P 14 The conflict may occur frequently, because of global communication. P 15

14 Scheduling for overlapping the calculation and the communication P 0 P 1 P 2 P 3 P 4 P 5 P 6 P 7 P 8 P 9 P 10 P 11 P 12 P 13 P 14 P 15 Execute the communication per each group one by one. Some cores, which do not execute the communication, become idle. Communication conflict can be avoided. Execute the calculation on the idle cores. Overlapping calculation and communication

15 Elapsed time (sec) Total elapsed time (sec) Effect of overlapping the calculation and the communication 4x10-site Hubbard model 19 up-spins, 19 down-spins U/t= T2K Open Supercomputer (Todai Combined Cluster) Parallelization:FlatMPI Total elapsed time Matrix-vector multiplication time (overlap method) cores 128 cores 1024 cores cores 512 cores cores speedup Conventional step Overlap method communication method Number of states kept (m) COM 1 COM 2 COM 3 COM 4 calculation calculation+communication Speedup up to 1024 cores

16 Targets of auto-tuning for parallel DMRG method In our parallel strategy, performance of two operations strongly depend on the computer architecture. Pattern of communication group for the 2-step all-to-all communication Eigenvalue problem for density matrix

17 Pattern of communication group for 2-step all-to-all communication The network architecture of a multi-core parallel computer system is complex and often heterogeneous. We can choose various pattern of the communication groups for the 2-step all-to-all communication. Example patterns of communication groups for COM1 and COM4 on parallel computer with 8 dual-core processors 4 groups of 4 processes 8 groups of 2 processes P 0 P 1 P 2 P 3 P 4 P 5 P 6 P 7 P 0 P 1 P 2 P 3 P 4 P 5 P 6 P 7 P 8 P 9 P 10 P 11 P 12 P 13 P 14 P 15 P 8 P 9 P 10 P 11 P 12 P 13 P 14 P 15

18 Total elapsed time (sec) Elapsed time for patterns of communication group 4x10-site Hubbard model 18 up-spins, 18 down-spins, U/t=10, Number of states kept : 400 FUJITSU PRIMERGY BX900 (Japan Atomic Energy Agency), 1024 cores 2000 The optimal case Number of communication groups of COM1 and COM 4 The performance strongly depends on the number of the communication groups. We have to optimize the pattern by executing DMRG method on various patterns. Auto tuning is required.

19 Eigenvalue problem for density matrix Density matrix Block diagonal matrix The dimension of each block matrix is various. Assign all processors to the large matrix. Ex. Assign the optimal number of processors to each problem. Ex. A B C D All PE s Serial computing (1 PE) A B C D 1000 PE s 100 PE s 10 PE s 1120 PE s It is very difficult to estimate the optimal number of processors theoretically. Auto-tuning is demanded.

20 Conclusion We proposed the parallelization strategy of DMRG method for quasi-2-dimensional quantum model. Key point Hamiltonian (sparse matrix) vector multiplication Sparse matrix- dense matrix multiplication using the property of the quantum model Parallelization by decomposing dense matrix All-to-all communication Strategy for avoiding conflict 2-step communication Overlapping for communication and calculation Our method can obtain the parallel efficiency up to 1024 cores. In future work We develop auto-tuning schemes to optimize: the dividing pattern of communication group for 2-step communication, the parallel eigenvalue solver for density matrix.

Parallelization Strategies for Density Matrix Renormalization Group algorithms on Shared-Memory Systems

Parallelization Strategies for Density Matrix Renormalization Group algorithms on Shared-Memory Systems G. Hager HPC Services, Computing Center Erlangen, Germany E. Jeckelmann Theoretical Physics, Univ.