Parallelization Strategies for Density Matrix Renormalization Group algorithms on Shared-Memory Systems

Similar documents
Susumu YAMADA 1,3 Toshiyuki IMAMURA 2,3, Masahiko MACHIDA 1,3

One-dimensional electron-phonon systems: Mott- versus Peierls-insulators

Quantum Lattice Models & Introduction to Exact Diagonalization

H ψ = E ψ. Introduction to Exact Diagonalization. Andreas Läuchli, New states of quantum matter MPI für Physik komplexer Systeme - Dresden

H ψ = E ψ. Introduction to Exact Diagonalization. Andreas Läuchli, New states of quantum matter MPI für Physik komplexer Systeme - Dresden

Spectral Properties of Strongly Correlated Electron Phonon Systems

Parallelization of the Molecular Orbital Program MOS-F

arxiv: v1 [cond-mat.str-el] 22 Jun 2007

Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA

H ψ = E ψ. Introduction to Exact Diagonalization. Andreas Läuchli, New states of quantum matter MPI für Physik komplexer Systeme - Dresden

Binding Performance and Power of Dense Linear Algebra Operations

HYCOM and Navy ESPC Future High Performance Computing Needs. Alan J. Wallcraft. COAPS Short Seminar November 6, 2017

ab initio Electronic Structure Calculations

21 Density-Matrix Renormalization Group Algorithms

Direct Self-Consistent Field Computations on GPU Clusters

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters

The advent of computer era has opened the possibility to perform large scale

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

ERLANGEN REGIONAL COMPUTING CENTER

Towards a highly-parallel PDE-Solver using Adaptive Sparse Grids on Compute Clusters

The Ongoing Development of CSDP

Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics

Claude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique

Characterizing Quantum Supremacy in Near-Term Devices

Phonon II Thermal Properties

Reducing the Run-time of MCMC Programs by Multithreading on SMP Architectures

Presentation of XLIFE++

Multi-Length Scale Matrix Computations and Applications in Quantum Mechanical Simulations

arxiv: v1 [hep-lat] 10 Jul 2012

Numerical simulations of spin dynamics

Software optimization for petaflops/s scale Quantum Monte Carlo simulations

INCREASING THE PERFORMANCE OF THE JACOBI-DAVIDSON METHOD BY BLOCKING

Parallel Transposition of Sparse Data Structures

Efficient algorithms for symmetric tensor contractions

Multicore Parallelization of Determinant Quantum Monte Carlo Simulations

The next-generation supercomputer and NWP system of the JMA

Opportunities for ELPA to Accelerate the Solution of the Bethe-Salpeter Eigenvalue Problem

Dynamical properties of strongly correlated electron systems studied by the density-matrix renormalization group (DMRG) Takami Tohyama

J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1. March, 2009

Massively scalable computing method to tackle large eigenvalue problems for nanoelectronics modeling

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Re-design of Higher level Matrix Algorithms for Multicore and Heterogeneous Architectures. Based on the presentation at UC Berkeley, October 7, 2009

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic

ELSI: A Unified Software Interface for Kohn-Sham Electronic Structure Solvers

Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters

3D Cartesian Transport Sweep for Massively Parallel Architectures on top of PaRSEC

Exact diagonalization study of spin, orbital, and lattice correlations in CMR manganites

Efficient implementation of the overlap operator on multi-gpus

Numerical Linear and Multilinear Algebra in Quantum Tensor Networks

Domain Decomposition-based contour integration eigenvalue solvers

2D Bose and Non-Fermi Liquid Metals

Cyclops Tensor Framework: reducing communication and eliminating load imbalance in massively parallel contractions

A Fast, Parallel Potential Flow Solver

Numerical Methods in Many-body Physics

Numerical Methods in Quantum Many-body Theory. Gun Sang Jeon Pyeong-chang Summer Institute 2014

INITIAL INTEGRATION AND EVALUATION

A NOVEL PARALLEL QR ALGORITHM FOR HYBRID DISTRIBUTED MEMORY HPC SYSTEMS

Benchmarking program performance evaluation of Parallel programming language XcalableMP on Many core processor

Performance Analysis of Lattice QCD Application with APGAS Programming Model

Optimizing the Performance of Reactive Molecular Dynamics Simulations for Multi-core Architectures

Parallelization of the Dirac operator. Pushan Majumdar. Indian Association for the Cultivation of Sciences, Jadavpur, Kolkata

Calibrating quantum chemistry: A multi-teraflop, parallel-vector, full-configuration interaction program for the Cray-X1

Improving the performance of applied science numerical simulations: an application to Density Functional Theory

Large-scale Electronic Structure Simulations with MVAPICH2 on Intel Knights Landing Manycore Processors

HPMPC - A new software package with efficient solvers for Model Predictive Control

Lattice Quantum Chromodynamics on the MIC architectures

H ψ = E ψ. Introduction to Exact Diagonalization. Andreas Läuchli Institute for Theoretical Physics University of Innsbruck, Austria

CEE 618 Scientific Parallel Computing (Lecture 7): OpenMP (con td) and Matrix Multiplication

Parallel sparse direct solvers for Poisson s equation in streamer discharges

Out-of-Core SVD and QR Decompositions

Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry

Leveraging Task-Parallelism in Energy-Efficient ILU Preconditioners

6810 Session 6 (last revised: January 28, 2017) 6 1

A Framework for Hybrid Parallel Flow Simulations with a Trillion Cells in Complex Geometries

Time Evolving Block Decimation Algorithm

Performance Evaluation of Scientific Applications on POWER8

Strassen s Algorithm for Tensor Contraction

The Mott Metal-Insulator Transition

Carlo Cavazzoni, HPC department, CINECA

On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code

Exploiting In-Memory Processing Capabilities for Density Functional Theory Applications

Review for the Midterm Exam

Multilevel low-rank approximation preconditioners Yousef Saad Department of Computer Science and Engineering University of Minnesota

Simulations of Quantum Dimer Models

Density Matrix Renormalization: A Review of the Method and its Applications arxiv:cond-mat/ v1 26 Mar 2003

Plaquette Renormalized Tensor Network States: Application to Frustrated Systems

Some thoughts about energy efficient application execution on NEC LX Series compute clusters

Topological Phases in One Dimension

The Blue Gene/P at Jülich Case Study & Optimization. W.Frings, Forschungszentrum Jülich,

Parallel Eigensolver Performance on High Performance Computers

Toward High Performance Matrix Multiplication for Exact Computation

Evaluation and Benchmarking of Highly Scalable Parallel Numerical Libraries

GPU Computing Activities in KISTI

Improving the solar cells efficiency of metal Sulfide thin-film nanostructure via first principle calculations

All-electron density functional theory on Intel MIC: Elk

NON EQUILIBRIUM DYNAMICS OF QUANTUM ISING CHAINS IN THE PRESENCE OF TRANSVERSE AND LONGITUDINAL MAGNETIC FIELDS

Conquest order N ab initio Electronic Structure simulation code for quantum mechanical modelling in large scale

Parallel Eigensolver Performance on the HPCx System

Transcription:

Parallelization Strategies for Density Matrix Renormalization Group algorithms on Shared-Memory Systems G. Hager HPC Services, Computing Center Erlangen, Germany E. Jeckelmann Theoretical Physics, Univ. Mainz, Germany H. Fehske Theoretical Physics, Univ. Greifswald, Germany G. Wellein HPC Services, Computing Center Erlangen, Germany 1 Motivation From Physics to Supercomputers Physics Microscopic Model: Hamiltonian H Ground state properties Thermodynamics excitation spectra Dynamical correlation functions Highly correlated electron/spin-phonon Systems High-T c Cuprates,, CMR manganites,, Quasi-1D Metal/Spin Systems The whole is greater than its parts Exact numerical studies of finite systems using Supercomputers 2

Motivation Microscopic Models Microscopic Hamiltonians in second quantization e.g. Hubbard model -t U -t e.g. Holstein-Hubbard model (HHM) HHM: Coupling between electrons and phonons (lattice oscillations) Hilbert space / #quantum states growth exponentially HHM using an N-site lattice: Electrons 4 N * (M+1) N (N~10-100; M~10) Phonons: Max. M per Site 3 Motivation Numerical Approaches Traditional Approaches Quantum Monte Carlo (QMC) Massively Parallel Codes Exact Diagonalizaton (ED) on Supercomputers New Approach Density Matrix Renormalization Group (DRMG) Method Originally introduced by White in 1992 Large sequential C++ package is in wide use (quantum physics and quantum chemistry) Elapsed Times: hours to weeks No parallel implementation available to date Aim: Parallelization strategy for DMRG package 4

Algorithms Summary: Exact Diagonalization DRMG algorithm 5 Algorithms Exact Diagonalization Exact Diagonalization (ED) Chose COMPLETE basis set (e.g. localized states in real space): Sparse matrix representation of H Exploit conservation laws and symmetries to reduce to effective Hilbert space by a factor of ~N (still exponential growth in N) Perform ONE Exact Diagonalization step using iterative algorithms e.g. Lanczos or Davidson Sparse Matrix-Vector Multiply determines computational effort Do not store matrix! ED on TFlop/s computers: N=8 ; M=7 Matrix Dimension ~ 10 billion No approximations!! 6

Algorithms - DMRG Basic Idea: Find an appropriate (reduced) basis set describing the ground-state of H with high accuracy Basic Quantities: Superblock = system & environment System Environment Superblock Superblock state (product of system & environment states) Reduced density matrix (DM) (summation over environment states) Eigenstates of DM with largest eigenvalues have most impact on observables!!! 7 Algorithm - DMRG DMRG algorithm (finite size; left to right sweep) 1. Diagonalize the reduced DM for a system block of size l and extract the m eigenvectors with largest eigenvalue 2. Construct all relevant operators (system block & environment, ) for a system block of size l+1 in the reduced density matrix eigenbasis 3. Form a superblock Hamiltonian from system & environment Hamiltonians plus two single sites 4. Diagonalize the new superblock Hamiltonian Accuracy depends mainly on m (m~100 10000) 8

Algorithm - DMRG Implementation Start-Up with infinite-size algorithm DM diagonalization: LAPACK (dsyev) costs about 5 % Superblock diagonalization costs about 90 % (Davidson algorithm) Most time-consuming step: Sparse matrix-vector multiply (MVM) in Davidson (costs about 85 %) Sparse matrix H is constructed by the transformations of each operator in H: Contribution from system block and from environment 9 Algorithm - DMRG Implementation of sparse MVM (1) Sparse MVM: Sum over dense matrix-matrix multiplies! However A and B may contain only a few nonzero elements, e.g. if conservation laws (quantum numbers) have to be obeyed To minimize overhead an additional loop (running over nonzero blocks only) is introduced 10

DMRG Benchmark Systems Benchmark Cases Single Processor Performance Potential Parallelization approaches Parallel BLAS OpenMP Parallelization 11 DMRG: Benchmark Systems Processor Frequency Peak Performance System Memory arch. #Processor Size / Max IBM Power4 1.3 GHz 5.2 GFlop/s IBM p690 ccnuma 32 / 32 Intel Xeon DP 2.4 GHz 4.8 GFlop/s Intel860 UMA 2 / 2 Intel Itanium2 1.0 GHz 4.0 GFlop/s HP rx5670 UMA 4 / 4 UltraSparc III 0.9 GHz 1.8 GFlop/s SunFire 6800 ccnuma 24 / 24 MIPS R14000 0.5 GHz 1.0 GFlop/s SGI O3400 SGI O3800 ccnuma 28 / 32 128 / 512 12

DMRG: Benchmark Cases Case1: 2D - 4X4 periodic Hubbard model at half filling U=4, t x/y =1 Small number of lattice sites (16) large values of m (m~1000 10000) are required to achieve convergence in 2D Case2: 1D 8 site periodic Holstein-Hubbard model at half filling U=3, t=1, ω 0 =1, g 2 =2 Max. 6 phonons per site ->Phonons are implemented as pseudo-sites -> large effective lattice size (~50) m is at most 1000 Metrics: P(N) is total performance on N CPUs Speed-Up: S(N) = P(N) / P(1) Parallel efficiency: ε(n) = S(N) / N 13 DMRG: Single Processor Performance Numerical core: DGEMM High sustained single processor performance MFlop/s 3000 2000 1000 0 Hubbard 4X4 IBM Power4 Intel Itanium2 Intel Xeon UltraSparcIII MIPS R14000 14

DMRG: Single Processor Efficiency Processor Efficiency: Fraction of Peak Performance achieved Parameters: DGEMM implementation (proprietary) C++ compiler (proprietary) Efficiency [%] 100 75 50 25 HUBBARD 4X4 0 IBM Power4 Intel Itanium2 Intel Xeon UltraSparcIII MIPS R14000 15 DMRG: Potential Parallelization approaches Implementation of sparse MVM - pseudocode // W: wavevector ; R: result for (α=0; α < number_of_hamiltonian_terms; α++) { Parallel loop!? term = hamiltonian_terms[α]; for (k=0 ; k < term.number_of_blocks; k++) { Parallel loop!? li = term[k].left_index; ri = term[k].right_index; temp_matrix = term[k].b.transpose() * W[ri]; R[li] += term[k].a * temp_matrix; }} Matrix-Matrix-Multiply Data dependency! (Parallel DGEMM?!) 16

DMRG: Potential Parallelization approaches Parallelization strategies 1. Linking with parallel BLAS (DGEMM) Does not require restructuring of code Significant speed-up only for large (transformation) matrices (A, B) 2. Shared-Memory parallelization of outer loops Chose OpenMP for portability reasons Requires some restructuring & directives Speed-Up should not depend on size of (transformation) matrices Exxpacted maximum speed-up for total program: if MVM is parallelized only: ~6-8 if also Davidson algorithm is parallelized: ~10 MPI parallelization Requires complete restructuring of algorithm -> Ian? 17 DMRG: Parallel BLAS Linking with parallel BLAS Useless on IBM for #CPU > 4 Best scalability on SGI (Network, BLAS implementation) Dual processor nodes can reduce elapsed runtime by about 30 % Increasing m to 7000: S(4) = 3.2 Small m (~600) with HHM: No Speed-Up 18

DMRG: OpenMP Parallelization OpenMP Parallelization Parallelization of innermost k loop: Scales badly loop too short collective thread operations within outer loop Parallelization of outer α loop: Scales badly even shorter load imbalance (trip count of k loop depends on α) Solution: Fuse both loops (α & k) Protect write operation R[li] with lock mechanism Use list of OpenMP locks for each block li 19 DMRG: OpenMP Parallelization Implementation of parallel sparse MVM pseudocode (prologue loops) // store all block references in block_array ics=0; for (α=0; α < number_of_hamiltonian_terms; α++) { term = hamiltonian_terms[α]; for (k=0 ; k < term.number_of_blocks; k++) { block_array[ics]=&term[q]; ics++; }} icsmax=ics; // set up lock lists for(i=0; i < MAX_NUMBER_OF_THREADS; i++) mm[i] = new Matrix // temp.matrix for (i=0; I < MAX_NUMBER_OF_LOCKS; i++) { locks[i]= new omp_lock_t; omp_init_lock(locks[i]); } 20

DMRG: OpenMP Parallelization Implementation of parallel sparse MVM pseudocode (main loop) // W: wavevector ; R: result #pragma omp parallel private(mymat, li, ri, myid, ics) { myid = omp_get_thread_num(); mytmat = mm[myid]; // temp thread local matrix #pragma omp for for (ics=0; ics< icsmax; ics++) { Fused (α,k) loop li = block_array[ics]->left_index; ri = block_array[ics]->right_index; mytmat = block_array[ics]->b.transpose() * W[ri]; } omp_set_lock(locks[li]); R[li] += block_array[ics]->a * mytmat; omp_unset_lock(locks[li]); } Protect each block of result vector R with locks 21 DMRG: OpenMP Parallelization The parallel code is compliant to the OpenMP standard However: NO system did compile and produce correct results with the initial MVM implementation! IBM xlc V6.0 Intel efc V7 ifc V7 SUN forte7 SGI MIPSpro 7.3.1.3m OpenMP locks prevent omp for parallelization omp for Loop is not distributed (IA64) or produces garbage (IA32) call to Intel Does not allow omp critical inside C++ classes! Complex data structures can not be allocated inside omp parallel regions Fixed by IBM Does not work Does not work Allocate everything outside loop 22

DMRG : OpenMP Parallelization Scalability on SGI Origin OMP_SCHEDULE=STATIC OpenMP scales significantly better than parallel DGEMM Serial overhead in parallel MVM is only about 5% Parallelization of Davidson should be the next step 23 DMRG : OpenMP Parallelization Scalability & Performance: SGI Origin vs. IBM p690 Scalability is pretty much the same on both systems Single processor run and OMP_NUM_THREADS=1 differ by approx. 5% on IBM Hardly any difference in SGI Total performance 1 * Power4 = 8 * MIPS 8 * Power4 > 12 GFlop/s sustained! 24

DMRG : OpenMP Parallelization Further improvement of total performance/scalability Chose best distribution strategy for parallel for loop: OMP_SCHEDULE= dynamic,2 (reduces serial overhead in MVM to 2%) Re-Link with parallel LAPACK /BLAS to speed-up densitymatrix diagonalization (DSYEV). Good thing to do: OMP_NESTED=FALSE HHM spends much more time in serial SuperBlock generation than Hubbard case ToDo: Parallelization of Davidson & SuperBlock generation! 25 Application: Peierls insulator (PI) Mott insulator (MI) transition in the Holstein Hubbard Model Use charge-structure factor S c (π) to determine transition point between PI and MI U=4 S c (π) may depend significantly on lattice size (N) Exact Diagonalization: At most N=10 sites DMRG allows to study finite size effects (see next slide) N -1 At small lattice sizes ED and DMRG are in good agreement! 26

Application: Finite-Size Study of Spin & Charge Structure Factors in the half-filled 1D periodic HHM Parameters: U=4, t=1, ω 0 =1, g 2 =2 (QC point see prev. slide) 5 boson pseudosites DMRG can get to very large lattices (up to 32 sites) Strong support for the conjecture that S c (π) and S s (π) vanish at quantum critical point in the thermodynamic limit (N ) 27 Application: Peierls insulator (PI) Mott insulator (MI) transition in the Holstein Hubbard Model DMRG computational requirements compared to ED ED (N=8; Mat. Dim~10 10 ) 1024 CPUs (Hitachi SR8k) ~12 Hrs. elapsed time 600 GB DMRG (N=8; m=600) 1 CPU (SGI Origin) ~18 Hrs. elapsed time 2 GB DMRG (N=24; m=1000) 4 CPU (SGI Origin) ~ 72 Hrs. elapsed time 10 GB 28

Summary Existing DMRG code from quantum physics/chemistry: Kernel: sparse Matrix-Vector-Multiply (MVM) Approaches for shared-memory parallelization of MVM (parallel BLAS vs. OpenMP) Fusing inner & outer loop allows a scalable OpenMP implementation for MVM routine with a parallel efficiency of 98% for MVM May compute ground-state properties for 1D Holstein-Hubbard model at high accuracy with minimal computational requirements (when compared to ED) SMP parallelization has still some optimization potential, but more than 8 CPUs will presumably never be reasonable 29 Acknowledgement Our work is funded by the Bavarian Network for High Performance Computing (KONWIHR) 30