Multicore Parallelization of Determinant Quantum Monte Carlo Simulations

Size: px

Start display at page:

Download "Multicore Parallelization of Determinant Quantum Monte Carlo Simulations"

Myron Thornton
5 years ago
Views:

1 Multicore Parallelization of Determinant Quantum Monte Carlo Simulations Andrés Tomás, Che-Rung Lee, Zhaojun Bai, Richard Scalettar UC Davis SIAM Conference on Computation Science & Engineering Reno, March 1st, 2011

2 Multicore / manycore Parallelization of Determinant Quantum Monte Carlo Simulations Andrés Tomás, Che-Rung Lee, Zhaojun Bai, Richard Scalettar UC Davis SIAM Conference on Computation Science & Engineering Reno, March 1st, 2011

3 Outline Introduction QUEST Hubbard model Determinant quantum Monte Carlo Green s function in QUEST Multicore parallelization GPU acceleration

4 QUantum Electron Simulation Toolbox (QUEST) Fortran 90 package that implements the Determinant Quantum Monte Carlo (DQMC) method for quantum electron simulations PETAMAT (DoE SciDAC project): Next Generation Multi-Scale Quantum Simulation Software for Strongly Correlated Materials

5 Hubbard model 1. Length parameters N: spatial lattice size (# electrons) L: temperature discretization steps 2. Energy parameters t: electron hopping between different atoms (kinetic energy) U: strength of interactions between electrons (potential energy) β: inverse temperature β = 1 k B T 3. τ = β/l connects length and energy scales Numerical Methods for Quantum Monte Carlo Simulations of the Hubbard Model, in Multi-Scale Phenomena in Complex Fluids, Thomas Y. Hou, Chun Liu and Jian-Guo Liu eds., pp.1-110, Higher Education Press and World Scientific, 2009

6 Hubbard matrix M σ (h) = I B 1,σ (h 1 ) B 2,σ (h 2 ) I B 3,σ (h 3 ) I B l,σ (h l ) = e t τk e σνdiag(h l) B L,σ (h L ) I N = n x n y, K R N N, h = (h 1, h 2,..., h L ) R N L, σ = ±

7 Determinant Quantum Monte Carlo 1. Flip h l,i = h l,i 2. Compute the Metropolis ratio r l,i = M +(h ) M (h ) M + (h) M (h) 3. Acceptance condition (random number r) 4. Update i and l, go to step 1 h l,i h l,i if r r l,i 5. Compute physical measurements

8 Green s function G σ (h) = (I + B L,σ (h L )B L 1,σ (h L 1 ) B 1,σ (h 1 )) 1 The Metropolis ratio can be computed from G σ (h), for example, r 11 = d + d where for σ = ± d σ = 1 + (e 2σνh 1,1 1)(1 G σ 1,1(h))

9 Green s function Flipping the spatial site h l,i+1 is a rank-1 update of G, for example, G σ (h) G σ (h) α 1,σ r 11 u σ w T σ where u σ = (I G σ (h))e 1 w σ = (G σ (h)) T e 1 α 1,σ = e 2σνh 1,1 1 Flipping the temporal site h l+1,1 is a similarity transformation of G, for example, G σ (h) B 1 1,σ (h 1)G σ (h)b 1,σ (h 1 ) Also used for physical measurements

10 DQMC in a nutshell for s = 1, 2, 3,... for i = 1, 2, 3,..., L if i mod l = 0 G (I + B i,σ (h i ) B 1,σ (h 1 )B L,σ (h L ) B i+1,σ (h i+1 )) 1 else G B1,σ 1 (h 1)G σ (h)b 1,σ (h 1 ) end for j = 1, 2, 3,..., N Compute G σ j,j (h) Update h and G σ (h) if accepted end end Compute physical measurements end Huge number of consecutive operations with small matrices

11 Outline Introduction QUEST Hubbard model Determinant quantum Monte Carlo Green s function in QUEST Multicore parallelization GPU acceleration

12 Stratification method Green s function G = (I + B L B L 1 B 2 B 1 ) 1 Q 0 = D 0 = T 0 = I for i = 1, 2,..., L/k C i = B (i 1)k+1 B (i 1)k+2 B ik C i = (C i Q i 1 )D i 1 C i = Q i R i P T D i = diag(r i ) T i = (Di 1 i DGEMM DORMQR DGEQP3 R i )(Pi T T i 1) DTRMM end G = T 1 L/k (QT L/k T 1 L/k + D L/k) 1 Q T L/k DGETRF, DGETRI,...

13 Stratification perfomance N = 32 32, L = 48, k = 12 NERSC Carver (2 quad-core Intel Nehalem 2.67 GHz) MKL netlib + MKL BLAS 1 core 8 cores 1 core 8 cores Routine # time time s-up time time s-up DGEMM DTRMM DGETRF DGETRI DGEQP DORMQR Overall Avoid QR with pivoting Improve chain of matrix multiplications performance

14 Structured Orthogonal Factorization (SOF) Green s function G = (I + B L B L 1 B 2 B 1 ) 1 C 1 = I A 1 = B 1 B 2 B k for i = 2, 3,..., L/k D [ i = B (i 1)k+1 ] [ B (i 1)k+2 ] [ B ik Ci 1 Q11 Q = 12 Ri D i Q 21 Q 22 0 A i = Q T 12 A i 1 C i = Q T 22 end G = (C L/k + A L/k ) 1 C L/k DGEMM ] DGEMM DGEQRF DORGQR DGEMM DGEQRF, DORMQR, DGEMM

15 SOF performance N = 32 32, L = 48, k = 12 NERSC Carver (2 quad-core Intel Nehalem 2.67 GHz) 1 core 8 cores Routine # time time speedup DGEMM DTRSM DORMQR DGEQRF DORGQR Overall Better scalability but slightly slower than stratification

16 Improved SOF Green s function G = (I + B L B L 1 B 2 B 1 ) 1 C 1 = I A 1 = B 1 B 2 B k for i = 2, 3,..., L/k D i = B (i 1)k+1 B (i 1)k+2 B ik [ ] ( [ ] Ci Vu = I T D i V d A i = V d T T Vu T A i 1 C i = I V d T T Vd T end G = (C L/k + A L/k ) 1 C L/k [ Vu V d DGEMM DGEMM ] ) T [ ] Ri DGEQRF, DLARFT 0 DGEMM, DTRMM DGEMM, DTRMM DGEQRF, DORMQR, DGEMM

17 Improved SOF performance N = 32 32, L = 48, k = 12 NERSC Carver (2 quad-core Intel Nehalem 2.67 GHz) 1 core 8 cores Routine # time time speedup DGEMM DTRMM DTRSM DORMQR DGEQRF DLARFT Overall

18 Compute G performance L = 48, k = 12 NERSC Carver (2 quad-core Intel Nehalem 2.67 GHz) B i B i+1 B i+k 1 takes more than half of the time

19 CUDA parallel model for Tesla C2050 Improved double precision performance 14 multiprocessors (up to 448 threads) Blocks of threads All threads share the same code (kernel) Up to 8 blocks are scheduled to each processor Threads are executed in parallel via warps (32), synchronized at instruction level

20 CUDA memory model for Tesla C2050 Memory hierarchy Register private to each thread (128 Kb) Shared private to each processor (up to 48 Kb) Global shared among all threads and CPU (3 Gb) Cache memories L1 up to 48 Kb per processor L2 768 Kb Explicit Read only, texture (8 Kb) and constant (64 Kb)

21 CUBLAS Chain of matrix multiplications A B 1 B 2 B k A B i = h i B (diagonal h i ) cublassetmatrix for i = 1, 2,..., k C B A for j = 1, 2,..., n C j,1:n h i,j C j,1:n end A C end cublasgetmatrix cublasdgemm cublasdscal cublasdcopy

22 CUBLAS + CUDA Chain of matrix multiplications A B 1 B 2 B k A B i = h i B (diagonal h i ) cublassetmatrix for i = 1, 2,..., k C B A A h i C end cublasgetmatrix cublasdgemm scalerow

23 CUBLAS + CUDA void scalerow_kernel(int n, double *h, double *C, double *A) { int j, i = blockidx.x * blockdim.x + threadidx.x; if (i < n) { double f = h[i]; for (j = 0; j < n; j++) A[i + j * n] = f * C[i + j * n]; } } Each thread computes one row of A h is read once A and C are stored column-wise so memory access is coalesced

24 GPU performance L = 48, k = 12 NERSC Carver + Tesla C2050 (double precision) Chain of matrix multiplications A B 1 B 2 B k A

25 GPU performance L = 48, k = 12 NERSC Carver + Tesla C2050 (double precision) Chain of matrix multiplications A B 1 B 2 B k A

26 GPU performance L = 48, k = 12 NERSC Carver + Tesla C2050 (double precision) Green s function (improved SOF) G = (I + B L B L 1 B 2 B 1 ) 1

27 Conclusions How to accelerate a huge number of consecutive operations with small matrices? Multicore: MKL BLAS + netlib LAPACK GPU acceleration: CUBLAS + small kernel Future work Fully implement SOF in GPU (MAGMA QR, TSQR,...) Acknowledgments NERSC computing resources

Advancing Large Scale Many-Body QMC Simulations on GPU Accelerated Multicore Systems

Advancing Large Scale Many-Body QMC Simulations on GPU Accelerated Multicore Systems Andres Tomas, Chia-Chen Chang, Richard Scalettar and Zhaojun Bai Department of Computer Science, University of California,