Ising model: Measurements. Simulating spin models on GPU Lecture 4: Advanced Techniques. Ising model: Measurements (cont d)

Similar documents
GPU accelerated Monte Carlo simulations of lattice spin models

Two case studies of Monte Carlo simulation on GPU

Population annealing study of the frustrated Ising antiferromagnet on the stacked triangular lattice

Lecture 8: Computer Simulations of Generalized Ensembles

AGuideto Monte Carlo Simulations in Statistical Physics

GPU-based computation of the Monte Carlo simulation of classical spin systems

Lecture V: Multicanonical Simulations.

Wang-Landau sampling for Quantum Monte Carlo. Stefan Wessel Institut für Theoretische Physik III Universität Stuttgart

Multicanonical methods

Rare Event Sampling using Multicanonical Monte Carlo

Generalized Ensembles: Multicanonical Simulations

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters

Classical Monte Carlo Simulations

Their Statistical Analvsis. With Web-Based Fortran Code. Berg

GPU accelerated population annealing algorithm

Population annealing and large scale simulations in statistical mechanics

Janus: FPGA Based System for Scientific Computing Filippo Mantovani

Introduction to numerical computations on the GPU

Graphical Representations and Cluster Algorithms

4. Cluster update algorithms

Hanoi 7/11/2018. Ngo Van Thanh, Institute of Physics, Hanoi, Vietnam.

Lecture 14: Advanced Conformational Sampling

Markov Chain Monte Carlo Method

Monte Caro simulations

Computer Simulation of Peptide Adsorption

Critical Dynamics of Two-Replica Cluster Algorithms

Markov Chain Monte Carlo Simulations and Their Statistical Analysis An Overview

Multicore Parallelization of Determinant Quantum Monte Carlo Simulations

Cluster Algorithms to Reduce Critical Slowing Down

Wang-Landau algorithm: A theoretical analysis of the saturation of the error

Advanced sampling. fluids of strongly orientation-dependent interactions (e.g., dipoles, hydrogen bonds)

Available online at ScienceDirect. Physics Procedia 57 (2014 ) 82 86

Multicanonical parallel tempering

arxiv: v1 [hep-lat] 7 Oct 2010

MONTE CARLO METHODS IN SEQUENTIAL AND PARALLEL COMPUTING OF 2D AND 3D ISING MODEL

Parallel Tempering Algorithm in Monte Carlo Simula5on

Phase Transitions in Spin Glasses

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic

S Subdivide, Preprocess and Conquer: Micromagnetism FEM/BEM-Simulations on Single-Node/Multi-GPU Systems

Microscopic Deterministic Dynamics and Persistence Exponent arxiv:cond-mat/ v1 [cond-mat.stat-mech] 22 Sep 1999

Optimized statistical ensembles for slowly equilibrating classical and quantum systems

Accelerating Model Reduction of Large Linear Systems with Graphics Processors

Optimized multicanonical simulations: A proposal based on classical fluctuation theory

Monte Carlo study of the Baxter-Wu model

arxiv:cond-mat/ v1 [cond-mat.stat-mech] 5 Jun 2003

CS 273 Prof. Serafim Batzoglou Prof. Jean-Claude Latombe Spring Lecture 12 : Energy maintenance (1) Lecturer: Prof. J.C.

Statistical description of magnetic domains in the Ising model

Phase transitions and finite-size scaling

LECTURE 10: Monte Carlo Methods II

Solving PDEs with CUDA Jonathan Cohen

CS-206 Concurrency. Lecture 13. Wrap Up. Spring 2015 Prof. Babak Falsafi parsa.epfl.ch/courses/cs206/

Overcoming the slowing down of flat-histogram Monte Carlo simulations: Cluster updates and optimized broad-histogram ensembles

VSOP19, Quy Nhon 3-18/08/2013. Ngo Van Thanh, Institute of Physics, Hanoi, Vietnam.

GPU Applications for Modern Large Scale Asset Management

Performance of Random Sampling for Computing Low-rank Approximations of a Dense Matrix on GPUs

Spin glass simulations on Janus

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters

Performance Evaluation of Scientific Applications on POWER8

Zacros. Software Package Development: Pushing the Frontiers of Kinetic Monte Carlo Simulation in Catalysis

Basics of Statistical Mechanics

Stat 8501 Lecture Notes Spatial Lattice Processes Charles J. Geyer November 16, Introduction

SIMULATION OF ISING SPIN MODEL USING CUDA

Advanced Monte Carlo Methods Problems

Monte Carlo. Lecture 15 4/9/18. Harvard SEAS AP 275 Atomistic Modeling of Materials Boris Kozinsky

Equilibrium sampling of self-associating polymer solutions: A parallel selective tempering approach

On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code

General Construction of Irreversible Kernel in Markov Chain Monte Carlo

Plaquette Renormalized Tensor Network States: Application to Frustrated Systems

There are self-avoiding walks of steps on Z 3

Beam dynamics calculation

Phase transitions in the Potts spin-glass model

GPU Computing Activities in KISTI

A New Method to Determine First-Order Transition Points from Finite-Size Data

Quantum Annealing in spin glasses and quantum computing Anders W Sandvik, Boston University

An improved Monte Carlo method for direct calculation of the density of states

A = {(x, u) : 0 u f(x)},

Invaded cluster dynamics for frustrated models

ICCP Project 2 - Advanced Monte Carlo Methods Choose one of the three options below

Polymer Simulations with Pruned-Enriched Rosenbluth Method I

Parallel Tempering Algorithm in Monte Carlo Simulation

Fractal Geometry of Minimal Spanning Trees

Microcanonical scaling in small systems arxiv:cond-mat/ v1 [cond-mat.stat-mech] 3 Jun 2004

Surface effects in frustrated magnetic materials: phase transition and spin resistivity

Welcome to MCS 572. content and organization expectations of the course. definition and classification

Trivially parallel computing

Efficient Combination of Wang-Landau and Transition Matrix Monte Carlo Methods for Protein Simulations

Self-learning Monte Carlo Method

Contents. References Index... 47

Computational Physics (6810): Session 13

Solving Zero-Sum Extensive-Form Games. Branislav Bošanský AE4M36MAS, Fall 2013, Lecture 6

Evaporation/Condensation of Ising Droplets

Phase Transitions in Spin Glasses

A Monte Carlo Study of the Specific Heat of Diluted Antiferromagnets

Parallelization of Molecular Dynamics (with focus on Gromacs) SeSE 2014 p.1/29

Parallelism of MRT Lattice Boltzmann Method based on Multi-GPUs

Foundations of

Progress toward a Monte Carlo Simulation of the Ice VI-VII Phase Transition

Distributed Processing of the Lattice in Monte Carlo Simulations of the Ising Type Spin Model

Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry

Non-equilibrium dynamics of Kinetically Constrained Models. Paul Chleboun 20/07/2015 SMFP - Heraclion

Transcription:

Simulating spin models on GPU Lecture 4: Advanced Techniques Martin Weigel Applied Mathematics Research Centre, Coventry University, Coventry, United Kingdom and Institut für Physik, Johannes Gutenberg-Universität Mainz, Germany IMPRS School 212: GPU Computing, Wroclaw, Poland, November 1, 212 Ising model: Measurements Consider Metropolis kernel for the 2D Ising model discussed before: GPU code v3 - kernel global void metro_checkerboard_three ( spin_t *s, int * ranvec, int offset ) { int n = blockdim.x* blockidx.x + threadidx.x; int cur = blockdim.x* blockidx.x + threadidx.x + offset *( N /2) ; int north = cur + (1-2* offset )*( N /2) ; int east = (( north +1) %L)? north + 1 : north -L+1; int west = ( north %L)? north - 1 : north+l -1; int south = (n - (1-2* offset )*L + N /2) %( N /2) + (1 - offset )*( N /2) ; int ide = s[ cur ]*( s[ west ]+s[ north ]+s[ east ]+s[ south ]); if( fabs ( RAN ( ranvec [n]) *4.656612e -1f) < tex1dfetch ( boltzt, ide +2* DIM )) { s[ cur ] = -s[ cur ]; How can measurements of the internal energy, say, be incorporated? local changes can be tracked M. Weigel (Coventry/Mainz) advanced simulations 1/11/212 1 / 34 M. Weigel (Coventry/Mainz) advanced simulations 1/11/212 3 / 34 Ising model: Measurements (cont d) energy changes global void metro_checkerboard_three ( spin_t *s, int * ranvec, int offset ) { int n = blockdim.x* blockidx.x + threadidx.x;... int south = (n - (1-2* offset )*L + N /2) %(N/2) + (1 - offset )*(N/2) ; int ide = s[ cur ]*( s[ west ]+s[ north ]+s[ east ]+s[ south ]); int ie = ; if( fabs ( RAN ( ranvec [n]) *4.656612e -1f) < tex1dfetch ( boltzt, ide +2* DIM )) { s[ cur ] = -s[ cur ]; ie -= 2* ide ; Ising model: Measurements (cont d) Access pattern for reduction: 1 2 3 4 5 6 7 1 2 3 4 5 6 7 butterfly sum shared int delta [ THRADS ]; delta [n] = ie; for ( int stride = THRADS >>1; stride > ; stride >>= 1) { syncthreads (); if(n < stride ) delta [n] += delta [n+stride ]; 1 2 3 4 5 6 7 1 2 3 4 5 6 7 if(n == ) result [ blockidx.y* GRIDL+blockIdx.x] += delta []; M. Weigel (Coventry/Mainz) advanced simulations 1/11/212 4 / 34 M. Weigel (Coventry/Mainz) advanced simulations 1/11/212 5 / 34

Ising Spin glass Spin glass: performance Recall Hamiltonian: H = J ijs is j, i,j where J ij are quenched random variables. For reasonable equilibrium results, average over thousands of realizations is necessary. same domain decomposition (checkerboard) slightly bigger effort due to non-constant couplings higher performance due to larger independence? tflip [ns] 1 1 1 CPU Tesla C16 GTX 48 very simple to combine with parallel tempering 1 1 1 1 1 1 1 N replica M. Weigel (Coventry/Mainz) advanced simulations 1/11/212 6 / 34 M. Weigel (Coventry/Mainz) advanced simulations 1/11/212 7 / 34 Spin glasses: continued Implementation Seems to work well with 15 ns per spin flip on CPU 7 ps per spin flip on GPU but not better than ferromagnetic Ising model. Further improvement: use multi-spin coding Synchronous multi-spin coding: different spins in a single configurations in one word Asynchronous multi-spin coding: spins from different realizations in one word brings us down to about 2 ps per spin flip for(int i = ; i < SWPS_LOCAL; ++i) { float r = RAN(ranvecS[n]); if(r < boltzd[4]) ss(x1,y) = ~ss(x1,y); else { p1 = JSx(x1m,y) ^ ss(x1,y) ^ ss(x1m,y); p2 = JSx(x1,y) ^ ss(x1,y) ^ ss(x1p,y); p3 = JSy(x1,ym) ^ ss(x1,y) ^ ss(x1,ym); p4 = JSy(x1,y) ^ ss(x1,y) ^ ss(x1,yp); if(r < boltzd[2]) { ido = p1 p2 p3 p4; ss(x1,y) = ido ^ ss(x1,y); else { ido1 = p1 & p2; ido2 = p1 ^ p2; ido3 = p3 & p4; ido4 = p3 ^ p4; ido = ido1 ido3 (ido2 & ido4); ss(x1,y) = ido ^ ss(x1,y); syncthreads(); r = RAN(ranvecS[n]); if(r < boltzd[4]) ss(x2,y) = ~ss(x2,y); else { p1 = JSx(x2m,y) ^ ss(x2,y) ^ ss(x2m,y); p2 = JSx(x2,y) ^ ss(x2,y) ^ ss(x2p,y); p3 = JSy(x2,ym) ^ ss(x2,y) ^ ss(x2,ym); p4 = JSy(x2,y) ^ ss(x2,y) ^ ss(x2,yp); if(r < boltzd[2]) { ido = p1 p2 p3 p4; ss(x2,y) = ido ^ ss(x2,y); else { ido1 = p1 & p2; ido2 = p1 ^ p2; ido3 = p3 & p4; ido4 = p3 ^ p4; ido = ido1 ido3 (ido2 & ido4); ss(x2,y) = ido ^ ss(x2,y); syncthreads(); M. Weigel (Coventry/Mainz) advanced simulations 1/11/212 8 / 34 M. Weigel (Coventry/Mainz) advanced simulations 1/11/212 9 / 34

Spin glasses: continued Janus Seems to work well with JANUS, a modular massively parallel and reconfigurable FPGA-based computing system. 15 ns per spin flip on CPU 7 ps per spin flip on GPU but not better than ferromagnetic Ising model. Further improvement: use multi-spin coding Synchronous multi-spin coding: different spins in a single configurations in one word Asynchronous multi-spin coding: spins from different realizations in one word brings us down to about 2 ps per spin flip M. Weigel (Coventry/Mainz) advanced simulations 1/11/212 1 / 34 M. Weigel (Coventry/Mainz) advanced simulations 1/11/212 11 / 34 1/11/212 13 / 34 Janus Critical configuration JANUS, a modular massively parallel and reconfigurable FPGA-based computing system. Costs: Janus: 256 units, total cost about 7, uros Same performance with GPU: 64 PCs (2 uros) with 2 GTX 295 cards (5 uros) 2, uros Same performance with CPU only (assuming a speedup of 5): 8 blade servers with two dual Quadcore sub-units (35 uros) 2, 8, uros M. Weigel (Coventry/Mainz) advanced simulations 1/11/212 11 / 34 M. Weigel (Coventry/Mainz) advanced simulations

Cluster algorithms Cluster algorithms Beat critical slowing down Spatial correlations close to a continuous phase transition impede decorrelation with local spin flips. Need to update non-local variables of characteristic size, i.e., percolating at transition, that have the right geometric properties, i.e., fractal dimensions etc. τint 1 1 1 Cluster algorithms τint(l) = AL z 1 1 These are strong requirements only fulfilled for a few algorithms, most notably for the Potts, O(n) and related lattice models. L 2 Would need to use cluster algorithms for efficient equilibrium simulation of spin models at criticality: 1 Activate bonds between like spins with probability p = 1 e 2βJ. 2 Construct (Swendsen-Wang) spin clusters from domains connected by active bonds. 3 Flip independent clusters with probability 1/2. 4 Goto 1. Steps 1 and 3 are local Can be efficiently ported to GPU. What about step 2? Domain decomposition into tiles. labeling inside of domains Hoshen-Kopelman breadth-first search self-labeling union-find algorithms relabeling across domains self-labeling hierarchical approach iterative relaxation M. Weigel (Coventry/Mainz) advanced simulations 1/11/212 14 / 34 M. Weigel (Coventry/Mainz) advanced simulations 1/11/212 15 / 34 Swendsen-Wang update Swendsen-Wang update M. Weigel (Coventry/Mainz) advanced simulations 1/11/212 16 / 34 M. Weigel (Coventry/Mainz) advanced simulations 1/11/212 16 / 34

Label activation BFS or Ants in the Labyrinth label activation global void prepare_bonds_small ( spin_t *s, bond_t * bond, int * inclus, int * ranvec ) { int x = blockidx.x*( L/ griddim.x)+ threadidx.x; int y = blockidx.y*( L/ griddim.y)+ threadidx.y; bond [y*l+x] = ( spin (x,y) == spin (( x==l -1)?: x+1,y) && RAN ( ranvec [y*l+x]) > BOLTZ )? 1 : ; bond [y*l+x+n] = ( spin (x,y) == spin (x,( y==l -1)?: y+1) && RAN ( ranvec [y*l+x]) > BOLTZ )? 1 : ; inclus [y*l+x] = y*l+x; one thread per site thread updates two bonds low load, therefore updating more bonds per thread beneficial 56 57 58 59 6 61 62 63 48 49 5 51 52 53 54 55 4 41 42 43 36 36 46 47 32 33 34 36 36 36 38 39 24 25 26 36 28 36 3 31 16 17 18 19 2 21 22 23 8 9 1 11 12 13 14 15 1 2 3 4 5 6 7 only wave-front vectorization would be possible many idle threads M. Weigel (Coventry/Mainz) advanced simulations 1/11/212 18 / 34 M. Weigel (Coventry/Mainz) advanced simulations 1/11/212 19 / 34 BFS code Self-labeling breadth-first search on tiles global void localbfs ( int * inclus, bond_t *bond, int * next ) { int clus =, clus_leng, tested ; int xmin = blockidx.x*(l/ griddim.x), xmax = ( blockidx.x +1) *(L/ griddim.x); int ymin = blockidx.y*(l/ griddim.y), ymax = ( blockidx.y +1) *(L/ griddim.y); int offset = ymin *L+ xmin ; for ( int y = ymin ; y < ymax ; ++y) { for ( int x = xmin ; x < xmax ; ++x) { if( inclus [y*l+x] <= clus ) continue ; clus_leng = tested = ; clus = next [ offset ] = y*l+x; while ( tested <= clus_leng ) { int xx = next [ offset + tested ] % L; int yy = next [ offset + tested ] / L; int k1 = yy*l+( xx +1) ; if(xx < xmax -1 && inclus [k1] > clus && bond [yy*l+xx ]) { ++ clus_leng ; next [ offset + clus_leng ] = k1; inclus [k1] = clus ; k1 = yy*l+(xx -1) ; if(xx > xmin && inclus [k1] > clus && bond [k1 ]) { ++ clus_leng ; next [ offset + clus_leng ] = k1; inclus [k1] = clus ;... // other directions ++ tested ; 56 57 58 59 6 61 62 63 48 49 49 51 52 53 54 55 4 33 42 43 44 45 46 47 32 33 34 35 36 37 38 39 24 25 26 19 27 28 29 3 31 16 17 18 19 19 2 21 22 23 8 9 1 11 12 13 14 15 1 2 3 4 5 6 7 effort is O(L 3 ) at the critical point, but can be vectorized with O(L 2 ) threads M. Weigel (Coventry/Mainz) advanced simulations 1/11/212 2 / 34 M. Weigel (Coventry/Mainz) advanced simulations 1/11/212 21 / 34

Self-labeling code Self-labeling code self-labeling on tiles self-labeling on tiles do { changed = ; if( bnd [ me] && in[ me]!= in[ right ]) { atomicmin (in+me,in[ right ]); atomicmin (in+right,in[me ]); changed = 1; if( bnd [ me+blockl * BLOCKL ] && in[ me]!= in[ up ]) { atomicmin (in+me,in[up ]); atomicmin (in+up,in[me ]); changed = 1; while ( syncthreads_or ( changed )); atomic operations: guarantee no interference by other threads fast intrinsic implementation atomicadd(), atomicmin(), atomicinc(), and others syncthreads_or() synchronizes threads and returns true if any of the thread expression evaluates to true do { changed = ; if( bnd [ me] && in[ me]!= in[ right ]) { in[ me] = in[ right ] = min ( in[ me], in[ right ]); changed = 1; if( bnd [ me+blockl * BLOCKL ] && in[ me]!= in[ up ]) { in[ me] = in[ up] = min ( in[ me], in[ up ]); changed = 1; while ( syncthreads_or ( changed )); synchronization not strictly necessary here non-deterministic version is faster might need a couple of more iterations M. Weigel (Coventry/Mainz) advanced simulations 1/11/212 22 / 34 M. Weigel (Coventry/Mainz) advanced simulations 1/11/212 22 / 34 Union-find Comparison 56 57 58 59 6 61 62 63 1 6 48 41 41 51 52 53 54 55 4 32 41 41 44 45 46 47 1 5 breadth-first search union-and-find self-labeling 32 32 34 3 3 3 38 39 24 25 26 27 3 3 13 31 16 17 18 19 2 21 13 23 T p local /l2 [ns] 1 4 1 3 B2 B 3.8 8 9 1 11 12 13 13 15 1 2 B 1.8 tree structure with two optimizations: balanced trees path compression 1 2 3 4 5 6 7 root finding and cluster union essentially O(1) operations 1 1 4 8 16 32 64 B M. Weigel (Coventry/Mainz) advanced simulations 1/11/212 23 / 34 M. Weigel (Coventry/Mainz) advanced simulations 1/11/212 24 / 34

Label consolidation Performance Label relaxation Hierarchical sewing 1 2 3 ns/flip 6 5 4 3 self-labeling, global union-find, hierarchical union-find, relaxation self-labeling, relaxation CPU 2 1 T p global = C l 2 B rel min(l 2, n) min(b, m) ld min [ ( ) ] 1 T p global = CsewL2 nb + 4 5 1 n L 32 64 128 256 512 124 248 L M. Weigel (Coventry/Mainz) advanced simulations 1/11/212 25 / 34 M. Weigel (Coventry/Mainz) advanced simulations 1/11/212 26 / 34 Performance Performance 4 3 flip relax setup boundary label local bonds 3 GTX 48, relaxation GTX 58 GTX 48 Tesla M27, CC on Tesla M27, CC off ns/flip 2 speedup 2 1 1 32 64 128 256 512 124 248 L 32 128 512 248 8192 L M. Weigel (Coventry/Mainz) advanced simulations 1/11/212 26 / 34 M. Weigel (Coventry/Mainz) advanced simulations 1/11/212 26 / 34

Generalized ensembles Instead of simulating the canonical distribution, P K () = 1 Z K Ω()e K, consider using a more general distribution Ω()/W () P muca() = = Ω()e ω(), Z muca Z muca engineered to overcome barriers, improve sampling speed and extend the reweighting range. P().1.8.6.4.2-5 -4-3 -2-1 Choice of weights To overcome barriers, we need to broaden P(), in the extremal case to a constant distribution, P muca() = Z 1 muca Ω()/W () = Z 1 muca e S() ω()! = const, where S() = ln Ω() is the microcanonical entropy. Under these assumptions, W () = Ω() is optimal, i.e., we again desire to estimate the density of states. This is not known a priori, so (again) use histogram estimator ˆΩ() = Z muca Ĥ muca()/n e ω(). Canonical averages can be recovered at any time by reweighting: A K = A()P K()/P muca() P K()/P muca() M. Weigel (Coventry/Mainz) advanced simulations 1/11/212 28 / 34 M. Weigel (Coventry/Mainz) advanced simulations 1/11/212 28 / 34 Muca iteration Determine muca weights/density of states iteratively: 1 Use, e.g., a K = canonical simulation to get initial estimate Ŝ () = ln ˆΩ (). 2 Choose multicanonical weights ω 1() = Ŝ() for next simulation. Pmuca().8.7.6.5.4.3.2.1 zeroth iteration Muca iteration Determine muca weights/density of states iteratively: 1 Use, e.g., a K = canonical simulation to get initial estimate Ŝ () = ln ˆΩ (). 2 Choose multicanonical weights ω 1() = Ŝ() for next simulation. Pmuca().25.2.15.1.5 first iteration 3 Iterate. -5-4 -3-2 -1 3 Iterate. -5-4 -3-2 -1 M. Weigel (Coventry/Mainz) advanced simulations 1/11/212 28 / 34 M. Weigel (Coventry/Mainz) advanced simulations 1/11/212 28 / 34

Muca iteration Determine muca weights/density of states iteratively:.2.18.16 nth iteration Muca iteration Determine muca weights/density of states iteratively:.5.4 limiting distribution 1 Use, e.g., a K = canonical simulation to get initial estimate Ŝ () = ln ˆΩ (). Pmuca().14.12.1.8 1 Use, e.g., a K = canonical simulation to get initial estimate Ŝ () = ln ˆΩ (). Pmuca().3.2 2 Choose multicanonical weights ω 1() = Ŝ() for next simulation..6.4.2 3 Iterate. -5-4 -3-2 -1 2 Choose multicanonical weights ω 1() = Ŝ() for next simulation. 3 Iterate. Advantages:.1-5 -4-3 -2-1 always in equilibrium arbitrary distributions possible system ideally performs an unbiased random walk in energy space fast(er) dynamics M. Weigel (Coventry/Mainz) advanced simulations 1/11/212 28 / 34 M. Weigel (Coventry/Mainz) advanced simulations 1/11/212 28 / 34 Variants:.5 limiting distribution Wang-Landau sampling Muca weights are updated as.5 limiting distribution umbrella sampling, entropic sampling (identical) multiple Gaussian modified ensemble Pmuca().4.3.2 ω i+1() ω i() = const + ln Ĥi(), i.e., if an energy is visited more often than others, it receives less weight in future iterations. Pmuca().4.3.2 (broad histogram method) transition-matrix Monte Carlo metadynamics Wang-Landau sampling.1-5 -4-3 -2-1 This behavior can be imitated in a one-step iteration: simulate but update P WL () Ω()e ω(),.1-5 -4-3 -2-1... ω i+1() ω i() = φ, each time an energy is seen. M. Weigel (Coventry/Mainz) advanced simulations 1/11/212 28 / 34 M. Weigel (Coventry/Mainz) advanced simulations 1/11/212 28 / 34

Wang-Landau sampling Muca weights are updated as ω i+1() ω i() = const + ln Ĥi(), i.e., if an energy is visited more often than others, it receives less weight in future iterations. This behavior can be imitated in a one-step iteration: simulate but update P WL () Ω()e ω(), WL iteration 1 Start with ω() =. 2 Simulate sufficiently long while continuously updating ω(). 3 Reduce modification factor, e.g., φ φ/2 4 Iterate till φ < φ thres. Parallelization Problem: dependence on global quantity (or M, q,...) effectively serializes all calculations. But: can subdivide range of in windows can increase statistics (and estimate statistical errors) with independent runs WL iteration 1 Start with ω() =. 2 Simulate sufficiently long while continuously updating ω(). 3 Reduce modification factor, e.g., φ φ/2 4 Iterate till φ < φ thres. ω i+1() ω i() = φ, each time an energy is seen. M. Weigel (Coventry/Mainz) advanced simulations 1/11/212 28 / 34 M. Weigel (Coventry/Mainz) advanced simulations 1/11/212 28 / 34 Wang-Landau simulations Wang-Landau simulations t/n, s.9.8.7.6.5.4.3.2.1 Time t to estimate S() of N=L 2 2-dimensional Ising model from MC Wang-Landau algorithm combining results from energy windows with =16. from 1 runs of 3.GHz CPU 1*t: from 1 runs of 1.4GHz 48-core GTX 48 GPU NB: GPU time multiplied by 1! For each run, GPU set to process circa 9. energy windows independently of L ε()= S()/S exact ()-1, %.3.25.2.15.1.5 Relative error ε() of microcanonical entropy S of N=64 2 2-dimensional Ising model from windowed MC Wang-Landau algorithm; =16, tile=16. 3 2 1 S() S from MC S exact /N.5 1 1.5 2 example estimate for S S averaged over 32 estimates /N 8 16 32 64 linear size L.5 1 1.5 2 /N M. Weigel (Coventry/Mainz) advanced simulations 1/11/212 29 / 34 M. Weigel (Coventry/Mainz) advanced simulations 1/11/212 29 / 34

Wang-Landau simulations Wang-Landau simulations 2 ε()= S()/S exact ()-1, %.8.7.6.5.4.3.2 Relative error ε() of microcanonical entropy S of N=64 2 2-dimensional Ising model from windowed MC Wang-Landau algorithm; tile=16. =1; 2 GPU runs =13; 2 GPU runs =16; 2 GPU runs =2; 2 GPU runs each curve is from 1 GPU run and represents average over 32 estimates speedup 15 1 5 MUCA WL.1.5 1 1.5 2 /N 8 16 32 64 L M. Weigel (Coventry/Mainz) advanced simulations 1/11/212 29 / 34 M. Weigel (Coventry/Mainz) advanced simulations 1/11/212 29 / 34 Performance Summary All the rest Summary Benchmark results for various models considered: CPU C16 GTX 48 System Algorithm L ns/flip ns/flip ns/flip speed-up 2D Ising Metropolis 32 8.3 2.58 1.6 3/5 2D Ising Metropolis 16 384 8..77.34 13/235 2D Ising Metropolis, k = 1 16 384 8..292.133 28/6 3D Ising Metropolis 512 14..13.67 17/29 2D Heisenberg Metro. double 496 183.7 4.66 1.94 39/95 2D Heisenberg Metro. single 496 183.2.74.5 248/366 2D Heisenberg Metro. fast math 496 183.2.3.18 611/118 2D spin glass Metropolis 32 14.6.15.7 97/29 2D spin glass Metro. multi-spin 32.18.75.23 24/78 2D Ising Swendsen-Wang 124 77.4 2.97 /26 2D Ising multicanonical 64 42.1.33 /128 2D Ising Wang-Landau 64 43.6.94 /46 What I haven t talked about: no-copy pinning of system memory asynchronous execution and data transfers: streams etc. surface memory GPU-GPU communication C++ language features: C++ new and delete virtual functions inline PTX Thrust library (similar to STL): data structures: device_vector, host_vector,... algorithms: sort, reduce,... libraries: FFT, BLAS, other language bindings: pycuda, Copperhead, Fortran,... OpenCL M. Weigel (Coventry/Mainz) advanced simulations 1/11/212 31 / 34 M. Weigel (Coventry/Mainz) advanced simulations 1/11/212 32 / 34

Summary and outlook Summary Summary Some advertisement: PJST issue This lecture In this lecture we have covered some more advanced simulation techniques. As expected, highly (data) non-local algorithms give significantly smaller speed-ups. Still, these are sizeable. All lectures Scientific computing with graphics processing units: tailoring to the hardware needed for really good performance special, but general enough to provide significant speed-ups for most problems GPU computing might be a fashion, but CPU computing goes the same way Reading For this lecture: M. Weigel, Phys. Rev. 84, 3679 (211) [arxiv:115.584]. M. Weigel and T. Yavors kii, Physics Procedia 15, 92 (211) [arxiv:117.5463]. http://www.cond-mat.physik.uni-mainz.de/~weigel/gpu M. Weigel (Coventry/Mainz) advanced simulations 1/11/212 33 / 34 http://epjst.epj.org/articles/epjst/abs/212/1/contents/contents.html M. Weigel (Coventry/Mainz) advanced simulations 1/11/212 34 / 34