Accelerating Quantum Chromodynamics Calculations with GPUs

Similar documents
Tuning And Understanding MILC Performance In Cray XK6 GPU Clusters. Mike Showerman, Guochun Shi Steven Gottlieb

Case Study: Quantum Chromodynamics

arxiv: v1 [hep-lat] 8 Nov 2014

Efficient implementation of the overlap operator on multi-gpus

Measuring freeze-out parameters on the Bielefeld GPU cluster

Unraveling the mysteries of quarks with hundreds of GPUs. Ron Babich NVIDIA

Accelerating linear algebra computations with hybrid GPU-multicore systems.

arxiv: v1 [cs.dc] 4 Sep 2014

Direct Self-Consistent Field Computations on GPU Clusters

arxiv: v1 [hep-lat] 31 Oct 2015

arxiv: v1 [hep-lat] 10 Jul 2012

Claude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique

Lattice QCD. Steven Gottlieb, Indiana University. Fermilab Users Group Meeting June 1-2, 2011

arxiv: v1 [hep-lat] 23 Dec 2010

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

Performance Analysis of Lattice QCD Application with APGAS Programming Model

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters

Solving Lattice QCD systems of equations using mixed precision solvers on GPUs

arxiv: v1 [hep-lat] 7 Oct 2010

Lattice QCD at non-zero temperature and density

Introduction to numerical computations on the GPU

Lattice QCD on Blue Waters

Thomas Jefferson National Accelerator Facility

Domain Wall Fermion Simulations with the Exact One-Flavor Algorithm

Lattice QCD on Blue Waters

Cold QCD. Meeting on Computational Nuclear Physics. Washington, DC July Thomas Luu Lawrence Livermore National Laboratory

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic

Isospin and Electromagnetism

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)

Welcome to MCS 572. content and organization expectations of the course. definition and classification

Thermodynamics using p4-improved staggered fermion action on QCDOC

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications

A Numerical QCD Hello World

REVOLUTIONIZING LATTICE QCD PHYSICS WITH HETEROGENEOUS MULTIGRID

The GPU code FARGO3D: presentation and implementation strategies

Lattice QCD with Domain Decomposition on Intel R Xeon Phi TM

Quantum Field Theory. and the Standard Model. !H Cambridge UNIVERSITY PRESS MATTHEW D. SCHWARTZ. Harvard University

Performance Analysis of Lattice QCD Application with APGAS Programming Model

Solving PDEs with CUDA Jonathan Cohen

arxiv: v1 [hep-lat] 19 Dec 2012

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics

MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors

Derivation of Electro Weak Unification and Final Form of Standard Model with QCD and Gluons 1 W W W 3

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters

arxiv: v1 [hep-lat] 28 Aug 2014 Karthee Sivalingam

Practical Combustion Kinetics with CUDA

Julian Merten. GPU Computing and Alternative Architecture

Sparse LU Factorization on GPUs for Accelerating SPICE Simulation

arxiv: v1 [hep-lat] 7 Oct 2007

PoS(LAT2005)104. Lattice QFT with FermiQCD

Research on GPU-accelerated algorithm in 3D finite difference neutron diffusion calculation method

SP-CNN: A Scalable and Programmable CNN-based Accelerator. Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay

GPU Computing Activities in KISTI

The Blue Gene/P at Jülich Case Study & Optimization. W.Frings, Forschungszentrum Jülich,

Copyright 2012 The Authors.

Algebraic Multi-Grid solver for lattice QCD on Exascale hardware: Intel Xeon Phi

QCD thermodynamics OUTLINE:

Two case studies of Monte Carlo simulation on GPU

arxiv: v2 [hep-lat] 20 Nov 2014

On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code

Multi-grid Methods in Lattice Field Theory

Lattice Quantum Chromodynamics on the MIC architectures

GPU-accelerated Computing at Scale. Dirk Pleiter I GTC Europe 10 October 2018

Tips Geared Towards R. Adam J. Suarez. Arpil 10, 2015

Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2

Klaus Schulten Department of Physics and Theoretical and Computational Biophysics Group University of Illinois at Urbana-Champaign

Catalytic effects of monopole in QCD

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver

First, a look at using OpenACC on WRF subroutine advance_w dynamics routine

Shortest Lattice Vector Enumeration on Graphics Cards

ACCELERATING WEATHER PREDICTION WITH NVIDIA GPUS

arxiv: v1 [hep-lat] 2 Nov 2015

arxiv: v1 [hep-lat] 25 Sep 2014

Fast event generation system using GPU. Junichi Kanzaki (KEK) ACAT 2013 May 16, 2013, IHEP, Beijing

Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry

Explore Computational Power of GPU in Electromagnetics and Micromagnetics

An Introduction to the Standard Model of Particle Physics

CRYPTOGRAPHIC COMPUTING

PoS(LAT2006)094. The decay constants f B + and f D + from three-flavor lattice QCD

Adaptive Multigrid for QCD. Lattice University of Regensburg

The Performance Evolution of the Parallel Ocean Program on the Cray X1

上海超级计算中心 Shanghai Supercomputer Center. Lei Xu Shanghai Supercomputer Center San Jose

QCDOC A Specialized Computer for Particle Physics

Improving many flavor QCD simulations using multiple GPUs

Structure of Hadrons. gluons. gluons

Performance Analysis of a List-Based Lattice-Boltzmann Kernel

lattice QCD and the hadron spectrum Jozef Dudek ODU/JLab

ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU

S Subdivide, Preprocess and Conquer: Micromagnetism FEM/BEM-Simulations on Single-Node/Multi-GPU Systems

The Standard Model and Beyond

Population annealing study of the frustrated Ising antiferromagnet on the stacked triangular lattice

Physics 7730: Particle Physics

D and B Meson Semileptonic Decays from the Lattice. Lattice QCD Meets Experiment Workshop April 26-27, 2010 Fermilab

Chenhan D. Yu The 3rd BLIS Retreat Sep 28, 2015

Parallelization of the Dirac operator. Pushan Majumdar. Indian Association for the Cultivation of Sciences, Jadavpur, Kolkata

Code Generation for GPU Accelerators in the Domain of Image Preprocessing

Quantum Field Theory 2 nd Edition

arxiv: v1 [hep-lat] 24 Dec 2008

Multiphase Flow Simulations in Inclined Tubes with Lattice Boltzmann Method on GPU

Plaquette Renormalized Tensor Network States: Application to Frustrated Systems

Transcription:

Accelerating Quantum Chromodynamics Calculations with GPUs Guochun Shi, Steven Gottlieb, Aaron Torok, Volodymyr Kindratenko NCSA & Indiana University National Center for Supercomputing Applications University of Illinois at Urbana-Champaign

Outline Background New staggered code Implementation in GPU Benchmarks Production use experience Where to get the code Future 2

Background NCSA s Innovative Systems Laboratory Worked on a Cell B.E. port of MILC, arxiv:0910.0262 Boston University: QUDA for Wilson type quarks, arxiv:0810.5365, arxiv:0911.3191 8/2009: Prof. Steven Gottlieb sabbatical at NCSA We are extending QUDA to staggered quarks 3

Introduction: Lattice QCD Diagram courtesy of K.Z.lbrahim, IRISA/INRIA, France Quantum Chromodynamics (QCD) studies the strong force (color force) interaction between quarks and gluons. QCD is highly nonlinear problem thus lattice QCD approach is introduced. The computation is performed on 4-D space-time lattice. 4

Lattice QCD computation on CPU Two phases in Lattice QCD computation Configuration generation: compute a snapshot of state of the strong force field, Conjugate Gradient (CG), fermion force (FF), gauge force (GF) and link fattening (FAT), see above table for the time distribution in this phase. Analysis: the observables of interest are computed over the configurations. Conjugate Gradient (CG) is even more dominant in this phase. In short, CG is the dominant part, others are important. 5

Development status MILC applications Interface functions QUDA GPU library CG FAT GF FF Description Status Status Update spinors using the neighboring links and spinors through Conjugate Gradient process Update fatlink using the neighboring links Update the momentum using the neighoring links Update the momenum using the neighboring links and spinors Fully implemented (18, 12 and 8- reconstruct, SP, DP and half precision, mixed precisions) Implemented the 12- reconstruct, single precision case Implemented 12- reconstruct, single precision Implemented 12- reconstruct, single precision 4 1 1 1 6

New Staggered Code Starting code: QUDA 0.2 from Boston University First effort was staggered Dslash for single GPU Extended to CG and then multimass CG Fat link computation / Gauge Force / Fermion Force Wrappers allow call of GPU code from MILC Code merged into BU QUDA code which had evolved from initial release Multi-GPU code over time dimension works. Multi-GPU development over multiple dimensions in development For this talk, I will focus on Staggered Dslash, CG/multimass CG and multi-gpu CG 7

Lattice QCD: Conjugate Gradient 8

Staggered Dslash even site starts odd site starts CPU data layout Spinor Each site contains: 1 spinor (1x3 complex) 4 fatlink (3x3 complex) 4 longlink (3x3 complex) Long link can be reconstructed with 12 or 8 real numbers if it is unitary Sites are divided into even and odd sites. For site (x,y,z,t) (x+y+z+t)%2 == 0 even site (x+y+z+t)%2 ==1 odd site Total number of sites V = dimx * dimy * dimz * dimt Half of total sites Vh = V/2 Fatlink Array of pointers Longlink Same as fatlink 6*V floats Each spinor contains 6 (3*2) floats +X +Y +Z +T Each link contains 18 floats( 3x3x2) 9

Staggered Dslash reference implementation 10

Spinor CPU-> GPU mapping even site starts odd site starts Host memory spinor data layout Host memory parity spinor data layout One spinor 6*V floats 6 *Vh floats Device memory parity spinor data layout Vh *float2 float2 GPU kernel code to read spinor 11

Link CPU-> GPU mapping Host memory links data layout Intermediate data format One link +X links Y/Z/T link 12-construct Device memory links data layout GPU kernel code to read link Vh *float4 float4 12 Y/Z/T link

Dslash GPU Kernel (only shows computation in +x direction) 13

Dslash optimization technique summary Data arrangement in device memory to enable coalesced read and write Using textures to read data improves the performance We find mixing texture and direct memory read has better performance The best configuration is different for GT200 and Fermi card. More on this later Use padding memory to avoid partition camping Loop unroll for computations over eight directions Use constant memory to store common used constant variables( X1,X2,X3,X4, X1m1, X2X1mX1, ) 14

CG in MILC Dslash is the most time consuming part in CG Other than the dslash, blas routines are also implemented for GPUs Mixed precision is also implemented. The bulk of work is done in lower precision, while higher precision is used to compute defect correction. The final solution is as accurate as the higher precision. Multimass (multi shift) solver is also implemented. The multimass solvers computes solutions for many equations with different mass but the same Dslash operator. 15

Hardware Comparison Type Cores BW(GB/s) SP (GF/s) DP (GF/S) RAM(GB) GTX280 240 142 1051 87 1.0 Tesla C1060 240 102 933 78 4.0 Fermi GTX480 480 177 1345 168 1.5 Fermi C2050 448 144 1030 515 3.0 NSRSC: Fermi C2050 NCSA: other GPUs 16

Performance for different loading method GTX480 L1/shared=16/48 KB Single precision reading method reconstruction Fatlink Longlink Spinor 12 8 18 D D D 104 120 96 D D T 107 123 99 D T D 111 115 100 D T T 110 123 102 T D D 113 126 101 T D T 113 126 102 winner T T D 115 125 102 T T T 104 120 96 D = direct read T = texture read 17

Performance for different loading method GTX480 L1/shared=48/16 KB Double precision reading method reconstruction Fatlink Longlink Spinor 12-recon 8-recon 18-recon D D D 24 18 37 D D T 24 16 44 D T D n/a* 18 42 D T T 28 16 46 T D D 25 16 43 winner T D T 28 16 47 T T D 31 16 50 T T T 31 15 50 D = direct read T = texture read * The code does not converge, something is wrong 18

Dslash performance on GTX280 reconstruct Dslash(GFLOPS) DP 12 32 8 16 18 35 SP 12 98 8 110 18 85 HP 12 123 8 128 18 109 Lattice size: 24 3 32 19

CG/multimass CG performance on GTX280 reconstruct CG (GFLOPS) multimass CG (GFLOPS) DP 12 31 31 8 15 16 18 33 34 SP 12 98 92 8 108 96 18 83 80 HP 12 123 106 8 128 113 18 108 98 24 3 32 lattice. Four masses for last column. 20

GTX280 performance with data movement overhead (GFLOPS) kernel-only Target speed assuming 100GB/s data transfer limit Including PCIe overhead CG 98 100 71 Multi mass CG 92 100 71 Single precision 12-reconstruct Measured for 500 iterations 24 3 32 lattice 21

Fermi Results (GFLOPS) DP SP HP reconstruction GTX 280 GTX 480 C2050 ECC C2050 no ECC 12 29 31 20 24 8 15 16 11 13 18 32 50 30 41 12 92 116 66 96 8 99 126 72 100 18 79 104 57 86 12 77 154 97 122 8 74 157 101 123 18 76 131 84 104 Lattice size 24 3 32 The code running this test is multi-gpu version running on one GPU 22

MULTIGPU benchmarks Two kernels used for multigpu Dslash: Interior kernel summing contributions for all directions for t=3,4...t-3 and spatial contributions for other boundary t values. Exterior kernel adding in the terms in the time direction that need on offnode spinors For 24 3 spatial size with single precision on GTX280: pack (D2H, 0.29ms, 3.3GB/s) MPI (0.16ms, 6.14GB/s) unpack(h2d, 0.20ms, 4.8 GB/s) only 1.5 GB/s aggregate message rate! 23

CG performance (GFLOPS) per GPU on AC s compute nodes (S1070) 24

CG performance (GFLOPS) per GPU on NERSC s compute nodes (C2050, no ECC) 25

Production Experience Have been using GPUs for electromagnetic effects, i.e., SU(3) U(1) So far only using ensembles that fit in single GPU memory Have analyzed about 4,000 configurations from 20 3 64 to 28 3 96. (Aaron Torok) AC (NCSA), FNAL, Dirac (NERSC), Jlab A few constrains on Aaron Torok s QED application Double precision required GPU mem requirement > 1 GB, only fits into Telsa C1060 Lower frequency, lower performance The long link is not unitary, only 18 reconstruct is applicable Higher bandwidth requirement CPU: 6.04 node-hr=48.2 core-hr GPU: 1.49 node-hr=11.9 core-hr (only 1 core used) 26

Electromagnetic Splitting of Charged and Neutral Mesons (Lattice 2010, Aaron Torok, et al) 27

Where to get the Code Staggered code is integrated with Wilson code, soon to be released (QUDA 0.3) by BU Can be found at http://lattice.bu.edu/quda Requires CUDA 3.0 or higher MultiGPU code to be released later 28

Future Study of heavy-light mesons might use GPUs. Need to combine both Clover and staggered inverters. Although we have asqtad code modules for gauge configuration generation, now generating HISQ configurations, so new code must be added. Investigate strong scaling as supercomputers now reaching for petaflop/s performance. Essential to decide what other parts of production running can be profitably shifted to GPUs 29

Acknowledgements This work is sponsored by the Institute for Advanced Computing Applications and Technologies (IACAT) at the University of Illinois at Urbana-Champaign (UIUC). Other Lattice QCD developers: Ron Balild (Boston University), Mike Clark (Harvard University), Balint Joo(Jefferson National Lab) We are thankful for the helpful discussions with Ray Sung and Wen-mei Hwu from ECE department, UIUC, Greg Bauer from NCSA, and John Stone from TCB group, UIUC. 30