Efficient implementation of the overlap operator on multi-gpus
|
|
- Lynette Price
- 5 years ago
- Views:
Transcription
1 Efficient implementation of the overlap operator on multi-gpus Andrei Alexandru Mike Lujan, Craig Pelissier, Ben Gamari, Frank Lee SAAHPC University of Tennessee
2 Outline Motivation Overlap operator Multi-gpu Wilson-Dirac kernel Eigensolver and inverter Conclusions 2
3 Building blocks of matter Quarks are the constituents of matter Quarks interact strongly by exchanging gluons Peculiar properties: Confinement Asymptotic freedom (Nobel Prize 2004) Theory of strong interactions -- Quantum Chromodynamics (QCD)
4 Lattice QCD Replace space-time with a fourdimensional lattice Differential operators are replaced with finite difference operators Typical lattice size per dimension and times longer in the 4th dimension For example 24 3 x48= sites Typical project size ~ 1 Petaflop
5 Why overlap fermions on multi-gpus? Study QCD dynamics in the chiral regime Overlap fermions preserve chiral symmetry at finite lattice spacing Overlap fermions are computationally demanding Use GPUs since they have good memory bandwidth Memory requirements of overlap force us to use multiple GPUs 5
6 Lattice QCD QCD is a field theory U110 Ψ Α a 10 U111 Ψ Α a 11 U112 Ψ Α a 12 U113 Ψ Α a 13 U114 Ψ Α a 14 Lattice QCD defined on a 4D grid U15 U 0 10 U16 U 0 11 U17 U 0 12 U18 U 0 13 U19 U 0 14 quarks -- sites Ψ a Α 5 U 0 5 Ψ a Α 6 U 0 6 Ψ a Α 7 U 0 7 Ψ a Α 8 U 0 8 Ψ a Α 9 U 0 9 gluons -- links U10 Ψ a Α 0 U 0 0 U11 Ψ a Α 1 U 0 1 U12 Ψ a Α 2 U 0 2 U13 Ψ a Α 3 U 0 3 U14 Ψ a Α 4 U 0 4 Links are randomly generated according to dynamics 6
7 Wilson-Dirac operator Wilson fermions is one of the simplest discretizations It is numerically fast, very sparse It breaks chiral symmetry m + D/ D w =(ma + 4) 1 2 Tµ µ>0: (T µ ψ) n = U µ (n)ψ n+ˆµ (1 γ µ ), µ<0: (T µ ψ) n = U µ (n ˆµ) ψ n ˆµ (1 + γ µ ). Serves as a kernel for the overlap operator 0 ψ(x) ψ(y) 0 =(D 1 w ) x,y 7
8 Wilson-Dirac operator Y (n) =(MX)(n) =X(n) κ µ Vµ (n)x(n +ˆµ)+V (n ˆµ)X(n ˆµ) Wilson operator is used to multiply Wilson fields, 4x3 matrices living at every site on the lattice The value of Y at a site depends on the value of X at the same site and the neighboring sites (8) Each of the fields at the neighboring sites needs to be transported to the final site -- this involves a multiplication with a color matrix (3x3) and a spinor matrix (4x4) The color matrices differ from link to link whereas the spinor matrices depend only on direction The matrices and the vectors are all complex T Μ Ψ U 1,1 U 1,2 U 1,3 U 2,1 U 2,2 U 2,3 U 3,1 U 3,2 U 3,3 n,μ Ψ 1,1 Ψ 1,2 Ψ 1,3 Ψ 1,4 Ψ 2,1 Ψ 2,2 Ψ 2,3 Ψ 2,4 Ψ 3,1 Ψ 3,2 Ψ 3,3 Ψ 3,4 nμ 1,1 1,2 1,3 1,4 2,1 2,2 2,3 2,4 3,1 3,2 3,3 3,4 4,1 4,2 4,3 4,4 Μ
9 Overlap operator Overlap operator is dense Ε About 100 times more expensive than Wilson kernel The cost is proportional to conditioning number of (H w ) 2 and log δ D =1+γ 5 sign(h w ) H w = γ 5 D w sign(h w ) QP (Q 2 ) with Q = H w H w δ = max x [,1] 1 xp (x) 9
10 Requirements Overlap operator Wilson kernel + vector routines Hwilson eigensolver Propagator calculation Overlap inverter Overlap eigensolver 10
11 System architecture GPU Memory 140 GB/s 1-6GB CPU ~5GB/s Infiniband ~2 x 2.5GB/s GB/s 12-48GB Memory 11
12 Computational strategy We use one process per GPU and MPI for communication All data resides in GPU memory Lattice sites are split evenly between the nodes All data belonging to a particular site resides on the node that owns the sites The communication is mainly implemented via shifts and is overlapped if possible with computation 12
13 Vector routines Expression Templates + THRUST library = auto-generate optimized kernels φ αψ 1 + βψ 2 + γψ Non-reduction kernels scale perfectly Max bandwidth on M2070 with ECC on is about 85GB/s Reduction kernels have poor scaling -- small computational fraction Most poor scaling is due to poor single-node kernel performance on small vectors Bandwidth per GPU GBs vector addition scalar product GPU Count 13
14 Wilson-Dirac kernel 14
15 Wilson-Dirac kernel The cost of the Wilson-Dirac operator is 1368 flops/site: 600 multiplications (44%) and 768 additions (56%) -- balanced load The amount of data for one site computation is in: 8 spinors + 8 links (neighbors) + 1 spinor out: 1 spinor In double precision 3072 bytes/site and computational density is 1368flops/3072bytes = 0.45 flop/byte -- double that in single precision For 85GB/s max bandwidth we have max kernel performance: GFlops (double) and 76.5 GFlops (single) It has a fair amount of parallelism the 8 transports can be implemented in parallel each of these transports can be split in 2 parallel tasks 15
16 Calculation steps The communication time is overlapped with the computation time to hide latency 1. Gather: compute compressed fields and fill in communication buffers 2. Comm: initiate non-blocking communication 3. Bulk: compute interior points dslash 4. Scatter: finish communication and add results 16
17 Minimal surface Cut the lattice in hypercubes with the same dimensions The longest dimension is always cut first and an already cut dimension is preferred As the lattice is cut the boundary to interior ratio increases GPUs N int N boun dims x , 24, 24, x x , 24, 24, x x , 24, 24, x x , 24, 24, x x , 12, 24, x x , 12, 12, 12 17
18 Dslash anatomy 2 :Dslash bulk 1:Gather 2: gpu > cpu 3: cpu > cpu 4: cpu > gpu 5:Scatter PCI Infiniband PCI stream 1 stream 2 18
19 Dslash timing GPUs gather scatter gpu>cpu cpu>cpu cpu>gpu comm dslash bulk
20 Strong scaling for 24 3 x64 Performance per GPU GFLOPS GPU Count double precision performance model single precision 20
21 Comparison with other codes Our code QUDA 32 3 x x GPUs 32 GPUs 16 GPUs 32 GPUs double single single
22 Overlap operator 22
23 Sign approximation Polynomial approximation P (Q 2 )ψ = P (Q 2 )ψ = n i=1 n i=1 c i T i (Q 2 )ψ Rational approximation b i Q 2 + c i ψ Time s GPU Count double pass polynomial 23
24 Wilson-Dirac kernel performance comparison Performance GFLOPS GPU Equivalent Count CPU GPU 24
25 Performance comparison The GPU cluster uses 1GPU per node and QDR Infiniband interconnects The CPU machine is a Cray XT5 with dual hex-core AMD processors We compare the performance of 32GPUs (the target cluster dimension) vs 256 CPU-cores (optimal performance for CPU) 25
26 Overlap performance For 24 3 x64 lattice overlap operator matrixvector multiplication takes 1.1 s on 32GPUs On 256 cores Cray XT-5 it takes 3.3 s This translate into a ratio of 1GPU = 24CPUcores 26
27 Hwilson eigensolver 27
28 Small eigenspace dimension Λ Number of Eigenvectors Polynomial Order Number of Eigenvectors δ = Ae bn = λ /λ max 28
29 Eigensolvers We use implicitly restarted Arnoldi factorization The method requires storage for temporary vectors. For optimal convergence we need 2.5 times more vectors than required: k=2.5 l For efficiency we need to also code a matrix-matrix multiplication routine AV k = V k H k + f k e k with (e k ) n = δ k,n H k µ = QR V k QV k H k RQ + µ For each iteration we need k matrix-vector multiplications and k 2 vector orthogonalizations We use locking of the converged eigenvectors to accelerate convergence 29
30 Hwilson eigensolver We use Chebyshev acceleration of order 100; Arnoldi eigensolver converges in one iteration Compute 200 eigenvectors: storage required for 500 vectors = 85GB Total time: 0.27 hours on the GPU cluster vs 0.60 hours on the Cray XT-5 This corresponds to 1GPU=18 CPU-cores In situations with reduced GPU memory, use mixed mode: the eigensystem is store in CPU memory. This is feasible due to Chebyshev acceleration. In this mode the GPU code takes 0.43 hours 30
31 Overlap eigensolver 31
32 Overlap eigensystem Deflation speeds up inversions considerably One propagator = 12 inversions m π = 200 MeV: 12 x 2,000 = 24,000 deflation: 6, x 200 = 9,000 one propagator per config: 2.5 times speed-up Compute hermitian overlap eigenvectors and then rebuild overlap eigenvectors 32
33 Overlap eigensystem Compute 100 eigenvector pairs to precision On GPU cluster this takes 2.7 hours On the Cray machine this takes 10.6 hours This translates in 1GPU = 26 CPU-cores In the situation where memory is limited, we can use mixed mode: store the overlap Krylov space in CPU memory. The code takes 4 hours to converge in this case. 33
34 Overlap inverter 34
35 Overlap inverter We use m π = 200 MeV and a precision of 10-8 We use adaptive CG method which is 60% faster than regular CG We use a multi-shifted inverter with shifts We store the overlap eigensystem in CPU memory, Hwilson eigensystem in GPU memory and solutions in GPU memory The GPU cluster takes 0.52 hours vs 2.3 hours for the Cray machine The performance translates to 1GPU = 35 CPU-cores 35
36 Summary CPU GPU Hwilson eigensolver Overlap eigensolver expensive orthogonalization Chebyshev acceleration 2.5 x 200 vectors 200 Hwilson vectors 2.5 x 100 eigenpairs Overlap inverter 100 Overlap eigenpairs 200 Hwilson vectors + solutions (100) Pure GPU(32): 3.5 hours Cray XT5(256): 13.5 hours 36
37 Conclusions We showed how to efficiently implement overlap operator on GPUs For efficiency we need to store the data in GPU memory, which forces us to use GPUs in parallel For 24 3 x64 lattices of interest Wilson kernel scaling efficiency is 50% on 32 GPUs Scaling efficiency is better that CPU codes of equivalent performance The sign function needed for the overlap operator polynomial approximation is better both in terms of memory use and performance Most of the time is spent in eigensolvers. We use implicitly restarted Arnoldi eigensolvers. On systems with reduced memory a mixed strategy can be use with only 50-60% performance penalty. Overall, the GPU/CPU performance ratio for our codes is compatible with the ratio measured for the dslash routine. This result is not surprising since the most time consuming part of these codes is the dslash routine, but it takes careful planning to work around all possible bottlenecks. 37
38 Outlook Most of the time is spent in overlap eigensolver Chebyshev acceleration: preliminary 20-30% boost Mixed precision -- different eigensolver method Use different inversion/deflation strategy 38
Tuning And Understanding MILC Performance In Cray XK6 GPU Clusters. Mike Showerman, Guochun Shi Steven Gottlieb
Tuning And Understanding MILC Performance In Cray XK6 GPU Clusters Mike Showerman, Guochun Shi Steven Gottlieb Outline Background Lattice QCD and MILC GPU and Cray XK6 node architecture Implementation
More informationCase Study: Quantum Chromodynamics
Case Study: Quantum Chromodynamics Michael Clark Harvard University with R. Babich, K. Barros, R. Brower, J. Chen and C. Rebbi Outline Primer to QCD QCD on a GPU Mixed Precision Solvers Multigrid solver
More informationAccelerating Quantum Chromodynamics Calculations with GPUs
Accelerating Quantum Chromodynamics Calculations with GPUs Guochun Shi, Steven Gottlieb, Aaron Torok, Volodymyr Kindratenko NCSA & Indiana University National Center for Supercomputing Applications University
More informationMeasuring freeze-out parameters on the Bielefeld GPU cluster
Measuring freeze-out parameters on the Bielefeld GPU cluster Outline Fluctuations and the QCD phase diagram Fluctuations from Lattice QCD The Bielefeld hybrid GPU cluster Freeze-out conditions from QCD
More informationDeflation for inversion with multiple right-hand sides in QCD
Deflation for inversion with multiple right-hand sides in QCD A Stathopoulos 1, A M Abdel-Rehim 1 and K Orginos 2 1 Department of Computer Science, College of William and Mary, Williamsburg, VA 23187 2
More informationarxiv: v1 [hep-lat] 10 Jul 2012
Hybrid Monte Carlo with Wilson Dirac operator on the Fermi GPU Abhijit Chakrabarty Electra Design Automation, SDF Building, SaltLake Sec-V, Kolkata - 700091. Pushan Majumdar Dept. of Theoretical Physics,
More informationLast Time. Social Network Graphs Betweenness. Graph Laplacian. Girvan-Newman Algorithm. Spectral Bisection
Eigenvalue Problems Last Time Social Network Graphs Betweenness Girvan-Newman Algorithm Graph Laplacian Spectral Bisection λ 2, w 2 Today Small deviation into eigenvalue problems Formulation Standard eigenvalue
More informationCatalytic effects of monopole in QCD
Catalytic effects of monopole in QCD Masayasu Hasegawa Bogoliubov Laboratory of Theoretical Physics, Joint Institute for Nuclear Research Lattice and Functional Techniques for Exploration of Phase Structure
More informationLattice Boltzmann simulations on heterogeneous CPU-GPU clusters
Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters H. Köstler 2nd International Symposium Computer Simulations on GPU Freudenstadt, 29.05.2013 1 Contents Motivation walberla software concepts
More informationarxiv: v1 [hep-lat] 8 Nov 2014
Staggered Dslash Performance on Intel Xeon Phi Architecture arxiv:1411.2087v1 [hep-lat] 8 Nov 2014 Department of Physics, Indiana University, Bloomington IN 47405, USA E-mail: ruizli AT umail.iu.edu Steven
More informationBlock Iterative Eigensolvers for Sequences of Dense Correlated Eigenvalue Problems
Mitglied der Helmholtz-Gemeinschaft Block Iterative Eigensolvers for Sequences of Dense Correlated Eigenvalue Problems Birkbeck University, London, June the 29th 2012 Edoardo Di Napoli Motivation and Goals
More informationClaude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique
Claude Tadonki MINES ParisTech PSL Research University Centre de Recherche Informatique claude.tadonki@mines-paristech.fr Monthly CRI Seminar MINES ParisTech - CRI June 06, 2016, Fontainebleau (France)
More informationUnraveling the mysteries of quarks with hundreds of GPUs. Ron Babich NVIDIA
Unraveling the mysteries of quarks with hundreds of GPUs Ron Babich NVIDIA Collaborators and QUDA developers Kip Barros (LANL) Rich Brower (Boston University) Mike Clark (NVIDIA) Justin Foley (University
More informationDELFT UNIVERSITY OF TECHNOLOGY
DELFT UNIVERSITY OF TECHNOLOGY REPORT -09 Computational and Sensitivity Aspects of Eigenvalue-Based Methods for the Large-Scale Trust-Region Subproblem Marielba Rojas, Bjørn H. Fotland, and Trond Steihaug
More informationLattice Quantum Chromodynamics on the MIC architectures
Lattice Quantum Chromodynamics on the MIC architectures Piotr Korcyl Universität Regensburg Intel MIC Programming Workshop @ LRZ 28 June 2017 Piotr Korcyl Lattice Quantum Chromodynamics on the MIC 1/ 25
More informationPerformance Analysis of Lattice QCD Application with APGAS Programming Model
Performance Analysis of Lattice QCD Application with APGAS Programming Model Koichi Shirahata 1, Jun Doi 2, Mikio Takeuchi 2 1: Tokyo Institute of Technology 2: IBM Research - Tokyo Programming Models
More informationAccelerating linear algebra computations with hybrid GPU-multicore systems.
Accelerating linear algebra computations with hybrid GPU-multicore systems. Marc Baboulin INRIA/Université Paris-Sud joint work with Jack Dongarra (University of Tennessee and Oak Ridge National Laboratory)
More informationA Jacobi Davidson Method with a Multigrid Solver for the Hermitian Wilson-Dirac Operator
A Jacobi Davidson Method with a Multigrid Solver for the Hermitian Wilson-Dirac Operator Artur Strebel Bergische Universität Wuppertal August 3, 2016 Joint Work This project is joint work with: Gunnar
More informationLattice QCD at non-zero temperature and density
Lattice QCD at non-zero temperature and density Frithjof Karsch Bielefeld University & Brookhaven National Laboratory QCD in a nutshell, non-perturbative physics, lattice-regularized QCD, Monte Carlo simulations
More informationab initio Electronic Structure Calculations
ab initio Electronic Structure Calculations New scalability frontiers using the BG/L Supercomputer C. Bekas, A. Curioni and W. Andreoni IBM, Zurich Research Laboratory Rueschlikon 8803, Switzerland ab
More informationThe symmetries of QCD (and consequences)
The symmetries of QCD (and consequences) Sinéad M. Ryan Trinity College Dublin Quantum Universe Symposium, Groningen, March 2018 Understand nature in terms of fundamental building blocks The Rumsfeld
More informationCME342 Parallel Methods in Numerical Analysis. Matrix Computation: Iterative Methods II. Sparse Matrix-vector Multiplication.
CME342 Parallel Methods in Numerical Analysis Matrix Computation: Iterative Methods II Outline: CG & its parallelization. Sparse Matrix-vector Multiplication. 1 Basic iterative methods: Ax = b r = b Ax
More informationQR Factorization of Tall and Skinny Matrices in a Grid Computing Environment
QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment Emmanuel AGULLO (INRIA / LaBRI) Camille COTI (Iowa State University) Jack DONGARRA (University of Tennessee) Thomas HÉRAULT
More informationarxiv: v1 [hep-lat] 19 Jul 2009
arxiv:0907.3261v1 [hep-lat] 19 Jul 2009 Application of preconditioned block BiCGGR to the Wilson-Dirac equation with multiple right-hand sides in lattice QCD Abstract H. Tadano a,b, Y. Kuramashi c,b, T.
More informationIntroduction to numerical computations on the GPU
Introduction to numerical computations on the GPU Lucian Covaci http://lucian.covaci.org/cuda.pdf Tuesday 1 November 11 1 2 Outline: NVIDIA Tesla and Geforce video cards: architecture CUDA - C: programming
More informationThe Blue Gene/P at Jülich Case Study & Optimization. W.Frings, Forschungszentrum Jülich,
The Blue Gene/P at Jülich Case Study & Optimization W.Frings, Forschungszentrum Jülich, 26.08.2008 Jugene Case-Studies: Overview Case Study: PEPC Case Study: racoon Case Study: QCD CPU0CPU3 CPU1CPU2 2
More informationEfficient algorithms for symmetric tensor contractions
Efficient algorithms for symmetric tensor contractions Edgar Solomonik 1 Department of EECS, UC Berkeley Oct 22, 2013 1 / 42 Edgar Solomonik Symmetric tensor contractions 1/ 42 Motivation The goal is to
More informationAccelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers
UT College of Engineering Tutorial Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers Stan Tomov 1, George Bosilca 1, and Cédric
More informationParallelization of the Molecular Orbital Program MOS-F
Parallelization of the Molecular Orbital Program MOS-F Akira Asato, Satoshi Onodera, Yoshie Inada, Elena Akhmatskaya, Ross Nobes, Azuma Matsuura, Atsuya Takahashi November 2003 Fujitsu Laboratories of
More informationLattice QCD with Domain Decomposition on Intel R Xeon Phi TM
Lattice QCD with Domain Decomposition on Intel R Xeon Phi TM Co-Processors Simon Heybrock, Bálint Joó, Dhiraj D. Kalamkar, Mikhail Smelyanskiy, Karthikeyan Vaidyanathan, Tilo Wettig, and Pradeep Dubey
More informationFinite-choice algorithm optimization in Conjugate Gradients
Finite-choice algorithm optimization in Conjugate Gradients Jack Dongarra and Victor Eijkhout January 2003 Abstract We present computational aspects of mathematically equivalent implementations of the
More informationDomain Wall Fermion Simulations with the Exact One-Flavor Algorithm
Domain Wall Fermion Simulations with the Exact One-Flavor Algorithm David Murphy Columbia University The 34th International Symposium on Lattice Field Theory (Southampton, UK) July 27th, 2016 D. Murphy
More informationSP-CNN: A Scalable and Programmable CNN-based Accelerator. Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay
SP-CNN: A Scalable and Programmable CNN-based Accelerator Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay Motivation Power is a first-order design constraint, especially for embedded devices. Certain
More informationParallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2
1 / 23 Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption Langshi CHEN 1,2,3 Supervised by Serge PETITON 2 Maison de la Simulation Lille 1 University CNRS March 18, 2013
More informationarxiv: v1 [hep-lat] 28 Aug 2014 Karthee Sivalingam
Clover Action for Blue Gene-Q and Iterative solvers for DWF arxiv:18.687v1 [hep-lat] 8 Aug 1 Department of Meteorology, University of Reading, Reading, UK E-mail: K.Sivalingam@reading.ac.uk P.A. Boyle
More informationParallel Eigensolver Performance on High Performance Computers 1
Parallel Eigensolver Performance on High Performance Computers 1 Andrew Sunderland STFC Daresbury Laboratory, Warrington, UK Abstract Eigenvalue and eigenvector computations arise in a wide range of scientific
More informationParallel Eigensolver Performance on High Performance Computers
Parallel Eigensolver Performance on High Performance Computers Andrew Sunderland Advanced Research Computing Group STFC Daresbury Laboratory CUG 2008 Helsinki 1 Summary (Briefly) Introduce parallel diagonalization
More informationMatrix Algorithms. Volume II: Eigensystems. G. W. Stewart H1HJ1L. University of Maryland College Park, Maryland
Matrix Algorithms Volume II: Eigensystems G. W. Stewart University of Maryland College Park, Maryland H1HJ1L Society for Industrial and Applied Mathematics Philadelphia CONTENTS Algorithms Preface xv xvii
More informationComparing iterative methods to compute the overlap Dirac operator at nonzero chemical potential
Comparing iterative methods to compute the overlap Dirac operator at nonzero chemical potential, Tobias Breu, and Tilo Wettig Institute for Theoretical Physics, University of Regensburg, 93040 Regensburg,
More informationDirect Self-Consistent Field Computations on GPU Clusters
Direct Self-Consistent Field Computations on GPU Clusters Guochun Shi, Volodymyr Kindratenko National Center for Supercomputing Applications University of Illinois at UrbanaChampaign Ivan Ufimtsev, Todd
More informationHeterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry
Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry and Eugene DePrince Argonne National Laboratory (LCF and CNM) (Eugene moved to Georgia Tech last week)
More informationArnoldi Methods in SLEPc
Scalable Library for Eigenvalue Problem Computations SLEPc Technical Report STR-4 Available at http://slepc.upv.es Arnoldi Methods in SLEPc V. Hernández J. E. Román A. Tomás V. Vidal Last update: October,
More informationMultigrid Methods for Linear Systems with Stochastic Entries Arising in Lattice QCD. Andreas Frommer
Methods for Linear Systems with Stochastic Entries Arising in Lattice QCD Andreas Frommer Collaborators The Dirac operator James Brannick, Penn State University Björn Leder, Humboldt Universität Berlin
More informationGPU accelerated Arnoldi solver for small batched matrix
15. 09. 22 GPU accelerated Arnoldi solver for small batched matrix Samsung Advanced Institute of Technology Hyung-Jin Kim Contents - Eigen value problems - Solution - Arnoldi Algorithm - Target - CUDA
More informationCRYSTAL in parallel: replicated and distributed (MPP) data. Why parallel?
CRYSTAL in parallel: replicated and distributed (MPP) data Roberto Orlando Dipartimento di Chimica Università di Torino Via Pietro Giuria 5, 10125 Torino (Italy) roberto.orlando@unito.it 1 Why parallel?
More informationSolving Quadratic Equations with XL on Parallel Architectures
Solving Quadratic Equations with XL on Parallel Architectures Cheng Chen-Mou 1, Chou Tung 2, Ni Ru-Ben 2, Yang Bo-Yin 2 1 National Taiwan University 2 Academia Sinica Taipei, Taiwan Leuven, Sept. 11, 2012
More informationarxiv: v1 [hep-lat] 2 May 2012
A CG Method for Multiple Right Hand Sides and Multiple Shifts in Lattice QCD Calculations arxiv:1205.0359v1 [hep-lat] 2 May 2012 Fachbereich C, Mathematik und Naturwissenschaften, Bergische Universität
More informationStatic-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems
Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems Ichitaro Yamazaki University of Tennessee, Knoxville Xiaoye Sherry Li Lawrence Berkeley National Laboratory MS49: Sparse
More informationELSI: A Unified Software Interface for Kohn-Sham Electronic Structure Solvers
ELSI: A Unified Software Interface for Kohn-Sham Electronic Structure Solvers Victor Yu and the ELSI team Department of Mechanical Engineering & Materials Science Duke University Kohn-Sham Density-Functional
More informationPreconditioned Parallel Block Jacobi SVD Algorithm
Parallel Numerics 5, 15-24 M. Vajteršic, R. Trobec, P. Zinterhof, A. Uhl (Eds.) Chapter 2: Matrix Algebra ISBN 961-633-67-8 Preconditioned Parallel Block Jacobi SVD Algorithm Gabriel Okša 1, Marián Vajteršic
More informationAppendix A Notational Conventions
Appendix A Notational Conventions Throughout the book we use Einstein s implicit summation convention: repeated indices in an expression are automatically summed over. We work in natural units where the
More informationNumerical Methods I Eigenvalue Problems
Numerical Methods I Eigenvalue Problems Aleksandar Donev Courant Institute, NYU 1 donev@courant.nyu.edu 1 MATH-GA 2011.003 / CSCI-GA 2945.003, Fall 2014 October 2nd, 2014 A. Donev (Courant Institute) Lecture
More informationGPU-accelerated Computing at Scale. Dirk Pleiter I GTC Europe 10 October 2018
GPU-accelerated Computing at Scale irk Pleiter I GTC Europe 10 October 2018 Outline Supercomputers at JSC Future science challenges Outlook and conclusions 2 3 Supercomputers at JSC JUQUEEN (until 2018)
More informationReview of lattice EFT methods and connections to lattice QCD
Review of lattice EFT methods and connections to lattice QCD Dean Lee Michigan State University uclear Lattice EFT Collaboration Multi-Hadron Systems from Lattice QCD Institute for uclear Theory Feburary
More informationhypre MG for LQFT Chris Schroeder LLNL - Physics Division
hypre MG for LQFT Chris Schroeder LLNL - Physics Division This work performed under the auspices of the U.S. Department of Energy by under Contract DE-??? Contributors hypre Team! Rob Falgout (project
More informationComputation of eigenvalues and singular values Recall that your solutions to these questions will not be collected or evaluated.
Math 504, Homework 5 Computation of eigenvalues and singular values Recall that your solutions to these questions will not be collected or evaluated 1 Find the eigenvalues and the associated eigenspaces
More informationACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU
ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU STEVE RENNICH, SR. ENGINEER, NVIDIA DEVELOPER TECHNOLOGY DARKO STOSIC, PHD CANDIDATE, UNIV. FEDERAL DE PERNAMBUCO TIM DAVIS, PROFESSOR, CSE, TEXAS
More informationJacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA
Jacobi-Based Eigenvalue Solver on GPU Lung-Sheng Chien, NVIDIA lchien@nvidia.com Outline Symmetric eigenvalue solver Experiment Applications Conclusions Symmetric eigenvalue solver The standard form is
More informationWelcome to MCS 572. content and organization expectations of the course. definition and classification
Welcome to MCS 572 1 About the Course content and organization expectations of the course 2 Supercomputing definition and classification 3 Measuring Performance speedup and efficiency Amdahl s Law Gustafson
More informationA knowledge-based approach to high-performance computing in ab initio simulations.
Mitglied der Helmholtz-Gemeinschaft A knowledge-based approach to high-performance computing in ab initio simulations. AICES Advisory Board Meeting. July 14th 2014 Edoardo Di Napoli Academic background
More informationAntti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA
S7255: CUTT: A HIGH- PERFORMANCE TENSOR TRANSPOSE LIBRARY FOR GPUS Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA MOTIVATION Tensor contractions are the most computationally intensive part of quantum
More informationAccelerating computation of eigenvectors in the nonsymmetric eigenvalue problem
Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem Mark Gates 1, Azzam Haidar 1, and Jack Dongarra 1,2,3 1 University of Tennessee, Knoxville, TN, USA 2 Oak Ridge National
More informationMAA507, Power method, QR-method and sparse matrix representation.
,, and representation. February 11, 2014 Lecture 7: Overview, Today we will look at:.. If time: A look at representation and fill in. Why do we need numerical s? I think everyone have seen how time consuming
More informationComputational Methods. Eigenvalues and Singular Values
Computational Methods Eigenvalues and Singular Values Manfred Huber 2010 1 Eigenvalues and Singular Values Eigenvalues and singular values describe important aspects of transformations and of data relations
More informationAccelerating and Scaling Lanczos Diagonalization with GPGPU
Accelerating and Scaling Lanczos Diagonalization with GPGPU Bill Brouwer, Filippo Spiga, Pierre-Yves Taunay, Sreejith GJ Nvidia GTC 2013 Outline Introduction Applications QE FQHE Theory Diagonalization
More informationAccelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem
Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem Mark Gates 1, Azzam Haidar 1, and Jack Dongarra 1,2,3 1 University of Tennessee, Knoxville, TN, USA 2 Oak Ridge National
More informationarxiv: v1 [cs.dc] 4 Sep 2014
and NVIDIA R GPUs arxiv:1409.1510v1 [cs.dc] 4 Sep 2014 O. Kaczmarek, C. Schmidt and P. Steinbrecher Fakultät für Physik, Universität Bielefeld, D-33615 Bielefeld, Germany E-mail: okacz, schmidt, p.steinbrecher@physik.uni-bielefeld.de
More informationA Numerical QCD Hello World
A Numerical QCD Hello World Bálint Thomas Jefferson National Accelerator Facility Newport News, VA, USA INT Summer School On Lattice QCD, 2007 What is involved in a Lattice Calculation What is a lattice
More informationILAS 2017 July 25, 2017
Polynomial and rational filtering for eigenvalue problems and the EVSL project Yousef Saad Department of Computer Science and Engineering University of Minnesota ILAS 217 July 25, 217 Large eigenvalue
More informationA Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters
A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters ANTONINO TUMEO, ORESTE VILLA Collaborators: Karol Kowalski, Sriram Krishnamoorthy, Wenjing Ma, Simone Secchi May 15, 2012 1 Outline!
More informationComputing least squares condition numbers on hybrid multicore/gpu systems
Computing least squares condition numbers on hybrid multicore/gpu systems M. Baboulin and J. Dongarra and R. Lacroix Abstract This paper presents an efficient computation for least squares conditioning
More informationThe Removal of Critical Slowing Down. Lattice College of William and Mary
The Removal of Critical Slowing Down Lattice 2008 College of William and Mary Michael Clark Boston University James Brannick, Rich Brower, Tom Manteuffel, Steve McCormick, James Osborn, Claudio Rebbi 1
More informationLecture 11: CMSC 878R/AMSC698R. Iterative Methods An introduction. Outline. Inverse, LU decomposition, Cholesky, SVD, etc.
Lecture 11: CMSC 878R/AMSC698R Iterative Methods An introduction Outline Direct Solution of Linear Systems Inverse, LU decomposition, Cholesky, SVD, etc. Iterative methods for linear systems Why? Matrix
More informationlattice QCD and the hadron spectrum Jozef Dudek ODU/JLab
lattice QCD and the hadron spectrum Jozef Dudek ODU/JLab the light meson spectrum relatively simple models of hadrons: bound states of constituent quarks and antiquarks the quark model empirical meson
More informationThe Gauge Principle Contents Quantum Electrodynamics SU(N) Gauge Theory Global Gauge Transformations Local Gauge Transformations Dynamics of Field Ten
Lecture 4 QCD as a Gauge Theory Adnan Bashir, IFM, UMSNH, Mexico August 2013 Hermosillo Sonora The Gauge Principle Contents Quantum Electrodynamics SU(N) Gauge Theory Global Gauge Transformations Local
More informationOpportunities for ELPA to Accelerate the Solution of the Bethe-Salpeter Eigenvalue Problem
Opportunities for ELPA to Accelerate the Solution of the Bethe-Salpeter Eigenvalue Problem Peter Benner, Andreas Marek, Carolin Penke August 16, 2018 ELSI Workshop 2018 Partners: The Problem The Bethe-Salpeter
More informationAMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 23: GMRES and Other Krylov Subspace Methods; Preconditioning
AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 23: GMRES and Other Krylov Subspace Methods; Preconditioning Xiangmin Jiao SUNY Stony Brook Xiangmin Jiao Numerical Analysis I 1 / 18 Outline
More informationA refined Lanczos method for computing eigenvalues and eigenvectors of unsymmetric matrices
A refined Lanczos method for computing eigenvalues and eigenvectors of unsymmetric matrices Jean Christophe Tremblay and Tucker Carrington Chemistry Department Queen s University 23 août 2007 We want to
More informationSolution of eigenvalue problems. Subspace iteration, The symmetric Lanczos algorithm. Harmonic Ritz values, Jacobi-Davidson s method
Solution of eigenvalue problems Introduction motivation Projection methods for eigenvalue problems Subspace iteration, The symmetric Lanczos algorithm Nonsymmetric Lanczos procedure; Implicit restarts
More informationREVIEW. Quantum electrodynamics (QED) Quantum electrodynamics is a theory of photons interacting with the electrons and positrons of a Dirac field:
Quantum electrodynamics (QED) based on S-58 Quantum electrodynamics is a theory of photons interacting with the electrons and positrons of a Dirac field: Noether current of the lagrangian for a free Dirac
More informationLow fermionic eigenmode dominance in QCD on the lattice
FACHBEREICH PHYSIK BERGISCHE UNIVERSITÄT GESAMTHOCHSCHULE WUPPERTAL Low fermionic eigenmode dominance in QCD on the lattice Dissertation zur Erlangung des Doktorgrades des Fachbereichs Physik der Bergischen
More informationParallel Longest Common Subsequence using Graphics Hardware
Parallel Longest Common Subsequence using Graphics Hardware John Kloetzli rian Strege Jonathan Decker Dr. Marc Olano Presented by: rian Strege 1 Overview Introduction Problem Statement ackground and Related
More informationH ψ = E ψ. Introduction to Exact Diagonalization. Andreas Läuchli, New states of quantum matter MPI für Physik komplexer Systeme - Dresden
H ψ = E ψ Introduction to Exact Diagonalization Andreas Läuchli, New states of quantum matter MPI für Physik komplexer Systeme - Dresden http://www.pks.mpg.de/~aml laeuchli@comp-phys.org Simulations of
More informationParallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics)
Parallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics) Eftychios Sifakis CS758 Guest Lecture - 19 Sept 2012 Introduction Linear systems
More informationA Parallel Algorithm for Computing the Extremal Eigenvalues of Very Large Sparse Matrices*
A Parallel Algorithm for Computing the Extremal Eigenvalues of Very Large Sparse Matrices* Fredrik Manne Department of Informatics, University of Bergen, N-5020 Bergen, Norway Fredrik. Manne@ii. uib. no
More informationCold and dense QCD matter
Cold and dense QCD matter GCOE sympodium Feb. 15, 2010 Yoshimasa Hidaka Quantum ChromoDynamics Atom Electron 10-10 m Quantum ChromoDynamics Atom Nucleon Electron 10-10 m 10-15 m Quantum ElectroDynamics
More informationAMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences)
AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences) Lecture 19: Computing the SVD; Sparse Linear Systems Xiangmin Jiao Stony Brook University Xiangmin Jiao Numerical
More informationQCDOC A Specialized Computer for Particle Physics
QCDOC A Specialized Computer for Particle Physics Supercomputers for Science across the Atlantic May 19, 2005 Norman H. Christ Columbia University Outline Physics overview Computer design opportunities
More informationCyclops Tensor Framework
Cyclops Tensor Framework Edgar Solomonik Department of EECS, Computer Science Division, UC Berkeley March 17, 2014 1 / 29 Edgar Solomonik Cyclops Tensor Framework 1/ 29 Definition of a tensor A rank r
More informationAlgebraic Multi-Grid solver for lattice QCD on Exascale hardware: Intel Xeon Phi
Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Algebraic Multi-Grid solver for lattice QCD on Exascale hardware: Intel Xeon Phi A. Abdel-Rehim aa, G. Koutsou a, C. Urbach
More informationDerivation of Electro Weak Unification and Final Form of Standard Model with QCD and Gluons 1 W W W 3
Derivation of Electro Weak Unification and Final Form of Standard Model with QCD and Gluons 1 W 1 + 2 W 2 + 3 W 3 Substitute B = cos W A + sin W Z 0 Sum over first generation particles. up down Left handed
More informationLarge-N Quantum Field Theories and Nonlinear Random Processes
Large-N Quantum Field Theories and Nonlinear Random Processes Pavel Buividovich (ITEP, Moscow and JINR, Dubna) ITEP Lattice Seminar, 16.09.2010 Motivation Problems for modern Lattice QCD simulations(based
More informationEigenvalue Problems. Eigenvalue problems occur in many areas of science and engineering, such as structural analysis
Eigenvalue Problems Eigenvalue problems occur in many areas of science and engineering, such as structural analysis Eigenvalues also important in analyzing numerical methods Theory and algorithms apply
More informationGPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications
GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications Christopher Rodrigues, David J. Hardy, John E. Stone, Klaus Schulten, Wen-Mei W. Hwu University of Illinois at Urbana-Champaign
More informationIndex. for generalized eigenvalue problem, butterfly form, 211
Index ad hoc shifts, 165 aggressive early deflation, 205 207 algebraic multiplicity, 35 algebraic Riccati equation, 100 Arnoldi process, 372 block, 418 Hamiltonian skew symmetric, 420 implicitly restarted,
More informationExploiting Low-Rank Structure in Computing Matrix Powers with Applications to Preconditioning
Exploiting Low-Rank Structure in Computing Matrix Powers with Applications to Preconditioning Erin C. Carson, Nicholas Knight, James Demmel, Ming Gu U.C. Berkeley SIAM PP 12, Savannah, Georgia, USA, February
More informationCollege of William & Mary Department of Computer Science
Technical Report WM-CS-2009-06 College of William & Mary Department of Computer Science WM-CS-2009-06 Extending the eigcg algorithm to non-symmetric Lanczos for linear systems with multiple right-hand
More informationPRECONDITIONING IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM
Proceedings of ALGORITMY 25 pp. 22 211 PRECONDITIONING IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM GABRIEL OKŠA AND MARIÁN VAJTERŠIC Abstract. One way, how to speed up the computation of the singular value
More informationThe Fermion Bag Approach
The Fermion Bag Approach Anyi Li Duke University In collaboration with Shailesh Chandrasekharan 1 Motivation Monte Carlo simulation Sign problem Fermion sign problem Solutions to the sign problem Fermion
More informationarxiv: v1 [hep-lat] 4 Nov 2014
Meson Mass Decomposition,2, Ying Chen, Terrence Draper 2, Ming Gong,2, Keh-Fei Liu 2, Zhaofeng Liu, and Jian-Ping Ma 3,4 arxiv:4.927v [hep-lat] 4 Nov 24 (χqcd Collaboration) Institute of High Energy Physics,
More information