Parallelization and benchmarks

Similar documents
NWChem: Coupled Cluster Method (Tensor Contraction Engine)

NWChem: Coupled Cluster Method (Tensor Contraction Engine)

Basic introduction of NWChem software

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters

Basic introduction of NWChem software

Efficient algorithms for symmetric tensor contractions

AN INTRODUCTION TO QUANTUM CHEMISTRY. Mark S. Gordon Iowa State University

OVERVIEW OF QUANTUM CHEMISTRY METHODS

Cyclops Tensor Framework

The QMC Petascale Project

Building a wavefunction within the Complete-Active. Cluster with Singles and Doubles formalism: straightforward description of quasidegeneracy

ab initio Electronic Structure Calculations

Cyclops Tensor Framework: reducing communication and eliminating load imbalance in massively parallel contractions

Targeting Extreme Scale Computational Challenges with Heterogeneous Systems

Coupled Cluster Green function: model involving singles and. doubles excitations

Introduction to numerical computations on the GPU

Cyclops Tensor Framework: reducing communication and eliminating load imbalance in massively parallel contractions

High Accuracy Local Correlation Methods: Computer Aided Implementation

Hybrid Programming Models in Computational Chemistry

Cyclops Tensor Framework: reducing communication and eliminating load imbalance in massively parallel contractions

2.5D algorithms for distributed-memory computing

COUPLED-CLUSTER CALCULATIONS OF GROUND AND EXCITED STATES OF NUCLEI

MRCI calculations in MOLPRO

NWChem: Hartree-Fock, Density Functional Theory, Time-Dependent Density Functional Theory

Orbital dependent correlation potentials in ab initio density functional theory

Software implementation of correlated quantum chemistry methods. Exploiting advanced programming tools and new computer architectures

Parallelization of Molecular Dynamics (with focus on Gromacs) SeSE 2014 p.1/29

We will use C1 symmetry and assume an RHF reference throughout. Index classes to be used include,

Hard scaling challenges for ab-initio molecular dynamics capabilities in NWChem: Using 100,000 CPUs per second

Scalable numerical algorithms for electronic structure calculations

Provably efficient algorithms for numerical tensor algebra

Auxiliary-field quantum Monte Carlo calculations of excited states and strongly correlated systems

New Developments in State-Specific Multireference Coupled-Cluster Theory

Ab initio calculations for potential energy surfaces. D. Talbi GRAAL- Montpellier

Exercise 1: Structure and dipole moment of a small molecule

Beyond Hartree-Fock: MP2 and Coupled-Cluster Methods for Large Systems

Walter Kohn was awarded with the Nobel Prize in Chemistry in 1998 for his development of the density functional theory.

( R)Ψ el ( r;r) = E el ( R)Ψ el ( r;r)

Parallel Sparse Tensor Decompositions using HiCOO Format

Algorithms and Computational Aspects of DFT Calculations

QUANTUM CHEMISTRY FOR TRANSITION METALS

DFT calculations of NMR indirect spin spin coupling constants

Integrating CML, FoX, Avogadro, NWChem, and EMSLHub to develop a computational chemistry knowledge and discovery base

Using BLIS for tensor computations in Q-Chem

QUANTUM CHEMISTRY PROJECT 3: ATOMIC AND MOLECULAR STRUCTURE

A microsecond a day keeps the doctor away: Efficient GPU Molecular Dynamics with GROMACS

A Computation- and Communication-Optimal Parallel Direct 3-body Algorithm

Beyond the Hartree-Fock Approximation: Configuration Interaction

Computational Material Science Part II. Ito Chao ( ) Institute of Chemistry Academia Sinica

Molecular Simulation I

Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry

A preliminary analysis of Cyclops Tensor Framework

Introduction to computational chemistry Exercise I: Structure and electronic energy of a small molecule. Vesa Hänninen

Leveraging sparsity and symmetry in parallel tensor contractions

Introduction to multiconfigurational quantum chemistry. Emmanuel Fromager

Review for the Midterm Exam

COPYRIGHTED MATERIAL. Quantum Mechanics for Organic Chemistry &CHAPTER 1

CHEM6085: Density Functional Theory Lecture 10

Computer Algorithms CISC4080 CIS, Fordham Univ. Outline. Last class. Instructor: X. Zhang Lecture 2

Computer Algorithms CISC4080 CIS, Fordham Univ. Instructor: X. Zhang Lecture 2

Practical Advice for Quantum Chemistry Computations. C. David Sherrill School of Chemistry and Biochemistry Georgia Institute of Technology

Post Hartree-Fock: MP2 and RPA in CP2K

Lec20 Wed 1mar17 update 3mar 10am

The Nanoscience End-Station and Petascale Computing

OPENATOM for GW calculations

Quantum Monte Carlo Benchmarks Density Functionals: Si Defects

Coupled Cluster Theory and Tensor Decomposition Techniques

Wave function methods for the electronic Schrödinger equation

Multi-reference Density Functional Theory. COLUMBUS Workshop Argonne National Laboratory 15 August 2005

We also deduced that transformations between Slater determinants are always of the form

Model Order Reduction via Matlab Parallel Computing Toolbox. Istanbul Technical University

Module 6 1. Density functional theory

The coupled ocean atmosphere model at ECMWF: overview and technical challenges. Kristian S. Mogensen Marine Prediction Section ECMWF

Fastfood Approximating Kernel Expansions in Loglinear Time. Quoc Le, Tamas Sarlos, and Alex Smola Presenter: Shuai Zheng (Kyle)

Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2

Electron Correlation

Development of Molecular Dynamics Simulation System for Large-Scale Supra-Biomolecules, PABIOS (PArallel BIOmolecular Simulator)

ELSI: A Unified Software Interface for Kohn-Sham Electronic Structure Solvers

Selected Publications of Prof. Dr. Wenjian Liu

Introduction to Computational Chemistry

arxiv: v1 [hep-lat] 7 Oct 2010

Coupled-Cluster Perturbative Triples for Bond Breaking

Session 1. Introduction to Computational Chemistry. Computational (chemistry education) and/or (Computational chemistry) education

Efficient Parallelization of Molecular Dynamics Simulations on Hybrid CPU/GPU Supercoputers

Chem 4502 Introduction to Quantum Mechanics and Spectroscopy 3 Credits Fall Semester 2014 Laura Gagliardi. Lecture 28, December 08, 2014

CRYSTAL in parallel: replicated and distributed (MPP) data. Why parallel?

Electronic quantum effect on hydrogen bond geometry in. water dimer

计算物理作业二. Excercise 1: Illustration of the convergence of the dissociation energy for H 2 toward HF limit.

Geometry Optimisation

Two case studies of Monte Carlo simulation on GPU

High Temperature High Pressure Properties of Silica From Quantum Monte Carlo

Key concepts in Density Functional Theory (II) Silvana Botti

MO Calculation for a Diatomic Molecule. /4 0 ) i=1 j>i (1/r ij )

arxiv: v2 [physics.chem-ph] 27 Sep 2017

The Performance Evolution of the Parallel Ocean Program on the Cray X1

TDDFT in Chemistry and Biochemistry III

Density Functional Theory Machinery

All-electron density functional theory on Intel MIC: Elk

Lec20 Fri 3mar17

An Evaluation of Difference and Threshold Techniques for Efficient Checkpoints

Transcription:

Parallelization and benchmarks

Content! Scalable implementation of the DFT/LCAO module! Plane wave DFT code! Parallel performance of the spin-free CC codes! Scalability of the Tensor Contraction Engine based CC implementations! iterative methods! non-iterative approaches 2

Gaussian DFT computational kernel Evaluation of XC potential matrix element my_next_task = SharedCounter() do i=1,max_i if(i.eq.my_next_task) then call ga_get() ρ(x (do work) q ) = µν D µν χ κ (x q ) χ ν (x q ) call ga_acc() F my_next_task = λσ += SharedCounter() q w q χ λ (x q ) V xc [ρ(x q )] χ σ (x q ) endif enddo barrier() D µν F λσ Both GA operations are greatly dependent on the communication latency 3

Parallel scaling of the DFT code 4 Si 28 O 148 H 66 3554 Basis functions LDA wavefunction XC build Benchmark run on Cray XK6

Parallel scaling of the DFT code 5 Si 159 O 264 H 110 7108 Basis functions LDA wavefunction XC build Benchmark run on Cray XK6

Parallel timings for AIMD simulation of UO 2 2+ +122H 2 O! Development of algorithms for AIMD has progressed in recent years! 0.1-10 seconds per step can be obtained on many of today s supercomputers for most AIMD simulations.! However, large numbers of cpus are often required! 5000 cpus * 10 days à 1.2 million cpu hours! Very easy to use up 1-2 million CPUs hours in a single simulation 6

Plane wave DFT Exact exchange timings 576 atom cell of water (cutoff energy=100ry). These calculations were performed on the Hopper Cray-XE6 computer system at NERSC. 7

CCSD(T) run on Cray XT5 : 24 water November 2009 Floating-Point performance at 223K cores: 1.39 PetaFLOP/s (H 2 O) 24 72 atoms 1224 basis functions Cc-pvtz(-f) basis

What is Tensor Contraction Engine (TCE)! Symbolic manipulation & program generator! Automates the derivation of complex working equations based on a well-defined second quantized many-electron theories! Synthesizing efficient parallel computer programs on the basis of these equations.! Granularity of the parallel CC TCE codes is provided by the so-called tiles, which define the partitioning of the whole spinorbital domain. 9

What is Tensor Contraction Engine (TCE) Tile structure: α β α β S1 S2 S1 S2 S1 S2. S1 S2. Occupied spinorbitals unoccupied spinorbitals Tile-induced block structure of the CC tensors: CPU1 CPU2 CPU3. i a T T [ h [ p n ] ] m CPU n 10

New elements of parallel design for the iterative CCSD/EOMCCSD method! Use of Global Arrays (GA) to implement a distributed memory model! Iterative CCSD/EOMCCSD methods (basic challenges)! Global memory requirements! Complex load balancing! Complicated communication pattern: use of one-sided ga_get,ga_put,ga_acc! Implementation improvements! New way of representing antysymmetric 2-electron integrals for the restricted (RHF) and restricted open-shell (ROHF) references! Replication of low-rank tensors! New task scheduling for the CCSD/EOMCCSD methods 11

Scalability of the iterative EOMCC methods! Alternative task schedulers! use global task pool! improve load balancing! reduce the number of synchronization steps to absolute minimum! larger tiles can be effectively used 12

New elements of parallel design for the non-iterative CR-EOMCCSD(T) method! Use of Global Arrays (GA) to implement a distributed memory model! Basic challenges for Non-Iterative CR-EOMCCSD(T) method! Local memory requirements: (tilesize) 4 (EOMCCSD) vs. M* (tilesize) 6 (CR-EOMCCSD(T))! Implementation improvements! Two-fold reduction of local memory use : 2*(tilesize) 6! New algorithms which enable the decomposition of sixdimensional tensors 13

Scalability of the non-iterative EOMCC code 94 %parallel efficiency using 210,000 cores! Scalability of the triples part of the CR- EOMCCSD(T) approach for the FBP-f-coronene system in the AVTZ basis set. Timings were determined from calculations on the Jaguar Cray XT5 computer system at NCCS. 14

Multireference CC methods! Multireference CC methods in NWChem (next release)! Strongly correlated excited states! Implemented MRCC approaches! Brillouin-Wigner MRCCSD! Mukherjee Mk-MRCCSD approach! State-Universal MRCCSD (under testing)! Perturbative triples corrections MRCCSD(T)! Novel paralellization strategies based on the processor groups! Demonstrated scalability of MRCCSD across 24,000 cores 15

16 Questions?