Software implementation of correlated quantum chemistry methods. Exploiting advanced programming tools and new computer architectures

Similar documents
Using BLIS for tensor computations in Q-Chem

A practical view on linear algebra tools

Efficient algorithms for symmetric tensor contractions

Ilya A. Kaliman* and Anna I. Krylov. Introduction

Cyclops Tensor Framework

Cyclops Tensor Framework: reducing communication and eliminating load imbalance in massively parallel contractions

Leveraging sparsity and symmetry in parallel tensor contractions

Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry

New Implementation of High-Level Correlated Methods Using a General Block-Tensor Library for High-Performance Electronic Structure Calculations

Algorithms as multilinear tensor equations

Strassen-like algorithms for symmetric tensor contractions

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters

6. Arthur Adamson Postdoctoral Recognition Award, University of Southern California

Coupled-cluster and perturbation methods for macromolecules

High Accuracy Local Correlation Methods: Computer Aided Implementation

Provably efficient algorithms for numerical tensor algebra

Strassen-like algorithms for symmetric tensor contractions

Simulating the generation, evolution and fate of electronic excitations in molecular and nanoscale materials with first principles methods.

Scalable numerical algorithms for electronic structure calculations

NWChem: Coupled Cluster Method (Tensor Contraction Engine)

Molecular properties in quantum chemistry

Cyclops Tensor Framework: reducing communication and eliminating load imbalance in massively parallel contractions

A Distributed Memory Library for Sparse Tensor Functions and Contractions

Parallelization and benchmarks

Simulating the generation, evolution and fate of electronic excitations in molecular and nanoscale materials with first principles methods

Low Rank Bilinear Algorithms for Symmetric Tensor Contractions

Targeting Extreme Scale Computational Challenges with Heterogeneous Systems

OVERVIEW OF QUANTUM CHEMISTRY METHODS

Quantum Chemical Calculations by Parallel Computer from Commodity PC Components

Methods for Treating Electron Correlation CHEM 430

We will use C1 symmetry and assume an RHF reference throughout. Index classes to be used include,

Coupled Cluster Theory and Tensor Decomposition Techniques

NWChem: Coupled Cluster Method (Tensor Contraction Engine)

Internally Contracted Multireference Coupled Cluster Theory. Computer Aided Implementation of Advanced Electronic Structure Methods.

Q-Chem 5: Facilitating Worldwide Scientific Breakthroughs

Electron Correlation - Methods beyond Hartree-Fock

Coupled-Cluster Theory. Nuclear Structure

Strassen s Algorithm for Tensor Contraction

Direct Self-Consistent Field Computations on GPU Clusters

Basic introduction of NWChem software

Cyclops Tensor Framework: reducing communication and eliminating load imbalance in massively parallel contractions

Cartesian Tensors. e 2. e 1. General vector (formal definition to follow) denoted by components

Beyond the Hartree-Fock Approximation: Configuration Interaction

Mixed Integer Programming:

Lazy matrices for contraction-based algorithms

MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors

Using AmgX to accelerate a PETSc-based immersed-boundary method code

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics

Theoretical and Computational Studies of Interstellar C2nH and SiC2m+1H. Ryan Fortenberry

Predictive Computing for Solids and Liquids

Open-source finite element solver for domain decomposition problems

Towards an algebraic formalism for scalable numerical algorithms

Algorithmic Challenges in Photodynamics Simulations on HPC systems

Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA

Building a wavefunction within the Complete-Active. Cluster with Singles and Doubles formalism: straightforward description of quasidegeneracy

Beyond Hartree-Fock: MP2 and Coupled-Cluster Methods for Large Systems

ELSI: A Unified Software Interface for Kohn-Sham Electronic Structure Solvers

Coupled-Cluster Perturbative Triples for Bond Breaking

Communication-Avoiding Parallel Algorithms for Dense Linear Algebra and Tensor Computations

Convergence properties of the coupled-cluster method: the accurate calculation of molecular properties for light systems

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)

RWTH Aachen University

Contracting symmetric tensors using fewer multiplications

Introduction to Computational Chemistry

2.5D algorithms for distributed-memory computing

Scalability and programming productivity for massively-parallel wavefunction methods

7.1 Creation and annihilation operators

Performance Optimization of Tensor Contraction. Expressions for Many Body Methods in Quantum. Chemistry

A preliminary analysis of Cyclops Tensor Framework

Supporting Information

SP-CNN: A Scalable and Programmable CNN-based Accelerator. Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay

Schwarz-type methods and their application in geomechanics

(Todays) Progress in coupled cluster compuations of atomic nuclei

Algorithmic Challenges in Photodynamics Simulations

Other methods to consider electron correlation: Coupled-Cluster and Perturbation Theory

One-sided Communication Implementation in FMO Method

Graphics Card Computing for Materials Modelling

TDDFT in Chemistry and Biochemistry III

Algorithm for Sparse Approximate Inverse Preconditioners in the Conjugate Gradient Method

ECS289: Scalable Machine Learning

CRYSTAL in parallel: replicated and distributed (MPP) data. Why parallel?

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

Automatic Code Generation for Many-Body Electronic Structure Methods: The Tensor Contraction Engine

( R)Ψ el ( r;r) = E el ( R)Ψ el ( r;r)

Julian Merten. GPU Computing and Alternative Architecture

Synchronous Sequential Logic Part I. BME208 Logic Circuits Yalçın İŞLER

MAGMA. Matrix Algebra on GPU and Multicore Architectures. Mark Gates. February 2012

HYCOM and Navy ESPC Future High Performance Computing Needs. Alan J. Wallcraft. COAPS Short Seminar November 6, 2017

Tuning And Understanding MILC Performance In Cray XK6 GPU Clusters. Mike Showerman, Guochun Shi Steven Gottlieb

Wave function methods for the electronic Schrödinger equation

Computational Methods. Chem 561

Using Web-Based Computations in Organic Chemistry

Projective Quantum Spaces

Lec20 Fri 3mar17

Provably Efficient Algorithms for Numerical Tensor Algebra

Performance evaluation of scalable optoelectronics application on large-scale Knights Landing cluster

FPGA Implementation of a Predictive Controller

Introduc)on to IQmol: Part I.!!! Shirin Faraji, Ilya Kaliman, and Anna Krylov

Communication avoiding parallel algorithms for dense matrix factorizations

Symbolic and numeric Scientific computing for atomic physics

Transcription:

Software implementation of correlated quantum chemistry methods. Exploiting advanced programming tools and new computer architectures Evgeny Epifanovsky Q-Chem Septermber 29, 2015

Acknowledgments Many thanks to my collaborators: Michael Wormit (Heidelberg) Ilya Kaliman and Anna Krylov (USC) Edgar Solomonik (ETH) Khaled Ibrahim and Samuel Williams (LBL)

Anatomy of a QC computation Single point energy Iterative solver Programmable tensor expressions Tensor contractions BLAS and its extensions

Programming technologies

Coupled cluster methods in Q-Chem Ground state Excited state Properties 2010 2012 MP2 QCISD CCD, CCSD CCSD(T), (dt), (ft) CISD EOM-CCSD (EA, EE, IP, SF, DIP, DSF) IP-CISD, EA-CISD OPDM, TPDM Properties (all methods) Gradient (CC, EOM) 2013 RI-CCSD RI-EOM-CCSD (EA, EE, IP, SF) RI-OPDM, RI-TPDM RI properties 2013 2015 CS/CX-MP2 CS/CX-CCSD CS/CX-CISD CS/CX-EOM-CCSD (EA, EE, IP, SF) CS/CX-OPDM Real, complex Dyson orbitals Two-photon absorption Spin-orbit coupling Over 1000 programmable expressions implemented Work by a single academic research group (Krylov @ USC) 14 contributors 4 5 persons working on method development at a given time

Coupled-cluster doubles (CCD) equations D ab ij = ɛ i + ɛ j ɛ a ɛ b Tij ab Dij ab = ij ab + P (ab) P (ij) ( k ( c f jk tik ab + 1 2 + 1 ij kl tkl ab + 1 2 4 kl ( P (ij)p (ab) kc klcd f bc t ac ij 1 2 klcd klcd kl cd t cd jl kl cd t cd ij kb jc t ac ik 1 2 kl cd tkl bd tij ac t ab ik ) t ab kl + 1 2 klcd cd kl cd t db lj ) ab cd t cd ij t ac ik ) P (ij)a ij = A ij A ji

Tensor expressions for CCD void ccd_t2_update(...) { letter i, j, k, l, a, b, c, d; btensor<2> f1_oo(oo), f1_vv(vv); btensor<4> ii_oooo(oooo), ii_ovov(ovov); // Compute intermediates f1_oo(i j) = f_oo(i j) + 0.5 * contract(k a b, i_oovv(j k a b), t2(i k a b)); f1_vv(b c) = f_vv(b c) - 0.5 * contract(k l d, i_oovv(k l c d), t2(k l b d)); ii_oooo(i j k l) = i_oooo(i j k l) + 0.5 * contract(a b, i_oovv(k l a b), t2(i j a b)); ii_ovov(i a j b) = i_ovov(i a j b) - 0.5 * contract(k c, i_oovv(i k b c), t2(k j c a)); } // Compute updated T2 t2new(i j a b) = i_oovv(i j a b) + asymm(a, b, contract(c, t2(i j a c), f1_vv(b c))) - asymm(i, j, contract(k, t2(i k a b), f1_oo(j k))) + 0.5 * contract(k l, ii_oooo(i j k l), t2(k l a b)) + 0.5 * contract(c d, i_vvvv(a b c d), t2(i j c d)) - asymm(a, b, asymm(i, j, contract(k c, ii_ovov(k b j c), t2(i k a c))));

Block tensors in libtensor Three components: Block tensor space: dimensions + tiling pattern. Symmetry relations between blocks. Non-zero canonical data blocks.

Block tensors in libtensor Three components: Block tensor space: dimensions + tiling pattern. Symmetry relations between blocks. Non-zero canonical data blocks. Symmetry: S : SB i (B j, U ij ) B A A B 1 B 2 B 3 α α B β B 1 B 2 β B 3 Permutational Point group Spin

Front end Architectureindependent programming interface Middleware Preparation of platform-specific tasks Back end Platform-specific optimized kernels

Front end Architectureindependent programming interface Middleware Preparation of platform-specific tasks Back end Platform-specific optimized kernels Equation specification via GUI TCE in NWChem Equation factorization and code generation Autogenerated Fortran code

Front end Architectureindependent programming interface Middleware Preparation of platform-specific tasks Back end Platform-specific optimized kernels Equation specification via GUI TCE in NWChem Equation factorization and code generation Autogenerated Fortran code Tensor expressions libtensor in Q-Chem Runtime expression AST optimization One of back-ends (native, XM, CTF)

Algorithms 1. Virtual memory (RAM + disk) based block tensors (native) Targets large-memory machines with fast disk. Most efficient in-core, lacks efficiency when spillover to disk is significant 2. Disk based tensor contraction algorithm (XM by Ilya Kaliman) Targets machines with fast disk, lacks efficiency when job fits in RAM 3. Distributed parallel in-core memory tensor library (CTF by Edgar Solomonik) Targets highly parallel machines with low memory per node and no disk

AST Optimizations I (1) iajb = c ia bc t c j t ab ij = P(ij)P(ab) kc I (1) kbic tac jk + P(ij) c jc ba t c i { = i1(i,a,j,b) { * ovvv(i,a,b,c) t1(j,c) } } { = t2(i,j,a,b) { + { asym(i,j; a,b) { * i1(k,b,i,c) t2(j,k,a,c) } } { asym(i,j) { * ovvv(j,c,b,a) t1(i,c) } } } }

AST Optimizations I (1) iajb = c ia bc t c j t ab ij = P(ij)P(ab) kc I (1) kbic tac jk + P(ij) c jc ba t c i { = i1(i,a,j,b) { * ovvv(i,a,b,c) t1(j,c) } } { = t2(i,j,a,b) { + { asym(i,j; a,b) { * i1(k,b,i,c) t2(j,k,a,c) } } { asym(i,j) { * ovvv(j,c,b,a) t1(i,c) } } } } For disk-based block tensors: { = i1(i,a,j,b) { * ovvv(i,a,b,c) t1(j,c) } } { = x(i,j,a,b) { * ovvv(j,c,b,a) t1(i,c) } } { += x(i,j,a,b) { asym(a,b) { * i1(k,b,i,c) t2(j,k,a,c) } } } { = t2(i,j,a,b) { asym(i,j) x(i,j,a,b) } }

AST Optimizations I (1) iajb = c ia bc t c j t ab ij = P(ij)P(ab) kc I (1) kbic tac jk + P(ij) c jc ba t c i { = i1(i,a,j,b) { * ovvv(i,a,b,c) t1(j,c) } } { = t2(i,j,a,b) { + { asym(i,j; a,b) { * i1(k,b,i,c) t2(j,k,a,c) } } { asym(i,j) { * ovvv(j,c,b,a) t1(i,c) } } } } For CTF: { = i1(i,a,j,b) { * ovvv(i,a,b,c) t1(j,c) } } { = t2(i,j,a,b) { asym(i,j; a,b) { * i1(k,b,i,c) t2(j,k,a,c) } } } { += t2(i,j,a,b) { asym(i,j) { * ovvv(j,c,b,a) t1(i,c) } } }

Benchmarks

Benchmarks Tests performed on 2 8-core Sandy Bridge, 384 GB Time to solve equations Steps BT XM CTF Uracil/cc-pVDZ CCSD 10 15 s 66 s 169 s 21 O, 103 V, Cs EOM-EE 63 46 s 661 s 869 s Uracil/cc-pVTZ CCSD 10 273 s 1174 s 1248 s 21 O, 267 V, Cs EOM-EE 74 537 s 6074 s 3047 s AATT/cc-pVDZ CCSD 12 160 h 92 h 98 O, 506 V, C1 EOM-IP 32 64 m 196 m Uracil AATT

Benchmarks Tests performed on NERSC Hopper system: 2 12-core AMD Magny Cours, 32 GB (64 GB*) Time to solve equations Steps BT CTF-1 CTF-4 Uracil/ CCSD 10 64 s 179 s 139 s cc-pvdz EOM-EE 64 144 s 809 s 696 s BT* CTF-16 CTF-64 Uracil/ CCSD 10 14 m 9 m 4.6 m cc-pvtz EOM-EE 64 39 m 39 m 21.8 m CTF-256* AATT/ CCSD 12 2.9 h cc-pvdz EOM-IP 32 235 s

Benchmarks Tests performed on NERSC Babbage system: 2 8-core Sandy Bridge, 64 GB, 2 Knight s Corner cards Time to solve equations Sandy Bridge Intel KNC Steps BT XM XM (AO) Uracil/ CCSD 10 15 s 74 s 83 s cc-pvdz EOM-EE 63 44 s 462 s 468 s Uracil/ cc-pvtz CCSD EOM-EE

Conclusions Changing landscape in computer technology forces us to make choices about developing and supporting scientific software Following appropriate software design and development methodologies enables efficient use of computer and human resources