ab initio Electronic Structure Calculations

Similar documents
Parallel Eigensolver Performance on High Performance Computers

Parallel Eigensolver Performance on High Performance Computers 1

Quantum Chemical Calculations by Parallel Computer from Commodity PC Components

Scalable Systems for Computational Biology

Carlo Cavazzoni, HPC department, CINECA

Large Scale Electronic Structure Calculations

Analysis of the Weather Research and Forecasting (WRF) Model on Large-Scale Systems

University of Denmark, Bldg. 307, DK-2800 Lyngby, Denmark, has been developed at CAMP based on message passing, currently

Massive Parallelization of First Principles Molecular Dynamics Code

Density Functional Theory

Lecture 19. Architectural Directions

Self Consistent Cycle

All-electron density functional theory on Intel MIC: Elk

Algorithms and Computational Aspects of DFT Calculations

Parallel Eigensolver Performance on the HPCx System

Efficient implementation of the overlap operator on multi-gpus

Performance optimization of WEST and Qbox on Intel Knights Landing

Additional background material on the Nobel Prize in Chemistry 1998

Direct Self-Consistent Field Computations on GPU Clusters

Weile Jia 1, Long Wang 1, Zongyan Cao 1, Jiyun Fu 1, Xuebin Chi 1, Weiguo Gao 2, Lin-Wang Wang 3

OPENATOM for GW calculations

Introduction to Computational Chemistry

A Fast and Scalable Low Dimensional Solver for Charged Particle Dynamics in Large Particle Accelerators

Algorithms and Computational Aspects of DFT Calculations

Nuclear Physics and Computing: Exascale Partnerships. Juan Meza Senior Scientist Lawrence Berkeley National Laboratory

Parallelization of the Molecular Orbital Program MOS-F

Symmetric Pivoting in ScaLAPACK Craig Lucas University of Manchester Cray User Group 8 May 2006, Lugano

ELECTRONIC STRUCTURE CALCULATIONS FOR THE SOLID STATE PHYSICS

Large-scale Electronic Structure Simulations with MVAPICH2 on Intel Knights Landing Manycore Processors

Solid State Theory: Band Structure Methods

Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA

Preconditioned Eigenvalue Solvers for electronic structure calculations. Andrew V. Knyazev. Householder Symposium XVI May 26, 2005

The Plane-Wave Pseudopotential Method

Introduction to Parallelism in CASTEP

Optimizing GROMACS for parallel performance

Introduction to Hartree-Fock Molecular Orbital Theory

Improvements for Implicit Linear Equation Solvers

Is there a future for quantum chemistry on supercomputers? Jürg Hutter Physical-Chemistry Institute, University of Zurich

Scalable numerical algorithms for electronic structure calculations

CHEM3023: Spins, Atoms and Molecules

2.5D algorithms for distributed-memory computing

CP2K. New Frontiers. ab initio Molecular Dynamics

Data analysis of massive data sets a Planck example

CRYSTAL in parallel: replicated and distributed (MPP) data. Why parallel?

GRAPE and Project Milkyway. Jun Makino. University of Tokyo

Electron Correlation

CP2K: the gaussian plane wave (GPW) method

Computational Sciences

ELSI: A Unified Software Interface for Kohn-Sham Electronic Structure Solvers

Ab initio molecular dynamics : BO

21st Century Engines of Discovery

Intro to ab initio methods

Introduction to numerical computations on the GPU

A Data Communication Reliability and Trustability Study for Cluster Computing

INITIAL INTEGRATION AND EVALUATION

Case Study: Numerical Simulations of the Process by which Planets generate their Magnetic Fields. Stephan Stellmach UCSC

Polynomial filtering for interior eigenvalue problems Yousef Saad Department of Computer Science and Engineering University of Minnesota

Janus: FPGA Based System for Scientific Computing Filippo Mantovani

v(r i r j ) = h(r i )+ 1 N

QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment

Making electronic structure methods scale: Large systems and (massively) parallel computing

Molecular Dynamics Simulations

Parallelization Strategies for Density Matrix Renormalization Group algorithms on Shared-Memory Systems

1. Hydrogen atom in a box

Molecular Dynamics Simulation of a Biomolecule with High Speed, Low Power and Accuracy Using GPU-Accelerated TSUBAME2.

Exploiting In-Memory Processing Capabilities for Density Functional Theory Applications

NIMROD Project Overview

Contents. Preface... xi. Introduction...

Density Functional Theory. Martin Lüders Daresbury Laboratory

Divide and conquer algorithms for large eigenvalue problems Yousef Saad Department of Computer Science and Engineering University of Minnesota

Stochastic Modelling of Electron Transport on different HPC architectures

Research on Structure Formation of Plasmas Dominated by Multiple Hierarchical Dynamics

Development of Molecular Dynamics Simulation System for Large-Scale Supra-Biomolecules, PABIOS (PArallel BIOmolecular Simulator)

Dept of Mechanical Engineering MIT Nanoengineering group

HYCOM and Navy ESPC Future High Performance Computing Needs. Alan J. Wallcraft. COAPS Short Seminar November 6, 2017

J S Parker (QUB), Martin Plummer (STFC), H W van der Hart (QUB) Version 1.0, September 29, 2015

Block Iterative Eigensolvers for Sequences of Dense Correlated Eigenvalue Problems

Update on Cray Earth Sciences Segment Activities and Roadmap

arxiv:astro-ph/ v1 10 Dec 1996

A 29.5 Tflops simulation of planetesimals in Uranus-Neptune region on GRAPE-6

On the Use of a Many core Processor for Computational Fluid Dynamics Simulations

Massively scalable computing method to tackle large eigenvalue problems for nanoelectronics modeling

Cyclops Tensor Framework: reducing communication and eliminating load imbalance in massively parallel contractions

Efficient Parallelization of Molecular Dynamics Simulations on Hybrid CPU/GPU Supercoputers

Chapter 3. The (L)APW+lo Method. 3.1 Choosing A Basis Set

Binding Performance and Power of Dense Linear Algebra Operations

Conjugate-Gradient Eigenvalue Solvers in Computing Electronic Properties of Nanostructure Architectures

Fragmentation methods

2.1 Introduction: The many-body problem

An Efficient FETI Implementation on Distributed Shared Memory Machines with Independent Numbers of Subdomains and Processors

Rustam Z. Khaliullin University of Zürich

Efficient algorithms for symmetric tensor contractions

Write your name here:

The Nanoscience End-Station and Petascale Computing

Large-Scale Molecular Dynamics Simulations on Parallel Clusters

Electronic structure, plane waves and pseudopotentials

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications

Structure of Cement Phases from ab initio Modeling Crystalline C-S-HC

Poisson Solver, Pseudopotentials, Atomic Forces in the BigDFT code

BENCHMARK STUDY OF A 3D PARALLEL CODE FOR THE PROPAGATION OF LARGE SUBDUCTION EARTHQUAKES

Transcription:

ab initio Electronic Structure Calculations New scalability frontiers using the BG/L Supercomputer C. Bekas, A. Curioni and W. Andreoni IBM, Zurich Research Laboratory Rueschlikon 8803, Switzerland

ab initio Electronic Structure Calculations: Why? Semiconductors Drug-protein interaction ab initio electronic structure calculations in Molecular Dynamics (MD): Simulate the behaviour of materials at the atomic level by applying the fundamental laws of Quantum Mechanics Why is it different than classical MD? The interaction between electrons is NOT modelled by empirical force fields but described quantum-mechanically Greatly improved accuracy (vs classical MD) at the nano level! though it comes at a cost

In this talk Introduction to the mathematical formulation of ab initio calculations in particular the Density Functional Theory (DFT) formulation and The parallel plane wave implementation of ab initio calculations in which we identify the computationally intensive spots 3D parallel FFT transforms Task Groups strategy for parallel 3D FFTs Showcase code: CPV, included in www.quantum-espresso.org Computational platform: Blue Gene /L Supercomputer Massively parallel system with key features particularly suitable for ab initio calculations: Dual core processors, delivering close to peak performance 2.8 Gflop/s (5.6 per node) for key linear alg. kernels (i.e. DGEMM) Very fast-scalable interconnection network, facilitating both 3D Torus and Tree network, ideal for parallel 3D FFT kernels Demonstrated performance for ab initio calculations: 110.4 Tflop/s for the CPMD code (www.cpmd.org)

Formulation: The Wave Function We seek to find the steady state of the electron distribution All (N) particles are described by a complicated wave function ψ ψ is a function of a 3N dimensional space in particular it is determined by The position r k of all (N) particles (including nuclei and electrons) It is normalized in such a way that Max Born s probabilistic interpretation: Considering a region D in r, then describes the probability of all particles being in region D. Thus: the distribution of electrons e i in space is defined by the wave function ψ

Mathematical Modeling: The Hamiltonian Steady state of the electron distribution: it is such that it minimizes the total energy of the molecular system (energy due to dynamic interaction of all the particles involved because of the forces that act upon them) Hamiltonian H of the molecular system: Operator that governs the interaction of the involved particles Considering all forces between nuclei and electrons we have H nucl H e U nucl V ext U ee Kinetic energy of the nuclei Kinetic energy of electrons Interaction energy of nuclei (Coulombic repulsion) Nuclei electrostatic potential with which electrons interact Electrostatic repulsion between electrons

Density Functional Theory (DFT) High complexity is mainly due to the many-electron formulation of ab initio calculations is there a way to come up with an one-electron formulation? Key Theory DFT: Density Functional Theory (Hohenberg,Kohn,Sham) The total ground energy of a molecular system is a functional of the ground state electronic density (number of electrons in a cubic unit) The energy of a system of electrons is at a minimum if it is an exact density of the ground state! This is an existence theorem the density functional always exists but the theorem does not prescribe a way to compute it This energy functional is highly complicated Thus approximations are considered concerning: Kinetic energy and Exchange-Correlation energies of the system of electrons

One electron Schrödinger's equation Let the columns of Ψ: hold the wave functions corresponding to the electrons Then it holds that This is an eigenvalue problem that becomes a usual algebraic eigenvalue problem when we discretize ψ i in a suitable basis Difficult and nonlinear problem since: Hamiltonian and wave functions depend upon each other through the unknown density functional Variational Principle in the DFT framework (in simple terms!) Minimal energy and the corresponding electron distribution amounts to calculating the M smallest eigenvalues/eigenvectors of the Schrödinger equation, where M is the number of electrons of the system

Density functional Theory: Formulation (1/2) Equivalent eigenproblem (r is space): One electron wave function Energy of the i-th state of the system Kinetic energy of electron e i Total potential that acts on e i at position r Charge density at position r

Density functional Theory: Formulation (2/2) Furthermore: Potential due to nuclei and core electrons Coulomb potential from valence electrons Exchange-Correlation potential function of the charge density ρ Non-linearity: The Hamiltonian depends upon the charge density ρ while ρ itself depends upon the wave functions (eigenvectors) ψ i Some short of iteration is required until convergence is achieved!

Fourier Space discretization: Plane Waves (PW) Using a plane wave basis to express the wave functions has a number of advantages: Natural basis for periodic systems like crystals Isolated systems can be readily represented using periodic cells Most parts of the Hamiltonian become diagonal matrices in the PW basis Maximum cut off frequency is determined by the kinetic energy operators (Laplacian) allowing for easy systematic control of accuracy Quantities are calculated either in real or Fourier space depending to where they are expressed locally. Efficient parallel 3D FFT transforms become crucial in PW ab initio calculations Example: Charge density: Requires double summation over plane-waves but only a short loop in real space. Major cost kernel. Opportunity for hierarchical parallelism.

The BG/L Architecture The Blue Gene/L Rack System 64 Racks, 64x32x32 Supercomputer (32 chips 4x4x2) 16 compute, 0-2 IO cards 180/360 TF/s 32 TB 2 chips, 1x2x1 2.8/5.6 TF/s 512 GB 2 processors 90/180 GF/s 16 GB 2.8/5.6 GF/s 4 MB 5.6/11.2 GF/s 1.0 GB

BG/L Interconnection Network 3 Dimensional Torus Interconnects all compute nodes (65,536) Virtual cut-through hardware routing 1.4Gb/s on all 12 node links (2.1 GB/s per node) Communications backbone for computations 0.7/1.4 TB/s bisection bandwidth, 67TB/s total bandwidth Global Collective Network One-to-all broadcast functionality Reduction operations functionality 2.8 Gb/s of bandwidth per link; Latency of tree traversal 2.5 µs ~23TB/s total binary tree bandwidth (64k machine) Interconnects all compute and I/O nodes (1024) Low Latency Global Barrier and Interrupt Round trip latency 1.3 µs Control Network Boot, monitoring and diagnostics Ethernet Incorporated into every node ASIC Active in the I/O nodes (1:64) All external comm. (file I/O, control, user interaction, etc.)

Distributed Memory Implementation in CPV Distribute PWs and real space mesh following such that: Each processor has the same number of plane waves A processor hosts full planes of real-space grid points All plane waves with common y and z components are on the same processor The number of different (y, z) pairs of plane-wave components is the same on each processor The number of real-space planes is the same on each processor

Distributed Memory 3D FFT For each wave function: Distribute its coefficients over the G-vectors across the z-direction, thus forming pencils Example: 2 processors Processor 1 Processor 2 z y x

Distributed Memory Implementation in CPV 3D FFT: can be computed in 3 steps 1D FFT across Z 1D FFT across Y 1D FFT across X or 3D FFT in two steps 1D FFT across Z 2D FFT across X-Y planes 1D FFTs ACROSS Z ALLTOALL 2D FFTs ACROSS X-Y

Limited Scalability of Standard 3D FFT Each processor takes a number of whole planes z y Very good scheme for small medium sized computational platforms but Observe that scalability is limited by the number of planes across the Z-direction!... Which is in the order of a few hundreds O(100) Thus: not appropriate for the BG/L system x

3D FFT Using Task Groups Loop across the number of electrons. Each iteration requires 1 3D FFT. Hierarchical parallelism * : Assign to each Task Group a number of iterations. ALLTOALL EIG 1: PROCS 1-2 EIG 2: PROCS 1-2 * J. Hutter and A. Curioni, Parallel Computing (31) 1, 2005

3D FFT Using Task Groups The Task Groups of processors will work on different eigenstates concurrently Number of processors per group: Ideally the one that achieves the best scalability for the original parallel 3D FFT scheme EIG 1: ONLY PROC 1 EIG 2: ONLY PROC 2

Numerical Example 80 H 2 O molecules (A. Pasquarello) 96 Density TG Density 240 atoms 320 occupied states (i.e. loop length for density) FFT mesh z length: 128 Original parallel 3D FFT stops scaling at 128 nodes Total time (secs) 48 24 12 6 Forces TG Forces TG parallel 3D FFTs continues to scale: 256 nodes 2TGs 512 nodes 4TGs 3 32 64 128 256 512 Number of comp. nodes

Comparison of Kernels 80 H 2 O molecules (A. Pasquarello) 240 atoms 320 occupied states (i.e. loop length for density) CPV Code: Uses Soft Pseudopotentials for the representation of core electrons. Requires much smaller cutoff for PWs. FFT kernels are no longer dominant for large # CPUs, but Dense eigenproblems for orthogonalization stand out! percentage of total time 55 45 35 25 15 TG Density Ortho TG Forces 5 16 32 64 128 256 512 Number of comput. nodes

Comparison v.s. rival architectures 80 H 2 O molecules (A. Pasquarello) 240 atoms 320 occupied states (i.e. loop length for density) Comparison vs. systems hosted at EPFL and new Cray XT3 at CSCS. Employed TG strategy and DUAL core characteristics of the BG/L nodes: DGEMM * Double FPU directives / code per CPU code * J. Gunnels et al, Supercomputing, 2005 (G.B. prize)

Conclusions ab initio electronic structure calculations have much to offer in the coming years w.r.t. better understanding of physical processes at the nano scale Electronic structure codes are typical examples of large scale parallel computing applications The advent of the pioneering scalable BG/L system has triggered significant effort in scaling these codes to tens of thousands of processors. The largest TF count is achieved by such a code: CPMD FFT kernels are very important: Demonstrated scalability is here! Linear algebra kernels appear once more very important The synergy of new numerical algorithms as well as parallel computing practices will be crucial in crossing to the Peta Flop era