CRYSTAL in parallel: replicated and distributed (MPP) data. Why parallel?

Size: px
Start display at page:

Download "CRYSTAL in parallel: replicated and distributed (MPP) data. Why parallel?"

Transcription

1 CRYSTAL in parallel: replicated and distributed (MPP) data Roberto Orlando Dipartimento di Chimica Università di Torino Via Pietro Giuria 5, Torino (Italy) 1 Why parallel? Faster time to solution; More available memory; High Performance Computing (HPC) resources available and not many softwares can run efficiently on thousands of processors. The programmer s concerns: Load imbalance: the time taken will be that of the longest job; Handling communications: the processors will need to talk to each other and communication is slow; Handling Input/Output (I/O): in most cases I/O is slow and should be avoided. The User s concerns: Choose an appropriate number of processors to run a job depending on the problem size (mostly determined by the number of basis functions in the unit cell). 2 1

2 Amdahl s law S(n)=(S+P)/(S+P/n) S+P=1 n: number of processors S: percentage of serial instructions P: percentage of parallelized instructions This is a somewhat frightening equation! 3 Gustafson s law BUT the relative values of S and P are a function of system size. Parallelize the more expensive parts (first) These typically become rapidly more expensive as the system size is increased Parallelism is good for large systems! 4 2

3 Load Imbalance Say we have twenty totally independent tasks and twenty processors Easy to parallelize give each task to one of the processors, but what if the tasks don t take all the same time? The time taken will be that of the longest job Because of Load Imbalance our speed up is less than perfect we have too few tasks for too many processors Don t use too many processors for too small a job 5 Communications and I/O But what if the tasks are not independent? The processors will need to talk to each other This is known as communication Communication is SLOW But usually the computation requirement scales more rapidly than the communication Depending on how the machine is set up I/O on parallel machines can be VERY slow. So in general it is best to run direct This may not be true for medium sized jobs on machines where each processor has a fast local disk. 6 3

4 Towards large unit cell systems A CRYSTAL job can be run serially in parallel crystal Pcrystal MPPcrystal Message Passing Interface (MPI) for communications Pcrystal uses replicated data storage in memory MPPcrystal for large unit cell systems on high-performance computers use of Linear Algebra Library parallel routines (Scalapack) for diagonalization, matrix products, Choleski decomposition enhanced distribution of data in memory among processors 7 Running CRYSTAL in parallel Pcrystal full parallelism in the calculation of the interactions (one- and two-electron integrals) distribution of tasks in the reciprocal space: one k point per processor no call to external libraries few inter-process communications MPPcrystal full parallelism in the calculation of the interactions (oneand two-electron integrals) double-level distribution of tasks in the reciprocal space: one k point to a subset of processors use of Linear Algebra Library parallel routines (Scalapack) for diagonalization, matrix products, Choleski decomposition enhanced distribution of data in memory among processors many inter-process communications 8 4

5 Pcrystal and MPPcrystal in action Example of a CRYSTAL calculation: 10,000 basis functions 16 processors available 4 k points sampled in reciprocal space Processors Tasks in real space (integrals) Tasks in reciprocal space Active Idle k 4 k 3 k 2 k 1 Pcrystal Tasks in real space (integrals) Tasks in reciprocal space Active k 4 k 3 k 2 k 1 MPPcrystal 9 Pcrystal - Implementation Standard compliant: Fortran 90 MPI for message passing Replicated data: Each k point is independent: each processor performs the linear algebra (FC=EC) for a subset of the k points that the job requires; Very few communications (potentially good scaling), but potential load imbalance; Each processor has a complete copy of all the matrices used in the linear algebra; The limit on the size of job is given by the memory required to store the linear algebra matrices for one k point; Number of k points limits the number of processors that can be exploited: in general scales very well provided the number of processors number of k points. 10 5

6 MPPcrystal Implementation I Standard compliant: Fortran 90 MPI for message passing ScaLAPACK 1.7 (Dongarra et al.) for linear algebra on distributed matrices Distributed data: Each processor holds only a part of each of the matrices used in the linear algebra (FC=EC); Number of processors that can be exploited is NOT limited by the number of k points (great for large Γ point only calculations); Use ScaLAPACK for e.g. Choleski decomposition Matrix matrix multiplies Linear equation solves As distributed data communications are required to perform the linear algebra; However, N 3 operations but only N 2 data to communicate. F o r 11 MPPcrystal Implementation II Scaling: Scaling gets better for larger systems; Very rough rule of thumb: if N basis functions can exploit up to around N/20 processors (optimal ratio: N/50); One further method that MPPcrystal uses is multilevel parallelism: if have 4 real k points and 32 procs each diagonalization will be done by 8 processors, so each diagonalization has to scale to fewer processors Complicated by complex k points Very useful for medium-large sized systems (for a big enough problem can scale very well) Non implemented features in MPPcrystal: Will fail quickly and cleanly if requested feature not implemented, such as: symmetry adaption of the Crystalline Orbitals (for large high symmetry systems Pcrystal may be more effective) CPHF Raman Intensities 12 6

7 MCM-41 mesoporous material model P. Ugliengo, M. Sodupe, F. Musso, I. J. Bush, R. Orlando, R. Dovesi, Advanced Materials 20, (2008). B3LYP approximation Hexagonal lattice with P1 symmetry 580 atoms per cell (7800 basis functions) MTS/423 K IR spectrum recorded on a micelle-templated silica calcinated at 823 K, water outgassed at 423 K B3LYP Simulated powder spectrum: no relevant reflexions at higher 2 because of short-range disorder 13 MCM-41:increasing the unit cell R. Orlando, M. Delle Piane, I. J. Bush, P. Ugliengo, M. Ferrabone, R. Dovesi, J. Comput. Chem. 33, 2276 (2012). T SPEEDUP T NC NC Supercells of the MCM-41 have been grown along the c crystallographic axis: Xn (side along c is n times that in X1). X10 contains 77,560 AOs in the unit cell. Calculations run on IBM SP6 at Cineca: Power6 processors (4.7 GHz) with peak performance of 101 Tflops/s Infiniband X4 DDR internal network Speedup vs number of cores (NC) for SCF+total energy gradient calculations 14 7

8 MCM-41:scaling of the main steps in MPPcrystal two-electron integrals X4 + total energy gradient one-electron integrals Fock matrix diagonalization exchange-correlation functional integration X preliminary steps Percentage data measure parallelization efficiency. Data in parenthesis: the amount of time for that task. 15 Running MCM-41 on different HPC architectures X1 IBM Blue Gene P at Cineca (Bologna) Cray XE6 - HECToR (Edimburgh) IBM Sp6 at CINECA (Bologna) 16 8

9 Memory storage optimization TOO MANY K POINTS IN THE PACK-MONKHORST NET: INCREASE LIM001 Most of the static allocations have been made dynamic: array size now fits the exact memory requirement; no need to recompile the code for large calculations; a few remaining fixed limits can be extended from input: CLUSTSIZE (maximum number of a atoms in a generated cluster; default setting: number of atoms i the unit cell) LATVEC (maximum number of lattice vectors to be classified; default value: 3500). n 2 atom-size arrays are distributed among the cores. Data are removed from memory as soon as they are not in use. 17 LOWMEM option The LOWMEM keyword avoids allocation of large arrays generally with a slight increase in the CPU time (by default in MPPcrystal): atomic orbital pair elements in matrices are located in real time without storing a large table into memory Fock and Density matrices are only stored in their irreducible forms; symmetry related elements are computed in real time Expansion of AO pair electron density for the bipolar approximation of 2-electron integrals into multipole moments is performed in real time instead of storing large buffers to memory Information about the grid of points used in DFT exchange-correlation functional integration (point cartesian coordinates, multiplicity, Becke s weights) is distributed among processors Dynamically allocated memory monitoring by means of: MEMOPRT, MEMOPRT2 18 9

10 T [sec] Speeding up two-electron integrals 2 g 4 h+l g 12 P l 34 0 g h h l 0 h g h l l0 h0 2 F h electron 1 electron 2 Integrals are screened on the basis of the overlap between atomic orbitals. In large unit cells a lot of (3, 4) pairs do not overlap to (1, 2 g ). The following integrals are equivalent by atomic orbital permutation: 0 g h hl 0 g hl h l h gh 0 l gh h Implemented for P1 symmetry Xn Linearization Permutation symmetry 19 Improved memory storage in Pcrystal F g F k V k F k V k Transformation of the Fock and the Density matrix into the basis set of the Symmetry- Adapted Crystalline Orbitals (SACO) is operated from the irreducible F g to each block of V k F k V k (irreducible representation) straightforwardly, without forming the full blocks of F k : the maximum size of matrices to be diagonalized is that of the largest block parallelization goes from k points down to the irreducible representations (many more than the number of k-points in highly symmetric cases) 20 10

11 Memory storage for fullerenes of increasing size (n,n)-fullerenes n N AO S irr S red n = 7 N AO : number of basis functions S irr : size of the irreducible part of the overlap matrix represented in real space (number of matrix elements) S red : size of the full overlap matrix represented in real space (number of matrix elements) 21 Fullerenes: matrix block size in the SACO basis (n,n) A g A u F 1g F 1u F 2g F 2u G g G u H g H u N AO t SCF (1,1) (2,2) (3,3) (4,4) (5,5) (6,6) (7,7) (8,8) (9,9) (10,10) t SCF : wallclock time (in seconds) for running 20 SCF cycles on a single core 22 11

12 Conclusions CRYSTAL can be run in parallel on a large number of processors efficiently, with very good scalability; is portable to different HPC platforms; allowed the calculation of the total energy and wavefunction of MCM41-X14, containing more than 100,000 basis functions (8000 atoms), on 2048 processors; has been improved as concerning data storage to memory; has been made more efficient as for the calculation of the Coulomb and exchange series; Memory storage for highly symmetric cases has been drastically reduced by extending the use of SACOs to all steps in reciprocal space; Task farming in Pcrystal will soon be moved from the k-point level to that of the irreducible representations

CRYSTAL in parallel: replicated and distributed (MPP) data

CRYSTAL in parallel: replicated and distributed (MPP) data CRYSTAL in parallel: replicated and distributed (MPP) data Lorenzo Maschio Dipar0mento di Chimica, Università di Torino lorenzo.maschio@unito.it Several slides courtesy of Roberto Orlando lorenzo.maschio@unito.it

More information

Symmetric Pivoting in ScaLAPACK Craig Lucas University of Manchester Cray User Group 8 May 2006, Lugano

Symmetric Pivoting in ScaLAPACK Craig Lucas University of Manchester Cray User Group 8 May 2006, Lugano Symmetric Pivoting in ScaLAPACK Craig Lucas University of Manchester Cray User Group 8 May 2006, Lugano Introduction Introduction We wanted to parallelize a serial algorithm for the pivoted Cholesky factorization

More information

Electronic Supplementary Information

Electronic Supplementary Information Electronic Supplementary Material (ESI) for CrystEngComm. This journal is The Royal Society of Chemistry 2014 Electronic Supplementary Information Configurational and energetical study of the (100) and

More information

Improvements for Implicit Linear Equation Solvers

Improvements for Implicit Linear Equation Solvers Improvements for Implicit Linear Equation Solvers Roger Grimes, Bob Lucas, Clement Weisbecker Livermore Software Technology Corporation Abstract Solving large sparse linear systems of equations is often

More information

TIME DEPENDENCE OF SHELL MODEL CALCULATIONS 1. INTRODUCTION

TIME DEPENDENCE OF SHELL MODEL CALCULATIONS 1. INTRODUCTION Mathematical and Computational Applications, Vol. 11, No. 1, pp. 41-49, 2006. Association for Scientific Research TIME DEPENDENCE OF SHELL MODEL CALCULATIONS Süleyman Demirel University, Isparta, Turkey,

More information

Parallel programming using MPI. Analysis and optimization. Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco

Parallel programming using MPI. Analysis and optimization. Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco Parallel programming using MPI Analysis and optimization Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco Outline l Parallel programming: Basic definitions l Choosing right algorithms: Optimal serial and

More information

Translation Symmetry, Space Groups, Bloch functions, Fermi energy

Translation Symmetry, Space Groups, Bloch functions, Fermi energy Translation Symmetry, Space Groups, Bloch functions, Fermi energy Roberto Orlando and Silvia Casassa Università degli Studi di Torino July 20, 2015 School Ab initio Modelling of Solids (UniTo) Symmetry

More information

Parallel Eigensolver Performance on High Performance Computers

Parallel Eigensolver Performance on High Performance Computers Parallel Eigensolver Performance on High Performance Computers Andrew Sunderland Advanced Research Computing Group STFC Daresbury Laboratory CUG 2008 Helsinki 1 Summary (Briefly) Introduce parallel diagonalization

More information

HECToR CSE technical meeting, Oxford Parallel Algorithms for the Materials Modelling code CRYSTAL

HECToR CSE technical meeting, Oxford Parallel Algorithms for the Materials Modelling code CRYSTAL HECToR CSE technical meeting, Oxford 2009 Parallel Algorithms for the Materials Modelling code CRYSTAL Dr Stanko Tomi Computational Science & Engineering Department, STFC Daresbury Laboratory, UK Acknowledgements

More information

ELSI: A Unified Software Interface for Kohn-Sham Electronic Structure Solvers

ELSI: A Unified Software Interface for Kohn-Sham Electronic Structure Solvers ELSI: A Unified Software Interface for Kohn-Sham Electronic Structure Solvers Victor Yu and the ELSI team Department of Mechanical Engineering & Materials Science Duke University Kohn-Sham Density-Functional

More information

A Computation- and Communication-Optimal Parallel Direct 3-body Algorithm

A Computation- and Communication-Optimal Parallel Direct 3-body Algorithm A Computation- and Communication-Optimal Parallel Direct 3-body Algorithm Penporn Koanantakool and Katherine Yelick {penpornk, yelick}@cs.berkeley.edu Computer Science Division, University of California,

More information

Parallelization of the Molecular Orbital Program MOS-F

Parallelization of the Molecular Orbital Program MOS-F Parallelization of the Molecular Orbital Program MOS-F Akira Asato, Satoshi Onodera, Yoshie Inada, Elena Akhmatskaya, Ross Nobes, Azuma Matsuura, Atsuya Takahashi November 2003 Fujitsu Laboratories of

More information

Wavefunction and electronic struture in solids: Bloch functions, Fermi level and other concepts.

Wavefunction and electronic struture in solids: Bloch functions, Fermi level and other concepts. Wavefunction and electronic struture in solids: Bloch functions, Fermi level and other concepts. Silvia Casassa Università degli Studi di Torino July 12, 2017 Minnesota Workshop on ab initio MSC Symmetry

More information

Making electronic structure methods scale: Large systems and (massively) parallel computing

Making electronic structure methods scale: Large systems and (massively) parallel computing AB Making electronic structure methods scale: Large systems and (massively) parallel computing Ville Havu Department of Applied Physics Helsinki University of Technology - TKK Ville.Havu@tkk.fi 1 Outline

More information

Vibrational frequencies in solids: tools and tricks

Vibrational frequencies in solids: tools and tricks Vibrational frequencies in solids: tools and tricks Roberto Dovesi Gruppo di Chimica Teorica Università di Torino Torino, 4-9 September 2016 This morning 3 lectures: R. Dovesi Generalities on vibrations

More information

Porting a sphere optimization program from LAPACK to ScaLAPACK

Porting a sphere optimization program from LAPACK to ScaLAPACK Porting a sphere optimization program from LAPACK to ScaLAPACK Mathematical Sciences Institute, Australian National University. For presentation at Computational Techniques and Applications Conference

More information

Efficient algorithms for symmetric tensor contractions

Efficient algorithms for symmetric tensor contractions Efficient algorithms for symmetric tensor contractions Edgar Solomonik 1 Department of EECS, UC Berkeley Oct 22, 2013 1 / 42 Edgar Solomonik Symmetric tensor contractions 1/ 42 Motivation The goal is to

More information

CP2K. New Frontiers. ab initio Molecular Dynamics

CP2K. New Frontiers. ab initio Molecular Dynamics CP2K New Frontiers in ab initio Molecular Dynamics Jürg Hutter, Joost VandeVondele, Valery Weber Physical-Chemistry Institute, University of Zurich Ab Initio Molecular Dynamics Molecular Dynamics Sampling

More information

Efficient implementation of the overlap operator on multi-gpus

Efficient implementation of the overlap operator on multi-gpus Efficient implementation of the overlap operator on multi-gpus Andrei Alexandru Mike Lujan, Craig Pelissier, Ben Gamari, Frank Lee SAAHPC 2011 - University of Tennessee Outline Motivation Overlap operator

More information

2.5D algorithms for distributed-memory computing

2.5D algorithms for distributed-memory computing ntroduction for distributed-memory computing C Berkeley July, 2012 1/ 62 ntroduction Outline ntroduction Strong scaling 2.5D factorization 2/ 62 ntroduction Strong scaling Solving science problems faster

More information

A Numerical QCD Hello World

A Numerical QCD Hello World A Numerical QCD Hello World Bálint Thomas Jefferson National Accelerator Facility Newport News, VA, USA INT Summer School On Lattice QCD, 2007 What is involved in a Lattice Calculation What is a lattice

More information

Matrix Computations: Direct Methods II. May 5, 2014 Lecture 11

Matrix Computations: Direct Methods II. May 5, 2014 Lecture 11 Matrix Computations: Direct Methods II May 5, 2014 ecture Summary You have seen an example of how a typical matrix operation (an important one) can be reduced to using lower level BS routines that would

More information

Weather Research and Forecasting (WRF) Performance Benchmark and Profiling. July 2012

Weather Research and Forecasting (WRF) Performance Benchmark and Profiling. July 2012 Weather Research and Forecasting (WRF) Performance Benchmark and Profiling July 2012 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Intel, Dell,

More information

The Design, Implementation, and Evaluation of a Symmetric Banded Linear Solver for Distributed-Memory Parallel Computers

The Design, Implementation, and Evaluation of a Symmetric Banded Linear Solver for Distributed-Memory Parallel Computers The Design, Implementation, and Evaluation of a Symmetric Banded Linear Solver for Distributed-Memory Parallel Computers ANSHUL GUPTA and FRED G. GUSTAVSON IBM T. J. Watson Research Center MAHESH JOSHI

More information

Acceleration of WRF on the GPU

Acceleration of WRF on the GPU Acceleration of WRF on the GPU Daniel Abdi, Sam Elliott, Iman Gohari Don Berchoff, Gene Pache, John Manobianco TempoQuest 1434 Spruce Street Boulder, CO 80302 720 726 9032 TempoQuest.com THE WORLD S FASTEST

More information

Parallelization of the QC-lib Quantum Computer Simulator Library

Parallelization of the QC-lib Quantum Computer Simulator Library Parallelization of the QC-lib Quantum Computer Simulator Library Ian Glendinning and Bernhard Ömer VCPC European Centre for Parallel Computing at Vienna Liechtensteinstraße 22, A-19 Vienna, Austria http://www.vcpc.univie.ac.at/qc/

More information

Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors

Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1 1 Deparment of Computer

More information

Parallel Eigensolver Performance on High Performance Computers 1

Parallel Eigensolver Performance on High Performance Computers 1 Parallel Eigensolver Performance on High Performance Computers 1 Andrew Sunderland STFC Daresbury Laboratory, Warrington, UK Abstract Eigenvalue and eigenvector computations arise in a wide range of scientific

More information

ab initio Electronic Structure Calculations

ab initio Electronic Structure Calculations ab initio Electronic Structure Calculations New scalability frontiers using the BG/L Supercomputer C. Bekas, A. Curioni and W. Andreoni IBM, Zurich Research Laboratory Rueschlikon 8803, Switzerland ab

More information

Model Order Reduction via Matlab Parallel Computing Toolbox. Istanbul Technical University

Model Order Reduction via Matlab Parallel Computing Toolbox. Istanbul Technical University Model Order Reduction via Matlab Parallel Computing Toolbox E. Fatih Yetkin & Hasan Dağ Istanbul Technical University Computational Science & Engineering Department September 21, 2009 E. Fatih Yetkin (Istanbul

More information

Parallelization of Molecular Dynamics (with focus on Gromacs) SeSE 2014 p.1/29

Parallelization of Molecular Dynamics (with focus on Gromacs) SeSE 2014 p.1/29 Parallelization of Molecular Dynamics (with focus on Gromacs) SeSE 2014 p.1/29 Outline A few words on MD applications and the GROMACS package The main work in an MD simulation Parallelization Stream computing

More information

Solving the Inverse Toeplitz Eigenproblem Using ScaLAPACK and MPI *

Solving the Inverse Toeplitz Eigenproblem Using ScaLAPACK and MPI * Solving the Inverse Toeplitz Eigenproblem Using ScaLAPACK and MPI * J.M. Badía and A.M. Vidal Dpto. Informática., Univ Jaume I. 07, Castellón, Spain. badia@inf.uji.es Dpto. Sistemas Informáticos y Computación.

More information

J S Parker (QUB), Martin Plummer (STFC), H W van der Hart (QUB) Version 1.0, September 29, 2015

J S Parker (QUB), Martin Plummer (STFC), H W van der Hart (QUB) Version 1.0, September 29, 2015 Report on ecse project Performance enhancement in R-matrix with time-dependence (RMT) codes in preparation for application to circular polarised light fields J S Parker (QUB), Martin Plummer (STFC), H

More information

Matrix Assembly in FEA

Matrix Assembly in FEA Matrix Assembly in FEA 1 In Chapter 2, we spoke about how the global matrix equations are assembled in the finite element method. We now want to revisit that discussion and add some details. For example,

More information

Parallel Numerics. Scope: Revise standard numerical methods considering parallel computations!

Parallel Numerics. Scope: Revise standard numerical methods considering parallel computations! Parallel Numerics Scope: Revise standard numerical methods considering parallel computations! Required knowledge: Numerics Parallel Programming Graphs Literature: Dongarra, Du, Sorensen, van der Vorst:

More information

MPI at MPI. Jens Saak. Max Planck Institute for Dynamics of Complex Technical Systems Computational Methods in Systems and Control Theory

MPI at MPI. Jens Saak. Max Planck Institute for Dynamics of Complex Technical Systems Computational Methods in Systems and Control Theory MAX PLANCK INSTITUTE November 5, 2010 MPI at MPI Jens Saak Max Planck Institute for Dynamics of Complex Technical Systems Computational Methods in Systems and Control Theory FOR DYNAMICS OF COMPLEX TECHNICAL

More information

Sparse BLAS-3 Reduction

Sparse BLAS-3 Reduction Sparse BLAS-3 Reduction to Banded Upper Triangular (Spar3Bnd) Gary Howell, HPC/OIT NC State University gary howell@ncsu.edu Sparse BLAS-3 Reduction p.1/27 Acknowledgements James Demmel, Gene Golub, Franc

More information

WRF performance tuning for the Intel Woodcrest Processor

WRF performance tuning for the Intel Woodcrest Processor WRF performance tuning for the Intel Woodcrest Processor A. Semenov, T. Kashevarova, P. Mankevich, D. Shkurko, K. Arturov, N. Panov Intel Corp., pr. ak. Lavrentieva 6/1, Novosibirsk, Russia, 630090 {alexander.l.semenov,tamara.p.kashevarova,pavel.v.mankevich,

More information

Welcome to MCS 572. content and organization expectations of the course. definition and classification

Welcome to MCS 572. content and organization expectations of the course. definition and classification Welcome to MCS 572 1 About the Course content and organization expectations of the course 2 Supercomputing definition and classification 3 Measuring Performance speedup and efficiency Amdahl s Law Gustafson

More information

Parallelization of the QC-lib Quantum Computer Simulator Library

Parallelization of the QC-lib Quantum Computer Simulator Library Parallelization of the QC-lib Quantum Computer Simulator Library Ian Glendinning and Bernhard Ömer September 9, 23 PPAM 23 1 Ian Glendinning / September 9, 23 Outline Introduction Quantum Bits, Registers

More information

OPENATOM for GW calculations

OPENATOM for GW calculations OPENATOM for GW calculations by OPENATOM developers 1 Introduction The GW method is one of the most accurate ab initio methods for the prediction of electronic band structures. Despite its power, the GW

More information

Performance of WRF using UPC

Performance of WRF using UPC Performance of WRF using UPC Hee-Sik Kim and Jong-Gwan Do * Cray Korea ABSTRACT: The Weather Research and Forecasting (WRF) model is a next-generation mesoscale numerical weather prediction system. We

More information

Dark Energy and Massive Neutrino Universe Covariances

Dark Energy and Massive Neutrino Universe Covariances Dark Energy and Massive Neutrino Universe Covariances (DEMNUniCov) Carmelita Carbone Physics Dept, Milan University & INAF-Brera Collaborators: M. Calabrese, M. Zennaro, G. Fabbian, J. Bel November 30

More information

Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters

Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters H. Köstler 2nd International Symposium Computer Simulations on GPU Freudenstadt, 29.05.2013 1 Contents Motivation walberla software concepts

More information

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Accelerating linear algebra computations with hybrid GPU-multicore systems. Accelerating linear algebra computations with hybrid GPU-multicore systems. Marc Baboulin INRIA/Université Paris-Sud joint work with Jack Dongarra (University of Tennessee and Oak Ridge National Laboratory)

More information

Quantum Chemical Calculations by Parallel Computer from Commodity PC Components

Quantum Chemical Calculations by Parallel Computer from Commodity PC Components Nonlinear Analysis: Modelling and Control, 2007, Vol. 12, No. 4, 461 468 Quantum Chemical Calculations by Parallel Computer from Commodity PC Components S. Bekešienė 1, S. Sėrikovienė 2 1 Institute of

More information

Building a Multi-FPGA Virtualized Restricted Boltzmann Machine Architecture Using Embedded MPI

Building a Multi-FPGA Virtualized Restricted Boltzmann Machine Architecture Using Embedded MPI Building a Multi-FPGA Virtualized Restricted Boltzmann Machine Architecture Using Embedded MPI Charles Lo and Paul Chow {locharl1, pc}@eecg.toronto.edu Department of Electrical and Computer Engineering

More information

Basic introduction of NWChem software

Basic introduction of NWChem software Basic introduction of NWChem software Background! NWChem is part of the Molecular Science Software Suite! Designed and developed to be a highly efficient and portable Massively Parallel computational chemistry

More information

Chapter 3. The (L)APW+lo Method. 3.1 Choosing A Basis Set

Chapter 3. The (L)APW+lo Method. 3.1 Choosing A Basis Set Chapter 3 The (L)APW+lo Method 3.1 Choosing A Basis Set The Kohn-Sham equations (Eq. (2.17)) provide a formulation of how to practically find a solution to the Hohenberg-Kohn functional (Eq. (2.15)). Nevertheless

More information

PRECONDITIONING IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM

PRECONDITIONING IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM Proceedings of ALGORITMY 25 pp. 22 211 PRECONDITIONING IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM GABRIEL OKŠA AND MARIÁN VAJTERŠIC Abstract. One way, how to speed up the computation of the singular value

More information

Cyclops Tensor Framework

Cyclops Tensor Framework Cyclops Tensor Framework Edgar Solomonik Department of EECS, Computer Science Division, UC Berkeley March 17, 2014 1 / 29 Edgar Solomonik Cyclops Tensor Framework 1/ 29 Definition of a tensor A rank r

More information

Preconditioned Parallel Block Jacobi SVD Algorithm

Preconditioned Parallel Block Jacobi SVD Algorithm Parallel Numerics 5, 15-24 M. Vajteršic, R. Trobec, P. Zinterhof, A. Uhl (Eds.) Chapter 2: Matrix Algebra ISBN 961-633-67-8 Preconditioned Parallel Block Jacobi SVD Algorithm Gabriel Okša 1, Marián Vajteršic

More information

Parallel Performance Theory - 1

Parallel Performance Theory - 1 Parallel Performance Theory - 1 Parallel Computing CIS 410/510 Department of Computer and Information Science Outline q Performance scalability q Analytical performance measures q Amdahl s law and Gustafson-Barsis

More information

Scalable and Power-Efficient Data Mining Kernels

Scalable and Power-Efficient Data Mining Kernels Scalable and Power-Efficient Data Mining Kernels Alok Choudhary, John G. Searle Professor Dept. of Electrical Engineering and Computer Science and Professor, Kellogg School of Management Director of the

More information

Matrix Eigensystem Tutorial For Parallel Computation

Matrix Eigensystem Tutorial For Parallel Computation Matrix Eigensystem Tutorial For Parallel Computation High Performance Computing Center (HPC) http://www.hpc.unm.edu 5/21/2003 1 Topic Outline Slide Main purpose of this tutorial 5 The assumptions made

More information

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University Che-Wei Chang chewei@mail.cgu.edu.tw Department of Computer Science and Information Engineering, Chang Gung University } 2017/11/15 Midterm } 2017/11/22 Final Project Announcement 2 1. Introduction 2.

More information

Performance optimization of WEST and Qbox on Intel Knights Landing

Performance optimization of WEST and Qbox on Intel Knights Landing Performance optimization of WEST and Qbox on Intel Knights Landing Huihuo Zheng 1, Christopher Knight 1, Giulia Galli 1,2, Marco Govoni 1,2, and Francois Gygi 3 1 Argonne National Laboratory 2 University

More information

A robust multilevel approximate inverse preconditioner for symmetric positive definite matrices

A robust multilevel approximate inverse preconditioner for symmetric positive definite matrices DICEA DEPARTMENT OF CIVIL, ENVIRONMENTAL AND ARCHITECTURAL ENGINEERING PhD SCHOOL CIVIL AND ENVIRONMENTAL ENGINEERING SCIENCES XXX CYCLE A robust multilevel approximate inverse preconditioner for symmetric

More information

Massively parallel semi-lagrangian solution of the 6d Vlasov-Poisson problem

Massively parallel semi-lagrangian solution of the 6d Vlasov-Poisson problem Massively parallel semi-lagrangian solution of the 6d Vlasov-Poisson problem Katharina Kormann 1 Klaus Reuter 2 Markus Rampp 2 Eric Sonnendrücker 1 1 Max Planck Institut für Plasmaphysik 2 Max Planck Computing

More information

Analytical Modeling of Parallel Systems

Analytical Modeling of Parallel Systems Analytical Modeling of Parallel Systems Chieh-Sen (Jason) Huang Department of Applied Mathematics National Sun Yat-sen University Thank Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar for providing

More information

Molecular Science Modelling

Molecular Science Modelling Molecular Science Modelling Lorna Smith Edinburgh Parallel Computing Centre The University of Edinburgh Version 1.0 Available from: http://www.epcc.ed.ac.uk/epcc-tec/documents/ Table of Contents 1 Introduction.....................................

More information

The Blue Gene/P at Jülich Case Study & Optimization. W.Frings, Forschungszentrum Jülich,

The Blue Gene/P at Jülich Case Study & Optimization. W.Frings, Forschungszentrum Jülich, The Blue Gene/P at Jülich Case Study & Optimization W.Frings, Forschungszentrum Jülich, 26.08.2008 Jugene Case-Studies: Overview Case Study: PEPC Case Study: racoon Case Study: QCD CPU0CPU3 CPU1CPU2 2

More information

Direct Self-Consistent Field Computations on GPU Clusters

Direct Self-Consistent Field Computations on GPU Clusters Direct Self-Consistent Field Computations on GPU Clusters Guochun Shi, Volodymyr Kindratenko National Center for Supercomputing Applications University of Illinois at UrbanaChampaign Ivan Ufimtsev, Todd

More information

Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems

Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems Ichitaro Yamazaki University of Tennessee, Knoxville Xiaoye Sherry Li Lawrence Berkeley National Laboratory MS49: Sparse

More information

INITIAL INTEGRATION AND EVALUATION

INITIAL INTEGRATION AND EVALUATION INITIAL INTEGRATION AND EVALUATION OF SLATE PARALLEL BLAS IN LATTE Marc Cawkwell, Danny Perez, Arthur Voter Asim YarKhan, Gerald Ragghianti, Jack Dongarra, Introduction The aim of the joint milestone STMS10-52

More information

Performance Analysis of Lattice QCD Application with APGAS Programming Model

Performance Analysis of Lattice QCD Application with APGAS Programming Model Performance Analysis of Lattice QCD Application with APGAS Programming Model Koichi Shirahata 1, Jun Doi 2, Mikio Takeuchi 2 1: Tokyo Institute of Technology 2: IBM Research - Tokyo Programming Models

More information

Theoretical Computer Science

Theoretical Computer Science Theoretical Computer Science 412 (2011) 1484 1491 Contents lists available at ScienceDirect Theoretical Computer Science journal homepage: wwwelseviercom/locate/tcs Parallel QR processing of Generalized

More information

A Parallel Implementation of the. Yuan-Jye Jason Wu y. September 2, Abstract. The GTH algorithm is a very accurate direct method for nding

A Parallel Implementation of the. Yuan-Jye Jason Wu y. September 2, Abstract. The GTH algorithm is a very accurate direct method for nding A Parallel Implementation of the Block-GTH algorithm Yuan-Jye Jason Wu y September 2, 1994 Abstract The GTH algorithm is a very accurate direct method for nding the stationary distribution of a nite-state,

More information

Ab-initio modelling of complex systems with MPPcrystal

Ab-initio modelling of complex systems with MPPcrystal Ab-initio modelling of complex systems with MPPcrystal Piero Ugliengo University of Torino Dip. Chimica IFM NIS Centre of Excellence Via P. Giuria, 7 10125 Torino 1 Acknowledgements M. Corno A. Rimola

More information

ASSESSMENT OF DFT METHODS FOR SOLIDS

ASSESSMENT OF DFT METHODS FOR SOLIDS MSSC2009 - Ab Initio Modeling in Solid State Chemistry ASSESSMENT OF DFT METHODS FOR SOLIDS Raffaella Demichelis Università di Torino Dipartimento di Chimica IFM 1 MSSC2009 - September, 10 th 2009 Table

More information

The Interpretation of the Short Range Disorder in the Fluorene- TCNE Crystal Structure

The Interpretation of the Short Range Disorder in the Fluorene- TCNE Crystal Structure Int. J. Mol. Sci. 2004, 5, 93-100 International Journal of Molecular Sciences ISSN 1422-0067 2004 by MDPI www.mdpi.net/ijms/ The Interpretation of the Short Range Disorder in the Fluorene- TCNE Crystal

More information

HYCOM and Navy ESPC Future High Performance Computing Needs. Alan J. Wallcraft. COAPS Short Seminar November 6, 2017

HYCOM and Navy ESPC Future High Performance Computing Needs. Alan J. Wallcraft. COAPS Short Seminar November 6, 2017 HYCOM and Navy ESPC Future High Performance Computing Needs Alan J. Wallcraft COAPS Short Seminar November 6, 2017 Forecasting Architectural Trends 3 NAVY OPERATIONAL GLOBAL OCEAN PREDICTION Trend is higher

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms Chapter 13 Prof. Michael T. Heath Department of Computer Science University of Illinois at Urbana-Champaign CS 554 / CSE 512 Michael T. Heath Parallel Numerical Algorithms

More information

COMPUTATIONAL TOOL. Fig. 4.1 Opening screen of w2web

COMPUTATIONAL TOOL. Fig. 4.1 Opening screen of w2web CHAPTER -4 COMPUTATIONAL TOOL Ph.D. Thesis: J. Maibam CHAPTER: 4 4.1 The WIEN2k code In this work, all the calculations presented are performed using the WIEN2k software package (Blaha et al., 2001). The

More information

1 Overview. 2 Adapting to computing system evolution. 11 th European LS-DYNA Conference 2017, Salzburg, Austria

1 Overview. 2 Adapting to computing system evolution. 11 th European LS-DYNA Conference 2017, Salzburg, Austria 1 Overview Improving LSTC s Multifrontal Linear Solver Roger Grimes 3, Robert Lucas 3, Nick Meng 2, Francois-Henry Rouet 3, Clement Weisbecker 3, and Ting-Ting Zhu 1 1 Cray Incorporated 2 Intel Corporation

More information

One-Electron Properties of Solids

One-Electron Properties of Solids One-Electron Properties of Solids Alessandro Erba Università di Torino alessandro.erba@unito.it most slides are courtesy of R. Orlando and B. Civalleri Energy vs Wave-function Energy vs Wave-function Density

More information

Scalable numerical algorithms for electronic structure calculations

Scalable numerical algorithms for electronic structure calculations Scalable numerical algorithms for electronic structure calculations Edgar Solomonik C Berkeley July, 2012 Edgar Solomonik Cyclops Tensor Framework 1/ 73 Outline Introduction Motivation: Coupled Cluster

More information

Timing Results of a Parallel FFTsynth

Timing Results of a Parallel FFTsynth Purdue University Purdue e-pubs Department of Computer Science Technical Reports Department of Computer Science 1994 Timing Results of a Parallel FFTsynth Robert E. Lynch Purdue University, rel@cs.purdue.edu

More information

CS425: Algorithms for Web Scale Data

CS425: Algorithms for Web Scale Data CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original slides can be accessed at: www.mmds.org Challenges

More information

Comparing the Efficiency of Iterative Eigenvalue Solvers: the Quantum ESPRESSO experience

Comparing the Efficiency of Iterative Eigenvalue Solvers: the Quantum ESPRESSO experience Comparing the Efficiency of Iterative Eigenvalue Solvers: the Quantum ESPRESSO experience Stefano de Gironcoli Scuola Internazionale Superiore di Studi Avanzati Trieste-Italy 0 Diagonalization of the Kohn-Sham

More information

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics SPARSE SOLVERS FOR THE POISSON EQUATION Margreet Nool CWI, Multiscale Dynamics November 9, 2015 OUTLINE OF THIS TALK 1 FISHPACK, LAPACK, PARDISO 2 SYSTEM OVERVIEW OF CARTESIUS 3 POISSON EQUATION 4 SOLVERS

More information

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC Hybrid static/dynamic scheduling for already optimized dense matrix factorization Simplice Donfack, Laura Grigori, INRIA, France Bill Gropp, Vivek Kale UIUC, USA Joint Laboratory for Petascale Computing,

More information

Vector Lane Threading

Vector Lane Threading Vector Lane Threading S. Rivoire, R. Schultz, T. Okuda, C. Kozyrakis Computer Systems Laboratory Stanford University Motivation Vector processors excel at data-level parallelism (DLP) What happens to program

More information

VASP: running on HPC resources. University of Vienna, Faculty of Physics and Center for Computational Materials Science, Vienna, Austria

VASP: running on HPC resources. University of Vienna, Faculty of Physics and Center for Computational Materials Science, Vienna, Austria VASP: running on HPC resources University of Vienna, Faculty of Physics and Center for Computational Materials Science, Vienna, Austria The Many-Body Schrödinger equation 0 @ 1 2 X i i + X i Ĥ (r 1,...,r

More information

Using Web-Based Computations in Organic Chemistry

Using Web-Based Computations in Organic Chemistry 10/30/2017 1 Using Web-Based Computations in Organic Chemistry John Keller UAF Department of Chemistry & Biochemistry The UAF WebMO site Practical aspects of computational chemistry theory and nomenclature

More information

CMP 338: Third Class

CMP 338: Third Class CMP 338: Third Class HW 2 solution Conversion between bases The TINY processor Abstraction and separation of concerns Circuit design big picture Moore s law and chip fabrication cost Performance What does

More information

Lightweight Superscalar Task Execution in Distributed Memory

Lightweight Superscalar Task Execution in Distributed Memory Lightweight Superscalar Task Execution in Distributed Memory Asim YarKhan 1 and Jack Dongarra 1,2,3 1 Innovative Computing Lab, University of Tennessee, Knoxville, TN 2 Oak Ridge National Lab, Oak Ridge,

More information

Review: From problem to parallel algorithm

Review: From problem to parallel algorithm Review: From problem to parallel algorithm Mathematical formulations of interesting problems abound Poisson s equation Sources: Electrostatics, gravity, fluid flow, image processing (!) Numerical solution:

More information

Fast and accurate Coulomb calculation with Gaussian functions

Fast and accurate Coulomb calculation with Gaussian functions Fast and accurate Coulomb calculation with Gaussian functions László Füsti-Molnár and Jing Kong Q-CHEM Inc., Pittsburgh, Pennysylvania 15213 THE JOURNAL OF CHEMICAL PHYSICS 122, 074108 2005 Received 8

More information

MODULE 2: QUANTUM MECHANICS. Practice: Quantum ESPRESSO

MODULE 2: QUANTUM MECHANICS. Practice: Quantum ESPRESSO MODULE 2: QUANTUM MECHANICS Practice: Quantum ESPRESSO I. What is Quantum ESPRESSO? 2 DFT software PW-DFT, PP, US-PP, PAW http://www.quantum-espresso.org FREE PW-DFT, PP, PAW http://www.abinit.org FREE

More information

Performance Evaluation of Scientific Applications on POWER8

Performance Evaluation of Scientific Applications on POWER8 Performance Evaluation of Scientific Applications on POWER8 2014 Nov 16 Andrew V. Adinetz 1, Paul F. Baumeister 1, Hans Böttiger 3, Thorsten Hater 1, Thilo Maurer 3, Dirk Pleiter 1, Wolfram Schenck 4,

More information

Roundoff Error. Monday, August 29, 11

Roundoff Error. Monday, August 29, 11 Roundoff Error A round-off error (rounding error), is the difference between the calculated approximation of a number and its exact mathematical value. Numerical analysis specifically tries to estimate

More information

Leigh Orf 1 Robert Wilhelmson 2,3 Roberto Sisneros 3 Brian Jewett 2 George Bryan 4 Mark Straka 3 Paul Woodward 5

Leigh Orf 1 Robert Wilhelmson 2,3 Roberto Sisneros 3 Brian Jewett 2 George Bryan 4 Mark Straka 3 Paul Woodward 5 Simulation and Visualization of Tornadic Supercells on Blue Waters PRAC: Understanding Tornadoes and Their Parent Supercells Through Ultra-High Resolution Simulation/Analysis Leigh Orf 1 Robert Wilhelmson

More information

Basic introduction of NWChem software

Basic introduction of NWChem software Basic introduction of NWChem software Background NWChem is part of the Molecular Science Software Suite Designed and developed to be a highly efficient and portable Massively Parallel computational chemistry

More information

Parallelization of the Dirac operator. Pushan Majumdar. Indian Association for the Cultivation of Sciences, Jadavpur, Kolkata

Parallelization of the Dirac operator. Pushan Majumdar. Indian Association for the Cultivation of Sciences, Jadavpur, Kolkata Parallelization of the Dirac operator Pushan Majumdar Indian Association for the Cultivation of Sciences, Jadavpur, Kolkata Outline Introduction Algorithms Parallelization Comparison of performances Conclusions

More information

Software optimization for petaflops/s scale Quantum Monte Carlo simulations

Software optimization for petaflops/s scale Quantum Monte Carlo simulations Software optimization for petaflops/s scale Quantum Monte Carlo simulations A. Scemama 1, M. Caffarel 1, E. Oseret 2, W. Jalby 2 1 Laboratoire de Chimie et Physique Quantiques / IRSAMC, Toulouse, France

More information

We will use C1 symmetry and assume an RHF reference throughout. Index classes to be used include,

We will use C1 symmetry and assume an RHF reference throughout. Index classes to be used include, 1 Common Notation We will use C1 symmetry and assume an RHF reference throughout. Index classes to be used include, µ, ν, λ, σ: Primary basis functions, i.e., atomic orbitals (AOs) {φ µ ( r)}. The size

More information

Lecture 3, Performance

Lecture 3, Performance Lecture 3, Performance Repeating some definitions: CPI Clocks Per Instruction MHz megahertz, millions of cycles per second MIPS Millions of Instructions Per Second = MHz / CPI MOPS Millions of Operations

More information

Predicting physical properties of crystalline solids

Predicting physical properties of crystalline solids redicting physical properties of crystalline solids CSC Spring School in Computational Chemistry 2015 2015-03-13 Antti Karttunen Department of Chemistry Aalto University Ab initio materials modelling Methods

More information

Linear Algebra Section 2.6 : LU Decomposition Section 2.7 : Permutations and transposes Wednesday, February 13th Math 301 Week #4

Linear Algebra Section 2.6 : LU Decomposition Section 2.7 : Permutations and transposes Wednesday, February 13th Math 301 Week #4 Linear Algebra Section. : LU Decomposition Section. : Permutations and transposes Wednesday, February 1th Math 01 Week # 1 The LU Decomposition We learned last time that we can factor a invertible matrix

More information