CRYSTAL in parallel: replicated and distributed (MPP) data. Why parallel?
|
|
- Clifton Green
- 5 years ago
- Views:
Transcription
1 CRYSTAL in parallel: replicated and distributed (MPP) data Roberto Orlando Dipartimento di Chimica Università di Torino Via Pietro Giuria 5, Torino (Italy) 1 Why parallel? Faster time to solution; More available memory; High Performance Computing (HPC) resources available and not many softwares can run efficiently on thousands of processors. The programmer s concerns: Load imbalance: the time taken will be that of the longest job; Handling communications: the processors will need to talk to each other and communication is slow; Handling Input/Output (I/O): in most cases I/O is slow and should be avoided. The User s concerns: Choose an appropriate number of processors to run a job depending on the problem size (mostly determined by the number of basis functions in the unit cell). 2 1
2 Amdahl s law S(n)=(S+P)/(S+P/n) S+P=1 n: number of processors S: percentage of serial instructions P: percentage of parallelized instructions This is a somewhat frightening equation! 3 Gustafson s law BUT the relative values of S and P are a function of system size. Parallelize the more expensive parts (first) These typically become rapidly more expensive as the system size is increased Parallelism is good for large systems! 4 2
3 Load Imbalance Say we have twenty totally independent tasks and twenty processors Easy to parallelize give each task to one of the processors, but what if the tasks don t take all the same time? The time taken will be that of the longest job Because of Load Imbalance our speed up is less than perfect we have too few tasks for too many processors Don t use too many processors for too small a job 5 Communications and I/O But what if the tasks are not independent? The processors will need to talk to each other This is known as communication Communication is SLOW But usually the computation requirement scales more rapidly than the communication Depending on how the machine is set up I/O on parallel machines can be VERY slow. So in general it is best to run direct This may not be true for medium sized jobs on machines where each processor has a fast local disk. 6 3
4 Towards large unit cell systems A CRYSTAL job can be run serially in parallel crystal Pcrystal MPPcrystal Message Passing Interface (MPI) for communications Pcrystal uses replicated data storage in memory MPPcrystal for large unit cell systems on high-performance computers use of Linear Algebra Library parallel routines (Scalapack) for diagonalization, matrix products, Choleski decomposition enhanced distribution of data in memory among processors 7 Running CRYSTAL in parallel Pcrystal full parallelism in the calculation of the interactions (one- and two-electron integrals) distribution of tasks in the reciprocal space: one k point per processor no call to external libraries few inter-process communications MPPcrystal full parallelism in the calculation of the interactions (oneand two-electron integrals) double-level distribution of tasks in the reciprocal space: one k point to a subset of processors use of Linear Algebra Library parallel routines (Scalapack) for diagonalization, matrix products, Choleski decomposition enhanced distribution of data in memory among processors many inter-process communications 8 4
5 Pcrystal and MPPcrystal in action Example of a CRYSTAL calculation: 10,000 basis functions 16 processors available 4 k points sampled in reciprocal space Processors Tasks in real space (integrals) Tasks in reciprocal space Active Idle k 4 k 3 k 2 k 1 Pcrystal Tasks in real space (integrals) Tasks in reciprocal space Active k 4 k 3 k 2 k 1 MPPcrystal 9 Pcrystal - Implementation Standard compliant: Fortran 90 MPI for message passing Replicated data: Each k point is independent: each processor performs the linear algebra (FC=EC) for a subset of the k points that the job requires; Very few communications (potentially good scaling), but potential load imbalance; Each processor has a complete copy of all the matrices used in the linear algebra; The limit on the size of job is given by the memory required to store the linear algebra matrices for one k point; Number of k points limits the number of processors that can be exploited: in general scales very well provided the number of processors number of k points. 10 5
6 MPPcrystal Implementation I Standard compliant: Fortran 90 MPI for message passing ScaLAPACK 1.7 (Dongarra et al.) for linear algebra on distributed matrices Distributed data: Each processor holds only a part of each of the matrices used in the linear algebra (FC=EC); Number of processors that can be exploited is NOT limited by the number of k points (great for large Γ point only calculations); Use ScaLAPACK for e.g. Choleski decomposition Matrix matrix multiplies Linear equation solves As distributed data communications are required to perform the linear algebra; However, N 3 operations but only N 2 data to communicate. F o r 11 MPPcrystal Implementation II Scaling: Scaling gets better for larger systems; Very rough rule of thumb: if N basis functions can exploit up to around N/20 processors (optimal ratio: N/50); One further method that MPPcrystal uses is multilevel parallelism: if have 4 real k points and 32 procs each diagonalization will be done by 8 processors, so each diagonalization has to scale to fewer processors Complicated by complex k points Very useful for medium-large sized systems (for a big enough problem can scale very well) Non implemented features in MPPcrystal: Will fail quickly and cleanly if requested feature not implemented, such as: symmetry adaption of the Crystalline Orbitals (for large high symmetry systems Pcrystal may be more effective) CPHF Raman Intensities 12 6
7 MCM-41 mesoporous material model P. Ugliengo, M. Sodupe, F. Musso, I. J. Bush, R. Orlando, R. Dovesi, Advanced Materials 20, (2008). B3LYP approximation Hexagonal lattice with P1 symmetry 580 atoms per cell (7800 basis functions) MTS/423 K IR spectrum recorded on a micelle-templated silica calcinated at 823 K, water outgassed at 423 K B3LYP Simulated powder spectrum: no relevant reflexions at higher 2 because of short-range disorder 13 MCM-41:increasing the unit cell R. Orlando, M. Delle Piane, I. J. Bush, P. Ugliengo, M. Ferrabone, R. Dovesi, J. Comput. Chem. 33, 2276 (2012). T SPEEDUP T NC NC Supercells of the MCM-41 have been grown along the c crystallographic axis: Xn (side along c is n times that in X1). X10 contains 77,560 AOs in the unit cell. Calculations run on IBM SP6 at Cineca: Power6 processors (4.7 GHz) with peak performance of 101 Tflops/s Infiniband X4 DDR internal network Speedup vs number of cores (NC) for SCF+total energy gradient calculations 14 7
8 MCM-41:scaling of the main steps in MPPcrystal two-electron integrals X4 + total energy gradient one-electron integrals Fock matrix diagonalization exchange-correlation functional integration X preliminary steps Percentage data measure parallelization efficiency. Data in parenthesis: the amount of time for that task. 15 Running MCM-41 on different HPC architectures X1 IBM Blue Gene P at Cineca (Bologna) Cray XE6 - HECToR (Edimburgh) IBM Sp6 at CINECA (Bologna) 16 8
9 Memory storage optimization TOO MANY K POINTS IN THE PACK-MONKHORST NET: INCREASE LIM001 Most of the static allocations have been made dynamic: array size now fits the exact memory requirement; no need to recompile the code for large calculations; a few remaining fixed limits can be extended from input: CLUSTSIZE (maximum number of a atoms in a generated cluster; default setting: number of atoms i the unit cell) LATVEC (maximum number of lattice vectors to be classified; default value: 3500). n 2 atom-size arrays are distributed among the cores. Data are removed from memory as soon as they are not in use. 17 LOWMEM option The LOWMEM keyword avoids allocation of large arrays generally with a slight increase in the CPU time (by default in MPPcrystal): atomic orbital pair elements in matrices are located in real time without storing a large table into memory Fock and Density matrices are only stored in their irreducible forms; symmetry related elements are computed in real time Expansion of AO pair electron density for the bipolar approximation of 2-electron integrals into multipole moments is performed in real time instead of storing large buffers to memory Information about the grid of points used in DFT exchange-correlation functional integration (point cartesian coordinates, multiplicity, Becke s weights) is distributed among processors Dynamically allocated memory monitoring by means of: MEMOPRT, MEMOPRT2 18 9
10 T [sec] Speeding up two-electron integrals 2 g 4 h+l g 12 P l 34 0 g h h l 0 h g h l l0 h0 2 F h electron 1 electron 2 Integrals are screened on the basis of the overlap between atomic orbitals. In large unit cells a lot of (3, 4) pairs do not overlap to (1, 2 g ). The following integrals are equivalent by atomic orbital permutation: 0 g h hl 0 g hl h l h gh 0 l gh h Implemented for P1 symmetry Xn Linearization Permutation symmetry 19 Improved memory storage in Pcrystal F g F k V k F k V k Transformation of the Fock and the Density matrix into the basis set of the Symmetry- Adapted Crystalline Orbitals (SACO) is operated from the irreducible F g to each block of V k F k V k (irreducible representation) straightforwardly, without forming the full blocks of F k : the maximum size of matrices to be diagonalized is that of the largest block parallelization goes from k points down to the irreducible representations (many more than the number of k-points in highly symmetric cases) 20 10
11 Memory storage for fullerenes of increasing size (n,n)-fullerenes n N AO S irr S red n = 7 N AO : number of basis functions S irr : size of the irreducible part of the overlap matrix represented in real space (number of matrix elements) S red : size of the full overlap matrix represented in real space (number of matrix elements) 21 Fullerenes: matrix block size in the SACO basis (n,n) A g A u F 1g F 1u F 2g F 2u G g G u H g H u N AO t SCF (1,1) (2,2) (3,3) (4,4) (5,5) (6,6) (7,7) (8,8) (9,9) (10,10) t SCF : wallclock time (in seconds) for running 20 SCF cycles on a single core 22 11
12 Conclusions CRYSTAL can be run in parallel on a large number of processors efficiently, with very good scalability; is portable to different HPC platforms; allowed the calculation of the total energy and wavefunction of MCM41-X14, containing more than 100,000 basis functions (8000 atoms), on 2048 processors; has been improved as concerning data storage to memory; has been made more efficient as for the calculation of the Coulomb and exchange series; Memory storage for highly symmetric cases has been drastically reduced by extending the use of SACOs to all steps in reciprocal space; Task farming in Pcrystal will soon be moved from the k-point level to that of the irreducible representations
CRYSTAL in parallel: replicated and distributed (MPP) data
CRYSTAL in parallel: replicated and distributed (MPP) data Lorenzo Maschio Dipar0mento di Chimica, Università di Torino lorenzo.maschio@unito.it Several slides courtesy of Roberto Orlando lorenzo.maschio@unito.it
More informationSymmetric Pivoting in ScaLAPACK Craig Lucas University of Manchester Cray User Group 8 May 2006, Lugano
Symmetric Pivoting in ScaLAPACK Craig Lucas University of Manchester Cray User Group 8 May 2006, Lugano Introduction Introduction We wanted to parallelize a serial algorithm for the pivoted Cholesky factorization
More informationElectronic Supplementary Information
Electronic Supplementary Material (ESI) for CrystEngComm. This journal is The Royal Society of Chemistry 2014 Electronic Supplementary Information Configurational and energetical study of the (100) and
More informationImprovements for Implicit Linear Equation Solvers
Improvements for Implicit Linear Equation Solvers Roger Grimes, Bob Lucas, Clement Weisbecker Livermore Software Technology Corporation Abstract Solving large sparse linear systems of equations is often
More informationTIME DEPENDENCE OF SHELL MODEL CALCULATIONS 1. INTRODUCTION
Mathematical and Computational Applications, Vol. 11, No. 1, pp. 41-49, 2006. Association for Scientific Research TIME DEPENDENCE OF SHELL MODEL CALCULATIONS Süleyman Demirel University, Isparta, Turkey,
More informationParallel programming using MPI. Analysis and optimization. Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco
Parallel programming using MPI Analysis and optimization Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco Outline l Parallel programming: Basic definitions l Choosing right algorithms: Optimal serial and
More informationTranslation Symmetry, Space Groups, Bloch functions, Fermi energy
Translation Symmetry, Space Groups, Bloch functions, Fermi energy Roberto Orlando and Silvia Casassa Università degli Studi di Torino July 20, 2015 School Ab initio Modelling of Solids (UniTo) Symmetry
More informationParallel Eigensolver Performance on High Performance Computers
Parallel Eigensolver Performance on High Performance Computers Andrew Sunderland Advanced Research Computing Group STFC Daresbury Laboratory CUG 2008 Helsinki 1 Summary (Briefly) Introduce parallel diagonalization
More informationHECToR CSE technical meeting, Oxford Parallel Algorithms for the Materials Modelling code CRYSTAL
HECToR CSE technical meeting, Oxford 2009 Parallel Algorithms for the Materials Modelling code CRYSTAL Dr Stanko Tomi Computational Science & Engineering Department, STFC Daresbury Laboratory, UK Acknowledgements
More informationELSI: A Unified Software Interface for Kohn-Sham Electronic Structure Solvers
ELSI: A Unified Software Interface for Kohn-Sham Electronic Structure Solvers Victor Yu and the ELSI team Department of Mechanical Engineering & Materials Science Duke University Kohn-Sham Density-Functional
More informationA Computation- and Communication-Optimal Parallel Direct 3-body Algorithm
A Computation- and Communication-Optimal Parallel Direct 3-body Algorithm Penporn Koanantakool and Katherine Yelick {penpornk, yelick}@cs.berkeley.edu Computer Science Division, University of California,
More informationParallelization of the Molecular Orbital Program MOS-F
Parallelization of the Molecular Orbital Program MOS-F Akira Asato, Satoshi Onodera, Yoshie Inada, Elena Akhmatskaya, Ross Nobes, Azuma Matsuura, Atsuya Takahashi November 2003 Fujitsu Laboratories of
More informationWavefunction and electronic struture in solids: Bloch functions, Fermi level and other concepts.
Wavefunction and electronic struture in solids: Bloch functions, Fermi level and other concepts. Silvia Casassa Università degli Studi di Torino July 12, 2017 Minnesota Workshop on ab initio MSC Symmetry
More informationMaking electronic structure methods scale: Large systems and (massively) parallel computing
AB Making electronic structure methods scale: Large systems and (massively) parallel computing Ville Havu Department of Applied Physics Helsinki University of Technology - TKK Ville.Havu@tkk.fi 1 Outline
More informationVibrational frequencies in solids: tools and tricks
Vibrational frequencies in solids: tools and tricks Roberto Dovesi Gruppo di Chimica Teorica Università di Torino Torino, 4-9 September 2016 This morning 3 lectures: R. Dovesi Generalities on vibrations
More informationPorting a sphere optimization program from LAPACK to ScaLAPACK
Porting a sphere optimization program from LAPACK to ScaLAPACK Mathematical Sciences Institute, Australian National University. For presentation at Computational Techniques and Applications Conference
More informationEfficient algorithms for symmetric tensor contractions
Efficient algorithms for symmetric tensor contractions Edgar Solomonik 1 Department of EECS, UC Berkeley Oct 22, 2013 1 / 42 Edgar Solomonik Symmetric tensor contractions 1/ 42 Motivation The goal is to
More informationCP2K. New Frontiers. ab initio Molecular Dynamics
CP2K New Frontiers in ab initio Molecular Dynamics Jürg Hutter, Joost VandeVondele, Valery Weber Physical-Chemistry Institute, University of Zurich Ab Initio Molecular Dynamics Molecular Dynamics Sampling
More informationEfficient implementation of the overlap operator on multi-gpus
Efficient implementation of the overlap operator on multi-gpus Andrei Alexandru Mike Lujan, Craig Pelissier, Ben Gamari, Frank Lee SAAHPC 2011 - University of Tennessee Outline Motivation Overlap operator
More information2.5D algorithms for distributed-memory computing
ntroduction for distributed-memory computing C Berkeley July, 2012 1/ 62 ntroduction Outline ntroduction Strong scaling 2.5D factorization 2/ 62 ntroduction Strong scaling Solving science problems faster
More informationA Numerical QCD Hello World
A Numerical QCD Hello World Bálint Thomas Jefferson National Accelerator Facility Newport News, VA, USA INT Summer School On Lattice QCD, 2007 What is involved in a Lattice Calculation What is a lattice
More informationMatrix Computations: Direct Methods II. May 5, 2014 Lecture 11
Matrix Computations: Direct Methods II May 5, 2014 ecture Summary You have seen an example of how a typical matrix operation (an important one) can be reduced to using lower level BS routines that would
More informationWeather Research and Forecasting (WRF) Performance Benchmark and Profiling. July 2012
Weather Research and Forecasting (WRF) Performance Benchmark and Profiling July 2012 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Intel, Dell,
More informationThe Design, Implementation, and Evaluation of a Symmetric Banded Linear Solver for Distributed-Memory Parallel Computers
The Design, Implementation, and Evaluation of a Symmetric Banded Linear Solver for Distributed-Memory Parallel Computers ANSHUL GUPTA and FRED G. GUSTAVSON IBM T. J. Watson Research Center MAHESH JOSHI
More informationAcceleration of WRF on the GPU
Acceleration of WRF on the GPU Daniel Abdi, Sam Elliott, Iman Gohari Don Berchoff, Gene Pache, John Manobianco TempoQuest 1434 Spruce Street Boulder, CO 80302 720 726 9032 TempoQuest.com THE WORLD S FASTEST
More informationParallelization of the QC-lib Quantum Computer Simulator Library
Parallelization of the QC-lib Quantum Computer Simulator Library Ian Glendinning and Bernhard Ömer VCPC European Centre for Parallel Computing at Vienna Liechtensteinstraße 22, A-19 Vienna, Austria http://www.vcpc.univie.ac.at/qc/
More informationParallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors
Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1 1 Deparment of Computer
More informationParallel Eigensolver Performance on High Performance Computers 1
Parallel Eigensolver Performance on High Performance Computers 1 Andrew Sunderland STFC Daresbury Laboratory, Warrington, UK Abstract Eigenvalue and eigenvector computations arise in a wide range of scientific
More informationab initio Electronic Structure Calculations
ab initio Electronic Structure Calculations New scalability frontiers using the BG/L Supercomputer C. Bekas, A. Curioni and W. Andreoni IBM, Zurich Research Laboratory Rueschlikon 8803, Switzerland ab
More informationModel Order Reduction via Matlab Parallel Computing Toolbox. Istanbul Technical University
Model Order Reduction via Matlab Parallel Computing Toolbox E. Fatih Yetkin & Hasan Dağ Istanbul Technical University Computational Science & Engineering Department September 21, 2009 E. Fatih Yetkin (Istanbul
More informationParallelization of Molecular Dynamics (with focus on Gromacs) SeSE 2014 p.1/29
Parallelization of Molecular Dynamics (with focus on Gromacs) SeSE 2014 p.1/29 Outline A few words on MD applications and the GROMACS package The main work in an MD simulation Parallelization Stream computing
More informationSolving the Inverse Toeplitz Eigenproblem Using ScaLAPACK and MPI *
Solving the Inverse Toeplitz Eigenproblem Using ScaLAPACK and MPI * J.M. Badía and A.M. Vidal Dpto. Informática., Univ Jaume I. 07, Castellón, Spain. badia@inf.uji.es Dpto. Sistemas Informáticos y Computación.
More informationJ S Parker (QUB), Martin Plummer (STFC), H W van der Hart (QUB) Version 1.0, September 29, 2015
Report on ecse project Performance enhancement in R-matrix with time-dependence (RMT) codes in preparation for application to circular polarised light fields J S Parker (QUB), Martin Plummer (STFC), H
More informationMatrix Assembly in FEA
Matrix Assembly in FEA 1 In Chapter 2, we spoke about how the global matrix equations are assembled in the finite element method. We now want to revisit that discussion and add some details. For example,
More informationParallel Numerics. Scope: Revise standard numerical methods considering parallel computations!
Parallel Numerics Scope: Revise standard numerical methods considering parallel computations! Required knowledge: Numerics Parallel Programming Graphs Literature: Dongarra, Du, Sorensen, van der Vorst:
More informationMPI at MPI. Jens Saak. Max Planck Institute for Dynamics of Complex Technical Systems Computational Methods in Systems and Control Theory
MAX PLANCK INSTITUTE November 5, 2010 MPI at MPI Jens Saak Max Planck Institute for Dynamics of Complex Technical Systems Computational Methods in Systems and Control Theory FOR DYNAMICS OF COMPLEX TECHNICAL
More informationSparse BLAS-3 Reduction
Sparse BLAS-3 Reduction to Banded Upper Triangular (Spar3Bnd) Gary Howell, HPC/OIT NC State University gary howell@ncsu.edu Sparse BLAS-3 Reduction p.1/27 Acknowledgements James Demmel, Gene Golub, Franc
More informationWRF performance tuning for the Intel Woodcrest Processor
WRF performance tuning for the Intel Woodcrest Processor A. Semenov, T. Kashevarova, P. Mankevich, D. Shkurko, K. Arturov, N. Panov Intel Corp., pr. ak. Lavrentieva 6/1, Novosibirsk, Russia, 630090 {alexander.l.semenov,tamara.p.kashevarova,pavel.v.mankevich,
More informationWelcome to MCS 572. content and organization expectations of the course. definition and classification
Welcome to MCS 572 1 About the Course content and organization expectations of the course 2 Supercomputing definition and classification 3 Measuring Performance speedup and efficiency Amdahl s Law Gustafson
More informationParallelization of the QC-lib Quantum Computer Simulator Library
Parallelization of the QC-lib Quantum Computer Simulator Library Ian Glendinning and Bernhard Ömer September 9, 23 PPAM 23 1 Ian Glendinning / September 9, 23 Outline Introduction Quantum Bits, Registers
More informationOPENATOM for GW calculations
OPENATOM for GW calculations by OPENATOM developers 1 Introduction The GW method is one of the most accurate ab initio methods for the prediction of electronic band structures. Despite its power, the GW
More informationPerformance of WRF using UPC
Performance of WRF using UPC Hee-Sik Kim and Jong-Gwan Do * Cray Korea ABSTRACT: The Weather Research and Forecasting (WRF) model is a next-generation mesoscale numerical weather prediction system. We
More informationDark Energy and Massive Neutrino Universe Covariances
Dark Energy and Massive Neutrino Universe Covariances (DEMNUniCov) Carmelita Carbone Physics Dept, Milan University & INAF-Brera Collaborators: M. Calabrese, M. Zennaro, G. Fabbian, J. Bel November 30
More informationLattice Boltzmann simulations on heterogeneous CPU-GPU clusters
Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters H. Köstler 2nd International Symposium Computer Simulations on GPU Freudenstadt, 29.05.2013 1 Contents Motivation walberla software concepts
More informationAccelerating linear algebra computations with hybrid GPU-multicore systems.
Accelerating linear algebra computations with hybrid GPU-multicore systems. Marc Baboulin INRIA/Université Paris-Sud joint work with Jack Dongarra (University of Tennessee and Oak Ridge National Laboratory)
More informationQuantum Chemical Calculations by Parallel Computer from Commodity PC Components
Nonlinear Analysis: Modelling and Control, 2007, Vol. 12, No. 4, 461 468 Quantum Chemical Calculations by Parallel Computer from Commodity PC Components S. Bekešienė 1, S. Sėrikovienė 2 1 Institute of
More informationBuilding a Multi-FPGA Virtualized Restricted Boltzmann Machine Architecture Using Embedded MPI
Building a Multi-FPGA Virtualized Restricted Boltzmann Machine Architecture Using Embedded MPI Charles Lo and Paul Chow {locharl1, pc}@eecg.toronto.edu Department of Electrical and Computer Engineering
More informationBasic introduction of NWChem software
Basic introduction of NWChem software Background! NWChem is part of the Molecular Science Software Suite! Designed and developed to be a highly efficient and portable Massively Parallel computational chemistry
More informationChapter 3. The (L)APW+lo Method. 3.1 Choosing A Basis Set
Chapter 3 The (L)APW+lo Method 3.1 Choosing A Basis Set The Kohn-Sham equations (Eq. (2.17)) provide a formulation of how to practically find a solution to the Hohenberg-Kohn functional (Eq. (2.15)). Nevertheless
More informationPRECONDITIONING IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM
Proceedings of ALGORITMY 25 pp. 22 211 PRECONDITIONING IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM GABRIEL OKŠA AND MARIÁN VAJTERŠIC Abstract. One way, how to speed up the computation of the singular value
More informationCyclops Tensor Framework
Cyclops Tensor Framework Edgar Solomonik Department of EECS, Computer Science Division, UC Berkeley March 17, 2014 1 / 29 Edgar Solomonik Cyclops Tensor Framework 1/ 29 Definition of a tensor A rank r
More informationPreconditioned Parallel Block Jacobi SVD Algorithm
Parallel Numerics 5, 15-24 M. Vajteršic, R. Trobec, P. Zinterhof, A. Uhl (Eds.) Chapter 2: Matrix Algebra ISBN 961-633-67-8 Preconditioned Parallel Block Jacobi SVD Algorithm Gabriel Okša 1, Marián Vajteršic
More informationParallel Performance Theory - 1
Parallel Performance Theory - 1 Parallel Computing CIS 410/510 Department of Computer and Information Science Outline q Performance scalability q Analytical performance measures q Amdahl s law and Gustafson-Barsis
More informationScalable and Power-Efficient Data Mining Kernels
Scalable and Power-Efficient Data Mining Kernels Alok Choudhary, John G. Searle Professor Dept. of Electrical Engineering and Computer Science and Professor, Kellogg School of Management Director of the
More informationMatrix Eigensystem Tutorial For Parallel Computation
Matrix Eigensystem Tutorial For Parallel Computation High Performance Computing Center (HPC) http://www.hpc.unm.edu 5/21/2003 1 Topic Outline Slide Main purpose of this tutorial 5 The assumptions made
More informationChe-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University
Che-Wei Chang chewei@mail.cgu.edu.tw Department of Computer Science and Information Engineering, Chang Gung University } 2017/11/15 Midterm } 2017/11/22 Final Project Announcement 2 1. Introduction 2.
More informationPerformance optimization of WEST and Qbox on Intel Knights Landing
Performance optimization of WEST and Qbox on Intel Knights Landing Huihuo Zheng 1, Christopher Knight 1, Giulia Galli 1,2, Marco Govoni 1,2, and Francois Gygi 3 1 Argonne National Laboratory 2 University
More informationA robust multilevel approximate inverse preconditioner for symmetric positive definite matrices
DICEA DEPARTMENT OF CIVIL, ENVIRONMENTAL AND ARCHITECTURAL ENGINEERING PhD SCHOOL CIVIL AND ENVIRONMENTAL ENGINEERING SCIENCES XXX CYCLE A robust multilevel approximate inverse preconditioner for symmetric
More informationMassively parallel semi-lagrangian solution of the 6d Vlasov-Poisson problem
Massively parallel semi-lagrangian solution of the 6d Vlasov-Poisson problem Katharina Kormann 1 Klaus Reuter 2 Markus Rampp 2 Eric Sonnendrücker 1 1 Max Planck Institut für Plasmaphysik 2 Max Planck Computing
More informationAnalytical Modeling of Parallel Systems
Analytical Modeling of Parallel Systems Chieh-Sen (Jason) Huang Department of Applied Mathematics National Sun Yat-sen University Thank Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar for providing
More informationMolecular Science Modelling
Molecular Science Modelling Lorna Smith Edinburgh Parallel Computing Centre The University of Edinburgh Version 1.0 Available from: http://www.epcc.ed.ac.uk/epcc-tec/documents/ Table of Contents 1 Introduction.....................................
More informationThe Blue Gene/P at Jülich Case Study & Optimization. W.Frings, Forschungszentrum Jülich,
The Blue Gene/P at Jülich Case Study & Optimization W.Frings, Forschungszentrum Jülich, 26.08.2008 Jugene Case-Studies: Overview Case Study: PEPC Case Study: racoon Case Study: QCD CPU0CPU3 CPU1CPU2 2
More informationDirect Self-Consistent Field Computations on GPU Clusters
Direct Self-Consistent Field Computations on GPU Clusters Guochun Shi, Volodymyr Kindratenko National Center for Supercomputing Applications University of Illinois at UrbanaChampaign Ivan Ufimtsev, Todd
More informationStatic-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems
Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems Ichitaro Yamazaki University of Tennessee, Knoxville Xiaoye Sherry Li Lawrence Berkeley National Laboratory MS49: Sparse
More informationINITIAL INTEGRATION AND EVALUATION
INITIAL INTEGRATION AND EVALUATION OF SLATE PARALLEL BLAS IN LATTE Marc Cawkwell, Danny Perez, Arthur Voter Asim YarKhan, Gerald Ragghianti, Jack Dongarra, Introduction The aim of the joint milestone STMS10-52
More informationPerformance Analysis of Lattice QCD Application with APGAS Programming Model
Performance Analysis of Lattice QCD Application with APGAS Programming Model Koichi Shirahata 1, Jun Doi 2, Mikio Takeuchi 2 1: Tokyo Institute of Technology 2: IBM Research - Tokyo Programming Models
More informationTheoretical Computer Science
Theoretical Computer Science 412 (2011) 1484 1491 Contents lists available at ScienceDirect Theoretical Computer Science journal homepage: wwwelseviercom/locate/tcs Parallel QR processing of Generalized
More informationA Parallel Implementation of the. Yuan-Jye Jason Wu y. September 2, Abstract. The GTH algorithm is a very accurate direct method for nding
A Parallel Implementation of the Block-GTH algorithm Yuan-Jye Jason Wu y September 2, 1994 Abstract The GTH algorithm is a very accurate direct method for nding the stationary distribution of a nite-state,
More informationAb-initio modelling of complex systems with MPPcrystal
Ab-initio modelling of complex systems with MPPcrystal Piero Ugliengo University of Torino Dip. Chimica IFM NIS Centre of Excellence Via P. Giuria, 7 10125 Torino 1 Acknowledgements M. Corno A. Rimola
More informationASSESSMENT OF DFT METHODS FOR SOLIDS
MSSC2009 - Ab Initio Modeling in Solid State Chemistry ASSESSMENT OF DFT METHODS FOR SOLIDS Raffaella Demichelis Università di Torino Dipartimento di Chimica IFM 1 MSSC2009 - September, 10 th 2009 Table
More informationThe Interpretation of the Short Range Disorder in the Fluorene- TCNE Crystal Structure
Int. J. Mol. Sci. 2004, 5, 93-100 International Journal of Molecular Sciences ISSN 1422-0067 2004 by MDPI www.mdpi.net/ijms/ The Interpretation of the Short Range Disorder in the Fluorene- TCNE Crystal
More informationHYCOM and Navy ESPC Future High Performance Computing Needs. Alan J. Wallcraft. COAPS Short Seminar November 6, 2017
HYCOM and Navy ESPC Future High Performance Computing Needs Alan J. Wallcraft COAPS Short Seminar November 6, 2017 Forecasting Architectural Trends 3 NAVY OPERATIONAL GLOBAL OCEAN PREDICTION Trend is higher
More informationParallel Numerical Algorithms
Parallel Numerical Algorithms Chapter 13 Prof. Michael T. Heath Department of Computer Science University of Illinois at Urbana-Champaign CS 554 / CSE 512 Michael T. Heath Parallel Numerical Algorithms
More informationCOMPUTATIONAL TOOL. Fig. 4.1 Opening screen of w2web
CHAPTER -4 COMPUTATIONAL TOOL Ph.D. Thesis: J. Maibam CHAPTER: 4 4.1 The WIEN2k code In this work, all the calculations presented are performed using the WIEN2k software package (Blaha et al., 2001). The
More information1 Overview. 2 Adapting to computing system evolution. 11 th European LS-DYNA Conference 2017, Salzburg, Austria
1 Overview Improving LSTC s Multifrontal Linear Solver Roger Grimes 3, Robert Lucas 3, Nick Meng 2, Francois-Henry Rouet 3, Clement Weisbecker 3, and Ting-Ting Zhu 1 1 Cray Incorporated 2 Intel Corporation
More informationOne-Electron Properties of Solids
One-Electron Properties of Solids Alessandro Erba Università di Torino alessandro.erba@unito.it most slides are courtesy of R. Orlando and B. Civalleri Energy vs Wave-function Energy vs Wave-function Density
More informationScalable numerical algorithms for electronic structure calculations
Scalable numerical algorithms for electronic structure calculations Edgar Solomonik C Berkeley July, 2012 Edgar Solomonik Cyclops Tensor Framework 1/ 73 Outline Introduction Motivation: Coupled Cluster
More informationTiming Results of a Parallel FFTsynth
Purdue University Purdue e-pubs Department of Computer Science Technical Reports Department of Computer Science 1994 Timing Results of a Parallel FFTsynth Robert E. Lynch Purdue University, rel@cs.purdue.edu
More informationCS425: Algorithms for Web Scale Data
CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original slides can be accessed at: www.mmds.org Challenges
More informationComparing the Efficiency of Iterative Eigenvalue Solvers: the Quantum ESPRESSO experience
Comparing the Efficiency of Iterative Eigenvalue Solvers: the Quantum ESPRESSO experience Stefano de Gironcoli Scuola Internazionale Superiore di Studi Avanzati Trieste-Italy 0 Diagonalization of the Kohn-Sham
More informationSPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics
SPARSE SOLVERS FOR THE POISSON EQUATION Margreet Nool CWI, Multiscale Dynamics November 9, 2015 OUTLINE OF THIS TALK 1 FISHPACK, LAPACK, PARDISO 2 SYSTEM OVERVIEW OF CARTESIUS 3 POISSON EQUATION 4 SOLVERS
More informationHybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC
Hybrid static/dynamic scheduling for already optimized dense matrix factorization Simplice Donfack, Laura Grigori, INRIA, France Bill Gropp, Vivek Kale UIUC, USA Joint Laboratory for Petascale Computing,
More informationVector Lane Threading
Vector Lane Threading S. Rivoire, R. Schultz, T. Okuda, C. Kozyrakis Computer Systems Laboratory Stanford University Motivation Vector processors excel at data-level parallelism (DLP) What happens to program
More informationVASP: running on HPC resources. University of Vienna, Faculty of Physics and Center for Computational Materials Science, Vienna, Austria
VASP: running on HPC resources University of Vienna, Faculty of Physics and Center for Computational Materials Science, Vienna, Austria The Many-Body Schrödinger equation 0 @ 1 2 X i i + X i Ĥ (r 1,...,r
More informationUsing Web-Based Computations in Organic Chemistry
10/30/2017 1 Using Web-Based Computations in Organic Chemistry John Keller UAF Department of Chemistry & Biochemistry The UAF WebMO site Practical aspects of computational chemistry theory and nomenclature
More informationCMP 338: Third Class
CMP 338: Third Class HW 2 solution Conversion between bases The TINY processor Abstraction and separation of concerns Circuit design big picture Moore s law and chip fabrication cost Performance What does
More informationLightweight Superscalar Task Execution in Distributed Memory
Lightweight Superscalar Task Execution in Distributed Memory Asim YarKhan 1 and Jack Dongarra 1,2,3 1 Innovative Computing Lab, University of Tennessee, Knoxville, TN 2 Oak Ridge National Lab, Oak Ridge,
More informationReview: From problem to parallel algorithm
Review: From problem to parallel algorithm Mathematical formulations of interesting problems abound Poisson s equation Sources: Electrostatics, gravity, fluid flow, image processing (!) Numerical solution:
More informationFast and accurate Coulomb calculation with Gaussian functions
Fast and accurate Coulomb calculation with Gaussian functions László Füsti-Molnár and Jing Kong Q-CHEM Inc., Pittsburgh, Pennysylvania 15213 THE JOURNAL OF CHEMICAL PHYSICS 122, 074108 2005 Received 8
More informationMODULE 2: QUANTUM MECHANICS. Practice: Quantum ESPRESSO
MODULE 2: QUANTUM MECHANICS Practice: Quantum ESPRESSO I. What is Quantum ESPRESSO? 2 DFT software PW-DFT, PP, US-PP, PAW http://www.quantum-espresso.org FREE PW-DFT, PP, PAW http://www.abinit.org FREE
More informationPerformance Evaluation of Scientific Applications on POWER8
Performance Evaluation of Scientific Applications on POWER8 2014 Nov 16 Andrew V. Adinetz 1, Paul F. Baumeister 1, Hans Böttiger 3, Thorsten Hater 1, Thilo Maurer 3, Dirk Pleiter 1, Wolfram Schenck 4,
More informationRoundoff Error. Monday, August 29, 11
Roundoff Error A round-off error (rounding error), is the difference between the calculated approximation of a number and its exact mathematical value. Numerical analysis specifically tries to estimate
More informationLeigh Orf 1 Robert Wilhelmson 2,3 Roberto Sisneros 3 Brian Jewett 2 George Bryan 4 Mark Straka 3 Paul Woodward 5
Simulation and Visualization of Tornadic Supercells on Blue Waters PRAC: Understanding Tornadoes and Their Parent Supercells Through Ultra-High Resolution Simulation/Analysis Leigh Orf 1 Robert Wilhelmson
More informationBasic introduction of NWChem software
Basic introduction of NWChem software Background NWChem is part of the Molecular Science Software Suite Designed and developed to be a highly efficient and portable Massively Parallel computational chemistry
More informationParallelization of the Dirac operator. Pushan Majumdar. Indian Association for the Cultivation of Sciences, Jadavpur, Kolkata
Parallelization of the Dirac operator Pushan Majumdar Indian Association for the Cultivation of Sciences, Jadavpur, Kolkata Outline Introduction Algorithms Parallelization Comparison of performances Conclusions
More informationSoftware optimization for petaflops/s scale Quantum Monte Carlo simulations
Software optimization for petaflops/s scale Quantum Monte Carlo simulations A. Scemama 1, M. Caffarel 1, E. Oseret 2, W. Jalby 2 1 Laboratoire de Chimie et Physique Quantiques / IRSAMC, Toulouse, France
More informationWe will use C1 symmetry and assume an RHF reference throughout. Index classes to be used include,
1 Common Notation We will use C1 symmetry and assume an RHF reference throughout. Index classes to be used include, µ, ν, λ, σ: Primary basis functions, i.e., atomic orbitals (AOs) {φ µ ( r)}. The size
More informationLecture 3, Performance
Lecture 3, Performance Repeating some definitions: CPI Clocks Per Instruction MHz megahertz, millions of cycles per second MIPS Millions of Instructions Per Second = MHz / CPI MOPS Millions of Operations
More informationPredicting physical properties of crystalline solids
redicting physical properties of crystalline solids CSC Spring School in Computational Chemistry 2015 2015-03-13 Antti Karttunen Department of Chemistry Aalto University Ab initio materials modelling Methods
More informationLinear Algebra Section 2.6 : LU Decomposition Section 2.7 : Permutations and transposes Wednesday, February 13th Math 301 Week #4
Linear Algebra Section. : LU Decomposition Section. : Permutations and transposes Wednesday, February 1th Math 01 Week # 1 The LU Decomposition We learned last time that we can factor a invertible matrix
More information