Making electronic structure methods scale: Large systems and (massively) parallel computing

Size: px
Start display at page:

Download "Making electronic structure methods scale: Large systems and (massively) parallel computing"

Transcription

1 AB Making electronic structure methods scale: Large systems and (massively) parallel computing Ville Havu Department of Applied Physics Helsinki University of Technology - TKK Ville.Havu@tkk.fi 1 Outline Part I: Scaling to large systems of many atoms - Localisation as a key to linear scaling - Implications to different tasks in FHI-aims Part II: Scaling to large computers of many cores - Basics in parallel computing - Parallel solutions to tasks in FHI-aims 2

2 Different Scalings: From Small to Large Systems 1250 T = N Crossover points T = N T 500 T = N N 3 Different Scalings: Logarithmic Scale T = N 3 T 1000 Crossover points T = N 100 T = N N 4

3 What is Linear Scaling? Natoms Nelectrons Npoints Assume: 1. There are O(N) items of input 2. We want less than O(N) items of output If the complexity (time) to get the output from the input is O(N) the method is linear scaling Method can be linear scaling only if it uses O(1) operations per item of input Example: Cubature in R 3 Input: N points ri, N weights wi N functions values fi = f(ri) Output: i wi fi f(r) dr Trivially O(N) 5 Approaches Towards O(N) in Electronic Structure Theory Key requirements: localisation, localisation, and localisation (in R 3 or in Fourier space) Popular approaches: 1. Minimise the total energy directly using the density matrix (and Wannier functions) skip / localise the Kohn-Sham orbitals (SIESTA, CONQUEST, OpenMX, ONETEP) 2. Accelerate the calculation / use of the entries of hij and sij (Gaussian basis functions (GAUSSIAN, Q-Chem, TURBOMOLE), regular cartesian grids, FFT methods (Quickstep), wavelets (BigDFT)) 3. Use fast solvers for the Hartree potential (multigrid, fast multipole, wavelets) 4. Employ divide & conquer framework (LS3DF) S. Goedecker, Rev. Mod. Phys , (1999) 6

4 Main Tasks for Scalability in FHI-aims Key enabling technology: localisation from basis functions 1. Integration of hij and sij on partitioned atom-centred grids 2. Update of the electron density n(r) on the same grid 3. Solution of the Hartree potential using atom-wise multipole decomposition 4. Solution of the eigenproblem h ck = k s ck with small prefactor O(N 3 ) method (recall the crossover!) 7 Main Tasks for Scalability in FHI-aims Key enabling technology: localisation from basis functions 1. Integration of hij and sij on partitioned atom-centred grids 2. Update of the electron density n(r) on the same grid 3. Solution of the Hartree potential using atom-wise multipole decomposition 4. Solution of the eigenproblem h ck = k s ck with small prefactor O(N 3 ) method (recall the crossover!) Warning: naïve approaches lead to O(N 3 ) - O(N 4 ) in

5 First Thing: Integration of hij and sij 1. Partition grid into spatially localised batches: 2. Due to localisation at each grid point eventually #{basis_fn 0} O(1) 3. For each batch, consider only the non-zero basis functions Constant work per integration point O(N) algorithm 8 First Thing: Integration of hij and sij 1. Partition grid into spatially localised batches: 2. Due to localisation at each grid point eventually #{basis_fn 0} O(1) 3. For each batch, consider only the non-zero basis functions Constant work per integration point O(N) algorithm 8

6 First Thing: Integration of hij and sij 1. Partition grid into spatially localised batches: 2. Due to localisation at each grid point eventually #{basis_fn 0} O(1) 3. For each batch, consider only the non-zero basis functions Constant work per integration point O(N) algorithm 8 Second Thing: Update of the Electron Density 1. Stick with the same grid 2. Construct the density matrix Dlk = j fj cljckj 3. Compute the new electron density a) n(ri) = lk Dlk l(ri) k(ri) or b) n(ri) = j fj j(ri) 2 Again, there are only O(1) non-zero basis functions per point ri a) is O(N) algorithm NB! Construction of Dlk in 2. is formally O(N 2 ) operation but with a small prefactor 9

7 More Use for the Density Matrix Instead of solving the eigenproblem it is possible to minimise the total energy E = 2 Tr [Dh] subject to the idempotency condition D 2 = D This leads to linear scaling if 1. Minimisation of E converges in O(1) steps 2. D and h are sparse enough so that their product is O(N) D.R.Bowler, T.Miyazaki and M.J.Gillan, Journal of Physics:Condensed Matter 14 (11), 2781 (2002) 10 Third Thing: Calculate the Hartree Potential Run the whole show on differences: n(r) = n(r) at nfree,at ( r Rat ) 1. Compute the atom-wise multipole density components: nat,lm(r) = r= r - Rat pat(r) n(r) Ylm( at) d 2. Solve for the corresponding potential components: 3. Build the potential: vat,lm(r) Natoms ves(ri) = vat,lm( ri Rat )Ylm( at) Npoints 11

8 Third Thing: Calculate the Hartree Potential Run the whole show on differences: n(r) = n(r) at nfree,at ( r Rat ) 1. Compute the atom-wise multipole density components: nat,lm(r) = r= r - Rat pat(r) n(r) Ylm( at) d 2. Solve for the corresponding potential components: 3. Build the potential: vat,lm(r) Natoms ves(ri) = vat,lm( ri Rat )Ylm( at) Npoints potentially resource consuming: higher multipoles must be cut to shorter distances 11 Fourth Thing: Solve the Eigenproblem The problem: There are Nbasis Nstates coefficients cli to solve for at least O(N 2 ) algorithm Conventional direct solvers lead to O(N 3 ) method Iterative solver could be O(N 2 ) but full matrix matrix vector is O(N 2 ) O(N 3 ) again Left: sparse matrix & iterative algorithm that converges with constant number of steps for all eigenpairs severe problem with initialisation FHI-aims: matrices are not very sparse LAPACK / modified ScaLAPACK, iterative LOPBCG under investigation 12

9 Conventional Direct Solvers for the Eigenproblem Implementation in LAPACK / ScaLAPACK: 1. Factorise s = LL T and let A = L -1 hl -T to get the problem Axk = kxk 2. Use Householder transformations to reduce A to tridiagonal T = H1H2H3 Hn A Hn H3H2H1 3. Solve the eigenproblem for T with some method a. Bisection & inverse iteration b. QR-iteration c. Divide & Conquer d. MRRR-method (Multiple Relatively Robust Representations) 4. Substitute back (twice) to get the eigenvectors ck 13 Conventional Direct Solvers for the Eigenproblem O(N 3 ) Implementation in LAPACK / ScaLAPACK: 1. Factorise s = LL T and let A = L -1 hl -T to get the problem Axk = kxk 2. Use Householder transformations to reduce A to tridiagonal T = H1H2H3 Hn A Hn H3H2H1 3. Solve the eigenproblem for T with some method a. Bisection & inverse iteration b. QR-iteration c. Divide & Conquer d. MRRR-method (Multiple Relatively Robust Representations) 4. Substitute back (twice) to get the eigenvectors ck 13

10 Conventional Direct Solvers for the Eigenproblem O(N 3 ) Implementation in LAPACK / ScaLAPACK: 1. Factorise s = LL T and let A = L -1 hl -T to get the problem Axk = kxk 2. Use Householder transformations to reduce A to tridiagonal T = H1H2H3 Hn A Hn H3H2H1 3. Solve the eigenproblem for T with some method a. Bisection & inverse iteration b. QR-iteration c. Divide & Conquer d. MRRR-method (Multiple Relatively Robust Representations) 4. Substitute back (twice) to get the eigenvectors ck Method of choice in FHI-aims due to parallel scalability 13 The Scalability Test Setting 1. Physical system: Fully extended polyalanines, atoms 2. Hardware: IBM Power6 575 at RZG 205 nodes, 32 cores / node, 6560 cores total 18 TB main : 64 / 128 / 256 GB in a node Infiniband interconnect 14

11 Time for s.c.f. iteration (s) Current FHI-aims Scaling: Light Settings, 32 cores ,1 Integration O(N 1.1 ) 0, Number of atoms in the polyalanine chain 15 Time for s.c.f. iteration (s) Current FHI-aims Scaling: Light Settings, 32 cores ,1 Density Integration update (dm) Integration Density update (orb) Crossover point O(N 1.7 ) O(N 1.1 ) O(N 1.1 ) 0, Number of atoms in the polyalanine chain 15

12 Time for s.c.f. iteration (s) Current FHI-aims Scaling: Light Settings, 32 cores ,1 Density Integration update (dm) Hartree Integration potential Integration Density update (orb) Density update (orb) Crossover point O(N 1.7 ) O(N 1.7 ) O(N 1.5 ) O(N )) O(N 1.1 ) 0, Number of atoms in the polyalanine chain 15 Time for Time s.c.f. for s.c.f. iteration (s) Current FHI-aims Scaling: Light Settings, 32 cores ,1 Total Density Integration time update (dm) Density Hartree Integration update potential (dm) Hartree Integration Density potential update (orb) Integration Density update (orb) EV solution Density update (orb) O(N Crossover O(N 1.7 ) 1.7 ) O(N point 1.7 ) O(N 1.5 ) O(N 1.8 ) O(N 1.5 ) O(N 1.5 ) O(N ) ) O(N ) In this region: EV solution: O(N 2.7 ) Total: O(N 1.9 ) 0, Number of of atoms in the polyalanine chain 15

13 Time for s.c.f. iteration (s) Current FHI-aims Scaling: Tight Settings, 32 cores ,1 Total time Density update (dm) Hartree potential Integration EV solution Density update (orb) O(N 1.7 ) O(N 1.6 ) O(N 2.3 ) O(N 1.6 ) O(N 1.2 ) In this region: EV solution: O(N 2.9 ) Total: O(N 2.0 ) 0, Number of atoms in the polyalanine chain O(N 1.2 ) 16 Current FHI-aims Scaling: GFlops, Light Settings GFlops for s.c.f. iteration Using density matrix (dm) Using orbitals (orb) Crossover point O(N 1.3 ) O(N 2.0 ) O(N 2.3 ) Number of atoms in the polyalanine chain 17

14 Current FHI-aims Scaling: Light Settings, 32 cores Time for s.c.f. iteration (s) ,1 Total time Density update (dm) Hartree potential Integration EV solution Absolute minimum (total) 0, Number of atoms in the polyalanine chain 18 Part II: Scaling to large computers of many cores 19

15 Parallel Computing: Your Desktop Why you need to care: 20 Parallel Computing: Your Desktop Why you need to care: 20

16 Parallel Computing:Your Desktop Why you need to care: POWER Processor Roadmap " 2007 " Future " " /. Distributed Switch Distributed Switch * %'44+/) +453+$65'& 8+5%* *#3'& 9/#.+% 4 GHz GHz 3.5GHz 3.5GHz Advanced Multi Design AltiVec AltiVec AltiVec AltiVec Cache Cache Cache Cache Advanced Cache Advanced System Features System Features Advanced System Features "03,-0#& %%'-'3#5034 +)*-9 5*3'#&'& %03' ' Distributed Switch *#3'& *#3'& *#3'& 1.9GHz 1.9GHz 1.5+ GHz *#3'& GHz 1.5+ GHz 1.5+ GHz 1+ GHz 1+ GHz /. /. /. /. Distributed Switch /.!'39 +)* 3'26'/%+'4 : /*#/%'&!+356#-+:#5+0/ &7#/%'& '.039 6$4945'. -5+7'% +/4536%5+0/4 /4536%5+0/ '539 9/#.+% /'3)9 #/#)'.'/5.1307'& #35+5+0/ 0$+-+59 ' '%5+0/ '94 /*#/%'& %#-+/) +.6-5#/' *3'#&+/) /*#/%'& +453+$65'& 8+5%* /*#/%'& 03' #3#--' '& '3(03.#/%' /%3'#4'&.'.039 $#/&8+&5* '&6%'&.'.039 -#5'/%+'4!+356#-+:#5+0/ BINARY COMPATIBILITY 2008 IBM Corporation 20 Parallel Computing:Your Desktop Why you need to care: POWER Processor Roadmap " " Future " " ) ( % /. /. 1.9GHz 1.9GHz GHz GHz *#3'&! #. Distributed Switch *#3'& $ $) && 1.5+ GHz 1.5+ GHz 1+ GHz 1+ GHz *#3'& Distributed Switch *#3'& /. /. /. /. GHz GHz 3.5GHz 3.5GHz (# $ $) "03,-0#& %%'-'3#5034 2' 3 +)*-9 5*3'#&'& %03'4! ' $!'39 +)* 3'26'/%+'4 " # $% : &&' /*#/%'&!+356#-+:#5+0/ &7#/%'& '.039 6$4945'. -5+7'% +/4536%5+0/4 /4536%5+0/ '539 9/#.+% /'3)9 #/#)'.'/5.1307'& #35+5+0/ 0$ % 3,0 * ' '%5+0/ '94! 2& #3 /*#/%'& %#-+/) *, $ +.6-5#/' *3'#&+/) && /*#/%'& +453+$65'& 8+5%* /*#/%'& 03' #3#--'-+4. * %'44+/) +453+$65'& 8+5%*"#.1307'& '3(03.#/%'. /%3'#4'&.'.039 $#/&8+&5* *#3'& 3 '&6%'&.'.039 -#5'/%+'4 9/#.+% 4 2&!+356#-+:#5+0/ #!!+ #!!1 #!!$ / BINARY COMPATIBILITY #!!4 #!!. Advanced System Features System Advanced Features System Features Distributed Switch $+. /, Cache Cache Cache Cache Advanced Cache Distributed Switch / Advanced Multi Design AltiVec AltiVec AltiVec AltiVec #!!+ #!!1 #!!$ #!!4 #!! IBM Corporation 20

17 Parallel Computing: Supercomputers (June 2009) Image courtesy of the National Center for Computational Sciences, Oak Ridge National Laboratory With supercomputers it s getting even more complicated: 1. IBM s Roadrunner: 129,600 cores 2. Cray XT/5 Jaguar : 150,152 cores 3. & 5. IBM s Blue Gene /P & /L: 294,912 & 212,992 cores 4. SGI s Altix Pleiades : 51,200 cores 21 Parallel Computing: Supercomputers (June 2009) Image courtesy of the National Center for Computational Sciences, Oak Ridge National Laboratory With supercomputers it s getting even more complicated: 1. IBM s Roadrunner: 129,600 cores 2. Cray XT/5 Jaguar : 150,152 cores 3. & 5. IBM s Blue Gene /P & /L: 294,912 & 212,992 cores 4. SGI s Altix Pleiades : 51,200 cores 21

18 Parallel Computing: Supercomputers (June 2009) Image courtesy of the National Center for Computational Sciences, Oak Ridge National Laboratory With supercomputers it s getting even more complicated: 1. IBM s Roadrunner: 129,600 cores 2. Cray XT/5 Jaguar : 150,152 cores 3. & 5. IBM s Blue Gene /P & /L: 294,912 & 212,992 cores 4. SGI s Altix Pleiades : 51,200 cores 21 Parallel Computing: Supercomputers (June 2009) Image courtesy of the National Center for Computational Sciences, Oak Ridge National Laboratory With supercomputers it s getting even more complicated: 1. IBM s Roadrunner: 129,600 cores 2. Cray XT/5 Jaguar : 150,152 cores 3. & 5. IBM s Blue Gene /P & /L: 294,912 & 212,992 cores 4. SGI s Altix Pleiades : 51,200 cores 21

19 Parallel Computing: Supercomputers (June 2009) Image courtesy of the National Center for Computational Sciences, Oak Ridge National Laboratory With supercomputers it s getting even more complicated: 1. IBM s Roadrunner: 129,600 cores 2. Cray XT/5 Jaguar : 150,152 cores 3. & 5. IBM s Blue Gene /P & /L: 294,912 & 212,992 cores See for more info. 4. SGI s Altix Pleiades : 51,200 cores 21 Theory of Parallel Computing Recall The Computer : Four-Field of parallelism (Flynn s taxonomy): data multiple single single SISD: serial computing MISD: no such thing multiple instructions # of data streams # of instruction streams core SIMD: vector machines MIMD: cluster machines M. Flynn, IEEE Trans. Comput., C-21, 948, (1972) 22

20 Theory of Parallel Computing Recall The Computer : core instructions Four-Field of parallelism (Flynn s taxonomy): # of instruction streams single multiple data # of data streams single multiple SISD: serial computing SIMD: vector machines MISD: no such thing MIMD: cluster machines M. Flynn, IEEE Trans. Comput., C-21, 948, (1972) SPMD: Single Program Multiple Data 22 Theory of Parallel Computing core Standard programming layer: Message Passing Interface (MPI) core core communication network core core Two main modes of communication: 1. Collective communication 2. Point-to-Point communication core W. Gropp, E. Lusk, A. Skjellum: Using MPI, MIT Press (1999) 23

21 MPI Communication task #0...sends 42 to... task #1 1. Post MPI_Send 1. Post MPI_Recv Copy to buffer 2. Receive data Send data 3. Copy to Compiler sees only these task #0 MPI Communication...sends 42 to... task #1 1. Post MPI_Send 1. Post MPI_Recv Copy to buffer 2. Receive data Send data 3. Copy to 42 24

22 MPI Today Pros: + Uniform layer for programmers + Can run on a large variety of platforms + Tested and bug-free Cons: - Opaque to the compiler no optimisation possible - Fragmented global space - Implementation vendor dependent - Lot of platform-dependent parameters (e.g. buffer size) Future replacements: Co-Array Fortran / Unified Parallel C (already here) X10, Chapel, and Fortress (still in the future) All with partitioned global address space 25 Example: Parallel Integration of hij Recall the grid and the batches: 1. Each task integrates a set of batches hij id 2. Collective communication to obtain hij = id hij id Same code on different set of points: SPMD Similarly: 1. Update of the Electron Density 2. Update of the Hartree Potential N.B.: The grid batches and their distribution define a parallel iterator over the grid 26

23 Example: Parallel Integration of hij Recall the grid and the batches: Each task integrates a set of batches hij id 2. Collective communication to obtain hij = id hij id Same code on different set of points: SPMD Similarly: 1. Update of the Electron Density 2. Update of the Hartree Potential N.B.: The grid batches and their distribution define a parallel iterator over the grid 26 Parallel Dense Linear Algebra Standard solution: ScaLAPACK / PBLAS / BLACS - Provides the same functionality as LAPACK / BLAS - Uses block-cyclic distribution of matrix elements to optimise load balancing and cache utilisation Task numbers (4 tasks) h11 h12 h13 h21 h22 h31 Matrix elements

24 Parallel Dense Linear Algebra Example: PxGEMM ( A0 A 1 A 2 A 3 )( ) B0 B 1 B 2 B 3 = ( ) C0 C 1 C 2 C 3 task #0 task #1 task #2 task #3 stage 1 stage 2 C 0 = A 0 B 0 + A 1 B 2 C 1 = A 1 B 3 + A 0 B 1 C 2 = A 2 B 0 + A 3 B 2 C 3 = A 2 B 1 + A 3 B 3 28 Parallel Dense Linear Algebra Example: PxGEMM ( A0 A 1 A 2 A 3 )( ) B0 B 1 B 2 B 3 = ( ) C0 C 1 C 2 C 3 task #0 task #1 task #2 task #3 stage 1 stage 2 C 0 = A 0 B 0 + A 1 B 2 C 1 = A 1 B 3 + A 0 B 1 C 2 = A 2 B 0 + A 3 B 2 C 3 = A 2 B 1 + A 3 B 3 28

25 Parallel Dense Linear Algebra Example: PxGEMM ( A0 A 1 A 2 A 3 )( ) B0 B 1 B 2 B 3 = ( ) C0 C 1 C 2 C 3 task #0 task #1 task #2 task #3 stage 1 stage 2 C 0 = A 0 B 0 + A 1 B 2 C 1 = A 1 B 3 + A 0 B 1 C 2 = A 2 B 0 + A 3 B 2 C 3 = A 2 B 1 + A 3 B 3 28 Parallel Dense Linear Algebra Example: PxGEMM ( A0 A 1 A 2 A 3 )( ) B0 B 1 B 2 B 3 = ( ) C0 C 1 C 2 C 3 task #0 task #1 task #2 task #3 stage 1 stage 2 C 0 = A 0 B 0 + A 1 B 2 C 1 = A 1 B 3 + A 0 B 1 C 2 = A 2 B 0 + A 3 B 2 C 3 = A 2 B 1 + A 3 B 3 Each stage: - Only one send / task - Point-to-point communication 28

26 Measures of Parallel Performance 1. Speedup: Sp = T1 / Tp (optimally Sp = p) 2. Efficiency: Ep = Sp / p (optimally Ep = 1) Amdahl s law, i.e., The Unfortunate Law of Diminishing Returns: Assume that proportion of the program can be run in parallel. Then (1- ) is the serial part of the code Then Sp ( (1- ) + /p ) -1 (1- ) -1 as p G. Amdahl, AFIPS Conference Proceedings, 30, , (1967) Measures of Parallel Performance 1. Speedup: Sp = T1 / Tp (optimally Sp = p) 2. Efficiency: Ep = Sp / p (optimally Ep = 1) = 0.95 Sp = Amdahl s law, i.e., The Unfortunate Law of Diminishing Returns: 5 Assume that proportion of the = 0.75 program can be run in parallel. = 0.50 Then (1- ) is the serial part of the code p Then Sp ( (1- ) + /p ) -1 (1- ) -1 as p G. Amdahl, AFIPS Conference Proceedings, 30, , (1967) 29

27 FHI-aims: Speedup on IBM s BlueGene: Ala100, -helical Speedup for one s.c.f. iteration Total time Speedup = p with R. Johanni (RZG) Number of cores 30 FHI-aims on IBM s BlueGene: Scaling to Many s Time for one s.c.f. iteration (s) Only 300 x 300 matrix / task Total time Density update Hartree potential Integration EV solution Linear line with R. Johanni (RZG) Number of cores 31

28 FHI-aims: Parallel Scalability Overall current status: 1. Optimised communication Up-to thousands of processors on BlueGene 2. Non-Optimised communication: Up-to hundreds of processors on Power6, Cray XT/5, Opteron cluster from HP (lots of parameters needed for optimal MPI performance: buffer size, message size, transfer method,...) 32 Parallel Data Storage In addition to distributed instructions also data must be distributed over MPI-tasks 1. Grid based quantities: electron density, potential: - Distribute grid batches to different tasks parallel iterator is local to each task 2. Splines describing Hartree potential vat,lm(r) - Store different splines to different tasks splines must be communicated to compute ves(r) 3. Matrices and Kohn-Sham eigenvectors - As dictated by ScaLAPACK Conversely, each MPI-task has 1. Grid batches and associated quantities 2. Splines for Hartree potential vat,lm(r) 3. Pieces of matrices and Kohn-Sham eigenvectors 33

29 Conclusions Part I: - Integration, density update and the calculation of the Hartree potential can be made scale O(N) (or, O(N log N)) - This requires localisation of the basis functions - Hard part that remains: solution of the eigenproblem Part 2: - Electronic structure theory codes need to scale to large parallel systems today - This is achieved by minimising serial part of the program 34

Parallel Eigensolver Performance on High Performance Computers

Parallel Eigensolver Performance on High Performance Computers Parallel Eigensolver Performance on High Performance Computers Andrew Sunderland Advanced Research Computing Group STFC Daresbury Laboratory CUG 2008 Helsinki 1 Summary (Briefly) Introduce parallel diagonalization

More information

ELSI: A Unified Software Interface for Kohn-Sham Electronic Structure Solvers

ELSI: A Unified Software Interface for Kohn-Sham Electronic Structure Solvers ELSI: A Unified Software Interface for Kohn-Sham Electronic Structure Solvers Victor Yu and the ELSI team Department of Mechanical Engineering & Materials Science Duke University Kohn-Sham Density-Functional

More information

Parallel Eigensolver Performance on the HPCx System

Parallel Eigensolver Performance on the HPCx System Parallel Eigensolver Performance on the HPCx System Andrew Sunderland, Elena Breitmoser Terascaling Applications Group CCLRC Daresbury Laboratory EPCC, University of Edinburgh Outline 1. Brief Introduction

More information

CRYSTAL in parallel: replicated and distributed (MPP) data. Why parallel?

CRYSTAL in parallel: replicated and distributed (MPP) data. Why parallel? CRYSTAL in parallel: replicated and distributed (MPP) data Roberto Orlando Dipartimento di Chimica Università di Torino Via Pietro Giuria 5, 10125 Torino (Italy) roberto.orlando@unito.it 1 Why parallel?

More information

Symmetric Pivoting in ScaLAPACK Craig Lucas University of Manchester Cray User Group 8 May 2006, Lugano

Symmetric Pivoting in ScaLAPACK Craig Lucas University of Manchester Cray User Group 8 May 2006, Lugano Symmetric Pivoting in ScaLAPACK Craig Lucas University of Manchester Cray User Group 8 May 2006, Lugano Introduction Introduction We wanted to parallelize a serial algorithm for the pivoted Cholesky factorization

More information

Parallel Eigensolver Performance on High Performance Computers 1

Parallel Eigensolver Performance on High Performance Computers 1 Parallel Eigensolver Performance on High Performance Computers 1 Andrew Sunderland STFC Daresbury Laboratory, Warrington, UK Abstract Eigenvalue and eigenvector computations arise in a wide range of scientific

More information

Porting a sphere optimization program from LAPACK to ScaLAPACK

Porting a sphere optimization program from LAPACK to ScaLAPACK Porting a sphere optimization program from LAPACK to ScaLAPACK Mathematical Sciences Institute, Australian National University. For presentation at Computational Techniques and Applications Conference

More information

Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA

Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA Jacobi-Based Eigenvalue Solver on GPU Lung-Sheng Chien, NVIDIA lchien@nvidia.com Outline Symmetric eigenvalue solver Experiment Applications Conclusions Symmetric eigenvalue solver The standard form is

More information

Solving the Inverse Toeplitz Eigenproblem Using ScaLAPACK and MPI *

Solving the Inverse Toeplitz Eigenproblem Using ScaLAPACK and MPI * Solving the Inverse Toeplitz Eigenproblem Using ScaLAPACK and MPI * J.M. Badía and A.M. Vidal Dpto. Informática., Univ Jaume I. 07, Castellón, Spain. badia@inf.uji.es Dpto. Sistemas Informáticos y Computación.

More information

Parallelization of the Molecular Orbital Program MOS-F

Parallelization of the Molecular Orbital Program MOS-F Parallelization of the Molecular Orbital Program MOS-F Akira Asato, Satoshi Onodera, Yoshie Inada, Elena Akhmatskaya, Ross Nobes, Azuma Matsuura, Atsuya Takahashi November 2003 Fujitsu Laboratories of

More information

Parallel Numerics. Scope: Revise standard numerical methods considering parallel computations!

Parallel Numerics. Scope: Revise standard numerical methods considering parallel computations! Parallel Numerics Scope: Revise standard numerical methods considering parallel computations! Required knowledge: Numerics Parallel Programming Graphs Literature: Dongarra, Du, Sorensen, van der Vorst:

More information

Matrix Eigensystem Tutorial For Parallel Computation

Matrix Eigensystem Tutorial For Parallel Computation Matrix Eigensystem Tutorial For Parallel Computation High Performance Computing Center (HPC) http://www.hpc.unm.edu 5/21/2003 1 Topic Outline Slide Main purpose of this tutorial 5 The assumptions made

More information

Direct Self-Consistent Field Computations on GPU Clusters

Direct Self-Consistent Field Computations on GPU Clusters Direct Self-Consistent Field Computations on GPU Clusters Guochun Shi, Volodymyr Kindratenko National Center for Supercomputing Applications University of Illinois at UrbanaChampaign Ivan Ufimtsev, Todd

More information

The ELPA library: scalable parallel eigenvalue solutions for electronic structure theory and

The ELPA library: scalable parallel eigenvalue solutions for electronic structure theory and Home Search Collections Journals About Contact us My IOPscience The ELPA library: scalable parallel eigenvalue solutions for electronic structure theory and computational science This content has been

More information

Massively parallel electronic structure calculations with Python software. Jussi Enkovaara Software Engineering CSC the finnish IT center for science

Massively parallel electronic structure calculations with Python software. Jussi Enkovaara Software Engineering CSC the finnish IT center for science Massively parallel electronic structure calculations with Python software Jussi Enkovaara Software Engineering CSC the finnish IT center for science GPAW Software package for electronic structure calculations

More information

The ELPA Library Scalable Parallel Eigenvalue Solutions for Electronic Structure Theory and Computational Science

The ELPA Library Scalable Parallel Eigenvalue Solutions for Electronic Structure Theory and Computational Science TOPICAL REVIEW The ELPA Library Scalable Parallel Eigenvalue Solutions for Electronic Structure Theory and Computational Science Andreas Marek 1, Volker Blum 2,3, Rainer Johanni 1,2 ( ), Ville Havu 4,

More information

ab initio Electronic Structure Calculations

ab initio Electronic Structure Calculations ab initio Electronic Structure Calculations New scalability frontiers using the BG/L Supercomputer C. Bekas, A. Curioni and W. Andreoni IBM, Zurich Research Laboratory Rueschlikon 8803, Switzerland ab

More information

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics SPARSE SOLVERS FOR THE POISSON EQUATION Margreet Nool CWI, Multiscale Dynamics November 9, 2015 OUTLINE OF THIS TALK 1 FISHPACK, LAPACK, PARDISO 2 SYSTEM OVERVIEW OF CARTESIUS 3 POISSON EQUATION 4 SOLVERS

More information

Susumu YAMADA 1,3 Toshiyuki IMAMURA 2,3, Masahiko MACHIDA 1,3

Susumu YAMADA 1,3 Toshiyuki IMAMURA 2,3, Masahiko MACHIDA 1,3 Dynamical Variation of Eigenvalue Problems in Density-Matrix Renormalization-Group Code PP12, Feb. 15, 2012 1 Center for Computational Science and e-systems, Japan Atomic Energy Agency 2 The University

More information

PRECONDITIONING IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM

PRECONDITIONING IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM Proceedings of ALGORITMY 25 pp. 22 211 PRECONDITIONING IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM GABRIEL OKŠA AND MARIÁN VAJTERŠIC Abstract. One way, how to speed up the computation of the singular value

More information

Massively parallel semi-lagrangian solution of the 6d Vlasov-Poisson problem

Massively parallel semi-lagrangian solution of the 6d Vlasov-Poisson problem Massively parallel semi-lagrangian solution of the 6d Vlasov-Poisson problem Katharina Kormann 1 Klaus Reuter 2 Markus Rampp 2 Eric Sonnendrücker 1 1 Max Planck Institut für Plasmaphysik 2 Max Planck Computing

More information

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization) A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization) Schodinger equation: Hψ = Eψ Choose a basis set of wave functions Two cases: Orthonormal

More information

High-Performance Scientific Computing

High-Performance Scientific Computing High-Performance Scientific Computing Instructor: Randy LeVeque TA: Grady Lemoine Applied Mathematics 483/583, Spring 2011 http://www.amath.washington.edu/~rjl/am583 World s fastest computers http://top500.org

More information

Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem

Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem Mark Gates 1, Azzam Haidar 1, and Jack Dongarra 1,2,3 1 University of Tennessee, Knoxville, TN, USA 2 Oak Ridge National

More information

APPLIED NUMERICAL LINEAR ALGEBRA

APPLIED NUMERICAL LINEAR ALGEBRA APPLIED NUMERICAL LINEAR ALGEBRA James W. Demmel University of California Berkeley, California Society for Industrial and Applied Mathematics Philadelphia Contents Preface 1 Introduction 1 1.1 Basic Notation

More information

Welcome to MCS 572. content and organization expectations of the course. definition and classification

Welcome to MCS 572. content and organization expectations of the course. definition and classification Welcome to MCS 572 1 About the Course content and organization expectations of the course 2 Supercomputing definition and classification 3 Measuring Performance speedup and efficiency Amdahl s Law Gustafson

More information

Opportunities for ELPA to Accelerate the Solution of the Bethe-Salpeter Eigenvalue Problem

Opportunities for ELPA to Accelerate the Solution of the Bethe-Salpeter Eigenvalue Problem Opportunities for ELPA to Accelerate the Solution of the Bethe-Salpeter Eigenvalue Problem Peter Benner, Andreas Marek, Carolin Penke August 16, 2018 ELSI Workshop 2018 Partners: The Problem The Bethe-Salpeter

More information

Intel Math Kernel Library (Intel MKL) LAPACK

Intel Math Kernel Library (Intel MKL) LAPACK Intel Math Kernel Library (Intel MKL) LAPACK Linear equations Victor Kostin Intel MKL Dense Solvers team manager LAPACK http://www.netlib.org/lapack Systems of Linear Equations Linear Least Squares Eigenvalue

More information

Parallelization of Molecular Dynamics (with focus on Gromacs) SeSE 2014 p.1/29

Parallelization of Molecular Dynamics (with focus on Gromacs) SeSE 2014 p.1/29 Parallelization of Molecular Dynamics (with focus on Gromacs) SeSE 2014 p.1/29 Outline A few words on MD applications and the GROMACS package The main work in an MD simulation Parallelization Stream computing

More information

Efficient implementation of the overlap operator on multi-gpus

Efficient implementation of the overlap operator on multi-gpus Efficient implementation of the overlap operator on multi-gpus Andrei Alexandru Mike Lujan, Craig Pelissier, Ben Gamari, Frank Lee SAAHPC 2011 - University of Tennessee Outline Motivation Overlap operator

More information

Molecular Science Modelling

Molecular Science Modelling Molecular Science Modelling Lorna Smith Edinburgh Parallel Computing Centre The University of Edinburgh Version 1.0 Available from: http://www.epcc.ed.ac.uk/epcc-tec/documents/ Table of Contents 1 Introduction.....................................

More information

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC Hybrid static/dynamic scheduling for already optimized dense matrix factorization Simplice Donfack, Laura Grigori, INRIA, France Bill Gropp, Vivek Kale UIUC, USA Joint Laboratory for Petascale Computing,

More information

Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem

Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem Mark Gates 1, Azzam Haidar 1, and Jack Dongarra 1,2,3 1 University of Tennessee, Knoxville, TN, USA 2 Oak Ridge National

More information

Domain Decomposition-based contour integration eigenvalue solvers

Domain Decomposition-based contour integration eigenvalue solvers Domain Decomposition-based contour integration eigenvalue solvers Vassilis Kalantzis joint work with Yousef Saad Computer Science and Engineering Department University of Minnesota - Twin Cities, USA SIAM

More information

Performance Analysis of Lattice QCD Application with APGAS Programming Model

Performance Analysis of Lattice QCD Application with APGAS Programming Model Performance Analysis of Lattice QCD Application with APGAS Programming Model Koichi Shirahata 1, Jun Doi 2, Mikio Takeuchi 2 1: Tokyo Institute of Technology 2: IBM Research - Tokyo Programming Models

More information

CP2K: the gaussian plane wave (GPW) method

CP2K: the gaussian plane wave (GPW) method CP2K: the gaussian plane wave (GPW) method Basis sets and Kohn-Sham energy calculation R. Vuilleumier Département de chimie Ecole normale supérieure Paris Tutorial CPMD-CP2K CPMD and CP2K CPMD CP2K http://www.cpmd.org

More information

Lecture 19. Architectural Directions

Lecture 19. Architectural Directions Lecture 19 Architectural Directions Today s lecture Advanced Architectures NUMA Blue Gene 2010 Scott B. Baden / CSE 160 / Winter 2010 2 Final examination Announcements Thursday, March 17, in this room:

More information

Conquest order N ab initio Electronic Structure simulation code for quantum mechanical modelling in large scale

Conquest order N ab initio Electronic Structure simulation code for quantum mechanical modelling in large scale Fortran Expo: 15 Jun 2012 Conquest order N ab initio Electronic Structure simulation code for quantum mechanical modelling in large scale Lianheng Tong Overview Overview of Conquest project Brief Introduction

More information

Nuclear Physics and Computing: Exascale Partnerships. Juan Meza Senior Scientist Lawrence Berkeley National Laboratory

Nuclear Physics and Computing: Exascale Partnerships. Juan Meza Senior Scientist Lawrence Berkeley National Laboratory Nuclear Physics and Computing: Exascale Partnerships Juan Meza Senior Scientist Lawrence Berkeley National Laboratory Nuclear Science and Exascale i Workshop held in DC to identify scientific challenges

More information

Lightweight Superscalar Task Execution in Distributed Memory

Lightweight Superscalar Task Execution in Distributed Memory Lightweight Superscalar Task Execution in Distributed Memory Asim YarKhan 1 and Jack Dongarra 1,2,3 1 Innovative Computing Lab, University of Tennessee, Knoxville, TN 2 Oak Ridge National Lab, Oak Ridge,

More information

Lecture 8: Fast Linear Solvers (Part 7)

Lecture 8: Fast Linear Solvers (Part 7) Lecture 8: Fast Linear Solvers (Part 7) 1 Modified Gram-Schmidt Process with Reorthogonalization Test Reorthogonalization If Av k 2 + δ v k+1 2 = Av k 2 to working precision. δ = 10 3 2 Householder Arnoldi

More information

BENCHMARK STUDY OF A 3D PARALLEL CODE FOR THE PROPAGATION OF LARGE SUBDUCTION EARTHQUAKES

BENCHMARK STUDY OF A 3D PARALLEL CODE FOR THE PROPAGATION OF LARGE SUBDUCTION EARTHQUAKES BENCHMARK STUDY OF A D PARALLEL CODE FOR THE PROPAGATION OF LARGE SUBDUCTION EARTHQUAKES Mario Chavez,2, Eduardo Cabrera, Raúl Madariaga 2, Narciso Perea, Charles Moulinec 4, David Emerson 4, Mike Ashworth

More information

GloMAP Mode on HECToR Phase2b (Cray XT6) Mark Richardson Numerical Algorithms Group

GloMAP Mode on HECToR Phase2b (Cray XT6) Mark Richardson Numerical Algorithms Group GloMAP Mode on HECToR Phase2b (Cray XT6) Mark Richardson Numerical Algorithms Group 1 Acknowledgements NERC, NCAS Research Councils UK, HECToR Resource University of Leeds School of Earth and Environment

More information

Quantum Chemical Calculations by Parallel Computer from Commodity PC Components

Quantum Chemical Calculations by Parallel Computer from Commodity PC Components Nonlinear Analysis: Modelling and Control, 2007, Vol. 12, No. 4, 461 468 Quantum Chemical Calculations by Parallel Computer from Commodity PC Components S. Bekešienė 1, S. Sėrikovienė 2 1 Institute of

More information

Julian Merten. GPU Computing and Alternative Architecture

Julian Merten. GPU Computing and Alternative Architecture Future Directions of Cosmological Simulations / Edinburgh 1 / 16 Julian Merten GPU Computing and Alternative Architecture Institut für Theoretische Astrophysik Zentrum für Astronomie Universität Heidelberg

More information

Dynamic Scheduling within MAGMA

Dynamic Scheduling within MAGMA Dynamic Scheduling within MAGMA Emmanuel Agullo, Cedric Augonnet, Jack Dongarra, Mathieu Faverge, Julien Langou, Hatem Ltaief, Samuel Thibault and Stanimire Tomov April 5, 2012 Innovative and Computing

More information

A knowledge-based approach to high-performance computing in ab initio simulations.

A knowledge-based approach to high-performance computing in ab initio simulations. Mitglied der Helmholtz-Gemeinschaft A knowledge-based approach to high-performance computing in ab initio simulations. AICES Advisory Board Meeting. July 14th 2014 Edoardo Di Napoli Academic background

More information

DGDFT: A Massively Parallel Method for Large Scale Density Functional Theory Calculations

DGDFT: A Massively Parallel Method for Large Scale Density Functional Theory Calculations DGDFT: A Massively Parallel Method for Large Scale Density Functional Theory Calculations The recently developed discontinuous Galerkin density functional theory (DGDFT)[21] aims at reducing the number

More information

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1 NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1 Multi-processor vs. Multi-computer architecture µp vs. DSP RISC vs. DSP RISC Reduced-instruction-set Register-to-register operation Higher throughput by using

More information

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic Jan Verschelde joint work with Xiangcheng Yu University of Illinois at Chicago

More information

A distributed packed storage for large dense parallel in-core calculations

A distributed packed storage for large dense parallel in-core calculations CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2007; 19:483 502 Published online 28 September 2006 in Wiley InterScience (www.interscience.wiley.com)..1119 A

More information

Binding Performance and Power of Dense Linear Algebra Operations

Binding Performance and Power of Dense Linear Algebra Operations 10th IEEE International Symposium on Parallel and Distributed Processing with Applications Binding Performance and Power of Dense Linear Algebra Operations Maria Barreda, Manuel F. Dolz, Rafael Mayo, Enrique

More information

The Performance Evolution of the Parallel Ocean Program on the Cray X1

The Performance Evolution of the Parallel Ocean Program on the Cray X1 The Performance Evolution of the Parallel Ocean Program on the Cray X1 Patrick H. Worley Oak Ridge National Laboratory John Levesque Cray Inc. 46th Cray User Group Conference May 18, 2003 Knoxville Marriott

More information

Asynchronous Parareal in time discretization for partial differential equations

Asynchronous Parareal in time discretization for partial differential equations Asynchronous Parareal in time discretization for partial differential equations Frédéric Magoulès, Guillaume Gbikpi-Benissan April 7, 2016 CentraleSupélec IRT SystemX Outline of the presentation 01 Introduction

More information

Applications of Mathematical Economics

Applications of Mathematical Economics Applications of Mathematical Economics Michael Curran Trinity College Dublin Overview Introduction. Data Preparation Filters. Dynamic Stochastic General Equilibrium Models: Sunspots and Blanchard-Kahn

More information

Porting a Sphere Optimization Program from lapack to scalapack

Porting a Sphere Optimization Program from lapack to scalapack Porting a Sphere Optimization Program from lapack to scalapack Paul C. Leopardi Robert S. Womersley 12 October 2008 Abstract The sphere optimization program sphopt was originally written as a sequential

More information

Metal Conquest. Lianheng Tong. 2 nd March 2011

Metal Conquest. Lianheng Tong. 2 nd March 2011 Metal Conquest Lianheng Tong nd March 011 Abstract This report describes the work done in the one year Distributed Computational Science and Engineering (dcse) project aimed to develop an ab initio Density

More information

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Accelerating linear algebra computations with hybrid GPU-multicore systems. Accelerating linear algebra computations with hybrid GPU-multicore systems. Marc Baboulin INRIA/Université Paris-Sud joint work with Jack Dongarra (University of Tennessee and Oak Ridge National Laboratory)

More information

Communication-avoiding parallel and sequential QR factorizations

Communication-avoiding parallel and sequential QR factorizations Communication-avoiding parallel and sequential QR factorizations James Demmel, Laura Grigori, Mark Hoemmen, and Julien Langou May 30, 2008 Abstract We present parallel and sequential dense QR factorization

More information

A Parallel Divide and Conquer Algorithm for the Symmetric Eigenvalue Problem on Distributed Memory Architectures. F Tisseur and J Dongarra

A Parallel Divide and Conquer Algorithm for the Symmetric Eigenvalue Problem on Distributed Memory Architectures. F Tisseur and J Dongarra A Parallel Divide and Conquer Algorithm for the Symmetric Eigenvalue Problem on Distributed Memory Architectures F Tisseur and J Dongarra 999 MIMS EPrint: 2007.225 Manchester Institute for Mathematical

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms Chapter 13 Prof. Michael T. Heath Department of Computer Science University of Illinois at Urbana-Champaign CS 554 / CSE 512 Michael T. Heath Parallel Numerical Algorithms

More information

An Efficient FETI Implementation on Distributed Shared Memory Machines with Independent Numbers of Subdomains and Processors

An Efficient FETI Implementation on Distributed Shared Memory Machines with Independent Numbers of Subdomains and Processors Contemporary Mathematics Volume 218, 1998 B 0-8218-0988-1-03024-7 An Efficient FETI Implementation on Distributed Shared Memory Machines with Independent Numbers of Subdomains and Processors Michel Lesoinne

More information

MPI at MPI. Jens Saak. Max Planck Institute for Dynamics of Complex Technical Systems Computational Methods in Systems and Control Theory

MPI at MPI. Jens Saak. Max Planck Institute for Dynamics of Complex Technical Systems Computational Methods in Systems and Control Theory MAX PLANCK INSTITUTE November 5, 2010 MPI at MPI Jens Saak Max Planck Institute for Dynamics of Complex Technical Systems Computational Methods in Systems and Control Theory FOR DYNAMICS OF COMPLEX TECHNICAL

More information

MAGMA. Matrix Algebra on GPU and Multicore Architectures. Mark Gates. February 2012

MAGMA. Matrix Algebra on GPU and Multicore Architectures. Mark Gates. February 2012 MAGMA Matrix Algebra on GPU and Multicore Architectures Mark Gates February 2012 1 Hardware trends Scale # cores instead of clock speed Hardware issue became software issue Multicore Hybrid 1.E+07 1e7

More information

A Parallel Algorithm for Computing the Extremal Eigenvalues of Very Large Sparse Matrices*

A Parallel Algorithm for Computing the Extremal Eigenvalues of Very Large Sparse Matrices* A Parallel Algorithm for Computing the Extremal Eigenvalues of Very Large Sparse Matrices* Fredrik Manne Department of Informatics, University of Bergen, N-5020 Bergen, Norway Fredrik. Manne@ii. uib. no

More information

Performance Evaluation of Some Inverse Iteration Algorithms on PowerXCell T M 8i Processor

Performance Evaluation of Some Inverse Iteration Algorithms on PowerXCell T M 8i Processor Performance Evaluation of Some Inverse Iteration Algorithms on PowerXCell T M 8i Processor Masami Takata 1, Hiroyuki Ishigami 2, Kini Kimura 2, and Yoshimasa Nakamura 2 1 Academic Group of Information

More information

Sparse BLAS-3 Reduction

Sparse BLAS-3 Reduction Sparse BLAS-3 Reduction to Banded Upper Triangular (Spar3Bnd) Gary Howell, HPC/OIT NC State University gary howell@ncsu.edu Sparse BLAS-3 Reduction p.1/27 Acknowledgements James Demmel, Gene Golub, Franc

More information

Enhancing Scalability of Sparse Direct Methods

Enhancing Scalability of Sparse Direct Methods Journal of Physics: Conference Series 78 (007) 0 doi:0.088/7-6596/78//0 Enhancing Scalability of Sparse Direct Methods X.S. Li, J. Demmel, L. Grigori, M. Gu, J. Xia 5, S. Jardin 6, C. Sovinec 7, L.-Q.

More information

An Integrative Model for Parallelism

An Integrative Model for Parallelism An Integrative Model for Parallelism Victor Eijkhout ICERM workshop 2012/01/09 Introduction Formal part Examples Extension to other memory models Conclusion tw-12-exascale 2012/01/09 2 Introduction tw-12-exascale

More information

INITIAL INTEGRATION AND EVALUATION

INITIAL INTEGRATION AND EVALUATION INITIAL INTEGRATION AND EVALUATION OF SLATE PARALLEL BLAS IN LATTE Marc Cawkwell, Danny Perez, Arthur Voter Asim YarKhan, Gerald Ragghianti, Jack Dongarra, Introduction The aim of the joint milestone STMS10-52

More information

Sakurai-Sugiura algorithm based eigenvalue solver for Siesta. Georg Huhs

Sakurai-Sugiura algorithm based eigenvalue solver for Siesta. Georg Huhs Sakurai-Sugiura algorithm based eigenvalue solver for Siesta Georg Huhs Motivation Timing analysis for one SCF-loop iteration: left: CNT/Graphene, right: DNA Siesta Specifics High fraction of EVs needed

More information

QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment

QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment Emmanuel AGULLO (INRIA / LaBRI) Camille COTI (Iowa State University) Jack DONGARRA (University of Tennessee) Thomas HÉRAULT

More information

Verbundprojekt ELPA-AEO. Eigenwert-Löser für Petaflop-Anwendungen Algorithmische Erweiterungen und Optimierungen

Verbundprojekt ELPA-AEO. Eigenwert-Löser für Petaflop-Anwendungen Algorithmische Erweiterungen und Optimierungen Verbundprojekt ELPA-AEO http://elpa-aeo.mpcdf.mpg.de Eigenwert-Löser für Petaflop-Anwendungen Algorithmische Erweiterungen und Optimierungen BMBF Projekt 01IH15001 Feb 2016 - Jan 2019 7. HPC-Statustagung,

More information

Parallelization of the Dirac operator. Pushan Majumdar. Indian Association for the Cultivation of Sciences, Jadavpur, Kolkata

Parallelization of the Dirac operator. Pushan Majumdar. Indian Association for the Cultivation of Sciences, Jadavpur, Kolkata Parallelization of the Dirac operator Pushan Majumdar Indian Association for the Cultivation of Sciences, Jadavpur, Kolkata Outline Introduction Algorithms Parallelization Comparison of performances Conclusions

More information

The Design and Implementation of the Parallel. Out-of-core ScaLAPACK LU, QR and Cholesky. Factorization Routines

The Design and Implementation of the Parallel. Out-of-core ScaLAPACK LU, QR and Cholesky. Factorization Routines The Design and Implementation of the Parallel Out-of-core ScaLAPACK LU, QR and Cholesky Factorization Routines Eduardo D Azevedo Jack Dongarra Abstract This paper describes the design and implementation

More information

TIME DEPENDENCE OF SHELL MODEL CALCULATIONS 1. INTRODUCTION

TIME DEPENDENCE OF SHELL MODEL CALCULATIONS 1. INTRODUCTION Mathematical and Computational Applications, Vol. 11, No. 1, pp. 41-49, 2006. Association for Scientific Research TIME DEPENDENCE OF SHELL MODEL CALCULATIONS Süleyman Demirel University, Isparta, Turkey,

More information

FEAST eigenvalue algorithm and solver: review and perspectives

FEAST eigenvalue algorithm and solver: review and perspectives FEAST eigenvalue algorithm and solver: review and perspectives Eric Polizzi Department of Electrical and Computer Engineering University of Masachusetts, Amherst, USA Sparse Days, CERFACS, June 25, 2012

More information

3D Parallel Elastodynamic Modeling of Large Subduction Earthquakes

3D Parallel Elastodynamic Modeling of Large Subduction Earthquakes D Parallel Elastodynamic Modeling of Large Subduction Earthquakes Eduardo Cabrera 1 Mario Chavez 2 Raúl Madariaga Narciso Perea 2 and Marco Frisenda 1 Supercomputing Dept. DGSCA UNAM C.U. 04510 Mexico

More information

Antonio Falabella. 3 rd nternational Summer School on INtelligent Signal Processing for FrontIEr Research and Industry, September 2015, Hamburg

Antonio Falabella. 3 rd nternational Summer School on INtelligent Signal Processing for FrontIEr Research and Industry, September 2015, Hamburg INFN - CNAF (Bologna) 3 rd nternational Summer School on INtelligent Signal Processing for FrontIEr Research and Industry, 14-25 September 2015, Hamburg 1 / 44 Overview 1 2 3 4 5 2 / 44 to Computing The

More information

1 Overview. 2 Adapting to computing system evolution. 11 th European LS-DYNA Conference 2017, Salzburg, Austria

1 Overview. 2 Adapting to computing system evolution. 11 th European LS-DYNA Conference 2017, Salzburg, Austria 1 Overview Improving LSTC s Multifrontal Linear Solver Roger Grimes 3, Robert Lucas 3, Nick Meng 2, Francois-Henry Rouet 3, Clement Weisbecker 3, and Ting-Ting Zhu 1 1 Cray Incorporated 2 Intel Corporation

More information

Model Order Reduction via Matlab Parallel Computing Toolbox. Istanbul Technical University

Model Order Reduction via Matlab Parallel Computing Toolbox. Istanbul Technical University Model Order Reduction via Matlab Parallel Computing Toolbox E. Fatih Yetkin & Hasan Dağ Istanbul Technical University Computational Science & Engineering Department September 21, 2009 E. Fatih Yetkin (Istanbul

More information

Graphics Card Computing for Materials Modelling

Graphics Card Computing for Materials Modelling Graphics Card Computing for Materials Modelling Case study: Analytic Bond Order Potentials B. Seiser, T. Hammerschmidt, R. Drautz, D. Pettifor Funded by EPSRC within the collaborative multi-scale project

More information

A PARALLEL EIGENSOLVER FOR DENSE SYMMETRIC MATRICES BASED ON MULTIPLE RELATIVELY ROBUST REPRESENTATIONS

A PARALLEL EIGENSOLVER FOR DENSE SYMMETRIC MATRICES BASED ON MULTIPLE RELATIVELY ROBUST REPRESENTATIONS SIAM J. SCI. COMPUT. Vol. 27, No. 1, pp. 43 66 c 2005 Society for Industrial and Applied Mathematics A PARALLEL EIGENSOLVER FOR DENSE SYMMETRIC MATRICES BASED ON MULTIPLE RELATIVELY ROBUST REPRESENTATIONS

More information

Electronic energy optimisation in ONETEP

Electronic energy optimisation in ONETEP Electronic energy optimisation in ONETEP Chris-Kriton Skylaris cks@soton.ac.uk 1 Outline 1. Kohn-Sham calculations Direct energy minimisation versus density mixing 2. ONETEP scheme: optimise both the density

More information

Introduction The Nature of High-Performance Computation

Introduction The Nature of High-Performance Computation 1 Introduction The Nature of High-Performance Computation The need for speed. Since the beginning of the era of the modern digital computer in the early 1940s, computing power has increased at an exponential

More information

Introduction to numerical computations on the GPU

Introduction to numerical computations on the GPU Introduction to numerical computations on the GPU Lucian Covaci http://lucian.covaci.org/cuda.pdf Tuesday 1 November 11 1 2 Outline: NVIDIA Tesla and Geforce video cards: architecture CUDA - C: programming

More information

Efficient Longest Common Subsequence Computation using Bulk-Synchronous Parallelism

Efficient Longest Common Subsequence Computation using Bulk-Synchronous Parallelism Efficient Longest Common Subsequence Computation using Bulk-Synchronous Parallelism Peter Krusche Department of Computer Science University of Warwick June 2006 Outline 1 Introduction Motivation The BSP

More information

Parallel programming using MPI. Analysis and optimization. Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco

Parallel programming using MPI. Analysis and optimization. Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco Parallel programming using MPI Analysis and optimization Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco Outline l Parallel programming: Basic definitions l Choosing right algorithms: Optimal serial and

More information

Concurrent Divide-and-Conquer Library

Concurrent Divide-and-Conquer Library with Petascale Electromagnetics Applications, Tech-X Corporation CScADS Workshop on Libraries and Algorithms for Petascale Applications, 07/30/2007, Snowbird, Utah Background Particle In Cell (PIC) in

More information

VASP: running on HPC resources. University of Vienna, Faculty of Physics and Center for Computational Materials Science, Vienna, Austria

VASP: running on HPC resources. University of Vienna, Faculty of Physics and Center for Computational Materials Science, Vienna, Austria VASP: running on HPC resources University of Vienna, Faculty of Physics and Center for Computational Materials Science, Vienna, Austria The Many-Body Schrödinger equation 0 @ 1 2 X i i + X i Ĥ (r 1,...,r

More information

A hybrid Hermitian general eigenvalue solver

A hybrid Hermitian general eigenvalue solver Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe A hybrid Hermitian general eigenvalue solver Raffaele Solcà *, Thomas C. Schulthess Institute fortheoretical Physics ETHZ,

More information

Lazy matrices for contraction-based algorithms

Lazy matrices for contraction-based algorithms Lazy matrices for contraction-based algorithms Michael F. Herbst michael.herbst@iwr.uni-heidelberg.de https://michael-herbst.com Interdisziplinäres Zentrum für wissenschaftliches Rechnen Ruprecht-Karls-Universität

More information

The Algorithm of Multiple Relatively Robust Representations for Multi-Core Processors

The Algorithm of Multiple Relatively Robust Representations for Multi-Core Processors Aachen Institute for Advanced Study in Computational Engineering Science Preprint: AICES-2010/09-4 23/September/2010 The Algorithm of Multiple Relatively Robust Representations for Multi-Core Processors

More information

Communication-avoiding parallel and sequential QR factorizations

Communication-avoiding parallel and sequential QR factorizations Communication-avoiding parallel and sequential QR factorizations James Demmel Laura Grigori Mark Frederick Hoemmen Julien Langou Electrical Engineering and Computer Sciences University of California at

More information

Poisson Solver, Pseudopotentials, Atomic Forces in the BigDFT code

Poisson Solver, Pseudopotentials, Atomic Forces in the BigDFT code CECAM Tutorial on Wavelets in DFT, CECAM - LYON,, in the BigDFT code Kernel Luigi Genovese L_Sim - CEA Grenoble 28 November 2007 Outline, Kernel 1 The with Interpolating Scaling Functions in DFT for Interpolating

More information

Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems

Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems Ichitaro Yamazaki University of Tennessee, Knoxville Xiaoye Sherry Li Lawrence Berkeley National Laboratory MS49: Sparse

More information

The Design, Implementation, and Evaluation of a Symmetric Banded Linear Solver for Distributed-Memory Parallel Computers

The Design, Implementation, and Evaluation of a Symmetric Banded Linear Solver for Distributed-Memory Parallel Computers The Design, Implementation, and Evaluation of a Symmetric Banded Linear Solver for Distributed-Memory Parallel Computers ANSHUL GUPTA and FRED G. GUSTAVSON IBM T. J. Watson Research Center MAHESH JOSHI

More information

Solving PDEs with CUDA Jonathan Cohen

Solving PDEs with CUDA Jonathan Cohen Solving PDEs with CUDA Jonathan Cohen jocohen@nvidia.com NVIDIA Research PDEs (Partial Differential Equations) Big topic Some common strategies Focus on one type of PDE in this talk Poisson Equation Linear

More information

Re-design of Higher level Matrix Algorithms for Multicore and Heterogeneous Architectures. Based on the presentation at UC Berkeley, October 7, 2009

Re-design of Higher level Matrix Algorithms for Multicore and Heterogeneous Architectures. Based on the presentation at UC Berkeley, October 7, 2009 III.1 Re-design of Higher level Matrix Algorithms for Multicore and Heterogeneous Architectures Based on the presentation at UC Berkeley, October 7, 2009 Background and motivation Running time of an algorithm

More information

SP-CNN: A Scalable and Programmable CNN-based Accelerator. Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay

SP-CNN: A Scalable and Programmable CNN-based Accelerator. Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay SP-CNN: A Scalable and Programmable CNN-based Accelerator Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay Motivation Power is a first-order design constraint, especially for embedded devices. Certain

More information