Making electronic structure methods scale: Large systems and (massively) parallel computing
|
|
- Ophelia Robertson
- 6 years ago
- Views:
Transcription
1 AB Making electronic structure methods scale: Large systems and (massively) parallel computing Ville Havu Department of Applied Physics Helsinki University of Technology - TKK Ville.Havu@tkk.fi 1 Outline Part I: Scaling to large systems of many atoms - Localisation as a key to linear scaling - Implications to different tasks in FHI-aims Part II: Scaling to large computers of many cores - Basics in parallel computing - Parallel solutions to tasks in FHI-aims 2
2 Different Scalings: From Small to Large Systems 1250 T = N Crossover points T = N T 500 T = N N 3 Different Scalings: Logarithmic Scale T = N 3 T 1000 Crossover points T = N 100 T = N N 4
3 What is Linear Scaling? Natoms Nelectrons Npoints Assume: 1. There are O(N) items of input 2. We want less than O(N) items of output If the complexity (time) to get the output from the input is O(N) the method is linear scaling Method can be linear scaling only if it uses O(1) operations per item of input Example: Cubature in R 3 Input: N points ri, N weights wi N functions values fi = f(ri) Output: i wi fi f(r) dr Trivially O(N) 5 Approaches Towards O(N) in Electronic Structure Theory Key requirements: localisation, localisation, and localisation (in R 3 or in Fourier space) Popular approaches: 1. Minimise the total energy directly using the density matrix (and Wannier functions) skip / localise the Kohn-Sham orbitals (SIESTA, CONQUEST, OpenMX, ONETEP) 2. Accelerate the calculation / use of the entries of hij and sij (Gaussian basis functions (GAUSSIAN, Q-Chem, TURBOMOLE), regular cartesian grids, FFT methods (Quickstep), wavelets (BigDFT)) 3. Use fast solvers for the Hartree potential (multigrid, fast multipole, wavelets) 4. Employ divide & conquer framework (LS3DF) S. Goedecker, Rev. Mod. Phys , (1999) 6
4 Main Tasks for Scalability in FHI-aims Key enabling technology: localisation from basis functions 1. Integration of hij and sij on partitioned atom-centred grids 2. Update of the electron density n(r) on the same grid 3. Solution of the Hartree potential using atom-wise multipole decomposition 4. Solution of the eigenproblem h ck = k s ck with small prefactor O(N 3 ) method (recall the crossover!) 7 Main Tasks for Scalability in FHI-aims Key enabling technology: localisation from basis functions 1. Integration of hij and sij on partitioned atom-centred grids 2. Update of the electron density n(r) on the same grid 3. Solution of the Hartree potential using atom-wise multipole decomposition 4. Solution of the eigenproblem h ck = k s ck with small prefactor O(N 3 ) method (recall the crossover!) Warning: naïve approaches lead to O(N 3 ) - O(N 4 ) in
5 First Thing: Integration of hij and sij 1. Partition grid into spatially localised batches: 2. Due to localisation at each grid point eventually #{basis_fn 0} O(1) 3. For each batch, consider only the non-zero basis functions Constant work per integration point O(N) algorithm 8 First Thing: Integration of hij and sij 1. Partition grid into spatially localised batches: 2. Due to localisation at each grid point eventually #{basis_fn 0} O(1) 3. For each batch, consider only the non-zero basis functions Constant work per integration point O(N) algorithm 8
6 First Thing: Integration of hij and sij 1. Partition grid into spatially localised batches: 2. Due to localisation at each grid point eventually #{basis_fn 0} O(1) 3. For each batch, consider only the non-zero basis functions Constant work per integration point O(N) algorithm 8 Second Thing: Update of the Electron Density 1. Stick with the same grid 2. Construct the density matrix Dlk = j fj cljckj 3. Compute the new electron density a) n(ri) = lk Dlk l(ri) k(ri) or b) n(ri) = j fj j(ri) 2 Again, there are only O(1) non-zero basis functions per point ri a) is O(N) algorithm NB! Construction of Dlk in 2. is formally O(N 2 ) operation but with a small prefactor 9
7 More Use for the Density Matrix Instead of solving the eigenproblem it is possible to minimise the total energy E = 2 Tr [Dh] subject to the idempotency condition D 2 = D This leads to linear scaling if 1. Minimisation of E converges in O(1) steps 2. D and h are sparse enough so that their product is O(N) D.R.Bowler, T.Miyazaki and M.J.Gillan, Journal of Physics:Condensed Matter 14 (11), 2781 (2002) 10 Third Thing: Calculate the Hartree Potential Run the whole show on differences: n(r) = n(r) at nfree,at ( r Rat ) 1. Compute the atom-wise multipole density components: nat,lm(r) = r= r - Rat pat(r) n(r) Ylm( at) d 2. Solve for the corresponding potential components: 3. Build the potential: vat,lm(r) Natoms ves(ri) = vat,lm( ri Rat )Ylm( at) Npoints 11
8 Third Thing: Calculate the Hartree Potential Run the whole show on differences: n(r) = n(r) at nfree,at ( r Rat ) 1. Compute the atom-wise multipole density components: nat,lm(r) = r= r - Rat pat(r) n(r) Ylm( at) d 2. Solve for the corresponding potential components: 3. Build the potential: vat,lm(r) Natoms ves(ri) = vat,lm( ri Rat )Ylm( at) Npoints potentially resource consuming: higher multipoles must be cut to shorter distances 11 Fourth Thing: Solve the Eigenproblem The problem: There are Nbasis Nstates coefficients cli to solve for at least O(N 2 ) algorithm Conventional direct solvers lead to O(N 3 ) method Iterative solver could be O(N 2 ) but full matrix matrix vector is O(N 2 ) O(N 3 ) again Left: sparse matrix & iterative algorithm that converges with constant number of steps for all eigenpairs severe problem with initialisation FHI-aims: matrices are not very sparse LAPACK / modified ScaLAPACK, iterative LOPBCG under investigation 12
9 Conventional Direct Solvers for the Eigenproblem Implementation in LAPACK / ScaLAPACK: 1. Factorise s = LL T and let A = L -1 hl -T to get the problem Axk = kxk 2. Use Householder transformations to reduce A to tridiagonal T = H1H2H3 Hn A Hn H3H2H1 3. Solve the eigenproblem for T with some method a. Bisection & inverse iteration b. QR-iteration c. Divide & Conquer d. MRRR-method (Multiple Relatively Robust Representations) 4. Substitute back (twice) to get the eigenvectors ck 13 Conventional Direct Solvers for the Eigenproblem O(N 3 ) Implementation in LAPACK / ScaLAPACK: 1. Factorise s = LL T and let A = L -1 hl -T to get the problem Axk = kxk 2. Use Householder transformations to reduce A to tridiagonal T = H1H2H3 Hn A Hn H3H2H1 3. Solve the eigenproblem for T with some method a. Bisection & inverse iteration b. QR-iteration c. Divide & Conquer d. MRRR-method (Multiple Relatively Robust Representations) 4. Substitute back (twice) to get the eigenvectors ck 13
10 Conventional Direct Solvers for the Eigenproblem O(N 3 ) Implementation in LAPACK / ScaLAPACK: 1. Factorise s = LL T and let A = L -1 hl -T to get the problem Axk = kxk 2. Use Householder transformations to reduce A to tridiagonal T = H1H2H3 Hn A Hn H3H2H1 3. Solve the eigenproblem for T with some method a. Bisection & inverse iteration b. QR-iteration c. Divide & Conquer d. MRRR-method (Multiple Relatively Robust Representations) 4. Substitute back (twice) to get the eigenvectors ck Method of choice in FHI-aims due to parallel scalability 13 The Scalability Test Setting 1. Physical system: Fully extended polyalanines, atoms 2. Hardware: IBM Power6 575 at RZG 205 nodes, 32 cores / node, 6560 cores total 18 TB main : 64 / 128 / 256 GB in a node Infiniband interconnect 14
11 Time for s.c.f. iteration (s) Current FHI-aims Scaling: Light Settings, 32 cores ,1 Integration O(N 1.1 ) 0, Number of atoms in the polyalanine chain 15 Time for s.c.f. iteration (s) Current FHI-aims Scaling: Light Settings, 32 cores ,1 Density Integration update (dm) Integration Density update (orb) Crossover point O(N 1.7 ) O(N 1.1 ) O(N 1.1 ) 0, Number of atoms in the polyalanine chain 15
12 Time for s.c.f. iteration (s) Current FHI-aims Scaling: Light Settings, 32 cores ,1 Density Integration update (dm) Hartree Integration potential Integration Density update (orb) Density update (orb) Crossover point O(N 1.7 ) O(N 1.7 ) O(N 1.5 ) O(N )) O(N 1.1 ) 0, Number of atoms in the polyalanine chain 15 Time for Time s.c.f. for s.c.f. iteration (s) Current FHI-aims Scaling: Light Settings, 32 cores ,1 Total Density Integration time update (dm) Density Hartree Integration update potential (dm) Hartree Integration Density potential update (orb) Integration Density update (orb) EV solution Density update (orb) O(N Crossover O(N 1.7 ) 1.7 ) O(N point 1.7 ) O(N 1.5 ) O(N 1.8 ) O(N 1.5 ) O(N 1.5 ) O(N ) ) O(N ) In this region: EV solution: O(N 2.7 ) Total: O(N 1.9 ) 0, Number of of atoms in the polyalanine chain 15
13 Time for s.c.f. iteration (s) Current FHI-aims Scaling: Tight Settings, 32 cores ,1 Total time Density update (dm) Hartree potential Integration EV solution Density update (orb) O(N 1.7 ) O(N 1.6 ) O(N 2.3 ) O(N 1.6 ) O(N 1.2 ) In this region: EV solution: O(N 2.9 ) Total: O(N 2.0 ) 0, Number of atoms in the polyalanine chain O(N 1.2 ) 16 Current FHI-aims Scaling: GFlops, Light Settings GFlops for s.c.f. iteration Using density matrix (dm) Using orbitals (orb) Crossover point O(N 1.3 ) O(N 2.0 ) O(N 2.3 ) Number of atoms in the polyalanine chain 17
14 Current FHI-aims Scaling: Light Settings, 32 cores Time for s.c.f. iteration (s) ,1 Total time Density update (dm) Hartree potential Integration EV solution Absolute minimum (total) 0, Number of atoms in the polyalanine chain 18 Part II: Scaling to large computers of many cores 19
15 Parallel Computing: Your Desktop Why you need to care: 20 Parallel Computing: Your Desktop Why you need to care: 20
16 Parallel Computing:Your Desktop Why you need to care: POWER Processor Roadmap " 2007 " Future " " /. Distributed Switch Distributed Switch * %'44+/) +453+$65'& 8+5%* *#3'& 9/#.+% 4 GHz GHz 3.5GHz 3.5GHz Advanced Multi Design AltiVec AltiVec AltiVec AltiVec Cache Cache Cache Cache Advanced Cache Advanced System Features System Features Advanced System Features "03,-0#& %%'-'3#5034 +)*-9 5*3'#&'& %03' ' Distributed Switch *#3'& *#3'& *#3'& 1.9GHz 1.9GHz 1.5+ GHz *#3'& GHz 1.5+ GHz 1.5+ GHz 1+ GHz 1+ GHz /. /. /. /. Distributed Switch /.!'39 +)* 3'26'/%+'4 : /*#/%'&!+356#-+:#5+0/ &7#/%'& '.039 6$4945'. -5+7'% +/4536%5+0/4 /4536%5+0/ '539 9/#.+% /'3)9 #/#)'.'/5.1307'& #35+5+0/ 0$+-+59 ' '%5+0/ '94 /*#/%'& %#-+/) +.6-5#/' *3'#&+/) /*#/%'& +453+$65'& 8+5%* /*#/%'& 03' #3#--' '& '3(03.#/%' /%3'#4'&.'.039 $#/&8+&5* '&6%'&.'.039 -#5'/%+'4!+356#-+:#5+0/ BINARY COMPATIBILITY 2008 IBM Corporation 20 Parallel Computing:Your Desktop Why you need to care: POWER Processor Roadmap " " Future " " ) ( % /. /. 1.9GHz 1.9GHz GHz GHz *#3'&! #. Distributed Switch *#3'& $ $) && 1.5+ GHz 1.5+ GHz 1+ GHz 1+ GHz *#3'& Distributed Switch *#3'& /. /. /. /. GHz GHz 3.5GHz 3.5GHz (# $ $) "03,-0#& %%'-'3#5034 2' 3 +)*-9 5*3'#&'& %03'4! ' $!'39 +)* 3'26'/%+'4 " # $% : &&' /*#/%'&!+356#-+:#5+0/ &7#/%'& '.039 6$4945'. -5+7'% +/4536%5+0/4 /4536%5+0/ '539 9/#.+% /'3)9 #/#)'.'/5.1307'& #35+5+0/ 0$ % 3,0 * ' '%5+0/ '94! 2& #3 /*#/%'& %#-+/) *, $ +.6-5#/' *3'#&+/) && /*#/%'& +453+$65'& 8+5%* /*#/%'& 03' #3#--'-+4. * %'44+/) +453+$65'& 8+5%*"#.1307'& '3(03.#/%'. /%3'#4'&.'.039 $#/&8+&5* *#3'& 3 '&6%'&.'.039 -#5'/%+'4 9/#.+% 4 2&!+356#-+:#5+0/ #!!+ #!!1 #!!$ / BINARY COMPATIBILITY #!!4 #!!. Advanced System Features System Advanced Features System Features Distributed Switch $+. /, Cache Cache Cache Cache Advanced Cache Distributed Switch / Advanced Multi Design AltiVec AltiVec AltiVec AltiVec #!!+ #!!1 #!!$ #!!4 #!! IBM Corporation 20
17 Parallel Computing: Supercomputers (June 2009) Image courtesy of the National Center for Computational Sciences, Oak Ridge National Laboratory With supercomputers it s getting even more complicated: 1. IBM s Roadrunner: 129,600 cores 2. Cray XT/5 Jaguar : 150,152 cores 3. & 5. IBM s Blue Gene /P & /L: 294,912 & 212,992 cores 4. SGI s Altix Pleiades : 51,200 cores 21 Parallel Computing: Supercomputers (June 2009) Image courtesy of the National Center for Computational Sciences, Oak Ridge National Laboratory With supercomputers it s getting even more complicated: 1. IBM s Roadrunner: 129,600 cores 2. Cray XT/5 Jaguar : 150,152 cores 3. & 5. IBM s Blue Gene /P & /L: 294,912 & 212,992 cores 4. SGI s Altix Pleiades : 51,200 cores 21
18 Parallel Computing: Supercomputers (June 2009) Image courtesy of the National Center for Computational Sciences, Oak Ridge National Laboratory With supercomputers it s getting even more complicated: 1. IBM s Roadrunner: 129,600 cores 2. Cray XT/5 Jaguar : 150,152 cores 3. & 5. IBM s Blue Gene /P & /L: 294,912 & 212,992 cores 4. SGI s Altix Pleiades : 51,200 cores 21 Parallel Computing: Supercomputers (June 2009) Image courtesy of the National Center for Computational Sciences, Oak Ridge National Laboratory With supercomputers it s getting even more complicated: 1. IBM s Roadrunner: 129,600 cores 2. Cray XT/5 Jaguar : 150,152 cores 3. & 5. IBM s Blue Gene /P & /L: 294,912 & 212,992 cores 4. SGI s Altix Pleiades : 51,200 cores 21
19 Parallel Computing: Supercomputers (June 2009) Image courtesy of the National Center for Computational Sciences, Oak Ridge National Laboratory With supercomputers it s getting even more complicated: 1. IBM s Roadrunner: 129,600 cores 2. Cray XT/5 Jaguar : 150,152 cores 3. & 5. IBM s Blue Gene /P & /L: 294,912 & 212,992 cores See for more info. 4. SGI s Altix Pleiades : 51,200 cores 21 Theory of Parallel Computing Recall The Computer : Four-Field of parallelism (Flynn s taxonomy): data multiple single single SISD: serial computing MISD: no such thing multiple instructions # of data streams # of instruction streams core SIMD: vector machines MIMD: cluster machines M. Flynn, IEEE Trans. Comput., C-21, 948, (1972) 22
20 Theory of Parallel Computing Recall The Computer : core instructions Four-Field of parallelism (Flynn s taxonomy): # of instruction streams single multiple data # of data streams single multiple SISD: serial computing SIMD: vector machines MISD: no such thing MIMD: cluster machines M. Flynn, IEEE Trans. Comput., C-21, 948, (1972) SPMD: Single Program Multiple Data 22 Theory of Parallel Computing core Standard programming layer: Message Passing Interface (MPI) core core communication network core core Two main modes of communication: 1. Collective communication 2. Point-to-Point communication core W. Gropp, E. Lusk, A. Skjellum: Using MPI, MIT Press (1999) 23
21 MPI Communication task #0...sends 42 to... task #1 1. Post MPI_Send 1. Post MPI_Recv Copy to buffer 2. Receive data Send data 3. Copy to Compiler sees only these task #0 MPI Communication...sends 42 to... task #1 1. Post MPI_Send 1. Post MPI_Recv Copy to buffer 2. Receive data Send data 3. Copy to 42 24
22 MPI Today Pros: + Uniform layer for programmers + Can run on a large variety of platforms + Tested and bug-free Cons: - Opaque to the compiler no optimisation possible - Fragmented global space - Implementation vendor dependent - Lot of platform-dependent parameters (e.g. buffer size) Future replacements: Co-Array Fortran / Unified Parallel C (already here) X10, Chapel, and Fortress (still in the future) All with partitioned global address space 25 Example: Parallel Integration of hij Recall the grid and the batches: 1. Each task integrates a set of batches hij id 2. Collective communication to obtain hij = id hij id Same code on different set of points: SPMD Similarly: 1. Update of the Electron Density 2. Update of the Hartree Potential N.B.: The grid batches and their distribution define a parallel iterator over the grid 26
23 Example: Parallel Integration of hij Recall the grid and the batches: Each task integrates a set of batches hij id 2. Collective communication to obtain hij = id hij id Same code on different set of points: SPMD Similarly: 1. Update of the Electron Density 2. Update of the Hartree Potential N.B.: The grid batches and their distribution define a parallel iterator over the grid 26 Parallel Dense Linear Algebra Standard solution: ScaLAPACK / PBLAS / BLACS - Provides the same functionality as LAPACK / BLAS - Uses block-cyclic distribution of matrix elements to optimise load balancing and cache utilisation Task numbers (4 tasks) h11 h12 h13 h21 h22 h31 Matrix elements
24 Parallel Dense Linear Algebra Example: PxGEMM ( A0 A 1 A 2 A 3 )( ) B0 B 1 B 2 B 3 = ( ) C0 C 1 C 2 C 3 task #0 task #1 task #2 task #3 stage 1 stage 2 C 0 = A 0 B 0 + A 1 B 2 C 1 = A 1 B 3 + A 0 B 1 C 2 = A 2 B 0 + A 3 B 2 C 3 = A 2 B 1 + A 3 B 3 28 Parallel Dense Linear Algebra Example: PxGEMM ( A0 A 1 A 2 A 3 )( ) B0 B 1 B 2 B 3 = ( ) C0 C 1 C 2 C 3 task #0 task #1 task #2 task #3 stage 1 stage 2 C 0 = A 0 B 0 + A 1 B 2 C 1 = A 1 B 3 + A 0 B 1 C 2 = A 2 B 0 + A 3 B 2 C 3 = A 2 B 1 + A 3 B 3 28
25 Parallel Dense Linear Algebra Example: PxGEMM ( A0 A 1 A 2 A 3 )( ) B0 B 1 B 2 B 3 = ( ) C0 C 1 C 2 C 3 task #0 task #1 task #2 task #3 stage 1 stage 2 C 0 = A 0 B 0 + A 1 B 2 C 1 = A 1 B 3 + A 0 B 1 C 2 = A 2 B 0 + A 3 B 2 C 3 = A 2 B 1 + A 3 B 3 28 Parallel Dense Linear Algebra Example: PxGEMM ( A0 A 1 A 2 A 3 )( ) B0 B 1 B 2 B 3 = ( ) C0 C 1 C 2 C 3 task #0 task #1 task #2 task #3 stage 1 stage 2 C 0 = A 0 B 0 + A 1 B 2 C 1 = A 1 B 3 + A 0 B 1 C 2 = A 2 B 0 + A 3 B 2 C 3 = A 2 B 1 + A 3 B 3 Each stage: - Only one send / task - Point-to-point communication 28
26 Measures of Parallel Performance 1. Speedup: Sp = T1 / Tp (optimally Sp = p) 2. Efficiency: Ep = Sp / p (optimally Ep = 1) Amdahl s law, i.e., The Unfortunate Law of Diminishing Returns: Assume that proportion of the program can be run in parallel. Then (1- ) is the serial part of the code Then Sp ( (1- ) + /p ) -1 (1- ) -1 as p G. Amdahl, AFIPS Conference Proceedings, 30, , (1967) Measures of Parallel Performance 1. Speedup: Sp = T1 / Tp (optimally Sp = p) 2. Efficiency: Ep = Sp / p (optimally Ep = 1) = 0.95 Sp = Amdahl s law, i.e., The Unfortunate Law of Diminishing Returns: 5 Assume that proportion of the = 0.75 program can be run in parallel. = 0.50 Then (1- ) is the serial part of the code p Then Sp ( (1- ) + /p ) -1 (1- ) -1 as p G. Amdahl, AFIPS Conference Proceedings, 30, , (1967) 29
27 FHI-aims: Speedup on IBM s BlueGene: Ala100, -helical Speedup for one s.c.f. iteration Total time Speedup = p with R. Johanni (RZG) Number of cores 30 FHI-aims on IBM s BlueGene: Scaling to Many s Time for one s.c.f. iteration (s) Only 300 x 300 matrix / task Total time Density update Hartree potential Integration EV solution Linear line with R. Johanni (RZG) Number of cores 31
28 FHI-aims: Parallel Scalability Overall current status: 1. Optimised communication Up-to thousands of processors on BlueGene 2. Non-Optimised communication: Up-to hundreds of processors on Power6, Cray XT/5, Opteron cluster from HP (lots of parameters needed for optimal MPI performance: buffer size, message size, transfer method,...) 32 Parallel Data Storage In addition to distributed instructions also data must be distributed over MPI-tasks 1. Grid based quantities: electron density, potential: - Distribute grid batches to different tasks parallel iterator is local to each task 2. Splines describing Hartree potential vat,lm(r) - Store different splines to different tasks splines must be communicated to compute ves(r) 3. Matrices and Kohn-Sham eigenvectors - As dictated by ScaLAPACK Conversely, each MPI-task has 1. Grid batches and associated quantities 2. Splines for Hartree potential vat,lm(r) 3. Pieces of matrices and Kohn-Sham eigenvectors 33
29 Conclusions Part I: - Integration, density update and the calculation of the Hartree potential can be made scale O(N) (or, O(N log N)) - This requires localisation of the basis functions - Hard part that remains: solution of the eigenproblem Part 2: - Electronic structure theory codes need to scale to large parallel systems today - This is achieved by minimising serial part of the program 34
Parallel Eigensolver Performance on High Performance Computers
Parallel Eigensolver Performance on High Performance Computers Andrew Sunderland Advanced Research Computing Group STFC Daresbury Laboratory CUG 2008 Helsinki 1 Summary (Briefly) Introduce parallel diagonalization
More informationELSI: A Unified Software Interface for Kohn-Sham Electronic Structure Solvers
ELSI: A Unified Software Interface for Kohn-Sham Electronic Structure Solvers Victor Yu and the ELSI team Department of Mechanical Engineering & Materials Science Duke University Kohn-Sham Density-Functional
More informationParallel Eigensolver Performance on the HPCx System
Parallel Eigensolver Performance on the HPCx System Andrew Sunderland, Elena Breitmoser Terascaling Applications Group CCLRC Daresbury Laboratory EPCC, University of Edinburgh Outline 1. Brief Introduction
More informationCRYSTAL in parallel: replicated and distributed (MPP) data. Why parallel?
CRYSTAL in parallel: replicated and distributed (MPP) data Roberto Orlando Dipartimento di Chimica Università di Torino Via Pietro Giuria 5, 10125 Torino (Italy) roberto.orlando@unito.it 1 Why parallel?
More informationSymmetric Pivoting in ScaLAPACK Craig Lucas University of Manchester Cray User Group 8 May 2006, Lugano
Symmetric Pivoting in ScaLAPACK Craig Lucas University of Manchester Cray User Group 8 May 2006, Lugano Introduction Introduction We wanted to parallelize a serial algorithm for the pivoted Cholesky factorization
More informationParallel Eigensolver Performance on High Performance Computers 1
Parallel Eigensolver Performance on High Performance Computers 1 Andrew Sunderland STFC Daresbury Laboratory, Warrington, UK Abstract Eigenvalue and eigenvector computations arise in a wide range of scientific
More informationPorting a sphere optimization program from LAPACK to ScaLAPACK
Porting a sphere optimization program from LAPACK to ScaLAPACK Mathematical Sciences Institute, Australian National University. For presentation at Computational Techniques and Applications Conference
More informationJacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA
Jacobi-Based Eigenvalue Solver on GPU Lung-Sheng Chien, NVIDIA lchien@nvidia.com Outline Symmetric eigenvalue solver Experiment Applications Conclusions Symmetric eigenvalue solver The standard form is
More informationSolving the Inverse Toeplitz Eigenproblem Using ScaLAPACK and MPI *
Solving the Inverse Toeplitz Eigenproblem Using ScaLAPACK and MPI * J.M. Badía and A.M. Vidal Dpto. Informática., Univ Jaume I. 07, Castellón, Spain. badia@inf.uji.es Dpto. Sistemas Informáticos y Computación.
More informationParallelization of the Molecular Orbital Program MOS-F
Parallelization of the Molecular Orbital Program MOS-F Akira Asato, Satoshi Onodera, Yoshie Inada, Elena Akhmatskaya, Ross Nobes, Azuma Matsuura, Atsuya Takahashi November 2003 Fujitsu Laboratories of
More informationParallel Numerics. Scope: Revise standard numerical methods considering parallel computations!
Parallel Numerics Scope: Revise standard numerical methods considering parallel computations! Required knowledge: Numerics Parallel Programming Graphs Literature: Dongarra, Du, Sorensen, van der Vorst:
More informationMatrix Eigensystem Tutorial For Parallel Computation
Matrix Eigensystem Tutorial For Parallel Computation High Performance Computing Center (HPC) http://www.hpc.unm.edu 5/21/2003 1 Topic Outline Slide Main purpose of this tutorial 5 The assumptions made
More informationDirect Self-Consistent Field Computations on GPU Clusters
Direct Self-Consistent Field Computations on GPU Clusters Guochun Shi, Volodymyr Kindratenko National Center for Supercomputing Applications University of Illinois at UrbanaChampaign Ivan Ufimtsev, Todd
More informationThe ELPA library: scalable parallel eigenvalue solutions for electronic structure theory and
Home Search Collections Journals About Contact us My IOPscience The ELPA library: scalable parallel eigenvalue solutions for electronic structure theory and computational science This content has been
More informationMassively parallel electronic structure calculations with Python software. Jussi Enkovaara Software Engineering CSC the finnish IT center for science
Massively parallel electronic structure calculations with Python software Jussi Enkovaara Software Engineering CSC the finnish IT center for science GPAW Software package for electronic structure calculations
More informationThe ELPA Library Scalable Parallel Eigenvalue Solutions for Electronic Structure Theory and Computational Science
TOPICAL REVIEW The ELPA Library Scalable Parallel Eigenvalue Solutions for Electronic Structure Theory and Computational Science Andreas Marek 1, Volker Blum 2,3, Rainer Johanni 1,2 ( ), Ville Havu 4,
More informationab initio Electronic Structure Calculations
ab initio Electronic Structure Calculations New scalability frontiers using the BG/L Supercomputer C. Bekas, A. Curioni and W. Andreoni IBM, Zurich Research Laboratory Rueschlikon 8803, Switzerland ab
More informationSPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics
SPARSE SOLVERS FOR THE POISSON EQUATION Margreet Nool CWI, Multiscale Dynamics November 9, 2015 OUTLINE OF THIS TALK 1 FISHPACK, LAPACK, PARDISO 2 SYSTEM OVERVIEW OF CARTESIUS 3 POISSON EQUATION 4 SOLVERS
More informationSusumu YAMADA 1,3 Toshiyuki IMAMURA 2,3, Masahiko MACHIDA 1,3
Dynamical Variation of Eigenvalue Problems in Density-Matrix Renormalization-Group Code PP12, Feb. 15, 2012 1 Center for Computational Science and e-systems, Japan Atomic Energy Agency 2 The University
More informationPRECONDITIONING IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM
Proceedings of ALGORITMY 25 pp. 22 211 PRECONDITIONING IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM GABRIEL OKŠA AND MARIÁN VAJTERŠIC Abstract. One way, how to speed up the computation of the singular value
More informationMassively parallel semi-lagrangian solution of the 6d Vlasov-Poisson problem
Massively parallel semi-lagrangian solution of the 6d Vlasov-Poisson problem Katharina Kormann 1 Klaus Reuter 2 Markus Rampp 2 Eric Sonnendrücker 1 1 Max Planck Institut für Plasmaphysik 2 Max Planck Computing
More informationA model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)
A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization) Schodinger equation: Hψ = Eψ Choose a basis set of wave functions Two cases: Orthonormal
More informationHigh-Performance Scientific Computing
High-Performance Scientific Computing Instructor: Randy LeVeque TA: Grady Lemoine Applied Mathematics 483/583, Spring 2011 http://www.amath.washington.edu/~rjl/am583 World s fastest computers http://top500.org
More informationAccelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem
Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem Mark Gates 1, Azzam Haidar 1, and Jack Dongarra 1,2,3 1 University of Tennessee, Knoxville, TN, USA 2 Oak Ridge National
More informationAPPLIED NUMERICAL LINEAR ALGEBRA
APPLIED NUMERICAL LINEAR ALGEBRA James W. Demmel University of California Berkeley, California Society for Industrial and Applied Mathematics Philadelphia Contents Preface 1 Introduction 1 1.1 Basic Notation
More informationWelcome to MCS 572. content and organization expectations of the course. definition and classification
Welcome to MCS 572 1 About the Course content and organization expectations of the course 2 Supercomputing definition and classification 3 Measuring Performance speedup and efficiency Amdahl s Law Gustafson
More informationOpportunities for ELPA to Accelerate the Solution of the Bethe-Salpeter Eigenvalue Problem
Opportunities for ELPA to Accelerate the Solution of the Bethe-Salpeter Eigenvalue Problem Peter Benner, Andreas Marek, Carolin Penke August 16, 2018 ELSI Workshop 2018 Partners: The Problem The Bethe-Salpeter
More informationIntel Math Kernel Library (Intel MKL) LAPACK
Intel Math Kernel Library (Intel MKL) LAPACK Linear equations Victor Kostin Intel MKL Dense Solvers team manager LAPACK http://www.netlib.org/lapack Systems of Linear Equations Linear Least Squares Eigenvalue
More informationParallelization of Molecular Dynamics (with focus on Gromacs) SeSE 2014 p.1/29
Parallelization of Molecular Dynamics (with focus on Gromacs) SeSE 2014 p.1/29 Outline A few words on MD applications and the GROMACS package The main work in an MD simulation Parallelization Stream computing
More informationEfficient implementation of the overlap operator on multi-gpus
Efficient implementation of the overlap operator on multi-gpus Andrei Alexandru Mike Lujan, Craig Pelissier, Ben Gamari, Frank Lee SAAHPC 2011 - University of Tennessee Outline Motivation Overlap operator
More informationMolecular Science Modelling
Molecular Science Modelling Lorna Smith Edinburgh Parallel Computing Centre The University of Edinburgh Version 1.0 Available from: http://www.epcc.ed.ac.uk/epcc-tec/documents/ Table of Contents 1 Introduction.....................................
More informationHybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC
Hybrid static/dynamic scheduling for already optimized dense matrix factorization Simplice Donfack, Laura Grigori, INRIA, France Bill Gropp, Vivek Kale UIUC, USA Joint Laboratory for Petascale Computing,
More informationAccelerating computation of eigenvectors in the nonsymmetric eigenvalue problem
Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem Mark Gates 1, Azzam Haidar 1, and Jack Dongarra 1,2,3 1 University of Tennessee, Knoxville, TN, USA 2 Oak Ridge National
More informationDomain Decomposition-based contour integration eigenvalue solvers
Domain Decomposition-based contour integration eigenvalue solvers Vassilis Kalantzis joint work with Yousef Saad Computer Science and Engineering Department University of Minnesota - Twin Cities, USA SIAM
More informationPerformance Analysis of Lattice QCD Application with APGAS Programming Model
Performance Analysis of Lattice QCD Application with APGAS Programming Model Koichi Shirahata 1, Jun Doi 2, Mikio Takeuchi 2 1: Tokyo Institute of Technology 2: IBM Research - Tokyo Programming Models
More informationCP2K: the gaussian plane wave (GPW) method
CP2K: the gaussian plane wave (GPW) method Basis sets and Kohn-Sham energy calculation R. Vuilleumier Département de chimie Ecole normale supérieure Paris Tutorial CPMD-CP2K CPMD and CP2K CPMD CP2K http://www.cpmd.org
More informationLecture 19. Architectural Directions
Lecture 19 Architectural Directions Today s lecture Advanced Architectures NUMA Blue Gene 2010 Scott B. Baden / CSE 160 / Winter 2010 2 Final examination Announcements Thursday, March 17, in this room:
More informationConquest order N ab initio Electronic Structure simulation code for quantum mechanical modelling in large scale
Fortran Expo: 15 Jun 2012 Conquest order N ab initio Electronic Structure simulation code for quantum mechanical modelling in large scale Lianheng Tong Overview Overview of Conquest project Brief Introduction
More informationNuclear Physics and Computing: Exascale Partnerships. Juan Meza Senior Scientist Lawrence Berkeley National Laboratory
Nuclear Physics and Computing: Exascale Partnerships Juan Meza Senior Scientist Lawrence Berkeley National Laboratory Nuclear Science and Exascale i Workshop held in DC to identify scientific challenges
More informationLightweight Superscalar Task Execution in Distributed Memory
Lightweight Superscalar Task Execution in Distributed Memory Asim YarKhan 1 and Jack Dongarra 1,2,3 1 Innovative Computing Lab, University of Tennessee, Knoxville, TN 2 Oak Ridge National Lab, Oak Ridge,
More informationLecture 8: Fast Linear Solvers (Part 7)
Lecture 8: Fast Linear Solvers (Part 7) 1 Modified Gram-Schmidt Process with Reorthogonalization Test Reorthogonalization If Av k 2 + δ v k+1 2 = Av k 2 to working precision. δ = 10 3 2 Householder Arnoldi
More informationBENCHMARK STUDY OF A 3D PARALLEL CODE FOR THE PROPAGATION OF LARGE SUBDUCTION EARTHQUAKES
BENCHMARK STUDY OF A D PARALLEL CODE FOR THE PROPAGATION OF LARGE SUBDUCTION EARTHQUAKES Mario Chavez,2, Eduardo Cabrera, Raúl Madariaga 2, Narciso Perea, Charles Moulinec 4, David Emerson 4, Mike Ashworth
More informationGloMAP Mode on HECToR Phase2b (Cray XT6) Mark Richardson Numerical Algorithms Group
GloMAP Mode on HECToR Phase2b (Cray XT6) Mark Richardson Numerical Algorithms Group 1 Acknowledgements NERC, NCAS Research Councils UK, HECToR Resource University of Leeds School of Earth and Environment
More informationQuantum Chemical Calculations by Parallel Computer from Commodity PC Components
Nonlinear Analysis: Modelling and Control, 2007, Vol. 12, No. 4, 461 468 Quantum Chemical Calculations by Parallel Computer from Commodity PC Components S. Bekešienė 1, S. Sėrikovienė 2 1 Institute of
More informationJulian Merten. GPU Computing and Alternative Architecture
Future Directions of Cosmological Simulations / Edinburgh 1 / 16 Julian Merten GPU Computing and Alternative Architecture Institut für Theoretische Astrophysik Zentrum für Astronomie Universität Heidelberg
More informationDynamic Scheduling within MAGMA
Dynamic Scheduling within MAGMA Emmanuel Agullo, Cedric Augonnet, Jack Dongarra, Mathieu Faverge, Julien Langou, Hatem Ltaief, Samuel Thibault and Stanimire Tomov April 5, 2012 Innovative and Computing
More informationA knowledge-based approach to high-performance computing in ab initio simulations.
Mitglied der Helmholtz-Gemeinschaft A knowledge-based approach to high-performance computing in ab initio simulations. AICES Advisory Board Meeting. July 14th 2014 Edoardo Di Napoli Academic background
More informationDGDFT: A Massively Parallel Method for Large Scale Density Functional Theory Calculations
DGDFT: A Massively Parallel Method for Large Scale Density Functional Theory Calculations The recently developed discontinuous Galerkin density functional theory (DGDFT)[21] aims at reducing the number
More informationNCU EE -- DSP VLSI Design. Tsung-Han Tsai 1
NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1 Multi-processor vs. Multi-computer architecture µp vs. DSP RISC vs. DSP RISC Reduced-instruction-set Register-to-register operation Higher throughput by using
More informationGPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic
GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic Jan Verschelde joint work with Xiangcheng Yu University of Illinois at Chicago
More informationA distributed packed storage for large dense parallel in-core calculations
CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2007; 19:483 502 Published online 28 September 2006 in Wiley InterScience (www.interscience.wiley.com)..1119 A
More informationBinding Performance and Power of Dense Linear Algebra Operations
10th IEEE International Symposium on Parallel and Distributed Processing with Applications Binding Performance and Power of Dense Linear Algebra Operations Maria Barreda, Manuel F. Dolz, Rafael Mayo, Enrique
More informationThe Performance Evolution of the Parallel Ocean Program on the Cray X1
The Performance Evolution of the Parallel Ocean Program on the Cray X1 Patrick H. Worley Oak Ridge National Laboratory John Levesque Cray Inc. 46th Cray User Group Conference May 18, 2003 Knoxville Marriott
More informationAsynchronous Parareal in time discretization for partial differential equations
Asynchronous Parareal in time discretization for partial differential equations Frédéric Magoulès, Guillaume Gbikpi-Benissan April 7, 2016 CentraleSupélec IRT SystemX Outline of the presentation 01 Introduction
More informationApplications of Mathematical Economics
Applications of Mathematical Economics Michael Curran Trinity College Dublin Overview Introduction. Data Preparation Filters. Dynamic Stochastic General Equilibrium Models: Sunspots and Blanchard-Kahn
More informationPorting a Sphere Optimization Program from lapack to scalapack
Porting a Sphere Optimization Program from lapack to scalapack Paul C. Leopardi Robert S. Womersley 12 October 2008 Abstract The sphere optimization program sphopt was originally written as a sequential
More informationMetal Conquest. Lianheng Tong. 2 nd March 2011
Metal Conquest Lianheng Tong nd March 011 Abstract This report describes the work done in the one year Distributed Computational Science and Engineering (dcse) project aimed to develop an ab initio Density
More informationAccelerating linear algebra computations with hybrid GPU-multicore systems.
Accelerating linear algebra computations with hybrid GPU-multicore systems. Marc Baboulin INRIA/Université Paris-Sud joint work with Jack Dongarra (University of Tennessee and Oak Ridge National Laboratory)
More informationCommunication-avoiding parallel and sequential QR factorizations
Communication-avoiding parallel and sequential QR factorizations James Demmel, Laura Grigori, Mark Hoemmen, and Julien Langou May 30, 2008 Abstract We present parallel and sequential dense QR factorization
More informationA Parallel Divide and Conquer Algorithm for the Symmetric Eigenvalue Problem on Distributed Memory Architectures. F Tisseur and J Dongarra
A Parallel Divide and Conquer Algorithm for the Symmetric Eigenvalue Problem on Distributed Memory Architectures F Tisseur and J Dongarra 999 MIMS EPrint: 2007.225 Manchester Institute for Mathematical
More informationParallel Numerical Algorithms
Parallel Numerical Algorithms Chapter 13 Prof. Michael T. Heath Department of Computer Science University of Illinois at Urbana-Champaign CS 554 / CSE 512 Michael T. Heath Parallel Numerical Algorithms
More informationAn Efficient FETI Implementation on Distributed Shared Memory Machines with Independent Numbers of Subdomains and Processors
Contemporary Mathematics Volume 218, 1998 B 0-8218-0988-1-03024-7 An Efficient FETI Implementation on Distributed Shared Memory Machines with Independent Numbers of Subdomains and Processors Michel Lesoinne
More informationMPI at MPI. Jens Saak. Max Planck Institute for Dynamics of Complex Technical Systems Computational Methods in Systems and Control Theory
MAX PLANCK INSTITUTE November 5, 2010 MPI at MPI Jens Saak Max Planck Institute for Dynamics of Complex Technical Systems Computational Methods in Systems and Control Theory FOR DYNAMICS OF COMPLEX TECHNICAL
More informationMAGMA. Matrix Algebra on GPU and Multicore Architectures. Mark Gates. February 2012
MAGMA Matrix Algebra on GPU and Multicore Architectures Mark Gates February 2012 1 Hardware trends Scale # cores instead of clock speed Hardware issue became software issue Multicore Hybrid 1.E+07 1e7
More informationA Parallel Algorithm for Computing the Extremal Eigenvalues of Very Large Sparse Matrices*
A Parallel Algorithm for Computing the Extremal Eigenvalues of Very Large Sparse Matrices* Fredrik Manne Department of Informatics, University of Bergen, N-5020 Bergen, Norway Fredrik. Manne@ii. uib. no
More informationPerformance Evaluation of Some Inverse Iteration Algorithms on PowerXCell T M 8i Processor
Performance Evaluation of Some Inverse Iteration Algorithms on PowerXCell T M 8i Processor Masami Takata 1, Hiroyuki Ishigami 2, Kini Kimura 2, and Yoshimasa Nakamura 2 1 Academic Group of Information
More informationSparse BLAS-3 Reduction
Sparse BLAS-3 Reduction to Banded Upper Triangular (Spar3Bnd) Gary Howell, HPC/OIT NC State University gary howell@ncsu.edu Sparse BLAS-3 Reduction p.1/27 Acknowledgements James Demmel, Gene Golub, Franc
More informationEnhancing Scalability of Sparse Direct Methods
Journal of Physics: Conference Series 78 (007) 0 doi:0.088/7-6596/78//0 Enhancing Scalability of Sparse Direct Methods X.S. Li, J. Demmel, L. Grigori, M. Gu, J. Xia 5, S. Jardin 6, C. Sovinec 7, L.-Q.
More informationAn Integrative Model for Parallelism
An Integrative Model for Parallelism Victor Eijkhout ICERM workshop 2012/01/09 Introduction Formal part Examples Extension to other memory models Conclusion tw-12-exascale 2012/01/09 2 Introduction tw-12-exascale
More informationINITIAL INTEGRATION AND EVALUATION
INITIAL INTEGRATION AND EVALUATION OF SLATE PARALLEL BLAS IN LATTE Marc Cawkwell, Danny Perez, Arthur Voter Asim YarKhan, Gerald Ragghianti, Jack Dongarra, Introduction The aim of the joint milestone STMS10-52
More informationSakurai-Sugiura algorithm based eigenvalue solver for Siesta. Georg Huhs
Sakurai-Sugiura algorithm based eigenvalue solver for Siesta Georg Huhs Motivation Timing analysis for one SCF-loop iteration: left: CNT/Graphene, right: DNA Siesta Specifics High fraction of EVs needed
More informationQR Factorization of Tall and Skinny Matrices in a Grid Computing Environment
QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment Emmanuel AGULLO (INRIA / LaBRI) Camille COTI (Iowa State University) Jack DONGARRA (University of Tennessee) Thomas HÉRAULT
More informationVerbundprojekt ELPA-AEO. Eigenwert-Löser für Petaflop-Anwendungen Algorithmische Erweiterungen und Optimierungen
Verbundprojekt ELPA-AEO http://elpa-aeo.mpcdf.mpg.de Eigenwert-Löser für Petaflop-Anwendungen Algorithmische Erweiterungen und Optimierungen BMBF Projekt 01IH15001 Feb 2016 - Jan 2019 7. HPC-Statustagung,
More informationParallelization of the Dirac operator. Pushan Majumdar. Indian Association for the Cultivation of Sciences, Jadavpur, Kolkata
Parallelization of the Dirac operator Pushan Majumdar Indian Association for the Cultivation of Sciences, Jadavpur, Kolkata Outline Introduction Algorithms Parallelization Comparison of performances Conclusions
More informationThe Design and Implementation of the Parallel. Out-of-core ScaLAPACK LU, QR and Cholesky. Factorization Routines
The Design and Implementation of the Parallel Out-of-core ScaLAPACK LU, QR and Cholesky Factorization Routines Eduardo D Azevedo Jack Dongarra Abstract This paper describes the design and implementation
More informationTIME DEPENDENCE OF SHELL MODEL CALCULATIONS 1. INTRODUCTION
Mathematical and Computational Applications, Vol. 11, No. 1, pp. 41-49, 2006. Association for Scientific Research TIME DEPENDENCE OF SHELL MODEL CALCULATIONS Süleyman Demirel University, Isparta, Turkey,
More informationFEAST eigenvalue algorithm and solver: review and perspectives
FEAST eigenvalue algorithm and solver: review and perspectives Eric Polizzi Department of Electrical and Computer Engineering University of Masachusetts, Amherst, USA Sparse Days, CERFACS, June 25, 2012
More information3D Parallel Elastodynamic Modeling of Large Subduction Earthquakes
D Parallel Elastodynamic Modeling of Large Subduction Earthquakes Eduardo Cabrera 1 Mario Chavez 2 Raúl Madariaga Narciso Perea 2 and Marco Frisenda 1 Supercomputing Dept. DGSCA UNAM C.U. 04510 Mexico
More informationAntonio Falabella. 3 rd nternational Summer School on INtelligent Signal Processing for FrontIEr Research and Industry, September 2015, Hamburg
INFN - CNAF (Bologna) 3 rd nternational Summer School on INtelligent Signal Processing for FrontIEr Research and Industry, 14-25 September 2015, Hamburg 1 / 44 Overview 1 2 3 4 5 2 / 44 to Computing The
More information1 Overview. 2 Adapting to computing system evolution. 11 th European LS-DYNA Conference 2017, Salzburg, Austria
1 Overview Improving LSTC s Multifrontal Linear Solver Roger Grimes 3, Robert Lucas 3, Nick Meng 2, Francois-Henry Rouet 3, Clement Weisbecker 3, and Ting-Ting Zhu 1 1 Cray Incorporated 2 Intel Corporation
More informationModel Order Reduction via Matlab Parallel Computing Toolbox. Istanbul Technical University
Model Order Reduction via Matlab Parallel Computing Toolbox E. Fatih Yetkin & Hasan Dağ Istanbul Technical University Computational Science & Engineering Department September 21, 2009 E. Fatih Yetkin (Istanbul
More informationGraphics Card Computing for Materials Modelling
Graphics Card Computing for Materials Modelling Case study: Analytic Bond Order Potentials B. Seiser, T. Hammerschmidt, R. Drautz, D. Pettifor Funded by EPSRC within the collaborative multi-scale project
More informationA PARALLEL EIGENSOLVER FOR DENSE SYMMETRIC MATRICES BASED ON MULTIPLE RELATIVELY ROBUST REPRESENTATIONS
SIAM J. SCI. COMPUT. Vol. 27, No. 1, pp. 43 66 c 2005 Society for Industrial and Applied Mathematics A PARALLEL EIGENSOLVER FOR DENSE SYMMETRIC MATRICES BASED ON MULTIPLE RELATIVELY ROBUST REPRESENTATIONS
More informationElectronic energy optimisation in ONETEP
Electronic energy optimisation in ONETEP Chris-Kriton Skylaris cks@soton.ac.uk 1 Outline 1. Kohn-Sham calculations Direct energy minimisation versus density mixing 2. ONETEP scheme: optimise both the density
More informationIntroduction The Nature of High-Performance Computation
1 Introduction The Nature of High-Performance Computation The need for speed. Since the beginning of the era of the modern digital computer in the early 1940s, computing power has increased at an exponential
More informationIntroduction to numerical computations on the GPU
Introduction to numerical computations on the GPU Lucian Covaci http://lucian.covaci.org/cuda.pdf Tuesday 1 November 11 1 2 Outline: NVIDIA Tesla and Geforce video cards: architecture CUDA - C: programming
More informationEfficient Longest Common Subsequence Computation using Bulk-Synchronous Parallelism
Efficient Longest Common Subsequence Computation using Bulk-Synchronous Parallelism Peter Krusche Department of Computer Science University of Warwick June 2006 Outline 1 Introduction Motivation The BSP
More informationParallel programming using MPI. Analysis and optimization. Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco
Parallel programming using MPI Analysis and optimization Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco Outline l Parallel programming: Basic definitions l Choosing right algorithms: Optimal serial and
More informationConcurrent Divide-and-Conquer Library
with Petascale Electromagnetics Applications, Tech-X Corporation CScADS Workshop on Libraries and Algorithms for Petascale Applications, 07/30/2007, Snowbird, Utah Background Particle In Cell (PIC) in
More informationVASP: running on HPC resources. University of Vienna, Faculty of Physics and Center for Computational Materials Science, Vienna, Austria
VASP: running on HPC resources University of Vienna, Faculty of Physics and Center for Computational Materials Science, Vienna, Austria The Many-Body Schrödinger equation 0 @ 1 2 X i i + X i Ĥ (r 1,...,r
More informationA hybrid Hermitian general eigenvalue solver
Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe A hybrid Hermitian general eigenvalue solver Raffaele Solcà *, Thomas C. Schulthess Institute fortheoretical Physics ETHZ,
More informationLazy matrices for contraction-based algorithms
Lazy matrices for contraction-based algorithms Michael F. Herbst michael.herbst@iwr.uni-heidelberg.de https://michael-herbst.com Interdisziplinäres Zentrum für wissenschaftliches Rechnen Ruprecht-Karls-Universität
More informationThe Algorithm of Multiple Relatively Robust Representations for Multi-Core Processors
Aachen Institute for Advanced Study in Computational Engineering Science Preprint: AICES-2010/09-4 23/September/2010 The Algorithm of Multiple Relatively Robust Representations for Multi-Core Processors
More informationCommunication-avoiding parallel and sequential QR factorizations
Communication-avoiding parallel and sequential QR factorizations James Demmel Laura Grigori Mark Frederick Hoemmen Julien Langou Electrical Engineering and Computer Sciences University of California at
More informationPoisson Solver, Pseudopotentials, Atomic Forces in the BigDFT code
CECAM Tutorial on Wavelets in DFT, CECAM - LYON,, in the BigDFT code Kernel Luigi Genovese L_Sim - CEA Grenoble 28 November 2007 Outline, Kernel 1 The with Interpolating Scaling Functions in DFT for Interpolating
More informationStatic-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems
Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems Ichitaro Yamazaki University of Tennessee, Knoxville Xiaoye Sherry Li Lawrence Berkeley National Laboratory MS49: Sparse
More informationThe Design, Implementation, and Evaluation of a Symmetric Banded Linear Solver for Distributed-Memory Parallel Computers
The Design, Implementation, and Evaluation of a Symmetric Banded Linear Solver for Distributed-Memory Parallel Computers ANSHUL GUPTA and FRED G. GUSTAVSON IBM T. J. Watson Research Center MAHESH JOSHI
More informationSolving PDEs with CUDA Jonathan Cohen
Solving PDEs with CUDA Jonathan Cohen jocohen@nvidia.com NVIDIA Research PDEs (Partial Differential Equations) Big topic Some common strategies Focus on one type of PDE in this talk Poisson Equation Linear
More informationRe-design of Higher level Matrix Algorithms for Multicore and Heterogeneous Architectures. Based on the presentation at UC Berkeley, October 7, 2009
III.1 Re-design of Higher level Matrix Algorithms for Multicore and Heterogeneous Architectures Based on the presentation at UC Berkeley, October 7, 2009 Background and motivation Running time of an algorithm
More informationSP-CNN: A Scalable and Programmable CNN-based Accelerator. Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay
SP-CNN: A Scalable and Programmable CNN-based Accelerator Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay Motivation Power is a first-order design constraint, especially for embedded devices. Certain
More information