Mitglied der Helmholtz-Gemeinschaft. Linear algebra tasks in Materials Science: optimization and portability

Size: px
Start display at page:

Download "Mitglied der Helmholtz-Gemeinschaft. Linear algebra tasks in Materials Science: optimization and portability"

Transcription

1 Mitglied der Helmholtz-Gemeinschaft Linear algebra tasks in Materials Science: optimization and portability ADAC Workshop, July Edoardo Di Napoli

2 Outline Jülich Supercomputing Center Chebyshev Accelerated Subspace Iteration (ChASE) Hamiltonian and Overlap matrices initialization in FLAPW ADAC Workshop, July Edoardo Di Napoli Folie 2

3 Forschungszentrum Jülich ADAC Workshop, July Edoardo Di Napoli Folie 3

4 Jülich Supercomputing Centre Supercomputing operations for: Center: FZJ Region: JARA Germany: Gauss Centre for Supercomputing John von Neumann Institute for Computing Europe: PRACE, EU projects Application support Primary support & Simulation Labs Peer review support & coordination R&D work Methods and algorithms, computational science, performance analysis and tools Scientific Big Data Analytics with HPC Computer architectures, Co-Design Exascale Labs together with IBM, Intel, NVIDIA Education and training ADAC Workshop, July Edoardo Di Napoli Folie 4

5 Supercomputing systems: dual architecture strategy ADAC Workshop, July Edoardo Di Napoli Folie 5

6 Simulation Laboratory as HPC enabler ADAC Workshop, July Edoardo Di Napoli Folie 6

7 SimLab Quantum Materials (SLQM) Team Mission The Simulation Laboratory Quantum Materials (SLQM) provides expertise in the field of quantum-based simulations in Materials Science with a special focus on high-performance computing. SLQM acts as a high-level support structure in dedicated projects and hosts research projects dealing with fundamental aspects of code development, algorithmic optimization, and performance improvement. The Lab acts as an enabler of large scale simulations on current HPC platforms as well as on future architectures by targeting domain-specific co-design processes. ADAC Workshop, July Edoardo Di Napoli Folie 7

8 High-performance computations Observations: Linear algebra algorithms ubiquitous in Materials Science Numerical libraries are used as black boxes. Domain-specific knowledge does not influence algorithm choice. Rigid legacy codes are hard to modernize. Goal Design and optimize linear algebra algorithms in order to: exploit available knowledge. increase the parallelism of complex tasks. facilitate performance portability ADAC Workshop, July Edoardo Di Napoli Folie 8

9 SLQM active areas of research Adaptive integration for functional Renormalization Group (frg) HPC Tensor operations (contractions, generalized contractions, transpositions, etc.) with application in frg, Quantum Chemistry, etc. Algebraic eigensolvers (sequences of dense eigenproblems, sparse eigenproblems) with applications in Density Functional Theory (DFT), Numerical RG, Excitonic Hamiltonians, etc. Articulated linear algebra tasks (Matrix initialization, etc.) Structured Linear Systems (Dyson, Poisson, and Keldish equations) Self-consistent Field preconditioners Predicting material properties through machine learning ADAC Workshop, July Edoardo Di Napoli Folie 9

10 Two examples from FLAPW methods (DFT) An eigensolver for sequences of dense matrices Chebyshev Accelerated Subspace iteration Eigensolver (ChASE) Exterior iterative eigensolver tailored to sequences of dense Hermitian eigenvalue problems arising in DFT methods based on the LAPW basis set. A complex linear algebra task H and S Dense Linear Algebra (HSDLA) algorithm Combination of dense linear algebra kernels for the initialization of Hamiltonian and Overlap matrices in Density Functional Theory. ADAC Workshop, July Edoardo Di Napoli Folie 10

11 Density Functional Theory Self-Consistent Field General framework Initial guess for charge density n start (r) Initialize A (l) k and B(l) k matrices Solve a set of eigenproblems P (l) k 1...P (l) k N No OUTPUT Electronic structure,... Yes Converged? n (l) n (l 1) < η Compute new charge density n (l) (r) ADAC Workshop, July Edoardo Di Napoli Folie 11

12 The case of the FLAPW method Observations: 1 every P (l) k : A(l) k x = B(l) k λx is a generalized eigenvalue problem; 2 A and B are hermitian (B is positive definite); 3 required: lower 2 20 % of eigenpairs; 4 eigenvectors of problems of same k are seemingly uncorrelated across iterations i 5 k-vector index: k = 1 : ; 6 iteration cycle index: l = 1 : ADAC Workshop, July Edoardo Di Napoli Folie 12

13 Outline Jülich Supercomputing Center Chebyshev Accelerated Subspace Iteration (ChASE) Hamiltonian and Overlap matrices initialization in FLAPW ADAC Workshop, July Edoardo Di Napoli Folie 13

14 Generalized eigenvalue solver: subspace iteration Two distinct optmization opportunities: 1 Knowledge exploitation: solution of previous SCF cycle can be used to precondition the solver at the next iteration 2 Algorithm optimization: polynomial degree can be pre-computed in order to minimize Mat-Vec multiplications ADAC Workshop, July Edoardo Di Napoli Folie 14

15 Chebyshev Accelerated Subspace Eigensolver Chebyshev filter A generic vector v = n i=1 s ix i is very quickly aligned in the direction of the eigenvector corresponding to the extremal eigenvalue λ 1 v m n n = p m (A)v = s i p m (A)x i = s i p m (λ i )x i i=1 i=1 n C m ( = s 1 x 1 + λ i c e ) s i i=2 C m ( λ 1 c e ) x i s 1 x Chebyshev polynomial Chebyshev polynomial c λmin λlower λupper eigenspectrum Polynomial degree m = 6 c λmin λlower λupper eigenspectrum Polynomial degree m = 6 ADAC Workshop, July Edoardo Di Napoli Folie 15

16 The core of the algorithm: Chebyshev filter > 80% Computing time Three-terms recurrence relation C m+1 (t) = 2xC m (t) C m 1 (t); m N, C 0 (t) = 1, C 1 (t) = x Z m. = pm ( H) Z 0 with H = H ci n FOR: i = 1 DEG 1 Z i+1 2 σ i+1 e H Z i σ i+1 σ i Z i 1 xgemm END FOR. ADAC Workshop, July Edoardo Di Napoli Folie 16

17 Preconditioning ChASE Exploiting knowledge P (l) k 1 ITERATION (l) iterative solver (X (l) k 1,Λ (l) k 1 ) P (l+1) k 1 ITERATION (l + 1) iterative solver (X (l+1) k 1,Λ (l+1) k 1 ) P (l) k 2 iterative solver (X (l) k 2,Λ (l) k 2 ) P (l+1) k 2 iterative solver (X (l+1) k 2,Λ (l+1) k 2 ) Next cycle P (l) k N iterative solver (X (l) k N,Λ (l) k N ) P (l+1) k N iterative solver (X (l+1) k N,Λ (l+1) k N ) X {x 1,...,x n } Λ diag(λ 1,...,λ n ) ADAC Workshop, July Edoardo Di Napoli Folie 17

18 Speed-up Speed-up = CPU time (input random vectors) CPU time (input approximate eigenvectors) 4 Au 98 Ag 10 - n =13, cores. 3 Speed-up Iteration Index ADAC Workshop, July Edoardo Di Napoli Folie 18

19 ChASE with polynomial optimization Speed-up Speed-up = CPU time (m constant) CPU time (m optimized for each vector) Cores Multi threaded BLAS 24 Cores and 1 GPU CUBLAS Speed up Iteration cycle index ADAC Workshop, July Edoardo Di Napoli Folie 19

20 ChASE pseudocode INPUT: Hamiltonian H, TOL, DEG OPTIONAL: approximate eigenvectors Z 0, extreme eigenvalues {λ 1,λ NEV }. OUTPUT: NEV wanted eigenpairs (Λ, W). 1 Lanczos DoS step. Identify the bounds for the eigenspectrum interval corresponding to the wanted eigenspace. REPEAT UNTIL CONVERGENCE: 2 Chebyshev filter. Filter a block of vectors W Z 0. 3 Re-orthogonalize the vectors outputted by the filter; W = QR. 4 Compute the Rayleigh quotient G = Q HQ. 5 Compute the primitive Ritz pairs (Λ,Y) by solving for GY = YΛ. 6 Compute the approximate Ritz pairs (Λ,W QY). 7 Check which one among the Ritz vectors converged. 8 Deflate and lock the converged vectors. END REPEAT ADAC Workshop, July Edoardo Di Napoli Folie 20

21 Execution time for distinct platforms An example from Density Functional Theory x Intel Xeon E (16 cores) 1x NVIDIA K20m 2x NVIDIA K20m 1x XeonPhi optimized (compact) 200 Time [s] iteration ADAC Workshop, July Edoardo Di Napoli Folie 21

22 ChASE new implementation C++ Library Separation between algorithm and implementation Few kernels for low-level tasks (Lanczos, QR, Mat-Mat multiply, etc..) General interface Kernels adapted to the parallel computing platform Library templated for Real, Complex, SP and DP. Integration into applications FLEUR code Jena BSE code (based on VASP) early stage collaboration with Exciting developers early stage collaboration with developers from AICS (Kobe) ADAC Workshop, July Edoardo Di Napoli Folie 22

23 Outline Jülich Supercomputing Center Chebyshev Accelerated Subspace Iteration (ChASE) Hamiltonian and Overlap matrices initialization in FLAPW ADAC Workshop, July Edoardo Di Napoli Folie 23

24 Hamiltonian and Overlap matrices Matrix form N A H = A H a T a [AA] A a a=1 } {{ } H AA +AH [AB] a Ta B a + B H a T a [BA] A a + B H a T a [BB] } {{ } H AB+BA+BB B a N A S = A H N A a A a + B H a U a H U a B a a=1 } {{ } a=1 } {{ } S AA S BB A a and B a C N L N G while T a... C N L N L and Hermitian. N L = (l max + 1) 2 121, N G = 1,000 50,000, and N A = number of atoms. ADAC Workshop, July Edoardo Di Napoli Folie 24

25 Constructing S AA An example of memory layout re-structuring N A S AA = A H a A a. a=1 1: for a := 1 N A do 2: S AA = A H a A a (zherk: 4N L N 2 G Flops) 3: end for N G N L A 1 N G A 2. A N A N L A NA 1: S AA = A H A (zherk: 4N A N L N 2 G Flops) ADAC Workshop, July Edoardo Di Napoli Folie 25

26 Constructing H AB+BA+BB An example of algorithm re-structuring N A H AB+BA+BB = a=1 N A = a=1 B H a (T a [BA] A a ) + (A H a T a [AB] )B a BH a (T a [BB] B a ) (BH a T a [BB] )B a B H a (T a [BA] A a T[BB] a B a ) + (A H a T [AB] a BH a T a [BB] )B a 1: for a := 1 N A do 2: Z a = T [BA] a A a (zgemm: 8N 2 L N G Flops) 3: Z a = Z a T[BB] a B a (zhemm: 8N 2 L N G Flops) 4: Stack Z a to Z and B a to B 5: end for 6: H = Z H B + B H Z (zher2k: 8N A N L N 2 G Flops) ADAC Workshop, July Edoardo Di Napoli Folie 26

27 FLEUR early optimizations Minimization of FLOPS in FLEUR S AA = (4π)2 Ω a l sph f l,a,g f l,a,g exp[i(k G K G ) x a ] l=0 Wl,a 2 l Y ( ) ( ) l,m Ra ˆK G Yl,m Ra ˆK G. m= l } {{ } The sum over m can be removed by using the well known identity ( ) 2l + 1 P l ˆK G ˆK G 4π = Removing the sum over m yields a reduction in #ops by a factor of ( lsph + 1 ) 10 but destroys matrix form = premature optimization ADAC Workshop, July Edoardo Di Napoli Folie 27

28 Hamiltonian and Overlap matrices initialization Main features Boosted performance through highly optimized computational kernels. Performance portability through modular implementation. Compute bound implementation at the cost of increased complexity (more FLOPs). NaCl (Kmax =4.0): IvyBridge Haswell TiO2 (Kmax =3.6): IvyBridge Haswell 3 speedup #threads Figure 1: Speedup of our algorithm over FLEUR with kmax =4and increasing parallelism ADAC Workshop, July Edoardo Di Napoli Folie 28

29 Porting to heterogeneous architectures Back-of-the-envelope analysis 5 lines of the algorithm constitute 97% of flops Correspond to BLAS-3 operations (gemm, herk, her2k) High arithmetic intensity and should fit GPUs well First step: offload these routine calls All 5 are BLAS kernels. Can we use some library? cublas cublas-xt MAGMA BLASX ADAC Workshop, July Edoardo Di Napoli Folie 29

30 Porting to heterogeneous architectures Additional code? 3x wrappers (zgemm, zherk, zher2k) Init and cleanup of cuda runtime and devices Get #devices, initialize devices, create handlers,... Destroy handlers, free devices,... Allocate data in page-locked memory Avoid hidden copies Fast data transfer Only around 100 lines of additional code ADAC Workshop, July Edoardo Di Napoli Folie 30

31 Experimental results Test case 2: AuAg (N A = 108, N L = 121, N G = [ ]) Time ( seconds ) CPU CPU / 1 GPU CPU / 2 GPUs 3.03x 4.96x Kmax Sandy Bridge: CPU: E5-2650, 2 x 8core, 2.0GHz, 64GBs RAM, 2 Nvidia Tesla K20X Peak performance: 256 GFs/s + 2 x 1.3 TFs/s ADAC Workshop, July Edoardo Di Napoli Folie 31

32 Conclusions Conclusions Modernizing algorithm structure of legacy code is critical Tailoring algorithms to application-specific tasks Exploiting knowledge from domain-specific applications Layered design built on top of standardized libraries Increase in performance (Almost) free lunch performance portability ADAC Workshop, July Edoardo Di Napoli Folie 32

33 References ChASE: a Chebyshev Accelerated Subspace iteration Eigensolver for sequences of eigenvalue problems. J. Winkelmann, M. Berljafa, P. Springer and E. Di Napoli. In preparation. An Optimized and Scalable Eigensolver for Sequences of Eigenvalue Problems. M. Berl-jafa, D. Wortmann, and E. Di Napoli. Concurrency Computat.: Pract. Exper. 27 (2015), pp [arxiv: ] High-performance generation of the Hamiltonian and Overlap matrices in FLAPW methods. Edoardo Di Napoli, Elmar Peise, Markus Hrywniak and Paolo Bientinesi. Comp. Phys. Comm. 211 (2017) [arxiv: ] Hybrid CPU-GPU generation of the Hamiltonian and Overlap matrices in FLAPW methods. Diego Fabregat-Traver, Davor Davidović, Markus Höhnerbach, Edoardo Di Napoli. Lecture Notes in Computer Science, High-Performance Scientific Computing (2016), pp [arxiv: ] ADAC Workshop, July Edoardo Di Napoli Folie 33

34 Thank you QUANTUMATERIALS For more information ADAC Workshop, July Edoardo Di Napoli Folie 34

A knowledge-based approach to high-performance computing in ab initio simulations.

A knowledge-based approach to high-performance computing in ab initio simulations. Mitglied der Helmholtz-Gemeinschaft A knowledge-based approach to high-performance computing in ab initio simulations. AICES Advisory Board Meeting. July 14th 2014 Edoardo Di Napoli Academic background

More information

Improving the performance of applied science numerical simulations: an application to Density Functional Theory

Improving the performance of applied science numerical simulations: an application to Density Functional Theory Improving the performance of applied science numerical simulations: an application to Density Functional Theory Edoardo Di Napoli Jülich Supercomputing Center - Institute for Advanced Simulation Forschungszentrum

More information

Block Iterative Eigensolvers for Sequences of Dense Correlated Eigenvalue Problems

Block Iterative Eigensolvers for Sequences of Dense Correlated Eigenvalue Problems Mitglied der Helmholtz-Gemeinschaft Block Iterative Eigensolvers for Sequences of Dense Correlated Eigenvalue Problems Birkbeck University, London, June the 29th 2012 Edoardo Di Napoli Motivation and Goals

More information

arxiv: v1 [cs.ce] 31 Oct 2016

arxiv: v1 [cs.ce] 31 Oct 2016 Hybrid CPU-GPU generation of the Hamiltonian and Overlap matrices in FLAPW methods arxiv:1611.00606v1 [cs.ce] 31 Oct 2016 Diego Fabregat-Traver, Davor Davidović, Markus Höhnerbach, and Edoardo di Napoli

More information

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Accelerating linear algebra computations with hybrid GPU-multicore systems. Accelerating linear algebra computations with hybrid GPU-multicore systems. Marc Baboulin INRIA/Université Paris-Sud joint work with Jack Dongarra (University of Tennessee and Oak Ridge National Laboratory)

More information

An Optimized and Scalable Eigensolver for Sequences of Eigenvalue Problems

An Optimized and Scalable Eigensolver for Sequences of Eigenvalue Problems An Optimized and Scalable Eigensolver for Sequences of Eigenvalue Problems Mario Berljafa Daniel Wortmann Edoardo Di Napoli arxiv:1404.4161v2 [cs.ms] 6 Jul 2014 July 8, 2014 Abstract In many scientific

More information

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers UT College of Engineering Tutorial Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers Stan Tomov 1, George Bosilca 1, and Cédric

More information

Simulation Laboratories at JSC

Simulation Laboratories at JSC Mitglied der Helmholtz-Gemeinschaft Simulation Laboratories at JSC Paul Gibbon Jülich Supercomputing Centre Jülich Supercomputing Centre Supercomputer operation for Centre FZJ Regional JARA Helmholtz &

More information

Opportunities for ELPA to Accelerate the Solution of the Bethe-Salpeter Eigenvalue Problem

Opportunities for ELPA to Accelerate the Solution of the Bethe-Salpeter Eigenvalue Problem Opportunities for ELPA to Accelerate the Solution of the Bethe-Salpeter Eigenvalue Problem Peter Benner, Andreas Marek, Carolin Penke August 16, 2018 ELSI Workshop 2018 Partners: The Problem The Bethe-Salpeter

More information

GPU accelerated Arnoldi solver for small batched matrix

GPU accelerated Arnoldi solver for small batched matrix 15. 09. 22 GPU accelerated Arnoldi solver for small batched matrix Samsung Advanced Institute of Technology Hyung-Jin Kim Contents - Eigen value problems - Solution - Arnoldi Algorithm - Target - CUDA

More information

A hybrid Hermitian general eigenvalue solver

A hybrid Hermitian general eigenvalue solver Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe A hybrid Hermitian general eigenvalue solver Raffaele Solcà *, Thomas C. Schulthess Institute fortheoretical Physics ETHZ,

More information

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization) A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization) Schodinger equation: Hψ = Eψ Choose a basis set of wave functions Two cases: Orthonormal

More information

MagmaDNN High-Performance Data Analytics for Manycore GPUs and CPUs

MagmaDNN High-Performance Data Analytics for Manycore GPUs and CPUs MagmaDNN High-Performance Data Analytics for Manycore GPUs and CPUs Lucien Ng The Chinese University of Hong Kong Kwai Wong The Joint Institute for Computational Sciences (JICS), UTK and ORNL Azzam Haidar,

More information

MARCH 24-27, 2014 SAN JOSE, CA

MARCH 24-27, 2014 SAN JOSE, CA MARCH 24-27, 2014 SAN JOSE, CA Sparse HPC on modern architectures Important scientific applications rely on sparse linear algebra HPCG a new benchmark proposal to complement Top500 (HPL) To solve A x =

More information

Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA

Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA Jacobi-Based Eigenvalue Solver on GPU Lung-Sheng Chien, NVIDIA lchien@nvidia.com Outline Symmetric eigenvalue solver Experiment Applications Conclusions Symmetric eigenvalue solver The standard form is

More information

Recent implementations, applications, and extensions of the Locally Optimal Block Preconditioned Conjugate Gradient method LOBPCG

Recent implementations, applications, and extensions of the Locally Optimal Block Preconditioned Conjugate Gradient method LOBPCG MITSUBISHI ELECTRIC RESEARCH LABORATORIES http://www.merl.com Recent implementations, applications, and extensions of the Locally Optimal Block Preconditioned Conjugate Gradient method LOBPCG Knyazev,

More information

Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem

Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem Mark Gates 1, Azzam Haidar 1, and Jack Dongarra 1,2,3 1 University of Tennessee, Knoxville, TN, USA 2 Oak Ridge National

More information

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems TR-0-07 A Comparison of the Performance of ::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems Ang Li, Omkar Deshmukh, Radu Serban, Dan Negrut May, 0 Abstract ::GPU is a

More information

Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA

Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA S7255: CUTT: A HIGH- PERFORMANCE TENSOR TRANSPOSE LIBRARY FOR GPUS Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA MOTIVATION Tensor contractions are the most computationally intensive part of quantum

More information

APPLIED NUMERICAL LINEAR ALGEBRA

APPLIED NUMERICAL LINEAR ALGEBRA APPLIED NUMERICAL LINEAR ALGEBRA James W. Demmel University of California Berkeley, California Society for Industrial and Applied Mathematics Philadelphia Contents Preface 1 Introduction 1 1.1 Basic Notation

More information

Computing least squares condition numbers on hybrid multicore/gpu systems

Computing least squares condition numbers on hybrid multicore/gpu systems Computing least squares condition numbers on hybrid multicore/gpu systems M. Baboulin and J. Dongarra and R. Lacroix Abstract This paper presents an efficient computation for least squares conditioning

More information

Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem

Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem Mark Gates 1, Azzam Haidar 1, and Jack Dongarra 1,2,3 1 University of Tennessee, Knoxville, TN, USA 2 Oak Ridge National

More information

Arnoldi Methods in SLEPc

Arnoldi Methods in SLEPc Scalable Library for Eigenvalue Problem Computations SLEPc Technical Report STR-4 Available at http://slepc.upv.es Arnoldi Methods in SLEPc V. Hernández J. E. Román A. Tomás V. Vidal Last update: October,

More information

Parallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics)

Parallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics) Parallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics) Eftychios Sifakis CS758 Guest Lecture - 19 Sept 2012 Introduction Linear systems

More information

Level-3 BLAS on a GPU

Level-3 BLAS on a GPU Level-3 BLAS on a GPU Picking the Low Hanging Fruit Francisco Igual 1 Gregorio Quintana-Ortí 1 Robert A. van de Geijn 2 1 Departamento de Ingeniería y Ciencia de los Computadores. University Jaume I. Castellón

More information

Introduction to numerical computations on the GPU

Introduction to numerical computations on the GPU Introduction to numerical computations on the GPU Lucian Covaci http://lucian.covaci.org/cuda.pdf Tuesday 1 November 11 1 2 Outline: NVIDIA Tesla and Geforce video cards: architecture CUDA - C: programming

More information

Algorithm for Sparse Approximate Inverse Preconditioners in the Conjugate Gradient Method

Algorithm for Sparse Approximate Inverse Preconditioners in the Conjugate Gradient Method Algorithm for Sparse Approximate Inverse Preconditioners in the Conjugate Gradient Method Ilya B. Labutin A.A. Trofimuk Institute of Petroleum Geology and Geophysics SB RAS, 3, acad. Koptyug Ave., Novosibirsk

More information

Eigenvalue Problems CHAPTER 1 : PRELIMINARIES

Eigenvalue Problems CHAPTER 1 : PRELIMINARIES Eigenvalue Problems CHAPTER 1 : PRELIMINARIES Heinrich Voss voss@tu-harburg.de Hamburg University of Technology Institute of Mathematics TUHH Heinrich Voss Preliminaries Eigenvalue problems 2012 1 / 14

More information

Efficient implementation of the overlap operator on multi-gpus

Efficient implementation of the overlap operator on multi-gpus Efficient implementation of the overlap operator on multi-gpus Andrei Alexandru Mike Lujan, Craig Pelissier, Ben Gamari, Frank Lee SAAHPC 2011 - University of Tennessee Outline Motivation Overlap operator

More information

MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors

MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors J. Dongarra, M. Gates, A. Haidar, Y. Jia, K. Kabir, P. Luszczek, and S. Tomov University of Tennessee, Knoxville 05 / 03 / 2013 MAGMA:

More information

ELSI: A Unified Software Interface for Kohn-Sham Electronic Structure Solvers

ELSI: A Unified Software Interface for Kohn-Sham Electronic Structure Solvers ELSI: A Unified Software Interface for Kohn-Sham Electronic Structure Solvers Victor Yu and the ELSI team Department of Mechanical Engineering & Materials Science Duke University Kohn-Sham Density-Functional

More information

Jacobi-Davidson Eigensolver in Cusolver Library. Lung-Sheng Chien, NVIDIA

Jacobi-Davidson Eigensolver in Cusolver Library. Lung-Sheng Chien, NVIDIA Jacobi-Davidson Eigensolver in Cusolver Library Lung-Sheng Chien, NVIDIA lchien@nvidia.com Outline CuSolver library - cusolverdn: dense LAPACK - cusolversp: sparse LAPACK - cusolverrf: refactorization

More information

Performance Evaluation of Scientific Applications on POWER8

Performance Evaluation of Scientific Applications on POWER8 Performance Evaluation of Scientific Applications on POWER8 2014 Nov 16 Andrew V. Adinetz 1, Paul F. Baumeister 1, Hans Böttiger 3, Thorsten Hater 1, Thilo Maurer 3, Dirk Pleiter 1, Wolfram Schenck 4,

More information

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver Sherry Li Lawrence Berkeley National Laboratory Piyush Sao Rich Vuduc Georgia Institute of Technology CUG 14, May 4-8, 14, Lugano,

More information

The Algorithm of Multiple Relatively Robust Representations for Multi-Core Processors

The Algorithm of Multiple Relatively Robust Representations for Multi-Core Processors Aachen Institute for Advanced Study in Computational Engineering Science Preprint: AICES-2010/09-4 23/September/2010 The Algorithm of Multiple Relatively Robust Representations for Multi-Core Processors

More information

EIGIFP: A MATLAB Program for Solving Large Symmetric Generalized Eigenvalue Problems

EIGIFP: A MATLAB Program for Solving Large Symmetric Generalized Eigenvalue Problems EIGIFP: A MATLAB Program for Solving Large Symmetric Generalized Eigenvalue Problems JAMES H. MONEY and QIANG YE UNIVERSITY OF KENTUCKY eigifp is a MATLAB program for computing a few extreme eigenvalues

More information

RWTH Aachen University

RWTH Aachen University IPCC @ RWTH Aachen University Optimization of multibody and long-range solvers in LAMMPS Rodrigo Canales William McDoniel Markus Höhnerbach Ahmed E. Ismail Paolo Bientinesi IPCC Showcase November 2016

More information

INITIAL INTEGRATION AND EVALUATION

INITIAL INTEGRATION AND EVALUATION INITIAL INTEGRATION AND EVALUATION OF SLATE PARALLEL BLAS IN LATTE Marc Cawkwell, Danny Perez, Arthur Voter Asim YarKhan, Gerald Ragghianti, Jack Dongarra, Introduction The aim of the joint milestone STMS10-52

More information

Contents. Preface... xi. Introduction...

Contents. Preface... xi. Introduction... Contents Preface... xi Introduction... xv Chapter 1. Computer Architectures... 1 1.1. Different types of parallelism... 1 1.1.1. Overlap, concurrency and parallelism... 1 1.1.2. Temporal and spatial parallelism

More information

Tips Geared Towards R. Adam J. Suarez. Arpil 10, 2015

Tips Geared Towards R. Adam J. Suarez. Arpil 10, 2015 Tips Geared Towards R Departments of Statistics North Carolina State University Arpil 10, 2015 1 / 30 Advantages of R As an interpretive and interactive language, developing an algorithm in R can be done

More information

A tale of efficiency and productivity. From scalar to tensor computations.

A tale of efficiency and productivity. From scalar to tensor computations. A tale of efficiency and productivity. From scalar to tensor computations. Paolo Bientinesi Aachen nstitute for Computational Engineering Science RWTH Aachen University October 23, 2017 Umeå Universitet

More information

Performance Prediction for Tensor Contractions

Performance Prediction for Tensor Contractions 1 / 18 Performance Prediction for Tensor Contractions Paolo Bientinesi, Edoardo Di Napoli, Diego Fabregat, Elmar Peise AICES, RWTH Aachen pauldj@aices.rwth-aachen.de June 3rd, 214 PASCConference 14 Zürich,

More information

Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry

Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry and Eugene DePrince Argonne National Laboratory (LCF and CNM) (Eugene moved to Georgia Tech last week)

More information

LARGE SPARSE EIGENVALUE PROBLEMS. General Tools for Solving Large Eigen-Problems

LARGE SPARSE EIGENVALUE PROBLEMS. General Tools for Solving Large Eigen-Problems LARGE SPARSE EIGENVALUE PROBLEMS Projection methods The subspace iteration Krylov subspace methods: Arnoldi and Lanczos Golub-Kahan-Lanczos bidiagonalization General Tools for Solving Large Eigen-Problems

More information

ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU

ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU STEVE RENNICH, SR. ENGINEER, NVIDIA DEVELOPER TECHNOLOGY DARKO STOSIC, PHD CANDIDATE, UNIV. FEDERAL DE PERNAMBUCO TIM DAVIS, PROFESSOR, CSE, TEXAS

More information

Verbundprojekt ELPA-AEO. Eigenwert-Löser für Petaflop-Anwendungen Algorithmische Erweiterungen und Optimierungen

Verbundprojekt ELPA-AEO. Eigenwert-Löser für Petaflop-Anwendungen Algorithmische Erweiterungen und Optimierungen Verbundprojekt ELPA-AEO http://elpa-aeo.mpcdf.mpg.de Eigenwert-Löser für Petaflop-Anwendungen Algorithmische Erweiterungen und Optimierungen BMBF Projekt 01IH15001 Feb 2016 - Jan 2019 7. HPC-Statustagung,

More information

LARGE SPARSE EIGENVALUE PROBLEMS

LARGE SPARSE EIGENVALUE PROBLEMS LARGE SPARSE EIGENVALUE PROBLEMS Projection methods The subspace iteration Krylov subspace methods: Arnoldi and Lanczos Golub-Kahan-Lanczos bidiagonalization 14-1 General Tools for Solving Large Eigen-Problems

More information

Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2

Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2 1 / 23 Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption Langshi CHEN 1,2,3 Supervised by Serge PETITON 2 Maison de la Simulation Lille 1 University CNRS March 18, 2013

More information

On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code

On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code E Calore, S F Schifano, R Tripiccione Enrico Calore INFN Ferrara, Italy 7 th Workshop on UnConventional High Performance

More information

Lecture 11: CMSC 878R/AMSC698R. Iterative Methods An introduction. Outline. Inverse, LU decomposition, Cholesky, SVD, etc.

Lecture 11: CMSC 878R/AMSC698R. Iterative Methods An introduction. Outline. Inverse, LU decomposition, Cholesky, SVD, etc. Lecture 11: CMSC 878R/AMSC698R Iterative Methods An introduction Outline Direct Solution of Linear Systems Inverse, LU decomposition, Cholesky, SVD, etc. Iterative methods for linear systems Why? Matrix

More information

On the design of parallel linear solvers for large scale problems

On the design of parallel linear solvers for large scale problems On the design of parallel linear solvers for large scale problems Journée problème de Poisson, IHP, Paris M. Faverge, P. Ramet M. Faverge Assistant Professor Bordeaux INP LaBRI Inria Bordeaux - Sud-Ouest

More information

FEAST eigenvalue algorithm and solver: review and perspectives

FEAST eigenvalue algorithm and solver: review and perspectives FEAST eigenvalue algorithm and solver: review and perspectives Eric Polizzi Department of Electrical and Computer Engineering University of Masachusetts, Amherst, USA Sparse Days, CERFACS, June 25, 2012

More information

Strassen s Algorithm for Tensor Contraction

Strassen s Algorithm for Tensor Contraction Strassen s Algorithm for Tensor Contraction Jianyu Huang, Devin A. Matthews, Robert A. van de Geijn The University of Texas at Austin September 14-15, 2017 Tensor Computation Workshop Flatiron Institute,

More information

MAGMA. Matrix Algebra on GPU and Multicore Architectures. Mark Gates. February 2012

MAGMA. Matrix Algebra on GPU and Multicore Architectures. Mark Gates. February 2012 MAGMA Matrix Algebra on GPU and Multicore Architectures Mark Gates February 2012 1 Hardware trends Scale # cores instead of clock speed Hardware issue became software issue Multicore Hybrid 1.E+07 1e7

More information

Solution of eigenvalue problems. Subspace iteration, The symmetric Lanczos algorithm. Harmonic Ritz values, Jacobi-Davidson s method

Solution of eigenvalue problems. Subspace iteration, The symmetric Lanczos algorithm. Harmonic Ritz values, Jacobi-Davidson s method Solution of eigenvalue problems Introduction motivation Projection methods for eigenvalue problems Subspace iteration, The symmetric Lanczos algorithm Nonsymmetric Lanczos procedure; Implicit restarts

More information

Dense Arithmetic over Finite Fields with CUMODP

Dense Arithmetic over Finite Fields with CUMODP Dense Arithmetic over Finite Fields with CUMODP Sardar Anisul Haque 1 Xin Li 2 Farnam Mansouri 1 Marc Moreno Maza 1 Wei Pan 3 Ning Xie 1 1 University of Western Ontario, Canada 2 Universidad Carlos III,

More information

Solution of eigenvalue problems. Subspace iteration, The symmetric Lanczos algorithm. Harmonic Ritz values, Jacobi-Davidson s method

Solution of eigenvalue problems. Subspace iteration, The symmetric Lanczos algorithm. Harmonic Ritz values, Jacobi-Davidson s method Solution of eigenvalue problems Introduction motivation Projection methods for eigenvalue problems Subspace iteration, The symmetric Lanczos algorithm Nonsymmetric Lanczos procedure; Implicit restarts

More information

Parallel Eigensolver Performance on High Performance Computers 1

Parallel Eigensolver Performance on High Performance Computers 1 Parallel Eigensolver Performance on High Performance Computers 1 Andrew Sunderland STFC Daresbury Laboratory, Warrington, UK Abstract Eigenvalue and eigenvector computations arise in a wide range of scientific

More information

Domain Decomposition-based contour integration eigenvalue solvers

Domain Decomposition-based contour integration eigenvalue solvers Domain Decomposition-based contour integration eigenvalue solvers Vassilis Kalantzis joint work with Yousef Saad Computer Science and Engineering Department University of Minnesota - Twin Cities, USA SIAM

More information

Accelerating Model Reduction of Large Linear Systems with Graphics Processors

Accelerating Model Reduction of Large Linear Systems with Graphics Processors Accelerating Model Reduction of Large Linear Systems with Graphics Processors P. Benner 1, P. Ezzatti 2, D. Kressner 3, E.S. Quintana-Ortí 4, Alfredo Remón 4 1 Max-Plank-Institute for Dynamics of Complex

More information

Comparing the Efficiency of Iterative Eigenvalue Solvers: the Quantum ESPRESSO experience

Comparing the Efficiency of Iterative Eigenvalue Solvers: the Quantum ESPRESSO experience Comparing the Efficiency of Iterative Eigenvalue Solvers: the Quantum ESPRESSO experience Stefano de Gironcoli Scuola Internazionale Superiore di Studi Avanzati Trieste-Italy 0 Diagonalization of the Kohn-Sham

More information

ab initio Electronic Structure Calculations

ab initio Electronic Structure Calculations ab initio Electronic Structure Calculations New scalability frontiers using the BG/L Supercomputer C. Bekas, A. Curioni and W. Andreoni IBM, Zurich Research Laboratory Rueschlikon 8803, Switzerland ab

More information

Is there life after the Lanczos method? What is LOBPCG?

Is there life after the Lanczos method? What is LOBPCG? 1 Is there life after the Lanczos method? What is LOBPCG? Andrew V. Knyazev Department of Mathematics and Center for Computational Mathematics University of Colorado at Denver SIAM ALA Meeting, July 17,

More information

Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS

Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS Jorge González-Domínguez*, Bertil Schmidt*, Jan C. Kässens**, Lars Wienbrandt** *Parallel and Distributed Architectures

More information

TENSOR LAYERS FOR COMPRESSION OF DEEP LEARNING NETWORKS. Cris Cecka Senior Research Scientist, NVIDIA GTC 2018

TENSOR LAYERS FOR COMPRESSION OF DEEP LEARNING NETWORKS. Cris Cecka Senior Research Scientist, NVIDIA GTC 2018 TENSOR LAYERS FOR COMPRESSION OF DEEP LEARNING NETWORKS Cris Cecka Senior Research Scientist, NVIDIA GTC 2018 Tensors Computations and the GPU AGENDA Tensor Networks and Decompositions Tensor Layers in

More information

Numerical Methods in Matrix Computations

Numerical Methods in Matrix Computations Ake Bjorck Numerical Methods in Matrix Computations Springer Contents 1 Direct Methods for Linear Systems 1 1.1 Elements of Matrix Theory 1 1.1.1 Matrix Algebra 2 1.1.2 Vector Spaces 6 1.1.3 Submatrices

More information

Case Study: Quantum Chromodynamics

Case Study: Quantum Chromodynamics Case Study: Quantum Chromodynamics Michael Clark Harvard University with R. Babich, K. Barros, R. Brower, J. Chen and C. Rebbi Outline Primer to QCD QCD on a GPU Mixed Precision Solvers Multigrid solver

More information

Parallelization Strategies for Density Matrix Renormalization Group algorithms on Shared-Memory Systems

Parallelization Strategies for Density Matrix Renormalization Group algorithms on Shared-Memory Systems Parallelization Strategies for Density Matrix Renormalization Group algorithms on Shared-Memory Systems G. Hager HPC Services, Computing Center Erlangen, Germany E. Jeckelmann Theoretical Physics, Univ.

More information

On the design of parallel linear solvers for large scale problems

On the design of parallel linear solvers for large scale problems On the design of parallel linear solvers for large scale problems ICIAM - August 2015 - Mini-Symposium on Recent advances in matrix computations for extreme-scale computers M. Faverge, X. Lacoste, G. Pichon,

More information

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic Jan Verschelde joint work with Xiangcheng Yu University of Illinois at Chicago

More information

problem Au = u by constructing an orthonormal basis V k = [v 1 ; : : : ; v k ], at each k th iteration step, and then nding an approximation for the e

problem Au = u by constructing an orthonormal basis V k = [v 1 ; : : : ; v k ], at each k th iteration step, and then nding an approximation for the e A Parallel Solver for Extreme Eigenpairs 1 Leonardo Borges and Suely Oliveira 2 Computer Science Department, Texas A&M University, College Station, TX 77843-3112, USA. Abstract. In this paper a parallel

More information

Direct Self-Consistent Field Computations on GPU Clusters

Direct Self-Consistent Field Computations on GPU Clusters Direct Self-Consistent Field Computations on GPU Clusters Guochun Shi, Volodymyr Kindratenko National Center for Supercomputing Applications University of Illinois at UrbanaChampaign Ivan Ufimtsev, Todd

More information

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics SPARSE SOLVERS FOR THE POISSON EQUATION Margreet Nool CWI, Multiscale Dynamics November 9, 2015 OUTLINE OF THIS TALK 1 FISHPACK, LAPACK, PARDISO 2 SYSTEM OVERVIEW OF CARTESIUS 3 POISSON EQUATION 4 SOLVERS

More information

A Jacobi Davidson Method with a Multigrid Solver for the Hermitian Wilson-Dirac Operator

A Jacobi Davidson Method with a Multigrid Solver for the Hermitian Wilson-Dirac Operator A Jacobi Davidson Method with a Multigrid Solver for the Hermitian Wilson-Dirac Operator Artur Strebel Bergische Universität Wuppertal August 3, 2016 Joint Work This project is joint work with: Gunnar

More information

A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures

A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences,

More information

Deflation for inversion with multiple right-hand sides in QCD

Deflation for inversion with multiple right-hand sides in QCD Deflation for inversion with multiple right-hand sides in QCD A Stathopoulos 1, A M Abdel-Rehim 1 and K Orginos 2 1 Department of Computer Science, College of William and Mary, Williamsburg, VA 23187 2

More information

Towards high performance IRKA on hybrid CPU-GPU systems

Towards high performance IRKA on hybrid CPU-GPU systems Towards high performance IRKA on hybrid CPU-GPU systems Jens Saak in collaboration with Georg Pauer (OVGU/MPI Magdeburg) Kapil Ahuja, Ruchir Garg (IIT Indore) Hartwig Anzt, Jack Dongarra (ICL Uni Tennessee

More information

Block AIR Methods. For Multicore and GPU. Per Christian Hansen Hans Henrik B. Sørensen. Technical University of Denmark

Block AIR Methods. For Multicore and GPU. Per Christian Hansen Hans Henrik B. Sørensen. Technical University of Denmark Block AIR Methods For Multicore and GPU Per Christian Hansen Hans Henrik B. Sørensen Technical University of Denmark Model Problem and Notation Parallel-beam 3D tomography exact solution exact data noise

More information

A CUDA Solver for Helmholtz Equation

A CUDA Solver for Helmholtz Equation Journal of Computational Information Systems 11: 24 (2015) 7805 7812 Available at http://www.jofcis.com A CUDA Solver for Helmholtz Equation Mingming REN 1,2,, Xiaoguang LIU 1,2, Gang WANG 1,2 1 College

More information

Cucheb: A GPU implementation of the filtered Lanczos procedure. Department of Computer Science and Engineering University of Minnesota, Twin Cities

Cucheb: A GPU implementation of the filtered Lanczos procedure. Department of Computer Science and Engineering University of Minnesota, Twin Cities Cucheb: A GPU implementation of the filtered Lanczos procedure Jared L. Aurentz, Vassilis Kalantzis, and Yousef Saad May 2017 EPrint ID: 2017.1 Department of Computer Science and Engineering University

More information

A Compiler for Linear Algebra Operations

A Compiler for Linear Algebra Operations A Compiler for Linear Algebra Operations Paolo Bientinesi In collaboration with Diego Fabregat AICES, RWTH Aachen pauldj@aices.rwth-aachen.de CScADS Autotuning Workshop 2012 August 13-14, 2012 Snowbird,

More information

Numerical Methods I Eigenvalue Problems

Numerical Methods I Eigenvalue Problems Numerical Methods I Eigenvalue Problems Aleksandar Donev Courant Institute, NYU 1 donev@courant.nyu.edu 1 MATH-GA 2011.003 / CSCI-GA 2945.003, Fall 2014 October 2nd, 2014 A. Donev (Courant Institute) Lecture

More information

Preconditioned Eigenvalue Solvers for electronic structure calculations. Andrew V. Knyazev. Householder Symposium XVI May 26, 2005

Preconditioned Eigenvalue Solvers for electronic structure calculations. Andrew V. Knyazev. Householder Symposium XVI May 26, 2005 1 Preconditioned Eigenvalue Solvers for electronic structure calculations Andrew V. Knyazev Department of Mathematics and Center for Computational Mathematics University of Colorado at Denver Householder

More information

Sparse LU Factorization on GPUs for Accelerating SPICE Simulation

Sparse LU Factorization on GPUs for Accelerating SPICE Simulation Nano-scale Integrated Circuit and System (NICS) Laboratory Sparse LU Factorization on GPUs for Accelerating SPICE Simulation Xiaoming Chen PhD Candidate Department of Electronic Engineering Tsinghua University,

More information

Robust Preconditioned Conjugate Gradient for the GPU and Parallel Implementations

Robust Preconditioned Conjugate Gradient for the GPU and Parallel Implementations Robust Preconditioned Conjugate Gradient for the GPU and Parallel Implementations Rohit Gupta, Martin van Gijzen, Kees Vuik GPU Technology Conference 2012, San Jose CA. GPU Technology Conference 2012,

More information

A Parallel Implementation of the Davidson Method for Generalized Eigenproblems

A Parallel Implementation of the Davidson Method for Generalized Eigenproblems A Parallel Implementation of the Davidson Method for Generalized Eigenproblems Eloy ROMERO 1 and Jose E. ROMAN Instituto ITACA, Universidad Politécnica de Valencia, Spain Abstract. We present a parallel

More information

S0214 : GPU Based Stacking Sequence Generation For Composite Skins Using GA

S0214 : GPU Based Stacking Sequence Generation For Composite Skins Using GA S0214 : GPU Based Stacking Sequence Generation For Composite Skins Using GA Date: 16th May 2012 Wed, 3pm to 3.25pm(Adv. Session) Sathyanarayana K., Manish Banga, and Ravi Kumar G. V. V. Engineering Services,

More information

Lecture 1: Numerical Issues from Inverse Problems (Parameter Estimation, Regularization Theory, and Parallel Algorithms)

Lecture 1: Numerical Issues from Inverse Problems (Parameter Estimation, Regularization Theory, and Parallel Algorithms) Lecture 1: Numerical Issues from Inverse Problems (Parameter Estimation, Regularization Theory, and Parallel Algorithms) Youzuo Lin 1 Joint work with: Rosemary A. Renaut 2 Brendt Wohlberg 1 Hongbin Guo

More information

Parallel Eigensolver Performance on High Performance Computers

Parallel Eigensolver Performance on High Performance Computers Parallel Eigensolver Performance on High Performance Computers Andrew Sunderland Advanced Research Computing Group STFC Daresbury Laboratory CUG 2008 Helsinki 1 Summary (Briefly) Introduce parallel diagonalization

More information

Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters

Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters H. Köstler 2nd International Symposium Computer Simulations on GPU Freudenstadt, 29.05.2013 1 Contents Motivation walberla software concepts

More information

Index. for generalized eigenvalue problem, butterfly form, 211

Index. for generalized eigenvalue problem, butterfly form, 211 Index ad hoc shifts, 165 aggressive early deflation, 205 207 algebraic multiplicity, 35 algebraic Riccati equation, 100 Arnoldi process, 372 block, 418 Hamiltonian skew symmetric, 420 implicitly restarted,

More information

The new challenges to Krylov subspace methods Yousef Saad Department of Computer Science and Engineering University of Minnesota

The new challenges to Krylov subspace methods Yousef Saad Department of Computer Science and Engineering University of Minnesota The new challenges to Krylov subspace methods Yousef Saad Department of Computer Science and Engineering University of Minnesota SIAM Applied Linear Algebra Valencia, June 18-22, 2012 Introduction Krylov

More information

Practical Combustion Kinetics with CUDA

Practical Combustion Kinetics with CUDA Funded by: U.S. Department of Energy Vehicle Technologies Program Program Manager: Gurpreet Singh & Leo Breton Practical Combustion Kinetics with CUDA GPU Technology Conference March 20, 2015 Russell Whitesides

More information

PFEAST: A High Performance Sparse Eigenvalue Solver Using Distributed-Memory Linear Solvers

PFEAST: A High Performance Sparse Eigenvalue Solver Using Distributed-Memory Linear Solvers PFEAST: A High Performance Sparse Eigenvalue Solver Using Distributed-Memory Linear Solvers James Kestyn, Vasileios Kalantzis, Eric Polizzi, Yousef Saad Electrical and Computer Engineering Department,

More information

Journal of Computational Physics

Journal of Computational Physics Journal of Computational Physics 229 (2010) 9188 9200 Contents lists available at ScienceDirect Journal of Computational Physics journal homepage: www.elsevier.com/locate/jcp A block Chebyshev Davidson

More information

A Parallel Additive Schwarz Preconditioned Jacobi-Davidson Algorithm for Polynomial Eigenvalue Problems in Quantum Dot Simulation

A Parallel Additive Schwarz Preconditioned Jacobi-Davidson Algorithm for Polynomial Eigenvalue Problems in Quantum Dot Simulation A Parallel Additive Schwarz Preconditioned Jacobi-Davidson Algorithm for Polynomial Eigenvalue Problems in Quantum Dot Simulation Feng-Nan Hwang a, Zih-Hao Wei a, Tsung-Ming Huang b, Weichung Wang c, a

More information

3D Cartesian Transport Sweep for Massively Parallel Architectures on top of PaRSEC

3D Cartesian Transport Sweep for Massively Parallel Architectures on top of PaRSEC 3D Cartesian Transport Sweep for Massively Parallel Architectures on top of PaRSEC 9th Scheduling for Large Scale Systems Workshop, Lyon S. Moustafa, M. Faverge, L. Plagne, and P. Ramet S. Moustafa, M.

More information

GPU ACCELERATED POLYNOMIAL SPECTRAL TRANSFORMATION METHODS

GPU ACCELERATED POLYNOMIAL SPECTRAL TRANSFORMATION METHODS GPU ACCELERATED POLYNOMIAL SPECTRAL TRANSFORMATION METHODS By: JARED L. AURENTZ A dissertation submitted in fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY WASHINGTON STATE UNIVERSITY

More information

AMS526: Numerical Analysis I (Numerical Linear Algebra)

AMS526: Numerical Analysis I (Numerical Linear Algebra) AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 16: Reduction to Hessenberg and Tridiagonal Forms; Rayleigh Quotient Iteration Xiangmin Jiao Stony Brook University Xiangmin Jiao Numerical

More information

Piz Daint & Piz Kesch : from general purpose supercomputing to an appliance for weather forecasting. Thomas C. Schulthess

Piz Daint & Piz Kesch : from general purpose supercomputing to an appliance for weather forecasting. Thomas C. Schulthess Piz Daint & Piz Kesch : from general purpose supercomputing to an appliance for weather forecasting Thomas C. Schulthess 1 Cray XC30 with 5272 hybrid, GPU accelerated compute nodes Piz Daint Compute node:

More information