Explore Computational Power of GPU in Electromagnetics and Micromagnetics

Size: px
Start display at page:

Download "Explore Computational Power of GPU in Electromagnetics and Micromagnetics"

Transcription

1 Explore Computational Power of GPU in Electromagnetics and Micromagnetics Presenter: Sidi Fu, PhD candidate, UC San Diego Advisor: Prof. Vitaliy Lomakin Center of Magnetic Recording Research, Department of Electrical and Computer Engineering, University of California, San Diego 4/20/2014 COMPUTATIONAL ELECTROMAGNETICS AND MICROMAGNETICS GROUP, ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT, UCSD 1

2 Outline Motivation Micromagnetics : FastMag solver Electromagnetics GPU Acceleration Projects Non-uniform Fast Fourier Transform Sparse Matrix Vector Multiplication Finite Difference Method solver OOMMF Simulation examples 4/20/2014 COMPUTATIONAL ELECTROMAGNETICS AND MICROMAGNETICS GROUP, ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT, UCSD 2

3 Outline Motivation Micromagnetics : FastMag solver Electromagnetics GPU Acceleration Projects Non-uniform Fast Fourier Transform Sparse Matrix Vector Multiplication Finite Difference Method solver OOMMF Simulation examples 4/20/2014 COMPUTATIONAL ELECTROMAGNETICS AND MICROMAGNETICS GROUP, ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT, UCSD 3

4 Motivation Typical applications of micromagnetic simulations Hard Drive Magnetic Materials Magnetic Memory Typical problem scale: 100K ~ 100M CPU? Too slow. MPI? Possible but expensive GPU (Relatively) low cost, high performance 4/20/2014 COMPUTATIONAL ELECTROMAGNETICS AND MICROMAGNETICS GROUP, ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT, UCSD 4

5 Motivation Landau-Lifshitz-Gilbert equation for magnetization dynamics: mˆ t ˆ ˆ 2 ˆ eff 1 m H m m H eff Near field: differential operator Effective field: Solved this nonlinear differential equation by marching-on-in-time, e.g. Integral operator Long-range field: demagnetization field Dense matrix -> Bottleneck: O(N 2 ) Differential operator Local field: exchange field Sparse matrix -> Can become bottleneck mˆ ( t ) ˆ ( ) ˆ ( ) ˆ m t t m t m( t ) H ( t ) m 1 m m m eff m M s 2A mˆ d r r r ˆ 2 M m s Far field: integral operator 4/20/2014 COMPUTATIONAL ELECTROMAGNETICS AND MICROMAGNETICS GROUP, ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT, UCSD 5

6 Motivation FastMag: a versatile GPU micromagnetic simulator Framework: j i Input interface FastMag LLG simulators Temperature/optics Hybrid simulators Fast Demag: NUFFT Fast Exchange Fast SpMV Time integration Fast Jacobian Parallelization CPU-GPU hybrid Output interface 6 4/20/2014 COMPUTATIONAL ELECTROMAGNETICS AND MICROMAGNETICS GROUP, ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT, UCSD 6

7 Motivation Typical applications of electromagnetic simulations Mie MOM RCS(dBSW) (degree) Biomedical EM Equations to solve EM wave scattering from airplane Radar cross section 1 A t c t D 1 V t c t 2 B 2 E A H J V 2 2 J 0 0 4/20/2014 COMPUTATIONAL ELECTROMAGNETICS AND MICROMAGNETICS GROUP, ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT, UCSD 7

8 Motivation Electromagnetic problem example Example: field-based volume integral equation jk 0 r' r 0 ' D ' e ' 2 kede ( k ) ' ed dv k0 dv 4 4 r r ' r r r ' jk r r D i Goal: solve electric flux Step 1: Quadrature points represents integral Q PD N D n 1 D f () r n n Step 2: Quadrature source to potential = ZQ Step 3: Quadrature observer function to testing function P T D i Sparse Matrix: maps basis function to quadrature source points Dense Matrix: Summation of the products between source and Green s function Sparse Matrix: maps quadrature potential points to testing functions 4/20/2014 COMPUTATIONAL ELECTROMAGNETICS AND MICROMAGNETICS GROUP, ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT, UCSD 8

9 Outline Motivation Micromagnetics : FastMag solver Electromagnetics GPU Acceleration Projects Non-uniform Fast Fourier Transform Sparse Matrix Vector Multiplication Finite Difference Method solver OOMMF Simulation examples 4/20/2014 COMPUTATIONAL ELECTROMAGNETICS AND MICROMAGNETICS GROUP, ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT, UCSD 9

10 NUFFT Traditional Fast Fourier Transform Advantage Computational complexity: O(N 2 ) O(NlogN) Well-known libraries: e.g. FFTW, Intel MKL, Nvidia CUFFT Electromagnetic probs: u j N jk i j e r r i1 ri rj i j Green's Function q( r ) i Disadvantage Cannot solve non-uniform source distribution problems Non-periodic problems require zero padding NUFFT: Non-uniform Fast Fourier Transform (or Adaptive Integral Method) * Uniform sampling general structures Non-uniform problem Uniform problem Ref: Zhu, Zhenhai, Ben Song, and Jacob White. "pfft++ A general and extensible fast integral equation solver based on a pre-corrected FFT algorithm." Micromagnetic probs: M s mˆ d r r r 4/20/2014 COMPUTATIONAL ELECTROMAGNETICS AND MICROMAGNETICS GROUP, ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT, UCSD 10

11 NUFFT Algorithm building blocks: 1. Projection 2. Fast Fourier Transform 3. Back-projection 4. Near-field correction CUDA Implementation: Coalesced memory access Shared memory Thread independency Workload balancing Heavy floating ops 4/20/2014 COMPUTATIONAL ELECTROMAGNETICS AND MICROMAGNETICS GROUP, ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT, UCSD 11

12 NUFFT Algorithm building blocks: 1. Projection 2. Fast Fourier Transform 3. Back-projection 4. Near-field correction CUDA Implementation: Coalesced memory access Shared memory Thread independency Workload balancing Heavy floating ops 4/20/2014 COMPUTATIONAL ELECTROMAGNETICS AND MICROMAGNETICS GROUP, ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT, UCSD 12

13 NUFFT Algorithm building blocks: 1. Projection 2. Fast Fourier Transform 3. Back-projection 4. Near-field correction CUDA Implementation: Coalesced memory access Shared memory Thread independency Workload balancing Heavy floating ops 4/20/2014 COMPUTATIONAL ELECTROMAGNETICS AND MICROMAGNETICS GROUP, ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT, UCSD 13

14 NUFFT Algorithm building blocks: 1. Projection 2. Fast Fourier Transform 3. Back-projection 4. Near-field correction CUDA Implementation: Coalesced memory access Shared memory Thread independency Workload balancing Heavy floating ops 4/20/2014 COMPUTATIONAL ELECTROMAGNETICS AND MICROMAGNETICS GROUP, ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT, UCSD 14

15 NUFFT NUFFT on SINGLE GPU NUFFT on MULTIPLE GPUs CPU GPU CPU GPU Source coords Domain structure Source coords Domain structure Source coords Domain structure Source coordinates Domain structure Get Src Amp Src Amp Src Amp Projection Proj. Proj. Proj. Proj. Projection Src Amp on grids FFT Src Amp on grids in k- space Parallel FFT in 3D Wait TensorMul K-space multiplication Mul. Mul. Mul. Mul. Field in k-space ifft Near field correction Field on grids Near-field correction Parallel inverse FFT in 3D Corr. Corr. Corr. Corr. Observer field Observer field Observer field 4/20/2014 COMPUTATIONAL ELECTROMAGNETICS AND MICROMAGNETICS GROUP, ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT, UCSD 15

16 NUFFT Simul. Time/ms Single GPU results INTEL 3.2GHz vs. NVIDIA Geforce GTX 690 (1 card) 100x~300x CPU-GPU speed up! Problem Size Direct CPU/s Direct GPU/s E S P T PT p p 1 P NUFFT CPU(cubic)/s NUFFT GPU(cubic)/s NUFFT GPU(linear)/s 16K 7.02E E E E E-3 64K 4.47E1 7.98E E0 1.23E E-3 256K 7.17E2 1.14E0 8.87E0 3.33E E-2 1M N/A 1.79E1 3.99E1 1.26E E-2 4M N/A N/A N/A 4.76E E-1 Multiple GPU results Multiple GPUs: 2 x NVIDIA Geforce GTX 690 (4GPUs) Problem size = 4M Parallel efficiency Ep = 77% across 4 GPUs 4/20/2014 COMPUTATIONAL ELECTROMAGNETICS AND MICROMAGNETICS GROUP, ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT, UCSD x 1.8x 2.6x 3.1x GPUs

17 SPMV Sparse Matrix-Vector Multiplication (SpMV) Application: differential operators, projections or interpolations Feature: #non-zero elements << #zero elements GPU Memory: only non-zero elements are kept in memory Computational Complexity: only non-zero elements are computed Example: compressed sparse row format (CSR) 2A ˆ 2 M m s A RowOffset = Ptr = Data = /20/2014 COMPUTATIONAL ELECTROMAGNETICS AND MICROMAGNETICS GROUP, ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT, UCSD 17

18 SPMV Implementation: single GPU Bind input vector to texture memory Parallel Reduction w/ shuffle operations Input vector Maximize the CPU-GPU memory transfer throughput: Important for CPU-GPU mixture solvers Pinned host memory -> increase memory transfer throughput by 100% Ref: 1. CUDA_C_Best_Practice, Nvidia; 2. Optimizing Parallel Reduction in CUDA, M. Harris; 3. How to Optimize Data Transfers in CUDA C/C++, M. Harris 4/20/2014 COMPUTATIONAL ELECTROMAGNETICS AND MICROMAGNETICS GROUP, ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT, UCSD 18

19 SPMV Implementation: Sorting Sparse Matrix Input Vector Output Vector Sparse Matrix Input Vector Output Vector X = X = Sorting Sorting Vs. Box-sorting RCM 4/20/2014 COMPUTATIONAL ELECTROMAGNETICS AND MICROMAGNETICS GROUP, ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT, UCSD 19

20 SPMV Implementation: multiple GPUs Only part of the matrix and input vector is assigned to each GPU Workload balance: leveraging the number of non-zero elements among GPUs Problem: memory scalability across GPUs GPU0 V 1 GPU1 V 2 V 2 V 3 V 3 GPU2 V 4 V 4 V 4 GPU3 V 5 V 5 V 5 V 6 V 6 V 7 V 7 V 8 Before sorting 4/20/2014 COMPUTATIONAL ELECTROMAGNETICS AND MICROMAGNETICS GROUP, ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT, UCSD 20

21 SPMV Implementation: multiple GPUs Only part of the matrix and input vector is assigned to each GPU Workload balance: leveraging the number of non-zero elements among GPUs Sorting helps to keep the scalability of multi-gpu implementation GPU0 V 1 GPU1 V 2 V 2 V 3 V 4 GPU2 GPU3 V 5 V 6 V 7 V 8 After sorting 4/20/2014 COMPUTATIONAL ELECTROMAGNETICS AND MICROMAGNETICS GROUP, ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT, UCSD 21

22 SPMV Speed results Two matrices generated from FEM mesh of a cube and a sphere, respectively. Three matrices chosen from Florida sparse matrix collection INTEL 3.2GHz w/ 1core running vs. 2 x NVIDIA Geforce GTX 690 CPU-GPU Memory transfer time is included nnz/ (nnz/row) Computational Time (ms) SPMV 1 GPU 2 GPUs 3 GPUs 4 GPUs Serial CPU MKL CPU Cusparse GPU FEM Cube 17.5M/ FEM Sphere 31.8M/ dielfilterv3real 89.3M/ gsm_ M/ Cube_Coup_dt6 124M/ /20/2014 COMPUTATIONAL ELECTROMAGNETICS AND MICROMAGNETICS GROUP, ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT, UCSD 22

23 Simul. Time/ms Simul. Time/ms Simul. Time/ms SPMV Speed results Multiple GPUs FEM Sphere Cube_Coup_dt6 DielFilterV3Real x 1.6x 2.2x GPUs memcpy kernel 3.1x x 1.9x 2.6x memcpy 10 kernel x GPUs GPUs memcpy kernel 1.0x 1.8x 2.7x 2.8x 4/20/2014 COMPUTATIONAL ELECTROMAGNETICS AND MICROMAGNETICS GROUP, ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT, UCSD 23

24 OOMMF GPU OOMMF (Object-oriented Micromagnetic Framework) by NIST Open-source, thousands users worldwide Micromagnetic simulator Landau-Lifshitz-Gilbert equation Finite Difference method Object-oriented coding framework Periodic and non-periodic boundary condition 6-point and 12-point exchange field Uni-axial and cubic anisotropy field Flexibility in changing material properties Problem: CPU speed is too slow for large problems Solution: GPU parallel computation 4/20/2014 COMPUTATIONAL ELECTROMAGNETICS AND MICROMAGNETICS GROUP, ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT, UCSD 24

25 OOMMF GPU GPU Parallelism m initiation H applied Hanisotropy k H k k m Hexchange M l 2 2 s ex m m m H m m H t 1 eff ( 2 eff ) Heff Happlied Hexchange Hanisotropy Hstray m t 1 m H 2 eff m ( m Heff ) 4/20/2014 COMPUTATIONAL ELECTROMAGNETICS AND MICROMAGNETICS GROUP, ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT, UCSD 25

26 OOMMF GPU Speed Results Test case: cubic geometry with various problem size Hardware CPU: Xeon w/ 1 core running GPU: Nvidia GTX 690@915MHz w/ 1536 cores running Computational Time Speed-up: Problem Size CPU/ms GPU/ms Speed-up 16 3 = 4K x = 32K x = 256K x = 2M x24.5 4/20/2014 COMPUTATIONAL ELECTROMAGNETICS AND MICROMAGNETICS GROUP, ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT, UCSD 26

27 Outline Motivatioin Micromagnetics : FastMag solver Electromagnetics GPU Acceleration Projects Non-uniform Fast Fourier Transform Sparse Matrix Vector Multiplication Finite Difference Method solver OOMMF Simulation examples 4/20/2014 COMPUTATIONAL ELECTROMAGNETICS AND MICROMAGNETICS GROUP, ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT, UCSD 27

28 Micromagnetic Simulation: Magnetic head Challenge and features Complex geometry: 5-10 micron size, ~1000 aspect ratio, complex shapes and coupled parts Hundreds of millions of elements may be needed Parameters: M = emu cc,α = 0.2 s 5 3 5micron size, 50 80nm tip Adams and BDF time stepping Hardware: Tesla S2070 GPU, i7 CPU *Coils are surrounding the head Tip resolution Largest element # of tetrah. elements Time per 1 ns 10 nm 130 nm 130K 1.75 min 10 nm 57 nm 1.2M 17 min 10 nm 33 nm 4.8M 107 min 10 nm 10 nm 126M ~3 days 4/20/2014 COMPUTATIONAL ELECTROMAGNETICS AND MICROMAGNETICS GROUP, ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT, UCSD 28

29 Micromagnetic Simulation: Granular media Features General Voronoi tessellation Distributions of particle size, shape, separation, material parameters, etc. Single and multiple layers with option for sub-layer discretization Surface and bulk exchange 4/20/2014 COMPUTATIONAL ELECTROMAGNETICS AND MICROMAGNETICS GROUP, ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT, UCSD 29

30 Micromagnetic Simulation: Magnetic memories Spin-transfer-torque based Magnetic RAM spin valve structure V electron flow In-plane MRAM Free Layer Perpendicular MRAM Free Layer Fixed Layer Fixed Layer 30 4/20/2014 COMPUTATIONAL ELECTROMAGNETICS AND MICROMAGNETICS GROUP, ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT, UCSD 30

31 Electromagnetic Simulation: Human body scattering Human body simulation Method: Potential Integral Equation Key algorithm: Non-uniform Fast Fourier Transform Mesh: 8.4 million tetrahedrons, 2mm resolution Total number of iterations: 109 Simulation time: 48mins Current distribution along x Incident wave x polarization z y x λ = 1.25m, ε r = 41.4 j18 4/20/2014 COMPUTATIONAL ELECTROMAGNETICS AND MICROMAGNETICS GROUP, ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT, UCSD 31

32 Summary Have done A Finite Element Method based micromagnetic solver - FastMag Two GPU algorithms: Non-uniform Fast Fourier Transform and Sparse Matrix Vector multiplication with 20x ~ 300x GPU-CPU speed-up Multi-GPU implementation of two algorithms, gaining 65% - 85% parallel efficiency Electromagnetic and micromagnetic simulation examples Future work The entire solver of FastMag is going to be implemented on GPU With the release of CUDA 6.0, implementation with multiple GPUs will be more efficient More information? Please find it out at our group s website: Acknowledgement Shaojing Li Ruinan Chang Marko Lubarda Marco Escobar Majd Kuteifan Marco Menarini Simon Couture Javier Espigares 4/20/2014 COMPUTATIONAL ELECTROMAGNETICS AND MICROMAGNETICS GROUP, ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT, UCSD 32

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems TR-0-07 A Comparison of the Performance of ::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems Ang Li, Omkar Deshmukh, Radu Serban, Dan Negrut May, 0 Abstract ::GPU is a

More information

S Subdivide, Preprocess and Conquer: Micromagnetism FEM/BEM-Simulations on Single-Node/Multi-GPU Systems

S Subdivide, Preprocess and Conquer: Micromagnetism FEM/BEM-Simulations on Single-Node/Multi-GPU Systems S4283 - Subdivide, : Micromagnetism FEM/BEM-Simulations on Single-Node/Multi-GPU Systems Elmar Westphal - Forschungszentrum Jülich GmbH 1 Contents Micromagnetism TetraMag, a FEM/BEM Micromagnetism Simulator

More information

Introduction to numerical computations on the GPU

Introduction to numerical computations on the GPU Introduction to numerical computations on the GPU Lucian Covaci http://lucian.covaci.org/cuda.pdf Tuesday 1 November 11 1 2 Outline: NVIDIA Tesla and Geforce video cards: architecture CUDA - C: programming

More information

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Accelerating linear algebra computations with hybrid GPU-multicore systems. Accelerating linear algebra computations with hybrid GPU-multicore systems. Marc Baboulin INRIA/Université Paris-Sud joint work with Jack Dongarra (University of Tennessee and Oak Ridge National Laboratory)

More information

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers UT College of Engineering Tutorial Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers Stan Tomov 1, George Bosilca 1, and Cédric

More information

Accelerating Model Reduction of Large Linear Systems with Graphics Processors

Accelerating Model Reduction of Large Linear Systems with Graphics Processors Accelerating Model Reduction of Large Linear Systems with Graphics Processors P. Benner 1, P. Ezzatti 2, D. Kressner 3, E.S. Quintana-Ortí 4, Alfredo Remón 4 1 Max-Plank-Institute for Dynamics of Complex

More information

Sparse LU Factorization on GPUs for Accelerating SPICE Simulation

Sparse LU Factorization on GPUs for Accelerating SPICE Simulation Nano-scale Integrated Circuit and System (NICS) Laboratory Sparse LU Factorization on GPUs for Accelerating SPICE Simulation Xiaoming Chen PhD Candidate Department of Electronic Engineering Tsinghua University,

More information

A CUDA Solver for Helmholtz Equation

A CUDA Solver for Helmholtz Equation Journal of Computational Information Systems 11: 24 (2015) 7805 7812 Available at http://www.jofcis.com A CUDA Solver for Helmholtz Equation Mingming REN 1,2,, Xiaoguang LIU 1,2, Gang WANG 1,2 1 College

More information

Open-Source Parallel FE Software : FrontISTR -- Performance Considerations about B/F (Byte per Flop) of SpMV on K-Supercomputer and GPU-Clusters --

Open-Source Parallel FE Software : FrontISTR -- Performance Considerations about B/F (Byte per Flop) of SpMV on K-Supercomputer and GPU-Clusters -- Parallel Processing for Energy Efficiency October 3, 2013 NTNU, Trondheim, Norway Open-Source Parallel FE Software : FrontISTR -- Performance Considerations about B/F (Byte per Flop) of SpMV on K-Supercomputer

More information

Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2

Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2 1 / 23 Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption Langshi CHEN 1,2,3 Supervised by Serge PETITON 2 Maison de la Simulation Lille 1 University CNRS March 18, 2013

More information

On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code

On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code E Calore, S F Schifano, R Tripiccione Enrico Calore INFN Ferrara, Italy 7 th Workshop on UnConventional High Performance

More information

Faster Kinetics: Accelerate Your Finite-Rate Combustion Simulation with GPUs

Faster Kinetics: Accelerate Your Finite-Rate Combustion Simulation with GPUs Faster Kinetics: Accelerate Your Finite-Rate Combustion Simulation with GPUs Christopher P. Stone, Ph.D. Computational Science and Engineering, LLC Kyle Niemeyer, Ph.D. Oregon State University 2 Outline

More information

A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures

A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences,

More information

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver Sherry Li Lawrence Berkeley National Laboratory Piyush Sao Rich Vuduc Georgia Institute of Technology CUG 14, May 4-8, 14, Lugano,

More information

Jacobi-Davidson Eigensolver in Cusolver Library. Lung-Sheng Chien, NVIDIA

Jacobi-Davidson Eigensolver in Cusolver Library. Lung-Sheng Chien, NVIDIA Jacobi-Davidson Eigensolver in Cusolver Library Lung-Sheng Chien, NVIDIA lchien@nvidia.com Outline CuSolver library - cusolverdn: dense LAPACK - cusolversp: sparse LAPACK - cusolverrf: refactorization

More information

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics SPARSE SOLVERS FOR THE POISSON EQUATION Margreet Nool CWI, Multiscale Dynamics November 9, 2015 OUTLINE OF THIS TALK 1 FISHPACK, LAPACK, PARDISO 2 SYSTEM OVERVIEW OF CARTESIUS 3 POISSON EQUATION 4 SOLVERS

More information

Particle Dynamics with MBD and FEA Using CUDA

Particle Dynamics with MBD and FEA Using CUDA Particle Dynamics with MBD and FEA Using CUDA Graham Sanborn, PhD Senior Research Engineer Solver 2 (MFBD) Team FunctionBay, Inc., S. Korea Overview MFBD: Multi-Flexible-Body Dynamics Rigid & flexible

More information

Two case studies of Monte Carlo simulation on GPU

Two case studies of Monte Carlo simulation on GPU Two case studies of Monte Carlo simulation on GPU National Institute for Computational Sciences University of Tennessee Seminar series on HPC, Feb. 27, 2014 Outline 1 Introduction 2 Discrete energy lattice

More information

Parallelization of Molecular Dynamics (with focus on Gromacs) SeSE 2014 p.1/29

Parallelization of Molecular Dynamics (with focus on Gromacs) SeSE 2014 p.1/29 Parallelization of Molecular Dynamics (with focus on Gromacs) SeSE 2014 p.1/29 Outline A few words on MD applications and the GROMACS package The main work in an MD simulation Parallelization Stream computing

More information

9. Spin Torque Majority Gate

9. Spin Torque Majority Gate eyond MOS computing 9. Spin Torque Majority Gate Dmitri Nikonov Thanks to George ourianoff Dmitri.e.nikonov@intel.com 1 Outline Spin majority gate with in-pane magnetization Spin majority gate with perpendicular

More information

An FPGA Implementation of Reciprocal Sums for SPME

An FPGA Implementation of Reciprocal Sums for SPME An FPGA Implementation of Reciprocal Sums for SPME Sam Lee and Paul Chow Edward S. Rogers Sr. Department of Electrical and Computer Engineering University of Toronto Objectives Accelerate part of Molecular

More information

Towards a highly-parallel PDE-Solver using Adaptive Sparse Grids on Compute Clusters

Towards a highly-parallel PDE-Solver using Adaptive Sparse Grids on Compute Clusters Towards a highly-parallel PDE-Solver using Adaptive Sparse Grids on Compute Clusters HIM - Workshop on Sparse Grids and Applications Alexander Heinecke Chair of Scientific Computing May 18 th 2011 HIM

More information

Direct Self-Consistent Field Computations on GPU Clusters

Direct Self-Consistent Field Computations on GPU Clusters Direct Self-Consistent Field Computations on GPU Clusters Guochun Shi, Volodymyr Kindratenko National Center for Supercomputing Applications University of Illinois at UrbanaChampaign Ivan Ufimtsev, Todd

More information

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic Jan Verschelde joint work with Xiangcheng Yu University of Illinois at Chicago

More information

Large-scale Electronic Structure Simulations with MVAPICH2 on Intel Knights Landing Manycore Processors

Large-scale Electronic Structure Simulations with MVAPICH2 on Intel Knights Landing Manycore Processors Large-scale Electronic Structure Simulations with MVAPICH2 on Intel Knights Landing Manycore Processors Hoon Ryu, Ph.D. (E: elec1020@kisti.re.kr) Principal Researcher / Korea Institute of Science and Technology

More information

Dr. Andrea Bocci. Using GPUs to Accelerate Online Event Reconstruction. at the Large Hadron Collider. Applied Physicist

Dr. Andrea Bocci. Using GPUs to Accelerate Online Event Reconstruction. at the Large Hadron Collider. Applied Physicist Using GPUs to Accelerate Online Event Reconstruction at the Large Hadron Collider Dr. Andrea Bocci Applied Physicist On behalf of the CMS Collaboration Discover CERN Inside the Large Hadron Collider at

More information

Population annealing study of the frustrated Ising antiferromagnet on the stacked triangular lattice

Population annealing study of the frustrated Ising antiferromagnet on the stacked triangular lattice Population annealing study of the frustrated Ising antiferromagnet on the stacked triangular lattice Michal Borovský Department of Theoretical Physics and Astrophysics, University of P. J. Šafárik in Košice,

More information

Accelerating Quantum Chromodynamics Calculations with GPUs

Accelerating Quantum Chromodynamics Calculations with GPUs Accelerating Quantum Chromodynamics Calculations with GPUs Guochun Shi, Steven Gottlieb, Aaron Torok, Volodymyr Kindratenko NCSA & Indiana University National Center for Supercomputing Applications University

More information

Dense Arithmetic over Finite Fields with CUMODP

Dense Arithmetic over Finite Fields with CUMODP Dense Arithmetic over Finite Fields with CUMODP Sardar Anisul Haque 1 Xin Li 2 Farnam Mansouri 1 Marc Moreno Maza 1 Wei Pan 3 Ning Xie 1 1 University of Western Ontario, Canada 2 Universidad Carlos III,

More information

A MEMORY EFFICIENT AND FAST SPARSE MATRIX VECTOR PRODUCT ON A GPU

A MEMORY EFFICIENT AND FAST SPARSE MATRIX VECTOR PRODUCT ON A GPU Progress In Electromagnetics Research, Vol. 116, 49 63, 2011 A MEMORY EFFICIENT AND FAST SPARSE MATRIX VECTOR PRODUCT ON A GPU A. Dziekonski, A. Lamecki, and M. Mrozowski WiComm Center of Excellence, Faculty

More information

Research on GPU-accelerated algorithm in 3D finite difference neutron diffusion calculation method

Research on GPU-accelerated algorithm in 3D finite difference neutron diffusion calculation method NUCLEAR SCIENCE AND TECHNIQUES 25, 0501 (14) Research on GPU-accelerated algorithm in 3D finite difference neutron diffusion calculation method XU Qi ( 徐琪 ), 1, YU Gang-Lin ( 余纲林 ), 1 WANG Kan ( 王侃 ),

More information

Solving PDEs with CUDA Jonathan Cohen

Solving PDEs with CUDA Jonathan Cohen Solving PDEs with CUDA Jonathan Cohen jocohen@nvidia.com NVIDIA Research PDEs (Partial Differential Equations) Big topic Some common strategies Focus on one type of PDE in this talk Poisson Equation Linear

More information

arxiv: v1 [physics.comp-ph] 22 Nov 2012

arxiv: v1 [physics.comp-ph] 22 Nov 2012 A Customized 3D GPU Poisson Solver for Free BCs Nazim Dugan a, Luigi Genovese b, Stefan Goedecker a, a Department of Physics, University of Basel, Klingelbergstr. 82, 4056 Basel, Switzerland b Laboratoire

More information

Scalable and Power-Efficient Data Mining Kernels

Scalable and Power-Efficient Data Mining Kernels Scalable and Power-Efficient Data Mining Kernels Alok Choudhary, John G. Searle Professor Dept. of Electrical Engineering and Computer Science and Professor, Kellogg School of Management Director of the

More information

Parallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics)

Parallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics) Parallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics) Eftychios Sifakis CS758 Guest Lecture - 19 Sept 2012 Introduction Linear systems

More information

arxiv: v1 [hep-lat] 7 Oct 2010

arxiv: v1 [hep-lat] 7 Oct 2010 arxiv:.486v [hep-lat] 7 Oct 2 Nuno Cardoso CFTP, Instituto Superior Técnico E-mail: nunocardoso@cftp.ist.utl.pt Pedro Bicudo CFTP, Instituto Superior Técnico E-mail: bicudo@ist.utl.pt We discuss the CUDA

More information

ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU

ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU STEVE RENNICH, SR. ENGINEER, NVIDIA DEVELOPER TECHNOLOGY DARKO STOSIC, PHD CANDIDATE, UNIV. FEDERAL DE PERNAMBUCO TIM DAVIS, PROFESSOR, CSE, TEXAS

More information

Parallel Sparse Tensor Decompositions using HiCOO Format

Parallel Sparse Tensor Decompositions using HiCOO Format Figure sources: A brief survey of tensors by Berton Earnshaw and NVIDIA Tensor Cores Parallel Sparse Tensor Decompositions using HiCOO Format Jiajia Li, Jee Choi, Richard Vuduc May 8, 8 @ SIAM ALA 8 Outline

More information

Beam dynamics calculation

Beam dynamics calculation September 6 Beam dynamics calculation S.B. Vorozhtsov, Е.Е. Perepelkin and V.L. Smirnov Dubna, JINR http://parallel-compute.com Outline Problem formulation Numerical methods OpenMP and CUDA realization

More information

Accelerating interior point methods with GPUs for smart grid systems

Accelerating interior point methods with GPUs for smart grid systems Downloaded from orbit.dtu.dk on: Dec 18, 2017 Accelerating interior point methods with GPUs for smart grid systems Gade-Nielsen, Nicolai Fog Publication date: 2011 Document Version Publisher's PDF, also

More information

NVIDIA MPI-enabled Iterative Solvers for Large Scale Problems. Joe Eaton Manager, AmgX CUDA Library NVIDIA

NVIDIA MPI-enabled Iterative Solvers for Large Scale Problems. Joe Eaton Manager, AmgX CUDA Library NVIDIA NVIDIA MPI-enabled Iterative Solvers for Large Scale Problems Joe Eaton Manager, AmgX CUDA Library NVIDIA ANSYS Fluent Fluent control flow Accelerate this first Non-linear iterations Assemble Linear System

More information

Parallel Transposition of Sparse Data Structures

Parallel Transposition of Sparse Data Structures Parallel Transposition of Sparse Data Structures Hao Wang, Weifeng Liu, Kaixi Hou, Wu-chun Feng Department of Computer Science, Virginia Tech Niels Bohr Institute, University of Copenhagen Scientific Computing

More information

A microsecond a day keeps the doctor away: Efficient GPU Molecular Dynamics with GROMACS

A microsecond a day keeps the doctor away: Efficient GPU Molecular Dynamics with GROMACS GTC 20130319 A microsecond a day keeps the doctor away: Efficient GPU Molecular Dynamics with GROMACS Erik Lindahl erik.lindahl@scilifelab.se Molecular Dynamics Understand biology We re comfortably on

More information

Weile Jia 1, Long Wang 1, Zongyan Cao 1, Jiyun Fu 1, Xuebin Chi 1, Weiguo Gao 2, Lin-Wang Wang 3

Weile Jia 1, Long Wang 1, Zongyan Cao 1, Jiyun Fu 1, Xuebin Chi 1, Weiguo Gao 2, Lin-Wang Wang 3 A plane wave pseudopotential density functional theory molecular dynamics code on multi-gpu machine - GPU Technology Conference, San Jose, May 17th, 2012 Weile Jia 1, Long Wang 1, Zongyan Cao 1, Jiyun

More information

7. Basics of Magnetization Switching

7. Basics of Magnetization Switching Beyond CMOS computing 7. Basics of Magnetization Switching Dmitri Nikonov Dmitri.e.nikonov@intel.com 1 Outline Energies in a nanomagnet Precession in a magnetic field Anisotropies in a nanomagnet Hysteresis

More information

Computers and Mathematics with Applications

Computers and Mathematics with Applications Computers and Mathematics with Applications 68 (2014) 1151 1160 Contents lists available at ScienceDirect Computers and Mathematics with Applications journal homepage: www.elsevier.com/locate/camwa A GPU

More information

MSE 7025 Magnetic Materials (and Spintronics)

MSE 7025 Magnetic Materials (and Spintronics) MSE 7025 Magnetic Materials (and Spintronics) Lecture 14: Spin Transfer Torque And the future of spintronics research Chi-Feng Pai cfpai@ntu.edu.tw Course Outline Time Table Week Date Lecture 1 Feb 24

More information

A CPU-GPU Hybrid Implementation and Model-Driven Scheduling of the Fast Multipole Method

A CPU-GPU Hybrid Implementation and Model-Driven Scheduling of the Fast Multipole Method A CPU-GPU Hybrid Implementation and Model-Driven Scheduling of the Fast Multipole Method Jee Choi 1, Aparna Chandramowlishwaran 3, Kamesh Madduri 4, and Richard Vuduc 2 1 ECE, Georgia Tech 2 CSE, Georgia

More information

MARCH 24-27, 2014 SAN JOSE, CA

MARCH 24-27, 2014 SAN JOSE, CA MARCH 24-27, 2014 SAN JOSE, CA Sparse HPC on modern architectures Important scientific applications rely on sparse linear algebra HPCG a new benchmark proposal to complement Top500 (HPL) To solve A x =

More information

Introduction to Practical FFT and NFFT

Introduction to Practical FFT and NFFT Introduction to Practical FFT and NFFT Michael Pippig and Daniel Potts Faculty of Mathematics Chemnitz University of Technology 07.09.2010 supported by BMBF grant 01IH08001B Table of Contents 1 Serial

More information

GPU Accelerated Markov Decision Processes in Crowd Simulation

GPU Accelerated Markov Decision Processes in Crowd Simulation GPU Accelerated Markov Decision Processes in Crowd Simulation Sergio Ruiz Computer Science Department Tecnológico de Monterrey, CCM Mexico City, México sergio.ruiz.loza@itesm.mx Benjamín Hernández National

More information

Tight-Focusing of Short Intense Laser Pulses in Particle-in-Cell Simulations of Laser-Plasma Interaction

Tight-Focusing of Short Intense Laser Pulses in Particle-in-Cell Simulations of Laser-Plasma Interaction 16/05/2017, CTU in Prague Tight-Focusing of Short Intense Laser Pulses in Particle-in-Cell Simulations of Laser-Plasma Interaction Bc. Petr Valenta (petr.valenta@eli-beams.eu) Supervisors: doc. Ing. Ondrej

More information

An Algorithmic Framework of Large-Scale Circuit Simulation Using Exponential Integrators

An Algorithmic Framework of Large-Scale Circuit Simulation Using Exponential Integrators An Algorithmic Framework of Large-Scale Circuit Simulation Using Exponential Integrators Hao Zhuang 1, Wenjian Yu 2, Ilgweon Kang 1, Xinan Wang 1, and Chung-Kuan Cheng 1 1. University of California, San

More information

On the design of parallel linear solvers for large scale problems

On the design of parallel linear solvers for large scale problems On the design of parallel linear solvers for large scale problems ICIAM - August 2015 - Mini-Symposium on Recent advances in matrix computations for extreme-scale computers M. Faverge, X. Lacoste, G. Pichon,

More information

Università degli studi di Udine

Università degli studi di Udine Università degli studi di Udine GPU Accelerated Time-Domain Discrete Geometric Approach Method for Maxwell's Equations on Tetrahedral Grids This is the peer reviewd version of the followng article: Original

More information

Hydra. A library for data analysis in massively parallel platforms. A. Augusto Alves Jr and Michael D. Sokoloff

Hydra. A library for data analysis in massively parallel platforms. A. Augusto Alves Jr and Michael D. Sokoloff Hydra A library for data analysis in massively parallel platforms A. Augusto Alves Jr and Michael D. Sokoloff University of Cincinnati aalvesju@cern.ch Presented at NVIDIA s GPU Technology Conference,

More information

Real-time signal detection for pulsars and radio transients using GPUs

Real-time signal detection for pulsars and radio transients using GPUs Real-time signal detection for pulsars and radio transients using GPUs W. Armour, M. Giles, A. Karastergiou and C. Williams. University of Oxford. 15 th July 2013 1 Background of GPUs Why use GPUs? Influence

More information

S0214 : GPU Based Stacking Sequence Generation For Composite Skins Using GA

S0214 : GPU Based Stacking Sequence Generation For Composite Skins Using GA S0214 : GPU Based Stacking Sequence Generation For Composite Skins Using GA Date: 16th May 2012 Wed, 3pm to 3.25pm(Adv. Session) Sathyanarayana K., Manish Banga, and Ravi Kumar G. V. V. Engineering Services,

More information

Efficient Molecular Dynamics on Heterogeneous Architectures in GROMACS

Efficient Molecular Dynamics on Heterogeneous Architectures in GROMACS Efficient Molecular Dynamics on Heterogeneous Architectures in GROMACS Berk Hess, Szilárd Páll KTH Royal Institute of Technology GTC 2012 GROMACS: fast, scalable, free Classical molecular dynamics package

More information

arxiv: v1 [physics.comp-ph] 30 Oct 2017

arxiv: v1 [physics.comp-ph] 30 Oct 2017 An efficient GPU algorithm for tetrahedron-based Brillouin-zone integration Daniel Guterding 1, and Harald O. Jeschke 1 Lucht Probst Associates, Große Gallusstraße 9, 011 Frankfurt am Main, Germany, European

More information

Simulating radiation from Laser-wakefield accelerators

Simulating radiation from Laser-wakefield accelerators TUSBC1 11 th International Computational Accelerator Physics Conference ICAP 2012 in Rostock Simulating radiation from Laser-wakefield accelerators Alexander Debus, Richard Pausch, René Widera, Michael

More information

Level-3 BLAS on a GPU

Level-3 BLAS on a GPU Level-3 BLAS on a GPU Picking the Low Hanging Fruit Francisco Igual 1 Gregorio Quintana-Ortí 1 Robert A. van de Geijn 2 1 Departamento de Ingeniería y Ciencia de los Computadores. University Jaume I. Castellón

More information

FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION

FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION EDMOND CHOW AND AFTAB PATEL Abstract. This paper presents a new fine-grained parallel algorithm for computing an incomplete LU factorization. All nonzeros

More information

Quantum Computer Simulation Using CUDA (Quantum Fourier Transform Algorithm)

Quantum Computer Simulation Using CUDA (Quantum Fourier Transform Algorithm) Quantum Computer Simulation Using CUDA (Quantum Fourier Transform Algorithm) Alexander Smith & Khashayar Khavari Department of Electrical and Computer Engineering University of Toronto April 15, 2009 Alexander

More information

Efficient Parallelization of Molecular Dynamics Simulations on Hybrid CPU/GPU Supercoputers

Efficient Parallelization of Molecular Dynamics Simulations on Hybrid CPU/GPU Supercoputers Efficient Parallelization of Molecular Dynamics Simulations on Hybrid CPU/GPU Supercoputers Jaewoon Jung (RIKEN, RIKEN AICS) Yuji Sugita (RIKEN, RIKEN AICS, RIKEN QBiC, RIKEN ithes) Molecular Dynamics

More information

Tips Geared Towards R. Adam J. Suarez. Arpil 10, 2015

Tips Geared Towards R. Adam J. Suarez. Arpil 10, 2015 Tips Geared Towards R Departments of Statistics North Carolina State University Arpil 10, 2015 1 / 30 Advantages of R As an interpretive and interactive language, developing an algorithm in R can be done

More information

Universität Dortmund UCHPC. Performance. Computing for Finite Element Simulations

Universität Dortmund UCHPC. Performance. Computing for Finite Element Simulations technische universität dortmund Universität Dortmund fakultät für mathematik LS III (IAM) UCHPC UnConventional High Performance Computing for Finite Element Simulations S. Turek, Chr. Becker, S. Buijssen,

More information

COMPARISON OF CPU AND GPU IMPLEMENTATIONS OF THE LATTICE BOLTZMANN METHOD

COMPARISON OF CPU AND GPU IMPLEMENTATIONS OF THE LATTICE BOLTZMANN METHOD XVIII International Conference on Water Resources CMWR 2010 J. Carrera (Ed) c CIMNE, Barcelona, 2010 COMPARISON OF CPU AND GPU IMPLEMENTATIONS OF THE LATTICE BOLTZMANN METHOD James.E. McClure, Jan F. Prins

More information

Massively scalable computing method to tackle large eigenvalue problems for nanoelectronics modeling

Massively scalable computing method to tackle large eigenvalue problems for nanoelectronics modeling 2019 Intel extreme Performance Users Group (IXPUG) meeting Massively scalable computing method to tackle large eigenvalue problems for nanoelectronics modeling Hoon Ryu, Ph.D. (E: elec1020@kisti.re.kr)

More information

FEAST eigenvalue algorithm and solver: review and perspectives

FEAST eigenvalue algorithm and solver: review and perspectives FEAST eigenvalue algorithm and solver: review and perspectives Eric Polizzi Department of Electrical and Computer Engineering University of Masachusetts, Amherst, USA Sparse Days, CERFACS, June 25, 2012

More information

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization) A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization) Schodinger equation: Hψ = Eψ Choose a basis set of wave functions Two cases: Orthonormal

More information

Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters

Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters H. Köstler 2nd International Symposium Computer Simulations on GPU Freudenstadt, 29.05.2013 1 Contents Motivation walberla software concepts

More information

Acoustics Analysis of Speaker ANSYS, Inc. November 28, 2014

Acoustics Analysis of Speaker ANSYS, Inc. November 28, 2014 Acoustics Analysis of Speaker 1 Introduction ANSYS 14.0 offers many enhancements in the area of acoustics. In this presentation, an example speaker analysis will be shown to highlight some of the acoustics

More information

Proceedings of Meetings on Acoustics

Proceedings of Meetings on Acoustics Proceedings of Meetings on Acoustics Volume 19, 2013 http://acousticalsociety.org/ ICA 2013 Montreal Montreal, Canada 2-7 June 2013 Underwater Acoustics Session 5aUW: Using Graphic Processing Units for

More information

Electromagnetic Field Analysis

Electromagnetic Field Analysis Spectral Integral Method and Spectral Element Method Domain Decomposition Method for Electromagnetic Field Analysis by Yun Lin Department of Electrical and Computer Engineering Duke University Date: Approved:

More information

上海超级计算中心 Shanghai Supercomputer Center. Lei Xu Shanghai Supercomputer Center San Jose

上海超级计算中心 Shanghai Supercomputer Center. Lei Xu Shanghai Supercomputer Center San Jose 上海超级计算中心 Shanghai Supercomputer Center Lei Xu Shanghai Supercomputer Center 03/26/2014 @GTC, San Jose Overview Introduction Fundamentals of the FDTD method Implementation of 3D UPML-FDTD algorithm on GPU

More information

Utilisation de la compression low-rank pour réduire la complexité du solveur PaStiX

Utilisation de la compression low-rank pour réduire la complexité du solveur PaStiX Utilisation de la compression low-rank pour réduire la complexité du solveur PaStiX 26 Septembre 2018 - JCAD 2018 - Lyon Grégoire Pichon, Mathieu Faverge, Pierre Ramet, Jean Roman Outline 1. Context 2.

More information

On the Computational Complexity of the Discrete Pascal Transform

On the Computational Complexity of the Discrete Pascal Transform 6 th International Conference Logic and Applications LAP 207, September 8-22, 207, Dubrovnik, Croatia On the Computational Complexity of the Discrete Pascal Transform Dušan B. Gajić, Radomir S. Stanković

More information

Computing least squares condition numbers on hybrid multicore/gpu systems

Computing least squares condition numbers on hybrid multicore/gpu systems Computing least squares condition numbers on hybrid multicore/gpu systems M. Baboulin and J. Dongarra and R. Lacroix Abstract This paper presents an efficient computation for least squares conditioning

More information

Fine-Grained Parallel Algorithms for Incomplete Factorization Preconditioning

Fine-Grained Parallel Algorithms for Incomplete Factorization Preconditioning Fine-Grained Parallel Algorithms for Incomplete Factorization Preconditioning Edmond Chow School of Computational Science and Engineering Georgia Institute of Technology, USA SPPEXA Symposium TU München,

More information

Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA

Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA Jacobi-Based Eigenvalue Solver on GPU Lung-Sheng Chien, NVIDIA lchien@nvidia.com Outline Symmetric eigenvalue solver Experiment Applications Conclusions Symmetric eigenvalue solver The standard form is

More information

Improvements for Implicit Linear Equation Solvers

Improvements for Implicit Linear Equation Solvers Improvements for Implicit Linear Equation Solvers Roger Grimes, Bob Lucas, Clement Weisbecker Livermore Software Technology Corporation Abstract Solving large sparse linear systems of equations is often

More information

Welcome to MCS 572. content and organization expectations of the course. definition and classification

Welcome to MCS 572. content and organization expectations of the course. definition and classification Welcome to MCS 572 1 About the Course content and organization expectations of the course 2 Supercomputing definition and classification 3 Measuring Performance speedup and efficiency Amdahl s Law Gustafson

More information

Introduction to Practical FFT and NFFT

Introduction to Practical FFT and NFFT Introduction to Practical FFT and NFFT Michael Pippig and Daniel Potts Department of Mathematics Chemnitz University of Technology September 14, 211 supported by BMBF grant 1IH81B Table of Contents 1 Serial

More information

GPU accelerated Arnoldi solver for small batched matrix

GPU accelerated Arnoldi solver for small batched matrix 15. 09. 22 GPU accelerated Arnoldi solver for small batched matrix Samsung Advanced Institute of Technology Hyung-Jin Kim Contents - Eigen value problems - Solution - Arnoldi Algorithm - Target - CUDA

More information

GPU Computing Activities in KISTI

GPU Computing Activities in KISTI International Advanced Research Workshop on High Performance Computing, Grids and Clouds 2010 June 21~June 25 2010, Cetraro, Italy HPC Infrastructure and GPU Computing Activities in KISTI Hongsuk Yi hsyi@kisti.re.kr

More information

A robust multilevel approximate inverse preconditioner for symmetric positive definite matrices

A robust multilevel approximate inverse preconditioner for symmetric positive definite matrices DICEA DEPARTMENT OF CIVIL, ENVIRONMENTAL AND ARCHITECTURAL ENGINEERING PhD SCHOOL CIVIL AND ENVIRONMENTAL ENGINEERING SCIENCES XXX CYCLE A robust multilevel approximate inverse preconditioner for symmetric

More information

A particle-in-cell method with adaptive phase-space remapping for kinetic plasmas

A particle-in-cell method with adaptive phase-space remapping for kinetic plasmas A particle-in-cell method with adaptive phase-space remapping for kinetic plasmas Bei Wang 1 Greg Miller 2 Phil Colella 3 1 Princeton Institute of Computational Science and Engineering Princeton University

More information

Coupling atomistic and continuum modelling of magnetism

Coupling atomistic and continuum modelling of magnetism Coupling atomistic and continuum modelling of magnetism M. Poluektov 1,2 G. Kreiss 2 O. Eriksson 3 1 University of Warwick WMG International Institute for Nanocomposites Manufacturing 2 Uppsala University

More information

Perpendicular MTJ stack development for STT MRAM on Endura PVD platform

Perpendicular MTJ stack development for STT MRAM on Endura PVD platform Perpendicular MTJ stack development for STT MRAM on Endura PVD platform Mahendra Pakala, Silicon Systems Group, AMAT Dec 16 th, 2014 AVS 2014 *All data in presentation is internal Applied generated data

More information

Julian Merten. GPU Computing and Alternative Architecture

Julian Merten. GPU Computing and Alternative Architecture Future Directions of Cosmological Simulations / Edinburgh 1 / 16 Julian Merten GPU Computing and Alternative Architecture Institut für Theoretische Astrophysik Zentrum für Astronomie Universität Heidelberg

More information

Reduced Vlasov-Maxwell modeling

Reduced Vlasov-Maxwell modeling Reduced Vlasov-Maxwell modeling Philippe Helluy, Michel Massaro, Laurent Navoret, Nhung Pham, Thomas Strub To cite this version: Philippe Helluy, Michel Massaro, Laurent Navoret, Nhung Pham, Thomas Strub.

More information

Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS

Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS Jorge González-Domínguez*, Bertil Schmidt*, Jan C. Kässens**, Lars Wienbrandt** *Parallel and Distributed Architectures

More information

sri 2D Implicit Charge- and Energy- Conserving Particle-in-cell Application Using CUDA Christopher Leibs Karthik Murthy

sri 2D Implicit Charge- and Energy- Conserving Particle-in-cell Application Using CUDA Christopher Leibs Karthik Murthy 2D Implicit Charge- and Energy- Conserving sri Particle-in-cell Application Using CUDA Christopher Leibs Karthik Murthy Mentors Dana Knoll and Allen McPherson IS&T CoDesign Summer School 2012, Los Alamos

More information

Perm State University Research-Education Center Parallel and Distributed Computing

Perm State University Research-Education Center Parallel and Distributed Computing Perm State University Research-Education Center Parallel and Distributed Computing A 25-minute Talk (S4493) at the GPU Technology Conference (GTC) 2014 MARCH 24-27, 2014 SAN JOSE, CA GPU-accelerated modeling

More information

1 Overview. 2 Adapting to computing system evolution. 11 th European LS-DYNA Conference 2017, Salzburg, Austria

1 Overview. 2 Adapting to computing system evolution. 11 th European LS-DYNA Conference 2017, Salzburg, Austria 1 Overview Improving LSTC s Multifrontal Linear Solver Roger Grimes 3, Robert Lucas 3, Nick Meng 2, Francois-Henry Rouet 3, Clement Weisbecker 3, and Ting-Ting Zhu 1 1 Cray Incorporated 2 Intel Corporation

More information

A Two-Scale Adaptive Integral Method

A Two-Scale Adaptive Integral Method A Two-Scale Adaptive Integral Method Ali Yilmaz Department of Electrical & Computer Engineering University of Texas at Austin IEEE APS International Symposium USC/URSI ational Radio Science Meeting San

More information

RWTH Aachen University

RWTH Aachen University IPCC @ RWTH Aachen University Optimization of multibody and long-range solvers in LAMMPS Rodrigo Canales William McDoniel Markus Höhnerbach Ahmed E. Ismail Paolo Bientinesi IPCC Showcase November 2016

More information

3D Cartesian Transport Sweep for Massively Parallel Architectures on top of PaRSEC

3D Cartesian Transport Sweep for Massively Parallel Architectures on top of PaRSEC 3D Cartesian Transport Sweep for Massively Parallel Architectures on top of PaRSEC 9th Scheduling for Large Scale Systems Workshop, Lyon S. Moustafa, M. Faverge, L. Plagne, and P. Ramet S. Moustafa, M.

More information

Claude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique

Claude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique Claude Tadonki MINES ParisTech PSL Research University Centre de Recherche Informatique claude.tadonki@mines-paristech.fr Monthly CRI Seminar MINES ParisTech - CRI June 06, 2016, Fontainebleau (France)

More information