Massively parallel semi-lagrangian solution of the 6d Vlasov-Poisson problem

Size: px
Start display at page:

Download "Massively parallel semi-lagrangian solution of the 6d Vlasov-Poisson problem"

Transcription

1 Massively parallel semi-lagrangian solution of the 6d Vlasov-Poisson problem Katharina Kormann 1 Klaus Reuter 2 Markus Rampp 2 Eric Sonnendrücker 1 1 Max Planck Institut für Plasmaphysik 2 Max Planck Computing and Data Facility October 20, 2016

2 1 Introduction Interpolation und Parallelization Numerical comparison of interpolators Code optimization Overlap of computation and communication

3 2 Vlasov Poisson equation and characteristics Vlasov Poisson equation for electrons in neutralizing background f (t, x, v) + v x f (t, x, v) E(t, x) v f (t, x, v) = 0, t φ(t, x) = 1 ρ(t, x), E(t, x) = φ(t, x), ρ(t, x) = f (t, x, v) dv. Advection equation keeps values along characteristics: dx dt = V, dv = E(t, X). dt Solution: f (t, x, v) = f 0 (X(0; t, x, v), V(0; t, x, v))

4 3 Split-semi-Lagrangian scheme Given f (m) and E (m) at time t m, we compute f (m+1) at time t m + t for all grid points (x i, v j ) as follows: 1. Solve t f E n v f = 0 on half time step: f (m, ) (x i, v j ) = f (m) (x i, v j + E (m) i t 2 ) 2. Solve t f + v x f = 0 on full time step: f (m, ) (x i, v j ) = f (m, ) (x i v j t, v j ) 3. Compute ρ(x i, v i ) and solve the Poisson equation for E (m+1) 4. Solve t f E (m+1) x f = 0 on half time step: f (m+1) (x i, v j ) = f (m, ) (x i, v j + E (m+1) i t 2 ) Use cascade interpolation for the x and v advection steps to reduce interpolations to successive 1d interpolation on stripes of the domain. Main building block: 1d interpolation on stripes of the domain of the form g(x j ) = f (x j + α).

5 4 Introduction Interpolation und Parallelization Numerical comparison of interpolators Code optimization Overlap of computation and communication

6 Interpolation schemes Let z j be a grid point and α = (β + γ) z be the shift of the grid points to the origin of the characteristic (β [0, 1], γ Q). Fixed interval Lagrange (odd number of points q): f (z j + α) = j+(q 1)/2 i=j (q 1)/2 l i(α)f (z i ) for α x. z j z j + α Centered interval Lagrange (even number of points q): f (z j + α) = f (z j+γ + β) = j+γ+q/2 i=j+γ+q/2 1 l i(β)f (z i ). z j z j + α Cubic splines: Global spline interpolant by solution of linear system, evaluation as f (z j + α) = j+γ+2 i=j+γ 1 c is i (β). z j z j + α 5

7 6 Parallelization strategy 1: Remapping scheme Two domain partitionings: One keeping x sequential and one keeping v sequential. p0 p1 p0 p1 p2 p2 Impact on interpolation: None as long as 1d interpolation (or at least split into x and v parts). MPI communication: All-to-all communication, fraction of the data to be communicated for p MPI processes: (p 1)/p. Memory requirements: Two copies of the 6d array (+ MPI communication buffers).

8 7 Parallelization strategy 2: Domain decomposition Patches of six dimensional data blocks. Impact on interpolation Local interpolant needed (Lagrange or local splines glued together with Hermite-type boundary conditions), artificial CFL number, communication increases with order. MPI communication: Nearest-neighbor communication of halo cells around the local domain, size depending on required halo width of interpolator and maximal displacement: 2wn 5 per 1d interpolation. p6 p7 p8 p3 p4 p5 p0 p1 p2

9 7 Parallelization strategy 2: Domain decomposition Patches of six dimensional data blocks. MPI communication: Nearest-neighbor communication of halo cells around the local domain, size depending on required halo width of interpolator and maximal displacement: 2wn 5 per 1d interpolation. Memory requirements: Two alternative implementations Connected buffers: (n + 2w) 6 (+ MPI communication buffers). Dynamic halo buffers ( DD slim ): Memory overhead of 2wn 5 (exploits the fact that only halos in one dimension at a time are necessary + MPI communication buffers, partly reused) p6 p7 p8 p3 p4 p5 p0 p1 p2

10 7 Parallelization strategy 2: Domain decomposition Patches of six dimensional data blocks. MPI communication: Nearest-neighbor communication of halo cells around the local domain, size depending on required halo width of interpolator and maximal displacement: 2wn 5 per 1d interpolation. Memory requirements: Two alternative implementations Connected buffers: (n + 2w) 6 (+ MPI communication buffers). Dynamic halo buffers ( DD slim ): Memory overhead of 2wn 5 (exploits the fact that only halos in one dimension at a time are necessary + MPI communication buffers, partly reused) p6 p7 p8 p3 p4 p5 p0 p1 p2

11 8 Lagrange interpolation Let x i be a grid point and α = β + γ x be the shift of the grid points to the origin of the characteristic (β [0, 1], γq). Interpolate f at x i + α. q-points Lagrange interpolation, q odd, with fixed stencil around: f (x j + α) = j+(q 1)/2 i=j (q 1)/2 l i(α)f (x i ) q-points Lagrange interpolation, q even, centered around the interval [x i+γ, x i+γ+1 ]: f (x j + α) = f (x j+γ + β) = j+γ+q/2 i=j+γ+q/2 1 l i(β)f (x i ) Parallelization for distributed domains: Fixed stencil: CFL-like condition α z, exchange of (q 1)/2 data points on each side. Centered stencil: CFL-like condition α w z, exchange of w + q/2 on each side.

12 9 Impact of domain decomposition Imposes a CFL-like condition. Vlasov Poisson: CFL-like condition is dominated by x-advections but here α = tv constant over time. Idea: Use the knowledge of sign of α to reduce data transfer. Resulting data transfer for CFL-like condition α = (w + β) z for centered stencil: max(q/2 w, 0) on left side, q/2 + w on right side. Total data to be sent: q if w q/2.

13 10 Local cubic splines Computation of interpolant: Use local spline on each domain with Hermite-type boundary conditions from neighboring domains 1. Use fast algorithm introduced by Unser et al. 2. Algorithm for x 1,..., N processor-local and α = (β + γ) x: ( d 0 = 1 M ( f (x γ) + b ) ) i f (x γ i ), a a Here a = i=1 d i = 1 a (f (x i+γ) bd i 1 ), i = 1,..., N + 1, c N+2 = ( M ( 3 f (x N+γ+2 ) + b ) ) i (f (x N+2+γ i ) + f (x N+2+γ+i )) a 2+ 3 i=1 c i = 1 a (d i bc i+1 ), i = N + 1,..., 0. 2, b = 3 and M determines accuracy (M = 27 for machine 6 6 precision). 1 Crouseilles et al., J. Comput. Phys. 228, Unser et al., IEEE Trans. Pattern Anal. Mach. Intell. 13, 1991

14 10 Local cubic splines Algorithm for x 1,..., N processor-local and α = (β + γ) x: ( d 0 = 1 M ( f (x γ) + b ) ) i f (x γ i ), a a Here a = precision). i=1 d i = 1 a (f (x i+γ) bd i 1 ), i = 1,..., N + 1, c N+2 = ( M ( 3 f (x N+γ+2 ) + b ) ) i (f (x N+2+γ i ) + f (x N+2+γ+i )) a 2+ 3 i=1 c i = 1 a (d i bc i+1 ), i = N + 1,..., 0. 2, b = 3 and M determines accuracy (M = 27 for machine 6 6 Data exchange: remote part of d 0 and c N+2, max( γ, 0) on left or max(γ + 1, 0) on right side.

15 11 Introduction Interpolation und Parallelization Numerical comparison of interpolators Code optimization Overlap of computation and communication

16 12 Weak Landau damping ) ( Initial condition: f 0 (x, v) = 1 exp ( v 2 (2π) 3/ α ) 3 l=1 cos(k lx l ) Parameters: α = 0.01, k x = 0.5, periodic boundaries. Weak perturbation α = 0.01 yields a mostly linear phenomenon. No real 6d effects. Relatively good resolution on the studied grids. Error measure: Absolute error in field energy. Reference: Created from 1d solution with spectral method at very high resolution (Jakob Ameres). Helios cluster: Sandy-Bridge EP 2.7GHz, 16 processors and 58 GB of usable memory per node, InfiniBand Compiler: Intel 15, IMPI 5.0.3

17 Interpolation error for various interpolators N x = 16, N v = 64, weak Landau damping CFL-like condition (for x-interpolation steps) at: t spline lag55 lag77 lag65 lag67 error time step 13

18 Interpolation error for various interpolators Data points: (N x = 8, N v = 32, t = 0.1, 1 MPI), (N x = 16, N v = 64, t = 0.05, 8 MPI), (N x = 32, N v = 128, t = 0.01, 2048 MPI) dds65 dds67 dds77 dds77 (h32) rmp spline error total CPU time 14

19 15 Bump-on-tail Initial condition: f 0 (x, v) = 1 (2π) 3/2 ( exp Instability and nonlinear effects. ) 0.9 exp( v ) exp( 2(v 1 4.5) 2 ) ( ( (v 2 2+v 3 2) ) 3 l=1 cos(0.3x l). Relatively bad resolution on the studied grid. Absolute error in field energy (until time 15). Reference: Solution with Lagrange interpolation of order 6,7 on grid with 64 6 data points.

20 Bump-on-tail Simulation with memory slim domain decomposition and Lagrange 6,7 (7,7 for N = 128) Number of processes (MPI OMP): 1 1, 16 1, , field energy time 16

21 Interpolation error (until t=15) N x = N v = 32, 16 MPI processes CFL-like condition (for x-interpolation steps) at: t rmp, lag77 rmp, spl33 dds, lag77 dds, lag67 dds, spl33 error time step 17

22 Interpolation error (until t=15) N x = N v = 32, 16 MPI processes CFL-like condition (for x-interpolation steps) at: t rmp, lag77 rmp, spl33 dds, lag77 dds, lag67 dds, spl33 error wall time 17

23 18 Introduction Interpolation und Parallelization Numerical comparison of interpolators Code optimization Overlap of computation and communication

24 19 Single core performance Speedup of the total domain decomposition code (main loop) by at least a factor 2 obtained by: Avoid Fortran convenience idioms (:). Force inlining of interpolators into advector modules. Cache blocking: Memory access to 1d stripes with large stride in 6D array slow. Instead extract them in blocks along the first dimension to exploit hardware prefetching.

25 20 Effect of cache blocking Configuration: Lagrange 7,7 on cube of 32 6 points, 5 time steps. Hardware: Sandy bridge node (mick@mpcdf) with 16 cores. CPU time (in s) for advections direction blocking no blocking sum

26 21 Single node performance (on Sandy bridge) 10 3 OMP MPI linear wall clock time no. of processors

27 22 Single node scalability Configuration: Lagrange 7,7 on cube of 32 6 points, 5 time steps. Hardware: Sandy bridge node (mick@mpcdf) with 16 cores. Speed up compared to single CPU 1 MPI, 16 OMP 2MPI, 8 OMP 16 MPI, 1 OMP dds, cache dds dd, cache dd rmp, cache [6.2] rmp [5.4] Note: Remap send-receive-buffer copying not OMP-parallelized.

28 23 Single node performance Configuration: Lagrange 7,7 on cube of Hardware: Haswell node MPI OMP time dd slim time dd [s] [s]

29 24 Multi node performance Configuration: Lagrange 7,7 on cube of Hardware: 32 Haswell nodes MPI/OMP time dd slim time dd time rmp 64/ s 103 s [336 s] 1024/ s 117 s 177s

30 25 Memory consumption of parallelization algorithms Parameters: N x = 16, N v = 64, 20 time steps Configuration: 8 MPI processes, 1 OMP on MAIK node of RZG (Sandy-Bridge), Intel 15. Interpolator Algorithm main memory [GB] Lagrange 7,7 Remap 6.3 Lagrange 7,7 DD 8.0 Lagrange 7,7 DD slim0 1.6 Lagrange 6,7 DD slim 1.4 Splines Remap 6.3

31 26 Strong scaling Configuration: N = 64 6, 50 time steps, 7-point Lagrange, 4 MPI with 5OMP-threads per node. Hardware: Ivy Bridge (hydra@mpcdf) (64 GB per node, InfiniBand FDR14) wall clock time [s] remap domain decomposition, 64 bit halos domain decomposition, 32 bit halos cores

32 27 Is the code portable to Intel Xeon Phi KNL? KNL XEON (Draco) GHz GHz 16 GB HBM, 96 GB DRAM 128 GB DRAM

33 28 Results on KNL Configuration: Lagrange 7,7 on cube of Hardware: KNL Cache mode, Intel 17. MPI OMP time dd slim time dd [s] [s] speed up

34 29 Results on KNL Configuration: 7-point Lagrange, OMP only, no hyperthreading. grid KNL HBM KNL DDR2 Haswell 24 dd dd slim

35 30 Introduction Interpolation und Parallelization Numerical comparison of interpolators Code optimization Overlap of computation and communication

36 31 Overlap of communication and computation Algorithm: Advection with fixed-interval Lagrange interpolation in domain decomposition.

37 Copy data to send buffer; MPI communication of halos; for i6 do for i5 do for i4 do for i2 do for i1 do Copy 1d stripe over i3 into scratch buffer; Interpolation along x3; Copy 1d stripe back to 6d array; end end end end end Algorithm 1: Advection along x 3. 32

38 for block do Copy data to send buffer for block; MPI communication of halos for block; for i6 in block do for i5 do for i4 do for i2 do for i1 do Copy 1d stripe over i3 into scratch buffer; Interpolation along x3; Copy 1d stripe back to 6d array; end end end end end end Algorithm 2: Advection along x 3. 33

39 34 Overlap of communication and computation Algorithm: Advection with fixed-interval Lagrange interpolation in domain decomposition. Idea: Split advection into blocks with separate MPI communication Computation from previous block can be overlapped with computations for next block. Blocking in x 6 = v 3 for x-advections and x 3 for v-advections. Implemented with nested OMP parallelism and OMP locks for domain decomposition slim. First result on 32 HASWELL nodes (draco@mpcdf, 64 MPI processes, 8 OMP threads each), 64 6, 5 time steps, Lagrange 7,7, 4 blocks per advection: dd slim overlap : s dd slim plain : s

40 35 Overlap: Preliminary results adv prep exch thread id time [s] Note: Advection (adv) and halo preparation (prep) use nested OpenMP threads to utilize all available CPU cores

41 36 Overlap: Zoom into first advection block Note: Advection (adv) and halo preparation (prep) use nested OpenMP threads to utilize all available CPU cores

42 37 Conclusions Summary Interpolation: Lagrange better than splines for good resolution, splines for low resolution. Lagrange better suited for distributed domains. Memory-slim implementation of domain decomposition enables solution of large-scale problems. Domain decomposition scales better than remap on thousands of processors. Remap algorithm gives more flexibility on time step size and for global interpolants. Outlook Include magnetic field and multidimensional interpolation. Further exploit the potential of overlapping communication and computation.

A particle-in-cell method with adaptive phase-space remapping for kinetic plasmas

A particle-in-cell method with adaptive phase-space remapping for kinetic plasmas A particle-in-cell method with adaptive phase-space remapping for kinetic plasmas Bei Wang 1 Greg Miller 2 Phil Colella 3 1 Princeton Institute of Computational Science and Engineering Princeton University

More information

Large-scale Electronic Structure Simulations with MVAPICH2 on Intel Knights Landing Manycore Processors

Large-scale Electronic Structure Simulations with MVAPICH2 on Intel Knights Landing Manycore Processors Large-scale Electronic Structure Simulations with MVAPICH2 on Intel Knights Landing Manycore Processors Hoon Ryu, Ph.D. (E: elec1020@kisti.re.kr) Principal Researcher / Korea Institute of Science and Technology

More information

Optimization of Particle-In-Cell simulations for Vlasov-Poisson system with strong magnetic field

Optimization of Particle-In-Cell simulations for Vlasov-Poisson system with strong magnetic field Optimization of Particle-In-Cell simulations for Vlasov-Poisson system with strong magnetic field Edwin Chacon-Golcher Sever A. Hirstoaga Mathieu Lutz Abstract We study the dynamics of charged particles

More information

Edwin Chacon-Golcher 1, Sever A. Hirstoaga 2 and Mathieu Lutz 3. Introduction

Edwin Chacon-Golcher 1, Sever A. Hirstoaga 2 and Mathieu Lutz 3. Introduction ESAIM: PROCEEDINGS AND SURVEYS, March 2016, Vol. 53, p. 177-190 M. Campos Pinto and F. Charles, Editors OPTIMIZATION OF PARTICLE-IN-CELL SIMULATIONS FOR VLASOV-POISSON SYSTEM WITH STRONG MAGNETIC FIELD

More information

HYCOM and Navy ESPC Future High Performance Computing Needs. Alan J. Wallcraft. COAPS Short Seminar November 6, 2017

HYCOM and Navy ESPC Future High Performance Computing Needs. Alan J. Wallcraft. COAPS Short Seminar November 6, 2017 HYCOM and Navy ESPC Future High Performance Computing Needs Alan J. Wallcraft COAPS Short Seminar November 6, 2017 Forecasting Architectural Trends 3 NAVY OPERATIONAL GLOBAL OCEAN PREDICTION Trend is higher

More information

Interfaces and simulations in SELALIB

Interfaces and simulations in SELALIB Interfaces and simulations in SELALIB M. Mehrenberger IRMA, Université de Strasbourg Garching, Selalib Day, November 2014 M. Mehrenberger (UDS) Interfaces and simulations in SELALIB Garching, November

More information

Towards a highly-parallel PDE-Solver using Adaptive Sparse Grids on Compute Clusters

Towards a highly-parallel PDE-Solver using Adaptive Sparse Grids on Compute Clusters Towards a highly-parallel PDE-Solver using Adaptive Sparse Grids on Compute Clusters HIM - Workshop on Sparse Grids and Applications Alexander Heinecke Chair of Scientific Computing May 18 th 2011 HIM

More information

Performance Analysis of Lattice QCD Application with APGAS Programming Model

Performance Analysis of Lattice QCD Application with APGAS Programming Model Performance Analysis of Lattice QCD Application with APGAS Programming Model Koichi Shirahata 1, Jun Doi 2, Mikio Takeuchi 2 1: Tokyo Institute of Technology 2: IBM Research - Tokyo Programming Models

More information

RWTH Aachen University

RWTH Aachen University IPCC @ RWTH Aachen University Optimization of multibody and long-range solvers in LAMMPS Rodrigo Canales William McDoniel Markus Höhnerbach Ahmed E. Ismail Paolo Bientinesi IPCC Showcase November 2016

More information

Weather Research and Forecasting (WRF) Performance Benchmark and Profiling. July 2012

Weather Research and Forecasting (WRF) Performance Benchmark and Profiling. July 2012 Weather Research and Forecasting (WRF) Performance Benchmark and Profiling July 2012 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Intel, Dell,

More information

Semi-Lagrangian Formulations for Linear Advection Equations and Applications to Kinetic Equations

Semi-Lagrangian Formulations for Linear Advection Equations and Applications to Kinetic Equations Semi-Lagrangian Formulations for Linear Advection and Applications to Kinetic Department of Mathematical and Computer Science Colorado School of Mines joint work w/ Chi-Wang Shu Supported by NSF and AFOSR.

More information

Earth System Modeling Domain decomposition

Earth System Modeling Domain decomposition Earth System Modeling Domain decomposition Graziano Giuliani International Centre for Theorethical Physics Earth System Physics Section Advanced School on Regional Climate Modeling over South America February

More information

Parallel Simulations of Self-propelled Microorganisms

Parallel Simulations of Self-propelled Microorganisms Parallel Simulations of Self-propelled Microorganisms K. Pickl a,b M. Hofmann c T. Preclik a H. Köstler a A.-S. Smith b,d U. Rüde a,b ParCo 2013, Munich a Lehrstuhl für Informatik 10 (Systemsimulation),

More information

Logo. A Massively-Parallel Multicore Acceleration of a Point Contact Solid Mechanics Simulation DRAFT

Logo. A Massively-Parallel Multicore Acceleration of a Point Contact Solid Mechanics Simulation DRAFT Paper 1 Logo Civil-Comp Press, 2017 Proceedings of the Fifth International Conference on Parallel, Distributed, Grid and Cloud Computing for Engineering, P. Iványi, B.H.V Topping and G. Várady (Editors)

More information

Some thoughts about energy efficient application execution on NEC LX Series compute clusters

Some thoughts about energy efficient application execution on NEC LX Series compute clusters Some thoughts about energy efficient application execution on NEC LX Series compute clusters G. Wellein, G. Hager, J. Treibig, M. Wittmann Erlangen Regional Computing Center & Department of Computer Science

More information

Convergence Models and Surprising Results for the Asynchronous Jacobi Method

Convergence Models and Surprising Results for the Asynchronous Jacobi Method Convergence Models and Surprising Results for the Asynchronous Jacobi Method Jordi Wolfson-Pou School of Computational Science and Engineering Georgia Institute of Technology Atlanta, Georgia, United States

More information

Scalable and Power-Efficient Data Mining Kernels

Scalable and Power-Efficient Data Mining Kernels Scalable and Power-Efficient Data Mining Kernels Alok Choudhary, John G. Searle Professor Dept. of Electrical Engineering and Computer Science and Professor, Kellogg School of Management Director of the

More information

Performance evaluation of scalable optoelectronics application on large-scale Knights Landing cluster

Performance evaluation of scalable optoelectronics application on large-scale Knights Landing cluster Performance evaluation of scalable optoelectronics application on large-scale Knights Landing cluster Yuta Hirokawa Graduate School of Systems and Information Engineering, University of Tsukuba hirokawa@hpcs.cs.tsukuba.ac.jp

More information

Berk-Breizman and diocotron instability testcases

Berk-Breizman and diocotron instability testcases Berk-Breizman and diocotron instability testcases M. Mehrenberger, N. Crouseilles, V. Grandgirard, S. Hirstoaga, E. Madaule, J. Petri, E. Sonnendrücker Université de Strasbourg, IRMA (France); INRIA Grand

More information

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters ANTONINO TUMEO, ORESTE VILLA Collaborators: Karol Kowalski, Sriram Krishnamoorthy, Wenjing Ma, Simone Secchi May 15, 2012 1 Outline!

More information

Mass-conserving and positive-definite semi-lagrangian advection in NCEP GFS: Decomposition for massively parallel computing without halo

Mass-conserving and positive-definite semi-lagrangian advection in NCEP GFS: Decomposition for massively parallel computing without halo Mass-conserving and positive-definite semi-lagrangian advection in NCEP GFS: Decomposition for massively parallel computing without halo Hann-Ming Henry Juang Environmental Modeling Center, NCEP NWS,NOAA,DOC,USA

More information

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics SPARSE SOLVERS FOR THE POISSON EQUATION Margreet Nool CWI, Multiscale Dynamics November 9, 2015 OUTLINE OF THIS TALK 1 FISHPACK, LAPACK, PARDISO 2 SYSTEM OVERVIEW OF CARTESIUS 3 POISSON EQUATION 4 SOLVERS

More information

Piz Daint & Piz Kesch : from general purpose supercomputing to an appliance for weather forecasting. Thomas C. Schulthess

Piz Daint & Piz Kesch : from general purpose supercomputing to an appliance for weather forecasting. Thomas C. Schulthess Piz Daint & Piz Kesch : from general purpose supercomputing to an appliance for weather forecasting Thomas C. Schulthess 1 Cray XC30 with 5272 hybrid, GPU accelerated compute nodes Piz Daint Compute node:

More information

High Order Semi-Lagrangian WENO scheme for Vlasov Equations

High Order Semi-Lagrangian WENO scheme for Vlasov Equations High Order WENO scheme for Equations Department of Mathematical and Computer Science Colorado School of Mines joint work w/ Andrew Christlieb Supported by AFOSR. Computational Mathematics Seminar, UC Boulder

More information

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver Sherry Li Lawrence Berkeley National Laboratory Piyush Sao Rich Vuduc Georgia Institute of Technology CUG 14, May 4-8, 14, Lugano,

More information

Review for the Midterm Exam

Review for the Midterm Exam Review for the Midterm Exam 1 Three Questions of the Computational Science Prelim scaled speedup network topologies work stealing 2 The in-class Spring 2012 Midterm Exam pleasingly parallel computations

More information

WRF performance tuning for the Intel Woodcrest Processor

WRF performance tuning for the Intel Woodcrest Processor WRF performance tuning for the Intel Woodcrest Processor A. Semenov, T. Kashevarova, P. Mankevich, D. Shkurko, K. Arturov, N. Panov Intel Corp., pr. ak. Lavrentieva 6/1, Novosibirsk, Russia, 630090 {alexander.l.semenov,tamara.p.kashevarova,pavel.v.mankevich,

More information

Claude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique

Claude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique Claude Tadonki MINES ParisTech PSL Research University Centre de Recherche Informatique claude.tadonki@mines-paristech.fr Monthly CRI Seminar MINES ParisTech - CRI June 06, 2016, Fontainebleau (France)

More information

上海超级计算中心 Shanghai Supercomputer Center. Lei Xu Shanghai Supercomputer Center San Jose

上海超级计算中心 Shanghai Supercomputer Center. Lei Xu Shanghai Supercomputer Center San Jose 上海超级计算中心 Shanghai Supercomputer Center Lei Xu Shanghai Supercomputer Center 03/26/2014 @GTC, San Jose Overview Introduction Fundamentals of the FDTD method Implementation of 3D UPML-FDTD algorithm on GPU

More information

The Lattice Boltzmann Method for Laminar and Turbulent Channel Flows

The Lattice Boltzmann Method for Laminar and Turbulent Channel Flows The Lattice Boltzmann Method for Laminar and Turbulent Channel Flows Vanja Zecevic, Michael Kirkpatrick and Steven Armfield Department of Aerospace Mechanical & Mechatronic Engineering The University of

More information

Benchmarking program performance evaluation of Parallel programming language XcalableMP on Many core processor

Benchmarking program performance evaluation of Parallel programming language XcalableMP on Many core processor XcalableMP 1 2 2 2 Xeon Phi Xeon XcalableMP HIMENO L Phi XL 16 Xeon 1 16 Phi XcalableMP MPI XcalableMP OpenMP 16 2048 Benchmarking program performance evaluation of Parallel programming language XcalableMP

More information

Research of the new Intel Xeon Phi architecture for solving a wide range of scientific problems at JINR

Research of the new Intel Xeon Phi architecture for solving a wide range of scientific problems at JINR Research of the new Intel Xeon Phi architecture for solving a wide range of scientific problems at JINR Podgainy D.V., Streltsova O.I., Zuev M.I. on behalf of Heterogeneous Computations team HybriLIT LIT,

More information

Performance of the fusion code GYRO on three four generations of Crays. Mark Fahey University of Tennessee, Knoxville

Performance of the fusion code GYRO on three four generations of Crays. Mark Fahey University of Tennessee, Knoxville Performance of the fusion code GYRO on three four generations of Crays Mark Fahey mfahey@utk.edu University of Tennessee, Knoxville Contents Introduction GYRO Overview Benchmark Problem Test Platforms

More information

Massively scalable computing method to tackle large eigenvalue problems for nanoelectronics modeling

Massively scalable computing method to tackle large eigenvalue problems for nanoelectronics modeling 2019 Intel extreme Performance Users Group (IXPUG) meeting Massively scalable computing method to tackle large eigenvalue problems for nanoelectronics modeling Hoon Ryu, Ph.D. (E: elec1020@kisti.re.kr)

More information

Lattice QCD with Domain Decomposition on Intel R Xeon Phi TM

Lattice QCD with Domain Decomposition on Intel R Xeon Phi TM Lattice QCD with Domain Decomposition on Intel R Xeon Phi TM Co-Processors Simon Heybrock, Bálint Joó, Dhiraj D. Kalamkar, Mikhail Smelyanskiy, Karthikeyan Vaidyanathan, Tilo Wettig, and Pradeep Dubey

More information

SP-CNN: A Scalable and Programmable CNN-based Accelerator. Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay

SP-CNN: A Scalable and Programmable CNN-based Accelerator. Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay SP-CNN: A Scalable and Programmable CNN-based Accelerator Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay Motivation Power is a first-order design constraint, especially for embedded devices. Certain

More information

Michel Mehrenberger 1 & Eric Sonnendrücker 2 ECCOMAS 2016

Michel Mehrenberger 1 & Eric Sonnendrücker 2 ECCOMAS 2016 for Vlasov type for Vlasov type 1 2 ECCOMAS 2016 In collaboration with : Bedros Afeyan, Aurore Back, Fernando Casas, Nicolas Crouseilles, Adila Dodhy, Erwan Faou, Yaman Güclü, Adnane Hamiaz, Guillaume

More information

Scaling the Software and Advancing the Science of Global Modeling and Assimilation Systems at NASA. Bill Putman

Scaling the Software and Advancing the Science of Global Modeling and Assimilation Systems at NASA. Bill Putman Global Modeling and Assimilation Office Scaling the Software and Advancing the Science of Global Modeling and Assimilation Systems at NASA Bill Putman Max Suarez, Lawrence Takacs, Atanas Trayanov and Hamid

More information

Moments conservation in adaptive Vlasov solver

Moments conservation in adaptive Vlasov solver 12 Moments conservation in adaptive Vlasov solver M. Gutnic a,c, M. Haefele b,c and E. Sonnendrücker a,c a IRMA, Université Louis Pasteur, Strasbourg, France. b LSIIT, Université Louis Pasteur, Strasbourg,

More information

Introduction to Benchmark Test for Multi-scale Computational Materials Software

Introduction to Benchmark Test for Multi-scale Computational Materials Software Introduction to Benchmark Test for Multi-scale Computational Materials Software Shun Xu*, Jian Zhang, Zhong Jin xushun@sccas.cn Computer Network Information Center Chinese Academy of Sciences (IPCC member)

More information

Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters

Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters H. Köstler 2nd International Symposium Computer Simulations on GPU Freudenstadt, 29.05.2013 1 Contents Motivation walberla software concepts

More information

Parallel Sparse Tensor Decompositions using HiCOO Format

Parallel Sparse Tensor Decompositions using HiCOO Format Figure sources: A brief survey of tensors by Berton Earnshaw and NVIDIA Tensor Cores Parallel Sparse Tensor Decompositions using HiCOO Format Jiajia Li, Jee Choi, Richard Vuduc May 8, 8 @ SIAM ALA 8 Outline

More information

Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry

Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry and Eugene DePrince Argonne National Laboratory (LCF and CNM) (Eugene moved to Georgia Tech last week)

More information

ERLANGEN REGIONAL COMPUTING CENTER

ERLANGEN REGIONAL COMPUTING CENTER ERLANGEN REGIONAL COMPUTING CENTER Making Sense of Performance Numbers Georg Hager Erlangen Regional Computing Center (RRZE) Friedrich-Alexander-Universität Erlangen-Nürnberg OpenMPCon 2018 Barcelona,

More information

A parallel Vlasov solver based on local cubic spline interpolation on patches

A parallel Vlasov solver based on local cubic spline interpolation on patches A parallel Vlasov solver based on local cubic spline interpolation on patches Nicolas Crouseilles Guillaume Latu Eric Sonnendrücker September 4, 2008 Abstract A method for computing the numerical solution

More information

Block AIR Methods. For Multicore and GPU. Per Christian Hansen Hans Henrik B. Sørensen. Technical University of Denmark

Block AIR Methods. For Multicore and GPU. Per Christian Hansen Hans Henrik B. Sørensen. Technical University of Denmark Block AIR Methods For Multicore and GPU Per Christian Hansen Hans Henrik B. Sørensen Technical University of Denmark Model Problem and Notation Parallel-beam 3D tomography exact solution exact data noise

More information

Conservative semi-lagrangian schemes for Vlasov equations

Conservative semi-lagrangian schemes for Vlasov equations Conservative semi-lagrangian schemes for Vlasov equations Nicolas Crouseilles Michel Mehrenberger Eric Sonnendrücker October 3, 9 Abstract Conservative methods for the numerical solution of the Vlasov

More information

Model Order Reduction via Matlab Parallel Computing Toolbox. Istanbul Technical University

Model Order Reduction via Matlab Parallel Computing Toolbox. Istanbul Technical University Model Order Reduction via Matlab Parallel Computing Toolbox E. Fatih Yetkin & Hasan Dağ Istanbul Technical University Computational Science & Engineering Department September 21, 2009 E. Fatih Yetkin (Istanbul

More information

First, a look at using OpenACC on WRF subroutine advance_w dynamics routine

First, a look at using OpenACC on WRF subroutine advance_w dynamics routine First, a look at using OpenACC on WRF subroutine advance_w dynamics routine Second, an estimate of WRF multi-node performance on Cray XK6 with GPU accelerators Based on performance of WRF kernels, what

More information

The Fast Multipole Method in molecular dynamics

The Fast Multipole Method in molecular dynamics The Fast Multipole Method in molecular dynamics Berk Hess KTH Royal Institute of Technology, Stockholm, Sweden ADAC6 workshop Zurich, 20-06-2018 Slide BioExcel Slide Molecular Dynamics of biomolecules

More information

Parallelization of Molecular Dynamics (with focus on Gromacs) SeSE 2014 p.1/29

Parallelization of Molecular Dynamics (with focus on Gromacs) SeSE 2014 p.1/29 Parallelization of Molecular Dynamics (with focus on Gromacs) SeSE 2014 p.1/29 Outline A few words on MD applications and the GROMACS package The main work in an MD simulation Parallelization Stream computing

More information

Parallel PIPS-SBB Multi-level parallelism for 2-stage SMIPS. Lluís-Miquel Munguia, Geoffrey M. Oxberry, Deepak Rajan, Yuji Shinano

Parallel PIPS-SBB Multi-level parallelism for 2-stage SMIPS. Lluís-Miquel Munguia, Geoffrey M. Oxberry, Deepak Rajan, Yuji Shinano Parallel PIPS-SBB Multi-level parallelism for 2-stage SMIPS Lluís-Miquel Munguia, Geoffrey M. Oxberry, Deepak Rajan, Yuji Shinano ... Our contribution PIPS-PSBB*: Multi-level parallelism for Stochastic

More information

Applied Numerical Analysis Quiz #2

Applied Numerical Analysis Quiz #2 Applied Numerical Analysis Quiz #2 Modules 3 and 4 Name: Student number: DO NOT OPEN UNTIL ASKED Instructions: Make sure you have a machine-readable answer form. Write your name and student number in the

More information

Adaptive simulation of Vlasov equations in arbitrary dimension using interpolatory hierarchical bases

Adaptive simulation of Vlasov equations in arbitrary dimension using interpolatory hierarchical bases Adaptive simulation of Vlasov equations in arbitrary dimension using interpolatory hierarchical bases Erwan Deriaz Fusion Plasma Team Institut Jean Lamour (Nancy), CNRS / Université de Lorraine VLASIX

More information

Massively scalable computing method to tackle large eigenvalue problems for nanoelectronics modeling

Massively scalable computing method to tackle large eigenvalue problems for nanoelectronics modeling 2019 Intel extreme Performance Users Group (IXPUG) meeting Massively scalable computing method to tackle large eigenvalue problems for nanoelectronics modeling Hoon Ryu, Ph.D. (E: elec1020@kisti.re.kr)

More information

CRYSTAL in parallel: replicated and distributed (MPP) data. Why parallel?

CRYSTAL in parallel: replicated and distributed (MPP) data. Why parallel? CRYSTAL in parallel: replicated and distributed (MPP) data Roberto Orlando Dipartimento di Chimica Università di Torino Via Pietro Giuria 5, 10125 Torino (Italy) roberto.orlando@unito.it 1 Why parallel?

More information

591 TFLOPS Multi-TRILLION Particles Simulation on SuperMUC

591 TFLOPS Multi-TRILLION Particles Simulation on SuperMUC International Supercomputing Conference 2013 591 TFLOPS Multi-TRILLION Particles Simulation on SuperMUC W. Eckhardt TUM, A. Heinecke TUM, R. Bader LRZ, M. Brehm LRZ, N. Hammer LRZ, H. Huber LRZ, H.-G.

More information

Scalable Domain Decomposition Preconditioners For Heterogeneous Elliptic Problems

Scalable Domain Decomposition Preconditioners For Heterogeneous Elliptic Problems Scalable Domain Decomposition Preconditioners For Heterogeneous Elliptic Problems Pierre Jolivet, F. Hecht, F. Nataf, C. Prud homme Laboratoire Jacques-Louis Lions Laboratoire Jean Kuntzmann INRIA Rocquencourt

More information

Weather and Climate Modeling on GPU and Xeon Phi Accelerated Systems

Weather and Climate Modeling on GPU and Xeon Phi Accelerated Systems Weather and Climate Modeling on GPU and Xeon Phi Accelerated Systems Mike Ashworth, Rupert Ford, Graham Riley, Stephen Pickles Scientific Computing Department & STFC Hartree Centre STFC Daresbury Laboratory

More information

Parallel Transposition of Sparse Data Structures

Parallel Transposition of Sparse Data Structures Parallel Transposition of Sparse Data Structures Hao Wang, Weifeng Liu, Kaixi Hou, Wu-chun Feng Department of Computer Science, Virginia Tech Niels Bohr Institute, University of Copenhagen Scientific Computing

More information

An Overview of HPC at the Met Office

An Overview of HPC at the Met Office An Overview of HPC at the Met Office Paul Selwood Crown copyright 2006 Page 1 Introduction The Met Office National Weather Service for the UK Climate Prediction (Hadley Centre) Operational and Research

More information

What place for mathematicians in plasma physics

What place for mathematicians in plasma physics What place for mathematicians in plasma physics Eric Sonnendrücker IRMA Université Louis Pasteur, Strasbourg projet CALVI INRIA Nancy Grand Est 15-19 September 2008 Eric Sonnendrücker (U. Strasbourg) Math

More information

Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA

Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA Jacobi-Based Eigenvalue Solver on GPU Lung-Sheng Chien, NVIDIA lchien@nvidia.com Outline Symmetric eigenvalue solver Experiment Applications Conclusions Symmetric eigenvalue solver The standard form is

More information

Performance Evaluation of Scientific Applications on POWER8

Performance Evaluation of Scientific Applications on POWER8 Performance Evaluation of Scientific Applications on POWER8 2014 Nov 16 Andrew V. Adinetz 1, Paul F. Baumeister 1, Hans Böttiger 3, Thorsten Hater 1, Thilo Maurer 3, Dirk Pleiter 1, Wolfram Schenck 4,

More information

Opportunities for ELPA to Accelerate the Solution of the Bethe-Salpeter Eigenvalue Problem

Opportunities for ELPA to Accelerate the Solution of the Bethe-Salpeter Eigenvalue Problem Opportunities for ELPA to Accelerate the Solution of the Bethe-Salpeter Eigenvalue Problem Peter Benner, Andreas Marek, Carolin Penke August 16, 2018 ELSI Workshop 2018 Partners: The Problem The Bethe-Salpeter

More information

Performance optimization of WEST and Qbox on Intel Knights Landing

Performance optimization of WEST and Qbox on Intel Knights Landing Performance optimization of WEST and Qbox on Intel Knights Landing Huihuo Zheng 1, Christopher Knight 1, Giulia Galli 1,2, Marco Govoni 1,2, and Francois Gygi 3 1 Argonne National Laboratory 2 University

More information

Some thoughts for particle-in-cell on leadership class computers

Some thoughts for particle-in-cell on leadership class computers Some thoughts for particle-in-cell on leadership class computers W. B. Mori University of California Los Angeles (UCLA) Departments of Physics and Astronomy and of Electrical Engineering Institute of Digital

More information

High-performance processing and development with Madagascar. July 24, 2010 Madagascar development team

High-performance processing and development with Madagascar. July 24, 2010 Madagascar development team High-performance processing and development with Madagascar July 24, 2010 Madagascar development team Outline 1 HPC terminology and frameworks 2 Utilizing data parallelism 3 HPC development with Madagascar

More information

Integration of Vlasov-type equations

Integration of Vlasov-type equations Alexander Ostermann University of Innsbruck, Austria Joint work with Lukas Einkemmer Verona, April/May 2017 Plasma the fourth state of matter 99% of the visible matter in the universe is made of plasma

More information

Exploring performance and power properties of modern multicore chips via simple machine models

Exploring performance and power properties of modern multicore chips via simple machine models Exploring performance and power properties of modern multicore chips via simple machine models G. Hager, J. Treibig, J. Habich, and G. Wellein Erlangen Regional Computing Center (RRZE) Martensstr. 1, 9158

More information

Performance Evaluation of MPI on Weather and Hydrological Models

Performance Evaluation of MPI on Weather and Hydrological Models NCAR/RAL Performance Evaluation of MPI on Weather and Hydrological Models Alessandro Fanfarillo elfanfa@ucar.edu August 8th 2018 Cheyenne - NCAR Supercomputer Cheyenne is a 5.34-petaflops, high-performance

More information

Parallelization of the Dirac operator. Pushan Majumdar. Indian Association for the Cultivation of Sciences, Jadavpur, Kolkata

Parallelization of the Dirac operator. Pushan Majumdar. Indian Association for the Cultivation of Sciences, Jadavpur, Kolkata Parallelization of the Dirac operator Pushan Majumdar Indian Association for the Cultivation of Sciences, Jadavpur, Kolkata Outline Introduction Algorithms Parallelization Comparison of performances Conclusions

More information

SPECIAL PROJECT PROGRESS REPORT

SPECIAL PROJECT PROGRESS REPORT SPECIAL PROJECT PROGRESS REPORT Progress Reports should be 2 to 10 pages in length, depending on importance of the project. All the following mandatory information needs to be provided. Reporting year

More information

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 13 Finite Difference Methods Outline n Ordinary and partial differential equations n Finite difference methods n Vibrating string

More information

Efficient Parallelization of Molecular Dynamics Simulations on Hybrid CPU/GPU Supercoputers

Efficient Parallelization of Molecular Dynamics Simulations on Hybrid CPU/GPU Supercoputers Efficient Parallelization of Molecular Dynamics Simulations on Hybrid CPU/GPU Supercoputers Jaewoon Jung (RIKEN, RIKEN AICS) Yuji Sugita (RIKEN, RIKEN AICS, RIKEN QBiC, RIKEN ithes) Molecular Dynamics

More information

Two case studies of Monte Carlo simulation on GPU

Two case studies of Monte Carlo simulation on GPU Two case studies of Monte Carlo simulation on GPU National Institute for Computational Sciences University of Tennessee Seminar series on HPC, Feb. 27, 2014 Outline 1 Introduction 2 Discrete energy lattice

More information

Edwin van der Weide and Magnus Svärd. I. Background information for the SBP-SAT scheme

Edwin van der Weide and Magnus Svärd. I. Background information for the SBP-SAT scheme Edwin van der Weide and Magnus Svärd I. Background information for the SBP-SAT scheme As is well-known, stability of a numerical scheme is a key property for a robust and accurate numerical solution. Proving

More information

Performance Study of the 3D Particle-in- Cell Code GTC on the Cray X1

Performance Study of the 3D Particle-in- Cell Code GTC on the Cray X1 Performance Study of the 3D Particle-in- Cell Code GTC on the Cray X1 Stéphane Ethier Princeton Plasma Physics Laboratory CUG 2004 Knoxville, TN May 20, 2004 Work Supported by DOE Contract No.DE-AC02-76CH03073

More information

The Blue Gene/P at Jülich Case Study & Optimization. W.Frings, Forschungszentrum Jülich,

The Blue Gene/P at Jülich Case Study & Optimization. W.Frings, Forschungszentrum Jülich, The Blue Gene/P at Jülich Case Study & Optimization W.Frings, Forschungszentrum Jülich, 26.08.2008 Jugene Case-Studies: Overview Case Study: PEPC Case Study: racoon Case Study: QCD CPU0CPU3 CPU1CPU2 2

More information

Hybrid semi-lagrangian finite element-finite difference methods for the Vlasov equation

Hybrid semi-lagrangian finite element-finite difference methods for the Vlasov equation Numerical Analysis and Scientific Computing Preprint Seria Hybrid semi-lagrangian finite element-finite difference methods for the Vlasov equation W. Guo J. Qiu Preprint #21 Department of Mathematics University

More information

Lattice Quantum Chromodynamics on the MIC architectures

Lattice Quantum Chromodynamics on the MIC architectures Lattice Quantum Chromodynamics on the MIC architectures Piotr Korcyl Universität Regensburg Intel MIC Programming Workshop @ LRZ 28 June 2017 Piotr Korcyl Lattice Quantum Chromodynamics on the MIC 1/ 25

More information

arxiv: v1 [hep-lat] 7 Oct 2010

arxiv: v1 [hep-lat] 7 Oct 2010 arxiv:.486v [hep-lat] 7 Oct 2 Nuno Cardoso CFTP, Instituto Superior Técnico E-mail: nunocardoso@cftp.ist.utl.pt Pedro Bicudo CFTP, Instituto Superior Técnico E-mail: bicudo@ist.utl.pt We discuss the CUDA

More information

Advanced Vectorization of PPML Method for Intel Xeon Scalable Processors

Advanced Vectorization of PPML Method for Intel Xeon Scalable Processors Advanced Vectorization of PPML Method for Intel Xeon Scalable Processors Igor Chernykh 1, Igor Kulikov 1, Boris Glinsky 1, Vitaly Vshivkov 1, Lyudmila Vshivkova 1, Vladimir Prigarin 1 Institute of Computational

More information

On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code

On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code E Calore, S F Schifano, R Tripiccione Enrico Calore INFN Ferrara, Italy 7 th Workshop on UnConventional High Performance

More information

Zacros. Software Package Development: Pushing the Frontiers of Kinetic Monte Carlo Simulation in Catalysis

Zacros. Software Package Development: Pushing the Frontiers of Kinetic Monte Carlo Simulation in Catalysis Zacros Software Package Development: Pushing the Frontiers of Kinetic Monte Carlo Simulation in Catalysis Jens H Nielsen, Mayeul D'Avezac, James Hetherington & Michail Stamatakis Introduction to Zacros

More information

Acceleration of WRF on the GPU

Acceleration of WRF on the GPU Acceleration of WRF on the GPU Daniel Abdi, Sam Elliott, Iman Gohari Don Berchoff, Gene Pache, John Manobianco TempoQuest 1434 Spruce Street Boulder, CO 80302 720 726 9032 TempoQuest.com THE WORLD S FASTEST

More information

Parallel Numerics. Scope: Revise standard numerical methods considering parallel computations!

Parallel Numerics. Scope: Revise standard numerical methods considering parallel computations! Parallel Numerics Scope: Revise standard numerical methods considering parallel computations! Required knowledge: Numerics Parallel Programming Graphs Literature: Dongarra, Du, Sorensen, van der Vorst:

More information

Improving Dynamical Core Scalability, Accuracy, and Limi:ng Flexibility with the ADER- DT Time Discre:za:on

Improving Dynamical Core Scalability, Accuracy, and Limi:ng Flexibility with the ADER- DT Time Discre:za:on Improving Dynamical Core Scalability, Accuracy, and Limi:ng Flexibility with the ADER- DT Time Discre:za:on Matthew R. Norman Scientific Computing Group National Center for Computational Sciences Oak Ridge

More information

Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems

Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems Ichitaro Yamazaki University of Tennessee, Knoxville Xiaoye Sherry Li Lawrence Berkeley National Laboratory MS49: Sparse

More information

Making electronic structure methods scale: Large systems and (massively) parallel computing

Making electronic structure methods scale: Large systems and (massively) parallel computing AB Making electronic structure methods scale: Large systems and (massively) parallel computing Ville Havu Department of Applied Physics Helsinki University of Technology - TKK Ville.Havu@tkk.fi 1 Outline

More information

Open-Source Parallel FE Software : FrontISTR -- Performance Considerations about B/F (Byte per Flop) of SpMV on K-Supercomputer and GPU-Clusters --

Open-Source Parallel FE Software : FrontISTR -- Performance Considerations about B/F (Byte per Flop) of SpMV on K-Supercomputer and GPU-Clusters -- Parallel Processing for Energy Efficiency October 3, 2013 NTNU, Trondheim, Norway Open-Source Parallel FE Software : FrontISTR -- Performance Considerations about B/F (Byte per Flop) of SpMV on K-Supercomputer

More information

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems TR-0-07 A Comparison of the Performance of ::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems Ang Li, Omkar Deshmukh, Radu Serban, Dan Negrut May, 0 Abstract ::GPU is a

More information

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic Jan Verschelde joint work with Xiangcheng Yu University of Illinois at Chicago

More information

A Hybrid Method for the Wave Equation. beilina

A Hybrid Method for the Wave Equation.   beilina A Hybrid Method for the Wave Equation http://www.math.unibas.ch/ beilina 1 The mathematical model The model problem is the wave equation 2 u t 2 = (a 2 u) + f, x Ω R 3, t > 0, (1) u(x, 0) = 0, x Ω, (2)

More information

A Computation- and Communication-Optimal Parallel Direct 3-body Algorithm

A Computation- and Communication-Optimal Parallel Direct 3-body Algorithm A Computation- and Communication-Optimal Parallel Direct 3-body Algorithm Penporn Koanantakool and Katherine Yelick {penpornk, yelick}@cs.berkeley.edu Computer Science Division, University of California,

More information

Parallelization of the QC-lib Quantum Computer Simulator Library

Parallelization of the QC-lib Quantum Computer Simulator Library Parallelization of the QC-lib Quantum Computer Simulator Library Ian Glendinning and Bernhard Ömer VCPC European Centre for Parallel Computing at Vienna Liechtensteinstraße 22, A-19 Vienna, Austria http://www.vcpc.univie.ac.at/qc/

More information

Nhung Pham 1, Philippe Helluy 2 and Laurent Navoret 3

Nhung Pham 1, Philippe Helluy 2 and Laurent Navoret 3 ESAIM: PROCEEDINGS AND SURVEYS, September 014, Vol. 45, p. 379-389 J.-S. Dhersin, Editor HYPERBOLIC APPROXIMATION OF THE FOURIER TRANSFORMED VLASOV EQUATION Nhung Pham 1, Philippe Helluy and Laurent Navoret

More information

Welcome to MCS 572. content and organization expectations of the course. definition and classification

Welcome to MCS 572. content and organization expectations of the course. definition and classification Welcome to MCS 572 1 About the Course content and organization expectations of the course 2 Supercomputing definition and classification 3 Measuring Performance speedup and efficiency Amdahl s Law Gustafson

More information

J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1. March, 2009

J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1. March, 2009 Parallel Preconditioning of Linear Systems based on ILUPACK for Multithreaded Architectures J.I. Aliaga M. Bollhöfer 2 A.F. Martín E.S. Quintana-Ortí Deparment of Computer Science and Engineering, Univ.

More information

arxiv: v1 [hep-lat] 10 Jul 2012

arxiv: v1 [hep-lat] 10 Jul 2012 Hybrid Monte Carlo with Wilson Dirac operator on the Fermi GPU Abhijit Chakrabarty Electra Design Automation, SDF Building, SaltLake Sec-V, Kolkata - 700091. Pushan Majumdar Dept. of Theoretical Physics,

More information