Interpolation with Radial Basis Functions on GPGPUs using CUDA

Size: px
Start display at page:

Download "Interpolation with Radial Basis Functions on GPGPUs using CUDA"

Transcription

1 Interpolation with Radial Basis Functions on GPGPUs using CUDA Gundolf Haase in coop. with: Dirk Martin [VRV Vienna] and Günter Offner [AVL Graz] Institute for Mathematics and Scientific Computing University of Graz, Austria Jena, July 15, 2014

2 Outline Moving mesh Motivation Radial basis function interpolation [Dirk Martin, AVL Graz] RBF Evaluation Numerical results More results Some nice speedups Conclusions Accelerator programming

3 Motivation Moving mesh Motivation

4 Motivation Moving Mesh - Motivation [Dirk Martin, AVL Graz] Changing geometry caused by boundary/interface displacement Preserve topology of overall mesh = map boundary displacement onto all nodes in mesh one opportunity: RBF interpolation requirement: runs on GPU We look for the harmonic solution of 2 u = 0 Ω u = f set of points = approx. fundamental solutions in points of interest Ω.

5 Radial basis function interpolation [Dirk Martin, AVL Graz] Moving mesh Radial basis function interpolation [Dirk Martin, AVL Graz]

6 Radial basis function interpolation [Dirk Martin, AVL Graz] Radial basis functions Definition If a univariate (one variable) real-valued function φ : [0, ) R is used as a symmetric multivariate function Φ : R d R d R via Φ(x, y) = φ( x y 2 ) for all x, y R d, then φ is called a radial basis function (RBF) and Φ is called the associated kernel. Definition The support of a function u defined on Ω R d is defined as supp u := {x Ω : u(x) 0}.

7 Radial basis function interpolation [Dirk Martin, AVL Graz] RBF interpolation General approximation: set of points X = {x i } N i=1 is given function values f i = f (x i ) are given (f unknown) search for an approximating function s : s X = f X. In the context of RBF interpolation we seek for an interpoland of the form N s(x) = λ i φ( x i x ) + p(x), λ i R, p P M. (1) i=1 Polynomial term p is required for the existence and uniqueness of a solution.

8 Radial basis function interpolation [Dirk Martin, AVL Graz] RBF Setup: System of equations Requiring the interpolation condition s X = f X in all given points and the unisolvency of the set X for P M d, thus p X = 0 p 0 and demanding a side condition on the coefficients of the polynomial term leads to a system of linear equations for the determination of the coefficients λ and π: N M λ i φ( x i x k ) + π j p j (x k ) = f (x k ), 1 k N, i=1 or, in short notation j=1 N λ i p l (x i ) = 0, 1 l M, (2) i=1 ( Φ Π Π 0 ) ( λ π ) = ( f 0 ). (3) Solving (3) provides all information to evaluate the RBF approximate s(x).

9 Radial basis function interpolation [Dirk Martin, AVL Graz] Solving the RBF system of equations RBF: r 2 + c 2 (multiquadric biharmonics) and constant polynomial terms. The system, i = 1,..., n, j = 1,..., M ( ) ( ) ( ) Φ Π λ f Π = ; Φ 0 π 0 ij = Φ(x i x j ), P ij = P j (x i ) is solved via FGP algorithm, a special Krylov subspace algorithm for our RBF [Faul/Goodsell/Powell 05]: no matrix is stored operation Matrix Vector directly implemented Brute force: direct implementation of Φ λ O(N 2 ) Multipole approx. Φ of Φ is used. O(N log N) preconditioning in FPG appropriate for our RBF with constant polynomial approximates Φ 1 by 51 entries per row. Octree is used for neighborhood relations: O(N log N) < 10 iterations to solve the system

10 RBF Evaluation Moving mesh RBF Evaluation

11 RBF Evaluation Evaluating the sums λ has been calculated in setup evaluation of sums N s(x) = λ i φ( x i x ) = i=1 N λ i φ i (x) i=1 with kernel φ(r) = r 2 + c 2. uses series expansion (Laurent) for the far field (octree distance 2) N λ i φ i (x) = i=1 # evaluation boxes N j=1 p+1 l=0 G j l (x)/ x 2l 1 + i in near field λ i φ i (x) Laurent series coefficients G l (x) are precomputed

12 Numerical results Moving mesh Numerical results

13 Numerical results Test examples [D. Martin, AVL] 3D geometry of an combustion chamber (AVL List GmbH) with boundary deformation and spheres with various numbers of boundary nodes boundary nodes: 603,..., equivalent to appr. 27 Mill nodes overall (similar to heart) heart geometry already in Fire TM -pipeline [ c D. Martin, AVL]

14 Numerical results Test I Setup of multipole method is measured (Multipole setup) Solving the system contains the application of one Mat Vector (t m) per iteration = dominates iterations in FGP algorithm Overall setup approx. 60 t m; Interpolation 6 t m sequential(+shm) CPU code was already accelerated by factor 2-4. Intel Core i7 2600K with 4+4HT cores; Nvidia GTX 680 ; CPU [s] GPU [s] Speedup multipole setup SP M v brute force M v multipole setup DP M v brute force M v Figure : 89,702 boundary nodes

15 Numerical results Test II mephisto: 2 Intel Xeon X5650 with 12+12HT cores; Nvidia Tesla C2070 CPU [s] GPU [s] Speedup Sp(1/12 cores) multipole setup SP M v brute force M v multipole setup DP M v brute force M v Figure : 89,702 boundary nodes GPU mesh moving/smoothing in consumer release, AMG on GPU in internal release [Max Emans, Manfred Liebmann]

16 Numerical results Example: ellipsoid [July 6, 2014] mephisto: 2 Intel Xeon E with 2 8 cores; Nvidia K20m; Xeon Phi 60 cores; gcc 4.4.7; nvcc V6.0.1; pgc++ v14.6 CPU-1 CPU-16 MIC CUDA OpenACC multipole setup SP M v brute force M v multipole setup DP M v brute force M v Figure : DP: 128,000 boundary nodes; SP: 32,000 bound. nodes; timings in sec. MIC: unmodified OpenMP-Code in native mode

17 Numerical results Many-core RBF: Goods and Odds OPENMP: times faster on 16 CPU cores. CUDA: up to 15 times faster than 16 CPU cores. MIC: very easy in native mode; up to 4 times faster than 16 CPU cores for simple structured code. MIC: needs more tuning OpenACC: 7 times slower than CUDA even for brute force alg.. Hard to convinve the compiler to use vector (thread) instead of gang (block).

18 Some nice speedups More results Some nice speedups

19 Some nice speedups Results wrt. GPU acceleration I Seminar work Andreas Windisch [ 12](QCD, Fortran, CUFFT): 95 Analytic Structure of Scalar Glueball Operators Markus Hopfer [ 12](Fortran, CUDA, MPI): 55 The Ghost-Gluon System of Yang-Mills Theory Michael Reisecker Parallel computing in the Potts model Mario Schröck Gauge fixing (maximation problem) in Quantum Electrodyn. Ydalia del Pilar Delgado: Random walk generation in Lattice QCD Martin Holler: (Matlab 3400 ) C++ 54 Totel Variation based JPEG decompression model

20 Some nice speedups Results wrt. GPU acceleration II Project work Andreas Kucher: 65 GPU accelerated optimization in a pill identification problem Kristian Bredies 50 TGV minimization for MRI Manfred Liebmann/Aurel Neic 10 AMG solver for unstructured sparse systems Manfred Liebmann 70 Mixed gas flow (Euler equations; explicit)

21 Accelerator programming Conclusions Accelerator programming

22 Accelerator programming Accelerator programming Available Accelerators: NVIDA-GPUs: Tesla K cores; 1.43 TFLOPS-DP (1.43); 288 GB/s AMD-GPU: Firepro S cores 1.48 TFLOPS-DP; 480 GB/s Intel: Xeon Phi SE10P (new: 7120X) 1.07 TFLOPS-DP (3); 61 cores (x86) ; 352 GB/s (500 GB/s) Intel: Xeon E7-8870: 10 cores, 96 GFLOPS, 43 GB/s Which language should be used for programming accelerators? CUDA: hardware specific but free compilers; additional code OpenMP 4.0 pragma: new standard (?); support by next gcc-compiler MIC pragma: special for Intel; yet no free compiler OpenMP 4.0 Will OpenACC converge to OpenMP 4.0? huma (heterogeneous UMA): one address space;

23 Accelerator programming Scalar product - OpenACC 1 #i f d e f OPENACC 2 #include <accel. h> // OpenACC 3 #e n d i f 4 5 double s c a l a r ( c o n s t u nsigned i n t N, c o n s t double c o n s t x, c o n s t double c o n s t y ) 6 { 7 double sum = 0. 0 ; 8 unsigned i n t i ; 9 #pragma omp p a r a l l e l f o r private ( i ) shared (x, y ) schedule ( s t a t i c ) reduction (+:sum ) 10 #pragma acc k e r n e l s l o o p p r e s e n t ( x [ 0 :N], y [ 0 :N] ) i n d e p e n d e n t r e d u c t i o n (+:sum ) 11 f o r ( i =0; i<n; ++i ) { 12 sum += x [ i ] y [ i ] ; } 13 r e t u r n sum ; i n t main ( i n t argc, char argv ){ // data a l l o c a t i o n and i n i t. on CPU #pragma acc data copy ( x [ 0 :N], y [ 0 :N] ) // copy data from CPU to GPU 20 { // Parantheses are important!! f o r ( i =0; i <10000; ++i ) { 23 sk = s c a l a r (N, x, y ) ; } } 26 }

24 Accelerator programming Matrix assembling - OpenACC [June 2014] 1 #pragma acc data c o p y i n ( e l e m c o l o r [ 0 : nelems ] )... copyout ( v a l [ 0 : nnz ] ) double s k e [ 3 ] [ 3 ], f e [ 3 ] ; //!! p r i v a t e but o u t s i d e o f l o o p!! 4 f o r ( i n t k = 0 ; k<nnz ; ++k ) val [ k ] = 0. 0 ; // Set v a l u e s i n m a t r i x to 0 5 f o r ( i n t k = 0 ; k<nrows ; ++k ) rhs [ k ] = 0. 0 ; 6 7 f o r ( i n t c=0; c<nc ol or s ; ++c ) // loop on a l l colors 8 { 9 c o n s t i n t c0 = c o l o r i d x [ c ], c1 = c o l o r i d x [ c +1]; 10 #pragma omp p a r a l l e l f o r d e f a u l t ( none ) s h a r e d ( e l e m c o l o r, conn, o f f s e t, xc, rhs, v a l, i d ) p r i v a t e ( ske, f e ) 11 #pragma acc k e r n e l s l o o p\ 12 p c o p y i n ( e l e m c o l o r [ 0 : nelems ], conn [ 0 : nelems 3], o f f s e t [ 0 : nelems 3 3], xc [ 0 : nnodes 2], i d [ 0 : nrows +1])\ 13 pcopy ( r h s [ 0 : nrows ], v a l [ 0 : nnz ] ) p r i v a t e ( ske, f e ) i n d e p e n d e n t f o r ( i n t j c=c0 ; jc<c1 ; ++j c ) // a l l elements of this color 16 { 17 c o n s t i n t j e = e l em c ol or [ j c ] ; // one e l e m e n t o f t h a t c o l o r 18 C a l c E l e m n e w i n l i n e ( conn+3 je, xc, ske, f e ) ; //!! pragma acc r o u t i n e seq!! no atomic!! 19 f o r ( i n t i =0; i<n s i z e ; ++i ) 20 { 21 c o n s t i n t ig = conn [3 je+i ] ; // g l o b a l row i n d e x i n CRS m a t r i x 22 c o n s t i n t irow = id [ i g ] ; // s t a r t o f t h a t row i n t h e v a l u e v e c t o r 23 f o r ( i n t j =0; j<nsize ; ++j ) 24 { 25 v a l [ i r o w+o f f s e t [3 3 j e +3 i+j ] ] += s k e [ i ] [ j ] ; 26 } 27 r h s [ i g ] += f e [ i ] ; 28 } 29 } // c o l o r s

25 Accelerator programming OpenACC and C++ OpenACC in PGI [June 2014] 1. support for C++ 2. simple classes can be transferred to GPU, 3. problems with virtual methods 4. inlining of functions is is now supported (gang vs. vector vs. seq) 5. mixing of CUDA and OpenACC possible via deviceptr 6. good performance for simple problems, but quite tricky for advanced problems. (private data, force vector parallelization)

26 Accelerator programming How will we continue with accelerator programming? OpenMP 4.0 will be the new standard. OpenACC will probably merge into OpenMP 4.0. Intel icpc 14.0 and gcc-4.9 support OpenMP 4.0. Hope: One code that runs on several devices. Less flexible than CUDA regading data transfer. C++ data are a problem. The success depends on the available frontends of the comilers. Hint: Start with OpenMP (-openmp) then run on MIC (-mmic) and OpenACC (-acc -ta=nvidia). Supported by the FWF project F32-N18 and by NAWI Graz.

27 Accelerator programming Thank you! Figure : Jan 3, 2014; near Graz

A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures

A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences,

More information

Introduction to numerical computations on the GPU

Introduction to numerical computations on the GPU Introduction to numerical computations on the GPU Lucian Covaci http://lucian.covaci.org/cuda.pdf Tuesday 1 November 11 1 2 Outline: NVIDIA Tesla and Geforce video cards: architecture CUDA - C: programming

More information

First, a look at using OpenACC on WRF subroutine advance_w dynamics routine

First, a look at using OpenACC on WRF subroutine advance_w dynamics routine First, a look at using OpenACC on WRF subroutine advance_w dynamics routine Second, an estimate of WRF multi-node performance on Cray XK6 with GPU accelerators Based on performance of WRF kernels, what

More information

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems TR-0-07 A Comparison of the Performance of ::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems Ang Li, Omkar Deshmukh, Radu Serban, Dan Negrut May, 0 Abstract ::GPU is a

More information

A simple FEM solver and its data parallelism

A simple FEM solver and its data parallelism A simple FEM solver and its data parallelism Gundolf Haase Institute for Mathematics and Scientific Computing University of Graz, Austria Chile, Jan. 2015 Partial differential equation Considered Problem

More information

11 Parallel programming models

11 Parallel programming models 237 // Program Design 10.3 Assessing parallel programs 11 Parallel programming models Many different models for expressing parallelism in programming languages Actor model Erlang Scala Coordination languages

More information

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics SPARSE SOLVERS FOR THE POISSON EQUATION Margreet Nool CWI, Multiscale Dynamics November 9, 2015 OUTLINE OF THIS TALK 1 FISHPACK, LAPACK, PARDISO 2 SYSTEM OVERVIEW OF CARTESIUS 3 POISSON EQUATION 4 SOLVERS

More information

Algebraic Multi-Grid solver for lattice QCD on Exascale hardware: Intel Xeon Phi

Algebraic Multi-Grid solver for lattice QCD on Exascale hardware: Intel Xeon Phi Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Algebraic Multi-Grid solver for lattice QCD on Exascale hardware: Intel Xeon Phi A. Abdel-Rehim aa, G. Koutsou a, C. Urbach

More information

arxiv: v1 [hep-lat] 7 Oct 2010

arxiv: v1 [hep-lat] 7 Oct 2010 arxiv:.486v [hep-lat] 7 Oct 2 Nuno Cardoso CFTP, Instituto Superior Técnico E-mail: nunocardoso@cftp.ist.utl.pt Pedro Bicudo CFTP, Instituto Superior Técnico E-mail: bicudo@ist.utl.pt We discuss the CUDA

More information

Performance of the fusion code GYRO on three four generations of Crays. Mark Fahey University of Tennessee, Knoxville

Performance of the fusion code GYRO on three four generations of Crays. Mark Fahey University of Tennessee, Knoxville Performance of the fusion code GYRO on three four generations of Crays Mark Fahey mfahey@utk.edu University of Tennessee, Knoxville Contents Introduction GYRO Overview Benchmark Problem Test Platforms

More information

Parallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics)

Parallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics) Parallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics) Eftychios Sifakis CS758 Guest Lecture - 19 Sept 2012 Introduction Linear systems

More information

Universität Dortmund UCHPC. Performance. Computing for Finite Element Simulations

Universität Dortmund UCHPC. Performance. Computing for Finite Element Simulations technische universität dortmund Universität Dortmund fakultät für mathematik LS III (IAM) UCHPC UnConventional High Performance Computing for Finite Element Simulations S. Turek, Chr. Becker, S. Buijssen,

More information

On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code

On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code E Calore, S F Schifano, R Tripiccione Enrico Calore INFN Ferrara, Italy 7 th Workshop on UnConventional High Performance

More information

Claude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique

Claude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique Claude Tadonki MINES ParisTech PSL Research University Centre de Recherche Informatique claude.tadonki@mines-paristech.fr Monthly CRI Seminar MINES ParisTech - CRI June 06, 2016, Fontainebleau (France)

More information

Benchmarking program performance evaluation of Parallel programming language XcalableMP on Many core processor

Benchmarking program performance evaluation of Parallel programming language XcalableMP on Many core processor XcalableMP 1 2 2 2 Xeon Phi Xeon XcalableMP HIMENO L Phi XL 16 Xeon 1 16 Phi XcalableMP MPI XcalableMP OpenMP 16 2048 Benchmarking program performance evaluation of Parallel programming language XcalableMP

More information

Parallelization of Molecular Dynamics (with focus on Gromacs) SeSE 2014 p.1/29

Parallelization of Molecular Dynamics (with focus on Gromacs) SeSE 2014 p.1/29 Parallelization of Molecular Dynamics (with focus on Gromacs) SeSE 2014 p.1/29 Outline A few words on MD applications and the GROMACS package The main work in an MD simulation Parallelization Stream computing

More information

Performance Analysis of Lattice QCD Application with APGAS Programming Model

Performance Analysis of Lattice QCD Application with APGAS Programming Model Performance Analysis of Lattice QCD Application with APGAS Programming Model Koichi Shirahata 1, Jun Doi 2, Mikio Takeuchi 2 1: Tokyo Institute of Technology 2: IBM Research - Tokyo Programming Models

More information

Panorama des modèles et outils de programmation parallèle

Panorama des modèles et outils de programmation parallèle Panorama des modèles et outils de programmation parallèle Sylvain HENRY sylvain.henry@inria.fr University of Bordeaux - LaBRI - Inria - ENSEIRB April 19th, 2013 1/45 Outline Introduction Accelerators &

More information

Implementing NNLO into MCFM

Implementing NNLO into MCFM Implementing NNLO into MCFM Downloadable from mcfm.fnal.gov A Multi-Threaded Version of MCFM, J.M. Campbell, R.K. Ellis, W. Giele, 2015 Higgs boson production in association with a jet at NNLO using jettiness

More information

Accelerating incompressible fluid flow simulations on hybrid CPU/GPU systems

Accelerating incompressible fluid flow simulations on hybrid CPU/GPU systems Accelerating incompressible fluid flow simulations on hybrid CPU/GPU systems Yushan Wang 1, Marc Baboulin 1,2, Karl Rupp 3,4, Yann Fraigneau 1,5, Olivier Le Maître 1,5 1 Université Paris-Sud, France 2

More information

Open-Source Parallel FE Software : FrontISTR -- Performance Considerations about B/F (Byte per Flop) of SpMV on K-Supercomputer and GPU-Clusters --

Open-Source Parallel FE Software : FrontISTR -- Performance Considerations about B/F (Byte per Flop) of SpMV on K-Supercomputer and GPU-Clusters -- Parallel Processing for Energy Efficiency October 3, 2013 NTNU, Trondheim, Norway Open-Source Parallel FE Software : FrontISTR -- Performance Considerations about B/F (Byte per Flop) of SpMV on K-Supercomputer

More information

Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2

Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2 1 / 23 Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption Langshi CHEN 1,2,3 Supervised by Serge PETITON 2 Maison de la Simulation Lille 1 University CNRS March 18, 2013

More information

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver Sherry Li Lawrence Berkeley National Laboratory Piyush Sao Rich Vuduc Georgia Institute of Technology CUG 14, May 4-8, 14, Lugano,

More information

Performance and Energy Analysis of the Iterative Solution of Sparse Linear Systems on Multicore and Manycore Architectures

Performance and Energy Analysis of the Iterative Solution of Sparse Linear Systems on Multicore and Manycore Architectures Performance and Energy Analysis of the Iterative Solution of Sparse Linear Systems on Multicore and Manycore Architectures José I. Aliaga Performance and Energy Analysis of the Iterative Solution of Sparse

More information

Computers and Mathematics with Applications

Computers and Mathematics with Applications Computers and Mathematics with Applications 68 (2014) 1151 1160 Contents lists available at ScienceDirect Computers and Mathematics with Applications journal homepage: www.elsevier.com/locate/camwa A GPU

More information

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers UT College of Engineering Tutorial Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers Stan Tomov 1, George Bosilca 1, and Cédric

More information

Dense Arithmetic over Finite Fields with CUMODP

Dense Arithmetic over Finite Fields with CUMODP Dense Arithmetic over Finite Fields with CUMODP Sardar Anisul Haque 1 Xin Li 2 Farnam Mansouri 1 Marc Moreno Maza 1 Wei Pan 3 Ning Xie 1 1 University of Western Ontario, Canada 2 Universidad Carlos III,

More information

Tips Geared Towards R. Adam J. Suarez. Arpil 10, 2015

Tips Geared Towards R. Adam J. Suarez. Arpil 10, 2015 Tips Geared Towards R Departments of Statistics North Carolina State University Arpil 10, 2015 1 / 30 Advantages of R As an interpretive and interactive language, developing an algorithm in R can be done

More information

Breaking Computational Barriers: Multi-GPU High-Order RBF Kernel Problems with Millions of Points

Breaking Computational Barriers: Multi-GPU High-Order RBF Kernel Problems with Millions of Points Breaking Computational Barriers: Multi-GPU High-Order RBF Kernel Problems with Millions of Points Michael Griebel Christian Rieger Peter Zaspel Institute for Numerical Simulation Rheinische Friedrich-Wilhelms-Universität

More information

Accelerating Quantum Chromodynamics Calculations with GPUs

Accelerating Quantum Chromodynamics Calculations with GPUs Accelerating Quantum Chromodynamics Calculations with GPUs Guochun Shi, Steven Gottlieb, Aaron Torok, Volodymyr Kindratenko NCSA & Indiana University National Center for Supercomputing Applications University

More information

Perm State University Research-Education Center Parallel and Distributed Computing

Perm State University Research-Education Center Parallel and Distributed Computing Perm State University Research-Education Center Parallel and Distributed Computing A 25-minute Talk (S4493) at the GPU Technology Conference (GTC) 2014 MARCH 24-27, 2014 SAN JOSE, CA GPU-accelerated modeling

More information

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters ANTONINO TUMEO, ORESTE VILLA Collaborators: Karol Kowalski, Sriram Krishnamoorthy, Wenjing Ma, Simone Secchi May 15, 2012 1 Outline!

More information

A CUDA Solver for Helmholtz Equation

A CUDA Solver for Helmholtz Equation Journal of Computational Information Systems 11: 24 (2015) 7805 7812 Available at http://www.jofcis.com A CUDA Solver for Helmholtz Equation Mingming REN 1,2,, Xiaoguang LIU 1,2, Gang WANG 1,2 1 College

More information

arxiv: v1 [cs.dc] 4 Sep 2014

arxiv: v1 [cs.dc] 4 Sep 2014 and NVIDIA R GPUs arxiv:1409.1510v1 [cs.dc] 4 Sep 2014 O. Kaczmarek, C. Schmidt and P. Steinbrecher Fakultät für Physik, Universität Bielefeld, D-33615 Bielefeld, Germany E-mail: okacz, schmidt, p.steinbrecher@physik.uni-bielefeld.de

More information

Massively parallel semi-lagrangian solution of the 6d Vlasov-Poisson problem

Massively parallel semi-lagrangian solution of the 6d Vlasov-Poisson problem Massively parallel semi-lagrangian solution of the 6d Vlasov-Poisson problem Katharina Kormann 1 Klaus Reuter 2 Markus Rampp 2 Eric Sonnendrücker 1 1 Max Planck Institut für Plasmaphysik 2 Max Planck Computing

More information

Practical Combustion Kinetics with CUDA

Practical Combustion Kinetics with CUDA Funded by: U.S. Department of Energy Vehicle Technologies Program Program Manager: Gurpreet Singh & Leo Breton Practical Combustion Kinetics with CUDA GPU Technology Conference March 20, 2015 Russell Whitesides

More information

Towards a highly-parallel PDE-Solver using Adaptive Sparse Grids on Compute Clusters

Towards a highly-parallel PDE-Solver using Adaptive Sparse Grids on Compute Clusters Towards a highly-parallel PDE-Solver using Adaptive Sparse Grids on Compute Clusters HIM - Workshop on Sparse Grids and Applications Alexander Heinecke Chair of Scientific Computing May 18 th 2011 HIM

More information

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Accelerating linear algebra computations with hybrid GPU-multicore systems. Accelerating linear algebra computations with hybrid GPU-multicore systems. Marc Baboulin INRIA/Université Paris-Sud joint work with Jack Dongarra (University of Tennessee and Oak Ridge National Laboratory)

More information

The new challenges to Krylov subspace methods Yousef Saad Department of Computer Science and Engineering University of Minnesota

The new challenges to Krylov subspace methods Yousef Saad Department of Computer Science and Engineering University of Minnesota The new challenges to Krylov subspace methods Yousef Saad Department of Computer Science and Engineering University of Minnesota SIAM Applied Linear Algebra Valencia, June 18-22, 2012 Introduction Krylov

More information

Parallel Transposition of Sparse Data Structures

Parallel Transposition of Sparse Data Structures Parallel Transposition of Sparse Data Structures Hao Wang, Weifeng Liu, Kaixi Hou, Wu-chun Feng Department of Computer Science, Virginia Tech Niels Bohr Institute, University of Copenhagen Scientific Computing

More information

SP-CNN: A Scalable and Programmable CNN-based Accelerator. Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay

SP-CNN: A Scalable and Programmable CNN-based Accelerator. Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay SP-CNN: A Scalable and Programmable CNN-based Accelerator Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay Motivation Power is a first-order design constraint, especially for embedded devices. Certain

More information

Information Sciences Institute 22 June 2012 Bob Lucas, Gene Wagenbreth, Dan Davis, Roger Grimes and

Information Sciences Institute 22 June 2012 Bob Lucas, Gene Wagenbreth, Dan Davis, Roger Grimes and Accelerating the Multifrontal Method Information Sciences Institute 22 June 2012 Bob Lucas, Gene Wagenbreth, Dan Davis, Roger Grimes {rflucas,genew,ddavis}@isi.edu and grimes@lstc.com 3D Finite Element

More information

Sparse LU Factorization on GPUs for Accelerating SPICE Simulation

Sparse LU Factorization on GPUs for Accelerating SPICE Simulation Nano-scale Integrated Circuit and System (NICS) Laboratory Sparse LU Factorization on GPUs for Accelerating SPICE Simulation Xiaoming Chen PhD Candidate Department of Electronic Engineering Tsinghua University,

More information

High-performance processing and development with Madagascar. July 24, 2010 Madagascar development team

High-performance processing and development with Madagascar. July 24, 2010 Madagascar development team High-performance processing and development with Madagascar July 24, 2010 Madagascar development team Outline 1 HPC terminology and frameworks 2 Utilizing data parallelism 3 HPC development with Madagascar

More information

Background. Another interests. Sieve method. Parallel Sieve Processing on Vector Processor and GPU. RSA Cryptography

Background. Another interests. Sieve method. Parallel Sieve Processing on Vector Processor and GPU. RSA Cryptography Background Parallel Sieve Processing on Vector Processor and GPU Yasunori Ushiro (Earth Simulator Center) Yoshinari Fukui (Earth Simulator Center) Hidehiko Hasegawa (Univ. of Tsukuba) () RSA Cryptography

More information

A Fast, Parallel Potential Flow Solver

A Fast, Parallel Potential Flow Solver Advisor: Jaime Peraire December 16, 2012 Outline 1 Introduction to Potential FLow 2 The Boundary Element Method 3 The Fast Multipole Method 4 Discretization 5 Implementation 6 Results 7 Conclusions Why

More information

Welcome to MCS 572. content and organization expectations of the course. definition and classification

Welcome to MCS 572. content and organization expectations of the course. definition and classification Welcome to MCS 572 1 About the Course content and organization expectations of the course 2 Supercomputing definition and classification 3 Measuring Performance speedup and efficiency Amdahl s Law Gustafson

More information

Efficient Serial and Parallel Coordinate Descent Methods for Huge-Scale Convex Optimization

Efficient Serial and Parallel Coordinate Descent Methods for Huge-Scale Convex Optimization Efficient Serial and Parallel Coordinate Descent Methods for Huge-Scale Convex Optimization Martin Takáč The University of Edinburgh Based on: P. Richtárik and M. Takáč. Iteration complexity of randomized

More information

SOLUTION of linear systems of equations of the form:

SOLUTION of linear systems of equations of the form: Proceedings of the Federated Conference on Computer Science and Information Systems pp. Mixed precision iterative refinement techniques for the WZ factorization Beata Bylina Jarosław Bylina Institute of

More information

Case Study: Quantum Chromodynamics

Case Study: Quantum Chromodynamics Case Study: Quantum Chromodynamics Michael Clark Harvard University with R. Babich, K. Barros, R. Brower, J. Chen and C. Rebbi Outline Primer to QCD QCD on a GPU Mixed Precision Solvers Multigrid solver

More information

Julian Merten. GPU Computing and Alternative Architecture

Julian Merten. GPU Computing and Alternative Architecture Future Directions of Cosmological Simulations / Edinburgh 1 / 16 Julian Merten GPU Computing and Alternative Architecture Institut für Theoretische Astrophysik Zentrum für Astronomie Universität Heidelberg

More information

GPU Computing Activities in KISTI

GPU Computing Activities in KISTI International Advanced Research Workshop on High Performance Computing, Grids and Clouds 2010 June 21~June 25 2010, Cetraro, Italy HPC Infrastructure and GPU Computing Activities in KISTI Hongsuk Yi hsyi@kisti.re.kr

More information

The Role of the Quark-Gluon Vertex in the QCD Phase Transition

The Role of the Quark-Gluon Vertex in the QCD Phase Transition The Role of the Quark-Gluon Vertex in the QCD Phase Transition PhD Seminar, 05.12.2012 Markus Hopfer University of Graz (A. Windisch, R. Alkofer) Outline 1 Motivation A Physical Motivation Calculations

More information

J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1. March, 2009

J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1. March, 2009 Parallel Preconditioning of Linear Systems based on ILUPACK for Multithreaded Architectures J.I. Aliaga M. Bollhöfer 2 A.F. Martín E.S. Quintana-Ortí Deparment of Computer Science and Engineering, Univ.

More information

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic Jan Verschelde joint work with Xiangcheng Yu University of Illinois at Chicago

More information

Optimal Control of the Schrödinger Equation on Many-Core Architectures

Optimal Control of the Schrödinger Equation on Many-Core Architectures Optimal Control of the Schrödinger Equation on Many-Core Architectures Manfred Liebmann Institute for Mathematics and Scientific Computing University of Graz manfred.liebmann@uni-graz.at July 15, 2014

More information

Targeting Extreme Scale Computational Challenges with Heterogeneous Systems

Targeting Extreme Scale Computational Challenges with Heterogeneous Systems Targeting Extreme Scale Computational Challenges with Heterogeneous Systems Oreste Villa, Antonino Tumeo Pacific Northwest Na/onal Laboratory (PNNL) 1 Introduction! PNNL Laboratory Directed Research &

More information

Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry

Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry and Eugene DePrince Argonne National Laboratory (LCF and CNM) (Eugene moved to Georgia Tech last week)

More information

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization) A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization) Schodinger equation: Hψ = Eψ Choose a basis set of wave functions Two cases: Orthonormal

More information

Logo. A Massively-Parallel Multicore Acceleration of a Point Contact Solid Mechanics Simulation DRAFT

Logo. A Massively-Parallel Multicore Acceleration of a Point Contact Solid Mechanics Simulation DRAFT Paper 1 Logo Civil-Comp Press, 2017 Proceedings of the Fifth International Conference on Parallel, Distributed, Grid and Cloud Computing for Engineering, P. Iványi, B.H.V Topping and G. Várady (Editors)

More information

A microsecond a day keeps the doctor away: Efficient GPU Molecular Dynamics with GROMACS

A microsecond a day keeps the doctor away: Efficient GPU Molecular Dynamics with GROMACS GTC 20130319 A microsecond a day keeps the doctor away: Efficient GPU Molecular Dynamics with GROMACS Erik Lindahl erik.lindahl@scilifelab.se Molecular Dynamics Understand biology We re comfortably on

More information

arxiv: v1 [hep-lat] 10 Jul 2012

arxiv: v1 [hep-lat] 10 Jul 2012 Hybrid Monte Carlo with Wilson Dirac operator on the Fermi GPU Abhijit Chakrabarty Electra Design Automation, SDF Building, SaltLake Sec-V, Kolkata - 700091. Pushan Majumdar Dept. of Theoretical Physics,

More information

The Lattice Boltzmann Method for Laminar and Turbulent Channel Flows

The Lattice Boltzmann Method for Laminar and Turbulent Channel Flows The Lattice Boltzmann Method for Laminar and Turbulent Channel Flows Vanja Zecevic, Michael Kirkpatrick and Steven Armfield Department of Aerospace Mechanical & Mechatronic Engineering The University of

More information

Parallel Sparse Tensor Decompositions using HiCOO Format

Parallel Sparse Tensor Decompositions using HiCOO Format Figure sources: A brief survey of tensors by Berton Earnshaw and NVIDIA Tensor Cores Parallel Sparse Tensor Decompositions using HiCOO Format Jiajia Li, Jee Choi, Richard Vuduc May 8, 8 @ SIAM ALA 8 Outline

More information

Accelerating Model Reduction of Large Linear Systems with Graphics Processors

Accelerating Model Reduction of Large Linear Systems with Graphics Processors Accelerating Model Reduction of Large Linear Systems with Graphics Processors P. Benner 1, P. Ezzatti 2, D. Kressner 3, E.S. Quintana-Ortí 4, Alfredo Remón 4 1 Max-Plank-Institute for Dynamics of Complex

More information

Parallelization Strategies for Density Matrix Renormalization Group algorithms on Shared-Memory Systems

Parallelization Strategies for Density Matrix Renormalization Group algorithms on Shared-Memory Systems Parallelization Strategies for Density Matrix Renormalization Group algorithms on Shared-Memory Systems G. Hager HPC Services, Computing Center Erlangen, Germany E. Jeckelmann Theoretical Physics, Univ.

More information

Shortest Lattice Vector Enumeration on Graphics Cards

Shortest Lattice Vector Enumeration on Graphics Cards Shortest Lattice Vector Enumeration on Graphics Cards Jens Hermans 1 Michael Schneider 2 Fréderik Vercauteren 1 Johannes Buchmann 2 Bart Preneel 1 1 K.U.Leuven 2 TU Darmstadt SHARCS - 10 September 2009

More information

Some notes on efficient computing and setting up high performance computing environments

Some notes on efficient computing and setting up high performance computing environments Some notes on efficient computing and setting up high performance computing environments Andrew O. Finley Department of Forestry, Michigan State University, Lansing, Michigan. April 17, 2017 1 Efficient

More information

Efficient implementation of the overlap operator on multi-gpus

Efficient implementation of the overlap operator on multi-gpus Efficient implementation of the overlap operator on multi-gpus Andrei Alexandru Mike Lujan, Craig Pelissier, Ben Gamari, Frank Lee SAAHPC 2011 - University of Tennessee Outline Motivation Overlap operator

More information

Multi-GPU Simulations of the Infinite Universe

Multi-GPU Simulations of the Infinite Universe () Multi-GPU of the Infinite with with G. Rácz, I. Szapudi & L. Dobos Physics of Complex Systems Department Eötvös Loránd University, Budapest June 22, 2018, Budapest, Hungary Outline 1 () 2 () Concordance

More information

Advancing Weather Prediction at NOAA. 18 November 2015 Tom Henderson NOAA / ESRL / GSD

Advancing Weather Prediction at NOAA. 18 November 2015 Tom Henderson NOAA / ESRL / GSD Advancing Weather Prediction at NOAA 18 November 2015 Tom Henderson NOAA / ESRL / GSD The U. S. Needs Better Global Numerical Weather Prediction Hurricane Sandy October 28, 2012 A European forecast that

More information

Beam dynamics calculation

Beam dynamics calculation September 6 Beam dynamics calculation S.B. Vorozhtsov, Е.Е. Perepelkin and V.L. Smirnov Dubna, JINR http://parallel-compute.com Outline Problem formulation Numerical methods OpenMP and CUDA realization

More information

APPLICATION OF CUDA TECHNOLOGY FOR CALCULATION OF GROUND STATES OF FEW-BODY NUCLEI BY FEYNMAN'S CONTINUAL INTEGRALS METHOD

APPLICATION OF CUDA TECHNOLOGY FOR CALCULATION OF GROUND STATES OF FEW-BODY NUCLEI BY FEYNMAN'S CONTINUAL INTEGRALS METHOD APPLICATION OF CUDA TECHNOLOGY FOR CALCULATION OF GROUND STATES OF FEW-BODY NUCLEI BY FEYNMAN'S CONTINUAL INTEGRALS METHOD M.A. Naumenko, V.V. Samarin Joint Institute for Nuclear Research, Dubna, Russia

More information

MATH 590: Meshfree Methods

MATH 590: Meshfree Methods MATH 590: Meshfree Methods Chapter 34: Improving the Condition Number of the Interpolation Matrix Greg Fasshauer Department of Applied Mathematics Illinois Institute of Technology Fall 2010 fasshauer@iit.edu

More information

Performance evaluation of scalable optoelectronics application on large-scale Knights Landing cluster

Performance evaluation of scalable optoelectronics application on large-scale Knights Landing cluster Performance evaluation of scalable optoelectronics application on large-scale Knights Landing cluster Yuta Hirokawa Graduate School of Systems and Information Engineering, University of Tsukuba hirokawa@hpcs.cs.tsukuba.ac.jp

More information

On the design of parallel linear solvers for large scale problems

On the design of parallel linear solvers for large scale problems On the design of parallel linear solvers for large scale problems ICIAM - August 2015 - Mini-Symposium on Recent advances in matrix computations for extreme-scale computers M. Faverge, X. Lacoste, G. Pichon,

More information

Scalable and Power-Efficient Data Mining Kernels

Scalable and Power-Efficient Data Mining Kernels Scalable and Power-Efficient Data Mining Kernels Alok Choudhary, John G. Searle Professor Dept. of Electrical Engineering and Computer Science and Professor, Kellogg School of Management Director of the

More information

Karhunen-Loève Approximation of Random Fields Using Hierarchical Matrix Techniques

Karhunen-Loève Approximation of Random Fields Using Hierarchical Matrix Techniques Institut für Numerische Mathematik und Optimierung Karhunen-Loève Approximation of Random Fields Using Hierarchical Matrix Techniques Oliver Ernst Computational Methods with Applications Harrachov, CR,

More information

Bachelor-thesis: GPU-Acceleration of Linear Algebra using OpenCL

Bachelor-thesis: GPU-Acceleration of Linear Algebra using OpenCL Bachelor-thesis: GPU-Acceleration of Linear Algebra using OpenCL Andreas Falkenstrøm Mieritz s093065 September 13, 2012 Supervisors: Allan Ensig-Peter Karup Bernd Dammann IMM-B.Sc.-2012-30 Contents 1 Problem

More information

Direct Self-Consistent Field Computations on GPU Clusters

Direct Self-Consistent Field Computations on GPU Clusters Direct Self-Consistent Field Computations on GPU Clusters Guochun Shi, Volodymyr Kindratenko National Center for Supercomputing Applications University of Illinois at UrbanaChampaign Ivan Ufimtsev, Todd

More information

An Implementation of SPELT(31, 4, 96, 96, (32, 16, 8))

An Implementation of SPELT(31, 4, 96, 96, (32, 16, 8)) An Implementation of SPELT(31, 4, 96, 96, (32, 16, 8)) Tung Chou January 5, 2012 QUAD Stream cipher. Security relies on MQ (Multivariate Quadratics). QUAD The Provably-secure QUAD(q, n, r) Stream Cipher

More information

Incomplete Cholesky preconditioners that exploit the low-rank property

Incomplete Cholesky preconditioners that exploit the low-rank property anapov@ulb.ac.be ; http://homepages.ulb.ac.be/ anapov/ 1 / 35 Incomplete Cholesky preconditioners that exploit the low-rank property (theory and practice) Artem Napov Service de Métrologie Nucléaire, Université

More information

Explore Computational Power of GPU in Electromagnetics and Micromagnetics

Explore Computational Power of GPU in Electromagnetics and Micromagnetics Explore Computational Power of GPU in Electromagnetics and Micromagnetics Presenter: Sidi Fu, PhD candidate, UC San Diego Advisor: Prof. Vitaliy Lomakin Center of Magnetic Recording Research, Department

More information

ERLANGEN REGIONAL COMPUTING CENTER

ERLANGEN REGIONAL COMPUTING CENTER ERLANGEN REGIONAL COMPUTING CENTER Making Sense of Performance Numbers Georg Hager Erlangen Regional Computing Center (RRZE) Friedrich-Alexander-Universität Erlangen-Nürnberg OpenMPCon 2018 Barcelona,

More information

HYCOM and Navy ESPC Future High Performance Computing Needs. Alan J. Wallcraft. COAPS Short Seminar November 6, 2017

HYCOM and Navy ESPC Future High Performance Computing Needs. Alan J. Wallcraft. COAPS Short Seminar November 6, 2017 HYCOM and Navy ESPC Future High Performance Computing Needs Alan J. Wallcraft COAPS Short Seminar November 6, 2017 Forecasting Architectural Trends 3 NAVY OPERATIONAL GLOBAL OCEAN PREDICTION Trend is higher

More information

S8241 VERSIONING GPU- ACCLERATED WRF TO Jeff Adie, 26 March, 2018 (Presented by Stan Posey, NVIDIA)

S8241 VERSIONING GPU- ACCLERATED WRF TO Jeff Adie, 26 March, 2018 (Presented by Stan Posey, NVIDIA) S8241 VERSIONING GPU- ACCLERATED WRF TO 3.7.1 Jeff Adie, 26 March, 2018 (Presented by Stan Posey, NVIDIA) 1 ACKNOWLEDGEMENT The work presented here today would not have been possible without the efforts

More information

Solving PDEs: the Poisson problem TMA4280 Introduction to Supercomputing

Solving PDEs: the Poisson problem TMA4280 Introduction to Supercomputing Solving PDEs: the Poisson problem TMA4280 Introduction to Supercomputing Based on 2016v slides by Eivind Fonn NTNU, IMF February 27. 2017 1 The Poisson problem The Poisson equation is an elliptic partial

More information

Real-time signal detection for pulsars and radio transients using GPUs

Real-time signal detection for pulsars and radio transients using GPUs Real-time signal detection for pulsars and radio transients using GPUs W. Armour, M. Giles, A. Karastergiou and C. Williams. University of Oxford. 15 th July 2013 1 Background of GPUs Why use GPUs? Influence

More information

An Algorithmic Framework of Large-Scale Circuit Simulation Using Exponential Integrators

An Algorithmic Framework of Large-Scale Circuit Simulation Using Exponential Integrators An Algorithmic Framework of Large-Scale Circuit Simulation Using Exponential Integrators Hao Zhuang 1, Wenjian Yu 2, Ilgweon Kang 1, Xinan Wang 1, and Chung-Kuan Cheng 1 1. University of California, San

More information

MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors

MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors J. Dongarra, M. Gates, A. Haidar, Y. Jia, K. Kabir, P. Luszczek, and S. Tomov University of Tennessee, Knoxville 05 / 03 / 2013 MAGMA:

More information

Improving many flavor QCD simulations using multiple GPUs

Improving many flavor QCD simulations using multiple GPUs Improving many flavor QCD simulations using multiple GPUs M. Hayakawa a, b, Y. Osaki b, S. Takeda c, S. Uno a, N. Yamada de a Department of Physics, Nagoya University, Nagoya 464-8602, Japan b Department

More information

A parameter tuning technique of a weighted Jacobi-type preconditioner and its application to supernova simulations

A parameter tuning technique of a weighted Jacobi-type preconditioner and its application to supernova simulations A parameter tuning technique of a weighted Jacobi-type preconditioner and its application to supernova simulations Akira IMAKURA Center for Computational Sciences, University of Tsukuba Joint work with

More information

Solution to Laplace Equation using Preconditioned Conjugate Gradient Method with Compressed Row Storage using MPI

Solution to Laplace Equation using Preconditioned Conjugate Gradient Method with Compressed Row Storage using MPI Solution to Laplace Equation using Preconditioned Conjugate Gradient Method with Compressed Row Storage using MPI Sagar Bhatt Person Number: 50170651 Department of Mechanical and Aerospace Engineering,

More information

Multiphase Flow Simulations in Inclined Tubes with Lattice Boltzmann Method on GPU

Multiphase Flow Simulations in Inclined Tubes with Lattice Boltzmann Method on GPU Multiphase Flow Simulations in Inclined Tubes with Lattice Boltzmann Method on GPU Khramtsov D.P., Nekrasov D.A., Pokusaev B.G. Department of Thermodynamics, Thermal Engineering and Energy Saving Technologies,

More information

Hybrid Analog-Digital Solution of Nonlinear Partial Differential Equations

Hybrid Analog-Digital Solution of Nonlinear Partial Differential Equations Hybrid Analog-Digital Solution of Nonlinear Partial Differential Equations Yipeng Huang, Ning Guo, Kyle Mandli, Mingoo Seok, Yannis Tsividis, Simha Sethumadhavan Columbia University Hybrid Analog-Digital

More information

A User Friendly Toolbox for Parallel PDE-Solvers

A User Friendly Toolbox for Parallel PDE-Solvers A User Friendly Toolbox for Parallel PDE-Solvers Gundolf Haase Institut for Mathematics and Scientific Computing Karl-Franzens University of Graz Manfred Liebmann Mathematics in Sciences Max-Planck-Institute

More information

Parallel Preconditioning Methods for Ill-conditioned Problems

Parallel Preconditioning Methods for Ill-conditioned Problems Parallel Preconditioning Methods for Ill-conditioned Problems Kengo Nakajima Information Technology Center, The University of Tokyo 2014 Conference on Advanced Topics and Auto Tuning in High Performance

More information

A CPU-GPU Hybrid Implementation and Model-Driven Scheduling of the Fast Multipole Method

A CPU-GPU Hybrid Implementation and Model-Driven Scheduling of the Fast Multipole Method A CPU-GPU Hybrid Implementation and Model-Driven Scheduling of the Fast Multipole Method Jee Choi 1, Aparna Chandramowlishwaran 3, Kamesh Madduri 4, and Richard Vuduc 2 1 ECE, Georgia Tech 2 CSE, Georgia

More information

Parallel Rabin-Karp Algorithm Implementation on GPU (preliminary version)

Parallel Rabin-Karp Algorithm Implementation on GPU (preliminary version) Bulletin of Networking, Computing, Systems, and Software www.bncss.org, ISSN 2186-5140 Volume 7, Number 1, pages 28 32, January 2018 Parallel Rabin-Karp Algorithm Implementation on GPU (preliminary version)

More information

Introduction to Benchmark Test for Multi-scale Computational Materials Software

Introduction to Benchmark Test for Multi-scale Computational Materials Software Introduction to Benchmark Test for Multi-scale Computational Materials Software Shun Xu*, Jian Zhang, Zhong Jin xushun@sccas.cn Computer Network Information Center Chinese Academy of Sciences (IPCC member)

More information