Solving RODEs on GPU clusters
|
|
- Philip Morris Walton
- 5 years ago
- Views:
Transcription
1 HIGH SCIENCE Solving RODEs on GPU clusters Christoph Riesinger Technische Universität München March 4, 206 HIGH SCIENCE, March 4, 206
2 Motivation - Parallel Computing HIGH SCIENCE, March 4, 206 2
3 Motivation - Parallel Computing HIGH SCIENCE, March 4, 206 2
4 Motivation - Parallel Computing HIGH SCIENCE, March 4, 206 2
5 Motivation - Parallel Computing HIGH SCIENCE, March 4, 206 2
6 Motivation - Parallel Computing HIGH SCIENCE, March 4, 206 2
7 Motivation - Multiple Levels of Parallelism HIGH SCIENCE, March 4, 206 3
8 Motivation - Multiple Levels of Parallelism HIGH SCIENCE, March 4, 206 3
9 Technische Universita t Mu nchen Motivation - Multiple Levels of Parallelism HIGH SCIENCE, March 4, 206 3
10 Building Blocks Pseudo Random Number Generation Ornstein-Uhlenbeck Process Averaging Numerical Solver () (3) x 3 x 6 x 9 μ () μ (3) x x 4 x 7 x 0 x 2 x 5 x 8 x μ () μ 3 (3) x 2 x 3 x 5 x 6 x 8 x 9 x μ () μ 2 μ 3 (3) x 5 x x 6 x GPU 0 Monte Carlo Pseudo Random Number Generation Ornstein-Uhlenbeck Process Averaging Numerical Solver GPU Pseudo Random Number Generation Ornstein-Uhlenbeck Process Averaging Numerical Solver GPU N- HIGH SCIENCE, March 4, 206 4
11 Pseudo Random Number Generation - Ziggurat The area under the Gaussian function is approximated by strips R i These strips are further subdivided in central (green), tail (purple), and cap (red) regions and a base strip (blue) y0 x R 0 y x 2 R y2 x 3 R 2 y3 x 4 R3 y4 R 4 x 5 x 6 y5 R 5 y 6 R 6 x 7 =r y 7 R 7 =R B HIGH SCIENCE, March 4, 206 5
12 Pseudo Random Number Generation - Ziggurat The area under the Gaussian function is approximated by strips R i These strips are further subdivided in central (green), tail (purple), and cap (red) regions and a base strip (blue) y0 x R 0 y x 2 R y2 x 3 R 2 y3 x 4 R3 y4 R 4 x 5 x 6 y5 R 5 y 6 R 6 x 7 =r y 7 R 7 =R B To do the transformation, a strip is randomly selected An uniform random number u [0, [ is stretched by a lookup table value basing on the selected strip If a central region is hit, the transformation is very cheap, otherwise it s much more expensive HIGH SCIENCE, March 4, 206 5
13 Pseudo Random Number Generation - Trade-off The more strips are used for the Ziggurat, the bigger the ratio of the sum of all central regions to the sum of all strips gets The bigger this ratio gets, the higher the likelihood to hit a (cheap) central region gets In addition, on GPUs, this reduces the likelihood for warp divergence So runtime can be reduced by using more strips which results in larger lookup tables runtime/memory trade-off HIGH SCIENCE, March 4, 206 6
14 Pseudo Random Number Generation - Results /2 Performance of the Ziggurat Method GPU architecture Fermi Kepler Maxwell Model M2090 Tesla K40m GTX 750 Ti #Processing elements Peak performance SP (TFLOPS) Peak performance DP (TFLOPS) Peak memory bandwidth (GByte/s) Tesla M2090 (Fermi) 2.5 Tesla K40m (Kepler) 2.5 GTX 750 Ti (Maxwell) giga pseudo random numbers per second number of strips number of strips number of strips local local local local local shared shared shared shared shared HIGH SCIENCE, March 4, 206 7
15 Pseudo Random Number Generation - Results 2/2 Comparison with other Normal PRNGs giga pseudo random numbers per second Tesla M2090 (Fermi) grid configuration Ziggurat Inverse CDF 4.5 Tesla K40m (Kepler) grid configuration Rational Polynomial curand Wallace XORWOW 4.5 GTX 750 Ti (Maxwell) grid configuration MKL on Xeon E v2 HIGH SCIENCE, March 4, 206 8
16 Ornstein-Uhlenbeck process - Link to Prefix Sum /2 O th = µo t σ X n () O t2h = µo th σ X n = µ µo t σ X n () ( ) = µ 2 O t σ X µn () n ( O t3h = µo t2h σ X n (3) = µ ( µ ( µo t σ X n () ( = µ µo th σ X n ) ) σ X n σ X n (3) ( = µ 3 O t σ X µ 2 n () µn n (3)... =... i ( ) O tih = µ i O t σ X µ i k n (k) k= ) ) σ X n = ) σ X n (3) = = HIGH SCIENCE, March 4, 206 9
17 Ornstein-Uhlenbeck process - Link to Prefix Sum 2/2 This looks very similar to the prefix sum or scan operation: i ( ) O tih = µ i O t σ X µ i k n (k) i O tih = k= n (k) k= HIGH SCIENCE, March 4, 206 0
18 Ornstein-Uhlenbeck process - Link to Prefix Sum 2/2 This looks very similar to the prefix sum or scan operation: i ( ) O tih = µ i O t σ X µ i k n (k) i O tih = k= n (k) k= x x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 0 x x 2 x 3 x 4 x 5 x x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 0 x x 2 x 3 x 4 x 5 HIGH SCIENCE, March 4, 206 0
19 Ornstein-Uhlenbeck process - Parallel Prefix Sum Up-Sweep Algorithm Up-sweep phase : for d = ; d log 2 (n); d do 2: for i = 0; i < n 2 d ; i do 3: x (i)2 d x (i)2 d x (i 2 )2 d 4: end for 5: end for d=4 μ () μ 3 (3) μ (5) (6) μ 7 (7) μ (9) (0) μ 8 μ 3 μ () μ (3) μ 5 (5) d=3 μ () μ 3 (3) μ (5) (6) μ 7 (7) μ (9) (0) μ 3 μ () μ (3) μ 4 μ 4 μ 7 (5) d=2 d= μ n 0 () μ () μ 3 (3) μ (3) μ (5) μ (5) (6) (6) μ 3 μ (7) μ (6) (7) μ (9) μ (9) (0) (0) μ 3 μ () μ (0) () μ (3) μ 2 μ 2 μ 2 μ 2 μ (3) μ μ μ μ μ μ μ μ μ 3 (5) μ (5) d=0 () (3) (5) (6) (7) (9) (0) () (3) (5) HIGH SCIENCE, March 4, 206
20 Ornstein-Uhlenbeck process - Parallel Prefix Sum Down-Sweep Algorithm 2 Down-sweep phase : for d = log 2 (n) ; d 0; d-- do 2: for i = 0; i < n 2 d ; i do 3: x (i 3 2 )2 d x (i)2 d x (i 3 2 )2 d 4: end for 5: end for d=3 μ () X μ 3 (3) μ (5) μ 7 (6) (7) μ (9) (0) μ 3 () μ (3) μ 5 (5) d=2 μ () μ 3 (3) μ (5) (6) μ 7 (7) μ (9) μ 4 (0) μ () μ (3) μ 5 (5) d= d=0 μ () μ () μ 2 μ 3 (3) μ 3 (3) μ 2 μ 2 μ 2 μ 4 μ 5 (5) μ 5 (5) (6) μ 6 (6) μ 7 (7) μ 7 (7) μ 8 μ 9 (9) μ 9 (9) (0) μ 0 (0) μ () μ () μ 2 μ 3 (3) μ μ μ μ μ μ μ μ 3 (3) μ 4 μ 5 (5) μ 5 (5) HIGH SCIENCE, March 4, 206 2
21 Ornstein-Uhlenbeck process - Results 3.5 Tesla M2090 (Fermi) 3.5 Tesla K40m (Kepler) 3.5 GTX 750 Ti (Maxwell) giga realizations of OU process elements per thread float, 2 7 threads/block double, 2 7 threads/block elements per thread float, 2 8 threads/block double, 2 8 threads/block float, 2 9 threads/block double, 2 9 threads/block elements per thread float, 2 0 threads/block double, 2 0 threads/block HIGH SCIENCE, March 4, 206 3
22 Averaging x 3 x 6 x 9 x 2 x 5 x 8 x 2 x x 4 x 7 x 0 x 3 x 6 x 9 x 22 x 2 x 5 x 8 x x 4 x 7 x 20 x 23 x 2 x 3 x 5 x 6 x 8 x 9 x x 2 x 4 x 5 x 7 x 8 x 20 x 2 x 23 x 5 x 6 x x 2 x 7 x 8 x 23 x x 2 x 23 x 23 HIGH SCIENCE, March 4, 206 4
23 Averaging - Results Tesla M2090 (Fermi) Tesla K40m (Kepler) GTX 750 Ti (Maxwell) ratio of maximum bandwidth threads per block float, single averaging double, single averaging threads per block float, double averaging double, double averaging float, 3-tridiagonal double, 3-tridiagonal threads per block float, 4-tridiagonal double, 4-tridiagonal architecture Tesla M2090 Tesla K40m GTX 750 Ti ratio peak memory bandwidth 88.5% 72.% 8.% configuration ( threads block 28, double 2 8, double 2 6, double HIGH SCIENCE, March 4, 206 5
24 Solving one instance of the RODE on a single GPU Tesla M2090 (Fermi) Tesla K40m (Kepler) GTX 750 Ti (Maxwell) float, double, float, double, float, double, initstatesnormalkernel() scanexclusiveoukernel() averagedeulerkernel() purple blue green red float, double, float, double, float, double, float, double, float, double, float, double, float, double, float, double, getrandomnumbersnormalkernel() scanoufixkernel() float, double, float, double, float, double, float, double, singleaveragekernel() realizeouprocesskernel() numerical solver averaging Ornstein-Uhlenbeck process pseudo random number generation float, double, float, double, HIGH SCIENCE, March 4, 206 6
25 Solving several instances of the RODE on multiple GPUs cluster JuDGE Hydra TSUBAME 2.5 location FZJ RZG GSIC GPUs per node 2 3 total # of GPUs Interconnect QDR InfiniBand FDR InfiniBand QDR InfiniBand efficiency JuDGE number of GPUs float, GPU computations double, GPU computations Hydra number of GPUs float, MPI_Reduce() double, MPI_Reduce() TSUBAME number of GPUs float, total double, total HIGH SCIENCE, March 4, 206 7
26 Final slide HIGH SCIENCE, March 4, 206 8
Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters
Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters H. Köstler 2nd International Symposium Computer Simulations on GPU Freudenstadt, 29.05.2013 1 Contents Motivation walberla software concepts
More informationDirect Self-Consistent Field Computations on GPU Clusters
Direct Self-Consistent Field Computations on GPU Clusters Guochun Shi, Volodymyr Kindratenko National Center for Supercomputing Applications University of Illinois at UrbanaChampaign Ivan Ufimtsev, Todd
More informationarxiv: v1 [hep-lat] 31 Oct 2015
and Code Optimization arxiv:1511.00088v1 [hep-lat] 31 Oct 2015 Hwancheol Jeong, Sangbaek Lee, Weonjong Lee, Lattice Gauge Theory Research Center, CTP, and FPRD, Department of Physics and Astronomy, Seoul
More informationA Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures
A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences,
More informationTwo case studies of Monte Carlo simulation on GPU
Two case studies of Monte Carlo simulation on GPU National Institute for Computational Sciences University of Tennessee Seminar series on HPC, Feb. 27, 2014 Outline 1 Introduction 2 Discrete energy lattice
More informationPopulation annealing study of the frustrated Ising antiferromagnet on the stacked triangular lattice
Population annealing study of the frustrated Ising antiferromagnet on the stacked triangular lattice Michal Borovský Department of Theoretical Physics and Astrophysics, University of P. J. Šafárik in Košice,
More informationAccelerating linear algebra computations with hybrid GPU-multicore systems.
Accelerating linear algebra computations with hybrid GPU-multicore systems. Marc Baboulin INRIA/Université Paris-Sud joint work with Jack Dongarra (University of Tennessee and Oak Ridge National Laboratory)
More informationTowards a highly-parallel PDE-Solver using Adaptive Sparse Grids on Compute Clusters
Towards a highly-parallel PDE-Solver using Adaptive Sparse Grids on Compute Clusters HIM - Workshop on Sparse Grids and Applications Alexander Heinecke Chair of Scientific Computing May 18 th 2011 HIM
More informationarxiv: v1 [hep-lat] 10 Jul 2012
Hybrid Monte Carlo with Wilson Dirac operator on the Fermi GPU Abhijit Chakrabarty Electra Design Automation, SDF Building, SaltLake Sec-V, Kolkata - 700091. Pushan Majumdar Dept. of Theoretical Physics,
More informationNuclear Physics and Computing: Exascale Partnerships. Juan Meza Senior Scientist Lawrence Berkeley National Laboratory
Nuclear Physics and Computing: Exascale Partnerships Juan Meza Senior Scientist Lawrence Berkeley National Laboratory Nuclear Science and Exascale i Workshop held in DC to identify scientific challenges
More informationJacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA
Jacobi-Based Eigenvalue Solver on GPU Lung-Sheng Chien, NVIDIA lchien@nvidia.com Outline Symmetric eigenvalue solver Experiment Applications Conclusions Symmetric eigenvalue solver The standard form is
More informationAccelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers
UT College of Engineering Tutorial Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers Stan Tomov 1, George Bosilca 1, and Cédric
More informationUnraveling the mysteries of quarks with hundreds of GPUs. Ron Babich NVIDIA
Unraveling the mysteries of quarks with hundreds of GPUs Ron Babich NVIDIA Collaborators and QUDA developers Kip Barros (LANL) Rich Brower (Boston University) Mike Clark (NVIDIA) Justin Foley (University
More informationGPU Computing Activities in KISTI
International Advanced Research Workshop on High Performance Computing, Grids and Clouds 2010 June 21~June 25 2010, Cetraro, Italy HPC Infrastructure and GPU Computing Activities in KISTI Hongsuk Yi hsyi@kisti.re.kr
More informationGPU accelerated Arnoldi solver for small batched matrix
15. 09. 22 GPU accelerated Arnoldi solver for small batched matrix Samsung Advanced Institute of Technology Hyung-Jin Kim Contents - Eigen value problems - Solution - Arnoldi Algorithm - Target - CUDA
More informationA Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters
A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters ANTONINO TUMEO, ORESTE VILLA Collaborators: Karol Kowalski, Sriram Krishnamoorthy, Wenjing Ma, Simone Secchi May 15, 2012 1 Outline!
More informationImproving weather prediction via advancing model initialization
Improving weather prediction via advancing model initialization Brian Etherton, with Christopher W. Harrop, Lidia Trailovic, and Mark W. Govett NOAA/ESRL/GSD 15 November 2016 The HPC group at NOAA/ESRL/GSD
More informationEfficient Molecular Dynamics on Heterogeneous Architectures in GROMACS
Efficient Molecular Dynamics on Heterogeneous Architectures in GROMACS Berk Hess, Szilárd Páll KTH Royal Institute of Technology GTC 2012 GROMACS: fast, scalable, free Classical molecular dynamics package
More informationIntroduction to numerical computations on the GPU
Introduction to numerical computations on the GPU Lucian Covaci http://lucian.covaci.org/cuda.pdf Tuesday 1 November 11 1 2 Outline: NVIDIA Tesla and Geforce video cards: architecture CUDA - C: programming
More informationAn Implementation of SPELT(31, 4, 96, 96, (32, 16, 8))
An Implementation of SPELT(31, 4, 96, 96, (32, 16, 8)) Tung Chou January 5, 2012 QUAD Stream cipher. Security relies on MQ (Multivariate Quadratics). QUAD The Provably-secure QUAD(q, n, r) Stream Cipher
More informationOrigami: Folding Warps for Energy Efficient GPUs
Origami: Folding Warps for Energy Efficient GPUs Mohammad Abdel-Majeed*, Daniel Wong, Justin Huang and Murali Annavaram* * University of Southern alifornia University of alifornia, Riverside Stanford University
More informationMultiphase Flow Simulations in Inclined Tubes with Lattice Boltzmann Method on GPU
Multiphase Flow Simulations in Inclined Tubes with Lattice Boltzmann Method on GPU Khramtsov D.P., Nekrasov D.A., Pokusaev B.G. Department of Thermodynamics, Thermal Engineering and Energy Saving Technologies,
More informationMeasuring freeze-out parameters on the Bielefeld GPU cluster
Measuring freeze-out parameters on the Bielefeld GPU cluster Outline Fluctuations and the QCD phase diagram Fluctuations from Lattice QCD The Bielefeld hybrid GPU cluster Freeze-out conditions from QCD
More informationEfficient implementation of the overlap operator on multi-gpus
Efficient implementation of the overlap operator on multi-gpus Andrei Alexandru Mike Lujan, Craig Pelissier, Ben Gamari, Frank Lee SAAHPC 2011 - University of Tennessee Outline Motivation Overlap operator
More informationAPPLICATION OF CUDA TECHNOLOGY FOR CALCULATION OF GROUND STATES OF FEW-BODY NUCLEI BY FEYNMAN'S CONTINUAL INTEGRALS METHOD
APPLICATION OF CUDA TECHNOLOGY FOR CALCULATION OF GROUND STATES OF FEW-BODY NUCLEI BY FEYNMAN'S CONTINUAL INTEGRALS METHOD M.A. Naumenko, V.V. Samarin Joint Institute for Nuclear Research, Dubna, Russia
More informationScalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver
Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver Sherry Li Lawrence Berkeley National Laboratory Piyush Sao Rich Vuduc Georgia Institute of Technology CUG 14, May 4-8, 14, Lugano,
More informationTR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems
TR-0-07 A Comparison of the Performance of ::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems Ang Li, Omkar Deshmukh, Radu Serban, Dan Negrut May, 0 Abstract ::GPU is a
More informationarxiv: v1 [cs.dc] 4 Sep 2014
and NVIDIA R GPUs arxiv:1409.1510v1 [cs.dc] 4 Sep 2014 O. Kaczmarek, C. Schmidt and P. Steinbrecher Fakultät für Physik, Universität Bielefeld, D-33615 Bielefeld, Germany E-mail: okacz, schmidt, p.steinbrecher@physik.uni-bielefeld.de
More informationarxiv: v1 [hep-lat] 7 Oct 2010
arxiv:.486v [hep-lat] 7 Oct 2 Nuno Cardoso CFTP, Instituto Superior Técnico E-mail: nunocardoso@cftp.ist.utl.pt Pedro Bicudo CFTP, Instituto Superior Técnico E-mail: bicudo@ist.utl.pt We discuss the CUDA
More informationTips Geared Towards R. Adam J. Suarez. Arpil 10, 2015
Tips Geared Towards R Departments of Statistics North Carolina State University Arpil 10, 2015 1 / 30 Advantages of R As an interpretive and interactive language, developing an algorithm in R can be done
More informationStochastic Modelling of Electron Transport on different HPC architectures
Stochastic Modelling of Electron Transport on different HPC architectures www.hp-see.eu E. Atanassov, T. Gurov, A. Karaivan ova Institute of Information and Communication Technologies Bulgarian Academy
More informationNew approaches to strongly interacting Fermi gases
New approaches to strongly interacting Fermi gases Joaquín E. Drut The Ohio State University INT Program Simulations and Symmetries Seattle, March 2010 In collaboration with Timo A. Lähde Aalto University,
More informationA simple Concept for the Performance Analysis of Cluster-Computing
A simple Concept for the Performance Analysis of Cluster-Computing H. Kredel 1, S. Richling 2, J.P. Kruse 3, E. Strohmaier 4, H.G. Kruse 1 1 IT-Center, University of Mannheim, Germany 2 IT-Center, University
More informationMolecular Dynamics Simulation of a Biomolecule with High Speed, Low Power and Accuracy Using GPU-Accelerated TSUBAME2.
APSIPA ASC 2011 Xi an Molecular Dynamics Simulation of a Biomolecule with High Speed, Low Power and Accuracy Using GPU-Accelerated TSUBAME2.0 Supercomputer Shiqiao Du, Takuro Udagawa, Toshio Endo and Masakazu
More informationMonte Carlo Methods for Electron Transport: Scalability Study
Monte Carlo Methods for Electron Transport: Scalability Study www.hp-see.eu Aneta Karaivanova (Joint work with E. Atanassov and T. Gurov) Institute of Information and Communication Technologies Bulgarian
More informationNumerical Characterization of Multi-Dielectric Green s Function for 3-D Capacitance Extraction with Floating Random Walk Algorithm
Numerical Characterization of Multi-Dielectric Green s Function for 3-D Capacitance Extraction with Floating Random Walk Algorithm Hao Zhuang 1, 2, Wenjian Yu 1 *, Gang Hu 1, Zuochang Ye 3 1 Department
More informationRandomized Selection on the GPU. Laura Monroe, Joanne Wendelberger, Sarah Michalak Los Alamos National Laboratory
Randomized Selection on the GPU Laura Monroe, Joanne Wendelberger, Sarah Michalak Los Alamos National Laboratory High Performance Graphics 2011 August 6, 2011 Top k Selection on GPU Output the top k keys
More informationUsing a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics
Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Jorge González-Domínguez Parallel and Distributed Architectures Group Johannes Gutenberg University of Mainz, Germany j.gonzalez@uni-mainz.de
More informationSparse LU Factorization on GPUs for Accelerating SPICE Simulation
Nano-scale Integrated Circuit and System (NICS) Laboratory Sparse LU Factorization on GPUs for Accelerating SPICE Simulation Xiaoming Chen PhD Candidate Department of Electronic Engineering Tsinghua University,
More informationSolving PDEs with CUDA Jonathan Cohen
Solving PDEs with CUDA Jonathan Cohen jocohen@nvidia.com NVIDIA Research PDEs (Partial Differential Equations) Big topic Some common strategies Focus on one type of PDE in this talk Poisson Equation Linear
More informationFast event generation system using GPU. Junichi Kanzaki (KEK) ACAT 2013 May 16, 2013, IHEP, Beijing
Fast event generation system using GPU Junichi Kanzaki (KEK) ACAT 2013 May 16, 2013, IHEP, Beijing Motivation The mount of LHC data is increasing. -5fb -1 in 2011-22fb -1 in 2012 High statistics data ->
More informationMachine Learning I Continuous Reinforcement Learning
Machine Learning I Continuous Reinforcement Learning Thomas Rückstieß Technische Universität München January 7/8, 2010 RL Problem Statement (reminder) state s t+1 ENVIRONMENT reward r t+1 new step r t
More informationGPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications
GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications Christopher Rodrigues, David J. Hardy, John E. Stone, Klaus Schulten, Wen-Mei W. Hwu University of Illinois at Urbana-Champaign
More informationJulian Merten. GPU Computing and Alternative Architecture
Future Directions of Cosmological Simulations / Edinburgh 1 / 16 Julian Merten GPU Computing and Alternative Architecture Institut für Theoretische Astrophysik Zentrum für Astronomie Universität Heidelberg
More informationERLANGEN REGIONAL COMPUTING CENTER
ERLANGEN REGIONAL COMPUTING CENTER Making Sense of Performance Numbers Georg Hager Erlangen Regional Computing Center (RRZE) Friedrich-Alexander-Universität Erlangen-Nürnberg OpenMPCon 2018 Barcelona,
More informationImplementing NNLO into MCFM
Implementing NNLO into MCFM Downloadable from mcfm.fnal.gov A Multi-Threaded Version of MCFM, J.M. Campbell, R.K. Ellis, W. Giele, 2015 Higgs boson production in association with a jet at NNLO using jettiness
More informationFaster Kinetics: Accelerate Your Finite-Rate Combustion Simulation with GPUs
Faster Kinetics: Accelerate Your Finite-Rate Combustion Simulation with GPUs Christopher P. Stone, Ph.D. Computational Science and Engineering, LLC Kyle Niemeyer, Ph.D. Oregon State University 2 Outline
More informationMultiscale simulations of complex fluid rheology
Multiscale simulations of complex fluid rheology Michael P. Howard, Athanassios Z. Panagiotopoulos Department of Chemical and Biological Engineering, Princeton University Arash Nikoubashman Institute of
More informationApproximation of inverse Poisson CDF on GPUs
Approximation of inverse Poisson CDF on GPUs Mike Giles Mathematical Institute, University of Oxford Oxford-Man Institute of Quantitative Finance 38th Conference on Stochastic Processes and their Applications
More informationA CPU-GPU Hybrid Implementation and Model-Driven Scheduling of the Fast Multipole Method
A CPU-GPU Hybrid Implementation and Model-Driven Scheduling of the Fast Multipole Method Jee Choi 1, Aparna Chandramowlishwaran 3, Kamesh Madduri 4, and Richard Vuduc 2 1 ECE, Georgia Tech 2 CSE, Georgia
More informationPerformance evaluation of scalable optoelectronics application on large-scale Knights Landing cluster
Performance evaluation of scalable optoelectronics application on large-scale Knights Landing cluster Yuta Hirokawa Graduate School of Systems and Information Engineering, University of Tsukuba hirokawa@hpcs.cs.tsukuba.ac.jp
More information上海超级计算中心 Shanghai Supercomputer Center. Lei Xu Shanghai Supercomputer Center San Jose
上海超级计算中心 Shanghai Supercomputer Center Lei Xu Shanghai Supercomputer Center 03/26/2014 @GTC, San Jose Overview Introduction Fundamentals of the FDTD method Implementation of 3D UPML-FDTD algorithm on GPU
More informationAccelerating Model Reduction of Large Linear Systems with Graphics Processors
Accelerating Model Reduction of Large Linear Systems with Graphics Processors P. Benner 1, P. Ezzatti 2, D. Kressner 3, E.S. Quintana-Ortí 4, Alfredo Remón 4 1 Max-Plank-Institute for Dynamics of Complex
More informationRWTH Aachen University
IPCC @ RWTH Aachen University Optimization of multibody and long-range solvers in LAMMPS Rodrigo Canales William McDoniel Markus Höhnerbach Ahmed E. Ismail Paolo Bientinesi IPCC Showcase November 2016
More informationUtilisation de la compression low-rank pour réduire la complexité du solveur PaStiX
Utilisation de la compression low-rank pour réduire la complexité du solveur PaStiX 26 Septembre 2018 - JCAD 2018 - Lyon Grégoire Pichon, Mathieu Faverge, Pierre Ramet, Jean Roman Outline 1. Context 2.
More informationDense Arithmetic over Finite Fields with CUMODP
Dense Arithmetic over Finite Fields with CUMODP Sardar Anisul Haque 1 Xin Li 2 Farnam Mansouri 1 Marc Moreno Maza 1 Wei Pan 3 Ning Xie 1 1 University of Western Ontario, Canada 2 Universidad Carlos III,
More informationA robust multilevel approximate inverse preconditioner for symmetric positive definite matrices
DICEA DEPARTMENT OF CIVIL, ENVIRONMENTAL AND ARCHITECTURAL ENGINEERING PhD SCHOOL CIVIL AND ENVIRONMENTAL ENGINEERING SCIENCES XXX CYCLE A robust multilevel approximate inverse preconditioner for symmetric
More informationWRF performance tuning for the Intel Woodcrest Processor
WRF performance tuning for the Intel Woodcrest Processor A. Semenov, T. Kashevarova, P. Mankevich, D. Shkurko, K. Arturov, N. Panov Intel Corp., pr. ak. Lavrentieva 6/1, Novosibirsk, Russia, 630090 {alexander.l.semenov,tamara.p.kashevarova,pavel.v.mankevich,
More informationSome thoughts about energy efficient application execution on NEC LX Series compute clusters
Some thoughts about energy efficient application execution on NEC LX Series compute clusters G. Wellein, G. Hager, J. Treibig, M. Wittmann Erlangen Regional Computing Center & Department of Computer Science
More informationArchitecture-Aware Algorithms and Software for Peta and Exascale Computing
Architecture-Aware Algorithms and Software for Peta and Exascale Computing Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester 4/25/2011 1 H. Meuer, H. Simon, E.
More informationOptimized LU-decomposition with Full Pivot for Small Batched Matrices S3069
Optimized LU-decomposition with Full Pivot for Small Batched Matrices S369 Ian Wainwright High Performance Consulting Sweden ian.wainwright@hpcsweden.se Based on work for GTC 212: 1x speed-up vs multi-threaded
More informationSolving Quadratic Equations with XL on Parallel Architectures
Solving Quadratic Equations with XL on Parallel Architectures Cheng Chen-Mou 1, Chou Tung 2, Ni Ru-Ben 2, Yang Bo-Yin 2 1 National Taiwan University 2 Academia Sinica Taipei, Taiwan Leuven, Sept. 11, 2012
More informationCase Study: Quantum Chromodynamics
Case Study: Quantum Chromodynamics Michael Clark Harvard University with R. Babich, K. Barros, R. Brower, J. Chen and C. Rebbi Outline Primer to QCD QCD on a GPU Mixed Precision Solvers Multigrid solver
More informationNetwork Security. Random Numbers. Cornelius Diekmann. Version: November 21, 2015
Network Security Random Numbers Cornelius Diekmann Lehrstuhl für Netzarchitekturen und Netzdienste Institut für Informatik Version: November 21, 2015 IN2101, WS 15/16, Network Security 1 Fakulta t fu r
More informationParallel PIPS-SBB Multi-level parallelism for 2-stage SMIPS. Lluís-Miquel Munguia, Geoffrey M. Oxberry, Deepak Rajan, Yuji Shinano
Parallel PIPS-SBB Multi-level parallelism for 2-stage SMIPS Lluís-Miquel Munguia, Geoffrey M. Oxberry, Deepak Rajan, Yuji Shinano ... Our contribution PIPS-PSBB*: Multi-level parallelism for Stochastic
More informationFine-Grained Parallel Algorithms for Incomplete Factorization Preconditioning
Fine-Grained Parallel Algorithms for Incomplete Factorization Preconditioning Edmond Chow School of Computational Science and Engineering Georgia Institute of Technology, USA SPPEXA Symposium TU München,
More informationRandom Sampling for Short Lattice Vectors on Graphics Cards
Random Sampling for Short Lattice Vectors on Graphics Cards Michael Schneider, Norman Göttert TU Darmstadt, Germany mischnei@cdc.informatik.tu-darmstadt.de CHES 2011, Nara September 2011 Michael Schneider
More informationDomain Decomposition-based contour integration eigenvalue solvers
Domain Decomposition-based contour integration eigenvalue solvers Vassilis Kalantzis joint work with Yousef Saad Computer Science and Engineering Department University of Minnesota - Twin Cities, USA SIAM
More informationComputations of Properties of Atoms and Molecules Using Relativistic Coupled Cluster Theory
Computations of Properties of Atoms and Molecules Using Relativistic Coupled Cluster Theory B P Das Department of Physics School of Science Tokyo Institute of Technology Collaborators: VS Prasannaa, Indian
More information1 Overview. 2 Adapting to computing system evolution. 11 th European LS-DYNA Conference 2017, Salzburg, Austria
1 Overview Improving LSTC s Multifrontal Linear Solver Roger Grimes 3, Robert Lucas 3, Nick Meng 2, Francois-Henry Rouet 3, Clement Weisbecker 3, and Ting-Ting Zhu 1 1 Cray Incorporated 2 Intel Corporation
More informationsri 2D Implicit Charge- and Energy- Conserving Particle-in-cell Application Using CUDA Christopher Leibs Karthik Murthy
2D Implicit Charge- and Energy- Conserving sri Particle-in-cell Application Using CUDA Christopher Leibs Karthik Murthy Mentors Dana Knoll and Allen McPherson IS&T CoDesign Summer School 2012, Los Alamos
More informationGPU accelerated Monte Carlo simulations of lattice spin models
Available online at www.sciencedirect.com Physics Procedia 15 (2011) 92 96 GPU accelerated Monte Carlo simulations of lattice spin models M. Weigel, T. Yavors kii Institut für Physik, Johannes Gutenberg-Universität
More informationEstablishing a CUDA Research Center at Penn State: Perspectives on GPU-Enabled Teaching and Research
Establishing a CUDA Research Center at Penn State: Perspectives on GPU-Enabled Teaching and Research William J. Brouwer (wjb19@psu.edu) Pierre-Yves Taunay (py.taunay@psu.edu) Research Computing and Cyberinfrastructure
More informationThe Memory Intensive System
DiRAC@Durham The Memory Intensive System The DiRAC-2.5x Memory Intensive system at Durham in partnership with Dell Dr Lydia Heck, Technical Director ICC HPC and DiRAC Technical Manager 1 DiRAC Who we are:
More informationHigh-Performance Computing and Groundbreaking Applications
INSTITUTE OF INFORMATION AND COMMUNICATION TECHNOLOGIES BULGARIAN ACADEMY OF SCIENCE High-Performance Computing and Groundbreaking Applications Svetozar Margenov Institute of Information and Communication
More informationReal-time signal detection for pulsars and radio transients using GPUs
Real-time signal detection for pulsars and radio transients using GPUs W. Armour, M. Giles, A. Karastergiou and C. Williams. University of Oxford. 15 th July 2013 1 Background of GPUs Why use GPUs? Influence
More informationEfficient and Cryptographically Secure Generation of Chaotic Pseudorandom Numbers on GPU
Efficient and Cryptographically Secure Generation of Chaotic Pseudorandom Numbers on GPU Jacques M. Bahi, Raphaël Couturier, Christophe Guyeux, and Pierre-Cyrille Héam October 30, 2018 arxiv:1112.5239v1
More informationData analysis of massive data sets a Planck example
Data analysis of massive data sets a Planck example Radek Stompor (APC) LOFAR workshop, Meudon, 29/03/06 Outline 1. Planck mission; 2. Planck data set; 3. Planck data analysis plan and challenges; 4. Planck
More informationPhysics plans and ILDG usage
Physics plans and ILDG usage in Italy Francesco Di Renzo University of Parma & INFN Parma The MAIN ILDG USERS in Italy are the ROME groups A (by now) well long track of ILDG-based projects mainly within
More informationCosmology with Galaxy Clusters: Observations meet High-Performance-Computing
Cosmology with Galaxy Clusters: Observations meet High-Performance-Computing Julian Merten (ITA/ZAH) Clusters of galaxies GPU lensing codes Abell 2744 CLASH: A HST/MCT programme Clusters of galaxies DM
More informationFEM-Level Set Techniques for Multiphase Flow --- Some recent results
FEM-Level Set Techniques for Multiphase Flow --- Some recent results ENUMATH09, Uppsala Stefan Turek, Otto Mierka, Dmitri Kuzmin, Shuren Hysing Institut für Angewandte Mathematik, TU Dortmund http://www.mathematik.tu-dortmund.de/ls3
More informationStatistical Methods for Data Analysis
Statistical Methods for Data Analysis Random number generators Luca Lista INFN Napoli Pseudo-random generators Requirement: Simulate random process with a computer E.g.: radiation interaction with matter,
More informationPlaquette Renormalized Tensor Network States: Application to Frustrated Systems
Workshop on QIS and QMP, Dec 20, 2009 Plaquette Renormalized Tensor Network States: Application to Frustrated Systems Ying-Jer Kao and Center for Quantum Science and Engineering Hsin-Chih Hsiao, Ji-Feng
More informationarxiv: v1 [cs.ms] 7 Nov 2018
Gravitational octree code performance evaluation on Volta GPU Yohei Miki 1 arxiv:1811.02761v1 [cs.ms] 7 Nov 2018 Abstract In this study, the gravitational octree code originally optimized for the Fermi,
More informationOpportunities for ELPA to Accelerate the Solution of the Bethe-Salpeter Eigenvalue Problem
Opportunities for ELPA to Accelerate the Solution of the Bethe-Salpeter Eigenvalue Problem Peter Benner, Andreas Marek, Carolin Penke August 16, 2018 ELSI Workshop 2018 Partners: The Problem The Bethe-Salpeter
More informationPerformance of the fusion code GYRO on three four generations of Crays. Mark Fahey University of Tennessee, Knoxville
Performance of the fusion code GYRO on three four generations of Crays Mark Fahey mfahey@utk.edu University of Tennessee, Knoxville Contents Introduction GYRO Overview Benchmark Problem Test Platforms
More informationGPU-accelerated Computing at Scale. Dirk Pleiter I GTC Europe 10 October 2018
GPU-accelerated Computing at Scale irk Pleiter I GTC Europe 10 October 2018 Outline Supercomputers at JSC Future science challenges Outlook and conclusions 2 3 Supercomputers at JSC JUQUEEN (until 2018)
More informationAntti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA
S7255: CUTT: A HIGH- PERFORMANCE TENSOR TRANSPOSE LIBRARY FOR GPUS Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA MOTIVATION Tensor contractions are the most computationally intensive part of quantum
More informationLight curve modeling of eclipsing binary stars
Light curve modeling of eclipsing binary stars Gábor Marschalkó Baja Observatory of University of Szeged Wigner Research Centre for Physics Binary stars physical variables pulsating stars mass, radius,
More informationParallel Simulations of Self-propelled Microorganisms
Parallel Simulations of Self-propelled Microorganisms K. Pickl a,b M. Hofmann c T. Preclik a H. Köstler a A.-S. Smith b,d U. Rüde a,b ParCo 2013, Munich a Lehrstuhl für Informatik 10 (Systemsimulation),
More informationClaude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique
Claude Tadonki MINES ParisTech PSL Research University Centre de Recherche Informatique claude.tadonki@mines-paristech.fr Monthly CRI Seminar MINES ParisTech - CRI June 06, 2016, Fontainebleau (France)
More informationA Stochastic-based Optimized Schwarz Method for the Gravimetry Equations on GPU Clusters
A Stochastic-based Optimized Schwarz Method for the Gravimetry Equations on GPU Clusters Abal-Kassim Cheik Ahamed and Frédéric Magoulès Introduction By giving another way to see beneath the Earth, gravimetry
More informationHybrid parallelization of a pseudo-spectral DNS code and its computational performance on RZG s idataplex system Hydra
Hybrid parallelization of a pseudo-spectral DNS code and its computational performance on RZG s idataplex system Hydra Markus Rampp 1, Liang Shi 2, Marc Avila 3,2, Björn Hof 2,4 1 Computing Center of the
More informationLevel-3 BLAS on a GPU
Level-3 BLAS on a GPU Picking the Low Hanging Fruit Francisco Igual 1 Gregorio Quintana-Ortí 1 Robert A. van de Geijn 2 1 Departamento de Ingeniería y Ciencia de los Computadores. University Jaume I. Castellón
More informationExplore Computational Power of GPU in Electromagnetics and Micromagnetics
Explore Computational Power of GPU in Electromagnetics and Micromagnetics Presenter: Sidi Fu, PhD candidate, UC San Diego Advisor: Prof. Vitaliy Lomakin Center of Magnetic Recording Research, Department
More informationCalculation of ground states of few-body nuclei using NVIDIA CUDA technology
Calculation of ground states of few-body nuclei using NVIDIA CUDA technology M. A. Naumenko 1,a, V. V. Samarin 1, 1 Flerov Laboratory of Nuclear Reactions, Joint Institute for Nuclear Research, 6 Joliot-Curie
More informationChile / Dirección Meteorológica de Chile (Chilean Weather Service)
JOINT WMO TECHNICAL PROGRESS REPORT ON THE GLOBAL DATA PROCESSING AND FORECASTING SYSTEM AND NUMERICAL WEATHER PREDICTION RESEARCH ACTIVITIES FOR 2015 Chile / Dirección Meteorológica de Chile (Chilean
More informationMultivariate Gaussian Random Number Generator Targeting Specific Resource Utilization in an FPGA
Multivariate Gaussian Random Number Generator Targeting Specific Resource Utilization in an FPGA Chalermpol Saiprasert, Christos-Savvas Bouganis and George A. Constantinides Department of Electrical &
More informationAccelerating interior point methods with GPUs for smart grid systems
Downloaded from orbit.dtu.dk on: Dec 18, 2017 Accelerating interior point methods with GPUs for smart grid systems Gade-Nielsen, Nicolai Fog Publication date: 2011 Document Version Publisher's PDF, also
More informationEfficient algorithms for symmetric tensor contractions
Efficient algorithms for symmetric tensor contractions Edgar Solomonik 1 Department of EECS, UC Berkeley Oct 22, 2013 1 / 42 Edgar Solomonik Symmetric tensor contractions 1/ 42 Motivation The goal is to
More information