GPU Computing Activities in KISTI

Similar documents
Introduction to numerical computations on the GPU

Stochastic Modelling of Electron Transport on different HPC architectures

Direct Self-Consistent Field Computations on GPU Clusters

Population annealing study of the frustrated Ising antiferromagnet on the stacked triangular lattice

Two case studies of Monte Carlo simulation on GPU

Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters

arxiv: v1 [hep-lat] 31 Oct 2015

Massively scalable computing method to tackle large eigenvalue problems for nanoelectronics modeling

Plaquette Renormalized Tensor Network States: Application to Frustrated Systems

Claude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique

On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code

Parallelization of Molecular Dynamics (with focus on Gromacs) SeSE 2014 p.1/29

上海超级计算中心 Shanghai Supercomputer Center. Lei Xu Shanghai Supercomputer Center San Jose

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters

Open-Source Parallel FE Software : FrontISTR -- Performance Considerations about B/F (Byte per Flop) of SpMV on K-Supercomputer and GPU-Clusters --

Large-scale Electronic Structure Simulations with MVAPICH2 on Intel Knights Landing Manycore Processors

arxiv: v1 [hep-lat] 7 Oct 2010

Tuning And Understanding MILC Performance In Cray XK6 GPU Clusters. Mike Showerman, Guochun Shi Steven Gottlieb

Welcome to MCS 572. content and organization expectations of the course. definition and classification

Julian Merten. GPU Computing and Alternative Architecture

High-Performance Computing and Groundbreaking Applications

Shortest Lattice Vector Enumeration on Graphics Cards

Performance evaluation of scalable optoelectronics application on large-scale Knights Landing cluster

Massively scalable computing method to tackle large eigenvalue problems for nanoelectronics modeling

Performance Evaluation of MPI on Weather and Hydrological Models

The Green Index (TGI): A Metric for Evalua:ng Energy Efficiency in HPC Systems

Monte Carlo Methods for Electron Transport: Scalability Study

Janus: FPGA Based System for Scientific Computing Filippo Mantovani

Deutscher Wetterdienst

Accelerating linear algebra computations with hybrid GPU-multicore systems.

A Data Communication Reliability and Trustability Study for Cluster Computing

Towards a highly-parallel PDE-Solver using Adaptive Sparse Grids on Compute Clusters

Reliability at Scale

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

Some thoughts about energy efficient application execution on NEC LX Series compute clusters

The QMC Petascale Project

Improvement of MPAS on the Integration Speed and the Accuracy

Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2

Calculation of ground states of few-body nuclei using NVIDIA CUDA technology

Extreme scale simulations of high-temperature superconductivity. Thomas C. Schulthess

Weather Research and Forecasting (WRF) Performance Benchmark and Profiling. July 2012

Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry

Multiphase Flow Simulations in Inclined Tubes with Lattice Boltzmann Method on GPU

GPU-based computation of the Monte Carlo simulation of classical spin systems

Parallel Sparse Tensor Decompositions using HiCOO Format

Measuring freeze-out parameters on the Bielefeld GPU cluster

Accelerating Model Reduction of Large Linear Systems with Graphics Processors

Performance of WRF using UPC

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics

Multi-GPU Simulations of the Infinite Universe

Accelerating Quantum Chromodynamics Calculations with GPUs

Piz Daint & Piz Kesch : from general purpose supercomputing to an appliance for weather forecasting. Thomas C. Schulthess

Performance Evaluation of Scientific Applications on POWER8

Multicore Parallelization of Determinant Quantum Monte Carlo Simulations

Dark Energy and Massive Neutrino Universe Covariances

Klaus Schulten Department of Physics and Theoretical and Computational Biophysics Group University of Illinois at Urbana-Champaign

Cyclops Tensor Framework

An Implementation of SPELT(31, 4, 96, 96, (32, 16, 8))

COMPARISON OF CPU AND GPU IMPLEMENTATIONS OF THE LATTICE BOLTZMANN METHOD

Universität Dortmund UCHPC. Performance. Computing for Finite Element Simulations

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems

Review for the Midterm Exam

Acceleration of WRF on the GPU

Perm State University Research-Education Center Parallel and Distributed Computing

Petascale Quantum Simulations of Nano Systems and Biomolecules

The Blue Gene/P at Jülich Case Study & Optimization. W.Frings, Forschungszentrum Jülich,

The Memory Intensive System

Improving the solar cells efficiency of metal Sulfide thin-film nanostructure via first principle calculations

CRYPTOGRAPHIC COMPUTING

HYCOM and Navy ESPC Future High Performance Computing Needs. Alan J. Wallcraft. COAPS Short Seminar November 6, 2017

Reflecting on the Goal and Baseline of Exascale Computing

Distributed Processing of the Lattice in Monte Carlo Simulations of the Ising Type Spin Model

Solving RODEs on GPU clusters

ACCELERATING WEATHER PREDICTION WITH NVIDIA GPUS

APPLICATION OF CUDA TECHNOLOGY FOR CALCULATION OF GROUND STATES OF FEW-BODY NUCLEI BY FEYNMAN'S CONTINUAL INTEGRALS METHOD

Verbundprojekt ELPA-AEO. Eigenwert-Löser für Petaflop-Anwendungen Algorithmische Erweiterungen und Optimierungen

Quantum computing with superconducting qubits Towards useful applications

Performance Analysis of Lattice QCD Application with APGAS Programming Model

Performance and Application of Observation Sensitivity to Global Forecasts on the KMA Cray XE6

A Framework for Hybrid Parallel Flow Simulations with a Trillion Cells in Complex Geometries

Listening for thunder beyond the clouds

Investigation of an Unusual Phase Transition Freezing on heating of liquid solution

High-Performance Scientific Computing

Targeting Extreme Scale Computational Challenges with Heterogeneous Systems

Scalable and Power-Efficient Data Mining Kernels

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics

Performance of the fusion code GYRO on three four generations of Crays. Mark Fahey University of Tennessee, Knoxville

Parallel Simulations of Self-propelled Microorganisms

CRYSTAL in parallel: replicated and distributed (MPP) data. Why parallel?

Utility Maximizing Routing to Data Centers

SP-CNN: A Scalable and Programmable CNN-based Accelerator. Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay

Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA

Data analysis of massive data sets a Planck example

Nuclear Physics and Computing: Exascale Partnerships. Juan Meza Senior Scientist Lawrence Berkeley National Laboratory

CMOS Ising Computer to Help Optimize Social Infrastructure Systems

N-body Simulations. On GPU Clusters

Case Study: Quantum Chromodynamics

Benchmark of the CPMD code on CRESCO HPC Facilities for Numerical Simulation of a Magnesium Nanoparticle.

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic

ab initio Electronic Structure Calculations

Transcription:

International Advanced Research Workshop on High Performance Computing, Grids and Clouds 2010 June 21~June 25 2010, Cetraro, Italy HPC Infrastructure and GPU Computing Activities in KISTI Hongsuk Yi hsyi@kisti.re.kr kr

Outline HPC infrastructure and activities in KISTI Heterogeneous Computing with GPU What is the scalability Heterogeneous Computing with MPI+CUDA HPC2010_Italy_2

Where is KISTI? KISTI is responsible for national cyber-infrastructure of Korea Mission is enable Discovery through National Projects I will don t talk about, today Grid Project (2002~2009) ~ K*Grid HPC Infrastructure & e-science Project (2005 ~) Multi-GPU programming Scientific Cloud Computing (2009~) Research Network Project (2005~) Cyber- infrastructure Environment Keep securing/providing g world-class supercomputing systems Computing /Network Resource Help Korea research communities to be equipped with proper knowledge of Cyber-infra. Validate newly emerging concepts, ideas, tools, and systems K*Grid, Scientific Cloud Testbed Center Value-adding Center Make best use of what the center has, to create new values Heterogeneous Computing_3

History of Supercomputers in Korea Time for new multi-petaflops System in Korea Exa Peta Ham. 01 435GF Nobel 02 4 TF Tera ea T3E 97 130GF SUN 10 307 TF Giga IBM 9 30 TF Mega SX5 02 SX6 03 2GF 88 16GF 93 80GF 160GF 1970 1980 1990 2000 2010 2020 2030 HPC2010_Italy_4

HPC ACT of Korea HPC ACT has been started from 2004 The ACT is currently awaiting the approval of the National Assembly Purpose To provide for a well coordinated national program to ensure continued Korea role in HPC and its applications by improving the coordination of supercomputing resource on HPC maximizing the effectiveness of the Korea s networks research (KREONET) We can make more contribution by expanding support for National agenda research program in the field of computational science, and development of cyber-infrastructure environment, ad well as applications of extreme scale computation HPC2010_Italy_5

NCRC National Core Research Center for Computational Science and Technology Budget ~ 10M$ Sep. 2010~ (not yet completely determined) Application Domains Energy transformation by quantum simulation Migration of pollution by air including yellow sand New material for Energy HPC2010_Italy_6

Supercomputing Resource (Tachyon) Tachyon-II is the 15 th ranked in Top500 (June, 2010) Sun Blade x6048, Intel Nehalem procs~26,232 (Memory~157 TB ) Peak ~ 307 Tflops (Sustained Peak 274 Tflops) Providing about 30% of whole computing capacity for public research in Korea Users form 200 institutes in Korea Utilization : 70~80% Little room for large scale grand challenge problems 50 40 30 20 10 0 Chemistry Earth/Weather Mechanics Physics Electronics Oh Others 2003 2004 2005 2006 HPC2010_Italy_7

User Support and Applications Support code optimization and parallelization We optimize and parallelize user s code. Performance improvements from - 5 times to more than 1,000-fold Year 2004 2005 2006 2007 2008 2009 Optimization 6 12 13 15 20 20 Parallelization 6 8 11 15 20 25 User s Application through Grand Challenge Problems Magnetic control of edge spins by K. S. Kim, Nature Nanotech. 3, 408 (2008) Hydrogen Storage Materials by J. Ihm, Phys. Rev. Lett. 97, 056104 (2006) HPC2010_Italy_8

Supercomputing Resource (GAIA) GAIA-II is SMP Cluster 393th in Top500, 2009.11 IBMPOWER65GH GHz, Power 595, Number of Procs ~ 1,536 (64 cores/node) Rpeak~30.7 Tflops (Sustained Peak 23.3 Tflops) Memory (8.7 TB) HPC2010_Italy_9

Hybrid Programming : Multi-Zone NPB GAIA-2 (IBM p6 5GHz) Big memory ~16GB/cores Bin-packing algorithm The size of zones varies ~20 BT-MZ with class F Memory required ~ 5 TB MPI+OpenMP Programming Performance ~ 4.5TF (15%) HPC2010_Italy_10

GPU computing for visualization All KISTI s visualization systems have direct connection to GLORIAD, whose bandwidth is 10 Gbps Visualization Computer Total number of nodes ±150 CPU #ofcpu cores 800 + Total memory 3.5 + TB GPU Model NVIDIA Quadro FX 5600 # of GPUs 96 + Netw ork Interconnectio n External network 20 Gbps 160 + Gbps HPC2010_Italy_11

GPU Computing Activities KSCSE (Korea Society for Computational Sciences and Engineering) Establish as a new computing society in 2009 Support GPU computing Forum and workshop 200 participants, two days, May 2010, Seoul Open for international collaboration on the extreme scale computing HPC2010_Italy_12

Heterogeneous Computing Testbed Heterogeneous Computing System refer to system that use a variety of different types of computational units. A computational unit could be a GPU, co-processor, FPGA KISTI Heterogeneous GPU Testbed NVIDIA 2* S1070 (8 GPUs), D870, GTX280 PCI-e : 16x HPC2010_Italy_13

Performance Benefit of GPU Computing The ratio of operations to elements transfer : O(N) Matrix-Matrix Multiplication Operation N 3, Transfer 3xN 2,,Scaling O(N) Matrix-Matrix Addition Operation N 2, Transfer 3xN 2, Scaling O(1) NVIDA GTX280 (Peak 933 GHz) 40% Efficiency 6x Matrix-Matrix Multiplication C= AxB HPC2010_Italy_14

Image Compression Using SVD SVD is an important factoriz ation of matrix with many applications in signal p rocessing and statistics. culasgesvd() by using CULA RBG full color 2048x2048 total 4,194,304 pixels Original 12,288288 KB 1 Rank 12 KB 10 Rank 120 KB 50 Rank 600 KB 80 Rank 960 KB 100 Rank 1,200 KB HPC2010_Italy_15

3D FDTD on GPU H t 300 Finite-Difference Time-Domain Time domain simulation: FDTD, Frequency domain simulation: FEM, (i,j,k+1) Divide both space and time into discrete grids BEM,.. 3D FDTD Benchmark Results Memory ~ 988 MB, Grid~300x300x240 1 E E t By Q. Park and K. Kim, Korea Univ. 1 H J Ez (i,j,k) Hy Ex Hz (i+1,j,k) Hx Ey (i+1, j+1, k+1) (i+1,j+1,k) 250 200 150 100 50 0 Intel Opt 9800GF GTX280 GTX480 C1060 HPC2010_Italy_16

RNG and Monte Carlo Algorithm Ising Model Model and probability weight E L J i, j 1 ij s s i j H L i 1 s i 1 p exp [ E / kb T ] ZT ( ) 1000 MC steps and 512 threads per block independent of block number and (tx=512, ty=1,by=1), bx Size of the warp ~32 and the number of thread block ~ 30 warp HPC2010_Italy_17

Performance of Ising Model on a GPU Use the checkerboard decomposition to avoid the read/write conflicts the spin field and the seed values of the RNGs Divide the spin on the lattice into blocks on the GPU. Example for a (20x16) lattice HPC2010_Italy_18

MPI Virtual Topology + CUDA Model VT is weak scaling problem The problem size grows in direct proportion to the num. of cores Using PBC with MPI_Sendrecv() e Intel Nehalem 8 cores + Nvidia Tesla C1060*8GPUs PCI-E cables HPC2010_Italy_19

Heterogeneous Communication Pattern PCIe InfiniBand PCIe GPU0 -------- CPU1 --- CPU2 -------> GPU2 There is an extra cudamemcopy() involved in message passing HPC2010_Italy_20

GPU scaling Issues Achieving good scaling is more difficult with GPUs The kernels are much faster so the MPI communication becomes a larger faction of the overall execution time HPC2010_Italy_21

Summary HPC ACT of Korea is in progress The act is awaiting the approval of the Korea assembly Time for heterogeneous petaflops system in Korea Consider too many things, power, space, user s ability of porting In the MPI+CUDA model, achieving good scaling is more difficult than pure MPI since the kernels are still faster on the GPU There is an another communication over head between CPU and GPUs HPC2010_Italy_22

Q&A Thank You! HPC2010_Italy_23