A CPU-GPU Hybrid Implementation and Model-Driven Scheduling of the Fast Multipole Method

Similar documents
Accelerating linear algebra computations with hybrid GPU-multicore systems.

Block AIR Methods. For Multicore and GPU. Per Christian Hansen Hans Henrik B. Sørensen. Technical University of Denmark

A roofline model of energy

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver

Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters

Real-time signal detection for pulsars and radio transients using GPUs

Actively analyzing performance to find microarchitectural bottlenecks and to estimate performance bounds

ERLANGEN REGIONAL COMPUTING CENTER

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)

Parallelization of Molecular Dynamics (with focus on Gromacs) SeSE 2014 p.1/29

Software optimization for petaflops/s scale Quantum Monte Carlo simulations

Research on GPU-accelerated algorithm in 3D finite difference neutron diffusion calculation method

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

arxiv: v1 [hep-lat] 7 Oct 2010

Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2

A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures

Dense Arithmetic over Finite Fields with CUMODP

Sparse LU Factorization on GPUs for Accelerating SPICE Simulation

Claude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique

Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications

Julian Merten. GPU Computing and Alternative Architecture

Massive Parallelization of First Principles Molecular Dynamics Code

MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors

Open-Source Parallel FE Software : FrontISTR -- Performance Considerations about B/F (Byte per Flop) of SpMV on K-Supercomputer and GPU-Clusters --

arxiv: v1 [cs.dc] 4 Sep 2014

Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS

GRAPE-DR, GRAPE-8, and...

591 TFLOPS Multi-TRILLION Particles Simulation on SuperMUC

Introduction to numerical computations on the GPU

Scalable and Power-Efficient Data Mining Kernels

A CUDA Solver for Helmholtz Equation

Explore Computational Power of GPU in Electromagnetics and Micromagnetics

CRYPTOGRAPHIC COMPUTING

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems

Mapping Sparse Matrix-Vector Multiplication on FPGAs

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters

Astronomical Computer Simulations. Aaron Smith

MAGMA. Matrix Algebra on GPU and Multicore Architectures. Mark Gates. February 2012

Direct Self-Consistent Field Computations on GPU Clusters

arxiv: v1 [hep-lat] 31 Oct 2015

Optimized LU-decomposition with Full Pivot for Small Batched Matrices S3069

S0214 : GPU Based Stacking Sequence Generation For Composite Skins Using GA

Exascale Computing for Radio Astronomy: GPU or FPGA?

Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem

Practical Free-Start Collision Attacks on full SHA-1

Faster Kinetics: Accelerate Your Finite-Rate Combustion Simulation with GPUs

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics

Measuring freeze-out parameters on the Bielefeld GPU cluster

A Many-Core Machine Model for Designing Algorithms with Minimum Parallelism Overheads

RWTH Aachen University

A simple Concept for the Performance Analysis of Cluster-Computing

QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment

Optimization Techniques for Parallel Code 1. Parallel programming models

An Integrative Model for Parallelism

Empowering Scientists with Domain Specific Languages

Listening for thunder beyond the clouds

MONTE CARLO NEUTRON TRANSPORT SIMULATING NUCLEAR REACTIONS ONE NEUTRON AT A TIME Tony Scudiero NVIDIA

Performance evaluation of scalable optoelectronics application on large-scale Knights Landing cluster

Practicality of Large Scale Fast Matrix Multiplication

Parallel stochastic simulation using graphics processing units for the Systems Biology Toolbox for MATLAB

N-body Simulations. On GPU Clusters

Performance Evaluation of Scientific Applications on POWER8

More Science per Joule: Bottleneck Computing

Piz Daint & Piz Kesch : from general purpose supercomputing to an appliance for weather forecasting. Thomas C. Schulthess

- Part 4 - Multicore and Manycore Technology: Chances and Challenges. Vincent Heuveline

Practical Combustion Kinetics with CUDA

Performance and Energy Analysis of the Iterative Solution of Sparse Linear Systems on Multicore and Manycore Architectures

Targeting Extreme Scale Computational Challenges with Heterogeneous Systems

Population annealing study of the frustrated Ising antiferromagnet on the stacked triangular lattice

First, a look at using OpenACC on WRF subroutine advance_w dynamics routine

Efficient Serial and Parallel Coordinate Descent Methods for Huge-Scale Convex Optimization

Tips Geared Towards R. Adam J. Suarez. Arpil 10, 2015

Efficient Molecular Dynamics on Heterogeneous Architectures in GROMACS

Dynamic Scheduling within MAGMA

Accelerating Quantum Chromodynamics Calculations with GPUs

The Lattice Boltzmann Method for Laminar and Turbulent Channel Flows

Computing least squares condition numbers on hybrid multicore/gpu systems

Exascale Computing for Radio Astronomy: GPU or FPGA?

Weather Research and Forecasting (WRF) Performance Benchmark and Profiling. July 2012

HPMPC - A new software package with efficient solvers for Model Predictive Control

Massively scalable computing method to tackle large eigenvalue problems for nanoelectronics modeling

Accelerating Model Reduction of Large Linear Systems with Graphics Processors

Multicore Parallelization of Determinant Quantum Monte Carlo Simulations

Utilisation de la compression low-rank pour réduire la complexité du solveur PaStiX

Hydra: Generation and Tuning of parallel solutions for linear algebra equations. Alexandre X. Duchâteau University of Illinois at Urbana Champaign

BeiHang Short Course, Part 7: HW Acceleration: It s about Performance, Energy and Power

Introduction The Nature of High-Performance Computation

Exploring performance and power properties of modern multicore chips via simple machine models

Parallel Reproducible Summation

ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU

Some thoughts about energy efficient application execution on NEC LX Series compute clusters

Reducing Noisy-Neighbor Impact with a Fuzzy Affinity- Aware Scheduler

Massively scalable computing method to tackle large eigenvalue problems for nanoelectronics modeling

Petascale Quantum Simulations of Nano Systems and Biomolecules

ab initio Electronic Structure Calculations

Practical Free-Start Collision Attacks on 76-step SHA-1

Nuclear Physics and Computing: Exascale Partnerships. Juan Meza Senior Scientist Lawrence Berkeley National Laboratory

HYCOM and Navy ESPC Future High Performance Computing Needs. Alan J. Wallcraft. COAPS Short Seminar November 6, 2017

Multi-Approximate-Keyword Routing Query

Transcription:

A CPU-GPU Hybrid Implementation and Model-Driven Scheduling of the Fast Multipole Method Jee Choi 1, Aparna Chandramowlishwaran 3, Kamesh Madduri 4, and Richard Vuduc 2 1 ECE, Georgia Tech 2 CSE, Georgia Tech 3 CSAIL, MIT 4 CSE PSU March 1, 2014 Presented at GPGPU7, Salt Lake City, Utah

Why? Importance One of the most important algorithms in scientific computing Performance Various phases of the Fast Multipole Method show different performance characteristics Power and energy Everyone has a strong suit Just because we can CPU(s) come bundled with GPU(s) (or is it vice versa?)

Contributions Optimized implementations of FMM for both CPUs and GPUs

Contributions Optimized implementations of FMM for both CPUs and GPUs Analytical performance model

Contributions Optimized implementations of FMM for both CPUs and GPUs Analytical performance model CPU-GPU hybrid implementation of FMM

Contributions Optimized implementations of FMM for both CPUs and GPUs Analytical performance model CPU-GPU hybrid implementation of FMM Uses our analytical performance model to automatically controls various FMM-specific tuning knobs and maps phases to platforms

Contributions Uniform Elliptical z -1.0-0.5 0.0 0.5 1.0-1.0-0.5 0.0 0.5 1.0-1.0-0.5 0.0 0.5 1.0 y z -1.0-0.5 0.0 0.5 1.0-1.0-0.5 0.0 0.5 1.0-1.0-0.5 0.0 0.5 1.0 y x x

Summary of Results Uniform CPU Elliptical 6 CPU Time 5 4 GPU 3 Best hybrid GPU 2 Best hybrid 3 4 5 6 7 3 4 5 6 7 Accuracy Measured

Summary of Results Uniform CPU Elliptical 6 CPU CPU Time 5 4 3 GPU GPU Best hybrid Best hybrid GPU 2 Best hybrid 3 4 5 6 7 3 4 5 6 7 Accuracy Measured Model

Limitations Analytical performance model is limited to uniform distribution of points Elliptical distribution is more difficult to model Model was driven by hand Hybrid scheduling is done by hand No scheduler implementation

Overview Algorithmic characteristics GPU performance model Implementation Hybrid scheduling Exascale projections

Overview Algorithmic characteristics GPU performance model Implementation Hybrid scheduling Exascale projections

The problem Given a system of N source points with positions { y 1,, y N } and N target points { x 1,, x N } We want to compute the N target sums, NX f(x i )= K (x i,y i ) s (y j ), i =1,...,N j=1

Direct vs. Tree-based Direct evaluation: O(N 2 ) Barnes-Hut: O(N log N) Fast Multipole Method (FMM): O(N)

Fast Multipole Method (FMM) Tree Construction Recursively divide space until each box has at most q points Evaluation (Uniform) Upward U-List V-List Downward Phases vary in: Data parallelism Compute intensity

Direct B U: O(q 2 ) flops : O(q) mops U-List

V-List 3-D FFT Point-wise multiplication 3-D IFFT

Overview Algorithmic characteristics GPU performance model Implementation Hybrid scheduling Exascale projections

Machine Model

CPU Performance Model U-List V-List T comp,u = C u.(3b 1/3 2) 3.q 2 C 0 T mem,u = C 1n mem + C 2 nl mem(z 1 3 q 2 3 ) T comp,v = C v kbp 3 2 C 0 T mem,v = C 1np 3 2 + C 2np 1 q mem (Z 0 1 2 L 3 q) mem

GPU Performance Model U-List V-List T comp,u = C u.(3b 1/3 2) 3.q 2 C 0 T mem,u = C 1n mem + C 2 nl mem(z 1 3 q 2 3 ) T comp,v = C v kbp 3 2 C 0 T mem,v = C 1np 3 2 + C 2np 1 q mem (Z 0 1 2 L 3 q) mem T u,gpu = C u,gpu 3b 1/3 2 3 q 2 C peak,gpu

GPU Performance Model U-List V-List T comp,u = C u.(3b 1/3 2) 3.q 2 C 0 T mem,u = C 1n mem + C 2 nl mem(z 1 3 q 2 3 ) T comp,v = C v kbp 3 2 C 0 T mem,v = C 1np 3 2 + C 2np 1 q mem (Z 0 1 2 L 3 q) mem T u,gpu = C u,gpu 3b 1/3 2 3 q 2 C peak,gpu T v,gpu = C 3 2 1,gpunp + C 2,gpunp 1 q mem,gpu Z 0 1 3 2 L mem,gpu

GPU Performance Model U-List V-List T comp,u = C u.(3b 1/3 2) 3.q 2 C 0 T mem,u = C 1n mem + C 2 nl mem(z 1 3 q 2 3 ) T comp,v = C v kbp 3 2 C 0 T mem,v = C 1np 3 2 + C 2np 1 q mem (Z 0 1 2 L 3 q) mem T u,gpu = C u,gpu 3b 1/3 2 3 q 2 C peak,gpu T v,gpu = C 3 2 1,gpunp + C 2,gpunp 1 q mem,gpu Z 0 1 3 2 L mem,gpu

GPU Performance Model U-List V-List T comp,u = C u.(3b 1/3 2) 3.q 2 C 0 T mem,u = C 1n mem + C 2 nl mem(z 1 3 q 2 3 ) T comp,v = C v kbp 3 2 C 0 T mem,v = C 1np 3 2 + C 2np 1 q mem (Z 0 1 2 L 3 q) mem T u,gpu = C u,gpu 3b 1/3 2 3 q 2 C peak,gpu T v,gpu = C 3 2 1,gpunp + C 2,gpunp 1 q mem,gpu Z 0 1 3 2 L mem,gpu

GPU Performance Model U-List V-List T comp,u = C u.(3b 1/3 2) 3.q 2 C 0 T mem,u = C 1n mem + C 2 nl mem(z 1 3 q 2 3 ) T comp,v = C v kbp 3 2 C 0 T mem,v = C 1np 3 2 + C 2np 1 q mem (Z 0 1 2 L 3 q) mem T u,gpu = C u,gpu 3b 1/3 2 3 q 2 C peak,gpu Why doesn t this work for V-List? Small LLC on GPUs can only fit ~50 translation vectors

GPU Performance Model U-List V-List T comp,u = C u.(3b 1/3 2) 3.q 2 C 0 T mem,u = C 1n mem + C 2 nl mem(z 1 3 q 2 3 ) T comp,v = C v kbp 3 2 C 0 T mem,v = C 1np 3 2 + C 2np 1 q mem (Z 0 1 2 L 3 q) mem T u,gpu = C u,gpu 3b 1/3 2 3 q 2 C peak,gpu T v,gpu = C v,gpu 3bp 3/2 189 mem,gpu

GPU Performance Model Upward T up,gpu = C up,gpu (4N +2bf 1 (p)(f 2 (p) + 1)) mem,gpu Downward T down,gpu = C down,gpu N +2b (f 1 (p)) 2 +2bf 1 (p) mem,gpu

GPU Performance Model Real peak memory throughput Optimized streaming μbenchmark Relatively close to specification (80-90%) Real Peak compute throughput Misleading Requires that fused multiply add (FMA) be issued by every scheduler at every cycle No hardware SFU for double-precision (e.g., reciprocal, square root, etc.)

GPU Performance Model U-list Inner-loop executes (in double-precision) 3 subtracts 1 add 1 multiply 2 multiply-adds 1 reciprocal square root

GPU Performance Model U-list Inner-loop executes (in double-precision) 3 subtracts 1 add 1 multiply 2 multiply-adds 1 reciprocal square root How expensive is it?

GPU Performance Model μbenchmarking study Reciprocal square root (in double-precision) ~14 cycle latency, or equivalently ~14 independent instructions It takes 14 + 7 instructions to execute 11 FLOPs

GPU Performance Model μbenchmarking study Reciprocal square root (in double-precision) ~14 cycle latency, or equivalently ~14 independent instructions It takes 14 + 7 instructions to execute 11 FLOPs U-list expected computational throughput C peak,gpu = 11 FLOPs 21 instructions freq proc

Overview Algorithmic characteristics GPU performance model Implementation Hybrid scheduling Exascale projections

Platform 1 Jinx @ GT CPU Intel Xeon X5650 (Westmere) 2 CPUs/node 6 cores Running @ 2.66 GHz (3.06 GHz TB) 147 (SP) / 73 (DP) Gflops/s GPU Tesla M2090 (Fermi) 2 GPUs/node 512 CUDA cores/16 SM Running @ 1.3 GHz 1331 (SP) / 665 (DP) Gflops/s

Platform 2 Condesa @ HPC Garage CPU Intel Xeon E5-2603 (Sandy Bridge) 2 CPUs/node 4 cores Running @ 1.8 GHz (No TB) 58 (SP) / 29 (DP) Gflops/s GPU GTX Titan (Kepler) 1 GPU/node 2688 CUDA cores/14 SMX Running @ 837 MHz 4500 (SP) / 1500 (DP) Gflops/s

GPU Constant Derivation Tesla M2090 GTX Titan C peak,gpu (GFLOP/s) 174.3 392.9 β mem,gpu (GB/s) 129.4 237.2 C up,gpu 2.99 4.16 C u,gpu 1.56 2.09 C v,gpu 0.95 1.4 C down,gpu 7.61 6.83 We want constants that are close to 1 (better implementation) More complicated kernels (upward, downward) are more difficult to model and consequently have higher constants Constant values of less than 1 indicates better than modeled performance (e.g., due to better than expected caching)

Overview Algorithmic characteristics GPU performance model Implementation Hybrid scheduling Exascale projections

FMM Directed Acyclic Graph U X Up (leaf) GPU CPU Up (non-leaf) Hybrid1 Hybrid2 V W Hybrid (elliptical distribution) CPU GPU up u-list Down (non-leaf) synchronize + memcpy v-list x-list synchronize + memcpy Down (leaf) down w-list synchronize + memcpy

FMM Performance and Model Accuracy Uniform CPU Elliptical 6 CPU CPU Time 5 4 3 GPU GPU Best hybrid Best hybrid GPU 2 Best hybrid 3 4 5 6 7 3 4 5 6 7 Accuracy Measured Model

Model Error Model median error Tesla M2090 7.5 % GTX Titan 6.9 % X5650 2.2 % E5-2603 2.0 % Hybrid1 8.6 % Hybrid2 7.1 %

FMM Performance Breakdown Upward U-list step V-list step W-list step X-list step Downward 8 8 6 6 Seconds 4 Seconds 4 2 2 0 GPU CPU 0 GPU CPU Uniform distribution Elliptical distribution

Overview Algorithmic characteristics GPU performance model Implementation Hybrid scheduling Exascale projections

100 80 T comp T mem Exascale Projection How will FMM scale in the future? FMM may become bandwidthbound - No more scaling! Better system balance is required - Implications for power and energy allocation Time (%) Time (%) Time (%) 60 40 20 100 0 2010 2015 2020 2025 80 60 40 20 100 Year 0 2010 2015 2020 2025 80 60 40 Year T comp T mem T comp T mem 20 0 2010 2015 2020 2025 Year

Exascale Projection How will FMM scale in the future? FMM may become bandwidth-bound - No more scaling! Better system balance is required - Implications for power and energy allocation

Exascale Projection How will FMM scale in the future?

Exascale Projection How will FMM scale in the future? 100 80 T comp T mem Time (%) 60 40 20 0 2010 2015 2020 2025 Year

Exascale Projection 100 100 100 80 T comp T mem 80 T comp T mem 80 T comp T mem Time (%) 60 40 Time (%) 60 40 Time (%) 60 40 20 20 20 0 2010 2015 2020 2025 0 2010 2015 2020 2025 0 2010 2015 2020 2025 Year Year Year

Exascale Projection 100 100 100 80 T comp T mem 80 T comp T mem 80 T comp T mem Time (%) 60 40 Time (%) 60 40 Time (%) 60 40 20 20 20 0 2010 2015 2020 2025 0 2010 2015 2020 2025 0 2010 2015 2020 2025 Year FMM may become bandwidth-bound - No more scaling! Better system balance is required - Implications for power and energy allocation Year Year

Conclusions Optimized implementation of FMM on CPU and GPU An analytical performance model that could be used to schedule FMM efficiently on hybrid systems Exascale projection There is a need for a similar model for elliptical distribution of points

Future Work Analytical models for W-list and X-list for elliptical distribution Power and energy modeling Roofline model of energy Support for Xeon Phi accelerator FMM for ARM?

Relevant Links Source code http://j.mp/kifmm--hybrid Energy and power A roofline model of energy Algorithmic time, energy, and power on candidate HPC compute building blocks http://j.mp/energy-roofline-- ubenchmarks