Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics

Similar documents
Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters

EasyChair Preprint. 1,000x Faster than PLINK: Genome-Wide Epistasis Detection with Logistic Regression Using Combined FPGA and GPU Accelerators

GLIDE: GPU-based LInear Detection of Epistasis

A CUDA Solver for Helmholtz Equation

Parallel Longest Common Subsequence using Graphics Hardware

Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters

Efficient Algorithms for Detecting Genetic Interactions in Genome-Wide Association Study

Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors

Performance Analysis of Lattice QCD Application with APGAS Programming Model

S0214 : GPU Based Stacking Sequence Generation For Composite Skins Using GA

Scalable and Power-Efficient Data Mining Kernels

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters

Performance Evaluation of Scientific Applications on POWER8

Dr. Andrea Bocci. Using GPUs to Accelerate Online Event Reconstruction. at the Large Hadron Collider. Applied Physicist

Sparse LU Factorization on GPUs for Accelerating SPICE Simulation

arxiv: v1 [hep-lat] 7 Oct 2010

Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry

Efficient Molecular Dynamics on Heterogeneous Architectures in GROMACS

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

Accelerating Proton Computed Tomography with GPUs

11 Parallel programming models

Measuring freeze-out parameters on the Bielefeld GPU cluster


GPU Acceleration of BCP Procedure for SAT Algorithms

Faster Kinetics: Accelerate Your Finite-Rate Combustion Simulation with GPUs

Population annealing study of the frustrated Ising antiferromagnet on the stacked triangular lattice

Random Sampling for Short Lattice Vectors on Graphics Cards

S Subdivide, Preprocess and Conquer: Micromagnetism FEM/BEM-Simulations on Single-Node/Multi-GPU Systems

Theoretical and computational aspects of association tests: application in case-control genome-wide association studies.

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University

HIGH PERFORMANCE CTC TRAINING FOR END-TO-END SPEECH RECOGNITION ON GPU

Parallelization of the QC-lib Quantum Computer Simulator Library

Scalable Asynchronous Gradient Descent Optimization for Out-of-Core Models

上海超级计算中心 Shanghai Supercomputer Center. Lei Xu Shanghai Supercomputer Center San Jose

Association Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5

Klaus Schulten Department of Physics and Theoretical and Computational Biophysics Group University of Illinois at Urbana-Champaign

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver

TDDB68 Concurrent programming and operating systems. Lecture: CPU Scheduling II

Targeting Extreme Scale Computational Challenges with Heterogeneous Systems

Dense Arithmetic over Finite Fields with CUMODP

On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code

Parallel PIPS-SBB Multi-level parallelism for 2-stage SMIPS. Lluís-Miquel Munguia, Geoffrey M. Oxberry, Deepak Rajan, Yuji Shinano

Case-Control Association Testing. Case-Control Association Testing

Everyday Multithreading

Progressive & Algorithms & Systems

Multivariate analysis of genetic data: an introduction

Adaptive Heterogeneous Computing with OpenCL: Harnessing hundreds of GPUs and CPUs

Adapting Irregular Computations to Large CPU-GPU Clusters in the MADNESS Framework

Accelerating Model Reduction of Large Linear Systems with Graphics Processors

GPU accelerated Monte Carlo simulations of lattice spin models

Stochastic Modelling of Electron Transport on different HPC architectures

First, a look at using OpenACC on WRF subroutine advance_w dynamics routine

EXASCALE COMPUTING. Implementation of a 2D Electrostatic Particle in Cell algorithm in UniÞed Parallel C with dynamic load-balancing

Acceleration of WRF on the GPU

Open-Source Parallel FE Software : FrontISTR -- Performance Considerations about B/F (Byte per Flop) of SpMV on K-Supercomputer and GPU-Clusters --

Reactive Molecular Dynamics on Massively Parallel Heterogeneous Architectures

Energy-efficient Mapping of Big Data Workflows under Deadline Constraints

COMBI - Combining high-dimensional classification and multiple hypotheses testing for the analysis of big data in genetics

Supporting Intra-Task Parallelism in Real- Time Multiprocessor Systems José Fonseca

Perm State University Research-Education Center Parallel and Distributed Computing

Two case studies of Monte Carlo simulation on GPU

Parallelization of the QC-lib Quantum Computer Simulator Library

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems

Beam dynamics calculation

A Note on Symmetry Reduction for Circular Traveling Tournament Problems

SP-CNN: A Scalable and Programmable CNN-based Accelerator. Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay

Backward Genotype-Trait Association. in Case-Control Designs

SAT in Bioinformatics: Making the Case with Haplotype Inference

GPU Computing Activities in KISTI

Towards a highly-parallel PDE-Solver using Adaptive Sparse Grids on Compute Clusters

A CPU-GPU Hybrid Implementation and Model-Driven Scheduling of the Fast Multipole Method

Behavioral Simulations in MapReduce

Parallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics)

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Figure 1 - Resources trade-off. Image of Jim Kinter (COLA)

Lecture WS Evolutionary Genetics Part I 1

DeepLog: Anomaly Detection and Diagnosis from System Logs through Deep Learning. Min Du, Feifei Li, Guineng Zheng, Vivek Srikumar University of Utah

Molecular dynamics simulations and drug discovery

Introduction to numerical computations on the GPU

A novel fuzzy set based multifactor dimensionality reduction method for detecting gene-gene interaction

A microsecond a day keeps the doctor away: Efficient GPU Molecular Dynamics with GROMACS

QR Decomposition in a Multicore Environment

A Computation- and Communication-Optimal Parallel Direct 3-body Algorithm

Performance Analysis of Lattice QCD Application with APGAS Programming Model

Calculation of ground states of few-body nuclei using NVIDIA CUDA technology

Practical Free-Start Collision Attacks on full SHA-1

Julian Merten. GPU Computing and Alternative Architecture

Direct Self-Consistent Field Computations on GPU Clusters

Juggle: Proactive Load Balancing on Multicore Computers

GPU Accelerated Markov Decision Processes in Crowd Simulation

OpenCL: Programming Heterogeneous Architectures

Analytical Modeling of Parallel Systems

Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems

Learning Your Identity and Disease from Research Papers: Information Leaks in Genome-Wide Association Study

Searching Genome-wide Disease Association Through SNP Data

Lecture 2: Genetic Association Testing with Quantitative Traits. Summer Institute in Statistical Genetics 2017

Analytical Modeling of Parallel Programs. S. Oliveira

COMPARISON OF CPU AND GPU IMPLEMENTATIONS OF THE LATTICE BOLTZMANN METHOD

GTRAC FAST R ETRIEVAL FROM C OMPRESSED C OLLECTIONS OF G ENOMIC VARIANTS. Kedar Tatwawadi Mikel Hernaez Idoia Ochoa Tsachy Weissman

Transcription:

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Jorge González-Domínguez Parallel and Distributed Architectures Group Johannes Gutenberg University of Mainz, Germany j.gonzalez@uni-mainz.de GTC 2015

1 Overview of the Problem 2 Intra-GPU Parallelization with CUDA 3 Inter-GPU Parallelization with UPC++ 4 Experimental Evaluation 5 Conclusions

Overview of the Problem 1 Overview of the Problem 2 Intra-GPU Parallelization with CUDA 3 Inter-GPU Parallelization with UPC++ 4 Experimental Evaluation 5 Conclusions

Overview of the Problem Genome-Wide Association Studies (I) Analyses of genetic influence on diseases

Overview of the Problem Genome-Wide Association Studies (I) Analyses of genetic influence on diseases M individuals

Overview of the Problem Genome-Wide Association Studies (I) Analyses of genetic influence on diseases M individuals K cases

Overview of the Problem Genome-Wide Association Studies (I) Analyses of genetic influence on diseases M individuals K cases C controls

Overview of the Problem Genome-Wide Association Studies (I) Analyses of genetic influence on diseases M individuals K cases C controls N genetic markers, Single Nucleotide Polymorphisms (SNPs). 3 genotypes: Homozygous Wild (w, AA, 0) Heterozygous (h, Aa, 1) Homozygous Variant (v, aa, 2)

Overview of the Problem Genome-Wide Association Studies (II) Cases Controls SNP 1 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 1 SNP 2 0 1 1 0 2 0 0 0 1 2 2 1 0 1 1 2 SNP 3 0 0 0 0 0 0 0 0 1 2 1 1 1 2 1 1 SNP 4 0 1 0 1 0 1 0 1 2 2 2 2 1 1 1 1 SNP 5 0 2 2 2 0 1 1 1 1 0 0 1 1 0 2 2 SNP 6 1 0 1 0 1 0 1 0 1 2 1 2 1 2 2 1

Overview of the Problem Genome-Wide Association Studies (II) Cases Controls SNP 1 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 1 SNP 2 0 1 1 0 2 0 0 0 1 2 2 1 0 1 1 2 SNP 3 0 0 0 0 0 0 0 0 1 2 1 1 1 2 1 1 SNP 4 0 1 0 1 0 1 0 1 2 2 2 2 1 1 1 1 SNP 5 0 2 2 2 0 1 1 1 1 0 0 1 1 0 2 2 SNP 6 1 0 1 0 1 0 1 0 1 2 1 2 1 2 2 1

Overview of the Problem Genome-Wide Association Studies (II) Cases Controls SNP 1 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 1 SNP 2 0 1 1 0 2 0 0 0 1 2 2 1 0 1 1 2 SNP 3 0 0 0 0 0 0 0 0 1 2 1 1 1 2 1 1 SNP 4 0 1 0 1 0 1 0 1 2 2 2 2 1 1 1 1 SNP 5 0 2 2 2 0 1 1 1 1 0 0 1 1 0 2 2 SNP 6 1 0 1 0 1 0 1 0 1 2 1 2 1 2 2 1

Overview of the Problem Genome-Wide Association Studies (and III) Definition Two SNPs present epistasis or interaction if: Their joint genotype frequencies show a statistically significant difference between cases and controls which potentially explains the effect of the genetic variation leading to disease. The difference between cases and controls shown by the joint values is significantly higher than using only the individual SNP values.

Overview of the Problem BOOST BOolean Operation-based Screening and Testing Binary traits Exhaustive search Statistical regression Good accuracy (used by biologists) Returns a list of SNP pairs with high interaction probability Fastest available tool. Intel Core i7 3.20GHz: 40,000 SNPs and 3,200 individuals About 800 million pairs 51 minutes 500,000 SNPs and 5,000 individuals About 125 billion pairs (moderated size) Estimated 7 days

Overview of the Problem GBOOST CUDA version for GPUs Same accuracy as BOOST 40,000 SNPs and 6,400 individuals About 800 million pairs 28 seconds on a GTX Titan 500,000 SNPs and 5,000 individuals About 125 billion pairs (moderated size) 1 hour on a GTX Titan

Overview of the Problem GBOOST CUDA version for GPUs Same accuracy as BOOST 40,000 SNPs and 6,400 individuals About 800 million pairs 28 seconds on a GTX Titan 500,000 SNPs and 5,000 individuals About 125 billion pairs (moderated size) 1 hour on a GTX Titan High-throughput genotyping technologies collect few million SNPs of an individual within a few minutes Expected datasets with 5M SNPs and 10,000 individuals

Intra-GPU Parallelization with CUDA 1 Overview of the Problem 2 Intra-GPU Parallelization with CUDA 3 Inter-GPU Parallelization with UPC++ 4 Experimental Evaluation 5 Conclusions

Intra-GPU Parallelization with CUDA Calculation of Contingency Tables (I) For each SNP-pair Number of occurrences of each combination of genotypes Cases SNP2=0 SNP2=1 SNP2=2 SNP1=0 n 000 n 010 n 020 SNP1=1 n 100 n 110 n 120 SNP1=2 n 200 n 210 n 220 Controls SNP2=0 SNP2=1 SNP2=2 SNP1=0 n 001 n 011 n 021 SNP1=1 n 101 n 111 n 121 SNP1=2 n 201 n 211 n 221

Intra-GPU Parallelization with CUDA Calculation of Contingency Tables (II) SNP 4 0 1 0 1 0 1 0 1 2 2 2 2 1 1 1 1 SNP 6 1 0 1 0 1 0 1 0 1 2 1 2 1 2 2 1 Casos SNP6=0 SNP6=1 SNP6=2 SNP4=0 0 4 0 SNP4=1 4 0 0 SNP4=2 0 0 0 Controles SNP6=0 SNP6=1 SNP6=2 SNP4=0 0 0 0 SNP4=1 0 2 2 SNP4=2 0 1 2

Intra-GPU Parallelization with CUDA Calculation of Contingency Tables (II) SNP 4 0 1 0 1 0 1 0 1 2 2 2 2 1 1 1 1 SNP 6 1 0 1 0 1 0 1 0 1 2 1 2 1 2 2 1 Casos SNP6=0 SNP6=1 SNP6=2 SNP4=0 0 4 0 SNP4=1 4 0 0 SNP4=2 0 0 0 Controles SNP6=0 SNP6=1 SNP6=2 SNP4=0 0 0 0 SNP4=1 0 2 2 SNP4=2 0 1 2

Intra-GPU Parallelization with CUDA Calculation of Contingency Tables (II) SNP 4 0 1 0 1 0 1 0 1 2 2 2 2 1 1 1 1 SNP 6 1 0 1 0 1 0 1 0 1 2 1 2 1 2 2 1 Casos SNP6=0 SNP6=1 SNP6=2 SNP4=0 0 4 0 SNP4=1 4 0 0 SNP4=2 0 0 0 Controles SNP6=0 SNP6=1 SNP6=2 SNP4=0 0 0 0 SNP4=1 0 2 2 SNP4=2 0 1 2

Intra-GPU Parallelization with CUDA Filtering Stage Epistatic interaction measured via log-linear models All SNP-pairs analyzed The measure is obtained with numerical calculations from the values of the contingency table Pairs with measure higher than a threshold pass the filter They are included in the output file multiepistsearch uses a faster filter than GBOOST (out of the scope)

Intra-GPU Parallelization with CUDA CUDA Implementation CUDA Kernel Genotyping information loaded in device memory through pinned copies Each thread performs the whole calculation of independent SNP-pairs Only one kernel for the whole computation Each call to the kernel analyzes a batch of SNP-pairs

Intra-GPU Parallelization with CUDA CUDA Implementation CUDA Kernel Genotyping information loaded in device memory through pinned copies Each thread performs the whole calculation of independent SNP-pairs Only one kernel for the whole computation Each call to the kernel analyzes a batch of SNP-pairs Optimization Techniques Boolean representation of genotyping information Increase of coalescence Exploitation of shared memory

Inter-GPU Parallelization with UPC++ 1 Overview of the Problem 2 Intra-GPU Parallelization with CUDA 3 Inter-GPU Parallelization with UPC++ 4 Experimental Evaluation 5 Conclusions

Inter-GPU Parallelization with UPC++ UPC++ (I) Unified Parallel C++ Novel extension of ANSI C++ Y Zheng, A Kamil, M Driscoll, H Shan, and K Yelick. UPC++: a PGAS Extension for C++. In Proc. 28th IEEE Intl. Parallel and Distributed Processing Symp. (IPDPS 14), Phoenix, AR, USA, 2014. Follows the Partitioned Global Address Space (PGAS) programming model Single Program Multiple Data (SPMD) execution model Works on shared and distributed memory systems

Inter-GPU Parallelization with UPC++ UPC++ (and II) Global memory logically partitioned among processes Processes can directly access (read/write) any part of the global memory Memory with affinity usually mapped in the same node (faster accesses)

Inter-GPU Parallelization with UPC++ Multi-GPU Approach (I) One UPC++ process per GPU SNP data distributed among the parts of the global memory All the information of the same SNP in the same part Each GPU (UPC++ process) analyzes different SNP-pairs Creation of contingency table Filtering The data of the SNPs to analyze might be in remote memory

Inter-GPU Parallelization with UPC++ Multi-GPU Approach (II)

Inter-GPU Parallelization with UPC++ Multi-GPU Approach (II)

Inter-GPU Parallelization with UPC++ Multi-GPU Approach (II)

Inter-GPU Parallelization with UPC++ Multi-GPU Approach (II)

Inter-GPU Parallelization with UPC++ Multi-GPU Approach (VI) Static distribution Workload distributed at the beginning Metablocks that will be analyzed by each GPU The distribution does not change during the execution Balance of the number of metablocks per GPU Similar workload for each GPU Good distribution for systems with similar GPUs Minimization of remote copies

Inter-GPU Parallelization with UPC++ Multi-GPU Approach (and VII) On-demand distribution The metablocks computed by each GPU initially unknown Table with one binary value per metablock that indicates if it has been computed When one GPU finishes with one metablock Looks for the next one that has not been analyzed Locks or semaphores necessary for the concurrent accesses to the table Easy with UPC++ support Synchronizations include performance overhead GPUs might compute different number of metablocks Faster GPUs analyze more SNP-pairs Good distribution for systems with different GPUs

Experimental Evaluation 1 Overview of the Problem 2 Intra-GPU Parallelization with CUDA 3 Inter-GPU Parallelization with UPC++ 4 Experimental Evaluation 5 Conclusions

Experimental Evaluation Evaluation with Homogeneous GPUs (I) Platform Mogon cluster Johannes Gutenberg Universität 8 nodes with 3 GTX Titan GPUs One of the most powerful GPUs Infiniband network

Experimental Evaluation Evaluation with Homogeneous GPUs (I) Platform Mogon cluster Dataset Johannes Gutenberg Universität 8 nodes with 3 GTX Titan GPUs One of the most powerful GPUs Infiniband network Real-world data from the WTCCC database Moderately-sized 500,568 SNPs 2,005 cases with bipolar disorder 3,004 controls

Experimental Evaluation Evaluation with Homogeneous GPUs (II) Execution Time (sec) 1000 800 600 400 200 0 WTCCC Dataset on Homogeneous GPUs (1.83) (1.93) (3.91) (3.21) (6.93) (7.77) static on-demand 2 4 8 16 24 Number of GTX Titan (15.68) (12.20) (22.82) (16.51) Static 1.38 times faster for 24 GPUs Static always > 95 % parallel efficiency

Experimental Evaluation Evaluation with Homogeneous GPUs (and III) Design Architecture Runtime Speed (10 6 pairs/s) multiepistsearch 24 GTX Titan 1 m 11 s 1764.56 multiepistsearch 1 GTX Titan 27 m 77.34 GBOOST 1 GTX Titan 1 h 15 m 34.23 EpiGPU* 1 GTX 580 2 h 55 m 11.90 SHEsisEPI* 1 GTX 285 27 h 1.29 BOOST** Intel Core i7 7 d 0.21 Speedups for one GPU: 2.77 over GBOOST > 373 over estimation for BOOST on a 3GHz Intel Core i7 With 24 Titan 54.93 and > 8,500 times faster than GBOOST and BOOST, respectively

Experimental Evaluation Evaluation with Heterogeneous GPUs (I) Platform Pluton cluster Universidade da Coruña (Spain) 8 nodes with 1 GTX Tesla K20m 4 nodes with 2 Tesla 2050 Less cores Gigabit Ethernet network

Experimental Evaluation Evaluation with Heterogeneous GPUs (II) Execution Time (sec) 3000 2500 2000 1500 1000 500 0 WTCCC Dataset on Heterogeneous GPUs (0.90) (1.31) (2.24) (2.47) 1+1 2+2 4+4 8+8 Number of GPUs static on-demand (4.44) (4.93) (7.98) (9.43) On demand 1.18 times faster for 16 GPUs

Experimental Evaluation Evaluation with Heterogeneous GPUs (and III) Design Architecture Runtime Speed (10 6 pairs/s) multiepistsearch 8 Tesla K20m + 8 2050 4 m 20 s 481.86 multiepistsearch 8 Tesla K20m 5 m 40 s 348.01 multiepistsearch 8 Tesla 2050 10 m 12 s 204.71 multiepistsearch 1 Tesla K20m 41 m 50.93 multiepistsearch 1 Tesla 2050 1 h 1 m 34.23 GBOOST 1 Tesla K20m 1 h 26 m 24.28 GBOOST 1 Tesla 2050 2 h 17 m 15.22 With 1 GPU 2.10 and 2.25 times faster than GBOOST 1.31 times faster using the whole cluster (on-demand) than only the 8 Tesla K20m

Experimental Evaluation Evaluation of a Large-Scale Dataset Simulated dataset 5M SNPs 5,000 cases 5,000 controls 2 hours and 45 minutes on Mogon (24 GTX Titan) Estimation of more than 2 days and 14 hours on 1 GPU GBOOST is not able to analyze it Out-of-bound problems in the arrays

Conclusions 1 Overview of the Problem 2 Intra-GPU Parallelization with CUDA 3 Inter-GPU Parallelization with UPC++ 4 Experimental Evaluation 5 Conclusions

Conclusions Conclusions multiepistsearch looks for epistatic interactions on GPU clusters Hybrid CUDA&UPC++ implementation On only one GPU always speedups higher than 2 over GBOOST Two inter-gpu data distributions Static for homogeneous clusters Dynamic for heterogeneous clusters High scalability 95% Parallel efficiency with 24 GTX Titans and WTCCC dataset 2 hours and 45 minutes for 5M SNPs and 10K samples on 24 GTX Titans

Conclusions Bibliography First version of the GPU kernel J. González-Domínguez, B. Schmidt, J. C. Kässens, and L. Wienbrandt. Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS. In Proc. 20th Intl. European Conf. on Parallel and Distributed Computing (Euro-Par 14), Porto, Portugal. multiepistseach (minor revision) J. González-Domínguez, J. C. Kässens, L. Wienbrandt, and B. Schmidt. Large-Scale Genome-Wide Association Studies on a GPU Cluster Using a CUDA-Accelerated PGAS Programming Model. Intl. Journal of High Performance Computing Applications (IJHPCA).

Conclusions Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Jorge González-Domínguez Parallel and Distributed Architectures Group Johannes Gutenberg University of Mainz, Germany j.gonzalez@uni-mainz.de GTC 2015