Everyday Multithreading

Similar documents
Scalable and Power-Efficient Data Mining Kernels

Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS

Large-scale Electronic Structure Simulations with MVAPICH2 on Intel Knights Landing Manycore Processors

FastGP: an R package for Gaussian processes

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver

Lecture 8 Genomic Selection

G-BLUP without inverting the genomic relationship matrix

Lecture 9 Multi-Trait Models, Binary and Count Traits

Dense Arithmetic over Finite Fields with CUMODP

Introduction to Benchmark Test for Multi-scale Computational Materials Software

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics

Introduction to numerical computations on the GPU

Massively scalable computing method to tackle large eigenvalue problems for nanoelectronics modeling

Performance and Energy Analysis of the Iterative Solution of Sparse Linear Systems on Multicore and Manycore Architectures

Welcome to MCS 572. content and organization expectations of the course. definition and classification

Parallelization of Molecular Dynamics (with focus on Gromacs) SeSE 2014 p.1/29

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems

Perm State University Research-Education Center Parallel and Distributed Computing

Parallel Sparse Tensor Decompositions using HiCOO Format

Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters

On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code

Practical Combustion Kinetics with CUDA

The Fast Multipole Method in molecular dynamics

Massively scalable computing method to tackle large eigenvalue problems for nanoelectronics modeling

Mini-project in scientific computing

Towards a highly-parallel PDE-Solver using Adaptive Sparse Grids on Compute Clusters

Some notes on efficient computing and setting up high performance computing environments

Parallel Polynomial Evaluation

Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA

Package bwgr. October 5, 2018

ExaGeoStat: A High Performance Unified Software for Geostatistics on Manycore Systems

Information Sciences Institute 22 June 2012 Bob Lucas, Gene Wagenbreth, Dan Davis, Roger Grimes and

High-performance processing and development with Madagascar. July 24, 2010 Madagascar development team

Accelerating Model Reduction of Large Linear Systems with Graphics Processors

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Julian Merten. GPU Computing and Alternative Architecture

BLUP without (inverse) relationship matrix

Computing least squares condition numbers on hybrid multicore/gpu systems

Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters

Dark Energy and Massive Neutrino Universe Covariances

Tips Geared Towards R. Adam J. Suarez. Arpil 10, 2015

Open-Source Parallel FE Software : FrontISTR -- Performance Considerations about B/F (Byte per Flop) of SpMV on K-Supercomputer and GPU-Clusters --

Panorama des modèles et outils de programmation parallèle

Targeting Extreme Scale Computational Challenges with Heterogeneous Systems

Utilisation de la compression low-rank pour réduire la complexité du solveur PaStiX

Parallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics)

A Comparison of Solving the Poisson Equation Using Several Numerical Methods in Matlab and Octave on the Cluster maya

Data Exploration and Unsupervised Learning with Clustering

Presentation of XLIFE++

HYCOM and Navy ESPC Future High Performance Computing Needs. Alan J. Wallcraft. COAPS Short Seminar November 6, 2017

Massively parallel semi-lagrangian solution of the 6d Vlasov-Poisson problem

Energy Consumption Evaluation for Krylov Methods on a Cluster of GPU Accelerators

Alternative implementations of Monte Carlo EM algorithms for likelihood inferences

Weather Research and Forecasting (WRF) Performance Benchmark and Profiling. July 2012

Performance evaluation of scalable optoelectronics application on large-scale Knights Landing cluster

ELSI: A Unified Software Interface for Kohn-Sham Electronic Structure Solvers

Calculate Sensitivity Function Using Parallel Algorithm

Opportunities for ELPA to Accelerate the Solution of the Bethe-Salpeter Eigenvalue Problem

1 Overview. 2 Adapting to computing system evolution. 11 th European LS-DYNA Conference 2017, Salzburg, Austria

Jacobi-Davidson Eigensolver in Cusolver Library. Lung-Sheng Chien, NVIDIA

Hydra: Generation and Tuning of parallel solutions for linear algebra equations. Alexandre X. Duchâteau University of Illinois at Urbana Champaign

INITIAL INTEGRATION AND EVALUATION

Modification of negative eigenvalues to create positive definite matrices and approximation of standard errors of correlation estimates

Nonparametric forecasting of the French load curve

CS 542G: Conditioning, BLAS, LU Factorization

Behavioral Simulations in MapReduce

Optimization Techniques for Parallel Code 1. Parallel programming models

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics

RWTH Aachen University

Benchmarking program performance evaluation of Parallel programming language XcalableMP on Many core processor

UNMIXING 4-D PTYCHOGRAPHIC IMAGES

Breaking Computational Barriers: Multi-GPU High-Order RBF Kernel Problems with Millions of Points

Research on GPU-accelerated algorithm in 3D finite difference neutron diffusion calculation method

Implementing NNLO into MCFM

Accelerated Neutrino Oscillation Probability Calculations and Reweighting on GPUs. Richard Calland

MPI at MPI. Jens Saak. Max Planck Institute for Dynamics of Complex Technical Systems Computational Methods in Systems and Control Theory

Architecture Solutions for DDM Numerical Environment

11 Parallel programming models

PCA, Kernel PCA, ICA

Parallel Transposition of Sparse Data Structures

Performance of the fusion code GYRO on three four generations of Crays. Mark Fahey University of Tennessee, Knoxville

Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems

Scaling the Software and Advancing the Science of Global Modeling and Assimilation Systems at NASA. Bill Putman

Language GTC April 7, These slides at Data Science Applications of GPUs in the R.

MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors

QR Decomposition in a Multicore Environment

Package BGGE. August 10, 2018

Optimizing Model Development and Validation Procedures of Partial Least Squares for Spectral Based Prediction of Soil Properties

Application to Hyperspectral Imaging

Statistical Clustering of Vesicle Patterns Practical Aspects of the Analysis of Large Datasets with R

Dr. Andrea Bocci. Using GPUs to Accelerate Online Event Reconstruction. at the Large Hadron Collider. Applied Physicist

Faster Machine Learning via Low-Precision Communication & Computation. Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich)

Accelerating incompressible fluid flow simulations on hybrid CPU/GPU systems

FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION

GWAS with mixed models

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

Enhancing Performance of Tall-Skinny QR Factorization using FPGAs

FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION

Transcription:

Everyday Multithreading Parallel computing for genomic evaluations in R C. Heuer, D. Hinrichs, G. Thaller Institute of Animal Breeding and Husbandry, Kiel University August 27, 2014 C. Heuer, D. Hinrichs, G. Thaller Parallel Computing in R August 27, 2014 1 / 17

Introduction Introduction C. Heuer, D. Hinrichs, G. Thaller Parallel Computing in R August 27, 2014 2 / 17

Introduction Introduction High Dimensional livestock data sets New computational challenges Paradigm shift in breeding programs and computing From sparse to dense MME (or mixtures) High storage, memory and computing demands C. Heuer, D. Hinrichs, G. Thaller Parallel Computing in R August 27, 2014 2 / 17

Introduction Introduction High Dimensional livestock data sets New computational challenges Paradigm shift in breeding programs and computing From sparse to dense MME (or mixtures) High storage, memory and computing demands Solution: Making use of available hardware resources by parallel computing C. Heuer, D. Hinrichs, G. Thaller Parallel Computing in R August 27, 2014 2 / 17

What parallel computing is: Parallel Computing C. Heuer, D. Hinrichs, G. Thaller Parallel Computing in R August 27, 2014 3 / 17

What parallel computing is: Parallel Computing Splitting up a big problem into chunks that are simultaneously being solved by several processing units C. Heuer, D. Hinrichs, G. Thaller Parallel Computing in R August 27, 2014 3 / 17

What parallel computing is: Parallel Computing Splitting up a big problem into chunks that are simultaneously being solved by several processing units 0.09 0.92 1.01 0.65 0.41 0.24 0.08 0.76 0.83 0.95 1.61 2.56 0.02 1.09 1.07 0.59 0.33 0.92 + = 0.55 0.83 0.27 0.37 3.21 3.58 0.05 1.29 1.23 0.16 0.82 0.98 0.59 1.23 0.64 0.55 1.08 1.64 C. Heuer, D. Hinrichs, G. Thaller Parallel Computing in R August 27, 2014 3 / 17

What parallel computing is: Parallel Computing Splitting up a big problem into chunks that are simultaneously being solved by several processing units 0.09 0.92 1.01 0.65 0.41 0.24 0.08 0.76 0.83 0.95 1.61 2.56 0.02 1.09 1.07 0.59 0.33 0.92 + = 0.55 0.83 0.27 0.37 3.21 3.58 0.05 1.29 1.23 0.16 0.82 0.98 0.59 1.23 0.64 0.55 1.08 1.64 C. Heuer, D. Hinrichs, G. Thaller Parallel Computing in R August 27, 2014 4 / 17

Parallel Computing Paradigms Parallel Computing C. Heuer, D. Hinrichs, G. Thaller Parallel Computing in R August 27, 2014 5 / 17

Parallel Computing Paradigms Parallel Computing 1 Most importantly: Single Core parallelism Vectorization 2 Shared Memory multi-threading OpenMP 3 Distributed Memory multi-processing MPI 4 GPU-programming CUDA, OpenCL, Intel Xeon-Phi C. Heuer, D. Hinrichs, G. Thaller Parallel Computing in R August 27, 2014 5 / 17

Parallel Computing Paradigms Parallel Computing 1 Most importantly: Single Core parallelism Vectorization 2 Shared Memory multi-threading OpenMP 3 Distributed Memory multi-processing MPI 4 GPU-programming CUDA, OpenCL, Intel Xeon-Phi Efficiency/Scaling: Depends on size of the problem: Overhead The less efficient a single-threaded application, the better the scaling C. Heuer, D. Hinrichs, G. Thaller Parallel Computing in R August 27, 2014 5 / 17

Parallel Computing Paradigms Parallel Computing 1 Most importantly: Single Core parallelism Vectorization 2 Shared Memory multi-threading OpenMP 3 Distributed Memory multi-processing MPI 4 GPU-programming CUDA, OpenCL, Intel Xeon-Phi Efficiency/Scaling: Depends on size of the problem: Overhead The less efficient a single-threaded application, the better the scaling In general: First single-threaded optimization then parallelization C. Heuer, D. Hinrichs, G. Thaller Parallel Computing in R August 27, 2014 5 / 17

R-Package R-package cpgen C. Heuer, D. Hinrichs, G. Thaller Parallel Computing in R August 27, 2014 6 / 17

R-Package R-package cpgen Advantages of R: 1 Very flexible open source interpreter language 2 Easy to use, available on all platforms and widely spread Drawbacks of R: C. Heuer, D. Hinrichs, G. Thaller Parallel Computing in R August 27, 2014 6 / 17

R-Package R-package cpgen Advantages of R: 1 Very flexible open source interpreter language 2 Easy to use, available on all platforms and widely spread Drawbacks of R: 1 Not designed for big data problems 2 Needs a lot of effort to extend R 3 Strictly single-threaded C. Heuer, D. Hinrichs, G. Thaller Parallel Computing in R August 27, 2014 6 / 17

R-Package R-package cpgen Advantages of R: 1 Very flexible open source interpreter language 2 Easy to use, available on all platforms and widely spread Drawbacks of R: 1 Not designed for big data problems 2 Needs a lot of effort to extend R 3 Strictly single-threaded But: Can be extended and multi-threaded through C/C++/Fortran shared libraries C. Heuer, D. Hinrichs, G. Thaller Parallel Computing in R August 27, 2014 6 / 17

R-Package General Implementation C. Heuer, D. Hinrichs, G. Thaller Parallel Computing in R August 27, 2014 7 / 17

R-Package General Implementation 1 R as the basic environment for data preparation and supply 2 Linking C++ to R: Rcpp (Eddelbuettel and Francois, 2011) 3 Linear Algebra: Eigen Vectorization! (Guennebaud et al., 2010) 4 Eigen + Rcpp + Sparse-Matrix support: RcppEigen (Bates and Eddelbuettel, 2013) 5 Parallelization: OpenMP C. Heuer, D. Hinrichs, G. Thaller Parallel Computing in R August 27, 2014 7 / 17

R-Package Parallel Computing in cpgen C. Heuer, D. Hinrichs, G. Thaller Parallel Computing in R August 27, 2014 8 / 17

R-Package Parallel Computing in cpgen C. Heuer, D. Hinrichs, G. Thaller Parallel Computing in R August 27, 2014 8 / 17

R-Package Functionality C. Heuer, D. Hinrichs, G. Thaller Parallel Computing in R August 27, 2014 9 / 17

R-Package Functionality 1 Single-Site Gibbs Sampler for Mixed Models with arbitrary number of random effects (sparse or dense design matrices) 2 Genomic Prediction Module: Different Methods, Cross Validation 3 GWAS Module: EMMAX - highly efficient and very flexible single marker GWAS 4 Tools: Genomic additive and dominance relationships, Crossproducts, Covariance Matrices,... C. Heuer, D. Hinrichs, G. Thaller Parallel Computing in R August 27, 2014 9 / 17

R-Package Gibbs Sampler Runs models of the following form: y = Xb + Z 1 u 1 + Z 2 u 2 + Z 3 u 3 +... + Z n u n + e For all u k : MVN(0, Iσ 2 u k ) If u k is assumed to follow some MVN(0, G k σk 2): Design matrix must be constructed as: Z k = Z k G 1/2, yielding independent effect in u k (Waldmann et al., 2008). C. Heuer, D. Hinrichs, G. Thaller Parallel Computing in R August 27, 2014 10 / 17

R-Package Genomic BLUP GBLUP can be accomplished very efficiently (Kang et al., 2008): y = Xb + a + e with: a MVN(0, Gσ 2 a) By finding the decomposition: G = UDU and premultiplying the model equation by U we get: U y = U Xb + U a + U e with: Var(U y)=u GUσ 2 a + U Uσ 2 e = U UDU Uσ 2 a + Iσ 2 e = Dσ 2 a + Iσ 2 e C. Heuer, D. Hinrichs, G. Thaller Parallel Computing in R August 27, 2014 11 / 17

R-Package GWAS Single marker regression Controlling for polygenic background effect through [assumed] covariance structure V of y - EMMAX (Kang et al., 2010) Obtaining General Least Squares estimates for marker effects: ˆβ =(X V 1 X) 1 X V 1 y This is equivalent to: ˆβ =(X X ) 1 X y, with X = V 1/2 X, y = V 1/2 y C. Heuer, D. Hinrichs, G. Thaller Parallel Computing in R August 27, 2014 12 / 17

R-Package Parallel Computing in cpgen What is parallelized: 1 All crossproduct-like computations 2 Sampling of random effects in Gibbs Sampler (dot product, vector-vector subtraction) Fernando et al., 2014 3 Cross Validation for genomic prediction 4 GWAS C. Heuer, D. Hinrichs, G. Thaller Parallel Computing in R August 27, 2014 13 / 17

R-Package Parallel Computing in cpgen What is parallelized: 1 All crossproduct-like computations 2 Sampling of random effects in Gibbs Sampler (dot product, vector-vector subtraction) Fernando et al., 2014 3 Cross Validation for genomic prediction 4 GWAS Number of threads being used can be controlled during runtime C. Heuer, D. Hinrichs, G. Thaller Parallel Computing in R August 27, 2014 13 / 17

Parallel Scaling 2,000 Observations, 50,000 Markers C. Heuer, D. Hinrichs, G. Thaller Parallel Computing in R August 27, 2014 14 / 17

Parallel Scaling 2,000 Observations, 50,000 Markers C. Heuer, D. Hinrichs, G. Thaller Parallel Computing in R August 27, 2014 14 / 17

Parallel Scaling Gibbs Sampler (BRR) - 10,000 Markers C. Heuer, D. Hinrichs, G. Thaller Parallel Computing in R August 27, 2014 15 / 17

Parallel Scaling Gibbs Sampler (BRR) - 10,000 Markers C. Heuer, D. Hinrichs, G. Thaller Parallel Computing in R August 27, 2014 15 / 17

Conclusions Conclusions C. Heuer, D. Hinrichs, G. Thaller Parallel Computing in R August 27, 2014 16 / 17

Conclusions Conclusions Shared Memory multithreading can decrease computation time significantly Bridges the gap between single threaded applications and heavily parallelized standalone programs for HPC Clusters We can make use of the computational power present in workstation PCs Everyday Multithreading C. Heuer, D. Hinrichs, G. Thaller Parallel Computing in R August 27, 2014 16 / 17

Conclusions Conclusions Shared Memory multithreading can decrease computation time significantly Bridges the gap between single threaded applications and heavily parallelized standalone programs for HPC Clusters We can make use of the computational power present in workstation PCs Everyday Multithreading But: Size of solvable problem is limited by available memory With 1 TB of memory, the package could fit a BRR model with 3 million observations and 40k markers C. Heuer, D. Hinrichs, G. Thaller Parallel Computing in R August 27, 2014 16 / 17

Thank you for the attention Github https://github.com/cheuerde/cpgen R-Forge https://r-forge.r-project.org/r/?group_id=1687 C. Heuer, D. Hinrichs, G. Thaller Parallel Computing in R August 27, 2014 17 / 17

Absolute Timings 10 Cores, 30,000 markers BRR: 1 million observations, 30k iterations 100 hours GWAS: 10k observations 7 minutes G-Matrix: 10k observations 2 minutes C. Heuer, D. Hinrichs, G. Thaller Parallel Computing in R August 27, 2014 17 / 17