Some notes on efficient computing and setting up high performance computing environments

Similar documents
Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

Computing least squares condition numbers on hybrid multicore/gpu systems

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics

Tips Geared Towards R. Adam J. Suarez. Arpil 10, 2015

INITIAL INTEGRATION AND EVALUATION

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems

Hierarchical Modeling for Univariate Spatial Data

Accelerating linear algebra computations with hybrid GPU-multicore systems.

MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors

Binding Performance and Power of Dense Linear Algebra Operations

Evaluation and Benchmarking of Highly Scalable Parallel Numerical Libraries

ECS289: Scalable Machine Learning

spbayes: An R Package for Univariate and Multivariate Hierarchical Point-referenced Spatial Models

Bayesian Linear Models

Nearest Neighbor Gaussian Processes for Large Spatial Data

Bayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes

MAGMA. Matrix Algebra on GPU and Multicore Architectures. Mark Gates. February 2012

Lightweight Superscalar Task Execution in Distributed Memory

Bayesian Linear Models

Bayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes

Bayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes

Hierarchical Nearest-Neighbor Gaussian Process Models for Large Geo-statistical Datasets

Bayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)

Symmetric Pivoting in ScaLAPACK Craig Lucas University of Manchester Cray User Group 8 May 2006, Lugano

On Gaussian Process Models for High-Dimensional Geostatistical Datasets

WRF performance tuning for the Intel Woodcrest Processor

Dense Arithmetic over Finite Fields with CUMODP

Hierarchical Modelling for Univariate Spatial Data

A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures

Parallel sparse direct solvers for Poisson s equation in streamer discharges

Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors

Intel Math Kernel Library (Intel MKL) LAPACK

Direct Self-Consistent Field Computations on GPU Clusters

The Ongoing Development of CSDP

Geostatistical Modeling for Large Data Sets: Low-rank methods

Accelerating Model Reduction of Large Linear Systems with Graphics Processors

Balanced Truncation Model Reduction of Large and Sparse Generalized Linear Systems

Tools for Power and Energy Analysis of Parallel Scientific Applications

Bayesian Linear Regression

Multicore Parallelization of Determinant Quantum Monte Carlo Simulations

Sparse BLAS-3 Reduction

ExaGeoStat: A High Performance Unified Software for Geostatistics on Manycore Systems

GPU accelerated Arnoldi solver for small batched matrix

Hierarchical Modelling for Univariate Spatial Data

Bayesian Modeling and Inference for High-Dimensional Spatiotemporal Datasets

Principles of Bayesian Inference

Algorithms and Methods for Fast Model Predictive Control

RRQR Factorization Linux and Windows MEX-Files for MATLAB

Hierarchical Modeling for Multivariate Spatial Data

Dynamic Scheduling within MAGMA

1 Overview. 2 Adapting to computing system evolution. 11 th European LS-DYNA Conference 2017, Salzburg, Austria

Journal of Statistical Software

Parallelization of the Molecular Orbital Program MOS-F

Towards Fast, Accurate and Reproducible LU Factorization

A hybrid Hermitian general eigenvalue solver

CS 542G: Conditioning, BLAS, LU Factorization

Level-3 BLAS on a GPU

Performance Analysis and Design of a Hessenberg Reduction using Stabilized Blocked Elementary Transformations for New Architectures

On aggressive early deflation in parallel variants of the QR algorithm

Adaptive Spike-Based Solver 1.0 User Guide

Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry

Communication-avoiding LU and QR factorizations for multicore architectures

Porting a sphere optimization program from LAPACK to ScaLAPACK

Porting a Sphere Optimization Program from lapack to scalapack

Hierarchical Modeling for Spatio-temporal Data

MARCH 24-27, 2014 SAN JOSE, CA

Gaussian predictive process models for large spatial data sets.

Saving Energy in the LU Factorization with Partial Pivoting on Multi-Core Processors

Bayesian Linear Models

Matrix Eigensystem Tutorial For Parallel Computation

Lecture 13 Stability of LU Factorization; Cholesky Factorization. Songting Luo. Department of Mathematics Iowa State University

Principles of Bayesian Inference

Enhancing Scalability of Sparse Direct Methods

Metropolis-Hastings Algorithm

FEAST eigenvalue algorithm and solver: review and perspectives

The ELPA library: scalable parallel eigenvalue solutions for electronic structure theory and

Performance of the fusion code GYRO on three four generations of Crays. Mark Fahey University of Tennessee, Knoxville

High-Performance Scientific Computing

Efficient algorithms for symmetric tensor contractions

A parallel tiled solver for dense symmetric indefinite systems on multicore architectures

LAPACK-Style Codes for Pivoted Cholesky and QR Updating. Hammarling, Sven and Higham, Nicholas J. and Lucas, Craig. MIMS EPrint: 2006.

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic

Parallel Transposition of Sparse Data Structures

Parallel Polynomial Evaluation

On the design of parallel linear solvers for large scale problems

Coordinate Update Algorithm Short Course The Package TMAC

Numerical Linear Algebra

Tile QR Factorization with Parallel Panel Processing for Multicore Architectures

Tall and Skinny QR Matrix Factorization Using Tile Algorithms on Multicore Architectures LAPACK Working Note - 222

Information Sciences Institute 22 June 2012 Bob Lucas, Gene Wagenbreth, Dan Davis, Roger Grimes and

Introduction, basic but important concepts

The ELPA Library Scalable Parallel Eigenvalue Solutions for Electronic Structure Theory and Computational Science

Parallel Circuit Simulation. Alexander Hayman

QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment

Lecture 8: Fast Linear Solvers (Part 7)

Linear Algebra. PHY 604: Computational Methods in Physics and Astrophysics II

SOLUTION of linear systems of equations of the form:

Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem

Principles of Bayesian Inference

Transcription:

Some notes on efficient computing and setting up high performance computing environments Andrew O. Finley Department of Forestry, Michigan State University, Lansing, Michigan. April 17, 2017 1

Efficient sampler computing Bayesian hierarchical linear mixed model p(θ) N(β µ β, Σ β ) N(α 0, K(θ)) N(y Xβ + Z(θ)α, D(θ)) y is an n 1 vector of possibly irregularly located observations, X is a known n p matrix of regressors (p < n), K(θ) and D(θ) are families of r r and n n covariance matrices, respectively, Z(θ) is n r with r n, all indexed by a set of unknown process parameters θ. α is the r 1 random vector and β is the p 1 slope vector. Space-varying intercept model is a special case where D(θ) = τ 2 I n, α = (w(s 1 ), w(s 2 ),..., w(s n )), Z(θ) = I n, and the n n K(θ) = σ 2 R(φ). 2 2017 UW-Madison Analysis of Spatial-Temporal Data

Efficient sampler computing For faster convergence, we integrate out β and α from the model and first sample from p(θ y) p(θ) N(y Xµ β, Σ y θ ), where Σ y θ = XΣ β X + Z(θ)K(θ)Z(θ) + D(θ). This involves evaluating log p(θ y) = const + log p(θ) 1 2 log Σ y θ 1 2 Q(θ), where Q(θ) = (y Xµ β ) Σ 1 y θ (y Xµ β). 1 L = chol(σ y θ ), lower-triangular Cholesky factor L of Σ y θ (O(n3 /3) flops) 2 u = trsolve(l, y Xµ β ), solves Lu = y Xµ β (O(n 2 ) flops) 3 Q(θ) = u u (2n flops) 4 log-determinant is 2 n i=1 log l ii, where l ii are the diagonal entries in L (n flops) 3 2017 UW-Madison Analysis of Spatial-Temporal Data

Efficient sampler computing Given marginal posterior samples θ from p(θ y), we can draw posterior samples of β and α using composition sampling. For more details see Finley, A.O., S. Banerjee, A.E. Gelfand. (2015) spbayes for large univariate and multivariate point-referenced spatio-temporal data models. Journal of Statistical Software, 63:1 28. 4 2017 UW-Madison Analysis of Spatial-Temporal Data

Code implementation Very useful libraries for efficient matrix computation: 1 Fortran BLAS (Basic Linear Algebra Subprograms, see Blackford et al. 2001). Started in late 70s at NASA JPL by Charles L. Lawson. 2 Fortran LAPACK (Linear Algebra Package, see Anderson et al. 1999). Started in mid 80s at Argonne and Oak Ridge National Laboratories. Modern math software has a heavy reliance on these libraries, e.g., Matlab and R. Routines are also accessible via C, C++, Python, etc. 5 2017 UW-Madison Analysis of Spatial-Temporal Data

Code implementation Many improvements on the standard BLAS and LAPACK functions, see, e.g., Intel Math Kernel Library (MKL) AMD Core Math Library (ACML) Automatically Tuned Linear Algebra Software (ATLAS) Matrix Algebra on GPU and Multicore Architecture (MAGMA) OpenBLAS http://www.openblas.net veclib (for Mac users only) 6 2017 UW-Madison Analysis of Spatial-Temporal Data

Code implementation Key BLAS and LAPACK functions used in our setting. Function dpotrf dtrsv dtrsm dgemv dgemm Description LAPACK routine to compute the Cholesky factorization of a real symmetric positive definite matrix. Level 2 BLAS routine to solve the systems of equations Ax = b, where x and b are vectors and A is a triangular matrix. Level 3 BLAS routine to solve the matrix equations AX = B, where X and B are matrices and A is a triangular matrix. Level 2 BLAS matrix-vector multiplication. Level 3 BLAS matrix-matrix multiplication. 7 2017 UW-Madison Analysis of Spatial-Temporal Data

Consider different environments: 1 A distributed system consists of multiple autonomous computers (nodes) that communicate through a network. A computer program that runs in a distributed system is called a distributed program. Message Passing Interface (MPI) is a specification for an Application Programming Interface (API) that allows many computers to communicate. 2 A shared memory multiprocessing system consists of a single computer with memory that may be simultaneously accessed by one or more programs running on multiple Central Processing Units (CPUs). OpenMP (Open Multi-Processing) is an API that supports shared memory multiprocessing programming. 3 A heterogeneous system uses more than one kind of processor, e.g., CPU & (Graphics Processing Unit) GPU or CPU & Intel s Xeon Phi Many Integrated Core (MIC). 8 2017 UW-Madison Analysis of Spatial-Temporal Data

Which environments are right for large n settings? MCMC necessitates iterative evaluation of the likelihood which requires operations on large matrices. A specific hurdle is factorization to computing determinant and inverse of large dense covariance matrices. We try to model our way out and use computing tools to overcome the complexity (e.g., covariance tapering, Kaufman et al. 2008; low-rank methods, Cressie and Johannesson 2008; Banerjee et al. 2008, etc.). Due to slow network communication and transport of submatrices among nodes distributed systems are not ideal for these types of iterative large matrix operations. 9 2017 UW-Madison Analysis of Spatial-Temporal Data

My lab currently favors shared memory multiprocessing and heterogeneous systems. Newest unit is a Dell Poweredge with 384 GB of RAM, 2 threaded 10-core Xeon CPUs, and 2 Intel Xeon Phi Coprocessor with 61-cores (244 threads) running a Linux operating systems. Software includes OpenMP coupled with Intel MKL. MKL is a library of highly optimized, extensively threaded math routines designed for Xeon CPUs and Phi coprocessors (e.g., BLAS, LAPACK, ScaLAPACK, Sparse Solvers, Fast Fourier Transforms, and vector RNGs). 10 2017 UW-Madison Analysis of Spatial-Temporal Data

So what kind of speed up to expect from threaded BLAS and LAPACK libraries. Cholesky factorization (dpotrf), n=20000 Execution time (seconds) 20 40 60 80 100 120 140 Cholesky factorization (dpotrf), n=40000 Execution time (seconds) 200 400 600 800 1000 5 10 15 20 Number of Xeon cores 5 10 15 20 Number of Xeon cores 11 2017 UW-Madison Analysis of Spatial-Temporal Data

R and threaded BLAS and LAPACK Many core and contributed packages (including spbayes) call BLAS)and LAPACK Fortran libraries. Compile R against threaded BLAS and LAPACK provides substantial computing gains: processor specific threaded BLAS/LAPACK implementation (e.g., MKL or ACML) processor specific compilers (e.g., Intel s icc/ifort) 12 2017 UW-Madison Analysis of Spatial-Temporal Data

For Linux/Unix: compiling R to call MKL s BLAS and LAPACK libraries (rather than stock serial versions). Change your config.site file and configure call in the R source code directory. My config.site file CC=icc CFLAGS="-g -O3 -wd188 -ip -mp" F77=ifort FLAGS="-g -O3 -mp -openmp" CXX=icpc CXXFLAGS="-g -O3 -mp -openmp" FC=ifort FCFLAGS="-g -O3 -mp -openmp" ICC_LIBS=/opt/intel/composerxe/lib/intel64 IFC_LIBS=/opt/intel/composerxe/lib/intel64 LDFLAGS="-L$ICC_LIBS -L$IFC_LIBS -L/usr/local/lib" SHLIB_CXXLD=icpc SHLIB_CXXLDFLAGS=-shared 13 2017 UW-Madison Analysis of Spatial-Temporal Data

Compiling R to call MKL s BLAS and LAPACK libraries (rather than stock serial versions). Change your config.site file and configure call in the R source code directory. My configure script MKL_LIB_PATH="/opt/intel/composerxe/mkl/lib/intel64" export LD_LIBRARY_PATH=$MKL_LIB_PATH MKL="-L${MKL_LIB_PATH} -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread -lm"./configure --with-blas="$mkl" --with-lapack --prefix=/usr/local/r-mkl 14 2017 UW-Madison Analysis of Spatial-Temporal Data

For many BLAS and LAPACK functions calls from R, expect near linear speed up... 15 2017 UW-Madison Analysis of Spatial-Temporal Data

Mac users, see Doug s veclib linking hints at http://blue.for.msu.edu/gweda15/blashints.txt Linux/Unix users, compile R with MKL as described above, or compile OpenBLAS and just redirect R s default librblas.so to libopenblas.so, e.g., /usr/local/lib/r/lib/librblas.so -> /opt/openblas/lib/libopenblas.so See http://blue.for.msu.edu/comp-notes for some simple examples of C++ with MKL and Rmath libraries along with associated Makefile files. 16 2017 UW-Madison Analysis of Spatial-Temporal Data