Hydra: Generation and Tuning of parallel solutions for linear algebra equations. Alexandre X. Duchâteau University of Illinois at Urbana Champaign

Similar documents
Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters

Accelerating linear algebra computations with hybrid GPU-multicore systems.

MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors

On the design of parallel linear solvers for large scale problems

Panorama des modèles et outils de programmation parallèle

MagmaDNN High-Performance Data Analytics for Manycore GPUs and CPUs

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)

Lightweight Superscalar Task Execution in Distributed Memory

On the design of parallel linear solvers for large scale problems

MAGMA. Matrix Algebra on GPU and Multicore Architectures. Mark Gates. February 2012

Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors

Dense Arithmetic over Finite Fields with CUMODP

MARCH 24-27, 2014 SAN JOSE, CA

J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1. March, 2009

Dynamic Scheduling within MAGMA

GRAPHITE Two Years After First Lessons Learned From Real-World Polyhedral Compilation

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems

Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA

Parallel Performance Theory - 1

Computing least squares condition numbers on hybrid multicore/gpu systems

Auto-Tuning Complex Array Layouts for GPUs - Supplemental Material

Energy Consumption Evaluation for Krylov Methods on a Cluster of GPU Accelerators

Direct Self-Consistent Field Computations on GPU Clusters

Tips Geared Towards R. Adam J. Suarez. Arpil 10, 2015

Quantum Computer Simulation Using CUDA (Quantum Fourier Transform Algorithm)

Minisymposia 9 and 34: Avoiding Communication in Linear Algebra. Jim Demmel UC Berkeley bebop.cs.berkeley.edu

Accelerating Proton Computed Tomography with GPUs

Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2

Level-3 BLAS on a GPU

SP-CNN: A Scalable and Programmable CNN-based Accelerator. Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay

Some notes on efficient computing and setting up high performance computing environments

ECE521 W17 Tutorial 1. Renjie Liao & Min Bai

LINEAR SYSTEMS (11) Intensive Computation

11 Parallel programming models

Package magma. February 15, 2013

Parallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics)

Parallel Performance Theory

A Compiler for Linear Algebra Operations

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters

Utilisation de la compression low-rank pour réduire la complexité du solveur PaStiX

An Efficient Evolutionary Algorithm for Solving Incrementally Structured Problems

Leveraging Task-Parallelism in Energy-Efficient ILU Preconditioners

Saving Energy in Sparse and Dense Linear Algebra Computations

ExaGeoStat: A High Performance Unified Software for Geostatistics on Manycore Systems

Enhancing Scalability of Sparse Direct Methods

MUMPS. The MUMPS library: work done during the SOLSTICE project. MUMPS team, Lyon-Grenoble, Toulouse, Bordeaux

CRYSTAL in parallel: replicated and distributed (MPP) data. Why parallel?

Julian Merten. GPU Computing and Alternative Architecture

Scheduling Trees of Malleable Tasks for Sparse Linear Algebra

A CPU-GPU Hybrid Implementation and Model-Driven Scheduling of the Fast Multipole Method

QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment

Parallel PIPS-SBB Multi-level parallelism for 2-stage SMIPS. Lluís-Miquel Munguia, Geoffrey M. Oxberry, Deepak Rajan, Yuji Shinano

Algorithms and Methods for Fast Model Predictive Control

N-body Simulations. On GPU Clusters

Numerical Linear Algebra

Solving PDEs with CUDA Jonathan Cohen

A CUDA Solver for Helmholtz Equation

FINDING PARALLELISM IN GENERAL-PURPOSE LINEAR PROGRAMMING

Direct and Incomplete Cholesky Factorizations with Static Supernodes

Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters

Code Generation for GPU Accelerators in the Domain of Image Preprocessing

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC

Exploring the Potential of Instruction-Level Parallelism of Exposed Datapath Architectures with Buffered Processing Units

Efficient Serial and Parallel Coordinate Descent Methods for Huge-Scale Convex Optimization

Reduced Vlasov-Maxwell modeling

Algebraic Multigrid as Solvers and as Preconditioner

A Fast, Parallel Potential Flow Solver

Input: System of inequalities or equalities over the reals R. Output: Value for variables that minimizes cost function

GPU accelerated Arnoldi solver for small batched matrix

GRAPE-DR, GRAPE-8, and...

Tools for Power and Energy Analysis of Parallel Scientific Applications

Binary Decision Diagrams and Symbolic Model Checking

Word-length Optimization and Error Analysis of a Multivariate Gaussian Random Number Generator

Understanding and Implementing F5

CS 542G: Conditioning, BLAS, LU Factorization

Sparse LU Factorization on GPUs for Accelerating SPICE Simulation

Lecture 16: 9.2 Geometry of Linear Operators

Introduction to Kleene Algebras

CSE 548: Analysis of Algorithms. Lecture 12 ( Analyzing Parallel Algorithms )

Parallel Longest Common Subsequence using Graphics Hardware

Programming Languages and Compilers (CS 421)

Tall and Skinny QR Matrix Factorization Using Tile Algorithms on Multicore Architectures LAPACK Working Note - 222

High-performance processing and development with Madagascar. July 24, 2010 Madagascar development team

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics

Everyday Multithreading

Towards Fast, Accurate and Reproducible LU Factorization

Data Structures in Java

Communication-avoiding parallel and sequential QR factorizations

Department of Mathematics Comprehensive Examination Option I 2016 Spring. Algebra

Logic, Optimization and Data Analytics

Dr. Andrea Bocci. Using GPUs to Accelerate Online Event Reconstruction. at the Large Hadron Collider. Applied Physicist

Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry

SPECTRAL CLUSTERING OF LARGE NETWORKS

Improvements for Implicit Linear Equation Solvers

This property turns out to be a general property of eigenvectors of a symmetric A that correspond to distinct eigenvalues as we shall see later.

Robust Preconditioned Conjugate Gradient for the GPU and Parallel Implementations

Parallel Circuit Simulation. Alexander Hayman

Dan Roth 461C, 3401 Walnut

Saving Energy in the LU Factorization with Partial Pivoting on Multi-Core Processors

Transcription:

Hydra: Generation and Tuning of parallel solutions for linear algebra equations Alexandre X. Duchâteau University of Illinois at Urbana Champaign

Collaborators Thesis Advisors Denis Barthou (Labri/INRIA Bordeaux) David Padua (University of Illinois at UC) Future collaboration starpu team, INRIA Bordeaux Jointlab INRIA-UIUC 2

Objectives and Contributions INTRODUCTION Jointlab INRIA-UIUC 3

Objectives Compile linear algebra equations Compute X for L X X U = C [CTSY] Compute L and U for L U = A [LU] Generate efficient task parallel code Identify tasks Generate task dependence graph Jointlab INRIA-UIUC 4

Motivation Focus is on linear algebra Start from high level description No code or algorithm Derivation through blocking of operands Data centric approach Derivation for parallelism Output is a parallel task graph Jointlab INRIA-UIUC 5

Contributions A specification language Express computation Characterize operands (shapes) Identify wanted result Derivation rules Validity/applicability patterns Operators symbolic execution rules Dependence build engine Jointlab INRIA-UIUC 6

A detailed view of the generator SYSTEM DESCRIPTION Jointlab INRIA-UIUC 7

Description Language - Operands %% Operands X: Unknown Matrix L: Lower Triangular Square Matrix U: Upper Triangular Square Matrix C: Square Matrix %% Equation L*X-X*U=C %% Parameters @name ctsy All operands (Type inference) Status Known, Unknown Shape Triangular, diagonal Type Matrix, (vector, scalar) Modifiers (transpose) (Sizes) (Density) Jointlab INRIA-UIUC 8

Description language - Equation %% Operands X: Unknown Matrix L: Lower Triangular Square Matrix U: Upper Triangular Square Matrix C: Square Matrix %% Equation L*X-X*U=C %% Parameters @name ctsy Base for decomposition Simple equations Assignments X = A*B Solvers LU Triangular Sylvester L*X=B Cholesky Jointlab INRIA-UIUC 9

Description language - Parameters %% Operands X: Unknown Matrix L: Lower Triangular Square Matrix U: Upper Triangular Square Matrix C: Square Matrix %% Equation L*X-X*U=C %% Parameters @name ctsy Drive code generation Set of parameters Name Codelet Order of operands Data type For customization Default values used Jointlab INRIA-UIUC 10

Kernel Declaration Generate a full solution Cost of full recursion Usually not a good idea Use existing kernels and libraries Already optimized Jointlab INRIA-UIUC 11

Performance Kernel Performance Measure performance Depend on size target Guide exploration Optimal nodes Similar to ATLAS 2 4 6 8 16 32 64 128 Problem Size Jointlab INRIA-UIUC 12

Starpu Codelet Codelet declaration void gemm(void *buffers[], void *cl_arg); struct starpu_codelet gemm_cl = {.where = STARPU_CPU,.cpu_funcs = { gemm, NULL},.nbuffers = 3,.modes = {STARPU_R, STARPU_R, STARPU_RW} }; Kernel code void gemm(void *buffers[], void *cl_arg) { struct params *params = cl_arg; int n = params->n; double *a = (double *) STARPU_MATRIX_GET_PTR(buffers[0]); // cblas_dgemm(cblasrowmajor, CBlasNoTrans, CBlasNoTrans, n, n, n, 1.0, a, n, b, n, 1.0, c, n); } Jointlab INRIA-UIUC 13

Defining blocking space Blocking defines the space of solutions Must only generate valid solutions Look at the equation s operation tree Each node gets a set of dimensions Generate constraints depending on operation Jointlab INRIA-UIUC 14

Validity of Blocking y A x B y B B <x A,y B > A B <x A,y A > <x B,y B > x A A New Constraint { y A = x B } Jointlab INRIA-UIUC 15

Valid Blockings - DTSY A <x A,y A > <x A,y X > <x A,y B > X <x X,y X > - (C) <x C,y C > B <x B,y B > X <x X,y X > Constraints y A = x X y X = x B x A = x X y B = y X x C = x X y C = y X { x A,y A,x X, x C } { x B,y B,y X,y C } Jointlab INRIA-UIUC 16

Valid Blockings DTSY (2) A*X*B X = C A lower triangular, B upper triangular xa = ya = xx = 2 xb = yb = yx = 2 A(0,0) A(0,1) X(0,0) X(0,1) B(0,0) B(0,1) X(0,0) X(0,1) T(0,0) T(0,1).... 0. 0 0. 0.... 0..... 0 0. 0.... * * - = 0 0. 0. 0.... 0. 0 0..... 0.... A(1,0) A(1,1) X(1,0) X(1,1) B(1,0) B(1,1) X(1,0) X(1,1) T(1,0) T(1,1) xa = ya = xx = 2 xb = yb = yx = 1 A(0,0) A(0,1) X(0,0).. 0. 0. 0. * * 0. 0 0.. A(1,0) A(1,1) X(1,0) B(0,0) <=> B. 0 0 0 0 0. 0 X(0,0) - = X(1,0) T(0,0) T(1,0) xa = ya = xx = 3 xb = yb = yx = 3 A(0,0) 0. 0. 0. 0. 0. 0 0 0 0 0 0 0 0 0 0. A(2,2) X(0,0) X(2,2) B(0,0). 0 0 0 0 0. 0. 0 0 0 0. 0. 0. 0. 0. 0 B(2,2) X(0,0) X(2,2) * * - = C(0,0) C(2,2) Jointlab INRIA-UIUC 17

Derivation example 0. 0 0 0 0 0. * = A X T blocking Sym. Exec. A(0,0) A(0,1) X(0,0) X(0,1) T(0,0) T(0,1)...... 0...... * = 0 0.... 0 0 0...... A(1,0) A(1,1) X(1,0) X(1,1) T(1,0) T(1,1) T(0,0) = A(0,0)*X(0,0) + A(0,1)*X(1,0) T(0,1) = A(0,0)*X(0,1) + A(0,1)*X(1,1) T(1,0) = A(1,0)*X(0,0) + A(1,1)*X(1,0) T(1,1) = A(1,0)*X(0,1) + A(1,1)*X(1,1) Removal of 0-computation T(0,0) = A(0,0)*X(0,0) + A(0,1)*X(1,0) T(0,1) = A(0,0)*X(0,1) + A(0,1)*X(1,1) T(1,0) = A(1,1)*X(1,0) T(1,1) = A(1,1)*X(1,1) Jointlab INRIA-UIUC 18

Operand characterization Blocks of X are outputs (unknown) A(0,0) and A(1,1) are lower triangular Equation Input Output T(1,1) = A(1,1) X(1,1) T(1,1) A(1,1) X(1,1) T(1,0) = A(1,1) X(1,0) T(1,0) A(1,1) X(1,0) T(0,1) = A(0,0) X(0,1) + A(0,1) X(1,1) T(0,1) A(0,0) A(0,1) X(0,1) X(1,1) T(0,0) = A(0,0) X(0,0) + A(0,1) X(1,0) T(0,0) A(0,0) A(0,1) X(0,0) X(1,0) Jointlab INRIA-UIUC 19

Equation Signature Need to identify equations Identification through types and operators A X = T : LT UNK = MT Set of simplification rules UNK + MT MT UNK + MT Jointlab INRIA-UIUC 20

Identify task T(1,1) = A(1,1) X(1,1) Signature : LT UNK = MT Instance of original problem Solvable X(1,1) can now be considered known Jointlab INRIA-UIUC 21

Building dependence Find instances of X(1,1) Shift in input set Add dependence edge Equation Input Output T(0,1) = A(0,0) X(0,1) + A(0,1) X(1,1) T(0,1) A(0,0) A(0,1) X(1, 1) X(0,1) New Dependence {T 1,1 = A 1,1 X 1,1 T 0,1 = A 0,0 X 0,1 + A 0,1 X(1,1) } Jointlab INRIA-UIUC 22

Signature Simplification T 0,1 = A 0,0 X 0,1 + A 0,1 X 1,1 1. MT = LT UNK + MT MT 2. MT = LT UNK + MT 3. MT MT = LT UNK 4. MT = LT UNK Match to the original problem! Jointlab INRIA-UIUC 23

Post identification expansion Simplification step Corresponding equation MT = LT UNK + MT MT T 0,1 = A 0,0 X 0,1 + A 0,1 X 1,1 MT = LT UNK + MT MT MT = LT UNK MT = LT UNK T1 = A 01 X(1,1) T 0,1 = A 0,0 X 0,1 + T1 T1 = A 01 X(1, 1) T2 = T 0, 1 T1 T2 = A 0, 0 X(0, 1) Jointlab INRIA-UIUC 24

Selection Generator Equation Dependence Graph Derivation L X U X = C ø L11 X11 U11 X11 = T 1 2 3 4 1 1 2 3 2 3 4 4 4 Set Equations Dependence Graph Identification 1: L00 X00 U00 X00 = C00 2: L10 X00 U00 + L11 X10 U00 X10 = C10 3: L00 X00 U01 + L00 X01 U11 X01 = C01 4: L10 X00 U01 + + L11 X11 U11 X11 = C11 ø 1: L 11 00 X 11 00 U 11 00 X 11 00 = T00 2: L 11 10 X 11 00 U 11 00 + L 11 11 X 11 10 U 11 00 X 11 10 = T10 3: L 11 00 X 11 00 U 11 01 + L 11 00 X 11 01 U 11 11 X 11 01 = T01 4: L 11 10 X 11 00 U 11 01 + + L 11 11 X 11 11 U 11 11 X 11 11 = T11 1 2 3 4 1 1 2 3 2 3 4 4 4 Dependence Graph Termination 1 2 3 4 1 2 3 4 1 1 2 3 2 3 4 4 1 2 3 4 Final Graph 1 2 3 4 1 1 2 3 2 3 4 4 1 2 3 4 Jointlab INRIA-UIUC 25

Simple graph example Jointlab INRIA-UIUC 26

Graph Example Jointlab INRIA-UIUC 27

Heterogeneous Graph Example MM DTSY DTSY Jointlab INRIA-UIUC 28

INITIAL RESULTS AND CHALLENGES Jointlab INRIA-UIUC 29

Challenges Version generation time 5 minutes in generator 20 minutes compiling Generated code size 500k lines of code in single function (icc segfaults) Jointlab INRIA-UIUC 30

Speedup Trisolve (L*X=B) 7 6 5 4 3 2 ref 2 4 8 10 1 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 12000 14000 16000 18000 Problem Size Jointlab INRIA-UIUC 31

Conclusion Faster development cycle for architectures No time to hand-tune everything anymore Can t hand tune for every HW iteration Increased complexity Heterogeneous systems Description of an automatic methodology Leverage existing small scale libraries Jointlab INRIA-UIUC 32

AU CAS OU Jointlab INRIA-UIUC 33

Exploiting heterogeneous machines : starpu Runtime Goal: Scheduling ( offloading) tasks over heterogeneous machines CPU + GPU + SPU = *PU Auto-tuning of performance models Optimization of memory transfers Target for Compilers StarSs [UPC], HMPP [CAPS] Libraries PLASMA/MAGMA [UTK] Applications Fast multipole methods [CESTA] StarPU provides an Open Scheduling platform Scheduling algorithm = plugins CPU CPU CPU CPU CPU CPU M. CPU CPU M. GPU GPU M. M. 800 700 600 500 400 300 200 100 0 Speed (GFlops) Jointlab INRIA-UIUC 34 Greedy task model prefetch data model

Overview of StarPU Maximizing PU occupancy, minimizing data transfers Principle Accept tasks that may have multiple implementations Together with potential interdependencies Leads to a dynamic acyclic graph of tasks Data-flow approach Provide a high-level data management layer Application should only describe Which data may be accessed by tasks How data may be divided Parallel Compilers CPU HPC Applications StarPU Parallel Libraries Drivers (CUDA, OpenCL GPU Jointlab INRIA-UIUC 35

Code styles and optimization for(int i = 0 ; i < N ; i++) for(int j = 0 ; j < N ; j++) for(int k = 0 ; k < N ; k++) c[i][j] += a[i][k] * b[k][j]; for(int kk = 0 ; kk < N ; kk+=b) for(int ii = 0 ; ii < N ; ii+=b) for(int jj = 0 ; jj < N ; jj+=b) for(int i = ii ; i < ii+b ; i++) for(int k = kk ; k < kk+b ; k++) { c_i = c[i]; a_ik = a[i][k]; b_k = b[k]; for(int j = jj ; i < jj+b ; i+=2) { c_i[j] += a_ik * b_k[j]; c_u[j+1] += a_ik * b_k[j+1]; } } Alexandre X. Duchateau - Preliminary Examination 36

Empirical search Generation of multiple solutions Algorithmic exploration Theoretical evaluation Usually unreliable Filtering heuristic Empirical evaluation (execution) slow (very slow when looking at bad versions) Alexandre X. Duchateau - Preliminary Examination 37

Generalized system overview Driver Description Generator Versions Prediction Execution Selection Predictor Filters what versions are actually executed Driver Guide the search Alexandre X. Duchateau - Preliminary Examination 38