An Integrative Model for Parallelism

Similar documents
Accelerating linear algebra computations with hybrid GPU-multicore systems.

High Performance Computing

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters

Lightweight Superscalar Task Execution in Distributed Memory

Special Nodes for Interface

Optimization Techniques for Parallel Code 1. Parallel programming models

Lecture 12: Adders, Sequential Circuits

Lecture 12: Adders, Sequential Circuits

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1

Lecture 13: Sequential Circuits, FSM

11 Parallel programming models

Lecture 7: Dynamic Programming I: Optimal BSTs

Panorama des modèles et outils de programmation parallèle

Overview: Synchronous Computations

Lecture 13: Sequential Circuits

1 / 28. Parallel Programming.

Computational complexity theory

Multicore Parallelization of Determinant Quantum Monte Carlo Simulations

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems

Faster arithmetic for number-theoretic transforms

B629 project - StreamIt MPI Backend. Nilesh Mahajan

Multicore Semantics and Programming

Dense Arithmetic over Finite Fields with CUMODP

Finite-State Model Checking

Algorithms as multilinear tensor equations

Computational complexity theory

EEC 281 VLSI Digital Signal Processing Notes on the RRI-FFT

Matrix Assembly in FEA

Model Order Reduction via Matlab Parallel Computing Toolbox. Istanbul Technical University

Causality and Time. The Happens-Before Relation

Parallel LU Decomposition (PSC 2.3) Lecture 2.3 Parallel LU

Dynamic Scheduling within MAGMA

Notes on the Matrix-Tree theorem and Cayley s tree enumerator

MA554 Assessment 1 Cosets and Lagrange s theorem

Algorithms Exam TIN093 /DIT602

Cosets and Lagrange s theorem

OBEUS. (Object-Based Environment for Urban Simulation) Shareware Version. Itzhak Benenson 1,2, Slava Birfur 1, Vlad Kharbash 1

Fast Multipole Methods: Fundamentals & Applications. Ramani Duraiswami Nail A. Gumerov

Antonio Falabella. 3 rd nternational Summer School on INtelligent Signal Processing for FrontIEr Research and Industry, September 2015, Hamburg

Binary addition by hand. Adding two bits

Binary addition example worked out

Communication-avoiding parallel and sequential QR factorizations

ECE/CS 250 Computer Architecture

Exploring the Potential of Instruction-Level Parallelism of Exposed Datapath Architectures with Buffered Processing Units

Parallel programming using MPI. Analysis and optimization. Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver

Trinity Christian School Curriculum Guide

Using Kernel Couplings to Predict Parallel Application Performance

SCHUR-WEYL DUALITY FOR U(n)

CPSC 540: Machine Learning

Oblivious Algorithms for Multicores and Network of Processors

Numerical Linear Algebra

Communication avoiding parallel algorithms for dense matrix factorizations

1 Introduction. 1.1 The Problem Domain. Self-Stablization UC Davis Earl Barr. Lecture 1 Introduction Winter 2007

CS 542G: Conditioning, BLAS, LU Factorization

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)

Direct Self-Consistent Field Computations on GPU Clusters

CS 6820 Fall 2014 Lectures, October 3-20, 2014

Latches. October 13, 2003 Latches 1

Parallelization of the QC-lib Quantum Computer Simulator Library

Energy Consumption Evaluation for Krylov Methods on a Cluster of GPU Accelerators

Parallelization of the QC-lib Quantum Computer Simulator Library

number to represent the problem.

Recitation 5: Elementary Matrices

Digital Systems EEE4084F. [30 marks]

Columbus State Community College Mathematics Department Public Syllabus

Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2

Designing Information Devices and Systems I Fall 2018 Lecture Notes Note Positioning Sytems: Trilateration and Correlation

Speculative Parallelism in Cilk++

Block AIR Methods. For Multicore and GPU. Per Christian Hansen Hans Henrik B. Sørensen. Technical University of Denmark

if b is a linear combination of u, v, w, i.e., if we can find scalars r, s, t so that ru + sv + tw = 0.

Coordinate Update Algorithm Short Course The Package TMAC

Revisiting the Cache Miss Analysis of Multithreaded Algorithms

Computing least squares condition numbers on hybrid multicore/gpu systems

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2)

A Brief Introduction to Model Checking

Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry

Parallel Sparse Matrix Vector Multiplication (PSC 4.3)

EXAMPLES OF CLASSICAL ITERATIVE METHODS

A Parallel Algorithm for Computing the Extremal Eigenvalues of Very Large Sparse Matrices*

Round-off error propagation and non-determinism in parallel applications

ECE 250 / CPS 250 Computer Architecture. Basics of Logic Design Boolean Algebra, Logic Gates

EXPLOITING RESIDUE NUMBER SYSTEM FOR POWER-EFFICIENT DIGITAL SIGNAL PROCESSING IN EMBEDDED PROCESSORS

A Quantitative Measure of Portability with Application to Bandwidth-Latency Models for Parallel Computing?

Parallel Numerics. Scope: Revise standard numerical methods considering parallel computations!

A CPU-GPU Hybrid Implementation and Model-Driven Scheduling of the Fast Multipole Method

Software Engineering 2DA4. Slides 8: Multiplexors and More

Performance Analysis of Lattice QCD Application with APGAS Programming Model

Communicating Parallel Processes. Stephen Brookes

A Note on S-Packing Colorings of Lattices

Real Time Operating Systems

Discrete Tranformation of Output in Cellular Automata

A New Task Model and Utilization Bound for Uniform Multiprocessors

Partition is reducible to P2 C max. c. P2 Pj = 1, prec Cmax is solvable in polynomial time. P Pj = 1, prec Cmax is NP-hard

Building a Multi-FPGA Virtualized Restricted Boltzmann Machine Architecture Using Embedded MPI

Exploiting Low-Rank Structure in Computing Matrix Powers with Applications to Preconditioning

On the design of parallel linear solvers for large scale problems

Designing Information Devices and Systems I Fall 2018 Lecture Notes Note Positioning Sytems: Trilateration and Correlation

MAGMA. Matrix Algebra on GPU and Multicore Architectures. Mark Gates. February 2012

Probability Propagation

Transcription:

An Integrative Model for Parallelism Victor Eijkhout ICERM workshop 2012/01/09

Introduction Formal part Examples Extension to other memory models Conclusion tw-12-exascale 2012/01/09 2

Introduction tw-12-exascale 2012/01/09 3

Practical starting point Data movement is very important: need to model this I m starting with distributed memory. We understand parallelism through MPI (running at 300,000 cores); the problem is one of programmability. Goal To find a high level way of parallel programming, that can be compiled down to what I would have written in MPI high productivity because working in global terms portable performance since the MPI code is derived Pie in the sky Apart from distributed memory, we would like to cover task parallelism, multicore, accelerators tw-12-exascale 2012/01/09 4

My claim It is possible to have a global model of parallelism where MPI-type parallelism can be derived through formal reasoning about global objects. The model is versatile in dealing with applications and memory models beyond linear algebra on clusters. Prospects for a high productivity high performance way of parallel programming. No silver bullet: algorithm designer still needs to design data and work distributions. tw-12-exascale 2012/01/09 5

Claim in CS terms Given distributions on input, output, instructions: a communication pattern becomes a formally derived first-class object; communication is an instantiation of this object Consequences: communication is properly aggregated data can be sent (asynchronously) when available inspector-executor model where the pattern is derived once, then applied many times. tw-12-exascale 2012/01/09 6

Working title: I/MP: Integrative Model for Parallelism tw-12-exascale 2012/01/09 7

Formal part tw-12-exascale 2012/01/09 8

Theoretical model Computation is bipartite graph from input to output, edges are instructions Edge: A = In, Out, E (α, β) E : α In, β Out is elementary computation between data elements. This models a kernel, (not a whole code), possibly even less: compare supersteps in BSP tw-12-exascale 2012/01/09 9

Parallel computation: where A = A 1,..., A P, A p = In p, Out p, E p, In = p In p, Out = p Out p, E = p E p ; Possibly non-disjoint partitions (redundant work) Example. If all In p, Out p identical: shared memory Note explicit distribution of work, independently of data tw-12-exascale 2012/01/09 10

Where does communication come from? Define total input/output of a processor: In(E p ) = {α: (α, β) E p }, Out(E p ) = {β : (α, β) E p }. Not the same as In p, Out p! In p In(E p) Out p Out(E p) local part of input total needed input local part of output total computed output Communication: In(E p ) In p is communicated in before computation Communication: Out(E p ) Out p is communicated out after computation Example. Owner computes: Out(E p ) = Out p Example. Embarrassingly parallel: In(E p ) = In p, Out(E p ) = Out p, and everything disjoint By defining the distributions, communication becomes formally derivable tw-12-exascale 2012/01/09 11

Object view of communication Analyze all computation edges in terms of p, q source/target processors The sum over all p, q of these is an object describing the total communication in a kernel Compare to a PETSc VecScatter object: describe what goes where, not how. Compare to active messages: communication and computation in one. Communication objects are derivable from the algorithm, they are not defined during the execution of the communication; inspector/executor paradigm tw-12-exascale 2012/01/09 12

Examples tw-12-exascale 2012/01/09 13

What am I going to show you? The algorithm is mathematics Data distributions are easy to design The communication is derived through formal reasoning Some distributions lead to efficient communication, some don t: we still need a smart programmer. Today I m only showing your math, but I think it s implementable tw-12-exascale 2012/01/09 14

Example: Matrix-vector product In p local part of input In(E p) total needed input Out p local part of output Out(E p) total computed output i : y i = j a ij x j. Split computation in temporaries and reduction: i : y i = j a ijx j = j t ij, t ij = a ij x j. Work distribution: t ij is computed where a ij lives We will derive communication for row/col-distributed matrix tw-12-exascale 2012/01/09 15

MVP by rows Data: In p = x(i p ) In p In(E p) Out p Out(E p) local part of input total needed input local part of output total computed output work: { E p = {τ ij : i I p }, τ ij := t ij a ij x j derived fact: In(E p ) = x( ) allgather tw-12-exascale 2012/01/09 16

MVP by rows Data: In p = t(i p, ) work: { E p = {η i : i I p } η i := y i j t ij derived fact: In p In(E p) Out p Out(E p) In(E p ) = t(i p, ) local operation local part of input total needed input local part of output total computed output tw-12-exascale 2012/01/09 17

MVP by columns t ij calculation: In p In(E p) Out p Out(E p) local part of input total needed input local part of output total computed output local In(E p ) = x(i p ), In p = x(i p ) Out(E p ) = t(, I p ) tw-12-exascale 2012/01/09 18

MVP by columns reduction: In p In(E p) Out p Out(E p) local part of input total needed input local part of output total computed output In(E p ) = t(i p, ) but now derived: In p = t(, I p ) global transpose or reduce-scatter tw-12-exascale 2012/01/09 19

Sparse MVP A little more notation, but no essential difference tw-12-exascale 2012/01/09 20

LU solution Derivation of messages is identical, now no longer data parallel communication. VecPipeline instead of VecScatter, dataflow Note: IMP kernels do not imply synchronization: dataflow, supply driven execution Example: one IMP kernel per level in an FFT butterfly operations execute based on availability of data (a bit like DAG scheduling but not entirely) tw-12-exascale 2012/01/09 21

Complicated example: n-body problems Tree computation framework according to Katzenelson 1989: The field due to cell i on level l is given by g(l, i) = g(l + 1, j) j C(i) where C(i) denotes the set of children of cell i The field felt by cell i on level l is given by f (l, i) = f (l 1, p(i)) + g(l, j) j I l (i) p(i): the parent cell of i; I l (i): interaction region of i tw-12-exascale 2012/01/09 22

Implementation E (g) = E τ E γ, { i j C(i) : i : E τ = {τ l ij}, E γ = {γ l i } τij l = tl ij = g l+1 j γi l = gi l = j C(i) tij l E (f ) = E ρ E σ E φ, E ρ = {ρ l i }, E σ = {σij}, l E φ = {φ l i } i : ρ l i = ri l = f l 1 p(i) i j Il (i) : σij l = sl ij = g j l i : φ l i = fi l = ri l + j I l (i) sl ij tw-12-exascale 2012/01/09 23

Extension to other memory models tw-12-exascale 2012/01/09 24

Attached accelerators tw-12-exascale 2012/01/09 25

Decompose GPU-GPU operations E as e = A 1 e A where A is memory copy from CPU to GPU, so e = A e A 1 is between CPUs: quotient graph. We have derived the MPI communication of an algorithm with distributed GPUs through formal means tw-12-exascale 2012/01/09 26

Shared memory multicore Processors still operate on local stores (L1/L2 cache) So it s really distributed memory except with the shared memory as network again factor out physical data movement from logical data movement (this still needs work) tw-12-exascale 2012/01/09 27

Conclusion tw-12-exascale 2012/01/09 28

Conclusion Integrative Model for Parallelism also: Interface without Message Passing Based on explicit distribution of everything that needs distributing Communication objects derived through formal reasoning Trivially covers distributed memory, can be interpreted for other models Theoretical questions remain: correctness, deadlock, relation to DAG models, interpretation as functional model, relation to BSP, suitability for graph algorithms Practical question: how do you implement this? (Inspiration from PETSc, ParPre, Charm++, elemental) tw-12-exascale 2012/01/09 29