Overview: Synchronous Computations

Similar documents
Barrier. Overview: Synchronous Computations. Barriers. Counter-based or Linear Barriers

Overview: Parallelisation via Pipelining

Parallel programming using MPI. Analysis and optimization. Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco

Solution of Linear Systems

Modelling and implementation of algorithms in applied mathematics using MPI

Finite difference methods. Finite difference methods p. 1

= ( 1 P + S V P) 1. Speedup T S /T V p

Image Reconstruction And Poisson s equation

Parallel Scientific Computing

CSE613: Parallel Programming, Spring 2012 Date: May 11. Final Exam. ( 11:15 AM 1:45 PM : 150 Minutes )

High Performance Computing

COMP 633: Parallel Computing Fall 2018 Written Assignment 1: Sample Solutions

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1

Parallel Programming in C with MPI and OpenMP

Antonio Falabella. 3 rd nternational Summer School on INtelligent Signal Processing for FrontIEr Research and Industry, September 2015, Hamburg

Fault-Tolerant Consensus

Counters. We ll look at different kinds of counters and discuss how to build them

Time. To do. q Physical clocks q Logical clocks

Deadlock. CSE 2431: Introduction to Operating Systems Reading: Chap. 7, [OSC]

Parallel PIPS-SBB Multi-level parallelism for 2-stage SMIPS. Lluís-Miquel Munguia, Geoffrey M. Oxberry, Deepak Rajan, Yuji Shinano

Clocks in Asynchronous Systems

Agreement Protocols. CS60002: Distributed Systems. Pallab Dasgupta Dept. of Computer Sc. & Engg., Indian Institute of Technology Kharagpur

Algorithms PART II: Partitioning and Divide & Conquer. HPC Fall 2007 Prof. Robert van Engelen

Our Problem. Model. Clock Synchronization. Global Predicate Detection and Event Ordering

The Weakest Failure Detector to Solve Mutual Exclusion

Pipelined Computations

Valency Arguments CHAPTER7

Last class: Today: Synchronization. Deadlocks

1 Lamport s Bakery Algorithm

Clock Synchronization

Counting in Practical Anonymous Dynamic Networks is Polynomial

CSC501 Operating Systems Principles. Deadlock

Section 6 Fault-Tolerant Consensus

Clojure Concurrency Constructs, Part Two. CSCI 5828: Foundations of Software Engineering Lecture 13 10/07/2014

Time. Today. l Physical clocks l Logical clocks

Logical Time. 1. Introduction 2. Clock and Events 3. Logical (Lamport) Clocks 4. Vector Clocks 5. Efficient Implementation

Operating Systems. VII. Synchronization

CME342 Parallel Methods in Numerical Analysis. Matrix Computation: Iterative Methods II. Sparse Matrix-vector Multiplication.

Timing Results of a Parallel FFTsynth

INF 4140: Models of Concurrency Series 3

Parallel Programming. Parallel algorithms Linear systems solvers

Performance Analysis of Lattice QCD Application with APGAS Programming Model

An Integrative Model for Parallelism

Overlay networks maximizing throughput

CS 425 / ECE 428 Distributed Systems Fall Indranil Gupta (Indy) Oct. 5, 2017 Lecture 12: Time and Ordering All slides IG

Matrix Computations: Direct Methods II. May 5, 2014 Lecture 11

5.7 Cramer's Rule 1. Using Determinants to Solve Systems Assumes the system of two equations in two unknowns

Analytical Modeling of Parallel Systems

Performance Evaluation of Codes. Performance Metrics

Distributed Systems Principles and Paradigms. Chapter 06: Synchronization

Agreement. Today. l Coordination and agreement in group communication. l Consensus

Chapter 11 Time and Global States

Divisible Load Scheduling

Distributed Systems Principles and Paradigms

Distributed Computing. Synchronization. Dr. Yingwu Zhu

Implementing Uniform Reliable Broadcast with Binary Consensus in Systems with Fair-Lossy Links

REDISTRIBUTION OF TENSORS FOR DISTRIBUTED CONTRACTIONS

Groupe de travail. Analysis of Mobile Systems by Abstract Interpretation

Solving the Convection Diffusion Equation on Distributed Systems

COURSE Numerical methods for solving linear systems. Practical solving of many problems eventually leads to solving linear systems.

DETERMINING THE VARIABLE QUANTUM TIME (VQT) IN ROUND ROBIN AND IT S IMPORTANCE OVER AVERAGE QUANTUM TIME METHOD

HYPERCUBE ALGORITHMS FOR IMAGE PROCESSING AND PATTERN RECOGNITION SANJAY RANKA SARTAJ SAHNI Sanjay Ranka and Sartaj Sahni

Convergence Models and Surprising Results for the Asynchronous Jacobi Method

Utility Maximizing Routing to Data Centers

Distributed Memory Parallelization in NGSolve

Parallel Programming & Cluster Computing Transport Codes and Shifting Henry Neeman, University of Oklahoma Paul Gray, University of Northern Iowa

4th year Project demo presentation

The Sieve of Erastothenes

Network Algorithms and Complexity (NTUA-MPLA) Reliable Broadcast. Aris Pagourtzis, Giorgos Panagiotakos, Dimitris Sakavalas

Parallelization of the QC-lib Quantum Computer Simulator Library

Parallel Program Performance Analysis

Algorithm Design CS 515 Fall 2015 Sample Final Exam Solutions

Trivially parallel computing

Lecture 4: Linear Algebra 1

Parallel Singular Value Decomposition. Jiaxing Tan

1 / 28. Parallel Programming.

Real Time Operating Systems

Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems

Parallel LU Decomposition (PSC 2.3) Lecture 2.3 Parallel LU

Lecture 23: Illusiveness of Parallel Performance. James C. Hoe Department of ECE Carnegie Mellon University

INF Models of concurrency

Revisiting Matrix Product on Master-Worker Platforms

Sequential Circuits Sequential circuits combinational circuits state gate delay

Divide & Conquer. Jordi Cortadella and Jordi Petit Department of Computer Science

Communication avoiding parallel algorithms for dense matrix factorizations

Earth System Modeling Domain decomposition

An Indian Journal FULL PAPER. Trade Science Inc. A real-time causal order delivery approach in largescale ABSTRACT KEYWORDS

Scheduling divisible loads with return messages on heterogeneous master-worker platforms

CS 453 Operating Systems. Lecture 7 : Deadlock

Non-Work-Conserving Non-Preemptive Scheduling: Motivations, Challenges, and Potential Solutions

Exam Spring Embedded Systems. Prof. L. Thiele

Figure 10.1 Skew between computer clocks in a distributed system

On Equilibria of Distributed Message-Passing Games

EE/CSCI 451: Parallel and Distributed Computation

Chapter 5. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 5 <1>

How to deal with uncertainties and dynamicity?

Big Data Analytics. Lucas Rego Drumond

Distributed Systems Byzantine Agreement

Lecture 4. Writing parallel programs with MPI Measuring performance

COL 730: Parallel Programming

Transcription:

Overview: Synchronous Computations barriers: linear, tree-based and butterfly degrees of synchronization synchronous example 1: Jacobi Iterations serial and parallel code, performance analysis synchronous example 2: Heat Distribution serial and parallel code comparison of block and strip partitioning methods safety ghost points (synchronous example 3) advection - Assignment 1 Ref: Chapter 6: Wilkinson and Allen COM4300/8300 L12: Synchronous Computations 2017 1

Barriers barrier: a point at which all processes must wait until all other processes have reached that point MI Barrier(MI Comm comm); mutual exclusion: a barrier that prevents other processes from entering the following region if another process is already in that region common in shared memory parallel programs necessary for some MI-2 operations both are possible sources of overhead COM4300/8300 L12: Synchronous Computations 2017 2

Barrier rocesses 0 1 2 n 1 Active Time Waiting Barrier COM4300/8300 L12: Synchronous Computations 2017 3

Counter-based or Linear Barriers rocesses 0 1 n 1 Counter, C Increment and check for n Barrier(); Barrier(); Barrier(); one process counts the arrival of the other processes when all processes have arrive, they are each sent a release message COM4300/8300 L12: Synchronous Computations 2017 4

Implementation arrival phase: process sends message to central counter departure phase: process receives message from central counter Master Slave rocesses Arrival phase Departure phase for (i=0; i<p 1; i++) recv((any)); for (i=0; i<p 1; i++) send((i)); Barrier: send((master)); recv((master)); Barrier: send((master)); recv((master)); implementations must handle possible time delays e.g. two barriers in quick succession cost is O(p) COM4300/8300 L12: Synchronous Computations 2017 5

Tree-Based Barriers 0 1 2 3 4 5 6 7 Arrival at barrier Departure from barrier note: broadcast does not ensure synchronization cost 2lg p or O(lg p) COM4300/8300 L12: Synchronous Computations 2017 6

Butterfly Barrier (Butterfly/Omega Network) 0 1 2 3 4 5 6 7 1st stage 2nd stage Time 3rd stage cost is 2lg p or O(lg p) COM4300/8300 L12: Synchronous Computations 2017 7

Degrees of Synchronization from fully to loosely synchronous the more synchronous your computation, the more potential overhead SIMD: synchronized at the instruction level provides ease of programming (one program) well suited for data decomposition applicable to many numerical problems the forall statement was introduced to specify data parallel operations forall ( i = 0; i < n ; i ++) { data parallel work } COM4300/8300 L12: Synchronous Computations 2017 8

Synchronous Example: Jacobi Iterations the Jacobi iteration solves a system of linear equations iteratively a n 1,0 x 0 + a n 1,1 x 1 + a n 1,2 x 2 + a n 1,n 1 x n 1 = b n 1. a 2,0 x 0 + a 2,1 x 1 + a 2,2 x 2 + a 2,n 1 x n 1 = b 2 a 1,0 x 0 + a 1,1 x 1 + a 1,2 x 2 + a 1,n 1 x n 1 = b 1 a 0,0 x 0 + a 0,1 x 1 + a 0,2 x 2 + a 0,n 1 x n 1 = b 0 where there are n equations and n unknowns (x 0,x 1,x 2, x n 1 ) COM4300/8300 L12: Synchronous Computations 2017 9

Jacobi Iterations consider equation i as: a i,0 x 0 + a i,1 x 1 + a i,2 x 2 + a i,n 1 x n 1 = b i which we can re-cast as: x i = (1/a i,i )[b i (a i,0 x 0 + a i,1 x 1 + a i,2 x 2 + a i,i 1 x i 1 + a i,i+1 x i+1 + a i,n 1 x n 1 )] i.e. x i = 1 a i,i [ bi j i a i, j x j ] strategy: guess x, then iterate and hope it converges! converges if the matrix is diagonally dominant: j i a i, j < a i,i terminate when convergence is achieved: x t x t 1 < error tolerance COM4300/8300 L12: Synchronous Computations 2017 10

Sequential Jacobi Code ignoring convergence testing: for ( i = 0; i < n ; i ++) x [ i ] = b [ i ]; for ( iter = 0; iter < m a x i t e r ; iter ++) { for ( i = 0; i < n ; i ++) { sum = a [ i ][ i ] x [ i ]; for ( j = 0; j < n ; j ++){ sum = sum + a [ i ][ j ] x [ j ] } n e w x [ i ] = ( b [ i ] sum ) / a [ i ][ i ]; } for ( i = 0; i < n ; i ++) x [ i ] = n e w x [ i ]; } COM4300/8300 L12: Synchronous Computations 2017 11

arallel Jacobi Code ignoring convergence testing and assuming parallelisation over n processes: x [ i ] = b [ i ]; for ( iter = 0; iter < m a x i t e r ; iter ++) { sum = a [ i ][ i ] x [ i ]; for ( j = 0; j < n ; j ++){ sum = sum + a [ i ][ j ] x [ j ] } n e w x [ i ] = ( b [ i ] sum ) / a [ i ][ i ]; b r o a d c a s t g a t h e r (& n e w x [ i ], n e w x ); g l o b a l b a r r i e r (); for ( i = 0; i < n ; i ++) x [ i ] = n e w x [ i ]; } broadcast gather() sends the local new x[i] to all processes and collects their new values COM4300/8300 L12: Synchronous Computations 2017 12

Broadcast Gather rocess 0 rocess 1 rocess n 1 Send buffer Data x Data x Data x 0 1 n 1 Receive buffer Broadcast gather() Broadcast gather() Broadcast gather() Can be (simplistically) implemented as: for ( j = 0; j < n ; j ++) send (& n e w x [ i ], / process / j ); for ( j = 0; j < n ; j ++) recv (& n e w x [ j ], / process / j ); Question: do we really need the barrier as well as this? COM4300/8300 L12: Synchronous Computations 2017 13

artitioning normally the number of processes is much less than the number of data items block partitioning: allocate groups of consecutive unknowns to processes cyclic partitioning: allocate in a round-robin fashion analysis: τ iterations, n/p unknowns per process computation decreases with p t comp = τ(2n + 4)(n/p)t f communication increases with p t comm = p(t s + (n/p)t w )τ = (pt s + nt w )τ total - has an overall minimum t tot = ((2n + 4)(n/p)t f + pt s + nt w )τ question: can we do an all-gather faster than pt s + nt w? COM4300/8300 L12: Synchronous Computations 2017 14

arallel Jacobi Iteration Time arameters: t s = 10 5 t f,t w = 50t f,n = 1000 Execution time Overall Communication Computation 4 8 12 16 20 24 28 32 Number of processors, p COM4300/8300 L12: Synchronous Computations 2017 15

Locally Synchronous Example: Heat Distribution roblem Consider a metal sheet with a fixed temperature along the sides but unknown temperatures in the middle find the temperature in the middle. finite difference approximation to the Laplace equation: 2 T (x,y) x 2 + 2 T (x,y) y 2 = 0 T (x + δx,y) 2T (x,y) + T (x δx,y) δx 2 + T (x,y + δy) 2T (x,y) + T (x,y δy) δy 2 = 0 assuming an even grid (i.e. δx = δy) of n n points (denoted as h i, j ), the temperature at any point is an average of surrounding points: h i, j = h i 1, j + h i+1, j + h i, j 1 + h i, j+1 4 problem is very similar to the Game of Life, i.e. what happens in a cell depends upon its neighbours COM4300/8300 L12: Synchronous Computations 2017 16

Array Ordering x1 x2 xk 1 xk xk+1 xk+2 x2k 1 x 2k x i k xi 1 x i xi+1 x i+k x k 2 we will solve iteratively: x i = x i 1+x i+1 +x i k +x i+k 4 but this problem may also be written as a system of linear equations: x i k + x i 1 4x i + x i+1 + x i+k = 0 COM4300/8300 L12: Synchronous Computations 2017 17

Heat Equation: Sequential Code assume a fixed number of iterations and a square mesh beware of what happens at the edges! for ( iter = 0; iter < m a x i t e r ; iter ++) { for ( i = 1; i < n ; i ++) for ( j = 1; j < n ; j ++) g [ i ][ j ] = 0.25 ( h [i 1][j] + h [ i +1][ j ] + h [ i ][ j 1] + h [ i ][ j +1]); for ( i = 1; i < n ; i ++) for ( j = 1; j < n ; j ++) h [ i ][ j ] = g [ i ][ j ]; } COM4300/8300 L12: Synchronous Computations 2017 18

Heat Equation: arallel Code one point per process assuming non-blocking sends: for ( iter = 0; iter < m a x i t e r ; iter ++) { g = 0.25 ( w + x + y + z ); send (& g, (i 1,j )); } send (& g, ( i +1, j )); send (& g, (i,j 1)); send (& g, (i, j +1)); recv (& w, (i 1,j )); recv (& x, ( i +1, j )); recv (& y, (i,j 1)); recv (& z, (i, j +1)); sends and receives provide a local barrier each process synchronizes with 4 others surrounding processes COM4300/8300 L12: Synchronous Computations 2017 19

Heat Equation: artitioning normally more than one point per process option of either block or strip partitioning 0 1 p 1 0 1 p 1 Block artitioning Strip artitioning COM4300/8300 L12: Synchronous Computations 2017 20

Block/Strip Communication Comparison block partitioning: four edges exchanged (n 2 data points, p processes) t comm = 8(t s + n p t w ) strip partitioning: two edges exchanged t comm = 4(t s + nt w ) n p n Block Communications Strip Communications COM4300/8300 L12: Synchronous Computations 2017 21

Block/Strip Optimum ( ) block communication is larger than strip if: 8 t s + n p t w > 4(t w + nt w ) ( i.e. if t s > n 1 2 p )t w t s Strip partition best Block partion best rocesses, p COM4300/8300 L12: Synchronous Computations 2017 22

Safety and Deadlock with all processes sending and then receiving data, the code is unsafe: it relies on local buffering in the send() function potential for deadlock (as in rac 1, Ex 3)! alternative #1: re-order sends and receives e.g. for strip partitioning: if (( myid % 2) == 0){ send (& g [1][1], n, (i 1)); recv (& h [1][0], n, (i 1)); send (& g [1][ n ], n, ( i +1)); recv (& h [1][ n +1], n, ( i +1)); } else { recv (& h [1][0], n, p (i 1)); send (& g [1][1], n, p (i 1)); recv (& h [1][ n +1], n, p ( i +1)); send (& g [1][ n ], n, p ( i +1)); } COM4300/8300 L12: Synchronous Computations 2017 23

Alt# 2: Asynchronous Communication using Ghostpoints assign extra receive buffers for edges where data is exchanged typically these are implemented as extra rows and columns in each process local array (known as a halo) can use asynchronous calls (e.g. MI Isend()) rocess i Ghost points Copy data rocess i+1 COM4300/8300 L12: Synchronous Computations 2017 24