Overview: Synchronous Computations

Overview: Synchronous Computations barriers: linear, tree-based and butterfly degrees of synchronization synchronous example 1: Jacobi Iterations serial and parallel code, performance analysis synchronous example 2: Heat Distribution serial and parallel code comparison of block and strip partitioning methods safety ghost points (synchronous example 3) advection - Assignment 1 Ref: Chapter 6: Wilkinson and Allen COM4300/8300 L12: Synchronous Computations 2017 1

Barriers barrier: a point at which all processes must wait until all other processes have reached that point MI Barrier(MI Comm comm); mutual exclusion: a barrier that prevents other processes from entering the following region if another process is already in that region common in shared memory parallel programs necessary for some MI-2 operations both are possible sources of overhead COM4300/8300 L12: Synchronous Computations 2017 2

Barrier rocesses 0 1 2 n 1 Active Time Waiting Barrier COM4300/8300 L12: Synchronous Computations 2017 3

Counter-based or Linear Barriers rocesses 0 1 n 1 Counter, C Increment and check for n Barrier(); Barrier(); Barrier(); one process counts the arrival of the other processes when all processes have arrive, they are each sent a release message COM4300/8300 L12: Synchronous Computations 2017 4

Implementation arrival phase: process sends message to central counter departure phase: process receives message from central counter Master Slave rocesses Arrival phase Departure phase for (i=0; i<p 1; i++) recv((any)); for (i=0; i<p 1; i++) send((i)); Barrier: send((master)); recv((master)); Barrier: send((master)); recv((master)); implementations must handle possible time delays e.g. two barriers in quick succession cost is O(p) COM4300/8300 L12: Synchronous Computations 2017 5

Tree-Based Barriers 0 1 2 3 4 5 6 7 Arrival at barrier Departure from barrier note: broadcast does not ensure synchronization cost 2lg p or O(lg p) COM4300/8300 L12: Synchronous Computations 2017 6

Butterfly Barrier (Butterfly/Omega Network) 0 1 2 3 4 5 6 7 1st stage 2nd stage Time 3rd stage cost is 2lg p or O(lg p) COM4300/8300 L12: Synchronous Computations 2017 7

Degrees of Synchronization from fully to loosely synchronous the more synchronous your computation, the more potential overhead SIMD: synchronized at the instruction level provides ease of programming (one program) well suited for data decomposition applicable to many numerical problems the forall statement was introduced to specify data parallel operations forall ( i = 0; i < n ; i ++) { data parallel work } COM4300/8300 L12: Synchronous Computations 2017 8

Synchronous Example: Jacobi Iterations the Jacobi iteration solves a system of linear equations iteratively a n 1,0 x 0 + a n 1,1 x 1 + a n 1,2 x 2 + a n 1,n 1 x n 1 = b n 1. a 2,0 x 0 + a 2,1 x 1 + a 2,2 x 2 + a 2,n 1 x n 1 = b 2 a 1,0 x 0 + a 1,1 x 1 + a 1,2 x 2 + a 1,n 1 x n 1 = b 1 a 0,0 x 0 + a 0,1 x 1 + a 0,2 x 2 + a 0,n 1 x n 1 = b 0 where there are n equations and n unknowns (x 0,x 1,x 2, x n 1 ) COM4300/8300 L12: Synchronous Computations 2017 9

Jacobi Iterations consider equation i as: a i,0 x 0 + a i,1 x 1 + a i,2 x 2 + a i,n 1 x n 1 = b i which we can re-cast as: x i = (1/a i,i )[b i (a i,0 x 0 + a i,1 x 1 + a i,2 x 2 + a i,i 1 x i 1 + a i,i+1 x i+1 + a i,n 1 x n 1 )] i.e. x i = 1 a i,i [ bi j i a i, j x j ] strategy: guess x, then iterate and hope it converges! converges if the matrix is diagonally dominant: j i a i, j < a i,i terminate when convergence is achieved: x t x t 1 < error tolerance COM4300/8300 L12: Synchronous Computations 2017 10

Sequential Jacobi Code ignoring convergence testing: for ( i = 0; i < n ; i ++) x [ i ] = b [ i ]; for ( iter = 0; iter < m a x i t e r ; iter ++) { for ( i = 0; i < n ; i ++) { sum = a [ i ][ i ] x [ i ]; for ( j = 0; j < n ; j ++){ sum = sum + a [ i ][ j ] x [ j ] } n e w x [ i ] = ( b [ i ] sum ) / a [ i ][ i ]; } for ( i = 0; i < n ; i ++) x [ i ] = n e w x [ i ]; } COM4300/8300 L12: Synchronous Computations 2017 11

arallel Jacobi Code ignoring convergence testing and assuming parallelisation over n processes: x [ i ] = b [ i ]; for ( iter = 0; iter < m a x i t e r ; iter ++) { sum = a [ i ][ i ] x [ i ]; for ( j = 0; j < n ; j ++){ sum = sum + a [ i ][ j ] x [ j ] } n e w x [ i ] = ( b [ i ] sum ) / a [ i ][ i ]; b r o a d c a s t g a t h e r (& n e w x [ i ], n e w x ); g l o b a l b a r r i e r (); for ( i = 0; i < n ; i ++) x [ i ] = n e w x [ i ]; } broadcast gather() sends the local new x[i] to all processes and collects their new values COM4300/8300 L12: Synchronous Computations 2017 12

Broadcast Gather rocess 0 rocess 1 rocess n 1 Send buffer Data x Data x Data x 0 1 n 1 Receive buffer Broadcast gather() Broadcast gather() Broadcast gather() Can be (simplistically) implemented as: for ( j = 0; j < n ; j ++) send (& n e w x [ i ], / process / j ); for ( j = 0; j < n ; j ++) recv (& n e w x [ j ], / process / j ); Question: do we really need the barrier as well as this? COM4300/8300 L12: Synchronous Computations 2017 13

artitioning normally the number of processes is much less than the number of data items block partitioning: allocate groups of consecutive unknowns to processes cyclic partitioning: allocate in a round-robin fashion analysis: τ iterations, n/p unknowns per process computation decreases with p t comp = τ(2n + 4)(n/p)t f communication increases with p t comm = p(t s + (n/p)t w )τ = (pt s + nt w )τ total - has an overall minimum t tot = ((2n + 4)(n/p)t f + pt s + nt w )τ question: can we do an all-gather faster than pt s + nt w? COM4300/8300 L12: Synchronous Computations 2017 14

arallel Jacobi Iteration Time arameters: t s = 10 5 t f,t w = 50t f,n = 1000 Execution time Overall Communication Computation 4 8 12 16 20 24 28 32 Number of processors, p COM4300/8300 L12: Synchronous Computations 2017 15

Locally Synchronous Example: Heat Distribution roblem Consider a metal sheet with a fixed temperature along the sides but unknown temperatures in the middle find the temperature in the middle. finite difference approximation to the Laplace equation: 2 T (x,y) x 2 + 2 T (x,y) y 2 = 0 T (x + δx,y) 2T (x,y) + T (x δx,y) δx 2 + T (x,y + δy) 2T (x,y) + T (x,y δy) δy 2 = 0 assuming an even grid (i.e. δx = δy) of n n points (denoted as h i, j ), the temperature at any point is an average of surrounding points: h i, j = h i 1, j + h i+1, j + h i, j 1 + h i, j+1 4 problem is very similar to the Game of Life, i.e. what happens in a cell depends upon its neighbours COM4300/8300 L12: Synchronous Computations 2017 16

Array Ordering x1 x2 xk 1 xk xk+1 xk+2 x2k 1 x 2k x i k xi 1 x i xi+1 x i+k x k 2 we will solve iteratively: x i = x i 1+x i+1 +x i k +x i+k 4 but this problem may also be written as a system of linear equations: x i k + x i 1 4x i + x i+1 + x i+k = 0 COM4300/8300 L12: Synchronous Computations 2017 17

Heat Equation: Sequential Code assume a fixed number of iterations and a square mesh beware of what happens at the edges! for ( iter = 0; iter < m a x i t e r ; iter ++) { for ( i = 1; i < n ; i ++) for ( j = 1; j < n ; j ++) g [ i ][ j ] = 0.25 ( h [i 1][j] + h [ i +1][ j ] + h [ i ][ j 1] + h [ i ][ j +1]); for ( i = 1; i < n ; i ++) for ( j = 1; j < n ; j ++) h [ i ][ j ] = g [ i ][ j ]; } COM4300/8300 L12: Synchronous Computations 2017 18

Heat Equation: arallel Code one point per process assuming non-blocking sends: for ( iter = 0; iter < m a x i t e r ; iter ++) { g = 0.25 ( w + x + y + z ); send (& g, (i 1,j )); } send (& g, ( i +1, j )); send (& g, (i,j 1)); send (& g, (i, j +1)); recv (& w, (i 1,j )); recv (& x, ( i +1, j )); recv (& y, (i,j 1)); recv (& z, (i, j +1)); sends and receives provide a local barrier each process synchronizes with 4 others surrounding processes COM4300/8300 L12: Synchronous Computations 2017 19

Heat Equation: artitioning normally more than one point per process option of either block or strip partitioning 0 1 p 1 0 1 p 1 Block artitioning Strip artitioning COM4300/8300 L12: Synchronous Computations 2017 20

Block/Strip Communication Comparison block partitioning: four edges exchanged (n 2 data points, p processes) t comm = 8(t s + n p t w ) strip partitioning: two edges exchanged t comm = 4(t s + nt w ) n p n Block Communications Strip Communications COM4300/8300 L12: Synchronous Computations 2017 21

Block/Strip Optimum ( ) block communication is larger than strip if: 8 t s + n p t w > 4(t w + nt w ) ( i.e. if t s > n 1 2 p )t w t s Strip partition best Block partion best rocesses, p COM4300/8300 L12: Synchronous Computations 2017 22

Safety and Deadlock with all processes sending and then receiving data, the code is unsafe: it relies on local buffering in the send() function potential for deadlock (as in rac 1, Ex 3)! alternative #1: re-order sends and receives e.g. for strip partitioning: if (( myid % 2) == 0){ send (& g [1][1], n, (i 1)); recv (& h [1][0], n, (i 1)); send (& g [1][ n ], n, ( i +1)); recv (& h [1][ n +1], n, ( i +1)); } else { recv (& h [1][0], n, p (i 1)); send (& g [1][1], n, p (i 1)); recv (& h [1][ n +1], n, p ( i +1)); send (& g [1][ n ], n, p ( i +1)); } COM4300/8300 L12: Synchronous Computations 2017 23

Alt# 2: Asynchronous Communication using Ghostpoints assign extra receive buffers for edges where data is exchanged typically these are implemented as extra rows and columns in each process local array (known as a halo) can use asynchronous calls (e.g. MI Isend()) rocess i Ghost points Copy data rocess i+1 COM4300/8300 L12: Synchronous Computations 2017 24