Overview: Synchronous Computations

Size: px

Start display at page:

Download "Overview: Synchronous Computations"

Leslie Daniel
5 years ago
Views:

1 Overview: Synchronous Computations barriers: linear, tree-based and butterfly degrees of synchronization synchronous example 1: Jacobi Iterations serial and parallel code, performance analysis synchronous example 2: Heat Distribution serial and parallel code comparison of block and strip partitioning methods safety ghost points (synchronous example 3) advection - Assignment 1 Ref: Chapter 6: Wilkinson and Allen COM4300/8300 L12: Synchronous Computations

2 Barriers barrier: a point at which all processes must wait until all other processes have reached that point MI Barrier(MI Comm comm); mutual exclusion: a barrier that prevents other processes from entering the following region if another process is already in that region common in shared memory parallel programs necessary for some MI-2 operations both are possible sources of overhead COM4300/8300 L12: Synchronous Computations

3 Barrier rocesses n 1 Active Time Waiting Barrier COM4300/8300 L12: Synchronous Computations

4 Counter-based or Linear Barriers rocesses 0 1 n 1 Counter, C Increment and check for n Barrier(); Barrier(); Barrier(); one process counts the arrival of the other processes when all processes have arrive, they are each sent a release message COM4300/8300 L12: Synchronous Computations

5 Implementation arrival phase: process sends message to central counter departure phase: process receives message from central counter Master Slave rocesses Arrival phase Departure phase for (i=0; i<p 1; i++) recv((any)); for (i=0; i<p 1; i++) send((i)); Barrier: send((master)); recv((master)); Barrier: send((master)); recv((master)); implementations must handle possible time delays e.g. two barriers in quick succession cost is O(p) COM4300/8300 L12: Synchronous Computations

6 Tree-Based Barriers Arrival at barrier Departure from barrier note: broadcast does not ensure synchronization cost 2lg p or O(lg p) COM4300/8300 L12: Synchronous Computations

7 Butterfly Barrier (Butterfly/Omega Network) st stage 2nd stage Time 3rd stage cost is 2lg p or O(lg p) COM4300/8300 L12: Synchronous Computations

8 Degrees of Synchronization from fully to loosely synchronous the more synchronous your computation, the more potential overhead SIMD: synchronized at the instruction level provides ease of programming (one program) well suited for data decomposition applicable to many numerical problems the forall statement was introduced to specify data parallel operations forall ( i = 0; i < n ; i ++) { data parallel work } COM4300/8300 L12: Synchronous Computations

9 Synchronous Example: Jacobi Iterations the Jacobi iteration solves a system of linear equations iteratively a n 1,0 x 0 + a n 1,1 x 1 + a n 1,2 x 2 + a n 1,n 1 x n 1 = b n 1. a 2,0 x 0 + a 2,1 x 1 + a 2,2 x 2 + a 2,n 1 x n 1 = b 2 a 1,0 x 0 + a 1,1 x 1 + a 1,2 x 2 + a 1,n 1 x n 1 = b 1 a 0,0 x 0 + a 0,1 x 1 + a 0,2 x 2 + a 0,n 1 x n 1 = b 0 where there are n equations and n unknowns (x 0,x 1,x 2, x n 1 ) COM4300/8300 L12: Synchronous Computations

10 Jacobi Iterations consider equation i as: a i,0 x 0 + a i,1 x 1 + a i,2 x 2 + a i,n 1 x n 1 = b i which we can re-cast as: x i = (1/a i,i )[b i (a i,0 x 0 + a i,1 x 1 + a i,2 x 2 + a i,i 1 x i 1 + a i,i+1 x i+1 + a i,n 1 x n 1 )] i.e. x i = 1 a i,i [ bi j i a i, j x j ] strategy: guess x, then iterate and hope it converges! converges if the matrix is diagonally dominant: j i a i, j < a i,i terminate when convergence is achieved: x t x t 1 < error tolerance COM4300/8300 L12: Synchronous Computations

11 Sequential Jacobi Code ignoring convergence testing: for ( i = 0; i < n ; i ++) x [ i ] = b [ i ]; for ( iter = 0; iter < m a x i t e r ; iter ++) { for ( i = 0; i < n ; i ++) { sum = a [ i ][ i ] x [ i ]; for ( j = 0; j < n ; j ++){ sum = sum + a [ i ][ j ] x [ j ] } n e w x [ i ] = ( b [ i ] sum ) / a [ i ][ i ]; } for ( i = 0; i < n ; i ++) x [ i ] = n e w x [ i ]; } COM4300/8300 L12: Synchronous Computations

12 arallel Jacobi Code ignoring convergence testing and assuming parallelisation over n processes: x [ i ] = b [ i ]; for ( iter = 0; iter < m a x i t e r ; iter ++) { sum = a [ i ][ i ] x [ i ]; for ( j = 0; j < n ; j ++){ sum = sum + a [ i ][ j ] x [ j ] } n e w x [ i ] = ( b [ i ] sum ) / a [ i ][ i ]; b r o a d c a s t g a t h e r (& n e w x [ i ], n e w x ); g l o b a l b a r r i e r (); for ( i = 0; i < n ; i ++) x [ i ] = n e w x [ i ]; } broadcast gather() sends the local new x[i] to all processes and collects their new values COM4300/8300 L12: Synchronous Computations

13 Broadcast Gather rocess 0 rocess 1 rocess n 1 Send buffer Data x Data x Data x 0 1 n 1 Receive buffer Broadcast gather() Broadcast gather() Broadcast gather() Can be (simplistically) implemented as: for ( j = 0; j < n ; j ++) send (& n e w x [ i ], / process / j ); for ( j = 0; j < n ; j ++) recv (& n e w x [ j ], / process / j ); Question: do we really need the barrier as well as this? COM4300/8300 L12: Synchronous Computations

14 artitioning normally the number of processes is much less than the number of data items block partitioning: allocate groups of consecutive unknowns to processes cyclic partitioning: allocate in a round-robin fashion analysis: τ iterations, n/p unknowns per process computation decreases with p t comp = τ(2n + 4)(n/p)t f communication increases with p t comm = p(t s + (n/p)t w )τ = (pt s + nt w )τ total - has an overall minimum t tot = ((2n + 4)(n/p)t f + pt s + nt w )τ question: can we do an all-gather faster than pt s + nt w? COM4300/8300 L12: Synchronous Computations

15 arallel Jacobi Iteration Time arameters: t s = 10 5 t f,t w = 50t f,n = 1000 Execution time Overall Communication Computation Number of processors, p COM4300/8300 L12: Synchronous Computations

16 Locally Synchronous Example: Heat Distribution roblem Consider a metal sheet with a fixed temperature along the sides but unknown temperatures in the middle find the temperature in the middle. finite difference approximation to the Laplace equation: 2 T (x,y) x T (x,y) y 2 = 0 T (x + δx,y) 2T (x,y) + T (x δx,y) δx 2 + T (x,y + δy) 2T (x,y) + T (x,y δy) δy 2 = 0 assuming an even grid (i.e. δx = δy) of n n points (denoted as h i, j ), the temperature at any point is an average of surrounding points: h i, j = h i 1, j + h i+1, j + h i, j 1 + h i, j+1 4 problem is very similar to the Game of Life, i.e. what happens in a cell depends upon its neighbours COM4300/8300 L12: Synchronous Computations

17 Array Ordering x1 x2 xk 1 xk xk+1 xk+2 x2k 1 x 2k x i k xi 1 x i xi+1 x i+k x k 2 we will solve iteratively: x i = x i 1+x i+1 +x i k +x i+k 4 but this problem may also be written as a system of linear equations: x i k + x i 1 4x i + x i+1 + x i+k = 0 COM4300/8300 L12: Synchronous Computations

18 Heat Equation: Sequential Code assume a fixed number of iterations and a square mesh beware of what happens at the edges! for ( iter = 0; iter < m a x i t e r ; iter ++) { for ( i = 1; i < n ; i ++) for ( j = 1; j < n ; j ++) g [ i ][ j ] = 0.25 ( h [i 1][j] + h [ i +1][ j ] + h [ i ][ j 1] + h [ i ][ j +1]); for ( i = 1; i < n ; i ++) for ( j = 1; j < n ; j ++) h [ i ][ j ] = g [ i ][ j ]; } COM4300/8300 L12: Synchronous Computations

19 Heat Equation: arallel Code one point per process assuming non-blocking sends: for ( iter = 0; iter < m a x i t e r ; iter ++) { g = 0.25 ( w + x + y + z ); send (& g, (i 1,j )); } send (& g, ( i +1, j )); send (& g, (i,j 1)); send (& g, (i, j +1)); recv (& w, (i 1,j )); recv (& x, ( i +1, j )); recv (& y, (i,j 1)); recv (& z, (i, j +1)); sends and receives provide a local barrier each process synchronizes with 4 others surrounding processes COM4300/8300 L12: Synchronous Computations

20 Heat Equation: artitioning normally more than one point per process option of either block or strip partitioning 0 1 p p 1 Block artitioning Strip artitioning COM4300/8300 L12: Synchronous Computations

21 Block/Strip Communication Comparison block partitioning: four edges exchanged (n 2 data points, p processes) t comm = 8(t s + n p t w ) strip partitioning: two edges exchanged t comm = 4(t s + nt w ) n p n Block Communications Strip Communications COM4300/8300 L12: Synchronous Computations

22 Block/Strip Optimum ( ) block communication is larger than strip if: 8 t s + n p t w > 4(t w + nt w ) ( i.e. if t s > n 1 2 p )t w t s Strip partition best Block partion best rocesses, p COM4300/8300 L12: Synchronous Computations

23 Safety and Deadlock with all processes sending and then receiving data, the code is unsafe: it relies on local buffering in the send() function potential for deadlock (as in rac 1, Ex 3)! alternative #1: re-order sends and receives e.g. for strip partitioning: if (( myid % 2) == 0){ send (& g [1][1], n, (i 1)); recv (& h [1][0], n, (i 1)); send (& g [1][ n ], n, ( i +1)); recv (& h [1][ n +1], n, ( i +1)); } else { recv (& h [1][0], n, p (i 1)); send (& g [1][1], n, p (i 1)); recv (& h [1][ n +1], n, p ( i +1)); send (& g [1][ n ], n, p ( i +1)); } COM4300/8300 L12: Synchronous Computations

24 Alt# 2: Asynchronous Communication using Ghostpoints assign extra receive buffers for edges where data is exchanged typically these are implemented as extra rows and columns in each process local array (known as a halo) can use asynchronous calls (e.g. MI Isend()) rocess i Ghost points Copy data rocess i+1 COM4300/8300 L12: Synchronous Computations

Barrier. Overview: Synchronous Computations. Barriers. Counter-based or Linear Barriers

Barrier. Overview: Synchronous Computations. Barriers. Counter-based or Linear Barriers Overview: Synchronous Computations Barrier barriers: linear, tree-based and butterfly degrees of synchronization synchronous example : Jacobi Iterations serial and parallel code, performance analysis synchronous