Barrier. Overview: Synchronous Computations. Barriers. Counter-based or Linear Barriers

Size: px

Start display at page:

Download "Barrier. Overview: Synchronous Computations. Barriers. Counter-based or Linear Barriers"

Christopher Goodman
5 years ago
Views:

1 Overview: Synchronous Computations Barrier barriers: linear, tree-based and butterfly degrees of synchronization synchronous example : Jacobi Iterations serial and parallel code, performance analysis synchronous example 2: Heat Distribution serial and parallel code comparison of block and strip partitioning methods safety ghost points (synchronous example 3) advection - Assignment Ref: Chapter 6: Wilkinson and Allen COM4300/8300 L2: Synchronous Computations 207 COM4300/8300 L2: Synchronous Computations Barriers Counter-based or Linear Barriers barrier: a point at which all processes must wait until all other processes have reached that point rr r mutual exclusion: a barrier that prevents other processes from entering the following region if another process is already in that region common in shared memory parallel programs necessary for some MI-2 operations both are possible sources of overhead one process counts the arrival of the other processes when all processes have arrive, they are each sent a release message COM4300/8300 L2: Synchronous Computations COM4300/8300 L2: Synchronous Computations 207 4

2 s s s s r s s s r s Implementation Butterfly Barrier (Butterfly/Omega Network) arrival phase: process sends message to central counter departure phase: process receives message from central counter r ss s s r s implementations must handle possible time delays e.g. two barriers in quick succession cost is O(p) cost is 2lg p or O(lg p) COM4300/8300 L2: Synchronous Computations COM4300/8300 L2: Synchronous Computations Tree-Based Barriers Degrees of Synchronization from fully to loosely synchronous the more synchronous your computation, the more potential overhead SIMD: synchronized at the instruction level provides ease of programming (one program) well suited for data decomposition applicable to many numerical problems the r statement was introduced to specify data parallel operations r < { t r r note: broadcast does not ensure synchronization cost 2lg p or O(lg p) COM4300/8300 L2: Synchronous Computations COM4300/8300 L2: Synchronous Computations 207 8

3 Synchronous Example: Jacobi Iterations Sequential Jacobi Code the Jacobi iteration solves a system of linear equations iteratively a n,0 x 0 + a n, x + a n,2 x 2 + a n,n x n = b n. a 2,0 x 0 + a 2, x + a 2,2 x 2 + a 2,n x n = b 2 a,0 x 0 + a, x + a,2 x 2 + a,n x n = b a 0,0 x 0 + a 0, x + a 0,2 x 2 + a 0,n x n = b 0 where there are n equations and n unknowns (x 0,x,x 2, x n ) ignoring convergence testing: r < r t r t r < t r t r { r < { s r < { s s s r < COM4300/8300 L2: Synchronous Computations COM4300/8300 L2: Synchronous Computations 207 Jacobi Iterations arallel Jacobi Code consider equation i as: a i,0 x 0 + a i, x + a i,2 x 2 + a i,n x n = b i which we can re-cast as: x i = (/a i,i )[b i (a i,0 x 0 + a i, x + a i,2 x 2 + a i,i x i + a i,i+ x i+ + a i,n x n )] i.e. x i = a i,i [ bi j i a i, j x j ] strategy: guess x, then iterate and hope it converges! converges if the matrix is diagonally dominant: j i a i, j < a i,i terminate when convergence is achieved: x t x t <error tolerance ignoring convergence testing and assuming parallelisation over n processes: r t r t r < t r t r { s r < { s s s r s t t r rr r r < r st t r sends the local to all processes and collects their new values COM4300/8300 L2: Synchronous Computations COM4300/8300 L2: Synchronous Computations 207 2

4 Broadcast Gather arallel Jacobi Iteration Time arameters: t s = 0 5 t f,t w = 50t f,n = 000 Can be (simplistically) implemented as: r < s r ss r < r r ss Question: do we really need the barrier as well as this? COM4300/8300 L2: Synchronous Computations COM4300/8300 L2: Synchronous Computations artitioning normally the number of processes is much less than the number of data items block partitioning: allocate groups of consecutive unknowns to processes cyclic partitioning: allocate in a round-robin fashion analysis: τ iterations, n/p unknowns per process computation decreases with p communication increases with p total - has an overall minimum t comp = τ(2n + 4)(n/p)t f t comm = p(t s + (n/p)t w )τ = (pt s + nt w )τ t tot = ((2n + 4)(n/p)t f + pt s + nt w )τ question: can we do an all-gather faster than pt s + nt w? COM4300/8300 L2: Synchronous Computations Locally Synchronous Example: Heat Distribution roblem Consider a metal sheet with a fixed temperature along the sides but unknown temperatures in the middle find the temperature in the middle. finite difference approximation to the Laplace equation: 2 T (x,y) x T (x,y) y 2 = 0 T (x + δx,y) 2T (x,y) + T (x δx,y) T (x,y + δy) 2T (x,y) + T (x,y δy) δx 2 + δy 2 = 0 assuming an even grid (i.e. δx = δy) of n n points (denoted as h i, j ), the temperature at any point is an average of surrounding points: h i, j = h i, j + h i+, j + h i, j + h i, j+ 4 problem is very similar to the Game of Life, i.e. what happens in a cell depends upon its neighbours COM4300/8300 L2: Synchronous Computations 207 6

5 Array Ordering Heat Equation: arallel Code we will solve iteratively: x i = x i +x i+ +x i k +x i+k 4 but this problem may also be written as a system of linear equations: x i k + x i 4x i + x i+ + x i+k = 0 one point per process assuming non-blocking sends: r t r t r < t r t r { 2 3 s s s s r r r 2 r 3 sends and receives provide a local barrier each process synchronizes with 4 others surrounding processes COM4300/8300 L2: Synchronous Computations COM4300/8300 L2: Synchronous Computations Heat Equation: Sequential Code Heat Equation: artitioning normally more than one point per process assume a fixed number of iterations and a square mesh option of either block or strip partitioning beware of what happens at the edges! r t r t r < t r t r { r < r < r < r < COM4300/8300 L2: Synchronous Computations COM4300/8300 L2: Synchronous Computations

6 Block/Strip Communication Comparison block partitioning: four edges exchanged (n 2 data points, p processes) t comm = 8(t s + n p t w ) Safety and Deadlock with all processes sending and then receiving data, the code is unsafe: it relies on local buffering in the s function strip partitioning: two edges exchanged t comm = 4(t s + nt w ) potential for deadlock (as in rac, Ex 3)! alternative #: re-order sends and receives e.g. for strip partitioning: 2 { s r s r s { r s r s COM4300/8300 L2: Synchronous Computations COM4300/8300 L2: Synchronous Computations Block/Strip Optimum ( block communication is larger than strip if: 8 t s + n p t w )>4(t w + nt w ) ( i.e. if t s > n 2 p )t w t s Alt# 2: Asynchronous Communication using Ghostpoints assign extra receive buffers for edges where data is exchanged typically these are implemented as extra rows and columns in each process local array (known as a halo) can use asynchronous calls (e.g. s ) t t t t t t COM4300/8300 L2: Synchronous Computations COM4300/8300 L2: Synchronous Computations

Overview: Synchronous Computations

Overview: Synchronous Computations barriers: linear, tree-based and butterfly degrees of synchronization synchronous example 1: Jacobi Iterations serial and parallel code, performance analysis synchronous