CSE Introduction to Parallel Processing. Chapter 2. A Taste of Parallel Algorithms

Size: px

Start display at page:

Download "CSE Introduction to Parallel Processing. Chapter 2. A Taste of Parallel Algorithms"

Solomon Strickland
5 years ago
Views:

1 Dr.. Izadi CSE-0 Introduction to Parallel Processing Chapter 2 A Taste of Parallel Algorithms Consider five basic building-block parallel operations Implement them on four simple parallel architectures Learn about the nature of parallel computations and interplay between algorithm and architecture

2 Introduction to Parallel Processing: Algorithms and Architectures Instructor s Manual, Vol. 2 (/00), Page Some Simple Computations x 0 identity element x x 2 t = 0 t = t = 2 t = Fig x n 2. x n t = n t = n s Semigroup computation on a uniprocessor. x 0 x x 2 x x x x x x x x 0 s Semigroup computation viewed as a tree or fan-in computation.

3 Introduction to Parallel Processing: Algorithms and Architectures Instructor s Manual, Vol. 2 (/00), Page 2 x 0 identity element s 0 x s x 2 s 2... x n 2 x n t = 0 t = t = 2 t = t = n t = n s n 2 s n Prefix computation on a uniprocessor.. Packet routing one processor sending a packet of data to another. Broadcasting one processor sending a packet of data to all others. Sorting processors cooperating in rearranging their data into desired order

4 Introduction to Parallel Processing: Algorithms and Architectures Instructor s Manual, Vol. 2 (/00), Page Some Simple Architectures P 0 P P 2 P P P P P P P 0 P P 2 P P P P P P Fig A linear array of nine processors and its ring variant. Diameter of linear array: D = p (Max) Node degree: d = 2 P 0 P P P 2 P P P P P Fig. 2.. A balanced (but incomplete) binary tree of nine processors. Diameter of balanced binary tree: D = 2 log 2 p ; or one less (Max) Node degree: d = We almost always deal with complete binary trees: p one less than a power of 2 D = 2 log 2 (p + ) 2

5 Introduction to Parallel Processing: Algorithms and Architectures Instructor s Manual, Vol. 2 (/00), Page 2 P 0 P P 2 P 0 P P 2 P P P P P P P P P P P P Fig. 2.. A 2D mesh of nine processors and its torus variant. Diameter of r (p/r) mesh: D = r + p/r 2 (Max) Node degree: d = Square meshes preferred; they minimize D (= 2 p 2) P 0 P P P P 2 P P P P Fig. 2.. A shared-variable architecture modeled as a complete graph. Diameter of complete graph: D = (Max) Node degree: d = p

6 Introduction to Parallel Processing: Algorithms and Architectures Instructor s Manual, Vol. 2 (/00), Page 2 2. Algorithms for a Linear Array P 0 P P 2 P P P P P P 2 Initial values Maximum identified Fig. 2.. Maximum-finding on a linear array of nine processors. P 0 P P 2 P P P P P P Initial values Final results Fig. 2.. Computing prefix sums on a linear array of nine processors. Diminished prefix computation: the ith result excludes the ith element (e.g., sum of the first i elements)

7 Introduction to Parallel Processing: Algorithms and Architectures Instructor s Manual, Vol. 2 (/00), Page 2 P 0 P P 2 P P P P P P = Initial values Local prefixes Linear-array diminished prefix sums Final results Fig. 2.. Computing prefix sums on a linear array with two items per processor. Packet routing or broadcasting: right- and left-moving packets have no conflict

8 Introduction to Parallel Processing: Algorithms and Architectures Instructor s Manual, Vol. 2 (/00), Page Fig. 2.. Sorting on a linear array with the keys input sequentially from the left.

9 Introduction to Parallel Processing: Algorithms and Architectures Instructor s Manual, Vol. 2 (/00), Page 2 P 0 P P 2 P P P P P P In odd steps,,,, etc., odd-numbered processors exchange values with their right neighbors Fig Odd-even transposition sort on a linear array. For odd-even transposition sort: Speed-up = O(p log p) / p = O(log p) Efficinecy = O((log p) / p) Redundancy = O(p / (log p)) Utilization = /2

10 Introduction to Parallel Processing: Algorithms and Architectures Instructor s Manual, Vol. 2 (/00), Page 2 2. Algorithms for a Binary Tree x 0x x 2x x x 0x x0 x x2 x 2x x x x Upward Propagation x x Identity Identity x 0x Identity x 0 x 0x x 0x x 2 Downward Propagation x x x 0 2 x 0x x 2 x x x 0x x 2x x 0 x x 0 x 0x x 2 x 0x x 2x x 0x x 2x x Results Fig. 2.. Parallel prefix computation on a binary tree of processors.

11 Introduction to Parallel Processing: Algorithms and Architectures Instructor s Manual, Vol. 2 (/00), Page 2 Some applications of the parallel prefix computation Finding the rank of each in a list of 0s and s: Data : Prefix sums : Ranks of s : 2 Priority circuit: Data : Diminished prefix ORs : Complement : AND with data : Carry computation in fast adders Let g, p, and a denote the event that a particular digit position in the adder generates, propagates, or annihilates a carry. The input data for the carry circuit consists of a vector of three-valued elements such as: p g a g g p p p g a c in g or a direction of indexing Parallel prefix computation using the carry operator p x = x x propagates over p, for all x {g, p, a} a x = a x is annihilated or absorbed by a g x = g x is immaterial; a carry is generated

12 Introduction to Parallel Processing: Algorithms and Architectures Instructor s Manual, Vol. 2 (/00), Page 0 Packet routing on a tree P 0 P P P 2 P P P P P A balanced binary tree with preorder node indices. maxl (maxr) = largest node number in the left (right) subtree if dest = self then remove the packet {done} else if dest < self or dest > maxr then route upward else if dest maxl then route leftward else route rightward endif endif endif

13 Introduction to Parallel Processing: Algorithms and Architectures Instructor s Manual, Vol. 2 (/00), Page Other indexing schemes might lead to simpler routing algorithms XXX LXX RXX LLX LRX RLX RRL RRX RRR Broadcasting is done via the root node

14 Introduction to Parallel Processing: Algorithms and Architectures Instructor s Manual, Vol. 2 (/00), Page 2 Sorting: let the root see all data in nondescending order 2 2 (a) (b) 2 2 (c) (d) Fig The first few steps of the sorting algorithm on a binary tree. Bisection Width = Fig. 2.. The bisection width of a binary tree architecture.

15 Introduction to Parallel Processing: Algorithms and Architectures Instructor s Manual, Vol. 2 (/00), Page 2. Algorithms for a 2D Mesh 2 Row maximums Column maximums Finding the max value on a 2D mesh Row prefix sums 0 Diminished prefix sums in last column 0 Broadcast in rows and combine Computing prefix sums on a 2D mesh Row-major order required if the operator is not commutative Routing and broadcasting done via row/column operations Initial values Snake-like row sort Top-to-bottom column sort Snake-like row sort Top-to-bottom column sort Left-to-right row sort Fig. 2.. Phase Phase 2 Phase The shearsort algorithm on a mesh.

16 Introduction to Parallel Processing: Algorithms and Architectures Instructor s Manual, Vol. 2 (/00), Page 2. Algorithms with Shared Variables P 0 P P P P 2 P P P P Fig. 2.. A shared-variable architecture modeled as a complete graph. Semigroup computation: each processor read all values in turn and combine Parallel prefix: processor i read/combine values 0 to i Both of the above are quite inefficient, given the high cost Packet routing and broadcasting: one step, assuming allport communication Sorting: rank each element by comparing it to all others, then permute according to ranks Figure for Problem 2..

R ij = 2. Using all of these facts together, you can solve problem number 9.

R ij = 2. Using all of these facts together, you can solve problem number 9. Help for Homework Problem #9 Let G(V,E) be any undirected graph We want to calculate the travel time across the graph. Think of each edge as one resistor of 1 Ohm. Say we have two nodes: i and j Let the