Lecture 22: Multithreaded Algorithms CSCI Algorithms I. Andrew Rosenberg

Lecture 22: Multithreaded Algorithms CSCI 700 - Algorithms I Andrew Rosenberg

Last Time Open Addressing Hashing

Today Multithreading

Two Styles of Threading Shared Memory Every thread can access the same memory (data). Distributed Memory Each thread has it s own partition of the memory

Types of Parallelism Nested Parallelism Call a method. Don t wait for it to return.

Types of Parallelism Parallel Loops Just like a for loop Loop executions run concurrently

Parallelization keywords spawn Start a new thread running sync Wait here until all threads finish parallel indicates a parallel loop

Fibonacci example for paralleism Fib(n) if n 1 then return n else x = Fib(n-1) y = Fib(n-2) return x + y end if

Fibonacci example for paralleism Fib(n) if n 1 then return n else x = spawn Fib(n-1) y = Fib(n-2) sync return x + y end if

Recursion of Fib(6)

Graphical example of P-Fib(4) While uses 17 commands, the critical path the longest from initial strand to the final strand is 8 units long. (Can be found using a BFS.)

Some Terminology Work is the number of calls to a function that are made (T 1 ). Span is the length of the critical path (T ). The fastest a process can run on a multicore processor with P threads is T P T 1 /P. The speedup is T 1 /T P. If the speedup is P we have perfect linear speedup. The parallelism of an algorithm is T 1 /T. This is the maximum possible speedup that an algorithm can achieve by adding more processors.

Analyzing multithreaded algorithms Most of what we ve done so far has been the analysis and optimization of work. Analyzing span is different. The total span of a parallel algorithm with two components A and B is max(t (A), T (B)). For P-Fib(x): T (n) = max(t (n 1), T (n 2)) + Θ(1) (1) = T (n 1) + Θ(1) (2)

Parallel Loops If the iterations of a loop don t depend on each other, they can run in parallel. For example, multiply a matrix A by a vector x. y i = n j=1 A ijx j

y i = n j=1 A ijx j Mat-Vec(A,x) n = A.rows y = new vector with n cells for i = 1 to n do y i = 0 end for for i = 1 to n do for j = 1 to n do y i = y i + a ij x j end for end for

y i = n j=1 A ijx j Mat-Vec(A,x) n = A.rows y = new vector with n cells for parallel i = 1 to n do y i = 0 end for for parallel i = 1 to n do for j = 1 to n do y i = y i + a ij x j end for end for

Race Conditions Race() x = 0 for parallel i = 1 to 2 do x = x + 1 end for print x Need to make sure that the two threads are independent no writing to memory that other threads are reading from.

Multithreaded Merge Sort Merge-Sort(A,p,r) if p < r then q = (p + r/2) spawn Merge-Sort(A, p, q) Merge-Sort(A, q + 1, r) sync Merge(A, p, q, r) end if Spawn two recursive calls to Merge Sort. No problem

Multithreaded Merge Sort Work MS 1 (n) = 2MS 1 (n/2) + Θ(n) = Θ(n log n) Span MS (n) = MS (n/2) + Θ(n) = Θ(n) Parallelization MS 1 MS = Θ(n log n) Θ(n) = Θ(log n)

Parallel Merge Serial Merge is dominating the performance. How can we parallelize merge? Divide-and-conquer Merge Put the middle element, z, of the smaller of the two lists in the correct position Merge the subarrays containing elements smaller than z Merge the subarrays containing elements greater than z

Parallel Merge P-Merge(T, p 1, r 1, p2, r 2, A, p 3 ) n 1 = r 1 p 1 + 1 n 2 = r 2 p 2 + 1 if n 1 < n 2 then swap p s, r s and n s end if if n 1 == 0 then return else q 1 = (p 1 + r 1 )/2 q 2 =Binary-Search(T [q 1 ], T, p 2, r 2 ) q 3 = p 3 + (q 1 p 1 ) + (q 2 p 2 ) // Where to put T [q 1 ] A[q 3 ] = T [q 1 ] spawn P-Merge(T, p 1, q 1 1, p 2, q 2 1, A, p 3 ) P-Merge(T, q 1 + 1, r 1, q 2 + 1, r 2, A, q 3 + 1) sync end if

Parallel Merge Span Identify the maximum number of elements in the largest call to P-Merge. The worst case merges n 1 /2 elements (from the larger subarray) with all n 2 elements (from the smaller subarray). n 1 /2 + n 2 n 1 /2 + n 2 /2 + n 2 /2 = (n 1 + n 2 )/2 + n 2 /2 n/2 + n/4 = 3n/4 PM (n) = PM (3n/4) + Θ(log n) = Θ(log 2 n)

Parallel Merge Work PM 1 (n) = PM 1 = PM 1 (αn) + PM 1 ((1 α)n) + O(log n) PM 1 is clearly Ω(n). Can show that PM 1 (n) c 1 n c 2 log n for constants c 1, c 2 proving that PM 1 = O(n). Thus, PM 1 = Θ(n).

Parallel Merge Sort P-MergeSort(A, p, r, B, s) n = r p + 1 if n == 1 then B[s] = A[p] else let T [n] be a new array q = (p + r)/2 q = q p + 1 spawn P-Merge-Sort(A, p, q, T, 1) P-Merge-Sort(A, q + 1, r, T, q + 1) sync P-Merge(T, 1, q, q + 1, n, B, s) end if

Parallel Merge Sort Work PMergeSort 1 (n) = 2PMergeSort 1 (n/2) + PM 1 (n) = 2PMergeSort 1 (n/2) + Θ(n) = Θ(n log n)

Parallel Merge Sort Span PMergeSort (n) = PMergeSort (n/2) + PM i nfty(n) = PMergeSort (n/2) + Θ(log 2 n) = Θ(log 3 n) Parallelism PMergeSort 1 (n)/pmergesort (n) = Θ(n log n)/θ(log 3 n) = Θ(n/ log 2 n)

Bye Next time NP-Completeness