COL 730: Parallel Programming

PARALLEL SORTING

Bitonic Merge and Sort Bitonic sequence: {a 0, a 1,, a n-1 }: A sequence with a monotonically increasing part and a monotonically decreasing part For some i, {a 0 <= <=a i } and {a i+1 >= >= a n-1 } Or, a cyclic-shift of indices makes it bitonic

Split Bitonic Sequence Subsequence 2 Subsequence 1 Say {a 0 <= <=a n/2 >= a n/2+1 >= >= a n-1 } Subsequence 1: {min(a 0, a n/2+1 ), min(a 1, a n/2+2 ), } Subsequence 2: {max(a 0, a n/2+1 ), max(a 1, a n/2+2 ), } Recursively sort each bitonic subsequence Subsequence 1 <= Subsequence 2

Merge Bitonic Sequence Sequence 1 Sequence 2 Sort the first sequence in increasing order Sort the second in decreasing order

Bitonic Sort Sort each pair alternately in increasing and decreasing orders Every sequence of length four is now bitonic Sort recursively Again alternate increasing and decreasing orders Forming bitonic sequences of length eight now And so on.. 7

Bitonic Network log n stages log 2 n Dme nlog 2 n work Compare- Exchange 7 3 3 7 8 6 8 6 3 6 8 7 4 1 2 1 4 5 5 2 5 4 1 2 3 6 7 8 5 4 2 1 3 4 2 1 5 6 8 7 2 1 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Stage 1 Stage 2 Stage 3 t n = t n/2 + log n No beker than nlog 2 n/p Dme on p procs

Sorting if n>p Simulate n/p virtual processors per processor like PRAM simulation earlier Or, divide element into p blocks and locally sort n/p elements (of each block) Then perform bitonic sort of blocks compare exchange becomes compare-split 3 4 10 15 20 13 14 16 21 25 3 4 10 13 14 15 16 20 21 25 O(n/p log n/p + n/p log 2 p) time local sorting log 2 p comp-splits

Odd-Even Exchange 10 34 20 5 30 1 40 15 50 45 10 34 5 20 1 30 15 40 45 50 10 5 34 1 20 15 30 40 45 50 Repeat n/2 Dmes O(n 2 ) work Divide into p blocks Sort each block locally Repeat p/2 times odd and even phase compare-split each done on p blocks by p processors O(n/p log n/p) + O(n) time each processor sequentially splits 2n/p sized block local sort p comp-split

Batcher s Odd-Even Merge 10 20 30 35 40 50 1 4 15 45 49 52

Batcher s Odd-Even Merge 10 20 30 35 40 50 1 4 15 45 49 52 1 4 10 20 15 35 30 45 40 50 49 52 1 4 10 15 20 30 35 40 45 49 50 52 Two way merge Given two sorted subsequences: Merge only even elements of each: e0, e1, e2, e3,.. Merge only odd elements of each: o0, o1, o2, o3,.. Perform a pair-wise exchange W(2n) = 2W(n) + O(n) T(2n) = T(n) + O(1) n log n work log n Dme e0, min(e1, o0), max(e1, o0), min(e2, o1), max(e2, o1),.. on-1

Parallel Bucket Sort Divide the range [a,b] of numbers into p equal sub-ranges or, buckets Divide input into p blocks pi Sends elements in its block to ith bucket For uniformly distributed input, expected bucket size is uniform Locally sort each bucket

Parallel Bucket Sort Decide buckets e.g., ranges of key values Parallel for i: Put element i in bucket b May keep each bucket locally at processor Later Merge buckets from all processors Sort each bucket separately For uniformly distributed input expected bucket size is uniform But, real risk of load imbalance Sample sort: Choose a sample of size s Sort the samples Choose B-1 evenly spaced element from the sorted list These splitters provide ranges for B buckets O(n/p log n/p + p log p)?

Parallel SpliKer SelecDon Divide n elements equally into B blocks (Quick)Sort each block (n/b log n/b) For each sorted block: Choose B-1 evenly spaced splitters Use the B*(B-1) elements as samples Sort the samples (B 2 log B) Choose B-1 Splitters Arrange elements by bucket in output array Count the number of elements in each bucket Perform prefix Sum of counts; Reserve space per bucket In-place No bucket contains more than 2*n/B elements (n/b + B log B)

Radix Sort Least significant digit first b-bits per round Entire list sorted per round (of bucket sort) Subdivide into p equal subsets Most significant digit first Separate into buckets Sort only within bucket per round load balance? Enough buckets => Local sort per processor Parallel round Local bucketing Bucket merging 16

Sort n/p elements, then Merge Merge 2n/p-elements pairs P 0,P 1, P 2,P 3 p-processors merging P 0,P Merge n/p-elements 1 P 2,P 3 pairs P 0 P 1 P p-1 Sort first n/p elements Sort second n/p elements HOW EFFICIENTLY CAN YOU MERGE?

Merge Sort Divide into P groups Locally sort each group N/P log (N/P) = O((N log N)/P) Parallel merge P groups Binary tree: log(p) stages P/2 pairs using P processors At leaf (level 1), 2 processors merge two N/P sized lists each time = O(N/P) Level i: 2 i+1 processors merge two 2 i N/P sized lists each time = O(N/P) At the root: P processor merge two N/2 sized list time = O(N/P) O(N/P) log P using P processors to perform each merge 18

Merging Options Optimal merge O(n) work n-list merge Sort the smallest elements of each list: Ls Output the smallest element of Ls Advance pointer of the corresponding list Parallel Insert the next element into Ls Extension of Batcher s algorithm simpler to implement and schedule 19

Optimal Multi-way Merge N=P 2 1.Divide L1 and L2, respectively, into P sublists each ith sublist contains elements at positions i, P+i, 2P+i.. N-P+i 2.Pi merges the ith sublist from L1 and the ith sublist from L2 Result go back to the positions originally occupied by the two sublists Each element is at most P off from its final position 3.Divide list in blocks of length P 4.Pi merges pairs of blocks block 2i and 2i+1 (Results are put in-place.) 5.Pi merges blocks 2i+1 and 2i+2 Results in-place 21

Multi-way Merge: Proof A Let L1i and L2i be the sublists assigned to Pi in step 1 Consider X, nth element in L2i: there are three cases X lies between the mth element of L1i, A, and (m+1)st, B Or, X is <all elements of L1i, or it is >all L1i Case 1: At least [mp+i+1]+[np+i] = [(m+n+1)p+i]+(i-p+1) elements <X mp+i+1 elements are from L1 (<= A) and np+ i are from L2 (<X) At most [(m+1)p+i] + [np+i] = [(m+n+1)p+i] + i elements < X (m+1)p+i elements are from L1 (<B) and np+ i are from L2 (<X) Rank of X in all elements is between [(m+n+1)p + i] + i-p+1 and [(m+n+1)p + i] + i The array-position of X after merger is [(m+n+1)p + i] i+mp i+(m+1)p i+np P P A B X I J, L1 L2 L2 i + (m+n+1)p X 22

Multi-way Merge: Proof B Claim: Each block is sorted Proof: Consider P elements of block i: Block i XY j j+1 j j j+1 X, the j th element (0 <= j < P) is stored by Pj This j th element is the i th element for Pj Prove that j th element (X) is smaller than the j+1 st element (Y) for all j Every element assigned to Pj+1 in Step 1 > the element to its left which is assigned to Pj Y is the ith element from Pj+1 after Step 2 each element assigned to Pj+1 is greater than an element assigned to Pj => Y is greater than at least i elements from Pj But X is the ith (sorted) element of Pj. Hence X < Y 23

Claim: Multi-way Merge: Proof C After Step 2, every element in block i is < all elements in block j, j>i+1 i.e., the largest element in block i, X, is < the smallest, Y, in block i+2 X is the last element in the block i and Y is the first element in block i+2 Note: X is assigned to PP-1 and Y is assigned to P0 in Step 1 P-2 of the elements assigned to P0 are > the resp. elements on their left which, in turn, are all assigned to PP-1 The two exceptions are the first elements from L1 and L2, resp Since Y is the (i+2)nd element from P0 after Step 2, Y is greater than at least i elements assigned to PP-1, But X is the ith element from Pp-1. Hence X < Y O(N/P) per merge, 3 merges Block i X PP-1 Block i+2 Y P0 24

Multi-way Merge N>P 2 1.Divide L1 and L2, respectively, into P sublists each ith sublist contains elements at positions i, P+i, 2P+i.. N-P+i 2.Pi merges the ith sublist from L1 and the ith sublist from L2 Result go back to the positions originally occupied by the two sublists Each element is at most P+1 off from its final position 3.Divide list in blocks of length P 4.Pi merges pairs of blocks block 2i and 2i+1 Results are put in-place 5.Pi merges blocks 2i+1 and 2i+2 Results in-place & P 2 +2i and P 2 +2i+1, etc. & P 2 +2i and P 2 +2i+1, etc. SDll O(N/P) 25

MulD-way Merge Sort Total Time = O(N/P) log P 2N/P-element pairs P 0,P 1, P 2,P 3 P/4 pairs, 4 procs each Time at each level = O(N/P) N/P-element P/2 pairs, pairs P 0,P 1 2 procs each P 2,P 3 Time = (N/P)/2 Sort first N/P elements Sort second N/P elements P lists, sort each Sorting Time = O(N/P) log (N/P) = O(N/P log N)

Practical Sorting Multi-way suitable for CUDA? Too many elements per processor (Shared Mem size) Slow network Sample sort Shared memory Bitonic, Odd-even merge External sort Merge 27

Example CUDA Sort 1.Divide input into p equal parts size t each 2.Sort each part in a block Batcher s odd-even mergesort 3.Merge p sorted lists Merge pairs of subsequences per block? the number of pairs decreases quickly Use multiple blocks per merge for larger lists subdivide into smaller tasks 28

Single Block Merge Lists with, say, t ~ 256 elements each, merge using a single t-thread block For an element ai A, Rank(ai, C ) = i + Rank(ai, B) Thread i computes Rank(ai, B) using binary search B in shared memory Rank elements of B similarly Thread i writes ai to the final Rank position in shared memory 29

Multiple Block Merge Divide large lists into sequences of size t Select splitters SA and SB Every t-th element, partitions lists into chunks of size t Merge SA and SB into S recursively select splitters if SA or SB are too large Compute Rank(s, B), s SA Compute Rank(s, SB) = Rank(s, S) Rank(s, SA) This provides the t-element chunk of B within which s must lie Binary search per thread Similarly compute Rank(s, A), s SB Split A and B, each with S Merge pairs, one per block A B t t S A S B 30

LOG N TIME SORTING

Examples of Fast Sort Rank sort? Given Array A, For each i, find rank A[i] A[rank[i]] = A[i] How fast can you find Rank(A:A)? If you had n 2 processors How fast can you merge sorted sublists? If you had n 2 processors?

c-cover Merging Consider sequence A and B X: a c-cover of A and B Two consecutive elements of X have at most c elements of A between them and at most c elements of B Given Rank(X:A) and Rank(X:B) Compute Rank(A:B) and Rank(B:A) In O(1) time, with O(n) work If X is a c-cover of B, and we know Rank(A:X) and Rank(X:B) Compute Rank(A:B) in O(1)

c-cover Merging Rank(X:A);Rank(X:B) => Rank(A:B);Rank(B:A) a1.. ar 1 ar 1 +1.. ar 2.. a ar i-1 +1.. ar i.. A i x1 x2.. xi-1 xi.. xs b1.. bt 1 bt 1 +1.. bt 2.. bt i-1 +1.. bt i.. Rank(a:B) = Rank(x i-1 :B) + Rank(a:B i ) But Bi <= c, as X is a c-cover of B Rank(a:B) can be found in O(1) B i 34

c-cover Merging Rank(A:X);Rank(X:B) => Rank(A:B) a1.......... ai.... a... x1 x2.. xi-1 xi.. xs b1.. bt 1 bt 1 +1.. bt 2.. bt i-1 +1.. bt i.. Rank(a:B) = Rank(Rank(a, X), B) + Rank(a:B i ) But Bi <= c, as X is a c-cover of B Rank(a:B) can be found in O(1) B i 35

OpDmal O(log n)-dme Merge Sort Algorithm runs in stages on a binary tree Node i maintains sorted list L[i] s th stage generates list L s [i] Initially: L 0 [i] = null for internal nodes L 0 [i] = value at leaf node i Procedure works for any proper binary tree Not necessarily balanced

Fast OpDmal Merge Sort: DefiniDons Algorithm proceeds up the tree, one stage at a time At stage s, node n is active if height(n) <= s <= 3*height(n) height(n) = height(tree) path-length from root to n At each stage a node is active, It merges a sample of the lists of its children

Algorithm parallel for n = active nodes: L s+1 [n] = Merge(Sample s (left), Sample s (right])) Sample s (n) = SUB 4 (L s [n]), if s <= 3*height(n) SUB 2 (L s [n]), if s = 3*height(n) + 1 SUB 1 (L s [n]) = L s [n], if s >= 3*height(n) + 2 where, SUB i (l 0, l 1, l 2 ) = (l i-1, l 2i-1, l 3i-1 )

Pipelined Merges A node: activates 1 step after child active for 3 extra steps height On activation, a node merges its children's every 4th element On children s deactivation, a node takes children s alternate elements then every child element 39

Brief Analysis 1.Node becomes full at stage = 3*height(node) Induction on height 2.Number of elements at a node doubles: L s+1 [n] <= 2 L s [n] + 4 Induction on height 3.Number of elements in active nodes = O(n) Levels active are s/3 to s Level s/3 is full = O(n) Levels above = O(n/2) and O(n/4).. Total = O(n) 4.Sample s (n) is a 4-cover for Sample s+1 (n) i.e., No more than 4 items of L s+1 [n] between two consecutive items of L s [n] 5.For each stage s > height(n), L s [n] is a 4-cover for Sample s (left) and Sample s (right)

Samples(n) is a 4-cover for Samples+1(n) Def: Interval [a,b] k-intersects [-,Samples, ] if {x, s.t. x Samples & a x b} = k Claim: [a,b] k-intersects Samples-1 => [a,b] 2k-intersects Samples Induction: [a,b] k-intersects Samples-1 [a,b] 2k-intersects Sample s Show [a,b] 4k-intersects Sample s+1 SUB 4 (L s+1 ): 4k L s+1 = Merge(SUB 4 (L s ),): 16k-2 SUB 4 (L s ) 2k a L s 8k-3 b SUB 4 (L s-1 ) p 1 a p 2 a 1 b 1 2 b 2 SUB 4 ( L s ): lee & right p 1 +p 2 8k-1 2p 1 +2p 2 2(8k-1) 41

Algorithm Details Merge(Sample s (left), Sample s (right)) => compute: Rank(Sample s (left), Sample s (right)) Rank(Sample s (right), Sample s (left)) Use Rank(L s,sample s (left)), Rank(L s,sample s (right)) O(1), since Ls is 4-cover of each Samples Compute Rank(L s :Sample s (left): Rank(L s :Sample s-1 (left)) Rank(Sample s-1 (left):sample s (left)) 4-cover Similarly compute Rank(L s :Sample s (right)) 42

Algorithm Details cover L s L s+1 Sample s Sample s For s >= 2, Merge => compute: Rank(Sample s (left), Sample s (right)) Rank(Sample s (right), Sample s (left)) Use Rank(L s :Sample s (left)), Rank(L s :Sample s (right)) O(1), since Ls is 4-cover of each Samples Compute Rank(L s :Sample s (left): Rank(L s :Sample s-1 (left)) Rank(Sample s-1 (left):sample s (left)) 4-cover Similarly compute Rank(L s :Sample s (right)) 43