COL 730: Parallel Programming

Similar documents
Algorithms PART II: Partitioning and Divide & Conquer. HPC Fall 2007 Prof. Robert van Engelen

Data Structures and Algorithms

Sorting algorithms. Sorting algorithms

Lecture 17: Trees and Merge Sort 10:00 AM, Oct 15, 2018

Fast Sorting and Selection. A Lower Bound for Worst Case

Integer Sorting on the word-ram

Lecture 6 September 21, 2016

Dictionary: an abstract data type

Algorithms, Design and Analysis. Order of growth. Table 2.1. Big-oh. Asymptotic growth rate. Types of formulas for basic operation count

EE/CSCI 451: Parallel and Distributed Computation

R ij = 2. Using all of these facts together, you can solve problem number 9.

Linear Time Selection

Sorting. Chapter 11. CSE 2011 Prof. J. Elder Last Updated: :11 AM

Algorithms and Data Structures 2016 Week 5 solutions (Tues 9th - Fri 12th February)

Algorithms And Programming I. Lecture 5 Quicksort

Dictionary: an abstract data type

MA008/MIIZ01 Design and Analysis of Algorithms Lecture Notes 3

Sorting Algorithms. We have already seen: Selection-sort Insertion-sort Heap-sort. We will see: Bubble-sort Merge-sort Quick-sort

Divide and Conquer Algorithms. CSE 101: Design and Analysis of Algorithms Lecture 14

CMPT 307 : Divide-and-Conqer (Study Guide) Should be read in conjunction with the text June 2, 2015

Jeffrey D. Ullman Stanford University

Advanced Analysis of Algorithms - Midterm (Solutions)

CS60007 Algorithm Design and Analysis 2018 Assignment 1

Recap: Prefix Sums. Given A: set of n integers Find B: prefix sums 1 / 86

4.8 Huffman Codes. These lecture slides are supplied by Mathijs de Weerd

CPS 616 DIVIDE-AND-CONQUER 6-1

Quick Sort Notes , Spring 2010

Data Structures and Algorithms CSE 465

Review Of Topics. Review: Induction

Advanced Implementations of Tables: Balanced Search Trees and Hashing

5. DIVIDE AND CONQUER I

Analysis of Algorithms CMPSC 565

Bin Sort. Sorting integers in Range [1,...,n] Add all elements to table and then

data structures and algorithms lecture 2

Parallel programming using MPI. Analysis and optimization. Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco

Fundamental Algorithms

BDD Based Upon Shannon Expansion

Linear Selection and Linear Sorting

Fundamental Algorithms

Fundamental Algorithms

Module 1: Analyzing the Efficiency of Algorithms

Week 5: Quicksort, Lower bound, Greedy

Searching. Sorting. Lambdas

Algorithm Design and Analysis

Bucket-Sort. Have seen lower bound of Ω(nlog n) for comparisonbased. Some cheating algorithms achieve O(n), given certain assumptions re input

Quiz 2. Due November 26th, CS525 - Advanced Database Organization Solutions

CSCE 750, Spring 2001 Notes 2 Page 1 4 Chapter 4 Sorting ffl Reasons for studying sorting is a big deal pedagogically useful Λ the application itself

Divide-and-conquer. Curs 2015

Randomized Sorting Algorithms Quick sort can be converted to a randomized algorithm by picking the pivot element randomly. In this case we can show th

Lecture 26 December 1, 2015

Sorting DS 2017/2018

Design and Analysis of Algorithms

Divide and Conquer. CSE21 Winter 2017, Day 9 (B00), Day 6 (A00) January 30,

! Break up problem into several parts. ! Solve each part recursively. ! Combine solutions to sub-problems into overall solution.

A Linear Time Algorithm for Ordered Partition

Copyright 2000, Kevin Wayne 1

Central Algorithmic Techniques. Iterative Algorithms

Partition and Select

Divide-and-conquer: Order Statistics. Curs: Fall 2017

ACO Comprehensive Exam March 20 and 21, Computability, Complexity and Algorithms

Lecture 2 September 4, 2014

Selection and Adversary Arguments. COMP 215 Lecture 19

5. DIVIDE AND CONQUER I

CS361 Homework #3 Solutions

Kartsuba s Algorithm and Linear Time Selection

1 Terminology and setup

Methods for solving recurrences

Quiz 1 Solutions. (a) f 1 (n) = 8 n, f 2 (n) = , f 3 (n) = ( 3) lg n. f 2 (n), f 1 (n), f 3 (n) Solution: (b)

Data selection. Lower complexity bound for sorting

Biased Quantiles. Flip Korn Graham Cormode S. Muthukrishnan

Algorithms. Quicksort. Slide credit: David Luebke (Virginia)

Problem. Problem Given a dictionary and a word. Which page (if any) contains the given word? 3 / 26

COMP 633: Parallel Computing Fall 2018 Written Assignment 1: Sample Solutions

0-1 Knapsack Problem in parallel Progetto del corso di Calcolo Parallelo AA

Advanced Counting Techniques. Chapter 8

Graph-theoretic Problems

Chapter 5. Divide and Conquer. Slides by Kevin Wayne. Copyright 2005 Pearson-Addison Wesley. All rights reserved.

Sorting and Searching. Tony Wong

Discrete Applied Mathematics

Lecture 9. Greedy Algorithm

CS 4407 Algorithms Lecture 2: Iterative and Divide and Conquer Algorithms

Analysis of Algorithms I: Asymptotic Notation, Induction, and MergeSort

Data Structures in Java

Sorting. CMPS 2200 Fall Carola Wenk Slides courtesy of Charles Leiserson with changes and additions by Carola Wenk

Administrivia. COMP9020 Lecture 7 Session 2, 2017 Induction and Recursion. Lecture 6 recap. Lecture 6 recap

CPSC 320 Sample Final Examination December 2013

Advanced Counting Techniques

6.854 Advanced Algorithms

CSE613: Parallel Programming, Spring 2012 Date: May 11. Final Exam. ( 11:15 AM 1:45 PM : 150 Minutes )

Lecture 1: Asymptotics, Recurrences, Elementary Sorting

Radix Sorting With No Extra Space

CS 2210 Discrete Structures Advanced Counting. Fall 2017 Sukumar Ghosh

Reductions, Recursion and Divide and Conquer

CPSC 121: Models of Computation. Module 9: Proof Techniques (part 2) Mathematical Induction

Design and Analysis of Algorithms

CS 161 Summer 2009 Homework #2 Sample Solutions

Algorithms Test 1. Question 1. (10 points) for (i = 1; i <= n; i++) { j = 1; while (j < n) {

CSE 421 Algorithms: Divide and Conquer

Homework 7 Solutions, Math 55

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:

Transcription:

COL 730: Parallel Programming

PARALLEL SORTING

Bitonic Merge and Sort Bitonic sequence: {a 0, a 1,, a n-1 }: A sequence with a monotonically increasing part and a monotonically decreasing part For some i, {a 0 <= <=a i } and {a i+1 >= >= a n-1 } Or, a cyclic-shift of indices makes it bitonic

Split Bitonic Sequence Subsequence 2 Subsequence 1 Say {a 0 <= <=a n/2 >= a n/2+1 >= >= a n-1 } Subsequence 1: {min(a 0, a n/2+1 ), min(a 1, a n/2+2 ), } Subsequence 2: {max(a 0, a n/2+1 ), max(a 1, a n/2+2 ), } Recursively sort each bitonic subsequence Subsequence 1 <= Subsequence 2

Split Bitonic Sequence Subsequence 2 Subsequence 1 Say {a 0 <= <=a n/2 >= a n/2+1 >= >= a n-1 } Subsequence 1: {min(a 0, a n/2+1 ), min(a 1, a n/2+2 ), } Subsequence 2: {max(a 0, a n/2+1 ), max(a 1, a n/2+2 ), } Recursively sort each bitonic subsequence Subsequence 1 <= Subsequence 2 5

Merge Bitonic Sequence Sequence 1 Sequence 2 Sort the first sequence in increasing order Sort the second in decreasing order

Bitonic Sort Sort each pair alternately in increasing and decreasing orders Every sequence of length four is now bitonic Sort recursively Again alternate increasing and decreasing orders Forming bitonic sequences of length eight now And so on.. 7

Bitonic Network log n stages log 2 n Dme nlog 2 n work Compare- Exchange 7 3 3 7 8 6 8 6 3 6 8 7 4 1 2 1 4 5 5 2 5 4 1 2 3 6 7 8 5 4 2 1 3 4 2 1 5 6 8 7 2 1 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Stage 1 Stage 2 Stage 3 t n = t n/2 + log n No beker than nlog 2 n/p Dme on p procs

Sorting if n>p Simulate n/p virtual processors per processor like PRAM simulation earlier Or, divide element into p blocks and locally sort n/p elements (of each block) Then perform bitonic sort of blocks compare exchange becomes compare-split 3 4 10 15 20 13 14 16 21 25 3 4 10 13 14 15 16 20 21 25 O(n/p log n/p + n/p log 2 p) time local sorting log 2 p comp-splits

Odd-Even Exchange 10 34 20 5 30 1 40 15 50 45 10 34 5 20 1 30 15 40 45 50 10 5 34 1 20 15 30 40 45 50 Repeat n/2 Dmes O(n 2 ) work Divide into p blocks Sort each block locally Repeat p/2 times odd and even phase compare-split each done on p blocks by p processors O(n/p log n/p) + O(n) time each processor sequentially splits 2n/p sized block local sort p comp-split

Batcher s Odd-Even Merge 10 20 30 35 40 50 1 4 15 45 49 52

Batcher s Odd-Even Merge 10 20 30 35 40 50 1 4 15 45 49 52 1 4 10 20 15 35 30 45 40 50 49 52 1 4 10 15 20 30 35 40 45 49 50 52 Two way merge Given two sorted subsequences: Merge only even elements of each: e0, e1, e2, e3,.. Merge only odd elements of each: o0, o1, o2, o3,.. Perform a pair-wise exchange W(2n) = 2W(n) + O(n) T(2n) = T(n) + O(1) n log n work log n Dme e0, min(e1, o0), max(e1, o0), min(e2, o1), max(e2, o1),.. on-1

Parallel Bucket Sort Divide the range [a,b] of numbers into p equal sub-ranges or, buckets Divide input into p blocks pi Sends elements in its block to ith bucket For uniformly distributed input, expected bucket size is uniform Locally sort each bucket

Parallel Bucket Sort Decide buckets e.g., ranges of key values Parallel for i: Put element i in bucket b May keep each bucket locally at processor Later Merge buckets from all processors Sort each bucket separately For uniformly distributed input expected bucket size is uniform But, real risk of load imbalance Sample sort: Choose a sample of size s Sort the samples Choose B-1 evenly spaced element from the sorted list These splitters provide ranges for B buckets O(n/p log n/p + p log p)?

Parallel SpliKer SelecDon Divide n elements equally into B blocks (Quick)Sort each block (n/b log n/b) For each sorted block: Choose B-1 evenly spaced splitters Use the B*(B-1) elements as samples Sort the samples (B 2 log B) Choose B-1 Splitters Arrange elements by bucket in output array Count the number of elements in each bucket Perform prefix Sum of counts; Reserve space per bucket In-place No bucket contains more than 2*n/B elements (n/b + B log B)

Radix Sort Least significant digit first b-bits per round Entire list sorted per round (of bucket sort) Subdivide into p equal subsets Most significant digit first Separate into buckets Sort only within bucket per round load balance? Enough buckets => Local sort per processor Parallel round Local bucketing Bucket merging 16

Sort n/p elements, then Merge Merge 2n/p-elements pairs P 0,P 1, P 2,P 3 p-processors merging P 0,P Merge n/p-elements 1 P 2,P 3 pairs P 0 P 1 P p-1 Sort first n/p elements Sort second n/p elements HOW EFFICIENTLY CAN YOU MERGE?

Merge Sort Divide into P groups Locally sort each group N/P log (N/P) = O((N log N)/P) Parallel merge P groups Binary tree: log(p) stages P/2 pairs using P processors At leaf (level 1), 2 processors merge two N/P sized lists each time = O(N/P) Level i: 2 i+1 processors merge two 2 i N/P sized lists each time = O(N/P) At the root: P processor merge two N/2 sized list time = O(N/P) O(N/P) log P using P processors to perform each merge 18

Merging Options Optimal merge O(n) work n-list merge Sort the smallest elements of each list: Ls Output the smallest element of Ls Advance pointer of the corresponding list Parallel Insert the next element into Ls Extension of Batcher s algorithm simpler to implement and schedule 19

Batcher s Odd-Even Merge 10 20 30 35 40 50 1 4 15 45 49 52 1 4 10 20 15 35 30 45 40 50 49 52 1 4 10 15 20 30 35 40 45 49 50 52 Two way merge Given two sorted subsequences: Merge only even elements of each: e0, e1, e2, e3,.. Merge only odd elements of each: o0, o1, o2, o3,.. Perform a pair-wise exchange W(2n) = 2W(n) + O(n) T(2n) = T(n) + O(1) n log n work log n Dme e0, min(e1, o0), max(e1, o0), min(e2, o1), max(e2, o1),.. on-1

Optimal Multi-way Merge N=P 2 1.Divide L1 and L2, respectively, into P sublists each ith sublist contains elements at positions i, P+i, 2P+i.. N-P+i 2.Pi merges the ith sublist from L1 and the ith sublist from L2 Result go back to the positions originally occupied by the two sublists Each element is at most P off from its final position 3.Divide list in blocks of length P 4.Pi merges pairs of blocks block 2i and 2i+1 (Results are put in-place.) 5.Pi merges blocks 2i+1 and 2i+2 Results in-place 21

Multi-way Merge: Proof A Let L1i and L2i be the sublists assigned to Pi in step 1 Consider X, nth element in L2i: there are three cases X lies between the mth element of L1i, A, and (m+1)st, B Or, X is <all elements of L1i, or it is >all L1i Case 1: At least [mp+i+1]+[np+i] = [(m+n+1)p+i]+(i-p+1) elements <X mp+i+1 elements are from L1 (<= A) and np+ i are from L2 (<X) At most [(m+1)p+i] + [np+i] = [(m+n+1)p+i] + i elements < X (m+1)p+i elements are from L1 (<B) and np+ i are from L2 (<X) Rank of X in all elements is between [(m+n+1)p + i] + i-p+1 and [(m+n+1)p + i] + i The array-position of X after merger is [(m+n+1)p + i] i+mp i+(m+1)p i+np P P A B X I J, L1 L2 L2 i + (m+n+1)p X 22

Multi-way Merge: Proof B Claim: Each block is sorted Proof: Consider P elements of block i: Block i XY j j+1 j j j+1 X, the j th element (0 <= j < P) is stored by Pj This j th element is the i th element for Pj Prove that j th element (X) is smaller than the j+1 st element (Y) for all j Every element assigned to Pj+1 in Step 1 > the element to its left which is assigned to Pj Y is the ith element from Pj+1 after Step 2 each element assigned to Pj+1 is greater than an element assigned to Pj => Y is greater than at least i elements from Pj But X is the ith (sorted) element of Pj. Hence X < Y 23

Claim: Multi-way Merge: Proof C After Step 2, every element in block i is < all elements in block j, j>i+1 i.e., the largest element in block i, X, is < the smallest, Y, in block i+2 X is the last element in the block i and Y is the first element in block i+2 Note: X is assigned to PP-1 and Y is assigned to P0 in Step 1 P-2 of the elements assigned to P0 are > the resp. elements on their left which, in turn, are all assigned to PP-1 The two exceptions are the first elements from L1 and L2, resp Since Y is the (i+2)nd element from P0 after Step 2, Y is greater than at least i elements assigned to PP-1, But X is the ith element from Pp-1. Hence X < Y O(N/P) per merge, 3 merges Block i X PP-1 Block i+2 Y P0 24

Multi-way Merge N>P 2 1.Divide L1 and L2, respectively, into P sublists each ith sublist contains elements at positions i, P+i, 2P+i.. N-P+i 2.Pi merges the ith sublist from L1 and the ith sublist from L2 Result go back to the positions originally occupied by the two sublists Each element is at most P+1 off from its final position 3.Divide list in blocks of length P 4.Pi merges pairs of blocks block 2i and 2i+1 Results are put in-place 5.Pi merges blocks 2i+1 and 2i+2 Results in-place & P 2 +2i and P 2 +2i+1, etc. & P 2 +2i and P 2 +2i+1, etc. SDll O(N/P) 25

MulD-way Merge Sort Total Time = O(N/P) log P 2N/P-element pairs P 0,P 1, P 2,P 3 P/4 pairs, 4 procs each Time at each level = O(N/P) N/P-element P/2 pairs, pairs P 0,P 1 2 procs each P 2,P 3 Time = (N/P)/2 Sort first N/P elements Sort second N/P elements P lists, sort each Sorting Time = O(N/P) log (N/P) = O(N/P log N)

Practical Sorting Multi-way suitable for CUDA? Too many elements per processor (Shared Mem size) Slow network Sample sort Shared memory Bitonic, Odd-even merge External sort Merge 27

Example CUDA Sort 1.Divide input into p equal parts size t each 2.Sort each part in a block Batcher s odd-even mergesort 3.Merge p sorted lists Merge pairs of subsequences per block? the number of pairs decreases quickly Use multiple blocks per merge for larger lists subdivide into smaller tasks 28

Single Block Merge Lists with, say, t ~ 256 elements each, merge using a single t-thread block For an element ai A, Rank(ai, C ) = i + Rank(ai, B) Thread i computes Rank(ai, B) using binary search B in shared memory Rank elements of B similarly Thread i writes ai to the final Rank position in shared memory 29

Multiple Block Merge Divide large lists into sequences of size t Select splitters SA and SB Every t-th element, partitions lists into chunks of size t Merge SA and SB into S recursively select splitters if SA or SB are too large Compute Rank(s, B), s SA Compute Rank(s, SB) = Rank(s, S) Rank(s, SA) This provides the t-element chunk of B within which s must lie Binary search per thread Similarly compute Rank(s, A), s SB Split A and B, each with S Merge pairs, one per block A B t t S A S B 30

LOG N TIME SORTING

Examples of Fast Sort Rank sort? Given Array A, For each i, find rank A[i] A[rank[i]] = A[i] How fast can you find Rank(A:A)? If you had n 2 processors How fast can you merge sorted sublists? If you had n 2 processors?

c-cover Merging Consider sequence A and B X: a c-cover of A and B Two consecutive elements of X have at most c elements of A between them and at most c elements of B Given Rank(X:A) and Rank(X:B) Compute Rank(A:B) and Rank(B:A) In O(1) time, with O(n) work If X is a c-cover of B, and we know Rank(A:X) and Rank(X:B) Compute Rank(A:B) in O(1)

c-cover Merging Rank(X:A);Rank(X:B) => Rank(A:B);Rank(B:A) a1.. ar 1 ar 1 +1.. ar 2.. a ar i-1 +1.. ar i.. A i x1 x2.. xi-1 xi.. xs b1.. bt 1 bt 1 +1.. bt 2.. bt i-1 +1.. bt i.. Rank(a:B) = Rank(x i-1 :B) + Rank(a:B i ) But Bi <= c, as X is a c-cover of B Rank(a:B) can be found in O(1) B i 34

c-cover Merging Rank(A:X);Rank(X:B) => Rank(A:B) a1.......... ai.... a... x1 x2.. xi-1 xi.. xs b1.. bt 1 bt 1 +1.. bt 2.. bt i-1 +1.. bt i.. Rank(a:B) = Rank(Rank(a, X), B) + Rank(a:B i ) But Bi <= c, as X is a c-cover of B Rank(a:B) can be found in O(1) B i 35

OpDmal O(log n)-dme Merge Sort Algorithm runs in stages on a binary tree Node i maintains sorted list L[i] s th stage generates list L s [i] Initially: L 0 [i] = null for internal nodes L 0 [i] = value at leaf node i Procedure works for any proper binary tree Not necessarily balanced

Fast OpDmal Merge Sort: DefiniDons Algorithm proceeds up the tree, one stage at a time At stage s, node n is active if height(n) <= s <= 3*height(n) height(n) = height(tree) path-length from root to n At each stage a node is active, It merges a sample of the lists of its children

Algorithm parallel for n = active nodes: L s+1 [n] = Merge(Sample s (left), Sample s (right])) Sample s (n) = SUB 4 (L s [n]), if s <= 3*height(n) SUB 2 (L s [n]), if s = 3*height(n) + 1 SUB 1 (L s [n]) = L s [n], if s >= 3*height(n) + 2 where, SUB i (l 0, l 1, l 2 ) = (l i-1, l 2i-1, l 3i-1 )

Pipelined Merges A node: activates 1 step after child active for 3 extra steps height On activation, a node merges its children's every 4th element On children s deactivation, a node takes children s alternate elements then every child element 39

Brief Analysis 1.Node becomes full at stage = 3*height(node) Induction on height 2.Number of elements at a node doubles: L s+1 [n] <= 2 L s [n] + 4 Induction on height 3.Number of elements in active nodes = O(n) Levels active are s/3 to s Level s/3 is full = O(n) Levels above = O(n/2) and O(n/4).. Total = O(n) 4.Sample s (n) is a 4-cover for Sample s+1 (n) i.e., No more than 4 items of L s+1 [n] between two consecutive items of L s [n] 5.For each stage s > height(n), L s [n] is a 4-cover for Sample s (left) and Sample s (right)

Samples(n) is a 4-cover for Samples+1(n) Def: Interval [a,b] k-intersects [-,Samples, ] if {x, s.t. x Samples & a x b} = k Claim: [a,b] k-intersects Samples-1 => [a,b] 2k-intersects Samples Induction: [a,b] k-intersects Samples-1 [a,b] 2k-intersects Sample s Show [a,b] 4k-intersects Sample s+1 SUB 4 (L s+1 ): 4k L s+1 = Merge(SUB 4 (L s ),): 16k-2 SUB 4 (L s ) 2k a L s 8k-3 b SUB 4 (L s-1 ) p 1 a p 2 a 1 b 1 2 b 2 SUB 4 ( L s ): lee & right p 1 +p 2 8k-1 2p 1 +2p 2 2(8k-1) 41

Algorithm Details Merge(Sample s (left), Sample s (right)) => compute: Rank(Sample s (left), Sample s (right)) Rank(Sample s (right), Sample s (left)) Use Rank(L s,sample s (left)), Rank(L s,sample s (right)) O(1), since Ls is 4-cover of each Samples Compute Rank(L s :Sample s (left): Rank(L s :Sample s-1 (left)) Rank(Sample s-1 (left):sample s (left)) 4-cover Similarly compute Rank(L s :Sample s (right)) 42

Algorithm Details cover L s L s+1 Sample s Sample s For s >= 2, Merge => compute: Rank(Sample s (left), Sample s (right)) Rank(Sample s (right), Sample s (left)) Use Rank(L s :Sample s (left)), Rank(L s :Sample s (right)) O(1), since Ls is 4-cover of each Samples Compute Rank(L s :Sample s (left): Rank(L s :Sample s-1 (left)) Rank(Sample s-1 (left):sample s (left)) 4-cover Similarly compute Rank(L s :Sample s (right)) 43