CSE 4502/5717: Big Data Analytics

Similar documents
Divide and Conquer Algorithms. CSE 101: Design and Analysis of Algorithms Lecture 14

b + O(n d ) where a 1, b > 1, then O(n d log n) if a = b d d ) if a < b d O(n log b a ) if a > b d

Data Structures and Algorithms

Linear Time Selection

5. DIVIDE AND CONQUER I

5. DIVIDE AND CONQUER I

Data selection. Lower complexity bound for sorting

Randomized Sorting Algorithms Quick sort can be converted to a randomized algorithm by picking the pivot element randomly. In this case we can show th

Partition and Select

Design and Analysis of Algorithms

Sorting n Objects with a k-sorter. Richard Beigel. Dept. of Computer Science. P.O. Box 2158, Yale Station. Yale University. New Haven, CT

Lecture 17: Trees and Merge Sort 10:00 AM, Oct 15, 2018

Quick Sort Notes , Spring 2010

COMP Analysis of Algorithms & Data Structures

Kartsuba s Algorithm and Linear Time Selection

5. DIVIDE AND CONQUER I

CSE 591 Homework 3 Sample Solutions. Problem 1

Algorithms And Programming I. Lecture 5 Quicksort

Divide and Conquer Algorithms

CMPT 307 : Divide-and-Conqer (Study Guide) Should be read in conjunction with the text June 2, 2015

Fast Sorting and Selection. A Lower Bound for Worst Case

Review Of Topics. Review: Induction

Data Structures and Algorithms CSE 465

CSE 4502/5717 Big Data Analytics Spring 2018; Homework 1 Solutions

A General Lower Bound on the I/O-Complexity of Comparison-based Algorithms

Divide and Conquer Algorithms

N/4 + N/2 + N = 2N 2.

CS161: Algorithm Design and Analysis Recitation Section 3 Stanford University Week of 29 January, Problem 3-1.

1 Quick Sort LECTURE 7. OHSU/OGI (Winter 2009) ANALYSIS AND DESIGN OF ALGORITHMS

Algorithms, CSE, OSU Quicksort. Instructor: Anastasios Sidiropoulos

Divide-and-conquer: Order Statistics. Curs: Fall 2017

Analysis of Algorithms CMPSC 565

Towards Optimal Multiple Selection

data structures and algorithms lecture 2

Selection and Adversary Arguments. COMP 215 Lecture 19

Divide and Conquer. Maximum/minimum. Median finding. CS125 Lecture 4 Fall 2016

COE428 Notes Week 4 (Week of Jan 30, 2017)

Sorting Algorithms. We have already seen: Selection-sort Insertion-sort Heap-sort. We will see: Bubble-sort Merge-sort Quick-sort

Algorithm Analysis Recurrence Relation. Chung-Ang University, Jaesung Lee

Divide-Conquer-Glue. Divide-Conquer-Glue Algorithm Strategy. Skyline Problem as an Example of Divide-Conquer-Glue

Lecture 4. Quicksort

Algorithms Test 1. Question 1. (10 points) for (i = 1; i <= n; i++) { j = 1; while (j < n) {

Recap & Interval Scheduling

CSCI 3110 Assignment 6 Solutions

CSC236 Week 3. Larry Zhang

Data Structures and Algorithms Running time and growth functions January 18, 2018

Design and Analysis of Algorithms

Algorithms. Quicksort. Slide credit: David Luebke (Virginia)

Analysis of Algorithms - Using Asymptotic Bounds -

CS473 - Algorithms I

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Asymptotic Analysis, recurrences Date: 9/7/17

Divide and Conquer. CSE21 Winter 2017, Day 9 (B00), Day 6 (A00) January 30,

Outline. 1 Introduction. 3 Quicksort. 4 Analysis. 5 References. Idea. 1 Choose an element x and reorder the array as follows:

Lecture 10 September 27, 2016

IS 709/809: Computational Methods in IS Research Fall Exam Review

CSCE 750 Final Exam Answer Key Wednesday December 7, 2005

CS 2110: INDUCTION DISCUSSION TOPICS

Data Structures and Algorithms CMPSC 465

Algorithm efficiency can be measured in terms of: Time Space Other resources such as processors, network packets, etc.

Taking Stock. IE170: Algorithms in Systems Engineering: Lecture 3. Θ Notation. Comparing Algorithms

CIS 121 Data Structures and Algorithms with Java Spring Big-Oh Notation Monday, January 22/Tuesday, January 23

Asymptotic Algorithm Analysis & Sorting

Computational Complexity

Extended Algorithms Courses COMP3821/9801

Greedy Algorithms. CSE 101: Design and Analysis of Algorithms Lecture 10

Reductions, Recursion and Divide and Conquer

Problem: Data base too big to fit memory Disk reads are slow. Example: 1,000,000 records on disk Binary search might take 20 disk reads

In-Class Soln 1. CS 361, Lecture 4. Today s Outline. In-Class Soln 2

What we have learned What is algorithm Why study algorithm The time and space efficiency of algorithm The analysis framework of time efficiency Asympt

Algorithms and Data Structures 2016 Week 5 solutions (Tues 9th - Fri 12th February)

Solutions. Problem 1: Suppose a polynomial in n of degree d has the form

CSE101: Design and Analysis of Algorithms. Ragesh Jaiswal, CSE, UCSD

Analysis of Algorithms I: Asymptotic Notation, Induction, and MergeSort

Problem Set 1. CSE 373 Spring Out: February 9, 2016

Advanced Analysis of Algorithms - Midterm (Solutions)

Analysis of Algorithms - Midterm (Solutions)

Sorting. Chapter 11. CSE 2011 Prof. J. Elder Last Updated: :11 AM

CSED233: Data Structures (2017F) Lecture4: Analysis of Algorithms

Algorithms, Design and Analysis. Order of growth. Table 2.1. Big-oh. Asymptotic growth rate. Types of formulas for basic operation count

CIS 121. Analysis of Algorithms & Computational Complexity. Slides based on materials provided by Mary Wootters (Stanford University)

Lecture 1: Asymptotics, Recurrences, Elementary Sorting

Divide-and-conquer. Curs 2015

CSE 613: Parallel Programming. Lecture 9 ( Divide-and-Conquer: Partitioning for Selection and Sorting )

Lecture 1: Asymptotic Complexity. 1 These slides include material originally prepared by Dr.Ron Cytron, Dr. Jeremy Buhler, and Dr. Steve Cole.

Tree-width and algorithms

Week 5: Quicksort, Lower bound, Greedy

COMP 9024, Class notes, 11s2, Class 1

Asymptotic Analysis. Slides by Carl Kingsford. Jan. 27, AD Chapter 2

Bin Sort. Sorting integers in Range [1,...,n] Add all elements to table and then

6.830 Lecture 11. Recap 10/15/2018

CS 4104 Data and Algorithm Analysis. Recurrence Relations. Modeling Recursive Function Cost. Solving Recurrences. Clifford A. Shaffer.

CS 5321: Advanced Algorithms Analysis Using Recurrence. Acknowledgement. Outline

15-854: Approximation Algorithms Lecturer: Anupam Gupta Topic: Approximating Metrics by Tree Metrics Date: 10/19/2005 Scribe: Roy Liu

1 Recursive Algorithms

ITEC2620 Introduction to Data Structures

CS2223 Algorithms D Term 2009 Exam 3 Solutions

Midterm 1 for CS 170

Recursion and Induction

Bucket-Sort. Have seen lower bound of Ω(nlog n) for comparisonbased. Some cheating algorithms achieve O(n), given certain assumptions re input

Recursion. Computational complexity

Transcription:

CSE 4502/5717: Big Data Analytics otes by Anthony Hershberger Lecture 4 - January, 31st, 2018 1 Problem of Sorting A known lower bound for sorting is Ω is the input size; M is the core memory size; and B is the block size Merge Sort ( ) B log( M ) I/O operations, where log( M B ) Input Given elements 1 In one pass through the data form Runs of Length M each Followed by this, merge two runs at a time as shown in the following figure Each leaf in this tree corresponds to a run of length M q x 1 x 2 log ( M ) levels 1

Process: Consider an arbitrary node (ie, a non-leaf) in this tree Two runs are merged at this node We now show how to merge these two runs A and C Let A = C = l To begin with, bring M 2B blocks from each run Start merging these runs We keep an output buffer of size B When one block is ready in the output (ie, when the output buffer is full), ship it to the disk When we run out of keys from either A or C, bring one block of that run from the disk Repeat the above steps until all the keys from one of the runs have been used At this time, output the remaining keys from the other run umber of I/O operations needed to merge A and C Total number of elements in both A and C = B = 2l B This implies that the Total umber of I/O s needed at any level of the tree = B ote: We spend B I/O operations per level in the tree and the number of levels in the tree is given by log ( ) M This implies that the total # of I/O s taken by the entire algorithm is [ B log M + 1] This is not optimal but can we make it optimal? Modification We will use a modification where we merge k sorted sequences at a time (for some k > 2) 1 First we form runs of length M each where we let q be any node in the tree The node q merges k sorted sequences of length l each q x 1 x 2 x k log k ( M ) levels 2

Process: We explain how any node q in the tree merges k runs Let the length of each run be l To begin with, bring M kb blocks from each run; When B elements are ready in the output, write this block to the disk; When we run out of keys in any run, read the next block from this run When all the keys from any run have been used, this run will not play a role after that Repeat the above steps untill all the runs have been merged Total umber of I/O s at any level of the tree = B umber of levels = log k M = log( M ) log(k) ote: We will always use a base 2 unless explicitly stated If we fix the value of k = M B, then ( ) Total umber of I/O Operations = n B log( M ) + 1 log( M B ) Question: Is this asymptotically optimal? Yes 2 Selection Algorithm Input: X = k 1, k 2,, k n Output: The i th smallest element of X 1 Pick a pivot k X 2 Partition X into two: Quick Select(X, i) X < k > k Without loss of generality, assume that the elements of X are distinct Let X 1 = {q X : q < k} and X 2 = {q X : q > k} Case 1: If X 1 = i 1 then output k and quit; Case 2: If X 1 i then output QuickSelect(X 1, i); Case 3: If X 1 + 1 < i then output QuickSelect(X 2, i X 1 1); QuickSelect has a worst case run time of Ω(n 2 ) We can show that the average run time of QuickSelect is O(n) Blum, Floyd, Pratt, Rivest and 3

Tarjan have modified this algorithm to obtain an algorithm that takes O(n) time in the worst case Blum, Floyd, Pratt, Rivest, Tarjan, 1973 1 Pick a Pivot as follows: (a) Group the input into groups as size 5 each; let the groups be G 1, G 2, G 3,, G n/5 ; (b) Find the median of each group This can be done using 9n 5 comparisons; Let the group medians be denoted as M 1, M 2, M 3,, M n/5 ; (c) Find the median M of these medians RECURSIVELY; 2 Perform steps 2 and 3 of QuickSelect with M as the pivot; Question: Why does the algorithm work? The median of the medians can be thought of as an approximate median of X As a result, we can show that the sizes of X 1 and X 2 will be nearly the same Proof: Analysis of the BFPRT algorithm Claim: X 1 7 10 n and X 2 7 10 n ote: Consider all the groups whose medians are M There are n 10 such groups In such group there will be at least three elements that are M This means that the size of X 2 cannot be more than 7 10n In a similar manner we can also show that the size of X 1 cannot exceed 7 10 n Step 2 and Step 3 of QuickSelect take n and T ( 7 10n) comparisons, respectively Run Time of the BFPRT algorithm: Let T (n) be the runtime of this algorithm on any input of size n and for any i The time it takes for step 1b is 9n 5 Step 1c takes T (n/5) time Step 2 and Step 3 of QuickSelect take n and T ( 7 10n) comparisons, respectively Thus we get: T (n) = 9 5 n + n + T ( n 5 ) + T ( 7 10 n) Claim: T (n) Cn for some constant C We ll prove this by induction 4

Assume the hypothesis for all inputs of size (n 1); We will prove it for n T (n) T ( n 5 ) + T ( 7 10n) + dn, where d is come constant Applying the induction hypothesis, we get: T (n) C n 5 + C 7 10 n + dn RHS = 9Cn + dn RHS Cn if 09Cn + dn Cn = 1C d = C 10d = T (n) 10dn QED 5