Finding the Median. 1 The problem of finding the median. 2 A good strategy? lecture notes September 2, Lecturer: Michel Goemans

Similar documents
Quick Sort Notes , Spring 2010

5 + 9(10) + 3(100) + 0(1000) + 2(10000) =

Selection and Adversary Arguments. COMP 215 Lecture 19

b + O(n d ) where a 1, b > 1, then O(n d log n) if a = b d d ) if a < b d O(n log b a ) if a > b d

Solving with Absolute Value

ITEC2620 Introduction to Data Structures

Lecture 17: Trees and Merge Sort 10:00 AM, Oct 15, 2018

Factoring. there exists some 1 i < j l such that x i x j (mod p). (1) p gcd(x i x j, n).

Find Min and Max. Find them independantly: 2n 2. Can easily modify to get 2n 3. Should be able to do better(?) Try divide and conquer.

When we use asymptotic notation within an expression, the asymptotic notation is shorthand for an unspecified function satisfying the relation:

N/4 + N/2 + N = 2N 2.

1 Divide and Conquer (September 3)

Lecture 3: Big-O and Big-Θ

An overview of key ideas

Solving Recurrences. 1. Express the running time (or use of some other resource) as a recurrence.

CSC236 Week 4. Larry Zhang

Data structures Exercise 1 solution. Question 1. Let s start by writing all the functions in big O notation:

Divide and Conquer. Arash Rafiey. 27 October, 2016

What we have learned What is algorithm Why study algorithm The time and space efficiency of algorithm The analysis framework of time efficiency Asympt

Writing proofs for MATH 51H Section 2: Set theory, proofs of existential statements, proofs of uniqueness statements, proof by cases

Numerical Optimization Prof. Shirish K. Shevade Department of Computer Science and Automation Indian Institute of Science, Bangalore

Divide and Conquer. Recurrence Relations

When we use asymptotic notation within an expression, the asymptotic notation is shorthand for an unspecified function satisfying the relation:

Problem Set 1. CSE 373 Spring Out: February 9, 2016

Recap & Interval Scheduling

Induction 1 = 1(1+1) = 2(2+1) = 3(3+1) 2

Announcements. CompSci 102 Discrete Math for Computer Science. Chap. 3.1 Algorithms. Specifying Algorithms

Intermediate Algebra. Gregg Waterman Oregon Institute of Technology

UCSD CSE 21, Spring 2014 [Section B00] Mathematics for Algorithm and System Analysis

Notes for Recitation 14

Solving recurrences. Frequently showing up when analysing divide&conquer algorithms or, more generally, recursive algorithms.

CSC236 Week 3. Larry Zhang

Lecture 10: Powers of Matrices, Difference Equations

CSE332: Data Structures & Parallelism Lecture 2: Algorithm Analysis. Ruth Anderson Winter 2019

Topic Contents. Factoring Methods. Unit 3: Factoring Methods. Finding the square root of a number

Linear Time Selection

CS173 Running Time and Big-O. Tandy Warnow

Data Structures and Algorithms

Mathematical Background. Unsigned binary numbers. Powers of 2. Logs and exponents. Mathematical Background. Today, we will review:

Analysis of Algorithm Efficiency. Dr. Yingwu Zhu

MA008/MIIZ01 Design and Analysis of Algorithms Lecture Notes 3

Divide and Conquer Problem Solving Method

Fall 2017 November 10, Written Homework 5

Algorithms, Design and Analysis. Order of growth. Table 2.1. Big-oh. Asymptotic growth rate. Types of formulas for basic operation count

CS 4104 Data and Algorithm Analysis. Recurrence Relations. Modeling Recursive Function Cost. Solving Recurrences. Clifford A. Shaffer.

Divide and Conquer Algorithms

CMPSCI611: Three Divide-and-Conquer Examples Lecture 2

CS 310 Advanced Data Structures and Algorithms

Algorithms and Their Complexity

CSE332: Data Structures & Parallelism Lecture 2: Algorithm Analysis. Ruth Anderson Winter 2018

Divide and Conquer. Maximum/minimum. Median finding. CS125 Lecture 4 Fall 2016

3.1 Induction: An informal introduction

CSE332: Data Structures & Parallelism Lecture 2: Algorithm Analysis. Ruth Anderson Winter 2018

CS361 Homework #3 Solutions

Divide and Conquer Algorithms. CSE 101: Design and Analysis of Algorithms Lecture 14

4.4 Recurrences and Big-O

Math 324 Summer 2012 Elementary Number Theory Notes on Mathematical Induction

One-to-one functions and onto functions

Linear Programming and its Extensions Prof. Prabha Shrama Department of Mathematics and Statistics Indian Institute of Technology, Kanpur

Computational Complexity - Pseudocode and Recursions

CS 161 Summer 2009 Homework #2 Sample Solutions

(x 1 +x 2 )(x 1 x 2 )+(x 2 +x 3 )(x 2 x 3 )+(x 3 +x 1 )(x 3 x 1 ).

15.1 Introduction to Lower Bounds Proofs

6.854J / J Advanced Algorithms Fall 2008

Computational Complexity. This lecture. Notes. Lecture 02 - Basic Complexity Analysis. Tom Kelsey & Susmit Sarkar. Notes

x 1 + x 2 2 x 1 x 2 1 x 2 2 min 3x 1 + 2x 2

Divide and Conquer Algorithms

[Disclaimer: This is not a complete list of everything you need to know, just some of the topics that gave people difficulty.]

Section 2.7 Solving Linear Inequalities

CS 124 Math Review Section January 29, 2018

Bucket-Sort. Have seen lower bound of Ω(nlog n) for comparisonbased. Some cheating algorithms achieve O(n), given certain assumptions re input

Analysis of Algorithms CMPSC 565

Chapter 2. Mathematical Reasoning. 2.1 Mathematical Models

CS 470/570 Divide-and-Conquer. Format of Divide-and-Conquer algorithms: Master Recurrence Theorem (simpler version)

Data Structures and Algorithms CSE 465

Divide and conquer. Philip II of Macedon

Solving Quadratic & Higher Degree Equations

Lecture notes on the ellipsoid algorithm

A guide to Proof by Induction

MITOCW MITRES18_005S10_DiffEqnsGrowth_300k_512kb-mp4

At the start of the term, we saw the following formula for computing the sum of the first n integers:

Warm-up Quantifiers and the harmonic series Sets Second warmup Induction Bijections. Writing more proofs. Misha Lavrov

Lecture 7: More Arithmetic and Fun With Primes

Solutions. Problem 1: Suppose a polynomial in n of degree d has the form

UCSD CSE 21, Spring 2014 [Section B00] Mathematics for Algorithm and System Analysis

CS1800: Mathematical Induction. Professor Kevin Gold

Mathematical Induction. EECS 203: Discrete Mathematics Lecture 11 Spring

Chapter 5 Divide and Conquer

Chapter 2A - Solving Equations

Procedure for Graphing Polynomial Functions

1. Determine (with proof) the number of ordered triples (A 1, A 2, A 3 ) of sets which satisfy

CONSTRUCTION OF sequence of rational approximations to sets of rational approximating sequences, all with the same tail behaviour Definition 1.

Note: Please use the actual date you accessed this material in your citation.

Extended Algorithms Courses COMP3821/9801

What can you prove by induction?

Winter Camp 2009 Number Theory Tips and Tricks

Tribology Prof. Dr. Harish Hirani Department of Mechanical Engineering Indian Institute of Technology, Delhi

Divide & Conquer. CS 320, Fall Dr. Geri Georg, Instructor CS320 Div&Conq 1

WHY POLYNOMIALS? PART 1

Quadratic Equations Part I

Transcription:

18.310 lecture notes September 2, 2013 Finding the Median Lecturer: Michel Goemans 1 The problem of finding the median Suppose we have a list of n keys that are completely unsorted. If we want to find the largest or the smallest key, it is very easy to do so with n 1 comparisons. If we want to find the mth largest key then you can use a heap structure to do it in cn + cm log(n) comparisons for some constant c. Indeed, it takes linear time to build the heap and then one can extract the maximum key m times in order to get the desired key (at a cost of log(n) comparisons per extraction). If m is larger than n/2 we can do slightly better by reversing the ordering (that is, building the heap based on minus the keys). In any case, if we are interested in the mth smallest key (the key of rank m) andm is close to n/2 (or linear in n), we are not saving much time comparing to sorting all keys. In these notes, we show how to find the median (key of rank (n +1)/2 ifn is odd, or key of rank n/2 or n/2+1ifn is even) with only a linear number (cn) comparisons in the worst-case. 2 A good strategy? The first approach that comes to mind when trying to find the rank m keyistotakeanarbitrary key, say p, and partition the n keys into two piles, one containing the keys smaller than p and the other containing the keys greater than p. By comparing m to the size of the piles, we can either find out that we got extremely lucky and the mth key is p or we can discard one of the piles and recursively search the appropriate key in the other pile. More precisely, if the first pile has at least m elements, we can recursively find the rank m key in the first pile; if not, we can replace m appropriately and search in the second pile. The problem of this approach is that, in the worst-case, one of the piles will (repeatedly) be empty, which would result in a number of comparisons equal to (n 1) + (n 2) +, which grows quadratically with n. We will now describe a technique for finding a key p for which the two piles will be reasonably balanced, and this will result in an algorithm for finding the median or any rank m key that takes a linear number of comparisons. This will not be a practical algorithm, as there are algorithms that run much faster most of the time (although in the worst-case will take more than a linear number of comparisons). We show you this because it is a really neat algorithm, and because you will learn some useful techniques and facts about recursion from it. The big question is how to find a good candidate for the key p. Ideally, we would like p to be as close as possible to the median 1. The following is a simple plan for doing just that. Step 1. We arbitrarily split the keys into groups of size 5, assuming n is divisible by 5; we ignore small differences otherwise. (Why did we choose 5? Actually, any small odd number larger than 3 works, but it turns out that 5 gives the fewest number of comparisons.) 1 You might wonder why do we try to find a good candidate for the median while our goal is to find the key of rank m? Again, we would like the two piles to be of roughly the same size so that we make progress faster. Median-1

Here is a an example of a set of 35 keys; the seven groups of 5 correspond to the columns. 23 32 6 22 76 15 40 91 28 39 12 97 29 33 75 23 53 71 80 55 68 68 38 64 77 24 7 47 82 10 1 43 37 25 16 We then sort each group of 5 keys, noting that each one can be sorted using a maximum of 7 comparisons (this is a slightly tricky exercise). After performing /5 comparisons, this leaves us with n/5 groups of 5 sorted keys. Here is the example in which the columns (groups of 5) are sorted from top to bottom). 23 10 1 12 24 7 16 68 23 6 22 37 15 33 82 32 53 71 80 29 47 91 38 64 77 97 55 68 Step 2. Now, we take each of the keys of rank 3 (one per group), the median in its group of 5, and find the median of this group of n/5 median keys. This median of medians is our good candidate p. Iff(m) is the number of comparisons it takes to find the median of m keys, this step takes f(m/5) comparisons. In our example, we find the median of: which happens to be p = 40. Step 3. Now we partition the keys in two piles, those smaller than p are placed in L < and those larger than p are placed in L >. If we find other keys equal to p, we can place them either in L < or in L >. In order to make this partition, we first compare the rank 3 key of each group of five to p. This takes n/5 comparisons. If such a rank 3 key is larger than p then not only do we know that it can be placed in L > but we also know that the rank 4 and rank 5 keys of that group can also be placed in L >. Similarly, if the rank 3 key happens to be smaller then we can automatically place the rank 1 and 2 keys of that group in L <. For example, we first compare 75 (the rank 3 key of the first group) to 40 and can immediately tell that the keys 82 and 91 are greater than 40 without doing additional comparisons. By comparing the rank 3 elements in the example to p, we can immediately decide in which pile to place the following keys: 10 1 7 16 23 6 15 33 82 71 80 47 91 77 97 68 Median-2

This has two implications. First, to construct both piles, we only need to do (at most 2 )3 comparisons for each group of 5, for a total of 3n/5 comparisons. Secondly, since half of the rank 3 keys will be smaller or equal to p, atleast3 1 n = 3n keys 2 5 10 will be in L > and similarly for L <. This means that both and 10 L <, 10 L 10 >. 10 By simply counting the number of keys we place in L < and in L >, we can find out the exact rank of our candidate p. Step 4. Knowing the rank of p and comparing it to m, we can keep only one of the two piles thereby eliminating at least 3n/10 keys. If m is smaller than the rank of p we can throw away all keys in L >, while if m is larger than the rank of p, we can discard L < (and decrease our value of m by the number L < of keys we have just discarded). In any case, we have at most /10 keys left, and need to find the rank m key in them. This takes at most f(/10) comparisons in the worst-case. It is important to realize that our procedure does not involve circular reasoning, even though our procedure uses as a subroutine a procedure for finding the rank m key. What we are doing is using the technique of recursion. To find the rank m key out of n keys, as intermediate steps we find the median of n/5 keys and the the rank m key out of /10 keys. During each of these intermediate steps, we again run the procedure for a smaller number of keys. Our algorithm does not run forever because we will terminate this recursion whenever the number of keys is small (which might be when we reach fewer than five keys); in this case we must use a different procedure for finding the rank m key (we could sort them, for example this is more efficient for small n). 3 Showing the Procedure is Linear If f(n) is the number of comparisons to find the rank m key out of n keys (irrespective of what m is), we have the formula ( n ) ( ) f(n) + f + + f, 5 5 5 10 where the /5 is the cost of sorting the groups of 5, the f(n/5) is the number of comparisons required to find the median of medians, p, the3n/5 is the cost of comparing the median of medians to all the keys, and f(/10) is the number of comparisons used to find the rank m element out of /10 elements. How do we know that the number of comparisons f(n) islinear in the number of keys? That is, how do we know that f(n) cn for some constant c? We will show this by using induction. 2 if the rank 3 key happens to be equal to p, we don t need any additional comparisons since we can place the rank 1and2keysinL < and the rank 4 and 5 keys in L >. Median-3

Suppose that we have shown that the equation f(m) cm holds for all m<n(we ll figure out the value of c later). Remember we have ( n ) ( ) f(n) f + f +2n. 5 10 But we can substitute f(m) cm for both of these values of f on the right hand side of the equation, because for both of these we have m n. This gives: cn c f(n) + +2n. 5 10 Now, we d like to choose c so that this inequality is true for all n. To get the best value for c, we can impose that the right-hand side of this equation is equal to cn. This is a linear equation cn = c n + c 5 10 +2n where the n s cancel and the solution is c = 20. We also need to verify that when n is small enough and we decide to simply sort the keys that the number of comparisons is indeed at most 20n. 4 Improving the procedure There are lots of clever tricks we can use to improve the constant in this procedure. All use the fact that we have been discarding lots of precious information that we have gathered in previous steps, and that we could instead recycle. We will describe one extremely simple improvement. In step 2, when finding the median p of the n/5 rank 3 keys, we end up knowing for each of these n/5 keys whether they are greater or smaller than p. There is no need therefore to compare them again to p in step 3. By combining step 2 and part of step 3, we can therefore decrease the number of comparisons in step 3 from 3n/5 to2n/5. The recurrence we thus get says that ( n ) ( ) 2n f(n) + f + + f, 5 5 5 10 and this implies that f(n) 18n. Many further improvements can be made, but we won t discuss them here. Median-4

MIT OpenCourseWare http://ocw.mit.edu 18.310 Principles of Discrete Applied Mathematics Fall 2013 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.