Application: Bucket Sort - PDF Free Download

5.2.2. Application: Bucket Sort Bucket sort breaks the log) lower bound for standard comparison-based sorting, under certain assumptions on the input We want to sort a set of =2 integers chosen I+U@R from the range [0,2 ), where Using Bucket sort, we can sort the numbers in expected time ) Expectation is over the choice of the random input, Bucket sort is a deterministic algorithm MAT-72306 RandAl, Spring 2015 12-Feb-15 207 Bucket sort works in two stages First we place the elements into buckets The th bucket holds all elements whose first binary digits correspond to the number E.g., if =2 bucket 3 contains all elements whose first 10 binary digits are 0000000011 When the elements of the th bucket all come before those in the th bucket in the sorted order Assuming that each element can be placed in the appropriate bucket in (1) time, this stage requires only () time MAT-72306 RandAl, Spring 2015 12-Feb-15 208 1

Because the elements to be sorted are chosen uniformly, the number of elements that land in a specific bucket follows a binomial distribution,1 ) Buckets can be implemented using linked lists In the second stage, each bucket is sorted using any standard quadratic time algorithm (e.g., Bubblesort or Insertion sort) Concatenating the sorted lists from each bucket in order gives the sorted order for the elements It remains to show that the expected time spent in the second stage is only () MAT-72306 RandAl, Spring 2015 12-Feb-15 209 The result relies on our assumption regarding the input distribution. Under the uniform distribution, Bucket sort falls naturally into the balls and bins model: the elements are balls, buckets are bins, and each ball falls uniformly at random into a bin Let be the number of elements that land in the th bucket The time to sort the th bucket is then at most for some constant MAT-72306 RandAl, Spring 2015 12-Feb-15 210 2

The expected time spent sorting is at most The second equality follows from symmetry: is the same for all buckets Since,1 ), using earlier results yields 1) = +1=2 1 <2 Hence the total expected time spent in the second stage is at most, so Bucket sort runs in expected linear time MAT-72306 RandAl, Spring 2015 12-Feb-15 211 5.3. The Poisson Distribution We now consider the probability that a given bin is empty in balls and bins model as well as the expected number of empty bins For the first bin to be empty, it must be missed by all balls Since each ball hits the first bin with probability, the probability the first bin remains empty is 1 MAT-72306 RandAl, Spring 2015 12-Feb-15 212 3

Symmetry: the probability is the same for all bins If is a RV that is 1 when the th bin is empty and 0 otherwise, then = Let represent the number of empty bins Then, by the linearity of expectations, = 1 Thus, the expected fraction of empty bins is approximately This approximation is very good even for moderately size values of and MAT-72306 RandAl, Spring 2015 12-Feb-15 213 Generalize to find the expected fraction of bins with balls for any constant The probability that a given bin has balls is 1 1 = 1 1)+1)! 1 When and are large compared to, the second factor on the RHS is approx., and the third factor is approx. MAT-72306 RandAl, Spring 2015 12-Feb-15 214 4

Hence the probability that a given bin has balls is approximately! and the expected number of bins with exactly balls is approximately Definition 5.1: A discrete Poisson random variable with parameter is given by the following probability distribution on = 0, 1,2, : Pr =! MAT-72306 RandAl, Spring 2015 12-Feb-15 215 The expectation of this random variable is : ]= Pr =! 1)!! Because probabilities sum to 1 MAT-72306 RandAl, Spring 2015 12-Feb-15 216 5

In the context of throwing balls into bins, the distribution of the number of balls in a bin is approximately Poisson with, which is exactly the average number of balls per bin, as one might expect Lemma 5.2: The sum of a finite number of independent Poisson random variables is a Poisson random variable. MAT-72306 RandAl, Spring 2015 12-Feb-15 217 Lemma 5.3: The MGF of a Poisson RV with parameter, is. Proof: For any, =! )!. MAT-72306 RandAl, Spring 2015 12-Feb-15 218 6

Differentiating yields: + 1) Setting =0gives +1 +1 +1 MAT-72306 RandAl, Spring 2015 12-Feb-15 219 Given two independent Poisson RVs and with means and, apply Theorem 4.3 to prove ) This is the MGF of a Poisson RV with mean By Theorem 4.2, the MGF uniquely defines the distribution, and hence the sum is a Poisson RV with mean MAT-72306 RandAl, Spring 2015 12-Feb-15 220 7

Theorem 5.4: Let be a Poisson RV with parameter. 1. If, then Pr ; 2. If, then Pr. Proof: For any >0and, Pr = Pr. MAT-72306 RandAl, Spring 2015 12-Feb-15 221 Plugging in the expression for the MGF of the Poisson distribution, we have Pr. Choosing = ln >0gives Pr = The proof of 2 is similar. MAT-72306 RandAl, Spring 2015 12-Feb-15 222 8

5.3.1. Limit of the Binomial Distribution The Poisson distribution is the limit distribution of the binomial distribution with parameters and, when is large and is small Theorem 5.5: Let ), where is a function of and lim is a constant that is independent of. Then, for any fixed, lim Pr =!. MAT-72306 RandAl, Spring 2015 12-Feb-15 223 This theorem directly applies to the balls-andbins scenario Consider the situation where there are balls and bins, where is a function of and lim Let be the number of balls in a specific bin Then, 1/) Theorem 5.5 thus applies and says that lim Pr =.! matching the earlier approximation MAT-72306 RandAl, Spring 2015 12-Feb-15 224 9

Consider the # of spelling or grammatical mistakes in a book Model such mistakes s.t. each word is likely to have an error with some very small probability The # of errors is a binomial RV with large and small and can be treated as a Poisson RV As another example, consider the # of chocolate chips inside a chocolate chip cookie Model by splitting the volume of the cookie into a large # of small disjoint compartments, so that a chip lands in each with some probability Now the # of chips in a cookie roughly follows a Poisson distribution MAT-72306 RandAl, Spring 2015 12-Feb-15 225 5.4. The Poisson Approximation The main difficulty in balls-and-bins problems is handling that dependencies naturally arise If, e.g., bin 1 is empty, then it is less likely that bin 2 is empty because the balls must now be distributed among 1bins More concretely: if we know the number of balls in the first 1bins, then the number of balls in the last bin is completely determined The loads of the bins are not independent MAT-72306 RandAl, Spring 2015 12-Feb-15 226 10

The distribution of the number of balls in a given bin is approximately Poisson with mean We would like to say that the joint distribution of the number of balls in all the bins is well approximated by assuming the load at each bin is an independent Poisson RV with mean This would allow us to treat bin loads as independent RVs We show here that we can do this when we are concerned with sufficiently rare events MAT-72306 RandAl, Spring 2015 12-Feb-15 227 Suppose that balls are thrown into bins I+U@R, and let ) be the number of balls in the th bin, Let ),, ) be independent Poisson RVs with mean In the first case, there are balls in total In the second case we know only that is the expected number of balls in all of the bins If, using the Poisson distribution, we end up with balls, then we do indeed have that the distribution is the same as if we threw balls into bins randomly MAT-72306 RandAl, Spring 2015 12-Feb-15 228 11

Theorem 5.6: The distribution ),, ) ) conditioned on is the same as ) ),,, regardless of the value of. Proof: When throwing balls into bins, the probability that ),, ) =,, for any,, satisfying is given by ; ;! =! MAT-72306 RandAl, Spring 2015 12-Feb-15 229 Now, for any,, with =, consider the probability that ) ),, =,, Conditioned on ),, ) ) Pr ),, ) satisfying =,, ) = Pr ) Pr ) MAT-72306 RandAl, Spring 2015 12-Feb-15 230 12

The probability that is!, since the are independent Poisson RVs with mean. Also, by Lemma 5.2, the sum of the is itself a Poisson RV with mean. Hence we have: =!!! =! proving the theorem. MAT-72306 RandAl, Spring 2015 12-Feb-15 231 With this we can prove strong results about any function on the loads of the bins Theorem 5.7: Let,, )be a nonnegative function. Then ) ),,,,. This holds for any nonnegative function on the number of balls in the bins In particular, if is the indicator that is 1 if some event occurs and 0 otherwise, then the theorem gives bounds on the probability of events MAT-72306 RandAl, Spring 2015 12-Feb-15 232 13

We call the scenario in which the number of balls in the bins are taken to be independent Poisson RVs with mean the Poisson case The scenario where balls are thrown into bins I+U@R is the exact case Corollary 5.9: Any event that takes place with probability in the Poisson case takes place with probability at most the exact case. Proof: Let be the indicator function of the event. In this case, ] is just the probability that the event occurs, and the result follows immediately from Theorem 5.7. MAT-72306 RandAl, Spring 2015 12-Feb-15 233 Any event that happens with small probability in the Poisson case also happens with small probability in the exact case In the analysis of algorithms we often want to show that certain events happen with small probability This result says that we can utilize an analysis of the Poisson approximation to obtain a bound for the exact case The Poisson approximation is easier to analyze because the numbers of balls in each bin are independent random variables MAT-72306 RandAl, Spring 2015 12-Feb-15 234 14

We can actually do even a little bit better in many natural cases Theorem 5.10: Let,, )be a nonnegative ) ) function such that,, is either monotonically increasing or monotonically decreasing in. Then ) ),,,, The following corollary is immediate: MAT-72306 RandAl, Spring 2015 12-Feb-15 235 Corollary 5.11: Let be an event whose probability is either monotonically increasing or monotonically decreasing in the number of balls. If has probability in the Poisson case, then has probability at most in the exact case. Consider again the maximum load problem for the case A union bound argument shows that the maximum load is at most 3lnlnlnw.h.p. Using the Poisson approximation, we prove the following almost-matching lower bound on the maximum load MAT-72306 RandAl, Spring 2015 12-Feb-15 236 15

Lemma 5.12: When balls are thrown I+U@R into bins, the maximum load is at least = with probability at least for sufficiently large. Proof: In the Poisson case, the probability that bin 1 has load at least is at least!, which is the probability it has load exactly, Pr =!. In the Poisson case, all bins are independent, so the probability that no bin has load at least is at most 1!! MAT-72306 RandAl, Spring 2015 12-Feb-15 237 We need to choose so that!, for then (by Thm 5.7) we will have that the probability that the maximum load is not at least in the exact case is at most < 1. This will give the lemma Because the maximum load is clearly monotonically increasing in the number of balls, we could also apply the slightly better Thm 5.10, but this would not affect the argument substantially It therefore suffices to show that ln, or equivalently that ln! < ln ln ln ln From Lemma 5.8 (not shown), it follows that: MAT-72306 RandAl, Spring 2015 12-Feb-15 238 16

when (and hence = ln/lnln) are suitably large. Hence, for suitably large, ln ln+ln = ln ln ln ln ln ln ln + ln ln ln ln + ln ln ln ln ln ln ln lnln ln ln ln ln. The last two inequalities use the fact that ln ln = (ln / ln ln ). MAT-72306 RandAl, Spring 2015 12-Feb-15 239 5.5. Application: Hashing Consider a password checker, which prevents people from using easily cracked passwords by keeping a dictionary of unacceptable ones The application would check if the requested password is unacceptable A checker could store the unacceptable passwords alphabetically and do a binary search on the dictionary to check a proposed password A binary search would require (log) time for words MAT-72306 RandAl, Spring 2015 12-Feb-15 240 17

5.5.1. Chain Hashing Another possibility is to place the words into bins and search the appropriate bin for the word Words in a bin are represented by a linked list The placement of words into bins is accomplished by using a hash function A hash function from a universe into a range [0, 1] can be thought of as a way of placing items from the universe into bins MAT-72306 RandAl, Spring 2015 12-Feb-15 241 Here the universe consist of possible password strings The collection of bins is called a hash table This approach to hashing is called chain hashing Using a hash table turns the dictionary problem into a balls-and-bins problem If our dictionary of unacceptable passwords consists of words and the range of the hash function is [0, 1], then we can model the distribution of words in bins with the same distribution as balls placed randomly in bins MAT-72306 RandAl, Spring 2015 12-Feb-15 242 18

It is a strong assumption to presume that a hash function maps words into bins in a fashion that appears random, so that the location of each word is independent and identically distributed (i.i.d) We assume that for each, the probability that )=is 1/ (for 1) and that the values of ) for each are independent of each other This does not mean that every evaluation of ) yields a different random answer The value of ) is fixed for all time; it is just equally likely to take on any value in the range MAT-72306 RandAl, Spring 2015 12-Feb-15 243 Consider the search time when there are bins and words To search for an item, we first hash it to find the bin that it lies in and then search sequentially through the linked list for it If we search for a word that is not in our dictionary, the expected number of words in the bin the word hashes to is If we search for a word that is in our dictionary, the expected number of other words in that word's bin is 1)/, so the expected number of words in the bin is 1 + ( 1)/ MAT-72306 RandAl, Spring 2015 12-Feb-15 244 19

If we choose = bins for our hash table, then the expected number of words we must search through in a bin is constant If the hashing takes constant time, then the total expected time for the search is constant The maximum time to search for a word, however, is proportional to the maximum number of words in a bin We have shown that when this maximum load is lnlnln with probability close to 1, and hence w.h.p. this is the maximum search time in such a hash table MAT-72306 RandAl, Spring 2015 12-Feb-15 245 While this is still faster than the required time for standard binary search, it is much slower than the average, which can be a drawback for many applications Another drawback of chain hashing can be wasted space If we use bins for items, several of the bins will be empty, potentially leading to wasted space The space wasted can be traded off against the search time by making the average number of words per bin larger than 1 MAT-72306 RandAl, Spring 2015 12-Feb-15 246 20

5.5.2. Hashing: Bit Strings Now save space instead of time Consider, again, the problem of keeping a dictionary of unsuitable passwords Assume that a password is restricted to be eight ASCII characters, which requires 64 bits (8 bytes) to represent Suppose we use a hash function to map each word into a 32-bit string This string is a short fingerprint for the word MAT-72306 RandAl, Spring 2015 12-Feb-15 247 We keep the fingerprints in a sorted list To check if a proposed password is unacceptable, we calculate its fingerprint and look for it on the list, say by a binary search If the fingerprint is on the list, we declare the password unacceptable In this case, our password checker may not give the correct answer! It is possible that an acceptable password is rejected because its fingerprint matches the fingerprint of an unacceptable password MAT-72306 RandAl, Spring 2015 12-Feb-15 248 21

Hence there is some chance that hashing will yield a false positive: it may falsely declare a match when there is not an actual match The fingerprints do not uniquely identify the associated word This is the only type of mistake this algorithm can make Allowing false positives means our algorithm is overly conservative, which is probably acceptable Letting easily cracked passwords through, however, would probably not be acceptable MAT-72306 RandAl, Spring 2015 12-Feb-15 249 Place in a more general context: describe as an approximate set membership problem Suppose we have a set =,, of elements from a large universe We want to be able to quickly answer queries of the form "Is an element of?" We want also like the representation to take as little space as possible To save space, we are willing to allow occasional mistakes in the form of false positives Here the unallowable passwords correspond to our set MAT-72306 RandAl, Spring 2015 12-Feb-15 250 22

How large should the range of the hash function used to create the fingerprints be? How many bits should be in a fingerprint? Obviously, we want to choose the number of bits that gives an acceptable probability for a false positive match The probability that an acceptable password has a fingerprint that is different from any specific unallowable password in is 12 If the set has size, then the probability of a false positive for an acceptable password is 1 1 12 1 MAT-72306 RandAl, Spring 2015 12-Feb-15 251 If we want this probability of a false positive to be less than a constant, we need which implies that log ln 1 (1 ) I.e., we need lg bits If we, however, use =2lgbits, then the probability of a false positive falls to 1 < 1 If we have 2 = 65,536 words, then using 32 bits yields a FP Pr of just less than 1/65,536 MAT-72306 RandAl, Spring 2015 12-Feb-15 252 23

5.6. Random Graphs There are many NP-hard computational problems defined on graphs: Hamiltonian cycle, independent set, vertex cover, Are these problems hard for most inputs or just for a relatively small fraction of all graphs? Random graph models provide a probabilistic setting for studying such questions Most of the work on random graphs has focused on two closely related models, and MAT-72306 RandAl, Spring 2015 12-Feb-15 253 5.6.1. Random Graph Models In we consider all undirected graphs on distinct vertices,, A graph with a given set of edges has probability One way to generate a random graph in is to consider each of the possible edges in some order and then independently add each edge to the graph with probability MAT-72306 RandAl, Spring 2015 12-Feb-15 254 24

The expected number of edges in the graph is therefore, and each vertex has expected degree 1) In the model, we consider all undirected graphs on vertices with exactly edges There are with equal probability possible graphs, each selected One way to generate a graph uniformly from the graphs in is to start with a graph with no edges MAT-72306 RandAl, Spring 2015 12-Feb-15 255 Choose one of the possible edges uniformly at random and add it to the edges in the graph Now choose one of the remaining 1 possible edges I+U@R and add it to the graph Continue similarly until there are edges The and models are related: When, the number of edges in a random graph in is concentrated around, and conditioned on a graph from having edges, that graph is uniform over all the graphs from MAT-72306 RandAl, Spring 2015 12-Feb-15 256 25

There are many similarities between random graphs and the balls-and-bins models Throwing edges into the graph as in the model is like throwing balls into bins However, since each edge has two endpoints, each edge is like throwing two balls at once into two different bins The pairing adds a rich structure that does not exist in the balls-and-bins model Yet we can often utilize the relation between the two models to simplify analysis in random graph models MAT-72306 RandAl, Spring 2015 12-Feb-15 257 26