As mentioned, we will relax the conditions of our dictionary data structure. The relaxations we

Size: px

Start display at page:

Download "As mentioned, we will relax the conditions of our dictionary data structure. The relaxations we"

Bernard York
5 years ago
Views:

1 CSE 203A: Advanced Algorithms Prof. Daniel Kane Lecture : Dictionary Data Structures and Load Balancing Lecture Date: 10/27 P Chitimireddi Recap This lecture continues the discussion of dictionary data structures from last class. It has been mentioned that for a dictionary data structure which encodes a subset W U, we need to our data structure to be able to encode every word in the Universe U since by querying the data structure we should be able to verify whether or not the word belongs in our set W. This implies that we need a fair amount of data storage since we have to encode any possible subset of U. The amount of space we need is of the order n log(n/n) bits. The question we pose is if by relaxing the requirements on the data structure, can we save space instead? Yes, we can do so using Bloom filters. Bloom Filters As mentioned, we will relax the conditions of our dictionary data structure. The relaxations we offer: We don t store any associated entries since they take space anyways We have a small chance of being wrong: If x W, then our data structure always accepts x If x / W, then there is an ɛ chance of data structure still accepting x Hence, if we are allowed to have a few false positives, then bloom filters are a good option. Space/Size of Bloom Filters Suppose we have a bloom filter for which the number of elements accepted #(x) ɛ U. What is the probability that a randomly chosen subset W of size W n is consistent with the bloom filter? Pr(W is consistent with bloom filter) Pr(n elements are accepted by bloom filter) ɛ n Hence, the total number of such bloom filters required to deal with all possible subsets W U will be ɛ n. The space requirement for the the bloom filter will then be n log 2 (1/ɛ) bits. The amount of space saved if we assume ɛ is constant is about log(n). Implementation We have a bit array b[] of length m and we also have k different hash functions h 1, h 2,..., h k ; h i : U [m] and then what we do is the following: 1

2 1. For each word x in our subset W, we set the bits b[h 1 (x)], b[h 2 (x)],..., b[h k (x)] to 1. Basically, we are saying that we have k conditions which are being set to 1. Any bit which isn t b[h i (x)] for some hash function h i and a word x W remains For lookup, compute b[h 1 (x)] b[h 2 (x)]... b[h k (x)] and check if it equals 1. Clearly, if the word x W, then all those bits were set and the bloom filter accepts x. If the word x / W, then there is a chance that the bits b[h 1 (x)], b[h 2 (x)],..., b[h k (x)] have been set to 1 by the words in W. Hence, there is a chance for a false positive. False Positive Probability and Size Let us calculate the false positive probability in relation to n, k, and m. occupancy rate of the bit array b is m. Suppose that the m {i b[i] 1} P r(y / W accepted) P r(b[h i (y)] 1) ( m m )k Since the number of bits which are set by each x W is k, a trivial bound on m will be m nk. Hence, if we set m 2nk and k log 2 ( 1 ɛ ), m m False Positive Rate ( m m )k ( 1 2 )log 2( 1 ɛ ) Hence, the space required for a bloom filter with a false positive rate of at most ɛ will be of the order m to store the bit array. size m 2 n k ɛ 2 n log 2 ( 1 ɛ ) For a stronger bound on the size of the bloom filter, let us consider the probability of a given bit not being set to 1. Pr(Bit 0) (1 1 m )n k e n k m for m nk 1/2, m m ln(2) m nk ln(2) size n log 2( 1 ɛ ) ln(2) 2

3 Independence assumptions We have assumed perfectly independent hash functions so far but that is not always easy to obtain. However, the analysis still holds under (k+1)-wise independent hash functions for our case where m 2nk. We can then bound the false positive rate: For y / W, x i W, P r(h 1 (y) h i1 (X 1 ), h 2 (y) h i2 (X 2 ),...) P r(h 1 (y) h i1 (X 1 )) Number of hash functions required Another optimization we can implement is instead of using k hash functions h 1, h 2,..., h k, we can just use 2 hash functions f, and g if g 0 (mod m). The new hash functions will be f(x)+g(x), f(x)+2*g(x), f(x)+3*g(x),... f(x)+k*g(x) (mod m) where m is an appropriately chosen prime. Load Balancing (Section 3.1, 3.6) The problem of load balancing is to distribute m jobs to n servers such that no server has too much load(i.e too many jobs on a single server). A simple way to do this is to keep track of how many jobs each server is currently running and assign the new job to server which has the least number of jobs running. However, it might be too expensive to maintain the list of the number of jobs running on each server or to communicate constantly with the servers. Hence, we use a randomized approach where we randomly assign jobs to servers and hope that no server gets overloaded with too may jobs. For our discussion, let s start with the case where we have n jobs and n servers. Since this is CS class, we will paraphrase this problem as throwing n balls into n bins for convenience. The randomized approach we use is a hash function h : [n] [n] which will map the balls into the bins. The main question we are going to ask is what is the expected maximum load for a server? i.e. What is the maximum number of balls in any given bin? Expected Maximum Load Let us consider the distribution on a single bin. It is essentially throwing n balls into a bin each with a probability of 1 n of landing in the bin. This is a binomial distribution B(n, 1 n ). Hence, Pr(k balls in a given bin) k n k (1 1 n )n k k n k ((1 1 n )n (n k)/n k n k (1 e )(n k)/n nk 1 k! n k (1 e )(n k)/n 1 k!, since(1 e )(n k)/n 1 Let k 0 be the smallest k such that k! n. Observe that k log(n)/log(log(n)) using sterlings approximation. Hence, Pr(load on bin i k 0 + 2) 1/n and hence, by union bound, 3

4 the maximum load is probably less than k 0 +O(1). It is easy to see that the probability of max load decreases by a factor of log(n) for each time we increase the constant term. Lower bound Heuristically, the load on bin i and the load on bin j are nearly independent since for a very large number of bins assigning the balls randomly to each bin is equivalent to independently throwing balls into each bin with probability 1/n. So if we took k k 0 2, the expected number of bins at load k is pretty large(of the order log 2 (n)). We show that this actually happens, i.e. there is a bin which exists with this k with a high probability. If we don t care about having a very high probability of success, then if we look at the indicator variable X i I Bin i has k balls, then X i and X j might be correlated but it is easy to see that the covariance between them is at most 0 since a ball being present in bin i only decreases the probability that a ball is in j. Hence, more number of balls in i implies that the expected number of balls in j decreases. Hence, the covariance is negative. V ar( X i ) V ar(x i ). Since X i are indicator random variables, the variance is at most the sum of the expectation of the X i. V ar( X i ) E( X i ) Poissonization V ar( X i ) E( X i ) P r( 1 X i > 0) 1 Poly(log(n)) Using Chebyshev Inequality A useful trick for analyzing the load balancing is to assume that the different bins are independent from each other. This is called poissonization. It is basically what we get when we take a large number of of unlikely events and the total number of expected events is λ. P oi(λ) lim n >inf B(n, 1 n ) The idea is that we have a very large number of unlikely events and we are just counting the number of events which actually occur. The interesting thing is that if we take Poi(λ) things and we sort them into n bins with probabilities P i, then what we get is independent n Poi(λ P i ) random variables in each bin. This is a very useful assumption because the bins will be truly independent with respect to each other. The only thing we should be careful about is that Poi(n) is not exactly equal to n. Hence, the actual number of balls will become (n ± n). Hence, each bin will have i.i.d Poi(1) number of balls in it. The probability of having k balls in the bin, Pr(Poi(1) k) 1 1 4

5 This is the limit of our earlier analysis. The analysis is now simpler since the number of bins with k balls will be B(n, 1 1 nk 1 ). If >> 1, then with very high probability we will find a bin with that load. However, if we want to prove lower bound on the load, we want to look at Poi(n nlog(n)) balls and then by concentration bounds, it will be n with high probability which will give us a stronger guarantee on our bound. 5

Algorithms for Data Science

Algorithms for Data Science CSOR W4246 Eleni Drinea Computer Science Department Columbia University Tuesday, December 1, 2015 Outline 1 Recap Balls and bins 2 On randomized algorithms 3 Saving space: hashing-based