Lecture Note 2 Material covere toay is from Chapter an chapter 4 Bonferroni Principle. Iea Get an iea the frequency of events when things are ranom billion = 0 9 Each person has a % chance to stay in a hotel Each hotel has 00 single rooms /00 * 0 9 : #people staying in hotels toay #hotels = /00 * /00 * 0 9 = 0 5, or 00,000 Sample a winow of 000 ays.2 Want Suspecte terrorists meet an stay in the same hotel twice in the 000 ay perio. How many such events might we expect? In other wors, we want to ientify a pair of persons (A, B) who stay at the same hotel on two ifferent ays, 2..2. Probability Pr[A stays at a hotel on ] = 00 Pr[B stays at a hotel on ] = 00 The probability that they both visit on is 00 00 = 0 4 The probability that they stay at the same hotel: 0 4 0 5 It happens on both an 2 : p = (0 4 0 5 ) 2.2.2 Number of events #ways to select a pair of iniviuals is ( ) n 2 = n(n ) 2 2 n2 #choices for (A, B,, 2 ) 2 n2 2 2 Final answer 2 n2 2 2 p = 4 06
.2.3 Exercise What if #ays raise to 2000? 2 How o you sample when you o not know the size of population 2. Reservoir Sampling 2.2 Algorithm Go to the first house, an we nee to pick it with probability an call it our sample. If there is no more house, we report this house as our answer. If there are still more houses, we continue an pick the secon house with probability /2, call it our sample, kicking out the first house. If there is no more house, we report our sample. Otherwise, we will keep going. For the k-th house, we pick it with probability /k, an replace the existing choice. 2.3 Analysis We prove that all elements got chosen with equal probability (which is weaker than what Reservoir Sampling actually provices) by inuction. Where there is only one house, we select with probability 00%, so the claim is true when n =. Suppose the claim is true for n houses, we now prove it also works for n +. Accoring to our assumption, before seeing the (n + )-th house, each house from to n will have n probability of being our sample. Upon seeing the new house, we will select it with probability n+, which means it will be our sample with probability n+. On the other han, with probability n n+, the original sample will persist. So the probability of a certain house from to n persist to be our sample woul be n n n+ = n. 2.4 Reservoir Sampling Multiple items Goal: pick p items ranomly out of n items. No knowlege of n. A new item with probability p n. If we en up not aing this new item, we o nothing. If this item gets chosen, one of our ol samples will be kicke out uniformly at ranom to make space for the new item. 2.5 Property of Reservoir Sampling When selecting a single item, Reservoir Sampling guarantees that every item has the same chance of being chosen. However, when we pick more than one item, Reservoir Sampling provies more than that. It guarantees that every subset is chosen with same probability 2.5. What oes it mean? We have A, B, C, D. p = 2, n = 4, an our algorithm is as follows: Toss a coin, return (A, B) if we see hea, an (C, D) otherwise. Each letter will be sample with probability /2, but we will never see (A, C). So Reservoir Sampling is a bit stronger 2
3 Sampling from a Stream We want to store 0-th of the stream for analysis 4 Hash function 4. A common hash functin h(x) = (ax + b mo p), where p is a prime. 4.. Ba example h(x) = x mo B, where B = 0. Then hash even numbers will always get even numbers 4.2 An obvious approach Generate a ranom number 0-9, save the query if it is 0. 4.2. Ba example Fails to answer the following query: among all the ifferent queries, what fraction are uplicate queries? Suppose a user has issue s search queries one time an search queries twice, an no queries more than twice. We will have s + 2 queries in total. Then the correct answer woul be uplicates. If we use the previous algorithm, we woul get 00 s+, since there are s + ifferent queries, an have will appear twice: only queries that have uplicates will ever appear twice. Further more, it will appear twice if an only if both got sample. The probability that both queries got sample is. With queries that have uplicates, we woul expect 00 such appearances. Of the queries with uplicates, 8 00 will appear exactly once. Either only the first query got sample, or only the secon query got sample. So the probability that a certain uplicate query appear exactly once is 0 0 0 + 0 0 0 = 8 00. So the total expecte number woul be 8 00. In all, we have /00 s/00+9/00 = 0s+9. 4.3 Use hash to sample /0 of the users Use h(user), an select the set {u h(u) = 0}. 4.4 Algorithm Use a hash function on user i (so that if a user come again, we will know whether he was previously sample or not). Sample user when hash to 0. By this metho, we can guarantee that each user is sample with probability /0. 3
5 Bloom Filters Goal is to answer membership queries in S. If x S, always say yes If x / S, might say yes (with small probability of error) 5. Example billion items, S = m Hash table a of size n << m y ifferent values to insert When some value v come, change a[h(v)] to When aske if a certain value v is in S, answer yes if an only if a[h(v)] is 5.2 Analysis What is the chance that a certain cell in a is 0? Suppose we have y arts an x targets Probability a single art oes not touch a specific entry: ( ) ( ) x x = x Probability that all arts (we have y arts) miss this entry: ( x) y = ( ( x )x) y x = e y x for large x. If we have x = 8 0 9, y = 0 9, then probability an element is hit is e /8 0.75 5.3 What if we have k hash functions? Suppose S = m, hash function is of n cells, an we have k hash functions. y = km, n = x On seeing v, we set a[h i (v)] to, for all hash function h,..., h k. When querying v, we answer yes if an only a[h i (v)] = is true for all k hash functions. Chance that a certain cell in a is 0: e km n Chance that we get false postive, i.e. a value v is not in S, but we thought it is in: ( e km n ) k. If k = 2, n = 0 m, y = k m ( ) The probability is: e 2 5 0.0329. 4
6 Count Min Sketch Want to count the frequency of number. 6. Algorithm Initialize an array a with 0 When a number x come, use 2 hash functions h, h 2, an increase a[h (x)] an a[h 2 (x)] When aske the frequency of x, answer min{a[h (x)], a[h 2 (x)]}. truth estimate ( + ɛ) truth, with high probability 7 Verify matrix multiplication 7. Freival s Algorithm To check if A = B C, check if Ar = B (C r) for ranom binary vector r. 8 Closest Pair Instea of O(N log N), we can get expecte running time O(N) using ranomization (answer is always correct, but might take longer) [Rabin, an inepenently Khuller & Matias link] 5