This model assumes that the probability of a gap has size i is proportional to 1/i. i.e., i log m e. j=1. E[gap size] = i P r(i) = N f t.

CS 493: Algoriths for Massive Data Sets Feb 2, 2002 Local Models, Bloo Filter Scribe: Qin Lv Local Models In global odels, every inverted file entry is copressed with the sae odel. This work wells when assuing unifor distribution of gaps, i.e. the gap sizes are randoly distributed. However, the distribution of gaps depends on the listing. If we sort URLs into lexical order, URLs fro the the sae site are likely to be close to each other. Then the distribution of gaps tends to have this clustering property. In order to utilize such property, people use local odels, where each inverted file entry uses its own odel. Generally, local odels outperfor global odels but are ore coplex to ipleent. Golob code: works well if ost gaps are b, but it s not good for large gaps. γ-code works better with large gaps. Skewed-Bernoulli is a cobination of the two.. Local Hyperbolic Model V T = (b, 2b, 4b,..., 2 i b,...) This odel assues that the probability of a gap has size i is proportional to /i. i.e., P r(i), fori =, 2,..., i Let be the truncate point of gap size, then P r(i) = i j= j i log e What should be? E[gap size] = i P r(i) i i loge log e = N f t where N is the nuber of docuents, and f t is the frequency of ters. We can then use observed frequencies to build the code. But one proble with this schee is that we have to do it at a ter by ter base.

To solve this proble, we can batch ters (with sae frequencies) together. An even better ethod is to use logarithic batching, i.e., we group ters according to logarithic intervals [2 i, 2 i+ ). In practice, this ethod uses only about 20 codes. All the odels we have looked at so far are zero-order odel, which eans that each sybol is independent, and they do not use context inforation. We ight be able to design better schees if we can use inforation of previous gap distribution. Such copression schees are called context sensitive copression schees. For the paraeterized global odel such as Golob code, we can choose the paraeter b to be the average of the previous K gap sizes. But there is a proble. Suppose we have any sall gaps followed by a large gap, b could becoe for the sall gaps, which is the sae as unary coding, then, when it coes to the large gap, it s going to use a huge code for that large gap. This proble can be alleviated by using Skewed-Bernoulli code..2 Interpolative Code With this schee, we assue that we know the coplete inverted file entry, and we recursively calculate ranges and code the with inial nuber of bits. e.g. assue we have 20 docuents in total, and a given inverted file entry has 7 docuents then the gaps are < 7; 3, 8, 9,, 2, 3, 7 > < 7; 3, 5,, 2,,, 4 > note that if we know the 4th nuber is and the 6th nuber is 3, there is only one possible value ( 2 ) for the 5th nuber. We recursively halve the range and calculate the lowest possible value and highest possible value for the nuber in the iddle. So for the exaple above, we first look at the 4th nuber, then 8, which is the iddle nuber in the left 3 docuents, then 3, 9, then 3 (the iddle nuber in the right 3 docuents), and finally 2, 7. We use (x, lo, hi) to denote that x lies in the interval [lo, hi]. Then for the exaple above, we can get the following sequence: (,4,7) since is the 4th nuber aong the 7 docuents, and there are 20 docuents in total. (8,2,9); (3,,7); (9,9,0); (3,3,9); (2,2,2) Interpolative code can achieve the best copression ratio, but it has longer decoding tie. Why the (x, lo, hi) coding is the best coding? Intuitively, for a sorted sequence of nubers within and N, the edian is likely to be N/2. And the probability P r(x is the edian) is uch higher if x is close to N/2. Then, using Huffan coding, we can get very short code. 2

2 Perforance of Index Copression Since local odels use ore inforation (such as the previous distribution of gap), they perfor better than global odels. Global observed frequency odel is only slightly better than γ or δ codes. Both γ and δ codes have perforance close to the local schees, but are outperfored by the local schees. And the perforances for the various local schees are: Bernoulli < Hyperbolic < Skewed Bernoulli < batchedf requencies < Interpolativecode 3 Hashing Based Signature Schees As for the purpose of indexing, a docuent is just a set of ters, and the database is soe representation of such sets so that queries can be answered efficiently. While soeties we want exact answers to our queries, there are any situations when soe probability of isatch can be allowed. We now discuss a schee that is used for the later case. 3. Bloo filters A Bloo filter represents a set S of n eleents by a bit vector V of size, initially all set to 0. A Bloo filter uses k independent hash functions (h, h 2,..., h k ) with range {, 2,..., }. We assue these hash functions ap each ite in the universe to a rando nuber unifor over the range {, 2,..., }. For each eleent e S, the bits h i (e) are set to for i k. To check if an eleent x is in set S, we check whether all the bits h i (x)( i k) are set to. If not, eleent x is definitely NOT in the set S. Otherwise, we assue that x is in S. However, it s possible that x is actually not in the set, but all its k corresponding bits are set (because soe other eleents are hashed to those bits). So a Bloo filter ay yield a false positive, where it suggests that an eleent x is in S even though it is not. For any applications, this is acceptable as long as the probability of a false positive is sufficiently sall. Bloo filters are widely used in any applications such as distributed caches. Now given the schee, what is the probability a false positive? Since there are n eleents and k hash functions, we need to set n k bits. Assuing purely rando hash functions, each tie we set a bit, the probability that a particular bit is not set is /. So, after reference: Copress Bloo Filter, by Michael Mitzenacher. 3

all the eleents of S are hashed into the Bloo filter, the probability that a particular bit is still not set is ( )nk e nk let p = e nk, then the probability of a false positive (this is when all the k corresponding bits are set) is ( ( ) nk ) k ( e nk ) k = ( p) k Let f = ( p) k. We now use the asyptotic approxiations p and f to represent respectively the probability of a bit not being set and the probability of a false positive. Suppose the nuber n and are fixed, how can we choose the nuber of hash functions k such that the probability of false positive f is iniized? p = e nk p = e nk ln p = nk k = n lnp f = ( p) k = ( p) n lnp = e ln( p)lnp n Since and n are fixed, f is iniized when ln( p)lnp is axiized. d ln( p) ln( p)lnp = lnp dp p p = 0 p = /2 so the optial (i.e. inial) f is achieved when p = /2: /2 = p = e nk ln2 = nk k = ln2( n ) f = ( p) k = ( 2 )ln2( n ) (0.685) n 4

3.2 Copressed Bloo filter Many applications using Bloo filter ay need to pass the Bloo filter as a essage, and the transission size Z(Z ) can becoe a liiting factor. If every bit has the sae probability, the Bloo Filter cannot be copressed (Z = ). Suppose, however, if we can choose k such that p, the probability of a bit not being set is not /2, we ay be able to copress the Bloo filter before sending it out, thus reducing the transission size Z. The lower bound of Z is H(p, p), where H(p, p) = plog 2 p ( p)log 2 ( p) is the entropy of the distribution {p, p}. Suppose n and the transission size Z are fixed, what is the best k that iniizes f? [to be continued in next class... ] 5