arxiv: v2 [cs.ds] 3 Nov 2017

Size: px

Start display at page:

Download "arxiv: v2 [cs.ds] 3 Nov 2017"

Ruby Henderson
6 years ago
Views:

1 HyperMinHash: Jaccard index sketching in LogLog space Extended Abstract arxiv: v [cs.ds] 3 Nov 07 YUN WILLIAM YU, Harvard Medical School DBMI GRIFFIN M. WEBER, Harvard Medical School DBMI In this extended abstract, we describe and analyse a streaming probabilistic sketch, HYPERMINHASH, to estimate the Jaccard index or Jaccard similarity coefficient) over two sets A and B. HyperMinHash can be thought of as a compression of standard MinHash by building off of a HyperLogLog count-distinct sketch. Given Jaccard index δ, using k buckets of size Ologl) + log log A B )) in practice, typically bytes) per set, HyperMinHash streams over A and B and generates an estimate of the Jaccard index δ with error O/l + k/δ). This improves on the best previously known sketch, MinHash, which requires the same number of storage units buckets), but using Olog A B )) bit per bucket. For instance, our new algorithm allows estimating Jaccard indices of 0.0 for set cardinalities on the order of 0 9 with relative error of around 5% using 64KiB of memory; the previous state-of-the-art MinHash can only estimate Jaccard indices for cardinalities of 0 0 with the same memory consumption. Alternately, one can think of HyperMinHash as an augmentation of b-bit MinHash that enables streaming updates, unions, and cardinality estimation and thus intersection cardinality by way of Jaccard), while using log log extra bits. CCS Concepts: Theory of computation Sketching and sampling; Additional Key Words and Phrases: MinHash, sketching, streaming, loglog, Jaccard ACM Reference Format: Yun William Yu and Griffin M. Weber. 07. HyperMinHash: Jaccard index sketching in LogLog space: Extended Abstract., November 07), 6 pages. Corresponding author Authors addresses: Yun William Yu, Harvard Medical School DBMI, 0 Shattuck St #3, Boston, Massachusetts, 05, william_yu@hms.harvard. edu; Griffin M. Weber, Harvard Medical School DBMI, 0 Shattuck St #3, Boston, Massachusetts, 05, griffin_weber@hms.harvard.edu. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/authors). 07 Copyright held by the owner/authors). XXXX-XXXX/07/-ART $ Vol., No., Article. Publication date: November 07.

2 : Yun William Yu & Griffin M. Weber INTRODUCTION Many questions in data science can be rephrased in terms of the number of items in a database that satisfy some Boolean formula. For example, how many participants in a political survey are independent and have a favorable view of the federal government?, or how many of the source IPs used in a DDoS attack today were also used last month? In this paper, we consider the design of approximate streaming sketches to answer questions phrased in conjunctive normal form an AND of ORs); this is of course equivalent to estimating the cardinality of intersections of unions of a collection of sets. The literature already has near-optimal probabilistic data structures for approximating the count-distinct problem [, 4, 7, 8], which is equivalent to finding the cardinality of unions of sets i.e. ORs in CNF) Thus, we in particular focus on the problem of estimating Jaccard index[9], a proxy for set similarity that when coupled with union cardinality, allows estimation of intersection sizes.. Jaccard index Given two sets A and B, where A = n and B = m, and n > m without loss of generality, the Jaccard index is defined as A B δx,y) = A B. ) Clearly, if paired with a good count-distinct union estimator for A B, this allows us to estimate intersection sizes as well. Though Jaccard originally defined this index to measure ecological diversity in 90 [9], in more modern times, it has been used as a proxy for the document similarity problems. In 997, Broder introduced min-wise hashing colloquially known as MinHash ) [], a technique for quickly estimating the resemblance of documents by looking at the Jaccard index of shingles collections of phrases) contained within the documents.. MinHash MinHash relies on a simple fact: if you apply a random permutation to the universe of elements, the chance that the smallest items under this permutation in sets A and B are the same is precisely the Jaccard index. To see this, consider a random permutation of A B. The minimum element will come from either A \ B, B \ A, or A B, all disjoint sets. If the minimum element lies in A \ B, then mina) B, so mina) minb); the same is of course true by symmetry for B \ A. Conversely, if mina B) A B, then clearly mina) = minb). Because the permutation is random, every element has an equal probability of being the minimum, and thus P mina) = minb)) = A B A B. ) While using a single random permutation produces an unbiased estimator of δx,y), it is a Bernouli 0/ random variable with high variance. So, instead of using a single permutation, one averages k trials. The expected fraction of matches is also an unbiased estimator of the Jaccard index, but with variance decreased by a factor of /k. Though the theoretical justification is predicated on having a true random permutation, in practice we approximate that by using good random hash functions instead. A good hash function will specify a nearly total ordering on the universe of items, and provided we use θlogn)) bits for the hash function output, the probability of accidental collision of min-hashes is exponentially small. Though theoretically easy to analyze, this scheme has a number of drawbacks, chief amongst them the requirement of having k random hash functions, which means that the computational complexity is θnk) to generate the sketch. To address this, several variants of MinHash have been proposed [3]: ) k-hash functions. The scheme described above, which has the shortcoming of using θnk) computation to generate the sketch., Vol., No., Article. Publication date: November 07.

3 HyperMinHash: Jaccard index sketching in LogLog space :3 ) k-minimum values. A single hash function is used, but instead of storing the single minimum value, we store the smallest k values for each set also known as the KMV sketch []). Sketch generation time is reduced to On log k), but we also incur an Ok log k) sorting penalty when computing Jaccard index. 3) k-partition. Another -permutation MinHash variant, k-partition stochastically averages by first deterministically partitioning a set into k parts using the first couple bits of the hash value, and then stores the minimum value in each partition []. k-partition has the advantage of On) sketch generation time and Ok) Jaccard index computation time, at the cost of some difficulty in the analysis. It is important here to remark that for all of the above variants, MinHash sketches of A and B can be losslessly combined to form the MinHash sketch of A B. Using order statistics, it is additionally possible to estimate the union cardinalities [], so we can directly estimate intersection size in addition to Jaccard index. This also implies that streaming updates are permitted, so preprocessing incurs no additional space requirement..3 log log space complexity All of the variants of MinHash given in the last section require logarithmic bits per bucket in order to prevent accidental collisions i.e. we want to ensure that when two hashes match, they came from identical elements), though in the case of k-partition, some of those bits can be stored implicitly in the bucket identity. However, in the similar problem of cardinality estimation of unique items the count-distinct problem), literature over the last several decades produced several streaming sketches that require sub-logarithmic bits per bucket; indeed, the LogLog, SuperLogLog, and HyperLogLog family of sketches requries, as given in the name, only Olog logn)) bits per bucket by storing only the position of the first bit of a uniform hash function [4, 7, 8]. We wanted to do the same thing for the Jaccard index problem. First note that HyperLogLog union cardinalities can be used to compute intersection cardinalities using the inclusion-exclusion principle, but that the relative error is then in the size of the union as opposed to the size of the Jaccard index for MinHash) and compounds when taking the intersections of multiple sets; for small intersections, the error is often too great to be practically feasible. Notably, some newer cardinality estimation methods based on maximum-likelihood estimation are able to more directly access intersection sizes in HyperLogLog sketches, which can then be paired with union-cardinality to estimate Jaccard index [5, 6]. However, this approach, while more sophisticated, is restricted to the information available in the HyperLogLog sketch itself, and seems empirically to be a constant order < 3x) improvement over conventional inclusion-exclusion. Alternately, when unions and streaming updates are not necessary, the more recent advance of b-bit MinHash [0] solves exactly the Jaccard index problem while using only a constant number of bits per bucket. b-bit MinHash operates in the same way as standard MinHash, but after computing the minimum hash value, stores only the lowest order b bits. Indeed, for very large Jaccard similarity δa, B) > 0.5, Li, et al. determined that even using bit per bucket was asymptotically optimal. In general, for small Jaccard similarity, b-bit MinHash needs Ωlog/δ) bits, without any dependence on the sizes of the sets. For estimating the Jaccard similarity between exactly two sets, b-bit MinHash is nearly asymptotically optimal []. However, b-bit MinHash, while great for Jaccard index, loses many of the benefits of standard MinHash. Because b-bit MinHash only takes the lowest order b bits of the minimum hash value after finding the minimum, it also requires logn) bits per bucket during the sketch generation phase, the same as standard MinHash. This also implies that sketches cannot be merged together, so union cardinalities cannot be estimated. Some of these shortcomings can be overcome by pairing a b-bit MinHash sketch with a HyperLogLog countdistinct sketch. HyperLogLog uses log logn) bits per bucket to estimate union cardinalities, so combined, these two allow for accurate estimation of intersection sizes, not just Jaccard index. However, this still does not permit the usage of unions or streaming updates, so more complex predicates e.g. A B) C ) still cannot be evaluated and the data structure still requires Ologn) + log logn)) bits per bucket during the sketch generation phase., Vol., No., Article. Publication date: November 07.

4 :4 Yun William Yu & Griffin M. Weber We resolved this issue by building a new sketch, HyperMinHash, as a hybrid between HyperLogLog and k-partition MinHash. Using the same amount of space as b-bit MinHash + HyperLogLog, we achieve better streaming performance using only Olog logn) log/δ)) bits per bucket at all stages of the process), the ability to take unions of sketches, and count-distinct cardinality estimation. In Table, we summarize some of the properties of the various methods we ve described above. INTUITION MinHash works under the premise that two sets will have identical minimum value with probability equal to the Jaccard distance, because they can only share a minimum value if that minimum value corresponds to a member of the intersection of those two sets. If we have a total ordering on the union of both sets, the fraction of equal buckets is an unbiased estimator for Jaccard distance. However, with limited precision hash functions, there is some chance of accidental collision, when the value does not correspond to a member of the intersection. In order to get close to a true total ordering, the space of potential hashes must be on the order of the size of the union, and thus we must store Olog n) bits. Note, however, that the minimum of of a collection of uniform [0, ] random variables X,..., X n is much more likely to be a small number than a large one the insight behind most count-distinct sketches []). HyperMinHash operates identically to MinHash, but instead of storing the minimum values with fixed precision, it effectively uses an adaptive precision that increases resolution when the values are smaller by using initial loglog counters from LogLog cardinality estimation) and then storing a fixed number of bits beyond that similar in spirit to b-bit minhash). More precisely, after dividing up the items into k partitions, we store the position of the leading bit in the first q bits and store q + if there is no bit in the first q bits), and r bits following that Figure ). We do not need a total ordering so long as the number of accidental collisions in the minimum values is low. To analyze the performance of HyperMinHash compared to random-permutation MinHash or equivalently 0-collision standard MinHash) it suffices to consider the expected number of accidental collisions. In this intuitive analysis here, we will only analyze the simple case of collisions while using only a single bucket, but the same flavor of argument holds for multiple partitions. The HyperLogLog part of the sketch results in collisions whenever two items match in order of magnitude Figure a). By pairing it with an additional r-bit hash, our collision space is narrowed by a factor of about r within each bucket. An explicit exact formula for the expected number of collisions is EC = i= r j=0 [ ) n ) n ] [ ) m ) m ] r + j i+r r + j + i+r r + j i+r r + j + i+r, though finding a closed formula is rather more difficult. Intuitively, suppose that our hash value is, 000) for partition 0. This implies that the original bitstring of the minimum hash was Then a uniform random hash in [0, ] collides with this number with probability ++8) =. So we expect to need cardinalities on the order of before having many collisions. But of course, as the cardinalities of A and B increase, so does the expected value of the leading in the bitstring, as analyzed in the construction of HyperLogLog [8]. Thus, the collision probabilities remain roughly constant as cardinalities increase, at least until we reach the precision limit of the LogLog counters. But of course, we store only a finite number of bits for the leading indicator often 6 bits). Because it s a LogLog counter, storing 6 bits is sufficient for set cardinalities up to O 6 = 64 ). This increases our collision surface though, as we might have collisions in the lower left region near the origin Figure c). We can directly compute the collision probability and similarly the variance) by summing together the probability mass in these boxes, replacing the infinite sum with a finite sum Lemma 3.6). For more sensitive estimations, we can subtract, Vol., No., Article. Publication date: November 07.

5 HyperMinHash: Jaccard index sketching in LogLog space :5 Table. Comparison of key features against other methods. Method Bits per bucket Unions Jaccard index Intersection size Streaming updates MinHash logn)!!!! b-bit MinHash log/δ)! HyperLogLog log logn)!! HyperLogLog + MinHash logn) + log logn)!!!! HyperLogLog + b-bit MinHash log logn) + log/δ)!! HyperMinHash log logn) + log/δ)!!!! : n is the cardinality of the sets and δ is the Jaccard indexes, where applicable. : Where applicable, Θ/ϵ ) buckets are required for union cardinality estimation with relative error ϵ. : All of the MinHash based methods also require Θ/δ) buckets to give accurate Jaccard indexes. : Jaccard index and intersection size can be directly computed from HyperLogLog through inclusion-exclusion or MLE [5], but errors are then dependent on union cardinality estimates, so relative error will be high for small intersections and complicated predicates. Objects in Set Hashed values Partition Partition Partition Partition Minimum of P , 000) Minimum of P , 00) Minimum of P , 000) Minimum of P , 0) Partition Hash 00 3, 000) 0 4, 00) 0 3, 000) 5, 0) Fig.. HyperMinHash generates sketches in the same fashion as one-permutation MinHash. It begins by hashing each object in the set to a uniformly random number between 0 and, encoded in binary. Then, the hashed values are partitioned by the first p, and the minimum value within each partition is taken. Each value is specified by a tuple; the first part is the position of the leftmost in the first q bits, and q + otherwise, so exactly identical to a HyperLogLog sketch. The second part is the value of the next r bits in the bitstring. Note that this is mathematically equivalent to applying three independent hash functions for each bucket for the green, blue, and red bits, or, alternately, to using a single hash function but dividing the bitstring into fixed-length regions first., Vol., No., Article. Publication date: November 07.

6 :6 Yun William Yu & Griffin M. Weber the expected number of collisions to debias the estimation. In the next section we will prove bounds on the expectation and variance in the number of collisions. 3 PROOFS The main result of this paper bounds the expectation and variance of accidental collision, given two HyperMinHash sketches of disjoint sets. First, we rigorously define the full HyperMinHash sketch. Definition 3.. We will define f p,q,r A) : S {{,..., q } {0, } r } p to be the HyperMinHash sketch constructed from Figure, where A is a set of hashable objects and p,q, r N, and let f p,q,r A) i : S {,..., q } {0, } r be the value of the ith bucket in the sketch. More precisely, let hx) : S [0, ] be a uniformly random hash function. Let ρ q x) = min log x) +, q) )), σ r x) = x r, and ĥq,r x) = ρ q x), σ r x ρ q x). Then, we will define f p,q,r A) i = ĥq,r min ha) p i ). a A i p <ha)<i+) p Definition 3.. Let A, B be hashable sets with A = n, B = m, n > m, and A B =. Then define an indicator variable for collisions in bucket i of their respective HyperMinHash sketches Z p,q,r A, B, i) = fp,q,r A) i =f p,q,r B) i). Our main theorems follow: Theorem 3.3. C = p i=0 Z p,q,r A, B, i) is the number of collisions ) between the HyperMinHash sketches of two disjoint sets A and B. Then the expectation EC p 5 n + r p+q +r. Theorem 3.4. Given the same setup as in Theorem 3.3, VarC) E[C] + E[C]. Theorem 3.3 allows us to correct for the number of random collisions before computing Jaccard distance, and Theorem 3.4 tells us that the standard deviation in the number of collisions is approximately the expectation. Before proving these theorems, we will start by proving a simpler proposition. Proposition 3.5. Consider a HyperMinHash sketch with only bucket on two disjoint sets A and B. i.e. f 0,q,r A) and f 0,q,r B). Let γ n,m) Z 0,q,r A, B, 0). Naturally, as a good hash function results in uniform random variables, γ is only dependent on the cardinalities n and m. We claim that Eγ n,m) 6 + n. r q +r Proving this will require a few technical lemmas, which we ll then use to prove the main theorems. Lemma 3.6. Eγ n,m) = Pf 0,q,r A) 0 = f 0,q,r B) 0 ) q r [ ) n ) n ] = r + j r +i r + j + r +i + i= q i= q j=0 r j=0 [ ) n j r +i j + ) n ] [ r +i [ ) m ) m ] r + j r +i r + j + r +i ) m j r +i j + ) m ] r +i Proof. Let a,..., a n be random variables corresponding to the hashed values of items in A. Then a i [0, ] are uniform r.v. Similarly, b,...,b m, drawn from hashed values of B are uniform [0, ] r.v. Let x = min{a,..., a n }, Vol., No., Article. Publication date: November 07.

7 HyperMinHash: Jaccard index sketching in LogLog space :7 and y = min{b,...,b m }. Then we have probability density functions pdfx) = n x) n, for x [0, ], pdfy) = m y) m, for y [0, ] and cumulative density functions cdfx) = x) n, for x [0, ], cdfy) = y) m, for y [0, ]. We are particularly interested in the joint probability density function pdfx,y) = n x) n m y) m, for x,y) [0, ]. The probability mass enclosed in a square along the diagonal S = [s, s ] [0, ] is then precisely µs) = s s s s n x) n m y) m dydx = [ s ) n s ) n ] [ s ) m s ) m ] 3) Recall f 0,q,r A) 0 {,..., q } [0, ] r {,..., q } {0,..., r }, so given f 0,q,r A) 0 = i, j), x = i j in the binary expansion, unless i = q, in which case the binary expansion is x = i j. That in turn gives s < x < s, where s = r +j, s r +i = r +j+ when i < q j, and s r +i =, s r +i = j+. Collisions happen precisely r +i when s < x,y < s. Finally, using the s, s formulas above, it suffices to sum the probability of collision over the image of f, so Eγ n,m) = q r i= j=0 µ[s, s ]). Substituting in for s, s, and µ completes the proof. Note also that this is precisely the sum of the probability mass in the red and purple squares along the diagonal in Figure c. While Lemma 3.6 allows us to explicitly compute Eγ m, n), the lack of a closed form solution makes reasoning about it difficult. Here, we will upper bound the expectation by integrating over four regions of the unit square that cover all the collision boxes Figure d). For ease of notation, let r = r and q = q. The Top Right box TR = [ r r +, ] in orange in Figure d). The magenta triangle from the origin bounded by the lines y = which we will denote RAY. r r + r + r x and y = r x with 0 < x < r +, The black strip near the origin covering all the purple boxes except the one on the origin, bounded by the lines y = x r q, y = x + r q, and r q < x < q, which we will denote STRIP. The Bottom Left purple box BL = [0, ]. Lemma 3.7. The probability mass contained in the top right square µtr) r. Proof. By Equation 3, µtr) = r r + r r + n x) n m y) m dydx = [ x) n ] r [ y) m ] r = r + r + r +) n+m r Lemma 3.8. The probability mass contained in the bottom left square near the origin is µbl) n. Proof. µbl) = = 0 [ 0 n x) n m y) m dydx = [ x) n ] 0 [ y) m ] ) n ] [ ) m ]. For n,m <, we note that the linear binomial approximation is actually a strict upper bound as can be trivially verified through the Taylor expansion), so µbl) nm n r q. 0, Vol., No., Article. Publication date: November 07.

8 :8 Yun William Yu & Griffin M. Weber pdf of miny,..., Ym) pdf of miny,..., Ym) r subbuckets leftmost indicator leftmost indicator 0,0) pdf of minx,..., Xn) 0,0) pdf of minx,..., Xn) a) HyperLogLog sections, used alone, result in collisions whenever the minhashes match in order of magnitude. b) HyperMinHash further subdivides HyperLogLog leading -indicator buckets, achieving a much smaller collision space, so long as we precisely store the position of the leading. pdf of miny,..., Ym) r subbuckets pdf of miny,..., Ym) Top Right r subbuckets precision limit = q pdf of minx,..., Xn) leftmost indicator Black Strip Bottom Left Magenta Ray pdf of minx,..., Xn) c) In practice, HyperMinHash has a limited number of bits for the loglog counters, so there s a final lower left bucket at the precision limit. d) We ll upper bound the collision probability of hyperminhash by dividing it into these four regions of integration: a) the Top Right orange box, b), the magenta ray covering intermediate boxes, c) the black strip covering all but the final purple box, and d) the final purple subbucket by the origin. Fig.. Visualization of collision probabilities for HyperMinHash., Vol., No., Article. Publication date: November 07.

9 HyperMinHash: Jaccard index sketching in LogLog space :9 Lemma 3.9. The probability mass of the ray from the origin can be bounded µray ) 3 r. Proof. Unfortunately, the ray is not aligned to the axes, so we cannot integrate x and y separately. r r + r + r x r [ µray ) = n x) n m y) m r + dydx = n x) n r ) m ) m ] r 0 r + x 0 r + x r + r x dx Using the elementary difference of powers formula, note that for 0 α β, m ) α m β m = α β) α m i β i α β)mβ m. With a bit of symbolic manipulation, we can conclude that [ r r + µray ) n x) n r + rl + ) xm r ) ] m r + x dx 0 i= r + rl + ) 0 r r + nm r ) n+m xdx. r + With a straight-forward integration by parts, and then upper bounding negative terms by 0, r + nm µray ) r n + m r r ) n+m r + nm r + r ) n+m r + r +) r n + m r n + m r +) r + nm r + r +) r +) nm + r n + m r n + m r 3 n + m)n + m ) 3 r. Lemma 3.0. The probability mass of the diagonal strip near the origin is µstrip) r. Proof. Using the same integration procedure and difference of powers formula used in the proof of Lemma 3.9, µstrip) = q x+ q x nm n x) n m y) m dydx x + ) n+m = nm r n + m r. Proof of Proposition 3.5. Summing bounds from Lemmas 3.7, 3.8, 3.9, and 3.0, Eγ n,m) 6 r + r n q = 6 + n r q +r Proof of Theorem 3.3. Let A i, B i be the ith partitions of A and B respectively. For ease of notation, let s define p = p. Recall that C = p j=0 Z p,q,r A, B, j). We will first bound EZ p,q,r A, B, j) using the same techniques used in Proposition 3.5. Notice first that Z p,q,r A, B, j) effectively rescales the minimum hash values from Z 0,q,r A, B, j) = γ n,m) down by a factor of p ; i.e. we scale down both the axes in Figure d by substituting q q p in Lemmas 3.0 and 3.8. We do not need Lemma 3.7 because its box is already covered by the Magenta Ray from Lemma 3.9, which we do not scale. Summing these together, we readily conclude EZ p,q,r A, B, j) 5 r + r n q p = 5 n + r p+q +r. [ ] Then by linearity of expectation, EC p 5 n + r p+q +r., Vol., No., Article. Publication date: November 07.

10 :0 Yun William Yu & Griffin M. Weber Proof of Theorem 3.4. By conditioning on the multinomial distribution, we can decompose C into C = Z p,q,r A, B, j. α + α p =n β + β p =m p ) i, A i =α i i, B i =β i i=0 ) A i =α i B i =β i For ease of notation in the following, we will use α, β to denote the event i, A i = α i and i, B i = β j respectively. Additionally, let Zj) = Z p,q,r A, B, j). So, C = p α, β j=0 Z j α, β ). α, β Then p VarC) = Cov α, β Z j α, β ), α, β Z j α, β )). α, β j, j =0 α, β But note that for α, β ) α, β ), α, β = = α, β = 0 and vice versa, because they are disjoint indicator variables. As such, for α, β ) α, β ), Cov α, β Z j α, β ), α, β Z j α, β )) 0, implying that VarC) p α, β j, j =0 α, β j=0 Cov α, Z j β α, β ), α, Z j β α, β )) p = Var α, Z j α, β )) + β α, β j j 0 j p 0 j p Cov α, Z j β α, β ), α, Z j β α, β )). Note that the first term can be simplified, recalling that Z is a {0,} Bernouli r.v., so p j=0 Var α, Z j α, β )) p = β j=0 Var Z j)) p [ E Z j)] = EC. Moving on, from the covariance formula, for independent random variables X, X,Y, CovX Y, X Y ) = E[X X Y ] E[X Y ]E[X Y ] = E[X ]E[X ] E[Y ] E[Y ] ) = E[X ]E[X ] VarY ) Thus the second term of the summation can be bounded as follows: Cov α, Z j β α, β ), α, Z j β α, β )) [ = E Z α, β j j 0 j p 0 j p p p [ E Z α, β j =0 j =0 = E α, β [ C α, β α, β j α, β )] [ E Z j α, β )] ) Var α, = β ] P α, β ) P α, β )) We conclude that VarC) E[C] + E[C]. E α, β [ C α, β j j 0 j p 0 j p E α, β ] P α, β j=0 j α, β )] [ E Z j α, β )] ) Var α, β [ C α, β ] ) Var α, β ) = E[C], Vol., No., Article. Publication date: November 07.

11 HyperMinHash: Jaccard index sketching in LogLog space : Fig. 3. For a fixed size sketch, HyperMinHash has better accuracy and/or cardinality range than MinHash. We compare Jaccard index estimation for identically sized sets with Jaccard index of /3 i.e. 50% overlap), so the maximum relative error is, and plot the mean relative errors without estimated collision correction. green circle) A 56 byte HyperMinHash sketch, with 56 buckets of 8 bits each, 4 bits of which are allocated to the LogLog counter, Jaccard index estimation remains stable until cardinalities around 3. orange diamond) A 56 byte MinHash sketch with 56 buckets of 8 bits each achieves similar accuracy at low cardinalities, but fails once cardinalities approach 4. blue triangle) A 56 byte MinHash sketch with 8 buckets of 6 bits can access larger cardinalities of around 0, but to do so trades off on low-cardinality accuracy. 4 EXPERIMENTAL VALIDATION For completeness, we give experimental validation for the behavior of Jaccard index estimation on raw Hyper)MinHash sketches with no expected error collision. In Figure 3, we allocate 56 bytes for two standard MinHash sketch and a HyperMinHash sketch. For fixed sketch size and cardinality range, HyperMinHash is more accurate; or, for fixed sketch size and bucket number, HyperMinHash can access exponentially larger set cardinalities. Pseudocode and a Python implementation are available details in Appendix). 5 DISCUSSION AND CONCLUSION We have introduced HyperMinHash, a sketch for estimating Jaccard distance using log log space, and made available a prototype Python implementation at It can be thought of as a compression scheme for MinHash that reduces the number of bits per bucket to log logn) from logn) by using insights from HyperLogLog and b-bit MinHash. As with the original MinHash, it retains variance on the order of k/δ, where k is the number of buckets and δ is the Jaccard index between two sets. However, it also introduces /l variance, where l = r, because of the increased number of collisions, which matches the requirements of b-bit MinHash. For practical parameters of p = 5,q = 6, r = 0, the HyperMinHash sketch will use up 64KiB memory per set, and allow for estimating Jaccard indices of 0.0 for set cardinalities on the order of 0 9 with accuracy around 5%. HyperMinHash is to our knowledge the first streaming summary sketch capable of directly estimating union, Vol., No., Article. Publication date: November 07.

12 : Yun William Yu & Griffin M. Weber cardinality, Jaccard index, and intersection cardinality in log log space, able to be applied to arbitrary Boolean formulas in conjunctive normal form with error rates bounded by the final result size. ACKNOWLEDGMENTS This study was supported by National Institutes of Health NIH) Big Data to Knowledge BDK) awards U54HG from the National Human Genome Research Institute NHGRI) and U0CA98934 from the National Cancer Institute NCI). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. We additionally thank Daphne Ippolito and Adam Sealfon for useful comments and advice. REFERENCES [] Ziv Bar-Yossef, TS Jayram, Ravi Kumar, D Sivakumar, and Luca Trevisan. 00. Counting distinct elements in a data stream. In International Workshop on Randomization and Approximation Techniques in Computer Science. Springer, 0. [] Andrei Z Broder On the resemblance and containment of documents. In Compression and Complexity of Sequences 997. Proceedings. IEEE, 9. [3] Edith Cohen. 06. Min-Hash Sketches. 06). [4] Marianne Durand and Philippe Flajolet Loglog counting of large cardinalities. In European Symposium on Algorithms. Springer, [5] Otmar Ertl. 07. New cardinality estimation algorithms for HyperLogLog sketches. arxiv preprint arxiv: ). [6] Otmar Ertl. 07. New Cardinality Estimation Methods for HyperLogLog Sketches. arxiv preprint arxiv: ). [7] Philippe Flajolet Counting by coin tossings. In Advances in Computer Science-ASIAN 004. Higher-Level Decision Making. Springer,. [8] Philippe Flajolet, Éric Fusy, Olivier Gandouet, and Frédéric Meunier Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm. In AofA: Analysis of Algorithms. Discrete Mathematics and Theoretical Computer Science, [9] Paul Jaccard. 90. Lois de distribution florale dans la zone alpine. Bull Soc Vaudoise Sci Nat 38 90), [0] Ping Li and Christian König. 00. b-bit minwise hashing. In Proceedings of the 9th international conference on World wide web. ACM, [] Ping Li, Art Owen, and Cun-Hui Zhang. 0. One permutation hashing. In Advances in Neural Information Processing Systems [] Rasmus Pagh, Morten Stöckel, and David P Woodruff. 04. Is min-wise hashing optimal for summarizing set intersection?. In Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. ACM, 09 0., Vol., No., Article. Publication date: November 07.

13 A APPENDIX: HYPERMINHASH IN PRACTICE HyperMinHash: Jaccard index sketching in LogLog space :3 Here, we present full algorithms to match a naive implementation of HyperMinHash as described in the previous Theory section. A Python implementation is available at Algorithm HyperMinHash sketch. : Let h,h,h 3 : D [0, ] {0, } be three independent hash functions hashing data from domain D to the binary domain. In practice, we generally use a single Hash function, e.g. SHA-, and use different sets of bits for each of the three hashes). : Let ρs), for s {0, } be the position of the left-most -bit ρ000 ) = 4). 3: Let σs, n) for s {0, } be the left-most n bits of s σ00, 5) = 00). 4: function HyperMinHashA, p, q, r) 5: Let ĥx) = σh x),p). 6: Let ĥx) = minρh x)), q ). 7: Let ĥ3x) = σh 3 x), r). 8: Initialize p tuples B = B = = B p = 0, 0). 9: for a A do 0: if B h a)[0] < ĥa) then : Bĥ a) ĥa),ĥ3a)) : else 3: if Bĥ a) [0] = ĥa) and Bĥ a) [] > ĥ3a) then 4: Bĥ a) ĥa),ĥ3a)) 5: 6: 7: end for 8: return B,..., B p as B 9: end function A. Implementation optimizations We recommend several optimizations for practical implementations of HyperMinHash. First, it is mathematically equivalent to: ) Pack the hashed tuple into a single word; this enables Jaccard index computation while using only one comparison per bucket instead of two. ) Use the max instead of min of the subbuckets. This allows us take the union of two sketches while using only one comparison per bucket. These recommendations should be self-explanatory, and are simply minor engineering optimizations, which we do not use in our prototyping, as they do not affect accuracy. However, while we can exactly compute the number of expected collisions through Lemma 3.6, this computation is slow and often results in floating point errors unless BigInts are used because Algorithm 5 is exponential in r. In practice, two ready solutions present themselves: ) We can ignore the bias and simply add it to the error. As the bias and standard deviation of the error are the same order of magnitude, this only doubles the absolute error in the estimation of Jaccard index. For large Jaccard indexes, this does not matter., Vol., No., Article. Publication date: November 07.

14 :4 Yun William Yu & Griffin M. Weber Algorithm HyperMinHash Union function UnionS, T ) assert S = T for i {,..., S } do Initialize S tuples B = B = = B S = 0, 0). if S i [0] > T i [0] then B i S i else if S i [0] < T i [0] then B i T i else if S i [0] = T i [0] then if S i [[] < T i [] then B i S i else B i T i end for return B,..., B p as B end function Algorithm 3 Estimate Cardinality. Note that the left parts of the buckets can be passed directly into a HyperLogLog estimator. We can also use other k-minimum value count-distinct cardinality estimators, which we empirically found useful for large cardinalities. function EstimateCardinalityS, p, q, r) Initialize S integer registers b = b = = b S = 0. for i {,..., S } do b i S i [0] end for R HyperLogLogCardinalityEstimator{b i },q) if R < 04 S then return R else Initialize S real registers r,..., r S. for i {,..., S } do r i S i [0] end for if r i = 0 then return else return S / r i end function ) + S i [] r, Vol., No., Article. Publication date: November 07.

15 HyperMinHash: Jaccard index sketching in LogLog space :5 Algorithm 4 Compute Jaccard Index. Note that the correction factor EC is generally not needed, except for really small Jaccard index. Additionally, for most practical purposes, it is safe to substitute ApproximateExpectedCollisions for ExpectedCollisions. function JaccardIndexS, T, p, q, r) assert S = T C 0, N 0 for i {,..., S } do if S i = T i then C C + if S i! = 0, 0) and T i! = 0, 0) then N N + end for n EstimateCardinalityS, q) m EstimateCardinalityT, q) EC [Approximate]ExpectedCollisionsn, m, p, q, r) return C EC)/N end function Algorithm 5 Expected collisions. Note that because of floating point error, BigInts must be used for large n and m. function ExpectedCollisionsn, m, p, q, r) x 0 for i {,..., q } do for j {,..., r } do if i q then b r +j, b p+r +i r +j+ p+r +i else b j, b p+r +i j+ p+r +i Pr x b ) n b ) n Pr y b ) m b ) m x x + Pr x Pr y end for end for return x p end function ) We also present a fast, numerically stable, algorithm to approximate the expected number of collisions Algorithm 6). We can however approximate the number of expected collisions using the following procedure, which is empirically asymptotically correct Algorithm 6):, Vol., No., Article. Publication date: November 07.

16 :6 Yun William Yu & Griffin M. Weber Algorithm 6 Fast numerically stable approximation to Algorithm 5. Generally underestimates collisions. function ApproximateExpectedCollisionsn, m, p, q, r) if n < m then SWAPx, y) if n > q +r then return ERROR: cardinality too large for approximation. else if n > p+5 then 4n/m +n/m) ϕ return p r ϕ else return ExpectedCollisionsn,m,p,q, 0) r return x p end function ) For n < p+5, we approximate by taking the number of expected HyperLogLog collisions and dividing it by r. In each HyperLogLog box, we are interested in collisions along r boxes along the diagonal c. For this approximation, we simply assume that the joint probability density function is almost uniform within the box; this is not completely accurate, but pretty close in practice. ) For p+5 < n < q +p, we noted empirically that the expected number of collisions approached p r for n = m as n. Furthermore, the number of collisions is dependent on n and m by a factor of 4nm 4n/m n+m)n+m ) from 3.9, which for n,m can be approximated by. This approximation is +n/m) primarily needed because of floating point errors when n. 3) Unfortunately, around n > q +p, the number of collisions starts increasing and these approximations fail. However, note that for reasonable values of q = 6,p = 5, this problem only appears when n > , Vol., No., Article. Publication date: November 07.

Lecture 3 Sept. 4, 2014

CS 395T: Sublinear Algorithms Fall 2014 Prof. Eric Price Lecture 3 Sept. 4, 2014 Scribe: Zhao Song In today s lecture, we will discuss the following problems: 1. Distinct elements 2. Turnstile model 3.