arxiv: v2 [cs.ds] 3 Nov 2017

Size: px
Start display at page:

Download "arxiv: v2 [cs.ds] 3 Nov 2017"

Transcription

1 HyperMinHash: Jaccard index sketching in LogLog space Extended Abstract arxiv: v [cs.ds] 3 Nov 07 YUN WILLIAM YU, Harvard Medical School DBMI GRIFFIN M. WEBER, Harvard Medical School DBMI In this extended abstract, we describe and analyse a streaming probabilistic sketch, HYPERMINHASH, to estimate the Jaccard index or Jaccard similarity coefficient) over two sets A and B. HyperMinHash can be thought of as a compression of standard MinHash by building off of a HyperLogLog count-distinct sketch. Given Jaccard index δ, using k buckets of size Ologl) + log log A B )) in practice, typically bytes) per set, HyperMinHash streams over A and B and generates an estimate of the Jaccard index δ with error O/l + k/δ). This improves on the best previously known sketch, MinHash, which requires the same number of storage units buckets), but using Olog A B )) bit per bucket. For instance, our new algorithm allows estimating Jaccard indices of 0.0 for set cardinalities on the order of 0 9 with relative error of around 5% using 64KiB of memory; the previous state-of-the-art MinHash can only estimate Jaccard indices for cardinalities of 0 0 with the same memory consumption. Alternately, one can think of HyperMinHash as an augmentation of b-bit MinHash that enables streaming updates, unions, and cardinality estimation and thus intersection cardinality by way of Jaccard), while using log log extra bits. CCS Concepts: Theory of computation Sketching and sampling; Additional Key Words and Phrases: MinHash, sketching, streaming, loglog, Jaccard ACM Reference Format: Yun William Yu and Griffin M. Weber. 07. HyperMinHash: Jaccard index sketching in LogLog space: Extended Abstract., November 07), 6 pages. Corresponding author Authors addresses: Yun William Yu, Harvard Medical School DBMI, 0 Shattuck St #3, Boston, Massachusetts, 05, william_yu@hms.harvard. edu; Griffin M. Weber, Harvard Medical School DBMI, 0 Shattuck St #3, Boston, Massachusetts, 05, griffin_weber@hms.harvard.edu. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/authors). 07 Copyright held by the owner/authors). XXXX-XXXX/07/-ART $ Vol., No., Article. Publication date: November 07.

2 : Yun William Yu & Griffin M. Weber INTRODUCTION Many questions in data science can be rephrased in terms of the number of items in a database that satisfy some Boolean formula. For example, how many participants in a political survey are independent and have a favorable view of the federal government?, or how many of the source IPs used in a DDoS attack today were also used last month? In this paper, we consider the design of approximate streaming sketches to answer questions phrased in conjunctive normal form an AND of ORs); this is of course equivalent to estimating the cardinality of intersections of unions of a collection of sets. The literature already has near-optimal probabilistic data structures for approximating the count-distinct problem [, 4, 7, 8], which is equivalent to finding the cardinality of unions of sets i.e. ORs in CNF) Thus, we in particular focus on the problem of estimating Jaccard index[9], a proxy for set similarity that when coupled with union cardinality, allows estimation of intersection sizes.. Jaccard index Given two sets A and B, where A = n and B = m, and n > m without loss of generality, the Jaccard index is defined as A B δx,y) = A B. ) Clearly, if paired with a good count-distinct union estimator for A B, this allows us to estimate intersection sizes as well. Though Jaccard originally defined this index to measure ecological diversity in 90 [9], in more modern times, it has been used as a proxy for the document similarity problems. In 997, Broder introduced min-wise hashing colloquially known as MinHash ) [], a technique for quickly estimating the resemblance of documents by looking at the Jaccard index of shingles collections of phrases) contained within the documents.. MinHash MinHash relies on a simple fact: if you apply a random permutation to the universe of elements, the chance that the smallest items under this permutation in sets A and B are the same is precisely the Jaccard index. To see this, consider a random permutation of A B. The minimum element will come from either A \ B, B \ A, or A B, all disjoint sets. If the minimum element lies in A \ B, then mina) B, so mina) minb); the same is of course true by symmetry for B \ A. Conversely, if mina B) A B, then clearly mina) = minb). Because the permutation is random, every element has an equal probability of being the minimum, and thus P mina) = minb)) = A B A B. ) While using a single random permutation produces an unbiased estimator of δx,y), it is a Bernouli 0/ random variable with high variance. So, instead of using a single permutation, one averages k trials. The expected fraction of matches is also an unbiased estimator of the Jaccard index, but with variance decreased by a factor of /k. Though the theoretical justification is predicated on having a true random permutation, in practice we approximate that by using good random hash functions instead. A good hash function will specify a nearly total ordering on the universe of items, and provided we use θlogn)) bits for the hash function output, the probability of accidental collision of min-hashes is exponentially small. Though theoretically easy to analyze, this scheme has a number of drawbacks, chief amongst them the requirement of having k random hash functions, which means that the computational complexity is θnk) to generate the sketch. To address this, several variants of MinHash have been proposed [3]: ) k-hash functions. The scheme described above, which has the shortcoming of using θnk) computation to generate the sketch., Vol., No., Article. Publication date: November 07.

3 HyperMinHash: Jaccard index sketching in LogLog space :3 ) k-minimum values. A single hash function is used, but instead of storing the single minimum value, we store the smallest k values for each set also known as the KMV sketch []). Sketch generation time is reduced to On log k), but we also incur an Ok log k) sorting penalty when computing Jaccard index. 3) k-partition. Another -permutation MinHash variant, k-partition stochastically averages by first deterministically partitioning a set into k parts using the first couple bits of the hash value, and then stores the minimum value in each partition []. k-partition has the advantage of On) sketch generation time and Ok) Jaccard index computation time, at the cost of some difficulty in the analysis. It is important here to remark that for all of the above variants, MinHash sketches of A and B can be losslessly combined to form the MinHash sketch of A B. Using order statistics, it is additionally possible to estimate the union cardinalities [], so we can directly estimate intersection size in addition to Jaccard index. This also implies that streaming updates are permitted, so preprocessing incurs no additional space requirement..3 log log space complexity All of the variants of MinHash given in the last section require logarithmic bits per bucket in order to prevent accidental collisions i.e. we want to ensure that when two hashes match, they came from identical elements), though in the case of k-partition, some of those bits can be stored implicitly in the bucket identity. However, in the similar problem of cardinality estimation of unique items the count-distinct problem), literature over the last several decades produced several streaming sketches that require sub-logarithmic bits per bucket; indeed, the LogLog, SuperLogLog, and HyperLogLog family of sketches requries, as given in the name, only Olog logn)) bits per bucket by storing only the position of the first bit of a uniform hash function [4, 7, 8]. We wanted to do the same thing for the Jaccard index problem. First note that HyperLogLog union cardinalities can be used to compute intersection cardinalities using the inclusion-exclusion principle, but that the relative error is then in the size of the union as opposed to the size of the Jaccard index for MinHash) and compounds when taking the intersections of multiple sets; for small intersections, the error is often too great to be practically feasible. Notably, some newer cardinality estimation methods based on maximum-likelihood estimation are able to more directly access intersection sizes in HyperLogLog sketches, which can then be paired with union-cardinality to estimate Jaccard index [5, 6]. However, this approach, while more sophisticated, is restricted to the information available in the HyperLogLog sketch itself, and seems empirically to be a constant order < 3x) improvement over conventional inclusion-exclusion. Alternately, when unions and streaming updates are not necessary, the more recent advance of b-bit MinHash [0] solves exactly the Jaccard index problem while using only a constant number of bits per bucket. b-bit MinHash operates in the same way as standard MinHash, but after computing the minimum hash value, stores only the lowest order b bits. Indeed, for very large Jaccard similarity δa, B) > 0.5, Li, et al. determined that even using bit per bucket was asymptotically optimal. In general, for small Jaccard similarity, b-bit MinHash needs Ωlog/δ) bits, without any dependence on the sizes of the sets. For estimating the Jaccard similarity between exactly two sets, b-bit MinHash is nearly asymptotically optimal []. However, b-bit MinHash, while great for Jaccard index, loses many of the benefits of standard MinHash. Because b-bit MinHash only takes the lowest order b bits of the minimum hash value after finding the minimum, it also requires logn) bits per bucket during the sketch generation phase, the same as standard MinHash. This also implies that sketches cannot be merged together, so union cardinalities cannot be estimated. Some of these shortcomings can be overcome by pairing a b-bit MinHash sketch with a HyperLogLog countdistinct sketch. HyperLogLog uses log logn) bits per bucket to estimate union cardinalities, so combined, these two allow for accurate estimation of intersection sizes, not just Jaccard index. However, this still does not permit the usage of unions or streaming updates, so more complex predicates e.g. A B) C ) still cannot be evaluated and the data structure still requires Ologn) + log logn)) bits per bucket during the sketch generation phase., Vol., No., Article. Publication date: November 07.

4 :4 Yun William Yu & Griffin M. Weber We resolved this issue by building a new sketch, HyperMinHash, as a hybrid between HyperLogLog and k-partition MinHash. Using the same amount of space as b-bit MinHash + HyperLogLog, we achieve better streaming performance using only Olog logn) log/δ)) bits per bucket at all stages of the process), the ability to take unions of sketches, and count-distinct cardinality estimation. In Table, we summarize some of the properties of the various methods we ve described above. INTUITION MinHash works under the premise that two sets will have identical minimum value with probability equal to the Jaccard distance, because they can only share a minimum value if that minimum value corresponds to a member of the intersection of those two sets. If we have a total ordering on the union of both sets, the fraction of equal buckets is an unbiased estimator for Jaccard distance. However, with limited precision hash functions, there is some chance of accidental collision, when the value does not correspond to a member of the intersection. In order to get close to a true total ordering, the space of potential hashes must be on the order of the size of the union, and thus we must store Olog n) bits. Note, however, that the minimum of of a collection of uniform [0, ] random variables X,..., X n is much more likely to be a small number than a large one the insight behind most count-distinct sketches []). HyperMinHash operates identically to MinHash, but instead of storing the minimum values with fixed precision, it effectively uses an adaptive precision that increases resolution when the values are smaller by using initial loglog counters from LogLog cardinality estimation) and then storing a fixed number of bits beyond that similar in spirit to b-bit minhash). More precisely, after dividing up the items into k partitions, we store the position of the leading bit in the first q bits and store q + if there is no bit in the first q bits), and r bits following that Figure ). We do not need a total ordering so long as the number of accidental collisions in the minimum values is low. To analyze the performance of HyperMinHash compared to random-permutation MinHash or equivalently 0-collision standard MinHash) it suffices to consider the expected number of accidental collisions. In this intuitive analysis here, we will only analyze the simple case of collisions while using only a single bucket, but the same flavor of argument holds for multiple partitions. The HyperLogLog part of the sketch results in collisions whenever two items match in order of magnitude Figure a). By pairing it with an additional r-bit hash, our collision space is narrowed by a factor of about r within each bucket. An explicit exact formula for the expected number of collisions is EC = i= r j=0 [ ) n ) n ] [ ) m ) m ] r + j i+r r + j + i+r r + j i+r r + j + i+r, though finding a closed formula is rather more difficult. Intuitively, suppose that our hash value is, 000) for partition 0. This implies that the original bitstring of the minimum hash was Then a uniform random hash in [0, ] collides with this number with probability ++8) =. So we expect to need cardinalities on the order of before having many collisions. But of course, as the cardinalities of A and B increase, so does the expected value of the leading in the bitstring, as analyzed in the construction of HyperLogLog [8]. Thus, the collision probabilities remain roughly constant as cardinalities increase, at least until we reach the precision limit of the LogLog counters. But of course, we store only a finite number of bits for the leading indicator often 6 bits). Because it s a LogLog counter, storing 6 bits is sufficient for set cardinalities up to O 6 = 64 ). This increases our collision surface though, as we might have collisions in the lower left region near the origin Figure c). We can directly compute the collision probability and similarly the variance) by summing together the probability mass in these boxes, replacing the infinite sum with a finite sum Lemma 3.6). For more sensitive estimations, we can subtract, Vol., No., Article. Publication date: November 07.

5 HyperMinHash: Jaccard index sketching in LogLog space :5 Table. Comparison of key features against other methods. Method Bits per bucket Unions Jaccard index Intersection size Streaming updates MinHash logn)!!!! b-bit MinHash log/δ)! HyperLogLog log logn)!! HyperLogLog + MinHash logn) + log logn)!!!! HyperLogLog + b-bit MinHash log logn) + log/δ)!! HyperMinHash log logn) + log/δ)!!!! : n is the cardinality of the sets and δ is the Jaccard indexes, where applicable. : Where applicable, Θ/ϵ ) buckets are required for union cardinality estimation with relative error ϵ. : All of the MinHash based methods also require Θ/δ) buckets to give accurate Jaccard indexes. : Jaccard index and intersection size can be directly computed from HyperLogLog through inclusion-exclusion or MLE [5], but errors are then dependent on union cardinality estimates, so relative error will be high for small intersections and complicated predicates. Objects in Set Hashed values Partition Partition Partition Partition Minimum of P , 000) Minimum of P , 00) Minimum of P , 000) Minimum of P , 0) Partition Hash 00 3, 000) 0 4, 00) 0 3, 000) 5, 0) Fig.. HyperMinHash generates sketches in the same fashion as one-permutation MinHash. It begins by hashing each object in the set to a uniformly random number between 0 and, encoded in binary. Then, the hashed values are partitioned by the first p, and the minimum value within each partition is taken. Each value is specified by a tuple; the first part is the position of the leftmost in the first q bits, and q + otherwise, so exactly identical to a HyperLogLog sketch. The second part is the value of the next r bits in the bitstring. Note that this is mathematically equivalent to applying three independent hash functions for each bucket for the green, blue, and red bits, or, alternately, to using a single hash function but dividing the bitstring into fixed-length regions first., Vol., No., Article. Publication date: November 07.

6 :6 Yun William Yu & Griffin M. Weber the expected number of collisions to debias the estimation. In the next section we will prove bounds on the expectation and variance in the number of collisions. 3 PROOFS The main result of this paper bounds the expectation and variance of accidental collision, given two HyperMinHash sketches of disjoint sets. First, we rigorously define the full HyperMinHash sketch. Definition 3.. We will define f p,q,r A) : S {{,..., q } {0, } r } p to be the HyperMinHash sketch constructed from Figure, where A is a set of hashable objects and p,q, r N, and let f p,q,r A) i : S {,..., q } {0, } r be the value of the ith bucket in the sketch. More precisely, let hx) : S [0, ] be a uniformly random hash function. Let ρ q x) = min log x) +, q) )), σ r x) = x r, and ĥq,r x) = ρ q x), σ r x ρ q x). Then, we will define f p,q,r A) i = ĥq,r min ha) p i ). a A i p <ha)<i+) p Definition 3.. Let A, B be hashable sets with A = n, B = m, n > m, and A B =. Then define an indicator variable for collisions in bucket i of their respective HyperMinHash sketches Z p,q,r A, B, i) = fp,q,r A) i =f p,q,r B) i). Our main theorems follow: Theorem 3.3. C = p i=0 Z p,q,r A, B, i) is the number of collisions ) between the HyperMinHash sketches of two disjoint sets A and B. Then the expectation EC p 5 n + r p+q +r. Theorem 3.4. Given the same setup as in Theorem 3.3, VarC) E[C] + E[C]. Theorem 3.3 allows us to correct for the number of random collisions before computing Jaccard distance, and Theorem 3.4 tells us that the standard deviation in the number of collisions is approximately the expectation. Before proving these theorems, we will start by proving a simpler proposition. Proposition 3.5. Consider a HyperMinHash sketch with only bucket on two disjoint sets A and B. i.e. f 0,q,r A) and f 0,q,r B). Let γ n,m) Z 0,q,r A, B, 0). Naturally, as a good hash function results in uniform random variables, γ is only dependent on the cardinalities n and m. We claim that Eγ n,m) 6 + n. r q +r Proving this will require a few technical lemmas, which we ll then use to prove the main theorems. Lemma 3.6. Eγ n,m) = Pf 0,q,r A) 0 = f 0,q,r B) 0 ) q r [ ) n ) n ] = r + j r +i r + j + r +i + i= q i= q j=0 r j=0 [ ) n j r +i j + ) n ] [ r +i [ ) m ) m ] r + j r +i r + j + r +i ) m j r +i j + ) m ] r +i Proof. Let a,..., a n be random variables corresponding to the hashed values of items in A. Then a i [0, ] are uniform r.v. Similarly, b,...,b m, drawn from hashed values of B are uniform [0, ] r.v. Let x = min{a,..., a n }, Vol., No., Article. Publication date: November 07.

7 HyperMinHash: Jaccard index sketching in LogLog space :7 and y = min{b,...,b m }. Then we have probability density functions pdfx) = n x) n, for x [0, ], pdfy) = m y) m, for y [0, ] and cumulative density functions cdfx) = x) n, for x [0, ], cdfy) = y) m, for y [0, ]. We are particularly interested in the joint probability density function pdfx,y) = n x) n m y) m, for x,y) [0, ]. The probability mass enclosed in a square along the diagonal S = [s, s ] [0, ] is then precisely µs) = s s s s n x) n m y) m dydx = [ s ) n s ) n ] [ s ) m s ) m ] 3) Recall f 0,q,r A) 0 {,..., q } [0, ] r {,..., q } {0,..., r }, so given f 0,q,r A) 0 = i, j), x = i j in the binary expansion, unless i = q, in which case the binary expansion is x = i j. That in turn gives s < x < s, where s = r +j, s r +i = r +j+ when i < q j, and s r +i =, s r +i = j+. Collisions happen precisely r +i when s < x,y < s. Finally, using the s, s formulas above, it suffices to sum the probability of collision over the image of f, so Eγ n,m) = q r i= j=0 µ[s, s ]). Substituting in for s, s, and µ completes the proof. Note also that this is precisely the sum of the probability mass in the red and purple squares along the diagonal in Figure c. While Lemma 3.6 allows us to explicitly compute Eγ m, n), the lack of a closed form solution makes reasoning about it difficult. Here, we will upper bound the expectation by integrating over four regions of the unit square that cover all the collision boxes Figure d). For ease of notation, let r = r and q = q. The Top Right box TR = [ r r +, ] in orange in Figure d). The magenta triangle from the origin bounded by the lines y = which we will denote RAY. r r + r + r x and y = r x with 0 < x < r +, The black strip near the origin covering all the purple boxes except the one on the origin, bounded by the lines y = x r q, y = x + r q, and r q < x < q, which we will denote STRIP. The Bottom Left purple box BL = [0, ]. Lemma 3.7. The probability mass contained in the top right square µtr) r. Proof. By Equation 3, µtr) = r r + r r + n x) n m y) m dydx = [ x) n ] r [ y) m ] r = r + r + r +) n+m r Lemma 3.8. The probability mass contained in the bottom left square near the origin is µbl) n. Proof. µbl) = = 0 [ 0 n x) n m y) m dydx = [ x) n ] 0 [ y) m ] ) n ] [ ) m ]. For n,m <, we note that the linear binomial approximation is actually a strict upper bound as can be trivially verified through the Taylor expansion), so µbl) nm n r q. 0, Vol., No., Article. Publication date: November 07.

8 :8 Yun William Yu & Griffin M. Weber pdf of miny,..., Ym) pdf of miny,..., Ym) r subbuckets leftmost indicator leftmost indicator 0,0) pdf of minx,..., Xn) 0,0) pdf of minx,..., Xn) a) HyperLogLog sections, used alone, result in collisions whenever the minhashes match in order of magnitude. b) HyperMinHash further subdivides HyperLogLog leading -indicator buckets, achieving a much smaller collision space, so long as we precisely store the position of the leading. pdf of miny,..., Ym) r subbuckets pdf of miny,..., Ym) Top Right r subbuckets precision limit = q pdf of minx,..., Xn) leftmost indicator Black Strip Bottom Left Magenta Ray pdf of minx,..., Xn) c) In practice, HyperMinHash has a limited number of bits for the loglog counters, so there s a final lower left bucket at the precision limit. d) We ll upper bound the collision probability of hyperminhash by dividing it into these four regions of integration: a) the Top Right orange box, b), the magenta ray covering intermediate boxes, c) the black strip covering all but the final purple box, and d) the final purple subbucket by the origin. Fig.. Visualization of collision probabilities for HyperMinHash., Vol., No., Article. Publication date: November 07.

9 HyperMinHash: Jaccard index sketching in LogLog space :9 Lemma 3.9. The probability mass of the ray from the origin can be bounded µray ) 3 r. Proof. Unfortunately, the ray is not aligned to the axes, so we cannot integrate x and y separately. r r + r + r x r [ µray ) = n x) n m y) m r + dydx = n x) n r ) m ) m ] r 0 r + x 0 r + x r + r x dx Using the elementary difference of powers formula, note that for 0 α β, m ) α m β m = α β) α m i β i α β)mβ m. With a bit of symbolic manipulation, we can conclude that [ r r + µray ) n x) n r + rl + ) xm r ) ] m r + x dx 0 i= r + rl + ) 0 r r + nm r ) n+m xdx. r + With a straight-forward integration by parts, and then upper bounding negative terms by 0, r + nm µray ) r n + m r r ) n+m r + nm r + r ) n+m r + r +) r n + m r n + m r +) r + nm r + r +) r +) nm + r n + m r n + m r 3 n + m)n + m ) 3 r. Lemma 3.0. The probability mass of the diagonal strip near the origin is µstrip) r. Proof. Using the same integration procedure and difference of powers formula used in the proof of Lemma 3.9, µstrip) = q x+ q x nm n x) n m y) m dydx x + ) n+m = nm r n + m r. Proof of Proposition 3.5. Summing bounds from Lemmas 3.7, 3.8, 3.9, and 3.0, Eγ n,m) 6 r + r n q = 6 + n r q +r Proof of Theorem 3.3. Let A i, B i be the ith partitions of A and B respectively. For ease of notation, let s define p = p. Recall that C = p j=0 Z p,q,r A, B, j). We will first bound EZ p,q,r A, B, j) using the same techniques used in Proposition 3.5. Notice first that Z p,q,r A, B, j) effectively rescales the minimum hash values from Z 0,q,r A, B, j) = γ n,m) down by a factor of p ; i.e. we scale down both the axes in Figure d by substituting q q p in Lemmas 3.0 and 3.8. We do not need Lemma 3.7 because its box is already covered by the Magenta Ray from Lemma 3.9, which we do not scale. Summing these together, we readily conclude EZ p,q,r A, B, j) 5 r + r n q p = 5 n + r p+q +r. [ ] Then by linearity of expectation, EC p 5 n + r p+q +r., Vol., No., Article. Publication date: November 07.

10 :0 Yun William Yu & Griffin M. Weber Proof of Theorem 3.4. By conditioning on the multinomial distribution, we can decompose C into C = Z p,q,r A, B, j. α + α p =n β + β p =m p ) i, A i =α i i, B i =β i i=0 ) A i =α i B i =β i For ease of notation in the following, we will use α, β to denote the event i, A i = α i and i, B i = β j respectively. Additionally, let Zj) = Z p,q,r A, B, j). So, C = p α, β j=0 Z j α, β ). α, β Then p VarC) = Cov α, β Z j α, β ), α, β Z j α, β )). α, β j, j =0 α, β But note that for α, β ) α, β ), α, β = = α, β = 0 and vice versa, because they are disjoint indicator variables. As such, for α, β ) α, β ), Cov α, β Z j α, β ), α, β Z j α, β )) 0, implying that VarC) p α, β j, j =0 α, β j=0 Cov α, Z j β α, β ), α, Z j β α, β )) p = Var α, Z j α, β )) + β α, β j j 0 j p 0 j p Cov α, Z j β α, β ), α, Z j β α, β )). Note that the first term can be simplified, recalling that Z is a {0,} Bernouli r.v., so p j=0 Var α, Z j α, β )) p = β j=0 Var Z j)) p [ E Z j)] = EC. Moving on, from the covariance formula, for independent random variables X, X,Y, CovX Y, X Y ) = E[X X Y ] E[X Y ]E[X Y ] = E[X ]E[X ] E[Y ] E[Y ] ) = E[X ]E[X ] VarY ) Thus the second term of the summation can be bounded as follows: Cov α, Z j β α, β ), α, Z j β α, β )) [ = E Z α, β j j 0 j p 0 j p p p [ E Z α, β j =0 j =0 = E α, β [ C α, β α, β j α, β )] [ E Z j α, β )] ) Var α, = β ] P α, β ) P α, β )) We conclude that VarC) E[C] + E[C]. E α, β [ C α, β j j 0 j p 0 j p E α, β ] P α, β j=0 j α, β )] [ E Z j α, β )] ) Var α, β [ C α, β ] ) Var α, β ) = E[C], Vol., No., Article. Publication date: November 07.

11 HyperMinHash: Jaccard index sketching in LogLog space : Fig. 3. For a fixed size sketch, HyperMinHash has better accuracy and/or cardinality range than MinHash. We compare Jaccard index estimation for identically sized sets with Jaccard index of /3 i.e. 50% overlap), so the maximum relative error is, and plot the mean relative errors without estimated collision correction. green circle) A 56 byte HyperMinHash sketch, with 56 buckets of 8 bits each, 4 bits of which are allocated to the LogLog counter, Jaccard index estimation remains stable until cardinalities around 3. orange diamond) A 56 byte MinHash sketch with 56 buckets of 8 bits each achieves similar accuracy at low cardinalities, but fails once cardinalities approach 4. blue triangle) A 56 byte MinHash sketch with 8 buckets of 6 bits can access larger cardinalities of around 0, but to do so trades off on low-cardinality accuracy. 4 EXPERIMENTAL VALIDATION For completeness, we give experimental validation for the behavior of Jaccard index estimation on raw Hyper)MinHash sketches with no expected error collision. In Figure 3, we allocate 56 bytes for two standard MinHash sketch and a HyperMinHash sketch. For fixed sketch size and cardinality range, HyperMinHash is more accurate; or, for fixed sketch size and bucket number, HyperMinHash can access exponentially larger set cardinalities. Pseudocode and a Python implementation are available details in Appendix). 5 DISCUSSION AND CONCLUSION We have introduced HyperMinHash, a sketch for estimating Jaccard distance using log log space, and made available a prototype Python implementation at It can be thought of as a compression scheme for MinHash that reduces the number of bits per bucket to log logn) from logn) by using insights from HyperLogLog and b-bit MinHash. As with the original MinHash, it retains variance on the order of k/δ, where k is the number of buckets and δ is the Jaccard index between two sets. However, it also introduces /l variance, where l = r, because of the increased number of collisions, which matches the requirements of b-bit MinHash. For practical parameters of p = 5,q = 6, r = 0, the HyperMinHash sketch will use up 64KiB memory per set, and allow for estimating Jaccard indices of 0.0 for set cardinalities on the order of 0 9 with accuracy around 5%. HyperMinHash is to our knowledge the first streaming summary sketch capable of directly estimating union, Vol., No., Article. Publication date: November 07.

12 : Yun William Yu & Griffin M. Weber cardinality, Jaccard index, and intersection cardinality in log log space, able to be applied to arbitrary Boolean formulas in conjunctive normal form with error rates bounded by the final result size. ACKNOWLEDGMENTS This study was supported by National Institutes of Health NIH) Big Data to Knowledge BDK) awards U54HG from the National Human Genome Research Institute NHGRI) and U0CA98934 from the National Cancer Institute NCI). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. We additionally thank Daphne Ippolito and Adam Sealfon for useful comments and advice. REFERENCES [] Ziv Bar-Yossef, TS Jayram, Ravi Kumar, D Sivakumar, and Luca Trevisan. 00. Counting distinct elements in a data stream. In International Workshop on Randomization and Approximation Techniques in Computer Science. Springer, 0. [] Andrei Z Broder On the resemblance and containment of documents. In Compression and Complexity of Sequences 997. Proceedings. IEEE, 9. [3] Edith Cohen. 06. Min-Hash Sketches. 06). [4] Marianne Durand and Philippe Flajolet Loglog counting of large cardinalities. In European Symposium on Algorithms. Springer, [5] Otmar Ertl. 07. New cardinality estimation algorithms for HyperLogLog sketches. arxiv preprint arxiv: ). [6] Otmar Ertl. 07. New Cardinality Estimation Methods for HyperLogLog Sketches. arxiv preprint arxiv: ). [7] Philippe Flajolet Counting by coin tossings. In Advances in Computer Science-ASIAN 004. Higher-Level Decision Making. Springer,. [8] Philippe Flajolet, Éric Fusy, Olivier Gandouet, and Frédéric Meunier Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm. In AofA: Analysis of Algorithms. Discrete Mathematics and Theoretical Computer Science, [9] Paul Jaccard. 90. Lois de distribution florale dans la zone alpine. Bull Soc Vaudoise Sci Nat 38 90), [0] Ping Li and Christian König. 00. b-bit minwise hashing. In Proceedings of the 9th international conference on World wide web. ACM, [] Ping Li, Art Owen, and Cun-Hui Zhang. 0. One permutation hashing. In Advances in Neural Information Processing Systems [] Rasmus Pagh, Morten Stöckel, and David P Woodruff. 04. Is min-wise hashing optimal for summarizing set intersection?. In Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. ACM, 09 0., Vol., No., Article. Publication date: November 07.

13 A APPENDIX: HYPERMINHASH IN PRACTICE HyperMinHash: Jaccard index sketching in LogLog space :3 Here, we present full algorithms to match a naive implementation of HyperMinHash as described in the previous Theory section. A Python implementation is available at Algorithm HyperMinHash sketch. : Let h,h,h 3 : D [0, ] {0, } be three independent hash functions hashing data from domain D to the binary domain. In practice, we generally use a single Hash function, e.g. SHA-, and use different sets of bits for each of the three hashes). : Let ρs), for s {0, } be the position of the left-most -bit ρ000 ) = 4). 3: Let σs, n) for s {0, } be the left-most n bits of s σ00, 5) = 00). 4: function HyperMinHashA, p, q, r) 5: Let ĥx) = σh x),p). 6: Let ĥx) = minρh x)), q ). 7: Let ĥ3x) = σh 3 x), r). 8: Initialize p tuples B = B = = B p = 0, 0). 9: for a A do 0: if B h a)[0] < ĥa) then : Bĥ a) ĥa),ĥ3a)) : else 3: if Bĥ a) [0] = ĥa) and Bĥ a) [] > ĥ3a) then 4: Bĥ a) ĥa),ĥ3a)) 5: 6: 7: end for 8: return B,..., B p as B 9: end function A. Implementation optimizations We recommend several optimizations for practical implementations of HyperMinHash. First, it is mathematically equivalent to: ) Pack the hashed tuple into a single word; this enables Jaccard index computation while using only one comparison per bucket instead of two. ) Use the max instead of min of the subbuckets. This allows us take the union of two sketches while using only one comparison per bucket. These recommendations should be self-explanatory, and are simply minor engineering optimizations, which we do not use in our prototyping, as they do not affect accuracy. However, while we can exactly compute the number of expected collisions through Lemma 3.6, this computation is slow and often results in floating point errors unless BigInts are used because Algorithm 5 is exponential in r. In practice, two ready solutions present themselves: ) We can ignore the bias and simply add it to the error. As the bias and standard deviation of the error are the same order of magnitude, this only doubles the absolute error in the estimation of Jaccard index. For large Jaccard indexes, this does not matter., Vol., No., Article. Publication date: November 07.

14 :4 Yun William Yu & Griffin M. Weber Algorithm HyperMinHash Union function UnionS, T ) assert S = T for i {,..., S } do Initialize S tuples B = B = = B S = 0, 0). if S i [0] > T i [0] then B i S i else if S i [0] < T i [0] then B i T i else if S i [0] = T i [0] then if S i [[] < T i [] then B i S i else B i T i end for return B,..., B p as B end function Algorithm 3 Estimate Cardinality. Note that the left parts of the buckets can be passed directly into a HyperLogLog estimator. We can also use other k-minimum value count-distinct cardinality estimators, which we empirically found useful for large cardinalities. function EstimateCardinalityS, p, q, r) Initialize S integer registers b = b = = b S = 0. for i {,..., S } do b i S i [0] end for R HyperLogLogCardinalityEstimator{b i },q) if R < 04 S then return R else Initialize S real registers r,..., r S. for i {,..., S } do r i S i [0] end for if r i = 0 then return else return S / r i end function ) + S i [] r, Vol., No., Article. Publication date: November 07.

15 HyperMinHash: Jaccard index sketching in LogLog space :5 Algorithm 4 Compute Jaccard Index. Note that the correction factor EC is generally not needed, except for really small Jaccard index. Additionally, for most practical purposes, it is safe to substitute ApproximateExpectedCollisions for ExpectedCollisions. function JaccardIndexS, T, p, q, r) assert S = T C 0, N 0 for i {,..., S } do if S i = T i then C C + if S i! = 0, 0) and T i! = 0, 0) then N N + end for n EstimateCardinalityS, q) m EstimateCardinalityT, q) EC [Approximate]ExpectedCollisionsn, m, p, q, r) return C EC)/N end function Algorithm 5 Expected collisions. Note that because of floating point error, BigInts must be used for large n and m. function ExpectedCollisionsn, m, p, q, r) x 0 for i {,..., q } do for j {,..., r } do if i q then b r +j, b p+r +i r +j+ p+r +i else b j, b p+r +i j+ p+r +i Pr x b ) n b ) n Pr y b ) m b ) m x x + Pr x Pr y end for end for return x p end function ) We also present a fast, numerically stable, algorithm to approximate the expected number of collisions Algorithm 6). We can however approximate the number of expected collisions using the following procedure, which is empirically asymptotically correct Algorithm 6):, Vol., No., Article. Publication date: November 07.

16 :6 Yun William Yu & Griffin M. Weber Algorithm 6 Fast numerically stable approximation to Algorithm 5. Generally underestimates collisions. function ApproximateExpectedCollisionsn, m, p, q, r) if n < m then SWAPx, y) if n > q +r then return ERROR: cardinality too large for approximation. else if n > p+5 then 4n/m +n/m) ϕ return p r ϕ else return ExpectedCollisionsn,m,p,q, 0) r return x p end function ) For n < p+5, we approximate by taking the number of expected HyperLogLog collisions and dividing it by r. In each HyperLogLog box, we are interested in collisions along r boxes along the diagonal c. For this approximation, we simply assume that the joint probability density function is almost uniform within the box; this is not completely accurate, but pretty close in practice. ) For p+5 < n < q +p, we noted empirically that the expected number of collisions approached p r for n = m as n. Furthermore, the number of collisions is dependent on n and m by a factor of 4nm 4n/m n+m)n+m ) from 3.9, which for n,m can be approximated by. This approximation is +n/m) primarily needed because of floating point errors when n. 3) Unfortunately, around n > q +p, the number of collisions starts increasing and these approximations fail. However, note that for reasonable values of q = 6,p = 5, this problem only appears when n > , Vol., No., Article. Publication date: November 07.

Lecture 3 Sept. 4, 2014

Lecture 3 Sept. 4, 2014 CS 395T: Sublinear Algorithms Fall 2014 Prof. Eric Price Lecture 3 Sept. 4, 2014 Scribe: Zhao Song In today s lecture, we will discuss the following problems: 1. Distinct elements 2. Turnstile model 3.

More information

1 Estimating Frequency Moments in Streams

1 Estimating Frequency Moments in Streams CS 598CSC: Algorithms for Big Data Lecture date: August 28, 2014 Instructor: Chandra Chekuri Scribe: Chandra Chekuri 1 Estimating Frequency Moments in Streams A significant fraction of streaming literature

More information

How Philippe Flipped Coins to Count Data

How Philippe Flipped Coins to Count Data 1/18 How Philippe Flipped Coins to Count Data Jérémie Lumbroso LIP6 / INRIA Rocquencourt December 16th, 2011 0. DATA STREAMING ALGORITHMS Stream: a (very large) sequence S over (also very large) domain

More information

CS6931 Database Seminar. Lecture 6: Set Operations on Massive Data

CS6931 Database Seminar. Lecture 6: Set Operations on Massive Data CS6931 Database Seminar Lecture 6: Set Operations on Massive Data Set Resemblance and MinWise Hashing/Independent Permutation Basics Consider two sets S1, S2 U = {0, 1, 2,...,D 1} (e.g., D = 2 64 ) f1

More information

New cardinality estimation algorithms for HyperLogLog sketches

New cardinality estimation algorithms for HyperLogLog sketches New cardinality estimation algorithms for HyperLogLog sketches Otmar Ertl otmar.ertl@gmail.com April 3, 2017 This paper presents new methods to estimate the cardinalities of data sets recorded by HyperLogLog

More information

Lecture 2 Sept. 8, 2015

Lecture 2 Sept. 8, 2015 CS 9r: Algorithms for Big Data Fall 5 Prof. Jelani Nelson Lecture Sept. 8, 5 Scribe: Jeffrey Ling Probability Recap Chebyshev: P ( X EX > λ) < V ar[x] λ Chernoff: For X,..., X n independent in [, ],

More information

Lecture 2. Frequency problems

Lecture 2. Frequency problems 1 / 43 Lecture 2. Frequency problems Ricard Gavaldà MIRI Seminar on Data Streams, Spring 2015 Contents 2 / 43 1 Frequency problems in data streams 2 Approximating inner product 3 Computing frequency moments

More information

Some notes on streaming algorithms continued

Some notes on streaming algorithms continued U.C. Berkeley CS170: Algorithms Handout LN-11-9 Christos Papadimitriou & Luca Trevisan November 9, 016 Some notes on streaming algorithms continued Today we complete our quick review of streaming algorithms.

More information

14.1 Finding frequent elements in stream

14.1 Finding frequent elements in stream Chapter 14 Streaming Data Model 14.1 Finding frequent elements in stream A very useful statistics for many applications is to keep track of elements that occur more frequently. It can come in many flavours

More information

Problem 1: (Chernoff Bounds via Negative Dependence - from MU Ex 5.15)

Problem 1: (Chernoff Bounds via Negative Dependence - from MU Ex 5.15) Problem 1: Chernoff Bounds via Negative Dependence - from MU Ex 5.15) While deriving lower bounds on the load of the maximum loaded bin when n balls are thrown in n bins, we saw the use of negative dependence.

More information

CONVERGENCE OF RANDOM SERIES AND MARTINGALES

CONVERGENCE OF RANDOM SERIES AND MARTINGALES CONVERGENCE OF RANDOM SERIES AND MARTINGALES WESLEY LEE Abstract. This paper is an introduction to probability from a measuretheoretic standpoint. After covering probability spaces, it delves into the

More information

CSC 2429 Approaches to the P vs. NP Question and Related Complexity Questions Lecture 2: Switching Lemma, AC 0 Circuit Lower Bounds

CSC 2429 Approaches to the P vs. NP Question and Related Complexity Questions Lecture 2: Switching Lemma, AC 0 Circuit Lower Bounds CSC 2429 Approaches to the P vs. NP Question and Related Complexity Questions Lecture 2: Switching Lemma, AC 0 Circuit Lower Bounds Lecturer: Toniann Pitassi Scribe: Robert Robere Winter 2014 1 Switching

More information

2. Probability. Chris Piech and Mehran Sahami. Oct 2017

2. Probability. Chris Piech and Mehran Sahami. Oct 2017 2. Probability Chris Piech and Mehran Sahami Oct 2017 1 Introduction It is that time in the quarter (it is still week one) when we get to talk about probability. Again we are going to build up from first

More information

DATA MINING LECTURE 6. Similarity and Distance Sketching, Locality Sensitive Hashing

DATA MINING LECTURE 6. Similarity and Distance Sketching, Locality Sensitive Hashing DATA MINING LECTURE 6 Similarity and Distance Sketching, Locality Sensitive Hashing SIMILARITY AND DISTANCE Thanks to: Tan, Steinbach, and Kumar, Introduction to Data Mining Rajaraman and Ullman, Mining

More information

COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from

COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from http://www.mmds.org Distance Measures For finding similar documents, we consider the Jaccard

More information

Joint Probability Distributions and Random Samples (Devore Chapter Five)

Joint Probability Distributions and Random Samples (Devore Chapter Five) Joint Probability Distributions and Random Samples (Devore Chapter Five) 1016-345-01: Probability and Statistics for Engineers Spring 2013 Contents 1 Joint Probability Distributions 2 1.1 Two Discrete

More information

A Simple and Efficient Estimation Method for Stream Expression Cardinalities

A Simple and Efficient Estimation Method for Stream Expression Cardinalities A Simple and Efficient Estimation Method for Stream Expression Cardinalities Aiyou Chen, Jin Cao and Tian Bu Bell Labs, Alcatel-Lucent (aychen,jincao,tbu)@research.bell-labs.com ABSTRACT Estimating the

More information

CS 246 Review of Proof Techniques and Probability 01/14/19

CS 246 Review of Proof Techniques and Probability 01/14/19 Note: This document has been adapted from a similar review session for CS224W (Autumn 2018). It was originally compiled by Jessica Su, with minor edits by Jayadev Bhaskaran. 1 Proof techniques Here we

More information

Lecture 7: More Arithmetic and Fun With Primes

Lecture 7: More Arithmetic and Fun With Primes IAS/PCMI Summer Session 2000 Clay Mathematics Undergraduate Program Advanced Course on Computational Complexity Lecture 7: More Arithmetic and Fun With Primes David Mix Barrington and Alexis Maciel July

More information

BTRY 4830/6830: Quantitative Genomics and Genetics Fall 2014

BTRY 4830/6830: Quantitative Genomics and Genetics Fall 2014 BTRY 4830/6830: Quantitative Genomics and Genetics Fall 2014 Homework 4 (version 3) - posted October 3 Assigned October 2; Due 11:59PM October 9 Problem 1 (Easy) a. For the genetic regression model: Y

More information

Consistent Sampling with Replacement

Consistent Sampling with Replacement Consistent Sampling with Replacement Ronald L. Rivest MIT CSAIL rivest@mit.edu arxiv:1808.10016v1 [cs.ds] 29 Aug 2018 August 31, 2018 Abstract We describe a very simple method for consistent sampling that

More information

1 Approximate Quantiles and Summaries

1 Approximate Quantiles and Summaries CS 598CSC: Algorithms for Big Data Lecture date: Sept 25, 2014 Instructor: Chandra Chekuri Scribe: Chandra Chekuri Suppose we have a stream a 1, a 2,..., a n of objects from an ordered universe. For simplicity

More information

Algorithms for Data Science: Lecture on Finding Similar Items

Algorithms for Data Science: Lecture on Finding Similar Items Algorithms for Data Science: Lecture on Finding Similar Items Barna Saha 1 Finding Similar Items Finding similar items is a fundamental data mining task. We may want to find whether two documents are similar

More information

6 Filtering and Streaming

6 Filtering and Streaming Casus ubique valet; semper tibi pendeat hamus: Quo minime credas gurgite, piscis erit. [Luck affects everything. Let your hook always be cast. Where you least expect it, there will be a fish.] Publius

More information

MAS113 Introduction to Probability and Statistics. Proofs of theorems

MAS113 Introduction to Probability and Statistics. Proofs of theorems MAS113 Introduction to Probability and Statistics Proofs of theorems Theorem 1 De Morgan s Laws) See MAS110 Theorem 2 M1 By definition, B and A \ B are disjoint, and their union is A So, because m is a

More information

MAS113 Introduction to Probability and Statistics. Proofs of theorems

MAS113 Introduction to Probability and Statistics. Proofs of theorems MAS113 Introduction to Probability and Statistics Proofs of theorems Theorem 1 De Morgan s Laws) See MAS110 Theorem 2 M1 By definition, B and A \ B are disjoint, and their union is A So, because m is a

More information

Probability. Lecture Notes. Adolfo J. Rumbos

Probability. Lecture Notes. Adolfo J. Rumbos Probability Lecture Notes Adolfo J. Rumbos October 20, 204 2 Contents Introduction 5. An example from statistical inference................ 5 2 Probability Spaces 9 2. Sample Spaces and σ fields.....................

More information

Bloom Filters, general theory and variants

Bloom Filters, general theory and variants Bloom Filters: general theory and variants G. Caravagna caravagn@cli.di.unipi.it Information Retrieval Wherever a list or set is used, and space is a consideration, a Bloom Filter should be considered.

More information

Notes 1 : Measure-theoretic foundations I

Notes 1 : Measure-theoretic foundations I Notes 1 : Measure-theoretic foundations I Math 733-734: Theory of Probability Lecturer: Sebastien Roch References: [Wil91, Section 1.0-1.8, 2.1-2.3, 3.1-3.11], [Fel68, Sections 7.2, 8.1, 9.6], [Dur10,

More information

Discrete Mathematics and Probability Theory Fall 2014 Anant Sahai Note 15. Random Variables: Distributions, Independence, and Expectations

Discrete Mathematics and Probability Theory Fall 2014 Anant Sahai Note 15. Random Variables: Distributions, Independence, and Expectations EECS 70 Discrete Mathematics and Probability Theory Fall 204 Anant Sahai Note 5 Random Variables: Distributions, Independence, and Expectations In the last note, we saw how useful it is to have a way of

More information

1 Basic Combinatorics

1 Basic Combinatorics 1 Basic Combinatorics 1.1 Sets and sequences Sets. A set is an unordered collection of distinct objects. The objects are called elements of the set. We use braces to denote a set, for example, the set

More information

Lecture 10: Probability distributions TUESDAY, FEBRUARY 19, 2019

Lecture 10: Probability distributions TUESDAY, FEBRUARY 19, 2019 Lecture 10: Probability distributions DANIEL WELLER TUESDAY, FEBRUARY 19, 2019 Agenda What is probability? (again) Describing probabilities (distributions) Understanding probabilities (expectation) Partial

More information

12 Count-Min Sketch and Apriori Algorithm (and Bloom Filters)

12 Count-Min Sketch and Apriori Algorithm (and Bloom Filters) 12 Count-Min Sketch and Apriori Algorithm (and Bloom Filters) Many streaming algorithms use random hashing functions to compress data. They basically randomly map some data items on top of each other.

More information

arxiv: v2 [cs.ds] 3 Oct 2017

arxiv: v2 [cs.ds] 3 Oct 2017 Orthogonal Vectors Indexing Isaac Goldstein 1, Moshe Lewenstein 1, and Ely Porat 1 1 Bar-Ilan University, Ramat Gan, Israel {goldshi,moshe,porately}@cs.biu.ac.il arxiv:1710.00586v2 [cs.ds] 3 Oct 2017 Abstract

More information

Math 416 Lecture 3. The average or mean or expected value of x 1, x 2, x 3,..., x n is

Math 416 Lecture 3. The average or mean or expected value of x 1, x 2, x 3,..., x n is Math 416 Lecture 3 Expected values The average or mean or expected value of x 1, x 2, x 3,..., x n is x 1 x 2... x n n x 1 1 n x 2 1 n... x n 1 n 1 n x i p x i where p x i 1 n is the probability of x i

More information

CS261: A Second Course in Algorithms Lecture #18: Five Essential Tools for the Analysis of Randomized Algorithms

CS261: A Second Course in Algorithms Lecture #18: Five Essential Tools for the Analysis of Randomized Algorithms CS261: A Second Course in Algorithms Lecture #18: Five Essential Tools for the Analysis of Randomized Algorithms Tim Roughgarden March 3, 2016 1 Preamble In CS109 and CS161, you learned some tricks of

More information

Basic counting techniques. Periklis A. Papakonstantinou Rutgers Business School

Basic counting techniques. Periklis A. Papakonstantinou Rutgers Business School Basic counting techniques Periklis A. Papakonstantinou Rutgers Business School i LECTURE NOTES IN Elementary counting methods Periklis A. Papakonstantinou MSIS, Rutgers Business School ALL RIGHTS RESERVED

More information

Conditional distributions (discrete case)

Conditional distributions (discrete case) Conditional distributions (discrete case) The basic idea behind conditional distributions is simple: Suppose (XY) is a jointly-distributed random vector with a discrete joint distribution. Then we can

More information

1 Maintaining a Dictionary

1 Maintaining a Dictionary 15-451/651: Design & Analysis of Algorithms February 1, 2016 Lecture #7: Hashing last changed: January 29, 2016 Hashing is a great practical tool, with an interesting and subtle theory too. In addition

More information

Lecture 2: Streaming Algorithms

Lecture 2: Streaming Algorithms CS369G: Algorithmic Techniques for Big Data Spring 2015-2016 Lecture 2: Streaming Algorithms Prof. Moses Chariar Scribes: Stephen Mussmann 1 Overview In this lecture, we first derive a concentration inequality

More information

Introduction to Randomized Algorithms III

Introduction to Randomized Algorithms III Introduction to Randomized Algorithms III Joaquim Madeira Version 0.1 November 2017 U. Aveiro, November 2017 1 Overview Probabilistic counters Counting with probability 1 / 2 Counting with probability

More information

Lecture 4: Probability and Discrete Random Variables

Lecture 4: Probability and Discrete Random Variables Error Correcting Codes: Combinatorics, Algorithms and Applications (Fall 2007) Lecture 4: Probability and Discrete Random Variables Wednesday, January 21, 2009 Lecturer: Atri Rudra Scribe: Anonymous 1

More information

arxiv: v2 [cs.ds] 17 Sep 2017

arxiv: v2 [cs.ds] 17 Sep 2017 Two-Dimensional Indirect Binary Search for the Positive One-In-Three Satisfiability Problem arxiv:1708.08377v [cs.ds] 17 Sep 017 Shunichi Matsubara Aoyama Gakuin University, 5-10-1, Fuchinobe, Chuo-ku,

More information

New Attacks on the Concatenation and XOR Hash Combiners

New Attacks on the Concatenation and XOR Hash Combiners New Attacks on the Concatenation and XOR Hash Combiners Itai Dinur Department of Computer Science, Ben-Gurion University, Israel Abstract. We study the security of the concatenation combiner H 1(M) H 2(M)

More information

ECE 302: Probabilistic Methods in Electrical Engineering

ECE 302: Probabilistic Methods in Electrical Engineering ECE 302: Probabilistic Methods in Electrical Engineering Test I : Chapters 1 3 3/22/04, 7:30 PM Print Name: Read every question carefully and solve each problem in a legible and ordered manner. Make sure

More information

6.842 Randomness and Computation Lecture 5

6.842 Randomness and Computation Lecture 5 6.842 Randomness and Computation 2012-02-22 Lecture 5 Lecturer: Ronitt Rubinfeld Scribe: Michael Forbes 1 Overview Today we will define the notion of a pairwise independent hash function, and discuss its

More information

Nondeterminism LECTURE Nondeterminism as a proof system. University of California, Los Angeles CS 289A Communication Complexity

Nondeterminism LECTURE Nondeterminism as a proof system. University of California, Los Angeles CS 289A Communication Complexity University of California, Los Angeles CS 289A Communication Complexity Instructor: Alexander Sherstov Scribe: Matt Brown Date: January 25, 2012 LECTURE 5 Nondeterminism In this lecture, we introduce nondeterministic

More information

CS Foundations of Communication Complexity

CS Foundations of Communication Complexity CS 2429 - Foundations of Communication Complexity Lecturer: Sergey Gorbunov 1 Introduction In this lecture we will see how to use methods of (conditional) information complexity to prove lower bounds for

More information

Lecture Lecture 25 November 25, 2014

Lecture Lecture 25 November 25, 2014 CS 224: Advanced Algorithms Fall 2014 Lecture Lecture 25 November 25, 2014 Prof. Jelani Nelson Scribe: Keno Fischer 1 Today Finish faster exponential time algorithms (Inclusion-Exclusion/Zeta Transform,

More information

Asymptotically optimal induced universal graphs

Asymptotically optimal induced universal graphs Asymptotically optimal induced universal graphs Noga Alon Abstract We prove that the minimum number of vertices of a graph that contains every graph on vertices as an induced subgraph is (1+o(1))2 ( 1)/2.

More information

Module 1. Probability

Module 1. Probability Module 1 Probability 1. Introduction In our daily life we come across many processes whose nature cannot be predicted in advance. Such processes are referred to as random processes. The only way to derive

More information

Lecture 6 Basic Probability

Lecture 6 Basic Probability Lecture 6: Basic Probability 1 of 17 Course: Theory of Probability I Term: Fall 2013 Instructor: Gordan Zitkovic Lecture 6 Basic Probability Probability spaces A mathematical setup behind a probabilistic

More information

Application: Bucket Sort

Application: Bucket Sort 5.2.2. Application: Bucket Sort Bucket sort breaks the log) lower bound for standard comparison-based sorting, under certain assumptions on the input We want to sort a set of =2 integers chosen I+U@R from

More information

CS 591, Lecture 9 Data Analytics: Theory and Applications Boston University

CS 591, Lecture 9 Data Analytics: Theory and Applications Boston University CS 591, Lecture 9 Data Analytics: Theory and Applications Boston University Babis Tsourakakis February 22nd, 2017 Announcement We will cover the Monday s 2/20 lecture (President s day) by appending half

More information

CHAPTER 11. A Revision. 1. The Computers and Numbers therein

CHAPTER 11. A Revision. 1. The Computers and Numbers therein CHAPTER A Revision. The Computers and Numbers therein Traditional computer science begins with a finite alphabet. By stringing elements of the alphabet one after another, one obtains strings. A set of

More information

Lecture 5: Hashing. David Woodruff Carnegie Mellon University

Lecture 5: Hashing. David Woodruff Carnegie Mellon University Lecture 5: Hashing David Woodruff Carnegie Mellon University Hashing Universal hashing Perfect hashing Maintaining a Dictionary Let U be a universe of keys U could be all strings of ASCII characters of

More information

Multiple-Site Distributed Spatial Query Optimization using Spatial Semijoins

Multiple-Site Distributed Spatial Query Optimization using Spatial Semijoins 11 Multiple-Site Distributed Spatial Query Optimization using Spatial Semijoins Wendy OSBORN a, 1 and Saad ZAAMOUT a a Department of Mathematics and Computer Science, University of Lethbridge, Lethbridge,

More information

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is: CS 24 Section #8 Hashing, Skip Lists 3/20/7 Probability Review Expectation (weighted average): the expectation of a random quantity X is: x= x P (X = x) For each value x that X can take on, we look at

More information

Lecture 4: Constructing the Integers, Rationals and Reals

Lecture 4: Constructing the Integers, Rationals and Reals Math/CS 20: Intro. to Math Professor: Padraic Bartlett Lecture 4: Constructing the Integers, Rationals and Reals Week 5 UCSB 204 The Integers Normally, using the natural numbers, you can easily define

More information

STAT2201. Analysis of Engineering & Scientific Data. Unit 3

STAT2201. Analysis of Engineering & Scientific Data. Unit 3 STAT2201 Analysis of Engineering & Scientific Data Unit 3 Slava Vaisman The University of Queensland School of Mathematics and Physics What we learned in Unit 2 (1) We defined a sample space of a random

More information

Lecture 4: Hashing and Streaming Algorithms

Lecture 4: Hashing and Streaming Algorithms CSE 521: Design and Analysis of Algorithms I Winter 2017 Lecture 4: Hashing and Streaming Algorithms Lecturer: Shayan Oveis Gharan 01/18/2017 Scribe: Yuqing Ai Disclaimer: These notes have not been subjected

More information

P (E) = P (A 1 )P (A 2 )... P (A n ).

P (E) = P (A 1 )P (A 2 )... P (A n ). Lecture 9: Conditional probability II: breaking complex events into smaller events, methods to solve probability problems, Bayes rule, law of total probability, Bayes theorem Discrete Structures II (Summer

More information

EC212: Introduction to Econometrics Review Materials (Wooldridge, Appendix)

EC212: Introduction to Econometrics Review Materials (Wooldridge, Appendix) 1 EC212: Introduction to Econometrics Review Materials (Wooldridge, Appendix) Taisuke Otsu London School of Economics Summer 2018 A.1. Summation operator (Wooldridge, App. A.1) 2 3 Summation operator For

More information

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 14

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 14 CS 70 Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 14 Introduction One of the key properties of coin flips is independence: if you flip a fair coin ten times and get ten

More information

Chapter 2. Probability

Chapter 2. Probability 2-1 Chapter 2 Probability 2-2 Section 2.1: Basic Ideas Definition: An experiment is a process that results in an outcome that cannot be predicted in advance with certainty. Examples: rolling a die tossing

More information

Computational Learning Theory

Computational Learning Theory CS 446 Machine Learning Fall 2016 OCT 11, 2016 Computational Learning Theory Professor: Dan Roth Scribe: Ben Zhou, C. Cervantes 1 PAC Learning We want to develop a theory to relate the probability of successful

More information

Lecture 2 September 4, 2014

Lecture 2 September 4, 2014 CS 224: Advanced Algorithms Fall 2014 Prof. Jelani Nelson Lecture 2 September 4, 2014 Scribe: David Liu 1 Overview In the last lecture we introduced the word RAM model and covered veb trees to solve the

More information

Theorem 1.7 [Bayes' Law]: Assume that,,, are mutually disjoint events in the sample space s.t.. Then Pr( )

Theorem 1.7 [Bayes' Law]: Assume that,,, are mutually disjoint events in the sample space s.t.. Then Pr( ) Theorem 1.7 [Bayes' Law]: Assume that,,, are mutually disjoint events in the sample space s.t.. Then Pr Pr = Pr Pr Pr() Pr Pr. We are given three coins and are told that two of the coins are fair and the

More information

Discrete Mathematics and Probability Theory Spring 2014 Anant Sahai Note 18

Discrete Mathematics and Probability Theory Spring 2014 Anant Sahai Note 18 EECS 7 Discrete Mathematics and Probability Theory Spring 214 Anant Sahai Note 18 A Brief Introduction to Continuous Probability Up to now we have focused exclusively on discrete probability spaces Ω,

More information

Lecture 11: Continuous-valued signals and differential entropy

Lecture 11: Continuous-valued signals and differential entropy Lecture 11: Continuous-valued signals and differential entropy Biology 429 Carl Bergstrom September 20, 2008 Sources: Parts of today s lecture follow Chapter 8 from Cover and Thomas (2007). Some components

More information

Big Data. Big data arises in many forms: Common themes:

Big Data. Big data arises in many forms: Common themes: Big Data Big data arises in many forms: Physical Measurements: from science (physics, astronomy) Medical data: genetic sequences, detailed time series Activity data: GPS location, social network activity

More information

Notes for Math 324, Part 19

Notes for Math 324, Part 19 48 Notes for Math 324, Part 9 Chapter 9 Multivariate distributions, covariance Often, we need to consider several random variables at the same time. We have a sample space S and r.v. s X, Y,..., which

More information

Lecture 01 August 31, 2017

Lecture 01 August 31, 2017 Sketching Algorithms for Big Data Fall 2017 Prof. Jelani Nelson Lecture 01 August 31, 2017 Scribe: Vinh-Kha Le 1 Overview In this lecture, we overviewed the six main topics covered in the course, reviewed

More information

Handout 5. α a1 a n. }, where. xi if a i = 1 1 if a i = 0.

Handout 5. α a1 a n. }, where. xi if a i = 1 1 if a i = 0. Notes on Complexity Theory Last updated: October, 2005 Jonathan Katz Handout 5 1 An Improved Upper-Bound on Circuit Size Here we show the result promised in the previous lecture regarding an upper-bound

More information

POISSON PROCESSES 1. THE LAW OF SMALL NUMBERS

POISSON PROCESSES 1. THE LAW OF SMALL NUMBERS POISSON PROCESSES 1. THE LAW OF SMALL NUMBERS 1.1. The Rutherford-Chadwick-Ellis Experiment. About 90 years ago Ernest Rutherford and his collaborators at the Cavendish Laboratory in Cambridge conducted

More information

Quick Sort Notes , Spring 2010

Quick Sort Notes , Spring 2010 Quick Sort Notes 18.310, Spring 2010 0.1 Randomized Median Finding In a previous lecture, we discussed the problem of finding the median of a list of m elements, or more generally the element of rank m.

More information

Lecture Notes 1: Vector spaces

Lecture Notes 1: Vector spaces Optimization-based data analysis Fall 2017 Lecture Notes 1: Vector spaces In this chapter we review certain basic concepts of linear algebra, highlighting their application to signal processing. 1 Vector

More information

Review of Probabilities and Basic Statistics

Review of Probabilities and Basic Statistics Alex Smola Barnabas Poczos TA: Ina Fiterau 4 th year PhD student MLD Review of Probabilities and Basic Statistics 10-701 Recitations 1/25/2013 Recitation 1: Statistics Intro 1 Overview Introduction to

More information

Finding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing Distance Measures. Modified from Jeff Ullman

Finding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing Distance Measures. Modified from Jeff Ullman Finding Similar Sets Applications Shingling Minhashing Locality-Sensitive Hashing Distance Measures Modified from Jeff Ullman Goals Many Web-mining problems can be expressed as finding similar sets:. Pages

More information

Ahlswede Khachatrian Theorems: Weighted, Infinite, and Hamming

Ahlswede Khachatrian Theorems: Weighted, Infinite, and Hamming Ahlswede Khachatrian Theorems: Weighted, Infinite, and Hamming Yuval Filmus April 4, 2017 Abstract The seminal complete intersection theorem of Ahlswede and Khachatrian gives the maximum cardinality of

More information

Bits. Chapter 1. Information can be learned through observation, experiment, or measurement.

Bits. Chapter 1. Information can be learned through observation, experiment, or measurement. Chapter 1 Bits Information is measured in bits, just as length is measured in meters and time is measured in seconds. Of course knowing the amount of information is not the same as knowing the information

More information

Introduction to Machine Learning

Introduction to Machine Learning What does this mean? Outline Contents Introduction to Machine Learning Introduction to Probabilistic Methods Varun Chandola December 26, 2017 1 Introduction to Probability 1 2 Random Variables 3 3 Bayes

More information

Randomized Algorithms. Lecture 4. Lecturer: Moni Naor Scribe by: Tamar Zondiner & Omer Tamuz Updated: November 25, 2010

Randomized Algorithms. Lecture 4. Lecturer: Moni Naor Scribe by: Tamar Zondiner & Omer Tamuz Updated: November 25, 2010 Randomized Algorithms Lecture 4 Lecturer: Moni Naor Scribe by: Tamar Zondiner & Omer Tamuz Updated: November 25, 2010 1 Pairwise independent hash functions In the previous lecture we encountered two families

More information

Notes on Random Variables, Expectations, Probability Densities, and Martingales

Notes on Random Variables, Expectations, Probability Densities, and Martingales Eco 315.2 Spring 2006 C.Sims Notes on Random Variables, Expectations, Probability Densities, and Martingales Includes Exercise Due Tuesday, April 4. For many or most of you, parts of these notes will be

More information

Asymmetric Minwise Hashing for Indexing Binary Inner Products and Set Containment

Asymmetric Minwise Hashing for Indexing Binary Inner Products and Set Containment Asymmetric Minwise Hashing for Indexing Binary Inner Products and Set Containment Anshumali Shrivastava and Ping Li Cornell University and Rutgers University WWW 25 Florence, Italy May 2st 25 Will Join

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Lecture 10. Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1

Lecture 10. Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1 Lecture 10 Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1 Recap Sublinear time algorithms Deterministic + exact: binary search Deterministic + inexact: estimating diameter in a

More information

Communication Engineering Prof. Surendra Prasad Department of Electrical Engineering Indian Institute of Technology, Delhi

Communication Engineering Prof. Surendra Prasad Department of Electrical Engineering Indian Institute of Technology, Delhi Communication Engineering Prof. Surendra Prasad Department of Electrical Engineering Indian Institute of Technology, Delhi Lecture - 41 Pulse Code Modulation (PCM) So, if you remember we have been talking

More information

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

A graph contains a set of nodes (vertices) connected by links (edges or arcs) BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,

More information

Connectedness. Proposition 2.2. The following are equivalent for a topological space (X, T ).

Connectedness. Proposition 2.2. The following are equivalent for a topological space (X, T ). Connectedness 1 Motivation Connectedness is the sort of topological property that students love. Its definition is intuitive and easy to understand, and it is a powerful tool in proofs of well-known results.

More information

Discrete Random Variable

Discrete Random Variable Discrete Random Variable Outcome of a random experiment need not to be a number. We are generally interested in some measurement or numerical attribute of the outcome, rather than the outcome itself. n

More information

Integer Linear Programs

Integer Linear Programs Lecture 2: Review, Linear Programming Relaxations Today we will talk about expressing combinatorial problems as mathematical programs, specifically Integer Linear Programs (ILPs). We then see what happens

More information

Finite Time Bounds for Temporal Difference Learning with Function Approximation: Problems with some state-of-the-art results

Finite Time Bounds for Temporal Difference Learning with Function Approximation: Problems with some state-of-the-art results Finite Time Bounds for Temporal Difference Learning with Function Approximation: Problems with some state-of-the-art results Chandrashekar Lakshmi Narayanan Csaba Szepesvári Abstract In all branches of

More information

Information Storage Capacity of Crossbar Switching Networks

Information Storage Capacity of Crossbar Switching Networks Information Storage Capacity of Crossbar Switching etworks ABSTRACT In this work we ask the fundamental uestion: How many bits of information can be stored in a crossbar switching network? The answer is

More information

Lecture 5: Computational Complexity

Lecture 5: Computational Complexity Lecture 5: Computational Complexity (3 units) Outline Computational complexity Decision problem, Classes N P and P. Polynomial reduction and Class N PC P = N P or P = N P? 1 / 22 The Goal of Computational

More information

Lecture 4 Noisy Channel Coding

Lecture 4 Noisy Channel Coding Lecture 4 Noisy Channel Coding I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw October 9, 2015 1 / 56 I-Hsiang Wang IT Lecture 4 The Channel Coding Problem

More information

Lecture 5. 1 Review (Pairwise Independence and Derandomization)

Lecture 5. 1 Review (Pairwise Independence and Derandomization) 6.842 Randomness and Computation September 20, 2017 Lecture 5 Lecturer: Ronitt Rubinfeld Scribe: Tom Kolokotrones 1 Review (Pairwise Independence and Derandomization) As we discussed last time, we can

More information

Essential facts about NP-completeness:

Essential facts about NP-completeness: CMPSCI611: NP Completeness Lecture 17 Essential facts about NP-completeness: Any NP-complete problem can be solved by a simple, but exponentially slow algorithm. We don t have polynomial-time solutions

More information

Homework 4 Solutions

Homework 4 Solutions CS 174: Combinatorics and Discrete Probability Fall 01 Homework 4 Solutions Problem 1. (Exercise 3.4 from MU 5 points) Recall the randomized algorithm discussed in class for finding the median of a set

More information

Counting. 1 Sum Rule. Example 1. Lecture Notes #1 Sept 24, Chris Piech CS 109

Counting. 1 Sum Rule. Example 1. Lecture Notes #1 Sept 24, Chris Piech CS 109 1 Chris Piech CS 109 Counting Lecture Notes #1 Sept 24, 2018 Based on a handout by Mehran Sahami with examples by Peter Norvig Although you may have thought you had a pretty good grasp on the notion of

More information