arxiv: v2 [cs.ds] 3 Nov 2017
|
|
- Ruby Henderson
- 6 years ago
- Views:
Transcription
1 HyperMinHash: Jaccard index sketching in LogLog space Extended Abstract arxiv: v [cs.ds] 3 Nov 07 YUN WILLIAM YU, Harvard Medical School DBMI GRIFFIN M. WEBER, Harvard Medical School DBMI In this extended abstract, we describe and analyse a streaming probabilistic sketch, HYPERMINHASH, to estimate the Jaccard index or Jaccard similarity coefficient) over two sets A and B. HyperMinHash can be thought of as a compression of standard MinHash by building off of a HyperLogLog count-distinct sketch. Given Jaccard index δ, using k buckets of size Ologl) + log log A B )) in practice, typically bytes) per set, HyperMinHash streams over A and B and generates an estimate of the Jaccard index δ with error O/l + k/δ). This improves on the best previously known sketch, MinHash, which requires the same number of storage units buckets), but using Olog A B )) bit per bucket. For instance, our new algorithm allows estimating Jaccard indices of 0.0 for set cardinalities on the order of 0 9 with relative error of around 5% using 64KiB of memory; the previous state-of-the-art MinHash can only estimate Jaccard indices for cardinalities of 0 0 with the same memory consumption. Alternately, one can think of HyperMinHash as an augmentation of b-bit MinHash that enables streaming updates, unions, and cardinality estimation and thus intersection cardinality by way of Jaccard), while using log log extra bits. CCS Concepts: Theory of computation Sketching and sampling; Additional Key Words and Phrases: MinHash, sketching, streaming, loglog, Jaccard ACM Reference Format: Yun William Yu and Griffin M. Weber. 07. HyperMinHash: Jaccard index sketching in LogLog space: Extended Abstract., November 07), 6 pages. Corresponding author Authors addresses: Yun William Yu, Harvard Medical School DBMI, 0 Shattuck St #3, Boston, Massachusetts, 05, william_yu@hms.harvard. edu; Griffin M. Weber, Harvard Medical School DBMI, 0 Shattuck St #3, Boston, Massachusetts, 05, griffin_weber@hms.harvard.edu. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/authors). 07 Copyright held by the owner/authors). XXXX-XXXX/07/-ART $ Vol., No., Article. Publication date: November 07.
2 : Yun William Yu & Griffin M. Weber INTRODUCTION Many questions in data science can be rephrased in terms of the number of items in a database that satisfy some Boolean formula. For example, how many participants in a political survey are independent and have a favorable view of the federal government?, or how many of the source IPs used in a DDoS attack today were also used last month? In this paper, we consider the design of approximate streaming sketches to answer questions phrased in conjunctive normal form an AND of ORs); this is of course equivalent to estimating the cardinality of intersections of unions of a collection of sets. The literature already has near-optimal probabilistic data structures for approximating the count-distinct problem [, 4, 7, 8], which is equivalent to finding the cardinality of unions of sets i.e. ORs in CNF) Thus, we in particular focus on the problem of estimating Jaccard index[9], a proxy for set similarity that when coupled with union cardinality, allows estimation of intersection sizes.. Jaccard index Given two sets A and B, where A = n and B = m, and n > m without loss of generality, the Jaccard index is defined as A B δx,y) = A B. ) Clearly, if paired with a good count-distinct union estimator for A B, this allows us to estimate intersection sizes as well. Though Jaccard originally defined this index to measure ecological diversity in 90 [9], in more modern times, it has been used as a proxy for the document similarity problems. In 997, Broder introduced min-wise hashing colloquially known as MinHash ) [], a technique for quickly estimating the resemblance of documents by looking at the Jaccard index of shingles collections of phrases) contained within the documents.. MinHash MinHash relies on a simple fact: if you apply a random permutation to the universe of elements, the chance that the smallest items under this permutation in sets A and B are the same is precisely the Jaccard index. To see this, consider a random permutation of A B. The minimum element will come from either A \ B, B \ A, or A B, all disjoint sets. If the minimum element lies in A \ B, then mina) B, so mina) minb); the same is of course true by symmetry for B \ A. Conversely, if mina B) A B, then clearly mina) = minb). Because the permutation is random, every element has an equal probability of being the minimum, and thus P mina) = minb)) = A B A B. ) While using a single random permutation produces an unbiased estimator of δx,y), it is a Bernouli 0/ random variable with high variance. So, instead of using a single permutation, one averages k trials. The expected fraction of matches is also an unbiased estimator of the Jaccard index, but with variance decreased by a factor of /k. Though the theoretical justification is predicated on having a true random permutation, in practice we approximate that by using good random hash functions instead. A good hash function will specify a nearly total ordering on the universe of items, and provided we use θlogn)) bits for the hash function output, the probability of accidental collision of min-hashes is exponentially small. Though theoretically easy to analyze, this scheme has a number of drawbacks, chief amongst them the requirement of having k random hash functions, which means that the computational complexity is θnk) to generate the sketch. To address this, several variants of MinHash have been proposed [3]: ) k-hash functions. The scheme described above, which has the shortcoming of using θnk) computation to generate the sketch., Vol., No., Article. Publication date: November 07.
3 HyperMinHash: Jaccard index sketching in LogLog space :3 ) k-minimum values. A single hash function is used, but instead of storing the single minimum value, we store the smallest k values for each set also known as the KMV sketch []). Sketch generation time is reduced to On log k), but we also incur an Ok log k) sorting penalty when computing Jaccard index. 3) k-partition. Another -permutation MinHash variant, k-partition stochastically averages by first deterministically partitioning a set into k parts using the first couple bits of the hash value, and then stores the minimum value in each partition []. k-partition has the advantage of On) sketch generation time and Ok) Jaccard index computation time, at the cost of some difficulty in the analysis. It is important here to remark that for all of the above variants, MinHash sketches of A and B can be losslessly combined to form the MinHash sketch of A B. Using order statistics, it is additionally possible to estimate the union cardinalities [], so we can directly estimate intersection size in addition to Jaccard index. This also implies that streaming updates are permitted, so preprocessing incurs no additional space requirement..3 log log space complexity All of the variants of MinHash given in the last section require logarithmic bits per bucket in order to prevent accidental collisions i.e. we want to ensure that when two hashes match, they came from identical elements), though in the case of k-partition, some of those bits can be stored implicitly in the bucket identity. However, in the similar problem of cardinality estimation of unique items the count-distinct problem), literature over the last several decades produced several streaming sketches that require sub-logarithmic bits per bucket; indeed, the LogLog, SuperLogLog, and HyperLogLog family of sketches requries, as given in the name, only Olog logn)) bits per bucket by storing only the position of the first bit of a uniform hash function [4, 7, 8]. We wanted to do the same thing for the Jaccard index problem. First note that HyperLogLog union cardinalities can be used to compute intersection cardinalities using the inclusion-exclusion principle, but that the relative error is then in the size of the union as opposed to the size of the Jaccard index for MinHash) and compounds when taking the intersections of multiple sets; for small intersections, the error is often too great to be practically feasible. Notably, some newer cardinality estimation methods based on maximum-likelihood estimation are able to more directly access intersection sizes in HyperLogLog sketches, which can then be paired with union-cardinality to estimate Jaccard index [5, 6]. However, this approach, while more sophisticated, is restricted to the information available in the HyperLogLog sketch itself, and seems empirically to be a constant order < 3x) improvement over conventional inclusion-exclusion. Alternately, when unions and streaming updates are not necessary, the more recent advance of b-bit MinHash [0] solves exactly the Jaccard index problem while using only a constant number of bits per bucket. b-bit MinHash operates in the same way as standard MinHash, but after computing the minimum hash value, stores only the lowest order b bits. Indeed, for very large Jaccard similarity δa, B) > 0.5, Li, et al. determined that even using bit per bucket was asymptotically optimal. In general, for small Jaccard similarity, b-bit MinHash needs Ωlog/δ) bits, without any dependence on the sizes of the sets. For estimating the Jaccard similarity between exactly two sets, b-bit MinHash is nearly asymptotically optimal []. However, b-bit MinHash, while great for Jaccard index, loses many of the benefits of standard MinHash. Because b-bit MinHash only takes the lowest order b bits of the minimum hash value after finding the minimum, it also requires logn) bits per bucket during the sketch generation phase, the same as standard MinHash. This also implies that sketches cannot be merged together, so union cardinalities cannot be estimated. Some of these shortcomings can be overcome by pairing a b-bit MinHash sketch with a HyperLogLog countdistinct sketch. HyperLogLog uses log logn) bits per bucket to estimate union cardinalities, so combined, these two allow for accurate estimation of intersection sizes, not just Jaccard index. However, this still does not permit the usage of unions or streaming updates, so more complex predicates e.g. A B) C ) still cannot be evaluated and the data structure still requires Ologn) + log logn)) bits per bucket during the sketch generation phase., Vol., No., Article. Publication date: November 07.
4 :4 Yun William Yu & Griffin M. Weber We resolved this issue by building a new sketch, HyperMinHash, as a hybrid between HyperLogLog and k-partition MinHash. Using the same amount of space as b-bit MinHash + HyperLogLog, we achieve better streaming performance using only Olog logn) log/δ)) bits per bucket at all stages of the process), the ability to take unions of sketches, and count-distinct cardinality estimation. In Table, we summarize some of the properties of the various methods we ve described above. INTUITION MinHash works under the premise that two sets will have identical minimum value with probability equal to the Jaccard distance, because they can only share a minimum value if that minimum value corresponds to a member of the intersection of those two sets. If we have a total ordering on the union of both sets, the fraction of equal buckets is an unbiased estimator for Jaccard distance. However, with limited precision hash functions, there is some chance of accidental collision, when the value does not correspond to a member of the intersection. In order to get close to a true total ordering, the space of potential hashes must be on the order of the size of the union, and thus we must store Olog n) bits. Note, however, that the minimum of of a collection of uniform [0, ] random variables X,..., X n is much more likely to be a small number than a large one the insight behind most count-distinct sketches []). HyperMinHash operates identically to MinHash, but instead of storing the minimum values with fixed precision, it effectively uses an adaptive precision that increases resolution when the values are smaller by using initial loglog counters from LogLog cardinality estimation) and then storing a fixed number of bits beyond that similar in spirit to b-bit minhash). More precisely, after dividing up the items into k partitions, we store the position of the leading bit in the first q bits and store q + if there is no bit in the first q bits), and r bits following that Figure ). We do not need a total ordering so long as the number of accidental collisions in the minimum values is low. To analyze the performance of HyperMinHash compared to random-permutation MinHash or equivalently 0-collision standard MinHash) it suffices to consider the expected number of accidental collisions. In this intuitive analysis here, we will only analyze the simple case of collisions while using only a single bucket, but the same flavor of argument holds for multiple partitions. The HyperLogLog part of the sketch results in collisions whenever two items match in order of magnitude Figure a). By pairing it with an additional r-bit hash, our collision space is narrowed by a factor of about r within each bucket. An explicit exact formula for the expected number of collisions is EC = i= r j=0 [ ) n ) n ] [ ) m ) m ] r + j i+r r + j + i+r r + j i+r r + j + i+r, though finding a closed formula is rather more difficult. Intuitively, suppose that our hash value is, 000) for partition 0. This implies that the original bitstring of the minimum hash was Then a uniform random hash in [0, ] collides with this number with probability ++8) =. So we expect to need cardinalities on the order of before having many collisions. But of course, as the cardinalities of A and B increase, so does the expected value of the leading in the bitstring, as analyzed in the construction of HyperLogLog [8]. Thus, the collision probabilities remain roughly constant as cardinalities increase, at least until we reach the precision limit of the LogLog counters. But of course, we store only a finite number of bits for the leading indicator often 6 bits). Because it s a LogLog counter, storing 6 bits is sufficient for set cardinalities up to O 6 = 64 ). This increases our collision surface though, as we might have collisions in the lower left region near the origin Figure c). We can directly compute the collision probability and similarly the variance) by summing together the probability mass in these boxes, replacing the infinite sum with a finite sum Lemma 3.6). For more sensitive estimations, we can subtract, Vol., No., Article. Publication date: November 07.
5 HyperMinHash: Jaccard index sketching in LogLog space :5 Table. Comparison of key features against other methods. Method Bits per bucket Unions Jaccard index Intersection size Streaming updates MinHash logn)!!!! b-bit MinHash log/δ)! HyperLogLog log logn)!! HyperLogLog + MinHash logn) + log logn)!!!! HyperLogLog + b-bit MinHash log logn) + log/δ)!! HyperMinHash log logn) + log/δ)!!!! : n is the cardinality of the sets and δ is the Jaccard indexes, where applicable. : Where applicable, Θ/ϵ ) buckets are required for union cardinality estimation with relative error ϵ. : All of the MinHash based methods also require Θ/δ) buckets to give accurate Jaccard indexes. : Jaccard index and intersection size can be directly computed from HyperLogLog through inclusion-exclusion or MLE [5], but errors are then dependent on union cardinality estimates, so relative error will be high for small intersections and complicated predicates. Objects in Set Hashed values Partition Partition Partition Partition Minimum of P , 000) Minimum of P , 00) Minimum of P , 000) Minimum of P , 0) Partition Hash 00 3, 000) 0 4, 00) 0 3, 000) 5, 0) Fig.. HyperMinHash generates sketches in the same fashion as one-permutation MinHash. It begins by hashing each object in the set to a uniformly random number between 0 and, encoded in binary. Then, the hashed values are partitioned by the first p, and the minimum value within each partition is taken. Each value is specified by a tuple; the first part is the position of the leftmost in the first q bits, and q + otherwise, so exactly identical to a HyperLogLog sketch. The second part is the value of the next r bits in the bitstring. Note that this is mathematically equivalent to applying three independent hash functions for each bucket for the green, blue, and red bits, or, alternately, to using a single hash function but dividing the bitstring into fixed-length regions first., Vol., No., Article. Publication date: November 07.
6 :6 Yun William Yu & Griffin M. Weber the expected number of collisions to debias the estimation. In the next section we will prove bounds on the expectation and variance in the number of collisions. 3 PROOFS The main result of this paper bounds the expectation and variance of accidental collision, given two HyperMinHash sketches of disjoint sets. First, we rigorously define the full HyperMinHash sketch. Definition 3.. We will define f p,q,r A) : S {{,..., q } {0, } r } p to be the HyperMinHash sketch constructed from Figure, where A is a set of hashable objects and p,q, r N, and let f p,q,r A) i : S {,..., q } {0, } r be the value of the ith bucket in the sketch. More precisely, let hx) : S [0, ] be a uniformly random hash function. Let ρ q x) = min log x) +, q) )), σ r x) = x r, and ĥq,r x) = ρ q x), σ r x ρ q x). Then, we will define f p,q,r A) i = ĥq,r min ha) p i ). a A i p <ha)<i+) p Definition 3.. Let A, B be hashable sets with A = n, B = m, n > m, and A B =. Then define an indicator variable for collisions in bucket i of their respective HyperMinHash sketches Z p,q,r A, B, i) = fp,q,r A) i =f p,q,r B) i). Our main theorems follow: Theorem 3.3. C = p i=0 Z p,q,r A, B, i) is the number of collisions ) between the HyperMinHash sketches of two disjoint sets A and B. Then the expectation EC p 5 n + r p+q +r. Theorem 3.4. Given the same setup as in Theorem 3.3, VarC) E[C] + E[C]. Theorem 3.3 allows us to correct for the number of random collisions before computing Jaccard distance, and Theorem 3.4 tells us that the standard deviation in the number of collisions is approximately the expectation. Before proving these theorems, we will start by proving a simpler proposition. Proposition 3.5. Consider a HyperMinHash sketch with only bucket on two disjoint sets A and B. i.e. f 0,q,r A) and f 0,q,r B). Let γ n,m) Z 0,q,r A, B, 0). Naturally, as a good hash function results in uniform random variables, γ is only dependent on the cardinalities n and m. We claim that Eγ n,m) 6 + n. r q +r Proving this will require a few technical lemmas, which we ll then use to prove the main theorems. Lemma 3.6. Eγ n,m) = Pf 0,q,r A) 0 = f 0,q,r B) 0 ) q r [ ) n ) n ] = r + j r +i r + j + r +i + i= q i= q j=0 r j=0 [ ) n j r +i j + ) n ] [ r +i [ ) m ) m ] r + j r +i r + j + r +i ) m j r +i j + ) m ] r +i Proof. Let a,..., a n be random variables corresponding to the hashed values of items in A. Then a i [0, ] are uniform r.v. Similarly, b,...,b m, drawn from hashed values of B are uniform [0, ] r.v. Let x = min{a,..., a n }, Vol., No., Article. Publication date: November 07.
7 HyperMinHash: Jaccard index sketching in LogLog space :7 and y = min{b,...,b m }. Then we have probability density functions pdfx) = n x) n, for x [0, ], pdfy) = m y) m, for y [0, ] and cumulative density functions cdfx) = x) n, for x [0, ], cdfy) = y) m, for y [0, ]. We are particularly interested in the joint probability density function pdfx,y) = n x) n m y) m, for x,y) [0, ]. The probability mass enclosed in a square along the diagonal S = [s, s ] [0, ] is then precisely µs) = s s s s n x) n m y) m dydx = [ s ) n s ) n ] [ s ) m s ) m ] 3) Recall f 0,q,r A) 0 {,..., q } [0, ] r {,..., q } {0,..., r }, so given f 0,q,r A) 0 = i, j), x = i j in the binary expansion, unless i = q, in which case the binary expansion is x = i j. That in turn gives s < x < s, where s = r +j, s r +i = r +j+ when i < q j, and s r +i =, s r +i = j+. Collisions happen precisely r +i when s < x,y < s. Finally, using the s, s formulas above, it suffices to sum the probability of collision over the image of f, so Eγ n,m) = q r i= j=0 µ[s, s ]). Substituting in for s, s, and µ completes the proof. Note also that this is precisely the sum of the probability mass in the red and purple squares along the diagonal in Figure c. While Lemma 3.6 allows us to explicitly compute Eγ m, n), the lack of a closed form solution makes reasoning about it difficult. Here, we will upper bound the expectation by integrating over four regions of the unit square that cover all the collision boxes Figure d). For ease of notation, let r = r and q = q. The Top Right box TR = [ r r +, ] in orange in Figure d). The magenta triangle from the origin bounded by the lines y = which we will denote RAY. r r + r + r x and y = r x with 0 < x < r +, The black strip near the origin covering all the purple boxes except the one on the origin, bounded by the lines y = x r q, y = x + r q, and r q < x < q, which we will denote STRIP. The Bottom Left purple box BL = [0, ]. Lemma 3.7. The probability mass contained in the top right square µtr) r. Proof. By Equation 3, µtr) = r r + r r + n x) n m y) m dydx = [ x) n ] r [ y) m ] r = r + r + r +) n+m r Lemma 3.8. The probability mass contained in the bottom left square near the origin is µbl) n. Proof. µbl) = = 0 [ 0 n x) n m y) m dydx = [ x) n ] 0 [ y) m ] ) n ] [ ) m ]. For n,m <, we note that the linear binomial approximation is actually a strict upper bound as can be trivially verified through the Taylor expansion), so µbl) nm n r q. 0, Vol., No., Article. Publication date: November 07.
8 :8 Yun William Yu & Griffin M. Weber pdf of miny,..., Ym) pdf of miny,..., Ym) r subbuckets leftmost indicator leftmost indicator 0,0) pdf of minx,..., Xn) 0,0) pdf of minx,..., Xn) a) HyperLogLog sections, used alone, result in collisions whenever the minhashes match in order of magnitude. b) HyperMinHash further subdivides HyperLogLog leading -indicator buckets, achieving a much smaller collision space, so long as we precisely store the position of the leading. pdf of miny,..., Ym) r subbuckets pdf of miny,..., Ym) Top Right r subbuckets precision limit = q pdf of minx,..., Xn) leftmost indicator Black Strip Bottom Left Magenta Ray pdf of minx,..., Xn) c) In practice, HyperMinHash has a limited number of bits for the loglog counters, so there s a final lower left bucket at the precision limit. d) We ll upper bound the collision probability of hyperminhash by dividing it into these four regions of integration: a) the Top Right orange box, b), the magenta ray covering intermediate boxes, c) the black strip covering all but the final purple box, and d) the final purple subbucket by the origin. Fig.. Visualization of collision probabilities for HyperMinHash., Vol., No., Article. Publication date: November 07.
9 HyperMinHash: Jaccard index sketching in LogLog space :9 Lemma 3.9. The probability mass of the ray from the origin can be bounded µray ) 3 r. Proof. Unfortunately, the ray is not aligned to the axes, so we cannot integrate x and y separately. r r + r + r x r [ µray ) = n x) n m y) m r + dydx = n x) n r ) m ) m ] r 0 r + x 0 r + x r + r x dx Using the elementary difference of powers formula, note that for 0 α β, m ) α m β m = α β) α m i β i α β)mβ m. With a bit of symbolic manipulation, we can conclude that [ r r + µray ) n x) n r + rl + ) xm r ) ] m r + x dx 0 i= r + rl + ) 0 r r + nm r ) n+m xdx. r + With a straight-forward integration by parts, and then upper bounding negative terms by 0, r + nm µray ) r n + m r r ) n+m r + nm r + r ) n+m r + r +) r n + m r n + m r +) r + nm r + r +) r +) nm + r n + m r n + m r 3 n + m)n + m ) 3 r. Lemma 3.0. The probability mass of the diagonal strip near the origin is µstrip) r. Proof. Using the same integration procedure and difference of powers formula used in the proof of Lemma 3.9, µstrip) = q x+ q x nm n x) n m y) m dydx x + ) n+m = nm r n + m r. Proof of Proposition 3.5. Summing bounds from Lemmas 3.7, 3.8, 3.9, and 3.0, Eγ n,m) 6 r + r n q = 6 + n r q +r Proof of Theorem 3.3. Let A i, B i be the ith partitions of A and B respectively. For ease of notation, let s define p = p. Recall that C = p j=0 Z p,q,r A, B, j). We will first bound EZ p,q,r A, B, j) using the same techniques used in Proposition 3.5. Notice first that Z p,q,r A, B, j) effectively rescales the minimum hash values from Z 0,q,r A, B, j) = γ n,m) down by a factor of p ; i.e. we scale down both the axes in Figure d by substituting q q p in Lemmas 3.0 and 3.8. We do not need Lemma 3.7 because its box is already covered by the Magenta Ray from Lemma 3.9, which we do not scale. Summing these together, we readily conclude EZ p,q,r A, B, j) 5 r + r n q p = 5 n + r p+q +r. [ ] Then by linearity of expectation, EC p 5 n + r p+q +r., Vol., No., Article. Publication date: November 07.
10 :0 Yun William Yu & Griffin M. Weber Proof of Theorem 3.4. By conditioning on the multinomial distribution, we can decompose C into C = Z p,q,r A, B, j. α + α p =n β + β p =m p ) i, A i =α i i, B i =β i i=0 ) A i =α i B i =β i For ease of notation in the following, we will use α, β to denote the event i, A i = α i and i, B i = β j respectively. Additionally, let Zj) = Z p,q,r A, B, j). So, C = p α, β j=0 Z j α, β ). α, β Then p VarC) = Cov α, β Z j α, β ), α, β Z j α, β )). α, β j, j =0 α, β But note that for α, β ) α, β ), α, β = = α, β = 0 and vice versa, because they are disjoint indicator variables. As such, for α, β ) α, β ), Cov α, β Z j α, β ), α, β Z j α, β )) 0, implying that VarC) p α, β j, j =0 α, β j=0 Cov α, Z j β α, β ), α, Z j β α, β )) p = Var α, Z j α, β )) + β α, β j j 0 j p 0 j p Cov α, Z j β α, β ), α, Z j β α, β )). Note that the first term can be simplified, recalling that Z is a {0,} Bernouli r.v., so p j=0 Var α, Z j α, β )) p = β j=0 Var Z j)) p [ E Z j)] = EC. Moving on, from the covariance formula, for independent random variables X, X,Y, CovX Y, X Y ) = E[X X Y ] E[X Y ]E[X Y ] = E[X ]E[X ] E[Y ] E[Y ] ) = E[X ]E[X ] VarY ) Thus the second term of the summation can be bounded as follows: Cov α, Z j β α, β ), α, Z j β α, β )) [ = E Z α, β j j 0 j p 0 j p p p [ E Z α, β j =0 j =0 = E α, β [ C α, β α, β j α, β )] [ E Z j α, β )] ) Var α, = β ] P α, β ) P α, β )) We conclude that VarC) E[C] + E[C]. E α, β [ C α, β j j 0 j p 0 j p E α, β ] P α, β j=0 j α, β )] [ E Z j α, β )] ) Var α, β [ C α, β ] ) Var α, β ) = E[C], Vol., No., Article. Publication date: November 07.
11 HyperMinHash: Jaccard index sketching in LogLog space : Fig. 3. For a fixed size sketch, HyperMinHash has better accuracy and/or cardinality range than MinHash. We compare Jaccard index estimation for identically sized sets with Jaccard index of /3 i.e. 50% overlap), so the maximum relative error is, and plot the mean relative errors without estimated collision correction. green circle) A 56 byte HyperMinHash sketch, with 56 buckets of 8 bits each, 4 bits of which are allocated to the LogLog counter, Jaccard index estimation remains stable until cardinalities around 3. orange diamond) A 56 byte MinHash sketch with 56 buckets of 8 bits each achieves similar accuracy at low cardinalities, but fails once cardinalities approach 4. blue triangle) A 56 byte MinHash sketch with 8 buckets of 6 bits can access larger cardinalities of around 0, but to do so trades off on low-cardinality accuracy. 4 EXPERIMENTAL VALIDATION For completeness, we give experimental validation for the behavior of Jaccard index estimation on raw Hyper)MinHash sketches with no expected error collision. In Figure 3, we allocate 56 bytes for two standard MinHash sketch and a HyperMinHash sketch. For fixed sketch size and cardinality range, HyperMinHash is more accurate; or, for fixed sketch size and bucket number, HyperMinHash can access exponentially larger set cardinalities. Pseudocode and a Python implementation are available details in Appendix). 5 DISCUSSION AND CONCLUSION We have introduced HyperMinHash, a sketch for estimating Jaccard distance using log log space, and made available a prototype Python implementation at It can be thought of as a compression scheme for MinHash that reduces the number of bits per bucket to log logn) from logn) by using insights from HyperLogLog and b-bit MinHash. As with the original MinHash, it retains variance on the order of k/δ, where k is the number of buckets and δ is the Jaccard index between two sets. However, it also introduces /l variance, where l = r, because of the increased number of collisions, which matches the requirements of b-bit MinHash. For practical parameters of p = 5,q = 6, r = 0, the HyperMinHash sketch will use up 64KiB memory per set, and allow for estimating Jaccard indices of 0.0 for set cardinalities on the order of 0 9 with accuracy around 5%. HyperMinHash is to our knowledge the first streaming summary sketch capable of directly estimating union, Vol., No., Article. Publication date: November 07.
12 : Yun William Yu & Griffin M. Weber cardinality, Jaccard index, and intersection cardinality in log log space, able to be applied to arbitrary Boolean formulas in conjunctive normal form with error rates bounded by the final result size. ACKNOWLEDGMENTS This study was supported by National Institutes of Health NIH) Big Data to Knowledge BDK) awards U54HG from the National Human Genome Research Institute NHGRI) and U0CA98934 from the National Cancer Institute NCI). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. We additionally thank Daphne Ippolito and Adam Sealfon for useful comments and advice. REFERENCES [] Ziv Bar-Yossef, TS Jayram, Ravi Kumar, D Sivakumar, and Luca Trevisan. 00. Counting distinct elements in a data stream. In International Workshop on Randomization and Approximation Techniques in Computer Science. Springer, 0. [] Andrei Z Broder On the resemblance and containment of documents. In Compression and Complexity of Sequences 997. Proceedings. IEEE, 9. [3] Edith Cohen. 06. Min-Hash Sketches. 06). [4] Marianne Durand and Philippe Flajolet Loglog counting of large cardinalities. In European Symposium on Algorithms. Springer, [5] Otmar Ertl. 07. New cardinality estimation algorithms for HyperLogLog sketches. arxiv preprint arxiv: ). [6] Otmar Ertl. 07. New Cardinality Estimation Methods for HyperLogLog Sketches. arxiv preprint arxiv: ). [7] Philippe Flajolet Counting by coin tossings. In Advances in Computer Science-ASIAN 004. Higher-Level Decision Making. Springer,. [8] Philippe Flajolet, Éric Fusy, Olivier Gandouet, and Frédéric Meunier Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm. In AofA: Analysis of Algorithms. Discrete Mathematics and Theoretical Computer Science, [9] Paul Jaccard. 90. Lois de distribution florale dans la zone alpine. Bull Soc Vaudoise Sci Nat 38 90), [0] Ping Li and Christian König. 00. b-bit minwise hashing. In Proceedings of the 9th international conference on World wide web. ACM, [] Ping Li, Art Owen, and Cun-Hui Zhang. 0. One permutation hashing. In Advances in Neural Information Processing Systems [] Rasmus Pagh, Morten Stöckel, and David P Woodruff. 04. Is min-wise hashing optimal for summarizing set intersection?. In Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. ACM, 09 0., Vol., No., Article. Publication date: November 07.
13 A APPENDIX: HYPERMINHASH IN PRACTICE HyperMinHash: Jaccard index sketching in LogLog space :3 Here, we present full algorithms to match a naive implementation of HyperMinHash as described in the previous Theory section. A Python implementation is available at Algorithm HyperMinHash sketch. : Let h,h,h 3 : D [0, ] {0, } be three independent hash functions hashing data from domain D to the binary domain. In practice, we generally use a single Hash function, e.g. SHA-, and use different sets of bits for each of the three hashes). : Let ρs), for s {0, } be the position of the left-most -bit ρ000 ) = 4). 3: Let σs, n) for s {0, } be the left-most n bits of s σ00, 5) = 00). 4: function HyperMinHashA, p, q, r) 5: Let ĥx) = σh x),p). 6: Let ĥx) = minρh x)), q ). 7: Let ĥ3x) = σh 3 x), r). 8: Initialize p tuples B = B = = B p = 0, 0). 9: for a A do 0: if B h a)[0] < ĥa) then : Bĥ a) ĥa),ĥ3a)) : else 3: if Bĥ a) [0] = ĥa) and Bĥ a) [] > ĥ3a) then 4: Bĥ a) ĥa),ĥ3a)) 5: 6: 7: end for 8: return B,..., B p as B 9: end function A. Implementation optimizations We recommend several optimizations for practical implementations of HyperMinHash. First, it is mathematically equivalent to: ) Pack the hashed tuple into a single word; this enables Jaccard index computation while using only one comparison per bucket instead of two. ) Use the max instead of min of the subbuckets. This allows us take the union of two sketches while using only one comparison per bucket. These recommendations should be self-explanatory, and are simply minor engineering optimizations, which we do not use in our prototyping, as they do not affect accuracy. However, while we can exactly compute the number of expected collisions through Lemma 3.6, this computation is slow and often results in floating point errors unless BigInts are used because Algorithm 5 is exponential in r. In practice, two ready solutions present themselves: ) We can ignore the bias and simply add it to the error. As the bias and standard deviation of the error are the same order of magnitude, this only doubles the absolute error in the estimation of Jaccard index. For large Jaccard indexes, this does not matter., Vol., No., Article. Publication date: November 07.
14 :4 Yun William Yu & Griffin M. Weber Algorithm HyperMinHash Union function UnionS, T ) assert S = T for i {,..., S } do Initialize S tuples B = B = = B S = 0, 0). if S i [0] > T i [0] then B i S i else if S i [0] < T i [0] then B i T i else if S i [0] = T i [0] then if S i [[] < T i [] then B i S i else B i T i end for return B,..., B p as B end function Algorithm 3 Estimate Cardinality. Note that the left parts of the buckets can be passed directly into a HyperLogLog estimator. We can also use other k-minimum value count-distinct cardinality estimators, which we empirically found useful for large cardinalities. function EstimateCardinalityS, p, q, r) Initialize S integer registers b = b = = b S = 0. for i {,..., S } do b i S i [0] end for R HyperLogLogCardinalityEstimator{b i },q) if R < 04 S then return R else Initialize S real registers r,..., r S. for i {,..., S } do r i S i [0] end for if r i = 0 then return else return S / r i end function ) + S i [] r, Vol., No., Article. Publication date: November 07.
15 HyperMinHash: Jaccard index sketching in LogLog space :5 Algorithm 4 Compute Jaccard Index. Note that the correction factor EC is generally not needed, except for really small Jaccard index. Additionally, for most practical purposes, it is safe to substitute ApproximateExpectedCollisions for ExpectedCollisions. function JaccardIndexS, T, p, q, r) assert S = T C 0, N 0 for i {,..., S } do if S i = T i then C C + if S i! = 0, 0) and T i! = 0, 0) then N N + end for n EstimateCardinalityS, q) m EstimateCardinalityT, q) EC [Approximate]ExpectedCollisionsn, m, p, q, r) return C EC)/N end function Algorithm 5 Expected collisions. Note that because of floating point error, BigInts must be used for large n and m. function ExpectedCollisionsn, m, p, q, r) x 0 for i {,..., q } do for j {,..., r } do if i q then b r +j, b p+r +i r +j+ p+r +i else b j, b p+r +i j+ p+r +i Pr x b ) n b ) n Pr y b ) m b ) m x x + Pr x Pr y end for end for return x p end function ) We also present a fast, numerically stable, algorithm to approximate the expected number of collisions Algorithm 6). We can however approximate the number of expected collisions using the following procedure, which is empirically asymptotically correct Algorithm 6):, Vol., No., Article. Publication date: November 07.
16 :6 Yun William Yu & Griffin M. Weber Algorithm 6 Fast numerically stable approximation to Algorithm 5. Generally underestimates collisions. function ApproximateExpectedCollisionsn, m, p, q, r) if n < m then SWAPx, y) if n > q +r then return ERROR: cardinality too large for approximation. else if n > p+5 then 4n/m +n/m) ϕ return p r ϕ else return ExpectedCollisionsn,m,p,q, 0) r return x p end function ) For n < p+5, we approximate by taking the number of expected HyperLogLog collisions and dividing it by r. In each HyperLogLog box, we are interested in collisions along r boxes along the diagonal c. For this approximation, we simply assume that the joint probability density function is almost uniform within the box; this is not completely accurate, but pretty close in practice. ) For p+5 < n < q +p, we noted empirically that the expected number of collisions approached p r for n = m as n. Furthermore, the number of collisions is dependent on n and m by a factor of 4nm 4n/m n+m)n+m ) from 3.9, which for n,m can be approximated by. This approximation is +n/m) primarily needed because of floating point errors when n. 3) Unfortunately, around n > q +p, the number of collisions starts increasing and these approximations fail. However, note that for reasonable values of q = 6,p = 5, this problem only appears when n > , Vol., No., Article. Publication date: November 07.
Lecture 3 Sept. 4, 2014
CS 395T: Sublinear Algorithms Fall 2014 Prof. Eric Price Lecture 3 Sept. 4, 2014 Scribe: Zhao Song In today s lecture, we will discuss the following problems: 1. Distinct elements 2. Turnstile model 3.
More information1 Estimating Frequency Moments in Streams
CS 598CSC: Algorithms for Big Data Lecture date: August 28, 2014 Instructor: Chandra Chekuri Scribe: Chandra Chekuri 1 Estimating Frequency Moments in Streams A significant fraction of streaming literature
More informationHow Philippe Flipped Coins to Count Data
1/18 How Philippe Flipped Coins to Count Data Jérémie Lumbroso LIP6 / INRIA Rocquencourt December 16th, 2011 0. DATA STREAMING ALGORITHMS Stream: a (very large) sequence S over (also very large) domain
More informationCS6931 Database Seminar. Lecture 6: Set Operations on Massive Data
CS6931 Database Seminar Lecture 6: Set Operations on Massive Data Set Resemblance and MinWise Hashing/Independent Permutation Basics Consider two sets S1, S2 U = {0, 1, 2,...,D 1} (e.g., D = 2 64 ) f1
More informationNew cardinality estimation algorithms for HyperLogLog sketches
New cardinality estimation algorithms for HyperLogLog sketches Otmar Ertl otmar.ertl@gmail.com April 3, 2017 This paper presents new methods to estimate the cardinalities of data sets recorded by HyperLogLog
More informationLecture 2 Sept. 8, 2015
CS 9r: Algorithms for Big Data Fall 5 Prof. Jelani Nelson Lecture Sept. 8, 5 Scribe: Jeffrey Ling Probability Recap Chebyshev: P ( X EX > λ) < V ar[x] λ Chernoff: For X,..., X n independent in [, ],
More informationLecture 2. Frequency problems
1 / 43 Lecture 2. Frequency problems Ricard Gavaldà MIRI Seminar on Data Streams, Spring 2015 Contents 2 / 43 1 Frequency problems in data streams 2 Approximating inner product 3 Computing frequency moments
More informationSome notes on streaming algorithms continued
U.C. Berkeley CS170: Algorithms Handout LN-11-9 Christos Papadimitriou & Luca Trevisan November 9, 016 Some notes on streaming algorithms continued Today we complete our quick review of streaming algorithms.
More information14.1 Finding frequent elements in stream
Chapter 14 Streaming Data Model 14.1 Finding frequent elements in stream A very useful statistics for many applications is to keep track of elements that occur more frequently. It can come in many flavours
More informationProblem 1: (Chernoff Bounds via Negative Dependence - from MU Ex 5.15)
Problem 1: Chernoff Bounds via Negative Dependence - from MU Ex 5.15) While deriving lower bounds on the load of the maximum loaded bin when n balls are thrown in n bins, we saw the use of negative dependence.
More informationCONVERGENCE OF RANDOM SERIES AND MARTINGALES
CONVERGENCE OF RANDOM SERIES AND MARTINGALES WESLEY LEE Abstract. This paper is an introduction to probability from a measuretheoretic standpoint. After covering probability spaces, it delves into the
More informationCSC 2429 Approaches to the P vs. NP Question and Related Complexity Questions Lecture 2: Switching Lemma, AC 0 Circuit Lower Bounds
CSC 2429 Approaches to the P vs. NP Question and Related Complexity Questions Lecture 2: Switching Lemma, AC 0 Circuit Lower Bounds Lecturer: Toniann Pitassi Scribe: Robert Robere Winter 2014 1 Switching
More information2. Probability. Chris Piech and Mehran Sahami. Oct 2017
2. Probability Chris Piech and Mehran Sahami Oct 2017 1 Introduction It is that time in the quarter (it is still week one) when we get to talk about probability. Again we are going to build up from first
More informationDATA MINING LECTURE 6. Similarity and Distance Sketching, Locality Sensitive Hashing
DATA MINING LECTURE 6 Similarity and Distance Sketching, Locality Sensitive Hashing SIMILARITY AND DISTANCE Thanks to: Tan, Steinbach, and Kumar, Introduction to Data Mining Rajaraman and Ullman, Mining
More informationCOMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from
COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from http://www.mmds.org Distance Measures For finding similar documents, we consider the Jaccard
More informationJoint Probability Distributions and Random Samples (Devore Chapter Five)
Joint Probability Distributions and Random Samples (Devore Chapter Five) 1016-345-01: Probability and Statistics for Engineers Spring 2013 Contents 1 Joint Probability Distributions 2 1.1 Two Discrete
More informationA Simple and Efficient Estimation Method for Stream Expression Cardinalities
A Simple and Efficient Estimation Method for Stream Expression Cardinalities Aiyou Chen, Jin Cao and Tian Bu Bell Labs, Alcatel-Lucent (aychen,jincao,tbu)@research.bell-labs.com ABSTRACT Estimating the
More informationCS 246 Review of Proof Techniques and Probability 01/14/19
Note: This document has been adapted from a similar review session for CS224W (Autumn 2018). It was originally compiled by Jessica Su, with minor edits by Jayadev Bhaskaran. 1 Proof techniques Here we
More informationLecture 7: More Arithmetic and Fun With Primes
IAS/PCMI Summer Session 2000 Clay Mathematics Undergraduate Program Advanced Course on Computational Complexity Lecture 7: More Arithmetic and Fun With Primes David Mix Barrington and Alexis Maciel July
More informationBTRY 4830/6830: Quantitative Genomics and Genetics Fall 2014
BTRY 4830/6830: Quantitative Genomics and Genetics Fall 2014 Homework 4 (version 3) - posted October 3 Assigned October 2; Due 11:59PM October 9 Problem 1 (Easy) a. For the genetic regression model: Y
More informationConsistent Sampling with Replacement
Consistent Sampling with Replacement Ronald L. Rivest MIT CSAIL rivest@mit.edu arxiv:1808.10016v1 [cs.ds] 29 Aug 2018 August 31, 2018 Abstract We describe a very simple method for consistent sampling that
More information1 Approximate Quantiles and Summaries
CS 598CSC: Algorithms for Big Data Lecture date: Sept 25, 2014 Instructor: Chandra Chekuri Scribe: Chandra Chekuri Suppose we have a stream a 1, a 2,..., a n of objects from an ordered universe. For simplicity
More informationAlgorithms for Data Science: Lecture on Finding Similar Items
Algorithms for Data Science: Lecture on Finding Similar Items Barna Saha 1 Finding Similar Items Finding similar items is a fundamental data mining task. We may want to find whether two documents are similar
More information6 Filtering and Streaming
Casus ubique valet; semper tibi pendeat hamus: Quo minime credas gurgite, piscis erit. [Luck affects everything. Let your hook always be cast. Where you least expect it, there will be a fish.] Publius
More informationMAS113 Introduction to Probability and Statistics. Proofs of theorems
MAS113 Introduction to Probability and Statistics Proofs of theorems Theorem 1 De Morgan s Laws) See MAS110 Theorem 2 M1 By definition, B and A \ B are disjoint, and their union is A So, because m is a
More informationMAS113 Introduction to Probability and Statistics. Proofs of theorems
MAS113 Introduction to Probability and Statistics Proofs of theorems Theorem 1 De Morgan s Laws) See MAS110 Theorem 2 M1 By definition, B and A \ B are disjoint, and their union is A So, because m is a
More informationProbability. Lecture Notes. Adolfo J. Rumbos
Probability Lecture Notes Adolfo J. Rumbos October 20, 204 2 Contents Introduction 5. An example from statistical inference................ 5 2 Probability Spaces 9 2. Sample Spaces and σ fields.....................
More informationBloom Filters, general theory and variants
Bloom Filters: general theory and variants G. Caravagna caravagn@cli.di.unipi.it Information Retrieval Wherever a list or set is used, and space is a consideration, a Bloom Filter should be considered.
More informationNotes 1 : Measure-theoretic foundations I
Notes 1 : Measure-theoretic foundations I Math 733-734: Theory of Probability Lecturer: Sebastien Roch References: [Wil91, Section 1.0-1.8, 2.1-2.3, 3.1-3.11], [Fel68, Sections 7.2, 8.1, 9.6], [Dur10,
More informationDiscrete Mathematics and Probability Theory Fall 2014 Anant Sahai Note 15. Random Variables: Distributions, Independence, and Expectations
EECS 70 Discrete Mathematics and Probability Theory Fall 204 Anant Sahai Note 5 Random Variables: Distributions, Independence, and Expectations In the last note, we saw how useful it is to have a way of
More information1 Basic Combinatorics
1 Basic Combinatorics 1.1 Sets and sequences Sets. A set is an unordered collection of distinct objects. The objects are called elements of the set. We use braces to denote a set, for example, the set
More informationLecture 10: Probability distributions TUESDAY, FEBRUARY 19, 2019
Lecture 10: Probability distributions DANIEL WELLER TUESDAY, FEBRUARY 19, 2019 Agenda What is probability? (again) Describing probabilities (distributions) Understanding probabilities (expectation) Partial
More information12 Count-Min Sketch and Apriori Algorithm (and Bloom Filters)
12 Count-Min Sketch and Apriori Algorithm (and Bloom Filters) Many streaming algorithms use random hashing functions to compress data. They basically randomly map some data items on top of each other.
More informationarxiv: v2 [cs.ds] 3 Oct 2017
Orthogonal Vectors Indexing Isaac Goldstein 1, Moshe Lewenstein 1, and Ely Porat 1 1 Bar-Ilan University, Ramat Gan, Israel {goldshi,moshe,porately}@cs.biu.ac.il arxiv:1710.00586v2 [cs.ds] 3 Oct 2017 Abstract
More informationMath 416 Lecture 3. The average or mean or expected value of x 1, x 2, x 3,..., x n is
Math 416 Lecture 3 Expected values The average or mean or expected value of x 1, x 2, x 3,..., x n is x 1 x 2... x n n x 1 1 n x 2 1 n... x n 1 n 1 n x i p x i where p x i 1 n is the probability of x i
More informationCS261: A Second Course in Algorithms Lecture #18: Five Essential Tools for the Analysis of Randomized Algorithms
CS261: A Second Course in Algorithms Lecture #18: Five Essential Tools for the Analysis of Randomized Algorithms Tim Roughgarden March 3, 2016 1 Preamble In CS109 and CS161, you learned some tricks of
More informationBasic counting techniques. Periklis A. Papakonstantinou Rutgers Business School
Basic counting techniques Periklis A. Papakonstantinou Rutgers Business School i LECTURE NOTES IN Elementary counting methods Periklis A. Papakonstantinou MSIS, Rutgers Business School ALL RIGHTS RESERVED
More informationConditional distributions (discrete case)
Conditional distributions (discrete case) The basic idea behind conditional distributions is simple: Suppose (XY) is a jointly-distributed random vector with a discrete joint distribution. Then we can
More information1 Maintaining a Dictionary
15-451/651: Design & Analysis of Algorithms February 1, 2016 Lecture #7: Hashing last changed: January 29, 2016 Hashing is a great practical tool, with an interesting and subtle theory too. In addition
More informationLecture 2: Streaming Algorithms
CS369G: Algorithmic Techniques for Big Data Spring 2015-2016 Lecture 2: Streaming Algorithms Prof. Moses Chariar Scribes: Stephen Mussmann 1 Overview In this lecture, we first derive a concentration inequality
More informationIntroduction to Randomized Algorithms III
Introduction to Randomized Algorithms III Joaquim Madeira Version 0.1 November 2017 U. Aveiro, November 2017 1 Overview Probabilistic counters Counting with probability 1 / 2 Counting with probability
More informationLecture 4: Probability and Discrete Random Variables
Error Correcting Codes: Combinatorics, Algorithms and Applications (Fall 2007) Lecture 4: Probability and Discrete Random Variables Wednesday, January 21, 2009 Lecturer: Atri Rudra Scribe: Anonymous 1
More informationarxiv: v2 [cs.ds] 17 Sep 2017
Two-Dimensional Indirect Binary Search for the Positive One-In-Three Satisfiability Problem arxiv:1708.08377v [cs.ds] 17 Sep 017 Shunichi Matsubara Aoyama Gakuin University, 5-10-1, Fuchinobe, Chuo-ku,
More informationNew Attacks on the Concatenation and XOR Hash Combiners
New Attacks on the Concatenation and XOR Hash Combiners Itai Dinur Department of Computer Science, Ben-Gurion University, Israel Abstract. We study the security of the concatenation combiner H 1(M) H 2(M)
More informationECE 302: Probabilistic Methods in Electrical Engineering
ECE 302: Probabilistic Methods in Electrical Engineering Test I : Chapters 1 3 3/22/04, 7:30 PM Print Name: Read every question carefully and solve each problem in a legible and ordered manner. Make sure
More information6.842 Randomness and Computation Lecture 5
6.842 Randomness and Computation 2012-02-22 Lecture 5 Lecturer: Ronitt Rubinfeld Scribe: Michael Forbes 1 Overview Today we will define the notion of a pairwise independent hash function, and discuss its
More informationNondeterminism LECTURE Nondeterminism as a proof system. University of California, Los Angeles CS 289A Communication Complexity
University of California, Los Angeles CS 289A Communication Complexity Instructor: Alexander Sherstov Scribe: Matt Brown Date: January 25, 2012 LECTURE 5 Nondeterminism In this lecture, we introduce nondeterministic
More informationCS Foundations of Communication Complexity
CS 2429 - Foundations of Communication Complexity Lecturer: Sergey Gorbunov 1 Introduction In this lecture we will see how to use methods of (conditional) information complexity to prove lower bounds for
More informationLecture Lecture 25 November 25, 2014
CS 224: Advanced Algorithms Fall 2014 Lecture Lecture 25 November 25, 2014 Prof. Jelani Nelson Scribe: Keno Fischer 1 Today Finish faster exponential time algorithms (Inclusion-Exclusion/Zeta Transform,
More informationAsymptotically optimal induced universal graphs
Asymptotically optimal induced universal graphs Noga Alon Abstract We prove that the minimum number of vertices of a graph that contains every graph on vertices as an induced subgraph is (1+o(1))2 ( 1)/2.
More informationModule 1. Probability
Module 1 Probability 1. Introduction In our daily life we come across many processes whose nature cannot be predicted in advance. Such processes are referred to as random processes. The only way to derive
More informationLecture 6 Basic Probability
Lecture 6: Basic Probability 1 of 17 Course: Theory of Probability I Term: Fall 2013 Instructor: Gordan Zitkovic Lecture 6 Basic Probability Probability spaces A mathematical setup behind a probabilistic
More informationApplication: Bucket Sort
5.2.2. Application: Bucket Sort Bucket sort breaks the log) lower bound for standard comparison-based sorting, under certain assumptions on the input We want to sort a set of =2 integers chosen I+U@R from
More informationCS 591, Lecture 9 Data Analytics: Theory and Applications Boston University
CS 591, Lecture 9 Data Analytics: Theory and Applications Boston University Babis Tsourakakis February 22nd, 2017 Announcement We will cover the Monday s 2/20 lecture (President s day) by appending half
More informationCHAPTER 11. A Revision. 1. The Computers and Numbers therein
CHAPTER A Revision. The Computers and Numbers therein Traditional computer science begins with a finite alphabet. By stringing elements of the alphabet one after another, one obtains strings. A set of
More informationLecture 5: Hashing. David Woodruff Carnegie Mellon University
Lecture 5: Hashing David Woodruff Carnegie Mellon University Hashing Universal hashing Perfect hashing Maintaining a Dictionary Let U be a universe of keys U could be all strings of ASCII characters of
More informationMultiple-Site Distributed Spatial Query Optimization using Spatial Semijoins
11 Multiple-Site Distributed Spatial Query Optimization using Spatial Semijoins Wendy OSBORN a, 1 and Saad ZAAMOUT a a Department of Mathematics and Computer Science, University of Lethbridge, Lethbridge,
More information1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:
CS 24 Section #8 Hashing, Skip Lists 3/20/7 Probability Review Expectation (weighted average): the expectation of a random quantity X is: x= x P (X = x) For each value x that X can take on, we look at
More informationLecture 4: Constructing the Integers, Rationals and Reals
Math/CS 20: Intro. to Math Professor: Padraic Bartlett Lecture 4: Constructing the Integers, Rationals and Reals Week 5 UCSB 204 The Integers Normally, using the natural numbers, you can easily define
More informationSTAT2201. Analysis of Engineering & Scientific Data. Unit 3
STAT2201 Analysis of Engineering & Scientific Data Unit 3 Slava Vaisman The University of Queensland School of Mathematics and Physics What we learned in Unit 2 (1) We defined a sample space of a random
More informationLecture 4: Hashing and Streaming Algorithms
CSE 521: Design and Analysis of Algorithms I Winter 2017 Lecture 4: Hashing and Streaming Algorithms Lecturer: Shayan Oveis Gharan 01/18/2017 Scribe: Yuqing Ai Disclaimer: These notes have not been subjected
More informationP (E) = P (A 1 )P (A 2 )... P (A n ).
Lecture 9: Conditional probability II: breaking complex events into smaller events, methods to solve probability problems, Bayes rule, law of total probability, Bayes theorem Discrete Structures II (Summer
More informationEC212: Introduction to Econometrics Review Materials (Wooldridge, Appendix)
1 EC212: Introduction to Econometrics Review Materials (Wooldridge, Appendix) Taisuke Otsu London School of Economics Summer 2018 A.1. Summation operator (Wooldridge, App. A.1) 2 3 Summation operator For
More informationDiscrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 14
CS 70 Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 14 Introduction One of the key properties of coin flips is independence: if you flip a fair coin ten times and get ten
More informationChapter 2. Probability
2-1 Chapter 2 Probability 2-2 Section 2.1: Basic Ideas Definition: An experiment is a process that results in an outcome that cannot be predicted in advance with certainty. Examples: rolling a die tossing
More informationComputational Learning Theory
CS 446 Machine Learning Fall 2016 OCT 11, 2016 Computational Learning Theory Professor: Dan Roth Scribe: Ben Zhou, C. Cervantes 1 PAC Learning We want to develop a theory to relate the probability of successful
More informationLecture 2 September 4, 2014
CS 224: Advanced Algorithms Fall 2014 Prof. Jelani Nelson Lecture 2 September 4, 2014 Scribe: David Liu 1 Overview In the last lecture we introduced the word RAM model and covered veb trees to solve the
More informationTheorem 1.7 [Bayes' Law]: Assume that,,, are mutually disjoint events in the sample space s.t.. Then Pr( )
Theorem 1.7 [Bayes' Law]: Assume that,,, are mutually disjoint events in the sample space s.t.. Then Pr Pr = Pr Pr Pr() Pr Pr. We are given three coins and are told that two of the coins are fair and the
More informationDiscrete Mathematics and Probability Theory Spring 2014 Anant Sahai Note 18
EECS 7 Discrete Mathematics and Probability Theory Spring 214 Anant Sahai Note 18 A Brief Introduction to Continuous Probability Up to now we have focused exclusively on discrete probability spaces Ω,
More informationLecture 11: Continuous-valued signals and differential entropy
Lecture 11: Continuous-valued signals and differential entropy Biology 429 Carl Bergstrom September 20, 2008 Sources: Parts of today s lecture follow Chapter 8 from Cover and Thomas (2007). Some components
More informationBig Data. Big data arises in many forms: Common themes:
Big Data Big data arises in many forms: Physical Measurements: from science (physics, astronomy) Medical data: genetic sequences, detailed time series Activity data: GPS location, social network activity
More informationNotes for Math 324, Part 19
48 Notes for Math 324, Part 9 Chapter 9 Multivariate distributions, covariance Often, we need to consider several random variables at the same time. We have a sample space S and r.v. s X, Y,..., which
More informationLecture 01 August 31, 2017
Sketching Algorithms for Big Data Fall 2017 Prof. Jelani Nelson Lecture 01 August 31, 2017 Scribe: Vinh-Kha Le 1 Overview In this lecture, we overviewed the six main topics covered in the course, reviewed
More informationHandout 5. α a1 a n. }, where. xi if a i = 1 1 if a i = 0.
Notes on Complexity Theory Last updated: October, 2005 Jonathan Katz Handout 5 1 An Improved Upper-Bound on Circuit Size Here we show the result promised in the previous lecture regarding an upper-bound
More informationPOISSON PROCESSES 1. THE LAW OF SMALL NUMBERS
POISSON PROCESSES 1. THE LAW OF SMALL NUMBERS 1.1. The Rutherford-Chadwick-Ellis Experiment. About 90 years ago Ernest Rutherford and his collaborators at the Cavendish Laboratory in Cambridge conducted
More informationQuick Sort Notes , Spring 2010
Quick Sort Notes 18.310, Spring 2010 0.1 Randomized Median Finding In a previous lecture, we discussed the problem of finding the median of a list of m elements, or more generally the element of rank m.
More informationLecture Notes 1: Vector spaces
Optimization-based data analysis Fall 2017 Lecture Notes 1: Vector spaces In this chapter we review certain basic concepts of linear algebra, highlighting their application to signal processing. 1 Vector
More informationReview of Probabilities and Basic Statistics
Alex Smola Barnabas Poczos TA: Ina Fiterau 4 th year PhD student MLD Review of Probabilities and Basic Statistics 10-701 Recitations 1/25/2013 Recitation 1: Statistics Intro 1 Overview Introduction to
More informationFinding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing Distance Measures. Modified from Jeff Ullman
Finding Similar Sets Applications Shingling Minhashing Locality-Sensitive Hashing Distance Measures Modified from Jeff Ullman Goals Many Web-mining problems can be expressed as finding similar sets:. Pages
More informationAhlswede Khachatrian Theorems: Weighted, Infinite, and Hamming
Ahlswede Khachatrian Theorems: Weighted, Infinite, and Hamming Yuval Filmus April 4, 2017 Abstract The seminal complete intersection theorem of Ahlswede and Khachatrian gives the maximum cardinality of
More informationBits. Chapter 1. Information can be learned through observation, experiment, or measurement.
Chapter 1 Bits Information is measured in bits, just as length is measured in meters and time is measured in seconds. Of course knowing the amount of information is not the same as knowing the information
More informationIntroduction to Machine Learning
What does this mean? Outline Contents Introduction to Machine Learning Introduction to Probabilistic Methods Varun Chandola December 26, 2017 1 Introduction to Probability 1 2 Random Variables 3 3 Bayes
More informationRandomized Algorithms. Lecture 4. Lecturer: Moni Naor Scribe by: Tamar Zondiner & Omer Tamuz Updated: November 25, 2010
Randomized Algorithms Lecture 4 Lecturer: Moni Naor Scribe by: Tamar Zondiner & Omer Tamuz Updated: November 25, 2010 1 Pairwise independent hash functions In the previous lecture we encountered two families
More informationNotes on Random Variables, Expectations, Probability Densities, and Martingales
Eco 315.2 Spring 2006 C.Sims Notes on Random Variables, Expectations, Probability Densities, and Martingales Includes Exercise Due Tuesday, April 4. For many or most of you, parts of these notes will be
More informationAsymmetric Minwise Hashing for Indexing Binary Inner Products and Set Containment
Asymmetric Minwise Hashing for Indexing Binary Inner Products and Set Containment Anshumali Shrivastava and Ping Li Cornell University and Rutgers University WWW 25 Florence, Italy May 2st 25 Will Join
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationLecture 10. Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1
Lecture 10 Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1 Recap Sublinear time algorithms Deterministic + exact: binary search Deterministic + inexact: estimating diameter in a
More informationCommunication Engineering Prof. Surendra Prasad Department of Electrical Engineering Indian Institute of Technology, Delhi
Communication Engineering Prof. Surendra Prasad Department of Electrical Engineering Indian Institute of Technology, Delhi Lecture - 41 Pulse Code Modulation (PCM) So, if you remember we have been talking
More informationA graph contains a set of nodes (vertices) connected by links (edges or arcs)
BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,
More informationConnectedness. Proposition 2.2. The following are equivalent for a topological space (X, T ).
Connectedness 1 Motivation Connectedness is the sort of topological property that students love. Its definition is intuitive and easy to understand, and it is a powerful tool in proofs of well-known results.
More informationDiscrete Random Variable
Discrete Random Variable Outcome of a random experiment need not to be a number. We are generally interested in some measurement or numerical attribute of the outcome, rather than the outcome itself. n
More informationInteger Linear Programs
Lecture 2: Review, Linear Programming Relaxations Today we will talk about expressing combinatorial problems as mathematical programs, specifically Integer Linear Programs (ILPs). We then see what happens
More informationFinite Time Bounds for Temporal Difference Learning with Function Approximation: Problems with some state-of-the-art results
Finite Time Bounds for Temporal Difference Learning with Function Approximation: Problems with some state-of-the-art results Chandrashekar Lakshmi Narayanan Csaba Szepesvári Abstract In all branches of
More informationInformation Storage Capacity of Crossbar Switching Networks
Information Storage Capacity of Crossbar Switching etworks ABSTRACT In this work we ask the fundamental uestion: How many bits of information can be stored in a crossbar switching network? The answer is
More informationLecture 5: Computational Complexity
Lecture 5: Computational Complexity (3 units) Outline Computational complexity Decision problem, Classes N P and P. Polynomial reduction and Class N PC P = N P or P = N P? 1 / 22 The Goal of Computational
More informationLecture 4 Noisy Channel Coding
Lecture 4 Noisy Channel Coding I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw October 9, 2015 1 / 56 I-Hsiang Wang IT Lecture 4 The Channel Coding Problem
More informationLecture 5. 1 Review (Pairwise Independence and Derandomization)
6.842 Randomness and Computation September 20, 2017 Lecture 5 Lecturer: Ronitt Rubinfeld Scribe: Tom Kolokotrones 1 Review (Pairwise Independence and Derandomization) As we discussed last time, we can
More informationEssential facts about NP-completeness:
CMPSCI611: NP Completeness Lecture 17 Essential facts about NP-completeness: Any NP-complete problem can be solved by a simple, but exponentially slow algorithm. We don t have polynomial-time solutions
More informationHomework 4 Solutions
CS 174: Combinatorics and Discrete Probability Fall 01 Homework 4 Solutions Problem 1. (Exercise 3.4 from MU 5 points) Recall the randomized algorithm discussed in class for finding the median of a set
More informationCounting. 1 Sum Rule. Example 1. Lecture Notes #1 Sept 24, Chris Piech CS 109
1 Chris Piech CS 109 Counting Lecture Notes #1 Sept 24, 2018 Based on a handout by Mehran Sahami with examples by Peter Norvig Although you may have thought you had a pretty good grasp on the notion of
More information