Bottom-k and Priority Sampling, Set Similarity and Subset Sums with Minimal Independence. Mikkel Thorup University of Copenhagen

Similar documents
Tabulation-Based Hashing

Priority sampling for estimation of arbitrary subset sums

Nearest Neighbor Classification Using Bottom-k Sketches

Some notes on streaming algorithms continued

Leveraging Big Data: Lecture 13

Basics of hashing: k-independence and applications

Lecture 4 Thursday Sep 11, 2014

A Tight Lower Bound for Dynamic Membership in the External Memory Model

Stream sampling for variance-optimal estimation of subset sums

Estimating arbitrary subset sums with few probes

Computing the Entropy of a Stream

15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018

arxiv: v3 [cs.ds] 11 May 2017

arxiv: v1 [cs.ds] 3 Feb 2018

Tight Bounds for Sliding Bloom Filters

Lecture Lecture 3 Tuesday Sep 09, 2014

CS 473: Algorithms. Ruta Mehta. Spring University of Illinois, Urbana-Champaign. Ruta (UIUC) CS473 1 Spring / 32

Problem 1: (Chernoff Bounds via Negative Dependence - from MU Ex 5.15)

The space complexity of approximating the frequency moments

COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from

arxiv: v5 [cs.ds] 3 Jan 2017

The Magic of Random Sampling: From Surveys to Big Data

CS369N: Beyond Worst-Case Analysis Lecture #6: Pseudorandom Data and Universal Hashing

Hash tables. Hash tables

Cuckoo Hashing with a Stash: Alternative Analysis, Simple Hash Functions

CS 591, Lecture 7 Data Analytics: Theory and Applications Boston University

Balanced Allocation Through Random Walk

Solutions to COMP9334 Week 8 Sample Problems

The Power of Simple Tabulation Hashing

The Power of Simple Tabulation Hashing

14.1 Finding frequent elements in stream

Entropy. Probability and Computing. Presentation 22. Probability and Computing Presentation 22 Entropy 1/39

Lecture 2. Frequency problems

Lecture 2: Streaming Algorithms

arxiv: v2 [cs.ds] 4 Feb 2015

As mentioned, we will relax the conditions of our dictionary data structure. The relaxations we

1 Estimating Frequency Moments in Streams

Bloom Filters, general theory and variants

Space- Efficient Es.ma.on of Sta.s.cs over Sub- Sampled Streams

2 How many distinct elements are in a stream?

A Model for Learned Bloom Filters, and Optimizing by Sandwiching

Introduction to Hash Tables

Asymmetric Minwise Hashing for Indexing Binary Inner Products and Set Containment

Finding similar items

Cryptographic Engineering

Lecture 5: Two-point Sampling

Cache-Oblivious Hashing

Big Data. Big data arises in many forms: Common themes:

Hashing. Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing. Philip Bille

Hashing. Hashing. Dictionaries. Dictionaries. Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing

Methods of evaluating estimators and best unbiased estimators Hamid R. Rabiee

CS6931 Database Seminar. Lecture 6: Set Operations on Massive Data

Lecture 10. Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1

12 Hash Tables Introduction Chaining. Lecture 12: Hash Tables [Fa 10]

Balls and Bins: Smaller Hash Families and Faster Evaluation

Minwise hashing for large-scale regression and classification with sparse data

Bloom Filters, Minhashes, and Other Random Stuff

Power of d Choices with Simple Tabulation

6.842 Randomness and Computation Lecture 5

Optimal Combination of Sampled Network Measurements

Balls and Bins: Smaller Hash Families and Faster Evaluation

Frequency Estimators

Higher Cell Probe Lower Bounds for Evaluating Polynomials

Piazza Recitation session: Review of linear algebra Location: Thursday, April 11, from 3:30-5:20 pm in SIG 134 (here)

Advanced Algorithm Design: Hashing and Applications to Compact Data Representation

Lecture Note 2. 1 Bonferroni Principle. 1.1 Idea. 1.2 Want. Material covered today is from Chapter 1 and chapter 4

Ping Li Compressed Counting, Data Streams March 2009 DIMACS Compressed Counting. Ping Li

Tight Bounds for Distributed Functional Monitoring

Worst Case Analysis : Our Strength is Our Weakness. Michael Mitzenmacher Harvard University

Lecture 4: Two-point Sampling, Coupon Collector s problem

Finding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing Distance Measures. Modified from Jeff Ullman

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:

Power of d Choices with Simple Tabulation

arxiv: v2 [cs.ds] 15 Nov 2010

Set Similarity Search Beyond MinHash

Discrete Mathematics and Probability Theory Spring 2014 Anant Sahai Note 18

Math 24 Spring 2012 Questions (mostly) from the Textbook

Why Simple Hash Functions Work: Exploiting the Entropy in a Data Stream

Hash tables. Hash tables

7 Algorithms for Massive Data Problems

V.4 MapReduce. 1. System Architecture 2. Programming Model 3. Hadoop. Based on MRS Chapter 4 and RU Chapter 2 IR&DM 13/ 14 !74

COS597D: Information Theory in Computer Science October 5, Lecture 6

Lecture 12: Hash Tables [Sp 15]

DATA MINING LECTURE 6. Similarity and Distance Sketching, Locality Sensitive Hashing

Charging from Sampled Network Usage

Hash tables. Hash tables

Goldreich-Levin Hardcore Predicate. Lecture 28: List Decoding Hadamard Code and Goldreich-L

Lecture 13: Seed-Dependent Key Derivation

Testing k-wise Independence over Streaming Data

Hash Tables. Given a set of possible keys U, such that U = u and a table of m entries, a Hash function h is a

2WB05 Simulation Lecture 7: Output analysis

CS 591, Lecture 6 Data Analytics: Theory and Applications Boston University

Dynamic Call Center Routing Policies Using Call Waiting and Agent Idle Times Online Supplement

Cell-Probe Lower Bounds for Prefix Sums and Matching Brackets

Discrete Mathematics and Probability Theory Fall 2015 Note 20. A Brief Introduction to Continuous Probability

Why Simple Hash Functions Work: Exploiting the Entropy in a Data Stream

6 Filtering and Streaming

Lecture 5: Hashing. David Woodruff Carnegie Mellon University

Lecture 24: Bloom Filters. Wednesday, June 2, 2010

Fast Moment Estimation in Data Streams in Optimal Space

Transcription:

Bottom-k and Priority Sampling, Set Similarity and Subset Sums with Minimal Independence Mikkel Thorup University of Copenhagen

Min-wise hashing [Broder, 98, Alta Vita] Jaccard similary of sets A and B is f (A, B) = A B / A B. With truly random hash function h f (A, B) = Pr [ min h(a) = min h(b) ] h

Min-wise hashing [Broder, 98, Alta Vita] Jaccard similary of sets A and B is f (A, B) = A B / A B. With truly random hash function h f (A, B) = Pr [ min h(a) = min h(b) ] h For concentration, repeat k times: k-mins use k indepedent hash functions h 1,..., h k. For set A store signature M k (A) = (min h 1 (A),.., min h k (A)).

Min-wise hashing [Broder, 98, Alta Vita] Jaccard similary of sets A and B is f (A, B) = A B / A B. With truly random hash function h f (A, B) = Pr [ min h(a) = min h(b) ] h For concentration, repeat k times: k-mins use k indepedent hash functions h 1,..., h k. For set A store signature M k (A) = (min h 1 (A),.., min h k (A)). To estimate Jaccard similariy f (A, B) of A and B, use M k (A) M k (B) /k = k [min h i (A) = min h i (B)]/k i=1

Min-wise hashing [Broder, 98, Alta Vita] Jaccard similary of sets A and B is f (A, B) = A B / A B. With truly random hash function h f (A, B) = Pr [ min h(a) = min h(b) ] h For concentration, repeat k times: k-mins use k indepedent hash functions h 1,..., h k. For set A store signature M k (A) = (min h 1 (A),.., min h k (A)). To estimate Jaccard similariy f (A, B) of A and B, use M k (A) M k (B) /k = k [min h i (A) = min h i (B)]/k i=1 Expected relative error below 1/ f (A, B) k.

Bias issues We do not have space for truly random hash functions.

Bias issues We do not have space for truly random hash functions. We say h is ε-minwise independent if for any S, x S,: Pr h H [h(x) = min h(s)] = 1 ± ε S

Bias issues We do not have space for truly random hash functions. We say h is ε-minwise independent if for any S, x S,: 1 ± ε Pr [h(x) = min h(s)] = h H S Such bias ε takes independence Θ ( lg 1 ) ε [Indyk 99, Patrascu Thorup 10].

Bias issues We do not have space for truly random hash functions. We say h is ε-minwise independent if for any S, x S,: 1 ± ε Pr [h(x) = min h(s)] = h H S Such bias ε takes independence Θ ( lg 1 ) ε [Indyk 99, Patrascu Thorup 10]. Bias ε not improved by k-mins no matter repetitions k.

Practice Mitzenmacher and Vadhan [SODA 08]: with enough entropy, 2-independent works as good as random, but

Practice Mitzenmacher and Vadhan [SODA 08]: with enough entropy, 2-independent works as good as random, but real world full of low-entroy data, e.g., consecutive numbers: 0.008 0.007 0.006 0.005 0.004 0.003 0.002 0.001 90% fractile single experiment 10% fractile real value 0 0 20000 40000 60000 80000 100000

Bottom-k With single hash function h, the bottom-k signature of set X is S k (X) = {k elements of X with smallest hash values}

Bottom-k With single hash function h, the bottom-k signature of set X is S k (X) = {k elements of X with smallest hash values} For subset Y X, estimate frequency f (Y, X) = Y / X as S k (X) Y /k

Bottom-k With single hash function h, the bottom-k signature of set X is S k (X) = {k elements of X with smallest hash values} For subset Y X, estimate frequency f (Y, X) = Y / X as S k (X) Y /k For sets A and B, the signature of union A B computed as S k (A B) = S k (S k (A) S k (B)) and Jaccard similarity f (A, B) is estimated as S k (A B) S k (A) S k (B) /k

Bottom-k With single hash function h, the bottom-k signature of set X is S k (X) = {k elements of X with smallest hash values} For subset Y X, estimate frequency f (Y, X) = Y / X as S k (X) Y /k For sets A and B, the signature of union A B computed as S k (A B) = S k (S k (A) S k (B)) and Jaccard similarity f (A, B) is estimated as Our result: S k (A B) S k (A) S k (B) /k Theorem If h is 2-independent, even including bias, the expected relative error is O(1/ f k).

Bottom-k With single hash function h, the bottom-k signature of set X is S k (X) = {k elements of X with smallest hash values} For subset Y X, estimate frequency f (Y, X) = Y / X as S k (X) Y /k For sets A and B, the signature of union A B computed as S k (A B) = S k (S k (A) S k (B)) and Jaccard similarity f (A, B) is estimated as Our result: S k (A B) S k (A) S k (B) /k Theorem If h is 2-independent, even including bias, the expected relative error is O(1/ f k). Porat proved this for 8-independent hashing and k 1.

Practice We can use Dietzfelbinger s super fast 2-independent hashing random int64 a,b; h(int32 x) return (a*x + b) >> 32; 0.008 0.007 0.006 90% fractile single experiment 10% fractile real value 0.005 0.004 0.003 0.002 0.001 0 0 20000 40000 60000 80000 100000

Practice We can use Dietzfelbinger s super fast 2-independent hashing random int64 a,b; h(int32 x) return (a*x + b) >> 32; 0.008 0.007 0.006 bottom-k 90% fractile single experiment 10% fractile real value 0.008 0.007 0.006 k-mins 90% fractile single experiment 10% fractile real value 0.005 0.004 0.003 0.002 0.001 0 0 20000 40000 60000 80000 100000 0.005 0.004 0.003 0.002 0.001 0 0 20000 40000 60000 80000 100000

A Simple Union Bound With n = X, S = S k (X), Y X, estimate f = Y / X as Y S /k Overestimate: For parameters a < 1, b, bound probability ( ) Y S > 1+b 1 a f k.

A Simple Union Bound With n = X, S = S k (X), Y X, estimate f = Y / X as Y S /k Overestimate: For parameters a < 1, b, bound probability ( ) Y S > 1+b 1 a f k. Define threshold probability (independent of random choices) p = k n (1 a). Hash values uniform in (0, 1) so Pr[h(x) < p] = p.

A Simple Union Bound With n = X, S = S k (X), Y X, estimate f = Y / X as Y S /k Overestimate: For parameters a < 1, b, bound probability ( ) Y S > 1+b 1 a f k. Define threshold probability (independent of random choices) p = k n (1 a). Hash values uniform in (0, 1) so Pr[h(x) < p] = p. Proposition The overestimate ( ) implies one of (A) {x X h(x) < p} < k (B) {x Y h(x) < p} > (1 + b) p Y. so Pr[( )] Pr[(A)] + Pr[(B)].

A Simple Union Bound With n = X, S = S k (X), Y X, estimate f = Y / X as Y S /k Overestimate: For parameters a < 1, b, bound probability ( ) Y S > 1+b 1 a f k. Define threshold probability (independent of random choices) p = k n (1 a). Hash values uniform in (0, 1) so Pr[h(x) < p] = p. Proposition The overestimate ( ) implies one of (A) {x X h(x) < p} < k E[ ] = k/(1 a) (B) {x Y h(x) < p} > (1 + b) p Y. E[ ] = p Y. so Pr[( )] Pr[(A)] + Pr[(B)].

Proof: (A) (B) = ( ) Suppose (A) is false, that is, {x X h(x) < p} k.

Proof: (A) (B) = ( ) Suppose (A) is false, that is, {x X h(x) < p} k. Then, since S contains k smallest, x S : h(x) < p. so Y S {x Y h(x) < p}

Proof: (A) (B) = ( ) Suppose (A) is false, that is, {x X h(x) < p} k. Then, since S contains k smallest, x S : h(x) < p. so Y S {x Y h(x) < p} Suppose (B) is also false, that is, {x Y h(x) < p} < (1 + b) p Y

Proof: (A) (B) = ( ) Suppose (A) is false, that is, {x X h(x) < p} k. Then, since S contains k smallest, so x S : h(x) < p. Y S {x Y h(x) < p} Suppose (B) is also false, that is, {x Y h(x) < p} < (1 + b) p Y With p = k/(n (1 a)) and Y = f n, we get {x Y h(x) < p} < 1 + b 1 a f k so the overestimate ( ) did not happen.

2-independence Proposition With p = k n (1 a), the overestimate ( ) Y S > 1+b 1 a f k. implies one of (A) {x X h(x) < p} < k E[ ] = k/(1 a) (B) {x Y h(x) < p} > (1 + b) p Y E[ ] = p Y. so Pr[( )] Pr[(A)] + Pr[(B)].

2-independence Proposition With p = k n (1 a), the overestimate ( ) Y S > 1+b 1 a f k. implies one of (A) {x X h(x) < p} < k E[ ] = k/(1 a) (B) {x Y h(x) < p} > (1 + b) p Y E[ ] = p Y. so Pr[( )] Pr[(A)] + Pr[(B)]. Application If h is 2-independent, by Chebyshev, Pr[(A)] 1/(a 2 k) Pr[(B)] 1/(b 2 f k)

2-independence Proposition With p = k n (1 a), the overestimate ( ) Y S > 1+b 1 a f k. implies one of (A) {x X h(x) < p} < k E[ ] = k/(1 a) (B) {x Y h(x) < p} > (1 + b) p Y E[ ] = p Y. so Pr[( )] Pr[(A)] + Pr[(B)]. Application If h is 2-independent, by Chebyshev, Pr[(A)] 1/(a 2 k) Pr[(B)] 1/(b 2 f k) For any ε < 1/3, appropriate choice of a and b yields Pr [ Y S > (1 + ε)f k] 4 ε 2 f k.

2-independence Proposition With p = k n (1 a), the overestimate ( ) Y S > 1+b 1 a f k. implies one of (A) {x X h(x) < p} < k E[ ] = k/(1 a) (B) {x Y h(x) < p} > (1 + b) p Y E[ ] = p Y. so Pr[( )] Pr[(A)] + Pr[(B)]. Application If h is 2-independent, by Chebyshev, Pr[(A)] 1/(a 2 k) Pr[(B)] 1/(b 2 f k) For any ε < 1/3, appropriate choice of a and b yields Pr [ Y S > (1 + ε)f k] 4 ε 2 f k. With more calcluations E[ Y S f k = O( f k).

Conclusion Both k-mins and bottom-k generalize the idea of storing the smallest hash value.

Conclusion Both k-mins and bottom-k generalize the idea of storing the smallest hash value. With limited independence k-mins has major problems with bias whereas bottom-k works perfectly even with 2-indepedence. bottom-k k-mins 0.008 0.007 0.006 90% fractile single experiment 10% fractile real value 0.008 0.007 0.006 90% fractile single experiment 10% fractile real value 0.005 0.005 0.004 0.004 0.003 0.003 0.002 0.002 0.001 0.001 0 0 20000 40000 60000 80000 100000 0 0 20000 40000 60000 80000 100000

Conclusion Both k-mins and bottom-k generalize the idea of storing the smallest hash value. With limited independence k-mins has major problems with bias whereas bottom-k works perfectly even with 2-indepedence. bottom-k k-mins 0.008 0.007 0.006 90% fractile single experiment 10% fractile real value 0.008 0.007 0.006 90% fractile single experiment 10% fractile real value 0.005 0.005 0.004 0.004 0.003 0.003 0.002 0.002 0.001 0.001 0 0 20000 40000 60000 80000 100000 0 0 20000 40000 60000 80000 100000 Technically the proof is very simple.

Conclusion Both k-mins and bottom-k generalize the idea of storing the smallest hash value. With limited independence k-mins has major problems with bias whereas bottom-k works perfectly even with 2-indepedence. bottom-k k-mins 0.008 0.007 0.006 90% fractile single experiment 10% fractile real value 0.008 0.007 0.006 90% fractile single experiment 10% fractile real value 0.005 0.005 0.004 0.004 0.003 0.003 0.002 0.002 0.001 0.001 0 0 20000 40000 60000 80000 100000 0 0 20000 40000 60000 80000 100000 Technically the proof is very simple. Similar results hold for weighted sampling but proofs much much harder.

Priority sampling of weighted items A bottom/top-k sampling scheme for weighted items [Duffield, Lund, Thorup SIGMETRICS 04, JACM 07] Stream of items i, each with a weight w i uniformly random hash h(i) (0, 1) priority q i = w i /h(i) (assumed distinct).

Priority sampling of weighted items A bottom/top-k sampling scheme for weighted items [Duffield, Lund, Thorup SIGMETRICS 04, JACM 07] Stream of items i, each with a weight w i uniformly random hash h(i) (0, 1) priority q i = w i /h(i) (assumed distinct). Priority sample S of size k contains k items of highest priority. threshold τ is the (k + 1) th priority. Then i S q i > τ. weight estimate ŵ i = max{w i, τ} if i S; 0 otherwise

Priority sampling of weighted items A bottom/top-k sampling scheme for weighted items [Duffield, Lund, Thorup SIGMETRICS 04, JACM 07] Stream of items i, each with a weight w i uniformly random hash h(i) (0, 1) priority q i = w i /h(i) (assumed distinct). Priority sample S of size k (maintained in priority queue) contains k items of highest priority. threshold τ is the (k + 1) th priority. Then i S q i > τ. weight estimate ŵ i = max{w i, τ} if i S; 0 otherwise

Priority sampling 3 out of 10 weighted items original weight priority weight estimate threshold sample

Priority sampling of weighted items A bottom/top-k sampling scheme for weighted items [Duffield, Lund, Thorup SIGMETRICS 04, JACM 07] Stream of items i, each with a weight w i uniformly random hash h(i) (0, 1) priority q i = w i /h(i) (assumed distinct). Priority sample S of size k (maintained in priority queue) contains k items of highest priority. threshold τ is the (k + 1) th priority. Then i S q i > τ. weight estimate ŵ i = max{w i, τ} if i S; 0 otherwise

Priority sampling of weighted items A bottom/top-k sampling scheme for weighted items [Duffield, Lund, Thorup SIGMETRICS 04, JACM 07] Stream of items i, each with a weight w i uniformly random hash h(i) (0, 1) priority q i = w i /h(i) (assumed distinct). Priority sample S of size k (maintained in priority queue) contains k items of highest priority. threshold τ is the (k + 1) th priority. Then i S q i > τ. weight estimate ŵ i = max{w i, τ} if i S; 0 otherwise Theorem [DLT 04] Priority sampling is unbiased: E[ŵ i ] = w i.

Priority sampling of weighted items A bottom/top-k sampling scheme for weighted items [Duffield, Lund, Thorup SIGMETRICS 04, JACM 07] Stream of items i, each with a weight w i uniformly random hash h(i) (0, 1) priority q i = w i /h(i) (assumed distinct). Priority sample S of size k (maintained in priority queue) contains k items of highest priority. threshold τ is the (k + 1) th priority. Then i S q i > τ. weight estimate ŵ i = max{w i, τ} if i S; 0 otherwise Theorem [DLT 04] Priority sampling is unbiased: E[ŵ i ] = w i. Theorem [Szegey STOC 06] For any set of input weights, with one extra sample, priority sampling has smaller variance sum i Var [ŵ i] than all other unbiased sampling schemes.

Sampled Internet traffic. Estimate traffic classes priority sampling uniform sampling without replacement. probability proportional to size with replacement 1e+09 Application: sessions 1e+08 Bytes 1e+07 1e+06 PRI U-R P+R 100000 1 10 100 1000 10000 Number of Samples

Sampled Internet traffic. Estimate traffic classes priority sampling uniform sampling without replacement. probability proportional to size with replacement 1e+08 Application: mail 1e+07 Bytes 1e+06 PRI U-R P+R 100000 1 10 100 1000 10000 Number of Samples

Priority sampling works with minimal independence Theorems a bit hard to state. Essential result is that priority sampling gives good concentration even with 2-indepence.

Priority sampling works with minimal independence Theorems a bit hard to state. Essential result is that priority sampling gives good concentration even with 2-indepence. Thus priority sampling trivial to implement including the hash function.

Priority sampling works with minimal independence Theorems a bit hard to state. Essential result is that priority sampling gives good concentration even with 2-indepence. Thus priority sampling trivial to implement including the hash function. With his mathematical analysis of linear probing, Knuth [1963] started the field of trying to understand simple magical algorithms: if they really work, or if they are just waiting to blow up. This is an area full of surprises, and an area where mathematics can really help the outside world. bottom-k k-mins 0.008 0.007 0.006 90% fractile single experiment 10% fractile real value 0.008 0.007 0.006 90% fractile single experiment 10% fractile real value 0.005 0.004 0.003 0.002 0.001 0 0 20000 40000 60000 80000 100000 0.005 0.004 0.003 0.002 0.001 0 0 20000 40000 60000 80000 100000