Bottom-k and Priority Sampling, Set Similarity and Subset Sums with Minimal Independence. Mikkel Thorup University of Copenhagen

Size: px
Start display at page:

Download "Bottom-k and Priority Sampling, Set Similarity and Subset Sums with Minimal Independence. Mikkel Thorup University of Copenhagen"

Transcription

1 Bottom-k and Priority Sampling, Set Similarity and Subset Sums with Minimal Independence Mikkel Thorup University of Copenhagen

2 Min-wise hashing [Broder, 98, Alta Vita] Jaccard similary of sets A and B is f (A, B) = A B / A B. With truly random hash function h f (A, B) = Pr [ min h(a) = min h(b) ] h

3 Min-wise hashing [Broder, 98, Alta Vita] Jaccard similary of sets A and B is f (A, B) = A B / A B. With truly random hash function h f (A, B) = Pr [ min h(a) = min h(b) ] h For concentration, repeat k times: k-mins use k indepedent hash functions h 1,..., h k. For set A store signature M k (A) = (min h 1 (A),.., min h k (A)).

4 Min-wise hashing [Broder, 98, Alta Vita] Jaccard similary of sets A and B is f (A, B) = A B / A B. With truly random hash function h f (A, B) = Pr [ min h(a) = min h(b) ] h For concentration, repeat k times: k-mins use k indepedent hash functions h 1,..., h k. For set A store signature M k (A) = (min h 1 (A),.., min h k (A)). To estimate Jaccard similariy f (A, B) of A and B, use M k (A) M k (B) /k = k [min h i (A) = min h i (B)]/k i=1

5 Min-wise hashing [Broder, 98, Alta Vita] Jaccard similary of sets A and B is f (A, B) = A B / A B. With truly random hash function h f (A, B) = Pr [ min h(a) = min h(b) ] h For concentration, repeat k times: k-mins use k indepedent hash functions h 1,..., h k. For set A store signature M k (A) = (min h 1 (A),.., min h k (A)). To estimate Jaccard similariy f (A, B) of A and B, use M k (A) M k (B) /k = k [min h i (A) = min h i (B)]/k i=1 Expected relative error below 1/ f (A, B) k.

6 Bias issues We do not have space for truly random hash functions.

7 Bias issues We do not have space for truly random hash functions. We say h is ε-minwise independent if for any S, x S,: Pr h H [h(x) = min h(s)] = 1 ± ε S

8 Bias issues We do not have space for truly random hash functions. We say h is ε-minwise independent if for any S, x S,: 1 ± ε Pr [h(x) = min h(s)] = h H S Such bias ε takes independence Θ ( lg 1 ) ε [Indyk 99, Patrascu Thorup 10].

9 Bias issues We do not have space for truly random hash functions. We say h is ε-minwise independent if for any S, x S,: 1 ± ε Pr [h(x) = min h(s)] = h H S Such bias ε takes independence Θ ( lg 1 ) ε [Indyk 99, Patrascu Thorup 10]. Bias ε not improved by k-mins no matter repetitions k.

10 Practice Mitzenmacher and Vadhan [SODA 08]: with enough entropy, 2-independent works as good as random, but

11 Practice Mitzenmacher and Vadhan [SODA 08]: with enough entropy, 2-independent works as good as random, but real world full of low-entroy data, e.g., consecutive numbers: % fractile single experiment 10% fractile real value

12 Bottom-k With single hash function h, the bottom-k signature of set X is S k (X) = {k elements of X with smallest hash values}

13 Bottom-k With single hash function h, the bottom-k signature of set X is S k (X) = {k elements of X with smallest hash values} For subset Y X, estimate frequency f (Y, X) = Y / X as S k (X) Y /k

14 Bottom-k With single hash function h, the bottom-k signature of set X is S k (X) = {k elements of X with smallest hash values} For subset Y X, estimate frequency f (Y, X) = Y / X as S k (X) Y /k For sets A and B, the signature of union A B computed as S k (A B) = S k (S k (A) S k (B)) and Jaccard similarity f (A, B) is estimated as S k (A B) S k (A) S k (B) /k

15 Bottom-k With single hash function h, the bottom-k signature of set X is S k (X) = {k elements of X with smallest hash values} For subset Y X, estimate frequency f (Y, X) = Y / X as S k (X) Y /k For sets A and B, the signature of union A B computed as S k (A B) = S k (S k (A) S k (B)) and Jaccard similarity f (A, B) is estimated as Our result: S k (A B) S k (A) S k (B) /k Theorem If h is 2-independent, even including bias, the expected relative error is O(1/ f k).

16 Bottom-k With single hash function h, the bottom-k signature of set X is S k (X) = {k elements of X with smallest hash values} For subset Y X, estimate frequency f (Y, X) = Y / X as S k (X) Y /k For sets A and B, the signature of union A B computed as S k (A B) = S k (S k (A) S k (B)) and Jaccard similarity f (A, B) is estimated as Our result: S k (A B) S k (A) S k (B) /k Theorem If h is 2-independent, even including bias, the expected relative error is O(1/ f k). Porat proved this for 8-independent hashing and k 1.

17 Practice We can use Dietzfelbinger s super fast 2-independent hashing random int64 a,b; h(int32 x) return (a*x + b) >> 32; % fractile single experiment 10% fractile real value

18 Practice We can use Dietzfelbinger s super fast 2-independent hashing random int64 a,b; h(int32 x) return (a*x + b) >> 32; bottom-k 90% fractile single experiment 10% fractile real value k-mins 90% fractile single experiment 10% fractile real value

19 A Simple Union Bound With n = X, S = S k (X), Y X, estimate f = Y / X as Y S /k Overestimate: For parameters a < 1, b, bound probability ( ) Y S > 1+b 1 a f k.

20 A Simple Union Bound With n = X, S = S k (X), Y X, estimate f = Y / X as Y S /k Overestimate: For parameters a < 1, b, bound probability ( ) Y S > 1+b 1 a f k. Define threshold probability (independent of random choices) p = k n (1 a). Hash values uniform in (0, 1) so Pr[h(x) < p] = p.

21 A Simple Union Bound With n = X, S = S k (X), Y X, estimate f = Y / X as Y S /k Overestimate: For parameters a < 1, b, bound probability ( ) Y S > 1+b 1 a f k. Define threshold probability (independent of random choices) p = k n (1 a). Hash values uniform in (0, 1) so Pr[h(x) < p] = p. Proposition The overestimate ( ) implies one of (A) {x X h(x) < p} < k (B) {x Y h(x) < p} > (1 + b) p Y. so Pr[( )] Pr[(A)] + Pr[(B)].

22 A Simple Union Bound With n = X, S = S k (X), Y X, estimate f = Y / X as Y S /k Overestimate: For parameters a < 1, b, bound probability ( ) Y S > 1+b 1 a f k. Define threshold probability (independent of random choices) p = k n (1 a). Hash values uniform in (0, 1) so Pr[h(x) < p] = p. Proposition The overestimate ( ) implies one of (A) {x X h(x) < p} < k E[ ] = k/(1 a) (B) {x Y h(x) < p} > (1 + b) p Y. E[ ] = p Y. so Pr[( )] Pr[(A)] + Pr[(B)].

23 Proof: (A) (B) = ( ) Suppose (A) is false, that is, {x X h(x) < p} k.

24 Proof: (A) (B) = ( ) Suppose (A) is false, that is, {x X h(x) < p} k. Then, since S contains k smallest, x S : h(x) < p. so Y S {x Y h(x) < p}

25 Proof: (A) (B) = ( ) Suppose (A) is false, that is, {x X h(x) < p} k. Then, since S contains k smallest, x S : h(x) < p. so Y S {x Y h(x) < p} Suppose (B) is also false, that is, {x Y h(x) < p} < (1 + b) p Y

26 Proof: (A) (B) = ( ) Suppose (A) is false, that is, {x X h(x) < p} k. Then, since S contains k smallest, so x S : h(x) < p. Y S {x Y h(x) < p} Suppose (B) is also false, that is, {x Y h(x) < p} < (1 + b) p Y With p = k/(n (1 a)) and Y = f n, we get {x Y h(x) < p} < 1 + b 1 a f k so the overestimate ( ) did not happen.

27 2-independence Proposition With p = k n (1 a), the overestimate ( ) Y S > 1+b 1 a f k. implies one of (A) {x X h(x) < p} < k E[ ] = k/(1 a) (B) {x Y h(x) < p} > (1 + b) p Y E[ ] = p Y. so Pr[( )] Pr[(A)] + Pr[(B)].

28 2-independence Proposition With p = k n (1 a), the overestimate ( ) Y S > 1+b 1 a f k. implies one of (A) {x X h(x) < p} < k E[ ] = k/(1 a) (B) {x Y h(x) < p} > (1 + b) p Y E[ ] = p Y. so Pr[( )] Pr[(A)] + Pr[(B)]. Application If h is 2-independent, by Chebyshev, Pr[(A)] 1/(a 2 k) Pr[(B)] 1/(b 2 f k)

29 2-independence Proposition With p = k n (1 a), the overestimate ( ) Y S > 1+b 1 a f k. implies one of (A) {x X h(x) < p} < k E[ ] = k/(1 a) (B) {x Y h(x) < p} > (1 + b) p Y E[ ] = p Y. so Pr[( )] Pr[(A)] + Pr[(B)]. Application If h is 2-independent, by Chebyshev, Pr[(A)] 1/(a 2 k) Pr[(B)] 1/(b 2 f k) For any ε < 1/3, appropriate choice of a and b yields Pr [ Y S > (1 + ε)f k] 4 ε 2 f k.

30 2-independence Proposition With p = k n (1 a), the overestimate ( ) Y S > 1+b 1 a f k. implies one of (A) {x X h(x) < p} < k E[ ] = k/(1 a) (B) {x Y h(x) < p} > (1 + b) p Y E[ ] = p Y. so Pr[( )] Pr[(A)] + Pr[(B)]. Application If h is 2-independent, by Chebyshev, Pr[(A)] 1/(a 2 k) Pr[(B)] 1/(b 2 f k) For any ε < 1/3, appropriate choice of a and b yields Pr [ Y S > (1 + ε)f k] 4 ε 2 f k. With more calcluations E[ Y S f k = O( f k).

31 Conclusion Both k-mins and bottom-k generalize the idea of storing the smallest hash value.

32 Conclusion Both k-mins and bottom-k generalize the idea of storing the smallest hash value. With limited independence k-mins has major problems with bias whereas bottom-k works perfectly even with 2-indepedence. bottom-k k-mins % fractile single experiment 10% fractile real value % fractile single experiment 10% fractile real value

33 Conclusion Both k-mins and bottom-k generalize the idea of storing the smallest hash value. With limited independence k-mins has major problems with bias whereas bottom-k works perfectly even with 2-indepedence. bottom-k k-mins % fractile single experiment 10% fractile real value % fractile single experiment 10% fractile real value Technically the proof is very simple.

34 Conclusion Both k-mins and bottom-k generalize the idea of storing the smallest hash value. With limited independence k-mins has major problems with bias whereas bottom-k works perfectly even with 2-indepedence. bottom-k k-mins % fractile single experiment 10% fractile real value % fractile single experiment 10% fractile real value Technically the proof is very simple. Similar results hold for weighted sampling but proofs much much harder.

35 Priority sampling of weighted items A bottom/top-k sampling scheme for weighted items [Duffield, Lund, Thorup SIGMETRICS 04, JACM 07] Stream of items i, each with a weight w i uniformly random hash h(i) (0, 1) priority q i = w i /h(i) (assumed distinct).

36 Priority sampling of weighted items A bottom/top-k sampling scheme for weighted items [Duffield, Lund, Thorup SIGMETRICS 04, JACM 07] Stream of items i, each with a weight w i uniformly random hash h(i) (0, 1) priority q i = w i /h(i) (assumed distinct). Priority sample S of size k contains k items of highest priority. threshold τ is the (k + 1) th priority. Then i S q i > τ. weight estimate ŵ i = max{w i, τ} if i S; 0 otherwise

37 Priority sampling of weighted items A bottom/top-k sampling scheme for weighted items [Duffield, Lund, Thorup SIGMETRICS 04, JACM 07] Stream of items i, each with a weight w i uniformly random hash h(i) (0, 1) priority q i = w i /h(i) (assumed distinct). Priority sample S of size k (maintained in priority queue) contains k items of highest priority. threshold τ is the (k + 1) th priority. Then i S q i > τ. weight estimate ŵ i = max{w i, τ} if i S; 0 otherwise

38 Priority sampling 3 out of 10 weighted items original weight priority weight estimate threshold sample

39 Priority sampling of weighted items A bottom/top-k sampling scheme for weighted items [Duffield, Lund, Thorup SIGMETRICS 04, JACM 07] Stream of items i, each with a weight w i uniformly random hash h(i) (0, 1) priority q i = w i /h(i) (assumed distinct). Priority sample S of size k (maintained in priority queue) contains k items of highest priority. threshold τ is the (k + 1) th priority. Then i S q i > τ. weight estimate ŵ i = max{w i, τ} if i S; 0 otherwise

40 Priority sampling of weighted items A bottom/top-k sampling scheme for weighted items [Duffield, Lund, Thorup SIGMETRICS 04, JACM 07] Stream of items i, each with a weight w i uniformly random hash h(i) (0, 1) priority q i = w i /h(i) (assumed distinct). Priority sample S of size k (maintained in priority queue) contains k items of highest priority. threshold τ is the (k + 1) th priority. Then i S q i > τ. weight estimate ŵ i = max{w i, τ} if i S; 0 otherwise Theorem [DLT 04] Priority sampling is unbiased: E[ŵ i ] = w i.

41 Priority sampling of weighted items A bottom/top-k sampling scheme for weighted items [Duffield, Lund, Thorup SIGMETRICS 04, JACM 07] Stream of items i, each with a weight w i uniformly random hash h(i) (0, 1) priority q i = w i /h(i) (assumed distinct). Priority sample S of size k (maintained in priority queue) contains k items of highest priority. threshold τ is the (k + 1) th priority. Then i S q i > τ. weight estimate ŵ i = max{w i, τ} if i S; 0 otherwise Theorem [DLT 04] Priority sampling is unbiased: E[ŵ i ] = w i. Theorem [Szegey STOC 06] For any set of input weights, with one extra sample, priority sampling has smaller variance sum i Var [ŵ i] than all other unbiased sampling schemes.

42 Sampled Internet traffic. Estimate traffic classes priority sampling uniform sampling without replacement. probability proportional to size with replacement 1e+09 Application: sessions 1e+08 Bytes 1e+07 1e+06 PRI U-R P+R Number of Samples

43 Sampled Internet traffic. Estimate traffic classes priority sampling uniform sampling without replacement. probability proportional to size with replacement 1e+08 Application: mail 1e+07 Bytes 1e+06 PRI U-R P+R Number of Samples

44 Priority sampling works with minimal independence Theorems a bit hard to state. Essential result is that priority sampling gives good concentration even with 2-indepence.

45 Priority sampling works with minimal independence Theorems a bit hard to state. Essential result is that priority sampling gives good concentration even with 2-indepence. Thus priority sampling trivial to implement including the hash function.

46 Priority sampling works with minimal independence Theorems a bit hard to state. Essential result is that priority sampling gives good concentration even with 2-indepence. Thus priority sampling trivial to implement including the hash function. With his mathematical analysis of linear probing, Knuth [1963] started the field of trying to understand simple magical algorithms: if they really work, or if they are just waiting to blow up. This is an area full of surprises, and an area where mathematics can really help the outside world. bottom-k k-mins % fractile single experiment 10% fractile real value % fractile single experiment 10% fractile real value

Tabulation-Based Hashing

Tabulation-Based Hashing Mihai Pǎtraşcu Mikkel Thorup April 23, 2010 Applications of Hashing Hash tables: chaining x a t v f s r Applications of Hashing Hash tables: chaining linear probing m c a f t x y t Applications of Hashing

More information

Priority sampling for estimation of arbitrary subset sums

Priority sampling for estimation of arbitrary subset sums Priority sampling for estimation of arbitrary subset sums NICK DUFFIELD AT&T Labs Research and CARSTEN LUND AT&T Labs Research and MIKKEL THORUP AT&T Labs Research From a high volume stream of weighted

More information

Nearest Neighbor Classification Using Bottom-k Sketches

Nearest Neighbor Classification Using Bottom-k Sketches Nearest Neighbor Classification Using Bottom- Setches Søren Dahlgaard, Christian Igel, Miel Thorup Department of Computer Science, University of Copenhagen AT&T Labs Abstract Bottom- setches are an alternative

More information

Some notes on streaming algorithms continued

Some notes on streaming algorithms continued U.C. Berkeley CS170: Algorithms Handout LN-11-9 Christos Papadimitriou & Luca Trevisan November 9, 016 Some notes on streaming algorithms continued Today we complete our quick review of streaming algorithms.

More information

Leveraging Big Data: Lecture 13

Leveraging Big Data: Lecture 13 Leveraging Big Data: Lecture 13 http://www.cohenwang.com/edith/bigdataclass2013 Instructors: Edith Cohen Amos Fiat Haim Kaplan Tova Milo What are Linear Sketches? Linear Transformations of the input vector

More information

Basics of hashing: k-independence and applications

Basics of hashing: k-independence and applications Basics of hashing: k-independence and applications Rasmus Pagh Supported by: 1 Agenda Load balancing using hashing! - Analysis using bounded independence! Implementation of small independence! Case studies:!

More information

Lecture 4 Thursday Sep 11, 2014

Lecture 4 Thursday Sep 11, 2014 CS 224: Advanced Algorithms Fall 2014 Lecture 4 Thursday Sep 11, 2014 Prof. Jelani Nelson Scribe: Marco Gentili 1 Overview Today we re going to talk about: 1. linear probing (show with 5-wise independence)

More information

A Tight Lower Bound for Dynamic Membership in the External Memory Model

A Tight Lower Bound for Dynamic Membership in the External Memory Model A Tight Lower Bound for Dynamic Membership in the External Memory Model Elad Verbin ITCS, Tsinghua University Qin Zhang Hong Kong University of Science & Technology April 2010 1-1 The computational model

More information

Stream sampling for variance-optimal estimation of subset sums

Stream sampling for variance-optimal estimation of subset sums Stream sampling for variance-optimal estimation of subset sums Edith Cohen Nick Duffield Haim Kaplan Carsten Lund Mikkel Thorup Abstract From a high volume stream of weighted items, we want to maintain

More information

Estimating arbitrary subset sums with few probes

Estimating arbitrary subset sums with few probes Estimating arbitrary subset sums with few probes Noga Alon Nick Duffield Carsten Lund Mikkel Thorup School of Mathematical Sciences, Tel Aviv University, Tel Aviv, Israel AT&T Labs Research, 8 Park Avenue,

More information

Computing the Entropy of a Stream

Computing the Entropy of a Stream Computing the Entropy of a Stream To appear in SODA 2007 Graham Cormode graham@research.att.com Amit Chakrabarti Dartmouth College Andrew McGregor U. Penn / UCSD Outline Introduction Entropy Upper Bound

More information

15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018

15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018 15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018 Today we ll talk about a topic that is both very old (as far as computer science

More information

arxiv: v3 [cs.ds] 11 May 2017

arxiv: v3 [cs.ds] 11 May 2017 Lecture Notes on Linear Probing with 5-Independent Hashing arxiv:1509.04549v3 [cs.ds] 11 May 2017 Mikkel Thorup May 12, 2017 Abstract These lecture notes show that linear probing takes expected constant

More information

arxiv: v1 [cs.ds] 3 Feb 2018

arxiv: v1 [cs.ds] 3 Feb 2018 A Model for Learned Bloom Filters and Related Structures Michael Mitzenmacher 1 arxiv:1802.00884v1 [cs.ds] 3 Feb 2018 Abstract Recent work has suggested enhancing Bloom filters by using a pre-filter, based

More information

Tight Bounds for Sliding Bloom Filters

Tight Bounds for Sliding Bloom Filters Tight Bounds for Sliding Bloom Filters Moni Naor Eylon Yogev November 7, 2013 Abstract A Bloom filter is a method for reducing the space (memory) required for representing a set by allowing a small error

More information

Lecture Lecture 3 Tuesday Sep 09, 2014

Lecture Lecture 3 Tuesday Sep 09, 2014 CS 4: Advanced Algorithms Fall 04 Lecture Lecture 3 Tuesday Sep 09, 04 Prof. Jelani Nelson Scribe: Thibaut Horel Overview In the previous lecture we finished covering data structures for the predecessor

More information

CS 473: Algorithms. Ruta Mehta. Spring University of Illinois, Urbana-Champaign. Ruta (UIUC) CS473 1 Spring / 32

CS 473: Algorithms. Ruta Mehta. Spring University of Illinois, Urbana-Champaign. Ruta (UIUC) CS473 1 Spring / 32 CS 473: Algorithms Ruta Mehta University of Illinois, Urbana-Champaign Spring 2018 Ruta (UIUC) CS473 1 Spring 2018 1 / 32 CS 473: Algorithms, Spring 2018 Universal Hashing Lecture 10 Feb 15, 2018 Most

More information

Problem 1: (Chernoff Bounds via Negative Dependence - from MU Ex 5.15)

Problem 1: (Chernoff Bounds via Negative Dependence - from MU Ex 5.15) Problem 1: Chernoff Bounds via Negative Dependence - from MU Ex 5.15) While deriving lower bounds on the load of the maximum loaded bin when n balls are thrown in n bins, we saw the use of negative dependence.

More information

The space complexity of approximating the frequency moments

The space complexity of approximating the frequency moments The space complexity of approximating the frequency moments Felix Biermeier November 24, 2015 1 Overview Introduction Approximations of frequency moments lower bounds 2 Frequency moments Problem Estimate

More information

COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from

COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from http://www.mmds.org Distance Measures For finding similar documents, we consider the Jaccard

More information

arxiv: v5 [cs.ds] 3 Jan 2017

arxiv: v5 [cs.ds] 3 Jan 2017 Fast and Powerful Hashing using Tabulation Mikkel Thorup University of Copenhagen January 4, 2017 arxiv:1505.01523v5 [cs.ds] 3 Jan 2017 Abstract Randomized algorithms are often enjoyed for their simplicity,

More information

The Magic of Random Sampling: From Surveys to Big Data

The Magic of Random Sampling: From Surveys to Big Data The Magic of Random Sampling: From Surveys to Big Data Edith Cohen Google Research Tel Aviv University Disclaimer: Random sampling is classic and well studied tool with enormous impact across disciplines.

More information

CS369N: Beyond Worst-Case Analysis Lecture #6: Pseudorandom Data and Universal Hashing

CS369N: Beyond Worst-Case Analysis Lecture #6: Pseudorandom Data and Universal Hashing CS369N: Beyond Worst-Case Analysis Lecture #6: Pseudorandom Data and Universal Hashing Tim Roughgarden April 4, 204 otivation: Linear Probing and Universal Hashing This lecture discusses a very neat paper

More information

Hash tables. Hash tables

Hash tables. Hash tables Basic Probability Theory Two events A, B are independent if Conditional probability: Pr[A B] = Pr[A] Pr[B] Pr[A B] = Pr[A B] Pr[B] The expectation of a (discrete) random variable X is E[X ] = k k Pr[X

More information

Cuckoo Hashing with a Stash: Alternative Analysis, Simple Hash Functions

Cuckoo Hashing with a Stash: Alternative Analysis, Simple Hash Functions 1 / 29 Cuckoo Hashing with a Stash: Alternative Analysis, Simple Hash Functions Martin Aumüller, Martin Dietzfelbinger Technische Universität Ilmenau 2 / 29 Cuckoo Hashing Maintain a dynamic dictionary

More information

CS 591, Lecture 7 Data Analytics: Theory and Applications Boston University

CS 591, Lecture 7 Data Analytics: Theory and Applications Boston University CS 591, Lecture 7 Data Analytics: Theory and Applications Boston University Babis Tsourakakis February 13th, 2017 Bloom Filter Approximate membership problem Highly space-efficient randomized data structure

More information

Balanced Allocation Through Random Walk

Balanced Allocation Through Random Walk Balanced Allocation Through Random Walk Alan Frieze Samantha Petti November 25, 2017 Abstract We consider the allocation problem in which m (1 ε)dn items are to be allocated to n bins with capacity d.

More information

Solutions to COMP9334 Week 8 Sample Problems

Solutions to COMP9334 Week 8 Sample Problems Solutions to COMP9334 Week 8 Sample Problems Problem 1: Customers arrive at a grocery store s checkout counter according to a Poisson process with rate 1 per minute. Each customer carries a number of items

More information

The Power of Simple Tabulation Hashing

The Power of Simple Tabulation Hashing The Power of Simple Tabulation Hashing Mihai Pǎtraşcu AT&T Labs Mikkel Thorup AT&T Labs Abstract Randomized algorithms are often enjoyed for their simplicity, but the hash functions used to yield the desired

More information

The Power of Simple Tabulation Hashing

The Power of Simple Tabulation Hashing The Power of Simple Tabulation Hashing Mihai Pǎtraşcu AT&T Labs Florham Park, NJ mip@alum.mit.edu Mikkel Thorup AT&T Labs Florham Park, NJ mthorup@research.att.com ABSTRACT Randomized algorithms are often

More information

14.1 Finding frequent elements in stream

14.1 Finding frequent elements in stream Chapter 14 Streaming Data Model 14.1 Finding frequent elements in stream A very useful statistics for many applications is to keep track of elements that occur more frequently. It can come in many flavours

More information

Entropy. Probability and Computing. Presentation 22. Probability and Computing Presentation 22 Entropy 1/39

Entropy. Probability and Computing. Presentation 22. Probability and Computing Presentation 22 Entropy 1/39 Entropy Probability and Computing Presentation 22 Probability and Computing Presentation 22 Entropy 1/39 Introduction Why randomness and information are related? An event that is almost certain to occur

More information

Lecture 2. Frequency problems

Lecture 2. Frequency problems 1 / 43 Lecture 2. Frequency problems Ricard Gavaldà MIRI Seminar on Data Streams, Spring 2015 Contents 2 / 43 1 Frequency problems in data streams 2 Approximating inner product 3 Computing frequency moments

More information

Lecture 2: Streaming Algorithms

Lecture 2: Streaming Algorithms CS369G: Algorithmic Techniques for Big Data Spring 2015-2016 Lecture 2: Streaming Algorithms Prof. Moses Chariar Scribes: Stephen Mussmann 1 Overview In this lecture, we first derive a concentration inequality

More information

arxiv: v2 [cs.ds] 4 Feb 2015

arxiv: v2 [cs.ds] 4 Feb 2015 Interval Selection in the Streaming Model Sergio Cabello Pablo Pérez-Lantero June 27, 2018 arxiv:1501.02285v2 [cs.ds 4 Feb 2015 Abstract A set of intervals is independent when the intervals are pairwise

More information

As mentioned, we will relax the conditions of our dictionary data structure. The relaxations we

As mentioned, we will relax the conditions of our dictionary data structure. The relaxations we CSE 203A: Advanced Algorithms Prof. Daniel Kane Lecture : Dictionary Data Structures and Load Balancing Lecture Date: 10/27 P Chitimireddi Recap This lecture continues the discussion of dictionary data

More information

1 Estimating Frequency Moments in Streams

1 Estimating Frequency Moments in Streams CS 598CSC: Algorithms for Big Data Lecture date: August 28, 2014 Instructor: Chandra Chekuri Scribe: Chandra Chekuri 1 Estimating Frequency Moments in Streams A significant fraction of streaming literature

More information

Bloom Filters, general theory and variants

Bloom Filters, general theory and variants Bloom Filters: general theory and variants G. Caravagna caravagn@cli.di.unipi.it Information Retrieval Wherever a list or set is used, and space is a consideration, a Bloom Filter should be considered.

More information

Space- Efficient Es.ma.on of Sta.s.cs over Sub- Sampled Streams

Space- Efficient Es.ma.on of Sta.s.cs over Sub- Sampled Streams Space- Efficient Es.ma.on of Sta.s.cs over Sub- Sampled Streams Andrew McGregor University of Massachuse7s mcgregor@cs.umass.edu A. Pavan Iowa State University pavan@cs.iastate.edu Srikanta Tirthapura

More information

2 How many distinct elements are in a stream?

2 How many distinct elements are in a stream? Dealing with Massive Data January 31, 2011 Lecture 2: Distinct Element Counting Lecturer: Sergei Vassilvitskii Scribe:Ido Rosen & Yoonji Shin 1 Introduction We begin by defining the stream formally. Definition

More information

A Model for Learned Bloom Filters, and Optimizing by Sandwiching

A Model for Learned Bloom Filters, and Optimizing by Sandwiching A Model for Learned Bloom Filters, and Optimizing by Sandwiching Michael Mitzenmacher School of Engineering and Applied Sciences Harvard University michaelm@eecs.harvard.edu Abstract Recent work has suggested

More information

Introduction to Hash Tables

Introduction to Hash Tables Introduction to Hash Tables Hash Functions A hash table represents a simple but efficient way of storing, finding, and removing elements. In general, a hash table is represented by an array of cells. In

More information

Asymmetric Minwise Hashing for Indexing Binary Inner Products and Set Containment

Asymmetric Minwise Hashing for Indexing Binary Inner Products and Set Containment Asymmetric Minwise Hashing for Indexing Binary Inner Products and Set Containment Anshumali Shrivastava and Ping Li Cornell University and Rutgers University WWW 25 Florence, Italy May 2st 25 Will Join

More information

Finding similar items

Finding similar items Finding similar items CSE 344, section 10 June 2, 2011 In this section, we ll go through some examples of finding similar item sets. We ll directly compare all pairs of sets being considered using the

More information

Cryptographic Engineering

Cryptographic Engineering Cryptographic Engineering Clément PERNET M2 Cyber Security, UFR-IM 2 AG, Univ. Grenoble-Alpes ENSIMAG, Grenoble INP Outline Unconditional security of symmetric cryptosystem Probabilities and information

More information

Lecture 5: Two-point Sampling

Lecture 5: Two-point Sampling Randomized Algorithms Lecture 5: Two-point Sampling Sotiris Nikoletseas Professor CEID - ETY Course 2017-2018 Sotiris Nikoletseas, Professor Randomized Algorithms - Lecture 5 1 / 26 Overview A. Pairwise

More information

Cache-Oblivious Hashing

Cache-Oblivious Hashing Cache-Oblivious Hashing Zhewei Wei Hong Kong University of Science & Technology Joint work with Rasmus Pagh, Ke Yi and Qin Zhang Dictionary Problem Store a subset S of the Universe U. Lookup: Does x belong

More information

Big Data. Big data arises in many forms: Common themes:

Big Data. Big data arises in many forms: Common themes: Big Data Big data arises in many forms: Physical Measurements: from science (physics, astronomy) Medical data: genetic sequences, detailed time series Activity data: GPS location, social network activity

More information

Hashing. Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing. Philip Bille

Hashing. Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing. Philip Bille Hashing Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing Philip Bille Hashing Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing

More information

Hashing. Hashing. Dictionaries. Dictionaries. Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing

Hashing. Hashing. Dictionaries. Dictionaries. Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing Philip Bille Dictionaries Dictionary problem. Maintain a set S U = {,..., u-} supporting lookup(x): return true if x S and false otherwise. insert(x): set S = S {x} delete(x): set S = S - {x} Dictionaries

More information

Methods of evaluating estimators and best unbiased estimators Hamid R. Rabiee

Methods of evaluating estimators and best unbiased estimators Hamid R. Rabiee Stochastic Processes Methods of evaluating estimators and best unbiased estimators Hamid R. Rabiee 1 Outline Methods of Mean Squared Error Bias and Unbiasedness Best Unbiased Estimators CR-Bound for variance

More information

CS6931 Database Seminar. Lecture 6: Set Operations on Massive Data

CS6931 Database Seminar. Lecture 6: Set Operations on Massive Data CS6931 Database Seminar Lecture 6: Set Operations on Massive Data Set Resemblance and MinWise Hashing/Independent Permutation Basics Consider two sets S1, S2 U = {0, 1, 2,...,D 1} (e.g., D = 2 64 ) f1

More information

Lecture 10. Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1

Lecture 10. Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1 Lecture 10 Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1 Recap Sublinear time algorithms Deterministic + exact: binary search Deterministic + inexact: estimating diameter in a

More information

12 Hash Tables Introduction Chaining. Lecture 12: Hash Tables [Fa 10]

12 Hash Tables Introduction Chaining. Lecture 12: Hash Tables [Fa 10] Calvin: There! I finished our secret code! Hobbes: Let s see. Calvin: I assigned each letter a totally random number, so the code will be hard to crack. For letter A, you write 3,004,572,688. B is 28,731,569½.

More information

Balls and Bins: Smaller Hash Families and Faster Evaluation

Balls and Bins: Smaller Hash Families and Faster Evaluation Balls and Bins: Smaller Hash Families and Faster Evaluation L. Elisa Celis Omer Reingold Gil Segev Udi Wieder May 1, 2013 Abstract A fundamental fact in the analysis of randomized algorithms is that when

More information

Minwise hashing for large-scale regression and classification with sparse data

Minwise hashing for large-scale regression and classification with sparse data Minwise hashing for large-scale regression and classification with sparse data Nicolai Meinshausen (Seminar für Statistik, ETH Zürich) joint work with Rajen Shah (Statslab, University of Cambridge) Simons

More information

Bloom Filters, Minhashes, and Other Random Stuff

Bloom Filters, Minhashes, and Other Random Stuff Bloom Filters, Minhashes, and Other Random Stuff Brian Brubach University of Maryland, College Park StringBio 2018, University of Central Florida What? Probabilistic Space-efficient Fast Not exact Why?

More information

Power of d Choices with Simple Tabulation

Power of d Choices with Simple Tabulation Power of d Choices with Simple Tabulation Anders Aamand 1 BARC, University of Copenhagen, Universitetsparken 1, Copenhagen, Denmark. aa@di.ku.dk https://orcid.org/0000-0002-0402-0514 Mathias Bæk Tejs Knudsen

More information

6.842 Randomness and Computation Lecture 5

6.842 Randomness and Computation Lecture 5 6.842 Randomness and Computation 2012-02-22 Lecture 5 Lecturer: Ronitt Rubinfeld Scribe: Michael Forbes 1 Overview Today we will define the notion of a pairwise independent hash function, and discuss its

More information

Optimal Combination of Sampled Network Measurements

Optimal Combination of Sampled Network Measurements Optimal Combination of Sampled Network Measurements Nick Duffield Carsten Lund Mikkel Thorup AT&T Labs Research, 8 Park Avenue, Florham Park, New Jersey, 7932, USA {duffield,lund,mthorup}research.att.com

More information

Balls and Bins: Smaller Hash Families and Faster Evaluation

Balls and Bins: Smaller Hash Families and Faster Evaluation Balls and Bins: Smaller Hash Families and Faster Evaluation L. Elisa Celis Omer Reingold Gil Segev Udi Wieder April 22, 2011 Abstract A fundamental fact in the analysis of randomized algorithm is that

More information

Frequency Estimators

Frequency Estimators Frequency Estimators Outline for Today Randomized Data Structures Our next approach to improving performance. Count-Min Sketches A simple and powerful data structure for estimating frequencies. Count Sketches

More information

Higher Cell Probe Lower Bounds for Evaluating Polynomials

Higher Cell Probe Lower Bounds for Evaluating Polynomials Higher Cell Probe Lower Bounds for Evaluating Polynomials Kasper Green Larsen MADALGO, Department of Computer Science Aarhus University Aarhus, Denmark Email: larsen@cs.au.dk Abstract In this paper, we

More information

Piazza Recitation session: Review of linear algebra Location: Thursday, April 11, from 3:30-5:20 pm in SIG 134 (here)

Piazza Recitation session: Review of linear algebra Location: Thursday, April 11, from 3:30-5:20 pm in SIG 134 (here) 4/0/9 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 Piazza Recitation session: Review of linear algebra Location: Thursday, April, from 3:30-5:20 pm in SIG 34

More information

Advanced Algorithm Design: Hashing and Applications to Compact Data Representation

Advanced Algorithm Design: Hashing and Applications to Compact Data Representation Advanced Algorithm Design: Hashing and Applications to Compact Data Representation Lectured by Prof. Moses Chariar Transcribed by John McSpedon Feb th, 20 Cucoo Hashing Recall from last lecture the dictionary

More information

Lecture Note 2. 1 Bonferroni Principle. 1.1 Idea. 1.2 Want. Material covered today is from Chapter 1 and chapter 4

Lecture Note 2. 1 Bonferroni Principle. 1.1 Idea. 1.2 Want. Material covered today is from Chapter 1 and chapter 4 Lecture Note 2 Material covere toay is from Chapter an chapter 4 Bonferroni Principle. Iea Get an iea the frequency of events when things are ranom billion = 0 9 Each person has a % chance to stay in a

More information

Ping Li Compressed Counting, Data Streams March 2009 DIMACS Compressed Counting. Ping Li

Ping Li Compressed Counting, Data Streams March 2009 DIMACS Compressed Counting. Ping Li Ping Li Compressed Counting, Data Streams March 2009 DIMACS 2009 1 Compressed Counting Ping Li Department of Statistical Science Faculty of Computing and Information Science Cornell University Ithaca,

More information

Tight Bounds for Distributed Functional Monitoring

Tight Bounds for Distributed Functional Monitoring Tight Bounds for Distributed Functional Monitoring Qin Zhang MADALGO, Aarhus University Joint with David Woodruff, IBM Almaden NII Shonan meeting, Japan Jan. 2012 1-1 The distributed streaming model (a.k.a.

More information

Worst Case Analysis : Our Strength is Our Weakness. Michael Mitzenmacher Harvard University

Worst Case Analysis : Our Strength is Our Weakness. Michael Mitzenmacher Harvard University Worst Case Analysis : Our Strength is Our Weakness Michael Mitzenmacher Harvard University Prelude Worst- case analysis is dominant in TCS for several reasons. Very suitable for mathemaecal analysis. OFen

More information

Lecture 4: Two-point Sampling, Coupon Collector s problem

Lecture 4: Two-point Sampling, Coupon Collector s problem Randomized Algorithms Lecture 4: Two-point Sampling, Coupon Collector s problem Sotiris Nikoletseas Associate Professor CEID - ETY Course 2013-2014 Sotiris Nikoletseas, Associate Professor Randomized Algorithms

More information

Finding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing Distance Measures. Modified from Jeff Ullman

Finding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing Distance Measures. Modified from Jeff Ullman Finding Similar Sets Applications Shingling Minhashing Locality-Sensitive Hashing Distance Measures Modified from Jeff Ullman Goals Many Web-mining problems can be expressed as finding similar sets:. Pages

More information

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is: CS 24 Section #8 Hashing, Skip Lists 3/20/7 Probability Review Expectation (weighted average): the expectation of a random quantity X is: x= x P (X = x) For each value x that X can take on, we look at

More information

Power of d Choices with Simple Tabulation

Power of d Choices with Simple Tabulation Power of d Choices with Simple Tabulation Anders Aamand, Mathias Bæk Tejs Knudsen, and Mikkel Thorup Abstract arxiv:1804.09684v1 [cs.ds] 25 Apr 2018 Suppose that we are to place m balls into n bins sequentially

More information

arxiv: v2 [cs.ds] 15 Nov 2010

arxiv: v2 [cs.ds] 15 Nov 2010 Stream sampling for variance-optimal estimation of subset sums Edith Cohen Nick Duffield Haim Kaplan Carsten Lund Mikkel Thorup arxiv:0803.0473v2 [cs.ds] 15 Nov 2010 Abstract From a high volume stream

More information

Set Similarity Search Beyond MinHash

Set Similarity Search Beyond MinHash Set Similarity Search Beyond MinHash ABSTRACT Tobias Christiani IT University of Copenhagen Copenhagen, Denmark tobc@itu.dk We consider the problem of approximate set similarity search under Braun-Blanquet

More information

Discrete Mathematics and Probability Theory Spring 2014 Anant Sahai Note 18

Discrete Mathematics and Probability Theory Spring 2014 Anant Sahai Note 18 EECS 7 Discrete Mathematics and Probability Theory Spring 214 Anant Sahai Note 18 A Brief Introduction to Continuous Probability Up to now we have focused exclusively on discrete probability spaces Ω,

More information

Math 24 Spring 2012 Questions (mostly) from the Textbook

Math 24 Spring 2012 Questions (mostly) from the Textbook Math 24 Spring 2012 Questions (mostly) from the Textbook 1. TRUE OR FALSE? (a) The zero vector space has no basis. (F) (b) Every vector space that is generated by a finite set has a basis. (c) Every vector

More information

Why Simple Hash Functions Work: Exploiting the Entropy in a Data Stream

Why Simple Hash Functions Work: Exploiting the Entropy in a Data Stream Why Simple Hash Functions Work: Exploiting the Entropy in a Data Stream ichael itzenmacher Salil Vadhan School of Engineering & Applied Sciences Harvard University Cambridge, A 02138 {michaelm,salil}@eecs.harvard.edu

More information

Hash tables. Hash tables

Hash tables. Hash tables Dictionary Definition A dictionary is a data-structure that stores a set of elements where each element has a unique key, and supports the following operations: Search(S, k) Return the element whose key

More information

7 Algorithms for Massive Data Problems

7 Algorithms for Massive Data Problems 7 Algorithms for Massive Data Problems Massive Data, Sampling This chapter deals with massive data problems where the input data (a graph, a matrix or some other object) is too large to be stored in random

More information

V.4 MapReduce. 1. System Architecture 2. Programming Model 3. Hadoop. Based on MRS Chapter 4 and RU Chapter 2 IR&DM 13/ 14 !74

V.4 MapReduce. 1. System Architecture 2. Programming Model 3. Hadoop. Based on MRS Chapter 4 and RU Chapter 2 IR&DM 13/ 14 !74 V.4 MapReduce. System Architecture 2. Programming Model 3. Hadoop Based on MRS Chapter 4 and RU Chapter 2!74 Why MapReduce? Large clusters of commodity computers (as opposed to few supercomputers) Challenges:

More information

COS597D: Information Theory in Computer Science October 5, Lecture 6

COS597D: Information Theory in Computer Science October 5, Lecture 6 COS597D: Information Theory in Computer Science October 5, 2011 Lecture 6 Lecturer: Mark Braverman Scribe: Yonatan Naamad 1 A lower bound for perfect hash families. In the previous lecture, we saw that

More information

Lecture 12: Hash Tables [Sp 15]

Lecture 12: Hash Tables [Sp 15] Insanity is repeating the same mistaes and expecting different results. Narcotics Anonymous (1981) Calvin: There! I finished our secret code! Hobbes: Let s see. Calvin: I assigned each letter a totally

More information

DATA MINING LECTURE 6. Similarity and Distance Sketching, Locality Sensitive Hashing

DATA MINING LECTURE 6. Similarity and Distance Sketching, Locality Sensitive Hashing DATA MINING LECTURE 6 Similarity and Distance Sketching, Locality Sensitive Hashing SIMILARITY AND DISTANCE Thanks to: Tan, Steinbach, and Kumar, Introduction to Data Mining Rajaraman and Ullman, Mining

More information

Charging from Sampled Network Usage

Charging from Sampled Network Usage Charging from Sampled Network Usage Nick Duffield Carsten Lund Mikkel Thorup AT&T Labs-Research, Florham Park, NJ 1 Do Charging and Sampling Mix? Usage sensitive charging charge based on sampled network

More information

Hash tables. Hash tables

Hash tables. Hash tables Dictionary Definition A dictionary is a data-structure that stores a set of elements where each element has a unique key, and supports the following operations: Search(S, k) Return the element whose key

More information

Goldreich-Levin Hardcore Predicate. Lecture 28: List Decoding Hadamard Code and Goldreich-L

Goldreich-Levin Hardcore Predicate. Lecture 28: List Decoding Hadamard Code and Goldreich-L Lecture 28: List Decoding Hadamard Code and Goldreich-Levin Hardcore Predicate Recall Let H : {0, 1} n {+1, 1} Let: L ε = {S : χ S agrees with H at (1/2 + ε) fraction of points} Given oracle access to

More information

Lecture 13: Seed-Dependent Key Derivation

Lecture 13: Seed-Dependent Key Derivation Randomness in Cryptography April 11, 2013 Lecture 13: Seed-Dependent Key Derivation Lecturer: Yevgeniy Dodis Scribe: Eric Miles In today s lecture, we study seeded key-derivation functions (KDFs) in the

More information

Testing k-wise Independence over Streaming Data

Testing k-wise Independence over Streaming Data Testing k-wise Independence over Streaming Data Kai-Min Chung Zhenming Liu Michael Mitzenmacher Abstract Following on the work of Indyk and McGregor [5], we consider the problem of identifying correlations

More information

Hash Tables. Given a set of possible keys U, such that U = u and a table of m entries, a Hash function h is a

Hash Tables. Given a set of possible keys U, such that U = u and a table of m entries, a Hash function h is a Hash Tables Given a set of possible keys U, such that U = u and a table of m entries, a Hash function h is a mapping from U to M = {1,..., m}. A collision occurs when two hashed elements have h(x) =h(y).

More information

2WB05 Simulation Lecture 7: Output analysis

2WB05 Simulation Lecture 7: Output analysis 2WB05 Simulation Lecture 7: Output analysis Marko Boon http://www.win.tue.nl/courses/2wb05 December 17, 2012 Outline 2/33 Output analysis of a simulation Confidence intervals Warm-up interval Common random

More information

CS 591, Lecture 6 Data Analytics: Theory and Applications Boston University

CS 591, Lecture 6 Data Analytics: Theory and Applications Boston University CS 591, Lecture 6 Data Analytics: Theory and Applications Boston University Babis Tsourakakis February 8th, 2017 Universal hash family Notation: Universe U = {0,..., u 1}, index space M = {0,..., m 1},

More information

Dynamic Call Center Routing Policies Using Call Waiting and Agent Idle Times Online Supplement

Dynamic Call Center Routing Policies Using Call Waiting and Agent Idle Times Online Supplement Submitted to imanufacturing & Service Operations Management manuscript MSOM-11-370.R3 Dynamic Call Center Routing Policies Using Call Waiting and Agent Idle Times Online Supplement (Authors names blinded

More information

Cell-Probe Lower Bounds for Prefix Sums and Matching Brackets

Cell-Probe Lower Bounds for Prefix Sums and Matching Brackets Cell-Probe Lower Bounds for Prefix Sums and Matching Brackets Emanuele Viola July 6, 2009 Abstract We prove that to store strings x {0, 1} n so that each prefix sum a.k.a. rank query Sumi := k i x k can

More information

Discrete Mathematics and Probability Theory Fall 2015 Note 20. A Brief Introduction to Continuous Probability

Discrete Mathematics and Probability Theory Fall 2015 Note 20. A Brief Introduction to Continuous Probability CS 7 Discrete Mathematics and Probability Theory Fall 215 Note 2 A Brief Introduction to Continuous Probability Up to now we have focused exclusively on discrete probability spaces Ω, where the number

More information

Why Simple Hash Functions Work: Exploiting the Entropy in a Data Stream

Why Simple Hash Functions Work: Exploiting the Entropy in a Data Stream Why Simple Hash Functions Work: Exploiting the Entropy in a Data Stream Michael Mitzenmacher Salil Vadhan Abstract Hashing is fundamental to many algorithms and data structures widely used in practice.

More information

6 Filtering and Streaming

6 Filtering and Streaming Casus ubique valet; semper tibi pendeat hamus: Quo minime credas gurgite, piscis erit. [Luck affects everything. Let your hook always be cast. Where you least expect it, there will be a fish.] Publius

More information

Lecture 5: Hashing. David Woodruff Carnegie Mellon University

Lecture 5: Hashing. David Woodruff Carnegie Mellon University Lecture 5: Hashing David Woodruff Carnegie Mellon University Hashing Universal hashing Perfect hashing Maintaining a Dictionary Let U be a universe of keys U could be all strings of ASCII characters of

More information

Lecture 24: Bloom Filters. Wednesday, June 2, 2010

Lecture 24: Bloom Filters. Wednesday, June 2, 2010 Lecture 24: Bloom Filters Wednesday, June 2, 2010 1 Topics for the Final SQL Conceptual Design (BCNF) Transactions Indexes Query execution and optimization Cardinality Estimation Parallel Databases 2 Lecture

More information

Fast Moment Estimation in Data Streams in Optimal Space

Fast Moment Estimation in Data Streams in Optimal Space Fast Moment Estimation in Data Streams in Optimal Space Daniel M. Kane Jelani Nelson Ely Porat David P. Woodruff Abstract We give a space-optimal algorithm with update time O(log 2 (1/ε) log log(1/ε))

More information