Bottom-k and Priority Sampling, Set Similarity and Subset Sums with Minimal Independence. Mikkel Thorup University of Copenhagen
|
|
- Brett Murphy
- 5 years ago
- Views:
Transcription
1 Bottom-k and Priority Sampling, Set Similarity and Subset Sums with Minimal Independence Mikkel Thorup University of Copenhagen
2 Min-wise hashing [Broder, 98, Alta Vita] Jaccard similary of sets A and B is f (A, B) = A B / A B. With truly random hash function h f (A, B) = Pr [ min h(a) = min h(b) ] h
3 Min-wise hashing [Broder, 98, Alta Vita] Jaccard similary of sets A and B is f (A, B) = A B / A B. With truly random hash function h f (A, B) = Pr [ min h(a) = min h(b) ] h For concentration, repeat k times: k-mins use k indepedent hash functions h 1,..., h k. For set A store signature M k (A) = (min h 1 (A),.., min h k (A)).
4 Min-wise hashing [Broder, 98, Alta Vita] Jaccard similary of sets A and B is f (A, B) = A B / A B. With truly random hash function h f (A, B) = Pr [ min h(a) = min h(b) ] h For concentration, repeat k times: k-mins use k indepedent hash functions h 1,..., h k. For set A store signature M k (A) = (min h 1 (A),.., min h k (A)). To estimate Jaccard similariy f (A, B) of A and B, use M k (A) M k (B) /k = k [min h i (A) = min h i (B)]/k i=1
5 Min-wise hashing [Broder, 98, Alta Vita] Jaccard similary of sets A and B is f (A, B) = A B / A B. With truly random hash function h f (A, B) = Pr [ min h(a) = min h(b) ] h For concentration, repeat k times: k-mins use k indepedent hash functions h 1,..., h k. For set A store signature M k (A) = (min h 1 (A),.., min h k (A)). To estimate Jaccard similariy f (A, B) of A and B, use M k (A) M k (B) /k = k [min h i (A) = min h i (B)]/k i=1 Expected relative error below 1/ f (A, B) k.
6 Bias issues We do not have space for truly random hash functions.
7 Bias issues We do not have space for truly random hash functions. We say h is ε-minwise independent if for any S, x S,: Pr h H [h(x) = min h(s)] = 1 ± ε S
8 Bias issues We do not have space for truly random hash functions. We say h is ε-minwise independent if for any S, x S,: 1 ± ε Pr [h(x) = min h(s)] = h H S Such bias ε takes independence Θ ( lg 1 ) ε [Indyk 99, Patrascu Thorup 10].
9 Bias issues We do not have space for truly random hash functions. We say h is ε-minwise independent if for any S, x S,: 1 ± ε Pr [h(x) = min h(s)] = h H S Such bias ε takes independence Θ ( lg 1 ) ε [Indyk 99, Patrascu Thorup 10]. Bias ε not improved by k-mins no matter repetitions k.
10 Practice Mitzenmacher and Vadhan [SODA 08]: with enough entropy, 2-independent works as good as random, but
11 Practice Mitzenmacher and Vadhan [SODA 08]: with enough entropy, 2-independent works as good as random, but real world full of low-entroy data, e.g., consecutive numbers: % fractile single experiment 10% fractile real value
12 Bottom-k With single hash function h, the bottom-k signature of set X is S k (X) = {k elements of X with smallest hash values}
13 Bottom-k With single hash function h, the bottom-k signature of set X is S k (X) = {k elements of X with smallest hash values} For subset Y X, estimate frequency f (Y, X) = Y / X as S k (X) Y /k
14 Bottom-k With single hash function h, the bottom-k signature of set X is S k (X) = {k elements of X with smallest hash values} For subset Y X, estimate frequency f (Y, X) = Y / X as S k (X) Y /k For sets A and B, the signature of union A B computed as S k (A B) = S k (S k (A) S k (B)) and Jaccard similarity f (A, B) is estimated as S k (A B) S k (A) S k (B) /k
15 Bottom-k With single hash function h, the bottom-k signature of set X is S k (X) = {k elements of X with smallest hash values} For subset Y X, estimate frequency f (Y, X) = Y / X as S k (X) Y /k For sets A and B, the signature of union A B computed as S k (A B) = S k (S k (A) S k (B)) and Jaccard similarity f (A, B) is estimated as Our result: S k (A B) S k (A) S k (B) /k Theorem If h is 2-independent, even including bias, the expected relative error is O(1/ f k).
16 Bottom-k With single hash function h, the bottom-k signature of set X is S k (X) = {k elements of X with smallest hash values} For subset Y X, estimate frequency f (Y, X) = Y / X as S k (X) Y /k For sets A and B, the signature of union A B computed as S k (A B) = S k (S k (A) S k (B)) and Jaccard similarity f (A, B) is estimated as Our result: S k (A B) S k (A) S k (B) /k Theorem If h is 2-independent, even including bias, the expected relative error is O(1/ f k). Porat proved this for 8-independent hashing and k 1.
17 Practice We can use Dietzfelbinger s super fast 2-independent hashing random int64 a,b; h(int32 x) return (a*x + b) >> 32; % fractile single experiment 10% fractile real value
18 Practice We can use Dietzfelbinger s super fast 2-independent hashing random int64 a,b; h(int32 x) return (a*x + b) >> 32; bottom-k 90% fractile single experiment 10% fractile real value k-mins 90% fractile single experiment 10% fractile real value
19 A Simple Union Bound With n = X, S = S k (X), Y X, estimate f = Y / X as Y S /k Overestimate: For parameters a < 1, b, bound probability ( ) Y S > 1+b 1 a f k.
20 A Simple Union Bound With n = X, S = S k (X), Y X, estimate f = Y / X as Y S /k Overestimate: For parameters a < 1, b, bound probability ( ) Y S > 1+b 1 a f k. Define threshold probability (independent of random choices) p = k n (1 a). Hash values uniform in (0, 1) so Pr[h(x) < p] = p.
21 A Simple Union Bound With n = X, S = S k (X), Y X, estimate f = Y / X as Y S /k Overestimate: For parameters a < 1, b, bound probability ( ) Y S > 1+b 1 a f k. Define threshold probability (independent of random choices) p = k n (1 a). Hash values uniform in (0, 1) so Pr[h(x) < p] = p. Proposition The overestimate ( ) implies one of (A) {x X h(x) < p} < k (B) {x Y h(x) < p} > (1 + b) p Y. so Pr[( )] Pr[(A)] + Pr[(B)].
22 A Simple Union Bound With n = X, S = S k (X), Y X, estimate f = Y / X as Y S /k Overestimate: For parameters a < 1, b, bound probability ( ) Y S > 1+b 1 a f k. Define threshold probability (independent of random choices) p = k n (1 a). Hash values uniform in (0, 1) so Pr[h(x) < p] = p. Proposition The overestimate ( ) implies one of (A) {x X h(x) < p} < k E[ ] = k/(1 a) (B) {x Y h(x) < p} > (1 + b) p Y. E[ ] = p Y. so Pr[( )] Pr[(A)] + Pr[(B)].
23 Proof: (A) (B) = ( ) Suppose (A) is false, that is, {x X h(x) < p} k.
24 Proof: (A) (B) = ( ) Suppose (A) is false, that is, {x X h(x) < p} k. Then, since S contains k smallest, x S : h(x) < p. so Y S {x Y h(x) < p}
25 Proof: (A) (B) = ( ) Suppose (A) is false, that is, {x X h(x) < p} k. Then, since S contains k smallest, x S : h(x) < p. so Y S {x Y h(x) < p} Suppose (B) is also false, that is, {x Y h(x) < p} < (1 + b) p Y
26 Proof: (A) (B) = ( ) Suppose (A) is false, that is, {x X h(x) < p} k. Then, since S contains k smallest, so x S : h(x) < p. Y S {x Y h(x) < p} Suppose (B) is also false, that is, {x Y h(x) < p} < (1 + b) p Y With p = k/(n (1 a)) and Y = f n, we get {x Y h(x) < p} < 1 + b 1 a f k so the overestimate ( ) did not happen.
27 2-independence Proposition With p = k n (1 a), the overestimate ( ) Y S > 1+b 1 a f k. implies one of (A) {x X h(x) < p} < k E[ ] = k/(1 a) (B) {x Y h(x) < p} > (1 + b) p Y E[ ] = p Y. so Pr[( )] Pr[(A)] + Pr[(B)].
28 2-independence Proposition With p = k n (1 a), the overestimate ( ) Y S > 1+b 1 a f k. implies one of (A) {x X h(x) < p} < k E[ ] = k/(1 a) (B) {x Y h(x) < p} > (1 + b) p Y E[ ] = p Y. so Pr[( )] Pr[(A)] + Pr[(B)]. Application If h is 2-independent, by Chebyshev, Pr[(A)] 1/(a 2 k) Pr[(B)] 1/(b 2 f k)
29 2-independence Proposition With p = k n (1 a), the overestimate ( ) Y S > 1+b 1 a f k. implies one of (A) {x X h(x) < p} < k E[ ] = k/(1 a) (B) {x Y h(x) < p} > (1 + b) p Y E[ ] = p Y. so Pr[( )] Pr[(A)] + Pr[(B)]. Application If h is 2-independent, by Chebyshev, Pr[(A)] 1/(a 2 k) Pr[(B)] 1/(b 2 f k) For any ε < 1/3, appropriate choice of a and b yields Pr [ Y S > (1 + ε)f k] 4 ε 2 f k.
30 2-independence Proposition With p = k n (1 a), the overestimate ( ) Y S > 1+b 1 a f k. implies one of (A) {x X h(x) < p} < k E[ ] = k/(1 a) (B) {x Y h(x) < p} > (1 + b) p Y E[ ] = p Y. so Pr[( )] Pr[(A)] + Pr[(B)]. Application If h is 2-independent, by Chebyshev, Pr[(A)] 1/(a 2 k) Pr[(B)] 1/(b 2 f k) For any ε < 1/3, appropriate choice of a and b yields Pr [ Y S > (1 + ε)f k] 4 ε 2 f k. With more calcluations E[ Y S f k = O( f k).
31 Conclusion Both k-mins and bottom-k generalize the idea of storing the smallest hash value.
32 Conclusion Both k-mins and bottom-k generalize the idea of storing the smallest hash value. With limited independence k-mins has major problems with bias whereas bottom-k works perfectly even with 2-indepedence. bottom-k k-mins % fractile single experiment 10% fractile real value % fractile single experiment 10% fractile real value
33 Conclusion Both k-mins and bottom-k generalize the idea of storing the smallest hash value. With limited independence k-mins has major problems with bias whereas bottom-k works perfectly even with 2-indepedence. bottom-k k-mins % fractile single experiment 10% fractile real value % fractile single experiment 10% fractile real value Technically the proof is very simple.
34 Conclusion Both k-mins and bottom-k generalize the idea of storing the smallest hash value. With limited independence k-mins has major problems with bias whereas bottom-k works perfectly even with 2-indepedence. bottom-k k-mins % fractile single experiment 10% fractile real value % fractile single experiment 10% fractile real value Technically the proof is very simple. Similar results hold for weighted sampling but proofs much much harder.
35 Priority sampling of weighted items A bottom/top-k sampling scheme for weighted items [Duffield, Lund, Thorup SIGMETRICS 04, JACM 07] Stream of items i, each with a weight w i uniformly random hash h(i) (0, 1) priority q i = w i /h(i) (assumed distinct).
36 Priority sampling of weighted items A bottom/top-k sampling scheme for weighted items [Duffield, Lund, Thorup SIGMETRICS 04, JACM 07] Stream of items i, each with a weight w i uniformly random hash h(i) (0, 1) priority q i = w i /h(i) (assumed distinct). Priority sample S of size k contains k items of highest priority. threshold τ is the (k + 1) th priority. Then i S q i > τ. weight estimate ŵ i = max{w i, τ} if i S; 0 otherwise
37 Priority sampling of weighted items A bottom/top-k sampling scheme for weighted items [Duffield, Lund, Thorup SIGMETRICS 04, JACM 07] Stream of items i, each with a weight w i uniformly random hash h(i) (0, 1) priority q i = w i /h(i) (assumed distinct). Priority sample S of size k (maintained in priority queue) contains k items of highest priority. threshold τ is the (k + 1) th priority. Then i S q i > τ. weight estimate ŵ i = max{w i, τ} if i S; 0 otherwise
38 Priority sampling 3 out of 10 weighted items original weight priority weight estimate threshold sample
39 Priority sampling of weighted items A bottom/top-k sampling scheme for weighted items [Duffield, Lund, Thorup SIGMETRICS 04, JACM 07] Stream of items i, each with a weight w i uniformly random hash h(i) (0, 1) priority q i = w i /h(i) (assumed distinct). Priority sample S of size k (maintained in priority queue) contains k items of highest priority. threshold τ is the (k + 1) th priority. Then i S q i > τ. weight estimate ŵ i = max{w i, τ} if i S; 0 otherwise
40 Priority sampling of weighted items A bottom/top-k sampling scheme for weighted items [Duffield, Lund, Thorup SIGMETRICS 04, JACM 07] Stream of items i, each with a weight w i uniformly random hash h(i) (0, 1) priority q i = w i /h(i) (assumed distinct). Priority sample S of size k (maintained in priority queue) contains k items of highest priority. threshold τ is the (k + 1) th priority. Then i S q i > τ. weight estimate ŵ i = max{w i, τ} if i S; 0 otherwise Theorem [DLT 04] Priority sampling is unbiased: E[ŵ i ] = w i.
41 Priority sampling of weighted items A bottom/top-k sampling scheme for weighted items [Duffield, Lund, Thorup SIGMETRICS 04, JACM 07] Stream of items i, each with a weight w i uniformly random hash h(i) (0, 1) priority q i = w i /h(i) (assumed distinct). Priority sample S of size k (maintained in priority queue) contains k items of highest priority. threshold τ is the (k + 1) th priority. Then i S q i > τ. weight estimate ŵ i = max{w i, τ} if i S; 0 otherwise Theorem [DLT 04] Priority sampling is unbiased: E[ŵ i ] = w i. Theorem [Szegey STOC 06] For any set of input weights, with one extra sample, priority sampling has smaller variance sum i Var [ŵ i] than all other unbiased sampling schemes.
42 Sampled Internet traffic. Estimate traffic classes priority sampling uniform sampling without replacement. probability proportional to size with replacement 1e+09 Application: sessions 1e+08 Bytes 1e+07 1e+06 PRI U-R P+R Number of Samples
43 Sampled Internet traffic. Estimate traffic classes priority sampling uniform sampling without replacement. probability proportional to size with replacement 1e+08 Application: mail 1e+07 Bytes 1e+06 PRI U-R P+R Number of Samples
44 Priority sampling works with minimal independence Theorems a bit hard to state. Essential result is that priority sampling gives good concentration even with 2-indepence.
45 Priority sampling works with minimal independence Theorems a bit hard to state. Essential result is that priority sampling gives good concentration even with 2-indepence. Thus priority sampling trivial to implement including the hash function.
46 Priority sampling works with minimal independence Theorems a bit hard to state. Essential result is that priority sampling gives good concentration even with 2-indepence. Thus priority sampling trivial to implement including the hash function. With his mathematical analysis of linear probing, Knuth [1963] started the field of trying to understand simple magical algorithms: if they really work, or if they are just waiting to blow up. This is an area full of surprises, and an area where mathematics can really help the outside world. bottom-k k-mins % fractile single experiment 10% fractile real value % fractile single experiment 10% fractile real value
Tabulation-Based Hashing
Mihai Pǎtraşcu Mikkel Thorup April 23, 2010 Applications of Hashing Hash tables: chaining x a t v f s r Applications of Hashing Hash tables: chaining linear probing m c a f t x y t Applications of Hashing
More informationPriority sampling for estimation of arbitrary subset sums
Priority sampling for estimation of arbitrary subset sums NICK DUFFIELD AT&T Labs Research and CARSTEN LUND AT&T Labs Research and MIKKEL THORUP AT&T Labs Research From a high volume stream of weighted
More informationNearest Neighbor Classification Using Bottom-k Sketches
Nearest Neighbor Classification Using Bottom- Setches Søren Dahlgaard, Christian Igel, Miel Thorup Department of Computer Science, University of Copenhagen AT&T Labs Abstract Bottom- setches are an alternative
More informationSome notes on streaming algorithms continued
U.C. Berkeley CS170: Algorithms Handout LN-11-9 Christos Papadimitriou & Luca Trevisan November 9, 016 Some notes on streaming algorithms continued Today we complete our quick review of streaming algorithms.
More informationLeveraging Big Data: Lecture 13
Leveraging Big Data: Lecture 13 http://www.cohenwang.com/edith/bigdataclass2013 Instructors: Edith Cohen Amos Fiat Haim Kaplan Tova Milo What are Linear Sketches? Linear Transformations of the input vector
More informationBasics of hashing: k-independence and applications
Basics of hashing: k-independence and applications Rasmus Pagh Supported by: 1 Agenda Load balancing using hashing! - Analysis using bounded independence! Implementation of small independence! Case studies:!
More informationLecture 4 Thursday Sep 11, 2014
CS 224: Advanced Algorithms Fall 2014 Lecture 4 Thursday Sep 11, 2014 Prof. Jelani Nelson Scribe: Marco Gentili 1 Overview Today we re going to talk about: 1. linear probing (show with 5-wise independence)
More informationA Tight Lower Bound for Dynamic Membership in the External Memory Model
A Tight Lower Bound for Dynamic Membership in the External Memory Model Elad Verbin ITCS, Tsinghua University Qin Zhang Hong Kong University of Science & Technology April 2010 1-1 The computational model
More informationStream sampling for variance-optimal estimation of subset sums
Stream sampling for variance-optimal estimation of subset sums Edith Cohen Nick Duffield Haim Kaplan Carsten Lund Mikkel Thorup Abstract From a high volume stream of weighted items, we want to maintain
More informationEstimating arbitrary subset sums with few probes
Estimating arbitrary subset sums with few probes Noga Alon Nick Duffield Carsten Lund Mikkel Thorup School of Mathematical Sciences, Tel Aviv University, Tel Aviv, Israel AT&T Labs Research, 8 Park Avenue,
More informationComputing the Entropy of a Stream
Computing the Entropy of a Stream To appear in SODA 2007 Graham Cormode graham@research.att.com Amit Chakrabarti Dartmouth College Andrew McGregor U. Penn / UCSD Outline Introduction Entropy Upper Bound
More information15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018
15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018 Today we ll talk about a topic that is both very old (as far as computer science
More informationarxiv: v3 [cs.ds] 11 May 2017
Lecture Notes on Linear Probing with 5-Independent Hashing arxiv:1509.04549v3 [cs.ds] 11 May 2017 Mikkel Thorup May 12, 2017 Abstract These lecture notes show that linear probing takes expected constant
More informationarxiv: v1 [cs.ds] 3 Feb 2018
A Model for Learned Bloom Filters and Related Structures Michael Mitzenmacher 1 arxiv:1802.00884v1 [cs.ds] 3 Feb 2018 Abstract Recent work has suggested enhancing Bloom filters by using a pre-filter, based
More informationTight Bounds for Sliding Bloom Filters
Tight Bounds for Sliding Bloom Filters Moni Naor Eylon Yogev November 7, 2013 Abstract A Bloom filter is a method for reducing the space (memory) required for representing a set by allowing a small error
More informationLecture Lecture 3 Tuesday Sep 09, 2014
CS 4: Advanced Algorithms Fall 04 Lecture Lecture 3 Tuesday Sep 09, 04 Prof. Jelani Nelson Scribe: Thibaut Horel Overview In the previous lecture we finished covering data structures for the predecessor
More informationCS 473: Algorithms. Ruta Mehta. Spring University of Illinois, Urbana-Champaign. Ruta (UIUC) CS473 1 Spring / 32
CS 473: Algorithms Ruta Mehta University of Illinois, Urbana-Champaign Spring 2018 Ruta (UIUC) CS473 1 Spring 2018 1 / 32 CS 473: Algorithms, Spring 2018 Universal Hashing Lecture 10 Feb 15, 2018 Most
More informationProblem 1: (Chernoff Bounds via Negative Dependence - from MU Ex 5.15)
Problem 1: Chernoff Bounds via Negative Dependence - from MU Ex 5.15) While deriving lower bounds on the load of the maximum loaded bin when n balls are thrown in n bins, we saw the use of negative dependence.
More informationThe space complexity of approximating the frequency moments
The space complexity of approximating the frequency moments Felix Biermeier November 24, 2015 1 Overview Introduction Approximations of frequency moments lower bounds 2 Frequency moments Problem Estimate
More informationCOMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from
COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from http://www.mmds.org Distance Measures For finding similar documents, we consider the Jaccard
More informationarxiv: v5 [cs.ds] 3 Jan 2017
Fast and Powerful Hashing using Tabulation Mikkel Thorup University of Copenhagen January 4, 2017 arxiv:1505.01523v5 [cs.ds] 3 Jan 2017 Abstract Randomized algorithms are often enjoyed for their simplicity,
More informationThe Magic of Random Sampling: From Surveys to Big Data
The Magic of Random Sampling: From Surveys to Big Data Edith Cohen Google Research Tel Aviv University Disclaimer: Random sampling is classic and well studied tool with enormous impact across disciplines.
More informationCS369N: Beyond Worst-Case Analysis Lecture #6: Pseudorandom Data and Universal Hashing
CS369N: Beyond Worst-Case Analysis Lecture #6: Pseudorandom Data and Universal Hashing Tim Roughgarden April 4, 204 otivation: Linear Probing and Universal Hashing This lecture discusses a very neat paper
More informationHash tables. Hash tables
Basic Probability Theory Two events A, B are independent if Conditional probability: Pr[A B] = Pr[A] Pr[B] Pr[A B] = Pr[A B] Pr[B] The expectation of a (discrete) random variable X is E[X ] = k k Pr[X
More informationCuckoo Hashing with a Stash: Alternative Analysis, Simple Hash Functions
1 / 29 Cuckoo Hashing with a Stash: Alternative Analysis, Simple Hash Functions Martin Aumüller, Martin Dietzfelbinger Technische Universität Ilmenau 2 / 29 Cuckoo Hashing Maintain a dynamic dictionary
More informationCS 591, Lecture 7 Data Analytics: Theory and Applications Boston University
CS 591, Lecture 7 Data Analytics: Theory and Applications Boston University Babis Tsourakakis February 13th, 2017 Bloom Filter Approximate membership problem Highly space-efficient randomized data structure
More informationBalanced Allocation Through Random Walk
Balanced Allocation Through Random Walk Alan Frieze Samantha Petti November 25, 2017 Abstract We consider the allocation problem in which m (1 ε)dn items are to be allocated to n bins with capacity d.
More informationSolutions to COMP9334 Week 8 Sample Problems
Solutions to COMP9334 Week 8 Sample Problems Problem 1: Customers arrive at a grocery store s checkout counter according to a Poisson process with rate 1 per minute. Each customer carries a number of items
More informationThe Power of Simple Tabulation Hashing
The Power of Simple Tabulation Hashing Mihai Pǎtraşcu AT&T Labs Mikkel Thorup AT&T Labs Abstract Randomized algorithms are often enjoyed for their simplicity, but the hash functions used to yield the desired
More informationThe Power of Simple Tabulation Hashing
The Power of Simple Tabulation Hashing Mihai Pǎtraşcu AT&T Labs Florham Park, NJ mip@alum.mit.edu Mikkel Thorup AT&T Labs Florham Park, NJ mthorup@research.att.com ABSTRACT Randomized algorithms are often
More information14.1 Finding frequent elements in stream
Chapter 14 Streaming Data Model 14.1 Finding frequent elements in stream A very useful statistics for many applications is to keep track of elements that occur more frequently. It can come in many flavours
More informationEntropy. Probability and Computing. Presentation 22. Probability and Computing Presentation 22 Entropy 1/39
Entropy Probability and Computing Presentation 22 Probability and Computing Presentation 22 Entropy 1/39 Introduction Why randomness and information are related? An event that is almost certain to occur
More informationLecture 2. Frequency problems
1 / 43 Lecture 2. Frequency problems Ricard Gavaldà MIRI Seminar on Data Streams, Spring 2015 Contents 2 / 43 1 Frequency problems in data streams 2 Approximating inner product 3 Computing frequency moments
More informationLecture 2: Streaming Algorithms
CS369G: Algorithmic Techniques for Big Data Spring 2015-2016 Lecture 2: Streaming Algorithms Prof. Moses Chariar Scribes: Stephen Mussmann 1 Overview In this lecture, we first derive a concentration inequality
More informationarxiv: v2 [cs.ds] 4 Feb 2015
Interval Selection in the Streaming Model Sergio Cabello Pablo Pérez-Lantero June 27, 2018 arxiv:1501.02285v2 [cs.ds 4 Feb 2015 Abstract A set of intervals is independent when the intervals are pairwise
More informationAs mentioned, we will relax the conditions of our dictionary data structure. The relaxations we
CSE 203A: Advanced Algorithms Prof. Daniel Kane Lecture : Dictionary Data Structures and Load Balancing Lecture Date: 10/27 P Chitimireddi Recap This lecture continues the discussion of dictionary data
More information1 Estimating Frequency Moments in Streams
CS 598CSC: Algorithms for Big Data Lecture date: August 28, 2014 Instructor: Chandra Chekuri Scribe: Chandra Chekuri 1 Estimating Frequency Moments in Streams A significant fraction of streaming literature
More informationBloom Filters, general theory and variants
Bloom Filters: general theory and variants G. Caravagna caravagn@cli.di.unipi.it Information Retrieval Wherever a list or set is used, and space is a consideration, a Bloom Filter should be considered.
More informationSpace- Efficient Es.ma.on of Sta.s.cs over Sub- Sampled Streams
Space- Efficient Es.ma.on of Sta.s.cs over Sub- Sampled Streams Andrew McGregor University of Massachuse7s mcgregor@cs.umass.edu A. Pavan Iowa State University pavan@cs.iastate.edu Srikanta Tirthapura
More information2 How many distinct elements are in a stream?
Dealing with Massive Data January 31, 2011 Lecture 2: Distinct Element Counting Lecturer: Sergei Vassilvitskii Scribe:Ido Rosen & Yoonji Shin 1 Introduction We begin by defining the stream formally. Definition
More informationA Model for Learned Bloom Filters, and Optimizing by Sandwiching
A Model for Learned Bloom Filters, and Optimizing by Sandwiching Michael Mitzenmacher School of Engineering and Applied Sciences Harvard University michaelm@eecs.harvard.edu Abstract Recent work has suggested
More informationIntroduction to Hash Tables
Introduction to Hash Tables Hash Functions A hash table represents a simple but efficient way of storing, finding, and removing elements. In general, a hash table is represented by an array of cells. In
More informationAsymmetric Minwise Hashing for Indexing Binary Inner Products and Set Containment
Asymmetric Minwise Hashing for Indexing Binary Inner Products and Set Containment Anshumali Shrivastava and Ping Li Cornell University and Rutgers University WWW 25 Florence, Italy May 2st 25 Will Join
More informationFinding similar items
Finding similar items CSE 344, section 10 June 2, 2011 In this section, we ll go through some examples of finding similar item sets. We ll directly compare all pairs of sets being considered using the
More informationCryptographic Engineering
Cryptographic Engineering Clément PERNET M2 Cyber Security, UFR-IM 2 AG, Univ. Grenoble-Alpes ENSIMAG, Grenoble INP Outline Unconditional security of symmetric cryptosystem Probabilities and information
More informationLecture 5: Two-point Sampling
Randomized Algorithms Lecture 5: Two-point Sampling Sotiris Nikoletseas Professor CEID - ETY Course 2017-2018 Sotiris Nikoletseas, Professor Randomized Algorithms - Lecture 5 1 / 26 Overview A. Pairwise
More informationCache-Oblivious Hashing
Cache-Oblivious Hashing Zhewei Wei Hong Kong University of Science & Technology Joint work with Rasmus Pagh, Ke Yi and Qin Zhang Dictionary Problem Store a subset S of the Universe U. Lookup: Does x belong
More informationBig Data. Big data arises in many forms: Common themes:
Big Data Big data arises in many forms: Physical Measurements: from science (physics, astronomy) Medical data: genetic sequences, detailed time series Activity data: GPS location, social network activity
More informationHashing. Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing. Philip Bille
Hashing Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing Philip Bille Hashing Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing
More informationHashing. Hashing. Dictionaries. Dictionaries. Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing
Philip Bille Dictionaries Dictionary problem. Maintain a set S U = {,..., u-} supporting lookup(x): return true if x S and false otherwise. insert(x): set S = S {x} delete(x): set S = S - {x} Dictionaries
More informationMethods of evaluating estimators and best unbiased estimators Hamid R. Rabiee
Stochastic Processes Methods of evaluating estimators and best unbiased estimators Hamid R. Rabiee 1 Outline Methods of Mean Squared Error Bias and Unbiasedness Best Unbiased Estimators CR-Bound for variance
More informationCS6931 Database Seminar. Lecture 6: Set Operations on Massive Data
CS6931 Database Seminar Lecture 6: Set Operations on Massive Data Set Resemblance and MinWise Hashing/Independent Permutation Basics Consider two sets S1, S2 U = {0, 1, 2,...,D 1} (e.g., D = 2 64 ) f1
More informationLecture 10. Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1
Lecture 10 Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1 Recap Sublinear time algorithms Deterministic + exact: binary search Deterministic + inexact: estimating diameter in a
More information12 Hash Tables Introduction Chaining. Lecture 12: Hash Tables [Fa 10]
Calvin: There! I finished our secret code! Hobbes: Let s see. Calvin: I assigned each letter a totally random number, so the code will be hard to crack. For letter A, you write 3,004,572,688. B is 28,731,569½.
More informationBalls and Bins: Smaller Hash Families and Faster Evaluation
Balls and Bins: Smaller Hash Families and Faster Evaluation L. Elisa Celis Omer Reingold Gil Segev Udi Wieder May 1, 2013 Abstract A fundamental fact in the analysis of randomized algorithms is that when
More informationMinwise hashing for large-scale regression and classification with sparse data
Minwise hashing for large-scale regression and classification with sparse data Nicolai Meinshausen (Seminar für Statistik, ETH Zürich) joint work with Rajen Shah (Statslab, University of Cambridge) Simons
More informationBloom Filters, Minhashes, and Other Random Stuff
Bloom Filters, Minhashes, and Other Random Stuff Brian Brubach University of Maryland, College Park StringBio 2018, University of Central Florida What? Probabilistic Space-efficient Fast Not exact Why?
More informationPower of d Choices with Simple Tabulation
Power of d Choices with Simple Tabulation Anders Aamand 1 BARC, University of Copenhagen, Universitetsparken 1, Copenhagen, Denmark. aa@di.ku.dk https://orcid.org/0000-0002-0402-0514 Mathias Bæk Tejs Knudsen
More information6.842 Randomness and Computation Lecture 5
6.842 Randomness and Computation 2012-02-22 Lecture 5 Lecturer: Ronitt Rubinfeld Scribe: Michael Forbes 1 Overview Today we will define the notion of a pairwise independent hash function, and discuss its
More informationOptimal Combination of Sampled Network Measurements
Optimal Combination of Sampled Network Measurements Nick Duffield Carsten Lund Mikkel Thorup AT&T Labs Research, 8 Park Avenue, Florham Park, New Jersey, 7932, USA {duffield,lund,mthorup}research.att.com
More informationBalls and Bins: Smaller Hash Families and Faster Evaluation
Balls and Bins: Smaller Hash Families and Faster Evaluation L. Elisa Celis Omer Reingold Gil Segev Udi Wieder April 22, 2011 Abstract A fundamental fact in the analysis of randomized algorithm is that
More informationFrequency Estimators
Frequency Estimators Outline for Today Randomized Data Structures Our next approach to improving performance. Count-Min Sketches A simple and powerful data structure for estimating frequencies. Count Sketches
More informationHigher Cell Probe Lower Bounds for Evaluating Polynomials
Higher Cell Probe Lower Bounds for Evaluating Polynomials Kasper Green Larsen MADALGO, Department of Computer Science Aarhus University Aarhus, Denmark Email: larsen@cs.au.dk Abstract In this paper, we
More informationPiazza Recitation session: Review of linear algebra Location: Thursday, April 11, from 3:30-5:20 pm in SIG 134 (here)
4/0/9 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 Piazza Recitation session: Review of linear algebra Location: Thursday, April, from 3:30-5:20 pm in SIG 34
More informationAdvanced Algorithm Design: Hashing and Applications to Compact Data Representation
Advanced Algorithm Design: Hashing and Applications to Compact Data Representation Lectured by Prof. Moses Chariar Transcribed by John McSpedon Feb th, 20 Cucoo Hashing Recall from last lecture the dictionary
More informationLecture Note 2. 1 Bonferroni Principle. 1.1 Idea. 1.2 Want. Material covered today is from Chapter 1 and chapter 4
Lecture Note 2 Material covere toay is from Chapter an chapter 4 Bonferroni Principle. Iea Get an iea the frequency of events when things are ranom billion = 0 9 Each person has a % chance to stay in a
More informationPing Li Compressed Counting, Data Streams March 2009 DIMACS Compressed Counting. Ping Li
Ping Li Compressed Counting, Data Streams March 2009 DIMACS 2009 1 Compressed Counting Ping Li Department of Statistical Science Faculty of Computing and Information Science Cornell University Ithaca,
More informationTight Bounds for Distributed Functional Monitoring
Tight Bounds for Distributed Functional Monitoring Qin Zhang MADALGO, Aarhus University Joint with David Woodruff, IBM Almaden NII Shonan meeting, Japan Jan. 2012 1-1 The distributed streaming model (a.k.a.
More informationWorst Case Analysis : Our Strength is Our Weakness. Michael Mitzenmacher Harvard University
Worst Case Analysis : Our Strength is Our Weakness Michael Mitzenmacher Harvard University Prelude Worst- case analysis is dominant in TCS for several reasons. Very suitable for mathemaecal analysis. OFen
More informationLecture 4: Two-point Sampling, Coupon Collector s problem
Randomized Algorithms Lecture 4: Two-point Sampling, Coupon Collector s problem Sotiris Nikoletseas Associate Professor CEID - ETY Course 2013-2014 Sotiris Nikoletseas, Associate Professor Randomized Algorithms
More informationFinding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing Distance Measures. Modified from Jeff Ullman
Finding Similar Sets Applications Shingling Minhashing Locality-Sensitive Hashing Distance Measures Modified from Jeff Ullman Goals Many Web-mining problems can be expressed as finding similar sets:. Pages
More information1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:
CS 24 Section #8 Hashing, Skip Lists 3/20/7 Probability Review Expectation (weighted average): the expectation of a random quantity X is: x= x P (X = x) For each value x that X can take on, we look at
More informationPower of d Choices with Simple Tabulation
Power of d Choices with Simple Tabulation Anders Aamand, Mathias Bæk Tejs Knudsen, and Mikkel Thorup Abstract arxiv:1804.09684v1 [cs.ds] 25 Apr 2018 Suppose that we are to place m balls into n bins sequentially
More informationarxiv: v2 [cs.ds] 15 Nov 2010
Stream sampling for variance-optimal estimation of subset sums Edith Cohen Nick Duffield Haim Kaplan Carsten Lund Mikkel Thorup arxiv:0803.0473v2 [cs.ds] 15 Nov 2010 Abstract From a high volume stream
More informationSet Similarity Search Beyond MinHash
Set Similarity Search Beyond MinHash ABSTRACT Tobias Christiani IT University of Copenhagen Copenhagen, Denmark tobc@itu.dk We consider the problem of approximate set similarity search under Braun-Blanquet
More informationDiscrete Mathematics and Probability Theory Spring 2014 Anant Sahai Note 18
EECS 7 Discrete Mathematics and Probability Theory Spring 214 Anant Sahai Note 18 A Brief Introduction to Continuous Probability Up to now we have focused exclusively on discrete probability spaces Ω,
More informationMath 24 Spring 2012 Questions (mostly) from the Textbook
Math 24 Spring 2012 Questions (mostly) from the Textbook 1. TRUE OR FALSE? (a) The zero vector space has no basis. (F) (b) Every vector space that is generated by a finite set has a basis. (c) Every vector
More informationWhy Simple Hash Functions Work: Exploiting the Entropy in a Data Stream
Why Simple Hash Functions Work: Exploiting the Entropy in a Data Stream ichael itzenmacher Salil Vadhan School of Engineering & Applied Sciences Harvard University Cambridge, A 02138 {michaelm,salil}@eecs.harvard.edu
More informationHash tables. Hash tables
Dictionary Definition A dictionary is a data-structure that stores a set of elements where each element has a unique key, and supports the following operations: Search(S, k) Return the element whose key
More information7 Algorithms for Massive Data Problems
7 Algorithms for Massive Data Problems Massive Data, Sampling This chapter deals with massive data problems where the input data (a graph, a matrix or some other object) is too large to be stored in random
More informationV.4 MapReduce. 1. System Architecture 2. Programming Model 3. Hadoop. Based on MRS Chapter 4 and RU Chapter 2 IR&DM 13/ 14 !74
V.4 MapReduce. System Architecture 2. Programming Model 3. Hadoop Based on MRS Chapter 4 and RU Chapter 2!74 Why MapReduce? Large clusters of commodity computers (as opposed to few supercomputers) Challenges:
More informationCOS597D: Information Theory in Computer Science October 5, Lecture 6
COS597D: Information Theory in Computer Science October 5, 2011 Lecture 6 Lecturer: Mark Braverman Scribe: Yonatan Naamad 1 A lower bound for perfect hash families. In the previous lecture, we saw that
More informationLecture 12: Hash Tables [Sp 15]
Insanity is repeating the same mistaes and expecting different results. Narcotics Anonymous (1981) Calvin: There! I finished our secret code! Hobbes: Let s see. Calvin: I assigned each letter a totally
More informationDATA MINING LECTURE 6. Similarity and Distance Sketching, Locality Sensitive Hashing
DATA MINING LECTURE 6 Similarity and Distance Sketching, Locality Sensitive Hashing SIMILARITY AND DISTANCE Thanks to: Tan, Steinbach, and Kumar, Introduction to Data Mining Rajaraman and Ullman, Mining
More informationCharging from Sampled Network Usage
Charging from Sampled Network Usage Nick Duffield Carsten Lund Mikkel Thorup AT&T Labs-Research, Florham Park, NJ 1 Do Charging and Sampling Mix? Usage sensitive charging charge based on sampled network
More informationHash tables. Hash tables
Dictionary Definition A dictionary is a data-structure that stores a set of elements where each element has a unique key, and supports the following operations: Search(S, k) Return the element whose key
More informationGoldreich-Levin Hardcore Predicate. Lecture 28: List Decoding Hadamard Code and Goldreich-L
Lecture 28: List Decoding Hadamard Code and Goldreich-Levin Hardcore Predicate Recall Let H : {0, 1} n {+1, 1} Let: L ε = {S : χ S agrees with H at (1/2 + ε) fraction of points} Given oracle access to
More informationLecture 13: Seed-Dependent Key Derivation
Randomness in Cryptography April 11, 2013 Lecture 13: Seed-Dependent Key Derivation Lecturer: Yevgeniy Dodis Scribe: Eric Miles In today s lecture, we study seeded key-derivation functions (KDFs) in the
More informationTesting k-wise Independence over Streaming Data
Testing k-wise Independence over Streaming Data Kai-Min Chung Zhenming Liu Michael Mitzenmacher Abstract Following on the work of Indyk and McGregor [5], we consider the problem of identifying correlations
More informationHash Tables. Given a set of possible keys U, such that U = u and a table of m entries, a Hash function h is a
Hash Tables Given a set of possible keys U, such that U = u and a table of m entries, a Hash function h is a mapping from U to M = {1,..., m}. A collision occurs when two hashed elements have h(x) =h(y).
More information2WB05 Simulation Lecture 7: Output analysis
2WB05 Simulation Lecture 7: Output analysis Marko Boon http://www.win.tue.nl/courses/2wb05 December 17, 2012 Outline 2/33 Output analysis of a simulation Confidence intervals Warm-up interval Common random
More informationCS 591, Lecture 6 Data Analytics: Theory and Applications Boston University
CS 591, Lecture 6 Data Analytics: Theory and Applications Boston University Babis Tsourakakis February 8th, 2017 Universal hash family Notation: Universe U = {0,..., u 1}, index space M = {0,..., m 1},
More informationDynamic Call Center Routing Policies Using Call Waiting and Agent Idle Times Online Supplement
Submitted to imanufacturing & Service Operations Management manuscript MSOM-11-370.R3 Dynamic Call Center Routing Policies Using Call Waiting and Agent Idle Times Online Supplement (Authors names blinded
More informationCell-Probe Lower Bounds for Prefix Sums and Matching Brackets
Cell-Probe Lower Bounds for Prefix Sums and Matching Brackets Emanuele Viola July 6, 2009 Abstract We prove that to store strings x {0, 1} n so that each prefix sum a.k.a. rank query Sumi := k i x k can
More informationDiscrete Mathematics and Probability Theory Fall 2015 Note 20. A Brief Introduction to Continuous Probability
CS 7 Discrete Mathematics and Probability Theory Fall 215 Note 2 A Brief Introduction to Continuous Probability Up to now we have focused exclusively on discrete probability spaces Ω, where the number
More informationWhy Simple Hash Functions Work: Exploiting the Entropy in a Data Stream
Why Simple Hash Functions Work: Exploiting the Entropy in a Data Stream Michael Mitzenmacher Salil Vadhan Abstract Hashing is fundamental to many algorithms and data structures widely used in practice.
More information6 Filtering and Streaming
Casus ubique valet; semper tibi pendeat hamus: Quo minime credas gurgite, piscis erit. [Luck affects everything. Let your hook always be cast. Where you least expect it, there will be a fish.] Publius
More informationLecture 5: Hashing. David Woodruff Carnegie Mellon University
Lecture 5: Hashing David Woodruff Carnegie Mellon University Hashing Universal hashing Perfect hashing Maintaining a Dictionary Let U be a universe of keys U could be all strings of ASCII characters of
More informationLecture 24: Bloom Filters. Wednesday, June 2, 2010
Lecture 24: Bloom Filters Wednesday, June 2, 2010 1 Topics for the Final SQL Conceptual Design (BCNF) Transactions Indexes Query execution and optimization Cardinality Estimation Parallel Databases 2 Lecture
More informationFast Moment Estimation in Data Streams in Optimal Space
Fast Moment Estimation in Data Streams in Optimal Space Daniel M. Kane Jelani Nelson Ely Porat David P. Woodruff Abstract We give a space-optimal algorithm with update time O(log 2 (1/ε) log log(1/ε))
More information