Lecture 2. Frequency problems

Similar documents
Lecture 3 Sept. 4, 2014

1 Estimating Frequency Moments in Streams

14.1 Finding frequent elements in stream

Lecture 2: Streaming Algorithms

Data Stream Methods. Graham Cormode S. Muthukrishnan

Lecture 2 Sept. 8, 2015

Lecture 10. Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1

The space complexity of approximating the frequency moments

Big Data. Big data arises in many forms: Common themes:

15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:

Lecture 4: Hashing and Streaming Algorithms

Lecture 3 Frequency Moments, Heavy Hitters

Lecture 01 August 31, 2017

Lecture 6 September 13, 2016

4/26/2017. More algorithms for streams: Each element of data stream is a tuple Given a list of keys S Determine which tuples of stream are in S

Beyond Distinct Counting: LogLog Composable Sketches of frequency statistics

How Philippe Flipped Coins to Count Data

Computing the Entropy of a Stream

Frequency Estimators

Lecture 5: Hashing. David Woodruff Carnegie Mellon University

Lecture 4 Thursday Sep 11, 2014

Lecture Lecture 25 November 25, 2014

1 Maintaining a Dictionary

CSE 190, Great ideas in algorithms: Pairwise independent hash functions

An Optimal Algorithm for l 1 -Heavy Hitters in Insertion Streams and Related Problems

As mentioned, we will relax the conditions of our dictionary data structure. The relaxations we

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Lecture Note 2. 1 Bonferroni Principle. 1.1 Idea. 1.2 Want. Material covered today is from Chapter 1 and chapter 4

6 Filtering and Streaming

Expectation of geometric distribution

Advanced Algorithm Design: Hashing and Applications to Compact Data Representation

Some notes on streaming algorithms continued

Bloom Filters, general theory and variants

CS 473: Algorithms. Ruta Mehta. Spring University of Illinois, Urbana-Champaign. Ruta (UIUC) CS473 1 Spring / 32

Ad Placement Strategies

CSCI8980 Algorithmic Techniques for Big Data September 12, Lecture 2

Linear Sketches A Useful Tool in Streaming and Compressive Sensing

Lecture Lecture 9 October 1, 2015

Lecture 4 February 2nd, 2017

CS341 info session is on Thu 3/1 5pm in Gates415. CS246: Mining Massive Datasets Jure Leskovec, Stanford University

1-Pass Relative-Error L p -Sampling with Applications

CMSC 858F: Algorithmic Lower Bounds: Fun with Hardness Proofs Fall 2014 Introduction to Streaming Algorithms

CS 598CSC: Algorithms for Big Data Lecture date: Sept 11, 2014

Lecture 1: Introduction to Sublinear Algorithms

12 Count-Min Sketch and Apriori Algorithm (and Bloom Filters)

Problem 1: (Chernoff Bounds via Negative Dependence - from MU Ex 5.15)

Expectation of geometric distribution. Variance and Standard Deviation. Variance: Examples

Interval Selection in the streaming model

6.1 Occupancy Problem

CS 591, Lecture 9 Data Analytics: Theory and Applications Boston University

arxiv: v2 [cs.ds] 4 Feb 2015

7 Algorithms for Massive Data Problems

Sparse Fourier Transform (lecture 1)

6.842 Randomness and Computation Lecture 5

So far we have implemented the search for a key by carefully choosing split-elements.

Sparse Johnson-Lindenstrauss Transforms

2. This exam consists of 15 questions. The rst nine questions are multiple choice Q10 requires two

Handout 5. α a1 a n. }, where. xi if a i = 1 1 if a i = 0.

Bucket-Sort. Have seen lower bound of Ω(nlog n) for comparisonbased. Some cheating algorithms achieve O(n), given certain assumptions re input

Lecture 8 HASHING!!!!!

compare to comparison and pointer based sorting, binary trees

Hierarchical Sampling from Sketches: Estimating Functions over Data Streams

The Online Space Complexity of Probabilistic Languages

Tabulation-Based Hashing

Data Streams & Communication Complexity

Fast Moment Estimation in Data Streams in Optimal Space

Acceptably inaccurate. Probabilistic data structures

Lecture: Analysis of Algorithms (CS )

Lecture 7: Fingerprinting. David Woodruff Carnegie Mellon University

Algorithms lecture notes 1. Hashing, and Universal Hash functions

Sketching and Streaming for Distributions

The Count-Min-Sketch and its Applications

arxiv: v3 [cs.ds] 11 May 2017

Notes on Discrete Probability

Lecture Lecture 3 Tuesday Sep 09, 2014

Streaming - 2. Bloom Filters, Distinct Item counting, Computing moments. credits:

Bloom Filters and Locality-Sensitive Hashing

Title: Count-Min Sketch Name: Graham Cormode 1 Affil./Addr. Department of Computer Science, University of Warwick,

CSC2420 Fall 2012: Algorithm Design, Analysis and Theory Lecture 9

Optimal compression of approximate Euclidean distances

Heavy Hitters. Piotr Indyk MIT. Lecture 4

Notes on the second moment method, Erdős multiplication tables

Tail and Concentration Inequalities

CS261: A Second Course in Algorithms Lecture #18: Five Essential Tools for the Analysis of Randomized Algorithms

Range-efficient computation of F 0 over massive data streams

Randomness and Computation March 13, Lecture 3

Estimating Frequencies and Finding Heavy Hitters Jonas Nicolai Hovmand, Morten Houmøller Nygaard,

CS 125 Section #12 (More) Probability and Randomized Algorithms 11/24/14. For random numbers X which only take on nonnegative integer values, E(X) =

Homework 6: Solutions Sid Banerjee Problem 1: (The Flajolet-Martin Counter) ORIE 4520: Stochastics at Scale Fall 2015

Distributed Summaries

Hash Tables. Given a set of possible keys U, such that U = u and a table of m entries, a Hash function h is a

Randomized Algorithms. Lecture 4. Lecturer: Moni Naor Scribe by: Tamar Zondiner & Omer Tamuz Updated: November 25, 2010

arxiv: v1 [cs.ds] 8 Aug 2014

1 Difference between grad and undergrad algorithms

Hashing. Dictionaries Hashing with chaining Hash functions Linear Probing

A Near-Optimal Algorithm for Computing the Entropy of a Stream

CSCB63 Winter Week10 - Lecture 2 - Hashing. Anna Bretscher. March 21, / 30

Declaring Independence via the Sketching of Sketches. Until August Hire Me!

1 Recommended Reading 1. 2 Public Key/Private Key Cryptography Overview RSA Algorithm... 2

Transcription:

1 / 43 Lecture 2. Frequency problems Ricard Gavaldà MIRI Seminar on Data Streams, Spring 2015

Contents 2 / 43 1 Frequency problems in data streams 2 Approximating inner product 3 Computing frequency moments 4 Counting distinct elements

Frequency problems in data streams 3 / 43

4 / 43 The data stream model. Frequency problems Input is sequence of items a 1, a 2, a 3,... Each a i is an element of a universe U of size n n is large or infinity At time t, the query returns something about a 1...a t

5 / 43 Frequency problems, alternate view At any time t, for any i U, f i =(def.) number of appearances of i so far Frequency problems: result depends on the f i s only In particular, independent of the order

Frequency problems, alternate view 6 / 43 Stream at t defines implicit array F[1..n] with F[i] = f i A new occurrence of i equivalent to F[i]++ Model extensions: F[i]++, F [i]-- (additions and deletions) F[i] += x, with x 0 (multiple additions) F[i] += x, with any x (multiple additions and deletions)

Approximating inner product 7 / 43

Approximating inner product 8 / 43 Implicit vectors u[1..n], v[1..n] Stream of instructions add(u i,x), add(v j,y), i,j = 1...n At every time, we want to output an approximation of n i=1 u i v i I ll suppose the above is always > 0 for relative approximation to make sense

Basic algorithm 9 / 43 Init: Pick a very good hash function f : [n] [n] For i [n], define (do not compute and store) b i = ( 1) f (i) mod 2 {1, 1} S 0; T 0; Update: When reading add(u i,x), do S += x b i When reading add(v j,y), do T += y b j Query: return S T

10 / 43 Final algorithm Run in parallel c 1 c 2 copies of the basic algorithm, grouped in c 2 groups of c 1 each When queried, compute the average of the results of each group of c 1 copies, then return the median of the averages of the c 2 groups Theorem For c 1 = O(ε 2 ) and c 2 = O(lnδ 1 ), the algorithm above (ε,δ)-approximates u v

Why does this work? Claim 1: S = n i=1 u ib i and T = n i=1 v ib i Claim 2: E[S T ] = IP(u,v) Claim 3: Var[S T ] 2E[S T ] 2 Claim 4: The median-of-averages as described (ε,δ)-approximates IP(u,v) 11 / 43

Claim 1 12 / 43 Claim 1: S = n i=1 u[i]b i and T = n i=1 v[i]b i Update is: When reading add(u i,x), do S += x b i When reading add(v j,y), do T += y b j

Claim 2 13 / 43 Claim 2: E[S T ] = IP(u,v) Really? But ( ( ) S = u i b i ), T = v i b i i i yet ( ) u i i ( ) v i = i ( u i v j ) i,j So the trick has to be in the b i, b j ( ) u i v i i

Claim 2 (II) 14 / 43 If i = j, E[b i b j ] = E[1] = 1 If i j and h is good, b i and b j are independent, so E[b i b j ] = 1 2 1 + 1 2 ( 1) = 0 Then Claim 2 is by linearity of expectation: n )( n ) E[S T ] = E[( u[i]b i v[i]b i ) ] i=1 = E[ u[i]v[j]b i b j ] i,j = i i=1 u[i]v[i]e[b i b i ] + u[i]v[j]e[b i b j ] i j = u[i]v[i] i

Claim 3 15 / 43 Claim 3: Var[S T ] 2E[S T ] 2 Var[S T ] = E[(S T ) 2 ] E[S T ] 2 = ( i,j...b i b j...) (...b k b l...) k,l = (...b i b j b k b l...) i,j,k,l 2( u[i]v[i]) ( u[j]v[j]) i j = 2E[S T ] 2 (you work it out)

16 / 43 Claim 4 Claim 4: Average c 1 copies of S T Let X be the output of the basic algorithm E[X] = IP(u,v), Var(X) 2E[X] 2 Equivalently, σ(x) = Var(X) 2E[X] Want to bound Pr[ X E[X] > ε E[X]] Pr[ X E[X] > ε E[X]] Pr[ X E[X] > 2εσ(X)] But applying Chebyshev requires 2ε > 1, not interesting We need to reduce the variance first: averaging

Claim 4 (cont.) Let X i be the output of i-th copy of basic algorithm E[X i ] = IP(u,v), Var(X i ) 2E[X i ] 2 Let Y be the average of X 1,..., X c1 See that E[Y ] = IP(u,v) and Var(Y ) 2E[Y ] 2 /c 1 By Chebyshev s inequality, if c 1 16/ε 2 Pr[ Y E[Y ] > ε E[Y ]] Var(Y )/(εe[y ]) 2 2E[Y ] 2 /(c 1 ε 2 E[Y ] 2 ) 1/8 We could throw δ into this bound, but get dependence 1/δ At this point, use Hoeffding to get ln(1/δ) 17 / 43

Claim 4 (cont.) 18 / 43 We have E[Y ] = IP(u,v) and Pr[(1 ε)e[y ] Y (1 + ε)e[y ]] 7/8 Now take the median Z of c 2 copies of Y, Y 1,..., Y c2 As in the exercise on computing medians (Hoeffding bound), if Pr[ Z E[Y ] εe[y ]] δ c 2 32 9 ln 2 δ We get (ε, δ)-approximation with ( 1 c 1 c 2 = O ε 2 ln 2 ) δ copies of the basic algorithm

19 / 43 Memory use & update time c = O( 1 ln 1 ε 2 δ ) copies of algorithm Each, 4logn bits to store hash function At most log i u i + log i v i bits to store S, T Say, O(logt) if the u i, v i are bounded Total memory proportional to 1 ε 2 ln 1 (logn + logt) δ Update time: O(c) word operations

20 / 43 How do we get the good hash functions? Solution 1: Generate b 1,..., b n at random once, store them n bits, too much Solution 2: E.g., linear congruential method: f (x) = a x + b OK if a,b n, so O(logn) bits to store But: h far from random: given h(x), h(y), get a, b by solving h(x) = ax + b h(y) = ay + b

Reducing Randomness 21 / 43

Reducing Randomness 22 / 43 Where did we use independence of the b i s, really? For example, here: E[b i b j ] = E[b i ] E[b j ] = 0 For this, it is enough to have pairwise independence: For every i, j, Pr[A i A j ] = Pr[A i ] Much weaker than full independence: For every i, j, Pr[A i A 1,...,A i 1,A i+1,...,a m ] = Pr[A i ]

23 / 43 Generating Pairwise Independent Bits Choose f at random from a small family of pairwise independent functions f (x), f (y) guaranteed to be pairwise independent Each f in the family can be stored with O(logn) bits

24 / 43 Generating Pairwise Independent Bits (details) Work over finite field of size q n (say q prime or q = 2 r ) Idea: Choose a,b [q] at random. Let f (x) = a x + b 2logq bits to store f Study system of equations ax + b = α, ay + b = β Given x, y (x y!), α, β, exactly one solution for a, b Therefore, Pr f [f (x) = α f (y) = β] = Pr f [f (x) = α] = 1/q Likewise: There are families of k-wise independent hash functions that can be stored in k logq k logn bits

Completing the proof The proof of Claim 3 (bound on Var(S T )) needs 4-wise independence Algorithm initially chooses a random hash function f in a 4-wise independent family Remembers it using 4logn bits Each time it needs b i, it computes ( 1) f (i) mod 2 25 / 43

About Pairwise Independence 26 / 43 Exercise 1 Verify that for pairwise independent variables X i with Var(X i ) = σ 2 we have Var( 1 k k i=1 X i ) = σ 2 k So: to reduce variance at a Chebyshev rate 1/k by averaging k copies, pairwise independence To have a Hoeffding-like rate exp( c k) we need full independence

Applications 27 / 43 Computing L 2 -distance L 2 (u,v) = n i=1 (u[i] v[i]) 2 = IP(u v,u v) Computing second frequency moment: F 2 = n i=1 f 2 i = IP(f,f )

Computing frequency moments 28 / 43

Frequency Moments 29 / 43 k-th frequency moment of the sequence: F k = n i=1 F 0 = number of distinct symbols occurring in S F 1 = length of sequence F 2 = inner product of f with itself Define f k i F = lim k (F k ) 1/k = max n i=1 f i

30 / 43 Computing moments [AMS] Noga Alon, Yossi Matias, Mario Szegedy (1996): The space complexity of approximating the frequency moments Considered to initiate data stream algorithmics Studied the complexity of computing moments F k Proposed approximation, proved upper and lower bounds Starting point for a large part of future work

Frequency Moments 31 / 43 F k = n i=1 f k i Obvious algorithm: One counter per symbol. Memory n logt [AMS] and many other papers, culminating in [Indyk, Woodruff 05] For k > 2, F k can be approximated with Õ(n1 2/k ) memory This is optimal. In particular, F requires Ω(n) memory For k 2, F k can be approximated with O(logn + logt) memory Dependence is θ(ε 2 ln(1/δ)) for relative approximation

Counting distinct elements 32 / 43

Counting distinct elements 33 / 43 Given a stream of elements from [n], approximate how many distinct ones d have we seen at any time t There are linear and logarithmic memory solutions (in d max n if known a priori) [Metwaly+08] good overview

34 / 43 Linear counting [Whang+90] Bloom filters Init: choose a hash function h : [n] s; choose load factor 0 < ρ 12; build a bit vector B of size s = d max /ρ Update(x): B[h(x)] 1 Query: w = the fraction of 0 s in B; return s ln(1/w)

Linear counting [Whang+90] Bloom filters 35 / 43 w = Prob[a fixed bucket is empty after inserting d distinct elements] = (1 1/s) d exp( d/s) E[Query] d, σ(query) = small!

Cohen s algorithm [Cohen97] 36 / 43 E[gap between two 1 s in B] = (s d)/(d + 1) s/d Query: return s / (size of first gap in B)

Cohen s algorithm [Cohen97] 37 / 43 Trick: Don t store B, remember smallest key inserted in B Init: posmin = s; choose hash function h : [n] s Update(x): if (h(x) < posmin) posmin h(x) Query: return s/posmin;

Cohen s algorithm [Cohen97] 38 / 43 E[posmin] s/d σ(posmin) s/d Space is logs plus space to store h, i.e. O(logn)

Probabilistic Counting 39 / 43 Flajolet-Martin counter [Flajolet+85] + LogLog + SuperLogLog + HyperLogLog Observe the values of f (i) where we insert, in binary Idea: To see f (i) = 0 k 1 1..., 2 k distinct values inserted And we don t need to store B; just the smallest k

40 / 43 Flajolet-Martin probabilistic counter Init: p = logn; Update(x): let b be the position of the leftmost 1 bit of h(x); if (b < p) p b; Query: return 2 p ; E[2 p ] = number of distinct elements Space: logp = loglogn bits

41 / 43 Flajolet-Martin: reducing the variance Solution 1: Use k independent copies, average Problem: runtime multiplied by k Problem: now pairwise independent hash functions don t seem to suffice We don t know how to generate several fully independent hash functions In fact, we don t know how to generate one fully independent hash functions But good quality crypto hash functions work in this setting - even weaker ones ( 2-universal hash functions ) with a minimum of entropy. And use O(logn) bits

42 / 43 Flajolet-Martin: reducing the variance Solution 2: Divide stream into m = O(ε 2 ) substreams Use first bits of h(x) to decide substream for x Track p separately for each substream Now a single h can be used for all copies One sketch updated per item Query: Drop top and bottom 20% of estimates, average the rest Space: O(m loglogn + logn) = O(ε 2 loglogn + logn)

43 / 43 Improving the leading constants SuperLogLog [Durand+03]: Take geometric mean HyperLogLog [Flajolet+07]: Take harmonic mean cardinalities up to 10 9 can be approximated within say 2% with 1.5 Kbytes of memory [Kane+10] Optimal O(ε 2 + logn) space, O(1) update time