An Optimal Algorithm for l 1 -Heavy Hitters in Insertion Streams and Related Problems

Size: px

Start display at page:

Download "An Optimal Algorithm for l 1 -Heavy Hitters in Insertion Streams and Related Problems"

Theodore Chandler
5 years ago
Views:

1 An Optimal Algorithm for l 1 -Heavy Hitters in Insertion Streams and Related Problems Arnab Bhattacharyya, Palash Dey, and David P. Woodruff Indian Institute of Science, Bangalore {arnabb,palash}@csa.iisc.ernet.in IBM Research, Almaden dpwoodru@us.ibm.com April 23, 2016 To appear: ACM SIGMOD conference on Principles of DB Systems (PODS-16)

3 Talk Overview Motivation Problem Definition State-of-the-art Our Results

4 Data Stream Frequency of item i = f i Frequency vector = (f 1,..., f n )

5 Data Stream Suitable model for many large source of data Stream of network packets Sensor networks Impractical and undesirable to store and process entire data exactly Instead design algorithms to find approximate solutions Quickly build summary with one pass over data Active area of research for last 15 years, history goes back 35 years

6 Example: Heavy-hitters Also called frequent items, elephants, icebergs One of the oldest streaming problems; history more than 30 years

7 Why Heavy-hitters? Monitoring Internet traffic: Track bandwidth hogs Popular destinations Subject of much streaming research: plenty of papers on Heavy-hitters and its variants A core streaming problem: many streaming problems (item set mining etc.) are connected to Heavy-hitters Many practical applications: Network data analysis Database optimization

8 Motivation Problem Definition State-of-the-art Our Results

9 l 1 -Heavy-hitters Exact version Given a stream of items from [n] of length m, find an item occurring more than ϕm times

10 l 1 -Heavy-hitters Exact version Given a stream of items from [n] of length m, find an item occurring more than ϕm times l 1 norm of frequency vector (f 1,..., f n )

11 l 1 -Heavy-hitters Exact version Given a stream of items from [n] of length m, find an item occurring more than ϕm times l 1 norm of frequency vector (f 1,..., f n ) Exact version is costly Unfortunately, exact version requires Ω(min{m, n}) space, even for ϕ = 1 2

12 (ε, ϕ)-heavy-hitters Let 0 < ε < ϕ < 1 and f i be the frequency of an item i Problem Definition Find a set S of items with the following property: S contains every item i with f i > ϕm S contains no item j with f j < (ϕ ε)m Moreover, for every item i S, output an estimate f i such that f i f i εm

13 (ε, ϕ)-heavy-hitters Let 0 < ε < ϕ < 1 and f i be the frequency of an item i Problem Definition Output all frequent items Find a set S of items with the following property: S contains every item i with f i > ϕm S contains no item j with f j < (ϕ ε)m Moreover, for every item i S, output an estimate f i such that f i f i εm Don t output any rare item Estimate frequencies of frequent items

14 Motivation Problem Definition State-of-the-art Our Results

15 Prior work on Heavy-hitters: Upper bounds Algorithm for ε > 1 2 with space complexity O(log n + log m) bits was given by [?] For general ε, the current best space complexity bound is from [?] Space complexity: O ( 1 ε (log n + log m)) bits Rediscovered by [?] and [?], but provided worst case O(1) time for updates and answers

16 Boyer-Moore Algorithm : works for ε > 1 2 Initialize table = empty 10 Insert(i) if(table empty) then put i and increment else if(i is in table) then increment else decrement table Frequent item: Approximate frequency: 1 Space: O (log n + log m) bits

17 Boyer-Moore Algorithm : works for ε > 1 2 Initialize table = empty 10 Insert(i) if(table empty) then put i and increment else if(i is in table) then increment else decrement table Frequent item: Approximate frequency: 1 Space: O (log n + log m) bits

18 Boyer-Moore Algorithm : works for ε > 1 2 Initialize table = empty 10 Insert(i) if(table empty) then put i and increment else if(i is in table) then increment else decrement table Frequent item: Approximate frequency: 1 Space: O (log n + log m) bits

19 Boyer-Moore Algorithm : works for ε > 1 2 Initialize table = empty 10 Insert(i) if(table empty) then put i and increment else if(i is in table) then increment else decrement table Frequent item: Approximate frequency: 1 Space: O (log n + log m) bits

20 Boyer-Moore Algorithm : works for ε > 1 2 Initialize table = empty 10 Insert(i) if(table empty) then put i and increment else if(i is in table) then increment else decrement table Frequent item: Approximate frequency: 1 Space: O (log n + log m) bits

21 Boyer-Moore Algorithm : works for ε > 1 2 Initialize table = empty 10 Insert(i) if(table empty) then put i and increment else if(i is in table) then increment else decrement table Frequent item: Approximate frequency: 1 Space: O (log n + log m) bits

22 Boyer-Moore Algorithm : works for ε > 1 2 Initialize table = empty 10 Insert(i) if(table empty) then put i and increment else if(i is in table) then increment else decrement table Frequent item: Approximate frequency: 1 Space: O (log n + log m) bits

23 Boyer-Moore Algorithm : works for ε > 1 2 Initialize table = empty 10 Insert(i) if(table empty) then put i and increment else if(i is in table) then increment else decrement table Frequent item: Approximate frequency: 1 Space: O (log n + log m) bits

24 Misra-Gries Algorithm [?]: Generalizes Boyer-Moore algorithm using k 1 counters Initialize: A empty associative array Process j: 1: if j Keys(A) then 2: A[j] A[j] + 1 3: else if Keys(A) < k 1 then 4: A[j] 1 5: else 6: for l Keys(A) do 7: A[l] A[l] 1 8: end for 9: if A[l] = 0 then 10: Remove l form A 11: end if 12: end if k 1 Keys Count Output: On query a, if a Keys(A), then report f a = A[a], else report f a = 0

25 Misra-Gries Algorithm: Analysis Crucial Observation Each decrement is witnessed by k distinct tokens (including itself) f a m k f a f a, a Space Complexity O (k (log m + log n)) bits of space Putting k = 1 ε solves (ε, ε)-heavy hitters Space: O ( 1 ε (log m + log n)) bits

26 Prior work on Heavy-hitters: Upper bounds Same bound as [?] or worse are also achieved by others like: Count-Sketch [?] CountMin-Sketch [?] Sticky sampling, lossy counting [?] Space saving [?] Sample-and-hold [?] Multi-stage Bloom filters [?] Sketch-guided sampling [?]...

27 Prior work on Heavy-hitters: Lower bounds There can be at most 1 ϕ many heavy hitters, giving a space bound of: ( ( ) ) ( ) n 1 Ω log 1 = Ω log ϕn ϕ ϕ Easy reduction from INDEXING problem provides a lower bound of: ( ) 1 Ω ε

28 Gap in Known Bounds In summary, best upper bound is: ( ) 1 O (log n + log m) ε and best lower bound is: ( 1 Ω ϕ log ϕn + 1 ) ε For constant ϕ and ε = 1 log n, this is a quadratic gap

29 Motivation Problem Definition State-of-the-art Our Results

30 If n ( ) ε Our Result We show that space complexity of (ε, ϕ)-heavy hitters is : Θ ( 1 ε log 1 ϕ + 1 ϕ log n + log log m ) with O(1) worst case update and query response times. Our algorithm is randomized

31 L -approximation Our algorithm also solves: ε-maximum Given an insertion only stream of length m over a universe of size n, output an item with frequency at most εm less than the maximum.

32 L -approximation Our algorithm also solves: ε-maximum approximate l norm of (f 1,..., f n ) within ε (f 1,..., f n ) 1 Given an insertion only stream of length m over a universe of size n, output an item with frequency at most εm less than the maximum.

33 L -approximation Our algorithm also solves: ε-maximum Given an insertion only stream of length m over a universe of size n, output an item with frequency at most εm less than the maximum. Our Result We show that the space complexity of ε-maximum is a : ( 1 Θ ε log 1 ) + log n + log log m ε with O(1) worst case update and query response times. Our algorithm is randomized whereas Misra-Gries is deterministic! a If n ( 1 ε ) approximate l norm of (f 1,..., f n ) within ε (f 1,..., f n ) 1

34 Comparison with prior work Best previous bound is again due to [?]: O ( ) 1 (log n + log m) ε Improving this result was listed as Open Problem 3 in IITK Workshop on Data Streams (2006)

35 A Voting Perspective The maximum problem finds an approximate winner of a plurality election when votes are streamed A natural question is to ask for winners of elections conducted according to other voting rules

36 Veto voting Modi Rahul Palash Palash wins ε-minimum Given an insertion only stream of length m over a universe of size n, find the minimum frequency upto additive error εm

37 Borda voting ε-borda Given an insertion only stream of length m over a universe of size n, find the maximum Borda score upto additive error εmn

38 Maximin voting 4 : A B C, 3 : C B A 2 : B A C, 2 : C A B Maximin scores: A : 6, B : 5, C : 5 ε-maximin Given an insertion only stream of length m over a universe of size n, find the maximum maximin frequency upto additive error εm

39 Other Results ε-minimum ( 1 O ε log log 1 ) ( ) 1 + log log m, Ω + log log m ε ε ε-borda Θ (n(log 1ε ) + log n) + log log m ε-maximin ( n ) ( n ) O ε 2 log2 n + log log m, Ω ε 2 + log log m

40 Other Results ε-minimum Optimal upto O(log log ε 1 ) ( 1 O ε log log 1 ) ( ) 1 + log log m, Ω + log log m ε ε ε-borda Θ (n(log 1ε ) + log n) + log log m Optimal upto O(1) ε-maximin Optimal upto O(log 2 n) ( n ) ( n ) O ε 2 log2 n + log log m, Ω ε 2 + log log m

41 Algorithm

42 A simpler almost optimal Heavy-hitters algorithm Hash Sample Count

43 Algorithm: In more details... Choose a hash function from a universal family of hash functions that hashes each id to a universe of size poly( 1 ε ) Sample l = poly( 1 ε ) items from the stream Feed in hashed samples to Misra-Gries data structure with 1 ε counters, while storing actual ids of top 1 ϕ items according to the Misra-Gries data structure

44 Algorithm: Correctness No collision among hashed ids in the sample (with high probability) If S is a subset of size O( 1 ) randomly chosen from the ε 2 stream, and f i be the frequency of the item i in S, then [?]: [ ] f i Pr i [n], S f i m ε

45 Algorithm analysis: Space Complexity Random item can be chosen from stream in O(log log m) bits of space Data structure in Misra-Gries algorithm uses O ( 1 ε log 1 ε) bits of space id space is poly( 1 ε ) length of the subsamples stream is poly( 1 ε ) εl additive approximation 1 ϕ log n to store the ids of top 1 ϕ items of the Misra-Gries table Space complexity: O ( 1 ε log 1 ε + 1 ϕ log n + log log m )

46 Algorithm analysis: Space Complexity Random item can be chosen from stream in O(log log m) bits of space Data structure in Misra-Gries algorithm uses O ( 1 ε log 1 ε) bits of space id space is poly( 1 ε ) length of the subsamples stream is poly( 1 ε ) εl additive approximation 1 ϕ log n to store the ids of top 1 ϕ Extra log ε 1 factor items of the Misra-Gries table Space complexity: O ( 1 ε log 1 ε + 1 ϕ log n + log log m )

47 Algorithm analysis: Time Complexity O(1) update and query response time Once an item is sampled, updating Misra-Gries table takes O( 1 ε ) time

48 Algorithm analysis: Time Complexity O(1) update and query response time Once an item is sampled, updating Misra-Gries table takes O( 1 ε ) time However, with very high probability, no item is sampled for next O( 1 ε ) items of the stream Distribute the O( 1 ε ) update time over next O( 1 ε ) items Worst case update time is O(1) Note that the guarantee above is not an amortized one!

49 Optimal Heavy-hitters algorithm Θ ( 1 ε log 1 ϕ + 1 ϕ log n + log log m )

50 Optimal algorithm: overview Overview Sample l = O(ε 2 ) items from the stream Run the Misra-Gries algorithm for (ϕ/2, ϕ/2)-heavy hitters returns a set C of O(ϕ 1 ) items For any item i, with probability 1 O(ϕ), estimate the frequency of i in the sampled stream to within additive error O(εl) = O(ε 1 )

51 Optimal algorithm cont: Approximate counting with additive error O(ε 1 ) Insert(i) With probability p i ĉ i + + Output(i) ˆfi = ĉ i /p i return ˆf i

52 Optimal algorithm cont: Approximate counting with additive error O(ε 1 ) Insert(i) With probability p i ĉ i + + Output(i) ˆfi = ĉ i /p i return ˆf i Accelerated counter Claim 1: If p i = Θ(ε 2 f i ), then Var[ˆf i ] = O(ε 2 ) Claim 2: ˆf i is an unbiased estimator of f i with additive error O(ε 1 ) with constant probability

53 Optimal algorithm cont: Approximate counting with additive error O(ε 1 ) Insert(i) With probability p i ĉ i + + Output(i) ˆfi = ĉ i /p i return ˆf i Accelerated counter Claim 1: If p i = Θ(ε 2 f i ), then Var[ˆf i ] = O(ε 2 ) Claim 2: ˆf i is an unbiased estimator of f i with additive error O(ε 1 ) with constant probability error probability can be reduced from constant to O(ϕ) by repeating O(log ϕ 1 ) times and taking median

54 Optimal algorithm cont: Tackling two issues Two issues 1. Need to keep Ω(l) = Ω(ε 2 ) counters Solution: Hashing into a space of size Θ(ε 1 ) 2. We do not know f i Solution: We divide the stream into epochs and maintain 4-approximation of frequencies in each epoch and change p i s dynamically

55 Lower Bound

56 Lower bound: Dependence on m GREATER-THAN log m Alice has x [log m] Bob has y [log m] Alice sends message to Bob Bob has to output whether or not x > y Known Result Alice has to send Ω(log log m) bits [?,?]

57 Lower bound: Reduction from GREATER-THAN log m to (ε, ϕ)-heavy hitters x [log m] y [log m] 2 x copies of item A 2 y copies of item B Algorithm s memory contents A iff x > y

58 Lower bound: Reduction from GREATER-THAN log m to (ε, ϕ)-heavy hitters x [log m] y [log m] 2 x copies of item A 2 y copies of item B Algorithm s memory contents A iff x > y

59 Lower bound: Reduction from GREATER-THAN log m to (ε, ϕ)-heavy hitters x [log m] y [log m] 2 x copies of item A 2 y copies of item B Algorithm s memory contents A iff x > y

60 Lower bound: Reduction from GREATER-THAN log m to (ε, ϕ)-heavy hitters x [log m] y [log m] 2 x copies of item A 2 y copies of item B Algorithm s memory contents A iff x > y

61 Lower bound: Reduction from GREATER-THAN log m to (ε, ϕ)-heavy hitters x [log m] y [log m] 2 x copies of item A 2 y copies of item B Algorithm s memory contents A iff x > y

62 Lower bound: Dependence on ε and ϕ INDEXING n,m Alice has a string x [n] m Bob has an index i [m] Alice sends message to Bob Bob has to output x i Known Result One-way communication complexity of INDEXING n,m Ω (m log n) is

63 Lower bound: Reduction from INDEXING1/2(ϕ ε), 1 /2ε to (ε, ϕ)-heavy hitters x [ 1 /2(ϕ ε)] 1 /2ε i [ 1 /2ε] εm copies of each (x j, j) j [ 1 /2ε] Algorithm s memory contents (ϕ ε)m copies of each (j, i) j [ 1 /2(ϕ ε)] (x i, i)

64 Lower bound: Reduction from INDEXING1/2(ϕ ε), 1 /2ε to (ε, ϕ)-heavy hitters x [ 1 /2(ϕ ε)] 1 /2ε i [ 1 /2ε] εm copies of each (x j, j) j [ 1 /2ε] Algorithm s memory contents (ϕ ε)m copies of each (j, i) j [ 1 /2(ϕ ε)] (x i, i)

65 Lower bound: Reduction from INDEXING1/2(ϕ ε), 1 /2ε to (ε, ϕ)-heavy hitters x [ 1 /2(ϕ ε)] 1 /2ε i [ 1 /2ε] εm copies of each (x j, j) j [ 1 /2ε] Algorithm s memory contents (ϕ ε)m copies of each (j, i) j [ 1 /2(ϕ ε)] (x i, i)

66 Lower bound: Reduction from INDEXING1/2(ϕ ε), 1 /2ε to (ε, ϕ)-heavy hitters x [ 1 /2(ϕ ε)] 1 /2ε i [ 1 /2ε] εm copies of each (x j, j) j [ 1 /2ε] Algorithm s memory contents (ϕ ε)m copies of each (j, i) j [ 1 /2(ϕ ε)] (x i, i)

67 Lower bound: Reduction from INDEXING1/2(ϕ ε), 1 /2ε to (ε, ϕ)-heavy hitters x [ 1 /2(ϕ ε)] 1 /2ε i [ 1 /2ε] εm copies of each (x j, j) j [ 1 /2ε] Algorithm s memory contents (ϕ ε)m copies of each (j, i) j [ 1 /2(ϕ ε)] (x i, i)

68 Notes For the turnstile model of streaming (insertions as well as deletions): ( ) CountMin sketch gives a O 1 ε log2 n bits upper bound [?] ( ) A Ω 1 ϕ log2 n lower bound is known due to [?] [?] show stronger error bound for Misra-Gries in terms of frequency distribution tail. Can we simultaneously achieve optimal space complexity? Possible other applications to rank aggregations.

An Optimal Algorithm for l 1 -Heavy Hitters in Insertion Streams and Related Problems

An Optimal Algorithm for l 1 -Heavy Hitters in Insertion Streams and Related Problems Arnab Bhattacharyya Indian Institute of Science, Bangalore arnabb@csa.iisc.ernet.in Palash Dey Indian Institute of