An Optimal Algorithm for l 1 -Heavy Hitters in Insertion Streams and Related Problems
|
|
- Theodore Chandler
- 5 years ago
- Views:
Transcription
1 An Optimal Algorithm for l 1 -Heavy Hitters in Insertion Streams and Related Problems Arnab Bhattacharyya, Palash Dey, and David P. Woodruff Indian Institute of Science, Bangalore {arnabb,palash}@csa.iisc.ernet.in IBM Research, Almaden dpwoodru@us.ibm.com April 23, 2016 To appear: ACM SIGMOD conference on Principles of DB Systems (PODS-16)
2
3 Talk Overview Motivation Problem Definition State-of-the-art Our Results
4 Data Stream Frequency of item i = f i Frequency vector = (f 1,..., f n )
5 Data Stream Suitable model for many large source of data Stream of network packets Sensor networks Impractical and undesirable to store and process entire data exactly Instead design algorithms to find approximate solutions Quickly build summary with one pass over data Active area of research for last 15 years, history goes back 35 years
6 Example: Heavy-hitters Also called frequent items, elephants, icebergs One of the oldest streaming problems; history more than 30 years
7 Why Heavy-hitters? Monitoring Internet traffic: Track bandwidth hogs Popular destinations Subject of much streaming research: plenty of papers on Heavy-hitters and its variants A core streaming problem: many streaming problems (item set mining etc.) are connected to Heavy-hitters Many practical applications: Network data analysis Database optimization
8 Motivation Problem Definition State-of-the-art Our Results
9 l 1 -Heavy-hitters Exact version Given a stream of items from [n] of length m, find an item occurring more than ϕm times
10 l 1 -Heavy-hitters Exact version Given a stream of items from [n] of length m, find an item occurring more than ϕm times l 1 norm of frequency vector (f 1,..., f n )
11 l 1 -Heavy-hitters Exact version Given a stream of items from [n] of length m, find an item occurring more than ϕm times l 1 norm of frequency vector (f 1,..., f n ) Exact version is costly Unfortunately, exact version requires Ω(min{m, n}) space, even for ϕ = 1 2
12 (ε, ϕ)-heavy-hitters Let 0 < ε < ϕ < 1 and f i be the frequency of an item i Problem Definition Find a set S of items with the following property: S contains every item i with f i > ϕm S contains no item j with f j < (ϕ ε)m Moreover, for every item i S, output an estimate f i such that f i f i εm
13 (ε, ϕ)-heavy-hitters Let 0 < ε < ϕ < 1 and f i be the frequency of an item i Problem Definition Output all frequent items Find a set S of items with the following property: S contains every item i with f i > ϕm S contains no item j with f j < (ϕ ε)m Moreover, for every item i S, output an estimate f i such that f i f i εm Don t output any rare item Estimate frequencies of frequent items
14 Motivation Problem Definition State-of-the-art Our Results
15 Prior work on Heavy-hitters: Upper bounds Algorithm for ε > 1 2 with space complexity O(log n + log m) bits was given by [?] For general ε, the current best space complexity bound is from [?] Space complexity: O ( 1 ε (log n + log m)) bits Rediscovered by [?] and [?], but provided worst case O(1) time for updates and answers
16 Boyer-Moore Algorithm : works for ε > 1 2 Initialize table = empty 10 Insert(i) if(table empty) then put i and increment else if(i is in table) then increment else decrement table Frequent item: Approximate frequency: 1 Space: O (log n + log m) bits
17 Boyer-Moore Algorithm : works for ε > 1 2 Initialize table = empty 10 Insert(i) if(table empty) then put i and increment else if(i is in table) then increment else decrement table Frequent item: Approximate frequency: 1 Space: O (log n + log m) bits
18 Boyer-Moore Algorithm : works for ε > 1 2 Initialize table = empty 10 Insert(i) if(table empty) then put i and increment else if(i is in table) then increment else decrement table Frequent item: Approximate frequency: 1 Space: O (log n + log m) bits
19 Boyer-Moore Algorithm : works for ε > 1 2 Initialize table = empty 10 Insert(i) if(table empty) then put i and increment else if(i is in table) then increment else decrement table Frequent item: Approximate frequency: 1 Space: O (log n + log m) bits
20 Boyer-Moore Algorithm : works for ε > 1 2 Initialize table = empty 10 Insert(i) if(table empty) then put i and increment else if(i is in table) then increment else decrement table Frequent item: Approximate frequency: 1 Space: O (log n + log m) bits
21 Boyer-Moore Algorithm : works for ε > 1 2 Initialize table = empty 10 Insert(i) if(table empty) then put i and increment else if(i is in table) then increment else decrement table Frequent item: Approximate frequency: 1 Space: O (log n + log m) bits
22 Boyer-Moore Algorithm : works for ε > 1 2 Initialize table = empty 10 Insert(i) if(table empty) then put i and increment else if(i is in table) then increment else decrement table Frequent item: Approximate frequency: 1 Space: O (log n + log m) bits
23 Boyer-Moore Algorithm : works for ε > 1 2 Initialize table = empty 10 Insert(i) if(table empty) then put i and increment else if(i is in table) then increment else decrement table Frequent item: Approximate frequency: 1 Space: O (log n + log m) bits
24 Misra-Gries Algorithm [?]: Generalizes Boyer-Moore algorithm using k 1 counters Initialize: A empty associative array Process j: 1: if j Keys(A) then 2: A[j] A[j] + 1 3: else if Keys(A) < k 1 then 4: A[j] 1 5: else 6: for l Keys(A) do 7: A[l] A[l] 1 8: end for 9: if A[l] = 0 then 10: Remove l form A 11: end if 12: end if k 1 Keys Count Output: On query a, if a Keys(A), then report f a = A[a], else report f a = 0
25 Misra-Gries Algorithm: Analysis Crucial Observation Each decrement is witnessed by k distinct tokens (including itself) f a m k f a f a, a Space Complexity O (k (log m + log n)) bits of space Putting k = 1 ε solves (ε, ε)-heavy hitters Space: O ( 1 ε (log m + log n)) bits
26 Prior work on Heavy-hitters: Upper bounds Same bound as [?] or worse are also achieved by others like: Count-Sketch [?] CountMin-Sketch [?] Sticky sampling, lossy counting [?] Space saving [?] Sample-and-hold [?] Multi-stage Bloom filters [?] Sketch-guided sampling [?]...
27 Prior work on Heavy-hitters: Lower bounds There can be at most 1 ϕ many heavy hitters, giving a space bound of: ( ( ) ) ( ) n 1 Ω log 1 = Ω log ϕn ϕ ϕ Easy reduction from INDEXING problem provides a lower bound of: ( ) 1 Ω ε
28 Gap in Known Bounds In summary, best upper bound is: ( ) 1 O (log n + log m) ε and best lower bound is: ( 1 Ω ϕ log ϕn + 1 ) ε For constant ϕ and ε = 1 log n, this is a quadratic gap
29 Motivation Problem Definition State-of-the-art Our Results
30 If n ( ) ε Our Result We show that space complexity of (ε, ϕ)-heavy hitters is : Θ ( 1 ε log 1 ϕ + 1 ϕ log n + log log m ) with O(1) worst case update and query response times. Our algorithm is randomized
31 L -approximation Our algorithm also solves: ε-maximum Given an insertion only stream of length m over a universe of size n, output an item with frequency at most εm less than the maximum.
32 L -approximation Our algorithm also solves: ε-maximum approximate l norm of (f 1,..., f n ) within ε (f 1,..., f n ) 1 Given an insertion only stream of length m over a universe of size n, output an item with frequency at most εm less than the maximum.
33 L -approximation Our algorithm also solves: ε-maximum Given an insertion only stream of length m over a universe of size n, output an item with frequency at most εm less than the maximum. Our Result We show that the space complexity of ε-maximum is a : ( 1 Θ ε log 1 ) + log n + log log m ε with O(1) worst case update and query response times. Our algorithm is randomized whereas Misra-Gries is deterministic! a If n ( 1 ε ) approximate l norm of (f 1,..., f n ) within ε (f 1,..., f n ) 1
34 Comparison with prior work Best previous bound is again due to [?]: O ( ) 1 (log n + log m) ε Improving this result was listed as Open Problem 3 in IITK Workshop on Data Streams (2006)
35 A Voting Perspective The maximum problem finds an approximate winner of a plurality election when votes are streamed A natural question is to ask for winners of elections conducted according to other voting rules
36 Veto voting Modi Rahul Palash Palash wins ε-minimum Given an insertion only stream of length m over a universe of size n, find the minimum frequency upto additive error εm
37 Borda voting ε-borda Given an insertion only stream of length m over a universe of size n, find the maximum Borda score upto additive error εmn
38 Maximin voting 4 : A B C, 3 : C B A 2 : B A C, 2 : C A B Maximin scores: A : 6, B : 5, C : 5 ε-maximin Given an insertion only stream of length m over a universe of size n, find the maximum maximin frequency upto additive error εm
39 Other Results ε-minimum ( 1 O ε log log 1 ) ( ) 1 + log log m, Ω + log log m ε ε ε-borda Θ (n(log 1ε ) + log n) + log log m ε-maximin ( n ) ( n ) O ε 2 log2 n + log log m, Ω ε 2 + log log m
40 Other Results ε-minimum Optimal upto O(log log ε 1 ) ( 1 O ε log log 1 ) ( ) 1 + log log m, Ω + log log m ε ε ε-borda Θ (n(log 1ε ) + log n) + log log m Optimal upto O(1) ε-maximin Optimal upto O(log 2 n) ( n ) ( n ) O ε 2 log2 n + log log m, Ω ε 2 + log log m
41 Algorithm
42 A simpler almost optimal Heavy-hitters algorithm Hash Sample Count
43 Algorithm: In more details... Choose a hash function from a universal family of hash functions that hashes each id to a universe of size poly( 1 ε ) Sample l = poly( 1 ε ) items from the stream Feed in hashed samples to Misra-Gries data structure with 1 ε counters, while storing actual ids of top 1 ϕ items according to the Misra-Gries data structure
44 Algorithm: Correctness No collision among hashed ids in the sample (with high probability) If S is a subset of size O( 1 ) randomly chosen from the ε 2 stream, and f i be the frequency of the item i in S, then [?]: [ ] f i Pr i [n], S f i m ε
45 Algorithm analysis: Space Complexity Random item can be chosen from stream in O(log log m) bits of space Data structure in Misra-Gries algorithm uses O ( 1 ε log 1 ε) bits of space id space is poly( 1 ε ) length of the subsamples stream is poly( 1 ε ) εl additive approximation 1 ϕ log n to store the ids of top 1 ϕ items of the Misra-Gries table Space complexity: O ( 1 ε log 1 ε + 1 ϕ log n + log log m )
46 Algorithm analysis: Space Complexity Random item can be chosen from stream in O(log log m) bits of space Data structure in Misra-Gries algorithm uses O ( 1 ε log 1 ε) bits of space id space is poly( 1 ε ) length of the subsamples stream is poly( 1 ε ) εl additive approximation 1 ϕ log n to store the ids of top 1 ϕ Extra log ε 1 factor items of the Misra-Gries table Space complexity: O ( 1 ε log 1 ε + 1 ϕ log n + log log m )
47 Algorithm analysis: Time Complexity O(1) update and query response time Once an item is sampled, updating Misra-Gries table takes O( 1 ε ) time
48 Algorithm analysis: Time Complexity O(1) update and query response time Once an item is sampled, updating Misra-Gries table takes O( 1 ε ) time However, with very high probability, no item is sampled for next O( 1 ε ) items of the stream Distribute the O( 1 ε ) update time over next O( 1 ε ) items Worst case update time is O(1) Note that the guarantee above is not an amortized one!
49 Optimal Heavy-hitters algorithm Θ ( 1 ε log 1 ϕ + 1 ϕ log n + log log m )
50 Optimal algorithm: overview Overview Sample l = O(ε 2 ) items from the stream Run the Misra-Gries algorithm for (ϕ/2, ϕ/2)-heavy hitters returns a set C of O(ϕ 1 ) items For any item i, with probability 1 O(ϕ), estimate the frequency of i in the sampled stream to within additive error O(εl) = O(ε 1 )
51 Optimal algorithm cont: Approximate counting with additive error O(ε 1 ) Insert(i) With probability p i ĉ i + + Output(i) ˆfi = ĉ i /p i return ˆf i
52 Optimal algorithm cont: Approximate counting with additive error O(ε 1 ) Insert(i) With probability p i ĉ i + + Output(i) ˆfi = ĉ i /p i return ˆf i Accelerated counter Claim 1: If p i = Θ(ε 2 f i ), then Var[ˆf i ] = O(ε 2 ) Claim 2: ˆf i is an unbiased estimator of f i with additive error O(ε 1 ) with constant probability
53 Optimal algorithm cont: Approximate counting with additive error O(ε 1 ) Insert(i) With probability p i ĉ i + + Output(i) ˆfi = ĉ i /p i return ˆf i Accelerated counter Claim 1: If p i = Θ(ε 2 f i ), then Var[ˆf i ] = O(ε 2 ) Claim 2: ˆf i is an unbiased estimator of f i with additive error O(ε 1 ) with constant probability error probability can be reduced from constant to O(ϕ) by repeating O(log ϕ 1 ) times and taking median
54 Optimal algorithm cont: Tackling two issues Two issues 1. Need to keep Ω(l) = Ω(ε 2 ) counters Solution: Hashing into a space of size Θ(ε 1 ) 2. We do not know f i Solution: We divide the stream into epochs and maintain 4-approximation of frequencies in each epoch and change p i s dynamically
55 Lower Bound
56 Lower bound: Dependence on m GREATER-THAN log m Alice has x [log m] Bob has y [log m] Alice sends message to Bob Bob has to output whether or not x > y Known Result Alice has to send Ω(log log m) bits [?,?]
57 Lower bound: Reduction from GREATER-THAN log m to (ε, ϕ)-heavy hitters x [log m] y [log m] 2 x copies of item A 2 y copies of item B Algorithm s memory contents A iff x > y
58 Lower bound: Reduction from GREATER-THAN log m to (ε, ϕ)-heavy hitters x [log m] y [log m] 2 x copies of item A 2 y copies of item B Algorithm s memory contents A iff x > y
59 Lower bound: Reduction from GREATER-THAN log m to (ε, ϕ)-heavy hitters x [log m] y [log m] 2 x copies of item A 2 y copies of item B Algorithm s memory contents A iff x > y
60 Lower bound: Reduction from GREATER-THAN log m to (ε, ϕ)-heavy hitters x [log m] y [log m] 2 x copies of item A 2 y copies of item B Algorithm s memory contents A iff x > y
61 Lower bound: Reduction from GREATER-THAN log m to (ε, ϕ)-heavy hitters x [log m] y [log m] 2 x copies of item A 2 y copies of item B Algorithm s memory contents A iff x > y
62 Lower bound: Dependence on ε and ϕ INDEXING n,m Alice has a string x [n] m Bob has an index i [m] Alice sends message to Bob Bob has to output x i Known Result One-way communication complexity of INDEXING n,m Ω (m log n) is
63 Lower bound: Reduction from INDEXING1/2(ϕ ε), 1 /2ε to (ε, ϕ)-heavy hitters x [ 1 /2(ϕ ε)] 1 /2ε i [ 1 /2ε] εm copies of each (x j, j) j [ 1 /2ε] Algorithm s memory contents (ϕ ε)m copies of each (j, i) j [ 1 /2(ϕ ε)] (x i, i)
64 Lower bound: Reduction from INDEXING1/2(ϕ ε), 1 /2ε to (ε, ϕ)-heavy hitters x [ 1 /2(ϕ ε)] 1 /2ε i [ 1 /2ε] εm copies of each (x j, j) j [ 1 /2ε] Algorithm s memory contents (ϕ ε)m copies of each (j, i) j [ 1 /2(ϕ ε)] (x i, i)
65 Lower bound: Reduction from INDEXING1/2(ϕ ε), 1 /2ε to (ε, ϕ)-heavy hitters x [ 1 /2(ϕ ε)] 1 /2ε i [ 1 /2ε] εm copies of each (x j, j) j [ 1 /2ε] Algorithm s memory contents (ϕ ε)m copies of each (j, i) j [ 1 /2(ϕ ε)] (x i, i)
66 Lower bound: Reduction from INDEXING1/2(ϕ ε), 1 /2ε to (ε, ϕ)-heavy hitters x [ 1 /2(ϕ ε)] 1 /2ε i [ 1 /2ε] εm copies of each (x j, j) j [ 1 /2ε] Algorithm s memory contents (ϕ ε)m copies of each (j, i) j [ 1 /2(ϕ ε)] (x i, i)
67 Lower bound: Reduction from INDEXING1/2(ϕ ε), 1 /2ε to (ε, ϕ)-heavy hitters x [ 1 /2(ϕ ε)] 1 /2ε i [ 1 /2ε] εm copies of each (x j, j) j [ 1 /2ε] Algorithm s memory contents (ϕ ε)m copies of each (j, i) j [ 1 /2(ϕ ε)] (x i, i)
68 Notes For the turnstile model of streaming (insertions as well as deletions): ( ) CountMin sketch gives a O 1 ε log2 n bits upper bound [?] ( ) A Ω 1 ϕ log2 n lower bound is known due to [?] [?] show stronger error bound for Misra-Gries in terms of frequency distribution tail. Can we simultaneously achieve optimal space complexity? Possible other applications to rank aggregations.
69
An Optimal Algorithm for l 1 -Heavy Hitters in Insertion Streams and Related Problems
An Optimal Algorithm for l 1 -Heavy Hitters in Insertion Streams and Related Problems Arnab Bhattacharyya Indian Institute of Science, Bangalore arnabb@csa.iisc.ernet.in Palash Dey Indian Institute of
More informationarxiv: v1 [cs.ds] 1 Mar 2016
An Optimal Algorithm for l 1 -Heavy Hitters in Insertion Streams and Related Problems Arnab Bhattacharyya, Palash Dey, and David P. Woodruff Indian Institute of Science, Bangalore arxiv:1603.00213v1 [cs.ds]
More information15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018
15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018 Today we ll talk about a topic that is both very old (as far as computer science
More information12 Count-Min Sketch and Apriori Algorithm (and Bloom Filters)
12 Count-Min Sketch and Apriori Algorithm (and Bloom Filters) Many streaming algorithms use random hashing functions to compress data. They basically randomly map some data items on top of each other.
More information11 Heavy Hitters Streaming Majority
11 Heavy Hitters A core mining problem is to find items that occur more than one would expect. These may be called outliers, anomalies, or other terms. Statistical models can be layered on top of or underneath
More informationTight Bounds for Distributed Functional Monitoring
Tight Bounds for Distributed Functional Monitoring Qin Zhang MADALGO, Aarhus University Joint with David Woodruff, IBM Almaden NII Shonan meeting, Japan Jan. 2012 1-1 The distributed streaming model (a.k.a.
More informationSpace-optimal Heavy Hitters with Strong Error Bounds
Space-optimal Heavy Hitters with Strong Error Bounds Graham Cormode graham@research.att.com Radu Berinde(MIT) Piotr Indyk(MIT) Martin Strauss (U. Michigan) The Frequent Items Problem TheFrequent Items
More informationComputing the Entropy of a Stream
Computing the Entropy of a Stream To appear in SODA 2007 Graham Cormode graham@research.att.com Amit Chakrabarti Dartmouth College Andrew McGregor U. Penn / UCSD Outline Introduction Entropy Upper Bound
More information1 Approximate Quantiles and Summaries
CS 598CSC: Algorithms for Big Data Lecture date: Sept 25, 2014 Instructor: Chandra Chekuri Scribe: Chandra Chekuri Suppose we have a stream a 1, a 2,..., a n of objects from an ordered universe. For simplicity
More informationTight Bounds for Distributed Streaming
Tight Bounds for Distributed Streaming (a.k.a., Distributed Functional Monitoring) David Woodruff IBM Research Almaden Qin Zhang MADALGO, Aarhus Univ. STOC 12 May 22, 2012 1-1 The distributed streaming
More informationCS 598CSC: Algorithms for Big Data Lecture date: Sept 11, 2014
CS 598CSC: Algorithms for Big Data Lecture date: Sept 11, 2014 Instructor: Chandra Cheuri Scribe: Chandra Cheuri The Misra-Greis deterministic counting guarantees that all items with frequency > F 1 /
More informationSimple and Deterministic Matrix Sketches
Simple and Deterministic Matrix Sketches Edo Liberty + ongoing work with: Mina Ghashami, Jeff Philips and David Woodruff. Edo Liberty: Simple and Deterministic Matrix Sketches 1 / 41 Data Matrices Often
More informationLecture and notes by: Alessio Guerrieri and Wei Jin Bloom filters and Hashing
Bloom filters and Hashing 1 Introduction The Bloom filter, conceived by Burton H. Bloom in 1970, is a space-efficient probabilistic data structure that is used to test whether an element is a member of
More informationNew Characterizations in Turnstile Streams with Applications
New Characterizations in Turnstile Streams with Applications Yuqing Ai 1, Wei Hu 1, Yi Li 2, and David P. Woodruff 3 1 The Institute for Theoretical Computer Science (ITCS), Institute for Interdisciplinary
More informationLower Bound Techniques for Multiparty Communication Complexity
Lower Bound Techniques for Multiparty Communication Complexity Qin Zhang Indiana University Bloomington Based on works with Jeff Phillips, Elad Verbin and David Woodruff 1-1 The multiparty number-in-hand
More informationFinding Frequent Items in Probabilistic Data
Finding Frequent Items in Probabilistic Data Qin Zhang, Hong Kong University of Science & Technology Feifei Li, Florida State University Ke Yi, Hong Kong University of Science & Technology SIGMOD 2008
More informationData Streams & Communication Complexity
Data Streams & Communication Complexity Lecture 1: Simple Stream Statistics in Small Space Andrew McGregor, UMass Amherst 1/25 Data Stream Model Stream: m elements from universe of size n, e.g., x 1, x
More informationLecture 10. Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1
Lecture 10 Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1 Recap Sublinear time algorithms Deterministic + exact: binary search Deterministic + inexact: estimating diameter in a
More informationDistributed Summaries
Distributed Summaries Graham Cormode graham@research.att.com Pankaj Agarwal(Duke) Zengfeng Huang (HKUST) Jeff Philips (Utah) Zheiwei Wei (HKUST) Ke Yi (HKUST) Summaries Summaries allow approximate computations:
More informationLecture 2. Frequency problems
1 / 43 Lecture 2. Frequency problems Ricard Gavaldà MIRI Seminar on Data Streams, Spring 2015 Contents 2 / 43 1 Frequency problems in data streams 2 Approximating inner product 3 Computing frequency moments
More informationLecture 3 Frequency Moments, Heavy Hitters
COMS E6998-9: Algorithmic Techniques for Massive Data Sep 15, 2015 Lecture 3 Frequency Moments, Heavy Hitters Instructor: Alex Andoni Scribes: Daniel Alabi, Wangda Zhang 1 Introduction This lecture is
More information14.1 Finding frequent elements in stream
Chapter 14 Streaming Data Model 14.1 Finding frequent elements in stream A very useful statistics for many applications is to keep track of elements that occur more frequently. It can come in many flavours
More informationBiased Quantiles. Flip Korn Graham Cormode S. Muthukrishnan
Biased Quantiles Graham Cormode cormode@bell-labs.com S. Muthukrishnan muthu@cs.rutgers.edu Flip Korn flip@research.att.com Divesh Srivastava divesh@research.att.com Quantiles Quantiles summarize data
More informationA Simpler and More Efficient Deterministic Scheme for Finding Frequent Items over Sliding Windows
A Simpler and More Efficient Deterministic Scheme for Finding Frequent Items over Sliding Windows ABSTRACT L.K. Lee Department of Computer Science The University of Hong Kong Pokfulam, Hong Kong lklee@cs.hku.hk
More informationStreaming and communication complexity of Hamming distance
Streaming and communication complexity of Hamming distance Tatiana Starikovskaya IRIF, Université Paris-Diderot (Joint work with Raphaël Clifford, ICALP 16) Approximate pattern matching Problem Pattern
More informationData Sketches for Disaggregated Subset Sum and Frequent Item Estimation
Data Sketches for Disaggregated Subset Sum and Frequent Item Estimation Daniel Ting Tableau Software Seattle, Washington dting@tableau.com ABSTRACT We introduce and study a new data sketch for processing
More informationTitle: Count-Min Sketch Name: Graham Cormode 1 Affil./Addr. Department of Computer Science, University of Warwick,
Title: Count-Min Sketch Name: Graham Cormode 1 Affil./Addr. Department of Computer Science, University of Warwick, Coventry, UK Keywords: streaming algorithms; frequent items; approximate counting, sketch
More informationData Stream Methods. Graham Cormode S. Muthukrishnan
Data Stream Methods Graham Cormode graham@dimacs.rutgers.edu S. Muthukrishnan muthu@cs.rutgers.edu Plan of attack Frequent Items / Heavy Hitters Counting Distinct Elements Clustering items in Streams Motivating
More informationHeavy Hitters. Piotr Indyk MIT. Lecture 4
Heavy Hitters Piotr Indyk MIT Last Few Lectures Recap (last few lectures) Update a vector x Maintain a linear sketch Can compute L p norm of x (in zillion different ways) Questions: Can we do anything
More informationAlgorithms for Data Science
Algorithms for Data Science CSOR W4246 Eleni Drinea Computer Science Department Columbia University Tuesday, December 1, 2015 Outline 1 Recap Balls and bins 2 On randomized algorithms 3 Saving space: hashing-based
More informationSome notes on streaming algorithms continued
U.C. Berkeley CS170: Algorithms Handout LN-11-9 Christos Papadimitriou & Luca Trevisan November 9, 016 Some notes on streaming algorithms continued Today we complete our quick review of streaming algorithms.
More information1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:
CS 24 Section #8 Hashing, Skip Lists 3/20/7 Probability Review Expectation (weighted average): the expectation of a random quantity X is: x= x P (X = x) For each value x that X can take on, we look at
More informationHierarchical Heavy Hitters with the Space Saving Algorithm
Hierarchical Heavy Hitters with the Space Saving Algorithm Michael Mitzenmacher Thomas Steinke Justin Thaler School of Engineering and Applied Sciences Harvard University, Cambridge, MA 02138 Email: {jthaler,
More informationCSCB63 Winter Week 11 Bloom Filters. Anna Bretscher. March 30, / 13
CSCB63 Winter 2019 Week 11 Bloom Filters Anna Bretscher March 30, 2019 1 / 13 Today Bloom Filters Definition Expected Complexity Applications 2 / 13 Bloom Filters (Specification) A bloom filter is a probabilistic
More informationEstimating Frequencies and Finding Heavy Hitters Jonas Nicolai Hovmand, Morten Houmøller Nygaard,
Estimating Frequencies and Finding Heavy Hitters Jonas Nicolai Hovmand, 2011 3884 Morten Houmøller Nygaard, 2011 4582 Master s Thesis, Computer Science June 2016 Main Supervisor: Gerth Stølting Brodal
More informationIntroduction to Randomized Algorithms III
Introduction to Randomized Algorithms III Joaquim Madeira Version 0.1 November 2017 U. Aveiro, November 2017 1 Overview Probabilistic counters Counting with probability 1 / 2 Counting with probability
More informationA Mergeable Summaries
A Mergeable Summaries Pankaj K. Agarwal, Graham Cormode, Zengfeng Huang, Jeff M. Phillips, Zhewei Wei, and Ke Yi We study the mergeability of data summaries. Informally speaking, mergeability requires
More informationA Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window
A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window Costas Busch 1 and Srikanta Tirthapura 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY
More informationThe space complexity of approximating the frequency moments
The space complexity of approximating the frequency moments Felix Biermeier November 24, 2015 1 Overview Introduction Approximations of frequency moments lower bounds 2 Frequency moments Problem Estimate
More informationAnother way of saying this is that amortized analysis guarantees the average case performance of each operation in the worst case.
Amortized Analysis: CLRS Chapter 17 Last revised: August 30, 2006 1 In amortized analysis we try to analyze the time required by a sequence of operations. There are many situations in which, even though
More informationLecture 3 Sept. 4, 2014
CS 395T: Sublinear Algorithms Fall 2014 Prof. Eric Price Lecture 3 Sept. 4, 2014 Scribe: Zhao Song In today s lecture, we will discuss the following problems: 1. Distinct elements 2. Turnstile model 3.
More informationA Mergeable Summaries
A Mergeable Summaries Pankaj K. Agarwal, Graham Cormode, Zengfeng Huang, Jeff M. Phillips, Zhewei Wei, and Ke Yi We study the mergeability of data summaries. Informally speaking, mergeability requires
More informationFinding Heavy Distinct Hitters in Data Streams
Finding Heavy Distinct Hitters in Data Streams Thomas Locher IBM Research Zurich thl@zurich.ibm.com ABSTRACT A simple indicator for an anomaly in a network is a rapid increase in the total number of distinct
More informationProblem 1: (Chernoff Bounds via Negative Dependence - from MU Ex 5.15)
Problem 1: Chernoff Bounds via Negative Dependence - from MU Ex 5.15) While deriving lower bounds on the load of the maximum loaded bin when n balls are thrown in n bins, we saw the use of negative dependence.
More informationCMSC 858F: Algorithmic Lower Bounds: Fun with Hardness Proofs Fall 2014 Introduction to Streaming Algorithms
CMSC 858F: Algorithmic Lower Bounds: Fun with Hardness Proofs Fall 2014 Introduction to Streaming Algorithms Instructor: Mohammad T. Hajiaghayi Scribe: Huijing Gong November 11, 2014 1 Overview In the
More informationA Near-Optimal Algorithm for Computing the Entropy of a Stream
A Near-Optimal Algorithm for Computing the Entropy of a Stream Amit Chakrabarti ac@cs.dartmouth.edu Graham Cormode graham@research.att.com Andrew McGregor andrewm@seas.upenn.edu Abstract We describe a
More informationcompare to comparison and pointer based sorting, binary trees
Admin Hashing Dictionaries Model Operations. makeset, insert, delete, find keys are integers in M = {1,..., m} (so assume machine word size, or unit time, is log m) can store in array of size M using power:
More informationAn Integrated Efficient Solution for Computing Frequent and Top-k Elements in Data Streams
An Integrated Efficient Solution for Computing Frequent and Top-k Elements in Data Streams AHMED METWALLY, DIVYAKANT AGRAWAL, and AMR EL ABBADI University of California, Santa Barbara We propose an approximate
More informationContinuous Matrix Approximation on Distributed Data
Continuous Matrix Approximation on Distributed Data Mina Ghashami School of Computing University of Utah ghashami@cs.uah.edu Jeff M. Phillips School of Computing University of Utah jeffp@cs.uah.edu Feifei
More informationLinear Sketches A Useful Tool in Streaming and Compressive Sensing
Linear Sketches A Useful Tool in Streaming and Compressive Sensing Qin Zhang 1-1 Linear sketch Random linear projection M : R n R k that preserves properties of any v R n with high prob. where k n. M =
More informationMulti-Dimensional Online Tracking
Multi-Dimensional Online Tracking Ke Yi and Qin Zhang Hong Kong University of Science & Technology SODA 2009 January 4-6, 2009 1-1 A natural problem Bob: tracker f(t) g(t) Alice: observer (t, g(t)) t 2-1
More informationLecture 01 August 31, 2017
Sketching Algorithms for Big Data Fall 2017 Prof. Jelani Nelson Lecture 01 August 31, 2017 Scribe: Vinh-Kha Le 1 Overview In this lecture, we overviewed the six main topics covered in the course, reviewed
More informationApproximate counting: count-min data structure. Problem definition
Approximate counting: count-min data structure G. Cormode and S. Muthukrishhan: An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms 55 (2005) 58-75. Problem
More informationClassic Network Measurements meets Software Defined Networking
Classic Network s meets Software Defined Networking Ran Ben Basat, Technion, Israel Joint work with Gil Einziger and Erez Waisbard (Nokia Bell Labs) Roy Friedman (Technion) and Marcello Luzieli (UFGRS)
More informationTracking Distributed Aggregates over Time-based Sliding Windows
Tracking Distributed Aggregates over Time-based Sliding Windows Graham Cormode and Ke Yi Abstract. The area of distributed monitoring requires tracking the value of a function of distributed data as new
More informationOptimal Random Sampling from Distributed Streams Revisited
Optimal Random Sampling from Distributed Streams Revisited Srikanta Tirthapura 1 and David P. Woodruff 2 1 Dept. of ECE, Iowa State University, Ames, IA, 50011, USA. snt@iastate.edu 2 IBM Almaden Research
More informationLecture 5: Hashing. David Woodruff Carnegie Mellon University
Lecture 5: Hashing David Woodruff Carnegie Mellon University Hashing Universal hashing Perfect hashing Maintaining a Dictionary Let U be a universe of keys U could be all strings of ASCII characters of
More informationto be more efficient on enormous scale, in a stream, or in distributed settings.
16 Matrix Sketching The singular value decomposition (SVD) can be interpreted as finding the most dominant directions in an (n d) matrix A (or n points in R d ). Typically n > d. It is typically easy to
More informationData Structures and Algorithm. Xiaoqing Zheng
Data Structures and Algorithm Xiaoqing Zheng zhengxq@fudan.edu.cn MULTIPOP top[s] = 6 top[s] = 2 3 2 8 5 6 5 S MULTIPOP(S, x). while not STACK-EMPTY(S) and k 0 2. do POP(S) 3. k k MULTIPOP(S, 4) Analysis
More informationMotivation. Dictionaries. Direct Addressing. CSE 680 Prof. Roger Crawfis
Motivation Introduction to Algorithms Hash Tables CSE 680 Prof. Roger Crawfis Arrays provide an indirect way to access a set. Many times we need an association between two sets, or a set of keys and associated
More informationEfficient Sketches for Earth-Mover Distance, with Applications
Efficient Sketches for Earth-Mover Distance, with Applications Alexandr Andoni MIT andoni@mit.edu Khanh Do Ba MIT doba@mit.edu Piotr Indyk MIT indyk@mit.edu David Woodruff IBM Almaden dpwoodru@us.ibm.com
More information25.2 Last Time: Matrix Multiplication in Streaming Model
EE 381V: Large Scale Learning Fall 01 Lecture 5 April 18 Lecturer: Caramanis & Sanghavi Scribe: Kai-Yang Chiang 5.1 Review of Streaming Model Streaming model is a new model for presenting massive data.
More informationMaintaining Significant Stream Statistics over Sliding Windows
Maintaining Significant Stream Statistics over Sliding Windows L.K. Lee H.F. Ting Abstract In this paper, we introduce the Significant One Counting problem. Let ε and θ be respectively some user-specified
More informationCS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/26/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2 More algorithms
More informationEstimating Dominance Norms of Multiple Data Streams Graham Cormode Joint work with S. Muthukrishnan
Estimating Dominance Norms of Multiple Data Streams Graham Cormode graham@dimacs.rutgers.edu Joint work with S. Muthukrishnan Data Stream Phenomenon Data is being produced faster than our ability to process
More informationB669 Sublinear Algorithms for Big Data
B669 Sublinear Algorithms for Big Data Qin Zhang 1-1 2-1 Part 1: Sublinear in Space The model and challenge The data stream model (Alon, Matias and Szegedy 1996) a n a 2 a 1 RAM CPU Why hard? Cannot store
More informationLecture 6 September 13, 2016
CS 395T: Sublinear Algorithms Fall 206 Prof. Eric Price Lecture 6 September 3, 206 Scribe: Shanshan Wu, Yitao Chen Overview Recap of last lecture. We talked about Johnson-Lindenstrauss (JL) lemma [JL84]
More informationFrequency Estimators
Frequency Estimators Outline for Today Randomized Data Structures Our next approach to improving performance. Count-Min Sketches A simple and powerful data structure for estimating frequencies. Count Sketches
More informationLecture 1: Introduction to Sublinear Algorithms
CSE 522: Sublinear (and Streaming) Algorithms Spring 2014 Lecture 1: Introduction to Sublinear Algorithms March 31, 2014 Lecturer: Paul Beame Scribe: Paul Beame Too much data, too little time, space for
More informationAd Placement Strategies
Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox 2014 Emily Fox January
More informationEffective computation of biased quantiles over data streams
Effective computation of biased quantiles over data streams Graham Cormode cormode@bell-labs.com S. Muthukrishnan muthu@cs.rutgers.edu Flip Korn flip@research.att.com Divesh Srivastava divesh@research.att.com
More informationAMORTIZED ANALYSIS. binary counter multipop stack dynamic table. Lecture slides by Kevin Wayne. Last updated on 1/24/17 11:31 AM
AMORTIZED ANALYSIS binary counter multipop stack dynamic table Lecture slides by Kevin Wayne http://www.cs.princeton.edu/~wayne/kleinberg-tardos Last updated on 1/24/17 11:31 AM Amortized analysis Worst-case
More informationCS 5321: Advanced Algorithms Amortized Analysis of Data Structures. Motivations. Motivation cont d
CS 5321: Advanced Algorithms Amortized Analysis of Data Structures Ali Ebnenasir Department of Computer Science Michigan Technological University Motivations Why amortized analysis and when? Suppose you
More informationCS 473: Algorithms. Ruta Mehta. Spring University of Illinois, Urbana-Champaign. Ruta (UIUC) CS473 1 Spring / 32
CS 473: Algorithms Ruta Mehta University of Illinois, Urbana-Champaign Spring 2018 Ruta (UIUC) CS473 1 Spring 2018 1 / 32 CS 473: Algorithms, Spring 2018 Universal Hashing Lecture 10 Feb 15, 2018 Most
More informationApproximate Counts and Quantiles over Sliding Windows
Approximate Counts and Quantiles over Sliding Windows Arvind Arasu Stanford University arvinda@cs.stanford.edu Gurmeet Singh Manku Stanford University manku@cs.stanford.edu Abstract We consider the problem
More informationCS341 info session is on Thu 3/1 5pm in Gates415. CS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS341 info session is on Thu 3/1 5pm in Gates415 CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/28/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets,
More information1 Estimating Frequency Moments in Streams
CS 598CSC: Algorithms for Big Data Lecture date: August 28, 2014 Instructor: Chandra Chekuri Scribe: Chandra Chekuri 1 Estimating Frequency Moments in Streams A significant fraction of streaming literature
More informationLecture Notes for Chapter 17: Amortized Analysis
Lecture Notes for Chapter 17: Amortized Analysis Chapter 17 overview Amortized analysis Analyze a sequence of operations on a data structure. Goal: Show that although some individual operations may be
More informationUniversity of New Mexico Department of Computer Science. Final Examination. CS 561 Data Structures and Algorithms Fall, 2013
University of New Mexico Department of Computer Science Final Examination CS 561 Data Structures and Algorithms Fall, 2013 Name: Email: This exam lasts 2 hours. It is closed book and closed notes wing
More informationLecture 5, CPA Secure Encryption from PRFs
CS 4501-6501 Topics in Cryptography 16 Feb 2018 Lecture 5, CPA Secure Encryption from PRFs Lecturer: Mohammad Mahmoody Scribe: J. Fu, D. Anderson, W. Chao, and Y. Yu 1 Review Ralling: CPA Security and
More informationCS5112: Algorithms and Data Structures for Applications
CS5112: Algorithms and Data Structures for Applications Lecture 14: Exponential decay; convolution Ramin Zabih Some content from: Piotr Indyk; Wikipedia/Google image search; J. Leskovec, A. Rajaraman,
More informationImproved Concentration Bounds for Count-Sketch
Improved Concentration Bounds for Count-Sketch Gregory T. Minton 1 Eric Price 2 1 MIT MSR New England 2 MIT IBM Almaden UT Austin 2014-01-06 Gregory T. Minton, Eric Price (IBM) Improved Concentration Bounds
More informationHow Philippe Flipped Coins to Count Data
1/18 How Philippe Flipped Coins to Count Data Jérémie Lumbroso LIP6 / INRIA Rocquencourt December 16th, 2011 0. DATA STREAMING ALGORITHMS Stream: a (very large) sequence S over (also very large) domain
More informationSketching in Adversarial Environments
Sketching in Adversarial Environments Ilya Mironov Moni Naor Gil Segev Abstract We formalize a realistic model for computations over massive data sets. The model, referred to as the adversarial sketch
More informationImproved Algorithms for Distributed Entropy Monitoring
Improved Algorithms for Distributed Entropy Monitoring Jiecao Chen Indiana University Bloomington jiecchen@umail.iu.edu Qin Zhang Indiana University Bloomington qzhangcs@indiana.edu Abstract Modern data
More informationFoundations of Data Mining
Foundations of Data Mining http://www.cohenwang.com/edith/dataminingclass2017 Instructors: Haim Kaplan Amos Fiat Lecture 1 1 Course logistics Tuesdays 16:00-19:00, Sherman 002 Slides for (most or all)
More informationCSE 190, Great ideas in algorithms: Pairwise independent hash functions
CSE 190, Great ideas in algorithms: Pairwise independent hash functions 1 Hash functions The goal of hash functions is to map elements from a large domain to a small one. Typically, to obtain the required
More information7 Algorithms for Massive Data Problems
7 Algorithms for Massive Data Problems Massive Data, Sampling This chapter deals with massive data problems where the input data (a graph, a matrix or some other object) is too large to be stored in random
More informationLecture 7: Amortized analysis
Lecture 7: Amortized analysis In many applications we want to minimize the time for a sequence of operations. To sum worst-case times for single operations can be overly pessimistic, since it ignores correlation
More information6 Filtering and Streaming
Casus ubique valet; semper tibi pendeat hamus: Quo minime credas gurgite, piscis erit. [Luck affects everything. Let your hook always be cast. Where you least expect it, there will be a fish.] Publius
More informationCR-precis: A deterministic summary structure for update data streams
CR-precis: A deterministic summary structure for update data streams Sumit Ganguly 1 and Anirban Majumder 2 1 Indian Institute of Technology, Kanpur 2 Lucent Technologies, Bangalore Abstract. We present
More informationLecture 24: Bloom Filters. Wednesday, June 2, 2010
Lecture 24: Bloom Filters Wednesday, June 2, 2010 1 Topics for the Final SQL Conceptual Design (BCNF) Transactions Indexes Query execution and optimization Cardinality Estimation Parallel Databases 2 Lecture
More informationWavelet decomposition of data streams. by Dragana Veljkovic
Wavelet decomposition of data streams by Dragana Veljkovic Motivation Continuous data streams arise naturally in: telecommunication and internet traffic retail and banking transactions web server log records
More informationBig Data. Big data arises in many forms: Common themes:
Big Data Big data arises in many forms: Physical Measurements: from science (physics, astronomy) Medical data: genetic sequences, detailed time series Activity data: GPS location, social network activity
More informationLecture 7: Fingerprinting. David Woodruff Carnegie Mellon University
Lecture 7: Fingerprinting David Woodruff Carnegie Mellon University How to Pick a Random Prime How to pick a random prime in the range {1, 2,, M}? How to pick a random integer X? Pick a uniformly random
More informationBenes and Butterfly schemes revisited
Benes and Butterfly schemes revisited Jacques Patarin, Audrey Montreuil Université de Versailles 45 avenue des Etats-Unis 78035 Versailles Cedex - France Abstract In [1], W. Aiello and R. Venkatesan have
More information1-Pass Relative-Error L p -Sampling with Applications
-Pass Relative-Error L p -Sampling with Applications Morteza Monemizadeh David P Woodruff Abstract For any p [0, 2], we give a -pass poly(ε log n)-space algorithm which, given a data stream of length m
More informationLecture Lecture 25 November 25, 2014
CS 224: Advanced Algorithms Fall 2014 Lecture Lecture 25 November 25, 2014 Prof. Jelani Nelson Scribe: Keno Fischer 1 Today Finish faster exponential time algorithms (Inclusion-Exclusion/Zeta Transform,
More informationBlock Heavy Hitters Alexandr Andoni, Khanh Do Ba, and Piotr Indyk
Computer Science and Artificial Intelligence Laboratory Technical Report -CSAIL-TR-2008-024 May 2, 2008 Block Heavy Hitters Alexandr Andoni, Khanh Do Ba, and Piotr Indyk massachusetts institute of technology,
More informationPart 1: Hashing and Its Many Applications
1 Part 1: Hashing and Its Many Applications Sid C-K Chau Chi-Kin.Chau@cl.cam.ac.u http://www.cl.cam.ac.u/~cc25/teaching Why Randomized Algorithms? 2 Randomized Algorithms are algorithms that mae random
More information