An Optimal Algorithm for l 1 -Heavy Hitters in Insertion Streams and Related Problems

Size: px
Start display at page:

Download "An Optimal Algorithm for l 1 -Heavy Hitters in Insertion Streams and Related Problems"

Transcription

1 An Optimal Algorithm for l 1 -Heavy Hitters in Insertion Streams and Related Problems Arnab Bhattacharyya, Palash Dey, and David P. Woodruff Indian Institute of Science, Bangalore {arnabb,palash}@csa.iisc.ernet.in IBM Research, Almaden dpwoodru@us.ibm.com April 23, 2016 To appear: ACM SIGMOD conference on Principles of DB Systems (PODS-16)

2

3 Talk Overview Motivation Problem Definition State-of-the-art Our Results

4 Data Stream Frequency of item i = f i Frequency vector = (f 1,..., f n )

5 Data Stream Suitable model for many large source of data Stream of network packets Sensor networks Impractical and undesirable to store and process entire data exactly Instead design algorithms to find approximate solutions Quickly build summary with one pass over data Active area of research for last 15 years, history goes back 35 years

6 Example: Heavy-hitters Also called frequent items, elephants, icebergs One of the oldest streaming problems; history more than 30 years

7 Why Heavy-hitters? Monitoring Internet traffic: Track bandwidth hogs Popular destinations Subject of much streaming research: plenty of papers on Heavy-hitters and its variants A core streaming problem: many streaming problems (item set mining etc.) are connected to Heavy-hitters Many practical applications: Network data analysis Database optimization

8 Motivation Problem Definition State-of-the-art Our Results

9 l 1 -Heavy-hitters Exact version Given a stream of items from [n] of length m, find an item occurring more than ϕm times

10 l 1 -Heavy-hitters Exact version Given a stream of items from [n] of length m, find an item occurring more than ϕm times l 1 norm of frequency vector (f 1,..., f n )

11 l 1 -Heavy-hitters Exact version Given a stream of items from [n] of length m, find an item occurring more than ϕm times l 1 norm of frequency vector (f 1,..., f n ) Exact version is costly Unfortunately, exact version requires Ω(min{m, n}) space, even for ϕ = 1 2

12 (ε, ϕ)-heavy-hitters Let 0 < ε < ϕ < 1 and f i be the frequency of an item i Problem Definition Find a set S of items with the following property: S contains every item i with f i > ϕm S contains no item j with f j < (ϕ ε)m Moreover, for every item i S, output an estimate f i such that f i f i εm

13 (ε, ϕ)-heavy-hitters Let 0 < ε < ϕ < 1 and f i be the frequency of an item i Problem Definition Output all frequent items Find a set S of items with the following property: S contains every item i with f i > ϕm S contains no item j with f j < (ϕ ε)m Moreover, for every item i S, output an estimate f i such that f i f i εm Don t output any rare item Estimate frequencies of frequent items

14 Motivation Problem Definition State-of-the-art Our Results

15 Prior work on Heavy-hitters: Upper bounds Algorithm for ε > 1 2 with space complexity O(log n + log m) bits was given by [?] For general ε, the current best space complexity bound is from [?] Space complexity: O ( 1 ε (log n + log m)) bits Rediscovered by [?] and [?], but provided worst case O(1) time for updates and answers

16 Boyer-Moore Algorithm : works for ε > 1 2 Initialize table = empty 10 Insert(i) if(table empty) then put i and increment else if(i is in table) then increment else decrement table Frequent item: Approximate frequency: 1 Space: O (log n + log m) bits

17 Boyer-Moore Algorithm : works for ε > 1 2 Initialize table = empty 10 Insert(i) if(table empty) then put i and increment else if(i is in table) then increment else decrement table Frequent item: Approximate frequency: 1 Space: O (log n + log m) bits

18 Boyer-Moore Algorithm : works for ε > 1 2 Initialize table = empty 10 Insert(i) if(table empty) then put i and increment else if(i is in table) then increment else decrement table Frequent item: Approximate frequency: 1 Space: O (log n + log m) bits

19 Boyer-Moore Algorithm : works for ε > 1 2 Initialize table = empty 10 Insert(i) if(table empty) then put i and increment else if(i is in table) then increment else decrement table Frequent item: Approximate frequency: 1 Space: O (log n + log m) bits

20 Boyer-Moore Algorithm : works for ε > 1 2 Initialize table = empty 10 Insert(i) if(table empty) then put i and increment else if(i is in table) then increment else decrement table Frequent item: Approximate frequency: 1 Space: O (log n + log m) bits

21 Boyer-Moore Algorithm : works for ε > 1 2 Initialize table = empty 10 Insert(i) if(table empty) then put i and increment else if(i is in table) then increment else decrement table Frequent item: Approximate frequency: 1 Space: O (log n + log m) bits

22 Boyer-Moore Algorithm : works for ε > 1 2 Initialize table = empty 10 Insert(i) if(table empty) then put i and increment else if(i is in table) then increment else decrement table Frequent item: Approximate frequency: 1 Space: O (log n + log m) bits

23 Boyer-Moore Algorithm : works for ε > 1 2 Initialize table = empty 10 Insert(i) if(table empty) then put i and increment else if(i is in table) then increment else decrement table Frequent item: Approximate frequency: 1 Space: O (log n + log m) bits

24 Misra-Gries Algorithm [?]: Generalizes Boyer-Moore algorithm using k 1 counters Initialize: A empty associative array Process j: 1: if j Keys(A) then 2: A[j] A[j] + 1 3: else if Keys(A) < k 1 then 4: A[j] 1 5: else 6: for l Keys(A) do 7: A[l] A[l] 1 8: end for 9: if A[l] = 0 then 10: Remove l form A 11: end if 12: end if k 1 Keys Count Output: On query a, if a Keys(A), then report f a = A[a], else report f a = 0

25 Misra-Gries Algorithm: Analysis Crucial Observation Each decrement is witnessed by k distinct tokens (including itself) f a m k f a f a, a Space Complexity O (k (log m + log n)) bits of space Putting k = 1 ε solves (ε, ε)-heavy hitters Space: O ( 1 ε (log m + log n)) bits

26 Prior work on Heavy-hitters: Upper bounds Same bound as [?] or worse are also achieved by others like: Count-Sketch [?] CountMin-Sketch [?] Sticky sampling, lossy counting [?] Space saving [?] Sample-and-hold [?] Multi-stage Bloom filters [?] Sketch-guided sampling [?]...

27 Prior work on Heavy-hitters: Lower bounds There can be at most 1 ϕ many heavy hitters, giving a space bound of: ( ( ) ) ( ) n 1 Ω log 1 = Ω log ϕn ϕ ϕ Easy reduction from INDEXING problem provides a lower bound of: ( ) 1 Ω ε

28 Gap in Known Bounds In summary, best upper bound is: ( ) 1 O (log n + log m) ε and best lower bound is: ( 1 Ω ϕ log ϕn + 1 ) ε For constant ϕ and ε = 1 log n, this is a quadratic gap

29 Motivation Problem Definition State-of-the-art Our Results

30 If n ( ) ε Our Result We show that space complexity of (ε, ϕ)-heavy hitters is : Θ ( 1 ε log 1 ϕ + 1 ϕ log n + log log m ) with O(1) worst case update and query response times. Our algorithm is randomized

31 L -approximation Our algorithm also solves: ε-maximum Given an insertion only stream of length m over a universe of size n, output an item with frequency at most εm less than the maximum.

32 L -approximation Our algorithm also solves: ε-maximum approximate l norm of (f 1,..., f n ) within ε (f 1,..., f n ) 1 Given an insertion only stream of length m over a universe of size n, output an item with frequency at most εm less than the maximum.

33 L -approximation Our algorithm also solves: ε-maximum Given an insertion only stream of length m over a universe of size n, output an item with frequency at most εm less than the maximum. Our Result We show that the space complexity of ε-maximum is a : ( 1 Θ ε log 1 ) + log n + log log m ε with O(1) worst case update and query response times. Our algorithm is randomized whereas Misra-Gries is deterministic! a If n ( 1 ε ) approximate l norm of (f 1,..., f n ) within ε (f 1,..., f n ) 1

34 Comparison with prior work Best previous bound is again due to [?]: O ( ) 1 (log n + log m) ε Improving this result was listed as Open Problem 3 in IITK Workshop on Data Streams (2006)

35 A Voting Perspective The maximum problem finds an approximate winner of a plurality election when votes are streamed A natural question is to ask for winners of elections conducted according to other voting rules

36 Veto voting Modi Rahul Palash Palash wins ε-minimum Given an insertion only stream of length m over a universe of size n, find the minimum frequency upto additive error εm

37 Borda voting ε-borda Given an insertion only stream of length m over a universe of size n, find the maximum Borda score upto additive error εmn

38 Maximin voting 4 : A B C, 3 : C B A 2 : B A C, 2 : C A B Maximin scores: A : 6, B : 5, C : 5 ε-maximin Given an insertion only stream of length m over a universe of size n, find the maximum maximin frequency upto additive error εm

39 Other Results ε-minimum ( 1 O ε log log 1 ) ( ) 1 + log log m, Ω + log log m ε ε ε-borda Θ (n(log 1ε ) + log n) + log log m ε-maximin ( n ) ( n ) O ε 2 log2 n + log log m, Ω ε 2 + log log m

40 Other Results ε-minimum Optimal upto O(log log ε 1 ) ( 1 O ε log log 1 ) ( ) 1 + log log m, Ω + log log m ε ε ε-borda Θ (n(log 1ε ) + log n) + log log m Optimal upto O(1) ε-maximin Optimal upto O(log 2 n) ( n ) ( n ) O ε 2 log2 n + log log m, Ω ε 2 + log log m

41 Algorithm

42 A simpler almost optimal Heavy-hitters algorithm Hash Sample Count

43 Algorithm: In more details... Choose a hash function from a universal family of hash functions that hashes each id to a universe of size poly( 1 ε ) Sample l = poly( 1 ε ) items from the stream Feed in hashed samples to Misra-Gries data structure with 1 ε counters, while storing actual ids of top 1 ϕ items according to the Misra-Gries data structure

44 Algorithm: Correctness No collision among hashed ids in the sample (with high probability) If S is a subset of size O( 1 ) randomly chosen from the ε 2 stream, and f i be the frequency of the item i in S, then [?]: [ ] f i Pr i [n], S f i m ε

45 Algorithm analysis: Space Complexity Random item can be chosen from stream in O(log log m) bits of space Data structure in Misra-Gries algorithm uses O ( 1 ε log 1 ε) bits of space id space is poly( 1 ε ) length of the subsamples stream is poly( 1 ε ) εl additive approximation 1 ϕ log n to store the ids of top 1 ϕ items of the Misra-Gries table Space complexity: O ( 1 ε log 1 ε + 1 ϕ log n + log log m )

46 Algorithm analysis: Space Complexity Random item can be chosen from stream in O(log log m) bits of space Data structure in Misra-Gries algorithm uses O ( 1 ε log 1 ε) bits of space id space is poly( 1 ε ) length of the subsamples stream is poly( 1 ε ) εl additive approximation 1 ϕ log n to store the ids of top 1 ϕ Extra log ε 1 factor items of the Misra-Gries table Space complexity: O ( 1 ε log 1 ε + 1 ϕ log n + log log m )

47 Algorithm analysis: Time Complexity O(1) update and query response time Once an item is sampled, updating Misra-Gries table takes O( 1 ε ) time

48 Algorithm analysis: Time Complexity O(1) update and query response time Once an item is sampled, updating Misra-Gries table takes O( 1 ε ) time However, with very high probability, no item is sampled for next O( 1 ε ) items of the stream Distribute the O( 1 ε ) update time over next O( 1 ε ) items Worst case update time is O(1) Note that the guarantee above is not an amortized one!

49 Optimal Heavy-hitters algorithm Θ ( 1 ε log 1 ϕ + 1 ϕ log n + log log m )

50 Optimal algorithm: overview Overview Sample l = O(ε 2 ) items from the stream Run the Misra-Gries algorithm for (ϕ/2, ϕ/2)-heavy hitters returns a set C of O(ϕ 1 ) items For any item i, with probability 1 O(ϕ), estimate the frequency of i in the sampled stream to within additive error O(εl) = O(ε 1 )

51 Optimal algorithm cont: Approximate counting with additive error O(ε 1 ) Insert(i) With probability p i ĉ i + + Output(i) ˆfi = ĉ i /p i return ˆf i

52 Optimal algorithm cont: Approximate counting with additive error O(ε 1 ) Insert(i) With probability p i ĉ i + + Output(i) ˆfi = ĉ i /p i return ˆf i Accelerated counter Claim 1: If p i = Θ(ε 2 f i ), then Var[ˆf i ] = O(ε 2 ) Claim 2: ˆf i is an unbiased estimator of f i with additive error O(ε 1 ) with constant probability

53 Optimal algorithm cont: Approximate counting with additive error O(ε 1 ) Insert(i) With probability p i ĉ i + + Output(i) ˆfi = ĉ i /p i return ˆf i Accelerated counter Claim 1: If p i = Θ(ε 2 f i ), then Var[ˆf i ] = O(ε 2 ) Claim 2: ˆf i is an unbiased estimator of f i with additive error O(ε 1 ) with constant probability error probability can be reduced from constant to O(ϕ) by repeating O(log ϕ 1 ) times and taking median

54 Optimal algorithm cont: Tackling two issues Two issues 1. Need to keep Ω(l) = Ω(ε 2 ) counters Solution: Hashing into a space of size Θ(ε 1 ) 2. We do not know f i Solution: We divide the stream into epochs and maintain 4-approximation of frequencies in each epoch and change p i s dynamically

55 Lower Bound

56 Lower bound: Dependence on m GREATER-THAN log m Alice has x [log m] Bob has y [log m] Alice sends message to Bob Bob has to output whether or not x > y Known Result Alice has to send Ω(log log m) bits [?,?]

57 Lower bound: Reduction from GREATER-THAN log m to (ε, ϕ)-heavy hitters x [log m] y [log m] 2 x copies of item A 2 y copies of item B Algorithm s memory contents A iff x > y

58 Lower bound: Reduction from GREATER-THAN log m to (ε, ϕ)-heavy hitters x [log m] y [log m] 2 x copies of item A 2 y copies of item B Algorithm s memory contents A iff x > y

59 Lower bound: Reduction from GREATER-THAN log m to (ε, ϕ)-heavy hitters x [log m] y [log m] 2 x copies of item A 2 y copies of item B Algorithm s memory contents A iff x > y

60 Lower bound: Reduction from GREATER-THAN log m to (ε, ϕ)-heavy hitters x [log m] y [log m] 2 x copies of item A 2 y copies of item B Algorithm s memory contents A iff x > y

61 Lower bound: Reduction from GREATER-THAN log m to (ε, ϕ)-heavy hitters x [log m] y [log m] 2 x copies of item A 2 y copies of item B Algorithm s memory contents A iff x > y

62 Lower bound: Dependence on ε and ϕ INDEXING n,m Alice has a string x [n] m Bob has an index i [m] Alice sends message to Bob Bob has to output x i Known Result One-way communication complexity of INDEXING n,m Ω (m log n) is

63 Lower bound: Reduction from INDEXING1/2(ϕ ε), 1 /2ε to (ε, ϕ)-heavy hitters x [ 1 /2(ϕ ε)] 1 /2ε i [ 1 /2ε] εm copies of each (x j, j) j [ 1 /2ε] Algorithm s memory contents (ϕ ε)m copies of each (j, i) j [ 1 /2(ϕ ε)] (x i, i)

64 Lower bound: Reduction from INDEXING1/2(ϕ ε), 1 /2ε to (ε, ϕ)-heavy hitters x [ 1 /2(ϕ ε)] 1 /2ε i [ 1 /2ε] εm copies of each (x j, j) j [ 1 /2ε] Algorithm s memory contents (ϕ ε)m copies of each (j, i) j [ 1 /2(ϕ ε)] (x i, i)

65 Lower bound: Reduction from INDEXING1/2(ϕ ε), 1 /2ε to (ε, ϕ)-heavy hitters x [ 1 /2(ϕ ε)] 1 /2ε i [ 1 /2ε] εm copies of each (x j, j) j [ 1 /2ε] Algorithm s memory contents (ϕ ε)m copies of each (j, i) j [ 1 /2(ϕ ε)] (x i, i)

66 Lower bound: Reduction from INDEXING1/2(ϕ ε), 1 /2ε to (ε, ϕ)-heavy hitters x [ 1 /2(ϕ ε)] 1 /2ε i [ 1 /2ε] εm copies of each (x j, j) j [ 1 /2ε] Algorithm s memory contents (ϕ ε)m copies of each (j, i) j [ 1 /2(ϕ ε)] (x i, i)

67 Lower bound: Reduction from INDEXING1/2(ϕ ε), 1 /2ε to (ε, ϕ)-heavy hitters x [ 1 /2(ϕ ε)] 1 /2ε i [ 1 /2ε] εm copies of each (x j, j) j [ 1 /2ε] Algorithm s memory contents (ϕ ε)m copies of each (j, i) j [ 1 /2(ϕ ε)] (x i, i)

68 Notes For the turnstile model of streaming (insertions as well as deletions): ( ) CountMin sketch gives a O 1 ε log2 n bits upper bound [?] ( ) A Ω 1 ϕ log2 n lower bound is known due to [?] [?] show stronger error bound for Misra-Gries in terms of frequency distribution tail. Can we simultaneously achieve optimal space complexity? Possible other applications to rank aggregations.

69

An Optimal Algorithm for l 1 -Heavy Hitters in Insertion Streams and Related Problems

An Optimal Algorithm for l 1 -Heavy Hitters in Insertion Streams and Related Problems An Optimal Algorithm for l 1 -Heavy Hitters in Insertion Streams and Related Problems Arnab Bhattacharyya Indian Institute of Science, Bangalore arnabb@csa.iisc.ernet.in Palash Dey Indian Institute of

More information

arxiv: v1 [cs.ds] 1 Mar 2016

arxiv: v1 [cs.ds] 1 Mar 2016 An Optimal Algorithm for l 1 -Heavy Hitters in Insertion Streams and Related Problems Arnab Bhattacharyya, Palash Dey, and David P. Woodruff Indian Institute of Science, Bangalore arxiv:1603.00213v1 [cs.ds]

More information

15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018

15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018 15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018 Today we ll talk about a topic that is both very old (as far as computer science

More information

12 Count-Min Sketch and Apriori Algorithm (and Bloom Filters)

12 Count-Min Sketch and Apriori Algorithm (and Bloom Filters) 12 Count-Min Sketch and Apriori Algorithm (and Bloom Filters) Many streaming algorithms use random hashing functions to compress data. They basically randomly map some data items on top of each other.

More information

11 Heavy Hitters Streaming Majority

11 Heavy Hitters Streaming Majority 11 Heavy Hitters A core mining problem is to find items that occur more than one would expect. These may be called outliers, anomalies, or other terms. Statistical models can be layered on top of or underneath

More information

Tight Bounds for Distributed Functional Monitoring

Tight Bounds for Distributed Functional Monitoring Tight Bounds for Distributed Functional Monitoring Qin Zhang MADALGO, Aarhus University Joint with David Woodruff, IBM Almaden NII Shonan meeting, Japan Jan. 2012 1-1 The distributed streaming model (a.k.a.

More information

Space-optimal Heavy Hitters with Strong Error Bounds

Space-optimal Heavy Hitters with Strong Error Bounds Space-optimal Heavy Hitters with Strong Error Bounds Graham Cormode graham@research.att.com Radu Berinde(MIT) Piotr Indyk(MIT) Martin Strauss (U. Michigan) The Frequent Items Problem TheFrequent Items

More information

Computing the Entropy of a Stream

Computing the Entropy of a Stream Computing the Entropy of a Stream To appear in SODA 2007 Graham Cormode graham@research.att.com Amit Chakrabarti Dartmouth College Andrew McGregor U. Penn / UCSD Outline Introduction Entropy Upper Bound

More information

1 Approximate Quantiles and Summaries

1 Approximate Quantiles and Summaries CS 598CSC: Algorithms for Big Data Lecture date: Sept 25, 2014 Instructor: Chandra Chekuri Scribe: Chandra Chekuri Suppose we have a stream a 1, a 2,..., a n of objects from an ordered universe. For simplicity

More information

Tight Bounds for Distributed Streaming

Tight Bounds for Distributed Streaming Tight Bounds for Distributed Streaming (a.k.a., Distributed Functional Monitoring) David Woodruff IBM Research Almaden Qin Zhang MADALGO, Aarhus Univ. STOC 12 May 22, 2012 1-1 The distributed streaming

More information

CS 598CSC: Algorithms for Big Data Lecture date: Sept 11, 2014

CS 598CSC: Algorithms for Big Data Lecture date: Sept 11, 2014 CS 598CSC: Algorithms for Big Data Lecture date: Sept 11, 2014 Instructor: Chandra Cheuri Scribe: Chandra Cheuri The Misra-Greis deterministic counting guarantees that all items with frequency > F 1 /

More information

Simple and Deterministic Matrix Sketches

Simple and Deterministic Matrix Sketches Simple and Deterministic Matrix Sketches Edo Liberty + ongoing work with: Mina Ghashami, Jeff Philips and David Woodruff. Edo Liberty: Simple and Deterministic Matrix Sketches 1 / 41 Data Matrices Often

More information

Lecture and notes by: Alessio Guerrieri and Wei Jin Bloom filters and Hashing

Lecture and notes by: Alessio Guerrieri and Wei Jin Bloom filters and Hashing Bloom filters and Hashing 1 Introduction The Bloom filter, conceived by Burton H. Bloom in 1970, is a space-efficient probabilistic data structure that is used to test whether an element is a member of

More information

New Characterizations in Turnstile Streams with Applications

New Characterizations in Turnstile Streams with Applications New Characterizations in Turnstile Streams with Applications Yuqing Ai 1, Wei Hu 1, Yi Li 2, and David P. Woodruff 3 1 The Institute for Theoretical Computer Science (ITCS), Institute for Interdisciplinary

More information

Lower Bound Techniques for Multiparty Communication Complexity

Lower Bound Techniques for Multiparty Communication Complexity Lower Bound Techniques for Multiparty Communication Complexity Qin Zhang Indiana University Bloomington Based on works with Jeff Phillips, Elad Verbin and David Woodruff 1-1 The multiparty number-in-hand

More information

Finding Frequent Items in Probabilistic Data

Finding Frequent Items in Probabilistic Data Finding Frequent Items in Probabilistic Data Qin Zhang, Hong Kong University of Science & Technology Feifei Li, Florida State University Ke Yi, Hong Kong University of Science & Technology SIGMOD 2008

More information

Data Streams & Communication Complexity

Data Streams & Communication Complexity Data Streams & Communication Complexity Lecture 1: Simple Stream Statistics in Small Space Andrew McGregor, UMass Amherst 1/25 Data Stream Model Stream: m elements from universe of size n, e.g., x 1, x

More information

Lecture 10. Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1

Lecture 10. Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1 Lecture 10 Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1 Recap Sublinear time algorithms Deterministic + exact: binary search Deterministic + inexact: estimating diameter in a

More information

Distributed Summaries

Distributed Summaries Distributed Summaries Graham Cormode graham@research.att.com Pankaj Agarwal(Duke) Zengfeng Huang (HKUST) Jeff Philips (Utah) Zheiwei Wei (HKUST) Ke Yi (HKUST) Summaries Summaries allow approximate computations:

More information

Lecture 2. Frequency problems

Lecture 2. Frequency problems 1 / 43 Lecture 2. Frequency problems Ricard Gavaldà MIRI Seminar on Data Streams, Spring 2015 Contents 2 / 43 1 Frequency problems in data streams 2 Approximating inner product 3 Computing frequency moments

More information

Lecture 3 Frequency Moments, Heavy Hitters

Lecture 3 Frequency Moments, Heavy Hitters COMS E6998-9: Algorithmic Techniques for Massive Data Sep 15, 2015 Lecture 3 Frequency Moments, Heavy Hitters Instructor: Alex Andoni Scribes: Daniel Alabi, Wangda Zhang 1 Introduction This lecture is

More information

14.1 Finding frequent elements in stream

14.1 Finding frequent elements in stream Chapter 14 Streaming Data Model 14.1 Finding frequent elements in stream A very useful statistics for many applications is to keep track of elements that occur more frequently. It can come in many flavours

More information

Biased Quantiles. Flip Korn Graham Cormode S. Muthukrishnan

Biased Quantiles. Flip Korn Graham Cormode S. Muthukrishnan Biased Quantiles Graham Cormode cormode@bell-labs.com S. Muthukrishnan muthu@cs.rutgers.edu Flip Korn flip@research.att.com Divesh Srivastava divesh@research.att.com Quantiles Quantiles summarize data

More information

A Simpler and More Efficient Deterministic Scheme for Finding Frequent Items over Sliding Windows

A Simpler and More Efficient Deterministic Scheme for Finding Frequent Items over Sliding Windows A Simpler and More Efficient Deterministic Scheme for Finding Frequent Items over Sliding Windows ABSTRACT L.K. Lee Department of Computer Science The University of Hong Kong Pokfulam, Hong Kong lklee@cs.hku.hk

More information

Streaming and communication complexity of Hamming distance

Streaming and communication complexity of Hamming distance Streaming and communication complexity of Hamming distance Tatiana Starikovskaya IRIF, Université Paris-Diderot (Joint work with Raphaël Clifford, ICALP 16) Approximate pattern matching Problem Pattern

More information

Data Sketches for Disaggregated Subset Sum and Frequent Item Estimation

Data Sketches for Disaggregated Subset Sum and Frequent Item Estimation Data Sketches for Disaggregated Subset Sum and Frequent Item Estimation Daniel Ting Tableau Software Seattle, Washington dting@tableau.com ABSTRACT We introduce and study a new data sketch for processing

More information

Title: Count-Min Sketch Name: Graham Cormode 1 Affil./Addr. Department of Computer Science, University of Warwick,

Title: Count-Min Sketch Name: Graham Cormode 1 Affil./Addr. Department of Computer Science, University of Warwick, Title: Count-Min Sketch Name: Graham Cormode 1 Affil./Addr. Department of Computer Science, University of Warwick, Coventry, UK Keywords: streaming algorithms; frequent items; approximate counting, sketch

More information

Data Stream Methods. Graham Cormode S. Muthukrishnan

Data Stream Methods. Graham Cormode S. Muthukrishnan Data Stream Methods Graham Cormode graham@dimacs.rutgers.edu S. Muthukrishnan muthu@cs.rutgers.edu Plan of attack Frequent Items / Heavy Hitters Counting Distinct Elements Clustering items in Streams Motivating

More information

Heavy Hitters. Piotr Indyk MIT. Lecture 4

Heavy Hitters. Piotr Indyk MIT. Lecture 4 Heavy Hitters Piotr Indyk MIT Last Few Lectures Recap (last few lectures) Update a vector x Maintain a linear sketch Can compute L p norm of x (in zillion different ways) Questions: Can we do anything

More information

Algorithms for Data Science

Algorithms for Data Science Algorithms for Data Science CSOR W4246 Eleni Drinea Computer Science Department Columbia University Tuesday, December 1, 2015 Outline 1 Recap Balls and bins 2 On randomized algorithms 3 Saving space: hashing-based

More information

Some notes on streaming algorithms continued

Some notes on streaming algorithms continued U.C. Berkeley CS170: Algorithms Handout LN-11-9 Christos Papadimitriou & Luca Trevisan November 9, 016 Some notes on streaming algorithms continued Today we complete our quick review of streaming algorithms.

More information

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is: CS 24 Section #8 Hashing, Skip Lists 3/20/7 Probability Review Expectation (weighted average): the expectation of a random quantity X is: x= x P (X = x) For each value x that X can take on, we look at

More information

Hierarchical Heavy Hitters with the Space Saving Algorithm

Hierarchical Heavy Hitters with the Space Saving Algorithm Hierarchical Heavy Hitters with the Space Saving Algorithm Michael Mitzenmacher Thomas Steinke Justin Thaler School of Engineering and Applied Sciences Harvard University, Cambridge, MA 02138 Email: {jthaler,

More information

CSCB63 Winter Week 11 Bloom Filters. Anna Bretscher. March 30, / 13

CSCB63 Winter Week 11 Bloom Filters. Anna Bretscher. March 30, / 13 CSCB63 Winter 2019 Week 11 Bloom Filters Anna Bretscher March 30, 2019 1 / 13 Today Bloom Filters Definition Expected Complexity Applications 2 / 13 Bloom Filters (Specification) A bloom filter is a probabilistic

More information

Estimating Frequencies and Finding Heavy Hitters Jonas Nicolai Hovmand, Morten Houmøller Nygaard,

Estimating Frequencies and Finding Heavy Hitters Jonas Nicolai Hovmand, Morten Houmøller Nygaard, Estimating Frequencies and Finding Heavy Hitters Jonas Nicolai Hovmand, 2011 3884 Morten Houmøller Nygaard, 2011 4582 Master s Thesis, Computer Science June 2016 Main Supervisor: Gerth Stølting Brodal

More information

Introduction to Randomized Algorithms III

Introduction to Randomized Algorithms III Introduction to Randomized Algorithms III Joaquim Madeira Version 0.1 November 2017 U. Aveiro, November 2017 1 Overview Probabilistic counters Counting with probability 1 / 2 Counting with probability

More information

A Mergeable Summaries

A Mergeable Summaries A Mergeable Summaries Pankaj K. Agarwal, Graham Cormode, Zengfeng Huang, Jeff M. Phillips, Zhewei Wei, and Ke Yi We study the mergeability of data summaries. Informally speaking, mergeability requires

More information

A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window

A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window Costas Busch 1 and Srikanta Tirthapura 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY

More information

The space complexity of approximating the frequency moments

The space complexity of approximating the frequency moments The space complexity of approximating the frequency moments Felix Biermeier November 24, 2015 1 Overview Introduction Approximations of frequency moments lower bounds 2 Frequency moments Problem Estimate

More information

Another way of saying this is that amortized analysis guarantees the average case performance of each operation in the worst case.

Another way of saying this is that amortized analysis guarantees the average case performance of each operation in the worst case. Amortized Analysis: CLRS Chapter 17 Last revised: August 30, 2006 1 In amortized analysis we try to analyze the time required by a sequence of operations. There are many situations in which, even though

More information

Lecture 3 Sept. 4, 2014

Lecture 3 Sept. 4, 2014 CS 395T: Sublinear Algorithms Fall 2014 Prof. Eric Price Lecture 3 Sept. 4, 2014 Scribe: Zhao Song In today s lecture, we will discuss the following problems: 1. Distinct elements 2. Turnstile model 3.

More information

A Mergeable Summaries

A Mergeable Summaries A Mergeable Summaries Pankaj K. Agarwal, Graham Cormode, Zengfeng Huang, Jeff M. Phillips, Zhewei Wei, and Ke Yi We study the mergeability of data summaries. Informally speaking, mergeability requires

More information

Finding Heavy Distinct Hitters in Data Streams

Finding Heavy Distinct Hitters in Data Streams Finding Heavy Distinct Hitters in Data Streams Thomas Locher IBM Research Zurich thl@zurich.ibm.com ABSTRACT A simple indicator for an anomaly in a network is a rapid increase in the total number of distinct

More information

Problem 1: (Chernoff Bounds via Negative Dependence - from MU Ex 5.15)

Problem 1: (Chernoff Bounds via Negative Dependence - from MU Ex 5.15) Problem 1: Chernoff Bounds via Negative Dependence - from MU Ex 5.15) While deriving lower bounds on the load of the maximum loaded bin when n balls are thrown in n bins, we saw the use of negative dependence.

More information

CMSC 858F: Algorithmic Lower Bounds: Fun with Hardness Proofs Fall 2014 Introduction to Streaming Algorithms

CMSC 858F: Algorithmic Lower Bounds: Fun with Hardness Proofs Fall 2014 Introduction to Streaming Algorithms CMSC 858F: Algorithmic Lower Bounds: Fun with Hardness Proofs Fall 2014 Introduction to Streaming Algorithms Instructor: Mohammad T. Hajiaghayi Scribe: Huijing Gong November 11, 2014 1 Overview In the

More information

A Near-Optimal Algorithm for Computing the Entropy of a Stream

A Near-Optimal Algorithm for Computing the Entropy of a Stream A Near-Optimal Algorithm for Computing the Entropy of a Stream Amit Chakrabarti ac@cs.dartmouth.edu Graham Cormode graham@research.att.com Andrew McGregor andrewm@seas.upenn.edu Abstract We describe a

More information

compare to comparison and pointer based sorting, binary trees

compare to comparison and pointer based sorting, binary trees Admin Hashing Dictionaries Model Operations. makeset, insert, delete, find keys are integers in M = {1,..., m} (so assume machine word size, or unit time, is log m) can store in array of size M using power:

More information

An Integrated Efficient Solution for Computing Frequent and Top-k Elements in Data Streams

An Integrated Efficient Solution for Computing Frequent and Top-k Elements in Data Streams An Integrated Efficient Solution for Computing Frequent and Top-k Elements in Data Streams AHMED METWALLY, DIVYAKANT AGRAWAL, and AMR EL ABBADI University of California, Santa Barbara We propose an approximate

More information

Continuous Matrix Approximation on Distributed Data

Continuous Matrix Approximation on Distributed Data Continuous Matrix Approximation on Distributed Data Mina Ghashami School of Computing University of Utah ghashami@cs.uah.edu Jeff M. Phillips School of Computing University of Utah jeffp@cs.uah.edu Feifei

More information

Linear Sketches A Useful Tool in Streaming and Compressive Sensing

Linear Sketches A Useful Tool in Streaming and Compressive Sensing Linear Sketches A Useful Tool in Streaming and Compressive Sensing Qin Zhang 1-1 Linear sketch Random linear projection M : R n R k that preserves properties of any v R n with high prob. where k n. M =

More information

Multi-Dimensional Online Tracking

Multi-Dimensional Online Tracking Multi-Dimensional Online Tracking Ke Yi and Qin Zhang Hong Kong University of Science & Technology SODA 2009 January 4-6, 2009 1-1 A natural problem Bob: tracker f(t) g(t) Alice: observer (t, g(t)) t 2-1

More information

Lecture 01 August 31, 2017

Lecture 01 August 31, 2017 Sketching Algorithms for Big Data Fall 2017 Prof. Jelani Nelson Lecture 01 August 31, 2017 Scribe: Vinh-Kha Le 1 Overview In this lecture, we overviewed the six main topics covered in the course, reviewed

More information

Approximate counting: count-min data structure. Problem definition

Approximate counting: count-min data structure. Problem definition Approximate counting: count-min data structure G. Cormode and S. Muthukrishhan: An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms 55 (2005) 58-75. Problem

More information

Classic Network Measurements meets Software Defined Networking

Classic Network Measurements meets Software Defined Networking Classic Network s meets Software Defined Networking Ran Ben Basat, Technion, Israel Joint work with Gil Einziger and Erez Waisbard (Nokia Bell Labs) Roy Friedman (Technion) and Marcello Luzieli (UFGRS)

More information

Tracking Distributed Aggregates over Time-based Sliding Windows

Tracking Distributed Aggregates over Time-based Sliding Windows Tracking Distributed Aggregates over Time-based Sliding Windows Graham Cormode and Ke Yi Abstract. The area of distributed monitoring requires tracking the value of a function of distributed data as new

More information

Optimal Random Sampling from Distributed Streams Revisited

Optimal Random Sampling from Distributed Streams Revisited Optimal Random Sampling from Distributed Streams Revisited Srikanta Tirthapura 1 and David P. Woodruff 2 1 Dept. of ECE, Iowa State University, Ames, IA, 50011, USA. snt@iastate.edu 2 IBM Almaden Research

More information

Lecture 5: Hashing. David Woodruff Carnegie Mellon University

Lecture 5: Hashing. David Woodruff Carnegie Mellon University Lecture 5: Hashing David Woodruff Carnegie Mellon University Hashing Universal hashing Perfect hashing Maintaining a Dictionary Let U be a universe of keys U could be all strings of ASCII characters of

More information

to be more efficient on enormous scale, in a stream, or in distributed settings.

to be more efficient on enormous scale, in a stream, or in distributed settings. 16 Matrix Sketching The singular value decomposition (SVD) can be interpreted as finding the most dominant directions in an (n d) matrix A (or n points in R d ). Typically n > d. It is typically easy to

More information

Data Structures and Algorithm. Xiaoqing Zheng

Data Structures and Algorithm. Xiaoqing Zheng Data Structures and Algorithm Xiaoqing Zheng zhengxq@fudan.edu.cn MULTIPOP top[s] = 6 top[s] = 2 3 2 8 5 6 5 S MULTIPOP(S, x). while not STACK-EMPTY(S) and k 0 2. do POP(S) 3. k k MULTIPOP(S, 4) Analysis

More information

Motivation. Dictionaries. Direct Addressing. CSE 680 Prof. Roger Crawfis

Motivation. Dictionaries. Direct Addressing. CSE 680 Prof. Roger Crawfis Motivation Introduction to Algorithms Hash Tables CSE 680 Prof. Roger Crawfis Arrays provide an indirect way to access a set. Many times we need an association between two sets, or a set of keys and associated

More information

Efficient Sketches for Earth-Mover Distance, with Applications

Efficient Sketches for Earth-Mover Distance, with Applications Efficient Sketches for Earth-Mover Distance, with Applications Alexandr Andoni MIT andoni@mit.edu Khanh Do Ba MIT doba@mit.edu Piotr Indyk MIT indyk@mit.edu David Woodruff IBM Almaden dpwoodru@us.ibm.com

More information

25.2 Last Time: Matrix Multiplication in Streaming Model

25.2 Last Time: Matrix Multiplication in Streaming Model EE 381V: Large Scale Learning Fall 01 Lecture 5 April 18 Lecturer: Caramanis & Sanghavi Scribe: Kai-Yang Chiang 5.1 Review of Streaming Model Streaming model is a new model for presenting massive data.

More information

Maintaining Significant Stream Statistics over Sliding Windows

Maintaining Significant Stream Statistics over Sliding Windows Maintaining Significant Stream Statistics over Sliding Windows L.K. Lee H.F. Ting Abstract In this paper, we introduce the Significant One Counting problem. Let ε and θ be respectively some user-specified

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/26/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2 More algorithms

More information

Estimating Dominance Norms of Multiple Data Streams Graham Cormode Joint work with S. Muthukrishnan

Estimating Dominance Norms of Multiple Data Streams Graham Cormode Joint work with S. Muthukrishnan Estimating Dominance Norms of Multiple Data Streams Graham Cormode graham@dimacs.rutgers.edu Joint work with S. Muthukrishnan Data Stream Phenomenon Data is being produced faster than our ability to process

More information

B669 Sublinear Algorithms for Big Data

B669 Sublinear Algorithms for Big Data B669 Sublinear Algorithms for Big Data Qin Zhang 1-1 2-1 Part 1: Sublinear in Space The model and challenge The data stream model (Alon, Matias and Szegedy 1996) a n a 2 a 1 RAM CPU Why hard? Cannot store

More information

Lecture 6 September 13, 2016

Lecture 6 September 13, 2016 CS 395T: Sublinear Algorithms Fall 206 Prof. Eric Price Lecture 6 September 3, 206 Scribe: Shanshan Wu, Yitao Chen Overview Recap of last lecture. We talked about Johnson-Lindenstrauss (JL) lemma [JL84]

More information

Frequency Estimators

Frequency Estimators Frequency Estimators Outline for Today Randomized Data Structures Our next approach to improving performance. Count-Min Sketches A simple and powerful data structure for estimating frequencies. Count Sketches

More information

Lecture 1: Introduction to Sublinear Algorithms

Lecture 1: Introduction to Sublinear Algorithms CSE 522: Sublinear (and Streaming) Algorithms Spring 2014 Lecture 1: Introduction to Sublinear Algorithms March 31, 2014 Lecturer: Paul Beame Scribe: Paul Beame Too much data, too little time, space for

More information

Ad Placement Strategies

Ad Placement Strategies Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox 2014 Emily Fox January

More information

Effective computation of biased quantiles over data streams

Effective computation of biased quantiles over data streams Effective computation of biased quantiles over data streams Graham Cormode cormode@bell-labs.com S. Muthukrishnan muthu@cs.rutgers.edu Flip Korn flip@research.att.com Divesh Srivastava divesh@research.att.com

More information

AMORTIZED ANALYSIS. binary counter multipop stack dynamic table. Lecture slides by Kevin Wayne. Last updated on 1/24/17 11:31 AM

AMORTIZED ANALYSIS. binary counter multipop stack dynamic table. Lecture slides by Kevin Wayne. Last updated on 1/24/17 11:31 AM AMORTIZED ANALYSIS binary counter multipop stack dynamic table Lecture slides by Kevin Wayne http://www.cs.princeton.edu/~wayne/kleinberg-tardos Last updated on 1/24/17 11:31 AM Amortized analysis Worst-case

More information

CS 5321: Advanced Algorithms Amortized Analysis of Data Structures. Motivations. Motivation cont d

CS 5321: Advanced Algorithms Amortized Analysis of Data Structures. Motivations. Motivation cont d CS 5321: Advanced Algorithms Amortized Analysis of Data Structures Ali Ebnenasir Department of Computer Science Michigan Technological University Motivations Why amortized analysis and when? Suppose you

More information

CS 473: Algorithms. Ruta Mehta. Spring University of Illinois, Urbana-Champaign. Ruta (UIUC) CS473 1 Spring / 32

CS 473: Algorithms. Ruta Mehta. Spring University of Illinois, Urbana-Champaign. Ruta (UIUC) CS473 1 Spring / 32 CS 473: Algorithms Ruta Mehta University of Illinois, Urbana-Champaign Spring 2018 Ruta (UIUC) CS473 1 Spring 2018 1 / 32 CS 473: Algorithms, Spring 2018 Universal Hashing Lecture 10 Feb 15, 2018 Most

More information

Approximate Counts and Quantiles over Sliding Windows

Approximate Counts and Quantiles over Sliding Windows Approximate Counts and Quantiles over Sliding Windows Arvind Arasu Stanford University arvinda@cs.stanford.edu Gurmeet Singh Manku Stanford University manku@cs.stanford.edu Abstract We consider the problem

More information

CS341 info session is on Thu 3/1 5pm in Gates415. CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS341 info session is on Thu 3/1 5pm in Gates415. CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS341 info session is on Thu 3/1 5pm in Gates415 CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/28/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets,

More information

1 Estimating Frequency Moments in Streams

1 Estimating Frequency Moments in Streams CS 598CSC: Algorithms for Big Data Lecture date: August 28, 2014 Instructor: Chandra Chekuri Scribe: Chandra Chekuri 1 Estimating Frequency Moments in Streams A significant fraction of streaming literature

More information

Lecture Notes for Chapter 17: Amortized Analysis

Lecture Notes for Chapter 17: Amortized Analysis Lecture Notes for Chapter 17: Amortized Analysis Chapter 17 overview Amortized analysis Analyze a sequence of operations on a data structure. Goal: Show that although some individual operations may be

More information

University of New Mexico Department of Computer Science. Final Examination. CS 561 Data Structures and Algorithms Fall, 2013

University of New Mexico Department of Computer Science. Final Examination. CS 561 Data Structures and Algorithms Fall, 2013 University of New Mexico Department of Computer Science Final Examination CS 561 Data Structures and Algorithms Fall, 2013 Name: Email: This exam lasts 2 hours. It is closed book and closed notes wing

More information

Lecture 5, CPA Secure Encryption from PRFs

Lecture 5, CPA Secure Encryption from PRFs CS 4501-6501 Topics in Cryptography 16 Feb 2018 Lecture 5, CPA Secure Encryption from PRFs Lecturer: Mohammad Mahmoody Scribe: J. Fu, D. Anderson, W. Chao, and Y. Yu 1 Review Ralling: CPA Security and

More information

CS5112: Algorithms and Data Structures for Applications

CS5112: Algorithms and Data Structures for Applications CS5112: Algorithms and Data Structures for Applications Lecture 14: Exponential decay; convolution Ramin Zabih Some content from: Piotr Indyk; Wikipedia/Google image search; J. Leskovec, A. Rajaraman,

More information

Improved Concentration Bounds for Count-Sketch

Improved Concentration Bounds for Count-Sketch Improved Concentration Bounds for Count-Sketch Gregory T. Minton 1 Eric Price 2 1 MIT MSR New England 2 MIT IBM Almaden UT Austin 2014-01-06 Gregory T. Minton, Eric Price (IBM) Improved Concentration Bounds

More information

How Philippe Flipped Coins to Count Data

How Philippe Flipped Coins to Count Data 1/18 How Philippe Flipped Coins to Count Data Jérémie Lumbroso LIP6 / INRIA Rocquencourt December 16th, 2011 0. DATA STREAMING ALGORITHMS Stream: a (very large) sequence S over (also very large) domain

More information

Sketching in Adversarial Environments

Sketching in Adversarial Environments Sketching in Adversarial Environments Ilya Mironov Moni Naor Gil Segev Abstract We formalize a realistic model for computations over massive data sets. The model, referred to as the adversarial sketch

More information

Improved Algorithms for Distributed Entropy Monitoring

Improved Algorithms for Distributed Entropy Monitoring Improved Algorithms for Distributed Entropy Monitoring Jiecao Chen Indiana University Bloomington jiecchen@umail.iu.edu Qin Zhang Indiana University Bloomington qzhangcs@indiana.edu Abstract Modern data

More information

Foundations of Data Mining

Foundations of Data Mining Foundations of Data Mining http://www.cohenwang.com/edith/dataminingclass2017 Instructors: Haim Kaplan Amos Fiat Lecture 1 1 Course logistics Tuesdays 16:00-19:00, Sherman 002 Slides for (most or all)

More information

CSE 190, Great ideas in algorithms: Pairwise independent hash functions

CSE 190, Great ideas in algorithms: Pairwise independent hash functions CSE 190, Great ideas in algorithms: Pairwise independent hash functions 1 Hash functions The goal of hash functions is to map elements from a large domain to a small one. Typically, to obtain the required

More information

7 Algorithms for Massive Data Problems

7 Algorithms for Massive Data Problems 7 Algorithms for Massive Data Problems Massive Data, Sampling This chapter deals with massive data problems where the input data (a graph, a matrix or some other object) is too large to be stored in random

More information

Lecture 7: Amortized analysis

Lecture 7: Amortized analysis Lecture 7: Amortized analysis In many applications we want to minimize the time for a sequence of operations. To sum worst-case times for single operations can be overly pessimistic, since it ignores correlation

More information

6 Filtering and Streaming

6 Filtering and Streaming Casus ubique valet; semper tibi pendeat hamus: Quo minime credas gurgite, piscis erit. [Luck affects everything. Let your hook always be cast. Where you least expect it, there will be a fish.] Publius

More information

CR-precis: A deterministic summary structure for update data streams

CR-precis: A deterministic summary structure for update data streams CR-precis: A deterministic summary structure for update data streams Sumit Ganguly 1 and Anirban Majumder 2 1 Indian Institute of Technology, Kanpur 2 Lucent Technologies, Bangalore Abstract. We present

More information

Lecture 24: Bloom Filters. Wednesday, June 2, 2010

Lecture 24: Bloom Filters. Wednesday, June 2, 2010 Lecture 24: Bloom Filters Wednesday, June 2, 2010 1 Topics for the Final SQL Conceptual Design (BCNF) Transactions Indexes Query execution and optimization Cardinality Estimation Parallel Databases 2 Lecture

More information

Wavelet decomposition of data streams. by Dragana Veljkovic

Wavelet decomposition of data streams. by Dragana Veljkovic Wavelet decomposition of data streams by Dragana Veljkovic Motivation Continuous data streams arise naturally in: telecommunication and internet traffic retail and banking transactions web server log records

More information

Big Data. Big data arises in many forms: Common themes:

Big Data. Big data arises in many forms: Common themes: Big Data Big data arises in many forms: Physical Measurements: from science (physics, astronomy) Medical data: genetic sequences, detailed time series Activity data: GPS location, social network activity

More information

Lecture 7: Fingerprinting. David Woodruff Carnegie Mellon University

Lecture 7: Fingerprinting. David Woodruff Carnegie Mellon University Lecture 7: Fingerprinting David Woodruff Carnegie Mellon University How to Pick a Random Prime How to pick a random prime in the range {1, 2,, M}? How to pick a random integer X? Pick a uniformly random

More information

Benes and Butterfly schemes revisited

Benes and Butterfly schemes revisited Benes and Butterfly schemes revisited Jacques Patarin, Audrey Montreuil Université de Versailles 45 avenue des Etats-Unis 78035 Versailles Cedex - France Abstract In [1], W. Aiello and R. Venkatesan have

More information

1-Pass Relative-Error L p -Sampling with Applications

1-Pass Relative-Error L p -Sampling with Applications -Pass Relative-Error L p -Sampling with Applications Morteza Monemizadeh David P Woodruff Abstract For any p [0, 2], we give a -pass poly(ε log n)-space algorithm which, given a data stream of length m

More information

Lecture Lecture 25 November 25, 2014

Lecture Lecture 25 November 25, 2014 CS 224: Advanced Algorithms Fall 2014 Lecture Lecture 25 November 25, 2014 Prof. Jelani Nelson Scribe: Keno Fischer 1 Today Finish faster exponential time algorithms (Inclusion-Exclusion/Zeta Transform,

More information

Block Heavy Hitters Alexandr Andoni, Khanh Do Ba, and Piotr Indyk

Block Heavy Hitters Alexandr Andoni, Khanh Do Ba, and Piotr Indyk Computer Science and Artificial Intelligence Laboratory Technical Report -CSAIL-TR-2008-024 May 2, 2008 Block Heavy Hitters Alexandr Andoni, Khanh Do Ba, and Piotr Indyk massachusetts institute of technology,

More information

Part 1: Hashing and Its Many Applications

Part 1: Hashing and Its Many Applications 1 Part 1: Hashing and Its Many Applications Sid C-K Chau Chi-Kin.Chau@cl.cam.ac.u http://www.cl.cam.ac.u/~cc25/teaching Why Randomized Algorithms? 2 Randomized Algorithms are algorithms that mae random

More information