Lecture 2 Sept. 8, 2015

Similar documents
1 Estimating Frequency Moments in Streams

Lecture 3 Sept. 4, 2014

Lecture 2: Streaming Algorithms

Lecture Lecture 25 November 25, 2014

CSCI8980 Algorithmic Techniques for Big Data September 12, Lecture 2

Lecture Lecture 9 October 1, 2015

Lecture 6 September 13, 2016

CS 591, Lecture 9 Data Analytics: Theory and Applications Boston University

Lecture 2. Frequency problems

Lecture 01 August 31, 2017

Range-efficient computation of F 0 over massive data streams

Lecture 1 September 3, 2013

CS 598CSC: Algorithms for Big Data Lecture date: Sept 11, 2014

CS 229r: Algorithms for Big Data Fall Lecture 17 10/28

Lecture 10. Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1

An Optimal Algorithm for Large Frequency Moments Using O(n 1 2/k ) Bits

Lecture 4: Hashing and Streaming Algorithms

Randomness and Computation March 13, Lecture 3

Lecture 3 Frequency Moments, Heavy Hitters

Lecture 1: Introduction to Sublinear Algorithms

Lecture 13 March 7, 2017

Lecture 2 Sep 5, 2017

6.842 Randomness and Computation Lecture 5

Data Stream Methods. Graham Cormode S. Muthukrishnan

Consistent Sampling with Replacement

As mentioned, we will relax the conditions of our dictionary data structure. The relaxations we

Lecture 4: Sampling, Tail Inequalities

A Near-Optimal Algorithm for Computing the Entropy of a Stream

Lecture Notes 3 Convergence (Chapter 5)

Some notes on streaming algorithms continued

Lecture 4 Thursday Sep 11, 2014

6 Filtering and Streaming

4/26/2017. More algorithms for streams: Each element of data stream is a tuple Given a list of keys S Determine which tuples of stream are in S

Linear Sketches A Useful Tool in Streaming and Compressive Sensing

Lecture Lecture 3 Tuesday Sep 09, 2014

Lecture 4 February 2nd, 2017

CMSC 858F: Algorithmic Lower Bounds: Fun with Hardness Proofs Fall 2014 Introduction to Streaming Algorithms

Topics in Probabilistic Combinatorics and Algorithms Winter, Basic Derandomization Techniques

14.1 Finding frequent elements in stream

Basic Probabilistic Checking 3

Lecture 11 October 7, 2013

Big Data. Big data arises in many forms: Common themes:

Sublinear-Time Algorithms

Sparse Johnson-Lindenstrauss Transforms

Homework 4 Solutions

Data Streams & Communication Complexity

Finding Frequent Items in Data Streams

Problem 1: (Chernoff Bounds via Negative Dependence - from MU Ex 5.15)

Lecture 13 October 6, Covering Numbers and Maurey s Empirical Method

Arthur-Merlin Streaming Complexity

Notes for Lecture 14 v0.9

1 Approximate Quantiles and Summaries

The space complexity of approximating the frequency moments

CMPUT 675: Approximation Algorithms Fall 2014

B669 Sublinear Algorithms for Big Data

How Philippe Flipped Coins to Count Data

Lecture 7: Fingerprinting. David Woodruff Carnegie Mellon University

An Optimal Algorithm for l 1 -Heavy Hitters in Insertion Streams and Related Problems

Advanced Algorithm Design: Hashing and Applications to Compact Data Representation

Revisiting Frequency Moment Estimation in Random Order Streams

1-Pass Relative-Error L p -Sampling with Applications

CSC2420 Fall 2012: Algorithm Design, Analysis and Theory Lecture 9

Lecture 10 September 27, 2016

Block Heavy Hitters Alexandr Andoni, Khanh Do Ba, and Piotr Indyk

CS261: A Second Course in Algorithms Lecture #18: Five Essential Tools for the Analysis of Randomized Algorithms

Lower Bounds for Testing Bipartiteness in Dense Graphs

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

25.2 Last Time: Matrix Multiplication in Streaming Model

The quantum complexity of approximating the frequency moments

Sketching Probabilistic Data Streams

1 Maintaining a Dictionary

Very Sparse Random Projections

1 Approximate Counting by Random Sampling

Lecture 2 September 4, 2014

CS 229r Notes. Sam Elder. November 26, 2013

Lecture 13 (Part 2): Deviation from mean: Markov s inequality, variance and its properties, Chebyshev s inequality

Topic: Sampling, Medians of Means method and DNF counting Date: October 6, 2004 Scribe: Florin Oprea

Optimality of the Johnson-Lindenstrauss Lemma

6.841/18.405J: Advanced Complexity Wednesday, April 2, Lecture Lecture 14

Interval Selection in the streaming model

Lecture 5: Two-point Sampling

Time lower bounds for nonadaptive turnstile streaming algorithms

Probability Revealing Samples

Frequency Estimators

Sparser Johnson-Lindenstrauss Transforms

Expectation, inequalities and laws of large numbers

Lower Bounds for Quantile Estimation in Random-Order and Multi-Pass Streaming

Stanford University CS254: Computational Complexity Handout 8 Luca Trevisan 4/21/2010

An Improved Frequent Items Algorithm with Applications to Web Caching

2 Completing the Hardness of approximation of Set Cover

CS341 info session is on Thu 3/1 5pm in Gates415. CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS5112: Algorithms and Data Structures for Applications

Streaming and communication complexity of Hamming distance

Hierarchical Sampling from Sketches: Estimating Functions over Data Streams

CMPSCI 711: More Advanced Algorithms

PRGs for space-bounded computation: INW, Nisan

1 Randomized Computation

Kousha Etessami. U. of Edinburgh, UK. Kousha Etessami (U. of Edinburgh, UK) Discrete Mathematics (Chapter 7) 1 / 13

Lecture 4: Two-point Sampling, Coupon Collector s problem

Tight Bounds for Distributed Streaming

Transcription:

CS 9r: Algorithms for Big Data Fall 5 Prof. Jelani Nelson Lecture Sept. 8, 5 Scribe: Jeffrey Ling Probability Recap Chebyshev: P ( X EX > λ) < V ar[x] λ Chernoff: For X,..., X n independent in [, ], < ɛ <, and µ E i X i, P ( i X i µ > ɛµ) < e ɛ µ/3 Today Distinct elements Norm estimation (if there s time) 3 Distinct elements (F ) Problem: Given a stream of integers i,..., i m [n] where [n] : {,,..., n}, we want to output the number of distinct elements seen. 3. Straightforward algorithms. Keep a bit array of length n. Flip bit if a number is seen.. Store the whole stream. Takes m lg n bits. We can solve with O(min(n, m lg n)) bits. 3. Randomized approximation We can settle for outputting t s.t. P ( t t > ɛt) < δ. The original solution was by Flajolet and Martin [].

3.3 Idealized algorithm. Pick random function h : [n] [, ] (idealized, since we can t actually nicely store this). Maintain counter X min i stream h(i) 3. Output /X Intuition. X is a random variable that s the minimum of t i.i.d Unif(, ) r.v.s. Claim. EX t+. Proof. EX t + P (X > λ)dλ P ( i str, h(i) > λ)dλ t P (h(i r ) > λ)dλ r ( λ) t dλ Claim. EX (t+)(t+) Proof. EX P (X > λ)dλ P (X > λ)dλ ( λ) t dλ u λ u t ( u)du (t + )(t + ) This gives V ar[x] EX (EX) t, and furthermore V ar[x] < (EX). (t+) (t+) (t+)

4 FM+ We average together multiple estimates from the idealized algorithm FM.. Instantiate q /ɛ η FMs independently. Let X i come from FM i. 3. Output /Z, where Z q i X i. We have that E(Z) t+, and V ar(z) t q (t+) (t+) <. q(t+) Claim 3. P ( Z t+ > ɛ t+ ) < η Proof. Chebyshev. P ( Z t + > ɛ (t + ) ) < t + ɛ q(t + ) η Claim 4. P ( ( Z ) t > O(ɛ)t) < η Proof. By the previous claim, with probability η we have ( ± ɛ) t+ ( ± O(ɛ))(t + ) ( ± O(ɛ))t ± O(ɛ) 5 FM++ We take the median of multiple estimates from FM+.. Instantiate s 36 ln(/δ) independent copies of FM+ with η /3.. Output the median t of {/Z j } s j where Z j is from the jth copy of FM+. Claim 5. P ( t t > ɛt) < δ Proof. Let Y j { if (/Z j ) t > ɛt else We have EY j P (Y j ) < /3 from the choice of η. The probability we seek to bound is equivalent to the probability that the median fails, i.e. at least half of the FM+ estimates have Y j. In other words, s Y j > s/ j 3

We then get that P ( Y j > s/) P ( Y j s/3 > s/6) () Make the simplifying assumption that EY j /3 (this turns out to be stronger than EY j < /3. Then equation becomes using Chernoff, as desired. P ( Y j E Y j > E Y j ) < e ( ) s/3 3 < δ The final space required, ignoring h, is O( lg(/δ) ɛ ) for O(lg(/δ)) copies of FM+ that require O(/ɛ ) space each. 6 k-wise independent functions Definition 6. A family H of functions mapping [a] to [b] is k-wise independent if j,..., j k [b] and distinct i,..., i k [a], P h H (h(i ) j... h(i k ) j k ) /b k Example. The set H of all functions [a] [b] is k-wise independent for every k. H b a so h is representable in a lg b bits. Example. Let a b q for q p r a prime power, then H poly, the set of degree k polynomials with coefficients in F q, the finite field of order q. H poly q k so h is representable in k lg p bits. Claim 7. H poly is k-wise independent. Proof. Interpolation. 7 Non-idealized FM First, we get an O()-approximation in O(lg n) bits, i.e. our estimate t satisfies t/c t Ct for some constant C.. Pick h from -wise family [n] [n], for n a power of (round up if necessary). Maintain X max i str lsb(h(i)) where lsb is the least significant bit of a number 3. Output X 4

For fixed j, let Z j be the number of i in stream with lsb(h(i)) j. Let Z >j be the number of i with lsb(h(i)) > j. Let Y i { lsb(h(i)) j else Then Z j i str Y i. We can compute EZ j t/ j+ and similarly EZ >j t( j+ + +...) < t/j+ j+3 and also V ar[z j ] V ar[ Y i ] E( Y i ) (E Y i ) i,i E(Y i Y i ) Since h is from a -wise family, Y i are pairwise independent, so E(Y i Y i ) E(Y i )E(Y i ). We can then show V ar[z j ] < t/ j+ Now for j lg t 5, we have 6 EZ j 3 P (Z j ) P ( Z j EZ j 6) < /5 by Chebyshev. For j lg t + 5 EZ >j /6 P (Z >j ) < /6 by Markov. This means with good probability the max lsb will be above j but below j, in a constant range. This gives us a 3-approximation, i.e. constant approximation. 8 Refine to + ɛ Trivial solution. Algorithm TS stores first C/ɛ distinct elements. This is correct if t C/ɛ. Algorithm.. Instantiate TS,..., TS lg n. Pick g : [n] [n] from -wise family 3. Feed i to TS lsb(g(i)) 4. Output j+ out where t/ j+ /ɛ. 5

Let B j be the number of distinct elements hashed by g to TS j. Then EB j t/ j+ Q j. By Chebyshev B j Q j ± O( Q j ) with good probability. This equals ( ± O(ɛ))Q j if Q j /ɛ. Final space: C ɛ (lg n) O( ɛ lg n) bits. It is known that space O(/ɛ + log n) is achievable [4], and furthermore this is optimal [, 5] (also see [3]). References [] Noga Alon, Yossi Matias, Mario Szegedy The Space Complexity of Approximating the Frequency Moments. J. Comput. Syst. Sci. 58(): 37 47, 999. [] Philippe Flajolet, G. Nigel Martin Probabilistic counting algorithms for data base applications. J. Comput. Syst. Sci., 3():8 9, 985. [3] T. S. Jayram, Ravi Kumar, D. Sivakumar: The One-Way Communication Complexity of Hamming Distance. Theory of Computing 4(): 9 35, 8. [4] Daniel M. Kane, Jelani Nelson, David P. Woodruff An optimal algorithm for the distinct elements problem. In Proceedings of the Twenty-Ninth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS), pages 4 5,. [5] David P. Woodruff. Optimal space lower bounds for all frequency moments. In SODA, pages 67 75, 4. 6