Lecture 3 Sept. 4, 2014

Similar documents
Lecture 2 Sept. 8, 2015

1 Estimating Frequency Moments in Streams

Lecture 6 September 13, 2016

Lecture 2. Frequency problems

Lecture 2: Streaming Algorithms

Lecture Lecture 25 November 25, 2014

Lecture 4: Hashing and Streaming Algorithms

CSCI8980 Algorithmic Techniques for Big Data September 12, Lecture 2

Lecture 13 October 6, Covering Numbers and Maurey s Empirical Method

Lecture 5: Hashing. David Woodruff Carnegie Mellon University

Lecture 3 Frequency Moments, Heavy Hitters

CS 598CSC: Algorithms for Big Data Lecture date: Sept 11, 2014

CS 229r: Algorithms for Big Data Fall Lecture 17 10/28

Some notes on streaming algorithms continued

Consistent Sampling with Replacement

Lecture Lecture 9 October 1, 2015

Lecture 1 September 3, 2013

CSE 190, Great ideas in algorithms: Pairwise independent hash functions

14.1 Finding frequent elements in stream

Lecture 10. Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1

CMSC 858F: Algorithmic Lower Bounds: Fun with Hardness Proofs Fall 2014 Introduction to Streaming Algorithms

Lecture 10 September 27, 2016

Lecture 4 Thursday Sep 11, 2014

1 Approximate Quantiles and Summaries

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:

Range-efficient computation of F 0 over massive data streams

Lecture 2 September 4, 2014

Lecture 23: Alternation vs. Counting

Linear Sketches A Useful Tool in Streaming and Compressive Sensing

Randomness and Computation March 13, Lecture 3

The Count-Min-Sketch and its Applications

25.2 Last Time: Matrix Multiplication in Streaming Model

B669 Sublinear Algorithms for Big Data

1 Maintaining a Dictionary

CS 591, Lecture 9 Data Analytics: Theory and Applications Boston University

How Philippe Flipped Coins to Count Data

Big Data. Big data arises in many forms: Common themes:

4/26/2017. More algorithms for streams: Each element of data stream is a tuple Given a list of keys S Determine which tuples of stream are in S

Sparse Johnson-Lindenstrauss Transforms

Lecture 7: Fingerprinting. David Woodruff Carnegie Mellon University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Improved Concentration Bounds for Count-Sketch

High Dimensional Geometry, Curse of Dimensionality, Dimension Reduction

The space complexity of approximating the frequency moments

compare to comparison and pointer based sorting, binary trees

Lecture 4 February 2nd, 2017

As mentioned, we will relax the conditions of our dictionary data structure. The relaxations we

Lecture 01 August 31, 2017

Interval Selection in the streaming model

Data Stream Methods. Graham Cormode S. Muthukrishnan

Lecture 10 Oct. 3, 2017

Algorithms lecture notes 1. Hashing, and Universal Hash functions

Electrical & Computer Engineering University of Waterloo Canada February 6, 2007

Problem 1: (Chernoff Bounds via Negative Dependence - from MU Ex 5.15)

Sparser Johnson-Lindenstrauss Transforms

CS341 info session is on Thu 3/1 5pm in Gates415. CS246: Mining Massive Datasets Jure Leskovec, Stanford University

arxiv: v2 [cs.ds] 3 Nov 2017

Lecture 8 HASHING!!!!!

CS5314 Randomized Algorithms. Lecture 15: Balls, Bins, Random Graphs (Hashing)

An Optimal Algorithm for l 1 -Heavy Hitters in Insertion Streams and Related Problems

Estimating Frequencies and Finding Heavy Hitters Jonas Nicolai Hovmand, Morten Houmøller Nygaard,

CPSC 490 Problem Solving in Computer Science

arxiv: v2 [cs.ds] 4 Feb 2015

CS 473: Algorithms. Ruta Mehta. Spring University of Illinois, Urbana-Champaign. Ruta (UIUC) CS473 1 Spring / 32

Lecture Lecture 3 Tuesday Sep 09, 2014

Lecture 11 October 7, 2013

CS6931 Database Seminar. Lecture 6: Set Operations on Massive Data

Stanford University CS254: Computational Complexity Handout 8 Luca Trevisan 4/21/2010

Arthur-Merlin Streaming Complexity

Norms of Random Matrices & Low-Rank via Sampling

Lecture 2: Minimax theorem, Impagliazzo Hard Core Lemma

CS 125 Section #12 (More) Probability and Randomized Algorithms 11/24/14. For random numbers X which only take on nonnegative integer values, E(X) =

Very Sparse Random Projections

Sketching as a Tool for Numerical Linear Algebra

String Matching. Thanks to Piotr Indyk. String Matching. Simple Algorithm. for s 0 to n-m. Match 0. for j 1 to m if T[s+j] P[j] then

Notes for Lecture 14 v0.9

arxiv: v1 [cs.ds] 8 Aug 2014

6.1 Occupancy Problem

Lecture 16 Oct 21, 2014

Fast Moment Estimation in Data Streams in Optimal Space

Course: ESO-209 Home Work: 1 Instructor: Debasis Kundu

An Optimal Algorithm for Large Frequency Moments Using O(n 1 2/k ) Bits

7 Algorithms for Massive Data Problems

A Unifying Framework for l 0 -Sampling Algorithms

Revisiting Frequency Moment Estimation in Random Order Streams

An Intersection Inequality for Discrete Distributions and Related Generation Problems

15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018

Lecture 5. 1 Review (Pairwise Independence and Derandomization)

16 Embeddings of the Euclidean metric

MAT 585: Johnson-Lindenstrauss, Group testing, and Compressed Sensing

15-855: Intensive Intro to Complexity Theory Spring Lecture 16: Nisan s PRG for small space

Some Useful Background for Talk on the Fast Johnson-Lindenstrauss Transform

Adaptive Matrix Vector Product

1 Dimension Reduction in Euclidean Space

New Characterizations in Turnstile Streams with Applications

CS168: The Modern Algorithmic Toolbox Lecture #4: Dimensionality Reduction

CS 591, Lecture 6 Data Analytics: Theory and Applications Boston University

Optimal compression of approximate Euclidean distances

Lecture 1: Introduction to Sublinear Algorithms

Faster Johnson-Lindenstrauss style reductions

Transcription:

CS 395T: Sublinear Algorithms Fall 2014 Prof. Eric Price Lecture 3 Sept. 4, 2014 Scribe: Zhao Song In today s lecture, we will discuss the following problems: 1. Distinct elements 2. Turnstile model 3. AMS - sketch 1 Distinct elements Given 1, 5, 4, 4, 19,, [n]. Goal Estimate k(= # distinct elements) up to factor (1 ± ɛ) with 1 δ probability. In order to solve the above problem, let s look at the following basic question: Given t. Goal Ask either k t or k 2t? Choose a subset S [n], then i [n], i s with probability 1 t. Record whether the intersection of stream and set S is empty, let x denote this event (stream S). (Note that S is chosen before you see a stream of integers). P r[x is true] = P r[x] = 1 (1 1 t )k (Note that, P r[x] is a monotonically increasing function on k when t is fixed) For k t, we have For k 2t, we have P r[x k t] 1 (1 1 t )t 1 1 e 0.63 P r[x k 2t] 1 (1 1 t )2t 1 1 e 2 0.8 Repeat to get x 1, x 2,, x m independent samples. Return whether i x i 0.7m. Since x i {0, 1} is subgaussian for σ = 1 2, we have that i x i is also a subgaussian with σ = m 2. P r[ i P r[ i x i µ + t] e t2 2σ 2 = e 2t2 m x i 0.63m + 0.07m] e 2 0.072 m 1

Therefore with m = Θ(log(1/δ)), we can distinguish the to cases with 1 δ probability. Question: But: how do we store a concise description of S? There can be 2 n such sets, and roughly ( n k) likely sets. So storing S would dominate the space complexity. Answer 1: Crypto h : SHA-256 [SHA256], or any other crypto hash, and choose roughly S = {i h(i) < 1 2 256 t }. Then there would exist streams that break the algorithm, but it s (hopefully) computationally intractible to find them. Answer 2: h : pair-wise independent, s = {i h(i) < 1 t } Let s look at the general definition of some hash functions first, Definition 1.1. Family H of functions from [n] [m] is pair-wise independent if with probability P r h H x,y [n] c,d [m] [h(x) = c and h(y) = d] = 1 m 2 Example 1.2. Canonical example: h(x) = ax + b (mod m), where (a, b) [m] pair-wise independent if m is a prime n. Let s consider an algorithm that uses pair-wise independent hash function: Algorithm: Let H denote a pairwise-independent hash function family, choose h H such that h : [n] [B], where B = Θ(t)(the constant will be decided later). Consider the set S = {i h(i) = 0}. Then, for the probability of any x S, we have an upper bound by the union bound: P r[any x s] i P r[i s] = k B And we have a lower bound by Inclusion-Exclusion 1 : P r[any x s] i P r[i s] i,j P r[i s and j s] = k k(k 1) B 2B 2 = k B (1 k 1 B ) Let s set B = 4t, for k t, we have P r[any x S] t B = 1 4 For k 2t, we have P r[any x S] 1 2 (1 1 4 ) = 3 8 For any t, do log( 1 δ ) independent samples/examples, each uses O(log n) spaces. Since there are O(log n) different ts, then O( 1 log( log n ɛ 2 δ )) total space is used to perform distinct elements. 1 http://en.wikipedia.org/wiki/inclusion-exclusion principle 2

Idealized streaming algorithm 2 We now explain the LogLog algorithm of [DF03], which improves the space complexity from roughly O( 1 ɛ 2 log n) to O( 1 ɛ 2 log log n). One algorithm you could use for distinct elements is the following: 1. Pick a random hash function h : [n] [0, 1] 2. Define z = min i stream h(i), then 1 z 1 k. The observation is that you don t need to store z exactly; you only need to remember which of log n different scales z lies in. LogLog algorithm 1. Pick a random hash function h : [n] {0, 1}. (Note that h is able to convert a stream of integers to a binary string.) 2. For a string x {0, 1}, define ρ(x) to be the number of leading zeros from left. (In [DF03], they defined ρ(x) in a similar way, where ρ(x) denotes the position of its first 1-bit, e.g. ρ(1 ) = 1 and ρ(001 ) = 3.) 3. Separate elements into m buckets (Analysis in [DF03] shows that ɛ = 1.3 m, here; ɛ = 1.05 m, for HyperLogLog.) 4. Let m = 2 t, then the first t binary bits of x denote the index of one of m buckets. 5. Let M denote the multiset of hashed values, define z(m) = max x M ρ(x). 6. For each bucket j, ignore the first t bits and compute z j. 7. Output α m m2 1 m zj to approximate n, where α m is a constant value (defined in [DF03]). Total Space : O( 1 ɛ 2 log log n + log n), where the first term 1 ɛ 2 log log n is caused by m buckets and the second term log n is from hash function. There exists a better algorithm : Theorem 1.3. [KNW10] For a stream of indices in {1, 2,, n}, the algorithm computes (1 ± ɛ)- approximation using an optimal O( 1 + log(n)) bits of space with 2 ɛ 2 3 success probability, where 0 < ɛ < 1. 2 The details of ISA can be found in Lecture 2 of Course Algorithm for Big Data at Harvard. http://people.seas.harvard.edu/ minilek/cs229r/lec/lec2.pdf 3

2 Turnstile model 1. Pick a vector x R n, start at 0. 2. Read a stream of updates (, (i, α i ), ), where i [n], α i is the number of elements to be added or deleted. 3. For each (i, α i ), we update x i x i + α i. 4. Compute f(x). A further restriction is the strict turnstile model, where x i is always 0, which means the count of any item can not be negative at any time. What are examples of f that you might want to compute? Well, distinct elements corresponds to f(x) = (#i x i 0) = x 0 (also called the sparsity of x ) (1) One may also ask about other norms, e.g. x 1 and x 2, or finding spanning tree, or finding the largest entries. Estimate x 2 in turnstile model Let A R m n be a Johnson-Lindenstrauss matrix, where A ij µ(0, 1 m ), m = O( 1 log( 1 ɛ 2 δ )), Ax 2 2 = (1 ± ɛ) x 2 2 with probability 1 δ. Given update (i, α), then we have : x x + α e i Ax Ax + A e i α y y + α (column i A) where e i is the elementary unit vector, a vector of length n with e i = } 00 {{ 0} 1 } 00 {{ 0}. This i 1 n i means we can maintain the linear sketch y = Ax under streaming updates to x. This would let us estimate x 2 from a small space sketch Ax. The problem is, to do so requires us to remember A, which takes more than mn bits. So how do we solve this? The same way we solved not being able to store S for distinct elements with hashing and limited independence. 4

3 AMS - sketch [AMS99] Definition 3.1. H is a k-wise independent hash family if i 1 i 2 i k [n] and j 1, j 2,, j k [m] P r [h(i 1) = j 1 h(i k ) = j k ] = h H 1 m k AMS Algorithm 3 : 1. Pick a random hash function h : [n] { 1, +1} from a four-wise independent family. 2. Let v i = h(i). 3. Let y =< v, x >, output y 2. 4. From Lemma 3.1 and 3.2, we know that y 2 is an unbiased estimator with variance big-oh of the square of its expectation. 5. Sample y 2 m 1 = O( 1 ɛ 2 ) independent times : {y 2 1, y2 2,, y2 m 1 }. Use Chebyshev s inequality to obtain a (1 ± ɛ) approximation with 2 3 probability. 6. Let y = 1 m 1 m 1 yi 2. i=1 7. Sample y m 2 = O(log( 1 δ )) independent times : {y 1, y 2,, y m2 }. Take the median to get (1 ± ɛ)-approximation with probability 1 δ. Space Analysis : Each of the hash function takes O(log n) bits to store, and there are O( 1 ɛ 2 log( 1 δ )) hash functions in total. Lemma 3.2. E[y 2 ] = x 2 2 Proof. E[y 2 ] = E[(< v, x >) 2 ] n = E[ i i=1v 2 x 2 i + v i v j x i x j ] i j n = E[ i i=1v 2 x 2 i ] + E[ v i v j x i x j ] i j n = x 2 i + 0 i=1 = x 2 2 where E[v i v j ] = E[v j ] E[v k ] = 0 since pair-wise independence. 3 More details also can be found in : Lecture 2 of Course Algorithm for Big Data at Harvard. http://people.seas.harvard.edu/ minilek/cs229r/lec/lec2.pdf ; Lecture 2 of Course Sublinear Algorithms for Big Datasets at the University of Buenos Aires. http://grigory.github.io/files/teaching/sublinear-big-data-2.pdf 5

Lemma 3.3. E[(y 2 E[y 2 ]) 2 ] 2 x 4 2 Proof. E[(y 2 E[y 2 ]) 2 ] = E[( v i v j x i x j ) 2 ] i j = E[4 vi 2 vj 2 x 2 i x 2 j + 4 vi 2 v j v k x 2 i x j x k + 24 v i v j v k v l x i x j x k x l ] i<j i j k i<j<k<l = 4 x 2 i x 2 j + 4 E[vi 2 v j v k x 2 i x j x k ] + 24E[ v i v j v k v l x i x j x k x l ] i<j i j k = 4 i<j x 2 i x 2 j + 0 + 0 i<j<k<l 2 x 4 2 where E[v 2 i v jv k ] = E[v j ] [v k ] = 0 since pair-wise independence, and E[v i v j v k v l ] = E[v i ]E[v j ]E[v k ]E[v l ] = 0 since four-wise independence. References [AMS99] Noga Alon, Yossi Matias, and Mario Szegedy. The Space Complexity of Approximating the Frequency Moments. J. Comput. Syst. Sci., 58(1):137 147, 1999. [DF03] Marianne Durand and Philippe Flajolet. Loglog Counting of Large Cardinalities. ESA, LNCS 2832:605 617, 2003. [KNW10] Daniel M. Kane, Jelani Nelson, and David P. Woodruff. An optimal algorithm for the distinct elements problem. Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. 2010. [SHA256] Descriptions of SHA-256, SHA-384, and SHA-512. NIST, 2014-09-07, http://csrc.nist.gov/groups/stm/cavp/documents/shs/sha256-384-512.pdf 6