Data Streams & Communication Complexity

Size: px
Start display at page:

Download "Data Streams & Communication Complexity"

Transcription

1 Data Streams & Communication Complexity Lecture 1: Simple Stream Statistics in Small Space Andrew McGregor, UMass Amherst 1/25

2 Data Stream Model Stream: m elements from universe of size n, e.g., x 1, x 2,..., x m = 3, 5, 3, 7, 5, 4,... 2/25

3 Data Stream Model Stream: m elements from universe of size n, e.g., x 1, x 2,..., x m = 3, 5, 3, 7, 5, 4,... Goal: Compute some function of stream, e.g., number of distinct elements, frequent items, longest increasing sequence, a clustering, graph connectivity properties,... 2/25

4 Data Stream Model Stream: m elements from universe of size n, e.g., x 1, x 2,..., x m = 3, 5, 3, 7, 5, 4,... Goal: Compute some function of stream, e.g., number of distinct elements, frequent items, longest increasing sequence, a clustering, graph connectivity properties,... Catch: 1. Limited working memory, sublinear in n and m 2/25

5 Data Stream Model Stream: m elements from universe of size n, e.g., x 1, x 2,..., x m = 3, 5, 3, 7, 5, 4,... Goal: Compute some function of stream, e.g., number of distinct elements, frequent items, longest increasing sequence, a clustering, graph connectivity properties,... Catch: 1. Limited working memory, sublinear in n and m 2. Access data sequentially 2/25

6 Data Stream Model Stream: m elements from universe of size n, e.g., x 1, x 2,..., x m = 3, 5, 3, 7, 5, 4,... Goal: Compute some function of stream, e.g., number of distinct elements, frequent items, longest increasing sequence, a clustering, graph connectivity properties,... Catch: 1. Limited working memory, sublinear in n and m 2. Access data sequentially 3. Process each element quickly 2/25

7 Data Stream Model Stream: m elements from universe of size n, e.g., x 1, x 2,..., x m = 3, 5, 3, 7, 5, 4,... Goal: Compute some function of stream, e.g., number of distinct elements, frequent items, longest increasing sequence, a clustering, graph connectivity properties,... Catch: 1. Limited working memory, sublinear in n and m 2. Access data sequentially 3. Process each element quickly Origins in seventies but has become popular in last ten years... 2/25

8 Why s it become popular? Practical Appeal: Faster networks, cheaper data storage, ubiquitous data-logging results in massive amount of data to be processed. Applications to network monitoring, query planning, I/O efficiency for massive data, sensor networks aggregation... 3/25

9 Why s it become popular? Practical Appeal: Faster networks, cheaper data storage, ubiquitous data-logging results in massive amount of data to be processed. Applications to network monitoring, query planning, I/O efficiency for massive data, sensor networks aggregation... Theoretical Appeal: Easy to state problems but hard to solve. Links to communication complexity, compressed sensing, metric embeddings, pseudo-random generators, approximation... 3/25

10 This Lecture: Basic Numerical Statistics Given a stream of m elements from universe [n] = {1, 2,..., n}, e.g., x 1, x 2,..., x m = 3, 5, 3, 7, 5, 4,... let f N n be the frequency vector where f i is the frequency of i. 4/25

11 This Lecture: Basic Numerical Statistics Given a stream of m elements from universe [n] = {1, 2,..., n}, e.g., x 1, x 2,..., x m = 3, 5, 3, 7, 5, 4,... let f N n be the frequency vector where f i is the frequency of i. Problems: What can we approximate in sub linear space? Frequency moments: Fk = i f i k. Max frequency: F = max i f i. Number of distinct element: F0 = i f i 0 Median: j such that f1 + f f j m/2 Algorithms are often randomized and guarantees will be probabilistic. 4/25

12 This Lecture: Basic Numerical Statistics Given a stream of m elements from universe [n] = {1, 2,..., n}, e.g., x 1, x 2,..., x m = 3, 5, 3, 7, 5, 4,... let f N n be the frequency vector where f i is the frequency of i. Problems: What can we approximate in sub linear space? Frequency moments: Fk = i f i k. Max frequency: F = max i f i. Number of distinct element: F0 = i f i 0 Median: j such that f1 + f f j m/2 Algorithms are often randomized and guarantees will be probabilistic. Keep things simple: Could consider f i s being increased or decreased but for this talk we ll focus on unit increments. Will also assume algorithms have an unlimited store of random bits. 4/25

13 Outline Sampling Sketching: The Basics Count-Min and Applications Count-Sketch: Count-Min with a Twist l p Sampling and Frequency Moments 5/25

14 Sampling and Statistics Sampling is a general technique for tackling massive amounts of data 6/25

15 Sampling and Statistics Sampling is a general technique for tackling massive amounts of data Example: To find an ɛ-approximate median, i.e., j such that f 1 + f f j = m/2 ± ɛm then sampling O(ɛ 2 ) stream elements and returning the sample median works with good probability. 6/25

16 Sampling and Statistics Sampling is a general technique for tackling massive amounts of data Example: To find an ɛ-approximate median, i.e., j such that f 1 + f f j = m/2 ± ɛm then sampling O(ɛ 2 ) stream elements and returning the sample median works with good probability. Beyond basic sampling: There are more powerful forms of sampling and other techniques the make better use of the limited space. 6/25

17 AMS Sampling Problem: Estimate i g(f i) for some function g with g(0) = 0 7/25

18 AMS Sampling Problem: Estimate i g(f i) for some function g with g(0) = 0 Basic Estimator: Sample x J where J R [m] and compute r = {j J : x j = x J } 7/25

19 AMS Sampling Problem: Estimate i g(f i) for some function g with g(0) = 0 Basic Estimator: Sample x J where J R [m] and compute r = {j J : x j = x J } Output X = m(g(r) g(r 1)) 7/25

20 AMS Sampling Problem: Estimate i g(f i) for some function g with g(0) = 0 Basic Estimator: Sample x J where J R [m] and compute r = {j J : x j = x J } Output X = m(g(r) g(r 1)) Expectation: E [X ] 7/25

21 AMS Sampling Problem: Estimate i g(f i) for some function g with g(0) = 0 Basic Estimator: Sample x J where J R [m] and compute r = {j J : x j = x J } Output X = m(g(r) g(r 1)) Expectation: E [X ] = i P [x J = i] E [X x J = i] 7/25

22 AMS Sampling Problem: Estimate i g(f i) for some function g with g(0) = 0 Basic Estimator: Sample x J where J R [m] and compute r = {j J : x j = x J } Output X = m(g(r) g(r 1)) Expectation: E [X ] = i = i P [x J = i] E [X x J = i] f i m ( fi r=1 ) m(g(r) g(r 1)) f i 7/25

23 AMS Sampling Problem: Estimate i g(f i) for some function g with g(0) = 0 Basic Estimator: Sample x J where J R [m] and compute r = {j J : x j = x J } Output X = m(g(r) g(r 1)) Expectation: E [X ] = i = i = i P [x J = i] E [X x J = i] f i m g(f i ) ( fi r=1 ) m(g(r) g(r 1)) f i 7/25

24 AMS Sampling Problem: Estimate i g(f i) for some function g with g(0) = 0 Basic Estimator: Sample x J where J R [m] and compute r = {j J : x j = x J } Output X = m(g(r) g(r 1)) Expectation: E [X ] = i = i = i P [x J = i] E [X x J = i] f i m g(f i ) ( fi r=1 ) m(g(r) g(r 1)) f i For high confidence: Compute t estimators in parallel and average. 7/25

25 Example: Frequency Moments Frequency Moments: Define F k = i f k i for k {1, 2, 3,...} 8/25

26 Example: Frequency Moments Frequency Moments: Define F k = i f k i for k {1, 2, 3,...} Use AMS estimator with X = m(r k (r 1) k ). 8/25

27 Example: Frequency Moments Frequency Moments: Define F k = i f k i for k {1, 2, 3,...} Use AMS estimator with X = m(r k (r 1) k ). Expectation: E [X ] = F k 8/25

28 Example: Frequency Moments Frequency Moments: Define F k = i f k i for k {1, 2, 3,...} Use AMS estimator with X = m(r k (r 1) k ). Expectation: E [X ] = F k Range: 0 X kmf k 1 kn 1 1/k F k 8/25

29 Example: Frequency Moments Frequency Moments: Define F k = i f k i for k {1, 2, 3,...} Use AMS estimator with X = m(r k (r 1) k ). Expectation: E [X ] = F k Range: 0 X kmf k 1 kn 1 1/k F k Repeat t times and let F k be the average value. By Chernoff, ] P [ F k F k ɛf k 2 exp ( tf kɛ 2 ) ) = 2 exp ( tɛ2 3kn 1 1/k F k 3kn 1 1/k 8/25

30 Example: Frequency Moments Frequency Moments: Define F k = i f k i for k {1, 2, 3,...} Use AMS estimator with X = m(r k (r 1) k ). Expectation: E [X ] = F k Range: 0 X kmf k 1 kn 1 1/k F k Repeat t times and let F k be the average value. By Chernoff, ) ] P [ F k F k ɛf k 2 exp ( tf kɛ 2 = 2 exp ( tɛ2 3kn 1 1/k F k ] If t = 3ɛ 2 kn 1 1/k log(2δ 1 ) then P [ F k F k ɛf k δ. 3kn 1 1/k ) 8/25

31 Example: Frequency Moments Frequency Moments: Define F k = i f k i for k {1, 2, 3,...} Use AMS estimator with X = m(r k (r 1) k ). Expectation: E [X ] = F k Range: 0 X kmf k 1 kn 1 1/k F k Repeat t times and let F k be the average value. By Chernoff, ) ] P [ F k F k ɛf k 2 exp ( tf kɛ 2 = 2 exp ( tɛ2 3kn 1 1/k F k ] If t = 3ɛ 2 kn 1 1/k log(2δ 1 ) then P [ F k F k ɛf k δ. 3kn 1 1/k Thm: In Õ(ɛ 2 n 1 1/k ) space we can find a (1 ± ɛ) approximation for F k with probability at least 1 δ. ) 8/25

32 Outline Sampling Sketching: The Basics Count-Min and Applications Count-Sketch: Count-Min with a Twist l p Sampling and Frequency Moments 9/25

33 Random Projections Many stream algorithms use a random projection Z R w n, w n Z(f ) = z 1, z 1,n.. z w, z w,n f 1 f 2.. f n s 1 =. s w = s 10/25

34 Random Projections Many stream algorithms use a random projection Z R w n, w n Z(f ) = z 1, z 1,n.. z w, z w,n f 1 f 2.. f n s 1 =. Updatable: We can maintain sketch s in Õ(w) space since incrementing f i corresponds to s w = s 10/25

35 Random Projections Many stream algorithms use a random projection Z R w n, w n Z(f ) = z 1, z 1,n.. z w, z w,n f 1 f 2.. f n s 1 =. Updatable: We can maintain sketch s in Õ(w) space since incrementing f i corresponds to s s + z 1,i. z w,i s w = s 10/25

36 Random Projections Many stream algorithms use a random projection Z R w n, w n Z(f ) = z 1, z 1,n.. z w, z w,n f 1 f 2.. f n s 1 =. Updatable: We can maintain sketch s in Õ(w) space since incrementing f i corresponds to s s + z 1,i. z w,i s w = s Useful: Choose a distribution for z i,j such that relevant function of f can be estimated from s with high probability for sufficiently large w. 10/25

37 Examples If z i,j R { 1, 1}, can estimate F 2 with w = O(ɛ 2 log δ 1 ). 11/25

38 Examples If z i,j R { 1, 1}, can estimate F 2 with w = O(ɛ 2 log δ 1 ). If z i,j D where D is p-stable p (0, 2], can estimate F p with w = O(ɛ 2 log δ 1 ). For example, 1 and 2 stable distributions are: Cauchy(x) = 1 π x 2 Gaussian(x) = 1 2π e x 2 /2 11/25

39 Examples If z i,j R { 1, 1}, can estimate F 2 with w = O(ɛ 2 log δ 1 ). If z i,j D where D is p-stable p (0, 2], can estimate F p with w = O(ɛ 2 log δ 1 ). For example, 1 and 2 stable distributions are: Cauchy(x) = 1 π x 2 Gaussian(x) = 1 2π e x 2 /2 Note that F 0 = (1 ± ɛ)f p if p = log(1 + ɛ)/ log m. 11/25

40 Examples If z i,j R { 1, 1}, can estimate F 2 with w = O(ɛ 2 log δ 1 ). If z i,j D where D is p-stable p (0, 2], can estimate F p with w = O(ɛ 2 log δ 1 ). For example, 1 and 2 stable distributions are: Cauchy(x) = 1 π x 2 Gaussian(x) = 1 2π e x 2 /2 Note that F 0 = (1 ± ɛ)f p if p = log(1 + ɛ)/ log m. For the rest of lecture we ll focus on hash-based sketches. Given a random hash function h : [n] [w], non-zero entries are z hi,i. Z = /25

41 Outline Sampling Sketching: The Basics Count-Min and Applications Count-Sketch: Count-Min with a Twist l p Sampling and Frequency Moments 12/25

42 Count-Min Sketch Maintain vector s N w via random hash function h : [n] [w] f[1] f[2] f[3] f[4] f[5] f[6]... f[n] s[1] s[2] s[3]... s[w] 13/25

43 Count-Min Sketch Maintain vector s N w via random hash function h : [n] [w] f[1] f[2] f[3] f[4] f[5] f[6]... f[n] s[1] s[2] s[3]... s[w] Update: For each increment of f i, increment s hi. Hence, s k = j:h j =k f j 13/25

44 Count-Min Sketch Maintain vector s N w via random hash function h : [n] [w] f[1] f[2] f[3] f[4] f[5] f[6]... f[n] s[1] s[2] s[3]... s[w] Update: For each increment of f i, increment s hi. Hence, s k = f j e.g., s 3 = f 6 + f 7 + f 13 j:h j =k 13/25

45 Count-Min Sketch Maintain vector s N w via random hash function h : [n] [w] f[1] f[2] f[3] f[4] f[5] f[6]... f[n] s[1] s[2] s[3]... s[w] Update: For each increment of f i, increment s hi. Hence, s k = f j e.g., s 3 = f 6 + f 7 + f 13 j:h j =k Query: Use f i = s hi to estimate f i. 13/25

46 Count-Min Sketch Maintain vector s N w via random hash function h : [n] [w] f[1] f[2] f[3] f[4] f[5] f[6]... f[n] s[1] s[2] s[3]... s[w] Update: For each increment of f i, increment s hi. Hence, s k = f j e.g., s 3 = f 6 + f 7 + f 13 j:h j =k Query: Use f i = s hi to estimate f i. Lemma: f i f ] i and P [ fi f i + 2m/w 1/2 13/25

47 Count-Min Sketch Maintain vector s N w via random hash function h : [n] [w] f[1] f[2] f[3] f[4] f[5] f[6]... f[n] s[1] s[2] s[3]... s[w] Update: For each increment of f i, increment s hi. Hence, s k = f j e.g., s 3 = f 6 + f 7 + f 13 j:h j =k Query: Use f i = s hi to estimate f i. Lemma: f i f ] i and P [ fi f i + 2m/w 1/2 Thm: Let w = 2/ɛ. Repeat the hashing lg(δ 1 ) times in parallel and take the minimum estimate for f i [ ] P f i f i f i + ɛm 1 δ 13/25

48 Proof of Lemma Define E by f i = f i + E and so E = j i:h i =h j f j 14/25

49 Proof of Lemma Define E by f i = f i + E and so E = Since all f j 0, we have E 0. j i:h i =h j f j 14/25

50 Proof of Lemma Define E by f i = f i + E and so E = Since all f j 0, we have E 0. Since P [h i = h j ] = 1/w, j i:h i =h j f j E [E] = j i f j P [h i = h j ] 14/25

51 Proof of Lemma Define E by f i = f i + E and so E = Since all f j 0, we have E 0. Since P [h i = h j ] = 1/w, j i:h i =h j f j E [E] = j i f j P [h i = h j ] m/w 14/25

52 Proof of Lemma Define E by f i = f i + E and so E = Since all f j 0, we have E 0. Since P [h i = h j ] = 1/w, j i:h i =h j f j E [E] = j i f j P [h i = h j ] m/w By an application of the Markov bound, P [E 2m/w] 1/2 14/25

53 Range Queries Range Query: For i, j [n], estimate f [i,j] = f i + f i f j 15/25

54 Range Queries Range Query: For i, j [n], estimate f [i,j] = f i + f i f j Dyadic Intervals: Restrict attention to intervals of the form [1 + (i 1)2 j, i2 j ] where j {0, 1,..., lg n}, i {1, 2,... n/2 j } 15/25

55 Range Queries Range Query: For i, j [n], estimate f [i,j] = f i + f i f j Dyadic Intervals: Restrict attention to intervals of the form [1 + (i 1)2 j, i2 j ] where j {0, 1,..., lg n}, i {1, 2,... n/2 j } since any range can be partitioned as O(log n) such intervals. E.g., [48, 106] = [48, 48] [49, 64] [65, 96] [97, 104] [105, 106] 15/25

56 Range Queries Range Query: For i, j [n], estimate f [i,j] = f i + f i f j Dyadic Intervals: Restrict attention to intervals of the form [1 + (i 1)2 j, i2 j ] where j {0, 1,..., lg n}, i {1, 2,... n/2 j } since any range can be partitioned as O(log n) such intervals. E.g., [48, 106] = [48, 48] [49, 64] [65, 96] [97, 104] [105, 106] To support dyadic intervals, construct Count-Min sketches corresponding to intervals of width 1, 2, 4, 8,... 15/25

57 Range Queries Range Query: For i, j [n], estimate f [i,j] = f i + f i f j Dyadic Intervals: Restrict attention to intervals of the form [1 + (i 1)2 j, i2 j ] where j {0, 1,..., lg n}, i {1, 2,... n/2 j } since any range can be partitioned as O(log n) such intervals. E.g., [48, 106] = [48, 48] [49, 64] [65, 96] [97, 104] [105, 106] To support dyadic intervals, construct Count-Min sketches corresponding to intervals of width 1, 2, 4, 8,... E.g., for intervals of width 2 we have: f[1] f[2] f[3] f[4] f[5] f[6]... f[n-1] f[n] g[1] g[2] g[3]... g[n/2] s[1] s[2] s[3]... s[w] where update rule is now: for increment of f 2i 1 or f 2i, increment s hi. 15/25

58 Quantiles and Heavy Hitters 16/25

59 Quantiles and Heavy Hitters Quantiles: Find j such that f f j m/2 16/25

60 Quantiles and Heavy Hitters Quantiles: Find j such that f f j m/2 Can approximate median via binary search of range queries. 16/25

61 Quantiles and Heavy Hitters Quantiles: Find j such that f f j m/2 Can approximate median via binary search of range queries. Heavy Hitter Problem: Find a set S [n] where {i : f i φm} S {i : f i (φ ɛ)m} 16/25

62 Quantiles and Heavy Hitters Quantiles: Find j such that f f j m/2 Can approximate median via binary search of range queries. Heavy Hitter Problem: Find a set S [n] where {i : f i φm} S {i : f i (φ ɛ)m} Rather than checking each f i individually can save time by exploiting the fact that if f [i,k] < φm then f j < φm for all j [i, k]. 16/25

63 Outline Sampling Sketching: The Basics Count-Min and Applications Count-Sketch: Count-Min with a Twist l p Sampling and Frequency Moments 17/25

64 Count-Sketch: Count-Min with a Twist Maintain s N w via hash functions h : [n] [w], r : [n] { 1, 1} f[1] f[2] f[3] f[4] f[5] f[6]... f[n] -f[1] f[2] -f[3] f[4] f[5] -f[6]... -f[n] s[1] s[2] s[3]... s[w] 18/25

65 Count-Sketch: Count-Min with a Twist Maintain s N w via hash functions h : [n] [w], r : [n] { 1, 1} f[1] f[2] f[3] f[4] f[5] f[6]... f[n] -f[1] f[2] -f[3] f[4] f[5] -f[6]... -f[n] s[1] s[2] s[3]... s[w] Update: For each increment of f i, s hi s k = f j r j j:h j =k s hi + r i. Hence, 18/25

66 Count-Sketch: Count-Min with a Twist Maintain s N w via hash functions h : [n] [w], r : [n] { 1, 1} f[1] f[2] f[3] f[4] f[5] f[6]... f[n] -f[1] f[2] -f[3] f[4] f[5] -f[6]... -f[n] s[1] s[2] s[3]... s[w] Update: For each increment of f i, s hi s hi + r i. Hence, s k = f j r j e.g., s 3 = f 6 f 7 f 13 j:h j =k 18/25

67 Count-Sketch: Count-Min with a Twist Maintain s N w via hash functions h : [n] [w], r : [n] { 1, 1} f[1] f[2] f[3] f[4] f[5] f[6]... f[n] -f[1] f[2] -f[3] f[4] f[5] -f[6]... -f[n] s[1] s[2] s[3]... s[w] Update: For each increment of f i, s hi s hi + r i. Hence, s k = f j r j e.g., s 3 = f 6 f 7 f 13 j:h j =k Query: Use f i = s hi r i to estimate f i. 18/25

68 Count-Sketch: Count-Min with a Twist Maintain s N w via hash functions h : [n] [w], r : [n] { 1, 1} f[1] f[2] f[3] f[4] f[5] f[6]... f[n] -f[1] f[2] -f[3] f[4] f[5] -f[6]... -f[n] s[1] s[2] s[3]... s[w] Update: For each increment of f i, s hi s hi + r i. Hence, s k = f j r j e.g., s 3 = f 6 f 7 f 13 j:h j =k Query: Use f i ] = s hi r i to estimate ] f i. Lemma: E [ f i = f i and V [ f i F 2 /w 18/25

69 Count-Sketch: Count-Min with a Twist Maintain s N w via hash functions h : [n] [w], r : [n] { 1, 1} f[1] f[2] f[3] f[4] f[5] f[6]... f[n] -f[1] f[2] -f[3] f[4] f[5] -f[6]... -f[n] s[1] s[2] s[3]... s[w] Update: For each increment of f i, s hi s hi + r i. Hence, s k = f j r j e.g., s 3 = f 6 f 7 f 13 j:h j =k Query: Use f i ] = s hi r i to estimate ] f i. Lemma: E [ f i = f i and V [ f i F 2 /w Thm: Let w = O(1/ɛ 2 ). Repeating O(lg δ 1 ) in parallel and taking the median estimate ensures P [f i ɛ F 2 f i f i + ɛ ] F 2 1 δ. 18/25

70 Proof of Lemma Define E by f i = f i + Er i and so E = f j r j j i:h i =h j 19/25

71 Proof of Lemma Define E by f i = f i + Er i and so E = f j r j j i:h i =h j Expectation: Since E [r j ] = 0, E [E] = f j E [r j ] = 0 j i:h i =h j 19/25

72 Proof of Lemma Define E by f i = f i + Er i and so E = f j r j j i:h i =h j Expectation: Since E [r j ] = 0, E [E] = f j E [r j ] = 0 j i:h i =h j Variance: Similarly, V [E] 19/25

73 Proof of Lemma Define E by f i = f i + Er i and so E = f j r j j i:h i =h j Expectation: Since E [r j ] = 0, E [E] = f j E [r j ] = 0 j i:h i =h j Variance: Similarly, V [E] E ( f j r j ) 2 j i:h i =h j 19/25

74 Proof of Lemma Define E by f i = f i + Er i and so E = f j r j j i:h i =h j Expectation: Since E [r j ] = 0, E [E] = f j E [r j ] = 0 j i:h i =h j Variance: Similarly, V [E] E ( f j r j ) 2 = j i:h i =h j f j f k E [r j r k ] P [h i = h j = h k ] j,k i h i =h j =h k 19/25

75 Proof of Lemma Define E by f i = f i + Er i and so E = f j r j j i:h i =h j Expectation: Since E [r j ] = 0, E [E] = f j E [r j ] = 0 j i:h i =h j Variance: Similarly, V [E] E ( f j r j ) 2 = j i:h i =h j = f j f k E [r j r k ] P [h i = h j = h k ] j,k i h i =h j =h k j i:h i =h j f 2 j P [h i = h j ] F 2 /w 19/25

76 Outline Sampling Sketching: The Basics Count-Min and Applications Count-Sketch: Count-Min with a Twist l p Sampling and Frequency Moments 20/25

77 l p Sampling 21/25

78 l p Sampling l p Sampling: Return random values I [n] and R R where P [I = i] = (1 ± ɛ) f i p F p and R = (1 ± ɛ)f i 21/25

79 l p Sampling l p Sampling: Return random values I [n] and R R where Applications: P [I = i] = (1 ± ɛ) f i p F p and R = (1 ± ɛ)f i Will use l2 sampling to get optimal algorithm for F k, k > 2. Will use l0 sampling for processing graph streams. Many other stream problems can be solved via lp sampling, e.g., duplicate finding, triangle counting, entropy estimation. 21/25

80 l p Sampling l p Sampling: Return random values I [n] and R R where Applications: P [I = i] = (1 ± ɛ) f i p F p and R = (1 ± ɛ)f i Will use l2 sampling to get optimal algorithm for F k, k > 2. Will use l0 sampling for processing graph streams. Many other stream problems can be solved via lp sampling, e.g., duplicate finding, triangle counting, entropy estimation. Let s see algorithm for p = /25

81 l 2 Sampling Algorithm 22/25

82 l 2 Sampling Algorithm Weight f i by γ i = 1/u i where u i R [0, 1] to form vector g: f = (f 1, f 2,..., f n ) g = (g 1, g 2,..., g n ) where g i = γ i f i 22/25

83 l 2 Sampling Algorithm Weight f i by γ i = 1/u i where u i R [0, 1] to form vector g: Return (i, f i ) if g 2 i f = (f 1, f 2,..., f n ) g = (g 1, g 2,..., g n ) where g i = γ i f i t := F 2 (f )/ɛ 22/25

84 l 2 Sampling Algorithm Weight f i by γ i = 1/u i where u i R [0, 1] to form vector g: f = (f 1, f 2,..., f n ) g = (g 1, g 2,..., g n ) where g i = γ i f i Return (i, f i ) if gi 2 t := F 2 (f )/ɛ Probability (i, f i ) is returned: P [ gi 2 t ] 22/25

85 l 2 Sampling Algorithm Weight f i by γ i = 1/u i where u i R [0, 1] to form vector g: f = (f 1, f 2,..., f n ) g = (g 1, g 2,..., g n ) where g i = γ i f i Return (i, f i ) if gi 2 t := F 2 (f )/ɛ Probability (i, f i ) is returned: P [ g 2 i t ] = P [ u i f 2 i /t ] = f 2 i /t 22/25

86 l 2 Sampling Algorithm Weight f i by γ i = 1/u i where u i R [0, 1] to form vector g: f = (f 1, f 2,..., f n ) g = (g 1, g 2,..., g n ) where g i = γ i f i Return (i, f i ) if gi 2 t := F 2 (f )/ɛ Probability (i, f i ) is returned: P [ g 2 i t ] = P [ u i f 2 i /t ] = f 2 i /t Probability some value is returned is i f 2 i /t = ɛ so repeating O(ɛ 1 log δ 1 ) ensures a value is returned with probability 1 δ. 22/25

87 l 2 Sampling Algorithm Weight f i by γ i = 1/u i where u i R [0, 1] to form vector g: f = (f 1, f 2,..., f n ) g = (g 1, g 2,..., g n ) where g i = γ i f i Return (i, f i ) if gi 2 t := F 2 (f )/ɛ Probability (i, f i ) is returned: P [ g 2 i t ] = P [ u i f 2 i /t ] = f 2 i /t Probability some value is returned is i f 2 i /t = ɛ so repeating O(ɛ 1 log δ 1 ) ensures a value is returned with probability 1 δ. Lemma: Using a Count-Sketch of size O(ɛ 1 log 2 n) ensures a (1 ± ɛ) approximation of any g i that passes the threshold. 22/25

88 Proof of Lemma 23/25

89 Proof of Lemma Exercise: P [F 2 (g)/f 2 (f ) c log n] 99/100 for some large c > 0 so we ll condition on this event. 23/25

90 Proof of Lemma Exercise: P [F 2 (g)/f 2 (f ) c log n] 99/100 for some large c > 0 so we ll condition on this event. Set w = 9cɛ 1 log n. Count-Sketch in O(w log 2 n) space ensures g i = g i ± F 2 (g)/w 23/25

91 Proof of Lemma Exercise: P [F 2 (g)/f 2 (f ) c log n] 99/100 for some large c > 0 so we ll condition on this event. Set w = 9cɛ 1 log n. Count-Sketch in O(w log 2 n) space ensures Then g 2 i F 2 (f )/ɛ implies g i = g i ± F 2 (g)/w F2 (g)/w F 2 (f )/(9ɛ 1 ) ɛ g 2 i /(9ɛ 1 ) = ɛ g i /3 and hence g i 2 = (1 ± ɛ/3) 2 gi 2 = (1 ± ɛ)gi 2 as required. 23/25

92 Proof of Lemma Exercise: P [F 2 (g)/f 2 (f ) c log n] 99/100 for some large c > 0 so we ll condition on this event. Set w = 9cɛ 1 log n. Count-Sketch in O(w log 2 n) space ensures Then g 2 i F 2 (f )/ɛ implies g i = g i ± F 2 (g)/w F2 (g)/w F 2 (f )/(9ɛ 1 ) ɛ g 2 i /(9ɛ 1 ) = ɛ g i /3 and hence g i 2 = (1 ± ɛ/3) 2 gi 2 = (1 ± ɛ)gi 2 as required. Under-the-rug: Need to ensure that conditioning doesn t affect sampling probability too much. 23/25

93 F k Revisited Earlier we used O(n 1 1/k ) space to approximate F k = i f i k. 24/25

94 F k Revisited Earlier we used O(n 1 1/k ) space to approximate F k = i f i k. Algorithm: Let (I, R) be an (1 + γ)-approximate l 2 sample. Return T = F 2 R k 2 where F 2 is a (1 ± γ) approximation for F 2 24/25

95 F k Revisited Earlier we used O(n 1 1/k ) space to approximate F k = i f i k. Algorithm: Let (I, R) be an (1 + γ)-approximate l 2 sample. Return T = F 2 R k 2 where F 2 is a (1 ± γ) approximation for F 2 Expectation: Setting γ = ɛ/(4k), E [T ] 24/25

96 F k Revisited Earlier we used O(n 1 1/k ) space to approximate F k = i f i k. Algorithm: Let (I, R) be an (1 + γ)-approximate l 2 sample. Return T = F 2 R k 2 where F 2 is a (1 ± γ) approximation for F 2 Expectation: Setting γ = ɛ/(4k), E [T ] = F 2 P [I = i] ((1±γ)fi ) k 2 24/25

97 F k Revisited Earlier we used O(n 1 1/k ) space to approximate F k = i f i k. Algorithm: Let (I, R) be an (1 + γ)-approximate l 2 sample. Return T = F 2 R k 2 where F 2 is a (1 ± γ) approximation for F 2 Expectation: Setting γ = ɛ/(4k), E [T ] = F f 2 P [I = i] ((1±γ)fi ) k 2 = (1±γ) k 2 F i 2 f k 2 i F 2 24/25

98 F k Revisited Earlier we used O(n 1 1/k ) space to approximate F k = i f i k. Algorithm: Let (I, R) be an (1 + γ)-approximate l 2 sample. Return T = F 2 R k 2 where F 2 is a (1 ± γ) approximation for F 2 Expectation: Setting γ = ɛ/(4k), E [T ] = F f 2 P [I = i] ((1±γ)fi ) k 2 = (1±γ) k 2 F i 2 f k 2 i = (1± ɛ F 2 2 )F k 24/25

99 F k Revisited Earlier we used O(n 1 1/k ) space to approximate F k = i f i k. Algorithm: Let (I, R) be an (1 + γ)-approximate l 2 sample. Return T = F 2 R k 2 where F 2 is a (1 ± γ) approximation for F 2 Expectation: Setting γ = ɛ/(4k), E [T ] = F f 2 P [I = i] ((1±γ)fi ) k 2 = (1±γ) k 2 F i 2 f k 2 i = (1± ɛ F 2 2 )F k Range: 0 T (1 + γ)f 2 F k 2 = (1 + γ)n 1 2/k F k. 24/25

100 F k Revisited Earlier we used O(n 1 1/k ) space to approximate F k = i f i k. Algorithm: Let (I, R) be an (1 + γ)-approximate l 2 sample. Return T = F 2 R k 2 where F 2 is a (1 ± γ) approximation for F 2 Expectation: Setting γ = ɛ/(4k), E [T ] = F f 2 P [I = i] ((1±γ)fi ) k 2 = (1±γ) k 2 F i 2 f k 2 i = (1± ɛ F 2 2 )F k Range: 0 T (1 + γ)f 2 F k 2 = (1 + γ)n 1 2/k F k. Averaging over t = O(ɛ 2 n 1 2/k log δ 1 ) parallel repetitions gives, ] P [ F k F k ɛf k δ 24/25

101 F k Revisited Earlier we used O(n 1 1/k ) space to approximate F k = i f i k. Algorithm: Let (I, R) be an (1 + γ)-approximate l 2 sample. Return T = F 2 R k 2 where F 2 is a (1 ± γ) approximation for F 2 Expectation: Setting γ = ɛ/(4k), E [T ] = F f 2 P [I = i] ((1±γ)fi ) k 2 = (1±γ) k 2 F i 2 f k 2 i = (1± ɛ F 2 2 )F k Range: 0 T (1 + γ)f 2 F k 2 = (1 + γ)n 1 2/k F k. Averaging over t = O(ɛ 2 n 1 2/k log δ 1 ) parallel repetitions gives, ] P [ F k F k ɛf k δ Thm: In Õ(ɛ 2 n 1 2/k ) space we can find a (1 ± ɛ) approximation for F k with probability at least 1 δ. 24/25

102 Summary Basic Sampling: Can do basic sampling where i is selected with probability f i but we can be much smarter via sketches. Count-Min: f i f i f i + ɛf 1 in O(ɛ 1 ) space. Z = Count-Sketch: f i ɛ F 2 f i f i + ɛ F 2 in O(ɛ 2 ) space Z = Above sketches solve range-queries, quantiles, heavy hitters,... l p -Sampling: Selecting i with probability f p i in O(ɛ 2 ) space. 0 γ γ 6 Z = γ γ 1 0 γ 3 0 γ /25

CMPSCI 711: More Advanced Algorithms

CMPSCI 711: More Advanced Algorithms CMPSCI 711: More Advanced Algorithms Section 1-1: Sampling Andrew McGregor Last Compiled: April 29, 2012 1/14 Concentration Bounds Theorem (Markov) Let X be a non-negative random variable with expectation

More information

CS 598CSC: Algorithms for Big Data Lecture date: Sept 11, 2014

CS 598CSC: Algorithms for Big Data Lecture date: Sept 11, 2014 CS 598CSC: Algorithms for Big Data Lecture date: Sept 11, 2014 Instructor: Chandra Cheuri Scribe: Chandra Cheuri The Misra-Greis deterministic counting guarantees that all items with frequency > F 1 /

More information

Linear Sketches A Useful Tool in Streaming and Compressive Sensing

Linear Sketches A Useful Tool in Streaming and Compressive Sensing Linear Sketches A Useful Tool in Streaming and Compressive Sensing Qin Zhang 1-1 Linear sketch Random linear projection M : R n R k that preserves properties of any v R n with high prob. where k n. M =

More information

B669 Sublinear Algorithms for Big Data

B669 Sublinear Algorithms for Big Data B669 Sublinear Algorithms for Big Data Qin Zhang 1-1 2-1 Part 1: Sublinear in Space The model and challenge The data stream model (Alon, Matias and Szegedy 1996) a n a 2 a 1 RAM CPU Why hard? Cannot store

More information

Some notes on streaming algorithms continued

Some notes on streaming algorithms continued U.C. Berkeley CS170: Algorithms Handout LN-11-9 Christos Papadimitriou & Luca Trevisan November 9, 016 Some notes on streaming algorithms continued Today we complete our quick review of streaming algorithms.

More information

Lecture 6 September 13, 2016

Lecture 6 September 13, 2016 CS 395T: Sublinear Algorithms Fall 206 Prof. Eric Price Lecture 6 September 3, 206 Scribe: Shanshan Wu, Yitao Chen Overview Recap of last lecture. We talked about Johnson-Lindenstrauss (JL) lemma [JL84]

More information

Sketching and Streaming for Distributions

Sketching and Streaming for Distributions Sketching and Streaming for Distributions Piotr Indyk Andrew McGregor Massachusetts Institute of Technology University of California, San Diego Main Material: Stable distributions, pseudo-random generators,

More information

Lecture 3 Frequency Moments, Heavy Hitters

Lecture 3 Frequency Moments, Heavy Hitters COMS E6998-9: Algorithmic Techniques for Massive Data Sep 15, 2015 Lecture 3 Frequency Moments, Heavy Hitters Instructor: Alex Andoni Scribes: Daniel Alabi, Wangda Zhang 1 Introduction This lecture is

More information

15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018

15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018 15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018 Today we ll talk about a topic that is both very old (as far as computer science

More information

14.1 Finding frequent elements in stream

14.1 Finding frequent elements in stream Chapter 14 Streaming Data Model 14.1 Finding frequent elements in stream A very useful statistics for many applications is to keep track of elements that occur more frequently. It can come in many flavours

More information

Algorithms for Querying Noisy Distributed/Streaming Datasets

Algorithms for Querying Noisy Distributed/Streaming Datasets Algorithms for Querying Noisy Distributed/Streaming Datasets Qin Zhang Indiana University Bloomington Sublinear Algo Workshop @ JHU Jan 9, 2016 1-1 The big data models The streaming model (Alon, Matias

More information

Block Heavy Hitters Alexandr Andoni, Khanh Do Ba, and Piotr Indyk

Block Heavy Hitters Alexandr Andoni, Khanh Do Ba, and Piotr Indyk Computer Science and Artificial Intelligence Laboratory Technical Report -CSAIL-TR-2008-024 May 2, 2008 Block Heavy Hitters Alexandr Andoni, Khanh Do Ba, and Piotr Indyk massachusetts institute of technology,

More information

1 Approximate Quantiles and Summaries

1 Approximate Quantiles and Summaries CS 598CSC: Algorithms for Big Data Lecture date: Sept 25, 2014 Instructor: Chandra Chekuri Scribe: Chandra Chekuri Suppose we have a stream a 1, a 2,..., a n of objects from an ordered universe. For simplicity

More information

2 How many distinct elements are in a stream?

2 How many distinct elements are in a stream? Dealing with Massive Data January 31, 2011 Lecture 2: Distinct Element Counting Lecturer: Sergei Vassilvitskii Scribe:Ido Rosen & Yoonji Shin 1 Introduction We begin by defining the stream formally. Definition

More information

12 Count-Min Sketch and Apriori Algorithm (and Bloom Filters)

12 Count-Min Sketch and Apriori Algorithm (and Bloom Filters) 12 Count-Min Sketch and Apriori Algorithm (and Bloom Filters) Many streaming algorithms use random hashing functions to compress data. They basically randomly map some data items on top of each other.

More information

An Optimal Algorithm for l 1 -Heavy Hitters in Insertion Streams and Related Problems

An Optimal Algorithm for l 1 -Heavy Hitters in Insertion Streams and Related Problems An Optimal Algorithm for l 1 -Heavy Hitters in Insertion Streams and Related Problems Arnab Bhattacharyya, Palash Dey, and David P. Woodruff Indian Institute of Science, Bangalore {arnabb,palash}@csa.iisc.ernet.in

More information

Lecture 1 September 3, 2013

Lecture 1 September 3, 2013 CS 229r: Algorithms for Big Data Fall 2013 Prof. Jelani Nelson Lecture 1 September 3, 2013 Scribes: Andrew Wang and Andrew Liu 1 Course Logistics The problem sets can be found on the course website: http://people.seas.harvard.edu/~minilek/cs229r/index.html

More information

Lecture 01 August 31, 2017

Lecture 01 August 31, 2017 Sketching Algorithms for Big Data Fall 2017 Prof. Jelani Nelson Lecture 01 August 31, 2017 Scribe: Vinh-Kha Le 1 Overview In this lecture, we overviewed the six main topics covered in the course, reviewed

More information

Graph & Geometry Problems in Data Streams

Graph & Geometry Problems in Data Streams Graph & Geometry Problems in Data Streams 2009 Barbados Workshop on Computational Complexity Andrew McGregor Introduction Models: Graph Streams: Stream of edges E = {e 1, e 2,..., e m } describe a graph

More information

Hierarchical Sampling from Sketches: Estimating Functions over Data Streams

Hierarchical Sampling from Sketches: Estimating Functions over Data Streams Hierarchical Sampling from Sketches: Estimating Functions over Data Streams Sumit Ganguly 1 and Lakshminath Bhuvanagiri 2 1 Indian Institute of Technology, Kanpur 2 Google Inc., Bangalore Abstract. We

More information

The Count-Min-Sketch and its Applications

The Count-Min-Sketch and its Applications The Count-Min-Sketch and its Applications Jannik Sundermeier Abstract In this thesis, we want to reveal how to get rid of a huge amount of data which is at least dicult or even impossible to store in local

More information

Data Stream Methods. Graham Cormode S. Muthukrishnan

Data Stream Methods. Graham Cormode S. Muthukrishnan Data Stream Methods Graham Cormode graham@dimacs.rutgers.edu S. Muthukrishnan muthu@cs.rutgers.edu Plan of attack Frequent Items / Heavy Hitters Counting Distinct Elements Clustering items in Streams Motivating

More information

Big Data. Big data arises in many forms: Common themes:

Big Data. Big data arises in many forms: Common themes: Big Data Big data arises in many forms: Physical Measurements: from science (physics, astronomy) Medical data: genetic sequences, detailed time series Activity data: GPS location, social network activity

More information

Declaring Independence via the Sketching of Sketches. Until August Hire Me!

Declaring Independence via the Sketching of Sketches. Until August Hire Me! Declaring Independence via the Sketching of Sketches Piotr Indyk Andrew McGregor Massachusetts Institute of Technology University of California, San Diego Until August 08 -- Hire Me! The Problem The Problem

More information

Computing the Entropy of a Stream

Computing the Entropy of a Stream Computing the Entropy of a Stream To appear in SODA 2007 Graham Cormode graham@research.att.com Amit Chakrabarti Dartmouth College Andrew McGregor U. Penn / UCSD Outline Introduction Entropy Upper Bound

More information

Communication-Efficient Computation on Distributed Noisy Datasets. Qin Zhang Indiana University Bloomington

Communication-Efficient Computation on Distributed Noisy Datasets. Qin Zhang Indiana University Bloomington Communication-Efficient Computation on Distributed Noisy Datasets Qin Zhang Indiana University Bloomington SPAA 15 June 15, 2015 1-1 Model of computation The coordinator model: k sites and 1 coordinator.

More information

Distributed Summaries

Distributed Summaries Distributed Summaries Graham Cormode graham@research.att.com Pankaj Agarwal(Duke) Zengfeng Huang (HKUST) Jeff Philips (Utah) Zheiwei Wei (HKUST) Ke Yi (HKUST) Summaries Summaries allow approximate computations:

More information

Notes on MapReduce Algorithms

Notes on MapReduce Algorithms Notes on MapReduce Algorithms Barna Saha 1 Finding Minimum Spanning Tree of a Dense Graph in MapReduce We are given a graph G = (V, E) on V = N vertices and E = m N 1+c edges for some constant c > 0. Our

More information

CSC2420 Fall 2012: Algorithm Design, Analysis and Theory Lecture 9

CSC2420 Fall 2012: Algorithm Design, Analysis and Theory Lecture 9 CSC2420 Fall 2012: Algorithm Design, Analysis and Theory Lecture 9 Allan Borodin March 13, 2016 1 / 33 Lecture 9 Announcements 1 I have the assignments graded by Lalla. 2 I have now posted five questions

More information

Stream Computation and Arthur- Merlin Communication

Stream Computation and Arthur- Merlin Communication Stream Computation and Arthur- Merlin Communication Justin Thaler, Yahoo! Labs Joint Work with: Amit Chakrabarti, Dartmouth Graham Cormode, University of Warwick Andrew McGregor, Umass Amherst Suresh Venkatasubramanian,

More information

Lecture 10. Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1

Lecture 10. Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1 Lecture 10 Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1 Recap Sublinear time algorithms Deterministic + exact: binary search Deterministic + inexact: estimating diameter in a

More information

CMSC 858F: Algorithmic Lower Bounds: Fun with Hardness Proofs Fall 2014 Introduction to Streaming Algorithms

CMSC 858F: Algorithmic Lower Bounds: Fun with Hardness Proofs Fall 2014 Introduction to Streaming Algorithms CMSC 858F: Algorithmic Lower Bounds: Fun with Hardness Proofs Fall 2014 Introduction to Streaming Algorithms Instructor: Mohammad T. Hajiaghayi Scribe: Huijing Gong November 11, 2014 1 Overview In the

More information

Topics in Probabilistic Combinatorics and Algorithms Winter, Basic Derandomization Techniques

Topics in Probabilistic Combinatorics and Algorithms Winter, Basic Derandomization Techniques Topics in Probabilistic Combinatorics and Algorithms Winter, 016 3. Basic Derandomization Techniques Definition. DTIME(t(n)) : {L : L can be decided deterministically in time O(t(n)).} EXP = { L: L can

More information

Frequency Estimators

Frequency Estimators Frequency Estimators Outline for Today Randomized Data Structures Our next approach to improving performance. Count-Min Sketches A simple and powerful data structure for estimating frequencies. Count Sketches

More information

The space complexity of approximating the frequency moments

The space complexity of approximating the frequency moments The space complexity of approximating the frequency moments Felix Biermeier November 24, 2015 1 Overview Introduction Approximations of frequency moments lower bounds 2 Frequency moments Problem Estimate

More information

CS 591, Lecture 9 Data Analytics: Theory and Applications Boston University

CS 591, Lecture 9 Data Analytics: Theory and Applications Boston University CS 591, Lecture 9 Data Analytics: Theory and Applications Boston University Babis Tsourakakis February 22nd, 2017 Announcement We will cover the Monday s 2/20 lecture (President s day) by appending half

More information

Lecture Lecture 9 October 1, 2015

Lecture Lecture 9 October 1, 2015 CS 229r: Algorithms for Big Data Fall 2015 Lecture Lecture 9 October 1, 2015 Prof. Jelani Nelson Scribe: Rachit Singh 1 Overview In the last lecture we covered the distance to monotonicity (DTM) and longest

More information

Lecture 4: Hashing and Streaming Algorithms

Lecture 4: Hashing and Streaming Algorithms CSE 521: Design and Analysis of Algorithms I Winter 2017 Lecture 4: Hashing and Streaming Algorithms Lecturer: Shayan Oveis Gharan 01/18/2017 Scribe: Yuqing Ai Disclaimer: These notes have not been subjected

More information

Sublinear Algorithms for Big Data

Sublinear Algorithms for Big Data Sublinear Algorithms for Big Data Qin Zhang 1-1 2-1 Part 2: Sublinear in Communication Sublinear in communication The model x 1 = 010011 x 2 = 111011 x 3 = 111111 x k = 100011 Applicaitons They want to

More information

Lecture 2. Frequency problems

Lecture 2. Frequency problems 1 / 43 Lecture 2. Frequency problems Ricard Gavaldà MIRI Seminar on Data Streams, Spring 2015 Contents 2 / 43 1 Frequency problems in data streams 2 Approximating inner product 3 Computing frequency moments

More information

Lecture 1: Introduction to Sublinear Algorithms

Lecture 1: Introduction to Sublinear Algorithms CSE 522: Sublinear (and Streaming) Algorithms Spring 2014 Lecture 1: Introduction to Sublinear Algorithms March 31, 2014 Lecturer: Paul Beame Scribe: Paul Beame Too much data, too little time, space for

More information

Sparse Fourier Transform (lecture 1)

Sparse Fourier Transform (lecture 1) 1 / 73 Sparse Fourier Transform (lecture 1) Michael Kapralov 1 1 IBM Watson EPFL St. Petersburg CS Club November 2015 2 / 73 Given x C n, compute the Discrete Fourier Transform (DFT) of x: x i = 1 x n

More information

Leveraging Big Data: Lecture 13

Leveraging Big Data: Lecture 13 Leveraging Big Data: Lecture 13 http://www.cohenwang.com/edith/bigdataclass2013 Instructors: Edith Cohen Amos Fiat Haim Kaplan Tova Milo What are Linear Sketches? Linear Transformations of the input vector

More information

Lecture Lecture 25 November 25, 2014

Lecture Lecture 25 November 25, 2014 CS 224: Advanced Algorithms Fall 2014 Lecture Lecture 25 November 25, 2014 Prof. Jelani Nelson Scribe: Keno Fischer 1 Today Finish faster exponential time algorithms (Inclusion-Exclusion/Zeta Transform,

More information

Lecture 2 Sept. 8, 2015

Lecture 2 Sept. 8, 2015 CS 9r: Algorithms for Big Data Fall 5 Prof. Jelani Nelson Lecture Sept. 8, 5 Scribe: Jeffrey Ling Probability Recap Chebyshev: P ( X EX > λ) < V ar[x] λ Chernoff: For X,..., X n independent in [, ],

More information

Estimating Frequencies and Finding Heavy Hitters Jonas Nicolai Hovmand, Morten Houmøller Nygaard,

Estimating Frequencies and Finding Heavy Hitters Jonas Nicolai Hovmand, Morten Houmøller Nygaard, Estimating Frequencies and Finding Heavy Hitters Jonas Nicolai Hovmand, 2011 3884 Morten Houmøller Nygaard, 2011 4582 Master s Thesis, Computer Science June 2016 Main Supervisor: Gerth Stølting Brodal

More information

Lecture 13 October 6, Covering Numbers and Maurey s Empirical Method

Lecture 13 October 6, Covering Numbers and Maurey s Empirical Method CS 395T: Sublinear Algorithms Fall 2016 Prof. Eric Price Lecture 13 October 6, 2016 Scribe: Kiyeon Jeon and Loc Hoang 1 Overview In the last lecture we covered the lower bound for p th moment (p > 2) and

More information

Complexity Theory of Polynomial-Time Problems

Complexity Theory of Polynomial-Time Problems Complexity Theory of Polynomial-Time Problems Lecture 5: Subcubic Equivalences Karl Bringmann Reminder: Relations = Reductions transfer hardness of one problem to another one by reductions problem P instance

More information

Lecture 2: Streaming Algorithms

Lecture 2: Streaming Algorithms CS369G: Algorithmic Techniques for Big Data Spring 2015-2016 Lecture 2: Streaming Algorithms Prof. Moses Chariar Scribes: Stephen Mussmann 1 Overview In this lecture, we first derive a concentration inequality

More information

Mining Data Streams. The Stream Model. The Stream Model Sliding Windows Counting 1 s

Mining Data Streams. The Stream Model. The Stream Model Sliding Windows Counting 1 s Mining Data Streams The Stream Model Sliding Windows Counting 1 s 1 The Stream Model Data enters at a rapid rate from one or more input ports. The system cannot store the entire stream. How do you make

More information

Sparse analysis Lecture VII: Combining geometry and combinatorics, sparse matrices for sparse signal recovery

Sparse analysis Lecture VII: Combining geometry and combinatorics, sparse matrices for sparse signal recovery Sparse analysis Lecture VII: Combining geometry and combinatorics, sparse matrices for sparse signal recovery Anna C. Gilbert Department of Mathematics University of Michigan Sparse signal recovery measurements:

More information

Lecture 3 Sept. 4, 2014

Lecture 3 Sept. 4, 2014 CS 395T: Sublinear Algorithms Fall 2014 Prof. Eric Price Lecture 3 Sept. 4, 2014 Scribe: Zhao Song In today s lecture, we will discuss the following problems: 1. Distinct elements 2. Turnstile model 3.

More information

Wavelet decomposition of data streams. by Dragana Veljkovic

Wavelet decomposition of data streams. by Dragana Veljkovic Wavelet decomposition of data streams by Dragana Veljkovic Motivation Continuous data streams arise naturally in: telecommunication and internet traffic retail and banking transactions web server log records

More information

Multiple Choice Tries and Distributed Hash Tables

Multiple Choice Tries and Distributed Hash Tables Multiple Choice Tries and Distributed Hash Tables Luc Devroye and Gabor Lugosi and Gahyun Park and W. Szpankowski January 3, 2007 McGill University, Montreal, Canada U. Pompeu Fabra, Barcelona, Spain U.

More information

A Pass-Efficient Algorithm for Clustering Census Data

A Pass-Efficient Algorithm for Clustering Census Data A Pass-Efficient Algorithm for Clustering Census Data Kevin Chang Yale University Ravi Kannan Yale University Abstract We present a number of streaming algorithms for a basic clustering problem for massive

More information

Improved Concentration Bounds for Count-Sketch

Improved Concentration Bounds for Count-Sketch Improved Concentration Bounds for Count-Sketch Gregory T. Minton 1 Eric Price 2 1 MIT MSR New England 2 MIT IBM Almaden UT Austin 2014-01-06 Gregory T. Minton, Eric Price (IBM) Improved Concentration Bounds

More information

Combining geometry and combinatorics

Combining geometry and combinatorics Combining geometry and combinatorics A unified approach to sparse signal recovery Anna C. Gilbert University of Michigan joint work with R. Berinde (MIT), P. Indyk (MIT), H. Karloff (AT&T), M. Strauss

More information

Estimating Dominance Norms of Multiple Data Streams Graham Cormode Joint work with S. Muthukrishnan

Estimating Dominance Norms of Multiple Data Streams Graham Cormode Joint work with S. Muthukrishnan Estimating Dominance Norms of Multiple Data Streams Graham Cormode graham@dimacs.rutgers.edu Joint work with S. Muthukrishnan Data Stream Phenomenon Data is being produced faster than our ability to process

More information

11 Heavy Hitters Streaming Majority

11 Heavy Hitters Streaming Majority 11 Heavy Hitters A core mining problem is to find items that occur more than one would expect. These may be called outliers, anomalies, or other terms. Statistical models can be layered on top of or underneath

More information

Title: Count-Min Sketch Name: Graham Cormode 1 Affil./Addr. Department of Computer Science, University of Warwick,

Title: Count-Min Sketch Name: Graham Cormode 1 Affil./Addr. Department of Computer Science, University of Warwick, Title: Count-Min Sketch Name: Graham Cormode 1 Affil./Addr. Department of Computer Science, University of Warwick, Coventry, UK Keywords: streaming algorithms; frequent items; approximate counting, sketch

More information

1 Estimating Frequency Moments in Streams

1 Estimating Frequency Moments in Streams CS 598CSC: Algorithms for Big Data Lecture date: August 28, 2014 Instructor: Chandra Chekuri Scribe: Chandra Chekuri 1 Estimating Frequency Moments in Streams A significant fraction of streaming literature

More information

Sketching Probabilistic Data Streams

Sketching Probabilistic Data Streams Sketching Probabilistic Data Streams Graham Cormode AT&T Labs Research 18 Park Avenue, Florham Park NJ graham@research.att.com Minos Garofalakis Yahoo! Research and UC Berkeley 2821 Mission College Blvd,

More information

4/26/2017. More algorithms for streams: Each element of data stream is a tuple Given a list of keys S Determine which tuples of stream are in S

4/26/2017. More algorithms for streams: Each element of data stream is a tuple Given a list of keys S Determine which tuples of stream are in S Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit

More information

Trusting the Cloud with Practical Interactive Proofs

Trusting the Cloud with Practical Interactive Proofs Trusting the Cloud with Practical Interactive Proofs Graham Cormode G.Cormode@warwick.ac.uk Amit Chakrabarti (Dartmouth) Andrew McGregor (U Mass Amherst) Justin Thaler (Harvard/Yahoo!) Suresh Venkatasubramanian

More information

Finding Frequent Items in Probabilistic Data

Finding Frequent Items in Probabilistic Data Finding Frequent Items in Probabilistic Data Qin Zhang, Hong Kong University of Science & Technology Feifei Li, Florida State University Ke Yi, Hong Kong University of Science & Technology SIGMOD 2008

More information

Improved Algorithms for Distributed Entropy Monitoring

Improved Algorithms for Distributed Entropy Monitoring Improved Algorithms for Distributed Entropy Monitoring Jiecao Chen Indiana University Bloomington jiecchen@umail.iu.edu Qin Zhang Indiana University Bloomington qzhangcs@indiana.edu Abstract Modern data

More information

Problem 1: (Chernoff Bounds via Negative Dependence - from MU Ex 5.15)

Problem 1: (Chernoff Bounds via Negative Dependence - from MU Ex 5.15) Problem 1: Chernoff Bounds via Negative Dependence - from MU Ex 5.15) While deriving lower bounds on the load of the maximum loaded bin when n balls are thrown in n bins, we saw the use of negative dependence.

More information

Lecture 3. 1 Polynomial-time algorithms for the maximum flow problem

Lecture 3. 1 Polynomial-time algorithms for the maximum flow problem ORIE 633 Network Flows August 30, 2007 Lecturer: David P. Williamson Lecture 3 Scribe: Gema Plaza-Martínez 1 Polynomial-time algorithms for the maximum flow problem 1.1 Introduction Let s turn now to considering

More information

Arthur-Merlin Streaming Complexity

Arthur-Merlin Streaming Complexity Weizmann Institute of Science Joint work with Ran Raz Data Streams The data stream model is an abstraction commonly used for algorithms that process network traffic using sublinear space. A data stream

More information

Trace Reconstruction Revisited

Trace Reconstruction Revisited Trace Reconstruction Revisited Andrew McGregor 1, Eric Price 2, and Sofya Vorotnikova 1 1 University of Massachusetts Amherst {mcgregor,svorotni}@cs.umass.edu 2 IBM Almaden Research Center ecprice@mit.edu

More information

? 11.5 Perfect hashing. Exercises

? 11.5 Perfect hashing. Exercises 11.5 Perfect hashing 77 Exercises 11.4-1 Consider inserting the keys 10; ; 31; 4; 15; 8; 17; 88; 59 into a hash table of length m 11 using open addressing with the auxiliary hash function h 0.k/ k. Illustrate

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/26/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2 More algorithms

More information

Sublinear-Time Algorithms

Sublinear-Time Algorithms Lecture 20 Sublinear-Time Algorithms Supplemental reading in CLRS: None If we settle for approximations, we can sometimes get much more efficient algorithms In Lecture 8, we saw polynomial-time approximations

More information

Very Sparse Random Projections

Very Sparse Random Projections Very Sparse Random Projections Ping Li, Trevor Hastie and Kenneth Church [KDD 06] Presented by: Aditya Menon UCSD March 4, 2009 Presented by: Aditya Menon (UCSD) Very Sparse Random Projections March 4,

More information

StreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory

StreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory StreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory S.V. N. (vishy) Vishwanathan Purdue University and Microsoft vishy@purdue.edu October 9, 2012 S.V. N. Vishwanathan (Purdue,

More information

Randomized Algorithms

Randomized Algorithms Randomized Algorithms Prof. Tapio Elomaa tapio.elomaa@tut.fi Course Basics A new 4 credit unit course Part of Theoretical Computer Science courses at the Department of Mathematics There will be 4 hours

More information

Probability Revealing Samples

Probability Revealing Samples Krzysztof Onak IBM Research Xiaorui Sun Microsoft Research Abstract In the most popular distribution testing and parameter estimation model, one can obtain information about an underlying distribution

More information

Sparser Johnson-Lindenstrauss Transforms

Sparser Johnson-Lindenstrauss Transforms Sparser Johnson-Lindenstrauss Transforms Jelani Nelson Princeton February 16, 212 joint work with Daniel Kane (Stanford) Random Projections x R d, d huge store y = Sx, where S is a k d matrix (compression)

More information

Mining Data Streams. The Stream Model Sliding Windows Counting 1 s

Mining Data Streams. The Stream Model Sliding Windows Counting 1 s Mining Data Streams The Stream Model Sliding Windows Counting 1 s 1 The Stream Model Data enters at a rapid rate from one or more input ports. The system cannot store the entire stream. How do you make

More information

Lower Bounds for Quantile Estimation in Random-Order and Multi-Pass Streaming

Lower Bounds for Quantile Estimation in Random-Order and Multi-Pass Streaming Lower Bounds for Quantile Estimation in Random-Order and Multi-Pass Streaming Sudipto Guha 1 and Andrew McGregor 2 1 University of Pennsylvania sudipto@cis.upenn.edu 2 University of California, San Diego

More information

CS5314 Randomized Algorithms. Lecture 15: Balls, Bins, Random Graphs (Hashing)

CS5314 Randomized Algorithms. Lecture 15: Balls, Bins, Random Graphs (Hashing) CS5314 Randomized Algorithms Lecture 15: Balls, Bins, Random Graphs (Hashing) 1 Objectives Study various hashing schemes Apply balls-and-bins model to analyze their performances 2 Chain Hashing Suppose

More information

6 Filtering and Streaming

6 Filtering and Streaming Casus ubique valet; semper tibi pendeat hamus: Quo minime credas gurgite, piscis erit. [Luck affects everything. Let your hook always be cast. Where you least expect it, there will be a fish.] Publius

More information

CS 229r Notes. Sam Elder. November 26, 2013

CS 229r Notes. Sam Elder. November 26, 2013 CS 9r Notes Sam Elder November 6, 013 Tuesday, September 3, 013 Standard information: Instructor: Jelani Nelson Grades (see course website for details) will be based on scribing (10%), psets (40%) 1, and

More information

Ad Placement Strategies

Ad Placement Strategies Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox 2014 Emily Fox January

More information

Verifying Computations in the Cloud (and Elsewhere) Michael Mitzenmacher, Harvard University Work offloaded to Justin Thaler, Harvard University

Verifying Computations in the Cloud (and Elsewhere) Michael Mitzenmacher, Harvard University Work offloaded to Justin Thaler, Harvard University Verifying Computations in the Cloud (and Elsewhere) Michael Mitzenmacher, Harvard University Work offloaded to Justin Thaler, Harvard University Goals of Verifiable Computation Provide user with correctness

More information

Testing random variables for independence and identity

Testing random variables for independence and identity Testing random variables for independence and identity Tuğkan Batu Eldar Fischer Lance Fortnow Ravi Kumar Ronitt Rubinfeld Patrick White January 10, 2003 Abstract Given access to independent samples of

More information

COMS E F15. Lecture 22: Linearity Testing Sparse Fourier Transform

COMS E F15. Lecture 22: Linearity Testing Sparse Fourier Transform COMS E6998-9 F15 Lecture 22: Linearity Testing Sparse Fourier Transform 1 Administrivia, Plan Thu: no class. Happy Thanksgiving! Tue, Dec 1 st : Sergei Vassilvitskii (Google Research) on MapReduce model

More information

Lecture 2 September 4, 2014

Lecture 2 September 4, 2014 CS 224: Advanced Algorithms Fall 2014 Prof. Jelani Nelson Lecture 2 September 4, 2014 Scribe: David Liu 1 Overview In the last lecture we introduced the word RAM model and covered veb trees to solve the

More information

Efficient Sketches for Earth-Mover Distance, with Applications

Efficient Sketches for Earth-Mover Distance, with Applications Efficient Sketches for Earth-Mover Distance, with Applications Alexandr Andoni MIT andoni@mit.edu Khanh Do Ba MIT doba@mit.edu Piotr Indyk MIT indyk@mit.edu David Woodruff IBM Almaden dpwoodru@us.ibm.com

More information

CMPT 710/407 - Complexity Theory Lecture 4: Complexity Classes, Completeness, Linear Speedup, and Hierarchy Theorems

CMPT 710/407 - Complexity Theory Lecture 4: Complexity Classes, Completeness, Linear Speedup, and Hierarchy Theorems CMPT 710/407 - Complexity Theory Lecture 4: Complexity Classes, Completeness, Linear Speedup, and Hierarchy Theorems Valentine Kabanets September 13, 2007 1 Complexity Classes Unless explicitly stated,

More information

Topics in Approximation Algorithms Solution for Homework 3

Topics in Approximation Algorithms Solution for Homework 3 Topics in Approximation Algorithms Solution for Homework 3 Problem 1 We show that any solution {U t } can be modified to satisfy U τ L τ as follows. Suppose U τ L τ, so there is a vertex v U τ but v L

More information

Data Stream Algorithms. S. Muthu Muthukrishnan

Data Stream Algorithms. S. Muthu Muthukrishnan Data Stream Algorithms Notes from a series of lectures by S. Muthu Muthukrishnan Guest Lecturer: Andrew McGregor The 2009 Barbados Workshop on Computational Complexity March 1st March 8th, 2009 Organizer:

More information

Approximate counting: count-min data structure. Problem definition

Approximate counting: count-min data structure. Problem definition Approximate counting: count-min data structure G. Cormode and S. Muthukrishhan: An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms 55 (2005) 58-75. Problem

More information

An Optimized Interestingness Hotspot Discovery Framework for Large Gridded Spatio-temporal Datasets

An Optimized Interestingness Hotspot Discovery Framework for Large Gridded Spatio-temporal Datasets IEEE Big Data 2015 Big Data in Geosciences Workshop An Optimized Interestingness Hotspot Discovery Framework for Large Gridded Spatio-temporal Datasets Fatih Akdag and Christoph F. Eick Department of Computer

More information

Randomness and Computation March 13, Lecture 3

Randomness and Computation March 13, Lecture 3 0368.4163 Randomness and Computation March 13, 2009 Lecture 3 Lecturer: Ronitt Rubinfeld Scribe: Roza Pogalnikova and Yaron Orenstein Announcements Homework 1 is released, due 25/03. Lecture Plan 1. Do

More information

Succinct Approximate Counting of Skewed Data

Succinct Approximate Counting of Skewed Data Succinct Approximate Counting of Skewed Data David Talbot Google Inc., Mountain View, CA, USA talbot@google.com Abstract Practical data analysis relies on the ability to count observations of objects succinctly

More information

Streaming Graph Computations with a Helpful Advisor. Justin Thaler Graham Cormode and Michael Mitzenmacher

Streaming Graph Computations with a Helpful Advisor. Justin Thaler Graham Cormode and Michael Mitzenmacher Streaming Graph Computations with a Helpful Advisor Justin Thaler Graham Cormode and Michael Mitzenmacher Thanks to Andrew McGregor A few slides borrowed from IITK Workshop on Algorithms for Processing

More information

CS341 info session is on Thu 3/1 5pm in Gates415. CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS341 info session is on Thu 3/1 5pm in Gates415. CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS341 info session is on Thu 3/1 5pm in Gates415 CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/28/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets,

More information

Lecture 16. Error-free variable length schemes (contd.): Shannon-Fano-Elias code, Huffman code

Lecture 16. Error-free variable length schemes (contd.): Shannon-Fano-Elias code, Huffman code Lecture 16 Agenda for the lecture Error-free variable length schemes (contd.): Shannon-Fano-Elias code, Huffman code Variable-length source codes with error 16.1 Error-free coding schemes 16.1.1 The Shannon-Fano-Elias

More information

Processing Aggregate Queries over Continuous Data Streams

Processing Aggregate Queries over Continuous Data Streams Processing Aggregate Queries over Continuous Data Streams Alin Dobra Computer Science Department Cornell University April 15, 2003 Relational Database Systems did dname 15 Legal 17 Marketing 3 Development

More information