Data Streams & Communication Complexity
|
|
- Marcus Henry
- 5 years ago
- Views:
Transcription
1 Data Streams & Communication Complexity Lecture 1: Simple Stream Statistics in Small Space Andrew McGregor, UMass Amherst 1/25
2 Data Stream Model Stream: m elements from universe of size n, e.g., x 1, x 2,..., x m = 3, 5, 3, 7, 5, 4,... 2/25
3 Data Stream Model Stream: m elements from universe of size n, e.g., x 1, x 2,..., x m = 3, 5, 3, 7, 5, 4,... Goal: Compute some function of stream, e.g., number of distinct elements, frequent items, longest increasing sequence, a clustering, graph connectivity properties,... 2/25
4 Data Stream Model Stream: m elements from universe of size n, e.g., x 1, x 2,..., x m = 3, 5, 3, 7, 5, 4,... Goal: Compute some function of stream, e.g., number of distinct elements, frequent items, longest increasing sequence, a clustering, graph connectivity properties,... Catch: 1. Limited working memory, sublinear in n and m 2/25
5 Data Stream Model Stream: m elements from universe of size n, e.g., x 1, x 2,..., x m = 3, 5, 3, 7, 5, 4,... Goal: Compute some function of stream, e.g., number of distinct elements, frequent items, longest increasing sequence, a clustering, graph connectivity properties,... Catch: 1. Limited working memory, sublinear in n and m 2. Access data sequentially 2/25
6 Data Stream Model Stream: m elements from universe of size n, e.g., x 1, x 2,..., x m = 3, 5, 3, 7, 5, 4,... Goal: Compute some function of stream, e.g., number of distinct elements, frequent items, longest increasing sequence, a clustering, graph connectivity properties,... Catch: 1. Limited working memory, sublinear in n and m 2. Access data sequentially 3. Process each element quickly 2/25
7 Data Stream Model Stream: m elements from universe of size n, e.g., x 1, x 2,..., x m = 3, 5, 3, 7, 5, 4,... Goal: Compute some function of stream, e.g., number of distinct elements, frequent items, longest increasing sequence, a clustering, graph connectivity properties,... Catch: 1. Limited working memory, sublinear in n and m 2. Access data sequentially 3. Process each element quickly Origins in seventies but has become popular in last ten years... 2/25
8 Why s it become popular? Practical Appeal: Faster networks, cheaper data storage, ubiquitous data-logging results in massive amount of data to be processed. Applications to network monitoring, query planning, I/O efficiency for massive data, sensor networks aggregation... 3/25
9 Why s it become popular? Practical Appeal: Faster networks, cheaper data storage, ubiquitous data-logging results in massive amount of data to be processed. Applications to network monitoring, query planning, I/O efficiency for massive data, sensor networks aggregation... Theoretical Appeal: Easy to state problems but hard to solve. Links to communication complexity, compressed sensing, metric embeddings, pseudo-random generators, approximation... 3/25
10 This Lecture: Basic Numerical Statistics Given a stream of m elements from universe [n] = {1, 2,..., n}, e.g., x 1, x 2,..., x m = 3, 5, 3, 7, 5, 4,... let f N n be the frequency vector where f i is the frequency of i. 4/25
11 This Lecture: Basic Numerical Statistics Given a stream of m elements from universe [n] = {1, 2,..., n}, e.g., x 1, x 2,..., x m = 3, 5, 3, 7, 5, 4,... let f N n be the frequency vector where f i is the frequency of i. Problems: What can we approximate in sub linear space? Frequency moments: Fk = i f i k. Max frequency: F = max i f i. Number of distinct element: F0 = i f i 0 Median: j such that f1 + f f j m/2 Algorithms are often randomized and guarantees will be probabilistic. 4/25
12 This Lecture: Basic Numerical Statistics Given a stream of m elements from universe [n] = {1, 2,..., n}, e.g., x 1, x 2,..., x m = 3, 5, 3, 7, 5, 4,... let f N n be the frequency vector where f i is the frequency of i. Problems: What can we approximate in sub linear space? Frequency moments: Fk = i f i k. Max frequency: F = max i f i. Number of distinct element: F0 = i f i 0 Median: j such that f1 + f f j m/2 Algorithms are often randomized and guarantees will be probabilistic. Keep things simple: Could consider f i s being increased or decreased but for this talk we ll focus on unit increments. Will also assume algorithms have an unlimited store of random bits. 4/25
13 Outline Sampling Sketching: The Basics Count-Min and Applications Count-Sketch: Count-Min with a Twist l p Sampling and Frequency Moments 5/25
14 Sampling and Statistics Sampling is a general technique for tackling massive amounts of data 6/25
15 Sampling and Statistics Sampling is a general technique for tackling massive amounts of data Example: To find an ɛ-approximate median, i.e., j such that f 1 + f f j = m/2 ± ɛm then sampling O(ɛ 2 ) stream elements and returning the sample median works with good probability. 6/25
16 Sampling and Statistics Sampling is a general technique for tackling massive amounts of data Example: To find an ɛ-approximate median, i.e., j such that f 1 + f f j = m/2 ± ɛm then sampling O(ɛ 2 ) stream elements and returning the sample median works with good probability. Beyond basic sampling: There are more powerful forms of sampling and other techniques the make better use of the limited space. 6/25
17 AMS Sampling Problem: Estimate i g(f i) for some function g with g(0) = 0 7/25
18 AMS Sampling Problem: Estimate i g(f i) for some function g with g(0) = 0 Basic Estimator: Sample x J where J R [m] and compute r = {j J : x j = x J } 7/25
19 AMS Sampling Problem: Estimate i g(f i) for some function g with g(0) = 0 Basic Estimator: Sample x J where J R [m] and compute r = {j J : x j = x J } Output X = m(g(r) g(r 1)) 7/25
20 AMS Sampling Problem: Estimate i g(f i) for some function g with g(0) = 0 Basic Estimator: Sample x J where J R [m] and compute r = {j J : x j = x J } Output X = m(g(r) g(r 1)) Expectation: E [X ] 7/25
21 AMS Sampling Problem: Estimate i g(f i) for some function g with g(0) = 0 Basic Estimator: Sample x J where J R [m] and compute r = {j J : x j = x J } Output X = m(g(r) g(r 1)) Expectation: E [X ] = i P [x J = i] E [X x J = i] 7/25
22 AMS Sampling Problem: Estimate i g(f i) for some function g with g(0) = 0 Basic Estimator: Sample x J where J R [m] and compute r = {j J : x j = x J } Output X = m(g(r) g(r 1)) Expectation: E [X ] = i = i P [x J = i] E [X x J = i] f i m ( fi r=1 ) m(g(r) g(r 1)) f i 7/25
23 AMS Sampling Problem: Estimate i g(f i) for some function g with g(0) = 0 Basic Estimator: Sample x J where J R [m] and compute r = {j J : x j = x J } Output X = m(g(r) g(r 1)) Expectation: E [X ] = i = i = i P [x J = i] E [X x J = i] f i m g(f i ) ( fi r=1 ) m(g(r) g(r 1)) f i 7/25
24 AMS Sampling Problem: Estimate i g(f i) for some function g with g(0) = 0 Basic Estimator: Sample x J where J R [m] and compute r = {j J : x j = x J } Output X = m(g(r) g(r 1)) Expectation: E [X ] = i = i = i P [x J = i] E [X x J = i] f i m g(f i ) ( fi r=1 ) m(g(r) g(r 1)) f i For high confidence: Compute t estimators in parallel and average. 7/25
25 Example: Frequency Moments Frequency Moments: Define F k = i f k i for k {1, 2, 3,...} 8/25
26 Example: Frequency Moments Frequency Moments: Define F k = i f k i for k {1, 2, 3,...} Use AMS estimator with X = m(r k (r 1) k ). 8/25
27 Example: Frequency Moments Frequency Moments: Define F k = i f k i for k {1, 2, 3,...} Use AMS estimator with X = m(r k (r 1) k ). Expectation: E [X ] = F k 8/25
28 Example: Frequency Moments Frequency Moments: Define F k = i f k i for k {1, 2, 3,...} Use AMS estimator with X = m(r k (r 1) k ). Expectation: E [X ] = F k Range: 0 X kmf k 1 kn 1 1/k F k 8/25
29 Example: Frequency Moments Frequency Moments: Define F k = i f k i for k {1, 2, 3,...} Use AMS estimator with X = m(r k (r 1) k ). Expectation: E [X ] = F k Range: 0 X kmf k 1 kn 1 1/k F k Repeat t times and let F k be the average value. By Chernoff, ] P [ F k F k ɛf k 2 exp ( tf kɛ 2 ) ) = 2 exp ( tɛ2 3kn 1 1/k F k 3kn 1 1/k 8/25
30 Example: Frequency Moments Frequency Moments: Define F k = i f k i for k {1, 2, 3,...} Use AMS estimator with X = m(r k (r 1) k ). Expectation: E [X ] = F k Range: 0 X kmf k 1 kn 1 1/k F k Repeat t times and let F k be the average value. By Chernoff, ) ] P [ F k F k ɛf k 2 exp ( tf kɛ 2 = 2 exp ( tɛ2 3kn 1 1/k F k ] If t = 3ɛ 2 kn 1 1/k log(2δ 1 ) then P [ F k F k ɛf k δ. 3kn 1 1/k ) 8/25
31 Example: Frequency Moments Frequency Moments: Define F k = i f k i for k {1, 2, 3,...} Use AMS estimator with X = m(r k (r 1) k ). Expectation: E [X ] = F k Range: 0 X kmf k 1 kn 1 1/k F k Repeat t times and let F k be the average value. By Chernoff, ) ] P [ F k F k ɛf k 2 exp ( tf kɛ 2 = 2 exp ( tɛ2 3kn 1 1/k F k ] If t = 3ɛ 2 kn 1 1/k log(2δ 1 ) then P [ F k F k ɛf k δ. 3kn 1 1/k Thm: In Õ(ɛ 2 n 1 1/k ) space we can find a (1 ± ɛ) approximation for F k with probability at least 1 δ. ) 8/25
32 Outline Sampling Sketching: The Basics Count-Min and Applications Count-Sketch: Count-Min with a Twist l p Sampling and Frequency Moments 9/25
33 Random Projections Many stream algorithms use a random projection Z R w n, w n Z(f ) = z 1, z 1,n.. z w, z w,n f 1 f 2.. f n s 1 =. s w = s 10/25
34 Random Projections Many stream algorithms use a random projection Z R w n, w n Z(f ) = z 1, z 1,n.. z w, z w,n f 1 f 2.. f n s 1 =. Updatable: We can maintain sketch s in Õ(w) space since incrementing f i corresponds to s w = s 10/25
35 Random Projections Many stream algorithms use a random projection Z R w n, w n Z(f ) = z 1, z 1,n.. z w, z w,n f 1 f 2.. f n s 1 =. Updatable: We can maintain sketch s in Õ(w) space since incrementing f i corresponds to s s + z 1,i. z w,i s w = s 10/25
36 Random Projections Many stream algorithms use a random projection Z R w n, w n Z(f ) = z 1, z 1,n.. z w, z w,n f 1 f 2.. f n s 1 =. Updatable: We can maintain sketch s in Õ(w) space since incrementing f i corresponds to s s + z 1,i. z w,i s w = s Useful: Choose a distribution for z i,j such that relevant function of f can be estimated from s with high probability for sufficiently large w. 10/25
37 Examples If z i,j R { 1, 1}, can estimate F 2 with w = O(ɛ 2 log δ 1 ). 11/25
38 Examples If z i,j R { 1, 1}, can estimate F 2 with w = O(ɛ 2 log δ 1 ). If z i,j D where D is p-stable p (0, 2], can estimate F p with w = O(ɛ 2 log δ 1 ). For example, 1 and 2 stable distributions are: Cauchy(x) = 1 π x 2 Gaussian(x) = 1 2π e x 2 /2 11/25
39 Examples If z i,j R { 1, 1}, can estimate F 2 with w = O(ɛ 2 log δ 1 ). If z i,j D where D is p-stable p (0, 2], can estimate F p with w = O(ɛ 2 log δ 1 ). For example, 1 and 2 stable distributions are: Cauchy(x) = 1 π x 2 Gaussian(x) = 1 2π e x 2 /2 Note that F 0 = (1 ± ɛ)f p if p = log(1 + ɛ)/ log m. 11/25
40 Examples If z i,j R { 1, 1}, can estimate F 2 with w = O(ɛ 2 log δ 1 ). If z i,j D where D is p-stable p (0, 2], can estimate F p with w = O(ɛ 2 log δ 1 ). For example, 1 and 2 stable distributions are: Cauchy(x) = 1 π x 2 Gaussian(x) = 1 2π e x 2 /2 Note that F 0 = (1 ± ɛ)f p if p = log(1 + ɛ)/ log m. For the rest of lecture we ll focus on hash-based sketches. Given a random hash function h : [n] [w], non-zero entries are z hi,i. Z = /25
41 Outline Sampling Sketching: The Basics Count-Min and Applications Count-Sketch: Count-Min with a Twist l p Sampling and Frequency Moments 12/25
42 Count-Min Sketch Maintain vector s N w via random hash function h : [n] [w] f[1] f[2] f[3] f[4] f[5] f[6]... f[n] s[1] s[2] s[3]... s[w] 13/25
43 Count-Min Sketch Maintain vector s N w via random hash function h : [n] [w] f[1] f[2] f[3] f[4] f[5] f[6]... f[n] s[1] s[2] s[3]... s[w] Update: For each increment of f i, increment s hi. Hence, s k = j:h j =k f j 13/25
44 Count-Min Sketch Maintain vector s N w via random hash function h : [n] [w] f[1] f[2] f[3] f[4] f[5] f[6]... f[n] s[1] s[2] s[3]... s[w] Update: For each increment of f i, increment s hi. Hence, s k = f j e.g., s 3 = f 6 + f 7 + f 13 j:h j =k 13/25
45 Count-Min Sketch Maintain vector s N w via random hash function h : [n] [w] f[1] f[2] f[3] f[4] f[5] f[6]... f[n] s[1] s[2] s[3]... s[w] Update: For each increment of f i, increment s hi. Hence, s k = f j e.g., s 3 = f 6 + f 7 + f 13 j:h j =k Query: Use f i = s hi to estimate f i. 13/25
46 Count-Min Sketch Maintain vector s N w via random hash function h : [n] [w] f[1] f[2] f[3] f[4] f[5] f[6]... f[n] s[1] s[2] s[3]... s[w] Update: For each increment of f i, increment s hi. Hence, s k = f j e.g., s 3 = f 6 + f 7 + f 13 j:h j =k Query: Use f i = s hi to estimate f i. Lemma: f i f ] i and P [ fi f i + 2m/w 1/2 13/25
47 Count-Min Sketch Maintain vector s N w via random hash function h : [n] [w] f[1] f[2] f[3] f[4] f[5] f[6]... f[n] s[1] s[2] s[3]... s[w] Update: For each increment of f i, increment s hi. Hence, s k = f j e.g., s 3 = f 6 + f 7 + f 13 j:h j =k Query: Use f i = s hi to estimate f i. Lemma: f i f ] i and P [ fi f i + 2m/w 1/2 Thm: Let w = 2/ɛ. Repeat the hashing lg(δ 1 ) times in parallel and take the minimum estimate for f i [ ] P f i f i f i + ɛm 1 δ 13/25
48 Proof of Lemma Define E by f i = f i + E and so E = j i:h i =h j f j 14/25
49 Proof of Lemma Define E by f i = f i + E and so E = Since all f j 0, we have E 0. j i:h i =h j f j 14/25
50 Proof of Lemma Define E by f i = f i + E and so E = Since all f j 0, we have E 0. Since P [h i = h j ] = 1/w, j i:h i =h j f j E [E] = j i f j P [h i = h j ] 14/25
51 Proof of Lemma Define E by f i = f i + E and so E = Since all f j 0, we have E 0. Since P [h i = h j ] = 1/w, j i:h i =h j f j E [E] = j i f j P [h i = h j ] m/w 14/25
52 Proof of Lemma Define E by f i = f i + E and so E = Since all f j 0, we have E 0. Since P [h i = h j ] = 1/w, j i:h i =h j f j E [E] = j i f j P [h i = h j ] m/w By an application of the Markov bound, P [E 2m/w] 1/2 14/25
53 Range Queries Range Query: For i, j [n], estimate f [i,j] = f i + f i f j 15/25
54 Range Queries Range Query: For i, j [n], estimate f [i,j] = f i + f i f j Dyadic Intervals: Restrict attention to intervals of the form [1 + (i 1)2 j, i2 j ] where j {0, 1,..., lg n}, i {1, 2,... n/2 j } 15/25
55 Range Queries Range Query: For i, j [n], estimate f [i,j] = f i + f i f j Dyadic Intervals: Restrict attention to intervals of the form [1 + (i 1)2 j, i2 j ] where j {0, 1,..., lg n}, i {1, 2,... n/2 j } since any range can be partitioned as O(log n) such intervals. E.g., [48, 106] = [48, 48] [49, 64] [65, 96] [97, 104] [105, 106] 15/25
56 Range Queries Range Query: For i, j [n], estimate f [i,j] = f i + f i f j Dyadic Intervals: Restrict attention to intervals of the form [1 + (i 1)2 j, i2 j ] where j {0, 1,..., lg n}, i {1, 2,... n/2 j } since any range can be partitioned as O(log n) such intervals. E.g., [48, 106] = [48, 48] [49, 64] [65, 96] [97, 104] [105, 106] To support dyadic intervals, construct Count-Min sketches corresponding to intervals of width 1, 2, 4, 8,... 15/25
57 Range Queries Range Query: For i, j [n], estimate f [i,j] = f i + f i f j Dyadic Intervals: Restrict attention to intervals of the form [1 + (i 1)2 j, i2 j ] where j {0, 1,..., lg n}, i {1, 2,... n/2 j } since any range can be partitioned as O(log n) such intervals. E.g., [48, 106] = [48, 48] [49, 64] [65, 96] [97, 104] [105, 106] To support dyadic intervals, construct Count-Min sketches corresponding to intervals of width 1, 2, 4, 8,... E.g., for intervals of width 2 we have: f[1] f[2] f[3] f[4] f[5] f[6]... f[n-1] f[n] g[1] g[2] g[3]... g[n/2] s[1] s[2] s[3]... s[w] where update rule is now: for increment of f 2i 1 or f 2i, increment s hi. 15/25
58 Quantiles and Heavy Hitters 16/25
59 Quantiles and Heavy Hitters Quantiles: Find j such that f f j m/2 16/25
60 Quantiles and Heavy Hitters Quantiles: Find j such that f f j m/2 Can approximate median via binary search of range queries. 16/25
61 Quantiles and Heavy Hitters Quantiles: Find j such that f f j m/2 Can approximate median via binary search of range queries. Heavy Hitter Problem: Find a set S [n] where {i : f i φm} S {i : f i (φ ɛ)m} 16/25
62 Quantiles and Heavy Hitters Quantiles: Find j such that f f j m/2 Can approximate median via binary search of range queries. Heavy Hitter Problem: Find a set S [n] where {i : f i φm} S {i : f i (φ ɛ)m} Rather than checking each f i individually can save time by exploiting the fact that if f [i,k] < φm then f j < φm for all j [i, k]. 16/25
63 Outline Sampling Sketching: The Basics Count-Min and Applications Count-Sketch: Count-Min with a Twist l p Sampling and Frequency Moments 17/25
64 Count-Sketch: Count-Min with a Twist Maintain s N w via hash functions h : [n] [w], r : [n] { 1, 1} f[1] f[2] f[3] f[4] f[5] f[6]... f[n] -f[1] f[2] -f[3] f[4] f[5] -f[6]... -f[n] s[1] s[2] s[3]... s[w] 18/25
65 Count-Sketch: Count-Min with a Twist Maintain s N w via hash functions h : [n] [w], r : [n] { 1, 1} f[1] f[2] f[3] f[4] f[5] f[6]... f[n] -f[1] f[2] -f[3] f[4] f[5] -f[6]... -f[n] s[1] s[2] s[3]... s[w] Update: For each increment of f i, s hi s k = f j r j j:h j =k s hi + r i. Hence, 18/25
66 Count-Sketch: Count-Min with a Twist Maintain s N w via hash functions h : [n] [w], r : [n] { 1, 1} f[1] f[2] f[3] f[4] f[5] f[6]... f[n] -f[1] f[2] -f[3] f[4] f[5] -f[6]... -f[n] s[1] s[2] s[3]... s[w] Update: For each increment of f i, s hi s hi + r i. Hence, s k = f j r j e.g., s 3 = f 6 f 7 f 13 j:h j =k 18/25
67 Count-Sketch: Count-Min with a Twist Maintain s N w via hash functions h : [n] [w], r : [n] { 1, 1} f[1] f[2] f[3] f[4] f[5] f[6]... f[n] -f[1] f[2] -f[3] f[4] f[5] -f[6]... -f[n] s[1] s[2] s[3]... s[w] Update: For each increment of f i, s hi s hi + r i. Hence, s k = f j r j e.g., s 3 = f 6 f 7 f 13 j:h j =k Query: Use f i = s hi r i to estimate f i. 18/25
68 Count-Sketch: Count-Min with a Twist Maintain s N w via hash functions h : [n] [w], r : [n] { 1, 1} f[1] f[2] f[3] f[4] f[5] f[6]... f[n] -f[1] f[2] -f[3] f[4] f[5] -f[6]... -f[n] s[1] s[2] s[3]... s[w] Update: For each increment of f i, s hi s hi + r i. Hence, s k = f j r j e.g., s 3 = f 6 f 7 f 13 j:h j =k Query: Use f i ] = s hi r i to estimate ] f i. Lemma: E [ f i = f i and V [ f i F 2 /w 18/25
69 Count-Sketch: Count-Min with a Twist Maintain s N w via hash functions h : [n] [w], r : [n] { 1, 1} f[1] f[2] f[3] f[4] f[5] f[6]... f[n] -f[1] f[2] -f[3] f[4] f[5] -f[6]... -f[n] s[1] s[2] s[3]... s[w] Update: For each increment of f i, s hi s hi + r i. Hence, s k = f j r j e.g., s 3 = f 6 f 7 f 13 j:h j =k Query: Use f i ] = s hi r i to estimate ] f i. Lemma: E [ f i = f i and V [ f i F 2 /w Thm: Let w = O(1/ɛ 2 ). Repeating O(lg δ 1 ) in parallel and taking the median estimate ensures P [f i ɛ F 2 f i f i + ɛ ] F 2 1 δ. 18/25
70 Proof of Lemma Define E by f i = f i + Er i and so E = f j r j j i:h i =h j 19/25
71 Proof of Lemma Define E by f i = f i + Er i and so E = f j r j j i:h i =h j Expectation: Since E [r j ] = 0, E [E] = f j E [r j ] = 0 j i:h i =h j 19/25
72 Proof of Lemma Define E by f i = f i + Er i and so E = f j r j j i:h i =h j Expectation: Since E [r j ] = 0, E [E] = f j E [r j ] = 0 j i:h i =h j Variance: Similarly, V [E] 19/25
73 Proof of Lemma Define E by f i = f i + Er i and so E = f j r j j i:h i =h j Expectation: Since E [r j ] = 0, E [E] = f j E [r j ] = 0 j i:h i =h j Variance: Similarly, V [E] E ( f j r j ) 2 j i:h i =h j 19/25
74 Proof of Lemma Define E by f i = f i + Er i and so E = f j r j j i:h i =h j Expectation: Since E [r j ] = 0, E [E] = f j E [r j ] = 0 j i:h i =h j Variance: Similarly, V [E] E ( f j r j ) 2 = j i:h i =h j f j f k E [r j r k ] P [h i = h j = h k ] j,k i h i =h j =h k 19/25
75 Proof of Lemma Define E by f i = f i + Er i and so E = f j r j j i:h i =h j Expectation: Since E [r j ] = 0, E [E] = f j E [r j ] = 0 j i:h i =h j Variance: Similarly, V [E] E ( f j r j ) 2 = j i:h i =h j = f j f k E [r j r k ] P [h i = h j = h k ] j,k i h i =h j =h k j i:h i =h j f 2 j P [h i = h j ] F 2 /w 19/25
76 Outline Sampling Sketching: The Basics Count-Min and Applications Count-Sketch: Count-Min with a Twist l p Sampling and Frequency Moments 20/25
77 l p Sampling 21/25
78 l p Sampling l p Sampling: Return random values I [n] and R R where P [I = i] = (1 ± ɛ) f i p F p and R = (1 ± ɛ)f i 21/25
79 l p Sampling l p Sampling: Return random values I [n] and R R where Applications: P [I = i] = (1 ± ɛ) f i p F p and R = (1 ± ɛ)f i Will use l2 sampling to get optimal algorithm for F k, k > 2. Will use l0 sampling for processing graph streams. Many other stream problems can be solved via lp sampling, e.g., duplicate finding, triangle counting, entropy estimation. 21/25
80 l p Sampling l p Sampling: Return random values I [n] and R R where Applications: P [I = i] = (1 ± ɛ) f i p F p and R = (1 ± ɛ)f i Will use l2 sampling to get optimal algorithm for F k, k > 2. Will use l0 sampling for processing graph streams. Many other stream problems can be solved via lp sampling, e.g., duplicate finding, triangle counting, entropy estimation. Let s see algorithm for p = /25
81 l 2 Sampling Algorithm 22/25
82 l 2 Sampling Algorithm Weight f i by γ i = 1/u i where u i R [0, 1] to form vector g: f = (f 1, f 2,..., f n ) g = (g 1, g 2,..., g n ) where g i = γ i f i 22/25
83 l 2 Sampling Algorithm Weight f i by γ i = 1/u i where u i R [0, 1] to form vector g: Return (i, f i ) if g 2 i f = (f 1, f 2,..., f n ) g = (g 1, g 2,..., g n ) where g i = γ i f i t := F 2 (f )/ɛ 22/25
84 l 2 Sampling Algorithm Weight f i by γ i = 1/u i where u i R [0, 1] to form vector g: f = (f 1, f 2,..., f n ) g = (g 1, g 2,..., g n ) where g i = γ i f i Return (i, f i ) if gi 2 t := F 2 (f )/ɛ Probability (i, f i ) is returned: P [ gi 2 t ] 22/25
85 l 2 Sampling Algorithm Weight f i by γ i = 1/u i where u i R [0, 1] to form vector g: f = (f 1, f 2,..., f n ) g = (g 1, g 2,..., g n ) where g i = γ i f i Return (i, f i ) if gi 2 t := F 2 (f )/ɛ Probability (i, f i ) is returned: P [ g 2 i t ] = P [ u i f 2 i /t ] = f 2 i /t 22/25
86 l 2 Sampling Algorithm Weight f i by γ i = 1/u i where u i R [0, 1] to form vector g: f = (f 1, f 2,..., f n ) g = (g 1, g 2,..., g n ) where g i = γ i f i Return (i, f i ) if gi 2 t := F 2 (f )/ɛ Probability (i, f i ) is returned: P [ g 2 i t ] = P [ u i f 2 i /t ] = f 2 i /t Probability some value is returned is i f 2 i /t = ɛ so repeating O(ɛ 1 log δ 1 ) ensures a value is returned with probability 1 δ. 22/25
87 l 2 Sampling Algorithm Weight f i by γ i = 1/u i where u i R [0, 1] to form vector g: f = (f 1, f 2,..., f n ) g = (g 1, g 2,..., g n ) where g i = γ i f i Return (i, f i ) if gi 2 t := F 2 (f )/ɛ Probability (i, f i ) is returned: P [ g 2 i t ] = P [ u i f 2 i /t ] = f 2 i /t Probability some value is returned is i f 2 i /t = ɛ so repeating O(ɛ 1 log δ 1 ) ensures a value is returned with probability 1 δ. Lemma: Using a Count-Sketch of size O(ɛ 1 log 2 n) ensures a (1 ± ɛ) approximation of any g i that passes the threshold. 22/25
88 Proof of Lemma 23/25
89 Proof of Lemma Exercise: P [F 2 (g)/f 2 (f ) c log n] 99/100 for some large c > 0 so we ll condition on this event. 23/25
90 Proof of Lemma Exercise: P [F 2 (g)/f 2 (f ) c log n] 99/100 for some large c > 0 so we ll condition on this event. Set w = 9cɛ 1 log n. Count-Sketch in O(w log 2 n) space ensures g i = g i ± F 2 (g)/w 23/25
91 Proof of Lemma Exercise: P [F 2 (g)/f 2 (f ) c log n] 99/100 for some large c > 0 so we ll condition on this event. Set w = 9cɛ 1 log n. Count-Sketch in O(w log 2 n) space ensures Then g 2 i F 2 (f )/ɛ implies g i = g i ± F 2 (g)/w F2 (g)/w F 2 (f )/(9ɛ 1 ) ɛ g 2 i /(9ɛ 1 ) = ɛ g i /3 and hence g i 2 = (1 ± ɛ/3) 2 gi 2 = (1 ± ɛ)gi 2 as required. 23/25
92 Proof of Lemma Exercise: P [F 2 (g)/f 2 (f ) c log n] 99/100 for some large c > 0 so we ll condition on this event. Set w = 9cɛ 1 log n. Count-Sketch in O(w log 2 n) space ensures Then g 2 i F 2 (f )/ɛ implies g i = g i ± F 2 (g)/w F2 (g)/w F 2 (f )/(9ɛ 1 ) ɛ g 2 i /(9ɛ 1 ) = ɛ g i /3 and hence g i 2 = (1 ± ɛ/3) 2 gi 2 = (1 ± ɛ)gi 2 as required. Under-the-rug: Need to ensure that conditioning doesn t affect sampling probability too much. 23/25
93 F k Revisited Earlier we used O(n 1 1/k ) space to approximate F k = i f i k. 24/25
94 F k Revisited Earlier we used O(n 1 1/k ) space to approximate F k = i f i k. Algorithm: Let (I, R) be an (1 + γ)-approximate l 2 sample. Return T = F 2 R k 2 where F 2 is a (1 ± γ) approximation for F 2 24/25
95 F k Revisited Earlier we used O(n 1 1/k ) space to approximate F k = i f i k. Algorithm: Let (I, R) be an (1 + γ)-approximate l 2 sample. Return T = F 2 R k 2 where F 2 is a (1 ± γ) approximation for F 2 Expectation: Setting γ = ɛ/(4k), E [T ] 24/25
96 F k Revisited Earlier we used O(n 1 1/k ) space to approximate F k = i f i k. Algorithm: Let (I, R) be an (1 + γ)-approximate l 2 sample. Return T = F 2 R k 2 where F 2 is a (1 ± γ) approximation for F 2 Expectation: Setting γ = ɛ/(4k), E [T ] = F 2 P [I = i] ((1±γ)fi ) k 2 24/25
97 F k Revisited Earlier we used O(n 1 1/k ) space to approximate F k = i f i k. Algorithm: Let (I, R) be an (1 + γ)-approximate l 2 sample. Return T = F 2 R k 2 where F 2 is a (1 ± γ) approximation for F 2 Expectation: Setting γ = ɛ/(4k), E [T ] = F f 2 P [I = i] ((1±γ)fi ) k 2 = (1±γ) k 2 F i 2 f k 2 i F 2 24/25
98 F k Revisited Earlier we used O(n 1 1/k ) space to approximate F k = i f i k. Algorithm: Let (I, R) be an (1 + γ)-approximate l 2 sample. Return T = F 2 R k 2 where F 2 is a (1 ± γ) approximation for F 2 Expectation: Setting γ = ɛ/(4k), E [T ] = F f 2 P [I = i] ((1±γ)fi ) k 2 = (1±γ) k 2 F i 2 f k 2 i = (1± ɛ F 2 2 )F k 24/25
99 F k Revisited Earlier we used O(n 1 1/k ) space to approximate F k = i f i k. Algorithm: Let (I, R) be an (1 + γ)-approximate l 2 sample. Return T = F 2 R k 2 where F 2 is a (1 ± γ) approximation for F 2 Expectation: Setting γ = ɛ/(4k), E [T ] = F f 2 P [I = i] ((1±γ)fi ) k 2 = (1±γ) k 2 F i 2 f k 2 i = (1± ɛ F 2 2 )F k Range: 0 T (1 + γ)f 2 F k 2 = (1 + γ)n 1 2/k F k. 24/25
100 F k Revisited Earlier we used O(n 1 1/k ) space to approximate F k = i f i k. Algorithm: Let (I, R) be an (1 + γ)-approximate l 2 sample. Return T = F 2 R k 2 where F 2 is a (1 ± γ) approximation for F 2 Expectation: Setting γ = ɛ/(4k), E [T ] = F f 2 P [I = i] ((1±γ)fi ) k 2 = (1±γ) k 2 F i 2 f k 2 i = (1± ɛ F 2 2 )F k Range: 0 T (1 + γ)f 2 F k 2 = (1 + γ)n 1 2/k F k. Averaging over t = O(ɛ 2 n 1 2/k log δ 1 ) parallel repetitions gives, ] P [ F k F k ɛf k δ 24/25
101 F k Revisited Earlier we used O(n 1 1/k ) space to approximate F k = i f i k. Algorithm: Let (I, R) be an (1 + γ)-approximate l 2 sample. Return T = F 2 R k 2 where F 2 is a (1 ± γ) approximation for F 2 Expectation: Setting γ = ɛ/(4k), E [T ] = F f 2 P [I = i] ((1±γ)fi ) k 2 = (1±γ) k 2 F i 2 f k 2 i = (1± ɛ F 2 2 )F k Range: 0 T (1 + γ)f 2 F k 2 = (1 + γ)n 1 2/k F k. Averaging over t = O(ɛ 2 n 1 2/k log δ 1 ) parallel repetitions gives, ] P [ F k F k ɛf k δ Thm: In Õ(ɛ 2 n 1 2/k ) space we can find a (1 ± ɛ) approximation for F k with probability at least 1 δ. 24/25
102 Summary Basic Sampling: Can do basic sampling where i is selected with probability f i but we can be much smarter via sketches. Count-Min: f i f i f i + ɛf 1 in O(ɛ 1 ) space. Z = Count-Sketch: f i ɛ F 2 f i f i + ɛ F 2 in O(ɛ 2 ) space Z = Above sketches solve range-queries, quantiles, heavy hitters,... l p -Sampling: Selecting i with probability f p i in O(ɛ 2 ) space. 0 γ γ 6 Z = γ γ 1 0 γ 3 0 γ /25
CMPSCI 711: More Advanced Algorithms
CMPSCI 711: More Advanced Algorithms Section 1-1: Sampling Andrew McGregor Last Compiled: April 29, 2012 1/14 Concentration Bounds Theorem (Markov) Let X be a non-negative random variable with expectation
More informationCS 598CSC: Algorithms for Big Data Lecture date: Sept 11, 2014
CS 598CSC: Algorithms for Big Data Lecture date: Sept 11, 2014 Instructor: Chandra Cheuri Scribe: Chandra Cheuri The Misra-Greis deterministic counting guarantees that all items with frequency > F 1 /
More informationLinear Sketches A Useful Tool in Streaming and Compressive Sensing
Linear Sketches A Useful Tool in Streaming and Compressive Sensing Qin Zhang 1-1 Linear sketch Random linear projection M : R n R k that preserves properties of any v R n with high prob. where k n. M =
More informationB669 Sublinear Algorithms for Big Data
B669 Sublinear Algorithms for Big Data Qin Zhang 1-1 2-1 Part 1: Sublinear in Space The model and challenge The data stream model (Alon, Matias and Szegedy 1996) a n a 2 a 1 RAM CPU Why hard? Cannot store
More informationSome notes on streaming algorithms continued
U.C. Berkeley CS170: Algorithms Handout LN-11-9 Christos Papadimitriou & Luca Trevisan November 9, 016 Some notes on streaming algorithms continued Today we complete our quick review of streaming algorithms.
More informationLecture 6 September 13, 2016
CS 395T: Sublinear Algorithms Fall 206 Prof. Eric Price Lecture 6 September 3, 206 Scribe: Shanshan Wu, Yitao Chen Overview Recap of last lecture. We talked about Johnson-Lindenstrauss (JL) lemma [JL84]
More informationSketching and Streaming for Distributions
Sketching and Streaming for Distributions Piotr Indyk Andrew McGregor Massachusetts Institute of Technology University of California, San Diego Main Material: Stable distributions, pseudo-random generators,
More informationLecture 3 Frequency Moments, Heavy Hitters
COMS E6998-9: Algorithmic Techniques for Massive Data Sep 15, 2015 Lecture 3 Frequency Moments, Heavy Hitters Instructor: Alex Andoni Scribes: Daniel Alabi, Wangda Zhang 1 Introduction This lecture is
More information15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018
15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018 Today we ll talk about a topic that is both very old (as far as computer science
More information14.1 Finding frequent elements in stream
Chapter 14 Streaming Data Model 14.1 Finding frequent elements in stream A very useful statistics for many applications is to keep track of elements that occur more frequently. It can come in many flavours
More informationAlgorithms for Querying Noisy Distributed/Streaming Datasets
Algorithms for Querying Noisy Distributed/Streaming Datasets Qin Zhang Indiana University Bloomington Sublinear Algo Workshop @ JHU Jan 9, 2016 1-1 The big data models The streaming model (Alon, Matias
More informationBlock Heavy Hitters Alexandr Andoni, Khanh Do Ba, and Piotr Indyk
Computer Science and Artificial Intelligence Laboratory Technical Report -CSAIL-TR-2008-024 May 2, 2008 Block Heavy Hitters Alexandr Andoni, Khanh Do Ba, and Piotr Indyk massachusetts institute of technology,
More information1 Approximate Quantiles and Summaries
CS 598CSC: Algorithms for Big Data Lecture date: Sept 25, 2014 Instructor: Chandra Chekuri Scribe: Chandra Chekuri Suppose we have a stream a 1, a 2,..., a n of objects from an ordered universe. For simplicity
More information2 How many distinct elements are in a stream?
Dealing with Massive Data January 31, 2011 Lecture 2: Distinct Element Counting Lecturer: Sergei Vassilvitskii Scribe:Ido Rosen & Yoonji Shin 1 Introduction We begin by defining the stream formally. Definition
More information12 Count-Min Sketch and Apriori Algorithm (and Bloom Filters)
12 Count-Min Sketch and Apriori Algorithm (and Bloom Filters) Many streaming algorithms use random hashing functions to compress data. They basically randomly map some data items on top of each other.
More informationAn Optimal Algorithm for l 1 -Heavy Hitters in Insertion Streams and Related Problems
An Optimal Algorithm for l 1 -Heavy Hitters in Insertion Streams and Related Problems Arnab Bhattacharyya, Palash Dey, and David P. Woodruff Indian Institute of Science, Bangalore {arnabb,palash}@csa.iisc.ernet.in
More informationLecture 1 September 3, 2013
CS 229r: Algorithms for Big Data Fall 2013 Prof. Jelani Nelson Lecture 1 September 3, 2013 Scribes: Andrew Wang and Andrew Liu 1 Course Logistics The problem sets can be found on the course website: http://people.seas.harvard.edu/~minilek/cs229r/index.html
More informationLecture 01 August 31, 2017
Sketching Algorithms for Big Data Fall 2017 Prof. Jelani Nelson Lecture 01 August 31, 2017 Scribe: Vinh-Kha Le 1 Overview In this lecture, we overviewed the six main topics covered in the course, reviewed
More informationGraph & Geometry Problems in Data Streams
Graph & Geometry Problems in Data Streams 2009 Barbados Workshop on Computational Complexity Andrew McGregor Introduction Models: Graph Streams: Stream of edges E = {e 1, e 2,..., e m } describe a graph
More informationHierarchical Sampling from Sketches: Estimating Functions over Data Streams
Hierarchical Sampling from Sketches: Estimating Functions over Data Streams Sumit Ganguly 1 and Lakshminath Bhuvanagiri 2 1 Indian Institute of Technology, Kanpur 2 Google Inc., Bangalore Abstract. We
More informationThe Count-Min-Sketch and its Applications
The Count-Min-Sketch and its Applications Jannik Sundermeier Abstract In this thesis, we want to reveal how to get rid of a huge amount of data which is at least dicult or even impossible to store in local
More informationData Stream Methods. Graham Cormode S. Muthukrishnan
Data Stream Methods Graham Cormode graham@dimacs.rutgers.edu S. Muthukrishnan muthu@cs.rutgers.edu Plan of attack Frequent Items / Heavy Hitters Counting Distinct Elements Clustering items in Streams Motivating
More informationBig Data. Big data arises in many forms: Common themes:
Big Data Big data arises in many forms: Physical Measurements: from science (physics, astronomy) Medical data: genetic sequences, detailed time series Activity data: GPS location, social network activity
More informationDeclaring Independence via the Sketching of Sketches. Until August Hire Me!
Declaring Independence via the Sketching of Sketches Piotr Indyk Andrew McGregor Massachusetts Institute of Technology University of California, San Diego Until August 08 -- Hire Me! The Problem The Problem
More informationComputing the Entropy of a Stream
Computing the Entropy of a Stream To appear in SODA 2007 Graham Cormode graham@research.att.com Amit Chakrabarti Dartmouth College Andrew McGregor U. Penn / UCSD Outline Introduction Entropy Upper Bound
More informationCommunication-Efficient Computation on Distributed Noisy Datasets. Qin Zhang Indiana University Bloomington
Communication-Efficient Computation on Distributed Noisy Datasets Qin Zhang Indiana University Bloomington SPAA 15 June 15, 2015 1-1 Model of computation The coordinator model: k sites and 1 coordinator.
More informationDistributed Summaries
Distributed Summaries Graham Cormode graham@research.att.com Pankaj Agarwal(Duke) Zengfeng Huang (HKUST) Jeff Philips (Utah) Zheiwei Wei (HKUST) Ke Yi (HKUST) Summaries Summaries allow approximate computations:
More informationNotes on MapReduce Algorithms
Notes on MapReduce Algorithms Barna Saha 1 Finding Minimum Spanning Tree of a Dense Graph in MapReduce We are given a graph G = (V, E) on V = N vertices and E = m N 1+c edges for some constant c > 0. Our
More informationCSC2420 Fall 2012: Algorithm Design, Analysis and Theory Lecture 9
CSC2420 Fall 2012: Algorithm Design, Analysis and Theory Lecture 9 Allan Borodin March 13, 2016 1 / 33 Lecture 9 Announcements 1 I have the assignments graded by Lalla. 2 I have now posted five questions
More informationStream Computation and Arthur- Merlin Communication
Stream Computation and Arthur- Merlin Communication Justin Thaler, Yahoo! Labs Joint Work with: Amit Chakrabarti, Dartmouth Graham Cormode, University of Warwick Andrew McGregor, Umass Amherst Suresh Venkatasubramanian,
More informationLecture 10. Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1
Lecture 10 Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1 Recap Sublinear time algorithms Deterministic + exact: binary search Deterministic + inexact: estimating diameter in a
More informationCMSC 858F: Algorithmic Lower Bounds: Fun with Hardness Proofs Fall 2014 Introduction to Streaming Algorithms
CMSC 858F: Algorithmic Lower Bounds: Fun with Hardness Proofs Fall 2014 Introduction to Streaming Algorithms Instructor: Mohammad T. Hajiaghayi Scribe: Huijing Gong November 11, 2014 1 Overview In the
More informationTopics in Probabilistic Combinatorics and Algorithms Winter, Basic Derandomization Techniques
Topics in Probabilistic Combinatorics and Algorithms Winter, 016 3. Basic Derandomization Techniques Definition. DTIME(t(n)) : {L : L can be decided deterministically in time O(t(n)).} EXP = { L: L can
More informationFrequency Estimators
Frequency Estimators Outline for Today Randomized Data Structures Our next approach to improving performance. Count-Min Sketches A simple and powerful data structure for estimating frequencies. Count Sketches
More informationThe space complexity of approximating the frequency moments
The space complexity of approximating the frequency moments Felix Biermeier November 24, 2015 1 Overview Introduction Approximations of frequency moments lower bounds 2 Frequency moments Problem Estimate
More informationCS 591, Lecture 9 Data Analytics: Theory and Applications Boston University
CS 591, Lecture 9 Data Analytics: Theory and Applications Boston University Babis Tsourakakis February 22nd, 2017 Announcement We will cover the Monday s 2/20 lecture (President s day) by appending half
More informationLecture Lecture 9 October 1, 2015
CS 229r: Algorithms for Big Data Fall 2015 Lecture Lecture 9 October 1, 2015 Prof. Jelani Nelson Scribe: Rachit Singh 1 Overview In the last lecture we covered the distance to monotonicity (DTM) and longest
More informationLecture 4: Hashing and Streaming Algorithms
CSE 521: Design and Analysis of Algorithms I Winter 2017 Lecture 4: Hashing and Streaming Algorithms Lecturer: Shayan Oveis Gharan 01/18/2017 Scribe: Yuqing Ai Disclaimer: These notes have not been subjected
More informationSublinear Algorithms for Big Data
Sublinear Algorithms for Big Data Qin Zhang 1-1 2-1 Part 2: Sublinear in Communication Sublinear in communication The model x 1 = 010011 x 2 = 111011 x 3 = 111111 x k = 100011 Applicaitons They want to
More informationLecture 2. Frequency problems
1 / 43 Lecture 2. Frequency problems Ricard Gavaldà MIRI Seminar on Data Streams, Spring 2015 Contents 2 / 43 1 Frequency problems in data streams 2 Approximating inner product 3 Computing frequency moments
More informationLecture 1: Introduction to Sublinear Algorithms
CSE 522: Sublinear (and Streaming) Algorithms Spring 2014 Lecture 1: Introduction to Sublinear Algorithms March 31, 2014 Lecturer: Paul Beame Scribe: Paul Beame Too much data, too little time, space for
More informationSparse Fourier Transform (lecture 1)
1 / 73 Sparse Fourier Transform (lecture 1) Michael Kapralov 1 1 IBM Watson EPFL St. Petersburg CS Club November 2015 2 / 73 Given x C n, compute the Discrete Fourier Transform (DFT) of x: x i = 1 x n
More informationLeveraging Big Data: Lecture 13
Leveraging Big Data: Lecture 13 http://www.cohenwang.com/edith/bigdataclass2013 Instructors: Edith Cohen Amos Fiat Haim Kaplan Tova Milo What are Linear Sketches? Linear Transformations of the input vector
More informationLecture Lecture 25 November 25, 2014
CS 224: Advanced Algorithms Fall 2014 Lecture Lecture 25 November 25, 2014 Prof. Jelani Nelson Scribe: Keno Fischer 1 Today Finish faster exponential time algorithms (Inclusion-Exclusion/Zeta Transform,
More informationLecture 2 Sept. 8, 2015
CS 9r: Algorithms for Big Data Fall 5 Prof. Jelani Nelson Lecture Sept. 8, 5 Scribe: Jeffrey Ling Probability Recap Chebyshev: P ( X EX > λ) < V ar[x] λ Chernoff: For X,..., X n independent in [, ],
More informationEstimating Frequencies and Finding Heavy Hitters Jonas Nicolai Hovmand, Morten Houmøller Nygaard,
Estimating Frequencies and Finding Heavy Hitters Jonas Nicolai Hovmand, 2011 3884 Morten Houmøller Nygaard, 2011 4582 Master s Thesis, Computer Science June 2016 Main Supervisor: Gerth Stølting Brodal
More informationLecture 13 October 6, Covering Numbers and Maurey s Empirical Method
CS 395T: Sublinear Algorithms Fall 2016 Prof. Eric Price Lecture 13 October 6, 2016 Scribe: Kiyeon Jeon and Loc Hoang 1 Overview In the last lecture we covered the lower bound for p th moment (p > 2) and
More informationComplexity Theory of Polynomial-Time Problems
Complexity Theory of Polynomial-Time Problems Lecture 5: Subcubic Equivalences Karl Bringmann Reminder: Relations = Reductions transfer hardness of one problem to another one by reductions problem P instance
More informationLecture 2: Streaming Algorithms
CS369G: Algorithmic Techniques for Big Data Spring 2015-2016 Lecture 2: Streaming Algorithms Prof. Moses Chariar Scribes: Stephen Mussmann 1 Overview In this lecture, we first derive a concentration inequality
More informationMining Data Streams. The Stream Model. The Stream Model Sliding Windows Counting 1 s
Mining Data Streams The Stream Model Sliding Windows Counting 1 s 1 The Stream Model Data enters at a rapid rate from one or more input ports. The system cannot store the entire stream. How do you make
More informationSparse analysis Lecture VII: Combining geometry and combinatorics, sparse matrices for sparse signal recovery
Sparse analysis Lecture VII: Combining geometry and combinatorics, sparse matrices for sparse signal recovery Anna C. Gilbert Department of Mathematics University of Michigan Sparse signal recovery measurements:
More informationLecture 3 Sept. 4, 2014
CS 395T: Sublinear Algorithms Fall 2014 Prof. Eric Price Lecture 3 Sept. 4, 2014 Scribe: Zhao Song In today s lecture, we will discuss the following problems: 1. Distinct elements 2. Turnstile model 3.
More informationWavelet decomposition of data streams. by Dragana Veljkovic
Wavelet decomposition of data streams by Dragana Veljkovic Motivation Continuous data streams arise naturally in: telecommunication and internet traffic retail and banking transactions web server log records
More informationMultiple Choice Tries and Distributed Hash Tables
Multiple Choice Tries and Distributed Hash Tables Luc Devroye and Gabor Lugosi and Gahyun Park and W. Szpankowski January 3, 2007 McGill University, Montreal, Canada U. Pompeu Fabra, Barcelona, Spain U.
More informationA Pass-Efficient Algorithm for Clustering Census Data
A Pass-Efficient Algorithm for Clustering Census Data Kevin Chang Yale University Ravi Kannan Yale University Abstract We present a number of streaming algorithms for a basic clustering problem for massive
More informationImproved Concentration Bounds for Count-Sketch
Improved Concentration Bounds for Count-Sketch Gregory T. Minton 1 Eric Price 2 1 MIT MSR New England 2 MIT IBM Almaden UT Austin 2014-01-06 Gregory T. Minton, Eric Price (IBM) Improved Concentration Bounds
More informationCombining geometry and combinatorics
Combining geometry and combinatorics A unified approach to sparse signal recovery Anna C. Gilbert University of Michigan joint work with R. Berinde (MIT), P. Indyk (MIT), H. Karloff (AT&T), M. Strauss
More informationEstimating Dominance Norms of Multiple Data Streams Graham Cormode Joint work with S. Muthukrishnan
Estimating Dominance Norms of Multiple Data Streams Graham Cormode graham@dimacs.rutgers.edu Joint work with S. Muthukrishnan Data Stream Phenomenon Data is being produced faster than our ability to process
More information11 Heavy Hitters Streaming Majority
11 Heavy Hitters A core mining problem is to find items that occur more than one would expect. These may be called outliers, anomalies, or other terms. Statistical models can be layered on top of or underneath
More informationTitle: Count-Min Sketch Name: Graham Cormode 1 Affil./Addr. Department of Computer Science, University of Warwick,
Title: Count-Min Sketch Name: Graham Cormode 1 Affil./Addr. Department of Computer Science, University of Warwick, Coventry, UK Keywords: streaming algorithms; frequent items; approximate counting, sketch
More information1 Estimating Frequency Moments in Streams
CS 598CSC: Algorithms for Big Data Lecture date: August 28, 2014 Instructor: Chandra Chekuri Scribe: Chandra Chekuri 1 Estimating Frequency Moments in Streams A significant fraction of streaming literature
More informationSketching Probabilistic Data Streams
Sketching Probabilistic Data Streams Graham Cormode AT&T Labs Research 18 Park Avenue, Florham Park NJ graham@research.att.com Minos Garofalakis Yahoo! Research and UC Berkeley 2821 Mission College Blvd,
More information4/26/2017. More algorithms for streams: Each element of data stream is a tuple Given a list of keys S Determine which tuples of stream are in S
Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit
More informationTrusting the Cloud with Practical Interactive Proofs
Trusting the Cloud with Practical Interactive Proofs Graham Cormode G.Cormode@warwick.ac.uk Amit Chakrabarti (Dartmouth) Andrew McGregor (U Mass Amherst) Justin Thaler (Harvard/Yahoo!) Suresh Venkatasubramanian
More informationFinding Frequent Items in Probabilistic Data
Finding Frequent Items in Probabilistic Data Qin Zhang, Hong Kong University of Science & Technology Feifei Li, Florida State University Ke Yi, Hong Kong University of Science & Technology SIGMOD 2008
More informationImproved Algorithms for Distributed Entropy Monitoring
Improved Algorithms for Distributed Entropy Monitoring Jiecao Chen Indiana University Bloomington jiecchen@umail.iu.edu Qin Zhang Indiana University Bloomington qzhangcs@indiana.edu Abstract Modern data
More informationProblem 1: (Chernoff Bounds via Negative Dependence - from MU Ex 5.15)
Problem 1: Chernoff Bounds via Negative Dependence - from MU Ex 5.15) While deriving lower bounds on the load of the maximum loaded bin when n balls are thrown in n bins, we saw the use of negative dependence.
More informationLecture 3. 1 Polynomial-time algorithms for the maximum flow problem
ORIE 633 Network Flows August 30, 2007 Lecturer: David P. Williamson Lecture 3 Scribe: Gema Plaza-Martínez 1 Polynomial-time algorithms for the maximum flow problem 1.1 Introduction Let s turn now to considering
More informationArthur-Merlin Streaming Complexity
Weizmann Institute of Science Joint work with Ran Raz Data Streams The data stream model is an abstraction commonly used for algorithms that process network traffic using sublinear space. A data stream
More informationTrace Reconstruction Revisited
Trace Reconstruction Revisited Andrew McGregor 1, Eric Price 2, and Sofya Vorotnikova 1 1 University of Massachusetts Amherst {mcgregor,svorotni}@cs.umass.edu 2 IBM Almaden Research Center ecprice@mit.edu
More information? 11.5 Perfect hashing. Exercises
11.5 Perfect hashing 77 Exercises 11.4-1 Consider inserting the keys 10; ; 31; 4; 15; 8; 17; 88; 59 into a hash table of length m 11 using open addressing with the auxiliary hash function h 0.k/ k. Illustrate
More informationCS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/26/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2 More algorithms
More informationSublinear-Time Algorithms
Lecture 20 Sublinear-Time Algorithms Supplemental reading in CLRS: None If we settle for approximations, we can sometimes get much more efficient algorithms In Lecture 8, we saw polynomial-time approximations
More informationVery Sparse Random Projections
Very Sparse Random Projections Ping Li, Trevor Hastie and Kenneth Church [KDD 06] Presented by: Aditya Menon UCSD March 4, 2009 Presented by: Aditya Menon (UCSD) Very Sparse Random Projections March 4,
More informationStreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory
StreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory S.V. N. (vishy) Vishwanathan Purdue University and Microsoft vishy@purdue.edu October 9, 2012 S.V. N. Vishwanathan (Purdue,
More informationRandomized Algorithms
Randomized Algorithms Prof. Tapio Elomaa tapio.elomaa@tut.fi Course Basics A new 4 credit unit course Part of Theoretical Computer Science courses at the Department of Mathematics There will be 4 hours
More informationProbability Revealing Samples
Krzysztof Onak IBM Research Xiaorui Sun Microsoft Research Abstract In the most popular distribution testing and parameter estimation model, one can obtain information about an underlying distribution
More informationSparser Johnson-Lindenstrauss Transforms
Sparser Johnson-Lindenstrauss Transforms Jelani Nelson Princeton February 16, 212 joint work with Daniel Kane (Stanford) Random Projections x R d, d huge store y = Sx, where S is a k d matrix (compression)
More informationMining Data Streams. The Stream Model Sliding Windows Counting 1 s
Mining Data Streams The Stream Model Sliding Windows Counting 1 s 1 The Stream Model Data enters at a rapid rate from one or more input ports. The system cannot store the entire stream. How do you make
More informationLower Bounds for Quantile Estimation in Random-Order and Multi-Pass Streaming
Lower Bounds for Quantile Estimation in Random-Order and Multi-Pass Streaming Sudipto Guha 1 and Andrew McGregor 2 1 University of Pennsylvania sudipto@cis.upenn.edu 2 University of California, San Diego
More informationCS5314 Randomized Algorithms. Lecture 15: Balls, Bins, Random Graphs (Hashing)
CS5314 Randomized Algorithms Lecture 15: Balls, Bins, Random Graphs (Hashing) 1 Objectives Study various hashing schemes Apply balls-and-bins model to analyze their performances 2 Chain Hashing Suppose
More information6 Filtering and Streaming
Casus ubique valet; semper tibi pendeat hamus: Quo minime credas gurgite, piscis erit. [Luck affects everything. Let your hook always be cast. Where you least expect it, there will be a fish.] Publius
More informationCS 229r Notes. Sam Elder. November 26, 2013
CS 9r Notes Sam Elder November 6, 013 Tuesday, September 3, 013 Standard information: Instructor: Jelani Nelson Grades (see course website for details) will be based on scribing (10%), psets (40%) 1, and
More informationAd Placement Strategies
Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox 2014 Emily Fox January
More informationVerifying Computations in the Cloud (and Elsewhere) Michael Mitzenmacher, Harvard University Work offloaded to Justin Thaler, Harvard University
Verifying Computations in the Cloud (and Elsewhere) Michael Mitzenmacher, Harvard University Work offloaded to Justin Thaler, Harvard University Goals of Verifiable Computation Provide user with correctness
More informationTesting random variables for independence and identity
Testing random variables for independence and identity Tuğkan Batu Eldar Fischer Lance Fortnow Ravi Kumar Ronitt Rubinfeld Patrick White January 10, 2003 Abstract Given access to independent samples of
More informationCOMS E F15. Lecture 22: Linearity Testing Sparse Fourier Transform
COMS E6998-9 F15 Lecture 22: Linearity Testing Sparse Fourier Transform 1 Administrivia, Plan Thu: no class. Happy Thanksgiving! Tue, Dec 1 st : Sergei Vassilvitskii (Google Research) on MapReduce model
More informationLecture 2 September 4, 2014
CS 224: Advanced Algorithms Fall 2014 Prof. Jelani Nelson Lecture 2 September 4, 2014 Scribe: David Liu 1 Overview In the last lecture we introduced the word RAM model and covered veb trees to solve the
More informationEfficient Sketches for Earth-Mover Distance, with Applications
Efficient Sketches for Earth-Mover Distance, with Applications Alexandr Andoni MIT andoni@mit.edu Khanh Do Ba MIT doba@mit.edu Piotr Indyk MIT indyk@mit.edu David Woodruff IBM Almaden dpwoodru@us.ibm.com
More informationCMPT 710/407 - Complexity Theory Lecture 4: Complexity Classes, Completeness, Linear Speedup, and Hierarchy Theorems
CMPT 710/407 - Complexity Theory Lecture 4: Complexity Classes, Completeness, Linear Speedup, and Hierarchy Theorems Valentine Kabanets September 13, 2007 1 Complexity Classes Unless explicitly stated,
More informationTopics in Approximation Algorithms Solution for Homework 3
Topics in Approximation Algorithms Solution for Homework 3 Problem 1 We show that any solution {U t } can be modified to satisfy U τ L τ as follows. Suppose U τ L τ, so there is a vertex v U τ but v L
More informationData Stream Algorithms. S. Muthu Muthukrishnan
Data Stream Algorithms Notes from a series of lectures by S. Muthu Muthukrishnan Guest Lecturer: Andrew McGregor The 2009 Barbados Workshop on Computational Complexity March 1st March 8th, 2009 Organizer:
More informationApproximate counting: count-min data structure. Problem definition
Approximate counting: count-min data structure G. Cormode and S. Muthukrishhan: An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms 55 (2005) 58-75. Problem
More informationAn Optimized Interestingness Hotspot Discovery Framework for Large Gridded Spatio-temporal Datasets
IEEE Big Data 2015 Big Data in Geosciences Workshop An Optimized Interestingness Hotspot Discovery Framework for Large Gridded Spatio-temporal Datasets Fatih Akdag and Christoph F. Eick Department of Computer
More informationRandomness and Computation March 13, Lecture 3
0368.4163 Randomness and Computation March 13, 2009 Lecture 3 Lecturer: Ronitt Rubinfeld Scribe: Roza Pogalnikova and Yaron Orenstein Announcements Homework 1 is released, due 25/03. Lecture Plan 1. Do
More informationSuccinct Approximate Counting of Skewed Data
Succinct Approximate Counting of Skewed Data David Talbot Google Inc., Mountain View, CA, USA talbot@google.com Abstract Practical data analysis relies on the ability to count observations of objects succinctly
More informationStreaming Graph Computations with a Helpful Advisor. Justin Thaler Graham Cormode and Michael Mitzenmacher
Streaming Graph Computations with a Helpful Advisor Justin Thaler Graham Cormode and Michael Mitzenmacher Thanks to Andrew McGregor A few slides borrowed from IITK Workshop on Algorithms for Processing
More informationCS341 info session is on Thu 3/1 5pm in Gates415. CS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS341 info session is on Thu 3/1 5pm in Gates415 CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/28/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets,
More informationLecture 16. Error-free variable length schemes (contd.): Shannon-Fano-Elias code, Huffman code
Lecture 16 Agenda for the lecture Error-free variable length schemes (contd.): Shannon-Fano-Elias code, Huffman code Variable-length source codes with error 16.1 Error-free coding schemes 16.1.1 The Shannon-Fano-Elias
More informationProcessing Aggregate Queries over Continuous Data Streams
Processing Aggregate Queries over Continuous Data Streams Alin Dobra Computer Science Department Cornell University April 15, 2003 Relational Database Systems did dname 15 Legal 17 Marketing 3 Development
More information