Sketching and Streaming for Distributions

Size: px
Start display at page:

Download "Sketching and Streaming for Distributions"

Transcription

1 Sketching and Streaming for Distributions Piotr Indyk Andrew McGregor Massachusetts Institute of Technology University of California, San Diego Main Material: Stable distributions, pseudo-random generators, embeddings, and data stream computation Piotr Indyk (FOCS 2000) Sketching information divergences Sudipto Guha, Piotr Indyk, Andrew McGregor (COLT 2007) Declaring independence via the sketching of sketches Piotr Indyk, Andrew McGregor (SODA 2008)

2 The Problem

3 The Problem List of m red values and m green values in [n] 3,5,3,7,5,4,8,5,3,7,5,4,8,6,3,2,6,4,7,3,4,...

4 The Problem List of m red values and m green values in [n] 3,5,3,7,5,4,8,5,3,7,5,4,8,6,3,2,6,4,7,3,4,... Define distributions (p1,..., pn) and (q1,..., qn)

5 The Problem List of m red values and m green values in [n] 3,5,3,7,5,4,8,5,3,7,5,4,8,6,3,2,6,4,7,3,4,... Define distributions (p1,..., pn) and (q1,..., qn) How different are p and q?

6 The Problem List of m red values and m green values in [n] 3,5,3,7,5,4,8,5,3,7,5,4,8,6,3,2,6,4,7,3,4,... Define distributions (p1,..., pn) and (q1,..., qn) How different are p and q? Variational: p i q i Euclidean: Kullback-Leibler: (p i q i ) 2 Hellinger: ( p i p i log(p i /q i ) q i ) 2

7 The Problem List of m red values and m green values in [n] 3,5,3,7,5,4,8,5,3,7,5,4,8,6,3,2,6,4,7,3,4,... Define distributions (p1,..., pn) and (q1,..., qn) How different are p and q? D f (p, q) = p i f (q i /p i ) B F (p, q) = [F (p i ) F (q i ) (p i q i )F (q i )] where f and F are convex and f(1)=0.

8 The Catch...

9 The Catch... What if m and n are huge and you can t store the list?

10 The Catch... What if m and n are huge and you can t store the list? Applications: monitoring internet traffic, I/O efficient external memory, processing huge log files, database query planning, sensor networks,...

11 The Catch... What if m and n are huge and you can t store the list? Applications: monitoring internet traffic, I/O efficient external memory, processing huge log files, database query planning, sensor networks,... Data Stream Model: No control over the order of the stream Limited working memory, e.g., polylog(n,m) space Limited time to process each element

12 The Catch... What if m and n are huge and you can t store the list? Applications: monitoring internet traffic, I/O efficient external memory, processing huge log files, database query planning, sensor networks,... Data Stream Model: No control over the order of the stream Limited working memory, e.g., polylog(n,m) space Limited time to process each element Previous work: quantiles, frequency moments, histograms, clustering, entropy, graph problems... see, e.g., Muthukrishnan Data Streams: Algorithms and Applications

13 Today s Talk

14 Today s Talk Sketching Lp distances (0<p 2): (1+ε)-approx. with prob. 1-δ in Õ(ε -2 ln δ -1 ) space Stable distributions and pseudo-random generators Stable distributions, pseudo-random generators, embeddings & data stream computation (Indyk, FOCS 2000)

15 Today s Talk Sketching Lp distances (0<p 2): (1+ε)-approx. with prob. 1-δ in Õ(ε -2 ln δ -1 ) space Stable distributions and pseudo-random generators Stable distributions, pseudo-random generators, embeddings & data stream computation (Indyk, FOCS 2000) Impossibility of Extending to Other Divergences: Can we sketch other divergences such as Hellinger? Lower bounds via communication complexity Sketching information divergences (Guha, Indyk, McGregor, COLT 2007)

16 Today s Talk Sketching Lp distances (0<p 2): (1+ε)-approx. with prob. 1-δ in Õ(ε -2 ln δ -1 ) space Stable distributions and pseudo-random generators Stable distributions, pseudo-random generators, embeddings & data stream computation (Indyk, FOCS 2000) Impossibility of Extending to Other Divergences: Can we sketch other divergences such as Hellinger? Lower bounds via communication complexity Sketching information divergences (Guha, Indyk, McGregor, COLT 2007) Using sketches to test independence: Testing independence between data streams Declaring independence via the sketching of sketches (Indyk, McGregor, SODA 2008)

17 1. Sketching Lp distances p-stable distributions, pseudo-random generators 2. The Unsketchables information divergences, communication complexity 3. Sketching Sketches identifying correlations in data streams

18 1. Sketching Lp distances p-stable distributions, pseudo-random generators 2. The Unsketchables information divergences, communication complexity 3. Sketching Sketches identifying correlations in data streams

19 Stable Distributions

20 Stable Distributions A p-stable distribution μ has the following property: If X, Y, Z µ and a, b R then : ax + by ( a p + b p ) 1/p Z

21 Stable Distributions A p-stable distribution μ has the following property: If X, Y, Z µ and a, b R then : ax + by ( a p + b p ) 1/p Z Examples: Normal(0,1) is 2-stable: 1 2π e x 2 /2 Cauchy is 1-stable: 1 π x 2

22 Approximating L1 and L2

23 Approximating L1 and L2 Let μ be a p-stable distribution (0<p 1)

24 Approximating L1 and L2 Let μ be a p-stable distribution (0<p 1) Ideal Algorithm: For i = 1 to k: Let x be a length n vector with x j ~ μ Compute t i = x.(p-q) Return median(t 1, t 2,..., t n )/median( μ )

25 Approximating L1 and L2 Let μ be a p-stable distribution (0<p 1) Ideal Algorithm: For i = 1 to k: Let x be a length n vector with x j ~ μ Compute t i = x.(p-q) Return median(t 1, t 2,..., t n )/median( μ ) Easy to compute x.(p-q): for stream 3,5,3,7,5,... compute x3-x5+x3-x7-x5-... and scale.

26 Approximating L1 and L2 Let μ be a p-stable distribution (0<p 1) Ideal Algorithm: For i = 1 to k: Let x be a length n vector with x j ~ μ Compute t i = x.(p-q) Return median(t 1, t 2,..., t n )/median( μ ) Easy to compute x.(p-q): for stream 3,5,3,7,5,... compute x3-x5+x3-x7-x5-... and scale. Lemma: Returns (1±ε)Lp(p-q) with prob. 1-δ, if k=õ(ε -2 ln δ -1 ).

27 Approximating L1 and L2 Let μ be a p-stable distribution (0<p 1) Ideal Algorithm: For i = 1 to k: Let x be a length n vector with x j ~ μ Compute t i = x.(p-q) Return median(t 1, t 2,..., t n )/median( μ ) Easy to compute x.(p-q): for stream 3,5,3,7,5,... compute x3-x5+x3-x7-x5-... and scale. Lemma: Returns (1±ε)Lp(p-q) with prob. 1-δ, if k=õ(ε -2 ln δ -1 ). Proof: Each ti ~ L1(p-q) μ by p-stablity property. Apply Chernoff bounds.

28 Sketches and Space

29 Sketches and Space Sketch/Embedding into Small Dimension:

30 Sketches and Space Sketch/Embedding into Small Dimension: Let x 1, x 2,..., x k be length n vector with xj i ~ μ

31 Sketches and Space Sketch/Embedding into Small Dimension: Let x 1, x 2,..., x k be length n vector with xj i ~ μ Let C(y)= (x 1.y,..., x k.y)

32 Sketches and Space Sketch/Embedding into Small Dimension: Let x 1, x 2,..., x k be length n vector with xj i ~ μ Let C(y)= (x 1.y,..., x k.y) Approximate L1(p-q) from C(p) and C(p)

33 Sketches and Space Sketch/Embedding into Small Dimension: Let x 1, x 2,..., x k be length n vector with xj i ~ μ Let C(y)= (x 1.y,..., x k.y) Approximate L1(p-q) from C(p) and C(p) CAUTION: Not an embedding into a normed space.

34 Sketches and Space Sketch/Embedding into Small Dimension: Let x 1, x 2,..., x k be length n vector with xj i ~ μ Let C(y)= (x 1.y,..., x k.y) Approximate L1(p-q) from C(p) and C(p) CAUTION: Not an embedding into a normed space. Can we also construct sketch in small space:

35 Sketches and Space Sketch/Embedding into Small Dimension: Let x 1, x 2,..., x k be length n vector with xj i ~ μ Let C(y)= (x 1.y,..., x k.y) Approximate L1(p-q) from C(p) and C(p) CAUTION: Not an embedding into a normed space. Can we also construct sketch in small space: Storing all x i requires Ω(nk) space.

36 Sketches and Space Sketch/Embedding into Small Dimension: Let x 1, x 2,..., x k be length n vector with xj i ~ μ Let C(y)= (x 1.y,..., x k.y) Approximate L1(p-q) from C(p) and C(p) CAUTION: Not an embedding into a normed space. Can we also construct sketch in small space: Storing all x i requires Ω(nk) space. Generate x i with Nisan s pseudo-random generator.

37 Sketches and Space Sketch/Embedding into Small Dimension: Let x 1, x 2,..., x k be length n vector with xj i ~ μ Let C(y)= (x 1.y,..., x k.y) Approximate L1(p-q) from C(p) and C(p) CAUTION: Not an embedding into a normed space. Can we also construct sketch in small space: Storing all x i requires Ω(nk) space. Generate x i with Nisan s pseudo-random generator. Can store the seed in O(polylog n) space.

38 Sketches and Space Sketch/Embedding into Small Dimension: Let x 1, x 2,..., x k be length n vector with xj i ~ μ Let C(y)= (x 1.y,..., x k.y) Approximate L1(p-q) from C(p) and C(p) CAUTION: Not an embedding into a normed space. Can we also construct sketch in small space: Storing all x i requires Ω(nk) space. Generate x i with Nisan s pseudo-random generator. Can store the seed in O(polylog n) space. Thm: Can (1+ε)-approx Lp(p-q) in Õ(ε -2 ln δ -1 ) space.

39 1. Sketching Lp distances p-stable distributions, pseudo-random generators 2. The Unsketchables information divergences, communication complexity 3. Sketching Sketches identifying correlations in data streams

40 Results

41 Results Thm (Shift Invariance): t-approx. of φ(pi, q i ) needs Ω(n) space if for some a, b, c and m=an/4+bn+cn/2, φ ( a m, a+c m ) > t 2 n 4 ( φ ( b+c m, b m ) ( + φ b m, )) b+c m

42 Results Thm (Shift Invariance): t-approx. of φ(pi, q i ) needs Ω(n) space if for some a, b, c and m=an/4+bn+cn/2, φ ( a m, a+c m ) > t 2 n 4 ( φ ( b+c m, b m ) ( + φ b m, )) b+c m

43 Results Thm (Shift Invariance): t-approx. of φ(pi, q i ) needs Ω(n) space if for some a, b, c and m=an/4+bn+cn/2, φ ( a m, a+c m ) > t 2 n 4 ( φ ( b+c m, b m ) ( + φ b m, )) b+c m a/m a/m+δ

44 Results Thm (Shift Invariance): t-approx. of φ(pi, q i ) needs Ω(n) space if for some a, b, c and m=an/4+bn+cn/2, φ ( a m, a+c m ) > t 2 n 4 ( φ ( b+c m, b m ) ( + φ b m, )) b+c m b/m b/m+δ

45 Results Thm (Shift Invariance): t-approx. of φ(pi, q i ) needs Ω(n) space if for some a, b, c and m=an/4+bn+cn/2, φ ( a m, a+c m ) > t 2 n 4 ( φ ( b+c m, b m ) ( + φ b m, )) b+c Corollary: Poly(n)-approx. of Df requires Ω(n) space if f is twice differentiable and strictly convex. m

46 Results Thm (Shift Invariance): t-approx. of φ(pi, q i ) needs Ω(n) space if for some a, b, c and m=an/4+bn+cn/2, φ ( a m, a+c m ) > t 2 n 4 ( φ ( b+c m, b m ) ( + φ b m, )) b+c Corollary: Poly(n)-approx. of Df requires Ω(n) space if f is twice differentiable and strictly convex. Corollary: Poly(n)-approx. of BF requires Ω(n) space if there exists ρ, z0 with or 0 z 2 z 1 z 0, F (z 1 )/F (z 2 ) (z 1 /z 2 ) ρ 0 z 2 z 1 z 0, F (z 1 )/F (z 2 ) (z 2 /z 1 ) ρ m

47 Results Thm (Shift Invariance): t-approx. of φ(pi, q i ) needs Ω(n) space if for some a, b, c and m=an/4+bn+cn/2, φ ( a m, a+c m ) > t 2 n 4 ( φ ( b+c m, b m ) ( + φ b m, )) b+c Corollary: Poly(n)-approx. of Df requires Ω(n) space if f is twice differentiable and strictly convex. Corollary: Poly(n)-approx. of BF requires Ω(n) space if there exists ρ, z0 with 0 z 2 z 1 z 0, F (z 1 )/F (z 2 ) (z 1 /z 2 ) ρ or 0 z 2 z 1 z 0, F (z 1 )/F (z 2 ) (z 2 /z 1 ) ρ Only exceptions are L1 and L2! m

48 Results Thm (Shift Invariance): t-approx. of φ(pi, q i ) needs Ω(n) space if for some a, b, c and m=an/4+bn+cn/2, φ ( a m, a+c m ) > t 2 n 4 ( φ ( b+c m, b m ) ( + φ b m, )) b+c Corollary: Poly(n)-approx. of Df requires Ω(n) space if f is twice differentiable and strictly convex. Corollary: Poly(n)-approx. of BF requires Ω(n) space if there exists ρ, z0 with 0 z 2 z 1 z 0, F (z 1 )/F (z 2 ) (z 1 /z 2 ) ρ or 0 z 2 z 1 z 0, F (z 1 )/F (z 2 ) (z 2 /z 1 ) ρ Only exceptions are L1 and L2!! BREAKING NEWS: Many of these lower bounds also apply for randomly ordered streams [Chakrabarti, Cormode, McGregor 2007] m

49 Alice x {0,1} n weight n/4 Bob y {0,1} n weight n/4

50 Alice x {0,1} n weight n/4 Question: Are x and y disjoint, i.e., x.y=0? Thm (Razborov 92): Needs Ω(n) communication. Bob y {0,1} n weight n/4

51 axi+b(1-xi) copies of i & i (i [n]) b copies of i+n & i+n (i [n/4]) Alice x {0,1} n weight n/4 Bob y {0,1} n weight n/4

52 axi+b(1-xi) copies of i & i (i [n]) b copies of i+n & i+n (i [n/4]) Alice x {0,1} n weight n/4 Bob y {0,1} n weight n/4

53 axi+b(1-xi) copies of i & i (i [n]) b copies of i+n & i+n (i [n/4]) cyi copies of i (i [n]) c copies of i+n (i [n/4]) Alice x {0,1} n weight n/4 Bob y {0,1} n weight n/4

54 axi+b(1-xi) copies of i & i (i [n]) b copies of i+n & i+n (i [n/4]) cyi copies of i (i [n]) c copies of i+n (i [n/4]) Alice x {0,1} n weight n/4 Bob y {0,1} n weight n/4

55 axi+b(1-xi) copies of i & i (i [n]) b copies of i+n & i+n (i [n/4]) cyi copies of i (i [n]) c copies of i+n (i [n/4]) Alice x {0,1} n weight n/4 If x.y = 0 then divergence is n 4 ( φ ( b m, b + c m ) ( b + c + φ m, b )) m If x.y =1 then divergence is at least φ ( a m, a + c m Factor t 2 difference by assumption ) Bob y {0,1} n weight n/4

56 axi+b(1-xi) copies of i & i (i [n]) b copies of i+n & i+n (i [n/4]) cyi copies of i (i [n]) c copies of i+n (i [n/4]) Alice x {0,1} n weight n/4 Let A be a t-approx algorithm: 1. Alice runs A on first half 2. Transmits memory state 3. Bob instantiates A 4. Continues A on second half Bob y {0,1} n weight n/4

57 axi+b(1-xi) copies of i & i (i [n]) b copies of i+n & i+n (i [n/4]) cyi copies of i (i [n]) c copies of i+n (i [n/4]) Alice x {0,1} n weight n/4 Let A be a t-approx algorithm: 1. Alice runs A on first half 2. Transmits memory state 3. Bob instantiates A 4. Continues A on second half Thm: Any t-approx algorithm for the divergence requires Ω(n) memory. Bob y {0,1} n weight n/4

58 Corollary to Df

59 Corollary to Df Corollary: Any poly(n) approx. of D f (p, q) = p i f (q i /p i ) requires Ω(n) space if f exists and is strictly positive.

60 Corollary to Df Corollary: Any poly(n) approx. of D f (p, q) = p i f (q i /p i ) requires Ω(n) space if f exists and is strictly positive. Proof: Set a=c=1 and b=t 2 n(f (1)+1)/ 8f(2)

61 Corollary to Df Corollary: Any poly(n) approx. of D f (p, q) = p i f (q i /p i ) requires Ω(n) space if f exists and is strictly positive. Proof: Set a=c=1 and b=t 2 n(f (1)+1)/ 8f(2) Take Taylor expansion of f around 1: φ(b/m, (b + c)/m) = (b/m) [ f(1) + f (1)/b + f (1 + γ)/(2b 2 ) ] (f (1) + 1)/(2mb) 8φ(a/m, (a + c)/m)/(t 2 n)

62 Corollary to Df Corollary: Any poly(n) approx. of D f (p, q) = p i f (q i /p i ) requires Ω(n) space if f exists and is strictly positive. Proof: Set a=c=1 and b=t 2 n(f (1)+1)/ 8f(2) Take Taylor expansion of f around 1: φ(b/m, (b + c)/m) = (b/m) [ f(1) + f (1)/b + f (1 + γ)/(2b 2 ) ] (f (1) + 1)/(2mb) 8φ(a/m, (a + c)/m)/(t 2 n)

63 Corollary to Df Corollary: Any poly(n) approx. of D f (p, q) = p i f (q i /p i ) requires Ω(n) space if f exists and is strictly positive. Proof: Set a=c=1 and b=t 2 n(f (1)+1)/ 8f(2) Take Taylor expansion of f around 1: φ(b/m, (b + c)/m) = (b/m) [ f(1) + f (1)/b + f (1 + γ)/(2b 2 ) ] (f (1) + 1)/(2mb) 8φ(a/m, (a + c)/m)/(t 2 n)

64 Corollary to Df Corollary: Any poly(n) approx. of D f (p, q) = p i f (q i /p i ) requires Ω(n) space if f exists and is strictly positive. Proof: Set a=c=1 and b=t 2 n(f (1)+1)/ 8f(2) Take Taylor expansion of f around 1: φ(b/m, (b + c)/m) = (b/m) [ f(1) + f (1)/b + f (1 + γ)/(2b 2 ) ] (f (1) + 1)/(2mb) 8φ(a/m, (a + c)/m)/(t 2 n) Similarly, φ((b + c)/m, b/m) 8φ(a/m, (a + c)/m)/(t 2 n)

65 Corollary to Df Corollary: Any poly(n) approx. of D f (p, q) = p i f (q i /p i ) requires Ω(n) space if f exists and is strictly positive. Proof: Set a=c=1 and b=t 2 n(f (1)+1)/ 8f(2) Take Taylor expansion of f around 1: φ(b/m, (b + c)/m) = (b/m) [ f(1) + f (1)/b + f (1 + γ)/(2b 2 ) ] (f (1) + 1)/(2mb) 8φ(a/m, (a + c)/m)/(t 2 n) Similarly, φ((b + c)/m, b/m) 8φ(a/m, (a + c)/m)/(t 2 n) Result follows by the Shift-Invariant Theorem.

66 Additive Error Algorithms

67 Additive Error Algorithms f-divergences: Thm: If Df bounded, Oε(ln δ -1 ) space is sufficient for ±ε approx. with prob. 1-δ. Thm: If Df unbounded, need Ω(n) space.

68 Additive Error Algorithms f-divergences: Thm: If Df bounded, Oε(ln δ -1 ) space is sufficient for ±ε approx. with prob. 1-δ. Thm: If Df unbounded, need Ω(n) space. Bregman Divergences: Thm: If F(0), F (0), and F (. ) exist, Oε(ln δ -1 ) space is sufficient for ±ε approx. with prob. 1-δ. Thm: If F(0) or F (0) infinite, need Ω(n) space.

69 1. Sketching Lp distances p-stable distributions, pseudo-random generators 2. The Unsketchables information divergences, communication complexity 3. Sketching Sketches identifying correlations in data streams

70 New Problem List of m pairs in [n] x [n]: (3,5), (5,3), (2,7), (3,4), (7,1), (1,2), (3,9), (6,6),... Stream defines random variables:

71 New Problem List of m pairs in [n] x [n]: (3,5), (5,3), (2,7), (3,4), (7,1), (1,2), (3,9), (6,6),... Stream defines random variables: Xp with distribution (p1,..., pn)

72 New Problem List of m pairs in [n] x [n]: (3,5), (5,3), (2,7), (3,4), (7,1), (1,2), (3,9), (6,6),... Stream defines random variables: Xp with distribution (p1,..., pn) Xq with distribution (q1,..., qn)

73 New Problem List of m pairs in [n] x [n]: (3,5), (5,3), (2,7), (3,4), (7,1), (1,2), (3,9), (6,6),... Stream defines random variables: Xp with distribution (p1,..., pn) Xq with distribution (q1,..., qn) (Xp, Xq) with distribution (r11, r12,..., rnn)

74 New Problem List of m pairs in [n] x [n]: (3,5), (5,3), (2,7), (3,4), (7,1), (1,2), (3,9), (6,6),... Stream defines random variables: Xp with distribution (p1,..., pn) Xq with distribution (q1,..., qn) (Xp, Xq) with distribution (r11, r12,..., rnn) How independent are Xp and Xq?

75 New Problem List of m pairs in [n] x [n]: (3,5), (5,3), (2,7), (3,4), (7,1), (1,2), (3,9), (6,6),... Stream defines random variables: Xp with distribution (p1,..., pn) Xq with distribution (q1,..., qn) (Xp, Xq) with distribution (r11, r12,..., rnn) How independent are Xp and Xq? Is the joint distribution far from the product distribution (s11, s12,..., snn)?

76 New Problem List of m pairs in [n] x [n]: (3,5), (5,3), (2,7), (3,4), (7,1), (1,2), (3,9), (6,6),... Stream defines random variables: Xp with distribution (p1,..., pn) Xq with distribution (q1,..., qn) (Xp, Xq) with distribution (r11, r12,..., rnn) How independent are Xp and Xq? Is the joint distribution far from the product distribution (s11, s12,..., snn)? Consider L1(s-r) or L2(s-r) or mutual information: I (X p ; X q ) = H(X p ) + H(X q ) H(X p, X q ) = i,j r ij lg p i r ij

77 Results Estimating L2(s-r): Thm: (1+ε)-factor approx. (w/p 1-δ) in Õ(ε -2 ln δ -1 ) space. Estimating L1(s-r): Thm: O(ln n)-factor approx. (w/p 1-δ) in Õ(ln δ -1 ) space. Thm: ±ε approx. (w/p 1-δ) in Õ(ε -4 ln δ -1 ) space (2-pass). Estimating I(Xp, Xq): Thm: No 5/4-factor approx. (w/p 4/5) in O(n) space. Thm: ±ε approx. (w/p 1-δ) in Õ(ε -2 ln δ -1 ) space.

78 L2 Sketching Revisited

79 L2 Sketching Revisited Let x {-1,1} n where xi are unbiased 4-wise indept.

80 L2 Sketching Revisited Let x {-1,1} n where xi are unbiased 4-wise indept. Compute x.(p-q)...

81 L2 Sketching Revisited Let x {-1,1} n where xi are unbiased 4-wise indept. Compute x.(p-q)... E[(x.(p q)) 2 ] = Σ i,j E[x i x j ](p i q i )(p j q j ) = (L 2 (p q)) 2 Var[(x.(p q)) 2 ] E[(x.(p q)) 4 ] = Σ i,j,k,l E[x i x j x k x l ](p i q i )(p j q j )(p k q k )(p l q l ) = (L 2 (p q)) 4

82 L2 Sketching Revisited Let x {-1,1} n where xi are unbiased 4-wise indept. Compute x.(p-q)... E[(x.(p q)) 2 ] = Σ i,j E[x i x j ](p i q i )(p j q j ) = (L 2 (p q)) 2 Var[(x.(p q)) 2 ] E[(x.(p q)) 4 ] = Σ i,j,k,l E[x i x j x k x l ](p i q i )(p j q j )(p k q k )(p l q l ) = (L 2 (p q)) 4

83 L2 Sketching Revisited Let x {-1,1} n where xi are unbiased 4-wise indept. Compute x.(p-q)... E[(x.(p q)) 2 ] = Σ i,j E[x i x j ](p i q i )(p j q j ) = (L 2 (p q)) 2 Var[(x.(p q)) 2 ] E[(x.(p q)) 4 ] = Σ i,j,k,l E[x i x j x k x l ](p i q i )(p j q j )(p k q k )(p l q l ) = (L 2 (p q)) 4

84 L2 Sketching Revisited Let x {-1,1} n where xi are unbiased 4-wise indept. Compute x.(p-q)... E[(x.(p q)) 2 ] = Σ i,j E[x i x j ](p i q i )(p j q j ) = (L 2 (p q)) 2 Var[(x.(p q)) 2 ] E[(x.(p q)) 4 ] = Σ i,j,k,l E[x i x j x k x l ](p i q i )(p j q j )(p k q k )(p l q l ) = (L 2 (p q)) 4

85 L2 Sketching Revisited Let x {-1,1} n where xi are unbiased 4-wise indept. Compute x.(p-q)... E[(x.(p q)) 2 ] = Σ i,j E[x i x j ](p i q i )(p j q j ) = (L 2 (p q)) 2 Var[(x.(p q)) 2 ] E[(x.(p q)) 4 ] = Σ i,j,k,l E[x i x j x k x l ](p i q i )(p j q j )(p k q k )(p l q l ) = (L 2 (p q)) 4 Thm: By Chebychev bounds, the average of O(ε -2 ln δ -1 ) repetitions yields (1±ε) L2(p-q) with prob. 1-δ. [Alon, Matias, Szegedy 1996]

86 Testing L2 Independence

87 Testing L2 Independence Idea: Estimate L2(r-s) using z { 1, 1} n n where zij are unbiased 4-wise independent.

88 Testing L2 Independence Idea: Estimate L2(r-s) using z { 1, 1} n n where zij are unbiased 4-wise independent. Problem: Can t compute sketch of product distribution!

89 Testing L2 Independence Idea: Estimate L2(r-s) using z { 1, 1} n n unbiased 4-wise independent. where zij are Problem: Can t compute sketch of product distribution! Solution: Let x, y { 1, 1} n be 4-wise independent and set zij = xi yj. z.s = ij z ijs ij = (x.p)(y.q)

90 Testing L2 Independence Idea: Estimate L2(r-s) using z { 1, 1} n n unbiased 4-wise independent. where zij are Problem: Can t compute sketch of product distribution! Solution: Let x, y { 1, 1} n be 4-wise independent and set zij = xi yj. Entries are no longer 4-wise independent but it s okay. Let aij = rij - sij, and consider T = Σij zij aij : Var[T 2 ] E[T 4 ] z.s = ij z ijs ij = (x.p)(y.q) = Σ i1,j 1,i 2,j 2,i 3,j 3,i 4,j 4 E[z i1 j 1 z i2 j 2 z i3 j 3 z i4 j 4 ]a i1 j 1 a i2 j 2 a i3 j 3 a i4 j 4 3(L 2 (p q)) 4

91 Testing L2 Independence Idea: Estimate L2(r-s) using z { 1, 1} n n unbiased 4-wise independent. where zij are Problem: Can t compute sketch of product distribution! Solution: Let x, y { 1, 1} n be 4-wise independent and set zij = xi yj. Entries are no longer 4-wise independent but it s okay. Let aij = rij - sij, and consider T = Σij zij aij : Var[T 2 ] E[T 4 ] z.s = ij z ijs ij = (x.p)(y.q) = Σ i1,j 1,i 2,j 2,i 3,j 3,i 4,j 4 E[z i1 j 1 z i2 j 2 z i3 j 3 z i4 j 4 ]a i1 j 1 a i2 j 2 a i3 j 3 a i4 j 4 3(L 2 (p q)) 4

92 Testing L2 Independence Idea: Estimate L2(r-s) using z { 1, 1} n n unbiased 4-wise independent. where zij are Problem: Can t compute sketch of product distribution! Solution: Let x, y { 1, 1} n be 4-wise independent and set zij = xi yj. Entries are no longer 4-wise independent but it s okay. Let aij = rij - sij, and consider T = Σij zij aij : Var[T 2 ] E[T 4 ] z.s = ij z ijs ij = (x.p)(y.q) = Σ i1,j 1,i 2,j 2,i 3,j 3,i 4,j 4 E[z i1 j 1 z i2 j 2 z i3 j 3 z i4 j 4 ]a i1 j 1 a i2 j 2 a i3 j 3 a i4 j 4 3(L 2 (p q)) 4 Repeat O(ε -2 ln δ -1 ) times to deduce (1±ε) L2(r-s)

93 Summary Small space sketches of L1 and L2 using p-stable distributions. No small space sketches exists for other information divergences. Can use sketching ideas to estimate independence. Main Material: Stable distributions, pseudo-random generators, embeddings, and data stream computation Piotr Indyk (FOCS 2000) Sketching information divergences Sudipto Guha, Piotr Indyk, Andrew McGregor (COLT 2007) Declaring independence via the sketching of sketches Piotr Indyk, Andrew McGregor (SODA 2008)

Declaring Independence via the Sketching of Sketches. Until August Hire Me!

Declaring Independence via the Sketching of Sketches. Until August Hire Me! Declaring Independence via the Sketching of Sketches Piotr Indyk Andrew McGregor Massachusetts Institute of Technology University of California, San Diego Until August 08 -- Hire Me! The Problem The Problem

More information

Lecture Lecture 9 October 1, 2015

Lecture Lecture 9 October 1, 2015 CS 229r: Algorithms for Big Data Fall 2015 Lecture Lecture 9 October 1, 2015 Prof. Jelani Nelson Scribe: Rachit Singh 1 Overview In the last lecture we covered the distance to monotonicity (DTM) and longest

More information

Data Streams & Communication Complexity

Data Streams & Communication Complexity Data Streams & Communication Complexity Lecture 1: Simple Stream Statistics in Small Space Andrew McGregor, UMass Amherst 1/25 Data Stream Model Stream: m elements from universe of size n, e.g., x 1, x

More information

Estimating Dominance Norms of Multiple Data Streams Graham Cormode Joint work with S. Muthukrishnan

Estimating Dominance Norms of Multiple Data Streams Graham Cormode Joint work with S. Muthukrishnan Estimating Dominance Norms of Multiple Data Streams Graham Cormode graham@dimacs.rutgers.edu Joint work with S. Muthukrishnan Data Stream Phenomenon Data is being produced faster than our ability to process

More information

Tight Bounds for Distributed Functional Monitoring

Tight Bounds for Distributed Functional Monitoring Tight Bounds for Distributed Functional Monitoring Qin Zhang MADALGO, Aarhus University Joint with David Woodruff, IBM Almaden NII Shonan meeting, Japan Jan. 2012 1-1 The distributed streaming model (a.k.a.

More information

Linear Sketches A Useful Tool in Streaming and Compressive Sensing

Linear Sketches A Useful Tool in Streaming and Compressive Sensing Linear Sketches A Useful Tool in Streaming and Compressive Sensing Qin Zhang 1-1 Linear sketch Random linear projection M : R n R k that preserves properties of any v R n with high prob. where k n. M =

More information

Computing the Entropy of a Stream

Computing the Entropy of a Stream Computing the Entropy of a Stream To appear in SODA 2007 Graham Cormode graham@research.att.com Amit Chakrabarti Dartmouth College Andrew McGregor U. Penn / UCSD Outline Introduction Entropy Upper Bound

More information

Tight Bounds for Distributed Streaming

Tight Bounds for Distributed Streaming Tight Bounds for Distributed Streaming (a.k.a., Distributed Functional Monitoring) David Woodruff IBM Research Almaden Qin Zhang MADALGO, Aarhus Univ. STOC 12 May 22, 2012 1-1 The distributed streaming

More information

Data Stream Methods. Graham Cormode S. Muthukrishnan

Data Stream Methods. Graham Cormode S. Muthukrishnan Data Stream Methods Graham Cormode graham@dimacs.rutgers.edu S. Muthukrishnan muthu@cs.rutgers.edu Plan of attack Frequent Items / Heavy Hitters Counting Distinct Elements Clustering items in Streams Motivating

More information

Lower Bound Techniques for Multiparty Communication Complexity

Lower Bound Techniques for Multiparty Communication Complexity Lower Bound Techniques for Multiparty Communication Complexity Qin Zhang Indiana University Bloomington Based on works with Jeff Phillips, Elad Verbin and David Woodruff 1-1 The multiparty number-in-hand

More information

A Near-Optimal Algorithm for Computing the Entropy of a Stream

A Near-Optimal Algorithm for Computing the Entropy of a Stream A Near-Optimal Algorithm for Computing the Entropy of a Stream Amit Chakrabarti ac@cs.dartmouth.edu Graham Cormode graham@research.att.com Andrew McGregor andrewm@seas.upenn.edu Abstract We describe a

More information

Direct product theorem for discrepancy

Direct product theorem for discrepancy Direct product theorem for discrepancy Troy Lee Rutgers University Joint work with: Robert Špalek Direct product theorems Knowing how to compute f, how can you compute f f f? Obvious upper bounds: If can

More information

Lecture 2. Frequency problems

Lecture 2. Frequency problems 1 / 43 Lecture 2. Frequency problems Ricard Gavaldà MIRI Seminar on Data Streams, Spring 2015 Contents 2 / 43 1 Frequency problems in data streams 2 Approximating inner product 3 Computing frequency moments

More information

Stream Computation and Arthur- Merlin Communication

Stream Computation and Arthur- Merlin Communication Stream Computation and Arthur- Merlin Communication Justin Thaler, Yahoo! Labs Joint Work with: Amit Chakrabarti, Dartmouth Graham Cormode, University of Warwick Andrew McGregor, Umass Amherst Suresh Venkatasubramanian,

More information

STREAM ORDER AND ORDER STATISTICS: QUANTILE ESTIMATION IN RANDOM-ORDER STREAMS

STREAM ORDER AND ORDER STATISTICS: QUANTILE ESTIMATION IN RANDOM-ORDER STREAMS STREAM ORDER AND ORDER STATISTICS: QUANTILE ESTIMATION IN RANDOM-ORDER STREAMS SUDIPTO GUHA AND ANDREW MCGREGOR Abstract. When trying to process a data-stream in small space, how important is the order

More information

Lecture 3 Sept. 4, 2014

Lecture 3 Sept. 4, 2014 CS 395T: Sublinear Algorithms Fall 2014 Prof. Eric Price Lecture 3 Sept. 4, 2014 Scribe: Zhao Song In today s lecture, we will discuss the following problems: 1. Distinct elements 2. Turnstile model 3.

More information

Estimating Dominance Norms of Multiple Data Streams

Estimating Dominance Norms of Multiple Data Streams Estimating Dominance Norms of Multiple Data Streams Graham Cormode and S. Muthukrishnan Abstract. There is much focus in the algorithms and database communities on designing tools to manage and mine data

More information

The space complexity of approximating the frequency moments

The space complexity of approximating the frequency moments The space complexity of approximating the frequency moments Felix Biermeier November 24, 2015 1 Overview Introduction Approximations of frequency moments lower bounds 2 Frequency moments Problem Estimate

More information

Block Heavy Hitters Alexandr Andoni, Khanh Do Ba, and Piotr Indyk

Block Heavy Hitters Alexandr Andoni, Khanh Do Ba, and Piotr Indyk Computer Science and Artificial Intelligence Laboratory Technical Report -CSAIL-TR-2008-024 May 2, 2008 Block Heavy Hitters Alexandr Andoni, Khanh Do Ba, and Piotr Indyk massachusetts institute of technology,

More information

Stream-Order and Order-Statistics

Stream-Order and Order-Statistics Stream-Order and Order-Statistics Sudipto Guha Andrew McGregor Abstract When processing a data-stream in small space, how important is the order in which the data arrives? Are there problems that are unsolvable

More information

B669 Sublinear Algorithms for Big Data

B669 Sublinear Algorithms for Big Data B669 Sublinear Algorithms for Big Data Qin Zhang 1-1 2-1 Part 1: Sublinear in Space The model and challenge The data stream model (Alon, Matias and Szegedy 1996) a n a 2 a 1 RAM CPU Why hard? Cannot store

More information

An Information Divergence Estimation over Data Streams

An Information Divergence Estimation over Data Streams An Information Divergence Estimation over Data Streams Emmanuelle Anceaume, Yann Busnel To cite this version: Emmanuelle Anceaume, Yann Busnel. An Information Divergence Estimation over Data Streams. C.Elks.

More information

Lower Bounds for Number-in-Hand Multiparty Communication Complexity, Made Easy

Lower Bounds for Number-in-Hand Multiparty Communication Complexity, Made Easy Lower Bounds for Number-in-Hand Multiparty Communication Complexity, Made Easy Jeff M. Phillips School of Computing University of Utah jeffp@cs.utah.edu Elad Verbin Dept. of Computer Science Aarhus University,

More information

An Algorithmist s Toolkit Nov. 10, Lecture 17

An Algorithmist s Toolkit Nov. 10, Lecture 17 8.409 An Algorithmist s Toolkit Nov. 0, 009 Lecturer: Jonathan Kelner Lecture 7 Johnson-Lindenstrauss Theorem. Recap We first recap a theorem (isoperimetric inequality) and a lemma (concentration) from

More information

Fourier analysis of boolean functions in quantum computation

Fourier analysis of boolean functions in quantum computation Fourier analysis of boolean functions in quantum computation Ashley Montanaro Centre for Quantum Information and Foundations, Department of Applied Mathematics and Theoretical Physics, University of Cambridge

More information

Leveraging Big Data: Lecture 13

Leveraging Big Data: Lecture 13 Leveraging Big Data: Lecture 13 http://www.cohenwang.com/edith/bigdataclass2013 Instructors: Edith Cohen Amos Fiat Haim Kaplan Tova Milo What are Linear Sketches? Linear Transformations of the input vector

More information

Wavelet decomposition of data streams. by Dragana Veljkovic

Wavelet decomposition of data streams. by Dragana Veljkovic Wavelet decomposition of data streams by Dragana Veljkovic Motivation Continuous data streams arise naturally in: telecommunication and internet traffic retail and banking transactions web server log records

More information

Lower Bounds for Quantile Estimation in Random-Order and Multi-Pass Streaming

Lower Bounds for Quantile Estimation in Random-Order and Multi-Pass Streaming Lower Bounds for Quantile Estimation in Random-Order and Multi-Pass Streaming Sudipto Guha 1 and Andrew McGregor 2 1 University of Pennsylvania sudipto@cis.upenn.edu 2 University of California, San Diego

More information

CMSC 858F: Algorithmic Lower Bounds: Fun with Hardness Proofs Fall 2014 Introduction to Streaming Algorithms

CMSC 858F: Algorithmic Lower Bounds: Fun with Hardness Proofs Fall 2014 Introduction to Streaming Algorithms CMSC 858F: Algorithmic Lower Bounds: Fun with Hardness Proofs Fall 2014 Introduction to Streaming Algorithms Instructor: Mohammad T. Hajiaghayi Scribe: Huijing Gong November 11, 2014 1 Overview In the

More information

Lecture 10. Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1

Lecture 10. Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1 Lecture 10 Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1 Recap Sublinear time algorithms Deterministic + exact: binary search Deterministic + inexact: estimating diameter in a

More information

16 Embeddings of the Euclidean metric

16 Embeddings of the Euclidean metric 16 Embeddings of the Euclidean metric In today s lecture, we will consider how well we can embed n points in the Euclidean metric (l 2 ) into other l p metrics. More formally, we ask the following question.

More information

Lecture 2: Streaming Algorithms

Lecture 2: Streaming Algorithms CS369G: Algorithmic Techniques for Big Data Spring 2015-2016 Lecture 2: Streaming Algorithms Prof. Moses Chariar Scribes: Stephen Mussmann 1 Overview In this lecture, we first derive a concentration inequality

More information

Streaming and communication complexity of Hamming distance

Streaming and communication complexity of Hamming distance Streaming and communication complexity of Hamming distance Tatiana Starikovskaya IRIF, Université Paris-Diderot (Joint work with Raphaël Clifford, ICALP 16) Approximate pattern matching Problem Pattern

More information

Testing k-wise Independence over Streaming Data

Testing k-wise Independence over Streaming Data Testing k-wise Independence over Streaming Data Kai-Min Chung Zhenming Liu Michael Mitzenmacher Abstract Following on the work of Indyk and McGregor [5], we consider the problem of identifying correlations

More information

Lecture 11: Quantum Information III - Source Coding

Lecture 11: Quantum Information III - Source Coding CSCI5370 Quantum Computing November 25, 203 Lecture : Quantum Information III - Source Coding Lecturer: Shengyu Zhang Scribe: Hing Yin Tsang. Holevo s bound Suppose Alice has an information source X that

More information

Application of Information Theory, Lecture 7. Relative Entropy. Handout Mode. Iftach Haitner. Tel Aviv University.

Application of Information Theory, Lecture 7. Relative Entropy. Handout Mode. Iftach Haitner. Tel Aviv University. Application of Information Theory, Lecture 7 Relative Entropy Handout Mode Iftach Haitner Tel Aviv University. December 1, 2015 Iftach Haitner (TAU) Application of Information Theory, Lecture 7 December

More information

Graph & Geometry Problems in Data Streams

Graph & Geometry Problems in Data Streams Graph & Geometry Problems in Data Streams 2009 Barbados Workshop on Computational Complexity Andrew McGregor Introduction Models: Graph Streams: Stream of edges E = {e 1, e 2,..., e m } describe a graph

More information

arxiv: v1 [cs.ds] 8 Aug 2014

arxiv: v1 [cs.ds] 8 Aug 2014 Asymptotically exact streaming algorithms Marc Heinrich, Alexander Munteanu, and Christian Sohler Département d Informatique, École Normale Supérieure, Paris, France marc.heinrich@ens.fr Department of

More information

Grothendieck s Inequality

Grothendieck s Inequality Grothendieck s Inequality Leqi Zhu 1 Introduction Let A = (A ij ) R m n be an m n matrix. Then A defines a linear operator between normed spaces (R m, p ) and (R n, q ), for 1 p, q. The (p q)-norm of A

More information

Lecture 6 September 13, 2016

Lecture 6 September 13, 2016 CS 395T: Sublinear Algorithms Fall 206 Prof. Eric Price Lecture 6 September 3, 206 Scribe: Shanshan Wu, Yitao Chen Overview Recap of last lecture. We talked about Johnson-Lindenstrauss (JL) lemma [JL84]

More information

CSC Linear Programming and Combinatorial Optimization Lecture 10: Semidefinite Programming

CSC Linear Programming and Combinatorial Optimization Lecture 10: Semidefinite Programming CSC2411 - Linear Programming and Combinatorial Optimization Lecture 10: Semidefinite Programming Notes taken by Mike Jamieson March 28, 2005 Summary: In this lecture, we introduce semidefinite programming

More information

The query register and working memory together form the accessible memory, denoted H A. Thus the state of the algorithm is described by a vector

The query register and working memory together form the accessible memory, denoted H A. Thus the state of the algorithm is described by a vector 1 Query model In the quantum query model we wish to compute some function f and we access the input through queries. The complexity of f is the number of queries needed to compute f on a worst-case input

More information

On the Optimality of the Dimensionality Reduction Method

On the Optimality of the Dimensionality Reduction Method On the Optimality of the Dimensionality Reduction Method Alexandr Andoni MIT andoni@mit.edu Piotr Indyk MIT indyk@mit.edu Mihai Pǎtraşcu MIT mip@mit.edu Abstract We investigate the optimality of (1+)-approximation

More information

Information Complexity and Applications. Mark Braverman Princeton University and IAS FoCM 17 July 17, 2017

Information Complexity and Applications. Mark Braverman Princeton University and IAS FoCM 17 July 17, 2017 Information Complexity and Applications Mark Braverman Princeton University and IAS FoCM 17 July 17, 2017 Coding vs complexity: a tale of two theories Coding Goal: data transmission Different channels

More information

14.1 Finding frequent elements in stream

14.1 Finding frequent elements in stream Chapter 14 Streaming Data Model 14.1 Finding frequent elements in stream A very useful statistics for many applications is to keep track of elements that occur more frequently. It can come in many flavours

More information

Range-efficient computation of F 0 over massive data streams

Range-efficient computation of F 0 over massive data streams Range-efficient computation of F 0 over massive data streams A. Pavan Dept. of Computer Science Iowa State University pavan@cs.iastate.edu Srikanta Tirthapura Dept. of Elec. and Computer Engg. Iowa State

More information

Pseudorandom Generators

Pseudorandom Generators 8 Pseudorandom Generators Great Ideas in Theoretical Computer Science Saarland University, Summer 2014 andomness is one of the fundamental computational resources and appears everywhere. In computer science,

More information

STAT 200C: High-dimensional Statistics

STAT 200C: High-dimensional Statistics STAT 200C: High-dimensional Statistics Arash A. Amini May 30, 2018 1 / 59 Classical case: n d. Asymptotic assumption: d is fixed and n. Basic tools: LLN and CLT. High-dimensional setting: n d, e.g. n/d

More information

Streaming Algorithms for Set Cover. Piotr Indyk With: Sepideh Mahabadi, Ali Vakilian

Streaming Algorithms for Set Cover. Piotr Indyk With: Sepideh Mahabadi, Ali Vakilian Streaming Algorithms for Set Cover Piotr Indyk With: Sepideh Mahabadi, Ali Vakilian Set Cover Input: a collection S of sets S 1...S m that covers U={1...n} I.e., S 1 S 2. S m = U Output: a subset I of

More information

Lecture 2 Sept. 8, 2015

Lecture 2 Sept. 8, 2015 CS 9r: Algorithms for Big Data Fall 5 Prof. Jelani Nelson Lecture Sept. 8, 5 Scribe: Jeffrey Ling Probability Recap Chebyshev: P ( X EX > λ) < V ar[x] λ Chernoff: For X,..., X n independent in [, ],

More information

EECS 750. Hypothesis Testing with Communication Constraints

EECS 750. Hypothesis Testing with Communication Constraints EECS 750 Hypothesis Testing with Communication Constraints Name: Dinesh Krithivasan Abstract In this report, we study a modification of the classical statistical problem of bivariate hypothesis testing.

More information

Lecture 4: Hashing and Streaming Algorithms

Lecture 4: Hashing and Streaming Algorithms CSE 521: Design and Analysis of Algorithms I Winter 2017 Lecture 4: Hashing and Streaming Algorithms Lecturer: Shayan Oveis Gharan 01/18/2017 Scribe: Yuqing Ai Disclaimer: These notes have not been subjected

More information

Graph Sparsification I : Effective Resistance Sampling

Graph Sparsification I : Effective Resistance Sampling Graph Sparsification I : Effective Resistance Sampling Nikhil Srivastava Microsoft Research India Simons Institute, August 26 2014 Graphs G G=(V,E,w) undirected V = n w: E R + Sparsification Approximate

More information

CSE 190, Great ideas in algorithms: Pairwise independent hash functions

CSE 190, Great ideas in algorithms: Pairwise independent hash functions CSE 190, Great ideas in algorithms: Pairwise independent hash functions 1 Hash functions The goal of hash functions is to map elements from a large domain to a small one. Typically, to obtain the required

More information

ECON Fundamentals of Probability

ECON Fundamentals of Probability ECON 351 - Fundamentals of Probability Maggie Jones 1 / 32 Random Variables A random variable is one that takes on numerical values, i.e. numerical summary of a random outcome e.g., prices, total GDP,

More information

LECTURE 3. Last time:

LECTURE 3. Last time: LECTURE 3 Last time: Mutual Information. Convexity and concavity Jensen s inequality Information Inequality Data processing theorem Fano s Inequality Lecture outline Stochastic processes, Entropy rate

More information

10. Show that the conclusion of the. 11. Prove the above Theorem. [Th 6.4.7, p 148] 4. Prove the above Theorem. [Th 6.5.3, p152]

10. Show that the conclusion of the. 11. Prove the above Theorem. [Th 6.4.7, p 148] 4. Prove the above Theorem. [Th 6.5.3, p152] foot of the altitude of ABM from M and let A M 1 B. Prove that then MA > MB if and only if M 1 A > M 1 B. 8. If M is the midpoint of BC then AM is called a median of ABC. Consider ABC such that AB < AC.

More information

18.409: Topics in TCS: Embeddings of Finite Metric Spaces. Lecture 6

18.409: Topics in TCS: Embeddings of Finite Metric Spaces. Lecture 6 Massachusetts Institute of Technology Lecturer: Michel X. Goemans 8.409: Topics in TCS: Embeddings of Finite Metric Spaces September 7, 006 Scribe: Benjamin Rossman Lecture 6 Today we loo at dimension

More information

Sublinear Algorithms for Big Data

Sublinear Algorithms for Big Data Sublinear Algorithms for Big Data Qin Zhang 1-1 2-1 Part 2: Sublinear in Communication Sublinear in communication The model x 1 = 010011 x 2 = 111011 x 3 = 111111 x k = 100011 Applicaitons They want to

More information

Space- Efficient Es.ma.on of Sta.s.cs over Sub- Sampled Streams

Space- Efficient Es.ma.on of Sta.s.cs over Sub- Sampled Streams Space- Efficient Es.ma.on of Sta.s.cs over Sub- Sampled Streams Andrew McGregor University of Massachuse7s mcgregor@cs.umass.edu A. Pavan Iowa State University pavan@cs.iastate.edu Srikanta Tirthapura

More information

Algorithms for Streaming Data

Algorithms for Streaming Data Algorithms for Piotr Indyk Problems defined over elements P={p 1,,p n } The algorithm sees p 1, then p 2, then p 3, Key fact: it has limited storage Can store only s

More information

CMPSCI 711: More Advanced Algorithms

CMPSCI 711: More Advanced Algorithms CMPSCI 711: More Advanced Algorithms Section 1-1: Sampling Andrew McGregor Last Compiled: April 29, 2012 1/14 Concentration Bounds Theorem (Markov) Let X be a non-negative random variable with expectation

More information

Capacity of AWGN channels

Capacity of AWGN channels Chapter 3 Capacity of AWGN channels In this chapter we prove that the capacity of an AWGN channel with bandwidth W and signal-tonoise ratio SNR is W log 2 (1+SNR) bits per second (b/s). The proof that

More information

CS Foundations of Communication Complexity

CS Foundations of Communication Complexity CS 2429 - Foundations of Communication Complexity Lecturer: Sergey Gorbunov 1 Introduction In this lecture we will see how to use methods of (conditional) information complexity to prove lower bounds for

More information

Heavy Hitters. Piotr Indyk MIT. Lecture 4

Heavy Hitters. Piotr Indyk MIT. Lecture 4 Heavy Hitters Piotr Indyk MIT Last Few Lectures Recap (last few lectures) Update a vector x Maintain a linear sketch Can compute L p norm of x (in zillion different ways) Questions: Can we do anything

More information

Information Theory. Lecture 5 Entropy rate and Markov sources STEFAN HÖST

Information Theory. Lecture 5 Entropy rate and Markov sources STEFAN HÖST Information Theory Lecture 5 Entropy rate and Markov sources STEFAN HÖST Universal Source Coding Huffman coding is optimal, what is the problem? In the previous coding schemes (Huffman and Shannon-Fano)it

More information

The Central Limit Theorem: More of the Story

The Central Limit Theorem: More of the Story The Central Limit Theorem: More of the Story Steven Janke November 2015 Steven Janke (Seminar) The Central Limit Theorem:More of the Story November 2015 1 / 33 Central Limit Theorem Theorem (Central Limit

More information

Challenges in Privacy-Preserving Analysis of Structured Data Kamalika Chaudhuri

Challenges in Privacy-Preserving Analysis of Structured Data Kamalika Chaudhuri Challenges in Privacy-Preserving Analysis of Structured Data Kamalika Chaudhuri Computer Science and Engineering University of California, San Diego Sensitive Structured Data Medical Records Search Logs

More information

Fast Dimension Reduction

Fast Dimension Reduction Fast Dimension Reduction MMDS 2008 Nir Ailon Google Research NY Fast Dimension Reduction Using Rademacher Series on Dual BCH Codes (with Edo Liberty) The Fast Johnson Lindenstrauss Transform (with Bernard

More information

15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018

15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018 15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018 Today we ll talk about a topic that is both very old (as far as computer science

More information

List coloring hypergraphs

List coloring hypergraphs List coloring hypergraphs Penny Haxell Jacques Verstraete Department of Combinatorics and Optimization University of Waterloo Waterloo, Ontario, Canada pehaxell@uwaterloo.ca Department of Mathematics University

More information

A Holevo-type bound for a Hilbert Schmidt distance measure

A Holevo-type bound for a Hilbert Schmidt distance measure Journal of Quantum Information Science, 205, *,** Published Online **** 204 in SciRes. http://www.scirp.org/journal/**** http://dx.doi.org/0.4236/****.204.***** A Holevo-type bound for a Hilbert Schmidt

More information

Lecture 1: Introduction to Sublinear Algorithms

Lecture 1: Introduction to Sublinear Algorithms CSE 522: Sublinear (and Streaming) Algorithms Spring 2014 Lecture 1: Introduction to Sublinear Algorithms March 31, 2014 Lecturer: Paul Beame Scribe: Paul Beame Too much data, too little time, space for

More information

Information Theory + Polyhedral Combinatorics

Information Theory + Polyhedral Combinatorics Information Theory + Polyhedral Combinatorics Sebastian Pokutta Georgia Institute of Technology ISyE, ARC Joint work Gábor Braun Information Theory in Complexity Theory and Combinatorics Simons Institute

More information

Communication Complexity

Communication Complexity Communication Complexity Jaikumar Radhakrishnan School of Technology and Computer Science Tata Institute of Fundamental Research Mumbai 31 May 2012 J 31 May 2012 1 / 13 Plan 1 Examples, the model, the

More information

18.5 Crossings and incidences

18.5 Crossings and incidences 18.5 Crossings and incidences 257 The celebrated theorem due to P. Turán (1941) states: if a graph G has n vertices and has no k-clique then it has at most (1 1/(k 1)) n 2 /2 edges (see Theorem 4.8). Its

More information

1 Quantum states and von Neumann entropy

1 Quantum states and von Neumann entropy Lecture 9: Quantum entropy maximization CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: February 15, 2016 1 Quantum states and von Neumann entropy Recall that S sym n n

More information

Explicit bounds on the entangled value of multiplayer XOR games. Joint work with Thomas Vidick (MIT)

Explicit bounds on the entangled value of multiplayer XOR games. Joint work with Thomas Vidick (MIT) Explicit bounds on the entangled value of multiplayer XOR games Jop Briët Joint work with Thomas Vidick (MIT) Waterloo, 2012 Entanglement and nonlocal correlations [Bell64] Measurements on entangled quantum

More information

Big Data. Big data arises in many forms: Common themes:

Big Data. Big data arises in many forms: Common themes: Big Data Big data arises in many forms: Physical Measurements: from science (physics, astronomy) Medical data: genetic sequences, detailed time series Activity data: GPS location, social network activity

More information

The Count-Min-Sketch and its Applications

The Count-Min-Sketch and its Applications The Count-Min-Sketch and its Applications Jannik Sundermeier Abstract In this thesis, we want to reveal how to get rid of a huge amount of data which is at least dicult or even impossible to store in local

More information

An exponential separation between quantum and classical one-way communication complexity

An exponential separation between quantum and classical one-way communication complexity An exponential separation between quantum and classical one-way communication complexity Ashley Montanaro Centre for Quantum Information and Foundations, Department of Applied Mathematics and Theoretical

More information

Direct product theorem for discrepancy

Direct product theorem for discrepancy Direct product theorem for discrepancy Troy Lee Rutgers University Robert Špalek Google Direct product theorems: Why is Google interested? Direct product theorems: Why should Google be interested? Direct

More information

Phenomena in high dimensions in geometric analysis, random matrices, and computational geometry Roscoff, France, June 25-29, 2012

Phenomena in high dimensions in geometric analysis, random matrices, and computational geometry Roscoff, France, June 25-29, 2012 Phenomena in high dimensions in geometric analysis, random matrices, and computational geometry Roscoff, France, June 25-29, 202 BOUNDS AND ASYMPTOTICS FOR FISHER INFORMATION IN THE CENTRAL LIMIT THEOREM

More information

Combining geometry and combinatorics

Combining geometry and combinatorics Combining geometry and combinatorics A unified approach to sparse signal recovery Anna C. Gilbert University of Michigan joint work with R. Berinde (MIT), P. Indyk (MIT), H. Karloff (AT&T), M. Strauss

More information

Lecture 6: Expander Codes

Lecture 6: Expander Codes CS369E: Expanders May 2 & 9, 2005 Lecturer: Prahladh Harsha Lecture 6: Expander Codes Scribe: Hovav Shacham In today s lecture, we will discuss the application of expander graphs to error-correcting codes.

More information

Random Variables. Random variables. A numerically valued map X of an outcome ω from a sample space Ω to the real line R

Random Variables. Random variables. A numerically valued map X of an outcome ω from a sample space Ω to the real line R In probabilistic models, a random variable is a variable whose possible values are numerical outcomes of a random phenomenon. As a function or a map, it maps from an element (or an outcome) of a sample

More information

Title: Count-Min Sketch Name: Graham Cormode 1 Affil./Addr. Department of Computer Science, University of Warwick,

Title: Count-Min Sketch Name: Graham Cormode 1 Affil./Addr. Department of Computer Science, University of Warwick, Title: Count-Min Sketch Name: Graham Cormode 1 Affil./Addr. Department of Computer Science, University of Warwick, Coventry, UK Keywords: streaming algorithms; frequent items; approximate counting, sketch

More information

Lecture 01 August 31, 2017

Lecture 01 August 31, 2017 Sketching Algorithms for Big Data Fall 2017 Prof. Jelani Nelson Lecture 01 August 31, 2017 Scribe: Vinh-Kha Le 1 Overview In this lecture, we overviewed the six main topics covered in the course, reviewed

More information

Spectral Sparsification in Dynamic Graph Streams

Spectral Sparsification in Dynamic Graph Streams Spectral Sparsification in Dynamic Graph Streams Kook Jin Ahn 1, Sudipto Guha 1, and Andrew McGregor 2 1 University of Pennsylvania {kookjin,sudipto}@seas.upenn.edu 2 University of Massachusetts Amherst

More information

STAT 512 sp 2018 Summary Sheet

STAT 512 sp 2018 Summary Sheet STAT 5 sp 08 Summary Sheet Karl B. Gregory Spring 08. Transformations of a random variable Let X be a rv with support X and let g be a function mapping X to Y with inverse mapping g (A = {x X : g(x A}

More information

Effective computation of biased quantiles over data streams

Effective computation of biased quantiles over data streams Effective computation of biased quantiles over data streams Graham Cormode cormode@bell-labs.com S. Muthukrishnan muthu@cs.rutgers.edu Flip Korn flip@research.att.com Divesh Srivastava divesh@research.att.com

More information

DATA STREAMS. Elena Ikonomovska Jožef Stefan Institute Department of Knowledge Technologies

DATA STREAMS. Elena Ikonomovska Jožef Stefan Institute Department of Knowledge Technologies QUERYING AND MINING DATA STREAMS Elena Ikonomovska Jožef Stefan Institute Department of Knowledge Technologies Outline Definitions Datastream models Similarity measures Historical background Foundations

More information

Finite Metric Spaces & Their Embeddings: Introduction and Basic Tools

Finite Metric Spaces & Their Embeddings: Introduction and Basic Tools Finite Metric Spaces & Their Embeddings: Introduction and Basic Tools Manor Mendel, CMI, Caltech 1 Finite Metric Spaces Definition of (semi) metric. (M, ρ): M a (finite) set of points. ρ a distance function

More information

Algorithms for Querying Noisy Distributed/Streaming Datasets

Algorithms for Querying Noisy Distributed/Streaming Datasets Algorithms for Querying Noisy Distributed/Streaming Datasets Qin Zhang Indiana University Bloomington Sublinear Algo Workshop @ JHU Jan 9, 2016 1-1 The big data models The streaming model (Alon, Matias

More information

Random Feature Maps for Dot Product Kernels Supplementary Material

Random Feature Maps for Dot Product Kernels Supplementary Material Random Feature Maps for Dot Product Kernels Supplementary Material Purushottam Kar and Harish Karnick Indian Institute of Technology Kanpur, INDIA {purushot,hk}@cse.iitk.ac.in Abstract This document contains

More information

Zero-One Laws for Sliding Windows and Universal Sketches

Zero-One Laws for Sliding Windows and Universal Sketches Zero-One Laws for Sliding Windows and Universal Sketches Vladimir Braverman 1, Rafail Ostrovsky 2, and Alan Roytman 3 1 Department of Computer Science, Johns Hopkins University, USA vova@cs.jhu.edu 2 Department

More information

Supremum of simple stochastic processes

Supremum of simple stochastic processes Subspace embeddings Daniel Hsu COMS 4772 1 Supremum of simple stochastic processes 2 Recap: JL lemma JL lemma. For any ε (0, 1/2), point set S R d of cardinality 16 ln n S = n, and k N such that k, there

More information

Chapter 4: More Applications of Differentiation

Chapter 4: More Applications of Differentiation Chapter 4: More Applications of Differentiation Autumn 2017 Department of Mathematics Hong Kong Baptist University 1 / 68 In the fall of 1972, President Nixon announced that, the rate of increase of inflation

More information

Probabilistic verification and approximation schemes

Probabilistic verification and approximation schemes Probabilistic verification and approximation schemes Richard Lassaigne Equipe de Logique mathématique, CNRS-Université Paris 7 Joint work with Sylvain Peyronnet (LRDE/EPITA & Equipe de Logique) Plan 1

More information

StreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory

StreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory StreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory S.V. N. (vishy) Vishwanathan Purdue University and Microsoft vishy@purdue.edu October 9, 2012 S.V. N. Vishwanathan (Purdue,

More information