Declaring Independence via the Sketching of Sketches. Until August Hire Me!

Size: px

Start display at page:

Download "Declaring Independence via the Sketching of Sketches. Until August Hire Me!"

Katrina Briggs
6 years ago
Views:

1 Declaring Independence via the Sketching of Sketches Piotr Indyk Andrew McGregor Massachusetts Institute of Technology University of California, San Diego Until August Hire Me!

2 The Problem

3 The Problem Center for Disease Control (CDC) has massive amounts of data on disease occurrences and their locations. How correlated is your zip code to the diseases you ll catch this year? Image from

4 The Problem Center for Disease Control (CDC) has massive amounts of data on disease occurrences and their locations. How correlated is your zip code to the diseases you ll catch this year? Sample (sub-linear time): How many are required to distinguish independence from ε-far from independence? [Batu et al. 01], [Alon et al. 07], [Valiant 08] Image from

The Problem Center for Disease Control (CDC) has massive amounts of data on disease occurrences and their locations. How correlated is your zip code to the diseases you ll catch this year?

5 The Problem Center for Disease Control (CDC) has massive amounts of data on disease occurrences and their locations. How correlated is your zip code to the diseases you ll catch this year? Sample (sub-linear time): How many are required to distinguish independence from ε-far from independence? [Batu et al. 01], [Alon et al. 07], [Valiant 08] Stream (sub-linear space): Access pairs sequentially or online and limited memory. Image from

6 Formulation

7 Formulation Stream of m pairs in [n] x [n]: (3,5), (5,3), (2,7), (3,4), (7,1), (1,2), (3,9), (6,6),...

8 Formulation Stream of m pairs in [n] x [n]: (3,5), (5,3), (2,7), (3,4), (7,1), (1,2), (3,9), (6,6),... Define empirical distributions: Marginals: (p1,..., pn), (q1,..., qn) Joint: (r11, r12,..., rnn) Product: (s11, s12,..., snn) where sij equals pi qj

9 Formulation Stream of m pairs in [n] x [n]: (3,5), (5,3), (2,7), (3,4), (7,1), (1,2), (3,9), (6,6),... Define empirical distributions: Marginals: (p1,..., pn), (q1,..., qn) Joint: (r11, r12,..., rnn) Product: (s11, s12,..., snn) where sij equals pi qj Question: How correlated are first and second terms?

10 Formulation Stream of m pairs in [n] x [n]: (3,5), (5,3), (2,7), (3,4), (7,1), (1,2), (3,9), (6,6),... Define empirical distributions: Marginals: (p1,..., pn), (q1,..., qn) Joint: (r11, r12,..., rnn) Product: (s11, s12,..., snn) where sij equals pi qj Question: How correlated are first and second terms? E.g., L 1 (s r) = i,j s ij r ij L 2 (s r) = i,j (s ij r ij ) 2 I (s, r) = H(p) H(p q)

11 Formulation Stream of m pairs in [n] x [n]: (3,5), (5,3), (2,7), (3,4), (7,1), (1,2), (3,9), (6,6),... Define empirical distributions: Marginals: (p1,..., pn), (q1,..., qn) Joint: (r11, r12,..., rnn) Product: (s11, s12,..., snn) where sij equals pi qj Question: How correlated are first and second terms? E.g., L 1 (s r) = i,j s ij r ij L 2 (s r) = i,j (s ij r ij ) 2 I (s, r) = H(p) H(p q) Previous work: Can estimate L1 and L2 between marginals. [Alon, Matias, Szegedy 96], [Feigenbaum et al. 99], [Indyk 00], [Guha, Indyk, McGregor 07], [Ganguly, Cormode 07]

12 Our Results

13 Our Results Estimating L2(s-r): (1+ε)-factor approx. in Õ(ε -2 ln δ -1 ) space. Neat result extending AMS sketches

14 Our Results Estimating L2(s-r): (1+ε)-factor approx. in Õ(ε -2 ln δ -1 ) space. Neat result extending AMS sketches Estimating L1(s-r): O(ln n)-factor approx. in Õ(ln δ -1 ) space. Sketches of sketches and sketches/embeddings

15 Our Results Estimating L2(s-r): (1+ε)-factor approx. in Õ(ε -2 ln δ -1 ) space. Neat result extending AMS sketches Estimating L1(s-r): O(ln n)-factor approx. in Õ(ln δ -1 ) space. Sketches of sketches and sketches/embeddings Other Results: L1(s-r): Additive approximations Mutual Information: Additive but not (1+ε)-factor approx. Distributed Model: Pairs are observed by different parties.

16 a) Neat Result for L2 b) Sketching Sketches c) Other Results

17 a) Neat Result for L2 b) Sketching Sketches c) Other Results

18 First Attempt

19 First Attempt Random Projection: Let z { 1, 1} n n where zij are unbiased 4-wise independent. [Alon, Matias, Szegedy 96]

20 First Attempt Random Projection: Let z { 1, 1} n n where zij are unbiased 4-wise independent. [Alon, Matias, Szegedy 96] Estimator: Suppose we can compute estimator: T = (z.r z.s) 2

21 First Attempt Random Projection: Let z { 1, 1} n n where zij are unbiased 4-wise independent. [Alon, Matias, Szegedy 96] Estimator: Suppose we can compute estimator: T = (z.r z.s) 2 Correct in expectation and has small variance: E[T ] = Σ i1,j 1,i 2,j 2 E[z i1 j 1 z i2 j 2 ]a i1 j 1 a i2 j 2 = (L 2 (r s)) 2 (a ij = r ij s ij ) Var[T ] E[T 2 ] = Σ i1,j 1,i 2,j 2,i 3,j 3,i 4,j 4 E[z i1 j 1 z i2 j 2 z i3 j 3 z i4 j 4 ]a i1 j 1 a i2 j 2 a i3 j 3 a i4 j 4 E[T ] 2

22 First Attempt Random Projection: Let z { 1, 1} n n where zij are unbiased 4-wise independent. [Alon, Matias, Szegedy 96] Estimator: Suppose we can compute estimator: T = (z.r z.s) 2 Correct in expectation and has small variance: E[T ] = Σ i1,j 1,i 2,j 2 E[z i1 j 1 z i2 j 2 ]a i1 j 1 a i2 j 2 = (L 2 (r s)) 2 (a ij = r ij s ij ) Var[T ] E[T 2 ] = Σ i1,j 1,i 2,j 2,i 3,j 3,i 4,j 4 E[z i1 j 1 z i2 j 2 z i3 j 3 z i4 j 4 ]a i1 j 1 a i2 j 2 a i3 j 3 a i4 j 4 E[T ] 2

23 First Attempt Random Projection: Let z { 1, 1} n n where zij are unbiased 4-wise independent. [Alon, Matias, Szegedy 96] Estimator: Suppose we can compute estimator: T = (z.r z.s) 2 Correct in expectation and has small variance: E[T ] = Σ i1,j 1,i 2,j 2 E[z i1 j 1 z i2 j 2 ]a i1 j 1 a i2 j 2 = (L 2 (r s)) 2 (a ij = r ij s ij ) Var[T ] E[T 2 ] = Σ i1,j 1,i 2,j 2,i 3,j 3,i 4,j 4 E[z i1 j 1 z i2 j 2 z i3 j 3 z i4 j 4 ]a i1 j 1 a i2 j 2 a i3 j 3 a i4 j 4 E[T ] 2

24 First Attempt Random Projection: Let z { 1, 1} n n where zij are unbiased 4-wise independent. [Alon, Matias, Szegedy 96] Estimator: Suppose we can compute estimator: T = (z.r z.s) 2 Correct in expectation and has small variance: E[T ] = Σ i1,j 1,i 2,j 2 E[z i1 j 1 z i2 j 2 ]a i1 j 1 a i2 j 2 = (L 2 (r s)) 2 (a ij = r ij s ij ) Var[T ] E[T 2 ] = Σ i1,j 1,i 2,j 2,i 3,j 3,i 4,j 4 E[z i1 j 1 z i2 j 2 z i3 j 3 z i4 j 4 ]a i1 j 1 a i2 j 2 a i3 j 3 a i4 j 4 E[T ] 2

25 First Attempt Random Projection: Let z { 1, 1} n n where zij are unbiased 4-wise independent. [Alon, Matias, Szegedy 96] Estimator: Suppose we can compute estimator: T = (z.r z.s) 2 Correct in expectation and has small variance: E[T ] = Σ i1,j 1,i 2,j 2 E[z i1 j 1 z i2 j 2 ]a i1 j 1 a i2 j 2 = (L 2 (r s)) 2 (a ij = r ij s ij ) Var[T ] E[T 2 ] = Σ i1,j 1,i 2,j 2,i 3,j 3,i 4,j 4 E[z i1 j 1 z i2 j 2 z i3 j 3 z i4 j 4 ]a i1 j 1 a i2 j 2 a i3 j 3 a i4 j 4 E[T ] 2 Repeating O(ε -2 ln δ -1 ) times and take the mean.

26 Computing Estimator

27 Computing Estimator Need to compute: z.r and z.s

28 Computing Estimator Need to compute: z.r and z.s Good News: First term is easy 1) Let A = 0 2) For each stream element: 2.1) If stream element = (i,j) then A A + z ij /m

29 Computing Estimator Need to compute: z.r and z.s Good News: First term is easy 1) Let A = 0 2) For each stream element: 2.1) If stream element = (i,j) then A A + z ij /m Bad News: Can t compute second term!

30 Computing Estimator Need to compute: z.r and z.s Good News: First term is easy 1) Let A = 0 2) For each stream element: 2.1) If stream element = (i,j) then A A + z ij /m Bad News: Can t compute second term! Good News: Use bilinear sketch: If z ij = x i y j z.s = ij z ijs ij = (x.p)(y.q) for x, y { 1, 1} n i.e., product of sketches is sketch of product.

31 Computing Estimator Need to compute: z.r and z.s Good News: First term is easy 1) Let A = 0 2) For each stream element: 2.1) If stream element = (i,j) then A A + z ij /m Bad News: Can t compute second term! Good News: Use bilinear sketch: If z ij = x i y j z.s = ij z ijs ij = (x.p)(y.q) for x, y { 1, 1} n i.e., product of sketches is sketch of product. Bad News: z is no longer 4-wise independent even if x and y are fully random, e.g., z 11 z 12 z 21 z 22 = (x 1 ) 2 (x 2 ) 2 (y 1 ) 2 (y 2 ) 2 = 1

32 Still Get Low Variance

33 Still Get Low Variance Lemma: Variance has at most tripled.

34 Still Get Low Variance Lemma: Variance has at most tripled. Proof: z = x 1 y 1 x 2 y x n y 1 x 1 y 2 x 2 y x n y 2... x 1 y n x 2 y n x n y n

35 Still Get Low Variance Lemma: Variance has at most tripled. Proof: z = x 1 y 1 x 2 y x n y 1 x 1 y 2 x 2 y x n y 2... x 1 y n x 2 y n x n y n Product of four entries is biased iff entries lie in rectangle

36 Still Get Low Variance Lemma: Variance has at most tripled. Proof: z = x 1 y 1 x 2 y x n y 1 x 1 y 2 x 2 y x n y 2... x 1 y n x 2 y n x n y n Product of four entries is biased iff entries lie in rectangle Hence, Var[T ] a i1 j 1 a i2 j 2 a i3 j 3 a i4 j 4 3E[T ] 2 (i 1,j 1 ),(i 2,j 2 ), (i 3,j 3 ),(i 4,j 4 ) in rectangle

37 Still Get Low Variance Lemma: Variance has at most tripled. Proof: z = x 1 y 1 x 2 y x n y 1 x 1 y 2 x 2 y x n y 2... x 1 y n x 2 y n x n y n Product of four entries is biased iff entries lie in rectangle Hence, Var[T ] a i1 j 1 a i2 j 2 a i3 j 3 a i4 j 4 3E[T ] 2 (i 1,j 1 ),(i 2,j 2 ), (i 3,j 3 ),(i 4,j 4 ) in rectangle since a rectangle is uniquely specified by a diagonal and 2a i1 j 1 a i2 j 2 a i3 j 3 a i4 j 4 (a i1 j 1 a i2 j 2 ) 2 + (a i3 j 3 a i4 j 4 ) 2

38 Still Get Low Variance Lemma: Variance has at most tripled. Proof: z = x 1 y 1 x 2 y x n y 1 x 1 y 2 x 2 y x n y 2... x 1 y n x 2 y n x n y n Product of four entries is biased iff entries lie in rectangle Hence, Var[T ] a i1 j 1 a i2 j 2 a i3 j 3 a i4 j 4 3E[T ] 2 (i 1,j 1 ),(i 2,j 2 ), (i 3,j 3 ),(i 4,j 4 ) in rectangle since a rectangle is uniquely specified by a diagonal and 2a i1 j 1 a i2 j 2 a i3 j 3 a i4 j 4 (a i1 j 1 a i2 j 2 ) 2 + (a i3 j 3 a i4 j 4 ) 2

39 Still Get Low Variance Lemma: Variance has at most tripled. Proof: z = x 1 y 1 x 2 y x n y 1 x 1 y 2 x 2 y x n y 2... x 1 y n x 2 y n x n y n Product of four entries is biased iff entries lie in rectangle Hence, Var[T ] a i1 j 1 a i2 j 2 a i3 j 3 a i4 j 4 3E[T ] 2 (i 1,j 1 ),(i 2,j 2 ), (i 3,j 3 ),(i 4,j 4 ) in rectangle since a rectangle is uniquely specified by a diagonal and 2a i1 j 1 a i2 j 2 a i3 j 3 a i4 j 4 (a i1 j 1 a i2 j 2 ) 2 + (a i3 j 3 a i4 j 4 ) 2 Less independence useful for range-sums. [Rusu, Dobra 06]

40 Summary of L2 Result Thm: (1+ε)-factor approx. (w/p 1-δ) in Õ(ε -2 ln δ -1 ) space. Proof Ideas: 1) First attempt: Use AMS technique. 2) Road block: Can t sketch product distribution. 3) Bilinear sketch: Product of sketches was sketch of product! 4) PANIC: No longer 4-wise independence. 5) Relax: We didn t need full 4-wise independence.

41 a) Neat Result for L2 b) Sketching Sketches c) Other Results

42 L1 Result

43 L1 Result Thm: O(ln n)-factor approx. of L1(s-r) in Õ(ln δ -1 ) space.

44 L1 Result Thm: O(ln n)-factor approx. of L1(s-r) in Õ(ln δ -1 ) space. Why not (1+ ε)-factor using Indyk s p-stable technique? [Indyk, 00]

45 L1 Result Thm: O(ln n)-factor approx. of L1(s-r) in Õ(ln δ -1 ) space. Why not (1+ ε)-factor using Indyk s p-stable technique? [Indyk, 00] Review of L1 sketching: Let entries of z be Cauchy(0,1) Compute estimator z.a Repeat k=o(ε -2 ln δ -1 ) times with different z. Take the median and appeal to concentration lemmas.

46 L1 Result Thm: O(ln n)-factor approx. of L1(s-r) in Õ(ln δ -1 ) space. Why not (1+ ε)-factor using Indyk s p-stable technique? [Indyk, 00] Review of L1 sketching: Let entries of z be Cauchy(0,1) Compute estimator z.a Repeat k=o(ε -2 ln δ -1 ) times with different z. Take the median and appeal to concentration lemmas. N.B. If median were mean we d have a dimensionality reduction result that doesn t exist. [Brinkman, Charikar 03]

47 Sketching Sketches

48 Sketching Sketches To sketch product distribution need z = ym x z = ( } y {{ ) } n ( x ) ( x ) } 0 0 {{... ( x ) } n 2

49 Sketching Sketches To sketch product distribution need z = ( } y {{ ) } n ( x ) ( x )... 0 M x z = ym x.... } 0 0 {{... ( x ) } n 2

50 Sketching Sketches To sketch product distribution need z = ( } y {{ ) } n ( x ) ( x )... 0 M x z = ym x.... } 0 0 {{... ( x ) } n 2 Sketch: Inner Sketch Outer Sketch R n2 R n a M x a R n R M x a ym x a

51 Sketching Sketches To sketch product distribution need z = ( } y {{ ) } n ( x ) ( x )... 0 M x z = ym x.... } 0 0 {{... ( x ) } n 2 Sketch: Inner Sketch Outer Sketch R n2 R n a M x a R n R M x a ym x a The Problem: Need to take median of multiple inner sketches before taking outer sketch.

52 Sketching Sketches To sketch product distribution need z = ( } y {{ ) } n ( x ) ( x )... 0 M x z = ym x.... } 0 0 {{... ( x ) } n 2 Sketch: Inner Sketch Outer Sketch R n2 R n a M x a R n R M x a ym x a The Problem: Need to take median of multiple inner sketches before taking outer sketch. The size of the inner sketch is large.

53 L1 Result

54 L1 Result Thm: O(ln n)-factor approx. of L1(s-r) in Õ(ln δ -1 ) space.

55 L1 Result Thm: O(ln n)-factor approx. of L1(s-r) in Õ(ln δ -1 ) space. Proof: Outer sketch: Entries y are Cauchy(0,1) Inner sketch: Entries x are truncated Cauchy(0,1)

56 L1 Result Thm: O(ln n)-factor approx. of L1(s-r) in Õ(ln δ -1 ) space. Proof: Outer sketch: Entries y are Cauchy(0,1) Inner sketch: Entries x are truncated Cauchy(0,1) Pr [ ] Ω(1) M(x).a a O(log n) 9/10

57 L1 Result Thm: O(ln n)-factor approx. of L1(s-r) in Õ(ln δ -1 ) space. Proof: Outer sketch: Entries y are Cauchy(0,1) Inner sketch: Entries x are truncated Cauchy(0,1) Pr [ ] Ω(1) M(x).a a O(log n) 9/10 Repeat Õ(ln δ -1 ) times and take median.

58 a) Neat Result for L2 b) Sketching Sketches c) Other Results

59 Other Results

60 Other Results Mutual Information: Can t (1+ε)-factor approximate in o(n) space Can ±ε using algorithms for approx. entropy. [Chakrabarti, Cormode, McGregor 07]

61 Other Results Mutual Information: Can t (1+ε)-factor approximate in o(n) space Can ±ε using algorithms for approx. entropy. [Chakrabarti, Cormode, McGregor 07] Distributed Model: Player 1 sees (3,. ), (5,. ), (2,. ), (3,. ), (7,. ), (1,. ), (3,. ), (6,. ),... Player 2 sees (.,5), (.,3), (.,7), (.,4), (.,1), (.,2), (.,9), (.,6),... Very hard in general, e.g., can t check if L1(s-r)=0

62 Other Results Mutual Information: Can t (1+ε)-factor approximate in o(n) space Can ±ε using algorithms for approx. entropy. [Chakrabarti, Cormode, McGregor 07] Distributed Model: Player 1 sees (3,. ), (5,. ), (2,. ), (3,. ), (7,. ), (1,. ), (3,. ), (6,. ),... Player 2 sees (.,5), (.,3), (.,7), (.,4), (.,1), (.,2), (.,9), (.,6),... Very hard in general, e.g., can t check if L1(s-r)=0 Additive Approximation for L1(s-r): L 1 (p q) = i p il 1 (q q i ) where q i is q conditioned on first term equals i. [Guha, McGregor, Venkatasubramanian 06]

63 Main Results Can estimate L2(r-s) well using neat extension of AMS sketch. Can estimate L1(r-s) up to O(log n) factor using p-stable distributions. Can estimate mutual information additively using entropy algorithms. Questions?

Sketching and Streaming for Distributions

Sketching and Streaming for Distributions Piotr Indyk Andrew McGregor Massachusetts Institute of Technology University of California, San Diego Main Material: Stable distributions, pseudo-random generators,