Sparser Johnson-Lindenstrauss Transforms

Size: px

Start display at page:

Download "Sparser Johnson-Lindenstrauss Transforms"

Todd Mathews
5 years ago
Views:

1 Sparser Johnson-Lindenstrauss Transforms Jelani Nelson Princeton February 16, 212 joint work with Daniel Kane (Stanford)

2 Random Projections x R d, d huge store y = Sx, where S is a k d matrix (compression)

3 Random Projections x R d, d huge store y = Sx, where S is a k d matrix (compression) compressed sensing (recover x from y when x is (near-)sparse) group-testing (as above, but Sx is Boolean multiplication) recover properties of x (entropy, heavy hitters,...) approximate norm preservation (want y x ) motif discovery (slightly different; randomly project discrete x onto subset of its coordinates) [Buhler-Tompa]

4 Random Projections x R d, d huge store y = Sx, where S is a k d matrix (compression) compressed sensing (recover x from y when x is (near-)sparse) group-testing (as above, but Sx is Boolean multiplication) recover properties of x (entropy, heavy hitters,...) approximate norm preservation (want y x ) motif discovery (slightly different; randomly project discrete x onto subset of its coordinates) [Buhler-Tompa] In many of these applications, random S is either required or obtains better parameters than deterministic constructions.

5 Random Projections x R d, d huge store y = Sx, where S is a k d matrix (compression) compressed sensing (recover x from y when x is (near-)sparse) group-testing (as above, but Sx is Boolean multiplication) recover properties of x (entropy, heavy hitters,...) approximate norm preservation (want y x ) motif discovery (slightly different; randomly project discrete x onto subset of its coordinates) [Buhler-Tompa] In many of these applications, random S is either required or obtains better parameters than deterministic constructions.

6 Metric Johnson-Lindenstrauss lemma Metric JL (MJL) Lemma, 1984 Every set of n points in Euclidean space can be embedded into O(ε 2 log n)-dimensional Euclidean space so that all pairwise distances are preserved up to a 1 ± ε factor.

7 Metric Johnson-Lindenstrauss lemma Metric JL (MJL) Lemma, 1984 Every set of n points in Euclidean space can be embedded into O(ε 2 log n)-dimensional Euclidean space so that all pairwise distances are preserved up to a 1 ± ε factor. Uses: Speed up geometric algorithms by first reducing dimension of input [Indyk-Motwani, 1998], [Indyk, 21] Low-memory streaming algorithms for linear algebra problems [Sarlós, 26], [LWMRT, 27], [Clarkson-Woodruff, 29] Essentially equivalent to RIP matrices from compressed sensing [Baraniuk et al., 28], [Krahmer-Ward, 211] (used for recovery of sparse signals)

8 How to prove the JL lemma Distributional JL (DJL) lemma Lemma For any < ε, δ < 1/2 there exists a distribution D ε,δ on R k d for k = O(ε 2 log(1/δ)) so that for any x of unit norm [ Pr Sx > ε ] < δ. S D ε,δ

9 How to prove the JL lemma Distributional JL (DJL) lemma Lemma For any < ε, δ < 1/2 there exists a distribution D ε,δ on R k d for k = O(ε 2 log(1/δ)) so that for any x of unit norm [ Pr Sx > ε ] < δ. S D ε,δ Proof of MJL: Set δ = 1/n 2 in DJL and x as the difference vector of some pair of points. Union bound over the ( n 2) pairs.

10 How to prove the JL lemma Distributional JL (DJL) lemma Lemma For any < ε, δ < 1/2 there exists a distribution D ε,δ on R k d for k = O(ε 2 log(1/δ)) so that for any x of unit norm [ Pr Sx > ε ] < δ. S D ε,δ Proof of MJL: Set δ = 1/n 2 in DJL and x as the difference vector of some pair of points. Union bound over the ( n 2) pairs. Theorem (Alon, 23) For every n, there exists a set of n points requiring target dimension k = Ω((ε 2 / log(1/ε)) log n). Theorem (Jayram-Woodruff, 211; Kane-Meka-N., 211) For DJL, k = Θ(ε 2 log(1/δ)) is optimal.

11 Proving the JL lemma Older proofs [Johnson-Lindenstrauss, 1984], [Frankl-Maehara, 1988]: Random rotation, then projection onto first k coordinates. [Indyk-Motwani, 1998], [Dasgupta-Gupta, 23]: Random matrix with independent Gaussian entries. [Achlioptas, 21]: Independent ±1 entries. [Clarkson-Woodruff, 29]: O(log(1/δ))-wise independent ±1 entries. [Arriaga-Vempala, 1999], [Matousek, 28]: Independent entries having mean, variance 1/k, and subgaussian tails

12 Proving the JL lemma Older proofs [Johnson-Lindenstrauss, 1984], [Frankl-Maehara, 1988]: Random rotation, then projection onto first k coordinates. [Indyk-Motwani, 1998], [Dasgupta-Gupta, 23]: Random matrix with independent Gaussian entries. [Achlioptas, 21]: Independent ±1 entries. [Clarkson-Woodruff, 29]: O(log(1/δ))-wise independent ±1 entries. [Arriaga-Vempala, 1999], [Matousek, 28]: Independent entries having mean, variance 1/k, and subgaussian tails Downside: Performing embedding is dense matrix-vector multiplication, O(k x ) time

13 Fast JL Transforms [Ailon-Chazelle, 26]: x PHDx, O(d log d + k 3 ) time P is a random sparse matrix, H is Hadamard, D has random ±1 on diagonal [Ailon-Liberty, 28]: O(d log k + k 2 ) time, also based on fast Hadamard transform [Ailon-Liberty, 211] and [Krahmer-Ward, 211]: O(d log d) for MJL, but with suboptimal k = O(ε 2 log n log 4 d).

14 Fast JL Transforms [Ailon-Chazelle, 26]: x PHDx, O(d log d + k 3 ) time P is a random sparse matrix, H is Hadamard, D has random ±1 on diagonal [Ailon-Liberty, 28]: O(d log k + k 2 ) time, also based on fast Hadamard transform [Ailon-Liberty, 211] and [Krahmer-Ward, 211]: O(d log d) for MJL, but with suboptimal k = O(ε 2 log n log 4 d). Downside: Slow to embed sparse vectors: running time is Ω(min{k x, d log d}).

15 Where Do Sparse Vectors Show Up? Document as bag of words: x i = number of occurrences of word i. Compare documents using cosine similarity. d = lexicon size; most documents aren t dictionaries Network traffic: x i,j = #bytes sent from i to j d = 2 64 (2 256 in IPv6); most servers don t talk to each other User ratings: x i is user s score for movie i on Netflix d = #movies; most people haven t rated all movies Streaming: x receives a stream of updates of the form: add v to x i. Maintaining Sx requires calculating v Se i....

16 Sparse JL transforms One way to embed sparse vectors faster: use sparse matrices.

17 Sparse JL transforms One way to embed sparse vectors faster: use sparse matrices. s = #non-zero entries per column in embedding matrix (so embedding time is s x ) reference value of s type [JL84], [FM88], [IM98],... k 4ε 2 log(1/δ) dense [Achlioptas1] k/3 sparse Bernoulli [WDALS9] no proof hashing [DKS1] Õ(ε 1 log 3 (1/δ)) hashing [KN1a], [BOR1] Õ(ε 1 log 2 (1/δ)) [KN12] O(ε 1 log(1/δ)) hashing (random codes)

18 Other related work CountSketch of [Charikar-Chen-FarachColton] gives s = O(log(1/δ)) (see [Thorup-Zhang])

19 Other related work CountSketch of [Charikar-Chen-FarachColton] gives s = O(log(1/δ)) (see [Thorup-Zhang]) Can recover (1 ± ε) x 2 from Sx, but not as Sx 2 (not an embedding into l 2 ) Not applicable in certain situations, e.g. in some nearest neighbor data structures, and when learning classifiers over projected vectors via stochastic gradient descent

20 Sparse JL Constructions [DKS, 21] s = Θ(ε 1 log 2 (1/δ))

21 Sparse JL Constructions [DKS, 21] s = Θ(ε 1 log 2 (1/δ)) [this work] s = Θ(ε 1 log(1/δ))

22 Sparse JL Constructions [DKS, 21] s = Θ(ε 1 log 2 (1/δ)) [this work] s = Θ(ε 1 log(1/δ)) [this work] k/s s = Θ(ε 1 log(1/δ))

23 Sparse JL Constructions (in matrix form) = k k/s = k Each black cell is ±1/ s at random

24 Sparse JL Constructions (nicknames) Graph construction k/s Block construction

25 Sparse JL notation (block construction) Let h(j, r), σ(j, r) be random hash location and random sign for copy of x j in rth block.

26 Sparse JL notation (block construction) Let h(j, r), σ(j, r) be random hash location and random sign for copy of x j in rth block. (Sx) i = (1/ s) h(j,r)=i x j σ(j, r)

27 Sparse JL notation (block construction) Let h(j, r), σ(j, r) be random hash location and random sign for copy of x j in rth block. (Sx) i = (1/ s) h(j,r)=i x j σ(j, r) Sx 2 2 = x s s x i x j σ(i, r)σ(j, r) 1 h(i,r)=h(j,r) r=1 i j

28 Sparse JL via Codes = k k/s = k Graph construction: Constant weight binary code of weight s. Block construction: Code over q-ary alphabet, q = k/s.

29 Sparse JL via Codes = k k/s = k Graph construction: Constant weight binary code of weight s. Block construction: Code over q-ary alphabet, q = k/s. Thm: Just need distance s O(s 2 /k) (block construction) (2s O(s 2 /k) for graph construction)

30 Analysis (block construction) k/s = k η i,j,r indicates whether i, j collide in rth block. Sx 2 2 = x Z Z = (1/s) r Z r Z r = i j x ix j σ(i, r)σ(j, r)η i,j,r

31 Analysis (block construction) k/s = k η i,j,r indicates whether i, j collide in rth block. Sx 2 2 = x Z Z = (1/s) r Z r Z r = i j x ix j σ(i, r)σ(j, r)η i,j,r Z is a quadratic form in σ, so apply known moment bounds for quadratic forms

32 Analysis k/s = k Theorem (Hanson-Wright, 1971) σ 1,..., σ n independent ±1, B R n n symmetric. For λ >, [ ] σ Pr T Bσ trace(b) > λ < e C min{λ2 / B 2 F,λ/ B 2} Reminder: B F = i,j B2 i,j B 2 is largest magnitude of eigenvalue of B

33 Analysis Z = 1 s s x i x j σ(i, r)σ(j, r)η i,j,r r=1 i j

34 Analysis Z = 1 s s x i x j σ(i, r)σ(j, r)η i,j,r = σ T Bσ r=1 i j B = 1 s B 1... B B s (B r ) i,j = x i x j η i,j,r

35 Frobenius norm bound B = 1 s B 1... B B s (B r ) i,j = x i x j η i,j,r

36 Frobenius norm bound B = 1 s B 1... B B s (B r ) i,j = x i x j η i,j,r B 2 F = 1 s 2 i j x 2 i x 2 j (#times i, j collide)

37 Frobenius norm bound B = 1 s B 1... B B s (B r ) i,j = x i x j η i,j,r B 2 F = 1 s i j x 2 2 i x j 2 (#times i, j collide) < O(1/k) x 4 2 = O(1/k) (good code!)

38 B r = 1 s Operator norm bound x1 2 x 1 x 2 η 1,2,r... x 1 x d η 1,d,r x 2 x 1 η 2,1,r x x 2 x d η 2,d,r x d x 1 η d,1,r... x d x d 1 η d,d 1,r xd 2 x s x xd 2

39 B r = 1 s Operator norm bound x1 2 x 1 x 2 η 1,2,r... x 1 x d η 1,d,r x 2 x 1 η 2,1,r x x 2 x d η 2,d,r x d x 1 η d,1,r... x d x d 1 η d,d 1,r xd 2 x s x xd 2 B r = 1 s (S r D)

40 B r = 1 s Operator norm bound x1 2 x 1 x 2 η 1,2,r... x 1 x d η 1,d,r x 2 x 1 η 2,1,r x x 2 x d η 2,d,r x d x 1 η d,1,r... x d x d 1 η d,d 1,r xd 2 x s x xd 2 B r = 1 s (S r D) D 2 = x 2 1

41 B r = 1 s Operator norm bound x1 2 x 1 x 2 η 1,2,r... x 1 x d η 1,d,r x 2 x 1 η 2,1,r x x 2 x d η 2,d,r x d x 1 η d,1,r... x d x d 1 η d,d 1,r xd 2 x s x xd 2 B r = 1 s (S r D) D 2 = x 2 1 S r = k/s i=1 u r,iu T r,i, these are the eigenvectors of S r (u r,i is projection of x onto coordinates hashing to i in rth block) S r 2 = max i u r,i 2 2 x 2 2 = 1

42 B r = 1 s Operator norm bound x1 2 x 1 x 2 η 1,2,r... x 1 x d η 1,d,r x 2 x 1 η 2,1,r x x 2 x d η 2,d,r x d x 1 η d,1,r... x d x d 1 η d,d 1,r xd 2 x s x xd 2 B r = 1 s (S r D) D 2 = x 2 1 S r = k/s i=1 u r,iu T r,i, these are the eigenvectors of S r (u r,i is projection of x onto coordinates hashing to i in rth block) S r 2 = max i u r,i 2 2 x 2 2 = 1 B r 2 (1/s) max{ S r 2, D 2 } 1/s

43 Wrapping up analysis B 2 F = O(1/k) B 2 1/s Apply Hanson-Wright:

44 Wrapping up analysis B 2 F = O(1/k) B 2 1/s Apply Hanson-Wright: Pr [ Z > ε] < e C min{ε 2 k, εs} k = Ω(log(1/δ)/ε 2 ), s = Ω(log(1/δ)/ε), QED

45 Code-based Construction: Caveat Need a sufficiently good code.

46 Code-based Construction: Caveat Need a sufficiently good code. Each pair of codewords should agree on O(s 2 /k) coordinates. Can get this with random code by Chernoff + union bound over pairs, but then need: s 2 /k log(d/δ) s k log(d/δ) = Ω(ε 1 log(d/δ) log(1/δ)).

47 Code-based Construction: Caveat Need a sufficiently good code. Each pair of codewords should agree on O(s 2 /k) coordinates. Can get this with random code by Chernoff + union bound over pairs, but then need: s 2 /k log(d/δ) s k log(d/δ) = Ω(ε 1 log(d/δ) log(1/δ)). Slightly better: Can assume d = O(ε 2 /δ) by first embedding into this dimension with s = 1 (Analysis: Chebyshev s inequality) Can get away with s = O(ε 1 log(1/(εδ)) log(1/δ)). Can we avoid the loss incurred by this union bound?

48 Code-based Construction: Caveat Need a sufficiently good code. Each pair of codewords should agree on O(s 2 /k) coordinates. Can get this with random code by Chernoff + union bound over pairs, but then need: s 2 /k log(d/δ) s k log(d/δ) = Ω(ε 1 log(d/δ) log(1/δ)). Slightly better: Can assume d = O(ε 2 /δ) by first embedding into this dimension with s = 1 (Analysis: Chebyshev s inequality) Can get away with s = O(ε 1 log(1/(εδ)) log(1/δ)). Can we avoid the loss incurred by this union bound?

49 Improving the Construction Pick h at random Analysis: Directly bound the l = log(1/δ)th moment of error term Z, then Markov s inequality Pr[ Z > ε] < ε l E[Z l ] (conditioning on the event h gives a code is too demanding)

50 Improving the Construction Pick h at random Analysis: Directly bound the l = log(1/δ)th moment of error term Z, then Markov s inequality Pr[ Z > ε] < ε l E[Z l ] (conditioning on the event h gives a code is too demanding) Z = (1/s) s r=1 i j x ix j σ(i, r)σ(j, r)η i,j,r

51 Improving the Construction Pick h at random Analysis: Directly bound the l = log(1/δ)th moment of error term Z, then Markov s inequality Pr[ Z > ε] < ε l E[Z l ] (conditioning on the event h gives a code is too demanding) Z = (1/s) s r=1 (Z = (1/s) s r=1 Z r ) i j x ix j σ(i, r)σ(j, r)η i,j,r

52 Improving the Construction Pick h at random Analysis: Directly bound the l = log(1/δ)th moment of error term Z, then Markov s inequality Pr[ Z > ε] < ε l E[Z l ] (conditioning on the event h gives a code is too demanding) Z = (1/s) s r=1 (Z = (1/s) s r=1 Z r ) E h,σ [Z l ] = 1 s l i j x ix j σ(i, r)σ(j, r)η i,j,r r 1 <...<r n t 1,...,t n>1 i t i =l ( l t 1,..., t n ) n i=1 E h,σ [Z t i r i ] Bound the tth moment of any Z r, then get the lth moment bound for Z by plugging into the above

53 Bounding E[Z t r ] Z r = i j x ix j σ(i, r)σ(j, r)η i,j,r

54 Bounding E[Z t r ] Z r = i j x ix j σ(i, r)σ(j, r)η i,j,r Monomials appearing in expansion of Zr t are in correspondence with directed multigraphs. (x 1 x 2 ) (x 3 x 4 ) (x 3 x 8 ) (x 4 x 8 ) (x 2 x 1 )

55 Bounding E[Z t r ] Z r = i j x ix j σ(i, r)σ(j, r)η i,j,r Monomials appearing in expansion of Zr t are in correspondence with directed multigraphs. (x 1 x 2 ) (x 3 x 4 ) (x 3 x 8 ) (x 4 x 8 ) (x 2 x 1 ) Monomial contributes to expectation iff all degrees even Analysis: Group monomials appearing in Zr t according to graph isomorphism class then do some combinatorics. Our analysis is tight up to a constant factor. 3

56 Bounding E[Z t r ] m = #connected components, v = #vertices, d u = degree of u E h,σ [Zr t ] = [ t ] ( t ) E x iu x ju G G t i 1 j 1,...,i t j t f ((i u,j u) t u=1 )=G u=1 η iu,j u,r u=1

57 Bounding E[Z t r ] m = #connected components, v = #vertices, d u = degree of u E h,σ [Zr t ] = [ t ] ( t ) E x iu x ju G G t i 1 j 1,...,i t j t f ((i u,j u) t u=1 )=G = ( s ) v m k G G t u=1 η iu,j u,r i 1 j 1,...,i t j t f ((i u,j u) t u=1 )=G u=1 ( t ) x iu x ju u=1

58 Bounding E[Z t r ] m = #connected components, v = #vertices, d u = degree of u E h,σ [Zr t ] = [ t ] ( t ) E x iu x ju G G t i 1 j 1,...,i t j t f ((i u,j u) t u=1 )=G = ( s ) v m k G G t G G t ( s k u=1 η iu,j u,r i 1 j 1,...,i t j t f ((i u,j u) t u=1 )=G ) v m 1 v! ( ) t d 1 /2,...,d v /2 u=1 ( t ) x iu x ju u=1

59 Bounding E[Z t r ] m = #connected components, v = #vertices, d u = degree of u E h,σ [Zr t ] = [ t ] ( t ) E x iu x ju G G t i 1 j 1,...,i t j t f ((i u,j u) t u=1 )=G = ( s ) v m k G G t G G t ( s k 2 O(t) v,m u=1 η iu,j u,r i 1 j 1,...,i t j t f ((i u,j u) t u=1 )=G ) v m 1 v! ( ) t d 1 /2,...,d v /2 ( ) v m t t v v ( s k G u=1 ( t ) x iu x ju u=1 ) d u du u

60 Bounding E[Z t r ] m = #connected components, v = #vertices, d u = degree of u E h,σ [Zr t ] = [ t ] ( t ) E x iu x ju G G t i 1 j 1,...,i t j t f ((i u,j u) t u=1 )=G = ( s ) v m k G G t G G t ( s k 2 O(t) v,m u=1 η iu,j u,r i 1 j 1,...,i t j t f ((i u,j u) t u=1 )=G ) v m 1 v! ( ) t d 1 /2,...,d v /2 ( ) v m t t v v ( s k G u=1 ( t ) x iu x ju u=1 ) d u du u

61 Bounding E[Z t r ] In the end, can show E[Z t r ] 2 O(t) { s/k (t/ log(k/s)) t t < log(k/s) otherwise Plug this into expression for E[Z l ], QED

62 Open Problems

63 Open Problems OPEN: Devise distribution which can be sampled using few random bits Current record: O(log d + log(1/ε) log(1/δ) + log(1/δ) log log(1/δ)) [Kane-Meka-N., 211] Existential: O(log d + log(1/δ)) OPEN: Prove a tight lower bound on achievable sparsity in a JL distribution. OPEN: Can we have a JL matrix such that we can multiply by any k k submatrix in k polylog(d) time? (ultimate goal) OPEN: Embed any vector in Õ(d) time into optimal k

Sparse Johnson-Lindenstrauss Transforms

Sparse Johnson-Lindenstrauss Transforms Jelani Nelson MIT May 24, 211 joint work with Daniel Kane (Harvard) Metric Johnson-Lindenstrauss lemma Metric JL (MJL) Lemma, 1984 Every set of n points in Euclidean