Faster Johnson-Lindenstrauss style reductions

Size: px

Start display at page:

Download "Faster Johnson-Lindenstrauss style reductions"

Simon White
5 years ago
Views:

1 Faster Johnson-Lindenstrauss style reductions Aditya Menon August 23, 2007

2 Outline 1 Introduction Dimensionality reduction The Johnson-Lindenstrauss Lemma Speeding up computation 2 The Fast Johnson-Lindenstrauss Transform Sparser projections Trouble with sparse vectors? Summary 3 Bounding the mapping The Walsh-Hadamard transform Error-correcting codes Putting it together 4 References

3 Introduction Dimensionality reduction Distances For high-dimensional vector data, it is of interest to have a notion of distance between two vectors Recall that the l p norm of a vector x is ( x p = xi p) 1/p The l 2 norm corresponds to the standard Euclidean norm of a vector The l norm is the maximal absolute value of any component x = max x i i

4 Introduction Dimensionality reduction Dimensionality reduction Problem Suppose we re given an input vector x R d We want to reduce the dimensionality of x to some k < d, while preserving the l p norm Can think of this as a metric embedding problem - can we embed l d p into l k p? Formally, we have the following problem Suppose we are given an x R d, and some parameters p, ɛ. Can we find a y R k for some k = f (ɛ) so that (1 ɛ) x p y p (1 + ɛ) x p

5 Introduction The Johnson-Lindenstrauss Lemma The Johnson-Lindenstrauss Lemma The Johnson-Lindenstrauss Lemma [5] is the archetypal result for l 2 dimensionality reduction Tells us that for n points, there is an ɛ-embedding of l d 2 lo(log n/ɛ2 ) 2 Theorem Suppose {u i } i=1...n R n d. Then, for ɛ > 0 and k = O(log n/ɛ 2 ), there is a mapping f : R d R k so that ( i, j)(1 ɛ) u i u j 2 f (u i ) f (u j ) 2 (1 + ɛ) u i u j 2

6 Introduction The Johnson-Lindenstrauss Lemma Johnson-Lindenstrauss in practice Proof of Johnson-Lindenstrauss lemma is non-constructive (unfortunately!) In practise, we use the probabilistic method to do a Johnson-Lindenstrauss style reduction Insert randomness at the cost of an exact guarantee Now the guarantee becomes probabilistic

7 Introduction The Johnson-Lindenstrauss Lemma Johnson-Lindenstrauss in practice Theorem Standard version: Suppose {u i } i=1...n R n d. Then, for ɛ > 0 and k = O(β log n/ɛ 2 ), the mapping f (u i ) = 1 k u i R, where R is a d k matrix of i.i.d. Gaussian variables, satisfies with probability at least 1 1, n β ( i, j)(1 ɛ) u i u j 2 f (u i ) f (u j ) 2 (1 + ɛ) u i u j 2

8 Introduction Speeding up computation Achlioptas improvement Achlioptas [1] gave an ever simpler matrix construction: R ij = +1 probability = probability = probability = rds sparse, and simpler to construct than a Gaussian matrix With no loss in accuracy!

9 Introduction Speeding up computation A question 2 3rds sparsity is a good speedup in practise But density is still O(dk) Computing the mapping is still an O(dk) operation asymptotically Let A = {A : unit x R d, with v.h.p., (1 ɛ) Ax 2 (1+ɛ)} Question: For which A A can Ax be computed quicker than O(dk)?

10 Introduction Speeding up computation The answer? We look at two approaches that allow for quicker computation First is the Fast Johnson-Lindenstrauss transform, based on a Fourier transform Next is the Ailon-Liberty Transform, based on a Fourier transform and error correcting codes!

11 The Fast Johnson-Lindenstrauss Transform The Fast Johnson-Lindenstrauss Transform Ailon and Chazelle [2] proposed the Fast Johnson-Lindenstrauss transform Can speedup l 2 reduction from O(dk) to (roughly) O(d log d) How? Make the projection matrix even sparser Need some tricks to solve the problems associated with this Let s reverse engineer the construction...

12 The Fast Johnson-Lindenstrauss Transform Sparser projections Sparser projection matrix Use the projection matrix { ( ) N 0, 1 q p = q P 0 p = 1 q where { q = min Θ ( log 2 n ) }, 1 d Density of the matrix is O ( 1 min { log 3 n, d log n }) ɛ 2 In practise, this is typically significantly sparser than Achlioptas matrix

13 The Fast Johnson-Lindenstrauss Transform Trouble with sparse vectors? What do we lose? Can follow standard concentration-proof methods But we end up needing to assume that x is bounded - namely, that information is spread out We fail on vectors like x = (1, 0,..., 0) i.e. sparse data and a sparse projection don t mix well So are we forced to choose between generality or usefulness?

14 The Fast Johnson-Lindenstrauss Transform Trouble with sparse vectors? What do we lose? Can follow standard concentration-proof methods But we end up needing to assume that x is bounded - namely, that information is spread out We fail on vectors like x = (1, 0,..., 0) i.e. sparse data and a sparse projection don t mix well So are we forced to choose between generality or usefulness? Not if we try to insert randomness...

15 The Fast Johnson-Lindenstrauss Transform Trouble with sparse vectors? A clever idea Can we randomly transform x so that Φ(x) 2 = x 2 Φ(x) is bounded with v.h.p.?

16 The Fast Johnson-Lindenstrauss Transform Trouble with sparse vectors? A clever idea Can we randomly transform x so that Φ(x) 2 = x 2 Φ(x) is bounded with v.h.p.? Answer: Yes! Use a Fourier transform Φ = F Distance preserving Has an uncertainty principle - a signal and its Fourier transform cannot both be concentrated Use the FFT to give an O(d log d) random mapping Details on the specifics in next section...

17 The Fast Johnson-Lindenstrauss Transform Trouble with sparse vectors? Applying a Fourier transform Fourier transform will guarantee that x = ω(1) x = o(1) But now we will be in trouble if the input is uniformly distributed! To deal with this, do a random sign change: x = Dx where D is a random diagonal ±1 matrix Now we get a guarantee of spread with high probability, so the random Fourier transform gives us back generality

18 The Fast Johnson-Lindenstrauss Transform Trouble with sparse vectors? Random sign change The sign change mapping Dx will give us d 1 x 1 ±x 1 d 2 x 2 x =. = ±x 2. d d x d ±x d where the ± are attained with equal probability Clearly norm preserving

19 The Fast Johnson-Lindenstrauss Transform Trouble with sparse vectors? Putting it together So, we compute the mapping f : x PF(Dx) Runtime will be ( { }) d log n O d log d + min ɛ 2, log3 n ɛ 2 Under some loose conditions, runtime is O ( max { d log d, k 3}) [ If k Ω(log d), O( ] d), this is quicker than the O(dk) simple mapping In practise, upper bound is reasonable, lower bound might not be though

20 The Fast Johnson-Lindenstrauss Transform Summary Summary Tried increasing sparsity with disregard for generality Used randomization to get back generality (probabilistically) Key ingredient was a Fourier transform, with a randomization step first

21 Ailon and Liberty [3] improved the runtime from O(d log d) to O(d log k), for k = O ( d 1/2 δ), δ > 0 Idea: Sparsity isn t the only way to speedup computation time Can also speedup runtime when the projection matrix has a special structure So find a matrix with a convenient structure and which will satisfy the JL property

22 Operator norm We need something called the operator norm in our analysis The operator norm of a transformation matrix A is A p q = sup Ax q x p=1 i.e. maximal q norm of the transformation of unit l p -norm points A fact we will need to employ: A p1 p 2 = A T q2 q 1 where 1 p q 1 = 1, 1 p q 2 = 1

23 Reverse engineering Let s say the mapping is a matrix multiplication In particular, say we have a mapping of the form f : x BDx where B is some k d matrix with unit columns, and D is a diagonal matrix whose entries are randomly ±1 Doing a random sign change again Now we just need to see what properties we will need B to satisfy in order for BDx 2 x 2

24 Bounding the mapping Bounding the mapping Easy to see that B 11 d 1 x B 1d d d x d BDx =. B k1 d 1 x B kd d d x d Write as BDx = Mz, where M (i) = x i B (i) z = [ d 1... d d ] T There is a special name for a vector like Mz...

25 Bounding the mapping Rademacher series Definition If M is an arbitrary k d real matrix, and z R d is so that { +1 p = 1/2 z i = 1 p = 1/2 then Mz is called a Rademacher random variable. This is a vector whose entries are arbitrary sums/differences of each of the entires in rows of M. Such a variable is interesting because of a powerful theorem...

26 Bounding the mapping Talagrand s theorem Theorem Suppose M, z are as above. Let Z = Mz p, and let σ = M 2 p µ = median(z) Then, Pr [ Z µ > t] 4e t2 /8σ 2 (see [6]) σ (the deviation ) is the maximal p-norm of all points on the unit circle Theorem says that the norm of a Rademacher variable is sharply concentrated about the median

27 Bounding the mapping Implications for us Our mapping, BDx, has given us a Rademacher random variable We know that we can apply Talagrand s theorem to get a concentration result So, all we need to do is find out what the median and deviation are...

28 Bounding the mapping Deviation Let Y = BDx 2 = Mz 2 Deviation is σ = sup y 2=1 y T M 2 ( d ) 1/2 = sup xi (y 2 B (i)) 2 i=1 ( d ) 1/4 x 4 sup (y T B (i) ) 4 i=1 by Cauchy-Schwartz = x 4 B T 2 4

29 Bounding the mapping What do we need? So, σ x 4 B T 2 4 Fact: 1 µ 32σ Can combine to get Pr [ Y 1 > t] c 0 e c 1t 2 /( x 2 4 BT )

30 Bounding the mapping What do we need? So, σ x 4 B T 2 4 Fact: 1 µ 32σ Can combine to get Pr [ Y 1 > t] c 0 e c 1t 2 /( x 2 4 BT ) Result: We need to control both x 4 and B T 2 4 i.e. we want them both to be small If we manage this, we ve got our concentration bound

31 Bounding the mapping The two ingredients To get the concentration bound, we need to ensure that x 4, B T 2 4 are sufficiently small How to control x 4? How to control B T 2 4?

32 Bounding the mapping The two ingredients To get the concentration bound, we need to ensure that x 4, B T 2 4 are sufficiently small How to control x 4? Use repeated Fourier/Walsh-Hadamard transforms How to control B T 2 4?

33 Bounding the mapping The two ingredients To get the concentration bound, we need to ensure that x 4, B T 2 4 are sufficiently small How to control x 4? Use repeated Fourier/Walsh-Hadamard transforms How to control B T 2 4? Use error correcting codes

34 The Walsh-Hadamard transform Controlling x 4 Problem: Input x is adversarial - so how to make x 4 small?

35 The Walsh-Hadamard transform Controlling x 4 Problem: Input x is adversarial - so how to make x 4 small? Solution: Use an isometric mapping Φ, with a guarantee that Φx 4 is small with very high probability

36 The Walsh-Hadamard transform Controlling x 4 Problem: Input x is adversarial - so how to make x 4 small? Solution: Use an isometric mapping Φ, with a guarantee that Φx 4 is small with very high probability Problem: What is such a Φ?

37 The Walsh-Hadamard transform Controlling x 4 Problem: Input x is adversarial - so how to make x 4 small? Solution: Use an isometric mapping Φ, with a guarantee that Φx 4 is small with very high probability Problem: What is such a Φ? (Final!) Solution: Back to the Fourier transform!

38 The Walsh-Hadamard transform The Discrete Fourier transform Discrete Fourier transform on {a 0, a 1,..., a N 1 } is a k = N 1 n=0 N 1 n=0 a n e 2πikn/N a n ( e 2πik/N) n Can think of it as a polynomial evaluation - if P(x) = a 0 + a 1 x + a 2 x a N 1 x N 1 then we have a k P (e 2πik/N)

39 The Walsh-Hadamard transform The finite-field Fourier transform Notice that ω k = e 2πik/N 1 satisfies (ω k ) N = 1 ω k is a primitive root of 1 Transform is a k P (ω k) for any primitive root ω

40 The Walsh-Hadamard transform The multi-dimensional Fourier transform We can also consider the transform of multi-dimensional data 1-D case: a k υ-d case: If n = (n 1,..., n υ ), a k N 1 n=0 N 1 n 1,...,n υ=0 a n ω kn a n ω k.n

41 The Walsh-Hadamard transform The Walsh-Hadamard transform Consider the case N = 2, ω = 1 [7]: a k1,k 2 1 n 1,n 2 =0 a n1,n 2 ( 1) k 1n 1 +k 2 n 2 This is called the Walsh-Hadamard transform Intuition: Instead of using sinusoidal basis functions, use square-wave functions The square waves are called Walsh-functions Why not the standard discrete FT? We use a technical property about the Walsh-Hadamard transform matrix...

42 The Walsh-Hadamard transform Fourier transform on the binary hyper-cube Suppose we work with F 2 = {0, 1} We can encode the Fourier transform with the Walsh-Hadamard matrix H d, H d (i, j) = 1 ( 1)<i 1,j 1> 2d/2 where < i, j > is the dot-product of i, j as expressed in binary Fact: H d = 1 [ ] Hd/2 H d/2 2 H d/2 H d/2 Corollary: We can compute H d in O(d log d) time

43 The Walsh-Hadamard transform Example of Hadamard matrix When d = 4, we get H 4 = Note entries are always ±1

44 The Walsh-Hadamard transform Fourier again? Let Φ : x H d D 0 x D 0 as before a random diagonal ±1 matrix Already know that it will preserve the l 2 norm But is Φ(x) 4 small?

45 The Walsh-Hadamard transform Fourier again? Let Φ : x H d D 0 x D 0 as before a random diagonal ±1 matrix Already know that it will preserve the l 2 norm But is Φ(x) 4 small? Answer: Yes - by another application of Talagrand s theorem!

46 The Walsh-Hadamard transform Towards Talagrand Need σ, µ for Talagrand s theorem Write Φ(x) = Mz as before, where M (i) = x i H (i) Estimate deviation: σ = M 2 4 = M T 4 2 (from earlier fact) 3 ( ) ( 1/4 x 4 i sup (y T H (i)) ) 4 1/4 y 4/3 =1 = x 4 H 4 3 4

47 The Walsh-Hadamard transform Some magic Theorem We now employ the following theorem [4] Haussdorf-Young theorem. For any p [1, 2], if H is the Hadamard matrix, and 1 p + 1 q = 1, then As a result, for p = 4 3, H p q d.d 1 p σ x 4 d 1/4 Further, we have the following fact (see [3] for proof!)... ( ) Fact: µ = O 1 d 1/4

48 The Walsh-Hadamard transform Getting the desired result With the above σ, µ, an application of Talagrand, along with the assumption k = O(d 1/2 δ ), reveals HD 0 x 4 c 0 d 1/4 + c 1 d δ/2 x 4 If we compose the mapping, HD 1 (HD 0 x) c 0 d 1/4 + c 0 c 1 d 1/4 δ/2 + c 2 1 d δ x 4 If we repeat this r = 1 2δ times, HD r 1 HD r 2... HD 0 x 4 = O (d 1/4)

49 The Walsh-Hadamard transform Our resultant transform To control x 4, use the composed transform Φ (r) : x HD r 1 HD r 2... HD 0 x We manage to preserve x 2, and contract Runtime is O ( ) d log d δ Φ r x 4

50 Error-correcting codes Error-correcting codes The Hadamard matrix also has a connection to error-correcting codes Such codes look to represent one s message in such a way that it can be decoded correctly even if there are some errors during transmission Suppose we want to send out a message to a decoder which allows for at most d errors i.e. we can recover from d or less errors in the transmission Fact: [ By ] choosing our code-words from the matrix H2d, where 1 0, we can correct up to d errors H 2d

51 Error-correcting codes Code matrix An m d matrix A is called a code matrix if H d (i 1, :) d H d (i 2, :) A = m. H d (i m, :) Picking out only m out of d rows of the Hadamard matrix

52 Error-correcting codes Independence in codes A code matrix is called a-wise independent if exactly d 2 a columns agree in a places Independence is very useful for us: Theorem Suppose B is a k d, 4-wise independent code matrix. Then, ( ) B T d 1/4 2 4 = O k

53 Error-correcting codes Proof of theorem Recall that we need to bound Consider: B T 2 4 = y T B 4 4 = de [ (y T B(j)) 4] = d k 2 i 1 i 2 i 3 sup y T B 4 y 2 =1 i 4 E [y i1 y i2 y i3 y i4 b 1 b 2 b 3 b 4 ] Consequently, = d k 2 (3 y y 4 4) 3d k 2 B T 2 4 (3d)1/4 k

54 Error-correcting codes Making our matrix We re set if we get a k d, 4-wise independent code matrix Problem: How do we make such a matrix?

55 Error-correcting codes Making our matrix We re set if we get a k d, 4-wise independent code matrix Problem: How do we make such a matrix? Fact: There exists a 4-wise independent code matrix of size k BCH(k) = Θ(k 2 ) Called the BCH code matrix Which is good, because...

56 Error-correcting codes Making our matrix We re set if we get a k d, 4-wise independent code matrix Problem: How do we make such a matrix? Fact: There exists a 4-wise independent code matrix of size k BCH(k) = Θ(k 2 ) Called the BCH code matrix Which is good, because... Fact: By padding and copy-pasting, we retain independence. In particular, we can construct a k d matrix from a k BCH(k) matrix: B = [ ] B BCH B BCH... B BCH }{{} d BCH(k) copies

57 Error-correcting codes Time to make matrix Time to compute the mapping x Bx? d We have to do BCH(k) mappings B BCHx BCH Each such mapping can be done via a Walsh-Hadamard transform, by construction of BCH codes Takes time O(BCH(k). log BCH(k)) Total runtime is therefore O(d log k)

58 Putting it together Merging results Use the randomized Fourier transform to keep x 4 small O(d log d) time Use the error-correcting code matrix to keep B 2 4 small O(d log k) time Result: We get the concentration bound!

59 Putting it together Runtime Runtime is still going to be O(d log d) Question: Can we speed up the computation of Φ (r)?

60 Putting it together Runtime Runtime is still going to be O(d log d) Question: Can we speed up the computation of Φ (r)? Answer: Yes - use the same block idea as with the error-correcting codes Some rather technical calculation reveals this will still work

61 Putting it together Blocked transform Choose β = BCH(k).k δ = Θ(k 2+δ ) Let H 1 H 2 H =... Hd/β where each H i is of size β β Fact: The above mapping can replace Φ r The mapping HD x can be computed in time O(d log k), so our total runtime is O(d log k)

62 Putting it together A tabular comparison Runtimes of the three approaches (standard JL, Fast JLT, and Ailon-Liberty) (from [3]): k = o(log d) k [ω(log d), o(poly(d))] k [Ω(poly(d)), o((d log d) 1/3 )] k [ω((d log d) 1/3 ), O(d 1/2 δ )] AL AL AL, FJLT AL JL FJLT FJLT FJLT JL JL JL

63 Putting it together Conclusion l 2 dimensionality reduction is based on the Johnson-Lindenstrauss lemma The standard approach takes O(dk) time to perform the reduction By sparsifying, and compensating with a randomized Fourier transform, we can reduce the runtime to roughly O(d log d) via the Fast Johnson-Lindenstrauss transform [2] By using error-correcting codes and a randomized Fourier transform, we can reduce the runtime to roughly O(d log k) via Ailon and Liberty s transform [3] Open questions: Can one extend this to k = O(d 1 δ )? k = Ω(d)?

64 References Achlioptas, D. Database-friendly random projections. In PODS 01: Proceedings of the Twentieth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (New York, NY, USA, 2001), ACM Press, pp Ailon, N., and Chazelle, B. Approximate nearest neighbors and the fast Johnson-Lindenstrauss transform. In STOC 06: Proceedings of the thirty-eighth annual ACM symposium on Theory of computing (New York, NY, USA, 2006), ACM Press, pp Ailon, N., and Liberty, E. Fast dimension reduction using Rademacher series on dual BCH codes. Tech. Rep. TR07-070, Electronic Colloquium on Computational Complexity, Bergh, J., and Lofstrom, J. Interpolation Spaces. Springer-Verlag, 1976.

65 References Johnson, W., and Lindenstrauss, J. Extensions of Lipschitz mappings into a Hilbert space. In Conference in Modern Analysis and Probability (Providence, RI, USA, 1984), American Mathematical Society, pp Ledoux, M., and Talagrand, M. Probability in Banach Spaces: Isoperimetry and Processes. Springer, Massey, J. L. Design and analysis of block ciphers. dabcmay2000f3.pdf, May Presentation.

Fast Random Projections

Fast Random Projections Edo Liberty 1 September 18, 2007 1 Yale University, New Haven CT, supported by AFOSR and NGA (www.edoliberty.com) Advised by Steven Zucker. About This talk will survey a few random