Dimensionality reduction: Johnson-Lindenstrauss lemma for structured random matrices

Size: px

Start display at page:

Download "Dimensionality reduction: Johnson-Lindenstrauss lemma for structured random matrices"

Anissa Hancock
5 years ago
Views:

1 Dimensionality reduction: Johnson-Lindenstrauss lemma for structured random matrices Jan Vybíral Austrian Academy of Sciences RICAM, Linz, Austria January 2011 MPI Leipzig, Germany

2 joint work with Aicke Hinrichs (University of Jena, Germany) Massimo Fornasier (RICAM, Linz) Jan Haškovec (RICAM, Linz) FWF-START project: Sparse Approximation and Optimization in High Dimensions led by Massimo Fornasier

3 Outline Johnson-Lindenstrauss lemma Classical proof Variants and improvements Circulant matrices Decoupling vs. Fourier transform Applications - Approximate nearest neighbours - Dynamical systems

4 Johnson-Lindenstrauss lemma Let ε (0, 1 2 ), x 1,...,x n R d...arbitrary points, k = O(ε 2 log n), i.e. k Cε 2 log n. There exists a (linear) mapping f : R d R k such that (1 ε) x i x j 2 2 f (x i) f (x j ) 2 2 (1 + ε) x i x j 2 2 for all i,j {1,...,n}. Here 2 stands for the Euclidean norm in R d or R k, respectively.

5 Typical proof A R k d - k d matrices, P - probability measure on A For each y, y 2 = 1: concentration of measure Choosing ( ) P(A A : Ay 2 2 > 1 + ε) exp( ckε2 ), ( ) P(A A : Ay 2 2 < 1 ε) exp( ckε 2 ). exp( ckε 2 ) 1 n 2, the probability of failure (union bound) is smaller then ( ) n n 2 = 1 1 n < 1. Hence, the probability of success is positive!

6 The condition exp( ckε 2 ) 1 n 2 leads to and ckε 2 2log n k 2 c ε 2 log n, i.e. C = 2 c. By increasing C(= 3/c), we may achieve, that such a mapping becomes typical, i.e. occurs with probability at least 1 1/n.

7 Proof of ( ) P...rotational invariant U R d d unitary; X, Y A: P(XU) = P(Y) X := {A : Ay 2 2 > 1 + ε}, Y := {B : Be > 1 + ε}, Ue 1 = y a 1,1... a 1,d A = 1 a 2,1... a 2,d k..... a k,1... a k,d a i,j - indep. Gaussian variables (P is a tensor product) k Be = 1 k i=1 b 2 i,1

8 ( k ) P σi 2 > k(1 + ε) exp( ckε 2 )? i=1 ( k ) ( P σi 2 > k(1 + ε) = P λ i=1 = P ( E exp exp ( ( λ λ ) k σi 2 kλ(1 + ε) > 0 i=1 ) ) k σi 2 kλ(1 + ε) > 1 i=1 ) k σi 2 kλ(1 + ε) i=1 ( 1 = e kλ(1+ε) 2π = e kλ(1+ε) (1 2λ) k/2 R ) k e λt2 e t2 /2 dt

9 Optimization of λ: 0 < λ = 1/2(1 1 1+ε ) < 1/2 leads to e kε/2 (1 + ε) k/2 exp( k/2(ε 2 /2 ε 3 /3))...for 0 < ε < 1/2... exp( ckε 2 )

10 Classical proof W. B. Johnson and J. Lindenstrauss, Extensions of Lipschitz mappings into a Hilbert space. Contem. Math., 26: , 1984 Projection onto a random k-dimensional subspace satisfies the desired property with positive probability advantages: geometrical proof disadvantages: measure on the set of all k-dimensional subspaces evaluating f (x) involves orthonormalisation time consuming

11 Variants and improvements Elementary proof: S. Dasgupta and A. Gupta An elementary proof of a theorem of Johnson and Lindenstrauss. Random. Struct. Algorithms, 22:60-65, Improvements motivated by applications: Good running times of f (x) Small randomness used Small memory space used...others...

12 D. Achlioptas, Database-friendly random projections: Johnson-Lindenstrauss with binary coins. J. Comput. Syst. Sci., 66(4): , f realized by a k d matrix, where each entry is generated independently at random: Gaussian or Bernoulli (or similar) variables. Running time: k d Randomness: k d Memory space k d

13 N. Ailon and B. Chazelle, Approximate nearest neighbors and the fast Johnson-Lindenstrauss transform. In Proc. 38th Annual ACM Symposium on Theory of Computing, f (x) = PHDx, where P is a k d matrix, where each component is generated independently at random P i,j = N(0,1) with probability { q = min Θ P i,j = 0 with probability 1 q, ( log 2 n d ) },1 H is the d d normalized Hadamard matrix, D is a random d d diagonal matrix, with each D i,i drawn independently from { 1,1} with probability 1/2.

14 Running time: With high probability, f (x) may be calculated in time O(d log d + qdk) Randomness: O(d + k log 2 n) Memory space: with high probability O(d +kdq) = O(d +k log 2 n) Not easy to implement Other variants and improvements...

15 Connection to compressed sensing - RIP We say, that A R k d satisfies the Restricted Isometry Property of order s, if there exists δ s (0,1), such that (1 δ s ) x 2 2 Ax 2 2 (1 + δ s) x 2 2 holds for all x R d with x 0 := #{j = 1,...,d : x j 0} s. The aim is to find matrices with small δ s for large s.

16 R. Baraniuk, M. Davenport, R. DeVore and M. Wakin A simple proof of the Restricted Isometry Property for Random Matrices, Constructive Approximation, If P ω ( Aω x 2 2 (1 + ε) x 2 2) exp( nc(ε)) (and the same for ), then A ω satisfies RIP of order s and δ s for s small enough with probability depending on s and δ s. Every distribution that yields J-L transforms, yields also RIP-matrices.

17 Reverse direction provided recently by F. Krahmer and R. Ward: New and improved Johnson-Lindenstrauss embeddings via the Restricted Isometry Property If Φ R k d satisfies RIP of order s (large enough), then ΦD ξ satisfies J-L Lemma. Every distribution that yields RIP matrices, yields also J-L-transforms. Here, D ξ R d d is a diagonal matrix with indep. Bernoulli variables on the diagonal.

18 Circulant matrices a = (a 0,...,a d 1 ) be i.i.d. random variables a 0 a 1 a 2... a d 1 a d 1 a 0 a 1... a d 2 M a,k = a d 2 a d 1 a 0... a d 3 R k d a d k+1 a d k+2 a d k+3... a d k Is it possible to take f (x) = 1 k M a,k x? Or f (x) = 1 k M a,k D κ x?

19 Decoupling vs. Fourier transform Yes! With k = O(ε 2 log 3 n) - decoupling techniques Yes! With k = O(ε 2 log 2 n) - Fourier-analytic methods The improvement to O(ε 2 log n) is still open... promising numerical experiments advantages: running time O(d log d) - using FFT randomness used 2d instead of (k + 1)d easy to implement: FFT is a part of every software package disadvantage: up to now - bigger k

20 log 3 n: Let x 1,...,x n be arbitrary points in R d, ε (0, 1 2 ), k = O(ε 2 log 3 n), a = (a 0,...,a d 1 ) be independent Bernoulli variables or independent normally distributed variables, M a,k and D κ be as above and f (x) = 1 k M a,k D κ x. Then with probability at least 2/3 the following holds (1 ε) x i x j 2 2 f (x i ) f (x j ) 2 2 (1+ε) x i x j 2 2, i,j = 1,...,n.

21 Strategy of the proof of log 3 n -decoupling the dependence Concentration inequalities for every fixed x with x 2 = 1: P a,κ ( M a,k D κ x 2 2 (1 + ε)k ) exp( c(kε 2 ) 1/3 ) and P a,κ ( M a,k D κ x 2 2 (1 ε)k ) exp( c(kε 2 ) 1/3 ). Then union bound over all n(n 1)/2 pairs of points. The bound on k is given by 2 n(n 1) 2 exp( c(kε 2 ) 1/3 ) < 1.

22 Separation of the diagonal and the off-diagonal term k 1 M a,k D κ x 2 2 = d 1 k 1 I = ai 2 i=0 j=0 x 2 j+i } {{ } diagonal j=0 ( d 1 ) 2 a i κ j+i x j+i = I + II i=0 k 1, II = a i a i κ j+i κ j+i x j+i x j+i j=0 i i } {{} off diagonal...summation in the index is modulo d... ) P a,κ ( M a,k D κ x 2 2 (1+ε)k P a (I (1+ε/2)k)+P a,κ (II εk/2)

23 Estimates of I: P a (I (1 + ε/2)k) Lemma of B. Laurent and P. Massart...or any other variant of Bernstein s inequality Exponential concentration of Z = D α i (ai 2 1), i=1 where a i are i.i.d. normal variables and α i are nonnegative real numbers. Then for any t > 0 P(Z 2 α 2 t + 2 α t) exp( t), P(Z 2 α 2 t) exp( t). α i := k 1 j=0 x2 j+i, α 1 = k, α 1 and α 2 k.

24 Estimates of II: Decoupling lemma of Bourgain and Tzafriri: Let ξ 0,...,ξ d 1 be independent random variables with E ξ 0 = = E ξ d 1 = 0 and let {x i,j } d 1 i,j=0 be a double sequence of real numbers. Then for 1 p < E p x i,j ξ i ξ j 4 p E x i,j ξ i ξ j p, i j where (ξ 0,...,ξ d 1 ) denotes an independent copy of (ξ 0,...,ξ d 1 ). i j

25 Further tools: Two times Khintchine s inequalities and ( E a,a k 1 a j a j p) 1/p p(k + p), j=0 for both a Bernoulli or Gaussian variables.

26 The role of D κ k d, a 0,...,a d 1 independent normal variables 2-stability: x = 1 d (1,...,1), b := d 1 j=0 P a ( M a,k x 2 2 > (1 + ε)k depends neither on k nor on d (d 1 M a,k x 2 2 = k a ) j 2 d a j d N(0,1) j=0 ) ( ) = P b b 2 > (1 + ε)

27 log 2 n Let x 1,...,x n be arbitrary points in R d, ε (0, 1 2 ), k = O(ε 2 log 2 n), a = (a 0,...,a d 1 ) be independent normally distributed variables, M a,k and D κ be as above and f (x) = 1 k M a,k D κ x. Then with probability at least 2/3 the following holds (1 ε) x i x j 2 2 f (x i ) f (x j ) 2 2 (1+ε) x i x j 2 2, i,j = 1,...,n.

28 Fourier methods F - unitary discrete Fourier transform, F : C d C d Every circulant matrix may be diagonalised by F and F 1 M a,d x = Fdiag( dfa)f 1 x. The singular values are the square roots of the eigenvalues of M a,d M a,d = Fdiag( dfa)diag( dfa)f 1 = Fdiag(d Fa 2 )F 1 i.e. d Fa.

29 Strategy of the proof of log 2 n Concentration inequalities for all x = x i x j x i x j 2 ( P a Ma,k D κ x 2 2 2(1 + ε)k ) exp ( ckε2 ), log n ( P a Ma,k D κ x 2 2 2(1 ε)k ) exp ( ckε2 ). log n From this, the result follows again by a union bound.

30 Let x 2 = 1. y j := S j (D κ x) C d, j = 0,...,k 1, where S is the shift operator S : C d C d, S(z 0,...,z d 1 ) = (z 1,...,z d 1,z 0 ). Y...k d matrix with rows y 0,...,y k 1. Note, that M a,k D κ x 2 2 = Ya 2 2 Hence, P ( M a,k D κ x 2 2 2(1 + ε)k) = P ( Ya 2 2 2(1 + ε)k).

31 Let Y = UΣV be the singular value decomposition of Y. Then Ya 2 2 = UΣVa 2 2 = ΣVa 2 2 = Σb 2 2, where b := Va is a k-dimensional vector of independent normal variables. Hence, P ( M a,k D κ x 2 2 (1 + ε)k ) = P (k 1 λ 2 j bj 2 (1 + ε)k ), where λ j are the singular values of Y. Lemma of B. Laurent and P. Massart: Estimate λ 4 and λ! j=0 λ 2 2 = Y F = k and λ 2 c log n implies λ 4 4 c k log n.

32 Related results: N. Ailon and E. Liberty, Almost optimal unrestricted fast Johnson-Lindenstrauss transform, Random partial Fourier transform, k = O(ε 4 log n log 4 d) H. Rauhut, J. Romberg and J. Tropp, Restricted isometries for partial random circulant matrices, Random circulant matrices { } k max ε 1 log 3/2 n log 3/2 d,ε 2 log n log 4 d.

33 Approximate nearest neighbors The nearest neighbor problem: Given P = {x 1,...,x n } in a metric space X, preprocess P so as to efficiently find the minimizer of min d(x i,q), q X. i=1,...,n Donald Knuth in vol. 3 of The Art of Computer Programming called it the post-office problem Naive algorithm: compare all the distances - no preprocessing. The approximate nearest neighbor problem: Given P = {x 1,...,x n } in a metric space X and ε > 0, preprocess P so as to efficiently find p P, such that d(p,q) (1 + ε)d(p,q), p X.

34 P = {x 1,...,x n } X = R d, hashing functions: Choose randomly v R d and pre-compute h v (i) := v,x i, i = 1,...,n Find argmin i=1,...,n v,x i q. Iterate over different v 1,v 2,...

35 Dynamical systems J-L-transforms are linear and (almost) preserve distances. We consider non-linear dynamical systems, where the non-linearity depends only on distances of the agents: ẋ i (t) = f i (Dx(t)) + N f ij (Dx(t))x j (t) j=1 N N - number of agents, x(t) = (x 1 (t),...,x N (t)) R d N, where x i : [0,T] R d,i = 1,...,N, f i : R N N R d,i = 1,...,N, f ij : R N N R,i,j = 1,...,N, D : R d N R N N, Dx := ( x i x j l d 2 ),i,j = 1,...,N is the adjacency matrix.

36 Euler s method: x i (0) = x 0 (i),i = 1,...,N and N := xi n + h f i (Dx n ) + f ij (Dx n )xj n,n = 0,...,n 0 1. x n+1 i j=1 We project the system using J-L-transform M R k d : y i (0) = Mx 0 (i),i = 1,...,N and N := yi n +h Mf i (D y n ) + y n+1 i j=1 Mf ij (D y n )y n j If f i and f i,j are Lipschitz: error estimates for y n Mx n.,n = 0,...,n 0 1.

37 References: W. B. Johnson and J. Lindenstrauss, Extensions of Lipschitz mappings into a Hilbert space. Contem. Math., 26: , 1984 S. Dasgupta and A. Gupta, An elementary proof of a theorem of Johnson and Lindenstrauss. Random Struct. & Algorithms, 22:60-65, N. Ailon and B. Chazelle, Approximate nearest neighbors and the fast Johnson-Lindenstrauss transform. In Proc. 38th Annual ACM Symposium on Theory of Computing, A. Hinrichs and J. Vybíral, Johnson-Lindenstrauss lemma for circulant matrices, to appear in Random Struct. & Algorithms J. Vybíral, A variant of the Johnson-Lindenstrauss lemma for circulant matrices, to appear in J. Funct. Anal.

Some Useful Background for Talk on the Fast Johnson-Lindenstrauss Transform

Some Useful Background for Talk on the Fast Johnson-Lindenstrauss Transform Nir Ailon May 22, 2007 This writeup includes very basic background material for the talk on the Fast Johnson Lindenstrauss Transform