Optimal Bounds for Johnson-Lindenstrauss Transformations

Size: px

Start display at page:

Download "Optimal Bounds for Johnson-Lindenstrauss Transformations"

Harold Merritt
5 years ago
Views:

1 Journal of Machine Learning Research Submitted 5/8; Revised 9/8; Published 0/8 Optimal Bounds for Johnson-Lindenstrauss Transformations Michael Burr Shuhong Gao Department of Mathematical Sciences Clemson University, Clemson SC 9634, USA burr@clemsonedu sgao@clemsonedu Fiona Knoll Department of Mathematical Sciences University of Cincinnati, Cincinnati, OH 45, USA fknoll@gclemsonedu Editor: Kenji Fukumizu Abstract In 984, Johnson and Lindenstrauss proved that any finite set of data in a high-dimensional space can be projected to a lower-dimensional space while preserving the pairwise Euclidean distances between points up to a bounded relative error If the desired dimension of the image is too small, however, Kane, Meka, and Nelson 0 and Jayram and Woodruff 03 proved that such a projection does not exist In this paper, we provide a precise asymptotic threshold for the dimension of the image, above which, there exists a projection preserving the Euclidean distance, but, below which, there does not exist such a projection Keywords: Johnson-Lindenstrauss transformation, Dimension reduction, Phase transition, Asymptotic threshold Introduction In 984, Johnson and Lindenstrauss 984, in establishing a bound on the Lipschitz constant for the Lipschitz extension problem, proved that any finite set of data in a highdimensional space can be projected into a lower-dimensional space while preserving the pairwise Euclidean distance within any desired relative error In particular, for any finite set of vectors x,, x N R d and for any error factor 0 < <, there exists an absolute constant c such that for all k c log N, there exists a linear map A : R d R k such that for all pairs i, j N, x i x j Ax i Ax j + x i x j, where denotes the Euclidean norm These inequalities are implied by the following theorem by setting δ = and using the union bound: N Theorem Johnson and Lindenstrauss 984 For any real numbers 0 <, δ <, there exists an absolute constant c > 0 such that for any integer k c log δ, there exists a probability distribution D on k d real matrices such that for any fixed x R d, Prob A D [ x Ax + x ] > δ, where A D indicates that the matrix A is a random matrix with distribution D c 08 Michael Burr, Shuhong Gao, and Fiona Knoll License: CC-BY 40, see Attribution requirements are provided at

2 Burr, Gao, and Knoll Note that, in order to project a large number of vectors, δ must be sufficiently small For instance, suppose we wish to project a set of N = 0 vectors to a smaller dimensional space To apply the union bound to Inequality, we use any δ < 39 In this case, Inequality implies that the probability of preserving all pairwise distances between N points up to a relative error of is at least δn / > 0 Since the probability is nonzero, such a projection exists A probability distribution D satisfying Inequality is called an, δ-jl distribution, or simply a JL distribution Since these transformations are linear, without loss of generality, we assume for the rest of the paper that x = When a JL distribution is specified via an explicit construction, we may call a random projection x Ax generated in this way a JL transformation Since the introduction of JL distributions, there has been considerable work on explicit constructions of JL distributions, see, eg, Johnson and Lindenstrauss 984; Frankl and Maehara 988; Indyk and Motwani 998; Achlioptas 003; Ailon and Chazelle 006; Matousek 008; Dasgupta et al 00; Kane and Nelson 04; and the references therein A simple and easily described JL distribution is that of Achlioptas 003 In this construction, the entries of A are distributed as follows: a ij = { k /, with probability /, k /, with probability / The recent constructions in Ailon and Chazelle 006; Matousek 008; Dasgupta et al 00; Kane and Nelson 04 have focused on the complexity of computing a projection for the purpose of applications We note that the ability to project a vector to a smaller dimensional space, independent of the original dimension, while preserving the Euclidean norm up to a prescribed relative error, is highly desirable In particular, dimension reduction has applications in many fields, including machine learning Arriaga and Vempala 006; Weinberger et al 009, low rank approximation Clarkson and Woodruff 03; Nguyen et al 009; Ubaru et al 07, approximate nearest neighbors Ailon and Chazelle 006; Indyk and Motwani 998, data storage Candès 008; Cormode and Indyk 06, and document similarity Bingham and Mannila 00; Lin and Gunopulos 003 For both practical and theoretical purposes, it is important to know the smallest possible dimension k of a potential image space for any given and δ Note that, for any d < d, each, δ-jl distribution D on R k d induces an, δ-jl distribution D on R k d in a natural way: the matrices of D are obtained from D by deleting the last d d columns of a random matrix, together with the induced probability distribution This construction is a JL distribution since R d can be naturally embedded into R d by extending a vector in R d by d d zeros Hence, if there exists an, δ-jl distribution on R k d, then there is an, δ-jl distribution on R k d for all d d Similarly, if an, δ-jl distribution does not exist on R k d, then, for any k < k, there cannot be an, δ-jl distribution on R k d In particular, since R k can be naturally embedded into R k by extending a vector in R k by k k zeros, if an, δ-jl distribution existed for R k d, it could be extended to an, δ-jl distribution for R k d

3 Optimal Bounds for Johnson-Lindenstrauss Transformations No JLD Exists JLD Exists c log δ 4 log δ [ + o] Figure : For fixed and δ, there exists a JL distribution for k 4 log/δ [ + o] For k < c log/δ, for some absolute constant c > 0, there is no JL distribution In this paper, we close this gap in the limit For any and δ, we define k 0, δ = min{k : there exists an, δ-jl distribution on R k d for every d } By our definition, k 0 = k 0, δ is independent of d, and, by Theorem, we have that k 0 c log/δ for some absolute constant c > 0 Frankl and Maehara 988 show that c 9 Achlioptas 003 further improves this bound by providing a JL distribution with resulting in the following upper bound: k > log/δ 3, 3 k 0 log/δ 3 = 4 log/δ [ + o], 3 where o approaches zero as both and δ approach zero A lower bound on k 0 was not given until 003 when Alon 003 proved that / k 0 c log/δ log/ for some absolute constant c > 0 Improving Alon s work, Jayram and Woodruff 03 and Kane et al 0 showed, through different methods, that, for some absolute constant c > 0, there is no, δ-jl distribution for k c log δ Hence, there is a lower bound of the form k 0 c log δ This situation is summarized in Figure The goal of the current paper is to close the gap between the upper and lower bounds in the limit In particular, we prove an optimal lower bound that asymptotically matches the known upper bound when and δ approach 0, see Theorem This means that 4 log/δ is an asymptotic threshold for k 0 where a phase change phenomenon occurs The main result of this paper is captured in the following theorem: Theorem For and δ sufficiently small, k 0 4 log/δ More precisely, lim,δ 0 k 0, δ 4 log/δ = 3

4 Burr, Gao, and Knoll The rest of the paper is organized as follows: To prove Theorem, we follow the approach of Kane et al 0 To make their constant c explicit, however, we must use a more careful argument In Section, we provide explicit conditions under which we prove the main result, Theorem We delay the proofs of these conditions until Section 3 in order to make the main result more accessible, since only the statements of these results are needed and not their more technical proofs In Section 3, we prove probabilistic bounds on s = x + +x k where x = x,, x k,, x d is a random variable uniformly distributed on S d These bounds explicitly determine the hidden constants in the bounds on Prob [s < s 0 ] and Prob [s > s 0 + ] from Kane et al, 0, Theorem 0 These bounds can be viewed as explicit bounds for concentration theorems for laws of large numbers from probability theory In the Appendix, we provide an alternate proof of a result in Kane et al 0, showing that if dω d is the surface area for the d -dimensional sphere S d, then for any k d, dω d = fsds dω k dω d k, where s [0, ] and fs = s k s d k Asymptotic Threshold Bound In this section, we prove the asymptotic threshold bound for JL transformations In particular, we provide specific conditions that result in the asymptotic threshold bound of 4 log/δ In Section 3, we prove that these conditions hold, but the details of these proofs are more technical and are unnecessary to understand the main results of this paper The Uniform Distribution on S d There is a unique probability distribution, called the uniform distribution, on the d - dimensional sphere S d that is invariant under the orthonormal group We express the uniform distribution on S d in terms of the surface area differential form dω d, which means that, for any measurable subset V S d, the d -dimensional surface area of V is equal to the integral with respect to dω d, ie, Vol d V = V dω d Thus, the uniform distribution on S d corresponds to the measure dω d /Vol d S d As we are interested in reducing a d-dimensional vector to a k-dimensional vector for k < d, we derive a relationship between the uniform distribution on S d and the uniform distributions on S k and S d k Following the approach of Kane et al 0, for k < d, we define an injective map Ψ : S d [0, ] S k S d k as follows: For any x = x, x,, x d t S d, we define s in Ψx = s, u, v as s = x + + x k In the case where 0 < s <, we define u = x,, x k t / s and v = x k+,, x d t / s In this paper, we suppress the pullback maps on equalities for differential forms since there is a unique almost bijective map under consideration in each case We leave the details to the interested reader 4

5 Optimal Bounds for Johnson-Lindenstrauss Transformations When s = 0, ie, x = = x k = 0, we define u =, 0,, 0 t or any point in S k and v = x k+,, x d t Similarly, for s =, we define u = x,, x k t and v =, 0,, 0 t or any point in S d k It is straight-forward to check that Ψ is injective In addition, the complement of the image of Ψ is a subset of {0, } S k S d k, which has, in turn, d -dimensional surface area equal to 0 Therefore, when convenient, we assume that s 0, For s [0, ], we define fs = s k d k s In the Appendix, we provide alternative proof to the computation in Kane et al 0 which shows that, via the map Ψ, dω d = fsds dω k dω d k Equivalently, in term of probability distributions, dω d dω k dω d k Vol d S d = Bfsds Vol k S k Vol d k S d k, where B = Vol k S k Vol d k S d k Γ d Vol k S k = Γ k d k Γ is an appropriate scaling constant, depending on the gamma function, for more details, see the Appendix Moreover, in this situation, we observe that Bfs is a probability distribution on [0, ] This implies that the uniform distribution on S d is a direct product of the distributions on the factors In other words, a uniformly distributed random variable X d on S d can be decomposed into three random variables ΨX d = S, X k, X d k with the following properties: i S is a random variable on [0, ] with density function Bfs, ii X k and X d k are uniformly distributed on S k and S d k, and iii The random variables S, X k, and X d k are independent The independence of these three random variables is a key property in our proof as it allows us to study the three spaces separately Upper Bound: Explicit JL Distribution We recall that Achlioptas 003 proved that k 0, δ log/δ 3 = 4 log/δ [ + o] 3 In this section, we give an alternate proof of this result using the approach and bounds from this paper We recall the following construction by Dasgupta and Gupta 003: A distribution D on k d matrices is formed by picking a d d orthonormal matrix V = v,, v d t uniformly 5

6 Burr, Gao, and Knoll at random with respect to the Haar measure on orthonormal matrices and then letting A = s0 v,, v k t, where s 0 = k/d The following proposition shows that k 0, δ 4 log/δ [ + o], which, in turn, implies that the limit appearing in Theorem if it exists is at most : Proposition 3 Let 0 <, δ < and s 0 = k/d Suppose that there is some constant C so that max{prob x S d [s < s 0 ], Prob x S d [s > s 0 + ]} Ce k 4 3, where s = x + +x k is defined as in Ψx = s, u, v Then, there exists an o function, which approaches zero as both and δ approach zero so that if k > 4 log δ [ + o], then the distribution on k d random matrices defined as above is an, δ-jl distribution, that is, for any w S d, Prob A D [ Aw < ] δ Proof Let V be the random orthogonal matrix as defined above, and let x = x,, x d t = V w Then Aw = s 0 x,, x k t, and Aw = s 0 x + + x k Since V is orthonormal and w =, we have that x =, and, hence, x S d We observe that since V is a random orthogonal matrix, for fixed w S d, x = V w is a random variable, uniformly distributed on S d Hence, [ Prob A D Aw > ] [ ] k = Prob x S d x i >, where x S d means that x is a random variable uniformly distributed on S d Let s = k i= x i Then, s [0, ] and the probability above becomes Prob x S d [s < s 0 ] + Prob x S d [s > s 0 + ] Ce k 4 3, 3 by assumption We observe that when [ ] k > 4 log + δ 3 + logc log /3 + 4 log = 4 log [ + o], δ δ the right-hand-side of Inequality 3 is less than δ In this case, the o term needed in the theorem statement appears in Inequality 4 Therefore, when k > 4 log δ [ + o], the distribution D is an, δ-jl distribution s 0 δ i= 4 We observe that this result also follows from work in Dasgupta and Gupta 003 In particular, Dasgupta and Gupta, 003, Lemma can be used to derive bounds similar to the assumed bounds in this statement 6

7 Optimal Bounds for Johnson-Lindenstrauss Transformations 3 Lower Bound for Arbitrary Distributions In this section, we prove an optimal lower bound on the limit in Theorem that matches the upper bound from the previous section The proof of this lower bound is the main challenge in this paper We begin with the following key lemma: Lemma 4 Let x = x,, x d t be a random variable, uniformly distributed on S d, Ψx = s, u, v, and s 0 = k/d Suppose that for a fixed > 0, min{prob [s > s 0 + ], Prob [s < s 0 ]} L, where s is a random variable with probability distribution Bfs on [0, ] For any function cu, v > 0 depending only on u S k and v S d k ie, independent of s, we have Prob x S d [ sc > ] L Proof By the equality of differential forms in Equation, dω k dω d k Prob [ sc > ] = Bfsds sc > Vol k S k Vol d k S d k dω k dω d k = Vol k S k Vol d k S d k S k S d k Bfsds sc > Our goal is to find a lower bound on the integral sc > Bfsds Due to the independence of u, v, and s, cu, v is a fixed positive constant within this integral We observe that sc > consists of two intervals, s < /c and s > + /c, and we consider two cases depending on the value of c We begin by recalling that Prob [s > s 0 + ] = Bfsds and Prob [s < s 0 ] = Bfsds s>s 0 + s<s 0 If c s 0, then + /c + /s 0, and, hence Bfsds Bfsds Bfsds L sc > s>+/c s>+/s 0 On the other hand, if c < s 0, then /s 0 < /c, then Bfsds Bfsds Bfsds L sc > s< /c s< /s 0 Therefore, the integral sc > Bfsds is bounded from below by L, and Prob [ sc > ] S k S d k L dω k dω d k Vol k S k Vol d k S d k = L 7

8 Burr, Gao, and Knoll We now show that when k η log/δ with η < 4, and and δ are sufficiently small, there does not exist an, δ-jl distribution on R k d This fact, combined with the results in Section, shows that the limit appearing in Theorem exists and equals In order to show this, we consider the following related problem: By definition, for a probability distribution D on R k d to be an, δ-jl distribution, the following inequality must hold for every w S d : Prob A D [ Aw > ] < δ Hence, [ Prob A D, w S d Aw > ] < δ, 5 where w S d is a random variable distributed uniformly on S d Following the approach of Kane et al 0, our goal is to prove that, for every A R k d, [ Prob w S d Aw > ] > δ 6 When Inequality 6 holds for all A, then Inequality 5 cannot hold for any distribution D on R k d Therefore, an, δ-jl distribution does not exist We make this precise in the following theorem: Theorem 5 Suppose that η < 4 and let k, δ = η log δ Let s0 = k/d, and suppose that, for every, δ, and s 0 sufficiently small to make s 0 sufficiently small, d must be sufficiently large, min{prob [s > s 0 + ], Prob [s < s 0 ]} Cδ η 4 γ, where C > 0 is an absolute constant, and γ approaches as, δ, and s 0 approach 0 Then, by decreasing, δ, and s 0 as needed, for every matrix A R k,δ d, [ Prob w S d Aw > ] > δ Proof We assume that A has rank k = k, δ since, if not, we may reduce k and decrease η correspondingly to the rank of A Let A = UΣV t be the singular value decomposition of A where U is a k k orthonormal matrix, V = v,, v d is a d d orthonormal matrix, and Σ is a k d diagonal matrix with λ i > 0 its entry at Σ i,i for i k Let x = x,, x d t = V t w Since V is orthonormal, we have x S d We observe that since w is a uniformly distributed random variable on S d, V t w is also a uniformly distributed random variable on S d Therefore, since U is orthonormal, we have Aw = UΣx = Σx = k λ i x i We recall the definition of the function Ψx = s, u, v where s = x + + x k Moreover, we restrict our attention to the case where s 0, since the complement has zero measure Let k c = λ i x i /s = Σ k k u, i= 8 i=

9 Optimal Bounds for Johnson-Lindenstrauss Transformations where Σ k k denotes the k k principal submatrix of Σ, then [ Prob w S d Aw > ] = Prob x S d [ sc > ] Due to the independence of u, v, and s, it follows that c depends only on u Therefore, by Lemma 4, it follows that [ Prob w S d Aw > ] Cδ η 4 γ It follows that for, δ, and s 0 sufficiently small, Cδ η 4 γ > δ Therefore, by selecting, δ, and s 0 sufficiently small, it follows from Theorem 5 that there is no, δ-jl distribution when k < η log δ for η < 4 Therefore, k0, δ η log δ We collect the results of Proposition 3 and Theorem 5 in the following corollary: Corollary 6 Assume the hypotheses on Prob [s > s 0 + ] and Prob [s < s 0 ] from Proposition 3 and Theorem 5 hold a There exists an o function that approaches 0 as and δ approach zero such that if k > 4 log δ [ + o], then there exists a JL distribution b If k, δ = η log δ, then, by decreasing and δ, and increasing d, there is no, δ-jl distribution for any k k, δ This proves the main result in the paper In the following section, we provide the more technical results that verify the assumptions in Proposition 3 and Theorem 5 3 Explicit Concentration Bounds Throughout this section, we assume that s is a random variable with probability distribution Bfs We define s 0 = k d, and we further assume that 0, δ /, k 4 and s 0 < 04 We derive lower and upper bounds for the following probabilities: Prob [s > s 0 + ] and Prob [s < s 0 ], using the probability density Bfs for s and fs = s k / s d k / These probabilities appear in Kane et al, 0, Theorem 0 without the explicit constants that are derived in this paper These bounds are instances of explicit concentration theorems or explicit laws of large numbers from probability theory Our goal is to formulate these bounds as precisely as possible so that the lower and upper bounds are asymptotically the same when and δ approach 0 3 Bounds for B We recall that Γ/ = π, Γ =, and Γ + z = zγz Hence Γ d d d d d =! if d is even, and Γ = 3 9 π if d is odd

10 Burr, Gao, and Knoll In this section, we derive lower and upper bounds for B, see Equation 38, by using the following form of Stirling s approximation of n! due to Robbins 955: πn n+/ e n e n+ < Γn + = n! < πn n+/ e n e n Since we are interested in the asymptotic behavior, we focus on the case where d is even This choice does not affect the asymptotic results of our paper, but the calculations are more straight-forward in this case We leave the details for the case where d is odd to the interested reader Lemma 7 Suppose k and d are both even Then we have the following inequality: e π d d / e k k / B d k / d k π Proof Using the bound on n! from Robbins 955, we obtain C 0 where d d / k k / d k d k / d d / k k / d k d k / B C d d / k k / d k d k /, C 0 = π e e C = π e e 6d + e 6k e 6d k 6d e 6k + e 6d k + e π, and e π Corollary 8 With s 0 = k/d, we have e k Bs0 fs 0 9e k π π Proof By evaluating f at s 0 and replacing B by its lower bound found in Lemma 7, we obtain the lower bound Bs 0 fs 0 e k d k d kd d kd k e π d k dk dd k k π Similarly, by using the upper bound in Lemma 7, we obtain the upper bound Bs 0 fs 0 e π where the last inequality follows from d k d k d k d k k d k d k x d d k since s 0 < 04, and x x d d k 9e π k, 3 for x 3 Throughout this section, it is possible to derive tighter bounds on constants in the following inequalities, but the ones appearing here are sufficient for our proofs We leave the details of the tighter bounds to the interested reader 0

11 Optimal Bounds for Johnson-Lindenstrauss Transformations 3 Bounds on Prob [s > s 0 + ] We begin by mentioning the following inequalities which are used in our arguments below: log + x x x for 0 < x <, 7 log x x x for 0 < x < 068, 8 log + x x for x >, and 9 log + x x x + x3 3 for x > 0 These bounds can be verified by employing basic calculus techniques eg, derivatives and Taylor expansions as well as sufficiently accurate approximations Using these inequalities, we derive the following bounds: Lemma 9 Moreover, when k < η log δ, Prob [s > s 0 + ] e 4 π e 4 k+ +s 0 s 0 Prob [s > s 0 + ] e 4π δ η 4 γ, where γ = + η log/δ / + s 0 s 0 Additionally, γ approaches as, δ, and s 0 approach 0 Proof Note that Prob [s > s 0 + ] = B s>s 0 + fsds = Bs 0 s 0 fs 0 + xdx, via the substitution s = s 0 + x Let gs = s k/ s d k/, then fs 0 +x can be expressed in terms of gs, namely, fs 0 + x = gs 0 + x s 0 + x s 0 + x To find a lower bound on Prob [s > s 0 + ], we compute a bound on gs 0 + x from below Taking the logarithm of gs 0 + x, we find log gs 0 + x = log gs 0 + d s 0 log + x + s 0 log s 0 x 3 s 0 Restricting x to the interval 0 x <, it then follows that 0 < s 0 s 0 x < 068 from the assumption that s 0 < 04 We now bound the second term in Equation 3 using Inequalities 7 and 8, as follows: s 0 log + x+ s 0 log s 0 s0 + s 0 x x 4 s 0 s 0

12 Burr, Gao, and Knoll Hence, by substituting Inequality 4 into Equation 3 and exponentiating, we obtain the following lower bound for gs 0 + x: gs 0 + x gs 0 e k 4 x +s 0 s 0 5 We also observe that since s 0 < 04 and 0 x <, the denominator of Equation is bounded from below as follows: s 0 + x s 0 + x s 0 s 0 6 Therefore, by substituting Inequalities 5 and 6 into Equation when 0 x <, we have gs 0 k fs 0 + x s 0 s 0 e 4 x +s 0 s 0 = fs 0e k 4 x +s 0 s 0 7 Since s 0 < 04, we observe that s 0 = d k k > 5 Since we assumed that < and k 4 + > 4, it follows that + k / < and so + k / < < s 0 Therefore, we further restrict x to the interval, + k / and observe that Inequality 7 applies in this range Therefore, since f is a positive function, Bs 0 s 0 +k / fs 0 + xdx Bs 0 fs 0 + xdx Bs 0fs 0 +k / e k 4 x +s 0 s 0 dx 8 Replacing Bs 0 fs 0 with its lower bound given in Corollary 8 and observing that the integrand is decreasing over an interval of width k /, Inequality 8 is bounded from below by +k / Bs 0fs 0 e k 4 x +s 0 s 0 dx e 4 π e 4 k+ +s 0 s 0, which completes the first inequality The second inequality follows by replacing k by the given upper bound and simplifying Lemma 0 Prob [s > s 0 + ] 7e π e k 4 3 Proof To derive an upper bound, we start with the expression for Prob [s > + s 0 ] from Equation We first find an upper bound on fs 0 + x Then, we bound fs 0 + x using the inequality x e x for all x as follows: fs 0 +x = fs 0 +x k Moreover, since s 0 < 04, s d k 0 x s 0 fs 0 +x k e s d k 0 x s 0, 9 s 0 d k = k d k > k s 0 d k 0

13 Optimal Bounds for Johnson-Lindenstrauss Transformations By applying Inequality 0 to Inequality 9, we derive the upper bound fs 0 + x k e k x Therefore, by extending the region of integration in Inequality, we find the following upper bound on the probability: Bs 0 s 0 fs 0 + xdx Bs 0 fs 0 + x k e k x dx By integrating by parts, we observe that for any l and m with l m, + x l e mx dx m + l e m + + x l e mx dx Applying Inequality k times to the integral in Inequality and bounding the resulting geometric series from above gives + x k e k x dx e k k k + i i=0 + k + e k k By applying Inequality 0 to + k = e k log+, we obtain the upper bound + k + e k k + k e 4 3 k Since k 4, it follows that k k k 4, and, hence, that k k Therefore, we can further simplify our bound to + k e 4 3 k + e k k By combining Inequalities and 3, we find an upper bound on the probability as follows: s 0 + Bs 0 fs 0 + xdx Bs 0 fs 0 e k 4 3 k Applying the upper bound on Bs 0 fs 0 from Corollary 8 and the assumption that completes the inequality 33 Bounds on Prob [s < s 0 ] We begin this section by including the following two additional inequalities on log x: log x x x x3 for 0 < x < 085, 4 log x x x for 0 x < 5 These bounds can be justified using a similar approach as for Inequalities 7-0 Using these inequalities, we derive the following bounds: 3

14 Burr, Gao, and Knoll Lemma Prob [s < s 0 ] e π e k+ + 3 k+k 4 s /6 3 0 Moreover, when k < η log δ, Prob [s < s 0 ] e π δ η 4 γ, where γ = + s 0 η log δ + /3 + 3 η log δ 3 Additionally, γ approaches as, δ, and s 0 approach 0 Proof The proof of this lemma is very similar to the proof of Lemma 9, so we focus on the new details The probability can be rewritten, using the substitution s = s 0 x, as Prob [s < s 0 ] = B s<s 0 Using gs as in Lemma 9, it follows that fsds = Bs 0 fs 0 xdx 6 fs 0 x = gs 0 x s 0 x s 0 x 7 and log gs 0 x = log gs 0 + d s 0 log x + s 0 log + s 0 x 8 s 0 Since 0 < x, it follows that 0 < s 0 s 0 x < 068 from the assumption that s 0 < 04 Therefore, we can bound the second term in Equation 8 using Inequalities 7 and 4, as follows: s 0 log x + s 0 log + s 0 x s 0 s 0 s 0 x s 0 x 3 9 Substituting Inequality 9 into Expression 8 and exponentiating, we get Since s 0 < 04, it follows that gs 0 x gs 0 e k 4 s 0 s 0 x s 0 +x 3 30 <, and, hence, that x + s 0 s 0 x < Therefore, the denominator in Equation 7 can be bounded from above by s 0 x s 0 x = s 0 s 0 x + s 0x s 0 s 0 3 s 0 4

15 Optimal Bounds for Johnson-Lindenstrauss Transformations Therefore, by substituting Inequalities 30 and 3 into Expression 7, when < x <, we have fs 0 x fs 0 e k x +x 4 s Since < and k 4+ > 4, it follows that +k / < Therefore, we restrict x to the interval, + k /, and observe that Inequality 3 applies in this range Therefore, Bs 0 +k / fs 0 xdx Bs 0 fs 0 e k x +x 4 s 0 dx 3 33 Replacing Bs 0 fs 0 with its lower bound given in Corollary 8 and observing that the integrand is decreasing on an interval of width k /, Inequality 33 is bounded from below by +k / Bs 0 fs 0 e k x +x 4 s 3 0 dx e π e k+ + 3 k+k 4 s /6 3 0 which completes the first inequality The second inequality follows by replacing k by the given upper bound and simplifying, Lemma Prob [s < s 0 ] 8 e / e k 4 8 e / k e 4 3 π π Proof The proof of this lemma is very similar to the proof of Lemma 0, so we focus on the new details To prove an upper bound, we start with the bound on Prob [s < s 0 ] from Equation 6 We first observe that fs 0 x = fs 0 x k + s d k 0 x 34 s 0 We now bound the logarithm of the second and third factors in Equation 34 using Inequalities 9 and 5 as follows: log Since x k s 0 s 0 = k d k k + s 0 s 0 x d k k x x + d k s0 x s 0 and < x <, Inequality 35 further simplifies to x x + d k k d k x x Hence, for x <, we have k 4 x 3 k x 4 35 fs 0 x fs 0 e 3/ e k 4 x fs 0 e 3/ e k 4 x 36 5

16 Burr, Gao, and Knoll Substituting Inequality 36 into the integral of Equation 6, we find Bs 0 fs 0 xdx Bs 0 fs 0 e 3/ e k 4 x dx Bs 0 fs 0 e 3/ Since k, Inequality 37 can be further simplified to Bs 0 fs 0 4e3/ k e k 4 e k 4 x dx Bs 0 fs 0 4e3/ k e k 4 37 Applying the upper bound on Bs 0 fs 0 from Corollary 8 completes the first inequality of the proof The final inequality follows from the fact that k k 3 This completes the proof all of the conditions in Section, and, therefore, completes the proof of our main theorem, Theorem Acknowledgments This work was supported by a grant from the Simons Foundation #8399 to Michael Burr and National Science Foundation Grants CCF-5793, CCF-40763, DMS-40306, and DMS Appendix A Uniform Distributions on Unit Spheres in High Dimensions In this section, we study uniform distributions and surface areas or hypervolumes on high-dimensional spheres More precisely, for any d, let S d denote the unit sphere of dimension d and dω d denote the surface area measure for S d We show that, for any k d, dω d = fsds dω k dω d k, where s [0, ] and fs = s k s d k This is a more precise version of a result in Kane et al 0, replacing an unspecified constant by / We include the proof of this result here in order to make this result more accessible to a larger population of researchers This formula is of independent interest since it shows that the uniform distribution on S d is a product of uniform distributions on S k and S d k with a distribution on [0, ] Theorem 3 Under the almost bijective map Ψ : S d [0, ] S k S d k, we have equality of the surface area differential forms on S d, S k, and S d k, ie, where fs = s k / s d k / measures, dω d dω d = fsds dω k dω d k, Equivalently, in terms of probability distribution dω k dω d k Vol d S d = Bfsds Vol k S k Vol d k S d k 6

17 Optimal Bounds for Johnson-Lindenstrauss Transformations Hence, the uniform distribution on S d can be identified with the product distribution on [0, ] S k S d k where the distribution of s on [0, ] has density function Bfs and B = Vol k S k Vol d k S d k Vol k S k = Γ k Γ d 38 d k Γ This theorem is based on the following lemma, which is well-known to experts, but is included here for completeness Lemma 4 Let x = x,, x d with x d > 0 be a point on the upper hemisphere of S d Then the surface area measure of the unit sphere S d at x is dω d = x d dx dx d Before we begin the proof, we recall the approach for S in 3-dimensional space We consider the upper hemisphere of S as the graph of a function over D, where D d denotes d - dimensional disk, namely D d = { ˆx R d : d i= ˆx i We then integrate over the disk D to calculate the surface area of S In particular, the integrand is the limit of the ratios of the area of a square in D to the area of the corresponding parallelogram above the square in the tangent space of S as the square shrinks a point In the case of the sphere, the parallelogram s area is calculated using the cross product, but we must replace the use of the cross product in higher dimensions Proof In d-dimensional space, we consider the upper hemisphere of S d as the graph of a function over the d -dimensional disk D d We construct a pair of d - dimensional parallelepipeds as follows: P d is in the tangent space of D d and Q d is in the tangent space of S d Then, we take the limit of their d -dimensional volumes as P d approaches a point Due to complications in taking the d -dimensional volume in d-dimensional space, we extend both P d and Q d to associated, full-dimensional parallelepipeds Let ˆx,, ˆx d D d and define φ : D d R 0 as d φˆx,, ˆx d = ˆx i We observe that the graph of this function is the upper hemisphere of S d We now extend this map to D d R as Φ d : D d R R d defined by ˆx,, ˆx d, ˆx d + ˆx d ˆx, + ˆx d ˆx,, + ˆx d ˆx d, + ˆx d φˆx,, ˆx d We observe that Φ d D d {0} maps the disk D d {0} surjectively onto the graph of φ, ie, the upper hemisphere of S d, see Figure i= } 7

18 Burr, Gao, and Knoll Φ d π Figure : Φ d maps the disk D d {0} surjectively onto the upper hemisphere of the d - dimensional unit sphere S d We observe that Φ d = π is the projection map onto the first d coordinates We recall that, at ˆx,, ˆx d D d, the tangent space is R d and we define the parallelepiped P d in the tangent space by the vectors ˆx i e i of length ˆx i in the direction of the i th standard basis vector e i of R d Similarly, the tangent space at ˆx,, ˆx d, 0 D d R is R d, and we define the parallelepiped P d in this tangent space by the vectors ˆx i e i for i d and he d for the final direction Then, as the vector he d is perpendicular to the tangent vectors of the disk D d, the d-dimensional volume of P d can be computed in terms of the d -dimensional volume of P d and the height h, ie, Vol d P d = hvol d P d Next, we let Q d and Q d be the images of P d and P d under the Jacobian of Φ d, ie, Jac Φ d, respectively Note that the Jacobian of Φ d, when restricted to D d {0}, is ˆx I Jac Φ d D d {0} = ˆx d ˆx ˆx φˆx,,ˆx d d φˆx,,ˆx d φˆx,, ˆx d Since the Jacobian acts on tangent vectors, Q d is defined by the vectors ˆx i ˆx i f i φˆx,, ˆx d f d, for i d, where the f i is the i th standard basis vector of R d Moreover, Q d is defined by these vectors as well as the image of he d, ie, h ˆx,, ˆx d, φˆx,, ˆx d 8

19 Optimal Bounds for Johnson-Lindenstrauss Transformations ˆx We observe that the vectors ˆx i f i i φˆx,,ˆx d f d for i d are tangent vectors to the sphere S d, and that h ˆx,, ˆx d, φˆx,, ˆx d is the outward pointing surface normal with length h Since the tangent vectors are perpendicular to the outward pointing normal, the volumes of Q d and Q d have a similar relationship as the volumes of P d and P d, ie, Vol d Q d = hvol d Q d Therefore, the ratio between the d-dimensional volumes of Q d and P d is the same as the ratio of the d -dimensional volumes of Q d and P d Since Q d is the image of P d under the linear map Jac Φ dd d {0}, it follows that Vol d Q d = det Jac Φ d ˆx,,ˆx d,0 Vol d P d Therefore, the ratio of the volumes of Q d and P d is det Jac Φ d ˆx,,ˆx d,0 It is straight-forward to compute the determinant of JacΦ d at the point ˆx,, ˆx d, 0 via a few row reductions to eliminate the first d entries in the last row and turn the matrix into an upper triangular matrix whose lower right corner is φˆx,,ˆx d Hence, the determinant of JacΦ d is φˆx,,ˆx d, which is the desired scaling factor Therefore, φˆx,,ˆx d is the local factor in the stretching of the surface area in the map from D d to S d We recall that the coordinates x,, x d are the coordinates on the upper hemisphere and ˆx,, ˆx d are the coordinates on D d Since, under the map Φ d D d {0}, x i = ˆx i for i d, it follows that dˆx dˆx d = dx dx d and φˆx,, ˆx d = x d From here, the result follows directly Proof of Theorem 3 Let w,, w d, x,, x k, and y,, y d k be coordinates of points on the d -dimensional unit sphere, the k -dimensional unit sphere, and the d k -dimensional unit sphere, respectively Let ŵ,, ŵ d, ˆx,, ˆx k, and ŷ,, ŷ d k be the coordinates of points on the disks D d, D k and D d k, respectively Let ϕ : [0, ] D k D d k D d be defined by s ˆx,, ˆx k ŷ,, ŷ d k sˆx,, sˆx k, k s ˆx i, sŷ,, sŷ d k We observe that ϕ maps the disks D k and D d k onto the half of the disk D d whose k th coordinate is nonnegative As the measure of the image is the measure of the preimage scaled by the determinant of the Jacobian of ϕ, the surface area measure of the disk D d is dŵ dŵ d = det Jacϕ ds dˆx dˆx k dŷ dŷ d k 39 i= 9

20 Burr, Gao, and Knoll The Jacobian of ϕ for s 0, is s Jac ϕ = ˆx s ˆx k s k i= ˆx i s ŷ s ŷ d k s 0 sˆx k i= ˆx i s 0 0 sˆx k k i= ˆx i s s Eliminating all but the k th entry of the first column by adding multiples of the other columns to the first column, we obtain det Jacϕ = ˆx k s k s d k Substituting this value into Expression 39, we have the surface area measure of D d in terms of the disks D k and D d k That is, dŵ dŵ d = ˆx k s k / s d k / ds dˆx dˆx k dŷ dŷ d k 40 We observe that the coordinates of the disk D t correspond to the first t entries of coordinates of the unit sphere S t Therefore, we may extend ϕ to the map Ψ, as defined above, where Ψ = Φ d ϕ id s Φ k Φ d k Employing the results of Lemma 4 in various dimensions, we rewrite the surface measure of a unit sphere in terms of the surface measure of the corresponding disks: dˆx dˆx k = dx dx k = x k dω k dŷ dŷ d k = dy dy d k = y d k dω d k dŵ dŵ d = dw dw d = w d dω d By applying the Ψ, we can substitute these three equalities into Equation 40 to obtain dω d = w d dw dw d = y d k w d s k / s d k / ds dω k dω d k = sk / s d k / ds dω k dω d k = fsds dω k dω d k, where the third equality follows from the fact that w d = sy d k by the map Ψ Since the cases where s = 0 or s = have measure 0, the result follows 0

21 Optimal Bounds for Johnson-Lindenstrauss Transformations References Dimitris Achlioptas Database-friendly random projections: Johnson-Lindenstrauss with binary coins Journal of Computer and System Sciences, 664:67 687, 003 Nir Ailon and Bernard Chazelle Approximate nearest neighbors and the fast Johnson- Lindenstrauss transform In Proceedings of the 38th ACM Symposium on Theory of Computing STOC, pages , 006 Noga Alon Problems and results in extremal combinatorics Discrete Mathematics, 73: 3 53, 003 Rosa I Arriaga and Santosh Vempala An algorithmic theory of learning: Robust concepts and random projection Machine Learning, 63:6 8, 006 Ella Bingham and Heikki Mannila Random projection in dimensionality reduction: applications to image and text data In Knowledge Discovery and Data Mining, pages 45 50, 00 Emmanuel J Candès The restricted isometry property and its implications for compressed sensing Comptes Rendus Mathématique Académie des Sciences Paris, 346:589 59, 008 Kenneth L Clarkson and David P Woodruff Low rank approximation and regression in input sparsity time In Proceedings of the 45th ACM Symposium on Theory of Computing STOC, pages 8 90, 03 Graham Cormode and Piotr Indyk Stable distributions in streaming computation In Garofalakis M, Gehrke J, and Rastoi R, editors, Data Stream Management Springer, Berlin, Heidelberg, 06 Anirban Dasgupta, Ravi Kumar, and Tamás Sarlós A sparse Johnson-Lindenstrauss transform In Proceedings of the 4nd ACM Symposium on Theory of Computing STOC, pages , 00 Sanjoy Dasgupta and Anupam Gupta An elementary proof of a theorem of Johnson and Lindenstrauss Random Structure and Algorithms, :60 65, 003 Peter Frankl and Hiroshi Maehara The Johnson-Lindenstrauss lemma and the sphericity of some graphs Journal of Combinatorial Theory Series B, 443:355 36, 988 Piotr Indyk and Rajeev Motwani Approximate nearest neighbors: Towards removing the curse of dimensionality In Proceedings of the 30th ACM Symposium on Theory of Computing STOC, pages , 998 T S Jayram and David P Woodruff Optimal bounds for Johnson-Lindenstrauss transforms and streaming problems with subconstant error ACM Transactions on Algorithms, 96, 03 William B Johnson and Joram Lindenstrauss Extensions of Lipschitz mappings into a hilbert space Contemporary Mathematics, 6:89 06, 984

22 Burr, Gao, and Knoll Daniel M Kane and Jelani Nelson Sparser Johnson-Lindenstrauss transforms Journal of the ACM, 6, 04 Daniel M Kane, Raghu Meka, and Jelani Nelson Almost optimal explicit Johnson- Lindenstrauss families In Proceedings of the 5th International Workshop on Randomization and Computation RANDOM, pages , 0 Jessica Lin and Dimitrios Gunopulos Dimensionality reduction by random projection and latent semantic indexing In Proceedings of the Text Mining Workshop, at the 3rd SIAM International Conference on Data Mining, May 003 Jiri Matousek On variants of the Johnson-Lindenstrauss lemma Random Structure and Algorithms, 33:4 56, 008 Nam H Nguyen, Thong T Do, and Trac D Tran A fast and efficient algorithm for lowrank approximation of a matrix In Proceedings of the 4st ACM Symposium on Theory of Computing STOC, 009 Herbert Robbins A remark on Stirling formula American Mathematical Monthly, 6:6 9, 955 Shashanka Ubaru, Arya Mazumdar, and Yousef Saad Low rank approximation and decomposition of large matrices using error correcting codes IEEE Transactions on Information Theory, 639: , 07 Kilian Q Weinberger, Anirban Dasgupta, John Langford, Alexander J Smola, and Josh Attenberg Feature hashing for large scale multitask learning In Proceedings of the 6th Annual International Conference on Machine Learning ICML, volume 3-0, 009

16 Embeddings of the Euclidean metric

16 Embeddings of the Euclidean metric In today s lecture, we will consider how well we can embed n points in the Euclidean metric (l 2 ) into other l p metrics. More formally, we ask the following question.