Database-friendly Random Projections

Similar documents
Lecture Introduction. 2 Examples of Measure Concentration. 3 The Johnson-Lindenstrauss Lemma. CS-621 Theory Gems November 28, 2012

Lower bounds on Locality Sensitive Hashing

Acute sets in Euclidean spaces

Multi-View Clustering via Canonical Correlation Analysis

Least-Squares Regression on Sparse Spaces

Lecture 6 : Dimensionality Reduction

An Optimal Algorithm for Bandit and Zero-Order Convex Optimization with Two-Point Feedback

Survey Sampling. 1 Design-based Inference. Kosuke Imai Department of Politics, Princeton University. February 19, 2013

u!i = a T u = 0. Then S satisfies

Linear First-Order Equations

Topic 7: Convergence of Random Variables

Computing Exact Confidence Coefficients of Simultaneous Confidence Intervals for Multinomial Proportions and their Functions

LECTURE NOTES ON DVORETZKY S THEOREM

A Sketch of Menshikov s Theorem

Separation of Variables

Lower Bounds for the Smoothed Number of Pareto optimal Solutions

FLUCTUATIONS IN THE NUMBER OF POINTS ON SMOOTH PLANE CURVES OVER FINITE FIELDS. 1. Introduction

Multi-View Clustering via Canonical Correlation Analysis

Convergence of Random Walks

LATTICE-BASED D-OPTIMUM DESIGN FOR FOURIER REGRESSION

A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks

16 Embeddings of the Euclidean metric

Math 1B, lecture 8: Integration by parts

Ramsey numbers of some bipartite graphs versus complete graphs

Tractability results for weighted Banach spaces of smooth functions

Agmon Kolmogorov Inequalities on l 2 (Z d )

The derivative of a function f(x) is another function, defined in terms of a limiting expression: f(x + δx) f(x)

A new proof of the sharpness of the phase transition for Bernoulli percolation on Z d

The Exact Form and General Integrating Factors

Discrete Mathematics

Time-of-Arrival Estimation in Non-Line-Of-Sight Environments

Quantum Mechanics in Three Dimensions

A Course in Machine Learning

The chromatic number of graph powers

Function Spaces. 1 Hilbert Spaces

Sharp Thresholds. Zachary Hamaker. March 15, 2010

. Using a multinomial model gives us the following equation for P d. , with respect to same length term sequences.

Multi-View Clustering via Canonical Correlation Analysis

ensembles When working with density operators, we can use this connection to define a generalized Bloch vector: v x Tr x, v y Tr y

Lecture 6: Calculus. In Song Kim. September 7, 2011

Generalizing Kronecker Graphs in order to Model Searchable Networks

Qubit channels that achieve capacity with two states

7.1 Support Vector Machine

Multi-View Clustering via Canonical Correlation Analysis

Schrödinger s equation.

Some Examples. Uniform motion. Poisson processes on the real line

Robust Forward Algorithms via PAC-Bayes and Laplace Distributions. ω Q. Pr (y(ω x) < 0) = Pr A k

A Randomized Approximate Nearest Neighbors Algorithm - a short version

DIFFERENTIAL GEOMETRY, LECTURE 15, JULY 10

A Unified Theorem on SDP Rank Reduction

Math 1271 Solutions for Fall 2005 Final Exam

Lecture 2 Lagrangian formulation of classical mechanics Mechanics

Lecture 5. Symmetric Shearer s Lemma

Table of Common Derivatives By David Abraham

Approximate Constraint Satisfaction Requires Large LP Relaxations

Lower Bounds for Local Monotonicity Reconstruction from Transitive-Closure Spanners

Robust Low Rank Kernel Embeddings of Multivariate Distributions

19 Eigenvalues, Eigenvectors, Ordinary Differential Equations, and Control

Quantum mechanical approaches to the virial

Lectures - Week 10 Introduction to Ordinary Differential Equations (ODES) First Order Linear ODEs

Math 342 Partial Differential Equations «Viktor Grigoryan

6 General properties of an autonomous system of two first order ODE

SYNCHRONOUS SEQUENTIAL CIRCUITS

Thermal conductivity of graded composites: Numerical simulations and an effective medium approximation

Iterated Point-Line Configurations Grow Doubly-Exponentially

TMA 4195 Matematisk modellering Exam Tuesday December 16, :00 13:00 Problems and solution with additional comments

Math Notes on differentials, the Chain Rule, gradients, directional derivative, and normal vectors

Sturm-Liouville Theory

EVALUATING HIGHER DERIVATIVE TENSORS BY FORWARD PROPAGATION OF UNIVARIATE TAYLOR SERIES

On combinatorial approaches to compressed sensing

Analyzing Tensor Power Method Dynamics in Overcomplete Regime

Free rotation of a rigid body 1 D. E. Soper 2 University of Oregon Physics 611, Theoretical Mechanics 5 November 2012

Leaving Randomness to Nature: d-dimensional Product Codes through the lens of Generalized-LDPC codes

PDE Notes, Lecture #11

Implicit Differentiation

Entanglement is not very useful for estimating multiple phases

Counting Lattice Points in Polytopes: The Ehrhart Theory

Chapter 4. Electrostatics of Macroscopic Media

arxiv: v2 [cs.ds] 11 May 2016

arxiv: v4 [cs.ds] 7 Mar 2014

The Three-dimensional Schödinger Equation

A note on asymptotic formulae for one-dimensional network flow problems Carlos F. Daganzo and Karen R. Smilowitz

Necessary and Sufficient Conditions for Sketched Subspace Clustering

Lagrangian and Hamiltonian Mechanics

1 dx. where is a large constant, i.e., 1, (7.6) and Px is of the order of unity. Indeed, if px is given by (7.5), the inequality (7.

On the Surprising Behavior of Distance Metrics in High Dimensional Space

Chromatic number for a generalization of Cartesian product graphs

On colour-blind distinguishing colour pallets in regular graphs

Parameter estimation: A new approach to weighting a priori information

arxiv:physics/ v4 [physics.class-ph] 9 Jul 1999

Proof of SPNs as Mixture of Trees

Vectors in two dimensions

arxiv: v4 [math.pr] 27 Jul 2016

Two formulas for the Euler ϕ-function

d-dimensional Arrangement Revisited

05 The Continuum Limit and the Wave Equation

Euler equations for multiple integrals

d dx But have you ever seen a derivation of these results? We ll prove the first result below. cos h 1

Exercise 1. Exercise 2.

Lecture 18: March 15

Transcription:

Database-frienly Ranom Projections Dimitris Achlioptas Microsoft ABSTRACT A classic result of Johnson an Linenstrauss asserts that any set of n points in -imensional Eucliean space can be embee into k-imensional Eucliean space where k is logarithmic in n an inepenent of so that all pairwise istances are maintaine within an arbitrarily small factor. All known constructions of such embeings involve projecting the n points onto a ranom k-imensional hyperplane. We give a novel construction of the embeing, suitable for atabase applications, which amounts to computing a simple aggregate over k ranom attribute partitions.. INTRODUCTION Consier projecting the points of your favorite sculpture first onto a plane an then onto a single line. The result amply emonstrates the power of imensionality. Conversely, given a high-imensional pointset it is natural to ask whether it exploits its full allotment of imensionality or, rather, it coul be embee into a lower imensional space without suffering great istortion. In general, such questions involve a, perhaps infinite, collection of points enowe with some istance function metric. In this paper, we will only eal with finite sets of points in Eucliean space so the Eucliean istance is the metric. In particular, it will be convenient to think of n points in R as an n table matrix A with each point represente as a row vector with attributes coorinates. Given such a matrix A, one of the most common embeings is the one suggeste by its Singular Value Decomposition. In particular, to embe the n points into R k we project them onto the k-imensional space spanne by the singular vectors corresponing to the k largest singular values of A. If one rewrites the result of this projection as a rank k Aress: Microsoft Corporation, One Microsoft Way, Remon WA, 98052, U.S.A. Email: optas@microsoft.com n matrix A k, we are guarantee that for every rank k matrix D A A k A D, for any unitarily invariant norm, such as the Frobenius or the L2 norm. Thus, istortion here amounts to a certain istance norm between the set of projecte points, A k, an the original set of points A. If we associate with each row point a vector corresponing to the ifference between its original an its new position then, for example, uner the Frobenius norm the istortion equals the sum of the square lengths of these vectors. It is clear that such a notion of istortion captures a significant global property. At the same time, though, it oes not offer any local guarantees. For example, the istance between a pair of points can be arbitrarily smaller than what it was in the original space, if that is avantageous to minimizing the total istortion. The stuy of embeings that respect local properties is a rich area of mathematics with eep an beautiful results. Such embeings can guarantee, for example, that all istances between pairs of points are approximately maintaine or, more generally, that for a given q 2, a certain notion of volume is maintaine for all collections of up to q points thus capturing higher orer local structure. The algorithmic uses of such embeings were first consiere in the seminal paper of Linial, Lonon an Rabinovich 9] an have by now become an important part of moern algorithmic esign. A real gem in this area has been the following result of Johnson an Linenstrauss 7]. Lemma 7]. Given ɛ > 0 an an integer n, let k be a positive integer such that k k 0 Oɛ 2 log n. For every set P of n points in R there exists f : R R k such that for all u, v P ɛ u v 2 fu fv 2 + ɛ u v 2. We will refer to embeings proviing a guarantee akin to that of Lemma as JL-embeings. In the last few years, JL-embeings have been useful in solving a variety of problems. The rough iea is the following. By proviing a low imensional representation of the ata, JL-embeings spee up certain algorithms ramatically, in particular algorithms whose run-time epens exponentially in the imension of the working space there are a number of practical problems for which the best known algorithms have such behaviour. At the same time, the provie guarantee regaring pairwise istances is often enough to establish that the solution

foun by working in the low imensional space is a goo approximation to the optimal solution in the original space. We give a few examples below. Papaimitriou, Raghavan, Tamaki an Vempala 0], prove that embeing the points of A in a low-imensional space can significantly spee up the computation of a low rank approximation to A, without significantly affecting its quality. In 6], Inyk an Motwani showe that JL-embeings are useful in solving the ε-approximate nearest neighbor problem, where after some preprocessing of the pointset P one is to answer queries of the following type: Given an arbitrary point x, fin a point y P, such that for every point z P, x z ε x y. In a ifferent vein, Schulman ] use JL-embeings as part of an approximation algorithm for the version of clustering where we seek to minimize the sum of the squares of intracluster istances. Recently, Inyk 5] showe that JL-embeings can also be use in the context of ata-stream computation, where one has limite memory an is allowe only a single pass over the ata stream.. Our contribution Over the years, the probabilistic metho has allowe for the original proof of Johnson an Linenstrauss to be greatly simplifie an sharpene 4, 6, 3], while at the same time giving conceptually simple ranomize algorithms for constructing the embeing. Roughly speaking, all such algorithms project the input points onto a spherically ranom hyperplane through the origin. Performing such a projection, while conceptually simple, is non-trivial, especially in a atabase environment. Moreover, its computational cost can be prohibitive for certain applications. At the same time, JL-embeings have become an important algorithmic esign tool an in certain omains they are a esirable stanar ata processing step. With this in min, it is natural to ask if we can compute such embeings in a manner that is simpler an more efficient than the one suggeste by the current methos. Our main result, below, is a first step in this irection, asserting that one can replace projections onto ranom hyperplanes with much simpler an faster operations, requiring extremely simple probability istributions. In particular, these operations can be implemente reaily using stanar SQL primitives without any aitional functionality. Moreover, somewhat surprisingly, this comes without any sacrifice in the quality of the embeing. In fact, we will see that for every fixe value of we can get slightly better bouns than all current methos. We escribe the main result below in stanar mathematical terminology. Following that, we give an example of how to compute the embeing using atabase operations. As in Lemma, the parameter ɛ controls the accuracy in istance preservation, while now β controls the probability of success. Theorem 2. Let P be an arbitrary set of n points in R, represente as an n matrix A. Given ɛ, β > 0 let k 0 4 + 2β ɛ 2 /2 ɛ 3 /3 log n. For integer k k 0, let R be a k ranom matrix with Ri, j r ij, where {r ij} are inepenent ranom variables from either one of the following two probability istributions: { + with probability /2 r ij /2, r ij + with probability /6 3 0 2/3 /6. Let E k A R. Let f : R R k map the i th row of A to the i th row of E. With probability at least n β, for all u, v P ɛ u v 2 fu fv 2 + ɛ u v 2. In a atabase system, all operations neee to compute A R are very efficient an easy to implement. For example, with the secon istribution above, the embeing amounts to generating k new attributes, each one forme by applying the same process: throw away 2/3 of all attributes at ranom; partition the remaining attributes ranomly into two equal parts; for each partition, prouce a new attribute equal to the sum of all attributes; take the ifference of the two sum-attributes. All in all, using Theorem 2, one nees very simple probability istributions, no floating point arithmetic, an all computation amounts to highly optimize atabase operations aggregation. By using the secon probability istribution, where r ij 0 with probability 2/3, we also get a threefol speeup as we only nee to process a thir of all attributes for each of the k coorinates. On the other han, when r ij {, +}, conceptually the construction seems to be about as simple as one coul hope for. Looking a bit more closely into the matrix E we see that each row vector of A is projecte onto k ranom vectors whose coorinates {r ij} are inepenent ranom variables with mean 0 an variance. If the {r ij} were inepenent Normal ranom variables with mean 0 an variance, it is well-known that the resulting vectors woul point to uniformly ranom irections in space. Projections onto such ranom lines through the origin have been consiere in a number of settings, incluing the work of Kleinberg on approximate nearest neighbors 8] an of Vempala on learning intersections of halfspaces 2]. More recently, such projections have also been use in learning mixture of Gaussians moels, starting with the work of Dasgupta 2] an later with the work of Arora an Kannan ]. Our proof will suggest that for any fixe vector α, the behavior of its projection onto a ranom vector c is manate by the even moments of α c. In fact, our result follows by showing that for every vector α, uner our istributions for {r ij}, these moments are ominate by the corresponing moments for the case where c is spherically symmetric. As a result, projecting onto vectors whose entries are istribute

like the columns of matrix R coul replace projection onto ranom lines; it is computationally simpler an results in projections that are at least as nicely behave. Finally, we note that Theorem 2 allows one to use significantly fewer ranom bits than all previous methos for constructing JL-embeings. While the amount of ranomness neee is still quite large, such attempts for ranomness reuction are of inepenent interest an our result can be viewe as a first step in that irection. 2. PREVIOUS WORK Let us write X D Y to enote that X is istribute as Y an recall that N0, enotes the stanar Normal ranom variable having mean 0 an variance. As we will see, in all methos for proucing JL-embeings, incluing ours, the heart of the matter is showing that for any vector, the square length of its projection is sharply concentrate aroun its expecte value. Arme with a sufficiently strong such concentration boun, one then proves the assertion of Lemma for a collection of n points in R by applying the union boun for the n 2 events corresponing to each istance-vector being istorte by more than ±ɛ. The original proof of Johnson an Linenstrauss 7] uses quite heavy geometric approximation machinery to yiel such a concentration boun when the projection is onto a uniformly ranom hyperplane through the origin. That proof was greatly simplifie an sharpene by Frankl an Meahara 4] who consiere a irect projection onto k ranom orthonormal vectors, yieling the following result. Theorem 3 4]. For any ɛ 0, /2, any sufficiently large set P R, an k k 0 9ɛ 2 2ɛ 3 /3 log P +, there exists a map f : P R k such that for all u, v P, ɛ u v 2 fu fv 2 + ɛ u v 2. The next great simplification of the proof of Lemma was given, inepenently, by Inyk an Motwani 6] an Dasgupta an Gupta 3], the latter also giving a slight sharpening of the boun for k 0. Below we state our renition of how this simplification was achieve. Assume that we try to implement the scheme of Frankl an Maehara 4] but we are lazy about enforcing either normality unit length or orthogonality among our k vectors. Instea, we just pick our k vectors inepenently, in a spherically symmetric manner. As we saw earlier, we can achieve this by taking as the coorinates of each vector inepenent N0, ranom variables. We then merely scale each vector by / so that its expecte length is. An immeiate gain of this approach is that now, for any fixe vector α, the length of its projection onto each of our vectors is also a Normal ranom variable. This is ue to a powerful an eep fact, namely the 2-stability of the Gaussian istribution: for any real numbers α, α 2,..., α, if {Z i} i is a family of inepenent Normal ranom variables an X i αizi, then X D c N0,, where c α 2 + +α 2 /2. As a result, if we interpret each of the k projection lengths as a coorinate in R k, then the square length of the resulting vector follows the Gamma istribution for which strong concentration bouns are reaily available. An what have we lost? Surprisingly little. While we i not insist upon either orthogonality, or normality, with high probability, the resulting k vectors come very close to having both these properties. In particular, the length of each of the k vectors is sharply concentrate aroun as the sum of inepenent ranom variables. Moreover, since the k vectors point in uniformly ranom irections in R, they get rapily closer to being orthogonal as grows. Unlike Inyk an Motwani 6], Dasgupta an Gupta 3] exploite spherical symmetry without appealing irectly to the 2-stability of the Gaussian istribution. Instea they observe that, by symmetry, the projection of any unit vector α on a ranom hyperplane through the origin is istribute exactly like the projection of a ranom point from the surface of the -imensional sphere onto a fixe subspace of imension k. Such a projection can be stuie reaily, though, as now each coorinate is a scale Normal ranom variable. Their analysis gave the strongest known boun, namely k k 0 4ɛ 2 /2 ɛ 3 /3. Note that this is exactly the same as our boun in Theorem 2 as β tens to 0. 3. SOME INTUITION By combining the analysis of 3] with the viewpoint of 6] it is in fact not har to show that Theorem 2 hols if for all i, j, r ij D N0,. Thus, our contribution essentially begins with the realization that spherical symmetry, while making life extremely comfortable, is not essential. What is essential is concentration. So, at least in principle, one is free to consier other caniate istributions for the {r ij}, if perhaps at the expense of comfort. As we saw earlier, each column of our matrix R will give us a coorinate of the projection in R k. Moreover, the square length of the projection is merely the sum of the squares of these coorinates. So, effectively, each column acts as an estimator of the original vector s length by taking its inner prouct with it, while in the en we take the consensus estimate sum over our k estimators. From this point of view, requiring our k vectors to be orthonormal has the pleasant statistical interpretation of greatest efficiency. In any case, though, as long as each column is an unbiase, boune variance estimator the Central Limit Theorem asserts that by taking enough columns we can get an arbitrarily goo estimate of the original length. Naturally, how many estimators are enough epens solely on the variance of the estimators. So, alreay we see that the key issue is the concentration of the projection of an arbitrary fixe vector α onto a single ranom vector. The main technical ifficulty that results from giving up spherical symmetry is that this concentration can now epen on α. Our main technical contribution lies in etermining probability istributions for {r ij} for which this concentration, for all vectors, is as goo as when r D ij N0,. In fact, it will turn out that for every fixe value of, we can get a minuscule improvement

over the concentration for that case. Thus, for every fixe, we can actually get a strictly better boun for k, albeit marginally, than by taking spherically ranom vectors. The reaer might be wonering how can it be that perfect spherical symmetry oes not buy us anything? an is in fact slightly worse for each fixe. At a high level, an answer to this question might go as follows. Given that we o not have spherical symmetry anymore, an aversary coul try to pick a vector α so that the length of its projection is as variable as possible. It is clear that not all vectors α are equal with respect to this variability. What then oes a worst-case vector w look like? How much are we exposing to the aversary by committing to pick our column vectors among lattice points rather than arbitrary points in R? As we will see, the worst-case vector is w /,..., an all 2 vectors resulting by sign-flipping w s coorinates. So, the worst-case vector turns out to be a more or less typical vector, at least in terms of the fluctuations in its coorinates, unlike say, 0,..., 0. As a result it is not har to believe that the aversary woul not fare much worse by picking a ranom vector. But in that case the aversary oes not benefit at all from our commitment. To get a more satisfactory answer, it seems like one has to elve into the proof. In particular, both for the spherically ranom case an for our istributions, the boun on k is manate by the probability of overestimating the projecte length. Thus, the ba events amount to the spanning vectors being too well-aligne with α. As a result, for any fixe one has to consier the traeoff between the probability an the extent of alignment. For example, let us consier the projection onto a single ranom vector when 2 an r ij {, +}. As we sai above, the worst case vector is w / 2,. So, it s easy to see that with probability /2 we have perfect alignment when our ranom vector is ±w an with probability /2 we have orthogonality. On the other han, for the spherically symmetric case, we have to consier the integral over all points on the plane, weighte by their probability uner the two-imensional Gaussian istribution. By a convexity argument it turns out that for every fixe, the even moments of the projecte length are marginally greater in the spherically symmetric case. This leas to a marginally weaker probability boun for that case. As one might guess, the two bouns coincie as tens to infinity. 4. PRELIMINARIES Let x y enote the inner prouct of vectors x, y. To simplify notation in the calculations, we will work with matrix R scale by /. Thus, R is a ranom k matrix with Ri, j r ij/, where the {r ij} are istribute as in Theorem 2. As a result, to get E we nee to scale A R by /k rather than / k. Therefore, if c j enotes the j th column of R, then {c j} k j is a family of k i.i.. ranom unit vectors in R an for all α R, fα /k α c,..., α c. In practice, of course, such scaling can be postpone until after the matrix multiplication projection has been performe, so that we maintain the avantage of only having {, 0, +} in the projection matrix. Let us start by computing E fα 2 for an arbitrary vector α R. Let {Q j} k j be efine as Q j α c j. Then E Q j E α ir ij i i an E 2 Q 2 j E α ir ij E i i α ir ij 2 + αi 2 E rij 2 + l m l m α ie r ij 0, i 2α l α mr lj r mj 2α l α me r lj E r mj α 2. 2 Note that to get an 2 we only use that {r ij} are inepenent, Er ij 0 an Varr ij. Using 2 we get E fα 2 k k E Q 2 j α 2. j That is, E fα 2 α 2 for any inepenent family of {r ij} with Er ij 0 an Varr ij. From the above we see that any istribution where Er ij 0 an Varr ij is, in principle, a caniate for the entries of R. In fact, in 3], Arriaga an Vempala inepenently suggeste the possibility of getting JL-embeings by projecting onto a matrix where r ij {, +} but i not give any bouns on the necessary value of k. As we mentione earlier, having a JL-embeing amounts to the following: for each of the n 2 pairs u, v P, the square norm of the vector u v, is maintaine within a factor of ± ɛ. Therefore, if we can prove that for some β > 0 an every vector α R, Pr ɛ α 2 fα 2 +ɛ α 2 ] 2, 3 n2+β then the probability that our projection oes not yiel a JL-embeing is boune by n 2 2/n 2+β < /n β. Let us note that since for a fixe projection matrix, fα 2 is proportional to α, it suffices to consier probability bouns for arbitrary unit vectors. Moreover, note that when E fα 2 α 2, inequality 3 merely asserts that the ranom variable fα 2 is concentrate aroun its expectation. Before consiering this point for our istributions for {r ij}, let us first wrap up the spherically ranom case. Getting a concentration inequality for fα 2 when r ij D N0, is straightforwar. Due to the 2-stability of the Normal istribution, fα 2 follows the Gamma istribution,

with parameters λ an t k, for every unit vector α. The fact that we get the same istribution for every vector α correspons to the intuition that all vectors are the same with respect to projection onto a spherically ranom vector. Stanar tail-bouns for the Gamma istribution reaily yiel the following. Lemma 4. Let r D ij N0, for all i, j. Then, for any ɛ > 0 an any unit-vector α R, Pr fα 2 + ɛk/ ] < exp k 2 ɛ2 /2 ɛ 3 /3, Pr fα 2 ɛk/ ] < exp k 2 ɛ2 /2 ɛ 3 /3. Thus, to get a JL-embeing we nee only require 2 exp k 2 ɛ2 /2 ɛ 3 /3 2 n, 2+β which hols for k 4 + 2β ɛ 2 /2 ɛ 3 /3 log n. We will use the stanar technique of applying Markov s inequality to the moment generating function of S. In particular, for arbitrary h > 0 we write Pr S > + ɛ k ] Pr exphs > exp h + ɛ k ] < E exp hs exp h + ɛ k Since {Q j} k j are i.i.. we get k E exp hs E exp hq 2 j 4 k j j E exp hq 2 j. 5 E exp hq 2 k, 6 where passing from 4 to 5 uses that the {Q j} k j are inepenent, while passing from 5 to 6 uses that they are ientically istribute. Thus, for any ɛ > 0 Pr S > + ɛ k ] < E exp hq 2 k exp h + ɛ k. 7 Let us note that the boun on the upper tail of fα 2 above is tight up to lower orer terms. As a result, as long as the union boun is use, one cannot hope for a better boun on k while using spherically ranom vectors. To prove our result we use the exact same approach, arguing that for every unit vector α R, the ranom variable αc 2 is sharply concentrate aroun its expectation, where c is a column of our projection matrix R. In the next section we state a lemma analogous to Lemma 4 above an show how it follows from bouns on certain moments of αc 2. We prove those bouns in Section 6. 5. PROBABILITY BOUNDS To simplify notation let us efine for an arbitrary vector α, S Sα k α c j 2 j k Q 2 jα, j where c j is the j th column of R, so that fα 2 S /k. Lemma 5. Let r ij have any of the two istributions in Theorem 2. Then, for any ɛ > 0 an any unit vector α R, Pr S > + ɛk/] < exp k 2 ɛ2 /2 ɛ 3 /3, Pr S < ɛk/] < exp k 2 ɛ2 /2 ɛ 3 /3. We will get a tight boun on E exp hq 2 from Lemma 6 below. Similarly, but this time consiering exp hs for arbitrary h > 0, we get that for any ɛ > 0 Pr S < ɛ k ] < E exp hq 2 k exp h ɛ k. 8 Rather than bouning E exp hq 2 irectly, this time we will expan exphq 2 to get Pr S < ɛ k ] 9 hq 2 2 k < E hq 2 + exp h ɛ k 2! h h2 + 2 E Q 4 k exp h ɛ k, 0 where EQ 2 was given by 2. We will get a tight boun on E Q 4 from Lemma 6 below. Lemma 6. For all h 0, /2 an all, E exp hq 2 2h/, In proving Lemma 5 we will generally omit the epenence of probabilities on α, making it explicit only when it affects our calculations. E Q 4 3 2. 2 The proof of Lemma 6 will comprise Section 6. Below we show how it implies Lemma 5 an thus Theorem 2.

Proof of Lemma 5. Substituting in 7 we get 3. To optimize the boun we set the erivative in 3 with respect to h to 0. This gives h ɛ <. Substituting 2 +ɛ 2 this value of h we get 4 an series expansion yiels 5. Pr S > + ɛ k ] k exp h + ɛ k 3 2h/ + ɛ exp ɛ k/2 4 < exp k 2 ɛ2 /2 ɛ 3 /3. 5 Similarly, substituting 2 in 8 we get 6. This time taking h ɛ is not optimal but it is goo enough, 2 +ɛ giving 7. Again, series expansion yiels 8. Pr S < ɛ k ] h + 3 2 k h exp h ɛ k 6 2 ɛ 2 + ɛ + 3ɛ 2 k ɛ ɛk exp 7 8 + ɛ 2 2 + ɛ < exp k 2 ɛ2 /2 ɛ 3 /3. 8 6. MOMENT BOUNDS Here we prove bouns on certain moments of Q. To simplify notation, we rop the subscript, writing it as Q. It shoul be clear that the istribution of Q epens on α, i.e., Q Qα. This is precisely what we give up by not projecting onto spherically symmetric vectors. Our strategy for giving bouns on the moments of Q will be to etermine a worst case unit vector w an consier Qw. Our precise claim is the following. Lemma 7. Let w,...,. For every unit vector α R, an for all k 0,,... E Qα 2k E Qw 2k. 9 Moreover, we will prove that the even moments of Qw are ominate by the even moments of an appropriately scale Normal ranom variable, i.e., the corresponing moments from the spherically symmetric case. Lemma 8. Let T D N0, /. For all an all k 0,,... E Qw 2k E T 2k. 20 Postponing the proof of Lemmata 7 an 8 for a moment, let us use them to prove Lemma 6. Proof of Lemma 6. We start by observing that E T 4 + exp λ 2 λ 4 /2 λ 3 2π. 2 Along with 9 an 20 this reaily implies 2. For any real-value ranom variable U, the Monotone Convergence Theorem MCT implies E exp hu 2 hu 2 k h k E k! k! E U 2k k0 2 k0 for all h such that E exp hu 2 is boune. For E exp ht 2, below, taking h 0, /2 makes the integral converge, giving 2. Thus, for such h we can apply the MCT to get 22. Now, applying 9 an 20 to 22 gives 23. Applying the MCT once more gives 24. E exp ht 2 + 2π exp λ 2 /2 exp h λ2 λ 2h/ 2 k0 k0 h k k! E T 2k 22 h k k! E Q 2k 23 E exp hq 2. 24 Thus, E exp hq 2 / 2h/ for h 0, /2, which is precisely inequality. To prove Lemma 7 we nee the following lemma. It s proof appears in the Appenix. Lemma 9. Let r, r 2 be i.i.. r.v. having one of the following two probability istributions: r i {, +}, each value having probability /2, or, r i { 3, 0, + 3} with 0 having probability 2/3 an ± 3 being equiprobable. For real numbers a, b let c a 2 + b 2 /2. Then, for all T an all k 0,,... E T + ar + br 2 2k E T + cr + cr 2 2k. Proof of Lemma 7. Recall that for any vector α, Qα Q α α c where c r,..., r. If α α,..., α is such that αi 2 αj 2 for all i, j, then by symmetry, Qα an Qw are ientically istribute an the lemma hols trivially. Otherwise, we can assume without loss of generality, that α 2 α2 2 an consier the

more balance unit vector θ c, c, α 3,..., α, where c α 2 + α2 2 /2. We will prove that E Qα 2k E Qθ 2k. 25 Applying this argument repeately yiels the lemma, as θ eventually becomes w. To prove 25, below we first express E Qα 2k as a sum of averages over r, r 2. We then apply Lemma 9 to get that each term average in the sum, is boune by the corresponing average for vector θ. More precisely, E Qα 2k ] E R + α r k + α 2r 2 2k Pr α ir i R R R i3 i3 ] E R + cr k + cr 2 2k Pr α ir i R E Qθ 2k. To prove the lemma we will show that for every value assignment to the inices i,..., i 2k, E Y i Y i2k E T i T i2k. 26 Let V v, v 2,..., v 2k be the value assignment consiere. For i {,..., }, let C V i be the number of times that i appears in V. Observe that if for some i, c V i is o then both expectations appearing in 26 are 0, since both {Y i} i an {T i} i are inepenent families an EY i ET i 0 for all i. Thus, we can assume that there exists a set {j, j 2,..., j p} of inices an corresponing values l, l 2,..., l p such that E Y i Y i2k E E T i T i2k E Y 2l j Y 2l 2 j 2 Y 2lp j p, an T 2l j T 2l 2 j 2 T 2lp j p Note now that since the inices j, j 2,..., j p are istinct, {Y jt } p t an {Tj t} p t are families of i.i.. r.v. Therefore, E Y i Y i2k E Y 2l j E Y 2lp j p, an E T i T i2k E T 2l j E T 2lp j p.. Proof of Lemma 8. Recall that T D N0, /. We will first express T as the scale sum of inepenent stanar Normal ranom variables. This will allow for a irect comparison of the terms in each of the two expectations. Specifically, let {T i} i be a family of i.i.. stanar Normal ranom variables. Then i Ti is a Normal ranom variable with variance. Therefore, T D T i. i Recall also that Qw Q w w c where c r,..., r. To simplify notation let us write r i Y i an let us also rop the epenence of Q on w. Thus, Q Y i, i where {Y i} i are i.i.. r.v. having one of the following two istributions: Y i {, +}, each value having probability /2, or Y i { 3, 0, + 3} with 0 having probability 2/3 an ± 3 being equiprobable. We are now reay to compare E Q 2k with E T 2k. We first observe that for every k 0,,... E T 2k E Q 2k 2k 2k i i i 2k i 2k E T i T i2k, an E Y i Y i2k. So, without loss of generality, in orer to prove 26 it suffices to prove that for every l 0,,... E E. 27 Y 2l T 2l This, though, is completely trivial. Moreover, along with Lemma 9, it is the only point were we nee to use properties of the istribution for the r ij here calle Y i. Let us first recall the well-known fact that the 2lth moment of N0, is 2l!! 2l!/l!2 l. Furthermore: If Y {, +} then E Y 2l. If Y { 3, 0, + 3} then EY 2l 3 l 2l!/l!2 l, where the last inequality follows by an easy inuction. Let us note that since E Y 2l < E T 2l for certain l, one can get that for each fixe, both inequalities in Lemma 6 are actually strict, yieling slightly better tails bouns for S an a corresponingly better boun for k 0. As a last remark we note that by using Jensen s inequality one can get a irect boun for EQ 2k when Y i {, +}, i.e., without comparing it to ET 2k. That simplifies the proof for that case an shows that taking Y i {, +} is the minimizer of E exp hq 2 for all h. Acknowlegments I am grateful to Marek Biskup for his help with the proof of Lemma 8 an to Jeong Han Kim for suggesting the approach of equation 0. Many thanks also to Paul Braley, Anna Karlin, Elias Koutsoupias an Piotr Inyk for comments on earlier versions of the paper an useful iscussions. 7. REFERENCES ] S. Arora an R. Kannan. Learning mixtures of arbitrary Gaussians. Submitte, 2000.

2] S. Dasgupta. Learning mixtures of Gaussians. In 40th Annual Symposium on Founations of Computer Science New York, NY, 999, pages 634 644. IEEE Comput. Soc. Press, Los Alamitos, CA, 999. 3] S. Dasgupta an A. Gupta. An elementary proof of the Johnson-Linenstrauss lemma. Technical report 99-006, UC Berkeley, March 999. 4] P. Frankl an H. Maehara. The Johnson-Linenstrauss lemma an the sphericity of some graphs. J. Combin. Theory Ser. B, 443:355 362, 988. 5] P. Inyk. Stable istributions, pseuoranom generators, embeings an ata stream computation. In 4st Annual Symposium on Founations of Computer Science Reono Beach, CA, 2000, pages 89 97. IEEE Comput. Soc. Press, Los Alamitos, CA, 2000. 6] P. Inyk an R. Motwani. Approximate nearest neighbors: towars removing the curse of imensionality. In 30th Annual ACM Symposium on Theory of Computing Dallas, TX, pages 604 63. ACM, New York, 998. 7] W. B. Johnson an J. Linenstrauss. Extensions of Lipschitz mappings into a Hilbert space. In Conference in moern analysis an probability New Haven, Conn., 982, pages 89 206. Amer. Math. Soc., Provience, R.I., 984. 8] J. Kleinberg. Two algorithms for nearest-neighbor search in high imensions. In 29th Annual ACM Symposium on Theory of Computing El Paso, TX, 997, pages 599 608. ACM, New York, 997. 9] N. Linial, E. Lonon, an Y. Rabinovich. The geometry of graphs an some of its algorithmic applications. Combinatorica, 52:25 245, 995. 0] C. H. Papaimitriou, P. Raghavan, H. Tamaki, an S. Vempala. Latent semantic inexing: A probabilistic analysis. In 7th Annual Symposium on Principles of Database Systems Seattle, WA, 998, pages 59 68, 998. ] L. J. Schulman. Clustering for ege-cost minimization. In 32n Annual ACM Symposium on Theory of Computing Portlan, OR, 2000, pages 547 555. ACM, New York, 2000. APPENDIX Proof of Lemma 9. Let us first consier the case where r i {, +}, each value having probability /2. If a 2 b 2 then a c an the lemma hols with equality. Otherwise, let us write E T + cr + cr 2 2k E T + ar + br 2 2k S k 4 where S k T + 2c 2k + 2T 2k + T 2c 2k T + a + b 2k T + a b 2k T a + b 2k T a b 2k. We will show that S k 0 for all k 0. Since a 2 b 2 we can use the binomial theorem to expan every term other than 2T 2k in S k an get 2k S k 2T 2k 2k + T 2k i D i, i where i0 D i 2c i + 2c i a+b i a b i a+b i a b i. Observe now that for o i, D i 0. Moreover, we claim that D 2j 0 for all j. To see this claim observe that 2a 2 + 2b 2 a + b 2 + a b 2 an that for all j an x, y 0, x + y j x j + y j. Thus, S k 2T 2k + k j0 2k 2j k 2k T 2k j D 2j 2j j 0. T 2k j D 2j The proof for the case where r i { 3, 0, + 3} is merely a more cumbersome version of the proof above, so we omit it. That proof, though, brings forwar an interesting point. If one tries to take r i 0 with probability greater than 2/3, while maintaining a range of size 3 an variance, the lemma fails. In other wors, 2/3 is tight in terms of how much probability mass we can put to r i 0 an still have the all-ones vector be the worst-case one. 2] S. Vempala. A ranom sampling base algorithm for learning the intersection of half-spaces. In 38th Annual Symposium on Founations of Computer Science Miami, FL, 997, pages 508 53. IEEE Comput. Soc. Press, Los Alamitos, CA, 997. 3] S. Vempala an R. I. Arriaga. An algorithmic theory of learning: robust concepts an ranom projection. In 40th Annual Symposium on Founations of Computer Science New York, NY, 999, pages 66 623. IEEE Comput. Soc. Press, Los Alamitos, CA, 999.