Lecture 6 : Dimensionality Reduction

Size: px

Start display at page:

Download "Lecture 6 : Dimensionality Reduction"

Luke Rodgers
5 years ago
Views:

1 CPS290: Algorithmic Founations of Data Science February 3, 207 Lecture 6 : Dimensionality Reuction Lecturer: Kamesh Munagala Scribe: Kamesh Munagala In this lecture, we will consier the roblem of maing n oints in a metric sace with large imension into a metric sace with much smaller imension, while reserving airwise istances aroximately. Unbiase Estimators of Distance The main iea comes from hashing. In stanar hashing, the hash functions we use were oblivious to the relationshi between ifferent items. Can we make hash function behave in such a way that they ma nearby objects accoring to some similarity measure into nearby buckets? Surrisingly, it turns out this can be one for many similarity istance measures. Warm-u: Hamming Cube Consier n oints on the Hamming cube {0, } in imensions. Suose the istance function is the l -istance. D x, y = x k y k which simly counts the number of coorinates where the oints iffer. Consier now the following simle hash family: H = {h k h k x = k th bit of x} Such a hash function mas each item to one of two buckets. Let Z ij enote the ranom variable that is if items x i an x j ma to ifferent buckets an 0 otherwise, where the ranomness is over the hash function chosen from H. Then it is easy to check that E[Z ij ] = hk x i =h k x j = x ik x jk = Dx i, x j Therefore, a hash function from this family can be viewe as maing the inut oints to {0, } so that the execte istance between the oints is exactly reserve. Of course, the execte istance oes not mean much if you choose any one hash function h k, the n oints ma to one of two oints, so the istance in the embeing is either 0 or. Therefore, it is very likely none of the istances are reserve in this maing the istances are either comresse to 0 or exane to. We can now use the stanar trick choose r hash functions h k, h k2,..., h k at ranom from H, an ma x to the -imensional oint h k x,..., h k x, that is, ma x to / times the value of the k th m bit, for k m being a ranom imension for each m =, 2,...,. By linearity of exectation alie to the revious argument, if Z ij is the ranom variable enoting the l istance between the maing of x i an x j, then E[Z ij ] = Dx i, x j. Since we are choosing the hash functions at ranom, it is unlikely that two inut items x i an x j will ma to the same bucket in all imensions.

2 2 UNBIASED ESTIMATORS OF DISTANCE The question then becomes, how large shoul be so that with large robability, all Z ij for i, j {, 2,..., n} are close to their execte value? It is easy to check that this has to be as large as. For instance, with =, the smallest istance between two hashe vectors is, but it coul be that these vectors were originally only istance aart. There is a moral to the above story: It is not merely sufficient to construct a hash function that in exectation reserves istance. It is also necessary to ensure the variance of this estimator is much smaller than the mean. Otherwise the trick of taking many ineenent coies of this hash function will en u requiring more imensions than the initial sace! However, there is a subtle oint with the above hash function. It ensures that Pr[h k x = h k y] = Dx i, x j Though labeling the buckets as 0 an an using the resulting istance introuces too much variance, surrisingly, we can use the fact that the robability of maing to the same bucket eens on istance in orer to beat brute force for similarity search. This will be the toic of the next lecture on LSH. Digression: Central Limit Theorem Before roceeing further, we will resent a rough statement of the Central Limit Theorem. Let X, X 2,..., X be ineenent ranom variables, each with mean µ an stanar eviation τ. Then uner mil restrictions on higher moments of these variables, in the limit of large, the X k converges to N µ, σ 2, where σ = τ, an N µ, σ 2 is the stanar istribution of Z = Normal istribution with mean µ an stanar eviation σ. What is a normal istribution? The istribution N µ, σ 2 has the ensity function fx = σ x µ 2 2π e 2σ 2 For a Normal istribution Y N µ, σ 2, the eviation aroun the mean exonentially falls off, so that roughly seaking, assuming k, Pr[ Y i µ kσ] 2 2π e k2 /2 k e k2 /2 Therefore, the robability that the ranom variable eviates significantly from the mean is very small, rovie we take eviations relative to the stanar eviation. For instance, six sigma means six stanar eviations from the mean, whose robability looks like /e 8 0 9, a really small number! Eucliean Saces The question that arises from the above iscussion is: Is there some other similarity measure in which we can construct an unbiase estimator with low variance? The answer surrisingly is yes for the Eucliean sace. The hash family is H = {h r h r x = r x an r 2 = } In other wors, choose unit vector r at ranom from the unit shere; the hash value is the length of the rojection of x in the irection of r.

3 Eucliean Saces 3 Ranom Vectors. In orer to unerstan what the above roceure achieves, we shoul first unerstan how to generate a ranom unit vector. We will show that the following rocess obeys a istribution that looks the same in all irections: For each coorinate r k, k =, 2,...,, generate r k ineenently from a Normal istribution with mean 0 an variance, that is, from N 0,. This vector will not have unit norm, but the vector r = r/ r 2 has unit norm, an oints in the same irection as r. Let us write the close form for the ensity of r. Let fr, r 2,..., r enote the ensity function. Since r N 0,, an ineenently r 2 N 0,, an so on, the joint ensity is simly the rouct of the ensities of r, r 2,..., r. This means fr, r 2,..., r = 2π e r2 k /2 = r k 2 e 2 = 2π /2 2π /2 e r 2 2 /2 The ensity only eens on the length of r an not on its irection! This means that for a given length, all irections have the same ensity; In other wors, this rocess generates a vector whose irection is ranom. Proerties of Normal Distributions. In alying the Central Limit Theorem, we use the fact that the Normal istribution s robability of eviation from the mean falls off exonentially the robability of being k stanar eviations away ros off roughly as e k2. In unerstaning the above hashing scheme, we nee a ifferent roerty of Normal istributions. The roerty is the following: Claim. If X N µ, σ 2 an a 0 is a constant, then ax N aµ, a 2 σ 2. Furthermore, if Y N µ, σ 2 an Z N µ 2, σ 2 2 are ineenent ranom variables, then Y +Z N µ +µ 2, σ 2 +σ2 2. The roof follows by writing out the resective ensity functions an checking. The key oint is that taking linear combinations of ineenent Normal ranom variables yiels a Normal ranom variable. Unbiase Estimator. Why is the above fact relevant? Consier our hash function h r x = r x = r k x k Since r k N 0,, the first art of the above claim imlies r k x k N 0, x 2 k. Since the r k s in ifferent imensions are ineenent, the above claim also imlies h r x = r k x k N 0, x 2 k = N 0, x 2 2 This shows that the hash value is a Normal istribution with zero mean, an variance equal to the square norm of x. For a ranom variable with mean zero, the exectation of the square equals the variance. Show this! This means: E [ h x 2] = x 2 2 where the exectation is over the choice of r. Thus we have an unbiase estimator of the square norm of x: Generate a ranom vector r by choosing each coorinate from a N 0, istribution ineenently; take the square of the length of the rojection of x onto r.

4 4 UNBIASED ESTIMATORS OF DISTANCE The roerties of Normal istributions imlies the above hols even for the ifference of two vectors, so that [ 2 E h x h y 2] = E r k x k y k = x y 2 2 Therefore, if we are given n oints D = { x, x 2,..., x n }, if we roject these oints onto a ranom vector, the square istance between any two oints is the same as the exectation of the square ifference in their rojections. So far so goo. But ha something similar even for Hamming saces; the roblem was that the variance of the estimator there was too large for it to be useful. What about in this case? Bouning the Variance. We erform the same trick as before: In orer to reuce variance in our estimator, we choose hash functions at ranom from H each of this is a ranom vector generate ineenently of the others, an ma x to the oint Π x in -imensional sace: Π x = x r, Each imension of Π x is istribute as N x r 2,..., x r, where the variance is ivie by because 0, x 2 2 we scale own each imension by a factor of. Similarly, Π x Π y has each of its coorinates istribute as N. 0, x y 2 2 By the same argument as before, [ E Π x Π y 2] = s= x y 2 2 = x y 2 2 All we nee to show is that the ranom variable Π x Π y 2, which is the sum of the squares of ineenent ranom variables each istribute as N 0, x y 2 2, is tightly concentrate aroun its exectation, for reasonable values of. Sums of Squares of Normals. We will nee to know what an Exonential istribution is. This istribution has a arameter λ, an has the ensity function fx = λe λx This istribution has mean λ, an stanar eviation λ. Intuitively, think about tossing a coin with bias till a heas is obtaine. The ranom variable X enoting the number of faile coin tosses before success. This is a Geometric istribution that satisfies Pr[X = k] = k. Intuitively, the Exonential istribution is the continuous version of the Geometric istribution, where you make the success robability become smaller an smaller, an squish time so there are a large number of coin tosses in a unit time interval. What is the connection between Exonential an Normal istributions? Here s a math fact whose roof is teious algebra: Claim 2. Suose X N 0, σ 2 an Y N 0, σ 2 are ineenent ranom variables, then X 2 + Y 2 Exonential 2σ 2

5 Eucliean Saces 5 Consier some two imensions of Π x Π y. Let the values here be X an Y. Then, we know that X, Y N 0, x y 2 2. This means X 2 + Y 2 Exonential 2 x y 2 2 This istribution has mean 2 x y 2 2, an the same value as stanar eviation. This is the crux the stanar eviation of our estimator of square length is comarable to the mean. This was not true for Hamming saces where it coul have been times larger! The quantity Π x Π y has square length equal to the sum of /2 ineenent coies of this ranom variable. This means its mean is /2 times larger, an its stanar eviation is /2 times larger. Furthermore, by CLT, the istribution becomes aroximately normal. This means the square length of Π x Π y is aroximately normal with mean x y 2 2, an stanar eviation x y Note that the stanar eviation is now much smaller than the mean! Suose we choose 6 log n, where n is the number of oints in our atabase. Then the robability that we eviate by more than k = /2 times the stanar eviation is at most e k2 = e /2 n 3 But k times the stanar eviation is at most x y 2 2, which is at most the mean. This means that with very high robability, the square length is at most twice the original square length. By union bouns over all the n 2 airs of oints, all the istances are reserve to within this factor with robability at least n 2 /n 3 = /n. The argument is somewhat rough, but it is reasonably comlete an can be extene to show that by choosing slightly more hash functions, the istances are in fact very close to the true value with very high robability. We finally have the following theorem, which shows that we can reuce the imension to roughly log n while reserving istances among n inut oints. Theorem Johnson-Linenstrauss. Given any n oints in Eucliean sace, for any ɛ > 0, there log n ɛ 2 is a maing to = O imensions so that with robability at least n, the istance between any air of oints is reserve to within a factor of ± ɛ. Note that the quantification in the above theorem is crucial: If we ranomly roject onto O log n imensions, then with very high robability, all airwise istances between n inut oints ɛ 2 are reserve. This is what makes the result algorithmically useful. Furthermore, the neat feature is that the scheme is oblivious to the inut for any set S of n inut oints, the ranom irections are chosen from the same istribution. This has alications in settings where ata is scanne one inut at a time. We will see another such metho, the Fourier Transform, a bit later in the course. Of course, it is conceivable that a imension reuction scheme that eens on the inut oints nees fewer imensions to reserve salient roerties of the ata. Maybe all ata lies on a 2-imensional subsace to start with. In such a case, an oblivious scheme such as the above still requires log n imensions, but a scheme that eens on the ata coul ientify the subsace an roject onto it. We will consier this when we iscuss algebraic methos such as the PCA.

Colin Cameron: Brief Asymptotic Theory for 240A

Colin Cameron: Brief Asymptotic Theory for 240A Colin Cameron: Brief Asymtotic Theory for 240A For 240A we o not go in to great etail. Key OLS results are in Section an 4. The theorems cite in sections 2 an 3 are those from Aenix A of Cameron an Trivei