Dimensionality Reduction Notes 1

Dmensonalty Reducton Notes 1 Jelan Nelson mnlek@seas.harvard.edu August 10, 2015 1 Prelmnares Here we collect some notaton and basc lemmas used throughout ths note. Throughout, for a random varable X, X p denotes (E X p ) 1/p. It s known that p s a norm for any p 1 (Mnkowsk s nequalty). It s also known X p X q whenever p q. Henceforth, whenever we dscuss p, we wll assume p 1. Lemma 1 (Khntchne nequalty). For any p 1, x R n, and (σ ) ndependent Rademachers, σ x p p x 2 Proof. Wthout loss of generalty we can assume p s an even nteger. Consder (g ) ndependent gaussans of mean zero and varance 1. Expand E( σ x ) p nto a sum of monomals. Any monomal wth odd exponents vanshes, as n the gaussan case. Meanwhle other monomals are nonnegatve wth all Rademacher moments beng 1, whle n the gaussan case the moments are at least 1. Thus the Rademacher case s term-by-term domnated by the gaussan case and σ x p g x p. But g x s a gaussan wth mean zero and varance x 2 2, and hence ts p-norm s x 2 (p!/(2 p/2 (p/2)!)) 1/p. We often use Jensen s nequalty below, especally for F (x) = x p (p 1). Lemma 2 (Jensen s nequalty). For F convex, F (E X) E F (X). 1

Before provng a couple concentraton nequaltes, we prove a lemma now whch lets us freely obtan tal bounds from moment bounds and vce versa (often we prove a moment bound and later nvoke a tal bound, or vce versa, wthout even mentonng any justfcaton). Lemma 3. Let Z be a scalar random varable. Consder the followng statements: (1a) There exsts σ > 0 s.t. p 1, Z p C 1 σ p. (1b) There exsts σ > 0 s.t. λ > 0, P( Z > λ) C 2 e C 2 λ2 /σ 2. (2a) There exsts K > 0 s.t. p 1, Z p C 3 Kp. (2b) There exsts K > 0 s.t. λ > 0, P( Z > λ) C 4 e C 4 λ/k. (3a) There exst σ, K > 0 s.t. p 1, Z p C 5 (σ p + Kp). (3b) There exst σ, K > 0 s.t. λ > 0, P( Z > λ) C 6 (e C 6 λ2 /σ 2 + e C 6 λ/k ). Then 1a s equvalent to 1b, 2a s equvalent to 2b, and 3a s equvalent to 3b, where the constants C, C n each case change by at most some absolute constant factor. Proof. We wll show only that 1a s equvalent to 1b; the other cases are argued dentcally. To show that 1a mples 1b, by Markov s nequalty ( ) C P(Z > λ) λ p E Z p 2 1 σ 2 p/2. λ 2 p Statement 1b follows by choosng p = max{1, 2C1λ 2 2 /σ 2 }. To show that 1b mples 1a, by ntegraton by parts we have E Z p = 0 px p 1 P( Z > λ)dλ 2C 2 p 0 px p 1 e C 2 λ2 /σ 2 dλ. The ntegral on the rght hand sde s exactly the pth moment of a gaussan random varable wth mean zero and varance σ 2 = σ 2 /(2C 2). Statement 1a then follows snce such a gaussan has p-norm Θ(σ p). 2

Now, the followng s a bread-and-butter trck for boundng pth moments of sums of ndependent random varables. A more general verson of ths lemma can be found as Lemma 6.3 n [LT91]. Lemma 4 (Symmetrzaton / Desymmetrzaton). Let Z 1,..., Z n be ndependent random varables. Let r 1,..., r n be ndependent Rademachers. Then Z E Z p 2 r Z p (symmetrzaton nequalty) and (1/2) r (Z E Z ) p Z p (desymmetrzaton nequalty). Proof. For the frst nequalty, let Y 1,..., Y n be ndependent of the Z but dentcally dstrbuted to them. Then Z E Z p = Z E Y p Y (Z Y ) p (Jensen) = r (Z Y ) p (1) 2 r X p (trangle nequalty) (1) follows snce the X Y are ndependent across and symmetrc. For the second nequalty, let Y be as before. Then r (Z E Z ) p = E r (Z Y ) p Y r (Z Y ) p (Jensen) = (Z Y ) p 2 Z p (trangle nequalty) 3

Lemma 5 (Decouplng [dlpng99]). Let x 1,..., x n be ndependent and mean zero, and x 1,..., x n dentcally dstrbuted as the x and ndependent of them. Then for any (a,j ) and for all p 1 j a,j x x j p 4,j a,j x x j p Proof. Let η 1,..., η n be ndependent Bernoull random varables each of expectaton 1/2. Then a,j x x j p = 4 E a,j x x j η 1 η j p η j j 4 j a,j x x j η (1 η j ) p (Jensen) (2) Hence there must be some fxed vector η {0, 1} n whch acheves a,j x x j η (1 η j ) p a,j x x j p j S where S = { : η = 1}. Let x S denote the S -dmensonal vector correspondng to the x for S. Then a,j x x j p = a,j x x j p S j / S S j / S = E E a,j x x xs j p ( E x = E x x S j = 0),j,j j / S a,j x x j p (Jensen) The followng proof of the Hanson-Wrght was shared to me by Sjoerd Drksen (personal communcaton). Theorem 1 (Hanson-Wrght nequalty [HW71]). For σ 1,..., σ n ndependent Rademachers and A R n n real and symmetrc, for all p 1 σ T Aσ E σ T Aσ p p A F + p A. 4

Proof. Wthout loss of generalty we assume n ths proof that p 2 (so that p/2 1). Then σ T Aσ E σ T Aσ p σ T Aσ p (Lemma 5) (3) p Ax 2 p (Khntchne) (4) = p Ax 2 2 1/2 p/2 (5) p Ax 2 2 1/2 p p ( A 2 F + Ax 2 2 E Ax 2 2 p ) 1/2 (trangle nequalty) p A F + p Ax 2 2 E Ax 2 2 1/2 p p A F + p x T A T Ax 1/2 p (Lemma 5) p A F + p 3/4 A T Ax 2 1/2 p (Khntchne) p A F + p 3/4 A 1/2 Ax 2 p 1/2 (6) Wrtng E = Ax 2 1/2 p and comparng (4) and (6), we see that for some constant C > 0, E 2 Cp 1/4 A 1/2 E C A F 0. Thus E must be smaller than the larger root of the above quadratc equaton, mplyng our desred upper bound on E 2. Remark 1. The square root trck n the proof of the Hanson-Wrght nequalty above s qute handy and can be used to prove several moment nequaltes (for example, you wll see how to prove the Bernsten nequalty wth t n tomorrow s lecture). As far as I am aware, the trck was frst used n a work of Rudelson [Rud99]. Remark 2. We could have upper bounded Eq. (5) by p A F + p Ax 2 2 E Ax 2 2 1/2 p/2 by the trangle nequalty. Now notce we have bounded the pth central moment of a symmetrc quadratc form (3) by the p/2th moment also of a symmetrc quadratc form. Wrtng p = 2 k, ths observaton leads to a proof by nducton on k, whch was the approach used n [DKN10]. 5

2 Johnson-Lndenstrauss (JL) lemma Frst we prove the Dstrbutonal JL Lemma (DJL). Lemma 6. DJL Lemma For any nteger n > 1 and ε, δ (0, 1/2), there exsts a dstrbuton D ε,δ over R m n for m ε 2 log(1/δ)) such that for any x R n of unt Eucldean norm, P ( Πx 2 2 1 > ε) < δ Π D ε,δ Proof. Wrte Π,j = σ,j / m, where the σ,j are ndependent Rademachers. Also overload σ to mean these Rademachers arranged as a vector of length mn, by concatenatng rows of Π. Note then Πx 2 2 = A x σ 2 2 where Thus A x = 1 m x T 0 0 0 x T 0.... (7) 0 0 x T P( Πx 2 2 1 > ε) = P( A x σ 2 2 E A x σ 2 2 > ε), where we see that the rght-hand sde s readly handled by the Hanson- Wrght nequalty Theorem 1 wth A = A T x A x. Now observe A s a blockdagonal matrx wth each block equalng (1/m)xx T, and thus A = x 2 2/m = 1/m. We also have A 2 F = 1/m. Thus Hanson-Wrght yelds P( Πx 2 2 1 > ε) e Cε2m + e Cεm, whch for ε < 1 s at most δ for m ε 2 log(1/δ). The followng s what s usually referred to as the Johnson-Lndenstrauss (JL) lemma [JL84]. In ths note we typcally refer to t as the Metrc JL lemma (or MJL) to dstngush t from DJL above. At some ponts we smply say JL nstead of DJL or MJL, but the verson meant wll be understood from context. 6

Corollary 1 (Metrc JL lemma (MJL)). For any X = {x 1,..., x N } R n and 0 < ε < 1/2, there exsts f : X R m for m = O(ε 2 log N) such that for all 1 < j N, (1 ε) x x j 2 f(x ) f(x j ) 2 (1 + ε) x x j 2. (8) Proof. Let D ε,δ be as n DJL wth δ < 1/ ( ) N 2. Consder a random f, where f(x) = Πx for Π drawn from D ε,δ. By DJL, each vector of the form (x x j )/ x x j 2 has ts norm preserved up to 1 + ε wth probablty strctly larger than 1 1/ ( ) N 2. Thus by a unon bound over all, j, all such vectors are preserved wth postve probablty, showng exstence of the desred f. 2.1 Example applcaton: k-means clusterng In the k-means clusterng problem the nput conssts of x 1,..., x N R n and a postve nteger k, and the goal s to output some partton P of [n] nto k dsjont subsets P 1,..., P k as well as some y = (y 1,..., y k ) (R n ) k (the y need not be equal to any of the x and can be chosen arbtrarly) so as to mnmze the cost functon cost P,y (x 1,..., x N ) = k j=1 P j x y j 2 2. That s, the x are clustered nto k clusters accordng to P, and the cost of a gven clusterng s the sum of squared Eucldean dstances to the cluster centers (the y j s). Unfortunately fndng the optmal clusterng for k-means s NP-hard, however effcent approxmaton algorthms do exst whch fnd a clusterngs that are close to optmal. It s easy to show, e.g. by takng the gradent of the cost functon, that for a fxed partton P of [n], the optmal choce of cluster centers y for that gven P s the one where, for the P j of postve sze, y j = (1/ P j ) P j x. Thus we can restrct our attenton to just optmzng over P. For a set of nput ponts X, we let cost P (X) denote nf y cost P,y (X). Lemma 7. Let the nput ponts to k-means be X = {x 1,..., x n }. Then for any 0 < ε < 1/2, f f : X R m s such that, j (1 ε) x x j 2 2 f(x ) f(x j ) 2 2 (1 + ε) x x j 2 2 7

then for ˆP a γ-approxmate optmal clusterng for f(x) and P an optmal clusterng for X, t holds that ( ) 1 + ε cost(x) γ cost ˆP 1 ε (X). P Proof. Fx a partton P of [n] and wrte P = (P 1,..., P k ). Then cost (X) = x 1 x 2 2 P P j j [k] P j P j = 1 x 2 2 2 x, x + x 2 2 P j j [k] P j P j P j P j = 1 ( ) x 2 2 + x 2 2 x, x P j 2 j [k] P j P j = 1 x x 2 2 2 P j j [k] P j P j Thus f f satsfes the condton of the lemma, then (1 ε) cost(x) cost(f(x)) (1 + ε) cost (X) P P for all parttons P smultaneously. Thus we have (1 ε) cost(x) cost(f(x)) γ cost(f(x)) γ (1 + ε) cost (X). ˆP ˆP P P P The lemma follows by comparng the rght hand sde wth the left. References [DKN10] Ilas Dakonkolas, Danel M. Kane, and Jelan Nelson. Bounded ndependence fools degree-2 threshold functons. In 51th Annual IEEE Symposum on Foundatons of Computer Scence (FOCS), pages 11 20, 2010. [dlpng99] Vctor de la Peña and Evarst Gné. Decouplng: From dependence to ndependence. Probablty and ts Applcatons. Sprnger- Verlag, New York, 1999. 8

[HW71] [JL84] Davd Lee Hanson and Farroll Tm Wrght. A bound on tal probabltes for quadratc forms n ndependent random varables. Ann. Math. Statst., 42:1079 1083, 1971. Wllam B. Johnson and Joram Lndenstrauss. Extensons of Lpschtz mappngs nto a Hlbert space. Contemporary Mathematcs, 26:189 206, 1984. [LT91] Mchel Ledoux and Mchel Talagrand. Probablty n Banach spaces. Sprnger-Verlag, Berln, 1991. [Rud99] Mark Rudelson. Random vectors n the sotropc poston. J. Functonal Analyss, 164(1):60 72, 1999. 9