Dimensionality Reduction Notes 2

Size: px

Start display at page:

Download "Dimensionality Reduction Notes 2"

Olivia Cain
5 years ago
Views:

1 Dmensonalty Reducton Notes 2 Jelan Nelson mnlek@seas.harvard.edu August 11, Optmalty theorems for JL Yesterday we saw for MJL that we could acheve target dmenson m = O(ε 2 log N), and for DJL we could acheve m = O(ε 2 log(1/δ)). The followng theorems tell us that not much mprovement s possble for MJL, and for DJL we have the optmal bound. Theorem 1 ([Alo03]). For any N > 1 and ε < 1/2, there exst N + 1 ponts n R N such that achevng the MJL guarantee wth dstorton 1 + ε requres m mn{n, ε 2 (log N)/ log(1/ε)}. The log(1/ε) loss n the lower bound can be removed f the map must be lnear. Theorem 2 ([LN14]). For any N > 1 and ε < 1/2, there exst N O(1) ponts n R N such that achevng the MJL guarantee wth dstorton 1 + ε usng a lnear map requres m mn{n, ε 2 log N}. For DJL, the upper bound s optmal. Theorem 3 ([JW13, KMN11]). For any ε, δ < 1/2, any DJL dstrbuton must have m mn{n, ε 2 log(1/δ)}. 1

2 2 Example applcaton: determnstc l 1 pont query and heavy htters Yesterday s notes gves an example applcaton of JL to k-means clusterng. Today we gve another applcaton. In the l 1 pont query problem a vector x R n s updated n the turnstle streamng model. A query s an ndex [n], and the response to the query should be a value x such that x x ε x 1. We show an argument of [NNW14] that the JL lemma mples the exstence of a fxed determnstc Π R m n wth m ε 2 log n such that such a x can be recovered from Πx. Defnton 1. We say that a matrx Π wth columns Π 1,..., Π n s ε-ncoherent f (1) Π 2 = 1 for all, and (2) for all j, Π, Π j ε. Theorem 4. If Π R m n s ε-ncoherent, then there s a polynomal tme recovery algorthm A Π such that gven any y = Πx, f we defne x = A Π (y) then x x ε x 1. Proof. The recovery algorthm wll be A Π (y) = Π T y = Π T Πx. Thus x = e T Π T Πx = n Π, Π j x = x + j j=1 Π, Π j x = x ± ε x 1. Now we show the exstence of such Π wth small m. Lemma 1. ε (0, 1/2), there s ε-ncoherent Π wth m ε 2 log n. Proof. Consder the set of vectors {0, e 1,..., e n }. By the JL lemma, there exsts Π wth O(ε 2 log n) rows, and havng columns Π such that (1) Π 2 = Π e 2 = 1 ± ε/3, and (2) Π Π j 2 = Π e Π e j 2 = (1 ± ε/3) 2 for all j. Let Π be the matrx whose th column s Π / Π 2. Then Π 2 = 1 for all, as desred. Furthermore 2(1 ± ε) 2 = Π Π j 2 2 = Π Π j Π, Π j. Note Π 2 2 and Π j 2 2 are both 1 ± O(ε), mplyng Π, Π j = O(ε). The lemma follows by applyng ths argument wth ε scaled down by a constant. 2

3 3 Faster JL Typcally we have some hgh-dmensonal computatonal geometry problem, and we use JL to speed up our algorthm n two steps: (1) apply a JL map Π to reduce the problem to low dmenson m, then (2) solve the lowerdmensonal problem. As m s made smaller, typcally (2) becomes faster. However, deally we would also lke step (1) to be as fast as possble. In ths secton, we nvestgate two approaches to speed up the computaton of Πx. One of the analyses wll make use of the followng Bernsten bound. Theorem 5 (Bernsten s nequalty). Let X 1,..., X n be ndependent random varables that are each at most K almost surely, and where n E(X E X ) 2 = σ 2. Then for all p 1 n =1 =1 X E X p σ p + Kp. Proof. Let r 1,..., r n be ndependent Rademachers. Then (X E X ) p 2 r X p (symmetrzaton) p ( = p p X 2 ) 1/2 p (Khntchne) (1) X 2 1/2 p/2 X 2 1/2 p σ p + p X 2 E X 2 1/2 p (trangle nequalty) σ p + p r X 2 1/2 p (symmetrzaton) σ p + p 3/4 ( X 4 ) 1/2 1/2 p (Khntchne) σ p + p 3/4 K ( X 2 ) 1/2 1/2 p (2) 3

4 Defnng E = ( X2 ) 1/2 1/2 p and comparng (1) wth (2), for some constant C > 0 E 2 C p 1/4 K E Cσ 0. Thus E must be smaller than the larger root of the above quadratc equaton, mplyng our desred upper bound on E Sparse JL One natural way to speed up JL s to make Π sparse. If Π has s non-zero entres per column, then Πx can be multpled n tme O(s x 0 ), where x 0 = { : x 0}. The goal s then to make s, m as small as possble. The followng matrx Π was ntroduced n [CCF04], and t was analyzed for DJL n [TZ12]. In ths constructon, one pcks a hash functon h : [n] [m] from a parwse ndependent famly, and a functon σ : [n] { 1, 1} from a 4-wse ndependent famly. Then for each [n], Π h(), = σ(), and the rest of the th column s 0. It was shown n [TZ12] that ths dstrbuton provdes DJL for m 1/(ε 2 δ). Note that s = 1 as descrbed here. The analyss s smply va Chebyshev s nequalty, after dong an expectaton and varance calculaton. The reason for the poor dependence n m on the falure probablty δ s that we use Chebyshev s nequalty. Ths was avoded yesterday by usng Hanson-Wrght,.e. a bound on the p-norms of quadratc forms. Recall that a bound on p-norms gves tal bounds va Markov s nequalty, and f one unrolls the proof fully yesterday, one would fnd that yesterday s lecture obtaned δ falure probablty by usng the Hanson-Wrght p-norm bound for p = Θ(log(1/δ)). That s to say, the mprovement yesterday came from boundng a hgher moment than p = 2 (.e. Chebyshev). To mprove the dependence of m on 1/δ, we allow ourselves to ncrease s. Here we analyze the Sparse JL Transform (SJLT) [KN14]. Ths s a JL dstrbuton over Π havng exactly s non-zero entres per column. As prevously, we assume x R n has x 2 = 1. Our random Π R m n satsfes Π r, = η r, σ r, / s for some nteger 1 s m. The σ r, are ndependent Rademachers. The η r, are Bernoull random varables satsfyng: For all r,, E η r, = s/m. For any, m r=1 η r, = s. That s, each column of Π has exactly s non-zero entres. 4

5 The η r, are negatvely correlated. That s, for any subset S of [m] [n], we have E (r,) S η r, (r,) S E η r, = (s/m) S. We would lke to show the followng, whch s the man theorem of [KN14]. Theorem 6. As long as m ε 2 log(1/δ) and s εm, x : x 2 = 1, P Π ( Πx > ε) < δ. (3) Proof. Abusng notaton and treatng σ as an mn-dmensonal vector, Z = Πx = 1 s m def η r, η r,j σ r, σ r,j x x j = σ T A x,η σ, r=1 j Thus by Hanson-Wrght Z p p A x,η F + p A x,η p p A x,η F p + p A x,η p. A x,η s a block dagonal matrx wth m blocks, where the rth block s (1/s)x (r) (x (r) ) T but wth the dagonal zeroed out. Here x (r) s the vector wth (x (r) ) = η r, x. Now we just need to bound A x,η F p and A x,η p. Snce A x,η s block-dagonal, ts operator norm s the largest operator norm of any block. The egenvalue of the rth block s at most (1/s) max{ x (r) 2 2, x (r) 2 } 1/s, and thus A x,η 1/s wth probablty 1. Next, defne Q,j = m r=1 η r,η r,j so that A x,η 2 F = 1 x 2 s 2 x 2 j Q,j. We wll show for p s 2 /m that for all, j, Q,j p p, where we take the p-norm over η. Therefore for ths p, j A x,η F p = A x,η 2 F 1/2 p/2 1 x 2 s 2 x 2 j Q,j p 1/2 j ( ) 1 1/2 x 2 x 2 j Q,j p s j (trangle nequalty) 5

6 1 m Then by Markov s nequalty and the settngs of p, s, m, P( Πx > ε) = P( σ T A x,η σ > ε) < ε p C p (m p/2 + s p ) < δ. We now show Q,j p p, for whch we use Bernsten s nequalty. Suppose η a1,,..., η as, are all 1, where a 1 < a 2 <... < a s. Now, note Q,j can be wrtten as s t=1 Y t, where Y t s an ndcator random varable for the event that η at,j = 1. The Y t are not ndependent, but for any nteger p 1 ther pth moment s upper bounded by the case that the Y t are ndependent Bernoull each of expectaton s/m (ths can be seen by smply expandng ( t Y t) p then comparng wth the ndependent Bernoull case monomal by monomal n the expanson). Thus Bernsten apples, and as desred we have Q,j p = t Y t p s 2 /m p + p p. There are two natural dstrbutons where η satsfes the condtons for the SJLT. In the frst, the columns are ndependent, and for each column (η 1,,..., η m, ) s chosen unformly at random from the ( ) m s vectors n {0, 1} m havng weght exactly s. A second dstrbuton s the CountSketch of [CCF04]. In ths dstrbuton, we assume s dvdes m, and the rows are parttoned arbtrarly nto s blocks each of equal sze m/s (e.g. the frst m/s rows, then the next m/s rows, etc.). For each column and for each block b wth correspondng η(b, ) = (η cm/s+1,,..., η (c+1)m/s, ), we set η(b, ) = e j R m/s for a unformly random j [m/s]. Ths s done ndependently across all b, pars. 3.2 FFT-based approach Another approach for obtanng fast JL was nvestgated by Alon and Chazelle [AC09]. Ths approach gves a runnng tme to compute Πx of roughly O(n log n), whch s faster than the sparse JL approach when x s suffcently dense. Although we dd not cover t ths approach n lecture today, I am ncludng a descrpton here. They called ther transformaton the Fast 6

7 Johnson-Lndenstrauss Transform (FJLT). A constructon smlar to thers, whch we wll analyze here, s the m n matrx Π defned as Π = 1 m SHD (4) where S s an m n samplng matrx wth replacement (each row has a 1 n a unformly random locaton and zeroes elsewhere, and the rows are ndependent), H s an unnormalzed bounded orthonormal system, and D = dag(α) for a vector α of n ndependent Rademachers. An unnormalzed bounded orthonormal system s a matrx H R n n such that H T H = I and max,j H,j 1. For example, H can be the unnormalzed Fourer matrx or Hadamard matrx. The orgnal FJLT replaced S wth a random sparse matrx P, whch has certan advantages; see Remark 1. The motvaton for the constructon (4) s speed: D can be appled n O(n) tme, H n O(n log n) tme (e.g. usng the Fast Fourer Transform), and S n O(m) tme. Thus, overall, applyng Π to any fxed vector x takes O(n log n) tme. Compare ths wth usng a dense matrx of Rademachers, whch takes O(mn) tme to apply. We wll show that for m ε 2 log(1/δ) log(1/(εδ)), the random Π descrbed n (4) provdes DJL. In fact we wll analyze a slghtly dfferent constructon n whch S s replaced by an n n dagonal matrx S η, S η = dag(η), where the entres of η {0, 1} n are ndependent wth E η = 1/m (so Π has m rows n expectaton). The proof to analyze the Π from (4) s essentally dentcal. The proof we provde here s an adaptaton of the proof of a more general theorem [CNW15, Theorem 9] to the current scenaro. Theorem 7. Let x R n be an arbtrary unt norm vector, and suppose 0 < ε, δ < 1/2. Also let Π = S η HD as descrbed above wth a number of rows equal to m ε 2 log(1/δ) log(1/(εδ)). Then P Π ( Πx > ε) < δ. Proof. We use the moment method. Let η be an ndependent copy of η, and let σ { 1, 1} n be unformly random. Wrte z = HDx so that Πx 2 2 = η z 2. Then 1 m n η z 2 1 p = 1 η z 2 1 L m p (η) L p (α) (5) =1 7

8 2 m σ η z 2 L p (η) L p (α) (symmetrzaton) 2 m σ η z 2 p p m ( η z 4 ) 1/2 p (Khntchne) p m (max η z ) ( η z 2 ) 1/2 p p m max η z 2 p 1/2 η z 2 1/2 p (Cauchy-Schwarz) p m max η z 2 p 1/2 ( 1 η z 2 1 p 1/2 + 1) (trangle nequalty) m We wll now bound max η z 2 p 1/2. Defne q = max{p, log m} and note p q. Then max η z 2 q = ( E ( max α,η E α,η ) 1/q η z 2q η z 2q ) 1/q ( ) 1/q = E η z 2q α,η ( n max E η z 2q α,η ( = ( = ) 1/q n max(e η ) (E z 2q η α m max ) 1/q E z 2q α )) 1/q (α, η ndependent) (6) 2 max z 2 q (m 1/q 2 by choce of q) = 2 max z 2 2q q (Khntchne) (7) 8

9 Eq. (7) uses that H s an unnormalzed bounded orthonormal system. Defnng E = 1 m η z 2 1 p 1/2 and combnng (5), (6), (7), we fnd that for some constant C > 0 pq pq E 2 C m E C m 0, mplyng E 2 max{ pq/m, pq/m}. By the Markov nequalty P( Πx > ε) ε p E 2p, and thus to acheve the theorem statement t suffces to set p = log(1/δ) then choose m ε 2 log(1/δ) log(m/δ). Remark 1. Note that the FJLT as analyzed above provdes suboptmal m. If one desred optmal m, one can nstead use the embeddng matrx Π Π,where Π s the FJLT and Π s, say, a dense matrx wth Rademacher entres havng the optmal m = O(ε 2 log(1/δ)) rows. The downsde s that the runtme to apply our embeddng worsens by an addtve m m. [AC09] slghtly mproved ths addtve term (by an ε 2 multplcatve factor) by replacng the matrx S wth a random sparse matrx P. Remark 2. The usual analyss for the FJLT, such as the approach n [AC09], would acheve a bound on m of O(ε 2 log(1/δ) log(n/δ)). Such analyses operate by, usng the notaton of the proof of Theorem 7, frst condtonng on z log(n/δ) (whch happens wth probablty at least 1 δ/2 by the Khntchne nequalty), then fnshng the proof usng Bernsten s nequalty. In our proof above, we mproved ths dependence on n to a dependence on the smaller quantty m by avodng any such condtonng. References [AC09] Nr Alon and Bernard Chazelle. The fast Johnson Lndenstrauss transform and approxmate nearest neghbors. SIAM J. Comput., 39(1): , [Alo03] Noga Alon. Problems and results n extremal combnatorcs I. Dscrete Mathematcs, 273(1-3):31 53,

10 [CCF04] Moses Charkar, Kevn C. Chen, and Martn Farach-Colton. Fndng frequent tems n data streams. Theor. Comput. Sc., 312(1):3 15, [CNW15] Mchael B. Cohen, Jelan Nelson, and Davd P. Woodruff. Optmal approxmate matrx product n terms of stable rank. CoRR, abs/ , [JW13] T. S. Jayram and Davd P. Woodruff. Optmal bounds for Johnson-Lndenstrauss transforms and streamng problems wth subconstant error. ACM Transactons on Algorthms, 9(3):26, [KMN11] Danel M. Kane, Raghu Meka, and Jelan Nelson. Almost optmal explct Johnson-Lndenstrauss famles. In RANDOM, pages , [KN14] Danel M. Kane and Jelan Nelson. Sparser Johnson-Lndenstrauss transforms. Journal of the ACM, 61(1):4, [LN14] Kasper Green Larsen and Jelan Nelson. The Johnson- Lndenstrauss lemma s optmal for lnear dmensonalty reducton. CoRR, abs/ , [NNW14] Jelan Nelson, Huy L. Nguy ên, and Davd P. Woodruff. On determnstc sketchng and streamng for sparse recovery and norm estmaton. Lnear Algebra and ts Applcatons, Specal Issue on Sparse Approxmate Soluton of Lnear Systems, 441: , [TZ12] Mkkel Thorup and Yn Zhang. Tabulaton-based 5-ndependent hashng wth applcatons to lnear probng and second moment estmaton. SIAM J. Comput., 41(2): ,

CIS 700: algorithms for Big Data

CIS 700: algorithms for Big Data CIS 700: algorthms for Bg Data Lecture 5: Dmenson Reducton Sldes at htt://grgory.us/bg-data-class.html Grgory Yaroslavtsev htt://grgory.us Today Dmensonalty reducton AMS as dmensonalty reducton Johnson-Lndenstrauss