Lecture 4: Constant Time SVD Approximation

Spectral Algorthms and Representatons eb. 17, Mar. 3 and 8, 005 Lecture 4: Constant Tme SVD Approxmaton Lecturer: Santosh Vempala Scrbe: Jangzhuo Chen Ths topc conssts of three lectures 0/17, 03/03, 03/08), based on [KV04]. We are nterested n the followng problem. Problem: Gven A R m n, fnd D R m n wth ran D) to approxmate A. ormally, mn A D 1) D:ranD) Notaton. Let λ t,, v t) denote the t th sngular value, left sngular vector, rght sngular vector of A, respectvely. Let A denote the th column of A, A ) denote the th row of A. Let A λ t v t)t t1 Av t) v t)t. By a theorem of Ecart and Young, A s the optmal soluton to 1) and A A r t+1 λ t. So one way to solve 1) s to fnd the top rght sngular vectors of A, {v t) } t1. t1 1 Compute SVD Gven an m n matrx, t taes Θmn) tme to just read the nput. We are to fnd ts top rght sngular vectors. Notce that we may only get approxmatons snce the sngular values can be rratonal. So for the top rght sngular vector, we are to fnd ṽ such that Aṽ 1 ε)λ 1 and ṽ v 1) ε for some gven accuracy parameter ε. The followng Power method can fnd the top rght sngular vector. Power method: 1. Let v 0 be a random unt vector n R n.. Repeat v t+1 AT A)v t A T A)v t Remar 1.1. There are several questons concernng the power method. How to generate a random unt vector v 0 R n? All we need s a unform dstrbuton on the surface of unt sphere. We can use any sphercal dstrbuton to generate a random vector and scale t to unt length. 1

Does the teraton converge to the top rght sngular vector v 1)? How fast? If v t v 1), then A T A)v t A T λ 1 u 1) ) λ 1 A T u 1) ) λ 1 v1) and v t+1 AT A)v t A T A)v t v1) v t so v 1) s a fxed pont of the teraton. It can be shown that v t ṽ after O 1 ε log n) teratons. Snce each round taes Omn) tme, the tme complexty of the power method s O mn ε log n). Exercse: Prove bounds on the convergence of the power method. Hnt: use SVD. See Lemma 1 and Theorem 1 n [CKVW] for detals.) How to fnd top rght sngular vectors? nd the top one, compute A Avv T ) and repeat. It taes tme O mn ε Can we have a smaller tme complexty when we have a sparse matrx? log n). Suppose A has only M nonzero entres. Each teraton of the power method taes OM) tme. So t taes O M ε log n) to fnd v1). or a sparse matrx, can we acheve O M ε log n) tme complexty for sngular vectors? Notce that n search of the top sngular vector, each teraton of the power method taes OM) tme. After computng A Avv T ), however, the matrx s no longer sparse. ortunately, we can extend the power method as follows. 1. Randomly choose orthonormal matrx V 0 v0 1,..., v0 ) Rn.. Repeat V t+1 A T A)V t V t+1 V t+1 dag V t+1 ) 1 1,..., V t+1 ) 1) normalze each column) Note that f V t V v 1),..., v ) ), then A T A)V t AT Av 1),..., Av ) ) A T Aλ 1 u 1),..., λ u ) ) λ 1v 1),..., λ v) ) and V t+1 λ 1v 1),..., λ v) )dag λ 1,..., ) λ v 1),..., v ) ) V t Approxmate SVD Can we have a faster approxmate soluton to 1)? That s, we want to fnd D wth ran D), such that A D s small.

.1 Exstence of a Good Constant-Dmensonal Subspace It s shown n [KV04] that we only need to loo at O ε ) rows of A. Theorem 1. [KV04] Theorem ) Gven A R m n, nteger, and ε > 0, there exsts a subset S of ε rows of A, such that n ts span les a matrx Ã.e., every row of Ã s n span {S}) wth the followng property: A Ã mn A rand) D + ε A. ) Proof: Pc ε rows from the followng dstrbuton wth multplcty): P Pr{row s pced} A) A, 1,..., m. We need to show there s nonzero probablty ths subset has the desred property. It would be a bad dea to just pc top rows wth largest A ). See gure 1 for such an example.) A 3),..., A m) A 1) 0 A ) gure 1: Suppose A 1) A ) < A 3) A m). The optmal subspace s span { A 1), A ),..., A )}, whle the subspace spanned by the top rows, span { A 3),..., A +)}, s a bad subspace. Let S be the chosen subset. We wll dentfy vectors ŷ 1),..., ŷ ) n span {S} such that ŷ t) s close to v t), t 1,...,. Notce v t) s a lnear combnaton of rows of A: λ t v t) A T m 1 ut) A )T. How do we approxmate ths lnear combnaton? The random vector w t) defned as follows has mean λ t v t)t wth bounded varance. w t) 1 S A ) S A ) P Let s S. Wrte w t) 1 s s j1 X j where {X j } s j1 A ) ut) are..d. as X P wth probablty P. 3

The mean of w t) s: [ E w t)] E [X j ] m 1 A ) P P m 1 A ) λ t v t)t. The varance of w t) s: [ w E t) E [w t)] ] 1 s E [ X λ t v t)t ) X T λ t v t))] 1 [ E XX T ] λ ) t s m A ) P 1 s 1 1 m s 1 s 1 A λ t ) A )T P ) A λ t ) ) P λ t ) Let ŷ t) w t)t /λ t, V 1 span { ŷ 1),..., ŷ )}. We show that Proj V1 A approxmates A by showng an upper bound on E [ A Proj V1 A ]. Let ˆ t1 Avt) ŷ t)t. A Proj V1 A A Av t) ŷ t)t t1 n u )T A ˆ ) 1 u )T A ˆ ) n + 1 λ v )T λ ŷ t)t + 1 λ v )T w ) + 1 error of orthogonal projecton general error) +1 u )T A ˆ ) n +1 n +1 λ λ v )T Therefore, E [ A Proj V1 A ] [ w A A + E ) λ v )T ] 1 4

A A + s A A A + ε A when s ε ) The exstence of a subset S satsfyng the propertes n the theorem follows from the nequalty on the expectaton. Remar.1. Theorem 1 s an exstental theorem. We do not now n the defnton of w t). The correspondng algorthmc result s gven n [DK + 04], presented on 0/4 and 03/01.. Constant Tme SVD Approxmaton Algorthm Lemma 1. [KV04] Lemma ) Let M R a b. Let Q Q 1, Q,..., Q a be a probablty dstrbuton on [a] such that Q α M ), 1,..., a for some α [0, 1] so when α 1, we have M equaltes). Let σ 1,..., p ) be p ndependent samples from [a], each follows dstrbuton Q. Let N R p b wth N t) M t ), t 1,..., p. Then pqt E [ M T M N T N ] 1 αp M 4 3) Proof: We frst show E [ N T N ] M T M. [ N E T N ) ] r,s p E [N t,r N t,s ] t1 p t1 t1 Next we show a bound on E [ N T N ) r,s M T M ) r,s M t,rm t,s pq t Q t ) ] : E [ N T N ) r,s M T M ) r,s p t1 ) ] M t,rm t,s M T M ) r,s t1 E [N t,r N t,s ) ] E [N t,r N t,s ]) ) M p M t,r t,s t1 pq t ) Q t M p M t,r t,s p α M t) / M t1 M M M t,r t,s αp M t). t1 Thus, E [ M T M N T N ] 5

b [ N E T N ) r,s M T M ) ) ] r,s r,s1 M αp M αp t1 1 M t) M t) t1 1 αp M 4 b r,s1 M t,rm t,s Remar.. Lemma 1 suggests that we can approxmate the egenvectors of M T M or the rght sngular vectors of M T M and the subspace spanned by them) by the egenvectors of N T N or the rght sngular vectors of N T N and the subspace spanned by them). In our problem, f we sample the rows of A to get p n matrx S and sample the columns of S to get p p matrx W, then we may use the subspace spanned by the left sngular vectors of W to approxmate the subspace spanned by the left sngular vectors of S, and use the subspace spanned by the rght sngular vectors of S to approxmate the subspace spanned by the rght sngular vectors of A. But can we use the subspace spanned by the left sngular vectors of W to approxmate the subspace spanned by the rght sngular vectors of A? They are not even of the same dmenson. A ey observaton of [KV04] s that we can mae use of the subspace spanned by the left sngular vectors of S and get an approxmaton of the subspace spanned by the rght sngular vectors of S. Remar.3. Wth the Marov nequalty, Lemma 1 mples that wth probablty at least 1 1 θ αp, we can assume M T M N T N θ M. Algorthm: 1. Input: A R m n,, ε.. p f, ε) max 4 ε 3, 3 ε 4 ) 3. Row samplng) Let P P 1, P,..., P m be a probablty dstrbuton on [m] such that P c A), 1,..., m for some c [0, 1]. Let A 1,..., p be p ndependent samples from [m], each followng dstrbuton P. Let S R p n wth S t) A t ), t 1,..., p. ppt 4. Column samplng) Let P P 1, P,..., P n be a probablty dstrbuton on [n] such that P j c S j, j 1,..., n. Let j S 1,..., j p be p ndependent samples from [n], each followng dstrbuton P. Let W R p p wth W t Sj t pp ), t 1,..., p. jt 5. Compute the top left sngular vectors of W : u 1) W ),..., u ) W ). 6

6. lter) Let T {t : W T W ) γ W } where γ cε 8. or t T, let ˆvt) ST W ) W T W ). 7. Output ˆv t) for t T. The ran- approxmaton to A can be reconstructed as Ã A t T ˆvt)ˆv t)t.) Remar.4. Some comments about the algorthm. An mportant observaton that the algorthm s based on s that there exsts a submatrx W of A, whose sze s only p p, p f, ε), such that W contans an mplct approxmaton to A that satsfes ). The algorthm enables us to answer n constant tme the queston: does there exst a good ran- approxmaton to A? Samplng: or the algorthm we have the followng two assumptons on samplng. 1. We can pc row of A wth probablty Q c A), c [0, 1]. A. or any row, we can pc the j th entry wth probablty Q,j c A,j A ). Note that f no entry of A s much larger than the average, then samplng accordng to unform dstrbuton would be enough. To mplement the column samplng step n the algorthm, we can pc a row unformly from S and apply the second samplng assumpton. Suppose the entres of A come as a stream and we only have p p memory. How do we acheve the samplng assumptons? Consder a smpler queston. How to pc one from a stream of numbers a 1, a,..., such that P a a and only one number s ept at any tme? The answer s whle seeng a, replace the exstng number by a wth probablty Proof Setch: Defne the dfference between M and the projecton of M on subspace span { x ) I } as: M; x ), I) M M M I a. 1 a x ) x )T 4) If {x ) I} s an orthonormal bass, then M; x ), I) I x )T M T Mx ) Next we state the followng lemmas, whose proofs can be found n [KV04]. 7

Lemma. [KV04] Lemma 3) Suppose A T A S T S θ A. Then a) or any par of unt vectors z, z n the row space of A, z T A T Az z T S T Sz θ A. b) or any set of vectors z 1),..., z l), l n the row space of A, A; z ), [l]) S; z ), [l]) θ A. ollowng from Lemma 1, the samplng n the algorthm enables us to mae use of Lemma for several tmes. Lemma 3. [KV04] Clam 1) S; ˆv t), t T ) S T ; W ), t T ) ε 8 A Now we are ready to show that Lemma 1, b), and 3. Ã n the algorthm does satsfy ). The proof uses Theorem 1, 1. rom Lemma 1, wth some probablty, we can assume for some θ 40 cp. A T A S T S θ A and SS T W W T θ S. rom Theorem 1, there exst vectors x 1),..., x ) such that A; x t), t []) A A A ε 8 A A ε 8 A. 3. rom Lemma b), by pcng θ ε 8, S; x t), t []) A; x t), t []) θ A A ε 4 A. 4. Snce S and S T have the same sngular values, there exst vectors y t), t [] n the column space of S such that S T ; y t), t []) A ε 4 A. 5. rom Theorem 1, there exst vectors z t), t [] such that W T ; z t), t []) S T ; z t), t []) θ S A ε A, and specfcally, W ), t [] wll have ths property: W T ; W ), t []) A ε A. 8

6. rom Lemma b), S T ; W ), t T ) W T ; W ), t T ) θ S A 3ε 4 A. 7. Apply Lemma 3 and Lemma b): Ths mples ). A; ˆv t), t T ) S; ˆv t), t T ) θ A A ε A. References [CKVW] Cheng, D., Kannan, R., Vempala, S., and Wang, G.. A dvde-and-merge methodology for clusterng. To appear n Proceedngs of the ACM Symposum on Prncples of Database Systems, 005. [DK + 04] Drneas, P., reze, A., Kannan, R., Vempala, S., and Vnay, V. Clusterng large graphs va the sngular value decomposton. Machne Learnng, 56:9 33, 004. Prelmnary verson n Proceedngs of the 10th ACM-SIAM Symposum on Dscrete Algorthms SODA), Baltmore, 1999. [KV04] reze, A., Kannan, R., and Vempala, S. ast Monte-Carlo algorthms for fndng lowran approxmatons. Journal of the ACM JACM), 516):105 1041, 004. Prelmnary verson n Proceedngs of the 39th oundatons of Computer Scence OCS), Palo Alto, 1998. 9