5.1 Review of Singular Value Decomposition (SVD)

Size: px

Start display at page:

Download "5.1 Review of Singular Value Decomposition (SVD)"

Sabina Fitzgerald
6 years ago
Views:

1 MGMT 69000: Topics i High-dimesioal Data Aalysis Falll 06 Lecture 5: Spectral Clusterig: Overview (cotd) ad Aalysis Lecturer: Jiamig Xu Scribe: Adarsh Barik, Taotao He, September 3, 06 Outlie Review of Sigular Value Decompositio Spectral Clusterig uder Gaussia Mixture Model (cotiued from previous lecture) Aalysis of spectral clusterig 5 Review of Sigular Value Decompositio (SVD) Recall from previous lecture that we give sigular value decompositio of a matrix A as A = i= σ iu i vi T where σ σ σ are sigular values ad u, u,, u ad v, v,, v are correspodig left ad right sigular vectors respectively Below we preset a summary of some of the results related to geometric iterpretatio of SVD that we discussed i previous lecture: We ca iterpret leadig right sigular vector (v ) of A as the best-fit vector for rows of A It is also the leadig eigevector of A T A Leadig sigular value (σ ) ca be viewed as sum of the legth of the projectios of rows of A oto the liear subspace spaed by v ie Av = σ Previous result exteds to higher dimesios ie best-fit k dimesioal subspace for rows of A is give by spa{v, v,, v k } where v k arg v arg max v = Av v arg max Av v v v = max v v,v v k v = Av Collectio of v, v, v k ca be chose as the top-k eigevectors of A T A u i s are defied as u i = Av i σ i This combied with the previous property implies that u i u j if i j ad u i = A = i= σ iu i vi T = r i= σ iu i vi T for some r assumig σ r+ = = σ = 0 I this case, let row(a)deote the row space of A The row(a) = spa{v, v,, v r } Frobeius orm ( A F ) is defied as A F = j= σ j = j= Av j = m j= A i = j= A ij where A i is ith row of A Secod equality holds because σ j = Av j ad m i= third equality holds because of the previous property

2 5 Spectral Clusterig uder Gaussia Mixture Model Recall that if we kow the cluster mea µ a priori the we ca project our sample poits to spa{µ} This ca help us i reducig the dimesio of the problem (i the simple example of clusters we ca reduce d dimesios to dimesio) However, i geeral we do t have prior kowledge of µ As discussed i the previous lecture, we could try a radom projectio but we showed that it does t help Here, we ll discuss aother projectio scheme called Spectral projectio 5 Spectral Projectio We ll start with our basic example of clusters cetered at µ ad µ ad variace σ The, we ll exted the model to more geeral case Idea We have bee give below iformatio: X X X = X X i iid N (µ, σ I d d ) + N ( µ, σ I d d ) Based o this we ca say that µ T µ T First Cluster E[X] = µ T = Left Sigular Vector µ Ṭ Secod Cluster µ T µ T } Right Sigular vector Observe that poits from first cluster cotribute µ T i first expected value matrix ad poits from secod cluster cotribute µ T i first expected value matrix This matrix ca be further decomposed ito vectors of {, } ad µ T which ca be treated as left sigular vector ad right sigular vector (upo ormalizatio) of E[X] respectively This gives us ituitio that if X is close to E[X] the we would expect the leadig right sigular vector of X to be close to µ Note that E[X] is rak matrix however X may ot be rak Now suppose, X = r σ i u i vi T i= Xv = σ u

3 If X is close to E[X] the u would be close to We ca treat the problem of clusterig X i as problem of clusterig u ad this gives us a algorithm 5 Spectral clusterig algorithm for k =, µ = µ, µ = µ Compute the leadig left sigular vector of X say it is give by u If (a) u,i < 0 assig X i to the first cluster (b) u,i > 0 assig X i to the secod cluster (c) u,i = 0 assig X i a arbitrarily chose cluster We ca easily geeralize our results for a geeral clusterig problem followig Gaussia mixture model with k clusters cetered at µ, µ,, µ k respectively We ca agai check that, { µ T 0 0 First Cluster { 0 0 µ T Secod Cluster µ T E[X] = = Left Sigular Vectors { 0 0 µ T Right Sigular vectors µ T 0 0 ḳ µ k-th Cluster T k 0 0 Colum i of the matrix cotaiig left sigular vectors (up to ormalizatio) acts as a idicator vector ( Si ) for cluster i The matrix itself is kow as membership matrix It is easy to exted our previous algorithm to deal with geeral case 53 Spectral clusterig algorithm i geeral case Compute SVD of X ie X = r i= σ iu i v T i U = [u, u, u k ] R k 3 Ru k meas o the rows of U Clusterig U is easy if U is close to membership matrix Note: We are treatig [µ T, µt,, µt k ]T as right sigular vectors up to ormalizatio However, they may ot be orthogoal to each other Our assumptio still works because we are oly iterested i the space spaed by them 3

4 Recall our spectral relaxatio of k-meas problem: Ŷ = arg max < XX T, Y > such that Y = r w i wi T, w i =, w i w j i j i= Optimal solutio for above is Y = k i= u iu T i = UU T where u,, u k are top k left sigular vectors of X ad U = [u, u,, u k ] Notice how U appears i the spectral clusterig algorithm as well 53 Aalysis of Spectral Clusterig Algorithms metioed i previous sectio deped o our assumptio that X is close to E[X] I this sectio, we ll try to quatify this closeess We will be usig Davis-Kaha s si θ theorem to aalyze spectral clusterig But before we move there, we ll defie some otatio Lets say we have two matrix A ad B such that B = A + where is called perturbatio Suppose that A ad B have a decompositio which is similar to SVD ad is give by : [ ] A0 0 A = E G = [ ] [ ] [ ] A E 0 A 0 E 0 0 G T 0 0 A G T [ ] B0 0 B = F H = [ ] [ ] [ ] B F 0 B 0 F 0 0 H T 0 0 B where A R m, E R m m, G R, E 0 R m k, E R m m k, G 0 R k, G R k B R m, F R m m, H R, F 0 R m k, F R m m k, H 0 R k, H R k A 0 R k k, A R m k k, B 0 R k k, B R m k k assume EE T = E T E = I m m GG T = G T G = I F T F = F F T = I m m H T H = HH T = I clearly A = E 0 A 0 G T 0 + E A G T B = F 0 B 0 H T 0 + F B H T I our case, we ca view A = E[X] ad B = X Our goal would be to defie a distace d(e 0, F 0 ) betwee E 0 ad F 0 ad upper boud it as a fuctio of Davis-Kaha s si θ theorem helps us i doig that But before we move to actual theorem we ll defie some specific distaces ad look ito their properties H T 4

5 53 Projectio distace Defiitio 5 d p (E 0, F 0 ) E 0 E T 0 F 0 F T 0 Lemma 5 d p (E 0, F 0 ) = F T E 0 = E T 0 F Proof Left for homework To get the ituitio behid above lemma we ca take a simple example where E 0 ad F 0 are oe dimesioal ad we kow that F F 0 Hece, the arragemet looks like below: It is easy to see that E 0 E T 0 F 0F T 0 = F T E 0 = si θ Notice, how we ca deote projectio distace i terms of si θ We ll geeralize this otio ad preset a way to view the projectio distace i terms of pricipal agles Let, E T 0 F 0 = U cos ΘV T where θ θ Θ = θk ad cos θ cos θ cos Θ = cos θk with 0 θ θ θ k π We ca do this because E 0 F 0 are basis with sigular values less tha or equal to Also, ote that U, V O(k) where O(k) is set of k k orthoormal matrices I our oe dimesioal example above E T 0 F 0 = cos θ 5

6 Lemma 5 F T E 0 = si Θ = si θ k Proof F T E 0 = E0 T F F T E 0 = E0 T (I F 0 F0 T )E 0 = E0 T E 0 E0 T F 0 F0 T E 0 = I k k U cos ΘV T V cos ΘU T = I k k U cos ΘU T = I k k cos Θ = si Θ = si Θ = si θ k Secod equality holds because F F T = I m m, third ad fourth equalities use SVD of E T 0 F 0 ad fifth ad sixth equalities hold because left ad right multiplicatio by U ad U T respectively oly causes rotatio which does t affect the spectral orm 53 Spectral distace Defiitio 5 (Spectral distace) d s (E 0, F 0 ) mi Q,R O(k) E 0Q F 0 R R O(k) E 0 F 0 R The equality holds because we ca iterpret Q ad R as rotatio matrices Let E 0, F 0 be ay vectors i R, the we oly eed to multiply oe of the two vectors by to get the quatity to be miimized Lemma 53 d s (E 0, F 0 ) = si Θ = si θ k 6

7 Proof d s(e 0, F 0 ) = mi R O(k) E 0 F 0 R R O(k) (E 0 F 0 R) (E 0 F 0 R) R O(k) E 0 E 0 R F 0 E 0 E 0 F 0 R + R F 0 F 0 R R O(k) I R F 0 E 0 E 0 F 0 R + I R O(k) I R V cos ΘU U cos ΘV R R O(k) U (I R V cos ΘU U cos ΘV R)U R O(k) I U R V cos Θ cos ΘV RU Let R V RU Sice the product of two orthogoal matrices is also a orthogoal matrix, we have R O(k) Next, we boud the quatity d s(e 0, F 0 ) o the both sides O the oe had, we have d s(e 0, F 0 ) = mi R O(k) I (R ) cos Θ cos ΘR I cos Θ = ( cos Θ k ) = 4 si θ k The iequality holds by lettig R be a feasible solutio, ie I k k O the aother had, we have d s(e 0, F 0 ) I R O(k) (R ) cos Θ cos ΘR ( R O(k) max x = x (I (R ) cos Θ cos ΘR )x mi R O(k) x (I (R ) cos Θ cos ΘR )x mi R O(k) e k (R ) cos Θe k R O(k) R kk cos θ k = cos θ k = 4 si θ k The secod iequality is true by lettig x e k ) Corollary 5 d p (E 0, F 0 ) d s (E 0, F 0 ) d p (E 0, F 0 ) Proof By Lemma 5, we have d p (E 0, F 0 ) = si θ k = si θ k cos θ k From Lemma 53, we have d s (E 0, F 0 ) = si θ k Sice 0 θ k, the cos θ k Therefore we have d p (E 0, F 0 ) d s (E 0, F 0 ) d p (E 0, F 0 ) 7

8 533 Davis-Kaha si-θ Theorem Theorem 5 (Davis-Kaha si-θ Theorem) Let Sval(A 0 ) ad Sval(B ) be the set of sigular values of A 0 ad B, respectively If Sval(A 0 ) [0, α] ad Sval(B ) [α + δ, ) for some α R ad δ > 0, the we have d p (E 0, F 0 ) δ (5) I the theorem, δ is called the spectral gap Before goig to prove the theorem, we discuss a applicatio of Davis-Kaha si-θ Theorem i spectral clusterig Example 5 (Applicatio of D-K si-θ theorem i spectral cluster) Recall i the two clusters settig, cluster oe ceters at µ R d, cluster two at µ R d, the matrix X R d is the data matrix Let A X, B E [X], ad X E [X] The, we have a SVD of A A = σ u v + r σ i u i vi ; i= ad where B = µ = ( µ ) µ µ β( µ ) µ µ, β The goal is to derive a upper boud of the distace betwee β ad sigular vector u i term of X E [X] We ca apply the Davis-Kaha si-θ Theorem (Theorem 5),with E 0 = β, F 0 = u, σ σ 3 A = ad B 0 = µ ad obtai σr d p (β, u ) X E [X], δ 8

9 where a lower boud of δ is give as below We eed to obtai a upper boud of the sigular value set {σ,, σ r } From the Weyl s Theorem, we kow σ i (A) σ i (B) A B Thus the sigular value set {σ,, σ r } is bouded by Hece, δ µ ad we have d p (β, u ) X E [X] δ µ We eed oe more lemma to prove the Davis-Kaha si-θ Theorem Lemma 54 Let P R, Q R m m, X R m ad Y R m Assume P α ad Q α+δ from some α R + ad δ R + Let C XQ P Y, the we have C (α + δ) X α Y Proof First, we have C = XQ P Y XQ P Y by the subadditivity of a orm The, we derive a lower boud of XQ : X = XQQ XQ Q XQ α + δ, where the secod iequality holds because for ay two matrices A, B, AB A B Thus, XQ (α + δ) X We also have a upper boud of P Y P Y α Y Hece, C (α + δ) X α Y Proof of Davis-Kaha si-θ Theorem Recall A = [ ] [ ] [ ] A E 0 E 0 0 G 0 0 A G B = [ ] [ ] [ ] B F 0 F 0 0 H 0 0 B = B A The sice E, F O(m) ad G, H O() E 0 H = E 0 (B A)H H = E 0 BH E 0 AH = E 0 F B A 0 G 0 H Let E0 F be X, B be Q, A 0 be P, G 0 H be Y, by Lemma 54, we have E0 H (α + δ) E0 F α G 0 H Similarly, we have F G 0 (α + δ) G 0 H α E0 F 9

10 Let t = G 0 H ad t = E0 F Thus, t αt + α+δ ad t αt + α+δ Therefore max{t, t } δ By Lemma 5, d p (E 0, F 0 ) δ 0

Lecture 8: October 20, Applications of SVD: least squares approximation

Lecture 8: October 20, Applications of SVD: least squares approximation Mathematical Toolkit Autum 2016 Lecturer: Madhur Tulsiai Lecture 8: October 20, 2016 1 Applicatios of SVD: least squares approximatio We discuss aother applicatio of sigular value decompositio (SVD) of