Versatility of Singular Value Decomposition (SVD) January 7, 2015

Size: px

Start display at page:

Download "Versatility of Singular Value Decomposition (SVD) January 7, 2015"

Quentin Jenkins
6 years ago
Views:

1 Versatility f Singular Value Decmpsitin (SVD) January 7, 2015

2 Assumptin : Data = Real Data + Nise Each Data Pint is a clumn f the n d Data Matrix A.

3 Assumptin : Data = Real Data + Nise Each Data Pint is a clumn f the n d Data Matrix A. A = }{{} B + }{{} C. Real Data Nise

4 Assumptin : Data = Real Data + Nise Each Data Pint is a clumn f the n d Data Matrix A. A = }{{} B Real Data + }{{} C. Nise rank (B) k. C (= Max u =1 Cu ).

5 Assumptin : Data = Real Data + Nise Each Data Pint is a clumn f the n d Data Matrix A. A = }{{} B Real Data + }{{} C. Nise rank (B) k. C (= Max u =1 Cu ). k << n,d. small.

6 Assumptin : Data = Real Data + Nise Each Data Pint is a clumn f the n d Data Matrix A. A = }{{} B Real Data + }{{} C. Nise rank (B) k. C (= Max u =1 Cu ). k << n,d. small. C Cautin: C F (= 2 ij ) need nt be smaller than fr example B F. In wrds, verall nise can be larger than verall real data.

7 Assumptin : Data = Real Data + Nise Each Data Pint is a clumn f the n d Data Matrix A. A = }{{} B Real Data + }{{} C. Nise rank (B) k. C (= Max u =1 Cu ). k << n,d. small. C Cautin: C F (= 2 ij ) need nt be smaller than fr example B F. In wrds, verall nise can be larger than verall real data. Given any A, Singular Value Decmpsitin (SVD) finds B f rank k (r less) fr which A B is minimum. Space spanned by clumns f B is the best-fit subspace fr A in the sense f least sum ver all data pints f squared distances t subspace.

8 Assumptin : Data = Real Data + Nise Each Data Pint is a clumn f the n d Data Matrix A. A = }{{} B Real Data + }{{} C. Nise rank (B) k. C (= Max u =1 Cu ). k << n,d. small. C Cautin: C F (= 2 ij ) need nt be smaller than fr example B F. In wrds, verall nise can be larger than verall real data. Given any A, Singular Value Decmpsitin (SVD) finds B f rank k (r less) fr which A B is minimum. Space spanned by clumns f B is the best-fit subspace fr A in the sense f least sum ver all data pints f squared distances t subspace. A very pwerful tl. Decades f thery, algrithms. Here: Example applicatins.

9 Example I- Mixture f Spherical Gaussians F(x) = w 1 N(µ 1,σ 2 1 ) +w 2N(µ 2,σ 2 2 ) + +w kn(µ k,σ 2 k ), in d dimensins.

10 Example I- Mixture f Spherical Gaussians F(x) = w 1 N(µ 1,σ 2 1 ) +w 2N(µ 2,σ 2 2 ) + +w kn(µ k,σ 2 k ), in d dimensins.

11 Example I- Mixture f Spherical Gaussians F(x) = w 1 N(µ 1,σ 2 1 ) +w 2N(µ 2,σ 2 2 ) + +w kn(µ k,σ 2 ), in d k dimensins. Learning Prblem: Given i.i.d. samples frm F( ), find the cmpnents (µ i,σ i,w i ). Really a Clustering Prblem.

12 Example I- Mixture f Spherical Gaussians F(x) = w 1 N(µ 1,σ 2 1 ) +w 2N(µ 2,σ 2 2 ) + +w kn(µ k,σ 2 ), in d k dimensins. Learning Prblem: Given i.i.d. samples frm F( ), find the cmpnents (µ i,σ i,w i ). Really a Clustering Prblem. In 1-dimensin, we can slve the learning prblem if Means f the cmpnent densities are Ω(1) standard deviatins apart.

13 Example I- Mixture f Spherical Gaussians F(x) = w 1 N(µ 1,σ 2 1 ) +w 2N(µ 2,σ 2 2 ) + +w kn(µ k,σ 2 ), in d k dimensins. Learning Prblem: Given i.i.d. samples frm F( ), find the cmpnents (µ i,σ i,w i ). Really a Clustering Prblem. In 1-dimensin, we can slve the learning prblem if Means f the cmpnent densities are Ω(1) standard deviatins apart. But in d dimensins: Apprximate k means fails. Pair f Sample frm different clusters may be clser than a pair frm the same!

14 SVD t the Rescue Fr a mixture f k spherical Gaussians (with different variances), the best-fit k dimensinal subspace (fund by SVD) passes thrugh all the k centers. Vempala, Wang.

15 SVD t the Rescue Fr a mixture f k spherical Gaussians (with different variances), the best-fit k dimensinal subspace (fund by SVD) passes thrugh all the k centers. Vempala, Wang. Beautiful prf: Fr ne spherical Gaussian with nn-zer mean, the best fit 1-dim subspace passes thrugh the mean. And any k-dim subspace cntaining the mean is a best-fit k dimensinal space.

16 SVD t the Rescue Fr a mixture f k spherical Gaussians (with different variances), the best-fit k dimensinal subspace (fund by SVD) passes thrugh all the k centers. Vempala, Wang. Beautiful prf: Fr ne spherical Gaussian with nn-zer mean, the best fit 1-dim subspace passes thrugh the mean. And any k-dim subspace cntaining the mean is a best-fit k dimensinal space. S, nw if a k dimensinal space cntains all the k means, it is individually the best fr each cmpnent Gaussian!!

17 SVD t the Rescue Fr a mixture f k spherical Gaussians (with different variances), the best-fit k dimensinal subspace (fund by SVD) passes thrugh all the k centers. Vempala, Wang. Beautiful prf: Fr ne spherical Gaussian with nn-zer mean, the best fit 1-dim subspace passes thrugh the mean. And any k-dim subspace cntaining the mean is a best-fit k dimensinal space. S, nw if a k dimensinal space cntains all the k means, it is individually the best fr each cmpnent Gaussian!! Simple Observatin t finish : Given the k space cntaining the means, we need nly slve a k dim prblem. Can be dne in time expnential nly in k

18 Planted Clique Prblem Given G = G(n,1/2) +S S, (S unknwn, S = s), find S in ply time. Best knwn: s Ω( n) ±1 ±1 ±1 ±1 ± ±1 ±1 ±1 ±1 ± ±1 ±1 ±1 ±1 ±1 A = ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1

19 Planted Clique Prblem Given G = G(n,1/2) +S S, (S unknwn, S = s), find S in ply time. Best knwn: s Ω( n) ±1 ±1 ±1 ±1 ± ±1 ±1 ±1 ±1 ± ±1 ±1 ±1 ±1 ±1 A = ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 Planted Clique = s. Randm Matrix Thery: Randm ±1 matrix has nrm at mst 2 n. S, SVD finds S when s n. Aln, Bppanna-1985.

20 Planted Clique Prblem Given G = G(n,1/2) +S S, (S unknwn, S = s), find S in ply time. Best knwn: s Ω( n) ±1 ±1 ±1 ±1 ± ±1 ±1 ±1 ±1 ± ±1 ±1 ±1 ±1 ±1 A = ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 Planted Clique = s. Randm Matrix Thery: Randm ±1 matrix has nrm at mst 2 n. S, SVD finds S when s n. Aln, Bppanna Feldman, Grigrescu, Reyzin, Vempala, Xia (2014): Cannt be beaten by Statistical Learning Algrithms.

21 Planted Gaussians: Signal and Nise A n n matrix and S [n], S = k. A =. µ N(0,σ 2 )..

22 Planted Gaussians: Signal and Nise A n n matrix and S [n], S = k. A ij all independent r.v. s A =. µ N(0,σ 2 )..

23 Planted Gaussians: Signal and Nise A n n matrix and S [n], S = k. A ij all independent r.v. s Fr i,j S, Pr(A ij µ) 1/2. (Eg. N(µ,σ 2 )). Signal = µ. A =. µ N(0,σ 2 )..

24 Planted Gaussians: Signal and Nise A = A n n matrix and S [n], S = k. A ij all independent r.v. s Fr i,j S, Pr(A ij µ) 1/2. (Eg. N(µ,σ 2 )). Signal = µ. Fr ther i,j, A ij is N(0,σ 2 ). Nise = σ.. µ N(0,σ 2 )..

25 Planted Gaussians: Signal and Nise A n n matrix and S [n], S = k. A ij all independent r.v. s Fr i,j S, Pr(A ij µ) 1/2. (Eg. N(µ,σ 2 )). Signal = µ. Fr ther i,j, A ij is N(0,σ 2 ). Nise = σ. Given A,µ,σ, find S. [Recall Planted Clique.]. µ A =..... N(0,σ 2 )..

26 Expnential Advantage in SNR by Threshlding Brave new step: Threshld entries f A at µ 0-1 matrix B.

27 Expnential Advantage in SNR by Threshlding Brave new step: Threshld entries f A at µ 0-1 matrix B.. (1/2) E(B) :..... exp( µ 2 /2σ 2 )..

28 Expnential Advantage in SNR by Threshlding Brave new step: Threshld entries f A at µ 0-1 matrix B.. (1/2) E(B) :..... exp( µ 2 /2σ 2 )..

29 Expnential Advantage in SNR by Threshlding Brave new step: Threshld entries f A at µ 0-1 matrix B.. (1/2) E(B) :..... exp( µ 2 /2σ 2 ).. Subtract exp( µ 2 /2σ 2 )

30 Expnential Advantage in SNR by Threshlding Brave new step: Threshld entries f A at µ 0-1 matrix B.. (1/2) E(B) :..... exp( µ 2 /2σ 2 ).. Subtract exp( µ 2 /2σ 2 ) k/4... Rand. Matrix nexp( cµ 2 /σ 2 )

31 Expnential Advantage in SNR by Threshlding Brave new step: Threshld entries f A at µ 0-1 matrix B.. (1/2) E(B) :..... exp( µ 2 /2σ 2 ).. Subtract exp( µ 2 /2σ 2 ) k/4... Rand. Matrix S, SVD finds S prvided exp(c(µ/σ) 2 ) > nexp( cµ 2 /σ 2 ) n k.

32 Expnential Advantage in SNR by Threshlding Brave new step: Threshld entries f A at µ 0-1 matrix B.. (1/2) E(B) :..... exp( µ 2 /2σ 2 ).. Subtract exp( µ 2 /2σ 2 ) k/4... Rand. Matrix S, SVD finds S prvided exp(c(µ/σ) 2 ) > Cf: Ordinary SVD succeeds if µ σ > n k. nexp( cµ 2 /σ 2 ) n k.

33 Threshlding: Secnd Plus Data pints {A 1,A 2,...,A j,...} in R d, d features.

34 Threshlding: Secnd Plus Data pints {A 1,A 2,...,A j,...} in R d, d features. Data pints are in 2 SOFT clusters: Data pint j belngs w j t cluster 1 and 1 w j t cluster 2. (Mre Generally, k clusters)

35 Threshlding: Secnd Plus Data pints {A 1,A 2,...,A j,...} in R d, d features. Data pints are in 2 SOFT clusters: Data pint j belngs w j t cluster 1 and 1 w j t cluster 2. (Mre Generally, k clusters) Each cluster has sme sme dminant features and each data pint has a dminant cluster.

36 Threshlding: Secnd Plus Data pints {A 1,A 2,...,A j,...} in R d, d features. Data pints are in 2 SOFT clusters: Data pint j belngs w j t cluster 1 and 1 w j t cluster 2. (Mre Generally, k clusters) Each cluster has sme sme dminant features and each data pint has a dminant cluster. A ij µ if feature i is a dminant feature f the dminant tpic f data pint j.

37 Threshlding: Secnd Plus Data pints {A 1,A 2,...,A j,...} in R d, d features. Data pints are in 2 SOFT clusters: Data pint j belngs w j t cluster 1 and 1 w j t cluster 2. (Mre Generally, k clusters) Each cluster has sme sme dminant features and each data pint has a dminant cluster. A ij µ if feature i is a dminant feature f the dminant tpic f data pint j. A ij σ therwise.

38 Threshlding: Secnd Plus Data pints {A 1,A 2,...,A j,...} in R d, d features. Data pints are in 2 SOFT clusters: Data pint j belngs w j t cluster 1 and 1 w j t cluster 2. (Mre Generally, k clusters) Each cluster has sme sme dminant features and each data pint has a dminant cluster. A ij µ if feature i is a dminant feature f the dminant tpic f data pint j. A ij σ therwise. If variance abve µ is larger than gap between µ and σ, a 2-clustering criterin (like 2-means) may split the high weight cluster instead f separating it frm the thers.

39 Threshlding: Secnd Plus Data pints {A 1,A 2,...,A j,...} in R d, d features. Data pints are in 2 SOFT clusters: Data pint j belngs w j t cluster 1 and 1 w j t cluster 2. (Mre Generally, k clusters) Each cluster has sme sme dminant features and each data pint has a dminant cluster. A ij µ if feature i is a dminant feature f the dminant tpic f data pint j. A ij σ therwise. If variance abve µ is larger than gap between µ and σ, a 2-clustering criterin (like 2-means) may split the high weight cluster instead f separating it frm the thers. Tw Differences frm Mixtures: Sft, High Variance in dminant features.

40 Tpic Mdeling: The Prblem Jint Wrk with T. Bansal and C. Bhattacharyya d features - wrds in the dictinary. A dcument is a d (clumn) vectr.

41 Tpic Mdeling: The Prblem Jint Wrk with T. Bansal and C. Bhattacharyya d features - wrds in the dictinary. A dcument is a d (clumn) vectr. k tpics. Tpic l is a d vectr. (Prbabilities f wrds in tpic).

42 Tpic Mdeling: The Prblem Jint Wrk with T. Bansal and C. Bhattacharyya d features - wrds in the dictinary. A dcument is a d (clumn) vectr. k tpics. Tpic l is a d vectr. (Prbabilities f wrds in tpic). T generate dc j, generate a randm cnvex cmbinatin f tpic vectrs.

43 Tpic Mdeling: The Prblem Jint Wrk with T. Bansal and C. Bhattacharyya d features - wrds in the dictinary. A dcument is a d (clumn) vectr. k tpics. Tpic l is a d vectr. (Prbabilities f wrds in tpic). T generate dc j, generate a randm cnvex cmbinatin f tpic vectrs. Generate wrds f dc. j in i.i.d. trials, each frm the multinmial with prb.s = Cnvex Cmbinatin. ***DRAW PICTURE ON BOARD WITH SPORTS, POLITICS, WEATHER***

44 Tpic Mdeling: The Prblem Jint Wrk with T. Bansal and C. Bhattacharyya d features - wrds in the dictinary. A dcument is a d (clumn) vectr. k tpics. Tpic l is a d vectr. (Prbabilities f wrds in tpic). T generate dc j, generate a randm cnvex cmbinatin f tpic vectrs. Generate wrds f dc. j in i.i.d. trials, each frm the multinmial with prb.s = Cnvex Cmbinatin. ***DRAW PICTURE ON BOARD WITH SPORTS, POLITICS, WEATHER*** The Tpic Mdeling Prblem Given nly A, find an apprximatin t all tpic vectrs s that the l 1 errr in each tpic vectr is at mst ε. l 1 errr crucial. (l 2 misses small wrds.)

45 Tpic Mdeling: The Prblem Jint Wrk with T. Bansal and C. Bhattacharyya d features - wrds in the dictinary. A dcument is a d (clumn) vectr. k tpics. Tpic l is a d vectr. (Prbabilities f wrds in tpic). T generate dc j, generate a randm cnvex cmbinatin f tpic vectrs. Generate wrds f dc. j in i.i.d. trials, each frm the multinmial with prb.s = Cnvex Cmbinatin. ***DRAW PICTURE ON BOARD WITH SPORTS, POLITICS, WEATHER*** The Tpic Mdeling Prblem Given nly A, find an apprximatin t all tpic vectrs s that the l 1 errr in each tpic vectr is at mst ε. l 1 errr crucial. (l 2 misses small wrds.) Generally NP-hard.

46 Tpic Mdeling is Sft Clustering Tpic Vectrs Cluster Centers

47 Tpic Mdeling is Sft Clustering Tpic Vectrs Cluster Centers Each data pint (dc) belngs t a weighted cmbinatin f clusters. Generated frm a distributin (happens t be multinmial) with expectatin = weighted cmbinatin.

48 Tpic Mdeling is Sft Clustering Tpic Vectrs Cluster Centers Each data pint (dc) belngs t a weighted cmbinatin f clusters. Generated frm a distributin (happens t be multinmial) with expectatin = weighted cmbinatin. Even if we manage t slve the clustering prblem smehw, it is nt true that cluster centers are averages f dcuments. Big Distinctin frm Learning Mixtures which is hard clusetring.

49 Gemetry Tpic Mdeling = Sft Clustering Given dc s (means f s), find μ l. Helps t find nearly pure dcs (X near crner) μ 3 X μ 1 X X X μ l = l th tpic vectr X = Weighted cmbinatin f μ l s are wrds in a dc iid chices with mean X X X μ 2

50 Prir Results and Assumptins Under Pure Tpics and Primary Wrds (1 ε f wrds are primary) Assumptins, SVD slves it. Papadimitriu, Raghavan, Tamaki, Vempala. Our Gals

51 Prir Results and Assumptins Under Pure Tpics and Primary Wrds (1 ε f wrds are primary) Assumptins, SVD slves it. Papadimitriu, Raghavan, Tamaki, Vempala. Belief: SVD cannt d the nn-pure tpic case. Our Gals

52 Prir Results and Assumptins Under Pure Tpics and Primary Wrds (1 ε f wrds are primary) Assumptins, SVD slves it. Papadimitriu, Raghavan, Tamaki, Vempala. Belief: SVD cannt d the nn-pure tpic case. LDA : Mst used mdel. Blei, Ng, Jrdan. Multiple tpics per dc. Our Gals

53 Prir Results and Assumptins Under Pure Tpics and Primary Wrds (1 ε f wrds are primary) Assumptins, SVD slves it. Papadimitriu, Raghavan, Tamaki, Vempala. Belief: SVD cannt d the nn-pure tpic case. LDA : Mst used mdel. Blei, Ng, Jrdan. Multiple tpics per dc. Anandkumar, Fster, Hsu, Kakade, Liu d Tpic Mdeling under LDA, t l 2 errr using clever tensr methds. Parameters. Our Gals

54 Prir Results and Assumptins Under Pure Tpics and Primary Wrds (1 ε f wrds are primary) Assumptins, SVD slves it. Papadimitriu, Raghavan, Tamaki, Vempala. Belief: SVD cannt d the nn-pure tpic case. LDA : Mst used mdel. Blei, Ng, Jrdan. Multiple tpics per dc. Anandkumar, Fster, Hsu, Kakade, Liu d Tpic Mdeling under LDA, t l 2 errr using clever tensr methds. Parameters. Arra, Ge, Mitra Assume Anchr Wrd + Other parameters : Each tpic has ne wrd (a) ccurring nly in that tpic (b) with high frequency. First Prvable Algrithm. Our Gals

55 Prir Results and Assumptins Under Pure Tpics and Primary Wrds (1 ε f wrds are primary) Assumptins, SVD slves it. Papadimitriu, Raghavan, Tamaki, Vempala. Belief: SVD cannt d the nn-pure tpic case. LDA : Mst used mdel. Blei, Ng, Jrdan. Multiple tpics per dc. Anandkumar, Fster, Hsu, Kakade, Liu d Tpic Mdeling under LDA, t l 2 errr using clever tensr methds. Parameters. Arra, Ge, Mitra Assume Anchr Wrd + Other parameters : Each tpic has ne wrd (a) ccurring nly in that tpic (b) with high frequency. First Prvable Algrithm. Our Gals Intuitive, empirically verified assumptins.

56 Prir Results and Assumptins Under Pure Tpics and Primary Wrds (1 ε f wrds are primary) Assumptins, SVD slves it. Papadimitriu, Raghavan, Tamaki, Vempala. Belief: SVD cannt d the nn-pure tpic case. LDA : Mst used mdel. Blei, Ng, Jrdan. Multiple tpics per dc. Anandkumar, Fster, Hsu, Kakade, Liu d Tpic Mdeling under LDA, t l 2 errr using clever tensr methds. Parameters. Arra, Ge, Mitra Assume Anchr Wrd + Other parameters : Each tpic has ne wrd (a) ccurring nly in that tpic (b) with high frequency. First Prvable Algrithm. Our Gals Intuitive, empirically verified assumptins. Natural, prvable Algrithm.

57 Our Assumptins Intuitive t Tpic Mdeling, nt numerical parameters like cnditin number.

58 Our Assumptins Intuitive t Tpic Mdeling, nt numerical parameters like cnditin number. Catchwrds: Each tpic has a set f wrds: (a) each ccurs mre frequently in the tpic than thers and (b) tgether, they have high frequency.

59 Our Assumptins Intuitive t Tpic Mdeling, nt numerical parameters like cnditin number. Catchwrds: Each tpic has a set f wrds: (a) each ccurs mre frequently in the tpic than thers and (b) tgether, they have high frequency. Dminant Tpics Each Dcument has a dminant tpic which has weight (in that dc) f at least sme α, whereas, nn-dminant tpics have weight at mst sme β.

60 Our Assumptins Intuitive t Tpic Mdeling, nt numerical parameters like cnditin number. Catchwrds: Each tpic has a set f wrds: (a) each ccurs mre frequently in the tpic than thers and (b) tgether, they have high frequency. Dminant Tpics Each Dcument has a dminant tpic which has weight (in that dc) f at least sme α, whereas, nn-dminant tpics have weight at mst sme β. Nearly Pure Dcuments Each tpic has a (small) fractin f dcuments which are 1 δ pure fr that tpic.

61 Our Assumptins Intuitive t Tpic Mdeling, nt numerical parameters like cnditin number. Catchwrds: Each tpic has a set f wrds: (a) each ccurs mre frequently in the tpic than thers and (b) tgether, they have high frequency. Dminant Tpics Each Dcument has a dminant tpic which has weight (in that dc) f at least sme α, whereas, nn-dminant tpics have weight at mst sme β. Nearly Pure Dcuments Each tpic has a (small) fractin f dcuments which are 1 δ pure fr that tpic. N Lcal Min.: Fr every wrd, the plt f number f dcuments versus number f ccurrences f wrd (cnditined n dminant tpic) has n lcal min. [Zipf s law Or Unimdal.]

62 The Algrthm - Threshld SVD (TSVD) s = N. f dcs. Fr this talk, prbability that each tpic is dminant is 1/k.

63 The Algrthm - Threshld SVD (TSVD) s = N. f dcs. Fr this talk, prbability that each tpic is dminant is 1/k. Threshld Cmpute the threshld fr each wrd i: First Gap : Maxζ : A ij ζ fr (s/2k) j s and A ij = ζ fr εs j s.

64 The Algrthm - Threshld SVD (TSVD) s = N. f dcs. Fr this talk, prbability that each tpic is dminant is 1/k. Threshld Cmpute the threshld fr each wrd i: First Gap : Maxζ : A ij ζ fr (s/2k) j s and A ij = ζ fr εs j s. SVD Use SVD n threshlded matrix t get starting centers fr k means algrithm.

65 The Algrthm - Threshld SVD (TSVD) s = N. f dcs. Fr this talk, prbability that each tpic is dminant is 1/k. Threshld Cmpute the threshld fr each wrd i: First Gap : Maxζ : A ij ζ fr (s/2k) j s and A ij = ζ fr εs j s. SVD Use SVD n threshlded matrix t get starting centers fr k means algrithm. k means Run k means. Will shw: This identifies dminant tpic.

66 The Algrthm - Threshld SVD (TSVD) s = N. f dcs. Fr this talk, prbability that each tpic is dminant is 1/k. Threshld Cmpute the threshld fr each wrd i: First Gap : Maxζ : A ij ζ fr (s/2k) j s and A ij = ζ fr εs j s. SVD Use SVD n threshlded matrix t get starting centers fr k means algrithm. k means Run k means. Will shw: This identifies dminant tpic. Identify Catchwrds Find the set f high frequency wrds in each cluster. Will shw: Set f Catchwrds fr tpic.

67 The Algrthm - Threshld SVD (TSVD) s = N. f dcs. Fr this talk, prbability that each tpic is dminant is 1/k. Threshld Cmpute the threshld fr each wrd i: First Gap : Maxζ : A ij ζ fr (s/2k) j s and A ij = ζ fr εs j s. SVD Use SVD n threshlded matrix t get starting centers fr k means algrithm. k means Run k means. Will shw: This identifies dminant tpic. Identify Catchwrds Find the set f high frequency wrds in each cluster. Will shw: Set f Catchwrds fr tpic. Identify Pure Dcs Find the set f dcuments with highest ttal number f ccurrences f set f catchwrds. Shw: Nearly Pure Dcs. Their average tpic vectr.

68 The advantage f Threshlding Diagnal blue blcks are Catchwrds fr each tpic. Black: Nn-Catchwrds.

69 μ 1 μ 2 μ 3 X X X X X X Thresh+SVD+k-means Dminant Tpics

70 Prperties f Threshlding Using n lcal min., shw: N threshld splits any dminant tpic in the middle. S, threshlded matrix is a blck matrix fr catchwrds. But fr nn-catchwrds, can be high n several tpics.

71 Prperties f Threshlding Using n lcal min., shw: N threshld splits any dminant tpic in the middle. S, threshlded matrix is a blck matrix fr catchwrds. But fr nn-catchwrds, can be high n several tpics. PICTURE ON THE BOARD OF A BLOCK MATRIX.

72 Prperties f Threshlding Using n lcal min., shw: N threshld splits any dminant tpic in the middle. S, threshlded matrix is a blck matrix fr catchwrds. But fr nn-catchwrds, can be high n several tpics. PICTURE ON THE BOARD OF A BLOCK MATRIX. Dne? N. Need inter-cluster separatin intra-cluster spread (variance inside cluster).

73 Prperties f Threshlding Using n lcal min., shw: N threshld splits any dminant tpic in the middle. S, threshlded matrix is a blck matrix fr catchwrds. But fr nn-catchwrds, can be high n several tpics. PICTURE ON THE BOARD OF A BLOCK MATRIX. Dne? N. Need inter-cluster separatin intra-cluster spread (variance inside cluster). Catchwrds prvide sufficient inter-cluster separatin.

74 Prperties f Threshlding Using n lcal min., shw: N threshld splits any dminant tpic in the middle. S, threshlded matrix is a blck matrix fr catchwrds. But fr nn-catchwrds, can be high n several tpics. PICTURE ON THE BOARD OF A BLOCK MATRIX. Dne? N. Need inter-cluster separatin intra-cluster spread (variance inside cluster). Catchwrds prvide sufficient inter-cluster separatin. Inside-cluster variance bunded with machinery frm Randm Matrix Thery. Beware: Only clumns are independent. Rws are nt.

75 Prperties f Threshlding Using n lcal min., shw: N threshld splits any dminant tpic in the middle. S, threshlded matrix is a blck matrix fr catchwrds. But fr nn-catchwrds, can be high n several tpics. PICTURE ON THE BOARD OF A BLOCK MATRIX. Dne? N. Need inter-cluster separatin intra-cluster spread (variance inside cluster). Catchwrds prvide sufficient inter-cluster separatin. Inside-cluster variance bunded with machinery frm Randm Matrix Thery. Beware: Only clumns are independent. Rws are nt. Appeal t a result n k means (Kumar, K.: If inter-cluster separatin inside-cluster directinal stan. dev, then SVD fllwed by k means clusters.

76 Getting Tpic Vectrs PICTURE OF SIMPLEX with clumns f M as extreme pints and cluster f dc.s with each dminant tpic. Taking average f dcs in T l n gd.

77 Emprircal Results: Datasets NIPS: 1,500 NIPS full papers NYT: Randm subset f 30,000 dcuments frm the New Yrk Times dataset Pubmed: Randm subset f 30,000 dcuments frm the Pubmed abstracts dataset 20NG: 13,389 dcuments frm 20NewsGrup dataset

78 Empirical Results: Assumptins Crpus Dcuments K Fractin f Dcuments α = 0.4 α = 0.8 α = 0.9 NIPS 1, % 10.7% 4.8% NYT 30, % 20.9% 12.7% Pubmed 30, % 20.3% 10.7% 20NG 13, % 54.4% 44.3% Table: Fractin f dcuments satisfying dminant tpic assumptin. Crpus K Mean per tpic % Tpics frequency f CW with CW NIPS % NYT % Pubmed % 20NG % Table: CatchWrds (CW) assumptin with ρ = 1.1, ε = 0.25

79 Empirical Results: Semi-synthetic Data Generate semi-synthetic crpra frm LDA mdel trained by MCMC, t ensure that the synthetic crpra retain the characteristics f real data Gibbs sampling is run fr 1000 iteratins n all the fur datasets and the final wrd-tpic distributin is used t generate varying number (s) f synthetic dcuments with dcument-tpic distributin drawn frm a symmetric Dirichlet with hyper-parameter 0.01 Nte that the synthetic data is nt guaranteed t satisfy dminant tpic assumptin fr every dcument, n average abut 80% dcuments satisfy the assumptin

80 Empirical Results: L1 Recnstructin Errr And percent imprvement ver Recver-KL. Ttal average imprvement ver R-KL is 20% Crpus Dcuments Tensr R-L2 R-KL TSVD % Imprvement NIPS Pubmed 20NG NYT 40, % 60, % 80, % 100, % 150, % 200, % 40, % 60, % 80, % 100, % 150, % 200, % 40, % 60, % 80, % 100, % 200, % 40, % 60, % 80, % 100, %

81 Empirical Results: L1 Recnstructin Err Histgram f L1 errr acrss tpics fr 40k synthetic dcuments. On majrity f the tpic (> 90%) the recvery errr fr TSVD is significantly smaller than Recver-KL. NIPS NYT Number f Tpics Pubmed NG Algrithm R KL TSVD L1 Recnstructin Errr

82 Empirical Results: Perplexity & Tpic Cherence Perplexity 0 Tpic Cherence Algrithm TSVD Recver L2 Recver KL 0 20NG NIPS NYT Pubmed 20NG NIPS NYT Pubmed

83 Tp 5 wrds f sme tpics n the real NYT dataset. Catchwrds, anchr highlighted. zzz - identifier placed by NYT dataset. TSVD Recver-KL Gibbs cup minutes add tablespn il cup minutes tablespn add il cup minutes add tablespn il team seasn cach zzz_ram game game team seasn play zzz_ram team seasn game cach zzz_nfl patient dctr drug cancer study patient drug dctr percent fund patient dctr drug medical cancer zzz_jhn_mccain zzz_mccain zzz_bush zzz_gerge_bush campaign zzz_jhn_mccain zzz_gerge_bush campaign republican vter zzz_jhn_mccain zzz_gerge_bush campaign zzz_bush zzz_mccain huse rm building wall flr rm shw lk hme huse rm lk water huse hand film mvie actr character zzz_scar zzz_gd christian religius zzz_jesus church film shw mvie music bk ppe church bk jewish religius film mvie character play directr religius church jewish jew zzz_gd

T Algorithmic methods for data mining. Slide set 6: dimensionality reduction

T Algorithmic methods for data mining. Slide set 6: dimensionality reduction T-61.5060 Algrithmic methds fr data mining Slide set 6: dimensinality reductin reading assignment LRU bk: 11.1 11.3 PCA tutrial in mycurses (ptinal) ptinal: An Elementary Prf f a Therem f Jhnsn and Lindenstrauss,