Fast Spectral Clustering via the Nyström Method

Size: px

Start display at page:

Download "Fast Spectral Clustering via the Nyström Method"

Gordon Hampton
5 years ago
Views:

1 Fast Spectra Custering via the Nyström Method Anna Choromanska, Tony Jebara, Hyungtae Kim, Mahesh Mohan 3, and Caire Monteeoni 3 Department of Eectrica Engineering, Coumbia University, NY, USA Department of Computer Science, Coumbia University, NY, USA 3 Department of Computer Science, George Washington University, DC, USA {aec63, tj8, hk56}@coumbia.edu, {mahesh mohan, cmonte}@gwu.edu Abstract. We propose and anayze a fast spectra custering agorithm with computationa compexity inear in the number of data points that is directy appicabe to arge-scae datasets. The agorithm combines two powerfu techniques in machine earning: spectra custering agorithms and Nyström methods commony used to obtain good quaity ow rank approximations of arge matrices. The proposed agorithm appies the Nyström approximation to the graph Lapacian to perform custering. We provide theoretica anaysis of the performance of the agorithm and show the error bound it achieves and we discuss the conditions under which the agorithm performance is comparabe to spectra custering with the origina graph Lapacian. We aso present empirica resuts. Keywords: spectra custering, Nyström method, arge-scae custering, samping, sparsity, performance guarantees, error bounds, unsupervised earning Introduction Custering is one of the fundamenta probems in machine earning. The recent widespread deveopment of sensors, data-storage and data-acquisition devices has heped make arge data-sets common pace. This, however, poses a serious computationa chaenge for the existing custering techniques. Spectra custering techniques (Luxburg, 7) are widey used, due to their simpicity and empirica performance advantages compared to other custering methods, such as k-means or singe-inkage agorithms. However, a significant obstace to scaing up spectra custering to arge datasets is that it requires buiding an affinity matrix between pairs of data points which becomes computationay prohibitive for arge data-sets. There have been severa attempts to address this probem and make spectra custering agorithms more appicabe to arge-scae probems. Here we study an Main contact author: Anna Choromanska, e-mai: aec63@coumbia.edu, maiing adress: Department of Eectrica Engineering, Coumbia University, CEPSR 64, 4 Amsterdam Avenue, 7, NY, USA.

2 A. Choromanska et a. approach that extends the spectra custering agorithm, described in Ng et a. (), via Nyström approximation techniques. Our work is most reated to Wiiams and Seeger (); Fowkes et a. (4); Li et a. (), which use the Nyström method to sampe the coumns of the affinity matrix and further approximate the fu matrix by using correations between the samped coumns and the remaining coumns (Fowkes et a., 4). However, these works did not provide performance guarantees; that is our primary contribution. Other approaches to scaing up spectra custering incude work by Yan et a. (9), which used the k-means custering agorithm (Loyd, 98) as a preprocessing step to spectra custering, to reduce its computationa compexity. The anaysis assumes the data are generated by a mixture mode (the same assumption is made in the work by Lashkari and Goand (7)). Reated work by Drineas and Mahoney (5) performs non-uniform samping of the Gram matrix and provides a bound on the approximation error, however in order to achieve good performance one may need to sampe arge number of coumns (in specia cases even O(n)) and, furthermore, the practicaity of this technique for massive datasets may be imited (Yan et a., 9). Severa other works on constructing approximations that are tighter than the Nyström method s, for sparse graph Lapacians, have aso emerged (Fung et a., ; Spieman and Teng, ). However, the computationa compexity of these methods depends highy on the number of edges in the graph. In contrast, the Nyström method has a fixed compexity that is dependent ony on the number of vertices n and the number of samped coumns. This is potentiay more usefu in cases where a arge number of edges exist but ony a few have significanty arge weights, as is the case in many sparse datasets that arise in appications, such as coaborative fitering. This paper combines the spectra custering agorithm (Ng et a., ) with the Nyström approximation method by using a Nyström approximation to the graph Lapacian. Our anaysis differs from the approach of Bekin and Niyogi (7) in that we focus on the finite sampe anaysis whereas Bekin and Niyogi (7) emphasized asymptotic resuts. In particuar, they show that if points are samped uniformy at random from an unknown sub-manifod M R N, then the eigenvectors of a suitaby constructed graph Lapacian converge to the eigenfunctions of the Lapace-Betrami operator on M. Our approach eads to a practica agorithm with compexity inear in the number of data points n. We provide performance guarantees for this agorithm by combining Nyström approximation anaysis, using a uniform random samping without repacement scheme due to Kumar et a. (9), with perturbation theory anaysis (Ng et a., ). We discuss conditions under which the agorithm s performance is comparabe to spectra custering with the origina graph Lapacian. Approach. Spectra Custering Agorithm

3 Agorithm Spectra custering Fast Spectra Custering via the Nyström Method 3 Input: dataset S = {s, s..., s n } R d, number of custers k, kerne function κ : R d R d R Output: k-custering of S A R n n s.t. A ij = δ[i j]κ(s i, s j) D R n n s.t. D ij = δ[i = j] n j= Aij L = I D AD X R n k s.t. SmaestEigenVectors(L, k) Y R n k s.t. Y im = X im/ m X im K = CusterRows(Y) (via any k-custering agorithm minimizing distortion, i.e. k- means) In genera, spectra custering methods can be interpreted as graph partitioning agorithms and the above agorithm (Agorithm ) can be seen as graph partitioning with a normaized-cut cost function. Agorithm shows the widey used normaized spectra custering agorithm presented in Ng et a. (). Given the set of n points S = {s, s..., s n } the agorithm first buids an n n affinity matrix A, i.e.: A ij = κ(s i, s j ) if i j and otherwise. Here A ij corresponds to the i th row and j th coumn of the affinity matrix and κ is any kerne function accepting two input data-points and returning a scaar output. Once the affinity matrix is computed, the normaized graph Lapacian L can be constructed. The first k eigenvectors of L are then normaized and custered. It was shown in Ng et a. () that one can perform spectra k- custering using a perturbed version Ã of the idea affinity matrix A. Under certain assumptions, the custerings obtained using A and Ã wi be simiar. Our goa is to extend these assumptions and show that using cose to idea graph Lapacian L and its Nyström r-rank approximation, L wi aso give simiar custering resuts. Based on the anaysis in Ng et a. () we know that if the four assumptions isted beow are satisfied then using either Ã or A to perform spectra custering wi give simiar partitionings of the dataset (and aso simiar to the true custering of the dataset): Assumption A: γ> i={,,...,k} λ i γ, where λ i is the second argest eigenvaue of L i, where L i is the subbock of L corresponding to custer i. Ã Assumption A: ϵ> j i,i ={,,...,k},i i j S i S i ϵ d, where = m S i Ã jm and d = m S i Ã m and S i is the set of points beonging to the i th custer. : S Ã j Assumption A3: ϵ > i i={,,...,k},j Si ϵ d ( Ã m j ) d. dm Assumption A4: C> d i={,,...,k},j={,,...,ni} j ( n i d = )/(Cn i ). Assumption A guarantees each custer to be tight. Assumption A and A3 require data points within a custer to be more connected to each other than

4 4 A. Choromanska et a. they are with data points from any other custer. Finay, the ast assumption requires that the points in any custer can never be much ess connected than other points in the same custer. The simiarity of the custerings obtained using A and Ã is then assured via Theorem. Let yi j be the jth row of Y i from Agorithm, where Y i is the subbock of Y corresponding to custer i. Then the foowing theorem hods. Theorem (Ng et a. ()). Let assumptions A, A, A3 and A4 hod. Set ϵ = k(k )ϵ + kϵ. If γ > ( + )ϵ, then there exist k orthonorma vectors r, r,..., r k such that Y in Agorithm satisfies n k n i yj i r i 4C(4 + k) ϵ (γ ϵ). i= j=. Nyström Method for Matrix Approximation Agorithm Nyström method for matrix approximation : Input: matrix L, - number of coumns samped, r - rank approximation (r << n) : Output: Σ and Ũ such that L = Ũ ΣŨ 3: L indices of coumns samped 4: C G(:, L) 5: W C(L, :) 6: W r best r-rank approximation to W 7: Σ = n Σ W r and Ũ = n CU W r Σ W r, where W r = U Wr Σ Wr U W r We now expicate the Nyström r-rank approximation for any symmetric positive semidefinite (SPSD) matrix L R n n. After performing samping (we wi ony be using uniform samping without repacement schemes), create matrix C R n from the samped coumns. Then, form matrix W R matrix consisting of the intersection of these coumns with the corresponding rows of L. Let W = UΣU, where U is orthogona and Σ = diag(σ, σ,..., σ ) is a rea diagona matrix with the diagona sorted in decreasing order. Let W r + be the pseudo-inverse of the best rank-r approximation to W (W r + = r t= σ t U (t) U (t), where U (t) and U (t) are respectivey the t th coumn and row of U). Then the Nyström approximation L of L can be obtained as foows: L = CW + r C. Furthermore if we represent L as L = Ũ ΣŨ then Σ = n Σ W r and Ũ = n CU W r Σ W r, where W r = U Wr Σ Wr U W r. Theorem due to Kumar et a. (9) shows the performance bounds for the Nyström method when used with uniform samping without repacement. In Kumar et a. (9) the authors aso

5 Fast Spectra Custering via the Nyström Method 5 compare the quaity of obtained Nyström approximations, on the experiments with arge-scae datasets, when using uniform and non-uniform samping strategies (they consider both samping with and without repacement). They consider two most popuar non-uniform samping techniques: coumn-norm samping and diagona samping. They show that uniform samping without repacement is not ony more effcient both in time and space but aso improves the accuracy of the Nyström method. Theorem (Kumar et a. (9)). Let G R n n be an SPSD matrix. Assume that coumns of G are samped uniformy at random without repacement, et G r be the rank-r Nyström approximation to G and et G r be the best rank-r approximation to G. Let ϵ >, 64r/ϵ 4 og(/δ)ξ(,n ) and η =, where mu ξ(m, u) =. Then with probabiity at east δ, m+u / /( max{m,u}) G G r F G G r F + ϵ n i D() G ii n n G ii + η max(ng ii) i=, where F is the Frobenius norm, i D() G ii is the sum of the argest diagona entries of G. 3 Fast Spectra Custering Agorithm Agorithm 3 Fast spectra custering Input: dataset S = {s, s..., s n } R d, k - number of custers, - number of coumns samped, r - rank approximation (k r << n) Output: k-custering of S L indices of coumns samped (uniformy without repacement) Â A(:, L) D R n n s.t. D ij = δ[i = j]/ j= Âij n i= Âij R s.t. ij = δ[i = j]/ C Î D Â : I - matrix of coumns of I indexed by L n W C(L, :) W r best r-rank approximation to W Σ = n ΣW r and Ũ = n CUW r Σ W r, where W r = U Wr Σ Wr U W r X SmaestEigenVectors(Ũ, k) Y NormaizeRows(X ) : Y im = X im/( m X im) K CusterRows(Y ) (use any k-custering agorithm minimizing distortion, i.e. k- means)

6 6 A. Choromanska et a. For arge-scae fast spectra custering, we propose Agorithm 3. The agorithm chooses coumns samped uniformy at random from the affinity matrix. It therefore never buids the entire n n affinity matrix which woud be computationay prohibitive. It then computes two sparse diagona degree matrices D AND. Subsequenty, the matrix C is recovered which is n x. Matrix C pays the roe of the samped graph Lapacian. We then foow the steps of Agorithm to obtain the approximate eigensystem of the graph Lapacian and finay the first k eigenvectors are normaized and custered. Ceary, Agorithm 3 performs samping of the affinity matrix. This is in contrast to the more computationay expensive approach of computing the compete n n affinity matrix and then obtaining matrix C by samping directy from the graph Lapacian. We provide Theorem 3 to show that for appropriate vaues of, both of these agorithms wi give simiar custering resuts. First, et us introduce some additiona notation. We consider two scenarios: samping the graph Lapacian and samping the affinity matrix. Let L be the set of indices of samped coumns. Let I be the matrix of coumns of I that are indexed by L. Notice that for i {,,..., n} and j L, any entry in the samped graph Lapacian has the foowing form: C ij = I ij A ij ( n a= A aj)( n b= A ib). On the other hand, matrix C in Agorithm 3 has the foowing form: C ij = I A ij ij n ( n a= A aj)(. b L A ib) The difference ies in the second term in the denominator and the scaing factor. Consider Theorem 3. n Theorem 3. Let A ij s be iid scaar random variabes (bounded in [, ]) whose expectation is µ. With probabiity at east δ the foowing hods: C µ ij im µ + δ C ij C µ ij max(, ), n µ δ where δ = og(/δ). n Proof. By the aw of arge numbers we have that im n n b= A ib = µ. By Hoeffding s inequaity we have that with probabiity at east δ, b L A ib µ og(/δ). Therefore im C ij = im n n [ I ij n n (I ij C ij) n n b= A ib b L A ib The iid assumption is made ony for the purpose of this section. ]

7 Finay, if and if = im n b L A ib µ: C ij b L A ib < µ: Fast Spectra Custering via the Nyström Method 7 [ C ij µ µ + δ im n C ij im n [ C ij [ C ij n n b= A ib b L A ib ] n n b= A ib b L A ib n n b= A ib b L A ib ] C ij ] C ij, µ µ δ, where δ = og(/δ). Combining both cases gives the theorem. Theorem 3 shows that, for sufficienty arge, the two agorithms under consideration (Agorithm 3 samping the affinity matrix and the sower aternative of samping the graph Lapacian) shoud produce simiar C matrices and thus yied simiar custering resuts. Furthermore, in batch settings with finite n, Agorithm 3 is sti appicabe, i.e. consider the exampe presented on Figure showing the partitionings of two simpe datasets obtained by spectra custering agorithm of Ng et a. () using the fu affinity matrix and, for comparison, our Agorithm 3. The size of both datasets is very sma (n = 5), but the performance of both agorithms is very simiar. Finay, in our theoretica anaysis we wi focus on the scenario when the graph Lapacian is being samped. This anaysis is easier than considering samping the affinity matrix which, for instance, does not need to be PSD. We end this section with Theorem 4 showing the computationa compexity of the proposed Agorithm 3. a) b) Fig.. Resut of spectra custering on two datasets (a and b), n = 5, = % n, r =. Top row: the dataset (eft) and the partitioning obtained by spectra custering using the fu affinity matrix (right). Bottom row: the dataset with samped data points (green) (eft) and the partitioning obtained by Agorithm 3 (using the samped affinity matrix) (right).

8 8 A. Choromanska et a. Theorem 4. The computationa compexity of Agorithm 3 is O(n max(r, c)) + Γ, where c is the cost of evauating a singe kerne function between two data points and Γ is the cost of the custering agorithm minimizing distortion used to obtain the fina custering. 4 Performance Guarantees As was mentioned before the theoretica anaysis considers the case where we sampe the graph Lapacian buit from the n n affinity matrix and thus C is an n matrix of samped coumns. Furthermore, matrix W is an matrix consisting of the intersection of these coumns with the corresponding rows of the graph Lapacian. Our theoretica anaysis wi consider the performance of the proposed agorithm in the case where the affinity matrix and the corresponding graph Lapacian are cose to bock diagona matrices. In particuar we wi require that L L r F L L r F ϵn, where L is the idea graph Lapacian, L is the true, cose to diagona, graph Lapacian that is samped and L r is its best rank r Nyström approximation. Two more conditions that we assume hods wi be introduced ater. This section is organized as foows: we wi first show the main resut (Theorem 5) and then we wi show the technica emmas and proofs that ed to this resut. 4. Main Resut Let A be the idea affinity matrix that gave rise to the idea graph Lapacian L. Let L be the Nyström r-rank approximation to L and et Ã be the affinity matrix that woud give rise to L in case when no Nyström approximation was used. We wi now present our main resut, Theorem 5. Let yj i be the jth row of Y i from Agorithm 3, where Y i is the subbock of Y corresponding to custer i. Then the foowing theorem hods. Theorem 5. Let ϵ >, 64r/ϵ 4 og(/δ)ξ(,n ) and η =, where ξ(m, u) = mu m+u / /( max{m,u}). Let γ, ϵ, ϵ and C be defined as in Lemma,, 3 and 4. Set ϵ = k(k )ϵ + kϵ. If γ > ( + )ϵ, then with probabiity at east δ, there exist k orthogona vectors r, r,..., r k (ri r j = if i = j, otherwise) so that Y in Agorithm 3 satisfies: n k n i yj i r i 4C(4 + k) ϵ i= j= (γ ϵ ) Theorem 5 is generaization of Theorem. It differs from Theorem in that it extends the four assumptions used in Theorem which resut from the fact that Ã is a very specia version of the perturbed idea A, in particuar it is an affinity matrix that gave rise to the Nyström r-rank approximation to the graph

9 Fast Spectra Custering via the Nyström Method 9 Lapacian. The assumption on each λ i ensures that each custer is tight enough such that after samping the custers wi sti remain tight (γ can be interpreted as the measure of tigtness of each custer after samping). This assumption aso shows that when we decrease the number of samped coumns, we expect the origina custers to be tighter in order for the custers obtained after samping to aso be tight enough such that the dataset is sti k-custerabe. 4. Theoretica Anaysis We wi first present Theorem 6 which is a version of Theorem when the samped matrix is a graph Lapacian L. Theorem 6 reies on the fact that L is a SPSD matrix that is cose to bock diagona. Theorem 6. Let L R n n be an idea graph Lapacian and L be the cose to bock diagona graph Lapacian defined before. Assume that coumns of L are samped uniformy at random without repacement and et L be the best rank-r Nyström approximation to L. Let ϵ >, 64r/ϵ 4 and η = where ξ(m, u) = mu m+u / /( max{m,u}) L L F ϵn + η. og(/δ)ξ(,n ),. Then with probabiity at east δ, Reca a usefu theorem (Theorem 7) that we wi need ater. It can be found i.e. in Kannan and Vempaa (9). Intuitivey Theorem 7 impies that if two matrices are cose (in terms of the squared Frobenius norm of their difference), then their singuar vaues shoud aso be cose too. Theorem 7. For any two n n symmetric matrices A and B, n (σ t (A) σ t (B)) A B F t= We now proceed with the theoretica anaysis that wi ead to Theorem 5. We aim to make use of Theorem and then Theorem 6 to provide theoretica guarantees on the performance of spectra custering when using the Nyström approximation to the cose to idea ( cose to bock diagona) graph Lapacian. We wi focus on extending assumptions A, A, A3 and A4 used in Theorem. We wi present Lemma, 3, 4 and. Appying them to Theorem yieds our main resut captured in Theorem 5. We wi aso make three additiona assumptions that we wi assume hod throughout the entire anaysis. First of a we wi assume that ( λ i λ i ) r n t= ( λ i t λ i t). Secondy we wi assume that h n i ñ i h, where n i is the number of data points assigned to custer i when using L, ñ i is the number of data points assigned to custer i when using L rather than L and h is some positive constant. Finay since the origina Lapacian is

10 A. Choromanska et a. assumed to be cose to bock diagona, we woud ike its Nyström approximation to be cose to bock diagona by assuming that the foowing two conditions hods: j S i S i L j L j = n i S i L j L j f n L L, where i i, j = arg max j Si S i L j L j and f > is some constant, and j, S i L j L j g n L L, where g > is some constant. Lemma. Let λ i be the second argest eigenvaue of L i, where L i is the subbock of L corresponding to custer i, and et λ i be the second argest eigenvaue of L i, where L i is the subbock of L corresponding to custer i. If λ i cϵ r( + η) (c > is some constant) then with probabiity at east δ, γ> λi γ. Proof. We know that ( λ i λ i ) n r ( λ i t λ i t) n L i L i F n L L F, () t= where the first inequaity comes from Theorem 7. By appying Jensen s inequaity to the eft hand side of Equation, we obtain r λ i t λ i t r n L L F. () t= Then in particuar the foowing hods: λ i λ i r n L L F. (3) By assumption, we know that λ i cϵ r( + η). Now, if λ i λ i then emma hods. If λ i > λ i, then we can rewrite Equation 3 as: λ i λ i + r n L L F (4) Since λ i cϵ r( + η) and by Theorem 6 with probabiity at east δ the foowing hods: r n L L F ϵ r( + η), then we can write that with probabiity at east δ, γ> λi γ, where γ = (c )ϵ r( + η). Lemma extends assumption A from Ng et a. (). Before we proceed to the next emma, et us first introduce some more notation. We know that Ã is defined as the affinity matrix that woud give rise to graph Lapacian L in case when no Nyström approximation was used and thus L = I D / / Ã D (in this case D is the diagona matrix whose (i, i)-eement is the sum of Ã s i th row). Let j S i, where i {,,..., k}. Define the foowing: d (j) = n m= A jm, d(j) = n m= Ãjm, d j = m S i A jm, dj = m S i Ã jm. Notice that d (j) d j. Let d (i) i {,,...,k} = min j Si dj and d = min d (i) i {,,...,k}. Aso, et D (i) i {,,...,k} = max j Si d(j) and D = max D (i) i {,,...,k}. At

11 Fast Spectra Custering via the Nyström Method this point we wi make a reasonabe assumption that D d is a bounded positive constant. Assuming the dataset has baanced custers (i.e., no custer is significanty bigger/smaer than any other) and in particuar the datasets have no outiers, this assumption wi be naturay satisfied. Furthermore, et α Si S i = min j Si, S i,i,i {,,...,k} d d (j) d() and et α = min i,i {,,...,k} α Si S i. Note that α (, ] and in the idea case α =. We are now ready to state and prove Lemma 3. Lemma. C> i={,,...,k},j={,,...,ni } ( n i = d )/(Cn i ). Proof. Consider any i {,,..., k} and any j, S i. It is true that where C = D d ni d = d D n i Cn i, (5) is a bounded positive constant as was aready discussed before. Lemma extends assumption A4 from Ng et a. (). Lemma 3. With probabiity at east δ, i,i ={,,...,k},i i ϵ, where ϵ = ϵc f + η (f > is some constant). Proof. Let j = arg max j Si S i L j L j. We know that: j S i S i L j L j j S i j S i S i Ã j d S i L j L j f n L L (6) The eft-hand side of Equation 6 can be further expressed as j S i S i Ã j d(j) d() A j = d(j) d () Combining this resut with Equation 6 we have Rewrite Equation 8 as: j S i S i j S i S i Ã j d j S i S i Ã j d(j) d(). (7) Ã j d (j) d() f n L L F. (8) dj d d (j) d() f n L L F. (9) The eft-hand side of Equation 9 is ower-bounded by α j S i S i thus j S i S i Ã j Ã j d and f d α n L L F. ()

12 A. Choromanska et a. Again, by Theorem 6 we can write that with probabiity at east δ the foowing hods: ( ) Ã j f ϵn D + η ϵf + η = ϵc d n α d f + η, () j S i S i where the ast inequaity comes from the fact that α ( d D ). Lemma 3 extends assumption A from Ng et a. (). Notice that exacty the same proof technique coud be used to show that Ã j d ϵ j d n i. This resut and β = max i {,,...,k} β Si. We can now pro- wi be used in the next emma. d Define β Si = max j Si, S i ceed to the next emma. S i : S Ã j Lemma 4. With probabiity at east δ, i i={,,...,k},j Si ϵ ( Ã m ) d, where ϵ dm = C 5 fk( + η) ϵ(ϵg + η + h ). Proof. Consider any i {,,..., k} and j S i. Let j = arg min j Si S i L j L j. We wi consider the expression: [ : S i Ã j ] Ã m d dm () The first term in the above expression can be upper-bounded by Jensen s inequaity as foows : S i Ã j = : S i Ã j = : S i Ã j n i : S i (Ãj The right-hand side of Equation 3 can be rewritten and bounded as n i : S i ( Ã j d d ) n i β : S i ( Ã j d ) n i β : S i ). (3) ( ) Ã j d d kβϵ. j (4) Combining these resuts together and appying Equation we see that, with probabiity at east δ, : S i Ã j kβϵ. (5) Now focus on bounding the second term in Expression. Reca that L m L m g n L L F. (6)

13 Fast Spectra Custering via the Nyström Method 3 Simiary, as in previous paragraph, we can write that L m L m = = Ã m d() d (m) Ã m d() d (m) A m d() d (m) n i, (7) The ast equaity uses the fact that,m Si A m = and d () = d (m) = d = d m = n i. We can then expand the right-hand side of Equation 7: Ã m d() d (m) = n i = Ã m d dm Equation 8 can be ower-bounded as: Ã m d dm d dm d () d(m) α d dm d () d(m) Ãm n i d() d(m) Combining Equation 6 and 9 we obtain: Ã m d () d(m) Ãm n i d() d(m) + α Ãm n i d() d(m) + n i +. (8) Ã m d dm ni ñ i + Ã m d dm h + (9) Ã m d dm α ( L L F + h ). () After appying Theorem 6 we obtain that with probabiity at east δ Combining Equation 5 and we get the foowing: Ã m d dm α ( g n ϵn + η + h ). () [ : S i Ã j ] Ã m d dm kβϵ α (ϵg + η + h ) C 5 fk( + η) ϵ(ϵg + η + h ) () where the ast inequaity uses the fact that α ( d D ) and β D d.

14 4 A. Choromanska et a. 5 Experiments To evauate the proposed agorithms empiricay, we consider the four datasets described in Ng et a. (). We used a Gaussian kerne to buid the affinity matrix (κ(s i, s j ) = exp( s i s j /σ )). The parameters σ and r were manuay tuned to obtain the best performance. Figure shows the datasets with pots of the error versus the percent of the coumns samped (/n). We used uniform samping without repacement throughout. Note that both the choice of coumns as we as the initiaization of the k-means custering agorithm sighty affect the performance. Thus, we show two types of resuts: the curves in the second row on Figure obtained by averaging over, runs and the curves underneath showing the most frequenty obtained performance (i.e. the median case). Aso we performed two sets of experiments where r was hed constant as we as where r was tuned for each vaue of. In the first case, we set r = τ (the vaue of τ for each dataset is provided under Figure ) and when τ we set τ =. In the second case, we observed that tuning r for each vaue of (when increases, r shoud decrease) can improve the performance but the improvement is reativey sma and not worth presenting here Error vs. /n 8 Error vs. /n Error vs. /n 5 Error vs. /n % of coumns samped (/n) % of coumns samped (/n) % of coumns samped (/n) % of coumns samped (/n) 5 Error vs. /n Error vs. /n Error vs. /n 6 Error vs. /n % of coumns samped (/n) % of coumns samped (/n) % of coumns samped (/n) % of coumns samped (/n) Fig.. Top row: the datasets with coor-coded custers. Second Row: error curves vs % of coumns samped with the error averaged over, runs. Third row: error curves vs % of coumns samped with the most frequent resut being dispayed. The parameters of interest for each experiment (from eft to right) were: a) n = ; σ = ; τ = 5, b) n = 5; σ = ; τ =, c) n = ; σ = ; τ = 5, d) n = ; σ = ; τ = 5. There was no significant difference in the choice of the distortion-minimizing agorithm we use in the ast step of our spectra custering agorithm, be it Loyd s agorithm, k-means++ and k-means#.

15 Fast Spectra Custering via the Nyström Method 5 Acknowedgments. The authors thank Sanjiv Kumar for hepfu suggestions. References Bekin, M. and Niyogi, P. (7). Convergence of Lapacian eigenmaps. In NIPS 6, pages MIT Press. Drineas, P. and Mahoney, M. W. (5). On the Nyström Method for Approximating a Gram Matrix for Improved Kerne-Based Learning. Journa of Machine Learning Research, 6:5. Fowkes, C., Beongie, S., Chung, F., and Maik, J. (4). Spectra grouping using the nyström method. IEEE Trans. Pattern Ana. Mach. Inte., 6():4 5. Fung, W. S., Hariharan, R., Harvey, N. J., and Panigrahi, D. (). A genera framework for graph sparsification. In STOC. Kannan, R. and Vempaa, S. (9). Spectra agorithms. Foundations and Trends in Theoretica Computer Science, 4(3-4): Kumar, S., Mohri, M., and Tawakar, A. (9). Samping techniques for the nyström method. Journa of Machine Learning Research, 5:34 3. Lashkari, D. and Goand, P. (7). Convex custering with exempar-based modes. In NIPS 7. Li, M., Lian, X.-C., Kwok, J. T., and Lu, B.-L. (). Time and space efficient spectra custering via coumn samping. In 4th IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pages IEEE. Loyd, S. P. (98). Least squares quantization in pcm. IEEE Transactions on Information Theory, 8:9 37. Luxburg, U. (7). A tutoria on spectra custering. Statistics and Computing, 7(4): Ng, A. Y., Jordan, M. I., and Weiss, Y. (). On spectra custering: Anaysis and an agorithm. In NIPS, pages MIT Press. Spieman, D. A. and Teng, S.-H. (). Spectra sparsification of graphs. SIAM Journa on Computing, 4(4):98 5. Wiiams, C. and Seeger, M. (). Using the Nyström method to speed up kerne machines. In NIPS, pages MIT Press. Yan, D., Huang, L., and Jordan, M. I. (9). Fast approximate spectra custering. In ACM SIGKDD, pages ACM.

SUPPLEMENTARY MATERIAL TO INNOVATED SCALABLE EFFICIENT ESTIMATION IN ULTRA-LARGE GAUSSIAN GRAPHICAL MODELS

ISEE 1 SUPPLEMENTARY MATERIAL TO INNOVATED SCALABLE EFFICIENT ESTIMATION IN ULTRA-LARGE GAUSSIAN GRAPHICAL MODELS By Yingying Fan and Jinchi Lv University of Southern Caifornia This Suppementary Materia