Multi-View Clustering via Canonical Correlation Analysis

Size: px

Start display at page:

Download "Multi-View Clustering via Canonical Correlation Analysis"

Toby Farmer
6 years ago
Views:

1 Keywors: multi-view learning, clustering, canonical correlation analysis Abstract Clustering ata in high-imensions is believe to be a har problem in general. A number of efficient clustering algorithms evelope in recent years aress this problem by projecting the ata into a lowerimensional subspace, e.g. via Principal Components Analysis PCA) or ranom projections, before clustering. Here, we consier constructing such projections using multiple views of the ata, via Canonical Correlation Analysis CCA). Uner the assumption that conitione on the cluster label the views are uncorrelate, we show that the separation conitions require for the algorithm to be successful are rather mil significantly weaker than prior results in the literature). We provie results for mixture of Gaussians an mixtures of log concave istributions. We also provie empirical support from auio-visual speaker clustering where we esire the clusters to correspon to speaker ID) an from hierarchical Wikipeia ocument clustering where one view is the wors in the ocument an other is the link structure). 1. Introuction The multi-view approach to learning is one in which we have views of the ata sometimes in a rather abstract sense) an the goal is to use the relationship between these views to alleviate the ifficulty of a learning problem of interest [BM98, KF07, AZ07]. In this work, we explore how having two views of the ata makes the clustering problem significantly more tractable. Much recent work has been one on unerstaning uner what conitions we can learn a mixture moel. Preliminary work. Uner review by the International Conference on Machine Learning ICML). Do not istribute. The basic problem is as follows: we are given inepenent samples from a mixture of k istributions, an our task is to either: 1) infer properties of the unerlying mixture moel e.g. the mixing weights, means, etc) or 2) classify a ranom sample accoring to which istribution it was generate from. Uner no restrictions on the unerlying istribution, this problem is consiere to be har. However, in many applications, we are only intereste in clustering the ata when the component istributions are well separate. In fact, the focus of recent clustering algorithms [Das99, VW02, AM05, BV08] is on efficiently learning with as little separation as possible. Typically, these separation conitions are such that when given a ranom sample form the mixture moel, the Bayes optimal classifier is able to reliably, with high probability, recover which cluster generate that point. This work assumes a rather natural multi-view assumption: the assumption is that the views are conitionally) uncorrelate, if we conition on which mixture istribution generate the views. There are many natural applications for which this assumption applies. For example, we can consier multi-moal views, with one view being a vieo stream an the other an auio stream of a speaker here conitione on the speaker ientity an maybe the phoneme both of which coul label the generating cluster), the views may be uncorrelate. A secon example is the wors an link structure in a ocument from a corpus such as Wikipeia here, conitione on the category of each ocument, the wors in it an its link structure may be uncorrelate. In this paper, we provie experiments for both settings. Uner this multi-view assumption, we provie a simple an efficient subspace learning metho, base on Canonical Correlation Analysis CCA). This algorithm is affine invariant an is able to learn with some of the weakest separation conitions to ate. The intuitive reason for this is that uner our multi-view assumption, we are able to approximately) fin the low-imensional subspace spanne by the means of the component istributions. This subspace is important,

2 because, when projecte onto this subspace, the means of the istributions are well-separate, yet the typical istance between points from the same istribution is smaller than in the original space. The number of samples we require to cluster correctly scales as O), where is the ambient imension. Finally, we show through experiments that CCA-base algorithms consistently provie better performance than PCA-base clustering methos when applie to atasets in two ifferent omains auio-visual speaker clustering, an hierarchical Wikipeia ocument clustering by category. Our work shows how the multi-view framework can provie substantial improvements to the clustering problem, aing to the growing boy of results which show how the multi-view framework can alleviate the ifficulty of learning problems. Relate Work. Most provably efficient clustering algorithms first project the ata own to some low imensional space an then cluster the ata in this lower imensional space typically, an algorithm such as single linkage suffices here). Typically, these algorithms also work uner a separation requirement, which is measure by the minimum istance between the means of any two mixture components. One of the first provably efficient algorithms for learning mixture moels is ue to [Das99], who learns a mixture of spherical Gaussians by ranomly projecting the mixture onto a low-imensional subspace. [VW02] provie an algorithm with an improve separation requirement that learns a mixture of k spherical Gaussians, by projecting the mixture own to the k-imensional subspace of highest variance. [KSV05, AM05] exten this result to mixtures of general Gaussians; however, they require a separation proportional to the maximum irectional stanar eviation of any mixture component. [CR08] use a canonical-correlations base algorithm to learn mixtures of axis-aligne Gaussians with a separation proportional to σ, the maximum irectional stanar eviation in the subspace containing the centers of the istributions. Their algorithm requires the coorinate-inepenence property, an an aitional spreaing conition. None of these algorithms are affine invariant. Finally, [BV08] provies an affine-invariant algorithm for learning mixtures of general Gaussians, so long as the mixture has a suitably low Fisher coefficient when in isotropic position. However, their separation involves a rather large polynomial epenence on 1 w min. The two results most closely relate to ours are the work of [VW02] an [CR08]. [VW02] shows that it is sufficient to fin the subspace spanne by the means of the istributions in the mixture for effective clustering. Like our algorithm, [CR08] use a projection onto the top k 1 singular value ecomposition subspace of the canonical correlations matrix. They also require a spreaing conition, which is relate to our requirement on the rank. We borrow techniques from both these papers. [?] propose a similar algorithm for multi-view clustering, in which, ata is projecte onto the top few irections obtaine by a kernel-cca of the multi-view ata. They show empirically, that for clustering images using associate text, where the two views are an image, an text associate with it, an the target clustering is a human-efine category), CCA-base methos outperform PCA-base clustering algorithms. In this paper, we stuy the problem of multi-view clustering. We have ata on a fixe set of objects from two sources, which we call the two views, an our goal is to use this fact to cluster more effectively than with ata from a single source. Following prior theoretical work, our goal is to show that our algorithm recovers the correct clustering when the input obeys certain conitons. This Work. Our input is ata on a fixe set of objects from two views, where View j is generate by a mixture of k Gaussians D j 1,..., Dj k ), for j = 1, 2. To generate a sample, a source i is picke with probability w i, an x 1) an x 2) in Views 1 an 2 are rawn from istributions Di 1 an D2 i. We impose two requirements on these mixtures. First, we require that conitione on the source istribution in the mixture, the two views are uncorrelate. Notice that this is a weaker restriction than the conition that given source i, the samples from Di 1 an D2 i are rawn inepenently. Moreover, this conition allows the istributions in the mixture within each view to be completely general, so long as they are uncorrelate across views. Although we o not show it theoretically, we suspect that our algorithm is robust to small eviations from this assumption. Secon, we require the rank of the CCA matrix across the views to be at least k 1, when each view is in isotropic position, an the k 1-th singular value of this matrix to be at least λ min. This conition ensures that there is sufficient correlation between the views. If the first two conitions hol, then, we can recover the subspace containing the means of the istributions in both views. In aition, for mixtures of Gaussians, if in at least

3 one view, say View 1, we have that for every pair of istributions i an j in the mixture, µ 1 i µ 1 j > Cσ k 1/4 logn/δ) for some constant C, where µ 1 i is the mean of the i-th component istribution in view one an σ is the maximum irectional stanar eviation in the subspace containing the means of the istributions in view 1, then our algorithm can also etermine which istribution each sample came from. Moreover, the number of samples require by us to learn this mixture grows almost) linearly with. This separation conition is consierably weaker than previous results in that σ only epens on the irectional variance in the subspace spanne by the means, which can be consierably lower than the maximum irectional variance over all irections. The only other algorithm which provies affine-invariant guarantess is ue to [BV08] while this result oes not explicitly state results in terms of separation between the means it uses a Fisher coefficient concept), the implie separation is rather large an grows with ecreasing w min, the minimum mixing weight. To get our improve sample complexity bouns, we use a result ue to [RV07], which may be of inepenent interest. We stress that our improve results are really ue the multi-view conition. Ha we simply combine the ata from both views, an applie previous algorithms on the combine ata, we coul not have obtaine our guarantees. We also emphasize that for our algorithm to cluster successfully, it is sufficient for the istributions in the mixture to obey the separation conition in one view, so long as the multiview conition an rank conitions are obeye. Finally, we stuy through experiments, the performance of CCA-base algorithms on ata-sets from two ifferent omains. First, we experiment with auiovisual speaker clustering, in which the two views are auio an face images of a speaker, an the target cluster variable is the speaker. Our experiments show that CCA-base algorithms perform better than PCAbase algorithms on auio ata, just as well on image ata, an are more robust to occlusions an translations of the images. For our secon experiment, we cluster ocuments in Wikipeia. The two views are the wors, an the link structure of a ocument, an the target cluster is the category. Our experiments show that a CCA-base hierarchical clustering algorithm provies much higher performance than PCAbase hierarchical clustering for this ata. 2. The Setting We assume that our ata is generate by a mixture of k istributions. In particular, we assume we obtain samples x = x 1), x 2) ), where x 1) an x 2) are the two views of the ata, which live in the vector spaces V 1 of imension 1 an V 2 of imension 2, respectively. We let = Let µ j i, for i = 1,..., k an j = 1, 2 be the center of istribution i in view j, an let w i be the mixing weight for istribution i. For simplicity, we assume that ata have mean 0. We enote the covariance matrix of the ata as: Σ = E[xx ], Σ 11 = E[x 1) x 1) ) ] Σ 22 = E[x 2) x 2) ) ], Σ 12 = E[x 1) x 2) ) ] Hence, we have: [ ] Σ11 Σ Σ = 21. 1) Σ 12 Σ 22 The multi-view assumption we work with is as follows: Assumption 1 Multi-View Conition) We assume that conitione on the source istribution l in the mixture where l = i is picke with probability w i ), the two views are uncorrelate. More precisely, we assume that for all i [k], E[x 1) x 2) ) l = i] = E[x 1) l = i]e[x 2) ) l = i] This assumption implies that: Σ 12 = i To see this, observe that E[x 1) x 2) ) ] = i = i = i w i µ 1 i µ 2 i ) T. E Di [x 1) x 2) ) ] Pr[D i ] w i E Di [x 1) ] E Di [x 2) ) ] w i µ 1 i µ 2 i ) T 2) As the istributions are in isotropic position, we observe that i w iµ 1 i = i w iµ 2 i = 0. Therefore, the above equation shows that the rank of Σ 12 is at most k 1. We now assume that it has rank precisely k 1. Assumption 2 Non-Degeneracy Conition) We assume that Σ 12 has rank k 1 an that the minimal non-zero singular value of Σ 12 is λ min > 0 where we are working in a coorinate system where Σ 11 an Σ 22 are ientity matrices).

4 For clarity of exposition, we also work in a isotropic coorinate system, in each view. Specifically, the expecte covariance matrix of the ata, in each view, is the ientity matrix, i.e. Σ 11 = I 1, Σ 22 = I 2. As our analysis shows, our algorithm is robust to errors, so we assume that ata is whitene as a preprocessing step. One way to view the Non-Degeneracy Assumption is in terms of correlation coefficients. Recall that for two irections u V 1 an v V 2, the correlation coefficient is efine as: ρu, v) = E[u x 1) )v x 2) )] E[u x 1) ) 2 ]E[v x 2) ) 2 ]. An alternative efinition of λ min is just the minimal non-zero, correlation coefficient i.e. λ min = min ρu, v). u,v:ρu,v) 0 Note 1 λ min > 0.We use Σ 11 an Σ 22 to enote the sample covariance matrix in views 1 an 2 respectively. We use Σ 12 to enote the sample covariance matrix combine across views 1 an 2. We assume these are obtaine through empirical averages from i.i.. samples from the unerlying istribution. 3. The Clustering Algorithm The following lemma provies the intuition for our algorithm. Lemma 1 Uner Assumption 2, if U, D, V is the thin SVD of Σ 12 where the thin SVD removes all zero entries from the iagonal), then the subspace spanne by the means in view 1 is precisely the column span of U an we have the analogous statement for view 2). The lemma is a consequence of Equation 2 an the rank assumption. Since samples from a mixture are well-separate in the space containing the means of the istributions, the lemma suggests the following strategy : use CCA to approximately) project the ata own to the subspace spanne by the means to get an easier clustering problem, an then apply stanar clustering algorithms in this space. Our clustering algorithm, base on the above iea, is state below. We can show that this algorithm clusters correctly with high probability, when the ata in at least one of the views obeys a separation conition, in aition to our assumptions. The input to the algorithm is a set of samples S, an a number k, an the output is a clustering of these samples into k clusters. For this algorithm, we assume that the ata obeys the separation conition in View 1; an analogous algorithm can be applie when the ata obeys the separation conition in View 2 as well. Algorithm Ranomly partition S into two subsets of equal size A an B. 2. Let Σ 12 A) Σ 12 B) respectively) enote the empirical covariance matrix between views 1 an 2, compute from the sample set A B respectively). Compute the top k 1 left singular vectors of Σ 12 A) Σ 12 B) respectively), an project the samples in B A respectively) on the subspace spanne by these vectors. 3. Apply single linkage clustering [?] for mixtures of log-concave istributions), or the algorithm in Section 3.5 of [AK05] for mixtures of Gaussians) on the projecte examples in View 1. We note that in Step 3, we apply either single-linkage or the algorithm of [AK05]; this allows us to show theoretically that given the istributions in the mixture are of a certain type, an given the right separation conitions, the clusters can be recovere correctly. In practice, however, these algorithms o not perform as well ue to lack of robustness, an one woul use an algorithm such as k-means or EM to cluster in this low-imensional subspace. In particular, a variant of the EM algorithm has been shown [DS00] to cluster correctly mixtures of Gaussians, uner certain conitions. Moreover, in Step 1, we ivie the ata-set into two halves to ensure inepenence between Steps 2 an 3 for our analysis; in practice however, these steps can be execute on the same sample set The Main Result Our main theorem can be state as follows. Theorem 1 Gaussians) Suppose the source istribution is a mixture of Gaussians, an suppose Assumptions 1 an 2 hol. Let σ be the maximum irectional stanar eviation of any istribution in the subspace spanne by {µ 1 i }k i=1. If, for each pair i an j an for a fixe constant C, µ 1 i µ 1 j Cσ k 1/4 log kn δ ) then, with probability 1 δ, Algorithm 1 correctly clas-

5 sifies the examples if the number of examples use is c σ ) 2 λ 2 log 2 min w2 min σ ) log 2 1/δ) λ min w min for some constant c. Here we assume that a separation conition hols in View 1, but a similar theorem also applies to View 2. An analogous theorem can also be shown for mixtures of log-concave istributions. Theorem 2 Log-concave Distributions) Suppose the source istribution is a mixture of log-concave istributions, an suppose Assumptions 1 an 2 hol. Let σ be the maximum irectional stanar eviation of any istribution in the subspace spanne by {µ 1 i }k i=1. If, for each pair i an j an for a fixe constant C, µ 1 i µ 1 j Cσ k log kn δ ) then, with probability 1 δ, Algorithm 1 correctly classifies the examples if the number of examples use is c σ ) 2 λ 2 log 3 min w2 min σ ) log 2 1/δ) λ min w min for some constant c. 4. Analyzing Our Algorithm In this section, we prove our main theorems. First, we efine some notation. Notation. In the sequel, we assume that we are given samples from a mixture which obeys Assumptions 2 an 1. We use the notation S 1 resp. S 2 ) to enote the subspace containing the centers of the istributions in the mixture in View 1 resp. View 2), an notation S 1 resp. S 2 ) to enote the orthogonal complement to the subspace containing the centers of the istributions in the mixture in View 1 resp. View 2). For any matrix A, we use A to enote the L 2 norm or maximum singular value of A. Proofs. Now, we are reay to prove our main theorem. First, we show the following two lemmas, which emonstrate properties of the expecte crosscorrelational matrix across the views. Their proofs are immeiate from Assumptions 2 an 1. Lemma 2 Let v 1 an v 2 be any vectors in S 1 an S 2 respectively. Then, v 1 ) T Σ 12 v 2 > λ min. Lemma 3 Let v 1 resp. v 2 ) be any vector in S 1 resp. S 2 ). Then, for any u 1 V 1 an u 2 V 2, v 1 ) T Σ 12 u 2 = u 1 ) T Σ 12 v 2 = 0. Next, we show that given sufficiently many samples, the subspace spanne by the top k 1 singular vectors of Σ 12 still approximates the subspace containing the means of the istributions comprising the mixture. Finally, we use this fact, along with some results in [AK05] to prove Theorem 1. Our main lemma of this section is the following. Lemma 4 Projection Subspace Lemma) Let v 1 resp. v 2 ) be any vector in S 1 resp. S 2 ). If the number of samples n > c τ 2 λ 2 log2 min wmin τλ minw min ) log 2 1 δ ) for some constant c, then, with probability 1 δ, the length of the projection of v 1 resp. v 2 ) in the subspace spanne by the top k 1 left resp. right) singular vectors of Σ12 is at least 1 τ 2 v 1 resp. 1 τ 2 v 2 ). The main tool in the proof of Lemma 4, is the following lemma, which uses a result ue to [RV07]. Lemma 5 Sample Complexity Lemma) If number of samples n > c ɛ 2 log 2 ) log 2 1 w min ɛw min δ ) the for some constant c, then, with probability at least 1 δ, Σ 12 Σ 12 ɛ, where enotes the L 2 -norm of a matrix. A consequence of Lemma 5 an Lemmas 2 an 3 is the following lemma. Lemma 6 Let n > C ɛ 2 w min log 2 ɛw min ) log 2 1 δ ), for some constant C. Then, with probability 1 δ, the top k 1 singular values of Σ 12 have value at least λ min ɛ. The remaining min 1, 2 ) k + 1 singular values of Σ 12 have value at most ɛ. The proof follows by a combination of Lemmas 2,3, 5 an a triangle inequality. Proof:Of Lemma 5) To prove this lemma, we apply Lemma 7. Observe the block representation of Σ in Equation 1. Moreover, with Σ 11 an Σ 22 in isotropic position, we have that the L 2 norm of Σ 12 is at most 1. Using the triangle inequality, we can write: Σ 12 Σ Σ Σ + Σ 11 Σ 11 + Σ 22 Σ 22 ) where we have applie the triangle inequality to the 2 2 block matrix with off-iagonal entries Σ 12 Σ 12 an with 0 iagonal entries). We now apply Lemma 7 three times, on Σ 11 Σ 11, on Σ 22 Σ 22 an a scale version of Σ Σ. The first two applications follow irectly.

6 For the thir application, we observe that Lemma 7 is rotation invariant, an that scaling each covariance value by some factor s scales the norm of the matrix by at most s. We claim that we can apply Lemma 7 on Σ Σ with s = 4. Since the covariance of any two ranom variables is at most the prouct of their stanar eviations, an since Σ 11 an Σ 22 are I 1 an I 2 respectively, the maximum singular value of Σ 12 is at most 1; the maximum singular value of Σ is therefore at most 4. Our claim follows. The lemma now follows by plugging in n as a function of ɛ, an w min Lemma 7 Let X be a set of n points generate by a mixture of k Gaussians over R, scale such that E[x x T ] = I. If M is the sample covariance matrix of X, then, for n large enough, with probability at least 1 δ, log n log 2n δ M E[M] C ) log1/δ) wmin n where C is a fixe constant, an w min is the minimum mixing weight of any Gaussian in the mixture. Proof: To prove this lemma, we use a concentration result on the L 2 -norms of matrices ue to [RV07]. We observe that each vector x i in the scale space is generate by a Gaussian with some mean µ an maximum irectional variance σ 2. As the total variance of the mixture along any irection is at most 1, w min µ 2 + σ 2 ) 1. Therefore, for all samples x i, with probability at least 1 δ/2, x i µ +σ log 2n δ ). We conition on the fact that the event x i µ + σ log 2n δ ) happens for all i = 1,..., n. The probability of this event is at least 1 δ/2. Conitione on this event, the istributions of the vectors x i are inepenent. Therefore, we can apply Theorem 3.1 in [RV07] on these conitional istributions, to conclue that: Pr[ M E[M] > t] 2e cnt2 /Λ 2 log n where c is a constant, an Λ is an upper boun on the norm of any vector x i. The lemma follows by plugging in t =, an Λ 2 log2n/δ) wmin. Λ 2 log4/δ) log n cn Now we are reay to prove Lemma 4. Proof: Of Lemma 4) For the sake of contraiction, suppose there exists a vector v 1 S 1 such that the projection of v 1 on the top k 1 left singular vectors of Σ 12 is equal to 1 τ 2 v 1, where τ > τ. Then, there exists some unit vector u 1 in V 1 in the orthogonal complement of the space spanne by the top k 1 left singular vectors of Σ 12 such that the projection of v 1 on u 1 is equal to τ v 1. This vector u 1 can be written as: u 1 = τv 1 +1 τ 2 ) 1/2 y 1, where y 1 is in the orthogonal complement of S 1. From Lemma 2, there exists some vector u 2 in S 2, such that v 1 ) Σ 12 u 2 λ min ; from Lemma 3, for this vector u 2, u 1 ) Σ 12 u 2 τλ min. If n > c τ 2 λ 2 log2 min wmin τλ minw min ) log 2 1 δ ), then, from Lemma 6, u 1 ) T Σ12 u 2 τ 2 λ min. Now, since u 1 is in the orthogonal complement of the subspace spanne by the top k 1 left singular vectors of Σ 12, for any vector y 2 in the subspace spanne by the top k 1 right singular vectors of Σ 12, u 1 ) Σ12 y 2 = 0. This, in turn, means that there exists a vector z 2 in V 2 in the orthogonal complement of the subspace spanne by the top k 1 right singular vectors of Σ 12 such that u 1 ) T Σ12 z 2 τ 2 λ min. This implies that the k-th singular value of Σ 12 is at least τ 2 λ min. However, from Lemma 6, all except the top k 1 singular values of Σ 12 are at most τ 3 λ min, which is a contraiction. Finally, we are reay to prove our main theorem. Proof:Of Theorem 1) From Lemma 4, if n > C τ 2 λ 2 log2 min wmin τλ minw min ) log 2 1/δ), then, with probability at least 1 δ, the projection of any vector v in S 1 or S 2 onto the subspace returne by Step 2 of Algorithm 1 has length at least 1 τ 2 v. Therefore, the maximum irectional variance of any D i in any this subspace is at most 1 τ 2 )σ ) 2 + τ 2 σ 2, where σ 2 is the maximum irectional variance of any D i. When τ σ /σ, this is at most 2σ ) 2. From the isotropic conition, σ 1 wmin. Therefore, when σ ) 2 λ 2 min w2 min n > C log 2 σ λ minw min ) log 2 1/δ), the maximum irectional variance of any D i in the mixture in the space output by Step 2 of the Algorithm is at most 2σ ) 2. Since A an B are ranom partitions of the sample set S, the subspace prouce by the action of Step 2 of Algorithm 1 on the set A is inepenent of B, an vice versa. Therefore, when projecte onto the top k 1 SVD subspace of Σ 12 A), the samples from B are istribute as a mixture of k 1)-imensional Gaussians. The theorem follows from the bouns in the previous paragraph, an Theorem 1 of [AK05].

7 5. Experiments 5.1. Auio-visual speaker clustering In the first set of experiments, we consier clustering either auio or face images of speakers. We use a subset of the ViTIMIT atabase [?] consisting of 41 speakers, speaking 10 sentences about secons) each, recore at 25 frames per secon in a stuio environment with no significant lighting or pose variation. The auio features are stanar 12-imensional mel cepstra [?] compute every 10ms over a ms winow, an finally concatenate over a winow of 0ms before an after the current frame, for a total of 1584 imensions. The vieo features are pixels of the face region extracte from each image 2394 imensions). We consier the target cluster variable to be the speaker. We use either CCA or PCA to project the ata to some lower imensionality N. In the case of CCA, we initially project the ata to an intermeiate imensionality M using PCA to reuce the effects of spurious correlations. For the results reporte here, N is typically an M is typically 100 for images an 1000 for auio. The parameters were selecte using a hel-out set. For CCA, we ranomize the vectors of one view in each sentence, to reuce correlations between the views ue to certain other latent variables such as the current phoneme. We then cluster either view using k-means into 82 clusters 2 per speaker). To alleviate the problem of local minima foun by k- means, each clustering consists of 5 runs of k-means, an the one with the lowest k-means score is taken as the final cluster assignment. Similarly to [?], we measure clustering performance using the conitional entropy of the speaker s given the cluster c, Hs c). We report the results in terms of conitional perplexity, 2 Hs c), which is the mean number of speakers corresponing to each cluster. Table 1 shows results on the raw ata, as well as with synthetic occlusions an translations of the image ata. Consiering the extremely clean visual environment, we expect PCA to o very well on the image ata. Inee, PCA provies an almost perfect clustering of the raw images an CCA oes not improve it. However, CCA far outperforms PCA when clustering the more challenging auio view. When synthetic occlusions or translations are applie to the images, the performance of PCA-base clustering is greatly egrae. CCA is unaffecte in the case of occlusion; in the case of translation, CCA-base image clustering is egrae similarly to PCA, but auio clustering is almost unaffecte. In other wors, even when the image ata are egrae, CCA is able to recover a goo clustering in at least one of the views. For a more e- PCA CCA Images Auio Images + occlusion Auio + occlusion Images + translation Auio + translation Table 1. Conitional perplexities of the speaker given the cluster, using PCA or CCA bases. + occlusion an + translation inicate that the images are corrupte with occlusion/translation; the auio is unchange, however. taile look at the clustering behavior, Figures 1a-) show the istributions of clusters for each speaker in several conitions Clustering Wikipeia articles Next we consier the task of clustering Wikipeia articles, base on either their text or their incoming an outgoing links. The link structure L is represente as a concatenation of to an from link incience vectors, where each element Li) is the number of times the current article links to/from article i. The article text is represente as a bag-of-wors feature vector, i.e. the raw count of each wor in the article. A lexicon of about 8 million wors an a list of about 12 million articles were use to construct the two feature vectors. Since the imensionality of the feature vectors is very high over million for the link view), we use ranom projection to reuce the imensionality to a computationally manageable level. We present clustering experiments on a subset of Wikipeia consisting of 128,327 articles. We use either PCA or CCA to reuce the feature vectors to the final esire imensionality, followe by clustering. In these experiments, we use a hierarchical clustering proceure, as a flat clustering is rather poor with either PCA or CCA CCA still usually outperforms PCA, however). In the hierarchical proceure, all points are initially consiere to be in a single cluster. Next, we iteratively pick the largest cluster, reuce the imensionality using PCA or CCA on the points in this cluster, an use k-means to break the cluster into smaller sub-clusters for some fixe k), until we reach the total esire number of clusters. The intuition for this is that ifferent clusters may have ifferent natural subspaces. As before, we evaluate the clustering using the conitional perplexity of the article category a as given by Wikipeia) given the cluster c, 2 Ha c). For each arti-

8 speaker a) AV: Auio, PCA basis speaker c) AV: Images + occlusion, PCA basis perplexity e) Wikipeia: Category perplexity hierarchical CCA hierarchical PCA cluster cluster number of clusters 5 10 b) AV: Auio, CCA basis 5 10 ) AV: Images + occlusion, CCA basis f) Wikipeia: Cluster perplexity balance clustering hierarchical CCA hierarchical PCA speaker speaker Entropy cluster cluster number of clusters Figure 1. a-) Distributions of cluster assignments per speaker in auio-visual experiments. The color of each cell s, c) correspons to the empirical probability pc s) arker = higher). e-f) Wikipeia experiments: e) Conitional perplexity of article category given cluster, 2 Ha c), as a function of the number of clusters. f) Perplexity of the cluster istribution 2 Hc) ) as a function of the number of clusters. cle we use the first category liste in the article. The 128,327 articles inclue roughly 15,000 categories, of which we use the 500 most frequent ones, which cover 73,145 articles. While the clustering is performe on all 128,327 articles, the reporte entropies are for the 73,145 articles covere by the 500 categories. Each sub-clustering consists of 10 runs of k-means, an the one with the lowest k-means score is taken as the final cluster assignment. Figure 1e) shows the conitional perplexity versus the number of clusters for PCA- an CCA-base hierarchical clustering. For any number of clusters, CCA prouces better clusterings, i.e. ones with lower perplexity. Note that the reuction in entropy with larger number of clusters is expecte for any approach; the relevant point is the ifference for any given number of clusters.) In aition, the tree structures of the PCA-/CCA-base clusterings are qualitatively ifferent. With PCA-base clustering, most points are assigne to a few large clusters, with the remaining clusters consisting of only a few points. CCA-base hierarchical clustering prouces more balance clusters. To see this, in Figure 1f) we show the perplexity of the cluster istribution, 2 Hc), versus the number of clusters. For about 25 or more clusters, the CCAbase clusterings have higher perplexity, inicating a more uniform istribution of clusters than in PCAbase clustering. References [AK05] [AM05] [AZ07] S. Arora an R. Kannan. Learning mixtures of separate nonspherical gaussians. Annals of Applie Probability, 151A):69 92, 05. D. Achlioptas an F. McSherry. On spectral learning of mixtures of istributions. In Proceeings of the 18th Annual Conference on Learning Theory, pages , 05. Rie Kubota Ano an Tong Zhang. Two-view feature generation moel for semi-supervise learning. In ICML 07: Proceeings of the 24th international conference on Machine learning, pages 25 32, New York, NY, USA, 07. ACM. [BM98] Avrim Blum an Tom Mitchell. Combining labele an unlabele ata with co-training. In COLT: Proceeings of the Workshop on Computational Learning Theory, Morgan Kaufmann Publishers, pages , [BV08] S. C. Brubaker an S. Vempala. Isotropic pca an affineinvariant clustering. In Proc. of Founations of Computer Science, 08. [CR08] K. Chauhuri an S. Rao. Learning mixtures of istributions using correlations an inepenence. In In Proc. of Conference on Learning Theory, 08. [Das99] S. Dasgupta. Learning mixtures of gaussians. In Proceeings of the th IEEE Symposium on Founations of Computer S cience, pages , [DS00] S. Dasgupta an L. Schulman. A two-roun variant of em for gaussian mixtures. In Sixteenth Conference on Uncertainty in Artificial Intelligence UAI), 00. [KF07] Sham M. Kakae an Dean P. Foster. Multi-view regression via canonical correlation analysis. In Naer H. Bshouty an Clauio Gentile, eitors, COLT, volume 4539 of Lecture Notes in Computer Science, pages Springer, 07.

9 [KSV05] [RV07] [VW02] R. Kannan, H. Salmasian, an S. Vempala. The spectral metho for general mixture moels. In Proceeings of the 18th Annual Conference on Learning Theory, 05. M. Ruelson an R. Vershynin. Sampling from large matrices: An approach through geometric functional analysis. Journal of the ACM, 07. V. Vempala an G. Wang. A spectral algorithm of learning mixtures of istributions. In Proceeings of the 43r IEEE Symposium on Founations of Computer Science, pages , 02.

Multi-View Clustering via Canonical Correlation Analysis

Technical Report TTI-TR-2008-5 Multi-View Clustering via Canonical Correlation Analysis Kamalika Chauhuri UC San Diego Sham M. Kakae Toyota Technological Institute at Chicago ABSTRACT Clustering ata in