Greedy Subspace Clustering

Size: px

Start display at page:

Download "Greedy Subspace Clustering"

Caroline Ramsey
5 years ago
Views:

1 Technical Report 111 Greey Subspace Clustering Research Supervisor Constantine Caramanis Wireless Networing an Communications Group September 016

collaborative initiative by researchers at the Center for Transportation Research

2 Data-Supporte Transportation Operations & Planning Center D-STOP A Tier 1 USDOT University Transportation Center at The University of Texas at Austin D-STOP is a collaborative initiative by researchers at the Center for Transportation Research an the Wireless Networing an Communications Group at The University of Texas at Austin.

3 1. Report No. D-STOP/016/ Title an Subtitle Greey Subspace Clustering Technical Report Documentation Page. Government Accession No. 3. Recipient's Catalog No.. Report Date September Performing Organization Coe 7. Authors Dohyung Par, Constantine Caramanis, Sujay Sanghavi 8. Performing Organization Report No. Report Performing Organization Name an Aress Data-Supporte Transportation Operations & Planning Center D- STOP The University of Texas at Austin 1616 Guaalupe Street, Suite 4.0 Austin, Texas Sponsoring Agency Name an Aress Data-Supporte Transportation Operations & Planning Center D- STOP The University of Texas at Austin 1616 Guaalupe Street, Suite 4.0 Austin, Texas Wor Unit No. TRAIS 11. Contract or Grant No. DTRT13-G-UTC8 13. Type of Report an Perio Covere 14. Sponsoring Agency Coe 1. Supplementary Notes Supporte by a grant from the U.S. Department of Transportation, University Transportation Centers Program. 16. Abstract We consier the problem of subspace clustering: given points that lie on or near the union of many lowimensional linear subspaces, recover the subspaces. To this en, one first ientifies sets of points close to the same subspace an uses the sets to estimate the subspaces. As the geometric structure of the clusters linear subspaces forbis proper performance of general istance base approaches such as K-means, many moel-specific methos have been propose. In this paper, we provie new simple an efficient algorithms for this problem. Our statistical analysis shows that the algorithms are guarantee exact perfect clustering performance uner certain conitions on the number of points an the affinity between subspaces. These conitions are weaer than those consiere in the stanar statistical literature. Experimental results on synthetic ata generate from the stanar unions of subspaces moel emonstrate our theory. We also show that our algorithm performs competitively against state-of-the-art algorithms on real-worl applications such as motion segmentation an face clustering, but with much simpler implementation an lower computational cost. 17. Key Wors Subspace clustering; algorithms 19. Security Classif.of this report Unclassifie Form DOT F Distribution Statement No restrictions. This ocument is available to the public through NTIS National Technical Information Service 8 Port Royal Roa Springfiel, Virginia Security Classif.of this page Unclassifie Reprouction of complete page authorize 1. No. of Pages 6. Price

4 Disclaimer The contents of this report reflect the views of the authors, who are responsible for the facts an the accuracy of the information presente herein. This ocument is isseminate uner the sponsorship of the U.S. Department of Transportation s University Transportation Centers Program, in the interest of information exchange. The U.S. Government assumes no liability for the contents or use thereof. The contents of this report reflect the views of the authors, who are responsible for the facts an the accuracy of the information presente herein. Mention of trae names or commercial proucts oes not constitute enorsement or recommenation for use. Acnowlegements The authors recognize that support for this research was provie by a grant from the U.S. Department of Transportation, University Transportation Centers.

5 Project Upate: Greey Subspace Clustering PI: Constantine Caramanis Subspace clustering is the problem of fitting a collection of high-imensional ata points to a union of subspaces. This problem can be regare as Mixture of PCA because the points are to be partitione into subsets, each of which is moele by a low-imensional subspace. This moel naturally arises in settings where ata from multiple latent phenomena are mixe together an nee to be separate. Recent algorithms for subspace clustering are base on two steps. At the first step, a number of neighbor points are collecte for each ata point. Then the points are segmente using spectral clustering. The state-of-the-art algorithms solve a convex program with size as large as the square number of ata points. As the amount of the ata increases, the computational cost becomes critical. To reuce the cost, we propose greey algorithms. Our algorithms are provable in the sense that the exactly correct clustering is guarantee uner certain conitions in the stanar moels. For example, when -imensional subspaces are rawn uniformly at ranom in, an ranom ata points are ii uniformly generate on each subspace, exact clustering is guarantee with high probability if an. This conition is no worse than the existing statistical guarantees. In practice, our greey algorithm, which is in general significantly faster than solving a convex program, performs competitively against the algorithms on realworl benchmar atasets. We observe that our algorithm outperforms the existing greey algorithms. This was the result reporte in the following report. Since the publication of those results in the Proceeings of the Neural Information Processing Systems conference NIPS in December 014, we continue to wor on this project in collaboration with Prof. Sujay Sanghavi UT Austin an Dr. Dohyung Par now a researcher at Faceboo through August 016. Our empirical results continue to represent the state of the art among all currently available algorithms. Therefore we believe it important to see eeper theoretical insight into precisely why our algorithm seems to perform better than others. While we have been able to unerstan much about how our algorithm operates an why it succees, we were unable to obtain new theoretical guarantees beyon those alreay reporte. As a consequence, no further publications were a irect consequence of this line of wor, though in general this topic then le us to consier mixe regression, which has prouce further publications as reporte elsewhere.

6 Greey Subspace Clustering Dohyung Par Dept. of Electrical an Computer Engineering The University of Texas at Austin Constantine Caramanis Dept. of Electrical an Computer Engineering The University of Texas at Austin Sujay Sanghavi Dept. of Electrical an Computer Engineering The University of Texas at Austin Abstract We consier the problem of subspace clustering: given points that lie on or near the union of many low-imensional linear subspaces, recover the subspaces. To this en, one first ientifies sets of points close to the same subspace an uses the sets to estimate the subspaces. As the geometric structure of the clusters linear subspaces forbis proper performance of general istance base approches such as K-means, many moel-specific methos have been propose. In this paper, we provie new simple an efficient algorithms for this problem. Our statistical analysis shows that the algorithms are guarantee exact perfect clustering performance uner certain conitions on the number of points an the affinity between subspaces. These conitions are weaer than those consiere in the stanar statistical literature. Experimental results on synthetic ata generate from the stanar unions of subspaces moel emonstrate our theory. We also show that our algorithm performs competitively against state-of-the-art algorithms on real-worl applications such as motion segmentation an face clustering, but with much simpler implementation an lower computational cost. 1 Introuction Subspace clustering is a classic problem where one is given points in a high-imensional ambient space an woul lie to approximate them by a union of lower-imensional linear subspaces. In particular, each subspace contains a subset of the points. This problem is har because one nees to jointly fin the subspaces, an the points corresponing to each; the ata we are given are unlabele. The unions of subspaces moel naturally arises in settings where ata from multiple latent phenomena are mixe together an nee to be separate. Applications of subspace clustering inclue motion segmentation [3], face clustering [8], gene expression analysis [10], an system ientification []. In these applications, ata points with the same label e.g., face images of a person uner varying illumination conitions, feature points of a moving rigi object in a vieo sequence lie on a lowimensional subspace, an the mixe ataset can be moele by unions of subspaces. For etaile escription of the applications, we refer the reaers to the reviews [10, 0] an references therein. There is now a sizable literature on empirical methos for this particular problem an some statistical analysis as well. Many recently propose methos, which perform remarably well an have theoretical guarantees on their performances, can be characterize as involving two steps: a fining a neighborhoo for each ata point, an b fining the subspaces an/or clustering the points given these neighborhoos. Here, neighbors of a point are other points that the algorithm estimates to lie on the same subspace as the point an not necessarily just closest in Eucliean istance. 1

7 Subspace Conitions for: Algorithm What is guarantee conition Fully ranom moel Semi-ranom moel SSC [4, 16] Correct neighborhoos None logn/ p = O lognl LRR [14] Exact clustering No intersection - - SSC-OMP [3] Correct neighborhoos No intersection - - max aff = O logn/ lognl TSC [6, 7] Exact clustering None p = O 1 lognl max aff = O 1 lognl LRSSC [4] Correct neighborhoos None p = O 1 lognl - NSN+GSR Exact clustering None p = O log n lognl max aff = O log n log L lognl NSN+Spectral Exact clustering None p = O log n lognl - Table 1: Subspace clustering algorithms with theoretical guarantees. LRR an SSC-OMP have only eterministic guarantees, not statistical ones. In the two stanar statistical moels, there are n ata points on each of L -imensional subspaces in R p. For the efinition of max aff, we refer the reaers to Section 3.1. Our contributions: In this paper we evise new algorithms for each of the two steps above; a we evelop a new metho, Nearest Subspace Neighbor NSN, to etermine a neighborhoo set for each point, an b a new metho, Greey Subspace Recovery GSR, to recover subspaces from given neighborhoos. Each of these two methos can be use in conjunction with other methos for the corresponing other step; however, in this paper we focus on two algorithms that use NSN followe by GSR an Spectral clustering, respectively. Our main result is establishing statistical guarantees for exact clustering with general subspace conitions, in the stanar moels consiere in recent analytical literature on subspace clustering. Our conition for exact recovery is weaer than the conitions of other existing algorithms that only guarantee correct neighborhoos 1, which o not always lea to correct clustering. We provie numerical results which emonstrate our theory. We also show that for the real-worl applications our algorithm performs competitively against those of state-of-the-art algorithms, but the computational cost is much lower than them. Moreover, our algorithms are much simpler to implement. 1.1 Relate wor The problem was first formulate in the ata mining community [10]. Most of the relate wor in this fiel assumes that an unerlying subspace is parallel to some canonical axes. Subspace clustering for unions of arbitrary subspaces is consiere mostly in the machine learning an the computer vision communities [0]. Most of the results from those communities are base on empirical justification. They provie algorithms erive from theoretical intuition an showe that they perform empirically well with practical ataset. To name a few, GPCA [1], Spectral curvature clustering SCC [], an many iterative methos [1, 19, 6] show their goo empirical performance for subspace clustering. However, they lac theoretical analysis that guarantees exact clustering. As escribe above, several algorithms with a common structure are recently propose with both theoretical guarantees an remarable empirical performance. Elhamifar an Vial [4] propose an algorithm calle Sparse Subspace Clustering SSC, which uses l 1 -minimization for neighborhoo construction. They prove that if the subspaces have no intersection, SSC always fins a correct neighborhoo matrix. Later, Soltanolotabi an Canes [16] provie a statistical guarantee of the algorithm for subspaces with intersection. Dyer et al. [3] propose another algorithm calle SSC- OMP, which uses Orthogonal Matching Pursuit OMP instea of l 1 -minimization in SSC. Another algorithm calle Low-Ran Representation LRR which uses nuclear norm minimization is propose by Liu et al. [14]. Wang et al. [4] propose an hybri algorithm, Low-Ran an Sparse Subspace Clustering LRSSC, which involves both l 1 -norm an nuclear norm. Hecel an Bölcsei [6] presente Thresholing base Subspace Clustering TSC, which constructs neighborhoos base on the inner proucts between ata points. All of these algorithms use spectral clustering for the clustering step. The analysis in those papers focuses on neither exact recovery of the subspaces nor exact clustering in general subspace conitions. SSC, SSC-OMP, an LRSSC only guarantee correct neighborhoos which o not always lea to exact clustering. LRR guarantees exact clustering only when 1 By correct neighborhoo, we mean that for each point every neighbor point lies on the same subspace. By no intersection between subspaces, we mean that they share only the null point.

8 the subspaces have no intersections. In this paper, we provie novel algorithms that guarantee exact clustering in general subspace conitions. When we were preparing this manuscript, it is prove that TSC guarantees exact clustering uner certain conitions [7], but the conitions are stricter than ours. See Table 1 1. Notation There is a set of N ata points in R p, enote by Y = {y 1,..., y N }. The ata points are lying on or near a union of L subspaces D = L i=1 D i. Each subspace D i is of imension i which is smaller than p. For each point y j, w j enotes the inex of the nearest subspace. Let N i enote the number of points whose nearest subspace is D i, i.e., N i = N j=1 I w j=i. Throughout this paper, sets an subspaces are enote by calligraphic letters. Matrices an ey parameters are enote by letters in upper case, an vectors an scalars are enote by letters in lower case. We frequently enote the set of n inices by [n] = {1,,..., n}. As usual, span{ } enotes a subspace spanne by a set of vectors. For example, span{v 1,..., v n } = {v : v = n i=1 α iv i, α 1,..., α n R}. Proj U y is efine as the projection of y onto subspace U. That is, Proj U y = arg min u U y u. I{ } enotes the inicator function which is one if the statement is true an zero otherwise. Finally, enotes the irect sum. Algorithms We propose two algorithms for subspace clustering as follows. NSN+GSR : Run Nearest Subspace Neighbor NSN to construct a neighborhoo matrix W {0, 1} N N, an then run Greey Subspace Recovery GSR for W. NSN+Spectral : Run Nearest Subspace Neighbor NSN to construct a neighborhoo matrix W {0, 1} N N, an then run spectral clustering for Z = W + W..1 Nearest Subspace Neighbor NSN NSN approaches the problem of fining neighbor points most liely to be on the same subspace in a greey fashion. At first, given a point y without any other nowlege, the one single point that is most liely to be a neighbor of y is the nearest point of the line span{y}. In the following steps, if we have foun a few correct neighbor points lying on the same true subspace an have no other nowlege about the true subspace an the rest of the points, then the most potentially correct point is the one closest to the subspace spanne by the correct neighbors we have. This motivates us to propose NSN escribe in the following. Algorithm 1 Nearest Subspace Neighbor NSN Input: A set of N samples Y = {y 1,..., y N }, The number of require neighbors K, Maximum subspace imension max. Output: A neighborhoo matrix W {0, 1} N N y i y i / y i, i [N] Normalize magnitues for i = 1,..., N o Run NSN for each ata point I i {i} for = 1,..., K o Iteratively a the closest point to the current subspace if max then U span{y j : j I i } en if j arg max j [N]\Ii Proj U y j I i I i {j } en for W ij I j Ii or y j U, j [N] Construct the neighborhoo matrix en for NSN collects K neighbors sequentially for each point. At each step, a -imensional subspace U spanne by the point an its 1 neighbors is constructe, an the point closest to the subspace is 3

9 newly collecte. After max, the subspace U constructe at the max th step is use for collecting neighbors. At last, if there are more points lying on U, they are also counte as neighbors. The subspace U can be store in the form of a matrix U R p imu whose columns form an orthonormal basis of U. Then Proj U y j can be compute easily because it is equal to U y j. While a naive implementation requires OK pn computational cost, this can be reuce to OKpN, an the faster implementation is escribe in Section A.1. We note that this computational cost is much lower than that of the convex optimization base methos e.g., SSC [4] an LRR [14] which solve a convex program with N variables an pn constraints. NSN for subspace clustering shares the same philosophy with Orthogonal Matching Pursuit OMP for sparse recovery in the sense that it incrementally pics the point ictionary element that is the most liely to be correct, assuming that the algorithms have foun the correct ones. In subspace clustering, that point is the one closest to the subspace spanne by the currently selecte points, while in sparse recovery it is the one closest to the resiual of linear regression by the selecte points. In the sparse recovery literature, the performance of OMP is shown to be comparable to that of Basis Pursuit l 1 -minimization both theoretically an empirically [18, 11]. One of the contributions of this wor is to show that this high-level intuition is inee born out, provable, as we show that NSN also performs well in collecting neighbors lying on the same subspace.. Greey Subspace Recovery GSR Suppose that NSN has foun correct neighbors for a ata point. How can we chec if they are inee correct, that is, lying on the same true subspace? One natural way is to count the number of points close to the subspace spanne by the neighbors. If they span one of the true subspaces, then many other points will be lying on the span. If they o not span any true subspaces, few points will be close to it. This fact motivates us to use a greey algorithm to recover the subspaces. Using the neighborhoo constructe by NSN or some other algorithm, we recover the L subspaces. If there is a neighborhoo set containing only the points on the same subspace for each subspace, the algorithm successfully recovers the unions of the true subspaces exactly. Algorithm Greey Subspace Recovery GSR Input: N points Y = {y 1,..., y N }, A neighborhoo matrix W {0, 1} N N, Error boun ɛ Output: Estimate subspaces ˆD = L ˆD l=1 l. Estimate labels ŵ 1,..., ŵ N y i y i / y i, i [N] Normalize magnitues W i Top-{y j : W ij = 1}, i [N] Estimate a subspace using the neighbors for each point I [N] while I o Iteratively pic the best subspace estimates i N arg max i I j=1 I{ Proj W i y j 1 ɛ} ˆD l Ŵi I I \ {j : Proj Wi y j 1 ɛ} en while ŵ i arg max l [L] Proj ˆDl y i, i [N] Label the points using the subspace estimates Recall that the matrix W contains the labelings of the points, so that W ij = 1 if point i is assigne to subspace j. Top-{y j : W ij = 1} enotes the -imensional principal subspace of the set of vectors {y j : W ij = 1}. This can be obtaine by taing the first left singular vectors of the matrix whose columns are the vector in the set. If there are only vectors in the set, Gram-Schmit orthogonalization will give us the subspace. As in NSN, it is efficient to store a subspace W i in the form of its orthogonal basis because we can easily compute the norm of a projection onto the subspace. Testing a caniate subspace by counting the number of near points has alreay been consiere in the subspace clustering literature. In [], the authors propose to run RANom SAmple Consensus RANSAC iteratively. RANSAC ranomly selects a few points an checs if there are many other points near the subspace spanne by the collecte points. Instea of ranomly choosing sample points, GSR receives some caniate subspaces in the form of sets of points from NSN or possibly some other algorithm an selects subspaces in a greey way as specifie in the algorithm above. 4

10 3 Theoretical results We analyze our algorithms in two stanar noiseless moels. The main theorems present sufficient conitions uner which the algorithms cluster the points exactly with high probability. For simplicity of analysis, we assume that every subspace is of the same imension, an the number of ata points on each subspace is the same, i.e., 1 = = L, n N 1 = = N L. We assume that is nown to the algorithm. Nonetheless, our analysis can exten to the general case. 3.1 Statistical moels We consier two moels which have been use in the subspace clustering literature: Fully ranom moel: The subspaces are rawn ii uniformly at ranom, an the points are also ii ranomly generate. Semi-ranom moel: The subspaces are arbitrarily etermine, but the points are ii ranomly generate. Let D i R p, i [L] be a matrix whose columns form an orthonormal basis of D i. An important measure that we use in the analysis is the affinity between two subspaces, efine as affi, j D i D j F =1 = cos θ i,j [0, 1], where θ i,j is the th principal angle between D i an D j. Two subspaces D i an D j are ientical if an only if affi, j = 1. If affi, j = 0, every vector on D i is orthogonal to any vectors on D j. We also efine the maximum affinity as max aff max affi, j [0, 1]. i,j [L],i j There are N = nl points, an there are n points exactly lying on each subspace. We assume that each ata point y i is rawn ii uniformly at ranom from S p 1 D wi where S p 1 is the unit sphere in R p. Equivalently, y i = D wi x i, x i UnifS 1, i [N]. As the points are generate ranomly on their corresponing subspaces, there are no points lying on an intersection of two subspaces, almost surely. This implies that with probability one the points are clustere correctly provie that the true subspaces are recovere exactly. 3. Main theorems The first theorem gives a statistical guarantee for the fully ranom moel. Theorem 1 Suppose L -imensional subspaces an n points on each subspace are generate in the fully ranom moel with n polynomial in. There are constants C 1, C > 0 such that if n > C 1 log ne δ, p < C log n lognlδ 1, 1 then with probability at least 1 3Lδ 1 δ, NSN+GSR3 clusters the points exactly. Also, there are other constants C 1, C > 0 such that if 1 with C 1 an C replace by C 1 an C hols then NSN+Spectral 4 clusters the points exactly with probability at least 1 3Lδ 1 δ. e is the exponential constant. 3 NSN with K = max = followe by GSR with arbitrarily small ɛ. 4 NSN with K = max =.

11 Our sufficient conitions for exact clustering explain when subspace clustering becomes easy or ifficult, an they are consistent with our intuition. For NSN to fin correct neighbors, the points on the same subspace shoul be many enough so that they loo lie lying on a subspace. This conition is spelle out in the first inequality of 1. We note that the conition hols even when n/ is a constant, i.e., n is linear in. The secon inequality implies that the imension of the subspaces shoul not be too high for subspaces to be istinguishable. If is high, the ranom subspaces are more liely to be close to each other, an hence they become more ifficult to be istinguishe. However, as n increases, the points become ense on the subspaces, an hence it becomes easier to ientify ifferent subspaces. Let us compare our result with the conitions require for success in the fully ranom moel in the existing literature. In [16], it is require for SSC to have correct neighborhoos that n shoul be superlinear in when /p fixe. In [6, 4], the conitions on /p becomes worse as we have more points. On the other han, our algorithms are guarantee exact clustering of the points, an the sufficient conition is orer-wise at least as goo as the conitions for correct neighborhoos by the existing algorithms See Table 1. Moreover, exact clustering is guarantee even when n is linear in, an /p fixe. For the semi-ranom moel, we have the following general theorem. Theorem Suppose L -imensional subspaces are arbitrarily chosen, an n points on each subspace are generate in the semi-ranom moel with n polynomial in. There are constants C 1, C > 0 such that if, max aff < n > C 1 log ne δ C log n loglδ 1 lognlδ 1. then with probability at least 1 3Lδ 1 δ, NSN+GSR clusters the points exactly. In the semi-ranom moel, the sufficient conition oes not epen on the ambient imension p. When the affinities between subspaces are fixe, an the points are exactly lying on the subspaces, the ifficulty of the problem oes not epen on the ambient imension. It rather epens on max aff, which measures how close the subspaces are. As they become closer to each other, it becomes more ifficult to istinguish the subspaces. The secon inequality of explains this intuition. The inequality also shows that if we have more ata points, the problem becomes easier to ientify ifferent subspaces. Compare with other algorithms, NSN+GSR is guarantee exact clustering, an more importantly, the conition on max aff improves as n grows. This remar is consistent with the practical performance of the algorithm which improves as the number of ata points increases, while the existing guarantees of other algorithms are not. In [16], correct neighborhoos in SSC are guarantee if max aff = O logn// lognl. In [6], exact clustering of TSC is guarantee if max aff = O1/ lognl. However, these algorithms perform empirically better as the number of ata points increases. 4 Experimental results In this section, we empirically compare our algorithms with the existing algorithms in terms of clustering performance an computational time on a single estop. For NSN, we use the fast implementation escribe in Section A.1. The compare algorithms are K-means, K-flats 6, SSC, LRR, SCC, TSC 7, an SSC-OMP 8. The numbers of replicates in K-means, K-flats, an the K- NSN with K = 1 an max = log followe by GSR with arbitrarily small ɛ. 6 K-flats is similar to K-means. At each iteration, it computes top- principal subspaces of the points with the same label, an then labels every point base on its istances to those subspaces. 7 The MATLAB coes for SSC, LRR, SCC, an TSC are obtaine from jhu.eu/ ehsan/coe.htm, an glchen/scc.html, commth/research/ownloas/sc.html, respectively. 8 For each ata point, OMP constructs a neighborhoo for each point by regressing the point on the other points up to 10 4 accuracy. 6

12 Ambient imension p SSC SSC OMP LRR TSC NSN+Spectral Number of points per imension for each subspace n/ NSN+GSR Figure 1: CE of algorithms on ranom -imensional subspaces an n ranom points on each subspace. The figures shows CE for ifferent numbers of n/ an ambient imension p. /p is fixe to be 3/. Brighter cells represent that less ata points are clustere incorrectly. Ambient imension p l1 minimization SSC OMP SSC OMP Nuclear norm min. LRR Nearest neighbor TSC Number of points per imension for each subspace n/ NSN Figure : NSE for the same moel parameters as those in Figure 1. Brighter cells represent that more ata points have all correct neighbors. Time sec im ambient space, five 10 im subspaces l1 minimization SSC OMP SSC OMP Nuclear norm min. LRR Thresholing TSC NSN 100 im ambient space, 10 im subspaces, 0 points/subspace Time sec Number of ata points per subspace n Number of subspaces L Figure 3: Average computational time of the neighborhoo selection algorithms means use in the spectral clustering are all fixe to 10. The algorithms are compare in terms of Clustering error CE an Neighborhoo selection error NSE, efine as 1 CE = min π Π L N N Iw i πŵ i, i=1 NSE = 1 N N I j : W ij 0, w i w j where Π L is the permutation space of [L]. CE is the proportion of incorrectly labele ata points. Since clustering is invariant up to permutation of label inices, the error is equal to the minimum isagreement over the permutation of label inices. NSE measures the proportion of the points which o not have all correct neighbors Synthetic ata We compare the performances on synthetic ata generate from the fully ranom moel. In R p, five -imensional subspaces are generate uniformly at ranom. Then for each subspace n unitnorm points are generate ii uniformly at ranom on the subspace. To see the agreement with the theoretical result, we ran the algorithms uner fixe /p an varie n an. We set /p = 3/ so that each pair of subspaces has intersection. Figures 1 an show CE an NSE, respectively. Each error value is average over 100 trials. Figure 1 inicates that our algorithm clusters the ata points better than the other algorithms. As preicte in the theorems, the clustering performance improves 9 For the neighborhoo matrices from SSC, LRR, an SSC-OMP, the points with the maximum weights are regare as neighbors for each point. For TSC, the nearest neighbors are collecte for each point. i=1 7

13 L Algorithms K-means K-flats SSC LRR SCC SSC-OMP8 TSC10 NSN+Spectral Mean CE % Meian CE % Avg. Time sec Mean CE % Meian CE % Avg. Time sec Table : CE an computational time of algorithms on Hopins1 ataset. L is the number of clusters motions. The numbers in the parentheses represent the number of neighbors for each point collecte in the corresponing algorithms. L Algorithms K-means K-flats SSC SSC-OMP TSC NSN+Spectral Mean CE % Meian CE % Avg. Time sec Mean CE % Meian CE % Avg. Time sec Mean CE % Meian CE % Avg. Time sec Mean CE % Meian CE % Avg. Time sec Table 3: CE an computational time of algorithms on Extene Yale B ataset. For each number of clusters faces L, the algorithms ran over 100 ranom subsets rawn from the overall 38 clusters. as the number of points increases. However, it also improves as the imension of subspaces grows in contrast to the theoretical analysis. We believe that this is because our analysis on GSR is not tight. In Figure, we can see that more ata points obtain correct neighbors as n increases or ecreases, which conforms the theoretical analysis. We also compare the computational time of the neighborhoo selection algorithms for ifferent numbers of subspaces an ata points. As shown in Figure 3, the greey algorithms OMP, Thresholing, an NSN are significantly more scalable than the convex optimization base algorithms l 1 -minimization an nuclear norm minimization. 4. Real-worl ata : motion segmentation an face clustering We compare our algorithm with the existing ones in the applications of motion segmentation an face clustering. For the motion segmentation, we use Hopins1 ataset [17], which contains 1 vieo sequences of or 3 motions. For the face clustering, we use Extene Yale B ataset with croppe images from [, 13]. The ataset contains 64 images for each of 38 iniviuals in frontal view an ifferent illumination conitions. To compare with the existing algorithms, we use the set of 48 4 resize raw images provie by the authors of [4]. The parameters of the existing algorithms were set as provie in their source coes. 10 Tables an 3 show CE an average computational time. 11 We can see that NSN+Spectral performs competitively with the methos with the lowest errors, but much faster. Compare to the other greey neighborhoo construction base algorithms, SSC-OMP an TSC, our algorithm performs significantly better. Acnowlegments The authors woul lie to acnowlege NSF grants 13043, 09409, 1017, an DTRA grant HDTRA for supporting this research. This research was also partially supporte by the U.S. Department of Transportation through the Data-Supporte Transportation Operations an Planning D-STOP Tier 1 University Transportation Center. 10 As SSC-OMP an TSC o not have propose number of parameters for motion segmentation, we foun the numbers minimizing the mean CE. The numbers are given in the table. 11 The LRR coe provie by the author i not perform properly with the face clustering ataset that we use. We i not run NSN+GSR since the ata points are not well istribute in its corresponing subspaces. 8

14 References [1] P. S. Braley an O. L. Mangasarian. K-plane clustering. Journal of Global Optimization, 161:3 3, 000. [] G. Chen an G. Lerman. Spectral curvature clustering. International Journal of Computer Vision, 813: , 009. [3] E. L. Dyer, A. C. Sanaranarayanan, an R. G. Baraniu. Greey feature selection for subspace clustering. The Journal of Machine Learning Research JMLR, 141:487 17, 013. [4] E. Elhamifar an R. Vial. Sparse subspace clustering: Algorithm, theory, an applications. Pattern Analysis an Machine Intelligence, IEEE Transactions on, 311:76 781, 013. [] A. S. Georghiaes, P. N. Belhumeur, an D. J. Kriegman. From few to many: Illumination cone moels for face recognition uner variable lighting an pose. IEEE Trans. Pattern Anal. Mach. Intelligence, 3 6: , 001. [6] R. Hecel an H. Bölcsei. Subspace clustering via thresholing an spectral clustering. In IEEE International Conference on Acoustics, Speech, an Signal Processing ICASSP, May 013. [7] R. Hecel an H. Bölcsei. Robust subspace clustering via thresholing. arxiv preprint arxiv: v, 014. [8] J. Ho, M.-H. Yang, J. Lim, K.-C. Lee, an D. Kriegman. Clustering appearances of objects uner varying illumination conitions. In IEEE conference on Computer Vision an Pattern Recognition CVPR, 003. [9] T. Inglot. Inequalities for quantiles of the chi-square istribution. Probability an Mathematical Statistics, 30:339 31, 010. [10] H.-P. Kriegel, P. Kröger, an A. Zime. Clustering high-imensional ata: A survey on subspace clustering, pattern-base clustering, an correlation clustering. ACM Transactions on Knowlege Discovery from Data TKDD, 31:1, 009. [11] S. Kunis an H. Rauhut. Ranom sampling of sparse trigonometric polynomials, ii. orthogonal matching pursuit versus basis pursuit. Founations of Computational Mathematics, 86: , 008. [1] M. Leoux. The concentration of measure phenomenon, volume 89. AMS Boostore, 00. [13] K. C. Lee, J. Ho, an D. Kriegman. Acquiring linear subspaces for face recognition uner variable lighting. IEEE Trans. Pattern Anal. Mach. Intelligence, 7: , 00. [14] G. Liu, Z. Lin, S. Yan, J. Sun, Y. Yu, an Y. Ma. Robust recovery of subspace structures by low-ran representation. Pattern Analysis an Machine Intelligence, IEEE Transactions on, 31: , 013. [1] V. D. Milman an G. Schechtman. Asymptotic Theory of Finite Dimensional Norme Spaces: Isoperimetric Inequalities in Riemannian Manifols. Lecture Notes in Mathematics. Springer, [16] M. Soltanolotabi an E. J. Canes. A geometric analysis of subspace clustering with outliers. The Annals of Statistics, 404:19 38, 01. [17] R. Tron an R. Vial. A benchmar for the comparison of 3- motion segmentation algorithms. In IEEE conference on Computer Vision an Pattern Recognition CVPR, 007. [18] J. A. Tropp an A. C. Gilbert. Signal recovery from ranom measurements via orthogonal matching pursuit. Information Theory, IEEE Transactions on, 31: , 007. [19] P. Tseng. Nearest q-flat to m points. Journal of Optimization Theory an Applications, 101:49, 000. [0] R. Vial. Subspace clustering. Signal Processing Magazine, IEEE, 8: 68, 011. [1] R. Vial, Y. Ma, an S. Sastry. Generalize principal component analysis. In IEEE conference on Computer Vision an Pattern Recognition CVPR, 003. [] R. Vial, S. Soatto, Y. Ma, an S. Sastry. An algebraic geometric approach to the ientification of a class of linear hybri systems. In Decision an Control, 003. Proceeings. 4n IEEE Conference on, volume 1, pages IEEE, 003. [3] R. Vial, R. Tron, an R. Hartley. Multiframe motion segmentation with missing ata using power factorization an GPCA. International Journal of Computer Vision, 791:8 10, 008. [4] Y.-X. Wang, H. Xu, an C. Leng. Provable subspace clustering: When LRR meets SSC. In Avances in Neural Information Processing Systems NIPS, December 013. [] A. Y. Yang, S. R. Rao, an Y. Ma. Robust statistical estimation an segmentation of multiple subspaces. In IEEE conference on Computer Vision an Pattern Recognition CVPR, 006. [6] T. Zhang, A. Szlam, Y. Wang, an G. Lerman. Hybri linear moeling via local best-fit flats. International journal of computer vision, 1003:17 40, 01. 9

15 A Discussion on implementation issues A.1 A faster implementation for NSN At each step of NSN, the algorithm computes the projections of all points onto a subspace an fin one with the largest norm. A naive implementation of the algorithm requires OpK N time complexity. In fact, we can reuce the complexity to OpKN. Instea of fining the maximum norm of the projections, we can fin the maximum square norm of the projections. Let U be the subspace U at step. For any ata point y, we have Proj U y = Proj U 1 y + u y where u is the new orthogonal axis ae from U 1 to mae U. That is, U 1 u an U = U 1 u. As Proj U 1 y is alreay compute in the 1 th step, we o not nee to compute it again at step. Base on this fact, we have a faster implementation as escribe in the following. Note that P j at the th step is equal to Proj U y j in the original NSN algorithm. Algorithm 3 Fast Nearest Subspace Neighbor F-NSN Input: A set of N samples Y = {y 1,..., y N }, The number of require neighbors K, Maximum subspace imension max. Output: A neighborhoo matrix W {0, 1} N N y i y i / y i, i [N] for i = 1,..., N o I i {i}, u 1 y i P j 0, j [N] for = 1,..., K o if max then P j P j + u y j, j [N] en if j arg max j [N],j / Ii P j I i I i {j } if < max then u +1 y j l=1 u l y j u l y j l=1 u l y j u l en if en for W ij I j Ii or P j=1, j [N] en for A. Estimation of the number of clusters When L is unnown, it can be estimate at the clustering step. For Spectral clustering, a well-nown approach to estimate L is to fin a nee point in the singular values of the neighborhoo matrix. It is the point where the ifference between two consecutive singular values are the largest. For GSR, we o not nee to estimate the number of clusters a priori. Once the algorithms finishes, the number of the resulting groups will be our estimate of L. A.3 Parameter setting The choices of K an max epen on the imension of the subspaces. If ata points are lying exactly on the moel subspaces, K = max = is enough for GSR to recover a subspace. In practical situations where the points are near the subspaces, it is better to set K to be larger than. max can also be larger than because the max aitional imensions, which may be inuce from the noise, o no intersect with the other subspaces in practice. For Extene Yale B ataset an Hopins1 ataset, we foun that NSN+Spectral performs well if K is set to be aroun. 10

16 B Proofs B.1 Proof outline We escribe the first few high-level steps in the proofs of our main theorems. Exact clustering of our algorithms epens on whether NSN can fin all correct neighbors for the ata points so that the following algorithm GSR or Spectral clustering can cluster the points exactly. For NSN+GSR, exact clustering is guarantee when there is a point on each subspace that have all correct neighbors which are at least 1. For NSN+Spectral, exact clustering is guarantee when each ata point has only the n 1 other points on the same subspace as neighbors. In the following, we explain why these are true. Step 1-1: Exact clustering conition for GSR The two statistical moels have a property that for any -imensional subspace in R p other than the true subspaces D 1,..., D L the probability of any points lying on the subspace is zero. Hence, we claim the following. Fact 3 Best -imensional fit With probability one, the true subspaces D 1,..., D L are the L subspaces containing the most points among the set of all possible -imensional subspaces. Then it suffices for each subspace to have one point whose neighbors are 1 all correct points on the same subspace. This is because the subspace spanne by those points is almost surely ientical to the true subspace they are lying on, an that subspace will be pice by GSR. Fact 4 If NSN with K 1 fins all correct neighbors for at least one point on each subspace, GSR recovers all the true subspaces an clusters the ata points exactly with probability one. In the following steps, we consier one ata point for each subspace. We show that NSN with K = max = fins all correct neighbors for the point with probability at least 1 3δ 1 δ. Then the union boun an Fact 4 establish exact clustering with probability at least 1 3Lδ 1 δ. Step 1-: Exact clustering conition for spectral clustering It is ifficult to analyze spectral clustering for the resulting neighborhoo matrix of NSN. A trivial case for a neighborhoo matrix to result in exact clustering is when the points on the same subspace form a single fully connecte component. If NSN with K = max = fins all correct neighbors for every ata point, the subspace U at the last step = K is almost surely ientical to the true subspace that the points lie on. Hence, the resulting neighborhoo matrix W form L fully connecte components each of which contains all of the points on the same subspace. In the rest of the proof, we show that if 1 hols, NSN fins all correct neighbors for a fixe point with probability 1 3δ 1 δ. Let us assume that this is true. If 1 with C 1 an C replace by C1 4 an C hols, we have n > C 1 log ne, δ/n p < C log n lognlδ/n 1. Then it follows from the union boun that NSN fins all correct neighbors for all of the n points on each subspace with probability at least 1 3Lδ 1 δ, an hence we obtain W ij = I wi=w j for every i, j [N]. Exact clustering is guarantee. Step : Success conition for NSN Now the only proposition that we nee to prove is that for each subspace D i NSN fins all correct neighbors for a ata point which is a uniformly ranom unit vector on the subspace with probability at least 1 3δ 1 δ. As our analysis is inepenent of the subspaces, we only consier D 1. Without loss of generality, we assume that y 1 lies on D 1 w 1 = 1 an focus on this ata point. 11

17 When NSN fins neighbors of y 1, the algorithm constructs max subspaces incrementally. At each step = 1,..., K, if the largest projection onto U of the uncollecte points on the same true subspace D 1 is greater than the largest projection among the points on ifferent subspaces, then NSN collects a correct neighbor. In a mathematical expression, we want to satisfy for each step of = 1,..., K. max Proj U y j > max Proj U y j 3 j:w j=1,j / I 1 j:w j 1,j / I 1 The rest of the proof is to show 1 an lea to 3 with probability 1 3δ 1 δ in their corresponing moels. It is ifficult to prove 3 itself because the subspaces, the ata points, an the inex set I 1 are all epenent of each other. Instea, we introuce an Oracle algorithm whose success is equivalent to the success of NSN, but the analysis is easier. Then the Oracle algorithm is analyze using stochastic orering, bouns on orer statistics of ranom projections, an the measure concentration inequalities for ranom subspaces. The rest of the proof is provie in Sections B.3 an B.4. B. Preliminary lemmas Before we step into the technical parts of the proof, we introuce the main ingreients which will be use. The following lemma is about upper an lower bouns on the orer statistics for the projections of ii uniformly ranom unit vectors. Lemma Let x 1,..., x n be rawn ii uniformly at ranom from the -imensional unit ball S 1. Let z n m+1 enote the m th largest value of {z i Ax i, 1 i n} where A R. a. Suppose that the rows of A are orthonormal to each other. For any α 0, 1, there exists a constant C > 0 such that for n, m,, N where n m + 1 Cm log ne 4 mδ we have zn m+1 { > + 1 min n m + 1 log Cm log ne, α } mδ with probability at least 1 δ m. b. For any matrix A, with probability at least 1 δ m. z n m+1 < A F + A π + log ne mδ 6 Lemma b can be prove by using the measure concentration inequalities [1]. Not only can they provie inequalities for ranom unit vectors, they also give us inequalities for ranom subspaces. Lemma 6 Let the columns of X R be the orthonormal basis of a -imensional ranom subspace rawn uniformly at ranom in -imensional space. a. For any matrix A R p. b. [1, 1] If A is boune, then we have { Pr AX F > E[ AX F ] = A F A F + A } 8π 1 + t e 1t 8. 1

18 B.3 Proof of Theorem Following Section B.1, we show in this section that if hols then NSN fins all correct neighbors for y 1 which is assume to be on D 1 with probability at least 1 3δ 1 δ. Step 3: NSN Oracle algorithm Consier the Oracle algorithm in the following. Unlie NSN, this algorithm nows the true label of each ata point. It pics the point closest to the current subspace among the points with the same label. Since we assume w 1 = 1, the Oracle algorithm for y 1 pics a point in {y j : w j = 1} at every step. Algorithm 4 NSN Oracle algorithm for y 1 assuming w 1 = 1 Input: A set of N samples Y = {y 1,..., y N }, The number of require neighbors K = 1, Maximum subspace imension max = log I 1 1 {1} for = 1,..., K o if max then V span{y j : j I 1 } j arg max j [N]:w j=1,j Proj / I V y j 1 else j arg max j [N]:w j=1,j Proj / I Vmax y j 1 en if if max j [N]:wj=1,j / I i Return FAILURE en if I +1 1 I 1 {j } en for Return SUCCESS Proj V y j max j [N]:wj 1 Proj V y then Note that the Oracle algorithm returns failure if an only if the original algorithm pics an incorrect neighbor for y 1. The reason is as follows. Suppose that NSN for y 1 pics the first incorrect point at step. By the step 1, correct points have been chosen because they are the nearest points for the subspaces in the corresponing steps. The Oracle algorithm will also pic those points because they are the nearest points among the correct points. Hence U V. At step, NSN pics an incorrect point as it is the closest to U. The Oracle algorithm will eclare failure because that incorrect point is closer than the closest point among the correct points. In the same manner, we see that NSN fails if the Oracle NSN fails. Therefore, we can instea analyze the success of the Oracle algorithm. The success conition is written as Proj V y j > Proj Vmax y j > max Proj V j [N]:w y, = 1,..., max, j 1 max Proj V j [N]:w max y, = max + 1,..., K. 7 j 1 Note that V s are inepenent of the points {y j : j [N], w j 1}. We will use this fact in the following steps. Step 4: Lower bouns on the projection of correct points the LHS of 7 Let V R be such that the columns of D 1 V form an orthogonal basis of V. Such a V exists because V is a -imensional subspace of D 1. Then we have Proj V y j = V D 1 D 1 x j = V x j In this step, we obtain lower bouns on V x j for max an V max x j for > max. 13

19 It is ifficult to analyze V x j because V an x j are epenent. We instea analyze another ranom variable that is stochastically ominate by V x j. Then we use a high-probability lower boun on that variable which also lower bouns V x j with high probability. Define P,m as the m th largest norm of the projections of n 1 ii uniformly ranom unit vector on S 1 onto a -imensional subspace. Since the istribution of the ranom unit vector is isotropic, the istribution of P,m is ientical for any -imensional subspaces inepenent of the ranom unit vectors. We have the following ey lemma. Lemma 7 V x j stochastically ominates P,, i.e., Pr{ V x j t} Pr{P, t} for any t 0. Moreover, P, Pˆ, for any ˆ. The proof of the lemma is provie in Appenix B.. Now we can use the lower boun on P, given in Lemma a to boun V x j. Let us pic α an C for which the lemma hols. The first inequality of with C 1 = C + 1 leas to n > C log ne δ, an also n > C log ne, = 1,..., 1. 8 δ Hence, it follow from Lemma a that for each = 1,..., max, we have V x j { + 1 min n + 1 log C log ne, α } δ { + 1 min n log C log ne, α } log δ with probability at least 1 δ. For > max, we want to boun Proj Vmax y j. We again use Lemma 7 to obtain the boun. Since the conition for the lemma hols as shown in 8, we have V max x j log + { 1 min n + 1 log C log ne, α } log δ log + 1 min { log n C log ne δ with probability at least 1 δ, for every = max + 1,..., 1., α log The union boun gives that 9 an 10 hol for all = 1,..., 1 simultaneously with probability at least 1 δ 1 δ. } 9 10 Step : Upper bouns on the projection of incorrect points the RHS of 7 Since we have Proj V y j = V D 1 D wj x j, the RHS of 7 can be written as max V D1 D wj x j 11 j:j [N],w j 1 In this step, we want to boun 11 for every = 1,..., 1 by using the concentration inequalities for V an x j. Since V an x j are inepenent, the inequality for x j hols for any fixe V. 14

Lecture Introduction. 2 Examples of Measure Concentration. 3 The Johnson-Lindenstrauss Lemma. CS-621 Theory Gems November 28, 2012

Lecture Introduction. 2 Examples of Measure Concentration. 3 The Johnson-Lindenstrauss Lemma. CS-621 Theory Gems November 28, 2012 CS-6 Theory Gems November 8, 0 Lecture Lecturer: Alesaner Mąry Scribes: Alhussein Fawzi, Dorina Thanou Introuction Toay, we will briefly iscuss an important technique in probability theory measure concentration