Necessary and Sufficient Conditions for Sketched Subspace Clustering

Size: px

Start display at page:

Download "Necessary and Sufficient Conditions for Sketched Subspace Clustering"

Madeline Crawford
5 years ago
Views:

1 Necessary an Sufficient Conitions for Sketche Subspace Clustering Daniel Pimentel-Alarcón, Laura Balzano 2, Robert Nowak University of Wisconsin-Maison, 2 University of Michigan-Ann Arbor Abstract This paper is about an interesting phenomenon: two r-imensional subspaces, even if they are orthogonal to one an other, can appear ientical if they are only observe on a subset of coorinates. Unerstaning this phenomenon is of particular importance for many moern applications of subspace clustering where one woul like to subsample in orer to improve computational efficiency. Examples inclue real-time vieo surveillance an atasets so large that cannot even be store in memory. In this paper we introuce a new metric between subspaces, which we call partial coorinate iscrepancy. This metric captures a notion of similarity between subsample subspaces that is not capture by other istance measures between subspaces. With this, we are able to show that subspace clustering is theoretically possible in lieu of coherence assumptions using only r + rows of the ataset at han. This gives precise information-theoretic necessary an sufficient conitions for sketche subspace clustering. This can greatly improve computational efficiency without compromising performance. We complement our theoretical analysis with synthetic an real ata experiments. I. INTRODUCTION In subspace clustering (SC), one is given a ata matrix X whose columns lie in the union of several (unknown) r- imensional subspaces, an aims to infer these subspaces an cluster the columns in X accoringly []. The union of subspaces moel is a powerful an flexible moel that applies to a wie variety of practical applications, ranging from computer vision [2] to network inference [3], [4], compression [5], recommener systems an collaborative filtering [6], [7]. Hence there is growing attention to this problem. As a result, existing theory an methos can hanle outliers [8] [3], noisy measurements [4], privacy concerns [5], ata constraints [6], an missing ata [7] [2], among other ifficulties. Yet, in many relevant applications, such as real-time vieo surveillance, or cases where X is too large to even store in memory, SC remains infeasible ue to computational constraints. In applications like these, it is essential to hanle big atasets in a computationally efficient manner, both in terms of storage an processing time. Fortunately, stuies regaring missing ata show that uner this moel, very large atasets can be accurately represente using a very small number of its entries [7] [2]. With this in min, recent stuies (e.g., [22]) explore the iea of projecting the ata (e.g., subsampling or sketching) as alternatives to reuce computational costs (time an storage). On this matter, it was recently shown that if the subspaces are sufficiently incoherent an separate, an the columns are well-sprea over the subspaces, then the popular Fig. : Left: The columns in X (represente by points) lie in the union of two -imensional subspaces in R 3. We want to cluster these points using only a few coorinates (to improve computational costs). This can be one if we use coorinates (y, z), as in the center. The main ifficulty is that the subspaces may be equal in certain coorinates. In this example, the subspaces are equal on the (x, y) coorinates. So if we use coorinates (x, y), as in the right, then all columns will appear to lie in the same subspace, an clustering woul be impossible. We o not know beforehan the coorinates in which the subspaces are ifferent. Searching for such coorinates coul result in combinatorial complexity, efeating the purpose of subsampling. sparse subspace clustering (SSC) algorithm [23] will fin the correct clustering using certain sketches of the ata (e.g., gaussian projection, row subsampling, an the fast Johnson- Linenstrauss transform) [24]. However, in general, these conitions are unverifiable. In this paper we show that almost every X can be theoretically clustere using as few as r + rows (the minimum require) of a generic rotation of X. The subtlety of this result is that the unerlying subspaces may be equal in certain coorinates. This means that if we sample a column of X in a set of coorinates where the unerlying subspaces are equal, one woul be unable to etermine (base on these observations) to which subspace it truly belongs. See Figure to buil some intuition. To give a concrete example, consier images as in Figure 2. It has been shown that the face images of the same iniviual uner varying illumination lie near a low-imensional subspace [25]. Hence SC can be use to classify faces. However, some coorinates (e.g., the corner pixels) are equal across many iniviuals. If we only sample those coorinates, we woul be unable to cluster. Moreover, those coorinates woul only obstruct clustering while consuming computational resources. To the best of our knowlege, none of the existing istance measures between subspaces captures this notion of partial coorinate similarity. For instance, Example in Section II shows that orthogonal subspaces (maximally apart with

2 Fig. 2: Images from the Extene Yale B ataset [26]. Each row has images of the same iniviual uner varying illumination. The vectorize images of each iniviual lie near a 9-imensional subspace [25], so the whole ataset lies near a union of subspaces. Some coorinates (e.g., the corner pixels) are equal across many iniviuals. If we only sample those coorinates, we woul be unable to subspace cluster. respect to the principal angle istance, the affinity istance, an the subspace incoherence istance [0]) can be ientical in certain coorinates. In this paper we stuy this phenomenon to erive precise information-theoretic necessary an sufficient conitions for sketche subspace clustering. To this en we first introuce a new istance measure between subspaces that captures this relationship between subspaces, which we call partial coorinate iscrepancy. This allows us to show that if we generically rotate X, its columns will lie in subspaces that are ifferent on all subsets of more than r coorinates with probability. In other wors, generic rotations maximize partial coorinate iscrepancy. This will imply that X can be clustere using only a sketch, that is, a few rows of a generic rotation of X. We complement our theoretical analysis with experiments using synthetic an real ata, showing the performance an avantages of sketching. Organization of the paper In Section II we formally state the problem, introuce our new istance measure between subspaces, an give our main results. In Section III we make several remarks about our istance measure. In Section IV we present experiments to support our results. We leave all proofs to Section V. II. MODEL AND MAIN RESULTS Let U = {S k } K k= be a set of r-imensional subspaces of R, an X be a n ata matrix whose columns lie in the union of the subspaces in U. Let X k enote the matrix with all the columns of X corresponing to S k. Assume: A The columns of X k are rawn inepenently accoring to an absolutely continuous istribution with respect to the Lebesgue measure on S k. A2 X k has at least r + columns. Fig. 3: Typical SC assumptions require (i) that the subspaces are sufficiently separate; this woul iscar subspaces that are too close, as in the top-left, (ii) that the subspaces are sufficiently incoherent; this woul iscar subspaces that are too aligne with the canonical axes, as in the top-left, an (iii) that the columns of X k are well-sprea over S k, as in the top-right; this woul iscar cases where the istribution of columns over S k is skewe, as in the bottom (left an right) [0]. In contrast, assumption A allows any collection of subspaces, incluing nearby an coherent subspaces, as in the top-left. A only requires that the columns of X k are rawn generically, as in the top-right an bottom-left. A exclues ill-conitione samples with Lebesgue measure zero, as in the bottom-right, where all columns lie in a line (when S k is a plane). A essentially requires that the columns in X k are rawn generically from S k. This allows nearby an coherent subspaces, an skewe istributions of the columns. In contrast, typical SC assumptions require that the subspaces are sufficiently separate, that S k is incoherent (not too aligne with the canonical axes), an that the columns are wellsprea over S k. See Figure 3 to buil some intuition. A2 is a funamental requirement for subspace clustering, as K sets of r columns can be clustere into K arbitrary r-imensional subspaces. Recall that we want to cluster X using only a few of its rows. The restriction of an r-imensional subspace in general position to l r coorinates is simply R l. So if X is sample on r or fewer rows, any subspace in general position woul agree with all the subsample columns, making clustering impossible. It follows that X must be sample on at least l = r + rows in orer to be clustere. In other wors, l = r+ rows are necessary for sketche subspace clustering. We will now show that X can be clustere using only this bare minimum of rows, i.e., that l = r + is also theoretically sufficient. To this en, we first introuce our new notion of istance between subspaces, which we call partial coorinate iscrepancy. Let [] l enote the collection of all subsets of {,..., } with exactly l istinct elements. Let Gr(r, R ) enote the Grassmann manifol of r-imensional subspaces in R, an let { } enote the inicator function. For any subspace,

3 matrix or vector that is compatible with a set ω [] l, we will use the subscript ω to enote its restriction to the coorinates/rows in ω. For example, X ω R l n enotes the restriction of X to the rows in ω, an S k ω R l enotes the restriction of S k to the coorinates in ω. Definition. Given S, S Gr(r, R ), efine the partial coorinate iscrepancy between S an S as: δ(s, S ) = ( r+ ) ω [] r+ {Sω S ω }. Example. Consier the following -imensional subspaces: S = span S = span. Then δ(s, S ) = 4, because if ω = {, 2} or ω = {3, 4}, 6 then S ω = S ω = span[ ] T, but for any of the other 4 choices of ω, S ω S ω. In other wors, S an S woul appear to be the same if they were only observe on the first two or the last two coorinates/rows. Notice that S an S are orthogonal (maximally apart with respect to the principal angle istance, the affinity istance, an the subspace incoherence istance [0]), yet they are ientical when restricte to certain coorinates. Remark. Notice that δ takes values in [0, ]. One can interpret δ as the probability that two subspaces are ifferent on r + coorinates chosen at ranom. For instance, if two subspaces are rawn inepenently accoring to the uniform measure over Gr(r, R ), then with probability they will have δ =. Example shows that even orthogonal subspaces can appear ientical if they are only sample on a subset of coorinates. Existing measures of istance between subspaces fail to capture this notion of partial coorinate similarity. In contrast, δ is a istance measure (metric) that quantifies the partial coorinate similarity of two subspaces when restricte to subsets of coorinates. We formalize this in the next lemma. The proof is given in Section V. Lemma. Partial coorinate iscrepancy is a metric over Gr(r, R ). Lemma implies that two ifferent subspaces must be ifferent on at least one set ω with r + coorinates. If subspaces S, S U are ifferent on ω, then columns corresponing to S an S can be subspace clustere using only X ω by iteratively trying combinations of r + columns in X ω. This is because uner A, a set of r + columns in X ω will be linearly epenent if an only if they correspon to the same subspace in U. This implies that we can cluster X using only r+ rows. The challenge is to etermine which rows to use. If the subspaces in U have δ = (i.e., they are ifferent on all subsets of r + coorinates), then we can cluster X using any set of r + rows. But if δ is small, we woul nee to use the right rows, which coul be har to fin. This matches the intuition that subspaces that are very similar are harer to cluster. Fortunately, we will show that generic rotations yiel maximal partial coorinate iscrepancy. In other wors, we will see that if we generically rotate the subspaces in U, then the rotate subspaces will be ifferent on all subsets of r + coorinates. This will imply that we can cluster X using any r + rows of a generic rotation of X. To formalize these ieas, let Γ R R enote a rotation operator. Assume A3 The rotation angles of Γ are rawn inepenently accoring to an absolutely continuous istribution with respect to the Lebesgue measure on (0, 2π). Essentially, A3 requires that Γ is a generic rotation. Equivalently, Γ can be consiere as a generic orthonormal matrix. Rotating X equates to left multiplying it by Γ. Similarly, the rotation of a subspace S by Γ (which we will enote by ΓS) is given by span{γu}, where U is a basis of S. The next lemma states that rotating subspaces by a generic rotation yiels subspaces with maximal partial coorinate iscrepancy. The proof is given in Section V. Lemma 2. Let Γ enote a rotation operator rawn accoring to A3. Let S, S be ifferent subspaces in Gr(r, R ). Then δ(γs, ΓS ) = with probability. Lemma 2 states that regarless of δ(s, S ), we can rotate S an S to obtain new subspaces with maximal partial coorinate iscrepancy (i.e., subspaces that are ifferent on all subsets of r + coorinates). See Figure 4 for some insight. Intuitively, a generic rotation istributes the local ifferences of S an S across all coorinates. So as long as S S, then (ΓS) ω will iffer (at least by a little bit) from (ΓS ) ω for every ω [] l, with l > r. This implies that ΓX can be perfectly clustere using any subset of l > r rows of ΓX (an clustering ΓX is as goo as clustering X). This is summarize in our main result, state in the next theorem. The proof is given in Section V. Theorem. Let A-A3 hol, an let ω [] l, with l > r. Let X be a subset of the columns in X. Transform an row-subsample X to obtain (ΓX ) ω. Then with probability, the columns in X lie in an r-imensional subspace of R if an only if the columns in (ΓX ) ω lie in an r-imensional subspace of R l. Theorem states that theoretically, X can be clustere using any r + rows of a generic rotation X = ΓX. Uner A-A3, perfectly clustering X ω is theoretically possible with probability by iteratively trying combinations of r +

4 Fig. 4: Left: Two ifferent subspaces (even orthogonal) can appear ientical if they are only observe on a subset of coorinates. In this figure, S an S are ientical if they are only observe on the (x, y) coorinates (top view). Right: Lemma 2 shows that if we rotate S an S generically, the rotate subspaces ΓS an ΓS will be ifferent on all subsets of more than r coorinates. In this figure, the rotate subspaces ΓS an ΓS are ifferent in all sets of r + = 2 coorinates, incluing the (x, y) plane. columns in X ω an verifying whether they are rank-r. This is because uner A an A3, a set of r + columns in X ω will be linearly epenent if an only if they correspon to the same subspace. Nonetheless, this combinatorial SC algorithm can be computationally prohibitive, especially for large n. In practice, we can use an algorithm such as sparse subspace clustering (SSC) [23]. This algorithm enjoys stateof-the-art performance, works well in practice, an has theoretical guarantees. The main iea behin SSC is that a column x in X lying in subspace S can be written as a linear combination of a few other columns in S (in fact, r or fewer). In contrast, it woul require more columns from other subspaces to express x as their linear combination (as many as ). So SSC aims to fin a sparse vector c R n, such that x = (X/x)c. Here X/x enotes the (n ) matrix forme with all the columns in X except x. The nonzero entries in c inex columns from the same subspace as x. SSC aims to fin such vector c by solving arg min c s.t. x = (X/x)c, () c R n where enotes the -norm, given by the sum of absolute values. SSC then uses spectral clustering on these coefficients to recover the clusters. Unfortunately, the solution to () is not exact. In fact, a typical solution to () will have most entries close to zero, an only a few (yet more than r) relevant entries. If we only use l = r + rows, the location of the relevant entries in c will be somewhat meaningless in the sense that they coul correspon to columns from ifferent subspaces, as it takes at most r + linearly inepenent columns to represent a column in R r+. As the number of rows l grows, the relevant entries in c are more likely to correspon to columns from the same subspace as x. On the other han, as l grows, so oes the computational complexity of (). Without subsampling the rows, the computational complexity of SSC is O(n 3 ). In contrast, using l > r rows, the computational complexity of SSC will only be O(ln 3 ). Depening on, n an r, this can bring substantial computational improvements. We thus want l to be large enough such that the relevant entries in c reveal clusters of X, but not so large that () is too computationally expensive. In fact, we know from Wang et al. [24] that SSC will fin the correct clustering using only l = O(r log(rk 2 ) + log n) rows if the following conitions hol (see Figure 3 to buil some intuition): (i) The angles between subspaces are sufficiently large. (ii) The subspaces are sufficiently incoherent with the canonical basis, or the ata is transforme by a gaussian projection or by the fast Johnson-Linenstrauss transform [27]. (iii) The columns of X k are well-sprea over S k. On the other han, Theorem states that theoretically it is possible to cluster X using only l = r+ rows, in lieu of these conitions. This reveals a gap between theory an practice that we further stuy in our experiments. We have shown that theoretically, conitions (i)-(iii) are sufficient but not necessary. It remains an open question whether there exists a polynomial time algorithm that can provably cluster without these requirements. III. ABOUT δ AND OTHER DISTANCES In this section we make several remarks about partial coorinate iscrepancy an its relation to other istances between subspaces. First recall the efinition of principal angle istance between two subspaces [28]. Definition 2 (Principal angle istance). Let S, S be subspaces in Gr(r, R ) with orthonormal bases U, U. The principal angle istance between S an S is efine as θ(s, S ) = U T U 2, where U is an orthonormal basis of S. It is intuitive that when ata are generate from subspaces that are close to one another, it is ifficult to cluster these ata correctly. Typically, other results use the principal angle istance to measure how close subspaces are. For example, in the previous section we iscusse that if conitions (i)- (iii) hol, then O(r log(rk 2 ) + log n) rows are sufficient for clustering [4]. Conition (i) essentially requires that θ is sufficiently large. The partial coorinate iscrepancy δ is an other useful metric. Here we use it to show that theoretically, r + rows are necessary an sufficient for clustering in lieu of these assumptions. We now wish to compare δ an θ. We will see that subspaces close in one metric can in general be far in the other. We believe this is an important observation for briging the gap between the sufficient oversampling of the rows require when using θ an the necessary an sufficient conition of Theorem. In our stuy, we will analyze δ using bases of subspaces, so let us first show that δ shares the important property of being basis inepenent. To see this, let U, U R r enote bases of S, S. Notice that S ω = S ω if an only if there exists a matrix B R r r such that U ω = U ω B. Now suppose that instea of U, we choose an other basis V of S. Since U

5 an V are both bases of S, there must exists a full-rank matrix Θ R r r such that U = VΘ. As before, S ω = S ω if an only if there exists a matrix B R r r such that U ω = V ω B. Now observe that if B such that U ω = U ω B, then B (namely B = ΘB) such that U ω = VB. Similarly, if B such that U ω = V ω B, then B (namely B = Θ B ) such that U ω = U ω B. With this, we can now stuy the relationship between partial coorinate iscrepancy an principal angle istance. The next example shows that two subspaces may be close with respect to θ, but far with respect to δ. Example 2 (Small θ may coincie with large δ). Consier a subspace S spanne by U R r. Let ɛ > 0 be given, an let U = U + ɛ. It is easy to see that θ(s, S ) 0 as ɛ 0. In contrast, δ(u, U ) = for every ɛ. Conversely, the next example shows that two subspaces may be close with respect to δ, but far with respect to θ. Example 3 (Small δ may coincie with large θ). Consier two subspaces S, S Gr(r, R ) spanne by I I U = I an U = I. 0 0 where I enotes the ientity matrix. For r, δ(s, S ) will be close to zero, because the two subspaces iffer on only r + subsets of the first 2r coorinates. However, the subspaces are orthogonal an so the principal angle istance is maximal; θ(s, S ) =. Examples 2 an 3 show that in general, subspaces close in one metric can be far in the other. However, for subspaces that are incoherent with the canonical axes, there is an interesting relation between δ an θ. Recall that coherence is a parameter inicating how aligne a subspace is with the canonical axes [29]. More precisely, Definition 3 (Coherence). Let S Gr(r, R ). Let P S enote the projection operator onto S, an e i the i th canonical vector in R. The stanar coherence parameter µ [, r ] of S is efine as µ = r max i P Se i 2 2. Intuitively, an incoherent subspace (small µ) will be wellsprea over all the canonical irections. Equivalently, the magnitue of the rows of its bases will not vary too much. In this case, if δ is small, we can also expect θ to be small. The following example emonstrates one such scenario. Example 4 (An example where small δ, µ imply small θ). Suppose that S an S are spanne by orthogonal bases U, U respectively. Suppose they have coorinates on which they span the same subspace; for close to, this will result in a small δ. Suppose the coherence for each subspace is boune by µ 0, i.e., r max i P Se i 2 2 = r max i U i 2 2 µ 0 where U i is the i th row of U. Further suppose that if we subsample the basis only on the coorinates the two subspaces have in common, we can lower boun their inner prouct: U T i U i c 0. i= This is essentially another incoherence conition that will hol with c 0 when the subspaces are highly incoherent with the canonical basis. Then θ(s, S 2 ) (c 0 ( )µ 0r ) µ0r when c 0 ( ) > 0. From this example our intuition is confirme: if is very close to, c 0, an µ 0 is constant, the term in the parentheses is near an the angle is small. To see how we get the boun on θ(s, S ), first note that θ(s, S ) = U T U 2 2, an we can boun the secon term from below. U T U 2 = U T i U i i= c 0 c 0 2 i=+ i=+ 2 = U T i U i + i= U T i U i 2 i=+ U T i U i 2 (2) U i 2 U i c 0 2 ( )µ 0r where we use the triangle inequality, matrix norm inequality, an step (2) follows by assumption. This illustrates a case where, if the subspaces in U have low coherence an their partial coorinate iscrepancy is small, the angle between them will also be small. Existing analyses show that practical SC algorithms ten to fail if θ is small [23]. It follows that for incoherent subspaces, if δ is small, SC can be very har in practice. This is illustrate in Figure 5, which shows that the clustering performance of practical algorithms eclines as δ ecreases. IV. EXPERIMENTS Theorem shows that one can cluster X using only r + rows of ΓX. As iscusse in Section II, practical algorithms like SSC may require more than these bare minimum number of rows. In this section we present experiments to stuy the gap between what is theoretically possible an what is practically possible with state-of-the-art algorithms. In Section III we also explaine that for incoherent subspaces, the partial coorinate iscrepancy δ an the principal angle istance θ have a tight relation: if δ is small, then θ is small too. Existing analyses show that practical SC algorithms ten to fail if θ is small [23]. It follows that for incoherent subspaces, if δ is small, SC can be very har in practice. The experiments of this section support these results. In our experiments, we will compare the following approaches to subspace clustering: (a) Cluster X irectly (full-ata). (b) Cluster l > r rows of ΓX.

To compare things vis-à-vis, we will stuy the cases above using the sparse subspace clustering (SSC) algorithm [23].

Simulations We will first use simulations to stuy the cases above as a function of the ambient imension, the partial coorinate iscrepancy δ of the subspaces in U, an the number of rows use l.

.., K, we selecte the k th set of δ rows in V (i.e., rows (k )δ +,..., kδ ) an replace them with other entries, also rawn i.i.. from the stanar Gaussian istribution.

It follows that δ(s, S ) is equal to the probability of selecting any of these 2δ rows in r + raws (without replacement). That is, δ(s, S ) = ( 2δ r+ ) ( r+ ) for every S, S U.

To o this, we will use the next simple boun, which gives a clear iea of how small δ is in our experiments. A erivation is given in Section V. δ(s, S ) (r + )(2δ r) r = O ( rδ ).

6 To compare things vis-à-vis, we will stuy the cases above using the sparse subspace clustering (SSC) algorithm [23]. We chose SSC because it enjoys state-of-the-art performance, works well in practice, an has theoretical guarantees. In all our experiments we use the SSC coe provie by their authors [23]. A. Simulations We will first use simulations to stuy the cases above as a function of the ambient imension, the partial coorinate iscrepancy δ of the subspaces in U, an the number of rows use l. To obtain subspaces with a specific δ, we first generate a r matrix V with entries rawn i.i.. from the stanar Gaussian istribution. Subspaces generate this way have low coherence. Then, for k =,..., K, we selecte the k th set of δ rows in V (i.e., rows (k )δ +,..., kδ ) an replace them with other entries, also rawn i.i.. from the stanar Gaussian istribution. This yiels K bases, which will span the subspaces in U. This way, the bases of any S an S in U will iffer on exactly 2δ rows. It follows that δ(s, S ) is equal to the probability of selecting any of these 2δ rows in r + raws (without replacement). That is, δ(s, S ) = ( 2δ r+ ) ( r+ ) for every S, S U. (3) Unfortunately, (3) gives little intuition of how small or large δ is. We will thus upper boun δ by a small number that is easily interpretable. To o this, we will use the next simple boun, which gives a clear iea of how small δ is in our experiments. A erivation is given in Section V. δ(s, S ) (r + )(2δ r) r = O ( rδ ). (4) In each trial of our experiments, we generate a set U of K = 5 subspaces, each of imension r = 5, using the proceure escribe above. Next we generate a matrix X with n k = 00 columns from each subspace. The coefficients of each column in X are rawn i.i.. from the stanar Gaussian istribution. Matrices generate this way satisfy A an A2. To measure accuracy, we fin the best matching between the ientifie clusters an the original sets. In our first simulation we stuy the epenency on δ (which gives a proxy of δ through (4)) an l, with = 0 5 fixe. The results are summarize in Figure 5 (top-left). This figure shows the gap between theory an practice. Theorem shows that theoretically, all these trials can be perfectly clustere. This figure shows, as preicte in Section III, that for incoherent subspaces, clustering becomes harer in practice as δ (an hence δ) shrinks. Observe that as δ grows, fewer rows suffice for accurate clustering. For example, in this experiment, SSC consistently succees with l = δ. Next we stuy the cases above as a function of an δ, with l = δ. The results are summarize in Figure 5 (top-right). This also shows a gap between theory an practice. Figure 5 shows, as preicte in Section III, that for incoherent subspaces, if δ (an hence δ) is too small, the angle between the subspaces in U will be small, whence Fig. 5: Proportion of correctly classifie points by SSC, using only l > r rows of ΓX, with K = 5 subspaces, each of imension r = 5, an n k = 00 columns per subspace. The color of each pixel inicates the average over 00 trials (the lighter the better). White represents 00% accuracy, an black represents 20%, which amounts to ranom guessing. Theorem states that theoretically, all these trials can be perfectly clustere. This shows a gap between theory an practice. Top-Left: Transition iagram as a function of δ (which gives a proxy of the partial coorinate iscrepancy δ through (4)), an the number of use rows l, with fixe ambient imension = 0 5. As iscusse in Section III, for incoherent subspaces, clustering becomes harer in practice as δ shrinks. Observe that as δ grows, fewer rows suffice for accurate clustering. Top-Right: Transition iagram as a function of an δ, using only l = δ rows. All pixels above the black point in each column have at least 95% accuracy. These points represent the minimum δ an l require for a clustering accuracy of at least 95%. As iscusse in Section III, for incoherent subspaces, if δ (an hence δ) is too small, the angle between the subspaces in U will be too small, whence clustering can be har in practice. Bottom-Left: Partial coorinate iscrepancy δ (upper boune by O(rδ /)) an fraction of rows l/ require by 3SC for a clustering accuracy of at least 95%. The curve is the best exponential fit to these points. This curve represents the iscriminant between 95% accuracy (above curve) an less than 95% accuracy (below curve). This shows that for incoherent subspaces, as grows, one only requires a vanishing partial coorinate iscrepancy δ an a vanishing fraction of rows l/ to succee. Bottom-Right: Time require to cluster X irectly (full-ata), an to cluster l = 20 rows of ΓX as a function of the ambient imension (average over 00 trials). In all of these trials, both options achieve 00% accuracy. clustering can be har in practice. In this experiment, we also recor the minimum δ an l require for a clustering accuracy of at least 95%. Figure 5 (bottom-left) shows that for incoherent subspaces, as grows, one only requires a vanishing partial coorinate iscrepancy δ an a vanishing fraction of rows l/ to succee. In our last simulation we stuy the computation time require require to cluster X irectly (full-ata), an to cluster l = 20 rows of ΓX as a function of. In this experiment, we fix l = δ = 20, known from our previous experiment to prouce 00% accuracy for a wie range of. Unsurprisingly, Figure 5 (bottom-right) shows that if

7 we only use a constant number of rows, the computation time is virtually unaffecte by the ambient imension, unlike stanar (full-ata) algorithms. This can thus bring computational complexity orers of magnitue lower (epening on an n) than stanar (full-ata) techniques. B. Real Data We now evaluate the performance of sketching on a real life problem where the phenomenon of partial coorinate similarity arises naturally: classifying faces. To this en we use the Extene Yale B ataset [26], which consists of face images of 38 iniviuals with a fixe pose uner varying illumination (see Figure 6). As iscusse in [23], shaows an specularities in these images can be moele as sparse errors. So as a preprocessing step, we first apply the augmente Lagrange multiplier metho [30] for robust principal component analysis on the images of each iniviual (using coe provie by the authors). This will remove the sparse errors, so that the vectorize images of each iniviual lie near a 9-imensional subspace [25]. Hence, the matrix X containing all the vectorize images images lies near a union of 38, 9-imensional subspaces. Observe that these images are very similar on several regions. For example, the lower corners are mostly ark. Distinct subspaces can thus appear to be the same if they are only observe on the coorinates corresponing to these pixels. If we only use a few rows of X (without rotating), there is a positive probability of selecting these coorinates. In this case, we woul be unable to etermine the right clustering. Fortunately, Lemma 2 shows that the columns of a generic rotation of X will lie near a union of subspaces that will be ifferent on all subsets of l > r coorinates (maximal partial coorinate iscrepancy). This implies, as shown in Theorem, that the clusters of the original X will be the same as the clusters of any l > r rows of the rotate X. This means that we can cluster X using any l > r coorinates of a rotation of X. This is verifie by the following experiment. In this experiment we stuy classification accuracy as a function of the number of iniviuals, or equivalently the number of subspaces K, an as a function of the number of rows l use for clustering. We o this replicating the experiment in [23]: we first ivie all iniviuals into four groups, corresponing to iniviuals {,..., 0}, {,..., 20}, {2,..., 30} an {3,..., 38}. Next we cluster all possible choices of K {2, 3, 5, 8, 0} iniviuals for the first three groups, an K {2, 3, 5, 8} iniviuals for the last group. We repeat this experiment for ifferent choices of l, an recor the classification accuracy. The results are summarize in Figure 6. They show that one can achieves the same performance as stanar (full-ata) methos, using only a small fraction of the ata. This results in computational avantages (time an memory). V. PROOFS In this section we give the proofs of all our statements. Fig. 6: Left: Proportion of correctly classifie images from the Extene Yale B ataset [26] (see Figure 2), as a function of the number of iniviuals, or equivalently the number of subspaces K, an as a function of the number of rows l use for clustering. In particular, l = = 206 correspons to stanar (full-ata) SSC. Right: Computation time as a function of the number of iniviuals K, with l = 65 fixe (known from the center figure to achieve the same accuracy as stanar SSC). Recall that the computational complexities of SSC an sketching are O(n 3 ) an O(ln 3 ), respectively. Here = 206 an n = 38K. This shows that sketching achieves the same accuracy as stanar SSC in only a fraction of the time. This gap becomes more evient as an n grow, as shown in Figure 5. Proof of Lemma We nee to show that δ satisfies the three properties of a metric. Let S, S, S Gr(r, R ). (i) It is easy to see that if S = S, then δ(s, S ) = 0. To obtain the converse, suppose δ(s, S ) = 0. Let υ = {,..., r}, an let ω i = υ i, with i = r +,...,. Take bases U, U of S, S, such that U ω = U ω. We can o this because δ(s, S ) = 0, which implies S ω = S ω for every ω [] r+, incluing ω. Next observe that for i = r + 2,...,, since S ωi = S ω i an U υ = U υ, it must be that U = U on the i th row (otherwise S ωi S ω i ). We thus conclue that U = U, which implies S = S. (ii) That δ(s, S ) = δ(s, S ) follows immeiately from the efinition. (iii) To see that δ satisfies the triangle inequality, write: δ(s, S ) + δ(s, S ) = ( ) ( {Sω S r+ ω [] r+ ω } + {S ω S ω } ) ( ) {Sω S r+ ω [] r+ ω S ω S ω } ( ) {Sω S r+ ω [] r+ ω } = δ(s, S ), where the last inequality follows because {S = S S = S } implies {S = S }, whence {Sω S ω S ω S ω } = {Sω S ω } = 0, an in any other case, {Sω S ω S ω S ω } = {Sω S ω }. Proof of Lemma 2 We nee to show that if S S, then (ΓS) ω (ΓS ) ω for every ω [] r+. Let U an U enote bases of S an S. Observe that (ΓS) ω = (ΓS ) ω if an only if there exists a matrix B R r r such that (ΓU ) ω = (ΓU) ω B, or equivalently, if an only if Γ ω U = Γ ω UB, which we can rewrite as Γ ω (U UB) = 0. (5)

8 Let υ enote the subset with the first r elements in ω, an i enote the last element in ω. Then we can rewrite (5) as [ Γ υ Γ i ] (U UB) = 0. (6) Since Γ is rawn accoring to A3, the rows in Γ υ are linearly inepenent with probability. Since U is a basis of an r- imensional subspace, its r columns are also linearly inepenent. It follows that Γ υ U is a full-rank r r matrix. So we can use the top block in (6) to obtain B = (Γ υ U) Γ υ U. We can plug this in the bottom part of (6) to obtain Γ i (U U(Γ υ U) Γ υ U ) = 0. (7) Recall that (Γ υ U) = (Γ υ U) / Γ υ U, where (Γ υ U) an Γ υ U enote the ajugate an the eterminant of Γ υ U. Therefore, we may rewrite (7) as the following system of r polynomial equations: Γ i ( Γ υ U U U(Γ υ U) Γ υ U ) = 0. (8) Observe that the left-han sie of (8) is just an other way to write Γ i (U UB), where B is in terms of U, U an Γ υ. Since S S, there exists no B R r r such that U = UB. Equivalently, (U UB) 0. Since Γ is rawn accoring to A3, we conclue that the left han sie of (8) is a nonzero set of polynomials, an so (8) hols with probability zero. Since (ΓS) ω = (ΓS ) ω if an only if (8) hols, we conclue that with probability, (ΓS) ω (ΓS ) ω. Since ω was arbitrary, we conclue that this is true for every ω [] r+, as esire. Proof of Theorem Recall that X k enotes the matrix forme with all the columns in X corresponing to the k th subspace in U. Uner A-A2, with probability the partition {X k } K k= is the only way to cluster the columns in X into K r-imensional subspaces. This is because uner A, the columns in X will lie on intersections of the subspaces in U with probability zero. So any combination of more than r columns from ifferent subspaces in U will lie in a subspace of imension greater than r with probability. Recall that [] l enotes the set of all subsets of {,..., } with exactly l istinct elements, an that Γ enotes a generic rotation rawn accoring to A3. Let ω [] l, an efine (ΓU) ω as the set of rotate subspaces in U, restricte to the coorinates in ω, i.e., (ΓU) ω = {(ΓS k ) ω } K k=. Lemma 2 implies that all the subspaces in (ΓU) ω are ifferent. It is easy to see that the columns in (ΓX k ) ω lie in (ΓS k ) ω. By A an A3, the columns in (ΓX) ω will lie on intersections of the subspaces in (ΓU) ω with probability zero. So any combination of more than r columns from ifferent subspaces in (ΓU) ω will lie in a subspace of imension greater than r with probability. Derivation of (4) We want to show that δ(s, S ) (r + )(2δ r). r Recall that δ(s, S ) is the probability that S an S are ifferent on a set of r + coorinates selecte uniformly at ranom (without replacement). In the setup of Section IV, the bases U, U of S, S are ifferent on exactly 2δ rows. Then δ(s, S ) = P( out of 2δ rows in r + raws) = P ( r+ { out of 2δ rows in τ th raw}) (a) (b) = (c) = τ= r+ τ= r+ τ= r+ τ= P( out of 2δ rows in τ th raw) τ ρ=0 P( out of 2δ rows in τ th raw ρ out of 2δ rows in first τ raws) P(ρ out of 2δ rows in first τ raws) τ ρ=0 2δ r r P(ρ out of 2δ rows in first τ raws), where (a) follows by the union boun, (b) follows by the law of total probability, an (c) follows because the probability of selecting one of the 2δ istinct rows in the τ th raw (without replacement) is smallest if τ = r + an ρ = r, which correspons to the case where the ratio (after r raws) of istinct rows (2δ r) versus equal rows ( r) is smallest. Continuing with the last equation, we have: δ(s, S ) 2δ r r τ ρ=0 r+ τ= P(ρ out of 2δ rows in first τ raws) = (r + )(2δ r), r as esire. VI. ACKNOWLEDGEMENTS Work by L. Balzano was supporte by ARO Grant W9NF REFERENCES [] R. Vial, Subspace clustering, IEEE Signal Processing Magazine, 20. [2] K. Kanatani, Motion segmentation by subspace separation an moel selection, IEEE International Conference in Computer Vision, 200. [3] B. Eriksson, P. Barfor, J. Sommers an R. Nowak, DomainImpute: Inferring unseen components in the Internet, IEEE INFOCOM Mini- Conference, 20. [4] G. Mateos an K. Rajawat, Dynamic network cartography: Avances in network health monitoring, IEEE Signal Processing Magazine, 203. [5] W. Hong an J. Wright an K. Huang an Y. Ma, Multi-scale hybri linear moels for lossy image representation, IEEE Transactions on Image Processing, [6] J. Rennie an N. Srebro, Fast maximum margin matrix factorization for collaborative preiction, International Conference on Machine Learning, [7] A. Zhang, N. Fawaz, S. Ioanniis an A. Montanari, Guess who rate this movie: Ientifying users through subspace clustering, Conference on Uncertainty in Artificial Intelligence, 202.

9 [8] G. Liu, Z. Lin an Y. Yu, Robust subspace segmentation by low-rank representation, International Conference on Machine Learning, 200. [9] G. Liu, Z. Lin, S. Yan, J. Sun, Y. Yu an Y. Ma, Robust recovery of subspace structures by low-rank representation, IEEE Transactions on Pattern Analysis an Machine Intelligence, 203. [0] M. Soltanolkotabi an E. Canès, A geometric analysis of subspace clustering with outliers, Annals of Statistics, 202. [] M. Soltanolkotabi, E. Elhamifar an E. Canès, Robust subspace clustering, Annals of Statistics, 204. [2] C. Qu an H. Xu, Subspace clustering with irrelevant features via robust Dantzig selector, Avances in Neural Information Processing Systems, 205. [3] X. Peng, Z. Yi an H. Tang, Robust subspace clustering via thresholing rige regression, AAAI Conference on Artificial Intelligence, 205. [4] Y. Wang an H. Xu, Noisy sparse subspace clustering, International Conference on Machine Learning, 203. [5] Y. Wang, Y.-X. Wang an A. Singh, Differentially private subspace clustering, Avances in Neural Information Processing Systems, 205. [6] H. Hu, J. Feng an J. Zhou, Exploiting unsupervise an supervise constraints for subspace clustering, IEEE Pattern Analysis an Machine Intelligence, 205. [7] L. Balzano, B. Recht an R. Nowak, High-imensional matche subspace etection when ata are missing, IEEE International Symposium on Information Theory, 200. [8] B. Eriksson, L. Balzano an R. Nowak, High-rank matrix completion an subspace clustering with missing ata, Artificial Intelligence an Statistics, 202. [9] D. Pimentel-Alarcón, L. Balzano an R. Nowak, On the sample complexity of subspace clustering with missing ata, IEEE Statistical Signal Processing, 204. [20] D. Pimentel-Alarcón an R. Nowak, The information-theoretic requirements of subspace clustering with missing ata, International Conference on Machine Learning, 206. [2] C. Yang, D. Robinson an R. Vial, Sparse subspace clustering with missing entries, International Conference on Machine Learning, 205. [22] J. He, L. Balzano an A. Szlam, Incremental graient on the Grassmannian for online foregroun an backgroun separation in subsample vieo, Conference on Computer Vision an Pattern Recognition, 202. [23] E. Elhamifar an R. Vial, Sparse subspace clustering: algorithm, theory, an applications, IEEE Transactions on Pattern Analysis an Machine Intelligence, 203. [24] Y. Wang, Y.-X. Wang an A. Singh, A eterministic analysis of noisy sparse subspace clustering for imensionality-reuce ata, International Conference on Machine Learning, 205. [25] R. Basri an D. Jacobs, Lambertian reflection an linear subspaces, IEEE Transactions on Pattern Analysis an Machine Intelligence, [26] K. Lee, J. Ho an D. Kriegman, Acquiring linear subspaces for face recognition uner variable lighting, IEEE Transactions on Pattern Analysis an Machine Intelligence, [27] N. Ailon an B. Chazelle, The fast Johnson-Linenstrauss transform an approximate nearest neighbors, SIAM Journal on Computing, [28] G. Golub an C. Loan, Matrix Computations, The Johns Hopkins University Press, 3r eition, 996. [29] B. Recht, A simpler approach to matrix completion, Journal of Machine Learning Research, 20. [30] Z. Lin, R. Liu an Z. Su, Linearize alternating irection metho with aaptive penalty for low rank representation, Avances in Neural Information Processing Systems, 20.

u!i = a T u = 0. Then S satisfies

u!i = a T u = 0. Then S satisfies Deterministic Conitions for Subspace Ientifiability from Incomplete Sampling Daniel L Pimentel-Alarcón, Nigel Boston, Robert D Nowak University of Wisconsin-Maison Abstract Consier an r-imensional subspace