Non-negative Laplacian Embedding

Size: px

Start display at page:

Download "Non-negative Laplacian Embedding"

Rodger French
5 years ago
Views:

9 Ninth IEEE International Conference on Data Mining Non-negative Laplacian Embedding Dijun Luo, Chris Ding, Heng Huang Computer Science and Engineering Department University of Texas at Arlington

1 9 Ninth IEEE International Conference on Data Mining Non-negative Laplacian Embedding Dijun Luo, Chris Ding, Heng Huang Computer Science and Engineering Department University of Texas at Arlington Arlington, Texas, 769 Tao Li School of Computer Science Florida International University Miami,3399 Abstract Laplacian embedding provides a low dimensional representation for a matrix of pairwise similarity data using the eigenvectors of the Laplacian matrix. The true power of Laplacian embedding is that it provides an approximation of the clustering. However, clustering requires the solution to be nonnegative. In this paper, we propose a new approach, nonnegative Laplacian embedding, which approximates clustering in a more direct way than traditional approaches. From the solution of our approach, clustering structures can be read off directly. We also propose an efficient algorithm to optimize the objective function utilized in our approach. Empirical studies on many real world datasets show that our approach leads to more accurate solution and improves clustering accuracy at the same time. Keywords-Laplacian Embedding; Non-negative Matrix Factorization; I. INTRODUCTION In many real world tasks in data mining, information retrieval, and machine learning areas, data are represented in high dimensional space, which might intrinsically lie in a very low dimensional one. In addition, many data come in as a matrix of pairwise similarities, such as network data, protein interaction data. Meanwhile, unlabeled data are much easier to be obtained than labeled data. Thus, it is challenging and useful to develop unsupervised approaches to embed high dimensional data into low dimensional one. From the data embedding point of view, there are two categories of embedding approaches. Approaches in the first category are to embed data into a linear space with linear transformation, such as principle component analysis (PCA). These approaches give out robust representations of data in a low dimension; However, they do not properly embed data which lie on non-linear manifold. The second category approaches embed data in a nonlinear manner. They include IsoMAP [5], Local Linear Embedding (LLE) [], Local Tangent Space Alignment [7], etc. These embeddings have different purposes and objectives. But they can detect the nonlinear manifold where data lie The above approaches all assume data points are represented by feature vectors (attributes). In this paper, our emphasis is on graph embedding, i.e., the relationship among data points are represented by a matrix of pairwise similarities (which are viewed as edge weights of the graph). Laplacian Embedding is one of the most popular graph embedding method. Laplacian Embedding and the related usage of eigenvectors of graph Laplace matrix is first developed in 97s. It was called quadratic placement[6] of graph nodes in a space. The eigenvectors of graph Laplace matrix are used for graph partitioning and connectivity analysis [4]. This approach becomes popular in 99s for circuit layout in VLSI community (see a review []), and graph partitioning [] for domain decomposition, a key problem in distributedmemory computing. Laplacian Embedding is now very popularly used [3] mainly due to its relation to graph clustering [5], [5], [4], [9]. In fact, the eigenvectors of Laplace matrix provides an approximation solution of the clustering [5], and the generalized eigenvectors of the Laplace matrix provides an approximation solution of the Normalized Cut clustering [4] and min-max clustering [9]. A Difficulty with Eigenvector Embedding A main difficulty of using eigenvectors of the Laplace matrix to solve multi-way clustering problem is that the eigenvectors have mixed-sign entries while the clustering indicator vectors (that these eigenvectors approximate) are nonnegative. For two-way clustering, this is not a problem because a linear Ψ-transformation[7] of the second eigenvector (the Fiedler vector) and the first eigenvector leads to two genuine indicator vectors (vectors with positive and/or zero entries and each row has only one nonzero entry). Because of this main difficulty, most applications resort to a two-step procedure[] : () embedding the graph into the eigenvector space (Laplace embedding) and () clustering these embedded points using K-means clustering. This procedure provides an approximate solution to the clustering problem. Nonnegative Embedding Provides a Solution In this paper, we propose a new approach. We propose to perform the Laplace embedding with nonnegative vectors, which can be directly interpreted as cluster membership indicator vectors. As a consequence, the nonnegative embedding also provides a more accurate solution to /9 $6. 9 IEEE DOI.9/ICDM

2 clustering problem because the solution indicators now more resembles the desired cluster indicators. We call this new approach, the nonnegative Laplacian Embedding (NLE). NLE has the following property. It optimizes the Ratio Cut function with enforcing the nonnegative requirements rigorously. With the nonnegative representation of the cluster indicator, the embedding results can be interpreted as the posterior clustering probability. As a result, the cluster membership can be read off from the embedding coordinates immediately (see 4). Second, our NLE method has soft clustering capability (see IX-A) where a data point could be fractionally assigned to several clusters. This capability is especially important for many real-life data which come with much noise. For these data, not every data point clearly and uniquely belongs to one cluster (pattern). The soft clustering capability is lacking in standard spectral clustering and K- means clustering. Our approach requires a solver which optimizes a quadratic function with both orthonormal and nonnegative constraints. The feasible domain of such optimization problem is highly non-linear and non-convex. In this paper, we also propose an efficient algorithm to address this problem. In the remainder of the paper, we first transform the minimization problems of the embedding ( ) and ( 3) into a maximization problem of a well-behaved positive definite function ( 4). In order to generalize the problem definition, in 5, we prove the similarity matrix (graph matrix) with mixed signs can also be applied for Laplacian embedding. After that, we present the NLE algorithm ( 6) and prove rigorously the correctness and convergence of our algorithm ( 7) using the theory of constrained optimization. We illustrate the NLE algorithm and capability using an example of faces images in 9. In, we perform extensive experiments on five UCI datasets [] and AT&T face image dataset [3] to compare our NLE algorithm to standard spectral approach. We show that our NLE algorithm consistently gives out better objective function value of Laplacian embedding and the clustering objective. Meanwhile, our NLE method also improves clustering accuracy over the standard spectral approach. Brief Summary of Major Clustering Frameworks In essence, our line of clustering framework is to show that the clustering objective can be written as an optimization of quadratic function with nonnegative constraints and orthogonality constraints; If we retain orthogonality while ignore the nonnegativity, the solution is the standard Laplacian embedding using eigenvectors. This has been the way spectral clustering is developed so far. However, if we retain nonnegativity rigorously and enforce the orthogonality approximately, the solution is the NLE proposed in this paper. We note that this clustering framework is similar to the K-means clustering PCA Nonnegative Matrix Factorization(NMF) [7], [8] framework [8]. It has been shown [6], [7], [8] that K-means clustering objective can be written as the maximization of a quadratic function with nonnegativity and orthogonality constraints; If we retain orthogonality while ignore the nonnegativity, the solution is PCA [6], [7]. However, if we retain nonnegativity rigorously and enforce orthogonality approximately, the solution is NMF[7]. Several further developments using NMF for clustering are convex NMF [], orthogonal NMF [3], and equivalence between NMF and probabilistic latent semantic indexing []. For recent surveys of NMF see [4], [9]. II. LAPLACIAN EMBEDDING We start with a brief introduction to Laplacian embedding. The input data is a matrix W of pairwise similarities among n objects. We view W as the edge weights on a graph with n nodes. The task is to embed the nodes of the graph into - D space with coordinates (x,,x n ). The objective is that if i, j are similar (i.e., w ij is large), they should be adjacent in embedded space, i.e., (x i x j ) should be small. This can be achieved by minimizing [6] min J(x) = (x i x j ) w ij x ij = ij x i (D W ) ij x j = x T (D W )x, () where D = diag(d,,d n ) and d i = j W ij. The minimization of ij (x i x j ) w ij would get x i = if there is no constraint on the magnitude of the vector x. Therefore, we impose the normalization i x i =.The original objective function invariant if we replace x i by x i + constant. Thus the solution is not unique. To fix this uncertainty, we can adjust the constant such that x i = (x i is centered around ). Thus x i have mixed signs. With these two constraints: x i =, x i =, i the solution of minimizing the embedding objective is given by the eigenvectors of i (D W )f = λf, () The matrix L = D W is called graph Laplacian. This is because L is a discrete form of the Laplace operator ( ) f(x, y, z) = x + y + z f(x, y, z) In mathematical physics, a partial differential operator is not defined unless the boundary condition are specified different boundary conditions leads to different solutions. The graph Laplacian here is the discretized form of Laplacian operator with the Von Neumann boundary condition, i.e., the derivatives along the boundary are zero. (The discretized 338

3 form of Laplacian operator with the Dirichlet boundary condition has slightly different form.) Because of the Von Neumann boundary condition, the solution is invariant w.r.t. an additive constant. As a consequence, the solution contains the constant eigenvector, the first eigenvector with eigenvalue zero. (see [] for details). Multi-dimensional Embedding This embedding can be generalized to embedding in k- dimensional space, with coordinates r i R k.let r i r j be the Euclidean distance between nodes i, j. The embedding is obtained by optimizing min J(R) = n r i r j w ij R i,j= n = r T i (D W ) ij r j i,j= = Tr R(D W )R T, (3) R (r,, r n ). In order to prevent R, we impose the normalization constraints RR T = I. To fix the uncertainty due to the shift invariance, we further impose the constraint ri =(r i is centered around ). The solution is given by eigenvectors: R =(f,, f k ) T. This is called spectral Laplacian embedding (spectral means using eigenvectors). Let Q = R T R n p, the spectral Laplacian embedding can be formally cast as an optimization problem: min Tr Q QT (D W )Q, s.t. Q T Q = I. (4) III. RATIO CUT SPECTRAL CLUSTERING The true power of Laplacian embedding is the clustering capability. Here we briefly outline the somehow often neglected, but fundamentally important relationship between the spectral clustering [5] and Laplacian embedding. In fact, these two things are identical! In clustering/partitioning a graph, the most popular objective is min-cut, which cuts the graph G into A, B such that the cross-cut similarity (weight) s(a, B) = i A j B w ij is minimized. Without size balancing, the mincut will often cut a very small subgraph out, leading to two highly unbalanced subgraphs. The first solution to this problem is developed in curcuit placement field by Cheng and Wei [6] who proposed to minimize the following ratio cut objective function min A,B s(a, B) A B = G [ s(a, B) + A ] s(a, B) B Note G is a constant and drops out. Hagen and Kahng[5] later show that Fiedler vector (nd eigenvector of the graph Laplacian) provides an effective solution. Chan et al. [5] generalized this two-way clustering to multi-way Ration Cut clustering: divide the nodes of G into K disjoint clusters {C p } by minimizing the objective function: J rc = p<q K s(c p,c q ) C p + s(c p,c q ), (5) C q where s(c k,c l )= i C k j C l w ij. d i = j w ij. Let h k = {, } n be an indicator vector for cluster C k, i.e., h k (i) =,ifx i belongs to the cluster C k ; h k (i) =, otherwise. They show that Theorem : The objective can be written as, J rc = = p<q K K l= s(c p,c q) C p + s(cp,cq) C q h T l (D W )h l h T l h l = Tr(H T (D W )H), (6) where H =(h / h,, h K / h K ). ratio cut problem becomes min Tr Q HT (D W )H, s.t. H T H = I, (7) Chan et al. also discussed the embedding of this function, which is identical to the Laplacian embedding of Eq.(4) with the same orthogonality constraints. Shi and Malik [4] further developed this into normalized cut clustering. Ding et al, [9] further developed this into the min-max cut clustering. A simple and widely adopted algorithm for solving spectral clustering has two steps: () compute the eigenvectors of L = D W for Laplacian embedding; () do K-means clustering in the eigenspace to obtain clusters. The second step is necessary because the eigenvector solution Q has mixed signs and the clusters cannot be identified directly. This is a generic difficulty of multi-way spectral clustering. IV. NONNEGATIVE LAPLACIAN EMBEDDING In all previous working on spectral clustering, the nonnegativity of the cluster indicator H are ignored. On the other hand, a nonnegative solution by enforcing the constraint H has two direct benefits: () we can obtain cluster assignments directly. () we obtain more accurate solution because the nonnegative solution resembles the desired cluster indicators. In this paper, we propose the Nonnegative Laplacian Embedding (NLE) approach. In NLE, we rigorously enforce the nonnegativity constraint. The most important benefit of nonnegative embedding is that the cluster membership can be read off from solution Q immediately: x i belongs to the cluster C k, where k corresponds to the largest component in the i-th row of Q, k = arg max Q ij. (8) j K In fact, we may view the i-th row of Q as the posterior probability that object i belongs to different clusters. 339

4 Formally, the optimization of Eq.(4) is identical to max Q Tr[QT (W D + σi)q], s.t. Q T Q = I,Q, (9) because the σ term TrQ T σiq = σtri = σn is a constant. We set σ = λ m to be the largest eigenvalue of L = D W. W D + σi is positive definite, because W D + σi = n k= (σ λ k)v k vk T. This -step transformation (change min to max and makes the objective positive definite) makes the optimization as a well-behaved problem. The algorithm to solve Eq.( 9) will be provided in 6. V. LAPLACIAN EMBEDDING WITH MIX SIGNED SIMILARITY MATRIX In traditional Laplacian embedding, graph matrix (i.e. similarity matrix) is required to be non-negative. Here we show that similarity matrix with mixed sign can also be applied for Laplacian Embedding, as well as NLE. Let W + and W be the positive and negative part of W, respectively: W = W + W. For the positive part, we want to minimize the embedding distance so that the instances are similar with each other, min w + x ij (x i x j ). i,j But for the negative part, we maximize the embedding distance so that the instances are dissimilar, max w x ij (x i x j ). i,j We can combine them together by minimizing the difference, min (w + x ij w ij )(x i x j ) = w ij (x i x j ). i,j i,j Here we show that the similarity matrix can be shifted by any constant. Theorem : If q is a non-trivial eigenvector of graph Laplacian on similarity W, then q is also an eigenvector of graph Laplacian on similarity W + σe, where σ is any constant and E is a matrix with all entries with proper size. Obviously the e (a single column with all ones) is an eigenvector of any graph Laplacian. The corresponding eigenvalue is. Here, by non-trivial eigenvector, we mean those eigenvectors which are not e. Proof. Since q is a non-trivial eigenvector of graph Laplacian on similarity W, (D W )q = λq. If the similarity matrix shifts by a constant, W = W + σe, then the corresponding graph Laplacian becomes: L = D W =(D + nσi) (W + σe). Notice that all non-trivial eigenvectors are orthogonal to the trivial eigenvector e, L q = [(D + nσi) (W + σe)]q = (D W )q + nσq E)q = (λ + nσ)q, () which indicates q is also an eigenvector of L. Theorem suggests that for any mix signed similarity matrix, we can add any constant, such that the similarity matrix is nonnegative, without changing the eigenvectors (i.e. the embedding results remain the same). VI. SOLVING NLE PROBLEMS Inspired from NMF algorithms, we solve the NLE problem of Eq.(9) using the similar techniques, see discussions for the relationship with NMF in VIII. A. NLE algorithm The algorithm starts with an initial guess Q. It then iteratively updates Q until convergence using the updating rule: [(W + σi)q + QΛ Q ik Q ] ik ik, () [DQ + QΛ + ] ik where Λ=Q T (W + σi D)Q, () and Λ + is the positive part of Λ, and similarly for Λ. Notice that the feasible domain of Eq.(9) is non-convex, indicating that our algorithm can only reach local solutions. However, we show in empirical study that our algorithm yields much better Ration Cut objective than standard spectral clustering with a statistical analysis over a large number of random trials. B. Computational complexity analysis In the typical implementation of NLE algorithm, the computational complexity is O(n K) [the complexity bottleneck is the computation of Λ in Eq.()], which is not suitable to large scale problems. However, one can easily incorporate the approximate decompositions such as Nyström decomposition, to reduce the problem to O(nK ) time complexity. VII. ANALYSIS OF NLE ALGORITHM In this section, we show the correctness and convergence of our algorithm. For correctness, we mean that the update yields a correct solution at convergence; The correctness of our algorithm is assured by the following theorem. Theorem 3: Fixed points of Eq. () satisfy the KKT condition of the optimization problem of Eq.(7). Proof. We begin with the Lagrangian L = Tr[Q T (W + σi D)Q Λ(Q T Q I) ΣQ], (3) 34

5 where the Lagrange multiplier Λ enforces the orthogonality condition Q T Q = I and the Lagrange multiplier Σ enforces the nonnegativity of Q. The KKT complementary slackness condition ( L/ Q ik )Q ik =becomes [(W + σi D)Q QΛ] ij Q ij =. (4) Clearly, a fixed point of the update rule Eq. () satisfies [(W + σi D)Q QΛ] ij Q ij =. This equation is mathematically identical to Eq. (4). From Eq. (4), summing over j, we obtain Λ ii = [Q T (W + σi D)Q] ii. To find the off-diagonal elements of α, we ignore the nonnegativity requirement and setting L/ Q =which leads to Λ ii =[Q T (W + σi D)Q] ii. By combining these two results we obtain Eq.(9). The convergence of our algorithm is assured by the following Theorem. Theorem 4: Under the update rule of Eq. (), the Lagrangian function L = Tr[Q T (W + σi D)Q Λ(Q T Q I)], (5) increases monotonically. Proof of Theorem 4. We use the auxiliary function approach [8]. An auxiliary function G(H, H) of function L(H) satisfies G(H, H) =L(H), G(H, H) L(H). We define H (t+) = arg max G(H, H (t) ). (6) H Then by construction, we have L(H (t) )=Z(H (t),h (t) ) Z(H (t+),h (t) ) L(H (t+) ). (7) This proves that L(H (t) ) is monotonically increasing. The key steps in the remainder of the proof are: () Find an appropriate auxiliary function; () Find the global maxima of the auxiliary function. We write Eq.(5) as L = Tr[Q T (W + σi)q +Λ Q T Q Q T DQ Λ + Q T Q]. We can show that one auxiliary function of L is Z(H, H) = ijk + ilk ik using the inequality (W + σ) ij Hik Hjk ( + log H ikh jk Hjk ) (Λ ) kl Hik Hil ( + log H ikh il Hil ) (D H) ik H ik ik ( HΛ + ) ik H ik z + logz,z = H ik H jk / Hjk, and a generic inequality (8) n k i= p= (AS B) ip S ip S ip Tr(S T ASB), (9) where A, B, S, S >,A = A T,B = B T. We now find the global maxima of Z(H) =G(H, H). The gradient is Z(H, H) H ik = [(W + σ) H] ij Hik H ik (D H) ik H ik The second derivative G(H, H) H ik H jl = W ik δ ij δ kl, W ik = [(W + σ) H] ij Hik H ik + (D H) ik + ( HΛ ) kl Hik H il ( HΛ + ) ik H ik + ( HΛ + ) ik, + ( HΛ ) ik Hik H il () () is negative definite. Thus Z(H) is a concave function in H and has a unique global maximum. This maximum is obtained by setting the first derivative to zero, yielding: Hik = [(W + σ) H] ij +( HΛ ) ik (D H) () ik +( HΛ + ) ik According to Eq. (6), H (t+) = H and H (t) = H, we see that Eq. () is the update rule of Eq. (). Thus Eq. (7) always holds. VIII. RELATIONSHIP WITH NMF The nonnegative Laplacian Embedding is inspired from the idea of NMF. Here we show that these two methods are connected. Theorem 5: Eq. (9) is equivalent to the following, Proof. (W D + σi) QQ T min (W D + σi) Q QQT, s.t. Q T Q = I,Q, (3) = W D + σi Tr(W D + σi)qq T + QQ T Since W D + σi and QQ T are constant (with the constraint Q T Q = I), Eq.(3) is equivalent to min [ Tr(W D + Q σi)qqt ], or max Q TrQT (W D + σi)q, with the same constraints, which is identical to Eq. (9). 34

Figure. Face images are selected from AT&T face database. On top three rows (one person per row), each person has ten images with different expressions.

6 Figure. Face images are selected from AT&T face database. On top three rows (one person per row), each person has ten images with different expressions. On the fourth row, ten images come from ten different people. IX. ILLUSTRATION EXAMPLE We illustrate the nonnegative Laplacian embedding using a simple dataset of 3 images from the AT&T face database [3] (see the first three row of Fig. ). Each person has images with different expressions. Using the standard way, for each image, we reshape the image to a single vector to represent the image. For this experiment, since the pixel values of the images are non-negative, we use the inner product (w ij = x T i x j) of two images to calculate the similarity; an advantage of inner-product similarity is that there is no adjustable parameter. We start NLE algorithm with random matrix Q, Q. We show the NLE embedding results at the st, -th, 5- th and 3-th iterations (see Fig. ). The objective function value is also shown on y-axis. For each checkpoint, we use a 3D plot to show all 3 images (each image as a point) with the first, second, and third row of Q as x-, y-, and z-axis. Because we impose both non-negative and near orthogonal constraints on Q, all the data points are near the positive part of the axis. From Fig., we notice that the clustering structure becomes more and more clear as the objective function value increases. A. Soft clustering capability of NLE In traditional spectral clustering, a data point must belong to one of the clusters this is hard clustering. However, such hard clustering sometimes prevents us from detecting delicate cluster structure details in complex data. For example, in Fig., we may add images from other persons (shown as the bottom row) to the 3 images on the top. Traditional spectral clustering will assign these images into one of the 3 clusters. However, these images do not belong to three existing clusters. Ideally, the clustering solution would exhibit this fact. We now demonstrate that this fact is revealed in our NLE approach. Our NLE has the soft clustering capability, i.e., the solution Q can be viewed as posterior probability of the object to be assigned to each cluster. The NLE solution Q =[q,q,q 3 ] is shown in Hinton diagram (see Fig. 3). In the figure the face images index i is sorted as following: i = for the images shown in st row of Fig., i = for the images shown in nd row of Fig., i = 3 for the images shown in 3rd row of Fig., and i =3 4 for the images shown in 4th row of Fig.. We plot the elements of solution Q in rectangles, the size of which denotes the value of the corresponding elements. We see from Fig.3 that for the first 3 images, one of q k is very pronounced and other components negligible: the cluster distribution/assignment are very clear. For the last images, none of them is clearly clustered into any clusters indicating the soft clustering nature for these images. These images are outliers in this dataset, and our NLE algorithm can correctly detect them. 34

7 Index of Image Figure 3. Soft clustering of NLE. q (i),q (i),q 3 (i) are shown as 3 rows using Hinton diagram of i = 4 (x-axis) for the 4 images in Fig., where i =3 4 correspond to the images in the 4th row of Fig.. Objective # iteration known/observed to belong to class F, etc.). They are clustered into K clusters. with m k = C k. This forms a contingency table T = (T kl ), where T kl denotes the number of objects from class F k and have been clustered into cluster C l. Clearly, l T kl = n k and k T kl = m l. The clustering accuracy is the percentage of objects been correctly clustered: ρ = k T kk/n. In practice, matching F k to C l is obtained by running the Hungarian algorithm for the optimal bipartite matching. A. Evolution of NLE algorithm In Figs 4 and 5, we show NLE evolutions of two typical runs on two UCI (dermatology and zoo) datasets. The initial Q are set to be results in spectral clustering as explained above. We observe that the NLE objective function values increase steadily as iteration proceeds. The clustering accuracy also improves with more iterations. These facts indicate that clustering quality is improved when the objective function value increases. B. Comparison with spectral clustering Figure. NLE results on the top 3 face images of Figure at different iterations. The objective function values of Tr Q T (W + σi D)Q are shown on y-axis. For each checkpoint, we use a 3D plot to show all 3 images (each image as a point) with the first, second, and third row of Q as x-, y-, and z-axis. X. EXPERIMENTS ON UCI DATASETS We evaluate the performance of our NLE algorithm in 4 UCI datasets []: Dermatology, Soybean, Vehicle, and Zoo. In experiments, our goal is to compare with the standard spectral approach (as explained in last paragraph of 3). Therefore, we initialize Q using the clustering solution of the standard spectral clustering: H is set to the cluster indicator and Q = H +. as the starting point. In evaluation, we use clustering accuracy. Suppose we have N = n + n + + n K data objects (n are Table I EXPERIMENTAL SETUP DETAILS ON UCI AND AT&T DATASETS Dataset #sample #feature #class Dermatology Glass Soybean Vehicle Zoo 6 7 AT&T We perform extensive evaluation of both NLE and spectral clustering on the 5 UCI datasets and the AT&T dataset (See Table for experimental setup details). We note that the standard spectral clustering results on a dataset are not deterministic, because the results of K-means on the eigenspace (the spectral Laplacian embedding) depend sensitively on 343

8 Objective x # iteration Figure 4. NLE objective function value and clustering accuracy on dermatology dataset. The accuracy starts from the spectral clustering value and improves with more NLE iterations. Objective.7 x # iteration Figure 5. NLE objective function value and clustering accuracy on zoo dataset. The accuracy starts from the spectral clustering value and improves with more NLE iterations. the initialization. For this reason, we perform 4 runs of K-means clustering on the eigenspace for each dataset. We also perform 4 NLE computations and each of them is initialized from the spectral solution. We evaluate the performance as following. Define Best(N) to be the lowest objective among N random trials for both of approaches (spectral clustering and NLE). Clearly, Best(N) improves (decreases) as we increase N. The results of experiments for different N are shown in Figure 6. [At smaller N, the results are averaged with multiple N-interval runs.] The results are shown in the right of Figures 6 (a-f). We compare the clustering accuracy using the same strategy (shown in left of Figures 6 (a-f)). For objective, the best (minimum) value is subtracted from the original objective. In all 6 datasets, NLE results are consistently better than spectral clustering on average, in both terms of Ratio Cup objective and clustering accuracy. In Table, we show the objective function value and the corresponding clustering accuracy, picking the best result of the 4 runs (here, the best means the lowest objective function value, because this is an unsupervised learning). For all 4 datasets, NLE consistently gives lower (better) objective function value and higher clustering accuracy. XI. CONCLUSION In this paper, we propose a Nonnegative Laplacian Embedding (NLE) algorithm and prove the correctness and convergence of the algorithm. NLE gives nonnegative embedding results from which clustering structures of data can be read off immediately. A computationally efficient algorithm is developed to solve proposed NLE problems. Moreover, we prove the similarity matrix (i.e. graph matrix) with mixed signs can also be applied for Laplacian embedding. We demonstrate the cluster assignment advantage and soft-clustering capability of NLE algorithm by illustrations on face expression data and extensive experiments on five UCI datasets and one image dataset. Our approach consistently outperforms spectral clustering in terms of both Ratio Cut objective and clustering accuracy. Acknowledgment. This work is supported partially by NSF DMS and NSF CCF-8378 at UTA, and NSF DMS and IIS-5468 at FIU. REFERENCES [] C.J. Alpert and A.B. Kahng. Recent directions in netlist partitioning: a survey. Integration, the VLSI Journal, 9: 8, 995. [] A. Asuncion and D. Newman. UCI machine learning repository. 7. [3] M. Belkin and P. Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering. NIPS,. [4] M. Berry, M. Browne, A. Langville, P. Pauca, and R. Plemmons. Algorithms and applications for approximate nonnegative matrix factorization. To Appear in Computational Statistics and Data Analysis, 6. [5] P.K. Chan, M.Schlag, and J.Y. Zien. Spectral k-way ratiocut partitioning and clustering. IEEE Trans. CAD-Integrated Circuits and Systems, 3:88 96, 994. [6] C.-K. Cheng and Y.A. Wei. An improved two-way partitioning algorithm with stable performance. IEEE. Trans. on Computed Aided Desgin, :5 5, 99. [7] C. Ding and X. He. K-means clustering and principal component analysis. Int l Conf. Machine Learning (ICML),

9 Objective ( 3 ) Clustering SpecClus NLE SpecClus NLE Dataset Ave Best Ave Best Ave Best Ave Best Dermatology Glass Soybean Vehicle Zoo AT&T Table II AVERAGE (AVG) AND BEST RATIO CUT OBJECTIVE FUNCTION VALUE AND CLUSTERING ACCURACY OF STANDARD SPECTRAL CLUSTERING (SPECCLUS) AND NLE OVER 4 RANDOM TRIALS. MEANS THAT THE LOWER THE BETTER AND MEANS THE HIGHER THE BETTER. [8] C. Ding, X. He, and H.D. Simon. On the equivalence of nonnegative matrix factorization and spectral clustering. Proc. SIAM Data Mining Conf, 5. [9] C. Ding, X. He, H. Zha, M. Gu, and H. Simon. A min-max cut algorithm for graph partitioning and data clustering. Proc. IEEE Int l Conf. Data Mining (ICDM), pages 7 4,. [] C. Ding, T. Li, and W. Peng. Nonnegative matrix factorization and probabilistic latent semantic indexing: Equivalence, chisquare statistic, and a hybrid method. Proc. National Conf. Artificial Intelligence, 6. [] Chris Ding, Rong Jin, Tao Li, and Horst D. Simon. A learning framework using green s function and kernel regularization with application to recommender system. In KDD, pages 6 69, 7. [] Chris Ding, Tao Li, and Michael I. Jordan. Convex and semi-nonnegative matrix factorizations. IEEE Trans. Pattern Analysis and Machine Intelligence, 9. [3] Chris Ding, Tao Li, Wei Peng, and Haesun Park. Orthogonal nonnegative matrix tri-factorizations for clustering. Proc Int l Conf. on Knowledge Discovery and Data Mining (KDD 6), page Accepted by. [4] M. Fiedler. Algebraic connectivity of graphs. Czech. Math. J., 3:98 35, 973. [5] L. Hagen and A.B. Kahng. New spectral methods for ratio cut partitioning and clustering. IEEE. Trans. on Computed Aided Desgin, :74 85, 99. [] A.Y. Ng, M.I. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. Proc. Neural Info. Processing Systems (NIPS ),. [] A. Pothen, H. D. Simon, and K. P. Liou. Partitioning sparse matrices with egenvectors of graph. SIAM Journal of Matrix Anal. Appl., :43 45, 99. [] S. Roweis and L. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 9:33 36,. [3] Ferdinando Samaria and Andy Harter. Parameterisation of a stochastic model for human face identification, [4] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE. Trans. on Pattern Analysis and Machine Intelligence, :888 95,. [5] J.B. Tenenbaum, V. de Silva, and J.C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 9:39 33,. [6] H. Zha, C. Ding, M. Gu, X. He, and H.D. Simon. Spectral relaxation for K-means clustering. Advances in Neural Information Processing Systems 4 (NIPS ), pages 57 64,. [7] Z. Zhang and Z. Zha. Principal manifolds and nonlinear dimensionality reduction via tangent space alignment. SIAM J. Scientific Computing, 6:33 338, 4. [6] K. M. Hall. R-dimensional quadratic placement algorithm. Management Science, 7:9 9, 97. [7] D.D. Lee and H. S. Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 4:788 79, 999. [8] D.D. Lee and H.S. Seung. Algorithms for non-negative matrix factorization. In Advances in Neural Information Processing Systems 3, Cambridge, MA,. MIT Press. [9] Tao Li and Chris Ding. The relationships among various nonnegative matrix factorization methods for clustering. In ICDM, pages 36 37,

10 .95 Highest of of N Trials SpecClus NLE 6 Lowest of of N Trials (A) Dermatology (B) Glass (C) Soybean (D) Vehicle (E) Zoo log (F) AT&T N log N Figure 6. Clustering accuracy (left) and objective (right) on six datasets for Spectral Clustering (SpecClus) and our method (NLE). For clustering accuracy the higher the better ( ) and for objective the lower the better ( ). 346

Graph-Laplacian PCA: Closed-form Solution and Robustness

2013 IEEE Conference on Computer Vision and Pattern Recognition Graph-Laplacian PCA: Closed-form Solution and Robustness Bo Jiang a, Chris Ding b,a, Bin Luo a, Jin Tang a a School of Computer Science and