Multiple Similarities Based Kernel Subspace Learning for Image Classification

Size: px

Start display at page:

Download "Multiple Similarities Based Kernel Subspace Learning for Image Classification"

Madison Mills
5 years ago
Views:

1 Multiple Similarities Based Kernel Subspace Learning for Image Classification Wang Yan, Qingshan Liu, Hanqing Lu, and Songde Ma National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, P.O. Box 2728, Beijing , China {wyan, qsliu, luhq, Abstract. In this paper, we propose a new method for image classification, in which matrix based kernel features are designed to capture the multiple similarities between images in different low-level visual cues. Based on the property that dot product kernel can be regarded as a similarity measure, we apply kernel functions to different low-level visual features respectively to measure the similarities between two images, and obtain a kernel feature matrix for each image. In order to deal with the problems of over fitting and numerical computation, a revised version of Two-Dimensional PCA algorithm is developed to learn intrinsic subspace of matrix features for classification. Extensive experiments on the Corel database show the advantage of the proposed method. 1 Introduction With image data growing rapidly, how to efficiently manage and browse images is an urgent and challenging problem. Image classification is an important technique for image browsing and retrieval in a large database [1], [2]. As a typical pattern recognition problem, image classification has two key issues, i.e., feature extraction and classifier selection based on extracted features. Most previous studies concentrate on designing the classifier, and directly take low-level visual cues as features, such as color, shape and texture [1], [2], [3]. Different feature vectors are concatenated end to end and form the feature vector of the image. The similarity between two images is usually measured in the Euclidean space with these feature vectors. Zheng, et al, propose using local preserving projection (LPP) [4], [5] to capture the local manifold of the images, but LPP is a linear approach in nature. Practically, as for some complex image data, especially nature images, there exist complex nonlinear variations, which will degrade the performance of the classification methods. Kernel Principal Component Analysis (KPCA) is a good nonlinear analysis method, which is actually a nonlinear version of Principal Component Analysis (PCA). Its idea is to first map the input data into an implicit feature space by a nonlinear mapping, and then the data are analyzed in the implicit feature space [6]. KPCA has been widely used in practical data analysis [6], [7], [8]. The new method proposed in this paper is inspired by KPCA. P.J. Narayanan et al. (Eds.): ACCV 2006, LNCS 3852, pp , c Springer-Verlag Berlin Heidelberg 2006

2 Multiple Similarities Based Kernel Subspace Learning 245 We first give a new perception of KPCA. It can be regarded as having two independent steps: kernel feature extraction and PCA on the kernel features [9]. Kernel feature vector of an image can be calculated as follows: construct an vector with kernel dot products between the image and all the training images first, and then center the vector by subtracting its mean value. Based on this perception, a scheme of matrix based kernel features for image classification is proposed. Since in image classification and retrieval, images are often described by multiple visual cues, such as color, shape and texture, if kernel dot products between two images on different visual cues are computed respectively, we can get a dot product vector between two images, and the kernel feature vector of an image becomes a kernel feature matrix. From the view that the kernel dot product being a similarity measure [10], [11], the kernel feature matrix provides a strategy to measure the multiple similarities between the image and training images, which should be more precise for image classification. In order to deal with the problem of the evaluation of eigenvectors, a revised version of Two- Dimensional PCA (2DPCA) [12] is developed to learn the intrinsic subspace of image feature matrices. Extensive experiments on the Corel database show that the proposed method has an encouraging performance. The rest paper is organized as follows: a new perception of KPCA is given in Section 2, and we present the proposed image classification scheme in Section 3. The experiments are reported in Section 4, followed by the conclusions in Section 5. 2 Kernel Principal Component Analysis The idea of KPCA is first to map the input data {x i } N i=1 into an implicit feature space F by a nonlinear mapping φ, andthepcaisperformedinf to get the nonlinear principal components of the input data [6]. It is unnecessary to know the mapping φ explicitly, and we only need to calculate the dot product between implicit features vectors {φ(x i )} N i=1 with a kernel function that satisfies Mercer s theorem [6]. Gaussian kernel is used in this paper for its popularity in image classification and retrieval [13], [14], and its definition is as follows: ( k(x 1, x 2 )=(φ(x 1 ) φ(x 2 )) = exp γ x 1 x 2 2). (1) For the following analysis, we define some symbols first. X =[x 1, x 2,, x N ] is the training set. The matrix Φ(X) =[φ(x 1 ),φ(x 2 ),,φ(x N )] is the mapping of training set in implicit feature space F.TheGrammatrixK=[K 1, K 2,,K N ], where the column vector K i is composed of dot products between φ(x i )andall the training set in F, i.e., K i =(k(x i, x 1 ),k(x i, x 2 ),,k(x i, x N )). It can be seen that K is symmetrical. KPCA is equivalent to solving the problem of eigenvectors and eigenvalues of covariance matrix C of {φ(x i )} N i=1 [6]. C = 1 N 1 (Φ(X) Φ(X)1 N )(Φ(X) Φ(X)1 N ) T, (2)

3 246 W. Yan et al. where 1 N is an N Nmatrix with each entry equals 1/N. Let W = [w 1, w 2,, w E ] denotes the unitary eigenvector matrix of C, wherew 1, w 2,, w E are unitary eigenvectors corresponding to positive eigenvalues λ 1,λ 2,,λ E,we get 1 N 1 WT (Φ(X) Φ(X)1 N )(Φ(X) Φ(X)1 N ) T W = Λ, (3) where Λ = diag(λ 1,λ 2,,λ E ). Since any eigenvector w i with eigenvalue λ i must lie in the span of {φ(x i )} N i=1 [6], we have W =(Φ(X) Φ(X)1 N )A, (4) where A = [α 1,α 2,,α E ]isan E matrix called eigenvector expansion coefficient matrix, and w i =(Φ(X) Φ(X)1 N )α i. Combining (3) and (4), and because Φ T (X)Φ(X) =K, weget 1 N 1 AT ( K K1 N )( K K1 N ) T A=Λ, (5) where K = K 1 N K. In fact, each column of K equals the corresponding column of K subtracting its mean value, so we call K centered Gram matrix. Thus, the solution of equation (5) is equivalent to solving the eigenvectors and eigenvalues of the covariance of K. Finally we normalize α 1,α 2,,α E in order to make (w i w i ) = 1. Note that K K1 N is symmetrical and all of its eigenvalues are nonnegative, it can be proved that any eigenvector α i of ( K K1 N )( K K1 N ) T /(N 1) with eigenvalue λ i is eigenvector of ( K K1 N ) with eigenvalue (N 1)λi. So the normalization condition is as follows: 1=(w i w i )=((Φ(X) Φ(X)1 N )α i (Φ(X) Φ(X)1 N )α i ) = α T i ( K K1 N )α i, (6) = (N 1)λ i (α i α i )= λ i (α i α i ) where λ i is the eigenvalue of ( K K1 N )( K K1 N ) T. As for test samples T =[t 1, t 2,, t L ], their projections in KPCA subspace are Y = W T (Φ(T) Φ(X)1 N )=W T Φ(T) W T Φ(X)1 N, (7) where 1 N is the N L matrix with each entry equals 1/N. Since all columns of W T Φ(X)1 N are identical, it is irrelevant for classification problem when using Euclidean distance. Combining (4) and (7) we get Y = W T Φ(T) =A T (Φ(X) Φ(X)1 N ) T Φ(T). (8) Define matrix K test =(k(x i, t j )) N L for test points, and then we get where K test = K test 1 N K test. Y =A T (K test 1 N K test )=A T K test, (9)

4 Multiple Similarities Based Kernel Subspace Learning 247 From the description above, we can see that KPCA is equivalent to solving the eigenvectors and eigenvalues of covariance of centered Gram matrix K.SoKPCA can be regarded as having two independent steps: kernel feature extraction and PCA on the kernel features. Dot product vector of an image is composed of kernel dot products between the image and all the training images, and then the kernel feature vector of the image is the dot product vector by subtracting its mean value. PCA is then performed on the kernel feature vectors to get the eigenvector expansion coefficient matrix A. The only difference is we need to normalize the coefficients according to (6). The projections of test data in KPCA subspace are Y =A T K test. 3 Matrix Based Kernel Feature for Image Classification Based on this new perception of KPCA, we extend the kernel feature vectors of images to kernel feature matrices to measure multiple similarities between two images. 3.1 Matrix Based Kernel Feature In KPCA, kernel dot product is used to capture the similarity between two images. But this description is not sufficient when the feature vectors of the images contain several kinds of low-level features, because it only tells the general similarity rather than individual ones in each kind of low-level visual cues. Since in image classification and retrieval, images are often represented by multiple visual cues, such as color, texture and shape, we perform kernel dot products on different visual cues respectively, and get a dot product vector between two images. This vector describes similarities in different visual cues rather than one general similarity between images. So the kernel feature vector of the image in KPCA becomes a kernel feature matrix. Assuming that there are p visual cues to represent images, the kernel feature matrix M i of image is defined as follows: k 1 (x 1 i, x1 1 ) k 2(x 2 i, x2 1 ) k p(x p i, xp 1 ) k 1 (x 1 i M i =, x1 2) k 2 (x 2 i, x2 2) k p (x p i, xp 2 ) , (10) k 1 (x 1 i, x1 N ) k 2(x 2 i, x2 N ) k p(x p i, xp N ) where k p is the dot product kernel function for the p-th visual cue, x j i is the j-th visual cue of the i-th image. We call M i the matrix based kernel feature. From the view that the dot product kernel is a similarity measure function, the matrix based kernel feature provides a multi-similarity representation, i.e., we can get p levels of similarities between the image x i and all the training images, which should be more precise than the kernel feature vector in traditional KPCA. Corresponding to the centering of vector K i in KPCA, we center each column of M i in the same way and get centered kernel feature matrix M i.

5 248 W. Yan et al. 3.2 Revised Two-Dimensional PCA Following the traditional KPCA, we have to reshape M i into a vector with N p elements first, and then perform PCA. However this reshaping lead to expensive computation due to dimension increasing by p times. For example, if there are 1000 training samples and 10 similarities between two images, then the number of dimension becomes Fortunately, the eigenvectors can be calculated efficiently using the SVD techniques [15], [16], and the process of generating the covariance matrix is actually avoided. But it is difficult to evaluate the covariance matrix accurately due to its large size and relatively small number of training samples. The eigenvectors cannot be obtained accurately, since they are determined by the covariance matrix. We revise 2DPCA algorithm proposed in [12] to deal with this problem. AsopposedtoconventionalPCA,2DPCAisbasedon2Dmatricesrather than 1D vectors. That is, the centered kernel feature matrix does not need to be previously transformed into a vector, a covariance matrix can be constructed using kernel feature matrices directly. Let {B i } N i=1 denote the training data, 2DPCA is to project B i by a transform matrix X: Z i = B i X. (11) Since the total scatter of the projected samples can be characterized by the trace of covariance matrix G of them, the following criterion is adopted to maximize the discriminating power of the projection X: 1 J(X) =trace( M 1 N (Z i µ) T (Z i µ)) = X T GX, (12) i=1 where µ is the mean matrix of all Z i s. The optimal projection X opt =[X 1, X 2,, X E ] is a matrix with each column as a unitary vector that maximize J(X), i.e., the eigenvectors of G corresponding to the first several largest eigenvalues [17]. We may take centered kernel feature matrices { } N M i as training set {B i=1 i} N i=1 and perform 2DPCA directly, or transpose { } N M i i=1 to get { } M T N i first, and i=1 perform 2DPCA. The latter scheme is adopted in this paper for its similarity to KPCA. Because each column of M i corresponds to centered kernel feature vector K i in KPCA, and the projections of the samples in KPCA subspace are the inner products between { } N K i and eigenvectors of covariance matrix C. i=1 It seems more reasonable to calculate inner products between column of M i and eigenvectors of G. Finally, the eigenvectors in X opt are normalized according to: λi (X i X i )=1, (13) where λ i is eigenvalue corresponding to eigenvector X i. For test points T =[t 1, t 2,, t L ], kernel feature matrix M test and its centered version M test can be calculated, the projections of test points are: Z = ( M test) T X. (14)

6 4 Experiments 4.1 Experimental Data Multiple Similarities Based Kernel Subspace Learning 249 We test the proposed method on the Corel image database. Our dataset contains 6000 images with 60 categories randomly selected from the Corel database. Each category with 100 manually labeled images are used as ground truth. Four kinds of visual features are used in this paper to represent the images: color histogram, color moments, wavelet based texture and orientation histogram. Color histogram is taken in HSV space with quantization of 8 4=32 bins on H and S channels. The first three moments from each of the three color channels are used for color moment. a 24-dimensional PWT based wavelet texture features and an 8-dimensional orientation histogram are contained to construct an 73-dimensional feature vector for each image. Each feature component is normalized, s.t. variance of each equals Experimental Results In our experiments, five subspace learning algorithms including the proposed method are compared. They are: PCA: This means performing PCA directly on original 73-dimensional visual features. KPCA: The Gaussian kernel is used, and we investigate the kernel parameter in [0.001, 1] and find γ =0.08 gives the best performance. The proposed method: We note it as MSPCA for simplicity. Since each image is represented by four kinds of visual cues, we perform four kernel functions to compute the kernel dot products between two images on each visual cue respectively. For each image, a matrix based kernel feature is obtained. The revised 2DPCA is applied to these features then. For simplicity, we adopt four Gaussian kernels with same parameters. By investigation of the kernel parameter in [0.001, 1], the performance of MSPCA is maximized when γ =0.5. LKPCA: As mentioned above in 3.2, after matrix based kernel features are calculated, we can reshape them into vectors first, and then perform PCA on them. The kernel functions used here are the same as in MSPCA. LPP: The code of LPP is downloaded from xiaofei/. The number of nearest neighbors N issetto10asin[4]. For the output of the above algorithms, the nearest neighbor classifier is used for classification. We test the algorithms on several different subsets of the database. Each subset is a mixture of k categories, where k varies between 2 and 10. For each category number k, 200 subsets are randomly selected from the database. Two groups of experiments are designed. The first one is designed to compare the performance of the five algorithms. For each subset, all the images are used as training data, and 75 images of each category are used as gallery, and the rest

250 W. Yan et al. Fig. 1. Classification results comparison among PCA, KPCA, MSPCA, LKPCA and LPP 25% images are used as probe. Fig. 1 shows the experimental results, where the classification accuracies are their average on 200 subsets.

7 250 W. Yan et al. Fig. 1. Classification results comparison among PCA, KPCA, MSPCA, LKPCA and LPP 25% images are used as probe. Fig. 1 shows the experimental results, where the classification accuracies are their average on 200 subsets. From Fig. 1, we can see that the propose algorithm, i.e., MSPCA has the best performance, followed by LKPCA, which uses the information of multiple similarities too. Since it is difficult to evaluate the covariance matrix accurately with relatively small training set in LKPCA, the accuracies are always lower than MSPCA. PCA outperforms KPCA when category number k varies from 2 to 8. Because the training set becomes more complicated with the increase of k, the performance of PCA is limited by its nature of linearity. That is why nonlinear KPCA has better results when k equals to 9 or 10. LPP fails in this experiment. In order to further evaluate the performance of the proposed method, we conduct statistical tests between the proposed method and the other four methods. Since it is hard for us to know the distributions of the accuracies, non-parametric Wilcoxon s signed rank test (one-sided) for two related samples is adopted. We conducted tests between the results of the algorithms for each category number k respectively. The null hypothesis H 0 is the result of the proposed method has the same distribution as the result of algorithm A, where A is PCA, KPCA, LKPCA or LPP. The p-value of each tests are shown in Table 1. Except for three tests (cells with italics), all p-values are less than 0.05, which means most of our tests show that there are significant differences between the accuracies of the proposed algorithm and four other algorithms respectively. Because the mean accuracies of MSPCA are always higher, MSPCA is considered better than other four algorithms at most of the time. Exceptions of the tests between PCA and MSPCA when k equals to 2 or 3 are probably due to the simplicity of training set, and adoption of kernel method such as MSPCA doesn t make much sense. But the average accuracies of MSPCA are still higher than those of PCA. The second group of experiments is to test the generalization capability of the proposed method. 50 images of each category are used as training data and as gallery, and the rest 50% are used as probe. As in the first group of experiments,

8 Multiple Similarities Based Kernel Subspace Learning 251 Table 1. P-values of hypothesis tests between the proposed method and four other algorithms respectively based on the first type of experiments. <2.2e-16 means less than 2.2e-16. k PCA KPCA LKPCA LPP e e e <2.2e e <2.2e e e e-7 <2.2e e e <2.2e e e <2.2e e e <2.2e-16 9 <2.2e e <2.2e e e <2.2e-16 Fig. 2. Generalization capability comparison among PCA, KPCA, MSPCA, LKPCA and LPP five algorithms are performed on 200 subsets and their average classification accuracies are calculated. The comparison of the result is shown in Fig. 2. These experimental results are similar to the results of the first group. MSPCA is the best, followed by LKPCA. This shows the good generalization capability of MSPCA. When the number of categories is small (no more than 8), linear PCA outperforms KPCA. For more complicated data set, KPCA is preferred to PCA. LPP also fails. We conducted Wilcoxon s signed rank tests between the results of the algorithms too, for each k respectively. The p-value of each tests are shown in Table 2. All p-values but one (the cell with italics) are small enough to show that the performance of MSPCA is better than the other four methods when using out-of-sample data, which show its good generalization capability again. When number of categories k equals 2, test between PCA and MSPCA fails to reject the null hypothesis. It is probably due to the simplicity of the image set. But the mean accuracy of MSPCA is still higher than PCA.

9 252 W. Yan et al. Table 2. P-values of hypothesis tests between the proposed method and four other algorithms respectively based on the second type of experiments k PCA KPCA LKPCA LPP <2.2e <2.2e e-5 <2.2e e-5 <2.2e e-7 <2.2e e-7 <2.2e e-12 <2.2e e-10 <2.2e-16 6 <2.2e-16 <2.2e e-8 <2.2e-16 7 <2.2e-16 <2.2e e-9 <2.2e-16 8 <2.2e-16 <2.2e e-12 <2.2e-16 9 <2.2e-16 <2.2e e-9 <2.2e <2.2e-16 <2.2e e-10 <2.2e-16 5 Conclusions In this paper, we conceive a new perception of KPCA, i.e., it can be regarded as having two separated steps: kernel features extraction and PCA based feature analysis. Dot product vector of an image is composed of kernel dot products between the image and all the training images, and then the kernel feature vector of the image is the dot product vector by subtracting its mean value. Based on this perception, we propose a new scheme of the matrix based kernel features for image clustering. With four kinds of visual cues, i.e., color histogram, color moment, wavelet based texture, and orientation histogram, we perform a dot product kernel to compute the similarity between two images respectively, and then obtain the matrix based kernel feature of an image with multi-similarities. In order to efficiently deal with the problem of the evaluation of eigenvectors, a matrix based KPCA algorithm is developed to learn the subspace of matrix features for classification. Extensive experiments on the Corel database are conducted to show the advantage of the proposed method. Acknowledgement This work is supported by the Natural Sciences Foundation of China Under Grant No , and References 1. Chen, Y., Wang, J., Krovetz, R.: Content-based image retrieval by clustering. In: Proc. of ACM SIGMM Int. Workshop on Multimedia Information Retrieval. (2003) 2. Grodon, S., Greenspan, H., Goldberger, J.: Applying the information bottleneck principal to unsupervised clustering of discrete and continuous image representations. In: Proc. of Int. Conf. Computer Vision. (2003) 3. Barreno, M.: Spectral methods for image clustering. Tech-Report CS 218B, U.C. Berkeley (2004)

10 Multiple Similarities Based Kernel Subspace Learning Zheng, X., Cai, D., He, X., Ma, W., Lin, X.: Locality preserving clustering for image database. In: Proc. of. ACM Multimedia. (2004) 5. He, X., Niyogi, P.: Locality preserving projections. In: Advances in Neural Information Processing System. Volume 16., Cambridge, MA, MIT Press (2004) 6. Schölkopf, B., Smola, A., Müller, K.: Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation 10 (1998) Yang, M., Ahuja, N., Kriegman, D.: Face recognition using kernel eigenfaces. In: Proc. of. Int. Conf. Image Processing. (2000) 8. Mika, S., Schölkopf, B., Smola, A., Müller, K., Scholz, M., Rätsch, G.: Kernel PCA and de-noising in feature spaces. In: Advances in Neural Information Processing System. Volume 11., Cambridge, MA, MIT Press (1999) 9. Liu, Q., Jin, H., Tang, X., Lu, H., Ma, S.: A new perception of kernel features. Tech-Report, National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences (2005) 10. Liu, Q., Liu, H., Ma, S.: Improving kernel fisher discriminant analysis for face recognition. IEEE Trans. on Circuits and Systems for Video Technology, Special Issue on Image and Video Based Biometrics 14 (2004) Liu, Q., Huang, R., Liu, H., Ma, S.: Face recognition using kernel-based fisher discriminant analysis. In: Proc. of. Int. Conf. Automatic Face and Gesture Recognition. (2002) 12. Yang, J., Zhang, D., Frangi, A., Yang, J.: Two-Dimensional PCA: A new approach to appearance-based face representation and recognition. IEEE Trans. on Pattern Analysis and Machine Intelligence 25 (2004) Tong, S., Chang, E.: Support vector machine active learning for image retrieval. In: Proc. of. ACM Multimedia. (2001) 14. Zhang, L., Lin, F., Zhang, B.: Support vector machine for image retrieval. In: Proc. of. Int. Conf. Image Processing. (2001) 15. Sirovich, L., Kirby, M.: Low-dimensional procedure for characterization of human faces. J. Optical Soc. Am. 4 (1987) Kirby, M., Sirovich, L.: Application of the KL procedure for the characterization of human faces. IEEE Trans. on Pattern Analysis and Machine Intelligence 12 (1990) Yang, J., Yang, J.: From image vector to matrix: A straightforward image projection technique IMPCA vs. PCA. IEEE Trans. on Pattern Analysis and Machine Intelligence 35 (2002)

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis Alvina Goh Vision Reading Group 13 October 2005 Connection of Local Linear Embedding, ISOMAP, and Kernel Principal