Class Mean Vector Component and Discriminant Analysis for Kernel Subspace Learning

Size: px

Start display at page:

Download "Class Mean Vector Component and Discriminant Analysis for Kernel Subspace Learning"

Marion Tate
5 years ago
Views:

1 1 Class Mean Vector Component and Discriminant Analysis for Kernel Suspace Learning Alexandros Iosifidis Department of Engineering, ECE, Aarhus University, Denmark arxiv: v1 [cs.lg] 14 Dec 2018 Astract he kernel matrix used in kernel methods encodes all the information required for solving complex nonlinear prolems defined on data representations in the input space using simple, ut implicitly defined, solutions. Spectral analysis on the kernel matrix defines an explicit nonlinear mapping of the input data representations to a suspace of the kernel space, which can e used for directly applying linear methods. However, the selection of the kernel suspace is crucial for the performance of the proceeding processing steps. In this paper, we propose a component analysis method for kernel-ased dimensionality reduction that optimally preserves the pair-wise distances of the class means in the feature space. We provide extensive analysis on the connection of the proposed criterion to those used in kernel principal component analysis and kernel discriminant analysis, leading to a discriminant analysis version of the proposed method. We illustrate the properties of the proposed approach on real-world data. Index erms Kernel suspace learning, Principal Component Analysis, Kernel Discriminant Analysis, Approximate kernel suspace learning I. INRODUCION Kernel methods are very effective in numerous machine learning prolems, including nonlinear regression, classification, and retrieval. he main idea in kernel-ased learning is to nonlinearly map the original data representations to a feature space of usually) increased dimensionality and solve an equivalent ut simpler) prolem using a simple linear) method for the transformed data. hat is, all the variaility and richness required for solving a complex prolem defined on the original data representations is encoded y the adopted nonlinear mapping. Since for commonly used nonlinear mappings in kernel methods the dimensionality of the feature space is aritrary virtually infinite), the data representations in the feature space are implicitly otained y expressing their pair-wise products stored in the so-called kernel matrix K R N N, where N is the numer of samples forming the prolem at hand. he feature space determined y spectral decomposition of K has een shown to encode several properties of interest: it has een used to define low-dimensional features suitale for linear class discrimination [1], to train linear classifiers capturing nonlinear patterns of the input data [2], to reveal nonlinear data structures in spectral clustering [3] and it has een shown to encode information related to the entropy of the input data distriution [4]. he expressive power of K and its resulting asis motivates us to study its discriminative power as well. In this paper we first propose a kernel matrix component analysis method for kernel-ased dimensionality reduction optimally preserving the pair-wise distances of the class means in the kernel space. We show that proposed criterion also preserves the distances of the class means with respect to the total mean of the data in the kernel space, as well as the Euclidean divergence etween the class distriutions in the input space. We analyze the connection of the proposed criterion with those used in uncentered) kernel principal component analysis and kernel discriminant analysis, providing new insights related to the dimensionality selection process of these two methods. Extensions using approximate kernel matrix definitions are susequently proposed. Finally, exploiting the connection of the proposed method to kernel discriminant analysis, we propose a discriminant analysis method that is ale to produce kernel suspaces the dimensionality of which is not ounded y the numer of classes forming the prolem at hand. Experiments on real-world data illustrate our findings. II. PRELIMINARIES Let us denote y S = {S c } C c=1 a set of D-dimensional vectors, where S c = {x c 1,..., x c N c } is the set of vectors elonging to class c. In kernel-ased learning [2], the samples in S are mapped to the kernel space F y using a nonlinear function φ ), such that x i R D φx i ) F, i = 1,..., N, where N = C c=1 N c. Since the dimensionality of F is aritrary virtually infinite), the data representations in F are not calculated. Instead, the non-linear mapping is implicitly performed using the kernel function expressing dot products etween the data representations in F, i.e. κx i, x j ) = φx i ) φx j ). By applying the kernel function on all training data pairs, the so-called kernel matrix K R N N is calculated. One of the most important properties of the kernel function κ, ) is that it leads to a positive semi-definite psd) kernel matrix K. While the indefinite matrices [5], [6] and general similarity matrices [7] have also een researched, in this paper we will consider only positive semi-definite kernel functions. he importance of kernel methods in Machine Learning comes from the fact that, in the case when a linear method can e expressed ased on dot products of the input data, they can e readily used to devise nonlinear extensions. his is achieved y exploiting the Representer theorem [2] stating that the solution of a linear model in F, e.g. W φ) R F C, can e expressed as a linear comination of the training data, i.e. W φ) = ΦA, where Φ = [φx 1 ),..., φx N )] and

2 2 A R N C is a matrix containing the comination weights. hen, the output of a linear model in F can e calculated y o i = Wφ) φx i) = A k i, where k i R N is a vector having its j-th element equal to [k i ] j = κx j, x i ), j = 1,..., N. hat is, instead of optimizing with respect to the aritrary dimensional W φ), the solution involves the optimization of the comination weights A. Another important aspect of using kernel methods is that they allow us to train models of increased discrimination power [2], [8]. Considering the Vapnik-Chervonenkis VC) dimension of a linear classifier defined on the data representations in the original feature space R D, the numer of samples that can e shattered i.e., correctly classified irrespectively of their arrangement) is equal to D + 1. On the other hand, the VC dimension of a linear classifier defined on the data representations in F is higher. For the most widely used kernel function, i.e. the Gaussian kernel function κx i, x j ) = exp 1 2σ x 2 i x j 2) 2, it is virtually infinite. In practice this means that, under mild assumptions, a linear classifier applied on data representations in F can classify all training data. Using the definition of K = Φ Φ and its psd property, its spectral decomposition leads to Φ = [φ 1,..., φ N ] = Λ 1 2 U, where Λ = diagλ 1,..., λ N ) and U R N N are the eigenvalues and the corresponding eigenvectors of K. hus, an explicit nonlinear mapping from x i R D to φ i R N is defined, such that the d-th dimension of the training data is: [ Φ] d = λ d u d, 1) where λ d is the d-th largest eigenvalue of K and u d is the corresponding eigenvector. In the case where K is centered in F, R N is the space defined y kernel Principal Component Analysis kpca) [2]. Moreover, as has een shown in [9], [10], the kernel matrix needs not to e centered. In the latter case, N is called the effective dimensionality of F and R N is the corresponding effective suspace of F. he latter case is also essentially the same as the uncentered kernel PCA. Recently, the kernel Entropy Component Analysis ECA) was proposed [4] following the uncentered kernel approach and sorting eigenvectors ased to the size of the entropy values defined as λ d u d 1)2. Kernel ECA has also een shown to e the projection that optimally preserves the length of the data mean vector in F [11]. After sorting the eigen-vectors ased on the size of either the eigenvalues, or the entropy values, the l-th dimension of a samples x j in the kernel suspace is otained y: y j = λ 1 2 l u l k j, 2) where k j R N is a vector having elements [k j ] i = κx i, x j ), i = 1,..., N. Note that the use of such an explicit mapping preserves the discriminative power of the kernel space, since a linear classifier on the data representations in R N can successfully classify all N training samples. When a lower-dimensional suspace R M, M < N of F is sought, the criterion for selecting an eigen-pair u d, λ d ) is defined in a generative manner, i.e. minimizing the quantity K U M Λ M U M 2 2 leading to selecting the eigen-pairs corresponding to the M maximal eigen-values λ 1 λ 2 λ M in the case of kernel PCA, or maximizing the entropy of the data distriution leading to selecting the eigenpairs corresponding to the M maximal entropy values in the case of kernel ECA. III. CLASS MEAN VECOR COMPONEN ANALYSIS Since the data representations in the kernel space F form classes which are linearly separale, we make the assumption that classes in these spaces are unimodal. We express the distance etween classes k and m y: dc k, c m ) = m k m m 2 2, 3) where m c is the mean vector of class c c in F. Since dc k, c m ) is calculated y using elements of the kernel matrix K, i.e. dc k, c m ) = [k k ] k 2[k k ] m + [k m ] m, we exploit the spectral decomposition of K and express the mean vectors in the effective kernel suspace, i.e., m c R N with their d-th dimension equal to: [m c ] d = 1 [φ i ] d = λd u d e c, N c φ i S c 4) where e c R N is the indicator vector for class c having elements equal to [e c ] i = 1/N c for φ i S c, and [e c ] i = 0 otherwise. hen, dc k, c m ) takes the form: dc k, c m ) = N λ d u d e k u ) 2 d e m = N k + N m N k N m N λ d cos 2 u d, e k e m ). 5) From the aove, it can e seen that the eigen-pairs of K maximally contriuting to the distance etween the two class means are those with a high eigenvalue λ d and an eigenvector angularly aligned to the vector e k e m. We express the weighted pair-wise distance etween all C classes in S y: D = = C,C C,C,N k,m, p k p m dc k, c m ) λ d p k p m u d e k u ) 2 N d e m = λ d D d 6) where each class contriutes proportionally to its cardinality p c = N c /N, c = 1,..., C and D d = 1 N 2 C,C N k + N m ) cos 2 u d, e k e m ) 7) expresses the weighted alignment of the eigenvector u d to all possile cominations of class indicator vectors difference. o define the suspace R M of the kernel space F that maximally preserves the pair-wise distances etween the class means in the kernel space, we keep the eigen-pairs minimizing minimizing: D = D D 1:M ) 2. 8)

3 3 hus, in contrary to uncentered) kernel PCA and kernel ECA selecting the eigen-pairs corresponding to the maximal eigenvalues or entropy values, for the selection of an eigenpair in CMVCA oth the eigenvalue λ d needs to e high and the corresponding eigenvector needs to e angularly aligned to the difference of a pair of class indicator vectors. A. CMVCA preserves the class means to total mean distances In the aove we defined CMVCA as the method preserving the pair-wise distances etween class means in F. Considering the weighted distance value of dimension d we have: D d = = 2 C,C p k p m u d e k ) 2 2u d e k u d e m + u d e m) 2) C p k u d e k ) 2 2 C = 2 p k u d e k ) = 2 C,C C,C C C ) ) p k u d e k p mu d e m C m=1 p k p mu d e k u d e m p k p mu d e k u d e m p k u d e k ) 2 2u d e k u d e + u d e) 2) ) C C = 2 p k u d e k u d e) 2 = 2 p k m k m 2 2, 9) where e R N is a vector having elements equal to 1/N. In the aove we considered that C m=1 p m = 1 and that e = C c=1 p ce c. hus, the eigen-pairs selected y minimizing the criterion in 8) are those preserving the distances etween the class means to the total mean in F. B. CMVCA as the Euclidean divergence etween the class proaility density functions Let us assume that the data forming S k and S m are drawn from the proaility density functions p k x) and p m x), respectively. he Euclidean divergence etween these two proaility density functions is given y: Dp k, p m ) = p 2 kx)dx 2 p k x)p m x)dx+ p 2 kx)dx. 10) Given the oservations of these two proaility density functions in S k and S m, Dp k, p m ) can e estimated using the Parzen window method [12], [13]. Let κ σ x i, ) e the Gaussian kernel centered at x i with width σ. hen, we have: ˆDp k, p m) = 1 κ σx i, x j) Nk 2 x i S k x j S k 2 κ σx i, x j) N k N m x i S k x j S m + 1 κ σx i, x j) Nm 2 x j S m x i S m = e k Ke k 2e k Ke m + e mke m 11) or expressing it using the eigen-decomposition of K: ˆDp k, p m ) = N λ d u d e k u ) 2 d e m. 12) Note here that the estimated Euclidean divergence etween p k x) and p m x) gets the same form as the distance of the class mean vectors of classes c k and c m in 5). hus, D in 6) can e expressed as: D = C,C p k p m ˆDpk, p m ). 13) From the aove, it can e seen that the dimensions minimizing the criterion in 8), are those optimally preserving the weighted pair-wise Euclidean divergence etween the proaility density functions of the classes in the input space. Interestingly, exploiting the psd property of the kernel matrix, the analysis in [14] ased on the expected value of kernel convolution operator shows that the Parzen window method can e replaced y any psd kernel function. C. CMVCA in terms of uncentered PCA projections Let us denote y v d the d-th eigenvector of the scatter matrix S φ) = Φ Φ. v d is in essence a projection vector defined y applying uncentered kernel PCA on the input vectors x i, i = 1,..., N. By using the connection etween the eigenvectors of S φ) and K, i.e. u d = 1 λd vd Φ [15], we have: D = 2 = 2 C,N k, p k v d Φe k v d Φe) 2 N C ) ) 2 p k v d m k m) = 2 N ˆD d.14) Using v d 2 2 = 1, we get: C N ) D = 2 p k m k m 2 2 cos 2 v d, m k m). 15) Since N cos2 v d, m k m) = 1, the contriution of uncentered kernel PCA axis v d to m k m 2 2 is determined y the cosine of the angle etween v d and m k m in the sense that the axes which are most angularly aligned with m k m contriute the most. his result adds to the insight provided in [16], [11] and defines CMVCA in terms of the projections otained y applying uncentered kernel PCA on the input data. D. Connection etween CMVCA and KDA o analyze the connection etween CMVCA and Kernel Discriminant Analysis, we define the within-class scatter matrix: S φ) w = C φ i S k φ i m k )φ i m k ) 16) and the etween-class scatter matrix: C S φ) = N k m k m)m k m). 17)

4 4 he total distance is then given y S φ) N S φ) = i=1 = S φ) w + S φ), i.e.: φ i m)φ i m). 18) Using the aove scatter matrices, KDA and its variants [17], [18] the eigenvectors v d maximizing the Rayleigh quotient: ) r V S φ) V V = arg max ), 19) V V=I r V S φ) V leading to at most C 1 axes which are the eigenvectors corresponding to the positive eigenvalues of the generalized eigen-prolem S φ) v = λs φ) v. Here we are interested in the discrimination power in terms of the KDA criterion of the axes v d, d = 1,..., N defined from the spectral decomposition of K. Expressing the aove projections ased on the eigenvectors of S φ) and assuming the data to e centered, i.e., m = 0, the Rayleigh quotient for axis d is equal to: vd S φ) v d vd Sφ) v d = v d = N = N C C N km k m k v d Φ Φ v d ) v d p k 1 λd v d Φe k ) 2 = N ˆD d λ d C ) 2 C p k u d e k = cos 2 u d, e k ). 20) he criterion of CMVCA from 6) and 9) for axis d ecomes: C D d = λ d D d = 2 p k λ d u d e k ) 2 = 2 C λ d cos 2 u d, e k ). N 21) hus, while in CMVCA an eigen-pair contriutes to the Rayleigh quotient ased on oth the size of λ d and the angular alignment etween u d and the class indicator vectors e k, the criterion of KDA selects dimensions ased on only the angular alignment etween u d and the class indicator vectors e k. Note that 20) also gives new insights on why the KDA criterions restricts the dimensionality of the produced suspace y the numer of classes. hat is, since y definition u d form an orthogonal asis, the numer of eigen-vectors that can e angularly aligned to the class indicator vectors is restricted y the numer of classes C, which is equal to the rank of the etween-class scatter matrix for uncentered data. We will exploit this oservation in Section IV to define a discriminative version of CMVCA. E. CMVCA on approximate kernel suspaces In cases where the cardinality of S is prohiitive for applying kernel methods, approximate kernel matrix spectral analysis can e used. Proaly the most widely used approach is ased on the Nyström method, which first chooses a set of n << N reference vectors to calculate the kernel matrix etween the reference vectors K nn R n n and the matrix K Nn R N n containing the kernel function values etween the training and reference vectors. In order to determine the reference vectors, two approaches have een proposed. he first is ased on selecting n columns of K using random or proailistic sampling [19], [20], while the second one determines the reference vectors y applying clustering on the training vectors [21], [22]. he Nyström-ased approximation of K is given y: K K Nn K 1 nnk Nn = K 1 2 nn K Nn) K 1 2 nn K Nn) = Φ Φ n n, 22) where Φ n R n N. When the ranks of K and K nn match, 22) gives an exact calculation of K and Φ n is the same as Φ defined in Section II. Eigen-decomposition of K nn leads to K 1 2 nn = U n Λ 1 2 n U n. When K nn is a n-rank matrix, the matrices Φ Φ n n and Φ n Φ n have the same n leading eigenvalues [15]. he matrix Φ n Φ n is an n n matrix and, thus, applying eigen-analysis to it can highly reduce the computational complexity required for the calculation of eigenvalues Λ n) of the approximate kernel matrix in 22). Finally, x i is represented in the approximate kernel suspace y: φ i = Λ 1 2 n) Λ 1 n U n K Nnk i, 23) After the calculation of Φ n = [ φ 1,..., φ N ], the leading eigenvector u d and the corresponding eigen-values λ d, d = 1,..., n of the approximate kernel matrix correspond to the right singular vectors and the singular values of Φ n. F. CMVCA on randomized kernel suspaces Another approach proposed for making the use of kernel methods in ig data feasile is ased on random nonlinear mappings [23]. A nonlinear mapping x i R D z i R n is defined such that z i z j κx i, x j ), or y using the entire dataset Z Z K. Singular value decomposition of Z can lead to a asis in a suspace of R n. hat is, the right singular vectors of Z corresponding to the non-zero singular values define the axes minimizing D z) = D z) D z) 1:M )2 where D z) is the CMVCA criterion 6) calculated on using Z. IV. CLASS MEAN VECOR DISCRIMINAN ANALYSIS An interesting extension of CMVCA is motivated y the connection of CMVCA with KDA otained y following the analysis in Susection III-D. By comparing 20) and 21) we see that in the case where λ d = 1, d = 1,..., N, the scores calculated for the kernel suspace dimensions y CMVCA and KDA are the same. his situation arises when the data Φ is whitened, i.e. when S φ) = Φ Φ = I, where I R N N is the identity matrix. Interestingly, the information needed for whitening Φ can e directly otained y the eigenanalysis of K = Φ Φ, since there is a direct connection etween the eigenvalues and eigenvectors of S φ) and K. Given a kernel matrix K = Φ Φ, where Φ is whitened, application of CMVCA requires eigen-decomposition of K for calculating the eigen-vectors u d, d = 1,..., N and the corresponding eigenvalues λ d to e used for weighting the dimensions of the kernel suspace ased on 6). K has all its eigenvalues equal to λ d = 1, d = 1,..., N and, thus, its eigenvectors form the axes of an aritrary asis, i.e.: {u d } N, u i u i = 1, u i u j = 0, i j ). 24)

5 5 Fig. 1. Classification rate vs. suspace dimensionality. From top-left to ottom-right: MNIS-100, AR, 15 scene using Nyström approximation, 15 scene using random features, MI indoor using Nyström approximation and MI indoor using random features. Fig. 2. Rayleigh quotient vs. suspace dimensionality on the AR and MI indoor datasets. ABLE I DAASES USED IN EXPERIMENS. Dataset D C N MNIS AR scene MI-indoor Such asis can e efficiently calculated y applying an orthogonalization process e.g. Cholesky decomposition) starting from a vector elonging to the span of Φ. he vector e is such a vector and, thus, can e used for generating the asis. Moreover the vectors 1/ 4 N c )e c, c = 1,..., C elong to the span of Φ and also satisfy the two properties, i.e., 1/ N k )e k e k = 1 and 1/ 4 N k N m )e k e m = 0, k m. hus, they can e selected to form the first C eigenvectors of K. Note that from 21) it can e seen that these vectors contriute the most to the Rayleigh quotient criterion. o form the rest N C ases, we can apply an orthogonalization process on the suspaces determined y each class indicator vector e c, each generating a asis in R Nc appended y zeros for the remaining dimensions, leading to C c=1 N c = N eigenvectors in total. In order to apply CMVDA on the approximate and randomized kernel cases, descried in susections III-E and III-F, the sample representations in the R n are whitened and we follow the aove-descried process to determine the eigen-pairs used for CMVDA-ased projection. V. EXPERIMENS In our experiments we used four datasets, namely the MNIS [24], AR [25], 15 scene [26] and MI indoor [27] datasets. For MNIS dataset, we used the first 100 training samples per class to form the training set and we report performance on the entire test set. For the rest datasets, we perform ten experiments y randomly keeping half of the samples per class for training and the remaining for evaluation, and we report the average performance over the ten experiments. We used the vectorized pixel intensity values for representing images in MNIS and AR datasets. For the 15 scene and MI indoor datasets we used deep features generated y average pooling over spatial dimension of the last convolution layer of VGG network [28] trained on ILSVRC2012 dataase, and we follow the approximate kernel and randomized kernel approaches using n = Details of these datasets are shown in ale I. In all experiments we used the Gaussian kernel function and set the width value equal to the mean pair-wise distance etween all training samples. In order to illustrate the effect of using different suspace dimensionality,

6 6 ABLE II CLASSIFICAION RAES %) OVER HE VARIOUS DAASES. Dataset kpca keca CMVCA KDA CMVDA MNIS AR scene N) MI Indoor N) scene R) MI Indoor R) we used the nearest class centroid classifier for all possile suspaces produced y each of the methods. Figure 1 illustrates the performance otained y applying kernel PCA, kernel ECA, KDA and the proposed CMVCA, CMVDA and the variant of CVMDA-R using the random asis of the kernel matrix produced y whitened kernel effective space Eq. 24)) as a function of the suspace dimensionality. CMVCA performs on par with kpca and keca on MNIS- 100 and AR datasets, while it outperforms them for small suspace dimensionalities in 15 scene and MI Indoor. In 15 scene dataset, CMVCA clearly outperforms kpca and keca, proaly due to the unimodal structure of classes otained y using CNN features. CMVDA and KDA outperform kpca, keca and CMVCA y using a small suspace dimensionality equal to C and C 1, respectively) while the performance otained y applying the CMVDA-R variant gradually increases to match the performance of CMVDA when all the dimensions are used. For completeness, we provide the maximum performance otained y each method in ale II. Figure 2 illustrates the Rayleigh quotient values as a function of the dimensionality of the suspace produced y all methods for the AR and MI indoor datasets. As can e seen, the value of the Rayleigh quotient of the suspaces otained y applying the unsupervised methods are, as expected, low. he suspaces otained y KDA lead to a high value, which is gradually decreasing as more dimensions are added. CMVDA leads to suspaces with a high Rayleigh quotient value which is gradually reduced, similarly to KDA. Similar ehaviors were oserved for the rest of the datasets. VI. CONCLUSION In this paper, we proposed a component analysis method for kernel-ased dimensionality reduction preserving the distances etween the class means in the kernel space. Analysis of the proposed criterion shows that it is also preserves the distances etween the class means and the total mean in the kernel space, as well as the Euclidean divergence of the class proaility density functions in the input space. Moreover, we showed that the proposed criterion, while expressing different properties, has relations to the criteria used in kernel principal component analysis and kernel discriminant analysis. he latter connection leads to a discriminant analysis version of the proposed method. he properties of the proposed approach were illustrated through experiments on real-world data. REFERENCES [1] J. Yang, A. Frangi, J. Yang, D. Zhang, Z. Jin, and Z. Jin, Kpca plus lda: a complete kernel fisher discriminant framework for feature extraction and recognition, IEEE ranactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 2, pp , [2] K. Müller, S. Mika, G. Rätsch, K. suda, and B. Schölkopf, An introduction to kernel-ased learning algorithms, IEEE ranactions on Neural networks, vol. 12, no. 2, pp , [3] J. Shi and J. Malik, Normalized cuts and image segmentation, IEEE ranactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 8, pp , [4] R. Jenssen, Kernel entropy component analysis, IEEE ranactions on Pattern Analysis and Machine Intelligence, vol. 32, no. 5, pp , [5] F. Schleif and P. iño, Indefinite proximity learning: A review, Neural Computation, vol. 27, no. 10, pp , [6] A. Gisrecht and F. Schleif, Metric and non-metric proximity transformations at linear costs, Neurocomputing, vol. 167, pp , [7] M. Balcan, A. Blum, and N. Srero, A theory of learning with similarity functions, Machine Learning, vol. 72, no. 1 2, pp , [8] V. Vapnik, Statistical Learning heory. New York: Wiley, [9] N. Kwak, Nonlinear Projection rick in kernel methods: an alternative to the kernel trick, IEEE ransactions on Neural Networks and Learning Systems, vol. 24, no. 12, pp , [10] N. Kwak, Implementing kernel methods incrementally y incremental nonlinear projection trick, IEEE ransactions on Cyernetics, vol. 47, no. 11, pp , [11] R. Jenssen, Mean vector component analysis for visualization and clustering of nonnegative data, IEEE ranactions on Neural networks and Learning Systems, vol. 24, no. 10, pp , [12] J. Principe, Information heoretic Learning: Renyi Entropy and Kernel Perspectives. Springer, [13] C. Williams and M. Seeger, he effect of the input density distriution on kernel ased classifiers, International Conference on Machine Learning, [14] R. Jenssen, Kernel entropy component analysis: New theory and semi-supervised learning, IEEE International Workshop on Machine Learning for Signal Processing, [15] N. Wermuth and H. Riissmann, Eigenanalysis of symmetrizale matrix products: a result with statistical applications, Scandinavian Journal of Statistics, vol. 20, pp , [16] J. Cadima and I. Jolliffe, On relationships etween uncentered and column-centered principal component analysis, Pakistan Journal on Statistics, vol. 25, no. 4, pp , [17] A. Iosifidis, A. efas, and I. Pitas, On the optimal class representation in Linear Discriminant Analysis, IEEE ransactions on Neural Networks and Learning Systems, vol. 24, no. 9, pp , [18] A. Iosifidis, A. efas, and I. Pitas, Kernel reference discriminant analysis, Pattern Recognition Letters, vol. 49, pp , [19] C. Williams and M. Seeger, Using the Nyström method to speed up kernel machines, Advances in Neural Information Processing Systems, pp , [20] S. Kumar, M. Mohri, and A. alwalkar, Sampling techniques for the Nyström method, Journal of Machine Learning Research, vol. 13, no. 1. [21] K. Zhang and J. Kwok, Clustered Nyström method for large scale manifold learning and dimensionality reduction, IEEE ransactions on Neural Networks, vol. 21, no. 10, pp , [22] A. Iosifidis and M. Gaouj, Nyström-ased approximate kernel suspace learning, Pattern Recognition, vol. 57, pp , [23] A. Rahimi and B. Recht, Random features for large-scale kernel machines, Advances in Neural Information Processing Systems, pp , [24] Y. LeCun, L. Bottou, and Y. Bengio, Gradient-ased learning applied to document recognition, Proceedings of the IEEE, vol. 86, pp , [25] A. Martinez and A. Kak, PCA versus LDA, IEEE ransactions on Pattern Analysis and Machine Intelligence, vol. 23, no. 2, pp , [26] S. Lazenik, C. Schmid, and J. Ponce, Beyond ags of features: Spatial pyramid matching for recognizing natural scene categories, IEEE Conference on Computer Vision and Pattern Recognition, [27] A. Quattoni and A. orrala, Recognizing indoor scenes, Computer Vision and Pattern Recognition, CVPR IEEE Conference on, [28] K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, arxiv: , 2014.

Probabilistic Class-Specific Discriminant Analysis

Probabilistic Class-Specific Discriminant Analysis Alexros Iosifidis Department of Engineering, ECE, Aarhus University, Denmark alexros.iosifidis@eng.au.dk arxiv:8.05980v [cs.lg] 4 Dec 08 Abstract In this