1 Kerne pea and De-Noising in Feature Spaces Sebastian Mika, Bernhard Schokopf, Aex Smoa Kaus-Robert Muer, Matthias Schoz, Gunnar Riitsch GMD FIRST, Rudower Chaussee 5, Berin, Germany {mika, bs, smoa, kaus, schoz, Abstract Kerne PCA as a noninear feature extractor has proven powerfu as a preprocessing step for cassification agorithms. But it can aso be considered as a natura generaization of inear principa component anaysis. This gives rise to the question how to use noninear features for data compression, reconstruction, and de-noising, appications common in inear PCA. This is a nontrivia task, as the resuts provided by kerne PCA ive in some high dimensiona feature space and need not have pre-images in input space. This work presents ideas for finding approximate pre-images, focusing on Gaussian kernes, and shows experimenta resuts using these pre-images in data reconstruction and de-noising on toy exampes as we as on rea word data. 1 pea and Feature Spaces Principa Component Anaysis (PC A) (e.g. [3]) is an orthogona basis transformation. The new basis is found by diagonaizing the centered covariance matrix of a data set {Xk E RNk = 1,...,f}, defined by C = ((Xi - (Xk))(Xi - (Xk))T). The coordinates in the Eigenvector basis are caed principa components. The size of an Eigenvaue >. corresponding to an Eigenvector v of C equas the amount of variance in the direction of v. Furthermore, the directions of the first n Eigenvectors corresponding to the biggest n Eigenvaues cover as much variance as possibe by n orthogona directions. In many appications they contain the most interesting information: for instance, in data compression, where we project onto the directions with biggest variance to retain as much information as possibe, or in de-noising, where we deiberatey drop directions with sma variance. Ceary, one cannot assert that inear PCA wi aways detect a structure in a given data set. By the use of suitabe noninear features, one can extract more information. Kerne PCA is very we suited to extract interesting noninear structures in the data [9]. The purpose of this work is therefore (i) to consider noninear de-noising based on Kerne PCA and (ii) to carify the connection between feature space expansions and meaningfu patterns in input space. Kerne PCA first maps the data into some feature space F via a (usuay noninear) function <II and then performs inear PCA on the mapped data. As the feature space F might be very high dimensiona (e.g. when mapping into the space of a possibe d-th order monomias of input space), kerne PCA empoys Mercer kernes instead of carrying

2 Kerne pea and De-Noising in Feature Spaces 537 out the mapping <I> expicity. A Mercer kerne is a function k(x, y) which for a data sets {Xi} gives rise to a positive matrix Kij = k(xi' Xj) [6]. One can show that using k instead of a dot product in input space corresponds to mapping the data with some <I> to a feature space F [1], i.e. k(x,y) = (<I>(x). <I>(y)). Kernes that have proven usefu incude Gaussian kernes k(x, y) = exp( -x - Y2 Ie) and poynomia kernes k(x, y) = (x y)d. Ceary, a agorithms that can be formuated in terms of dot products, e.g. Support Vector Machines [1], can be carried out in some feature space F without mapping the data expicity. A these agorithms construct their soutions as expansions in the potentiay infinite-dimensiona feature space. The paper is organized as foows: in the next section, we briefy describe the kerne PCA agorithm. In section 3, we present an agorithm for finding approximate pre-images of expansions in feature space. Experimenta resuts on toy and rea word data are given in section 4, foowed by a discussion of our findings (section 5). 2 Kerne pea and Reconstruction To perform PCA in feature space, we need to find Eigenvaues A > 0 and Eigenvectors V E F\{O} satisfying AV = GV with G = (<I>(Xk)<I>(Xk)T).1 Substituting G into the Eigenvector equation, we note that a soutions V must ie in the span of <I>-images of the training data. This impies that we can consider the equivaent system A( <I>(Xk). V) = (<I>(Xk). GV) for a k = 1,...,f (1) and that there exist coefficients Q1,...,Q such that V = L i = Qi<>(Xi) (2) Substituting C and (2) into (1), and defining an f x f matrix K by Kij := (<I>(Xi) <I>(Xj)) = k( Xi, X j), we arrive at a probem which is cast in terms of dot products: sove faa = Ko. (3) where 0. = (Q1,...,Q)T (for detais see [7]). Normaizing the soutions V k, i.e. (V k. Vk) = 1, transates into Ak(o.k.o.k) = 1. To extract noninear principa components for the <I>-image of a test point X we compute the projection onto the k-th component by 13k := (Vk. <I> (X)) = 2:f= Q~ k(x, Xi). Forfeature extraction, we thus have to evauate f kerne functions instead of a dot product in F, which is expensive if F is high-dimensiona (or, as for Gaussian kernes, infinite-dimensiona). To reconstruct the <I>-image of a vector X from its projections 13k onto the first n principa components in F (assuming that the Eigenvectors are ordered by decreasing Eigenvaue size), we define a projection operator Pn by k= If n is arge enough to take into account a directions beonging to Eigenvectors with nonzero Eigenvaue, we have Pn<>(Xi) = <I>(Xi). Otherwise (kerne) PCA sti satisfies (i) that the overa squared reconstruction error 2: i II Pn<>(Xi) - <I>(xd 2 is minima and (ii) the retained variance is maxima among a projections onto orthogona directions in F. In common appications, however, we are interested in a reconstruction in input space rather than in F. The present work attempts to achieve this by computing a vector z satisfying <I>(z) = Pn<>(x). The hope is that for the kerne used, such a z wi be a good approximation of X in input space. However. (i) such a z wi not aways exist and (ii) if it exists, I For simpicity, we assume that the mapped data are centered in F. Otherwise, we have to go through the same agebra using ~(x) := <I>(x) - (<I>(x;»). (4)

3 538 S. Mika et a. it need be not unique.2 As an exampe for (i), we consider a possibe representation of F. One can show [7] that <I> can be thought of as a map <I> (x) = k( x,.) into a Hibert space 1{ k of functions Li ai k( Xi,.) with a dot product satisfying (k( x,.). k(y,.)) = k( x, y). Then 1{k is cahed reproducing kerne Hibert space (e.g. [6]). Now, for a Gaussian kerne, 1{k contains ahinear superpositions of Gaussian bumps on RN (pus imit points), whereas by definition of <I> ony singe bumps k(x,.) have pre-images under <1>. When the vector P n<>( x) has no pre-image z we try to approximate it by minimizing p(z) = 1I<I>(z) - Pn<>(x) 112 (5) This is a specia case of the reduced set method [2]. Repacing terms independent of z by 0, we obtain p(z) = 1I<I>(z)112-2(<I>(z). Pn<>(x)) + 0 (6) Substituting (4) and (2) into (6), we arrive at an expression which is written in terms of dot products. Consequenty, we can introduce a kerne to obtain a formua for p (and thus V' % p) which does not rey on carrying out <I> expicity n p(z) = k(z, z) - 2 L f3k L a~ k(z, Xi) + 0 (7) k= i= 3 Pre-Images for Gaussian Kernes To optimize (7) we empoyed standard gradient descent methods. If we restrict our attention to kernes of the form k(x, y) = k(x - Y12) (and thus satisfying k(x, x) == const. for a x), an optima z can be determined as fohows (cf. [8]): we deduce from (6) that we have to maximize p(z) = (<I>(z). Pn<>(x)) + 0' = L Ii k(z, Xi) + 0' (8) i= where we set Ii = L~= f3ka: (for some 0' independent of z). For an extremum, the gradient with respect to z has to vanish: V' %p(z) = L~= ik'(z - xi112)(z - Xi) = O. This eads to a necessary condition for the extremum: z = Li tjixd Lj tjj, with tji,ik'(z - xii2). For a Gaussian kerne k(x, y) = exp( -ix - Y2 jc) we get L~= Ii exp( -iz - xi1 2 jc)xi z= ~ Li= iexp(-z - xi12jc) We note that the denominator equas (<I>(z). P n<>(x)) (cf. (8». Making the assumption that Pn<>(x) i= 0, we have (<I>(x). Pn<>(x)) = (Pn<>(x). Pn<>(x)) > O. As k is smooth, we concude that there exists a neighborhood of the extremum of (8) in which the denominator of (9) is i= O. Thus we can devise an iteration scheme for z by Zt+ = L~= Ii exp( -zt - xi2 jc)xi Li= i exp(-iz t - xi1 2jc) Numerica instabiities reated to (<I>(z). Pn<>(x)) being smah can be deat with by restarting the iteration with a different starting vaue. Furthermore we note that any fixed-point of (10) wi be a inear combination of the kerne PCA training data Xi. If we regard (10) in the context of custering we see that it resembes an iteration step for the estimation of 2If the kerne aows reconstruction of the dot-product in input space, and under the assumption that a pre-image exists, it is possibe to construct it expicity (cf. [7]). But ceary, these conditions do not hod true in genera. (10)

4 Kerne pea and De-Noising in Feature Spaces 539 the center of a singe Gaussian custer. The weights or 'probabiities' T'i refect the (anti-) correation between the amount of cp (x) in Eigenvector direction Vk and the contribution of CP(Xi) to this Eigenvector. So the 'custer center' z is drawn towards training patterns with positive T'i and pushed away from those with negative T'i, i.e. for a fixed-point Zoo the infuence of training patterns with smaner distance to x wi11 tend to be bigger. 4 Experiments To test the feasibiity of the proposed agorithm, we run severa toy and rea word experiments. They were performed using (10) and Gaussian kernes of the form k(x, y) = exp( -(x - YI12)/(nc)) where n equas the dimension of input space. We mainy focused on the appication of de-noising, which differs from reconstruction by the fact that we are aowed to make use of the origina test data as starting points in the iteration. Toy exampes: In the first experiment (tabe 1), we generated a data set from eeven Gaussians in RIO with zero mean and variance u 2 in each component, by seecting from each source 100 points as a training set and 33 points for a test set (centers of the Gaussians randomy chosen in [-1, 1]10). Then we appied kerne pea to the training set and computed the projections 13k of the points in the test set. With these, we carried out de-noising, yieding an approximate pre-image in RIO for each test point. This procedure was repeated for different numbers of components in reconstruction, and for different vaues of u. For the kerne, we used c = 2u 2 We compared the resuts provided by our agorithm to those of inear pea via the mean squared distance of an de-noised test points to their corresponding center. Tabe 1 shows the ratio of these vaues; here and beow, ratios arger than one indicate that kerne pea performed better than inear pea. For amost every choice of nand u, kerne PeA did better. Note that using ao components, inear pea is just a basis transformation and hence cannot de-noise. The extreme superiority of kerne pea for sma u is due to the fact that a test points are in this case ocated cose to the eeven spots in input space, and inear PeA has to cover them with ess than ten directions. Kerne PeA moves each point to the correct source even when using ony a sman number of components. n= :i.5:i U 1.8U Tabe 1: De-noising Gaussians in RIO (see text). Performance ratios arger than one indicate how much better kerne PeA did, compared to inear PeA, for different choices of the Gaussians' std. dev. u, and different numbers of components used in reconstruction. To get some intuitive understanding in a ow-dimensiona case, figure 1 depicts the resuts of de-noising a haf circe and a square in the pane, using kerne pea, a noninear autoencoder, principa curves, and inear PeA. The principa curves agorithm [4] iterativey estimates a curve capturing the structure of the data. The data are projected to the cosest point on a curve which the agorithm tries to construct such that each point is the average of a data points projecting onto it. It can be shown that the ony straight ines satisfying the atter are principa components, so principa curves are a generaization of the atter. The agorithm uses a smoothing parameter whic:h is anneaed during the iteration. In the noninear autoencoder agorithm, a 'botteneck' 5-ayer network is trained to reproduce the input vaues as outputs (i.e. it is used in autoassociative mode). The hidden unit activations in the third ayer form a ower-dimensiona representation of the data, cosey reated to

5 540 S. Mika et a. PCA (see for instance [3]). Training is done by conjugate gradient descent. In a agorithms, parameter vaues were seected such that the best possibe de-noising resut was obtained. The figure shows that on the cosed square probem, kerne PeA does (subjectivey) best, foowed by principa curves and the noninear autoencoder; inear PeA fais competey. However, note that a agorithms except for kerne PCA actuay provide an expicit one-dimensiona parameterization of the data, whereas kerne PCA ony provides us with a means of mapping points to their de-noised versions (in this case, we used four kerne PCA features, and hence obtain a four-dimensiona parameterization). kerne PCA noninear autoencoder Principa Curves inear PCA ~It%i 1J;:qf~~ I::f;~.;~,: '~ Figure 1: De-noising in 2-d (see text). Depicted are the data set (sma points) and its de-noised version (big points, joining up to soid ines). For inear PCA, we used one component for reconstruction, as using two components, reconstruction is perfect and thus does not de-noise. Note that a agorithms except for our approach have probems in capturing the circuar structure in the bottom exampe. USPS exampe: To test our approach on rea-word data, we aso appied the agorithm to the USPS database of 256-dimensiona handwritten digits. For each of the ten digits, we randomy chose 300 exampes from the training set and 50 exampes from the test set. We used (10) and Gaussian kernes with c = 0.50, equaing twice the average of the data's variance in each dimensions. In figure 4, we give two possibe depictions of 1111 II ii fi 0"(1'. 11m Figure 2: Visuaization of Eigenvectors (see text). Depicted are the 2,..., 2 8 -th Eigenvector (from eft to right). First row: inear PeA, second and third row: different visuaizations for kerne PCA. the Eigenvectors found by kerne PCA, compared to those found by inear PCA for the USPS set. The second row shows the approximate pre-images of the Eigenvectors V k, k = 2,...,28, found by our agorithm. In the third row each image is computed as foows: Pixe i is the projection of the <II-image of the i-th canonica basis vector in input space onto the corresponding Eigenvector in features space (upper eft <II(e). V k, ower right <II (e256). Vk). In the inear case, both methods woud simpy yied the Eigenvectors ofinear PCA depicted in the first row; in this sense, they may be considered as generaized Eigenvectors in input space. We see that the first Eigenvectors are amost identica (except for signs). But we aso see, that Eigenvectors in inear PeA start to concentrate on highfrequency structures aready at smaer Eigenvaue size. To understand this, note that in inear PCA we ony have a maximum number of 256 Eigenvectors, contrary to kerne PCA which gives us the number of training exampes (here 3000) possibe Eigenvectors. This

6 Kerne pea and De-Noising in Feature Spaces 541 & 3.55 & I.CI' ( o.n ( 0.7( S S! Figure 3: Reconstruction of USPS data. Depicted are the reconstructions of the first digit in the test set (origina in ast coumn) from the first n = 1,...,20 components for inear pea (first row) and kerne pea (second row) case. The numbers in between denote the fraction of squared distance measured towards the origina exampe. For a sma number of components both agorithms do neary the same. For more components, we see that inear pea yieds a resut resembing the origina digit, whereas kerne pea gives a resut resembing a more prototypica 'three' aso expains some of the resuts we found when working with the USPS set. In these experiments, inear and kerne pea were trained with the origina data. Then we added (i) additive Gaussian noise with zero mean and standard deviation u = 0.5 or (ii) 'specke' noise with probabiity p = 0.4 (i.e. each pixe fips to back or white with probabiity p/2) to the test set. For the noisy test sets we computed the projections onto the first n inear and noninear components, and carried out reconstruction for each case. The resuts were compared by taking the mean squared distance of each reconstructed digit from the noisy test set to its origina counterpart. As a third experiment we did the same for the origina test set (hence doing reconstruction, not de-noising). In the atter case, where the task is to reconstruct a given exampe as exacty as possibe, inear pea did better, at east when using more than about 10 components (figure 3). This is due to the fact that inear pea starts earier to account for fine structures, but at the same time it starts to reconstruct noise, as we wi see in figure 4. Kerne PCA, on the other hand, yieds recognizabe resuts even for a sma number of components, representing a prototype of the desired exampe. This is one reason why our approach did better than inear pea for the de-noising exampe (figure 4). Taking the mean squared distance measured over the whoe test set for the optima number of components in inear and kerne PCA, our approach did better by a factor of 1.6 for the Gaussian noise, and 1.2 times better for the 'specke' noise (the optima number of components were 32 in inear pea, and 512 and 256 in kerne PCA, respectivey). Taking identica numbers of components in both agorithms, kerne pea becomes up to 8 (!) times better than inear pea. However, note that kerne PCA comes with a higher computationa compexity. 5 Discussion We have studied the probem of finding approximate pre-images of vectors in feature space, and proposed an agorithm to sove it. The agorithm can be appied to both reconstruction and de-noising. In the former case, resuts were comparabe to inear pea, whie in the atter case, we obtained significanty better resuts. Our interpretation of this finding is as foows. Linear pea can extract at most N components, where N is the dimensionaity of the data. Being a basis transform, a N components together fuy describe the data. If the data are noisy, this impies that a certain fraction of the components wi be devoted to the extraction of noise. Kerne pea, on the other hand, aows the extraction of up to f features, where f is the number of training exampes. Accordingy, kerne pea can provide a arger number of features carrying information about the structure in the data (in our experiments, we had f > N). In addition, if the structure to be extracted is noninear, then inear pea must necessariy fai, as we have iustrated with toy exampes. These methods, aong with depictions of pre-images of vectors in feature space, provide some understanding of kerne methods which have recenty attracted increasing attention. Open questions incude (i) what kind of resuts kernes other than Gaussians wi provide,

7 542 S. Mika et a. Figure 4: De-Noising of USPS data (see text). The eft haf shows: top: the first occurrence of each digit in the test set, second row: the upper digit with additive Gaussian noise (0' = 0.5), foowing five rows: the reconstruction for inear PCA using n = 1,4,16,64,256 components, and, ast five rows: the resuts of our approach using the same number of components. In the right haf we show the same but for 'specke' noise with probabiity p = 0.4. (ii) whether there is a more efficient way to sove either (6) or (8), and (iii) the comparison (and connection) to aternative noninear de-noising methods (cf. [5]). References [1] B. Boser, I. Guyon, and V.N. Vapnik. A training agorithm for optima margin cassifiers. In D. Hausser, editor, Proc. COLT, pages , Pittsburgh, ACM Press. [2] C.J.c. Burges. Simpified support vector decision rues. In L. Saitta, editor, Prooceedings, 13th ICML, pages 71-77, San Mateo, CA, [3] K.I. Diamantaras and S.Y. Kung. Principa Component Neura Networks. Wiey, New York,1996. [4] T. Hastie and W. Stuetze. Principa curves. JASA, 84: ,1989. [5] S. Maat and Z. Zhang. Matching Pursuits with time-frequency dictionaries. IEEE Transactions on Signa Processing, 41(12): , December [6] S. Saitoh. Theory of Reproducing Kernes and its Appications. Longman Scientific & Technica, Harow, Engand, [7] B. Schokopf. Support vector earning. Odenbourg Verag, Munich, [8] B. Schokopf, P. Knirsch, A. Smoa, and C. Burges. Fast approximation of support vector kerne expansions, and an interpretation of custering as approximation in feature spaces. In P. Levi et. a1., editor, DAGM'98, pages , Berin, Springer. [9] B. Schokopf, A.J. Smoa, and K.-R. Muer. Noninear component anaysis as a kerne eigenvaue probem. Neura Computation, 10: ,1998.

Kernel PCA and De-Noising in Feature Spaces Sebastian Mika, Bernhard Schölkopf, Alex Smola Klaus-Robert Müller, Matthias Schol, Gunnar Rätsch GMD FIRST, Rudower Chaussee 5, 1489 Berlin, Germany

Kernel PCA Pattern Reconstruction via Approximate Pre-Images Bernhard Scholkopf, Sebastian Mika, Alex Smola, Gunnar Ratsch, & Klaus-Robert Muller GMD FIRST, Rudower Chaussee 5, 12489 Berlin, Germany

