Kernel Trick Embedded Gaussian Mixture Model

Size: px

Start display at page:

Download "Kernel Trick Embedded Gaussian Mixture Model"

Angelica Norman
6 years ago
Views:

1 Kerne Trick Embedded Gaussian Mixture Mode Jingdong Wang, Jianguo Lee, and Changshui Zhang State Key Laboratory of Inteigent Technoogy and Systems Department of Automation, Tsinghua University Beijing, , P. R. China Abstract. In this paper, we present a kerne trick embedded Gaussian Mixture Mode (GMM), caed kerne GMM. The basic idea is to embed kerne trick into EM agorithm and deduce a parameter estimation agorithm for GMM in feature space. Kerne GMM coud be viewed as a Bayesian Kerne Method. Compared with most cassica kerne methods, the proposed method can sove probems in probabiistic framework. Moreover, it can tacke noninear probems better than the traditiona GMM. To avoid great computationa cost probem existing in most kerne methods upon arge scae data set, we aso empoy a Monte Caro samping technique to speed up kerne GMM so that it is more practica and efficient. Experimenta resuts on synthetic and rea-word data set demonstrate that the proposed approach has satisfing performance. 1 Introduction Kerne trick is an efficient method for noninear data anaysis eary used by Support Vector Machine (SVM) [18]. It has been pointed out that kerne trick coud be used to deveop noninear generaization of any agorithm that coud be cast in the term of dot products. In recent years, kerne trick has been successfuy introduced into various machine earning agorithms, such as Kerne Principa Component Anaysis (Kerne PCA) [14,15], Kerne Fisher Discriminant (KFD) [11], Kerne Independent Component Anaysis (Kerne ICA) [7] and so on. However, in many cases, we are required to obtain risk minimization resut and incorporate prior knowedge, which coud be easiy provided within Bayesian probabiistic framework. This makes the emerging of combining kerne trick and Bayesian method, which is caed Bayesian Kerne Method [16]. As Bayesian Kerne Method is in probabiistic framework, it can reaize Bayesian optima decision and estimate confidence or reiabiity easiy with probabiistic criteria such as Maximum-A-Posterior [5] and so on. Recenty some researches have been done in this fied. Kwok combined the evidence framework with SVM [10], Geste et a. [8] incorporated Bayesian framework with SVM and KFD. These two work are both to appy Bayesian framework to known kerne method. On the other hand, some researchers proposed R. Gavadà et a. (Eds.): ALT 2003, LNAI 2842, pp , c Springer-Verag Berin Heideberg 2003

2 160 J. Wang, J. Lee, and C. Zhang new Bayesian methods with kerne trick embedded, among which one of the most infuentia work is the Reevance Vector Machine (RVM) proposed by Tipping [17]. This paper aso addresses the probem of Bayesian Kerne Method. The proposed method is that we embed kerne trick into Expectation-Maximization (EM) agorithm [3], and deduce a new parameter estimation agorithm for Gaussian Mixture Mode (GMM) in the feature space. The entire mode is caed kerne Gaussian Mixture Mode (kgmm). The rest of this paper is organized as foows. Section 2 reviews some background knowedge, and Section 3 describes the kerne Gaussian Mixture Mode and the corresponding parameter estimation agorithm. Experiments and resuts are presented in Section 4. Concusions are drawn in the fina section. 2 Preiminaries In this section, we review some background knowedge incuding the kerne trick, GMM based on EM agorithm and Bayesian Kerne Method. 2.1 Kerne Trick Mercer kerne trick was eary appied by SVM. The idea is that we can impicity map input data into a high dimension feature space via a noninear function: Φ : X H x φ(x) (1) And a simiarity measure is defined from the dot product in space H as foows: k (x, x ) ( φ(x) φ(x ) ) (2) where the kerne function k shoud satisfy Mercer s condition [18]. Then it aows us to dea with earning agorithms using inear agebra and anaytic geometry. Generay speaking, on the one hand kerne trick coud dea with data in the high-dimensiona dot product space H, which is named feature space by a map associated with k. On the other hand, it avoids expensive computation cost in feature space by empoying the kerne function k instead of directy computing dot product in H. Being an eegant way for noninear anaysis, kerne trick has been used in many other agorithms such as Kerne Fisher Discriminant [11], Kerne PCA [14, 15], Kerne ICA [7] and so on. 2.2 GMM Based on EM Agorithm GMM is a kind of mixture density modes, which assumes that each component of the probabiistic mode is a Gaussian density. That is to say:

3 Kerne Trick Embedded Gaussian Mixture Mode 161 p(x Θ) = M α ig i (x θ i ) (3) i=1 where x R d is a random variabe, parameters Θ = ( ) α 1,,α M ; θ 1, θ M satisfy M i=1 α i = 1, α i 0 and G i (x θ i ) is a Gaussian probabiity density function: 1 { G (x θ )= exp 1 } (2π) d/2 1/2 Σ 2 (x µ ) T Σ 1 (x µ ) (4) where θ =(µ,σ ). GMM coud be viewed as a generative mode [12] or a atent variabe mode [6] that assumes data set X = {x i } N i=1 are generated by M Gaussian components, and introduces the atent variabes item Z = {z i } N i=1 whose vaue indicates which component generates the data. That is to say, we assume that if sampe x i is generated by the th component and then z i =. Then the parameters of GMM coud be estimated by the EM agorithm [2]. EM agorithm for GMM is an iterative procedure, which estimates the new parameters in terms of the od parameters as the foowing updating formuas: α (t) = 1 N i=1 p( x i,θ (t 1) ) µ (t) i=1 = x ip( x i,θ (t 1) ) i=1 p( x i,θ (t 1) ) Σ (t) = i=1 (x i µ (t) i=1 p( x i,θ (t 1) ) )(x i µ (t) ) T p( x i,θ (t 1) ) (5) where represents the th Gaussian component, and p( x i,θ (t 1) )= α (t 1) M j=1 α(t 1) j G(x i θ (t 1) ) G(x i θ (t 1) j ),=1,,M (6) Θ (t 1) = ( α (t 1) 1,,α (t 1) M ; θ(t 1) 1,,θ (t 1) ) M are parameters of the (t 1) th iteration and Θ (t) = ( α (t) 1,,α(t) M ; ) θ(t) 1,,θ(t) M are parameters of the (t) th iteration. GMM has been successfuy appied in many fieds, such as parametric custering, density estimation and so on. However, for instance, it can t give a simpe but satisfied custering resut on data set with compex structure [13] as shown in Figure 1. One aternative way is to perform GMM based custering in another space instead of in origina data space.

162 J. Wang, J. Lee, and C. Zhang 0.5 0.5 0 0-0.5-0.5 0 0.5 (a) -0.5-0.5 0 0.5 (b) (c) Fig. 1.

4 162 J. Wang, J. Lee, and C. Zhang (a) (b) (c) Fig. 1. Data set of two concentric circes with 1,000 sampes, points marked by beong to one custer and marked by beong to the other. (a) is the partition resut by traditiona GMM, (b) is the resut achieved by kgmm using poynomia kerne of degree 2; (c) shows the probabiity that each point beongs to the outer circe. The whiter the point is, the higher the probabiity is. 2.3 Bayesian Kerne Method Bayesian Kerne Method coud be viewed as a combination of Bayesian method and Kerne method. It inherits merits from both these two methods. It coud tacke probems nonineary ike kerne method, and it obtains estimation resuts within a probabiistic framework ike cassica Bayesian methods. Many works have been done in this fied. There are typicay three different ways. Interpretation of kerne methods in Bayesian framework such as SVM and other kerne methods in Bayesian framework as in [10,8]; Empoying kerne methods in traditiona Bayesian methods such as Gaussian Processes and Lapacian Processes [16]; Proposing new methods in Bayesian framework with kerne trick embedded such as Reevance Vector Machine (RVM) [17] and Bayes Point Machine [9]. And we intend to embed kerne trick into Gaussian Mixture Mode. This work just beongs to the second category of Bayesian Kerne Method. 3 Kerne Trick Embedded GMM As mentioned before, Gaussian Mixture Mode can not obtain simpe but satisfied resuts on data sets with compex structure, so we consider empoying kerne trick to reaize a Bayesian Kerne version of GMM. Our basic idea is to embed kerne trick into parameter estimation procedure of GMM. In this section, we firsty describe GMM in feature space, secondy present the properties in feature space, then formuate the Kerne Gaussian Mixture Mode and the corresponding parameter estimation agorithm, finay make some discussions on the agorithm.

5 Kerne Trick Embedded Gaussian Mixture Mode GMM in Feature Space GMM in feature space by a map φ( ) associated with kerne function k can be easiy rewritten as p(φ(x) Θ) = M α ig(φ(x) θ i ) (7) i=1 and the EM updating formua in (5) and (6) can be repaced by the foowing. α (t) = 1 N i=1 p( φ(x i),θ (t 1) ) µ (t) i=1 = φ(x i)p( φ(x i ),Θ (t 1) ) i=1 p( φ(x i),θ (t 1) ) Σ (t) = i=1 (φ(x i) µ (t) i=1 p( φ(x i),θ (t 1) ) )(φ(x i ) µ (t) ) T p( φ(x i ),Θ (t 1) ) (8) where represents the th Gaussian component, and p( φ(x i ),Θ (t 1) )= α (t 1) M j=1 α(t 1) j G(φ(x i ) θ (t 1) ) G(φ(x i ) θ (t 1) j ), =1,,M (9) However, computing GMM directy with formua (8) and (9) in a high dimension feature space is computationay expensive thus impractica. We consider empoying kerne trick to overcome this difficuty. In the foowing section, we wi give some properties based on Mercer kerne trick to estimate the GMM parameters in feature space. 3.2 Properties in Feature Space To be convenient, notations in feature space are given firsty, and then three properties are presented. Notations. In a the formuas, bod and capita etters are for matrixes, itaic and bod etters are for vectors, and itaic and ower case are for scaars. Subscript represents the th Gaussian component, superscript t represents the t th iteration of the EM procedure. A T represents transpose of matrix A. Other notations are shown in Tabe 1.

6 164 J. Wang, J. Lee, and C. Zhang p (t) Tabe 1. Notation List i = p( φ(x i),θ (t) ) Posterior that φ(x i) beongs to the th component. (p w (t) (t)/ i = M ) i j=1 p(t) ji (w (t) i )2 represents ratio that φ(x i) is occupied by the th Gaussian component. µ (t) φ (t) Σ (t) = i=1 φ(xi)(w(t) i )2 (x i)=φ(x i) µ (t) = i=1 φ(x i) φ(x i) T (w (t) i )2 Mean vector of the th Gaussian component. Centered image of φ(x i). Covariance matrix of the th Gaussian component. (K (t) ) ij = ( (w i φ(x i) w j φ(x ) j) Kerne matrix ( K (t) ) ij = ( (w i φ(xi) w j φ(xj) ) Centered kerne matrix. (K (t) ) ij = ( (φ(x i) w j φ(x ) j) Projecting kerne matrix. ( K (t) ) ij = ( φ(xi) w j φ(xj) ) Centered projecting kerne matrix. λ (t) e, V (t) e λ (t) e, β(t) e Eigenvaue and eigenvector of Σ (t). Eigenvaue and eigenvector of K (t). Properties in Feature Space. According to Lemma 1 in Appendix, we can get the first property. [Property 1] Centered kerne matrix K (t) and centered projecting kerne matrix ( K (t) ) are computed from the foowing formuas: ( K (t) K (t) ) =(K (t) = K (t) W (t) K (t) ) (W (t) ) K (t) K (t) W (t) (K (t) ) W (t) + W (t) +(W (t) K (t) W (t) ) K (t) W (t) (10) where W (t) = w (t) (w (t) ) T,(W (t) ) =1 N (w (t) ) T, w (t) =[w (t) 1,,w(t) N ]T, and 1 N is a N-dimensiona coumn vector with a entries equa to 1. This property presents the way to center the kerne matrix and the projecting kerne matrix. According to Lemma 2 in Appendix and Property 1, we can obtain the second property. [Property 2] Feature space covariance matrix Σ (t) and centered kerne matrix K (t) have the same nonzero eigenvaues, and the foowing equivaence reation hods. Σ (t) V (t) e = λ e V (t) e K (t) β (t) e = λ eβ (t) e (11) where V (t) e and β (t) e are eigenvectors of Σ(t) and K (t) respectivey. This property enabes us to compute eigenvectors from centered kerne matrix instead of from feature space covariance matrix as in Equation (8). With the first two properties, we do not need to compute the means and covariance matrixes in feature space, but do the eigen-decomposition of centered kerne matrix instead.

7 Kerne Trick Embedded Gaussian Mixture Mode 165 Feature space has a very high dimensionaity, which makes it intractabe to compute the compete Gaussian probabiity density G (φ(x j ) θ ) directy. However, Moghadam and Pentand s work [22] bring us some motivation. It divides the origina high dimension space into two parts: the principa subspace and the orthogona compement subspace. The principa subspace has dimension d φ. Thus the compete Gaussian density coud be approximated by Ĝ (φ(x j ) θ )= 1 (2π) d φ/2 d φ e=1 λ 1/2 e 1 (2πρ) (N d φ)/2 exp [ exp 1 2 [ d φ e=1 ε2 (x j ) 2ρ ] y 2 ] e λ e (12) where y e = φ(x j ) T V e (here V e is the e-th eigenvector of Σ ), ρ is the weight ratio, and ε 2 (x j ) is the residua reconstruction error. At the right side of the Equation (12), the first factor computes from the principa subspace, and the second factor computes from the orthogona compement subspace. The optima vaue of ρ can be determined by minimizing a cost function. From an information-theoretic point of view, the cost function shoud be the Kuback- Leiber divergence between the true density G (φ(x j ) θ ) and its approximation Ĝ (φ(x j ) θ ). J(ρ) = G(φ(x) θ ) og G(φ(x) θ ) dφ(x) (13) Ĝ(φ(x) θ ) Pugging (12) into upper equation, it can easiy shown that J(ρ) = 1 2 N e=d φ +1 [ λe ρ 1 + og ρ λ e ] Soving the equation J ρ = 0 yieds the optima vaue 1 ρ = N d φ N e=d φ +1 And according to Property 2, Σ and K has same nonzero eigenvaues, by empoying the property of symmetry matrix, we obtain λ e 1 [ dφ ] ρ = Σ 2 N d 1 [ dφ ] λ F e = K 2 φ N d λ F e φ e=1 e=1 (14) where F is a Frobenius matrix norm defined as A F = trace(aa T ).

8 166 J. Wang, J. Lee, and C. Zhang The residua reconstruction error, ε 2 (x j )= φ(x j ) 2 obtained by empoying kerne trick ε 2 (x j )=k(x j,x j ) And according to Lemma 3 in Appendix, d φ e=1 d φ e=1 y 2 e, coud be easiy y 2 e (15) y e = φ(x j ) T V e = ( V e φ(x j ) ) = β T Γ (16) where Γ is the j-th coumn of centered projecting kerne matrix K. It shoud notice here the centered kerne matrix K is used to obtain eigenvaues λ and eigenvectors β, whereas projecting kerne matrix K is used to compute y e as in Equation (16). And both these two matrixes cannot be omitted in the training procedure. Empoying a these resuts, we obtain the third property. [Property 3] The Gaussian probabiity density function G (φ(x j ) θ ) can be approximated by Ĝ(φ(x j ) θ ) as shown in (12), where ρ, ε 2 (x j ) and y e are shown in (14), (15) and (16) respectivey. We shoud speciay stress that the approximation of G (φ(x j ) θ ) by Ĝ (φ(x j ) θ ) is compete since it represents not ony the principe subspace but aso the orthogona compement subspace. According to these three properties, we can draw the foowing concusions. Mercer kerne trick is introduced to indirecty compute dot products in the high dimension feature space. The probabiity that a sampe beongs to the th component can be computed through centered kerne matrix and centered projecting kerne matrix instead of mean vector and covariance matrix as in traditiona GMM. Needing not to obtain fu eigen-decomposition of the centered kerne matrix, we coud approximate the Gaussian density with ony the argest d φ principa components of the centered kerne matrix, and the approximation is not dependent on d φ very much since it is compete and optima. With these three properties, we can formuate our kerne Gaussian Mixture Mode. 3.3 Kerne GMM and the Parameter Estimation Agorithm In feature space, the kerne matrixes repace the mean and covariance matrixes in input space to represent the Gaussian component. So the parameters of each component are not the mean vectors and covariance matrixes, but the kerne matrixes. In fact, it is aso intractabe to compute mean vectors and covariance matrixes in feature space because feature space has a quite high or even infinite

9 Kerne Trick Embedded Gaussian Mixture Mode 167 dimension. Fortunatey, with kerne trick embedded, computing on the kerne matrix is quite feasibe since the dimension of the principa subspace is bounded by the data size N. Consequenty, the parameters that kerne GMM needs to estimate are the prior probabiity α (t), centered kerne matrix K (t), centered projecting kerne (see Tabe 1). That is to say, the M-components kerne matrix ( K (t) ) and w (t) GMM is determined by parameters θ = ( α (t),w (t), K (t), ( K (t) ) ), =1,,M. According to the properties in previous sections, the EM agorithm for parameter estimation of kgmm coud be summarized as in Tabe 2. Assuming the number of Gaussian components is M, we initiaize the posterior probabiity p i that each sampe beongs to some Gaussian component. The agorithm coud not be terminated unti it converges or the presetting maximum iteration step is reached. Tabe 2. Parameter Estimation Agorithm for kgmm Step 0. Initiaize a p (0) i ( = 1,,M; i = 1,,N), t = 0, set t max and stopping condition fase. Step 1. Whie stopping condition is fase, t = t + 1, do Step 2-7. Step 2. Compute α (t), w (t) notations in Tabe 1. i, W(t),(W (t) ), K (t) and (K (t) ) according to Step 3. Compute the matrixes K (t),( K (t) ) via Property 1. Step 4. Compute the argest d φ eigenvaues and eigenvectors of centered kerne matrixes K (t). Step 5. Compute Ĝ(φ(x j) θ (t) ) via Property 3. Step 6. Compute a posterior probabiities p (t) i via (9) Step 7. Test the stopping condition. ( (t) p i If t>t max or k =1 i=1 otherwise oop back to Step 1. p (t 1) ) 2 i <ε, set stopping condition true, 3.4 Discussion of the Agorithm Computationa Cost and Speedup Techniques on Large Scae Probem. By empoying kerne trick, the computationa cost of kerne eigen-decomposition based methods is amost invoved by the eigen-decomposition step. Therefore, the computationa cost mainy depends on the size of kerne matrix, i.e. the size of data set. If the size N is not very arge (e.g. N 1, 000), it is not a probem to obtain fu eigen-decomposition. If the size N is arge enough, it is iabe to meet with the curse of dimension. As is known, if N>5, 000, it is impossibe to finish fu eigen-decomposition even within hours on the fastest PCs currenty. However, the size N is usuay very arge in some probems such as data mining. Fortunatey, as we have pointed out in Section 3.2, we need not obtain the fu eigen-decomposition for components of kgmm, and we ony need estimate

10 168 J. Wang, J. Lee, and C. Zhang the argest d φ nonzero eigenvaues and corresponding eigenvectors. As for arge scae probem, we coud make the assumption that d φ N. Some techniques can be adopted to estimate the argest d φ components for kerne methods. The first technique is based on traditiona Orthogona Iteration or Lanzcos Iteration. The second is to make the kerne matrix sparse by samping techniques [1]. The third is to appy Nyström method to speedup kerne machine [19]. However, these three techniques are a itte compicated. In this paper, we adopt another much simpe but practica technique proposed by Tayor et.a [20]. It assumes that a sampes forming kerne matrix of each component are drawn from an underying density p(x). And the probem coud be written down as a continue eigen-probem. k(x, y)p(x)β i (x)dx = λ i β i (y) (17) where λ i, β i (y) are eigenvaue and eigenvector, and k is a given kerne function. The integra coud be approximate using Monte Caro method by a subset of sampes {x i } m i=1 (m N,m d φ) drawn according to p(x). k(x, y)p(x)v i (x)dx 1 N k(x j,y)v i (x j ) (18) j=1 Pugging in y = x k for j =1,,N, we obtain a matrix eigen-probem. 1 N k(x j,x k )V i (x j )=ˆλ i V i (x k ) (19) j=1 where ˆλ i is the approximation of eigenvaue λ i. This approximation approach has been proved feasibe and has bounded error. We appy it to our parameter estimation agorithm on arge scae probem (N > 1, 000). In our agorithm, the underying density of component is approximated by Ĝ(φ(x) θ (t) ). We do samping to obtain a subset with size m (t) according to Ĝ(φ(x) θ ), and perform fu eigen-decomposition on such subset to obtain the argest d φ eigen-components. With empoying this Monte Caro samping technique, the computationa cost upon arge scae probem coud be reduced greaty. Furthermore, the memory needed by the parameter estimation agorithm aso reduces greaty upon arge scae probem. These makes the proposed kgmm efficient and practica. Comparison with Reated Works. There are sti some other work reated to ours. One major is the spectra custering agorithm [21]. Spectra custering coud be regarded as using RBF based kerne method to extract features and then performing custering by K-means. Compared with spectra custering, the proposed kgmm has at east two advantages. (1)kGMM can provide resut in probabiistic framework and can incorporate prior information easiy. (2) kgmm can be used in supervised earning probem as a density estimation method. A these advantages encourage us to appy the proposed kgmm.

11 Kerne Trick Embedded Gaussian Mixture Mode 169 Misunderstanding. We emphasize a misunderstanding of the proposed mode. Someone doubt that it can simpy run GMM in the reduced dimension space obtained by Kerne PCA to achieve the same resut as kgmm. They say that this just need project the data into the first d φ dimension of the feature space by Kerne PCA, and then perform GMM parameter estimation in that principe subspace. However, the choice of a proper d φ is a critica probem so that the performance wi competey depend on the choice of d φ.ifd φ is too arge, the estimation is not feasibe because probabiity estimation demands that the number of sampes is arge enough in comparison with dimension d φ, and the computationa cost wi increase greaty simutaneousy. On the contrary, sma d φ makes the estimated parameters not we represent the data. The proposed kgmm does not have that probem since the approximated density function is compete and optima under the minimum Kuback-Leiber divergence criteria. Moreover, kgmm can aow different component with different d φ. A these improve the fexibiity and expand the appication of the proposed kgmm. 4 Experiments In this section, two experiments are performed to vaidate the proposed kgmm compared with traditiona GMM. Firsty kgmm is empoyed as unsupervised earning or custering method on synthetic data set. Secondy kgmm is empoyed as supervised density estimation method for rea-word handwritten digit recognition. 4.1 Synthetic Data Custering Using Kerne GMM To provide an intuitive comparison between the proposed kgmm and traditiona GMM, we first conduct experiments on synthetic 2-D data sets. The data sets each with 1,000 sampes are depicted in Figure 1 and Figure 2. For traditiona GMM, a the sampes are used to estimate the parameters of two components mixture of Gaussian. When the agorithm stops, each sampe wi beong to one of the components or custers according to its posterior. The custering resuts of traditiona GMM are shown in Figure 1(a) and Figure 2(a). The resuts are obviousy not satisfying. However, by using kgmm with a poynomia kerne of degree 2, d φ = 4 for each Gaussian component and the same custering scheme as traditiona GMM, we achieve the promising resuts as shown in Figure 1(b) and Figure 2(b). Besides, kgmm provides probabiistic information as in Figure 1(c) and Figure 2(c), which cannot provide by most cassica kerne methods. 4.2 USPS Data-Set Recognition Using Kerne GMM Kerne GMM is aso appied to a rea-word probem, the US Posta Service (USPS) handwritten digit recognition. The data set consists of 9,226 grayscae

170 J. Wang, J. Lee, and C. Zhang 0.5 0.5 0 0-0.5-0.5 0 0.5 (a) -0.5-0.5 0 0.5 (b) (c) Fig. 2. Data set consists of 1,000 sampes. Points marked by beong to one custer and marked by beong to the other.

12 170 J. Wang, J. Lee, and C. Zhang (a) (b) (c) Fig. 2. Data set consists of 1,000 sampes. Points marked by beong to one custer and marked by beong to the other. (a) is the partition resut by traditiona GMM; (b) is the resut achieved by kgmm; (c) shows the probabiity that each point beongs to the eft-right custer. The whiter the point is, the higher the probabiity is. images of size 16x16, divided into a training set of 7,219 images and a test set of 2,007 images. The origina input data is just vector form of the digit image, i.e., the input feature space is with dimensionaity 256. Optionay, we can perform a inear discriminant anaysis (LDA) to reduce the dimensionaity of feature space. If LDA is performed, the feature space yieds to be 39. Each category ω is estimated a density of p(x ω) using 4 components GMM on training set. To cassify an test sampe x, we use the Bayesian decision rue { } ω = arg max p(x ω)p (ω),ω=1,, 10 (20) ω where P (ω) is prior probabiity of category ω. In this experiment, we set P (ω) = 1/10. That is to say, a categories are with equa prior probabiity. To be comparison, kgmm adopts the same experiment scheme as traditiona GMM except for using an RBF kerne function k(x, x ) = exp( γ x x 2 ) with γ =0.0015, Gaussian mixture component number of 2 and d φ = 40 for each Gaussian component. The experiment resuts of GMM in origina input space, GMM in the space by LDA (from [4]) and kgmm are shown in Tabe 3. We can see that kgmm, with ess components number, has obviousy much better performance than traditiona GMM with or without LDA. Athough, the resut by kgmm is not the state-of-art resut on USPS, we sti can improve the resut by incorporating invariance prior knowedge using Tangent distance as in [4].

13 Kerne Trick Embedded Gaussian Mixture Mode 171 Tabe 3. Comparison resuts on USPS data set Method Best Error rate GMM 8.0% LDA+GMM 6.7% Kerne GMM 4.3% 5 Concusion In this paper, we present a kerne Gaussian Mixture Mode, and deduce a parameter estimation agorithm by embedding kerne trick into EM agorithm. Furthermore, we adopt a Monte Caro samping technique to speedup kgmm upon arge scae probem, thus make it more practica and efficient. Compared with most cassica kerne methods, kgmm can sove probems in a probabiistic framework. Moreover, it can tacke noninear probems better than the traditiona GMM. Experimenta resuts on synthetic and rea-word data set show that the proposed approach has satisfied performance. Our future work wi focus on incorporating prior knowedge such as invariance in kgmm and enriching its appications. Acknowedgements. The author woud ike to thank anonymous reviewers for their hepfu comments, aso thank Jason Xu for hepfu conversations about this work. References 1. Achioptas, D., McSherry, F. and Schökopf, B.: Samping techniques for kerne methods. In Advances in Neura Information Processing System (NIPS) 14, MIT Press, Cambridge MA (2002) 2. Bimes, J. A.: A Gente Tutoria on the EM Agorithm and its Appication to Parameter Estimation for Gaussian Mixture and Hidden Markov Modes, Technica Report, UC Berkeey, ICSI-TR (1997) 3. Bishop, C. M.: Neura Networks for Pattern Recognition, Oxford University Press. (1995) 4. Dahmen, J., Keysers, D., Ney, H. and Güd, M.O.: Statistica Image Object Recognition using Mixture Densities. Journa of Mathematica Imaging and Vision, 14(3) (2001) Duda, R. O., Hart, P. E. and Stork, D. G.: Pattern Cassification, New York: John Wiey & Sons Press, 2nd Edition. (2001) 6. Everitt, B. S.: An Introduction to Latent Variabe Modes, London: Chapman and Ha. (1984) 7. Francis R. B. and Michae I. J.: Kerne Independent Component Anaysis, Journa of Machine Learning Research, 3, (2002) Geste, T. V., Suykens, J.A.K., Lanckriet, G., Lambrechts, A., Moor, B. De and Vanderwae J.: Bayesian framework for east squares support vector machine cassifiers, gaussian processs and kerne fisher discriminant anaysis. Neura Computation, 15(5) (2002)

14 172 J. Wang, J. Lee, and C. Zhang 9. Herbrich, R., Graepe, T. and Campbe, C.: Bayes Point Machines: Estimating the Bayes Point in Kerne Space. In Proceedings of Internationa Joint Conference on Artificia Inteigence Work-shop on Support Vector Machines, (1999) Kwok, J. T.: The Evidence Framework Appied to Support Vector Machines, IEEE Trans. on NN, Vo. 11 (2000) Mika, S., Rätsch, G., Weston, J., Schökopf, B. and Müer, K.R.: Fisher discriminant anaysis with kernes. IEEE Workshop on Neura Networks for Signa Processing IX, (1999) Mjosness, E. and Decoste, D.: Machine Learning for Science: State of the Art and Future Pros-pects, Science. Vo. 293 (2001) 13. Roberts, S. J.: Parametric and Non-Parametric Unsupervised Custer Anaysis, Pattern Recogni-tion, Vo. 30. No 2, (1997) Schökopf, B., Smoa, A.J. and Müer, K.R.: Noninear Component Anaysis as a Kerne Eigen-vaue Probem, Neura Computation, 10(5), (1998) Schökopf, B., Mika, S., Burges, C. J. C., Knirsch, P., Müer, K. R., Raetsch, G. and Smoa, A.: Input Space vs. Feature Space in Kerne-Based Methods, IEEE Trans. on NN, Vo 10. No. 5, (1999) Schökopf, B. and Smoa, A. J.: Learning with Kernes: Support Vector Machines, Reguarization and Beyond, MIT Press, Cambridge MA (2002) 17. Tipping, M. E.: Sparse Bayesian Learning and the Reevance Vector Machine, Journa of Machine Learning Research. (2001) 18. Vapnik, V.: The Nature of Statistica Learning Theory, 2nd Edition, Springer- Verag, New York (1997) 19. Wiiams, C. and Seeger, M.: Using the Nyström Method to Speed Up Kerne Machines. In T. K. Leen, T. G. Diettrich, and V. Tresp, editors, Advances in Neura Information Processing Systems (NIPS)13. MIT Press, Cambridge MA (2001) 20. Tayor, J. S., Wiiams, C., Cristianini, N. and Kandoa J.: On the Eigenspectrum of the Gram Matrix and Its Reationship to the Operator Eigenspectrum, N. CesaBianchi et a. (Eds.): ALT 2002, LNAI 2533, Springer-Verag, Berin Heideberg (2002) Ng, A. Y., Jordan, M. I. and Weiss, Y.: On Spectra Custering: Anaysis and an agorithm, Advance in Neura Information Processing Systems (NIPS) 14, MIT Press, Cambridge MA (2002) 22. Moghaddam, B. and Pentand, A.: Probabiistic visua earning for object representation, IEEE Trans. on PAMI, Vo. 19, No. 7 (1997) Appendix To be convenient, the subscripts representing Gaussian components and the superscripts representing iterations are omitted in this part. [Lemma 1] Suppose φ( ) is a mapping function that satisfies Mercer conditions as in Equation (1)(Section 2) with N training sampes X = {x i } N i=1. w is a N-dimensiona coumn vector with w =[w 1,,w N ] T R N. Let s define µ = i=1 φ(x i)wi 2 and φ(x i )=φ(x i ) µ Then the foowing consequences hod true: (1) If K is a N N kerne matrix such that K ij = ( w i φ(x i ) w j φ(x j ) ), and K is a N N matrix, which is centered in the feature space, such that

15 Kerne Trick Embedded Gaussian Mixture Mode 173 K ij = ( w i φ(xi ) w j φ(xj ) ), then we can get: K = K WK KW + WKW (a1) where W = ww T. (2) If K is a N N projecting kerne matrix such that K ij = ( φ(x i ) w j φ(x j ) ), and K is a N N matrix, which is centered in the feature space, such that K ij = ( φ(xi ) w j φ(xj ) ), then K = K W K K W + W KW (a2) where W = 1 N w T, 1 N is a N-dimensiona coumn vector that a entries equa to 1. Proof: (1) K ij = ( w i φ(xi) w j φ(xj) ) = w i φ(xi) T w j φ(xj) = ( N w i(φ(x i) φ(x k )wk) 2 ) T ( N wj(φ(x j) φ(x k )wk) 2 ) k=1 k=1 = ( w iφ(x i) T w ) ( jφ(x j) w i w k wk φ(x k ) T w ) jφ(x j) k=1 N ( w k wiφ(x i) T w k φ(x k ) ) w j + w i k=1 k=1 N k=1 n=1 N = K ij w i w k K kj w k K ik w j + w i k=1 N ( w k w n wk φ(x k ) T w ) nφ(x n) N k=1 n=1 then we can get the more compact expression as (a1). Simiary, we can prove (2). [Lemma 2] Suppose Σ is a covariance matrix such that N ( w k w n wk φ(x k ) T w ) nφ(x n) w j Σ = i=1 φ(x i ) φ(x i ) T w 2 i (a3) Then the foowing woud be hod (1) ΣV = λv Kβ = λβ. (2) V = i=1 β φ(x i i )w i. Proof: (1) Firsty, we prove. If ΣV = λv, then the soution V ies in space spanned by w 1 φ(x1 ),,w N φ(xn ), and we have two usefu consequences: firsty, we can consider the equivaent equation λ ( w k φ(xk ) V ) = ( w k φ(xk ) ΣV ), for a k =1,,N (a4)

16 174 J. Wang, J. Lee, and C. Zhang and secondy, there exist coefficients β i (i =1,,N) such that V = β iw i φ(xi ) (a5) i=1 Combining (a4) and (a5), we get N ( λ β i wk φ(xk ) w i φ(xi ) ) N N = β i (w k φ(xk ) w j φ(xj ) ( w j φ(xj ) w i φ(xi ) )) i=1 i=1 j=1 then this can read λ Kβ = K 2 β where β =[β 1,,β N ] T. K is a symmetric matrix, which has a set of eigenvectors which span the whoe space, thus λβ = Kβ (a6) Simiary, we can prove. (2) is easy to prove, so the proof is omitted. [Lemma 3] If x R d is a sampe, with φ(x) =φ(x) µ, then ( V φ(xj ) ) = i=1 β i where Γ is j-th coumn of centered projecting matrix K. Proof: This Lemma foows from (a5). ( wi φ(xi ) φ(x j ) ) = β T Γ (a7)

Statistical Learning Theory: A Primer

Statistical Learning Theory: A Primer Internationa Journa of Computer Vision 38(), 9 3, 2000 c 2000 uwer Academic Pubishers. Manufactured in The Netherands. Statistica Learning Theory: A Primer THEODOROS EVGENIOU, MASSIMILIANO PONTIL AND TOMASO