EIGEN-ANALYSIS OF KERNEL OPERATORS FOR NONLINEAR DIMENSION REDUCTION AND DISCRIMINATION

Size: px

Start display at page:

Download "EIGEN-ANALYSIS OF KERNEL OPERATORS FOR NONLINEAR DIMENSION REDUCTION AND DISCRIMINATION"

Hillary Matthews
6 years ago
Views:

1 EIGEN-ANALYSIS OF KERNEL OPERATORS FOR NONLINEAR DIMENSION REDUCTION AND DISCRIMINATION DISSERTATION Presente in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Grauate School of the Ohio State University By Zhiyu Liang Grauate Program in Statistics The Ohio State University 014 Dissertation Committee: Yoonkyung Lee, Avisor Tao Shi Vincent Vu

2 c Copyright by Zhiyu Liang 014

3 ABSTRACT There has been growing interest in kernel methos for classification, clustering an imension reuction. For example, kernel linear iscriminant analysis, spectral clustering an kernel principal component analysis are wiely use in statistical learning an ata mining applications. The empirical success of the kernel metho is generally attribute to nonlinear feature mapping inuce by the kernel, which in turn etermines a low imensional ata embeing. It is important to unerstan the effect of a kernel an its associate kernel parameter(s) on the embeing in relation to ata istributions. In this issertation, we examine the geometry of the nonlinear embeings for kernel PCA an kernel LDA through spectral analysis of the corresponing kernel operators. In particular, we carry out eigenanalysis of the polynomial kernel operator associate with ata istributions an investigate the effect of the egree of polynomial on the ata embeing. We also investigate the effect of centering kernels on the spectral property of both polynomial an Gaussian kernel operators. In aition, we exten the framework of the eigen-analysis of kernel PCA to kernel LDA by consiering between-class an within-class variation operators for polynomial kernels. The results provie both insights into the geometry of nonlinear ata embeings given by kernel methos an practical guielines for choosing an appropriate egree for imension reuction an iscrimination with polynomial kernels. ii

4 This is eicate to my parents, Yigui Liang an Yunmei Hou; my sister, Zhiyan Liang an my husban, Sungmin Kim. iii

5 ACKNOWLEDGMENTS I woul like to express my sincere appreciation an gratitue first an foremost to my avisor, Dr. Yoonkyung Lee, for her continuous avice an encouragement throughout my octoral stuies an issertation work. Without her excellent guiance, patience an constructive comments, I coul have never finishe my issertation successfully. My gratitue also goes to the members of my committee, Dr. Tao Shi, Dr. Vincent Vu, an Dr. Prem Goel for their guiance in my research an valuable comments. Special thanks also go to Dr. Ranolph Moses for willing to participate in my final efense committee an giving constructive comments afterwars. I also thank Dr. Elizabeth Stasny for her help throughout my Ph.D. stuy; Dr. Rebecca Sela an Dr. Naer Gemayel for their valuable avice that mae my internship fruitful an exciting. I also thank my parents an sister for always supporting me an encouraging me with their best wishes. Last but not least, I woul like express my love, appreciation, an gratitue to my husban Sungmin Kim for his continuous help, support, love an many valuable acaemic avices. iv

6 VITA B.Sc. in Applie Mathematics, Shanghai University of Finance an Economics, Shanghai, China 00-Present Grauate Teaching /Research Associate, The Ohio State University, Columbus, OH PUBLICATIONS Liang, Z. an Lee, Y. (01), Eigen-analysis of Nonlinear PCA with Polynomial Kernels. Statistical Analysis an Data Mining, Vol 6, Issue 6, pp FIELDS OF STUDY Major Fiel: Statistics v

7 TABLE OF CONTENTS Abstract Deication Acknowlegments Vita ii iii iv v List of Figures viii CHAPTER PAGE 1 Introuction Kernel Methos Kernel Kernel metho Examples of kernel methos Kernel Operator Definition Eigen-analysis of the Gaussian kernel operator Eigen-analysis of Kernel Operators for Nonlinear Dimension Reuction 1.1 Eigen-analysis of the Polynomial Kernel Operator Two-imensional setting Multi-imensional setting Polynomial kernel with constant Simulation Stuies Uniform example Mixture normal example Effect of egree Restriction in embeings Analysis of Hanwritten Digit Data vi

8 4 On the Effect of Centering Kernels in Kernel PCA Centering in the feature space Simple illustration Centere ata with centere kernel Uncentere ata with uncentere kernel Uncentere ata with centere kernel Extension General Eigen-analysis for Centere Kernel Operator Derivation of eigenfunction-eigenvalue pair Orthogonality of eigenfunctions Connection with the centere polynomial kernel operator result Analysis of Centere Gaussian Kernel Operator Centere Gaussian kernel operator One-component normal Mixture normal istribution Eigen-analysis of Kernel Operators for Nonlinear Discrimination The Population Version of Kernel LDA Eigen-analysis of the Polynomial Kernel Operator Two-imensional setting Multi-imensional setting Simulation Stuies Effect of Degree Conclusion an Discussion Conclusion Discussion APPENDICES A Proof for the form of leaing eigenfunctions in a simple centere ata example B Example for the centere ata with centere kernel C Remarks on K p being a vali mapping Bibliography vii

9 LIST OF FIGURES FIGURE PAGE.1 Comparison of the contours of the nonlinear embeings given by three leaing eigenvectors an the theoretical eigenfunctions for the uniform ata. The upper three panels are for the embeings inuce by the eigenvectors for three nonzero eigenvalues, an the lower three panels are for the corresponing eigenfunctions Comparison of the contours of the nonlinear embeings given by three leaing eigenvectors (top panels) an the theoretical eigenfunctions (bottom panels) for the mixture normal ata when egree is Comparison of the contours of the nonlinear embeings given by four leaing eigenvectors (top panels) an the theoretical eigenfunctions (bottom panels) for the mixture normal ata when egree is The mixture normal ata an their projections through principal components with polynomial kernel of varying egrees. The colors istinguish the two normal components Wheel ata an their projections through principal components with polynomial kernel of varying egrees. The colors istinguish the two clusters Restricte projection space for kernel PCA with quaratic kernel when the leaing eigenfunctions are φ 1 (x) = 0.1x 1 0.0x an φ (x) = 0.5x 1 x Projections of hanwritten igits an by kernel PCA with polynomial kernel of egree 1 to viii

10 . Images corresponing to a 5 5 gri over the first two principal components for kernel PCA of the hanwritten igits an Projections of hanwritten igits an by kernel PCA with polynomial kernel of egree 1 to Projections of igits given by approximate eigenfunctions of kernel PCA that are base on the sample moment matrices Comparison of the contours of the leaing eigenfunction for the uncentere kernel operator (left) an centere kernel operator (right) in the bivariate normal example with k = Contour plots of the leaing eigenfunctions of the uncentere kernel operator for the istribution setting in (4.6) when the center of ata istribution graually moves along the x 1 axis away from the origin The trichotomy of the leaing eigenfunction form for the uncentere ata istribution case in (4.6) with centere kernel Contours of the leaing eigenfunction as m increases when k = 1.5 for the uncentere ata istribution in (4.6) with centere kernel operator Contours of the leaing eigenfunction when the center of the ata istribution moves from the origin along the 45 egree line The first row shows the first five eigenvectors of an uncentere Gaussian kernel matrix with the banwith w = 1.5 for ata sample from normal istribution N(, 1 ) an the secon row shows the first five eigenvectors of the centere kernel matrix. The thir row shows the linear combinations of eigenvectors with the coefficients erive from our analysis; the fourth row shows the first five theoretical eigenfunctions for the centere kernel operator; the fifth row shows the five eigenfunctions for the uncentere kernel operator The inner prouct φ 1, φ φ p versus the value of w The top row shows five leaing eigenvectors of a Gaussian kernel matrix of ata sample from a mixture normal istribution 0.6N(, 1 ) + 0.4N(, 1 ), the secon row shows the five leaing eigenvectors of the uncentere kernel matrix. The thir row shows the theoretical eigenfunctions of the centere kernel operator we obtaine ix

11 5.1 The contours of the probability ensity function of a mixture normal example with two classes. The re circles an blue crosses show the ata points generate from the istribution The left panel shows the contours of the empirical iscriminant function from kernel LDA algorithm with linear kernel; the right panel shows the contours of theoretical iscriminant function The left panel shows the contours of the empirical iscriminant function from kernel LDA algorithm; the right panel shows the contours of theoretical iscriminant function ( = ) The left panel shows the embeings of kernel LDA algorithm; the right panel shows the contours of the iscriminant function ( = ) Wheel ata an contours of the theoretical iscriminant functions of kernel LDA with polynomial kernel of varying egrees base on the sample moments Contours of the empirical iscriminant functions for wheel ata The scatterplot of two explicit features x 1 vs x 1x (polynomial kernel with = ) for the bivariate normal example in (5.4). Re circles an blue crosses represent two classes The scatterplot of two features x 1 vs x (polynomial kernel with = ) for the wheel ata. Black circles an re crosses represent the outer circle an inner cluster in the original ata Comparison between the first principal component of kernel PCA an the iscriminant function of kernel LDA for polynomial kernel with = over wheel ata x

12 CHAPTER 1 INTRODUCTION Kernel methos have rawn great attention in machine learning an ata mining in recent years (Schölkopf an Smola 00; Hofmann et al. 00). They are given as nonlinear generalization of linear methos by mapping ata into a high imensional feature space an applying the linear methos in the so-calle feature space (Aizerman et al. 164). Kernels are the functions that efine the inner prouct of the feature vectors an play an important role in capturing nonlinear mapping esire for ata analysis. Historically, they are closely relate to reproucing kernels use in statistics for nonparametric function estimation; see Wahba (10) for spline moels. The explicit form of feature mapping is not require. Instea, specification of a kernel is sufficient for kernel methos. Application of the nonlinear generalization through kernels has le to various methos for classification, clustering an imension reuction. Examples inclue support vector machines (SVMs) (Schölkopf et al. 1; Vapnik 15), kernel linear iscriminant analysis (kernel LDA) (Mika et al. 1), spectral clustering (Scott an Longuet-Higgins 10; von Luxburg 00), an kernel principal component analysis (kernel PCA) (Schölkopf et al. 1). There have been many stuies examining the effect of a kernel function an its associate parameters on the performance of kernel methos. For example, Brown 1

13 et al. (000), Ahn (010) an Bauat an Anouar (000) investigate how to select the banwith of Gaussian kernel for SVM an kernel LDA. In spectral clustering an kernel PCA, the kernel etermines the projections or ata embeings to be use for uncovering clusters or for representing ata effectively in a low imensional space, which are given as the leaing eigenvectors of the kernel matrix. As kernel PCA regars the spectral analysis of a finite-imensional kernel matrix, we can consier the eigen-analysis of the kernel operator as an infinite imensional analogue, where eigenfunctions are viewe as a continuous version of the eigenvectors of the kernel matrix. Such eigen-analysis can provie a view point of the metho at the population level. In general, it is important to unerstan the effect of a kernel on nonlinear ata embeing in relation to ata istributions. In this issertation, we examine the geometry of the ata embeing for kernel PCA. Zhu et al. (1), Williams an Seeger (000) an Shi et al. (00) stuie the relation between Gaussian kernels an the eigenfunctions of the corresponing kernel operator uner normal istributions. Zhu et al. (1) compute the eigenvalues an eigenfunctions of the Gaussian kernel operator explicitly when ata follow a univariate normal istribution. Williams an Seeger (000) investigate how eigenvalues an eigenfunctions change epening on the input ensity function, an state that the eigenfunctions with relatively large eigenvalues are useful in classification, in the context of approximating the kernel matrix using low rank eigen-expansion. Shi et al. (00) extene the iscussion for spectral clustering, explaining which eigenvectors to use for clustering when the istribution is a mixture of multiple components.

14 Among the kernel functions, Gaussian kernel an polynomial kernels are commonly use. Although Gaussian kernel is generally more flexible as a universal approximator, the two kernels have ifferent merits, an the polynomial kernel with appropriate egrees can be often as effective as Gaussian kernel. For example, Kaufmann (1) iscusse the application of polynomial kernels to hanwritten igits recognition an checkerboar problem in the context of classification using support vector machines, which prouce ecent results. Extening the current stuies of the Gaussian kernel operator, we carry out eigen-analysis of the polynomial kernel operator uner various ata istributions. In aition, we investigate the effect of the egree on the geometry of the nonlinear embeing with polynomial kernels. In stanar PCA, eigen-ecomposition is performe on the covariance matrix to obtain the principal components. Analogous to this stanar practice, ata are centere in the feature space an the corresponing centere version of the kernel matrix is commonly use in kernel PCA. We explore the effect of centering kernels on the spectral property of both polynomial kernel an Gaussian kernel operators, using the explicit form of the centere kernel operator. In particular, we characterize the change in the spectrum from the uncentere counterpart. As another popular kernel metho, kernel LDA has been use successfully in many applications. For example, Mika et al. (1) conucte an experimental stuy showing that kernel LDA is competitive in comparison to other classification methos. We exten the eigen-analysis of the kernel operator for kernel PCA to the general eigen-problem associate with kernel LDA, which leas to better unerstaning of the kernel LDA projections in relation to the unerlying ata istribution on the nonlinear embeing for iscrimination. We mainly investigate

15 the eigen-analysis of the polynomial kernel operator for kernel LDA an comment on the effect of the egree. Chapter gives introuction to technical etails of the kernel, kernel operator an kernel methos. It also provies a review on the eigen-analysis of the Gaussian kernel operator. Chapter presents the eigen-analysis of nonlinear PCA with polynomial kernels. Section.1 inclues the general results of the eigen-analysis of the polynomial kernel operator efine through ata istributions, an we show that the matrix of moments etermines the eigenvalues an eigenfunctions. In Section., numerical examples are given to illustrate the relationship between the eigenvectors of a sample kernel matrix an the eigenfunctions from the theoretical analysis. We comment on the effect of egrees (especially even or o) on ata projections given by the leaing eigenvectors, in relation to some features of the ata istribution in the original input space. We also iscuss how the eigenfunctions can explain some geometric patterns observe in ata projections. In Section., we present kernel principal component analysis of the hanwritten igit ata from Le Cun et al. (10) for some pairs of igits an explain the geometry of the embeings of igit pairs through analysis of the sample moment matrices. Chapter 4 mainly focuses on the effect of centering kernels. Section 4.1 regars how centering kernel affects the spectral property of the polynomial kernel operator. We show examples using both centere an uncentere polynomial kernels to illustrate the ifference. In Section 4., we use Mercer s theorem to express the kernel function for general analysis of the centere kernel operator, which encompasses the result for the polynomial kernel operator. Section 4. examines the effect of centering kernels on the spectral property of the Gaussian kernel operator. We investigate both one-component normal an multi-component normal examples an escribe the 4

16 change in the spectrum after centering. Chapter 5 extens the current framework for analysis of kernel PCA to kernel LDA. By solving the general eigen-problem associate with the population version of kernel LDA, we characterize the theoretical iscriminant function that maximizes the between-class variation relative to within-class variation. The polynomial kernel function is use in this erivation. Numerical examples are given in Section 5. an Section 5.4 to compare the empirical iscriminant function an theoretical iscriminant function. Chapter 6 conclues the issertation with iscussions. 5

17 CHAPTER KERNEL METHODS.1 Kernel Suppose that ata (D = {x 1,..., x n }) consist of ii sample from a probability istribution P an the input omain for the ata is X, e.g. X = R p. Then a kernel function is efine as a semi-positive efinite mapping from X X to R, i.e.: K : X X R, (x i, x j ) K(x i, x j ). The kernel function is symmetric, which means K(x, y) = K(y, x). Besies, there are some properties of a kernel that are worth noting: K(x, x) 0 K(u, v) K(u, u)k(v, v). For the corresponing ata point (x i, x j ), the kernel function can be expresse in terms of inner prouct as follows: K(x i, x j ) = Φ(x i ), Φ(x j ), 6

18 where Φ is typically a nonlinear map from the input space to an inner prouct space H, Φ : X H. The reason we introuce the inner prouct space is that being able to compute the inner prouct allows us to perform relate geometrical constructions with the information of angles, istances or lengths. Given such formulation, we call the similarity measure function K a kernel, Φ its feature map an H the corresponing feature space. We say that kernel K correspons to inner proucts in the feature space H via a feature mapping Φ. Schölkopf an Smola (00) showe that kernels which correspon to the inner proucts in the feature space coincie with the class of non-negative efinite kernels. Some examples of non-negative efinite kernels can be foun to be evaluate efficiently even if they correspon to inner proucts in infinite imensional inner prouct space. The corresponence is thus critical. Historically, kernels are closely relate to reproucing kernels typically use for nonparametric function estimation. The following summary of construction of reproucing kernel Hilbert space an reproucing kernel gives an example of well-efine kernel space an the corresponing kernel. To efine reproucing kernels, consier a Hilbert space H K of real value functions on an input omain X. Note that a Hilbert space H K is a complete inner prouct linear space, which is ifferent from the feature space H. In Wahba (10), a reproucing kernel Hilbert space is efine as a Hilbert space of real value functions, where for each x X, the evaluation functional L x (f) = f(x) is boune in H K.

19 By the Riesz representation theorem, if H K is a reproucing kernel Hilbert space, then there exists an element K x H K, the representer of evaluation at x, such that L x (f) = K x, f = f(x), f H K see Aronszajn (150) for etails. The symmetric bivariate function K(x, y) (Note that K(x, y) = K x (y) = K x, K y = K y, K x ) is calle the reproucing kernel an it has the reproucing property K(x, ), f( ) = f(x). It can be shown that any reproucing kernel is non-negative efinite. There exists a one-to-one corresponence between reproucing kernel Hilbert spaces an non-negative efinite functions. The Moore-Aronszajn theorem states that for every reproucing kernel Hilbert space H K of functions, there correspons a unique reproucing kernel K(x, y), which is non-negative efinite. Conversely, given a non-negative efinite function K(s, t) on X, we can construct a unique reproucing kernel Hilbert space H K that has K(s, t) as its reproucing kernel. Given the kernel function, we efine the kernel matrix in the following way. Let x 1,..., x n X be an ii sample an K be the kernel function, the kernel matrix is given as a n n matrix: K n = [K(x i, x j )]. We say a kernel matrix is non-negative efinite if it satisfies the conition c i c j K(x i, x j ) 0. i,j A kernel function which generates a non-negative efinite kernel matrix K n is calle a non-negative efinite kernel.

20 . Kernel metho Kernel methos are given as nonlinear generalization of linear methos by mapping ata into a high imensional feature space an applying the linear methos in the feature space. In most kernel methos, the key step is to replace the inner prouct with the kernel so that the explicit form of feature mapping is not require. This substitution is calle the kernel trick in machine learning. Such trick allows us to hanle problems which are ifficult to solve in the high or even infinite imensional feature space irectly. We introuce three popular kernel methos in the following section where the kernel trick is applie. Some examples of positive efinite kernels in those kernel methos inclue Gaussian kernel K(x, x ) = e x x /σ, polynomial kernel of egree K(x, x ) = (1 + x, x ), sigmoi kernels K(x, x ) = tanh(κ(x x ) + Θ) an so on.

21 ..1 Examples of kernel methos (a) Support Vector Machines (SVM) Several authors came up with the class of hyperplanes base on the ata (x 1, y 1 ), (x, y ),, (x n, y n ), with the input omain of x i X an y i { 1, 1}, x t β + β 0 = 0 where β R, corresponing to the following classification rule f(x) = sign(x t β + β 0 ), an propose a learning algorithm for problems which are linearly separable by fining hyperplane that create the largest margin between the points for ifferent classes through optimization problem (Vapnik an Lerner 16; Vapnik an Chervonenkis 164). We call the above classifier which fins linear bounaries in the input space the support vector classifier (Hastie et al. 00). In the non-separable case, where the classes overlap, the optimization problem can be generalize by allowing some points on the wrong sie of the margin, which leas to the objective function L D = n α i 1 i=1 n n α i α i y i y i x t ix i. i=1 i =1 While the support vector classifier fins the linear bounary, the proceures can be more flexible by mapping the ata into the feature space. We call this extension the Support Vector Machines. It prouces the nonlinear bounaries in the input space by constructing linear bounaries in the feature 10

22 space to achieve better separation. Through feature mapping, the objective function has the form L D = n α i 1 i=1 n n α i α i y i y i Φ(x i ), Φ(x i ). i=1 i =1 By applying the kernel trick, we replace K(x i, x i) = Φ(x i ), Φ(x i ), then the the support vector machines for two-class classification problems have the following form f(x) = n α i K(x, x i ) + β 0. i=1 The support vector machine is generally use to solve two-class problems. It can also be extene to multiclass problems by solving many two-class problems. SVMs have applications in many supervise an unsupervise learning problems. (b) Kernel PCA The stanar PCA is a powerful tool for extracting a linear structure in the ata. The principal components are obtaine by eigen-ecomposition of the covariance matrix. Schölkopf et al. (1) propose kernel PCA by computing the inner prouct in the feature space using kernel functions in the input space. In this kernel metho, one can compute the principal components in a high-imensional feature space, which is relate to input space by some nonlinear feature mapping. Similar to SVM, this kernel metho enables the construction of nonlinear version of principle component analysis. Suppose the covariance matrix for the centere observations x i, i = 1,, n, 11

23 is C = 1 n equation n x i x t i. PCA computes the principal components by solving the i=1 Cv = λv By mapping the ata into the feature space H, the covariance matrix in H can be written in the form of C = 1 n Φ(x i )Φ(x i ) t. The problem is thus n i=1 turne into fining the eigen-ecomposition of C, Cu = λu. Notice that the computation involve is prohibitive when the feature space n is very high imensional. Replacing u = α i Φ(x i ) in the above equation, i=1 we have the eigenvalue problem of the kernel matrix K n α = nλα, where the kernel matrix is given by K n = [K(x i, x j )] = [ Φ(x i ), Φ(x j ) ]. We thus have the matrix α with its columns as the eigenvectors of the kernel matrix, let α k inicate the eigenvectors with respect to the eigenvalue λ k. Corresponingly, the projections of any point x onto the normalize eigenvectors u k of the covariance matrix in the feature space can be erive as n αi k K(x i, x). We are thus able to obtain embeings in the feature space i=1 for any ata point in kernel PCA setting. (c) Kernel LDA In kernel PCA, we aim to fin the principal components explaining as much variance of ata as possible, which are for escribing the ata. When it 1

24 comes to classfication, we look for the features which iscriminate between the two classes given the label information. The classical classification algorithms inclue linear an quaratic iscriminant analysis, which assume the Gaussian istribution for each class. Fisher s linear iscriminant irection is obtaine through maximizing the between-class variance relative to the within-class variance. Mika et al. (1) propose a nonlinear classification technique base on Fisher s linear iscriminant analysis, which coul be useful when the classification bounary is not clear. By mapping the ata into the feature space, the kernel trick allows us to fin Fisher s linear iscriminant in the feature space H, leaing to a nonlinear iscriminant irection in the input space. Assume we have two classes for this classification problem an let D 1 = {x 1 1,, x 1 n 1 } an D 0 = {x 0 1,, x 0 n 0 } be samples from those two ifferent classes. Then the sample size is n = n 1 + n 0. Let Φ be the feature mapping into the feature space H. To fin the linear iscriminant in H, we nee to fin the irection w which maximizes the between-class variation relative to within-class variation in the feature space, i.e. J(w) = wt S B w w t S W w. (.1) Here w H an S B an S W are the matrices in the feature space, S B = (m Φ 1 m Φ 0 )(m Φ 1 m Φ 0 ) t an S W = (Φ(x) m Φ l )(Φ(x) m Φ l ) t, x D l l=1,0 where m Φ l = 1 n l Φ(x l n j), l = 1, 0 is the mean of feature vectors in class l. l j=1 1

25 When w H is in the span of all training samples in the feature space, n w can be written as w = α i Φ(x i ). Plugging w into the equation (.1) i=1 an expaning both the numerator an enominator in terms of α using the kernel trick, we have w t S B w = α t Bα an w t S W w = α t W α, where B an W are efine base on the kernel matrix, see Section 5.1 for etails of B an W. Therefore, Fisher s linear iscriminant can be foun through the eigen-problem: Bα = λw α, Similar to the kernel PCA, any projection of a new pattern on the irection w in the feature space is given by the linear combination of the coefficients α i an kernel functions evaluate at the new point an original ata K(x i, x), which gives the empirical iscriminant function ˆf(x) n = α i K(x i, x) in kernel LDA. i=1. Kernel Operator..1 Definition As we mentione before, our kernel function K n is efine as a semi-positive efinite mapping from X X to R, an there is a unique function space H K (calle a reproucing kernel Hilbert space) corresponing to the kernel. Given a probability istribution P with ensity function p(x) an a kernel function K, the istribution-epenent kernel operator is efine as K p f(y) = K(x, y)f(x)p(x)x (.) X 14

26 as a mapping from H K to H K. Then an eigenfunction φ H K an the corresponing eigenvalue λ for the operator K p are efine through the equation K p φ = λφ or X K(x, y)φ(x)p(x)x = λφ(y). (.) Note that the eigenvalue an eigenfunction epen on both the kernel an probability istribution. To see the connection between the kernel operator an kernel matrix as its sample version, consier the n n kernel matrix, K n = [K(x i, x j )]. From the iscussion about kernel PCA in Section..1, we know that kernel PCA fins nonlinear ata embeings for imension reuction through eigen-analysis of the kernel matrix. Suppose that λ n an v = (v 1,..., v n ) t are a pair of eigenvalue an eigenvector of K n such that K n v = λ n v. Then for each i = 1,,..., n, we have 1 n n j=1 K(x i, x j )v j = λ n n v i. When x 1,..., x n are sample from the istribution with ensity p(x) an v is consiere as a iscrete version of φ( ) at ata points, (φ(x 1 ),..., φ(x n )) t, we can see that the left-han sie of the above equation is an approximation to its integral counterpart: 1 n n K(x i, x j )φ(x j ) j=1 X K(x, x i )φ(x)p(x)x. As a result, λ n /n can be viewe as an approximation to the eigenvalue λ of the kernel operator with eigenfunction φ. The pair of λ n an v yiel a nonlinear principal component or nonlinear embeing from X to R given by ˆφ(x) = 1 λ n n v i K(x i, x). i=1 15

27 Hence, eigen-analysis of the kernel operator amounts to an infinite-imensional analogue of kernel PCA. Baker (1) gives the theory of the numerical solution of eigenvalue problems, showing that the eigenvalues of K n converge to eigenvalues of the kernel operator as n. The eigen-analysis of kernel PCA is useful for unerstaning the kernel metho on the population level... Eigen-analysis of the Gaussian kernel operator As we iscusse in the introuction, Zhu et al. (1), Williams an Seeger (000) an Shi et al. (00) stuie the relation between Gaussian kernels an the eigenfunctions of the corresponing kernel operator uner the normal istributions. Shi et al. (00) obtaine the refine version of analytic results in Zhu et al. (1) for the spectrum of Gaussian kernel operator with the univariate Gaussian case. When the probability ensity function is normal with P N(µ, σ ) an the (x y) kernel function K(x, y) = exp( ), the eigenvalues an eigenfunctions are w given explicitly by ( ) i 1 λ i = (1 + β + β 1 + β) 1 + β β ( ) ( (1 (1 + β)1/ φ i (x) = i 1 (i 1)! exp (x µ) 1 + β 1 H σ i β ) ) 1/4 x µ, σ for i = 1,,, where β = σ /w, H i is the ith orer Hermite polynomial; see Koekoek an Swarttouw (1) for more etails about Hermite polynomials. Williams an Seeger (000) investigate the epenence of the eigenfunction on the Gaussian input ensity an iscusse how this epenence etermines the basis functions for classification problems. Shi et al. (00) explore this connection in an attempt to unerstan spectral clustering metho from a population point of view. 16

28 Many clustering algorithms use the top eigenvectors of the kernel matrix or its normalize version (Scott an Longuet-Higgins 10; Perona an Freeman 1; Shi an Malik 000). Despite their empirical success, some limitations of the above stanar spectral clustering are note in Naler an Galun (00). For example, they pointe out that those clustering algorithms base on the kernel matrix with a single parameter (e.g, Gaussian kernel) woul fail when ealing with clusters of ifferent scales. Shi et al. (00) investigate the spectral clustering from a population level when the istribution P inclues several separate high-ensity components. They foun that when there is enough separation among the components, each of the top eigenfunctions of the kernel operator correspons to one of the separate components, with the orer of eigenfunctions etermine by the mixture proportion an the eigenvalues. They also showe that the top eigenfunction of the kernel operator for separate components is the only eigenfunction with no sign change. Hence, when each mixture component has enough separation from the other components, the number of eigenfunctions with no sign change of the kernel operator K P suggests the number of components of the istribution. Using the relationship between kernel matrix an kernel operator, we can estimate the number of clusters by the number of eigenvectors that have no sign change up to some precision. 1

29 CHAPTER EIGEN-ANALYSIS OF KERNEL OPERATORS FOR NONLINEAR DIMENSION REDUCTION.1 Eigen-analysis of the Polynomial Kernel Operator In this section, we stuy the epenence of eigenfunctions an eigenvalues of the kernel operator on the ata ensity istribution when the polynomial kernels are use. We examine eigen-expansion of the polynomial kernel operator base on the equation (.) when X = R p, an establish the epenence of the eigen-expansion on the ata istribution. There are two types of polynomial kernels of egree : i) K(x, y) = (x t y) an ii) K (x, y) = (1 + x t y). We begin with eigen-analysis for the first type in two imensional setting in Section.1.1 an generalize it to p-imensional setting in Section.1.. Then we exten the analysis further to the secon type with an aitional constant in Section.1.. 1

30 .1.1 Two-imensional setting Suppose that ata arise from a two-imensional setting, X = R with probability ensity p(x). For polynomial kernel of egree, K(x, y) = (x t y), we erive λ an φ( ) satisfying (x t y) φ(x)p(x) x = λφ(y) (.1) R in this setting. More explicitly, K(x, y) = (x 1 y 1 + x y ) = j=0 ( ) (x 1 y 1 ) j (x y ) j = j j=0 ( ) (x j 1 x j j )(y j 1 y). j Note that the polynomial kernel can be also expresse as the inner prouct of the so-calle feature vectors, Φ(x) t Φ(y), through the feature map, Φ(x) = ( ( 0 ) 1 x 1, ( 1 ) 1 ( ) ) 1 t x 1 1 x,, x. Appenix C comments on that the mapping K p is vali with the polynomial kernel. With the explicit expression of K, the equation (.1) becomes [ ( ) ] (x j 1 x j j )(y j 1 y) j φ(x)p(x) x = λφ(y), j=0 which is re-expresse as ( ) [ 1 ( y j 1 y j j j j=0 ) 1 ) 1 x j 1 x j φ(x)p(x) x ] = λφ(y). ( Let C j = x j 1 x j φ(x)p(x) x be a istribution-epenent constant for j j = 0,...,. Then for λ 0, the corresponing eigenfunction φ( ) shoul be of the form φ(y) = 1 λ k=0 ( ) 1 Ck y1 k y k k. (.) 1

31 By substituting (.) for φ(x) in the efining equation for C j, we get the following equations for the constants (j = 0,..., ): C j = 1 λ which leas to λc j = ( ) 1 x j 1 x j j k=0 Note that ( ) 1 ( k j [ k=0 ) 1 Ck ( ) 1 ] Ck x k 1 x k p(x)x, k x (j+k) 1 x (j+k) p(x)x. x (j+k) 1 x (j+k) p(x)x is E(X (j+k) 1 X (j+k) ), a moment of the ranom vector X = (X 1, X ) t istribute with p(x). Let µ (j+k),(j+k) enote the moment. Then the set of the equations can be written as k=0 ( ) 1 ( k j ) 1 µ (j+k),(j+k) C k = λc j for j = 0,...,. (.) Defining the ( + 1) ( + 1) matrix with entries given by moments of total egree as follows M = ( ) 1 1 ( ) ( µ,0 0 0 ( 0 ) 1. ( ) 1 ( 0 µ 1,1 ( 1 ) 1 µ, ( ) 1 ( ) 1 ( ) 1 ( µ 1,1... ( 1 0 ) 1 ( ) 1 ( µ, ) 1 ) 1 ( 1 ) 1 we can succinctly express the set of equations as M C 0 C 1. C = λ C 0 C 1. C µ 1,+1... ) 1 µ, ) 1 µ 1,+1 ( ) µ 0,. (.5) 0,(.4)

32 For the moment matrix M, the subscript inicates the input imension, an the superscript refers to the egree of the polynomial kernel. From the equation (.5), we can see that the pairs of eigenvalue an eigenfunction for the polynomial kernel operator are etermine by the spectral ecomposition of the moment matrix M. Note that the eigenvectors of M nee to be scale so that φ (x)p(x) x = 1. Obviously, the eterminant of M λi is a polynomial of egree (+1). Therefore, there are at most ( + 1) nonzero eigenvalues of the polynomial kernel operator. The statements so far lea to the following theorem. Theorem 1. Suppose that the probability istribution p(x 1, x ) efine on R has finite th moments µ j,j kernel of egree, K(x, y) = (x t y), = E(X j 1 X j ), j = 0,...,. For the polynomial (i) The eigenvalues of the polynomial kernel operator are given by the eigenvalues of the moment matrix ( 1 1) M = ( 1 ) ( ) ( 1 µ,0 0 0) ( 0 ) 1 µ 1,1 ( 1. ( 0 ) 1 µ, ( ( 1 ) 1 ) 1 ( ) 1 1 µ 1,1... µ,.... ) 1 ( ) 1 1 µ 1,+1... ( 1) 1 ( 0) ( ) 1 µ, ( ) 1 µ 1,+1 ( ) µ0,. (ii) There are at most + 1 nonzero eigenvalues. (iii) The eigenfunctions are polynomials of total egree of the form in (.) with coefficients etermine by the eigenvectors of M. (iv) The eigenfunctions, φ i, are orthogonal in the sense of φ i, φ j p = φ i (x)φ j (x)p(x)x = 0 R for i j. 1

33 Proof. We prove the statement (iv). Let C i an C j be the eigenvectors of M corresponing to a pair of eigenfunctions φ i an φ j H K with eigenvalues λ i an λ j. Then for i j, φ i (x)φ j (x)p(x)x C t im C j = C t i(λ j C j ) = λ j C t ic j = 0. R.1. Multi-imensional setting In general, consier the p-imensional input space (X = R p ) for ata. For x, y R p, the kernel function can be expane as ( p ) (x t y) = x k y k = k=1 an the equation (.) becomes i.e. j 1 + +j p= j 1 + +j p= ( j 1 + +j p= ( ) p (x k y k ) j k, j 1,, j p k=1 ( ) p (x k y k ) j k φ(x)p(x) x = λφ(y) j 1,, j p j 1,, j p ) 1 φ(x)p(x) x, we can write the eigen- ( Letting C j1,,j p = j 1,, j p function φ( ) as φ(x) = 1 λ j 1 + +j p= ( k=1 p k=1 ) 1 j 1,, j p y j k k [ ( j 1,, j p p k=1 x j k k ) 1 p Cj1,,j p ) 1 p k=1 x j k k φ(x)p(x) x k=1 x j k k. (.6) Again, by plugging this expansion φ(x) in the equation that efines C j1,,j p, we get a set of equations for the constants: ] = λφ(y).

34 C j1,,j p = 1 λ ( j 1,, j p which is rewritten as λc j1,,j p = ) 1 p i 1 + +i p= k=1 ( x j k k i 1,, i p Let µ j1 +i 1,,j p+i p enote the moment E( i 1 + +i p= ( ) 1 ( p k=1 i 1,, i p j 1,, j p ) 1 X j k+i k k ) = ) 1 p Ci1,,i p k=1 Ci1,,i p p p k=1 k=1 x i k k p(x)x, x i k+j k k p(x) x. x i k+j k k p(x) x for (i 1,..., i p ) with i i p = an (j 1,..., j p ) with j j p =. Then we have ( ) 1 ( ) 1 µj1 +i 1,,j p+i p C i1,,i p = λc j1,,j p. (.) i 1 + +i p= i 1,, i p j 1,, j p To express the above equation in matrix form, we generalize the moment matrix ( ) 1 M to Mp ( ) 1 with entries given by µj1 +i i 1,, i p j 1,, j 1,,j p+i p. The p imension of Mp is the number of combinations of non-negative integers j k s satisfying j j p =, which is p =. Then the equation (.) is ( ) + p 1 written as M p C = λc, where C is a p -vector with entries C j1,,j p for j j p =. Applying the similar argument use for the two-imensional setting, we conclue that there are ( ) + p 1 at most p = nonzero eigenvalues of the polynomial kernel operator, an p epens on both the input imension an the egree of the polynomial kernel. Thus we arrive at the following theorem.

35 Theorem. Suppose that the probability istribution p(x 1, x,, x p ) efine on p R p has finite th moments, µ i1 +j 1,,i p+j p = E( X i k+j k k ) for j j p =, i i p =. For the polynomial kernel of egree, K(x, y) = (x t y), k=1 (i) The eigenvalues of the polynomial kernel operator are given by the eigenvalues of the moment matrix Mp. ( ) + p 1 (ii) There are at most p = nonzero eigenvalues. (iii) The eigenfunctions are polynomials of total egree of the form in (.6) with coefficients given by the eigenvectors of M p. (iv) The eigenfunctions are orthogonal with respect to the inner prouct, φ i, φ j p = R p φ i (x)φ j (x)p(x)x..1. Polynomial kernel with constant The kernel operator for the secon type of polynomial kernel with constant can be treate as a special case of what we have iscusse in the previous section. For example, K (x, y) = (1 + x 1 y 1 + x y ) in the two-imensional setting can be viewe as K(x, y) = (x 1 y 1 + x y + x y ) in the three-imensional setting with x = y = 1. Using the connection between K an K, we know that the number of nonzero eigenvalues for the kernel operator with K is at most ( ) ( ) = = 6 from Theorem. The eigenfunctions in this case are of the following form: φ(x) = 1 ( ) 1 Cj1,j λ j 1, j, j,j x j 1 1 x j. (.) j 1 +j +j = 4

36 There are six combinations of non-negative integers j k s such that j 1 + j + j =. M in general is given as follows: µ 4,0,0 µ,1,0 µ,0,1 µ,,0 µ,1,1 µ,0, µ,1,0 µ,,0 µ,1,1 µ1,,0 µ 1,,1 µ1,1, µ,0,1 µ M =,1,1 µ,0, µ1,,1 µ 1,1, µ1,0,, µ,,0 µ1,,0 µ1,,1 µ 0,4,0 µ0,,1 µ 0,, µ,1,1 µ 1,,1 µ 1,1, µ0,,1 µ 0,, µ0,1, µ,0, µ1,1, µ1,0, µ 0,, µ0,1, µ 0,0,4 an the vector C with constants C j1,j,j M C,0,0 C 1,1,0 C 1,0,1 C 0,,0 C 0,1,1 C 0,0, = λ C,0,0 C 1,1,0 C 1,0,1 C 0,,0 C 0,1,1 C 0,0,. satisfies the following equation: Since X = 1, the moments µ i1 +j 1,i +j,i +j = E( µ i 1 +j 1,i +j = E( k=1 X i k+j k k ). k=1 X i k+j k k ) are simplifie to In summary, we conclue that for ata istribution in R p an polynomial kernel of egree with constant term, the resulting eigenvalues an eigenfunctions of the kernel operator can be obtaine on the basis of Theorem. The extensions are accomplishe by application of the result with polynomial kernel of egree for ata istribution in R p+1 with X p+1 fixe at 1, where the moments p+1 p µ i1 +j 1,,i p+1 +j p+1 = E( X i k+j k k ) reuce to µ i 1 +j 1,,i p+j p = E( X i k+j k k ). k=1 5 k=1

37 . Simulation Stuies We present simulation stuies to illustrate the relationship between the theoretical eigenfunctions an sample eigenvectors for kernel PCA. First we consier two simulation settings in R an examine the explicit forms of the eigenfunctions using Theorem 1. With an aitional example, we investigate the effect of egree (the parameter for polynomial kernels) on the nonlinear ata embeings inuce by the kernel, which can be use for uncovering ata clusters or iscriminating ifferent classes. Furthermore, we explore how eigenfunctions can be use to unerstan certain geometric patterns observe in ata projections...1 Uniform example For X = (X 1, X ) t, let X 1 an X be ii with uniform istribution on (0, 1). Suppose that we use the secon-orer polynomial kernel, K(x, y) = (x 1 y 1 +x y ). Since all the fourth moments µ j,4 j = E(X j 1X 4 j ), j = 0,..., 4 are finite in this case, we can compute the theoretical moment matrix M explicitly, an it is given by M = µ 4,0 µ,1 µ, µ,1 µ, µ1, µ, µ1, µ 0,4 = Notice that there is symmetry in the moments ue to the exchangeability of X 1 an X (e.g. µ 1, = µ,1 ). The eigenvalues of the kernel operator are the same as those of M. We can get the eigenvalues of the matrix numerically, which are given by λ 1 = 0.506, 6

38 λ = 0.0, an λ = Accoring to Theorem 1, given each eigenvalue λ, the corresponing eigenfunction can be written explicitly in the form, φ(x) = 1 λ( C0 x 1 + C 1 x 1 x + C x ), where (C 0, C 1, C ) t is a scale version of the eigenvector of M corresponing to λ. For simplicity of exposition, we choose not to scale the eigenfunctions to the unit norm but to go with the scale given by the eigenvectors throughout our numerical stuies. With the unit-norme eigenvectors, we have the following eigenfunctions for the uniform istribution: φ 1 (x) = ( 0.54x 1 0.0x 1 x 0.54x ), φ (x) = (0.0x 1 0.0x ), φ (x) = (0.454x x 1 x x ). To make numerical comparisons, we took a sample of size 400 from the istribution an compute the sample kernel matrix for the secon-orer polynomial kernel. Then we obtaine its eigenvalues an corresponing eigenvectors. There are three non-zero eigenvalues, an they are ˆλ 1 = 0.540, ˆλ = 0.044, an ˆλ = 0.01 after being scale by the sample size n as iscusse in Section.1. The sample eigenvalues are quite close to the theoretical ones. Figure.1 compares the contour plots of the nonlinear embeings given by the sample eigenvectors an the theoretical eigenfunctions. The top panels are for the embeings inuce by the leaing eigenvectors of the kernel matrix, while the bottom panels are for the theoretical eigenfunctions obtaine from the moment matrix. The change in color from blue to yellow in each panel inicates increase in values. There is great similarity between the contours of the true eigenfunction

39 an its sample version through eigenvector in terms of the shape an the graient inicate by the color change. We also observe in Figure.1 that the nonlinear embeings given by the first two leaing eigenvectors an eigenfunctions of the secon-orer polynomial kernel are roughly along the two iagonal lines of the unit square (0, 1), which correspon to the irections of the largest variation in the uniform istribution. λ^ =.540 λ^ =.044 λ^ =.01 X X λ =.506 λ =.0 λ = Figure.1: Comparison of the contours of the nonlinear embeings given by three leaing eigenvectors an the theoretical eigenfunctions for the uniform ata. The upper three panels are for the embeings inuce by the eigenvectors for three nonzero eigenvalues, an the lower three panels are for the corresponing eigenfunctions.

40 .. Mixture normal example We turn to a mixture of normal istributions for (X 1, X ) t. Suppose that X 1 an X are two inepenent variables istribute with the following mixture Gaussian istribution: X N, N, X For this example, we consier the polynomial kernels of egrees an. When egree is The moment matrix for the mixture istribution can be obtaine as follows: µ 4,0 µ,1 µ, M = µ,1 µ, µ1, = µ, µ1, µ 0, Three nonzero eigenvalues of the matrix are λ 1 = 110.5, λ = 1.415, an λ =.4. With the corresponing eigenvectors of the moment matrix, we get the following three eigenfunctions: φ 1 (x) = (0.6x 1 0.5x 1 x x ), φ (x) = ( 0.45x x 1 x + 0.4x ), φ (x) = 1.4 (0.064x x 1 x + 0.5x ). Contours of these eigenfunctions are isplaye in the bottom panels of Figure.. For their sample counterparts, we generate a ranom sample of size 400 from the mixture of two normals. A scatter plot of the sample is isplaye in the top left panel of Figure.4. Three nonzero eigenvalues from the kernel matrix are

41 foun to be ˆλ 1 = , ˆλ = 1.6, an ˆλ =.. The top panels of Figure. show the contours of the ata embeings given by the corresponing eigenvectors. The ata embeings an eigenfunctions for this mixture normal example also exhibit strong similarity. The contours of the leaing embeing an eigenfunction are ellipses centere at the origin. It appears that the minor axis of the ellipses for the leaing eigenfuncion correspons to the line connecting the two mean vectors of the mixture istribution, capturing the largest ata variation, an the major axis is perpenicular to the mean ifference. The contours of the secon leaing eigenfunction are hyperbolas centere at the origin. The asymptotes of the hyperbolas for the eigenfunction are the same as the major an minor axes for the leaing eigenfunction. Although the approximate symmetry aroun the origin that the ata embeings an eigenfunctions exhibit reflects that of the unerlying istribution, information about the two normal components is lost after projection. If imension reuction is to be use primarily for ientifying the clusters later, then the quaratic kernel woul not be useful in this case. When egree is The moment matrix for = involves the moments up to orer 6, an for the mixture istribution, it is explicitly given by µ 6,0 µ5,1 µ4, µ, M µ5,1 µ 4, µ, µ, = = µ4, µ, µ,4 µ1, µ, µ,4 µ1,5 µ 0,

42 λ^ = λ^ = 1.6 λ^ =. X X λ = λ = λ = Figure.: Comparison of the contours of the nonlinear embeings given by three leaing eigenvectors (top panels) an the theoretical eigenfunctions (bottom panels) for the mixture normal ata when egree is. The matrix has four nonzero eigenvalues, λ 1 = , λ = 4.4, λ = 5.66, an λ 4 =.0, an the corresponing eigenfunctions are 1 φ 1 (x) = (0.4x 1 0.1x 1x + 0.4x 1 x 0.0x ), 1 φ (x) = 4.4 ( 0.51x 1 1.0x 1x + 0.6x 1 x 0.1x ), φ (x) = ( 0.11x 1 1.0x 1x 0.x 1 x x ), φ 4 (x) = 1.0 ( 0.06x x 1x + 1.0x 1 x x ). We obtaine the kernel matrix with the polynomial kernel of egree for the same ata as in = case. Four leaing eigenvalues for this kernel matrix are ˆλ 1 = 1.5, ˆλ = 46.10, ˆλ = 5.54, an ˆλ 4 =.65. 1

7.1 Support Vector Machine

7.1 Support Vector Machine 67577 Intro. to Machine Learning Fall semester, 006/7 Lecture 7: Support Vector Machines an Kernel Functions II Lecturer: Amnon Shashua Scribe: Amnon Shashua 7. Support Vector Machine We return now to