Nonlinear Projection Trick in kernel methods: An alternative to the Kernel Trick

Similar documents
Max-Margin Ratio Machine

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis

Multiple Similarities Based Kernel Subspace Learning for Image Classification

Kernel Principal Component Analysis

Introduction to Machine Learning

Kernel Uncorrelated and Orthogonal Discriminant Analysis: A Unified Approach

A Modified ν-sv Method For Simplified Regression A. Shilton M. Palaniswami

Kernel Methods. Machine Learning A W VO

Kernel Discriminant Analysis for Regression Problems

Efficient Kernel Discriminant Analysis via QR Decomposition

Discriminative Direction for Kernel Classifiers

An Experimental Evaluation of Linear and Kernel-Based Methods for Face Recognition

Kernel Partial Least Squares for Nonlinear Regression and Discrimination

PCA and LDA. Man-Wai MAK

Neural networks and support vector machines

Machine Learning - MT & 14. PCA and MDS

Data Mining and Analysis: Fundamental Concepts and Algorithms

Approximate Kernel PCA with Random Features

CHAPTER 3 THE COMMON FACTOR MODEL IN THE POPULATION. From Exploratory Factor Analysis Ledyard R Tucker and Robert C. MacCallum

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Novelty Detection. Cate Welch. May 14, 2015

Lecture 10: A brief introduction to Support Vector Machine

Stefanos Zafeiriou, Anastasios Tefas, and Ioannis Pitas

Machine Learning. Support Vector Machines. Manfred Huber

1 Feature Vectors and Time Series

Sparse Support Vector Machines by Kernel Discriminant Analysis

PCA and LDA. Man-Wai MAK

Lecture 10: Support Vector Machine and Large Margin Classifier

Principal Component Analysis

Linear & nonlinear classifiers

Approximate Kernel Methods

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

Machine Learning. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Nonlinear Dimensionality Reduction

Linear and Non-Linear Dimensionality Reduction

Semi-Supervised Learning through Principal Directions Estimation

Statistical Pattern Recognition

Support Vector Machines for Classification: A Statistical Portrait

Lecture 5 Supspace Tranformations Eigendecompositions, kernel PCA and CCA

Support Vector Machine (continued)

When Fisher meets Fukunaga-Koontz: A New Look at Linear Discriminants

Support Vector Machine (SVM) and Kernel Methods

Multi-Output Regularized Projection

Artificial Neural Networks. Part 2

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

Lecture 6 Sept Data Visualization STAT 442 / 890, CM 462

ADAPTIVE QUASICONFORMAL KERNEL FISHER DISCRIMINANT ANALYSIS VIA WEIGHTED MAXIMUM MARGIN CRITERION. Received November 2011; revised March 2012

Unsupervised dimensionality reduction

Kernel methods, kernel SVM and ridge regression

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Kernel Method: Data Analysis with Positive Definite Kernels

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture # 12 Scribe: Indraneel Mukherjee March 12, 2008

PCA, Kernel PCA, ICA

Preliminaries. Definition: The Euclidean dot product between two vectors is the expression. i=1

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto

Support Vector Machines

Logistic Regression. Machine Learning Fall 2018

Statistical Pattern Recognition

Support Vector Machines

Support Vector Machine (SVM) and Kernel Methods

Machine learning for pervasive systems Classification in high-dimensional spaces

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

SVMs, Duality and the Kernel Trick

Computation. For QDA we need to calculate: Lets first consider the case that

1 Principal Components Analysis

Least Squares SVM Regression

Linear Dimensionality Reduction

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen PCA. Tobias Scheffer

A Statistical Analysis of Fukunaga Koontz Transform

An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM

Dimensionality Reduction AShortTutorial

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

CHARACTERIZATION OF ULTRASONIC IMMERSION TRANSDUCERS

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Convergence of Eigenspaces in Kernel Principal Component Analysis

y(x) = x w + ε(x), (1)

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Fisher Linear Discriminant Analysis

Local Fisher Discriminant Analysis for Supervised Dimensionality Reduction

j=1 u 1jv 1j. 1/ 2 Lemma 1. An orthogonal set of vectors must be linearly independent.

Online Kernel PCA with Entropic Matrix Updates

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

SPECTRAL CLUSTERING AND KERNEL PRINCIPAL COMPONENT ANALYSIS ARE PURSUING GOOD PROJECTIONS

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Perceptron Revisited: Linear Separators. Support Vector Machines

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Kernel Methods. Outline

Supervised Learning Coursework

CS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision)

Manifold Learning: Theory and Applications to HRI

Eigenvoice Speaker Adaptation via Composite Kernel PCA

Kernel Methods. Barnabás Póczos

Statistical Machine Learning from Data

ML (cont.): SUPPORT VECTOR MACHINES

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

Classification with Kernel Mahalanobis Distance Classifiers

Efficient Kernel Discriminant Analysis via Spectral Regression

MTTS1 Dimensionality Reduction and Visualization Spring 2014 Jaakko Peltonen

CS145: INTRODUCTION TO DATA MINING

Transcription:

Nonlinear Projection Trick in kernel methods: An alternative to the Kernel Trick Nojun Kak, Member, IEEE, Abstract In kernel methods such as kernel PCA and support vector machines, the so called kernel trick is used to avoid direct calculations in a high (virtually infinite) dimensional kernel space. In this paper, based on the fact that the effective dimensionality of a kernel space is less than the number of training samples, e propose an alternative to the kernel trick that explicitly maps the input data into a reduced dimensional kernel space. This is easily obtained by the eigenvalue decomposition of the kernel matrix. The proposed method is named as the nonlinear projection trick in contrast to the kernel trick. With this technique, the applicability of the kernel methods is idened to arbitrary algorithms that do not utilize the dot product. The equivalence beteen the kernel trick and the nonlinear projection trick is shon for several conventional kernel methods. In addition, e extend PCA-L, hich utilizes L -norm instead of L 2 -norm (or dot product), into a kernel version and sho the effectiveness of the proposed approach. Index Terms Kernel methods, nonlinear projection trick, KPCA, Support vector machines, KPCA-L, dimensionality reduction. I. INTRODUCTION In pattern recognition and machine learning societies, kernel methods have started to gain attention since mid-990s and have been successfully applied in pattern classification, dimensionality reduction, regression problems and so on [] [2] [3] [4] [5] [6] [7] [8] [9] [0]. In kernel methods, the kernel trick is used to simplify the optimization in a high dimensional feature space ithout an explicit mapping of input data into the feature space. This is done by introducing a kernel function of to elements in the input space and regarding it as a dot product of the corresponding to elements in the kernel space. If the optimization problem contains the elements in the feature space only in the form of the dot product, then the problem can be solved ithout knoing the explicit mapping. The dimension of the feature space, hich is unmeasurable and can be infinite, depends on the specific kernel function used. Because the kernel trick takes care of the dot product in the kernel space rather than the mapping itself, there has been little attention on the mapping of the input data to the feature space. Hoever, there are many optimization problems that cannot be formulated by the dot product only, thus could not be extended to a kernel space by the kernel trick. In addition, ithout the N. Kak is an associate professor at the Graduate School of Convergence Science and Technology, Seoul National University, Korea. This ork as supported by the Korea Research Foundation Grant funded by the Korean Government (KRF-203006599) The terms feature space and kernel space are used interchangeably to denote the high dimensional space defined by a kernel function. knoledge of the mapped image in the feature space, it is difficult to solve data reconstruction problem by the kernel methods. Although, this pre-image problem in kernel PCA has been tackled in [] and [2], a direct mapping of the input data to a vector in the feature space by kernel methods has not been studied yet. It is ell knon that the mapped images of the finite number of training samples constitute a subspace of the possibly infinite dimensional feature space []. In addition, most operations in the kernel methods are performed in the training stage, hich are done on this finite dimensional subspace of the possibly infinite dimensional feature space. Based on this idea, in this paper, e doubt and anser the question Can e find a direct mapping of the input data to a vector in the feature space by kernel methods? This paper proposes an alternative method to the kernel trick hich directly maps input samples into a reduced dimensional kernel space. The proposed approach is named as the nonlinear projection trick in contrast to the kernel trick. The proposed nonlinear kernel trick idens the applicability of kernel methods to any algorithms (even if their formulation does not rely only on the dot product) that are applicable in the input space. We also sho the equivalence beteen the proposed nonlinear projection trick and the kernel trick by some examples. As an application of the nonlinear kernel trick, PCA-L [3], hose objective function is in L -norm rather than L 2 -norm, is extended to a kernel version, hich ould not have been possible by the kernel trick. The paper is organized as follos. In the next section, the structure of a kernel space such as the bases and the coordinates are studied. In section III, the proposed nonlinear projection trick is formulated. The proposed approach is applied to the conventional kernel methods such as KPCA [], KSVM (kernel support vector machine) [2] and KFD (kernel fisher discriminant) [4] in Section IV to sho the equivalence beteen the kernel trick and the nonlinear projection trick. In addition, a kernel version of PCA-L (KPCA-L) hich could not have been derived from the conventional kernel trick is derived by the nonlinear projection trick in this section. Finally, conclusions follo in Section V. II. THE STRUCTURE OF A KERNEL SPACE In this section, the structure, such as the bases and the corresponding coordinates, of a kernel space ith finite number of training samples is studied. Before going on, in the folloing subsection, e briefly define the notations and variables that ill be used in this paper.

2 Fig.. A. Notations P o (x) Projections in the feature space. (x) (x) X [x,, x n ] R d n : the training data matrix here d is the dimension of the input space and n is the number of training samples. Φ(X) [ϕ(x ),, ϕ(x n )] R f n : the mapped training data in the f-dimensional feature space after the mapping ϕ( ). We consider ϕ(x i ) as a vector ith respect to a countable orthornormal basis of the respective RKHS (reproducing kernel Hilbert space). Φ(X) is assumed to have zero mean, i.e., n ϕ(x i) = 0. Note that f can be infinite. k(x, y) ϕ(x), ϕ(y) = ϕ(x) T ϕ(y) R: a kernel function of any to inputs x, y R d. K Φ(X) T Φ(X) = [k(x i, x j )] R n n : a kernel matrix of the training data. The rank of K is r. It becomes r n because Φ(X) is assumed to be centered. k(x) Φ(X) T ϕ(x) R n : a kernel vector of any x R d. P : an r-dimensional subspace of the feature space formed by the mapped training samples Φ(X). ϕ P (x): the projection of ϕ(x) onto P. If x lies on P (e.g., one of the training samples), ϕ P (x) = ϕ(x). ϕ (x): the projection of ϕ(x) onto a one-dimensional vector space formed by a vector R f. In most cases, the vector is restricted to reside in the subspace P, i.e., = Φ(X)α for some α R n. The relationship among ϕ(x), ϕ P (x) and ϕ (x) is illustrated in Fig.. Note that if resides in P as in the figure, then T ϕ(x) = T ϕ P (x). B. The bases and the coordinates of a kernel space It is ell knon that the effective dimensionality 2 of the feature space defined by the kernel function k(, ) given a finite number of training samples X is finite []. In this subsection, the bases of the subspace spanned by the mapped training data Φ(X), as ell as the coordinate of the mapping ϕ(x) of an unseen data x R d are specified. Lemma : (Bases of P ) Let r be the rank of K and K = UΛU T be an eigenvalue decomposition of K composed of only the nonzero eigenvalues of K here U = 2 The term intrinsic dimensionality is defined and idely used in the machine learning society [4]. Because projecting methods based on eigenvalue decomposition tends to overestimate the intrinsic dimensionality [5], the effective dimensionality can be interpreted as an upper bound on intrinsic dimensionality of a kernel space. [u,, u r ] R n r hose columns are orthonormal and Λ = diag(λ,, λ r ) R r r. Then, the columns of Π Φ(X)UΛ 2 = [π,, π r ] R f r constitute orthonormal bases of P. Proof: Because K is the outer product of Φ(X) and itself, it is obvious that r, the rank of K, is the dimension of the subspace spanned by Φ(X). By utilizing the orthonormality of U (U T U = I r ) and the definition of K (K = Φ(X) T Φ(X)), it is easy to check that Π T Π = I r and Π becomes orthonormal bases of P. From the above lemma, π i = Φ(X)u i λ 2 i, ( i r) becomes the i-th basis vector of P, here u i and λ i are the i-th unit eigenvector and eigenvalue of K respectively. Theorem 2: (Coordinate of ϕ P (x)) For any x R d, the projection ϕ P (x) of ϕ(x) onto P is obtained by ϕ P (x) = Πy here y = Λ 2 U T k(x), hich can be regarded as the coordinate of ϕ P (x) in the cartesian coordinate system hose orthonormal bases are Π. Proof: Because ϕ P (x) is in P, ϕ P (x) = Πy for some y R r. If ϕ P (x) is a projection of ϕ(x) onto P hose bases are Π, it should minimize ϕ(x) ϕ P (x) 2. Therefore, the optimal y should satisfy the folloing. y = argmin ϕ(x) Πy 2 2 y = argmin ϕ(x) 2 2 2ϕ(x) T Πy + y T Π T Πy y = argmin y T y 2ϕ(x) T Πy y The solution of the quadratic problem is y = Π T ϕ(x) = Λ 2 U T k(x). Note that ϕ P (x) can be regarded as a nonlinear mapping of x onto the r-dimensional subspace spanned by Π. From the above theorem, this mapping can further be thought of as a direct mapping of x to y = Λ 2 U T k(x). Corollary 3: (Coordinates of the mapped training data) The coordinate of Φ(X) is Y = Λ 2 U T. Proof: It is easy to sho that the training data X is mapped to Y = Λ 2 U T K = Λ 2 U T UΛU T = Λ 2 U T. Note that K = UΛU T = Y T Y 3. Lemma 4: (Mean of the mapped training data) The mapped training data Y = Λ 2 U T are centered, i.e., n y i = 0. Proof: From Theorem 2, y i = Λ 2 U T k(x i ) n and it becomes y i = Λ 2 U T n k(x i) = Λ 2 U T Φ(X) T n ϕ(x i) = 0. Lemma 5: (Coordinate of the residual) For any x R d, let X [X, x] R d (n+) and Φ(X ) [Φ(X), ϕ(x)] R f (n+). Let δϕ P (x) ϕ(x) ϕ P (x) be the residual vector of ϕ(x) ith respect to P. If δϕ P (x) 0, Φ(X ) lies in a (r + )-dimensional subspace containing P. Furthermore, the coordinate of the mapped training data Φ(X) is [Y T, 0] T R (r+) n hile the coordinate of ϕ(x) becomes [y T, y r+ ] T R r+ here y r+ = k(x, x) y T y. Proof: It is obvious that δϕ P (x) is orthogonal to P because ϕ P (x) is the projection of ϕ(x) onto P. Furthermore, 3 Note that K can be directly decomposed into K = Y T Y by Cholesky decomposition. In this case, Y is a unique upper triangle matrix. The singular value decomposition of Y is Y = V Λ 2 U T = V Y for some unitary matrix V. Therefore, Y can be interpreted as a rotation of Y. ()

3 if the residual vector δϕ P (x) 0, δϕ P (x) adds one dimension to ϕ(x), thus Φ(X )(= [Φ(X), ϕ(x)]) lies in a (r + )- dimensional subspace P of the f-dimensional feature space hich contains P. If e retain the bases Π(= [π,, π r ]) of P as the r bases of P, then the ne bases Π of P becomes Π = [Π, π r+ ] δϕ P (x) δϕ P (x) 2. here π r+ = In the coordinate system of Π, the first r coordinates of the mapped training data Φ(X) are unchanged ith the (r + )-st coordinate being identically 0. Likeise, the first r coordinates of ϕ(x) are identical to the coordinate of ϕ P (x), i.e., y, hile the (r + )-th coordinate becomes the length (norm) of the residual vector δϕ P (x). Finally, because ϕ(x) = ϕ P (x) + δϕ P (x) and ϕ P (x) δϕ P (x), it becomes δϕ P (x) 2 2 = ϕ(x) 2 2 ϕ P (x) 2 2 = k(x, x) y T y by Pythagorean trigonometric identity. Lemma 6: (Coordinate of a vector ) If a vector in P can be ritten in the form of = Φ(X)α, then it can also be ritten as = Πβ here β = Y α. Proof: Because Π is an orthonormal bases of P, any vector in P can be ritten as = ΠΠ T. Therefore, = ΠΛ 2 U T Φ(X) T Φ(X)α = ΠΛ 2 U T Kα = ΠY α = Πβ. Note that β = Y α is the coordinate of in P. Corollary 7: (Coordinate of ϕ (x)) The projection ϕ (x) of ϕ(x) onto can be obtained by ϕ (x) = Πγ here γ = ββ T β T β yṗroof: Let = be a unit vector. Then 2 ϕ (x) = ( T ϕ(x)) = T (ϕ P (x) + δϕ P (x)) = T ϕ P (x) = T ΠΛ 2 U T k(x) = β T β ΠββT Π T ΠΛ 2 U T k(x) = Π ββt β T β Λ 2 U T k(x) = Π ββt β T β y = Πγ. In the second equality, δϕ P (x) is orthogonal to P, thus also orthogonal to hich lies in P. Note that if 2 =, it becomes β 2 = and γ = ββ T y. III. THE PROPOSED METHOD In this section, by using the results in the previous section, e give a ne interpretation of kernel methods that any kernel method is equivalent to applying the corresponding conventional method applicable in the input space to the data ith the ne coordinates. Before doing that, e deal ith the problem of centerization in the feature space. (2) (3) (4) A. Centerization In the previous section, the training data Φ(X) are assumed to be centered in the feature space, i.e., n ϕ(x i) = 0. If this condition is not satisfied, P hich is made up of {p : p = n α iϕ(x i ), α i R} is not a subspace but a manifold in the feature space, i.e., 0 / P. Therefore, the centerization is a necessary step in practical application of the proposed idea on kernel methods. Let Ψ(X) [ψ(x ),, ψ(x n )] R f n be the uncentered data in the feature space and ψ n n ψ(x i) = n Ψ(X) n R f be the mean of Ψ(X). Here, n [,, ] T R n is a vector hose components are identically. By subtracting the mean ψ from the uncentered data Ψ(X), the centered data Φ(X) can be obtained as Φ(X) = Ψ(X) ψ T n = Ψ(X)(I n n n T n ) = Ψ(X)(I n E n ). here I n is the n n identity matrix and E n n n T n is an n n matrix hose components are identically n. Consider an uncentered kernel function κ(a, b) ψ(a) T ψ(b). Given this kernel function and the training data X, the uncentered kernel matrix K is defined as K [κ(x i, x j )] R n n. Writing K in matrix form, it becomes K = Ψ(X) T Ψ(X). Then, from (5), the centered positive definite kernel matrix K can be obtained as (5) K = Φ(X) T Φ(X) = (I n E n )K(I n E n ) (6) Once a test data x R d is presented, an uncentered kernel vector κ(x) [κ(x, x),, κ(x n, x)] T R n can be computed. Because κ(x i, x) = ψ(x i ) T ψ(x), if e subtract the mean ψ from both ψ(x i ) and ψ(x), e can get the centered kernel vector k(x) from the uncentered kernel vector κ(x) as follos: k(x) = Φ(X) T ϕ(x) = [Ψ(X)(I n E n )] T (ψ(x) ψ) = (I n E n )Ψ(X) T (ψ(x) n Ψ(X) n) = (I n E n )[κ(x) n K n]. B. Kernel methods: nonlinear projection trick No, a ne frameork of kernel methods referred to as nonlinear projection trick is presented in this part. Belo, a general kernel method is denoted as KM hile its corresponding method in the input space is denoted as M. As inputs to KM, the training and test data X and x respectively, a specific (uncentered) kernel function κ(, ) and a method M are given. Here, M can be any feature extraction method such as PCA, LDA and ICA as ell as any classifier/regressor such as SVM/SVR. Algorithm: Kernel method KM (Input: X, x, κ(, ), method M) Training phase (7)

4 ) Compute the uncentered kernel matrix K such that K ij = κ(x i, x j ). 2) Compute the centered kernel K by K = (I n E n )K(I n E n ). 3) Obtain the eigenvalue decomposition of K such that K = UΛU T here Λ is composed of only the nonzero eigenvalues of K and the columns of U are the corresponding unit eigenvectors of K. 4) Compute Y, the coordinates of Φ(X), by Y = Λ 2 U T. 5) Apply the method M to Y, then it is equivalent to applying the kernel method KM to X, i.e., M(Y ) KM(X). Test phase ) Compute the uncentered kernel vector κ(x). 2) Compute the centered kernel vector k(x) by k(x) = (I n E n )[κ(x) n K n]. 3) Obtain y, the coordinate of ϕ(x), by y = Λ 2 U T k(x). 4) Apply the method M to y, then it is equivalent to applying the kernel method KM to x, i.e., M(y) KM(x). IV. APPLICATIONS In this section, the nonlinear kernel trick is applied to several conventional kernel methods such as KPCA, KFD, Kernel SVM and so on. In doing so, e sho that different kernel methods can be interpreted in the unified frameork of applying conventional linear methods to the reduced dimensional feature space. As a ne contribution, e extend PCA-L [3] to a kernel version. A. Kernel PCA (KPCA) ) Conventional derivation: In the conventional KPCA, the objective is to find a projection vector in the feature space that minimizes the sum of squared error or equivalently maximizes the variance along ith the projection direction. The problem becomes = argmax T Φ(X) 2 2 = argmax T Φ(X)Φ(X) T such that 2 =. (8) The solution of (8) can be found by the eigenvalue decomposition of S f Φ(X)Φ(X) T, the scatter matrix in the feature space, i.e., S f i = λ i i, λ λ 2 λ m. (9) Using the fact that lies in P, can be ritten as = Φ(X)α, and (9) becomes Φ(X)Kα i = λ i Φ(X)α i, λ λ 2 λ m. (0) Left multiplying both sides ith Φ(X) T, the problem becomes finding {α i } m that satisfies K2 α i = λ i Kα i, hich further reduces to finding {α i } m such that Kα i = λ i α i, λ λ 2 λ m. () ith the constraint αi T Kα i =. The solution {α i } m is the eigenvectors of K corresponding to the m largest eigenvalues. No the i-th solution that satisfies αi T Kα i = becomes α i = u i λ 2 i. (2) Let W [,, m ] R f m, then the projection of ϕ(x) onto W gives an m-dimensional vector z = W T ϕ(x). Because i = Φ(X)α i, it further reduces to z = A T k(x) here A [α,, α m ] R n m. From (2), it becomes A = [u λ 2,, u m λ 2 m ] = U m Λ 2 m (3) here U m [u,, u m ] and Λ m diag(λ,, λ m ) and finally the nonlinear projection of x becomes z = Λ 2 m U T mk(x). (4) 2) Ne derivation: In the ne interpretation of kernel methods presented in the previous section, the coordinates of mapped training data Φ(X) are firstly computed as Y = Λ 2 U T R r n from Theorem 2. Then, the problem becomes finding maximum variance direction vector in R r. It is easy to sho that the scatter matrix of Y becomes Y Y T = Λ because the mapped training data Y are centered (Lemma 2) and e i becomes the i-th eigenvector of the scatter matrix and therefore, the i-th principal vector. Here, e i R r is a column vector hose elements are identically zero except the i-th element hich is one. Once a ne data x is presented, by Theorem 2, it is projected to y = Λ 2 U T k(x) and the i-th projection of y becomes z i = e T i y = e T i Λ 2 U T k(x) = λ 2 i u T i k(x). (5) No z [z,, z m ] T, the m-dimensional nonlinear projection of x, becomes z = Λ 2 m U T mk(x). (6) Note that (6) is identical to (4) and e can see that the ne interpretation of KPCA is equivalent to the conventional derivation. Hoever, by the ne interpretation, e can obtain the exact coordinates of Φ(X) and the corresponding principal vectors, hile conventional derivation cannot find those coordinates. B. Kernel SVM (KSVM) ) Conventional derivation: Given some training data {(x i, c i )} n here x i R d and c i {, }, the kernel SVM tries to solve the optimization problem: (, b ) = argmin (,b) 2 2 2 subject to c i ( T ϕ(x i ) + b) i =,, n (7) By introducing Lagrange multipliers {α i } n, the above constrained problem can be expressed as the dual form: α = argmax α i α i α j c i c j k(x i, x j ) α 2 i,j= subject to α i c i = 0 and α i 0 i =,, n (8)

5 Once e solve the above dual problem, can be computed thanks to α terms as = n α ic i ϕ(x i ) and b can be computed so that it meets the KKT condition α i [c i ( T ϕ(x i )+ b) ] = 0 for all i =,, n. Once a test sample x is presented, it can be classified as sgn( T ϕ(x)+b). Note that T ϕ(x) = n α ic i k(x i, x) and e do not have to find an explicit expression for in both the training and the test phase. 2) Ne derivation: In the ne derivation, given the training data {(x i, c i )} n, X = [x,, x n ] is firstly mapped to Y = Λ 2 U T = [y,, y n ] and then a linear SVM is solved for {y i, c i } n. The primal problem becomes (v, d ) = argmin (v,d) 2 v 2 2 subject to c i (v T y i + d) i =,, n and the corresponding dual problem is β = argmax β i β i β j c i c j yi T y j β 2 i,j= subject to β i c i = 0 and β i 0 i =,, n (9) (20) Note that because k(x i, x j ) = yi T y j, (8) and (20) are exactly the same problem and it becomes β = α and d = b. In this case, v can be explicitly found as v = n β ic i y i. Once a test sample x is presented, it is firstly mapped to y = Λ 2 U T k(x) and then classified as sgn(v T y + d). C. Kernel Fisher discriminant (KFD) ) Conventional derivation: In the conventional KFD, hich is a kernel extension of LDA (linear discriminant analysis), the objective is to find a projection vector in the feature space such that the ratio of the beteen-class variance and the ithin-class variance after the projection is maximized. Given C class training data {(x i, c i )} n here x i R d and c i {,, C}, the problem becomes Here and S Φ W S Φ B = argmax T S Φ B T S Φ W. (2) (ϕ(x i ) ϕ)(ϕ(x i ) ϕ) T R f f (22) (ϕ(x i ) ϕ ci )(ϕ(x i ) ϕ ci ) T R f f (23) are the beteen scatter and the ithin scatter matrices in the feature space respectively, here ϕ = n ϕ(x i) and ϕ c n c i {j c ϕ(x j=c} i) are the total mean and class mean in the feature space respectively. The term n c denotes the number of training samples hose class identity is c. As in the previous section, ithout loss of generality, e assume that Φ(X) is centerized, i.e., ϕ = 0 and the beteen scatter matrix becomes SB Φ = Φ(X)Φ(X)T. Note that SB Φ in (22) can also be ritten as SB Φ = C c= n c( ϕ c ϕ)( ϕ c ϕ) T. Restricting in P, can be ritten as = Φ(X)α and it becomes T ϕ(x i ) = α T k(x i ). Using this, (2) becomes here and S K W = α = argmax α α T S K B α α T S K W α (24) S K B = K 2 R n n (25) (k(x i ) k ci )(k(x i ) k ci ) T R n n. (26) Here, k ci n c i {j c j =c} k(x i). The solution of (24) can be obtained by solving the folloing generalized eigenvalue decomposition problem S K B α i = λs K W α i, λ λ 2 λ m. (27) Once obtaining {α i } m, for a given sample x, an m- dimensional nonlinear feature vector z R m is obtained by the dot product of W [,, m ] = Φ(X)[α,, α m ] and ϕ(x), i.e., z = W T ϕ(x) = A T k(x), here A [α,, α m ] R n m. 2) Ne derivation: In the ne derivation of KFD, for given training data X, a (centerized) kernel matrix K is computed, then its eigenvalue decomposition K = UΛU T is utilized to get the nonlinear mapping Y = Λ 2 U T. Then, its ithin scatter and beteen scatter matrices are computed as SW Y = (y i ȳ ci )(y i ȳ ci ) T R r r (28) and S Y B = C n c (ȳ c ȳ)(ȳ c ȳ) T = c= y i yi T = Y Y T R r r (29) respectively, here ȳ and ȳ c are the total mean and the class mean of Y. Recall that ȳ = 0 from Lemma 3. Applying LDA directly to Y, the objective becomes finding β R r such that β = argmax β β T S Y B β β T S Y W β. (30) Noting that y i = Λ 2 U T k(x i ) (from Theorem 2), the numerator becomes β T S Y B β = β T Λ 2 U T KKUΛ 2 β = α T K 2 α (3) here α = UΛ 2 β. Likeise the denominator becomes ( n ) β T SW Y β = α T (k(x i ) k ci )(k(x i ) k ci ) T α (32) Replacing (3) and (32) into (30), e can see the problem becomes identical to (24). Furthermore, because α = UΛ 2 β, if e multiply both sides ith Λ 2 U T to the left, β = Λ 2 U T α = Y α holds, hich as anticipated in Lemma 5. After obtaining {β i } m, z, the nonlinear projection of x can be obtained by the dot product of B [β,, β m ] and y = Λ 2 U T k(x). Then z = B T Λ 2 U T k(x) = A T k(x).

6 D. Kernel PCA-L (KPCA-L) Until no, e presented examples of application of the ne kernel interpretation to the existing kernel methods such as KPCA, KSVM and KFD. In this subsection, e sho that this interpretation of kernel methods is applicable to algorithms hich do not rely on the dot product, thus to hich kernel trick is not applicable. In [3], PCA-L, a robust version of PCA hich maximizes the dispersion of the projection as introduced. Given training data X, the objective of PCA-L is to find a set of projection vectors s that solve the folloing optimization problem: = argmax T X = argmax subject to 2 =. T x i (33) For information, the algorithm PCA-L is presented belo. Algorithm: PCA-L (Input: X) ) Initialization: Pick any (0). Set (0) (0)/ (0) 2 and t = 0. 2) Polarity check: For all i {,, n}, if T (t)x i < 0, p i (t) =, otherise p i (t) =. 3) Flipping and maximization: Set t t + and (t) = n p i(t )x i. Set (t) (t)/ (t) 2. 4) Convergence check: a. If (t) (t ), go to Step 2. b. Else if there exists i such that T (t)x i = 0, set (t) ((t) + )/ (t) + 2 and go to Step 2. Here, is a small nonzero random vector. c. Otherise, set = (t) and stop. This problem formulation can be easily extended to a kernel version by replacing X ith Φ(X) and the problem becomes = argmax T Φ(X) = argmax T ϕ(x i ) (34) subject to 2 =. Note that the kernel trick cannot be applicable to the above problem because the problem is not formulated based on L2- norm. By Lemma, can be expressed using the bases Π as = Πβ and from Theorem 2 T Φ(X) can be expressed as β T Y. Furthermore, the constraint 2 = becomes β 2 = because the columns of Π are orthonormal. Therefore, (34) becomes β = argmax β T Y subject to β 2 =. (35) β Because the form of (35) is exactly the same as (33), the conventional PCA-L algorithm can directly be applied to Y to obtain the KPCA-L algorithm. Note that this is a simple example of making a kernel version of an algorithm by using the nonlinear projection trick. The point is that this trick can be applied to an arbitrary method as shon in Section III. V. CONCLUSIONS This paper proposed a kernel method that directly maps the input space to the feature space by deriving an exact coordinates of the mapped input data. In the conventional kernel methods, the kernel trick is used to avoid direct nonlinear mapping of input data onto a high (could be infinite) dimensional feature space. Instead, the dot products of the mapped data are converted to kernel functions in the problems to be solved, making the problems solvable ithout ever having to explicitly map the original input data into a kernel space. Hoever, the applicability of the kernel trick is restricted to problems here the mapped data appear in the problem only in the form of dot products. Therefore, arbitrary methods that do not utilize the dot product (or L 2 -norm) could not be extended to kernel methods. In this paper, based on the fact that the effective dimensionality of a kernel space is less than the number of training samples, an explicit mapping of the input data into a reduced kernel space is proposed. This is obtained by eigenvalue decomposition of the kernel matrix. With the proposed nonlinear projection trick, the applicability of the kernel methods is idened to arbitrary methods that can be applied in the input space. In other ords, not only the conventional kernel methods based on L 2 -norm optimization such as KPCA and kernel SVM, but the methods utilizing other objective functions can also be extended to kernel versions. The equivalence beteen the kernel trick and the nonlinear projection trick is shon in several conventional kernel methods such as KPCA, KSVM and KFD. In addition, e extend PCA-L that utilize L - norm into a kernel version and sho the effectiveness of the proposed approach. REFERENCES [] B. Schölkopf, A.J. Somla, and K.R. Müller, Nonlinear component analysis as a kernel eigenvalue problem, Neural Computation, vol. 0, no. 5, pp. 299 39, 998. [2] B. Schölkopf and A.J. Smola, Learning ith Kernels: Support Vector Machines, Regularization, Optimization, and Beyond, The MIT Press, 2002. [3] R. Rosipal, M. Girolami, and L.J. Trejo, Kernel PCA for feature extraction and de-noising in non-linear regression, Neural Computing & Applications, vol. 0, pp. 23 243, 200. [4] S. Mika, G. Ratsch, J. Weston, B. Schölkopf, and K.R. Müller, Fisher discriminant analysis ith kernels, in Proceedings of the 999 IEEE Signal Processing Society Workshop on Neural Netorks for Signal Processing IX, 999, pp. 4 48. [5] J. Yang, A.F. Frangi, J.-Y. Yang, D. Zhang, and Z. Jin, KPCA plus LDA: A complete kernel fisher discriminant frameork for feature extraction and recognition, IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 27, no. 2, pp. 230 244, Feb. 2005. [6] G. Baudat and R. Anouar, Generalized discriminant analysis using a kernel approach, Neural Computation, vol. 2, no. 0, pp. 2385 2404, 2000. [7] M. H. Nguyen and F. De la Torre, Robust kernel principal component analysis, Tech. Rep. Paper 44, Robotics Institute, 2008, http://repository.cmu.edu/robitics/44. [8] N. Kak, Kernel discriminant analysis for regression problems, Pattern Recognition, vol. 45, no. 5, pp. 209 203, May 202. [9] H. Hoffmann, Kernel PCA for novelty detection, Pattern Recognition, vol. 40, no. 3, pp. 863 874, March 2007. [0] T. Hofmann, B. Schölkopf, and A.J. Smola, Kernel methods in machine learning, The Annals of Statistics, vol. 36, no. 3, pp. 7 220, 2008. [] S. Mika, B. Schölkopf, A. Smola, K.-R. Müller, M. Scholz, and G. Ratsch, Kernel PCA and de-noising in feature spaces, in Advances in Neural Information Processing Systems. 999, pp. 536 542, MIT Press.

[2] T.-Y. Kok and I. W.-H. Tsang, The pre-image problem in kernel methods, IEEE Trans. on Neural Netorks, vol. 5, no. 6, pp. 57 525, Nov. 2004. [3] N. Kak, Principal component analysis based on l norm maximization, IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 30, no. 9, pp. 672 680, Sep. 2008. [4] K. Fukunaga, Intrinsic dimensionality extraction, Handbook of Statistics, vol. 2, pp. 347 360, 982. [5] F. Camastra, Data dimensionality estimation methods: a survey, Pattern Recognition, vol. 36, pp. 2945 2954, 2003. 7