Nonlinear Projection Trick in kernel methods: An alternative to the Kernel Trick

Size: px
Start display at page:

Download "Nonlinear Projection Trick in kernel methods: An alternative to the Kernel Trick"

Transcription

1 Nonlinear Projection Trick in kernel methods: An alternative to the Kernel Trick Nojun Kak, Member, IEEE, Abstract In kernel methods such as kernel PCA and support vector machines, the so called kernel trick is used to avoid direct calculations in a high (virtually infinite) dimensional kernel space. In this paper, based on the fact that the effective dimensionality of a kernel space is less than the number of training samples, e propose an alternative to the kernel trick that explicitly maps the input data into a reduced dimensional kernel space. This is easily obtained by the eigenvalue decomposition of the kernel matrix. The proposed method is named as the nonlinear projection trick in contrast to the kernel trick. With this technique, the applicability of the kernel methods is idened to arbitrary algorithms that do not utilize the dot product. The equivalence beteen the kernel trick and the nonlinear projection trick is shon for several conventional kernel methods. In addition, e extend PCA-L, hich utilizes L -norm instead of L 2 -norm (or dot product), into a kernel version and sho the effectiveness of the proposed approach. Index Terms Kernel methods, nonlinear projection trick, KPCA, Support vector machines, KPCA-L, dimensionality reduction. I. INTRODUCTION In pattern recognition and machine learning societies, kernel methods have started to gain attention since mid-990s and have been successfully applied in pattern classification, dimensionality reduction, regression problems and so on [] [2] [3] [4] [5] [6] [7] [8] [9] [0]. In kernel methods, the kernel trick is used to simplify the optimization in a high dimensional feature space ithout an explicit mapping of input data into the feature space. This is done by introducing a kernel function of to elements in the input space and regarding it as a dot product of the corresponding to elements in the kernel space. If the optimization problem contains the elements in the feature space only in the form of the dot product, then the problem can be solved ithout knoing the explicit mapping. The dimension of the feature space, hich is unmeasurable and can be infinite, depends on the specific kernel function used. Because the kernel trick takes care of the dot product in the kernel space rather than the mapping itself, there has been little attention on the mapping of the input data to the feature space. Hoever, there are many optimization problems that cannot be formulated by the dot product only, thus could not be extended to a kernel space by the kernel trick. In addition, ithout the N. Kak is an associate professor at the Graduate School of Convergence Science and Technology, Seoul National University, Korea. This ork as supported by the Korea Research Foundation Grant funded by the Korean Government (KRF ) The terms feature space and kernel space are used interchangeably to denote the high dimensional space defined by a kernel function. knoledge of the mapped image in the feature space, it is difficult to solve data reconstruction problem by the kernel methods. Although, this pre-image problem in kernel PCA has been tackled in [] and [2], a direct mapping of the input data to a vector in the feature space by kernel methods has not been studied yet. It is ell knon that the mapped images of the finite number of training samples constitute a subspace of the possibly infinite dimensional feature space []. In addition, most operations in the kernel methods are performed in the training stage, hich are done on this finite dimensional subspace of the possibly infinite dimensional feature space. Based on this idea, in this paper, e doubt and anser the question Can e find a direct mapping of the input data to a vector in the feature space by kernel methods? This paper proposes an alternative method to the kernel trick hich directly maps input samples into a reduced dimensional kernel space. The proposed approach is named as the nonlinear projection trick in contrast to the kernel trick. The proposed nonlinear kernel trick idens the applicability of kernel methods to any algorithms (even if their formulation does not rely only on the dot product) that are applicable in the input space. We also sho the equivalence beteen the proposed nonlinear projection trick and the kernel trick by some examples. As an application of the nonlinear kernel trick, PCA-L [3], hose objective function is in L -norm rather than L 2 -norm, is extended to a kernel version, hich ould not have been possible by the kernel trick. The paper is organized as follos. In the next section, the structure of a kernel space such as the bases and the coordinates are studied. In section III, the proposed nonlinear projection trick is formulated. The proposed approach is applied to the conventional kernel methods such as KPCA [], KSVM (kernel support vector machine) [2] and KFD (kernel fisher discriminant) [4] in Section IV to sho the equivalence beteen the kernel trick and the nonlinear projection trick. In addition, a kernel version of PCA-L (KPCA-L) hich could not have been derived from the conventional kernel trick is derived by the nonlinear projection trick in this section. Finally, conclusions follo in Section V. II. THE STRUCTURE OF A KERNEL SPACE In this section, the structure, such as the bases and the corresponding coordinates, of a kernel space ith finite number of training samples is studied. Before going on, in the folloing subsection, e briefly define the notations and variables that ill be used in this paper.

2 2 Fig.. A. Notations P o (x) Projections in the feature space. (x) (x) X [x,, x n ] R d n : the training data matrix here d is the dimension of the input space and n is the number of training samples. Φ(X) [ϕ(x ),, ϕ(x n )] R f n : the mapped training data in the f-dimensional feature space after the mapping ϕ( ). We consider ϕ(x i ) as a vector ith respect to a countable orthornormal basis of the respective RKHS (reproducing kernel Hilbert space). Φ(X) is assumed to have zero mean, i.e., n ϕ(x i) = 0. Note that f can be infinite. k(x, y) ϕ(x), ϕ(y) = ϕ(x) T ϕ(y) R: a kernel function of any to inputs x, y R d. K Φ(X) T Φ(X) = [k(x i, x j )] R n n : a kernel matrix of the training data. The rank of K is r. It becomes r n because Φ(X) is assumed to be centered. k(x) Φ(X) T ϕ(x) R n : a kernel vector of any x R d. P : an r-dimensional subspace of the feature space formed by the mapped training samples Φ(X). ϕ P (x): the projection of ϕ(x) onto P. If x lies on P (e.g., one of the training samples), ϕ P (x) = ϕ(x). ϕ (x): the projection of ϕ(x) onto a one-dimensional vector space formed by a vector R f. In most cases, the vector is restricted to reside in the subspace P, i.e., = Φ(X)α for some α R n. The relationship among ϕ(x), ϕ P (x) and ϕ (x) is illustrated in Fig.. Note that if resides in P as in the figure, then T ϕ(x) = T ϕ P (x). B. The bases and the coordinates of a kernel space It is ell knon that the effective dimensionality 2 of the feature space defined by the kernel function k(, ) given a finite number of training samples X is finite []. In this subsection, the bases of the subspace spanned by the mapped training data Φ(X), as ell as the coordinate of the mapping ϕ(x) of an unseen data x R d are specified. Lemma : (Bases of P ) Let r be the rank of K and K = UΛU T be an eigenvalue decomposition of K composed of only the nonzero eigenvalues of K here U = 2 The term intrinsic dimensionality is defined and idely used in the machine learning society [4]. Because projecting methods based on eigenvalue decomposition tends to overestimate the intrinsic dimensionality [5], the effective dimensionality can be interpreted as an upper bound on intrinsic dimensionality of a kernel space. [u,, u r ] R n r hose columns are orthonormal and Λ = diag(λ,, λ r ) R r r. Then, the columns of Π Φ(X)UΛ 2 = [π,, π r ] R f r constitute orthonormal bases of P. Proof: Because K is the outer product of Φ(X) and itself, it is obvious that r, the rank of K, is the dimension of the subspace spanned by Φ(X). By utilizing the orthonormality of U (U T U = I r ) and the definition of K (K = Φ(X) T Φ(X)), it is easy to check that Π T Π = I r and Π becomes orthonormal bases of P. From the above lemma, π i = Φ(X)u i λ 2 i, ( i r) becomes the i-th basis vector of P, here u i and λ i are the i-th unit eigenvector and eigenvalue of K respectively. Theorem 2: (Coordinate of ϕ P (x)) For any x R d, the projection ϕ P (x) of ϕ(x) onto P is obtained by ϕ P (x) = Πy here y = Λ 2 U T k(x), hich can be regarded as the coordinate of ϕ P (x) in the cartesian coordinate system hose orthonormal bases are Π. Proof: Because ϕ P (x) is in P, ϕ P (x) = Πy for some y R r. If ϕ P (x) is a projection of ϕ(x) onto P hose bases are Π, it should minimize ϕ(x) ϕ P (x) 2. Therefore, the optimal y should satisfy the folloing. y = argmin ϕ(x) Πy 2 2 y = argmin ϕ(x) 2 2 2ϕ(x) T Πy + y T Π T Πy y = argmin y T y 2ϕ(x) T Πy y The solution of the quadratic problem is y = Π T ϕ(x) = Λ 2 U T k(x). Note that ϕ P (x) can be regarded as a nonlinear mapping of x onto the r-dimensional subspace spanned by Π. From the above theorem, this mapping can further be thought of as a direct mapping of x to y = Λ 2 U T k(x). Corollary 3: (Coordinates of the mapped training data) The coordinate of Φ(X) is Y = Λ 2 U T. Proof: It is easy to sho that the training data X is mapped to Y = Λ 2 U T K = Λ 2 U T UΛU T = Λ 2 U T. Note that K = UΛU T = Y T Y 3. Lemma 4: (Mean of the mapped training data) The mapped training data Y = Λ 2 U T are centered, i.e., n y i = 0. Proof: From Theorem 2, y i = Λ 2 U T k(x i ) n and it becomes y i = Λ 2 U T n k(x i) = Λ 2 U T Φ(X) T n ϕ(x i) = 0. Lemma 5: (Coordinate of the residual) For any x R d, let X [X, x] R d (n+) and Φ(X ) [Φ(X), ϕ(x)] R f (n+). Let δϕ P (x) ϕ(x) ϕ P (x) be the residual vector of ϕ(x) ith respect to P. If δϕ P (x) 0, Φ(X ) lies in a (r + )-dimensional subspace containing P. Furthermore, the coordinate of the mapped training data Φ(X) is [Y T, 0] T R (r+) n hile the coordinate of ϕ(x) becomes [y T, y r+ ] T R r+ here y r+ = k(x, x) y T y. Proof: It is obvious that δϕ P (x) is orthogonal to P because ϕ P (x) is the projection of ϕ(x) onto P. Furthermore, 3 Note that K can be directly decomposed into K = Y T Y by Cholesky decomposition. In this case, Y is a unique upper triangle matrix. The singular value decomposition of Y is Y = V Λ 2 U T = V Y for some unitary matrix V. Therefore, Y can be interpreted as a rotation of Y. ()

3 3 if the residual vector δϕ P (x) 0, δϕ P (x) adds one dimension to ϕ(x), thus Φ(X )(= [Φ(X), ϕ(x)]) lies in a (r + )- dimensional subspace P of the f-dimensional feature space hich contains P. If e retain the bases Π(= [π,, π r ]) of P as the r bases of P, then the ne bases Π of P becomes Π = [Π, π r+ ] δϕ P (x) δϕ P (x) 2. here π r+ = In the coordinate system of Π, the first r coordinates of the mapped training data Φ(X) are unchanged ith the (r + )-st coordinate being identically 0. Likeise, the first r coordinates of ϕ(x) are identical to the coordinate of ϕ P (x), i.e., y, hile the (r + )-th coordinate becomes the length (norm) of the residual vector δϕ P (x). Finally, because ϕ(x) = ϕ P (x) + δϕ P (x) and ϕ P (x) δϕ P (x), it becomes δϕ P (x) 2 2 = ϕ(x) 2 2 ϕ P (x) 2 2 = k(x, x) y T y by Pythagorean trigonometric identity. Lemma 6: (Coordinate of a vector ) If a vector in P can be ritten in the form of = Φ(X)α, then it can also be ritten as = Πβ here β = Y α. Proof: Because Π is an orthonormal bases of P, any vector in P can be ritten as = ΠΠ T. Therefore, = ΠΛ 2 U T Φ(X) T Φ(X)α = ΠΛ 2 U T Kα = ΠY α = Πβ. Note that β = Y α is the coordinate of in P. Corollary 7: (Coordinate of ϕ (x)) The projection ϕ (x) of ϕ(x) onto can be obtained by ϕ (x) = Πγ here γ = ββ T β T β yṗroof: Let = be a unit vector. Then 2 ϕ (x) = ( T ϕ(x)) = T (ϕ P (x) + δϕ P (x)) = T ϕ P (x) = T ΠΛ 2 U T k(x) = β T β ΠββT Π T ΠΛ 2 U T k(x) = Π ββt β T β Λ 2 U T k(x) = Π ββt β T β y = Πγ. In the second equality, δϕ P (x) is orthogonal to P, thus also orthogonal to hich lies in P. Note that if 2 =, it becomes β 2 = and γ = ββ T y. III. THE PROPOSED METHOD In this section, by using the results in the previous section, e give a ne interpretation of kernel methods that any kernel method is equivalent to applying the corresponding conventional method applicable in the input space to the data ith the ne coordinates. Before doing that, e deal ith the problem of centerization in the feature space. (2) (3) (4) A. Centerization In the previous section, the training data Φ(X) are assumed to be centered in the feature space, i.e., n ϕ(x i) = 0. If this condition is not satisfied, P hich is made up of {p : p = n α iϕ(x i ), α i R} is not a subspace but a manifold in the feature space, i.e., 0 / P. Therefore, the centerization is a necessary step in practical application of the proposed idea on kernel methods. Let Ψ(X) [ψ(x ),, ψ(x n )] R f n be the uncentered data in the feature space and ψ n n ψ(x i) = n Ψ(X) n R f be the mean of Ψ(X). Here, n [,, ] T R n is a vector hose components are identically. By subtracting the mean ψ from the uncentered data Ψ(X), the centered data Φ(X) can be obtained as Φ(X) = Ψ(X) ψ T n = Ψ(X)(I n n n T n ) = Ψ(X)(I n E n ). here I n is the n n identity matrix and E n n n T n is an n n matrix hose components are identically n. Consider an uncentered kernel function κ(a, b) ψ(a) T ψ(b). Given this kernel function and the training data X, the uncentered kernel matrix K is defined as K [κ(x i, x j )] R n n. Writing K in matrix form, it becomes K = Ψ(X) T Ψ(X). Then, from (5), the centered positive definite kernel matrix K can be obtained as (5) K = Φ(X) T Φ(X) = (I n E n )K(I n E n ) (6) Once a test data x R d is presented, an uncentered kernel vector κ(x) [κ(x, x),, κ(x n, x)] T R n can be computed. Because κ(x i, x) = ψ(x i ) T ψ(x), if e subtract the mean ψ from both ψ(x i ) and ψ(x), e can get the centered kernel vector k(x) from the uncentered kernel vector κ(x) as follos: k(x) = Φ(X) T ϕ(x) = [Ψ(X)(I n E n )] T (ψ(x) ψ) = (I n E n )Ψ(X) T (ψ(x) n Ψ(X) n) = (I n E n )[κ(x) n K n]. B. Kernel methods: nonlinear projection trick No, a ne frameork of kernel methods referred to as nonlinear projection trick is presented in this part. Belo, a general kernel method is denoted as KM hile its corresponding method in the input space is denoted as M. As inputs to KM, the training and test data X and x respectively, a specific (uncentered) kernel function κ(, ) and a method M are given. Here, M can be any feature extraction method such as PCA, LDA and ICA as ell as any classifier/regressor such as SVM/SVR. Algorithm: Kernel method KM (Input: X, x, κ(, ), method M) Training phase (7)

4 4 ) Compute the uncentered kernel matrix K such that K ij = κ(x i, x j ). 2) Compute the centered kernel K by K = (I n E n )K(I n E n ). 3) Obtain the eigenvalue decomposition of K such that K = UΛU T here Λ is composed of only the nonzero eigenvalues of K and the columns of U are the corresponding unit eigenvectors of K. 4) Compute Y, the coordinates of Φ(X), by Y = Λ 2 U T. 5) Apply the method M to Y, then it is equivalent to applying the kernel method KM to X, i.e., M(Y ) KM(X). Test phase ) Compute the uncentered kernel vector κ(x). 2) Compute the centered kernel vector k(x) by k(x) = (I n E n )[κ(x) n K n]. 3) Obtain y, the coordinate of ϕ(x), by y = Λ 2 U T k(x). 4) Apply the method M to y, then it is equivalent to applying the kernel method KM to x, i.e., M(y) KM(x). IV. APPLICATIONS In this section, the nonlinear kernel trick is applied to several conventional kernel methods such as KPCA, KFD, Kernel SVM and so on. In doing so, e sho that different kernel methods can be interpreted in the unified frameork of applying conventional linear methods to the reduced dimensional feature space. As a ne contribution, e extend PCA-L [3] to a kernel version. A. Kernel PCA (KPCA) ) Conventional derivation: In the conventional KPCA, the objective is to find a projection vector in the feature space that minimizes the sum of squared error or equivalently maximizes the variance along ith the projection direction. The problem becomes = argmax T Φ(X) 2 2 = argmax T Φ(X)Φ(X) T such that 2 =. (8) The solution of (8) can be found by the eigenvalue decomposition of S f Φ(X)Φ(X) T, the scatter matrix in the feature space, i.e., S f i = λ i i, λ λ 2 λ m. (9) Using the fact that lies in P, can be ritten as = Φ(X)α, and (9) becomes Φ(X)Kα i = λ i Φ(X)α i, λ λ 2 λ m. (0) Left multiplying both sides ith Φ(X) T, the problem becomes finding {α i } m that satisfies K2 α i = λ i Kα i, hich further reduces to finding {α i } m such that Kα i = λ i α i, λ λ 2 λ m. () ith the constraint αi T Kα i =. The solution {α i } m is the eigenvectors of K corresponding to the m largest eigenvalues. No the i-th solution that satisfies αi T Kα i = becomes α i = u i λ 2 i. (2) Let W [,, m ] R f m, then the projection of ϕ(x) onto W gives an m-dimensional vector z = W T ϕ(x). Because i = Φ(X)α i, it further reduces to z = A T k(x) here A [α,, α m ] R n m. From (2), it becomes A = [u λ 2,, u m λ 2 m ] = U m Λ 2 m (3) here U m [u,, u m ] and Λ m diag(λ,, λ m ) and finally the nonlinear projection of x becomes z = Λ 2 m U T mk(x). (4) 2) Ne derivation: In the ne interpretation of kernel methods presented in the previous section, the coordinates of mapped training data Φ(X) are firstly computed as Y = Λ 2 U T R r n from Theorem 2. Then, the problem becomes finding maximum variance direction vector in R r. It is easy to sho that the scatter matrix of Y becomes Y Y T = Λ because the mapped training data Y are centered (Lemma 2) and e i becomes the i-th eigenvector of the scatter matrix and therefore, the i-th principal vector. Here, e i R r is a column vector hose elements are identically zero except the i-th element hich is one. Once a ne data x is presented, by Theorem 2, it is projected to y = Λ 2 U T k(x) and the i-th projection of y becomes z i = e T i y = e T i Λ 2 U T k(x) = λ 2 i u T i k(x). (5) No z [z,, z m ] T, the m-dimensional nonlinear projection of x, becomes z = Λ 2 m U T mk(x). (6) Note that (6) is identical to (4) and e can see that the ne interpretation of KPCA is equivalent to the conventional derivation. Hoever, by the ne interpretation, e can obtain the exact coordinates of Φ(X) and the corresponding principal vectors, hile conventional derivation cannot find those coordinates. B. Kernel SVM (KSVM) ) Conventional derivation: Given some training data {(x i, c i )} n here x i R d and c i {, }, the kernel SVM tries to solve the optimization problem: (, b ) = argmin (,b) subject to c i ( T ϕ(x i ) + b) i =,, n (7) By introducing Lagrange multipliers {α i } n, the above constrained problem can be expressed as the dual form: α = argmax α i α i α j c i c j k(x i, x j ) α 2 i,j= subject to α i c i = 0 and α i 0 i =,, n (8)

5 5 Once e solve the above dual problem, can be computed thanks to α terms as = n α ic i ϕ(x i ) and b can be computed so that it meets the KKT condition α i [c i ( T ϕ(x i )+ b) ] = 0 for all i =,, n. Once a test sample x is presented, it can be classified as sgn( T ϕ(x)+b). Note that T ϕ(x) = n α ic i k(x i, x) and e do not have to find an explicit expression for in both the training and the test phase. 2) Ne derivation: In the ne derivation, given the training data {(x i, c i )} n, X = [x,, x n ] is firstly mapped to Y = Λ 2 U T = [y,, y n ] and then a linear SVM is solved for {y i, c i } n. The primal problem becomes (v, d ) = argmin (v,d) 2 v 2 2 subject to c i (v T y i + d) i =,, n and the corresponding dual problem is β = argmax β i β i β j c i c j yi T y j β 2 i,j= subject to β i c i = 0 and β i 0 i =,, n (9) (20) Note that because k(x i, x j ) = yi T y j, (8) and (20) are exactly the same problem and it becomes β = α and d = b. In this case, v can be explicitly found as v = n β ic i y i. Once a test sample x is presented, it is firstly mapped to y = Λ 2 U T k(x) and then classified as sgn(v T y + d). C. Kernel Fisher discriminant (KFD) ) Conventional derivation: In the conventional KFD, hich is a kernel extension of LDA (linear discriminant analysis), the objective is to find a projection vector in the feature space such that the ratio of the beteen-class variance and the ithin-class variance after the projection is maximized. Given C class training data {(x i, c i )} n here x i R d and c i {,, C}, the problem becomes Here and S Φ W S Φ B = argmax T S Φ B T S Φ W. (2) (ϕ(x i ) ϕ)(ϕ(x i ) ϕ) T R f f (22) (ϕ(x i ) ϕ ci )(ϕ(x i ) ϕ ci ) T R f f (23) are the beteen scatter and the ithin scatter matrices in the feature space respectively, here ϕ = n ϕ(x i) and ϕ c n c i {j c ϕ(x j=c} i) are the total mean and class mean in the feature space respectively. The term n c denotes the number of training samples hose class identity is c. As in the previous section, ithout loss of generality, e assume that Φ(X) is centerized, i.e., ϕ = 0 and the beteen scatter matrix becomes SB Φ = Φ(X)Φ(X)T. Note that SB Φ in (22) can also be ritten as SB Φ = C c= n c( ϕ c ϕ)( ϕ c ϕ) T. Restricting in P, can be ritten as = Φ(X)α and it becomes T ϕ(x i ) = α T k(x i ). Using this, (2) becomes here and S K W = α = argmax α α T S K B α α T S K W α (24) S K B = K 2 R n n (25) (k(x i ) k ci )(k(x i ) k ci ) T R n n. (26) Here, k ci n c i {j c j =c} k(x i). The solution of (24) can be obtained by solving the folloing generalized eigenvalue decomposition problem S K B α i = λs K W α i, λ λ 2 λ m. (27) Once obtaining {α i } m, for a given sample x, an m- dimensional nonlinear feature vector z R m is obtained by the dot product of W [,, m ] = Φ(X)[α,, α m ] and ϕ(x), i.e., z = W T ϕ(x) = A T k(x), here A [α,, α m ] R n m. 2) Ne derivation: In the ne derivation of KFD, for given training data X, a (centerized) kernel matrix K is computed, then its eigenvalue decomposition K = UΛU T is utilized to get the nonlinear mapping Y = Λ 2 U T. Then, its ithin scatter and beteen scatter matrices are computed as SW Y = (y i ȳ ci )(y i ȳ ci ) T R r r (28) and S Y B = C n c (ȳ c ȳ)(ȳ c ȳ) T = c= y i yi T = Y Y T R r r (29) respectively, here ȳ and ȳ c are the total mean and the class mean of Y. Recall that ȳ = 0 from Lemma 3. Applying LDA directly to Y, the objective becomes finding β R r such that β = argmax β β T S Y B β β T S Y W β. (30) Noting that y i = Λ 2 U T k(x i ) (from Theorem 2), the numerator becomes β T S Y B β = β T Λ 2 U T KKUΛ 2 β = α T K 2 α (3) here α = UΛ 2 β. Likeise the denominator becomes ( n ) β T SW Y β = α T (k(x i ) k ci )(k(x i ) k ci ) T α (32) Replacing (3) and (32) into (30), e can see the problem becomes identical to (24). Furthermore, because α = UΛ 2 β, if e multiply both sides ith Λ 2 U T to the left, β = Λ 2 U T α = Y α holds, hich as anticipated in Lemma 5. After obtaining {β i } m, z, the nonlinear projection of x can be obtained by the dot product of B [β,, β m ] and y = Λ 2 U T k(x). Then z = B T Λ 2 U T k(x) = A T k(x).

6 6 D. Kernel PCA-L (KPCA-L) Until no, e presented examples of application of the ne kernel interpretation to the existing kernel methods such as KPCA, KSVM and KFD. In this subsection, e sho that this interpretation of kernel methods is applicable to algorithms hich do not rely on the dot product, thus to hich kernel trick is not applicable. In [3], PCA-L, a robust version of PCA hich maximizes the dispersion of the projection as introduced. Given training data X, the objective of PCA-L is to find a set of projection vectors s that solve the folloing optimization problem: = argmax T X = argmax subject to 2 =. T x i (33) For information, the algorithm PCA-L is presented belo. Algorithm: PCA-L (Input: X) ) Initialization: Pick any (0). Set (0) (0)/ (0) 2 and t = 0. 2) Polarity check: For all i {,, n}, if T (t)x i < 0, p i (t) =, otherise p i (t) =. 3) Flipping and maximization: Set t t + and (t) = n p i(t )x i. Set (t) (t)/ (t) 2. 4) Convergence check: a. If (t) (t ), go to Step 2. b. Else if there exists i such that T (t)x i = 0, set (t) ((t) + )/ (t) + 2 and go to Step 2. Here, is a small nonzero random vector. c. Otherise, set = (t) and stop. This problem formulation can be easily extended to a kernel version by replacing X ith Φ(X) and the problem becomes = argmax T Φ(X) = argmax T ϕ(x i ) (34) subject to 2 =. Note that the kernel trick cannot be applicable to the above problem because the problem is not formulated based on L2- norm. By Lemma, can be expressed using the bases Π as = Πβ and from Theorem 2 T Φ(X) can be expressed as β T Y. Furthermore, the constraint 2 = becomes β 2 = because the columns of Π are orthonormal. Therefore, (34) becomes β = argmax β T Y subject to β 2 =. (35) β Because the form of (35) is exactly the same as (33), the conventional PCA-L algorithm can directly be applied to Y to obtain the KPCA-L algorithm. Note that this is a simple example of making a kernel version of an algorithm by using the nonlinear projection trick. The point is that this trick can be applied to an arbitrary method as shon in Section III. V. CONCLUSIONS This paper proposed a kernel method that directly maps the input space to the feature space by deriving an exact coordinates of the mapped input data. In the conventional kernel methods, the kernel trick is used to avoid direct nonlinear mapping of input data onto a high (could be infinite) dimensional feature space. Instead, the dot products of the mapped data are converted to kernel functions in the problems to be solved, making the problems solvable ithout ever having to explicitly map the original input data into a kernel space. Hoever, the applicability of the kernel trick is restricted to problems here the mapped data appear in the problem only in the form of dot products. Therefore, arbitrary methods that do not utilize the dot product (or L 2 -norm) could not be extended to kernel methods. In this paper, based on the fact that the effective dimensionality of a kernel space is less than the number of training samples, an explicit mapping of the input data into a reduced kernel space is proposed. This is obtained by eigenvalue decomposition of the kernel matrix. With the proposed nonlinear projection trick, the applicability of the kernel methods is idened to arbitrary methods that can be applied in the input space. In other ords, not only the conventional kernel methods based on L 2 -norm optimization such as KPCA and kernel SVM, but the methods utilizing other objective functions can also be extended to kernel versions. The equivalence beteen the kernel trick and the nonlinear projection trick is shon in several conventional kernel methods such as KPCA, KSVM and KFD. In addition, e extend PCA-L that utilize L - norm into a kernel version and sho the effectiveness of the proposed approach. REFERENCES [] B. Schölkopf, A.J. Somla, and K.R. Müller, Nonlinear component analysis as a kernel eigenvalue problem, Neural Computation, vol. 0, no. 5, pp , 998. [2] B. Schölkopf and A.J. Smola, Learning ith Kernels: Support Vector Machines, Regularization, Optimization, and Beyond, The MIT Press, [3] R. Rosipal, M. Girolami, and L.J. Trejo, Kernel PCA for feature extraction and de-noising in non-linear regression, Neural Computing & Applications, vol. 0, pp , 200. [4] S. Mika, G. Ratsch, J. Weston, B. Schölkopf, and K.R. Müller, Fisher discriminant analysis ith kernels, in Proceedings of the 999 IEEE Signal Processing Society Workshop on Neural Netorks for Signal Processing IX, 999, pp [5] J. Yang, A.F. Frangi, J.-Y. Yang, D. Zhang, and Z. Jin, KPCA plus LDA: A complete kernel fisher discriminant frameork for feature extraction and recognition, IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 27, no. 2, pp , Feb [6] G. Baudat and R. Anouar, Generalized discriminant analysis using a kernel approach, Neural Computation, vol. 2, no. 0, pp , [7] M. H. Nguyen and F. De la Torre, Robust kernel principal component analysis, Tech. Rep. Paper 44, Robotics Institute, 2008, [8] N. Kak, Kernel discriminant analysis for regression problems, Pattern Recognition, vol. 45, no. 5, pp , May 202. [9] H. Hoffmann, Kernel PCA for novelty detection, Pattern Recognition, vol. 40, no. 3, pp , March [0] T. Hofmann, B. Schölkopf, and A.J. Smola, Kernel methods in machine learning, The Annals of Statistics, vol. 36, no. 3, pp , [] S. Mika, B. Schölkopf, A. Smola, K.-R. Müller, M. Scholz, and G. Ratsch, Kernel PCA and de-noising in feature spaces, in Advances in Neural Information Processing Systems. 999, pp , MIT Press.

7 [2] T.-Y. Kok and I. W.-H. Tsang, The pre-image problem in kernel methods, IEEE Trans. on Neural Netorks, vol. 5, no. 6, pp , Nov [3] N. Kak, Principal component analysis based on l norm maximization, IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 30, no. 9, pp , Sep [4] K. Fukunaga, Intrinsic dimensionality extraction, Handbook of Statistics, vol. 2, pp , 982. [5] F. Camastra, Data dimensionality estimation methods: a survey, Pattern Recognition, vol. 36, pp ,

Max-Margin Ratio Machine

Max-Margin Ratio Machine JMLR: Workshop and Conference Proceedings 25:1 13, 2012 Asian Conference on Machine Learning Max-Margin Ratio Machine Suicheng Gu and Yuhong Guo Department of Computer and Information Sciences Temple University,

More information

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis Alvina Goh Vision Reading Group 13 October 2005 Connection of Local Linear Embedding, ISOMAP, and Kernel Principal

More information

Multiple Similarities Based Kernel Subspace Learning for Image Classification

Multiple Similarities Based Kernel Subspace Learning for Image Classification Multiple Similarities Based Kernel Subspace Learning for Image Classification Wang Yan, Qingshan Liu, Hanqing Lu, and Songde Ma National Laboratory of Pattern Recognition, Institute of Automation, Chinese

More information

Kernel Principal Component Analysis

Kernel Principal Component Analysis Kernel Principal Component Analysis Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr

More information

Introduction to Machine Learning

Introduction to Machine Learning 10-701 Introduction to Machine Learning PCA Slides based on 18-661 Fall 2018 PCA Raw data can be Complex, High-dimensional To understand a phenomenon we measure various related quantities If we knew what

More information

Kernel Uncorrelated and Orthogonal Discriminant Analysis: A Unified Approach

Kernel Uncorrelated and Orthogonal Discriminant Analysis: A Unified Approach Kernel Uncorrelated and Orthogonal Discriminant Analysis: A Unified Approach Tao Xiong Department of ECE University of Minnesota, Minneapolis txiong@ece.umn.edu Vladimir Cherkassky Department of ECE University

More information

A Modified ν-sv Method For Simplified Regression A. Shilton M. Palaniswami

A Modified ν-sv Method For Simplified Regression A. Shilton M. Palaniswami A Modified ν-sv Method For Simplified Regression A Shilton apsh@eemuozau), M Palanisami sami@eemuozau) Department of Electrical and Electronic Engineering, University of Melbourne, Victoria - 3 Australia

More information

Kernel Methods. Machine Learning A W VO

Kernel Methods. Machine Learning A W VO Kernel Methods Machine Learning A 708.063 07W VO Outline 1. Dual representation 2. The kernel concept 3. Properties of kernels 4. Examples of kernel machines Kernel PCA Support vector regression (Relevance

More information

Kernel Discriminant Analysis for Regression Problems

Kernel Discriminant Analysis for Regression Problems Kernel Discriminant Analysis for Regression Problems 1 Nojun Kwak Abstract In this paper, we propose a nonlinear feature extraction method for regression problems to reduce the dimensionality of the input

More information

Efficient Kernel Discriminant Analysis via QR Decomposition

Efficient Kernel Discriminant Analysis via QR Decomposition Efficient Kernel Discriminant Analysis via QR Decomposition Tao Xiong Department of ECE University of Minnesota txiong@ece.umn.edu Jieping Ye Department of CSE University of Minnesota jieping@cs.umn.edu

More information

Discriminative Direction for Kernel Classifiers

Discriminative Direction for Kernel Classifiers Discriminative Direction for Kernel Classifiers Polina Golland Artificial Intelligence Lab Massachusetts Institute of Technology Cambridge, MA 02139 polina@ai.mit.edu Abstract In many scientific and engineering

More information

An Experimental Evaluation of Linear and Kernel-Based Methods for Face Recognition

An Experimental Evaluation of Linear and Kernel-Based Methods for Face Recognition An Experimental Evaluation of Linear and Kernel-Based Methods for Face Recognition Himaanshu Gupta, Amit K Agraal, arun Pruthi, Chandra Shekhar, Rama Chellappa Department of Electrical & Computer Engineering

More information

Kernel Partial Least Squares for Nonlinear Regression and Discrimination

Kernel Partial Least Squares for Nonlinear Regression and Discrimination Kernel Partial Least Squares for Nonlinear Regression and Discrimination Roman Rosipal Abstract This paper summarizes recent results on applying the method of partial least squares (PLS) in a reproducing

More information

PCA and LDA. Man-Wai MAK

PCA and LDA. Man-Wai MAK PCA and LDA Man-Wai MAK Dept. of Electronic and Information Engineering, The Hong Kong Polytechnic University enmwmak@polyu.edu.hk http://www.eie.polyu.edu.hk/ mwmak References: S.J.D. Prince,Computer

More information

Neural networks and support vector machines

Neural networks and support vector machines Neural netorks and support vector machines Perceptron Input x 1 Weights 1 x 2 x 3... x D 2 3 D Output: sgn( x + b) Can incorporate bias as component of the eight vector by alays including a feature ith

More information

Machine Learning - MT & 14. PCA and MDS

Machine Learning - MT & 14. PCA and MDS Machine Learning - MT 2016 13 & 14. PCA and MDS Varun Kanade University of Oxford November 21 & 23, 2016 Announcements Sheet 4 due this Friday by noon Practical 3 this week (continue next week if necessary)

More information

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms Data Mining and Analysis: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA

More information

Approximate Kernel PCA with Random Features

Approximate Kernel PCA with Random Features Approximate Kernel PCA with Random Features (Computational vs. Statistical Tradeoff) Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Journées de Statistique Paris May 28,

More information

CHAPTER 3 THE COMMON FACTOR MODEL IN THE POPULATION. From Exploratory Factor Analysis Ledyard R Tucker and Robert C. MacCallum

CHAPTER 3 THE COMMON FACTOR MODEL IN THE POPULATION. From Exploratory Factor Analysis Ledyard R Tucker and Robert C. MacCallum CHAPTER 3 THE COMMON FACTOR MODEL IN THE POPULATION From Exploratory Factor Analysis Ledyard R Tucker and Robert C. MacCallum 1997 19 CHAPTER 3 THE COMMON FACTOR MODEL IN THE POPULATION 3.0. Introduction

More information

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395 Data Mining Dimensionality reduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 42 Outline 1 Introduction 2 Feature selection

More information

Novelty Detection. Cate Welch. May 14, 2015

Novelty Detection. Cate Welch. May 14, 2015 Novelty Detection Cate Welch May 14, 2015 1 Contents 1 Introduction 2 11 The Four Fundamental Subspaces 2 12 The Spectral Theorem 4 1 The Singular Value Decomposition 5 2 The Principal Components Analysis

More information

Lecture 10: A brief introduction to Support Vector Machine

Lecture 10: A brief introduction to Support Vector Machine Lecture 10: A brief introduction to Support Vector Machine Advanced Applied Multivariate Analysis STAT 2221, Fall 2013 Sungkyu Jung Department of Statistics, University of Pittsburgh Xingye Qiao Department

More information

Stefanos Zafeiriou, Anastasios Tefas, and Ioannis Pitas

Stefanos Zafeiriou, Anastasios Tefas, and Ioannis Pitas GENDER DETERMINATION USING A SUPPORT VECTOR MACHINE VARIANT Stefanos Zafeiriou, Anastasios Tefas, and Ioannis Pitas Artificial Intelligence and Information Analysis Lab/Department of Informatics, Aristotle

More information

Machine Learning. Support Vector Machines. Manfred Huber

Machine Learning. Support Vector Machines. Manfred Huber Machine Learning Support Vector Machines Manfred Huber 2015 1 Support Vector Machines Both logistic regression and linear discriminant analysis learn a linear discriminant function to separate the data

More information

1 Feature Vectors and Time Series

1 Feature Vectors and Time Series PCA, SVD, LSI, and Kernel PCA 1 Feature Vectors and Time Series We now consider a sample x 1,..., x of objects (not necessarily vectors) and a feature map Φ such that for any object x we have that Φ(x)

More information

Sparse Support Vector Machines by Kernel Discriminant Analysis

Sparse Support Vector Machines by Kernel Discriminant Analysis Sparse Support Vector Machines by Kernel Discriminant Analysis Kazuki Iwamura and Shigeo Abe Kobe University - Graduate School of Engineering Kobe, Japan Abstract. We discuss sparse support vector machines

More information

PCA and LDA. Man-Wai MAK

PCA and LDA. Man-Wai MAK PCA and LDA Man-Wai MAK Dept. of Electronic and Information Engineering, The Hong Kong Polytechnic University enmwmak@polyu.edu.hk http://www.eie.polyu.edu.hk/ mwmak References: S.J.D. Prince,Computer

More information

Lecture 10: Support Vector Machine and Large Margin Classifier

Lecture 10: Support Vector Machine and Large Margin Classifier Lecture 10: Support Vector Machine and Large Margin Classifier Applied Multivariate Analysis Math 570, Fall 2014 Xingye Qiao Department of Mathematical Sciences Binghamton University E-mail: qiao@math.binghamton.edu

More information

Principal Component Analysis

Principal Component Analysis CSci 5525: Machine Learning Dec 3, 2008 The Main Idea Given a dataset X = {x 1,..., x N } The Main Idea Given a dataset X = {x 1,..., x N } Find a low-dimensional linear projection The Main Idea Given

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

Approximate Kernel Methods

Approximate Kernel Methods Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Machine Learning Summer School Tübingen, 207 Outline Motivating example Ridge regression

More information

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES Wei Chu, S. Sathiya Keerthi, Chong Jin Ong Control Division, Department of Mechanical Engineering, National University of Singapore 0 Kent Ridge Crescent,

More information

Machine Learning. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Machine Learning. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395 Machine Learning Dimensionality reduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 1 / 47 Table of contents 1 Introduction

More information

Nonlinear Dimensionality Reduction

Nonlinear Dimensionality Reduction Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Kernel PCA 2 Isomap 3 Locally Linear Embedding 4 Laplacian Eigenmap

More information

Linear and Non-Linear Dimensionality Reduction

Linear and Non-Linear Dimensionality Reduction Linear and Non-Linear Dimensionality Reduction Alexander Schulz aschulz(at)techfak.uni-bielefeld.de University of Pisa, Pisa 4.5.215 and 7.5.215 Overview Dimensionality Reduction Motivation Linear Projections

More information

Semi-Supervised Learning through Principal Directions Estimation

Semi-Supervised Learning through Principal Directions Estimation Semi-Supervised Learning through Principal Directions Estimation Olivier Chapelle, Bernhard Schölkopf, Jason Weston Max Planck Institute for Biological Cybernetics, 72076 Tübingen, Germany {first.last}@tuebingen.mpg.de

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Feature Extraction Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi, Payam Siyari Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/ Agenda Dimensionality Reduction

More information

Support Vector Machines for Classification: A Statistical Portrait

Support Vector Machines for Classification: A Statistical Portrait Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,

More information

Lecture 5 Supspace Tranformations Eigendecompositions, kernel PCA and CCA

Lecture 5 Supspace Tranformations Eigendecompositions, kernel PCA and CCA Lecture 5 Supspace Tranformations Eigendecompositions, kernel PCA and CCA Pavel Laskov 1 Blaine Nelson 1 1 Cognitive Systems Group Wilhelm Schickard Institute for Computer Science Universität Tübingen,

More information

Support Vector Machine (continued)

Support Vector Machine (continued) Support Vector Machine continued) Overlapping class distribution: In practice the class-conditional distributions may overlap, so that the training data points are no longer linearly separable. We need

More information

When Fisher meets Fukunaga-Koontz: A New Look at Linear Discriminants

When Fisher meets Fukunaga-Koontz: A New Look at Linear Discriminants When Fisher meets Fukunaga-Koontz: A New Look at Linear Discriminants Sheng Zhang erence Sim School of Computing, National University of Singapore 3 Science Drive 2, Singapore 7543 {zhangshe, tsim}@comp.nus.edu.sg

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2014 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

Multi-Output Regularized Projection

Multi-Output Regularized Projection Multi-Output Regularized Projection Kai Yu Shipeng Yu Volker Tresp Corporate Technology Institute for Computer Science Corporate Technology Siemens AG, Germany University of Munich, Germany Siemens AG,

More information

Artificial Neural Networks. Part 2

Artificial Neural Networks. Part 2 Artificial Neural Netorks Part Artificial Neuron Model Folloing simplified model of real neurons is also knon as a Threshold Logic Unit x McCullouch-Pitts neuron (943) x x n n Body of neuron f out Biological

More information

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017 COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University PRINCIPAL COMPONENT ANALYSIS DIMENSIONALITY

More information

Lecture 6 Sept Data Visualization STAT 442 / 890, CM 462

Lecture 6 Sept Data Visualization STAT 442 / 890, CM 462 Lecture 6 Sept. 25-2006 Data Visualization STAT 442 / 890, CM 462 Lecture: Ali Ghodsi 1 Dual PCA It turns out that the singular value decomposition also allows us to formulate the principle components

More information

ADAPTIVE QUASICONFORMAL KERNEL FISHER DISCRIMINANT ANALYSIS VIA WEIGHTED MAXIMUM MARGIN CRITERION. Received November 2011; revised March 2012

ADAPTIVE QUASICONFORMAL KERNEL FISHER DISCRIMINANT ANALYSIS VIA WEIGHTED MAXIMUM MARGIN CRITERION. Received November 2011; revised March 2012 International Journal of Innovative Computing, Information and Control ICIC International c 2013 ISSN 1349-4198 Volume 9, Number 1, January 2013 pp 437 450 ADAPTIVE QUASICONFORMAL KERNEL FISHER DISCRIMINANT

More information

Unsupervised dimensionality reduction

Unsupervised dimensionality reduction Unsupervised dimensionality reduction Guillaume Obozinski Ecole des Ponts - ParisTech SOCN course 2014 Guillaume Obozinski Unsupervised dimensionality reduction 1/30 Outline 1 PCA 2 Kernel PCA 3 Multidimensional

More information

Kernel methods, kernel SVM and ridge regression

Kernel methods, kernel SVM and ridge regression Kernel methods, kernel SVM and ridge regression Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Collaborative Filtering 2 Collaborative Filtering R: rating matrix; U: user factor;

More information

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar Data Mining Support Vector Machines Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar 02/03/2018 Introduction to Data Mining 1 Support Vector Machines Find a linear hyperplane

More information

Kernel Method: Data Analysis with Positive Definite Kernels

Kernel Method: Data Analysis with Positive Definite Kernels Kernel Method: Data Analysis with Positive Definite Kernels 2. Positive Definite Kernel and Reproducing Kernel Hilbert Space Kenji Fukumizu The Institute of Statistical Mathematics. Graduate University

More information

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture # 12 Scribe: Indraneel Mukherjee March 12, 2008

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture # 12 Scribe: Indraneel Mukherjee March 12, 2008 COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture # 12 Scribe: Indraneel Mukherjee March 12, 2008 In the previous lecture, e ere introduced to the SVM algorithm and its basic motivation

More information

PCA, Kernel PCA, ICA

PCA, Kernel PCA, ICA PCA, Kernel PCA, ICA Learning Representations. Dimensionality Reduction. Maria-Florina Balcan 04/08/2015 Big & High-Dimensional Data High-Dimensions = Lot of Features Document classification Features per

More information

Preliminaries. Definition: The Euclidean dot product between two vectors is the expression. i=1

Preliminaries. Definition: The Euclidean dot product between two vectors is the expression. i=1 90 8 80 7 70 6 60 0 8/7/ Preliminaries Preliminaries Linear models and the perceptron algorithm Chapters, T x + b < 0 T x + b > 0 Definition: The Euclidean dot product beteen to vectors is the expression

More information

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto Reproducing Kernel Hilbert Spaces 9.520 Class 03, 15 February 2006 Andrea Caponnetto About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

Support Vector Machines

Support Vector Machines Two SVM tutorials linked in class website (please, read both): High-level presentation with applications (Hearst 1998) Detailed tutorial (Burges 1998) Support Vector Machines Machine Learning 10701/15781

More information

Logistic Regression. Machine Learning Fall 2018

Logistic Regression. Machine Learning Fall 2018 Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Support Vector Machine (SVM) Hamid R. Rabiee Hadi Asheri, Jafar Muhammadi, Nima Pourdamghani Spring 2013 http://ce.sharif.edu/courses/91-92/2/ce725-1/ Agenda Introduction

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Support vector machines (SVMs) are one of the central concepts in all of machine learning. They are simply a combination of two ideas: linear classification via maximum (or optimal

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2015 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

Machine learning for pervasive systems Classification in high-dimensional spaces

Machine learning for pervasive systems Classification in high-dimensional spaces Machine learning for pervasive systems Classification in high-dimensional spaces Department of Communications and Networking Aalto University, School of Electrical Engineering stephan.sigg@aalto.fi Version

More information

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Linear classifier Which classifier? x 2 x 1 2 Linear classifier Margin concept x 2

More information

SVMs, Duality and the Kernel Trick

SVMs, Duality and the Kernel Trick SVMs, Duality and the Kernel Trick Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University February 26 th, 2007 2005-2007 Carlos Guestrin 1 SVMs reminder 2005-2007 Carlos Guestrin 2 Today

More information

Computation. For QDA we need to calculate: Lets first consider the case that

Computation. For QDA we need to calculate: Lets first consider the case that Computation For QDA we need to calculate: δ (x) = 1 2 log( Σ ) 1 2 (x µ ) Σ 1 (x µ ) + log(π ) Lets first consider the case that Σ = I,. This is the case where each distribution is spherical, around the

More information

1 Principal Components Analysis

1 Principal Components Analysis Lecture 3 and 4 Sept. 18 and Sept.20-2006 Data Visualization STAT 442 / 890, CM 462 Lecture: Ali Ghodsi 1 Principal Components Analysis Principal components analysis (PCA) is a very popular technique for

More information

Least Squares SVM Regression

Least Squares SVM Regression Least Squares SVM Regression Consider changing SVM to LS SVM by making following modifications: min (w,e) ½ w 2 + ½C Σ e(i) 2 subject to d(i) (w T Φ( x(i))+ b) = e(i), i, and C>0. Note that e(i) is error

More information

Linear Dimensionality Reduction

Linear Dimensionality Reduction Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Principal Component Analysis 3 Factor Analysis

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen PCA. Tobias Scheffer

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen PCA. Tobias Scheffer Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen PCA Tobias Scheffer Overview Principal Component Analysis (PCA) Kernel-PCA Fisher Linear Discriminant Analysis t-sne 2 PCA: Motivation

More information

A Statistical Analysis of Fukunaga Koontz Transform

A Statistical Analysis of Fukunaga Koontz Transform 1 A Statistical Analysis of Fukunaga Koontz Transform Xiaoming Huo Dr. Xiaoming Huo is an assistant professor at the School of Industrial and System Engineering of the Georgia Institute of Technology,

More information

An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM

An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM Wei Chu Chong Jin Ong chuwei@gatsby.ucl.ac.uk mpeongcj@nus.edu.sg S. Sathiya Keerthi mpessk@nus.edu.sg Control Division, Department

More information

Dimensionality Reduction AShortTutorial

Dimensionality Reduction AShortTutorial Dimensionality Reduction AShortTutorial Ali Ghodsi Department of Statistics and Actuarial Science University of Waterloo Waterloo, Ontario, Canada, 2006 c Ali Ghodsi, 2006 Contents 1 An Introduction to

More information

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA 1 MACHINE LEARNING Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA 2 Practicals Next Week Next Week, Practical Session on Computer Takes Place in Room GR

More information

CHARACTERIZATION OF ULTRASONIC IMMERSION TRANSDUCERS

CHARACTERIZATION OF ULTRASONIC IMMERSION TRANSDUCERS CHARACTERIZATION OF ULTRASONIC IMMERSION TRANSDUCERS INTRODUCTION David D. Bennink, Center for NDE Anna L. Pate, Engineering Science and Mechanics Ioa State University Ames, Ioa 50011 In any ultrasonic

More information

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis Massimiliano Pontil 1 Today s plan SVD and principal component analysis (PCA) Connection

More information

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396 Data Mining Linear & nonlinear classifiers Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 1 / 31 Table of contents 1 Introduction

More information

Convergence of Eigenspaces in Kernel Principal Component Analysis

Convergence of Eigenspaces in Kernel Principal Component Analysis Convergence of Eigenspaces in Kernel Principal Component Analysis Shixin Wang Advanced machine learning April 19, 2016 Shixin Wang Convergence of Eigenspaces April 19, 2016 1 / 18 Outline 1 Motivation

More information

y(x) = x w + ε(x), (1)

y(x) = x w + ε(x), (1) Linear regression We are ready to consider our first machine-learning problem: linear regression. Suppose that e are interested in the values of a function y(x): R d R, here x is a d-dimensional vector-valued

More information

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling Machine Learning B. Unsupervised Learning B.2 Dimensionality Reduction Lars Schmidt-Thieme, Nicolas Schilling Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University

More information

Fisher Linear Discriminant Analysis

Fisher Linear Discriminant Analysis Fisher Linear Discriminant Analysis Cheng Li, Bingyu Wang August 31, 2014 1 What s LDA Fisher Linear Discriminant Analysis (also called Linear Discriminant Analysis(LDA)) are methods used in statistics,

More information

Local Fisher Discriminant Analysis for Supervised Dimensionality Reduction

Local Fisher Discriminant Analysis for Supervised Dimensionality Reduction Local Fisher Discriminant Analysis for Supervised Dimensionality Reduction Masashi Sugiyama sugi@cs.titech.ac.jp Department of Computer Science, Tokyo Institute of Technology, ---W8-7, O-okayama, Meguro-ku,

More information

j=1 u 1jv 1j. 1/ 2 Lemma 1. An orthogonal set of vectors must be linearly independent.

j=1 u 1jv 1j. 1/ 2 Lemma 1. An orthogonal set of vectors must be linearly independent. Lecture Notes: Orthogonal and Symmetric Matrices Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong taoyf@cse.cuhk.edu.hk Orthogonal Matrix Definition. Let u = [u

More information

Online Kernel PCA with Entropic Matrix Updates

Online Kernel PCA with Entropic Matrix Updates Online Kernel PCA with Entropic Matrix Updates Dima Kuzmin Manfred K. Warmuth University of California - Santa Cruz ICML 2007, Corvallis, Oregon April 23, 2008 D. Kuzmin, M. Warmuth (UCSC) Online Kernel

More information

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Module - 5 Lecture - 22 SVM: The Dual Formulation Good morning.

More information

SPECTRAL CLUSTERING AND KERNEL PRINCIPAL COMPONENT ANALYSIS ARE PURSUING GOOD PROJECTIONS

SPECTRAL CLUSTERING AND KERNEL PRINCIPAL COMPONENT ANALYSIS ARE PURSUING GOOD PROJECTIONS SPECTRAL CLUSTERING AND KERNEL PRINCIPAL COMPONENT ANALYSIS ARE PURSUING GOOD PROJECTIONS VIKAS CHANDRAKANT RAYKAR DECEMBER 5, 24 Abstract. We interpret spectral clustering algorithms in the light of unsupervised

More information

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction Linear vs Non-linear classifier CS789: Machine Learning and Neural Network Support Vector Machine Jakramate Bootkrajang Department of Computer Science Chiang Mai University Linear classifier is in the

More information

Perceptron Revisited: Linear Separators. Support Vector Machines

Perceptron Revisited: Linear Separators. Support Vector Machines Support Vector Machines Perceptron Revisited: Linear Separators Binary classification can be viewed as the task of separating classes in feature space: w T x + b > 0 w T x + b = 0 w T x + b < 0 Department

More information

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan Support'Vector'Machines Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan kasthuri.kannan@nyumc.org Overview Support Vector Machines for Classification Linear Discrimination Nonlinear Discrimination

More information

Kernel Methods. Outline

Kernel Methods. Outline Kernel Methods Quang Nguyen University of Pittsburgh CS 3750, Fall 2011 Outline Motivation Examples Kernels Definitions Kernel trick Basic properties Mercer condition Constructing feature space Hilbert

More information

Supervised Learning Coursework

Supervised Learning Coursework Supervised Learning Coursework John Shawe-Taylor Tom Diethe Dorota Glowacka November 30, 2009; submission date: noon December 18, 2009 Abstract Using a series of synthetic examples, in this exercise session

More information

CS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision)

CS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision) CS4495/6495 Introduction to Computer Vision 8B-L2 Principle Component Analysis (and its use in Computer Vision) Wavelength 2 Wavelength 2 Principal Components Principal components are all about the directions

More information

Manifold Learning: Theory and Applications to HRI

Manifold Learning: Theory and Applications to HRI Manifold Learning: Theory and Applications to HRI Seungjin Choi Department of Computer Science Pohang University of Science and Technology, Korea seungjin@postech.ac.kr August 19, 2008 1 / 46 Greek Philosopher

More information

Eigenvoice Speaker Adaptation via Composite Kernel PCA

Eigenvoice Speaker Adaptation via Composite Kernel PCA Eigenvoice Speaker Adaptation via Composite Kernel PCA James T. Kwok, Brian Mak and Simon Ho Department of Computer Science Hong Kong University of Science and Technology Clear Water Bay, Hong Kong [jamesk,mak,csho]@cs.ust.hk

More information

Kernel Methods. Barnabás Póczos

Kernel Methods. Barnabás Póczos Kernel Methods Barnabás Póczos Outline Quick Introduction Feature space Perceptron in the feature space Kernels Mercer s theorem Finite domain Arbitrary domain Kernel families Constructing new kernels

More information

Statistical Machine Learning from Data

Statistical Machine Learning from Data Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Support Vector Machines Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique

More information

ML (cont.): SUPPORT VECTOR MACHINES

ML (cont.): SUPPORT VECTOR MACHINES ML (cont.): SUPPORT VECTOR MACHINES CS540 Bryan R Gibson University of Wisconsin-Madison Slides adapted from those used by Prof. Jerry Zhu, CS540-1 1 / 40 Support Vector Machines (SVMs) The No-Math Version

More information

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable. Linear SVM (separable case) First consider the scenario where the two classes of points are separable. It s desirable to have the width (called margin) between the two dashed lines to be large, i.e., have

More information

Classification with Kernel Mahalanobis Distance Classifiers

Classification with Kernel Mahalanobis Distance Classifiers Classification with Kernel Mahalanobis Distance Classifiers Bernard Haasdonk and Elżbieta P ekalska 2 Institute of Numerical and Applied Mathematics, University of Münster, Germany, haasdonk@math.uni-muenster.de

More information

Efficient Kernel Discriminant Analysis via Spectral Regression

Efficient Kernel Discriminant Analysis via Spectral Regression Efficient Kernel Discriminant Analysis via Spectral Regression Deng Cai Xiaofei He Jiawei Han Department of Computer Science, University of Illinois at Urbana-Champaign Yahoo! Research Labs Abstract Linear

More information

MTTS1 Dimensionality Reduction and Visualization Spring 2014 Jaakko Peltonen

MTTS1 Dimensionality Reduction and Visualization Spring 2014 Jaakko Peltonen MTTS1 Dimensionality Reduction and Visualization Spring 2014 Jaakko Peltonen Lecture 3: Linear feature extraction Feature extraction feature extraction: (more general) transform the original to (k < d).

More information

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING 5: Vector Data: Support Vector Machine Instructor: Yizhou Sun yzsun@cs.ucla.edu October 18, 2017 Homework 1 Announcements Due end of the day of this Thursday (11:59pm)

More information