Nonlinear Projection Trick in kernel methods: An alternative to the Kernel Trick
|
|
- Randell Griffith
- 5 years ago
- Views:
Transcription
1 Nonlinear Projection Trick in kernel methods: An alternative to the Kernel Trick Nojun Kak, Member, IEEE, Abstract In kernel methods such as kernel PCA and support vector machines, the so called kernel trick is used to avoid direct calculations in a high (virtually infinite) dimensional kernel space. In this paper, based on the fact that the effective dimensionality of a kernel space is less than the number of training samples, e propose an alternative to the kernel trick that explicitly maps the input data into a reduced dimensional kernel space. This is easily obtained by the eigenvalue decomposition of the kernel matrix. The proposed method is named as the nonlinear projection trick in contrast to the kernel trick. With this technique, the applicability of the kernel methods is idened to arbitrary algorithms that do not utilize the dot product. The equivalence beteen the kernel trick and the nonlinear projection trick is shon for several conventional kernel methods. In addition, e extend PCA-L, hich utilizes L -norm instead of L 2 -norm (or dot product), into a kernel version and sho the effectiveness of the proposed approach. Index Terms Kernel methods, nonlinear projection trick, KPCA, Support vector machines, KPCA-L, dimensionality reduction. I. INTRODUCTION In pattern recognition and machine learning societies, kernel methods have started to gain attention since mid-990s and have been successfully applied in pattern classification, dimensionality reduction, regression problems and so on [] [2] [3] [4] [5] [6] [7] [8] [9] [0]. In kernel methods, the kernel trick is used to simplify the optimization in a high dimensional feature space ithout an explicit mapping of input data into the feature space. This is done by introducing a kernel function of to elements in the input space and regarding it as a dot product of the corresponding to elements in the kernel space. If the optimization problem contains the elements in the feature space only in the form of the dot product, then the problem can be solved ithout knoing the explicit mapping. The dimension of the feature space, hich is unmeasurable and can be infinite, depends on the specific kernel function used. Because the kernel trick takes care of the dot product in the kernel space rather than the mapping itself, there has been little attention on the mapping of the input data to the feature space. Hoever, there are many optimization problems that cannot be formulated by the dot product only, thus could not be extended to a kernel space by the kernel trick. In addition, ithout the N. Kak is an associate professor at the Graduate School of Convergence Science and Technology, Seoul National University, Korea. This ork as supported by the Korea Research Foundation Grant funded by the Korean Government (KRF ) The terms feature space and kernel space are used interchangeably to denote the high dimensional space defined by a kernel function. knoledge of the mapped image in the feature space, it is difficult to solve data reconstruction problem by the kernel methods. Although, this pre-image problem in kernel PCA has been tackled in [] and [2], a direct mapping of the input data to a vector in the feature space by kernel methods has not been studied yet. It is ell knon that the mapped images of the finite number of training samples constitute a subspace of the possibly infinite dimensional feature space []. In addition, most operations in the kernel methods are performed in the training stage, hich are done on this finite dimensional subspace of the possibly infinite dimensional feature space. Based on this idea, in this paper, e doubt and anser the question Can e find a direct mapping of the input data to a vector in the feature space by kernel methods? This paper proposes an alternative method to the kernel trick hich directly maps input samples into a reduced dimensional kernel space. The proposed approach is named as the nonlinear projection trick in contrast to the kernel trick. The proposed nonlinear kernel trick idens the applicability of kernel methods to any algorithms (even if their formulation does not rely only on the dot product) that are applicable in the input space. We also sho the equivalence beteen the proposed nonlinear projection trick and the kernel trick by some examples. As an application of the nonlinear kernel trick, PCA-L [3], hose objective function is in L -norm rather than L 2 -norm, is extended to a kernel version, hich ould not have been possible by the kernel trick. The paper is organized as follos. In the next section, the structure of a kernel space such as the bases and the coordinates are studied. In section III, the proposed nonlinear projection trick is formulated. The proposed approach is applied to the conventional kernel methods such as KPCA [], KSVM (kernel support vector machine) [2] and KFD (kernel fisher discriminant) [4] in Section IV to sho the equivalence beteen the kernel trick and the nonlinear projection trick. In addition, a kernel version of PCA-L (KPCA-L) hich could not have been derived from the conventional kernel trick is derived by the nonlinear projection trick in this section. Finally, conclusions follo in Section V. II. THE STRUCTURE OF A KERNEL SPACE In this section, the structure, such as the bases and the corresponding coordinates, of a kernel space ith finite number of training samples is studied. Before going on, in the folloing subsection, e briefly define the notations and variables that ill be used in this paper.
2 2 Fig.. A. Notations P o (x) Projections in the feature space. (x) (x) X [x,, x n ] R d n : the training data matrix here d is the dimension of the input space and n is the number of training samples. Φ(X) [ϕ(x ),, ϕ(x n )] R f n : the mapped training data in the f-dimensional feature space after the mapping ϕ( ). We consider ϕ(x i ) as a vector ith respect to a countable orthornormal basis of the respective RKHS (reproducing kernel Hilbert space). Φ(X) is assumed to have zero mean, i.e., n ϕ(x i) = 0. Note that f can be infinite. k(x, y) ϕ(x), ϕ(y) = ϕ(x) T ϕ(y) R: a kernel function of any to inputs x, y R d. K Φ(X) T Φ(X) = [k(x i, x j )] R n n : a kernel matrix of the training data. The rank of K is r. It becomes r n because Φ(X) is assumed to be centered. k(x) Φ(X) T ϕ(x) R n : a kernel vector of any x R d. P : an r-dimensional subspace of the feature space formed by the mapped training samples Φ(X). ϕ P (x): the projection of ϕ(x) onto P. If x lies on P (e.g., one of the training samples), ϕ P (x) = ϕ(x). ϕ (x): the projection of ϕ(x) onto a one-dimensional vector space formed by a vector R f. In most cases, the vector is restricted to reside in the subspace P, i.e., = Φ(X)α for some α R n. The relationship among ϕ(x), ϕ P (x) and ϕ (x) is illustrated in Fig.. Note that if resides in P as in the figure, then T ϕ(x) = T ϕ P (x). B. The bases and the coordinates of a kernel space It is ell knon that the effective dimensionality 2 of the feature space defined by the kernel function k(, ) given a finite number of training samples X is finite []. In this subsection, the bases of the subspace spanned by the mapped training data Φ(X), as ell as the coordinate of the mapping ϕ(x) of an unseen data x R d are specified. Lemma : (Bases of P ) Let r be the rank of K and K = UΛU T be an eigenvalue decomposition of K composed of only the nonzero eigenvalues of K here U = 2 The term intrinsic dimensionality is defined and idely used in the machine learning society [4]. Because projecting methods based on eigenvalue decomposition tends to overestimate the intrinsic dimensionality [5], the effective dimensionality can be interpreted as an upper bound on intrinsic dimensionality of a kernel space. [u,, u r ] R n r hose columns are orthonormal and Λ = diag(λ,, λ r ) R r r. Then, the columns of Π Φ(X)UΛ 2 = [π,, π r ] R f r constitute orthonormal bases of P. Proof: Because K is the outer product of Φ(X) and itself, it is obvious that r, the rank of K, is the dimension of the subspace spanned by Φ(X). By utilizing the orthonormality of U (U T U = I r ) and the definition of K (K = Φ(X) T Φ(X)), it is easy to check that Π T Π = I r and Π becomes orthonormal bases of P. From the above lemma, π i = Φ(X)u i λ 2 i, ( i r) becomes the i-th basis vector of P, here u i and λ i are the i-th unit eigenvector and eigenvalue of K respectively. Theorem 2: (Coordinate of ϕ P (x)) For any x R d, the projection ϕ P (x) of ϕ(x) onto P is obtained by ϕ P (x) = Πy here y = Λ 2 U T k(x), hich can be regarded as the coordinate of ϕ P (x) in the cartesian coordinate system hose orthonormal bases are Π. Proof: Because ϕ P (x) is in P, ϕ P (x) = Πy for some y R r. If ϕ P (x) is a projection of ϕ(x) onto P hose bases are Π, it should minimize ϕ(x) ϕ P (x) 2. Therefore, the optimal y should satisfy the folloing. y = argmin ϕ(x) Πy 2 2 y = argmin ϕ(x) 2 2 2ϕ(x) T Πy + y T Π T Πy y = argmin y T y 2ϕ(x) T Πy y The solution of the quadratic problem is y = Π T ϕ(x) = Λ 2 U T k(x). Note that ϕ P (x) can be regarded as a nonlinear mapping of x onto the r-dimensional subspace spanned by Π. From the above theorem, this mapping can further be thought of as a direct mapping of x to y = Λ 2 U T k(x). Corollary 3: (Coordinates of the mapped training data) The coordinate of Φ(X) is Y = Λ 2 U T. Proof: It is easy to sho that the training data X is mapped to Y = Λ 2 U T K = Λ 2 U T UΛU T = Λ 2 U T. Note that K = UΛU T = Y T Y 3. Lemma 4: (Mean of the mapped training data) The mapped training data Y = Λ 2 U T are centered, i.e., n y i = 0. Proof: From Theorem 2, y i = Λ 2 U T k(x i ) n and it becomes y i = Λ 2 U T n k(x i) = Λ 2 U T Φ(X) T n ϕ(x i) = 0. Lemma 5: (Coordinate of the residual) For any x R d, let X [X, x] R d (n+) and Φ(X ) [Φ(X), ϕ(x)] R f (n+). Let δϕ P (x) ϕ(x) ϕ P (x) be the residual vector of ϕ(x) ith respect to P. If δϕ P (x) 0, Φ(X ) lies in a (r + )-dimensional subspace containing P. Furthermore, the coordinate of the mapped training data Φ(X) is [Y T, 0] T R (r+) n hile the coordinate of ϕ(x) becomes [y T, y r+ ] T R r+ here y r+ = k(x, x) y T y. Proof: It is obvious that δϕ P (x) is orthogonal to P because ϕ P (x) is the projection of ϕ(x) onto P. Furthermore, 3 Note that K can be directly decomposed into K = Y T Y by Cholesky decomposition. In this case, Y is a unique upper triangle matrix. The singular value decomposition of Y is Y = V Λ 2 U T = V Y for some unitary matrix V. Therefore, Y can be interpreted as a rotation of Y. ()
3 3 if the residual vector δϕ P (x) 0, δϕ P (x) adds one dimension to ϕ(x), thus Φ(X )(= [Φ(X), ϕ(x)]) lies in a (r + )- dimensional subspace P of the f-dimensional feature space hich contains P. If e retain the bases Π(= [π,, π r ]) of P as the r bases of P, then the ne bases Π of P becomes Π = [Π, π r+ ] δϕ P (x) δϕ P (x) 2. here π r+ = In the coordinate system of Π, the first r coordinates of the mapped training data Φ(X) are unchanged ith the (r + )-st coordinate being identically 0. Likeise, the first r coordinates of ϕ(x) are identical to the coordinate of ϕ P (x), i.e., y, hile the (r + )-th coordinate becomes the length (norm) of the residual vector δϕ P (x). Finally, because ϕ(x) = ϕ P (x) + δϕ P (x) and ϕ P (x) δϕ P (x), it becomes δϕ P (x) 2 2 = ϕ(x) 2 2 ϕ P (x) 2 2 = k(x, x) y T y by Pythagorean trigonometric identity. Lemma 6: (Coordinate of a vector ) If a vector in P can be ritten in the form of = Φ(X)α, then it can also be ritten as = Πβ here β = Y α. Proof: Because Π is an orthonormal bases of P, any vector in P can be ritten as = ΠΠ T. Therefore, = ΠΛ 2 U T Φ(X) T Φ(X)α = ΠΛ 2 U T Kα = ΠY α = Πβ. Note that β = Y α is the coordinate of in P. Corollary 7: (Coordinate of ϕ (x)) The projection ϕ (x) of ϕ(x) onto can be obtained by ϕ (x) = Πγ here γ = ββ T β T β yṗroof: Let = be a unit vector. Then 2 ϕ (x) = ( T ϕ(x)) = T (ϕ P (x) + δϕ P (x)) = T ϕ P (x) = T ΠΛ 2 U T k(x) = β T β ΠββT Π T ΠΛ 2 U T k(x) = Π ββt β T β Λ 2 U T k(x) = Π ββt β T β y = Πγ. In the second equality, δϕ P (x) is orthogonal to P, thus also orthogonal to hich lies in P. Note that if 2 =, it becomes β 2 = and γ = ββ T y. III. THE PROPOSED METHOD In this section, by using the results in the previous section, e give a ne interpretation of kernel methods that any kernel method is equivalent to applying the corresponding conventional method applicable in the input space to the data ith the ne coordinates. Before doing that, e deal ith the problem of centerization in the feature space. (2) (3) (4) A. Centerization In the previous section, the training data Φ(X) are assumed to be centered in the feature space, i.e., n ϕ(x i) = 0. If this condition is not satisfied, P hich is made up of {p : p = n α iϕ(x i ), α i R} is not a subspace but a manifold in the feature space, i.e., 0 / P. Therefore, the centerization is a necessary step in practical application of the proposed idea on kernel methods. Let Ψ(X) [ψ(x ),, ψ(x n )] R f n be the uncentered data in the feature space and ψ n n ψ(x i) = n Ψ(X) n R f be the mean of Ψ(X). Here, n [,, ] T R n is a vector hose components are identically. By subtracting the mean ψ from the uncentered data Ψ(X), the centered data Φ(X) can be obtained as Φ(X) = Ψ(X) ψ T n = Ψ(X)(I n n n T n ) = Ψ(X)(I n E n ). here I n is the n n identity matrix and E n n n T n is an n n matrix hose components are identically n. Consider an uncentered kernel function κ(a, b) ψ(a) T ψ(b). Given this kernel function and the training data X, the uncentered kernel matrix K is defined as K [κ(x i, x j )] R n n. Writing K in matrix form, it becomes K = Ψ(X) T Ψ(X). Then, from (5), the centered positive definite kernel matrix K can be obtained as (5) K = Φ(X) T Φ(X) = (I n E n )K(I n E n ) (6) Once a test data x R d is presented, an uncentered kernel vector κ(x) [κ(x, x),, κ(x n, x)] T R n can be computed. Because κ(x i, x) = ψ(x i ) T ψ(x), if e subtract the mean ψ from both ψ(x i ) and ψ(x), e can get the centered kernel vector k(x) from the uncentered kernel vector κ(x) as follos: k(x) = Φ(X) T ϕ(x) = [Ψ(X)(I n E n )] T (ψ(x) ψ) = (I n E n )Ψ(X) T (ψ(x) n Ψ(X) n) = (I n E n )[κ(x) n K n]. B. Kernel methods: nonlinear projection trick No, a ne frameork of kernel methods referred to as nonlinear projection trick is presented in this part. Belo, a general kernel method is denoted as KM hile its corresponding method in the input space is denoted as M. As inputs to KM, the training and test data X and x respectively, a specific (uncentered) kernel function κ(, ) and a method M are given. Here, M can be any feature extraction method such as PCA, LDA and ICA as ell as any classifier/regressor such as SVM/SVR. Algorithm: Kernel method KM (Input: X, x, κ(, ), method M) Training phase (7)
4 4 ) Compute the uncentered kernel matrix K such that K ij = κ(x i, x j ). 2) Compute the centered kernel K by K = (I n E n )K(I n E n ). 3) Obtain the eigenvalue decomposition of K such that K = UΛU T here Λ is composed of only the nonzero eigenvalues of K and the columns of U are the corresponding unit eigenvectors of K. 4) Compute Y, the coordinates of Φ(X), by Y = Λ 2 U T. 5) Apply the method M to Y, then it is equivalent to applying the kernel method KM to X, i.e., M(Y ) KM(X). Test phase ) Compute the uncentered kernel vector κ(x). 2) Compute the centered kernel vector k(x) by k(x) = (I n E n )[κ(x) n K n]. 3) Obtain y, the coordinate of ϕ(x), by y = Λ 2 U T k(x). 4) Apply the method M to y, then it is equivalent to applying the kernel method KM to x, i.e., M(y) KM(x). IV. APPLICATIONS In this section, the nonlinear kernel trick is applied to several conventional kernel methods such as KPCA, KFD, Kernel SVM and so on. In doing so, e sho that different kernel methods can be interpreted in the unified frameork of applying conventional linear methods to the reduced dimensional feature space. As a ne contribution, e extend PCA-L [3] to a kernel version. A. Kernel PCA (KPCA) ) Conventional derivation: In the conventional KPCA, the objective is to find a projection vector in the feature space that minimizes the sum of squared error or equivalently maximizes the variance along ith the projection direction. The problem becomes = argmax T Φ(X) 2 2 = argmax T Φ(X)Φ(X) T such that 2 =. (8) The solution of (8) can be found by the eigenvalue decomposition of S f Φ(X)Φ(X) T, the scatter matrix in the feature space, i.e., S f i = λ i i, λ λ 2 λ m. (9) Using the fact that lies in P, can be ritten as = Φ(X)α, and (9) becomes Φ(X)Kα i = λ i Φ(X)α i, λ λ 2 λ m. (0) Left multiplying both sides ith Φ(X) T, the problem becomes finding {α i } m that satisfies K2 α i = λ i Kα i, hich further reduces to finding {α i } m such that Kα i = λ i α i, λ λ 2 λ m. () ith the constraint αi T Kα i =. The solution {α i } m is the eigenvectors of K corresponding to the m largest eigenvalues. No the i-th solution that satisfies αi T Kα i = becomes α i = u i λ 2 i. (2) Let W [,, m ] R f m, then the projection of ϕ(x) onto W gives an m-dimensional vector z = W T ϕ(x). Because i = Φ(X)α i, it further reduces to z = A T k(x) here A [α,, α m ] R n m. From (2), it becomes A = [u λ 2,, u m λ 2 m ] = U m Λ 2 m (3) here U m [u,, u m ] and Λ m diag(λ,, λ m ) and finally the nonlinear projection of x becomes z = Λ 2 m U T mk(x). (4) 2) Ne derivation: In the ne interpretation of kernel methods presented in the previous section, the coordinates of mapped training data Φ(X) are firstly computed as Y = Λ 2 U T R r n from Theorem 2. Then, the problem becomes finding maximum variance direction vector in R r. It is easy to sho that the scatter matrix of Y becomes Y Y T = Λ because the mapped training data Y are centered (Lemma 2) and e i becomes the i-th eigenvector of the scatter matrix and therefore, the i-th principal vector. Here, e i R r is a column vector hose elements are identically zero except the i-th element hich is one. Once a ne data x is presented, by Theorem 2, it is projected to y = Λ 2 U T k(x) and the i-th projection of y becomes z i = e T i y = e T i Λ 2 U T k(x) = λ 2 i u T i k(x). (5) No z [z,, z m ] T, the m-dimensional nonlinear projection of x, becomes z = Λ 2 m U T mk(x). (6) Note that (6) is identical to (4) and e can see that the ne interpretation of KPCA is equivalent to the conventional derivation. Hoever, by the ne interpretation, e can obtain the exact coordinates of Φ(X) and the corresponding principal vectors, hile conventional derivation cannot find those coordinates. B. Kernel SVM (KSVM) ) Conventional derivation: Given some training data {(x i, c i )} n here x i R d and c i {, }, the kernel SVM tries to solve the optimization problem: (, b ) = argmin (,b) subject to c i ( T ϕ(x i ) + b) i =,, n (7) By introducing Lagrange multipliers {α i } n, the above constrained problem can be expressed as the dual form: α = argmax α i α i α j c i c j k(x i, x j ) α 2 i,j= subject to α i c i = 0 and α i 0 i =,, n (8)
5 5 Once e solve the above dual problem, can be computed thanks to α terms as = n α ic i ϕ(x i ) and b can be computed so that it meets the KKT condition α i [c i ( T ϕ(x i )+ b) ] = 0 for all i =,, n. Once a test sample x is presented, it can be classified as sgn( T ϕ(x)+b). Note that T ϕ(x) = n α ic i k(x i, x) and e do not have to find an explicit expression for in both the training and the test phase. 2) Ne derivation: In the ne derivation, given the training data {(x i, c i )} n, X = [x,, x n ] is firstly mapped to Y = Λ 2 U T = [y,, y n ] and then a linear SVM is solved for {y i, c i } n. The primal problem becomes (v, d ) = argmin (v,d) 2 v 2 2 subject to c i (v T y i + d) i =,, n and the corresponding dual problem is β = argmax β i β i β j c i c j yi T y j β 2 i,j= subject to β i c i = 0 and β i 0 i =,, n (9) (20) Note that because k(x i, x j ) = yi T y j, (8) and (20) are exactly the same problem and it becomes β = α and d = b. In this case, v can be explicitly found as v = n β ic i y i. Once a test sample x is presented, it is firstly mapped to y = Λ 2 U T k(x) and then classified as sgn(v T y + d). C. Kernel Fisher discriminant (KFD) ) Conventional derivation: In the conventional KFD, hich is a kernel extension of LDA (linear discriminant analysis), the objective is to find a projection vector in the feature space such that the ratio of the beteen-class variance and the ithin-class variance after the projection is maximized. Given C class training data {(x i, c i )} n here x i R d and c i {,, C}, the problem becomes Here and S Φ W S Φ B = argmax T S Φ B T S Φ W. (2) (ϕ(x i ) ϕ)(ϕ(x i ) ϕ) T R f f (22) (ϕ(x i ) ϕ ci )(ϕ(x i ) ϕ ci ) T R f f (23) are the beteen scatter and the ithin scatter matrices in the feature space respectively, here ϕ = n ϕ(x i) and ϕ c n c i {j c ϕ(x j=c} i) are the total mean and class mean in the feature space respectively. The term n c denotes the number of training samples hose class identity is c. As in the previous section, ithout loss of generality, e assume that Φ(X) is centerized, i.e., ϕ = 0 and the beteen scatter matrix becomes SB Φ = Φ(X)Φ(X)T. Note that SB Φ in (22) can also be ritten as SB Φ = C c= n c( ϕ c ϕ)( ϕ c ϕ) T. Restricting in P, can be ritten as = Φ(X)α and it becomes T ϕ(x i ) = α T k(x i ). Using this, (2) becomes here and S K W = α = argmax α α T S K B α α T S K W α (24) S K B = K 2 R n n (25) (k(x i ) k ci )(k(x i ) k ci ) T R n n. (26) Here, k ci n c i {j c j =c} k(x i). The solution of (24) can be obtained by solving the folloing generalized eigenvalue decomposition problem S K B α i = λs K W α i, λ λ 2 λ m. (27) Once obtaining {α i } m, for a given sample x, an m- dimensional nonlinear feature vector z R m is obtained by the dot product of W [,, m ] = Φ(X)[α,, α m ] and ϕ(x), i.e., z = W T ϕ(x) = A T k(x), here A [α,, α m ] R n m. 2) Ne derivation: In the ne derivation of KFD, for given training data X, a (centerized) kernel matrix K is computed, then its eigenvalue decomposition K = UΛU T is utilized to get the nonlinear mapping Y = Λ 2 U T. Then, its ithin scatter and beteen scatter matrices are computed as SW Y = (y i ȳ ci )(y i ȳ ci ) T R r r (28) and S Y B = C n c (ȳ c ȳ)(ȳ c ȳ) T = c= y i yi T = Y Y T R r r (29) respectively, here ȳ and ȳ c are the total mean and the class mean of Y. Recall that ȳ = 0 from Lemma 3. Applying LDA directly to Y, the objective becomes finding β R r such that β = argmax β β T S Y B β β T S Y W β. (30) Noting that y i = Λ 2 U T k(x i ) (from Theorem 2), the numerator becomes β T S Y B β = β T Λ 2 U T KKUΛ 2 β = α T K 2 α (3) here α = UΛ 2 β. Likeise the denominator becomes ( n ) β T SW Y β = α T (k(x i ) k ci )(k(x i ) k ci ) T α (32) Replacing (3) and (32) into (30), e can see the problem becomes identical to (24). Furthermore, because α = UΛ 2 β, if e multiply both sides ith Λ 2 U T to the left, β = Λ 2 U T α = Y α holds, hich as anticipated in Lemma 5. After obtaining {β i } m, z, the nonlinear projection of x can be obtained by the dot product of B [β,, β m ] and y = Λ 2 U T k(x). Then z = B T Λ 2 U T k(x) = A T k(x).
6 6 D. Kernel PCA-L (KPCA-L) Until no, e presented examples of application of the ne kernel interpretation to the existing kernel methods such as KPCA, KSVM and KFD. In this subsection, e sho that this interpretation of kernel methods is applicable to algorithms hich do not rely on the dot product, thus to hich kernel trick is not applicable. In [3], PCA-L, a robust version of PCA hich maximizes the dispersion of the projection as introduced. Given training data X, the objective of PCA-L is to find a set of projection vectors s that solve the folloing optimization problem: = argmax T X = argmax subject to 2 =. T x i (33) For information, the algorithm PCA-L is presented belo. Algorithm: PCA-L (Input: X) ) Initialization: Pick any (0). Set (0) (0)/ (0) 2 and t = 0. 2) Polarity check: For all i {,, n}, if T (t)x i < 0, p i (t) =, otherise p i (t) =. 3) Flipping and maximization: Set t t + and (t) = n p i(t )x i. Set (t) (t)/ (t) 2. 4) Convergence check: a. If (t) (t ), go to Step 2. b. Else if there exists i such that T (t)x i = 0, set (t) ((t) + )/ (t) + 2 and go to Step 2. Here, is a small nonzero random vector. c. Otherise, set = (t) and stop. This problem formulation can be easily extended to a kernel version by replacing X ith Φ(X) and the problem becomes = argmax T Φ(X) = argmax T ϕ(x i ) (34) subject to 2 =. Note that the kernel trick cannot be applicable to the above problem because the problem is not formulated based on L2- norm. By Lemma, can be expressed using the bases Π as = Πβ and from Theorem 2 T Φ(X) can be expressed as β T Y. Furthermore, the constraint 2 = becomes β 2 = because the columns of Π are orthonormal. Therefore, (34) becomes β = argmax β T Y subject to β 2 =. (35) β Because the form of (35) is exactly the same as (33), the conventional PCA-L algorithm can directly be applied to Y to obtain the KPCA-L algorithm. Note that this is a simple example of making a kernel version of an algorithm by using the nonlinear projection trick. The point is that this trick can be applied to an arbitrary method as shon in Section III. V. CONCLUSIONS This paper proposed a kernel method that directly maps the input space to the feature space by deriving an exact coordinates of the mapped input data. In the conventional kernel methods, the kernel trick is used to avoid direct nonlinear mapping of input data onto a high (could be infinite) dimensional feature space. Instead, the dot products of the mapped data are converted to kernel functions in the problems to be solved, making the problems solvable ithout ever having to explicitly map the original input data into a kernel space. Hoever, the applicability of the kernel trick is restricted to problems here the mapped data appear in the problem only in the form of dot products. Therefore, arbitrary methods that do not utilize the dot product (or L 2 -norm) could not be extended to kernel methods. In this paper, based on the fact that the effective dimensionality of a kernel space is less than the number of training samples, an explicit mapping of the input data into a reduced kernel space is proposed. This is obtained by eigenvalue decomposition of the kernel matrix. With the proposed nonlinear projection trick, the applicability of the kernel methods is idened to arbitrary methods that can be applied in the input space. In other ords, not only the conventional kernel methods based on L 2 -norm optimization such as KPCA and kernel SVM, but the methods utilizing other objective functions can also be extended to kernel versions. The equivalence beteen the kernel trick and the nonlinear projection trick is shon in several conventional kernel methods such as KPCA, KSVM and KFD. In addition, e extend PCA-L that utilize L - norm into a kernel version and sho the effectiveness of the proposed approach. REFERENCES [] B. Schölkopf, A.J. Somla, and K.R. Müller, Nonlinear component analysis as a kernel eigenvalue problem, Neural Computation, vol. 0, no. 5, pp , 998. [2] B. Schölkopf and A.J. Smola, Learning ith Kernels: Support Vector Machines, Regularization, Optimization, and Beyond, The MIT Press, [3] R. Rosipal, M. Girolami, and L.J. Trejo, Kernel PCA for feature extraction and de-noising in non-linear regression, Neural Computing & Applications, vol. 0, pp , 200. [4] S. Mika, G. Ratsch, J. Weston, B. Schölkopf, and K.R. Müller, Fisher discriminant analysis ith kernels, in Proceedings of the 999 IEEE Signal Processing Society Workshop on Neural Netorks for Signal Processing IX, 999, pp [5] J. Yang, A.F. Frangi, J.-Y. Yang, D. Zhang, and Z. Jin, KPCA plus LDA: A complete kernel fisher discriminant frameork for feature extraction and recognition, IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 27, no. 2, pp , Feb [6] G. Baudat and R. Anouar, Generalized discriminant analysis using a kernel approach, Neural Computation, vol. 2, no. 0, pp , [7] M. H. Nguyen and F. De la Torre, Robust kernel principal component analysis, Tech. Rep. Paper 44, Robotics Institute, 2008, [8] N. Kak, Kernel discriminant analysis for regression problems, Pattern Recognition, vol. 45, no. 5, pp , May 202. [9] H. Hoffmann, Kernel PCA for novelty detection, Pattern Recognition, vol. 40, no. 3, pp , March [0] T. Hofmann, B. Schölkopf, and A.J. Smola, Kernel methods in machine learning, The Annals of Statistics, vol. 36, no. 3, pp , [] S. Mika, B. Schölkopf, A. Smola, K.-R. Müller, M. Scholz, and G. Ratsch, Kernel PCA and de-noising in feature spaces, in Advances in Neural Information Processing Systems. 999, pp , MIT Press.
7 [2] T.-Y. Kok and I. W.-H. Tsang, The pre-image problem in kernel methods, IEEE Trans. on Neural Netorks, vol. 5, no. 6, pp , Nov [3] N. Kak, Principal component analysis based on l norm maximization, IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 30, no. 9, pp , Sep [4] K. Fukunaga, Intrinsic dimensionality extraction, Handbook of Statistics, vol. 2, pp , 982. [5] F. Camastra, Data dimensionality estimation methods: a survey, Pattern Recognition, vol. 36, pp ,
Max-Margin Ratio Machine
JMLR: Workshop and Conference Proceedings 25:1 13, 2012 Asian Conference on Machine Learning Max-Margin Ratio Machine Suicheng Gu and Yuhong Guo Department of Computer and Information Sciences Temple University,
More informationConnection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis
Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis Alvina Goh Vision Reading Group 13 October 2005 Connection of Local Linear Embedding, ISOMAP, and Kernel Principal
More informationMultiple Similarities Based Kernel Subspace Learning for Image Classification
Multiple Similarities Based Kernel Subspace Learning for Image Classification Wang Yan, Qingshan Liu, Hanqing Lu, and Songde Ma National Laboratory of Pattern Recognition, Institute of Automation, Chinese
More informationKernel Principal Component Analysis
Kernel Principal Component Analysis Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr
More informationIntroduction to Machine Learning
10-701 Introduction to Machine Learning PCA Slides based on 18-661 Fall 2018 PCA Raw data can be Complex, High-dimensional To understand a phenomenon we measure various related quantities If we knew what
More informationKernel Uncorrelated and Orthogonal Discriminant Analysis: A Unified Approach
Kernel Uncorrelated and Orthogonal Discriminant Analysis: A Unified Approach Tao Xiong Department of ECE University of Minnesota, Minneapolis txiong@ece.umn.edu Vladimir Cherkassky Department of ECE University
More informationA Modified ν-sv Method For Simplified Regression A. Shilton M. Palaniswami
A Modified ν-sv Method For Simplified Regression A Shilton apsh@eemuozau), M Palanisami sami@eemuozau) Department of Electrical and Electronic Engineering, University of Melbourne, Victoria - 3 Australia
More informationKernel Methods. Machine Learning A W VO
Kernel Methods Machine Learning A 708.063 07W VO Outline 1. Dual representation 2. The kernel concept 3. Properties of kernels 4. Examples of kernel machines Kernel PCA Support vector regression (Relevance
More informationKernel Discriminant Analysis for Regression Problems
Kernel Discriminant Analysis for Regression Problems 1 Nojun Kwak Abstract In this paper, we propose a nonlinear feature extraction method for regression problems to reduce the dimensionality of the input
More informationEfficient Kernel Discriminant Analysis via QR Decomposition
Efficient Kernel Discriminant Analysis via QR Decomposition Tao Xiong Department of ECE University of Minnesota txiong@ece.umn.edu Jieping Ye Department of CSE University of Minnesota jieping@cs.umn.edu
More informationDiscriminative Direction for Kernel Classifiers
Discriminative Direction for Kernel Classifiers Polina Golland Artificial Intelligence Lab Massachusetts Institute of Technology Cambridge, MA 02139 polina@ai.mit.edu Abstract In many scientific and engineering
More informationAn Experimental Evaluation of Linear and Kernel-Based Methods for Face Recognition
An Experimental Evaluation of Linear and Kernel-Based Methods for Face Recognition Himaanshu Gupta, Amit K Agraal, arun Pruthi, Chandra Shekhar, Rama Chellappa Department of Electrical & Computer Engineering
More informationKernel Partial Least Squares for Nonlinear Regression and Discrimination
Kernel Partial Least Squares for Nonlinear Regression and Discrimination Roman Rosipal Abstract This paper summarizes recent results on applying the method of partial least squares (PLS) in a reproducing
More informationPCA and LDA. Man-Wai MAK
PCA and LDA Man-Wai MAK Dept. of Electronic and Information Engineering, The Hong Kong Polytechnic University enmwmak@polyu.edu.hk http://www.eie.polyu.edu.hk/ mwmak References: S.J.D. Prince,Computer
More informationNeural networks and support vector machines
Neural netorks and support vector machines Perceptron Input x 1 Weights 1 x 2 x 3... x D 2 3 D Output: sgn( x + b) Can incorporate bias as component of the eight vector by alays including a feature ith
More informationMachine Learning - MT & 14. PCA and MDS
Machine Learning - MT 2016 13 & 14. PCA and MDS Varun Kanade University of Oxford November 21 & 23, 2016 Announcements Sheet 4 due this Friday by noon Practical 3 this week (continue next week if necessary)
More informationData Mining and Analysis: Fundamental Concepts and Algorithms
Data Mining and Analysis: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA
More informationApproximate Kernel PCA with Random Features
Approximate Kernel PCA with Random Features (Computational vs. Statistical Tradeoff) Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Journées de Statistique Paris May 28,
More informationCHAPTER 3 THE COMMON FACTOR MODEL IN THE POPULATION. From Exploratory Factor Analysis Ledyard R Tucker and Robert C. MacCallum
CHAPTER 3 THE COMMON FACTOR MODEL IN THE POPULATION From Exploratory Factor Analysis Ledyard R Tucker and Robert C. MacCallum 1997 19 CHAPTER 3 THE COMMON FACTOR MODEL IN THE POPULATION 3.0. Introduction
More informationData Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395
Data Mining Dimensionality reduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 42 Outline 1 Introduction 2 Feature selection
More informationNovelty Detection. Cate Welch. May 14, 2015
Novelty Detection Cate Welch May 14, 2015 1 Contents 1 Introduction 2 11 The Four Fundamental Subspaces 2 12 The Spectral Theorem 4 1 The Singular Value Decomposition 5 2 The Principal Components Analysis
More informationLecture 10: A brief introduction to Support Vector Machine
Lecture 10: A brief introduction to Support Vector Machine Advanced Applied Multivariate Analysis STAT 2221, Fall 2013 Sungkyu Jung Department of Statistics, University of Pittsburgh Xingye Qiao Department
More informationStefanos Zafeiriou, Anastasios Tefas, and Ioannis Pitas
GENDER DETERMINATION USING A SUPPORT VECTOR MACHINE VARIANT Stefanos Zafeiriou, Anastasios Tefas, and Ioannis Pitas Artificial Intelligence and Information Analysis Lab/Department of Informatics, Aristotle
More informationMachine Learning. Support Vector Machines. Manfred Huber
Machine Learning Support Vector Machines Manfred Huber 2015 1 Support Vector Machines Both logistic regression and linear discriminant analysis learn a linear discriminant function to separate the data
More information1 Feature Vectors and Time Series
PCA, SVD, LSI, and Kernel PCA 1 Feature Vectors and Time Series We now consider a sample x 1,..., x of objects (not necessarily vectors) and a feature map Φ such that for any object x we have that Φ(x)
More informationSparse Support Vector Machines by Kernel Discriminant Analysis
Sparse Support Vector Machines by Kernel Discriminant Analysis Kazuki Iwamura and Shigeo Abe Kobe University - Graduate School of Engineering Kobe, Japan Abstract. We discuss sparse support vector machines
More informationPCA and LDA. Man-Wai MAK
PCA and LDA Man-Wai MAK Dept. of Electronic and Information Engineering, The Hong Kong Polytechnic University enmwmak@polyu.edu.hk http://www.eie.polyu.edu.hk/ mwmak References: S.J.D. Prince,Computer
More informationLecture 10: Support Vector Machine and Large Margin Classifier
Lecture 10: Support Vector Machine and Large Margin Classifier Applied Multivariate Analysis Math 570, Fall 2014 Xingye Qiao Department of Mathematical Sciences Binghamton University E-mail: qiao@math.binghamton.edu
More informationPrincipal Component Analysis
CSci 5525: Machine Learning Dec 3, 2008 The Main Idea Given a dataset X = {x 1,..., x N } The Main Idea Given a dataset X = {x 1,..., x N } Find a low-dimensional linear projection The Main Idea Given
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table
More informationApproximate Kernel Methods
Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Machine Learning Summer School Tübingen, 207 Outline Motivating example Ridge regression
More informationA GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong
A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES Wei Chu, S. Sathiya Keerthi, Chong Jin Ong Control Division, Department of Mechanical Engineering, National University of Singapore 0 Kent Ridge Crescent,
More informationMachine Learning. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395
Machine Learning Dimensionality reduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 1 / 47 Table of contents 1 Introduction
More informationNonlinear Dimensionality Reduction
Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Kernel PCA 2 Isomap 3 Locally Linear Embedding 4 Laplacian Eigenmap
More informationLinear and Non-Linear Dimensionality Reduction
Linear and Non-Linear Dimensionality Reduction Alexander Schulz aschulz(at)techfak.uni-bielefeld.de University of Pisa, Pisa 4.5.215 and 7.5.215 Overview Dimensionality Reduction Motivation Linear Projections
More informationSemi-Supervised Learning through Principal Directions Estimation
Semi-Supervised Learning through Principal Directions Estimation Olivier Chapelle, Bernhard Schölkopf, Jason Weston Max Planck Institute for Biological Cybernetics, 72076 Tübingen, Germany {first.last}@tuebingen.mpg.de
More informationStatistical Pattern Recognition
Statistical Pattern Recognition Feature Extraction Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi, Payam Siyari Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/ Agenda Dimensionality Reduction
More informationSupport Vector Machines for Classification: A Statistical Portrait
Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,
More informationLecture 5 Supspace Tranformations Eigendecompositions, kernel PCA and CCA
Lecture 5 Supspace Tranformations Eigendecompositions, kernel PCA and CCA Pavel Laskov 1 Blaine Nelson 1 1 Cognitive Systems Group Wilhelm Schickard Institute for Computer Science Universität Tübingen,
More informationSupport Vector Machine (continued)
Support Vector Machine continued) Overlapping class distribution: In practice the class-conditional distributions may overlap, so that the training data points are no longer linearly separable. We need
More informationWhen Fisher meets Fukunaga-Koontz: A New Look at Linear Discriminants
When Fisher meets Fukunaga-Koontz: A New Look at Linear Discriminants Sheng Zhang erence Sim School of Computing, National University of Singapore 3 Science Drive 2, Singapore 7543 {zhangshe, tsim}@comp.nus.edu.sg
More informationSupport Vector Machine (SVM) and Kernel Methods
Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2014 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin
More informationMulti-Output Regularized Projection
Multi-Output Regularized Projection Kai Yu Shipeng Yu Volker Tresp Corporate Technology Institute for Computer Science Corporate Technology Siemens AG, Germany University of Munich, Germany Siemens AG,
More informationArtificial Neural Networks. Part 2
Artificial Neural Netorks Part Artificial Neuron Model Folloing simplified model of real neurons is also knon as a Threshold Logic Unit x McCullouch-Pitts neuron (943) x x n n Body of neuron f out Biological
More informationCOMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017
COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University PRINCIPAL COMPONENT ANALYSIS DIMENSIONALITY
More informationLecture 6 Sept Data Visualization STAT 442 / 890, CM 462
Lecture 6 Sept. 25-2006 Data Visualization STAT 442 / 890, CM 462 Lecture: Ali Ghodsi 1 Dual PCA It turns out that the singular value decomposition also allows us to formulate the principle components
More informationADAPTIVE QUASICONFORMAL KERNEL FISHER DISCRIMINANT ANALYSIS VIA WEIGHTED MAXIMUM MARGIN CRITERION. Received November 2011; revised March 2012
International Journal of Innovative Computing, Information and Control ICIC International c 2013 ISSN 1349-4198 Volume 9, Number 1, January 2013 pp 437 450 ADAPTIVE QUASICONFORMAL KERNEL FISHER DISCRIMINANT
More informationUnsupervised dimensionality reduction
Unsupervised dimensionality reduction Guillaume Obozinski Ecole des Ponts - ParisTech SOCN course 2014 Guillaume Obozinski Unsupervised dimensionality reduction 1/30 Outline 1 PCA 2 Kernel PCA 3 Multidimensional
More informationKernel methods, kernel SVM and ridge regression
Kernel methods, kernel SVM and ridge regression Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Collaborative Filtering 2 Collaborative Filtering R: rating matrix; U: user factor;
More informationSupport Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar
Data Mining Support Vector Machines Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar 02/03/2018 Introduction to Data Mining 1 Support Vector Machines Find a linear hyperplane
More informationKernel Method: Data Analysis with Positive Definite Kernels
Kernel Method: Data Analysis with Positive Definite Kernels 2. Positive Definite Kernel and Reproducing Kernel Hilbert Space Kenji Fukumizu The Institute of Statistical Mathematics. Graduate University
More informationCOS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture # 12 Scribe: Indraneel Mukherjee March 12, 2008
COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture # 12 Scribe: Indraneel Mukherjee March 12, 2008 In the previous lecture, e ere introduced to the SVM algorithm and its basic motivation
More informationPCA, Kernel PCA, ICA
PCA, Kernel PCA, ICA Learning Representations. Dimensionality Reduction. Maria-Florina Balcan 04/08/2015 Big & High-Dimensional Data High-Dimensions = Lot of Features Document classification Features per
More informationPreliminaries. Definition: The Euclidean dot product between two vectors is the expression. i=1
90 8 80 7 70 6 60 0 8/7/ Preliminaries Preliminaries Linear models and the perceptron algorithm Chapters, T x + b < 0 T x + b > 0 Definition: The Euclidean dot product beteen to vectors is the expression
More informationReproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto
Reproducing Kernel Hilbert Spaces 9.520 Class 03, 15 February 2006 Andrea Caponnetto About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert
More informationSupport Vector Machines
Two SVM tutorials linked in class website (please, read both): High-level presentation with applications (Hearst 1998) Detailed tutorial (Burges 1998) Support Vector Machines Machine Learning 10701/15781
More informationLogistic Regression. Machine Learning Fall 2018
Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes
More informationStatistical Pattern Recognition
Statistical Pattern Recognition Support Vector Machine (SVM) Hamid R. Rabiee Hadi Asheri, Jafar Muhammadi, Nima Pourdamghani Spring 2013 http://ce.sharif.edu/courses/91-92/2/ce725-1/ Agenda Introduction
More informationSupport Vector Machines
Support Vector Machines Support vector machines (SVMs) are one of the central concepts in all of machine learning. They are simply a combination of two ideas: linear classification via maximum (or optimal
More informationSupport Vector Machine (SVM) and Kernel Methods
Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2015 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin
More informationMachine learning for pervasive systems Classification in high-dimensional spaces
Machine learning for pervasive systems Classification in high-dimensional spaces Department of Communications and Networking Aalto University, School of Electrical Engineering stephan.sigg@aalto.fi Version
More informationSupport Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012
Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Linear classifier Which classifier? x 2 x 1 2 Linear classifier Margin concept x 2
More informationSVMs, Duality and the Kernel Trick
SVMs, Duality and the Kernel Trick Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University February 26 th, 2007 2005-2007 Carlos Guestrin 1 SVMs reminder 2005-2007 Carlos Guestrin 2 Today
More informationComputation. For QDA we need to calculate: Lets first consider the case that
Computation For QDA we need to calculate: δ (x) = 1 2 log( Σ ) 1 2 (x µ ) Σ 1 (x µ ) + log(π ) Lets first consider the case that Σ = I,. This is the case where each distribution is spherical, around the
More information1 Principal Components Analysis
Lecture 3 and 4 Sept. 18 and Sept.20-2006 Data Visualization STAT 442 / 890, CM 462 Lecture: Ali Ghodsi 1 Principal Components Analysis Principal components analysis (PCA) is a very popular technique for
More informationLeast Squares SVM Regression
Least Squares SVM Regression Consider changing SVM to LS SVM by making following modifications: min (w,e) ½ w 2 + ½C Σ e(i) 2 subject to d(i) (w T Φ( x(i))+ b) = e(i), i, and C>0. Note that e(i) is error
More informationLinear Dimensionality Reduction
Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Principal Component Analysis 3 Factor Analysis
More informationUniversität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen PCA. Tobias Scheffer
Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen PCA Tobias Scheffer Overview Principal Component Analysis (PCA) Kernel-PCA Fisher Linear Discriminant Analysis t-sne 2 PCA: Motivation
More informationA Statistical Analysis of Fukunaga Koontz Transform
1 A Statistical Analysis of Fukunaga Koontz Transform Xiaoming Huo Dr. Xiaoming Huo is an assistant professor at the School of Industrial and System Engineering of the Georgia Institute of Technology,
More informationAn Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM
An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM Wei Chu Chong Jin Ong chuwei@gatsby.ucl.ac.uk mpeongcj@nus.edu.sg S. Sathiya Keerthi mpessk@nus.edu.sg Control Division, Department
More informationDimensionality Reduction AShortTutorial
Dimensionality Reduction AShortTutorial Ali Ghodsi Department of Statistics and Actuarial Science University of Waterloo Waterloo, Ontario, Canada, 2006 c Ali Ghodsi, 2006 Contents 1 An Introduction to
More informationMACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA
1 MACHINE LEARNING Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA 2 Practicals Next Week Next Week, Practical Session on Computer Takes Place in Room GR
More informationCHARACTERIZATION OF ULTRASONIC IMMERSION TRANSDUCERS
CHARACTERIZATION OF ULTRASONIC IMMERSION TRANSDUCERS INTRODUCTION David D. Bennink, Center for NDE Anna L. Pate, Engineering Science and Mechanics Ioa State University Ames, Ioa 50011 In any ultrasonic
More informationGI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil
GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis Massimiliano Pontil 1 Today s plan SVD and principal component analysis (PCA) Connection
More informationData Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396
Data Mining Linear & nonlinear classifiers Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 1 / 31 Table of contents 1 Introduction
More informationConvergence of Eigenspaces in Kernel Principal Component Analysis
Convergence of Eigenspaces in Kernel Principal Component Analysis Shixin Wang Advanced machine learning April 19, 2016 Shixin Wang Convergence of Eigenspaces April 19, 2016 1 / 18 Outline 1 Motivation
More informationy(x) = x w + ε(x), (1)
Linear regression We are ready to consider our first machine-learning problem: linear regression. Suppose that e are interested in the values of a function y(x): R d R, here x is a d-dimensional vector-valued
More informationMachine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling
Machine Learning B. Unsupervised Learning B.2 Dimensionality Reduction Lars Schmidt-Thieme, Nicolas Schilling Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University
More informationFisher Linear Discriminant Analysis
Fisher Linear Discriminant Analysis Cheng Li, Bingyu Wang August 31, 2014 1 What s LDA Fisher Linear Discriminant Analysis (also called Linear Discriminant Analysis(LDA)) are methods used in statistics,
More informationLocal Fisher Discriminant Analysis for Supervised Dimensionality Reduction
Local Fisher Discriminant Analysis for Supervised Dimensionality Reduction Masashi Sugiyama sugi@cs.titech.ac.jp Department of Computer Science, Tokyo Institute of Technology, ---W8-7, O-okayama, Meguro-ku,
More informationj=1 u 1jv 1j. 1/ 2 Lemma 1. An orthogonal set of vectors must be linearly independent.
Lecture Notes: Orthogonal and Symmetric Matrices Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong taoyf@cse.cuhk.edu.hk Orthogonal Matrix Definition. Let u = [u
More informationOnline Kernel PCA with Entropic Matrix Updates
Online Kernel PCA with Entropic Matrix Updates Dima Kuzmin Manfred K. Warmuth University of California - Santa Cruz ICML 2007, Corvallis, Oregon April 23, 2008 D. Kuzmin, M. Warmuth (UCSC) Online Kernel
More informationIntroduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur
Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Module - 5 Lecture - 22 SVM: The Dual Formulation Good morning.
More informationSPECTRAL CLUSTERING AND KERNEL PRINCIPAL COMPONENT ANALYSIS ARE PURSUING GOOD PROJECTIONS
SPECTRAL CLUSTERING AND KERNEL PRINCIPAL COMPONENT ANALYSIS ARE PURSUING GOOD PROJECTIONS VIKAS CHANDRAKANT RAYKAR DECEMBER 5, 24 Abstract. We interpret spectral clustering algorithms in the light of unsupervised
More informationLinear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction
Linear vs Non-linear classifier CS789: Machine Learning and Neural Network Support Vector Machine Jakramate Bootkrajang Department of Computer Science Chiang Mai University Linear classifier is in the
More informationPerceptron Revisited: Linear Separators. Support Vector Machines
Support Vector Machines Perceptron Revisited: Linear Separators Binary classification can be viewed as the task of separating classes in feature space: w T x + b > 0 w T x + b = 0 w T x + b < 0 Department
More informationSupport'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan
Support'Vector'Machines Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan kasthuri.kannan@nyumc.org Overview Support Vector Machines for Classification Linear Discrimination Nonlinear Discrimination
More informationKernel Methods. Outline
Kernel Methods Quang Nguyen University of Pittsburgh CS 3750, Fall 2011 Outline Motivation Examples Kernels Definitions Kernel trick Basic properties Mercer condition Constructing feature space Hilbert
More informationSupervised Learning Coursework
Supervised Learning Coursework John Shawe-Taylor Tom Diethe Dorota Glowacka November 30, 2009; submission date: noon December 18, 2009 Abstract Using a series of synthetic examples, in this exercise session
More informationCS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision)
CS4495/6495 Introduction to Computer Vision 8B-L2 Principle Component Analysis (and its use in Computer Vision) Wavelength 2 Wavelength 2 Principal Components Principal components are all about the directions
More informationManifold Learning: Theory and Applications to HRI
Manifold Learning: Theory and Applications to HRI Seungjin Choi Department of Computer Science Pohang University of Science and Technology, Korea seungjin@postech.ac.kr August 19, 2008 1 / 46 Greek Philosopher
More informationEigenvoice Speaker Adaptation via Composite Kernel PCA
Eigenvoice Speaker Adaptation via Composite Kernel PCA James T. Kwok, Brian Mak and Simon Ho Department of Computer Science Hong Kong University of Science and Technology Clear Water Bay, Hong Kong [jamesk,mak,csho]@cs.ust.hk
More informationKernel Methods. Barnabás Póczos
Kernel Methods Barnabás Póczos Outline Quick Introduction Feature space Perceptron in the feature space Kernels Mercer s theorem Finite domain Arbitrary domain Kernel families Constructing new kernels
More informationStatistical Machine Learning from Data
Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Support Vector Machines Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique
More informationML (cont.): SUPPORT VECTOR MACHINES
ML (cont.): SUPPORT VECTOR MACHINES CS540 Bryan R Gibson University of Wisconsin-Madison Slides adapted from those used by Prof. Jerry Zhu, CS540-1 1 / 40 Support Vector Machines (SVMs) The No-Math Version
More informationStat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.
Linear SVM (separable case) First consider the scenario where the two classes of points are separable. It s desirable to have the width (called margin) between the two dashed lines to be large, i.e., have
More informationClassification with Kernel Mahalanobis Distance Classifiers
Classification with Kernel Mahalanobis Distance Classifiers Bernard Haasdonk and Elżbieta P ekalska 2 Institute of Numerical and Applied Mathematics, University of Münster, Germany, haasdonk@math.uni-muenster.de
More informationEfficient Kernel Discriminant Analysis via Spectral Regression
Efficient Kernel Discriminant Analysis via Spectral Regression Deng Cai Xiaofei He Jiawei Han Department of Computer Science, University of Illinois at Urbana-Champaign Yahoo! Research Labs Abstract Linear
More informationMTTS1 Dimensionality Reduction and Visualization Spring 2014 Jaakko Peltonen
MTTS1 Dimensionality Reduction and Visualization Spring 2014 Jaakko Peltonen Lecture 3: Linear feature extraction Feature extraction feature extraction: (more general) transform the original to (k < d).
More informationCS145: INTRODUCTION TO DATA MINING
CS145: INTRODUCTION TO DATA MINING 5: Vector Data: Support Vector Machine Instructor: Yizhou Sun yzsun@cs.ucla.edu October 18, 2017 Homework 1 Announcements Due end of the day of this Thursday (11:59pm)
More information