Kernel Principal Component Analysis Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr 1 / 22
Outline Principal component analysis (PCA) Learning in feature space What is a kernel? PCA in feature space? Kernel PCA 2 / 22
Principal Component Analysis (PCA) Given a data matrix X = [x 1,..., x N ] R m N, PCA aims at finding a linear orthogonal transformation W (W W = I ) such that tr{y Y } is maximized, where Y = W X. It turns out that W corresponds to first n eigenvectors of the data covariance matrix C = 1 N (X H)(X H), where H = I 1 N 1 N1 N W = U R m n where C UDU (eigen-decomposition). 3 / 22
PCA: An Example 4 3 2 1 0 1 2 3 4 4 3 2 1 0 1 2 3 4 4 / 22
Learning in Feature Space It is important to choose a representation that matches the specific learning problem Change the representation of the data! x = x 1. x m φ(x) = φ 1 (x). φ r (x) φ : R m F (feature space) Feature space is {φ(x) x X }. It can be infinite dimensional space, i.e., r =. 5 / 22
Why a Nonlinear Mapping? 6 / 22
A Simple Example Consider a target function f (m 1, m 2, r) = C m 1m 2 r 2, where f is a gravitational force between two bodies with masses m 1 and m 2. The separation distance between two bodies is r. A simple change of coordinates leads to (m 1, m 2, r) (x, y, z) = (log m 1, log m 2, log r) g(x, y, z) = log f (m 1, m 2, r) = log C + log m 1 + log m 2 2 log r = C + x + y 2z. 7 / 22
What is a Kernel? Consider a nonlinear mapping φ : R m F(feature space), where φ(x) = [φ 1 (x),..., φ r (x)] (M could be infinite). Definition (Kernel) A kernel is a function k such that for all x, y X k(x, y) = φ(x), φ(y), where φ is a mapping from X to an (inner product) feature space F (dot product space). 8 / 22
Various Kernels Polynomial kernel k(x, y) = x, y d RBF kernel } x y 2 k(x, y) = exp { 2σ 2 Sigmoid kernel k(x, y) = tanh (κ x, y + θ), for suitable values of gain κ and threshold θ. 9 / 22
Example: Polynomial Kernel Polynomial kernel If d = 2 and x, y R 2, then x, y 2 = k(x, y) = x, y d [ x1 x 2 ] [ y1, y 2 ] 2 = (x 1 y 1 + x 2 y 2 ) 2 x1 2 = x2 2, 2x1 x 2 = φ(x), φ(y) y 2 1 y 2 2 2y1 y 2 10 / 22
Reproducing Kernels Define a map φ φ : x k(, x). Reproducing kernels satisfy k(, x), f = f (x) k(, x), k(, y) = k(x, y) φ(x), φ(y) = k(x, y). 11 / 22
RKHS and Kernels Theorem This theorem relates kernels and RKHS a) For every RKHS there exits a unique, positive definite function called the reproducing kernel (RK) b) Conversely for every positive definite function k on X X there is a unique RKHS with k as its RK 12 / 22
Mercer s Theorem Theorem (Mercer) If k is a continuous symmetric kernel of a positive integral operator T, i.e., (T f )(y) = k(x, y)f (x)dx with C C C k(x, y)f (x)f (y)dxdy 0 for all f L 2(C) (C begin a compact subset of R m ), then it can be expanded in a uniformly convergent series (on C C) in terms of T s eigenfunctions ϕ j and positive eigenvalues λ j, k(x, y) = r λ j ϕ j (x)ϕ j (y), j=1 where r is the number of positive eigenvalues. 13 / 22
PCA: Using Dot Products Given a set of data (with zero mean), x k R m, k = 1,..., N, the sample covariance matrix C is given by C = 1 N N j=1 x jx j. For PCA, one has to solve the eigenvalue equation Cv = λv. (1) Note that Cv = ( ) 1 N x j x j v N = 1 N j=1 N x j, v x j. (2) j=1 This implies that all solutions v with λ 0 must lie in the span of x 1,..., x N. Hence Cv = λv is equivalent to λ x k, v = x k, Cv. (3) 14 / 22
PCA in Feature Space Consider a nonlinear mapping Assume φ : R m F(feature space). N φ(x k ) = 0. k=1 The covariance matrix C in the feature space F is C = 1 N φ(x j )φ (x j ). N j=1 Like PCA, one has to solve the eigenvalue problem λv = CV. Again, all solutions V with λ 0 lie in the span of φ(x 1),..., φ(x N ), which leads to λ φ(x k ), V = φ(x k ), CV, k = 1,..., N, (4) and there exits coefficients {α i } such that N V = α i φ(x i ). (5) i=1 15 / 22
Define an N N matrix K by [K] ij = K ij = φ(x i ), φ(x j ). Then, we have λnkα = K 2 α. (7) It can be shown that Eq. (7) (see the proof in the paper) implies Nλα = Kα, (8) for nonzero eigenvalues. 16 / 22
Normalization Let λ 1, s λ N denote the eigenvalues of K and α 1,..., α N their corresponding eigenvectors, with λ p being the first nonzero eigenvalue. We normalize α p,..., α N by requiring that the corresponding vectors in F be normalized, i.e., V k, V k = 1, k = p,..., N. (9) Eq. (9) leads to N N αi k φ(x i ), αj k φ(x j ) = 1 i=1 j=1 αi k αj k K ij = 1 i j α k, Kα k = 1 λ k α k, α k = 1. 17 / 22
Compute Nonlinear Components In linear PCA, principal components are extracted by projecting the data x onto the eigenvectors v k of the covariance matrix C, i.e., v k, x. In kernel PCA, we also project x onto the eigenvectors V k of C, i.e., V k, φ(x) = = N αi k φ(x i ), φ(x) i=1 N αi k k(x i, x). i=1 18 / 22
Centering in Feature Space Define φ(x t) = φ(x t) 1 N N φ(x l ). Then we have K ij = φ(x i ), φ(x j ) = φ(x i ) 1 N φ(x l ), φ(x j ) 1 N φ(x k ) N N l=1 k=1 = K ij 1 K ik 1 K lj + 1 K N N N 2 lk. k l=1 Therefore, the centered Kernel matrix is given by K = K 1 N K K1 N + 1 N K1 N, l l k where 1 N = 1 N 1 1.. 1 1. 19 / 22
Kernel PCA Algorithm Outline 1. Given a set of m-dimensional training data {x k }, k = 1,..., N, we compute the kernel matrix K R N N = [k(x i, x j )]. 2. Carry out centering in feature space for N k=1 φ(x k) = 0, K = K 1 N K K1 N + 1 N K1 N, where 1 N = 1 1 1. N...... R N N. 1 1 3. Solve the eigenvalue problem Nλα = Kα and normalize α k such that α k, α k = 1 λ k. 4. For a test pattern x, we extract a nonlinear component via V k, x = N αi k k(x i, x). i=1 20 / 22
Toy Example Eigenvalue=0.251 1.5 Eigenvalue=0.233 1.5 Eigenvalue=0.052 1.5 Eigenvalue=0.044 1.5 1 1 1 1 0.5 0.5 0.5 0.5 0 0 0 0 0.5 1 0 1 0.5 0.5 0.5 1 0 1 1 0 1 1 0 1 Eigenvalue=0.037 1.5 Eigenvalue=0.033 1.5 Eigenvalue=0.031 1.5 Eigenvalue=0.025 1.5 1 1 1 1 0.5 0.5 0.5 0.5 0 0 0 0 0.5 1 0 1 0.5 0.5 0.5 1 0 1 1 0 1 1 0 1 21 / 22
KPCA in a Nutshell Consider the data matrix X = [x 1,..., x N ]. Then the eigen-decomposition of the covariance matrix is given by X X U = UΣ. Pre-multiply both sides by X leads to Let U = X W. Then we have which is re-written as (K = X X ) which is further simplified as X X X U = X UΣ. X X X X W = X X W Σ, K 2 W = KW Σ, KW = W Σ. 22 / 22