Kernel Principal Component Analysis

Similar documents
Manifold Learning: Theory and Applications to HRI

Fisher s Linear Discriminant Analysis

Linear Models for Regression

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis

Nonparameteric Regression:

Principal Component Analysis

Kernel Methods. Machine Learning A W VO

Linear Models for Regression

PCA, Kernel PCA, ICA

The Kernel Trick. Robert M. Haralick. Computer Science, Graduate Center City University of New York

Convergence of Eigenspaces in Kernel Principal Component Analysis

A Coupled Helmholtz Machine for PCA

Probabilistic Latent Semantic Analysis

Nonlinear Dimensionality Reduction

Kernel methods for comparing distributions, measuring dependence

CIS 520: Machine Learning Oct 09, Kernel Methods

Density Estimation. Seungjin Choi

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen PCA. Tobias Scheffer

Novelty Detection. Cate Welch. May 14, 2015

Kernel Methods. Barnabás Póczos

Nonlinear Projection Trick in kernel methods: An alternative to the Kernel Trick

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Kernel Methods. Charles Elkan October 17, 2007

LMS Algorithm Summary

below, kernel PCA Eigenvectors, and linear combinations thereof. For the cases where the pre-image does exist, we can provide a means of constructing

Approximate Kernel PCA with Random Features

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (continued)

Support Vector Machine

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Reproducing Kernel Hilbert Spaces

Lecture 6 Sept Data Visualization STAT 442 / 890, CM 462

LECTURE NOTE #11 PROF. ALAN YUILLE

MLCC 2015 Dimensionality Reduction and PCA

Nonnegative Matrix Factorization

Nonlinear Dimensionality Reduction

Each new feature uses a pair of the original features. Problem: Mapping usually leads to the number of features blow up!

Unsupervised dimensionality reduction

Support Vector Machine (SVM) and Kernel Methods

1 Principal Components Analysis

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Bayes Decision Theory

SPECTRAL CLUSTERING AND KERNEL PRINCIPAL COMPONENT ANALYSIS ARE PURSUING GOOD PROJECTIONS

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Logistic Regression. Seungjin Choi

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Introduction to SVM and RVM

RKHS, Mercer s theorem, Unbounded domains, Frames and Wavelets Class 22, 2004 Tomaso Poggio and Sayan Mukherjee

Introduction to Support Vector Machines

Reproducing Kernel Hilbert Spaces

Machine learning for pervasive systems Classification in high-dimensional spaces

Kernels A Machine Learning Overview

Lecture 10: Support Vector Machine and Large Margin Classifier

Kernel Method: Data Analysis with Positive Definite Kernels

Linear & nonlinear classifiers

Support Vector Machine (SVM) and Kernel Methods

Kernels MIT Course Notes

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto

Advanced Machine Learning & Perception

Data Mining and Analysis: Fundamental Concepts and Algorithms

Mathematical foundations - linear algebra

A Kernel Between Sets of Vectors

Direct Learning: Linear Classification. Donglin Zeng, Department of Biostatistics, University of North Carolina

Kernel PCA: keep walking... in informative directions. Johan Van Horebeek, Victor Muñiz, Rogelio Ramos CIMAT, Guanajuato, GTO

DIMENSION REDUCTION. min. j=1

10-701/ Recitation : Kernels

Independent Component Analysis (ICA)

Linear & nonlinear classifiers

Lecture Notes on Support Vector Machine

The Kernel Trick. Carlos C. Rodríguez October 25, Why don t we do it in higher dimensions?

Unsupervised Learning: Dimensionality Reduction

Functional Analysis Review

Basis Expansion and Nonlinear SVM. Kai Yu

Kernel-Based Contrast Functions for Sufficient Dimension Reduction

Point Distribution Models

Kernel Methods. Jean-Philippe Vert Last update: Jan Jean-Philippe Vert (Mines ParisTech) 1 / 444

Online Kernel PCA with Entropic Matrix Updates

Kernel Hebbian Algorithm for Iterative Kernel Principal Component Analysis

Kernel Partial Least Squares for Nonlinear Regression and Discrimination

There are two things that are particularly nice about the first basis

Kernel Methods in Machine Learning

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

STATISTICAL LEARNING SYSTEMS

7.1 Support Vector Machine

MATH 115A: SAMPLE FINAL SOLUTIONS

CS798: Selected topics in Machine Learning

Non-linear Dimensionality Reduction

Machine Learning. Support Vector Machines. Manfred Huber

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

Introduction to Machine Learning

The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space

Matrix Vector Products

Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation)

Maximum variance formulation

Finite Automata. Seungjin Choi

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

Insights into the Geometry of the Gaussian Kernel and an Application in Geometric Modeling

Data Mining and Analysis: Fundamental Concepts and Algorithms

Nonlinear Component Analysis Based on Correntropy

Reproducing Kernel Hilbert Spaces

Transcription:

Kernel Principal Component Analysis Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr 1 / 22

Outline Principal component analysis (PCA) Learning in feature space What is a kernel? PCA in feature space? Kernel PCA 2 / 22

Principal Component Analysis (PCA) Given a data matrix X = [x 1,..., x N ] R m N, PCA aims at finding a linear orthogonal transformation W (W W = I ) such that tr{y Y } is maximized, where Y = W X. It turns out that W corresponds to first n eigenvectors of the data covariance matrix C = 1 N (X H)(X H), where H = I 1 N 1 N1 N W = U R m n where C UDU (eigen-decomposition). 3 / 22

PCA: An Example 4 3 2 1 0 1 2 3 4 4 3 2 1 0 1 2 3 4 4 / 22

Learning in Feature Space It is important to choose a representation that matches the specific learning problem Change the representation of the data! x = x 1. x m φ(x) = φ 1 (x). φ r (x) φ : R m F (feature space) Feature space is {φ(x) x X }. It can be infinite dimensional space, i.e., r =. 5 / 22

Why a Nonlinear Mapping? 6 / 22

A Simple Example Consider a target function f (m 1, m 2, r) = C m 1m 2 r 2, where f is a gravitational force between two bodies with masses m 1 and m 2. The separation distance between two bodies is r. A simple change of coordinates leads to (m 1, m 2, r) (x, y, z) = (log m 1, log m 2, log r) g(x, y, z) = log f (m 1, m 2, r) = log C + log m 1 + log m 2 2 log r = C + x + y 2z. 7 / 22

What is a Kernel? Consider a nonlinear mapping φ : R m F(feature space), where φ(x) = [φ 1 (x),..., φ r (x)] (M could be infinite). Definition (Kernel) A kernel is a function k such that for all x, y X k(x, y) = φ(x), φ(y), where φ is a mapping from X to an (inner product) feature space F (dot product space). 8 / 22

Various Kernels Polynomial kernel k(x, y) = x, y d RBF kernel } x y 2 k(x, y) = exp { 2σ 2 Sigmoid kernel k(x, y) = tanh (κ x, y + θ), for suitable values of gain κ and threshold θ. 9 / 22

Example: Polynomial Kernel Polynomial kernel If d = 2 and x, y R 2, then x, y 2 = k(x, y) = x, y d [ x1 x 2 ] [ y1, y 2 ] 2 = (x 1 y 1 + x 2 y 2 ) 2 x1 2 = x2 2, 2x1 x 2 = φ(x), φ(y) y 2 1 y 2 2 2y1 y 2 10 / 22

Reproducing Kernels Define a map φ φ : x k(, x). Reproducing kernels satisfy k(, x), f = f (x) k(, x), k(, y) = k(x, y) φ(x), φ(y) = k(x, y). 11 / 22

RKHS and Kernels Theorem This theorem relates kernels and RKHS a) For every RKHS there exits a unique, positive definite function called the reproducing kernel (RK) b) Conversely for every positive definite function k on X X there is a unique RKHS with k as its RK 12 / 22

Mercer s Theorem Theorem (Mercer) If k is a continuous symmetric kernel of a positive integral operator T, i.e., (T f )(y) = k(x, y)f (x)dx with C C C k(x, y)f (x)f (y)dxdy 0 for all f L 2(C) (C begin a compact subset of R m ), then it can be expanded in a uniformly convergent series (on C C) in terms of T s eigenfunctions ϕ j and positive eigenvalues λ j, k(x, y) = r λ j ϕ j (x)ϕ j (y), j=1 where r is the number of positive eigenvalues. 13 / 22

PCA: Using Dot Products Given a set of data (with zero mean), x k R m, k = 1,..., N, the sample covariance matrix C is given by C = 1 N N j=1 x jx j. For PCA, one has to solve the eigenvalue equation Cv = λv. (1) Note that Cv = ( ) 1 N x j x j v N = 1 N j=1 N x j, v x j. (2) j=1 This implies that all solutions v with λ 0 must lie in the span of x 1,..., x N. Hence Cv = λv is equivalent to λ x k, v = x k, Cv. (3) 14 / 22

PCA in Feature Space Consider a nonlinear mapping Assume φ : R m F(feature space). N φ(x k ) = 0. k=1 The covariance matrix C in the feature space F is C = 1 N φ(x j )φ (x j ). N j=1 Like PCA, one has to solve the eigenvalue problem λv = CV. Again, all solutions V with λ 0 lie in the span of φ(x 1),..., φ(x N ), which leads to λ φ(x k ), V = φ(x k ), CV, k = 1,..., N, (4) and there exits coefficients {α i } such that N V = α i φ(x i ). (5) i=1 15 / 22

Define an N N matrix K by [K] ij = K ij = φ(x i ), φ(x j ). Then, we have λnkα = K 2 α. (7) It can be shown that Eq. (7) (see the proof in the paper) implies Nλα = Kα, (8) for nonzero eigenvalues. 16 / 22

Normalization Let λ 1, s λ N denote the eigenvalues of K and α 1,..., α N their corresponding eigenvectors, with λ p being the first nonzero eigenvalue. We normalize α p,..., α N by requiring that the corresponding vectors in F be normalized, i.e., V k, V k = 1, k = p,..., N. (9) Eq. (9) leads to N N αi k φ(x i ), αj k φ(x j ) = 1 i=1 j=1 αi k αj k K ij = 1 i j α k, Kα k = 1 λ k α k, α k = 1. 17 / 22

Compute Nonlinear Components In linear PCA, principal components are extracted by projecting the data x onto the eigenvectors v k of the covariance matrix C, i.e., v k, x. In kernel PCA, we also project x onto the eigenvectors V k of C, i.e., V k, φ(x) = = N αi k φ(x i ), φ(x) i=1 N αi k k(x i, x). i=1 18 / 22

Centering in Feature Space Define φ(x t) = φ(x t) 1 N N φ(x l ). Then we have K ij = φ(x i ), φ(x j ) = φ(x i ) 1 N φ(x l ), φ(x j ) 1 N φ(x k ) N N l=1 k=1 = K ij 1 K ik 1 K lj + 1 K N N N 2 lk. k l=1 Therefore, the centered Kernel matrix is given by K = K 1 N K K1 N + 1 N K1 N, l l k where 1 N = 1 N 1 1.. 1 1. 19 / 22

Kernel PCA Algorithm Outline 1. Given a set of m-dimensional training data {x k }, k = 1,..., N, we compute the kernel matrix K R N N = [k(x i, x j )]. 2. Carry out centering in feature space for N k=1 φ(x k) = 0, K = K 1 N K K1 N + 1 N K1 N, where 1 N = 1 1 1. N...... R N N. 1 1 3. Solve the eigenvalue problem Nλα = Kα and normalize α k such that α k, α k = 1 λ k. 4. For a test pattern x, we extract a nonlinear component via V k, x = N αi k k(x i, x). i=1 20 / 22

Toy Example Eigenvalue=0.251 1.5 Eigenvalue=0.233 1.5 Eigenvalue=0.052 1.5 Eigenvalue=0.044 1.5 1 1 1 1 0.5 0.5 0.5 0.5 0 0 0 0 0.5 1 0 1 0.5 0.5 0.5 1 0 1 1 0 1 1 0 1 Eigenvalue=0.037 1.5 Eigenvalue=0.033 1.5 Eigenvalue=0.031 1.5 Eigenvalue=0.025 1.5 1 1 1 1 0.5 0.5 0.5 0.5 0 0 0 0 0.5 1 0 1 0.5 0.5 0.5 1 0 1 1 0 1 1 0 1 21 / 22

KPCA in a Nutshell Consider the data matrix X = [x 1,..., x N ]. Then the eigen-decomposition of the covariance matrix is given by X X U = UΣ. Pre-multiply both sides by X leads to Let U = X W. Then we have which is re-written as (K = X X ) which is further simplified as X X X U = X UΣ. X X X X W = X X W Σ, K 2 W = KW Σ, KW = W Σ. 22 / 22