Manifold Learning: Theory and Applications to HRI

Similar documents
Kernel Principal Component Analysis

Unsupervised dimensionality reduction

Non-linear Dimensionality Reduction

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis

Nonlinear Dimensionality Reduction

LECTURE NOTE #11 PROF. ALAN YUILLE

Lecture 10: Dimension Reduction Techniques

Laplacian Eigenmaps for Dimensionality Reduction and Data Representation

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

Nonlinear Dimensionality Reduction

Advanced Machine Learning & Perception

Nonlinear Dimensionality Reduction. Jose A. Costa

Fisher s Linear Discriminant Analysis

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu

Dimension Reduction and Low-dimensional Embedding

Data dependent operators for the spatial-spectral fusion problem

DIMENSION REDUCTION. min. j=1

Manifold Learning and it s application

Apprentissage non supervisée

Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA

CSE 291. Assignment Spectral clustering versus k-means. Out: Wed May 23 Due: Wed Jun 13

Face Recognition Using Laplacianfaces He et al. (IEEE Trans PAMI, 2005) presented by Hassan A. Kingravi

Data-dependent representations: Laplacian Eigenmaps

Nonlinear Methods. Data often lies on or near a nonlinear low-dimensional curve aka manifold.

Locality Preserving Projections

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

Laplacian Eigenmaps for Dimensionality Reduction and Data Representation

Dimensionality Reduction AShortTutorial

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Statistical Pattern Recognition

Nonlinear Manifold Learning Summary

Robust Laplacian Eigenmaps Using Global Information

Data Mining II. Prof. Dr. Karsten Borgwardt, Department Biosystems, ETH Zürich. Basel, Spring Semester 2016 D-BSSE

L26: Advanced dimensionality reduction

Kernel methods for comparing distributions, measuring dependence

Intrinsic Structure Study on Whale Vocalizations

Distance Preservation - Part 2

Statistical Machine Learning

Principal Component Analysis

Global (ISOMAP) versus Local (LLE) Methods in Nonlinear Dimensionality Reduction

Kernel-Based Contrast Functions for Sufficient Dimension Reduction

Manifold Learning: From Linear to nonlinear. Presenter: Wei-Lun (Harry) Chao Date: April 26 and May 3, 2012 At: AMMAI 2012

PCA, Kernel PCA, ICA

Kernel Methods. Machine Learning A W VO

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

The Kernel Trick. Robert M. Haralick. Computer Science, Graduate Center City University of New York

EECS 275 Matrix Computation

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen PCA. Tobias Scheffer

Lecture: Some Practical Considerations (3 of 4)

Advances in Manifold Learning Presented by: Naku Nak l Verm r a June 10, 2008

ISSN: (Online) Volume 3, Issue 5, May 2015 International Journal of Advance Research in Computer Science and Management Studies

Dimensionality Reduction:

Data Analysis and Manifold Learning Lecture 3: Graphs, Graph Matrices, and Graph Embeddings

Spectral Clustering. by HU Pili. June 16, 2013

SPECTRAL CLUSTERING AND KERNEL PRINCIPAL COMPONENT ANALYSIS ARE PURSUING GOOD PROJECTIONS

Functional Analysis Review

Learning gradients: prescriptive models

Dimensionality Reduction: A Comparative Review

Statistical and Computational Analysis of Locality Preserving Projection

Learning a Kernel Matrix for Nonlinear Dimensionality Reduction

Kernel Methods in Machine Learning

Novelty Detection. Cate Welch. May 14, 2015

Data Mining and Analysis: Fundamental Concepts and Algorithms

Reproducing Kernel Hilbert Spaces

Chapter 3 Transformations

LEC 2: Principal Component Analysis (PCA) A First Dimensionality Reduction Approach

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

Kernel Method: Data Analysis with Positive Definite Kernels

Learning a kernel matrix for nonlinear dimensionality reduction

Learning on Graphs and Manifolds. CMPSCI 689 Sridhar Mahadevan U.Mass Amherst

CIS 520: Machine Learning Oct 09, Kernel Methods

Machine Learning (BSMC-GA 4439) Wenke Liu

Lecture 7: Positive Semidefinite Matrices

Dimensionality Reduction

Gaussian Process Latent Random Field

10-701/ Recitation : Kernels

Certifying the Global Optimality of Graph Cuts via Semidefinite Programming: A Theoretic Guarantee for Spectral Clustering

Linear Models for Regression

A Duality View of Spectral Methods for Dimensionality Reduction

Linear Models for Regression

Preprocessing & dimensionality reduction

Manifold Estimation, Hidden Structure and Dimension Reduction

14 Singular Value Decomposition

Machine Learning. Data visualization and dimensionality reduction. Eric Xing. Lecture 7, August 13, Eric Xing Eric CMU,

Principal Component Analysis

Spectral Clustering. Spectral Clustering? Two Moons Data. Spectral Clustering Algorithm: Bipartioning. Spectral methods

Kernel Discriminant Analysis for Regression Problems

Machine Learning. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space

Kernel Methods. Barnabás Póczos

Lecture 6 Sept Data Visualization STAT 442 / 890, CM 462

A Duality View of Spectral Methods for Dimensionality Reduction

Lecture 3: Review of Linear Algebra

Spectral Dimensionality Reduction

Nonparameteric Regression:

Dimensionality Reduction: A Comparative Review

Kernel Methods. Jean-Philippe Vert Last update: Jan Jean-Philippe Vert (Mines ParisTech) 1 / 444

Lecture 3: Review of Linear Algebra

Introduction to Machine Learning

Vectors To begin, let us describe an element of the state space as a point with numerical coordinates, that is x 1. x 2. x =

Transcription:

Manifold Learning: Theory and Applications to HRI Seungjin Choi Department of Computer Science Pohang University of Science and Technology, Korea seungjin@postech.ac.kr August 19, 2008 1 / 46

Greek Philosopher said... Heraclitus: Old days You can never step in the same river twice. Heraclitus: Now You can never see the same face twice. 2 / 46

Manifold Ways of Perception: Seung and Lee 2000 3 / 46

Manifold Learning: Example 1 Fingers extension Wrist rotation 4 / 46

Manifold Learning: Example 2 5 / 46

Manifold Learning: Example 3 6 / 46

Why Manifold? 7 / 46

Principal Component Analysis (PCA) Given a data matrix X = [x 1,..., x N ] R m N, PCA aims at finding a linear orthogonal transformation W (W W = I) such that tr{yy } is maximized, where Y = W X. It turns out that W corresponds to first n eigenvectors of the data covariance matrix C = 1 N (XH)(XH), where H = I 1 N 1 N1 N W = U R m n where C UDU (eigen-decomposition). 8 / 46

PCA: An Example 9 / 46

Learning in Feature Space It is important to choose a representation that matches the specific learning problem. Change the representation of the data! x = x 1. x m φ(x) = φ 1 (x). φ r (x) φ : R m F (feature space) Feature space is {φ(x) x X }. (could infinite dimensional space, i.e., r = ) 10 / 46

Why a Nonlinear Mapping? 11 / 46

What is a Kernel? Consider a nonlinear mapping φ : R m F(feature space), where φ(x) = [φ 1 (x),..., φ r (x)] (r could be infinite). Definition (Kernel) A kernel is a function k such that for all x, y X k(x, y) = φ(x), φ(y), where φ is a mapping from X to an (inner product) feature space F (dot product space). 12 / 46

Various Kernels Polynomial kernel k(x, y) = x, y d RBF kernel } x y 2 k(x, y) = exp { 2σ 2 Sigmoid kernel k(x, y) = tanh (κ x, y + θ), for suitable values of gain κ and threshold θ. 13 / 46

Reproducing Kernels Define a map φ φ : x k(, x). Reproducing kernels satisfy k(, x), f = f (x) k(, x), k(, y) = k(x, y) φ(x), φ(y) = k(x, y). 14 / 46

RKHS and Kernels Theorem This theorem relates kernels and RKHS a) For every RKHS there exits a unique, positive definite function called the reproducing kernel (RK) b) Conversely for every positive definite function k on X X there is a unique RKHS with k as its RK 15 / 46

Mercer s Theorem Theorem (Mercer) If k is a continuous symmetric kernel of a positive integral operator T, i.e., (T f )(y) = k(x, y)f (x)dx with C C C k(x, y)f (x)f (y)dxdy 0 for all f L 2(C) (C begin a compact subset of R m ), then it can be expanded in a uniformly convergent series (on C C) in terms of T s eigenfunctions ϕ j and positive eigenvalues λ j, k(x, y) = r λ j ϕ j (x)ϕ j (y), j=1 where r is the number of positive eigenvalues. 16 / 46

PCA: Using Dot Products Given a set of data (with zero mean), x k R m, k = 1,..., N, the sample covariance matrix C is given by C = 1 N N j=1 x jx j. For PCA, one has to solve the eigenvalue equation Cv = λv. Note that Cv = 1 N = 1 N N x j x j v j=1 N x j, v x j. j=1 This implies that all solutions v with λ 0 must lie in the span of x 1,..., x N. Hence Cv = λv is equivalent to λ x k, v = x k, Cv. 17 / 46

PCA in Feature Space Consider a nonlinear mapping φ : R m F(feature space). Assume N k=1 φ(x k) = 0. The covariance matrix C in the feature space F is C = 1 N N φ(x j )φ (x j ). j=1 Like PCA, one has to solve the eigenvalue problem λv = CV. Again, all solutions V with λ 0 lie in the span of φ(x 1 ),..., φ(x N ), which leads to λ φ(x k ), V = φ(x k ), CV, k = 1,..., N, 18 / 46

PCA in Feature Space (Cont d) There exits coefficients {α i } such that N V = α i φ(x i ). i=1 Substitute this relation into λ φ(x k ), V = φ(x k ), CV to obtain N λ α i φ(x k ), φ(x i ) i=1 = 1 N for k = 1,..., N. N N α i φ(x k ), φ(x j ) φ(x j ), φ(x i ), i=1 j=1 19 / 46

PCA in Feature Space (Cont d) Define an N N matrix K by [K] ij = K ij = φ(x i ), φ(x j ). Then, we have λnkα = K 2 α, which is further simplified as Nλα = Kα, for nonzero eigenvalues. 20 / 46

Normalization Let λ 1, s λ N denote the eigenvalues of K and α 1,..., α N their corresponding eigenvectors, with λ p being the first nonzero eigenvalue. We normalize α p,..., α N by requiring that the corresponding vectors in F be normalized, i.e., V k, V k = 1, k = p,..., N, leading to N N αi k φ(x i ), αj k φ(x j ) = 1 i=1 j=1 αi k αj k K ij = 1 i j α k, Kα k = 1 λ k α k, α k = 1. 21 / 46

Compute Nonlinear Components In linear PCA, principal components are extracted by projecting the data x onto the eigenvectors v k of the covariance matrix C, i.e., v k, x. In kernel PCA, we also project x onto the eigenvectors V k of C, i.e., V k, φ(x) = = N αi k φ(x i ), φ(x) i=1 N αi k k(x i, x). i=1 22 / 46

Centering in Feature Space Define φ(x t ) = φ(x t ) 1 N N l=1 φ(x l). Then we have K ij = φ(xi ), φ(x j ) = φ(x i ) 1 N φ(x l ), φ(x j ) 1 N φ(x k ) N N l=1 k=1 = K ij 1 K ik 1 K lj + 1 N N N 2 K lk. Therefore, the centered Kernel matrix is given by k K = K 1 N N K K1 N N + 1 N N K1 N N. l l k 23 / 46

Algorithm Outline: Kernel PCA 1 Given a set of m-dimensional training data {x k }, k = 1,..., N, we compute the kernel matrix K R N N = [k(x i, x j )]. 2 Carry out centering in feature space for N k=1 φ(x k) = 0, K = K 1 N N K K1 N N + 1 N N K1 N N, where 1 N N = 1 1 1. N.... R N N. 1 1 3 Solve the eigenvalue problem Nλα = Kα and normalize α k such that α k, α k = 1 λ k. 4 For a test pattern x, we extract a nonlinear component via V k, x = N αi k k(x i, x). i=1 24 / 46

Toy Example Eigenvalue=0.251 1.5 Eigenvalue=0.233 1.5 Eigenvalue=0.052 1.5 Eigenvalue=0.044 1.5 1 1 1 1 0.5 0.5 0.5 0.5 0 0 0 0 0.5 1 0 1 0.5 0.5 0.5 1 0 1 1 0 1 1 0 1 Eigenvalue=0.037 1.5 Eigenvalue=0.033 1.5 Eigenvalue=0.031 1.5 Eigenvalue=0.025 1.5 1 1 1 1 0.5 0.5 0.5 0.5 0 0 0 0 0.5 1 0 1 0.5 0.5 0.5 1 0 1 1 0 1 1 0 1 25 / 46

Multidimensional Scaling (MDS) Let {x t R m } N t=1 be given data points and {y t R n } N t=1 be the lower-dimensional images of x t. Let δ ij be the distance (dissimilarity) between x i and x j and d ij be the distance between y i and y j. The aim of MDS is to find a configuration of lower-dimensional images, such that the distances {d ij } match the dissimilarities {δ ij } as well as possible. 26 / 46

Algorithm Outline: Classical Scaling 1 Obtain dissimilarities {δ ij = x i x j }. 2 Compute the matrix A = [ 1 2 δ2 ij]. 3 Construct the Gram matrix B = H AH R N N where H = I 1 N 1 N1 N is the centering matrix. 4 Rank-n approximation of B is given by B = V 1 Λ 1 V 1 = Y Y, where V 1 R N n and Λ 1 R n n. 5 The coordinate of N points in the n-dimensional Euclidean space are given by Y = V 1 Λ 1 2 1 R n N = [y 1,..., y N ]. That is, each column y i R n in Y is an embedding of x i. 27 / 46

Properties of Centering Matrix 1 H = I 1 N 1 N1 N 2 [HX] ij = X ij 1 N N k=1 X kj (column-wise centering) 3 [XH] ij = X ij 1 N N k=1 X ik (row-wise centering) 4 H 2 = HH = H (idempotent) 28 / 46

Core Idea: B = X X = H AH One can easily show that B ij = (x i x) (x j x) = x i x j x i x x x j + x x = 1 2 δ2 ij 1 δij 2 1 δij 2 + 1 N N N 2 One the other hand [ [ ] ( H AH = I 1 ij N 1 N1 N = i j ) ( A I 1 ) ] N 1 N1 N i ij j δ 2 ij. [ A 1 N 1 N1 N A 1 N A1 N1 N + 1 N 2 1 N1 N A1 N 1 N = [A] ij 1 N A j 1 N A i + 1 N 2 A. ] ij 29 / 46

Relation to PCA One can easily prove that the classical scaling solution is nothing but the projection of centered data onto normalized principal directions: Λ 1 2 V }{{} = Λ 1 2 U XH, where U = XHV. MDS Note that V is the eigenvector matrix of B: BV = VΛ, HX XHV = VΛ, (XH)HX XHV = (XH)VΛ, NC XHV }{{} U = XHV }{{} Λ, U where C = 1 N (XH)(XH) is the covariance matrix. 30 / 46

Consider U = XHV R m n. Projecting centered data XH onto principal directions U yields Therefore we have Projection property U XH = V HX XH = V B = ΛV. Λ 1 2 U XH = Λ 1 2 V. PCA defines a mapping from the original space to the principal coordinates. Given a new data point, its projection onto the principal coordinate defined by the original N data points, can be computed as y = Λ 1 2 U x. 31 / 46

Isomap: Tenenbaum et al., 2000 1 Construct a neighborhood graph (k-nn or ɛ-ball). 2 Compute geodesic distances, D ij with D 2 = [ D 2 ij] R N N. 3 Construct a Gram matrix K(D 2 ) = 1 2 HD2 H. 4 Compute top n eigenvectors of K, i.e., K = V 1 Λ 1 V 1, where V 1 R N n and Λ 1 R n n. 5 The coordinate of N points in the n-dimensional Euclidean space are given by Y = V 1 Λ 1 2 1 R n N = [y 1,..., y N ]. 32 / 46

Kernel Isomap: Choi and Choi, 2005 1 Construct a neighborhood graph (k-nn or ɛ-ball). 2 Compute geodesic distances, D ij with D 2 = [ D 2 ij] R N N. 3 Construct a Gram matrix K(D 2 ) = 1 2 HD2 H. 4 Compute the largest eigenvalue, c, of the matrix [ 0 2K(D 2 ] ), I 4K(D) and construct a Mercer kernel matrix K = K(D 2 ) + 2cK(D) + 1 2 c2 H, where K is guaranteed to be positive semidefinite for c c. 5 Compute top n eigenvectors of K, i.e., K = V 1 Λ 1 V 1, where V 1 R N n and Λ 1 R n n. 6 The coordinate of N points in the n-dimensional Euclidean space are given by Y = V 1 Λ 1 2 1 R n N = [y 1,..., y N ]. 33 / 46

Kernel Isomap: An Example Noisy Swiss Roll, Isomap, Kernel Isomap, Kernel Isomap with projection. 34 / 46

Locally Linear Embedding (LLE): Roweis and Saul, 2000 35 / 46

Algorithm Outline: LLE 1 Determine weights {W ij } such that arg min Wij i x i j W ijx j 2, subject to W ij = 0 if x j / N i, j W ij = 1. 2 Fix {W ij } and optimize the coordinates y i via minimizing the embedding cost function: J (y) = i y i j W ij y j 2, with two constraints i y i = 0 and 1 N yi y i = I. 36 / 46

Algorithm Outline: Laplacian Eigenmap 1 Construct a neighborhood graph. 2 Choose edge weights W ij W ij = { e x x i j 2 σ if nodes i and i are connected, 0 otherwise. 3 Solve the generalized eigenvalue problem Lv i = λ i Dv i, where D is degree matrix which is a diagonal matrix with diagonal entries D ii = j W ij and L = D W is graph Laplacian. Eigenvalues are 0 = λ 0 λ 1 λ N 1. 4 Low-dimensional embedding Y R n N is given by Y = [v 1,..., v n ]. 37 / 46

Optimal Embedding Consider a embedding from x i R m to y i R. A reasonable criterion for choosing a good map is to minimize the following objective function: where L = D W. J = 1 (y i y j ) 2 W ij 2 = y L y, Therefore the minimization problem reduces to finding subject to y D y = 1. i arg min y L y, y j 38 / 46

Locality Preserving Projection (LPP): He and Niyogi, 2003 Consider a linear mapping y i = Ψ x i in the framework of Laplacian eigenmap, where Ψ R m n. Then the mapping Ψ is determined by solving { } arg min tr Ψ Ψ XLX Ψ, subject to Ψ XDX Ψ = I. 39 / 46

Manifolds of Spatial Hearing: Choi and Choi, 2007 40 / 46

Laplcianfaces: He et al., 2005 41 / 46

Manifolds of Human Motion: Elgammal and Lee, 2004 42 / 46

Manifolds of Human Motion (Cont d) 43 / 46

Human Emotion in HRI: Ho et al., 2008 44 / 46

Human Emotion in HRI (Cont d) 45 / 46

Human Emotion in HRI (Cont d) 46 / 46