Manifold Learning: Theory and Applications to HRI

Manifold Learning: Theory and Applications to HRI Seungjin Choi Department of Computer Science Pohang University of Science and Technology, Korea seungjin@postech.ac.kr August 19, 2008 1 / 46

Greek Philosopher said... Heraclitus: Old days You can never step in the same river twice. Heraclitus: Now You can never see the same face twice. 2 / 46

Manifold Ways of Perception: Seung and Lee 2000 3 / 46

Manifold Learning: Example 1 Fingers extension Wrist rotation 4 / 46

Manifold Learning: Example 2 5 / 46

Manifold Learning: Example 3 6 / 46

Why Manifold? 7 / 46

Principal Component Analysis (PCA) Given a data matrix X = [x 1,..., x N ] R m N, PCA aims at finding a linear orthogonal transformation W (W W = I) such that tr{yy } is maximized, where Y = W X. It turns out that W corresponds to first n eigenvectors of the data covariance matrix C = 1 N (XH)(XH), where H = I 1 N 1 N1 N W = U R m n where C UDU (eigen-decomposition). 8 / 46

PCA: An Example 9 / 46

Learning in Feature Space It is important to choose a representation that matches the specific learning problem. Change the representation of the data! x = x 1. x m φ(x) = φ 1 (x). φ r (x) φ : R m F (feature space) Feature space is {φ(x) x X }. (could infinite dimensional space, i.e., r = ) 10 / 46

Why a Nonlinear Mapping? 11 / 46

What is a Kernel? Consider a nonlinear mapping φ : R m F(feature space), where φ(x) = [φ 1 (x),..., φ r (x)] (r could be infinite). Definition (Kernel) A kernel is a function k such that for all x, y X k(x, y) = φ(x), φ(y), where φ is a mapping from X to an (inner product) feature space F (dot product space). 12 / 46

Various Kernels Polynomial kernel k(x, y) = x, y d RBF kernel } x y 2 k(x, y) = exp { 2σ 2 Sigmoid kernel k(x, y) = tanh (κ x, y + θ), for suitable values of gain κ and threshold θ. 13 / 46

Reproducing Kernels Define a map φ φ : x k(, x). Reproducing kernels satisfy k(, x), f = f (x) k(, x), k(, y) = k(x, y) φ(x), φ(y) = k(x, y). 14 / 46

RKHS and Kernels Theorem This theorem relates kernels and RKHS a) For every RKHS there exits a unique, positive definite function called the reproducing kernel (RK) b) Conversely for every positive definite function k on X X there is a unique RKHS with k as its RK 15 / 46

Mercer s Theorem Theorem (Mercer) If k is a continuous symmetric kernel of a positive integral operator T, i.e., (T f )(y) = k(x, y)f (x)dx with C C C k(x, y)f (x)f (y)dxdy 0 for all f L 2(C) (C begin a compact subset of R m ), then it can be expanded in a uniformly convergent series (on C C) in terms of T s eigenfunctions ϕ j and positive eigenvalues λ j, k(x, y) = r λ j ϕ j (x)ϕ j (y), j=1 where r is the number of positive eigenvalues. 16 / 46

PCA: Using Dot Products Given a set of data (with zero mean), x k R m, k = 1,..., N, the sample covariance matrix C is given by C = 1 N N j=1 x jx j. For PCA, one has to solve the eigenvalue equation Cv = λv. Note that Cv = 1 N = 1 N N x j x j v j=1 N x j, v x j. j=1 This implies that all solutions v with λ 0 must lie in the span of x 1,..., x N. Hence Cv = λv is equivalent to λ x k, v = x k, Cv. 17 / 46

PCA in Feature Space Consider a nonlinear mapping φ : R m F(feature space). Assume N k=1 φ(x k) = 0. The covariance matrix C in the feature space F is C = 1 N N φ(x j )φ (x j ). j=1 Like PCA, one has to solve the eigenvalue problem λv = CV. Again, all solutions V with λ 0 lie in the span of φ(x 1 ),..., φ(x N ), which leads to λ φ(x k ), V = φ(x k ), CV, k = 1,..., N, 18 / 46

PCA in Feature Space (Cont d) There exits coefficients {α i } such that N V = α i φ(x i ). i=1 Substitute this relation into λ φ(x k ), V = φ(x k ), CV to obtain N λ α i φ(x k ), φ(x i ) i=1 = 1 N for k = 1,..., N. N N α i φ(x k ), φ(x j ) φ(x j ), φ(x i ), i=1 j=1 19 / 46

PCA in Feature Space (Cont d) Define an N N matrix K by [K] ij = K ij = φ(x i ), φ(x j ). Then, we have λnkα = K 2 α, which is further simplified as Nλα = Kα, for nonzero eigenvalues. 20 / 46

Normalization Let λ 1, s λ N denote the eigenvalues of K and α 1,..., α N their corresponding eigenvectors, with λ p being the first nonzero eigenvalue. We normalize α p,..., α N by requiring that the corresponding vectors in F be normalized, i.e., V k, V k = 1, k = p,..., N, leading to N N αi k φ(x i ), αj k φ(x j ) = 1 i=1 j=1 αi k αj k K ij = 1 i j α k, Kα k = 1 λ k α k, α k = 1. 21 / 46

Compute Nonlinear Components In linear PCA, principal components are extracted by projecting the data x onto the eigenvectors v k of the covariance matrix C, i.e., v k, x. In kernel PCA, we also project x onto the eigenvectors V k of C, i.e., V k, φ(x) = = N αi k φ(x i ), φ(x) i=1 N αi k k(x i, x). i=1 22 / 46

Centering in Feature Space Define φ(x t ) = φ(x t ) 1 N N l=1 φ(x l). Then we have K ij = φ(xi ), φ(x j ) = φ(x i ) 1 N φ(x l ), φ(x j ) 1 N φ(x k ) N N l=1 k=1 = K ij 1 K ik 1 K lj + 1 N N N 2 K lk. Therefore, the centered Kernel matrix is given by k K = K 1 N N K K1 N N + 1 N N K1 N N. l l k 23 / 46

Algorithm Outline: Kernel PCA 1 Given a set of m-dimensional training data {x k }, k = 1,..., N, we compute the kernel matrix K R N N = [k(x i, x j )]. 2 Carry out centering in feature space for N k=1 φ(x k) = 0, K = K 1 N N K K1 N N + 1 N N K1 N N, where 1 N N = 1 1 1. N.... R N N. 1 1 3 Solve the eigenvalue problem Nλα = Kα and normalize α k such that α k, α k = 1 λ k. 4 For a test pattern x, we extract a nonlinear component via V k, x = N αi k k(x i, x). i=1 24 / 46

Toy Example Eigenvalue=0.251 1.5 Eigenvalue=0.233 1.5 Eigenvalue=0.052 1.5 Eigenvalue=0.044 1.5 1 1 1 1 0.5 0.5 0.5 0.5 0 0 0 0 0.5 1 0 1 0.5 0.5 0.5 1 0 1 1 0 1 1 0 1 Eigenvalue=0.037 1.5 Eigenvalue=0.033 1.5 Eigenvalue=0.031 1.5 Eigenvalue=0.025 1.5 1 1 1 1 0.5 0.5 0.5 0.5 0 0 0 0 0.5 1 0 1 0.5 0.5 0.5 1 0 1 1 0 1 1 0 1 25 / 46

Multidimensional Scaling (MDS) Let {x t R m } N t=1 be given data points and {y t R n } N t=1 be the lower-dimensional images of x t. Let δ ij be the distance (dissimilarity) between x i and x j and d ij be the distance between y i and y j. The aim of MDS is to find a configuration of lower-dimensional images, such that the distances {d ij } match the dissimilarities {δ ij } as well as possible. 26 / 46

Algorithm Outline: Classical Scaling 1 Obtain dissimilarities {δ ij = x i x j }. 2 Compute the matrix A = [ 1 2 δ2 ij]. 3 Construct the Gram matrix B = H AH R N N where H = I 1 N 1 N1 N is the centering matrix. 4 Rank-n approximation of B is given by B = V 1 Λ 1 V 1 = Y Y, where V 1 R N n and Λ 1 R n n. 5 The coordinate of N points in the n-dimensional Euclidean space are given by Y = V 1 Λ 1 2 1 R n N = [y 1,..., y N ]. That is, each column y i R n in Y is an embedding of x i. 27 / 46

Properties of Centering Matrix 1 H = I 1 N 1 N1 N 2 [HX] ij = X ij 1 N N k=1 X kj (column-wise centering) 3 [XH] ij = X ij 1 N N k=1 X ik (row-wise centering) 4 H 2 = HH = H (idempotent) 28 / 46

Core Idea: B = X X = H AH One can easily show that B ij = (x i x) (x j x) = x i x j x i x x x j + x x = 1 2 δ2 ij 1 δij 2 1 δij 2 + 1 N N N 2 One the other hand [ [ ] ( H AH = I 1 ij N 1 N1 N = i j ) ( A I 1 ) ] N 1 N1 N i ij j δ 2 ij. [ A 1 N 1 N1 N A 1 N A1 N1 N + 1 N 2 1 N1 N A1 N 1 N = [A] ij 1 N A j 1 N A i + 1 N 2 A. ] ij 29 / 46

Relation to PCA One can easily prove that the classical scaling solution is nothing but the projection of centered data onto normalized principal directions: Λ 1 2 V }{{} = Λ 1 2 U XH, where U = XHV. MDS Note that V is the eigenvector matrix of B: BV = VΛ, HX XHV = VΛ, (XH)HX XHV = (XH)VΛ, NC XHV }{{} U = XHV }{{} Λ, U where C = 1 N (XH)(XH) is the covariance matrix. 30 / 46

Consider U = XHV R m n. Projecting centered data XH onto principal directions U yields Therefore we have Projection property U XH = V HX XH = V B = ΛV. Λ 1 2 U XH = Λ 1 2 V. PCA defines a mapping from the original space to the principal coordinates. Given a new data point, its projection onto the principal coordinate defined by the original N data points, can be computed as y = Λ 1 2 U x. 31 / 46

Isomap: Tenenbaum et al., 2000 1 Construct a neighborhood graph (k-nn or ɛ-ball). 2 Compute geodesic distances, D ij with D 2 = [ D 2 ij] R N N. 3 Construct a Gram matrix K(D 2 ) = 1 2 HD2 H. 4 Compute top n eigenvectors of K, i.e., K = V 1 Λ 1 V 1, where V 1 R N n and Λ 1 R n n. 5 The coordinate of N points in the n-dimensional Euclidean space are given by Y = V 1 Λ 1 2 1 R n N = [y 1,..., y N ]. 32 / 46

Kernel Isomap: Choi and Choi, 2005 1 Construct a neighborhood graph (k-nn or ɛ-ball). 2 Compute geodesic distances, D ij with D 2 = [ D 2 ij] R N N. 3 Construct a Gram matrix K(D 2 ) = 1 2 HD2 H. 4 Compute the largest eigenvalue, c, of the matrix [ 0 2K(D 2 ] ), I 4K(D) and construct a Mercer kernel matrix K = K(D 2 ) + 2cK(D) + 1 2 c2 H, where K is guaranteed to be positive semidefinite for c c. 5 Compute top n eigenvectors of K, i.e., K = V 1 Λ 1 V 1, where V 1 R N n and Λ 1 R n n. 6 The coordinate of N points in the n-dimensional Euclidean space are given by Y = V 1 Λ 1 2 1 R n N = [y 1,..., y N ]. 33 / 46

Kernel Isomap: An Example Noisy Swiss Roll, Isomap, Kernel Isomap, Kernel Isomap with projection. 34 / 46

Locally Linear Embedding (LLE): Roweis and Saul, 2000 35 / 46

Algorithm Outline: LLE 1 Determine weights {W ij } such that arg min Wij i x i j W ijx j 2, subject to W ij = 0 if x j / N i, j W ij = 1. 2 Fix {W ij } and optimize the coordinates y i via minimizing the embedding cost function: J (y) = i y i j W ij y j 2, with two constraints i y i = 0 and 1 N yi y i = I. 36 / 46

Algorithm Outline: Laplacian Eigenmap 1 Construct a neighborhood graph. 2 Choose edge weights W ij W ij = { e x x i j 2 σ if nodes i and i are connected, 0 otherwise. 3 Solve the generalized eigenvalue problem Lv i = λ i Dv i, where D is degree matrix which is a diagonal matrix with diagonal entries D ii = j W ij and L = D W is graph Laplacian. Eigenvalues are 0 = λ 0 λ 1 λ N 1. 4 Low-dimensional embedding Y R n N is given by Y = [v 1,..., v n ]. 37 / 46

Optimal Embedding Consider a embedding from x i R m to y i R. A reasonable criterion for choosing a good map is to minimize the following objective function: where L = D W. J = 1 (y i y j ) 2 W ij 2 = y L y, Therefore the minimization problem reduces to finding subject to y D y = 1. i arg min y L y, y j 38 / 46

Locality Preserving Projection (LPP): He and Niyogi, 2003 Consider a linear mapping y i = Ψ x i in the framework of Laplacian eigenmap, where Ψ R m n. Then the mapping Ψ is determined by solving { } arg min tr Ψ Ψ XLX Ψ, subject to Ψ XDX Ψ = I. 39 / 46

Manifolds of Spatial Hearing: Choi and Choi, 2007 40 / 46

Laplcianfaces: He et al., 2005 41 / 46

Manifolds of Human Motion: Elgammal and Lee, 2004 42 / 46

Manifolds of Human Motion (Cont d) 43 / 46

Human Emotion in HRI: Ho et al., 2008 44 / 46

Human Emotion in HRI (Cont d) 45 / 46

Human Emotion in HRI (Cont d) 46 / 46