Manifold Learning: Theory and Applications to HRI

Size: px

Start display at page:

Download "Manifold Learning: Theory and Applications to HRI"

Claud Ball
5 years ago
Views:

1 Manifold Learning: Theory and Applications to HRI Seungjin Choi Department of Computer Science Pohang University of Science and Technology, Korea August 19, / 46

2 Greek Philosopher said... Heraclitus: Old days You can never step in the same river twice. Heraclitus: Now You can never see the same face twice. 2 / 46

3 Manifold Ways of Perception: Seung and Lee / 46

4 Manifold Learning: Example 1 Fingers extension Wrist rotation 4 / 46

5 Manifold Learning: Example 2 5 / 46

6 Manifold Learning: Example 3 6 / 46

7 Why Manifold? 7 / 46

8 Principal Component Analysis (PCA) Given a data matrix X = [x 1,..., x N ] R m N, PCA aims at finding a linear orthogonal transformation W (W W = I) such that tr{yy } is maximized, where Y = W X. It turns out that W corresponds to first n eigenvectors of the data covariance matrix C = 1 N (XH)(XH), where H = I 1 N 1 N1 N W = U R m n where C UDU (eigen-decomposition). 8 / 46

9 PCA: An Example 9 / 46

10 Learning in Feature Space It is important to choose a representation that matches the specific learning problem. Change the representation of the data! x = x 1. x m φ(x) = φ 1 (x). φ r (x) φ : R m F (feature space) Feature space is {φ(x) x X }. (could infinite dimensional space, i.e., r = ) 10 / 46

11 Why a Nonlinear Mapping? 11 / 46

12 What is a Kernel? Consider a nonlinear mapping φ : R m F(feature space), where φ(x) = [φ 1 (x),..., φ r (x)] (r could be infinite). Definition (Kernel) A kernel is a function k such that for all x, y X k(x, y) = φ(x), φ(y), where φ is a mapping from X to an (inner product) feature space F (dot product space). 12 / 46

13 Various Kernels Polynomial kernel k(x, y) = x, y d RBF kernel } x y 2 k(x, y) = exp { 2σ 2 Sigmoid kernel k(x, y) = tanh (κ x, y + θ), for suitable values of gain κ and threshold θ. 13 / 46

14 Reproducing Kernels Define a map φ φ : x k(, x). Reproducing kernels satisfy k(, x), f = f (x) k(, x), k(, y) = k(x, y) φ(x), φ(y) = k(x, y). 14 / 46

15 RKHS and Kernels Theorem This theorem relates kernels and RKHS a) For every RKHS there exits a unique, positive definite function called the reproducing kernel (RK) b) Conversely for every positive definite function k on X X there is a unique RKHS with k as its RK 15 / 46

16 Mercer s Theorem Theorem (Mercer) If k is a continuous symmetric kernel of a positive integral operator T, i.e., (T f )(y) = k(x, y)f (x)dx with C C C k(x, y)f (x)f (y)dxdy 0 for all f L 2(C) (C begin a compact subset of R m ), then it can be expanded in a uniformly convergent series (on C C) in terms of T s eigenfunctions ϕ j and positive eigenvalues λ j, k(x, y) = r λ j ϕ j (x)ϕ j (y), j=1 where r is the number of positive eigenvalues. 16 / 46

17 PCA: Using Dot Products Given a set of data (with zero mean), x k R m, k = 1,..., N, the sample covariance matrix C is given by C = 1 N N j=1 x jx j. For PCA, one has to solve the eigenvalue equation Cv = λv. Note that Cv = 1 N = 1 N N x j x j v j=1 N x j, v x j. j=1 This implies that all solutions v with λ 0 must lie in the span of x 1,..., x N. Hence Cv = λv is equivalent to λ x k, v = x k, Cv. 17 / 46

18 PCA in Feature Space Consider a nonlinear mapping φ : R m F(feature space). Assume N k=1 φ(x k) = 0. The covariance matrix C in the feature space F is C = 1 N N φ(x j )φ (x j ). j=1 Like PCA, one has to solve the eigenvalue problem λv = CV. Again, all solutions V with λ 0 lie in the span of φ(x 1 ),..., φ(x N ), which leads to λ φ(x k ), V = φ(x k ), CV, k = 1,..., N, 18 / 46

19 PCA in Feature Space (Cont d) There exits coefficients {α i } such that N V = α i φ(x i ). i=1 Substitute this relation into λ φ(x k ), V = φ(x k ), CV to obtain N λ α i φ(x k ), φ(x i ) i=1 = 1 N for k = 1,..., N. N N α i φ(x k ), φ(x j ) φ(x j ), φ(x i ), i=1 j=1 19 / 46

20 PCA in Feature Space (Cont d) Define an N N matrix K by [K] ij = K ij = φ(x i ), φ(x j ). Then, we have λnkα = K 2 α, which is further simplified as Nλα = Kα, for nonzero eigenvalues. 20 / 46

21 Normalization Let λ 1, s λ N denote the eigenvalues of K and α 1,..., α N their corresponding eigenvectors, with λ p being the first nonzero eigenvalue. We normalize α p,..., α N by requiring that the corresponding vectors in F be normalized, i.e., V k, V k = 1, k = p,..., N, leading to N N αi k φ(x i ), αj k φ(x j ) = 1 i=1 j=1 αi k αj k K ij = 1 i j α k, Kα k = 1 λ k α k, α k = / 46

22 Compute Nonlinear Components In linear PCA, principal components are extracted by projecting the data x onto the eigenvectors v k of the covariance matrix C, i.e., v k, x. In kernel PCA, we also project x onto the eigenvectors V k of C, i.e., V k, φ(x) = = N αi k φ(x i ), φ(x) i=1 N αi k k(x i, x). i=1 22 / 46

23 Centering in Feature Space Define φ(x t ) = φ(x t ) 1 N N l=1 φ(x l). Then we have K ij = φ(xi ), φ(x j ) = φ(x i ) 1 N φ(x l ), φ(x j ) 1 N φ(x k ) N N l=1 k=1 = K ij 1 K ik 1 K lj + 1 N N N 2 K lk. Therefore, the centered Kernel matrix is given by k K = K 1 N N K K1 N N + 1 N N K1 N N. l l k 23 / 46

24 Algorithm Outline: Kernel PCA 1 Given a set of m-dimensional training data {x k }, k = 1,..., N, we compute the kernel matrix K R N N = [k(x i, x j )]. 2 Carry out centering in feature space for N k=1 φ(x k) = 0, K = K 1 N N K K1 N N + 1 N N K1 N N, where 1 N N = N.... R N N Solve the eigenvalue problem Nλα = Kα and normalize α k such that α k, α k = 1 λ k. 4 For a test pattern x, we extract a nonlinear component via V k, x = N αi k k(x i, x). i=1 24 / 46

25 Toy Example Eigenvalue= Eigenvalue= Eigenvalue= Eigenvalue= Eigenvalue= Eigenvalue= Eigenvalue= Eigenvalue= / 46

26 Multidimensional Scaling (MDS) Let {x t R m } N t=1 be given data points and {y t R n } N t=1 be the lower-dimensional images of x t. Let δ ij be the distance (dissimilarity) between x i and x j and d ij be the distance between y i and y j. The aim of MDS is to find a configuration of lower-dimensional images, such that the distances {d ij } match the dissimilarities {δ ij } as well as possible. 26 / 46

27 Algorithm Outline: Classical Scaling 1 Obtain dissimilarities {δ ij = x i x j }. 2 Compute the matrix A = [ 1 2 δ2 ij]. 3 Construct the Gram matrix B = H AH R N N where H = I 1 N 1 N1 N is the centering matrix. 4 Rank-n approximation of B is given by B = V 1 Λ 1 V 1 = Y Y, where V 1 R N n and Λ 1 R n n. 5 The coordinate of N points in the n-dimensional Euclidean space are given by Y = V 1 Λ R n N = [y 1,..., y N ]. That is, each column y i R n in Y is an embedding of x i. 27 / 46

28 Properties of Centering Matrix 1 H = I 1 N 1 N1 N 2 [HX] ij = X ij 1 N N k=1 X kj (column-wise centering) 3 [XH] ij = X ij 1 N N k=1 X ik (row-wise centering) 4 H 2 = HH = H (idempotent) 28 / 46

29 Core Idea: B = X X = H AH One can easily show that B ij = (x i x) (x j x) = x i x j x i x x x j + x x = 1 2 δ2 ij 1 δij 2 1 δij N N N 2 One the other hand [ [ ] ( H AH = I 1 ij N 1 N1 N = i j ) ( A I 1 ) ] N 1 N1 N i ij j δ 2 ij. [ A 1 N 1 N1 N A 1 N A1 N1 N + 1 N 2 1 N1 N A1 N 1 N = [A] ij 1 N A j 1 N A i + 1 N 2 A. ] ij 29 / 46

30 Relation to PCA One can easily prove that the classical scaling solution is nothing but the projection of centered data onto normalized principal directions: Λ 1 2 V }{{} = Λ 1 2 U XH, where U = XHV. MDS Note that V is the eigenvector matrix of B: BV = VΛ, HX XHV = VΛ, (XH)HX XHV = (XH)VΛ, NC XHV }{{} U = XHV }{{} Λ, U where C = 1 N (XH)(XH) is the covariance matrix. 30 / 46

31 Consider U = XHV R m n. Projecting centered data XH onto principal directions U yields Therefore we have Projection property U XH = V HX XH = V B = ΛV. Λ 1 2 U XH = Λ 1 2 V. PCA defines a mapping from the original space to the principal coordinates. Given a new data point, its projection onto the principal coordinate defined by the original N data points, can be computed as y = Λ 1 2 U x. 31 / 46

32 Isomap: Tenenbaum et al., Construct a neighborhood graph (k-nn or ɛ-ball). 2 Compute geodesic distances, D ij with D 2 = [ D 2 ij] R N N. 3 Construct a Gram matrix K(D 2 ) = 1 2 HD2 H. 4 Compute top n eigenvectors of K, i.e., K = V 1 Λ 1 V 1, where V 1 R N n and Λ 1 R n n. 5 The coordinate of N points in the n-dimensional Euclidean space are given by Y = V 1 Λ R n N = [y 1,..., y N ]. 32 / 46

33 Kernel Isomap: Choi and Choi, Construct a neighborhood graph (k-nn or ɛ-ball). 2 Compute geodesic distances, D ij with D 2 = [ D 2 ij] R N N. 3 Construct a Gram matrix K(D 2 ) = 1 2 HD2 H. 4 Compute the largest eigenvalue, c, of the matrix [ 0 2K(D 2 ] ), I 4K(D) and construct a Mercer kernel matrix K = K(D 2 ) + 2cK(D) c2 H, where K is guaranteed to be positive semidefinite for c c. 5 Compute top n eigenvectors of K, i.e., K = V 1 Λ 1 V 1, where V 1 R N n and Λ 1 R n n. 6 The coordinate of N points in the n-dimensional Euclidean space are given by Y = V 1 Λ R n N = [y 1,..., y N ]. 33 / 46

34 Kernel Isomap: An Example Noisy Swiss Roll, Isomap, Kernel Isomap, Kernel Isomap with projection. 34 / 46

35 Locally Linear Embedding (LLE): Roweis and Saul, / 46

36 Algorithm Outline: LLE 1 Determine weights {W ij } such that arg min Wij i x i j W ijx j 2, subject to W ij = 0 if x j / N i, j W ij = 1. 2 Fix {W ij } and optimize the coordinates y i via minimizing the embedding cost function: J (y) = i y i j W ij y j 2, with two constraints i y i = 0 and 1 N yi y i = I. 36 / 46

37 Algorithm Outline: Laplacian Eigenmap 1 Construct a neighborhood graph. 2 Choose edge weights W ij W ij = { e x x i j 2 σ if nodes i and i are connected, 0 otherwise. 3 Solve the generalized eigenvalue problem Lv i = λ i Dv i, where D is degree matrix which is a diagonal matrix with diagonal entries D ii = j W ij and L = D W is graph Laplacian. Eigenvalues are 0 = λ 0 λ 1 λ N 1. 4 Low-dimensional embedding Y R n N is given by Y = [v 1,..., v n ]. 37 / 46

38 Optimal Embedding Consider a embedding from x i R m to y i R. A reasonable criterion for choosing a good map is to minimize the following objective function: where L = D W. J = 1 (y i y j ) 2 W ij 2 = y L y, Therefore the minimization problem reduces to finding subject to y D y = 1. i arg min y L y, y j 38 / 46

39 Locality Preserving Projection (LPP): He and Niyogi, 2003 Consider a linear mapping y i = Ψ x i in the framework of Laplacian eigenmap, where Ψ R m n. Then the mapping Ψ is determined by solving { } arg min tr Ψ Ψ XLX Ψ, subject to Ψ XDX Ψ = I. 39 / 46

40 Manifolds of Spatial Hearing: Choi and Choi, / 46

41 Laplcianfaces: He et al., / 46

42 Manifolds of Human Motion: Elgammal and Lee, / 46

43 Manifolds of Human Motion (Cont d) 43 / 46

44 Human Emotion in HRI: Ho et al., / 46

45 Human Emotion in HRI (Cont d) 45 / 46

46 Human Emotion in HRI (Cont d) 46 / 46

Kernel Principal Component Analysis

Kernel Principal Component Analysis Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr