Distance Preservation - Part I

October 2, 2007

1 Introduction 2 Scalar product Equivalence with PCA Euclidean distance 3 4 5

Spatial distances Only the coordinates of the points affects the distances. L p norm: a p = p D k=1 a k p Minkowski distance: d(a,b) = a b p Maximum (p = ): a b = max 1 k D a k b k City-block (p = 1): a b 1 = D k=1 a k b k D Euclidean (p = 2): a b 2 = k=1 (a k b k ) 2 Mahalanobis norm: a Mahalanobis = a T M 1 a Usually M = C aa = E {aa T } a Mahalanobis = a 2 M = I

Scalar product Equivalence with PCA Euclidean distance Classical metric multidimensional scaling (MDS) Preserves pairwise scalar products instead of distances. Assumes a simple generative model as with PCA: y = Wx observed variables y, uncorrelated latent variables x variables are assumed to be centered orthogonal D-by-P matrix W: W T W = I P N points in matrix form: Y = [y(1),...,y(n)] Scalar products are known: s y (i,j) = y(i),y(j) = y(i) T y(j) Then S = [s y (i,j)] 1 i,j N = Y T Y = X T W T WX = X T X Usually Y and X are unknown

MDS: Finding latent variables Scalar product Equivalence with PCA Euclidean distance The eigenvalue decomposition of the Gram matrix S: S = UΛU T = (Λ 1/2 U T ) T (Λ 1/2 U T ) Eigenvalues sorted in descending order P-dimensional latent variables: ˆX = I P N Λ 1/2 U T

Equivalence of PCA and MDS Scalar product Equivalence with PCA Euclidean distance PCA and MDS give the same projection. SVD: Y = VΣU T Then Ĉyy YY T = VΛ PCA V T and S = Y T Y = UΛ MDS U T, where Λ PCA = ΣΣ T and Λ MDS = Σ T Σ We get: ˆX MDS = I P N Λ 1/2 MDS UT = I P N ΣU T = I P N V T VΣU T = I P N V T Y = ˆX PCA Thus MDS minimizes the criterion: E MDS = N i,j=1 (s y(i,j) sˆx (i,j)) 2

Three ways to calculate PCA Scalar product Equivalence with PCA Euclidean distance The P-by-N matrix Y is known. PCA: 1 P N: Ĉ yy YY T = VΛ PCA V T ˆX PCA = I P N V T Y 2 P N S = Y T Y = UΛ MDS U T ˆX PCA = ˆX MDS = I P N Λ 1/2 MDS UT 3 P N Y = VΣU T ˆX PCA = I P N V T Y

MDS with Euclidean distances Scalar product Equivalence with PCA Euclidean distance Instead of scalar products, pairwise distances are known D = [d 2 y (i,j)] 1 i,j N Solution: transform distances to scalar products dy 2(i,j) = y(i) y(j),y(i) y(j) = s y(i,i) 2s y (i,j)+s y (j,j) s y (i,j) = 1 ( 2 d 2 y (i,j) s y (i,i) s y (j,j) )

Double centering of D Scalar product Equivalence with PCA Euclidean distance Calculate the mean: µ j (d 2 y(i,j)) = µ j ( y(i) y(j),y(i) y(j) ) = y(i),y(i) 2 y(i),µ j (y(j)) + µ j ( y(j),y(j) ) = s y (i,i) + µ j (s y (j,j)) µ i (d 2 y (i,j)) = µ i (s y (i,i)) + s y (j,j) µ i,j (d 2 y (i,j)) = µ i(s y (i,i)) + µ j (s y (j,j)) s y (i,j) = 1 2 (d2 y (i,j) µ j(dy 2(i,j)) µ i(dy 2(i,j))+µ i,j(dy 2(i,j))) S = 1 ( 2 D 1 N D1 N1 T N 1 N 1 N1 T N D + 1 1 N 2 N 1 T N D1 N1 T ) N

MDS: Algorithm Introduction Scalar product Equivalence with PCA Euclidean distance 1 If data is Y, center it, compute S = Y T Y and go to step 3 2 If pairwise distances D, transform them to scalar products S by double centering 3 EVD: S = UΛU T 4 ˆX = IP N Λ 1/2 U T

Embedding of test set Scalar product Equivalence with PCA Euclidean distance Test set as coordinates: Test point y ˆx = I P D V T y Test set as scalar products: Test point s = Y T y ˆx = I P N Λ 1/2 U T s Test set as distances Test point ( d = [ y(i) y,y(i) y ] 1 i N s 1 2 d 1 N 1 N1 T N d 1 N D1 N + 1 N 1 2 N 1 T N D1 ) N

: Embeddings with MDS Scalar product Equivalence with PCA Euclidean distance

MDS variants Introduction Scalar product Equivalence with PCA Euclidean distance Classical metric MDS preserves only the pairwise scalar products Variants try to preserve the pairwise distances directly by minimizing the stress function of metric MDS: E mmds = 1 2 N w ij (d y (i,j) d x (i,j)) 2 i,j=1 Variants do not depend on any generative model

(NLM) Sammon s stress function: E NLM = 1 c c = N i,j=1 N d y (i,j) i=1 i<j (d y (i,j) d x (i,j)) 2 d y (i,j) Minimizes it iteratively with quasi-newton optimization: E NLM x k (i) x k (i) x k (i) α 2 E NLM x k (i) 2

NLM: Derivation Introduction Direct calculation gives E NLM x k (i) = E NLM d x (i,j) N = 2 c 2 E NLM 2 xk 2 = (i) c j=1 j i N j=1 j i d x (i,j) x k (i) d y (i,j) d x (i,j) d y (i,j)d x (i,j) (x k(i) x k (j)) ( dy (i,j) d x (i,j) d y (i,j)d x (i,j) (x k(i) x k (j)) 2 ) dx 3(i,j)

NLM: Algorithm Introduction 1 Compute pairwise distances d y (i,j) 2 Initialize points x(i) randomly or by PCA 3 Calculate the quasi-newton update for each point 4 Update the coordinates of all points x(i) 5 Return to step 3 until convergence

Embedding of test set No easy way to generalize the embedding for new points: Updating only the new point with quasi-newton Interpolation procedure of Curvilinear Component Analysis Neural variants of NLM, like the SAMANN

: Embeddings with NLM

(CCA) Minimizes stress function: E CCA = 1 N (d y (i,j) d x (i,j)) 2 F λ (d x (i,j)) 2 i,j=1 Typically F λ is monotonically decreasing: F λ (d x ) = exp ( ) dx λ F λ (d x ) = H(λ d x ), where H(u) = 0 if u 0 and 1 otherwise

CCA: Derivation Introduction Minimization by gradient descent: Direct calculation gives: x(i) E CCA = N j=1 x(i) x(i) α x(i) E CCA (d y d x ) ( 2F λ (d x ) (d y d x )F λ (d x) ) x(j) x(i) d x, where d y = d y (i,j) and d x = d x (i,j)

Condition for λ Introduction The condition 2F λ (d x ) > (d y d x )F λ (d x) guarantees that distances change reasonably F λ (d x ) = exp ( ) dx λ : λ > 1 2 (d x d y ) F λ (d x ) = H(λ d x ): The condition is always fulfilled The parameters α and λ can be decreased during the convergence

CCA: Problem with traditional gradient descent Gradient descent can get stuck into local minimum: Better solution: Stochastic gradient descent

CCA: Stochastic gradient descent Decompose E CCA : E CCA = E i CCA = 1 2 N i=1 E i CCA N (d y (i,j) d x (i,j)) 2 F λ (d x (i,j)) j=1 Separate optimization: x(j) x(j) α x(j) ECCA i x(i) x(j) x(j) αβ(i,j), d x where β(i,j) = (d y d x )(2F λ (d x ) (d y d x )F λ (d x))

CCA: Algorithm Introduction 1 Perform vector quantization for size reduction 2 Compute pairwise distances d y (i,j) 3 Initialize points x(i) randomly or by PCA 4 Give learning rate α and neighborhood width λ 5 Select a point x(i) and update all others 6 Return to step 5 until all points x(i) selected in this epoch 7 If not converged, return to step 4

Embedding of test set Original points are fixed For each test point, the update rule is applied to move it to the right position

: Embeddings with CCA

Introduction Three dimensionality reduction methods based on distance preservation: multidimensional scaling, Shannon s nonlinear mapping, curvilinear component analysis MDS is a generalization of PCA to pairwise scalar products and distances NLM and CCA preserve distances directly by minimizing a corresponding stress function