Distance Preservation - Part I

Similar documents
Distance Preservation - Part 2

Preprocessing & dimensionality reduction

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu

Curvilinear Components Analysis and Bregman Divergences

Dimension Reduction and Low-dimensional Embedding

Lecture 10: Dimension Reduction Techniques

Statistical Machine Learning

Manifold Learning: From Linear to nonlinear. Presenter: Wei-Lun (Harry) Chao Date: April 26 and May 3, 2012 At: AMMAI 2012

Advanced Machine Learning & Perception

STA141C: Big Data & High Performance Statistical Computing

Table of Contents. Multivariate methods. Introduction II. Introduction I

Fundamentals of Matrices

Introduction to gradient descent

STA141C: Big Data & High Performance Statistical Computing

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

14 Singular Value Decomposition

Mirror Descent for Metric Learning. Gautam Kunapuli Jude W. Shavlik

Non-linear Dimensionality Reduction

Spectral Clustering. by HU Pili. June 16, 2013

Distance Metric Learning in Data Mining (Part II) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

Machine Learning (BSMC-GA 4439) Wenke Liu

Nonlinear Dimensionality Reduction

7 Principal Component Analysis

Machine Learning - MT Clustering

Unsupervised dimensionality reduction

Unsupervised learning: beyond simple clustering and PCA

15 Singular Value Decomposition

Probabilistic Graphical Models

Metric Learning. 16 th Feb 2017 Rahul Dey Anurag Chowdhury

Apprentissage non supervisée

Least Squares Optimization

L26: Advanced dimensionality reduction

LECTURE NOTE #11 PROF. ALAN YUILLE

Manifold Learning: Theory and Applications to HRI

Nonlinear Manifold Learning Summary

Least Squares Optimization

Exercises * on Principal Component Analysis

Statistical Pattern Recognition

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

CS545 Contents XVI. Adaptive Control. Reading Assignment for Next Class. u Model Reference Adaptive Control. u Self-Tuning Regulators

Clusters. Unsupervised Learning. Luc Anselin. Copyright 2017 by Luc Anselin, All Rights Reserved

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen PCA. Tobias Scheffer

Dimensionality Reduction

Neural Network Training

Linear Dimensionality Reduction

Global (ISOMAP) versus Local (LLE) Methods in Nonlinear Dimensionality Reduction

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

Cheng Soon Ong & Christian Walder. Canberra February June 2018

FACTOR ANALYSIS AND MULTIDIMENSIONAL SCALING

Nonlinear Dimensionality Reduction

CS 143 Linear Algebra Review

Motivating the Covariance Matrix

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Adaptive Gradient Methods AdaGrad / Adam

LEC 2: Principal Component Analysis (PCA) A First Dimensionality Reduction Approach

Regression with Numerical Optimization. Logistic

ISSN: (Online) Volume 3, Issue 5, May 2015 International Journal of Advance Research in Computer Science and Management Studies

Logistic Regression Trained with Different Loss Functions. Discussion

CPSC 340 Assignment 4 (due November 17 ATE)

Lecture 6. Notes on Linear Algebra. Perceptron

ON SOME EXTENSIONS OF THE NATURAL GRADIENT ALGORITHM. Brain Science Institute, RIKEN, Wako-shi, Saitama , Japan

2. LINEAR ALGEBRA. 1. Definitions. 2. Linear least squares problem. 3. QR factorization. 4. Singular value decomposition (SVD) 5.

Lecture 2: Linear Algebra Review

The Nyström Extension and Spectral Methods in Learning

Tutorial on Principal Component Analysis

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

Data Mining Lecture 4: Covariance, EVD, PCA & SVD

Fundamentals of Principal Component Analysis (PCA), Independent Component Analysis (ICA), and Independent Vector Analysis (IVA)

Principal Components Analysis. Sargur Srihari University at Buffalo

A Quick Tour of Linear Algebra and Optimization for Machine Learning

Data Mining and Analysis: Fundamental Concepts and Algorithms

Principal Component Analysis (PCA) for Sparse High-Dimensional Data

In this paper we introduce min-plus low rank

Notes on Latent Semantic Analysis

Pseudoinverse & Moore-Penrose Conditions

Least Squares Optimization

Kernel Learning via Random Fourier Representations

CS47300: Web Information Search and Management

Numerical Optimization

COURSE Iterative methods for solving linear systems

(One Dimension) Problem: for a function f(x), find x 0 such that f(x 0 ) = 0. f(x)

Cholesky Decomposition Rectification for Non-negative Matrix Factorization

Matrices and Vectors. Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A =

4.8 Arnoldi Iteration, Krylov Subspaces and GMRES

Principal Component Analysis and Linear Discriminant Analysis

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Data Fitting and Uncertainty

Lecture: Face Recognition and Feature Reduction

A Cross-Associative Neural Network for SVD of Nonsquared Data Matrix in Signal Processing

Radial-Basis Function Networks. Radial-Basis Function Networks

Multidimensional scaling (MDS)

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Self-Tuning Spectral Clustering

nonrobust estimation The n measurement vectors taken together give the vector X R N. The unknown parameter vector is P R M.

CLASS NOTES Computational Methods for Engineering Applications I Spring 2015

Maximum variance formulation

Radial-Basis Function Networks

(Non-linear) dimensionality reduction. Department of Computer Science, Czech Technical University in Prague

Transcription:

October 2, 2007

1 Introduction 2 Scalar product Equivalence with PCA Euclidean distance 3 4 5

Spatial distances Only the coordinates of the points affects the distances. L p norm: a p = p D k=1 a k p Minkowski distance: d(a,b) = a b p Maximum (p = ): a b = max 1 k D a k b k City-block (p = 1): a b 1 = D k=1 a k b k D Euclidean (p = 2): a b 2 = k=1 (a k b k ) 2 Mahalanobis norm: a Mahalanobis = a T M 1 a Usually M = C aa = E {aa T } a Mahalanobis = a 2 M = I

Scalar product Equivalence with PCA Euclidean distance Classical metric multidimensional scaling (MDS) Preserves pairwise scalar products instead of distances. Assumes a simple generative model as with PCA: y = Wx observed variables y, uncorrelated latent variables x variables are assumed to be centered orthogonal D-by-P matrix W: W T W = I P N points in matrix form: Y = [y(1),...,y(n)] Scalar products are known: s y (i,j) = y(i),y(j) = y(i) T y(j) Then S = [s y (i,j)] 1 i,j N = Y T Y = X T W T WX = X T X Usually Y and X are unknown

MDS: Finding latent variables Scalar product Equivalence with PCA Euclidean distance The eigenvalue decomposition of the Gram matrix S: S = UΛU T = (Λ 1/2 U T ) T (Λ 1/2 U T ) Eigenvalues sorted in descending order P-dimensional latent variables: ˆX = I P N Λ 1/2 U T

Equivalence of PCA and MDS Scalar product Equivalence with PCA Euclidean distance PCA and MDS give the same projection. SVD: Y = VΣU T Then Ĉyy YY T = VΛ PCA V T and S = Y T Y = UΛ MDS U T, where Λ PCA = ΣΣ T and Λ MDS = Σ T Σ We get: ˆX MDS = I P N Λ 1/2 MDS UT = I P N ΣU T = I P N V T VΣU T = I P N V T Y = ˆX PCA Thus MDS minimizes the criterion: E MDS = N i,j=1 (s y(i,j) sˆx (i,j)) 2

Three ways to calculate PCA Scalar product Equivalence with PCA Euclidean distance The P-by-N matrix Y is known. PCA: 1 P N: Ĉ yy YY T = VΛ PCA V T ˆX PCA = I P N V T Y 2 P N S = Y T Y = UΛ MDS U T ˆX PCA = ˆX MDS = I P N Λ 1/2 MDS UT 3 P N Y = VΣU T ˆX PCA = I P N V T Y

MDS with Euclidean distances Scalar product Equivalence with PCA Euclidean distance Instead of scalar products, pairwise distances are known D = [d 2 y (i,j)] 1 i,j N Solution: transform distances to scalar products dy 2(i,j) = y(i) y(j),y(i) y(j) = s y(i,i) 2s y (i,j)+s y (j,j) s y (i,j) = 1 ( 2 d 2 y (i,j) s y (i,i) s y (j,j) )

Double centering of D Scalar product Equivalence with PCA Euclidean distance Calculate the mean: µ j (d 2 y(i,j)) = µ j ( y(i) y(j),y(i) y(j) ) = y(i),y(i) 2 y(i),µ j (y(j)) + µ j ( y(j),y(j) ) = s y (i,i) + µ j (s y (j,j)) µ i (d 2 y (i,j)) = µ i (s y (i,i)) + s y (j,j) µ i,j (d 2 y (i,j)) = µ i(s y (i,i)) + µ j (s y (j,j)) s y (i,j) = 1 2 (d2 y (i,j) µ j(dy 2(i,j)) µ i(dy 2(i,j))+µ i,j(dy 2(i,j))) S = 1 ( 2 D 1 N D1 N1 T N 1 N 1 N1 T N D + 1 1 N 2 N 1 T N D1 N1 T ) N

MDS: Algorithm Introduction Scalar product Equivalence with PCA Euclidean distance 1 If data is Y, center it, compute S = Y T Y and go to step 3 2 If pairwise distances D, transform them to scalar products S by double centering 3 EVD: S = UΛU T 4 ˆX = IP N Λ 1/2 U T

Embedding of test set Scalar product Equivalence with PCA Euclidean distance Test set as coordinates: Test point y ˆx = I P D V T y Test set as scalar products: Test point s = Y T y ˆx = I P N Λ 1/2 U T s Test set as distances Test point ( d = [ y(i) y,y(i) y ] 1 i N s 1 2 d 1 N 1 N1 T N d 1 N D1 N + 1 N 1 2 N 1 T N D1 ) N

: Embeddings with MDS Scalar product Equivalence with PCA Euclidean distance

MDS variants Introduction Scalar product Equivalence with PCA Euclidean distance Classical metric MDS preserves only the pairwise scalar products Variants try to preserve the pairwise distances directly by minimizing the stress function of metric MDS: E mmds = 1 2 N w ij (d y (i,j) d x (i,j)) 2 i,j=1 Variants do not depend on any generative model

(NLM) Sammon s stress function: E NLM = 1 c c = N i,j=1 N d y (i,j) i=1 i<j (d y (i,j) d x (i,j)) 2 d y (i,j) Minimizes it iteratively with quasi-newton optimization: E NLM x k (i) x k (i) x k (i) α 2 E NLM x k (i) 2

NLM: Derivation Introduction Direct calculation gives E NLM x k (i) = E NLM d x (i,j) N = 2 c 2 E NLM 2 xk 2 = (i) c j=1 j i N j=1 j i d x (i,j) x k (i) d y (i,j) d x (i,j) d y (i,j)d x (i,j) (x k(i) x k (j)) ( dy (i,j) d x (i,j) d y (i,j)d x (i,j) (x k(i) x k (j)) 2 ) dx 3(i,j)

NLM: Algorithm Introduction 1 Compute pairwise distances d y (i,j) 2 Initialize points x(i) randomly or by PCA 3 Calculate the quasi-newton update for each point 4 Update the coordinates of all points x(i) 5 Return to step 3 until convergence

Embedding of test set No easy way to generalize the embedding for new points: Updating only the new point with quasi-newton Interpolation procedure of Curvilinear Component Analysis Neural variants of NLM, like the SAMANN

: Embeddings with NLM

(CCA) Minimizes stress function: E CCA = 1 N (d y (i,j) d x (i,j)) 2 F λ (d x (i,j)) 2 i,j=1 Typically F λ is monotonically decreasing: F λ (d x ) = exp ( ) dx λ F λ (d x ) = H(λ d x ), where H(u) = 0 if u 0 and 1 otherwise

CCA: Derivation Introduction Minimization by gradient descent: Direct calculation gives: x(i) E CCA = N j=1 x(i) x(i) α x(i) E CCA (d y d x ) ( 2F λ (d x ) (d y d x )F λ (d x) ) x(j) x(i) d x, where d y = d y (i,j) and d x = d x (i,j)

Condition for λ Introduction The condition 2F λ (d x ) > (d y d x )F λ (d x) guarantees that distances change reasonably F λ (d x ) = exp ( ) dx λ : λ > 1 2 (d x d y ) F λ (d x ) = H(λ d x ): The condition is always fulfilled The parameters α and λ can be decreased during the convergence

CCA: Problem with traditional gradient descent Gradient descent can get stuck into local minimum: Better solution: Stochastic gradient descent

CCA: Stochastic gradient descent Decompose E CCA : E CCA = E i CCA = 1 2 N i=1 E i CCA N (d y (i,j) d x (i,j)) 2 F λ (d x (i,j)) j=1 Separate optimization: x(j) x(j) α x(j) ECCA i x(i) x(j) x(j) αβ(i,j), d x where β(i,j) = (d y d x )(2F λ (d x ) (d y d x )F λ (d x))

CCA: Algorithm Introduction 1 Perform vector quantization for size reduction 2 Compute pairwise distances d y (i,j) 3 Initialize points x(i) randomly or by PCA 4 Give learning rate α and neighborhood width λ 5 Select a point x(i) and update all others 6 Return to step 5 until all points x(i) selected in this epoch 7 If not converged, return to step 4

Embedding of test set Original points are fixed For each test point, the update rule is applied to move it to the right position

: Embeddings with CCA

Introduction Three dimensionality reduction methods based on distance preservation: multidimensional scaling, Shannon s nonlinear mapping, curvilinear component analysis MDS is a generalization of PCA to pairwise scalar products and distances NLM and CCA preserve distances directly by minimizing a corresponding stress function