Dimensionality Reduction Given N vectors in n dims, find the k most important axes to project them k is user defined (k < n) Applications: information retrieval & indexing identify the k most important features or reduce indexing dimensions for faster retrieval (low dim indices are faster) E.G.M. Petrakis Dimensionality Reduction 1
Techniques Eigenvalue analysis techniques [NR 9] Karhunen-Loeve (K-L) transform Singular Value Decomposition (SVD) both need O(N ) time FastMap [Faloutsos & Lin 95] dimensionality reduction and mapping of objects to vectors O(N) time E.G.M. Petrakis Dimensionality Reduction
Mathematical Preliminaries For an nxn square matrix S, for unit vector x and scalar value λ: Sx λx x: eigenvector of S λ: eigenvalue of S The eigenvectors of a symmetric matrix (SS T ) are mutually orthogonal and its eigenvalues are real r rank of a matrix: maximum number or independent columns or rows E.G.M. Petrakis Dimensionality Reduction 3
Example 1 Intuition: S defines an affine transform y Sx that involves scaling, rotation eigenvectors: unit vectors along the new directions eigenvalues denote scaling S 1 3.38,.85.5 E.G.M. Petrakis Dimensionality Reduction 4 1 3 λ 3.6, λ 1 r u 1 r u eigenvector of major axis.5.85
Example If S is real and symmetric (SS T ) then it can be written as S UΛU T the columns of U are eigenvectors of S U: column orthogonal (UU T I) Λ: diagonal with the eigenvalues of S S 1 1 3.5.85.85 3.6.5.5 1.38.85.85.5 E.G.M. Petrakis Dimensionality Reduction 5
Karhunen-Loeve (K-L) Project in a k-dimensional space (k<n) minimizing the error of the projections (sum. of sq. diffs) K-L gives a linear combination of axes sorted by importance keep the first k dims -dim points and the K-L directions for k1 keep x E.G.M. Petrakis Dimensionality Reduction 6
Computation of K-L Put N vectors in rows in A[a ij ] Compute B[a ij -a p ], where a 1 p N Covariance matrix: CB T B Compute the eigenvectors of C Sort in decreasing eigenvalue order N i 1 Approximate each object by its projections on the directions of the first k eigenvectors E.G.M. Petrakis Dimensionality Reduction 7 a ip
Intuition B shifts the origin of the center of gravity of the vectors by a p and has column mean C represents attribute to attribute similarity C square, real, symmetric Eigenvector and eigenvalues are computed on C not on A C denotes the affine transform that minimizes the error Approximate each vector with its projections along the first k eigenvectors E.G.M. Petrakis Dimensionality Reduction 8
E.G.M. Petrakis Dimensionality Reduction 9 Example Input vectors [1 ], [1 1], [ ] Then col.avgs are /3 and 1 1 1 1 A.47 -.88 u.13.88.47 u.53 1 1 3 / and 1 3 / 3 1/ 1 3 1/ 1 1 r r λ λ C B
SVD For general rectangular matrixes Nxn matrix (N vectors, n dimensions) groups similar entities (documents) together Groups similar terms together and each group of terms corresponds to a concept Given an Nxn matrix A, write it as A UΛV T U: Nxr column orthogonal (r: rank of A) Λ: rxr diagonal matrix (non-negative, desc. order) V: rxn column orthogonal matrix E.G.M. Petrakis Dimensionality Reduction 1
SVD (cont,d) A λ 1 u 1 v 1T + λ u v T + + λ r u r v T r u, v are column vectors of U, V SVD identifies rect. blobs of related values in A The rank r of A: number of blobs E.G.M. Petrakis Dimensionality Reduction 11
Example Term/ data information retrieval brain lung Document CS-TR1 1 1 1 CS-TR CS-TR3 1 1 1 CS-TR4 5 5 5 MED-TR1 MED-TR 3 3 MED-TR3 1 1 Two types of documents: CS and Medical Two concepts (groups of terms) CS: data, information, retrieval Medical: brain, lung E.G.M. Petrakis Dimensionality Reduction 1
Example (cont,d) U.18.36.18 9.64 A.9.53.8.7 r Λ.58 5.9.58.58 V t.71.71 U: document-to-document similarity matrix V: term-to-document similarity matrix v 1 : data has similarity with the nd concept E.G.M. Petrakis Dimensionality Reduction 13
SVD and LSI SVD leads to Latent Semantic Indexing (http://lsi.research.telcordia.com/lsi/lsipapers.html) Terms that occur together are grouped into concepts When a user searches for a term, the system determines the relevant concepts to search LSI maps concepts to vectors in the concept space instead of the n-dim. document space Concept space: is a lower dimensionality space E.G.M. Petrakis Dimensionality Reduction 14
Examples of Queries Find documents with the term data Translate query vector q to concept space The query is related to the CS concept and unrelated to the medical concept LSI returns docs that also contain the terms retrieval and information which are not specified by the query r q.58.58 1 q r r q 1.71 E.G.M. Petrakis Dimensionality Reduction 15 c V T.58.58.71
FastMap Works with distances, has two roles: 1. Maps objects to vectors so that their distances are preserved (then apply SAMs for indexing). Dim. Reduction: N vectors with n attributes each, find N vectors with k attributes such that distances are preserved as much as possible E.G.M. Petrakis Dimensionality Reduction 16
Main idea Pretend that objects are points in some unknown n-dimensional space project these points on k mutually orthogonal axes compute projections using distance only The heart of FastMap is the method that projects two objects on a line take objects which are far apart (pivots) project on the line that connects the pivots E.G.M. Petrakis Dimensionality Reduction 17
Project Objects on a Line Apply cosine low: d x bi i d d ai ai + d + d d ab ab ab x d d bi i ab O a, O b : pivots, O i : any object d ij : shorthand for D(O i,o j ) x i : first coordinate on a k dimensional space If O i is close to O a, x i is small E.G.M. Petrakis Dimensionality Reduction 18
Choose Pivots Complexity: O(N) The optimal algorithm would require O(N ) time steps,3 can be repeated 4-5 times to improve the accuracy of selection E.G.M. Petrakis Dimensionality Reduction 19
Extension for Many Dimensions Consider the (n-1)-dimensional hyperplane H that is perpendicular to line O ab Project objects on H and apply previous step choose two new pivots the new x i is the next object coordinate repeat this step until k dim. vectors are obtained The distance on H is not D D : distance between projected objects E.G.M. Petrakis Dimensionality Reduction
Distance on the Hyper-Plane H Pythagorean theorem: D' ( O O i j ) D( O O i j ) ( x i x j ) D on H can be computed from the Pythagorean theorem The ability to compute D allows for computing a second line on H etc. E.G.M. Petrakis Dimensionality Reduction 1
Algorithm E.G.M. Petrakis Dimensionality Reduction
Observations Complexity: O(kN) distance calculations k: desired dimensionality k recursive calls, each takes O(N) The algorithm records pivots in each call (dimension) to facilitate queries the query is mapped to a k-dimensional vector by projecting it on the pivot lines for each dimension O(1) computation/step: no need to compute pivots E.G.M. Petrakis Dimensionality Reduction 3
Observations (cont,d) The projected vectors can be indexed mapping on -3 dimensions allows for visualization of the data space Assumes Euclidean space (triangle rules) not always true (at least after second step) Approximation of pivots some distances are negative turn negative distances to E.G.M. Petrakis Dimensionality Reduction 4
Application: Document Vectors distance( d, d ) (1 cos( θ )) 1 sin( θ / ) (1 similarity( d 1, d )) E.G.M. Petrakis Dimensionality Reduction 5
FastMap on 1 documents for & 3 dims (a) k and (b) k 3 E.G.M. Petrakis Dimensionality Reduction 6
References Searching Multimedia Databases by Content, C. Faloutsos, Kluwer, 1996 W. Press et.al. Numerical Recipes in C, Cambridge Univ. Press, 1988 LSI website: http://lsi.research.telcordia.com/lsi/lsipapers.html C. Faloutsos, K.-Ip.Lin, FastMap: A Fast Algorithm for Indexing, Data Mining and Visualization of Traditional and Multimedia Datasets, Proc. of Sigmod, 1995 E.G.M. Petrakis Dimensionality Reduction 7