Sta306b May 21, 2012 Dimension Reduction: 1 Self Organizing Maps A SOM represents the data by a set of prototypes (like K-means. These prototypes are topologically organized on a lattice structure. In the example, each circle represents a prototype. The points associated with each prototype have been randomly jittered (for viewing).
Sta306b May 21, 2012 Dimension Reduction: 2 1 2 3 4 5 1 2 3 4 5 A SOM is a discrete version of a principal surface, where the surface is represented by a topologically constrained set of prototypes.
Sta306b May 21, 2012 Dimension Reduction: 3 SOM Algorithm Each prototype m k has a pair of integer lattice coordinates, e.g. l k = (2,4). The SOM algorithms are typically online. Initialize by placing a lattice of prototypes uniformly on the principal component plane. Infinite loop until stabilization: For observation x i, identify the closest prototype m j. Move: m k m k +αh( l j l k )(x i m k ), where h dies down with increasing topological distance between prototypes. Batch versions of the SOM algorithm are similar to K-means and principal surfaces.
Sta306b May 21, 2012 Dimension Reduction: 4 SOM: Example Document Organization and Retrieval Example taken (with permission from Kohonen et al) from WEBSOM homepage: http://websom.hut.fi/websom/. Data are 12,088 newsgroup comp.ai.neural-nets articles. Observations are represented by a term-document matrix, where for each document, the features are the relative frequency of each of a dictionary of terms (e.g. 50,000 words). WEBSOM software uses a randomized version of the SVD to initially reduce the dimension of the data. WEBSOM has a zoom feature which allows one to interact with the map.
Sta306b May 21, 2012 Dimension Reduction: 5 WEBSOM heatmap
Sta306b May 21, 2012 Dimension Reduction: 6 Multidimensional Scaling Like principal surfaces and SOM, MDS delivers a low-dimensional mapping of high-dimensional data. MDS requires only interpoint distances, while PS/SOM require coordinate data. MDS delivers low dimensional coordinates for each observation; PS/SOM deliver a mapping function. XGvis is freely available (and wonderful) software that implements MDS (Buja, Swayne, Littman and Dean, 1998); http://www.research.att.com/areas/stat/xgobi Example: in a consumer survey, respondents are asked to compare different products by giving a pairwise similarity ranking. Based on the average dissimilarities, MDS will produce a two-dimensional mapping of the products.
Sta306b May 21, 2012 Dimension Reduction: 7 MDS: Example Political Science students are asked to give similarities between countries, based on social and political environment (Kaufman and Rousseeuw, 1990). BEL BRA CHI CUB EGY FRA IND ISR USA USS YUG BRA 5.58 CHI 7.00 6.50 CUB 7.08 7.00 3.83 EGY 4.83 5.08 8.17 5.83 FRA 2.17 5.75 6.67 6.92 4.92 IND 6.42 5.00 5.58 6.00 4.67 6.42 ISR 3.42 5.50 6.42 6.42 5.00 3.92 6.17 USA 2.50 4.92 6.25 7.33 4.50 2.25 6.33 2.75 USS 6.08 6.67 4.25 2.67 6.00 6.17 6.17 6.92 6.17 YUG 5.25 6.83 4.50 3.75 5.75 5.42 6.08 5.83 6.67 3.67 ZAI 4.75 3.00 6.08 6.67 5.00 5.58 4.83 6.17 5.67 6.50 6.92
Sta306b May 21, 2012 Dimension Reduction: 8 MDS solution Second MDS Coordinate -2-1 0 1 2 3 ZAI BRA EGY USA BEL ISR FRA IND YUG USS CHI CUB -2 0 2 4 First MDS Coordinate Colored sets denote partitioning found by K-medoids (discussed earlier).
Sta306b May 21, 2012 Dimension Reduction: 9 Background Given data x 1,x 2,...x n R p, with d ii = x i x i, we seek values z 1,z 2...z n R k with k < p so that d ii z i z i. Similar in flavor to principal curves, where points close in original space should map to close points in the reduced space. But in principal curves, points far apart could also map together. But not in MDS, which tries to preserve all pairwise distances.
Sta306b May 21, 2012 Dimension Reduction: 10 MDS Algorithms Least Squares or Kruskal Shephard Metric Scaling minimizes i i (d ii z i z i ) 2 ( stress function ) with respect to coordinates z i, using gradient descent. Not a nice function! Sammon Mapping minimizes (d ii z i zi ) 2. d i i ii More emphasis is put on preserving smaller pairwise distances. Shephard Kruskal Nonmetric Scaling (using only ranks) minimizes i,i [θ( z i z i ) d ii ] 2 i,i d 2 ii
Sta306b May 21, 2012 Dimension Reduction: 11 over the z i and an arbitrary increasing function θ( ). Uses isotonic regression.
Sta306b May 21, 2012 Dimension Reduction: 12 Classical MDS A slightly different formulation, that leads to a simple eigenvalue problem. Similarities S ii = x i x i,x i x i (centered inner product) Minimize i i [S ii z i z i,z i z i ] 2 Details of solution in exercise 14.11 in text. Classical scaling uses the result x i x i 2 = x i 2 + x i 2 2 x i,x i if data are centered and lie in a circle, then x i 2 is constant, and classical MDS is equivalent to LS scaling. In general a good approximation.
Sta306b May 21, 2012 Dimension Reduction: 13 Applications of MDS social sciences- summarizing proximity data archeology- similarity of two digging sites can be quantified by frequency of shared features of artifacts found at each site classification problems- analysis of pairwise classification rates with many classes. graph layout
Sta306b May 21, 2012 Dimension Reduction: 14 classical scaling Sammon map y 6 4 2 0 1 45 678 91011 12 1314 15 16 17 3 18 19 2 20 y 6 4 2 0 1 45 678 91011 12 1314 15 16 17 3 18 19 2 20 4 2 0 2 4 x 4 2 0 2 4 x isomds y 6 4 2 0 1 45 678 91011 12 1314 15 16 17 3 18 19 2 20 4 2 0 2 4 x
Sta306b May 21, 2012 Dimension Reduction: 15 classical scaling Sammon map y 6 4 2 0 1 3 4567 891011 12 1314 15 16 17 18 2 19 20 y 6 4 2 0 1 3 4567 891011 12 1314 15 16 17 18 2 19 20 4 2 0 2 4 x 4 2 0 2 4 x isomds y 6 4 2 0 1 3 4567 891011 12 1314 15 16 17 18 2 19 20 4 2 0 2 4 x
Sta306b May 21, 2012 Dimension Reduction: 16 Classical vs local MDS Classical MDS Local MDS x2 15 10 5 0 x2 15 10 5 0 5 0 5 x 1 5 0 5 x 1
Sta306b May 21, 2012 Dimension Reduction: 17 The orange points show data lying on a parabola, while the blue points shows multidimensional scaling representations in one dimension. Classical multidimensional scaling (left panel) does not preserve the ordering of the points along the curve, because it judges points on opposite ends of the curve to be close together. In contrast, local multidimensional scaling (right panel) does a good job of preserving the ordering of the points along the curve.
Sta306b May 21, 2012 Dimension Reduction: 18 Glossary Machine learning/ai STATISTICS neural network model self-organizing map principal surface weights parameters learning fitting generalization test set performance supervised learning regression/classification unsupervised learning density estimation optimal brain damage model selection large grant = $500,000 large grant= $5,000 nice place to have a meeting: Snowbird, Utah, French Alps nice place to have a meeting: Las Vegas in August