Supplemental Materials for. Local Multidimensional Scaling for. Nonlinear Dimension Reduction, Graph Drawing. and Proximity Analysis

Size: px

Start display at page:

Download "Supplemental Materials for. Local Multidimensional Scaling for. Nonlinear Dimension Reduction, Graph Drawing. and Proximity Analysis"

Brittany Brown
5 years ago
Views:

1 Supplemental Materials for Local Multidimensional Scaling for Nonlinear Dimension Reduction, Graph Drawing and Proximity Analysis Lisha Chen and Andreas Buja Yale University and University of Pennsylvania October 3, 2008 Abstract In the this supplemental report we include sections that describe (1) a new interpretation of nonlinear dimension reduction in terms of a population framework as opposed to manifold learning, (2) a simulation example based on the well-known Swissroll to illustrate LMDS on data with known structure, (3) robustness properties of the LC meta-criterion under an error model. Finally we include color versions of the plots in the main paper. 1 A Population Framework Nonlinear dimension reduction can be approached with statistical modeling if it is thought of as manifold learning. Zhang and Zha (2005), for example, 1

2 examine tangent space alignment with a theoretical error analysis whereby an assumed target manifold is reconstructed from noisy point observations. With practical experience in mind, we introduce an alternative to manifold learning and its assumption that the data falls near a warped sheet. We propose instead that the data be modeled by a distribution in high dimensions that describes the data, warts and all, with variation caused by digitization, rounding, uneven sampling density, and so on. In this view the goal is to develop methodology that shows what can be shown in low dimensions and to provide diagnostics that pinpoint problems with the reduction. By avoiding prejudices implied by assumed models, the data are allowed to show whatever they have to show, be it manifolds, patchworks of manifolds of different dimensions, clusters, sculpted shapes with protrusions or holes, general uneven density patterns, and so on. For an example where this unprejudiced EDA view is successful, see the Frey face image data (Section 4.2, main paper) which exhibit a noisy patchwork of 0-D, 1-D and 2-D submanifolds in the form of clusters, rods between clusters, and webfoot-like structures between rods. This data example also makes it clear that no single dimension reduction may be sufficient to show all that is of interest: very localized methods (K = 4) reveal the underlying video sequence and transitions between clusters, whereas global methods (PCA, MDS) show the extent of the major interpretable clusters and dimensions more realistically. In summary, it is often pragmatic to assume less (a distribution) rather than more (a manifold) and to assign the statistician the task of flattening the distribution to lower dimensions as faithfully and usefully as possible. An immediate benefit of the distribution view is to recognize Isomap as non-robust due to its use of geodesic distances. For a distribution, a geodesic 2

3 distance is a path with shortest length in the support of the distribution. If a distribution s support is convex, the notion of geodesic distance in the population reduces to chordal distance. For small samples, Isomap may find manifolds that approximate the high-density areas of the distribution, but for increasing sample size the geodesic paths will find shortcuts that cross the low-density areas and asymptotically approximate chordal distances, thus changing the qualitative message of the dimension reduction as N. By comparison, the LMDS criterion has a safer behavior under statistical sampling as meaningful limits for N exist. Here is the population target: LMDS P N (x) = E [ ( Y Y x(y ) x(y ) ) 2 1[(Y,Y ) N] ] (1) t E [ x(y ) x(y ] ) 1 [(Y,Y )/ N], where Y, Y are iid P(dy) on IR p, N is a symmetric neighborhood definition in IR p, and the configuration y x(y) is interpreted as a map from the support of P(dy) in IR p to IR d whose quality is being judged by this criterion. When specialized to empirical measures, LMDS P N (x) becomes LMDSD N (x(y 1),...,x(y N )). A full theory would establish when local minima of (1) exist and when those of (equation 7, main paper) obtained from data converge to those of (1) for N. Further theory would describe rates with which N and t can be shrunk to obtain finer resolution while still achieving consistency. The LC meta-criteria also have population versions for dimension reduction: For y in the support of P(dy) (which we assume continuous), let Nα Y (y) be the Euclidean neighborhood of y that contains mass α: P[Y Nα Y (y)] = α; similarly, in configuration space let Nα X (x(y)) be the mass-α neighborhood of x(y): P[x(Y ) Nα X (x(y))] = α. Then the population version of the LC meta- 3

4 criterion for parameter α is M α = 1 α P[Y N Y α (Y ) and x(y ) N X α (x(y ))], where again Y, Y are iid P(dy). This specializes to M K (equation 9, main paper) for K /N = α when P(dy) is an empirical measure (modulo exclusion of the center points, which is asymptotically irrelevant). The quantity M α is a continuity measure, unlike for example the criterion of Hessian Eigenmaps (Donoho and Grimes 2003) which is a smoothness measure. A population point of view had proven successful once before in work by Buja, Logan, Reeds and Shepp (1994) which tackled the problem of MDS performance on completely uniformative or null data. It was shown that in the limit (N ) null data {D i,j } produce non-trivial spherical distributions as configurations and that this effect is also a likely contributor to the horseshoe effect, the artifactual bending of MDS configurations in the presence of noise. 2 A Simulation Example: Swiss Roll Data To explore the effectiveness of LMDS when the underlying structure is known, we rely on a simulation example widely used in nonlinear dimension reduction: the Swiss roll, which is a 2-D planar rectangle (Figure 9, left) rolled up and embedded in 3-D (Figure 9, right). The objective is to recover the intrinsic 2-D structure from the observed 3-D data. Our data consist of 100 points forming an equispaced 5 20 rectangular grid in parameter space, mapped to 3-D by warping the long side along a spiral. We color the data points red, orange, green and blue by dividing the rectangular grid into four chunks of 25 points in the long dimension. This scheme facilitates visual evaluation of a configuration generated with a particular method. 4

5 Given the data, we obtain a 2-D configuration by following these steps: (1) Calculate the Euclidean pairwise distances between the D data points. (2) Define the neighborhood based on metric thresholding (ǫ-nn). More specifically, for any pair of points (i,j), (i,j) N iff D i,j ǫ = 1. (We have also tried K-NN (K = 4) for defining neighborhood, with results slightly inferior to metric thresholding.) (3) Choose a value for tuning parameter τ. (4) Optimize the LMDS stress function to generate a configuration. In Figure 10 we show the configurations with four different values of τ. We see that the configurations with relatively larger τ s (τ = 1, 0.1) tend to be stretched out on the long side, which indicates repulsion is too strong. The most desirable configuration is achieved at τ = 0.01 which is most similar to the intrinsic 2-D structure. With no repulsion (τ = 0) the configurations get trapped in local minimal, which again indicates that repulsion is essential for achieving meaningful configurations. The quality of the configurations as measured by the values of the LC metacriterion is consistent with our visual impressions, with larger M ǫ =1 corresponding to a better configuration. (As in K-NN thresholding, we use ǫ for the calculation of the meta-criterion in distinction of ǫ used in defining the local graph for the stress function.) This again supports the idea that the LC meta-criterion provides a good measure for local faithfulness and the quality of a configuration. We also compared the configurations generated by LMDS with those of other dimension reduction methods, including PCA, MDS, Isomap and LLE. The configurations are shown in Figure 11. PCA provides a linear projection of the data that captures the greatest variation of the data, which is determined by the spiral structure. The global structure of the MDS configuration is very similar 5

6 to the PCA configuration due to the dominance of large distances in the stress function. For Isomap, we use both the original Isomap method and a slightly modified version. The latter version uses metric distance scaling instead of classical scaling to generate the configuration from the shortest path distances. Interestingly, Isomap with classical scaling produces a double fan shape which systematically deviates from the underlying true structure. Isomap with metric distance scaling on the other hand yields a configuration with less distortion, which is another example of the generally lower desirability of classical scaling compared to metric scaling. The parabolic shape generated by LLE suggests that LLE is able to do the unrolling to some extent but not fully, at least not without further fine tuning (LLE is known to be sensitive to tuning). The last embedding generated by LMDS is clearly the winner, and it outperforms its close competitor, the embedding from Isomap metric scaling, in terms of faithfulness of local structure. While in Isomap metric scaling the local texture is just a byproduct of the global embedding due to the dominance of the large geodesic distances in the stress functions, in LMDS one is able to tune τ to obtain locally faithful configuration. We see this as an appealing feature of LMDS. 3 Moderate Robustness of the LC Meta- Criterion A referee raised the question of robustness of the LC meta-criterion to noise. In the presence of noise, the set of K nearest neighbors of each data point will be affected. By measuring the degree of local overlap between neighborhoods 6

7 of configurations and noise-contaminated data, can the LC meta-criterion still guide the search for desirable configurations? The referee s intuition was that if the noise level exceeds the diameter of the K nearest neighborhoods the LC meta-criterion will become essentially random. We explored the question through simulation. The results show that the LC meta-criterion is moderately robust to noise. It certainly does not become completely random for moderate noise levels. In our simulation, we corrupted the true interpoint distances D i,j of the uncorrupted Swiss roll data with Gaussian noise e D with mean 0 and standard deviation SD(e D ). In order to measure the ability of the LC meta-criterion to distinguish good from bad configurations, we corrupted the underlying true 2-D configuration (the 5 20 regular grid with unit length grid steps) with Gaussian noise e X with mean 0 and standard deviation SD(e X ). The idea is that the LC meta-criterion is supposed to show a decline for increasing SD(e X ), the steeper the better. We then calculated the LC meta-criterion M K =4 by measuring the overlaps of K = 4 neighborhoods in terms of the corrupted distances D i,j vis-à-vis the corrupted grids. Figure 12 shows M K =4 as a function of the noise level SD(e X ) in each of the six plots, and as a function of the noise level SD(e D ) across the plots. The noise levels we considered for SD(e X ) and SD(e D ) range both from 0 to 1, which is commensurate with the fact that the true underlying grid steps have unit length. Each plot shows the averages and 90% coverage intervals for M k =4 based on 100 simulations. As hoped for, M k =4 generally decreases as SD(e X ) increases. When the distances are moderately contaminated with noise (SD(e D ) =0.2, 0.4, 0.6), the LC meta-criterion provides a good measure of the quality of a configuration, indicated by the overall downward trend of 7

8 M k =4. Only when the noise level is too great (e D = 0.8, 1) does the LC metacriterion lose its ability to distinguish good configurations (small SD(e X )) from bad configurations (large SD(e X )). It is conceivable that a version of the above simulations could be developed into a general diagnostics methodology for data analysis. It would provide insight into the robustness of the structure found in terms of a sensitivity analysis. We thank the referee for suggesting this simulation study which may well lead to refining the methodology of non-linear dimension reduction in general. 4 Color Plots We reproduce here the full series of plots of the main paper, but those that were intended to be in color are rendered here in color. Figures 9 and on are new and illustrate the Swiss roll experiments in this supplemental report. REFERENCES Buja, A., Logan, B.F, Reeds, J.R., Shepp, L.A. 1994, Inequalities and Positive- Definite Functions Arising from a Problem in Multidimensional Scaling, The Annals of Statistics, 22, Donoho, D.L., and Grimes, C., 2003, Hessian Eigenmaps: Locally Linear Embedding Techniques For High-Dimensional Data, Proc. of National Academy of Sciences, 100 (10), Zhang, Z., and Zha, H., 2005, Principal manifolds and nonlinear dimensionality reduction via tangent space alignment. SIAM J. Scientific Computing, 26,

9 30 3D LMDS Up down Pose Lighting Direction Figure 1: Sculpture Face Data. Three-D LMDS configuration, K = 6, optimized τ. 9

10 Lighting Direction Up down Pose PCA MDS LMDS ISOMAP LLE Figure 2: Sculpture Face Data. Three-D configurations from five methods. The gray tones encode terciles of known parameters indicated in the three column labels, rotated to best separation. 10

11 K=K =6 M K M K M K AdjustedM K 5e 04 5e 03 5e 02 5e 01 τ K (=K) M K K=4 K=6 K=8 K=10 K=25 K=125 K=650 Adjusted M K K=4 K=6 K=8 K=10 K=25 K=125 K= K K Figure 3: Sculpture Face Data: Selection of τ and K in the stress function LMDS K,τ. Top Left: Trace τ M K =6 for configurations that minimize LMDS K=6,τ. A global maximum is attained near τ =.005. Top Right: Traces K M K, M adj K for τ- minimized configurations, constraining K = K in LMDS K. The adjusted criterion points to a global maximum at K = K = 8. Bottom: Assessing τ-optimized solutions of LMDS K with traces K M K, M (adj) K for various values of K: K = 8 dominates over a range of K from 4 to

12 3D PCA (N 6 =2.6) 3D ISOMAP (N 6 =4.5) D LLE (N 6 =2.8) D LMDS (N 6 =5.2) Figure 4: Sculpture Face Data, pointwise diagnostics for 2-D views of 3-D configurations from four methods. The colors code N 6 (i) as follows: red overlap 0 to 2, orange overlap 3 to 4, green overlap 5 to 6. A comparison of the density of red and orange points reveals the local dominance of Isomap and LMDS over PCA and LLE, as well as an edge of LMDS over Isomap on these data. 12

13 K=8, M4=.268 1e+06 5e+05 0e+00 5e+05 1e e+06 5e e e e+06 K=4, M4= K=20, M4= K=12, M4= Figure 5: Frey Face Data. 2-D views of 3-D configurations from LMDS with four choices of K. 13

14 3D PCA 3D MDS 3D ISOMAP K=12 3D LLE K=12 3D KPCA 3D LMDS K=12 14 Figure 6: Frey Face Data. 2-D views of 3-D configurations, comparing PCA, MDS, Isomap, LLE, KPCA and LMDS.

15 3D PCA 3D MDS 3D ISOMAP 3D LLE 3D KPCA 3D LMDS K=12 Figure 7: Frey Face Data. The six configurations 15 (from Figure 6) with connecting lines that reflect the time order of the video footage.

16 M K K=4 K=8 K=12 K=20 K=50 K=100 K=150 K= K M K adjusted with MDS K Figure 8: Frey Face Data. Traces of the Meta-Criterion, K M K, for various choices of the neighborhood size K in the local stress LMDS K. Left: unadjusted fractional overlap; right: adjusted with MDS as the baseline. MDS is LMDS with K=1964, which by definition has the profile M adj K 0 in the right hand frame. 16

17 Figure 9: Swiss roll Data: A 2-D plane (left) rolled up and embedded into 3-D space (right). 17

18 τ= 1, M ε =1 =0.44 τ= 0.1, M ε =1 = τ= 0.01, M ε =1 = τ= 0, M ε =1 = Figure 10: Swiss roll Data. 2-D configurations from LMDS with four choices of τ. 18

19 PCA MDS Isomap Isomap Metric LLE LMDS Figure 11: Swiss roll Data. 2-D configurations, comparing PCA, MDS, Isomap (with both classical and metric scaling), LLE, and LMDS. 19

20 SD(e D )=0 SD(e D )=0.2 M K = SD(e D )= SD(e D )=0.6 M K = SD(e D )= SD(e D )=1 M K = SD(e X ) SD(e X ) Figure 12: Swiss roll Data. A simulation study on robustness of LC meta-criterion. e D is the noise added to original 3-D swissroll 20 and e X is the noise added to intrinsic 2-D structure to generate artificial configurations.

Local Multidimensional Scaling for. Nonlinear Dimension Reduction, Graph Drawing. and Proximity Analysis

Local Multidimensional Scaling for Nonlinear Dimension Reduction, Graph Drawing and Proximity Analysis Lisha Chen and Andreas Buja Yale University and University of Pennsylvania May 29, 2008 Abstract In