Multidimensional scaling (MDS)

Size: px

Start display at page:

Download "Multidimensional scaling (MDS)"

Alyson Scott
5 years ago
Views:

1 Multidimensional scaling (MDS) Just like SOM and principal curves or surfaces, MDS aims to map data points in R p to a lower-dimensional coordinate system. However, MSD approaches the problem somewhat differently. Let x 1,..., x N R p be observations and d ij be the distance between observations i and j. MDS seeks values z 1, z 2,..., z N R k to minimize the stress function: S M (z 1, z 2,..., z N ) = i i (d ii z i z i ) 2 This is known as least squares or Kruskal-Shephard scaling. Sammon mapping: S Sm (z 1, z 2,..., z N ) = i i (d ii z i z i ) 2 where more emphasis is put on preserving smaller pairwise distances. d ii

2 Classical scaling: S C (z 1, z 2,..., Z N ) = i,i (s ii < z i z, z i z >) 2 where s ii is the similarity between x i and x i and is usually defined as the centered inner product s ii =< x i x, x i x >. Shephard-Kruskal nonmetric scaling seeks to minimize S NM (z 1, z 2,..., Z N, θ) = i i [ z i z i θ(d ii )] 2 i i z i z i 2 over the z i and an arbitrary increasing function θ.

4 Classical scaling with centered inner product is equivalent to principal components. It is not equivalent to least square scaling, in which mapping can be nonlinear. Nonmetric scaling effectively uses only ranks of the distances, rather than the actual dissimilarities or similarities. MDS tries to preserve all pairwise distances, while principal surfaces and SOMs do not. MDS requires only the dissimilarities d ij, in contrast to the SOM and principal curves and surfaces which need the data points x i.

5 Finding latent variables of multivariate data Multivariate data are often viewed as multiple indirect measurements aris-ing from an underlying source, which typically cannot be directly measured. Examples include the following: Educational and psychological tests use the answers to questionnaires to measure the underlying intelligence and other mental abilities of subjects. EEG brain scans measure the neuronal activity in various parts of the brain indirectly via electromagnetic signals recorded at sensors placed at various positions on the head. The trading prices of stocks change constantly over time, and rflect various unmeasured factors such as market confidence, external influences, and other driving forces that may be hard to identify or measure.

7 PCA has a latent variable presentation The correlated X j are each represented as a linear expansion in the uncorrelated, unit variance varaiables S l. The problem with PCA latent variables is that they are not unique any orthogonal transformation of S 1,..., S p is also uncorrelated with unit variance and satisfy the PCA expansion.

8 Factor analysis The idea is that the latent variables S l are common sources of variation amongst the X j, and the account for their correlation structure, while the uncorrelated ɛ j are unique to each X j and pick up the remaining unaccounted variation.

9 Factor analysis faces the same problem as PCA, that is, any orthogonal transformation of S 1,..., S p is also uncorrelated with unit variance and satisfy the factorization equation This leaves a certain subjectivity in the use of factor analysis, since the user can search for rotated versions of the factors that are more easily interpretable. This aspect has left many analysts skeptical of factor analysis and may account for its lack of popularity in contemporary statistics.

10 Differences between PCA and factor analysis Because of the separate disturbances ɛ j for each X j, factor analysis can be seen to be modeling the correlation structure of the X j rather than the covariance structure, as PCA. Example (Exercise 14.15): Generate 200 observations of the three variates X 1, X 2, X 3 according to X 1 = Z 1 X 2 = X Z 2 X 3 = 10Z 3 where Z 1, Z 2, Z 3 are independent standard normal variates. It turns out the leading principal component aligns itself in the maximal variance direction X 3, while the leading factor essentially ignores the uncorrelated component X 3 and picks up the correlated component X 2 + X 1.

11 Independent component analysis (ICA) ICA model has exactly the same form as PCA: except that the S l are assumed to be statistically independent rather than uncorrelated. Since the multivariate Gaussian distribution is determined by its covariance matrix, any Gaussian independent components can be dtermined only up to a rotation. ICA therefore seeks S l that are independent and non-gaussian. ICA looks for a sequence of orthogonal projections such that the projected data look as far from Gaussian as possible.

13 Finding ICA ICA finds an orthogonal matrix A such that the components in A T X are as independent as possible. Let Y = A T X and I (Y ) be the Kullback-Leibler distance between the density g(y) of Y and its independence version p j=1 g j(y j ), where g j (y j ) is the marginal density of Y j : p I (Y ) = H(Y j ) H(Y ) where j=1 H(Y ) = g(y) log g(y)dy is the entropy of the random variable Y with density g(y).

14 It turns out I (Y ) = p H(Y j ) H(X ) j=1 Finding A is equivalent to minimizing the sum of the entropies of the separate components of Y. A well-known result in information theory says that among all random varaibles with equal variance, Gaussian varialbes have the maximum entropy Therefore, finding A is equivalent to maximizing departure of the components of A T X from Gaussianity separately.

16 Subjects wear a cap embedded with a lattice of 100 EEG electrodes, which record brain activity at different locations on the scalp. Figure (top panel) shows 15 seconds of output from a subset of nine of these elec-trodes from a subject performing a standard two-back learning task over a 30 minute period. The subject is presented with a letter (B, H, J, C, F, or K) at roughly 1500-ms intervals, and responds by pressing one of two buttons to indicate whether the letter presented is the same or dfferent from that presented two steps back. Depending on the answer, the subject earns or loses points, and occasionally earns bonus or loses penalty points. The time-course data show spatial correlation in the EEG signals-the signals of nearby sensors look very similar.

Unsupervised learning: beyond simple clustering and PCA

Unsupervised learning: beyond simple clustering and PCA Liza Rebrova Self organizing maps (SOM) Goal: approximate data points in R p by a low-dimensional manifold Unlike PCA, the manifold does not have