Dimensionality Reduction

Size: px

Start display at page:

Download "Dimensionality Reduction"

Sharleen Heath
5 years ago
Views:

1 Dimensionality Reduction Neil D. Lawrence Mathematics for Data Modelling University of Sheffield January 23rd 28 Neil Lawrence () Dimensionality Reduction Data Modelling School 1 / 7

2 Outline 1 Motivation 2 Background 3 Distance Matching 4 Distances along the Manifold 5 Model Selection 6 Conclusions Neil Lawrence () Dimensionality Reduction Data Modelling School 2 / 7

3 Online Resources All source code and slides are available online This talk available from my home page (see talks link on left hand side). MATLAB examples in the dimred toolbox (vrs.1) MATLAB commands used for examples given in typewriter font. Neil Lawrence () Dimensionality Reduction Data Modelling School 3 / 7

4 Outline 1 Motivation 2 Background 3 Distance Matching 4 Distances along the Manifold 5 Model Selection 6 Conclusions Neil Lawrence () Dimensionality Reduction Data Modelling School 4 / 7

5 High Dimensional Data USPS Data Set Handwritten Digit 3648 Dimensions 64 rows by 57 columns Space contains more than just this digit. Even if we sample every nanosecond from now until the end of the universe, you won t see the original six! Neil Lawrence () Dimensionality Reduction Data Modelling School 5 / 7

6 High Dimensional Data USPS Data Set Handwritten Digit 3648 Dimensions 64 rows by 57 columns Space contains more than just this digit. Even if we sample every nanosecond from now until the end of the universe, you won t see the original six! Neil Lawrence () Dimensionality Reduction Data Modelling School 5 / 7

7 High Dimensional Data USPS Data Set Handwritten Digit 3648 Dimensions 64 rows by 57 columns Space contains more than just this digit. Even if we sample every nanosecond from now until the end of the universe, you won t see the original six! Neil Lawrence () Dimensionality Reduction Data Modelling School 5 / 7

8 High Dimensional Data USPS Data Set Handwritten Digit 3648 Dimensions 64 rows by 57 columns Space contains more than just this digit. Even if we sample every nanosecond from now until the end of the universe, you won t see the original six! Neil Lawrence () Dimensionality Reduction Data Modelling School 5 / 7

9 Simple Model of Digit Rotate a Prototype Neil Lawrence () Dimensionality Reduction Data Modelling School 6 / 7

10 Simple Model of Digit Rotate a Prototype Neil Lawrence () Dimensionality Reduction Data Modelling School 6 / 7

11 Simple Model of Digit Rotate a Prototype Neil Lawrence () Dimensionality Reduction Data Modelling School 6 / 7

12 Simple Model of Digit Rotate a Prototype Neil Lawrence () Dimensionality Reduction Data Modelling School 6 / 7

13 Simple Model of Digit Rotate a Prototype Neil Lawrence () Dimensionality Reduction Data Modelling School 6 / 7

14 Simple Model of Digit Rotate a Prototype Neil Lawrence () Dimensionality Reduction Data Modelling School 6 / 7

15 Simple Model of Digit Rotate a Prototype Neil Lawrence () Dimensionality Reduction Data Modelling School 6 / 7

16 Simple Model of Digit Rotate a Prototype Neil Lawrence () Dimensionality Reduction Data Modelling School 6 / 7

17 Simple Model of Digit Rotate a Prototype Neil Lawrence () Dimensionality Reduction Data Modelling School 6 / 7

18 MATLAB Demo demdigitsmanifold([1 2], all ) Neil Lawrence () Dimensionality Reduction Data Modelling School 7 / 7

19 MATLAB Demo demdigitsmanifold([1 2], all ).1.5 PC no PC no 1 Neil Lawrence () Dimensionality Reduction Data Modelling School 7 / 7

20 MATLAB Demo demdigitsmanifold([1 2], sixnine ).1.5 PC no PC no 1 Neil Lawrence () Dimensionality Reduction Data Modelling School 7 / 7

21 Low Dimensional Manifolds Pure Rotation is too Simple In practice the data may undergo several distortions. e.g. digits undergo thinning, translation and rotation. For data with structure : we expect fewer distortions than dimensions; we therefore expect the data to live on a lower dimensional manifold. Conclusion: deal with high dimensional data by looking for lower dimensional non-linear embedding. Neil Lawrence () Dimensionality Reduction Data Modelling School 8 / 7

22 Outline 1 Motivation 2 Background 3 Distance Matching 4 Distances along the Manifold 5 Model Selection 6 Conclusions Neil Lawrence () Dimensionality Reduction Data Modelling School 9 / 7

23 Notation q dimension of latent/embedded space D dimension of data space N number of data points data matrix, Y = [y 1,:,..., y N,: ] T = [y :,1,..., y :,D ] R N D latent variables, X = [x 1,:,..., x N,: ] T = [x :,1,..., x :,q ] R N q mapping matrix, W R D q centering matrix, H = I N 1 11 T R N N Neil Lawrence () Dimensionality Reduction Data Modelling School 1 / 7

24 Reading Notation a i,: is a vector from the ith row of a given matrix A. a :,j is a vector from the jth row of a given matrix A. X and Y are design matrices. Centred data matrix given by Ŷ = HY. Background Sample covariance given by S = N 1 Ŷ T Ŷ. Centred inner product matrix given by K = ŶŶ T. Neil Lawrence () Dimensionality Reduction Data Modelling School 11 / 7

25 Outline 1 Motivation 2 Background 3 Distance Matching 4 Distances along the Manifold 5 Model Selection 6 Conclusions Neil Lawrence () Dimensionality Reduction Data Modelling School 12 / 7

26 Data Representation Classical statistical approach: represent via proximities. [Mardia, 1972] Proximity data: similarities or dissimilarities. Example of a dissimilarity matrix: a distance matrix. d i,j = y i,: y j,: 2 = (y i,: y j,: ) T (y i,: y j,: ) For a data set can display as a matrix. Neil Lawrence () Dimensionality Reduction Data Modelling School 13 / 7

27 Interpoint Distances for Rotated Sixes Figure: Interpoint distances for the rotated digits data. Neil Lawrence () Dimensionality Reduction Data Modelling School 14 / 7

28 Multidimensional Scaling Find a configuration of points, X, such that each δ i,j = x i,: x j,: 2 closely matches the corresponding d i,j in the distance matrix. Need an objective function for matching = (δ i,j ) i,j to D = (d i,j ) i,j. Neil Lawrence () Dimensionality Reduction Data Modelling School 15 / 7

29 Feature Selection An entrywise L 1 norm on difference between squared distances E (X) = N N d ij 2 δij 2. i=1 j=1 Reduce dimension by selecting features from data set. Select for X, in turn, the column from Y that most reduces this error until we have the desired q. To minimise E (Y) we compose X by extracting the columns of Y which have the largest variance. Derive Algorithm Neil Lawrence () Dimensionality Reduction Data Modelling School 16 / 7

Reconstruction from Latent Space 9 18 27 36 35 9 18 27 36 3 7 9 25 9

30 Reconstruction from Latent Space Left: distances reconstructed with two dimensions. Right: distances reconstructed with 1 dimensions. 36 Neil Lawrence () Dimensionality Reduction Data Modelling School 17 / 7

31 Reconstruction from Latent Space Left: distances reconstructed with 1 dimensions. Right: distances reconstructed with 1 dimensions. 36 Neil Lawrence () Dimensionality Reduction Data Modelling School 17 / 7

32 Feature Selection Figure: demrotationdist. Feature selection via distance preservation. Neil Lawrence () Dimensionality Reduction Data Modelling School 18 / 7

33 Feature Selection Figure: demrotationdist. Feature selection via distance preservation. Neil Lawrence () Dimensionality Reduction Data Modelling School 18 / 7

34 Feature Selection Figure: demrotationdist. Feature selection via distance preservation. Neil Lawrence () Dimensionality Reduction Data Modelling School 18 / 7

35 Feature Extraction Figure: demrotationdist. Rotation preserves interpoint distances.. Neil Lawrence () Dimensionality Reduction Data Modelling School 19 / 7

36 Feature Extraction Figure: demrotationdist. Rotation preserves interpoint distances.. Neil Lawrence () Dimensionality Reduction Data Modelling School 19 / 7

37 Feature Extraction Figure: demrotationdist. Rotation preserves interpoint distances.. Neil Lawrence () Dimensionality Reduction Data Modelling School 19 / 7

38 Feature Extraction Figure: demrotationdist. Rotation preserves interpoint distances.. Neil Lawrence () Dimensionality Reduction Data Modelling School 19 / 7

39 Feature Extraction Figure: demrotationdist. Rotation preserves interpoint distances.. Neil Lawrence () Dimensionality Reduction Data Modelling School 19 / 7

40 Feature Extraction Figure: demrotationdist. Rotation preserves interpoint distances. Residuals are much reduced. Neil Lawrence () Dimensionality Reduction Data Modelling School 19 / 7

41 Feature Extraction Figure: demrotationdist. Rotation preserves interpoint distances. Residuals are much reduced. Neil Lawrence () Dimensionality Reduction Data Modelling School 19 / 7

42 Which Rotation? We need the rotation that will minimise residual error. We already derived an algorithm for discarding directions. Discard direction with maximum variance. Error is then given by the sum of residual variances. E (X) = 2N 2 D k=q+1 σ 2 k. Rotations of data matrix do not effect this analysis. Neil Lawrence () Dimensionality Reduction Data Modelling School 2 / 7

Rotation Reconstruction from Latent Space 9 18 27 36 1 9 18 27 36 12 9 1

43 Rotation Reconstruction from Latent Space Left: distances reconstructed with two dimensions. Right: distances reconstructed with 1 dimensions. 36 Neil Lawrence () Dimensionality Reduction Data Modelling School 21 / 7

44 Rotation Reconstruction from Latent Space Left: distances reconstructed with 1 dimensions. Right: distances reconstructed with 36 dimensions. 36 Neil Lawrence () Dimensionality Reduction Data Modelling School 21 / 7

45 Reminder: Principal Component Analysis How do we find these directions? Find directions in data with maximal variance. That s what PCA does! PCA: rotate data to extract these directions. PCA: work on the sample covariance matrix S = N 1 Ŷ T Ŷ. Neil Lawrence () Dimensionality Reduction Data Modelling School 22 / 7

46 Principal Component Analysis Find a direction in the data, x :,1 = Ŷr 1, for which variance is maximised. ) r 1 = argmax r1 var (Ŷr1 subject to : r T 1 r 1 = 1 Can rewrite in terms of sample covariance ( ) T ( ) var (x :,1 ) = N 1 Ŷr 1 Ŷr1 = r1 T N 1 Ŷ T Ŷ r 1 = r1 T Sr 1 }{{} sample covariance Neil Lawrence () Dimensionality Reduction Data Modelling School 23 / 7

47 Lagrangian Solution via constrained optimisation: L (r 1, λ 1 ) = r T 1 Sr 1 + λ 1 ( 1 r T 1 r 1 ) Gradient with respect to r 1 dl (r 1, λ 1 ) dr 1 = 2Sr 1 2λ 1 r 1 rearrange to form Sr 1 = λ 1 r 1. Which is recognised as an eigenvalue problem. Neil Lawrence () Dimensionality Reduction Data Modelling School 24 / 7

48 Lagrange Multiplier Recall the gradient, dl (r 1, λ 1 ) dr 1 = 2Sr 1 2λ 1 r 1 (1) to find λ 1 premultiply (1) by r T 1 and rearrange giving λ 1 = r T 1 Sr 1. Maximum variance is therefore necessarily the maximum eigenvalue of S. This is the first principal component. Neil Lawrence () Dimensionality Reduction Data Modelling School 25 / 7

49 Further Directions Find orthogonal directions to earlier extracted directions with maximal variance. Orthogonality constraints, for j < k we have Lagrangian r T j r k = r T k r k = 1 L (r k, λ k, γ) = rk T ( ) k 1 Sr k + λ k 1 r T k r k + γ j rj T r k j=1 dl (r k, λ k ) k 1 = 2Sr k 2λ k r k + γ j r j dr k j=1 Neil Lawrence () Dimensionality Reduction Data Modelling School 26 / 7

50 Further Eigenvectors Gradient of Lagrangian: dl (r k, λ k ) k 1 = 2Sr k 2λ k r k + γ j r j (2) dr k Premultipling (2) by r i with i < k implies which allows us to write Premultiplying (2) by r k implies γ i = Sr k = λ k r k. λ k = r T k Sr k. This is the kth principal component. j=1 Neil Lawrence () Dimensionality Reduction Data Modelling School 27 / 7

51 Principal Coordinates Analysis The rotation which finds directions of maximum variance is the eigenvectors of the covariance matrix. The variance in each direction is given by the eigenvalues. Problem: working directly with the sample covariance, S, may be impossible. For example: perhaps we are given distances between data points, but not absolute locations. No access to absolute positions: cannot compute original sample covariance. Neil Lawrence () Dimensionality Reduction Data Modelling School 28 / 7

52 An Alternative Formalism Matrix representation of eigenvalue problem for first q eigenvectors. Ŷ T ŶR q = R q Λ q R q R D q (3) Premultiply by Ŷ: ŶŶ T ŶR q = ŶR q Λ q Postmultiply by Λ 1 2 q ŶŶ T ŶR q Λ 1 2 q = ŶR q Λ q Λ 1 2 q Neil Lawrence () Dimensionality Reduction Data Modelling School 29 / 7

53 An Alternative Formalism Matrix representation of eigenvalue problem for first q eigenvectors. Ŷ T ŶR q = R q Λ q R q R D q (3) Premultiply by Ŷ: ŶŶ T ŶR q = ŶR q Λ q Postmultiply by Λ 1 2 q ŶŶ T ŶR q Λ 1 2 q = ŶR q Λ 1 2 q Λ q Neil Lawrence () Dimensionality Reduction Data Modelling School 29 / 7

54 An Alternative Formalism Matrix representation of eigenvalue problem for first q eigenvectors. Ŷ T ŶR q = R q Λ q R q R D q (3) Premultiply by Ŷ: ŶŶ T ŶR q = ŶR q Λ q Postmultiply by Λ 1 2 q ŶŶ T U q = U q Λ q U q = ŶR q Λ 1 2 q Neil Lawrence () Dimensionality Reduction Data Modelling School 29 / 7

55 U q Diagonalizes the Inner Product Matrix Need to prove that U q are eigenvectors of inner product matrix. U T q ŶŶ T U q = Λ 1 2 q R T q ŶT ŶŶ T ŶR q Λ 1 2 q Full eigendecomposition of sample covariance Ŷ T Ŷ = RΛR T Implies that (ŶT ) 2 Ŷ = RΛR T RΛR T = RΛ 2 R T. Neil Lawrence () Dimensionality Reduction Data Modelling School 3 / 7

56 U q Diagonalizes the Inner Product Matrix Need to prove that U q are eigenvectors of inner product matrix. U T q ŶŶ T U q = Λ 1 2 q R T (ŶT ) 2 q Ŷ Rq Λ 1 2 q Full eigendecomposition of sample covariance Ŷ T Ŷ = RΛR T Implies that (ŶT ) 2 Ŷ = RΛR T RΛR T = RΛ 2 R T. Neil Lawrence () Dimensionality Reduction Data Modelling School 3 / 7

57 U q Diagonalizes the Inner Product Matrix Need to prove that U q are eigenvectors of inner product matrix. U T q ŶŶ T U q = Λ 1 2 q R T (ŶT ) 2 q Ŷ Rq Λ 1 2 q Full eigendecomposition of sample covariance Ŷ T Ŷ = RΛR T Implies that (ŶT ) 2 Ŷ = RΛR T RΛR T = RΛ 2 R T. Neil Lawrence () Dimensionality Reduction Data Modelling School 3 / 7

58 U q Diagonalizes the Inner Product Matrix Need to prove that U q are eigenvectors of inner product matrix. U T q ŶŶ T U q = Λ 1 2 q R T (ŶT ) 2 q Ŷ Rq Λ 1 2 q Full eigendecomposition of sample covariance Ŷ T Ŷ = RΛR T Implies that (ŶT ) 2 Ŷ = RΛR T RΛR T = RΛ 2 R T. Neil Lawrence () Dimensionality Reduction Data Modelling School 3 / 7

59 U q Diagonalizes the Inner Product Matrix Need to prove that U q are eigenvectors of inner product matrix. U T q ŶŶ T U q = Λ 1 2 q R T q RΛ 2 R T R q Λ 1 2 q Full eigendecomposition of sample covariance Ŷ T Ŷ = RΛR T Implies that (ŶT ) 2 Ŷ = RΛR T RΛR T = RΛ 2 R T. Neil Lawrence () Dimensionality Reduction Data Modelling School 3 / 7

60 U q Diagonalizes the Inner Product Matrix Need to prove that U q are eigenvectors of inner product matrix. U T q ŶŶ T U q = Λ 1 2 q R T q RΛ 2 R T R q Λ 1 2 q Product of the first q eigenvectors with the rest, [ ] R T Iq R q = R D q where we have used I q to denote a q q identity matrix. Premultiplying by eigenvalues gives, [ ] ΛR T Λq R q = Multiplying by self transpose gives Neil Lawrence () Dimensionality Reduction Data Modelling School 31 / 7

61 U q Diagonalizes the Inner Product Matrix Need to prove that U q are eigenvectors of inner product matrix. U T q ŶŶ T U q = Λ 1 2 q R T q RΛ 2 R T R q Λ 1 2 q Product of the first q eigenvectors with the rest, [ ] R T Iq R q = R D q where we have used I q to denote a q q identity matrix. Premultiplying by eigenvalues gives, [ ] ΛR T Λq R q = Multiplying by self transpose gives Neil Lawrence () Dimensionality Reduction Data Modelling School 31 / 7

62 U q Diagonalizes the Inner Product Matrix Need to prove that U q are eigenvectors of inner product matrix. U T q ŶŶ T U q = Λ 1 2 q R T q RΛ 2 R T R q Λ 1 2 q Product of the first q eigenvectors with the rest, [ ] R T Iq R q = R D q where we have used I q to denote a q q identity matrix. Premultiplying by eigenvalues gives, [ ] ΛR T Λq R q = Multiplying by self transpose gives R T q RΛ 2 R T R q = Λ 2 q Neil Lawrence () Dimensionality Reduction Data Modelling School 31 / 7

63 U q Diagonalizes the Inner Product Matrix Need to prove that U q are eigenvectors of inner product matrix. [ R T q RΛ 2 R T R q ] Λ 1 2 q U T q ŶŶ T U q = Λ 1 2 q Product of the first q eigenvectors with the rest, R T R q = [ Iq ] R D q where we have used I q to denote a q q identity matrix. Premultiplying by eigenvalues gives, [ ] ΛR T Λq R q = Multiplying by self transpose gives R T q RΛ 2 R T R q = Λ 2 q Neil Lawrence () Dimensionality Reduction Data Modelling School 31 / 7

64 U q Diagonalizes the Inner Product Matrix Need to prove that U q are eigenvectors of inner product matrix. U T q ŶŶ T U q = Λ 1 [ 2 q R T q RΛ 2 R T ] R q Λ 1 2 q Product of the first q eigenvectors with the rest, [ ] R T Iq R q = R D q where we have used I q to denote a q q identity matrix. Premultiplying by eigenvalues gives, [ ] ΛR T Λq R q = Multiplying by self transpose gives R T q RΛ 2 R T R q = Λ 2 q Neil Lawrence () Dimensionality Reduction Data Modelling School 31 / 7

65 U q Diagonalizes the Inner Product Matrix Need to prove that U q are eigenvectors of inner product matrix. U T q ŶŶ T U q = Λ 1 2 q Λ 2 qλ 1 2 q Product of the first q eigenvectors with the rest, [ ] R T Iq R q = R D q where we have used I q to denote a q q identity matrix. Premultiplying by eigenvalues gives, [ ] ΛR T Λq R q = Multiplying by self transpose gives R T q RΛ 2 R T R q = Λ 2 q Neil Lawrence () Dimensionality Reduction Data Modelling School 31 / 7

66 U q Diagonalizes the Inner Product Matrix Need to prove that U q are eigenvectors of inner product matrix. U T q ŶŶ T U q = Λ q Product of the first q eigenvectors with the rest, [ ] R T Iq R q = R D q where we have used I q to denote a q q identity matrix. Premultiplying by eigenvalues gives, [ ] ΛR T Λq R q = Multiplying by self transpose gives R T q RΛ 2 R T R q = Λ 2 q Neil Lawrence () Dimensionality Reduction Data Modelling School 31 / 7

67 U q Diagonalizes the Inner Product Matrix Need to prove that U q are eigenvectors of inner product matrix. ŶŶ T U q = U q Λ q Product of the first q eigenvectors with the rest, [ ] R T Iq R q = R D q where we have used I q to denote a q q identity matrix. Premultiplying by eigenvalues gives, [ ] ΛR T Λq R q = Multiplying by self transpose gives R T q RΛ 2 R T R q = Λ 2 q Neil Lawrence () Dimensionality Reduction Data Modelling School 31 / 7

68 Equivalent Eigenvalue Problems Two eigenvalue problems are equivalent. One solves for the rotation, the other solves for the location of the rotated points. When D < N it is easier to solve for the rotation, R q. But when D > N we solve for the embedding (principal coordinate analysis). In MDS we may not know Y, cannot compute Ŷ T Ŷ from distance matrix. Can we compute ŶŶ T instead? Neil Lawrence () Dimensionality Reduction Data Modelling School 32 / 7

69 The Covariance Interpretation N 1 Ŷ T Ŷ is the data covariance. ŶŶ T is a centred inner product matrix. Also has an interpretation as a covariance matrix (Gaussian processes). It expresses correlation and anti correlation between data points. Standard covariance expresses correlation and anti correlation between data dimensions. Neil Lawrence () Dimensionality Reduction Data Modelling School 33 / 7

70 Distance to Similarity: A Gaussian Covariance Interpretation Translate between covariance and distance. Consider a vector sampled from a zero mean Gaussian distribution, z N (, K). Expected square distance between two elements of this vector is di,j 2 = (z i z j ) 2 di,j 2 = zi 2 + z 2 j 2 zi z j under a zero mean Gaussian with covariance given by K this is d 2 i,j = k i,i + k j,j 2k i,j. Take the distance to be square root of this, d i,j = (k i,i + k j,j 2k i,j ) 1 2. Neil Lawrence () Dimensionality Reduction Data Modelling School 34 / 7

71 Standard Transformation This transformation is known as the standard transformation between a similarity and a distance [Mardia et al., 1979, pg 42]. If the covariance is of the form K = ŶŶ T then k i,j = y T i,: y j,: and d i,j = ( y T i,:y i,: + y T j,:y j,: 2y T i,:y j,: ) 1 2 = y i,: y j,: 2. For other distance matrices this gives us an approach to covert to a similarity matrix or kernel matrix so we can perform classical MDS. Neil Lawrence () Dimensionality Reduction Data Modelling School 35 / 7

72 Example: Road Distances with Classical MDS Classical example: redraw a map from road distances (see e.g. Mardia et al. 1979). Here we use distances across Europe. Between each city we have road distance. Enter these in a distance matrix. Convert to a similarity matrix using the covariance interpretation. Perform eigendecomposition. See for the data we used. Neil Lawrence () Dimensionality Reduction Data Modelling School 36 / 7

73 Distance Matrix Convert distances to similarities using covariance interpretation x Figure: Left: road distances between European cities visualised as a matrix. Right: similarity matrix derived from these distances. If this matrix is a covariance matrix, then expected distance between samples from this covariance is given on the left. Neil Lawrence () Dimensionality Reduction Data Modelling School 37 / 7

74 Example: Road Distances with Classical MDS 6 o N 54 o N 48 o N 42 o N 36 o N Edinburgh Sassnitz Dublin Hamburg AmsterdamBerlin Warszawa London Bruxelles Praha Luxembourg Paris Wien Budapest Bern Innsbruck Milano Ljubljana Bordeaux Genova BeogradBucuresti Sofija Barcelona Roma Brindisi Istanbul Madrid Lisboa Moskva 3 o N 12 o W o 12 o E 24 o E 36 o E 48 o E Figure: demcmdsroaddata. Reconstructed locations projected onto true map using Procrustes rotations. Neil Lawrence () Dimensionality Reduction Data Modelling School 38 / 7

75 Beware Negative Eigenvalues 5 x 17 6 o N 4 54 o N 48 o N 42 o N 36 o N Edinburgh Sassnitz Dublin Hamburg AmsterdamBerlin Warszawa London Bruxelles Praha Luxembourg Paris Wien Budapest Bern Innsbruck Milano Ljubljana Bordeaux Genova BeogradBucuresti Sofija Barcelona Roma Brindisi Istanbul Madrid Lisboa Moskva o N 12 o W o 12 o E 24 o E 36 o E 48 o E Figure: Eigenvalues of the similarity matrix are negative in this case. Neil Lawrence () Dimensionality Reduction Data Modelling School 39 / 7

76 European Cities Distance Matrices Figure: Left: the original distance matrix. Right: the reconstructed distance matrix. Neil Lawrence () Dimensionality Reduction Data Modelling School 4 / 7

77 Other Distance Similarity Measures Can use similarity/distance of your choice. Beware though! The similarity must be positive semi definite for the distance to be Euclidean. Why? Can immediately see positive definite is sufficient from the covariance intepretation. For more details see [Mardia et al., 1979, Theorem ]. Neil Lawrence () Dimensionality Reduction Data Modelling School 41 / 7

78 A Class of Similarities for Vector Data All Mercer kernels are positive semi definite. Example, squared exponential (also known as RBF or Gaussian) ( ) k i,j = exp y i,: y j,: 2 2l 2. This leads to a kernel eigenvalue problem. This is known as Kernel PCA Schölkopf et al Neil Lawrence () Dimensionality Reduction Data Modelling School 42 / 7

79 Implied Distance Matrix What is the equivalent distance d i,j? d i,j = k i,i + k j,j 2k i,j If point separation is large, k i,j. k i,i = 1 and k j,j = 1. d i,j = 2 Kernel with RBF kernel projects along axes PCA can produce poor results. Uses many dimensions to keep dissimilar objects a constant amount apart. Neil Lawrence () Dimensionality Reduction Data Modelling School 43 / 7

Implied Distances on Rotated Sixes 9 18 27 36 9 18 27 36 1.4.9 1.2 9.8.7 9 1 18.6.5.4 18.8.6 27 36.3.2.1 27 36.4.2 Figure: Left: similarity matrix for RBF kernel on rotated sixes.

80 Implied Distances on Rotated Sixes Figure: Left: similarity matrix for RBF kernel on rotated sixes. Right: implied distance matrix for kernel on rotated sixes. Note that most of the distances are set to Neil Lawrence () Dimensionality Reduction Data Modelling School 44 / 7

81 Kernel PCA on Rotated Sixes Figure: demsixkpca. The fifth, sixth and seventh dimensions of the latent space for kernel PCA. Points spread out along axes so that dissimilar points are always 2 apart. Neil Lawrence () Dimensionality Reduction Data Modelling School 45 / 7

82 Kernel PCA on Rotated Sixes Figure: demsixkpca. The fifth, sixth and seventh dimensions of the latent space for kernel PCA. Points spread out along axes so that dissimilar points are always 2 apart. Neil Lawrence () Dimensionality Reduction Data Modelling School 45 / 7

83 Kernel PCA on Rotated Sixes Figure: demsixkpca. The fifth, sixth and seventh dimensions of the latent space for kernel PCA. Points spread out along axes so that dissimilar points are always 2 apart. Neil Lawrence () Dimensionality Reduction Data Modelling School 45 / 7

84 Kernel PCA on Rotated Sixes Figure: demsixkpca. The fifth, sixth and seventh dimensions of the latent space for kernel PCA. Points spread out along axes so that dissimilar points are always 2 apart. Neil Lawrence () Dimensionality Reduction Data Modelling School 45 / 7

85 Kernel PCA on Rotated Sixes Figure: demsixkpca. The fifth, sixth and seventh dimensions of the latent space for kernel PCA. Points spread out along axes so that dissimilar points are always 2 apart. Neil Lawrence () Dimensionality Reduction Data Modelling School 45 / 7

86 Kernel PCA on Rotated Sixes Figure: demsixkpca. The fifth, sixth and seventh dimensions of the latent space for kernel PCA. Points spread out along axes so that dissimilar points are always 2 apart. Neil Lawrence () Dimensionality Reduction Data Modelling School 45 / 7

87 MDS Conclusions Multidimensional scaling: preserve a distance matrix. Classical MDS a particular objective function for Classical MDS distance matching is equivalent to maximum variance spectral decomposition of the similarity matrix For Euclidean distances in Y space classical MDS is equivalent to PCA. known as principal coordinate analysis (PCO) Haven t discussed choice of distance matrix. Neil Lawrence () Dimensionality Reduction Data Modelling School 46 / 7

88 Outline 1 Motivation 2 Background 3 Distance Matching 4 Distances along the Manifold 5 Model Selection 6 Conclusions Neil Lawrence () Dimensionality Reduction Data Modelling School 47 / 7

89 Data (a) plane (b) swissroll (c) trefoil Figure: Illustrative data sets for the talk. Each data set is generated by calling generatemanifolddata(datatype). The datatype argument is given below each plot. Neil Lawrence () Dimensionality Reduction Data Modelling School 48 / 7

90 Isomap Tenenbaum et al. 2 MDS finds geometric configuration preserving distances MDS applied to Manifold distance Geodesic Distance = Manifold Distance Cannot compute geodesic distance without knowing manifold Neil Lawrence () Dimensionality Reduction Data Modelling School 49 / 7

91 Isomap Fig. 2. The residual variance of PCA (open triangles), MDS [open triangles in (A) through (C); open circles in (D)], and Isomap (filled circles) on four data sets (42). (A) Face images varying in pose and illumination (Fig. 1A). (B) Swiss roll data (Fig. 3). (C) Hand images varying in finger extension and wrist rotation (2). (D) Handwritten 2 s (Fig. 1B). In all cas- set of handwritten digits, which does not have a clear manifold geometry, Isomap still finds globally meaningful coordinates (Fig. 1B) and nonlinear structure that PCA or MDS do not detect (Fig. 2D). For all three data sets, the natural appearance of linear interpolations between distant points in the low-dimensional coordinate space confirms that Isomap has captured the data s perceptually relevant structure (Fig. 4). Previous attempts to extend PCA and MDS to nonlinear data sets fall into two broad classes, each of which suffers from limitations overcome by our approach. Local linear techniques (21 23) are not designed to Isomap: define neighbours and compute represent distances the global structure ofbetween a data set neighbours. es, residual variance decreases as the dimensionality d is increased. The intrinsic dimensionality of the data can be estimated by looking for the elbow within a single coordinate system, as we do in Fig. 1. Nonlinear techniques based on greedy optimization procedures (24 3) attempt to discover global structure, but lack the crucial algorithmic features that Isomap inherits Geodesic Distance approximated by shortest from PCA and MDS: path a noniterative, through polynomial time procedure with a guarantee of glob- adjacency matrix. al optimality; for intrinsically Euclidean man- at which this curve ceases to decrease significantly with added dimensions. Arrows mark the true or approximate dimensionality, when known. Note the tendency of PCA and MDS to overestimate the dimensionality, in contrast to Isomap. Fig. 3. The Swiss roll data set, illustrating how Isomap exploits geodesic paths for nonlinear dimensionality reduction. (A) For two arbitrary points (circled) on a nonlinear manifold, their Euclidean distance in the highdimensional input space (length of dashed line) may not accurately reflect their intrinsic similarity, as measured by geodesic distance along the low-dimensional manifold (length of solid curve). (B) The neighborhood graph G constructed in step one of Isomap (with K 7 and N 1 data points) allows an approximation (red segments) to the true geodesic path to be computed efficiently in step two, as the shortest path in G. (C) The two-dimensional embedding recovered by Isomap in step three, which best preserves the shortest path distances in the neighborhood graph (overlaid). Straight lines in the embedding (blue) now represent simpler and cleaner approximations to the true geodesic paths than do the corresponding graph paths (red). SCIENCE VOL DECEMBER Neil Lawrence () Dimensionality Reduction Data Modelling School 5 / 7

92 Isomap Examples demisomap Data generation Carl Henrik Ek Neil Lawrence () Dimensionality Reduction Data Modelling School 51 / 7

93 Isomap Examples demisomap 2 1 Data generation Carl Henrik Ek Neil Lawrence () Dimensionality Reduction Data Modelling School 51 / 7

94 Isomap Examples demisomap Data generation Carl Henrik Ek Neil Lawrence () Dimensionality Reduction Data Modelling School 51 / 7

95 Isomap: Summary MDS on shortest path approximation of manifold distance + Simple + Intrinsic dimension from eigen spectra - Solves a very large eigenvalue problem - Cannot handle holes or non-convex manifold - Sensitive to short circuit Neil Lawrence () Dimensionality Reduction Data Modelling School 52 / 7

96 Inverse Covariance From the covariance interpretation we think of the similarity matrix as a covariance. Each element of the covariance is a function of two data points. Another option is to specify the inverse covariance. If the inverse covariance between two points is zero. Those points are independent given all other points. This is a conditional independence. Describes how points are connected. Laplacian Eigenmaps and LLE can both be seen as specifiying the inverse covariance. Neil Lawrence () Dimensionality Reduction Data Modelling School 53 / 7

97 LLE Examples demlle neighbours used. No playing with settings. Neil Lawrence () Dimensionality Reduction Data Modelling School 54 / 7

98 LLE Examples demlle neighbours used. No playing with settings. Neil Lawrence () Dimensionality Reduction Data Modelling School 54 / 7

99 LLE Examples demlle neighbours used. No playing with settings. Neil Lawrence () Dimensionality Reduction Data Modelling School 54 / 7

100 Generative Observed data have been sampled from manifold Spectral methods start in the wrong end It s a lot easier to make a mess than to clean it up! Things break or disapear How to model observation generation? Neil Lawrence () Dimensionality Reduction Data Modelling School 55 / 7

101 Generative Observed data have been sampled from manifold Spectral methods start in the wrong end It s a lot easier to make a mess than to clean it up! Things break or disapear How to model observation generation? Neil Lawrence () Dimensionality Reduction Data Modelling School 55 / 7

102 Generative Observed data have been sampled from manifold Spectral methods start in the wrong end It s a lot easier to make a mess than to clean it up! Things break or disapear How to model observation generation? Neil Lawrence () Dimensionality Reduction Data Modelling School 55 / 7

103 Generative Observed data have been sampled from manifold Spectral methods start in the wrong end It s a lot easier to make a mess than to clean it up! Things break or disapear How to model observation generation? Neil Lawrence () Dimensionality Reduction Data Modelling School 55 / 7

104 Outline 1 Motivation 2 Background 3 Distance Matching 4 Distances along the Manifold 5 Model Selection 6 Conclusions Neil Lawrence () Dimensionality Reduction Data Modelling School 56 / 7

105 Model Selection Observed data have been sampled from low dimensional manifold y = f (x) Idea: Model f rank embedding according to 1 Data fit of f 2 Complexity of f How to model f? 1 Making as few assumtpions about f as possible? 2 Allowing f from as rich class as possible? Neil Lawrence () Dimensionality Reduction Data Modelling School 57 / 7

106 Gaussian Processes Generalisation of Gaussian Distribution over infinite index sets Can be used specify distributions over functions Regression y = f (x) + ɛ p(y X, Φ) = p(y f, X, Φ)p(f X, Φ)df p(f X, Φ) = N (, K) ˆΦ = argmax Φ p(y X, Φ) Neil Lawrence () Dimensionality Reduction Data Modelling School 58 / 7

107 Gaussian Processes log likelihood length scale log p(y X) = 1 2 YT (K + β 1 I) 1 Y }{{} data fit 1 2 log det(k + β 1 I) N log 2π }{{} 2 complexity 3 Images: N.D. Lawrence Neil Lawrence () Dimensionality Reduction Data Modelling School 59 / 7

108 Gaussian Processes log likelihood length scale log p(y X) = 1 2 YT (K + β 1 I) 1 Y }{{} data fit 1 2 log det(k + β 1 I) N log 2π }{{} 2 complexity 3 Images: N.D. Lawrence Neil Lawrence () Dimensionality Reduction Data Modelling School 59 / 7

109 Gaussian Processes log likelihood length scale log p(y X) = 1 2 YT (K + β 1 I) 1 Y }{{} data fit 1 2 log det(k + β 1 I) N log 2π }{{} 2 complexity 3 Images: N.D. Lawrence Neil Lawrence () Dimensionality Reduction Data Modelling School 59 / 7

110 Gaussian Processes log likelihood length scale log p(y X) = 1 2 YT (K + β 1 I) 1 Y }{{} data fit 1 2 log det(k + β 1 I) N log 2π }{{} 2 complexity 3 Images: N.D. Lawrence Neil Lawrence () Dimensionality Reduction Data Modelling School 59 / 7

111 Gaussian Processes log likelihood length scale log p(y X) = 1 2 YT (K + β 1 I) 1 Y }{{} data fit 1 2 log det(k + β 1 I) N log 2π }{{} 2 complexity 3 Images: N.D. Lawrence Neil Lawrence () Dimensionality Reduction Data Modelling School 59 / 7

112 Gaussian Processes log likelihood length scale log p(y X) = 1 2 YT (K + β 1 I) 1 Y }{{} data fit 1 2 log det(k + β 1 I) N log 2π }{{} 2 complexity 3 Images: N.D. Lawrence Neil Lawrence () Dimensionality Reduction Data Modelling School 59 / 7

113 Gaussian Processes log likelihood length scale log p(y X) = 1 2 YT (K + β 1 I) 1 Y }{{} data fit 1 2 log det(k + β 1 I) N log 2π }{{} 2 complexity 3 Images: N.D. Lawrence Neil Lawrence () Dimensionality Reduction Data Modelling School 59 / 7

114 Gaussian Processes log likelihood length scale log p(y X) = 1 2 YT (K + β 1 I) 1 Y }{{} data fit 1 2 log det(k + β 1 I) N log 2π }{{} 2 complexity 3 Images: N.D. Lawrence Neil Lawrence () Dimensionality Reduction Data Modelling School 59 / 7

115 Gaussian Processes log likelihood length scale log p(y X) = 1 2 YT (K + β 1 I) 1 Y }{{} data fit 1 2 log det(k + β 1 I) N log 2π }{{} 2 complexity 3 Images: N.D. Lawrence Neil Lawrence () Dimensionality Reduction Data Modelling School 59 / 7

116 Gaussian Process Latent Variable Models GP-LVM models sampling process y = f (x) + ɛ p(y X, Φ) = p(y f, X, Φ)p(f X, Φ)df p(f X, Φ) = N (, K) {ˆX, ˆΦ} = argmax X,Φ p(y X, Φ) Linear: Closed form solution Non-Linear: Gradient based solution Neil Lawrence () Dimensionality Reduction Data Modelling School 6 / 7

117 Model Selection Lawrence - 23 suggested the use of Spectral algorithms to initialise the latent space Y Harmeling - 27 evaluated the use of GP-LVM objective for model selection Comparisons between Procrustes score to ground truth and GP-LVM objective Neil Lawrence () Dimensionality Reduction Data Modelling School 61 / 7

118 Model Selection: Results 4 Embedding 6 4 2!2!4!6 1 5!5!1!6!4! Model selection results kindly provided by Carl Henrik Ek. Neil Lawrence () Dimensionality Reduction Data Modelling School 62 / 7

119 Model Selection: Results 4 Embedding 1 Isomap 1 MVU !5!5 2!1!1!5 5 1!1!1!5 5 1!2!1 LLE.15 Laplacian!4!1.1!6 1 5!5!1!6!4! !1!1!1!1!1!4! !.5!.1!.2! Model selection results kindly provided by Carl Henrik Ek. Neil Lawrence () Dimensionality Reduction Data Modelling School 62 / 7

120 Model Selection: Results 4 18 GP!LVM Objective 1 Isomap 1 MVU !5!5 1!1!1!5 5 1!1!1! !1 LLE.15 Laplacian 6! !1!1!1!1.5! !1!4!2 2 4!.1!.2! Model selection results kindly provided by Carl Henrik Ek. Neil Lawrence () Dimensionality Reduction Data Modelling School 62 / 7

121 Model Selection: Results 4 Embedding 5!5 5 5!5!5 4 Model selection results kindly provided by Carl Henrik Ek. Neil Lawrence () Dimensionality Reduction Data Modelling School 62 / 7

122 Model Selection: Results 4 Embedding 4 Isomap 1 MVU 2 5 5!2!4!5!6!2!1 1 2!1!1!5 5 1!1 LLE.1 Laplacian!1.5!5 5 5!1!.5!1!.1!5!5!1!2!1 1 2!.15!.15!.1! Model selection results kindly provided by Carl Henrik Ek. Neil Lawrence () Dimensionality Reduction Data Modelling School 62 / 7

123 Model Selection: Results 4 1 GP!LVM Objective 4 Isomap 1 MVU !2 7!4!5 6!6!2!1 1 2!1!1! !1 LLE.1 Laplacian 3!1.5 2!1!.5 1!1! !1!2!1 1 2!.15!.15!.1! Model selection results kindly provided by Carl Henrik Ek. Neil Lawrence () Dimensionality Reduction Data Modelling School 62 / 7

124 Model Selection: Results 4 Embedding !5!1!15 2 1!1!2!15!1! Model selection results kindly provided by Carl Henrik Ek. Neil Lawrence () Dimensionality Reduction Data Modelling School 62 / 7

125 Model Selection: Results 4 Embedding 2 Isomap 15 MVU !1!5!1!2!1!5 5!15!2!1 1 2!5 2 LLE.1 Laplacian!1.5!15 2 1!1!2!15!1! !2!4!6!3!2!1 1!.5!.1!.1! Model selection results kindly provided by Carl Henrik Ek. Neil Lawrence () Dimensionality Reduction Data Modelling School 62 / 7

126 Model Selection: Results 4 # '(!)*+,-./ Isomap 15 MVU !1!5!1 15!2!1!5 5!15!2! LLE.1 Laplacian 5.5!2!4!.5!5 1 2 # $!6!3!2!1 1!.1!.1! Model selection results kindly provided by Carl Henrik Ek. Neil Lawrence () Dimensionality Reduction Data Modelling School 62 / 7

127 Conclusion Assume local structure contains enough characteristics to unravel global structure + Intuative - Hard to set parameters without knowing manifold - Learns embeddings not mappings i.e. Visualisations - Models problem wrong way around - Sensitive to noise + Currently best strategy to initialise generative models Neil Lawrence () Dimensionality Reduction Data Modelling School 63 / 7

128 References I K. V. Mardia. Statistics of Directional Data. Academic Press, London, K. V. Mardia, J. T. Kent, and J. M. Bibby. Multivariate analysis. Academic Press, London, B. Schölkopf, A. Smola, and K.-R. Müller. Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 1: , J. B. Tenenbaum, V. d. Silva, and J. C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 29(55): , 2. doi: /science Neil Lawrence () Dimensionality Reduction Data Modelling School 64 / 7

129 Material Acknowledgement: Carl Henrik Ek for GP log likelihood examples. My examples given here This talk Neil Lawrence () Dimensionality Reduction Data Modelling School 65 / 7

130 Outline Distance Matching Neil Lawrence () Dimensionality Reduction Data Modelling School 66 / 7

131 Centering Matrix If Ŷ is a version of Y with the mean removed then: Ŷ = HY Ŷ = ( I N 1 11 T) Y = Y 1 ( N 1 1 T Y ) ( ) T 1 N = Y 1 y i,: N i=1 = Y ȳ,: ȳ,:. ȳ,: Return Neil Lawrence () Dimensionality Reduction Data Modelling School 67 / 7

132 Feature Selection Derivation Squared distance can be re-expressed as d 2 ij = D (y i,k y j,k ) 2. k=1 Can re-order the columns of Y without affecting the distances. Choose ordering: first q columns of Y are the those that will best represent the distance matrix. Substitution x :,k = y :,k for k = 1... q. Distance in latent space is given by: δ 2 ij = q (x i,k x j,k ) 2 = k=1 q (y i,k y j,k ) 2 k=1 Neil Lawrence () Dimensionality Reduction Data Modelling School 68 / 7

133 Feature Selection Derivation II Can rewrite as N N E (X) = d ij 2 δij 2. i=1 j=1 N N D E (X) = (y i,k y j,k ) 2. i=1 j=1 k=q+1 Introduce mean of each dimension, ȳ k = 1 N N i=1 y i,k, N N D E (X) = ((y i,k ȳ k ) (y j,k ȳ k )) 2 i=1 j=1 k=q+1 Expand brackets N N E (X) = D (y i,k ȳ k ) 2 + (y j,k ȳ k ) 2 2 (y j,k ȳ k ) (y i,k ȳ k ) i=1 j=1 k=q+1 Neil Lawrence () Dimensionality Reduction Data Modelling School 69 / 7

134 Feature Selection Derivation III Expand brackets E (X) = N N D (y i,k ȳ k ) 2 + (y j,k ȳ k ) 2 2 (y j,k ȳ k ) (y i,k ȳ k ) i=1 j=1 k=q+1 Bring sums in E (X) = DX NX X `yi,k N X ȳ k 2 + N `yj,k N ȳ k 2 2 `yj,k ȳ k NX 1 `yi,k ȳ k A i=1 j=1 Recognise as the sum of the variances discarded columns of Y, E (X) = 2N 2 D k=q+1 We should compose X by extracting the columns of Y which have the largest variance. Return Selection Return Rotation Neil Lawrence () Dimensionality Reduction Data Modelling School 7 / 7 σ 2 k. j=1 i=1

Unsupervised dimensionality reduction

Unsupervised dimensionality reduction Guillaume Obozinski Ecole des Ponts - ParisTech SOCN course 2014 Guillaume Obozinski Unsupervised dimensionality reduction 1/30 Outline 1 PCA 2 Kernel PCA 3 Multidimensional